Experimental scGPT human-pancreas annotation#

Who this notebook is for#

This notebook is for researchers who want to see the wrapper-first adapt_annotation(...) path on a non-PBMC human dataset.

Prerequisites#

  • Install scdlkit[foundation,tutorials]

  • Be comfortable working with AnnData

  • Expect this workflow to stay experimental in v0.1.7

What you will learn#

  • how to load the cached OpenProblems human-pancreas quickstart subset

  • how to inspect dataset compatibility before adaptation

  • how to run the easiest public annotation path on a beyond-PBMC human dataset

  • how to compare frozen probe and head-only tuning by default, with LoRA kept as an opt-in heavier path

  • how to save and reload the best fitted runner

Out of scope#

  • full-backbone fine-tuning

  • non-human data

  • checkpoints other than whole-human

  • raw OpenProblems preprocessing details

  • perturbation, spatial, or multimodal workflows

Outline#

  1. load the cached pancreas quickstart subset

  2. inspect the dataset card and compatibility report

  3. run the wrapper-first adaptation workflow

  4. annotate AnnData with the best strategy

  5. save and reload the fitted runner

  6. verify prediction consistency and review the output bundle

Expected artifacts#

  • artifacts/scgpt_human_pancreas_annotation/report.md

  • artifacts/scgpt_human_pancreas_annotation/report.csv

  • artifacts/scgpt_human_pancreas_annotation/strategy_metrics.csv

  • artifacts/scgpt_human_pancreas_annotation/best_strategy_confusion_matrix.png

  • artifacts/scgpt_human_pancreas_annotation/frozen_embedding_umap.png

  • artifacts/scgpt_human_pancreas_annotation/best_strategy_embedding_umap.png

  • artifacts/scgpt_human_pancreas_annotation/best_model/manifest.json

  • artifacts/scgpt_human_pancreas_annotation/best_model/model_state.pt

Next step#

  • deeper experimental comparison: docs/guides/annotation-benchmarks.md

  • lower-level control: examples/scgpt_cell_type_annotation.ipynb

  • API pages: docs/api/annotation.md and docs/api/foundation.md

Published tutorial status

This page is a static notebook copy published for documentation review. It is meant to show the exact workflow and outputs from the last recorded run.

  • Last run date (UTC): 2026-03-27 10:18 UTC

  • Publication mode: static executed tutorial

  • Execution profile: published

  • Artifact check in this sync: passed

  • Source notebook: examples/scgpt_human_pancreas_annotation.ipynb

import json
from pathlib import Path

import numpy as np
import pandas as pd
import torch

from scdlkit import (
    AnnotationRunner,
    adapt_annotation,
    inspect_annotation_data,
)
from scdlkit._datasets.openproblems import (
    load_openproblems_pancreas_annotation_dataset,
)

PROFILE = "quickstart"
CONFIGS = {
    "quickstart": {
        "seed": 42,
        "batch_size": 32,
        "strategies": ("frozen_probe", "head"),
    },
    "full": {
        "seed": 42,
        "batch_size": 64,
        "strategies": ("frozen_probe", "head", "lora"),
    },
}
config = CONFIGS[PROFILE]
output_dir = Path("artifacts/scgpt_human_pancreas_annotation")
output_dir.mkdir(parents=True, exist_ok=True)
print(f"Using device: {'cuda' if torch.cuda.is_available() else 'cpu'}")
np.random.seed(config["seed"])
Using device: cpu

Load the cached quickstart pancreas subset#

This tutorial stays on the public wrapper path, but it uses a repo-managed cached subset loader because the raw OpenProblems pancreas file is large. The executed docs notebook uses the deterministic quickstart subset so the public walkthrough stays CPU-practical.

This is still the first beyond-PBMC human benchmark in scDLKit, not a claim of broad stable foundation-model support.

adata = load_openproblems_pancreas_annotation_dataset(profile=PROFILE)

dataset_card = {
    "profile": PROFILE,
    "cells": int(adata.n_obs),
    "genes": int(adata.n_vars),
    "label_key": "cell_type",
    "batch_key": "batch",
    "selected_cell_types": sorted(adata.obs["cell_type"].astype(str).unique().tolist()),
    "num_batches": int(adata.obs["batch"].astype(str).nunique()),
}
dataset_card
{'profile': 'quickstart',
 'cells': 512,
 'genes': 1024,
 'label_key': 'cell_type',
 'batch_key': 'batch',
 'selected_cell_types': ['acinar',
  'activated_stellate',
  'alpha',
  'beta',
  'delta',
  'ductal',
  'endothelial',
  'gamma'],
 'num_batches': 9}

Inspect the dataset before adaptation#

This is the wrapper-first preflight step. It answers the practical question: does this deterministic pancreas subset look compatible with the current experimental scGPT annotation path before we spend time training anything?

report = inspect_annotation_data(
    adata,
    label_key="cell_type",
    checkpoint="whole-human",
    min_gene_overlap=min(500, adata.n_vars),
)
report_frame = pd.DataFrame(
    {
        "field": [
            "checkpoint_id",
            "num_cells",
            "num_input_genes",
            "num_genes_matched",
            "gene_overlap_ratio",
            "min_class_count",
            "stratify_possible",
            "label_categories",
            "warnings",
        ],
        "value": [
            report.checkpoint_id,
            report.num_cells,
            report.num_input_genes,
            report.num_genes_matched,
            round(report.gene_overlap_ratio, 4),
            report.min_class_count,
            report.stratify_possible,
            report.label_categories,
            report.warnings,
        ],
    }
)
report_frame
field value
0 checkpoint_id whole-human
1 num_cells 512
2 num_input_genes 1024
3 num_genes_matched 996
4 gene_overlap_ratio 0.9727
5 min_class_count 10
6 stratify_possible True
7 label_categories (acinar, activated_stellate, alpha, beta, delt...
8 warnings ()

Run the one-shot wrapper#

adapt_annotation(...) remains the easiest public adaptation path in this release line. In the quickstart profile it compares frozen probe and head-only tuning only. LoRA remains available by explicit opt-in and is evaluated in the heavier evidence workflow instead of the default docs notebook path.

runner = adapt_annotation(
    adata,
    label_key="cell_type",
    strategies=config["strategies"],
    batch_size=config["batch_size"],
    device="auto",
    output_dir=output_dir,
)
runner.summary_.strategy_metrics
strategy validation_accuracy validation_macro_f1 validation_balanced_accuracy validation_auroc_ovr test_accuracy test_macro_f1 test_balanced_accuracy test_auroc_ovr runtime_sec trainable_parameters
0 head 0.805195 0.524471 0.536458 NaN 0.779221 0.584684 0.565625 NaN 2339.427733 5128
1 frozen_probe 0.779221 0.418743 0.453125 0.968261 0.714286 0.395245 0.415625 0.923283 170.017979 0
<Figure size 500x400 with 0 Axes>

Annotate the dataset with the best strategy#

The wrapper writes predicted labels, confidence scores, and the best latent embedding back into the same AnnData object so the downstream Scanpy handoff stays simple.

runner.annotate_adata(
    adata,
    obs_key="scgpt_label",
    embedding_key="X_scgpt_best",
)
adata.obs[["cell_type", "batch", "scgpt_label", "scgpt_label_confidence"]].head()
cell_type batch scgpt_label scgpt_label_confidence
D17All1_72 gamma celseq alpha 0.522375
D28-1_74 gamma celseq2 alpha 0.622307
D29-6_88 gamma celseq2 alpha 0.599043
D31-7_15 gamma celseq2 alpha 0.577890
9th-C66_S33 gamma fluidigmc1 alpha 0.408202

Save and reload the fitted runner#

This is the reusability step researchers usually care about most: compare strategies once, keep the best fitted runner, and reload it later without rerunning the whole adaptation loop.

save_dir = runner.save(output_dir / "best_model")
reloaded = AnnotationRunner.load(save_dir, device="auto")

original_predictions = runner.predict(adata)
reloaded_predictions = reloaded.predict(adata)

np.testing.assert_array_equal(
    original_predictions["label_codes"], reloaded_predictions["label_codes"]
)
np.testing.assert_array_equal(original_predictions["labels"], reloaded_predictions["labels"])
np.testing.assert_allclose(
    original_predictions["probabilities"],
    reloaded_predictions["probabilities"],
    atol=1e-6,
)
np.testing.assert_allclose(
    original_predictions["latent"],
    reloaded_predictions["latent"],
    atol=1e-6,
)

{"best_strategy": runner.best_strategy_, "save_dir": str(save_dir)}
{'best_strategy': 'head',
 'save_dir': 'artifacts/scgpt_human_pancreas_annotation/best_model'}

What to inspect#

  • This is still an experimental tutorial, but it now answers the beyond-PBMC question with a real human pancreas dataset.

  • The quickstart default should show whether frozen scGPT is enough or whether head-only tuning is the better tradeoff on this dataset.

  • The saved confusion matrix, UMAPs, and runner files are there to keep the result auditable instead of turning the workflow into a single scalar metric.

summary_payload = {
    "best_strategy": runner.summary_.best_strategy,
    "label_categories": list(runner.summary_.label_categories),
    "strategy_metrics_columns": runner.summary_.strategy_metrics.columns.tolist(),
    "dataset_card": dataset_card,
}
(output_dir / "summary.json").write_text(json.dumps(summary_payload, indent=2), encoding="utf-8")
summary_payload
{'best_strategy': 'head',
 'label_categories': ['acinar',
  'activated_stellate',
  'alpha',
  'beta',
  'delta',
  'ductal',
  'endothelial',
  'gamma'],
 'strategy_metrics_columns': ['strategy',
  'validation_accuracy',
  'validation_macro_f1',
  'validation_balanced_accuracy',
  'validation_auroc_ovr',
  'test_accuracy',
  'test_macro_f1',
  'test_balanced_accuracy',
  'test_auroc_ovr',
  'runtime_sec',
  'trainable_parameters'],
 'dataset_card': {'profile': 'quickstart',
  'cells': 512,
  'genes': 1024,
  'label_key': 'cell_type',
  'batch_key': 'batch',
  'selected_cell_types': ['acinar',
   'activated_stellate',
   'alpha',
   'beta',
   'delta',
   'ductal',
   'endothelial',
   'gamma'],
  'num_batches': 9}}

Stable output-path contract#

The notebook ends with a single dictionary that points to the files the docs and tutorial validation suite expect.

output_paths = {
    "report_md": output_dir / "report.md",
    "report_csv": output_dir / "report.csv",
    "strategy_metrics_csv": output_dir / "strategy_metrics.csv",
    "best_strategy_confusion_matrix": output_dir / "best_strategy_confusion_matrix.png",
    "frozen_embedding_umap": output_dir / "frozen_embedding_umap.png",
    "best_strategy_embedding_umap": output_dir / "best_strategy_embedding_umap.png",
    "saved_model_manifest": output_dir / "best_model" / "manifest.json",
    "saved_model_state": output_dir / "best_model" / "model_state.pt",
}
output_paths
{'report_md': PosixPath('artifacts/scgpt_human_pancreas_annotation/report.md'),
 'report_csv': PosixPath('artifacts/scgpt_human_pancreas_annotation/report.csv'),
 'strategy_metrics_csv': PosixPath('artifacts/scgpt_human_pancreas_annotation/strategy_metrics.csv'),
 'best_strategy_confusion_matrix': PosixPath('artifacts/scgpt_human_pancreas_annotation/best_strategy_confusion_matrix.png'),
 'frozen_embedding_umap': PosixPath('artifacts/scgpt_human_pancreas_annotation/frozen_embedding_umap.png'),
 'best_strategy_embedding_umap': PosixPath('artifacts/scgpt_human_pancreas_annotation/best_strategy_embedding_umap.png'),
 'saved_model_manifest': PosixPath('artifacts/scgpt_human_pancreas_annotation/best_model/manifest.json'),
 'saved_model_state': PosixPath('artifacts/scgpt_human_pancreas_annotation/best_model/model_state.pt')}