Experimental scGPT human-pancreas annotation#

Who this notebook is for#

This notebook is for researchers who want to see the wrapper-first adapt_annotation(...) path on a non-PBMC human dataset.

Prerequisites#

Install scdlkit[foundation,tutorials]
Be comfortable working with AnnData
Expect this workflow to stay experimental in v0.1.7

What you will learn#

how to load the cached OpenProblems human-pancreas quickstart subset
how to inspect dataset compatibility before adaptation
how to run the easiest public annotation path on a beyond-PBMC human dataset
how to compare frozen probe and head-only tuning by default, with LoRA kept as an opt-in heavier path
how to save and reload the best fitted runner

Out of scope#

full-backbone fine-tuning
non-human data
checkpoints other than whole-human
raw OpenProblems preprocessing details
perturbation, spatial, or multimodal workflows

Outline#

load the cached pancreas quickstart subset
inspect the dataset card and compatibility report
run the wrapper-first adaptation workflow
annotate AnnData with the best strategy
save and reload the fitted runner
verify prediction consistency and review the output bundle

Expected artifacts#

artifacts/scgpt_human_pancreas_annotation/report.md
artifacts/scgpt_human_pancreas_annotation/report.csv
artifacts/scgpt_human_pancreas_annotation/strategy_metrics.csv
artifacts/scgpt_human_pancreas_annotation/best_strategy_confusion_matrix.png
artifacts/scgpt_human_pancreas_annotation/frozen_embedding_umap.png
artifacts/scgpt_human_pancreas_annotation/best_strategy_embedding_umap.png
artifacts/scgpt_human_pancreas_annotation/best_model/manifest.json
artifacts/scgpt_human_pancreas_annotation/best_model/model_state.pt

Next step#

deeper experimental comparison: docs/guides/annotation-benchmarks.md
lower-level control: examples/scgpt_cell_type_annotation.ipynb
API pages: docs/api/annotation.md and docs/api/foundation.md

Published tutorial status

This page is a static notebook copy published for documentation review. It is meant to show the exact workflow and outputs from the last recorded run.

Last run date (UTC): 2026-03-27 10:18 UTC
Publication mode: static executed tutorial
Execution profile: published
Artifact check in this sync: passed
Source notebook: examples/scgpt_human_pancreas_annotation.ipynb

import json
from pathlib import Path

import numpy as np
import pandas as pd
import torch

from scdlkit import (
    AnnotationRunner,
    adapt_annotation,
    inspect_annotation_data,
)
from scdlkit._datasets.openproblems import (
    load_openproblems_pancreas_annotation_dataset,
)

PROFILE = "quickstart"
CONFIGS = {
    "quickstart": {
        "seed": 42,
        "batch_size": 32,
        "strategies": ("frozen_probe", "head"),
    },
    "full": {
        "seed": 42,
        "batch_size": 64,
        "strategies": ("frozen_probe", "head", "lora"),
    },
}
config = CONFIGS[PROFILE]
output_dir = Path("artifacts/scgpt_human_pancreas_annotation")
output_dir.mkdir(parents=True, exist_ok=True)
print(f"Using device: {'cuda' if torch.cuda.is_available() else 'cpu'}")
np.random.seed(config["seed"])

Using device: cpu

Load the cached quickstart pancreas subset#

This tutorial stays on the public wrapper path, but it uses a repo-managed cached subset loader because the raw OpenProblems pancreas file is large. The executed docs notebook uses the deterministic quickstart subset so the public walkthrough stays CPU-practical.

This is still the first beyond-PBMC human benchmark in scDLKit, not a claim of broad stable foundation-model support.

adata = load_openproblems_pancreas_annotation_dataset(profile=PROFILE)

dataset_card = {
    "profile": PROFILE,
    "cells": int(adata.n_obs),
    "genes": int(adata.n_vars),
    "label_key": "cell_type",
    "batch_key": "batch",
    "selected_cell_types": sorted(adata.obs["cell_type"].astype(str).unique().tolist()),
    "num_batches": int(adata.obs["batch"].astype(str).nunique()),
}
dataset_card

{'profile': 'quickstart',
 'cells': 512,
 'genes': 1024,
 'label_key': 'cell_type',
 'batch_key': 'batch',
 'selected_cell_types': ['acinar',
  'activated_stellate',
  'alpha',
  'beta',
  'delta',
  'ductal',
  'endothelial',
  'gamma'],
 'num_batches': 9}

Inspect the dataset before adaptation#

This is the wrapper-first preflight step. It answers the practical question: does this deterministic pancreas subset look compatible with the current experimental scGPT annotation path before we spend time training anything?

report = inspect_annotation_data(
    adata,
    label_key="cell_type",
    checkpoint="whole-human",
    min_gene_overlap=min(500, adata.n_vars),
)
report_frame = pd.DataFrame(
    {
        "field": [
            "checkpoint_id",
            "num_cells",
            "num_input_genes",
            "num_genes_matched",
            "gene_overlap_ratio",
            "min_class_count",
            "stratify_possible",
            "label_categories",
            "warnings",
        ],
        "value": [
            report.checkpoint_id,
            report.num_cells,
            report.num_input_genes,
            report.num_genes_matched,
            round(report.gene_overlap_ratio, 4),
            report.min_class_count,
            report.stratify_possible,
            report.label_categories,
            report.warnings,
        ],
    }
)
report_frame

	field	value
0	checkpoint_id	whole-human
1	num_cells	512
2	num_input_genes	1024
3	num_genes_matched	996
4	gene_overlap_ratio	0.9727
5	min_class_count	10
6	stratify_possible	True
7	label_categories	(acinar, activated_stellate, alpha, beta, delt...
8	warnings	()

Run the one-shot wrapper#

adapt_annotation(...) remains the easiest public adaptation path in this release line. In the quickstart profile it compares frozen probe and head-only tuning only. LoRA remains available by explicit opt-in and is evaluated in the heavier evidence workflow instead of the default docs notebook path.

runner = adapt_annotation(
    adata,
    label_key="cell_type",
    strategies=config["strategies"],
    batch_size=config["batch_size"],
    device="auto",
    output_dir=output_dir,
)
runner.summary_.strategy_metrics

	strategy	validation_accuracy	validation_macro_f1	validation_balanced_accuracy	validation_auroc_ovr	test_accuracy	test_macro_f1	test_balanced_accuracy	test_auroc_ovr	runtime_sec	trainable_parameters
0	head	0.805195	0.524471	0.536458	NaN	0.779221	0.584684	0.565625	NaN	2339.427733	5128
1	frozen_probe	0.779221	0.418743	0.453125	0.968261	0.714286	0.395245	0.415625	0.923283	170.017979	0

<Figure size 500x400 with 0 Axes>

Annotate the dataset with the best strategy#

The wrapper writes predicted labels, confidence scores, and the best latent embedding back into the same AnnData object so the downstream Scanpy handoff stays simple.

runner.annotate_adata(
    adata,
    obs_key="scgpt_label",
    embedding_key="X_scgpt_best",
)
adata.obs[["cell_type", "batch", "scgpt_label", "scgpt_label_confidence"]].head()

	cell_type	batch	scgpt_label	scgpt_label_confidence
D17All1_72	gamma	celseq	alpha	0.522375
D28-1_74	gamma	celseq2	alpha	0.622307
D29-6_88	gamma	celseq2	alpha	0.599043
D31-7_15	gamma	celseq2	alpha	0.577890
9th-C66_S33	gamma	fluidigmc1	alpha	0.408202

Save and reload the fitted runner#

This is the reusability step researchers usually care about most: compare strategies once, keep the best fitted runner, and reload it later without rerunning the whole adaptation loop.

save_dir = runner.save(output_dir / "best_model")
reloaded = AnnotationRunner.load(save_dir, device="auto")

original_predictions = runner.predict(adata)
reloaded_predictions = reloaded.predict(adata)

np.testing.assert_array_equal(
    original_predictions["label_codes"], reloaded_predictions["label_codes"]
)
np.testing.assert_array_equal(original_predictions["labels"], reloaded_predictions["labels"])
np.testing.assert_allclose(
    original_predictions["probabilities"],
    reloaded_predictions["probabilities"],
    atol=1e-6,
)
np.testing.assert_allclose(
    original_predictions["latent"],
    reloaded_predictions["latent"],
    atol=1e-6,
)

{"best_strategy": runner.best_strategy_, "save_dir": str(save_dir)}

{'best_strategy': 'head',
 'save_dir': 'artifacts/scgpt_human_pancreas_annotation/best_model'}

What to inspect#

This is still an experimental tutorial, but it now answers the beyond-PBMC question with a real human pancreas dataset.
The quickstart default should show whether frozen scGPT is enough or whether head-only tuning is the better tradeoff on this dataset.
The saved confusion matrix, UMAPs, and runner files are there to keep the result auditable instead of turning the workflow into a single scalar metric.

summary_payload = {
    "best_strategy": runner.summary_.best_strategy,
    "label_categories": list(runner.summary_.label_categories),
    "strategy_metrics_columns": runner.summary_.strategy_metrics.columns.tolist(),
    "dataset_card": dataset_card,
}
(output_dir / "summary.json").write_text(json.dumps(summary_payload, indent=2), encoding="utf-8")
summary_payload

{'best_strategy': 'head',
 'label_categories': ['acinar',
  'activated_stellate',
  'alpha',
  'beta',
  'delta',
  'ductal',
  'endothelial',
  'gamma'],
 'strategy_metrics_columns': ['strategy',
  'validation_accuracy',
  'validation_macro_f1',
  'validation_balanced_accuracy',
  'validation_auroc_ovr',
  'test_accuracy',
  'test_macro_f1',
  'test_balanced_accuracy',
  'test_auroc_ovr',
  'runtime_sec',
  'trainable_parameters'],
 'dataset_card': {'profile': 'quickstart',
  'cells': 512,
  'genes': 1024,
  'label_key': 'cell_type',
  'batch_key': 'batch',
  'selected_cell_types': ['acinar',
   'activated_stellate',
   'alpha',
   'beta',
   'delta',
   'ductal',
   'endothelial',
   'gamma'],
  'num_batches': 9}}

Stable output-path contract#

The notebook ends with a single dictionary that points to the files the docs and tutorial validation suite expect.

output_paths = {
    "report_md": output_dir / "report.md",
    "report_csv": output_dir / "report.csv",
    "strategy_metrics_csv": output_dir / "strategy_metrics.csv",
    "best_strategy_confusion_matrix": output_dir / "best_strategy_confusion_matrix.png",
    "frozen_embedding_umap": output_dir / "frozen_embedding_umap.png",
    "best_strategy_embedding_umap": output_dir / "best_strategy_embedding_umap.png",
    "saved_model_manifest": output_dir / "best_model" / "manifest.json",
    "saved_model_state": output_dir / "best_model" / "model_state.pt",
}
output_paths

{'report_md': PosixPath('artifacts/scgpt_human_pancreas_annotation/report.md'),
 'report_csv': PosixPath('artifacts/scgpt_human_pancreas_annotation/report.csv'),
 'strategy_metrics_csv': PosixPath('artifacts/scgpt_human_pancreas_annotation/strategy_metrics.csv'),
 'best_strategy_confusion_matrix': PosixPath('artifacts/scgpt_human_pancreas_annotation/best_strategy_confusion_matrix.png'),
 'frozen_embedding_umap': PosixPath('artifacts/scgpt_human_pancreas_annotation/frozen_embedding_umap.png'),
 'best_strategy_embedding_umap': PosixPath('artifacts/scgpt_human_pancreas_annotation/best_strategy_embedding_umap.png'),
 'saved_model_manifest': PosixPath('artifacts/scgpt_human_pancreas_annotation/best_model/manifest.json'),
 'saved_model_state': PosixPath('artifacts/scgpt_human_pancreas_annotation/best_model/model_state.pt')}