Experimental scGPT human-pancreas annotation#
Who this notebook is for#
This notebook is for researchers who want to see the wrapper-first adapt_annotation(...) path on a non-PBMC human dataset.
Prerequisites#
Install
scdlkit[foundation,tutorials]Be comfortable working with
AnnDataExpect this workflow to stay experimental in
v0.1.7
What you will learn#
how to load the cached OpenProblems human-pancreas quickstart subset
how to inspect dataset compatibility before adaptation
how to run the easiest public annotation path on a beyond-PBMC human dataset
how to compare frozen probe and head-only tuning by default, with LoRA kept as an opt-in heavier path
how to save and reload the best fitted runner
Out of scope#
full-backbone fine-tuning
non-human data
checkpoints other than
whole-humanraw OpenProblems preprocessing details
perturbation, spatial, or multimodal workflows
Outline#
load the cached pancreas quickstart subset
inspect the dataset card and compatibility report
run the wrapper-first adaptation workflow
annotate
AnnDatawith the best strategysave and reload the fitted runner
verify prediction consistency and review the output bundle
Expected artifacts#
artifacts/scgpt_human_pancreas_annotation/report.mdartifacts/scgpt_human_pancreas_annotation/report.csvartifacts/scgpt_human_pancreas_annotation/strategy_metrics.csvartifacts/scgpt_human_pancreas_annotation/best_strategy_confusion_matrix.pngartifacts/scgpt_human_pancreas_annotation/frozen_embedding_umap.pngartifacts/scgpt_human_pancreas_annotation/best_strategy_embedding_umap.pngartifacts/scgpt_human_pancreas_annotation/best_model/manifest.jsonartifacts/scgpt_human_pancreas_annotation/best_model/model_state.pt
Next step#
deeper experimental comparison:
docs/guides/annotation-benchmarks.mdlower-level control:
examples/scgpt_cell_type_annotation.ipynbAPI pages:
docs/api/annotation.mdanddocs/api/foundation.md
Published tutorial status
This page is a static notebook copy published for documentation review. It is meant to show the exact workflow and outputs from the last recorded run.
Last run date (UTC):
2026-03-27 10:18 UTCPublication mode:
static executed tutorialExecution profile:
publishedArtifact check in this sync:
passedSource notebook:
examples/scgpt_human_pancreas_annotation.ipynb
import json
from pathlib import Path
import numpy as np
import pandas as pd
import torch
from scdlkit import (
AnnotationRunner,
adapt_annotation,
inspect_annotation_data,
)
from scdlkit._datasets.openproblems import (
load_openproblems_pancreas_annotation_dataset,
)
PROFILE = "quickstart"
CONFIGS = {
"quickstart": {
"seed": 42,
"batch_size": 32,
"strategies": ("frozen_probe", "head"),
},
"full": {
"seed": 42,
"batch_size": 64,
"strategies": ("frozen_probe", "head", "lora"),
},
}
config = CONFIGS[PROFILE]
output_dir = Path("artifacts/scgpt_human_pancreas_annotation")
output_dir.mkdir(parents=True, exist_ok=True)
print(f"Using device: {'cuda' if torch.cuda.is_available() else 'cpu'}")
np.random.seed(config["seed"])
Using device: cpu
Load the cached quickstart pancreas subset#
This tutorial stays on the public wrapper path, but it uses a repo-managed cached subset loader because the raw OpenProblems pancreas file is large. The executed docs notebook uses the deterministic quickstart subset so the public walkthrough stays CPU-practical.
This is still the first beyond-PBMC human benchmark in scDLKit, not a claim of broad stable foundation-model support.
adata = load_openproblems_pancreas_annotation_dataset(profile=PROFILE)
dataset_card = {
"profile": PROFILE,
"cells": int(adata.n_obs),
"genes": int(adata.n_vars),
"label_key": "cell_type",
"batch_key": "batch",
"selected_cell_types": sorted(adata.obs["cell_type"].astype(str).unique().tolist()),
"num_batches": int(adata.obs["batch"].astype(str).nunique()),
}
dataset_card
{'profile': 'quickstart',
'cells': 512,
'genes': 1024,
'label_key': 'cell_type',
'batch_key': 'batch',
'selected_cell_types': ['acinar',
'activated_stellate',
'alpha',
'beta',
'delta',
'ductal',
'endothelial',
'gamma'],
'num_batches': 9}
Inspect the dataset before adaptation#
This is the wrapper-first preflight step. It answers the practical question: does this deterministic pancreas subset look compatible with the current experimental scGPT annotation path before we spend time training anything?
report = inspect_annotation_data(
adata,
label_key="cell_type",
checkpoint="whole-human",
min_gene_overlap=min(500, adata.n_vars),
)
report_frame = pd.DataFrame(
{
"field": [
"checkpoint_id",
"num_cells",
"num_input_genes",
"num_genes_matched",
"gene_overlap_ratio",
"min_class_count",
"stratify_possible",
"label_categories",
"warnings",
],
"value": [
report.checkpoint_id,
report.num_cells,
report.num_input_genes,
report.num_genes_matched,
round(report.gene_overlap_ratio, 4),
report.min_class_count,
report.stratify_possible,
report.label_categories,
report.warnings,
],
}
)
report_frame
| field | value | |
|---|---|---|
| 0 | checkpoint_id | whole-human |
| 1 | num_cells | 512 |
| 2 | num_input_genes | 1024 |
| 3 | num_genes_matched | 996 |
| 4 | gene_overlap_ratio | 0.9727 |
| 5 | min_class_count | 10 |
| 6 | stratify_possible | True |
| 7 | label_categories | (acinar, activated_stellate, alpha, beta, delt... |
| 8 | warnings | () |
Run the one-shot wrapper#
adapt_annotation(...) remains the easiest public adaptation path in this release line. In the quickstart profile it compares frozen probe and head-only tuning only. LoRA remains available by explicit opt-in and is evaluated in the heavier evidence workflow instead of the default docs notebook path.
runner = adapt_annotation(
adata,
label_key="cell_type",
strategies=config["strategies"],
batch_size=config["batch_size"],
device="auto",
output_dir=output_dir,
)
runner.summary_.strategy_metrics
| strategy | validation_accuracy | validation_macro_f1 | validation_balanced_accuracy | validation_auroc_ovr | test_accuracy | test_macro_f1 | test_balanced_accuracy | test_auroc_ovr | runtime_sec | trainable_parameters | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | head | 0.805195 | 0.524471 | 0.536458 | NaN | 0.779221 | 0.584684 | 0.565625 | NaN | 2339.427733 | 5128 |
| 1 | frozen_probe | 0.779221 | 0.418743 | 0.453125 | 0.968261 | 0.714286 | 0.395245 | 0.415625 | 0.923283 | 170.017979 | 0 |
<Figure size 500x400 with 0 Axes>
Annotate the dataset with the best strategy#
The wrapper writes predicted labels, confidence scores, and the best latent embedding back into the same AnnData object so the downstream Scanpy handoff stays simple.
runner.annotate_adata(
adata,
obs_key="scgpt_label",
embedding_key="X_scgpt_best",
)
adata.obs[["cell_type", "batch", "scgpt_label", "scgpt_label_confidence"]].head()
| cell_type | batch | scgpt_label | scgpt_label_confidence | |
|---|---|---|---|---|
| D17All1_72 | gamma | celseq | alpha | 0.522375 |
| D28-1_74 | gamma | celseq2 | alpha | 0.622307 |
| D29-6_88 | gamma | celseq2 | alpha | 0.599043 |
| D31-7_15 | gamma | celseq2 | alpha | 0.577890 |
| 9th-C66_S33 | gamma | fluidigmc1 | alpha | 0.408202 |
Save and reload the fitted runner#
This is the reusability step researchers usually care about most: compare strategies once, keep the best fitted runner, and reload it later without rerunning the whole adaptation loop.
save_dir = runner.save(output_dir / "best_model")
reloaded = AnnotationRunner.load(save_dir, device="auto")
original_predictions = runner.predict(adata)
reloaded_predictions = reloaded.predict(adata)
np.testing.assert_array_equal(
original_predictions["label_codes"], reloaded_predictions["label_codes"]
)
np.testing.assert_array_equal(original_predictions["labels"], reloaded_predictions["labels"])
np.testing.assert_allclose(
original_predictions["probabilities"],
reloaded_predictions["probabilities"],
atol=1e-6,
)
np.testing.assert_allclose(
original_predictions["latent"],
reloaded_predictions["latent"],
atol=1e-6,
)
{"best_strategy": runner.best_strategy_, "save_dir": str(save_dir)}
{'best_strategy': 'head',
'save_dir': 'artifacts/scgpt_human_pancreas_annotation/best_model'}
What to inspect#
This is still an experimental tutorial, but it now answers the beyond-PBMC question with a real human pancreas dataset.
The quickstart default should show whether frozen scGPT is enough or whether head-only tuning is the better tradeoff on this dataset.
The saved confusion matrix, UMAPs, and runner files are there to keep the result auditable instead of turning the workflow into a single scalar metric.
summary_payload = {
"best_strategy": runner.summary_.best_strategy,
"label_categories": list(runner.summary_.label_categories),
"strategy_metrics_columns": runner.summary_.strategy_metrics.columns.tolist(),
"dataset_card": dataset_card,
}
(output_dir / "summary.json").write_text(json.dumps(summary_payload, indent=2), encoding="utf-8")
summary_payload
{'best_strategy': 'head',
'label_categories': ['acinar',
'activated_stellate',
'alpha',
'beta',
'delta',
'ductal',
'endothelial',
'gamma'],
'strategy_metrics_columns': ['strategy',
'validation_accuracy',
'validation_macro_f1',
'validation_balanced_accuracy',
'validation_auroc_ovr',
'test_accuracy',
'test_macro_f1',
'test_balanced_accuracy',
'test_auroc_ovr',
'runtime_sec',
'trainable_parameters'],
'dataset_card': {'profile': 'quickstart',
'cells': 512,
'genes': 1024,
'label_key': 'cell_type',
'batch_key': 'batch',
'selected_cell_types': ['acinar',
'activated_stellate',
'alpha',
'beta',
'delta',
'ductal',
'endothelial',
'gamma'],
'num_batches': 9}}
Stable output-path contract#
The notebook ends with a single dictionary that points to the files the docs and tutorial validation suite expect.
output_paths = {
"report_md": output_dir / "report.md",
"report_csv": output_dir / "report.csv",
"strategy_metrics_csv": output_dir / "strategy_metrics.csv",
"best_strategy_confusion_matrix": output_dir / "best_strategy_confusion_matrix.png",
"frozen_embedding_umap": output_dir / "frozen_embedding_umap.png",
"best_strategy_embedding_umap": output_dir / "best_strategy_embedding_umap.png",
"saved_model_manifest": output_dir / "best_model" / "manifest.json",
"saved_model_state": output_dir / "best_model" / "model_state.pt",
}
output_paths
{'report_md': PosixPath('artifacts/scgpt_human_pancreas_annotation/report.md'),
'report_csv': PosixPath('artifacts/scgpt_human_pancreas_annotation/report.csv'),
'strategy_metrics_csv': PosixPath('artifacts/scgpt_human_pancreas_annotation/strategy_metrics.csv'),
'best_strategy_confusion_matrix': PosixPath('artifacts/scgpt_human_pancreas_annotation/best_strategy_confusion_matrix.png'),
'frozen_embedding_umap': PosixPath('artifacts/scgpt_human_pancreas_annotation/frozen_embedding_umap.png'),
'best_strategy_embedding_umap': PosixPath('artifacts/scgpt_human_pancreas_annotation/best_strategy_embedding_umap.png'),
'saved_model_manifest': PosixPath('artifacts/scgpt_human_pancreas_annotation/best_model/manifest.json'),
'saved_model_state': PosixPath('artifacts/scgpt_human_pancreas_annotation/best_model/model_state.pt')}