Experimental scGPT dataset-specific annotation#
Who this notebook is for#
This notebook is for users who already understand the experimental scGPT PBMC annotation tutorial and want an easier, wrapper-first way to adapt scGPT to a second labeled human dataset.
Prerequisites#
Install
scdlkit[foundation,tutorials]Be comfortable working with
AnnDataExpect this workflow to stay experimental in
v0.1.7
What you will learn#
how to inspect a labeled dataset before adaptation
how to run the wrapper-first
adapt_annotation(...)workflowhow to compare frozen probe and head-only tuning in the quickstart path, with LoRA available in the full profile
how to write predictions and embeddings back into
AnnDatahow to save and reload the fitted runner
Out of scope#
full-backbone fine-tuning
non-human data
checkpoints other than
whole-humanperturbation, spatial, or multimodal workflows
Outline#
load a reproducible PBMC dataset
take a CPU-friendly stratified subset for the tutorial profile
inspect dataset compatibility with scGPT
run the one-shot adaptation wrapper
annotate the dataset with the best strategy
save and reload the fitted runner
verify the reload path and review the saved artifacts
Expected artifacts#
artifacts/scgpt_dataset_specific_annotation/report.mdartifacts/scgpt_dataset_specific_annotation/report.csvartifacts/scgpt_dataset_specific_annotation/strategy_metrics.csvartifacts/scgpt_dataset_specific_annotation/best_strategy_confusion_matrix.pngartifacts/scgpt_dataset_specific_annotation/frozen_embedding_umap.pngartifacts/scgpt_dataset_specific_annotation/best_strategy_embedding_umap.pngartifacts/scgpt_dataset_specific_annotation/best_model/manifest.jsonartifacts/scgpt_dataset_specific_annotation/best_model/model_state.pt
Next tutorial#
After this notebook, go back to the lower-level guide when you need tighter control over the training surface, or continue with the existing scGPT annotation tutorial to inspect the raw Trainer path directly.
Next step#
lower-level scGPT route:
examples/scgpt_cell_type_annotation.ipynbAPI pages:
docs/api/annotation.mdanddocs/api/foundation.md
Published tutorial status
This page is a static notebook copy published for documentation review. It is meant to show the exact workflow and outputs from the last recorded run.
Last run date (UTC):
2026-03-27 09:22 UTCPublication mode:
static executed tutorialExecution profile:
publishedArtifact check in this sync:
passedSource notebook:
examples/scgpt_dataset_specific_annotation.ipynb
import json
from pathlib import Path
import numpy as np
import pandas as pd
import scanpy as sc
from scipy import sparse
from sklearn.model_selection import train_test_split
from scdlkit import (
AnnotationRunner,
adapt_annotation,
inspect_annotation_data,
)
PROFILE = "quickstart"
CONFIGS = {
"quickstart": {
"seed": 42,
"batch_size": 32,
"max_cells": 64,
"max_genes": 48,
"strategies": ("frozen_probe", "head"),
},
"full": {
"seed": 42,
"batch_size": 64,
"max_cells": 128,
"max_genes": 96,
"strategies": ("frozen_probe", "head", "lora"),
},
}
config = CONFIGS[PROFILE]
output_dir = Path("artifacts/scgpt_dataset_specific_annotation")
output_dir.mkdir(parents=True, exist_ok=True)
np.random.seed(config["seed"])
Load a reproducible labeled dataset#
The public API is meant for arbitrary labeled human AnnData, but the tutorial stays reproducible by using scanpy.datasets.pbmc68k_reduced() with bulk_labels as the coarse annotation target.
The quickstart profile keeps the run CPU-friendly by taking a stratified subset of cells, a high-variance subset of genes, and the lighter frozen-probe plus head-only strategy ladder. This is a tutorial runtime choice, not a limit of the wrapper API itself. The full profile adds LoRA back into the comparison.
pbmc68k_reduced() also contains centered expression values. Because the current public scGPT path requires non-negative inputs, the tutorial shifts the subset to a non-negative matrix before adaptation. That compatibility step is explicit here instead of being hidden inside the library.
adata = sc.datasets.pbmc68k_reduced()
adata.obs["bulk_labels"] = adata.obs["bulk_labels"].astype(str)
if adata.n_obs > config["max_cells"]:
indices = np.arange(adata.n_obs)
keep_indices, _ = train_test_split(
indices,
train_size=config["max_cells"],
random_state=config["seed"],
stratify=adata.obs["bulk_labels"],
)
adata = adata[np.sort(keep_indices)].copy()
if adata.n_vars > config["max_genes"]:
dense = adata.X.toarray() if sparse.issparse(adata.X) else np.asarray(adata.X)
keep_genes = np.argsort(np.var(dense, axis=0))[-config["max_genes"] :]
adata = adata[:, np.sort(keep_genes)].copy()
dense_matrix = (
adata.X.toarray() if sparse.issparse(adata.X) else np.asarray(adata.X, dtype="float32")
)
minimum_value = float(dense_matrix.min())
if minimum_value < 0.0:
dense_matrix = dense_matrix - minimum_value
adata.X = dense_matrix.astype("float32", copy=False)
adata.raw = adata.copy()
print(
{
"profile": PROFILE,
"cells": int(adata.n_obs),
"genes": int(adata.n_vars),
"label_key": "bulk_labels",
"min_value_after_shift": float(np.min(adata.X)),
"labels": sorted(adata.obs["bulk_labels"].astype(str).unique().tolist()),
}
)
{'profile': 'quickstart', 'cells': 64, 'genes': 48, 'label_key': 'bulk_labels', 'min_value_after_shift': 0.0, 'labels': ['CD14+ Monocyte', 'CD19+ B', 'CD34+', 'CD4+/CD25 T Reg', 'CD4+/CD45RA+/CD25- Naive T', 'CD4+/CD45RO+ Memory', 'CD56+ NK', 'CD8+ Cytotoxic T', 'CD8+/CD45RA+ Naive Cytotoxic', 'Dendritic']}
Inspect the dataset before adaptation#
This preflight report is the wrapper-first way to answer a practical question: is the dataset a reasonable candidate for scGPT annotation adaptation, or are there obvious overlap or class-balance problems to resolve first?
report = inspect_annotation_data(
adata,
label_key="bulk_labels",
checkpoint="whole-human",
min_gene_overlap=min(500, adata.n_vars),
)
report_frame = pd.DataFrame(
{
"field": [
"checkpoint_id",
"num_cells",
"num_input_genes",
"num_genes_matched",
"gene_overlap_ratio",
"min_class_count",
"stratify_possible",
"label_categories",
"warnings",
],
"value": [
report.checkpoint_id,
report.num_cells,
report.num_input_genes,
report.num_genes_matched,
round(report.gene_overlap_ratio, 4),
report.min_class_count,
report.stratify_possible,
report.label_categories,
report.warnings,
],
}
)
report_frame
| field | value | |
|---|---|---|
| 0 | checkpoint_id | whole-human |
| 1 | num_cells | 64 |
| 2 | num_input_genes | 48 |
| 3 | num_genes_matched | 45 |
| 4 | gene_overlap_ratio | 0.9375 |
| 5 | min_class_count | 1 |
| 6 | stratify_possible | False |
| 7 | label_categories | (CD14+ Monocyte, CD19+ B, CD34+, CD4+/CD25 T R... |
| 8 | warnings | (Gene overlap with the scGPT checkpoint vocabu... |
Run the one-shot wrapper#
adapt_annotation(...) is the simplest public path in this release. In the quickstart profile it compares frozen probe and head-only tuning so the executed notebook stays CPU-practical. The full profile adds LoRA back into the comparison. In both cases the wrapper saves the standard report artifacts and keeps the best fitted strategy in memory for prediction and annotation.
runner = adapt_annotation(
adata,
label_key="bulk_labels",
strategies=config["strategies"],
batch_size=config["batch_size"],
device="auto",
output_dir=output_dir,
)
runner.summary_.strategy_metrics
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/metrics/_classification.py:2924: UserWarning: y_pred contains classes not in y_true
warnings.warn("y_pred contains classes not in y_true")
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/metrics/_classification.py:2924: UserWarning: y_pred contains classes not in y_true
warnings.warn("y_pred contains classes not in y_true")
| strategy | validation_accuracy | validation_macro_f1 | validation_balanced_accuracy | validation_auroc_ovr | test_accuracy | test_macro_f1 | test_balanced_accuracy | test_auroc_ovr | runtime_sec | trainable_parameters | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | frozen_probe | 0.0 | 0.0 | 0.0 | NaN | 0.5 | 0.166667 | 0.25 | NaN | 0.911852 | 0 |
| 1 | head | 0.0 | 0.0 | 0.0 | NaN | 0.5 | 0.166667 | 0.25 | NaN | 4.600107 | 6154 |
<Figure size 500x400 with 0 Axes>
Annotate the dataset with the best strategy#
The wrapper writes both predicted labels and the best latent embedding back into the AnnData object, which keeps the downstream Scanpy handoff straightforward.
runner.annotate_adata(
adata,
obs_key="scgpt_label",
embedding_key="X_scgpt_best",
)
adata.obs[["bulk_labels", "scgpt_label", "scgpt_label_confidence"]].head()
| bulk_labels | scgpt_label | scgpt_label_confidence | |
|---|---|---|---|
| index | |||
| GAGCGCACAGAGGC-1 | CD8+ Cytotoxic T | Dendritic | 0.399711 |
| GGACGCACATCGTG-1 | CD19+ B | Dendritic | 0.381634 |
| GTGACCCTTAGAAG-1 | Dendritic | Dendritic | 0.390948 |
| GTTCATACGAACTC-1 | CD8+/CD45RA+ Naive Cytotoxic | Dendritic | 0.397679 |
| ACGGAACTGTAGCT-2 | CD19+ B | Dendritic | 0.369715 |
Save and reload the fitted runner#
This is the practical dataset-specific adaptation step in v0.1.7: after comparing strategies, save the best fitted wrapper state and reload it later without re-running the full benchmark.
save_dir = runner.save(output_dir / "best_model")
reloaded = AnnotationRunner.load(save_dir, device="auto")
original_predictions = runner.predict(adata)
reloaded_predictions = reloaded.predict(adata)
np.testing.assert_array_equal(
original_predictions["label_codes"], reloaded_predictions["label_codes"]
)
np.testing.assert_array_equal(original_predictions["labels"], reloaded_predictions["labels"])
np.testing.assert_allclose(
original_predictions["probabilities"],
reloaded_predictions["probabilities"],
atol=1e-6,
)
np.testing.assert_allclose(
original_predictions["latent"],
reloaded_predictions["latent"],
atol=1e-6,
)
{"best_strategy": runner.best_strategy_, "save_dir": str(save_dir)}
{'best_strategy': 'frozen_probe',
'save_dir': 'artifacts/scgpt_dataset_specific_annotation/best_model'}
What to inspect#
The strategy table should show whether the configured adaptation ladder wins with frozen probe, head-only tuning, or, in the full profile, LoRA.
The saved confusion matrix and UMAPs are there to make the comparison auditable instead of reducing the workflow to a single scalar score.
This remains an experimental wrapper. The point is to make adaptation easier to run and easier to inspect, not to claim that scGPT universally wins on every dataset.
summary_payload = {
"best_strategy": runner.summary_.best_strategy,
"label_categories": list(runner.summary_.label_categories),
"strategy_metrics_columns": runner.summary_.strategy_metrics.columns.tolist(),
}
(output_dir / "summary.json").write_text(json.dumps(summary_payload, indent=2), encoding="utf-8")
summary_payload
{'best_strategy': 'frozen_probe',
'label_categories': ['CD14+ Monocyte',
'CD19+ B',
'CD34+',
'CD4+/CD25 T Reg',
'CD4+/CD45RA+/CD25- Naive T',
'CD4+/CD45RO+ Memory',
'CD56+ NK',
'CD8+ Cytotoxic T',
'CD8+/CD45RA+ Naive Cytotoxic',
'Dendritic'],
'strategy_metrics_columns': ['strategy',
'validation_accuracy',
'validation_macro_f1',
'validation_balanced_accuracy',
'validation_auroc_ovr',
'test_accuracy',
'test_macro_f1',
'test_balanced_accuracy',
'test_auroc_ovr',
'runtime_sec',
'trainable_parameters']}
Stable output-path contract#
The notebook ends with a single dictionary that points to the files the docs and tutorial validation suite expect.
output_paths = {
"report_md": output_dir / "report.md",
"report_csv": output_dir / "report.csv",
"strategy_metrics_csv": output_dir / "strategy_metrics.csv",
"best_strategy_confusion_matrix": output_dir / "best_strategy_confusion_matrix.png",
"frozen_embedding_umap": output_dir / "frozen_embedding_umap.png",
"best_strategy_embedding_umap": output_dir / "best_strategy_embedding_umap.png",
"saved_model_manifest": output_dir / "best_model" / "manifest.json",
"saved_model_state": output_dir / "best_model" / "model_state.pt",
}
output_paths
{'report_md': PosixPath('artifacts/scgpt_dataset_specific_annotation/report.md'),
'report_csv': PosixPath('artifacts/scgpt_dataset_specific_annotation/report.csv'),
'strategy_metrics_csv': PosixPath('artifacts/scgpt_dataset_specific_annotation/strategy_metrics.csv'),
'best_strategy_confusion_matrix': PosixPath('artifacts/scgpt_dataset_specific_annotation/best_strategy_confusion_matrix.png'),
'frozen_embedding_umap': PosixPath('artifacts/scgpt_dataset_specific_annotation/frozen_embedding_umap.png'),
'best_strategy_embedding_umap': PosixPath('artifacts/scgpt_dataset_specific_annotation/best_strategy_embedding_umap.png'),
'saved_model_manifest': PosixPath('artifacts/scgpt_dataset_specific_annotation/best_model/manifest.json'),
'saved_model_state': PosixPath('artifacts/scgpt_dataset_specific_annotation/best_model/model_state.pt')}