Experimental scGPT dataset-specific annotation#

Who this notebook is for#

This notebook is for users who already understand the experimental scGPT PBMC annotation tutorial and want an easier, wrapper-first way to adapt scGPT to a second labeled human dataset.

Prerequisites#

Install scdlkit[foundation,tutorials]
Be comfortable working with AnnData
Expect this workflow to stay experimental in v0.1.7

What you will learn#

how to inspect a labeled dataset before adaptation
how to run the wrapper-first adapt_annotation(...) workflow
how to compare frozen probe and head-only tuning in the quickstart path, with LoRA available in the full profile
how to write predictions and embeddings back into AnnData
how to save and reload the fitted runner

Out of scope#

full-backbone fine-tuning
non-human data
checkpoints other than whole-human
perturbation, spatial, or multimodal workflows

Outline#

load a reproducible PBMC dataset
take a CPU-friendly stratified subset for the tutorial profile
inspect dataset compatibility with scGPT
run the one-shot adaptation wrapper
annotate the dataset with the best strategy
save and reload the fitted runner
verify the reload path and review the saved artifacts

Expected artifacts#

artifacts/scgpt_dataset_specific_annotation/report.md
artifacts/scgpt_dataset_specific_annotation/report.csv
artifacts/scgpt_dataset_specific_annotation/strategy_metrics.csv
artifacts/scgpt_dataset_specific_annotation/best_strategy_confusion_matrix.png
artifacts/scgpt_dataset_specific_annotation/frozen_embedding_umap.png
artifacts/scgpt_dataset_specific_annotation/best_strategy_embedding_umap.png
artifacts/scgpt_dataset_specific_annotation/best_model/manifest.json
artifacts/scgpt_dataset_specific_annotation/best_model/model_state.pt

Next tutorial#

After this notebook, go back to the lower-level guide when you need tighter control over the training surface, or continue with the existing scGPT annotation tutorial to inspect the raw Trainer path directly.

Next step#

lower-level scGPT route: examples/scgpt_cell_type_annotation.ipynb
API pages: docs/api/annotation.md and docs/api/foundation.md

Published tutorial status

This page is a static notebook copy published for documentation review. It is meant to show the exact workflow and outputs from the last recorded run.

Last run date (UTC): 2026-03-27 09:22 UTC
Publication mode: static executed tutorial
Execution profile: published
Artifact check in this sync: passed
Source notebook: examples/scgpt_dataset_specific_annotation.ipynb

import json
from pathlib import Path

import numpy as np
import pandas as pd
import scanpy as sc
from scipy import sparse
from sklearn.model_selection import train_test_split

from scdlkit import (
    AnnotationRunner,
    adapt_annotation,
    inspect_annotation_data,
)

PROFILE = "quickstart"
CONFIGS = {
    "quickstart": {
        "seed": 42,
        "batch_size": 32,
        "max_cells": 64,
        "max_genes": 48,
        "strategies": ("frozen_probe", "head"),
    },
    "full": {
        "seed": 42,
        "batch_size": 64,
        "max_cells": 128,
        "max_genes": 96,
        "strategies": ("frozen_probe", "head", "lora"),
    },
}
config = CONFIGS[PROFILE]
output_dir = Path("artifacts/scgpt_dataset_specific_annotation")
output_dir.mkdir(parents=True, exist_ok=True)
np.random.seed(config["seed"])

Load a reproducible labeled dataset#

The public API is meant for arbitrary labeled human AnnData, but the tutorial stays reproducible by using scanpy.datasets.pbmc68k_reduced() with bulk_labels as the coarse annotation target.

The quickstart profile keeps the run CPU-friendly by taking a stratified subset of cells, a high-variance subset of genes, and the lighter frozen-probe plus head-only strategy ladder. This is a tutorial runtime choice, not a limit of the wrapper API itself. The full profile adds LoRA back into the comparison.

pbmc68k_reduced() also contains centered expression values. Because the current public scGPT path requires non-negative inputs, the tutorial shifts the subset to a non-negative matrix before adaptation. That compatibility step is explicit here instead of being hidden inside the library.

adata = sc.datasets.pbmc68k_reduced()
adata.obs["bulk_labels"] = adata.obs["bulk_labels"].astype(str)

if adata.n_obs > config["max_cells"]:
    indices = np.arange(adata.n_obs)
    keep_indices, _ = train_test_split(
        indices,
        train_size=config["max_cells"],
        random_state=config["seed"],
        stratify=adata.obs["bulk_labels"],
    )
    adata = adata[np.sort(keep_indices)].copy()

if adata.n_vars > config["max_genes"]:
    dense = adata.X.toarray() if sparse.issparse(adata.X) else np.asarray(adata.X)
    keep_genes = np.argsort(np.var(dense, axis=0))[-config["max_genes"] :]
    adata = adata[:, np.sort(keep_genes)].copy()

dense_matrix = (
    adata.X.toarray() if sparse.issparse(adata.X) else np.asarray(adata.X, dtype="float32")
)
minimum_value = float(dense_matrix.min())
if minimum_value < 0.0:
    dense_matrix = dense_matrix - minimum_value
adata.X = dense_matrix.astype("float32", copy=False)
adata.raw = adata.copy()

print(
    {
        "profile": PROFILE,
        "cells": int(adata.n_obs),
        "genes": int(adata.n_vars),
        "label_key": "bulk_labels",
        "min_value_after_shift": float(np.min(adata.X)),
        "labels": sorted(adata.obs["bulk_labels"].astype(str).unique().tolist()),
    }
)

{'profile': 'quickstart', 'cells': 64, 'genes': 48, 'label_key': 'bulk_labels', 'min_value_after_shift': 0.0, 'labels': ['CD14+ Monocyte', 'CD19+ B', 'CD34+', 'CD4+/CD25 T Reg', 'CD4+/CD45RA+/CD25- Naive T', 'CD4+/CD45RO+ Memory', 'CD56+ NK', 'CD8+ Cytotoxic T', 'CD8+/CD45RA+ Naive Cytotoxic', 'Dendritic']}

Inspect the dataset before adaptation#

This preflight report is the wrapper-first way to answer a practical question: is the dataset a reasonable candidate for scGPT annotation adaptation, or are there obvious overlap or class-balance problems to resolve first?

report = inspect_annotation_data(
    adata,
    label_key="bulk_labels",
    checkpoint="whole-human",
    min_gene_overlap=min(500, adata.n_vars),
)
report_frame = pd.DataFrame(
    {
        "field": [
            "checkpoint_id",
            "num_cells",
            "num_input_genes",
            "num_genes_matched",
            "gene_overlap_ratio",
            "min_class_count",
            "stratify_possible",
            "label_categories",
            "warnings",
        ],
        "value": [
            report.checkpoint_id,
            report.num_cells,
            report.num_input_genes,
            report.num_genes_matched,
            round(report.gene_overlap_ratio, 4),
            report.min_class_count,
            report.stratify_possible,
            report.label_categories,
            report.warnings,
        ],
    }
)
report_frame

	field	value
0	checkpoint_id	whole-human
1	num_cells	64
2	num_input_genes	48
3	num_genes_matched	45
4	gene_overlap_ratio	0.9375
5	min_class_count	1
6	stratify_possible	False
7	label_categories	(CD14+ Monocyte, CD19+ B, CD34+, CD4+/CD25 T R...
8	warnings	(Gene overlap with the scGPT checkpoint vocabu...

Run the one-shot wrapper#

adapt_annotation(...) is the simplest public path in this release. In the quickstart profile it compares frozen probe and head-only tuning so the executed notebook stays CPU-practical. The full profile adds LoRA back into the comparison. In both cases the wrapper saves the standard report artifacts and keeps the best fitted strategy in memory for prediction and annotation.

runner = adapt_annotation(
    adata,
    label_key="bulk_labels",
    strategies=config["strategies"],
    batch_size=config["batch_size"],
    device="auto",
    output_dir=output_dir,
)
runner.summary_.strategy_metrics

/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/metrics/_classification.py:2924: UserWarning: y_pred contains classes not in y_true
  warnings.warn("y_pred contains classes not in y_true")
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/metrics/_classification.py:2924: UserWarning: y_pred contains classes not in y_true
  warnings.warn("y_pred contains classes not in y_true")

	strategy	validation_accuracy	validation_macro_f1	validation_balanced_accuracy	validation_auroc_ovr	test_accuracy	test_macro_f1	test_balanced_accuracy	test_auroc_ovr	runtime_sec	trainable_parameters
0	frozen_probe	0.0	0.0	0.0	NaN	0.5	0.166667	0.25	NaN	0.911852	0
1	head	0.0	0.0	0.0	NaN	0.5	0.166667	0.25	NaN	4.600107	6154

<Figure size 500x400 with 0 Axes>

Annotate the dataset with the best strategy#

The wrapper writes both predicted labels and the best latent embedding back into the AnnData object, which keeps the downstream Scanpy handoff straightforward.

runner.annotate_adata(
    adata,
    obs_key="scgpt_label",
    embedding_key="X_scgpt_best",
)
adata.obs[["bulk_labels", "scgpt_label", "scgpt_label_confidence"]].head()

	bulk_labels	scgpt_label	scgpt_label_confidence
index
GAGCGCACAGAGGC-1	CD8+ Cytotoxic T	Dendritic	0.399711
GGACGCACATCGTG-1	CD19+ B	Dendritic	0.381634
GTGACCCTTAGAAG-1	Dendritic	Dendritic	0.390948
GTTCATACGAACTC-1	CD8+/CD45RA+ Naive Cytotoxic	Dendritic	0.397679
ACGGAACTGTAGCT-2	CD19+ B	Dendritic	0.369715

Save and reload the fitted runner#

This is the practical dataset-specific adaptation step in v0.1.7: after comparing strategies, save the best fitted wrapper state and reload it later without re-running the full benchmark.

save_dir = runner.save(output_dir / "best_model")
reloaded = AnnotationRunner.load(save_dir, device="auto")

original_predictions = runner.predict(adata)
reloaded_predictions = reloaded.predict(adata)

np.testing.assert_array_equal(
    original_predictions["label_codes"], reloaded_predictions["label_codes"]
)
np.testing.assert_array_equal(original_predictions["labels"], reloaded_predictions["labels"])
np.testing.assert_allclose(
    original_predictions["probabilities"],
    reloaded_predictions["probabilities"],
    atol=1e-6,
)
np.testing.assert_allclose(
    original_predictions["latent"],
    reloaded_predictions["latent"],
    atol=1e-6,
)

{"best_strategy": runner.best_strategy_, "save_dir": str(save_dir)}

{'best_strategy': 'frozen_probe',
 'save_dir': 'artifacts/scgpt_dataset_specific_annotation/best_model'}

What to inspect#

The strategy table should show whether the configured adaptation ladder wins with frozen probe, head-only tuning, or, in the full profile, LoRA.
The saved confusion matrix and UMAPs are there to make the comparison auditable instead of reducing the workflow to a single scalar score.
This remains an experimental wrapper. The point is to make adaptation easier to run and easier to inspect, not to claim that scGPT universally wins on every dataset.

summary_payload = {
    "best_strategy": runner.summary_.best_strategy,
    "label_categories": list(runner.summary_.label_categories),
    "strategy_metrics_columns": runner.summary_.strategy_metrics.columns.tolist(),
}
(output_dir / "summary.json").write_text(json.dumps(summary_payload, indent=2), encoding="utf-8")
summary_payload

{'best_strategy': 'frozen_probe',
 'label_categories': ['CD14+ Monocyte',
  'CD19+ B',
  'CD34+',
  'CD4+/CD25 T Reg',
  'CD4+/CD45RA+/CD25- Naive T',
  'CD4+/CD45RO+ Memory',
  'CD56+ NK',
  'CD8+ Cytotoxic T',
  'CD8+/CD45RA+ Naive Cytotoxic',
  'Dendritic'],
 'strategy_metrics_columns': ['strategy',
  'validation_accuracy',
  'validation_macro_f1',
  'validation_balanced_accuracy',
  'validation_auroc_ovr',
  'test_accuracy',
  'test_macro_f1',
  'test_balanced_accuracy',
  'test_auroc_ovr',
  'runtime_sec',
  'trainable_parameters']}

Stable output-path contract#

The notebook ends with a single dictionary that points to the files the docs and tutorial validation suite expect.

output_paths = {
    "report_md": output_dir / "report.md",
    "report_csv": output_dir / "report.csv",
    "strategy_metrics_csv": output_dir / "strategy_metrics.csv",
    "best_strategy_confusion_matrix": output_dir / "best_strategy_confusion_matrix.png",
    "frozen_embedding_umap": output_dir / "frozen_embedding_umap.png",
    "best_strategy_embedding_umap": output_dir / "best_strategy_embedding_umap.png",
    "saved_model_manifest": output_dir / "best_model" / "manifest.json",
    "saved_model_state": output_dir / "best_model" / "model_state.pt",
}
output_paths

{'report_md': PosixPath('artifacts/scgpt_dataset_specific_annotation/report.md'),
 'report_csv': PosixPath('artifacts/scgpt_dataset_specific_annotation/report.csv'),
 'strategy_metrics_csv': PosixPath('artifacts/scgpt_dataset_specific_annotation/strategy_metrics.csv'),
 'best_strategy_confusion_matrix': PosixPath('artifacts/scgpt_dataset_specific_annotation/best_strategy_confusion_matrix.png'),
 'frozen_embedding_umap': PosixPath('artifacts/scgpt_dataset_specific_annotation/frozen_embedding_umap.png'),
 'best_strategy_embedding_umap': PosixPath('artifacts/scgpt_dataset_specific_annotation/best_strategy_embedding_umap.png'),
 'saved_model_manifest': PosixPath('artifacts/scgpt_dataset_specific_annotation/best_model/manifest.json'),
 'saved_model_state': PosixPath('artifacts/scgpt_dataset_specific_annotation/best_model/model_state.pt')}