Experimental scGPT dataset-specific annotation#

Who this notebook is for#

This notebook is for users who already understand the experimental scGPT PBMC annotation tutorial and want an easier, wrapper-first way to adapt scGPT to a second labeled human dataset.

Prerequisites#

  • Install scdlkit[foundation,tutorials]

  • Be comfortable working with AnnData

  • Expect this workflow to stay experimental in v0.1.7

What you will learn#

  • how to inspect a labeled dataset before adaptation

  • how to run the wrapper-first adapt_annotation(...) workflow

  • how to compare frozen probe and head-only tuning in the quickstart path, with LoRA available in the full profile

  • how to write predictions and embeddings back into AnnData

  • how to save and reload the fitted runner

Out of scope#

  • full-backbone fine-tuning

  • non-human data

  • checkpoints other than whole-human

  • perturbation, spatial, or multimodal workflows

Outline#

  1. load a reproducible PBMC dataset

  2. take a CPU-friendly stratified subset for the tutorial profile

  3. inspect dataset compatibility with scGPT

  4. run the one-shot adaptation wrapper

  5. annotate the dataset with the best strategy

  6. save and reload the fitted runner

  7. verify the reload path and review the saved artifacts

Expected artifacts#

  • artifacts/scgpt_dataset_specific_annotation/report.md

  • artifacts/scgpt_dataset_specific_annotation/report.csv

  • artifacts/scgpt_dataset_specific_annotation/strategy_metrics.csv

  • artifacts/scgpt_dataset_specific_annotation/best_strategy_confusion_matrix.png

  • artifacts/scgpt_dataset_specific_annotation/frozen_embedding_umap.png

  • artifacts/scgpt_dataset_specific_annotation/best_strategy_embedding_umap.png

  • artifacts/scgpt_dataset_specific_annotation/best_model/manifest.json

  • artifacts/scgpt_dataset_specific_annotation/best_model/model_state.pt

Next tutorial#

After this notebook, go back to the lower-level guide when you need tighter control over the training surface, or continue with the existing scGPT annotation tutorial to inspect the raw Trainer path directly.

Next step#

  • lower-level scGPT route: examples/scgpt_cell_type_annotation.ipynb

  • API pages: docs/api/annotation.md and docs/api/foundation.md

Published tutorial status

This page is a static notebook copy published for documentation review. It is meant to show the exact workflow and outputs from the last recorded run.

  • Last run date (UTC): 2026-03-27 09:22 UTC

  • Publication mode: static executed tutorial

  • Execution profile: published

  • Artifact check in this sync: passed

  • Source notebook: examples/scgpt_dataset_specific_annotation.ipynb

import json
from pathlib import Path

import numpy as np
import pandas as pd
import scanpy as sc
from scipy import sparse
from sklearn.model_selection import train_test_split

from scdlkit import (
    AnnotationRunner,
    adapt_annotation,
    inspect_annotation_data,
)

PROFILE = "quickstart"
CONFIGS = {
    "quickstart": {
        "seed": 42,
        "batch_size": 32,
        "max_cells": 64,
        "max_genes": 48,
        "strategies": ("frozen_probe", "head"),
    },
    "full": {
        "seed": 42,
        "batch_size": 64,
        "max_cells": 128,
        "max_genes": 96,
        "strategies": ("frozen_probe", "head", "lora"),
    },
}
config = CONFIGS[PROFILE]
output_dir = Path("artifacts/scgpt_dataset_specific_annotation")
output_dir.mkdir(parents=True, exist_ok=True)
np.random.seed(config["seed"])

Load a reproducible labeled dataset#

The public API is meant for arbitrary labeled human AnnData, but the tutorial stays reproducible by using scanpy.datasets.pbmc68k_reduced() with bulk_labels as the coarse annotation target.

The quickstart profile keeps the run CPU-friendly by taking a stratified subset of cells, a high-variance subset of genes, and the lighter frozen-probe plus head-only strategy ladder. This is a tutorial runtime choice, not a limit of the wrapper API itself. The full profile adds LoRA back into the comparison.

pbmc68k_reduced() also contains centered expression values. Because the current public scGPT path requires non-negative inputs, the tutorial shifts the subset to a non-negative matrix before adaptation. That compatibility step is explicit here instead of being hidden inside the library.

adata = sc.datasets.pbmc68k_reduced()
adata.obs["bulk_labels"] = adata.obs["bulk_labels"].astype(str)

if adata.n_obs > config["max_cells"]:
    indices = np.arange(adata.n_obs)
    keep_indices, _ = train_test_split(
        indices,
        train_size=config["max_cells"],
        random_state=config["seed"],
        stratify=adata.obs["bulk_labels"],
    )
    adata = adata[np.sort(keep_indices)].copy()

if adata.n_vars > config["max_genes"]:
    dense = adata.X.toarray() if sparse.issparse(adata.X) else np.asarray(adata.X)
    keep_genes = np.argsort(np.var(dense, axis=0))[-config["max_genes"] :]
    adata = adata[:, np.sort(keep_genes)].copy()

dense_matrix = (
    adata.X.toarray() if sparse.issparse(adata.X) else np.asarray(adata.X, dtype="float32")
)
minimum_value = float(dense_matrix.min())
if minimum_value < 0.0:
    dense_matrix = dense_matrix - minimum_value
adata.X = dense_matrix.astype("float32", copy=False)
adata.raw = adata.copy()

print(
    {
        "profile": PROFILE,
        "cells": int(adata.n_obs),
        "genes": int(adata.n_vars),
        "label_key": "bulk_labels",
        "min_value_after_shift": float(np.min(adata.X)),
        "labels": sorted(adata.obs["bulk_labels"].astype(str).unique().tolist()),
    }
)
{'profile': 'quickstart', 'cells': 64, 'genes': 48, 'label_key': 'bulk_labels', 'min_value_after_shift': 0.0, 'labels': ['CD14+ Monocyte', 'CD19+ B', 'CD34+', 'CD4+/CD25 T Reg', 'CD4+/CD45RA+/CD25- Naive T', 'CD4+/CD45RO+ Memory', 'CD56+ NK', 'CD8+ Cytotoxic T', 'CD8+/CD45RA+ Naive Cytotoxic', 'Dendritic']}

Inspect the dataset before adaptation#

This preflight report is the wrapper-first way to answer a practical question: is the dataset a reasonable candidate for scGPT annotation adaptation, or are there obvious overlap or class-balance problems to resolve first?

report = inspect_annotation_data(
    adata,
    label_key="bulk_labels",
    checkpoint="whole-human",
    min_gene_overlap=min(500, adata.n_vars),
)
report_frame = pd.DataFrame(
    {
        "field": [
            "checkpoint_id",
            "num_cells",
            "num_input_genes",
            "num_genes_matched",
            "gene_overlap_ratio",
            "min_class_count",
            "stratify_possible",
            "label_categories",
            "warnings",
        ],
        "value": [
            report.checkpoint_id,
            report.num_cells,
            report.num_input_genes,
            report.num_genes_matched,
            round(report.gene_overlap_ratio, 4),
            report.min_class_count,
            report.stratify_possible,
            report.label_categories,
            report.warnings,
        ],
    }
)
report_frame
field value
0 checkpoint_id whole-human
1 num_cells 64
2 num_input_genes 48
3 num_genes_matched 45
4 gene_overlap_ratio 0.9375
5 min_class_count 1
6 stratify_possible False
7 label_categories (CD14+ Monocyte, CD19+ B, CD34+, CD4+/CD25 T R...
8 warnings (Gene overlap with the scGPT checkpoint vocabu...

Run the one-shot wrapper#

adapt_annotation(...) is the simplest public path in this release. In the quickstart profile it compares frozen probe and head-only tuning so the executed notebook stays CPU-practical. The full profile adds LoRA back into the comparison. In both cases the wrapper saves the standard report artifacts and keeps the best fitted strategy in memory for prediction and annotation.

runner = adapt_annotation(
    adata,
    label_key="bulk_labels",
    strategies=config["strategies"],
    batch_size=config["batch_size"],
    device="auto",
    output_dir=output_dir,
)
runner.summary_.strategy_metrics
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/metrics/_classification.py:2924: UserWarning: y_pred contains classes not in y_true
  warnings.warn("y_pred contains classes not in y_true")
/opt/hostedtoolcache/Python/3.11.15/x64/lib/python3.11/site-packages/sklearn/metrics/_classification.py:2924: UserWarning: y_pred contains classes not in y_true
  warnings.warn("y_pred contains classes not in y_true")
strategy validation_accuracy validation_macro_f1 validation_balanced_accuracy validation_auroc_ovr test_accuracy test_macro_f1 test_balanced_accuracy test_auroc_ovr runtime_sec trainable_parameters
0 frozen_probe 0.0 0.0 0.0 NaN 0.5 0.166667 0.25 NaN 0.911852 0
1 head 0.0 0.0 0.0 NaN 0.5 0.166667 0.25 NaN 4.600107 6154
<Figure size 500x400 with 0 Axes>

Annotate the dataset with the best strategy#

The wrapper writes both predicted labels and the best latent embedding back into the AnnData object, which keeps the downstream Scanpy handoff straightforward.

runner.annotate_adata(
    adata,
    obs_key="scgpt_label",
    embedding_key="X_scgpt_best",
)
adata.obs[["bulk_labels", "scgpt_label", "scgpt_label_confidence"]].head()
bulk_labels scgpt_label scgpt_label_confidence
index
GAGCGCACAGAGGC-1 CD8+ Cytotoxic T Dendritic 0.399711
GGACGCACATCGTG-1 CD19+ B Dendritic 0.381634
GTGACCCTTAGAAG-1 Dendritic Dendritic 0.390948
GTTCATACGAACTC-1 CD8+/CD45RA+ Naive Cytotoxic Dendritic 0.397679
ACGGAACTGTAGCT-2 CD19+ B Dendritic 0.369715

Save and reload the fitted runner#

This is the practical dataset-specific adaptation step in v0.1.7: after comparing strategies, save the best fitted wrapper state and reload it later without re-running the full benchmark.

save_dir = runner.save(output_dir / "best_model")
reloaded = AnnotationRunner.load(save_dir, device="auto")

original_predictions = runner.predict(adata)
reloaded_predictions = reloaded.predict(adata)

np.testing.assert_array_equal(
    original_predictions["label_codes"], reloaded_predictions["label_codes"]
)
np.testing.assert_array_equal(original_predictions["labels"], reloaded_predictions["labels"])
np.testing.assert_allclose(
    original_predictions["probabilities"],
    reloaded_predictions["probabilities"],
    atol=1e-6,
)
np.testing.assert_allclose(
    original_predictions["latent"],
    reloaded_predictions["latent"],
    atol=1e-6,
)

{"best_strategy": runner.best_strategy_, "save_dir": str(save_dir)}
{'best_strategy': 'frozen_probe',
 'save_dir': 'artifacts/scgpt_dataset_specific_annotation/best_model'}

What to inspect#

  • The strategy table should show whether the configured adaptation ladder wins with frozen probe, head-only tuning, or, in the full profile, LoRA.

  • The saved confusion matrix and UMAPs are there to make the comparison auditable instead of reducing the workflow to a single scalar score.

  • This remains an experimental wrapper. The point is to make adaptation easier to run and easier to inspect, not to claim that scGPT universally wins on every dataset.

summary_payload = {
    "best_strategy": runner.summary_.best_strategy,
    "label_categories": list(runner.summary_.label_categories),
    "strategy_metrics_columns": runner.summary_.strategy_metrics.columns.tolist(),
}
(output_dir / "summary.json").write_text(json.dumps(summary_payload, indent=2), encoding="utf-8")
summary_payload
{'best_strategy': 'frozen_probe',
 'label_categories': ['CD14+ Monocyte',
  'CD19+ B',
  'CD34+',
  'CD4+/CD25 T Reg',
  'CD4+/CD45RA+/CD25- Naive T',
  'CD4+/CD45RO+ Memory',
  'CD56+ NK',
  'CD8+ Cytotoxic T',
  'CD8+/CD45RA+ Naive Cytotoxic',
  'Dendritic'],
 'strategy_metrics_columns': ['strategy',
  'validation_accuracy',
  'validation_macro_f1',
  'validation_balanced_accuracy',
  'validation_auroc_ovr',
  'test_accuracy',
  'test_macro_f1',
  'test_balanced_accuracy',
  'test_auroc_ovr',
  'runtime_sec',
  'trainable_parameters']}

Stable output-path contract#

The notebook ends with a single dictionary that points to the files the docs and tutorial validation suite expect.

output_paths = {
    "report_md": output_dir / "report.md",
    "report_csv": output_dir / "report.csv",
    "strategy_metrics_csv": output_dir / "strategy_metrics.csv",
    "best_strategy_confusion_matrix": output_dir / "best_strategy_confusion_matrix.png",
    "frozen_embedding_umap": output_dir / "frozen_embedding_umap.png",
    "best_strategy_embedding_umap": output_dir / "best_strategy_embedding_umap.png",
    "saved_model_manifest": output_dir / "best_model" / "manifest.json",
    "saved_model_state": output_dir / "best_model" / "model_state.pt",
}
output_paths
{'report_md': PosixPath('artifacts/scgpt_dataset_specific_annotation/report.md'),
 'report_csv': PosixPath('artifacts/scgpt_dataset_specific_annotation/report.csv'),
 'strategy_metrics_csv': PosixPath('artifacts/scgpt_dataset_specific_annotation/strategy_metrics.csv'),
 'best_strategy_confusion_matrix': PosixPath('artifacts/scgpt_dataset_specific_annotation/best_strategy_confusion_matrix.png'),
 'frozen_embedding_umap': PosixPath('artifacts/scgpt_dataset_specific_annotation/frozen_embedding_umap.png'),
 'best_strategy_embedding_umap': PosixPath('artifacts/scgpt_dataset_specific_annotation/best_strategy_embedding_umap.png'),
 'saved_model_manifest': PosixPath('artifacts/scgpt_dataset_specific_annotation/best_model/manifest.json'),
 'saved_model_state': PosixPath('artifacts/scgpt_dataset_specific_annotation/best_model/model_state.pt')}