Experimental scGPT PBMC embeddings#

This notebook shows the first experimental foundation-model workflow in scDLKit.

Audience:

  • researchers and analysts who already use Scanpy and want to evaluate a pretrained single-cell foundation model inside the same downstream workflow

Prerequisites:

  • pip install "scdlkit[foundation,tutorials]"

  • familiarity with AnnData, adata.obsm, neighbors, and UMAP in Scanpy

Learning goals:

  • prepare PBMC data for the official whole-human scGPT checkpoint

  • extract frozen cell embeddings through Trainer.predict_dataset(...)

  • write those embeddings back into adata.obsm

  • run a simple frozen linear probe and compare the embedding quality qualitatively

Experimental scope:

  • embeddings only

  • whole-human checkpoint only

  • human scRNA-seq only

  • no fine-tuning in this release

What correctness means in this notebook:

  • the checkpoint loads successfully

  • embeddings are produced and can be handed back to Scanpy

  • the qualitative structure is inspectable against the PBMC labels

  • this notebook does not claim that frozen scGPT is already the best baseline for every PBMC workflow

Published tutorial status

This page is a static notebook copy published for documentation review. It is meant to show the exact workflow and outputs from the last recorded run.

  • Last run date (UTC): 2026-03-27 09:23 UTC

  • Publication mode: static executed tutorial

  • Execution profile: published

  • Artifact check in this sync: passed

  • Source notebook: examples/scgpt_pbmc_embeddings.ipynb

from __future__ import annotations

import json
from pathlib import Path
from time import perf_counter

import numpy as np
import pandas as pd
import scanpy as sc
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

from scdlkit import Trainer
from scdlkit.evaluation import evaluate_predictions, save_markdown_report, save_metrics_table
from scdlkit.evaluation.metrics import classification_metrics
from scdlkit.foundation import load_scgpt_model, prepare_scgpt_data
from scdlkit.visualization.classification import plot_confusion_matrix

SEED = 42
TUTORIAL_PROFILE = "quickstart"  # "quickstart" or "full"
PROFILE = {
    "quickstart": {"max_cells": 128, "batch_size": 64},
    "full": {"max_cells": None, "batch_size": 64},
}[TUTORIAL_PROFILE]

OUTPUT_DIR = Path("artifacts/scgpt_pbmc_embeddings")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

OUTPUT_DIR
PosixPath('artifacts/scgpt_pbmc_embeddings')

Load PBMC data#

The quickstart profile uses a deterministic subset so the docs build stays CPU-friendly. The full profile keeps all cells from pbmc3k_processed().

adata = sc.datasets.pbmc3k_processed()

if PROFILE["max_cells"] is not None and adata.n_obs > PROFILE["max_cells"]:
    rng = np.random.default_rng(SEED)
    subset_indices = np.sort(rng.choice(adata.n_obs, size=PROFILE["max_cells"], replace=False))
    adata = adata[subset_indices].copy()
else:
    adata = adata.copy()

prepared = prepare_scgpt_data(
    adata,
    checkpoint="whole-human",
    label_key="louvain",
    batch_size=PROFILE["batch_size"],
    use_raw=True,
)

{
    "cells": int(adata.n_obs),
    "genes": int(adata.n_vars),
    "matched_genes": prepared.num_genes_matched,
    "checkpoint": prepared.checkpoint_id,
}
{'cells': 128,
 'genes': 1838,
 'matched_genes': 12300,
 'checkpoint': 'whole-human'}

Load the frozen checkpoint and extract embeddings#

This uses the official whole-human checkpoint in frozen inference mode. The supported surface in the current release line is Trainer.predict_dataset(...), not TaskRunner and not Trainer.fit(...).

model = load_scgpt_model("whole-human", device="auto")
trainer = Trainer(
    model=model,
    task="representation",
    batch_size=prepared.batch_size,
    device="auto",
    epochs=1,
)

started_at = perf_counter()
predictions = trainer.predict_dataset(prepared.dataset)
embedding_runtime_sec = perf_counter() - started_at

representation_metrics = evaluate_predictions("representation", predictions)
representation_metrics
{'silhouette': 0.2256391942501068,
 'knn_label_consistency': 0.9296875,
 'ari': 0.5768869998943564,
 'nmi': 0.7609582588732033}

Return to the normal Scanpy downstream path#

The foundation-model step ends once embeddings are available. After that, the workflow is the familiar Scanpy path: store the embedding in adata.obsm, build neighbors, compute UMAP, and inspect structure against the known PBMC labels.

Recommended next tutorials:

  • PBMC model comparison for classical and deep-learning baselines

  • downstream Scanpy after scDLKit when you want clustering and marker interpretation on a learned embedding

adata.obsm["X_scgpt_whole_human"] = predictions["latent"]
sc.pp.neighbors(adata, use_rep="X_scgpt_whole_human")
sc.tl.umap(adata, random_state=SEED)

umap_fig = sc.pl.umap(adata, color="louvain", return_fig=True, frameon=False)
umap_fig.savefig(OUTPUT_DIR / "latent_umap.png", dpi=150, bbox_inches="tight")
plt.close(umap_fig)

labels = predictions["y"]
label_categories = list(pd.Categorical(adata.obs["louvain"].astype(str)).categories)
_, label_counts = np.unique(labels, return_counts=True)
probe_stratify = labels if int(label_counts.min()) >= 2 else None
train_x, test_x, train_y, test_y = train_test_split(
    predictions["latent"],
    labels,
    test_size=0.2,
    random_state=SEED,
    stratify=probe_stratify,
)
probe = LogisticRegression(max_iter=1000, random_state=SEED)
probe.fit(train_x, train_y)
probe_logits = probe.predict_proba(test_x)
probe_metrics = classification_metrics(test_y, probe_logits)
confusion_fig, _ = plot_confusion_matrix(
    probe_metrics["confusion_matrix"],
    class_names=label_categories,
)
confusion_fig.savefig(
    OUTPUT_DIR / "linear_probe_confusion_matrix.png",
    dpi=150,
    bbox_inches="tight",
)
plt.close(confusion_fig)

summary_metrics = {
    **representation_metrics,
    "probe_accuracy": float(probe_metrics["accuracy"]),
    "probe_macro_f1": float(probe_metrics["macro_f1"]),
    "num_genes_matched": int(prepared.num_genes_matched),
    "embedding_runtime_sec": float(embedding_runtime_sec),
    "cells": int(adata.n_obs),
}

save_markdown_report(
    summary_metrics,
    path=OUTPUT_DIR / "report.md",
    title="Experimental scGPT PBMC embedding report",
    extra_sections=[
        "## Notes",
        "",
        "- Experimental feature: embeddings only.",
        "- Checkpoint: `whole-human`.",
        "- Fine-tuning is intentionally deferred.",
    ],
)
save_metrics_table(summary_metrics, OUTPUT_DIR / "report.csv")
(OUTPUT_DIR / "embedding_summary.json").write_text(
    json.dumps(summary_metrics, indent=2),
    encoding="utf-8",
)

summary_metrics
{'silhouette': 0.2256391942501068,
 'knn_label_consistency': 0.9296875,
 'ari': 0.5768869998943564,
 'nmi': 0.7609582588732033,
 'probe_accuracy': 0.6538461538461539,
 'probe_macro_f1': 0.2724867724867725,
 'num_genes_matched': 12300,
 'embedding_runtime_sec': 75.44226392799999,
 'cells': 128}
output_paths = {
    "report_markdown": str(OUTPUT_DIR / "report.md"),
    "report_csv": str(OUTPUT_DIR / "report.csv"),
    "latent_umap": str(OUTPUT_DIR / "latent_umap.png"),
    "linear_probe_confusion_matrix": str(OUTPUT_DIR / "linear_probe_confusion_matrix.png"),
    "embedding_summary": str(OUTPUT_DIR / "embedding_summary.json"),
}
output_paths
{'report_markdown': 'artifacts/scgpt_pbmc_embeddings/report.md',
 'report_csv': 'artifacts/scgpt_pbmc_embeddings/report.csv',
 'latent_umap': 'artifacts/scgpt_pbmc_embeddings/latent_umap.png',
 'linear_probe_confusion_matrix': 'artifacts/scgpt_pbmc_embeddings/linear_probe_confusion_matrix.png',
 'embedding_summary': 'artifacts/scgpt_pbmc_embeddings/embedding_summary.json'}