Experimental scGPT PBMC embeddings#

This notebook shows the first experimental foundation-model workflow in scDLKit.

Audience:

researchers and analysts who already use Scanpy and want to evaluate a pretrained single-cell foundation model inside the same downstream workflow

Prerequisites:

pip install "scdlkit[foundation,tutorials]"
familiarity with AnnData, adata.obsm, neighbors, and UMAP in Scanpy

Learning goals:

prepare PBMC data for the official whole-human scGPT checkpoint
extract frozen cell embeddings through Trainer.predict_dataset(...)
write those embeddings back into adata.obsm
run a simple frozen linear probe and compare the embedding quality qualitatively

Experimental scope:

embeddings only
whole-human checkpoint only
human scRNA-seq only
no fine-tuning in this release

What correctness means in this notebook:

the checkpoint loads successfully
embeddings are produced and can be handed back to Scanpy
the qualitative structure is inspectable against the PBMC labels
this notebook does not claim that frozen scGPT is already the best baseline for every PBMC workflow

Published tutorial status

This page is a static notebook copy published for documentation review. It is meant to show the exact workflow and outputs from the last recorded run.

Last run date (UTC): 2026-03-27 09:23 UTC
Publication mode: static executed tutorial
Execution profile: published
Artifact check in this sync: passed
Source notebook: examples/scgpt_pbmc_embeddings.ipynb

from __future__ import annotations

import json
from pathlib import Path
from time import perf_counter

import numpy as np
import pandas as pd
import scanpy as sc
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

from scdlkit import Trainer
from scdlkit.evaluation import evaluate_predictions, save_markdown_report, save_metrics_table
from scdlkit.evaluation.metrics import classification_metrics
from scdlkit.foundation import load_scgpt_model, prepare_scgpt_data
from scdlkit.visualization.classification import plot_confusion_matrix

SEED = 42
TUTORIAL_PROFILE = "quickstart"  # "quickstart" or "full"
PROFILE = {
    "quickstart": {"max_cells": 128, "batch_size": 64},
    "full": {"max_cells": None, "batch_size": 64},
}[TUTORIAL_PROFILE]

OUTPUT_DIR = Path("artifacts/scgpt_pbmc_embeddings")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

OUTPUT_DIR

PosixPath('artifacts/scgpt_pbmc_embeddings')

Load PBMC data#

The quickstart profile uses a deterministic subset so the docs build stays CPU-friendly. The full profile keeps all cells from pbmc3k_processed().

adata = sc.datasets.pbmc3k_processed()

if PROFILE["max_cells"] is not None and adata.n_obs > PROFILE["max_cells"]:
    rng = np.random.default_rng(SEED)
    subset_indices = np.sort(rng.choice(adata.n_obs, size=PROFILE["max_cells"], replace=False))
    adata = adata[subset_indices].copy()
else:
    adata = adata.copy()

prepared = prepare_scgpt_data(
    adata,
    checkpoint="whole-human",
    label_key="louvain",
    batch_size=PROFILE["batch_size"],
    use_raw=True,
)

{
    "cells": int(adata.n_obs),
    "genes": int(adata.n_vars),
    "matched_genes": prepared.num_genes_matched,
    "checkpoint": prepared.checkpoint_id,
}

{'cells': 128,
 'genes': 1838,
 'matched_genes': 12300,
 'checkpoint': 'whole-human'}

Load the frozen checkpoint and extract embeddings#

This uses the official whole-human checkpoint in frozen inference mode. The supported surface in the current release line is Trainer.predict_dataset(...), not TaskRunner and not Trainer.fit(...).

model = load_scgpt_model("whole-human", device="auto")
trainer = Trainer(
    model=model,
    task="representation",
    batch_size=prepared.batch_size,
    device="auto",
    epochs=1,
)

started_at = perf_counter()
predictions = trainer.predict_dataset(prepared.dataset)
embedding_runtime_sec = perf_counter() - started_at

representation_metrics = evaluate_predictions("representation", predictions)
representation_metrics

{'silhouette': 0.2256391942501068,
 'knn_label_consistency': 0.9296875,
 'ari': 0.5768869998943564,
 'nmi': 0.7609582588732033}

Return to the normal Scanpy downstream path#

The foundation-model step ends once embeddings are available. After that, the workflow is the familiar Scanpy path: store the embedding in adata.obsm, build neighbors, compute UMAP, and inspect structure against the known PBMC labels.

Recommended next tutorials:

PBMC model comparison for classical and deep-learning baselines
downstream Scanpy after scDLKit when you want clustering and marker interpretation on a learned embedding

adata.obsm["X_scgpt_whole_human"] = predictions["latent"]
sc.pp.neighbors(adata, use_rep="X_scgpt_whole_human")
sc.tl.umap(adata, random_state=SEED)

umap_fig = sc.pl.umap(adata, color="louvain", return_fig=True, frameon=False)
umap_fig.savefig(OUTPUT_DIR / "latent_umap.png", dpi=150, bbox_inches="tight")
plt.close(umap_fig)

labels = predictions["y"]
label_categories = list(pd.Categorical(adata.obs["louvain"].astype(str)).categories)
_, label_counts = np.unique(labels, return_counts=True)
probe_stratify = labels if int(label_counts.min()) >= 2 else None
train_x, test_x, train_y, test_y = train_test_split(
    predictions["latent"],
    labels,
    test_size=0.2,
    random_state=SEED,
    stratify=probe_stratify,
)
probe = LogisticRegression(max_iter=1000, random_state=SEED)
probe.fit(train_x, train_y)
probe_logits = probe.predict_proba(test_x)
probe_metrics = classification_metrics(test_y, probe_logits)
confusion_fig, _ = plot_confusion_matrix(
    probe_metrics["confusion_matrix"],
    class_names=label_categories,
)
confusion_fig.savefig(
    OUTPUT_DIR / "linear_probe_confusion_matrix.png",
    dpi=150,
    bbox_inches="tight",
)
plt.close(confusion_fig)

summary_metrics = {
    **representation_metrics,
    "probe_accuracy": float(probe_metrics["accuracy"]),
    "probe_macro_f1": float(probe_metrics["macro_f1"]),
    "num_genes_matched": int(prepared.num_genes_matched),
    "embedding_runtime_sec": float(embedding_runtime_sec),
    "cells": int(adata.n_obs),
}

save_markdown_report(
    summary_metrics,
    path=OUTPUT_DIR / "report.md",
    title="Experimental scGPT PBMC embedding report",
    extra_sections=[
        "## Notes",
        "",
        "- Experimental feature: embeddings only.",
        "- Checkpoint: `whole-human`.",
        "- Fine-tuning is intentionally deferred.",
    ],
)
save_metrics_table(summary_metrics, OUTPUT_DIR / "report.csv")
(OUTPUT_DIR / "embedding_summary.json").write_text(
    json.dumps(summary_metrics, indent=2),
    encoding="utf-8",
)

summary_metrics

{'silhouette': 0.2256391942501068,
 'knn_label_consistency': 0.9296875,
 'ari': 0.5768869998943564,
 'nmi': 0.7609582588732033,
 'probe_accuracy': 0.6538461538461539,
 'probe_macro_f1': 0.2724867724867725,
 'num_genes_matched': 12300,
 'embedding_runtime_sec': 75.44226392799999,
 'cells': 128}

output_paths = {
    "report_markdown": str(OUTPUT_DIR / "report.md"),
    "report_csv": str(OUTPUT_DIR / "report.csv"),
    "latent_umap": str(OUTPUT_DIR / "latent_umap.png"),
    "linear_probe_confusion_matrix": str(OUTPUT_DIR / "linear_probe_confusion_matrix.png"),
    "embedding_summary": str(OUTPUT_DIR / "embedding_summary.json"),
}
output_paths

{'report_markdown': 'artifacts/scgpt_pbmc_embeddings/report.md',
 'report_csv': 'artifacts/scgpt_pbmc_embeddings/report.csv',
 'latent_umap': 'artifacts/scgpt_pbmc_embeddings/latent_umap.png',
 'linear_probe_confusion_matrix': 'artifacts/scgpt_pbmc_embeddings/linear_probe_confusion_matrix.png',
 'embedding_summary': 'artifacts/scgpt_pbmc_embeddings/embedding_summary.json'}