Experimental scGPT PBMC embeddings#
This notebook shows the first experimental foundation-model workflow in scDLKit.
Audience:
researchers and analysts who already use Scanpy and want to evaluate a pretrained single-cell foundation model inside the same downstream workflow
Prerequisites:
pip install "scdlkit[foundation,tutorials]"familiarity with
AnnData,adata.obsm, neighbors, and UMAP in Scanpy
Learning goals:
prepare PBMC data for the official
whole-humanscGPT checkpointextract frozen cell embeddings through
Trainer.predict_dataset(...)write those embeddings back into
adata.obsmrun a simple frozen linear probe and compare the embedding quality qualitatively
Experimental scope:
embeddings only
whole-humancheckpoint onlyhuman scRNA-seq only
no fine-tuning in this release
What correctness means in this notebook:
the checkpoint loads successfully
embeddings are produced and can be handed back to Scanpy
the qualitative structure is inspectable against the PBMC labels
this notebook does not claim that frozen scGPT is already the best baseline for every PBMC workflow
Published tutorial status
This page is a static notebook copy published for documentation review. It is meant to show the exact workflow and outputs from the last recorded run.
Last run date (UTC):
2026-03-27 09:23 UTCPublication mode:
static executed tutorialExecution profile:
publishedArtifact check in this sync:
passedSource notebook:
examples/scgpt_pbmc_embeddings.ipynb
from __future__ import annotations
import json
from pathlib import Path
from time import perf_counter
import numpy as np
import pandas as pd
import scanpy as sc
from matplotlib import pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from scdlkit import Trainer
from scdlkit.evaluation import evaluate_predictions, save_markdown_report, save_metrics_table
from scdlkit.evaluation.metrics import classification_metrics
from scdlkit.foundation import load_scgpt_model, prepare_scgpt_data
from scdlkit.visualization.classification import plot_confusion_matrix
SEED = 42
TUTORIAL_PROFILE = "quickstart" # "quickstart" or "full"
PROFILE = {
"quickstart": {"max_cells": 128, "batch_size": 64},
"full": {"max_cells": None, "batch_size": 64},
}[TUTORIAL_PROFILE]
OUTPUT_DIR = Path("artifacts/scgpt_pbmc_embeddings")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_DIR
PosixPath('artifacts/scgpt_pbmc_embeddings')
Load PBMC data#
The quickstart profile uses a deterministic subset so the docs build stays CPU-friendly. The full profile keeps all cells from pbmc3k_processed().
adata = sc.datasets.pbmc3k_processed()
if PROFILE["max_cells"] is not None and adata.n_obs > PROFILE["max_cells"]:
rng = np.random.default_rng(SEED)
subset_indices = np.sort(rng.choice(adata.n_obs, size=PROFILE["max_cells"], replace=False))
adata = adata[subset_indices].copy()
else:
adata = adata.copy()
prepared = prepare_scgpt_data(
adata,
checkpoint="whole-human",
label_key="louvain",
batch_size=PROFILE["batch_size"],
use_raw=True,
)
{
"cells": int(adata.n_obs),
"genes": int(adata.n_vars),
"matched_genes": prepared.num_genes_matched,
"checkpoint": prepared.checkpoint_id,
}
{'cells': 128,
'genes': 1838,
'matched_genes': 12300,
'checkpoint': 'whole-human'}
Load the frozen checkpoint and extract embeddings#
This uses the official whole-human checkpoint in frozen inference mode. The supported surface in the current release line is Trainer.predict_dataset(...), not TaskRunner and not Trainer.fit(...).
model = load_scgpt_model("whole-human", device="auto")
trainer = Trainer(
model=model,
task="representation",
batch_size=prepared.batch_size,
device="auto",
epochs=1,
)
started_at = perf_counter()
predictions = trainer.predict_dataset(prepared.dataset)
embedding_runtime_sec = perf_counter() - started_at
representation_metrics = evaluate_predictions("representation", predictions)
representation_metrics
{'silhouette': 0.2256391942501068,
'knn_label_consistency': 0.9296875,
'ari': 0.5768869998943564,
'nmi': 0.7609582588732033}
Return to the normal Scanpy downstream path#
The foundation-model step ends once embeddings are available. After that, the workflow is the familiar Scanpy path: store the embedding in adata.obsm, build neighbors, compute UMAP, and inspect structure against the known PBMC labels.
Recommended next tutorials:
PBMC model comparison for classical and deep-learning baselines
downstream Scanpy after scDLKit when you want clustering and marker interpretation on a learned embedding
adata.obsm["X_scgpt_whole_human"] = predictions["latent"]
sc.pp.neighbors(adata, use_rep="X_scgpt_whole_human")
sc.tl.umap(adata, random_state=SEED)
umap_fig = sc.pl.umap(adata, color="louvain", return_fig=True, frameon=False)
umap_fig.savefig(OUTPUT_DIR / "latent_umap.png", dpi=150, bbox_inches="tight")
plt.close(umap_fig)
labels = predictions["y"]
label_categories = list(pd.Categorical(adata.obs["louvain"].astype(str)).categories)
_, label_counts = np.unique(labels, return_counts=True)
probe_stratify = labels if int(label_counts.min()) >= 2 else None
train_x, test_x, train_y, test_y = train_test_split(
predictions["latent"],
labels,
test_size=0.2,
random_state=SEED,
stratify=probe_stratify,
)
probe = LogisticRegression(max_iter=1000, random_state=SEED)
probe.fit(train_x, train_y)
probe_logits = probe.predict_proba(test_x)
probe_metrics = classification_metrics(test_y, probe_logits)
confusion_fig, _ = plot_confusion_matrix(
probe_metrics["confusion_matrix"],
class_names=label_categories,
)
confusion_fig.savefig(
OUTPUT_DIR / "linear_probe_confusion_matrix.png",
dpi=150,
bbox_inches="tight",
)
plt.close(confusion_fig)
summary_metrics = {
**representation_metrics,
"probe_accuracy": float(probe_metrics["accuracy"]),
"probe_macro_f1": float(probe_metrics["macro_f1"]),
"num_genes_matched": int(prepared.num_genes_matched),
"embedding_runtime_sec": float(embedding_runtime_sec),
"cells": int(adata.n_obs),
}
save_markdown_report(
summary_metrics,
path=OUTPUT_DIR / "report.md",
title="Experimental scGPT PBMC embedding report",
extra_sections=[
"## Notes",
"",
"- Experimental feature: embeddings only.",
"- Checkpoint: `whole-human`.",
"- Fine-tuning is intentionally deferred.",
],
)
save_metrics_table(summary_metrics, OUTPUT_DIR / "report.csv")
(OUTPUT_DIR / "embedding_summary.json").write_text(
json.dumps(summary_metrics, indent=2),
encoding="utf-8",
)
summary_metrics
{'silhouette': 0.2256391942501068,
'knn_label_consistency': 0.9296875,
'ari': 0.5768869998943564,
'nmi': 0.7609582588732033,
'probe_accuracy': 0.6538461538461539,
'probe_macro_f1': 0.2724867724867725,
'num_genes_matched': 12300,
'embedding_runtime_sec': 75.44226392799999,
'cells': 128}
output_paths = {
"report_markdown": str(OUTPUT_DIR / "report.md"),
"report_csv": str(OUTPUT_DIR / "report.csv"),
"latent_umap": str(OUTPUT_DIR / "latent_umap.png"),
"linear_probe_confusion_matrix": str(OUTPUT_DIR / "linear_probe_confusion_matrix.png"),
"embedding_summary": str(OUTPUT_DIR / "embedding_summary.json"),
}
output_paths
{'report_markdown': 'artifacts/scgpt_pbmc_embeddings/report.md',
'report_csv': 'artifacts/scgpt_pbmc_embeddings/report.csv',
'latent_umap': 'artifacts/scgpt_pbmc_embeddings/latent_umap.png',
'linear_probe_confusion_matrix': 'artifacts/scgpt_pbmc_embeddings/linear_probe_confusion_matrix.png',
'embedding_summary': 'artifacts/scgpt_pbmc_embeddings/embedding_summary.json'}