Scanpy PBMC quickstart#

Audience:

  • Single-cell researchers who already use Scanpy and want a model-focused baseline workflow.

Prerequisites:

  • Install scdlkit[tutorials].

  • Familiarity with AnnData and basic Scanpy concepts.

Learning goals:

  • Load PBMC data with Scanpy.

  • Train a VAE baseline with TaskRunner.

  • Store latent embeddings in adata.obsm.

  • Continue with Scanpy neighbors and UMAP on the learned representation.

Install:

python -m pip install "scdlkit[tutorials]"

Links:

Outline#

  1. Load PBMC data with Scanpy.

  2. Inspect the dataset and confirm the label field.

  3. Detect the runtime device.

  4. Choose the notebook profile.

  5. Train a VAE with device="auto".

  6. Evaluate metrics and save artifacts.

  7. Push the latent embedding into adata.obsm.

  8. Run Scanpy neighbors and UMAP on the latent space.

from __future__ import annotations

from pathlib import Path

import scanpy as sc
import torch
from IPython.display import display

from scdlkit import TaskRunner

DATA_PATH = Path("examples/data/pbmc3k_processed.h5ad")
OUTPUT_DIR = Path("artifacts/pbmc_vae_quickstart")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

device_name = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device_name}")
Using device: cpu
TUTORIAL_PROFILE = "quickstart"  # change to "full" for a longer run

PROFILE = {
    "quickstart": {"epochs": 20, "batch_size": 128, "kl_weight": 1e-3},
    "full": {"epochs": 50, "batch_size": 128, "kl_weight": 1e-3},
}[TUTORIAL_PROFILE]

print(f"Tutorial profile: {TUTORIAL_PROFILE}")
print(PROFILE)
Tutorial profile: quickstart
{'epochs': 20, 'batch_size': 128, 'kl_weight': 0.001}

Load PBMC data#

The tutorial prefers the repository copy of pbmc3k_processed.h5ad when available, then falls back to scanpy.datasets.pbmc3k_processed().

adata = sc.read_h5ad(DATA_PATH) if DATA_PATH.exists() else sc.datasets.pbmc3k_processed()
print(adata)
print(f"Shape: {adata.shape}")
print("obs columns:", list(adata.obs.columns))
print("Label field used for evaluation:", "louvain")
AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
    var: 'n_cells'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
Shape: (2638, 1838)
obs columns: ['n_genes', 'percent_mito', 'n_counts', 'louvain']
Label field used for evaluation: louvain

Train the VAE baseline#

This is the main scDLKit step. The code is the same on CPU and GPU because the runner uses device="auto".

For this PBMC quickstart, the VAE uses a light KL term so PBMC populations stay visibly separated in the latent space. A healthy quickstart result should show broad islands for major PBMC groups rather than a single mixed circular cloud.

Use the default quickstart profile for a CPU-friendly docs run. Switch to full when you want a longer fit with stronger qualitative separation before interpreting the embedding.

runner = TaskRunner(
    model="vae",
    task="representation",
    epochs=PROFILE["epochs"],
    batch_size=PROFILE["batch_size"],
    label_key="louvain",
    device="auto",
    model_kwargs={"kl_weight": PROFILE["kl_weight"]},
    output_dir=str(OUTPUT_DIR),
)
runner.fit(adata)
metrics = runner.evaluate()
metrics
{'mse': 0.8210347294807434,
 'mae': 0.40033289790153503,
 'pearson': 0.23027241230010986,
 'spearman': 0.12482234953591331,
 'silhouette': 0.16635675728321075,
 'knn_label_consistency': 0.898989898989899,
 'ari': 0.5762919225290116,
 'nmi': 0.7341654912608157}

Save a report and inspect the training curve#

The notebook writes a Markdown report and a training-loss plot to artifacts/pbmc_vae_quickstart/.

runner.save_report(OUTPUT_DIR / "report.md")
loss_fig, _ = runner.plot_losses()
loss_fig.savefig(OUTPUT_DIR / "loss_curve.png", dpi=150, bbox_inches="tight")
display(loss_fig)
../_images/2bb62b53dd8ab7eeff8bc5e4ba91ecfc48fc7b2e71d3ef4fb0db00d522162949.png ../_images/2bb62b53dd8ab7eeff8bc5e4ba91ecfc48fc7b2e71d3ef4fb0db00d522162949.png

Push the latent space back into Scanpy#

scDLKit stays model-focused. Once the latent representation is available, continue the downstream neighborhood and visualization workflow with Scanpy.

A healthy result should show broad louvain groups separating into distinct islands rather than a single mixed circular cloud.

adata.obsm["X_scdlkit_vae"] = runner.encode(adata)
sc.pp.neighbors(adata, use_rep="X_scdlkit_vae")
sc.tl.umap(adata, random_state=42)
umap_fig = sc.pl.umap(adata, color="louvain", return_fig=True, frameon=False)
umap_fig.savefig(OUTPUT_DIR / "latent_umap.png", dpi=150, bbox_inches="tight")
display(umap_fig)
../_images/724747515fb4f54b63f682c9cf079646958e2fc99094f197cee6912c561177f4.png ../_images/724747515fb4f54b63f682c9cf079646958e2fc99094f197cee6912c561177f4.png

Expected outputs#

After running this notebook you should have:

  • metrics from runner.evaluate()

  • artifacts/pbmc_vae_quickstart/report.md

  • artifacts/pbmc_vae_quickstart/report.csv

  • artifacts/pbmc_vae_quickstart/loss_curve.png

  • artifacts/pbmc_vae_quickstart/latent_umap.png

  • a latent UMAP with broad separation between PBMC populations such as T cells, B cells, monocytes, and NK cells

Recommended next steps:

  • re-run the first config cell with TUTORIAL_PROFILE = "full" when you want a stronger qualitative result

  • open the PBMC model-comparison tutorial to compare PCA, autoencoder, vae, and transformer_ae on the same dataset

output_paths = {
    "report_md": str(OUTPUT_DIR / "report.md"),
    "report_csv": str(OUTPUT_DIR / "report.csv"),
    "loss_curve_png": str(OUTPUT_DIR / "loss_curve.png"),
    "latent_umap_png": str(OUTPUT_DIR / "latent_umap.png"),
}
output_paths
{'report_md': 'artifacts/pbmc_vae_quickstart/report.md',
 'report_csv': 'artifacts/pbmc_vae_quickstart/report.csv',
 'loss_curve_png': 'artifacts/pbmc_vae_quickstart/loss_curve.png',
 'latent_umap_png': 'artifacts/pbmc_vae_quickstart/latent_umap.png'}