Scanpy PBMC quickstart#

Audience:

  • Single-cell researchers and analysts who already use Scanpy and want the shortest baseline path from AnnData to a learned embedding.

Prerequisites:

  • Install scdlkit[tutorials].

  • Familiarity with AnnData, neighbors, and UMAP in Scanpy.

Learning goals:

  • Load PBMC data with Scanpy.

  • Train a VAE baseline with TaskRunner.

  • Store latent embeddings in adata.obsm.

  • Continue with Scanpy neighbors and UMAP on the learned representation.

Out of scope:

  • raw-count QC and preprocessing

  • marker-gene interpretation

  • gene-level reconstruction inspection

Why this notebook starts from processed PBMC:

  • scDLKit is focusing here on the model layer, not reproducing the full raw preprocessing tutorial that Scanpy already teaches well.

  • The raw-count preprocessing path stays in the official Scanpy tutorials; this notebook starts at the point where model training begins.

Install:

python -m pip install "scdlkit[tutorials]"

Links:

Related APIs:

  • TaskRunner: stable beginner workflow

  • prepare_data(...): lower-level preprocessing and split control

Next steps:

  • Tutorial: downstream_scanpy_after_scdlkit.ipynb

  • API: docs/api/taskrunner.md

Published tutorial status

This page is a static notebook copy published for documentation review. It is meant to show the exact workflow and outputs from the last recorded run.

  • Last run date (UTC): 2026-03-27 09:22 UTC

  • Publication mode: static executed tutorial

  • Execution profile: published

  • Artifact check in this sync: passed

  • Source notebook: examples/train_vae_pbmc.ipynb

Outline#

  1. Load PBMC data with Scanpy.

  2. Inspect the dataset and confirm the label field.

  3. Detect the runtime device.

  4. Choose the notebook profile.

  5. Train a VAE with device="auto".

  6. Evaluate metrics and save artifacts.

  7. Push the latent embedding into adata.obsm.

  8. Run Scanpy neighbors and UMAP on the latent space.

from __future__ import annotations

from pathlib import Path

import scanpy as sc
import torch
from IPython.display import display

from scdlkit import TaskRunner

sc.set_figure_params(dpi=100, dpi_save=180, frameon=False, fontsize=12)

DATA_PATH = Path("examples/data/pbmc3k_processed.h5ad")
OUTPUT_DIR = Path("artifacts/pbmc_vae_quickstart")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

device_name = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device_name}")
Using device: cpu
TUTORIAL_PROFILE = "quickstart"  # change to "full" for a longer run

PROFILE = {
    "quickstart": {"epochs": 20, "batch_size": 128, "kl_weight": 1e-3},
    "full": {"epochs": 50, "batch_size": 128, "kl_weight": 1e-3},
}[TUTORIAL_PROFILE]

print(f"Tutorial profile: {TUTORIAL_PROFILE}")
print(PROFILE)
Tutorial profile: quickstart
{'epochs': 20, 'batch_size': 128, 'kl_weight': 0.001}

Load PBMC data#

The tutorial prefers the repository copy of pbmc3k_processed.h5ad when available, then falls back to scanpy.datasets.pbmc3k_processed().

adata = sc.read_h5ad(DATA_PATH) if DATA_PATH.exists() else sc.datasets.pbmc3k_processed()
print(adata)
print(f"Shape: {adata.shape}")
print("obs columns:", list(adata.obs.columns))
print("Label field used for evaluation:", "louvain")
AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
    var: 'n_cells'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
Shape: (2638, 1838)
obs columns: ['n_genes', 'percent_mito', 'n_counts', 'louvain']
Label field used for evaluation: louvain

Train the VAE baseline#

This is the main scDLKit step. The code is the same on CPU and GPU because the runner uses device="auto".

For this PBMC quickstart, the VAE uses a light KL term so PBMC populations stay visibly separated in the latent space. A healthy quickstart result should show broad islands for major PBMC groups rather than a single mixed circular cloud.

Use the default quickstart profile for a CPU-friendly docs run. Switch to full when you want a longer fit with stronger qualitative separation before interpreting the embedding.

runner = TaskRunner(
    model="vae",
    task="representation",
    epochs=PROFILE["epochs"],
    batch_size=PROFILE["batch_size"],
    label_key="louvain",
    device="auto",
    model_kwargs={"kl_weight": PROFILE["kl_weight"]},
    output_dir=str(OUTPUT_DIR),
)
runner.fit(adata)
metrics = runner.evaluate()
metrics
{'mse': 0.8214080929756165,
 'mae': 0.40160080790519714,
 'pearson': 0.22945579886436462,
 'spearman': 0.12458946432617383,
 'silhouette': 0.17217771708965302,
 'knn_label_consistency': 0.8914141414141414,
 'ari': 0.5968597655832685,
 'nmi': 0.762229436266658}

What to inspect#

Before moving on, check three things:

  1. The training curve should fall without obvious instability.

  2. The latent UMAP should separate broad PBMC families rather than collapsing into a single mixed blob.

  3. This notebook is only the embedding step. Marker-gene interpretation and reconstruction inspection live in separate tutorials.

The notebook writes a Markdown report and a training-loss plot to artifacts/pbmc_vae_quickstart/.

runner.save_report(OUTPUT_DIR / "report.md")
loss_fig, _ = runner.plot_losses()
loss_fig.savefig(OUTPUT_DIR / "loss_curve.png", dpi=150, bbox_inches="tight")
display(loss_fig)

Push the latent space back into Scanpy#

scDLKit stays model-focused. Once the latent representation is available, continue the downstream neighborhood and visualization workflow with Scanpy.

A healthy result should show broad louvain groups separating into distinct islands rather than a single mixed circular cloud.

adata.obsm["X_scdlkit_vae"] = runner.encode(adata)
sc.pp.neighbors(adata, use_rep="X_scdlkit_vae")
sc.tl.umap(adata, random_state=42)
umap_fig = sc.pl.umap(
    adata,
    color="louvain",
    legend_loc="on data",
    legend_fontsize=10,
    legend_fontoutline=2,
    title="",
    return_fig=True,
    frameon=False,
)
umap_fig.savefig(OUTPUT_DIR / "latent_umap.png", dpi=150, bbox_inches="tight")
display(umap_fig)

Expected outputs#

After running this notebook you should have:

  • metrics from runner.evaluate()

  • artifacts/pbmc_vae_quickstart/report.md

  • artifacts/pbmc_vae_quickstart/report.csv

  • artifacts/pbmc_vae_quickstart/loss_curve.png

  • artifacts/pbmc_vae_quickstart/latent_umap.png

  • a latent UMAP with broad separation between PBMC populations such as T cells, B cells, monocytes, and NK cells

Recommended next steps:

  • re-run the first config cell with TUTORIAL_PROFILE = "full" when you want a stronger qualitative result

  • open the downstream Scanpy tutorial for clustering, markers, and broad annotation after the embedding step

  • open the PBMC model-comparison tutorial to compare PCA, autoencoder, vae, and transformer_ae on the same dataset

  • open the reconstruction sanity-check tutorial if you want to inspect predicted or reconstructed gene-expression values

output_paths = {
    "report_md": str(OUTPUT_DIR / "report.md"),
    "report_csv": str(OUTPUT_DIR / "report.csv"),
    "loss_curve_png": str(OUTPUT_DIR / "loss_curve.png"),
    "latent_umap_png": str(OUTPUT_DIR / "latent_umap.png"),
}
output_paths
{'report_md': 'artifacts/pbmc_vae_quickstart/report.md',
 'report_csv': 'artifacts/pbmc_vae_quickstart/report.csv',
 'loss_curve_png': 'artifacts/pbmc_vae_quickstart/loss_curve.png',
 'latent_umap_png': 'artifacts/pbmc_vae_quickstart/latent_umap.png'}