Reconstruction sanity check on PBMC#

Audience:

Researchers who want to inspect predicted or reconstructed gene-expression outputs from a reconstruction-capable baseline.

Prerequisites:

Install scdlkit[tutorials].
Know the Scanpy PBMC quickstart notebook first.

Learning goals:

train a simple reconstruction baseline on processed PBMC
retrieve reconstructed expression with runner.reconstruct(...)
inspect one-gene reconstruction scatter and a small marker panel summary
understand what reconstructed expression is useful for and what it is not

Out of scope:

claiming biological truth from reconstructed expression
full denoising benchmarks
heavy gene-level exploratory analysis

This notebook uses a plain autoencoder so the reconstruction story stays easy to inspect.

Published tutorial status

This page is a static notebook copy published for documentation review. It is meant to show the exact workflow and outputs from the last recorded run.

Last run date (UTC): 2026-03-27 09:24 UTC
Publication mode: static executed tutorial
Execution profile: published
Artifact check in this sync: passed
Source notebook: examples/reconstruction_sanity_pbmc.ipynb

Outline#

Load processed PBMC data.
Train an autoencoder reconstruction baseline.
Save a short report and loss curve.
Retrieve reconstructed expression with runner.reconstruct(...).
Inspect one-gene scatter and a small marker panel summary.
Save output artifact paths.

from __future__ import annotations

from pathlib import Path

import numpy as np
import pandas as pd
import scanpy as sc
import torch
from IPython.display import display
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

from scdlkit import TaskRunner

OUTPUT_DIR = Path("artifacts/reconstruction_sanity_pbmc")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

TUTORIAL_PROFILE = "quickstart"  # change to "full" for a longer run
PROFILE = {
    "quickstart": {"max_cells": 1500, "epochs": 15, "batch_size": 128},
    "full": {"max_cells": None, "epochs": 35, "batch_size": 128},
}[TUTORIAL_PROFILE]

print(f"Using device: {'cuda' if torch.cuda.is_available() else 'cpu'}")
print(f"Tutorial profile: {TUTORIAL_PROFILE}")
print(PROFILE)

Using device: cpu
Tutorial profile: quickstart
{'max_cells': 1500, 'epochs': 15, 'batch_size': 128}

Load PBMC data#

The same processed PBMC dataset keeps this tutorial aligned with the main quickstart while staying focused on reconstruction behavior.

def stratified_subset(adata, label_key: str, max_cells: int | None, seed: int = 42):
    if max_cells is None or adata.n_obs <= max_cells:
        return adata.copy()
    indices = np.arange(adata.n_obs)
    labels = adata.obs[label_key].astype(str).to_numpy()
    keep_indices, _ = train_test_split(
        indices,
        train_size=max_cells,
        random_state=seed,
        stratify=labels,
    )
    return adata[np.sort(keep_indices)].copy()

adata = sc.datasets.pbmc3k_processed()
adata = stratified_subset(adata, "louvain", PROFILE["max_cells"])
print(adata)

AnnData object with n_obs × n_vars = 1500 × 1838
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
    var: 'n_cells'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Train the reconstruction baseline#

Use the reconstruction task explicitly here. The point is to inspect the predicted expression output, not to optimize for latent separation.

runner = TaskRunner(
    model="autoencoder",
    task="reconstruction",
    epochs=PROFILE["epochs"],
    batch_size=PROFILE["batch_size"],
    device="auto",
    output_dir=str(OUTPUT_DIR),
)
runner.fit(adata)
metrics = runner.evaluate()
metrics

{'mse': 0.8139280080795288,
 'mae': 0.3975978493690491,
 'pearson': 0.20109763741493225,
 'spearman': 0.11078534881440727}

Save the report and inspect the loss curve#

This notebook should leave behind a compact report and a loss curve before any gene-level inspection starts.

runner.save_report(OUTPUT_DIR / "report.md")
loss_fig, _ = runner.plot_losses()
loss_fig.savefig(OUTPUT_DIR / "loss_curve.png", dpi=150, bbox_inches="tight")
display(loss_fig)
plt.close(loss_fig)

../_images/45cb7792cbd785f7af45a98323bc7116df93a03ea9b7fb3e85b1692049bb3817.png

Retrieve reconstructed expression#

runner.reconstruct(...) is the clear public path for reconstruction-capable models. This is the output you should inspect when checking whether a feature is wired correctly.

reconstructed = runner.reconstruct(adata)
reconstructed.shape

(1500, 1838)

One-gene reconstruction scatter#

Start with one interpretable marker so the relationship between observed and reconstructed values is easy to see.

candidate_genes = [gene for gene in ["MS4A1", "LYZ", "NKG7", "IL7R"] if gene in adata.var_names]
feature_gene = candidate_genes[0] if candidate_genes else adata.var_names[0]

scatter_fig, _ = runner.plot_reconstruction(feature=feature_gene)
scatter_fig.savefig(OUTPUT_DIR / "reconstruction_scatter.png", dpi=150, bbox_inches="tight")
display(scatter_fig)
plt.close(scatter_fig)
print(f"Using feature gene: {feature_gene}")

../_images/653c0618b780e900610ec60e55109511d77114a57638ca1ecf82d1d9128ec26f.png

Using feature gene: MS4A1

Marker-panel reconstruction summary#

This is a coarse sanity check only. We compare mean observed vs reconstructed expression across PBMC groups for a small marker panel.

marker_candidates = ["IL7R", "LTB", "MS4A1", "CD79A", "LYZ", "NKG7"]
marker_panel = [gene for gene in marker_candidates if gene in adata.var_names]
observed_matrix = adata[:, marker_panel].X
if hasattr(observed_matrix, "toarray"):
    observed_matrix = observed_matrix.toarray()
observed = pd.DataFrame(observed_matrix, columns=marker_panel, index=adata.obs_names).astype(float)
reconstructed_df = pd.DataFrame(
    reconstructed[:, [adata.var_names.get_loc(gene) for gene in marker_panel]],
    columns=marker_panel,
    index=adata.obs_names,
).astype(float)

observed["louvain"] = adata.obs["louvain"].astype(str).to_numpy()
reconstructed_df["louvain"] = adata.obs["louvain"].astype(str).to_numpy()

observed_means = observed.groupby("louvain").mean(numeric_only=True)
reconstructed_means = reconstructed_df.groupby("louvain").mean(numeric_only=True)

fig, axes = plt.subplots(1, 2, figsize=(12, 4), constrained_layout=True)
vmax = max(float(observed_means.to_numpy().max()), float(reconstructed_means.to_numpy().max()))
axes[0].imshow(observed_means.to_numpy(), aspect="auto", cmap="viridis", vmin=0, vmax=vmax)
axes[0].set_title("Observed mean expression")
axes[1].imshow(reconstructed_means.to_numpy(), aspect="auto", cmap="viridis", vmin=0, vmax=vmax)
axes[1].set_title("Reconstructed mean expression")
for axis in axes:
    axis.set_xticks(range(len(marker_panel)))
    axis.set_xticklabels(marker_panel, rotation=45, ha="right")
    axis.set_yticks(range(len(observed_means.index)))
    axis.set_yticklabels(list(observed_means.index))
fig.savefig(OUTPUT_DIR / "gene_panel_reconstruction.png", dpi=150, bbox_inches="tight")
display(fig)
plt.close(fig)
observed_means.head()

../_images/62c8bc2d984ff96e52c78562db61fa073fdcea499de503f347ba2f147a4940d0.png

	LTB	MS4A1	CD79A	NKG7
louvain
B cells	0.737258	2.069788	2.330160	-0.526241
CD14+ Monocytes	-0.972925	-0.336502	-0.391754	-0.415092
CD4 T cells	0.518906	-0.310609	-0.334568	-0.387462
CD8 T cells	-0.119601	-0.344340	-0.421676	1.409332
Dendritic cells	-0.946825	-0.056058	0.054287	0.019020

Expected outputs and interpretation#

After running this notebook, check:

artifacts/reconstruction_sanity_pbmc/report.md
artifacts/reconstruction_sanity_pbmc/report.csv
artifacts/reconstruction_sanity_pbmc/loss_curve.png
artifacts/reconstruction_sanity_pbmc/reconstruction_scatter.png
artifacts/reconstruction_sanity_pbmc/gene_panel_reconstruction.png

Interpretation checklist:

The loss curve should decrease without looking obviously unstable.
The reconstruction scatter should show broad correlation, not a random cloud.
The marker-panel summary should preserve broad PBMC trends, not exact cell-by-cell truth.
Treat reconstructed expression as model output, not corrected biological ground truth.