Evaluation#

Evaluation is built into the workflow rather than left to ad hoc notebook code.

The release process also uses an internal quality suite so the toolkit is evaluated against itself on small Scanpy built-ins before public tutorial defaults are changed. The primary release-gate datasets are pbmc3k_processed and paul15, with PCA kept as the classical reference baseline in comparison work. The experimental foundation-model pilot adds pbmc68k_reduced and compares frozen scGPT embeddings against PCA rather than treating the foundation path as automatically better.

Core metrics#

Representation and reconstruction workflows can report:

mse
mae
pearson
spearman
silhouette
knn_label_consistency
ari
nmi
runtime_sec in comparison and benchmark summaries
probe_accuracy for frozen linear probes on embedding benchmarks
probe_macro_f1 for frozen linear probes on embedding benchmarks

Classification workflows can report:

accuracy
macro_f1
confusion_matrix

Example#

metrics = runner.evaluate()
metrics

Reports#

You can export a Markdown report and scalar metrics table:

runner.save_report("artifacts/report.md")

For reconstruction-capable models, evaluation often goes together with direct inspection of reconstructed outputs:

reconstructed = runner.reconstruct(adata)

That output is now covered in the dedicated reconstruction sanity-check tutorial rather than being overloaded into the main embedding quickstart.

For benchmark work, treat PCA as the classical reference baseline rather than comparing deep-learning models only against each other.