Evaluation#
Evaluation is built into the workflow rather than left to ad hoc notebook code.
The release process also uses an internal quality suite so the toolkit is evaluated
against itself on small Scanpy built-ins before public tutorial defaults are changed.
The primary release-gate datasets are pbmc3k_processed and paul15, with PCA
kept as the classical reference baseline in comparison work. The experimental
foundation-model pilot adds pbmc68k_reduced and compares frozen scGPT embeddings
against PCA rather than treating the foundation path as automatically better.
Core metrics#
Representation and reconstruction workflows can report:
msemaepearsonspearmansilhouetteknn_label_consistencyarinmiruntime_secin comparison and benchmark summariesprobe_accuracyfor frozen linear probes on embedding benchmarksprobe_macro_f1for frozen linear probes on embedding benchmarks
Classification workflows can report:
accuracymacro_f1confusion_matrix
Example#
metrics = runner.evaluate()
metrics
Reports#
You can export a Markdown report and scalar metrics table:
runner.save_report("artifacts/report.md")
For reconstruction-capable models, evaluation often goes together with direct inspection of reconstructed outputs:
reconstructed = runner.reconstruct(adata)
That output is now covered in the dedicated reconstruction sanity-check tutorial rather than being overloaded into the main embedding quickstart.
For benchmark work, treat PCA as the classical reference baseline rather than comparing deep-learning models only against each other.