Evaluation#
Evaluation is built into the workflow rather than left to ad hoc notebook code.
The release process also uses an internal quality suite so the toolkit is evaluated
against itself on small Scanpy built-ins before public tutorial defaults are changed.
The primary release-gate datasets are pbmc3k_processed and paul15, with PCA
kept as the classical reference baseline in comparison work.
Core metrics#
Representation and reconstruction workflows can report:
msemaepearsonspearmansilhouetteknn_label_consistencyarinmiruntime_secin comparison and benchmark summaries
Classification workflows can report:
accuracymacro_f1confusion_matrix
Example#
metrics = runner.evaluate()
metrics
Reports#
You can export a Markdown report and scalar metrics table:
runner.save_report("artifacts/report.md")
For benchmark work, treat PCA as the classical reference baseline rather than comparing deep-learning models only against each other.