Evaluation#

Evaluation is built into the workflow rather than left to ad hoc notebook code.

The release process also uses an internal quality suite so the toolkit is evaluated against itself on small Scanpy built-ins before public tutorial defaults are changed. The primary release-gate datasets are pbmc3k_processed and paul15, with PCA kept as the classical reference baseline in comparison work.

Core metrics#

Representation and reconstruction workflows can report:

  • mse

  • mae

  • pearson

  • spearman

  • silhouette

  • knn_label_consistency

  • ari

  • nmi

  • runtime_sec in comparison and benchmark summaries

Classification workflows can report:

  • accuracy

  • macro_f1

  • confusion_matrix

Example#

metrics = runner.evaluate()
metrics

Reports#

You can export a Markdown report and scalar metrics table:

runner.save_report("artifacts/report.md")

For benchmark work, treat PCA as the classical reference baseline rather than comparing deep-learning models only against each other.