Annotation Benchmarks#

Status: experimental.

This guide summarizes how scDLKit currently frames annotation adaptation evidence.

The goal is not to claim that a foundation model always wins. The goal is to give researchers a reproducible way to compare:

PCA + logistic regression
frozen scGPT probe
head-only tuning
full fine-tuning
LoRA tuning
adapter tuning
prefix tuning
IA3 tuning

on more than one labeled human dataset while keeping the outputs Scanpy-friendly.

Datasets in the current evidence story#

pbmc68k_reduced
OpenProblems human pancreas

These are both human scRNA-seq datasets, which matches the current experimental scGPT whole-human scope.

Regimes in the annotation pillar#

full-label benchmark on pbmc68k_reduced and human pancreas
low-label benchmark on both datasets at 1%, 5%, and 10%
cross-study benchmark on human pancreas through held-out technology families

The main public tutorial stays on the quickstart pancreas subset and compares only frozen_probe and head. The heavier regime matrix belongs to the dedicated benchmark workflow, not the default docs notebook path.

What the current comparisons are meant to answer#

Is the frozen checkpoint already good enough for the label space?
Does a small trainable head recover the gap without a heavier adaptation step?
Does full fine-tuning materially improve on lighter strategies?
Which PEFT method is the best cost-performance tradeoff on this dataset?
Can the best adapted runner be saved and reused later?

What to conclude from the benchmark outputs#

Sometimes the frozen probe is already strong enough.
Sometimes head-only tuning is the best cost-versus-performance tradeoff.
Full fine-tuning is the unconstrained reference point, not the default recommendation.
LoRA, adapters, prefix tuning, and IA3 are available when more expressive adaptation is worth the added runtime and configuration complexity.
The main value of scDLKit is not a universal leaderboard claim. It is the ability to make this comparison reproducible and inspectable on AnnData.

Why scDLKit is useful here#

it keeps the workflow AnnData-native
it writes embeddings and predictions back into Scanpy-compatible structures
it saves standard comparison artifacts instead of hiding the result behind one score
it can save and reload the best adapted runner for later reuse
it now exposes a dedicated annotation benchmark runner and workflow for the heavier matrix, separate from the beginner tutorial path

Artifact and ranking rules#

The annotation pillar ranks strategies by:

macro_f1
balanced_accuracy
accuracy
lower trainable_parameters
lower runtime_sec

The benchmark artifact bundle is shaped for paper figures:

full-label summary across PBMC and pancreas
low-label curves for 1%, 5%, and 10%
pancreas cross-study held-out-batch results
efficiency-performance Pareto plots
qualitative UMAP and confusion-matrix outputs from the main tutorial path

Use scripts/run_annotation_benchmark.py for the full matrix and .github/workflows/annotation-benchmark.yml for the scheduled or manual heavy run.

Notes and caveats#

This is still an experimental feature line.
The current public checkpoint scope is still only scGPT whole-human, so the annotation pillar is implemented on scGPT only.
This guide does not claim non-human, multimodal, perturbation, or spatial support.
The beyond-PBMC evidence path is intentionally heavier than the standard PR CI path.