Experimental Annotation Quickstart API#

What it is#

Status: experimental.

This page documents the easiest public path for labeled annotation adaptation:

  • adapt_annotation(...) for the one-call workflow

  • inspect_annotation_data(...) for preflight checks

  • AnnotationRunner for the explicit inspect-fit-predict-annotate-save-load flow

The current implementation routes only to the experimental scGPT whole-human annotation path for human scRNA-seq data.

When to use it#

Use this page instead of TaskRunner when:

  • you already have labels in adata.obs

  • your goal is annotation adaptation, not just a baseline embedding

  • you want predictions and embeddings written back into AnnData

  • you want to compare frozen and tuned strategies with minimal code

Use Experimental foundation helpers when you want the lower-level scGPT-specific route underneath this alias layer.

Minimal example#

from scdlkit import adapt_annotation

runner = adapt_annotation(
    adata,
    label_key="cell_type",
    output_dir="artifacts/scgpt_annotation",
)

runner.annotate_adata(adata, obs_key="scgpt_label", embedding_key="X_scgpt_best")
runner.save("artifacts/scgpt_annotation/best_model")

Parameters#

  • label_key: required adata.obs column containing the target annotation labels.

  • checkpoint: currently fixed to the experimental scGPT whole-human checkpoint.

  • strategies: default quickstart is ("frozen_probe", "head"). Additional opt-in strategy names are "full_finetune", "lora", "adapter", "prefix_tuning", and "ia3".

  • strategy_configs: optional per-strategy config mapping for the heavier PEFT comparison surface exposed under scdlkit.foundation.

  • lora_config: backward-compatible alias for strategy_configs={"lora": LoRAConfig(...)} in the 0.1.x line.

  • batch_size, val_size, test_size, random_state, device: wrapper training and split defaults.

  • output_dir: optional artifact directory for reports, plots, and saved runner state.

Input expectations#

  • input must be human scRNA-seq stored in anndata.AnnData.

  • label_key must exist in adata.obs and contain at least two label categories for training.

  • the expression matrix must be non-negative for the scGPT tokenization path.

  • gene overlap with the whole-human vocabulary must be sufficient; inspect_annotation_data(...) exposes that check before fitting.

  • optional batch or study metadata can stay in adata.obs and will be carried through for downstream reporting when present.

Returns / outputs#

  • inspect_annotation_data(...) returns a ScGPTAnnotationDataReport.

  • adapt_annotation(...) returns a fitted AnnotationRunner.

  • AnnotationRunner.predict(...) returns label_codes, labels, probabilities, and latent.

  • AnnotationRunner.annotate_adata(...) writes labels to adata.obs and embeddings to adata.obsm.

  • AnnotationRunner.save(...) writes a directory with manifest.json and model_state.pt.

  • strategy comparison artifacts include per-strategy metrics with macro_f1, accuracy, balanced_accuracy, and multiclass auroc_ovr when probability outputs make it valid.

Failure modes / raises#

  • ImportError if the package was installed without the foundation extra.

  • ValueError if labels are missing, class counts are too small, or gene overlap is insufficient.

  • ValueError if an unsupported strategy name is requested or a strategy config does not match the selected strategy.

  • ValueError if both strategy_configs and lora_config are supplied.

  • RuntimeError if you try to predict, annotate, or save before fitting or loading a runner.

  • ValueError if the saved runner manifest is incomplete or incompatible.

Notes / caveats#

  • This surface is experimental even though the aliases live at scdlkit.

  • The beginner default is intentionally CPU-friendly: frozen probe plus head-only tuning.

  • The heavier annotation benchmark surface extends to:

    • full fine-tuning

    • LoRA

    • adapters

    • prefix tuning

    • IA3

  • The current public model implementation is still scGPT only.

  • TaskRunner is not extended for this path in the current release line.