Experimental Annotation On Your Data#

Use this guide when you have a labeled human AnnData object and want the easiest public path for experimental scGPT adaptation.

Related APIs:

The current scope is still narrow:

  • human scRNA-seq only

  • official scGPT whole-human checkpoint only

  • annotation only

  • default quickstart strategies:

    • frozen probe

    • head-only tuning

  • heavier benchmark-only strategies:

    • full fine-tuning

    • LoRA tuning

    • adapter tuning

    • prefix tuning

    • IA3 tuning

Required input shape#

Your dataset should provide:

  • cells in adata.obs

  • genes in adata.var_names

  • non-negative expression values in adata.X or adata.raw.X

  • a categorical label column in adata.obs

For the current experimental wrapper, the most important fields are:

  • adata.var_names

  • adata.obs["your_label_key"]

Start with inspection#

Run the inspection step before training:

from scdlkit import inspect_annotation_data

report = inspect_annotation_data(
    adata,
    label_key="cell_type",
    checkpoint="whole-human",
)

Inspect these fields first:

  • num_genes_matched

  • gene_overlap_ratio

  • class_counts

  • min_class_count

  • warnings

If the report shows low overlap or very small classes, the wrapper may still run, but the result should be treated with more caution.

Fastest adaptation path#

from scdlkit import adapt_annotation

runner = adapt_annotation(
    adata,
    label_key="cell_type",
    output_dir="artifacts/scgpt_annotation",
)

This one call:

  • inspects the dataset

  • prepares and splits the tokenized data

  • compares frozen probe and head-only tuning by default

  • keeps the best fitted strategy

  • writes the standard artifact bundle

The heavier annotation benchmark matrix extends to:

  • full_finetune

  • lora

  • adapter

  • prefix_tuning

  • ia3

Those strategies are opt-in and intentionally live outside the default quickstart tutorial path.

Write results back into AnnData#

runner.annotate_adata(
    adata,
    obs_key="scgpt_label",
    embedding_key="X_scgpt_best",
)

This writes:

  • predicted labels to adata.obs["scgpt_label"]

  • label codes to adata.obs["scgpt_label_code"]

  • max confidence to adata.obs["scgpt_label_confidence"]

  • latent embedding to adata.obsm["X_scgpt_best"]

That keeps the downstream Scanpy handoff simple.

Save and reload the best fitted runner#

save_dir = runner.save("artifacts/scgpt_annotation/best_model")

from scdlkit import AnnotationRunner

reloaded = AnnotationRunner.load(save_dir, device="auto")

The saved directory contains:

  • manifest.json

  • model_state.pt

The base whole-human checkpoint is not vendored into the saved artifact. Reloading resolves it from the local cache.

When to drop down to the lower-level API#

Use the wrapper first when you want:

  • minimal code

  • sensible defaults

  • built-in strategy comparison

  • straightforward AnnData write-back

Drop down to the Trainer path when you want:

  • tighter control over epochs and optimizer settings

  • custom evaluation code

  • notebook-level debugging of the low-level training loop

What remains experimental#

  • no non-human support

  • no checkpoints beyond whole-human

  • no perturbation, spatial, or multimodal workflows

  • no claim that scGPT always beats classical baselines

  • current model implementation is still scGPT only

The main product value of this path is not universal superiority. It is the ability to compare adaptation strategies on your own labeled dataset with a reproducible, Scanpy-compatible workflow.

Under the hood, this top-level beginner path is still backed by the experimental scGPT whole-human workflow in scdlkit.foundation.

If you want to see the same workflow on a non-PBMC human dataset before trying your own data, start with: