Experimental Annotation On Your Data#
Use this guide when you have a labeled human AnnData object and want the easiest public path for experimental scGPT adaptation.
Related APIs:
The current scope is still narrow:
human scRNA-seq only
official scGPT
whole-humancheckpoint onlyannotation only
default quickstart strategies:
frozen probe
head-only tuning
heavier benchmark-only strategies:
full fine-tuning
LoRA tuning
adapter tuning
prefix tuning
IA3 tuning
Required input shape#
Your dataset should provide:
cells in
adata.obsgenes in
adata.var_namesnon-negative expression values in
adata.Xoradata.raw.Xa categorical label column in
adata.obs
For the current experimental wrapper, the most important fields are:
adata.var_namesadata.obs["your_label_key"]
Start with inspection#
Run the inspection step before training:
from scdlkit import inspect_annotation_data
report = inspect_annotation_data(
adata,
label_key="cell_type",
checkpoint="whole-human",
)
Inspect these fields first:
num_genes_matchedgene_overlap_ratioclass_countsmin_class_countwarnings
If the report shows low overlap or very small classes, the wrapper may still run, but the result should be treated with more caution.
Fastest adaptation path#
from scdlkit import adapt_annotation
runner = adapt_annotation(
adata,
label_key="cell_type",
output_dir="artifacts/scgpt_annotation",
)
This one call:
inspects the dataset
prepares and splits the tokenized data
compares frozen probe and head-only tuning by default
keeps the best fitted strategy
writes the standard artifact bundle
The heavier annotation benchmark matrix extends to:
full_finetuneloraadapterprefix_tuningia3
Those strategies are opt-in and intentionally live outside the default quickstart tutorial path.
Write results back into AnnData#
runner.annotate_adata(
adata,
obs_key="scgpt_label",
embedding_key="X_scgpt_best",
)
This writes:
predicted labels to
adata.obs["scgpt_label"]label codes to
adata.obs["scgpt_label_code"]max confidence to
adata.obs["scgpt_label_confidence"]latent embedding to
adata.obsm["X_scgpt_best"]
That keeps the downstream Scanpy handoff simple.
Save and reload the best fitted runner#
save_dir = runner.save("artifacts/scgpt_annotation/best_model")
from scdlkit import AnnotationRunner
reloaded = AnnotationRunner.load(save_dir, device="auto")
The saved directory contains:
manifest.jsonmodel_state.pt
The base whole-human checkpoint is not vendored into the saved artifact. Reloading resolves it from the local cache.
When to drop down to the lower-level API#
Use the wrapper first when you want:
minimal code
sensible defaults
built-in strategy comparison
straightforward
AnnDatawrite-back
Drop down to the Trainer path when you want:
tighter control over epochs and optimizer settings
custom evaluation code
notebook-level debugging of the low-level training loop
What remains experimental#
no non-human support
no checkpoints beyond
whole-humanno perturbation, spatial, or multimodal workflows
no claim that scGPT always beats classical baselines
current model implementation is still
scGPTonly
The main product value of this path is not universal superiority. It is the ability to compare adaptation strategies on your own labeled dataset with a reproducible, Scanpy-compatible workflow.
Under the hood, this top-level beginner path is still backed by the experimental scGPT whole-human workflow in scdlkit.foundation.
If you want to see the same workflow on a non-PBMC human dataset before trying your own data, start with: