Experimental Foundation Helpers#

What it is#

Status: experimental.

This page documents the explicit lower-level scGPT path underneath scdlkit.adapt_annotation(...). It is the place to go when you want direct control over frozen scGPT embeddings, tokenized datasets, split-aware annotation training, or the underlying wrapper objects.

When to use it#

Use this page when you want to:

extract frozen scGPT embeddings directly
prepare tokenized scGPT data for your own workflow
split tokenized data for annotation fine-tuning
load a Trainer-compatible scGPT annotation model explicitly
drop below the top-level beginner alias and inspect the scGPT-specific objects

Minimal example#

from scdlkit.foundation import (
    AdapterConfig,
    load_scgpt_annotation_model,
    prepare_scgpt_data,
    split_scgpt_data,
)
from scdlkit import Trainer

prepared = prepare_scgpt_data(adata, label_key="cell_type")
split = split_scgpt_data(prepared)
model = load_scgpt_annotation_model(
    num_classes=len(prepared.label_categories or ()),
    label_categories=prepared.label_categories,
    tuning_strategy="adapter",
    strategy_config=AdapterConfig(bottleneck_dim=64, dropout=0.05),
)
trainer = Trainer(model=model, task="classification", batch_size=prepared.batch_size)
trainer.fit(split.train, split.val)

Parameters#

load_scgpt_model(...) loads the official whole-human checkpoint for frozen embeddings.
prepare_scgpt_data(...) tokenizes compatible human AnnData and optionally encodes labels.
split_scgpt_data(...) creates train, validation, and test subsets without re-tokenizing.
load_scgpt_annotation_model(...) builds a head, full_finetune, lora, adapter, prefix_tuning, or ia3 scGPT classifier for Trainer.
Generic PEFT configs are exposed under scdlkit.foundation as:
- PEFTConfig
- LoRAConfig
- AdapterConfig
- PrefixTuningConfig
- IA3Config
ScGPTLoRAConfig remains available as a compatibility alias in the 0.1.x release line.
ScGPTAnnotationRunner and adapt_scgpt_annotation(...) expose the explicit wrapper-first foundation path.

Input expectations#

input must be human scRNA-seq in AnnData.
the checkpoint scope is currently limited to scGPT whole-human.
expression values must be non-negative.
annotation tuning requires a valid label_key with at least two label categories.
sufficient gene overlap with the checkpoint vocabulary is required; otherwise preparation raises a clear error.

Returns / outputs#

ScGPTPreparedData stores tokenized tensors plus checkpoint and label metadata.
ScGPTSplitData stores split-aware token datasets for training and evaluation.
load_scgpt_model(...) returns an embedding model for frozen inference.
load_scgpt_annotation_model(...) returns a classification model ready for Trainer(..., task="classification").
ScGPTAnnotationRunner and adapt_scgpt_annotation(...) can emit reports, plots, predictions, and saved runner state.
saved runner manifests now include strategy metadata and serialized strategy-config values so trainable strategies can be reloaded cleanly.

Failure modes / raises#

ImportError if the package was installed without scdlkit[foundation].
ValueError if labels are missing, the tuning strategy is unsupported, or the checkpoint vocabulary overlap is too small.
ValueError if expression values are negative.
RuntimeError if wrapper prediction or save/load methods are called in the wrong lifecycle stage.

Notes / caveats#

The recommended beginner route is still Experimental annotation quickstart API.
This page documents the lower-level implementation and is intentionally narrower than a general foundation-model framework.
Supported scope remains:
- human scRNA-seq only
- scGPT whole-human only
- annotation tuning only
- the current model implementation is still scGPT only
The heavier scGPT annotation matrix now includes:
- head
- full_finetune
- lora
- adapter
- prefix_tuning
- ia3
Cross-model support for scFoundation, CellFM, and Nicheformer remains future work.