Experimental Foundation Helpers#

What it is#

Status: experimental.

This page documents the explicit lower-level scGPT path underneath scdlkit.adapt_annotation(...). It is the place to go when you want direct control over frozen scGPT embeddings, tokenized datasets, split-aware annotation training, or the underlying wrapper objects.

When to use it#

Use this page when you want to:

  • extract frozen scGPT embeddings directly

  • prepare tokenized scGPT data for your own workflow

  • split tokenized data for annotation fine-tuning

  • load a Trainer-compatible scGPT annotation model explicitly

  • drop below the top-level beginner alias and inspect the scGPT-specific objects

Minimal example#

from scdlkit.foundation import (
    AdapterConfig,
    load_scgpt_annotation_model,
    prepare_scgpt_data,
    split_scgpt_data,
)
from scdlkit import Trainer

prepared = prepare_scgpt_data(adata, label_key="cell_type")
split = split_scgpt_data(prepared)
model = load_scgpt_annotation_model(
    num_classes=len(prepared.label_categories or ()),
    label_categories=prepared.label_categories,
    tuning_strategy="adapter",
    strategy_config=AdapterConfig(bottleneck_dim=64, dropout=0.05),
)
trainer = Trainer(model=model, task="classification", batch_size=prepared.batch_size)
trainer.fit(split.train, split.val)

Parameters#

  • load_scgpt_model(...) loads the official whole-human checkpoint for frozen embeddings.

  • prepare_scgpt_data(...) tokenizes compatible human AnnData and optionally encodes labels.

  • split_scgpt_data(...) creates train, validation, and test subsets without re-tokenizing.

  • load_scgpt_annotation_model(...) builds a head, full_finetune, lora, adapter, prefix_tuning, or ia3 scGPT classifier for Trainer.

  • Generic PEFT configs are exposed under scdlkit.foundation as:

    • PEFTConfig

    • LoRAConfig

    • AdapterConfig

    • PrefixTuningConfig

    • IA3Config

  • ScGPTLoRAConfig remains available as a compatibility alias in the 0.1.x release line.

  • ScGPTAnnotationRunner and adapt_scgpt_annotation(...) expose the explicit wrapper-first foundation path.

Input expectations#

  • input must be human scRNA-seq in AnnData.

  • the checkpoint scope is currently limited to scGPT whole-human.

  • expression values must be non-negative.

  • annotation tuning requires a valid label_key with at least two label categories.

  • sufficient gene overlap with the checkpoint vocabulary is required; otherwise preparation raises a clear error.

Returns / outputs#

  • ScGPTPreparedData stores tokenized tensors plus checkpoint and label metadata.

  • ScGPTSplitData stores split-aware token datasets for training and evaluation.

  • load_scgpt_model(...) returns an embedding model for frozen inference.

  • load_scgpt_annotation_model(...) returns a classification model ready for Trainer(..., task="classification").

  • ScGPTAnnotationRunner and adapt_scgpt_annotation(...) can emit reports, plots, predictions, and saved runner state.

  • saved runner manifests now include strategy metadata and serialized strategy-config values so trainable strategies can be reloaded cleanly.

Failure modes / raises#

  • ImportError if the package was installed without scdlkit[foundation].

  • ValueError if labels are missing, the tuning strategy is unsupported, or the checkpoint vocabulary overlap is too small.

  • ValueError if expression values are negative.

  • RuntimeError if wrapper prediction or save/load methods are called in the wrong lifecycle stage.

Notes / caveats#

  • The recommended beginner route is still Experimental annotation quickstart API.

  • This page documents the lower-level implementation and is intentionally narrower than a general foundation-model framework.

  • Supported scope remains:

    • human scRNA-seq only

    • scGPT whole-human only

    • annotation tuning only

    • the current model implementation is still scGPT only

  • The heavier scGPT annotation matrix now includes:

    • head

    • full_finetune

    • lora

    • adapter

    • prefix_tuning

    • ia3

  • Cross-model support for scFoundation, CellFM, and Nicheformer remains future work.