Foundation Models#

scDLKit now includes an experimental scGPT path for human scRNA-seq workflows.

Related APIs:

The public scope is still deliberately narrow:

official whole-human checkpoint only
human single-cell RNA only
wrapper-first helpers for beginners, Trainer plus scdlkit.foundation helpers underneath
frozen embeddings remain supported
experimental cell-type annotation fine-tuning is now supported through:
- frozen linear probe
- head-only tuning
- full fine-tuning
- LoRA tuning
- adapter tuning
- prefix tuning
- IA3 tuning
no TaskRunner support yet

Install#

python -m pip install "scdlkit[foundation,tutorials]"

What this path is for#

Use the experimental foundation path when you want to:

extract frozen cell embeddings from an official scGPT checkpoint
compare those embeddings against PCA and scDLKit baselines
fine-tune scGPT for a labeled cell-type annotation task
decide whether your dataset needs:
- only frozen embeddings
- a trainable classification head
- a full-backbone baseline
- one of the available PEFT methods

This is the bridge between the baseline toolkit and later foundation-model adaptation work.

Easiest wrapper-first path#

If you want the smallest amount of code, start with the top-level experimental alias:

from scdlkit import adapt_annotation

runner = adapt_annotation(
    adata,
    label_key="cell_type",
    output_dir="artifacts/scgpt_annotation",
)

runner.annotate_adata(adata)
runner.save("artifacts/scgpt_annotation/best_model")

This wrapper:

inspects the labeled dataset
compares frozen probe and head-only tuning by default
keeps the best fitted strategy in memory
writes standard report artifacts
makes it easy to annotate AnnData and save the best fitted runner

The heavier annotation benchmark matrix extends to:

full_finetune
lora
adapter
prefix_tuning
ia3

Those heavier strategies are intentionally not part of the default docs quickstart.

Inspect before training#

For user-supplied datasets, inspect first through the top-level beginner alias:

from scdlkit import inspect_annotation_data

report = inspect_annotation_data(
    adata,
    label_key="cell_type",
    checkpoint="whole-human",
)

This is the recommended preflight step when you want to know whether gene overlap or class balance is likely to make the adaptation path brittle.

Frozen embedding API#

from scdlkit import Trainer
from scdlkit.foundation import load_scgpt_model, prepare_scgpt_data

prepared = prepare_scgpt_data(
    adata,
    checkpoint="whole-human",
    label_key="louvain",
    batch_size=64,
)

model = load_scgpt_model("whole-human", device="auto")
trainer = Trainer(
    model=model,
    task="representation",
    batch_size=prepared.batch_size,
    device="auto",
)

predictions = trainer.predict_dataset(prepared.dataset)
adata.obsm["X_scgpt_whole_human"] = predictions["latent"]

Annotation fine-tuning API#

from scdlkit import Trainer
from scdlkit.foundation import (
    load_scgpt_annotation_model,
    prepare_scgpt_data,
    split_scgpt_data,
)

prepared = prepare_scgpt_data(
    adata,
    checkpoint="whole-human",
    label_key="louvain",
    batch_size=64,
)
split = split_scgpt_data(prepared, val_size=0.15, test_size=0.15, random_state=42)

model = load_scgpt_annotation_model(
    num_classes=len(prepared.label_categories or ()),
    checkpoint="whole-human",
    tuning_strategy="lora",
    label_categories=prepared.label_categories,
    device="auto",
)

trainer = Trainer(
    model=model,
    task="classification",
    batch_size=prepared.batch_size,
    device="auto",
    epochs=8,
)
trainer.fit(split.train, split.val)
predictions = trainer.predict_dataset(split.test)
adata.obsm["X_scgpt_lora"] = predictions["latent"]

Wrapper class for advanced convenience#

If you want the easy path but still need explicit control over the fitted object, use the top-level runner alias directly:

from scdlkit import AnnotationRunner

runner = AnnotationRunner(label_key="cell_type", output_dir="artifacts/scgpt_annotation")
runner.inspect(adata)
runner.fit_compare(adata)
runner.annotate_adata(adata)
runner.save("artifacts/scgpt_annotation/best_model")

The lower-level Trainer path and scdlkit.foundation helpers remain the advanced public surface underneath the wrapper.

When to use each strategy#

frozen linear probe:
- best first question
- tells you whether the checkpoint already separates your labels
head-only tuning:
- cheapest trainable path
- use when the frozen probe is useful but not good enough
full fine-tuning:
- unconstrained reference baseline
- use when you want to measure the cost of training the whole backbone
LoRA tuning:
- good first PEFT baseline
- use when you want more flexibility than a frozen backbone plus head
adapter tuning:
- parameter-efficient residual bottleneck path
- use when you want a lightweight trainable module without low-rank updates
prefix tuning:
- prompt-like trainable prefix path
- use when you want to bias the transformer layers without unfreezing the backbone
IA3 tuning:
- multiplicative activation-scaling path
- use when you want a very small trainable parameter footprint

Current limitations#

input preparation is a separate tokenized pipeline, not prepare_data(...)
only the official whole-human checkpoint is supported
user datasets must have labels for annotation fine-tuning
gene overlap with the checkpoint vocabulary still gates compatibility
current model implementation is still scGPT only
this release does not claim perturbation, spatial, or multimodal support

Tutorials#

frozen embeddings: Experimental scGPT PBMC embeddings
annotation tuning: Experimental scGPT cell-type annotation
dataset-specific wrapper workflow: Experimental scGPT dataset-specific annotation
beyond-PBMC wrapper workflow: Main annotation tutorial: human-pancreas wrapper workflow
benchmark framing: Annotation benchmarks
user-data guide: Experimental annotation on your data

Experimental scope#

This feature should be treated as experimental.

The goal is not to claim that foundation models always beat classical baselines. The goal is to give users a reproducible, Scanpy-compatible workflow to compare:

PCA + logistic regression
frozen scGPT linear probe
head-only tuning
full fine-tuning
LoRA
adapters
prefix tuning
IA3

That comparison story is the main product value of the current foundation release line.

The current beyond-PBMC evidence phase uses OpenProblems human pancreas to show the same wrapper-first workflow on a second labeled human dataset.

Treat scdlkit.foundation as the explicit lower-level experimental namespace that sits underneath the easier top-level beginner aliases.