Foundation Models#
scDLKit now includes an experimental scGPT path for human scRNA-seq workflows.
Related APIs:
The public scope is still deliberately narrow:
official
whole-humancheckpoint onlyhuman single-cell RNA only
wrapper-first helpers for beginners,
Trainerplusscdlkit.foundationhelpers underneathfrozen embeddings remain supported
experimental cell-type annotation fine-tuning is now supported through:
frozen linear probe
head-only tuning
full fine-tuning
LoRA tuning
adapter tuning
prefix tuning
IA3 tuning
no
TaskRunnersupport yet
Install#
python -m pip install "scdlkit[foundation,tutorials]"
What this path is for#
Use the experimental foundation path when you want to:
extract frozen cell embeddings from an official scGPT checkpoint
compare those embeddings against
PCAand scDLKit baselinesfine-tune scGPT for a labeled cell-type annotation task
decide whether your dataset needs:
only frozen embeddings
a trainable classification head
a full-backbone baseline
one of the available PEFT methods
This is the bridge between the baseline toolkit and later foundation-model adaptation work.
Easiest wrapper-first path#
If you want the smallest amount of code, start with the top-level experimental alias:
from scdlkit import adapt_annotation
runner = adapt_annotation(
adata,
label_key="cell_type",
output_dir="artifacts/scgpt_annotation",
)
runner.annotate_adata(adata)
runner.save("artifacts/scgpt_annotation/best_model")
This wrapper:
inspects the labeled dataset
compares frozen probe and head-only tuning by default
keeps the best fitted strategy in memory
writes standard report artifacts
makes it easy to annotate
AnnDataand save the best fitted runner
The heavier annotation benchmark matrix extends to:
full_finetuneloraadapterprefix_tuningia3
Those heavier strategies are intentionally not part of the default docs quickstart.
Inspect before training#
For user-supplied datasets, inspect first through the top-level beginner alias:
from scdlkit import inspect_annotation_data
report = inspect_annotation_data(
adata,
label_key="cell_type",
checkpoint="whole-human",
)
This is the recommended preflight step when you want to know whether gene overlap or class balance is likely to make the adaptation path brittle.
Frozen embedding API#
from scdlkit import Trainer
from scdlkit.foundation import load_scgpt_model, prepare_scgpt_data
prepared = prepare_scgpt_data(
adata,
checkpoint="whole-human",
label_key="louvain",
batch_size=64,
)
model = load_scgpt_model("whole-human", device="auto")
trainer = Trainer(
model=model,
task="representation",
batch_size=prepared.batch_size,
device="auto",
)
predictions = trainer.predict_dataset(prepared.dataset)
adata.obsm["X_scgpt_whole_human"] = predictions["latent"]
Annotation fine-tuning API#
from scdlkit import Trainer
from scdlkit.foundation import (
load_scgpt_annotation_model,
prepare_scgpt_data,
split_scgpt_data,
)
prepared = prepare_scgpt_data(
adata,
checkpoint="whole-human",
label_key="louvain",
batch_size=64,
)
split = split_scgpt_data(prepared, val_size=0.15, test_size=0.15, random_state=42)
model = load_scgpt_annotation_model(
num_classes=len(prepared.label_categories or ()),
checkpoint="whole-human",
tuning_strategy="lora",
label_categories=prepared.label_categories,
device="auto",
)
trainer = Trainer(
model=model,
task="classification",
batch_size=prepared.batch_size,
device="auto",
epochs=8,
)
trainer.fit(split.train, split.val)
predictions = trainer.predict_dataset(split.test)
adata.obsm["X_scgpt_lora"] = predictions["latent"]
Wrapper class for advanced convenience#
If you want the easy path but still need explicit control over the fitted object, use the top-level runner alias directly:
from scdlkit import AnnotationRunner
runner = AnnotationRunner(label_key="cell_type", output_dir="artifacts/scgpt_annotation")
runner.inspect(adata)
runner.fit_compare(adata)
runner.annotate_adata(adata)
runner.save("artifacts/scgpt_annotation/best_model")
The lower-level Trainer path and scdlkit.foundation helpers remain the advanced public surface underneath the wrapper.
When to use each strategy#
frozen linear probe:
best first question
tells you whether the checkpoint already separates your labels
head-only tuning:
cheapest trainable path
use when the frozen probe is useful but not good enough
full fine-tuning:
unconstrained reference baseline
use when you want to measure the cost of training the whole backbone
LoRA tuning:
good first PEFT baseline
use when you want more flexibility than a frozen backbone plus head
adapter tuning:
parameter-efficient residual bottleneck path
use when you want a lightweight trainable module without low-rank updates
prefix tuning:
prompt-like trainable prefix path
use when you want to bias the transformer layers without unfreezing the backbone
IA3 tuning:
multiplicative activation-scaling path
use when you want a very small trainable parameter footprint
Current limitations#
input preparation is a separate tokenized pipeline, not
prepare_data(...)only the official
whole-humancheckpoint is supporteduser datasets must have labels for annotation fine-tuning
gene overlap with the checkpoint vocabulary still gates compatibility
current model implementation is still
scGPTonlythis release does not claim perturbation, spatial, or multimodal support
Tutorials#
frozen embeddings: Experimental scGPT PBMC embeddings
annotation tuning: Experimental scGPT cell-type annotation
dataset-specific wrapper workflow: Experimental scGPT dataset-specific annotation
beyond-PBMC wrapper workflow: Main annotation tutorial: human-pancreas wrapper workflow
benchmark framing: Annotation benchmarks
user-data guide: Experimental annotation on your data
Experimental scope#
This feature should be treated as experimental.
The goal is not to claim that foundation models always beat classical baselines. The goal is to give users a reproducible, Scanpy-compatible workflow to compare:
PCA + logistic regressionfrozen scGPT linear probe
head-only tuning
full fine-tuning
LoRA
adapters
prefix tuning
IA3
That comparison story is the main product value of the current foundation release line.
The current beyond-PBMC evidence phase uses OpenProblems human pancreas to show the same wrapper-first workflow on a second labeled human dataset.
Treat scdlkit.foundation as the explicit lower-level experimental namespace that sits underneath the easier top-level beginner aliases.