Data preparation#

What it is#

Status: stable.

This page documents the lower-level preprocessing helpers that turn AnnData into split objects for Trainer and related workflows:

prepare_data(...)
transform_adata(...)

When to use it#

Use these helpers when:

you are building on Trainer directly
you need explicit control over labels, batches, or split fractions
you want to transform a second AnnData with the same fitted preprocessing metadata

Minimal example#

from scdlkit import prepare_data
from scdlkit.data import transform_adata

prepared = prepare_data(adata, label_key="louvain", normalize=True, log1p=True)
held_out = transform_adata(
    other_adata,
    prepared.preprocessing,
    label_encoder=prepared.label_encoder,
    batch_encoder=prepared.batch_encoder,
)

Parameters#

prepare_data(...) controls matrix selection and preprocessing through layer, use_hvg, normalize, log1p, and scale.
label_key and batch_key define optional encoded supervision and batch metadata from adata.obs.
val_size, test_size, batch_aware_split, and random_state define split behavior.
transform_adata(...) expects the stored preprocessing metadata plus optional fitted label and batch encoders.

Input expectations#

input must be an anndata.AnnData object with features in var_names.
if label_key or batch_key is provided, the column must exist in adata.obs.
the transformed dataset must contain the same feature names as the fitted preprocessing metadata.
Scanpy-backed operations require the scanpy extra.

Returns / outputs#

prepare_data(...) returns PreparedData with train, validation, and test SplitData plus preprocessing metadata.
transform_adata(...) returns a transformed SplitData that can be passed to Trainer.predict_dataset(...).

Failure modes / raises#

ValueError if label_key, batch_key, or the selected layer is missing.
ValueError if transformed data is missing required features or contains unseen labels.
ImportError if Scanpy-backed preprocessing is requested without scdlkit[scanpy].

Notes / caveats#

These helpers are the stable lower-level path behind TaskRunner.
They are not the scGPT tokenization entrypoint; use Experimental foundation helpers for that path.
transform_adata(...) applies the previously stored feature order and optional scaler before inference.

Data preparation#

What it is#

When to use it#

Minimal example#

Parameters#

Input expectations#

Returns / outputs#

Failure modes / raises#

Notes / caveats#

Related tutorial(s)#