Data preparation#
What it is#
Status: stable.
This page documents the lower-level preprocessing helpers that turn AnnData
into split objects for Trainer and related workflows:
prepare_data(...)transform_adata(...)
When to use it#
Use these helpers when:
you are building on
Trainerdirectlyyou need explicit control over labels, batches, or split fractions
you want to transform a second
AnnDatawith the same fitted preprocessing metadata
Minimal example#
from scdlkit import prepare_data
from scdlkit.data import transform_adata
prepared = prepare_data(adata, label_key="louvain", normalize=True, log1p=True)
held_out = transform_adata(
other_adata,
prepared.preprocessing,
label_encoder=prepared.label_encoder,
batch_encoder=prepared.batch_encoder,
)
Parameters#
prepare_data(...)controls matrix selection and preprocessing throughlayer,use_hvg,normalize,log1p, andscale.label_keyandbatch_keydefine optional encoded supervision and batch metadata fromadata.obs.val_size,test_size,batch_aware_split, andrandom_statedefine split behavior.transform_adata(...)expects the storedpreprocessingmetadata plus optional fitted label and batch encoders.
Input expectations#
input must be an
anndata.AnnDataobject with features invar_names.if
label_keyorbatch_keyis provided, the column must exist inadata.obs.the transformed dataset must contain the same feature names as the fitted preprocessing metadata.
Scanpy-backed operations require the
scanpyextra.
Returns / outputs#
prepare_data(...)returnsPreparedDatawith train, validation, and testSplitDataplus preprocessing metadata.transform_adata(...)returns a transformedSplitDatathat can be passed toTrainer.predict_dataset(...).
Failure modes / raises#
ValueErroriflabel_key,batch_key, or the selected layer is missing.ValueErrorif transformed data is missing required features or contains unseen labels.ImportErrorif Scanpy-backed preprocessing is requested withoutscdlkit[scanpy].
Notes / caveats#
These helpers are the stable lower-level path behind
TaskRunner.They are not the scGPT tokenization entrypoint; use Experimental foundation helpers for that path.
transform_adata(...)applies the previously stored feature order and optional scaler before inference.