ӳ��ý

Consequences of training data composition for deep learning models in single-cell biology.

bioRxiv : the preprint server for biology

Authors	Ajay Nadig Akshaya Thoutam Madeline Hughes Anay Gupta Andrew Navia Nicolo Fusi Srivatsan Raghavan Peter Winter Ava Amini Lorin Crawford
Abstract	Foundation models for single-cell transcriptomics have the potential to augment (or replace) purpose-built tools for a variety of common analyses, especially when data are sparse. Recent work with large language models has shown that training data composition greatly shapes performance; however, to date, single-cell foundation models have ignored this aspect, opting instead to train on the largest possible corpus. We systematically investigate the consequences of training dataset composition on the behavior of deep learning models of single-cell transcriptomics, focusing on human hematopoiesis as a tractable model system and including cells from adult and developing tissues, disease states, and perturbation atlases. We find that (1) these models generalize poorly to unseen cell types, (2) adding malignant cells to a healthy cell training corpus does not necessarily improve modeling of unseen malignant cells, and (3) including an embryonic stem cell differentiation atlas during training improves performance on out-of-distribution tasks. Our results emphasize the importance of diverse training data and suggest strategies to optimize future single-cell foundation models.
Year of Publication	2025
Journal	bioRxiv : the preprint server for biology
Date Published	02/2025
ISSN	2692-8205
DOI	10.1101/2025.02.19.639127
PubMed ID	40060416
Links

Recent ӳ��ý Publications

The Alzheimer's Disease Gene SORL1 Regulates Lysosome Function in Human Microglia.

Taking the 3Rs to a higher level: Replacement and reduction of animal testing in life sciences in space research.

Modifiable risk factors for stroke, dementia and late-life depression: a systematic review and DALY-weighted risk factors for a composite outcome.

Polygenic score analysis identifies distinct genetic risk profiles in Alzheimer's disease comorbidities.

Scalable spatial transcriptomics through computational array reconstruction.