Learning the "parts" of cells using topic models

Abhishek Sarkar
Computational Biology and Data Science group,
Vesalius Therapeutics
Primer: Clarifying confusion in scRNA-seq analysis

The high proportion of zeros in typical scRNA-seq datasets has led to widespread but inconsistent use of terminology such as "dropout" and "missing data". Here, we argue that much of this terminology is unhelpful and confusing, and outline simple ideas to help reduce confusion. These include: (1) observed scRNA-seq counts reflect both true gene expression levels and measurement error, and carefully distinguishing these contributions helps clarify thinking; and (2) method development should start with a Poisson measurement model, rather than more complex models, because it is simple and generally consistent with existing data. We outline how several existing methods can be viewed within this framework and highlight how these methods differ in their assumptions about expression variation. We also illustrate how our perspective helps address questions of biological interest, such as whether mRNA expression levels are multimodal among cells.

Peter Carbonetto
Dept. of Human Genetics, and the Research Computing Center,
University of Chicago
Meeting: Learning the "parts" of cells using topic models

Methods for learning reduced representations of data such as PCA and t-SNE have become essential to single-cell genomics studies. Beyond their use in producing evocative visualizations of cell population structure, some of these methods have the potential to recover interpretable "parts" of cellular transcriptomes or epigenomes. Less well appreciated is the fact that the topic model, originally developed to analyze collections of text documents, is also well suited to single-cell genomics data.

Here, we reconsider some basic principles behind the topic model, motivated by aims that are different from early topic modeling papers. Specifically, we would like to make accurate inferences from large data sets, and to extract biological insights from these inferences. We show that making connections between the topic model and other models increases the potential for the topic model to tackle these aims. In particular, we borrow ideas from non-negative matrix factorization, the Structure model used in population genetics, and differential expression analysis. These borrowed ideas lead to faster and more accurate algorithms for topic models, produce effective visualizations of complex cell structure from topic model fits, and suggest a new way to interpret topics.

For more information visit: .