Robust nonlinear manifold learning for single cell RNA-seq data/Experimental design for maximizing cell type discovery


Engelhardt Lab, Princeton CS and Quantitative and Comp. Bio.
Experimental design for maximizing cell type discovery in single-cell RNA-seq data

Abstract:  Bandit algorithms are often the tool of choice for recommendation engines, and have recently seen applications in the context of medical health care data. Here, inspired by bandit ideas, we show a novel application to iterative experimental design in multi-tissue single-cell RNA-seq (scRNA-seq) data. We present two algorithms, a Good-Toulmin like estimator via Thompson sampling (joint work with Karen Feng and Barbara Engelhardt) and an extension involving a Pitman-Yor prior (joint work with Federico Ferrari and Stefano Favaro). Given a budget and modeling cell type information across tissues, they both estimate how many cells are required for sampling from each tissue with the goal of maximizing cell type discovery across samples from multiple iterations. In both real and simulated data, we demonstrate the advantages these algorithms provide in data collection planning when compared to a random strategy in the absence of experimental design.

 

 

Archit Verma
Engelhardt Lab, Princeton CS and Quantitative and Comp. Bio.
Primer: Robust nonlinear manifold learning for single cell RNA-seq data

Abstract:  Analysis of single cell RNA sequencing (scRNA-seq) experiments requires dimension reduction for regularization and efficiency. We present a nonlinear latent variable model with robust, heavy-tail error modeling and adaptive kernel learning to capture low dimensional nonlinear structure in scRNA-seq data. Gene expression is modeled as a noisy draw from a Gaussian process in high dimensions from latent positions, known as a Gaussian Process Latent Variable Model (GPLVM). We model residual errors with a heavy-tailed Student's t-distribution to control for observed technical and biological noise. We compare our approach to common dimension reduction tools to highlight our model's ability to enable important downstream tasks, including clustering and inferring cell developmental trajectories, on available experimental data. We show that our robust nonlinear manifold is well suited for raw, unfiltered gene counts from high throughput sequencing technologies for visualization and exploration of cell states.