Identifying gene expression programs with scRNA-Seq/NMF
Sabeti Lab (Harvard Systems Biology, Ó³»´«Ã½), HMS Identifying gene expression programs of cell-type identity and cellular activity with single-cell RNA-Seq
Abstract: While matrix factorizations such as PCA or ICA are commonly used for dimensionality reduction of single-cell RNA-Seq data, the dimensions they infer may not necessarily align with biologically meaningful gene expression programs and are frequently ignored in practice. Here, I will discuss analysis of real and simulated single-cell data showing that matrix factorization can yield components corresponding to cell types and cellular activities such as life-cycle processes or responses to environmental stimuli. However, one limitation of many matrix factorizations is that their stochastic optimization algorithms can yield variable solutions when run multiple times on the same dataset which reduces the interpretability of the result. To address this limitation, we developed a meta-analysis approach that we call consensus matrix factorization which averages over multiple replicates to increase the robustness of the solution. We show with simulated data that, in particular, the consensus implementation of NMF (cNMF) outperforms several other factorizations at inferring cell-type and activity programs, including the relative contribution of programs in each cell. Applied to published brain organoid and visual cortex single-cell RNA-Seq datasets, cNMF refines the hierarchy of cell types and identifies both expected (e.g. cell-cycle and hypoxia) and intriguing novel activity programs. We make cNMF available to the community and illustrate how this approach can provide key insights into gene expression variation within and between cell types.
Aleksandrina Goeva
Macosko Lab Primer: Intro to non-negative matrix factorization
Abstract: Dimensionality reduction is essential for extracting generalizable knowledge from noisy, high-dimensional data. While singular value decomposition (SVD, PCA) is optimal with respect to minimizing data movement, the resulting features are often not interpretable or robust across experiments. Non-negative matrix factorization (NMF) is a powerful alternative that may be applied when the data is non-negative (e.g, counts or concentrations of biological molecules!). In this primer, we will formulate an NMF objective function and optimization algorithm, paying special attention to practical challenges that Dylan will explore in the main talk. We will discuss biomedical applications of NMF, including to spatially-resolved RNA-seq data. And time permitting, we will survey familiar probabilistic models built on NMF, such as the topic models from last week.