Quantifying inter-sample variation in single-cell data

Laurie Rumker
Raychaudhuri Lab, Bioinformatics and Integrative Genomics Program, Harvard Medical School

Yakir Reshef
Raychaudhuri Lab, Harvard Medical School; Brigham and Women’s Hospital
Meeting: Quantifying axes of inter-sample variability among transcriptional neighborhoods in single-cell datasets

As single-cell datasets grow in sample size, there are increasing efforts to characterize cell states that vary across samples and associate with sample attributes like disease status. But it is not yet clear how best to summarize single-cell data on a per-sample level to enable comparison across samples. Prevailing approaches typically assume a transcriptional structure, such as a clustering of cells, and represent samples through that lens, e.g., by measuring the abundance of each cluster in each sample. However, this can be limiting because it assumes that the transcriptional structure in question matches the underlying biology. We will present co-varying neighborhood analysis (CNA), an alternative approach with greater granularity and flexibility. CNA characterizes dominant axes of variation across samples by identifying groups of small regions in transcriptional space--termed neighborhoods--that co-vary in abundance across samples, suggesting shared function or regulation. CNA can then perform statistical testing for associations between any sample-level attribute and the abundances of these co-varying neighborhood groups. We will discuss simulation evidence that CNA can provide more sensitive and accurate identification of disease-associated cell states than a cluster-based approach. We will then show three examples of applications of CNA that reveal a Notch activation signature in rheumatoid arthritis, heterogeneity in monocyte populations expanded in sepsis, and a novel T cell population associated with progression to active tuberculosis.

Dylan Kotliar
Harvard Medical School
Primer: Capturing structure in high-dimensional data using K nearest neighbor graphs

K nearest neighbor graphs (KNNs) are ubiquitous in high-dimensional data analysis and play a key role in diverse techniques including classification, clustering, non-linear dimensionality reduction, data integration, and trajectory inference. But what makes KNNs so powerful and broadly useful? In this primer, I will provide a general overview on KNNs, emphasizing how they help overcome the curse of dimensionality to efficiently represent meaningful structure in high-dimensional data. I will use the MNIST database of handwritten digits as an easy to visualize example to illustrate the mechanics of constructing and diffusing signal along the KNN. This will help set the stage for the main talk on Co-varying Neighborhood Analysis which employs KNNs to identify axes of sample-level variability, with fine granularity, in single-cell genomics datasets.