MIA: Noor Pratap Singh, RNA-Seq data using a tree-based framework; Primer: Rob Patro

Primer: Counting is not easy: Assessing and quantifying uncertainty in abundance inferences from high-throughput sequencing data
Rob Patro

University of Maryland - College Park -- Department of Computer Science and Center for Bioinformatics and Computational Biology

From gene-level read counts in RNA-seq analysis through species-level read counts in metagenomic analysis, count data are often treated as direct observations to be statistically modeled for downstream analyses (like differential testing). Yet, due to fundamental read-to-target ambiguity in the underlying data, direct counts can often not be observed. To help overcome this difficulty, methods have been developed which posit generative models in which the abundances of interest are key parameters, directly related to latent variables encoding read-to-target assignments. Much effort has been expended to make these models accurate and efficient for inference. Nonetheless, they often return point estimates (usually maximum likelihood or maximum a posteriori) where the degree of uncertainty can vary widely between different parameters, and the posterior distributions of these parameters can be correlated in complex ways. In this background talk, I will discuss the challenges posed by read-to-target ambiguity, generative models for abundance estimation developed to address these challenges, methods for statistical inference in these models, and methods for estimating and propagating quantification uncertainty in these models.

Meeting: Uncertainty-aware analysis of RNA-Seq data using a tree-based framework
Noor Pratap Singh

University of Maryland -- Department of Computer Science and Center for Bioinformatics and Computational Biology

The length of a short read is typically much smaller than that of a spliced transcript, making it difficult to determine the true locus of origin in eukaryotic transcriptomes, especially since transcripts can share overlapping sequences. This ambiguity introduces uncertainty in the abundance estimation of certain transcripts, which in turn affects downstream analyses such as differential expression testing. To address these challenges, we introduce a data-driven tree-based framework that incorporates uncertainty into RNA-seq data analysis.

In the first part of the talk, I will discuss existing approaches for handling uncertainty and their limitations in RNA-seq data analysis before introducing TreeTerminus. TreeTerminus constructs a hierarchical, tree-like structure from a given set of RNA-seq samples, where leaf nodes represent individual transcripts and internal nodes correspond to aggregated transcript groups. As one ascends the tree, uncertainty decreases, providing a flexible framework for analyzing data at different levels of resolution, depending on the analysis of interest.

In the second part of the talk, I will introduce mehenDi, a tree-based differential testing method designed to operate on the tree structures generated by TreeTerminus. mehenDi maximizes the signal that can be extracted from RNA-seq data while explicitly controlling for uncertainty, enabling the discovery of novel features that would be missed by existing gene or transcript-level differential testing methods.

For more information visit: .