Counting is not easy: Assessing and quantifying uncertainty in abundance inferences from high-throughput sequencing data
University of Maryland - College Park -- Department of Computer Science and Center for Bioinformatics and Computational Biology
From gene-level read counts in RNA-seq analysis through species-level read counts in metagenomic analysis, count data are often treated as direct observations to be statistically modeled for downstream analyses (like differential testing). Yet, due to fundamental read-to-target ambiguity in the underlying data, direct counts can often not be observed. To help overcome this difficulty, methods have been developed which posit generative models in which the abundances of interest are key parameters, directly related to latent variables encoding read-to-target assignments. Much effort has been expended to make these models accurate and efficient for inference. Nonetheless, they often return point estimates (usually maximum likelihood or maximum a posteriori) where the degree of uncertainty can vary widely between different parameters, and the posterior distributions of these parameters can be correlated in complex ways. In this background talk, I will discuss the challenges posed by read-to-target ambiguity, generative models for abundance estimation developed to address these challenges, methods for statistical inference in these models, and methods for estimating and propagating quantification uncertainty in these models.