SPLASH Ultraefficient statistics-first analysis of raw sequencing data
Julia Salzman
Stanford University
Meeting: SPLASH unifies genomic analysis and discovery through a paradigm shift to statistics-first
Myriad mechanisms diversify the sequence content of DNA and of RNA transcripts. Currently, these events are detected using tools that first require alignment to a necessarily incomplete reference genome alignment in the first step; this incompleteness is especially prominent in human genetic diseases such as cancer, in the microbial world, and in non—model organisms where it severely limits the speed and scope of discovery. Second, today the next step in analysis requires as a custom choice of bioinformatic procedure to follow it: for example, to detect splicing, RNA editing, or V(D)J recombination among many others. I will present the theory for why SPLASH, a new statistics-first analytic approach captures myriad forms of genome regulation, without a reference or sample metadata, by performing statistical inference directly on raw sequencing reads. By design, SPLASH as an algorithm is highly efficient. Thanks to joint work with Professor Sebastian Deorowicz’s group, SPLASH is now implemented so that it is efficient and simple to run. A snapshot of its findings include new insights into RNA splicing, cancer transcriptomes, single cell RNA-editing, mobile genetic elements and discovers new genes non-model organisms.
Primer: Dissecting cell identity via network inference and in silico gene perturbation
Tavor Baharav
Postdoctoral fellow Eric and Wendy Schmidt Center Ó³»´«Ã½
Primer: Statistical and algorithmic challenges in reference-free analysis
Today’s genomics workflows typically begin by aligning sequencing data to a reference. In addition to being slow, this has many statistical drawbacks. Even in the intensely studied human genome, it was found that understudied populations have large amounts of sequence missing from the current reference; such blind spots may exacerbate health disparities. Reference-based methods are additionally limited in their detection of novel biology: reads from unannotated isoforms may be mismapped or discarded completely. In recent work we introduce a unifying paradigm, SPLASH, which directly analyzes raw sequencing data, using a statistical test to detect a signature of regulation: sample-specific sequence variation. SPLASH detects many types of variation and can be efficiently run at scale, providing a unifying statistical approach to genomic analysis that enables expansive discovery without metadata or references. In this primer I’ll discuss some of the challenges of reference-free analysis, and provide the algorithmic and statistical background for our proposed solution, SPLASH.