Willard Ford

Willard is a senior at the University of California, San Diego, double majoring in Bioinformatics and Probability & Statistics. He is interested in applying genomic algorithms and statistical modeling to problems in population genetics with potential for medical impact, particularly in addressing health care inequities. Additionally, Willard is particularly excited about how advancements in deep learning will allow personalized data to improve patient outcomes.

Traditional methods for detecting human genetic variation explain a large portion of heritable diseases, but not all. BSRP was one of the best experiences of my career. In only a few months, I learned how to ask my own scientific questions and, with incredible support from my lab and my mentors, built the algorithms and tested my hypothesis. I felt like a scientist for the first time. And beyond the incredible science, my fellow interns formed an incredibly supportive and motivating community that I’m sure will last much longer than a summer. Structural variants (SVs), large edits that affect at least 50 consecutive bases of the genome, are enriched for disease association in historically difficult to sequence regions and could explain some of the remaining heritability gap. Long-read sequencing allows us to directly observe these “dark regions”, but it has added costs; thus, to provide statistical power for disease association studies, consortia plan to sequence tens of thousands of samples at reduced coverage.

These large SV sets contain significant noise and redundancy that confound disease association. Our solution determines an optimal set of sequences that sufficiently explains the data in the cohort while preserving true variation. However, solving this problem requires pairwise alignments between every read and candidate SV, and it is slow on cohorts with thousands of samples and tens of thousands of reads per location of the genome.

We show that fast filters based on hashing and sketching of k-mers reduce the number of alignments by 62% and thus can notably speed up the optimization algorithm in a large cohort, while preserving 95% of the accuracy of the solution. We measure the performance of several filtering strategies, including length filtering, min-hashing, sketching, and Euclidean distance approximations, and we highlight specific combinations of filters that are more effective than any single filter in isolation. Our tools enable researchers to study larger long-read cohorts with SV calls, an essential requirement to uncover novel and population specific disease associations with SVs.

Project: Fast structural variant merging at a population scale

Mentors: Fabio Cunial & Ryan Lorig-Roach, Data Science Platform