Making 100,000 whole genome sequences available

A Ó³»­´«Ã½ data scientist discusses his team’s efforts to build new tools for researchers to access and study the genomic data recently released by NIH’s All of Us Research Program.

Lee Lichtenstein, associate director of computational methods at Ó³»­´«Ã½'s Data Sciences Platform
Lee Lichtenstein, associate director of computational methods at Ó³»­´«Ã½'s Data Sciences Platform

The National Institutes of Health launched the All of Us Research Program in 2018 to partner with one million participants in the United States and create one of the largest and most diverse genetic and medical datasets in the world. The aim is to accelerate research on health and disease and tackle health disparities. 

Last week, the project  including whole genome sequences from nearly 100,000 participants. The Genomics Platform at the Ó³»­´«Ã½ of MIT and Harvard is one of several sequencing centers for the project. What’s more, Ó³»­´«Ã½ researchers also helped develop a way for genetic results to be returned to participants.  

A Data Sciences Platform team at Ó³»­´«Ã½ processed the All of Us data and made it available for researchers. Ó³»­´«Ã½ also worked with colleagues at Vanderbilt University Medical Center and Verily to build and manage a new cloud-based platform, Researcher Workbench, to allow researchers to access the data. We spoke with Lee Lichtenstein, associate director of computational methods in the Ó³»­´«Ã½â€™s Data Sciences Platform, about how he and his team pulled off this massive task. 

What tools did the Data Sciences Platform create for the All of Us data?

We developed tools to bring researchers to the data in a cost-efficient and secure way. One main tool that we used is , an open, cloud-based platform for accessing data, performing analyses, and collaboration. On our end, we first used Terra to do genomic data processing, quality control, and generate the final genomic data products that we release to researchers. 

At the same time, we built a product called the where researchers can analyze the All of Us genomic data alongside surveys, electronic health record data, and physical measurement data, by creating and sharing workspaces. We supply Jupyter notebooks as well in both R and Python so that researchers with coding expertise can perform their own queries. Researchers can also create cohorts and datasets to organize and filter the data. Plus, the workbench offers tutorials and full support for all users.

What do you mean by bringing researchers to the data?

Historically, when large datasets were generated, researchers would download the data to their home institution. There they would work on dedicated compute clusters, also within their institution. There are several issues here. One is that we have to rely on each institution to secure their copy of the data. Additionally, having the compute and storage infrastructure is expensive to maintain and smaller research institutions are priced out.

In All of Us, we bring the researchers to the data, which means that they use a cloud-based infrastructure to analyze data. This is more secure, since all researchers are analyzing one copy of the data and we no longer have to rely on each institution to properly secure the data. This is also more cost-effective than a dedicated compute cluster, since researchers are not paying to maintain compute clusters when idle, and the cloud compute can scale to meet the need, even if only for a short period of time. Additionally, smaller institutions do not need to create the infrastructure to do these analyses.

What was the quality-control process for the data?

While processing the data, we paid a lot of attention to quality control (QC), especially consistency. When generating this much data in a short timeframe, you can't rely on just one sequencing center. So we took a lot of pains to make sure that the sequencing centers were consistent with each other. The sequencing centers use the same protocols for sample prep, the same analytical processes, and the exact same software. That way, we could reduce or eliminate batch effects. The data has also been passed multiple layers of QC, both single-sample QC as well as QC across the samples. This allows us to find artifact variants and samples that might be either contaminated or noisy.

How can scientists access the All of Us data?

Researchers can apply for tiered access to use the . Controlled tier access is required to use genomic data. We have researchers apply for appropriate levels of access because we do not want researchers to have more access than they need. We also have a for researchers to use before they apply for access. The public-facing website is where researchers can browse available data before deciding whether to register. There, anyone can see how much data is available for the population they’re interested in studying, including all of the survey questions, physical measurements, and electronic health record domains collected. For example, if you study colon cancer, you can see on the public website how many subjects All of Us has who have had colon cancer, before you request access to the Researcher Workbench.