A conversation about the legacy of ENCODE and what comes next
Epigenetics researcher Charles Epstein discusses the impact of the NHGRI ENCODE consortium on our understanding of genome regulation, and how it is paving the way for new efforts.
By Natalia Mesa
In April 2003, scientists finished the monumental task of sequencing the first draft of the human genome. It revealed that only one percent of the human genome codes for genes, which left the nascent genomics community with the challenge of discovering the function, if any, of the remaining 99 percent. For the next 18 years, the Encyclopedia of DNA Elements (ENCODE) project sought to interpret this trove of genomic data by comprehensively characterizing all of the functional elements in DNA, including both genes and the non-coding elements that regulate gene activity.
The ENCODE consortium, funded by the National Human Genome Research Institute (NHGRI), is made up of 31 institutions, including the Ó³»´«Ã½ of MIT and Harvard, and has revealed much about how genes are regulated in the human and mouse genomes. In sharing its data openly, ENCODE has helped researchers make countless discoveries. The consortium produced the Encyclopedia of DNA Elements, a public data repository, which revealed that at least 80 percent of the human genome has regulatory activity. In addition, researchers have used ENCODE data to pinpoint regulatory elements that contribute to cardiovascular and Alzheimer's disease, Crohn’s disease, bipolar disease, and many others.
At Ó³»´«Ã½, the project is headed by institute member Bradley Bernstein, director of the Gene Regulation Observatory and Epigenomics Program and chair of the Department of Cancer Biology at Dana-Farber Cancer Institute, along with Charles Epstein and Noam Shoresh, both associate directors in the Epigenomics Program. For the past 13 years, they’ve led a team of scientists and project managers that has contributed hundreds of experiments and thousands of datasets to this collective effort. Data from Epstein’s group has been used in thousands of publications, contributing to advances in genomic science and health.
Now ENCODE, in its fourth and final stage of work, is winding down, and with that comes a new initiative that will build on the legacy of ENCODE. The Impact of Genomic Variation on Function (IGVF) will dive deeper into the function of DNA elements and how they operate in different cell types and states.
We sat down with Epstein to talk about ENCODE’s legacy, impact, and what comes next.
What are the key goals of ENCODE?
ENCODE was and is a project devoted to understanding how the genome functions.
It used an ensemble of technologies that were developed after the advent of genome sequencing and the goal was to develop and apply these new methods to understand the functional significance of every section of the genome. With these technologies, we developed a huge dataset that’s been a foundational resource for the genomics community.
We needed to go beyond gene sequencing to map the functional elements involved in gene regulation. To accomplish this, one of the methods we developed and applied extensively over the years is called chromatin immunoprecipitation followed by DNA sequencing, best known as ChIP-Seq. This method allows us to find out which regions of the genome regulate gene expression in which cell types, and which regions are actively repressed. We applied ChIP-Seq in an enormous number of cell types and tissues and built up what ENCODE is now, which is an encyclopedia that can be accessed by anyone. Robbyn Issner, in particular, among many other group members has contributed an enormous number of ChIP-seq maps. If you have some bit of the genome you want to understand, you can look it up in an encyclopedia now and see things like: this bit of the genome is very active in liver cells but is completely inactive in pancreas cells. It’s an enormous resource.
What do you think are the biggest impacts of ENCODE?
The most profound impact of this project has been the interface that it has enabled between the functional characterization community and the genetics community, which are two parallel efforts. As ENCODE was launching, the genetics community pioneered a method called the genome-wide association study (GWAS) to discover places in the genome where there are gene variants that are statistically associated with disease risk. The variants usually turned out to be distant from what we classically think of as genes — they were between the genes in the non-coding regions.
So, the majority of the gene variants that may predispose people to any number of heritable diseases –– like Alzheimer’s, cardiovascular disease, and diabetes –– coincide with regulatory regions that were discovered through projects like ENCODE. These sites are known as enhancers — distal regulatory elements that affect the expression of genes, which can be far away in the genome. The knowledge we generated was useful because of the tremendous diversity of cell and tissue types that the ENCODE project has profiled. ENCODE deepened the insight gained from discovery of disease-associated variants by suggesting the cell or tissue type in which the variant may have a pathological effect. The intersection between the work of the genetics community and the functional characterization community is one of the real legacies of the ENCODE project.
The other exciting class of insights relates to what we’ve learned about the three-dimensional structures of chromosomes and the pathological states that can result when that structure is disrupted. DNA is highly compacted to fit into the nucleus, and there are so-called topologically associating domains (TADs) that are critical to how gene regulation is achieved. Enhancers that promote gene expression principally work inside the scope of a particular domain. When that 3D structure is disrupted, enhancers can start modulating genes they would not normally control, which can contribute to disease.
What comes next?
Ó³»´«Ã½ is interested in genomic medicine because we want to cure and prevent disease, so we really need to get deep into the details of how things work. We’re working to launch Ó³»´«Ã½â€™s part in a new NHGRI-funded consortium called the Impact of Genomic Variation on Function (IGVF). This new effort further develops functional annotation as a source of insight for genetics, but takes it to the next level by using single-cell methodology, rather than merging the ensemble of cell types that make up a tissue. We’re entering the brave new world of single-cell multiomics, where we will characterize gene expression, functional states in the genome, and enriched regulatory motifs, all from the same single cells.