We build technology to help researchers connect to the patients, datasets and tools they need to do life-changing biomedical research.
The life sciences are in the midst of a data revolution. Cheap and accurate genome sequencing is a reality, advanced imaging is routine, and clinical data is increasingly stored in electronic formats. These innovations — and the massive data sets they produce — have brought us to the threshold of a new era in medicine, one where the data sciences hold the potential to propel our understanding and treatment of human disease.
The Ó³»´«Ã½ Data Sciences Platform (DSP) is a methods development and software engineering group dedicated to maximizing the impact of the data sciences on the life sciences. DSP engineers, analysts, and designers build applications and capabilities to serve the Ó³»´«Ã½ and beyond.
The DSP is organized around four principal components:
Workbench: A suite of web services that provide foundational computational capabilities to support genomic and biomedical science, providing infrastructure for storing, sharing, and analyzing genomic and clinical data at unlimited scale. Workbench supports , an open cloud-based platform for accessing data, performing analyses, and collaborating securely in the cloud, which powers projects like , , and the used by the All of Us Research Program.
Analytical tools: Open-source applications and approaches such as GATK that provide best practice pipelines for extracting all available information from read-level data, available for download as well as via portals and other software-as-a-service mechanisms.
User interfaces: Web-based portals and other ways to data and analytical methods available that engage researchers, clinicians, and patients. In particular, we develop software to support a number of direct-to-patient studies.
Production data processing: Tools and applications designed and scaled to process massive volumes of raw genome sequence data into forms scientists can use to create new knowledge. As part of this effort, we partner with the Ó³»´«Ã½ Genomics Platform to process all data that they produce and reduce it to a form that is usable by researchers.
Flagship DSP software products and services
The DSP develops software products and operates services that are widely used across the biomedical ecosystem, such as:
: an open cloud-based platform for accessing data, performing analyses and collaborating securely in the cloud, developed in collaboration with Microsoft and Verily Life Sciences.
: the leading open-source variant discovery package for analysis of high-throughput sequencing data.
: a popular set of open-source command line tools for processing high-throughput sequencing data
: An execution engine that allows users to run reproducible workflows written in either the (WDL, pronounced widdle) or the Common Workflow Language (CWL), portable across local machines, computer clusters, and cloud platforms (e.g., AWS, Microsoft Azure, Google Cloud Platform)
The Data Donation Platform (DDP): A software stack that enables direct participant engagement, including consent and recontact, via intuitive web and mobile interfaces. DDP provides the underlying infrastructure for disease-specific registries such as the , the , and the .
The : A suite of interfaces for managing interactions between data access committees and researchers seeking to access sensitive genomic datasets.
Flagship scientific projects and portals
The DSP plays pivotal roles in several national and international scientific initiatives, including:
: A National Institutes of Health (NIH)-funded initiative that will recruit 1 million or more U.S. citizens and collect their genomic and clinical data. Ó³»´«Ã½, in collaboration with Vanderbilt and Verily, is building a Workbench-based platform to store, share, and analyze all data generated as part of the program.
The and browsers: The DSP supports members of the Ó³»´«Ã½ Program in Medical and Population Genetics to process and analyze these large collections of genome and exome data generated by collaborators around the world.
The : A resource for genotype and tissue-specific gene expression correlation data.
The : An international effort to comprehensively characterize cell types and cell states in health and disease. Ó³»´«Ã½, in collaboration with the European Bioinformatics Institute, the University of California at Santa Cruz (UCSC), and the Chan-Zuckerberg Initiative, is building the HCA Data Coordination Platform, which will serve as the effort’s central collection, quality control, data processing, and data sharing point.
The : A visualization and data exploration portal for single cell RNA sequencing data.
Cancer Genome Commons: A National Cancer Institute-funded initiative to provide a cloud-based ecosystem for storing, sharing, and analyzing key cancer datasets, including those of The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research To Generate Effective Treatments (TARGET) initiatives.
NIH Data Commons: An NIH-funded initiative to create a data commons for hosting key datasets, including , GTEx, and model organism datasets. Ó³»´«Ã½, UCSC, and the University of Chicago are collaborating to create a software platform for storing, sharing, and analyzing data deposited in the commons.
Flagship DSP partnerships
To bring the tools of machine learning and cloud computing to bear on problems of fundamental importance to biomedicine, the DSP collaborates with world leading technology corporations, philanthropic organizations, and pharmaceutical companies such as: