Data Sciences

Generating insights that will lead to breakthroughs requires that those data and the tools we build to study them be stored, curated, analyzed, updated, and shared rapidly, efficiently, openly, accurately, and broadly — all with privacy, security, and informed patient consent remaining top-of-mind.

The computer scientists, software engineers, informaticians, mathematicians, and others who make up the ӳý’s data science community share three core principles that form a foundation for addressing the growing computational needs of large-scale genomic and biomedical research. We believe in:

  1. The value of vast and diverse types of data. Biomedical research today requires platforms that allow secure but easy storage, access, analysis, and processing of sequence data, medical records, and other complementary forms of information at very large scale, while protecting patient privacy and ensuring security.
  2. Development of open source tools and resources. The ӳý’s Data Sciences Platform has committed to making all of the software products it develops open source. (Learn more in our blog post, “Open source: Foundation for the future”, and our explainer, "Creating tools to generate data insights.")
  3. Widespread sharing of ideas and data within the scientific and computational community. Since before the launch of the Human Genome Project, the ӳý’s research community has been committed to making data and tools available to researchers worldwide.

Members of the data sciences community are woven tightly into the fabric of the ӳý. They play prominent roles in the Institute’s programs, platforms, and initiatives. A few examples:

  • Cancer Program: The ӳý Cancer Program’s many data scientists form the backbone of several large teams, including the , Cancer Genome Computational Analysis, and groups. These teams develop, build, and maintain a variety of tools and resources for analyzing a wide variety of high-throughput screening results and cancer genome data, such as the portal and . Many of these tools are available on the ӳý's Data, Software and Tools page.
  • Data Sciences Platform (DSP): The DSP is a team of software engineers, computational biologists, and other technical contributors who are developing  open-source software products for the analysis of genomic and clinical data at large scale, including , , , , , and numerous direct-to-patient portals.
  • Epigenomics Program: The ӳý Epigenomics Program includes robust computational and software engineering efforts responsible for developing tools and generating data for understanding how the genome is regulated.
  • Imaging Platform: The ӳý Imaging Platform develops open-source software tools such as  and  for analyzing and mining image-based data, and helps biologists to apply them to important questions in biomedicine.
  • LIMS and Analytics: The LIMS and Analytics group develops and maintains information and reporting systems that support the ӳý Genomics Platform’s daily activities.
  • Program in Medical and Population Genetics (MPG): Members of MPG have played key roles in developing a range of portals and computational tools, including the  variant browser and the variant analysis and exploration framework.

In addition, ӳý data scientists have created two unique activities that support collaboration and provide opportunities for ongoing professional development:

  • Models, Inference, and Algorithms (MIA): MIA is an initiative that supports learning and collaboration at the interface of biology with mathematics / statistics / machine learning / computer science.
  • Software Engineering (SoftEng) Affinity Group: This internal group supports software engineers at ӳý and their professional growth with an ongoing speaker series, career development opportunities, and occasions for community building.