How ӳ��ý and Verily are creating tools together to enable data sharing and analysis

Collaboration will create and disseminate open-source tools for storing, sharing, and analyzing life sciences data

By Kristian Cibulskis, Director, Platform Engineering

December 7, 2017

Credit: Susanna M. Hamilton, ӳ��ý

Today, we are writing to share how the collaboration between the ӳ��ý and is expanding to create and disseminate open-source tools for storing, sharing, and analyzing life sciences data.

We have been working with Verily on several efforts for more than a year.

First, both of our organizations are among a group of National Institutes of Health (NIH) funded collaborators (including Vanderbilt University, University of Michigan, Columbia, and others) developing open source computational tools for the , NIH’s effort to gather data from one million or more people living in the United States to accelerate research and improve health.

Second, we have worked with Verily to migrate our production data processing environment for genome sequencing to the cloud. This is based on earlier work with Google Genomics, whereby we optimized its overall framework for running pipelines to shift away from an environment that was heavy on local computing and storage, to one operating in cloud environments. This was done by creating two new components — the and the Pipelines API — that are now used to process all genomic data that the ӳ��ý generates (). Now, we are helping Verily adopt these same analytical tools, as well as incorporate ӳ��ý’s sequencing center’s best practices, for its own genome sequencing operations.

In the course of the collaboration, we experienced firsthand that impactful science increasingly involves collaborations across institutions and countries, and it no longer makes sense for every group to develop analytical tools and data environments in isolation. We both believe that tools should be interoperable and openly available—to avoid needless duplication, and to maximize the opportunities for data sharing and rapid adoption of tools, as well as to avoid needless sending, copying, and storing of vast amounts of information.

A key consideration is ensuring that the software is open source. The ӳ��ý Data Sciences Platform is already committed to making all of its software open source. Verily is also committed to contributing to the open source community and has agreed to fund some software engineers at ӳ��ý to create open source tools.

Both organizations have contributed to genomic variant callers that we share freely. ӳ��ý has long made its widely available, and recently changed the licensing so that it is now open source. Verily and the Google Brain Team recently made , a variant caller that uses deep neural networks, available via open source on GitHub, and Verily will provide additional opportunities to support open source software development for the life sciences through efforts like .

Our collaboration follows the , a set of principles recently described by authors in academia and industry. These principles promote open-source sharing of tools and standards-based interoperability in computational biology.

In addition to always ensuring that the data itself is secure, the Data Biosphere framework includes a stewardship principle that biomedical data environments should act as data custodians, not data owners—that is, that software services should not presume to have the right to use, sell, or control access to third-party data. In contrast to consumer technology products, medical data entails greater responsibilities to patients and participants—including to protect patient privacy and ensure that only appropriate secondary use of data is permitted.

We recently released the first two components created from our collaboration — a , and a . We plan to add new components in the coming months and years.

Importantly, this collaboration is just one component of an ecosystem of collaborations. We are part of a network of research groups that is working together, in various combinations, on a number of flagship scientific projects, including , the , the , and the . All of these groups embraced the of open-source licensing, modularity, standardization, and community engagement.

We hope researchers find these open-source components useful.

Tags:

Data Sciences Platform