A Data Scientist’s View into the Cancer Moonshot Project: Part 1, Data Infrastructure

September 9, 2016 Katie Malone

Working at Civis, I’m able to use my PhD both on data science research and development and real world problems. Recently, I teamed up with my colleagues, including our CEO, Dan Wagner, to provide data recommendations for Vice President Biden’s Cancer Moonshot initiative (you can read Dan’s thoughts here). Our efforts resulted in a full report that was presented at the Cancer Moonshot Summit in June – you can download the report here – and this blog is the first in a three part series where we’ll summarize our findings.

To start, we needed to understand the landscape. That meant wrapping our heads around the large and complex set of interacting pieces (hospitals, regulations, research programs, scientific context, historical background, etc.) of the fight against cancer. To this end, we conducted over 40 interviews with groups and individuals in relevant fields including doctors, researchers, data experts, legal experts, patients, hospital administrators, and many others.

Our objective was to understand how data was being used in the field, find where it was falling short of its potential, and form a set of policy recommendations to make cancer data science better. The final result is a report outlining three main recommendations:

  • What computational infrastructure is necessary to deal with cancer data
  • How that data can be liberated from silos and shared effectively
  • How to train the current and next generations of cancer data scientists to use the data effectively.

This post focuses on computational infrastructure.

When we set out to describe the current computational infrastructure we expected that our primary task would be to find out about it, and how it could be made better. The reality was quite different: our interviews quickly made clear that “the cancer data infrastructure” was more like a very loosely confederated set of systems, each with its own peculiarities and nuances, and a major barrier to progress was the lack of coherence and interoperability among all these systems. There aren’t technical interfaces between the systems, each one has a different user experience (interfaces, software requirements, technical assumptions about the users), they’re expensive to build and maintain, and the people who build and maintain them don’t always do a great job of communicating or coordinating their work. The data had simply grown too big and complex, too quickly, for any system to keep up with the deluge.

Instead of recommending small changes to an existing system, we decided that a major focus of our recommendations for the Moonshot effort should be building and maintaining the computational infrastructure for large-scale cancer data science. Our interviews made clear that there’s a huge diversity of data that brilliant researchers are using to do great work, but generally that data isn’t as useful as it could be because of all the problems outlined above. The government is in a unique position to coordinate this work, and take the lead in building it as a kind of public good for cancer research, in the same way that the government built the interstate highway system as a public good. The question then becomes, what should the system look like? How might we build it?

Let’s start with a summary of the data, the driving factor for many of our recommendations about infrastructure. There are far too many datasets to summarize comprehensively; broadly speaking, though, we found that most of the data could be described as belonging to public or government organizations, academic research groups, or individual patients/hospitals (most commonly in the form of electronic medical records, or EMRs).

  • Public and government data has the advantage of often being very large (many cases) and very rich (many pieces of information about each case) data that is collected in a standardized way and is therefore most amenable to unification efforts.
  • Academic research data is more spread out and heterogeneous, but it still tends to be high quality and there are ways that the government can incentivize researchers to share it.
  • Patient data is the most challenging to harness, in part because it is stored in the widest variety of places and formats (and also because of important privacy protections–more on this in another post), but it also holds huge potential for helping researchers unlock answers to some of the thorniest questions.

With all that in mind, we set to work thinking about what features of an infrastructure would be most important for making all this data as useful as possible. When we talk about “infrastructure”, we think about it in two ways: first is the storage capacity to hold all the data, and the tools for getting data into and out of the system; second is the computational capacity to run intensive algorithms and analytics on that data to get insights out of it. Our experience working as data scientists has taught us a few general things about building infrastructure for working with big and complex data.

  • The infrastructure should be flexible: making a system that is too rigid in the types of data that it can handle or what it can do with them means you will also have a system that will be harder to maintain, and will become obsolete faster.
  • The system should be easy to use: it should have both a graphical user interface (GUI) for less-technical users to interact with it easily, as well as an application programming interface (API) for developers to build tools that interact with it.
  • The system needs to support data harmonization1: this is critical for data to be useful.

One very bright spot in our work was learning about the (NCI Genomic Data Commons) GDC a project that’s been under construction for some time and officially launched in June. As it currently stands, the GDC hosts several large public datasets (TCGA and TARGET) and will be the future home of datasets from publicly funded academic research studies. This is a large collection of datasets, and while it’s not comprehensive by any means, we think it holds great potential as a central gathering place for cancer-associated genomic data. The GDC also put an impressive amount of work into their GUI and API, and into harmonizing their data, making it appealing and useful from a user perspective. We know from building our own data science platform that building a database is one thing, but building a tool that is intuitive, useful, and powerful is considerably harder. As a result of all this, many of our infrastructure recommendations build upon the foundation of the GDC, and explore what additional upgrades or capabilities should be considered.

If the GDC is a great start for a publicly-funded cancer data storage solution, its counterpart for cancer data analysis would probably be the NCI Cloud Pilot programs. These three sibling programs–run out of the Broad Institute, Seven Bridges Genomics and the Institute for Systems Biology–provide a set of cloud-based analytics tools for cancer data. Each of these pilot programs work side-by-side with the GDC, offering researchers access to its datasets and computational resources. The Cloud Pilots are still growing, but we see potential for them to develop into widely-used tools for large-scale analysis of genomic data. Regardless of whether future data analysis tools arise out of these programs or elsewhere, we recommended that the government continue to look for ways to lead in data analysis infrastructure to grow alongside the data storage infrastructure.

All this infrastructure work will take some time. Even if we started today, it would take years to start seeing returns. That said, we’ve seen the payoff of these investments in the work that we do at Civis, where we’ve spent a lot of time and effort building a solid computational infrastructure that enables us to use data to solve really hard and important problems.

Infrastructure only gets us part of the way there, though. In our next blog posts we’ll talk about the second and third pieces of the puzzle: how to think about data sharing and technical skills within the context of cancer data science.

This post was co-authored by Angelo Mancini, Ola Topczewska, and Todd Harris.

Infrastructure only gets us part of the way there, though. In our next blog posts we’ll talk about the second and third pieces of the puzzle: how to think about data sharing and technical skills within the context of cancer data science.


  1. Data harmonization is critical because data from different sources often show so-called “batch effects”, which are systematic differences in the data that arise from differences in how it was collected and processed. As an example, suppose one doctor tends to write a patient’s age as “forty years old” where another might say “40 years of age”—the same information is present in both statements, but a simple algorithm looking for something in the first format will miss information recorded in the second format. Another example would be genomic data from two different labs: those labs might have slight differences in their sequencing methodologies, or their data collection procedures, that mean that two hypothetical samples that originate in the same patient could still come out looking very different. When you want to combine data from different sources, these so-called batch effects can easily overwhelm any possible signal in the data and make it hard to disambiguate a real effect from an artifact of the data.

The post A Data Scientist’s View into the Cancer Moonshot Project: Part 1, Data Infrastructure appeared first on Civis Analytics.

Previous Article
A Data Scientist’s View into the Cancer Moonshot Project: Part 2, Data Sharing
A Data Scientist’s View into the Cancer Moonshot Project: Part 2, Data Sharing

Over the past few months, we worked on a project that’s a little different from our usual work: researching...

Next Article
GephiForceDiagramTool: Automatically create attractive network visualizations
GephiForceDiagramTool: Automatically create attractive network visualizations

One aspect of the people-centered data science that we do at Civis is social network analysis. Connections ...