A Data Scientist’s View into the Cancer Moonshot Project: Part 2, Data Sharing

September 13, 2016 Katie Malone

Over the past few months, we worked on a project that’s a little different from our usual work: researching and writing a report for Vice President Biden’s Cancer Moonshot Initiative. In this report, we analyzed how cancer research might benefit from better use of data and analytics. Our recommendations were organized around three major themes:

  • What kind of computational infrastructure is needed for better cancer data science
  • How cancer data can be shared more effectively
  • How to create the next generation of cancer researcher data scientists with the skills to analyze that data

You can read more about the infrastructure recommendations here, a note from our CEO, or download the full report here. Later this week, we’ll share a deep dive on the people and skills recommendations.

In this post, I’ll summarize some of the challenges we identified surrounding data sharing.

So why focus so much on data sharing?

  • Effective data sharing means that data collected by one researcher can be used again by another researcher.
  • Data sharing is critical to compiling datasets that are big enough to find small or rare effects; if you’re trying to study a rare cancer or genomic variation, it’s much easier to combine data from many patients (coming from all their various doctors, hospitals, clinical trials, etc.) than to ask everyone to come to the same doctor, hospital, or clinical trial.
  • While there is a huge amount of medical data being collected, that data is effectively unavailable to researchers for their use in studying new treatments because of the current lack of effective sharing infrastructure.

Data sharing is very tricky because there are good reasons to constrain it. Laws governing sharing of sensitive health information, like HIPAA (Health Insurance Portability and Accountability Act) and other related laws and regulations at the state and federal level, are in place to protect patients. These laws spell out how personal health information must be kept private and secure by researchers, hospitals, and doctors. Even the strongest advocates of data sharing among us can agree that confidentiality of medical data needs to be reasonably and responsibly protected.

However, in our interviews with many hospital administrators, doctors, and researchers, we heard again and again how privacy laws are vague, difficult to interpret, and don’t always keep the data as secure as one might think. HIPAA was passed by Congress in 1996 —before data science and affordable genomic sequencing were widespread. As the volume of digitized medical data has increased, data confidentiality rules haven’t always kept up. The Cancer Moonshot provides a great opportunity to modernize privacy regulations for better data science in all kinds of medical fields.

One of the most compelling arguments for better data sharing came to us from patients and doctors themselves. In our interviews, we heard from patients both directly and indirectly (through their doctors) that they want to have the option to voluntarily contribute certain pieces of their own medical data for research. Currently, a patient legally has the right to open up their (anonymized) medical record up to accredited researchers for study, but in practicality there’s no way to get their data to the researcher. A significant part of our recommendation is to clarify and start building a process for patient data donation.

A scalable process would require several components: software that can format and export the data, a database that could keep it both secure and accessible, and a clear legal framework surrounding ownership, liability, and rights for the data. We recommend several initiatives targeted at solving specific sub-problems related to patient data sharing, and to allow those initiatives the time and space they would need to find the exact best policy changes to make.

Data sharing isn’t all about privacy regulations and laws, though. A second barrier to data sharing is that there isn’t an agreed-upon set of conventions about how to format cancer data, particularly electronic medical record (EMR) data. Currently, there are many different ways that data gets recorded and formatted, which causes huge headaches every time that data needs to be aggregated or moved from one system to another. There is work underway right now to define a shared data format; we recommended that the government continue to encourage that work and enforce the standards once they are finalized.

Third, there’s lots of great data that is being generated in research studies all over the country, from genomic sequencing to basic research to clinical trials. You can see where I’m going with this – that research data isn’t always shared effectively either. To be fair to the researchers generating and using that data, it takes a lot of work to share data and there are not many incentives to nudge them in that direction. Researchers invest a lot of time, effort, and resources into developing their datasets, so it makes sense that they wouldn’t want to release it before they are able to fully utilize it for publishing new findings. However, since much of the research in the United States is funded by the government, changes in government policy about data sharing could be attached to research grants, effectively saying “the government will pay for your research but the data must be made public after the work is done.” (There are some policies like this already in place, so our recommendation is to expand their scope and enforcement, and to make it easier to access the datasets that are already being shared by consolidating them in a centralized location.) When this is coupled with infrastructure advances like the NCI’s Genomic Data Commons for storing, finding, and distributing datasets, the incentives around academic data sharing will start to point in the right direction.

As I’m sure you now appreciate, it will take a lot of work to build the legal, regulatory, and incentive structure that we need for better data sharing. But it’s worth it: our work with rich and diverse datasets at Civis Analytics has taught us that the degree to which we are able to make accurate, actionable predictions about the world depends on the quality of the data that goes into those predictions. Sharing data effectively is crucial for making the best use of all our resources in the fight against cancer, and we look forward to the advances that our brilliant researchers will be able to make once we start liberating cancer data.

This post was co-authored by Angelo Mancini, Ola Topczewska, and Todd Harris.

Later this week, we’ll share a deep dive on the people and skills recommendations.

The post A Data Scientist’s View into the Cancer Moonshot Project: Part 2, Data Sharing appeared first on Civis Analytics.

Previous Article
A Data Scientist’s View into the Cancer Moonshot Project: Part 3, Data Sharing
A Data Scientist’s View into the Cancer Moonshot Project: Part 3, Data Sharing

Over the past few months, a team at Civis led by our CEO, Dan Wagner, has worked on researching and writing...

Next Article
A Data Scientist’s View into the Cancer Moonshot Project: Part 1, Data Infrastructure
A Data Scientist’s View into the Cancer Moonshot Project: Part 1, Data Infrastructure

Working at Civis, I’m able to use my PhD both on data science research and development and real world probl...