Jan 20, 2016

Cancer Moonshot in the Cloud

I've been reading a bit about the "Cancer Moonshot" discussion at the Davos economics conference.


Naturally I'm interested in the possible increase in funding for genomics and bioinformatics research, but also the discussion of 'big data'  and sharing of genomics data are issues that I bump into all the time. It is almost impossible to overstate the amount of hoops an ordinary scientist has to jump through to obtain access or to share human genomic data that has already been published. There is an entire system of "authorized access" that requires not only that scientists swear to handle genomic data securely and make no attempt to connect genomic data back to patient identities, but also that the University (or research institute) where they work must monitor and enforce these rules. I have had to deal with this system to upload human microbiome data (DNA sequences from bacteria found in or on the human body) that are contaminated with some human DNA. [But not with the coffee beetle genome!] Then I had to apply again for authorization to view my own data to make sure it had been loaded properly.

Why is cancer genomic data protected? Unfortunately, some annoyingly clever people such as Yaniv Erlich have shown that it is possible (fairly easy in his hands) to identify people by name and address just from some of their genome sequence. Patients who agree to participate in research are supposed to be guaranteed privacy - they wanted to share information with scientists about the genetic nature of their tumors, not to share their health care records with nosy neighbors, privacy hackers and identity thieves.

Why do we need thousands of cancer genomes? One key goal of cancer research is personalized medicine - matching up people with customized treatments based on the genetics of their cancer. Current technology is pretty good for DNA sequencing of tumors - for a single cancer patient we can come up with a list of somatic mutations (found only in the tumor) for a few thousand dollars worth of sequencing effort (and a poorly measured amount of bioinformatician and oncologist time). One of the biggest challenges right now is sorting through the list of mutations to figure out which ones are important drivers of cancer growth and disease severity - and should therefore be targeted by drugs or other therapy. Some mutations are well known to be bad actors, others are new mutations in genes that have been found to be mutated in other cancers, others are complete unknowns. Data is needed from (hundreds of) thousands of tumors together with records of treatment response and other medical outcomes in order to build strong predictive models that will reliably advise the doctor about the medical importance of each observed mutation. Another challenge is the heterogeneity of cells within a single person's cancer. As DNA sequencing technology improves, investigators have started to sequence small bits of tumors, or even single cells. They observe different mutations in different cells or sub-clones. Now a key question is if the common resistance to drug treatment is a result of new mutations that occur during (or after) treament, or if the resistant cells already exist in the tumor, but are selected for growth by drug treatment. Overall, this means that precision cancer treatment may require a large number of different genome sequences from each patient, both during diagnosis and to monitor the course of treatment and post-treatment.

So cancer genomic research requires thousands of genomes (deeply sequenced for accuracy and control of artifacts), which means that each authorized investigator must download terabytes of data, and then come up with the data storage and compute power to run his or her clever analysis. In addition to the strictly administrative hurdles of applying for and maintaining an authorized access to cancer genomic data, there are the problems of data transfer, data storage, and big computing power. So the NIH (or other funding agency) has to pay once to generate the cancer genomic data, then again to store it and provide a high bandwidth web or FTP data sharing system, then again to administer the authorized access system, then again for each interested scientist to build a local computing system powerful enough to download, store, and analyze the data (and for University administrators to triple check that they are doing it properly, and again for the NIH administrators to check up on the University administrators to insure they are doing their checking properly). This is an impressive amount of redundancy and wasted effort, even for the US Government.

There is an obvious solution to this problem: 'Use the Cloud, Luke'.  A single Cloud computing system can store all cancer genomic data in a central location, together with a sufficiently massive amount of compute power so that authorized investigators can log in and run their analysis remotely. This technology already exists; Google, Amazon, Microsoft, IBM, Verizon, and at least a dozen other companies already have data centers large enough to handle the necessary data storage and compute tasks. It would be handy to build a whizz bang compute system with all kinds of custom software designed for cancer genomics, but that would take time (and government contractors). A better, faster, simpler system would just stick the genomic data in a central location and let researchers launch virtual machines with whatever software they want (or design for themselves). Amazon EC2 has this infrastructure already in place. It could be merged with the NIH authorized access system in a week-long hackathon. Cancer research Funding agencies could award Cloud compute credits (or just let people budget for Cloud computing in the standard grant application).

Cancer Moonshot:

Use the force luke - use the cloud luke

No comments: