Making Genomic Data Analysis Faster and More Accurate

July 20, 2012
Matei Zaharia and Kristal Curtis | UC Berkeley

The Human Genome Project was completed almost a decade ago. One of its most promising potential outcomes is the ability to use genomic data for personalized medicine, especially in genetically-driven diseases such as cancer. Despite the excitement around this approach, personalized medicine is still in its infancy. One hurdle is producing large amounts of low-cost but accurate genomic data, to better understand the connections between genotype and health. Fortunately, technology advances in the past ten years have enabled a rapid fall, faster than Moore’s Law, in the cost of DNA sequencing. However, the disadvantage of new DNA sequencers is that postprocessing the data they produce is hard. These sequencers produce large numbers of short (100-character) “reads” from the genome, which must then be assembled, much like a puzzle, into a full sequence. Because of its high computational cost, this data analysis problem will soon dominate the cost of reconstructing a genome.

Our group, a team of researchers from the UC Berkeley AMP Lab, Microsoft, and UCSF, is working on a holistic system to quickly and accurately process short-read DNA data. This is in contrast to previous work, where the processing was broken up into discrete stages and separately optimized, resulting in inefficient resource usage and information loss. Our most mature contribution so far is a new algorithm for the first step in this process, alignment, where each read is matched to the location in the genome from which it most likely came. Alignment has traditionally been highly compute-intensive, taking days for one genome. Our new aligner, the Scalable Nucleotide Alignment Program (SNAP), reduces this cost by 10-100x, while simultaneously improving accuracy. It accomplishes this through a combination of algorithmic innovation and judicious use of modern hardware. We are also applying the insights from SNAP to further steps of the sequencing process, which use alignment results from multiple reads to determine the individual’s true genotype.

Speaker Details

Matei Zaharia is a fifth-year PhD student at UC Berkeley, working with Scott Shenker and Ion Stoica on topics in computer systems, networks and cloud computing. He is also a committer on Apache Hadoop and Apache Mesos. Matei is funded by a Google PhD Fellowship.

Kristal Curtis is a fifth-year PhD student in the AMP Lab at UC Berkeley, advised by David Patterson and Armando Fox. Her research has focused on performance modeling for storage systems and fast and accurate analysis of genomics data. She has been supported by an NSF Graduate Research Fellowship and a UC Berkeley Chancellor’s Fellowship.