Making Genomic Data Analysis Faster and More Accurate
- Matei Zaharia and Kristal Curtis | UC Berkeley
The Human Genome Project was completed almost a decade ago. One of its most promising potential outcomes is the ability to use genomic data for personalized medicine, especially in genetically-driven diseases such as cancer. Despite the excitement around this approach, personalized medicine is still in its infancy. One hurdle is producing large amounts of low-cost but accurate genomic data, to better understand the connections between genotype and health. Fortunately, technology advances in the past ten years have enabled a rapid fall, faster than Moore’s Law, in the cost of DNA sequencing. However, the disadvantage of new DNA sequencers is that postprocessing the data they produce is hard. These sequencers produce large numbers of short (100-character) “reads” from the genome, which must then be assembled, much like a puzzle, into a full sequence. Because of its high computational cost, this data analysis problem will soon dominate the cost of reconstructing a genome.
Our group, a team of researchers from the UC Berkeley AMP Lab, Microsoft, and UCSF, is working on a holistic system to quickly and accurately process short-read DNA data. This is in contrast to previous work, where the processing was broken up into discrete stages and separately optimized, resulting in inefficient resource usage and information loss. Our most mature contribution so far is a new algorithm for the first step in this process, alignment, where each read is matched to the location in the genome from which it most likely came. Alignment has traditionally been highly compute-intensive, taking days for one genome. Our new aligner, the Scalable Nucleotide Alignment Program (SNAP), reduces this cost by 10-100x, while simultaneously improving accuracy. It accomplishes this through a combination of algorithmic innovation and judicious use of modern hardware. We are also applying the insights from SNAP to further steps of the sequencing process, which use alignment results from multiple reads to determine the individual’s true genotype.
Speaker Details
Matei Zaharia is a fifth-year PhD student at UC Berkeley, working with Scott Shenker and Ion Stoica on topics in computer systems, networks and cloud computing. He is also a committer on Apache Hadoop and Apache Mesos. Matei is funded by a Google PhD Fellowship.
Kristal Curtis is a fifth-year PhD student in the AMP Lab at UC Berkeley, advised by David Patterson and Armando Fox. Her research has focused on performance modeling for storage systems and fast and accurate analysis of genomics data. She has been supported by an NSF Graduate Research Fellowship and a UC Berkeley Chancellor’s Fellowship.
-
-
Jeff Running
-
-
Series: Microsoft Research Talks
-
Decoding the Human Brain – A Neurosurgeon’s Experience
- Dr. Pascal O. Zinn
-
-
-
-
-
-
Challenges in Evolving a Successful Database Product (SQL Server) to a Cloud Service (SQL Azure)
- Hanuma Kodavalla,
- Phil Bernstein
-
Improving text prediction accuracy using neurophysiology
- Sophia Mehdizadeh
-
Tongue-Gesture Recognition in Head-Mounted Displays
- Tan Gemicioglu
-
DIABLo: a Deep Individual-Agnostic Binaural Localizer
- Shoken Kaneko
-
-
-
-
Audio-based Toxic Language Detection
- Midia Yousefi
-
-
From SqueezeNet to SqueezeBERT: Developing Efficient Deep Neural Networks
- Forrest Iandola,
- Sujeeth Bharadwaj
-
Hope Speech and Help Speech: Surfacing Positivity Amidst Hate
- Ashique Khudabukhsh
-
-
-
Towards Mainstream Brain-Computer Interfaces (BCIs)
- Brendan Allison
-
-
-
-
Learning Structured Models for Safe Robot Control
- Subramanian Ramamoorthy
-