Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IBM Software Information Management Case Study University at Buffalo, State University of New York IBM Netezza powers the study of exponentially large gene interaction associated with multiple sclerosis Overview The need Need for supercomputer-level processing against millions of single nucleotide polymorphisms (SNPs) to handle combinatorial explosion involved in breakthrough data mining methods for gene interaction in disease discovery The solution IBM Netezza data warehouse appliance provides faster performance than high-performance computer (HPC) platforms for efficient execution of the AMBIENCE algorithm; AMBIENCE uses the “K-Way Interaction Index” to identify series of SNPs that correlate with the presence of a phenotype, such as multiple sclerosis The benefit Processes exponentially complex gene-environment interactions more than 6,000 times as fast as large symmetric multiprocessing (SMP) systems The State University of New York (SUNY) at Buffalo is home to one of the leading multiple sclerosis (MS) research centers in the world. MS is a devastating, chronic neurological disease that affects nearly one million people worldwide. The disease causes physical and cognitive disabilities in individuals and is characterized by inflammation and neurodegeneration of the brain and spinal cord. From the beginning, the genetics of MS were known to be complex and it was apparent that no single gene was the likely cause for the disease. Since 2007, the SUNY team has been looking at data obtained from scanned genomes of MS patients to identify genes whose variations could contribute to the risk of developing MS. New technologies now enable hundreds of thousands of genetic variations, called single nucleotide polymorphisms (SNPs), to be obtained from single samples. According to research lead, Dr. Murali Ramanathan, a critical fact in the study of MS is that “gene products work by interacting with both other gene products and environmental factors.” Because of this, researchers have postulated that multiple SNPs — combined with environmental variables — would better explain the risk of developing MS. Finding these combinations is analogous to the proverbial search for a needle in a haystack says Dr. Ramanathan. Identifying a candidate’s environmental factors that could be used to prevent the disease from progressing in patients was of great importance. Examples of this include sun exposure and vitamin D levels, Epstein-Barr virus infections and smoking. The researchers needed to create algorithms that would efficiently identify interactions and attempt ‘parsimony’ — an extreme efficiency. The team was looking for the fewest number of SNPs, environmental and phenotypic variable combinations that would help explain the presence of MS. IBM Software Information Management “Developing knowledge of pertinent environmental factors opens the way for new therapeutic and prevention strategies and can enable individualized treatment and management of MS patients.” Case Study 1 A T A T C G C C G A T T A G C C G A C T G SNP 2 —Dr. Murali Ramanathan, Research Lead, The State University of New York (SUNY) at Buffalo A T A T C T A C G A T T A G C C G A C T G Figure 1: A single nucleotide polymorphism (SNP) is a DNA sequence variation oc- curring when a single nucleotide—A, C, G or T—in the shared genome differs in an individual. If many individuals with a disease (or ‘phenotype’) have the same polymorphisms that healthy individuals do not, this may be a clue to finding out genetic clues to what causes a disease (Wikipedia). The researchers at the State University of New York developed an approach they called AMBIENCE that is distinctive in its use of new information theoretic methods along with its versatility and scalability. The theoretic underpinnings of AMBIENCE enable the detection of both linear and non-linear dependencies in the data. The AMBIENCE algorithm is capable of conducting an efficient search of the large combinatorial space because of the unique nature of the information theoretic metrics, which allows for greedy search identification of the most promising combinations. The need The computational challenge in gene-environmental interaction analyses is due to a phenomenon called ‘combinatorial explosion.’ Considering that there are thousands of SNPs, the number of combinations of SNPs that have to be assessed for uncovering potential interaction becomes astronomical. 2 IBM Software Information Management Solution Components Software • IBM® Netezza® 1000-24 “The IBM Netezza 1000 data warehouse appliance not only allowed us to greatly speed up our run time, it also gave us the added ability to accomplish more within that time.” Case Study The sheer number of SNPs combined with environmental variables and phenotype values mean that the amount of computations necessary for data mining could number in the quintillions (18 zeros). To identify how many interactions could exist, based on the number of SNPs included, the SUNY researchers graphed a mathematical function. If they were to mine through one million SNPs to look for the number of combinations containing four SNPs, they’d be looking for a possible sextillion interactions, or 10 to the 21st power. The researchers attempted to run the algorithms against commodity hardware. In this method they found that a simple run of the algorithm took almost a week. Extrapolating this performance, they realized it would take many weeks to run the algorithm against the volume of data they wanted. Meanwhile, they knew that the algorithm results would lead to additional questions, algorithm adjustments, data changes and more, which would be untenable. They needed processing power that would speed this up by 100 times in order to be able to make meaningful discoveries and publish them. —Dr. Murali Ramanathan Figure 2: A heuristic example of the SUNY data table The solution The researchers knew they needed the kind of processing only available in a high-performance computer (HPC), previously known as a supercomputer. In addition, the research team needed capabilities found in relational databases — not often included in HPC platforms. HPCs typically ran their processing only in parallel, breaking up calculations to run simultaneously across many processors. They also often used field-programmable gate array (FPGA) architectures for additional speed. 3 IBM Software Information Management “With the IBM Netezza research analytics infrastructure in place running an algorithm became a matter of minutes instead of days” —Dr. Murali Ramanathan Case Study Unfortunately HPCs can cost tens of millions of dollars and do not include storage for large volumes of data. They are designed for data to be read ‘into memory’ making their use for the ‘big data’ computations needed by the SUNY team implausible. That is when the researchers discovered IBM® Netezza®. Testing the IBM Netezza data warehouse appliance confirmed it delivered all the performance they wanted and much more. IBM Netezza’s architecture is designed to accommodate ‘big data’ storage and processing, combined with the ‘in memory’ performance of an HPC. IBM Netezza, a pioneer of data-intensive HPC that are easy to administer at a fraction of the cost of today’s HPC platforms, easily emerged as the comprehensive solution the researchers at SUNY were seeking. The benefit Once deploying IBM Netezza data warehousing appliance as their research analytics infrastructure, the SUNY researchers were able to: • • • • • Use the new algorithms and commence analyses that were previously nearly impossible to conduct. Reduce the time required to conduct analysis from 27.2 hours without Netezza to 11.7 minutes with it. “We went from days to minutes,” said Dr. Ramanathan. Carry out their research with little to no database administration. (Unlike other HPC platforms or databases available, Netezza was designed to require an absolute minimum of maintenance.) Publish multiple articles in scientific journals, with more in-process. Proceed with studies based on ‘vector phenotypes’ — a more complex variable that will further push the IBM Netezza platform. Figure 3: One of the articles published by SUNY researchers 4 IBM Software Information Management “The IBM Netezza appliance was very easy to install and required little to no database administration. It performed as they said it would…low maintenance and with the high speed analytics our computationally intense work demanded.” —Dr. Murali Ramanathan Case Study A call to action One of the most debilitating hurdles to productive scientific research is the latency between being ready to start — having a solid question, the right data, and a robust calculation — and getting an answer. With the aggressive price/performance ratio of IBM® Netezza®, researchers can take advantage of modern, secure compute and storage power. And one of the chief benefits is that the researchers can ask and answer many more questions – in the same time period – than ever before. For more information To learn more about IBM Netezza data warehouse appliances, please contact your IBM or Netezza sales representative or IBM Business Partner, or visit the following website: www.netezza.com. For more information on SUNY Buffalo, visit: www.buffalo.edu/, or SUNY’s research, consult the following sample publications: AMBIENCE: a novel approach and efficient algorithm for identifying informative genetic and environmental associations with complex phenotypes. Chanda P, Sucheston L, Zhang A, Brazeau D, Freudenheim JL, Ambrosone C, Ramanathan M. Genetics. 2008 Oct; 180(2): 1191-210. Epub 2008 Sep 9. PMID: 18780753 [PubMed - indexed for MEDLINE] Information-theoretic gene-gene and gene-environment interaction analysis of quantitative traits. Chanda P, Sucheston L, Liu S, Zhang A, Ramanathan M. BMC Genomics. 2009 Nov 4; 10:509. PMID: 19889230. Comparison of information-theoretic to statistical methods for gene-gene interactions in the presence of genetic heterogeneity. Sucheston L, Chanda P, Zhang A, Tritchler D, Ramanathan M. BMC Genomics. 2010 Sep 3; 11:487. PMID: 20815886. The interaction index, a novel information-theoretic metric for prioritizing interacting genetic variations and environmental factors. Chanda P, Sucheston L, Zhang A, Ramanathan M. Eur J Hum Genet. 2009 Oct;17(10): 1274-86. Epub 2009 Mar 18. PMID: 19293841. 5 © Copyright IBM Corporation 2011 IBM Corporation Software Group Route 100 Somers, NY 10589 U.S.A. Produced in the United States of America April 2011 All Rights Reserved IBM, the IBM logo, and ibm.com, are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml Netezza is a registered trademark of Netezza Corporation, an IBM Company. Other company, product and service names may be trademarks or service marks of others. References in this publication to IBM products or services do not imply that IBM intends to make them available in all countries in which IBM operates. Please Recycle an IBM Company IMC14645-USEN-00