Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Geographic information system wikipedia , lookup
Neuroinformatics wikipedia , lookup
Machine learning wikipedia , lookup
Hardware random number generator wikipedia , lookup
Data analysis wikipedia , lookup
Theoretical computer science wikipedia , lookup
Corecursion wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Improved VariantSpark breaks “the curse of dimensionality” for machine learning on genomic data Aidan O’Briena, Piotr Szulb, Robert Dunnec, Oscar J. Luoa, Denis C. Bauera aCSIRO, Health and Biosecurity, North Ryde, Sydney Data 61, Dutton Park, Brisbane cCSIRO, Data 61, North Ryde, Sydney bCSIRO, Genetic variation identification from high-throughput genomic data is becoming increasingly prevalent in the field of medical research with ever larger cohorts of samples. However, traditional computing hardware and software are inadequate to cater for large-cohort analysis, with the finite resource of computer memory becoming the bottleneck. Thus, MapReduce and MapReduce-like systems, such as Apache Hadoop and Spark, have been utilised to overcome these obstacles. Leveraging the power of Spark and its machine learning libraries (Spark ML), we built VariantSpark: A flexible framework for analysing genomic data. We previously demonstrated VariantSpark’s ability on the 1000 Genomes Project (phase 3) data by clustering 2,500 individuals with 80 million genomic variants each into their super-population groups achieving an ARI=0.82 (1 perfect and -1 random clustering). Aiming to further improve this performance and distinguish between the American and European populations, we sought to apply a supervised machine learning approach using Spark ML. However, Spark ML suffers from “the curse of dimensionality”. That is, Spark ML scales well with samples (n) but not with features (p), large features for analysis could easily exceed memory limits. This is due to it was originally developed for web-analytics data, which has different properties to genomics data. Hence, performing sophisticated supervised machine learning tasks, such as random forest, was not possible on the 80 million variants of the 1000 genomes. We have recently extended VariantSpark to include CursedForest, an alternative random forests algorithm capable of scaling not only with samples, but also with features. We successfully trained a random forest on the 1000 genomes project data and achieved a cross-validated accuracy of ARI=0.96. Because random forests are built on decision trees, it is relatively trivial to inspect the role individual features play in building the model. Furthermore, we can view important features based on attributes such as their Gini coefficient or propensity to appear in individual trees. However, to take full advantage of feature selection, it’s important that we can build a model on the entire dataset. Using feature reduction methods on training data would result in somewhat arbitrary data (i.e.PCA) or potentially missing data. CursedForest allows us to do this by parallelizing the problem beyond the tree level, and down to individual nodes. It’s through this approach that we can trivially build random forest models on samples containing millions of features Keywords Hadoop, Spark, random forests, VariantSpark, machine leaning, curse of dimensionality