Download for machine learning on genomic data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Geographic information system wikipedia , lookup

Neuroinformatics wikipedia , lookup

Randomness wikipedia , lookup

Machine learning wikipedia , lookup

Hardware random number generator wikipedia , lookup

Data analysis wikipedia , lookup

Theoretical computer science wikipedia , lookup

Corecursion wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Data assimilation wikipedia , lookup

Pattern recognition wikipedia , lookup

Transcript
Improved VariantSpark breaks “the curse of
dimensionality” for machine learning on genomic data
Aidan O’Briena, Piotr Szulb, Robert Dunnec, Oscar J. Luoa, Denis C. Bauera
aCSIRO,
Health and Biosecurity, North Ryde, Sydney
Data 61, Dutton Park, Brisbane
cCSIRO, Data 61, North Ryde, Sydney
bCSIRO,
Genetic variation identification from high-throughput genomic data is becoming increasingly prevalent
in the field of medical research with ever larger cohorts of samples. However, traditional computing
hardware and software are inadequate to cater for large-cohort analysis, with the finite resource of
computer memory becoming the bottleneck. Thus, MapReduce and MapReduce-like systems, such as
Apache Hadoop and Spark, have been utilised to overcome these obstacles. Leveraging the power of
Spark and its machine learning libraries (Spark ML), we built VariantSpark: A flexible framework for
analysing genomic data. We previously demonstrated VariantSpark’s ability on the 1000 Genomes
Project (phase 3) data by clustering 2,500 individuals with 80 million genomic variants each into their
super-population groups achieving an ARI=0.82 (1 perfect and -1 random clustering). Aiming to further
improve this performance and distinguish between the American and European populations, we sought
to apply a supervised machine learning approach using Spark ML.
However, Spark ML suffers from “the curse of dimensionality”. That is, Spark ML scales well with
samples (n) but not with features (p), large features for analysis could easily exceed memory limits.
This is due to it was originally developed for web-analytics data, which has different properties to
genomics data. Hence, performing sophisticated supervised machine learning tasks, such as random
forest, was not possible on the 80 million variants of the 1000 genomes.
We have recently extended VariantSpark to include CursedForest, an alternative random forests
algorithm capable of scaling not only with samples, but also with features. We successfully trained a
random forest on the 1000 genomes project data and achieved a cross-validated accuracy of ARI=0.96.
Because random forests are built on decision trees, it is relatively trivial to inspect the role individual
features play in building the model. Furthermore, we can view important features based on attributes
such as their Gini coefficient or propensity to appear in individual trees. However, to take full advantage
of feature selection, it’s important that we can build a model on the entire dataset. Using feature
reduction methods on training data would result in somewhat arbitrary data (i.e.PCA) or potentially
missing data. CursedForest allows us to do this by parallelizing the problem beyond the tree level, and
down to individual nodes. It’s through this approach that we can trivially build random forest models
on samples containing millions of features
Keywords
Hadoop, Spark, random forests, VariantSpark, machine leaning, curse of dimensionality