Download ABabcdfghiejkl Extremely Large Data Challenges What R can and can't do Susan Holmes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Extremely Large Data Challenges
What R can and can't do
Susan Holmes
http://www-stat.stanford.edu/˜susan/
Bio-X and Statistics, Stanford University
NIH-R01GM086884
ABabcdfghiejkl
.
.
.
.
.
.
A roadmap
xkcd
.
.
.
.
.
.
Some Advantages of R
Reproducibilty Sweave, history, ...
Prototyping Availability of > 3500 packages, all
statistical/machine learning methods available.
Data Input and cleaning Multiple formats.
Visualization and Presentation High quality graphics:
ggplot2.
Open Source Community of users, documentation.
.
.
.
.
.
.
Some Difficulties in R
Main Drawback By default works with data in RAM.
Another Problem Heterogeneity: Data, Packages,
Documentation.
Open Source Community of users, documentation.
.
.
.
.
.
.
Best Design Platform for Designing a Statistical
Analysis
I
Comparing sophisticated methods.
Almost 4,000 packages available.
I
Compare 20 methods on a new type of data in a week.
I
High quality visualization.
I
Good profiling tools.
I
Highly flexible connections available: Java, Python,
Python, as well as C++, fortran.
.
.
.
.
.
.
Prototyping and Teaching
I
Examples: Choosing a statistical learning algorithm for
segmentation.
I
Teaching Data Mining.
I
Workflow Design.
I
Visualization, pattern searching.
.
.
.
.
.
.
R Plays ( well) with others
I
Data connections (> 80% of a statisticians time is spent
preparing the data).
I
Program connections (C, Fortran, Java, JavaScript, Matlab,
Perl (FEST), php (amap), Python or Tcl).
Chambers and coworkers Omegahat Project
I
1.
2.
3.
4.
5.
6.
7.
8.
9.
RFirefox
RGoogleDocs
..........
R2GoogleMaps
RJavaScript
RAmazonS3
RRuby
Rcrypt
RGraphicsDevice .....
.
.
.
.
.
.
Seamless Interaction with standardized Databases
I
Through Bioconductor package AnnotationDbi we
have access to GenBank, the Gene Ontology
Consortium, Entrez genes, UniGene, the UCSC
Human Genome Project KEGG
I
metlin ChemmineR , rcdklibs, rpubchem
.
.
.
.
.
.
Slicing and Dicing Data
I
plyr Wickham
I
Subsampling, thinning, (using Random number generation,
very high quality).
Missing Data imputation, filtering...
I
I
I
I
I
I
I
I
mitools provides tools for multiple imputation,
mice ( chained equations)
mvnmle (ML estimation)
mix provides multiple imputation for mixed categorical
and continuous data.
pan for missing panel data.
VIM for visualisation as well as imputation
....
.
.
.
.
.
.
Packages Tailored to specific formats
I
Images.
I
Maps.
I
Trees, Networks and Graphs
I
Microarrays, Mass Spectroscopy,
I
DNA, AA, Sequencing Data
I
Texts.. Snowball RKEA
.
.
.
.
.
.
.
Some standard formats and their packages
R Import and Export
-DBI, foreign,
gdata,
-hdf5
-RNetCDF , ncdf, ncdf4
-RMySQL , RODBC , ROracle
-RPostgreSQL, RSQLite
-WriteXLS , XLConnect, xlsReadWrite, dataframes2xls
.
.
.
.
.
.
Image Formats and Analysis
I
Formats for Medical Images DICOM: Packages
oro.dicom,fmri
I
ANALYZE format : AnalyzeFMRI
I
Microscopic Imaging: EBImage.
I
MRI images (any format): TractoR
.
.
.
.
.
.
GSM98560.CEL
GSM98549.CEL
GSM98559.CEL
GSM98556.CEL
GSM98566.CEL
GSM98554.CEL
GSM98565.CEL
GSM98561.CEL
GSM98548.CEL
GSM98558.CEL
GSM98555.CEL
GSM98562.CEL
GSM98552.CEL
GSM98557.CEL
GSM98553.CEL
GSM98550.CEL
2
GSM98563.CEL
0
GSM98551.CEL
−2
GSM98567.CEL
I
GSM98564.CEL
Microarray Data Analysis
I
Bioconductor packages: marray, limma, affy.
I
Renormalization : vsn.
Color Key
Plotting : heatmap.
Row Z−Score
4
.
.
.
.
.
.
Mass Spectroscopy Data
See cran view Chemical and Physics Tasks: Packages:
- BioC: MSnbase (mass spectrometry-based proteomics data
handling, plotting, processing and quantification).
- MALDIquant for MALDI-TOF mass spect data
- OrgMassSpecR organic/biological mass spectrometry
- FTICRMS for Analyzing Fourier Transform-Ion Cyclotron
Resonance Mass Spectrometry Data. - msProcess for protein
mass spectra processing.
- titan: GUI to analyze mass spectrometric data on the
relative abundance of two substances from a titration series.
- Bioconductor packages: MassSpecWavelet, BioC:PROcess,
BioC:xcms
.
.
.
.
.
.
Plotting
.
High powered graphics packages with layered functionalities:
I
ade4.
I
Philosophy: Leland Wilkinson's Grammar of Graphics.
I
Implementation :ggplot2
(Hadley Wickham youtube video)
.
.
.
.
.
.
Very Large Problems: HPC
High performances packages (a page full)
Parallel computing: Explicit parallelism ;
-Rmpi can be used with LAM/MPI, MPICH /
MPICH2, Open MPI, Deino MPI.
-nws (NetWorkSpaces) packages from REvolution
Computing.
-snow (Simple Network of Workstations) package
(update: snowfall )
-Rdsm package provides a threads-like parallel
computing environment, both on multicore and
network.
.
.
.
.
.
.
Implicit parallelism ; -pnmath package by Tierney uses Open
MP parallel directives
-pnmath0 package uses Pthreads .
-multicore package by Urbanek.
-mapReduce package by Brown follows Google
mapReduce approach.
.
.
.
.
.
.
Parallel computing: Grid computing :
-GridR package grid computing environment via
a web service
-biocep-distrib offers a Java-based
framework for local, Grid, or Cloud computing.
-RHIPE package by Guha profides an interface
between R and Hadoop for a Map/Reduce.
-xgrid package to use Apple Xgrid clusters
from within R.
.
.
.
.
.
.
Data that don't fit in RAM
.
.
.
.
.
.
Some recent packages for large data
I
segue Example using segue
I
multicore
I
bigmemory
I
ff
I
RHIPE
.
.
.
.
.
.
Example of using ff
Limits R's RAM consumption through chunked processing
See Oehlschlägel (2010) Managing large datasets in R for
examples.
.
.
.
.
.
.
Large memory and out-of-memory data solutions
biglm (Lumley et al.)
ff package by Adler et al. offers file-based access
to data sets.
bigmemory package by Kane and Emerson.
Advantage: Several R processes on the same
computer can also shared big memory objects.
HadoopStreaming package: map/reduce scripts for use in
Hadoop Streaming
speedglm (generalised) linear models to large data. Also
has fast updating.
biglars package by Seligman et al uses ff for
least-angle regression, lasso and stepwise
regression.
.
.
.
.
.
.
R and Hadoop
Some current projects:
RevoScaleR proprietary package:
* a new binary 'Big Data' file format XDF--with
an interface to the R language that provides
high-speed access to arbitrary rows, blocks, and
columns of data
* data reading and transformation tools to
prepare large data sets for analysis
(not available on Macs)
.
.
.
.
.
.
R and Hadoop
Ricardo: Integrating R and Hadoop : A Project from IBM:
Das, Sismanis, Beyer, Gemulla, Haas,
McPherson
Decompose as much as possible into:
Large Part
small part
Join/Merge/Sort → Vectors/Matrices
Jaql/Hadoop
R
.
.
.
.
.
.
Hard Problems
These require the access to all the data to be solved.
I
Distances between all objects.
I
Finding Nearest neighbors with complex distances.
I
Extreme values.
I
Data Integration.
I
Still weak in the dynamic/interactive graphic realm.
.
.
.
.
.
.
Current Difficulties with Many Tools
Heterogeneity
I
Heterogeneity in Package Quality
I
Levels of maintenance (we get what we pay for).
I
Documentation.
I
Heterogeneity of the data we need to deal with.
.
.
.
.
.
.
I
Package quality and maintenance: commercial versions:
Revolution.
I
Documentation and quality control:
Bioconductor model.
I
Heterogeneity of data structures...still largely open......
.
.
.
.
.
.
Data Heterogeneity
I
Status : response/ explanatory.
I
Hidden (latent)/measured.
Type :
I
I
I
I
I
I
I
I
Continuous
Binary, categorical
Graphs/ Trees
Images
Maps/ Spatial Information
Rankings
Amounts of dependency: independent/time series/spatial.
.
.
.
.
.
.
Goals in Modern Biology: Systems Approach
Look at the data/ all the data: data integration
.
.
.
.
.
.
Goals in Modern Biology: Systems Approach
Look at the data/ all the data: data integration
0.001
15000
5e-04
10000
0
5000
0 0 0 -1
000 1
0 0 0 -1
010 1
101 1
0
( (
0 1 1 0 -1 1
01 10 0 0
0 1 -1 0 -1 0
01 10 0 0
0 1 1 0 0 -1
20000
Tumor Cells
0
.
5000
10000
.
15000
.
20000
.
.
.
Thanks to...
The organizers of this XLDB conference for inviting me.
NIH for funding.
My PhD student, Nelson Ray for useful conversations about
large data mining projects (and about Data Mining class
Stats202).
.
.
.
.
.
.