Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Extremely Large Data Challenges What R can and can't do Susan Holmes http://www-stat.stanford.edu/˜susan/ Bio-X and Statistics, Stanford University NIH-R01GM086884 ABabcdfghiejkl . . . . . . A roadmap xkcd . . . . . . Some Advantages of R Reproducibilty Sweave, history, ... Prototyping Availability of > 3500 packages, all statistical/machine learning methods available. Data Input and cleaning Multiple formats. Visualization and Presentation High quality graphics: ggplot2. Open Source Community of users, documentation. . . . . . . Some Difficulties in R Main Drawback By default works with data in RAM. Another Problem Heterogeneity: Data, Packages, Documentation. Open Source Community of users, documentation. . . . . . . Best Design Platform for Designing a Statistical Analysis I Comparing sophisticated methods. Almost 4,000 packages available. I Compare 20 methods on a new type of data in a week. I High quality visualization. I Good profiling tools. I Highly flexible connections available: Java, Python, Python, as well as C++, fortran. . . . . . . Prototyping and Teaching I Examples: Choosing a statistical learning algorithm for segmentation. I Teaching Data Mining. I Workflow Design. I Visualization, pattern searching. . . . . . . R Plays ( well) with others I Data connections (> 80% of a statisticians time is spent preparing the data). I Program connections (C, Fortran, Java, JavaScript, Matlab, Perl (FEST), php (amap), Python or Tcl). Chambers and coworkers Omegahat Project I 1. 2. 3. 4. 5. 6. 7. 8. 9. RFirefox RGoogleDocs .......... R2GoogleMaps RJavaScript RAmazonS3 RRuby Rcrypt RGraphicsDevice ..... . . . . . . Seamless Interaction with standardized Databases I Through Bioconductor package AnnotationDbi we have access to GenBank, the Gene Ontology Consortium, Entrez genes, UniGene, the UCSC Human Genome Project KEGG I metlin ChemmineR , rcdklibs, rpubchem . . . . . . Slicing and Dicing Data I plyr Wickham I Subsampling, thinning, (using Random number generation, very high quality). Missing Data imputation, filtering... I I I I I I I I mitools provides tools for multiple imputation, mice ( chained equations) mvnmle (ML estimation) mix provides multiple imputation for mixed categorical and continuous data. pan for missing panel data. VIM for visualisation as well as imputation .... . . . . . . Packages Tailored to specific formats I Images. I Maps. I Trees, Networks and Graphs I Microarrays, Mass Spectroscopy, I DNA, AA, Sequencing Data I Texts.. Snowball RKEA . . . . . . . Some standard formats and their packages R Import and Export -DBI, foreign, gdata, -hdf5 -RNetCDF , ncdf, ncdf4 -RMySQL , RODBC , ROracle -RPostgreSQL, RSQLite -WriteXLS , XLConnect, xlsReadWrite, dataframes2xls . . . . . . Image Formats and Analysis I Formats for Medical Images DICOM: Packages oro.dicom,fmri I ANALYZE format : AnalyzeFMRI I Microscopic Imaging: EBImage. I MRI images (any format): TractoR . . . . . . GSM98560.CEL GSM98549.CEL GSM98559.CEL GSM98556.CEL GSM98566.CEL GSM98554.CEL GSM98565.CEL GSM98561.CEL GSM98548.CEL GSM98558.CEL GSM98555.CEL GSM98562.CEL GSM98552.CEL GSM98557.CEL GSM98553.CEL GSM98550.CEL 2 GSM98563.CEL 0 GSM98551.CEL −2 GSM98567.CEL I GSM98564.CEL Microarray Data Analysis I Bioconductor packages: marray, limma, affy. I Renormalization : vsn. Color Key Plotting : heatmap. Row Z−Score 4 . . . . . . Mass Spectroscopy Data See cran view Chemical and Physics Tasks: Packages: - BioC: MSnbase (mass spectrometry-based proteomics data handling, plotting, processing and quantification). - MALDIquant for MALDI-TOF mass spect data - OrgMassSpecR organic/biological mass spectrometry - FTICRMS for Analyzing Fourier Transform-Ion Cyclotron Resonance Mass Spectrometry Data. - msProcess for protein mass spectra processing. - titan: GUI to analyze mass spectrometric data on the relative abundance of two substances from a titration series. - Bioconductor packages: MassSpecWavelet, BioC:PROcess, BioC:xcms . . . . . . Plotting . High powered graphics packages with layered functionalities: I ade4. I Philosophy: Leland Wilkinson's Grammar of Graphics. I Implementation :ggplot2 (Hadley Wickham youtube video) . . . . . . Very Large Problems: HPC High performances packages (a page full) Parallel computing: Explicit parallelism ; -Rmpi can be used with LAM/MPI, MPICH / MPICH2, Open MPI, Deino MPI. -nws (NetWorkSpaces) packages from REvolution Computing. -snow (Simple Network of Workstations) package (update: snowfall ) -Rdsm package provides a threads-like parallel computing environment, both on multicore and network. . . . . . . Implicit parallelism ; -pnmath package by Tierney uses Open MP parallel directives -pnmath0 package uses Pthreads . -multicore package by Urbanek. -mapReduce package by Brown follows Google mapReduce approach. . . . . . . Parallel computing: Grid computing : -GridR package grid computing environment via a web service -biocep-distrib offers a Java-based framework for local, Grid, or Cloud computing. -RHIPE package by Guha profides an interface between R and Hadoop for a Map/Reduce. -xgrid package to use Apple Xgrid clusters from within R. . . . . . . Data that don't fit in RAM . . . . . . Some recent packages for large data I segue Example using segue I multicore I bigmemory I ff I RHIPE . . . . . . Example of using ff Limits R's RAM consumption through chunked processing See Oehlschlägel (2010) Managing large datasets in R for examples. . . . . . . Large memory and out-of-memory data solutions biglm (Lumley et al.) ff package by Adler et al. offers file-based access to data sets. bigmemory package by Kane and Emerson. Advantage: Several R processes on the same computer can also shared big memory objects. HadoopStreaming package: map/reduce scripts for use in Hadoop Streaming speedglm (generalised) linear models to large data. Also has fast updating. biglars package by Seligman et al uses ff for least-angle regression, lasso and stepwise regression. . . . . . . R and Hadoop Some current projects: RevoScaleR proprietary package: * a new binary 'Big Data' file format XDF--with an interface to the R language that provides high-speed access to arbitrary rows, blocks, and columns of data * data reading and transformation tools to prepare large data sets for analysis (not available on Macs) . . . . . . R and Hadoop Ricardo: Integrating R and Hadoop : A Project from IBM: Das, Sismanis, Beyer, Gemulla, Haas, McPherson Decompose as much as possible into: Large Part small part Join/Merge/Sort → Vectors/Matrices Jaql/Hadoop R . . . . . . Hard Problems These require the access to all the data to be solved. I Distances between all objects. I Finding Nearest neighbors with complex distances. I Extreme values. I Data Integration. I Still weak in the dynamic/interactive graphic realm. . . . . . . Current Difficulties with Many Tools Heterogeneity I Heterogeneity in Package Quality I Levels of maintenance (we get what we pay for). I Documentation. I Heterogeneity of the data we need to deal with. . . . . . . I Package quality and maintenance: commercial versions: Revolution. I Documentation and quality control: Bioconductor model. I Heterogeneity of data structures...still largely open...... . . . . . . Data Heterogeneity I Status : response/ explanatory. I Hidden (latent)/measured. Type : I I I I I I I I Continuous Binary, categorical Graphs/ Trees Images Maps/ Spatial Information Rankings Amounts of dependency: independent/time series/spatial. . . . . . . Goals in Modern Biology: Systems Approach Look at the data/ all the data: data integration . . . . . . Goals in Modern Biology: Systems Approach Look at the data/ all the data: data integration 0.001 15000 5e-04 10000 0 5000 0 0 0 -1 000 1 0 0 0 -1 010 1 101 1 0 ( ( 0 1 1 0 -1 1 01 10 0 0 0 1 -1 0 -1 0 01 10 0 0 0 1 1 0 0 -1 20000 Tumor Cells 0 . 5000 10000 . 15000 . 20000 . . . Thanks to... The organizers of this XLDB conference for inviting me. NIH for funding. My PhD student, Nelson Ray for useful conversations about large data mining projects (and about Data Mining class Stats202). . . . . . .