Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Visualizing and Clustering Life Science Applications in Parallel HiCOMB 2015 14th IEEE International Workshop on High Performance Computational Biology at IPDPS 2015 Hyderabad, India May 25 2015 Geoffrey Fox [email protected] http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington 5/25/2015 1 Deterministic Annealing Algorithms 5/25/2015 2 Some Motivation • Big Data requires high performance – achieve with parallel computing • Big Data sometimes requires robust algorithms as more opportunity to make mistakes • Deterministic annealing (DA) is one of better approaches to robust optimization and broadly applicable – Started as “Elastic Net” by Durbin for Travelling Salesman Problem TSP – Tends to remove local optima – Addresses overfitting – Much Faster than simulated annealing • Physics systems find true lowest energy state if you anneal i.e. you equilibrate at each temperature as you cool • Uses mean field approximation, which is also used in “Variational Bayes” and “Variational 5/25/2015 inference” 3 • • • • (Deterministic) Annealing Find minimum at high temperature when trivial Small change avoiding local minima as lower temperature Typically gets better answers than standard libraries- R and Mahout And can be parallelized and put on GPU’s etc. 5/25/2015 4 General Features of DA • In many problems, decreasing temperature is classic multiscale – finer resolution (√T is “just” distance scale) • In clustering √T is distance in space of points (and centroids), for MDS scale in mapped Euclidean space • T = ∞, all points are in same place – the center of universe • For MDS all Euclidean points are at center and distances are zero. For clustering, there is one cluster • As Temperature lowered there are phase transitions in clustering cases where clusters split – Algorithm determines whether split needed as second derivative matrix singular • Note DA has similar features to hierarchical methods and you do not have to specify a number of clusters; you need 5/25/2015 to specify a final distance scale 5 Math of Deterministic Annealing • H() is objective function to be minimized as a function of parameters (as in Stress formula given earlier for MDS) • Gibbs Distribution at Temperature T P() = exp( - H()/T) / d exp( - H()/T) • Or P() = exp( - H()/T + F/T ) • Use the Free Energy combining Objective Function and Entropy F = < H - T S(P) > = d {P()H + T P() lnP()} • Simulated annealing performs these integrals by Monte Carlo • Deterministic annealing corresponds to doing integrals analytically (by mean field approximation) and is much much faster • Need to introduce a modified Hamiltonian for some cases so that integrals are tractable. Introduce extra parameters to be varied so that modified Hamiltonian matches original • In each case temperature is lowered slowly – say by a factor 0.95 to 0.9999 at each iteration 5/25/2015 6 Some Uses of Deterministic Annealing DA • Clustering improved K-means – Vectors: Rose (Gurewitz and Fox 1990 – 486 citations encouraged me to revisit) – Clusters with fixed sizes and no tails (Proteomics team at Broad) – No Vectors: Hofmann and Buhmann (Just use pairwise distances) • Dimension Reduction for visualization and analysis – Vectors: GTM Generative Topographic Mapping – No vectors SMACOF: Multidimensional Scaling) MDS (Just use pairwise distances) • Can apply to HMM & general mixture models (less study) – Gaussian Mixture Models – Probabilistic Latent Semantic Analysis with Deterministic Annealing DA-PLSA as alternative to Latent Dirichlet Allocation for finding “hidden factors” • Have scalable parallel versions of much of above – mainly Java • Many clustering methods – not clear what is best although DA pretty good and improves K-means at increased computing cost which is not always useful • DA clearly improves MDS which is ~only reliable” dimension reduction? 5/25/2015 7 Clusters v. Regions Lymphocytes 4D Pathology 54D • In Lymphocytes clusters are distinct; DA useful • In Pathology, clusters divide space into regions and sophisticated methods like deterministic annealing are probably unnecessary 5/25/2015 8 Some Problems we are working on • Analysis of Mass Spectrometry data to find peptides by clustering peaks (Broad Institute/Hyderabad) – ~0.5 million points in 2 dimensions (one collection) -- ~ 50,000 clusters summed over charges • Metagenomics – 0.5 million (increasing rapidly) points NOT in a vector space – hundreds of clusters per sample – Apply MDS to Phylogenetic trees • Pathology Images >50 Dimensions • Social image analysis is in a highish dimension vector space – 10-50 million images; 1000 features per image; million clusters • Finding communities from network graphs coming from Social media contacts etc. 5/25/2015 9 MDS and Clustering on ~60K EMR 5/25/2015 10 Colored by Sample – not clustered. Distances from vectors in word space. This 128 4mers for ~200K sequences 5/25/2015 11 Examples of current DA algorithms: LCMS 5/25/2015 12 Background on LC-MS • Remarks of collaborators – Broad Institute/Hyderabad • Abundance of peaks in “label-free” LC-MS enables large-scale comparison of peptides among groups of samples. • In fact when a group of samples in a cohort is analyzed together, not only is it possible to “align” robustly or cluster the corresponding peaks across samples, but it is also possible to search for patterns or fingerprints of disease states which may not be detectable in individual samples. • This property of the data lends itself naturally to big data analytics for biomarker discovery and is especially useful for population-level studies with large cohorts, as in the case of infectious diseases and epidemics. • With increasingly large-scale studies, the need for fast yet precise cohort-wide clustering of large numbers of peaks assumes technical importance. • In particular, a scalable parallel implementation of a cohort-wide peak clustering algorithm for LC-MS-based proteomic data can prove to be a critically important tool in clinical pipelines for responding to global epidemics of infectious diseases like tuberculosis, influenza, etc. 5/25/2015 13 Proteomics 2D DA Clustering T= 25000 with 60 Clusters (will be 30,000 at T=0.025) 5/25/2015 14 The brownish triangles are “sponge” (soaks up trash) peaks outside any cluster. The colored hexagons are peaks inside clusters with the white hexagons being determined cluster center Fragment of 30,000 Clusters 241605 Points 5/25/2015 15 Continuous Clustering • This is a very useful subtlety introduced by Ken Rose but not widely known although it greatly improves algorithm • Take a cluster k to be split into 2 with centers Y(k)A and Y(k)B with initial values Y(k)A = Y(k)B at original center Y(k) • Then typically if you make this change Free Energy F and perturb the Y(k)A and Y(k)B, they will return to starting position Y(k)A and Y(k)B as F at stable minimum (positive eigenvalue) • But instability (the negative eigenvalue) can develop and one finds Free Energy F Free Energy F Y(k)A + Y(k)B Y(k)A - Y(k)B • Implement by adding arbitrary number p(k) of centers for each cluster Zi = k=1K p(k) exp(-i(k)/T) and M step gives p(k) = C(k)/N • Halve p(k) at splits; can’t split easily in standard case p(k) = 1 • Show weighting in sums like Zi now equipoint not equicluster as p(k) 5/25/2015 16 proportional to points C(k) in cluster Trimmed Clustering • Clustering with position-specific constraints on variance: Applying redescending M-estimators to label-free LC-MS data analysis (Rudolf Frühwirth , D R Mani and Saumyadipta Pyne) BMC Bioinformatics 2011, 12:358 • HTCC = k=0K i=1N Mi(k) f(i,k) – f(i,k) = (X(i) - Y(k))2/2(k)2 k > 0 – f(i,0) = c2 / 2 k=0 • The 0’th cluster captures (at zero temperature) all points outside clusters (background) T=1 • Clusters are trimmed T~0 (X(i) - Y(k))2/2(k)2 < c2 / 2 T=5 • Relevant when well defined errors Distance from cluster center 5/25/2015 17 Cluster Count v. Temperature for 2 Runs 60000 DAVS(2) 40000 DA2D 30000 20000 Start Sponge DAVS(2) Sponge Reaches final value 10000 Add Close Cluster Check 1.00E+06 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+00 1.00E-01 1.00E-02 0 1.00E-03 Temperature • All start with one cluster at far left • T=1 special as measurement errors divided out 5/25/2015counts clusters with 1 member as clusters. DAVS(2) does not 18 • DA2D Cluster Count 50000 Simple Parallelism as in k-means • • • • Decompose points i over processors Equations either pleasingly parallel “maps” over i Or “All-Reductions” summing over i for each cluster Parallel Algorithm: – Each process holds all clusters and calculates contributions to clusters from points in node – e.g. Y(k) = i=1N <Mi(k)> Xi / C(k) • Runs well in MPI or MapReduce – See all the MapReduce k-means papers 5/25/2015 19 Better Parallelism • The previous model is correct at start but each point does not really contribute to each cluster as damped exponentially by exp( - (Xi- Y(k))2 /T ) • For Proteomics problem, on average only 6.45 clusters needed per point if require (Xi- Y(k))2 /T ≤ ~40 (as exp(-40) small) • So only need to keep nearby clusters for each point • As average number of Clusters ~ 20,000, this gives a factor of ~3000 improvement • Further communication is no longer all global; it has nearest neighbor components and calculated by parallelism over clusters which can be done in parallel if separated 5/25/2015 20 Speedups for several runs on Tempest from 8-way through 384 way MPI parallelism with one thread per process. We look at different choices for MPI processes which are either inside nodes or on separate nodes 5/25/2015 21 Parallelism within a Single Node of Madrid Cluster. A set of runs on 241605 peak data with a single node with 16 cores with either threads or MPI giving parallelism. Parallelism is either number of threads or number of MPI processes. Parallelism (#threads or #processes) 5/25/2015 22 METAGENOMICS -- SEQUENCE CLUSTERING Non-metric Spaces O(N2) Algorithms – Illustrate Phase Transitions 5/25/2015 23 • Start at T= “” with 1 Cluster • Decrease T, Clusters emerge at instabilities 5/25/2015 24 5/25/2015 25 5/25/2015 26 446K sequences ~100 clusters 5/25/2015 27 METAGENOMICS -- SEQUENCE CLUSTERING Non-metric Spaces O(N2) Algorithms – Compare Other Methods 5/25/2015 28 DA-PWC “Divergent” Data Sample 23 True Clusters UClust CDhit Divergent Data Set UClust (Cuts 0.65 to 0.95) DAPWC 0.65 0.75 0.85 0.95 23 4 10 36 91 23 0 0 13 16 Total # of clusters Total # of clusters uniquely identified (i.e. one original cluster goes to 1 uclust cluster ) Total # of shared clusters with significant sharing (one uclust cluster goes to > 1 real cluster) Total # of uclust clusters that are just part of a real cluster (numbers in brackets only have one member) Total # of real clusters that are 1 uclust cluster but uclust cluster is spread over multiple real clusters Total # of real clusters that have 5/25/2015 significant contribution from > 1 uclust cluster 0 4 10 5 0 4 10 0 14 9 5 0 9 14 5 0 17(11) 72(62) 0 7 29 PROTEOMICS No clear clusters 5/25/2015 30 Protein Universe Browser for COG Sequences with a few illustrative biologically identified clusters 5/25/2015 31 Heatmap of biology distance (NeedlemanWunsch) vs 3D Euclidean Distances If d a distance, so is f(d) for any monotonic f. Optimize choice of f 5/25/2015 32 O(N2) ALGORITHMS? 5/25/2015 33 Algorithm Challenges • • • • See NRC Massive Data Analysis report O(N) algorithms for O(N2) problems Parallelizing Stochastic Gradient Descent Streaming data algorithms – balance and interplay between batch methods (most time consuming) and interpolative streaming methods • Graph algorithms – need shared memory? • Machine Learning Community uses parameter servers; Parallel Computing (MPI) would not recommend this? – Is classic distributed model for “parameter service” better? • Apply best of parallel computing – communication and load balancing – to Giraph/Hadoop/Spark • Are data analytics sparse?; many cases are full matrices • BTW Need Java Grande – Some C++ but Java most popular in ABDS, 5/25/2015 with Python, Erlang, Go, Scala (compiles to JVM) ….. 34 “clean” sample of 446K O(N2) green-green and purplepurple interactions have value but green-purple are “wasted” O(N2) interactions between green and purple clusters should be able to represent by centroids as in Barnes-Hut. Hard as no Gauss theorem; no multipole expansion and points really in 1000 dimension space as clustered before 3D projection 5/25/2015 35 Use Barnes Hut OctTree, originally developed to make O(N2) astrophysics O(NlogN), to give similar speedups in machine learning 5/25/2015 36 OctTree for 100K sample of Fungi We use OctTree for logarithmic interpolation (streaming data) 5/25/2015 37 Fungi Analysis 5/25/2015 38 Fungi Analysis • Multiple Species from multiple places • Several sources of sequences starting with 446K and eventually boiled down to ~10K curated sequences with 61 species • Original sample – clustering and MDS • Final sample – MDS and other clustering methods • Note MSA and SWG gives similar results • Some species are clearly split • Some species are diffuse; others compact making a fixed distance cut unreliable – Easy for humans! • MDS very clear on structure and clustering artifacts • Why not do “high-value” clustering as interactive iteration driven by MDS? 5/25/2015 39 Fungi -- 4 Classic Clustering Methods 5/25/2015 40 5/25/2015 41 5/25/2015 42 Same Species 5/25/2015 43 Same Species Different Locations 5/25/2015 44 Parallel Data Mining 5/25/2015 45 Parallel Data Analytics • Streaming algorithms have interesting differences but • “Batch” Data analytics is “just classic parallel computing” with usual features such as SPMD and BSP • Expect similar systematics to simulations where • Static Regular problems are straightforward but • Dynamic Irregular Problems are technically hard and high level approaches fail (see High Performance Fortran HPF) – Regular meshes worked well but – Adaptive dynamic meshes did not although “real people with MPI” could parallelize • However using libraries is successful at either – Lowest: communication level – Higher: “core analytics” level • Data analytics does not yet have “good regular parallel libraries” – Graph analytics has most attention 5/25/2015 46 Remarks on Parallelism I • Maximum Likelihood or 2 both lead to objective functions like • Minimize sum items=1N (Positive nonlinear function of unknown parameters for item i) • Typically decompose items i and parallelize over both i and parameters to be determined • Solve iteratively with (clever) first or second order approximation to shift in objective function – Sometimes steepest descent direction; sometimes Newton – Have classic Expectation Maximization structure – Steepest descent shift is sum over shift calculated from each point • Classic method – take all (millions) of items in data set and move full distance – Stochastic Gradient Descent SGD – take randomly a few hundred of items in data set and calculate shifts over these and move a tiny distance – SGD cannot parallelize over items 5/25/2015 47 Remarks on Parallelism II • Need to cover non vector semimetric and vector spaces for clustering and dimension reduction (N points in space) • Semimetric spaces just have pairwise distances defined between points in space (i, j) • MDS Minimizes Stress and illustrates this (X) = i<j=1N weight(i,j) ((i, j) - d(Xi , Xj))2 • Vector spaces have Euclidean distance and scalar products – Algorithms can be O(N) and these are best for clustering but for MDS O(N) methods may not be best as obvious objective function O(N2) – Important new algorithms needed to define O(N) versions of current O(N2) – “must” work intuitively and shown in principle • Note matrix solvers often use conjugate gradient – converges in 5100 iterations – a big gain for matrix with a million rows. This removes factor of N in time complexity • Ratio of #clusters to #points important; new clustering ideas if ratio >~ 0.1 5/25/2015 48 Problem Structure • Note learning networks have huge number of parameters (11 billion in Stanford work) so that inconceivable to look at second derivative • Clustering and MDS have lots of parameters but can be practical to look at second derivative and use Newton’s method to minimize • Parameters are determined in distributed fashion but are typically needed globally – MPI use broadcast and “AllCollectives” implying Map-Collective is a useful programming model – AI community: use parameter server and access as needed. Non-optimal? (1) Map Only Input (2) Classic MapReduce Input (3) Iterative Map Reduce or Map-Collective Input (4) Point to Point or Map-Communication Iterations (5) Map-Streaming maps brokers Shared Memory map map map Map & Communicate Local Output 5/25/2015 (6) Shared memory Map Communicates reduce reduce Graph Events 49 MDS in more detail 5/25/2015 50 WDA-SMACOF “Best” MDS • Semimetric spaces just have pairwise distances defined between points in space (i, j) • MDS Minimizes Stress with pairwise distances (i, j) (X) = i<j=1N weight(i,j) ((i, j) - d(Xi , Xj))2 • SMACOF clever Expectation Maximization method choses good steepest descent • Improved by Deterministic Annealing reducing distance scale; DA does not impact compute time much and gives DA-SMACOF • Classic SMACOF is O(N2) for uniform weight and O(N3) for non trivial weights but get nonuniform weight from – The preferred Sammon method weight(i,j) = 1/(i, j) or – Missing distances put in as weight(i,j) = 0 • Use conjugate gradient – converges in 5-100 iterations – a big gain for matrix with a million rows. This removes factor of N in time complexity and gives WDA-SMACOF 5/25/2015 51 Timing of WDA SMACOF • 20k to 100k AM Fungal sequences on 600 cores 100000 Time Cost Comparison between WDASMACOF with Equal Weights and Sammon's Mapping Equal Weights Sammon's Mapping Seconds 10000 1000 100 10 1 5/25/2015 20k 40k 60k Data Size 80k 100k 52 WDA-SMACOF Timing • Input Data: 100k to 400k AM Fungal sequences • Environment: 32 nodes (1024 cores) to 128 nodes (4096 cores) on BigRed2. • Using Harp plug in for Hadoop (MPI Performance) Time Cost of WDA-SMACOF over Increasing Data Size 4000 3500 512 1024 2048 Parallel Efficiency of WDA-SMACOF over Increasing Number of Processors Parallel Efficiency Seconds 2500 2000 1500 1000 0.8 0.6 0.4 0.2 500 5/25/2015 WDA-SMACOF (Harp) 1 3000 0 1.2 4096 100k 200k 300k Data Size 400k 0 512 1024 2048 Number of Processors 4096 53 Spherical Phylogram • Take a set of sequences mapped to nD with MDS (WDASMACOF) (n=3 or ~20) – N=20 captures ~all features of dataset? • Consider a phylogenetic tree and use neighbor joining formulae to calculate distances of nodes to sequences (or later other nodes) starting at bottom of tree • Do a new MDS fixing mapping of sequences noting that sequences + nodes have defined distances • Use RAxML or Neighbor Joining (N=20?) to find tree • Random note: do not need Multiple Sequence Alignment; pairwise tools are easier to use and give reliably good results 5/25/2015 54 Spherical Phylograms MSA SWG 5/25/2015 RAxML result visualized in FigTree. Spherical Phylogram visualized in PlotViz 55 for MSA or SWG distances Quality of 3D Phylogenetic Tree • EM-SMACOF is basic SMACOF • LMA was previous best method using Levenberg-Marquardt nonlinear 2 solver • WDA-SMACOF finds best result • 3 different distance measures Sum of Branches on 999nts Data Sum of Branches on 599nts Data 30 25 WDA-SMACOF LMA EM-SMACOF 20 15 10 5/25/2015 EM-SMACOF 15 10 5 5 0 LMA 20 Sum of Branches Sum of Branches 25 WDA-SMACOF MSA SWG NW 0 MSA SWG NW Sum of branch lengths of the Spherical Phylogram generated in 3D space on two datasets 56 Summary • Always run MDS. Gives insight into data and performance of machine learning – Leads to a data browser as GIS gives for spatial data – 3D better than 2D – ~20D better than MSA? • Clustering Observations – Do you care about quality or are you just cutting up space into parts – Deterministic Clustering always makes more robust – Continuous clustering enables hierarchy – Trimmed Clustering cuts off tails – Distinct O(N) and O(N2) algorithms • Use Conjugate Gradient 5/25/2015 57 Java Grande 5/25/2015 58 Java Grande • We once tried to encourage use of Java in HPC with Java Grande Forum but Fortran, C and C++ remain central HPC languages. – Not helped by .com and Sun collapse in 2000-2005 • The pure Java CartaBlanca, a 2005 R&D100 award-winning project, was an early successful example of HPC use of Java in a simulation tool for non-linear physics on unstructured grids. • Of course Java is a major language in ABDS and as data analysis and simulation are naturally linked, should consider broader use of Java • Using Habanero Java (from Rice University) for Threads and mpiJava or FastMPJ for MPI, gathering collection of high performance parallel Java analytics – Converted from C# and sequential Java faster than sequential C# • So will have either Hadoop+Harp or classic Threads/MPI versions in Java Grande version of Mahout 5/25/2015 59 Performance of MPI Kernel Operations 10000 MPI.NET C# in Tempest FastMPJ Java in FG OMPI-nightly Java FG OMPI-trunk Java FG OMPI-trunk C FG MPI.NET C# in Tempest FastMPJ Java in FG OMPI-nightly Java FG OMPI-trunk Java FG OMPI-trunk C FG 5000 Performance of MPI send and receive operations 10000 4MB 1MB 256KB 64KB 16KB 4KB 1KB 64B 16B 256B Message size (bytes) Performance of MPI allreduce operation Pure Java as in FastMPJ slower than Java interfacing to C version of MPI 1000000 OMPI-trunk C Madrid OMPI-trunk Java Madrid OMPI-trunk C FG OMPI-trunk Java FG 1000 5 4B Average time (us) 512KB 128KB 32KB 8KB 2KB 512B Message size (bytes) 128B 32B 8B 2B 1 0B Average time (us) 100 OMPI-trunk C Madrid OMPI-trunk Java Madrid OMPI-trunk C FG OMPI-trunk Java FG 10000 Performance of MPI send and receive on 5/25/2015 Infiniband and Ethernet Message Size (bytes) 4MB 1MB 256KB 64KB 16KB 4KB 1KB 256B 64B 1 16B 512KB 128KB Message Size (bytes) 32KB 8KB 2KB 512B 128B 32B 8B 2B 0B 1 100 4B 10 Average Time (us) Average Time (us) 100 Performance of MPI allreduce on Infiniband and Ethernet 60 Java Grande and C# on 40K point DAPWC Clustering Very sensitive to threads v MPI C# Hardware 0.7 performance Java Hardware C# Java 64 Way parallel 128 Way parallel 256 Way parallel TXP Nodes Total 5/25/2015 61 Java and C# on 12.6K point DAPWC Clustering Java Time hours #Threads x #Processes per node # Nodes Total Parallelism 1x1 5/25/2015 1x2 C# C# Hardware 0.7 performance Java Hardware #Threads x #Processes per node 1x8 1x4 4x1 2x1 2x2 2x4 4x2 8x1 62 Data Analytics in SPIDAL 5/25/2015 63 Analytics and the DIKW Pipeline • Data goes through a pipeline Raw data Data Information Knowledge Wisdom Decisions Information Data Analytics Knowledge Information More Analytics • Each link enabled by a filter which is “business logic” or “analytics” • We are interested in filters that involve “sophisticated analytics” which require non trivial parallel algorithms – Improve state of art in both algorithm quality and (parallel) performance • Design and Build SPIDAL (Scalable Parallel Interoperable Data Analytics Library) 5/25/2015 64 Strategy to Build SPIDAL • Analyze Big Data applications to identify analytics needed and generate benchmark applications • Analyze existing analytics libraries (in practice limit to some application domains) – catalog library members available and performance – Mahout low performance, R largely sequential and missing key algorithms, MLlib just starting • Identify big data computer architectures • Identify software model to allow interoperability and performance • Design or identify new or existing algorithm including parallel implementation • Collaborate application scientists, computer systems and statistics/algorithms communities 5/25/2015 65 Machine Learning in Network Science, Imaging in Computer Vision, Pathology, Polar Science, Biomolecular Simulations Algorithm Applications Features Status Parallelism Graph Analytics Community detection Social networks, webgraph P-DM GML-GrC Subgraph/motif finding Webgraph, biological/social networks P-DM GML-GrB Finding diameter Social networks, webgraph P-DM GML-GrB Clustering coefficient Social networks Page rank Webgraph P-DM GML-GrC Maximal cliques Social networks, webgraph P-DM GML-GrB Connected component Social networks, webgraph P-DM GML-GrB Betweenness centrality Social networks Shortest path Social networks, webgraph Graph . Graph, static P-DM GML-GrC Non-metric, P-Shm GML-GRA P-Shm Spatial Queries and Analytics Spatial queries relationship Distance based queries based P-DM PP GIS/social networks/pathology informatics Geometric P-DM PP Spatial clustering Seq GML Spatial modeling Seq PP GML Global (parallel) ML 5/25/2015 GrA Static GrB Runtime partitioning 66 Some specialized data analytics in SPIDAL Algorithm • aa Applications Features Parallelism P-DM PP P-DM PP P-DM PP Seq PP Todo PP Todo PP P-DM GML Core Image Processing Image preprocessing Object detection & segmentation Image/object feature computation Status Computer vision/pathology informatics Metric Space Point Sets, Neighborhood sets & Image features 3D image registration Object matching Geometric 3D feature extraction Deep Learning Learning Network, Stochastic Gradient Descent Image Understanding, Language Translation, Voice Recognition, Car driving PP Pleasingly Parallel (Local ML) Seq Sequential Available GRA 5/25/2015 Good distributed algorithm needed Connections in artificial neural net Todo No prototype Available P-DM Distributed memory Available P-Shm Shared memory Available 67 Some Core Machine Learning Building Blocks Algorithm Applications Features Status //ism DA Vector Clustering DA Non metric Clustering Kmeans; Basic, Fuzzy and Elkan Levenberg-Marquardt Optimization Accurate Clusters Vectors P-DM GML Accurate Clusters, Biology, Web Non metric, O(N2) P-DM GML Fast Clustering Vectors Non-linear Gauss-Newton, use Least Squares in MDS Squares, DA- MDS with general weights Least 2 O(N ) DA-GTM and Others Vectors Find nearest neighbors in document corpus Bag of “words” Find pairs of documents with (image features) TFIDF distance below a threshold P-DM GML P-DM GML P-DM GML P-DM GML P-DM PP Todo GML Support Vector Machine SVM Learn and Classify Vectors Seq GML Random Forest Gibbs sampling (MCMC) Latent Dirichlet Allocation LDA with Gibbs sampling or Var. Bayes Singular Value Decomposition SVD Learn and Classify Vectors P-DM PP Solve global inference problems Graph Todo GML Topic models (Latent factors) Bag of “words” P-DM GML Dimension Reduction and PCA Vectors Seq GML 5/25/2015 Hidden Markov Models (HMM) Global inference on sequence Vectors models Seq SMACOF Dimension Reduction Vector Dimension Reduction TFIDF Search All-pairs similarity search 68 PP GML & Some Futures • Always run MDS. Gives insight into data – Leads to a data browser as GIS gives for spatial data • Claim is algorithm change gave as much performance increase as hardware change in simulations. Will this happen in analytics? – Today is like parallel computing 30 years ago with regular meshs. We will learn how to adapt methods automatically to give “multigrid” and “fast multipole” like algorithms • Need to start developing the libraries that support Big Data – Understand architectures issues – Have coupled batch and streaming versions – Develop much better algorithms • Please join SPIDAL (Scalable Parallel Interoperable Data 5/25/2015 69 Analytics Library) community