Download Data Mining Runtime Software and Algorithms

Document related concepts
no text concepts found
Transcript
Visualizing and Clustering Life Science
Applications in Parallel
HiCOMB 2015 14th IEEE International Workshop on
High Performance Computational Biology at IPDPS 2015
Hyderabad, India
May 25 2015
Geoffrey Fox
[email protected]
http://www.infomall.org
School of Informatics and Computing
Digital Science Center
Indiana University Bloomington
5/25/2015
1
Deterministic Annealing
Algorithms
5/25/2015
2
Some Motivation
• Big Data requires high performance – achieve with parallel
computing
• Big Data sometimes requires robust algorithms as more
opportunity to make mistakes
• Deterministic annealing (DA) is one of better approaches to
robust optimization and broadly applicable
– Started as “Elastic Net” by Durbin for Travelling Salesman
Problem TSP
– Tends to remove local optima
– Addresses overfitting
– Much Faster than simulated annealing
• Physics systems find true lowest energy state if
you anneal i.e. you equilibrate at each
temperature as you cool
• Uses mean field approximation, which is also
used in “Variational Bayes” and “Variational
5/25/2015
inference”
3
•
•
•
•
(Deterministic) Annealing
Find minimum at high temperature when trivial
Small change avoiding local minima as lower temperature
Typically gets better answers than standard libraries- R and Mahout
And can be parallelized and put on GPU’s etc.
5/25/2015
4
General Features of DA
• In many problems, decreasing temperature is classic
multiscale – finer resolution (√T is “just” distance scale)
• In clustering √T is distance in space of points (and
centroids), for MDS scale in mapped Euclidean space
• T = ∞, all points are in same place – the center of universe
• For MDS all Euclidean points are at center and distances
are zero. For clustering, there is one cluster
• As Temperature lowered there are phase transitions in
clustering cases where clusters split
– Algorithm determines whether split needed as second derivative
matrix singular
• Note DA has similar features to hierarchical methods and
you do not have to specify a number of clusters; you need
5/25/2015
to
specify a final distance scale
5
Math of Deterministic Annealing
• H() is objective function to be minimized as a function of
parameters  (as in Stress formula given earlier for MDS)
• Gibbs Distribution at Temperature T
P() = exp( - H()/T) /  d exp( - H()/T)
• Or P() = exp( - H()/T + F/T )
• Use the Free Energy combining Objective Function and Entropy
F = < H - T S(P) > =  d {P()H + T P() lnP()}
• Simulated annealing performs these integrals by Monte Carlo
• Deterministic annealing corresponds to doing integrals analytically
(by mean field approximation) and is much much faster
• Need to introduce a modified Hamiltonian for some cases so that
integrals are tractable. Introduce extra parameters to be varied so
that modified Hamiltonian matches original
• In each case temperature is lowered slowly – say by a factor 0.95 to
0.9999 at each iteration
5/25/2015
6
Some Uses of Deterministic Annealing DA
• Clustering improved K-means
– Vectors: Rose (Gurewitz and Fox 1990 – 486 citations encouraged me to revisit)
– Clusters with fixed sizes and no tails (Proteomics team at Broad)
– No Vectors: Hofmann and Buhmann (Just use pairwise distances)
• Dimension Reduction for visualization and analysis
– Vectors: GTM Generative Topographic Mapping
– No vectors SMACOF: Multidimensional Scaling) MDS (Just use pairwise
distances)
• Can apply to HMM & general mixture models (less study)
– Gaussian Mixture Models
– Probabilistic Latent Semantic Analysis with Deterministic Annealing DA-PLSA
as alternative to Latent Dirichlet Allocation for finding “hidden factors”
• Have scalable parallel versions of much of above – mainly Java
• Many clustering methods – not clear what is best although DA pretty good and
improves K-means at increased computing cost which is not always useful
• DA clearly improves MDS which is ~only reliable” dimension reduction?
5/25/2015
7
Clusters v. Regions
Lymphocytes 4D
Pathology 54D
• In Lymphocytes clusters are distinct; DA useful
• In Pathology, clusters divide space into regions and
sophisticated methods like deterministic annealing are
probably unnecessary
5/25/2015
8
Some Problems we are working on
• Analysis of Mass Spectrometry data to find peptides by
clustering peaks (Broad Institute/Hyderabad)
– ~0.5 million points in 2 dimensions (one collection) -- ~ 50,000
clusters summed over charges
• Metagenomics – 0.5 million (increasing rapidly) points NOT
in a vector space – hundreds of clusters per sample
– Apply MDS to Phylogenetic trees
• Pathology Images >50 Dimensions
• Social image analysis is in a highish dimension vector space
– 10-50 million images; 1000 features per image; million clusters
• Finding communities from network graphs coming from
Social media contacts etc.
5/25/2015
9
MDS and
Clustering on
~60K EMR
5/25/2015
10
Colored by Sample – not clustered.
Distances from vectors in word space. This 128 4mers for ~200K sequences
5/25/2015
11
Examples of current DA
algorithms: LCMS
5/25/2015
12
Background on LC-MS
• Remarks of collaborators – Broad Institute/Hyderabad
• Abundance of peaks in “label-free” LC-MS enables large-scale comparison of
peptides among groups of samples.
• In fact when a group of samples in a cohort is analyzed together, not only is it
possible to “align” robustly or cluster the corresponding peaks across samples,
but it is also possible to search for patterns or fingerprints of disease states
which may not be detectable in individual samples.
• This property of the data lends itself naturally to big data analytics for
biomarker discovery and is especially useful for population-level studies with
large cohorts, as in the case of infectious diseases and epidemics.
• With increasingly large-scale studies, the need for fast yet precise cohort-wide
clustering of large numbers of peaks assumes technical importance.
• In particular, a scalable parallel implementation of a cohort-wide peak
clustering algorithm for LC-MS-based proteomic data can prove to be a critically
important tool in clinical pipelines for responding to global epidemics of
infectious diseases like tuberculosis, influenza, etc.
5/25/2015
13
Proteomics 2D DA Clustering T= 25000
with 60 Clusters (will be 30,000 at T=0.025)
5/25/2015
14
The brownish triangles are “sponge” (soaks up trash) peaks outside
any cluster.
The colored hexagons are peaks inside clusters with the white
hexagons being determined cluster center
Fragment of 30,000 Clusters
241605 Points
5/25/2015
15
Continuous Clustering
• This is a very useful subtlety introduced by Ken Rose but not widely known
although it greatly improves algorithm
• Take a cluster k to be split into 2 with centers Y(k)A and Y(k)B with initial
values Y(k)A = Y(k)B at original center Y(k)
• Then typically if you make this change
Free Energy F
and perturb the Y(k)A and Y(k)B, they will
return to starting position
Y(k)A and Y(k)B
as F at stable minimum (positive eigenvalue)
• But instability (the negative eigenvalue) can develop and one finds
Free Energy F
Free Energy F
Y(k)A + Y(k)B
Y(k)A - Y(k)B
• Implement by adding arbitrary number p(k) of centers for each cluster
Zi = k=1K p(k) exp(-i(k)/T) and M step gives p(k) = C(k)/N
• Halve p(k) at splits; can’t split easily in standard case p(k) = 1
• Show weighting in sums like Zi now equipoint not equicluster as p(k)
5/25/2015
16
proportional
to points C(k) in cluster
Trimmed Clustering
• Clustering with position-specific constraints on variance: Applying
redescending M-estimators to label-free LC-MS data analysis (Rudolf
Frühwirth , D R Mani and Saumyadipta Pyne) BMC
Bioinformatics 2011, 12:358
• HTCC = k=0K i=1N Mi(k) f(i,k)
– f(i,k) = (X(i) - Y(k))2/2(k)2 k > 0
– f(i,0) = c2 / 2
k=0
• The 0’th cluster captures (at zero temperature) all points outside
clusters (background)
T=1
• Clusters are trimmed
T~0
(X(i) - Y(k))2/2(k)2 < c2 / 2
T=5
• Relevant when well
defined errors
Distance from
cluster center
5/25/2015
17
Cluster Count v. Temperature for 2 Runs
60000
DAVS(2)
40000
DA2D
30000
20000
Start Sponge DAVS(2)
Sponge Reaches final value
10000
Add Close Cluster Check
1.00E+06
1.00E+05
1.00E+04
1.00E+03
1.00E+02
1.00E+01
1.00E+00
1.00E-01
1.00E-02
0
1.00E-03
Temperature
• All start with one cluster at far left
• T=1 special as measurement errors divided out
5/25/2015counts clusters with 1 member as clusters. DAVS(2) does not
18
• DA2D
Cluster Count
50000
Simple Parallelism as in k-means
•
•
•
•
Decompose points i over processors
Equations either pleasingly parallel “maps” over i
Or “All-Reductions” summing over i for each cluster
Parallel Algorithm:
– Each process holds all clusters and calculates
contributions to clusters from points in node
– e.g. Y(k) = i=1N <Mi(k)> Xi / C(k)
• Runs well in MPI or MapReduce
– See all the MapReduce k-means papers
5/25/2015
19
Better Parallelism
• The previous model is correct at start but each point does
not really contribute to each cluster as damped
exponentially by exp( - (Xi- Y(k))2 /T )
• For Proteomics problem, on average only 6.45 clusters
needed per point if require (Xi- Y(k))2 /T ≤ ~40 (as exp(-40)
small)
• So only need to keep nearby clusters for each point
• As average number of Clusters ~ 20,000, this gives a factor
of ~3000 improvement
• Further communication is no longer all global; it has
nearest neighbor components and calculated by
parallelism over clusters which can be done in parallel if
separated
5/25/2015
20
Speedups for several runs on Tempest from 8-way through 384 way MPI parallelism with
one thread per process. We look at different choices for MPI processes which are either
inside nodes or on separate nodes
5/25/2015
21
Parallelism within a Single Node of Madrid
Cluster. A set of runs on 241605 peak data
with a single node with 16 cores with either
threads or MPI giving parallelism.
Parallelism is either number of threads or
number of MPI processes.
Parallelism (#threads or #processes)
5/25/2015
22
METAGENOMICS -- SEQUENCE
CLUSTERING
Non-metric Spaces
O(N2) Algorithms – Illustrate Phase Transitions
5/25/2015
23
• Start at T= “” with 1
Cluster
• Decrease T, Clusters
emerge at instabilities
5/25/2015
24
5/25/2015
25
5/25/2015
26
446K sequences
~100 clusters
5/25/2015
27
METAGENOMICS -- SEQUENCE
CLUSTERING
Non-metric Spaces
O(N2) Algorithms – Compare Other Methods
5/25/2015
28
DA-PWC
“Divergent” Data
Sample
23 True Clusters
UClust
CDhit
Divergent Data Set
UClust (Cuts 0.65 to 0.95)
DAPWC 0.65 0.75
0.85 0.95
23
4
10
36
91
23
0
0
13
16
Total # of clusters
Total # of clusters uniquely identified
(i.e. one original cluster goes to 1 uclust cluster )
Total # of shared clusters with significant sharing
(one uclust cluster goes to > 1 real cluster)
Total # of uclust clusters that are just part of a real cluster
(numbers in brackets only have one member)
Total # of real clusters that are 1 uclust cluster
but uclust cluster is spread over multiple real clusters
Total # of real clusters that have
5/25/2015
significant contribution from > 1 uclust cluster
0
4
10
5
0
4
10
0
14
9
5
0
9
14
5
0
17(11) 72(62)
0
7
29
PROTEOMICS
No clear clusters
5/25/2015
30
Protein Universe Browser for COG Sequences with a
few illustrative biologically identified clusters
5/25/2015
31
Heatmap of biology distance (NeedlemanWunsch) vs 3D Euclidean Distances
If d a distance, so is f(d) for any monotonic f. Optimize choice of f
5/25/2015
32
O(N2) ALGORITHMS?
5/25/2015
33
Algorithm Challenges
•
•
•
•
See NRC Massive Data Analysis report
O(N) algorithms for O(N2) problems
Parallelizing Stochastic Gradient Descent
Streaming data algorithms – balance and interplay between
batch methods (most time consuming) and interpolative
streaming methods
• Graph algorithms – need shared memory?
• Machine Learning Community uses parameter servers;
Parallel Computing (MPI) would not recommend this?
– Is classic distributed model for “parameter service” better?
• Apply best of parallel computing – communication and load
balancing – to Giraph/Hadoop/Spark
• Are data analytics sparse?; many cases are full matrices
• BTW Need Java Grande – Some C++ but Java most popular in
ABDS,
5/25/2015 with Python, Erlang, Go, Scala (compiles to JVM) …..
34
“clean” sample of 446K
O(N2) green-green and purplepurple interactions have value
but green-purple are “wasted”
O(N2) interactions between
green and purple clusters
should be able to represent by
centroids as in Barnes-Hut.
Hard as no Gauss theorem; no
multipole expansion and points
really in 1000 dimension space
as clustered before 3D
projection
5/25/2015
35
Use Barnes Hut OctTree,
originally developed to
make O(N2) astrophysics
O(NlogN), to give similar
speedups in machine
learning
5/25/2015
36
OctTree for 100K
sample of Fungi
We use OctTree for logarithmic
interpolation (streaming data)
5/25/2015
37
Fungi Analysis
5/25/2015
38
Fungi Analysis
• Multiple Species from multiple places
• Several sources of sequences starting with 446K and
eventually boiled down to ~10K curated sequences with 61
species
• Original sample – clustering and MDS
• Final sample – MDS and other clustering methods
• Note MSA and SWG gives similar results
• Some species are clearly split
• Some species are diffuse; others compact making a fixed
distance cut unreliable
– Easy for humans!
• MDS very clear on structure and clustering artifacts
• Why not do “high-value” clustering as interactive iteration
driven by MDS?
5/25/2015
39
Fungi -- 4 Classic Clustering Methods
5/25/2015
40
5/25/2015
41
5/25/2015
42
Same Species
5/25/2015
43
Same Species Different Locations
5/25/2015
44
Parallel Data Mining
5/25/2015
45
Parallel Data Analytics
• Streaming algorithms have interesting differences but
• “Batch” Data analytics is “just classic parallel computing” with
usual features such as SPMD and BSP
• Expect similar systematics to simulations where
• Static Regular problems are straightforward but
• Dynamic Irregular Problems are technically hard and high level
approaches fail (see High Performance Fortran HPF)
– Regular meshes worked well but
– Adaptive dynamic meshes did not although “real people with MPI”
could parallelize
• However using libraries is successful at either
– Lowest: communication level
– Higher: “core analytics” level
• Data analytics does not yet have “good regular parallel libraries”
– Graph analytics has most attention
5/25/2015
46
Remarks on Parallelism I
• Maximum Likelihood or 2 both lead to objective functions like
• Minimize sum items=1N (Positive nonlinear function of
unknown parameters for item i)
• Typically decompose items i and parallelize over both i and
parameters to be determined
• Solve iteratively with (clever) first or second order
approximation to shift in objective function
– Sometimes steepest descent direction; sometimes Newton
– Have classic Expectation Maximization structure
– Steepest descent shift is sum over shift calculated from each point
• Classic method – take all (millions) of items in data set and
move full distance
– Stochastic Gradient Descent SGD – take randomly a few hundred of
items in data set and calculate shifts over these and move a tiny
distance
–
SGD cannot parallelize over items
5/25/2015
47
Remarks on Parallelism II
• Need to cover non vector semimetric and vector spaces for
clustering and dimension reduction (N points in space)
• Semimetric spaces just have pairwise distances defined between
points in space (i, j)
• MDS Minimizes Stress and illustrates this
(X) = i<j=1N weight(i,j) ((i, j) - d(Xi , Xj))2
• Vector spaces have Euclidean distance and scalar products
– Algorithms can be O(N) and these are best for clustering but for MDS O(N)
methods may not be best as obvious objective function O(N2)
– Important new algorithms needed to define O(N) versions of current O(N2) –
“must” work intuitively and shown in principle
• Note matrix solvers often use conjugate gradient – converges in 5100 iterations – a big gain for matrix with a million rows. This
removes factor of N in time complexity
• Ratio of #clusters to #points important; new clustering ideas if ratio
>~ 0.1
5/25/2015
48
Problem Structure
• Note learning networks have huge number of parameters (11 billion
in Stanford work) so that inconceivable to look at second derivative
• Clustering and MDS have lots of parameters but can be practical to
look at second derivative and use Newton’s method to minimize
• Parameters are determined in distributed fashion but are typically
needed globally
– MPI use broadcast and “AllCollectives” implying Map-Collective is a useful
programming model
– AI community: use parameter server and access as needed. Non-optimal?
(1) Map Only
Input
(2) Classic
MapReduce
Input
(3) Iterative Map Reduce or
Map-Collective
Input
(4) Point to Point or
Map-Communication
Iterations
(5) Map-Streaming
maps
brokers
Shared Memory
map
map
map
Map & Communicate
Local
Output
5/25/2015
(6) Shared memory
Map Communicates
reduce reduce
Graph
Events
49
MDS in more detail
5/25/2015
50
WDA-SMACOF “Best” MDS
• Semimetric spaces just have pairwise distances defined between
points in space (i, j)
• MDS Minimizes Stress with pairwise distances (i, j)
(X) = i<j=1N weight(i,j) ((i, j) - d(Xi , Xj))2
• SMACOF clever Expectation Maximization method choses good
steepest descent
• Improved by Deterministic Annealing reducing distance scale; DA
does not impact compute time much and gives DA-SMACOF
• Classic SMACOF is O(N2) for uniform weight and O(N3) for non trivial
weights but get nonuniform weight from
– The preferred Sammon method weight(i,j) = 1/(i, j) or
– Missing distances put in as weight(i,j) = 0
• Use conjugate gradient – converges in 5-100 iterations – a big gain
for matrix with a million rows. This removes factor of N in time
complexity
and gives WDA-SMACOF
5/25/2015
51
Timing of WDA SMACOF
• 20k to 100k AM Fungal sequences on 600 cores
100000
Time Cost Comparison between WDASMACOF with Equal Weights and Sammon's
Mapping
Equal Weights
Sammon's Mapping
Seconds
10000
1000
100
10
1
5/25/2015
20k
40k
60k
Data Size
80k
100k
52
WDA-SMACOF Timing
• Input Data: 100k to 400k AM Fungal sequences
• Environment: 32 nodes (1024 cores) to 128 nodes (4096 cores) on
BigRed2.
• Using Harp plug in for Hadoop (MPI Performance)
Time Cost of WDA-SMACOF over
Increasing Data Size
4000
3500
512
1024
2048
Parallel Efficiency of WDA-SMACOF over
Increasing Number of Processors
Parallel Efficiency
Seconds
2500
2000
1500
1000
0.8
0.6
0.4
0.2
500
5/25/2015
WDA-SMACOF (Harp)
1
3000
0
1.2
4096
100k
200k
300k
Data Size
400k
0
512
1024
2048
Number of Processors
4096
53
Spherical Phylogram
• Take a set of sequences mapped to nD with MDS (WDASMACOF) (n=3 or ~20)
– N=20 captures ~all features of dataset?
• Consider a phylogenetic tree and use neighbor joining
formulae to calculate distances of nodes to sequences (or
later other nodes) starting at bottom of tree
• Do a new MDS fixing mapping of sequences noting that
sequences + nodes have defined distances
• Use RAxML or Neighbor Joining (N=20?) to find tree
• Random note: do not need Multiple Sequence Alignment;
pairwise tools are easier to use and give reliably good results
5/25/2015
54
Spherical Phylograms
MSA
SWG
5/25/2015
RAxML result visualized in FigTree.
Spherical Phylogram visualized in PlotViz
55
for MSA or SWG distances
Quality of 3D Phylogenetic Tree
• EM-SMACOF is basic SMACOF
• LMA was previous best method using Levenberg-Marquardt
nonlinear 2 solver
• WDA-SMACOF finds best result
• 3 different distance measures
Sum of Branches on 999nts Data
Sum of Branches on 599nts Data
30
25
WDA-SMACOF
LMA
EM-SMACOF
20
15
10
5/25/2015
EM-SMACOF
15
10
5
5
0
LMA
20
Sum of Branches
Sum of Branches
25
WDA-SMACOF
MSA
SWG
NW
0
MSA
SWG
NW
Sum of branch lengths of the Spherical Phylogram generated in 3D space
on two datasets
56
Summary
• Always run MDS. Gives insight into data and performance
of machine learning
– Leads to a data browser as GIS gives for spatial data
– 3D better than 2D
– ~20D better than MSA?
• Clustering Observations
– Do you care about quality or are you just cutting up
space into parts
– Deterministic Clustering always makes more robust
– Continuous clustering enables hierarchy
– Trimmed Clustering cuts off tails
– Distinct O(N) and O(N2) algorithms
• Use Conjugate Gradient
5/25/2015
57
Java Grande
5/25/2015
58
Java Grande
• We once tried to encourage use of Java in HPC with Java Grande
Forum but Fortran, C and C++ remain central HPC languages.
– Not helped by .com and Sun collapse in 2000-2005
• The pure Java CartaBlanca, a 2005 R&D100 award-winning
project, was an early successful example of HPC use of Java in a
simulation tool for non-linear physics on unstructured grids.
• Of course Java is a major language in ABDS and as data analysis
and simulation are naturally linked, should consider broader use
of Java
• Using Habanero Java (from Rice University) for Threads and
mpiJava or FastMPJ for MPI, gathering collection of high
performance parallel Java analytics
– Converted from C# and sequential Java faster than sequential C#
• So will have either Hadoop+Harp or classic Threads/MPI
versions
in Java Grande version of Mahout
5/25/2015
59
Performance of MPI Kernel Operations
10000
MPI.NET C# in Tempest
FastMPJ Java in FG
OMPI-nightly Java FG
OMPI-trunk Java FG
OMPI-trunk C FG
MPI.NET C# in Tempest
FastMPJ Java in FG
OMPI-nightly Java FG
OMPI-trunk Java FG
OMPI-trunk C FG
5000
Performance of MPI send and receive operations
10000
4MB
1MB
256KB
64KB
16KB
4KB
1KB
64B
16B
256B
Message size (bytes)
Performance of MPI allreduce operation
Pure Java as
in FastMPJ
slower than
Java
interfacing
to C version
of MPI
1000000
OMPI-trunk C Madrid
OMPI-trunk Java Madrid
OMPI-trunk C FG
OMPI-trunk Java FG
1000
5
4B
Average time (us)
512KB
128KB
32KB
8KB
2KB
512B
Message size (bytes)
128B
32B
8B
2B
1
0B
Average time (us)
100
OMPI-trunk C Madrid
OMPI-trunk Java Madrid
OMPI-trunk C FG
OMPI-trunk Java FG
10000
Performance of MPI send and receive on
5/25/2015 Infiniband and Ethernet
Message Size (bytes)
4MB
1MB
256KB
64KB
16KB
4KB
1KB
256B
64B
1
16B
512KB
128KB
Message Size (bytes)
32KB
8KB
2KB
512B
128B
32B
8B
2B
0B
1
100
4B
10
Average Time (us)
Average Time (us)
100
Performance of MPI allreduce on Infiniband
and Ethernet
60
Java Grande and C# on 40K point DAPWC Clustering
Very sensitive to threads v MPI
C# Hardware 0.7 performance Java Hardware
C#
Java
64 Way parallel
128 Way parallel
256 Way
parallel
TXP
Nodes
Total
5/25/2015
61
Java and C# on 12.6K point DAPWC Clustering
Java
Time hours
#Threads x #Processes per node
# Nodes
Total Parallelism
1x1
5/25/2015
1x2
C#
C# Hardware 0.7 performance Java Hardware
#Threads x #Processes per node
1x8
1x4
4x1
2x1
2x2
2x4
4x2
8x1
62
Data Analytics in SPIDAL
5/25/2015
63
Analytics and the DIKW Pipeline
• Data goes through a pipeline
Raw data  Data  Information  Knowledge  Wisdom 
Decisions
Information
Data
Analytics
Knowledge
Information
More Analytics
• Each link enabled by a filter which is “business logic” or “analytics”
• We are interested in filters that involve “sophisticated analytics”
which require non trivial parallel algorithms
– Improve state of art in both algorithm quality and (parallel) performance
• Design and Build SPIDAL (Scalable Parallel Interoperable Data
Analytics Library)
5/25/2015
64
Strategy to Build SPIDAL
• Analyze Big Data applications to identify analytics
needed and generate benchmark applications
• Analyze existing analytics libraries (in practice limit to
some application domains) – catalog library members
available and performance
– Mahout low performance, R largely sequential and missing
key algorithms, MLlib just starting
• Identify big data computer architectures
• Identify software model to allow interoperability and
performance
• Design or identify new or existing algorithm including
parallel implementation
• Collaborate application scientists, computer systems
and statistics/algorithms communities
5/25/2015
65
Machine Learning in Network Science, Imaging in Computer
Vision, Pathology, Polar Science, Biomolecular Simulations
Algorithm
Applications
Features
Status Parallelism
Graph Analytics
Community detection
Social networks, webgraph
P-DM GML-GrC
Subgraph/motif finding
Webgraph, biological/social networks
P-DM GML-GrB
Finding diameter
Social networks, webgraph
P-DM GML-GrB
Clustering coefficient
Social networks
Page rank
Webgraph
P-DM GML-GrC
Maximal cliques
Social networks, webgraph
P-DM GML-GrB
Connected component
Social networks, webgraph
P-DM GML-GrB
Betweenness centrality
Social networks
Shortest path
Social networks, webgraph
Graph
.
Graph,
static
P-DM GML-GrC
Non-metric, P-Shm GML-GRA
P-Shm
Spatial Queries and Analytics
Spatial
queries
relationship
Distance based queries
based
P-DM PP
GIS/social networks/pathology
informatics
Geometric
P-DM PP
Spatial clustering
Seq
GML
Spatial modeling
Seq
PP
GML
Global (parallel) ML
5/25/2015
GrA Static GrB Runtime partitioning
66
Some specialized data analytics in
SPIDAL
Algorithm
• aa
Applications
Features
Parallelism
P-DM
PP
P-DM
PP
P-DM
PP
Seq
PP
Todo
PP
Todo
PP
P-DM
GML
Core Image Processing
Image preprocessing
Object detection &
segmentation
Image/object feature
computation
Status
Computer vision/pathology
informatics
Metric Space Point
Sets, Neighborhood
sets & Image
features
3D image registration
Object matching
Geometric
3D feature extraction
Deep Learning
Learning Network,
Stochastic Gradient
Descent
Image Understanding,
Language Translation, Voice
Recognition, Car driving
PP Pleasingly Parallel (Local ML)
Seq Sequential Available
GRA 5/25/2015
Good distributed algorithm needed
Connections in
artificial neural net
Todo No prototype Available
P-DM Distributed memory Available
P-Shm Shared memory Available 67
Some Core Machine Learning Building Blocks
Algorithm
Applications
Features
Status
//ism
DA Vector Clustering
DA Non metric Clustering
Kmeans; Basic, Fuzzy and Elkan
Levenberg-Marquardt
Optimization
Accurate Clusters
Vectors
P-DM
GML
Accurate Clusters, Biology, Web Non metric, O(N2)
P-DM
GML
Fast Clustering
Vectors
Non-linear Gauss-Newton, use Least Squares
in MDS
Squares,
DA- MDS with general weights Least
2
O(N )
DA-GTM and Others
Vectors
Find nearest neighbors in
document corpus
Bag of “words”
Find pairs of documents with (image features)
TFIDF distance below a
threshold
P-DM
GML
P-DM
GML
P-DM
GML
P-DM
GML
P-DM
PP
Todo
GML
Support Vector Machine SVM
Learn and Classify
Vectors
Seq
GML
Random Forest
Gibbs sampling (MCMC)
Latent Dirichlet Allocation LDA
with Gibbs sampling or Var.
Bayes
Singular Value Decomposition
SVD
Learn and Classify
Vectors
P-DM
PP
Solve global inference problems Graph
Todo
GML
Topic models (Latent factors)
Bag of “words”
P-DM
GML
Dimension Reduction and PCA
Vectors
Seq
GML
5/25/2015
Hidden
Markov Models (HMM)
Global inference on sequence Vectors
models
Seq
SMACOF Dimension Reduction
Vector Dimension Reduction
TFIDF Search
All-pairs similarity search
68
PP
GML
&
Some Futures
• Always run MDS. Gives insight into data
– Leads to a data browser as GIS gives for spatial data
• Claim is algorithm change gave as much performance
increase as hardware change in simulations. Will this
happen in analytics?
– Today is like parallel computing 30 years ago with regular meshs.
We will learn how to adapt methods automatically to give
“multigrid” and “fast multipole” like algorithms
• Need to start developing the libraries that support Big Data
– Understand architectures issues
– Have coupled batch and streaming versions
– Develop much better algorithms
• Please join SPIDAL (Scalable Parallel Interoperable Data
5/25/2015
69
Analytics
Library) community