Download Data mining with sparse grids

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data mining with sparse grids
Jochen Garcke and Michael Griebel
Institut für Angewandte Mathematik
Universität Bonn
Data mining with sparse grids – p.1/40
Overview
What is Data mining ?
Regularization networks
Sparse grids
Numerical examples
Conclusions
Data mining with sparse grids – p.2/40
What is Data mining ?
»Data mining is the process of exploration and analysis,
by automatic or semi-automatic means, of large quantities
of data in order to discover meaningful patterns and rules.«
[Berry and Linoff, Mastering Data Mining]
Example: Mail-order merchant (who gets a catalog ?)
Merchant aims to increase revenue per catalog mailed
Based on available customer data a response model is built
Available information are e.g.
Number of quarters with at least one order placed
Number of catalogs purchased from
Number of days since last order
Amount of money spent per quarter going back some years
Data mining with sparse grids – p.3/40
Data mining activities
Directed or supervised data mining
Classification, classifying risk of credit applicants
Estimation, estimating the value of a piece of real estate
Prediction, prediction which customers will leave
Undirected or unsupervised data mining
Affinity grouping / association rules, shopping cart
Clustering, cluster of symptoms indicates particular disease
Description and visualization
Data mining with sparse grids – p.4/40
Data mining in the knowledge
discovery process
Identifying the problem
Data preparation
Data mining
Post-processing of the discovered knowledge
Putting the results of knowledge discovery in use
Data mining with sparse grids – p.5/40
The classification problem
We want to compute a function, the classifier,
which approximates the given training data set
but also gives ’good’ results on unseen data
can be large, we will consider moderately high
For that a compromise has to be found between
the correctness of the approximation, i.e. the size of the
data error, and
the generalization qualities of the classifier for new, i.e.
before unseen, data
can consist of up to millions or billions of data points
Data mining with sparse grids – p.6/40
Approximation with data centered
ansatz functions
Error is zero at the data points, but is overfitting
Assume smoothness properties of
Data mining with sparse grids – p.7/40
Regularization networks
To get a well-posed, uniquely solvable problem we have to
assume knowledge of
Regularization theory imposes smoothness constraints
Regularization network approach considers the variational
problem
with
Error of the classifier on the given data
Assumed smoothness properties
Regularization parameter
Data mining with sparse grids – p.8/40
we have
a basis of
With
Exact solution with kernels
In the case of a regularization term of the type
where
is a decreasing positive sequence, the solution
of the variational problem has always the form
Data mining with sparse grids – p.9/40
is a symmetric kernel function
Reproducing Kernel Hilbert Space
can be interpreted as the kernel of a Reproducing Kernel
Hilbert Space (RKHS)
But in general a full
In other words
if certain functions
are used in an approximation
scheme which are centered in the location of the data
points
then the approximation solution is a finite series and
terms
involves ’only’
system has to be solved
Data mining with sparse grids – p.10/40
Approximation schemes in
regularization network context
For radially symmetric kernels we end up with radial basis
function approximation schemes
Many other approximation schemes like
additive models
hyper-basis functions
ridge approximation models and
several types of neural networks
can be derived by a specific choice of the regularization
operator
The support vector machine (SVM) approach can also be
expressed in the form of a regularization network
All scale in general non-linearly in
, the number of data points
Data mining with sparse grids – p.11/40
Discretization
should span
and preferably
, i.e.
is to be minimized in
Regularization operator
Cost function
The ansatz functions
should form a basis for
Different approach: We explicitly restrict the problem to a finite
, with
dimensional subspace
.
Data mining with sparse grids – p.12/40
,
and differentiation with respect to
Or equivalently (
Plug-in of
Derivative of the functional
:
)
Data mining with sparse grids – p.13/40
we get the linear equation system
-matrix with
of the data classes
is the vector of the unknowns and has length
-matrix with
is the vector with length
-matrix with
is a
is a
is a
With
Problem to solve
Data mining with sparse grids – p.14/40
Approximation with grid-based ansatz
functions
In this picture only discrete values are used on the grid points,
in general continuous values are used
Data mining with sparse grids – p.15/40
Which function space
to take ?
Again, widely used are methods with global data-centered
basis functions, which scale with the number of data points
We use a grid to discretize the data space and local basis
functions on the grid points
A naive grid has
grid points, with a reasonable size of
, where
gives the mesh size, one encounters the
curse of dimensionality
To overcome this we use sparse grids, which have
grid points
Data mining with sparse grids – p.16/40
Interpolation with the hierarchical basis
Interpolation
Hierarchical basis
1- case is generalized by means of a tensor product approach
Hierarchical values of the -dimensional basis functions are
bounded through the size of their supports
Data mining with sparse grids – p.17/40
Supports of
Data mining with sparse grids – p.18/40
Sparse grids
Difference-spaces
span
of piece-wise -linear functions
Space
can be splitted accordingly
Function
of level
Sparse grid space
Data mining with sparse grids – p.19/40
Properties of sparse grids
Sparse grid in 2D and 3D with level
smoothness properties
approximation properties
number of points
sparse grid
full grid
Data mining with sparse grids – p.20/40
, i.e.
Example in six dimensions with level
full grid: 75 418 890 625 points
sparse grid: 483 201 points
Sparse grids
:
Now use sparse grids to solve the minimization problem
Linear equation system with
points
Matrix is more densely populated than corresponding full
grid matrices, would add further terms to complexity
Explicit assembly of the matrix should be avoided
Difficult to implement only the action of the matrices
Action of the data matrix would scale with # of data points
Therefore use combination technique variant of sparse grids
Data mining with sparse grids – p.21/40
=
–
+
–
+
–
+
Combination technique of level 4 in 2D
Data mining with sparse grids – p.22/40
Sparse grids with the combination
technique
Solve the problem on the sequence of full grids
combine solution on
With the results
dim
sparse grid
Example in two dimensions:
Data mining with sparse grids – p.23/40
Sequence of problems to solve
Number of grids #
, with
on
Discretize and solve the minimization problem
dim
, i.e. small enough for the
main memory of a workstation (for
) concerning the grid
The resulting linear equation system is solved by a diagonally
preconditioned conjugate gradient algorithm
Data mining with sparse grids – p.24/40
Complexities of the computation
To solve on each grid in the sequence of grids
storage
assembly
mv-multipl.
Complexities of the computation
is the number of grid points
is the number of data points
Scales linearly with
Data mining with sparse grids – p.25/40
Numerical Examples
We test our method with
Benchmark data sets from the UCI Repository
Synthetically generated massive data sets
is found in an outer loop over several s
The best
Evaluation and comparison with other methods through either
Correctness rates on test data set, which where not used
during the computation,
10-fold cross validation, or
Leave-one-out cross validation
Data mining with sparse grids – p.26/40
Checkerboard data set / Ripley data set
Checkerboard with level 10. 10-fold-correctness rate 96,20%
Ripley data set with level 5 (correctness rate of 90.9 %)
Ripley data set with level 8 (correctness rate of 89.7 %)
Ripley data set with neural networks 91.1 %
Best possible rate for Ripley is 92.0%, since 8 % error is
introduced
Data mining with sparse grids – p.27/40
Spiral data set
level
training correctness testing correctness
4
0.00001
95.31 %
87.63 %
5
0.001
94.36 %
87.11 %
0.00075
100.00 %
89.69 %
6
7
0.00075
100.00 %
88.14 %
Leave-one-out cross-validation results, level 4 to 6 are shown
77.20% with neural networks reported [Singh, 1998]
Data mining with sparse grids – p.28/40
sparse grid combination method
level 1 level 2 level 3 level 4
SVM
SSVM SVM
BUPA Liver Disorders data set (6D)
10-fold train. % 70.37
70.57
76.00 77.49 84.28 90.27
10-fold test. % 70.33
69.86
67.87 67.84 70.34 70.92
Results for the BUPA Liver Disorders data set (345 data points) from
the UCI Repository in comparison to support vector machines [Lee
and Mangasarian, 2001]
Data mining with sparse grids – p.29/40
sparse grid combination method
level 1
level 2
level 3
SVM
SSVM SVM
PIMA Indians Diabetes data set (8D)
10-fold train. % 78.11
77.92
83.94
88.51
93.29
10-fold test. % 78.12
77.07
77.47
75.01
72.93
Results for the PIMA Indians Diabetes data set (768 data points)
from the UCI Repository in comparison to support vector machines
[Lee and Mangasarian, 2001]
Data mining with sparse grids – p.30/40
Synthetic massive 6D data set
level 1
level 2
# of
training
testing
total
data matrix
data
correctness
correctness
time (sec)
time (sec)
50 000
90.8 %
90.8 %
158
152
500 000
90.7 %
90.8 %
1570
1528
5 million
90.7 %
90.7 %
15933
15514
50 000
91.9 %
91.5 %
1155
1126
500 000
91.5 %
91.6 %
11219
11022
5 million
91.4 %
91.5 %
112656
110772
Data mining with sparse grids – p.31/40
Using simplicial basis functions
On the grids of the combination technique linear basis
functions based on a simplicial discretization are also possible
So-called Kuhn’s triangulation for each rectangular block
(1,1,1)
(0,0,0)
Theoretical properties of this variant of the sparse grid
technique still has to be investigated in more detail
Since the overlap of supports is greatly reduced due to the use
of a simplicial discretization, the complexities scale significantly
better
Data mining with sparse grids – p.32/40
Complexities for both discretization
variants
mv-multipl.
assembly
storage
linear basis functions on simplicials
-linear basis functions
Reduced -dependence in the complexities with linear basis
functions on simplicials
N is the number of grid points
, the number of data points
Scales linearly with
Data mining with sparse grids – p.33/40
Ripley data set / Spiral data set with
linear basis functions
Ripley data set with level 4 (correctness rate of 91.4 %)
Compare with 90.9 % with level 5, -linear
and 91.1 % with neural networks
Spiral data set with level 7, 88.66 % leave-one-out correctness
Spiral data set with level 8, 89.18 % leave-one-out correctness
Compare with 89.69 % with level 6, -linear
Data mining with sparse grids – p.34/40
BUPA Liver Disorders data set (6D)
level 1 10-fold train.
10-fold test.
level 2 10-fold train.
10-fold test.
level 3 10-fold train.
10-fold test.
level 4 10-fold train.
10-fold test.
0.012
0.040
0.165
0.075
%
76.00
69.00
76.13
66.01
78.71
66.41
92.01
69.60
-linear
%
0.020 76.00
67.87
0.10
77.49
67.84
0.007 84.28
70.34
0.0004 90.27
70.92
linear
Data mining with sparse grids – p.35/40
Synthetic massive 6D data set
level 1
level 2
level 3
training
testing
total
data matrix
# of data
correctness
correctness
time (sec)
time (sec)
500 000
90.5
90.5
25
8
5 million
90.5
90.6
242
77
500 000
91.2
91.1
110
55
5 million
91.1
91.2
1086
546
500 000
91.7
91.7
417
226
5 million
91.6
91.7
4087
2239
42690
41596
-linear basis functions
level 2
5 million
91.4
91.5
Data mining with sparse grids – p.36/40
Synthetic massive 10D data set
level 1
level 2
training
testing
total
data matrix
# of data correct.
correct.
time (sec)
time (sec)
50 000
98.8
97.2
19
4
500 000
97.6
97.4
104
49
5 million
97.4
97.4
811
452
50 000
99.8
96.3
265
45
500 000
98.6
97.8
1126
541
5 million
97.9
97.9
7764
5330
Data mining with sparse grids – p.37/40
Parallelization
Combination technique parallel on a coarse grain level
Classifiers in sequence of grids can be computed
independently of each other
Just short setup and gather phases are necessary
Simple but effective static load balancing strategy
Fine grain level parallelization with threads on SMP-machines
To compute data dependent
the
array of the
training set can be separated in (# processors) parts
Some overhead is introduced to avoid memory conflicts
In the iterative solver a vector can be split into parts and
each processor now computes the action of the matrix on a
vector of size
Data mining with sparse grids – p.38/40
Synthetic massive 10D data set in
parallel
Coarse grain level parallelization of the combination technique
Speed-up of 10.1 with an efficiency of 0.92 on 11 nodes
Since only 11 grids have to be calculated no more than 11
nodes are needed
Threads for each partial problem in the sequence of grids
We achieve acceptable speed-ups from 1.6 for two
processors up to 3.7 for eight processors
As one would expect the efficiency decreases with the
number of processors
Both parallelization strategies are used simultaneously
Each node is a shared memory dual-processor system
On 11 nodes a speed-up of 17.9 with an efficiency of 0.81
Data mining with sparse grids – p.39/40
Conclusions and outlook
Our method is well suited for huge data sets
Memory requirements still grow exponentially in
Lumping
Reduce number of points on the boundary
Moderate high number of dimensions
Enough for a lot of practical applications after the reduction
to the essential dimensions
Dimension reduction (e.g. SVD) has to be applied
Fast solvers for the partial problems in the sequence of grids
Multi-grid with partial semi-coarsening
Data mining with sparse grids – p.40/40