Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data mining with sparse grids Jochen Garcke and Michael Griebel Institut für Angewandte Mathematik Universität Bonn Data mining with sparse grids – p.1/40 Overview What is Data mining ? Regularization networks Sparse grids Numerical examples Conclusions Data mining with sparse grids – p.2/40 What is Data mining ? »Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.« [Berry and Linoff, Mastering Data Mining] Example: Mail-order merchant (who gets a catalog ?) Merchant aims to increase revenue per catalog mailed Based on available customer data a response model is built Available information are e.g. Number of quarters with at least one order placed Number of catalogs purchased from Number of days since last order Amount of money spent per quarter going back some years Data mining with sparse grids – p.3/40 Data mining activities Directed or supervised data mining Classification, classifying risk of credit applicants Estimation, estimating the value of a piece of real estate Prediction, prediction which customers will leave Undirected or unsupervised data mining Affinity grouping / association rules, shopping cart Clustering, cluster of symptoms indicates particular disease Description and visualization Data mining with sparse grids – p.4/40 Data mining in the knowledge discovery process Identifying the problem Data preparation Data mining Post-processing of the discovered knowledge Putting the results of knowledge discovery in use Data mining with sparse grids – p.5/40 The classification problem We want to compute a function, the classifier, which approximates the given training data set but also gives ’good’ results on unseen data can be large, we will consider moderately high For that a compromise has to be found between the correctness of the approximation, i.e. the size of the data error, and the generalization qualities of the classifier for new, i.e. before unseen, data can consist of up to millions or billions of data points Data mining with sparse grids – p.6/40 Approximation with data centered ansatz functions Error is zero at the data points, but is overfitting Assume smoothness properties of Data mining with sparse grids – p.7/40 Regularization networks To get a well-posed, uniquely solvable problem we have to assume knowledge of Regularization theory imposes smoothness constraints Regularization network approach considers the variational problem with Error of the classifier on the given data Assumed smoothness properties Regularization parameter Data mining with sparse grids – p.8/40 we have a basis of With Exact solution with kernels In the case of a regularization term of the type where is a decreasing positive sequence, the solution of the variational problem has always the form Data mining with sparse grids – p.9/40 is a symmetric kernel function Reproducing Kernel Hilbert Space can be interpreted as the kernel of a Reproducing Kernel Hilbert Space (RKHS) But in general a full In other words if certain functions are used in an approximation scheme which are centered in the location of the data points then the approximation solution is a finite series and terms involves ’only’ system has to be solved Data mining with sparse grids – p.10/40 Approximation schemes in regularization network context For radially symmetric kernels we end up with radial basis function approximation schemes Many other approximation schemes like additive models hyper-basis functions ridge approximation models and several types of neural networks can be derived by a specific choice of the regularization operator The support vector machine (SVM) approach can also be expressed in the form of a regularization network All scale in general non-linearly in , the number of data points Data mining with sparse grids – p.11/40 Discretization should span and preferably , i.e. is to be minimized in Regularization operator Cost function The ansatz functions should form a basis for Different approach: We explicitly restrict the problem to a finite , with dimensional subspace . Data mining with sparse grids – p.12/40 , and differentiation with respect to Or equivalently ( Plug-in of Derivative of the functional : ) Data mining with sparse grids – p.13/40 we get the linear equation system -matrix with of the data classes is the vector of the unknowns and has length -matrix with is the vector with length -matrix with is a is a is a With Problem to solve Data mining with sparse grids – p.14/40 Approximation with grid-based ansatz functions In this picture only discrete values are used on the grid points, in general continuous values are used Data mining with sparse grids – p.15/40 Which function space to take ? Again, widely used are methods with global data-centered basis functions, which scale with the number of data points We use a grid to discretize the data space and local basis functions on the grid points A naive grid has grid points, with a reasonable size of , where gives the mesh size, one encounters the curse of dimensionality To overcome this we use sparse grids, which have grid points Data mining with sparse grids – p.16/40 Interpolation with the hierarchical basis Interpolation Hierarchical basis 1- case is generalized by means of a tensor product approach Hierarchical values of the -dimensional basis functions are bounded through the size of their supports Data mining with sparse grids – p.17/40 Supports of Data mining with sparse grids – p.18/40 Sparse grids Difference-spaces span of piece-wise -linear functions Space can be splitted accordingly Function of level Sparse grid space Data mining with sparse grids – p.19/40 Properties of sparse grids Sparse grid in 2D and 3D with level smoothness properties approximation properties number of points sparse grid full grid Data mining with sparse grids – p.20/40 , i.e. Example in six dimensions with level full grid: 75 418 890 625 points sparse grid: 483 201 points Sparse grids : Now use sparse grids to solve the minimization problem Linear equation system with points Matrix is more densely populated than corresponding full grid matrices, would add further terms to complexity Explicit assembly of the matrix should be avoided Difficult to implement only the action of the matrices Action of the data matrix would scale with # of data points Therefore use combination technique variant of sparse grids Data mining with sparse grids – p.21/40 = – + – + – + Combination technique of level 4 in 2D Data mining with sparse grids – p.22/40 Sparse grids with the combination technique Solve the problem on the sequence of full grids combine solution on With the results dim sparse grid Example in two dimensions: Data mining with sparse grids – p.23/40 Sequence of problems to solve Number of grids # , with on Discretize and solve the minimization problem dim , i.e. small enough for the main memory of a workstation (for ) concerning the grid The resulting linear equation system is solved by a diagonally preconditioned conjugate gradient algorithm Data mining with sparse grids – p.24/40 Complexities of the computation To solve on each grid in the sequence of grids storage assembly mv-multipl. Complexities of the computation is the number of grid points is the number of data points Scales linearly with Data mining with sparse grids – p.25/40 Numerical Examples We test our method with Benchmark data sets from the UCI Repository Synthetically generated massive data sets is found in an outer loop over several s The best Evaluation and comparison with other methods through either Correctness rates on test data set, which where not used during the computation, 10-fold cross validation, or Leave-one-out cross validation Data mining with sparse grids – p.26/40 Checkerboard data set / Ripley data set Checkerboard with level 10. 10-fold-correctness rate 96,20% Ripley data set with level 5 (correctness rate of 90.9 %) Ripley data set with level 8 (correctness rate of 89.7 %) Ripley data set with neural networks 91.1 % Best possible rate for Ripley is 92.0%, since 8 % error is introduced Data mining with sparse grids – p.27/40 Spiral data set level training correctness testing correctness 4 0.00001 95.31 % 87.63 % 5 0.001 94.36 % 87.11 % 0.00075 100.00 % 89.69 % 6 7 0.00075 100.00 % 88.14 % Leave-one-out cross-validation results, level 4 to 6 are shown 77.20% with neural networks reported [Singh, 1998] Data mining with sparse grids – p.28/40 sparse grid combination method level 1 level 2 level 3 level 4 SVM SSVM SVM BUPA Liver Disorders data set (6D) 10-fold train. % 70.37 70.57 76.00 77.49 84.28 90.27 10-fold test. % 70.33 69.86 67.87 67.84 70.34 70.92 Results for the BUPA Liver Disorders data set (345 data points) from the UCI Repository in comparison to support vector machines [Lee and Mangasarian, 2001] Data mining with sparse grids – p.29/40 sparse grid combination method level 1 level 2 level 3 SVM SSVM SVM PIMA Indians Diabetes data set (8D) 10-fold train. % 78.11 77.92 83.94 88.51 93.29 10-fold test. % 78.12 77.07 77.47 75.01 72.93 Results for the PIMA Indians Diabetes data set (768 data points) from the UCI Repository in comparison to support vector machines [Lee and Mangasarian, 2001] Data mining with sparse grids – p.30/40 Synthetic massive 6D data set level 1 level 2 # of training testing total data matrix data correctness correctness time (sec) time (sec) 50 000 90.8 % 90.8 % 158 152 500 000 90.7 % 90.8 % 1570 1528 5 million 90.7 % 90.7 % 15933 15514 50 000 91.9 % 91.5 % 1155 1126 500 000 91.5 % 91.6 % 11219 11022 5 million 91.4 % 91.5 % 112656 110772 Data mining with sparse grids – p.31/40 Using simplicial basis functions On the grids of the combination technique linear basis functions based on a simplicial discretization are also possible So-called Kuhn’s triangulation for each rectangular block (1,1,1) (0,0,0) Theoretical properties of this variant of the sparse grid technique still has to be investigated in more detail Since the overlap of supports is greatly reduced due to the use of a simplicial discretization, the complexities scale significantly better Data mining with sparse grids – p.32/40 Complexities for both discretization variants mv-multipl. assembly storage linear basis functions on simplicials -linear basis functions Reduced -dependence in the complexities with linear basis functions on simplicials N is the number of grid points , the number of data points Scales linearly with Data mining with sparse grids – p.33/40 Ripley data set / Spiral data set with linear basis functions Ripley data set with level 4 (correctness rate of 91.4 %) Compare with 90.9 % with level 5, -linear and 91.1 % with neural networks Spiral data set with level 7, 88.66 % leave-one-out correctness Spiral data set with level 8, 89.18 % leave-one-out correctness Compare with 89.69 % with level 6, -linear Data mining with sparse grids – p.34/40 BUPA Liver Disorders data set (6D) level 1 10-fold train. 10-fold test. level 2 10-fold train. 10-fold test. level 3 10-fold train. 10-fold test. level 4 10-fold train. 10-fold test. 0.012 0.040 0.165 0.075 % 76.00 69.00 76.13 66.01 78.71 66.41 92.01 69.60 -linear % 0.020 76.00 67.87 0.10 77.49 67.84 0.007 84.28 70.34 0.0004 90.27 70.92 linear Data mining with sparse grids – p.35/40 Synthetic massive 6D data set level 1 level 2 level 3 training testing total data matrix # of data correctness correctness time (sec) time (sec) 500 000 90.5 90.5 25 8 5 million 90.5 90.6 242 77 500 000 91.2 91.1 110 55 5 million 91.1 91.2 1086 546 500 000 91.7 91.7 417 226 5 million 91.6 91.7 4087 2239 42690 41596 -linear basis functions level 2 5 million 91.4 91.5 Data mining with sparse grids – p.36/40 Synthetic massive 10D data set level 1 level 2 training testing total data matrix # of data correct. correct. time (sec) time (sec) 50 000 98.8 97.2 19 4 500 000 97.6 97.4 104 49 5 million 97.4 97.4 811 452 50 000 99.8 96.3 265 45 500 000 98.6 97.8 1126 541 5 million 97.9 97.9 7764 5330 Data mining with sparse grids – p.37/40 Parallelization Combination technique parallel on a coarse grain level Classifiers in sequence of grids can be computed independently of each other Just short setup and gather phases are necessary Simple but effective static load balancing strategy Fine grain level parallelization with threads on SMP-machines To compute data dependent the array of the training set can be separated in (# processors) parts Some overhead is introduced to avoid memory conflicts In the iterative solver a vector can be split into parts and each processor now computes the action of the matrix on a vector of size Data mining with sparse grids – p.38/40 Synthetic massive 10D data set in parallel Coarse grain level parallelization of the combination technique Speed-up of 10.1 with an efficiency of 0.92 on 11 nodes Since only 11 grids have to be calculated no more than 11 nodes are needed Threads for each partial problem in the sequence of grids We achieve acceptable speed-ups from 1.6 for two processors up to 3.7 for eight processors As one would expect the efficiency decreases with the number of processors Both parallelization strategies are used simultaneously Each node is a shared memory dual-processor system On 11 nodes a speed-up of 17.9 with an efficiency of 0.81 Data mining with sparse grids – p.39/40 Conclusions and outlook Our method is well suited for huge data sets Memory requirements still grow exponentially in Lumping Reduce number of points on the boundary Moderate high number of dimensions Enough for a lot of practical applications after the reduction to the essential dimensions Dimension reduction (e.g. SVD) has to be applied Fast solvers for the partial problems in the sequence of grids Multi-grid with partial semi-coarsening Data mining with sparse grids – p.40/40