Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R Introduction to Data Mining Jie Yang Department of Mathematics, Statistics, and Computer Science University of Illinois at Chicago February 3, 2014 Fundamentals of Data Mining Typical Data Mining Tasks 1 Fundamentals of Data Mining Extracting useful information from large dataset Components of data mining algorithms 2 Typical Data Mining Tasks I. Exploratory data analysis II. Descriptive modeling III. Predictive Modeling IV. Discovering Patterns and Rules V. Retrieval by Content 3 Data Mining Using R R Resources Data Mining Using R Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R What is Data Mining? Science of extracting useful information from large data sets or databases. Analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. Intersection of statistics, machine learning, data management and databases, pattern recognition, artificial intelligence, and other areas. Hand, Mannila, and Smyth, Principles of Data Mining, 2001 Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R Components of data mining algorithms Model or Pattern Structure: determining the underlying structure or functional forms that we seek from the data. Score Function: judging the quality of a fitted model. Optimization and Search Method: optimizing the score function and searching over different model and pattern structures. Data Management Strategy: handling data access efficiently during the search/optimization. Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R I. Exploratory data analysis Explore the data without any clear ideas of what we are looking for. Typical techniques are interactive and visual. Projection techniques (such as principal components analysis) can be very useful for high-dimensional data. Small-proportion or lower resolution samples can be displayed or summarized for large numbers of cases. Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R Example 1: Prostate cancer (Stamey et al., 1989) −1 2 −1 2 0 80 −1 2 4 2.5 2.5 4.0 lcavol 40 60 80 lweight −1 1 age 0.0 0.6 lbph −1 1 3 svi 6.0 8.0 lcp 0 60 gleason lpsa −1 3 40 70 0.0 1.0 6.0 9.0 0 4 0 3 pgg45 Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R Example 1: Prostate cancer (continued) lcavol lweight age lbph svi lcp gleason pgg45 lpsa Correlation Matrix lcavol 1.00 0.28 0.22 0.03 0.54 0.68 0.43 0.43 0.73 lweight 0.28 1.00 0.35 0.44 0.16 0.16 0.06 0.11 0.43 age 0.22 0.35 1.00 0.35 0.12 0.13 0.27 0.28 0.17 lbph 0.03 0.44 0.35 1.00 -0.09 -0.01 0.08 0.08 0.18 svi 0.54 0.16 0.12 -0.09 1.00 0.67 0.32 0.46 0.57 lcp 0.68 0.16 0.13 -0.01 0.67 1.00 0.51 0.63 0.55 gleason 0.43 0.06 0.27 0.08 0.32 0.51 1.00 0.75 0.37 pgg45 0.43 0.11 0.28 0.08 0.46 0.63 0.75 1.00 0.42 lpsa 0.73 0.43 0.17 0.18 0.57 0.55 0.37 0.42 1.00 Fundamentals of Data Mining Typical Data Mining Tasks Example 2: Leukemia Data (Golub et al., 1999) Data Mining Using R Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R Example 2: Leukemia Data (continued) Two−Dim Display Based on Training Data −20000 66 ALL AML Testing Unit 57 60 67 −40000 −50000 −60000 1st PCA −30000 54 −80000 −60000 −40000 mean difference −20000 0 Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R II. Descriptive modeling Describe all of the data or the process generating the data. Density estimation —- for overall probability distribution. Cluster analysis and segmentation —- partition samples into groups. Dependency modeling —- describe the relationship between variables. Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R Example 3: South African Heart Disease (Rousseauw et al., 1983) CHD —- coronary heart disease 0.4 0.6 0.8 1.0 207 0.0 0.2 Prevalence CHD 0.8 0.6 0.4 0.2 0.0 Prevalence CHD 1.0 6.5 Local Likelihood and Other Models 100 140 180 Systolic Blood Pressure 220 15 25 35 Obesity 45 Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R Simulated example: K-means (Hastie et al., 2009) Initial Centroids • • Initial Partition • • 2 4 6 • • ••••• • •• ••• • • • •••• •• ••• ••••••••• •• • •• • • • • • • • • • • • • • • •• • •• •• • • • • • • • • ••••••••• • ••• ••• • •• • ••• • • •• •••• • •• • • •• •• • •• ••••••• ••• ••• •••• • • •• •• • • • • • • ••••• • •• ••• • • • •••• •• ••• ••••••••• •• • •• • • • • • • • • • • • • • • •• • •• •• • • • • • • • • ••••••••• • ••• ••• • •• • ••• • • •• •••• • •• • • •• •• • •• ••••••• ••• ••• •••• • • •• •• • • • • -2 0 • -4 -2 0 2 Iteration Number 2 • • • • ••••• • •• ••• • • • •••• •• ••• ••••••••• •• • •• • • • • • • • • • • • • • • •• • •• •• • • • • •• • • ••••••••• • ••• ••• • •• • •• • • •• •••• • •• • • •• • •• • •• ••••••• ••• ••• •••• • • • •• • • • • • • • • • • 4 6 Iteration Number 20 • • • • ••••• • •• ••• • • • •••• •• ••• ••••••••• •• • •• • • • • • • • • • • • • • • •• • •• •• • • • • •• • • ••••••••• • ••• ••• • •• • •• • • •• •••• • •• • • •• • •• • •• ••••••• ••• ••• •••• • • • •• • • • • • • • 6 1. Introduction SIDW299104 SIDW380102 SID73161 GNAL H.sapiensmRNA SID325394 RASGTPASE SID207172 ESTs SIDW377402 HumanmRNA SIDW469884 ESTs SID471915 MYBPROTO ESTsChr.1 SID377451 DNAPOLYMER SID375812 SIDW31489 SID167117 SIDW470459 SIDW487261 Homosapiens SIDW376586 Chr MITOCHONDRIAL60 SID47116 ESTsChr.6 SIDW296310 SID488017 SID305167 ESTsChr.3 SID127504 SID289414 PTPRC SIDW298203 SIDW310141 SIDW376928 ESTsCh31 SID114241 SID377419 SID297117 SIDW201620 SIDW279664 SIDW510534 HLACLASSI SIDW203464 SID239012 SIDW205716 SIDW376776 HYPOTHETICAL WASWiskott SIDW321854 ESTsChr.15 SIDW376394 SID280066 ESTsChr.5 SIDW488221 SID46536 SIDW257915 ESTsChr.2 SIDW322806 SID200394 ESTsChr.15 SID284853 SID485148 SID297905 ESTs SIDW486740 SMALLNUC ESTs SIDW366311 SIDW357197 SID52979 ESTs SID43609 SIDW416621 ERLUMEN TUPLE1TUP1 SIDW428642 SID381079 SIDW298052 SIDW417270 SIDW362471 ESTsChr.15 SIDW321925 SID380265 SIDW308182 SID381508 SID377133 SIDW365099 ESTsChr.10 SIDW325120 SID360097 SID375990 SIDW128368 SID301902 SID31984 SID42354 Breast CNS Colon K562 Leukemia MCF7 3 2 2 5 0 0 0 0 7 0 2 0 0 6 0 0 2 0 Melanoma NSCLC Ovarian Prostate Renal Unknown 1 7 0 7 2 0 6 0 0 2 0 0 9 0 0 1 0 0 • Number of Clusters K FIGURE 14.8. Total within-cluster sum of squares for K-means clustering applied to the human tumor microarray data. TABLE 14.2. Human tumor data: number of cancer cases of each type, in each of the three clusters from K-means clustering. 200000 240000 • Sum of Squares FIGURE 1.3. DNA microarray data: expression matrix of 6830 genes (rows) and 64 samples (columns), for the human tumor data. Only a random sample of 100 rows are shown. The display is a heat map, ranging from bright green (negative, under expressed) to bright red (positive, over expressed). Missing values Cluster 1 2 3 Cluster 1 2 3 10 8 6 4 2 • • 160000 Example 4: Human Tumour Microarray Data (Hastie et • al., 2009) • • • • BREAST RENAL MELANOMA MELANOMA MCF7D-repro COLON COLON K562B-repro COLON NSCLC LEUKEMIA RENAL MELANOMA BREAST CNS CNS RENAL MCF7A-repro NSCLC K562A-repro COLON CNS NSCLC NSCLC LEUKEMIA CNS OVARIAN BREAST LEUKEMIA MELANOMA MELANOMA OVARIAN OVARIAN NSCLC RENAL BREAST MELANOMA OVARIAN OVARIAN NSCLC RENAL BREAST MELANOMA LEUKEMIA COLON BREAST LEUKEMIA COLON CNS MELANOMA NSCLC PROSTATE NSCLC RENAL RENAL NSCLC RENAL LEUKEMIA OVARIAN PROSTATE COLON BREAST RENAL UNKNOWN Data Mining Using R Typical Data Mining Tasks • Fundamentals of Data Mining Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R III. Predictive modeling: classification and regression Predict the value of one variable from the known values of other variables. Classification —- the predicted variable is categorical. Regression —- the predicted variable is quantitative. Subset Selection and Shrinkage Methods – for cases of too many variables. the best value turned out to be between .5 and dataset, covariance 1, sugFundamentalscommon of Data Mining matrix for the different classes, Typical Mining Tasks Data Mining Using R yieldedData lower errorrates than quadraticclassifiers (i.e., DQDA), that gesting that the performance of CPD was not very sensitive to allow for different class covariance matrices. Thus for the the parameter d controlling the degree of smoothing. A value datasets considered here, gains in accuracy were obtained of d = .75 was used in Table 1. Examples 2 & 4 (continued, Dudoit et al., 2002) Table 1. Test Set Error.Median and Upper Quartiles Over 200 LSITS Runs, of the Number of Misclassified TumorSamples for 9 Discrimination MethodsAppliedto 3 Datasets. Fora GivenDataset,the ErrorNumbersforthe Best Predictorare in Bold. Leukemiaa Twoclasses Median quartile Upper quartile NCI60C Eightclasses Lymphomab Threeclasses Threeclasses Median quartile Upper quartile Median quartile Upper quartile Median quartile Upper quartile 11 8 Linearand quadraticdiscriminantanalysis FLDAd DLDAe 3 0 4 1 3 1 4 2 6 1 8 1 11 7 Golubf 1 2 - - - - - 1 DQDA9 Classificationtrees 2 1 2 0 1 9 10 CVh 3 4 1 3 2 3 12 13 Bag' Boost/ 2 1 2 2 1 1 2 2 2 1 3 2 10 9 11 11 1 CPDk Nearest neighbors 1 2 1 3 1 2 9 10 1 1 1 0 1 8 10 aLeukemiadatasetfromGolubet al. (1999),test set size nTS= 24, p = 40 genes. bLymphomadatasetfromAlizadehet al. (2000),test set size nTS= 27, p = 50 genes. c NCI60 datasetfromRoss et al. (2000),test set size nTS= 21,p = 30 genes. dFLDA:Fisherlineardiscriminantanalysis. eDLDA:diagonallineardiscriminantanalysis. fGolub:weightedgene votingscheme of Golubet al. (1999). 9DQDA:diagonalquadraticdiscriminantanalysis. hCV:singleCARTtree withpruningby 10-foldcross-validation. 'Bag:B= 50 bagged exploratorytrees. Boost: B = 50 boosted exploratorytrees. kCPD:B = 50 bagged exploratorytrees withCPD,d = .75. FLDA: Fisher linear discriminant analysis; DLDA: diagonal linear discriminant analysis; Golub: weighted gene voting scheme; DQDA: diagonal quadratic discriminant analysis; CV: single CART tree; Bag: B= 50 bagged exploratory trees; Boost: B = 50 boosted exploratory trees; CPD: B = 50 bagged exploratory trees with CPD. Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R Example 2: Leukemia Data (continued, Yang et al., 2012) 2.5 2.0 1.5 1.0 0.5 number of errors on average 3.0 DLDA k−NN SVM K2 K1 40 100 200 500 1000 2000 3000 number of genes used 4000 5000 6000 7129 lbph Fundamentals of Data Mining Example 2009) 0.063 svi 0.593 lcp 0.692 1: Prostate gleason 0.426 pgg45 0.483 0.437 Typical 0.287Data Mining Tasks Data Mining Using R 0.181 0.129 −0.139 0.157 0.173(continued, −0.089 0.671 Hastie et al., cancer 0.024 0.366 0.033 0.307 0.476 0.074 0.276 −0.030 0.481 0.663 0.757 TABLE 3.2. Linear model fit to the prostate cancer data. The Z score is the coefficient divided by its standard error (3.12). Roughly a Z score larger than two in absolute value is significantly nonzero at the p = 0.05 level. Term Coefficient Std. Error Z Score 2.46 0.09 27.60 0.68 0.13 5.37 0.26 0.10 2.75 −0.14 0.10 −1.40 0.21 0.10 2.06 0.31 0.12 2.47 −0.29 0.15 −1.87 −0.02 0.15 −0.15 0.27 0.15 1.74 Intercept lcavol lweight age lbph svi lcp gleason pgg45 Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R Example 1: Prostate cancer (continued, Hastie et al., 2009) 3.4 Shrinkage Methods 63 TABLE 3.3. Estimated coefficients and test error results, for different subset and shrinkage methods applied to the prostate data. The blank entries correspond to variables omitted. Term Intercept lcavol lweight age lbph svi lcp gleason pgg45 Test Error Std Error LS 2.465 0.680 0.263 −0.141 0.210 0.305 −0.288 −0.021 0.267 0.521 0.179 Best Subset 2.477 0.740 0.316 0.492 0.143 Ridge 2.452 0.420 0.238 −0.046 0.162 0.227 0.000 0.040 0.133 0.492 0.165 Lasso 2.468 0.533 0.169 0.002 0.094 0.479 0.164 PCR 2.497 0.543 0.289 −0.152 0.214 0.315 −0.051 0.232 −0.056 0.449 0.105 PLS 2.452 0.419 0.344 −0.026 0.220 0.243 0.079 0.011 0.084 0.528 0.152 Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R IV. Discovering patterns and rules Pattern detection: Examples include spotting fraudulent behavior, detection of unusual stars or galaxies. Association rules: For example, to find combinations of items that occur frequently in transaction databases (e.g., grocery products that are often purchased together). Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R V. Retrieval by content Find similar patterns in the data set. For text (e.g., Web pages), the pattern may be a set of keywords. For images, the user may have a sample image, a sketch of an image, or a description of an image, and wish to find similar images from a large set of images. The definition of similarity is critical, but so are the details of the search strategy. Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R R resources Learning R in 15 minutes http://homepages.math.uic.edu/∼jyang06/stat486/handouts/handou R web resources http://cran.r-project.org/ – Official R website http://cran.r-project.org/other-docs.html – R reference books http://www.bioconductor.org/ – R resources (dataset, packages) for bioinformatics http://www.rstudio.com/ – RStudio, a convenient R editor http://accc.uic.edu/service/argo-cluster – UIC high performance computing resource R packages Fundamentals of Data Mining Typical Data Mining Tasks Data Mining Using R Reference Hastie, Tibshirani, and Friendman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edition, Springer, 2009. Websites: http://statweb.stanford.edu/∼tibs/ElemStatLearn/ http://cran.r-project.org/web/packages/ElemStatLearn/index.html Hand, Mannila, and Smyth, Principles of Data Mining, MIT, 2001. Torgo, Data Mining with R: Learning with Case Studies, Chapman & Hall/CRC, 2011. Dudoit, S., Fridlyand, J. and Speed, T.P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data, JASA, 97, 77-87. Golub, T.R. et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, 286, 531-537. Yang, J., Miescke, K., and McCullagh, P. (2012). Classification based on a permanental process with cyclic approximation. Under revision for publication in Biometrika.