Download Introduction to Data Mining - University of Illinois at Chicago

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
Introduction to Data Mining
Jie Yang
Department of Mathematics, Statistics, and Computer Science
University of Illinois at Chicago
February 3, 2014
Fundamentals of Data Mining
Typical Data Mining Tasks
1
Fundamentals of Data Mining
Extracting useful information from large dataset
Components of data mining algorithms
2
Typical Data Mining Tasks
I. Exploratory data analysis
II. Descriptive modeling
III. Predictive Modeling
IV. Discovering Patterns and Rules
V. Retrieval by Content
3
Data Mining Using R
R Resources
Data Mining Using R
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
What is Data Mining?
Science of extracting useful information from large data sets
or databases.
Analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel
ways that are both understandable and useful to the data
owner.
Intersection of statistics, machine learning, data management
and databases, pattern recognition, artificial intelligence, and
other areas.
Hand, Mannila, and Smyth, Principles of Data Mining, 2001
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
Components of data mining algorithms
Model or Pattern Structure: determining the underlying
structure or functional forms that we seek from the data.
Score Function: judging the quality of a fitted model.
Optimization and Search Method: optimizing the score
function and searching over different model and pattern
structures.
Data Management Strategy: handling data access efficiently
during the search/optimization.
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
I. Exploratory data analysis
Explore the data without any clear ideas of what we are
looking for.
Typical techniques are interactive and visual.
Projection techniques (such as principal components analysis)
can be very useful for high-dimensional data.
Small-proportion or lower resolution samples can be displayed
or summarized for large numbers of cases.
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
Example 1: Prostate cancer (Stamey et al., 1989)
−1
2
−1
2
0
80
−1 2 4
2.5
2.5 4.0
lcavol
40 60 80
lweight
−1 1
age
0.0 0.6
lbph
−1 1 3
svi
6.0 8.0
lcp
0 60
gleason
lpsa
−1
3
40
70
0.0
1.0
6.0
9.0
0
4
0 3
pgg45
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
Example 1: Prostate cancer (continued)
lcavol
lweight
age
lbph
svi
lcp
gleason
pgg45
lpsa
Correlation Matrix
lcavol
1.00
0.28
0.22
0.03
0.54
0.68
0.43
0.43
0.73
lweight
0.28
1.00
0.35
0.44
0.16
0.16
0.06
0.11
0.43
age
0.22
0.35
1.00
0.35
0.12
0.13
0.27
0.28
0.17
lbph
0.03
0.44
0.35
1.00
-0.09
-0.01
0.08
0.08
0.18
svi
0.54
0.16
0.12
-0.09
1.00
0.67
0.32
0.46
0.57
lcp
0.68
0.16
0.13
-0.01
0.67
1.00
0.51
0.63
0.55
gleason
0.43
0.06
0.27
0.08
0.32
0.51
1.00
0.75
0.37
pgg45
0.43
0.11
0.28
0.08
0.46
0.63
0.75
1.00
0.42
lpsa
0.73
0.43
0.17
0.18
0.57
0.55
0.37
0.42
1.00
Fundamentals of Data Mining
Typical Data Mining Tasks
Example 2: Leukemia Data (Golub et al., 1999)
Data Mining Using R
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
Example 2: Leukemia Data (continued)
Two−Dim Display Based on Training Data
−20000
66
ALL
AML
Testing Unit
57
60
67
−40000
−50000
−60000
1st PCA
−30000
54
−80000
−60000
−40000
mean difference
−20000
0
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
II. Descriptive modeling
Describe all of the data or the process generating the data.
Density estimation —- for overall probability distribution.
Cluster analysis and segmentation —- partition samples into
groups.
Dependency modeling —- describe the relationship between
variables.
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
Example 3: South African Heart Disease (Rousseauw et
al., 1983)
CHD —- coronary heart disease
0.4
0.6
0.8
1.0
207
0.0
0.2
Prevalence CHD
0.8
0.6
0.4
0.2
0.0
Prevalence CHD
1.0
6.5 Local Likelihood and Other Models
100
140
180
Systolic Blood Pressure
220
15
25
35
Obesity
45
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
Simulated example: K-means (Hastie et al., 2009)
Initial Centroids
•
•
Initial Partition
•
•
2
4
6
• • ••••• •
••
•••
• • • ••••
•• ••• ••••••••• ••
• •• • •
• •
•
•
•
•
• • •
• • • •• • •• •• •
•
•
•
•
• • • ••••••••• • ••• ••• •
•• • ••• • • •• •••• • •• • • ••
•• •
•• ••••••• ••• ••• ••••
•
• ••
••
•
•
•
•
• • ••••• •
••
•••
• • • ••••
•• ••• ••••••••• ••
• •• • •
• •
•
•
•
•
• • •
• • • •• • •• •• •
•
•
•
•
• • • ••••••••• • ••• ••• •
•• • ••• • • •• •••• • •• • • ••
•• •
•• ••••••• ••• ••• ••••
•
• ••
••
•
•
•
•
-2
0
•
-4
-2
0
2
Iteration Number 2
•
•
• • ••••• •
••
•••
• • • ••••
•• ••• ••••••••• ••
• •• • •
• •
•
•
•
•
• • •
• • • •• • •• •• •
•
•
• •• • • ••••••••• • ••• ••• •
•• • •• • • •• •••• • •• • • ••
• •• •
•• ••••••• ••• ••• ••••
•
•
• ••
•
•
•
•
•
•
•
•
•
•
4
6
Iteration Number 20
•
•
• • ••••• •
••
•••
• • • ••••
•• ••• ••••••••• ••
• •• • •
• •
•
•
•
•
• • •
• • • •• • •• •• •
•
•
• •• • • ••••••••• • ••• ••• •
•• • •• • • •• •••• • •• • • ••
• •• •
•• ••••••• ••• ••• ••••
•
•
• ••
•
•
•
•
•
•
•
6
1. Introduction
SIDW299104
SIDW380102
SID73161
GNAL
H.sapiensmRNA
SID325394
RASGTPASE
SID207172
ESTs
SIDW377402
HumanmRNA
SIDW469884
ESTs
SID471915
MYBPROTO
ESTsChr.1
SID377451
DNAPOLYMER
SID375812
SIDW31489
SID167117
SIDW470459
SIDW487261
Homosapiens
SIDW376586
Chr
MITOCHONDRIAL60
SID47116
ESTsChr.6
SIDW296310
SID488017
SID305167
ESTsChr.3
SID127504
SID289414
PTPRC
SIDW298203
SIDW310141
SIDW376928
ESTsCh31
SID114241
SID377419
SID297117
SIDW201620
SIDW279664
SIDW510534
HLACLASSI
SIDW203464
SID239012
SIDW205716
SIDW376776
HYPOTHETICAL
WASWiskott
SIDW321854
ESTsChr.15
SIDW376394
SID280066
ESTsChr.5
SIDW488221
SID46536
SIDW257915
ESTsChr.2
SIDW322806
SID200394
ESTsChr.15
SID284853
SID485148
SID297905
ESTs
SIDW486740
SMALLNUC
ESTs
SIDW366311
SIDW357197
SID52979
ESTs
SID43609
SIDW416621
ERLUMEN
TUPLE1TUP1
SIDW428642
SID381079
SIDW298052
SIDW417270
SIDW362471
ESTsChr.15
SIDW321925
SID380265
SIDW308182
SID381508
SID377133
SIDW365099
ESTsChr.10
SIDW325120
SID360097
SID375990
SIDW128368
SID301902
SID31984
SID42354
Breast
CNS
Colon
K562
Leukemia
MCF7
3
2
2
5
0
0
0
0
7
0
2
0
0
6
0
0
2
0
Melanoma
NSCLC
Ovarian
Prostate
Renal
Unknown
1
7
0
7
2
0
6
0
0
2
0
0
9
0
0
1
0
0
•
Number of Clusters K
FIGURE 14.8. Total within-cluster sum of squares for K-means clustering applied to the human tumor microarray data.
TABLE 14.2. Human tumor data: number of cancer cases of each type, in each
of the three clusters from K-means clustering.
200000
240000
•
Sum of Squares
FIGURE 1.3. DNA microarray data: expression matrix of 6830 genes (rows)
and 64 samples (columns), for the human tumor data. Only a random sample
of 100 rows are shown. The display is a heat map, ranging from bright green
(negative, under expressed) to bright red (positive, over expressed). Missing values
Cluster
1
2
3
Cluster
1
2
3
10
8
6
4
2
•
•
160000
Example 4: Human Tumour Microarray Data (Hastie et
•
al., 2009)
•
•
•
•
BREAST
RENAL
MELANOMA
MELANOMA
MCF7D-repro
COLON
COLON
K562B-repro
COLON
NSCLC
LEUKEMIA
RENAL
MELANOMA
BREAST
CNS
CNS
RENAL
MCF7A-repro
NSCLC
K562A-repro
COLON
CNS
NSCLC
NSCLC
LEUKEMIA
CNS
OVARIAN
BREAST
LEUKEMIA
MELANOMA
MELANOMA
OVARIAN
OVARIAN
NSCLC
RENAL
BREAST
MELANOMA
OVARIAN
OVARIAN
NSCLC
RENAL
BREAST
MELANOMA
LEUKEMIA
COLON
BREAST
LEUKEMIA
COLON
CNS
MELANOMA
NSCLC
PROSTATE
NSCLC
RENAL
RENAL
NSCLC
RENAL
LEUKEMIA
OVARIAN
PROSTATE
COLON
BREAST
RENAL
UNKNOWN
Data Mining Using R
Typical Data Mining Tasks
•
Fundamentals of Data Mining
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
III. Predictive modeling: classification and regression
Predict the value of one variable from the known values of
other variables.
Classification —- the predicted variable is categorical.
Regression —- the predicted variable is quantitative.
Subset Selection and Shrinkage Methods – for cases of too
many variables.
the best
value turned out to be between .5 and
dataset,
covariance
1, sugFundamentalscommon
of Data
Mining matrix for the different classes,
Typical
Mining
Tasks
Data
Mining Using R
yieldedData
lower errorrates than quadraticclassifiers (i.e., DQDA), that gesting that the performance of CPD was not very sensitive to
allow for different class covariance matrices. Thus for the the parameter d controlling the degree of smoothing. A value
datasets considered here, gains in accuracy were obtained of d = .75 was used in Table 1.
Examples 2 & 4 (continued, Dudoit et al., 2002)
Table 1. Test Set Error.Median and Upper Quartiles Over 200 LSITS Runs, of the Number of Misclassified TumorSamples for 9 Discrimination
MethodsAppliedto 3 Datasets. Fora GivenDataset,the ErrorNumbersforthe Best Predictorare in Bold.
Leukemiaa
Twoclasses
Median
quartile
Upper
quartile
NCI60C
Eightclasses
Lymphomab
Threeclasses
Threeclasses
Median
quartile
Upper
quartile
Median
quartile
Upper
quartile
Median
quartile
Upper
quartile
11
8
Linearand quadraticdiscriminantanalysis
FLDAd
DLDAe
3
0
4
1
3
1
4
2
6
1
8
1
11
7
Golubf
1
2
-
-
-
-
-
1
DQDA9
Classificationtrees
2
1
2
0
1
9
10
CVh
3
4
1
3
2
3
12
13
Bag'
Boost/
2
1
2
2
1
1
2
2
2
1
3
2
10
9
11
11
1
CPDk
Nearest neighbors
1
2
1
3
1
2
9
10
1
1
1
0
1
8
10
aLeukemiadatasetfromGolubet al. (1999),test set size nTS= 24, p = 40 genes.
bLymphomadatasetfromAlizadehet al. (2000),test set size
nTS= 27, p = 50 genes.
c NCI60 datasetfromRoss et al. (2000),test set size
nTS= 21,p = 30 genes.
dFLDA:Fisherlineardiscriminantanalysis.
eDLDA:diagonallineardiscriminantanalysis.
fGolub:weightedgene votingscheme of Golubet al. (1999).
9DQDA:diagonalquadraticdiscriminantanalysis.
hCV:singleCARTtree withpruningby 10-foldcross-validation.
'Bag:B= 50 bagged exploratorytrees.
Boost: B = 50 boosted exploratorytrees.
kCPD:B = 50 bagged exploratorytrees withCPD,d = .75.
FLDA: Fisher linear discriminant analysis; DLDA: diagonal linear discriminant
analysis; Golub: weighted gene voting scheme; DQDA: diagonal quadratic
discriminant analysis; CV: single CART tree; Bag: B= 50 bagged exploratory
trees; Boost: B = 50 boosted exploratory trees; CPD: B = 50 bagged
exploratory trees with CPD.
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
Example 2: Leukemia Data (continued, Yang et al., 2012)
2.5
2.0
1.5
1.0
0.5
number of errors on average
3.0
DLDA
k−NN
SVM
K2
K1
40
100
200
500
1000
2000
3000
number of genes used
4000
5000
6000
7129
lbph
Fundamentals of Data Mining
Example
2009)
0.063
svi 0.593
lcp
0.692
1: Prostate
gleason 0.426
pgg45 0.483
0.437 Typical
0.287Data Mining Tasks
Data Mining Using R
0.181 0.129 −0.139
0.157 0.173(continued,
−0.089 0.671 Hastie et al.,
cancer
0.024 0.366 0.033 0.307 0.476
0.074 0.276 −0.030 0.481 0.663
0.757
TABLE 3.2. Linear model fit to the prostate cancer data. The Z score is the
coefficient divided by its standard error (3.12). Roughly a Z score larger than two
in absolute value is significantly nonzero at the p = 0.05 level.
Term Coefficient Std. Error Z Score
2.46
0.09
27.60
0.68
0.13
5.37
0.26
0.10
2.75
−0.14
0.10
−1.40
0.21
0.10
2.06
0.31
0.12
2.47
−0.29
0.15
−1.87
−0.02
0.15
−0.15
0.27
0.15
1.74
Intercept
lcavol
lweight
age
lbph
svi
lcp
gleason
pgg45
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
Example 1: Prostate cancer (continued, Hastie et al.,
2009)
3.4 Shrinkage Methods
63
TABLE 3.3. Estimated coefficients and test error results, for different subset
and shrinkage methods applied to the prostate data. The blank entries correspond
to variables omitted.
Term
Intercept
lcavol
lweight
age
lbph
svi
lcp
gleason
pgg45
Test Error
Std Error
LS
2.465
0.680
0.263
−0.141
0.210
0.305
−0.288
−0.021
0.267
0.521
0.179
Best Subset
2.477
0.740
0.316
0.492
0.143
Ridge
2.452
0.420
0.238
−0.046
0.162
0.227
0.000
0.040
0.133
0.492
0.165
Lasso
2.468
0.533
0.169
0.002
0.094
0.479
0.164
PCR
2.497
0.543
0.289
−0.152
0.214
0.315
−0.051
0.232
−0.056
0.449
0.105
PLS
2.452
0.419
0.344
−0.026
0.220
0.243
0.079
0.011
0.084
0.528
0.152
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
IV. Discovering patterns and rules
Pattern detection: Examples include
spotting fraudulent behavior,
detection of unusual stars or galaxies.
Association rules: For example, to find combinations of items
that occur frequently in transaction databases (e.g., grocery
products that are often purchased together).
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
V. Retrieval by content
Find similar patterns in the data set.
For text (e.g., Web pages), the pattern may be a set of
keywords.
For images, the user may have a sample image, a sketch of an
image, or a description of an image, and wish to find similar
images from a large set of images.
The definition of similarity is critical, but so are the details of
the search strategy.
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
R resources
Learning R in 15 minutes
http://homepages.math.uic.edu/∼jyang06/stat486/handouts/handou
R web resources
http://cran.r-project.org/ – Official R website
http://cran.r-project.org/other-docs.html – R reference books
http://www.bioconductor.org/ – R resources (dataset,
packages) for bioinformatics
http://www.rstudio.com/ – RStudio, a convenient R editor
http://accc.uic.edu/service/argo-cluster – UIC high
performance computing resource
R packages
Fundamentals of Data Mining
Typical Data Mining Tasks
Data Mining Using R
Reference
Hastie, Tibshirani, and Friendman, The Elements of Statistical Learning:
Data Mining, Inference, and Prediction, 2nd edition, Springer, 2009.
Websites: http://statweb.stanford.edu/∼tibs/ElemStatLearn/
http://cran.r-project.org/web/packages/ElemStatLearn/index.html
Hand, Mannila, and Smyth, Principles of Data Mining, MIT, 2001.
Torgo, Data Mining with R: Learning with Case Studies, Chapman &
Hall/CRC, 2011.
Dudoit, S., Fridlyand, J. and Speed, T.P. (2002). Comparison of
discrimination methods for the classification of tumors using gene
expression data, JASA, 97, 77-87.
Golub, T.R. et al. (1999). Molecular classification of cancer: class
discovery and class prediction by gene expression monitoring, Science,
286, 531-537.
Yang, J., Miescke, K., and McCullagh, P. (2012). Classification based on
a permanental process with cyclic approximation. Under revision for
publication in Biometrika.
Related documents