Download Visualizing and Exploring Data

Document related concepts

Mixture model wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining: An Overview
David Madigan
[email protected]
http://stat.rutgers.edu/~madigan
Overview
• Brief Introduction to Data Mining
• Data Mining Algorithms
• Specific Examples
– Algorithms: Disease Clusters
– Algorithms: Model-Based Clustering
– Algorithms: Frequent Items and Association Rules
• Future Directions, etc.
Of “Laws”, Monsters, and Giants…
• Moore’s law: processing “capacity” doubles every 18
months : CPU, cache, memory
• It’s more aggressive cousin:
– Disk storage “capacity” doubles every 9 months
Disk TB Shipped per Year
1E+7
What do the two
“laws” combined
produce?
A rapidly growing gap
between our ability to
generate data, and our
ability to make use of it.
1998 Disk Trend (Jim Porter)
http://www.disktrend.com/pdf/portrpkg.pdf.
ExaByte
1E+6
1E+5
disk TB
growth:
112%/y
Moore's Law:
58.7%/y
1E+4
1E+3
1988
1991
1994
1997
2000
What is Data Mining?
Finding interesting structure in data
• Structure: refers to statistical patterns, predictive
models, hidden relationships
• Examples of tasks addressed by Data Mining
– Predictive Modeling (classification, regression)
– Segmentation (Data Clustering )
– Summarization
– Visualization
Ronny Kohavi, ICML 1998
Ronny Kohavi, ICML 1998
Ronny Kohavi, ICML 1998
Stories: Online Retailing
Chapter 4: Data Analysis and
Uncertainty
•Elementary statistical concepts: random variables,
distributions, densities, independence, point and interval
estimation, bias & variance, MLE
•Model (global, represent prominent structures) vs.
Pattern (local, idiosyncratic deviations):
•Frequentist vs. Bayesian
•Sampling methods
Bayesian Estimation
e.g. beta-binomial model:
p ( | D)  p( D |  ) p( )
  (1   )
r
nr

 1
(1   )
 1
  r  1 (1   ) n  r   1
Predictive distribution:
p ( x(n  1) | D)   p ( x(n  1),  | D)d
  p ( x(n  1) |  ) p ( | D)d
Issues to do with p-values
•Using threshold’s of 0.05 or 0.01 regardless of sample size
•Multiple testing (e.g. Friedman (1983) selecting highly
significant regressors from noise)
•Subtle interpretation Jeffreys (1980): “I have always considered
the arguments for the use of P absurd. They amount to saying
that a hypothesis that may or may not be true is rejected
because a greater departure from the trial value was
improbable; that is, that is has not predicted something that has
not happened.”
p-value as measure of evidence
Schervish (1996): “if hypothesis H implies hypothesis H', then
there should be at least as much support for H' as for H.”
- not satisfied by p-values
Grimmet and Ridenhour (1996): “one might expect an
outlying data point to lend support to the alternative
hypothesis in, for instance, a one-way analysis of variance.”
- the value of the outlying data point that minimizes the
significance level can lie within the range of the data
Chapter 5: Data Mining Algorithms
“A data mining algorithm is a well-defined
procedure that takes data as input and produces
output in the form of models or patterns”
“well-defined”: can be encoded in software
“algorithm”: must terminate after some finite number
of steps
Data Mining Algorithms
“A data mining algorithm is a well-defined
procedure that takes data as input and produces
output in the form of models or patterns”
Hand, Mannila, and Smyth
“well-defined”: can be encoded in software
“algorithm”: must terminate after some finite number
of steps
Algorithm Components
1. The task the algorithm is used to address (e.g.
classification, clustering, etc.)
2. The structure of the model or pattern we are fitting to the
data (e.g. a linear regression model)
3. The score function used to judge the quality of the fitted
models or patterns (e.g. accuracy, BIC, etc.)
4. The search or optimization method used to search over
parameters and/or structures (e.g. steepest descent,
MCMC, etc.)
5. The data management technique used for storing, indexing,
and retrieving data (critical when data too large to reside in
memory)
Backpropagation data mining algorithm
x1
x2
h1
x3
h2
x4
s1  i 1i xi ; s2  i 1 i xi
4
y
4
hsi   1 (1  e  si )
y  i 1 wi hi
2
4
2
•vector of p input values multiplied by p  d1 weight matrix
•resulting d1 values individually transformed by non-linear function
1
•resulting d1 values multiplied by d1  d2 weight matrix
Backpropagation (cont.)
Parameters:
1,,4 , 1,,  4 , w1, w2
n
Score:
S SSE   ( y (i )  yˆ (i )) 2
i 1
Search: steepest descent; search for structure?
Models and Patterns
Models
Prediction
•Linear regression
•Piecewise linear
Probability
Distributions
Structured
Data
Models
Prediction
•Linear regression
•Piecewise linear
•Nonparamatric
regression
Probability
Distributions
Structured
Data
Models
Prediction
•Linear regression
•Piecewise linear
•Nonparametric
regression
•Classification
Probability
Distributions
Structured
Data
logistic regression
naïve bayes/TAN/bayesian networks
NN
support vector machines
Trees
etc.
Models
Prediction
Probability
Distributions
•Linear regression
•Parametric models
•Piecewise linear
•Mixtures of
parametric models
•Nonparametric
regression
•Classification
•Graphical Markov
models (categorical,
continuous, mixed)
Structured
Data
Models
Prediction
Probability
Distributions
•Linear regression
•Parametric models
•Piecewise linear
•Mixtures of
parametric models
•Nonparametric
regression
•Classification
•Graphical Markov
models (categorical,
continuous, mixed)
Structured
Data
•Time series
•Markov models
•Mixture Transition
Distribution models
•Hidden Markov
models
•Spatial models
Markov Models
T
First-order:
p ( y1 ,  , yT )  p1 ( y1 ) pt ( yt | yt 1 )
t 2
1
1  y  g ( yt 1 ) 
p( yt | yt 1 ) 
exp   t

2

2 


e.g.:
2
g linear  standard first-order auto-regressive model
yt   0  1 yt 1  e
y1
y2
e ~ N (0,  2 )
y3
yT
First-Order HMM/Kalman Filter
y1
y2
y3
yT
x1
x2
x3
xT
T
p ( y1 , , yT , x1 , , xT )  p1 ( x1 ) p1 ( y1 | x1 ) p( yt | xt ) p( xt | xt 1 )
t 2
Note: to compute p(y1,…,yT) need to sum/integrate over all
possible state sequences...
Bias-Variance Tradeoff
High Bias - Low Variance
Score function should
embody the compromise
Low Bias - High Variance
“overfitting” - modeling the
random component
The Curse of Dimensionality
X ~ MVNp (0 , I)
•Gaussian kernel density estimation
•Bandwidth chosen to minimize MSE at the mean
2
ˆ
E
[(
p
(
x
)

p
(
x
))
•Suppose want:
 0.1
2
p ( x)
x0
Dimension
1
2
3
6
10
# data points
4
19
67
2,790
842,000
Patterns
Local
Global
•Clustering via
partitioning
•Outlier
detection
•Hierarchical
Clustering
•Changepoint
detection
•Mixture Models
•Bump hunting
•Scan statistics
•Association
rules
Scan Statistics via Permutation Tests
xx
x
xx x
x xx
x
x
xx
xx x
x
x
x
xxxx
x
x
x x
xxxx
The curve represents a road
Each “x” marks an accident
Red “x” denotes an injury accident
Black “x” means no injury
Is there a stretch of road where there is an unusually large
fraction of injury accidents?
xxx x x
Scan with Fixed Window
• If we know the length of the “stretch of
road” that we seek, e.g.,
we
could slide this window long the road and
find the most “unusual” window location
xx
x
x
xx
xx x
x
x
x
x
xxxx
x
x
x
xxxx
x
xx x
x xx
xxx x x
How Unusual is a Window?
• Let pW and p¬W denote the true probability of being
red inside and outside the window respectively. Let
(xW ,nW) and (x¬W ,n¬W) denote the corresponding
counts
• Use the GLRT for comparing H0: pW = p¬W versus
H1: pW ≠ p¬W
[( xW  xW ) /( nW  nW )] xW  xW [1  (( xW  xW ) /( nW  nW ))]nW  nW  xW  xW

( xW / nW ) xW [1  ( xW / nW )]nW  xW ( xW / nW ) xW [1  ( xW / nW )]nW  xW
• lambda measures how unusual a window is
2 log  here has an asymptotic chi-square distribution with 1df
Permutation Test
• Since we look at the smallest  over all window
locations, need to find the distribution of smallest-
under the null hypothesis that there are no clusters
• Look at the distribution of smallest- over say 999
random relabellings of the colors of the x’s
xx x xxx
xx x xxx
xx x xxx
xx x xxx
…
x
x
x
x
xx
xx
xx
xx
x xx
x xx
x xx
x xx
x
x
x
x
smallest-
0.376
0.233
0.412
0.222
• Look at the position of observed smallest- in this distribution
to get the scan statistic p-value (e.g., if observed smallest- is 5th
smallest, p-value is 0.005)
Variable Length Window
• No need to use fixed-length window. Examine all
possible windows up to say half the length of the
entire road
O
O
= fatal accident
= non-fatal accident
Spatial Scan Statistics
• Spatial scan statistic uses, e.g., circles instead of line
segments
Spatial-Temporal Scan Statistics
• Spatial-temporal scan statistic use cylinders where the
height of the cylinder represents a time window
Other Issues
• Poisson model also common (instead of the
bernoulli model)
• Covariate adjustment
• Andrew Moore’s group at CMU: efficient
algorithms for scan statistics
Software: SaTScan + others
http://www.satscan.org
http://www.phrl.org
http://www.terraseer.com
Association Rules: Support and Confidence
Customer
buys both
Customer
buys beer
Customer
buys diaper
• Find all the rules Y  Z with
minimum confidence and support
– support, s, probability that a
transaction contains {Y & Z}
– confidence, c, conditional probability
that a transaction having {Y & Z}
also contains Z
Transaction ID Items Bought Let minimum support 50%, and
2000
A,B,C
minimum confidence 50%, we
1000
A,C
have
4000
A,D
– A  C (50%, 66.6%)
5000
B,E,F
– C  A (50%, 100%)
Mining Association Rules—An Example
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
For rule A  C:
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
support = support({A &C}) = 50%
confidence = support({A &C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
Mining Frequent Itemsets: the
Key Step
• Find the frequent itemsets: the sets of items that have
minimum support
– A subset of a frequent itemset must also be a frequent
itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B} should be a
frequent itemset
– Iteratively find frequent itemsets with cardinality from 1 to
k (k-itemset)
• Use the frequent itemsets to generate association
rules.
The Apriori Algorithm
• Join Step: Ck is generated by joining Lk-1with itself
• Prune Step: Any (k-1)-itemset that is not frequent cannot
be a subset of a frequent k-itemset
• Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
The Apriori Algorithm —
Example
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Association Rule Mining: A Road
Map
•
Boolean vs. quantitative associations (Based on the types of values
handled)
–
–
•
•
Single dimension vs. multiple dimensional associations (see ex.
Above)
Single level vs. multiple-level analysis
–
•
buys(x, “SQLServer”) ^ buys(x, “DMBook”) buys(x, “DBMiner”) [0.2%,
60%]
age(x, “30..39”) ^ income(x, “42..48K”) buys(x, “PC”) [1%, 75%]
What brands of beers are associated with what brands of diapers?
Various extensions (thousands!)
Model-based Clustering
K
f ( x)    k f k ( x; k )
k 1
ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
Padhraic Smyth, UCI
3.9
4
ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
EM ITERATION 1
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
EM ITERATION 3
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
EM ITERATION 5
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
EM ITERATION 10
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
EM ITERATION 15
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
EM ITERATION 25
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
Mixtures of {Sequences, Curves, …}
K
p ( Di )   p ( Di | ck )  k
k 1
Generative Model
- select a component ck for individual i
- generate data according to p(Di | ck)
- p(Di | ck) can be very general
- e.g., sets of sequences, spatial patterns, etc
[Note: given p(Di | ck), we can define an EM algorithm]
Example: Mixtures of SFSMs
Simple model for traversal on a Web site
(equivalent to first-order Markov with end-state)
Generative model for large sets of Web users
- different behaviors <=> mixture of SFSMs
EM algorithm is quite simple: weighted counts
WebCanvas: Cadez, Heckerman, et al, KDD 2000
Discussion
• What is data mining? Hard to pin down – who cares?
• Textbook statistical ideas with a new focus on
algorithms
• Lots of new ideas too
Privacy and Data Mining
Ronny Kohavi, ICML 1998
Analyzing Hospital Discharge Data
David Madigan
Rutgers University
Comparing Outcomes Across Providers
•Florence Nightingale wrote in 1863:
“In attempting to arrive at the truth, I have
applied everywhere for information, but in
scarcely an instance have I been able to obtain
hospital records fit for any purposes of
comparison…I am fain to sum up with an urgent
appeal for adopting some uniform system of
publishing the statistical records of hospitals.”
Data
• Data of various kinds are now available; e.g. data
concerning all medicare/medicaid hospital
admissions in standard format UB-92; covers >95%
of all admissions nationally
• Considerable interest in using these data to compare
providers (hospitals, physician groups, physicians,
etc.)
• In Pennsylvannia, large corporations such as
Westinghouse and Hershey Foods are a motivating
force and use the data to select providers.
SYSID
DCSTATUS
PPXDOW
CANCER1
YEAR
LOS
SPX1DOW
CANCER2
QUARTER
DCHOUR
SPX2DOW
MDCHC4
PAF
DCDOW
SPX3DOW
MQSEV
HREGION
ECODE
SPX4DOW
MQNRSP
MAID
PDX
SPX5DOW
PROFCHG
PTSEX
SDX1
REFID
TOTALCHG
ETHNIC
SDX2
ATTID
NONCVCHG
RACE
SDX3
OPERID
ROOMCHG
PSEUDOID
SDX4
PAYTYPE1
ANCLRCHG
AGE
SDX5
PAYTYPE2
DRUGCHG
AGECAT
SDX6
PAYTYPE3
EQUIPCHG
PRIVZIP
SDX7
ESTPAYER
SPECLCHG
MKTSHARE
SDX8
NAIC
MISCCHG
COUNTY
PPX
OCCUR1
APRMDC
STATE
SPX1
OCCUR2
APRDRG
ADTYPE
SPX2
BILLTYPE
APRSOI
ADSOURCE
SPX3
DRGHOSP
APRROM
ADHOUR
SPX4
PCMU
MQGCLUST
ADMDX
SPX5
DRGHC4
MQGCELL
ADDOW
Pennsylvannia Healthcare Cost Containment Council. 2000-1, n=800,000
Risk Adjustment
• Discharge data like these allow for comparisons of,
e.g., mortality rates for CABG procedure across
hospitals.
• Some hospitals accept riskier patients than others; a
fair comparison must account for such differences.
• PHC4 (and many other organizations) use “indirect
standardization”
• http://www.phc4.org
Hospital Responses
p-value computation
n=463;
suppose actual number of deaths=40
e=29.56;
 463 
p-value =   j  (29.56 463) (1  29.56 463)  0.02


463
j
463 j
j 35
p-value < 0.05
Concerns
• Ad-hoc groupings of strata
• Adequate risk adjustment for outcomes other
than mortality? Sensitivity analysis? Hopeless?
• Statistical testing versus estimation
• Simpson’s paradox
Risk Cat.
Low
A
High
N
Rate
800
200
1%
8%
Actual
Number
8
16
Expected
Number
8 (1%)
10 (5%)
SMR = 24/18 = 1.33; p-value = 0.07
Low
B
High
200
800
1%
8%
SMR = 66/42 = 1.57;
2
64
p-value = 0.0002
2 (1%)
40 (5%)
Hierarchical Model
• Patients -> physicians -> hospitals
• Build a model using data at each level and
estimate quantities of interest
Bayesian Hierarchical Model
MCMC via WinBUGS
Goldstein and Spiegelhalter, 1996
Discussion
• Markov chain Monte Carlo + compute power
enable hierarchical modeling
• Software is a significant barrier to the
widespread application of better methodology
• Are these data useful for the study of disease?