Download Lecture 4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
CSE5230/DMS/2002/4
Data Mining - CSE5230
Data Mining and Statistics
Clustering Techniques
CSE5230 - Data Mining, 2002
Lecture 4.1
Lecture Outline
 Data
Mining and Statistics
 A taxonomy of Data Mining Approaches
» Verification-driven techniques
» Discovery-driven techniques


Predictive
Informative
 Regression
 Exploratory Data Analysis
 Automatic
Cluster Detection
 The K-Means Technique
 Similarity, Association, Distance
» Types of Variables, Measures of Similarity, Weighting
and Scaling
 Agglomerative Techniques
CSE5230 - Data Mining, 2002
Lecture 4.2
Lecture Objectives
 By
the end of this lecture, you should:
 understand the link between the type of pattern being
sought and the DM approach chosen
 be able to give examples of verification and discoverydriven DM techniques, and explain the difference
between them
 be able to explain the difference between supervised and
unsupervised DM techniques
 give an example of the use of regression
 explain what is meant by cluster detection, and given and
example of clusters in data
 understand how the K-means clustering technique works,
and use it to do a simple example by hand
 be able to explain the importance of similarity measures
for clustering, and why the Euclidean distance between
raw data values is often not good enough
CSE5230 - Data Mining, 2002
Lecture 4.3
The Link between Pattern and
Approach
 Data
mining aims to reveal knowledge about
the data under consideration
 This knowledge takes the form of patterns
within the data which embody our
understanding of the data
 Patterns are also referred to as structures, models and
relationships
 The
approach chosen is inherently linked to
the pattern revealed
CSE5230 - Data Mining, 2002
Lecture 4.4
A Taxonomy of Approaches to Data
Mining - 1
 It
is not expected that all the approaches will
work equally well with all data sets
 Visualization of data sets can be combined
with, or used prior to, modeling and assists in
selecting an approach and indicating what
patterns might be present
CSE5230 - Data Mining, 2002
Lecture 4.5
A Taxonomy of Approaches to Data
Mining - 2
Verification-driven
Discovery-driven
Predictive
Informative
(Supervised) (Unsupervised)
Query and reporting
Statistical analysis
CSE5230 - Data Mining, 2002
Clustering
Association
Regression
Deviation
Classification
detection
(outliers)
Lecture 4.6
Verification-driven Data Mining
Techniques - 1
 Verification
data mining techniques require
the user to postulate some hypothesis
 Simple query and reporting, or statistical analysis
techniques then confirm this hypothesis
 Statistics
has been neglected to a degree in
data mining in comparison to less traditional
techniques such as
 neural networks, genetic algorithms and rule-based
approaches to classification
 Many
of these “less traditional” techniques
also have a statistical interpretation
CSE5230 - Data Mining, 2002
Lecture 4.7
Verification-driven Data Mining
Techniques - 2
 The
reasons for this are various:
 Statistical techniques are most useful for well-structured
problems
 Many data mining problems are not well-structured:
» the statistical techniques breakdown or require large
amounts of time and effort to be effective
CSE5230 - Data Mining, 2002
Lecture 4.8
Problems with Statistical
Approaches - 1
 Traditional
statistical models often highlight
linear relationships but not complex nonlinear relationships (e.g. correlation)
 Exploring all possible higher dimensional
relationships, often (usually) takes an
unacceptably long time
 the non-linear statistical methods require knowledge
about
» the type of non-linearity
» the ways in which the variables interact
 This knowledge is often not available in complex multidimensional data mining problems
CSE5230 - Data Mining, 2002
Lecture 4.9
Problems with Statistical
Approaches - 2
 Statisticians
have traditionally focused on
model estimation, rather than model selection
 For these reasons less traditional, more
exploratory, techniques are often chosen for
modern data mining
 The current high level of interest in data mining
centres on many of the newer techniques,
which may be termed discovery-driven
 Lessons from statistics should not be forgotten.
Estimation of uncertainty and checking of
assumptions is as important as ever!
CSE5230 - Data Mining, 2002
Lecture 4.10
Discovery-driven Data Mining
Techniques - 1
 Discovery-driven
data mining techniques can
also be broken down into two broad areas:
 those techniques which are considered predictive,
sometimes termed supervised techniques
 those techniques which are termed informative,
sometimes termed unsupervised techniques
 Predictive
techniques build patterns by
making a prediction of some unknown
attribute given the values of other known
attributes
CSE5230 - Data Mining, 2002
Lecture 4.11
Discovery-driven Data Mining
Techniques - 2
 Informative
techniques do not present a
solution to a known problem
 they present interesting patterns for consideration by
some expert in the domain
 the patterns may be termed “informative patterns”
 The
main predictive and informative patterns
are:
 Regression
 Classification
 Clustering
 Association
CSE5230 - Data Mining, 2002
Lecture 4.12
Regression
 Regression
is a predictive technique which
discovers relationships between input and
output patterns, where the values are
continuous or real valued
 Many traditional statistical regression models
are linear
 Neural networks, though biologically inspired,
are in fact non-linear regression models
 Non-linear relationships occur in many multidimensional data mining applications
CSE5230 - Data Mining, 2002
Lecture 4.13
An Example of a Regression Model - 1
 Consider
a mortgage provider that is concerned
with retaining mortgages once taken out
 They may also be interested in how profit on
individual loans is related to customers paying
off their loans at an accelerated rate
 For example, a customer may pay an additional amount each
month and thus pay off their loan in 15 years instead of 25
years
A
graph of the relationship between profit and
the elapsed time between when a loan is actually
paid off and when it was originally contracted to
be paid off appears on the next slide
CSE5230 - Data Mining, 2002
Lecture 4.14
An Example of a Regression Model - 2
Non-linear model
Linear model
Profit
0
0
7
Years Early Loan Paid Off
CSE5230 - Data Mining, 2002
Lecture 4.15
An Example of a Regression Model - 3
 The
linear regression model (linear in the
variables) does not match the real pattern of
the data
 The curved line represents what might be
produced by a non-linear model (perhaps a
neural network, or linear regression on a
known non-linear function which is linear in
the variables)
 This curved line fits the data much better. It
could be used as the basis on which to
predict profitability
 Decisions on exit fees and penalties for certain behaviors
may be based on this kind of analysis
CSE5230 - Data Mining, 2002
Lecture 4.16
Exploratory Data Analysis (EDA)
 Classical
statistics has a dogma that the data may
not be viewed prior to modeling [ElP96]
 aim is to avoid choosing biased hypotheses
 During
the 1970s the term Exploratory Data
Analysis (EDA) was used to express the notion
that both the choice of model and hints as to
appropriate approaches could be data-driven
 Elder and Pregibon describes the dichotomy thus:
“On the one side the argument was that hypotheses and the like
must not be biased by choosing them on the basis of what the data
seemed to be indicating. On the other side was the belief that
pictures and numerical summaries of data are necessary in order
to understand how rich a model the data can support.”
CSE5230 - Data Mining, 2002
Lecture 4.17
EDA and the Domain Expert - 1
 It
is a very hard problem to include “common
sense” based on some knowledge of the
domain in automated modeling systems
 chance discoveries occur when exploring data that may
not have occurred otherwise
 these can also change the approach to the subsequent
modeling
CSE5230 - Data Mining, 2002
Lecture 4.18
EDA and the Domain Expert - 2
 The
obstacles to entirely automating the
process are:
 It is hard to quantify a procedure to capture “the
unexpected” in plots
 Even if this could be accomplished, one would need to
describe how this maps into the next analysis step in the
automated procedure
 What
is needed is a way to represent metaknowledge about the problem at hand and the
procedures commonly used
CSE5230 - Data Mining, 2002
Lecture 4.19
An Interactive Approach to DM
A
domain expert is someone who has metaknowledge about the problem
 An interactive exploration and a querying
and/or visualization system guided by a
domain expert goes beyond current statistical
methods
 Current thinking on statistical theory
recognizes such an approach as being
potentially able to provide a more effective
way of discovering knowledge about a data
set
CSE5230 - Data Mining, 2002
Lecture 4.20
Automatic Cluster Detection
 If
the are many competing patterns, a data set
can appear to contain just noise
 Subdividing a data set into clusters where
patterns can be more easily discerned can
overcome this
 When we have no idea how to define the
clusters automatic cluster detection methods
can be useful
 Finding clusters is an unsupervised learning
task
CSE5230 - Data Mining, 2002
Lecture 4.21
Example: The Hehrtzsprung-Russell diagram
Luminosity
(Sun=1)
Red Giants
1
Main Sequence
White Dwarves
40,000
2,500
Temperature (Degrees Kelvin)
CSE5230 - Data Mining, 2002
Lecture 4.22
Automatic Cluster Detection example
 The
Hehrtzsprung-Russell diagram graphs a
stars luminosity against temperature reveals
three clusters
 It is interesting to note that each of the clusters has a
different relationship between luminosity and
temperature.
 In
most data mining situations the variables
to consider and the clusters that may be
formed are not so easily determined
CSE5230 - Data Mining, 2002
Lecture 4.23
The K-Means Technique
 K,
the number of clusters that are to be
formed, must be decided before beginning
 Step 1
» Select K data points to act as the seeds (or initial
centroids)
 Step 2
» Each record is assigned to the centroid which is
nearest, thus forming a cluster
 Step 3
» The centroids of the new clusters are then calculated.
Go back to Step 2
 This is continued until the clusters stop changing
CSE5230 - Data Mining, 2002
Lecture 4.24
Assign Each Record to the Nearest
Centroid
X2
X1
CSE5230 - Data Mining, 2002
Lecture 4.25
Calculate the New Centroids
X2
X1
CSE5230 - Data Mining, 2002
Lecture 4.26
Determine the New Cluster
Boundaries
X2
X1
CSE5230 - Data Mining, 2002
Lecture 4.27
Similarity, Association and
Distance
 The
method just described assumes that each
record can be described as a point in a
metric-space
 This is not easily done for many data sets (e.g.
categorical and some numeric variables)
 The
records in a cluster should have a natural
association. A measure of similarity is
required.
 Euclidean distance is often used, but it is not always
suitable
 Euclidean distance treats changes in each dimension
equally, but in databases changes in one field may be
more important than changes in another (Mahalanobis
distance is often a big improvement)
CSE5230 - Data Mining, 2002
Lecture 4.28
Types of Variables
 Categories
 e.g. Food Group: Grain, Dairy, Meat, etc.
 Ranks
 e.g. Food Quality: Premium, High Grade, Medium, Low
 Intervals
 e.g. The distance between temperatures
 True
Measures
 The measures have a meaningful zero point so ratios
have meaning as well as distances
CSE5230 - Data Mining, 2002
Lecture 4.29
Measures of Similarity
 Euclidean
distance
 Angle between two vectors (from origin to
data point)
 The number of features in common
 Mahalanobis distance
 and many more...
CSE5230 - Data Mining, 2002
Lecture 4.30
Weighting and Scaling
 Weighting
allows some variables to assume
greater importance than others.
 The domain expert must decide if certain variables
deserve a greater weighting
 Statistical weighting techniques also exist
 Scaling
attempts to apply a common range to
variables so that differences are comparable
between variables
 This can also be statistically based
CSE5230 - Data Mining, 2002
Lecture 4.31
Variants of the K-Means Technique
 There
are problems with simple K-means
method:
 It does not deal well with overlapping clusters.
 The clusters can be pulled of centre by outliers.
 Records are either in or out of the cluster so there is no
notion of likelihood of being in a particular cluster or not
A
Gaussian Mixture Model varies the
approach already outlined by attaching a
weighting based on a probability distribution
to records which are close to or distant from
the centroids initially chosen. There is then
less chance that outliers will distort the
situation. Each record contributes to some
degree to each of the centroids
CSE5230 - Data Mining, 2002
Lecture 4.32
Agglomerative Techniques - 1
A
true unsupervised technique would not predetermine the number of clusters
 A hierarchical technique would offer a
hierarchy of clusters from large to small. This
can be achieved in a number of ways
 An agglomerative technique starts out by
considering each record as a cluster and
gradually building larger clusters by merging
the records which are near each other
CSE5230 - Data Mining, 2002
Lecture 4.33
Agglomerative Techniques - 2
 An
example of an agglomerative cluster tree:
CSE5230 - Data Mining, 2002
Lecture 4.34
Evaluating Clusters
 We
desire clusters to have members which
are close to each other and we also want the
clusters to be widely spaced
 Variance measures are often used. Ideally, we
want to minimize within-cluster variance and
maximize between-cluster variance
 But variance is not the only important factor,
for example it will favor not merging clusters
in an hierarchical technique
CSE5230 - Data Mining, 2002
Lecture 4.35
Strengths of Automatic Cluster
Detection
 Strengths
 is an undirected knowledge discovery technique
 works well with many types of data
 is relatively simple to carry out
 Weaknesses
 can be difficult to choose the distance measures and
weightings
 can be sensitive to initial parameter choices
 the clusters found can be difficult to interpret
CSE5230 - Data Mining, 2002
Lecture 4.36
References





[ElP1996] Elder, John F. IV and Pregibon, Daryl, A Statistical
Perspective on KDD, In Advances in Knowledge Discovery and
Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.
Uthurusamy, Eds. AAAI/MIT Press, Cambridge, Mass., 1996.
[Han1999] Hand D.J., Statistics and Data Mining: Intersecting
Disciplines, SIGKDD Explorations, Vol. 1, Issue 1, pp. 16-19, 1999.
[GMP1997] Clark Glymour, David Madigan, Daryl Pregibon and
Padhraic Smyth, Statistical Themes and Lessons for Data Mining,
Data Mining and Knowledge Discovery, Vol. 1, Num. 1, pp. 11-28,
1997.
[JMF1999] A. K. Jain, M. N. Murty and P. J. Flynn, Data clustering: a
review, ACM Computing Surveys, Volume 31 , Issue 3, pp. 264-323,
1999.
[BeL1997] Michael J. A. Berry and Gordon Linoff, Automatic
Cluster Detection, Ch. 10 in Data Mining Techniques: For
Marketing, Sales, and Customer Support, John Wiley & Sons, 1997.
CSE5230 - Data Mining, 2002
Lecture 4.37