Download Introduction to unsupervised data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Data Mining Strategies
Scales of Measurement
 Stevens, S.S. (1946). On the theory of
scales of measurement. Science, 103,
677-680
 Four Scales
Categorical (nominal)
Ordinal (only order matters)
Interval (difference between two vars is
meaningful)
Ratio (when variable is 0.0 there is none of
that data; Kelvin is but C and F are not)
What to Know about the
Scales
 The measurement principle involved
for each scale
 Examples of the measurement scales
 Permissible arithmetic operations for
each scale
Categorical Scale Data
 The values of the
scale have no
numeric meaning
 Examples
 Gender
 Ethnicity
 Marital Status
 Hair Color
 Operations
 Counting (only)
Ordinal Scale Data
 The categories can
be ordered
 But the intervals
between adjacent
scale values are
indeterminate
 Examples
 Movie ratings (0, 1 or 2
thumbs up)
 U.S.D.A. beef (good,
choice, prime)
 The rank order of
anything
 Operations
 Counting
 Greater than or less
than operations
Interval Scale Data
 Intervals between
adjacent scale
values are equal
 Examples
 Degrees Fahrenheit
 Most personality
measures
 IQ intelligence score
 Operations
 Counting
 Greater than or less
than operations
 Addition and
subtraction of scale
values.
Ratio Scale Data
 There is a rationale
zero point for the
scale
 An absolute zero
 Examples
 Degrees Kelvin
 Annual income in
dollars
 Length, distance, size
cm, kB, inches, km
 Operations
 All plus
 Multiplication and
division of scale values.
Variables
 Independent
 Dependent
 Input
 Output
x
f(x) = 3+ 2x2
f(x)
Data Mining Strategies
 Unsupervised
(No dependent
variables used)
 Clustering
 Market Basket
Analysis
 Information
Visualization
 Supervised
(At least one
dependent variable
used for training)
 Classification
 Estimation
 Prediction
Clustering
 Cluster analysis divides data into groups
(clusters) that are meaningful, useful or
both
 Clusters capture the natural structure of
the data
 Clustering allows us to think about the
data at a new level of abstraction
 Cluster analysis is often the first step in a
data mining project
Cluster of Stars
Water Clusters
Cellular Clusters
Cluster Analysis
 Uses information found in the data that
describes objects and their relationships
 Goal: That objects within a group be
similar to one another and different from
objects in other groups
 The greater the similarity within groups
and the greater the difference between
groups, the better the clustering
How Many Clusters?
Three Clusters Identified
Six Clusters Identified
Types of Clustering
 Partitional clustering
 Heirarchical clustering
 Exclusive clustering
 Overlaping clustering
 Fuzzy clustering
 Complete clustering
 Partial clustering
Partitional Clustering
 A division of a set
of data into nonoverlaping clusters
 Each data point is
in exactly one
cluster
 Example of
Partitional
Clustering
Heirarchical clustering
 Permit subclusters
(nested clusters
within clusters)
 Example of
Hierarchical
Clustering
Exclusive clustering
 Each object is
assigned to a single
cluster
Overlaping Clustering
 Non-exclusive
 A data point can
belong to two or
more clusters
simultaneously
Fuzzy Clustering
 Every data point
belongs to every
cluster with a
membership weight.
C1
 Membership ranges
from 0 (absolutely
does not belong) to 1
(absolutely belongs)
 The sum of the
membership weightsC1 75%
for each point is 1
C2 25%
C1 40%
C2 60%
C1 01%
C2 99%
C2
Complete Clustering
 Assigns every data
point to a cluster
 No data point is
left out of a cluster
Partial Clustering
 Does not assign
every data point to
a cluster
 Some data points
can not belong to
any cluster
 Noise
 Outliers
 Uninteresting
background
 Classify newspaper
stories
 Many fall into
 Global warming
 Terrorism
 Some stories are
unique
 Cable Tie just
graduated from the
CofC in CS
Chris Starr:
A centroid is the
center of a
cluster
K-Means
1. Select K points as initial centroids
2. Repeat
1. Form K cluster by assigning each point
to its closest centroid.
2. Recompute the centroid of each
cluster.
3. Until centroids so not change
The centroids are repositioned until stable in the K-means algorithm.
Observe Your Environment
 Start looking for clusters around you
 Think about how the clusters are
formed
Are they hierarchical?
Are they fuzzy clusters?
Are they complete clusters?