Download now

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
The news: INTELLIGENCE
Prof. Dr. Herwig Unger
1
Striving for Intelligence..
“Data is raw and unadorned. Information is data endowed with
some degree of business context and meaning. Intelligence
elevates information to a higher level within an organization.”
-- Bernard Liautaud, e-Business Intelligence
Prof. Dr. Herwig Unger
2
The data pyramid
Wisdom
Knowledge + experience
Knowledge
Information + rules
Information
Data + context
Data
Prof. Dr. Herwig Unger
3
What is „data mining“???
Data mining (knowledge discovery in databases - KDD,
business intelligence):
Extraction of interesting ( non-trivial, implicit, previously
unknown and potentially useful) information from data in large
databases
• “Tell me something interesting about the data.”
• “Describe the data.”
Prof. Dr. Herwig Unger
4
What is „data mining“ (2)?
What is not Data
Mining?
– Look up phone
number in phone
directory
– Query a Web
search engine for
information about
“Amazon”
What is Data Mining?
– Certain names are more
prevalent in certain US locations
(O’Brien, O’Rurke, O’Reilly… in
Boston area)
– Group together similar
documents returned by search
engine according to their context
(e.g. Amazon rainforest,
Amazon.com,)
Prof. Dr. Herwig Unger
5
Tasks of Data Mining
Prediction Methods
Use some variables to predict unknown or future values of
other variables.
Description Methods
Find human-interpretable patterns that describe the data.
Prof. Dr. Herwig Unger
6
The data gap
There is often information “hidden” in the data that is
not readily evident
Human analysts may take weeks to discover useful information
Much of the data is never analyzed at all
4,000,000
3,500,000
The Data Gap
3,000,000
2,500,000
2,000,000
1,500,000
Total new disk (TB) since 1995
1,000,000
Number of
analysts
500,000
0
1995
1996
1997
Prof. Dr. Herwig Unger
1998
1999
7
Application in Business
Database analysis and decision support
Market analysis and management
Risk analysis and management
Fraud detection and management
Text analysis - Text Mining
Web analysis - Web Mining
Intelligent query answering
Prof. Dr. Herwig Unger
8
Market analysis and management data
Where are the data sources for analysis?
Credit card transactions, loyalty cards, discount coupons, customer
complaint calls, plus (public) lifestyle studies.
Target marketing:
Find clusters of “model” customers who share the same characteristics:
interest, income level, spending habits, etc.
Determine customer purchasing patterns over time:
Conversion of single to a joint bank account: marriage, etc.
Prof. Dr. Herwig Unger
9
Analysis and risk management
Finance planning and asset evaluation:
cash flow analysis and prediction
time series analysis (trend analysis, etc.)
Resource planning:
summarize and compare the resources and spending
Competition:
Monitor competitors and market directions
Set pricing strategy in a highly competitive market
Prof. Dr. Herwig Unger
10
Fraud detection
Use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
Examples application:
Auto Insurance: detect a group of people who stage accidents to
collect on insurance
Money Laundering: detect suspicious money transactions
Detecting telephone fraud: detecting suspicious patterns (generate
call model - destination, time, duration)
Prof. Dr. Herwig Unger
11
Others
Sports
Analysis of game in NBA (eg., detect the opponent’s strategy)
Astronomy
discovery and classification of new objects
Internet
analysis of Web access logs, discovery of user behavior
patterns, analyzing effectiveness of Web marketing, improving
Web site organization
Text
news analysis, medical record analysis, automatic email
sorting and filtering, automatic document categorization
Prof. Dr. Herwig Unger
12
Datamining is interdisciplinary !!!
Database systems, data warehouse and OLAP
Statistics
Machine learning
Visualization
Information science
High performance computing
Other disciplines:
Neural networks, mathematical modeling, information retrieval,
pattern recognition, ...
Prof. Dr. Herwig Unger
13
From data to knowledge….
Data mining: the core of
knowledge discovery process.
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
Prof. Dr. Herwig Unger
Main steps of KDD
Learning the application domain:
relevant prior knowledge and goals of application
Data cleaning and preprocessing: (may take 60% of effort!):
creating a target data set: data selection
find useful features, generate new features, map feature values, discretization
of values
Choosing data mining tools/algorithms
summarization, classification, regression, association, clustering.
Data mining: search for patterns of interest
Interpretation: analysis of results.
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge.
Prof. Dr. Herwig Unger
15
Finally: Data mining and BI
Increasing potential
to support
business decisions
Making
Decisions
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
End User
Business
Analyst
Data
Analyst
Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP, MDA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Prof. Dr. Herwig Unger
DBA
Decision support systems (DSS)
Data
Model
Knowledge
Base
Base
Base
User Interface
Prof. Dr. Herwig Unger
17
The Contents of Analytic Applications
Analytic applications typically have no limits; analysts
can see everything
Analytic applications can view and analyze all of an
organization’s data in a number of ways
Analytic applications are powerful, but not as easy to
use as other mechanisms
Prof. Dr. Herwig Unger
18
Analytic Applications
Prof. Dr. Herwig Unger
19
The Purpose of Analytic Applications
Analytic applications free analysts from building
complex models and writing complex queries
Analysts are free to focus on the data and discover
relationships and drivers behind numbers
Rich visualizations allow much easier understanding
of trends and relationships
Prof. Dr. Herwig Unger
20
Benefits of Analytic Applications
Data is significantly easier to analyze
Analysts can focus on analyzing the data and not
writing complex queries
Reports created with analytic applications can be
pushed out to the organization
Graphical tools provide users throughout the
organization with powerful reports and analytic
capabilities
Prof. Dr. Herwig Unger
21
The decision environment
UNCERTAINTY
COMPLEXITY
EQUIVOCALITY
• Facts not known
• Too many facts
• Facts not Clear
• Gather Information
• Generate Information
• Interpret Information
• Fact Finding /.Analysis
• Simulation/Synthesis
• Application of Expertise
DATA
BASED
MODEL
BASED
Prof. Dr. Herwig Unger
KNOWLEDGE
BASED
22
Examples
Standard query:
List all customers whose peak-hour usage revenues have decreased by 20
percent or more
Multidimensional analysis
Slice these customers above by the southwest, west and northwest regions.
Drill down to the largest city in the southwest region.
Modeling and segmentation
What are the demographic characteristics of these customers, and how can
we use that to predict the revenue patterns of new customers ?
Knowledge discovery
A large fraction of these customers have recently responded to the
”Pampers” ads on our web-site.
Prof. Dr. Herwig Unger
23
Some DM application
Finance Industry (credit cards, insurance, mortgages,
etc.)
Telecommunications
Utilities
Medicine
Search Engines (text data mining)
Law Enforcement
Prof. Dr. Herwig Unger
24
Overview: Data Mining Methods
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
Prof. Dr. Herwig Unger
25
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the
class.
Find a model for class attribute as a function of the values of
other attributes.
Goal: previously unseen records should be assigned a class
as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the
given data set is divided into training and test sets, with training set
used to build the model and test set used to validate it.
Prof. Dr. Herwig Unger
26
Examples of Classification Task
Predicting tumor cells as benign or malignant
Classifying credit card transactions
as legitimate or fraudulent
Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil
Categorizing news stories as finance,
weather, entertainment, sports, etc
Prof. Dr. Herwig Unger
27
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Naïve Bayes and Bayesian Belief Networks
Support Vector Machines
Prof. Dr. Herwig Unger
28
Classification Example
al
al
us
c
c
i
i
o
or
or
nu
i
g
g
t
n
te
te
ss
a
a
o
a
l
c
c
c
c
Tid Refund Marital
Status
Taxable
Income Cheat
Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
No
Single
75K
?
2
No
Married
100K
No
Yes
Married
50K
?
3
No
Single
70K
No
No
Married
150K
?
4
Yes
Married
120K
No
Yes
Divorced 90K
?
5
No
Divorced 95K
Yes
No
Single
40K
?
6
No
Married
No
No
Married
80K
?
60K
10
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
10
No
Single
90K
Yes
Training
Set
Prof. Dr. Herwig Unger
Learn
Classifier
Test
Set
Model
29
Classification: Application 1
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers
likely to buy a new cell-phone product.
Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.
From [Berry & Linoff] Data Mining Techniques, 1997
Prof. Dr. Herwig Unger
30
Classification: Application 2
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
• Use credit card transactions and the information on its account-holder
as attributes.
– When does a customer buy, what does he buy, how often he
pays on time, etc
• Label past transactions as fraud or fair transactions. This forms the
class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card transactions
on an account.
Prof. Dr. Herwig Unger
31
Classification: Application 3
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to be lost to a
competitor.
Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
– How often the customer calls, where he calls, what time-ofthe day he calls most, his financial status, marital status,
etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.
From [Berry & Linoff] Data Mining Techniques, 1997
Prof. Dr. Herwig Unger
32
Classification: Application 4
Sky Survey Cataloging
Goal: To predict class (star or galaxy) of sky objects,
especially visually faint ones, based on the telescopic survey
images (from Palomar Observatory).
– 3000 images with 23,040 x 23,040 pixels per image.
Approach:
• Segment the image.
• Measure image attributes (features) - 40 of them per object.
• Model the class based on these features.
• Success Story: Could find 16 new high red-shift quasars, some
of the farthest objects that are difficult to find!
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Prof. Dr. Herwig Unger
33
Classifying Galaxies
Courtesy: http://aps.umn.edu
Early
Class:
• Stages of Formation
Attributes:
• Image features,
• Characteristics of light
waves received, etc.
Intermediate
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Prof. Dr. Herwig Unger
34
Another Example of Decision Tree
al
al
us
c
c
i
i
o
or
or
nu
i
g
g
t
ss
e
e
n
t
t
a
l
c
ca
ca
co
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
MarSt
NO
Single,
Divorced
Refund
Yes
NO
No
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that
fits the same data!
10
Prof. Dr. Herwig Unger
35
Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
Attrib2
125K
Attrib3
No
Class
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learn
Model
10
Tid
Attrib1
11
No
Small
Attrib2
55K
Attrib3
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
10
Prof. Dr. Herwig Unger
36
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
No
NO
MarSt
Single, Divorced
TaxInc
NO
Taxable
Income Cheat
No
80K
Married
?
10
Yes
< 80K
Refund Marital
Status
Married
NO
> 80K
YES
Prof. Dr. Herwig Unger
37
Apply Model to Test Data
Test Data
Refund
No
NO
MarSt
Single, Divorced
TaxInc
NO
Taxable
Income Cheat
No
80K
Married
?
10
Yes
< 80K
Refund Marital
Status
Married
NO
> 80K
YES
Prof. Dr. Herwig Unger
38
Apply Model to Test Data
Test Data
Refund
No
NO
MarSt
Single, Divorced
TaxInc
NO
Taxable
Income Cheat
No
80K
Married
?
10
Yes
< 80K
Refund Marital
Status
Married
NO
> 80K
YES
Prof. Dr. Herwig Unger
39
Apply Model to Test Data
Test Data
Refund
No
NO
MarSt
Single, Divorced
TaxInc
NO
Taxable
Income Cheat
No
80K
Married
?
10
Yes
< 80K
Refund Marital
Status
Married
NO
> 80K
YES
Prof. Dr. Herwig Unger
40
Apply Model to Test Data
Test Data
Refund
No
NO
MarSt
Single, Divorced
TaxInc
NO
Taxable
Income Cheat
No
80K
Married
?
10
Yes
< 80K
Refund Marital
Status
Married
NO
> 80K
YES
Prof. Dr. Herwig Unger
41
Apply Model to Test Data
Test Data
Refund
No
NO
MarSt
Single, Divorced
TaxInc
NO
Taxable
Income Cheat
No
80K
Married
?
10
Yes
< 80K
Refund Marital
Status
Married
Assign Cheat to “No”
NO
> 80K
YES
Prof. Dr. Herwig Unger
42
Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
Attrib2
125K
Attrib3
No
Class
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Learn
Model
10
Tid
Attrib1
11
No
Small
Attrib2
55K
Attrib3
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
10
Prof. Dr. Herwig Unger
43
Decision Tree Induction
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
Prof. Dr. Herwig Unger
44
General Structure of Hunt’s Algorithm
Let Dt be the set of training
records that reach a node t
General Procedure:
If Dt contains records that
belong the same class yt, then t
is a leaf node labeled as yt
If Dt is an empty set, then t is a
leaf node labeled by the default
class, yd
If Dt contains records that
belong to more than one class,
use an attribute test to split the
data into smaller subsets.
Recursively apply the
procedure to each subset.
Prof. Dr. Herwig Unger
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Dt
?
45
Hunt’s Algorithm
Refund
Don’t
Cheat
Yes
No
Don’t
Cheat
Don’t
Cheat
Refund
Refund
Yes
Yes
No
No
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Don’t
Cheat
Don’t
Cheat
Marital
Status
Single,
Divorced
Cheat
Married
Marital
Status
Single,
Divorced
Married
Don’t
Cheat
Taxable
Income
Don’t
Cheat
< 80K
>= 80K
Don’t
Cheat
Cheat
Prof. Dr. Herwig Unger
46
Tree Induction
Greedy strategy.
Split the records based on an attribute test that optimizes
certain criterion.
Issues
Determine how to split the records
•How to specify the attribute test condition?
•How to determine the best split?
Determine when to stop splitting
Prof. Dr. Herwig Unger
47
How to Specify Test Condition?
Depends on attribute types
Nominal
Ordinal
Continuous
Depends on number of ways to split
2-way split
Multi-way split
Prof. Dr. Herwig Unger
48
Splitting Based on Nominal Attributes
Multi-way split: Use as many partitions as distinct values.
CarType
Family
Luxury
Sports
Binary split: Divides values into two subsets.
Need to find optimal partitioning.
{Sports,
Luxury}
CarType
{Family}
OR
Prof. Dr. Herwig Unger
{Family,
Luxury}
CarType
{Sports}
49
Splitting Based on Ordinal Attributes
Multi-way split: Use as many partitions as distinct values.
Size
Small
Medium
Large
Binary split: Divides values into two subsets.
Need to find optimal partitioning.
{Small,
Medium}
Size
{Large}
What about this split?
OR
{Small,
Large}
Prof. Dr. Herwig Unger
{Medium,
Large}
Size
{Small}
Size
{Medium}
50
Splitting Based on Continuous Attributes
Different ways of handling
Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
Binary Decision: (A < v) or (A ≥ v)
• consider all possible splits and finds the best cut
• can be more compute intensive
Prof. Dr. Herwig Unger
51
Splitting Based on Continuous Attributes
Prof. Dr. Herwig Unger
52
Tree Induction
Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.
Issues
Determine how to split the records
•How to specify the attribute test condition?
•How to determine the best split?
Determine when to stop splitting
Prof. Dr. Herwig Unger
53
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
Prof. Dr. Herwig Unger
54
How to determine the Best Split
Greedy approach:
Nodes with homogeneous class distribution are preferred
Need a measure of node impurity:
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
Prof. Dr. Herwig Unger
55
Clustering Definition
Given a set of data points, each having a set of attributes, and a
similarity measure among them, find clusters such that
Data points in one cluster are more similar to one another.
Data points in separate clusters are less similar to one another.
Similarity Measures:
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
Prof. Dr. Herwig Unger
56
Illustrating Clustering
⌧Euclidean Distance Based Clustering in 3-D space.
Intracluster
Intraclusterdistances
distances
are
areminimized
minimized
Prof. Dr. Herwig Unger
Intercluster
Interclusterdistances
distances
are
aremaximized
maximized
57
Clustering: Application 1
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers where
any subset may conceivably be selected as a market target to be
reached with a distinct marketing mix.
Approach:
• Collect different attributes of customers based on their geographical
and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.
Prof. Dr. Herwig Unger
58
Clustering: Application 2
Document Clustering:
Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the
frequencies of different terms. Use it to cluster.
Gain: Information Retrieval can utilize the clusters to relate a
new document or search term to clustered documents.
Prof. Dr. Herwig Unger
59
Illustrating Document Clustering
Clustering Points: 3204 Articles of Los Angeles Times.
Similarity Measure: How many words are common in these documents
(after some word filtering).
Category
Total
Articles
Correctly
Placed
555
364
Foreign
341
260
National
273
36
Metro
943
746
Sports
738
573
Entertainment
354
278
Financial
Prof. Dr. Herwig Unger
60
Clustering of S&P Stock Data
Observe Stock Movements every day.
Clustering points: Stock-{UP/DOWN}
Similarity Measure: Two points are more similar if the events
described by them frequently happen together on the same day.
We used association rules to quantify a similarity measure.
Discovered Clusters
1
2
3
4
Applied-Matl-DOW N,Bay-Net work-Down,3-COM-DOWN,
Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,
DSC-Co mm-DOW N,INTEL-DOWN,LSI-Logic-DOWN,
Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,
Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOW N,
Sun-DOW N
Apple-Co mp-DOW N,Autodesk-DOWN,DEC-DOWN,
ADV-M icro-Device-DOWN,Andrew-Corp-DOWN,
Co mputer-Assoc-DOWN,Circuit-City-DOWN,
Co mpaq-DOWN, EM C-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOW N,Microsoft-DOWN,Scientific-Atl-DOWN
Fannie-Mae-DOWN,Fed-Ho me-Loan-DOW N,
MBNA-Corp -DOWN,Morgan-Stanley-DOWN
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlu mberger-UP
Prof. Dr. Herwig Unger
Industry Group
Technology1-DOWN
Technology2-DOWN
Financial-DOWN
Oil-UP
61
What is NOT clustering analysis?
Supervised classification
Have class label information
Simple segmentation
Dividing students into different registration groups alphabetically, by
last name
Results of a query
Groupings are a result of an external specification
Graph partitioning
Some mutual relevance and synergy, but areas are not identical
Prof. Dr. Herwig Unger
62
The Notion of a Cluster can be Ambiguous
How many clusters?
Six Clusters
Two Clusters
Four Clusters
Prof. Dr. Herwig Unger
63
Types of clustering
A clustering is a set of clusters
Important distinction between hierarchical and
partitional sets of clusters
Partitional Clustering
A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a hierarchical tree
Prof. Dr. Herwig Unger
64
Partitional clustering
Original Points
A Partitional Clustering
Prof. Dr. Herwig Unger
65
Hierarchical clustering
p1
p3
p4
p2
p1 p2
Traditional Hierarchical Clustering
p3 p4
Traditional Dendrogram
p1
p3
p4
p2
p1 p2
Non-traditional Hierarchical Clustering
p3 p4
Non-traditional Dendrogram
Prof. Dr. Herwig Unger
66
Other Distinctions Between Sets of Clusters
Exclusive versus non-exclusive
In non-exclusive clusterings, points may belong to multiple
clusters.
Can represent multiple classes or ‘border’ points
Fuzzy versus non-fuzzy
In fuzzy clustering, a point belongs to every cluster with some
weight between 0 and 1
Weights must sum to 1
Probabilistic clustering has similar characteristics
Partial versus complete
In some cases, we only want to cluster some of the data
Heterogeneous versus homogeneous
Cluster of widely different sizes, shapes, and densities
Prof. Dr. Herwig Unger
67
Types of Clusters
Well-separated clusters
Center-based clusters
Contiguous clusters
Density-based clusters
Property or Conceptual
Described by an Objective Function
Prof. Dr. Herwig Unger
68
Types of Clusters: Well-Separated
Well-Separated Clusters:
A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than to
any point not in the cluster.
3 well-separated clusters
Prof. Dr. Herwig Unger
69
Types of Clusters: Center-Based
Center-based
A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
The center of a cluster is often a centroid, the average of all the
points in the cluster, or a medoid, the most “representative” point
of a cluster
4 center-based clusters
Prof. Dr. Herwig Unger
70
Types of Clusters: Contiguity-Based
Contiguous Cluster (Nearest neighbor or
Transitive)
A cluster is a set of points such that a point in a cluster is closer
(or more similar) to one or more other points in the cluster than
to any point not in the cluster.
8 contiguous clusters
Prof. Dr. Herwig Unger
71
Types of Clusters: Density-Based
Density-based
A cluster is a dense region of points, which is separated by lowdensity regions, from other regions of high density.
Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
Prof. Dr. Herwig Unger
72
Types of Clusters: Conceptual Clusters
Shared Property or Conceptual Clusters
Finds clusters that share some common property or represent a particular concept.
.
2 Overlapping Circles
Prof. Dr. Herwig Unger
73
Types of Clusters: Objective Function
Clusters Defined by an Objective Function
Finds clusters that minimize or maximize an objective function.
Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness' of each potential set of clusters by using the
given objective function. (NP Hard)
Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
A variation of the global objective function approach is to fit the data to a
parameterized model.
• Parameters for the model are determined from the data.
• Mixture models assume that the data is a ‘mixture' of a number of statistical
distributions.
Prof. Dr. Herwig Unger
74
Types of Clusters: Objective Function …
Map the clustering problem to a different domain and
solve a related problem in that domain
Proximity matrix defines a weighted graph, where the nodes
are the points being clustered, and the weighted edges
represent the proximities between points
Clustering is equivalent to breaking the graph into
connected components, one for each cluster.
Want to minimize the edge weight between clusters and
maximize the edge weight within clusters
Prof. Dr. Herwig Unger
75
Characteristics of the Input Data Are Important
Type of proximity or density measure
This is a derived measure, but central to clustering
Sparseness
Dictates type of similarity
Adds to efficiency
Attribute type
Dictates type of similarity
Type of Data
Dictates type of similarity
Other characteristics, e.g., autocorrelation
Dimensionality
Noise and Outliers
Type of Distribution
Prof. Dr. Herwig Unger
76
Clustering Algorithms
K-means and its variants
Hierarchical clustering
Density-based clustering
Prof. Dr. Herwig Unger
77
K-means clusterung
Partitional clustering approach
Each cluster is associated with a centroid (center point)
Each point is assigned to the cluster with the closest centroid
Number of clusters, K, must be specified
The basic algorithm is very simple
Prof. Dr. Herwig Unger
78
k-means clustering: details
Initial centroids are often chosen randomly.
Clusters produced vary from one run to another.
The centroid is (typically) the mean of the points in the
cluster.
‘Closeness’ is measured by Euclidean distance, cosine
similarity, correlation, etc.
K-means will converge for common similarity measures
mentioned above.
Most of the convergence happens in the first few iterations.
Often the stopping condition is changed to ‘Until relatively few
points change clusters’
Complexity is O( n * K * I * d )
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Prof. Dr. Herwig Unger
79
Two different K-means Clusterings
3
2.5
Original Points
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
2.5
2.5
2
2
1.5
1.5
y
3
y
3
1
1
0.5
0.5
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2
x
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
Optimal Clustering
Sub-optimal Clustering
Prof. Dr. Herwig Unger
80
Importance of Choosing Initial Centroids
Iteration 6
1
2
3
4
5
3
2.5
2
y
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
Prof. Dr. Herwig Unger
81
Importance of Choosing Initial Centroids
Iteration 1
Iteration 2
Iteration 3
2.5
2.5
2.5
2
2
2
1.5
1.5
1.5
y
3
y
3
y
3
1
1
1
0.5
0.5
0.5
0
0
0
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
-2
-1.5
-1
-0.5
x
0
0.5
1
1.5
2
-2
Iteration 4
Iteration 5
2.5
2
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
-0.5
0
x
0.5
1
1.5
2
0
0.5
1
1.5
2
1
1.5
2
y
2.5
y
2.5
y
3
-1
-0.5
Iteration 6
3
-1.5
-1
x
3
-2
-1.5
x
-2
-1.5
-1
-0.5
0
0.5
1
1.5
x
Prof. Dr. Herwig Unger
2
-2
-1.5
-1
-0.5
0
0.5
x
82
Association Rule Discovery: Definition
Given a set of records each of which contain some number of
items from a given collection;
Produce dependency rules which will predict occurrence of an item
based on occurrences of other items.
TID
Items
1
2
3
4
5
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
Rules
RulesDiscovered:
Discovered:
{Milk}
{Milk}-->
-->{Coke}
{Coke}
{Diaper,
{Diaper,Milk}
Milk}-->
-->{Beer}
{Beer}
Prof. Dr. Herwig Unger
83
Association Rule Discovery: Application 1
Marketing and Sales Promotion:
Let the rule discovered be
{Bagels, … } --> {Potato Chips}
Potato Chips as consequent => Can be used to determine what
should be done to boost its sales.
Bagels in the antecedent => Can be used to see which products
would be affected if the store discontinues selling bagels.
Bagels in antecedent and Potato chips in consequent => Can be
used to see what products should be sold with Bagels to
promote sale of Potato chips!
Prof. Dr. Herwig Unger
84
Association Rule Discovery: Application 2
Supermarket shelf management.
Goal: To identify items that are bought together by
sufficiently many customers.
Approach: Process the point-of-sale data collected with
barcode scanners to find dependencies among items.
A classic rule -• If a customer buys diaper and milk, then he is very likely to buy
beer.
• So, don’t be surprised if you find six-packs stacked next to
diapers!
Prof. Dr. Herwig Unger
85
Association Rule Discovery: Application 3
Inventory Management:
Goal: A consumer appliance repair company wants to anticipate
the nature of repairs on its consumer products and keep the
service vehicles equipped with right parts to reduce on number of
visits to consumer households.
Approach: Process the data on tools and parts required in previous
repairs at different consumer locations and discover the cooccurrence patterns.
Prof. Dr. Herwig Unger
86
Definition: Frequent Itemset
Itemset
A collection of one or more items
• Example: {Milk, Bread, Diaper}
k-itemset
• An itemset that contains k items
Support count (σ)
Frequency of occurrence of an itemset
E.g. σ({Milk, Bread,Diaper}) = 2
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Support
Fraction of transactions that contain an
itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
An itemset whose support is greater than or
equal to a minsup threshold
Prof. Dr. Herwig Unger
87
Definition: Association Rule
Association Rule
An implication expression of the form
X → Y, where X and Y are itemsets
Example:
{Milk, Diaper} → {Beer}
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Rule Evaluation Metrics
Example:
Support (s)
{Milk , Diaper } ⇒ Beer
• Fraction of transactions that contain
both X and Y
s=
Confidence (c)
• Measures how often items in Y
appear in transactions that
contain X
c=
Prof. Dr. Herwig Unger
σ ( Milk , Diaper, Beer )
|T|
=
2
= 0 .4
5
σ (Milk, Diaper, Beer) 2
= = 0.67
3
σ (Milk, Diaper)
88
Association Rule Mining Task
Given a set of transactions T, the goal of association
rule mining is to find all rules having
support ≥ minsup threshold
confidence ≥ minconf threshold
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf thresholds
⇒ Computationally prohibitive!
Prof. Dr. Herwig Unger
89
Mining Association Rules
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Rules:
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Prof. Dr. Herwig Unger
90
Mining Association Rules
Two-step approach:
1.
Frequent Itemset Generation
–
2.
Generate all itemsets whose support ≥ minsup
Rule Generation
–
Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still computationally
expensive
Prof. Dr. Herwig Unger
91
Frequent Itemset Generation
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
ABCDE
Prof. Dr. Herwig Unger
BCDE
Given d items, there
are 2d possible
candidate itemsets
92
Frequent Itemset Generation
Brute-force approach:
Each itemset in the lattice is a candidate frequent itemset
Count the support of each candidate by scanning the database
Transactions
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Match each transaction against every candidate
Complexity ~ O(NMw) => Expensive since M = 2d !!!
Prof. Dr. Herwig Unger
93
Computational Complexity
Given d unique items:
Total number of itemsets = 2d
Total number of possible association rules:
⎡⎛ d ⎞ ⎛ d − k ⎞⎤
R = ∑ ⎢⎜ ⎟ × ∑ ⎜
⎟⎥
⎣⎝ k ⎠ ⎝ j ⎠⎦
= 3 − 2 +1
d −1
d −k
k =1
j =1
d
d +1
If d=6, R = 602 rules
Prof. Dr. Herwig Unger
94
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)
Complete search: M=2d
Use pruning techniques to reduce M
Reduce the number of transactions (N)
Reduce size of N as the size of itemset increases
Used by DHP and vertical-based mining algorithms
Reduce the number of comparisons (NM)
Use efficient data structures to store the candidates or
transactions
No need to match every candidate against every transaction
Prof. Dr. Herwig Unger
95
Reducing Number of Candidates
Apriori principle:
If an itemset is frequent, then all of its subsets must also be
frequent
Apriori principle holds due to the following property of the
support measure:
∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y )
Support of an itemset never exceeds the support of its subsets
This is known as the anti-monotone property of support
Prof. Dr. Herwig Unger
96
Illustrating Apriori Principle
null
A
Found to be
Infrequent
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
Pruned
supersets
ABDE
ACDE
BCDE
ABCDE
Prof. Dr. Herwig Unger
97
Illustrating Apriori Principle
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Count
4
2
4
3
4
1
Items (1-itemsets)
Minimum Support = 3
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based pruning,
6 + 6 + 1 = 13
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Count
3
2
3
2
3
3
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
Ite m s e t
{ B r e a d ,M ilk ,D ia p e r }
Prof. Dr. Herwig Unger
C ount
3
98
Apriori Algorithm
Method:
Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Prune candidate itemsets containing subsets of length k that
are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those
that are frequent
Prof. Dr. Herwig Unger
99
Reducing Number of Comparisons
Candidate counting:
Scan the database of transactions to determine the support of
each candidate itemset
To reduce the number of comparisons, store the candidates in a
hash structure
• Instead of matching each transaction against every candidate, match
it against candidates contained in the hashed buckets
Transactions
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Prof. Dr. Herwig Unger
100
Generate Hash Tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
• Hash function
• Max leaf size: max number of itemsets stored in a leaf node (if number of
candidate itemsets exceeds max leaf size, split the node)
Hash function
3,6,9
1,4,7
234
567
345
136
145
2,5,8
124
457
125
458
Prof. Dr. Herwig Unger
356
357
689
367
368
159
101
Association Rule Discovery: Hash tree
Hash Function
1,4,7
Candidate Hash Tree
3,6,9
2,5,8
234
567
145
136
345
Hash on
1, 4 or 7
124
125
457
458
159
Prof. Dr. Herwig Unger
356
367
357
368
689
102
Association Rule Discovery: Hash tree
Hash Function
1,4,7
Candidate Hash Tree
3,6,9
2,5,8
234
567
145
136
345
Hash on
2, 5 or 8
124
125
457
458
159
Prof. Dr. Herwig Unger
356
367
357
368
689
103
Association Rule Discovery: Hash tree
Hash Function
1,4,7
Candidate Hash Tree
3,6,9
2,5,8
234
567
145
136
345
Hash on
3, 6 or 9
124
125
457
458
159
Prof. Dr. Herwig Unger
356
367
357
368
689
104
Subset Operation
Given a transaction t, what
are the possible subsets of
size 3?
Prof. Dr. Herwig Unger
105
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356
1,4,7
3+ 56
3,6,9
2,5,8
234
567
145
136
345
124
457
125
458
159
356
357
689
Prof. Dr. Herwig Unger
367
368
106
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356
1,4,7
12+ 356
3+ 56
3,6,9
2,5,8
13+ 56
234
567
15+ 6
145
136
345
124
457
125
458
159
Prof. Dr. Herwig Unger
356
357
689
367
368
107
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356
1,4,7
12+ 356
3+ 56
3,6,9
2,5,8
13+ 56
234
567
15+ 6
145
136
345
124
457
125
458
159
356
357
689
367
368
Match transaction against 11 out of 15 candidates
Prof. Dr. Herwig Unger
108
Factors Affecting Complexity
Choice of minimum support threshold
lowering support threshold results in more frequent itemsets
this may increase number of candidates and max length of frequent
itemsets
Dimensionality (number of items) of the data set
more space is needed to store support count of each item
if number of frequent items also increases, both computation and I/O
costs may also increase
Size of database
since Apriori makes multiple passes, run time of algorithm may increase
with number of transactions
Average transaction width
transaction width increases with denser data sets
This may increase max length of frequent itemsets and traversals of hash
tree (number of subsets in a transaction increases with its width)
Prof. Dr. Herwig Unger
109
Rule Generation
Given a frequent itemset L, find all non-empty subsets
f ⊂ L such that f → L – f satisfies the minimum
confidence requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC →D,
A →BCD,
AB →CD,
BD →AC,
ABD →C,
B →ACD,
AC → BD,
CD →AB,
ACD →B,
C →ABD,
AD → BC,
BCD →A,
D →ABC
BC →AD,
If |L| = k, then there are 2k – 2 candidate association
rules (ignoring L → ∅ and ∅ → L)
Prof. Dr. Herwig Unger
110
Rule Generation
How to efficiently generate rules from frequent
itemsets?
In general, confidence does not have an anti-monotone
property
c(ABC →D) can be larger or smaller than c(AB →D)
But confidence of rules generated from the same itemset
has an anti-monotone property
e.g., L = {A,B,C,D}:
c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)
• Confidence is anti-monotone w.r.t. number of items on the RHS
of the rule
Prof. Dr. Herwig Unger
111
Rule Generation for Apriori Algorithm
Lattice of rules
Low
Confidence
Rule
Pruned
Rules
Prof. Dr. Herwig Unger
112
Rule Generation for Apriori Algorithm
Candidate rule is generated by merging two rules that share the same
prefix
CD=>AB
BD=>AC
in the rule consequent
join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
D=>ABC
Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence
Prof. Dr. Herwig Unger
113
Effect of minsup
How to set the appropriate minsup threshold?
If minsup is set too high, we could miss itemsets involving
interesting rare items (e.g., expensive products)
If minsup is set too low, it is computationally expensive and
the number of itemsets is very large
Using a single minimum support threshold may not be
effective
Prof. Dr. Herwig Unger
114
Multiple Minimum Support
How to apply multiple minimum supports?
MS(i): minimum support for item i
e.g.:
MS(Milk)=5%,
MS(Coke) = 3%,
MS(Broccoli)=0.1%,
MS(Salmon)=0.5%
MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%
Challenge: Support is no longer anti-monotone
• Suppose:
Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%
• {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent
Prof. Dr. Herwig Unger
115
Multiple Minimum Support
Item
MS (I)
S u p (I)
A
0.10% 0.25%
B
0.20% 0.26%
C
0.30% 0.29%
D
0.50% 0.05%
E
3%
4.20%
AB
ABC
AC
ABD
AD
ABE
AE
AC D
BC
AC E
BD
AD E
BE
BC D
CD
BC E
CE
BD E
DE
CDE
A
B
C
D
E
Prof. Dr. Herwig Unger
116
Multiple Minimum Support
Item
MS (I)
S u p (I)
AB
ABC
AC
ABD
AD
ABE
AE
ACD
BC
ACE
BD
ADE
BE
BCD
CD
BCE
CE
BDE
DE
CDE
A
A
B
0.10% 0.25%
0.20% 0.26%
B
C
C
0.30% 0.29%
D
D
0.50% 0.05%
E
E
3%
4.20%
Prof. Dr. Herwig Unger
117
Multiple Minimum Support (Liu 1999)
Order the items according to their minimum support
(in ascending order)
e.g.:
MS(Milk)=5%,
MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
Ordering: Broccoli, Salmon, Coke, Milk
Need to modify Apriori such that:
L1 : set of frequent items
F1 : set of items whose support is ≥ MS(1)
where MS(1) is mini( MS(i) )
C2 : candidate itemsets of size 2 is generated from F1
instead of L1
Prof. Dr. Herwig Unger
118
Multiple Minimum Support (Liu 1999)
Modifications to Apriori:
In traditional Apriori,
• A candidate (k+1)-itemset is generated by merging two
frequent itemsets of size k
• The candidate is pruned if it contains any infrequent subsets
of size k
Pruning step has to be modified:
• Prune only if subset contains the first item
• e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to
minimum support)
• {Broccoli, Coke} and {Broccoli, Milk} are frequent but
{Coke, Milk} is infrequent
– Candidate is not pruned because {Coke,Milk} does not contain
the first item, i.e., Broccoli.
Prof. Dr. Herwig Unger
119
Pattern Evaluation
Association rule algorithms tend to produce too many rules
many of them are uninteresting or redundant
Redundant if {A,B,C} → {D} and {A,B} → {D}
have same support & confidenc
Interestingness measures can be used to prune/rank the derived
patterns
In the original formulation of association rules, support &
confidence are the only measures used
Prof. Dr. Herwig Unger
120
Subjective Interestingness Measure
Objective measure:
Rank patterns based on statistics computed from data
e.g., 21 measures of association (support, confidence,
Laplace, Gini, mutual information, Jaccard, etc).
Subjective measure:
Rank patterns according to user’s interpretation
• A pattern is subjectively interesting if it contradicts the
expectation of a user (Silberschatz & Tuzhilin)
• A pattern is subjectively interesting if it is actionable
(Silberschatz & Tuzhilin)
Prof. Dr. Herwig Unger
121
Interestingness via Unexpectedness
Need to model expectation of users (domain knowledge)
+
-
Pattern expected to be frequent
Pattern expected to be infrequent
Pattern found to be frequent
Pattern found to be infrequent
+ - +
Expected Patterns
Unexpected Patterns
Need to combine expectation of users with evidence from data (i.e.,
extracted patterns)
Prof. Dr. Herwig Unger
122
Interestingness via Unexpectedness
Web Data (Cooley et al 2001)
Domain knowledge in the form of site structure
Given an itemset F = {X1, X2, …, Xk} (Xi : Web pages)
• L: number of links connecting the pages
• lfactor = L / (k × k-1)
• cfactor = 1 (if graph is connected), 0 (disconnected graph)
Structure evidence = cfactor × lfactor
Usage evidence
P ( X Ι X Ι ... Ι X )
=
P( X ∪ X ∪ ... ∪ X )
1
1
2
2
k
k
Use Dempster-Shafer theory to combine domain knowledge
and evidence from data
Prof. Dr. Herwig Unger
123
Sequential Pattern Discovery: Definition
Given is a set of objects, with each object associated with its own timeline of
events, find rules that predict strong sequential dependencies among
different events.
(A B)
(C)
(D E)
Rules are formed by first disovering patterns. Event occurrences in the
patterns are governed by timing constraints.
(A B)
(C)
<= xg
(D E)
>ng
<= ws
<= ms
Prof. Dr. Herwig Unger
124
Sequential Pattern Discovery: Examples
In telecommunications alarm logs,
(Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) --> (Fire_Alarm)
In point-of-sale transaction sequences,
Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) -->
(Perl_for_dummies,Tcl_Tk)
Athletic Apparel Store:
(Shoes) (Racket, Racketball) --> (Sports_Jacket)
Prof. Dr. Herwig Unger
125
Regression
Predict a value of a given continuous valued variable based on the
values of other variables, assuming a linear or nonlinear model of
dependency.
Greatly studied in statistics, neural network fields.
Examples:
Predicting sales amounts of new product based on advetising
expenditure.
Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
Time series prediction of stock market indices.
Prof. Dr. Herwig Unger
126
Deviation/Anomaly detection
Detect significant deviations from normal behavior
Applications:
Credit Card Fraud Detection
Network Intrusion
Detection
Typical network traffic at University level may reach over 100 million connections per day
Prof. Dr. Herwig Unger
127
Anomaly/Outlier Detection cont.
What are anomalies/outliers?
The set of data points that are considerably different than the remainder of
the data
Variants of Anomaly/Outlier Detection Problems
Given a database D, find all the data points x ∈ D with anomaly scores
greater than some threshold t
Given a database D, find all the data points x ∈ D having the top-n largest
anomaly scores f(x)
Given a database D, containing mostly normal (but unlabeled) data points,
and a test point x, compute the anomaly score of x with respect to D
Applications:
Credit card fraud detection, telecommunication fraud detection, network
intrusion detection, fault detection
Prof. Dr. Herwig Unger
128
Importance of Anomaly Detection
Ozone Depletion History
In 1985 three researchers (Farman,
Gardinar and Shanklin) were puzzled
by data gathered by the British
Antarctic Survey showing that ozone
levels for Antarctica had dropped
10% below normal levels
Why did the Nimbus 7 satellite, which
had instruments aboard for recording
ozone levels, not record similarly low
ozone concentrations?
The ozone concentrations recorded
by the satellite were so low they were
being treated as outliers by a
computer program and discarded!
Sources:
http://exploringdata.cqu.edu.au/ozone.html
http://www.epa.gov/ozone/science/hole/size.html
Prof. Dr. Herwig Unger
129
Anomaly Detection
Challenges
How many outliers are there in the data?
Method is unsupervised
• Validation can be quite challenging (just like for clustering)
Finding needle in a haystack
Working assumption:
There are considerably more “normal” observations than
“abnormal” observations (outliers/anomalies) in the data
Prof. Dr. Herwig Unger
130
Anomaly Detection Schemes
General Steps
Build a profile of the “normal” behavior
• Profile can be patterns or summary statistics for the overall population
Use the “normal” profile to detect anomalies
• Anomalies are observations whose characteristics
differ significantly from the normal profile
Types of anomaly detection
schemes
Graphical & Statistical-based
Distance-based
Model-based
Prof. Dr. Herwig Unger
131
Graphical Approaches
Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D)
Limitations
Time consuming
Subjective
Prof. Dr. Herwig Unger
132
Convex Hull Method
Extreme points are assumed to be outliers
Use convex hull method to detect extreme values
What if the outlier occurs in the middle of the data?
Prof. Dr. Herwig Unger
133
Statistical Approaches
Assume a parametric model describing the distribution of the
data (e.g., normal distribution)
Apply a statistical test that depends on
Data distribution
Parameter of distribution (e.g., mean, variance)
Number of expected outliers (confidence limit)
Prof. Dr. Herwig Unger
134
Statistical-based – Likelihood Approach
Data distribution, D = (1 – λ) M + λ A
M is a probability distribution estimated from data
Can be based on any modeling method (naïve Bayes, maximum
entropy, etc)
A is initially assumed to be uniform distribution
Likelihood at time t:
⎞
⎞⎛ |At |
⎛
|M t |
Lt ( D ) = ∏ PD ( xi ) = ⎜⎜ (1 − λ ) ∏ PM t ( xi ) ⎟⎟⎜⎜ λ ∏ PAt ( xi ) ⎟⎟
i =1
xi ∈M t
⎠
⎠⎝ xi ∈At
⎝
LLt ( D ) = M t log(1 − λ ) + ∑ log PM t ( xi ) + At log λ + ∑ log PAt ( xi )
N
xi ∈M t
Prof. Dr. Herwig Unger
xi ∈At
135
Statistical-based – Likelihood Approach
Assume the data set D contains samples from a
mixture of two probability distributions:
M (majority distribution)
A (anomalous distribution)
General Approach:
Initially, assume all the data points belong to M
Let Lt(D) be the log likelihood of D at time t
For each point xt that belongs to M, move it to A
• Let Lt+1 (D) be the new log likelihood.
• Compute the difference, Δ = Lt(D) – Lt+1 (D)
• If Δ > c (some threshold), then xt is declared as an anomaly and
moved permanently from M to A
Prof. Dr. Herwig Unger
136
Limitations of Statistical Approaches
Most of the tests are for a single attribute
In many cases, data distribution may not be known
For high dimensional data, it may be difficult to
estimate the true distribution
Prof. Dr. Herwig Unger
137
Distance-based Approaches
Data is represented as a vector of features
Three major approaches
Nearest-neighbor based
Density based
Clustering based
Prof. Dr. Herwig Unger
138
Nearest-Neighbor Based Approach
Approach:
Compute the distance between every pair of data points
There are various ways to define outliers:
• Data points for which there are fewer than p neighboring points
within a distance D
• The top n data points whose distance to the kth nearest
neighbor is greatest
• The top n data points whose average distance to the k nearest
neighbors is greatest
Prof. Dr. Herwig Unger
139
Outliers in Lower Dimensional Projection
Divide each attribute into φ equal-depth intervals
Each interval contains a fraction f = 1/φ of the records
Consider a k-dimensional cube created by picking
grid ranges from k different dimensions
If attributes are independent, we expect region to contain a
fraction fk of the records
If there are N points, we can measure sparsity of a cube D
as:
Negative sparsity indicates cube contains smaller number of
points than expected
Prof. Dr. Herwig Unger
141
Example
N=100, φ = 5, f = 1/5 = 0.2, N × f2 = 4
Prof. Dr. Herwig Unger
142
Density-based: LOF approach
For each point, compute the density of its local neighborhood
Compute local outlier factor (LOF) of a sample p as the average of the
ratios of the density of sample p and the density of its nearest neighbors
Outliers are points with largest LOF value
In the NN approach, p2 is
not considered as outlier,
while LOF approach find
both p1 and p2 as outliers
×
p2
×
p1
Prof. Dr. Herwig Unger
143
Clustering-Based
Basic idea:
Cluster the data into groups
of different density
Choose points in small
cluster as candidate outliers
Compute the distance
between candidate points
and non-candidate clusters.
• If candidate points are far
from all other non-candidate
points, they are outliers
Prof. Dr. Herwig Unger
144