Download Data Mining I Data Mining Applications Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Topic 4: Data Mining I
„
Data Mining Applications
Data mining (knowledge discovery in databases):
„ Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large
databases
„
„
Advanced Database Technologies
target marketing, customer relation management, market basket
analysis, cross selling, market segmentation
„
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
„
„
Text mining (news group, email, documents) and Web analysis.
Intelligent query answering (do you want to know more? do you
want to see similar results? are you also interested in this?)
„
1
„
Risk analysis and management
Forecasting, customer retention, improved underwriting, quality
control, competitive analysis
„
Fraud detection and management
Text mining (news group, email, documents) and Web analysis.
„ Intelligent query answering (do you want to know more? do you
want to see similar results? are you also interested in this?)
„
Advanced Database Technologies
3
Market Analysis and Management (2)
Target marketing
„
„
Associations/co-relations between product sales
„
Prediction based on the association information
Dr. N. Mamoulis
„
„
Identifying customer requirements
4
cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-ratio, trend
analysis, etc.)
Resource planning:
„
identifying the best products for different customers
„
use prediction to find what factors will attract new customers
Provides summary information
statistical summary information (data central tendency and
Advanced Database Technologies
Finance planning and asset evaluation
„
various multidimensional summary reports
Conversion of single to a joint bank account: marriage, etc.
„
products (clustering or classification)
„
Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Cross-market analysis
data mining can tell you what types of customers buy what
„
Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Corporate Analysis and Risk Management
Customer profiling
„
2
Determine customer purchasing patterns over time
Other Applications
Dr. N. Mamoulis
Advanced Database Technologies
Where are the data sources for analysis?
Market analysis and management
„
„
Dr. N. Mamoulis
Market Analysis and Management (1)
„ target marketing, customer relation management, market basket
analysis, cross selling, market segmentation
„
Fraud detection and management
Other Applications
„
Database analysis and decision support
„
Risk analysis and management
„
Data Mining Applications
„
Market analysis and management
„
ƒ Why Data Mining?
ƒ The data explosion problem: Automated data
collection tools and mature database technology
lead to tremendous amounts of data stored in
databases, data warehouses and other information
repositories; We are drowning in data, but starving
for knowledge!
Dr. N. Mamoulis
Database analysis and decision support
„
summarize and compare the resources and spending
Competition:
„
„
„
monitor competitors and market directions
group customers into classes and a class-based pricing
procedure
set pricing strategy in a highly competitive market
variation)
Dr. N. Mamoulis
Advanced Database Technologies
5
Dr. N. Mamoulis
Advanced Database Technologies
6
1
Fraud Detection and Management (1)
Fraud Detection and Management (2)
Detecting inappropriate medical treatment
Applications
„
„
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
„
Detecting telephone fraud
use historical data to build models of fraudulent behavior and
use data mining to help identify similar instances
„
Examples
„
„
„
Advanced Database Technologies
7
Telephone call model: destination of the call, duration, time of
day or week. Analyze patterns that deviate from an expected
norm.
British Telecom identified discrete groups of callers with frequent
intra-group calls, especially mobile phones, and broke a
multimillion dollar fraud.
Retail
„
Analysts estimate that 38% of retail shrink is due to dishonest
employees.
Dr. N. Mamoulis
„
IBM Advanced Scout analyzed NBA game statistics (shots
blocked, assists, and fouls) to gain competitive advantage for
New York Knicks and Miami Heat
Data mining: the core
of knowledge discovery
Data Mining
process.
Task-relevant Data
Astronomy
JPL and the Palomar Observatory discovered 22 quasars with
the help of data mining
Internet Web Surf-Aid
„
IBM Surf-Aid applies data mining algorithms to Web access logs
for market-related pages to discover customer preference and
behavior pages, analyzing effectiveness of Web marketing,
improving Web site organization, etc.
Dr. N. Mamoulis
Advanced Database Technologies
9
Steps of a KDD Process
Databases
Dr. N. Mamoulis
„
Find useful features, dimensionality/variable reduction, invariant
representation.
Advanced Database Technologies
10
„
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
noisy: containing errors or outliers
inconsistent: containing discrepancies in codes or names
No quality data, no quality mining results!
summarization, classification, regression, association, clustering.
„
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
„
Data Integration
„
Choosing functions of data mining
„
Data Cleaning
Data in the real world is dirty
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
„
Selection
Data Warehouse
Why Data Preprocessing?
Learning the application domain:
„
8
Pattern Evaluation
Sports
„
Advanced Database Technologies
Data Mining: A KDD Process
Other Applications
„
„
auto insurance: detect a group of people who stage accidents to
collect on insurance
money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring of
doctors and ring of references
Dr. N. Mamoulis
Australian Health Insurance Commission identifies that in many
cases blanket screening tests were requested (save Australian
$1m/yr).
„
Quality decisions must be based on quality data
Data warehouse needs consistent integration of quality data
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
Dr. N. Mamoulis
Advanced Database Technologies
11
Dr. N. Mamoulis
Advanced Database Technologies
12
2
Major Tasks in Data Preprocessing
Data Cleaning
Data cleaning
„
Data cleaning tasks
Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
Data integration
„
„
Normalization and aggregation
Data reduction
„
Obtains reduced representation in volume but produces the same or
similar analytical results
Data discretization
„
Part of data reduction but with particular importance, especially for
numerical data
Dr. N. Mamoulis
Advanced Database Technologies
13
Data Integration
„
Age Income
Religion
Gender
23
24,200
Muslim
M
39
?
Christian F
45
45,390
?
Fill missing values using aggregate functions (e.g., average) or
probabilistic estimates on global value distribution
Dr. N. Mamoulis
Advanced Database Technologies
„
combines data from multiple sources into a coherent store
integrate metadata from different sources
Entity identification problem: identify real world entities from
multiple data sources, e.g., A.cust-id ≡ B.cust-#
for the same real world entity, attribute values from
different sources are different
possible reasons: different representations, different scales,
e.g., metric vs. British units
Dr. N. Mamoulis
Advanced Database Technologies
15
Data Transformation
Gender
ekey
age
salary
gender
10002
23
24,200
M
30004
43
60,234
M
10003
39
32,000
F
30020
23
47,000
F
10004
45
40,000
F
30025
56
93,030
M
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified
range
min-max normalization
z-score normalization
„
normalization by decimal scaling
Advanced Database Technologies
Dr. N. Mamoulis
Income
Gender
10002
23
36,300
M
10003
39
46,000
F
10004
45
60,000
F
30004
43
60,234
M
30020
23
47,000
F
30025
56
93,030
M
2. Covert data types
1 GBP = 1.5 USD
Advanced Database Technologies
16
Obtains a reduced representation of the data set that is much
smaller in volume but yet produces the same (or almost the
same) analytical results
Data reduction strategies
„
„
„
New attributes constructed from the given ones
Dr. N. Mamoulis
1. Integrate
attribute names
„
Attribute/feature construction
„
Integrated Data
Emp-id Age
Warehouse may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the
complete data set
Data reduction
Aggregation: summarization, data cube construction
„
Data source 2
Income
Data Reduction Strategies
Smoothing: remove noise from data
„
14
Emp-id Age
Detecting and resolving data value conflicts
„
F
Data source 1
Schema integration
„
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Data Integration Example
Data integration:
„
„
Integration of multiple databases, data cubes, or files
Data transformation
„
„
17
Data cube aggregation (reduce to summary data)
Dimensionality reduction (as in multimedia data)
Numerosity reduction (model data using functions or
summarize them using histograms)
Dr. N. Mamoulis
Advanced Database Technologies
18
3
Discretization and Concept Hierarchy
Summary of the KDD process
Discretization
„
„
reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals. Interval
labels can then be used to replace actual data values.
E.g., ages 0-10 → 1, ages 11-20 → 2, etc.
Concept hierarchies
„
„
reduce the data by collecting and replacing low level concepts
(such as numeric values for the attribute age) by higher level
concepts (such as young, middle-aged, or senior).
Advanced Database Technologies
19
Mining Association Rules
„
Dr. N. Mamoulis
„
Basket data analysis, cross-marketing, catalog design, lossleader analysis, clustering, classification, etc.
„
Rule form: “Body → Head [support, confidence]”.
buys(x, “diapers”) → buys(x, “beers”) [0.5%, 60%]
major(x, “CS”) ∧ takes(x, “DB”) → grade(x, “A”) [1%, 75%]
„
„
„
Dr. N. Mamoulis
Advanced Database Technologies
21
A (potentially unknown) set of literals I={i1, i2, …,in}, called
items,
A support threshold s, 0<s≤1 and
„
A confidence threshold c, 0<s≤1
Find
„
All rules of the form X → Y, where X ⊂ I, Y ⊂ I, X ∩ Y = ∅
and (i) at least s% of the transactions in D contain X ∪ Y,
(ii) at least c% of the transactions in D which contain X, also
contain Y.
Dr. N. Mamoulis
Advanced Database Technologies
get automotive services done
* ⇒ Maintenance Agreement (What the store should do to
boost Maintenance Agreement sales)
Home Electronics ⇒ * (What other products should the store
stocks up?)
Attached mailing in direct marketing
Detecting “ping-pong”ing of patients, faulty “collisions”
Dr. N. Mamoulis
Advanced Database Technologies
22
Generate association rules from F as follows. Let X ∪
Y ∈ F and X ∈ F. Then conf(X→Y) = supp(X)/supp(X
∪ Y). If conf(X→Y) ≥ c then rule X→Y is generated,
because:
A set of transactions D, where each transaction T ∈ D is a
subset of I, i.e., T ⊆ I,
„
E.g., 98% of people who purchase tires and auto accessories also
Find all frequent itemsets F, where for each Z ∈ F,
at least s% of the transactions in D contain Z.
Given
„
20
General Procedure for Mining Association
Rules
Formal Definition
„
Advanced Database Technologies
Applications
Examples.
„
of frequent patterns in long sequences
Given: (1) database of transactions, (2) each transaction
is a list of items (purchased by a customer in a visit)
Find: all rules that correlate the presence of one set of
items with that of another set of items
Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
Applications:
„
Next, we will present some data mining techniques
focusing on discovery of association rules and discovery
Association Rule: Basic Concepts
Association rule mining:
„
Data mining is the most interesting one due to the
challenge of automatic discovery of interesting
information
E.g., ages 0-12 → child, ages 13-20 → teenager, etc.
Dr. N. Mamoulis
„
All steps of the KDD process are important and crucial
for the effectiveness of discovery.
23
„
supp(X→Y) = supp(X ∪ Y) ≥ s and
„
conf(X→Y) ≥ c
Therefore mining association rules reduces to mining
frequent itemsets and correlations between them.
Dr. N. Mamoulis
Advanced Database Technologies
24
4
Finding Frequent Itemsets – The Apriori
algorithm
The Apriori Algorithm
Join Step: Ck is generated by joining Lk-1with itself
Prune Step: Any (k-1)-itemset that is not frequent cannot be
The most important part of the mining process is to find
all itemsets with minimum support s in the database D.
a subset of a frequent k-itemset
Apriori is an influential algorithm for this task
Pseudo-code:
It is based on the following observation: an itemset l =
{i1,i2,…,in} can be frequent, only if all subsets of l are
frequent. In other words, supp(l)≥s only if supp(l’ )≥s,
∀l’’⊂l.
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
Thus the frequent itemsets are found in a level-wise
manner. First freq. itemsets L1 with only 1 item are found.
Then candidate itemsets C2 of size 2 are generated from
them and verified, and so on.
Dr. N. Mamoulis
Advanced Database Technologies
25
TID
100
200
300
400
C1
itemset sup.
Items
{1}
2
134
{2}
3
235
Scan D
{3}
3
1235
{4}
1
25
{5}
3
L2 itemset sup
{1 3}
{2 3}
{2 5}
{3 5}
C2
2
2
3
2
C3 itemset
{2 3 5}
itemset sup
{1 2}
1
{1 3}
2
{1 5}
1
{2 3}
2
{2 5}
3
{3 5}
2
Scan D
Dr. N. Mamoulis
L1
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
2}
3}
5}
3}
5}
5}
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Advanced Database Technologies
27
Example of Generating Candidates
abcd from abc and abd
„
acde from acd and ace
„
Advanced Database Technologies
28
Why counting supports of candidates a problem?
{a,c,d}
{a,c,e}
„
The total number of candidates can be very huge
„
One transaction may contain many candidates
Method:
{a,c,d,e}
X
acd ace
√
√
Pruning:
Dr. N. Mamoulis
How to Count Supports of Candidates?
L3={abc, abd, acd, ace, bcd}
„
26
insert into Ck
L3 itemset sup
{2 3 5} 2
Self- joining: L3*L3
Advanced Database Technologies
Step 1: self-joining Lk-1
C2 itemset
{1
{1
{1
{2
{2
{3
Dr. N. Mamoulis
Suppose the items in Lk-1 are listed in an order
itemset sup.
{1}
2
{2}
3
{3}
3
{5}
3
Scan D
end
return ∪k Lk;
How to Generate Candidates?
The Apriori Algorithm — Example (s=2)
Database D
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
„
ade cde
X
„
„
„
acde is removed because ade is not in L3
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a
transaction
C4={abcd}
Dr. N. Mamoulis
Advanced Database Technologies
29
Dr. N. Mamoulis
Advanced Database Technologies
30
5
Example of the hash-tree for C3
Example of the hash-tree for C3
Hash function: mod 3
Hash function: mod 3
H
H
1,4,.. 2,5,.. 3,6,..
234
567
H
Hash on 3rd item
145
H
124
457
125
458
Dr. N. Mamoulis
Hash on 2nd item
H
356
689
345
367
368
159
31
Example of the hash-tree for C3
H
12345
look for 1XX H
12345
look for 12X
12345
look for 13X (null)
12345
look for 14X
145
H
124
457
125
458
Dr. N. Mamoulis
∅
H
356
689
345
C2
C3
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset
{2 3 5}
Dr. N. Mamoulis
367
368
Advanced Database Technologies
TID
100
200
300
400
{{2},{5}}
C1’
Sets of itemsets
{{1 3}}
{{2 3},{2 5},{3 5}}
300
400
{{1 2},{1 3},{1 5},
{2 3},{2 5},{3 5}}
{{2 5}}
TID
200
Sets of itemsets
{{2 3 5}}
300
{{2 3 5}}
C3’
H
124
457
125
458
345
H
356
689
Hash on 2nd item
367
368
159
Advanced Database Technologies
32
The database is not used after the 1st pass.
Instead, the set Ck’ is used for each step, Ck’ =
<TID, {Xk}> : each Xk is a potentially frequent
itemset in transaction with id=TID.
At each step Ck’ is generated from Ck-1’ at the
pruning step of constructing Ck and used to
compute Lk.
For small values of k, Ck’ could be larger than
the database!
Dr. N. Mamoulis
Advanced Database Technologies
34
Experiments show that Apriori is faster than
AprioriTid, in general.
This is because the Ck’ generated during the
early phases of AprioriTid are very large.
An AprioriHybrid method is proposed that uses
Apriori in the early phases and AprioriTid in the
latter phases.
L2
itemset
{1 3}
{2 3}
{2 5}
{3 5}
Advanced Database Technologies
Dr. N. Mamoulis
145
234
567
Hash on 1st item
Experiments with Apriori, AprioryTid
itemset sup.
{1}
2
{2}
3
{3}
3
{5}
3
Sets of itemsets
{{1},{2},{4}}
{{2},{3},{5}}
{{1},{2},{3},{5}}
TID
100
200
33
L1
C1’
Database D
Items
134
235
1235
25
Hash on 2nd item
159
AprioriTid Example (s=2)
TID
100
200
300
400
345
look for 3XX
Hash on 1st item
234
567
H
12345
look for 1XX H
345
look for 3XX
AprioriTid: Use D only for first pass
2345
look for 2XX
12345
H
1,4,.. 2,5,.. 3,6,..
1,4,.. 2,5,.. 3,6,..
Hash on 3rd item
Advanced Database Technologies
Hash function: mod 3
H
Hash on 1st item
2345
look for 2XX
12345
sup
2
2
3
2
itemset sup L3
{2 3 5} 2
35
Dr. N. Mamoulis
Advanced Database Technologies
36
L
6
Is Apriori Fast Enough? — Performance
Bottlenecks
Methods to Improve Apriori’s Efficiency
Hash-based itemset counting: A k-itemset whose corresponding
The core of the Apriori algorithm:
hashing bucket count is below the threshold cannot be frequent
„
Transaction reduction: A transaction that does not contain any
„
frequent k-itemset is useless in subsequent scans
Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
The bottleneck of Apriori: candidate generation
„
Sampling: mining on a subset of given data, lower support
Š To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate 2100 ≈ 1030 candidates.
Dynamic itemset counting: add new candidate itemsets only when
„
all of their subsets are estimated to be frequent
37
„
highly condensed, but complete for frequent pattern mining
„
avoid costly database scans
Dr. N. Mamoulis
Advanced Database Technologies
38
FP-tree Construction
Compress a large database into a compact, FrequentPattern tree (FP-tree) structure
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
{b, f, h, j, o}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
Steps:
Develop an efficient, FP-tree-based frequent pattern
mining method
„
Multiple scans of database:
Š Needs (n +1 ) scans, n is the length of the longest pattern
Advanced Database Technologies
Mining Frequent Patterns Without
Candidate Generation
„
Huge candidate sets:
Š 104 frequent 1-itemset will generate 107 candidate 2-itemsets
threshold + a method to determine the completeness
Dr. N. Mamoulis
Use frequent (k – 1)-itemsets to generate candidate frequent kitemsets
Use database scan and pattern matching to collect counts for the
candidate itemsets
A divide-and-conquer methodology: decompose mining tasks into
smaller ones
Avoid candidate generation: sub-database test only!
min_support = 3
Item frequency
f
4
c
4
a
3
b
3
m
3
p
3
1. Scan DB once, find frequent 1-itemset
(single item pattern)
2. Order frequent items in frequency
descending order
3. Scan DB again, construct FP-tree
Dr. N. Mamoulis
Advanced Database Technologies
FP-tree Construction
TID
100
200
300
400
500
39
root
Advanced Database Technologies
FP-tree Construction
min_support = 3
freq. Items bought
{f, c, a, m, p}
{f, c, a, b, m}
{f, b}
{c, p, b}
{f, c, a, m, p}
Dr. N. Mamoulis
Item frequency
f
4
c
4
a
3
b
3
m
3
p
3
TID
100
200
300
400
500
root
c:1
c:2
a:1
a:2
p:1
Advanced Database Technologies
Item frequency
f
4
c
4
a
3
b
3
m
3
p
3
f:2
m:1
Dr. N. Mamoulis
min_support = 3
freq. Items bought
{f, c, a, m, p}
{f, c, a, b, m}
{f, b}
{c, p, b}
{f, c, a, m, p}
f:1
40
41
Dr. N. Mamoulis
m:1
b:1
p:1
m:1
Advanced Database Technologies
42
7
FP-tree Construction
TID
100
200
300
400
500
freq. Items bought
{f, c, a, m, p}
{f, c, a, b, m}
{f, b}
{c, p, b}
{f, c, a, m, p}
Item frequency
f
4
c
4
a
3
b
3
m
3
p
3
root
f:3
c:2
TID
100
200
300
400
500
b:1
b:1
p:1
m:1
b:1
p:1
m:1
Advanced Database Technologies
43
Benefits of the FP-tree Structure
„
never breaks a long pattern of any transaction
preserves complete information for frequent pattern
mining
Compactness
„
„
„
„
reduce irrelevant information—infrequent items are
gone
frequency descending ordering: more frequent
items are more likely to be shared
never be larger than the original database (if not
count node-links and counts)
Example: For Connect-4 DB, compression ratio
could be over 100
Dr. N. Mamoulis
Advanced Database Technologies
45
Mining Frequent Patterns Using the FP-tree
(cont’d)
Start with last item in order (i.e., p).
Follow node pointers and traverse only the paths containing p.
Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
f:4
p
c:3
b:1
a:3
p:1
m:2
p:2
Dr. N. Mamoulis
c:1
Dr. N. Mamoulis
root
f:4
c:1
b:1
c:3
a:3
b:1
p:1
m:2
b:1
p:2
m:1
Advanced Database Technologies
44
General idea (divide
- and
- conquer)
„
Recursively grow frequent pattern path using the FPtree
Method
„
„
„
For each item, construct its conditional pattern-base,
and then its conditional FP-tree
Repeat the process on each newly created conditional
FP-tree
Until the resulting FP-tree is empty, or it contains only
one path (single path will generate all the combinations of its
sub-paths, each of which is a frequent pattern)
Dr. N. Mamoulis
Advanced Database Technologies
46
Mining Frequent Patterns Using the FP-tree
(cont’d)
Move to next least frequent item in order, i.e., m
Follow node pointers and traverse only the paths containing m.
Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
Conditional pattern base for p
f:4
fcam:2, cb:1
Conditional pattern base for m
c:3
Constructing a new FPtree based on this
pattern base leads to
only one branch c:3
Thus we derive only
one frequent pattern
cont. p. Pattern cp
Advanced Database Technologies
Item frequency
f
4
c
4
a
3
b
3
m
3
p
3
Mining Frequent Patterns Using the FP-tree
Completeness:
„
min_support = 3
freq. Items bought
{f, c, a, m, p}
{f, c, a, b, m}
{f, b}
{c, p, b}
{f, c, a, m, p}
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
c:1
a:2
Dr. N. Mamoulis
FP-tree Construction
min_support = 3
fca:2, fcab:1
a:3
m
m:2
b:1
m:1
47
Dr. N. Mamoulis
Constructing a new FP-tree
based on this pattern base
leads to path fca:3
From this we derive frequent
patterns fcam, fcm, cam, fm,
cm, am
Advanced Database Technologies
48
8
Properties of FP-tree for Conditional
Pattern Base Construction
Conditional Pattern-Bases for the example
Node
- link property
For any frequent item ai, all the possible frequent
patterns that contain ai can be obtained by following ai's
node-links, starting from ai's head in the FP-tree header
„
Prefix path property
To calculate the frequent patterns for a node ai in a path
P, only the prefix sub-path of ai in P need to be
accumulated, and its frequency count should carry the
same count as node ai.
„
Dr. N. Mamoulis
Advanced Database Technologies
49
Item Conditional pattern
- base Conditional FP
- tree
p
{(fcam:2), (cb:1)}
{(c:3)}|p
m
{(fca:2), (fcab:1)}
{(f:3, c:3, a:3)}|m
b
{(fca:1), (f:1), (c:1)}
Empty
a
{(fc:3)}
{(f:3, c:3)}|a
c
{(f:3)}
{(f:3)}|c
f
Empty
Empty
Dr. N. Mamoulis
Performance studies show that
90
Apriori, and is also faster than tree-projection
80
Run time(sec.)
No candidate generation, no candidate test
„
Use compact data structure
„
Eliminate repeated database scan
„
Basic operation is counting and FP-tree building
D1 FP-grow th runtime
D1 Apriori runtime
70
Reasoning
Dr. N. Mamoulis
Data set T25I20D10K
100
FP-growth is an order of magnitude faster than
„
50
FP-growth vs. Apriori: Scalability With
the Support Threshold
Why Is Frequent Pattern Growth Fast?
„
Advanced Database Technologies
60
50
40
30
20
10
Advanced Database Technologies
0
0
51
FP-growth vs. Tree-Projection: Scalability
with Support Threshold
Dr. N. Mamoulis
0.5
1
1.5
2
Support threshold(%)
Advanced Database Technologies
2.5
3
52
Additional issues on the FP-tree
How to use this method for large databases?
Data set T25I20D100K
140
D2 FP-growth
Runtime (sec.)
120
„
Partition the database cleverly and build one FP-tree for each partition
„
Make the FP-tree disk resident (like the B+-tree). However, node link
traversal and accessing of paths may cause many random I/Os
D2 TreeProjection
Materialize the FP-tree
100
„
80
No need to build it for each query. However, the construction is based
on predefined minimum support threshold.
60
Incremental Updates on the Tree
40
„
20
Useful for incremental updates of association rules. However, it is only
possible when the minimum support is 1. Another solution is to adjust
the minimum support with the update of the database.
0
0
Dr. N. Mamoulis
0.5
1
Support
(%)
Advanced
Databasethreshold
Technologies
1.5
2
53
Dr. N. Mamoulis
Advanced Database Technologies
54
9
Finding Frequent Episodes in Event
Sequences
Finding Frequent Episodes in Event
Sequences: An example
A C ED B C F
Problem definition:
Given a sequence of events in time, and a time window
win find all episodes that appear frequently in all
windows of size win in the sequence
BA A
time
The frequency of an episode α is defined by the number of windows where
fr (α , S , win) =
occurrence of event types.
The problem is similar to mining frequent itemsets in a
database of transactions, but the different type of data
Advanced Database Technologies
55
Types of episodes
For example, the episode (C appears after A) is very frequent in the above
Dr. N. Mamoulis
E
B in any order
First find frequent events (e.g., A,B,C)
Generate candidate episodes for next level
A
Hybrid serial- p
arallel episode:
(e.g., AB, AC, BC). Prune candidates for which a
B
sub
- episode is not freq. in the previous level.
there is no total order in the
appearance of the elements, but
there is a partial order
Count the frequencies of candidates and find
A
frequent episodes.
F
B
Apply iteratively for next- level episodes.
Advanced Database Technologies
57
Counting frequencies of candidates
A C ED B C F
w1
Apriori in transactional data
Observation: Consecutive windows contain
C D EA A C
D CE A C F D A B
time
We do not have to count the episodes in
each window
Advanced Database Technologies
Advanced Database Technologies
58
BA A
C D EA A C
D CE A C F D A B
w2
time
While sliding the window we update the
frequencies of episodes considering only the
new events which enter the window and the
events which are no longer in the window
similar events:
w1
w2
w3
w4
Dr. N. Mamoulis
Counting frequencies of candidates (cont’d)
This is the main difference; the rest is like
Dr. N. Mamoulis
56
Like Apriori
F
Parallel episode: A appears with
BA A
Advanced Database Technologies
Main Algorithm
Serial episode: F appears after E
A C ED B C F
| {w ∈ W ( S , win ) | α ∈ w} |
| W ( S , win ) |
sequence
make it special.
Dr. N. Mamoulis
D CE A C F D A B
α appears divided by the total number of time windows:
An episode is defined as an serial or parallel co-
Dr. N. Mamoulis
C D EA A C
w1
w2
w3
w4
59
For example, when moving from w1 to w2 we
update only the frequencies of episodes
containing B (the incoming event).
Dr. N. Mamoulis
Advanced Database Technologies
60
10
WINEPI
MINEPI
This method is called WINEPI (mining frequent
episodes using a sliding time window)
Another method (MINEPI) follows a different strategy
(not counting frequencies in sliding windows)
If we count parallel episodes (the order of events
does not matter) computation is simple
The approach is based on counting minimal occurrences
of episodes
If we count serial episodes (the events are ordered in
the episode) computation is based on automata
A minimal occurrence of an episode α is a time interval
w=[ts, te) such that (i) α occurs in w and (ii) no subwindow of w contains α
Hybrid episodes (with partial event orders) are
counted hierarchically; they are broken to serial or
parallel sub-episodes, the frequencies of them are
counted, and the frequencies of the hybrid episode are
computed by checking the co-existence of the parts.
Dr. N. Mamoulis
Advanced Database Technologies
61
MINEPI (cont’d)
At level k the algorithm computes the minimal occurrences of
all episodes with k events.
The level k+1 episodes are found by performing a temporaljoin of the minimal occurrences of the sub-episodes of size k.
Example. To find the frequency an minimal occurrences of
A→B→C, we join the MO of A→B with the MO of B→C
BA A
C D EA A C
D CE A C F D A B
occurrence of A→B→C
Dr. N. Mamoulis
A C ED B C F
BA A
C D EA A C
D CE A C F D A B
time
Dr. N. Mamoulis
Advanced Database Technologies
62
WINEPI vs. MINEPI
The algorithm is again level-wise. The frequent episodes with
k events are computed from the frequent episodes with k-1
events.
A C ED B C F
Example: minimal occurrences of A→B
Advanced Database Technologies
time
63
MINEPI is very efficient in finding frequent episodes, but it
may suffer from the excessive space needed to store the
minimal occurrences. These may not fit in memory for large
problems. Especially the early phases of MINEPI can be very
expensive.
However the temporal join is much more efficient than the
counting of WINEPI. Moreover, MINEPI is independent to the
window size.
In general, the algorithms may produce different results
because of the different definition of frequency.
WINEPI and MINEPI can be combined to derive frequent
episodes efficiently.
Dr. N. Mamoulis
Advanced Database Technologies
64
References
Jiawei Han and Micheline Kamber: "Data Mining:
Concepts and Techniques ", Morgan Kaufmann, 2001
Rakesh Agrawal, Ramakrishnan Srikant: Fast
Algorithms for Mining Association Rules in Large
Databases, VLDB 1994
Jiawei Han, Jian Pei, Yiwen Yin: Mining Frequent
Patterns without Candidate Generation, ACM SIGMOD,
2000
Heikki Mannila, Hannu Toivonen, and A. Inkeri
Verkamo: Discovery of frequent episodes in event
sequences. Data Mining and Knowledge Discovery
1(3): 259 - 289, November 1997
V. Kumar and M. Joshi, “Tutorial on High Performance
Data Mining”, http://wwwusers.cs.umn.edu/~mjoshi/hpdmtut/
Dr. N. Mamoulis
Advanced Database Technologies
65
11