Download Pattern Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Frequent Pattern
Mining
Toon Calders
Bart Goethals
ADReM research group
Outline
• What is data mining?
-
Definition
local patterns vs global models
Supervised vs Unsupervised
What do we do?
• Frequent set mining
• More complex data types
2
What is data mining?
“the use of sophisticated data analysis tools to
discover previously unknown, valid patterns
and relationships in large data sets.”
$
$
$
Data
Information
3
Supervised vs Unsupervised
• Supervised:
- data has been annotated
- well-defined task: learn to
annotate new data
E.g.: examples of good/bad customers
• Unsupervised:
- only data has been given
- no annotation
- « find knowledge »
x
y
n
y
x
x
x
x
x
x
4
Local vs Global
• Local pattern:
- tells something about a small subset of the data
E.g. « 90% of the customers that purchase beer
also buy chips »
• Global model:
- fits a global model to the data, a summary
E.g. : there is a linear relationship between $ spent
and the income of the customers
5
What do we do?
• Pattern mining
- Local
- Unsupervised
• Useful for
- large datasets
- exploration: « what is this data like? »
• Less suitable for
- well-studied and understood problem domains
6
Outline
• What is data mining?
• Frequent set mining
-
Market Basket analysis
Association rules
Interestingness measures
Numerical attributes
• More complex data types
7
Market Basket Analysis
• Data: collection of transactions of customers:
• Goal: find sets of products frequently occuring
together
8
Applications
• Supermarket
- product placement
- special promotions
• Websearch
- which keywords often occur together in
webpages?
• Health care
- frequent sets of symptoms for a disease
9
Applications
• Basically works for all data that can be
represented as a set of examples/objects
having certain properties
-
patient / symptoms
movies / ratings
web pages / keywords
basket / products
…
10
Algorithms
• Computationally a very hard problem
- with n products, 2n sets of products
• Hundreds of algorithms have been proposed
-
for sparse/dense data
many rows/columns
data fits/does not fit in memory
…
11
Association Rules
• Conditional probabilities
XY (c%): if X is in the transaction, then there is a
probability of c% that Y is in it as well.
• Based on the frequent sets, associations can be
computed easily:
{ Beer, Chips }  { Snack nuts }
75%
{ adrem.html, cnts.html }  { islab.html }
80%
{ rain }  { overcast }
100%
12
Interestingness Measures
• Not all association rules are interesting
- Domain knowledge
pregnant  female, rain  overcast
- Redundancy
A  B (100%) then: AC  B, AD  B, …
- Independence
70% buys product A: XA(70%), YA(70%)
• Too many rules
13
Interestingness Measures
• Incorporating background knowledge
- e.g., via Bayesian network
- only produce rules that deviate from background
knowledge
• Redundancies
- Condensed representations: produce only a nonredundant subset of patterns
14
Interestingness Measures
• Independence
- statistical significance tests
• X2
• Careful with conclusions !!
1000 tests with significance level 0.05 …
(Bonferroni correction)
• Too many rules
- Constraints
- Top-k mining
15
Numerical Attributes
• Association rule mining is also possible for
numerical attributes
- discretization: make continuous attributes ordinal
• information loss
• not appropriate if the order between the values is important
- other methods:
• recently new method based on rank correlation measures
16
Complex Patterns
• Sets
• Sequences
• Graphs
• Relational Structures
• Generation and Counting of such patterns becomes much
more complex too!
17
Sequences
CGATGGGCCAGTCGATACGTCGATGCCGATGTCACGA
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
18
Patterns in Sequences
•
•
•
•
Substrings
Regular expressions (bb|[^b]{2})
Partial orders
Directed Acyclic Graphs
19
Graphs
20
Patterns in Graphs
21
Rules
f: 5
0.5
0.8
f: 4
f: 7
f: 8
0.57
f: 4
f: 4
22
Relational Databases
23
Patterns in RDBs
• Queries
• Query 1:
Select L.drinker, V.bar
From Likes L, Visits V
Where V.drinker = L.drinker
And L.beer = ‘Duvel’
24
Patterns in RDBs
• Query 2:
Select L.drinker, V.bar
From Likes L, Visits V, Serves S
Where V.drinker = L.drinker
And L.beer = ‘Duvel’
And S.bar = V.bar
And S.beer = ‘Duvel’
25
Patterns in RDBs
• Association Rule:
Query 1 => Query 2
If a person that likes Duvel visits bar, then
that bar serves Duvel
26
27