Download Data Mining Basic Concepts

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Requirements/Challenges in Data
Mining (Con’t)
Requirements/Challenges in Data
Mining (Con’t)
• Data source issues:
Diversity of data types
• Other issues
• Handling complex types of data
• Mining information from heterogeneous databases and global
information systems.
• Is it possible to expect a DM system to perform well on all kinds of
data? (distinct algorithms for distinct data sources)
– Integration of the discovered knowledge with
existing knowledge: A knowledge fusion problem.
Data glut
• Are we collecting the right data with the right amount?
• Distinguish between the data that is important and the data that is not.
 Dr. Osmar R. Zaïane, 2001-2003
Database Management Systems
University of Alberta
29
 Dr. Osmar R. Zaïane, 2001-2003
Database Management Systems
University of Alberta
30
Basic Concepts
Data Mining
• Needing More than just Information Retrieval
A transaction is a set of items:
• Elementary Concepts
T={ia, ib,…it}
T ⊂ I, where I is the set of all possible items {i1, i2,…in}
• Patterns and Rules to be Discovered
D, the task relevant data, is a set of transactions.
• Requirements and Challenges
An association rule is of the form:
P
Q, where P ⊂ I, Q ⊂ I, and P∩Q =∅
• Association Rule Mining
• Classification
• Clustering
 Dr. Osmar R. Zaïane, 2001-2003
Database Management Systems
University of Alberta
31
 Dr. Osmar R. Zaïane, 2001-2003
Database Management Systems
University of Alberta
32
Basic Concepts (con’t)
Itemsets
P Q holds in D with support s
and
P Q has a confidence c in the transaction set D.
A set of items is referred to as itemset.
Support(P Q) = Probability(P ∪ Q)
Confidence(P Q)=Probability(Q / P)
An itemset containing k items is called k-itemset.
An items set can also be seen as a conjunction of items (or a
predicate)
 Dr. Osmar R. Zaïane, 2001-2003
Database Management Systems
University of Alberta
33
 Dr. Osmar R. Zaïane, 2001-2003
Rule Measures: Support and Confidence
University of Alberta
34
University of Alberta
36
Strong Rules
• Support of a rule P → Q
= Support of (P ∪ Q) in D
- sD(P → Q) = sD(P ∪ Q): percentage of
transactions in D containing P and Q.
(#transactions containing P and Q divided by
cardinality of D).
Database Management Systems
• Thresholds:
Customer
buys both
Customer
buys bread
– minimum support: minsup
– minimum confidence: minconf
• Frequent itemset P
– support of P larger than minimum support,
• Confidence of a rule P → Q
- cD(P → Q) = sD(P ∪ Q) / sD(P): percentage
of transactions that contain both P and Q in the
subset of transactions that contain already P.
 Dr. Osmar R. Zaïane, 2001-2003
Database Management Systems
• Strong rule P → Q (c%)
Customer
buys milk
– (P ∪ Q) frequent,
– c is larger than minimum confidence.
University of Alberta
35
 Dr. Osmar R. Zaïane, 2001-2003
Database Management Systems
Mining Association Rules
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
How do we Mine Association Rules?
Min. support 50%
Min. confidence 50%
• Input
– A database of transactions
– Each transaction is a list of items (Ex. purchased by a
customer in a visit)
• Find all strong rules that associate the presence of one
set of items with that of another set of items.
– Example: 98% of people who purchase tires and auto
accessories also get automotive services done
– There are no restrictions on the number of items in
the head or body of the rule.
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
For rule {A}
{C}:
support = support({A, C}) = 50%
confidence = support({A, C})/support({A}) = 66.6%
For rule {C}
{A}:
support = support({A, C}) = 50%
confidence = support({A, C})/support({C}) = 100.0%
 Dr. Osmar R. Zaïane, 2001-2003
Database Management Systems
University of Alberta
37
Mining Frequent Itemsets: the
Key Step
38
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Use the frequent itemsets to generate association rules.
University of Alberta
University of Alberta
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
Based on the Apriori principle:
Any subset of a frequent itemset must also be frequent.
E.g., if {AB} is a frequent itemset, both {A} and {B}
must be frequent itemsets.
Database Management Systems
Database Management Systems
The Apriori Algorithm
Iteratively find the frequent itemsets, i.e. sets of items that
have minimum support, with cardinality from 1 to k
(k-itemsets)
 Dr. Osmar R. Zaïane, 2001-2003
 Dr. Osmar R. Zaïane, 2001-2003
39
 Dr. Osmar R. Zaïane, 2001-2003
Database Management Systems
University of Alberta
40
Generating Association Rules
from Frequent Itemsets
The Apriori Algorithm -- Example
Database D
TID
100
200
300
400
itemset sup.
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
L1 itemset sup.
C1
Items
134
235
1235
25
 Dr. Osmar R. Zaïane, 2001-2003
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
{1}
{2}
{3}
{5}
2
3
3
3
• Only strong association rules are generated.
• Frequent itemsets satisfy minimum support threshold.
• Strong AR satisfy minimum confidence threshold.
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
Database Management Systems
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
• Confidence(P
Support(P ∪ Q)
Support(P)
For each frequent itemset, f, generate all non-empty subsets of f.
For every non-empty subset s of f do
output rule s (f-s) if support(f)/support(s) ≥ min_confidence
end
Note: {1,2,3}{1,2,5}
and {1,3,5} not in C3
University of Alberta
Q) = Prob(Q/P) =
41
 Dr. Osmar R. Zaïane, 2001-2003
Database Management Systems
University of Alberta
42
What is Classification?
Data Mining
The goal of data classification is to organize and categorize
data in distinct classes.
• Needing More than just Information Retrieval
A model is first created based on the data distribution.
The model is then used to classify new data.
Given the model, a class can be predicted for new data.
• Elementary Concepts
• Patterns and Rules to be Discovered
• Requirements and Challenges
• Association Rule Mining
?
• Classification
1
2
3
4
…
n
• Clustering
 Dr. Osmar R. Zaïane, 2001-2003
Database Management Systems
University of Alberta
43
 Dr. Osmar R. Zaïane, 2001-2003
Database Management Systems
University of Alberta
44