Download Data Mining - Computer Science Intranet

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
COMP527:
Data Mining
COMP527: Data Mining
M. Sulaiman Khan
([email protected])
Dept. of Computer Science
University of Liverpool
2009
ARM: Advanced Techniques
March 11, 2009
Slide 1
COMP527:
Data Mining
COMP527: Data Mining
Introduction to the Course
Introduction to Data Mining
Introduction to Text Mining
General Data Mining Issues
Data Warehousing
Classification: Challenges, Basics
Classification: Rules
Classification: Trees
Classification: Trees 2
Classification: Bayes
Classification: Neural Networks
Classification: SVM
Classification: Evaluation
Classification: Evaluation 2
Regression, Prediction
ARM: Advanced Techniques
Input Preprocessing
Attribute Selection
Association Rule Mining
ARM: Apriori and Data Structures
ARM: Improvements
ARM: Advanced Techniques
Clustering: Challenges, Basics
Clustering: Improvements
Clustering: Advanced Algorithms
Hybrid Approaches
Graph Mining, Web Mining
Text Mining: Challenges, Basics
Text Mining: Text-as-Data
Text Mining: Text-as-Language
Revision for Exam
March 11, 2009
Slide 2
COMP527:
Data Mining
Today's Topics
Parallelization
Constraints
Multi-Level Rule Mining
Other Issues
ARM: Advanced Techniques
March 11, 2009
Slide 3
COMP527:
Data Mining
Parallelization
Task based distribution vs Data based distribution of processing
Data parallelism divides the database into partitions, one for each
node.
Task parallelism has each node count a different candidate set (eg
node 1 counts the 1-itemsets, node 2 counts the 2-itemsets) etc.
Main advantages:
By using multiple machines, we can avoid database scans as
there's more memory to use -- the total size of all candidates is
more likely to fit into all of the combined memory across N
machines.
ARM: Advanced Techniques
March 11, 2009
Slide 4
COMP527:
Data Mining
Data Parallelism
Database is divided into N partitions. Each partition can have a
different number of records, depending on the capabilities of the
node.
Each node counts the candidates for its partition, then broadcasts
the count to all other nodes.
As the counts are received, each node adds up the global support
counts so that it has them to determine the candidates in the
next level. (eg from 2-itemsets to 3-itemsets)
The Count Distribution Algorithm:
ARM: Advanced Techniques
March 11, 2009
Slide 5
COMP527:
Data Mining
Count Distribution
VERY Pseudo-code for CDA approach...
At each processor p:
while potential frequent itemsets:
Using partition Dp of Database D,
Count supports in Dp
Broadcast Counts
On receive(Counts):
globalCounts += Counts
Determine candidates for level k+1
ARM: Advanced Techniques
March 11, 2009
Slide 6
COMP527:
Data Mining
Task Parallelism
Candidates as well as the database are distributed amongst the
processors.
Each processor counts the candidates given to it, using the database
subset given to it.
Each processor then broadcasts the database partition to other
processors to use for the global count, which are broadcast again, so
that each processor can find globally frequent itemsets.
The candidates for the next set are then shared amongst the available
processors for the next level.
Yes, that's a lot of broadcasting, which is a lot of network traffic, which is
a lot of SLOW!
(Not going to go through the algorithm for this)
ARM: Advanced Techniques
March 11, 2009
Slide 7
COMP527:
Data Mining
Constraints
Constrained Association Rule Mining involves simply setting more
rules initially as to what is an interesting rule.
For example:
 Statistics: Support, Confidence, Lift, Correlation
 Data: Specify task relevant data to include in transactions
 Dimensions: Dimensions of heirarchical data to be used (next
time)
 Meta-Rules: Form of the useful rules to be found
ARM: Advanced Techniques
March 11, 2009
Slide 8
COMP527:
Data Mining
Meta-Rules
Examples:
 Rule templates
 Max/Min number of predicates in antecedent/consequent
 Types of relationship among attributes or attribute values
Eg: Interested only in pairs of attributes for a customer that buys a
certain type of item:
P(x,a) AND Q(x,b) => R(x,c)
eg: age(x, 20..30) AND income(x, 20k..30k) => buys(x, computer)
ARM: Advanced Techniques
March 11, 2009
Slide 9
COMP527:
Data Mining
Item Level Thresholds
Or, the Rare Item problem.
If you have a very rare item, then its support may not be much
higher than the minimum support given for an interesting rule.
For example, 48” plasma TVs are sold very infrequently.
But these rules could be interesting, especially if they meant it was
more likely for someone to buy a big TV.
Solution: Multiple Minimum Support thresholds.
Simply give rare items a lower threshold to the rest of the dataset.
Which could be extended out to one threshold per item...
ARM: Advanced Techniques
March 11, 2009
Slide 10
COMP527:
Data Mining
MISAPriori
Minimum Item Support A-Priori.
The minimum support required for an itemset is the minimum support for
any item in the itemset.
This breaks our lovely A-Priori downward closure principle :(
eg minimum supports: {A 20%, B 3%, C 4%}
actual supports: {A 18%, B 4%, C 3%}
A is infrequent, but AB is frequent because the threshold of AB is 3% and
both A and B meet that threshold.
Solution: Sort items by ascending MIS value, then candidate generation
only looks at items which are after the current one in this list.
ARM: Advanced Techniques
March 11, 2009
Slide 11
COMP527:
Data Mining
Multi-level Rule Mining
Our examples have been supermarket baskets. But you don't buy
'bread' you buy a certain brand of bread, with a certain flavour
and thickness. eg White Warburton's Toast bread. 2 litre bottle of
Tesco's Semi-skimmed milk, not 'milk'
We could compact all of the 'milks' and 'breads' together before
data mining, but what if buying 'white bread' and 'semi-skimmed
milk' together is an interesting rule? As compared to 'skim milk'
and 'whole grain bread'. Or Tesco's milk and Tesco's bread? Or
...
We need a heirarchy of products to capture these different levels.
ARM: Advanced Techniques
March 11, 2009
Slide 12
COMP527:
Data Mining
Multi-level Rule Mining
We could have a large tree of how the products inter-relate:
All Products
Bread
White
Milk
Brown
White/Toast
Whole
Semi-skim
Whole/2Litre
Brown/Tesco
ARM: Advanced Techniques
March 11, 2009
Slide 13
COMP527:
Data Mining
Multi-level Rule Mining
We can count support for the items at the bottom level and
propogate them upwards. Or count each level for frequency as a
top-down approach.
Note that what we really need is some sort of clever cluster system
with different axes: bread has color, size, brand, thickness... Milk
on the other hand has size, brand, skimmed-ness... Beer has a
totally different set of properties.
But maybe those axes have the same value... Tesco has a milk
range and a bread range... but not a beer range...
Let's leave that alone :)
ARM: Advanced Techniques
March 11, 2009
Slide 14
COMP527:
Data Mining
Multi-level Rule Mining
To avoid the rare item problem, each level in the tree could have a
reduced minimum support threshold.
Eg level 1 could be 8%, level 2 (more specific) needs a lower
threshold of 5%, then 3%, 2%, etc.
(And in our graph, it would be path distance, rather than tree level)
We need some search strategies to crawl the tree in comparison to
the transaction database.
ARM: Advanced Techniques
March 11, 2009
Slide 15
COMP527:
Data Mining
Multi-level Rule Mining
Level-by-level Independent: full breadth search.
May examine a lot of infrequent items!
Cross Filtering by Itemset: A k-itemset at (i)th level is examined
only if the corresponding k-itemset at (i-1)th level is frequent
Might filter out valuable patterns (eg the 20%, 3% issue)
Cross Filtering by Item: An item at (i)th level will only be
examined if its parent node at (i-1)th level is frequent
One compromise between the previous two.
ARM: Advanced Techniques
March 11, 2009
Slide 16
COMP527:
Data Mining
Multi-level Rule Mining
Controlled Cross Filtering by Single Item: Two thresholds at
each level. One for frequency at that level, and one called a
level passage threshold. This controls which items can pass
down to the next level. If the item doesn't makes the threshold, it
doesn't pass down. This threshold is typically between the two
levels' support thresholds.
None of these address cross-level association rules. Eg rules that
link buying items at one level with items at a different level.
ARM: Advanced Techniques
March 11, 2009
Slide 17
COMP527:
Data Mining
Multi-level Rule Mining
Many similar rules can be generated between different levels.
eg: white bread -> skim milk is similar to bread -> milk, and white
toast bread -> 2l skim milk and ...
If we allow cross levels, these become astronomical.
If we allow cross levels, we can have totally redundant rules:
milk -> bread
skim milk -> bread
tesco milk -> bread
ARM: Advanced Techniques
March 11, 2009
Slide 18
COMP527:
Data Mining
Multi-dimensional Rule Mining
We could mine other dimensions than 'buys', assuming that we
have some knowledge about the buyer.
For example:
age(20..29) & buys(milk) => buys(bread)
occupation(student) & buys(laptop) => buys(blank dvds)
This isn't necessarily any more difficult, it just involves putting these
items into the transaction to be mined.
Can be usefully combined with meta-rules or constraints.
ARM: Advanced Techniques
March 11, 2009
Slide 19
COMP527:
Data Mining
Discretization
We have the same 'range' problem we have with numeric data, but
in spades. We don't want to classify by it, we want to find
arbitrary rules using arbitrary ranges.
For example, we might want age() somehow linked to buying.. but
we don't know how to discretize it.
Equally we might want some sort of distance based association
rule, where the distance between data points is important. Either
physical (item A is spatially close to B), or similarity (item A is
similar to item B)
ARM: Advanced Techniques
March 11, 2009
Slide 20
COMP527:
Data Mining
Quantity
Not only could we discretize single numeric attributes, we can have
a number attached to each item:
I might buy 10 cans of cat food, 2 bottles of coke, 3 packets of
chicken pieces...
We could then look for rules that use this quantity (orthogonally to
all of the other dimensions we've looked at). Eg:
buys(cat food, 5+) -> buys(cat litter, 1)
buys(soda, 2) -> buys(potato chips, 2+)
(I feel sympathy for your encroaching headaches!)
ARM: Advanced Techniques
March 11, 2009
Slide 21
COMP527:
Data Mining
Time
(But not that much sympathy!)
You could use association rule mining techniques to find episodic
rules. For example that I buy cheese every 3 weeks, milk and
bread every week, and dvds apparently randomly. The metric
could be number of transactions rather than calendar
days/weeks.
If the items were a sequence of events, then the order is important
in the transaction and that could be mined for rules.
Trend rules examine the same attribute over time, eg trends in the
stock market. Which could be applied to many attributes
concurrently.
ARM: Advanced Techniques
March 11, 2009
Slide 22
COMP527:
Data Mining
Classification ARM
A final note to say that once association rules have been
discovered, they can be used to form a classifier.
For example by adding a constraint that the consequent must be
one of the attributes that are specified as a class.
ARM: Advanced Techniques
March 11, 2009
Slide 23
COMP527:
Data Mining
Further Reading
The rest of Zhang!
Berry and Browne, Chapters 15, 16
Han 5.3, 5.5
Dunham 6.4, 6.7
ARM: Advanced Techniques
March 11, 2009
Slide 24