Download CSE3212 Data Mining Defining a Data Mining Task Task

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Defining a Data Mining Task
CSE3212 Data Mining
„ To define a data mining task, one needs to answer the
following questions:
Š 1. What data set do I want to mine?
Š 2. What kind of knowledge do I want to mine?
Š 3. What background knowledge could be useful?
Š 4. How do I measure if the results are interesting?
Š 5. How do I display what I have discovered?
Data Mining Approaches
2.1
Task-relevant Data
2.2
What to be mined? Or the Approaches
„ Generally we wish to mine only a subset of a database, not
the whole database. It may be that we only want to study
something specific e.g. trends in postgraduate students
Š countries they come from;
Š degree program they are doing;
Š their age;
Š time (duration) that they taken to finish the degree; and
Š Have they been awarded scholarship?
„ Building the database subset may be a subtask before data
mining can be done.
„ What kind of knowledge we are after?
Š Classification
Š Estimation
Š Prediction
Š Clustering
Š Description
Š Affinity Grouping
Š Outliers
Š …..
2.3
Classification
2.4
Estimation
„ Classification involves considering the features of some
object then assigning it it to some pre-defined class, for
example:
Š Spotting fraudulent insurance claims
Š Which phone numbers are fax numbers
Š Which customers are high-value
„ The features that are considered are known as the
independent attributes or variables while the attribute that
constitute the pre-defined classes is called as the
dependent attribute/variable.
„ First build a model based on the known data and use the
model to classify other data for which the class label is
not known Æ known as supervised learning
„ Estimation deals with numerically valued outcomes rather
than discrete categories as occurs in classification.
Š Estimating the number of children in a family
Š Estimating family income
2.5
2.6
Prediction
Clustering
„ Essentially the same as classification and estimation but
involves future behaviour
„ Historical data is used to build a model explaining
behaviour (outputs) for known inputs
„ The model developed is then applied to current inputs to
predict future outputs
Š Predict which customers will respond to a promotion
Š Classifying loan applications
„ Clustering is also sometimes referred to as segmentation
(though this has other meanings in other fields)
„ In clustering there are no pre-defined classes. Selfsimilarity is used to group records. The user must attach
meaning to the clusters formed
„ Clustering often precedes some other data mining task,
for example:
Š once customers are separated into clusters, a
promotion might be carried out based on market
basket analysis of the resulting cluster
„ Known as un-supervised learning
2.7
2.8
Description
Deviation Detection
„ A good description of data can provide understanding of
behaviour
„ The description of the behaviour can suggest an
explanation for it as well
„ Statistical measures can be useful in describing data, as
can techniques that generate rules
„ Records whose attributes deviate from the norm by
significant amounts are also called outliers
„ Application areas include:
Š fraud detection
Š quality control
Š tracing defects.
„ Visualization techniques and statistical techniques are
useful in finding outliers
„ A cluster which contains only a few records may in fact
represent outliers
2.9
2.10
Affinity Grouping
Market Basket Analysis
„ Affinity grouping is also referred to as Market Basket
Analysis
„ A common example is the discovery of which items are
frequently sold together at a supermarket. If this is
known, decisions can be made about:
Š arranging items on shelves
Š which items should be promoted together
Š which items should not simultaneously be discounted
Rule Body
When a customer buys a shirt, in 70% of cases,
he or she will also buy a tie!
We find this happens in 13.5% of all purchases.
Rule Head
2.11
Confidence
Support
2.12
Co-Occurrence Table
The Usefulness of Market Basket Analysis
„ Some rules are useful: Unknown, unexpected and
indicative of some action to take.
„ Some rules are trivial: Known by anyone familiar with the
business.
„ Some rules are inexplicable: Seem to have no explanation
and do not suggest a course of action.
“The key to success in business is to know something
that nobody else knows”
Aristotle Onassis
Customer
1
2
3
4
5
OJ
Cleaner
Milk
Cola
Detergent
Items
orange juice (OJ), cola
milk, orange juice, window cleaner
orange juice, detergent
orange juice, detergent, cola
window cleaner, cola
OJ
4
1
1
2
2
Cleaner
1
2
1
1
0
Milk Cola Detergent
1
2
2
1
1
0
1
0
0
0
3
1
0
1
2
2.13
From the Co-Occurrence Table
2.14
Support and Confidence
„ We can say that people who buys Orange Juice also will buy Cola ( or
detergent).
orange juice Î cola
„ This association rule is satisfied by 2 out of 5 customers ( 1 and 4)
hence support is 2/5 = 40%
„ However, there are four customers (1,2,3 and 4) have purchased
orange juice and hence the confidence of the above rule is only 2/4 =
50%
„ Question: Are support and confidence measures good enough?
„ The rule has one item (or attribute) on the left hand side and the right
hand side. How do you find rules which has more than one items on
the left hand side (multi-attribute rule)
„ Support:
Š Percentage of transactions from a transaction database that the
given rule satisfies.
Š This can be taken as the probability P(X ∪ Y) where X ∪ Y
indicates that a transaction contains both X and Y, that is union
of item sets X and Y.
„ Confidence:
Š Which assess the degree of certainty of the detected association.
Š This can be taken as the conditional probability P(Y|X), that is, the
probability that a transaction containing X also contains Y.
„ More formally
Š Support (X ⇒ Y ) = P (X ∪ Y)
Š Confidence (X ⇒ Y) = P (Y|X)
2.15
What is a Rule?
2.16
Is the Rule a Useful Predictor? - 1
„ Confidence is the ratio of the number of transactions with
all the items in the rule to the number of transactions with
just the items in the condition. Consider:
If condition then result
Note:
If nappies and Thursday then beer
if B and C then A
„ If this rule has a confidence of 0.33, it means that when B
and C occur in a transaction, there is a 33% chance that A
also occurs.
is usually better than (in the sense that it is more actionable)
If Thursday then nappies and beer
because it has just one item in the result
If a 3 way combination is the most common, then consider
rules with just 1 item in the result, e.g.
If A and B, then C
If A and C, then B
2.17
2.18
Is the Rule a Useful Predictor? - 2
Is the Rule a Useful Predictor? - 3
„ Consider the following table of probabilities of items and
there combinations:
Combination
A
B
C
A and B
A and C
B and C
A and B and C
„ Now consider the following rules:
Probability
0.45
0.42
0.40
0.25
0.20
0.15
0.05
Rule
If A and B then C
If A and C then B
If B and C then A
p(condition) p(condition
and result)
0.25
0.05
0.20
0.05
0.15
0.05
confidence
0.20
0.25
0.33
It is tempting to choose “If B and C then A”, because it is
the most confident (33%) - but there is a problem
2.19
Is the Rule a Useful Predictor? - 4
Is the Rule a Useful Predictor? - 5
„ This rule is actually worse than just saying that A
randomly occurs in the transaction - which happens 45%
of the time
„ A measure called improvement indicates whether the rule
predicts the result better than just assuming the result in
the first place
Improvement =
2.20
„ Improvement measures how much better a rule is at
predicting a result than just assuming the result in the
first place
„ When improvement > 1, the rule is better at predicting the
result than random chance
p(condition and result)
p(condition)p(result)
2.21
Is the Rule a Useful Predictor? - 6
Is the Rule a Useful Predictor? - 7
„ Consider the improvement for our rules:
Rule
If A and B then C
If A and C then B
If B and C then A
If A then B
support
0.05
0.05
0.05
0.25
confidence
0.20
0.25
0.33
0.59
2.22
„ When improvement < 1, negating the result produces a
better rule. For example
improvement
0.50
0.59
0.74
1.32
if B and C then not A
has a confidence of 0.67 and thus an improvement of
0.67/0.55 = 1.22
„ Negated rules may not be as useful as the original
association rules when it comes to acting on the results
„ None of the rules with three items shows any
improvement - the best rule in the data actually has only
two items: “if A then B”. A predicts the occurrence of B
1.31 times better than chance.
2.23
2.24
Choosing the Right Set of Items
Multi-attribute Rule
„ Choosing the right level of detail (the creation of classes
and a taxonomy)
„ Virtual items may be added to take advantage of
information that goes beyond the taxonomy
„ Anonymous versus signed transactions
„ For 2 items on the left hand side and one item on the
right hand side of a rule (e.g. If A and B then C) would
require the co-occurrence matrix to be 3-dimensional.
„ How do you visualise three dimensional co-occurrence
matrix? What happens for higher dimensions?
2.25
2.26
The Process for Market Basket Analysis
An Example
„ A co-occurrence cube would show associations in three
dimensions - hard to visualize more
Consider the following database:
Student(sid, name1, dob, country, degree, startsem, address1,
„
telephone, address2, email, scholarship, ..)
We must:
Š Choose the right set of items
Š Generate rules by deciphering the counts in the cooccurrence matrix
Š Overcome the practical limits imposed by many items
in large numbers of transactions
Enrolment(sid, subject-id, mark, tutegroup, tutor,..)
Subject(sub-id, name, school-id, whenstarted, lecturer,..)
School(name, id, ..)
Not all of this data is needed for decision making. Let
us extract some data from this database.
2.27
2.28
Data Cube
Example
yob,
We could look at the information as
country,
1965, Thailand,
1970, Canada,
1967, Australia,
1966, Australia,
1972, Australia,
1972, India,
1982, Sweden,
yob X country X degree X startsem X numsubjects X
scholarship
In fact it is natural to think of an enterprise data as
multidimensional.
degree,
startsem, numsubjects,
MIT,
991,
BIT,
992,
LLB, 993,
LLB, 983,
Bcom, 973,
BIT/Bcom, 991,
MSc(IT), 991,
5,
4,
3,
4,
5,
5,
3,
scholarship
25%
0
30%
40%
10%
10%
10%
Is this information useful for decision making? Not
really!
2.29
2.30
Example
Example
The university management may be interested in
retrieving information like:
• How many students are doing BIT? How many
students from Thailand? How many students
started in 1998? (queries involving only one
variable)
• How many students doing BIT are from Thailand?
How many MIT students started in 981? How many
students from Thailand started in 993? (queries
involving two variables)
•How many students doing MIT from Thailand
started in 981? (query involving three variables)
Special type of database systems, called data cube
systems, are often used for answering such queries.
2.31
2.32
Data Cube
Data Cube
The example queries discussed earlier may be
represented by a three-dimensional data cube with
each edge representing one of the variables viz.
startsem, country, and degree.
Let us look at a simple two-dimensional situation:
country X degree
A point inside the cube is an intersection of the
coordinates defined by the edges of the cube. The
coordinates of the point define the meaning of the data
at that point.
For decision making this may be useful information. If
we had a 2-dimensional matrix then we could find out
the number of students for any country (x) and any
degree (y).
2.33
2.34
Data Cube
Data Cube
But in the two-dimensional situation, we don’t just want to
find out the number of students for any country (x) and any
degree (y). We may have many other queries e.g.
Consider a slightly more complex situation in which
we have three dimensions:
1. How many students are doing MIT?
country X degree X startsem
2. How many students from Thailand?
3. How many Asian students doing Law degrees?
for any country (x), any degree (y) and any start
semester (z).
Thus there is kind of hierarchy that we wish to use, for
example, the world, the continents, the regions, the countries
etc. In degrees, we may want a hierarchy of university, Schools,
UG and PG, individual degrees.
We may now look at this information as a 3dimensional cube as shown on the following slide.
2.35
2.36
A Sample Data Cube
Data Cube
de
gr
e
e
de
gr
e
Dimensions: country, degree, sem
Hierarchical summarization paths
country
continent school
region
ug/pg
country
degree
Year
LLB
BComp
MIT
Sum
991
992
Total enrolments
semester
993
001
sum
U.S.A
Malaysia
Australia
Country
e
„ Number of students as a function of country, degree and
semester
sum
semester
semester
2.37
Data Cube
2.38
Strengths and Weaknesses
Each edge of the cube is called a dimension. A user
normally has a number of different dimensions from
which the given data may be analyzed. A user therefore
has a multidimensional conceptual view of the data
which is represented by the cube.
The points inside a cube provide aggregations. For
example, a point may provide the number of students
from Malaysia admitted to BComp in year 2006.
2.39
Outlier Analysis
„ Outlier analysis identifies data objects that do not comply with
the general behaviour or model of the data. Often outliers are
ignored but in applications like fraud detection the outliers are
the objects of interest
2.41
„ Strengths
Š Clear understandable results
Š Supports undirected data mining
Š Works on variable length data
Š Is simple to understand
„ Weaknesses
Š Requires exponentially more computational effort as the problem
size grows
Š Suits items in transactions but not all problems fit this description
Š It can be difficult to determine the right set of items to analysis
Š It does not handle rare items well; simply considering the level of
support will exclude these items
„ We need an algorithm to find the association rules.
2.40