Download Total

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Lecture 16
Data mining and knowledge
discovery
Introduction, or or what is data mining?
n Data warehouse and query tools
n Decision trees
n Case study: Profiling people with high
blood pressure
n Summary
n
 Negnevitsky, Pearson Education, 2005
1
What is data mining?
Data is what we collect and store, and
knowledge is what helps us to make informed
decisions.
n The extraction of knowledge from data is called
data mining.
n Data mining can also be defined as the
exploration and analysis of large quantities of
data in order to discover meaningful patterns and
rules.
n The ultimate goal of data mining is to discover
knowledge.
n
 Negnevitsky, Pearson Education, 2005
2
Data warehouse
Modern organisations must respond quickly to
any change in the market. This requires rapid
access to current data normally stored in
operational databases.
n However, an organisation must also determine
which trends are relevant. This task is
accomplished with access to historical data that
are stored in large databases called data
warehouses.
n
 Negnevitsky, Pearson Education, 2005
3
The main characteristic of a data warehouse is its
capacity. A data warehouse is really big – it
includes millions, even billions, of data records.
n The data stored in a data warehouse is
l time dependent – linked together by the
times of recording – and
l integrated – all relevant information from the
operational databases is combined and
structured in the warehouse.
n
 Negnevitsky, Pearson Education, 2005
4
Query tools
A data warehouse is designed to support decisionmaking in the organisation. The information
needed can be obtained with query tools.
n Query tools are assumption-based – a user must
ask the right questions.
n
 Negnevitsky, Pearson Education, 2005
5
How is data mining applied in practice?
Many companies use data mining today, but
refuse to talk about it.
n In direct marketing, data mining is used for
targeting people who are most likely to buy
certain products and services.
n In trend analysis, it is used to determine trends
in the marketplace, for example, to model the
stock market. In fraud detection, data mining is
used to identify insurance claims, cellular
phone calls and credit card purchases that are
most likely to be fraudulent.
n
 Negnevitsky, Pearson Education, 2005
6
Data mining tools
Data mining is based on intelligent technologies
already discussed in this book. It often applies
such tools as neural networks and neuro-fuzzy
systems.
However, the most popular tool used for data
mining is a decision tree.
 Negnevitsky, Pearson Education, 2005
7
Decision trees
A decision tree can be defined as a map of the
reasoning process. It describes a data set by a
tree-like structure.
Decision trees are particularly good at solving
classification problems.
 Negnevitsky, Pearson Education, 2005
8
An example of a decision tree
Household
responded:
112
not responded: 888
Total:
1000
Homeownership
Yes
responded:
not responded:
Total:
9
334
343
No
responded:
not responded:
Total:
103
554
657
Household Income
≤$20,700
responded:
not responded:
Total:
14
158
172
>$20,701
responded:
not responded:
Total:
89
396
485
Savings Accounts
Yes
responded:
not responded:
Total:
 Negnevitsky, Pearson Education, 2005
86
188
274
No
responded:
not responded:
Total:
3
208
211
9
A decision tree consists of nodes, branches and
leaves.
n The top node is called the root node. The tree
always starts from the root node and grows
down by splitting the data at each level into
new nodes. The root node contains the entire
data set (all data records), and child nodes hold
respective subsets of that set.
n All nodes are connected by branches.
n Nodes that are at the end of branches are called
terminal nodes, or leaves.
n
 Negnevitsky, Pearson Education, 2005
10
How does a decision tree select splits?
A split in a decision tree corresponds to the
predictor with the maximum separating power.
The best split does the best job in creating
nodes where a single class dominates.
n One of the best known methods of calculating
the predictor’s power to separate data is based
on the Gini coefficient of inequality.
n
 Negnevitsky, Pearson Education, 2005
11
The Gini coefficient
The Gini coefficient is a measure of how well the
predictor separates the classes contained in the
parent node.
Gini, an Italian economist, introduced a rough
measure of the amount of inequality in the
income distribution in a country.
 Negnevitsky, Pearson Education, 2005
12
Computation of the Gini coefficient
100
% Income
80
60
40
20
0
0
20
40
60
80
100
% Population
The Gini coefficient is calculated as the area between
the curve and the diagonal divided by the area below the
diagonal. For a perfectly equal wealth distribution, the
Gini coefficient is equal to zero.
 Negnevitsky, Pearson Education, 2005
13
Selecting an optimal decision tree:
(a) Splits selected by Gini
Class A: 100
Class B: 50
Total: 150
Predictor 1
yes
no
Class A: 63
Class B: 38
Total:
101
Class A: 37
Class B: 12
Total:
49
Predictor 2
Predictor 4
yes
Class A: 4
Class B: 37
Total:
41
yes
no
Class A: 25
Class B: 4
Total:
29
Class A: 12
Class B: 8
Total:
20
Predictor 3
Predictor 5
Predictor 6
yes
yes
yes
Class A: 0
Class B: 36
Total:
36
Class A: 59
Class B: 1
Total:
60
no
no
Class A:
Class B:
Total:
4
1
5
Class A:
Class B:
Total:
 Negnevitsky, Pearson Education, 2005
2
1
3
no
Class A: 23
Class B: 3
Total:
26
Class A:
Class B:
Total:
1
8
9
no
Class A: 11
Class B: 0
Total:
11
14
Selecting an optimal decision tree:
(b) Splits selected by duesswork
Class A: 100
Class B: 50
Total: 150
Predictor 5
yes
no
Class A: 19
Class B: 14
Total:
33
Class A: 81
Class B: 36
Total: 117
Predictor 2
Predictor 3
yes
Class A: 17
Class B: 6
Total:
23
no
yes
Class A: 2
Class B: 8
Total:
10
no
Class A: 46
Class B: 21
Total:
67
Class A: 35
Class B: 15
Total:
50
Predictor 1
Predictor 6
yes
yes
Class A: 37
Class B: 14
Total:
51
no
Class A: 9
Class B: 7
Total:
16
Class A: 23
Class B: 9
Total:
32
no
Class A: 12
Class B: 6
Total:
18
Predictor 4
yes
Class A: 8
Class B: 9
Total:
17
 Negnevitsky, Pearson Education, 2005
no
Class A: 29
Class B: 5
Total:
34
15
Gain chart of Class A
100
% Class
80
60
40
The Gini splits
20
Manual split selection
0
0
20
40
60
80
100
% Population
 Negnevitsky, Pearson Education, 2005
16
Can we extract rules from a decision tree?
The pass from the root node to the bottom leaf
reveals a decision rule.
For example, a rule associated with the right
bottom leaf in the figure that represents Gini splits
can be represented as follows:
if
and
and
then
(Predictor 1 = no)
(Predictor 4 = no)
(Predictor 6 = no)
class = Class A
 Negnevitsky, Pearson Education, 2005
17
Case study 10:
Profiling people with high blood pressure
A typical task for decision trees is to determine
conditions that may lead to certain outcomes.
Blood pressure can be categorised as optimal,
normal or high. Optimal pressure is below
120/80, normal is between 120/80 and 130/85,
and a hypertension is diagnosed when blood
pressure is over 140/90.
 Negnevitsky, Pearson Education, 2005
18
A data set for a hypertension study
Community Health Survey: Hypertension Study (California, U.S.A.)
Gender
3Male
Female
Age
18
35
351
65
Race
3Caucasian
African American
Hispanic
Asian or Pacific Islander
Marital Status
Married
Separated
3Divorced
Widowed
Never Married
Household Income
Less than $20,700
$20,701 − $45,000
3$45,001 − $75,000
$75,001 and over
 Negnevitsky, Pearson Education, 2005
– 34 years
– 50 years
– 64 years
or more years
19
A data set for a hypertension study (continued)
Community Health Survey: Hypertension Study (California, U.S.A.)
Alcohol Consumption
Abstain from alcohol
Occasional (a few drinks per month)
3Regular (one or two drinks per day)
Heavy (three or more drinks per day)
Smoking
Nonsmoker
1 – 10 cigarettes per day
311 – 20 cigarettes per day
More than one pack per day
Caffeine Intake
Abstain from coffee
3One or two cups per day
Three or more cups per day
Salt Intake
Low-salt diet
3Moderate-salt diet
High-salt diet
Physical Activities
None
3One or two times per week
Three or more times per week
Weight
Height
1 7 cm
9 3 kg
Blood Pressure
Optimal
Normal
3High
 Negnevitsky, Pearson Education, 2005
20
Data cleaning
Decision trees are as good as the data they
represent. Unlike neural networks and fuzzy
systems, decision trees do not tolerate noisy and
polluted data. Therefore, the data must be
cleaned before we can start data mining.
We might find that such fields as Alcohol
Consumption or Smoking have been left blank
or contain incorrect information.
 Negnevitsky, Pearson Education, 2005
21
Data enriching
From such variables as weight and height we
can easily derive a new variable, obesity. This
variable is calculated with a body-mass index
(BMI), that is, the weight in kilograms divided
by the square of the height in metres. Men with
BMIs of 27.8 or higher and women with BMIs
of 27.3 or higher are classified as obese.
 Negnevitsky, Pearson Education, 2005
22
A data set for a hypertension study (continued)
Community Health Survey: Hypertension Study (California, U.S.A.)
Obesity
 Negnevitsky, Pearson Education, 2005
3Obese
Not Obese
23
Growing a decision tree
Blood Pressure
optimal:
319 (32%)
normal:
528 (53%)
high:
153 (15%)
Total:
1000
Age
18 – 34 years
optimal:
88 (56%)
normal:
64 (41%)
high:
5 (3%)
Total:
157
35 – 50 years
optimal:
208 (35%)
normal:
340 (57%)
high:
48 (8%)
Total:
596
 Negnevitsky, Pearson Education, 2005
51 – 64 years
optimal:
21 (12%)
normal:
90 (52%)
high:
62 (36%)
Total:
173
65 or more years
optimal:
2 (3%)
normal:
34 (46%)
high:
38 (51%)
Total:
74
24
Growing a decision tree (continued)
51 – 64 years
optimal:
21 (12%)
normal:
90 (52%)
high:
62 (36%)
Total:
173
Obesity
Obese
optimal:
3 (3%)
normal:
53 (49%)
high:
51 (48%)
Total:
107
 Negnevitsky, Pearson Education, 2005
Not Obese
optimal:
18 (27%)
normal:
37 (56%)
high:
11 (17%)
Total:
66
25
Growing a decision tree (continued)
Obese
optimal:
3 (3%)
normal:
53 (49%)
high:
51 (48%)
Total:
107
Race
Caucasian
optimal:
2 (5%)
normal:
24 (55%)
high:
17 (40%)
Total:
43
African American
optimal:
0 (0%)
normal:
13 (35%)
high:
24 (65%)
Total:
37
 Negnevitsky, Pearson Education, 2005
Hispanic
optimal:
0 (0%)
normal:
11 (58%)
high:
8 (42%)
Total:
19
Asian
optimal:
normal:
high:
Total:
1 (12%)
5 (63%)
2 (25%)
8
26
Solution space of the hypertension study
The solution space is first divided into four
rectangles by age, then age group 51-64 is
further divided into those who are overweight
and those who are not. And finally, the group
of obese people is divided by race.
 Negnevitsky, Pearson Education, 2005
27
Solution space of the hypertension study
157
74
8
19
596
37
66
43
 Negnevitsky, Pearson Education, 2005
28
Hypertension study: forcing a split
Blood Pressure
optimal:
319 (32%)
normal:
528 (53%)
high:
153 (15%)
Total:
1000
Age
18 – 34 years
optimal:
88 (56%)
normal:
64 (41%)
high:
5 (3%)
Total:
157
Male
optimal:
111 (36%)
normal:
168 (55%)
high:
28 (9%)
Total:
307
35 – 50 years
optimal:
208 (35%)
normal:
340 (57%)
high:
48 (8%)
Total:
596
51 – 64 years
optimal:
21 (12%)
normal:
90 (52%)
high:
62 (36%)
Total:
173
Gender
Gender
Female
optimal:
97 (34%)
normal:
172 (59%)
high:
20 (7%)
Total:
289
Male
optimal:
11 (13%)
normal:
48 (56%)
high:
27 (31%)
Total:
86
 Negnevitsky, Pearson Education, 2005
65 or more years
optimal:
2 (3%)
normal:
34 (46%)
high:
38 (51%)
Total:
74
Female
optimal:
10 (12%)
normal:
42 (48%)
high:
35 (40%)
Total:
87
29
Advantages of decision trees
The main advantage of the decision-tree
approach to data mining is it visualises the
solution; it is easy to follow any path through the
tree.
n Relationships discovered by a decision tree can
be expressed as a set of rules, which can then be
used in developing an expert system.
n
 Negnevitsky, Pearson Education, 2005
30
Drawbacks of decision trees
Continuous data, such as age or income, have to
be grouped into ranges, which can unwittingly
hide important patterns.
n Handling of missing and inconsistent data –
decision trees can produce reliable outcomes
only when they deal with “clean” data.
n Inability to examine more than one variable at a
time. This confines trees to only the problems
that can be solved by dividing the solution space
into several successive rectangles.
n
 Negnevitsky, Pearson Education, 2005
31
In spite of all these limitations, decision
trees have become the most successful
technology used for data mining.
An ability to produce clear sets of rules
make decision trees particularly attractive
to business professionals.
 Negnevitsky, Pearson Education, 2005
32