Download What is data mining?

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Lecture 16
Data mining and knowledge
discovery
n Introduction, or what is data mining?
n Data warehouse and query tools
n Decision trees
n Case study: Profiling people with high
blood pressure
n Summary
What is data mining?
n Data is what we collect and store, and
knowledge is what helps us to make informed
decisions.
n The extraction of knowledge from data is called
data mining.
n Data mining can also be defined as the
exploration and analysis of large quantities of
data in order to discover meaningful patterns and
rules.
n The ultimate goal of data mining is to discover
knowledge.
Data warehouse
n Modern organisations must respond quickly to
any change in the market. This requires rapid
access to current data normally stored in
operational databases.
n However, an organisation must also determine
which trends are relevant. This task is
accomplished with access to historical data that
are stored in large databases called data
warehouses.
n The main characteristic of a data warehouse is its
capacity. A data warehouse is really big – it
includes millions, even billions, of data records.
n The data stored in a data warehouse is
l time dependent – linked together by the
times of recording – and
l integrated – all relevant information from the
operational databases is combined and
structured in the warehouse.
Query tools
n A data warehouse is designed to support decisionmaking in the organisation. The information
needed can be obtained with query tools.
n Query tools are assumption-based – a user must
ask the right questions.
How is data mining applied in practice?
n Many companies use data mining today, but
refuse to talk about it.
n In direct marketing, data mining is used for
targeting people who are most likely to buy
certain products and services.
n In trend analysis, it is used to determine trends
in the marketplace, for example, to model the
stock market. In fraud detection, data mining is
used to identify insurance claims, cellular
phone calls and credit card purchases that are
most likely to be fraudulent.
Data mining tools
Data mining is based on intelligent technologies
already discussed in this book. It often applies
such tools as neural networks and neuro-fuzzy systems.
However, the most popular tool used for data
mining is a decision tree.
Decision trees
A decision tree can be defined as a map of the
reasoning process. It describes a data set by a
tree-like structure.
Decision trees are particularly good at solving
classification problems.
An example of a decision tree
89
86
n A decision tree consists of nodes, branches and
leaves.
n The top node is called the root node. The tree
always starts from the root node and grows
down by splitting the data at each level into
new nodes. The root node contains the entire
data set (all data records), and child nodes hold
respective subsets of that set.
n All nodes are connected by branches.
n Nodes that are at the end of branches are called
terminal nodes, or leaves.
How does a decision tree select splits?
n A split in a decision tree corresponds to the
predictor with the maximum separating power.
The best split does the best job in creating
nodes where a single class dominates.
n One of the best known methods of calculating
the predictor’s power to separate data is based
on the Gini coefficient of inequality.
The Gini coefficient
The Gini coefficient is a measure of how well the
predictor separates the classes contained in the
parent node.
Gini, an Italian economist, introduced a rough
measure of the amount of inequality in the
income distribution in a country.
Computation of the Gini coefficient
100
80
60
40
20
0
0
20
40
40
80
100
% Population
The Gini coefficient is calculated as the area between
the curve and the diagonal divided by the area below the
diagonal. For a perfectly equal wealth distribution, the
Gini coefficient is equal to zero.
Selecting an optimal decision tree:
(a) Splits selected by Gini
Class A: 100
Class B: 50
Total: 150
Predictor1
yes
no
Class A: 63
Class B: 38
Total: 101
Class A: 37
Class B: 12
Total:
49
Predictor2
Predictor4
yes
Class A: 4
Class B: 37
41
Total:
no
yes
Class A: 25
Class B: 4
29
Total:
Class A: 12
Class B: 8
20
Total:
Predictor3
Predictor5
Predictor6
yes
yes
yes
Cl ass A: 0
Cl ass B: 36
Total:
36
Class A: 59
Class B: 1
60
Total:
no
no
Class A:
Class B:
Total:
4
1
5
Class A:
Class B:
Total:
2
1
3
no
Class A: 23
Class B: 3
Total:
26
Class A:
Class B:
Total:
1
8
9
no
Class A: 11
Class B: 0
Total:
11
Selecting an optimal decision tree:
(b) Splits selected by duesswork
:
150
:
:
:
:
:
:
67
:
:
:
Class A:
Class B:
Total :
:
:
Gain chart of Class A
100
80
60
40
The Gini splits
20
Manual split selection
0
0
20
40
60
% Population
80
100
Can we extract rules from a decision tree?
The pass from the root node to the bottom leaf
reveals a decision rule.
For example, a rule associated with the right
bottom leaf in the figure that represents Gini splits
can be represented as follows:
if
and
and
then
(Predictor 1 = no)
(Predictor 4 = no)
(Predictor 6 = no)
class = Class A
Case study 10:
Profiling people with high blood pressure
A typical task for decision trees is to determine
conditions that may lead to certain outcomes.
Blood pressure can be categorised as optimal,
normal or high. Optimal pressure is below
120/80, normal is between 120/80 and 130/85,
and a hypertension is diagnosed when blood
pressure is over 140/90.
A data set for a hypertension study
Community Health Survey: Hypertension Study (California,U.S.A.)
Gender
Male
Female
Age
Race
18  34 years
35  50 years
51  64 years
65 or more years
Caucasian
African American
Hispanic
Asian or Pacific Islander
Marital Status
Married
Separated
Divorced
Widowed
Never Married
Household Income
Less than $20,700
$20,701  $45,000
$45,001  $75,000
$75,001 and over
A data set for a hypertension study (continued)
Community Health Survey: Hypertension Study (California, U.S.A.)
Alcohol Consumption
Abstain from alcohol
Occasional (a few drinks per month)
 Regular (one or two drinks per day)
Heavy (three or more drinks per day)
Smoking
Nonsmoker
1 – 10 cigarettes per day
 11 – 20 cigarettes per day
More than one pack per day
Caffeine Intake
Abstain from coffee
One or two cups per day
Three or more cups per day
Salt Intake
Low-salt diet
Moderate-salt diet
High-salt diet
Physical Activities
None
One or two times per week
Three or more times per week
Weight
Height
17
93
Blood Pressure
Optimal
Normal
High
cm
kg
Data cleaning
Decision trees are as good as the data they
represent. Unlike neural networks and fuzzy
systems, decision trees do not tolerate noisy and
polluted data. Therefore, the data must be
cleaned before we can start data mining.
We might find that such fields as Alcohol
Consumption or Smoking have been left blank
or contain incorrect information.
Data enriching
From such variables as weight and height we
can easily derive a new variable, obesity. This
variable is calculated with a body-mass index
(BMI), that is, the weight in kilograms divided
by the square of the height in metres. Men with
BMIs of 27.8 or higher and women with BMIs
of 27.3 or higher are classified as obese.
A data set for a hypertension study (continued)
Community Health Survey: Hypertension Study (California, U.S.A.)
Obesity
 Obese
Not Obese
Growing a decision tree
Blood Pressure
optimal: 319 (32%)
normal:
528 (53%)
high:
153 (15%)
Total:
1000
Age
18 – 34 years
optimal:
88 (56%)
normal:
64 (41%)
high:
5 (3%)
Total:
157
35 – 50 years
optimal: 20 (35%)
normal:
34 (57%)
high:
48 (8%)
Total:
59
51 – 64 years
optimal:
21 (12%)
normal:
90 (52%)
high:
62 (36%)
Total:
17
65 or more years
optimal:
2 (3%)
normal:
34 (46%)
high:
38 (51%)
Total:
74
Growing a decision tree (continued)
normal:
normal:
normal:
Growing a decision tree (continued)
Obese
optimal:
3( 3%)
normal:
53 (49%)
high:
51 (48%)
Total:
107
Race
Caucasian
optimal:
2( 5%)
normal:
24 (55%)
high:
17 (40%)
Total:
43
African American
optimal:
0 (0%)
normal:
13 (35%)
high:
24 (65%)
Total:
37
Hispanic
optimal:
0 (0%)
normal:
11 (58%)
high:
8 (42%)
Total:
19
Asian
optimal:
1 (12%)
normal:
5 (63%)
high:
2 (25%)
Total:
8
Solution space of the hypertension study
The solution space is first divided into four
rectangles by age, then age group 51-64 is
further divided into those who are overweight
and those who are not. And finally, the group
of obese people is divided by race.
Solution space of the hypertension study
157
74
8
19
596
37
66
43
Hypertension study: forcing a split
Blood Pressure
optimal:
319 (32%)
normal:
528 (53%)
high:
153 (15%)
Total:
1000
Age
18 – 34 years
optimal:
88 (56%)
normal:
64 (41%)
high:
5 (3%)
Total:
157
Male
optimal:
111 (36%)
normal:
168 (55%)
high:
28 (9%)
Total:
307
35 – 50 years
optimal:
208 (35%)
normal:
340 (57%)
high:
48 (8%)
Total:
596
51 – 64 years
optimal:
21 (12%)
normal:
90 (52%)
high:
62 (36%)
Total:
173
Gender
Gender
Female
optimal:
97 (34%)
normal:
172 (59%)
high:
20 (7%)
Total:
289
Male
optimal:
11 (13%)
normal:
48 (56%)
high:
27 (31%)
Total:
86
65 or more years
optimal:
2 (3%)
normal:
34 (46%)
high:
38 (51%)
Total:
74
Female
optimal:
10 (12%)
normal:
42 (48%)
high:
35 (40%)
Total:
87
Advantages of decision trees
n The main advantage of the decision-tree
approach to data mining is it visualises the
solution; it is easy to follow any path through the
tree.
n Relationships discovered by a decision tree can
be expressed as a set of rules, which can then be
used in developing an expert system.
Drawbacks of decision trees
n Continuous data, such as age or income, have to
be grouped into ranges, which can unwittingly
hide important patterns.
n Handling of missing and inconsistent data –
decision trees can produce reliable outcomes
only when they deal with “clean” data.
n Inability to examine more than one variable at a
time. This confines trees to only the problems
that can be solved by dividing the solution space
into several successive rectangles.
In spite of all these limitations, decision
trees have become the most successful
technology used for data mining.
An ability to produce clear sets of rules
make decision trees particularly attractive
to business professionals.