Download Tronci-N3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Knowledge Discovery via
Data mining
Enrico Tronci
Dipartimento di Informatica, Università di Roma “La Sapienza”, Via Salaraia 113,
00198 Roma, Italy, [email protected], http://www.dsi.uniroma1.it/~tronci
Workshop ENEA: I Sistemi di Supporto alle Decisioni
Centro Ricerche ENEA Casaccia, Roma, October 28, 2003
Data Mining
Data mining is the extraction of implicit, previously
unknown, and potentially useful information from
data.
A data miner is a computer program that sifts through
data seeking regularities or patterns.
Obstructions: noise and computational complexity.
2
Some Applications
•Decisions involving judgments, e.g. loans.
•Screening images. Example: detection of oil slicks from satellite images, warning of
ecological disasters, illegal dumping.
•Load forecasting in the electricity supply industry.
•Diagnosis, e.g. for preventive maintenance of electromechanical devices.
•Marketing and Sales. … On Thursday customers often purchase beer and diapers
together …
•Stock Market Analysis.
•Anomaly Detection.
3
Data
Attributes
Goal
Age
Spectacle
Prescription
Astigmatism Tear
Recommende
production rate d lens
young
myope
no
reduced
none
young
myope
no
normal
soft
young
myope
yes
reduced
none
young
myope
yes
normal
hard
young
hypermetrope
no
reduced
none
young
hypermetrope
no
normal
soft
young
hypermetrope
yes
reduced
none
young
hypermetrope
yes
normal
hard
Prepresbyopic
myope
no
reduced
none
Pre-presb
myope
no
normal
soft
Instance
............................................................
4
Classification
Assume instances have n attributes A1, … An-1, An.
Let attribute An our goal. A classifier is a function f from (A1 x …x An-1) to An.
That is f looks at the values of the first (n-1) attributes and returns the (estimated) value of the
goal. In other words f classifies each instance w.r.t. the goal attribute.
The problem of computing a classifier from a set of instances is called the classification problem.
Note that in a classification problem the set of classes (i.e. the possible goal value) is known in
advance.
Note that a classifier works on any possible instance. That is also on instances that were not
present in our data set. This is way classification is a form of machine learning.
5
Clustering
Assume instances have n attributes A1, … An.
A clustering function is a function f from the set (A1 x …x An) to some small subset of the
natural numbers. That is f splits the set of instances into a small number of classes.
The problem of computing a clustering function from our data set is called the clustering
problem.
Note that, unlink in a classification problem, in a clustering problem the set of classes is not
known in advance.
Note that a clustering function works on any possible instance. That is also on instances that were
not present in our data set. This is way clustering is a form of machine learning.
In the following we will focus on classification.
6
Rules for Contact Lens Data
(An example of calssification)
if (<tear production rate> = <reduced>) then <recommendation> = <none>;
if (<age> = <young> and <astigmatic> = <no> and <tear production rate> = <soft>)
then <recommendation> = <soft>
if (<age> = <pre-presbyotic > and <astigmatic> = <no> and <tear production rate> =
<normal>) then <recommendation> = <soft>
....
Attribute recommendation is the attribute we would like to predict. Such attribute is usually
called Goal and is typically written on the last column.
A possible way of defining a classifier is by using a set of rules as above.
7
Labor Negotiations Data
Attribute
Type
1
2
3
...
40
Duration
years
1
2
3
...
2
Wage increase first
year
percentage
2%
4%
4.3%
...
4.5%
Wage increase second
year
percentage
?
?
?
...
?
Working hours per
week
Number of
hours
28
35
38
...
40
pension
{none, r, c}
none
?
?
...
?
Education allowance
{yes, no}
yes
?
?
...
?
Statutory holidays
Nun of days
11
15
12
...
12
vacation
Below-avg,
avg, gen
avg
gen
gen
...
avg
...
...
...
...
...
...
Acceptability of
contract
{good, bad}
bad
good
good
...
good
8
Classification using Decision Trees
(The Labor Negotiations Data Example (1))
Wage increase
first year
> 2.5
Statutory holidays
<= 10
<= 2.5
Wage increase
first year
> 10
<= 4
bad
good
bad
>4
good
9
Classification using Decision Trees
(The Labor Negotiations Data Example (2))
Wage increase first year
<= 2.5
> 2.5
working hours per week
Statutory holidays
> 36
<= 36
> 10
Health plan contribution
bad
none
bad
half
good
<= 10
good
Wage increase first year
<= 4
full
bad
bad
>4
good
10
Which Classifiers is good for me ?
From the same data set we may get many classifiers with different properties. Here are some of
the properties usually considered for a classifiers. Note that depending on the problem under
consideration, some property may or may not not be relevant.
•Success rate. That is the percentage of instances classified correctly.
•Easy of computation.
•Readability. There are cases in which the definition of the classifier must be read by a human
being. In such cases the readability of the classifier definition is an important parameter to judge
the goodness of a classifier.
Finally we should note that starting from the same data set different classification algorithms may
return different classifiers. Usually deciding which one to use requires running some testing
experiments.
11
A Classification Algorithm
Decision Trees
Decision trees are among the most used and more effective classifiers.
We will show the decision tree classification algorithm with an example: the weather data.
12
Weather Data
Outlook
Temperature
Humidity
Windy
Play
sunny
hot
high
false
no
sunny
hot
high
true
no
overcast
hot
high
false
yes
rainy
mild
high
false
yes
rainy
cool
normal
false
yes
rainy
cool
normal
true
no
overcast
cool
normal
true
yes
sunny
mild
high
false
no
sunny
cool
normal
false
yes
rainy
mild
normal
false
yes
sunny
mild
normal
true
yes
overcast
mild
high
true
yes
overcast
hot
normal
false
yes
rainy
mild
high
true
no
13
Constructing a decision tree for the weather data (1)
Gain: 0.247
Gain: 0.029
Gain: 0.152
Gain: 0.048
Outlook
Temperature
Humidity
Windy
mild
false
y
y
n
n
n
y
y
y
y
0.971 0.0
y
y
n
n
n
0.971
y
y
n
n
y
y
y
y
n
n
y
y
y
n
y
y
y
n
n
n
n
y
y
y
y
y
y
n
true
y
y
y
y
y
y
n
n
y
y
y
n
n
n
H(p1, … pn) = -p1logp1 - … -pnlogpn
H(p, q, r) = H(p. q + r) + (q + r)*H(q/(q + r), r/(q + r))
H([2, 3]) = -(2/5)*log(2/5) – (3/5)*log(3/5) = 0.971 bits;
H([4, 0]) = 0 bits;
H([3, 2]) = 0.971 bits
H([2, 3], [4, 0], [3, 2]) = (5/14)*H([2, 3]) + (4/14)*H([4, 0]) + (5/14)*H([3, 2]) = 0.693 bits
Info before any decision tree was created (9 yes, 5 no): H([9, 5]) = 0.940.
Gain(outlook) = H([9, 5]) - H([2, 3], [4, 0], [3, 2]) = 0.247
14
Constructing a decision tree for the weather data (2)
Outlook
Outlook
Outlook
Humidity
0.971
Windy
0.020
high
Temperature
0.571
n
n
y
n
y
n
n
n
y
y
y
y
n
n
y
n
15
Constructing a decision tree for the weather data (3)
Outlook
rainy
sunny
overcast
Humidity
high
no
Windy
normal
yes
yes
false
true
yes
no
Computational cost of decision tree construction for a data set with m attributes and n
instances:
O(mn(log n)) + O(n(log n)2)
16
Naive Bayes
Outlook
yes
Temperature
no
yes
yes
no
sunn 2
y
3
hot
2
2
over
cast
4
0
mild
4
2
rain
y
3
2
yes
no
cool
Humidity
3
1
yes
no
sunn
y
2/9
3/5
hot
2/9
2/5
over
cast
4/9
0/5
mild
4/9
2/5
rainy 3/9
2/5
cool
3/9
1/5
Windy
no
Play
yes
no
high 3
4
fals
e
6
2
nor
mal
1
true
3
3
6
yes
no
high 3/9
4/5
nor
mal
1/5
6/9
yes
no
fals
e
6/9
2/5
true
3/9
3/5
yes
no
9
5
yes
no
9/14
5/14
17
Naive Bayes (2)
A new day:
Outlook
temperature
Humidity
Windy
Play
sunny
cool
high
true
?
E = (sunny and cool and high and true)
Bayes: P(yes | E) = (P(E| yes) P(yes)) / P(E).
Assuming attributes statistically independent:
P(yes | E) = (P(sunny | yes) * P(cool| yes) * P(high | yes) * P(true | yes) * P(yes)) / P(E) =
(2/9)*(3/9)*(3/9)*(3/9)*(9/14) / P(E) = 0.0053 / P(E).
P(no | E) = 0.0026 / P(E).
Since P(yes | E) + P(no | E) = 1 we have that P(E) = 0.0053 + 0.0026 = 0.0079.
Thus: P(yes | E) = 0.205
P(no | E) = 0.795;
Thus we answer: NO
Obstruction: usually attributes are not statistically independent. However naive Bayes works
quite well in practice.
18
Performance Evaluation
Split data set into two parts: training set and test set.
Use training set to compute classifier.
Use test set to evaluate classifier. Note: test set data have no been used in the training process.
This allows us to compute the following quantites (on the test set).
For sake of simplicity we refer to a two-class prediction.
Predicted class
Actual class
yes
no
yes
TP (true positive)
FN (false negative)
no
FP (false positive)
TN (true negative)
19
Lift Chart
Number of true positives
= TP
1000
100%
Predicted positive subset size = (TP + FP)/(TP + FP + TN + FN)
Lift charts are typically used in Marketing Applications
20
Receiver Operating Characteristic (ROC) Curve
Tp rate = TP/(TP + FN)
100%
100%
FP rate = FP/(FP + TN)
ROC curves are typically used in Communication Applications
21
A glimpse of the data mining in
Safeguard
We outline our use of data mining techniques in the safeguard project.
22
On line schema
Port
2506
TCP Packets
tcpdump
Format
Filter
Preprocessed
TCP payload
Sequence of payload bytes
Classifier 1
(Hash Table based)
Alarm
level
Format
Filter
Distribution of payload bytes
Supervisor
Classifier 2
(Hidden Markov Models)
Format
Filter
Conditional probabilities of chars and words in payload
Cluster Analyzer
Format
Filter
Statistics info (avg, var, dev) on payload bytes
23
Training schema
Port
2506
TCP Packets
tcpdump
Format
Filter
Preprocessed
TCP payload log
Sequence of payload bytes
Classifier 1
HT Classifier
Synthesizer
(Hash Table based)
Format
Filter
Distribution of payload bytes
Classifier 2
(Hidden Markov Models)
HMM
Synthesizer
Format
Filter
Conditional probabilities of chars and words in payload
Cluster Analyzer
WEKA
(Datamining tool)
Format
Filter
Statistics info (avg, var, dev) on payload bytes
24