Download Lecture 1: Historical Framework

Document related concepts
no text concepts found
Transcript
Technische
Universität
Dresden
Faculty of Civil Engineering
Institute of Construction Informatics , Prof. Dr.-Ing. Scherer
Information Mining
Prof. Dr.-Ing. Raimar J. Scherer
Institute of Construction Informatics
Dresden, 04.05.2005
1
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Quality of the Data
Data (Values)
2
3
semantical
nominal
not ranked
1
ordinal
ranked
numerical
(ranked)
discrete
interval
ratio
continuous
(analog)
interval
ratio
try to transfer
2
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Quality of Attributes
Quality of Attributes
= semantical
 importance (weight) usually not given
implicitely assumed: each is equally important
(e.g. wighting factor = 1.0)
better, explicit transfer into numeric
Example: project aim (cost, duration, reputation)
Implicit: project aim = 1.0 x cost + 1.0 x duration + 1.0 x reputation
Explicit: e.g. project aim = 2 x cost + 1.0 x duration + 1.5 x reputation
3
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Data Mining = Procedure of machine learning methods
1)
Identification of patterns (principles)
2)
Deduction of structures (Rules, models)
3)
Forecasting of behaviour (application of model)
Data
Information =^ Pattern
Knowledge =^ Structures / Models
Wisdom =^ Forecasting
4
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Data
Data Mining = Procedure of machine learning methods
Example of
• a pattern
• a structure
= observation, measuring
Information =^ Pattern
= description and recognition of the measurements
^
Knowledge = Structures / Models
generalised theory by which
the observations are explained
Wisdom ^= Forecasting
5
InfoMining
using the theory, the
information, the data to
simulate and forecast not
observed scenarios
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Data
Terminology
= Recorded facts
Information = set of patterns or expectations
Knowledge = accumulation of set of expectations
Wisdom
6
InfoMining
= usefulness, related to the knowledge
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Platon‘s Cave Analogy
The problem:
We can never see (record) the whole reality, but only an
uncomplete mapping
Shadow of
dancing people
dancing
people
He can only observe
shadows and has to
interpret what the
original „thing“ /
„meaning“ is.
7
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Data Structure for Formalization of
Information and Knowledge 1
Technische
Universität
Dresden
Object = thing with a certain meaning and a certain appearance
given by
its name
given by
its attributes
thing can be
8
and it can be

a real object, e.g. windows

a behaviour,
e.g.
- opened, closed
- transparent, clear
- aging
InfoMining
given by the data
of the attributes

a behaviour due to the
interaction of several things,
e.g.
- window is opening and
closing due to the wind
- window is aging due to
rain, wind, sun, operation
(good/bad) by humans
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
What can we observe?
1) Object
Geometric form

Colour

Material

Positions
2) Relationship

Location (in the wall)

Topology (to the
ground)

Each is described by
one or more attributes
Each attribute is expressed by a
datum (value) from a set of data
(values)
Some or each attribute can be
modelled as an (sub-)object
3) Behaviour





9
stress distribution
deflection
vibration
aging
and so on ...
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Closed World
If we know that we are describing / observing windows we
can evaluate the attributes of the schema (concept) window
and determine which kind of window the particular one is, i.e.
we classify the particular window in one of the several
classes represented by the values of the attributes
This means
1) We already know what a window is and we are
evaluating the observed data according to windows
2) We already know the (possible) sets of the attributes
3) We already know the (possible) classes constituted by
the values of the attributes
Hence we have a closed (pre determined) world and therefore we
can do straightforward classification
10
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Open World
If we do not know what we observe (e.g. image analysis) but we
have recorded a lot of data (made a lot of fotos, where each foto
consists of many pixels) we can nevertheless identify windows –
but also doors, gates, etc. instead of windows (!) – when we extend
our procedure by two steps, namely
11
1)
Analyse the sets of data to find similarities / dis-similarities between the
sets, by partitioning each set of data in subsets and compare the sub-sets.
This is called identification / analysis of patterns
2)
Generalise the patterns and find an objective structure (theory) which
explains the patterns, i.e. synthesize the result of the patterns. This is called
to build a concept.
A concept can be the schema of an object with its attributes and with the
value range of each attribute (in an ideal way)
A concept is a schema of an object and hence a class structure.
3)
Classify further observations (as explained in the beginning) in order to
a) Identify the particular object, if the „thing“ in question is an object
b) Forecast the object behaviour, if the „thing“ in question is the behaviour
c) identify the relationship between the objects, if there are more than one
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Hierarchy of Methods
Knowledge Management
Information Mining
Data Mining
Machine Learning
Data Analysis
Signal Processing
Statistics
Data Collection
Sensors (-systems)
Design of observation
12
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Data Collection => Fact Table
1) Fact Table (or records)
Example: Relation (behaviour) weather-play
Weather data
13
outlook
temperature humidity
windy
play
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
false
true
false
false
false
true
true
false
false
false
true
true
false
true
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
InfoMining
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Knowledge Representation
Knowledge is usually represented by rules.
A rule has the form
•
Premisses (if)
•
Conclusion (then)
The 4 main form to represent (the rules, which contain the)
knowledge are:
1) Decision Tables
2) Decision Trees
3) Classification Rules
4) Association Rules
14
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Knowledge Representation - Decision Tables
1) Decision Tables (look-up tables)
Looks like a fact table.
The only difference is that:
- Each row is interpreted as one rule
- Each attribute is combined with an AND
Weather data
15
outlook
temperature humidity
windy
play
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
false
true
false
false
false
true
true
false
false
false
true
true
false
true
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
InfoMining
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Decision Tables
In decision tables all possible combinations of values of
all attributes have to be explicitelly represented (ideal)
m
n combi   n ai
1
Where m is the number of attributes and nai is the number of values for
attribute ai
This means for the given example of the relation „weather-play“
which has m=4 attributes (outlook, temperature, humidity, windy,
play), that there exist
3 x 3 x 2 x 2 = 36 combinations
For a new set of attribute values we have only look-up in the table,
i.e. find that row, which shows a 100% match with the given set and
we can read the result, namely play=„yes/no“
This is ideal
16
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Objectives of Decision Making
Usually we do not know all combinations. For real problems there
can be several 1000!
Therefore we reduce the possible number of combination to the
most important ones.
Doing this by only deleting rows in the decision table we are ending
up with information / knowledge gaps, and we would have paritioned
our world in a deciteable part and an undeciteable part.
The latter would be called „stupid.“ This is not what we want to
have.
In addition we are usually never able to observe all possible cases,
and hence we would have natural gaps.
Our objective is always to end up with a decision, whether correct or
false, but never with obstain (if not explicitely allowed). Of course,
we want to avoid or at least minimize false decisions
17
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Generalisation
Therefore we have in to generalize the remaining rows in such a way, that
they cover all the decisions of the deleted and unknown (not observed)
rows without ending-up with less as possible wrong decision
If we make the generalisation not allowing wrong decisions for all observed
cases we would have an overdetermined problem, which may also
contain some attriutes or attribute combinations which are dependent,
i.e. there are identical rules in the rule base.
However
1) It is hard to find all or enough dependent combinations
2) To find the dependent combinations we first would have set-up the full
decision table
3) Usually we want to reduce the ideal decision table much more than
only by the dependent combinations
4) Usually we never can observe all possible cases, i.e. we always have
natural gaps
18
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Shortcomings of Generalization
Therefore we have to merge several rows to one row which is possible.
The most simple way is to neglect (the values of) one or more attributes.
This is the most simple way of generalization (remark: it is the only way of
generalization in relational data banks).
Say we keep only outlook, the decision table reduces to
outlook
sunny
rainy
overcast
play
no
yes
yes
As a consequence, we make some wrong decision.
But we fulfil the first and main objective, namely we are able to make
always a decision.
For our example, this would lead for the 14 given combinations (i.e. our
known world) to 2+2+0=4 wrong decisions
19
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Shortcomings of Generalization
We reduced in the given example the 36 possible
combinations (rows), each expressed by 36 rules like
If
and
and
and
then
outlook = sunny
temperature = hot
humidity = high
windy = false
play = no
to 3 simple rules
20
if
then
outlook = sunny
play = no
if
then
outlook = rainy
play = yes
if
then
outlook = overcast
play = yes
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Range of wrong Decision
We know, that for 4 out of 36 possible cases, we would make a wrong
decision, i.e. for 10%.
However, we do not know how much further wrong decision we will do,
namely 0 or 22 wrong decisions or something
in-between, because we know only that we do have an observation gap of
22 cases.
This statement is based on the assumption, that we described our problem
(UoD) completely by 4 attributes. However, if we take into consideration
that the UoD may be biased and say it would be governed by 5 attributes,
i.e. 1 additional attribute we do not know, than we would have an unknown
range of 36 x number of value range of the unknown attribute can take.
Remark:
A hint for an unknown attribute is given, if there are two rows in the
decision table with identical values, but two different decision
(play=yes / play=no)
21
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Liability of knowledge
We can now apply native statistics in order to estimate the number of
further wrong decisions.
Namely, when we assume that our known world represented by 14 rules
a) is a representative part for the whole world, i.e. the sample is
representative for the Universe of Discourse UoD
b) the rules are unbiased, i.e. all known rules are error free
c) all attributes are known, i.e. the UoD is unbiased
Then we can estimate that 10 out of 36 decision would be wrong.
What we did can be explained by statistical theory, namely we evaluated
the mean value of wrong decision in our known world, assumed that
this mean value is the true value of the total world (UoD) and forecast
the number of wrong decision using the mean value.
Note: We do not consider any uncertainty here.
22
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Decision Trees
We have seen from the decision tables that each value set of the
attributes, i.e. each row in the table can be expressed as one rule with
a simple semantic, namely
if {all attributes are true, i.e. show a certain value} then {classify b}
if {and ai = vj } then bl =vk
This means we have used a sequential system for our rule system.
However it is well known that also parallel systems may be possible.
There the status of only one attribute is evaluated, i.e. is checked
against all possible values, before in a separate step the next attribute
is considered.
Applied this to a rule system we come up with nested rules. The
graphical representation of nested rules is a tree structure and we call
this new representation a decision tree
23
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
General Structure of Decision Trees
In general terms a decision tree can be expressed as
if { state a1} :=
{a1 = v1} then
if { state a2} ……..
…………………
{a1 = vm} then
if { state a2} ……..
end if
And in each branch vj this has to be repeated for the next
attribute ai and so on for all i=1,N
24
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Ranking in a Decision Tree
If we know all combinations of the UoD and we want to express them all (what is
our ideal goal in order to avoid wrong decision) like we did it for the decision table,
ranking of attributes, i.e. what is operated first, second ..., is not a matter at all.
Then we can apply straight-forward the general formula and using any arbitrary
order. For convenience we can choose i=1,2,3...N and we will end up in a tree,
like
a1
a1
a2
a3
a3
a4
a4
a3
a4
a4
a3
a4
a4
a3
a3
a3
a4
a2
a2
a4
a4
YN
N N
a4
....
a4
a2
Y N
25
Y N N N Y N Y N
InfoMining
Y Y Y N Y N
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
an
Technische
Universität
Dresden
Normalisation to binary Decision Tree
For several conviences (memory amount, processing time,
search time, etc.) the multi branching tree is transformed
into a binary tree, or already built-up as a binary tree from
the beginning
This means, that we have applied the following transformation
rule for all non-binary branches
if M>2 then
if (state ai=vj) then
else ...
yes
no
which means that we divide the value range at each layer in
two halfspaces, namely the actually considered value and
in the other up to now not considered values.
This results in a tree explicited below only for a1 and a2
26
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Explosion of layer through binary tree representation
a1v1
yes
no
a1v2
yes
no
not possible
a1v3
yes
a2v1
yes
a2v1
no yes
a3v1
yes
a2v2
a2v1
no yes
a3v1
no
a2v2
a3v1
a2v1
no yes
a2v2
a3v1
no
a2v2
no
………………………………..
a4v1
27
a3v2
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Shortcomings of simple
explicit binary decision trees
Technische
Universität
Dresden
A property of binary decision trees is the replication of sub-trees
which for every attribute value generates a new replication,
namely replication of subtrees = number of values - 1
and this repeats for each attribute in each subtree again and
again
As long as we want to (and can) express all combinations there
would be no shortcoming but
1) we do not want to consider all combinations but only the
important ones, i.e. generalize our explicit knowledge space.
2) we usually do not know all combinations , which can be
interpreted as an un-controlled generalization.
Both lead to the result that
1) the ranking of the attributes is important.
2) attributes are no longer sorted in layers but mixed to receive an
optimal structure of the tree
28
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Generalised Decision Tree
In decision trees the generalisation process is much more visible and hence
controllable.
Generalisation means e.g. deleting a subtree and substituting with only one decision
a1
a1
a2
a3
a4
a2
a3
a4
a4
a3
a4
a4
a2
a3
a4
a4
an
29
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Classification (Rules)
Representation of knowledge by classification rules means
if {ai and/or/not/etc aj} then {bk}
i,j=1,N
k=1,M
We have already used this representation when we explained
the meaning of decision tables and decision trees. Hence
decision tables and decision trees are only another ("visual")
representation for a set of rules.
30
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Classification (Rules)
This is true with the exception, that it is straight forward to
transform decision trees and tables to classification rules,
but transforming classification into decision tables, we have to
explicite all ors into ands because decision tables are look
up tables and therefore uses only ands.
31
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Classification (Rules)
For decision trees a straightforward transformation is formal
possible and correct, but advances of decision trees get lost,
which is an optimised arrangement, namely either
• size of the tree is minimised
or
• readability is maximised (e.g. for each attribute one layer)
or
• a combination of both, e.g. optimised for the human
understanding
32
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Ranking Dependence
As long as we have all cases represented ranking of
•
rows in decision tables
•
attributes and values in decision trees
•
classification rules in rule bases
are not influencing the result.
However we
a) do not have all cases (observations)
b) want to reduce rows, branches, rules by generalisation
This results in a ranking dependency problem.
This holds also for classification rules – which may be overseen, because at a
first glance a rule maybe seen as selfstanding, independent of knowledge,
which is definitely not the case. Each rule is always embedded in its
context, represented by other rules and expressed by the ranging. This
means the solution is always path dependent!
So ranking is already a part of the representation of the knowledge.
33
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Association Rules 1
Association rules express the relationship between arbitrary
attribute states
if {ai = state1} then {aj = state2}
for all i  j and i=1,N , j=1,M, where ai,aj{A,B}
If we would restrict all aj to be only elements of B, i.e. if
{ai=state1} then {bj=state2}
then we would have a classification rule.
Hence, classification rules are a subset of association rules
34
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Association Rules 2
With association rules, we can combine any attribute state (attribute value)
with any other attribute state or any grouping of attributes states. There is
no any limitation.
As a consequence, we allow dependencies (or redundancies ) between the
rules.
It would be not wise to express all or even many association rules, because
we would produce an uncontrollable sub-space of the inherent knowledge
with many redundant rules, i.e.
- some information is not expressed at all
- some information is expressed once
- some information is expressed several times.
Therefore we will lose our basic weighting criteria, namely
- that each rule is equally important
- that the importance of an attribute or an attribute value is the frequency of its
appearance in the rules
- both may be generalized by adding an verifyable arbitrary weighting factor
35
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Objectives of Association Rules
Associate rules should only be applied as a shortcut in addition
to a clearly specified minimum rule set without redundancies.
Such shortcuts are used
- for important relationships
- often appearing relationships
- simplified solutions
in order to considerably reduce the search time.
36
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Examples of Association Rules
Examples:
If temperature = low
Then humidity = normal
If windy = false and play = no
Then outlook = sunny and humidity = high
If humidity = high and windy = false and play = no
Then outlook = sunny
All are correct expressions (correct "knowledge" expressed in a
rule).
37
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Coverage and Accuracy
Coverage (or strength or support) is the number of instances for
which a rule predicts correctly
Accuracy (or confidence) is the ratio of instances the rule
predicts correctly (consequences) related to all instances it
applies for (premise).
correct consequences (coverage)
accuracy =
correct premises (applications)
38
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Coverage and Accuracy
Examples
if temperature = cool then humidity = normal
applies = 4 (temp=cool)
coverage = 4 (humidity=normal | temp=cool)
accuracy = 1,0 (100%)
if outlook = sunny
then play = yes
if outlook = sunny
and temperature = mild
applies = 5
then play = yes
coverage = 2
applies = 2
accuracy = 40%
coverage = 1
accuracy = 50%
39
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Rules with exceptions
With the possibility to formulate exceptions like
but not for {ai = state}
we are able to refine general applicable rules in a very efficient
way, namely we divide the value range of the attribute ai in
two halfspaces, namely
true
1.) the value=state and
2.) all the rest of values
exclude
(=false)
by specifying only one value, which means we increase the
coverage and reduce as less as possible the application,
hence we maximise acuracy.
This means that we can start with a simple and very general rule
and sharpen it by adding exceptions.
40
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Rules with Exceptions
Example
applications = 5
if outlook = rainy
coverage = 3
then play = yes
accuracy = 60%
when we add
applications = 3
coverage = 3
and windy = not true:
accuracy = 100%
when we would instead add:
applications = 3
and temperature = not cool:
coverage = 2
accuracy = 66%
41
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Rules with Relations
Propositional Rules
Up to now, we only evaluated each attribute separately, i.e. we compared the
value of the attribute with a given value set. Such rules are called
propositional rules and they have the same power as the proposition
calculus of logic reasoning.
Relational Rules
Sometime it is convenient to compare two attributes like
If {ai} is in some relation to {aj} then {bi}
This implies that ai and aj show the same unity or can be transformed into one
and the same unity.
The comparison can be any Boolean operation.
Example
If (height > length) then object = column
42
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Rules for Numerical Values
All numerically valued attributes can be dealt with in the same
manner as with nominal values, to which we apply the halfspace principle, namely
• divide the range of values
into two half spaces
a1
• the two half spaces have not necessarely be symmetric or
•
•
have to contain equal number of values
repeat this recursively until enough small intervals remain
number of recursion is independent between branches
This leads straightforward to a binary decision-tree.
43
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Rules for Numerical Values
Equivalently, we can pre-divide the range of values into equally
(a2) or arbitrarily (a1) sized intervals and we directly show in
which interval the observed attribute value fits. This is
equivalent to a multi-branching tree.
a2
a1
The test of equality (=) is possible but not feasible, because it is
e.q. for R, an arbitrarily rare event.
Semantically and ordinally ranked data can be dealt with in the
same way. There, the test of equivalence may be feasible,
because of the very limited number of values.
44
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Instance-based Representation
In contrast to rule-based representation, where we test each
observation of equality, namely
If {ai =state}, then
For instance-based representation,
we test the attribute value set t against the distance to a
given state with n sets and evaluate the minimal distance.
Each attribute value set (= vector) with as many components
as attributes, e.g. one row of the fact table
s1=[sunny, hot, high, false]T
45
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Instance-based Representation
The state is now a given set of attribute value sets, e.g. 10 rows
of the fact table
state=[s1...sn]
So we test a vector t against a vector set [s1...sn]
if {distance(t,si)=min} then {bt=bsi}
and we use the consequences of the closest vector for the
prognosis (decision)
Remark:
nominal values are usually transformed in true=0, false=1
46
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Comparison of Instance-based to Rule-based
Compared to the rule-based representation of numerically valued data
problems, where we deal with fixed intervals, here we deal with nonfixed intervals - but only at a first glance. In fact, we have also fixed
boundaries, namely boundaries defined by halfway between the state
vectors.
s2
s1
s3
The only difference is that the intervalls are arbitrary in size and that
we do not explizitely define the intervals, but we define the center
(lines) of the intervals (classes).
Another advantage to the explicitely expressed interval procedure of
the rule-based representation is that we can easily add an additional
instance for better representing the knowledge space, i.e. for
refinement of the space
47
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Requirements for Instance-based Representation
There are 3 requirements
1) We do need a metric.
2) All attributes have to be presentable in one and the same
metric.
3) We do need a distance metric, also called norm.
48
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Norm
A norm is defined as a mapping
 : RP  R
which fullfil the 3 requirements
x  0  x  0,,0
T
ax  a  x
a  R, x  R P
xy  x  y
x, y  R P
The most well-known norm is the Euklid Norm
(=geometric distance in the Euklid space)
d 
N

di
i 1
49
InfoMining
2
with d=a-b, di=ai-bi
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Norm
in general terms the Euklid norm can be written
d  d T Ad
1
0
with A  
0

...
0
1
0
...
0
0
1
...
...
...
...

...
This can be generalised to the Diagonal-Norm with
a 1 0 0
0 a
0
2
A
 0 0 a3

 ... ... ...
...
...
...

...
where ai are arbitrary values, which can be explained as weighting factors
(for each component of the instance vector)
We can now imagine about off-diagonal values, namely we can include in
our distance measure dependencies between attributes when we set the
off-diagonal values not zero,
i.e. we evaluate relationships between attributes (relational rules)
50
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Minkowsky-Norm
The Minkowsky-Norm is defined as
d
g
q
N

di
q
i 1
choosing q=2 we receive the well-known Euklid-Norm.
51
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Minkowsky-Norm: Manhatten Distance
For convenience often the well-known Manhatten distance or
City Block is used, which is obtained for q=1
B
N
d
d1
i
i 1
A
This means, that instead of computing the bird-line distance (Euklid-Norm)
we are walking around each block in Manhatten, i.e. we are summing
up Δx+ Δy.
The deviation (error) to the geometrical distance is immediately to be seen
d12  d i2 ,
d11  d1i ,
1
i
52
InfoMining
q  2
q  1
 
d1i  1, 2  d i2
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Minkowsky-Norm: max-Norm
For lim q→ we receive the supram or max-Norm
N
lim
q 
q

1
di
q
 maxd i 
i 1, N
The qualitative differences between the different norms are, what we give
different importance to large distances between component compared to
short distances
As we increase q we give more importance to large distances and for q→
we give our only attention to the largest distance.
The natural norm is the Euklid-Norm.
53
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Hyperbolic Norm
A very often used Norm is the Hyperbel-Norm
d
H

N
d
i
i 1
However this Norm do not fullfill the mathematical definition of a
norm (not any of the 3 requriements)
If we add the N-th root it is the well-known geometrical mean
dg  n
N
d
i
i 1
54
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Relative Norm
All these norms depend on the number of elements, which is correct
for distance.
However sometimes we want to have only the quality of the distance.
There we divide simple by the number of elements and we come up
with a generalised mean value
1
d q
N
55
InfoMining
N
d
q
i
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Explicit Portioning of the Instance Space
(a)
The classification of attribute sets according to the shorterdistance criteria leads the portioning of the information (or
instance) space as given in Figure a, namely the boundaries
of a class are obtained as a polynomial (hyperplanes), where
each polynomial line is perpendicular to the mean value of the
shortest distance between the class vectors. This leads to
very-hard-to-express boundaries for each class.
56
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Explicit Portioning of the Instance Space
(b)
(c)
A more convenient simplification is to describe each class by rectangular boundaries
as shown in Figure b. As a consequence, boundaries can easily be described leading
to simple rules, which the human understanding can conceive, i.e. we can rationalize
upon the boundaries and hence the classes.
The rectangular box means that each attribute value of each class has a well defined
upper and lower limit, hence we have defined an explicit interval for each attribute.
The difference between instance-based and rule-based representation is that the
intervals for instance-based ones are not of equal size for each class, whereas for
rule-based representation they are.
Remark: If we generalize rules this may result in non-equal intervals.
Generalization of instance-based representation can end up in nested portions like
the ones shown in Figure c. this is the typical case for rule-based representation using
exceptions, namely the rule for the outer box and the exception for the inner box.
57
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Machine learning
All machine learning methods generate rules, i.e. they extract
rules from the observed data, the fact data.
Each machine learning method expresses another relationship
between the data, i.e. expresses another system. If we chose
a machine learning method which fits not to the inherent
system of the data, we will receive a rule set which is
(1)
complex
(2)
makes often false predictions
but we always will receive a rule set!
Therefore we need a measure for the quality of learned rule set
in order to decide about the best or more appropriate learning
method
58
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Machine learning
If we would know the underlying system in advance we would
model it either in mathematical expressions or in logical
expressions.
If we do not know anything we have to use Information Mining
and Machine Learning Methods, of course.
If we would know something about the system, we should model
the system with the appropriate expressions first and then
transform the data by the system before we again use
Information Mining Methods.
This is called hybrid methods.
59
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Example: Limited observation range
Observation range 0 – 30 m
(with attributes: x=0, x=10, x=20, x=30)
Prediction range
0 – 100 m
y[m]
x[m]
60
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Example: System roughly known
Draft System: y = ax+bx²
Assumption of polynomial of 2nd or higher order
y[m]
x[m]
61
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Example
Real System: Diagonal throw
y  x  tan  
g
2
x
2 v 2 cos 2 

   ´,    10%
3
m
 V  50 ´,  V  10%
s
62
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Example
Reduction of one order of the polynomial results in a
straight line. This indicates, that the assumption of a
polynomial of 2nd order seems to be correct.
y/x [-]
x[m]
63
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Example
Now we will also consider a similar
example with y(x=0) ≠ 0
y[m]
x[m]
64
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Example
If y(x=0) ≠ 0 reduction of one order of
the polynomial leads not to an
improvement as it was the case shown
before but to a very bad result.
This illustrates, that for
an hierarchical approach,
the first chosen
approaches (or learning
methods) have an
important impact on the
overall result!
y/x [-]
x[m]
65
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Example
Taking into consideration the knowledge about
y(x=0)=20 we can approve the assumption that
y=a+bx+cx²
(y-20)/x [-]
x[m]
66
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Example
The following Table shows the average of Y for each attribute X
X (Attribute) Average Y
0,00
20,00
10,00
37,68
20,00
53,49
30,00
67,43
40,00
79,50
50,00
89,69
60,00
98,02
70,00
104,48
80,00
109,06
90,00
111,78
100,00
112,62
In case of realistic assumption of 2nd order polynomial
we will get the parameters: a=20, b=1.8615 and c=-0.0094
If we compute the function y(x)=a+bx+cx² with
these parameters and divide this result by the
average of the Y-data, we will get a measure for the
accuracy of our assumption.
Attribute
X
In this example we get for
all attributes YM/YD =1.
This is the case of exact
fitting of the assumed
function y(x). The error will
increase with increasing
variation from 1.
67
InfoMining
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
Data-based
Model-based
Average of Y
Estimation of Y
YD
YM
20,00
20,00
37,68
37,68
53,49
53,47
67,43
67,39
79,50
79,42
89,69
89,58
98,02
97,85
104,48
104,25
109,06
108,76
111,78
111,40
112,62
112,15
YM/YD
1,00
1,00
1,00
1,00
1,00
1,00
1,00
1,00
1,00
1,00
1,00
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Example
If we had supposed, linear dependence between x and y, namely y=a+bx, the
parameters of the curve would be: a=34.029 and b=0.9262
In this case the YM/YD shows some greater variation from 1 and hence it is a
rather bad estimate.
Comparison
0,00
10,00
20,00
30,00
40,00
50,00
60,00
70,00
80,00
90,00
100,00
Data-based
Model-based
Average of Y Estimation of Y
YD
YM
20,00
37,68
53,49
67,43
79,50
89,69
98,02
104,48
109,06
111,78
112,62
34,03
43,29
52,55
61,82
71,08
80,34
89,60
98,86
108,13
117,39
126,65
YM/YD
1,80
1,60
1,70
1,15
0,98
0,92
0,89
0,90
0,91
0,95
0,99
1,05
1,12
bad estimation
1,40
good estimation
1,20
YM/YD
Attribute
X
1,00
0,80
0,60
0,40
0,20
0,00
0,00
20,00
40,00
60,00
80,00
100,00
120,00
X
68
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
One arbitrary sampled sample
If we would assume independence between each observation
point, we would receive a valid sample as given below.
However, this is a very unrealistic curve of a throw.
Therefore dependence between observation points leading to an
observation set is a very important knowledge about the system.
This is a typical mistake made in stochastic applications.
y[m]
x[m]
69
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Training & Test Set
So, as more information we have - i.e. not only data but so-called
pre-information - and as more information we model
appropriately as better the information mining method will
work.
Nevertheless, we do need an objective verification of the quality
of the model (the pre-modelled system + the added larning
system), i.e. we need a measure and a data set to apply the
measure. Therefore we have to divide the observed data set
in two parts:
-
training set
-
test set
Usually the test set is chosen 50% of the training set, i.e. the
observed data are divided by 2:1
70
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Training & Test Set
Therefore our world is divided into 3 parts
unknown part
training set
test set
observed part
where the training and the test set should be a representative set
of the UoD=Universe of Discource (whole world we consider).
To fulfil the requirement "representative" is often not to be
verifyable, because we do not know the unknown world.
71
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Distribution of Observations
(1)
extrapolation problem
we have no any observation about a great part of the world.
Machine learning methods are always interpolation methods, hence our prediction
are strongly biased if the system shows another behaviour in the observed and
un-observed world
(2)
we have more observations in some parts and less in other, hence we may give
more importance to the system behaviour in those parts we have more
observations
(3)
the distribution of observations of the test set are not similar to the distribution of
the training set. Hence our quality measure is biased, too, as a reason of (2)
Reality
number of
observations
observations
whole 1D-world
72
InfoMining
Ideal
test set
test set
training set
training set
whole 1D-world
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Distribution of Observations
If the system behaviour is similar over the whole world, i.e.
y (=response)
e.g.:
or
y = a+bx
y = a+bx+cx²
x (=input)
whole 1D-world
But if the system behaviour is not monotonic, i.e.
y (=response)
x (=input)
whole 1D-world
we would like to have as more (and dense) observations as more nonmonotonic the behaviour of the system is
73
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Strategies for Machine Learning
(1)
apply different learning methods for different domains of the
world
 need measure of quality to decide about best learning
methods
(2)
apply different learning methods in a hierarchical order
 principle of simple systems (rule) is the best fitting one
 need measure of quality to decide about best combination
of learning method
(3)
divide the observed data set several times in different pairs of
training and test sets in order to optimize the probability that
non-monotonic systems behaviour domains are good
matched by training and test sets
(4)
use each additional observed data to update the learned
system – i.e. call always in question your learned system
74
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Analogy: Curve fitting
For a 1D-world we have the following analogy:
1 attribute  constant model:
y=c
2 attributes  linear model:
y=a+bx
3 attributes  parabolic model:
y=a+bx+cx²
n 1
n attributes  n-polygon model:
y

aixi
i 0
We can fit an n-polynomial
when we have m
observations with m ≥ n.
So as maximum we can fit
an m-polynomial.
This is known as overfit.
75
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Structures in random data
Random data can have different structures.
Most data sets can be classified to one of the following structures:
1.
One attribute contributes much, most other attributes are redundant or
about meaningless
2.
All attributes contribute almost equally and independently
3.
Few attributes contribute, but in a dependent way, which can be expressed
and represented by a decision tree (numerical: correlation function)
4.
Few rules structure the data domain into distinctive classes
5.
Subsets of attributes show interdependencies
6.
(Non)linear dependent numerical attributes, where the weighted sum of the
attributes describes the data structure
7.
Non-equal distance between the classes would describe the data set best
For each type, another learning algorithm fits best, which can only
be found by trial and (error) test procedures.
76
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Learning Method: 1R
1 R means "1-Rule"
1-Rule generates a one-level decision tree
the simplest way is to make a rule, which tests only 1 attribute.
(Remark: very attractive is to make a rule where many
attributes are tested by using the exception rule methods.)
However the question is, which attribute should be tested?
Hence we need a measure.
The most simple measure is to use that attribute which maximise
right and consequently minimise wrong decisions
= COVERAGE measure
This must be done also for each value set of each attribute
77
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Learning Method: 1R
Procedure:
generate all possible rules with of all attributes and values from the training set
n rules 
na

n v ,i
n a  n attributes
~ na  n v
n v ,i  n values of attribute i
i 1
and select that rule set which has the maximum coverage
Usually the
1)
"coverage" is used to identify the best fitting value
2)
"(applies - coverage) = min failure" is used to identify the strongest attribute
For the weather-play example the best attribute is outlook with a rule set for the values of:
(3 simple rules of PPT 1/20)
Coverage
Error
if outlook = sunny
then play = no
3
2
if
outlook = rainy
then play = yes
3
2
if
outlook = overcast
then play = yes
4
0
10
4
∑
78
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Missing Values
Missing will be treated on the same way as every other value of
attributes.
Hence if the weather data miss values for the attribute outlook,
the specified rule set will contain 4 possible class values: one
for sunny, overcast, rainy and one for missing.
However, the problem is, what is the result (consequence) of
missing?
The assumption is, that missing get the same result, as the most
often appearing result of the values of the attribute.
79
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Statistical Modelling
Statistical modelling enables the consideration of more than one
attribute and therefore decision making based on all attributes if
these attributes are of equal importance and independent of
each other.
Of course, this requirement is not realistic
– real data sets are interesting even because the attributes are
not of equal importance and are dependent –
but this simplification leads to a simple method, which works
surprisingly good in practice.
80
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Statistical Modelling
The table below shows a summary of the weather data, whereas for every
value of each attribute the number of occurrence for play=yes and play=no
was counted and shown in the upper part of the table. The lower part of the
table shows the same information in form of ratios of observed
probabilities, i.e. conditional probabilities.
Example: for 2 days out of 9 days for which play=yes there is
outlook=sunny.
For play these rates are days of yes/no devided by all observed days.
Weather data with frequencies and probabilities
outlook
temperature
yes
81
no
humidity
yes
no
windy
yes
no
play
yes
no
sunny
2
3
hot
2
2
high
3
4
false
6
2
overcast
4
0
mild
4
2
normal
6
1
true
3
3
rainy
3
2
cool
3
1
sunny
2/9
3/5
hot
2/9
2/5
high
3/9
4/5
false
6/9
2/5
overcast
4/9
0/5
mild
4/9
2/5
normal
6/9
1/5
true
3/9
3/5
rainy
3/9
2/5
cool
3/9
1/5
InfoMining
yes
9
no
5
9/14 5/14
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Example
If we want to predict play given that
outlook
temp
humidity
windy
play
sunny
cool
high
true
?
We can compute (assuming independency) the probability of:
P[play=yes] = 2/9 x 3/9 x 3/9 x 3/9 x 9/14 = 0.0053
P[play=no] = 3/5 x 1/5 x 4/5 x 2/5 x 5/14 = 0.0206
P[E] = P[play=yes] + P[play=no]
= 0.0259
Therefore we can compute the probability of play=yes or play=no
given that it will be played for the above given attributes and
values (and not any other values)
0.0053
P[play=yes|E] =
= 20.5%
0.0259
0.0206
P[play=no|E] =
= 79.5%
0.0259
82
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Statistical Model – Naive Bayes
This intuitive and simple method is based on Bayesian statistics, which
says:
If we have a reliable hypothesis H and statistical analysis (=training set) and
we have new observations we can update the statistics, assuming that
the new observations (even it is only 1 observation=E!) is of the same
importance as all historical observations.
PH | E  
PE | H  PH 
PE 
with PE | H  
n
 Pe | H assuming INDEPENDENCE between attributes
i
i 1
Now we can compute for the above example of E=(outlook=sunny,
temp=cool, ...) the probability of occurrence of play=yes
Pplay  yes | E  
83
InfoMining
2 / 9 x 3 / 9 x 3 / 9 x 3 / 9x 9 /14  20.5%
0.0259
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Statistical Modelling – Naive Bayes
This procedure is based on Bayes Rule, which says:
if you have a hypothesis hi and data D which bears on the
hypothesis, then:
Ph i | D  
PD | h i  Ph i 
 PD | h Ph 
n
j
j
j1
P(h): probability of h
P(D|h): conditional probability of D given h
P(h|D): conditional probability of h given D
i = hypothesis i
n = number of hypothesis
84
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Statistical Modelling - Example
Example: For a new day we want forecast play.
outlook = sunny
temperature = cool
humidity = high
windy = true
play = ?
The data are identified as follows:
D1: outlook = sunny
D2: temperature = cool
D3: humidity = high
D4: windy = true
We can take two possible hypothesises:
h1: play = yes
h2: play = no
85
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Statistical Modelling - Example
Therewith the conditional probability of the hypothesis h1: play = yes,
given the data set D={D1,D2,D3,D4} can be determined as follows:
Ph1 | D  
PD1 | h1 PD 2 | h1 PD 3 | h1 PD 4 | h1  Ph1 
PD1 | h1 PD 2 | h1 PD 3 | h1 PD 4 | h1  Ph1   PD1 | h 2 PD 2 | h 2 PD 3 | h 2 PD 4 | h 2  Ph 2 
2 3 3 3 9
   
9
9 9 9 14

 0.205  20.5%
2 3 3 3 9 3 1 4 3 5
        
9 9 9 9 14 5 5 5 5 14
Weather data with frequencies and probabilities
outlook
temperature
yes
86
no
humidity
yes
no
windy
yes
no
play
yes
no
sunny
2
3
hot
2
2
high
3
4
false
6
2
overcast
4
0
mild
4
2
normal
6
1
true
3
3
rainy
3
2
cool
3
1
sunny
2/9
3/5
hot
2/9
2/5
high
3/9
4/5
false
6/9
2/5
overcast
4/9
0/5
mild
4/9
2/5
normal
6/9
1/5
true
3/9
3/5
rainy
3/9
2/5
cool
3/9
1/5
InfoMining
yes
9
no
5
9/14 5/14
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Statistical Modelling - Example
The conditional probability of the hypothesis h2: play = no,
given the data set D={D1,D2,D3,D4} can be determined respectively:
Ph 2 | D  
PD1 | h 2 PD 2 | h 2 PD 3 | h 2 PD 4 | h 2  Ph 2 
PD1 | h1 PD 2 | h1 PD 3 | h1 PD 4 | h1  Ph1   PD1 | h 2 PD 2 | h 2 PD 3 | h 2 PD 4 | h 2  Ph 2 
3 1 4 3 5
   
5
5 5 5 14

 0.795  79.5%
2 3 3 3 9 3 1 4 3 5
        
9 9 9 9 14 5 5 5 5 14
Weather data with frequencies and probabilities
outlook
temperature
yes
87
no
humidity
yes
no
windy
yes
no
play
yes
no
sunny
2
3
hot
2
2
high
3
4
false
6
2
overcast
4
0
mild
4
2
normal
6
1
true
3
3
rainy
3
2
cool
3
1
sunny
2/9
3/5
hot
2/9
2/5
high
3/9
4/5
false
6/9
2/5
overcast
4/9
0/5
mild
4/9
2/5
normal
6/9
1/5
true
3/9
3/5
rainy
3/9
2/5
cool
3/9
1/5
InfoMining
yes
9
no
5
9/14 5/14
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Problems of Naive Bayes
Conditional probabilities can be estimated directly as relative frequencies:


nc
P ai | b j 
n
where n is the total number of training instances of class bj, and nc is the
number of instances with attribute ai and class bi
Problem: this provides a poor estimate if nc is very small (low confidence).
Extreme case: if nc=0, then the probability of a hypothesis for the
concerned attribute will be calculated to be zero and hence for the whole
probability determined by multiplying of the probabilities of all attributes will
get the wrong value zero.
This problem can be handled by Laplace Estimation
nc    p
P ai | b j 
n

88

InfoMining
p=1/k : a-priori probability (=prior estimate of probability)
k: number of values that the attribute ai can take
: weighting factor, defines the influence of a-priori
probability
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Learning Method: 1R
Hier geht es weiter
89
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Redundant attributes
Assume 5 attributes a1...a5 dependent on attribute „outlook“.
Redundant means 100% dependancy
Then, the probabilities for „play“ would be if we would assume independency:
play = yes
outlook
a1...a5
temp
hum
windy
play
Cond:
2/9
X (2/9)5 x 3/9 x 3/9 x 3/9 x 9/14 = 2,8763e-6
absolut: 2,8763e-3 / 1,6025e-3 = 0,00179
play = no
outlook
a1...a5
temp
hum
windy
play
Cond:
3/5
x (3/5)5 x 1/5 x 4/5 x 3/5 x 5/14 = 1,5996e-3
absolut: 1,5996e-3 / 1,6025e-3 = 0,99821
.
Sum
1,00
1,6025e-3
This is wrong.
The right result would be received by multiplying with (1)5 instead of (3/5)5 ,because P(X/A)
of fully dependent events is 1, i.e. the origanl result is not changed
90
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Dependent attributes
If we would assume attribute „outlook“ only, we would receive
play = yes
outlook
play
Cond:
2/9
x 9/14 = 0,143
absolut: 0,143 / 0,357 = 0,4
play = no
outlook
play
Cond:
3/5
x 5/14 = 0,214
absolut: 0.214 / 0,357 = 0,6
.
Sum
1,00
0,357
Resume:
We made the assumption, that all the neclected attributes are fully
dependent on the remaining attribute
91
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Numerical Attributes
Numerical values are usually assumed to follow Normal (or Gaussian)
distribution which is described by the mean and std.dev.
The table shows an overview of the weather data with numerical attributes.
Weather data with numerical values and recapitulating statistical values
outlook
92
temperature
humidity
windy
yes
no
yes
no
yes
no
sunny
2
3
83
85
86
85
overcast
4
0
70
80
96
90
rainy
3
2
68
65
80
70
64
72
65
95
69
71
70
91
75
80
75
70
72
90
81
75
play
yes
no
yes
no
false
6
2
9
5
true
3
3
sunny
2/9
3/5
mean
73
74,6
mean
79,1 86,2
false
6/9
2/5
overcast
4/9
0/5
std dev
6,2
7,9
std dev
10,2
true
3/9
3/5
rainy
3/9
2/5
InfoMining
9,7
9/14 5/14
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Numerical Attributes
The values of the nominal attributes are represented through occurrence
rates.
The numerical attributes are represented through the stochastic moments of
the distribution, i.e. the mean and the standard deviation. From this values
the related occurrence rates can be calculated as the integral of the
probability density of the value.
This needs that an intregral intervall “e” has to be assumed.
The occurrence rate or probability of occurrence of a continous distributed
attribute is (s. lectures from Prof. Herz):
P[ E : x  X  x] : 
x e
x e
f x dx
Simplified we can write for small e
P[ E] : f x e
93
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Numerical Attributes
If we have only continous distributed numerical attributes e
1
e is not influencing the result, because it appears as
e
However, when we have mixed continous and discrete attributes e is
influencing the result and we have to choose an appropriate value.
A reasonable value is the mean interval length of the discrete attribute. If
no information at all is available e = 1 is assumed, which is a very
abritary assumption.
Example:
The values for temperature and humidity in the aforementioned table are
assumed to be normally distributed.
The probability density for the event (temperature=66|yes) is
 x   2 
 66  732 
1
1

  0.0340
f temperature  66 | yes  
exp  
exp  
2
2 

2

2

6
.
2
2 
2 




P[ E] : f x e  0,0340 1,0[deg ree]
94
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Numerical Attributes
In the same way we get also the probability densities for
f(humidity=90|yes) = 0.0221
f(temperature=55|no) = 0.0291
f(humidity=90|no) = 0.0380
Again we have assumed that e 1,0 for each continuous value.
Therewith we get the conditional probabilities of the hypothesis h1 and h2
2
3 9
 0.0340  0.0221  
9
9 14
Ph1 | D 
 0.209  20.9%
2
3 9 3
3 5
 0.0340  0.0221     0.0291  0.0380  
9
9 14 5
5 14
3
3 5
 0.0291  0.0380  
5
5 14
Ph 2 | D 
 0.791  79.1%
2
3 9 3
3 5
 0.0340  0.0221     0.0291  0.0380  
9
9 14 5
5 14
Remind: for each discrete value we have implicitly assumed an interval e.
95
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Numerical Attributes
The following observations are given for temperature
64 65
Yes No
68
Y
69
Y
70
Y
71
N
72
N
72
Y
75
Y
75
Y
80
N
81
Y
83
Y
85
N
Problem
 To use every different value as a new class would be an overkill
 There can be some observations with two different consequences (75)
.
Solution
 Define classes for those intervals resulting to the same consequence,
They are shown as vertical bars, according interval limits are 64.5, 66.5,
70.5, 72, 77.5, 80.5, 84.
 Subsume neighbouring classes to a superclass and take as the
consequence the majority. They are shown as horizontal bars above. This
may be further subsumed to only 2 classes namely
if temperature =< 77.5 then play = yes
> 77.5 then play = no
 if no knowledge at all is available about the system of the data, aquidistant
interval is also a good and justified approach
96
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
How to Build Decision Trees
Outlook
We have 1 start tree and
4 partial trees, one for
Sunny
each attribute.
N
In order to decide,
N
which is the most
N
important one, we need
Y
a measure for decision.
Overcast
Rainy
Hot
Mild
Cool
Y
N
N
N
N
Y
N
N
N
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
Info=0,0bits
Y
Temperature
Info=0,940bits
Info=0,971bits
Y
Y
Info=0,971bits
Y
Info=0,693bits
Play
Humidity
Yes
No
High
Normal
False
True
9
5
YN
YN
YN
YN
YN
Y
YN
YN
YN
Y
Y
YN
N
Y
Y
Y
Y
Y
Y
Info=0,940bits
97
Windy
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Measure for the Information Value
The decision measure should express the information value (worthiness)
Requirements:
1. info value = 0 if one of the consequence values (yes, no) is zero
2.
info value = max if the frequencies of all consequence values are equal
3.
if more than 2 classes are present an arbitrary sequential calculation
should be possible, namely:
info([a,b,c]) = info([a, {b,c}]) + f(b,c) / f(2,3,4) * info([b,c])
98
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Basic Measure: Info / Entropy
These requirements are fulfilled by one measure, the entropy, which is
defined as
n
entropy (p1, p 2 ,..., pn )   pi  log( pi )
i1
n
condition  pi  1
i1
entropy(p, q, r) = entropy( p, q+r) + (q+r) * entropy( q/(q+r) + r/(q+r) )
and
Info(a,b,c) = entropy (a/S, b/S, c/S)
with S = sum(a,b,c)
The units of Info are [bits]
99
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Example
Direct calculation:
n
entropy( p1 , p2 ,..., pn )   pi  log( pi )
i 1
info([2,3,4]) = entropy (2/9, 3/9, 4/9)
= -2/9 * log2/9 - 3/9 * log3/9 - 4/9 * log4/9
= (-2 log2
- 3 log3
- 4 log4
+ 9 log9)/9
Sequential calculation:
entropy(p, q, r) = entropy( p, q+r) + (q+r) * entropy( q/(q+r) , r/(q+r) )
bc
inf o([a, b, c ]) inf o([a, b  c ]) 
inf o([b, c ])
abc
info([2,3,4]) = info([2,7]) + 7/9 * info([3,4])
= entropy(2/9,7/9) + 7/9 * entropy(3/7,4/7)
(base unit = 9) + transf unit * (base unit = 7)
100
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Further Measures: Info Gain and Info Value sum
gain
info sum:
= info value before - info value after distribution in classes
inf o([a, b], [c, d], [e, f ]) 

n 1
1
 (a  a
n
a
i1
i1
i
i1
ab
cd
ef
inf o([a, b]) 
inf o([c, d] 
inf o([ e, f ])
a

f
a

f
a

f



) inf o([ai  ai1 ])
i
example:
gain(outlook)
= info(outlook) – info(sunny, overcast, rainy)
info(outlook)
= info([9,5])
info(sunny)
= info ([2,3]) = 0,971 bits
info(overcast)
= info ([4,0]) = 0,0 bits
info(rainy)
= info ([3,2]) = 0,971 bits
= 0,940 bits
(9yes, 5 no)
info ([2,3], [4,0], [3,2]) = 5/14*0,971 + 4/14*0,0 + 5/14*0,971= 0,693 bits
gain(outlook)
101
InfoMining
= 0,940-0,693 = 0,247 bits
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Procedure for Building Decision Trees
With decision measure
 info value
 info value sum
 info gain
we are able to build an optimal decision tree
optimal: the attribute with the highest gain comes first
cost:
we have to evaluate the gain for each attribute
First level: evaluate the attribute with the highest gain
Second and consecutive levels:
for each value of the attribute in the first level, the same procedure must be repeated,
because
 each value has another consequence range
 each value may receive another second attribute
repeat this for all consecutive levels until all values show only 1 consequence or all
attributes are used.
Decision measures give us a criteria to order the tree under assumption of:
independence of attributes
102
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Optimized Decision Tree for the Weather Data
outlook
sunny
overcast
rainy
yes
humidity
high
no
103
InfoMining
windy
normal
yes
false
yes
true
no
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Pruning
Pruning is the method to decide whether a branch in a tree is worth or not.
 Pre pruning
to refine the actual node by a sub-tree or stop refinement
(applied on the training set)
 Post pruning or Backward-Pruning
to reduce already established sub-tree to its root-node and hence simplify
the tree – or further refine it
(should be applied on the evaluation set, but often the test set is used)
Post pruning methods are more powerful than Pre pruning methods
Pruning should be done on an independent data set, i.e. whether on the
training nor the test set. Hence we should have a 3rd data set, namely the
pruning set. More general it is named evaluation set. It is used to optimize
the rules obtained from the training set. From the test set only the quality
of the final rules should be determined.
If it will be done on the training set a statistical bias will be the result.
104
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Post Pruning
Example: "startphase"
Sub tree Replacement
a1
1. delete a4
≤2.5
>2.5
a2
a3
≤36
a1
≤10
>36 >10
N
a4
N
none
N
half
Y
full
N
a5
≤4
N
>2.5
≤2.5
>2.5
a2
a3
N
a3
≤36
N
N
N
a1
Sub tree Raising
1. delete a2
none
V1
InfoMining
≤10
>36 >10
>4
Y
a1
≤2.5
or
105
2. delete a2
≤2.5
a4
half
V2
full
V3
a5
≤10
>10
≤4
N
>4
N
Y
N
a5
≤4
>4
N
Y
Remark:
>2.5
The consequences V1, V2, V3
are not the same as in the 'start
a3
phase'.
V1, V2, V3 represent now the
............... consequences, when a4 is
before a2
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Measure for pruning decision: Confidence
The confidence of an estimated (calculated) value can be expressed by a
confidence interval.
An estimated value, like the error rate or the success rate can differ to + or
– values. Hence for both rates this tends to a symmetrical pdf, namely
the well known Normal distribution.
The estimated rate is the expected rate and hence the mean value of the
random variable rate.
What is the related standard deviation of the Rate?
The basic pdf is the Bernoulli distribution, because we have a bivariate
value, namely success or failure (s. lectures from Prof. Herz)
The standard deviation of the Bernoulli distribution is
q1  q
B 
N
106
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Confidence
From the standard deviation we see, that the number N of the training/test
set is shaping the confidence. If N→∞ than σ→0 and hence the
estimated rate is equal to the true rate.
If N≠∞ we have the Normal Distribution to express the possible fluctuation of
the estimated rate and the related confidence range is limited to both
sides.
Hence the 2 sided confidence interval expresses the confidence of the
estimated rate
Pr[-z ≤ X ≤ z] = c
Tables for the confidence values are given for a N(0,1) standardized Normal
Distribution. Therefore we can express by normalisation the confidence
through

Pr  z 

107
InfoMining

f q
 z  c
q1  q / N

N = number of observations
E = number of wrong classifications
f = observed error rate = E/N
q = true (unknown) error rate
c = confidence in %
z = standardized upper confidence limit
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Confidence
The two sided confidence interval can be computed from one
sided confidence interval.
The one sided confidence interval is the border z for which the
probability that X is inside of the confidence range c is
Pr[X ≤ -z] = cleft or Pr[X ≥ z] =cright
X = random number
c = confidence in %
z = standardized upper/lower confidence limit
For symmetric distributions with =0  Pr[X ≥ z] = Pr[X ≤ -z]
For symmetric distributed random numbers with a mean value
=0, the probability that the realisation of a random number X is
inside the two sided confidence interval z is
Pr[-z ≤ X ≤ z] = 1-c/2
108
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Pr[X≥z]
0,00%
0,05%
0,10%
0,15%
0,20%
0,25%
0,30%
0,35%
0,40%
0,45%
0,50%
0,55%
0,60%
0,65%
0,70%
0,75%
0,80%
0,85%
0,90%
0,95%
109
InfoMining
Confidence Intervals for Standard Normal Distribution
z
∞
3,29
3,09
2,97
2,88
2,81
2,75
2,70
2,65
2,61
2,58
2,54
2,51
2,48
2,46
2,43
2,41
2,39
2,37
2,35
Pr[X≥z]
z
1,00%
2,00%
3,00%
4,00%
5,00%
6,00%
7,00%
8,00%
9,00%
10,00%
12,00%
14,00%
16,00%
18,00%
20,00%
25,00%
30,00%
35,00%
40,00%
45,00%
50,00%
2,33
2,05
1,88
1,75
1,64
1,55
1,48
1,41
1,34
1,28
1,17
1,08
0,99
0,92
0,84
0,67
0,52
0,39
0,25
0,13
0,00
3,50
3,00
2,50
2,00
z
1,50
1,00
0,50
0,00
0,00%
10,00%
20,00%
30,00%
40,00%
50,00%
Pr[X>=z]
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Measure for Pruning
If the measure for Pruning is based on the training set the one-sided confidence
interval is chosen in order to respect the statistical bias. It is an empirical approach,
but results proofed its validation.
N = number of observations
E = number of wrong classifications
f = observed error rate = E/N
q = true ( but unknown) error rate (e)
c = (given/chosen) confidence in %
z = standardized upper confidence limit


f q
Pr 
 z  c
 q1  q / N

Estimated error rate e (= q):
z2
f f2
z2
f
z
 2
2
2
N
N
N
4
N
e
z2
1
N
110
InfoMining
usually a confidence c=25% is assumed to which
the standardized upper confidence limit z=0.69
belongs
Remark:
numbers will be taken from the training set,
hence the more conservative confidence limit
have to be chosen and not the two-sided one,
which would be statistically correct.
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Measure for Pruning
The measure for pruning for more than one value is the mean value of the
individual measures
1
ek  error estimate e1  en  
N
k
n e
i 1
i
i
ni = number of observations of value i from the training set
k = number of values at a node
111
InfoMining
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer
Technische
Universität
Dresden
Example
a2
≤36
>36
1N
1Y
a4
none half
4N
2Y
1N
1Y
fnone = 2/6 = 0.33
enone = 0.47
fhalf = ½ = 0.50
efull = 0.72
ffull = = 2/6 = 0.33
enone = 0.47
full
enone  f ull 
4N
2Y
fa 4 
1
6  0.47  2  0.72  6  0.47  0.51
626
5
 0.36
14
ea 4  0.46
Result:
error estimate of the 3 values is higher than error estimate of
node a4
Decision: subtree a4 will be replaced
value a4 is 'N' with numbers (9N,5Y)
Repeat procedure for a2:
e36  0.72
fa2 
112
InfoMining
e36  0.46  e36,36 
1
2  0.72  14  0.46  0.49
2  14
6
 0.375 ea 2  0.48  subtree a 2 replaced by value N
16
Institute of Construction Informatics, Prof. Dr.-Ing. Scherer