Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Technische Universität Dresden Faculty of Civil Engineering Institute of Construction Informatics , Prof. Dr.-Ing. Scherer Information Mining Prof. Dr.-Ing. Raimar J. Scherer Institute of Construction Informatics Dresden, 04.05.2005 1 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Quality of the Data Data (Values) 2 3 semantical nominal not ranked 1 ordinal ranked numerical (ranked) discrete interval ratio continuous (analog) interval ratio try to transfer 2 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Quality of Attributes Quality of Attributes = semantical importance (weight) usually not given implicitely assumed: each is equally important (e.g. wighting factor = 1.0) better, explicit transfer into numeric Example: project aim (cost, duration, reputation) Implicit: project aim = 1.0 x cost + 1.0 x duration + 1.0 x reputation Explicit: e.g. project aim = 2 x cost + 1.0 x duration + 1.5 x reputation 3 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Data Mining = Procedure of machine learning methods 1) Identification of patterns (principles) 2) Deduction of structures (Rules, models) 3) Forecasting of behaviour (application of model) Data Information =^ Pattern Knowledge =^ Structures / Models Wisdom =^ Forecasting 4 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Data Data Mining = Procedure of machine learning methods Example of • a pattern • a structure = observation, measuring Information =^ Pattern = description and recognition of the measurements ^ Knowledge = Structures / Models generalised theory by which the observations are explained Wisdom ^= Forecasting 5 InfoMining using the theory, the information, the data to simulate and forecast not observed scenarios Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Data Terminology = Recorded facts Information = set of patterns or expectations Knowledge = accumulation of set of expectations Wisdom 6 InfoMining = usefulness, related to the knowledge Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Platon‘s Cave Analogy The problem: We can never see (record) the whole reality, but only an uncomplete mapping Shadow of dancing people dancing people He can only observe shadows and has to interpret what the original „thing“ / „meaning“ is. 7 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Data Structure for Formalization of Information and Knowledge 1 Technische Universität Dresden Object = thing with a certain meaning and a certain appearance given by its name given by its attributes thing can be 8 and it can be a real object, e.g. windows a behaviour, e.g. - opened, closed - transparent, clear - aging InfoMining given by the data of the attributes a behaviour due to the interaction of several things, e.g. - window is opening and closing due to the wind - window is aging due to rain, wind, sun, operation (good/bad) by humans Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden What can we observe? 1) Object Geometric form Colour Material Positions 2) Relationship Location (in the wall) Topology (to the ground) Each is described by one or more attributes Each attribute is expressed by a datum (value) from a set of data (values) Some or each attribute can be modelled as an (sub-)object 3) Behaviour 9 stress distribution deflection vibration aging and so on ... InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Closed World If we know that we are describing / observing windows we can evaluate the attributes of the schema (concept) window and determine which kind of window the particular one is, i.e. we classify the particular window in one of the several classes represented by the values of the attributes This means 1) We already know what a window is and we are evaluating the observed data according to windows 2) We already know the (possible) sets of the attributes 3) We already know the (possible) classes constituted by the values of the attributes Hence we have a closed (pre determined) world and therefore we can do straightforward classification 10 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Open World If we do not know what we observe (e.g. image analysis) but we have recorded a lot of data (made a lot of fotos, where each foto consists of many pixels) we can nevertheless identify windows – but also doors, gates, etc. instead of windows (!) – when we extend our procedure by two steps, namely 11 1) Analyse the sets of data to find similarities / dis-similarities between the sets, by partitioning each set of data in subsets and compare the sub-sets. This is called identification / analysis of patterns 2) Generalise the patterns and find an objective structure (theory) which explains the patterns, i.e. synthesize the result of the patterns. This is called to build a concept. A concept can be the schema of an object with its attributes and with the value range of each attribute (in an ideal way) A concept is a schema of an object and hence a class structure. 3) Classify further observations (as explained in the beginning) in order to a) Identify the particular object, if the „thing“ in question is an object b) Forecast the object behaviour, if the „thing“ in question is the behaviour c) identify the relationship between the objects, if there are more than one InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Hierarchy of Methods Knowledge Management Information Mining Data Mining Machine Learning Data Analysis Signal Processing Statistics Data Collection Sensors (-systems) Design of observation 12 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Data Collection => Fact Table 1) Fact Table (or records) Example: Relation (behaviour) weather-play Weather data 13 outlook temperature humidity windy play sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy hot hot hot mild cool cool cool mild cool mild mild mild hot mild false true false false false true true false false false true true false true no no yes yes yes no yes no yes yes yes yes yes no InfoMining high high high high normal normal normal high normal normal normal high normal high Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Knowledge Representation Knowledge is usually represented by rules. A rule has the form • Premisses (if) • Conclusion (then) The 4 main form to represent (the rules, which contain the) knowledge are: 1) Decision Tables 2) Decision Trees 3) Classification Rules 4) Association Rules 14 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Knowledge Representation - Decision Tables 1) Decision Tables (look-up tables) Looks like a fact table. The only difference is that: - Each row is interpreted as one rule - Each attribute is combined with an AND Weather data 15 outlook temperature humidity windy play sunny sunny overcast rainy rainy rainy overcast sunny sunny rainy sunny overcast overcast rainy hot hot hot mild cool cool cool mild cool mild mild mild hot mild false true false false false true true false false false true true false true no no yes yes yes no yes no yes yes yes yes yes no InfoMining high high high high normal normal normal high normal normal normal high normal high Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Decision Tables In decision tables all possible combinations of values of all attributes have to be explicitelly represented (ideal) m n combi n ai 1 Where m is the number of attributes and nai is the number of values for attribute ai This means for the given example of the relation „weather-play“ which has m=4 attributes (outlook, temperature, humidity, windy, play), that there exist 3 x 3 x 2 x 2 = 36 combinations For a new set of attribute values we have only look-up in the table, i.e. find that row, which shows a 100% match with the given set and we can read the result, namely play=„yes/no“ This is ideal 16 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Objectives of Decision Making Usually we do not know all combinations. For real problems there can be several 1000! Therefore we reduce the possible number of combination to the most important ones. Doing this by only deleting rows in the decision table we are ending up with information / knowledge gaps, and we would have paritioned our world in a deciteable part and an undeciteable part. The latter would be called „stupid.“ This is not what we want to have. In addition we are usually never able to observe all possible cases, and hence we would have natural gaps. Our objective is always to end up with a decision, whether correct or false, but never with obstain (if not explicitely allowed). Of course, we want to avoid or at least minimize false decisions 17 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Generalisation Therefore we have in to generalize the remaining rows in such a way, that they cover all the decisions of the deleted and unknown (not observed) rows without ending-up with less as possible wrong decision If we make the generalisation not allowing wrong decisions for all observed cases we would have an overdetermined problem, which may also contain some attriutes or attribute combinations which are dependent, i.e. there are identical rules in the rule base. However 1) It is hard to find all or enough dependent combinations 2) To find the dependent combinations we first would have set-up the full decision table 3) Usually we want to reduce the ideal decision table much more than only by the dependent combinations 4) Usually we never can observe all possible cases, i.e. we always have natural gaps 18 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Shortcomings of Generalization Therefore we have to merge several rows to one row which is possible. The most simple way is to neglect (the values of) one or more attributes. This is the most simple way of generalization (remark: it is the only way of generalization in relational data banks). Say we keep only outlook, the decision table reduces to outlook sunny rainy overcast play no yes yes As a consequence, we make some wrong decision. But we fulfil the first and main objective, namely we are able to make always a decision. For our example, this would lead for the 14 given combinations (i.e. our known world) to 2+2+0=4 wrong decisions 19 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Shortcomings of Generalization We reduced in the given example the 36 possible combinations (rows), each expressed by 36 rules like If and and and then outlook = sunny temperature = hot humidity = high windy = false play = no to 3 simple rules 20 if then outlook = sunny play = no if then outlook = rainy play = yes if then outlook = overcast play = yes InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Range of wrong Decision We know, that for 4 out of 36 possible cases, we would make a wrong decision, i.e. for 10%. However, we do not know how much further wrong decision we will do, namely 0 or 22 wrong decisions or something in-between, because we know only that we do have an observation gap of 22 cases. This statement is based on the assumption, that we described our problem (UoD) completely by 4 attributes. However, if we take into consideration that the UoD may be biased and say it would be governed by 5 attributes, i.e. 1 additional attribute we do not know, than we would have an unknown range of 36 x number of value range of the unknown attribute can take. Remark: A hint for an unknown attribute is given, if there are two rows in the decision table with identical values, but two different decision (play=yes / play=no) 21 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Liability of knowledge We can now apply native statistics in order to estimate the number of further wrong decisions. Namely, when we assume that our known world represented by 14 rules a) is a representative part for the whole world, i.e. the sample is representative for the Universe of Discourse UoD b) the rules are unbiased, i.e. all known rules are error free c) all attributes are known, i.e. the UoD is unbiased Then we can estimate that 10 out of 36 decision would be wrong. What we did can be explained by statistical theory, namely we evaluated the mean value of wrong decision in our known world, assumed that this mean value is the true value of the total world (UoD) and forecast the number of wrong decision using the mean value. Note: We do not consider any uncertainty here. 22 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Decision Trees We have seen from the decision tables that each value set of the attributes, i.e. each row in the table can be expressed as one rule with a simple semantic, namely if {all attributes are true, i.e. show a certain value} then {classify b} if {and ai = vj } then bl =vk This means we have used a sequential system for our rule system. However it is well known that also parallel systems may be possible. There the status of only one attribute is evaluated, i.e. is checked against all possible values, before in a separate step the next attribute is considered. Applied this to a rule system we come up with nested rules. The graphical representation of nested rules is a tree structure and we call this new representation a decision tree 23 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden General Structure of Decision Trees In general terms a decision tree can be expressed as if { state a1} := {a1 = v1} then if { state a2} …….. ………………… {a1 = vm} then if { state a2} …….. end if And in each branch vj this has to be repeated for the next attribute ai and so on for all i=1,N 24 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Ranking in a Decision Tree If we know all combinations of the UoD and we want to express them all (what is our ideal goal in order to avoid wrong decision) like we did it for the decision table, ranking of attributes, i.e. what is operated first, second ..., is not a matter at all. Then we can apply straight-forward the general formula and using any arbitrary order. For convenience we can choose i=1,2,3...N and we will end up in a tree, like a1 a1 a2 a3 a3 a4 a4 a3 a4 a4 a3 a4 a4 a3 a3 a3 a4 a2 a2 a4 a4 YN N N a4 .... a4 a2 Y N 25 Y N N N Y N Y N InfoMining Y Y Y N Y N Institute of Construction Informatics, Prof. Dr.-Ing. Scherer an Technische Universität Dresden Normalisation to binary Decision Tree For several conviences (memory amount, processing time, search time, etc.) the multi branching tree is transformed into a binary tree, or already built-up as a binary tree from the beginning This means, that we have applied the following transformation rule for all non-binary branches if M>2 then if (state ai=vj) then else ... yes no which means that we divide the value range at each layer in two halfspaces, namely the actually considered value and in the other up to now not considered values. This results in a tree explicited below only for a1 and a2 26 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Explosion of layer through binary tree representation a1v1 yes no a1v2 yes no not possible a1v3 yes a2v1 yes a2v1 no yes a3v1 yes a2v2 a2v1 no yes a3v1 no a2v2 a3v1 a2v1 no yes a2v2 a3v1 no a2v2 no ……………………………….. a4v1 27 a3v2 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Shortcomings of simple explicit binary decision trees Technische Universität Dresden A property of binary decision trees is the replication of sub-trees which for every attribute value generates a new replication, namely replication of subtrees = number of values - 1 and this repeats for each attribute in each subtree again and again As long as we want to (and can) express all combinations there would be no shortcoming but 1) we do not want to consider all combinations but only the important ones, i.e. generalize our explicit knowledge space. 2) we usually do not know all combinations , which can be interpreted as an un-controlled generalization. Both lead to the result that 1) the ranking of the attributes is important. 2) attributes are no longer sorted in layers but mixed to receive an optimal structure of the tree 28 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Generalised Decision Tree In decision trees the generalisation process is much more visible and hence controllable. Generalisation means e.g. deleting a subtree and substituting with only one decision a1 a1 a2 a3 a4 a2 a3 a4 a4 a3 a4 a4 a2 a3 a4 a4 an 29 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Classification (Rules) Representation of knowledge by classification rules means if {ai and/or/not/etc aj} then {bk} i,j=1,N k=1,M We have already used this representation when we explained the meaning of decision tables and decision trees. Hence decision tables and decision trees are only another ("visual") representation for a set of rules. 30 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Classification (Rules) This is true with the exception, that it is straight forward to transform decision trees and tables to classification rules, but transforming classification into decision tables, we have to explicite all ors into ands because decision tables are look up tables and therefore uses only ands. 31 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Classification (Rules) For decision trees a straightforward transformation is formal possible and correct, but advances of decision trees get lost, which is an optimised arrangement, namely either • size of the tree is minimised or • readability is maximised (e.g. for each attribute one layer) or • a combination of both, e.g. optimised for the human understanding 32 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Ranking Dependence As long as we have all cases represented ranking of • rows in decision tables • attributes and values in decision trees • classification rules in rule bases are not influencing the result. However we a) do not have all cases (observations) b) want to reduce rows, branches, rules by generalisation This results in a ranking dependency problem. This holds also for classification rules – which may be overseen, because at a first glance a rule maybe seen as selfstanding, independent of knowledge, which is definitely not the case. Each rule is always embedded in its context, represented by other rules and expressed by the ranging. This means the solution is always path dependent! So ranking is already a part of the representation of the knowledge. 33 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Association Rules 1 Association rules express the relationship between arbitrary attribute states if {ai = state1} then {aj = state2} for all i j and i=1,N , j=1,M, where ai,aj{A,B} If we would restrict all aj to be only elements of B, i.e. if {ai=state1} then {bj=state2} then we would have a classification rule. Hence, classification rules are a subset of association rules 34 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Association Rules 2 With association rules, we can combine any attribute state (attribute value) with any other attribute state or any grouping of attributes states. There is no any limitation. As a consequence, we allow dependencies (or redundancies ) between the rules. It would be not wise to express all or even many association rules, because we would produce an uncontrollable sub-space of the inherent knowledge with many redundant rules, i.e. - some information is not expressed at all - some information is expressed once - some information is expressed several times. Therefore we will lose our basic weighting criteria, namely - that each rule is equally important - that the importance of an attribute or an attribute value is the frequency of its appearance in the rules - both may be generalized by adding an verifyable arbitrary weighting factor 35 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Objectives of Association Rules Associate rules should only be applied as a shortcut in addition to a clearly specified minimum rule set without redundancies. Such shortcuts are used - for important relationships - often appearing relationships - simplified solutions in order to considerably reduce the search time. 36 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Examples of Association Rules Examples: If temperature = low Then humidity = normal If windy = false and play = no Then outlook = sunny and humidity = high If humidity = high and windy = false and play = no Then outlook = sunny All are correct expressions (correct "knowledge" expressed in a rule). 37 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Coverage and Accuracy Coverage (or strength or support) is the number of instances for which a rule predicts correctly Accuracy (or confidence) is the ratio of instances the rule predicts correctly (consequences) related to all instances it applies for (premise). correct consequences (coverage) accuracy = correct premises (applications) 38 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Coverage and Accuracy Examples if temperature = cool then humidity = normal applies = 4 (temp=cool) coverage = 4 (humidity=normal | temp=cool) accuracy = 1,0 (100%) if outlook = sunny then play = yes if outlook = sunny and temperature = mild applies = 5 then play = yes coverage = 2 applies = 2 accuracy = 40% coverage = 1 accuracy = 50% 39 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Rules with exceptions With the possibility to formulate exceptions like but not for {ai = state} we are able to refine general applicable rules in a very efficient way, namely we divide the value range of the attribute ai in two halfspaces, namely true 1.) the value=state and 2.) all the rest of values exclude (=false) by specifying only one value, which means we increase the coverage and reduce as less as possible the application, hence we maximise acuracy. This means that we can start with a simple and very general rule and sharpen it by adding exceptions. 40 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Rules with Exceptions Example applications = 5 if outlook = rainy coverage = 3 then play = yes accuracy = 60% when we add applications = 3 coverage = 3 and windy = not true: accuracy = 100% when we would instead add: applications = 3 and temperature = not cool: coverage = 2 accuracy = 66% 41 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Rules with Relations Propositional Rules Up to now, we only evaluated each attribute separately, i.e. we compared the value of the attribute with a given value set. Such rules are called propositional rules and they have the same power as the proposition calculus of logic reasoning. Relational Rules Sometime it is convenient to compare two attributes like If {ai} is in some relation to {aj} then {bi} This implies that ai and aj show the same unity or can be transformed into one and the same unity. The comparison can be any Boolean operation. Example If (height > length) then object = column 42 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Rules for Numerical Values All numerically valued attributes can be dealt with in the same manner as with nominal values, to which we apply the halfspace principle, namely • divide the range of values into two half spaces a1 • the two half spaces have not necessarely be symmetric or • • have to contain equal number of values repeat this recursively until enough small intervals remain number of recursion is independent between branches This leads straightforward to a binary decision-tree. 43 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Rules for Numerical Values Equivalently, we can pre-divide the range of values into equally (a2) or arbitrarily (a1) sized intervals and we directly show in which interval the observed attribute value fits. This is equivalent to a multi-branching tree. a2 a1 The test of equality (=) is possible but not feasible, because it is e.q. for R, an arbitrarily rare event. Semantically and ordinally ranked data can be dealt with in the same way. There, the test of equivalence may be feasible, because of the very limited number of values. 44 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Instance-based Representation In contrast to rule-based representation, where we test each observation of equality, namely If {ai =state}, then For instance-based representation, we test the attribute value set t against the distance to a given state with n sets and evaluate the minimal distance. Each attribute value set (= vector) with as many components as attributes, e.g. one row of the fact table s1=[sunny, hot, high, false]T 45 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Instance-based Representation The state is now a given set of attribute value sets, e.g. 10 rows of the fact table state=[s1...sn] So we test a vector t against a vector set [s1...sn] if {distance(t,si)=min} then {bt=bsi} and we use the consequences of the closest vector for the prognosis (decision) Remark: nominal values are usually transformed in true=0, false=1 46 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Comparison of Instance-based to Rule-based Compared to the rule-based representation of numerically valued data problems, where we deal with fixed intervals, here we deal with nonfixed intervals - but only at a first glance. In fact, we have also fixed boundaries, namely boundaries defined by halfway between the state vectors. s2 s1 s3 The only difference is that the intervalls are arbitrary in size and that we do not explizitely define the intervals, but we define the center (lines) of the intervals (classes). Another advantage to the explicitely expressed interval procedure of the rule-based representation is that we can easily add an additional instance for better representing the knowledge space, i.e. for refinement of the space 47 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Requirements for Instance-based Representation There are 3 requirements 1) We do need a metric. 2) All attributes have to be presentable in one and the same metric. 3) We do need a distance metric, also called norm. 48 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Norm A norm is defined as a mapping : RP R which fullfil the 3 requirements x 0 x 0,,0 T ax a x a R, x R P xy x y x, y R P The most well-known norm is the Euklid Norm (=geometric distance in the Euklid space) d N di i 1 49 InfoMining 2 with d=a-b, di=ai-bi Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Norm in general terms the Euklid norm can be written d d T Ad 1 0 with A 0 ... 0 1 0 ... 0 0 1 ... ... ... ... ... This can be generalised to the Diagonal-Norm with a 1 0 0 0 a 0 2 A 0 0 a3 ... ... ... ... ... ... ... where ai are arbitrary values, which can be explained as weighting factors (for each component of the instance vector) We can now imagine about off-diagonal values, namely we can include in our distance measure dependencies between attributes when we set the off-diagonal values not zero, i.e. we evaluate relationships between attributes (relational rules) 50 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Minkowsky-Norm The Minkowsky-Norm is defined as d g q N di q i 1 choosing q=2 we receive the well-known Euklid-Norm. 51 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Minkowsky-Norm: Manhatten Distance For convenience often the well-known Manhatten distance or City Block is used, which is obtained for q=1 B N d d1 i i 1 A This means, that instead of computing the bird-line distance (Euklid-Norm) we are walking around each block in Manhatten, i.e. we are summing up Δx+ Δy. The deviation (error) to the geometrical distance is immediately to be seen d12 d i2 , d11 d1i , 1 i 52 InfoMining q 2 q 1 d1i 1, 2 d i2 Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Minkowsky-Norm: max-Norm For lim q→ we receive the supram or max-Norm N lim q q 1 di q maxd i i 1, N The qualitative differences between the different norms are, what we give different importance to large distances between component compared to short distances As we increase q we give more importance to large distances and for q→ we give our only attention to the largest distance. The natural norm is the Euklid-Norm. 53 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Hyperbolic Norm A very often used Norm is the Hyperbel-Norm d H N d i i 1 However this Norm do not fullfill the mathematical definition of a norm (not any of the 3 requriements) If we add the N-th root it is the well-known geometrical mean dg n N d i i 1 54 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Relative Norm All these norms depend on the number of elements, which is correct for distance. However sometimes we want to have only the quality of the distance. There we divide simple by the number of elements and we come up with a generalised mean value 1 d q N 55 InfoMining N d q i Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Explicit Portioning of the Instance Space (a) The classification of attribute sets according to the shorterdistance criteria leads the portioning of the information (or instance) space as given in Figure a, namely the boundaries of a class are obtained as a polynomial (hyperplanes), where each polynomial line is perpendicular to the mean value of the shortest distance between the class vectors. This leads to very-hard-to-express boundaries for each class. 56 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Explicit Portioning of the Instance Space (b) (c) A more convenient simplification is to describe each class by rectangular boundaries as shown in Figure b. As a consequence, boundaries can easily be described leading to simple rules, which the human understanding can conceive, i.e. we can rationalize upon the boundaries and hence the classes. The rectangular box means that each attribute value of each class has a well defined upper and lower limit, hence we have defined an explicit interval for each attribute. The difference between instance-based and rule-based representation is that the intervals for instance-based ones are not of equal size for each class, whereas for rule-based representation they are. Remark: If we generalize rules this may result in non-equal intervals. Generalization of instance-based representation can end up in nested portions like the ones shown in Figure c. this is the typical case for rule-based representation using exceptions, namely the rule for the outer box and the exception for the inner box. 57 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Machine learning All machine learning methods generate rules, i.e. they extract rules from the observed data, the fact data. Each machine learning method expresses another relationship between the data, i.e. expresses another system. If we chose a machine learning method which fits not to the inherent system of the data, we will receive a rule set which is (1) complex (2) makes often false predictions but we always will receive a rule set! Therefore we need a measure for the quality of learned rule set in order to decide about the best or more appropriate learning method 58 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Machine learning If we would know the underlying system in advance we would model it either in mathematical expressions or in logical expressions. If we do not know anything we have to use Information Mining and Machine Learning Methods, of course. If we would know something about the system, we should model the system with the appropriate expressions first and then transform the data by the system before we again use Information Mining Methods. This is called hybrid methods. 59 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Example: Limited observation range Observation range 0 – 30 m (with attributes: x=0, x=10, x=20, x=30) Prediction range 0 – 100 m y[m] x[m] 60 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Example: System roughly known Draft System: y = ax+bx² Assumption of polynomial of 2nd or higher order y[m] x[m] 61 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Example Real System: Diagonal throw y x tan g 2 x 2 v 2 cos 2 ´, 10% 3 m V 50 ´, V 10% s 62 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Example Reduction of one order of the polynomial results in a straight line. This indicates, that the assumption of a polynomial of 2nd order seems to be correct. y/x [-] x[m] 63 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Example Now we will also consider a similar example with y(x=0) ≠ 0 y[m] x[m] 64 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Example If y(x=0) ≠ 0 reduction of one order of the polynomial leads not to an improvement as it was the case shown before but to a very bad result. This illustrates, that for an hierarchical approach, the first chosen approaches (or learning methods) have an important impact on the overall result! y/x [-] x[m] 65 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Example Taking into consideration the knowledge about y(x=0)=20 we can approve the assumption that y=a+bx+cx² (y-20)/x [-] x[m] 66 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Example The following Table shows the average of Y for each attribute X X (Attribute) Average Y 0,00 20,00 10,00 37,68 20,00 53,49 30,00 67,43 40,00 79,50 50,00 89,69 60,00 98,02 70,00 104,48 80,00 109,06 90,00 111,78 100,00 112,62 In case of realistic assumption of 2nd order polynomial we will get the parameters: a=20, b=1.8615 and c=-0.0094 If we compute the function y(x)=a+bx+cx² with these parameters and divide this result by the average of the Y-data, we will get a measure for the accuracy of our assumption. Attribute X In this example we get for all attributes YM/YD =1. This is the case of exact fitting of the assumed function y(x). The error will increase with increasing variation from 1. 67 InfoMining 0,00 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80,00 90,00 100,00 Data-based Model-based Average of Y Estimation of Y YD YM 20,00 20,00 37,68 37,68 53,49 53,47 67,43 67,39 79,50 79,42 89,69 89,58 98,02 97,85 104,48 104,25 109,06 108,76 111,78 111,40 112,62 112,15 YM/YD 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 1,00 Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Example If we had supposed, linear dependence between x and y, namely y=a+bx, the parameters of the curve would be: a=34.029 and b=0.9262 In this case the YM/YD shows some greater variation from 1 and hence it is a rather bad estimate. Comparison 0,00 10,00 20,00 30,00 40,00 50,00 60,00 70,00 80,00 90,00 100,00 Data-based Model-based Average of Y Estimation of Y YD YM 20,00 37,68 53,49 67,43 79,50 89,69 98,02 104,48 109,06 111,78 112,62 34,03 43,29 52,55 61,82 71,08 80,34 89,60 98,86 108,13 117,39 126,65 YM/YD 1,80 1,60 1,70 1,15 0,98 0,92 0,89 0,90 0,91 0,95 0,99 1,05 1,12 bad estimation 1,40 good estimation 1,20 YM/YD Attribute X 1,00 0,80 0,60 0,40 0,20 0,00 0,00 20,00 40,00 60,00 80,00 100,00 120,00 X 68 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden One arbitrary sampled sample If we would assume independence between each observation point, we would receive a valid sample as given below. However, this is a very unrealistic curve of a throw. Therefore dependence between observation points leading to an observation set is a very important knowledge about the system. This is a typical mistake made in stochastic applications. y[m] x[m] 69 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Training & Test Set So, as more information we have - i.e. not only data but so-called pre-information - and as more information we model appropriately as better the information mining method will work. Nevertheless, we do need an objective verification of the quality of the model (the pre-modelled system + the added larning system), i.e. we need a measure and a data set to apply the measure. Therefore we have to divide the observed data set in two parts: - training set - test set Usually the test set is chosen 50% of the training set, i.e. the observed data are divided by 2:1 70 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Training & Test Set Therefore our world is divided into 3 parts unknown part training set test set observed part where the training and the test set should be a representative set of the UoD=Universe of Discource (whole world we consider). To fulfil the requirement "representative" is often not to be verifyable, because we do not know the unknown world. 71 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Distribution of Observations (1) extrapolation problem we have no any observation about a great part of the world. Machine learning methods are always interpolation methods, hence our prediction are strongly biased if the system shows another behaviour in the observed and un-observed world (2) we have more observations in some parts and less in other, hence we may give more importance to the system behaviour in those parts we have more observations (3) the distribution of observations of the test set are not similar to the distribution of the training set. Hence our quality measure is biased, too, as a reason of (2) Reality number of observations observations whole 1D-world 72 InfoMining Ideal test set test set training set training set whole 1D-world Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Distribution of Observations If the system behaviour is similar over the whole world, i.e. y (=response) e.g.: or y = a+bx y = a+bx+cx² x (=input) whole 1D-world But if the system behaviour is not monotonic, i.e. y (=response) x (=input) whole 1D-world we would like to have as more (and dense) observations as more nonmonotonic the behaviour of the system is 73 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Strategies for Machine Learning (1) apply different learning methods for different domains of the world need measure of quality to decide about best learning methods (2) apply different learning methods in a hierarchical order principle of simple systems (rule) is the best fitting one need measure of quality to decide about best combination of learning method (3) divide the observed data set several times in different pairs of training and test sets in order to optimize the probability that non-monotonic systems behaviour domains are good matched by training and test sets (4) use each additional observed data to update the learned system – i.e. call always in question your learned system 74 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Analogy: Curve fitting For a 1D-world we have the following analogy: 1 attribute constant model: y=c 2 attributes linear model: y=a+bx 3 attributes parabolic model: y=a+bx+cx² n 1 n attributes n-polygon model: y aixi i 0 We can fit an n-polynomial when we have m observations with m ≥ n. So as maximum we can fit an m-polynomial. This is known as overfit. 75 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Structures in random data Random data can have different structures. Most data sets can be classified to one of the following structures: 1. One attribute contributes much, most other attributes are redundant or about meaningless 2. All attributes contribute almost equally and independently 3. Few attributes contribute, but in a dependent way, which can be expressed and represented by a decision tree (numerical: correlation function) 4. Few rules structure the data domain into distinctive classes 5. Subsets of attributes show interdependencies 6. (Non)linear dependent numerical attributes, where the weighted sum of the attributes describes the data structure 7. Non-equal distance between the classes would describe the data set best For each type, another learning algorithm fits best, which can only be found by trial and (error) test procedures. 76 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Learning Method: 1R 1 R means "1-Rule" 1-Rule generates a one-level decision tree the simplest way is to make a rule, which tests only 1 attribute. (Remark: very attractive is to make a rule where many attributes are tested by using the exception rule methods.) However the question is, which attribute should be tested? Hence we need a measure. The most simple measure is to use that attribute which maximise right and consequently minimise wrong decisions = COVERAGE measure This must be done also for each value set of each attribute 77 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Learning Method: 1R Procedure: generate all possible rules with of all attributes and values from the training set n rules na n v ,i n a n attributes ~ na n v n v ,i n values of attribute i i 1 and select that rule set which has the maximum coverage Usually the 1) "coverage" is used to identify the best fitting value 2) "(applies - coverage) = min failure" is used to identify the strongest attribute For the weather-play example the best attribute is outlook with a rule set for the values of: (3 simple rules of PPT 1/20) Coverage Error if outlook = sunny then play = no 3 2 if outlook = rainy then play = yes 3 2 if outlook = overcast then play = yes 4 0 10 4 ∑ 78 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Missing Values Missing will be treated on the same way as every other value of attributes. Hence if the weather data miss values for the attribute outlook, the specified rule set will contain 4 possible class values: one for sunny, overcast, rainy and one for missing. However, the problem is, what is the result (consequence) of missing? The assumption is, that missing get the same result, as the most often appearing result of the values of the attribute. 79 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Statistical Modelling Statistical modelling enables the consideration of more than one attribute and therefore decision making based on all attributes if these attributes are of equal importance and independent of each other. Of course, this requirement is not realistic – real data sets are interesting even because the attributes are not of equal importance and are dependent – but this simplification leads to a simple method, which works surprisingly good in practice. 80 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Statistical Modelling The table below shows a summary of the weather data, whereas for every value of each attribute the number of occurrence for play=yes and play=no was counted and shown in the upper part of the table. The lower part of the table shows the same information in form of ratios of observed probabilities, i.e. conditional probabilities. Example: for 2 days out of 9 days for which play=yes there is outlook=sunny. For play these rates are days of yes/no devided by all observed days. Weather data with frequencies and probabilities outlook temperature yes 81 no humidity yes no windy yes no play yes no sunny 2 3 hot 2 2 high 3 4 false 6 2 overcast 4 0 mild 4 2 normal 6 1 true 3 3 rainy 3 2 cool 3 1 sunny 2/9 3/5 hot 2/9 2/5 high 3/9 4/5 false 6/9 2/5 overcast 4/9 0/5 mild 4/9 2/5 normal 6/9 1/5 true 3/9 3/5 rainy 3/9 2/5 cool 3/9 1/5 InfoMining yes 9 no 5 9/14 5/14 Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Example If we want to predict play given that outlook temp humidity windy play sunny cool high true ? We can compute (assuming independency) the probability of: P[play=yes] = 2/9 x 3/9 x 3/9 x 3/9 x 9/14 = 0.0053 P[play=no] = 3/5 x 1/5 x 4/5 x 2/5 x 5/14 = 0.0206 P[E] = P[play=yes] + P[play=no] = 0.0259 Therefore we can compute the probability of play=yes or play=no given that it will be played for the above given attributes and values (and not any other values) 0.0053 P[play=yes|E] = = 20.5% 0.0259 0.0206 P[play=no|E] = = 79.5% 0.0259 82 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Statistical Model – Naive Bayes This intuitive and simple method is based on Bayesian statistics, which says: If we have a reliable hypothesis H and statistical analysis (=training set) and we have new observations we can update the statistics, assuming that the new observations (even it is only 1 observation=E!) is of the same importance as all historical observations. PH | E PE | H PH PE with PE | H n Pe | H assuming INDEPENDENCE between attributes i i 1 Now we can compute for the above example of E=(outlook=sunny, temp=cool, ...) the probability of occurrence of play=yes Pplay yes | E 83 InfoMining 2 / 9 x 3 / 9 x 3 / 9 x 3 / 9x 9 /14 20.5% 0.0259 Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Statistical Modelling – Naive Bayes This procedure is based on Bayes Rule, which says: if you have a hypothesis hi and data D which bears on the hypothesis, then: Ph i | D PD | h i Ph i PD | h Ph n j j j1 P(h): probability of h P(D|h): conditional probability of D given h P(h|D): conditional probability of h given D i = hypothesis i n = number of hypothesis 84 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Statistical Modelling - Example Example: For a new day we want forecast play. outlook = sunny temperature = cool humidity = high windy = true play = ? The data are identified as follows: D1: outlook = sunny D2: temperature = cool D3: humidity = high D4: windy = true We can take two possible hypothesises: h1: play = yes h2: play = no 85 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Statistical Modelling - Example Therewith the conditional probability of the hypothesis h1: play = yes, given the data set D={D1,D2,D3,D4} can be determined as follows: Ph1 | D PD1 | h1 PD 2 | h1 PD 3 | h1 PD 4 | h1 Ph1 PD1 | h1 PD 2 | h1 PD 3 | h1 PD 4 | h1 Ph1 PD1 | h 2 PD 2 | h 2 PD 3 | h 2 PD 4 | h 2 Ph 2 2 3 3 3 9 9 9 9 9 14 0.205 20.5% 2 3 3 3 9 3 1 4 3 5 9 9 9 9 14 5 5 5 5 14 Weather data with frequencies and probabilities outlook temperature yes 86 no humidity yes no windy yes no play yes no sunny 2 3 hot 2 2 high 3 4 false 6 2 overcast 4 0 mild 4 2 normal 6 1 true 3 3 rainy 3 2 cool 3 1 sunny 2/9 3/5 hot 2/9 2/5 high 3/9 4/5 false 6/9 2/5 overcast 4/9 0/5 mild 4/9 2/5 normal 6/9 1/5 true 3/9 3/5 rainy 3/9 2/5 cool 3/9 1/5 InfoMining yes 9 no 5 9/14 5/14 Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Statistical Modelling - Example The conditional probability of the hypothesis h2: play = no, given the data set D={D1,D2,D3,D4} can be determined respectively: Ph 2 | D PD1 | h 2 PD 2 | h 2 PD 3 | h 2 PD 4 | h 2 Ph 2 PD1 | h1 PD 2 | h1 PD 3 | h1 PD 4 | h1 Ph1 PD1 | h 2 PD 2 | h 2 PD 3 | h 2 PD 4 | h 2 Ph 2 3 1 4 3 5 5 5 5 5 14 0.795 79.5% 2 3 3 3 9 3 1 4 3 5 9 9 9 9 14 5 5 5 5 14 Weather data with frequencies and probabilities outlook temperature yes 87 no humidity yes no windy yes no play yes no sunny 2 3 hot 2 2 high 3 4 false 6 2 overcast 4 0 mild 4 2 normal 6 1 true 3 3 rainy 3 2 cool 3 1 sunny 2/9 3/5 hot 2/9 2/5 high 3/9 4/5 false 6/9 2/5 overcast 4/9 0/5 mild 4/9 2/5 normal 6/9 1/5 true 3/9 3/5 rainy 3/9 2/5 cool 3/9 1/5 InfoMining yes 9 no 5 9/14 5/14 Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Problems of Naive Bayes Conditional probabilities can be estimated directly as relative frequencies: nc P ai | b j n where n is the total number of training instances of class bj, and nc is the number of instances with attribute ai and class bi Problem: this provides a poor estimate if nc is very small (low confidence). Extreme case: if nc=0, then the probability of a hypothesis for the concerned attribute will be calculated to be zero and hence for the whole probability determined by multiplying of the probabilities of all attributes will get the wrong value zero. This problem can be handled by Laplace Estimation nc p P ai | b j n 88 InfoMining p=1/k : a-priori probability (=prior estimate of probability) k: number of values that the attribute ai can take : weighting factor, defines the influence of a-priori probability Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Learning Method: 1R Hier geht es weiter 89 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Redundant attributes Assume 5 attributes a1...a5 dependent on attribute „outlook“. Redundant means 100% dependancy Then, the probabilities for „play“ would be if we would assume independency: play = yes outlook a1...a5 temp hum windy play Cond: 2/9 X (2/9)5 x 3/9 x 3/9 x 3/9 x 9/14 = 2,8763e-6 absolut: 2,8763e-3 / 1,6025e-3 = 0,00179 play = no outlook a1...a5 temp hum windy play Cond: 3/5 x (3/5)5 x 1/5 x 4/5 x 3/5 x 5/14 = 1,5996e-3 absolut: 1,5996e-3 / 1,6025e-3 = 0,99821 . Sum 1,00 1,6025e-3 This is wrong. The right result would be received by multiplying with (1)5 instead of (3/5)5 ,because P(X/A) of fully dependent events is 1, i.e. the origanl result is not changed 90 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Dependent attributes If we would assume attribute „outlook“ only, we would receive play = yes outlook play Cond: 2/9 x 9/14 = 0,143 absolut: 0,143 / 0,357 = 0,4 play = no outlook play Cond: 3/5 x 5/14 = 0,214 absolut: 0.214 / 0,357 = 0,6 . Sum 1,00 0,357 Resume: We made the assumption, that all the neclected attributes are fully dependent on the remaining attribute 91 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Numerical Attributes Numerical values are usually assumed to follow Normal (or Gaussian) distribution which is described by the mean and std.dev. The table shows an overview of the weather data with numerical attributes. Weather data with numerical values and recapitulating statistical values outlook 92 temperature humidity windy yes no yes no yes no sunny 2 3 83 85 86 85 overcast 4 0 70 80 96 90 rainy 3 2 68 65 80 70 64 72 65 95 69 71 70 91 75 80 75 70 72 90 81 75 play yes no yes no false 6 2 9 5 true 3 3 sunny 2/9 3/5 mean 73 74,6 mean 79,1 86,2 false 6/9 2/5 overcast 4/9 0/5 std dev 6,2 7,9 std dev 10,2 true 3/9 3/5 rainy 3/9 2/5 InfoMining 9,7 9/14 5/14 Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Numerical Attributes The values of the nominal attributes are represented through occurrence rates. The numerical attributes are represented through the stochastic moments of the distribution, i.e. the mean and the standard deviation. From this values the related occurrence rates can be calculated as the integral of the probability density of the value. This needs that an intregral intervall “e” has to be assumed. The occurrence rate or probability of occurrence of a continous distributed attribute is (s. lectures from Prof. Herz): P[ E : x X x] : x e x e f x dx Simplified we can write for small e P[ E] : f x e 93 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Numerical Attributes If we have only continous distributed numerical attributes e 1 e is not influencing the result, because it appears as e However, when we have mixed continous and discrete attributes e is influencing the result and we have to choose an appropriate value. A reasonable value is the mean interval length of the discrete attribute. If no information at all is available e = 1 is assumed, which is a very abritary assumption. Example: The values for temperature and humidity in the aforementioned table are assumed to be normally distributed. The probability density for the event (temperature=66|yes) is x 2 66 732 1 1 0.0340 f temperature 66 | yes exp exp 2 2 2 2 6 . 2 2 2 P[ E] : f x e 0,0340 1,0[deg ree] 94 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Numerical Attributes In the same way we get also the probability densities for f(humidity=90|yes) = 0.0221 f(temperature=55|no) = 0.0291 f(humidity=90|no) = 0.0380 Again we have assumed that e 1,0 for each continuous value. Therewith we get the conditional probabilities of the hypothesis h1 and h2 2 3 9 0.0340 0.0221 9 9 14 Ph1 | D 0.209 20.9% 2 3 9 3 3 5 0.0340 0.0221 0.0291 0.0380 9 9 14 5 5 14 3 3 5 0.0291 0.0380 5 5 14 Ph 2 | D 0.791 79.1% 2 3 9 3 3 5 0.0340 0.0221 0.0291 0.0380 9 9 14 5 5 14 Remind: for each discrete value we have implicitly assumed an interval e. 95 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Numerical Attributes The following observations are given for temperature 64 65 Yes No 68 Y 69 Y 70 Y 71 N 72 N 72 Y 75 Y 75 Y 80 N 81 Y 83 Y 85 N Problem To use every different value as a new class would be an overkill There can be some observations with two different consequences (75) . Solution Define classes for those intervals resulting to the same consequence, They are shown as vertical bars, according interval limits are 64.5, 66.5, 70.5, 72, 77.5, 80.5, 84. Subsume neighbouring classes to a superclass and take as the consequence the majority. They are shown as horizontal bars above. This may be further subsumed to only 2 classes namely if temperature =< 77.5 then play = yes > 77.5 then play = no if no knowledge at all is available about the system of the data, aquidistant interval is also a good and justified approach 96 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden How to Build Decision Trees Outlook We have 1 start tree and 4 partial trees, one for Sunny each attribute. N In order to decide, N which is the most N important one, we need Y a measure for decision. Overcast Rainy Hot Mild Cool Y N N N N Y N N N Y Y Y Y Y Y Y Y Y Y Y Info=0,0bits Y Temperature Info=0,940bits Info=0,971bits Y Y Info=0,971bits Y Info=0,693bits Play Humidity Yes No High Normal False True 9 5 YN YN YN YN YN Y YN YN YN Y Y YN N Y Y Y Y Y Y Info=0,940bits 97 Windy InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Measure for the Information Value The decision measure should express the information value (worthiness) Requirements: 1. info value = 0 if one of the consequence values (yes, no) is zero 2. info value = max if the frequencies of all consequence values are equal 3. if more than 2 classes are present an arbitrary sequential calculation should be possible, namely: info([a,b,c]) = info([a, {b,c}]) + f(b,c) / f(2,3,4) * info([b,c]) 98 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Basic Measure: Info / Entropy These requirements are fulfilled by one measure, the entropy, which is defined as n entropy (p1, p 2 ,..., pn ) pi log( pi ) i1 n condition pi 1 i1 entropy(p, q, r) = entropy( p, q+r) + (q+r) * entropy( q/(q+r) + r/(q+r) ) and Info(a,b,c) = entropy (a/S, b/S, c/S) with S = sum(a,b,c) The units of Info are [bits] 99 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Example Direct calculation: n entropy( p1 , p2 ,..., pn ) pi log( pi ) i 1 info([2,3,4]) = entropy (2/9, 3/9, 4/9) = -2/9 * log2/9 - 3/9 * log3/9 - 4/9 * log4/9 = (-2 log2 - 3 log3 - 4 log4 + 9 log9)/9 Sequential calculation: entropy(p, q, r) = entropy( p, q+r) + (q+r) * entropy( q/(q+r) , r/(q+r) ) bc inf o([a, b, c ]) inf o([a, b c ]) inf o([b, c ]) abc info([2,3,4]) = info([2,7]) + 7/9 * info([3,4]) = entropy(2/9,7/9) + 7/9 * entropy(3/7,4/7) (base unit = 9) + transf unit * (base unit = 7) 100 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Further Measures: Info Gain and Info Value sum gain info sum: = info value before - info value after distribution in classes inf o([a, b], [c, d], [e, f ]) n 1 1 (a a n a i1 i1 i i1 ab cd ef inf o([a, b]) inf o([c, d] inf o([ e, f ]) a f a f a f ) inf o([ai ai1 ]) i example: gain(outlook) = info(outlook) – info(sunny, overcast, rainy) info(outlook) = info([9,5]) info(sunny) = info ([2,3]) = 0,971 bits info(overcast) = info ([4,0]) = 0,0 bits info(rainy) = info ([3,2]) = 0,971 bits = 0,940 bits (9yes, 5 no) info ([2,3], [4,0], [3,2]) = 5/14*0,971 + 4/14*0,0 + 5/14*0,971= 0,693 bits gain(outlook) 101 InfoMining = 0,940-0,693 = 0,247 bits Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Procedure for Building Decision Trees With decision measure info value info value sum info gain we are able to build an optimal decision tree optimal: the attribute with the highest gain comes first cost: we have to evaluate the gain for each attribute First level: evaluate the attribute with the highest gain Second and consecutive levels: for each value of the attribute in the first level, the same procedure must be repeated, because each value has another consequence range each value may receive another second attribute repeat this for all consecutive levels until all values show only 1 consequence or all attributes are used. Decision measures give us a criteria to order the tree under assumption of: independence of attributes 102 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Optimized Decision Tree for the Weather Data outlook sunny overcast rainy yes humidity high no 103 InfoMining windy normal yes false yes true no Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Pruning Pruning is the method to decide whether a branch in a tree is worth or not. Pre pruning to refine the actual node by a sub-tree or stop refinement (applied on the training set) Post pruning or Backward-Pruning to reduce already established sub-tree to its root-node and hence simplify the tree – or further refine it (should be applied on the evaluation set, but often the test set is used) Post pruning methods are more powerful than Pre pruning methods Pruning should be done on an independent data set, i.e. whether on the training nor the test set. Hence we should have a 3rd data set, namely the pruning set. More general it is named evaluation set. It is used to optimize the rules obtained from the training set. From the test set only the quality of the final rules should be determined. If it will be done on the training set a statistical bias will be the result. 104 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Post Pruning Example: "startphase" Sub tree Replacement a1 1. delete a4 ≤2.5 >2.5 a2 a3 ≤36 a1 ≤10 >36 >10 N a4 N none N half Y full N a5 ≤4 N >2.5 ≤2.5 >2.5 a2 a3 N a3 ≤36 N N N a1 Sub tree Raising 1. delete a2 none V1 InfoMining ≤10 >36 >10 >4 Y a1 ≤2.5 or 105 2. delete a2 ≤2.5 a4 half V2 full V3 a5 ≤10 >10 ≤4 N >4 N Y N a5 ≤4 >4 N Y Remark: >2.5 The consequences V1, V2, V3 are not the same as in the 'start a3 phase'. V1, V2, V3 represent now the ............... consequences, when a4 is before a2 Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Measure for pruning decision: Confidence The confidence of an estimated (calculated) value can be expressed by a confidence interval. An estimated value, like the error rate or the success rate can differ to + or – values. Hence for both rates this tends to a symmetrical pdf, namely the well known Normal distribution. The estimated rate is the expected rate and hence the mean value of the random variable rate. What is the related standard deviation of the Rate? The basic pdf is the Bernoulli distribution, because we have a bivariate value, namely success or failure (s. lectures from Prof. Herz) The standard deviation of the Bernoulli distribution is q1 q B N 106 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Confidence From the standard deviation we see, that the number N of the training/test set is shaping the confidence. If N→∞ than σ→0 and hence the estimated rate is equal to the true rate. If N≠∞ we have the Normal Distribution to express the possible fluctuation of the estimated rate and the related confidence range is limited to both sides. Hence the 2 sided confidence interval expresses the confidence of the estimated rate Pr[-z ≤ X ≤ z] = c Tables for the confidence values are given for a N(0,1) standardized Normal Distribution. Therefore we can express by normalisation the confidence through Pr z 107 InfoMining f q z c q1 q / N N = number of observations E = number of wrong classifications f = observed error rate = E/N q = true (unknown) error rate c = confidence in % z = standardized upper confidence limit Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Confidence The two sided confidence interval can be computed from one sided confidence interval. The one sided confidence interval is the border z for which the probability that X is inside of the confidence range c is Pr[X ≤ -z] = cleft or Pr[X ≥ z] =cright X = random number c = confidence in % z = standardized upper/lower confidence limit For symmetric distributions with =0 Pr[X ≥ z] = Pr[X ≤ -z] For symmetric distributed random numbers with a mean value =0, the probability that the realisation of a random number X is inside the two sided confidence interval z is Pr[-z ≤ X ≤ z] = 1-c/2 108 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Pr[X≥z] 0,00% 0,05% 0,10% 0,15% 0,20% 0,25% 0,30% 0,35% 0,40% 0,45% 0,50% 0,55% 0,60% 0,65% 0,70% 0,75% 0,80% 0,85% 0,90% 0,95% 109 InfoMining Confidence Intervals for Standard Normal Distribution z ∞ 3,29 3,09 2,97 2,88 2,81 2,75 2,70 2,65 2,61 2,58 2,54 2,51 2,48 2,46 2,43 2,41 2,39 2,37 2,35 Pr[X≥z] z 1,00% 2,00% 3,00% 4,00% 5,00% 6,00% 7,00% 8,00% 9,00% 10,00% 12,00% 14,00% 16,00% 18,00% 20,00% 25,00% 30,00% 35,00% 40,00% 45,00% 50,00% 2,33 2,05 1,88 1,75 1,64 1,55 1,48 1,41 1,34 1,28 1,17 1,08 0,99 0,92 0,84 0,67 0,52 0,39 0,25 0,13 0,00 3,50 3,00 2,50 2,00 z 1,50 1,00 0,50 0,00 0,00% 10,00% 20,00% 30,00% 40,00% 50,00% Pr[X>=z] Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Measure for Pruning If the measure for Pruning is based on the training set the one-sided confidence interval is chosen in order to respect the statistical bias. It is an empirical approach, but results proofed its validation. N = number of observations E = number of wrong classifications f = observed error rate = E/N q = true ( but unknown) error rate (e) c = (given/chosen) confidence in % z = standardized upper confidence limit f q Pr z c q1 q / N Estimated error rate e (= q): z2 f f2 z2 f z 2 2 2 N N N 4 N e z2 1 N 110 InfoMining usually a confidence c=25% is assumed to which the standardized upper confidence limit z=0.69 belongs Remark: numbers will be taken from the training set, hence the more conservative confidence limit have to be chosen and not the two-sided one, which would be statistically correct. Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Measure for Pruning The measure for pruning for more than one value is the mean value of the individual measures 1 ek error estimate e1 en N k n e i 1 i i ni = number of observations of value i from the training set k = number of values at a node 111 InfoMining Institute of Construction Informatics, Prof. Dr.-Ing. Scherer Technische Universität Dresden Example a2 ≤36 >36 1N 1Y a4 none half 4N 2Y 1N 1Y fnone = 2/6 = 0.33 enone = 0.47 fhalf = ½ = 0.50 efull = 0.72 ffull = = 2/6 = 0.33 enone = 0.47 full enone f ull 4N 2Y fa 4 1 6 0.47 2 0.72 6 0.47 0.51 626 5 0.36 14 ea 4 0.46 Result: error estimate of the 3 values is higher than error estimate of node a4 Decision: subtree a4 will be replaced value a4 is 'N' with numbers (9N,5Y) Repeat procedure for a2: e36 0.72 fa2 112 InfoMining e36 0.46 e36,36 1 2 0.72 14 0.46 0.49 2 14 6 0.375 ea 2 0.48 subtree a 2 replaced by value N 16 Institute of Construction Informatics, Prof. Dr.-Ing. Scherer