Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining – Output: Knowledge Representation Chapter 3 Representing Structural Patterns • There are many different ways of representing patterns • 2 covered in Chapter 1 – decision trees and classification rules • Learned pattern is a form of “knowledge representation” (even if the knowledge does not seem very impressive) Decision Trees • Make decisions by following branches down the tree until a leaf is found • Classification based on contents of leaf • Non-leaf node usually involve testing a single attribute – Usually for different values of nominal attributes, or for range of a numeric attribute (most commonly a two way split, > some value and < same value) – Less commonly, compare two attribute values, or some function of multiple attributes • Common for an attribute once used to not be used at a lower level of same branch Decision Trees • Missing Values – May be treated as another possible value of a nominal attribute – if missing data may mean something – May follow most popular branch when data is missing from test data – More complicated approach – rather than going allor-nothing, can ‘split’ the test instance in proportion to popularity of branches in test data – recombination at end will use vote based on weights Classification Rules • Popular alternative to decision trees • LHS / antecedent / precondition – tests to determine if rule is applicable – Tests usually ANDed together – Could be general logical condition (AND/OR/NOT) but learning such rules is MUCH less constrained • RHS / consequent / conclusion – answer –usually the class (but could be a probability distribution) • Rules with the same conclusion essentially represent an OR • Rules may be an ordered set, or independent • If independent, policy may need to be established for if more than one rule matches (conflict resolution strategy) or if no rule matches Rules / Trees • Rules can be easily created from a tree – but not the most simple set of rules • Transforming rules into a tree is not straightforward (see “replicated subtree” problem – next two slides) • In many cases rules are more compact than trees – particularly if default rule is possible • Rules may appear to be independent nuggets of knowledge (and hence less complicated than trees) – but if rules are an ordered set, then they are much more complicated than they appear If a and b then x If c and d then x Figure 3.1 Decision tree for a simple disjunction. If x=1 and y=1 then class = a If z=1 and w=1 then class = a Otherwise class = b Each gray triangle actually contains the whole gray subtree below Figure 3.3 Decision tree with a replicated subtree. Association Rules • Association Rules are not intended to be used together as a set – in fact value is in the knowledge – probably no automatic use of rules • Large numbers of possible rules Association Rule Evaluation • Coverage – the number of instances for which it predicts correctly – also called support • Accuracy – proportion of instances that it predicts correctly – also called confidence • Coverage sometimes expressed as percent of the total # instances • Usually methods or users specify minimum coverage and accuracy for rules to be generated • Some possible rules imply others – present the strongest supported Example – My Weather – Apriori Algorithm Apriori Minimum support: 0.15 Minimum metric <confidence>: 0.9 Number of cycles performed: 17 Best rules found: 1. outlook=rainy 5 ==> play=no 5 conf:(1) 2. temperature=cool 4 ==> humidity=normal 4 conf:(1) 3. temperature=hot windy=FALSE 3 ==> play=no 3 conf:(1) 4. temperature=hot play=no 3 ==> windy=FALSE 3 conf:(1) 5. outlook=rainy windy=FALSE 3 ==> play=no 3 conf:(1) 6. outlook=rainy humidity=normal 3 ==> play=no 3 conf:(1) 7. outlook=rainy temperature=mild 3 ==> play=no 3 conf:(1) 8. temperature=mild play=no 3 ==> outlook=rainy 3 conf:(1) 9. temperature=hot humidity=high windy=FALSE 2 ==> play=no 2 conf:(1) 10. temperature=hot humidity=high play=no 2 ==> windy=FALSE 2 conf:(1) Rules with Exceptions • Skip Rules involving Relations • More than the value for attributes may be important • See book example on next slide Shaded: standing Unshaded: lying Figure 3.6 The shapes problem. More Complicated – Winston’s Blocks World • House – 3 sided block & 4 sided block AND 3 sided is on top of 4 sided • Solutions frequently involve learning rules that include variables/parameters – E.g. 3sided(block1) & 4sided(block2) & ontopof(block1,block2) house Easier and Sometimes Useful • Introduce new attributes during data preparation • New attribute represents relationship – E.g. for the standing / lying task could introduce new boolean attribute: widthgreater? which would be filled in for each instance during data prep – E.g. in numeric weather, could introduce “WindChill” based on calculations from temperature and wind speed (if numeric) or “Heat Index” based on temperature and humidity Numeric Prediction • Standard for comparison for numeric prediction is the statistical technique of regression • E.g. for the CPU performance data the regression equation below was derived PRP = - 56.1 + 0.049 MYCT + 0.015 MMIN + 0.006 MMAX + 0.630 CACH - 0.270 CHMIN + 1.46 CHMAX Trees for Numeric Prediction • Tree branches as in a decision tree (may be based on ranges of attributes) • Regression Tree – leaf nodes contain average of training set values that the leaf applies to • Model Tree – leaf nodes contain regression equations for the instances that the leaf applies to Figure 3.7(b) Models for the CPU performance data: regression tree. Figure 3.7(c) Models for the CPU performance data: model tree. Instance Based Representation • Concept not really represented (except via examples) • Real World Example – some radio stations don’t define what they play by words, they play promos basically saying “WXXX music is:” <songs> • Training examples are merely stored (kind of like “rote learning”) • Answers are given by finding the most similar training example(s) to test instance at testing time • Has been called “lazy learning” – no work until an answer is needed Instance Based – Finding Most Similar Example • Nearest Neighbor – each new instance is compared to all other instances, with a “distance” calculated for each attribute for each instance • Class of nearest neighbor instance is used as the prediction <see next slide and come back> • OR K-nearest neighbors vote, or weighted vote • Combination of distances – city block or euclidean (crow flies) Nearest Neighbor •x x •x •y •x x •y •x •x •x •z •z •z •z x •z •z •z •z •y •y •z T •y •y •y •y •y Additional Details • Distance/ Similarity function must deal with binaries/nominals – usually by all or nothing match – but mild should be a better match to hot than cool is! • Distance / Similarity function is simpler if data is normalized in advance. E.g. $10 difference in household income is not significant, while 1.0 distance in GPA is big • Distance/Similarity function should weight different attributes differently – key task is determining those weights Further Wrinkles • May not need to save all instances – Very normal instances may not all need be be saved – Some approaches actually do some generalization But … • Not really a structural pattern that can be pointed to • However, many people in many task/domains will respect arguments based on “previous cases” (diagnosis, law among them) • Book points out that instances + distance metric combine to form class boundaries – With 2 attributes, these can actually be envisioned <see next slide> (a) (b) (c) (d) Figure 3.8 Different ways of partitioning the instance space. Clustering • Clusters may be able to be represented graphically • If dimensionality is high, best representation may only be tabular – showing which instances are in which clusters • Show Weka – do njcrimenominal with EM and then do visualization of results • In some algorithms associate instances with clusters probabilistically – for every instance, list probability of membership in each of the clusters • Some algorithms produce a hierarchy of clusters and these can be visualized using a tree diagram • After clustering, clusters may be used as class for classification (a) e a h k f a b c d e f g h … b h f g 1 2 3 0.4 0.1 0.3 0.1 0.4 0.1 0.7 0.5 0.1 0.8 0.3 0.1 0.2 0.4 0.2 0.4 0.5 0.1 0.4 0.8 0.4 0.5 0.1 0.1 c j k i g e a c j (c) d (b) d b i (d) g a c i e d k b j f h Figure 3.9 Different ways of representing clusters. End Chapter 3