Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification and Prediction Monash University Semester 1, March 2006 www.monash.edu.au Patterns • Data mining aims to reveal knowledge about the data under consideration • Knowledge is viewed as Patterns within the data • Patterns are also referred to as structures, models and relationships • Learning approach is inherently linked to the patterns • Which approach for what data ? – eternal question! www.monash.edu.au 2 Taxonomy of Learning Approaches • Verification Driven – Query and reporting Statistical Analysis • Discovery Driven – Supervised Learning > Predictive > Typical Learning Techniques: Classification and Regression – Unsupervised Learning > Informative > Typical Learning Techniques: Clustering, Association Analysis, Deviation/Outlier Detections www.monash.edu.au 3 Problems and Uses of Verification Driven Techniques • Statistics – Assumption – you have a well-defined model > In reality – you often don’t – Focus on model estimation rather than model selection – Don’t cope well with large amounts of data > Time and Effort required is very high – Highlight linear relationships – Non-linear stats requires knowledge of the non-linearity > The ways in which variables interact > In real multi-dimensional data -> Not often known or specificiable – Important lessons from Statistics > Estimation of uncertainty and checking of assumptions is absolutely essential www.monash.edu.au 4 Discovery Driven Data Mining • Predictive/Supervised Learning – Build patterns by making a prediction of some unknown attribute given the values of other known attributes – Principal Techniques: Classification and Regression – Supervision: Training data (a set of observations/measurements) is accompanied by labels indicating the class of the observations – New data is classified based on the training data. www.monash.edu.au 5 Discovery Driven Data Mining • Informative/Unsupervised Learning – Do not present a solution to a known problem – They present “interesting” patterns for consideration by a domain expert – Principal Techniques: Clustering and Association Analysis – Labels of training data unknown – Given a set of observations/measurements (data), need to establish existence of classes or clusters in the data. www.monash.edu.au 6 Classification • Difficult to define • Usually based on a set of classes that are pre-defined and a set of training samples that are used for building class models • Each new object is then assigned to one of the classes already defined in the class model • Typical Apps: Credit Approval, Target Marketing, Medical Diagnosis… www.monash.edu.au 7 Classification Vs. Regression • Classification and Regression are two major types of prediction techniques • Classification – Predict nominal or discrete values • Regression – Predict continuous or ordered values www.monash.edu.au 8 Classification – A two stage process • Model Construction – Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute – Set of tuples used for model construction: Training Set – The model can be represented as a set of IFTHEN rules, Decision Trees of Mathematical Formulae www.monash.edu.au 9 Classification Process (1): Model Construction Training Data NAME Mike Mary Bill Jim Dave Anne RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no Classification Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ www.monash.edu.au 10 Classification – A two stage process • Model Usage – For classifying future or unknown objects – Need to estimate accuracy of the model – The known label of the test sample is compared with the classified result from the model – Accuracy rate is the percentage of test set samples that are correctly classified by the model – Test set is independent of the training set www.monash.edu.au 11 Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data NAME Tom Merlisa George Joseph RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes (Jeff, Professor, 4) Tenured? www.monash.edu.au 12 Evaluating Classification Methods • Predictive accuracy • Speed and scalability – time to construct the model – time to use the model • Robustness – handling noise and missing values • Scalability – efficiency in disk-resident databases • Interpretability: – understanding and insight provided by the model • Goodness of rules – decision tree size – compactness of classification rules www.monash.edu.au 13 Decision Trees • Assumption: – Data consists of records that have a number of independent attributes and a dependent attribute – A flow-chart-like tree structure – A structure to support a game of 20 questions: > Each internal node represents a test on an attribute > Each branch represents the outcome of the test > Each leaf represents a class www.monash.edu.au 14 Classification by Decision Tree Induction Decision tree generation consists of two phases – Tree construction > At start, all the training examples are at the root > Partition examples recursively based on selected attributes – Tree pruning > Identify and remove branches that reflect noise or outliers • Use of decision tree: Classifying an unknown sample – Test the attribute values of the sample against the decision tree www.monash.edu.au 15 Training Dataset This follows an example from Quinlan’s ID3 age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income high high high medium low low low medium low medium medium medium high medium student no no no no yes yes yes no yes yes yes no yes no credit_rating fair excellent fair fair fair excellent excellent fair fair fair excellent excellent fair excellent www.monash.edu.au 16 Output: A Decision Tree for “buys_computer” age? <=30 student? overcast 30..40 yes >40 credit rating? no yes excellent fair no yes no yes www.monash.edu.au 17 Decision Trees • Classify objects based on the values of their independent attributes • Classification is based on a tree structure where each node of the tree is a rule that involves a multi-way decision • To classify an object, the attributes are compared with tests in each node starting from the root. The path found leads to a leaf node which is the class to which the object belongs to. www.monash.edu.au 18 Decision Trees • Attractive due to ease of understanding • Rules can be expressed in natural language • Aim: Build a DT consisting of a root node, number of internal nodes and a set of predefined leaf nodes. • Continuous splitting of root node until the process is complete. www.monash.edu.au 19 Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) – Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training examples are at the root – Attributes are categorical (if continuous-valued, they are discretized in advance) – Examples are partitioned recursively based on selected attributes – Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left www.monash.edu.au 20 Decision Trees • Attractive due to ease of understanding • Rules can be expressed in natural language • Aim: Build a DT consisting of a root node, number of internal nodes and a set of predefined leaf nodes. • Continuous splitting of root node until the process is complete. www.monash.edu.au 21 Decision Trees • The quality of the tree depends on the quality of the training data • 100% accurate for training data, but training data is only a sample…cannot be accurate for all data. www.monash.edu.au 22 Finding the Split • A variable has to be chosen to split the data • This variable has to be such that it does the best job of discriminating among target classes. • Finding which variable is most influential is nontrivial. • One approach – finding the data’s diversity and choosing a variable that minimises the diversity amongst the children nodes. • Several Techniques have been proposed – Information Gain/Entropy, Gini, Chi-Squared, Rough Sets etc. www.monash.edu.au 23 Attribute Selection Measure • Information gain (ID3/C4.5) – All attributes are assumed to be categorical – Can be modified for continuous-valued attributes • Gini index (IBM IntelligentMiner) – All attributes are assumed continuous-valued – Assume there exist several possible split values for each attribute – May need other tools, such as clustering, to get the possible split values – Can be modified for categorical attributes www.monash.edu.au 24 Information Theory – 101 • Suppose there is a variable s that can take either value a or value b. • If s is always going to be a, then there is no uncertainty and no information • The most common measure of the mount of information (also known as entropy) is: – I = - SUM (pi log (pi)) • Let p(a) = 0.9 and p(b) = 0.1 • Let p(a) = 0.5 and p(b) = 0.5 • When is I maximised ? www.monash.edu.au 25 Information Gain (ID3/C4.5) • Select the attribute with the highest information gain • Assume there are two classes, P and N – Let the set of examples S contain p elements of class P and n elements of class N – The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as p p n n I ( p, n) log 2 log 2 pn pn pn pn www.monash.edu.au 26 Information Gain in Decision Tree Induction • Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv} – If Si contains pi examples of P and ni examples of N, the entropy, or the expected information needed to classify objects in all subtrees Si is pi ni E ( A) I ( pi , ni ) i 1 p n • The encoding information that would be gained by branching on A Gain( A) I ( p, n) E ( A) www.monash.edu.au 27 Attribute Selection by Information Gain Computation Class P: buys_computer = “yes” Class N: buys_computer = “no” I(p, n) = I(9, 5) =0.940 Compute the entropy for age: age <=30 30…40 >40 pi 2 4 3 ni I(pi, ni) 3 0.971 0 0 2 0.971 5 4 I ( 2,3) I ( 4,0) 14 14 5 I (3,2) 0.69 14 E ( age) Hence Gain(age) I ( p, n) E (age) Similarly Gain(income) 0.029 Gain( student ) 0.151 Gain(credit _ rating ) 0.048 www.monash.edu.au 28 Extracting Classification Rules from Trees • Represent the knowledge in the form of IF-THEN rules • One rule is created for each path from the root to a leaf • Each attribute-value pair along a path forms a conjunction • The leaf node holds the class prediction • Rules are easier for humans to understand www.monash.edu.au 29 Extracting Classification Rules from Trees • Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no” www.monash.edu.au 30 Avoid Overfitting in Classification • The generated tree may overfit the training data – Too many branches, some may reflect anomalies due to noise or outliers – Result is in poor accuracy for unseen samples www.monash.edu.au 31 Avoid Overfitting in Classification • Two approaches to avoid overfitting – Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold > Difficult to choose an appropriate threshold – Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees > Use a set of data different from the training data to decide which is the “best pruned tree” www.monash.edu.au 32 Approaches to Determine the Final Tree Size • Separate training (2/3) and testing (1/3) sets • Use cross validation, e.g., 10-fold cross validation • Use all the data for training – but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution • Use minimum description length (MDL) principle: – halting growth of the tree when the encoding is minimized www.monash.edu.au 33 Decision Trees: Strengths and Weaknesses • Strengths – Ability to generate understandable rules – Classify with minimal computational overhead • Issues – Need to find the right split variable – Visualising Trees is tedious – particularly as data dimensionality increases – Handling both continuous and categorical data www.monash.edu.au 34 Enhancements to basic decision tree induction • Allow for continuous-valued attributes – Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals • Handle missing attribute values – Assign the most common value of the attribute – Assign probability to each of the possible values • Attribute construction – Create new attributes based on existing ones that are sparsely represented – This reduces fragmentation, repetition, and replication www.monash.edu.au 35 Classification in Large Databases • Classification—a classical problem extensively studied by statisticians and machine learning researchers • Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed • Why decision tree induction in data mining? – relatively faster learning speed (than other classification methods) – convertible to simple and easy to understand classification rules – can use SQL queries for accessing databases – comparable classification accuracy with other methods www.monash.edu.au 36 Other Classification Methods • • • • Bayesian approaches Neural Networks k-nearest neighbor classifier case-based reasoning • Genetic algorithm • Rough set approach • Fuzzy set approaches www.monash.edu.au 37