Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
For written notes on this lecture, please read chapter 3 of The Practical Bioinformatician, CS2220: Introduction to Computational Biology Unit 1: Essence of Knowledge Discovery Li Xiaoli 2 Information about Instructor • Instructor – Adj AP Li Xiaoli – Email: [email protected] [email protected] – Web http://www1.i2r.a-star.edu.sg/~xlli/ – Office: NUS COM2-03-33 Fusionopolis (One North MRT station) Institute for Infocomm Research, A*Star Copyright 2015 © Wong Limsoon, Li Xiaoli 3 Knowledge Discovery/Data Mining The first two lectures of this course aims to introduce some concepts, techniques on knowledge discovery, including – Association rule mining – Clustering/unsupervised machine learning techniques – Classification/supervised machine learning techniques – Classification performance measures – Feature selection techniques Copyright 2015 © Wong Limsoon, Li Xiaoli 4 Benefits of learning knowledge discovery At the end of the course, students should Have good knowledge of basic data mining concepts Be familiar with representative data mining techniques Understand how to use a data mining tool, Weka Given a biological dataset or task, have some knowledge of appropriate data mining techniques and tools for analyzing the data and address the task Copyright 2015 © Wong Limsoon, Li Xiaoli 5 What is Knowledge Discovery / Data Mining? Jonathan’s blocks Jessica’s blocks Whose block is this? Jonathan’s rules Jessica’s rules : Blue or Circle : All the rest Copyright 2015 © Wong Limsoon, Li Xiaoli 6 What is Knowledge Discovery /Data Mining? Question: Can you explain how? Copyright 2015 © Wong Limsoon, Li Xiaoli 7 The Steps of Data Mining • Training data gathering • Feature generation – k-grams, colour, texture, domain know-how, ... • Feature selection – Entropy, 2, CFS, t-test, domain know-how... • Feature integration and model construction – SVM, ANN, PCL, CART, C4.5, kNN, ... Some classifiers / machine learning methods Copyright 2015 © Wong Limsoon, Li Xiaoli 8 We are drowning in data But starving for knowledge Data Mining Copyright 2015 © Wong Limsoon, Li Xiaoli 9 What is Knowledge Discovery? • Knowledge Discovery from Databases (KDD): the overall process of non-trivial extraction of implicit, previously unknown and potentially useful knowledge from large amounts of data • Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns Copyright 2015 © Wong Limsoon, Li Xiaoli Data Mining: The Core Step of a KDD Process 10 Training data gathering Although “Data Mining” is just one of the many steps, it is usually used to refer to the whole process of KDD Copyright 2015 © Wong Limsoon, Li Xiaoli 11 Origins of Data Mining • Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems • Traditional Techniques may be unsuitable due to – Enormity of data – High dimensionality of data – Heterogeneous, distributed nature of data Statistics/ AI Machine Learning/ Pattern Recognition Data Mining Database systems Copyright 2015 © Wong Limsoon, Li Xiaoli 12 Data Mining Tools • • • • • • • • Weka RapidMiner R Orange IBM SPSS modeler SAS Data Mining Oracle Data Mining …… Copyright 2015 © Wong Limsoon, Li Xiaoli 13 Data Mining Tasks • Prediction Methods – Use some variables to predict unknown or future values of other variables • Description Methods – Find human-interpretable patterns that describe the data From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996 Copyright 2015 © Wong Limsoon, Li Xiaoli 14 Data Mining Tasks... • Association Rule Discovery [Descriptive] • Clustering [Descriptive] • Classification [Predictive] • Sequential Pattern Discovery [Descriptive] • Regression [Predictive] • Outlier Detection (Deviation Detection) [Predictive] Copyright 2015 © Wong Limsoon, Li Xiaoli 15 Association Rule Mining: “Market Basket Analysis” Copyright 2015 © Wong Limsoon, Li Xiaoli 16 1. Association Rules (Descriptive) • Finding groups of items that tend to occur together • Also known as “frequent itemset mining” or “market basket analysis” – Grocery store: Beer and Diaper – Amazon.com: People who bought this book also bought other books – Find gene mutations that are frequently occurred in patients Copyright 2015 © Wong Limsoon, Li Xiaoli 17 A Real-world Application: AMAZON BOOK RECOMMENDATIONS Copyright 2015 © Wong Limsoon, Li Xiaoli 18 Frequent patterns and association rules Frequent pattern {data mining, mining the social web, data analysis with open source tools} Association Rules {data mining} -> {machine learning in action} Copyright 2015 © Wong Limsoon, Li Xiaoli 19 An Intuitive Example of Association Rules from Transactional Data Shopping Cart (transaction) Items Bought 1 {fruits, , milk, beef, 2 {sugar, , toothbrush, ice-cream, …} 3 { … … n {battery, juice, beef, egg, chicken,…} , pacifier, formula, ,eggs, …} ,blanket,…} Copyright 2015 © Wong Limsoon, Li Xiaoli 20 What is Association Rule Mining? • Given a set of transactions, find rules that will predict the occurrence of items based on the occurrences of other items in the transaction Market-Basket transactions TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Example of Association Rules {Diaper} {Beer}, {Milk, Bread} {Coke}, {Bread} {Milk}, Explanation and usefulness: {Bread} {Milk} “People who buy bread may also buy milk.” • Stores might want to offer specials on bread to get people to buy more milk. • Stores might want to put bread and milk close each other. Note that Implication means co-occurrence, not causality! Copyright 2015 © Wong Limsoon, Li Xiaoli 21 Definition: Frequent Itemset • Itemset – A collection of one or more items • Example: {Milk, Bread, Diaper} – k-itemset • An itemset that contains k items TID 1 2 3 • Support count () – Frequency of occurrence of an itemset 4 5 – E.g. ({Milk, Bread, Diaper}) = 2 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke • Support (s) – Fraction of transactions that contain an itemset – E.g. s({Milk, Bread, Diaper}) = 2/5 = 40% • Frequent Itemset – An itemset whose support is greater than or equal to a minsup threshold, which is usually defined by a user, possibly interactively. Copyright 2015 © Wong Limsoon, Li Xiaoli 22 Definition: Association Rule Association Rule – An implication expression of the form X Y, where X and Y are itemsets, and X∩Y=∅ – Example: {Milk, Diaper} {Beer} Rule Evaluation Metrics – Support (s) TID Items 1 2 3 4 5 Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke Fraction of transactions that contain Example: {Milk , Diaper } Beer both X and Y (Milk, Diaper,Beer) 2 s 0.4 # trans containing ( X Y ) | T | 5 P( X Y ) # trans in T (Milk, Diaper,Beer) 2 c 0.67 – Confidence (c) (Milk, Diaper) 3 Measures how often items in Y appear in transactions that contain X Support : How extensive the rules can P(Y | X ) # trans containing ( X Y ) # trans containing X be applied? Confidence: How reliable when we apply the rules? Copyright 2015 © Wong Limsoon, Li Xiaoli 23 Association Rule Mining Task • An association rule r is strong if – Support(r) ≥ min_sup – Confidence(r) ≥ min_conf • Given a set of transactions T, the goal of association rule mining is to find all rules having – support ≥ minsup threshold – confidence ≥ minconf threshold Copyright 2015 © Wong Limsoon, Li Xiaoli 24 2. Clustering (Descriptive) • Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups • Automatically learn the structure of data Main Principle Intra-cluster distances are minimized Inter-cluster distances are maximized Copyright 2015 © Wong Limsoon, Li Xiaoli 25 The Input of Clustering f1 f2 fn D1 = (V11, V12, …, V1n) Objects/ Instances/ points D2 = (V21, V22, …, V2n) …… Dm = (Vm1, Vm2, …, Vmn) f1 f2 ,… , fn could be various features like race, gender, income, education level, medical features (e.g. weight, blood pressure, total cholesterol, LDL (low-density lipoprotein cholesterol, HDL (high-density lipoprotein cholesterol), triglycerides (fats carried in the blood), x-ray, and biological features (e.g. gene expression, protein interaction, gene functions) Copyright 2015 © Wong Limsoon, Li Xiaoli 26 Unsupervised Learning Purely learn from the unlabeled data Uniformity Single Bland id Clump Uniformity of Cell Marginal Epithelial Bare Chrom Normal number Thickness of Cell Size Shape Adhesion Cell Size Nuclei atin Nucleoli Mitoses ID1 5 1 1 1 2 1 3 1 1 ID2 5 4 4 5 7 10 3 2 1 ID3 3 1 1 1 2 2 3 1 1 ID4 8 10 10 8 7 10 9 7 1 We can learn the relationship or structures from data by clustering, e.g. maybe ID1 and ID3 should be in 1 cluster? Copyright 2015 © Wong Limsoon, Li Xiaoli 27 Clustering Definition • Unsupervised learning: data without labels different from supervised learning/classification • Given a set of data points/objects, each having a set of attributes, and a similarity measure among them, find clusters such that – Data points in one cluster are more similar to one another. – Data points in separate clusters are less similar to one another • Similarity Measures: – Euclidean Distance – Cosine Similarity Copyright 2015 © Wong Limsoon, Li Xiaoli 28 Clustering Techniques and Applications • We will learn some standard clustering methods, e.g. K-means and hierarchical clustering, in the coming weeks • We will demonstrate how to apply clustering methods for gene expression data analysis Copyright 2015 © Wong Limsoon, Li Xiaoli 29 3. Classification (Predictive) • Also called Supervised Learning • Learn from past experience/labels, and use the learned knowledge to classify new data • Knowledge learned by intelligent machine learning algorithms • Examples: – Clinical diagnosis for patients – Cell type classification – Secondary structure classification of protein as alpha-helix, beta-sheet, or random coil in biology Copyright 2015 © Wong Limsoon, Li Xiaoli 30 An Example: Classification Attributes Class: (2 for Uniformity Single Bland benign, 4 for id Clump Uniformity of Cell Marginal Epithelial Bare Chrom Normal number Thickness of Cell Size Shape Adhesion Cell Size Nuclei atin Nucleoli Mitoses malignant) ID1 5 1 1 1 2 1 3 1 1 2 ID2 5 4 4 5 7 10 3 2 1 2 ID3 3 1 1 1 2 2 3 1 1 4 ID4 8 10 10 8 7 10 9 7 1 4 Find a model for class attribute as a function of the values of other attributes. f(clump Thickness, Uniformity of Cell Size, …, Mitoses) = Class Copyright 2015 © Wong Limsoon, Li Xiaoli 31 Classification Data • Classification application involves > 1 class of data. E.g., – Normal vs disease cells for a diagnosis problem • Training data is a set of instances (samples, points) with known class labels • Test data is a set of instances whose class labels are to be predicted Copyright 2015 © Wong Limsoon, Li Xiaoli 32 Classification: Definition • Given a collection of records (training set: History ) – Each record contains a set of attributes, one of the key attributes is the class. • Find a model for class attribute as a function of the values of other attributes. • Goal: previously unseen records should be assigned a class as accurately as possible (Future). Prediction based on the past examples. Let History tell us the future! Copyright 2015 © Wong Limsoon, Li Xiaoli 33 Classification: Evaluation • How good is your prediction? How do you evaluate your results? • A test set is used to determine the accuracy of the model. • Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it • Usually, the overlap between training and test is empty Copyright 2015 © Wong Limsoon, Li Xiaoli 34 Construction of a Classifier Training samples Build Classifier Classifier Test instance Apply Classifier Prediction Copyright 2015 © Wong Limsoon, Li Xiaoli 35 Estimate Accuracy: Wrong Way Training samples Build Classifier Classifier Apply Classifier Predictions Estimate Accuracy Accuracy Exercise: Why is this way of estimating accuracy wrong? We will elaborate how to estimate the classification accuracy in next lecture Copyright 2015 © Wong Limsoon, Li Xiaoli 36 Typical Notations • Training data {x1, y1, x2, y2, …, xm, ym}, where xj (1<=j<=m) are n-dimensional vectors, e.g. xj=(Vj1, Vj2, …, Vjn), yj are from a discrete space Y. E.g., Y = {normal, disease}, yj = normal. • Test data {u1, ?, u2, ?, …, uk, ?, } Copyright 2015 © Wong Limsoon, Li Xiaoli 37 Process Y = f(X) Training data: X Class labels Y A classifier, a mapping, a hypothesis f(U) Test data: U Predicted class labels Copyright 2015 © Wong Limsoon, Li Xiaoli 38 Relational Representation of Gene Expression Data n features (order of 1000) gene1 gene2 gene3 gene4 … genen m samples x11 x12 x13 x14 … x1n x21 x22 x23 x24 … x2n x31 x32 x33 x34 … x3n …………………………………. xm1 xm2 xm3 xm4 … xmn class P N P N Copyright 2015 © Wong Limsoon, Li Xiaoli 39 Features (aka Attributes) • Categorical features: describe qualitative aspects of an objects – color = {red, blue, green} • Continuous or numerical features: describe quantitative aspects of an objects – gene expression – age – blood pressure • Discretization: Transform a continuous attribute into a categorical attribute (e.g. age) Copyright 2015 © Wong Limsoon, Li Xiaoli 40 Discretization of Continuous Attribute • Transformation of a continuous attribute to a categorical attribute involve two subtasks – (1) Decide how many categories • After the values are sorted, they are then divided into n intervals by specifying n-1 split points (equal interval bin, equal frequency bin, or clustering) – (2) Determine how to map the values of the continuous attribute to these categories • All the values in one interval are mapped to the same categorical value Copyright 2015 © Wong Limsoon, Li Xiaoli 41 An Example of classification task Copyright 2015 © Wong Limsoon, Li Xiaoli 42 Overall Picture of Classification/Supervised Learning Biomedical Financial Government Scientific …… Decision trees KNN SVM Bayesian Classifier Neural networks Classifiers (Medical Doctors) Copyright 2015 © Wong Limsoon, Li Xiaoli 43 Requirements of Biomedical Classification • Accurate – High accuracy – High sensitivity – High specificity – High precision • High comprehensibility Copyright 2015 © Wong Limsoon, Li Xiaoli 44 Classification Techniques • Decision Tree based Methods • Rule-based Methods • Memory-based or Instance-based Methods • Naïve Bayes Classifiers • Neural Networks • Support Vector Machines • Ensemble Classification Copyright 2015 © Wong Limsoon, Li Xiaoli 45 Illustrating Classification Task 45 Copyright 2015 © Wong Limsoon, Li Xiaoli 46 Importance of Rule-Based Methods • Systematic selection of a small number of features used for the decision making Increase comprehensibility of the knowledge patterns • C4.5 and CART are two commonly used rule induction algorithms---a.k.a. decision tree induction algorithms Copyright 2015 © Wong Limsoon, Li Xiaoli 47 Structure of Decision Trees x1 Root node > a1 x2 x3 Internal nodes > a2 x4 A B B A B Leaf nodes A A • If x1 > a1 & x2 > a2, then it’s A class • C4.5, CART, two of the most widely used • Easy interpretation Copyright 2015 © Wong Limsoon, Li Xiaoli 48 Elegance of Decision Trees Every path from root to a leaf forms a decision rule A B B A A Copyright 2015 © Wong Limsoon, Li Xiaoli 49 Example of a Decision Tree Splitting Attributes Tid Refund Marital Status Taxable Income Cheat Refund 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Single, Divorced 5 No Divorced 95K Yes TaxInc 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Yes No NO MarSt < 80K NO Married NO > 80K YES 10 Training Data Leave nodes: Class Copyright 2015 © Wong Limsoon, Li Xiaoli 50 Another Example of Decision Tree Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K 10 Training Data MarSt Married NO Single, Divorced Refund No Yes NO TaxInc < 80K NO > 80K YES There could be more than one tree that fits the same data! Copyright 2015 © Wong Limsoon, Li Xiaoli 51 Decision Tree Classification Task Tid Attrib1 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Attrib2 Attrib3 Class Tree Induction algorithm Induction Learn Model 10 Model Training Set Tid Attrib1 Attrib2 Attrib3 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Apply Model Class Decision Tree Deduction 10 Test Set Copyright 2015 © Wong Limsoon, Li Xiaoli 52 Apply Model to Test Data Test Data Start from the root of tree. Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES Copyright 2015 © Wong Limsoon, Li Xiaoli 53 Apply Model to Test Data Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES Copyright 2015 © Wong Limsoon, Li Xiaoli 54 Apply Model to Test Data Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES Copyright 2015 © Wong Limsoon, Li Xiaoli 55 Apply Model to Test Data Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES Copyright 2015 © Wong Limsoon, Li Xiaoli 56 Apply Model to Test Data Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K YES Copyright 2015 © Wong Limsoon, Li Xiaoli 57 Apply Model to Test Data Test Data Refund Yes Refund Marital Status Taxable Income Cheat No 80K Married ? 10 No NO MarSt Single, Divorced TaxInc < 80K NO Married NO > 80K Assign Cheat to “No” NO YES Copyright 2015 © Wong Limsoon, Li Xiaoli 58 Decision Tree Classification Task Copyright 2015 © Wong Limsoon, Li Xiaoli 59 Decision Tree Induction • Many Algorithms: – Hunt’s Algorithm (one of the earliest) – CART – ID3, C4.5 – SLIQ, SPRINT We want to learn a classification tree model from training data and we want to apply it to predict future test data as accurate as possible. Copyright 2015 © Wong Limsoon, Li Xiaoli 60 Most Discriminatory Feature and Partition • Discriminatory Feature – Every feature can be used to partition the training data – If the partitions contain a pure class of training instances, then this feature is most discriminatory • Partition – Categorical feature • Number of partitions of the training data is equal to the number of values of this feature – Numerical feature • Two partitions Copyright 2015 © Wong Limsoon, Li Xiaoli 61 Categorical feature Instance # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Outlook Sunny Sunny Sunny Sunny Sunny Overcast Overcast Overcast Overcast Rain Rain Rain Rain Rain Numerical feature Temp 75 80 85 72 69 72 83 64 81 71 65 75 68 70 Humidity 70 90 85 95 70 90 78 65 75 80 70 80 80 96 Windy true true false true false true false true false true true false false false class Play Don’t Don’t Don’t Play Play Play Play Play Don’t Don’t Play Play Play Copyright 2015 © Wong Limsoon, Li Xiaoli 62 Outlook = sunny Total 14 training instances A categorical feature is partitioned based on its number of possible values Outlook = overcast Outlook = rain 1,2,3,4,5 P,D,D,D,P 6,7,8,9 P,P,P,P 10,11,12,13,14 D, D, P, P, P Copyright 2015 © Wong Limsoon, Li Xiaoli 63 Temperature <= 70 5,8,11,13,14 P,P, D, P, P Total 14 training instances Temperature 1,2,3,4,6,7,9,10,12 P,D,D,D,P,P,P,D,P > 70 A numerical feature is generally partitioned by choosing a “cutting point” Copyright 2015 © Wong Limsoon, Li Xiaoli 64 Steps of Decision Tree Construction • Select the “best” feature as root node of the whole tree • Partition dataset into subsets using this feature so that the subsets are as “pure” as possible • After partition by this feature, select the best feature (wrt the subset of training data) as root node of this sub-tree • Recursively, until the partitions become pure or almost pure Copyright 2015 © Wong Limsoon, Li Xiaoli 65 How to determine the Best Split? Which features/attributes do you want to use first? Before Splitting: 10 records of class 0 (C0) , Suppose we have three 10 records of class 1 (C1) features, i.e. Own Car, Car Type, Student ID.? Own Car? Car Type? No Yes Family Student ID? Luxury c1 Sports C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0 ... c10 C0: 1 C1: 0 c11 C0: 0 C1: 1 c20 ... C0: 0 C1: 1 Which test condition is the best? The split can help us to distinguish different classes Copyright 2015 © Wong Limsoon, Li Xiaoli 66 How to determine the Best Split • Greedy approach: – Nodes with homogeneous class distribution are preferred • Need a measure of node impurity: C0: 5 C1: 5 C0: 9 C1: 1 Non-homogeneous, Homogeneous, High degree of impurity Low degree of impurity Provide certainty and confidence for my future test cases Copyright 2015 © Wong Limsoon, Li Xiaoli 67 Measures of Node Impurity • Gini Index • Entropy (Do it yourself) • Misclassification error (Do it yourself) Copyright 2015 © Wong Limsoon, Li Xiaoli 68 How to Find the Best Split Before Splitting: C0 C1 N00 N01 M0 A? B? Yes No Node N1 C0 C1 Node N2 N10 N11 C0 C1 Overall impurity of feature A N20 N21 M2 M1 M12 Impurity of parents Yes No Node N3 C0 C1 Node N4 N30 N31 C0 C1 M3 Gain = M0 – M12 vs M0 – M34 N40 N41 M4 M34 Overall impurity of feature B Which attribute (A or B) should we choose? We choose the one that can reduce more impurity and bring more gain Copyright 2015 © Wong Limsoon, Li Xiaoli Examples for computing Impunity using GINI for a node GINI (t ) 1 [ p( j | t )]2 j 69 p( j | t) is the relative frequency (prob) of class j at node t. C1 C2 0 6 C1 C2 1 5 P(C1) = 1/6 C1 C2 2 4 P(C1) = 2/6 C1 C2 3 3 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444 P(C1) = 3/6 P(C2) = 3/6 Gini = 1 – (3/6)2 – (3/6)2 = 0.5 Copyright 2015 © Wong Limsoon, Li Xiaoli 70 Measure of Impurity: GINI for a node • Gini Index for a given node t : GINI (t ) 1 [ p( j | t )]2 j C1 0 C2 6 Gini=0.000 p(c1|t)=0 p(c2|t)=1 C1 1 C2 5 Gini=0.278 C1 2 C2 4 Gini=0.444 C1 3 C2 3 Gini=0.500 p(c1|t)=1/6 p(c1|t)=2/6 p(c1|t)=3/6=0.5 p(c2|t)=5/6 p(c2|t)=4/6 p(c2|t)=3/6=0.5 – Maximum (1 - 1/nc) when records are equally distributed among all classes p(ci|t)=1/nc – Minimum (0) when all records belong to one class, implying most interesting information. Copyright 2015 © Wong Limsoon, Li Xiaoli 71 Splitting Based on GINI • Used in CART, SLIQ, SPRINT. • When a node p is split into k partitions (children), the quality of split is computed as, P k GINIsplit ni GINI (i) i 1 n Yes Node 1 No Node 2 where ni = number of records at child i, n = number of records at node p. Copyright 2015 © Wong Limsoon, Li Xiaoli 72 Binary Attributes: Computing GINI Index Splits into two partitions Effect of Weighing partitions: – Larger and Purer Partitions are sought for. Parent B? Yes No Node N1 Gini(N1) = 1 – (5/7)2 – (2/7)2 = 0.4082 Gini(N2) = 1 – (1/5)2 – (4/5)2 = 0.3200 Node N2 C1 C2 N1 5 2 N2 1 4 Gini=0.3715 What is the gain? C1 6 C2 6 Gini = 0.500 After use attribute B Gini(Children) = 7/12 * 0.4082 + 5/12 * 0.3200 = 0.3715 Copyright 2015 © Wong Limsoon, Li Xiaoli 73 Categorical Attributes: Computing Gini Index • For each distinct value, gather counts for each class in the dataset • Use the count matrix to make decisions Multi-way split C1 C2 Gini Two-way split (find best partition of values) CarType Family Sports Luxury 1 2 1 4 1 1 0.393 C1 C2 Gini CarType {Sports, {Family} Luxury} 3 1 2 4 0.400 C1 C2 Gini CarType {Family, {Sports} Luxury} 2 2 1 5 0.419 Which one do you want to choose to split? Copyright 2015 © Wong Limsoon, Li Xiaoli 74 Continuous Attributes: Computing Gini Index Taxable Income > 97K? • Use Binary Decisions based on one value • Several Choices for the splitting value and each splitting value has a count matrix associated with it Yes No A Tid – Class counts in each of the partitions, A < v and A v – v could be any value (e.g. 97) • Simple method to choose best v – For each v, scan the database to gather count matrix and compute its Gini index Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Copyright 2015 © Wong Limsoon, Li Xiaoli 75 Continuous Attributes: Computing Gini Index... • For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index (largest gain) Cheat No No No Yes Yes Yes No No No No 100 120 125 220 Taxable Income 60 Sorted Values Split Positions 70 55 75 65 85 72 90 80 95 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Copyright 2015 © Wong Limsoon, Li Xiaoli 76 Characteristics of Decision Trees • Easy to implement • Single coverage of training data (elegance) • Divide-and-conquer splitting strategy • Easy to explain to medical doctors or biologists Copyright 2015 © Wong Limsoon, Li Xiaoli 77 Example Use of Decision Tree Methods: Proteomics Approaches to Biomarker Discovery • In prostate and bladder cancers (Adam et al. Proteomics, 2001) • In serum samples to detect breast cancer (Zhang et al. Clinical Chemistry, 2002) • In serum samples to detect ovarian cancer (Petricoin et al. Lancet; Li & Rao, PAKDD 2004) Copyright 2015 © Wong Limsoon, Li Xiaoli 78 Contact: [email protected] if you have questions Copyright 2015 © Wong Limsoon, Li Xiaoli