Download Click to add title

Classification Part 1 CSE 439 – Data Mining Assist. Prof. Dr. Derya BİRANT Outline ◘ What Is Classification? ◘ Classification Examples ◘ Classification Methods – – – – – – – Decision Trees Bayesian Classification K-Nearest Neighbor Neural Network Genetic Algorithms Support Vector Machines (SVM) Fuzzy Set Approaches What Is Classification? ◘ Classification – Construction of a model to classify data – When constructing the model, use the training set and the class labels – After the construction of the model, use it in classifying new data Classification (A Two-Step Process) 1. Model construction – Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute – The set of tuples used for model construction is training set – The model is represented as classification rules, trees, or mathematical formulae 2. Model usage (Classifying future or unknown objects) – Estimate accuracy rate of the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model – If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known Classification (A Two-Step Process) Mining Model Training Data Data Mining Model To Predict DM Engine DM Engine Mining Model Predicted Data Classification Example Training Data NAM E M ike M ary Bill Jim Dave Anne RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no Classification Algorithms Classifier (Model) Process (2): Using the Model in Prediction IF rank = ‘professor’ OR years > 6 Classifier THEN tenured = ‘yes’ Process (1): Model Construction Testing Data NAM E Tom M erlisa G eorge Joseph RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes Unseen Data (Jeff, Professor, 4) Tenured? Classification Example ◘ Given old data about customers and payments, predict new applicant’s loan eligibility. – Good Customers – Bad Customers Previous customers Classifier Rules Salary > 5 L Age Salary Profession Location Customer type Prof. = Exec New applicant’s data Good/ bad Classification Techniques 1. Decision Trees 5. Genetic Algorithms 2. Bayesian Classification c  max cj p(c j ) n  p(a p(d ) i 1 i | cj) 6. Support Vector Machines (SVM) 3. K-Nearest Neighbor 7. Fuzzy Set Approaches 4. Neural Network Classification Techniques Decision Trees Bayesian Classification K-Nearest Neighbor Neural Network Classification Genetic Algorithms Support Vector Machines (SVM) Fuzzy Set Approaches … Decision Trees ◘ Decision Tree is a tree where – internal nodes are simple decision rules on one or more attributes – leaf nodes are predicted class labels ◘ Decision trees are used for deciding between several courses of action age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income high high high medium low low low medium low medium medium medium high medium student no no no no yes yes yes no yes yes yes no yes no credit_rating fair excellent fair fair fair excellent excellent fair fair fair excellent excellent fair excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no Attribute Value age? <=30 31..40 student? No no yes Yes yes >40 Classification credit rating? Excellent Fair yes Desicion Tree Applications ◘ Decision trees are used extensively in data mining. ◘ Has been applied to: – – – – classify medical patients based on the disease, equipment malfunction by cause, loan applicant by likelihood of payment, ... Salary < 1 M Job = teacher Good Age < 30 Bad Bad House Hiring Good Decision Trees (Different Representation) DT Splits Area ( Different representation of decision tree) Age <30 >=30 Car Type Minivan YES Sports, Truck NO YES Minivan YES Sports, Truck NO YES short short 0 30 60 Age medium medium tall tall Decision Tree Adv. DisAdv. Positives (+) + Reasonable training time + Fast application + Easy to interpret (can be re-represented as if-thenelse rules) + Easy to implement + Can handle large number of features + Does not require any prior knowledge of data distribution Negatives (-) - Cannot handle complicated relationship between features - Simple decision boundaries - Problems with lots of missing data - Output attribute must be categorical - Limited to one output attribute Rules Indicated by Decision Trees ◘ Write a rule for each path in the decision tree from the root to a leaf. Decision Tree Algorithms ◘ ID3 – Quinlan (1981) – Tries to reduce expected number of comparison ◘ C 4.5 – – – – Quinlan (1993) It is an extension of ID3 Just starting to be used in data mining applications Also used for rule induction ◘ CART – Breiman, Friedman, Olshen, and Stone (1984) – Classification and Regression Trees ◘ CHAID – Kass (1980) – Oldest decision tree algorithm – Well established in database marketing industry ◘ QUEST – Loh and Shih (1997) Decision Tree Construction ◘ Which attribute is the best classifier? – Calculate the information gain G(S,A) for each attribute A. – The basic idea is that we select the attribute with the highest information gain. m Entropy(S)    p i log 2 p i Entropy(S)  p1 log 2 p1  p 2 log 2 p 2 i 1 Gain (S, A)  Entropy(S )   iA | Si | Entropy(Si) |S| Decision Tree Construction Hava Sıcaklık Nem Rüzgar Tenis Güneşli Sıcak Yüksek Hafif Hayır Güneşli Sıcak Yüksek Kuvvetli Hayır Bulutlu Sıcak Yüksek Hafif Evet Yağmurlu Ilık Yüksek Hafif Evet Yağmurlu Serin Normal Hafif Evet Yağmurlu Serin Normal Kuvvetli Hayır Bulutlu Serin Normal Kuvvetli Evet Güneşli Ilık Yüksek Hafif Hayır Güneşli Serin Normal Hafif Evet Yağmurlu Ilık Normal Hafif Evet Güneşli Ilık Normal Kuvvetli Evet Bulutlu Ilık Yüksek Kuvvetli Evet Bulutlu Sıcak Normal Hafif Evet Yağmurlu Ilık Yüksek Kuvvetli Hayır Which attribute first? Decision Tree Construction Hava Sıcaklık Nem Rüzgar Tenis Güneşli Sıcak Yüksek Hafif Hayır Güneşli Sıcak Yüksek Kuvvetli Hayır Bulutlu Sıcak Yüksek Hafif Evet Yağmurlu Ilık Yüksek Hafif Evet Yağmurlu Serin Normal Hafif Evet Yağmurlu Serin Normal Kuvvetli Hayır Bulutlu Serin Normal Kuvvetli Evet Güneşli Ilık Yüksek Hafif Hayır Güneşli Serin Normal Hafif Evet Yağmurlu Ilık Normal Hafif Evet Güneşli Ilık Normal Kuvvetli Evet Bulutlu Ilık Yüksek Kuvvetli Evet Bulutlu Sıcak Normal Hafif Evet Yağmurlu Ilık Yüksek Kuvvetli Hayır Gain (S, Rüzgar )  Entropy(S)  Entropi(S )  (9 / 14) log 2 (9 / 14)  (5 / 14) log 2 (5 / 14)  0,940 Gain(S, Hava) = 0,246 Gain(S, Sıcaklık) = 0,029 Gain(S, Nem) = 0,151 Gain(S, Rüzgar) = 0,048 | SHafif | |S | 8 6 Entropy(SHafif )  Kuvvetli Entropy(SKuvvetli)  0,940  * 0,811  *1,0 14 14 |S| |S|  0,048 Gain (S, Nem)  Entropy(S)  | SYüksek | |S | 7 7 Entropy(SYüksek)  Normal Entropy(SNormal )  0,940  * 0,985  *1,0 14 14 |S| |S|  0,151 Decision Tree Construction ◘ Which attribute is next? Hava Güneşli ? Bulutlu Yağmurlu ? Evet Gain (SGüneşün, Rüzgar )  0,970  (2 / 5)1,0  (3 / 5)0,918  0,970  0,019 Gain (SGünesli, Nem)  0,970  (3 / 5)0,0  (2 / 5)0,0  0,970 Gain (SGünesli, Sııcaklı )  0,970  (2 / 5)0  (2 / 5)1  (1 / 5)0  0,570 Decision Tree Construction Hava Sıcaklık Nem Rüzgar Tenis R1 Güneşli Sıcak Yüksek Hafif Hayır R2 Güneşli Sıcak Yüksek Kuvvetli Hayır R3 Bulutlu Sıcak Yüksek Hafif Evet R4 Yağmurlu Ilık Yüksek Hafif Evet R5 Yağmurlu Serin Normal Hafif Evet R6 Yağmurlu Serin Normal Kuvvetli Hayır R7 Bulutlu Serin Normal Kuvvetli Evet R8 Güneşli Ilık Yüksek Hafif Hayır R9 Güneşli Serin Normal Hafif Evet R10 Yağmurlu Ilık Normal Hafif Evet R11 Güneşli Ilık Normal Kuvvetli Evet R12 Bulutlu Ilık Yüksek Kuvvetli Evet R13 Bulutlu Sıcak Normal Hafif Evet R14 Yağmurlu Ilık Yüksek Kuvvetli Hayır Hava Güneşli Yağmurlu Bulutlu Rüzgar Nem Evet [R3,R7,R12,R13] Yüksek Hayır [R1,R2, R8] Normal Evet [R9,R11] Hafif Evet [R4,R5,R10] Kuvvetli Hayır [R6,R14] Another Example At the weekend: - go shopping, - watch a movie, - play tennis or - just stay in. What you do depends on three things: - the weather (windy, rainy or sunny); - how much money you have (rich or poor) - whether your parents are visiting. Another Example Classification Techniques Decision Trees Bayesian Classification K-Nearest Neighbor Neural Network Classification Genetic Algorithms Support Vector Machines (SVM) Fuzzy Set Approaches … Classification Techniques 2- Bayesian Classification ◘ A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities. ◘ Foundation: Based on Bayes’ Theorem. Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem P(H | X)  P(X | H )P(H ) P(X) Classification Techniques 2- Bayesian Classification age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data sample X = (Age <=30, Income = medium, Student = yes Credit_rating = Fair) Classification Techniques 2- Bayesian Classification ◘ X = (age <= 30 , income = medium, student = yes, credit_rating = fair) ◘ P(C1): P(C2): P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357 ◘ Compute P(X|Ci) for each class P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4 ◘ age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income student credit_rating buys_computer high no fair no high no excellent no high no fair yes medium no fair yes low yes fair yes low yes excellent no low yes excellent yes medium no fair no low yes fair yes medium yes fair yes medium yes excellent yes medium no excellent yes high yes fair yes medium no excellent no P(X|C1) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|C2) : P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 Therefore, X belongs to class (“buys_computer = yes”) Classification Techniques Decision Trees Bayesian Classification K-Nearest Neighbor Neural Network Classification Genetic Algorithms Support Vector Machines (SVM) Fuzzy Set Approaches … K-Nearest Neighbor (k-NN) ◘ An object is classified by a majority vote of its neighbors (k closest members) . ◘ If k = 1, then the object is simply assigned to the class of its nearest neighbor. ◘ Euclidean Distance measure is used to calculate how close K-Nearest Neighbor (k-NN) Classification Evaluation (Testing) Tid Refund Marital Status Taxable Income Cheat Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No No Single 75K ? 2 No Married 100K No Yes Married 50K ? 3 No Single 70K No No Married 150K ? 4 Yes Married 120K No Yes Divorced 90K ? 5 No Divorced 95K Yes No Single 40K ? 6 No Married No No Married 80K ? 60K 10 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 10 No Single 90K Yes Training Set Learn Classifier Test Set Model Classification Accuracy True Positive False Negative dogruluk  hata  False Positive True Negative Which classification model is better? TP  TN TP  TN  FP  FN FN  FP TP  TN  FP  FN Validation Techniques ◘ Simple Validation Test set Training set ◘ Cross Validation Training set Test set ◘ n-Fold Cross Validation Test set Training set ◘ Bootstrap Method

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Click to add title