Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification 分類與預測 Classification (分類) predict categorical class labels 預測分門別類的類別 標籤 categorical class labels: 離散式的或名目式的 (discrete/nominal) 根據訓練組資料的分類屬性的值(即類別標籤), 建構一個模型,並以之將新資料歸類 Prediction (預測) model continuous-valued functions 掌握連續值型態 的函數,預測未知或遺失的值 2 Ming-Yen Lin, IECS, FCU Classification Problem Given a database D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the Classification Problem is to define a mapping f:DgC where each ti is assigned to one class. Actually divides D into equivalence classes. Prediction is similar, but may be viewed as having infinite number of classes. 3 Ming-Yen Lin, IECS, FCU Classification Ex: Grading If x >= 90 then grade =A. If 80<=x<90 then grade =B. If 70<=x<80 then grade =C. If 60<=x<70 then grade =D. If x<50 then grade =F. x <90 x <80 x <70 x <50 F Ming-Yen Lin, IECS, FCU >=90 A >=80 B >=70 C >=60 D 4 Classification Ex: Letter Recognition View letters as constructed from 5 components: Letter A Letter B Letter C Letter D Letter E Letter F 5 Ming-Yen Lin, IECS, FCU Classification Examples Teachers classify students’ grades as A, B, C, D, or F. Identify mushrooms as poisonous or edible. Classify a river as flooding or not. Identify an individual as certain credit risk. Speech recognition: Identify the spoken “word” from the classified examples (rules). Pattern recognition: Identify a pattern from the classified ones. 典型應用 信用核發,目標行銷,醫療診斷,處方(treatment)有效性 分析 6 Ming-Yen Lin, IECS, FCU Classification—兩大步驟 建構模型: 描述一組預定的類別(predetermined class) 每筆資料皆已具有預定的類別(由 class label attribute決定) training set :用來建構模型的資料組 模型的呈現:(1) classification rules (2) decision trees (3) mathematical formulae 使用模型: 將未來或未知之objects分類 估計模型之 accuracy 將已知類別的test sample灌入模型,比較模型的答案類別 Accuracy rate: test set samples正確被模型分類的比例 Test set:獨立於 training set, 否則會over-fitting 若accuracy 可接受,用此模型將未知類別標籤的資料項分類 Most common techniques use DTs, NNs, or are based on distances or statistical methods 7 Ming-Yen Lin, IECS, FCU Process 1: 建構模型 Training Data NAME M ike M ary B ill Jim D ave A nne RANK YEARS TENURED A ssistant P rof 3 no A ssistant P rof 7 yes P rofessor 2 yes A ssociate P rof 7 yes A ssistant P rof 6 no A ssociate P rof 3 no Ming-Yen Lin, IECS, FCU Classification Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 8 Process 2:使用模型 Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME T om M erlisa G eorge Joseph RANK YEARS TENURED A ssistant P rof 2 no A ssociate P rof 7 no P rofessor 5 yes A ssistant P rof 7 yes Ming-Yen Lin, IECS, FCU Tenured? 9 Defining Classes Distance Based Partitioning Based 10 Ming-Yen Lin, IECS, FCU Supervised vs. Unsupervised Learning 監督式學習 (classification) 監督: training data (觀察值或測量值等) 帶有指示此觀察之 標籤 新資料依照training set來分類 非監督式學習(clustering) training data 的類別標籤未知 給予一組觀察值或測量值等,目的是建構資料中存在的 類別或群體(cluster) 11 Ming-Yen Lin, IECS, FCU 議題 (1): Data Preparation Data cleaning Preprocess data in order to reduce noise and handle missing values 相關性分析(Relevance analysis):即屬性的選取 (feature selection) Remove the irrelevant or redundant attributes Data transformation Generalize and/or normalize data Missing Data Ignore Replace with assumed value 12 Ming-Yen Lin, IECS, FCU 議題(2): 評估 Classification Methods 預測準確率(Predictive accuracy) Classification accuracy on test data Confusion matrix OC Curve 速度與可擴充性( scalability) 建構 classifier 的時間多久 何時可以使用classifier 強固性(Robustness) handling noise and missing values 可擴充性(Scalability) disk-resident databases 的效率 可解釋性(Interpretability) 對model的瞭解與model可提供的insight(瞭解深度) rule有多好( goodness measure ) decision tree size compactness of classification rules 13 Ming-Yen Lin, IECS, FCU Height Example Data Name Kristina Jim Maggie Martha Stephanie Bob Kathy Dave Worth Steven Debbie Todd Kim Amy Wynette Gender F M F F F M F M M M F M F F F Height 1.6m 2m 1.9m 1.88m 1.7m 1.85m 1.6m 1.7m 2.2m 2.1m 1.8m 1.95m 1.9m 1.8m 1.75m Output1 Short Tall Medium Medium Short Medium Short Short Tall Tall Medium Medium Medium Medium Medium Output2 Medium Medium Tall Tall Medium Medium Medium Medium Tall Tall Medium Medium Tall Medium Medium accuracy 14 Ming-Yen Lin, IECS, FCU Classification Performance True Positive False Negative False Positive True Negative 15 Ming-Yen Lin, IECS, FCU Confusion Matrix Example Using height data example with Output1 correct and Output2 actual assignment Actual Membership Short Medium Tall Assignment Short Medium 0 4 0 5 0 1 Tall 0 3 2 16 Ming-Yen Lin, IECS, FCU Operating Characteristic Curve 17 Ming-Yen Lin, IECS, FCU Classification Using Decision Trees Partitioning based: Divide search space into rectangular regions. Tuple placed into class based on the region within which it falls. DT approaches differ in how the tree is built: DT Induction Internal nodes associated with attribute and arcs with values for that attribute. Algorithms: ID3, C4.5, CART 18 Ming-Yen Lin, IECS, FCU Decision Tree Given: D = {t1, …, tn} where ti=<ti1, …, tih> Database schema contains {A1, A2, …, Ah} Classes C={C1, …., Cm} Decision or Classification Tree is a tree associated with D such that Each internal node is labeled with attribute, Ai Each arc is labeled with predicate which can be applied to attribute at parent Each leaf node is labeled with a class, Cj 19 Ming-Yen Lin, IECS, FCU DT Induction 20 Ming-Yen Lin, IECS, FCU DT Splits Area Gender M F Height 21 Ming-Yen Lin, IECS, FCU Comparing DTs Balanced Deep 22 Ming-Yen Lin, IECS, FCU Decision Tree Induction Training Dataset age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 Ming-Yen Lin, IECS, FCU income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no 23 Output: A Decision Tree for “buys_computer” age? <=30 student? overcast 30..40 yes >40 credit rating? no yes excellent fair no yes no yes an example from Quinlan’s ID3 24 Ming-Yen Lin, IECS, FCU Decision Tree Induction演算法 基本方法 (a greedy algorithm) top-down recursive divide-and-conquer 式的建構此樹 開始時,所有的example都在樹根 只能處理categorical屬性(若屬性是continuous-valued, 預做 discretization) 遞迴地將examples依據所選屬性 partition 選屬性的根據:以(1) heuristic(2)或statistical measure (如 information gain) 停止partitioning的條件 某一節點的類別均相同 已沒有其他可以用來partition的屬性 – 這個leaf的類別以 「多數決」 已經沒有sample留下 Ming-Yen Lin, IECS, FCU 25 Decision Tree Induction is often based on Information Theory So 26 Ming-Yen Lin, IECS, FCU Information 27 Ming-Yen Lin, IECS, FCU DT Induction When all the marbles in the bowl are mixed up, little information is given. When the marbles in the bowl are all from one class and those in the other two classes are on either side, more information is given. Use this approach with DT Induction ! 28 Ming-Yen Lin, IECS, FCU Information/Entropy Given probabilitites p1, p2, .., ps whose sum is 1, Entropy is defined as: Entropy measures the amount of randomness or surprise or uncertainty. Goal in classification no surprise entropy = 0 29 Ming-Yen Lin, IECS, FCU Entropy log (1/p) H(p,1-p) 30 Ming-Yen Lin, IECS, FCU 選屬性: Information Gain (ID3/C4.5) 選具有最高 information gain之屬性 S contains si tuples of class Ci for i = {1, …, m} information measures info required to classify any arbitrary tuple m si si I( s1,s2,...,sm ) log 2 s i 1 s entropy of attribute A with values {a1,a2,…,av} s1 j ... smj E(A) I (s1 j,..., smj) s j 1 v information gained by branching on attribute A Gain(A) I(s 1, s 2 ,..., sm) E(A) 31 Ming-Yen Lin, IECS, FCU Information Gain 例子 Class P: buys_computer = “yes” Class N: buys_computer = “no” I(p, n) = I(9, 5) =0.940 Compute the entropy for age: age <=30 30…40 >40 age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 pi 2 4 3 ni I(pi, ni) 3 0.971 0 0 2 0.971 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent Ming-Yen Lin, IECS, FCU buys_computer no no yes yes yes no yes no yes yes yes yes yes no 5 4 I (2,3) I ( 4,0) 14 14 5 I (3,2) 0.694 14 E ( age) 5 I (2,3) 表 “age <=30” 14個中 14 有5個,其中 2 ‘yes’、3 ‘ no’ 故 Gain(age) I ( p, n) E (age) 0.246 同理 Gain(income) 0.029 Gain( student ) 0.151 Gain(credit _ rating ) 0.048 32 由decision tree產生規則 將知識以 IF-THEN 規則方式呈現 每個路徑(從root到leaf) 產生一條規則 路徑上每個attribute-value pair是一個條件組合( conjunction) Leaf具有類別預測值 人們對於規則比較容易懂 Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no” 33 Ming-Yen Lin, IECS, FCU 避免 Overfitting Overfitting: induced tree 可能會overfit training data 太多分支:有些可能是noise或outlier的不正常反應 unseen samples的accuracy不好 Two approaches 預先修剪(Prepruning): 建樹時早一點結束 如果會導致goodness measure掉到某個threshold下,別再「分」了 (no more split) 合適的threshold 難定 後修剪(Postpruning): 從長太滿的樹中移除branches 可得到一連串漸進(progressively)修剪的樹 用一組不同於 training 的data 決定那一個「修剪的樹」 最佳 Ming-Yen Lin, IECS, FCU 34 決定 Final Tree Size的方法 分成 training (2/3) 與 testing (1/3) sets used for data set with large number of samples 使用 cross validation, 例如, 10倍 cross validation divide the data set into k subsamples use k-1 subsamples as training data and one sub-sample as test data --- kfold cross-validation for data set with moderate size 使用 所有 data training 但使用 statistical test (如 chi-square) 估計 展開(expand)或修剪(prune) 某個節點可否改善整體分配 用 minimum description length (MDL) 原則 當encoding minimized時,停止長樹 35 Ming-Yen Lin, IECS, FCU Information Gain example YES: 9 NO: 5 Total = 14 P(YES) = 9/14 = 0.64 P(NO) =5/14 = 0.36 I(Y,N) = -( 9/14 * log2(9/14) + 5/14 * log2(5/14) ) = 0.94 ----------------------------------------------------------------------- age buys_computer <=30 no <=30 no 31…40 yes >40 yes >40 yes >40 no 31…40 yes <=30 no <=30 yes >40 yes <=30 yes 31…40 yes 31…40 yes >40 no Age (<= 30): I(Y,N) = -( 2/5 * log2(2/5) + 3/5 * log2(3/5) ) = 0.97 YES 2 NO 3 Age (31..40): I(Y, N) = -( 4/4 * log2(4/4) + 0/4 * log2(0/4) ) = 0 YES 4 NO 0 Age(>40):-( 2/5 * log2(2/5) + 3/5 * log2(3/5) ) = 0.97 YES 3 NO 2 E(age) = 5/14 I(Y,N)age(<=30) + 4/14 I(Y,N)age(31..40)+ 5/14 I(Y,N)age(>40) = 0.694 Gain = 0.94 – 0.694 = 0.246 Ming-Yen Lin, IECS, FCU 36 Example: Error Rate Test set: 84 84 2/3 1/3 56 1/7 6+2 E1 E1 = 28 3/7 1/6 E2 3/7 3/4 1/12 E3 2/7 E4 類別錯誤的個數 進入這個 leaf 的個數 1/4 3/7 E5 = 2/(6+2) Etree = 2/3*(1/7*E1+ 3/7*E2+3/7*E3)+1/3*(3/4*E4+1/4*E5) 37 Ming-Yen Lin, IECS, FCU Example of overfitting “Yes”: 11 “No” : 9 Predict: Yes Entropy = -(11/20log211/20+ 9/20log29/20) = 0.993 “Yes”: 3 “No”: 0 Predict: Yes “Yes”: 8 “No”: 9 Predict: No Average entropy = -17/20(8/17log28/17+ 9/17log29/17) = 0.848 38 Ming-Yen Lin, IECS, FCU 強化(enhance)基本決策樹 允許 continuous-valued 屬性 動態定義將continuous attribute value partition為離散區間的 新離散屬性 處理 missing attribute values 指定共通值 指定各個值的機率 建立Attribute 為現有稀疏的屬性(sparsely represented)建立新attribute 可減少(簡化) fragmentation, repetition, and replication 39 Ming-Yen Lin, IECS, FCU DT Issues Choosing Splitting Attributes Ordering of Splitting Attributes Splits Tree Structure Stopping Criteria Training Data Pruning 40 Ming-Yen Lin, IECS, FCU ID3 Creates tree using information theory concepts and tries to reduce expected number of comparison.. ID3 chooses split attribute with the highest information gain: 41 Ming-Yen Lin, IECS, FCU ID3 Example (Output1) Starting state entropy: 4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384 Gain using gender: Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764 Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392 Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) = 0.34152 Gain: 0.4384 – 0.34152 = 0.09688 Gain using height: 0.4384 – (2/15)(0.301) = 0.3983 Choose height as first splitting attribute 42 Ming-Yen Lin, IECS, FCU CART, C4.5, CHAID Classification And Regression Tree maximize diversitybefore- diversityafter diversity分散度-僅含單一類別之diversity低 – eg. entropy/information – compute I(2,3,9)? C4.5 Post-pruning CHAID A tree is “pruned” by halting its construction early. (Pre-pruning) CHAID is restricted to categorical variables Continuous variables must be broken into ranges or replaced with classes such as high, low, medium 43 Ming-Yen Lin, IECS, FCU C4.5 ID3 favors attributes with large number of divisions Improved version of ID3: Missing Data Continuous Data Pruning Rules GainRatio: 44 Ming-Yen Lin, IECS, FCU CART Create Binary Tree Uses entropy Formula to choose split point, s, for node t: PL,PR probability that a tuple in the training set will be on the left or right side of the tree. 45 Ming-Yen Lin, IECS, FCU CART Example At the start, there are six choices for split point (right branch on equality): P(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224 P(1.6) = 0 P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169 P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385 P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256 P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32 Split at 1.8 46 Ming-Yen Lin, IECS, FCU Classification in Large Databases Classification—a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? relatively faster learning speed (than other classification methods) convertible to simple and easy to understand classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods Ming-Yen Lin, IECS, FCU 47 Scalable Decision Tree Induction in Data Mining SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.) constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim) integrates tree splitting and tree pruning: stop growing the tree earlier RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria that determine the quality of the tree builds an AVC-list (attribute, value, class label) Ming-Yen Lin, IECS, FCU 48 Presentation of Classification Results 49 Ming-Yen Lin, IECS, FCU Decision Tree in SGI/MineSet 3.0 50 Ming-Yen Lin, IECS, FCU Tool: Decision Tree C4.5, the “classic” decision-tree tool, developed by J. Q. Quinlan http://www.cse.unsw.edu.tw/~quinlan Classification Tree in Excel EC4.5, a more efficient version of C4.5 http://www.-kdd.di.unipi.it/software IND, provides Gini and C4.5 decision trees http://ic.arc.nasa.gov/projects/bayes-group/ind/INDprogram.html 51 Ming-Yen Lin, IECS, FCU Regression Assume data fits a predefined function Determine best values for regression coefficients c0,c1,…,cn. Assume an error: y = c0+c1x1+…+cnxn+e Estimate error using mean squared error for training set: 52 Ming-Yen Lin, IECS, FCU Linear Regression Poor Fit 53 Ming-Yen Lin, IECS, FCU Classification Using Regression Division: Use regression function to divide area into regions. Prediction: Use regression function to predict a class membership function. Input includes desired class. 54 Ming-Yen Lin, IECS, FCU Division 55 Ming-Yen Lin, IECS, FCU Prediction 56 Ming-Yen Lin, IECS, FCU Classification Using Distance Place items in class to which they are “closest”. Must determine distance between an item and a class. Classes represented by Centroid: Central value. Medoid: Representative point. Individual points Algorithm: KNN 57 Ming-Yen Lin, IECS, FCU K Nearest Neighbor (KNN): Training set includes classes. Examine K items near item to be classified. New item placed in class with the most number of close items. O(q) for each tuple to be classified. (Here q is the size of the training set.) 58 Ming-Yen Lin, IECS, FCU KNN 59 Ming-Yen Lin, IECS, FCU KNN Algorithm 60 Ming-Yen Lin, IECS, FCU Classification Using Neural Networks Typical NN structure for classification: One output node per class Output value is class membership function value Supervised learning For each tuple in training set, propagate it through NN. Adjust weights on edges to improve future classification. Algorithms: Propagation, Backpropagation, Gradient Descent 61 Ming-Yen Lin, IECS, FCU NN Issues Number of source nodes Number of hidden layers Training data Number of sinks Interconnections Weights Activation Functions Learning Technique When to stop learning 62 Ming-Yen Lin, IECS, FCU Decision Tree vs. Neural Network 63 Ming-Yen Lin, IECS, FCU Propagation Tuple Input 64 Ming-Yen Lin, IECS, FCU NN Propagation Algorithm 65 Ming-Yen Lin, IECS, FCU Example Propagation 66 Ming-Yen Lin, IECS, FCU NN Learning Adjust weights to perform better with the associated test data. Supervised: Use feedback from knowledge of correct classification. Unsupervised: No knowledge of correct classification needed. 67 Ming-Yen Lin, IECS, FCU NN Supervised Learning 68 Ming-Yen Lin, IECS, FCU Supervised Learning Possible error values assuming output from node i is yi but should be di: Change weights on arcs based on estimated error 69 Ming-Yen Lin, IECS, FCU NN Backpropagation Propagate changes to weights backward from output layer to input layer. Delta Rule: r wij= c xij (dj – yj) Gradient Descent: technique to modify the weights in the graph. 70 Ming-Yen Lin, IECS, FCU Backpropagation 71 Ming-Yen Lin, IECS, FCU Backpropagation Algorithm 72 Ming-Yen Lin, IECS, FCU Gradient Descent 73 Ming-Yen Lin, IECS, FCU Gradient Descent Algorithm 74 Ming-Yen Lin, IECS, FCU Output Layer Learning 75 Ming-Yen Lin, IECS, FCU Hidden Layer Learning 76 Ming-Yen Lin, IECS, FCU Types of NNs Different NN structures used for different problems. Perceptron Self Organizing Feature Map Radial Basis Function Network 77 Ming-Yen Lin, IECS, FCU Perceptron Perceptron is one of the simplest NNs. No hidden layers. 78 Ming-Yen Lin, IECS, FCU Perceptron Example Suppose: Summation: S=3x1+2x2-6 Activation: if S>0 then 1 else 0 79 Ming-Yen Lin, IECS, FCU Self Organizing Feature Map (SOFM) SOM Competitive Unsupervised Learning Observe how neurons work in brain: Firing impacts firing of those near Neurons far apart inhibit each other Neurons have specific nonoverlapping tasks Ex: Kohonen Network 80 Ming-Yen Lin, IECS, FCU Kohonen Network 81 Ming-Yen Lin, IECS, FCU Kohonen Network Competitive Layer – viewed as 2D grid Similarity between competitive nodes and input nodes: Input: X = <x1, …, xh> Weights: <w1i, … , whi> Similarity defined based on dot product Competitive node most similar to input “wins” Winning node weights (as well as surrounding node weights) increased. 82 Ming-Yen Lin, IECS, FCU Radial Basis Function Network RBF function has Gaussian shape RBF Networks Three Layers Hidden layer – Gaussian activation function Output layer – Linear activation function 83 Ming-Yen Lin, IECS, FCU Radial Basis Function Network 84 Ming-Yen Lin, IECS, FCU Classification Using Rules Perform classification using If-Then rules Classification Rule: r = <a,c> Antecedent, Consequent May generate from from other techniques (DT, NN) or generate directly. Algorithms: Gen, RX, 1R, PRISM 85 Ming-Yen Lin, IECS, FCU Generating Rules from DTs 86 Ming-Yen Lin, IECS, FCU Generating Rules Example 87 Ming-Yen Lin, IECS, FCU Generating Rules from NNs 88 Ming-Yen Lin, IECS, FCU 1R Algorithm 89 Ming-Yen Lin, IECS, FCU 1R Example 90 Ming-Yen Lin, IECS, FCU PRISM Algorithm 91 Ming-Yen Lin, IECS, FCU PRISM Example 92 Ming-Yen Lin, IECS, FCU Decision Tree vs. Rules Tree has implied order in which splitting is performed. Tree created based on looking at all classes. Rules have no ordering of predicates. Only need to look at one class to generate its rules. 決策樹方法的優點?缺點? 93 Ming-Yen Lin, IECS, FCU