Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Reference Books y Han, J. , Kamber, M., Pei, J., (2011). Data Mining: Concepts Data Mining and Techniques. Third edition. San Francisco: Morgan Kaufmann Publishers. y Larose, Daniel T. (2005). Discovering Knowledge In Data – An Introduction to Data Mining. New Jersey: John Wiley and Sons Ltd. y Pang-NingTan,Michael Steinbach, Vipin Kumar (2006), Introduction to Data Mining, AddisonWesley. y Alpaydın, E. (2010). Introduction to Machine Learning. Second Ed. London:MIT Press. Classification k-nearest neighbors Dr. Engin YILDIZTEPE 2 Supervised vs. Unsupervised Learning Classification: Definition y In y Supervised learning y Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations classification, there is a target categorical variable, (e.g., income bracket), which is partitioned into predetermined classes or categories, such as hi h income, high i middle iddl income, i and d low l i income. y New data is classified based on the training set y Unsupervised learning y The class labels of training data is unknown y Given a set of measurements, observations, etc. with the aim of establishing the existence off classes or clusters in the data 3 1 Classification: Definition Illustrating Classification Task y Training set : The set of tuples used for model construction is training set. Given a collection of records. y Each record contains a set of attributes, one of the attributes is the class. y Each tuple/record is assumed to belong to a predefined class, as determined by the class attribute. y Test set: A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model d l and d test t t sett used d to t validate lid t it. it y Test set is independent of training set (otherwise overfitting) Tid Attrib1 1 Yes Large Attrib2 125K Attrib3 No Class 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tid Attrib1 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? Learn Model 10 Attrib2 Attrib3 Class Apply Model 10 Classification: Definition y Model construction: Find a model for class attribute as a function of the values of other attributes. y The model is represented as classification rules, decision trees, or mathematical formulae Process (1): Model Construction Training Data y Goal: previously unseen or new records should be assigned a class as accurately as possible. Accuracy rate is the percentage of test set samples that are correctly classified by the model. If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known NAME M ike M ary Bill Jim Dave Anne RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no Classification Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 8 2 Process (2): Using the Model in Prediction Prediction Problems: Classification vs. Numeric Prediction y Classification y predicts categorical class labels (discrete or nominal) y classifies f data (constructs a model) based on the Classifier Testing Data Unseen Data training set and the values (class labels) in a classifying attribute and uses it in classifying new data y Numeric Prediction y models continuous-valued functions, i.e., predicts unknown or missing values (Jeff Professor, (Jeff, Professor 4) NAME Tom Merlisa George Joseph RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes Tenured? 9 Classification Techniques y Instance-Based Classifiers y Decision Tree based Methods y Neural Networks 10 Lazy Learning y Lazy learning y Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple y Lazy: less time in training but more time in predicting y Other Classification Methods y Rule-based Methods y Memory based reasoning y Bayes Classification Methods y Support Vector Machines 12 3 Instance-Based Methods (Lazy Learner) y Instance-based learning: y Store training examples and delay the processing K-Nearest Neighbor Classifiers y Basic idea: y If it walks like a duck, quacks like a duck, then it’s probably a duck ( lazy evaluation”) (“lazy evaluation ) until a new instance must be classified y Typical approaches y k-nearest neighbor approach Compute Distance Test Record y Instances represented as points in a Euclidean space. y Locally weighted regression y Case-based reasoning Training Records Choose k of the “nearest” records 13 K-Nearest-Neighbor Classifiers Unknown record z Definition of k-Nearest Neighbor Requires three things – The set of stored records – Distance Metric to compute distance between records – The value of k, the number of nearest neighbors to retrieve X z X X (b) 2-nearest neighbor (c) 3-nearest neighbor To classify an unknown record: – Compute distance to other training records – Identify Id tif k nearestt neighbors i hb – Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) (a) 1-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x 4 Nearest Neighbor Classification y Compute distance between two points: y Euclidean distance d ( p, q ) = Nearest Neighbor Classification… y Choosing the value of k: y If k is too small, sensitive to noise points y If k is too large, large neighborhood may include points ∑ ( pi i −q ) 2 from other classes i y Determine the class from nearest neighbor list y take the majority vote of class labels among the k- nearest neighbors hb Nearest Neighbor Classification… y Scaling issues y Attributes may y have to be scaled to p prevent distance measures from being dominated by one of the attributes (normalization) y Example: y height of a person may vary from 1.5m to 1.8m Nearest neighbor Classification y k-NN classifiers are lazy learners y It does not build models explicitly p y y They’re not eager learners such as decision tree induction and rule-based systems y Classifying unknown records are relatively expensive. y weight of a person may vary from 40kg to 100kg y income of a person may vary from $10K to $1M 5 Example X1 5 4 1 2 2 2 1 4 6 4 10 5 10 8 8 5 4 7 6 7 5 10 10 10 8 X2 4 7 6 7 4 2 6 1 1 1 10 8 5 4 6 8 5 4 9 7 8 6 6 4 3 class + + + + + + + + + + ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ‐ ? 21 6