Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Science Workshop Introduction to Machine Learning Instructor: Dr Eamonn Keogh Computer Science & Engineering Department 318 Winston Chung Hall University of California - Riverside Riverside, CA 92521 [email protected] Get the slides now! www.cs.ucr.edu/~eamonn/public/DSW.pdf www.cs.ucr.edu/~eamonn/public/DSW.ppt Some slides adapted from Tan, Steinbach and Kumar, and from Chris Clifton Machine Learning Machine learning explores the study and construction of algorithms that can learn from data. Basic Idea: Instead of trying to create a very complex program to do X. Use a (relatively) simple program that can learn to do X. Example: Instead of trying to program a car to drive (If light(red) && NOT(pedestrian) || speed(X) <= 12 && .. ), create a program that watches human drive, and learns how to drive*. *Currently, self driving cars do a bit of both. Why Machine Learning I Why do machine learning instead of just writing an explicit program? • It is often much cheaper, faster and more accurate. • It may be possible to teach a computer something that we are not sure how to program. For example: • We could explicitly write a program to tell if a person is obese If (weightkg /(heightm heightm)) > 30, printf(“Obese”) •We would find it hard to write a program to tell is a person is sad However, we could easily obtain a 1,000 photographs of sad people/ not sad people, and ask a machine learning algorithm to learn to tell them apart. What kind of data do you want to work with? • • • • • • • • Insects Stars Books Mice Counties Emails Historical manuscripts People – – – – – As potential terrorists As potential voters for your candidate As potential heart attack victims As potential tax cheats etc What kind of data do you want to work with? • • • • • • • • Insects No matter what kind of data you want to work Stars with, it is best if you can Books “massage” it into a rectangular flat file.. Mice This may be easy, or… Counties Emails Historical manuscripts People – – – – As potential terrorists As potential voters for your candidate As potential heart attack victims As potential tax cheats 10 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K What is Data? Collection of objects and their attributes Attributes An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc. Attribute is also known as variable, field, characteristic, or feature A collection of attributes describe an object Objects Objects are also known as records, points, cases, samples, entities, exemplars or instances 10 Objects could be a customer, a patient, a car, a country, a novel, a drug, a movie etc Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K Data Dimensionality and Numerosity The number of attributes is the dimensionality of a dataset. Attributes The number of objects is the numerosity (or just size) of a dataset. Some of the algorithms we want to use, may scale badly in the dimensionality, or scale badly in the numerosity (or both). As we will see, reducing the dimensionality and/or numerosity of data is a common task in data mining. Objects 10 Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K The Classification Problem Katydids (informal definition) Given a collection of annotated data. In this case 5 instances Katydids of and five of Grasshoppers, decide what type of insect the unlabeled example is. Katydid or Grasshopper? Grasshoppers The Classification Problem Canadian (informal definition) Given a collection of annotated data. In this case 3 instances Canadian of and 3 of American, decide what type of coin the unlabeled example is. American Canadian or American? For any domain of interest, we can measure features Color {Green, Brown, Gray, Other} Abdomen Length Has Wings? Thorax Length Antennae Length Mandible Size Spiracle Diameter Leg Length Sidebar 1 In data mining, we usually don’t have a choice of what features to measure. The data is not usually collect with data mining in mind. The features we really want may not be available: Why? ____________________ ____________________ We typically have to use (a subset) of whatever data we are given. Sidebar 2 In data mining, we can sometimes generate new features. For example Feature X = Abdomen Length/ Antennae Length Abdomen Length Antennae Length We can store features in a database. The classification problem can now be expressed as: • Given a training database (My_Collection), predict the class label of a previously unseen instance My_Collection Insect Abdomen Antennae Insect Class ID Length Length Grasshopper 1 2.7 5.5 2 3 4 5 6 7 8 9 10 previously unseen instance = 8.0 0.9 1.1 5.4 2.9 6.1 0.5 8.3 8.1 11 9.1 4.7 3.1 8.5 1.9 6.6 1.0 6.6 4.7 5.1 7.0 Katydid Grasshopper Grasshopper Katydid Grasshopper Katydid Grasshopper Katydid Katydids ??????? Grasshoppers Katydids Antenna Length 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length Grasshoppers We will also use this lager dataset as a motivating example… Antenna Length 10 9 8 7 6 5 4 3 2 1 Katydids Each of these data objects are called… • exemplars • (training) examples • instances • tuples 1 2 3 4 5 6 7 8 9 10 Abdomen Length We will return to the previous slide in two minutes. In the meantime, we are going to play a quick game. I am going to show you some classification problems which were shown to pigeons! Let us see if you are as smart as a pigeon! Pigeon Problem 1 Examples of class A 3 4 1.5 5 Examples of class B 5 2.5 5 2 6 8 8 3 2.5 5 4.5 3 Pigeon Problem 1 Examples of class A 3 4 1.5 6 5 8 What class is this object? Examples of class B 5 2.5 5 2 8 3 8 What about this one, A or B? 4.5 2.5 5 4.5 3 1.5 7 Pigeon Problem 1 Examples of class A 3 4 1.5 5 This is a B! Examples of class B 5 2.5 5 2 6 8 8 3 2.5 5 4.5 3 8 1.5 Here is the rule. If the left bar is smaller than the right bar, it is an A, otherwise it is a B. Pigeon Problem 2 Examples of class A Oh! This ones hard! Examples of class B 4 4 5 2.5 5 5 2 5 6 6 5 3 8 Even I know this one 7 3 3 2.5 3 1.5 7 Pigeon Problem 2 Examples of class A Examples of class B 4 4 5 2.5 5 5 2 5 The rule is as follows, if the two bars are equal sizes, it is an A. Otherwise it is a B. So this one is an A. 6 6 5 3 7 3 3 2.5 3 7 Pigeon Problem 3 Examples of class A Examples of class B 6 4 4 5 6 1 5 7 5 6 3 4 8 3 7 7 7 6 This one is really hard! What is this, A or B? Pigeon Problem 3 Examples of class A It is a B! Examples of class B 6 4 4 5 6 6 1 5 7 5 6 3 4 8 3 7 7 7 The rule is as follows, if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B. Why did we spend so much time with this stupid game? Because we wanted to show that almost all classification problems have a geometric interpretation, check out the next 3 slides… Examples of class A 3 Examples of class B 5 4 2.5 Left Bar Pigeon Problem 1 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Right Bar 1.5 5 5 2 6 8 8 3 2.5 5 4.5 3 Here is the rule again. If the left bar is smaller than the right bar, it is an A, otherwise it is a B. Examples of class A 4 4 Examples of class B 5 2.5 Left Bar Pigeon Problem 2 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Right Bar 5 5 2 5 6 6 5 3 3 3 2.5 3 Let me look it up… here it is.. the rule is, if the two bars are equal sizes, it is an A. Otherwise it is a B. Examples of class A 4 4 Examples of class B 5 6 Left Bar Pigeon Problem 3 100 90 80 70 60 50 40 30 20 10 10 20 30 40 50 60 70 80 90 100 Right Bar 1 5 7 5 6 3 4 8 3 7 7 7 The rule again: if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B. Grasshoppers Katydids Antenna Length 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length previously unseen instance = 11 5.1 7.0 ??????? • We can “project” the previously unseen instance into the same space as the database. Antenna Length 10 9 8 7 6 5 4 3 2 1 • We have now abstracted away the details of our particular problem. It will be much easier to talk about points in space. 1 2 3 4 5 6 7 8 9 10 Abdomen Length Katydids Grasshoppers Simple Linear Classifier 10 9 8 7 6 5 4 3 2 1 R.A. Fisher 1890-1962 If previously unseen instance above the line then class is Katydid else class is Grasshopper 1 2 3 4 5 6 7 8 9 10 Katydids Grasshoppers Simple Quadratic Classifier Simple Cubic Classifier Simple Quartic Classifier Simple Quintic Classifier Simple….. 10 9 8 7 6 5 4 3 2 1 If previously unseen instance above the line then class is Katydid else class is Grasshopper 1 2 3 4 5 6 7 8 9 10 Katydids Grasshoppers The simple linear classifier is defined for higher dimensional spaces… … we can visualize it as being an n-dimensional hyperplane It is interesting to think about what would happen in this example if we did not have the 3rd dimension… We can no longer get perfect accuracy with the simple linear classifier… We could try to solve this problem by user a simple quadratic classifier or a simple cubic classifier.. However, as we will later see, this is probably a bad idea… Which of the “Pigeon Problems” can be solved by the Simple Linear Classifier? 1) Perfect 2) Useless 3) Pretty Good 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Problems that can be solved by a linear classifier are call linearly separable. 10 9 8 7 6 5 4 3 2 1 100 90 80 70 60 50 40 30 20 10 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 Revisiting Sidebar 2 What would happen if we created a new feature Z, where: Z= abs(X.value - X.value) All blue points are perfectly aligned, so we can only see one 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Virginica A Famous Problem R. A. Fisher’s Iris Dataset. • 3 classes • 50 of each class Setosa The task is to classify Iris plants into one of 3 varieties using the Petal Length and Petal Width. Iris Setosa Versicolor Iris Versicolor Iris Virginica We can generalize the piecewise linear classifier to N classes, by fitting N-1 lines. In this case we first learned the line to (perfectly) discriminate between Setosa and Virginica/Versicolor, then we learned to approximately discriminate between Virginica and Versicolor. Virginica Setosa Versicolor If petal width > 3.272 – (0.325 * petal length) then class = Virginica Elseif petal width… We have now seen one classification algorithm, and we are about to see more. How should we compare them? • Predictive accuracy • Speed and scalability – time to construct the model – time to use the model • Robustness – handling noise, missing values and irrelevant features, streaming data • Interpretability: – understanding and insight provided by the model Predictive Accuracy I Hold Out Data • How do we estimate the accuracy of our classifier? We can use Hold Out data We divide the dataset into 2 partitions, called train and test. We build our models on train, and see how well we do on test. Insect ID Abdomen Length Antennae Length 1 2.7 5.5 Insect Class Grasshopper train 10 9 8 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 3 0.9 4.7 Grasshopper 4 1.1 3.1 Grasshopper 5 5.4 8.5 Katydid 7 6 5 4 3 2 8.0 9.1 Katydid 3 0.9 4.7 Grasshopper 4 1.1 3.1 Grasshopper 5 5.4 8.5 Katydid 6 2.9 1.9 Grasshopper 7 6.1 6.6 Katydid 6 2.9 1.9 Grasshopper 8 0.5 1.0 Grasshopper 7 6.1 6.6 Katydid 9 8.3 6.6 Katydid 8 0.5 1.0 Grasshopper 10 8.1 4.7 Katydids 9 8.3 6.6 Katydid 10 8.1 4.7 Katydids test 2 1 1 2 3 4 5 6 7 8 9 10 Predictive Accuracy II • How do we estimate the accuracy of our classifier? We can use K-fold cross validation We divide the dataset into K equal sized sections. The algorithm is tested K times, each time leaving out one of the K section from building the classifier, but using it to test the classifier instead Accuracy = K=5 Number of correct classifications Number of instances in our database Insect ID Abdomen Length Antennae Length Insect Class 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 3 0.9 4.7 Grasshopper 4 1.1 3.1 Grasshopper 5 5.4 8.5 Katydid 6 2.9 1.9 Grasshopper 7 6.1 6.6 Katydid 8 0.5 1.0 Grasshopper 9 8.3 6.6 Katydid 10 8.1 4.7 Katydids 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 The Default Rate How accurate can we be if we use no features? The answer is called the Default Rate, the size of the most common class, over the size of the full dataset. Default Rate size(most common class ) size(dataset ) No features Examples: I want to predict the sex of some pregnant friends unborn baby. The most common class is ‘boy’, so I will always say ‘boy’. 101 50.024% 101 100 I do just a tiny bit better than random guessing. I want to predict the sex of the nurse that will give me a flu shot next week. The most common class is ‘female’, so I will say ‘female’. 266634 85.29% 266634 45971 Predictive Accuracy III • Using K-fold cross validation is a good way to set any parameters we may need to adjust in (any) classifier. • We can do K-fold cross validation for each possible setting, and choose the model with the highest accuracy. Where there is a tie, we choose the simpler model. • Actually, we should probably penalize the more complex models, even if they are more accurate, since more complex models are more likely to overfit (discussed later). Accuracy = 94% 10 9 8 7 6 5 4 3 2 1 Accuracy = 99% 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Accuracy = 100% 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Predictive Accuracy III Accuracy = Number of correct classifications Number of instances in our database Accuracy is a single number, we may be better off looking at a confusion matrix. This gives us additional useful information… True label is... Cat Dog Pig Classified as a… Cat Dog Pig 100 0 9 90 45 45 0 1 10 Speed and Scalability I We need to consider the time and space requirements for the two distinct phases of classification: • Time to construct the classifier • In the case of the simpler linear classifier, the time taken to fit the line, this is linear in the number of instances. • Time to use the model • In the case of the simpler linear classifier, the time taken to test which side of the line the unlabeled instance is. This can be done in constant time. 10 9 As we shall see, some classification algorithms are very efficient in one aspect, and very poor in the other. 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Robustness I We need to consider what happens when we have: • Noise • For example, a persons age could have been mistyped as 650 instead of 65, how does this effect our classifier? (This is important only for building the classifier, if the instance to be classified is noisy we can do nothing). •Missing values • For example suppose we want to classify an insect, but we only know the abdomen length (X-axis), and not the antennae length (Y-axis), can we still classify the instance? 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Robustness II We need to consider what happens when we have: • Irrelevant features For example, suppose we want to classify people as either • Suitable_Grad_Student • Unsuitable_Grad_Student And it happens that scoring more than 5 on a particular test is a perfect indicator for this problem… 10 9 8 7 6 5 4 3 2 1 If we also use “hair_length” as a feature, how will this effect our classifier? 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Robustness III We need to consider what happens when we have: • Streaming data For many real world problems, we don’t have a single fixed dataset. Instead, the data continuously arrives, potentially forever… (stock market, weather data, sensor data etc) Can our classifier handle streaming data? 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Interpretability As a trivial example, if we try to classify peoples health risks based on just their height and weight, we could gain the following insight (Based of the observation that a single linear classifier does not work well, but two linear classifiers do). There are two ways to be unhealthy, being obese and being too skinny. Weight Some classifiers offer a bonus feature. The structure of the learned classifier tells use something about the domain. Height Nearest Neighbor Classifier Antenna Length 10 9 8 7 6 5 4 3 2 1 Evelyn Fix Joe Hodges 1904-1965 1922-2000 If the nearest instance to the previously unseen instance is a Katydid class is Katydid else class is Grasshopper 1 2 3 4 5 6 7 8 9 10 Abdomen Length Katydids Grasshoppers We can visualize the nearest neighbor algorithm in terms of a decision surface… Note the we don’t actually have to construct these surfaces, they are simply the implicit boundaries that divide the space into regions “belonging” to each instance. This division of space is called Dirichlet Tessellation (or Voronoi diagram, or Theissen regions). The nearest neighbor algorithm is sensitive to outliers… The solution is to… We can generalize the nearest neighbor algorithm to the K- nearest neighbor (KNN) algorithm. We measure the distance to the nearest K instances, and let them vote. K is typically chosen to be an odd number. K=1 K=3 The nearest neighbor algorithm is sensitive to irrelevant features… Suppose the following is true, if an insects antenna is longer than 5.5 it is a Katydid, otherwise it is a Grasshopper. Training data 1 2 3 4 5 6 7 8 9 10 6 1 2 3 4 5 6 7 8 9 10 Using just the antenna length we get perfect classification! 1 2 3 4 5 6 7 8 9 10 5 Suppose however, we add in an irrelevant feature, for example the insects mass. 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Using both the antenna length and the insects mass with the 1-NN algorithm we get the wrong classification! How do we mitigate the nearest neighbor algorithms sensitivity to irrelevant features? • Use more training instances • Ask an expert what features are relevant to the task • Use statistical tests to try to determine which features are useful • Search over feature subsets (in the next slide we will see why this is hard) Why searching over feature subsets is hard Suppose you have the following classification problem, with 100 features, where is happens that Features 1 and 2 (the X and Y below) give perfect classification, but all 98 of the other features are irrelevant… Only Feature 2 Only Feature 1 Using all 100 features will give poor results, but so will using only Feature 1, and so will using Feature 2! Of the 2100 –1 possible subsets of the features, only one really works. 1,2 1 2 3 4 1,3 2,3 1,4 2,4 1,2,3 •Forward Selection •Backward Elimination •Bi-directional Search 1,2,4 1,3,4 1,2,3,4 3,4 2,3,4 The nearest neighbor algorithm is sensitive to the units of measurement X axis measured in centimeters Y axis measure in dollars The nearest neighbor to the pink unknown instance is red. X axis measured in millimeters Y axis measure in dollars The nearest neighbor to the pink unknown instance is blue. One solution is to normalize the units to pure numbers. Typically the features are Z-normalized to have a mean of zero and a standard deviation of one. X = (X – mean(X))/std(x) We can speed up nearest neighbor algorithm by “throwing away” some data. This is called data editing. Note that this can sometimes improve accuracy! We can also speed up classification with indexing One possible approach. Delete all instances that are surrounded by members of their own class. Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean Distance, however this need not be the case… DQ, C qi ci n 2 DQ, C p i 1 10 9 8 7 6 5 4 3 2 1 p q c i i n i 1 Max (p=inf) Manhattan (p=1) Weighted Euclidean Mahalanobis 1 2 3 4 5 6 7 8 9 10 …In fact, we can use the nearest neighbor algorithm with any distance/similarity function For example, is “Faloutsos” Greek or Irish? We could compare the name “Faloutsos” to a database of names using string edit distance… edit_distance(Faloutsos, Keogh) = 8 edit_distance(Faloutsos, Gunopulos) = 6 Hopefully, the similarity of the name (particularly the suffix) to other Greek names would mean the nearest nearest neighbor is also a Greek name. ID 1 2 3 4 5 6 7 8 Name Class Gunopulos Greek Papadopoulos Greek Kollios Dardanos Keogh Gough Greenhaugh Hadleigh Greek Greek Irish Irish Irish Irish Specialized distance measures exist for DNA strings, time series, images, graphs, videos, sets, fingerprints etc… Edit Distance Example It is possible to transform any string Q into string C, using only Substitution, Insertion and Deletion. Assume that each of these operators has a cost associated with it. How similar are the names “Peter” and “Piotr”? Assume the following cost function Substitution Insertion Deletion 1 Unit 1 Unit 1 Unit D(Peter,Piotr) is 3 The similarity between two strings can be defined as the cost of the cheapest transformation from Q to C. Peter Note that for now we have ignored the issue of how we can find this cheapest transformation Substitution (i for e) Piter Insertion (o) Pioter Deletion (e) Piotr Setting Parameters and Overfitting You need to classify widgets, you get a training set.. • You could use a Linear Classifier or Nearest Neighbor … Model Selection • Nearest Neighbor •You could use 1NN, 3NN, 5NN… • You could use Euclidean Distance, LP1, Lpinf, Mahalanobis… • You could do some data editing… • You could do some feature weighting… • You could …. Parameter Selection • “Linear Classifier” • You could use a Constant classifier Or parameter • You could use a Linear Classifier tuning, tweaking • You could use a Quadratic Classifier • You could…. Setting parameters and overfitting You need to classify widgets, you get a training set.. • You could use a Linear Classifier or Nearest Neighbor … • Nearest Neighbor •You could use 1NN, 3NN, 5NN… • You could use Euclidean Distance, LP1, Lpinf, Mahalanobis… • You could do some data editing… • You could do some feature weighting… • You could …. • “Linear Classifier” • You could use a Constant classifier • You could use a Linear Classifier • You could use a Quadratic Classifier • You could…. Overfitting Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model which has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data. Suppose we need to solve a classification problem We are not sure if we should use the.. • Simple linear classifier or the • Simple quadratic classifier How do we decide which to use? We do cross validation or leave-one out and choose the best one. • Simple linear classifier gets 81% accuracy • Simple quadratic classifier 99% accuracy 100 90 80 70 60 50 40 30 20 10 100 90 80 70 60 50 40 30 20 10 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 • Simple linear classifier gets 96% accuracy • Simple quadratic classifier 97% accuracy This problem is greatly exacerbated by having too little data • Simple linear classifier gets 90% accuracy • Simple quadratic classifier 95% accuracy What happens as we have more and more training examples? The accuracy for all models goes up! The chance of making a mistake (choosing the wrong model) goes down Even if we make a mistake, it will not matter too much (because we would learn a degenerate quadratic that is basically a straight line) • Simple linear 70% accuracy • Simple quadratic 90% accuracy • Simple linear 90% accuracy • Simple quadratic 95% accuracy • Simple linear 99.999999% accuracy • Simple quadratic 99.999999% accuracy One Solution: Charge Penalty for complex models • For example, for the simple {polynomial} classifier, we could “charge” 1% for every increase in the degree of the polynomial • Simple linear classifier gets 90.5% • Simple quadratic classifier 97.0% • Simple cubic classifier 97.05% Accuracy = 90.5% 10 9 8 7 6 5 4 3 2 1 accuracy, minus 0, equals 90.5% accuracy, minus 1, equals 96.0% accuracy, minus 2, equals 95.05% Accuracy = 97.0% 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Accuracy = 97.05% 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 One Solution: Charge Penalty for complex models • For example, for the simple {polynomial} classifier, we could charge 1% for every increase in the degree of the polynomial. • There are more principled ways to charge penalties • In particular, there is a technique called Minimum Description Length (MDL) Appendix Types of Attributes • There are different types of attributes – Nominal (includes Boolean) • Examples: ID numbers, eye color, zip codes, sex – Ordinal • Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} – Interval • Examples: calendar dates, temperatures in Celsius or Fahrenheit. – Ratio • Examples: temperature in Kelvin, length, time, counts Properties of Attribute Values • The type of an attribute depends on which of the following properties it possesses: = < > + */ – – – – Distinctness: Order: Addition: Multiplication: – – – – Nominal attribute: distinctness Ordinal attribute: distinctness & order Interval attribute: distinctness, order & addition Ratio attribute: all 4 properties Properties of Attribute Values – Nominal attribute: distinctness – We can say • Jewish = Jewish • Catholic Muslim – We cannot say • Jewish < Buddist • (Jewish + Muslim)/2 • Sqrt(Atheist) allowed Key: Atheist: 1 Jewish: 2 Buddist:3 Name Religio n Ad Joe 1 12 Sue 2 61 Cat 1 34 Even though (2<3) Bob 3 Tim 1 3 is Even thoughJinSqrt(1) 65 54 44 Properties of Attribute Values – Ordinal attribute: distinctness & order – We can say {newborn, infant, toddler, child, teen, adult} • infant = infant • newborn < toddler – We cannot say • newborn + child • infant / newborn • log(child) Key: newborn: 1 infant: 2 toddler:3 etc Name lifestag e Ad Joe 1 12 Sue 2 61 Properties of Attribute Values • There are a handful of tricky cases…. – Ordinal attribute: distinctness & order – If we have {Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, Saturday} – Then we can clearly say • Sunday = Sunday • Sunday != Tuesday – But can we say Sunday < Tuesday? – A similar problem occurs with degree of an angle… Properties of Attribute Values – Interval attribute: distinctness, order & addition – Suppose it is 10 degrees Celsius – We can say it is not 11 degrees Celsius • 10 11 – We can say it is colder than 15 degrees Celsius • 10 < 15 – We can say closing a window will make it two degrees hotter • NewTemp = 10 + 2 – We cannot say that it is twice as hot as 5 degrees Celsius • 10 / 2 = 5 No! Properties of Attribute Values • The type of an attribute depends on which of the following properties it possesses: – Ratio attribute: all 4 properties – We can do anything! • So 10kelvin really is twice as hot as 5kelvin – Of course, distinctness is tricky to define with real numbers. • is 3.1415926535897 = 3.141592653589? Attribute Type Description Examples Operations Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) For ratio variables, both differences and ratios are meaningful. (*, /) hardness of minerals, {good, better, best}, grades, street numbers calendar dates, temperature in Celsius or Fahrenheit Ratio temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current mean, standard deviation, Pearson's correlation, t and F tests geometric mean, harmonic mean, percent variation Attribute Level Transformation Comments Nominal Any permutation of values If all employee ID numbers were reassigned, would it make any difference? Ordinal An order preserving change of values, i.e., new_value = f(old_value) where f is a monotonic function. An attribute encompassing the notion of good, better best can be represented equally well by the values {1, 2, 3} or by { 0.5, 1, 10}, or by {A, B, C} Interval new_value =a * old_value + b where a and b are constants Thus, the Fahrenheit and Celsius temperature scales differ in terms of where their zero value is and the size of a unit (degree). Ratio new_value = a * old_value Length can be measured in meters or feet. Discrete and Continuous Attributes • Discrete Attribute – Has only a finite or countably infinite set of values – Examples: zip codes, counts, or the set of words in a collection of documents – Often represented as integer variables. – Note: binary attributes are a special case of discrete attributes • Continuous Attribute – Has real numbers as attribute values – Examples: temperature, height, or weight. – As a practical matter, real values can only be measured and represented using a finite number of digits. – Continuous attributes are typically represented as floating-point variables. Discrete and Continuous Attributes • We can convert between Continuous and Discrete variables. – For example, below we have converted real-valued heights to ordinal {short, medium, tall} • Conversions of Discrete to Continuous are less common, but possible. • • Why convert? Sometimes the algorithms we what to use are only defined for a certain type of data. For example, hashing or Bloom filters are best defined for Discrete data. Conversion may involve making choices, for example, how many “heights”, where do we place the cutoffs (equal width, equal bin counts etc.) These choices may effect the performance of the algorithms. 6’3’’ 3 5’1’’ 1 5’7’’ 2 5’3’’ 1 {short, medium, tall} 1, 2, 3 Discrete and Continuous Attributes • We can convert between Discrete and Continuous variables. – For example, below we have converted discrete words to a real-valued time series In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of The waters. And God Said :: :: :: :: :: :: With you all. Amen. There are 783,137 words in the King James Bible There are 12,143 unique words in the King James Bible Local frequency of “God” in King James Bible 0 0 1 2 3 4 5 6 7 8 x 10 5 Even if the data is given to you as continuous, it might really be intrinsically discrete partID size Ad 12323 7.801 12 5324 7.802 61 75654 32.09 34 34523 32.13 65 424 47.94 54 25324 62.07 44 Even if the data is given to you as continuous, it might really be intrinsically discrete 0 10000 stroke Bing Hu, Thanawin Rakthanmanon, Yuan Hao, Scott Evans, Stefano Lonardi, and Eamonn Keogh (2011). Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL. ICDM 2011 glide 0 push off 20000 stroke glide 1000 2000 push off glide 3000 4000 Data can be Missing • Data can be missing for many reasons. – – – – Someone may decline-to-state The attribute may be the result of an expensive test Sensor failure etc Handling missing values • Eliminate Data Objects • Estimate Missing Values • Ignore the Missing Value Data can be “Missing”: Special Case • In some case we expect most of the data to be missing. • • • • Consider a dataset containing people’s rankings of movies (or books, or music etc) The dimensionality is very high, there are lots of movies However, most people have only seen a tiny fraction of these So the movie ranking database will be very sparse. • Some platforms/languages explicitly support sparse matrices (including Matlab) • Here, inferring a missing value is equivalent to asking a question like “How much would Joe like the movie MASH?” See “Collaborative filtering” / “ Recommender Systems” Joe Jaws E.T. 4 1 MAS H May Argo 3 4 Brave OZ 4 Van Sue Ted 2 4 5 5 4 Bait Document Data is also Sparse • Each document is a `term' vector (vector-space representation) – each term is a component (attribute) of the vector, – the value of each component is the number of times the corresponding term occurs in the document. Doc2 Doc4 document-term matrix the Doc1 42 Doc2 22 Doc3 32 Doc4 29 Doc5 9 harry rebel god 1 cat dog 1 13 1 help near 1 0 1 56 5 1 3 • Graph Data is also Typically Sparse The elements of the matrix indicate whether pairs of vertices are connected or not in the graph. Not all datasets naturally fit neatly into a rectangular matrix… We may have to deal with such data as special cases. DNA Data First 100 base pairs of the chimp’s mitochondrial DNA: gtttatgtagcttaccccctcaaagcaatacactgaaaatgtttcgacgggtttacatcaccccataaacaaacaggtttggtcctagcctttctattag First 100 base pairs of the human’s mitochondrial DNA: gatcacaggtctatcaccctattaaccactcacgggagctctccatgcatttggtattttcgtctggggggtgtgcacgcgatagcattgcgagacgctg Transaction Data TID Items 1 Bread, Coke, Milk 2 3 4 5 Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk Spatio-Temporal Dat Data Quality • What kinds of data quality problems? • How can we detect problems with the data? • What can we do about these problems? • Examples of data quality problems: – – – – Redundancy Noise and outliers Missing values Duplicate data Redundancy • Various subsets of the features are often related or correlated in some way. They are partly redundant with each other. • For problems in signal processing, this redundancy is typically helpful. But for data mining, redundancy almost always hurts. 0.3 0.4 0.4 0.5 0.6 0.8 0.8 0.9 0.8 0.7 Height F/I Height Meters Weight 1 4’10’’ 1.47 166 2 6’3’’ 1.90 210 3 5’11’ 1.80 215 4 5’4’’ 1.62 150 Why Redundancy Hurts • Some data mining algorithms scale poorly in dimensionality, say O(2D). For the problem below, this means we take O(23) time, when we really only needed O(22) time. • We can see some data mining algorithms as counting evidence across a row (Nearest Neighbor Classifier, Bayes Classifier etc). If we have redundant features, we will “overcount” evidence. • It is probable that the redundant features will add errors. • For example, suppose that person 1 really is exactly 4’10’’. Then they are exactly 1.4732m, but the system recorded them as 1.47m . So we have introduced 0.0032m of error. This is a tiny amount, but it we had 100s of such attributes, we would be introducing a lot of error. • The curse of dimensionality (discussed later in the quarter) As we will see in the course, we can try to fix this issue with data aggregation, dimensionality reduction techniques, feature selection, feature generation etc Height F/I Height Meters Weight 1 4’10’’ 1.47 166 2 6’3’’ 1.90 210 3 5’11’ 1.80 215 4 5’4’’ 1.62 150 Detecting Redundancy • By creating a scatterplot of “Height F/I” vs. “Height Meters” we can see the redundancy, and measure it with correlation. • However, if we have 100 features, we clearly cannot visual check 1002 scatterplots. • Note that two features can have zero correlation, but still be related/redundant. There are more sophisticated tests of “relatedness” Height F/I Height Meters Weight 1 4’10’’ 1.47 166 2 6’3’’ 1.90 210 3 5’11’ 1.80 215 4 5’4’’ 1.62 150 Detecting Redundancy Height F/I Height Meters Weight 1 4’10’’ 1.47 166 2 6’3’’ 1.90 210 3 5’11’ 1.80 215 4 5’4’’ 1.62 150 Noise • Noise refers to modification of original values – Examples: distortion of a person’s voice when talking on a poor quality phone. The two images are one man’s ECGs, taken about an hour apart. The different are mostly due to sensor noise (MIT-BIH Atrial Fibrillation Database record afdb/08405)