Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
UCR CS 235 Data Mining Winter 2011 Important Note • All information about grades/homeworks/ projects etc will be given out at the next meeting • Someone give me a 15 minute warning before the end of this class 2 What Is Data Mining? • Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data • Alternative names – Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. • “data mining” is not – Databases – (Deductive) query processing. – Expert systems or small ML/statistical programs 3 What is Data Mining? example from yesterdays newspapers What is Data Mining? Example from the NBA • Play-by-play information recorded by teams – Who is on the court – Who shoots – Results • Coaches want to know what works best – Plays that work well against a given team – Good/bad player matchups • Advanced Scout (from IBM Research) is a data mining tool to answer these questions Starks+Houston+ Ward playing Shooting Percentage Overall 0 20 40 60 http://www.nba.com/news_feat/beyond/0126.html 5 What is Data Mining? Example from Keogh/Mueen Beet Leafhopper (Circulifer tenellus) input resistor conductive glue to insect to soil near plant V voltage source voltage reading 2 0 1 00 plant membrane 0 50 150 200 Stylet Approximately 14.4 minutes of insect telemetry 10,000 0 3 100 20,000 x 104 Instance at 3,664 2 Instance at 9,036 1 0 100 200 300 400 500 30,000 All these examples show… • Lots of raw data in • Some data mining • Facts, rules, patterns out Lots of data Some rules or facts or patterns 7 Knowledge Discovery in Databases: Process Interpretation/ Evaluation There exists a planet at… Data Mining Knowledge Preprocessing Patterns Selection Preprocessed Data Data Target Data adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press 12.2 3434.00232 11.2 3454.64555 23.6 4324.53435 8 Course Outline 1. Introduction: What is data mining? – What makes it a new and unique discipline? – Relationship between Data Warehousing, On-line Analytical Processing, and Data Mining – Data mining tasks - Clustering, Classification, Rule learning, etc. 2. Data mining process – Task identification – Data preparation/cleansing 3. Association Rule mining 5. Prediction – Regression – Neural Networks 6. Clustering – Distance-based approaches – Density-based approaches 7. Anomaly Detection – – – Distance based Density based Model based 8. Similarity Search – Problem Description – Algorithms 4. Classification – – – – Bayesian Nearest Neighbor Linear Classifier Tree-based approaches 10 Data Mining: Classification Schemes • General functionality – Descriptive data mining – Predictive data mining • Different views, different classifications – Kinds of data to be mined – Kinds of knowledge to be discovered – Kinds of techniques utilized – Kinds of applications adapted 11 Data Mining: History of the Field • Knowledge Discovery in Databases workshops started ‘89 – Now a conference under the auspices of ACM SIGKDD – IEEE conference series started 2001 • Key founders / technology contributors: – Usama Fayyad, JPL (then Microsoft, then his own company, Digimine, now Yahoo! Research labs, now CEO at Open Insights) – Gregory Piatetsky-Shapiro (then GTE, now his own data mining consulting company, Knowledge Stream Partners) – Rakesh Agrawal (IBM Research) The term “data mining” has been around since at least 1983 – as a pejorative term in the statistics community 12 Data Mining: The big players 13 A data mining problem… Wei Wang - School of Life Science, Fudan University, Chin Wei Wang - Nonlinear Systems Laboratory, Department of Mechanical Engineering, MIT Wei Wang - University of Maryland Baltimore County Wei Wang - University of Naval Engineering Wei Wang - ThinkIT Speech Lab, Institute of Acoustics, Ch Academy of Sciences Wei Wang - Rutgers University, New Brunswick, NJ, USA Wei Wang - Purdue University Indianapolis Wei Wang - INRIA Sophia Antipolis, Sophia Antipolis, Fran Wei Wang - Institute of Computational Linguistics, Peking University Wei Wang - National University of Singapore Wei Wang - Nanyang Technological University, Singapore Wei Wang - Computer and Electronics Engineering, Unive Nebraska Lincoln, NE, USA Wei Wang - The University of New South Wales, Australia Wei Wang - Language Weaver, Inc. Wei Wang - The Chinese University of Hong Kong, Mecha and Automation Engineering Wei Wang - Center for Engineering and Scientific Comput Zhejiang University, China Wei Wang - Fudan University, Shanghai, China Wei Wang - University of North Carolina at Chape 14 What Can Data Mining Do? • Classify – Categorical, Regression • Cluster • Summarize – Summary statistics, Summary rules • Link Analysis / Model Dependencies – Association rules • Sequence analysis – Time-series analysis, Sequential associations • Detect Deviations 15 Why is Data Mining Hard? • Scalability • High Dimensionality • Heterogeneous and Complex Data • Data Ownership and Distribution • Non-traditional Analysis • Over fitting • Privacy issues 16 Scale of Data Organization Walmart Google Yahoo NASA satellites NCBI GenBank France Telecom UK Land Registry AT&T Corp Scale of Data ~ 20 million transactions/day ~ 8.2 billion Web pages ~10 GB Web data/hr ~ 1.2 TB/day ~ 22 million genetic sequences 29.2 TB 18.3 TB 26.2 TB “The great strength of computers is that they can reliably manipulate vast amounts of data very quickly. Their great weakness is that they don’t have a clue as to what any of that data actually means” (S. Cass, IEEE Spectrum, Jan 2004) 17 What Can Data Mining Do? • Classify – Categorical, Regression • Cluster • Summarize – Summary statistics, Summary rules • Link Analysis / Model Dependencies – Association rules • Sequence analysis – Time-series analysis, Sequential associations • Detect Deviations 18 The Classification Problem Katydids (informal definition) Given a collection of annotated data. In this case 5 instances Katydids of and five of Grasshoppers, decide what type of insect the unlabeled example is. Katydid or Grasshopper? Grasshoppers The Classification Problem spam Given a collection of annotated data… email Spam or email? The Classification Problem Spanish Given a collection of annotated data… Polish Spanish or Polish? The Classification Problem Stinging Nettle Given a collection of annotated data… False Nettle Stinging Nettle or False Nettle? The Classification Problem Given a collection of annotated data… Greek Gunopulos Papadopoulos Kollios Dardanos Irish Tsotras Greek or Irish? Keogh Gough Greenhaugh Hadleigh The Classification Problem Katydids (informal definition) Given a collection of annotated data. In this case 5 instances Katydids of and five of Grasshoppers, decide what type of insect the unlabeled example is. Katydid or Grasshopper? Grasshoppers For any domain of interest, we can measure features Color {Green, Brown, Gray, Other} Abdomen Length Has Wings? Thorax Length Antennae Length Mandible Size Spiracle Diameter Leg Length We can store features in a database. The classification problem can now be expressed as: • Given a training database (My_Collection), predict the class label of a previously unseen instance previously unseen instance = My_Collection Insect Abdomen Antennae Insect Class ID Length Length Grasshopper 1 2.7 5.5 2 3 4 5 6 7 8 9 10 8.0 0.9 1.1 5.4 2.9 6.1 0.5 8.3 8.1 11 9.1 4.7 3.1 8.5 1.9 6.6 1.0 6.6 4.7 5.1 7.0 Katydid Grasshopper Grasshopper Katydid Grasshopper Katydid Grasshopper Katydid Katydids ??????? Grasshoppers Katydids Antenna Length 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length Grasshoppers We will also use this lager dataset as a motivating example… Antenna Length 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length Katydids Each of these data objects are called… • exemplars • (training) examples • instances • tuples We will return to the previous slide in two minutes. In the meantime, we are going to play a quick game. I am going to show you some classification problems which were shown to pigeons! Let us see if you are as smart as a pigeon! Pigeon Problem 1 Examples of class A 3 4 1.5 5 Examples of class B 5 2.5 5 2 6 8 8 3 2.5 5 4.5 3 Pigeon Problem 1 Examples of class A 3 4 1.5 6 5 8 What class is this object? Examples of class B 5 2.5 5 2 8 3 8 What about this one, A or B? 4.5 2.5 5 4.5 3 1.5 7 Pigeon Problem 1 Examples of class A 3 4 1.5 5 This is a B! Examples of class B 5 2.5 5 2 6 8 8 3 2.5 5 4.5 3 8 1.5 Here is the rule. If the left bar is smaller than the right bar, it is an A, otherwise it is a B. Pigeon Problem 2 Examples of class A Examples of class B 4 4 5 2.5 5 5 2 5 5 3 6 6 Oh! This ones hard! 8 Even I know this one 7 3 3 2.5 3 1.5 7 Pigeon Problem 2 Examples of class A Examples of class B The rule is as follows, if the two bars are equal sizes, it is an A. Otherwise it is a B. 4 4 5 2.5 5 5 2 5 So this one is an A. 5 6 6 3 7 3 3 2.5 3 7 Pigeon Problem 3 Examples of class A Examples of class B 6 4 4 5 6 1 5 7 5 4 8 7 7 6 3 3 7 This one is really hard! What is this, A or B? 6 Pigeon Problem 3 Examples of class A It is a B! Examples of class B 6 4 1 4 5 6 3 3 7 5 6 6 7 5 4 8 7 7 The rule is as follows, if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B. Why did we spend so much time with this game? Because we wanted to show that almost all classification problems have a geometric interpretation, check out the next 3 slides… Examples of class A 3 Examples of class B 5 4 2.5 Left Bar Pigeon Problem 1 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Right Bar 1.5 5 5 2 6 8 8 3 2.5 5 4.5 3 Here is the rule again. If the left bar is smaller than the right bar, it is an A, otherwise it is a B. Examples of class A 4 4 Examples of class B 5 2.5 Left Bar Pigeon Problem 2 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Right Bar 5 5 6 6 3 3 2 5 5 3 2.5 3 Let me look it up… here it is.. the rule is, if the two bars are equal sizes, it is an A. Otherwise it is a B. Examples of class A 4 4 Examples of class B 5 6 Left Bar Pigeon Problem 3 100 90 80 70 60 50 40 30 20 10 10 20 30 40 50 60 70 80 90 100 Right Bar 1 5 6 3 3 7 7 5 4 8 7 7 The rule again: if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B. Grasshoppers Katydids Antenna Length 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Abdomen Length previously unseen instance = 11 5.1 7.0 ??????? • We can “project” the previously unseen instance into the same space as the database. 10 9 8 7 6 5 4 3 2 1 Antenna Length • We have now abstracted away the details of our particular problem. It will be much easier to talk about points in space. 1 2 3 4 5 6 7 8 9 10 Abdomen Length Katydids Grasshoppers Simple Linear Classifier 10 9 8 7 6 5 4 3 2 1 R.A. Fisher 1890-1962 If previously unseen instance above the line then class is Katydid else class is Grasshopper 1 2 3 4 5 6 7 8 9 10 Katydids Grasshoppers The simple linear classifier is defined for higher dimensional spaces… … we can visualize it as being an n-dimensional hyperplane It is interesting to think about what would happen in this example if we did not have the 3rd dimension… We can no longer get perfect accuracy with the simple linear classifier… We could try to solve this problem by user a simple quadratic classifier or a simple cubic classifier.. However, as we will later see, this is probably a bad idea… Which of the “Pigeon Problems” can be solved by the Simple Linear Classifier? 1) 2) 3) Perfect Useless Pretty Good 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Problems that can be solved by a linear classifier are call linearly separable. 10 9 8 7 6 5 4 3 2 1 100 90 80 70 60 50 40 30 20 10 10 20 30 40 50 60 70 80 90 100 1 2 3 4 5 6 7 8 9 10 Virginica A Famous Problem R. A. Fisher’s Iris Dataset. • 3 classes • 50 of each class Setosa The task is to classify Iris plants into one of 3 varieties using the Petal Length and Petal Width. Iris Setosa Versicolor Iris Versicolor Iris Virginica We can generalize the piecewise linear classifier to N classes, by fitting N-1 lines. In this case we first learned the line to (perfectly) discriminate between Setosa and Virginica/Versicolor, then we learned to approximately discriminate between Virginica and Versicolor. Virginica Setosa Versicolor If petal width > 3.272 – (0.325 * petal length) then class = Virginica Elseif petal width… We have now seen one classification algorithm, and we are about to see more. How should we compare them? • Predictive accuracy • Speed and scalability – time to construct the model – time to use the model – efficiency in disk-resident databases • Robustness – handling noise, missing values and irrelevant features, streaming data • Interpretability: – understanding and insight provided by the model Predictive Accuracy I • How do we estimate the accuracy of our classifier? We can use K-fold cross validation We divide the dataset into K equal sized sections. The algorithm is tested K times, each time leaving out one of the K section from building the classifier, but using it to test the classifier instead Accuracy = K=5 Number of correct classifications Number of instances in our database Insect ID Abdomen Length Antennae Length Insect Class 10 1 2.7 5.5 Grasshopper 2 8.0 9.1 Katydid 9 8 3 0.9 4.7 Grasshopper 4 1.1 3.1 Grasshopper 5 5.4 8.5 Katydid 6 2.9 1.9 Grasshopper 7 6.1 6.6 Katydid 8 0.5 1.0 Grasshopper 9 8.3 6.6 Katydid 10 8.1 4.7 Katydids 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Predictive Accuracy II • Using K-fold cross validation is a good way to set any parameters we may need to adjust in (any) classifier. • We can do K-fold cross validation for each possible setting, and choose the model with the highest accuracy. Where there is a tie, we choose the simpler model. • Actually, we should probably penalize the more complex models, even if they are more accurate, since more complex models are more likely to overfit (discussed later). Accuracy = 94% Accuracy = 100% Accuracy = 100% 10 9 8 7 6 5 4 3 2 1 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Predictive Accuracy III Number of correct classifications Accuracy = Number of instances in our database Accuracy is a single number, we may be better off looking at a confusion matrix. This gives us additional useful information… True label is... Cat Dog Pig Classified as a… Cat Dog Pig 100 0 9 90 45 45 0 1 10 Speed and Scalability I We need to consider the time and space requirements for the two distinct phases of classification: • Time to construct the classifier • In the case of the simpler linear classifier, the time taken to fit the line, this is linear in the number of instances. • Time to use the model • In the case of the simpler linear classifier, the time taken to test which side of the line the unlabeled instance is. This can be done in 10 constant time. As we shall see, some classification algorithms are very efficient in one aspect, and very poor in the other. 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Speed and Scalability II For learning with small datasets, this is the whole picture Speed and Scalability I We need to consider the time and space requirements for the two distinct phases of classification: • Time to construct the classifier However, for data mining with massive datasets, it is not so much the (main memory) time complexity that matters, rather it is how many times we have to scan the database. • In the case of the simpler linear classifier, the time taken to fit the line, this is linear in the number of instances. • Time to use the model • In the case of the simpler linear classifier, the time taken to test which side of the line the unlabeled instance is. This can be done in constant time. As we shall see, some classification algorithms are very efficient in one aspect, and very poor in the other. 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 This is because for most data mining operations, disk access times completely dominate the CPU times. For data mining, researchers often report the number of times you must scan the database. 8 9 10 Robustness I We need to consider what happens when we have: • Noise • For example, a persons age could have been mistyped as 650 instead of 65, how does this effect our classifier? (This is important only for building the classifier, if the instance to be classified is noisy we can do nothing). •Missing values • For example suppose we want to classify an insect, but we only know the abdomen length (Xaxis), and not the antennae length (Y-axis), can we still classify the instance? 10 10 99 88 77 66 55 44 33 22 11 11 22 33 44 55 66 77 88 99 10 10 Robustness II We need to consider what happens when we have: • Irrelevant features For example, suppose we want to classify people as either • Suitable_Grad_Student • Unsuitable_Grad_Student And it happens that scoring more than 5 on a particular test is a perfect indicator for this problem… 10 9 8 7 6 5 4 3 2 1 If we also use “hair_length” as a feature, how will this effect our classifier? 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Robustness III We need to consider what happens when we have: • Streaming data For many real world problems, we don’t have a single fixed dataset. Instead, the data continuously arrives, potentially forever… (stock market, weather data, sensor data etc) Can our classifier handle streaming data? 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 Interpretability Some classifiers offer a bonus feature. The structure of the learned classifier tells use something about the domain. Weight As a trivial example, if we try to classify peoples health risks based on just their height and weight, we could gain the following insight (Based of the observation that a single linear classifier does not work well, but two linear classifiers do). There are two ways to be unhealthy, being obese and being too skinny. Height Nearest Neighbor Classifier 10 9 8 7 6 5 4 3 2 1 Antenna Length Evelyn Fix 1904-1965 Joe Hodges 1922-2000 If the nearest instance to the previously unseen instance is a Katydid class is Katydid else class is Grasshopper 1 2 3 4 5 6 7 8 9 10 Abdomen Length Katydids Grasshoppers We can visualize the nearest neighbor algorithm in terms of a decision surface… Note the we don’t actually have to construct these surfaces, they are simply the implicit boundaries that divide the space into regions “belonging” to each instance. This division of space is called Dirichlet Tessellation (or Voronoi diagram, or Theissen regions). The nearest neighbor algorithm is sensitive to outliers… The solution is to… We can generalize the nearest neighbor algorithm to the K- nearest neighbor (KNN) algorithm. We measure the distance to the nearest K instances, and let them vote. K is typically chosen to be an odd number. K=1 K=3 The nearest neighbor algorithm is sensitive to irrelevant features… Suppose the following is true, if an insects antenna is longer than 5.5 it is a Katydid, otherwise it is a Grasshopper. Training data 1 2 3 4 5 6 7 8 9 10 6 1 2 3 4 5 6 7 8 9 10 Using just the antenna length we get perfect classification! 1 2 3 4 5 6 7 8 9 10 5 Suppose however, we add in an irrelevant feature, for example the insects mass. 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Using both the antenna length and the insects mass with the 1-NN algorithm we get the wrong classification! How do we mitigate the nearest neighbor algorithms sensitivity to irrelevant features? • Use more training instances • Ask an expert what features are relevant to the task • Use statistical tests to try to determine which features are useful • Search over feature subsets (in the next slide we will see why this is hard) Why searching over feature subsets is hard Suppose you have the following classification problem, with 100 features, where is happens that Features 1 and 2 (the X and Y below) give perfect classification, but all 98 of the other features are irrelevant… Only Feature 2 Only Feature 1 Using all 100 features will give poor results, but so will using only Feature 1, and so will using Feature 2! Of the 2100 –1 possible subsets of the features, only one really works. 1,2 1 2 3 4 1,3 2,3 1,4 2,4 1,2,3 •Forward Selection •Backward Elimination •Bi-directional Search 1,2,4 1,3,4 1,2,3,4 3,4 2,3,4 The nearest neighbor algorithm is sensitive to the units of measurement X axis measured in centimeters Y axis measure in dollars The nearest neighbor to the pink unknown instance is red. X axis measured in millimeters Y axis measure in dollars The nearest neighbor to the pink unknown instance is blue. One solution is to normalize the units to pure numbers. Typically the features are Znormalized to have a mean of zero and a standard deviation of one. X = (X – mean(X))/std(x) We can speed up nearest neighbor algorithm by “throwing away” some data. This is called data editing. Note that this can sometimes improve accuracy! We can also speed up classification with indexing One possible approach. Delete all instances that are surrounded by members of their own class. Up to now we have assumed that the nearest neighbor algorithm uses the Euclidean Distance, however this need not be the case… DQ, C qi ci n 2 DQ, C i 1 10 9 8 7 6 5 4 3 2 1 p q c i i n p i 1 Max (p=inf) Manhattan (p=1) Weighted Euclidean Mahalanobi s 1 2 3 4 5 6 7 8 9 10 …In fact, we can use the nearest neighbor algorithm with any distance/similarity function For example, is “Faloutsos” Greek or Irish? We could compare the name “Faloutsos” to a database of names using string edit distance… edit_distance(Faloutsos, Keogh) = 8 edit_distance(Faloutsos, Gunopulos) = 6 Hopefully, the similarity of the name (particularly the suffix) to other Greek names would mean the nearest nearest neighbor is also a Greek name. ID 1 2 3 4 5 6 7 8 Name Class Gunopulos Greek Papadopoulos Greek Kollios Dardanos Keogh Gough Greenhaugh Hadleigh Greek Greek Irish Irish Irish Irish Specialized distance measures exist for DNA strings, time series, images, graphs, videos, sets, fingerprints etc… Edit Distance Example It is possible to transform any string Q into string C, using only Substitution, Insertion and Deletion. Assume that each of these operators has a cost associated with it. The similarity between two strings can be defined as the cost of the cheapest transformation from Q to C. How similar are the names “Peter” and “Piotr”? Assume the following cost function Substitution Insertion Deletion 1 Unit 1 Unit 1 Unit D(Peter,Piotr) is 3 Peter Note that for now we have ignored the issue of how we can find this cheapest Substitution (i for e) transformation Piter Insertion (o) Pioter Deletion (e) Piotr Dear SIR, I am Mr. John Coleman and my sister is Miss Rose Colemen, we are the children of late Chief Paul Colemen from Sierra Leone. I am writing you in absolute confidence primarily to seek your assistance to transfer our cash of twenty one Million Dollars ($21,000.000.00) now in the custody of a private Security trust firm in Europe the money is in trunk boxes deposited and declared as family valuables by my late father as a matter of fact the company does not know the content as money, although my father made them to under stand that the boxes belongs to his foreign partner. … This mail is probably spam. The original message has been attached along with this report, so you can recognize or block similar unwanted mail in future. See http://spamassassin.org/tag/ for more details. Content analysis details: (12.20 points, 5 required) NIGERIAN_SUBJECT2 (1.4 points) Subject is indicative of a Nigerian spam FROM_ENDS_IN_NUMS (0.7 points) From: ends in numbers MIME_BOUND_MANY_HEX (2.9 points) Spam tool pattern in MIME boundary URGENT_BIZ (2.7 points) BODY: Contains urgent matter US_DOLLARS_3 (1.5 points) BODY: Nigerian scam key phrase ($NN,NNN,NNN.NN) DEAR_SOMETHING (1.8 points) BODY: Contains 'Dear (something)' BAYES_30 (1.6 points) BODY: Bayesian classifier says spam probability is 30 to 40% [score: 0.3728] Acknowledgements Some of the material used in this lecture is drawn from other sources: • Chris Clifton • Jiawei Han • Dr. Hongjun Lu (Hong Kong Univ. of Science and Technology) • Graduate students from Simon Fraser Univ., Canada, notably Eugene Belchev, Jian Pei, and Osmar R. Zaiane • Graduate students from Univ. of Illinois at Urbana-Champaign • Dr. Bhavani Thuraisingham (MITRE Corp. and UT Dallas) 76 77 Making Good Figures • We are going to see many figures this quarter • I personally feel that making good figures is very important to a papers chance of acceptance. 2 3 3 1 3 1 1 1 1 1 1 4 6 5 Fig. 1. Sequence graph example Fig. 1. A sample sequence graph. The line thickness encodes relative entropy What's wrong with this figure? Let me count the ways… None of the arrows line up with the “circles”. The “circles” are all different sizes and aspect ratios, the (normally invisible) white bounding box around the numbers breaks the arrows in many places. The figure captions has almost no information. Circles are not aligned… On the right is my redrawing of the figure with PowerPoint. It took me 300 seconds This figure is an insult to reviewers. It says, “we expect you to spend an unpaid hour to review our paper, but we don’t think it worthwhile to spend 5 minutes to make clear figures” Fig. 1. Sequence graph example Note that there are figures drawn seven hundred years ago that have much better symmetry and layout. Peter Damian, Paulus Diaconus, and others, Various saints lives: Netherlands, S. or France, N. W.; 2nd quarter of the 13th century Lets us see some more examples of poor figures, then see some principles that can help This figure wastes 80% of the space it takes up. In any case, it could be replace by a short English sentence: “We found that for selectivity ranging from 0 to 0.05, the four methods did not differ by more than 5%” Why did they bother with the legend, since you can’t tell the four lines apart This figure wastes almost a quarter of a page. The ordering on the Xaxis is arbitrary, so the figure could be replaced with the sentence “We found the average performance was 198 with a standard deviation of 11.2”. The paper in question had 5 similar plots, wasting an entire page. The figure below takes up 1/6 of a page, but it only reports 3 numbers. The figure below takes up 1/6 of a page, but it only reports 2 numbers! Actually, it really only reports one number! Only the relative times really matter, so they could have written “We found that FTW is 1007 times faster than the exact calculation, independent of the sequence length”. Both figures below describe the classification of time series motions… It is not obvious from this figure which algorithm is best. The caption has almost zero information You need to read the text very carefully to understand the figure Redesign by Keogh At a glance we can see that the accuracy is very high. We can also see that DTW tends to win when the... The data is plotted in Figure 5. Note that any correctly classified motions must appear in the upper left (gray) triangle. 1 In this region our algorithm wins In this region DTW wins 0 0 1 Figure 5. Each of our 100 motions plotted as a point in 2 dimensions. The X value is set to the distance to the nearest neighbor from the same class, and the Y value is set to the distance to the nearest neighbor from any other class. This should be a bar chart, the four items are unrelated (in any case this should probably be a table, not a figure) This pie chart takes up a lot of space to communicate two numbers ( Better as a table, or as simple text) People that have heard of Pacman People that have not heard of Pacman A Database Architecture For Real-Time Motion Retrieval Principles to make Good Figures • Think about the point you want to make, should it be done with words, a table, or a figure. If a figure, what kind? • Color helps (but you cannot depend on it) • Linking helps (sometimes called brushing) • Direct labeling helps • Meaningful captions helps • Minimalism helps (Omit needless elements) • Finally, taking great care, taking pride in your work, helps Direct labeling helps It removes one level of indirection, and allows the figures to be self explaining (see Edward Tufte: Visual Explanations, Chapter 4) D E C B A Figure 10. Stills from a video sequence; the right hand is tracked, and converted into a time series: A) Hand at rest: B) Hand moving above holster. C) Hand moving down to grasp gun. D Hand moving to shoulder level, E) Aiming Gun. Linking helps interpretability I How did we get from here To here? What is Linking? Linking is connecting the same data in two views by using the same color (or thickness etc). In the figures below, color links the data in the pie chart, with data in the scatterplot. Fish It is not clear from the above figure See next slide for a suggested fix. 50 45 40 Fow l 35 30 25 20 15 10 Neither Both 5 0 0 10 20 30 40 50 60 Linking helps interpretability II In this figure, the color of the arrows inside the fish link to the colors of the arrows on the time series. This tells us exactly how we go from a shape to a time series. Note that there are other links, for example in II, you can tell which fish is which based on color or link thickness linking. Minimalism helps: In this case, numbers on the Xaxis do not mean anything, so they are deleted. A nice example of linking © Sinaue 1 EBEL Detection Rate ABEL DROP1 DROP2 0 0 • Don’t cover the data with the labels! You are implicitly saying “the results are not that important”. • Do we need all the numbers to annotate the X and Y axis? • Can we remove the text “With Ranking”? False Alarm Rate 1 Direct labeling helps Note that the line thicknesses differ by powers of 2, so even in a B/W printout you can tell the four lines apart. Minimalism helps: delete the “with Ranking”, the X-axis numbers, the grid… Covering the data with the labels is a common sin These two images, which are both use to discuss an anomaly detection algorithm, illustrate many of the points discussed in previous slides. Color helps - Direct labeling helps - Meaningful captions help The images should be as self contained as possible, to avoid forcing the reader to look back to the text for clarification multiple times. Note that while Figure 6 use color to highlight the anomaly, it also uses the line thickness (hard to see in PowerPoint) thus this figure works also well in B/W printouts Sometime between next week and the end of the quarter Find a poor figure in a data mining paper. Create a fixed version of it. Present one to three slides about it at the beginning of a class. Before you start to work on the poor figure, run it by me.