Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Data Mining Instructor: Dr. Chris Volinsky Data Mining - Massey University Class Structure • Class structure – – – – – – 9AM - 11AM Lecture 11AM - 12PM Computer Lab 12PM Lunch 1PM - 3PM Lecture 3PM - 4PM Computer Lab 4PM - 5PM Recap and discussion • Exams / Grades based on – 40% data analysis project – 30% technical paper presentation – 30% pop quizzes and exams • given at the beginning of each class • questions about broad concepts Data Mining - Massey University Class Schedule • 24-26 July: Block 1 • 4-6 September: Block 2 • 6 September: Presentation of technical paper and data analysis proposal to class • 19 September: Data Analysis project due (to via email) • I will be available intermittently from 1-24 August Data Mining - Massey University Course Objectives • Direct Objectives: – To learn data mining techniques – To see their use in real-world/research applications – To understand limitations of standard statistical techniques in data mining applications – To get an understanding of the methodological principles behind data mining – To be able to read and understand methodological work in scholarly journal papers – To implement & use data mining models using statistical software (specifically R) Data Mining - Massey University Project #1: Data Analysis Project • The goal of data mining is to find interesting patterns in data. You will be required to: – – – – – Define a scientific question of interest Collect a data set (probably online) Prepare the data set properly Analyze the data using appropriate models Write a 5-10 page report on your analysis (graphics included) • Project proposals (1/2 -1 page) will be due at the beginning of the second block. • Present proposal (5 minutes) to class on 6 September. • Finished reports will be due 19 September. Data Mining - Massey University Project #2: Scientific Paper Presentation • Select a technical paper about a data mining technique (list provided on web site). – Read and understand the paper – Write a one-page summary of the paper – Present the basic ideas of the paper to the class (10-15 minutes) • Emphasis to be placed on the motivation for a particular statistical methodology within the application context: – – – – – What is the general objective of the paper? What data are they using? What statistical approach/method is proposed? Why? What has been done in the past? How does the paper accomplish new domain insight using that method? • Paper presented to class on last day of lecture: 9 September Data Mining - Massey University Class Web Site • • • • http://www.research.att.com/~volinsky/DataMining Lists of papers for presentation Announcements Links to other data mining course notes, R tutorials, resources • Email: – [email protected] Data Mining - Massey University Resources • Data mining is a new field and as such, does not have authoritative texts (yet). • This class draws from many sources, best are – “Handbook of Data Mining” Hand, Mannila and Smyth – “Elements of Statistical Learning” Hastie, Tibshirani, and Friedman – “Interactive and Dynamic Graphics for Data Analysis” Cook and Swayne – Also good class notes available from other classes: • • • • David Madigan, Rutgers Di Cook, Iowa State Padhraic Smyth, UC Irvine Jiawei Han, Simon Fraser – see class web site for pointers to these notes, or just Google them!) • Also many good tutorials and books on R (or S/Splus), both online and in the library. Data Mining - Massey University Course Outline • 6 days = 12 “units” each unit is a lecture and a lab • Units: – – – – – – – – – – Intro to Data Mining Data exploration and visualization Data Mining Concepts Regression Topics Classification and Supervised Learning Clustering and Unsupervised Learning Text Mining and Information Retrieval Web Mining and Social Networks 1 Web Mining and Social Networks 2 Assorted Topics • • • • • Advanced Classification - Neural networks, ensemble methods Association Rules Telecommunications Fraud Proximity models for social networks Support Vector machines Data Mining - Massey University What is Data Mining? • Not well defined…. • Hand, Mannila, Smyth: – “data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner” • Isn’t that the same as statistics? Data Mining - Massey University Data Mining Enablers • Explosion of data • Fast and cheap computation and storage – Moore’s Law: processing doubles every 19 months – Disk storage doubles every 9 months – Database technology • Competitive pressure in business • New, successful models Disk TB Shipped per Year 1E+7 ExaByte 1E+6 – SVM, boosting • Commercial products 1998 Disk Trend (Jim Porter) http://www.disktrend.com/pdf/portrpkg.pdf. 1E+5 disk TB growth: 112%/y • SAS, SPSS, Insightful, IBM, Oracle – Open Source products • Weka • R Moore's Law: 58.7%/y 1E+4 1E+3 1988 Both from NZ! Data Mining - Massey University 1991 1994 1997 2000 Data Mining vs. Statistics • Statistics is known for: – – – – well defined hypotheses used to learn about a specifically chosen population studied using carefully collected data providing inferences with well known properties. • Data mining isn’t that careful. It is: – – – – data driven discovery of models and patterns from massive and observational data sets Data Mining - Massey University Two Types of Data • Experimental Data – – – – Hypothesis H design an experiment to test H collect data, infer how likely it is that H is true e.g., clinical trials in medicine • Observational or Retrospective or Secondary Data – massive non-experimental data sets • e.g., Web logs, human genome, atmospheric simulations, etc – assumptions of experimental design no longer valid – how can we use such data to do science? • use the data to support model exploration, hypothesis testing Data Mining - Massey University Data-Driven Discovery • Observational data – cheap relative to experimental data • Examples: – – – – Transaction data archives for retail stores, airlines, etc Web logs for Amazon, Google, etc The human/mouse/rat genome Etc., etc makes sense to leverage available data useful (?) information may be hidden in vast archives of data Data Mining - Massey University Data Mining v. Statistics • Traditional statistics – first hypothesize, then collect data, then analyze – often model-oriented (strong parametric models) • Data mining: – – – – few if any a priori hypotheses data is usually already collected a priori analysis is typically data-driven not hypothesis-driven Often algorithm-oriented rather than model-oriented • Different? – Yes, in terms of culture, motivation: however….. – statistical ideas are very useful in data mining, e.g., in validating whether discovered knowledge is useful – Increasing overlap at the boundary of statistics and DM e.g., exploratory data analysis (based on pioneering work of John Tukey in the 1960’s) Data Mining - Massey University Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Statistics Data Mining Information Science Visualization Other Disciplines Different fields have different views of what data mining is Data Mining - Massey University Data Data Data • It’s all about the data - where does it come from? – – – – – – – – www NASA Business processes/transactions Telecommunications and networking Medical imagery Government, census, demographics Sensor networks, RFID tags sports Data Mining - Massey University Flat File or Vector Data 2.3 -1.5 … -1.3 n 1.1 0.1 … -0.1 … … … … p • Rows = objects • Columns = measurements on objects – Represent each row as a p-dimensional vector, where p is the dimensionality • In efffect, embed our objects in a p-dimensional vector space • Often useful, but not always appropriate • Both n and p can be very large in data mining • Matrix can be quite sparse Data Mining - Massey University Sparse Matrix (Text) Data 50 100 150 Text 200 Documents 250 300 350 400 450 500 20 40 60 80 100 120 Word IDs Data Mining - Massey University 140 160 180 200 Sequence (Web) Data 128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -, User 1 User 2 User 3 User 4 User 5 … 2 3 7 1 5 3 3 7 5 1 … 2 3 7 1 1 2 1 7 1 5 3 1 7 1 3 3 1 1 1 3 1 3 3 3 3 1 7 7 7 5 1 5 1 1 1 1 1 1 Data Mining - Massey University Time Series Data TRAJECTORIES OF CENTROIDS OF MOVING HAND IN VIDEO STREAMS 160 140 X-POSITION 120 100 80 60 40 0 5 10 15 TIME 20 Data Mining - Massey University 25 30 Image Data QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Data Mining - Massey University Spatio Temporal Data • http://senseable.mit.edu/nyte/nyte-globe-encounters.mov QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Data Mining - Massey University Relational Data Algorithms for estimating relative importance in networks S. White and P. Smyth, ACM SIGKDD, 2003. Data Mining - Massey University QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Data Mining - Massey University Examples of Data Mining Successes • • • • • • • • Market Basket (WalMart) Recommender Systems (Amazon.com) Fraud Detection in Telecommunications (AT&T) Target Marketing / CRM Financial Markets DNA Microarray analysis Biometrics (fingerprinting, handwriting) Web Traffic / Blog analysis Data Mining - Massey University Examples of Data Mining Successes QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. • • • • • • Google is a company built on data mining PageRank mined the web to build better search Google as spell checker Google as ad placer Google as news aggregator Google as face recognizer Data Mining - Massey University The Data Mining Process • Often called KDD - Knowledge Discovery in Databases • Analysis is just one part of the process – – – – – Data collection and storage Data cleaning Data sampling Analysis Decision making Data Mining - Massey University Different Data Mining Tasks • Exploratory Data Analysis • Descriptive Modeling • Predictive Modeling • Discovering Patterns and Rules • + others…. Data Mining - Massey University Exploratory Data Analysis • Getting an overall sense of the data set – Computing summary statistics: • Number of distinct values, max, min, mean, median, variance, skewness,.. • Visualization is widely used – 1d histograms – 2d scatter plots – Higher-dimensional methods • Useful for data checking – E.g., finding that a variable is always integer valued or positive – Finding the some variables are highly skewed • Simple exploratory analysis can be extremely valuable – You should always “look” at your data before applying any data mining algorithms Data Mining - Massey University Example of Exploratory Data Analysis (Pima Indians data, scatter plot matrix) Data Mining - Massey University Descriptive Modeling • Goal is to build a “descriptive” model – e.g., a model that could simulate the data if needed – models the underlying process • Examples: – Density estimation: • estimate the joint distribution P(x1,……xp) – Cluster analysis: • Find natural groups in the data – Dependency models among the p variables • Learning a Bayesian network for the data Data Mining - Massey University Example of Descriptive Modeling ANEMIA PATIENTS AND CONTROLS Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 Control Group 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume 3.8 3.9 4 Data Mining - Massey University Anemia Group Example of Descriptive Modeling ANEMIA PATIENTS AND CONTROLS 4.3 4.2 Control Group 4.1 EM ITERATION 25 4.4 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume 3.8 3.9 4 Red Blood Cell Hemoglobin Concentration Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 Anemia Group 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume Data Mining - Massey University 3.8 3.9 4 WebCanvas algorithm and software - currently in new SQLServer Data Mining - Massey University Different Data Mining Tasks • Exploratory Data Analysis • Descriptive Modeling • Predictive Modeling • Discovering Patterns and Rules • + others…. Data Mining - Massey University Predictive Modeling • Predict one variable Y given a set of other variables X – Here X could be a p-dimensional vector – Classification: Y is categorical – Regression: Y is real-valued • In effect this is function approximation, learning the relationship between Y and X • Many, many algorithms for predictive modeling in statistics and machine learning • Often the emphasis is on predictive accuracy, less emphasis on understanding the model Data Mining - Massey University Predictive Modeling: Fraud Detection • Telecommunications fraud detection – Fraud costs telecommunication companies US$ Billions per year – very few transactions are fraudulent, but they are costly • Approach – For each transaction estimate “fraudiness”. – Based on known fraud AND known user behavior – High probability cases investigated by fraud police • Example models: – anomaly detection – guilt by association • Issues – Significant feature engineering/preprocessing – false alarm rate vs missed detection – what is the tradeoff ? Data Mining - Massey University Predictive Modeling: Other Examples • Risk Management: – Example: Credit card company wants to do risk management. – How would you do this? • Netflix Prize – US$1M prize to make better movie recommendations. – How would you do this? Data Mining - Massey University Different Data Mining Tasks • Exploratory Data Analysis • Descriptive Modeling • Predictive Modeling • Discovering Patterns and Rules • + others…. Data Mining - Massey University Pattern Discovery • Goal is to discover interesting “local” patterns in the data rather than to characterize the data globally • given market basket data we might discover that • If customers buy wine and bread then they buy cheese with probability 0.9 • These are known as “association rules” • Given multivariate data on astronomical objects • We might find a small group of previously undiscovered objects that are very self-similar in our feature space, but are very far away in feature space from all other objects Data Mining - Massey University Example of Pattern Discovery • IBM “Advanced Scout” System – Bhandari et al. (1997) – Every NBA basketball game is annotated, • e.g., time = 6 mins, 32 seconds event = 3 point basket player = Michael Jordan • This creates a huge untapped database of information – IBM algorithms search for rules of the form “If player A is in the game, player B’s scoring rate increases from 3.2 points per quarter to 8.7 points per quarter” Data Mining - Massey University Data Mining Pitfalls • Is data mining always necessary – Just because you have a terabyte doesn’t mean you need to use it. • Privacy concerns – Differ by country, industry, application, generation • Meaningfulness of patterns unclear – Rhine paradox – Terrorism – DM has a lot to learn from statistics! Data Mining - Massey University Rhine Paradox • David Rhine: parapsychologist who studied ESP (he was a believer!) • He devised an experiment where subjects were asked to guess 10 hidden cards --- red or blue. • He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right! • He told these people they had ESP and called them in for another test of the same type. • Alas, he discovered that almost all of them had lost their ESP. • What is the conclusion? Data Mining - Massey University Data Mining Pitfalls • PR Problems: data mining as a four letter word? – ...increasingly people’s data is at risk. The old ways ...are still at use like dumpster diving, stealing from mailboxes, physical theft, and credit card receipt copying. New tactics include disparate techniques of phishing, email fraud, data mining, spam, keylogging and an array of other technological processes. - Steven D. Domenikos, IdentityTruth, 2008 – One place oversight is sorely lacking is in the whole matter of data mining. ...What have they contributed? Not a single case comes to mind in which security services apprehended a terrorist following identification by data mining. ...that huge database will be out there, win or lose, for some government agency to divert to its purposes or some hacker to turn to private gain or crime. - John Prados, TomPaine.com Data Mining - Massey University Fighting Terrorism in the US • US Government is widely known to be collecting lots of data on Americans and using data mining to look for patterns consistent with terrorist activity. • Bruce Schneier, Wired Magazine, “Why Data Mining Won’t Stop Terror”: • Assume: – – – – 1 in 100 false positive (99% precision) 1 in 1000 false negative 1 trillion events (phone calls, credit card transactions, emails) per day 10 are really terrorist plots • Then: – 1 billion false alarms for every true plot uncovered – 27 million leads daily – Even if 99.9999% precision = 2,750 false alarms Data Mining - Massey University Data Mining Software: Introduction to R Data Mining - Massey University Data Mining Software • What is R? – – – – Open source statistical software Grew out of S, S+ www.r-project.org http://cran.stat.auckland.ac.nz/ • R Tutorials available online (see website and CRAN) • Great graphics Data Mining - Massey University R examples x=5 y=rnorm(1000,-1,3.5) hist(y) ?hist hist(y, nclass=20, col=‘orange’) > summary(y) Min. 1st Qu. Median Mean 3rd Qu. Max. -12.7400 -3.3080 -0.8247 -0.8101 1.5820 10.5500 > t.test(y,mu=0) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. data: y t = -7.1286, df = 999, p-value = 1.942e-12 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -1.0330474 -0.5870667 sample estimates: mean of x -0.810057 mydata = read.table(“iris.dat”,sep=“ “) > summary(mydata) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Data Mining - Massey University QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. R Examples • Data stored in “data frames” – read.table or read.csv reads data into a data frame • Check the contents of ‘mydata’ – – – – – – – – summary(mydata) names(mydata) – lists all the variable names mydata[1,] – shows only the first row of data mydata[,1] – shows only the first column mydata[,2:5] selects columns mydata$Sepal.Length – shows all values of the variable “Sepal.Length” attach(mydata) - allows you to access Sepal.Length and other columns directly mydata[Sepal.Length > 3,] selects conditional rows Data Mining - Massey University R Examples • Modelling requires formula notation lm.out = lm(y~x) Data Mining - Massey University Lab #1 • R Tutorial – Courtesy of Di Cook – Work your way through the R tutorial (intro-R.pdf). – Code is available in the file introductory-code.txt – Input spam data (collected at Iowa State University) – Find summaries and simple manipulations – Write R functions using function() and apply() Data Mining - Massey University