Download Introduction to Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Introduction to Data Mining
Instructor: Dr. Chris Volinsky
Data Mining - Massey University
Class Structure
• Class structure
–
–
–
–
–
–
9AM - 11AM Lecture
11AM - 12PM Computer Lab
12PM Lunch
1PM - 3PM Lecture
3PM - 4PM Computer Lab
4PM - 5PM Recap and discussion
• Exams / Grades based on
– 40% data analysis project
– 30% technical paper presentation
– 30% pop quizzes and exams
• given at the beginning of each class
• questions about broad concepts
Data Mining - Massey University
Class Schedule
• 24-26 July: Block 1
• 4-6 September: Block 2
• 6 September: Presentation of technical paper and
data analysis proposal to class
• 19 September: Data Analysis project due (to via
email)
• I will be available intermittently from 1-24 August
Data Mining - Massey University
Course Objectives
• Direct Objectives:
– To learn data mining techniques
– To see their use in real-world/research applications
– To understand limitations of standard statistical techniques in data
mining applications
– To get an understanding of the methodological principles behind
data mining
– To be able to read and understand methodological work in scholarly
journal papers
– To implement & use data mining models using statistical software
(specifically R)
Data Mining - Massey University
Project #1:
Data Analysis Project
• The goal of data mining is to find interesting patterns in
data. You will be required to:
–
–
–
–
–
Define a scientific question of interest
Collect a data set (probably online)
Prepare the data set properly
Analyze the data using appropriate models
Write a 5-10 page report on your analysis (graphics included)
• Project proposals (1/2 -1 page) will be due at the beginning
of the second block.
• Present proposal (5 minutes) to class on 6 September.
• Finished reports will be due 19 September.
Data Mining - Massey University
Project #2:
Scientific Paper Presentation
• Select a technical paper about a data mining technique (list provided on
web site).
– Read and understand the paper
– Write a one-page summary of the paper
– Present the basic ideas of the paper to the class (10-15 minutes)
• Emphasis to be placed on the motivation for a particular statistical
methodology within the application context:
–
–
–
–
–
What is the general objective of the paper?
What data are they using?
What statistical approach/method is proposed? Why?
What has been done in the past?
How does the paper accomplish new domain insight using that method?
• Paper presented to class on last day of lecture: 9 September
Data Mining - Massey University
Class Web Site
•
•
•
•
http://www.research.att.com/~volinsky/DataMining
Lists of papers for presentation
Announcements
Links to other data mining course notes, R tutorials,
resources
• Email:
– [email protected]
Data Mining - Massey University
Resources
• Data mining is a new field and as such, does not have
authoritative texts (yet).
• This class draws from many sources, best are
– “Handbook of Data Mining” Hand, Mannila and Smyth
– “Elements of Statistical Learning” Hastie, Tibshirani, and Friedman
– “Interactive and Dynamic Graphics for Data Analysis” Cook and
Swayne
– Also good class notes available from other classes:
•
•
•
•
David Madigan, Rutgers
Di Cook, Iowa State
Padhraic Smyth, UC Irvine
Jiawei Han, Simon Fraser
– see class web site for pointers to these notes, or just Google them!)
• Also many good tutorials and books on R (or S/Splus), both
online and in the library.
Data Mining - Massey University
Course Outline
• 6 days = 12 “units” each unit is a lecture and a lab
• Units:
–
–
–
–
–
–
–
–
–
–
Intro to Data Mining
Data exploration and visualization
Data Mining Concepts
Regression Topics
Classification and Supervised Learning
Clustering and Unsupervised Learning
Text Mining and Information Retrieval
Web Mining and Social Networks 1
Web Mining and Social Networks 2
Assorted Topics
•
•
•
•
•
Advanced Classification - Neural networks, ensemble methods
Association Rules
Telecommunications Fraud
Proximity models for social networks
Support Vector machines
Data Mining - Massey University
What is Data Mining?
• Not well defined….
• Hand, Mannila, Smyth:
– “data mining is the analysis of (often large) observational data sets
to find unsuspected relationships and to summarize the data in
novel ways that are both understandable and useful to the data
owner”
• Isn’t that the same as statistics?
Data Mining - Massey University
Data Mining Enablers
• Explosion of data
• Fast and cheap computation and storage
– Moore’s Law: processing doubles every 19 months
– Disk storage doubles every 9 months
– Database technology
• Competitive pressure in business
• New, successful models
Disk TB Shipped per Year
1E+7
ExaByte
1E+6
– SVM, boosting
• Commercial products
1998 Disk Trend (Jim Porter)
http://www.disktrend.com/pdf/portrpkg.pdf.
1E+5
disk TB
growth:
112%/y
• SAS, SPSS, Insightful, IBM, Oracle
– Open Source products
• Weka
• R
Moore's Law:
58.7%/y
1E+4
1E+3
1988
Both from NZ!
Data Mining - Massey University
1991
1994
1997
2000
Data Mining vs. Statistics
• Statistics is known for:
–
–
–
–
well defined hypotheses used to learn about a
specifically chosen population studied using
carefully collected data providing inferences with
well known properties.
• Data mining isn’t that careful. It is:
–
–
–
–
data driven discovery of
models and patterns from
massive and
observational data sets
Data Mining - Massey University
Two Types of Data
• Experimental Data
–
–
–
–
Hypothesis H
design an experiment to test H
collect data, infer how likely it is that H is true
e.g., clinical trials in medicine
• Observational or Retrospective or Secondary Data
– massive non-experimental data sets
• e.g., Web logs, human genome, atmospheric simulations, etc
– assumptions of experimental design no longer valid
– how can we use such data to do science?
• use the data to support model exploration, hypothesis testing
Data Mining - Massey University
Data-Driven Discovery
• Observational data
– cheap relative to experimental data
• Examples:
–
–
–
–
Transaction data archives for retail stores, airlines, etc
Web logs for Amazon, Google, etc
The human/mouse/rat genome
Etc., etc
 makes sense to leverage available data
 useful (?) information may be hidden in vast archives of data
Data Mining - Massey University
Data Mining v. Statistics
• Traditional statistics
– first hypothesize, then collect data, then analyze
– often model-oriented (strong parametric models)
• Data mining:
–
–
–
–
few if any a priori hypotheses
data is usually already collected a priori
analysis is typically data-driven not hypothesis-driven
Often algorithm-oriented rather than model-oriented
• Different?
– Yes, in terms of culture, motivation: however…..
– statistical ideas are very useful in data mining, e.g., in validating whether
discovered knowledge is useful
– Increasing overlap at the boundary of statistics and DM
e.g., exploratory data analysis (based on pioneering work of John Tukey in the
1960’s)
Data Mining - Massey University
Data Mining: Confluence of Multiple Disciplines
Database
Technology
Machine
Learning
Statistics
Data Mining
Information
Science
Visualization
Other
Disciplines
Different fields have different views of what data mining is
Data Mining - Massey University
Data Data Data
• It’s all about the data - where does it come from?
–
–
–
–
–
–
–
–
www
NASA
Business processes/transactions
Telecommunications and networking
Medical imagery
Government, census, demographics
Sensor networks, RFID tags
sports
Data Mining - Massey University
Flat File or Vector Data
2.3 -1.5 … -1.3
n
1.1 0.1
… -0.1
…
…
…
…
p
• Rows = objects
• Columns = measurements on objects
– Represent each row as a p-dimensional vector, where p is the dimensionality
• In efffect, embed our objects in a p-dimensional vector space
• Often useful, but not always appropriate
• Both n and p can be very large in data mining
• Matrix can be quite sparse
Data Mining - Massey University
Sparse Matrix (Text) Data
50
100
150
Text
200
Documents
250
300
350
400
450
500
20
40
60
80
100
120
Word IDs
Data Mining - Massey University
140
160
180
200
Sequence (Web) Data
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,
128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,
User 1
User 2
User 3
User 4
User 5
…
2
3
7
1
5
3
3
7
5
1
…
2
3
7
1
1
2
1
7
1
5
3
1
7
1
3 3 1 1 1 3 1 3 3 3 3
1
7 7 7
5 1 5 1 1 1 1 1 1
Data Mining - Massey University
Time Series Data
TRAJECTORIES OF CENTROIDS OF MOVING HAND IN VIDEO STREAMS
160
140
X-POSITION
120
100
80
60
40
0
5
10
15
TIME
20
Data Mining - Massey University
25
30
Image Data
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Data Mining - Massey University
Spatio Temporal Data
•
http://senseable.mit.edu/nyte/nyte-globe-encounters.mov
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Data Mining - Massey University
Relational Data
Algorithms for estimating relative importance in networks
S. White and P. Smyth, ACM SIGKDD, 2003.
Data Mining - Massey University
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Data Mining - Massey University
Examples of Data Mining Successes
•
•
•
•
•
•
•
•
Market Basket (WalMart)
Recommender Systems (Amazon.com)
Fraud Detection in Telecommunications (AT&T)
Target Marketing / CRM
Financial Markets
DNA Microarray analysis
Biometrics (fingerprinting, handwriting)
Web Traffic / Blog analysis
Data Mining - Massey University
Examples of Data Mining Successes
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
•
•
•
•
•
•
Google is a company built on data mining
PageRank mined the web to build better search
Google as spell checker
Google as ad placer
Google as news aggregator
Google as face recognizer
Data Mining - Massey University
The Data Mining Process
• Often called KDD - Knowledge Discovery in
Databases
• Analysis is just one part of the process
–
–
–
–
–
Data collection and storage
Data cleaning
Data sampling
Analysis
Decision making
Data Mining - Massey University
Different Data Mining Tasks
• Exploratory Data Analysis
• Descriptive Modeling
• Predictive Modeling
• Discovering Patterns and Rules
• + others….
Data Mining - Massey University
Exploratory Data Analysis
• Getting an overall sense of the data set
– Computing summary statistics:
• Number of distinct values, max, min, mean, median, variance,
skewness,..
• Visualization is widely used
– 1d histograms
– 2d scatter plots
– Higher-dimensional methods
• Useful for data checking
– E.g., finding that a variable is always integer valued or positive
– Finding the some variables are highly skewed
• Simple exploratory analysis can be extremely valuable
– You should always “look” at your data before applying any data
mining algorithms
Data Mining - Massey University
Example of Exploratory Data Analysis
(Pima Indians data, scatter plot matrix)
Data Mining - Massey University
Descriptive Modeling
• Goal is to build a “descriptive” model
– e.g., a model that could simulate the data if needed
– models the underlying process
• Examples:
– Density estimation:
• estimate the joint distribution P(x1,……xp)
– Cluster analysis:
• Find natural groups in the data
– Dependency models among the p variables
• Learning a Bayesian network for the data
Data Mining - Massey University
Example of Descriptive Modeling
ANEMIA PATIENTS AND CONTROLS
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
Control Group
4.1
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
Data Mining - Massey University
Anemia Group
Example of Descriptive Modeling
ANEMIA PATIENTS AND CONTROLS
4.3
4.2
Control Group
4.1
EM ITERATION 25
4.4
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
3.8
3.9
4
Red Blood Cell Hemoglobin Concentration
Red Blood Cell Hemoglobin Concentration
4.4
4.3
4.2
4.1
Anemia Group
4
3.9
3.8
3.7
3.3
3.4
3.5
3.6
3.7
Red Blood Cell Volume
Data Mining - Massey University
3.8
3.9
4
WebCanvas algorithm and software - currently in new SQLServer
Data Mining - Massey University
Different Data Mining Tasks
• Exploratory Data Analysis
• Descriptive Modeling
• Predictive Modeling
• Discovering Patterns and Rules
• + others….
Data Mining - Massey University
Predictive Modeling
• Predict one variable Y given a set of other variables X
– Here X could be a p-dimensional vector
– Classification: Y is categorical
– Regression: Y is real-valued
• In effect this is function approximation, learning the relationship
between Y and X
• Many, many algorithms for predictive modeling in statistics and machine
learning
• Often the emphasis is on predictive accuracy, less emphasis on
understanding the model
Data Mining - Massey University
Predictive Modeling: Fraud Detection
• Telecommunications fraud detection
– Fraud costs telecommunication companies US$ Billions per year
– very few transactions are fraudulent, but they are costly
• Approach
– For each transaction estimate “fraudiness”.
– Based on known fraud AND known user behavior
– High probability cases investigated by fraud police
• Example models:
– anomaly detection
– guilt by association
•
Issues
– Significant feature engineering/preprocessing
– false alarm rate vs missed detection – what is the tradeoff ?
Data Mining - Massey University
Predictive Modeling: Other Examples
• Risk Management:
– Example: Credit card company wants to do risk management.
– How would you do this?
• Netflix Prize
– US$1M prize to make better movie recommendations.
– How would you do this?
Data Mining - Massey University
Different Data Mining Tasks
• Exploratory Data Analysis
• Descriptive Modeling
• Predictive Modeling
• Discovering Patterns and Rules
• + others….
Data Mining - Massey University
Pattern Discovery
• Goal is to discover interesting “local” patterns in the
data rather than to characterize the data globally
• given market basket data we might discover that
• If customers buy wine and bread then they buy cheese with
probability 0.9
• These are known as “association rules”
• Given multivariate data on astronomical objects
• We might find a small group of previously undiscovered objects
that are very self-similar in our feature space, but are very far
away in feature space from all other objects
Data Mining - Massey University
Example of Pattern Discovery
• IBM “Advanced Scout” System
– Bhandari et al. (1997)
– Every NBA basketball game is annotated,
• e.g., time = 6 mins, 32 seconds
event = 3 point basket
player = Michael Jordan
• This creates a huge untapped database of information
– IBM algorithms search for rules of the form
“If player A is in the game, player B’s scoring rate increases from
3.2 points per quarter to 8.7 points per quarter”
Data Mining - Massey University
Data Mining Pitfalls
• Is data mining always necessary
– Just because you have a terabyte doesn’t mean you need
to use it.
• Privacy concerns
– Differ by country, industry, application, generation
• Meaningfulness of patterns unclear
– Rhine paradox
– Terrorism
– DM has a lot to learn from statistics!
Data Mining - Massey University
Rhine Paradox
• David Rhine: parapsychologist who studied ESP (he was a
believer!)
• He devised an experiment where subjects were asked to guess
10 hidden cards --- red or blue.
• He discovered that almost 1 in 1000 had ESP --- they were
able to get all 10 right!
• He told these people they had ESP and called them in for
another test of the same type.
• Alas, he discovered that almost all of them had lost their ESP.
• What is the conclusion?
Data Mining - Massey University
Data Mining Pitfalls
•
PR Problems: data mining as a four letter word?
– ...increasingly people’s data is at risk. The old ways ...are still at use like dumpster
diving, stealing from mailboxes, physical theft, and credit card receipt copying. New
tactics include disparate techniques of phishing, email fraud, data mining, spam, keylogging and an array of other technological processes. - Steven D. Domenikos,
IdentityTruth, 2008
– One place oversight is sorely lacking is in the whole matter of data mining. ...What
have they contributed? Not a single case comes to mind in which security services
apprehended a terrorist following identification by data mining. ...that huge database
will be out there, win or lose, for some government agency to divert to its purposes or
some hacker to turn to private gain or crime. - John Prados, TomPaine.com
Data Mining - Massey University
Fighting Terrorism in the US
• US Government is widely known to be collecting lots of data on
Americans and using data mining to look for patterns consistent with
terrorist activity.
• Bruce Schneier, Wired Magazine, “Why Data Mining Won’t Stop Terror”:
• Assume:
–
–
–
–
1 in 100 false positive (99% precision)
1 in 1000 false negative
1 trillion events (phone calls, credit card transactions, emails) per day
10 are really terrorist plots
• Then:
– 1 billion false alarms for every true plot uncovered
– 27 million leads daily
– Even if 99.9999% precision = 2,750 false alarms
Data Mining - Massey University
Data Mining Software:
Introduction to R
Data Mining - Massey University
Data Mining Software
• What is R?
–
–
–
–
Open source statistical software
Grew out of S, S+
www.r-project.org
http://cran.stat.auckland.ac.nz/
• R Tutorials available online (see website and CRAN)
• Great graphics
Data Mining - Massey University
R examples
x=5
y=rnorm(1000,-1,3.5)
hist(y)
?hist
hist(y, nclass=20, col=‘orange’)
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-12.7400 -3.3080 -0.8247 -0.8101 1.5820 10.5500
> t.test(y,mu=0)
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
data: y
t = -7.1286, df = 999, p-value = 1.942e-12
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-1.0330474 -0.5870667
sample estimates:
mean of x
-0.810057
mydata = read.table(“iris.dat”,sep=“ “)
>
summary(mydata)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Data Mining - Massey University
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
R Examples
• Data stored in “data frames”
– read.table or read.csv reads data into a data frame
• Check the contents of ‘mydata’
–
–
–
–
–
–
–
–
summary(mydata)
names(mydata) – lists all the variable names
mydata[1,] – shows only the first row of data
mydata[,1] – shows only the first column
mydata[,2:5] selects columns
mydata$Sepal.Length – shows all values of the variable “Sepal.Length”
attach(mydata) - allows you to access Sepal.Length and other columns directly
mydata[Sepal.Length > 3,] selects conditional rows
Data Mining - Massey University
R Examples
•
Modelling requires formula notation lm.out = lm(y~x)
Data Mining - Massey University
Lab #1
• R Tutorial
– Courtesy of Di Cook
– Work your way through the R tutorial (intro-R.pdf).
– Code is available in the file introductory-code.txt
– Input spam data (collected at Iowa State University)
– Find summaries and simple manipulations
– Write R functions using function() and apply()
Data Mining - Massey University