Download IRDS: Data Mining Process “Data Science” The term “data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
“Data Science”
IRDS: Data Mining Process
Charles Sutton
University of Edinburgh
• Our working definition!
• Data science is the study of the computational
principles, methods, and systems for extracting
knowledge from data.!
• A relatively new term. A lot of current hype…!
• “If you have to put ‘science’ in the name…”!
• Component areas have a long history
•
•
•
•
machine learning!
databases!
statistics!
optimization
•
•
•
•
natural language processing!
computer vision !
speech processing!
applications to science, business, health….
• Difficult to find another term for this intersection
(many figures used from Murphy. Machine Learning: A Probabilistic Perspective.)
1
The term “data mining”
Data mining is the analysis of (often large)
observational data sets to find unsuspected
relationships and to summarise the data in novel
ways that are both understandable and useful to
the data owner. — Hand, Mannila, Smyth, 2001
2
The term “data mining”
Data mining is the analysis of (often large)
observational data sets to find unsuspected
relationships and to summarise the data in novel
ways that are both understandable and useful to
the data owner. — Hand, Mannila, Smyth, 2001
not collected for the
purpose of your
analysis
3
4
The term “data mining”
Data mining is the analysis of (often large)
observational data sets to find unsuspected
relationships and to summarise the data in novel
ways that are both understandable and useful to
the data owner. — Hand, Mannila, Smyth, 2001
Many “easy” patterns already known
e.g., pregnant example from
association rule mining
5
The term “data mining”
Data mining is the analysis of (often large)
observational data sets to find unsuspected
relationships and to summarise the data in novel
ways that are both understandable and useful to
the data owner. — Hand, Mannila, Smyth, 2001
Tradeoff between
• predictive performance
• human interpretability
Ex: neural networks vs decision trees
6
What problem am I trying to solve?
Before I get too far ahead of myself…
7
8
Problem Types
•
Visualization
•
Prediction: Learn a map x —> y
•
Prediction Examples
•
•
Classification: Predict categorical value
•
Regression: Predict a real value
•
Others
•
Collaborative filtering
•
Learning to rank
•
Structured prediction
supervised learning
•
Description
•
Clustering
•
Dimensionality reduction
•
Density estimation
•
Finding patterns
unsupervised learning
•
Association rule mining
•
Detecting anomalies / outliers
•
Classification
• Advertising
• Ex: Given the text of an online advertisement and a search engine
query, predict whether a user will click on the ad
• Document classification
• Ex: Spam filtering
• Object detection
• Ex: Given an image patch, dose it contain a face?
Regression
• Predict the final vote in an election (or referendum) from polls
• Predict the temperature tomorrow given the previous few days
Sometimes augmented with other structure / information
• Structured prediction
• Spatial data, Time series data
• Ex: Predicting coding regions in DNA
• Collaborative filtering (Amazon, Netflix)
• Semi-supervised learning
9
Description Examples
•
•
•
•
Clustering
• Assign data into groups with high intra-group similarity
• (like classification, except without examples of “correct” group assignments)
• Ex: Cluster users into groups, based on behaviour
• Social network analysis
• Autoclass system (Cheeseman et al. 1988) discovered a new type of star,
Dimensionality reduction
• Eigenfaces
• Topic modelling
Discovering graph structure
• Ex: Transcription networks
• Ex: JamBayes for Seattle traffic jams
Association rule mining
• Market basket data
• Computer security
11
10
2
Dimensionality reduction
General problem
0
−2
4
2
0
−2
−4
−8
−6
−4
−2
0
Application to Images
Data
Basis
12
2
4
6
8
Phrase as a machine
learning problem
Data
Analysis
Process
Roadmap
Prepare, clean data
More data
Collect data
Better features
Select features
Train model
In the next few weeks, we’ll talk about
Different learning
algorithm
Evaluate model
Performance
good enough?
Decide how to improve
model
Yes
Inspired by Wagstaff, 2012.
“Machine Learning that Matters”
For another more industrial process,
see CRISP-DM.
No
•
Visualization
•
Feature extraction
•
Evaluation and debugging
But to talk about these, we still need to understand
representation behind the algorithms
No
Deploy model
Performance
still good?
Monitor deployed model
Yes
13
14
Two Representation Problems
Data
Feature Vectors
Two Representation Problems
Models/Patterns
Image Classification
o
o
o
Feature
Extraction
X
w
x
Learning
x
y
x
x
x
x
x
o
o
o
o
Feature Vectors
Models/Patterns
x2
o
o
o
o
o
o
Feature
Extraction
o
x1
x
Document Clustering
X
w
x
Learning
x
y
x
x
x
x
x
o
o
o
o
o
o
o
o
x1
x
Document Clustering
Feature
Extraction
Learning
X
o
o o o
o
o o o
o o
Feature
Extraction
o o
o o o
o o
o oo
o
o
o o
o o
o o
Association Rule Mining
Transaction
Database
Data
Image Classification
x2
Learning
X
o
o o o
o
o o o
o o
o o
o o o
o o
o oo
o
o
o o
o o
o o
Association Rule Mining
Feature
Extraction
Learning
X
if Diapers = 1
then Beer = 1
if Ham = 1 then
Pineapple = 1
Transaction
Database
Feature
Extraction
Learning
X
…
if Diapers = 1
then Beer = 1
if Ham = 1 then
Pineapple = 1
…
1. Given input, what goes in the feature vector?
15-1
15-2
Two Representation Problems
Data
Feature Vectors
Two Representation Problems
Models/Patterns
Image Classification
x2
o
o
o
Feature
Extraction
X
w
x
Learning
x
y
x
x
o
x
x
o
o
o
1. What features to use
o
o
o
o
x1
x
x
2. What is the space of possible models
Document Clustering
Feature
Extraction
o
o o o
o
o o o
o o
Learning
X
o o
o o o
o oo
o o oo
o o
o o
o o
Association Rule Mining
Transaction
Database
Feature
Extraction
if Diapers = 1
then Beer = 1
Learning
if Ham = 1 then
Pineapple = 1
X
…
•
In this course, we discuss features.
•
Model —> IAML, PMR, MLPR
•
But: To pick features, must understand model.
•
So: Whirlwind tour of models, leaving out learning
algorithms
2. What is the set of possible models?
15-3
16
Linear regression
Nonlinear regression
Let x 2 Rd denote the feature vector. Trying to predict y 2 R
Simplest choice a linear function. Define parameters w 2 R
d
X
ŷ = f (x, w) = w> x =
w j xj
What if we want to learn a nonlinear function?
2 >
Trick: Define new features, e.g., for scalar x, define(x) = (1, x, x )
d
ŷ = f (x, w) = w> (x)
j=1
this is still linear in w
(to keep notation simple assume that always xd = 1 )
Given a data set
x(1) . . . x(N ) , y (1) , . . . , y (N )
find the best parameters
N ⇣
⌘2
X
min
y (i) w> x(i)
w
i=1
which can be solved easily
(but I won’t say how)
To find parameters,
the minimisation problem is now
2.5
2
min
w
1.5
1
17
i=1
y
(i)
w
>
(i)
(x )
⌘2
exactly the same form as before
(because x is fixed)
so still just as easy
0.5
0
−2
N ⇣
X
−1
0
1
2
degree 2
15
10
5
0
−5
−10
3
18
0
5
10
15
20
Logistic regression
K-Nearest Neighbour
(a classification method, despite the name)
simple method for classification or regression
Define a distance function between feature vectors D(x, x0 )
x2
Linear regression was easy.
Can we do linear classification too?
o
x
x
Define a discriminant function
f (x, w) = w> x
x
x
Then predict using
(
1 if f (x, w)
y=
0 otherwise
o
o
w
o
o
o
o
o
o
x
x
To classify a new feature vector x
o
1. Look through your training set. Find the K closest points. Call them NK (x)
o
(this is memory-based learning.)
x1
x
x
2. Return the majority vote.
3. If you want a probability, take the proportion
0
!
p(y = c|x) =
yields linear decision boundary
1
K
Can get class probabilities from this idea, using logistic regression:
p(y = 1|x) =
1
1 + exp{ w> x}
X
(the running time of this algorithm is terrible. See IAML for better indexing.)
(to show decision boundaries same, compute log odds log
p(y = 1|x)
p(y = 0|x)
19
20
K-Nearest Neighbour
data
K-means clustering
train
5
Decision boundaries can be highly nonlinear
4
The bigger the K, the smoother the boundary
Initialize cluster centroids randomly: µ1 . . . µK
This is nonparametric: the complexity
of the boundary varies depending on
the amount of training data
Repeat
2
1
0
−1
−2
−1
0
1
2
predicted label, K=5
5
4
4
3
3
2
2
1
1
0
0
−2
−3
−1
c1
c2
c3
−2
−1
0
Assign to closest cluster
!
5
−1
For each data point i
3
predicted label, K=1
k=1
To split the data into k clusters, iterate:
!
3
−2
−3
I{y 0 = c}
(y 0 ,x0 )2NK (x)
1
2
3
−2
−3
21
ki
arg max D(µk , x(i) )
k2{1...K}
!
Move cluster centroids (average all points currently in cluster)
k=5
!
!
!
c1
c2
c3
−2
For all k in 1,2,…K
PN (i)
i=1 x I{ki = k}
µk
PN
i=1 I{ki = k}
Until converged
−1
0
1
2
Results in “convex” clusters
3
22
Summary
•
Different types of model structures
1. Linear boundaries (for classification and regression)
2. Nonlinear boundaries (but linear in a set of features)
3. “Wavy” boundaries (nonparametric, piecewise linear)
4. Convex boundaries (with respect to Euclidean distance)
•
This will affect feature construction, soon.
23