Download datamining-lect11

Document related concepts

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Transcript
DATA MINING
LECTURE 11
Classification
Naïve Bayes
Graphs And Centrality
NAÏVE BAYES CLASSIFIER
Bayes Classifier
• A probabilistic framework for solving classification
problems
• A, C random variables
• Joint probability: Pr(A=a,C=c)
• Conditional probability: Pr(C=c | A=a)
• Relationship between joint and conditional
probability distributions
Pr(C , A)  Pr(C | A)  Pr( A)  Pr( A | C )  Pr(C )
• Bayes Theorem:
P( A | C ) P(C )
P(C | A) 
P( A)
Bayesian Classifiers
• Consider each
attribute
and class label as random
al
al
us
ic
r
o
variables
g
te
ca
Tid
10
Refund
c
e
at
g
ic
r
o
c
on
o
u
tin
s
s
a
cl
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
Evade C
Event space: {Yes, No}
P(C) = (0.3,0.7}
Refund A1
Event space: {Yes, No}
P(A1) = (0.3,0.7)
Martial Status A2
Event space: {Single, Married, Divorced}
P(A2) = (0.4,0.4,0.2)
Taxable Income A3
Event space: R
P(A3) ~ Normal(,)
Bayesian Classifiers
• Given a record X over attributes (A1, A2,…,An)
• E.g., X = (‘Yes’, ‘Single’, 125K)
• The goal is to predict class C
• Specifically, we want to find the value c of C that maximizes
P(C=c| X)
• Can we estimate P(C| X) directly from data?
• This means that we estimate the probability for all possible
values of the class variable.
Bayesian Classifiers
• Approach:
• compute the posterior probability P(C | A1, A2, …, An) for all
values of C using the Bayes theorem
P( A A  A | C ) P(C )
P(C | A A  A ) 
P( A A  A )
1
1
2
2
n
n
1
2
n
• Choose value of C that maximizes
P(C | A1, A2, …, An)
• Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
• How to estimate P(A1, A2, …, An | C )?
Naïve Bayes Classifier
• Assume independence among attributes Ai when class is
given:
• 𝑃(𝐴1 , 𝐴2 , … , 𝐴𝑛 |𝐶) = 𝑃(𝐴1 |𝐶) 𝑃(𝐴2 𝐶 ⋯ 𝑃(𝐴𝑛 |𝐶)
• We can estimate P(Ai| C) for all values of Ai and C.
• New point X is classified to class c if
𝑃 𝐶 = 𝑐 𝑋 = 𝑃 𝐶 = 𝑐 𝑖 𝑃(𝐴𝑖 |𝑐)
is maximal over all possible values of C.
How to Estimate
Probabilities
from
Data?
l
l
c
Tid
at
Refund
o
eg
a
c
i
r
c
at
o
eg
a
c
i
r
c
on
u
it n
s
u
o
s
s
a
cl
Marital
Status
Taxable
Income
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
• Class Prior Probability:
𝑃 𝐶=𝑐 =
e.g., P(C = No) = 7/10,
P(C = Yes) = 3/10
𝑁𝑐
𝑁
• For discrete attributes:
𝑁𝑎,𝑐
𝑃 𝐴𝑖 = 𝑎 𝐶 = 𝑐 =
𝑁𝑐
where 𝑁𝑎,𝑐 is number of
instances having attribute 𝐴𝑖 =
𝑎 and belongs to class 𝑐
• Examples:
10
P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
How to Estimate Probabilities from Data?
• For continuous attributes:
• Discretize the range into bins
• one ordinal attribute per bin
• violates independence assumption
• Two-way split: (A < v) or (A > v)
• choose only one of the two splits as new attribute
• Probability density estimation:
• Assume attribute follows a normal distribution
• Use data to estimate parameters of distribution
(e.g., mean  and standard deviation )
• Once probability distribution is known, can use it to estimate the
conditional probability P(Ai|c)
l
a
ric
l
a
ric
s
u
o
How togoEstimate
uProbabilities from Data?
o
n
s
i
g
t
ca
Tid
Refund
e
t
ca
e
Marital
Status
c
t
n
o
Taxable
Income
as
l
c
Evade
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
• Normal distribution:
P( Ai  a | c j ) 
1
2 ij2

e
( a   ij ) 2
2 ij2
• One for each (ai,ci) pair
• For (Income, Class=No):
• If Class=No
• sample mean = 110
• sample variance = 2975
10
1
P( Income  120 | No) 
e
2 (54.54)

( 120110) 2
2 ( 2975)
 0.0072
Example of Naïve Bayes Classifier
Given a Test Record:
X  (Refund  No, Married, Income  120K)
naive Bayes Classifier:
P(Refund=Yes|No) = 3/7
P(Refund=No|No) = 4/7
P(Refund=Yes|Yes) = 0
P(Refund=No|Yes) = 1
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7
P(Marital Status=Single|Yes) = 2/7
P(Marital Status=Divorced|Yes)=1/7
P(Marital Status=Married|Yes) = 0
For taxable income:
If class=No:
sample mean=110
sample variance=2975
If class=Yes: sample mean=90
sample variance=25

P(X|Class=No) = P(Refund=No|Class=No)
 P(Married| Class=No)
 P(Income=120K| Class=No)
= 4/7  4/7  0.0072 = 0.0024

P(X|Class=Yes) = P(Refund=No| Class=Yes)
 P(Married| Class=Yes)
 P(Income=120K| Class=Yes)
= 1  0  1.2  10-9 = 0
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
=> Class = No
Example of Naïve Bayes Classifier
Given a Test Record:
X  (Refund  No, Married, Income  120K)
naive Bayes Classifier:
P(Refund=Yes|No) = 3/7
P(Refund=No|No) = 4/7
P(Refund=Yes|Yes) = 0
P(Refund=No|Yes) = 1
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7
P(Marital Status=Single|Yes) = 2/7
P(Marital Status=Divorced|Yes)=1/7
P(Marital Status=Married|Yes) = 0
For taxable income:
If class=No:
sample mean=110
sample variance=2975
If class=Yes: sample mean=90
sample variance=25

P(X|Class=No) = P(Refund=No|Class=No)
 P(Married| Class=No)
 P(Income=120K| Class=No)
= 4/7  4/7  0.0072 = 0.0024

P(X|Class=Yes) = P(Refund=No| Class=Yes)
 P(Married| Class=Yes)
 P(Income=120K| Class=Yes)
= 1  0  1.2  10-9 = 0
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
=> Class = No
Naïve Bayes Classifier
• If one of the conditional probability is zero, then
the entire expression becomes zero
• Probability estimation:
N ac
Original : P( Ai  a | C  c) 
Nc
Laplace : P( Ai  a | C  c) 
N ac  1
Nc  Ni
m - estimate : P( Ai  a | C  c) 
N ac  mp
Nc  m
Ni: number of attribute
values for attribute Ai
p: prior probability
m: parameter
Implementation details
• Computing the conditional probabilities involves
multiplication of many very small numbers
• Numbers get very close to zero, and there is a danger
of numeric instability
• We can deal with this by computing the logarithm
of the conditional probability
log 𝑃 𝐶 𝐴 ~ log 𝑃 𝐴 𝐶 + log 𝑃 𝐴
=
log 𝐴𝑖 𝐶 + log 𝑃(𝐴)
𝑖
Naïve Bayes (Summary)
• Robust to isolated noise points
• Handle missing values by ignoring the instance
during probability estimate calculations
• Robust to irrelevant attributes
• Independence assumption may not hold for some
attributes
• Use other techniques such as Bayesian Belief Networks
(BBN)
• Naïve Bayes can produce a probability estimate, but
it is usually a very biased one
• Logistic Regression is better for obtaining probabilities.
Generative vs Discriminative models
• Naïve Bayes is a type of a generative model
• Generative process:
• First pick the category of the record
• Then given the category, generate the attribute values from the
distribution of the category
C
• Conditional independence given C
𝐴1
𝐴2
𝐴𝑛
• We use the training data to learn the distribution
of the values in a class
Generative vs Discriminative models
• Logistic Regression and SVM are discriminative
models
• The goal is to find the boundary that discriminates
between the two classes from the training data
• In order to classify the language of a document,
you can
• Either learn the two languages and find which is more
likely to have generated the words you see
• Or learn what differentiates the two languages.
SUPERVISED LEARNING
Learning
• Supervised Learning: learn a model from the data
using labeled data.
• Classification and Regression are the prototypical
examples of supervised learning tasks. Other are
possible (e.g., ranking)
• Unsupervised Learning: learn a model – extract
structure from unlabeled data.
• Clustering and Association Rules are prototypical
examples of unsupervised learning tasks.
• Semi-supervised Learning: learn a model for the
data using both labeled and unlabeled data.
Supervised Learning Steps
• Model the problem
• What is you are trying to predict? What kind of optimization function
do you need? Do you need classes or probabilities?
• Extract Features
• How do you find the right features that help to discriminate between
the classes?
• Obtain training data
• Obtain a collection of labeled data. Make sure it is large enough,
accurate and representative. Ensure that classes are well
represented.
• Decide on the technique
• What is the right technique for your problem?
• Apply in practice
• Can the model be trained for very large data? How do you test how
you do in practice? How do you improve?
Modeling the problem
• Sometimes it is not obvious. Consider the
following three problems
• Detecting if an email is spam
• Categorizing the queries in a search engine
• Ranking the results of a web search
Feature extraction
• Feature extraction, or feature engineering is the most
tedious but also the most important step
• How do you separate the players of the Greek national team
from those of the Swedish national team?
• One line of thought: throw features to the classifier
and the classifier will figure out which ones are
important
• More features, means that you need more training data
• Another line of thought: select carefully the features
using various functions and techniques
• Computationally intensive
Training data
• An overlooked problem: How do you get labeled
data for training your model?
• E.g., how do you get training data for ranking?
• Usually requires a lot of manual effort and domain
expertise and carefully planned labeling
• Results are not always of high quality (lack of expertise)
• And they are not sufficient (low coverage of the space)
• Recent trends:
• Find a source that generates the labeled data for you.
• Crowd-sourcing techniques
Dealing with small amount of labeled data
• Semi-supervised techniques have been developed for this
purpose.
• Self-training: Train a classifier on the data, and then feed
back the high-confidence output of the classifier as input
• Co-training: train two “independent” classifiers and feed
the output of one classifier as input to the other.
• Regularization: Treat learning as an optimization problem
where you define relationships between the objects you
want to classify, and you exploit these relationships
• Example: Image restoration
Technique
• The choice of technique depends on the problem
requirements (do we need a probability
estimate?) and the problem specifics (does
independence assumption hold? Do we think
classes are linearly separable?)
• For many cases finding the right technique may
be trial and error
• For many cases the exact technique does not
matter.
Big Data Trumps Better Algorithms
• If you have enough data then the algorithms are
not so important
• The web has made this
possible.
• Especially for text-related
tasks
• Search engine uses the
collective human
intelligence
http://www.youtube.com/w
atch?v=nU8DcBF-qo4
Apply-Test
• How do you scale to very large datasets?
• Distributed computing – map-reduce implementations of
machine learning algorithms (Mahut, over Hadoop)
• How do you test something that is running
online?
• You cannot get labeled data in this case
• A/B testing
• How do you deal with changes in data?
• Active learning
GRAPHS AND LINK
ANALYSIS RANKING
Graphs - Basics
• A graph is a powerful abstraction for modeling
entities and their pairwise relationships.
• G = (V,E)
• Set of nodes 𝑉 = 𝑣1 , … , 𝑣5
𝑣1
• Set of edges 𝐸 = { 𝑣1 , 𝑣2 , … 𝑣4 , 𝑣5 }
• Examples:
• Social network
• Twitter Followers
• Web
• Collaboration graphs
𝑣5
𝑣2
𝑣4
𝑣3
Undirected Graphs
• Undirected Graph: The edges are undirected pairs – they can
be traversed in any direction.
• Degree of node: Number of edges incident on the node
• Path: A sequence of edges from one node to another
• We say that the node is reachable
• Connected Component: A set of nodes such that there is a path
𝑣1
between any two nodes in the set
𝑣5
0
1

A  1

1
1
1
0
1
1
1
1
0
0
1
0
1
1
1
0
0
1
1
0

0
1
𝑣2
𝑣4
𝑣3
Directed Graphs
• Directed Graph: The edges are ordered pairs – they can be traversed in the
direction from first to second.
• In-degree and Out-degree of a node.
• Path: A sequence of directed edges from one node to another
• We say that the node is reachable
• Strongly Connected Component: A set of nodes such that there is a directed
path between any two nodes in the set
• Weakly Connected Component: A set of nodes such that there is an
𝑣1
undirected path between any two nodes in the set
0
0

A  0

1
1
1 1 0 0
0 0 0 1 
1 0 0 0

1 1 0 0
0 0 0 1 
𝑣5
𝑣2
𝑣4
𝑣3
Bipartite Graph
• A graph where the vertex set V is partitioned into
two sets V = {L,R}, of size greater than one, such
that there is no edge within each set.
𝑣1
𝑣4
Set L
Set R
𝑣2
𝑣5
𝑣3
Importance problem
• What are the most important nodes in the graph?
• What are the most authoritative pages on the web
• Who are the important users in Facebook?
• What are the most influential Twitter accounts?
Why is this important?
• When you make a query “microsoft” to Google
why do you get the home page of Microsoft as
the first result?
Link Analysis
• First generation search engines
• view documents as flat text files
• could not cope with size, spamming, user needs
• Second generation search engines
• Ranking becomes critical
• use of Web specific data: Link Analysis
• shift from relevance to authoritativeness
• a success story for the network analysis
Link Analysis: Intuition
• A link from page p to page q denotes
endorsement
• page p considers page q an authority on a subject
• use the graph of recommendations
• assign an authority value to every page
Popularity: InDegree algorithm
• Rank pages according to the popularity of
incoming edges
w=3
w=2
w=2
w=1
w=1
1.
2.
3.
4.
5.
Red Page
Yellow Page
Blue Page
Purple Page
Green Page
Popularity
• Could you think of the case where this could be a
problem?
• It is not important only how many link to you, but
how important are the people that link to you.
PageRank algorithm [BP98]
• Good authorities should be pointed
by good authorities
• The value of a page is the value of the
people that link to you
• How do we implement that?
• Each page has a value.
• Proceed in iterations,
• in each iteration every page
distributes the value to the neighbors
• Continue until there is
convergence.
PR(q )
PR( p )  
q p F (q)
1.
2.
3.
4.
5.
Red Page
Purple Page
Yellow Page
Blue Page
Green Page
Random Walks on Graphs
• What we described is equivalent to a random walk on the
graph
• Random walk:
• Pick a node uniformly at random
• Pick one of the outgoing edges uniformly at random
• Repeat.
• Question:
• What is the probability that after N steps you will be at node x? Or,
after N steps, what is the fraction of times times have you visited
node x?
• The answer is the same for these two questions
• When 𝑁 → ∞ this number converges to a single value regardless of
the starting point!
PageRank algorithm [BP98]
• Random walk on the web graph
(the Random Surfer model)
• pick a page at random
• with probability 1- α jump to a random
page
• with probability α follow a random
outgoing link
• Rank according to the stationary
distribution
•
PR(q)
1
1   
n
q  p F (q)
PR( p)   
1.
2.
3.
4.
5.
Red Page
Purple Page
Yellow Page
Blue Page
Green Page
Markov chains
• A Markov chain describes a discrete time stochastic
process over a set of states
S = {s1, s2, … sn}
according to a transition probability matrix
P = {Pij}
• Pij = probability of moving to state j when at state i
• ∑jPij = 1 (stochastic matrix)
• Memorylessness property: The next state of the chain
depends only at the current state and not on the past of
the process (first order MC)
• higher order MCs are also possible
Random walks
• Random walks on graphs correspond to Markov
Chains
• The set of states S is the set of nodes of the graph G
• The transition probability matrix is the probability that
we follow an edge from one node to another
An example
0
0

A  0

1
1
1 1 0 0
0 0 0 1 
1 0 0 0

1 1 0 0
0 0 0 1 
0 12 12 0 0 
0

0
0
0
1


P 0
1
0 0 0 


1
3
1
3
1
3
0
0


1 2 0
0 0 1 2
v2
v1
v3
v5
v4
State probability vector
• The vector qt = (qt1,qt2, … ,qtn) that stores the
probability of being at state i at time t
• q0i = the probability of starting from state i
qt = qt-1 P
An example
0 12 12 0
0
0
0
0

P 0
1
0
0

1 3 1 3 1 3 0
1 2 0
0 12
0
1 
0

0
0
v2
v1
v3
qt+11 = 1/3 qt4 + 1/2 qt5
qt+12 = 1/2 qt1 + qt3 + 1/3 qt4
qt+13 = 1/2 qt1 + 1/3 qt4
qt+14 = 1/2 qt5
qt+15 = qt2
v5
v4
Stationary distribution
• A stationary distribution for a MC with transition matrix P,
is a probability distribution π, such that π = πP
• A MC has a unique stationary distribution if
• it is irreducible
• the underlying graph is strongly connected
• it is aperiodic
• for random walks, the underlying graph is not bipartite
• The probability πi is the fraction of times that we visited
state i as t → ∞
• The stationary distribution is an eigenvector of matrix P
• the principal left eigenvector of P – stochastic matrices have
maximum eigenvalue 1
Computing the stationary distribution
• The Power Method
• Initialize to some distribution q0
• Iteratively compute qt = qt-1P
• After enough iterations qt ≈ π
• Power method because it computes qt = q0Pt
• Why does it converge?
• follows from the fact that any vector can be written as a
linear combination of the eigenvectors
• q0 = v1 + c2v2 + … cnvn
• Rate of convergence
• determined by λ2t
The PageRank random walk
• Vanilla random walk
• make the adjacency matrix stochastic and run a random
walk
0 12 12 0
0
0
0
0

P 0
1
0
0

1 3 1 3 1 3 0
1 2 0
0 12
0
1 
0

0
0
The PageRank random walk
• What about sink nodes?
• what happens when the random walk moves to a node
without any outgoing inks?
0 12 12 0
0
0
0
0

P 0
1
0
0

1 3 1 3 1 3 0
1 2 0
0 12
0
0
0

0
0
The PageRank random walk
• Replace these row vectors with a vector v
• typically, the uniform vector
0 
0 12 12 0
1 5 1 5 1 5 1 5 1 5


P'   0
1
0
0
0 


1
3
1
3
1
3
0
0


1 2 0
0 1 2 0 
P’ = P + dvT
1 if i is sink
d
0 otherwise
The PageRank random walk
• How do we guarantee irreducibility?
• add a random jump to vector v with prob α
• typically, to a uniform vector
0 
0 12 12 0
1 5
1 5 1 5 1 5 1 5 1 5
1 5



P' '    0
1
0
0
0   (1   ) 1 5



1
3
1
3
1
3
0
0


1 5
1 2 0
1 5
0
0 1 2
P’’ = αP’ + (1-α)uvT, where u is the vector of all 1s
1 5 1 5 1 5 1 5
1 5 1 5 1 5 1 5
1 5 1 5 1 5 1 5

1 5 1 5 1 5 1 5
1 5 1 5 1 5 1 5
Effects of random jump
• Guarantees irreducibility
• Motivated by the concept of random surfer
• Offers additional flexibility
• personalization
• anti-spam
• Controls the rate of convergence
• the second eigenvalue of matrix P’’ is α
A PageRank algorithm
• Performing vanilla power method is now too
expensive – the matrix is not sparse
Efficient computation of y = (P’’)T x
q0 =
v
t=1
repeat
y  αP' T x
qt  P' ' qt 1
δ  qt  qt 1
t = t +1
until δ < ε
T
β x 1 y1
y  y  βv
Random walks on undirected graphs
• In the stationary distribution of a random walk on
an undirected graph, the probability of being at
node i is proportional to the (weighted) degree of
the vertex
• Random walks on undirected graphs are not so
“interesting”