Download datamining-lect11

DATA MINING LECTURE 11 Classification Naïve Bayes Graphs And Centrality NAÏVE BAYES CLASSIFIER Bayes Classifier • A probabilistic framework for solving classification problems • A, C random variables • Joint probability: Pr(A=a,C=c) • Conditional probability: Pr(C=c | A=a) • Relationship between joint and conditional probability distributions Pr(C , A)  Pr(C | A)  Pr( A)  Pr( A | C )  Pr(C ) • Bayes Theorem: P( A | C ) P(C ) P(C | A)  P( A) Bayesian Classifiers • Consider each attribute and class label as random al al us ic r o variables g te ca Tid 10 Refund c e at g ic r o c on o u tin s s a cl Marital Status Taxable Income Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Evade C Event space: {Yes, No} P(C) = (0.3,0.7} Refund A1 Event space: {Yes, No} P(A1) = (0.3,0.7) Martial Status A2 Event space: {Single, Married, Divorced} P(A2) = (0.4,0.4,0.2) Taxable Income A3 Event space: R P(A3) ~ Normal(,) Bayesian Classifiers • Given a record X over attributes (A1, A2,…,An) • E.g., X = (‘Yes’, ‘Single’, 125K) • The goal is to predict class C • Specifically, we want to find the value c of C that maximizes P(C=c| X) • Can we estimate P(C| X) directly from data? • This means that we estimate the probability for all possible values of the class variable. Bayesian Classifiers • Approach: • compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem P( A A  A | C ) P(C ) P(C | A A  A )  P( A A  A ) 1 1 2 2 n n 1 2 n • Choose value of C that maximizes P(C | A1, A2, …, An) • Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) • How to estimate P(A1, A2, …, An | C )? Naïve Bayes Classifier • Assume independence among attributes Ai when class is given: • 𝑃(𝐴1 , 𝐴2 , … , 𝐴𝑛 |𝐶) = 𝑃(𝐴1 |𝐶) 𝑃(𝐴2 𝐶 ⋯ 𝑃(𝐴𝑛 |𝐶) • We can estimate P(Ai| C) for all values of Ai and C. • New point X is classified to class c if 𝑃 𝐶 = 𝑐 𝑋 = 𝑃 𝐶 = 𝑐 𝑖 𝑃(𝐴𝑖 |𝑐) is maximal over all possible values of C. How to Estimate Probabilities from Data? l l c Tid at Refund o eg a c i r c at o eg a c i r c on u it n s u o s s a cl Marital Status Taxable Income Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes • Class Prior Probability: 𝑃 𝐶=𝑐 = e.g., P(C = No) = 7/10, P(C = Yes) = 3/10 𝑁𝑐 𝑁 • For discrete attributes: 𝑁𝑎,𝑐 𝑃 𝐴𝑖 = 𝑎 𝐶 = 𝑐 = 𝑁𝑐 where 𝑁𝑎,𝑐 is number of instances having attribute 𝐴𝑖 = 𝑎 and belongs to class 𝑐 • Examples: 10 P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0 How to Estimate Probabilities from Data? • For continuous attributes: • Discretize the range into bins • one ordinal attribute per bin • violates independence assumption • Two-way split: (A < v) or (A > v) • choose only one of the two splits as new attribute • Probability density estimation: • Assume attribute follows a normal distribution • Use data to estimate parameters of distribution (e.g., mean  and standard deviation ) • Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c) l a ric l a ric s u o How togoEstimate uProbabilities from Data? o n s i g t ca Tid Refund e t ca e Marital Status c t n o Taxable Income as l c Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes • Normal distribution: P( Ai  a | c j )  1 2 ij2  e ( a   ij ) 2 2 ij2 • One for each (ai,ci) pair • For (Income, Class=No): • If Class=No • sample mean = 110 • sample variance = 2975 10 1 P( Income  120 | No)  e 2 (54.54)  ( 120110) 2 2 ( 2975)  0.0072 Example of Naïve Bayes Classifier Given a Test Record: X  (Refund  No, Married, Income  120K) naive Bayes Classifier: P(Refund=Yes|No) = 3/7 P(Refund=No|No) = 4/7 P(Refund=Yes|Yes) = 0 P(Refund=No|Yes) = 1 P(Marital Status=Single|No) = 2/7 P(Marital Status=Divorced|No)=1/7 P(Marital Status=Married|No) = 4/7 P(Marital Status=Single|Yes) = 2/7 P(Marital Status=Divorced|Yes)=1/7 P(Marital Status=Married|Yes) = 0 For taxable income: If class=No: sample mean=110 sample variance=2975 If class=Yes: sample mean=90 sample variance=25  P(X|Class=No) = P(Refund=No|Class=No)  P(Married| Class=No)  P(Income=120K| Class=No) = 4/7  4/7  0.0072 = 0.0024  P(X|Class=Yes) = P(Refund=No| Class=Yes)  P(Married| Class=Yes)  P(Income=120K| Class=Yes) = 1  0  1.2  10-9 = 0 Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X) => Class = No Example of Naïve Bayes Classifier Given a Test Record: X  (Refund  No, Married, Income  120K) naive Bayes Classifier: P(Refund=Yes|No) = 3/7 P(Refund=No|No) = 4/7 P(Refund=Yes|Yes) = 0 P(Refund=No|Yes) = 1 P(Marital Status=Single|No) = 2/7 P(Marital Status=Divorced|No)=1/7 P(Marital Status=Married|No) = 4/7 P(Marital Status=Single|Yes) = 2/7 P(Marital Status=Divorced|Yes)=1/7 P(Marital Status=Married|Yes) = 0 For taxable income: If class=No: sample mean=110 sample variance=2975 If class=Yes: sample mean=90 sample variance=25  P(X|Class=No) = P(Refund=No|Class=No)  P(Married| Class=No)  P(Income=120K| Class=No) = 4/7  4/7  0.0072 = 0.0024  P(X|Class=Yes) = P(Refund=No| Class=Yes)  P(Married| Class=Yes)  P(Income=120K| Class=Yes) = 1  0  1.2  10-9 = 0 Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X) => Class = No Naïve Bayes Classifier • If one of the conditional probability is zero, then the entire expression becomes zero • Probability estimation: N ac Original : P( Ai  a | C  c)  Nc Laplace : P( Ai  a | C  c)  N ac  1 Nc  Ni m - estimate : P( Ai  a | C  c)  N ac  mp Nc  m Ni: number of attribute values for attribute Ai p: prior probability m: parameter Implementation details • Computing the conditional probabilities involves multiplication of many very small numbers • Numbers get very close to zero, and there is a danger of numeric instability • We can deal with this by computing the logarithm of the conditional probability log 𝑃 𝐶 𝐴 ~ log 𝑃 𝐴 𝐶 + log 𝑃 𝐴 = log 𝐴𝑖 𝐶 + log 𝑃(𝐴) 𝑖 Naïve Bayes (Summary) • Robust to isolated noise points • Handle missing values by ignoring the instance during probability estimate calculations • Robust to irrelevant attributes • Independence assumption may not hold for some attributes • Use other techniques such as Bayesian Belief Networks (BBN) • Naïve Bayes can produce a probability estimate, but it is usually a very biased one • Logistic Regression is better for obtaining probabilities. Generative vs Discriminative models • Naïve Bayes is a type of a generative model • Generative process: • First pick the category of the record • Then given the category, generate the attribute values from the distribution of the category C • Conditional independence given C 𝐴1 𝐴2 𝐴𝑛 • We use the training data to learn the distribution of the values in a class Generative vs Discriminative models • Logistic Regression and SVM are discriminative models • The goal is to find the boundary that discriminates between the two classes from the training data • In order to classify the language of a document, you can • Either learn the two languages and find which is more likely to have generated the words you see • Or learn what differentiates the two languages. SUPERVISED LEARNING Learning • Supervised Learning: learn a model from the data using labeled data. • Classification and Regression are the prototypical examples of supervised learning tasks. Other are possible (e.g., ranking) • Unsupervised Learning: learn a model – extract structure from unlabeled data. • Clustering and Association Rules are prototypical examples of unsupervised learning tasks. • Semi-supervised Learning: learn a model for the data using both labeled and unlabeled data. Supervised Learning Steps • Model the problem • What is you are trying to predict? What kind of optimization function do you need? Do you need classes or probabilities? • Extract Features • How do you find the right features that help to discriminate between the classes? • Obtain training data • Obtain a collection of labeled data. Make sure it is large enough, accurate and representative. Ensure that classes are well represented. • Decide on the technique • What is the right technique for your problem? • Apply in practice • Can the model be trained for very large data? How do you test how you do in practice? How do you improve? Modeling the problem • Sometimes it is not obvious. Consider the following three problems • Detecting if an email is spam • Categorizing the queries in a search engine • Ranking the results of a web search Feature extraction • Feature extraction, or feature engineering is the most tedious but also the most important step • How do you separate the players of the Greek national team from those of the Swedish national team? • One line of thought: throw features to the classifier and the classifier will figure out which ones are important • More features, means that you need more training data • Another line of thought: select carefully the features using various functions and techniques • Computationally intensive Training data • An overlooked problem: How do you get labeled data for training your model? • E.g., how do you get training data for ranking? • Usually requires a lot of manual effort and domain expertise and carefully planned labeling • Results are not always of high quality (lack of expertise) • And they are not sufficient (low coverage of the space) • Recent trends: • Find a source that generates the labeled data for you. • Crowd-sourcing techniques Dealing with small amount of labeled data • Semi-supervised techniques have been developed for this purpose. • Self-training: Train a classifier on the data, and then feed back the high-confidence output of the classifier as input • Co-training: train two “independent” classifiers and feed the output of one classifier as input to the other. • Regularization: Treat learning as an optimization problem where you define relationships between the objects you want to classify, and you exploit these relationships • Example: Image restoration Technique • The choice of technique depends on the problem requirements (do we need a probability estimate?) and the problem specifics (does independence assumption hold? Do we think classes are linearly separable?) • For many cases finding the right technique may be trial and error • For many cases the exact technique does not matter. Big Data Trumps Better Algorithms • If you have enough data then the algorithms are not so important • The web has made this possible. • Especially for text-related tasks • Search engine uses the collective human intelligence http://www.youtube.com/w atch?v=nU8DcBF-qo4 Apply-Test • How do you scale to very large datasets? • Distributed computing – map-reduce implementations of machine learning algorithms (Mahut, over Hadoop) • How do you test something that is running online? • You cannot get labeled data in this case • A/B testing • How do you deal with changes in data? • Active learning GRAPHS AND LINK ANALYSIS RANKING Graphs - Basics • A graph is a powerful abstraction for modeling entities and their pairwise relationships. • G = (V,E) • Set of nodes 𝑉 = 𝑣1 , … , 𝑣5 𝑣1 • Set of edges 𝐸 = { 𝑣1 , 𝑣2 , … 𝑣4 , 𝑣5 } • Examples: • Social network • Twitter Followers • Web • Collaboration graphs 𝑣5 𝑣2 𝑣4 𝑣3 Undirected Graphs • Undirected Graph: The edges are undirected pairs – they can be traversed in any direction. • Degree of node: Number of edges incident on the node • Path: A sequence of edges from one node to another • We say that the node is reachable • Connected Component: A set of nodes such that there is a path 𝑣1 between any two nodes in the set 𝑣5 0 1  A  1  1 1 1 0 1 1 1 1 0 0 1 0 1 1 1 0 0 1 1 0  0 1 𝑣2 𝑣4 𝑣3 Directed Graphs • Directed Graph: The edges are ordered pairs – they can be traversed in the direction from first to second. • In-degree and Out-degree of a node. • Path: A sequence of directed edges from one node to another • We say that the node is reachable • Strongly Connected Component: A set of nodes such that there is a directed path between any two nodes in the set • Weakly Connected Component: A set of nodes such that there is an 𝑣1 undirected path between any two nodes in the set 0 0  A  0  1 1 1 1 0 0 0 0 0 1  1 0 0 0  1 1 0 0 0 0 0 1  𝑣5 𝑣2 𝑣4 𝑣3 Bipartite Graph • A graph where the vertex set V is partitioned into two sets V = {L,R}, of size greater than one, such that there is no edge within each set. 𝑣1 𝑣4 Set L Set R 𝑣2 𝑣5 𝑣3 Importance problem • What are the most important nodes in the graph? • What are the most authoritative pages on the web • Who are the important users in Facebook? • What are the most influential Twitter accounts? Why is this important? • When you make a query “microsoft” to Google why do you get the home page of Microsoft as the first result? Link Analysis • First generation search engines • view documents as flat text files • could not cope with size, spamming, user needs • Second generation search engines • Ranking becomes critical • use of Web specific data: Link Analysis • shift from relevance to authoritativeness • a success story for the network analysis Link Analysis: Intuition • A link from page p to page q denotes endorsement • page p considers page q an authority on a subject • use the graph of recommendations • assign an authority value to every page Popularity: InDegree algorithm • Rank pages according to the popularity of incoming edges w=3 w=2 w=2 w=1 w=1 1. 2. 3. 4. 5. Red Page Yellow Page Blue Page Purple Page Green Page Popularity • Could you think of the case where this could be a problem? • It is not important only how many link to you, but how important are the people that link to you. PageRank algorithm [BP98] • Good authorities should be pointed by good authorities • The value of a page is the value of the people that link to you • How do we implement that? • Each page has a value. • Proceed in iterations, • in each iteration every page distributes the value to the neighbors • Continue until there is convergence. PR(q ) PR( p )   q p F (q) 1. 2. 3. 4. 5. Red Page Purple Page Yellow Page Blue Page Green Page Random Walks on Graphs • What we described is equivalent to a random walk on the graph • Random walk: • Pick a node uniformly at random • Pick one of the outgoing edges uniformly at random • Repeat. • Question: • What is the probability that after N steps you will be at node x? Or, after N steps, what is the fraction of times times have you visited node x? • The answer is the same for these two questions • When 𝑁 → ∞ this number converges to a single value regardless of the starting point! PageRank algorithm [BP98] • Random walk on the web graph (the Random Surfer model) • pick a page at random • with probability 1- α jump to a random page • with probability α follow a random outgoing link • Rank according to the stationary distribution • PR(q) 1 1    n q  p F (q) PR( p)    1. 2. 3. 4. 5. Red Page Purple Page Yellow Page Blue Page Green Page Markov chains • A Markov chain describes a discrete time stochastic process over a set of states S = {s1, s2, … sn} according to a transition probability matrix P = {Pij} • Pij = probability of moving to state j when at state i • ∑jPij = 1 (stochastic matrix) • Memorylessness property: The next state of the chain depends only at the current state and not on the past of the process (first order MC) • higher order MCs are also possible Random walks • Random walks on graphs correspond to Markov Chains • The set of states S is the set of nodes of the graph G • The transition probability matrix is the probability that we follow an edge from one node to another An example 0 0  A  0  1 1 1 1 0 0 0 0 0 1  1 0 0 0  1 1 0 0 0 0 0 1  0 12 12 0 0  0  0 0 0 1   P 0 1 0 0 0    1 3 1 3 1 3 0 0   1 2 0 0 0 1 2 v2 v1 v3 v5 v4 State probability vector • The vector qt = (qt1,qt2, … ,qtn) that stores the probability of being at state i at time t • q0i = the probability of starting from state i qt = qt-1 P An example 0 12 12 0 0 0 0 0  P 0 1 0 0  1 3 1 3 1 3 0 1 2 0 0 12 0 1  0  0 0 v2 v1 v3 qt+11 = 1/3 qt4 + 1/2 qt5 qt+12 = 1/2 qt1 + qt3 + 1/3 qt4 qt+13 = 1/2 qt1 + 1/3 qt4 qt+14 = 1/2 qt5 qt+15 = qt2 v5 v4 Stationary distribution • A stationary distribution for a MC with transition matrix P, is a probability distribution π, such that π = πP • A MC has a unique stationary distribution if • it is irreducible • the underlying graph is strongly connected • it is aperiodic • for random walks, the underlying graph is not bipartite • The probability πi is the fraction of times that we visited state i as t → ∞ • The stationary distribution is an eigenvector of matrix P • the principal left eigenvector of P – stochastic matrices have maximum eigenvalue 1 Computing the stationary distribution • The Power Method • Initialize to some distribution q0 • Iteratively compute qt = qt-1P • After enough iterations qt ≈ π • Power method because it computes qt = q0Pt • Why does it converge? • follows from the fact that any vector can be written as a linear combination of the eigenvectors • q0 = v1 + c2v2 + … cnvn • Rate of convergence • determined by λ2t The PageRank random walk • Vanilla random walk • make the adjacency matrix stochastic and run a random walk 0 12 12 0 0 0 0 0  P 0 1 0 0  1 3 1 3 1 3 0 1 2 0 0 12 0 1  0  0 0 The PageRank random walk • What about sink nodes? • what happens when the random walk moves to a node without any outgoing inks? 0 12 12 0 0 0 0 0  P 0 1 0 0  1 3 1 3 1 3 0 1 2 0 0 12 0 0 0  0 0 The PageRank random walk • Replace these row vectors with a vector v • typically, the uniform vector 0  0 12 12 0 1 5 1 5 1 5 1 5 1 5   P'   0 1 0 0 0    1 3 1 3 1 3 0 0   1 2 0 0 1 2 0  P’ = P + dvT 1 if i is sink d 0 otherwise The PageRank random walk • How do we guarantee irreducibility? • add a random jump to vector v with prob α • typically, to a uniform vector 0  0 12 12 0 1 5 1 5 1 5 1 5 1 5 1 5 1 5    P' '    0 1 0 0 0   (1   ) 1 5    1 3 1 3 1 3 0 0   1 5 1 2 0 1 5 0 0 1 2 P’’ = αP’ + (1-α)uvT, where u is the vector of all 1s 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5  1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 Effects of random jump • Guarantees irreducibility • Motivated by the concept of random surfer • Offers additional flexibility • personalization • anti-spam • Controls the rate of convergence • the second eigenvalue of matrix P’’ is α A PageRank algorithm • Performing vanilla power method is now too expensive – the matrix is not sparse Efficient computation of y = (P’’)T x q0 = v t=1 repeat y  αP' T x qt  P' ' qt 1 δ  qt  qt 1 t = t +1 until δ < ε T β x 1 y1 y  y  βv Random walks on undirected graphs • In the stationary distribution of a random walk on an undirected graph, the probability of being at node i is proportional to the (weighted) degree of the vertex • Random walks on undirected graphs are not so “interesting”

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download datamining-lect11