Download Document

Algorithms for Big Data Classification Dr. Jianye Hao Associate Professor School of Software Review • Decision Tree Classifier – decision tree induction algorithm • a greedy approach • how to specify the attribute splitting condition? – Binary vs multi-way splitting • how to select the optimal splitting? – Impurity measures: Gini index, entropy, classification error • When to stop splitting? – Overfitting vs underfitting Review • How to evaluate and compare the performance of a classifier? – Accuracy, precision, recall… – Confidence interval • the accuracy rate of a classifier • the accuracy difference between two models (statistically significant?) • the accuracy difference between two algorithms (statistically significant?) Classical classification methods • Bayes Classifier max(P(wi|x)) • Linear discriminant classifier • Quadratic discriminant classifier • Minimum Distance Classifier min(Dj(x)) • Minimum Error Rate Classifier min(ri(x)) • K-Nearest Neighbor Classifier Joint probability & conditional probability • If we have two variables, X and Y, some probabilities involve both X and Y • Joint Probability: p(x,y) is the probability that two things happen together • Conditional Probability: p(x|y) the conditional probability that X takes the value x given that Y takes the value y. • The relationship: p(x,y) = p(x|y)p(y) Bayes rule • From the formula relating joint probability with conditional probability, we know p(x|y)p(y) = p(x,y) = p(y|x)p(x) So, p( x | y ) p( y ) p( y | x )  p( x ) This is the Bayes Rule. Using that we can express p(y|x) in terms of p(x|y). An example of Bayes reasoning We know rain  wet ground; and wet ground  rain (ground can be wet by pouring water) But usually we say it is raining when we see the ground is wet. Why do we have this prediction? An example of Bayes reasoning • Bayes rule: =1 (always true) P( rain ) P( rain | wet ground )  P( wet ground | rain ) P( wet ground ) • In fact – If P(wet ground)>>P(rain), for instance, somebody frequently pours water on the ground, we will not say it is raining when seeing the wet ground. – If P(wet ground) P(rain), we may conclude that it is raining when seeing the wet ground – This is exactly what Bayes rule tells us! Another example of Bayes reasoning • Given: – A doctor knows that meningitis causes stiff neck 50% of the time – Prior probability of any patient having meningitis is 1/50,000 – Prior probability of any patient having stiff neck is 1/20 • If a patient has stiff neck, what’s the probability he/she has meningitis? P( S | M ) P( M ) 0.5 1 / 50000 P( M | S )    0.0002 P( S ) 1 / 20 Bayes Classifiers • Consider the attribute and the class label as random variables x and w • Given the attribute x – Goal is to predict class w – Specifically, we want to find value of i such that maximizes P(wi|x), say k, because given x, wk is most probably true • Can we estimate P(wi|x) directly from data? Bayes Classifiers • Approach: – Compute the posterior probability P(wi|x) for all values of wi using the Bayes rule prior posterior P( wi | x )  P( x | wi ) P( wi ) P( x ) – Choose value of i that maximized P(wi|x) – Equivalent to choosing value of i that maximizes P(x|wi)P(wi) because P(x) is a constant • P(x|wi) and P(wi) are given or can be easily estimate from data An Illustrating Example • • • • • In US, 0.8% of the people has this form of cancer When this disease is present, the test returns a correct POS result 98% of the times It returns a correct NE result 97% of the times when the disease is not present Suppose Anne comes to the doctor’s office and run a blood test and the test result is POS What is the likelihood that Anne has this form of cancer or not? – This is not good for Anne. After all, the test is 98% accurate An Illustrating Example • • • • • In US, 0.8% of the people has this form of cancer When this disease is present, the test returns a correct POS result 98% of the times It returns a correct NE result 97% of the times when the disease is not present Suppose Anne comes to the doctor’s office and run a blood test and the test result is POS What is the likelihood that Anne has this form of cancer or not? P(C | POS )  P( POS | C ) P(C ) 0.98  0.008 0.0078   P( POS ) P( POS ) P( POS ) P( NC | POS )  • P( POS | NC ) P( NC ) 0.03  0.92 0.0298   P( POS ) P( POS ) P( POS ) We can classify Anne is not having cancer. How tol Estimate Probabilities s al a u c c i i r r uo o o from Data? n s i g g t s e e t t n a l Tid 10 ca ca co Refund Marital Status Taxable Income Evade c 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes • Class: P(C) = Nc/N – e.g., P(No) = 7/10, P(Yes) = 3/10 • For discrete attributes: P(Ai | Ck) = |Aik|/ Nc – where |Aik| is number of instances having attribute Ai and belongs to class Ck – Examples: P(Status=Married|No) = 4/7 P(Refund=Yes|Yes)=0 How to handle continuous attributes? • For continuous attributes: – Discretize the range into bins • one ordinal attribute per bin – Two-way split: (A < v) or (A > v) • choose only one of the two splits as new attribute – Example: • Age (<18, 18-22, 23-30,30-40,>40) • Annual Salary (>200,000, 120,000-200,000, 60,000-120,000, <60,000) How to handle continuous attributes? • For continuous attributes: – Probability density estimation: • Assume attribute follows a normal distribution • Use data to estimate parameters of distribution – Mean : sample mean  x 1  xi d i – standard deviation: sample standard deviation sd   2  ( x  x) i i d 1 • Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c) How to handle continuous attributes? c Tid at Refund o g e l a ric c at o g e Marital Status l a ric co in t n Taxable Income u s u o as l c Evade 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes s • Normal distribution: 1 P( A | c )  e 2 i j  ( Ai   ij ) 2 2  ij2 2 ij – One for each (Ai,ci) pair • For (Income, Class=No): – If Class=No • sample mean = 110 • sample variance = 2975 10 1 P( Income  120 | No)  e 2 (54.54)  ( 120110 ) 2 2 ( 2975 )  0.0072 Example of Naive Bayes Classifier Given a Test Record: X  (Refund  No, Married, Income  120K) naive Bayes Classifier: P(Refund=Yes|No) = 3/7 P(Refund=No|No) = 4/7 P(Refund=Yes|Yes) = 0 P(Refund=No|Yes) = 1 P(Marital Status=Single|No) = 2/7 P(Marital Status=Divorced|No)=1/7 P(Marital Status=Married|No) = 4/7 P(Marital Status=Single|Yes) = 2/7 P(Marital Status=Divorced|Yes)=1/7 P(Marital Status=Married|Yes) = 0 For taxable income: If class=No: sample mean=110 sample variance=2975 If class=Yes: sample mean=90 sample variance=25 P(X|Class=No) = P(Refund=No|Class=No)  P(Married| Class=No)  P(Income=120K| Class=No) = 4/7  4/7  0.0072 = 0.0024 P(X|Class=Yes) = P(Refund=No| Class=Yes)  P(Married| Class=Yes)  P(Income=120K| Class=Yes) = 1  0  1.2  10-9 = 0 Since P(X|No)P(No) > P(X|Yes)P(Yes) Therefore P(No|X) > P(Yes|X) => Class = No Bayes Classifiers • Robust to isolated noise points • Handle missing values by ignoring the instance during probability estimate calculations • Robust to irrelevant attributes • Bayes Classification  Decision boundary Quadratic Discriminant Classifier • Using the Bayes classifier, assign x to class j: j  arg max P(wi | x)  arg max P( x | wi ) P(wi ) i i • Assume data points in each class follow the multivariate Gaussian distribution, • Then the Bayes classifier is reduced to assign x to class j: j  arg max d i ( x ), i n 1 1 1 d i ( x )  ln P( wi )  ln 2  ln | Ci |  ( x  mi )T Ci ( x  mi ) 2 2 2 • di(x) is a quadratic function of x, which is also called quadratic discriminant functions (QDA) Demo of decision boundaries • Consider a simple case – x is two-dimensional – Ci is diagonal – with two classes • Then the decision boundary is a quadratic equation of x1 and x2, and the coefficient of x1x2 is zero Demo of decision boundaries Linear Discriminant Classifier • If we assume that the covariance matrix is the same for all classes, then 1 2 1 2  k ( x)   log  k  ( x  k )T  k 1 ( x  k )  log  k 1 T 1  k ( x)  x  k  k  k  log  k 2 T 1 • The decision boundary function can be defined as follows. Linear Discriminant Classifier • In practice the parameters of the Gaussian distribution can be estimated using the training data, ˆ  N k / N where N k is the number of instance in class k uˆk  x / N gi k i K k ˆ    ( xi  ˆ k )( xi  ˆ k )T /( N  K ) k 1 g i  k Minimum Distance Classifier • Compute the mean vector i of each class i • Given a new instance x, calculate the Euclidean distance between x and each mean vector d  x  i  ( x  i )T ( x  i ) Minimum Distance Classifier • Can be considered as a special case of linear discriminant classifier when the covariance matrix  1   2 , and the prior probability of each class is the same. 1 T 1  k ( x)  x  k  k  k  log  k 2 T 1 1 T  k ( x)  x  k   k  k 2 linear discriminant function T MDC Decision function Minimum error rate classifiers • Given item x, the probability that item x belongs to class wi is P(wi|x). • If item x is wrongly classified as class wj (it actually belongs to class wk), then the loss incurs is Lkj Minimum error rate classifiers • Assign x to class j which minimizes the conditional average risk (loss): M j  arg min ri ( x )  arg min  Lki P( wk | x ) i i k 1 • We can see the expected risk is minimized • If , we have r i (x) = 1-P(wi|x) • Then minimizing ri (x) is equivalent to Bayes classifier Overview of the classification methods Minimum Error Rate Classifiers Bayes Classifiers Gaussian distribution Hyperquadrics as the Decision boundaries The same Ci for all i Linear discriminant classifiers Minimum distance classifiers Ci = σ2I, and P(wi)=P(wj) Instance-Based Classifiers Set of Stored Cases Atr1 ……... AtrN Class A • Store the training records • Use training records to predict the class label of unseen cases B B C A C B Unseen Case Atr1 ……... AtrN Instance Based Classifiers • Examples: – Rote-learner • Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly – Nearest neighbor • Uses k “closest” points (nearest neighbors) for performing classification Nearest-Neighbor Classifiers • Basic idea: – If it walks like a duck, quacks like a duck, then it’s probably a duck Compute Distance Training Records Choose k of the “nearest” records Test Record Nearest-Neighbor Classifiers Unknown record • Requires three things – The set of stored records – Distance Metric to compute distance between records – The value of k, the number of nearest neighbors to retrieve • To classify an unknown record: – Compute distance to other training records – Identify k nearest neighbors – Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) Definition of Nearest Neighbor X (a) 1-nearest neighbor X X (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x 1 nearest-neighbor Voronoi Diagram Nearest Neighbor Classification • Compute distance between two points: – Euclidean distance d ( p, q )  n 2 ( p  q )  i i i 1 • Determine the class from nearest neighbor list – take the majority vote of class labels among the knearest neighbors – Weigh the vote according to distance • weight factor, w = 1/d2 Nearest Neighbor Classification • Choosing the value of k: – If k is too small, sensitive to noise points – If k is too large, neighborhood may include points from other classes X Nearest Neighbor Classification • Scaling issues – Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes – Example: • height of a person may vary from 1.5m to 1.8m • weight of a person may vary from 90lb to 300lb • income of a person may vary from $10K to $1M Nearest Neighbor Classification • Problem with Euclidean measure: – High dimensional data • curse of dimensionality – Can produce counter-intuitive results 111111111110 100000000000 vs 011111111111 000000000001 d = 1.4142 • Solutions: • • Normalize the vectors to unit length Cosine similarity d = 1.4142 Example: PEBLS • PEBLS: Parallel Examplar-Based Learning System (Cost & Salzberg) – Works with both continuous and nominal features • For nominal features, distance between two nominal values is computed using modified value difference metric (MVDM) – Each record is assigned a weight factor – Number of nearest neighbor, k = 1 Example: PEBLS Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No d(Single,Married) 2 No Married 100K No = | 2/4 – 0/4 | + | 2/4 – 4/4 | = 1 3 No Single 70K No d(Single,Divorced) 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No d(Married,Divorced) 7 Yes Divorced 220K No = | 0/4 – 1/2 | + | 4/4 – 1/2 | = 1 8 No Single 85K Yes d(Refund=Yes,Refund=No) 9 No Married 75K No = | 0/3 – 3/7 | + | 3/3 – 4/7 | = 6/7 10 No Single 90K Yes 60K Distance between nominal attribute values: = | 2/4 – 1/2 | + | 2/4 – 1/2 | = 0 10 Marital Status Class Refund Single Married Divorced Yes 2 0 1 No 2 4 1 Class Yes No Yes 0 3 No 3 4 d (V1 ,V2 )   i n1i n2i  n1 n2 Example: PEBLS Tid Refund Marital Status Taxable Income Cheat X Yes Single 125K No Y No Married 100K No 10 Distance between record X and record Y: d ( X , Y )  wX wY  d ( X i , Yi )2 i 1 where: Number of times X is used for prediction wX  Number of times X predicts correctly wX  1 if X makes accurate prediction most of the time wX > 1 if X is not reliable for making predictions Collaborative Filtering • Making recommendation based on nearest neighbor approach. • Basic Idea: to recommend a music to Anne, find the person A that are the most similar to Anne and recommend to Anne the music that person A is listening. • We use Pearson Correlation Coefficient to compute the similarity (the opposite of distance) between them. Collaborative Filtering • Suppose Amanda, Eric and Sally rated the band, the Grey Warden, and their similarity with Anne is as follows, Person Grey Wardens Rating Amanda 4.5 Eric 5 Sally 3.5 Person Pearson Sally 0.8 Eric 0.7 Amanda 0.5 Collaborative Filtering Person Grey Wardens Rating Influence Amanda 4.5 25% Eric 5 35% Sally 3.5 40% • K = 3: the projected rating for Anne= (4.5*0.25)+(5*0.35)+(3.5*0.4)=4.275 • K=2: the projected rating for Anne = (3.5*0.8/1.5)+(5*0.7/1.5)=4.2 Nearest neighbor Classification • k-NN classifiers are lazy learners – It does not build models explicitly – Unlike eager learners such as decision tree induction and rule-based systems – Classifying unknown records are relatively expensive Classification methods • The methods are related – – – – – Minimum distance classifier Linear discriminant classifier Quadratic discriminant classifier Bayes classifier Minimum error rate classifier • The methods based on the statistical property of the data set Automated Negotiation • A negotiation consists of – A number of agents (agent space) – A negotiation domain D (outcome space) D = {I1, I2, … In} and each issue consists of k values Ii = {v1,v2,…vk} – A number of utility space (preference profiles) • A Laptop negotiation domain – – – – Two negotiation agents: agent A and agent B Three issues: brand, hard disk, monitor Each issue contains a number of discrete values One bidding instance is a Dell laptop with 80 Gb and 17’inch monitor. Automated Negotiation • Utility Space – Specify a preference of each agent for each outcome Pareto optimal bids • The optimal bid of an agent is the bid that gives the maximum utility to that agent • The utilities of negotiation agents are often contradictory, i.e., one agent’s gain is another agent’s pain Automated Negotiation • Optimal goal of a negotiation – Maximizing individual payoff – Maximizing social welfare (the sum of the payoffs of all partied involved in the negotiation) Automated Negotiation • Negotiation protocol – Defines the rules to regulate how the negotiation proceeds between negotiation agents. – Agents are obliged to follow the protocol, and any deviation from the protocol will be penalized. • Negotiation Strategy – Specify how an agent should behave during a negotiation under the regulation of a negotiation protocol. Negotiation Protocol • Bilateral Negotiation Protocol (Alternating Offer Protocol) – – – – Involves two parties- agent A and B One agent (e.g., agent A) starts the negotiation first Each agent takes turn to negotiate At its turn, each agent is allowed to present one of the following three options • Accept – accept the current proposal from the negotiation partner • Offer – propose a new offer to the partner • EndNegotiation – choose to terminate the negotiation without reaching an agreement • Reservation Value – The value that an agent obtains if no agreement is reached by the end of negotiation or the negotiation terminates earlier • Time pressure – The utility decreases with the passing of negotiation time Overall Structure of a Negotiation Strategy “Decoupling Negotiating Agents to Explore the Space of Negotiation Strategies”， Novel Insights in Agent-based Complex Automated Negotiation，2014 Design Effective Negotiation Strategies using Mining Techniques • Benefits of predicting the opponent’s behavior – Cooperative environments ->better coordination with others – Competitive environments -> opportunity of taking exploitative actions to maximize its own payoff. • How to predict the opponent’s behavior? – Regression techniques (non-linear regression, Guassian process regression) predicting the opponent’s concession degree [Using gaussian processes to optimise concession in complex negotiations against unknown opponents, IJCAI, 2011] – Chebychev Polynomials – predict the opponent’s decision function [Modeling opponent decision in repeated one-shot negotiations. Proceedings of the fourth international joint conference on Autonomous agents and multiagent systems. ACM, 2005: 397-403.] – Neural Network [Predicting Opponent’s Moves in Electronic Negotiations Using Neural Networks, Group Decision and Negotiation Conference,2006] Design Effective Negotiation Strategies using Mining Techniques • Benefits of predicting the opponent’s preference – Increase the chance of reaching win-win negotiation outcomes – Better understanding of the opponent’s behaviors • How to predict the opponent’s preference? – Bayesian Learning - Predict the opponent’s issue weight and the evaluation function [Opponent modelling in automated multiissue negotiation using bayesian learning, AAMAS, 2008] – Bayesian Learning – predict the opponent’s reservation value [Sycara K, Zeng D. Benefits of learning in negotiation, AAAI. 1997.] Bayesian Learning in Negotiation • A Buyer and a Seller • Task: learn the reservation value of the seller using Bayesian learning • Two hypotheses on the seller’s reservation value: H1 = 100, H2 =130 • Prior knowledge of the buyer: P(H1) = P(H2) = 0.5 • Domain Knowledge of the buyer: a seller usually offers a price which is above their reservation price by 17% – P(152|H2) = 0.95, P(152|H1) = 0.75 • Given a new price offered by the seller, apply Bayesian learning P ( H 1 | e)  P(e | H 1) P( H 1) P(e | H 1) P( H 1)  P (e) P(e | H 1) P( H 1)  P(e | H 2) P( H 2) P ( H 2 | e)  P(e | H 2) P( H 2) P(e | H 2) P( H 2)  P (e) P(e | H 1) P( H 1)  P(e | H 2) P( H 2) Select Effective Negotiation Strategies using Mining Techniques • Given a new domain, predicting which existing negotiation strategy performs best? – Artificial Neural Network – Decision Tree – Linear/Logistic Regression Basic Procedures • • • • Extract the features of a domain Build the training set Choose a data mining algorithm and train the model Apply the model to a testing set and evaluate the performance of the prediction results. Discounting factor Avg No. of values in each issue Average utility overall all proposals No. of Issue Domain size Class 0.3 0.3 0.1 3 1000 Strategy 1 0.8 0.45 0.35 10 2888 Strategy 2 0.1 0.32 0.56 21 33 Strategy 1 0.2 0.76 0.8 4 200 Strategy 3 0.3 0.678 0.95 8 34 Strategy 5 Project Introduction Genius Negotiation Platform A negotiation environment for heterogeneous negotiation agents A set of negotiation problems (domains) A set of negotiation agents (strategies) A set of analytical tools to evaluate an agent’s performance Webpage: http://ii.tudelft.nl/genius/ User manual: https://ii.tudelft.nl/genius/sites/default/files/userguide.pdf Resources: http://mmi.tudelft.nl/negotiation/index.php/Automated_Ne gotiating_Agents_Competition_(ANAC) A Remainder • Start earlier ! • Send the group information to [email protected] ASAP Exercise II • Using Bayes classier and k-NN to classify the following task and compare their performance. • Pima Indians Diabetes – – – – – – – – – the number of times pregnant, plasma glucose concentration, blood pressure, triceps skin fold thickness, serum insulin level, body mass index, diabetes pedigree function, age, Class label: ‘1’-diabetes; ‘0’ - not.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document