Download Lecture 3 , Jan - 26

Machine Learning Mehdi Ghayoumi MSB rm 132 [email protected] Ofc hr: Thur, 11-12 a Machine Learning • Overfitting: • Model is too “complex” and fits irrelevant characteristics (noise) in the data – Low bias and high variance – Low training error and high test error Machine Learning • Models with parameters are too many inaccurate because of a large variance (too much sensitivity to the sample). Machine Learning Bias-Variance Trade-off Machine Learning Bias-Variance Trade-off E(MSE) = noise2 + bias2 + variance Unavoidable error Error due to incorrect assumptions Error due to variance of training samples Machine Learning Machine Learning Machine Learning Machine Learning Probabilities • We write P(A) as “the fraction of possible worlds in which A is true” Event space of all possible worlds Its area is 1 Worlds in which A is true Worlds in which A is false P(A) = Area of red rectangle Machine Learning Axioms of Probability Theory 1. All probabilities between 0 and 1 0<= P(A) <= 1 2. True has probability 1, false has probability 0. P(true) = 1 P(false) = 0 P(not A) = P(~A) = 1-P(A) 3. The probability of disjunction is: P( A or B) = P(A) + P(B) – P (A and B) Sometimes it is written as this: P( A  B)  P( A)  P( B)  P( A  B) Machine Learning Interpretation of the Axioms A P( A  B)  P( A)  P( B)  P( A  B) A or B B Simple addition and subtraction A and B B Machine Learning Definition of Conditional Probability P(A ^ B) P(A|B) = -----------P(B) The Chain Rule: P(A ^ B) = P(A|B) P(B) Machine Learning Conditional Probability P(A|B) = Fraction of worlds in which B is true that also have A true H = “Have a headache” F = “Coming down with Flu” P(H) = 1/10 P(F) = 1/40 P(H|F) = ½ P(H|F) = Fraction of flu-inflicted worlds in which you have a Headache = #worlds with flu and headache ---------------------------------#worlds with flu = Area of “H and F” region -----------------------------Area of “F” region = P(H ^ F) ----------P(F) F H Machine Learning Probabilistic Inference H = “Have a headache” P(H) = 1/10 F = “Coming down with Flu” P(F) = 1/40 Area wise we have: P(H|F) = ½ F A P(F)= P(H)= P(H|F)= P(F|H)= C B H Machine Learning Independence 2 blue and 3 red marbles are in a bag. What are the chances of getting a blue marble? The chance is 2 in 5 But after taking one out the chances change! So the next time? Machine Learning Independence • A and B are independent iff: P( A | B)  P( A) P( B | A)  P( B) • Therefore, if A and B are independent: P( A  B) P( A | B)   P( A) P( B) P( A  B)  P( A) P( B) Machine Learning Example: Ice Cream 70% of your friends like Chocolate, and 35% like Chocolate AND like Strawberry. What percent of those who like Chocolate also like Strawberry? Machine Learning P(Strawberry | Chocolate) = P(Chocolate and Strawberry) / P(Chocolate) 0.35 / 0.7 = 50% It means: 50% of your friends who like Chocolate also like Strawberry Machine Learning Machine Learning • The joint probability distribution for a set of random variables, X1,…,Xn gives the probability of every combination of values (an n-dimensional array with vn values if all variables are discrete with v values, all vn values must sum to 1): P(X1,…,Xn) positive negative circle square red 0.20 0.02 blue 0.02 0.01 P( positive | red  circle )  circle square red 0.05 0.30 blue 0.20 0.20 Machine Learning • The probability of all possible conjunctions (assignments of values to some subset of variables) can be calculated by summing the appropriate subset of values from the joint distribution. P(red  circle )  0.20  0.05  0.25 • Therefore, all conditional probabilities can also be calculated. P( positive  red  circle ) 0.20 P( positive | red  circle )    0.80 P(red  circle ) 0.25 Machine Learning Bayes Rule • Thomas Bayes (c. 1701 – 7 April 1761) was an English statistician, philosopher and Presbyterian minister, known for having formulated a specific case of the theorem that bears his name: Bayes' theorem. Bayes never published what would eventually become his most accomplishment; his notes were edited and published after his death by Richard Price. P( A  B) P( B | A) P( A) p( A | B)   P( B) P( B) famous Machine Learning Bayesian Learning P ( x | h) P ( h) p (h | x)  P( x) P ( h) : P ( x | h) : prior belief (probability of hypothesis h before seeing any data) likelihood (probability of the data if the hypothesis h is true) P( x)   P( x | h) P(h) : data evidence (marginal probability of the data) h P(h | x) : posterior (probability of hypothesis h after having seen the data d ) Machine Learning An Illustrating Example A patient takes a lab test and the result comes back positive. It is known that the test returns a correct positive result in only 98% of the cases and a correct negative result in only 97% of the cases. Furthermore, only 0.008 of the entire population has this disease. 1. What is the probability that this patient has cancer? 2. What is the probability that he does not have cancer? 3. What is the disease? Machine Learning An Illustrating Example • The available data has two possible outcomes: Positive (+) and Negative (-) • Various probabilities are P(cancer) = 0.008 P(~cancer) = 0.992 P(+|cancer) = 0.98 P(-|cancer) = 0.02 P(+|~cancer) = 0.03 P(-|~cancer) 0.97 • Now a new patient, whose test result is positive, Should we diagnose the patient have cancer or not? Machine Learning Choosing Hypotheses • Generally, we want the most probable hypothesis given the observed data: – Maximum a posteriori (MAP) hypothesis – Maximum likelihood (ML) hypothesis Definition: Arg max stands for the argument of the maximum, that is to say, the set of points of the given argument for which the given function attains its maximum value. Machine Learning Maximum a posteriori (MAP) • Maximum a posteriori (MAP) hypothesis P ( x | h) P ( h) p (h | x)  P( x) P ( x | h) P ( h ) hMAP  arg max p(h | x)  arg max  arg max P( x | h) P(h) hH hH hH P( x) Note P(x) is independent of h, hence can be ignored. Machine Learning Does patient have cancer or not? P(+|cancer)P(cancer) =0.98*0.008 = 0.078 P(+|~cancer)P(~cancer) = 0.03*0.992 = 0.298 MAP: P(+|cancer)P(cancer) < P(+|~cancer)P(~cancer) Diagnosis: ~cancer Thank you!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 3 , Jan - 26