Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Probability and Statistics Joyeeta Dutta-Moscato May 24, 2016 There are three kinds of lies: lies, damned lies and statistics - Mark Twain, attributed to Disraeli Terms and concepts • • • • • Sample vs population Central tendency: Mean, median, mode Variance, standard deviation Normal distribution Descriptive Cumulative distribution • • • • • • Hypothesis Null hypothesis (H0) Alternate hypothesis (HA) Significance P-value Confidence Interval Statistical Hypothesis Testing • Method of least squares • Euclidean distance • Overfitting & generalization Statistical Models Statistics Central tendency and Spread • • • • • Mean Median Mode Variance, standard deviation Normal distribution Central tendency and Spread • • • • • Mean Median Mode Variance, standard deviation Normal distribution http://www.mathsisfun.com/data/standard-normal-distribution.html But do numbers tell the full story? https://en.wikipedia.org/wiki/ Anscombe's_quartet Anscombe’s Quartet Good graphics reveal data Anscombe’s quartet Building a model from data Fitting the data to a model: y = f(x) Objective: Minimize mean square error Does mean square error = 0 mean this is the best model? What does this mean about the relationship between x and y? Correlation When we say that two genes are correlated, we mean that they vary together. But how to quantify the degree of correlation? Pearson’s r measures the extent to which two random variables are linearly related. Perfect linear correlation = 1 No correlation = 0 Anti-correlation = -1 Positive Correlations Negative Correlations What do correlations tell us? Interesting site: http://www.tylervigen.com/ So how do we do make statements of causality? - Can ask the question: How likely is event X given an event Y? Probability: How likely is it? • How likely is a certain observation? Possible Outcomes Head, Tail 1, 2, 3, 4, 5, 6 P(Head) = ? P(Tail) = ? P(1) = ? P(2) = ? . . P(6) = ? Probability of Multiple Events Toss a coin twice. How likely are you to observe 2 Heads? P(2 Heads) = P(Head) x P(Head) What is the DISTRIBUTION of outcomes? Key condition: INDEPENDENCE Probability of Multiple Events Toss a coin twice. How likely are you to observe 2 Heads? P(2 Heads) = P(Head) x P(Head) Key condition: INDEPENDENCE What is the DISTRIBUTION of outcomes? P(2 Heads) = ¼ P(2 Tails) = ¼ P(1 Head) = P(1 Head, 1 Tail) + P( 1 Tail, 1 Head) =¼+¼ =½ Key condition: Must sum to 1 Probability of Multiple Events Toss a coin twice. How likely are you to observe 2 Heads? P(2 Heads) = P(Head) x P(Head) Key condition: INDEPENDENCE What is the DISTRIBUTION of outcomes? P(2 Heads) = ¼ P(2 Tails) = ¼ P(1 Head) = P(1 Head, 1 Tail) + P( 1 Tail, 1 Head) =¼+¼ =½ Histogram of outcomes of 10 tosses Key condition: Must sum to 1 Probability of Multiple Events Toss a coin twice. How likely are you to observe 2 Heads? P(2 Heads) = P(Head) x P(Head) Key condition: INDEPENDENCE What is the DISTRIBUTION of outcomes? P(2 Heads) = ¼ P(2 Tails) = ¼ P(1 Head) = P(1 Head, 1 Tail) + P( 1 Tail, 1 Head) =¼+¼ =½ Histogram of outcomes of 10 tosses Key condition: Must sum to 1 As the number of independent (random) events grows, the distribution approaches a NORMAL or GAUSSIAN distribution Cumulative Distribution The probability distribution shows the probability of the value X The cumulative distribution shows the probability of a value less than or equal to X Wikipedia: http://en.wikipedia.org/wiki/Cumulative_distribution_function Statistical Hypothesis Testing You are running experiments to test the effect of a drug on subjects. How likely is it that the effect would be observed even if no real relation exists? If the likelihood is sufficiently small (eg. < 1%), then it can be assumed that a real relation exists. Otherwise, any observed effect may simply be due to chance H0 : Null hypothesis No relation exists HA : Alternate hypothesis There is some sort of relation Statistical Hypothesis Testing SIGNIFICANCE LEVEL is decided a priori to decide whether H0 is accepted or rejected. (Eg: 0.1, 0.5, 0.01) If P-VALUE < significance level, then H0 is rejected. i.e. The result is considered STATISTICALLY SIGNIFICANT Wikipedia: http://en.wikipedia.org/wiki/P-value Error reporting How reliable is the measurement? (How reliable is the estimate?) Eg: 95% CONFIDENCE INTERVAL We are 95% confident that the true value is within this interval STANDARD ERROR can be used to approximate confidence intervals Standard error = Standard deviation of the sampling distribution Back to Probability 0 < Prob < 1 P(A) = 1 – P(AC) [AC = Complement of A] If events A and B are independent, (event B has no effect on the probability of event A) Then: P (A, B) = P(A) · P(B) If they are not independent, Then: P (A, B) = P(A|B) · P(B) P (A, B) = JOINT PROBABILITY of A and B P (A|B) = CONDITIONAL PROBABILITY of A given B Example We are given 2 urns, each containing a collection of colored balls. Urn 1 contains 2 white and 3 blue balls; Urn 2 contains 3 white and 4 blue balls. A ball is drawn at random from urn 1 and put into urn 2, and then a ball is picked at random from urn 2 and examined. What is the probability that the ball is blue? Example We are given 2 urns, each containing a collection of colored balls. Urn 1 contains 2 white and 3 blue balls; Urn 2 contains 3 white and 4 blue balls. A ball is drawn at random from urn 1 and put into urn 2, and then a ball is picked at random from urn 2 and examined. What is the probability that the ball is blue? Urn 2 Urn 1 3 x 5 5 8 Scenario 1: The ball picked from Urn 1 is blue + 2 x 4 5 8 = Scenario 2: The ball picked from Urn 1 is white 23 40 = 0.575 Bayes Theorem P (A|B) = P (B|A)· P(A) P (B) How? P (A, B) = P(A|B) · P(B) P (A, B) = P(B, A) so P(A|B) = P (A, B) / P(B) or P(A|B) = P(B|A)· P(A) / P(B) Also, This is equivalent to: P (A|B) = P (B, A) = P(B|A)· P(A) P (B|A)· P(A) P (B|A)· P(A) + P (B|AC)· P(AC) Contingency Table Courtesy: Rich Tsui, PhD Contingency Table You have developed a test to detect a certain disease What is the True Positive Rate (TPR) and True Negative Rate (TNR) of this test? Sensitivity = TPR = TP / (TP + FN) = P(Test+ | Disease+) Specificity = TNR = TN / (TN + FP) = P(Test- | Disease-) What is the Positive Predictive Value (PPV) and Negative Predictive Value (NPV)? PPV = TP / (TP + FP) = P(Disease+ | Test+) NPV = TN / (TN + FN) = P(Disease- | Test-) Sensitivity (TPR) The probability of sick people who are correctly identified as having the condition Specificity (TNR) The probability of healthy people who are correctly identified as not having the condition Positive predictive value (PPV) Given that you test positive, the probability that you actually have the condition. Negative predictive value (NPV) Given that you test negative, the probability that you actually do not have the condition. The Prevalence of a particular disease is 1/10. A test for this disease provides a correct diagnosis in 90% of cases (i.e. if you have the disease, 90% of the time you will test positive, and if you do not have the disease, 90% of the time you will test negative). Given that you test positive for the disease, what is the probability that you actually have the disease? The Prevalence of a particular disease is 1/10. A test for this disease provides a correct diagnosis in 90% of cases (i.e. if you have the disease, 90% of the time you will test positive, and if you do not have the disease, 90% of the time you will test negative). Given that you test positive for the disease, what is the probability that you actually have the disease? Prevalence = Prior probability in population Solution: P (D+) = 0.1 T+ TD+ D- Test positive Test negative Disease present Disease absent P (T+|D+) = 0.9 P (T-|D-) = 0.9, therefore P(T+|D-) = 1 – 0.9 = 0.1 P (D+|T+) = P (T+|D+)· P(D+) P (T+|D+)· P(D+) + P (T+|D-)· P(D-) = 0.5 = (0.1)· (0.9) (0.1)· (0.9) + (0.9)· (0.1)