Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Basic of Probability Theory for Ph.D. students in Education, Social Sciences and Business (Shing On LEUNG and Hui Ping WU) (April, May 2015) This is a series of 3 talks respectively on: A. Probability Theory B. Hypothesis Testing C. Bayesian Inference Lecture 1: Probability Theory 1 A. Probability Theory Why Probability Theory? Model Vs data: Model is not real, but serves as a framework for analysis and interpretations of real data (e.g. Normal distribution) Probability model is a model, not real, and a basic framework for our uncertain future world For example, you conduct a research, take a sample, and ask how data collected from your sample can be generalized to a wider population, etc. The basic mechanism behind is probability theory. Probability Theory is the basic for many other topic, e.g. Hypothesis Testing and Bayesian Analysis, and possibly many others. 2 Probability Theory Axioms of probability Axiom 1: all probability > 0 Axiom 2: sum of all possible event = 1 (axiom 1 and 2 already imply probability < 1) Axiom 3: Pr(A or B) = Pr(A) + Pr(B) if A & B do not overlap These are axiom, not assumption, i.e. we start from these. These are not assumptions to be challenged, i.e., the starting point. If we do not agree, we cannot move forward. 3 Independence Pr( A & B ) = Pr(A) * Pr(B), the chance for A does not affect the chance of B. Say, the present of one student (or subject) does not affect the present of the others (independent sampling). Say, higher IQ score does not lead to higher GPA (independence). Sometimes, we want to assume independence, and with the hope that it is rejected, and then establish the relation between two (IQ and GPA in this case). This is the logic in Hypothesis Testing. 4 Conditional probabilities Pr(A/B) : Probability of A given that B exist Pr(A/B) = Pr(A) if A & B are independent, the present of B (say student) do not affect the chance of A (other student) (independence in sampling). Say, higher IQ will have a higher chance to get higher GPA, i.e. Pr(A/B) > Pr(A) (dependence). Note, this is the chance and not actual score, i.e., it is wrong to say higher IQ will have higher GPA. But, a higher IQ will lead to a higher chance of getting higher GPA. If Pr(A/B) not equal to Pr(A), A & B are dependent. Conditional probability is the basic for Bayesian Analysis, and other topics 5 Monty Hall Problem You are ask to choose 1 out of 3 balls (A, B and C say). After you have chosen A, you are told that B is not correct. Would you (i) keep A, or (ii) change to C? http://en.wikipedia.org/wiki/Monty_Hall_problem 6 Random variable A variable that is random (what do you meant?) Variable Variable can take different values, e.g., sex – variable, male and female – label; group – variable, experimental or control group – value; IQ score – variable, IQ = 50 (actual value). Measurement is measurement of variations Random Random variable, a variable take values according to a specified probability model, or, simply, takes values by some chance. Say, Pr(Female) = 0.6; Pr(male) = 0.4, etc. The chance of getting a boy in UM may be only 0.4. So, sex in UM is a random variable. So, both words, random and variable, carry their meanings 7 Probability Distributions (models) A probability distribution is one that follows 3 axioms. “A” implies 1 out of many distributions behind Any distribution follow 3 axioms is probability distribution. Discrete vs continuous. Discrete distributions: probabilities represented by a single point; Continuous distributions: probabilities represented by an areas. 8 Probability Density Function (pdf) and Cumulative Density Distribution (cdf) pdf: plot the chance of a random variable against its value cdf: plot the total chance of a random variable that is less than its value both pdf and cdf are useful in different situations 9 Examples of Probability Distributions Discrete Uniform n = 5 where n = b − a + 1 10 Binominal (n=10, p=0.5) Binominal (n=10, p=0.1) 11 Continuous Uniform Normal 12 Chi squared Gamma where k is the parameter, known where k and θ are the parameters, denoting the shape and scale. as df (degree of freedom) 13 Important note: Knowledge about “common” probability distributions is necessary for “non-common”, e.g. Normal before Fat Tail distributions. 14 Relationships among probability distributions http://en.wikipedia.org/wiki/Relationships_among_probability_distribut ions Knowledge on 1 distribution facilitate knowledge of the others. 15 Expectation E(X) = ʃ x p(x) dx, the “expected” value of x. “Expected” value imply (i) in the long run, and (ii) on the average, the value we get Or, the value we get by taking all possibilities (defined by pdf) into considerations Or, “automatic averaging” by pdf Expectation is needed for moments (described later), and in turn characterize a distribution, and hence all are linked. 16 Expectation can be f(x) (function of x) instead of only x, i.e. E( f(x) ) = ʃ f(x) p(x) dx That is, the expected value of f(x) instead of just x. Moments How to describe a distribution? Graphically by the distribution, pdf, or cdf Or, by some “summary statistics”: mean, SD, skewness, and kurtosis 17 Mean: location SD or variance: dispersion 18 Skewness: “dis-symmetry” (degree away from symmetry) Symmetric 19 Kurtosis Below is copy from (on 13 Feb., 2015): http://en.wikipedia.org/wiki/Kurtosis Below, each has: (i) mean=0, (ii) SD=1, and (iii) Skewness=0, but difference Kurtosis. 20 D: Laplace distribution, also known as the double exponential distribution, red curve (two straight lines in the log-scale plot), excess kurtosis = 3 S: hyperbolic secant distribution, orange curve, excess kurtosis = 2 L: logistic distribution, green curve, excess kurtosis = 1.2 N: normal distribution, black curve (inverted parabola in the log-scale plot), excess kurtosis = 0 C: raised cosine distribution, cyan curve, excess kurtosis = −0.593762... W: Wigner semicircle distribution, blue curve, excess kurtosis = −1 U: uniform distribution, magenta curve (shown for clarity as a rectangle in both images), excess kurtosis = −1.2. Important note: same SD and variance. SD just cannot tell the peak and tail! 21 Moments (again) It so happens that mean, SD, skewness and kurtosis are respectively the first 4 moments of a distribution. Definitions of moments are as follows. E( X) = ʃ x p(x) dx, for r = 1 E( Xr) = ʃ (x-µ)r p(x) dx, for r > 1 We take (x-µ) instead of x, i.e. moment about the mean µ (=E(X)) Implication of moments If we do not know the distribution, but we know the first 4 moments (from our data, say), we can accurately (but not exact) calculate the tails (or percentiles or areas, etc) of that distribution Or, if two distributions have first 4 moments similar, they are similar Or, if two have first 3 moments similar, they are “quite” similar 22 Linear transformation y=a+bx By linear transformation, we can always equate two distributions with the same mean and SD, i.e., the first 2 moments. Log transformation y = log (x) transforms the range of x (0 to ∞) to y (-∞ to +∞) Logit transformation y = logit (x) = log ( x / (1-x) ) transforms x from the range (0, 1) to y (-∞ to +∞) Univariate and Multivariate Distribution Univariate: one dimension Multivariate: several dimensions 23 Multiple “多元” regression: one y (dependent variable) and many x’s (independent variable) Multivariate “多維” regression: many y’s Correlation Correlation: among many variables, Population correlations: within a probability model Empirical correlations: within a real data set 24 Conditional Probability, and conditional probability distributions Conditional Probability: Pr(A/B) = Pr(A and B) / Pr(B) Conditional probability distribution p(y/x) = p(y and x ) / p(x) Now, x and y are random variables following specified probability distributions. 25 Estimation There can be “parameters” within “a class” model, e.g. Normal with mean μ and variance σ. To specify a model, each parameter can take specific values, e.g. μ=50, σ=10. Parameter estimation refer to estimating a specific value of parameters within a class of distribution, e.g. estimating μ and σ within Normal 3 Q: (i) what bases? (ii) what criteria? (iii) what method? (i) based on our data (ii) commonly used criteria: maximum likelihood, least square, etc (iii) method, the way to do it (leave it to statistician) MLE (maximum likelihood estimation) refer to finding parameters that maximize the chance of getting the data, i.e. maximize the pdf. It is one of the popular methods in finding out parameters. 26 Let, X = data, θ = parameters, Pr(X/θ) (pdf of X given θ) = L(θ/X) (Likelihood of θ given X) Mathematically, both size are the same, but Pr(X/θ), treat X as variable, θ as fixed (chance of X given θ) L(θ/X), treat θ as variable, X is fixed (what values of θ that can maximize L given that we got X, our data) Some of the above details are quite complicated, but you only need to know them conceptually, not technically. Nowadays, computer packages will do it for you, but you need to understand the background behind for interpretation purposes. 27 Concluding remark A probability model is a model. It is not real, but has practical implications. These content are basic to learn, and to understand, other topics (even though you are doing qualitative research) A framework to under our uncertain world 28 Next lecture: On Hypothesis Testing (based on knowledge of probability theory) Q&A Shing On LEUNG [email protected] Hui Ping WU [email protected] 29