Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STP 421 - Core Concepts 1. Probability Spaces: Probability spaces are used to model processes with random outcomes and have three components: a sample space Ω which is the set of all possible outcomes, a collection F of subsets of Ω that are called events, and a function P : F → [0, 1] which assigns a probability P(E) to every event E ∈ F. P is said to be a probability distribution or probability measure on F. 2. Interpretations: The probability of an event E can be interpreted in two very different ways. On the one hand, the frequentist interpretation regards P(E) as the limiting frequency with which the event E occurs in an infinite series of independent, identically-distributed trials. In contrast, Bayesian interpretations regard P(E) as a measure of the plausibility of a conjecture E, either as a matter of subjective belief or in terms of the strength of the evidence in favor of E. Whereas frequentist probabilities can only be assigned to events that can occur in repeated trials, Bayesian probabilities can be assigned to events that are unique. 3. The Laws of Probability: Every probability distribution satisfies the following four identities: 1. P(∅) = 0 2. P(Ω) = 1 3. Sum rule: P(A ∪ B) = P(A) + P(B) − P(A ∩ B) 4. Product rule: P(A ∩ B) = P(A)P(B|A) = P(B)P(A|B) It follows from the first, second and third law that P(Ac ) = 1 − P(A) for any event A where Ac = Ω \ A is the complement of A in the sample space. 4. Countable additivity: If E1 , E2 , · · · is a countable collection of mutually exclusive events, then the probability of their union is equal to the sum of their probabilities: [ X P Ei = P(Ei ). i≥1 i≥1 Be aware that additivity does not generally extend to uncountable unions of disjoint sets. 5. Conditional Probabilities: The conditional probability of A given B is denoted P(A|B) and is defined by the formula: P(A|B) = P(A ∩ B) P(B) provided that P(B) > 0. P(A|B) is undefined if P(B) = 0. P(A|B) can be interpreted as the probability that A is true given that we know or assume that B is true. 1 6. Independence: Two events A and B are said to be independent if P(A ∩ B) = P(A)P(B). In this case, it can be shown that P(A|B) = P(A) and P(B|A) = P(A) if P(A) > 0 and P(B) > 0. In other words, if two events are independent, then knowing that one of them has occurred does not change the likelihood that the other has occurred. Similarly, a collection of n events E1 , · · · , En is said to be independent if any finite subcollection, say Ei1 , · · · , Eim satisfies the condition ! m m Y \ P(Eik ). P Eik = k=1 k=1 For example, three events A, B and C are independent if and only if the following four identities hold: P(A ∩ B ∩ C) = P(A)P(B)P(C) P(A ∩ B) = P(A)P(B) P(A ∩ C) = P(A)P(C) P(B ∩ C) = P(B)P(C). Independence does not follow from pairwise independence, e.g., it is possible to find three events A, B and C such that each pair of events is independent, but the three events taken together are not independent. 7. The Law of Total Probability: Suppose that E and F are events and that P(F ) > 0 and P(F c ) > 0. Then P(E) = P(F )P(E|F ) + P(F c )P(E|F c ). In other words, the probability of E is equal to the weighted average of the conditional probabilities of E given F and of E given F c . More generally, if E, F1 , · · · , Fn are events such that F1 , · · · , Fn are mutually exclusive and E ⊂ F1 ∪ · · · ∪ Fn , then P(E) = n X P(Fi )P(E|Fi ), i=1 which we can also interpret as a weighted average of the conditional probabilities of E given each of the events F1 , · · · , Fn . This result is important because we can sometimes use it to calculate the probability of an event E by conditioning on additional information that makes the probability calculations simpler. 8. Bayes’ formula: Suppose that E and F are events such that P(E) > 0 and P(F ) > 0. Then P(F |E) = P(F ) P(E|F ) , P(E) which shows how the two conditional probabilities P(F |E) and P(E|F ) are related to each other. The probability P(E) appearing in the denominator on the right-hand side can often 2 be evaluated with the help of the law of total probability. This is arguably one of the most important identities in all of mathematics because of the central role that it plays in Bayesian statistical inference. 9. Random variables: Let (Ω, F, P) be a probability space and let E be a set, e.g., E could be the real line or the set of 2 × 2 matrices or the collection of English words with 5 letters. An E-valued random variable is simply a function X : Ω → E which assigns a value X(ω) in E to each outcome ω in the sample space. Often we think of X as a measurement or observation that depends on the state of a random process described by the probability space. In particular, because the outcome ω is random, so is the value X(ω) of the random variable. The distribution of an E-valued random variable X is a probability distribution PX defined on the set E by PX (A) = P(X ∈ A) = P(X −1 (A)), where A is a subset of E. In words, PX (A) is the probability that the value assumed by X belongs to the set A. Often we will work solely with random variables and their distributions without specifying the underlying probability space. 10. Cumulative distribution function: If X is a real-valued random variable, then the cumulative distribution function of X is the function FX : R → [0, 1] defined by FX (x) = P(X ≤ x). It can be shown that the cumulative distribution function of a random variable uniquely determines its distribution and that this function is non-decreasing with jump discontinuities at those values x where P(X = x) > 0. Such values are called atoms. 11. Discrete random variables: A random variable X is said to be discrete if it takes values in a countable set E = {x1 , x2 , · · · }. In this case, the probability mass function of X is the function pX : E → [0, 1] defined by pX (x) = P(X = x). The distribution of a discrete random variable is uniquely determined by its probability mass function. Indeed, if A is a subset of E, then P(X ∈ A) = X pX (x). x∈A In particular, 1 = P(X ∈ E) = X pX (x). x∈E 12. Continuous random variables: A real-valued random variable X is said to be continuous if there is a function fX : R → [0, ∞] such that 3 Z P(a ≤ X ≤ b) = b fX (x)dx a for all −∞ ≤ a ≤ b ≤ ∞. The function fX is said to be the probability density function of X and it uniquely determines the distribution of X. In particular, by taking a = −∞ and b = ∞, we have Z ∞ fX (x)dx. 1 = P(−∞ < X < ∞) = −∞ Furthermore, by taking a = x = b, we have Z x fX (t)dt = 0 P(X = x) = x for every x ∈ R. Thus continuous random variables do not have atoms. The density and the cumulative distribution function of a continuous random variable are related as follows: Z x FX (x) = P(−∞ < X ≤ x) = fX (t)dt, −∞ while fX (x) = FX0 (x), i.e., the cumulative distribution function is an anti-derivative of the density, which the density is the derivative of the cumulative distribution function. 13. Expectations: If X is a real-valued random variable, then the expected value of X is the quantity P if X is discrete with probability mass function pX , xi pX (xi ) xi E[X] = R ∞ −∞ fX (x) x dx if X is continuous with density fX , provided that the sum or the integral exists. The expected value is also called the expectation or the mean. It is a weighted average of the values that the random variable can assume with weights equal to the probabilities with which those values occur. Expectations have several important properties. First, they are linear in the sense that E[X1 + · · · + Xn ] = n X E[Xi ], i=1 i.e., the expected value of a sum of random variables is equal to the sum of the expected values of each random variable. Secondly, given a function φ : R → R, the expected value of φ(X) can be calculated using one of the following two formulas 4 E[φ(X)] = P xi pX (xi ) φ(xi ) if X is discrete with p.m.f. pX , R ∞ if X is continuous with density fX . −∞ fX (x) φ(x) dx This result is sometimes known as the law of the unconscious statistician. If φ is non-linear, then in general E[φ(X)] 6= φ(E[X]). However, if φ(x) = ax + c is a linear function, then the previous result can be used to show that E[aX + c] = aE[X] + c. 14. The Law of Large Numbers: Suppose that X1 , X2 , X3 , · · · is a sequence of independent, identically-distributed real-valued random variables with finite mean µ = E[X1 ] and let Sn = X1 + · · · + Xn be the sum of the first n variables. Dividing through by n, we obtain the sample mean Sn /n, which is just the average of the first n values in this sequence. The Strong Law of Large Numbers asserts that the sequence of sample means is certain to converge to the true mean, i.e., 1 P lim Sn = µ = 1. n→∞ n In other words, by collecting a sufficiently large number of independent data points, we are guaranteed that the sample mean will approach the true mean arbitrarily closely. This is one of the reasons that independent experimental trials are conducted when trying to estimate an unknown quantity. 15. Moments: The n’th moment of a real-valued random variable X is the expected value of X n: P n if X is discrete with p.m.f. pX , xi pX (xi ) xi n µn = E[X ] = R ∞ n −∞ fX (x) x dx if X is continuous with density fX . These are sometimes referred to as non-central moments. In contrast, the n’th central moment of X is the expected value of the quantity (X − µ)n , where µ = E[X] is the expected value of X: P n if X is discrete with p.m.f. pX , xi pX (xi ) (xi − µ) 0 n µn = E[(X − µ) ] = R ∞ n −∞ fX (x) (x − µ) dx if X is continuous with density fX . The central moments tell us something about the dispersion of the values of a random variable around its mean. The most important central moment is the second, which is also called the variance. The third central moment is sometimes referred to as the skewness and tells us whether the distribution of X is symmetric around the mean or skewed to the right or the 5 left. The fourth central moment is known as the kurtosis and tells us how rapidly the tails of the distribution decay to 0. Notice that the first central moment is always equal to 0 since µ01 = E[X − µ] = E[X] − µ = 0. 16. Variance: The second central moment of a random variable X is more commonly known as the variance: Var(X) = E[(X − µ)2 ]. In words, the variance is equal to the mean squared distance between the values assumed by X and its expected value. If the variance is close to 0, then X is typically close to its mean. In particular, Var(X) = 0 if and only if X is effectively a constant, i.e., if and only if P(X = µ) = 1. Notice, however, that the variance of any random variable is non-negative. By expanding the quadratic expression inside the expectation, it can be shown that the variance of X is also equal to the following expression: Var(X) = E[X 2 ] − µ2 , i.e., the variance is equal to the difference between the second non-central moment and the square of the mean. It is often easier to calculate variances with this second formula than with the definition. Furthermore, if a and b are constants, then Var(aX + b) = a2 Var(x), i.e., adding a constant to a random variable does not change its variance, but multiplying the variable by a constant a rescales the variance by a factor of a2 . The square root of the variance of a random variable X is known as the standard deviation of X; it has the advantage that it has the same units as X. 17. Independence of random variables: Two random variables X and Y are said to be independent if the identity P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B) holds for all subsets A and B where X and Y take values. In words, X and Y are independent if the value assumed by X does not influence the value assumed by Y and vice versa. More generally, the random variables X1 , X2 , · · · , Xn are independent if the identity P(X1 ∈ A1 , X2 ∈ A2 , · · · , Xn ∈ An ) = n Y P(Xi ∈ Ai ) i=1 holds for all subsets A1 , · · · , An of the ranges of these random variables. If the random variables are both independent and have the same distribution, then we say that they are independent and identically-distributed, which is customarily abbreviated i.i.d. 18. Variances of sums of random variables: If X1 , · · · , Xn are independent real-valued random variables, then the variance of their sum is equal to the sum of their variances: 6 Var (X1 + · · · + Xn ) = n X Var(Xi ). i=1 19. Some important discrete distributions: A random variable X is said to have the Bernoulli distribution with parameter p ∈ [0, 1] if X takes values in the set {0, 1} with probabilities p = P(X = 1) and 1 − p = P(X = 0). In this case, E[X] = p and Var(X) = p(1 − p). Bernoulli random variables are often used to represent experiments that have just two possible outcomes, e.g., success or failure, heads or tails, etc. Also, if A is an event in a probability space (Ω, F, P), then the indicator variable of A is the random variable 1A which is defined to be equal to 1 if A occurs and 0 otherwise. It follows that 1A is a Bernoulli random variable with parameter p = P(A). A random variable X is said to have the Binomial distribution with parameters n ≥ 1 and p ∈ [0, 1] if X takes values in the set {0, 1, · · · , n} with probabilities n k pX (k) = P(X = k) = p (1 − p)n−k . k n! Here, nk = k!(n−k)! is a binomial coefficient which counts the number of ways of choosing k objects from a set containing n objects. Binomial random variables arise in the following way. Suppose that we perform a series of n independent trials, each of which can result in a success with probability p. Then the total number of successes in all n trials is Binomially distributed with parameters n and p. Equivalently, if X1 , · · · , Xn are independent Bernoulli random variables with parameter p, then X = X1 + · · · + Xn is Binomially distributed with parameters n and p. This can be used to show that E[X] = np and Var(X) = np(1 − p). A random variable X is said to have the Poisson distribution with parameter λ > 0 if X takes values in the non-negative integers {0, 1, 2, · · · } with probabilities λk . k! It can be shown that the mean and the variance of X are both equal to λ. The Poisson distribution arises as a limiting case of the Binomial distribution when the number of trials n is large and the success probability per trial p is small. Specifically, if Xn is a Binomial random variable with parameters n and pn = λ/n, then according to the Law of Rare Events, pX (k) = P(X = k) = e−λ λk n→∞ k! for each integer k ≥ 0. This helps explain why count data, such as the number of typing errors per page of a book or the number of car accidents at an intersection per day, can often be modeled using a Poisson distribution. lim P(Xn = k) = e−λ 20. Some important continuous distributions: A random variable X is said to be uniformly distributed on the compact interval [a, b] if X takes values in this interval with density 7 1 . b−a By convention, the density is defined to be equal to 0 outside of this interval. In the special case where a = 0 and b = 1, we say that X is a standard uniform random variable. In general, a random variable X is uniformly distributed on a region if every point within that region is equally likely to occur as a value of X. For X uniform on [a, b] we have E[X] = (a + b)/2 and Var(X) = (b − a)2 /12. fX (x) = A random variable X is said to be exponentially distributed with rate parameter λ > 0 if X takes values in the non-negative real numbers [0, ∞) with density fX (x) = λe−λx . In this case we have E[X] = λ−1 and Var(X) = λ−2 . Exponentially distributed random variables are often used to model random waiting times such as the time until a mechanical part fails or the time between successive telephone calls. This application is most appropriate when the event in question occurs at a constant rate. This is because the exponential distribution is the unique continuous distribution which satisfies the following memorylessness property: for all t, s > 0, P(X > t + s|X > t) = P(X > s). For example, if we think of X as a survival time or a lifespan, then memorylessness means that the death or failure rate does not change with age, i.e., the conditional probability of surviving for an additional s units of time given that one has already survived to age t is the same as surviving to age s. A random variable X is said to be normally distributed with mean µ and variance σ 2 > 0 if X takes values in the real numbers with density 1 (x − µ)2 fX (x) = √ exp − . 2σ 2 σ 2π This is the classic bell curve familiar from statistics. As the description suggests, E[X] = µ and Var(X) = σ 2 . In the special case where µ = 0 and σ 2 = 1, X is said to have the standard normal distribution. The importance of the normal distribution, which is also known as the Gaussian distribution, is connected with its appearance in the Central Limit Theorem, described below. One useful property of the normal distribution is that it is maintained under linear transformations. Specifically, if X is normally distributed with mean µ and variance σ 2 , and a and b are constants with a 6= 0, then Y = aX + b is a normally distributed random variable with mean aµ + b and variance a2 σ 2 . In particular, if we set Z = (X − µ)/σ, then Z is a standard normal random variable. 21. Central Limit Theorem: Suppose that X1 , X2 , X3 , · · · is a sequence of independent, identically-distributed random variables with finite mean µ = E[X1 ] and finite positive variance σ 2 = Var(X1 ). If Sn = X1 + · · · + Xn is the sum of the first n variables, then according to the law of large numbers we knows that the sample means Sn /n converge to µ. The Central Limit 8 Theorem characterizes the deviations from this limit for large n. Specifically, for each n ≥ 1, let Zn be the normalized difference √ n 1 Zn = Sn − µ σ n and let Z be a standard normal random variable. Then, according to the Central Limit Theorem, Z t 1 2 √ e−x /2 dx lim P(Zn ≤ t) = P(Z ≤ t) = n→∞ 2π −∞ for every t ∈ (−∞, ∞). In other words, when n is large, the distribution of the normalized difference Zn is approximately that of a standard normal random variable. This holds no matter what the distribution of the Xi ’s is (e.g., whether it is discrete, continuous, bounded, etc.) so long as the mean and the variance are finite. In a sense, by summing over a large number of independent random variables, the properties of the distribution of the individual variables are ‘averaged away’ and we are left with the normal distribution. It also follows that when n is large, the distribution of the partial sum Sn is approximately normal with mean nµ and variance nσ 2 . In particular, if X is a binomial random variable with parameters n and p, where n is large and p is neither too close to 0 nor too close to 1, then X is approximately normally distributed with mean np and variance np(1 − p). This result, which is known as the de Moivre-Laplace theorem, is a special case of the central limit theorem that follows from the fact that X has the same distribution of a sum of n independent Bernoulli random variables, each having parameter p. 9