Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Basics The number of distinguishable permutations of We often denote our sample space with S and an n objects with k different types (n1 of the first event (or subset) within S as A. For any event, type, ..., nk of the kth type) is P (A) ≥ 0 and we also require n! , n1 !n2 ! . . . nk ! if n1 + . . . + nk = n. P (S) = 1, An unordered arrangement of r objects from a set which is equivalent to saying that something A containing n objects is called a combination. must happen. A side effect of this is that The number of r-element combinations is ( ) c n n! P (A) + P (A ) = 1, = . r r!(n − r)! which, if we push it to arbitrary sets, says that P (A) = n ∑ P (Ak ) k=1 if A = A1 ∪ . . . ∪ An and Ai ∩ Aj = 0. The binomial expansion is defined for integer n ≥ 0 as n ( ) ∑ n n−i i n x y. (x + y) = i i=0 Many ideas from set theory carry over to proba- Conditional Probability bility, such as The conditional probability of a A given B is if A ⊆ B, then P (A) ≤ P (B), P (A|B) = or P (A ∩ B) , P (B) for P (B) > 0. The standard axioms can be proved as theorems for conditional probability Extending this last idea further gives us the distributions. This formula can be reversed to inclusion-exclusion principle: P (A1 ∪ . . . ∪ An ) give can be calculated by finding all the intersections P (A ∩ B) = P (A|B)P (B) = P (B|A)P (A). between those sets and adding all the ”odd intersections” and subtracting all the ”even inter- This idea allows us to write sections.” P (A) = P (A|B)P (B) + P (A|B c )P (B c ). P (A ∪ B) = P (A) + P (B) − P (A ∩ B). Combinatorics Let E1 , . . . , Ek be sets with n1 , . . . , nk elements respectively. Then there are n1 n2 . . . nk ways by which one element from each set may be chosen. A set with n elements has 2n subsets. If we define a partition of our sample space S as a set {B1 , . . . , Bn∪} where these events are mutually exclusive and ni=1 Bi = S, then the law of total probability is a natural extension of this concept P (A) = n ∑ P (A|Bi )P (Bi ). i=1 An ordered arrangement of r objects from a set of n distinguishable objects is called a permutation. Plugging this in to the conditional probability definition gives us Bayes’ Theorem The number of r-element permutations is n Pr = n! . (n − r)! P (A|Bk )P (Bk ) . P (Bk |A) = ∑n i=1 P (A|Bi )P (Bi ) The event A is considered independent from the We also define the standard deviation event B if P (A|B) = P (A). These two events √ σ = Var(X). X are declared independent if P (A ∩ B) = P (A)P (B). The term i.i.d. stands for ”independent and identically distributed.” We know that if A and B are independent, so are A and B c . To extend this to more than two sets, Special Discrete Random Variables all subsets must be declared independent by the A Bernoulli RV X ∼ Bernoulli(p) takes value 1 with probability p and 0 with probability 1 − p. same decomposability of probability. E(X) = p and Var(X) = p(1 − p). Basic Random Variables If an experiment has sample space S, a random A Binomial RV X ∼ Binom(n, p) is the numvariable is a function X : S → R if for any I ⊆ R ber of successes in n i.i.d. Bernoulli (n) k trials with parameter p. P (X = k) = p (1 − p)n−k , {s : X(s) ∈ I} is an event. k E(X) = np and Var(X) = np(1 − p). The function FX defined on R is the cumulative distribution function, or CDF, of random variable A Poisson RV X ∼ Poisson(λ) is the number of successes over a given time with rate of success X if FX (t) = P (X ≤ t). k λ. P (X = k) = e−λ λk! , and E(X) = Var(X) = λ. If Y ∼ Binom(n, p) and n is very large, but Every CDF must be: .1 < np < 10, then y ∼ Poisson(np) is a good • nondecreasing, approximation. • limt→∞ F (t) = 1, • limt→−∞ F (t) = 0, A Geometric RV X ∼ Geom(p) is the number of • right continuous. failures before the first success, if p is the probability of success. P (X = k) = (1 − p)k−1 p, If X can take countably many values, it is a E(X) = 1/p and Var(X) = (1 − p)/p2 . Geometdiscrete random variable. The probability mass ric RV satisfy a memoryless property, P (X > function p of a discrete RV satisfies: n + m|X > m) = P (X > n). • p(x) = 0 if X cannot take the value x, • p(x) ∑∞ = P (X = x) if X can take the value x, • i=1 p(xi ) = 1 for values xi that X may take. The expected value of a discrete RV X is ∑ E(X) = xp(x), A Negative Binomial RV X ∼ NBinom(p, r) is the number of trials required before r successes, where p is probability of success. P (X = k) = (n−1) r−1 p (1 − p)n−r , E(X) = r/p and Var(X) = r−1 r(1 − p)/p2 . A Hypergeometric RV X ∼ Hyper(N, D, n) is the number of broken units if n units are chosen from x∈A N units, of which D are broken, without replacewhere A is the set of values X may take. We also ment. P (X = k) = (D)(N −D)/(N ), E(X) = k n−k n know that nD/N and Var(X) = [nD(N − D)/N 2 ][(N − ∑ n)/(N − 1)]. E(h(X)) = h(x)p(x). x∈A This allows us to compute the variance Var(X) = E[(X − E(X))2 ] = E(X 2 ) − E(X)2 . Continuous Random Variables A continuous RV X has a CDF FX which is continuous. By differentiating, fX = FX′ gives us the probability density function of X, wherever FX is differentiable. This allows for ∫ b P (a < X < b) = fX (t)dt The most important distribution is the Normal or Gaussian distribution. If X ∼ N (µ, σ), for σ > 0, then a 1 2 2 fX (t) = √ e−(t−µ) /2σ , σ 2π = F (b) − F (a). Our definition of CDF therefore requires that ∫ ∞ fX (t)dt = 1, t ∈ R. As a special case, we may use the letter Z ∼ N (0, 1) to denote the Standard Normal distribution. We know −∞ and that fX (t) > 0, that FX is nondecreasing. E(X) = µ, Var(X) = σ 2 . Sometimes we may use Φ(t) = FZ (t), though I Note that P (a < X < b) = P (a ≤ X ≤ b) if X don’t use that often. As a symmetric density is a continuous RV. (about t = 0), we know FZ (−t) = 1 − FZ (t). The expected value of a continuous RV X is ∫ ∞ E(X) = xfX (x)dx. Things such as IQ, height, GPA or other such quantities may be normally distributed. As before, we know that ∫ ∞ E(g(X)) = g(x)fX (x)dx. The exponential distribution is also important, and allows us to describe the time until some event occurs. If X ∼ Exp(λ), for λ > 0, then ∞ ∞ fX (t) = λe−λt , This allows us to compute t ∈ [0, ∞). We know that Var(X) = E[(X − E(X)) ] = E(X ) − E(X) . 1 1 E(X) = , Var(X) = 2 . √ λ λ As before, σX = Var(X) is the standard deviAn important feature of the exponential distriation. bution is its memoryless property: Common Continuous Distributions P (X > s + t|X > t) = P (X > t). For a continuous RV X, there are several distributions which occur often enough to warrant If we want to consider the time until r events names. occur, we will instead need the Gamma distribution. If X ∼ Gamma(r, λ), for r, λ > 0, then The Uniform distribution represents a point being selected randomly from an interval without λ −λt e (λt)r−1 , t ∈ [0, ∞), fX (t) = any preference between points. If X ∼ Unif(a, b), Γ(r) for a > b, then where ∫ ∞ 1 , t ∈ [a, b). fX (t) = Γ(r) = tr−1 e−t dt = (r − 1)!. b−a 2 2 2 0 This tells us that E(X) = a+b , 2 We know that Var(X) = (b − a)2 . 12 E(X) = r , λ Var(X) = r . λ2 Bivariate Distributions This in turn implies that To study how two discrete RVs interact together, F (x, y) = FX (x)FY (y), we need to consider their joint probability mass function: which has the, respectively, discrete and contin∑∑ p(x, y) = P (X = x, Y = y), p(x, y) = 1, uous results x∈A y∈B p(x, y) = pX (x)pY (y), f (x, y) = fX (x)fY (y). where A and B are the sets of points at which X An important result of this is that, if X and Y and Y may take value. are independent RVs, The marginal mass function pX (x) is defined as E[g(X)h(Y )] = E[g(X)]E[h(Y )]. ∑ pX (x) = p(x, y), The Covariance between RVs X and Y is y∈B Cov(X, Y ) = E [(X − E(X))(Y − E(Y ))] with pY (y) defined analogously. Using these = E(XY ) − E(X)E(Y ). marginal mass functions we can compute ∑ ∑ Note that Cov(X, X) = Var(X), and if X and Y E(X) = xpX (x), E(Y ) = ypY (y). are independent then Cov(X, Y ) = 0. x∈A y∈B We also know that E(h(X, Y )) = ∑∑ h(x, y)p(x, y). x∈A y∈B Central Limit Theorem A random sample is a collection of i.i.d. random variables X1 , . . . , Xn . After observing these variables, X1 = x1 and so on, these statistics can be used to study the population of Xi . The same ideas are all true in the continuous setting: a joint probability density function f (x, y) The Central Limit Theorem states that if is a function such that, for X and Y RVs and a X1 , X2 , . . . , Xn is a sequence of iid RVs with E(Xi ) = µ and Var(Xi ) = σ 2 , then region R ⊆ R2 , ∫∫ 1 X̄n = (X1 + . . . + Xn ) P ((X, Y ) ∈ R) = f (x, y) dx dy. n R We know that ∫ ∞∫ is a random variable whose distribution approaches N (µ, σ 2 /n) for large n. ∞ f (x, y) dx dy = 1. ∞ ∞ We define the marginal density functions ∫ ∞ ∫ ∞ fX (x) = f (x, y) dy, fY (y) = f (x, y) dx, ∞ ∞ and with those we can define E(X), E(Y ). Two random variables are called independent if, for arbitrary A, B ⊆ R, P (X ∈ A, Y ∈ B) = P (X ∈ A)P (Y ∈ B). Using the central limit theorem, we can approximate our population mean E[Xi ] = µ using the sample mean X̄n = x̄. For increasingly large samples, Var X̄n decreases, meaning that the sample mean x̄ is an increasingly good approximation of the true mean µ.