Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 Introduction to Probability 1.1 Basic Rules of Probability Set Theory Digression A set is de…ned as any collection of objects, which are called points or elements. The biggest possible collection of points under consideration is called the space, universe, or universal set. For Probability Theory the space is called the sample space. A set A is called a subset of B (we write A B or B A) if every element of A is also an element of B. A is called a proper subset of B (we write A B or B A) if every element of A is also an element of B and there is at least one element of B which does not belong to A. Two sets A and B are called equivalent sets or equal sets (we write A = B) if A B and B A. If a set has no points, it will be called the empty or null set and denoted by . The complement of a set A with respect to the space Ac , or A, is the set of all points that are in , denoted by A, but not in A. The intersection of two sets A and B is a set that consists of the common elements of the two sets and it is denoted by A \ B or AB. The union of two sets A and B is a set that consists of all points that are in A or B or both (but only once) and it is denoted by A [ B. The set di¤erence of two sets A and B is a set that consists of all points in A that are not in B and it is denoted by A B. Properties of Set Operations Commutative: A [ B = B [ A and A \ B = B \ A. Associative: A[(B [ C) = (A [ B)[C and A\(B \ C) = (A \ B)\C. 1 Distributive: A \ (B [ C) = (A \ B) [ (A \ C) and A [ (B \ C) = (A [ B) \ (A [ C). (Ac )c = A = A i.e. the complement of A complement is A. If A subset of A[ (the space) then: A \ = A, A [ = , A\ = , = A, A \ A = , A [ A = , A \ A = A, and A [ A = A. De Morgan Law: (A [ B) = A \ B, and (A \ B) = A [ B. Disjoint or mutually exclusive sets are the sets that their intersection is the empty set, i.e. A and B are mutually exclusive if A \ B = . Subsets A1 ; A2 ; :::: are mutually exclusive if Ai \ Aj = for any i 6= j. The sample space is the collection or totality of all possible outcomes of a conceptual experiment. An event is a subset of the sample space. The class of all events associated with a given experiment is de…ned to be the event space. Classical or a priori Probability: If a random experiment can result in N mutually exclusive and equally likely outcomes and if N (A) of these outcomes have an attribute A, then the probability of A is the fraction N (A)=N i.e. P (A) = N (A)=N , where N = N (A) + N (A). Example: Consider the drawing an ace (event A) from a deck of 52 cards. What is P (A)? We have that N (A) = 4 and N (A) = 48. Then N = N (A) + N (A) = 4 + 48 = 52 and P (A) = N (A) N = 4 52 Frequency or a posteriori Probability: Is the ratio of the number that an event A has occurred out of n trials, i.e. P (A) = =n. Example: Assume that we ‡ip a coin 1000 times and we observe 450 heads. Then the a posteriori probability is P (A) = =n = 450=1000 = 0:45 (this is also the relative frequency). Notice that the a priori probability is in this case 0.5. 2 Subjective Probability: This is based on intuition or judgment. We shall be concerned with a priori probabilities. These probabilities involve, many times, the counting of possible outcomes. 1.1.1 Methods of Counting We have the following cases: 1. Duplication is permissible and Order is important (Multiple Choice Arrangement), i.e. the element AA is permitted and AB is a di¤erent element from BA. In this case where we want to arrange n objects in x places the possible outcomes is given from: Mxn = nx . Example: Find all possible combinations of the letters A, B, C, and D when duplication is allowed and order is important. The result according to the formula is: n = 4, and x = 2, consequently the possible number of combinations is M24 = 42 = 16. To …nd the result we can also use a tree diagram. 2. Duplication is not permissible and Order is important (Permutation Arrangement), i.e. the element AA is not permitted and AB is a di¤erent element from BA. In this case where we want to permute n objects in x places the possible outcomes is given from: Pxn = n! . (n x)! Example: Find all possible permutations of the letters A, B, C, and D when duplication is not allowed and order is important. The result according to the formula is: n = 4, and x = 2, consequently the possible number of combinations is P24 = 4! (4 2)! = 2 3 4 2 = 12. 3. Duplication is not permissible and Order is not important (Combination Arrangement), i.e. the element AA is not permitted and AB is not a di¤erent element from BA. In this case where we want the combinations of n objects in x places the possible outcomes is given from: 3 Cxn = n! . x! (n x)! Example: Find all possible combinations of the letters A, B, C, and D when duplication is not allowed and order is not important. The result according to the formula is: n = 4, and x = 2, consequently the possible number of combinations is C24 = 1.1.2 4! 2! (4 2)! = 2 3 4 2 2 = 6. Probability De…nition and Properties To de…ne the probability rigorously we need the following de…nition of event space, say A, which is a collection of subsets of the sample space . A is an event space if: 2A i) i.e. the sample space belong to the event space. ii) If A 2 A; then Ac 2 A: and iii) If A1 2 A and A2 2 A; then A1 [ A2 2 A: Under these 3 conditions A is called algebra of events or simply event space. P [:], a set function with domain A and counter domain the closed interval [0,1], is a Probability Function or simply Probability if it satis…es the following conditions: 0 f or every A 2 A i) P [A] ii) P [ ] = 1 and "1 # 1 [ X iii) P Ai = P [Ai ] i=1 i=1 4 for any sequence of mutually exclusive events A1 ; A2 ; :::: (i.e. Ai \ Aj = 1 S any i 6= j) and A1 [ A2 [ :::: = Ai 2 A. for i=1 Properties of Probability 1. P [ ] = 0. 2.If A1 ; A2 ; :::An are mutually exclusive events then P n S i=1 3. If A is an event in A, then P [A] = 1 P [A]. Ai = n P P [Ai ]. i=1 4. For every two events A 2 A and B 2 A, P [A [ B] = P [A] + P [B] P [AB]. More generally, for events A1 ; A2 ; :::An 2 A we have: n n S P PP PPP P Ai = P [Ai ] P [Ai Aj ]+ P [Ai Aj Ak ] ::+( 1)n+1 P [A1 ::An ]. i=1 i=1 i<j i<j<k For n = 3 the above formula is: S S P [A1 A2 A3 ] = P [A1 ]+P [A2 ]+P [A3 ] P [A1 A2 ] P [A1 A3 ] P [A2 A3 ] +P [A1 A2 A2 ]. Notice that if A and B are mutually exclusive, then P [A [ B] = P [A] + P [B]. 5. If A 2 A, B 2 A, and A B then P [A] P [B]. With the use of Venn Diagrams we can have an intuitive explanation to the above properties. The triplet ( ; A; P [:]) is called probability space. 1.1.3 Conditional Probability and Independence Let A and B be two events in A and a probability function P [:]. The conditional probability of A given event B, is denoted by P [AjB] and is de…ned by: P [AjB] = P [AB] P [B] if P [B] > 0 and is left unde…ned if P [B] = 0. From the above formula is evident P [AB] = P [AjB]P [B] = P [BjA]P [A] 5 if both P [A] and P [B] are nonzero. Notice that when speaking of conditional probabilities we are conditioning on some given event B; that is, we are assuming that the experiment has resulted in some outcome in B. B, in e¤ect then becomes our ”new”sample space. All probability properties of the previous section apply to conditional probabilities as well. However, there is an additional property (Law) called the Law of Total Probabilities which states that: For a given probability space ( ; A; P [:]), if B1 ; B2 ; :::; Bn is a collection n S of mutually exclusive events in A satisfying Bi = and P [Bi ] > 0 for i=1 i = 1; 2; :::; n then for every A 2 A, P [A] = n X P [AjBi ]P [Bi ] i=1 Another important theorem in probability is the so called Bayes’Theorem which states: Given a probability space ( ; A; P [:]), if B1 ; B2 ; :::; Bn is a collection of n S mutually exclusive events in A satisfying Bi = and P [Bi ] > 0 for i=1 i = 1; 2; :::; n then for every A 2 A for which P [A] > 0 we have: P [AjBj ]P [Bj ] P [Bj jA] = P n P [AjBi ]P [Bi ] i=1 Notice that for events A and B 2 A which satisfy P [A] > 0 and P [B] > 0 we have: P [BjA] = P [AjB]P [B] P [AjB]P [B] + P [AjB]P [B] Finally the Multiplication Rule states: 6 Given a probability space ( ; A; P [:]), if A1 ; A2 ; :::; An are events in A for which P [A1 A2 ::::::An 1 ] > 0 then: P [A1 A2 ::::::An ] = P [A1 ]P [A2 jA1 ]P [A3 jA1 A2 ]:::::P [An jA1 A2 ::::An 1 ] Example: There are …ve boxes and they are numbered 1 to 5. Each box contains 10 balls. Box i has i defective balls and 10 i non-defective balls, i = 1; 2; ::; 5. Consider the following random experiment: First a box is selected at random, and then a ball is selected at random from the selected box. 1) What is the probability that a defective ball will be selected? 2) If we have already selected the ball and noted that it is defective, what is the probability that it came from the box 5? Let A denote the event that a defective ball is selected and Bi the event that box i is selected, i = 1; 2; ::; 5. Note that P [Bi ] = 1=5, for i = 1; 2; ::; 5, and P [AjBi ] = i=10. Question 1) is what is P [A]? Using the theorem of total probabilities we have: 5 5 P P P [A] = P [AjBi ]P [Bi ] = i=1 i=1 i1 55 = 3=10. Notice that the total number of defective balls is 15 out of 50. Hence in this case we can say that P [A] = 15 50 = 3=10. This is true as the probabilities of choosing each of the 5 boxes is the same. Question 2) asks what is P [B5 jA]. Since box 5 contains more defective balls than box 4, which contains more defective balls than box 3 and so on, we expect to …nd that P [B5 jA] > P [B4 jA] > P [B3 jA] > P [B2 jA] > P [B1 jA]. We apply Bayes’theorem: P [B5 jA] = P [AjB5 ]P [B5 ] = 5 P P [AjBi ]P [Bi ] i=1 7 11 25 3 10 = 1 3 Similarly P [Bj jA] = P [AjBj ]P [Bj ] 5 P P [AjBi ]P [Bi ] = j 1 10 5 3 10 = j 15 for j = 1; 2; :::; 5. Notice that i=1 unconditionally all Bi0 s were equally likely. Let A and B be two events in A and a probability function P [:]. Events A and B are de…ned independent if and only if one of the following conditions is satis…ed: (i) P [AB] = P [A]P [B]; (ii) P [AjB] = P [A]if P [B] > 0 and (iii) P [BjA] = P [B]if P [A] > 0: Example: Consider tossing two dice. Let A denote the event of an odd total, B the event of an ace on the …rst die, and C the event of a total of seven. We ask the following: (i) Are A and B independent? (ii) Are A and C independent? (iii) Are B and C independent? (i) P [AjB] = 1=2, P [A] = 1=2 hence P [AjB] = P [A] and consequently A and B are independent. (ii) P [AjC] = 1 6= P [A] = 1=2 hence A and C are not independent. (iii) P [CjB] = 1=6 = P [C] = 1=6 hence B and C are independent. Notice that although A and B are independent and C and B are independent A and C are not independent. Let us extend the independence of two events to several ones: For a given probability space ( ; A; P [:]), let A1 ; A2 ; :::; An be n events in A. Events A1 ; A2 ; :::; An are de…ned to be independent if and only if: P [Ai Aj ] = P [Ai ]P [Aj ] for i 6= j P [Ai Aj Ak ] = P [Ai ]P [Aj ]P [Ak ] for i 6= j; i 6= k; k 6= j 8 and so on n n T Q P [ Ai ] = P [Ai ] i=1 i=1 Notice that pairwise independence does not imply independence, as the following example shows. Example: Consider tossing two dice. Let A 1 denote the event of an odd face in the …rst die, A2 the event of an odd face in the second die, and A3 the event of an odd total. Then we have: P [A1 ]P [A2 ] = P [A1 A2 ]; P [A1 ]P [A3 ] = 1 4 11 22 11 22 = = P [A3 jA1 ]P [A1 ] = P [A1 A3 ]; and P [A2 A3 ] = = P [A2 ]P [A3 ] hence A1 ; A2 ; A3 are pairwise independent. However notice that P [A1 A2 A3 ] = 0 6= 1 8 = P [A1 ]P [A2 ]P [A3 ]. Hence A1 ; A2 ; A3 are not independent. Notice that the property of two events A and B and the property that A and B are mutually exclusive are distinct, though related properties. We know that if A and B are mutually exclusive then P [AB] = 0. Now if these events are also independent then P [AB] = P [A]P [B], and consequently P [A]P [B] = 0, which means that either P [A] = 0 or P [B] = 0. Hence two mutually exclusive events are independent if P [A] = 0 or P [B] = 0. On the other hand if P [A] 6= 0 and P [B] 6= 0, then if A and B are independent can not be mutually exclusive and oppositely if they are mutually exclusive can not be independent. Example: A plant has two machines. Machine A produces 60% of the total output with a fraction defective of 0.02. Machine B the rest output with a fraction defective of 0.04. If a single unit of output is observed to be defective, what is the probability that this unit was produced by machine A? If A is the event that the unit was produced by machine A, B the event 9 that the unit was produced by machine B and D the event that the unit is defective. Then we ask what is P [AjD]. But P [AjD] = Notice that P [BjD] = 1 Now 0:6 = 0:012. Also P [D] = P [DjA]P [A] + P [AD] = P [DjA]P [A] = 0:02 P [DjB]P [B] = 0:012 + 0:04 P [AD] . P [D] 0:4 = 0:028. Consequently, P [AjD] = 0:571. P [AjD] = 0:429. We can also use a tree diagram to evaluate P [AD] and P [BD]. Example: A marketing manager believes the market demand potential of a new product to be high with a probability of 0.30, or average with probability of 0.50, or to be low with a probability of 0.20. From a sample of 20 employees, 14 indicated a very favorable reception to the new product. In the past such an employee response (14 out of 20 favorable) has occurred with the following probabilities: if the actual demand is high, the probability of favorable reception is 0.80; if the actual demand is average, the probability of favorable reception is 0.55; and if the actual demand is low, the probability of the favorable reception is 0.30. Thus given a favorable reception, what is the probability of actual high demand? Again what we ask is P [HjF ] = P [HF ] . P [F ] Now P [F ] = P [H]P [F jH] + P [A]P [F jA] + P [L]P [F jL] = 0:24 + 0:275 + 0:06 = 0:575. Also P [HF ] = P [F jH]P [H] = 0:24. Hence P [HjF ] = 10 0:24 0:575 = 0:4174 1.2 1.2.1 Discrete and Continuous Random Variables Random Variable and Cumulative Distribution Function For a given probability space ( ; A; P [:]) a random variable, denoted by X or X(:), is a function with domain and counterdomain the real line. The function X(:) must be such that the set Ar , de…ned by Ar = f! : X(!) rg, belongs to A for every real number r. The important part of the de…nition is that in terms of a random experiment, is the totality of outcomes of that random experiment, and the function, or random variable, X(:) with domain makes some real number correspond to each outcome of the experiment. The fact that we also require the collection of !0s for which X(!) r to be an event (i.e. an element of A) for each real number r is not much of a restriction since the use of random variables is, in our case, to describe only events. Example: Consider the experiment of tossing a single coin. Let the random variable X denote the number of heads. In this case = fhead; tailg, and X(!) = 1 if ! = head, and X(!) = 0 if ! = tail. So the random variable X associates a real number with each outcome of the experiment. To show that X satis…es the de…nition we should show that f! : X(!) rg, belongs to A for every real number r. A = f ; fheadg; ftailg; g. Now if r < 0, f! : X(!) if r rg = , if 0 1 then f! : X(!) f! : X(!) r < 1 then f! : X(!) rg = fhead; tailg = rg = ftailg, and . Hence, for each r the set rg belongs to A and consequently X(:) is a random variable. In the above example the random variable is described in terms of the random experiment as opposed to its functional form, which is the usual case. The cumulative distribution function of a random variable X, denoted by FX (:), is de…ned to be that function with domain the real line and 11 counterdomain the interval [0; 1] which satis…es FX (x) = P [X x] = P [f! : xg] for every real number x. X(!) A cumulative distribution function is uniquely de…ned for each random variable. If it is known, it can be used to …nd probabilities of events de…ned in terms of its corresponding random variables. Notice that it is in the de…nition of the random variable the requirement that f! : X(!) rg belongs to A for every real number r which appears in our de…nition of random variable X. Notice that each of the three words in the expression “cumulative distribution function”is justi…able. Example: Consider again the experiment of tossing a single coin. Assuming that 8 the coin is fair, let X denote the number of heads. Then: > > 0 if x < 0 > < FX (x) = 1=2 if 0 x < 1 > > > : 1 if 1 x A cumulative distribution function has the following Properties: i) FX ( 1) = lim FX (x) = 0 and FX (1) = lim FX (x) = 1 x! 1 x!1 ii) FX (:) is a monotone, nondecreasing function i.e., if FX ( ) < then FX ( ) iii) FX (:) is continues from the right i.e. lim+ FX (x + h) = FX (x) h!0 Now, we can say that any function with domain the real line and counterdomain the interval [0; 1], satisfying the above 3 properties can be called cumulative distribution function. Now we can de…ne the discrete and continuous random variables 1.2.2 Discrete Random Variable A random variable X will be de…ned to be discrete if the range of X is countable. If a random variable X is discrete, then its corresponding cumulative 12 distribution function FX (:) will be de…ned to be discrete. By the range of X being countable we mean that there exists a …nite or denumerable set of real numbers, say x1 ; x2 ; :::xn ; :::, such that X takes on values only in that set. If X is discrete with distinct values x1 ; x2 ; :::xn ; :::, S then = f! : X(!) = xn g; and fX = xi g \ fX = xj g = for i 6= j. n P Hence 1 = P [ ] = P [X = xn ] by the third axiom of probability. n If X is a discrete random variable with distinct values x1 ; x2 ; :::xn ; :::, then the function,8denoted by fX (:) and de…ned by < P [X = x] if x = xj ; j = 1; 2; :::; n; ::: fX (x) = : 0 if x 6= xj is de…ned to be the discrete density function of X. Notice that the discrete density function tell us how likely or probable each of the values of a discrete random variable is. It also enables one to calculate the probability of events described in terms of the discrete random variable. Also notice that for any discrete random variable X, FX (:) can be obtained from fX (:), and vice versa Example: Consider the experiment of tossing a single die. Let X denote the number of spots on the upper face. Then for this case we have: X takes any value from the set f1; 2; 3; 4; 5; 6g. So X is a discrete random variable. The density function of X is: fX (x) = P [X = x] = 1=6 for any x 2 f1; 2; 3; 4; 5; 6g and 0 otherwise. The cumulative distribution function [x] P of X is: FX (x) = P [X x] = P [X = n] where [x] denotes the integer n=1 part of x.. Notice that x can be any real number. However, the points of interest are the elements of f1; 2; 3; 4; 5; 6g. Notice also that in this case = f1; 2; 3; 4; 5; 6g as well, and we do not need any reference to A. Example: Consider the experiment of tossing two dice. Let X denote the total of the upturned faces. Then for this case we have: 13 = f(1; 1); (1; 2); :::(1; 6); (2; 1); (2; 2); ::::(2; 6); (3; 1); :::::; (6; 6)g a total of (using the Multiplication rule) 36 = 62 elements. X takes values from the set f2; 3; 4; 5; 6; 7; 8; 9; 10;8 11; 12g. The density function is: > > 1=36 f or x = 2 or x = 12 > > > > > > 2=36 f or x = 3 or x = 11 > > > > > > 3=36 f or x = 4 or x = 10 > < fX (x) = P [X = x] = 4=36 f or x = 5 or x = 9 > > > > > 5=36 f or x = 6 or x = 8 > > > > > > 1=36 f or x = 7 > > > > : 0 f or any other x The cumulative distribution function is: 8 > > 0 f or x < 2 > > > > 1 > > f or 2 x < 3 > 36 > > > 3 > > f or 3 x < 4 > 36 > > > < 6 [x] f or 4 x < 5 P 36 FX (x) = P [X x] = P [X = n] = 10 > n=1 > f or 5 x < 6 > 36 > > > > > :::::::::: > > > > > 35 > f or 11 x < 12 > 36 > > > : 1 f or 12 x Notice that, again, we do not need any reference to A. In fact we can speak of discrete density functions without reference to some random variable at all. Any function f (:) with domain the real line and counterdomain [0; 1] is de…ned to be a discrete density function if for some countable set 14 x1 ; x2 ; :::xn ; :::: has the following properties: i) f (xj ) > 0 f or j = 1; 2; ::: ii) f (x) = 0 f or x 6= xj ; j = 1; 2; ::: and X f (xj ) = 1; where the summation is over the points x1 ; x2 ; :::xn ; :::: iii) 1.2.3 Continuous Random Variable A random variable X is called continuous if there exist a function fX (:) such Rx fX (u)du for every real number x. In such a case FX (x) is the that FX (x) = 1 cumulative distribution and the function fX (:) is the density function. Notice that according to the above de…nition the density function is not uniquely determined. The idea is that if the function change value in a few points its integral is unchanged. Furthermore, notice that fX (x) = dFX (x)=dx. The notations for discrete and continuous density functions are the same, yet they have di¤erent interpretations. We know that for discrete random variables fX (x) = P [X = x], which is not true for continuous random variables. Furthermore, for discrete random variables fX (:) is a function with domain the real line and counterdomain the interval [0; 1], whereas, for continuous random variables fX (:) is a function with domain the real line and counterdomain the interval [0; 1). Example: Let X be the random variable representing the length of a telephone conversation. One could model this experiment by assuming that the distribution of X is given by FX (x) = (1 e x ) where is some positive number and the random variable can take values only from the interval [0; 1). The density function is dFX (x)=dx = fX (x) = e x . If we assume that telephone conversations are measured in minutes, P [5 < X 15 10] = R 10 5 fX (x)dx = that P [5 < X R 10 10] = e x e 5 1 e dx = e 2 5 e 10 , and for = 1=5 we have = 0:23. The example above indicates that the density functions of continuous random variables are used to calculate probabilities of events de…ned in terms of the corresponding continuous random variable X i.e. P [a < X b] = Rb f (x)dx. Again we can give the de…nition of the density function without a X any reference to the random variable i.e. any function f (:) with domain the real line and counterdomain [0; 1) is de…ned to be a probability density function if f (ii) (i) Z 1 f (x) 0 f or all x and f (x)dx = 1: 1 In practice when we refer to the certain distribution of a random variable, we state its density or cumulative distribution function. However, notice that not all random variables are either discrete or continuous. 16 1.3 Expectations and Moments of Random Variables An extremely useful concept in problems involving random variables or distributions is that of expectation. 1.3.1 Mean Let X be a random variable. The mean or the expected value of X, denoted by E[X] or X, is de…ned by: E[X] = X xj P [X = xj ] = X xj fX (xj ) if X is a discrete random variable with counterdomain the countable set fx1 ; :::; xj ; ::g and E[X] = Z 1 xfX (x)dx 1 if X is a continuous random variable with density function fX (x). Finally, E[X] = Z 1 [1 FX (x)]dx 0 Z 0 FX (x)dx 1 for an arbitrary random variable X. The …rst two de…nitions are used in practice to …nd the mean for discrete and continuous random variables, respectively. The third one is used for the mean of a random variable that is neither discrete nor continuous. Notice that in the above de…nition we assume that the sum and the integrals exist, also that the summation in (i) runs over the possible values of j and the j th term is the value of the random variable multiplied by the probability that the random variable takes this value. Hence E[X] is an average of the values that the random variable takes on, where each value is weighted by the probability that the random variable takes this value. Values that are more probable receive more weight. The same is true in the integral 17 form in (ii). There the value x is multiplied by the approximate probability that X equals the value x, i.e. fX (x)dx, and then integrated over all values. Notice that in the de…nition of a mean of a random variable, only density functions or cumulative distributions were used. Hence we have really de…ned the mean for these functions without reference to random variables. We then call the de…ned mean the mean of the cumulative distribution or the appropriate density function. Hence, we can speak of the mean of a distribution or density function as well as the mean of a random variable. Notice that E[X] is the center of gravity (or centroid) of the unit mass that is determined by the density function of X. So the mean of X is a measure of where the values of the random variable are centered or located i.e. is a measure of central location or central tendency. Example: Consider the experiment of tossing two dice. Let X denote the total of the upturned faces. Then for this case we have: E[X] = 12 X ifX (i) = 7: i=2 Example: Consider a X that can take only to possible values, 1 and -1, each with probability 0.5. Then the mean of X is: E[X] = 1 0:5 + ( 1) 0:5 = 0: Notice that the mean in this case is not one of the possible values of X. Example: Consider a continuous random variable X with denx sity function fX (x) = e E[X] = Z1 for x 2 [0; 1). Then xfX (x)dx = 1 Z1 0 18 x e x dx = 1= : Example: Consider a continuous random variable X with density function fX (x) = x E[X] = Z1 2 for x 2 [1; 1). Then xfX (x)dx = 1 Z1 xx 2 dx = lim log b = 1 b!1 1 so we say that the mean does not exist, or that it is in…nite. 1.3.2 Variance Let X be a random variable and by 2 X be E[X]. The variance of X, denoted X or var[X], is de…ned by: (i) var[X] = X 2 X ) P [X = xj ] = (xj X (xj X) 2 fX (xj ) if X is a discrete random variable with counterdomain the countable set fx1 ; :::; xj ; ::g: (ii) var[X] = Z 1 (xj 2 X ) fX (x)dx 1 if X is a continuous random variable with density function fX (x). And (iii) var[X] = Z 1 2x[1 FX (x) + FX ( x)]dx 2 X 0 for an arbitrary random variable X. The variances are de…ned only if the series in (i) is convergent or if the integrals in (ii) or (iii) exist. Again, the variance of a random variable is de…ned in terms of the density function or cumulative distribution function of the random variable and consequently, variance can be de…ned in terms of these functions without reference to a random variable. Notice that variance is a measure of spread since if the values of the random variable X tend to be far from their mean, the variance of X will be 19 larger than the variance of a comparable random variable whose values tend to be near their mean. It is clear from (i), (ii) and (iii) that the variance is a nonnegative number. If X is a random variable with variance 2X , then the standard deviation p of X, denoted by X , is de…ned as var(X) The standard deviation of a random variable, like the variance, is a mea- sure of spread or dispersion of the values of a random variable. In many applications it is preferable to the variance since it will have the same measurement units as the random variable itself. In …nance the standard deviation is a measure of risk, although there are other measures, as well, i.e. semi-standard deviation etc. Example: Consider the experiment of tossing two dice. Let X denote the total of the upturned faces. Then for this case we have ( 12 X var[X] = (i 2 X ) fX (i) X = 7): = 210=36: i=2 Example: Consider a X that can take only to possible values, 1 and -1, each with probability 0.5. Then the variance of X is ( X = 0): var[X] = 0:5 12 + 0:5 ( 1)2 = 1: Example: Consider a X that can take only to possible values, 10 and -10, each with probability 0.5. Then we have: X = E[X] = 10 0:5 + ( 10) 0:5 = 0; and var[X] = 0:5 102 + 0:5 ( 10)2 = 100: 20 Notice that in the second and third examples, the two random variables have the same mean but di¤erent variance, larger being the variance of the random variable with values further away from the mean. Example: Consider a continuous random variable X with denx sity function fX (x) = e var[X] = Z1 for x 2 [0; 1). Then ( Z1 2 (x X ) fX (x)dx = (x 1 X = 1= ): 1= )2 e x dx = 1 2: 0 Example: Consider a continuous random variable X with density function fX (x) = x 2 for x 2 [1; 1). Then we know that the mean of X does not exist. Consequently, we can not de…ne the variance. 1.3.3 Expected Value of a Function of a Random Variable Let X be a random variable and g(:) be a function with domain and counterdomain the real line. The expectation or expected value of the function g(:) of the random variable X, denoted by E[g(X)], is de…ned by: (i) E[g(X)] = X g(xj )P [X = xj ] = X g(xj )fX (xj ) if X is a discrete random variable with counterdomain the countable set fx1 ; :::; xj ; ::g: (ii) E[X] = Z 1 g(x)fX (x)dx 1 if X is a continuous random variable with density function fX (x). Properties of expected value: (i) E[c] = c for all constants c. (ii) E[cg(x)] = cE[g(x)] for a constant c. 21 (iii) E[c1 g1 (x) + c2 g2 (x)] = c1 E[g1 (x)] + c2 E[g2 (x)]. (iv) E[g1 (x)] E[g2 (x)] if g1 g2 (x) for all x. From the above properties we can prove two important theorems. Theorem 1. For any random variable X var[X] = E[X 2 ] (E[X])2 : For the second theorem we shall need the following de…nition of the convex function. A continuous function g(:) with domain and counterdomain the real line is called convex if for any x0 on the real line, there exist a line which goes through the point (x0 ; g(x0 )) and lies on or under the graph of the function g(:). Also if g == (x0 ) 0 then g(:) is convex. Theorem 2 (Jensen Inequality) Let X be a random variable with mean E[X], and let g(:) be a convex function. Then E[g(X)] g(E[X]). We can also use these properties to …nd the expected return and variance (standard deviation) of a portfolio of assets. We shall need the following de…nitions: Let X and Y be any two random variables de…ned on the same probability space. The covariance of X and Y , denoted by cov[X; Y ] or X;Y , is de…ned as: cov[X; Y ] = E[(X X )(Y Y )] provided that the indicated expectation exists. The correlation coe¢ cient or simply the correlation, denoted by [X; Y ] or X;Y , of random variables X and Y is de…ned to be: X;Y = cov[X; Y ] X 22 Y provided that cov[X; Y ], X, and Y exist, and X > 0 and Y > 0. Both the covariance and the correlation of random variables X and Y are measures of a linear relationship of X and Y in the following sense. cov[X; Y ] will be positive when (X X) and (Y Y) tend to have the same sign with high probability, and cov[X; Y ] will be negative when (X X) and (Y Y) tend to have opposite signs with high probability. The actual magnitude of the cov[X; Y ] does not much meaning of how strong the linear relationship between X and Y is. This is because the variability of X and Y is also important. The correlation coe¢ cient does not have this problem, as we divide the covariance by the product of the standard deviations. Furthermore, the correlation is unitless and 1. We can 1 prove that cov[X; Y ] = E[(X X )(Y Y )] = E[XY ] X Y The properties are very useful for evaluating the expected return and standard deviation of a portfolio. Assume ra and rb are the returns on assets A and B, and their variances are 2 a and 2 b, respectively. Assume that we form a portfolio of the two assets with weights wa and wb , respectively. If the correlation of the returns of these assets is , …nd the expected return and standard deviation of the portfolio. If Rp is the return of the portfolio then Rp = wa ra + wb rb : The expected portfolio return is E[Rp ] = wa E[ra ] + wb E[rb ]: 23 The variance of the portfolio is var[Rp ] = var[wa ra + wb rb ] = E[(wa ra + wb rb )2 ] (E[wa ra + wb rb ])2 wb2 (E[rb ])2 = wa2 E[ra2 ] + wb2 E[rb2 ] + 2wa wb E[ra rb ] wa2 (E[ra ])2 = wa2 E[ra2 ] (E[rb ])2 + 2wa wb fE[ra rb ] (E[ra ])2 + wb2 E[rb2 ] 2wa wb E[ra ]E[rb ] = wa2 var[ra ] + wb2 var[rb ] + 2wa wb cov[ra ; rb ] = wa2 2 a + wb2 2 b + 2wa wb a b: In a vector format we have: E[Rp ] = @ wa wb and var[Rp ] = 0 0 2 a @ wa wb E[ra ] E[rb ] a b a b 2 b 1 A 10 A@ wa wb 1 A: From the above example we can see that var[aX + bY ] = a2 var[X] + b2 var[Y ] + 2abcov[X; Y ] for random variables X and Y and constants a and b. In fact we can generalize the formula above for several random variables X1 ; X2 ; :::; Xn and constants a1 ; a2 ; a3 ; :::; an i.e. var[a1 X1 + a2 X2 + :::an Xn ] = n X a2i var[Xi ] i=1 1.3.4 +2 n X ai aj cov[Xi ; Xj ]: i<j Moments of a Random Variable If X is a random variable, the rth raw moment of X, denoted by de…ned as: = r = E[X r ] 24 = r, is E[ra ]E[rb ]g if this expectation exists. Notice that = r = 1 = E[X] = = X, the mean of X. If X is a random variable, the rth central moment of X about de…ned as E[(X about X, )r ]: If denoted by r, = X, is we have the rth central moment of X which is: r = E[(X r X ) ]: We de…ne measures in terms of quantiles to describe some of the characteristics of random variables or density functions. The qth quantile of a random variable X or of its corresponding distribution is denoted by FX ( ) q and is de…ned as the smallest number satisfying q. If X is a continuous random variable, then the of X is given as the smallest number satisfying FX ( ) q. The median of a random variable X, denoted by medX or med(X); or q, is the 0.5th quantile. Notice that if X a continuous random variable the median of X satis…es: Z med(X) Z 1 1 fX (x)dx = = fX (x)dx 2 1 med(X) so the median of X is any number that has half the mass of X to its right and the other half to its left. The median and the mean are measures of central location. The third moment about the mean, 3, is called a measure of asymme- try, or skewness. Symmetrical distributions can be shown to have 3 = 0. Distributions can be skewed to the left or to the right. However, knowledge of the third moment gives no clue as to the shape of the distribution, i.e. it could be the case that 3 25 = 0 but the distribution to be far from symmetrical. The ratio 3 3 is unitless and is call the coe¢ cient of skewness. An alternative measure of skewness is provided by the ratio: (mean median) = (standard deviation) The fourth moment about the mean is used as a measure of kurtosis, which is a degree of ‡atness of a density near the center. The coe¢ cient of kurtosis is de…ned as 4 4 3 and positive values are sometimes used to indicate that a density function is more peaked around its center than the normal (leptokurtic distributions). A positive value of the coe¢ cient of kurtosis is indicative for a distribution which is ‡atter around its center than the standard normal (platykurtic distributions). This measure su¤ers from the same failing as the measure of skewness i.e. it does not always measure what it supposed to. While a particular moment or a few of the moments may give little information about a distribution the entire set of moments will determine the distribution exactly. In applied statistics the …rst two moments are of great importance, but the third and forth are also useful. 26 2 Parametric Families of Univariate Distributions A parametric family of density functions is a collection of density functions that are indexed by a quantity called parameter, e.g. let f (x; ) = e x > 0 and some > 0. is the parameter, and as numbers, the collection ff (:; ) : x for ranges over the positive > 0g is a parametric family of density functions. 2.1 Discrete Univariate Distributions Let us start with discrete univariate distributions. 2.1.1 Bernoulli Distribution A random variable whose outcome have been classi…ed into two categories, called “success”and “failure”, represented by the letters s and f, respectively, is called a Bernoulli trial. If a random variable X is de…ned as 1 if a Bernoulli trial results in success and 0 if the same Bernoulli trial results in failure, then X has a Bernoulli distribution with parameter p = P [success]. The de…nition of this distribution is: A random variable X has a Bernoulli distribution if the discrete density of X is given by: 8 < px (1 fX (x) = fX (x; p) = : 0 p)1 x f or x = 0; 1 otherwise where p = P [X = 1]. For the above de…ned random variable X we have that: E[X] = p and var[X] = p(1 27 p) 2.1.2 Binomial Distribution Consider a random experiment consisting of n repeated independent Bernoulli trials with p the probability of success at each individual trial. Let the random variable X represent the number of successes in the n repeated trials. Then X follows a Binomial distribution. The de…nition of this distribution is: A random variable X has a binomial distribution, X s Binomial(n; p), if the discrete density of X is given by: 8 0 1 > > n > < @ A px (1 fX (x) = fX (x; n; p) = x > > > : 0 p)n x f or x = 0; 1; :::; n otherwise where p = P [X = 1] i.e. the probability of success in each independent Bernoulli trial and n is the total number of trials. For the above de…ned random variable X we have that: E[X] = np and var[X] = np(1 p) Example: Consider a stock with value S = 50. Each period the stock moves up or down, independently, in discrete steps of 5. The probability of going up is p = 0:7 and down 1 p = 0:3. What is the expected value and the variance of the value of the stock after 3 period? If we call X the random variable which is a success if the stock moves up and failure if the stock moves down. Then P [X = success] = P [X = 1] = 0:7, and X~Binomial(3; p). Now X can take the values 0; 1; 2; 3 i.e. no success, 1 success and 2 failures, etc.. The value of the stock in each case and the probabilities are:0 S = 35, and fX (0) = @ 3 0 1 A p0 (1 p)3 28 0 = 1 0:33 = 0:027; 0 S = 45, and fX (1) = @ 0 S = 55, and fX (2) = @ 0 S = 65 and fX (3) = @ 3 1 3 2 3 1 A p1 (1 1 A p2 (1 1 A p3 (1 p)3 1 = 3 0:7 0:32 = 0:189; p)3 2 = 3 0:72 0:3 = 0:441, p)3 3 = 1 0:73 = 0:343. 3 Hence the expected stock value is: E[S] = 35 0:027 + 45 0:189 + 55 0:441 + 65 0:343 = 56, and var[S] = 56)2 0:027 + ( 11)2 0:189 + ( 1)2 0:441 + (9)2 0:343 (35 2.1.3 Hypergeometric Distribution Let X denote the number of defective balls in a sample of size n when sampling is done without replacement from a box containing M balls out of which K are defective. The X has a hypergeometric distribution. The de…nition of this distribution is: A random variable X has a hypergeometric distribution if the discrete density of X is given by: fX (x) = fX (x; M; K; n) = 8 > > > > > > > > > < > > > > > > > > > : 0 B B B @ K x 10 CB CB CB A@ 0 B B B @ M n M n 0 K 1 x C C C A 1 C C C A f or x = 0; 1; :::; n otherwise where M is a positive integer, K is a nonnegative that is at most M , and n is a positive integer that is at most M . For this distribution we have that: E[X] = n K M and var[X] = n KM KM M M M n 1 Notice the di¤erence of the binomial and the hypergeometric i.e. for the binomial distribution we have Bernoulli trials i.e. independent trials with 29 …xed probability of success or failure, whereas in the hypergeometric in each trial the probability of success or failure changes depending on the result. 2.1.4 Poisson Distribution A random variable X has a Poisson distribution, X s P oisson( ), if the discrete density of X is given by: fX (x) = fX (x; ) = where 8 < x e x! : 0 is a parameter satisfying f or x = 0; 1; ::: otherwise > 0. For the Poisson distribution we have that: E[X] = and var[X] = The Poisson distribution provides a realistic model for many random phenomena. Since the values of a Poisson random variable are nonnegative integers, any random phenomenon for which a count of some sort is of interest is a candidate for modeling in assuming a Poisson distribution. Such a count might be the number of fatal tra¢ c accidents per week in a given place, the number of telephone calls per hour, arriving in a switchboard of a company, the number of pieces of information arriving per hour, etc. Example: It is known that the average number of daily changes in excess of 1%, for a speci…c stock Index, occurring in each six-month period is 5. What is the probability of having one such a change within the next 6 months? What is the probability of at least 3 changes within the same period? We model the number of in excess of 1% changes, X, within the next 6 months as a Poisson process. We know that E[X] = x e x! = e Also P [X 5 5x , x! = 5. Hence fX (x) = for x = 0; 1; 2; ; ::: Then P [X = 1] = fX (1) = 3] = 1 P [X < 3] = 30 e 5 51 1! = 0:0337. =1 =1 P [X = 0] e 5 50 0! P [X = 1] 5 51 e e 1! 5 52 2! P [X = 2] = = 0:875 It is worth noticing that the Binomial distribution Binomial(n; p) can be approximated by a P oisson(np). The approximation is more valid as n ! 1; p ! 0; in such a way so that np = constant. 2.2 Geometric Distribution Consider a sequence of independent Bernoulli trials with p equal the probability of success on an individual trial. Let the random variable X represent the number of trials required before the …rst success. Then X has a geometric distribution. The de…nition of this distribution is: A random variable X has a geometric distribution, X s geometric(p) , if the discrete density of X is given by: 8 < p(1 p)x fX (x) = fX (x; p) = : 0 f or x = 0; 1; :::; n otherwise where p is the probability of success in each Bernoulli trial. For this distribution we have that: E[X] = 1 p p and 31 var[X] = 1 p p2 2.3 2.3.1 Continuous Univariate Distributions Uniform Distribution A very simple distribution for a continuous random variable is the uniform distribution. Its density function is: fX (x) = fX (x; a; b) = where 1 b a f or a x b 1 < a < b < 1. Then the random variable X is de…ned to be uniformly distributed over the interval [a; b]. Now if X is uniformly distributed over [a; b] then E[X] = a+b 2 and var[X] = a)2 12 (b Notice that if a random variable is uniformly distributed over one of the following intervals [a; b), (a; b], (a; b) the density function, expected value and variance do not change. 2.3.2 Exponential Distribution If a random variable X has a density function given by: fX (x) = fX (x; ) = e where x f or 0 x<1 > 0 then X is de…ned to have an (negative) exponential distribution. Now this random variable X we have E[X] = 2.3.3 1 and var[X] = 1 2 Pareto-Levy or Stable Distributions The stable distributions are a natural generalization of the normal in that, as their name suggests, they are stable under addition, i.e. a sum of stable 32 random variables is also a random variable of the same type. However, nonnormal stable distributions have more probability mass in the tail areas than the normal. In fact, the nonnormal stable distributions are so fat-tailed that their variance and all higher moments are in…nite. Closed form expressions for the density functions of stable random variables are available for only the cases of normal and Cauchy. If a random variable X has a density function given by: fX (x) = fX (x; ; ) = where 1< 1 2 < 1 and 0 < )2 + (x 1<x<1 f or < 1; then X is de…ned to have a Cauchy distribution. Notice that for this random variable even the mean is in…nite. 2.3.4 Normal Distribution A random variable X is de…ned to be normally distributed with parameters and 2 , denoted by X s N ( ; fX (x) = fX (x; ; where and 2 2 1 )= p 2 2 ), if its density function is given by: 2 e (x )2 2 2 are parameters such that 1<x<1 f or 1 < < 1 and 2 > 0. Any distribution de…ned as above is called a normal distribution. Now if X s N( ; 2 ) then E[X] = and var[X] = 2 If the normal random variable has mean 0 and variance 1, then this random variable is called standard normal random variable. In such a case the, standard normal, density function is denoted by (x) i.e. we have s tan dard normal density 33 1 (x) = p e 2 x2 2 The cumulative distribution of X when it is normally distributed is given by 2 FX (x) = FX (x; ; )= Zx 1 1 p 2 2 (u )2 2 2 e du Notice that the cumulative distribution of the standard normal, denoted by (x), is given by Zx (x) = Notice that if X s N ( ; 2 1 1 p e 2 u2 2 du ) then the random variable Y = (X )= is distributed as standard normal, i.e. Y s N (0; 1). this is called the standardization of the random variable X. This is very useful as there are statistical tables where it is presented areas under the standard normal distribution. 2.3.5 Lognormal Distribution Let X be a positive random variable, and let a new random variable Y be de…ned as Y = log X. If Y has a normal distribution, then X is said to have a lognormal distribution. The density function of a lognormal distribution is given by fX (x; ; where and 2 2 1 )= p x 2 2 e (log x )2 2 2 are parameters such that 0<x<1 f or 1< 2 < 1 and > 0. We haven E[X] = e + 21 2 var[X] = e2 and +2 2 e2 + 2 Notice that if X is lognormally distributed then E[log X] = and var[log X] = 2 Notice that we can approximate the Poisson and Binomial functions by the normal, in the sense that if a random variable X is distributed as Poisson 34 with parameter , then Xp is distributed approximately as standard normal. On the other hand if Y s Binomial(n; p) then pY np np(1 p) s N (0; 1). The standard normal is an important distribution for another reason, as well. Assume that we have a sample of n independent random variables, x1 ; x2 ; :::; xn , which are coming from the same distribution with mean m and variance s2 , then we have the following: 1 X xi m p s N (0; 1) s n i=1 n This is the well known Central Limit Theorem for independent observations. 35 3 Statistical Inference 3.1 Sampling Theory To proceed we shall need the following de…nitions. Let X1 ; X2 ; :::; Xk be k random variables all de…ned on the same probability space ( ; A; P [:]). The joint cumulative distribution function of X1 ; X2 ; :::; Xk , denoted by FX1 ;X2 ;:::Xn ( ; ; :::; ), is de…ned as FX1 ;X2 ;:::Xk (x1 ; x2 ; :::; xk ) = P [X1 x1 ; X 2 x2 ; :::; Xk xk ] for all (x1 ; x2 ; :::; xk ). Let X1 ; X2 ; :::; Xk be k discrete random variables, then the joint discrete density function of these, denoted by fX1 ;X2 ;:::Xk ( ; ; :::; ), is de…ned to be fX1 ;X2 ;:::Xk (x1 ; x2 ; :::; xk ) = P [X1 = x1 ; X2 = x2 ; :::; Xk = xk ] for (x1 ; x2 ; :::; xk ), a value of (X1 ; X2 ; :::; Xk ) and is 0 otherwise. Let X1 ; X2 ; :::; Xk be k continuous random variables, then the joint continuous density function of these, denoted by fX1 ;X2 ;:::Xk ( ; ; :::; ), is de…ned to be a function such that FX1 ;X2 ;:::Xk (x1 ; x2 ; :::; xk ) = Zxk 1 ::: Zx1 fX1 ;X2 ;:::Xk (u1 ; u2 ; :::; uk )du1 ::duk 1 for all (x1 ; x2 ; :::; xk ). The totality of elements which are under discussion and about which information is desired will be called the target population. The statistical problem is to …nd out something about a certain target population. It is generally impossible or impractical to examine the entire population, but one may examine a part of it (a sample from it) and, on the basis of this limited investigation, make inferences regarding the entire target population. 36 The problem immediately arises as to how the sample of the population should be selected. Of practical importance is the case of a simple random sample, usually called a random sample, which can be de…ned as follows: Let the random variables X1 ; X2 ; :::; Xn have a joint density fX1 ;X2 ;:::Xn (x1 ; x2 ; :::; xn ) that factors as follows: fX1 ;X2 ;:::Xn (x1 ; x2 ; :::; xn ) = f (x1 )f (x2 ):::f (xn ) where f (:) is the common density of each Xi . Then X1 ; X2 ; :::; Xn is de…ned to be a random sample of size n from a population with density f (:). A statistic is a function of observable random variables, which is itself an observable random variable, which does not contain any unknown parameters. Let X1 ; X2 ; :::; Xk be a random sample from the density f (:). Then the = rth sample moment, denoted by Mr , is de…ned as: 1X r X n i=1 i n Mr= = In particular, if r = 1, we get the sample mean, which is usually denoted by X orX n ; that is: 1X Xn = Xi n i=1 n Also the rth sample central moment (about X n ), denoted by Mr , is de…ned as: 1X Xi Mr = n i=1 n Xn r We can prove the following theorem: Theorem Let X1 ; X2 ; :::; Xk be a random sample from the density f (:). The expected value of the rth sample moment is equal to the rth population moment, 37 i.e. the rth sample moment is an unbiased estimator of the rth population moment. Proof omitted. Theorem 1 n Let X1 ; X2 ; :::; Xn be a random sample from a density f (:), and let X n = n P Xi be the sample mean. Then i=1 E[X n ] = where and 2 and var[X n ] = 1 n 2 are the mean and variance of f (:), respectively. Notice that this is true for any distribution f (:), provided that is not in…nite. Proof E[X n ] = E[ n1 Also n P Xi ] = i=1 var[X n ] = var[ n1 n P 1 n n P E[Xi ] = i=1 Xi ] = i=1 1 n2 n P 1 n n P = n1 n = i=1 1 n2 var[Xi ] = i=1 n P 2 = i=1 1 n 2 n2 = 1 n 2 Let X1 ; X2 ; :::; Xn be a random sample from a density f (:). Then Sn2 1 2 =S = n 1 n X (Xi X n )2 f or n>1 i=1 is de…ned to be the sample variance. Theorem Let X1 ; X2 ; :::; Xn be a random sample from a density f (:), and let Sn2 as de…ned above. Then E[Sn2 ] = where 2 and 4 2 and var[Sn2 ] = 1 n 4 n n 3 1 4 are the variance and the 4th central moment of f (:), respec- tively. Notice that this is true for any distribution f (:), provided that not in…nite. Proof 38 4 is We shall prove …rst the following identity, which will be used latter: n X 2 (Xi ) = i=1 P n X 2 Xn Xi 2 + n Xn i=1 P 2 = Xi X n + X n (Xi )2 = Xi X n + X n i Ph 2 2 = X i X n + 2 Xi X n X n + Xn = P P 2 2 Xi X n + n X n = = Xi X n + 2 X n P 2 2 = Xi X n + n X n P 2 = Using the above identity we obtain: E[Sn2 ] = E = = " 1 n 1 n 1 1 n 1 " 1 # n X (Xi X n )2 = i=1 n X 1 )2 E (Xi n 1 E " n X i=1 2 nE X n i=1 n 2 1 n n 2 = )2 (Xi # = 1 n 1 " 2 n Xn n X 2 nvar X n i=1 2 The derivation of the variance of Sn2 is omitted. 3.1.1 Sampling from the Normal Distribution Theorem Let denote X n the sample mean of a random sample of size n from a normal distribution with mean distribution with mean and variance and variance 2 n 2 . Then X n has a normal . Proof omitted. The gamma function is de…ned as: (t) = Z1 xt 1 e x dx f or t>0 0 Notice that (t + 1) = t (t) and if t is an integer then (t + 1) = t!. Also if p 1) p t is again an integer then (t + 12 ) = 1 3 5 2:::(2t . Finally ( 12 ) = . t 39 # # If X is a random variable with density 1 fX (x) = (k=2) k=2 1 2 k 1 x2 e 1 x 2 f or 0<x<1 where (:) is the gamma function, then X is de…ned to have a chi-square distribution with k degrees of freedom. Notice that X is distributed as above then: E[X] = k and var[X] = 2k We can prove the following theorem Theorem If the random variables Xi ; i = 1; 2; ::; k are normally and independently distributed with means i and variances U= k X 2 i then 2 Xi i i i=1 has a chi-square distribution with k degrees of freedom. Proof omitted. Theorem If the random variables Xi ; i = 1; 2; ::; k are normally and independently n P distributed with mean and variance 2 , and let S 2 = n 1 1 (Xi X n )2 i=1 then U= where 2 n 1 (n 1)S 2 2 v 2 n 1 is the chi-square distribution with n 1 degrees of freedom. Proof omitted. If X is a random variable with density [(m + n)=2] m fX (x) = (m=2) (n=2) n m=2 m x2 1 [1 + (m=n)x](m+n)=2 40 f or 0<x<1 where (:) is the gamma function, then X is de…ned to have a F distribution with m and n degrees of freedom. Notice that if X is distributed as above then: E[X] = n n 2 and var[X] = 2n2 (m + n 2) m(n 2)2 (n 4) Theorem If the random variables U and V are independently distributed as chisquare with m and n degrees of freedom, respectively i.e. U v V v 2 n 2 m and independently, then U=m = X v Fm;n V =n where Fm;n is the F distribution with m; n degrees of freedom. Proof omitted. If X is a random variable with density fX (x) = [(k + 1)=2] 1 1 p 2 (k=2) k [1 + x =k](k+1)=2 1<x<1 f or where (:) is the gamma function, then X is de…ned to have a t distribution with k degrees of freedom. Notice that if X is distributed as above then: E[X] = 0 and var[X] = k k 2 Theorem If the random variables Z and V are independently distributed as standard normal and chi-square with k, respectively i.e. Z v (N (0; 1) and V v independently, then Z p = X v tk V =k where tk is the t distribution with k degrees of freedom. Proof omitted. 41 2 k 3.2 Point and Interval Estimation The problem of estimation is de…ned as follows. Assume that some characteristic of the elements in a population can be represented by a random variable X whose density is fX (:; ) = f (:; ), where the form of the density is assumed known except that it contains an unknown parameter (if were known, the density function would be completely speci…ed, and there would be no need to make inferences about it. Further assume that the values x1 ; x2 ; :::; xn of a random sample X1 ; X2 ; ::::; Xn from f (:; ) can be observed. On the basis of the observed sample values x1 ; x2 ; :::; xn it is desired to estimate the value of the unknown parameter or the value of some function, say ( ), of the unknown parameter. The estimation can be made in two ways. The …rst, called point estimation, is to let the value of some statistic, say t(X1 ; X2 ; ::::; Xn ), represent or estimate, the unknown ( ). Such a statistic is called the point estimator. The second, called interval estimation, is to de…ne two statistics, say t1 (X1 ; X2 ; ::::; Xn ) and t2 (X1 ; X2 ; ::::; Xn ), where t1 (X1 ; X2 ; ::::; Xn ) < t2 (X1 ; X2 ; ::::; Xn ), so that (t1 (X1 ; X2 ; ::::; Xn ); t2 (X1 ; X2 ; ::::; Xn )) constitutes an interval for which the probability can be determined that it contains the unknown ( ). 3.2.1 Parametric Point Estimation The point estimation admits two problems. The …rst is to devise some means of obtaining a statistic to use as an estimator. The second, to select criteria and techniques to de…ne and …nd a “best” estimator among many possible estimators. Methods of Finding Estimators Any statistic (know function of observable random variables that is itself 42 a random variable) whose values are used to estimate ( ), where (:) is some function of the parameter , is de…ned to be an estimator of ( ). Notice that for speci…c values of the realized random sample the estimator takes a speci…c value called estimate. Method of Moments Let f (:; 1 ; 2 ; :::; k ) be a density of a random variable X which has k parameters 1 ; 2 ; :::; k . As before let In general = r = r denote the rth moment i.e. = E[X r ]. will be a known function of the k parameters Denote this by writing = r = r ( 1 ; 2 ; :::; k ). = 1 ; 2 ; :::; k . Let X1 ; X2 ; :::; Xn be a random = th 1 ; 2 ; :::; k ), and, as before, let Mj be the j n P 1 Xij . Then equating sample moments to n i=1 sample from the density f (:; = sample moment, i.e. Mj = population ones we get k equations with k unknowns, i.e. = = j ( 1 ; 2 ; :::; k ) Mj = f or j = 1; 2; :::; k Let the solution to these equations be b1 ; b2 ; :::; bk . We say that these k estimators are the estimators of 1 ; 2 ; :::; k obtained by the method of moments. Example: Let X ; X ; :::; X 1 2 n be a random sample from a normal 2 and variance the parameters by the method of moments.. Recall that = 2 and = 2) =( ; ). Estimate 2 = = ( 1 )2 and = 1 . The method of moment equations become: n P = = = 1 Xi = X = M1 = 1 = 1 ( ; 2 ) = n 1 n i=1 n P i=1 = Xi2 = M2 = = 2 = = 2( ; 2 )= 2 + 2 Solving the two equations for and we get: r n P b = X; and b = n1 (Xi X) which are the M-M estimators of i=1 . . Let ( 1 ; 2 distribution with mean 43 and Example: Let X ; X ; :::; X 1 2 n be a random sample from a Poisson distribution with parameter . There is only one parameter, hence only one equation, which is: n P = 1 Xi = X = M1 = n i=1 Hence the M = 1 = 1( = )= M estimator of Maximum Likelihood is b = X. Consider the following estimation problem. Suppose that a box contains a number of black and a number of white balls, and suppose that it is known that the ratio of the number is 3/1 but it is not known whether the black or the white are more numerous, i.e. the number of drawing a black ball is either 1/4 or 3/4. If n balls are drawn with replacement from the box, the distribution of X, the number of black balls, is given by the binomial distribution 0 f (x; p) = @ n x 1 A px (1 p)n x f or x = 0; 1; 2; :::; n where p is the probability of drawing a black ball. Here p = 1=4 or p = 3=4. We shall draw a sample of three balls, i.e. n = 3, with replacement and attempt to estimate the unknown parameter p of the distribution. the estimation is simple in this case as we have to choose only between the two numbers 1=4 = 0:25 and 3=4 = 0:75. The possible outcomes and their probabilities are given below: outcome : x 0 1 f (x; 0:75) 1=64 9=64 f (x; 0:25) 27=64 27=64 2 3 27=64 27=64 9=64 1=64 In the present example, if we found x = 0 in a sample of 3, the estimate 44 0:25 for p would be preferred over 0:75 because the probability 27=64 is greater than 1=64. In general we should estimate p by 0:25 when x = 0 or 1 and by 0:75 when x = 2 or 3. The estimator may be de…ned as 8 < 0:25 f or x = 0; 1 pb = pb(x) = : 0:75 f or x = 2; 3 The estimator thus selects for every possible x the value of p, say pb, such that f (x; pb) > f (x; p= ) where p= is the other value of p. More generally, if several values of p were possible, we might reasonably proceed in the same manner. Thus if we found x = 2 in a sample of 3 from a binomial population, we should substitute all possible values of p in the expression 0 f (2; p) = @ 3 2 1 A p2 (1 p) f or 0 p 1 and choose as our estimate that value of p which maximizes f (2; p). The position of the maximum of the function above is found by setting equal to zero the …rst derivative with respect to p, i.e. d f (2; p) dp 3p) = 0 ) p = 0 or p = 2=3. The second derivative is: Hence, d2 dp2 d2 f (2; 23 ) dp2 pb = 2 3 = 6p 9p2 = 3p(2 d2 f (2; p) dp2 =6 18p. f (2; 0) = 6 and the value of p = 0 represents a minimum, whereas = 6 and consequently p = 2 3 represents the maximum. Hence is our estimate which has the property f (x; pb) > f (x; p= ) where p= is any other value in the interval 0 45 p 1. The likelihood function of n random variables X1 ; X2 ; :::; Xn is de…ned to be the joint density of the n random variables, say fX1 ;X2 ;:::;Xn (x1 ; x2 ; :::; xn ; ), which is considered to be a function of . In particular, if X1 ; X2 ; :::; Xn is a random sample from the density f (x; ), then the likelihood function is f (x1 ; )f (x2 ; ):::::f (xn ; ). To think of the likelihood function as a function of , we shall use the notation L( ; x1 ; x2 ; :::; xn ) or L( ; x1 ; x2 ; :::; xn ) for the likelihood function in general. The likelihood is a value of a density function. Consequently, for discrete random variables it is a probability. Suppose for the moment that is known, denoted by 0. The particular value of the random variables which is “most = = = likely to occur”is that value x1 ; x2 ; :::; xn such that fX1 ;X2 ;:::;Xn (x1 ; x2 ; :::; xn ; 0) is a maximum. for example, for simplicity let us assume that n = 1 and X1 has the normal density with mean 0 and variance 1. Then the value of the random variable which is most likely to occur is X1 = 0. By “most likely to oc= cur”we mean the value x1 of X1 such that = 0;1 (x1 ) > 0;1 (x1 ). Now let us sup- pose that the joint density of n random variables is fX1 ;X2 ;:::;Xn (x1 ; x2 ; :::; xn ; ), where is known. Let the particular values which are observed be represented = = = by x1 ; x2 ; :::; xn . We want to know from which density is this particular set of values most likely to have come. We want to know from which density (what = = = value of ) is the likelihood largest that the set x1 ; x2 ; :::; xn was obtained. in other words, we want to …nd the value of in the admissible set, de- noted by b, which maximizes the likelihood function L( ; x1 ; x2 ; :::; xn ). The = = = value b which maximizes the likelihood function is, in general, a function of x1 ; x2 ; :::; xn , say b = b(x1 ; x2 ; :::; xn ). Hence we have the following de…nition: Let L( ) = L( ; x1 ; x2 ; :::; xn ) be the likelihood function for the random variables X1 ; X2 ; :::; Xn . If b [where b = b(x1 ; x2 ; :::; xn ) is a function of the observations x1 ; x2 ; :::; xn ] is the value of 46 in the admissible range which maximizes L( ), then b = b(X1 ; X2 ; :::; Xn ) is the maximum likelihood estimator of . b = b(x1 ; x2 ; :::; xn ) is the maximum likelihood estimate of for the sample x1 ; x2 ; :::; xn . The most important cases which we shall consider are those in which X1 ; X2 ; :::; Xn is a random sample from some density function f (x; ), so that the likelihood function is L( ) = f (x1 ; )f (x2 ; ):::::f (xn ; ) Many likelihood functions satisfy regularity conditions so the maximum likelihood estimator is the solution of the equation dL( ) =0 d Also L( ) and logL( ) have their maxima at the same value of , and it is sometimes easier to …nd the maximum of the logarithm of the likelihood. Notice also that if the likelihood function contains k parameters then we …nd the estimator from the solution of the k …rst order conditions. Example: Let a random sample of size n is drawn from the Bernoulli distribution f (x; p) = px (1 where 0 p p)1 x 1. The sample values x1 ; x2 ; :::; xn will be a sequence of 0s and 1s, and the likelihood function is L(p) = Let y = P n Y pxi (1 p)1 xi =p i=1 P xi (1 p)n xi we obtain that logL(p) = y log p + (n y) log(1 and d log L(p) y = dp p 47 n 1 y p p) P xi Setting this expression equal to zero we get pb = y 1X = xi = x n n which is intuitively what the estimate for this parameter should be. Example: Let a random sample of size n is drawn from the normal distribution with density f (x; ; 2 )= p 1 2 2 e (x )2 2 2 The likelihood function is L( ; 2 )= n Y i=1 p 1 2 2 e (xi )2 2 2 n=2 1 = exp 2 2 " 2 n 1 X 2 )2 (xi i=1 # the logarithm of the likelihood function is log L( ; 2 )= n log 2 2 n log 2 To …nd the maximum with respect to 2 2 2 and n 1 X @ log L = 2 (xi @ i=1 and @ log L = @ 2 n 1 X (xi 2 )2 i=1 we compute ) n n 1 1 X + (xi 2 2 2 4 i=1 )2 and putting these derivatives equal to 0 and solving the resulting equations we …nd the estimates and 1X xi = x n i=1 n b= X b2 = 1 (xi n i=1 n x)2 which turn out to be the sample moments corresponding to 48 and 2 . Properties of Point Estimators One needs to de…ne criteria so that various estimators can be compared. One of these is the unbiasedness. An estimator T = t(X1 ; X2 ; :::; Xn ) is de…ned to be an unbiased estimator of ( ) if and only if E [T ] = E [t(X1 ; X2 ; :::; Xn )] = ( ) for all in the admissible space. Other criteria are consistency, mean square error etc. 3.2.2 Interval Estimation In practice estimates are often given in the form of the estimate plus or minus a certain amount e.g. the cost per volume of a book could be 83 4:5 per cent which means that the actual cost will lie somewhere between 78.5% and 87.5% with high probability. let us consider a particular example. Suppose that a random sample (1.2,3.4,.6,5.6) of four observations is drawn from a normal population with unknown mean maximum likelihood estimate of and a known variance 9. The is the sample mean of the observations: x = 2:7 We wish to determine upper and lower limits which are rather certain to contain the true unknown parameter value between them. We know that the sample mean, x, is distributed as normal with mean x v N( ; 2 and variance 9=n i.e. =n). Hence we have Z= X v N (0; 1) 3 2 Hence Z is standard normal. Consequently we can …nd the probability that Z will be between two arbitrary values. For example we have that P [ 1:96 < Z < 1:96] = Z1:96 1:96 49 (z)dz = 0:95 Hence we get that must be in the interval 3 X + 1:96 > 2 >X 3 1:96 2 and for the speci…c value of the sample mean we have that 5:64 > i.e. P [5:64 > > > :24 :24] = :95. This leads us to the following de…nition for the con…dence interval. Let X1 ; X2 ; ::::; Xn be a random sample from the density f ( ; ). Let T1 = t1 (X1 ; X2 ; ::::; Xn ) and T2 = t2 (X1 ; X2 ; ::::; Xn ) be two statistics satisfying T1 T2 for which P [T1 < ( ) < T2 ] = , where does not depend on . Then the random interval (T1 ; T2 ) is called a 100 percent con…dence interval for ( ). is called the con…dence coe¢ cient. T1 and T2 are called the lower and upper con…dence limits, respectively. A value (t1 ; t2 ) of the random interval (T1 ; T2 ) is also called a 100 percent con…dence interval for ( ). Let X1 ; X2 ; ::::; Xn be a random sample from the density f ( ; ). Let T1 = t1 (X1 ; X2 ; ::::; Xn ) be a statistic for which P [T1 < ( )] = . Then T1 is called a one-sided lower con…dence interval for ( ). Similarly, let T2 = t2 (X1 ; X2 ; ::::; Xn ) be a statistic for which P [ ( ) < T2 ] = . Then T2 is called a one-sided upper con…dence interval for ( ). Example: Let X1 ; X2 ; ::::; Xn be a random sample from the p density f (x; ) = ;9 (x). Set T1 = t1 (X1 ; X2 ; ::::; Xn ) = X 6= n and p T2 = t2 (X1 ; X2 ; ::::; Xn ) = X + 6= n. Then (T1 ; T2 ) constitutes a random interval and is a con…dence interval for ( ) = , with con…dence coe¢ cient p p = P [X 6= n < < X + 6= n] = = P[ 2 < X p3 n < 2] = (2) ( 2) = 0:9772 0:0228 = 0:9544. hence if the random sample of 25 observations has a sample mean of, say, 17.5, then p p the interval (17:5 6= 25; 17:5 + 6= 25) is also called a con…dence interval 50 of . Sampling from the Normal Distribution Let X1 ; X2 ; :::; Xn be a random sample from the normal distribution with mean 2 and variance . If parameters and ( ) = 2 is unknown then 2 =( ; ), the unknown , the parameter we want to estimate by interval estimation. We know that X p v N (0; 1) n However, the problem with this statistic is that we have two parameters. Consequently we can not an interval. hence we look for a statistic that involves only the parameter we want to estimate, i.e. qP (X )= pn X)2 =(n (Xi = 1) 2 . Notice that X p v tn S= n 1 This statistic involves only the parameter we want to estimate. Hence we have q1 < X p < q2 S= n , X p q2 (S= n) < <X p q1 (S= n) where q1 ; q2 are such that X p < q2 = S= n p p q2 (S= n); X q1 (S= n) is the 100 percent conP q1 < Hence the interval X …dence interval for . It can be proved that if q1 ; q2 are symmetrical around 0, then the length is the interval is minimized. Alternatively, if we want to …nd a con…dence interval for unknown, then we use the statistic P (Xi X)2 (n = 2 51 1)S 2 2 v 2 n 1 2 , when is Hence we have q1 < (n 1)S 2 2 < q2 , 1)S 2 (n q2 < 2 < (n 1)S 2 q1 where q1 ; q2 are such that P q1 < (n 1)S 2 2 < q2 = (n 1)S 2 (n 1)S 2 ; q1 q2 is a 100 percent con…dence interval for 2 . h i h i 2 2 The q1 ; q2 are often selected so that P q2 < (n 1)S = P (n 1)S < q1 = 2 2 So the interval (1 )=2. Such a con…dence interval is referred to as equal-tailed con…dence interval for 2 . 52 3.3 Hypothesis testing A statistical hypothesis is an assertion or conjecture, denoted by H, about a distribution of one or more random variables. If the statistical hypothesis completely speci…es the distribution is simple, otherwise is composite. Example: Let X ; X ; :::; X 1 ;25 (x). 2 n be a random sample from f (x; ) = The statistical hypothesis that the mean of the normal population is less or equal to 17 is denoted by: H : 17. Such a hypothesis is composite, as it does not completely specify the distribution. On the other hand, the hypothesis H : = 17 is simple since it completely speci…es the distribution. A test of statistical hypothesis H is a rule or procedure for deciding whether to reject H. Example: Let X ; X ; :::; X 1 2 n be a random sample from f (x; ) = 17. One possible test Y is as follows: Reject H if p and only if X > 17 + 5= n. ;25 (x). Consider H : In many hypotheses-testing problems two hypotheses are discussed. The …rst, the hypothesis being testing, is called the null hypothesis, denoted by H0 , and the second is called the alternative hypothesis denoted by H1 . We say that H0 is tested against or versus H1 . The thinking is that if the null hypothesis is wrong the alternative hypothesis is true, and vice versa. We can make two types of errors: Rejection of H0 when H0 is true is called a Type I error, and acceptance of H0 when H0 is false is called a Type II error. The size of Type I error is de…ned to be the probability that a Type I error is made, and similarly the size of a Type II error is de…ned to be the probability that a Type II error is made. Signi…cance level or size of a test, denoted by , is the supremum of the probability of rejecting H0 when H0 is correct, i.e. it is the supremum of 53 the Type I error. In general to perform a test we …x the size to a prespeci…ed value in general 10%, 5% or 1%. Example: Let X ; X ; :::; X 1 2 n be a random sample from f (x; ) = Consider H0 : 17 and the test Y: Reject H0 if and only if p X > 17 + 5= n. Then of the test is p p 17 + 5= n X p > p sup P [X > 17 + 5= n] = sup P 5= n 5= n 17 17 p p 17 + 5= n X 17 + 5= n p p p = sup 1 P = sup 1 P Z 5= n 5= n 5= n 17 17 p 17 + 5= n p = sup 1 =1 (1) = 0:159: 5= n 17 ;25 (x). 3.3.1 Testing Procedure Let us establish a test procedure via an example. Assume that the xi 0s are 2 iid Normal, n = 64, X = 9:8 and hypothesis that = 0:04. We would like to test the = 10. 1. Formulate the null hypothesis: H0 : = 10 2. Formulate the alternative: H1 : 6= 10 3. select the level of signi…cance: = 0:01 From tables …nd the critical values for Z, denoted by cZ = 2:58. 4. Establish the rejection limits: Reject H0 if Z < 2:58 or Z > 2:58. 5. Calculate Z: Z= X p 0 n = 9:8 p10 0:2= 64 = 8 54 6. Make the decision: Since Z is less than 2:58, reject H0 . To …nd the appropriate test for the mean we have to consider the following cases: 1. Normal population and known population variance (or standard deviation). In this case the statistic we use is: Z= X v N (0; 1) 0 p n 2. Large samples in order to use the central limit theorem. In this case the statistic we use is: Z= X v N (0; 1) 0 pS n 3. Small samples from a normal population where the population variance (or standard deviation) is unknown. In this case the statistic we use is: t= 3.3.2 X 0 pS n v tn 1 Testing Proportions The null hypothesis will be of the form: H0 : = 0 an the three possible alternatives are: (1) H1 : 6= 0 two sided test, (2) H1 : < 0 one sided, (3) H1 : > 0 one sided. The appropriate statistic is based on the central limit theorem and is: Z= p 0 pS n v N (0; 1) where S 2 = 55 0 (1 0) Example: Mr. X believes that he will get more 60% of the votes. However, in a sample of 400 voters 252 indicate that they will vote for X. At a signi…cance level of 5% test Mr. X belief. p = 252 400 = 0:63, S 2 = 0:6(1 alternative is H1 : 0:63 p0:6 0:489= 400 > 0. 0:6) = 0:24. The H0 : = 0 The critical value is 1:64. Now Z = and the p 0 S p n = = 1:22. Consequently, the null is not rejected as Z < 1:64. Thus Mr. X belief is wrong. If fact we have the following possible outcomes when testing hypotheses: H0 is accepted H0 is correct Correct decision (1 H1 is correct Type II error ( ) H1 is accepted ) Type I error ( ) Correct decision (1 ) An operating characteristic curve presents the probability of accepting a null hypothesis for various values of the population parameter at a given signi…cance level using a particular sample size. The power of the test is the inverse function of the operating characteristic curve, i.e. it is the probability of rejecting the null hypothesis for various possible values of the population parameter. 56 Exercises: 1) The growth of an economy can be high with probability 0.15, normal with probability 0.5 and negative (recession) with probability 0.35. In the high-growth state the return of a stock will be 0.4, in the normal state 0.1 and in the recession will be -0.1. Evaluate the expected return of this stock. 2) The Audit department of a bank, to check the best credit practices are complied, it chooses at random 2 out of its 5 branches. It is known that 3 of the 5 branches do not follow the best credit practices ("bad" branches), although which 3 is not known. a. What is the probability that 2 "bad" branches are chosen? b. What is the probability that 2 "good" branches are chosen? c. What is the probability that 1 "good" and 1 "bad" are chosen? 3) In a market research 55% of the participating consumers are women. 60% of those women prefer a speci…c product, whereas only 38% of men prefer the same product. We choose at random one participant of the research. Find: a) The probability that this person prefers the speci…c product. b) The probability that the chosen person is a woman, given that the person prefers the speci…c product. 4) If X v N (10; 9) …nd a) P (X d) P (4:12 e) P (X X 15:88), b) P (X 4:12), c) P (X 4:12), 15:88). For the same random variable X …nd x0 such that x0 ) = 5%, d) P (X x0 ) = 5% and P ( x0 X x0 ) = 95%. 5) The following numbers are the number of computers sold per month over the last 19 months, by a speci…c computer company: 57 25, 26, 32, 21, 29, 31, 27, 23, 34, 29, 32, 34, 35, 31, 36, 37, 41, 44, 46. Compute the sample mean, the second, third and fourth moments, the sample variance, the ML variance, the coe¢ cient of skewness and the kurtosis. 6) For a large class of students a random sample of 4 grades were drawn: 64, 66, 89, and 77. Calculate a 95% con…dence interval for the whole class mean . How is your result change if you knew that the variance 2 were 100? (notice that the value of a t3 distribution that leaves 2.5% at the right tail is 3.18) 58