Download Probability and Statistical Review

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Transcript
Probability and Statistical Review
Lecture 1
Manoranjan Majji
Lecture slides and notes available online:
Visit http://dnc.tamu.edu/Class Notes/AERO626/index.php
Probability and Statistical Review
 Probability
–Motivating Example
–Definition of Probability
–Axioms of Probability
–Conditional Probability
–Bayes’s Theorem
–Random Variables
Discrete and Continuous
–Expectation of Random Variables
–Multivariate Density Functions
2
Basic Probability Concepts
 Probabilities are numbers assigned to events that indicate “how
likely” it is that event will occur when a random experiment is
performed.
– The statement “E has probability P(E)” then means that if we perform
the experiment very often, it is practically certain that the relative
frequency is approximately equal to P(E).
 What do we mean by Relative Frequency?
– The relative frequency is at least equal to 0 and at most equal to 1.
0  P( E )  1
– Frequency Function: It shows how the values of the samples are
distributed.
f ( x) 
fj
0
when x  x j
for any value x not appearing in the sample
– Sample Distribution function:
F ( x)   f (t )
tx
4
Basic Probability Concepts
 The frequency function characterizes a given sample in detail.
– We can compute some numbers that characterize certain properties
of the sample.
– Sample Mean:
m
1 n
1 m
   x j   x j  nf ( x j )    x j f ( x j )
n j 1
n j 1
j 1
– Sample Variance:
m
1 n
2
   ( x j   )   ( x j   )2 f ( x j )
n j 1
j 1
2
5
Useful Definitions
 Random Experiment or Random Observation:
– It is performed according to a set of rules that determines the
performance completely.
– It can be repeated arbitrarily often.
– The result of each performance depends on “chance” (that is, on
influences which we can not control) and can therefore not be
uniquely predicted.
 The result of single performance of the experiment is called the
outcome of that experiment.
 The set of all possible outcomes of an experiment is called the
sample space of the experiment.
 In most practical problems, we are not interested in the
individual outcomes of the experiment but in whether an
outcome belongs to a certain set of outcomes. Such a set is
called an “Event”
6
Useful Definitions
 Impossible Event: An event containing no element and is
denoted by .
 Mutually Exclusive or Disjoint Events: if A  B  
 Example: Let us consider a rolling of a dice.
Sample Space: S  {1, 2,3, 4,5, 6}
E : an event that the dice turns up an even number  {2, 4, 6}
O: an event that the dice turns up an odd number = {1,3,5}
E O  
E and O are mutually exclusive events.
7
Axioms of Probability
Property 1: 0  P( E )  1.
Property 2: P( S )  1.
Property 3: P( E c )  1  P( E ).
Property 4: P( A  B)  P( A)  P( B)  P( A  B).
Property 5: if E1 , E2 , .........., En are mutually exclusive events then
P( E1  E2  ........  En )  P( E1 )  P( E2 )  ..........  P( En )
8
Conditional Probability
 The probability of an event B under the condition that an event
A occurs is given by
P( A  B)
P( B / A) 
P( A)
– P(B/A) is called the conditional probability of B given A.
AB
A
B
– In this case, event A serves as a new sample space and event B
becomes AB.
– A and B are called independent events if
P( B / A)  P( B)
P( A / B)  P( A)
P( A  B)  P( A) P( B)
9
Theorem of Total Probability.
B1
Bn-1
A
B2
B3
Bn
 Let B1, B2………,Bn be mutually exclusive events s.t.
n
Bi  S
i 1
 The probability of an event A can be represented as:
P( A)  P( A  B1 )  P( A  B2 )  ..............  P( A  Bn )
 and, therefore
n
P( A)  P( A / B1 ) P( B1 )  ..............  P( A / Bn ) P( Bn )   P( A / Bi ) P( Bi )
i 1
10
Bayes’s Theorem
 Let us assume there are m mutually exclusive states of nature
(classes) labeled j(j=1,2,…..,m).
 Let P(x) be the probability that an event assumes the specific
value x.
 Definitions:
– Prior Probability: P(j).
– Posterior Probability: P(j/x) (of class j given observation x)
– Likelihood Probability: P(x/j) (conditional probability of observation
x given class j).
 Bayes’s Theorem: gives the relationship between the m prior
probabilities P(j), the m likelihood probabilities P(x/j) and one
posterior probability of interest.
P( j ) P( x /  j )
P( j / x)  m
 P(k ) P( x / k )
k 1
11
Exercise
 Consider a clinical problem where we have to decide if a patient
has one particular rare disease on the basis of an imperfect
medical test.
– 1 in 1000 people have a rare disease.
– Test shows positive 99% when a person has disease.
– Test shows positive 2% when a person does not have a disease.
 What is the probability when the test is positive and person
actually has the disease?
P( B / A1 ) P( A1 )
P( A 1 / B) 
P( B / A1 ) P( A1 )  P( B / A2 ) P( A2 )
0.99  0.001

(0.99  0.001)  (0.02  0.999)
 0.047
0.02
0.98
12
Exercise (continued….)
 P(A1/B)=0.047=4.7%
– seems counter-intuitive…WHY?
 Most positive tests arise from error than from people actually having the
disease.
– From prior 0.001 to posterior 0.047.
 Disease is rare and test is marginally reliable.
 NOTE: if disease were not so rare (e.g. say 25% incidents) then
we would get a good diagnoses.
– P(A1/B)=0.94.
13
Random Variables
 A random variable X (also called stochastic variable) is a
function whose values are real numbers and depend on
“chance”. More precisely, it is a function X which has following
properties:
– X is defined on the sample space S of the experiment, and its values
are real numbers
 The function that assigns value to each outcome is fixed and
deterministic.
– The randomness is due to the underlying randomness of the
argument of the function X.
 Random numbers can be Discrete or Continuous.
14
Discrete Random Variables
 A random variable X and the corresponding distribution are said
to be discrete, if the number of values for which X has non-zero
probability is finite.
 Probability Mass Function of X:
f ( x) 
pj
0
when x  x j
otherwise
 Probability Distribution Function of x:
F ( x)  P( X  x)
 Properties of Distribution Function:
0  F ( x)  1
P(a  x  b)  F (b)  F (a )
15
Continuous Random Variables and Distributions
 A random variable X and the corresponding distribution are said
to be continuous if the distribution function F(x)=P(Xx) of X can
be represented by an integral form.
x
F ( x) 

f ( y )dy

 The integrand f(y) is called a probability density function.
F '( x)  f ( x)
 Properties:


f ( x)dx  1

b
P(a  X  b)  F (b)  F (a)   f ( x)dx
a
16
Statistical Characterization of Random Variables
 Expected Value:
– The expected value of a discrete random variable, x is found by
multiplying each value of random variable by its probability and then
summing over all values of x.
Expected value of x: E[ x]   xP( x)   xf ( x)
x
x
– The expectation value of x is the “balancing point” for the probability
mass function of x. That is, it is the arithmetic mean.
– We can take an expectation of any function of a random variable.
Expected value of g(x) = E[g(x)]= g(x)f(x)
x
– This balance point is the value expected for g(x) for all possible
repetitions of the experiment involving the random variable x.
– Expected value of a continuous density function f(x), is given by

E ( x) 
 xf ( x)dx

17
Illustration of Expectation
A Lottery has two schemes, the First scheme has two
outcomes (denoted by 1 and 2)and the second has
three (denoted by 1,2 and 3). It is agreed that the
participant in the First scheme gets $1, if outcome is
1, $2, if the outcome is 2. The participant in the
second scheme gets $3 if the outcome is 1, -$2 if the
outcome is 2 and $3 if the outcome is 3. The
probabilities of each outcome are listed as follows.
p(1, 1) = 0.1; p(1, 2) = 0.2; p(1, 3) = 0.3
p(2, 1) = 0.2; p(2, 2) = 0.1; p(2; 3) = 0.1
Help the investor to decide on which scheme to
prefer.[Bryson]
18
Example
 Let us assume that we have agreed to pay $1 for each dot
showing when a pair of dice is thrown. We are interested in
knowing, how much we would lose on the average?
Values of x
Frequency
Values of
Probability Function
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
5
4
3
2
1
P(x=2) = 1/36
P(x=3) = 2/36
P(x=4) = 3/36
P(x=5) = 4/36
P(x=6) = 5/36
P(x=7) = 6/36
P(x=8) = 5/36
P(x=9) = 4/36
P(x=10) = 3/36
P(x=11) = 2/36
P(x=12) = 1/36
Sum
36
1.00
Probability
Distribution
Function
P(x2) = 1/36
P(x3) = 3/36
P(x4) = 6/36
P(x5) = 10/36
P(x6) = 15/36
P(x7) = 21/36
P(x8) = 26/36
P(x9) = 30/36
P(x10) = 33/36
P(x11) = 35/36
P(x12) = 1
 Average amount we pay= (($2x1)+($3x2)+……+($12x1))/36=$7
 E(x)=$2(1/36)+$3(2/36)+……….+$12(1/36)=$7
19
Example (Continue…)
 Let us assume that we had agreed to pay an amount equal to
the squares of the sum of the dots showing on a throw of dice.
– What would be the average loss this time?
 Will it be ($7)2=$49.00?
 Actually, now we are interested in calculating E[x2].
– E[x2]=($2)2(1/36)+……….+($12)2(1/36)=$54.83  $49
– This result also emphasized that (E[x])2  E[x2]
20
Variance of Random Variable
 Variance of random variable, x is defined as
V ( x)   2  E[( x   ) 2 ]
V ( x)  E[ x 2  2  x   2 ]
 E[ x 2 ]  2( E[ x]) 2  ( E[ x]) 2
 E[ x 2 ]  ( E[ x]) 2
This result is also known as “Parallel Axis Theorem”
21
Expectation Rules
 Rule 1: E[k]=k; where k is a constant
 Rule 2: E[kx] = kE[x].
 Rule 3: E[x  y] = E[x]  E[y].
 Rule 4: If x and y are independent
E[xy] = E[x]E[y]
 Rule 5: V[k] = 0; where k is a constant
 Rule 6: V[kx] = k2V[x]
22
Propagation of moments and density function through
linear models
 y=ax+b
– Given:  = E[x] and 2 = V[x]
– To find: E[y] and V[y]
E[y] = E[ax]+E[b] = aE[x]+b = a+b
V[y] = V[ax]+V[b] = a2V[x]+0 = a2 2
 Let us define
z
(x  )

Here, a = 1/  and b = - / 
Therefore, E[z] = 0 and V[z] = 1
z is generally known as “Standardized variable”
23
Propagation of moments and density function through
non-linear models
 If x is a random variable with probability density function p(x)
and y = f(x) is a one to one transformation that is differentiable
for all x then the probability function of y is given by
– p(y)=p(x)|J|-1, for all x given by x=f-1(y)
– where J is the determinant of Jacobian matrix J.
 Example:
Let y  ax 2 and p( x) 
1
 x 2
exp( x 2 / 2 x2 )
NOTE: for each value of y there are two values of x.
p( y ) 
1
exp( y / 2a x2 ),  y  0
2 x 2 ay
and
p(y) = 0, otherwise
We can also show that
E ( y )  a x2 and V ( y )  2a 4 x4
24
Random Vectors
 Just an extension to random variable
– A vector random variable X is a function that assigns a vector of real
number to each outcome in the sample space.
 Joint Probability Functions:
– Joint Probability Distribution Function:
F ( X )  P[{X1  x1} {X 2  x2 }  ......... {X n  xn }]
– Joint Probability Density Function:
n F ( X )
f ( x) 
X 1X 2 ...X n
 Marginal Probability Functions: A marginal probability functions
are obtained by integrating out the variables that are of no
interest.
F ( x)   P( x, y ) or
y
y 

f ( x, y )dy
y 
25
Multivariate Expectations
 Mean Vector:
E[ x ]  [ E[ x1 ] E[ x2 ] ...... E[ xn ]]
 Expected value of g(x1,x2,…….,xn) is given by
E[ g ( x )]    ..... g ( x ) f ( x ) or
xn xn1
x1
  ..... g ( x) f ( x)dx
xn xn-1
x1
 Covariance Matrix:
cov[ x ]  P  E[( x   )( x   )T ]  E[ XX T ]   T
where, S  E[ xxT ] is known as autocorrelation matrix.
 1 0  0   1
 0   0  
2
  21
NOTE: P  R  




 0 0   n    n1
12
1
n 2
1n   1 0 
 2 n   0  2 


1   0
0
0
0 


  n 
R is the correlation matrix
26
Covariance Matrix
 Covariance matrix indicates the tendency of each pair of
dimensions in random vector to vary together i.e. “co-vary”.
 Properties of covariance matrix:
– Covariance matrix is square.
– Covariance matrix is always +ive definite i.e. xTPx > 0.
– Covariance matrix is symmetric i.e. P = PT.
– If xi and xj tends to increase together then Pij > 0.
– If xi and xj are uncorrelated then Pij = 0.
27
Probability Distribution Function
 There are many situations in statistics that involves the same
type of probability functions.
– It is not necessary to derive these results over and over again in
each special case with different numbers.
 We can avoid this tedious process by recognizing the
similarities between certain types of apparently
unique
experiments, and then merely matching a given case to general
formula.
 Examples:
– Toss a coin:
– Take an exam:
– analyzing stock market:
Head or Tail
Pass or Fail
Up or Down
 So all the above written processes can be distinguished by only
two events “Success” and “Failure”
28
Binomial Distribution
 Binomial Distribution plays an important role in experiments
involving repeated independent trials each with just two possible
outcomes.
– Independent trials means result of one trial cannot influence the
result of other trials.
– Repeated trials means the probability of “success” or “failure” does
not change with trials.
 In binomial distribution, we are interested in the probability of
receiving a certain number of successes.
 Let us assume that we have n independent trials, each trial
having same probability of success say p
– probability of failure: q = 1-p
 Let us say we are interested in determine the probability of x
successes in n trials.
– Find the probability of any one occurrence of this type and then
multiply this value by number of possible occurrences.
29
Binomial Distribution
 One of the possible occurrence is:
SS ......S FF ......F
x times
n  x times
 Joint probability of this particular sequence is given by p x q n  x
NOTE: pxqn-x represents the probability not only of our one
arrangement but of any possible arrangement of x successes
and n-x failures.
 How many occurrences of n successes and n-x failures are
possible?
n
Cx 
n!
x !(n  x)!
P( x successes in n trials)  nCx p x q n  x
Binomial distribution is discrete in nature as x or n can take only discrete values.
30
Mean and Variance of Binomial Distribution
 Mean:
E[ xbinomial ]    np
 Variance:
V [ xbinomial ]   2  ( E[ x 2 ]  ( E[ x]) 2 )  npq
 Example:
A football executive claims that 90% of viewers watch football
over baseball on a concurrent telecast.
An advertising agency claims that the viewers for each are 50%.
Who is right?
We did a survey in 25 households and assume that in 10 of them the
games were being viewed with the following breakdown
Viewing Football
Viewing Baseball
7
3
Which of the two reports is correct?
31
Hypergeometric Distribution
 Binomial distribution is important in sampling with replacement
but many practical problems involve sampling without
replacement.
– In that case hypergeometric distribution can be used to obtain
precise probability.
Cx N  M Cn  x
f ( x) 
N
Cn
M
n
M
N
2 
nM ( N  M )( N  n)
N 2 ( N  1)
– Example: We want to pick two apples from a box containing 15
apples, 5 of which are rotten. Find the probability function for
number of rotten apples in our sample?
Without Replacement
5
Cx 10C2 x
f ( x )  15
C2
With Replacement
x
 5   10 
f ( x )  2Cx    
 15   15 
2 x
32
Poisson Distribution
 Poisson Distribution is one of the most important discrete
distribution.
– It was first used by French Mathematician S.D. Poisson in 1837 to
describe the probability of deaths in the Prussian army from the kick
of a horse as well as number of suicides among women and
children.
– These days it is successfully used in problems involving the number
of arrivals/requests for a service per unit time at any service facility.
 Assumptions
– It must be possible to divide the time interval being used into a large
number of small time intervals s.t. the the probability of an
occurrence in each sub interval is very small.
– The probability of an occurrence in each of these sub intervals must
remain constant throughout the time period being considered.
– The probability of two or more occurrences in each sub intervals
must be small enough to be ignored.
– The occurrences in one time interval are independent from the any
occurrence in other time interval.
33
Poisson Distribution
 Probability mass function for Poisson distribution is given by
f ( x) 
 x e 
x!
 The Poisson distribution has the mean  and the variance 2= .
 It can be shown that Poisson distribution can be obtained as a
special case of Binomial distribution when p0 and n.
 Example:
It is given that on average 60 customers visit the bank
b/w 10am and 11am daily. Then we may be interested in
knowing the probability of exactly or less than or equal to 2
customers visiting the bank in a given one minute time interval.
e 1.12
1
P(2 arrivals) 

2
2e
1 1 1
5
P( 2 arrivals)=  

e e 2e 2e
34
Gaussian or Normal Distribution
 The normal distribution is the most widely known and used distribution in the
field of statistics.
– Many natural phenomena can be approximated by Normal distribution.
 Central Limit Theorem:
– The central limit theorem states that given a distribution with a mean  and
variance 2, the sampling distribution of the mean approaches a normal
distribution with a mean  and a variance  2/N as N, the sample size,
increases.
 Normal Density Function:
f ( x) 
1
e
 2
0.399

( x   )2
2 2
x 
-2 -
 + +2
35
Normal Distribution
 Multivariate Gaussian Density Function:
T 1
 1

  2  X μ  P  X μ  
1


f ( X) 
e
1
n
2 P 2
 What is the probability that
Y  A( X  μ)
Yi
zi 
i
z12  z22 
 zn2  R 2
 X μ 
T
P
 1
 2
 1

0
Where, 


 0

P   zi2  R 2    f ( z )dV
V
Curse of Dimensionality
1
 X μ   R 2
0
1
 22
0

0 


0 
-1 T
  AP A


1 
 n2 
2
3
n\R 1
1 0.683 0.955 0.997
2 0.394 0.865 0.989
3 0.200 0.739 0.971
36
Summary of Some Probability Mass/Density Functions
Probability
Distribution
Discrete
Parameters
Characteristics Probability
Function
Binomial
0  p  1 and n  0,1, 2,
Skewed unless
p=0.5
M=0…n, N=0,1,2…
Hypergeometric n=0…N
n
Cx p x q n x
M
Skewed
 xe 
Symmetric
about 
1
e
 2
Standardized
Normal
Symmetric
about zero
1 x2
e
2
Exponential
Skewed
Positively
>0
np
C x N  M Cn  x
N
Cn
Skewed
positively
Poisson
Mean Variance
n
M
N
npq
nM ( N  M )( N  n )
N 2 ( N  1)



2
0
1
1/
1/2
x!
Continuous
Normal
-     and   0
0
( x   )2
2 2
2
 e T
A distribution is skewed if it has most of its values either to the right or to the left of its mean
A measure of this variability in density is given by the third moment of a distribution called the
“skewness” defined as E(x3).
37