Download Random variable distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of network traffic models wikipedia , lookup

Law of large numbers wikipedia , lookup

History of statistics wikipedia , lookup

Central limit theorem wikipedia , lookup

Exponential distribution wikipedia , lookup

Student's t-distribution wikipedia , lookup

Normal distribution wikipedia , lookup

Multimodal distribution wikipedia , lookup

Transcript
Probability distribution functions
•
•
•
•
•
Normal distribution
Lognormal distribution
Mean, median and mode
Tails
Extreme value distributions
Normal (Gaussian) distribution
• Probability density function (PDF)
 1  x   2 
1
f ( x) 
exp   
 
2

 2
 
 
• What does figure tell about the cumulative distribution
x
function (CDF)?
F ( x)  P( X  x) 


f (t )dt
More on the normal distribution
• Normal distribution is denoted 𝑁 𝜇, 𝜎 2 , with the
square giving the variance.
• If X is normal, Y=aX+b is also normal. What would be
the mean and standard deviation of Y?
• Similarly, if X and Y are normal variables, any linear
combination, aX+bY is also normal.
• Can often use any function of a normal random
variables by using a linear Taylor expansion.
• Example: X=N(10,0.52) and Y=X2 . Then 𝑋 2 ≈ 100 +
Estimating mean and standard
deviation
• Given a sample from a normally distributed variable, the
sample mean is the best linear unbiased estimator
(BLUE) of the true mean.
• For the variance the equation gives the best unbiased
estimator, but the square root is not an unbiased
estimate of the standard deviation
n
2
n
1
1
2 
 xi  x  x   xi

n  1 i 1
n i 1
• For example, for a sample of 5 from a standard normal
distribution, the standard deviation will be estimated on
average as 0.94 (with standard deviation of 0.34)
Lognormal distribution
• If ln(X) has normal distribution X has
lognormal distribution. That is, if X is normally
distributed exp(X) is lognormally distributed.
• Notation: ln𝑁 𝜇, 𝜎 2
• PDF f ( x)  1 exp   ln x    
2
x 2

2 2

• Mean and variance
 X  exp     2 / 2  ,


 X2  Var  X   e  1 e 2  
2
2
Question
• Suppose the income of a family of four in the
United States follows a lognormal distribution
with µ = log(20,000) and σ2 = 1.0. (𝜇𝑋 =32974,
𝜎𝑋 = 43224). See figure:
What is your estimate of the
mode (that is the most
common income)? The
median?
Mean, mode and median
• Mode (highest point) =exp[𝜇 − 𝜎 2
• Median (50% of samples) = 𝑒 𝜇
• Figure for 𝜇=0.
Light and heavy tails
• Normal distribution has light tail; 4.5 sigma is
equivalent to 3.4e-6 failure or defect probability.
• Lognormal can have heavy tail 𝜇 = 0, 𝜎 = 0.25,7.5e−4 ,
𝜇 = 0, 𝜎 = 1,0.0075
Fitting distribution to data
• Usually fit CDF to minimize maximum distance
(Kolmogorov-Smirnoff test)
• Generated 20 points from N(3,12).
• Normal fit N(3.48,0.932)
1
0.9
• Lognormal lnN(1.24,0.26)
Almost same mean and
0.8
0.7
standard deviation.
CDF
0.6
0.5
0.4
0.3
0.2
experimental
lognormal
normal
0.1
0
1
2
3
4
5
x
6
7
8
Extreme value distributions
• No matter what distribution you sample from, the
mean of the sample tends to be normally distributed as
sample size increases (what mean and standard
deviation?)
• Similarly, distributions of the minimum (or maximum)
of samples belong to other distributions.
• Even though there are infinite number of distributions,
there are only three extreme value distributions.
– Type I (Gumbel) derived from normal.
– Type II (Frechet) e.g. maximum daily rainfall
– Type III (Weibull) weakest link failure
Maximum of normal samples
With normal distribution, maximum of sample is more narrowly distributed
than original distribution.
9000
8000
Max of 10 standard
normal samples.
1.54 mean, 0.59
standard deviation
7000
6000
5000
Max of 100 standard
normal samples.
2.50 mean, 0.43
standard deviation
8000
7000
6000
5000
4000
4000
3000
3000
2000
2000
1000
0
-1
1000
0
0
1
2
3
4
5
6
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
Gumbel distribution
exp   z  e  z  ,
x
CDF  exp(e
• .


• Mean, median, mode and variance
PDF 
1
Mean    
Variance 
2
6
z
median     ln(ln(2))
2
z
)
mode=
Euler-Mascheroni constant   0.5772
1
0.9
1
fitted ev1
-max10 data
0.9
0.8
fitted ev1
-max100 data
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
-5
0
-5.5
-4
-3
-2
-1
0
1
-5
-4.5
-4
-3.5
-3
-2.5
-2
-1.5
-1
Weibull distribution
• Probability distribution
• Its log has Gumbel dist.
kx
f ( x;  , k )   

k 1
e
 x /  
k
x  0, k  0,   0
• Used to describe distribution of strength or fatigue life in brittle materials.
• If it describes time to failure, then
 k<1 indicates that failure rate decreases with time,
 k=1 indicates constant rate,
 k>1 indicates increasing rate.
• Can add 3rd parameter by replacing x by x-c.
1
0.9
log weibull
ev1 fit
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-8
-6
-4
-2
0
2
4
Exercises
1.
2.
3.
4.
Estimate how much rain will Gainesville have in 2014 as well as the aleatory and
and epistemic uncertainty in your estimate.
Find how many samples of normally distributed numbers you need in order to
estimate the mean with an error that will be less than 5% of the true standard
deviation 90% of the time. Use the fact that the mean of a sample of a normal
variable has the same mean and a standard deviation that is reduced by the
square root of the number of samples.
Both the lognormal and Weibull distributions are used to model strength. Fit 100
data generated from a standard lognormal distribution by both lognormal and
Weibull distributions. Repeat with 5 randomly generated samples. In each case
measure the distance using the KS distance, and translate the result to a
sentence of the following format: The maximum difference between the two
CDFs is at x=2, where the true probability of x<2 is 60%, the probability from the
experimental CDF is 61%, the probability from the lognormal fit is 62% and the
probability from the Weibull fit is 64% (these numbers are invented for the
purpose of illustrating the format).
Generate a histogram of word lengths in this assignment, including hyphens and
the math (e.g., x=2 is a 3-letter word), but not punctuation marks. Select an
appropriate number of boxes for the histogram and explain your selection). Then
fit the distribution of word lengths with five standard distributions including
normal, lognormal, and Weibull using the K-S criterion. What distribution fits
best? Compare the graphs of the CDFs.