Download Task: normal distribution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Task: Monte Carlo Simulation
The process of using simulated data to check the statistical method is called Monte Carlo
simulation. The advantage of using simulated data is that we know the true values of
population parameters. We seldom know the population parameter for real data (Do you
know the average height of US men?).
The goal of this simulation is to show (i) sample mean is a random variable, whose
distribution is called sampling distribution; (ii) the sample mean is a consistent estimator.
The stata codes used for this simulation are below:
clear
set seed 1
set obs 1000
sca mu = 5
gen y = mu + rnormal()
histogram y
dis "try four small samples, each having 10 observations"
sum y in 1/10
sum y in 11/20
sum y in 21/30
sum y in 31/40
dis "try four big samples, each having 250 observations"
sum y in 1/250
sum y in 251/500
sum y in 501/750
sum y in 751/1000
dis "try the whole sample using 1000 observations"
sum y
Discuss
1. What is the population? How large is our sample?
2. µy = ( ), σy2 = ( ). Why do we know them?
3. The sample estimates using the first 10 and second 10 observations are different. Is it
ok? Why? What is the implication?
1
4. Using the whole sample, ȳ = ( ), sy = ( ), s2y = ( ) where s and s2 denote the
standard deviation and sample variance, respectively.
5. Which sample mean is closer to µy , the one that uses 10 observations or the one that
uses 1000 observations? Is this result expected?
6. Find ỹ =
y1 +y2
2
using this simulation. Is it getting closer to µy when n rises? Why?
7. How to obtain the sampling distribution of ȳ?
2
Task: normal distribution
Goal: understand the property of normal distribution
Reading: appendix B.5 of the textbook
Key points:
1. Basically there are two types of random variables. Discrete random variable can only
take finite number of values. For example, y = 0 if new baby is girl and y = 1 if boy
is a discrete random variable (which follows the Bernoulli’s distribution)
2. A normal random variable is continuous random variable that can take (in theory) any
value on the real line.
3. The normal distribution can be characterized by two parameters, the mean µ and
variance σ 2 . So if you know the two parameters you know everything about the normal
distribution.
4. The (probability) density function (PDF) of normal distribution is
(y−µ)2
1
f (y) = √
e− 2σ2 ,
2πσ
(−∞ < y < ∞)
(1)
Symbolically we write
y ∼ N (µ, σ 2 )
(2)
Remarks
(a) The density function (1) is symmetric and bell-shaped.
(b) The center of the density function is the mean µ.
(c) The dispersion (spread) of the density function is controlled by σ 2 . The distribution becomes “wider” when σ 2 rises.
(d) The standard deviation σ is the square root of variance σ 2 . It also measures the
dispersion.
5. A linear transformation of a normal random variable also follows normal distribution.
This is a special property of normal distribution. Consider a particular linear transformation of subtracting the mean and then dividing by the standard deviation, called
3
standardization:
y−µ
σ
(3)
E(zy ) =
(4)
var(zy ) =
(5)
zy ∼
(6)
zy ≡
Please show
6. In practice, standardization is commonly used to make a variable unit-free.
7. The N (0, 1) is called standard normal distribution. Usually we use later Z to represent
the standard normal random variable:
Z ∼ N (0, 1).
The letter z is the first letter of z-score, another name for the standardized variable.
8. For the standard normal variable, Table G.1 in the textbook reports the probability
of P (Z < z), or the cumulative distribution function (CDF). For example, from table
G.1 we know P (Z < −2.00) = 0.0228.
9. The stata function that computes the probability of P (Z < z) is normal( .). For example,
dis normal(-2)
.02275013
The function is very useful for computing the p-value.
10. In general,
P (z1 < Z < z2 ) = P (Z < z2 ) − P (Z < z1 )
(7)
There are two special cases
(a) when z1 = −1.96, z2 = 1.96, then from Table G.1
P (−1.96 < Z < 1.96) = 0.95
4
(8)
(b) Please find z1 and z2 so that P (z1 < Z < z2 ) = 0.90
11. Those special z1 and z2 are called critical values.
12. For a general normal variable y ∼ N (µ, σ 2 ) we can show
P (µ − 1.96σ < y < µ + 1.96σ) = 0.95
(9)
Exercise: prove (9)
13. (µ − 1.96σ, µ + 1.96σ) is the 95% confidence interval for y. With 95% probability the
values taken by the y variable will be inside that interval.
14. In practice, people approximate 1.96 using 2. So with around 95% probability the
normal y variable will take values within 2 times standard deviation around the mean.
15. Discuss:
(a) what is the 90% confidence interval for y ∼ N (µ, σ 2 )
(b) which interval is wider? The 90% or 95% interval?
16. A common question is “how big is big, and how small is small?” Equation (9) gives a
hint. From a statistical viewpoint, if a variable follows normal distribution, then big
values can be defined as those greater than µ + 1.96σ, and small values as those less
than µ − 1.96.
17. Normality can be checked in multiple ways:
(a) see whether the histogram of the real data is bell-shaped
(b) apply the Jarque-Bera (Skewness-Kurtosis) Test. The null hypothesis is the distribution is normal (so that skewness = 0, kurtosis = 3). The stata commands
are
histogram y
sktest y
You reject normality when the p-value (Prob>chi2) of the sktest is less than 0.05
Discuss
5
1. How to define a rich family according to the income, suppose family income follows
normal distribution?
2. How to define a rich family according to the income, suppose family income follows a
non-normal distribution?
3. How to check whether the distribution of family income in our data is normal or not?
4. Is it possible that the rating of eco201 instructor follows the normal distribution? If
not, what distribution is better candidate?
5. Comment on this “standard deviation is important because with around 95% probability a random variable will take values within 2 times standard deviation around the
mean.”
6. Suppose the SAT score (y) follows normal distribution. The average score is 21, so
µ = 21, and the standard deviation is 5, so σ = 5.
• Question 1: Find the probability that a student’s SAT score is greater than 30.
• Question 2: Suppose one student’s score is x. Find x so that 90% students earn
scores lower than him. Now we know how the admission office decide that cutoff
number for the SAT score (a student is accepted if his SAT score is above that
number).
7. How to figure out probability if y is not normal?
6