Download Task: normal distribution

Task: Monte Carlo Simulation The process of using simulated data to check the statistical method is called Monte Carlo simulation. The advantage of using simulated data is that we know the true values of population parameters. We seldom know the population parameter for real data (Do you know the average height of US men?). The goal of this simulation is to show (i) sample mean is a random variable, whose distribution is called sampling distribution; (ii) the sample mean is a consistent estimator. The stata codes used for this simulation are below: clear set seed 1 set obs 1000 sca mu = 5 gen y = mu + rnormal() histogram y dis "try four small samples, each having 10 observations" sum y in 1/10 sum y in 11/20 sum y in 21/30 sum y in 31/40 dis "try four big samples, each having 250 observations" sum y in 1/250 sum y in 251/500 sum y in 501/750 sum y in 751/1000 dis "try the whole sample using 1000 observations" sum y Discuss 1. What is the population? How large is our sample? 2. µy = ( ), σy2 = ( ). Why do we know them? 3. The sample estimates using the first 10 and second 10 observations are different. Is it ok? Why? What is the implication? 1 4. Using the whole sample, ȳ = ( ), sy = ( ), s2y = ( ) where s and s2 denote the standard deviation and sample variance, respectively. 5. Which sample mean is closer to µy , the one that uses 10 observations or the one that uses 1000 observations? Is this result expected? 6. Find ỹ = y1 +y2 2 using this simulation. Is it getting closer to µy when n rises? Why? 7. How to obtain the sampling distribution of ȳ? 2 Task: normal distribution Goal: understand the property of normal distribution Reading: appendix B.5 of the textbook Key points: 1. Basically there are two types of random variables. Discrete random variable can only take finite number of values. For example, y = 0 if new baby is girl and y = 1 if boy is a discrete random variable (which follows the Bernoulli’s distribution) 2. A normal random variable is continuous random variable that can take (in theory) any value on the real line. 3. The normal distribution can be characterized by two parameters, the mean µ and variance σ 2 . So if you know the two parameters you know everything about the normal distribution. 4. The (probability) density function (PDF) of normal distribution is (y−µ)2 1 f (y) = √ e− 2σ2 , 2πσ (−∞ < y < ∞) (1) Symbolically we write y ∼ N (µ, σ 2 ) (2) Remarks (a) The density function (1) is symmetric and bell-shaped. (b) The center of the density function is the mean µ. (c) The dispersion (spread) of the density function is controlled by σ 2 . The distribution becomes “wider” when σ 2 rises. (d) The standard deviation σ is the square root of variance σ 2 . It also measures the dispersion. 5. A linear transformation of a normal random variable also follows normal distribution. This is a special property of normal distribution. Consider a particular linear transformation of subtracting the mean and then dividing by the standard deviation, called 3 standardization: y−µ σ (3) E(zy ) = (4) var(zy ) = (5) zy ∼ (6) zy ≡ Please show 6. In practice, standardization is commonly used to make a variable unit-free. 7. The N (0, 1) is called standard normal distribution. Usually we use later Z to represent the standard normal random variable: Z ∼ N (0, 1). The letter z is the first letter of z-score, another name for the standardized variable. 8. For the standard normal variable, Table G.1 in the textbook reports the probability of P (Z < z), or the cumulative distribution function (CDF). For example, from table G.1 we know P (Z < −2.00) = 0.0228. 9. The stata function that computes the probability of P (Z < z) is normal( .). For example, dis normal(-2) .02275013 The function is very useful for computing the p-value. 10. In general, P (z1 < Z < z2 ) = P (Z < z2 ) − P (Z < z1 ) (7) There are two special cases (a) when z1 = −1.96, z2 = 1.96, then from Table G.1 P (−1.96 < Z < 1.96) = 0.95 4 (8) (b) Please find z1 and z2 so that P (z1 < Z < z2 ) = 0.90 11. Those special z1 and z2 are called critical values. 12. For a general normal variable y ∼ N (µ, σ 2 ) we can show P (µ − 1.96σ < y < µ + 1.96σ) = 0.95 (9) Exercise: prove (9) 13. (µ − 1.96σ, µ + 1.96σ) is the 95% confidence interval for y. With 95% probability the values taken by the y variable will be inside that interval. 14. In practice, people approximate 1.96 using 2. So with around 95% probability the normal y variable will take values within 2 times standard deviation around the mean. 15. Discuss: (a) what is the 90% confidence interval for y ∼ N (µ, σ 2 ) (b) which interval is wider? The 90% or 95% interval? 16. A common question is “how big is big, and how small is small?” Equation (9) gives a hint. From a statistical viewpoint, if a variable follows normal distribution, then big values can be defined as those greater than µ + 1.96σ, and small values as those less than µ − 1.96. 17. Normality can be checked in multiple ways: (a) see whether the histogram of the real data is bell-shaped (b) apply the Jarque-Bera (Skewness-Kurtosis) Test. The null hypothesis is the distribution is normal (so that skewness = 0, kurtosis = 3). The stata commands are histogram y sktest y You reject normality when the p-value (Prob>chi2) of the sktest is less than 0.05 Discuss 5 1. How to define a rich family according to the income, suppose family income follows normal distribution? 2. How to define a rich family according to the income, suppose family income follows a non-normal distribution? 3. How to check whether the distribution of family income in our data is normal or not? 4. Is it possible that the rating of eco201 instructor follows the normal distribution? If not, what distribution is better candidate? 5. Comment on this “standard deviation is important because with around 95% probability a random variable will take values within 2 times standard deviation around the mean.” 6. Suppose the SAT score (y) follows normal distribution. The average score is 21, so µ = 21, and the standard deviation is 5, so σ = 5. • Question 1: Find the probability that a student’s SAT score is greater than 30. • Question 2: Suppose one student’s score is x. Find x so that 90% students earn scores lower than him. Now we know how the admission office decide that cutoff number for the SAT score (a student is accepted if his SAT score is above that number). 7. How to figure out probability if y is not normal? 6

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Task: normal distribution