Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 Sampling Distributions 1.1 Statistics and Sampling Distributions When a random sample is selected the numerical descriptive measures calculated from such a sample are called statistics. These statistics vary or change for each different random sample you select; that is they are random variables. Definition: Any quantity computed from values in a sample is called a statistic. The value of a statistic varies from sample to sample this effect is called sample variability. Since the sampling is done randomly, the value of an statistic is random. In conclusion: statistics are random variables. Since statistics are random variables, they have a distribution, that tells which values occur with which probability. Definition: The distribution of a statistic is called a sampling distribution. They provide this information: • What are the values the statistic can assume? • What is the probability of each value to occur? Remember: • The population distribution describes the values of a variable in the population. It gives the possible values of all individuals and their likelihood to occur. • The sampling distribution describes the values of a statistic in all possible samples from the population. It describes how a statistic varies in all possible samples of the same size. Example: The population is this section of Stat 141. Let µ be the population mean of the random variable X = height, therefore µ is the average of the heights of all people in this class. Select a random sample of size 5 and observe x̄, the mean height in this sample. Is x̄ = µ? For every random sample the sample mean x̄ is different, this is called the sample variability. Also is it most unlikely that the sample mean is equal to the population mean µ. Even if we use a random sample and don’t make any measuring mistakes the x̄ will be most likely different from µ. This difference is then called the sampling error: sampling error = x̄ − µ Now we will study the distribution of X̄. Suppose you look at every possible random sample of 5 students from this class and the 1 corresponding sample mean. From these numbers you can create the sampling distribution of X̄. If this class has 50 students then there are ! 50 = 2, 118, 760 5 different samples. You will find that 1. The value of x̄ differs from one random sample to another (sampling variability). 2. Some samples produced x̄ values larger than µ, whereas other produce x̄ smaller than µ. (x̄ − µ sampling error). 3. They can be fairly close to the mean µ, or also quite far off the population mean µ. The sampling distribution of X̄ provides important information about the behaviour of the statistic X̄ and how it relates to the population mean µ. Considering how many different samples of size five there are in this class this process is very cumbersome. Fortunately, there are mathematical theorems that help us to obtain information about the sampling distributions. 1.2 The Sampling Distribution of a Sample Mean x̄ based on a large sample tends to be closer to µ than does x̄ based on a small n. This can be explained by the following theoretical results. Lemma: Suppose x1 , . . . , xn are random variables with the same distribution with mean µ and population standard deviation σ. Now look at the random variable X̄. 1. The population mean of X̄, denoted µX̄ , is equal to µ. 2. The population standard deviation of X̄, denoted σx̄ , is σ σX̄ = √ n This means that the sampling distribution of X̄ is always centered at µ and the second statement gives the rate the spread of the sampling distribution (sampling variability) decreases as n increases. This effect is called the law of large numbers, since it indicates that in average the sample mean will get closer to the population mean as the sample size increases. Beside finding the mean and the standard deviation, do we know anything about the shape of the density curve of the sampling distribution of the mean? 2 1.3 Central Limit Theorem This section explains why the normal distribution is so important in statistics. The result is surprising. It states, that under rather general conditions, means of random samples drawn from one population tend to follow an approximate normal distribution. It does not matter which kind of distribution we find in the population. It even can be discrete or extremely skewed. But if n is large enough the sampling distribution of the mean is approximately normal distributed. Central Limit Theorem If random samples of n observations are drawn from any population with finite mean µ and standard deviation σ, then, when n is large, the sampling distribution √of the mean X̄ is approximate normal distributed, with mean µ and standard deviation σ/ n. Remark: • If the population itself is normal x̄ is normal distributed for all n, so that n does not have to be large. • When the sampled population has a symmetric distribution, the sampling distribution of X̄ becomes quickly normal. Compare the example below for n = 3. • If the distribution is skewed, usually for n = 30 the sampling distribution is already close to a normal distribution. Example: Consider tossing n unbiased dice and recording the average number of the upper faces. The graphs display the sampling distribution for x̄ for n = 1, 2, 3, 4. 3 Looking at only n = 4 dice leads to a distributions that is very close to a normal distribution. Summary: Assume that the measurements in a population follow all the same distribution with finite mean µ and standard deviation σ. Then • The mean of the sampling distribution of the mean X̄ of n observations holds µX̄ = µ • The standard deviation of the sampling distribution of the mean x̄ of n observations holds σ σX̄ = √ n • The sampling distribution of the mean x̄ of n observations is approximately normal distributed, if n is large enough. Example: In 2004 a water taxi in Baltimore capsized because it was overloaded. 5 people drowned. How could this have happened, was the accident predictable? At that time they permitted 20 passengers per load, and knew that the taxi is save up to a load of 3500 lb. Assuming that the mean weight of American men is about normally distributed with a mean µ = 172lb and a standard deviation of σ = 29lb, what was the risk? 1. What is the probability that a randomly chosen man is heavier than 175lb? 2. What is the probability that 20 randomly chosen American men weigh in total more than 3500lb? ad 1. P (X > 175) = = = = 46% of American men are heavier than 175 lb. 4 P Z > 175−172 29 P (Z > 0.103) 1 − 0.5398 0.4602 ad 2. If 20 men weigh in total more than 3500lb, it is the same as if they weigh in average more than 3500/20=175lb. Calculate the probability that the mean weight of 20 men, X̄, exceeds 175lb. The mean of the sample mean, √ X̄, is√µX̄ = µ = 172 and the standard deviation of the sample mean, X̄, is σX̄ = σ/ n = 29/ 20 = 6.68. Then P (X̄ > 175) = = = = P Z > 175−172 6.68 P (Z > 0.4629) 1 − 0.6772 0.3228 The probability that the total weight of 20 randomly chosen men exceeds the weight limit is 0.3228. Example: The duration of Alzheimer’s disease from the onset of symptoms until death ranges from 3 to 20 years. The mean is 8 years and the standard deviation is 4 years. Looking at the average duration for 30 randomly selected Alzheimer patients: what is the probability that the average duration of those 30 patients is less than 7 years? 7 − µX̄ X̄ − µX̄ < P (X̄ < 7) = P σX̄ σ ! X̄ 7−8 = P Z< √ 4/ 30 ! standardize = P (Z < −1.37) = 0.0853 TableA Example 1 Consider a bottle filling machine, that is supposed to fill 500mL in each bottle. This machine will be reset if a sample of 10 bottles shows an average filling amount of more than 505mL. Assume that the machine is in correct adjustment, that is µ = 500mL, but the filling amount varies with a standard deviation of 7mL. What is the probability that the machine will be reset, even though it is correctly adjusted? 505 − µX̄ X̄ − µX̄ > P (X̄ > 505) = P σX̄ σX̄ ! 505 − 500 √ = P Z> 7/ 10 ! standardize = P (Z > 2.26) = 1 − 0.9878 = 0.0122 TableA The machine would be reset in 1.2% of examinations, even if is correctly adjusted. 5 1.4 The Sampling Distribution of a Sample Proportion Suppose we are just interested if one characteristic occurred in the population of interest. For example: • flip a coin and observe, if Tail was tossed. • a persons IQ is above 120 • a student is a ”nonresident” • a person survived at least five years, after a specific cancer treatment • a clinical test is positive Then there exists a probability p that a randomly selected individual or object from the population possesses this characteristic. Traditionally, any individual or object that possesses the property of interest is labelled a Success (S), and one one that does not possess the property is called a Failure (F). Looking at a random sample of the size n, the probability p can be estimated by calculating the sample proportion (relative frequency) of S’s, which is number of S0 s in the sample . p̂ = n Now rewrite the sample proportion in the following way. First let single observations be encoded in the following way. 0 if individual is labelled F 1 if individual is labelled S xi = Then is p̂ = P xi /n = x̄. So that the Central Limit Theorem applies to p̂. Result: • µp̂ , the mean of the sampling distribution of p̂, is p the true probability. • The standard deviation of the sampling distribution of p̂ is s σp̂ = p(1 − p) n (since the standard deviation for a single observation is q p(1 − p).) • The sampling proportion p̂ is for large n approximately normal distributed. (Central Limit Theorem) 6 A conservative rule of thumb states that the Central Limit Theorem can be applied if: np ≥ 5 and n(1 − p) ≥ 5 Example: A study showed, that the proportion of people in the 20 to 34 age group with an IQ (on the Wechsler Intelligence Scale) of over 120 is about 0.35. Calculate the probability for the event that in a sample of 50 there are more than 30 people with an IQ of at least 120. First obtain µp and σp for the distribution of the sample proportion p. • µp̂ = p = 0.35 • σp̂ = q p(1−p) n = q 0.35(1−0.35) 50 = 0.06745. ! P (p̂ > 30 ) 50 p̂ − µp̂ 0.6 − µp̂ = P > standardize σp̂ σp̂ ! 0.6 − 0.35 p̂ − 0.35 > use above results = P 0.06745 0.06745 ! p̂ − 0.35 = 1−P ≤ 3.706 Rule for Compliments 0.06745 = 1 − 0.9999 CLT, np = 17.5 > 5, TableIV = 0.0001 We calculated that the probability that more than 30 out of 50 people (between 20 and 34) have an IQ greater than 120 is 0.0001. It is highly unlikely! Example 2 Assume 80% of Canadians are in favour of protecting the environment through more laws. Given that 1000 randomly selected Canadians are asked if they are in favour or against additional laws for protecting the environment, what is the probability that the poll results in a percentage less than 75 ! P (p̂ < 0.75) = = = = = 0.75 − µp̂ p̂ − µp̂ < P σp̂ σp̂ ! p̂ − 0.8 0.6 − 0.35 P > ?????0.06745 0.06745 ! p̂ − 0.35 1−P ≤ 3.706 0.06745 1 − 0.9999 0.0001 7 standardize use above results Rule for Compliments CLT, np = 17.5 > 5, TableIV