Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
SAMPLES AND POPULATIONS y It is usually not practical to study an entire population SAMPLING AND SAMPLING DISTRIBUTION ¾ in a random sample each member of the population has an equal chance of being chosen ¾ a representative sample might have the same proportion of men and women (and several other characteristics) as does the population. ¾ a convenience sample (haphazard sample) is a non non-random random subset of the population chosen because they happen to be available (e.g., all members of this class would be a convenience sample of AAU students). students) y convenience samples may be biased because each member of the population does not have an equal chance of being chosen. 2 SAMPLES AND POPULATIONS y Population SAMPLES AND POPULATIONS y Sample ¾ Relatively R l ti l smallll y Sample ¾ All Americans ¾ A subset of ¾ All pets in Europe ¾ A subset of pets ¾ All car accidents in ¾ Today Today’s s car accidents Prague ¾ All depressed people ¾ The depressed people in number of cases (observations) that are studied t di d to t make k inferences about a larger group from which hi h th they were drawn y Population p ¾ The larger group (all cases) from which a sample is drawn Prague. Sample Population 4 3 WHY SAMPLE THE POPULATION? z z z z The physical impossibility of checking all items in the population. l ti The cost of studying all the items in a population is often prohibited. The adequacy of sample results. To contact the whole population would often be timeconsuming. DEFINITION: STATISTICS AND PARAMETERS y a parameter is a characteristic of a population ¾ e.g., the average blood pressure of all Czechs. y a statistic is a characteristic of a sample (function of a r.v.) ¾ e.g., e g the average blood pressure of a sample of Czechs. Czechs y We use statistics of samples to estimate parameters of populations. Statistic o estimates o Parameter X o estimates o P P “mew” P mew s o estimates o V 5 6 s2 o estimates o V 2 r o estimates o U V “sigma” U “rho” COMMON SENSE - COMMON SENSE: SAMPLES AND POPULATIONS y A random sample should represent the population y Randomness in sampling is good well so sample statistics from a random sample well, should provide reasonable estimates of population parameters y All sample l statistics t ti ti h have some error iin estimating ti ti population parameters y If repeated p samples p are taken from a p population p and the same statistic (e.g. mean) is calculated from each sample, the statistics will vary, that is, they will have a distribution y A larger sample provides more information than a smaller sample so a statistic from a large sample should have less error than a statistic from a small sample y Statistics (= functions of r.v.) have error y Statistics have distributions y Larger L sample l size i ((n)) iis b better tt - less l error y The probability distribution of a statistic is called its sampling distribution 7 8 SAMPLING DISTRIBUTIONS SAMPLING DISTRIBUTIONS OF THE MEAN A sampling distribution is a distribution of all of the possible values of a statistic for a given size sample selected from a population Sampling Distributions Sampling Distributions Sampling Distributions of the Mean ea Sampling Distributions of the Mean Sampling Distributions of the Proportion opo t o 9 10 DEVELOPING A SAMPLING DISTRIBUTION SAMPLING DISTRIBUTION OF MEANS y A sampling distribution is a frequency distribution (equivalently, a probability distribution) of a sample (equivalently statistic. y The distribution of sample means is usually referred t as the to th sampling li di distributions t ib ti off means or, the th sampling distribution of the mean y Definition Assume there is a population, size N=4. Random variable, X, is age of individuals, individuals taking values X: 18 18, 20 20, 22 22, 24 (years) Summary Measures for the Population Distribution: ¾ A sampling distribution of means is a probability ¦X P(x) i N .3 18 20 22 24 4 distribution. It is the relative frequency distribution of means obtained from an unlimited series of sampling p g experiments, each consisting of a sample of size n randomly selected from the population 11 Sampling Distributions of the Proportion 12 ¦ (X ) i N 21 .2 .1 x 0 2 2.236 18 20 22 24 A B C D Uniform Distribution (continued) (continued) DEVELOPING A SAMPLING DISTRIBUTION DEVELOPING A SAMPLING DISTRIBUTION NOW CONSIDER ALL POSSIBLE SAMPLES OF SIZE N=2 1st Obs 2nd Observation 18 20 22 24 SAMPLING DISTRIBUTION OF ALL SAMPLE MEANS 16 Sample Means 18 18,18 18,20 18,22 18,24 Sample Means Distribution 16 Sample Means 20 20,18 20,20 20,22 20,24 1st 2nd Observation Obs 18 20 22 24 1st 2nd Observation Obs 18 8 20 0 22 24 22 22,18 22,20 22,22 22,24 18 18 19 20 21 18 18 19 20 21 24 24,18 24 18 24,20 24 20 24,22 24 22 24,24 24 24 20 19 20 21 22 20 19 20 21 22 22 20 21 22 23 22 20 21 22 23 16 possible samples (sampling with replacement) 24 21 22 23 24 13 _ P(X) .3 .2 .1 0 24 21 22 23 24 18 19 14 20 21 22 23 24 _ X (no longer uniform) (continued) DEVELOPING A SAMPLING DISTRIBUTION COMPARING THE POPULATION WITH ITS SAMPLING DISTRIBUTION SUMMARY MEASURES OF THIS SAMPLING DISTRIBUTION: X ¦X Population N=4 18 19 21 24 16 i N ¦(X i X X 21 N 1.58 15 P(X) 3 .3 .2 .2 .1 .1 0 CHARACTERISTICS OF THE SAMPLING DISTRIBUTION OF MEANS 1.58 18 20 22 24 A B C D X 0 18 19 20 21 22 23 24 ¾ The shape of the distribution of means will be approximately normal if either y The sample p size 30 or g greater,, of y The underlying distribution of the population of individuals is normal ¾ This means that the sampling distribution of means will be normal even if the distribution being sampled is not, provided sample size is large enough ¾ This Thi is i referred f d to t as the th CENTRAL LIMIT THEOREM y Variance (of the sampling distribution of means) 2 2 X V n y Standard deviation of the sampling distribution of means ((usuallyy called the standard error of the mean, or SEM) 17 X y Shape of the sampling distribution of means P VX 21 CHARACTERISTICS OF THE SAMPLING DISTRIBUTION OF MEANS y Mean (of the sampling distribution of means) V X 2.236 P(X) .3 3 16 PX _ )2 (18 - 21)2 (19 - 21)2 (24 - 21)2 16 21 Sample Means Distribution n=2 V n y Sampling distributions of means tend toward a normal shape as the sample size increases, regardless of the shape of the population distribution from which the samples have been randomly selected. 18 _ X SAMPLING DISTRIBUTION OF MEANS IF THE POPULATION IS NORMAL y If a population is normal with mean and standard deviation , the sampling distribution of X is also normally distributed with X and n X (This assumes that sampling is with replacement or sampling is without replacement from an infinite population) 19 20 Z-VALUE FOR SAMPLING DISTRIBUTION OF THE MEAN STANDARD ERROR OF THE MEAN y Z-value for the sampling distribution of X : y Different samples of the same size from the same population will yield different sample means y A measure of the variability in the mean from sample to sample is given by the Standard Error of the Mean: X n where: y Note that the standard error of the mean decreases as the sample size increases 21 (X X ) Z X ( X ) n X = sample p mean = population mean = population standard deviation n = sample size 22 EXAMPLE SAMPLING DISTRIBUTION PROPERTIES y Suppose a population has mean = 8 and standard x (i e (i.e. x deviation = 3. Suppose a random sample of size n = 36 is selected. y What is the probability that the sample mean is between 7 8 and 8 7.8 8.2? 2? Normal Population Distribution is unbiased ) Solution: x y Even if the population is not normally distributed distributed, the Normal Sampling Distribution (has the same mean) central limit theorem can be used (n > 30) y … so the sampling distribution of x is approximately normal x 23 y … with mean x 24 = 8 y …and standard deviation x n 3 36 0.5 EXAMPLE EXAMPLE: SAMPLE MEAN If n Solution (continued): P(7.8 ( 8 X 8 8.2)) Population Distribution ??? ?? ? ? ? ? ? ? ? 25 Sampling Distribution Sample X 8 § V V2 V· P ¨ P d N (P , ) d P ¸ 4 20 4¹ © § 20 20 · P ¨¨ d N (0,1) d ¸¸ ) (1.12) ) (1.12) 0.7372 4 4 © ¹ 0 3830 0.3830 Probability density function 74% Standard Normal Distribution X lies within V / 4 of P . V V· § P¨P d X d P ¸ 4 4¹ © § · X - ¨ 7.8 - 8 8.2 - 8 ¸ P¨ ¸ 3 ¸ ¨3 36 n 36 ¹ © P( 0 5 Z 0.5) P(-0.5 0 5) 20, compute the prob. that Pˆ of Pˆ .1915 1915 +.1915 X when n 20 Standardize 7.8 X 8 8.2 x -0.5 z 0 0.5 P V / 4 Z P P V / 4 26 USING THE SAMPLING DISTRIBUTION OF MEANS TO DETERMINE PROBABILITIES USING THE SAMPLING DISTRIBUTION OF MEANS TO DETERMINE PROBABILITIES y Assume IQ is normally distributed, with P = 100 and y Assume IQ is normally distributed, with P= 100 and V = 16 V= 16 ¾ what is the probability of obtaining a sample mean of ¾ what is the probability of obtaining a sample mean that 102 or higher g if: y the sample size is 4? differs from the p population p mean byy 6 points p or more (i.e., ±6 points) if: y the sample size is 16? y the sample size is 4? y the sample size is 9? ¾ what is the probability of obtaining a sample mean of 98 or less if: y How would your answer change if IQ were not y the sample size is 4? normally distributed? y the th sample l size i iis 16? 27 28 DISTRIBUTION OF SAMPLE VARIANCE DISTRIBUTION OF SAMPLE MEAN If X 1 , , X n normally distributed with mean P and If X 1 , , X n are observations from a population with a mean P variance V 2 , then the sample variance S 2 has the distribution and a variance V 2 , then the Central Limit Theorem indicates S2 ~ V 2 the sample mean has the approximate distrbution Pˆ X ~ N (P , V2 n V ( X P) S n , but since V is usually unknown, n wemaysafelyreplace V ,whennislarge, by the observed value (the sample standard deviation) s, 29 s.e.( X ) s n (n 1) ) The standard error of the sample mean is defined as s.e.( X ) F n21 30 (X P) V n 1 N (0, 1) ~ ~ tn 1 §S· F n21 ¨ ¸ ©V ¹ (n 1) POPULATION PROPORTIONS, P SAMPLING DISTRIBUTIONS OF THE PROPORTION p = the proportion of the population having some characteristic Sampling Distributions y Sample proportion ( ps ) provides an estimate ps Sampling Distributions of the Mean Sampling Distributions of the Proportion 31 X n of p: number of items in the sample having the characteristic of interest sample size y 0 ps 1 y ps has h a bi binomial i l di distribution t ib ti (assuming sampling with replacement from a finite population or without replacement from an infinite population) 32 SAMPLING DISTRIBUTION: SAMPLE PROPORTION P-HAT X has n p (1 p ) · § the approximate distribution pˆ ~ N ¨ p, ¸ n © ¹ y The situation in this section is that we are interested p ppropotion p pˆ If X ~ B(n, p ), then the sample E ( pˆ ) i th in the proportion ti off the th population l ti th thatt has h a certain t i characteristic. y This proportion is the population parameter of interest, denoted by symbol p. y We estimate this parameter with the statistic p-hat – the number in the sample with the characteristic divided by the sample size n. p, Var ( X ) np (1 p ) Var ( pˆ ) Var ( X ) n p(1 p ) n pˆ 33 X /n 34 SAMPLING DISTRIBUTION OF P-HAT EXAMPLES OF P-HAT COMPUTATION y Flip coin n=10 times, keep track of number of heads, y How does p-hat behave? To study the behavior, X and compute the p X, p-hat hat = X/10. X/10 i imagine i ttaking ki many random d samples l off size i n, and d computing a p-hat for each of the samples. y Then flip coin n=30 times, then compute p-hat. y ?? What would be the shape and behavior properties of the two displays?? y Because the shape of the distribution is normal, we can standardize the variable p-hat to a Z standard normal distribution. Use Z-transform: y PROPERTIES y When sample sizes are fairly large, the shape of the p-hat distribution will be normal. y The mean of the distribution is the value of the population parameter p p. y The standard deviation of this distribution is the Z q root of p( p(1-p)/n. p) square 35 36 pˆ p p (1 p ) n pˆ E ( pˆ ) Var ( pˆ ) STANDARD ERROR OF THE SAMPLE POPULATION PROPORTIONS PROPORTION • Categorical variable (e.g., gender) • % population having a characteristic Th standard The t d d error off the th sample l mean is i defined d fi d as • If two outcomes, binomial distribution ¾ Possess or don’t possess characteristic p(1 p) , but since p is usually unknown, n wereplace p by the observed value pˆ x / nto have pˆ(1 p̂ˆ ) 1 x(n x) s.e.( pˆ ) n n n s.e.( pˆ ) • Sample proportion (ps) X n Ps 37 number of successes sample size 38 STANDARDIZING SAMPLING DISTRIBUTION OF PROPORTION SAMPLING DISTRIBUTION OF P y Approximated by a normal distribution if: ¾ P( ps) np t 5 Z # Sampling Distribution .3 .2 .1 0 and n(1 (1 p)) t 5 ps - P p Vp = Sampling Distribution 0 .2 .4 4 .6 6 8 1 V=1 where ps p and ps p(1 p) n (where p = population proportion) 39 y ps p(1 p) n Convert to P(.40 d ps d .45) standard normal: 41 .4(1 .4) 200 .03464 ps Z P = 0 EXAMPLE: SAMPLING DISTRIBUTION OF PROPORTION 9 np t 5 n(1 p ) t 5 if p = .4 and n = 200, what is P(.40 ps .45) ? Find p s : Pp 40 EXAMPLE : If the true proportion of voters who support Proposition A is p = .4, what is the probability that a sample of size 200 yields a sample proportion between .40 and .45? p (1 p ) n Standardized Normal Distribution Vp ps ps - p Z# Sampling Distribution ps - p p (1 p ) n .43 - .40 .40u(1.40) 200 = .87 Standardized Normal Distribution - Vp = .0346 = V=1 0.3078 40 .40 40 .43 43 .40 40 · § .40 dZd P¨ ¸ .03464 .03464 © ¹ P(0 d Z d 0.87) 0 87) 42 Pp = .40 .43 ps P =0 .87 Z Z-VALUE FOR PROPORTIONS, CORRECTION EXAMPLE FOR SAMPLING WITHOUT REPLACEMENT (continued) Always standardize ps to a Z value with the formula: Z ps p ps if p = .4 4 and n = 200, 200 what is P(.40 P( 40 ps .45) 45) ? ps p p(1 p) n Use standard normal table: P(0 Z 1.44) = .4251 y If sampling is without replacement and n is greater than Standardized Normal Distribution Sampling Distribution 5% of the population size, then p must use the finite population correction factor: .4251 Standardize ps 43 p(1 p) p( n Nn N 1 .40 0 ps Z EXAMPLE – USE OF SAMPLE PROPORTION IN ESTIMATION y Old drug cures 80 percent of the time. New drug for y Suppose that you want to know how many animals heartworm h t disease di iin d dogs cures 90 percentt off n=1000 dogs, so p-hat=.9. y What is the probability of observing a sample proportion greater than .9 if p=.8 and n=1000. y That is, P(p-hat > .9) = P(Z > .9 - .8/ .0126) =P(Z> 7.905) = approx zero. y y y y /b /bugs/insects/tigers /i t /ti are in i a specific ifi region/storage/house/… How could you use sampling get a g good estimate? to g ANSWER Mark; release; resample Catch say 50 animals. Mark them. Release them back in the region/storage/house/ again. Now, catch randomly 50 animals again. again What percentage are the originals that were captured? How could this be used in other estimations? On what does it rely? 46 ANSWER IF THE POPULATION IS NOT NORMAL y Mark; release; resample y We can apply the Central Limit Theorem: y Catch 50 tigers. Put a band around their neck. ¾ Even if the population is not normal, Release them in the jungle again. Now, catch 50 tigers again. What percentage are the originals that were captured? y How could this be used in other estimations? On what does it rely? ? ¾ …sample means from the population will be approximately normal as long as the sample size is large enough. Properties of the sampling distribution: and x 47 1.44 44 EXAMPLE – MOTIVATION FOR TESTING 45 .45 48 x n CENTRAL LIMIT THEOREM As the sample size gets large enough… n IF THE POPULATION IS NOT NORMAL Population Distribution the sampling distribution becomes almost normal regardless of shape of population Sampling distribution properties: C t lT Central Tendency d x Variation x x 49 n x Sampling Distribution (becomes normal as n increases) Larger sample size Smaller sample size ((Sampling p g with replacement) x 50 HOW LARGE IS LARGE ENOUGH? CENTRAL LIMIT THEOREM y Even if data are not normally y distributed,, as long g as you take “large enough” samples, the sample averages will at least be approximately normally distributed. distributed y Mean of sample averages is still P y Standard error of sample averages is still V/sqrt(n). y In general, “large enough” means more than 30 measurements. y For most distributions, n > 30 will give a sampling distribution that is nearly normal y For fairly symmetric distributions, n > 15 y For normal population distributions distributions, the sampling distribution of the mean is always normally distributed 51 52 CENTRAL LIMIT THEOREM CENTRAL LIMIT THEOREM y For a population with a mean P and a variance V2, the y Even if data are not normally y distributed,, as long g as sampling li di distribution t ib ti off th the means off allll possible ibl samples of size n generated from the population will pp y normally y distributed - with the mean be approximately of the sampling distribution equal to P and the variance equal to V2/n - assuming that the sample size is sufficiently large large. you take “large enough” samples, the sample averages will at least be approximately normally distributed. distributed y Mean of sample averages is still P y Standard error of sample averages is still V/sqrt(n). /sqrt(n) y In general, “large enough” means more than 30 measurements. 53 54 x SUMMARY: CENTRAL LIMIT THEOREM ((CLT)) • The sampling distribution of – hhas mean P Y P Y – and standard error V Y Y VY / N • As the sample size N gets larger, – the standard error gets smaller – and the sampling distribution gets closer to “normal.” • So – larger samples give • closer • more predictable 55 – approximations to the population mean SUMMARY y Law of Large Samples ¾ If we take simple random samples y from a well-defined population ¾ we expect p y that the sample means y is “usually” “close” to the population mean y Central Limit Theorem ¾ If by “close” y we mean “within 2 (1.96) standard errors” ¾ then by “usually” usually y we mean “in 95% of all samples” ¾ For other definitions of “close” and “usually,” y see the “z z (standard normal)…table normal) table” in your course binder 57 SUMMARY y If we take a simple p random sample p ¾ from a well-defined population y we expect ¾ that the sample mean ¾ is “probably” “close” to the population mean y By “close” close we mean “within within ~2 2 standard errors” errors • Today, we’ll learn that “probably” means in 95% of all samples 56