* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Tue Jan 27 - Wharton Statistics
Sufficient statistic wikipedia , lookup
Foundations of statistics wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Confidence interval wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Statistical inference wikipedia , lookup
Misuse of statistics wikipedia , lookup
Gibbs sampling wikipedia , lookup
Lecture 5 Outline – Tues., Jan. 27 • Miscellanea from Lecture 4 • Case Study 2.1.2 • Chapter 2.2 – Probability model for random sampling (see also chapter 1.4.1) – Sampling distribution of sample mean – t-test – Confidence intervals Miscellanea from Lecture 4 • Definition of medians/quartiles – The median is the midpoint of a distribution, the number such that half the observations are smaller and half are larger. • Computing the median: Make a list of all observations. If n is odd, the median is the center of the ordered observations; if n is even, the median is the mean of the two center observations. – pth percentile of a distribution: value such that p percent of the observations fall at or below it. Exact computation in JMP: order the observations from top to bottom and count up required percent of observations from the bottom of the list. If pth percent falls between two observations, JMP takes weighted average, (1p)*lower observation + p*higher observation. Miscellanea from Lecture 4 • Definition of median/quartiles continued: – First quartile is 25th percentile. Third quartile is 75th percentile. • Long-tailed vs. short-tailed distributions: Loosely, a long-tailed distribution has a tail that dies out slower than the normal distribution. A short-tailed distribution has a tail that dies out faster than the normal distribution. See figure at end of notes. Case Study 2.1.2 • Broad Question: Are any physiological indicators associated with schizophrenia? Early studies suggested certain areas of brain may be different in persons with schizophrenia than in others but confounding factors clouded the issue. • Specific Question: Is the left hippocampus region of brain smaller in people with schizophrenia? • Research design: Sample pairs of monozygotic twins, where one of twins was schizophrenic and other was not. Comparing monozy. twins controls for genetic and socioeconomic differences. Case Study 2.1.2 Cont. • The mean difference (unaffected-affected) in volume of left hippocampus region between 15 pairs is 0.199. Is this larger than could be explained by “chance”? • Probability (chance) model: Random sampling (fictitious) from a single population. • Scope of inference – Goal is to make inference about population mean but inference is questionable because we did not take a random sample. – No causal inference can be made. In fact researchers had no theories about whether abnormalities preceded the disease or resulted from it. Probability Model • Goal is to compare two groups (affecteds and unaffecteds) but we have taken a paired sample. We can think of having one population (pairs of twins) and looking at the mean of one variable, the difference in hippocampus volumes in each pair. • Probability model: Simple random sample with replacement from population. When the population size is more than 50 times the sample size, simple random sampling with replacement and simple random sampling without replacement are essentially equivalent. Review of Terminology • Population: Collection of all items of interest to researcher, e.g., heights of members of this class, U.S. adults’ incomes, lifetimes of a new brand of tires. • Statistic (random variable): Any quantity that can be calculated from the data. • Probability (sampling) distribution of statistic: the proportion of times that a statistic will take on each possible value in repeated trials of the data collection process (randomized experiment or random sample). • Population distribution: The probability distribution of a randomly chosen observation from the population. • Parameter: Describes feature of population distribution (e.g., mean or standard deviation) Parameters and Statistics • Population parameters (  ,  ) –  = population mean –  2 = population variance = average size of (Y   )2 in population • Hypotheses: H 0 :   0, H1 :   0 • Sample statistics ( Y , s ) – Sample: Y1 ,, Yn 1 n – Y   Yi = sample mean n i 1 – s2  n 1 2 ( Y  Y )  i n  1 i 1 = sample variance Continuous Distributions • A continuous random variable can take values with any number of decimals like 1.2361248912. • The probability of a continuous r.v. taking on an exact value is 0. But there is a nonzero chance that continuous r.v. will take on a value in an interval. • Density function defines probability for continuous r.v. The probability that a r.v. takes on values between 3.9 and 6.2 is the area under the density function between 3.9 and 6.2. Total area under density function is 1. • Example of continuous r.v.: height. Normal probability distribution • The normal probability distributions are a family of density functions for a continuous r.v. that are “bell-shaped.” • The normal probability distribution has two parameters, mean  and standard dev.  • The probability that a normal r.v. will be within one s.d. of its mean is about 68%. The probability that a normal r.v. will be within two s.d.’s of its mean is about 95%. Sampling distribution of sample mean • Consider random sample of size n from a population with mean  and variance  2 Key facts about sampling distribution of Y . – Center: The mean of the sampling distribution of Y is  – Spread: Sample means are closer to the population mean than single values. The sampling distribution has . (Y )   SD n – Shape: If the population distribution is normal, the sampling distribution of the sample mean will be normal. If the population distribution is not normal, the sampling distribution of the sample mean will be nearly normal for n>30 (Central Limit Theorem). Standard errors • The standard error of a statistic is an estimate of the standard deviation in its sampling distribution. It is the best guess of likely size of difference between a statistic used to estimate parameter and parameter itself. • Associated with every standard error is a measure of the amount of information used to estimate variability, called its degrees of freedom. Degrees of freedom are measured in units of “equivalent numbers of independent observations.” s SE ( Y )  • Standard error of sample mean: n d.f. = n-1 Testing a hypothesis about  • H 0 :   0, H1 :   0 • Could the difference of Y from  * (the hypothesized value for  , =0 here ) be due to chance (in random sampling)? | t | | Y  * | SE (Y ). • Test statistic: • The test statistic will tend to be near 0 when H0 is true and far from 0 when H0 is false. • Assume the population distribution is normal. If H0 is true, then t has the Student’s t-distribution with n-1 degrees of freedom. P-value • The (2-sided) p-value is the proportion of random samples with absolute value of t ratios >= observed test statistic (|t|) • Schizophrenia example: t = 3.23 8 7 Estim Mean 0.1986666667 Hypoth Mean 0 T Ratio 3.2289280811 P Value 0.0060615436 6 Y 5 4 3 2 1 0 -0.4 -0.3 Sample Size = 15 -0.2 -0.1 .0 X .1 .2 .3 .4 Schizophrenia Example • p-value (2-sided, paired t-test) = .006 • So either, – (i) the null hypothesis is incorrect OR – (ii) the null hypothesis is correct and we happened to get a particularly unusual sample (only 6 out of 1000 are as unusual) • Strong evidence against H 0 :   0 • One-sided test: H 0 :   0, H1 :   0 – Test statistic: t  Y 0 s/ n – For schizophrenia example, t=3.21, p-value (1-sided) =.003 Matched pairs t-test in JMP • Click Analyze, Matched Pairs, put two columns (e.g., affected and unaffected) into Y, Paired Response. • Can also use one-sample t-test. Click Analyze, Distribution, put difference into Y, columns. Then click red triangle under difference and click test mean. Confidence Intervals • Point estimate: a single number used as the best estimate of a population parameter, e.g., Y for . • Interval estimate (confidence interval): range of values used as an estimate of a population parameter. • Uses of a confidence interval: – Provides a range of values that is “likely” to contain the true parameter. Confidence interval can be thought of as the range of values for the parameter that are “plausible” given the data. – Conveys precision of point estimate as an estimate of population parameter. Confidence interval construction • A confidence interval typically takes the form: point estimate  margin of error • The margin of error depends on two factors: – Standard error of the estimate – Degree of “confidence” we want. – Margin of error = Multiplier for degree of confidence * SE of estimate – For a 95% confidence interval, the multiplier for degree of confidence is about 2 in most cases. CI for population mean • If the population distribution of Y is normal (* we will study the if part later) 95% CI for mean of single population: Y  tn1 (.975) * SE (Y )  s Y  tn1 (.975) * n • For schizophrenia data: .199cm3  2.145  0.615cm3  0.067cm3 to 0.331cm3 Interpretation of CIs • A 95% confidence interval will contain the true parameter (e.g., the population mean) 95% of the time if repeated random samples are taken. • It is impossible to say whether it is successful or not in any particular case, i.e., we know that the CI will usually contain the true mean under random sampling but we do not know for the schizophrenia data if the CI (0.067cm3 ,0.331cm3) contains the true mean difference. Confidence Intervals in JMP • For both methods of doing paired t-test (Analyze, Matched Pairs or Analyze, Distribution), the 95% confidence intervals for the mean are shown on the output.
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            