Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Gibbs sampling wikipedia , lookup
Categorical variable wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Misuse of statistics wikipedia , lookup
Math 143 Fall 2010 1 Random Variables A random variable is a variable whose value is 1. numerical and 2. determined by the outcome of some random process The distribution of a random variable describes the possible values of a random variable, and the probability of having those values. That is, it answers the two questions: • What values? • With what frequency? We use random variables to help us quantify the results of experiments for the purpose of analysis. Random variables are generally denoted using capital letters, like X, Y, Z. 1.1 Some handy “rules’ In the rules below, a and b are numbers and X and Y are random variables. 1. Linear Transformations (a) E(aX) = a E(X) (b) E(X + b) = E(X + b) (c) Var(aX) = a2 Var(X) (d) Var(X + b) = Var(X + b) (e) If X is normal, then aX and X + b are also normal. (This is a special property of normal distributions and few others.) 2. Sums (a) E(X + Y ) = E(X) + E(Y ) for any random variables X and Y (b) Var(X + Y ) = Var(X) + Var(Y ) only in the special case that X and Y are independent. (c) If X and Y are normal and independent, then X + Y is normal too. 3. Differences (these rules follow from the ones above) (a) E(X − Y ) = E(X) − E(Y ) (b) Var(X − Y ) = Var(X) + Var(Y ) only in the special case that X and Y are independent. • this is the one that gets people; the variance of a difference is the sum of the variances. (c) If X and Y are normal and independent, then X − Y is normal too. Examples. Suppose we have random variables with means, standard deviations, and variances given in the following table. variable X Y mean 45 variance 9 st dev 3 60 16 4 Last updated: November 3, 2010 Math 143 Fall 2010 We can use this information to compute the means, variances, and standard deviations of other variables, including the ones in the following table. (The results for variance and standard deviation are only correct if the variables are independent.) 1.2 variable 2X + 5 mean 2 · 45 + 5 = 95 variance 4 · 9 = 36 X +Y 45 + 60 = 105 9 + 16 = 25 X −Y 45 − 60 = −15 9 + 16 = 25 st dev 36 = 2 · 3 = 6 √ √ 32 + 42 = 25 = 5 √ √ 32 + 42 = 25 = 5 √ Binomial Random Variables 1. The Binom(n, p) Situation (a) n trials. A random process is repeated n times (n known in advance). Each repetition is called a trial. (b) Two outcomes. Each trial has two outcomes (often called success and failure) (c) Constant probability of success. The probability of success is the same for each trial and denoted p. (d) Independent Trials. Each trial is independent of the others. (e) Count successes. The Binom(n, p) random variable counts the number of successes. 2. When n = 1, we call the binomial random variable a Bernoulli random variable. 3. Formulas for expected value and variance of a Binom(n, p) random variable • Expected value: np • Variance: np(1 − p) • Standard deviation: p np(1 − p) 4. There is a computational formula for determining binomial probabilities, but we will usually use a computer for this. (In StatCrunch: Stat > Calculators > Binomial) p 5. Binom(n, p) ≈ Norm np, np(1 − p) (a) The approximation is better when n is larger and when p is closer to 1/2. (b) Rule of Thumb: The approximation is good enough for us if np ≥ 10 and n(1 − p) ≥ 10. That is, if we would expect at least 10 successes and at least 10 failures. (c) This result follows from a mathematical theorem called the Central Limit Theorem. 6. The binomial distribution is often a good model for sampling for a proportion: (a) Each person in the sample is a trial (b) They give one of two responses (yes/no, male/female, smoker/non-smoker, etc.) (c) p is the true (unknown) probability in the population (d) Need to be sure that the people in our sample are independent, representative sample Last updated: November 3, 2010 Math 143 Fall 2010 2 Sampling Distributions 1. A parameter is a number that describes a population or a process. • Example: the percentage of Americans wearing black socks today. • Usually the value of a parameter is unknown. (Do you know what percentage of Americans are wearing black socks today?) • Favorite parameters: population proportion (p), population mean (µ), population standard deviation (σ) 2. A statistic is a number that describes a sample. • Example: the percentage of people in my sample who are wearing black socks today. • If you know the data, you know the statistic – it can be calculated from the data. • Favorite statistics: sample proportion (p̂), sample mean (x), sample standard deviation (s). 3. A sampling distribution is the distribution of a statistic. • If we use randomization to collect our data (random sampling, random assignment to treatment groups, etc.), then the statistic is a random variable (why?) • The study of sampling distributions allows us to learn what our sample statistic tells us about a population parameter. 4. The sampling distribution for a sample count (number of observations in a data set with a certain property) is p ≈ Binom(n, p) ≈ Norm np, np(1 − p) • Can only use this if we have a random sampling method. • Binomial approximation good enough if population is much larger than sample. • Normal approximation good enough if in addition np ≥ 10, and n(1 − p) ≥ 10. 5. The sampling distribution for a sample proportion (proportion of observations in a data set with a certain property) “Binom(n, p)” ≈ ≈ Norm p, n r p(1 − p) n ! • Approximation good enough in same conditions as above. 6. The sampling distribution for the sample mean (x) • expected value = µ (unbiased) • standard deviation = √σn σ • distribution ≈ Norm µ, √ provided the sample size is large enough (Central Limit Theorem). n ◦ ◦ ◦ ◦ ◦ Assumes random sampling method and large population (at least 10 times larger than the sample). Exact when population is normal. Very good approximation even for small samples if population is unimodal and roughly symmetric. Approximation requires larger samples if population is skewed. In most cases, n = 30 is plenty big. Last updated: November 3, 2010 Fall 2010 3 Math 143 Inference for Proportions and Means 3.1 The Scenarios The inference procedures in this section deal with four scenarios. (We’ll learn more scenarios later.) • 1-proportion In this scenario we use a random sample of a categorical variable to learn about the proportion in the population that have a specific value (generically called success) for this categorical variable. Data: A single categorical variable. Example: Estimate what proportion of Grand Rapids residents will vote in the upcoming election. • 2-proportion In this scenario, we use samples from two populations to investigate how different the proportions are in the two populations. Data: A categorical response variable plus a categorical variable indicating which population each observation is from. Example: Will the voting percentage be higher among women than among men? • 1-sample t In this scenario we use a random sample of a quantitative variable to learn about the mean value of that measurement in the population. Data: A single quantitative variable. Example: Estimate the mean length of lemur tails. • 2-sample t In this scenario we use samples from two populations to compare the mean of some quantitative measurement between the two populations. Data: A quantitative response variable plus a categorical variable indicating which population each observation is from. Example: Are male lemur tails longer than female lemur tails (on average). Observational and Experimental designs We can use the 2-proportion and 2-sample t procedures to analyze the results of either observational or experimental designs. In an experiment with two treatments (e.g., drug vs. placebo), the two populations correspond to the two treatment options. In an observational study, we may classify subjects based on two variables without either of them being a treatment. Recall that in an experiment, the values of at least one explanatory variable are determined by the researchers (usually using randomness). 3.2 The Big Picture These four scenarios lead to very similar procedures. Seeing the similarities can help you remember how things go. Last updated: November 3, 2010 Math 143 Fall 2010 3.2.1 Confidence Intervals All four confidence intervals have the general form data value ± (critical value) · SE The table below shows how to obtain these values for each scenario. scenarario 3.2.2 data value ± critical value 1-proportion p̂ ± z∗ 2-proportion p̂1 − p̂2 ± z∗ 1-sample t x ± t∗ 2-sample t x1 − x2 ± t∗ SE (standard error) q q p̂(1−p̂) n p̂1 (1−p̂1 ) n1 √s n q = s21 n1 + p̂2 (1−p̂2 ) n2 q + s2 n s22 n2 Hypothesis Testing All hypothesis tests follow our 4-step outline: 1. State the Null and Alternative Hypotheses. 2. Calculate a test statistic. All the test statistics for these four scenarios have a similar form: t or z = data value - hypothesis value SD or SE 3. Calculate a p-value. The p-value is the probability of obtaining a test statistic at least as extreme as the one calculated from the data, assuming that the null hypothesis is true. 4. Draw a conclusion. This step is exactly the same for all tests. You just need to understand what a p-value means in the context of your particular test. scenarario H0 test statistic data value 1-proportion H0 : p̂ = p0 z p̂ SD = 2-proportion H0 : p1 − p2 = 0 z p̂1 − p̂2 SE = 1-sample t H0 : µ = µ0 t x − µ0 SE = 2-sample t H0 : µ1 − µ2 = µdiff t (x1 − x2 ) − µdiff SE = Note: For the 2-proportion test, we use a pooled estimate for p (p̂ = because the null hypothesis tells us the two proportions are the same. x1 +x2 n1 +n2 ) Last updated: November 3, 2010 SD or SE q p0 (1−p̂0 ) n q p̂(1−p̂) n1 + q s2 n √s n q = s21 n1 + p̂(1−p̂) n2 s22 n2 in our formula for the standard error