Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A.P. Statistics Exam Study Guide Lacey Kaplan Unit 1 Important introductory terms: Individuals (Observational units): The objects described by the set of data. They can be people, but they can also be just about anything. Variables: Any characteristic of an individual. Distribution: Describes the different values of a variable and the frequency (may be relative frequency) the variable takes on each value. Categorical Variable: Places individuals into categories based on things like race, gender, etc. These take the form of bar graphs or pie/circle graphs. Quantitative variable: Takes on numerical values about the individual such as height, weight, IQ score, etc. These take the form of histograms, stemplots, and box plots. 3 key features of a distribution: Shape: Symmetric: values are balanced Skewed: one end of the distribution stretches out further than the other. Center: Describes middle of distribution (mean or median) Spread: Variation in the data (Range, Standard deviation, or IQR) Other features of a distribution: Outliers: A value not part of the overall pattern. Mode: Peak (s) or cluster (s) of a distribution. Bimodal: 2 normal curves Mean: X or : Average of the values for the variable Median: The midpoint of the value for the variable First Quartile (Q1): The midpoint for the lower half of the data Third Quartile (Q2): The midpoint for the upper half of the data Five number summary: Minimum Q1 Median Q3 Maximum The Q1 is the middle value of the bottom half of the data. If the median is in the middle of two numbers start at the lower number and count back. Range: Distance between the minimum value and the maximum value Interquartile range: Distance between Q3 and Q1 Outlier: Any value outside [Q1-1.5xIQR,Q3+1.5xIQR] Standard Deviation Sx or x: The average distance the value of the variable is from the mean Variance sx2 or x2 Resistant measure: A value that is relatively unaffected by changing a small proportion of the total number of values Density curve: Has an area of exactly 1 underneath it and describes the overall pattern underneath it Mean > Median Mean= Median z Mean < Median x . This value tells you how many standard Standard Z-score: deviations your value is from the mean. 99.7% 95% 68% QuickTime™ and a decompressor are needed to see this picture. This is the empirical rule: 68% of the data is plus or minus 1 standard deviation. 95% of the data is plus or minus 2 standard deviations. 99.7% of the data is plus or minus 3 standard deviations. Four important formulas: x X n (x X) 2 s 2 n 1 s (x X) 2 n 1 z xX s Z-score applications: 1) Assume the distribution is normal. 2) Sketch the normal curve. 3) Use the formula to find the z-score 4) Use normal cdf to find the % 5) Answer in context of the question Calculator functions for Unit 1: To find % normal curve above, below, or between given Z-scores: Normal cdf (min, max) To find z-score for a given curve: Invnorm (proportion below) Example: Find the z-score associated with the top 20% of the normal curve. Invnorm (.8)= .841 Unit 2: Bivariate data: data involving two quantitative variables The x-axis has the explanatory variable, and the y-axis has the response variable. Key Features: 1) Form: linear 2) Direction (positive/ negative) 3) Strength: weak, moderate, strong We describe the association between two variables. The Correlation Coefficient: The correlation coefficient, r, is the statistic that measures the strength and direction of a linear association between two variables. There are several formulas for r but for now it is sufficient to know that: 1) r is a value from -1 to 1, never outside this range 2) Positive values of r indicate positive association, negative values of r indicate negative association, if r=0, there is absolutely no correlation. 3) The choice of x & y as the explanatory and response variables do not matter 4) The scale of the x and y axes do not matter 5) R is useful only for linear relationships and tells us nothing about nonlinear relationships 6) R is not a resistant measure and is very sensitive to certain kinds of outliers known as influential points 7) R can only be found for quantitative variables What values of r mean. -1 .8 -.5 Strong Moderate -.3 0 -.3 Weak NO Weak Negative Correlation The formula for the correlation coefficient, r is: r 1 xX xX n 1 sx sy You can also think about it like this: .5 .8 +1 Moderate Strong Positive Correlation r z z x y n 1 Lurking variable: temperature in the fact that as crime goes up so do ice cream sales. You can’t therefore say that a strong association implies causation. LSRL (Least squares regression line): y$ b0 b1 x b0= where you are starting/ y intercept b1= slope/ rate of change Interpreting the LSRL: • APexam 6.97 .1278midterm For every point you get on the miderm, we would expect your AP exam score to be 0.1278 higher. Finding r: Zy Zx Slope=r Example: Fast food sandwiches Protein: X 17.2g sx 14.6g Fat: Y 23.5g sy 16.4g r=.83 b1 r( sy sx ) b1= .972 The line of best fit goes through the average, so you can plug in the average to find b0. B0= 6.8 Final equation: • 6.8 .972 protein Fat Coefficient of determination, r2: The % variability in the response variable accounted for by the model. R2 (as a percent) of the variation in (response variable) is accounted for by this model . Calculator functions and further notes for Unit 2: Creating a scatterplot: Enter data into two lists. Press 2nd statplot. Turn the desired plot on, choose 1st type of plot, enter list name For explanatory variable in xlist, response variable in ylist. Choose zoom then 9. Describe scatterplots by characterizing form, direction, and strength. Form: The general pattern. Can by characterized as linear, non-linear, quadratic, exponential, etc. Direction: Can be positively associated (above-average values of explanatory correspond to above-average values of response), negatively associated (above-average values of explanatory correspond to below-average values of response), or none. Strength: Describe how closely the points follow a clear (not necessarily linear) form. Scatterplots that show tight adherence to the general form have strong association, plots where the points have significant spread about the general form have weak association. Strength is difficult to determine by eye due to possible distortion based on scaling of the axes. Finding the correlation coefficient: Enter data into 2 lists. Press stat, arrow to calc, choose 4: linreg(ax+b). Enter two list names separated by a comma, read r (correlation coefficient) and r2 (coefficient of determination) off the screen. Characteristics of correlation and the correlation coefficient r: 1) The correlation coefficient will be the same regardless of which of the two variables is designated as the explanatory variable. 2) R does not change if the units of measure of x or y are changed, or if each x or y value is multiplied by a constant, or if a constant is added to each x or y value. 3) Correlation measures only the strength of a linear relationship. Correlation does not describe curved relationships, no matter how strong. 4) Correlation is not a resistant measure. 5) Correlation does not imply causation (lurking variables). Finding the regression line: Enter data into 2 lists. Press stat, arrow to calc, choose 4:linreg(ax+b) Enter two list names separated by a comma, then (if desired) another comma and a functional variable for graphing the line by choosing vars arrowing to y vars then 1: function and then choosing y1 or another desired y variable. The slope and intercept of the regression line will be displayed, along with r and r2. The equation of the line will be inserted into y1. The regression line describes how the response variable (y) changes as an explanatory variable (x) changes. LSRL: the line minimizes the sum of the distances from the data points of the line. y$ = the predicted, or estimated value of the response variable obtained from the LSRL. Characteristics of the regression line: 1) It is important to correctly indicate the explanatory and response variables. 2) The least squares regression line always passes through the point (X,Y ) . b1 r sy sx . 3) The slope of the LSRL is 4) The y intercept is b0 Y b1 X . 5) R2, the coefficient of determination, indicates the proportion of the response variable variation that is explained the least squares regression line. 6) In the regression line using for the points (zx,zy), r is the slope and the y-intercept=0. $ Prediction with the Regression Line: calculating y for a given x using the regression line equation. Interpolation: Using the regression line to predict values within the range of the explanatory variable. Appropriate and usually accurate. Extrapolation: Using the regression line to predict values outside the range of the explanatory variable. Dangerous and often inaccurate. Residuals: The difference between the observed value of the response variable and the expected value, which is predicted by the regression line. It is simply the distance the point is from the regression line. If the point is above the regression line, the residual is positive and is the point is below the line. If the point is below the regression line, the residual is negative. residual y y$ Characteristics of residuals: 1) The sum of the residual is always equal to 0. 2) The mean of the residuals is always equal to 0. Residual plot: A scatterplot of residuals. The residuals plotted on the y-axis vs. the explanatory variable on the x-axis. Visualize the regression line being rotated to make It horizontal and each point’s distance from the line stays the same. If a residual plot shows no systematic pattern and an even scattering about the line y=0, then the regression line is a good fit to the data. Residual Plots: Whenever a regression is run on the calculator, the residuals will be placed in a list called resid. To view a residual plot, follow the directions for a scatterplot: press 2nd staplot, turn the desired plot on, choose 1st type of plot, choose explanatory variable for xlist, residual for ylist. Choose zoom then 9. Influential points: Individual data points which, if removed, dramatically affect the regression line (slope and or intercept) and correlation coefficient. 1) An influential datapoint may have a small residual because it pulls the regression line toward itself. 2) The regression line is not resistant. Unit 3 Lesson A1: Introduction to Probability The probability of an event is a value between 0 and 1 and is a measure of how likely an event is to happen. Probability= 5 C3 (.2)3 (.8)2 # timeseventhasoccured total # ofpossibilities We use probability to describe how likely random events are to happen. A random event is one that is unpredictable in the short term, but has a regular distribution of outcomes in a large number of repetitions. Examples of random events: tossing a thumbtack, playing roulette, tossing a die. Random events are studied through a long series of trials which must be independent. Events are independent when one outcome has no bearing on future outcomes. Law of large numbers: if you keep flipping coins, the probability increases closer and closer to 50/50. Or: add And: multiply Vocabulary: Event: a particular outcome or set of outcomes (trials) Sample space: a set of all possible outcomes Mutually exclusive or disjointed events: two events have no outcomes in common Example of a probability table: Rain Saturday 40% Sunday 70% No Rain 60% 30% 100% 100% 28% Rain Sat. and Sun. 18% No Rain Sat. and Sun. Rules of Basic Probability: 1. 0 P(x) 1 2. Sum of the probabilities for all possible outcomes in a sample space must total one 3. P(not X)= 1-P(X) 4. If events A and B are disjoint, then: P(A or B)= P(A)+P(B) 5. Disjoint events are said to be mutually exclusive. This means that they have no outcomes in common. Sample Problems for Lesson A1: Suppose that 40% of cars in your area are manufactured in the United States, 30% in Japan, 10% in Germany, and 20% in other countries. If cars are selected at random, find the probability that: 1) 2) 3) 4) 5) 6) 7) A car is not U.S. made: 60% It is made in Japan or Germany: 40% You see two cars in a row from Japan: 90% None of three cars come from Germany: 72.9% At least one of three cars is U.S. made: 78.4% The first Japanese car is the fourth one you see: 10.29% At least one of five cars is from Germany: 40.95% More on A.P. Statistics Quiz A- Chapter 14 Worksheet Lesson A2: Mutually Exclusive and Independent Events Two events are mutually exclusive if they cannot occur at the same time (i.e. they have no common outcomes). Ex: If you roll a coin, it is heads or tails, not both. Two events are independent if the fact that A occurs does not affect the probability of B occurring. Ex: If you roll 2 die Addition and Multiplication Rules: Key: = or = and B/A= B given that A has occurred (Addition “Or” Problems): P(A B)= P(A)+P(B)-P(A B) (Multiplication “And” Problems): P(A B)= P(A) P(B) When the events are not independent… P(A B)= P(A) P(B/A) (Given Problems): P(B/A): P(A I B) P(A) Sample Problems on Lesson A2: Boy Girl Total Grades 117 130 247 Popular 50 91 141 Sports 60 30 90 Total 227 251 478 The table above indicates the number of randomly chosen 4 5, & 6 graders as to whether their primary goal was to get good grades, to be popular, or to be good at sports. 1) Based on the table, a randomly chosen child has the following probabilities: P(girl)= 25/478 P(girl and popular)= 91/478 P(sports)= 90/478 Conditional Probability takes into account a given condition has already occurred. For example, the probability of chosen a student that excels at sports given that we have selected a girl is: P(sports/girl)= 30/251 2) Find the probability of choosing a student that desires good grades given a boy is chosen. 117/227 3) Given that a boy is chosen, find the probability of choosing a student that desires to be popular. Police report that 78% of drivers stopped on suspicion of drunk driving are given a breath test, 36% a blood test, and 22% both tests. What is the probability that a randomly selected DWI suspect is given: Make a Venn Diagram Breath 56% Both Blood 22% 14% 8% Neither More on SCPM- General Rules of Probability Sheet Lesson A3: Tree Diagrams Sample Problems on Lesson A3: Employment data at a large company revealed that 72% of workers are married, that 44% are college graduates, and that half of college graduates are married. COLLEGE: .30 22% NO COLLEGE: .70 50.4% .72 MARRIED UNMARRIED .28 COLLEGE: 78% 22% NO COLLEGE: 22% 6% More on AP STAT Tree Diagram Practice Sheet See “AP STAT- Probability Quiz Review” and “AP STAT- Probability Quiz” for more on Lesson A Lesson B1: Expected Value E(x)= x P(X) Example 1: Game with a die. Roll 1 Win $1 Roll 2 Win $2 Roll 3 Win $3 Roll 4 Win $4 Roll 5 Win $5 Roll 6 Win $6 Win Prob. 1 1/6 2 1/6 3 1/6 4 1/6 5 1/6 6 1/6 E(x)= (1 (1 / 6) (2 (2 / 6) (3 (3 / 6)... E(x)= 21/6 E(x)= 3 (1/2)= $3.50 $3.50 is the average winning. Example 2: Mrs. Smith has a 9 litter of golden puppies. 3 male and 4 female. You randomly choose 2 puppies. Find the expected number of male puppies. # Males Prob. 0 (4/7)(3/6) 1 (34/42) 2 (3/7)(2/6) *All of the probabilities must add up to 1 E(x)= 6/7 male puppies How to find expected value on your calculator: List 1: Row 1 List 2: Row 2 Stat- Calc- 1 Var Stat L1,L2 x = expected value (A.K.A. average) = standard deviation Example 3: A couple plans to have children until they have a girl, but they agree that they will not have any more than three children eve if all are boys. 1) Find the expected number of children. # Children Prob. 1 1/2 2 4(1/2)= 1/4 3 3(1/2)+3(1/2)= 1/2 2) Find the expected number of boys that they will have # Boys Prob. 0 1/2 1 1/4 2 1/8 3 1/8 Example 4: In a litter of seven kittens, three are female. You pick two kittens at random. 1) What is the expected number of male kittens you would get? # Males Prob. 0 (3/7)(2/6)= 6/42 1 2 (4/7)(3/6)+(3/7)(4/6)= (4/7)(3/6)= 24/42 12/42 See “AP Stat-Probability Models” and “Probability Model Quiz” for more on Lesson B1. Lesson B2: The Wieland and Liza Problem Given random variable x E(x+c)= E(x)+c (c is a constant) E(x+y)= E(x)+E(y) Standard deviation (x+y) or difference of standard deviations: (x)2 (y)2 Example 1: Wieland and Liza have cereal for breakfast each morning. Wieland has an average of 14 oz of cereal (st. dev. 3 oz) and 8 oz of milk (st. dev. 1.4 oz). Liza has an average of 7.5 oz of cereal (st. dev. 3 oz) and 5 oz of milk (st. dev. 8 oz). 1) What is the average size of Wieland’s breakfast and standard deviation? Average Size:(14+8)=22 Standard Deviation: 32 1.4 2 = 3.31 2) Wieland goes on a diet and eats only ½ of his usual breakfast. Find the new average and standard deviation. 22 11 2 3.31 1.655 2 3) What is Liza’s average breakfast for the entire week? Per day average: 12.5 Per day St. Dev.: 1.53 Per week average: 12.5+12.5+12.5+12.5+12.5+12.5+12.5= 87.5 12.5 7 = 87.5 Per week St. Dev.: (1.53)2 (1.53)2 (1.53)2 (1.53)2 (1.53)2 (1.53)2 (1.53)2 = 4.05 7(1.53)2 = 4.05 If each event is independent and there is a normal distribution: For two variables: x Z= Z-score x= difference in x and y = average difference = different in standard deviations For one variable: x Z= Z score x= the value in question = average = standard deviation Example of “for one variable”: Wieland has an average of 22 oz and a standard deviation of 3.31. Find the probability that Wieland has more than 25 oz. 25 22 = .906 3.31 Because you are looking for the probability that he has MORE than 25 oz: 2nd Vars- normal cdf- (.906,99) Answer: P(z>.906)= .182 18.2% chance he has more than 25 oz. Example of “for two variable”: Find the probability that Liza has more cereal than Wieland. W= : 22 = 3.31 L= = 12.5 = 1.53 Average difference: 22-12.5=9.5 Difference in standard deviations: (3.31)2 (1.53)2 = 3.65 0 9.5 = -2.6 3.65 (You make X zero because you want the difference between Wieland and Liza to be less than zero, because that would mean that Liza has MORE cereal than Wieland. You are looking for a negative Z score. Normal cdf(-99, -2.6)= .00466 P(z<-2.6)=.00466 Lesson B3: Geometric and Binomial Probability Preliminary Example: 20% of cheerios boxes have a Dora prize. How many boxes would Wieland have to buy so Liza gets a prize? P(win on 5th box)= .8)4 (.2)2 = .08192 Geometric and Binomial Restrictions: 1) Each trial is independent 2) Only the option of “success” or “failure” (“p” or “q”) (Only 2 options) Basic Vocabulary: Permutation: order matters Combination: order doesn’t matter N C R C R = number of combinations N, Math, PRB, R N P= N = C n! r!(n r)! R P R (1 P)N R binompdf(n,p,r) 2nd- vars- binompdf Example 1: Picking 5 boxes and winning exactly 3 times. P= 5 C3 (.2)3 (.8)2 binompdf(5, .2, 3) Lesson B4: Binomial Probability Make sure: 1) Each trial is independent 2) Only success and failure Example 1: 10% of the population is left handed. 4 people walk in the room. Find the probabilities: # LH Ppl Prob. 0 1(.9)4 1 4(.1)1 (.9)3 2 6(.1)2 (.9)2 3 4(.10)3 (.90)1 4 1(.10)4 How you would use binompdf for these problems: Example, for 2 LH people you would do: binompdf (4,.10,2) Exactly 2 LH people Lesson B5: Binomial Probability and the Normal Model When the numbers are large, the normal model can be used. If we expect: np 10 nq 10 E(x)= np npq Example 1: If the American Red Cross needs at least 1850 O-Negative donors, if 6% of the donors are O-Negative, find the probability that a group of 35,000 people has at least 1850 O-Negative donors. n= 35,000 p= .06 q= .94 E(x)= (35,000)(.06)= 2100 (35,000)(.06)(.94) 44.4 Now go back to the Z score idea from Lesson B2: 1850 2100 44.4 z= -5.63 P(z>-5.63) 1.00 Lesson B6: binomcdf Uses Example: Archer hits 80% of bulls eyes. He shoots 6 arrows. # Bulls 0 1 2 3 4 5 Eyes Prob. 6.4 105 .001536 .01536 .03192 .24576 .3932 (Find the probabilities based on what you learned in Lesson B4) 6 .26214 1) P(6th is first miss)= (.8)5(.2) 2) P(misses exactly once)= binompdf (6, .8, 5) or binompdf (6, .2, 1) 3) P(more than 3 bulls eye)= binompdf (6, .8, 4)+ binompdf (6, .8, 5)+ binompdf (6, .8, 6) Wait, there must be a quicker way… binomcdf deals with everything BELOW the number you put in for “c”. This means for “more than”, or anything having to do with going ABOVE that number, you must do 1-binomcdf. 1-binomcdf(6, .8, 3) Visual Expression of that: 0123456 4 numbers below, so do 4-1=3. 3 is put in for n. 4) P(less than 5 bullseye): 0123456 5 numbers below, so do 5-1=4. 4 is put in for n. binomcdf (6, .8, 4) 5) P(at least 4): At least has the same idea as “more than”. It is asking you to go above a number, so you have to do 1-binomcdf. 0123456 4 numbers below, so do 4-1=3. 3 is put in for n. 1-binomcdf (6, .8, 3) 6) P(at most 3): 0123456 4 numbers below, so do 4-1=3. 3 is put in for n. binomcdf(6, .8, 3) Lesson B7: Simulations Simulation is a way to estimate probabilities when we are either unable to determine probabilities analytically or do not have the time, resources, or money to estimate probability by observation. Some probabilities associated with gaming (poker, 21) are difficult to calculate by hand because the composition of a deck of cards is constantly changing. High speed computer simulations with many, many trials are used to predict probabilities like these. Example 1: Pascack Valley A.P. Students have a 70% probability of achieving a score of “4” or “5” on the A.P. Exam in May. What is the probability that out of 5 randomly selected students at least 4 obtain a score of “4” or “5”. Use the random number table so that 1,2,3,4,5,6,7= success and 8,9,0= failure. Choices could vary here. Trial # 1 2 3 4 5 6 7 8 9 10 Digits 19223 95034 05756 28713 96409 12531 42544 82853 73676 47150 7/10 Are “Yes” for a 4 or a 5. Unit 4: Methods of Sampling: Random sampling: Each member of the population has an equal chance of being selected. Computers are often used to generate random telephone numbers. Stratified Sampling: Classify the population into at least two strata, then draw a sample from each. Systematic sampling: Select every nth member. Cluster sampling: Divide the population area into sections, randomly select a few of those sections, and then choose all members in them. Convenience sampling: Use results that are readily available (NOT GOOD!). Cautions about sample surveys: 1) Undercoverage: some groups in the population are left out of the sample selection process 2) Nonresponse: individual chosen for a sample won’t cooperate or can’t be contacted 3) Bias: a study is biased if it systematically favors certain outcomes 4) Response bias: a respondent may not be truthful 5) Interviewer bias: interviewers may try to obtain certain answers by their attitude 6) Question wording: can influence a respondent’s answer by leading them towards a particular point of view 7) Small samples: are not as accurate as large ones. Large samples decrease the margin or error of a sample. Important terms: Census: survey of the entire population Sample: piece of the entire population You must write in detail how you take your random sample. Population parameter: what we’re looking at (what data you are collecting) Statistic: summary of data Notes on Observational Studies and Experimental Design: Observational Studies 1) Retrospective study: a look back on past events 2) Prospective study: researcher identifies subjects in advance and collects data as events unfold 3) Observational studies can be used to find trends or identify possible relationships. Not cause and effect though. 4) Observational studies do not demonstrate a causal relationship Randomized, Comparative Experiments 1) A study that allows us to prove a cause and effect relationship 2) Researcher identifies at least one explanatory variable (known as a factor) to manipulate at least one response variable 3) An experiment: a. Manipulates factor levels to create treatments b. Randomly assigns subjects to these treatment levels c. Compares the responses of the subject groups across treatment levels d. The individuals that we experiment on are experimental units (subjects/ participants if they are human) e. The values for a factor are called levels f. A treatment is a combination of specific levels from all the factors that an experimental unit receives 4) Four principals of experimental design: a. Control: we must control sources of variation other than the factors we are testing b. Randomization: allows us to equalize the effects of unknown or uncontrollable sources of variation c. Replication: repeat the experiment, applying the treatments to a number of subjects. We sometimes replicate on different groups allowing us to generalize the results d. Blocking: grouping similar individuals together and then randomize within these blocks (not required) 5) New terms a. Control treatments: a baseline measurement (control group) b. Blinding: not allowing individuals who can influence the results (subjects, treatment administrators, technicians) or those who evaluate the results (judges, treating physicians, etc) Single-blind: When one group is blinded Double-blind: When both groups are blinded c. Placebo: A “fake” treatment that looks just like the actual treatment Further notes: Undercoverage is when someone is not asked that should be, and voluntary response bias is when they don’t answer for a specific reason Census: everyone in the population Population: who you are trying to generalize to People that answer are your sample Sample are only people that stop to answer you Simple random sample: everyone has an equal chance of getting chosen Multistage sample: Use of more than one type of sampling method. Example: stratified by grades, broke up honors and not honors, stratify and cluster Study can be observational or experimental. Factors: what you are looking at o Example: Trying to find out if warming up or sleeping make people run faster. o Factors: sleep and warm up time o Levels: 6 hours sleep, 8 hours sleep, 10 hours sleep 20 minutes warm up, 0 minutes warm up o Treatments: 6 treatments (levels 1xlevels2) Experimental units: who you run the experiment on Response variable: running times (for above example) 4 principles in more depth: o Randomized: must randomly assign treatment o Replication: have to be able to replicate it and run it exactly the same way. Different trials with different people o Control: must have control to avoid confounding variables o Blocking: don’t have to block, but if you think you have to, you should Statistically significant: difference is greater than what we would expect to happen randomly. Can find statistical significance in a study, you just can’t prove causation in a study Placebo: method of control and blinding: like a pill with nothing in it Randomizing treatment, not randomizing the sample in an experiment Control group: group that represents the norm (in an experiment) Placebo effect: taking sugar pill and thinking you are better Matched pairs: When we block into specific units to compare to each other. Example: twins, before and after pictures Unit 5: The variability we see from sample to sample is the sampling error, or sampling variability. Sampling distribution with proportions: Assumptions and conditions: Sample values are independent Sample size is large enough Random condition (for experiments: randomize assignment. For study: randomize sample) Can’t just write: assumptions and conditions have been met. You need to be specific about what exactly has been met. Example: We know that 13% of the population is left handed. A 200 seat auditorium has been built with 15 lefty seats that have been built in desk on the left rather than the right arm of the chair. In a class of 90 students, what is the probability that there will be enough seats for the left handed students? Assumptions and Conditions: Assume each student is independent and that 90 students is large enough of a sample size. Through 90 students are not a random sample, we proceed as if they are. 90 is <10% all students that may end up in their lecture (.13)(.90)=11.7>10 successes (.87)(.90)=78.3 > 10 failures µ Find SD( p ) and conduct the z score work. Find probability that >15 students are left handed. µ p =15/90= .167 µ p =13/100=.13 pq n µ p .0354 µ p µ p p SD z 1.05 z n<10% of population. We expect 10 successes and 10 failures Given a population proportion, p: pq µ p n Control Limit Theorem (CLT): With averages Sample size gets larger, distribution gets more normal Same conditions and assumptions except you don’t need np>10 and nq>10. We use X as an estimate of (population mean) SD(X) n Standard deviation of sampling distribution “standard error” Use X to estimate Example: If the average GPA is 3.2 with a standard deviation of .85m what is the probability that a picked person has a GPA > 4? z 4 3.2 .8 Example: What is the probability is > 3.3 if the average is among 25 students? z 3.3 3.2 .8 25 *Make sure to look at former tests and quizzes to review. Unit 6: The first thing you need to know, is that there are two types of data you can be given: Percents and Averages Note: Anything written on this sheet in pink pertains to percents, and anything on this sheet written in green pertains to averages. Example of a problem using percents: Out of 109 people polled, 10 said µ they were in favor of the new policy. The percent is 9% ( p represents percents). Example of a problem using averages: The average grade for fourth period statistics is a 42. The average is a 42 ( X represents averages) Okay, there are four types of intervals/ tests that can be done with percents, and five types of intervals/tests that can be done with averages. Everything I can do with percents: 1-proportion 1-proportion 2-proportion 2-proportion z-interval (Confidence Interval) z-test (Hypothesis Test) z-interval (Confidence Interval) z-test (Hypothesis Test) Everything I can do with averages: 1-sample t-interval (Confidence Interval) 1-sample t-test (Hypothesis Test) o Matched Pair Test (Hypothesis Test) 2-sample t-interval (Confidence Interval) 2-sample t-test (Hypothesis Test) Before we explain the specifics of all of the nine intervals/tests, let’s review exactly what all of these things are doing: µ Using a sample ( p or X , depending on whether it is a percent or average you are given) to make an estimate (confidence interval) or judgment (hypothesis test) about a population parameter (p or , depending on whether it is a percent or average you are given). We are now going to review everything pertaining to percents, or all of your “proportion” z-tests and z-intervals. 1. The 1-proportion z-interval When do I use it? When you are asked to perform a confidence interval and you are only given one percent. A confidence interval is an estimate of a population parameter. The assumptions and conditions: 1. Independence 2. Randomization (random sample or randomized experiment) µ pn 10 $ 3. qn 10 4. n<10% of the population The Steps: 1) Write out your proportion µ p = whatever percent you are given 2) Solve for your standard error µ pq$ SE( µ p) n 3) Find your z* One’s you should know so you don’t always have to solve for it: 90%=1.65 95%=1.69 98%=2.33 He will usually not ask you for any other than those, but if you just want to know how… Example: Finding it for a 90% C.I. .100-.90= .10 .10/2= .05 Invnorm (.05)= -1.65 Invnorm (.95<-.90+.05)= 1.65 It is 1.65. 4) Calculate your margin of error ME z * (SE( µ p )) 5) Write our your confidence interval p ME 6) Write out your conclusion I am _% confident that the population proportion is between ____ and ____. 2. The 1-proportion z-test When do I use it? When I am given one percent and asked if I think the true percent is actually greater than, less than, or simply not that percent. The assumptions and conditions: 1. Independence 2. Randomization (random sample or randomized experiment) pn 10 3. qn 10 4. n<10% of the population Steps: 1) Write out the null and alternative hypothesis H 0 : p whatever the problem says the percent is H A : p , , whatever the problem says the percent is 2) Choose alpha level Always make .05 3) Write out knowns µ You always know p and n 4) Calculate standard deviation SD( µ p) pq n 5) Calculate z-score and find p-value µ p p z SD( µ p) Then solve for either the probability z-score is greater than or less than whatever you find it to be. Or, if you are testing if p is simply not equal to a given value, you do a two-tailed test and do 2 x the probability z-score is greater than what you find it to be. You use normal cdf for this just as you would any z-score. Your answer is your p-value. If your p-value is less than alpha: In repeated samples, and assuming the null hypothesis is true, we would expect results similar to this p-value percent of the time. Since the p-value is less than alpha, I reject the null hypothesis. There is evidence that the alternative hypothesis is true. Possibility of a: Type I Error. If your p-value is greater than alpha: In repeated samples, and assuming the null hypothesis is true, we would expect results similar to this p-value percent of the time. Since the p-value is greater than alpha, we fail to reject the null hypothesis. There is no evidence that the null hypothesis is false. Possibility of a: Type II Error. 3. The 2-proportion z-interval You simply do: (µ p1 µ p2 ) z * ( µ p1 q$1 µ p q$ 2 2) n1 n2 When do I use this? When I am given more than one percent. Note* The assumptions and conditions are the same for a 2-proportion µ z-test, you just use p in a confidence interval assumption and p in a hypothesis test assumption) 4) The 2-proportion z-test Assumptions and Conditions: 1. Both samples must be a simple random sample 2. The samples are chosen independently 3. Each sample is less than 10% of the population n1 p1 10 n1q1 10 n2 p2 10 4. n2 q2 10 For 2-proportion z-tests ONLY! You use something called µ p pooled. p1 p2 µ p pooled. n1 n2 Your standard error pooled is then found by using: SE pooled ( µ p1 µ p2 ) µ p pooled q$pooled n1 µ p pooled q$pooled n2 You then find your z-score using: z ( p1 p2 ) 0 SE pooled *Note, the reason it is minus zero is because in all 2-proportion and sample-tests and the null hypothesis is that the difference between the two percents or averages is 0. ALWAYS. You then find your p-value the same way you would for the 1-proportion tests. Alright, time for t-tests/intervals 1. The 1-sample t-interval When do I use this? When you are given one mean. Assumptions and Conditions: 1. 2. 3. 4. Random sample or random assignment of treatments Independence n<10% If n<15, distribution must be nearly normal. If 15<n<40 distribution must be unimodal and symmetric. If n>40 any distribution without extreme skewness. Always draw a histogram of your data to show the distribution! Finding a confidence interval for a one-sample t-interval is extremely easy. You just do: X t * SE(X) X t* whatever average you are given program, inv-t (if Wieland has not programmed this into your calculator you need to get it programmed asap). It will ask for degrees of freedom, which is just n-1. SE(X) sX n 2. The 1-sample t-test H 0 : something H A : ,, something t X SE(X) You then do the same thing you would do with a z-score to find the p-value. The conclusions are found the same way! Note* Matched pair tests are when you use one sample and just change something both times. It is done exactly like a 1-sample t-test, you just have: H 0 : D 0 H A : D ,, 0 The average difference= D *There are the same assumptions as you would have for a 1-sample ttest. 3. The 2-sample t-interval The only thing that changes as far as the assumptions and conditions is you make two histograms instead of one, because there are two samples. (X1 X2 ) t * s1 s2 n1 n2 4. The 2-sample t-test t (X1 X2 ) ( 1 2 ) s1 s2 n1 n2 Note* You do not pool with t-tests That’s not that hard, right? Okay, now here is everything you need to know about errors. A Type I error occurs when you reject the null hypothesis. A Type II error occurs when you fail TO reject the null hypothesis. (II, TO, yeah, you get it) You: Prob. Of Type I Prob. Of Power No change Prob. Of Type II Decrease Increase sample size Increase alpha Increase Decrease Increase Increase Power is 1 and is the probability of a type II error. This makes power the probability of NOT having a type II error.