* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download standard deviation
Survey
Document related concepts
Transcript
And Here We Go … Get ready to study for the AP Stats test! Only 1050 minutes of class time until the big day… Friday,MAY 10! How much studying will you do for $521.04? plus book… The Exam Itself To maximize your score on the AP Statistics Exam, you first need to know how the exam is organized and how it will be scored. The AP Statistics Exam consists of two separate sections: Section I 40 MultipleChoice questions 90 minutes counts 50 percent of exam score Section II FreeResponse questions 90 minutes counts 50 percent of exam score Questions are designed to test your statistical reasoning and your communication skills. SCORING: Five open-ended problems @ 13 minutes; each counts 15 percent of freeresponse score One investigative task @ 25 minutes; counts 25 percent of free-response score Each free-response question is scored on a 0 to 4 scale. General descriptors for each of the scores are: 4 Complete Response NO statistical errors and clear communication 3 Substantial Response Minor statistical error/omission or fuzzy communication 2 Developing Response Important statistical error/omission or lousy communication 1 Minimal Response A "glimmer" of statistical knowledge related to the problem 0 Inadequate Response No glimmer; statistically dangerous to himself and others Your work is graded holistically, meaning that your entire response to a problem is considered before a score is assigned. Calculator Policy Each student is expected to bring to the exam a graphing calculator with statistical capabilities. The computational capabilities should include standard statistical univariate and bivariate summaries, through linear regression. The graphical capabilities should include common univariate and bivariate displays such as histograms, boxplots, and scatterplots. • You can bring two calculators to the exam. • The calculator memory will not be cleared but you may only use the memory to store programs, not notes. • For the exam, you're not allowed to access any information in your graphing calculators or elsewhere if it's not directly related to upgrading the statistical functionality of older graphing calculators to make them comparable to statistical features found on newer models. The only acceptable upgrades are those that improve the computational functionalities and/or graphical functionalities for data you key into the calculator while taking the examination. Unacceptable enhancements include, but aren't limited to, keying or scanning text or response templates into the calculator. • During the exam, you can't use minicomputers, pocket organizers, electronic writing pads, or calculators with QWERTY (i.e., typewriter) keyboards. 2008-09 List of Graphing Calculators Graphing calculators having the expected built-in capabilities listed above are indicated with an asterisk (*). However, students may bring any calculator on the list to the exam; any model within each series is acceptable. Casio FX-6000 series FX-6200 series FX-6300 series FX-6500 series FX-7000 series FX-7300 series FX-7400 series FX-7500 series FX-7700 series FX-7800 series FX-8000 series FX-8500 series FX-8700 series FX-8800 series FX-9700 series * FX-9750 series * FX-9860 series * CFX-9800 series * CFX-9850 series * CFX-9950 series * CFX-9970 series * FX 1.0 series * Algebra FX 2.0 series * Hewlett-Packard HP-9G HP-28 series * HP-38G * HP-39 series * HP-40 series* HP-48 series * HP-49 series * HP-50 series* Radio Shack EC-4033 EC-4034 EC-4037 Sharp EL-5200 EL-9200 EL-9300 EL-9600 EL-9900 series series series series * * † * * Texas Instruments TI-73 TI-80 TI-81 TI-82 * TI-83/TI-83 Plus * TI-83 Plus Silver * TI-84 Plus * TI-84 Plus Silver * TI-85 * TI-86 * TI-89 * TI-89 Titanium * TI-Nspire * TI-Nspire CAS * Other Datexx DS-883 Micronta 2 Smart 1st AP Statistics test: 1997 ~ 7500 students 2008 AP Stat test: ~ 100,000 students Exam grade 2008 Statistics Goins 2008 5 14,009 12.8% 3 12% 4 24,528 22.6% 7 28% 3 25,707 23.8% 8 32% 2 20,403 18.8% 4 16% 1 23,637 21.9% 3 12% Number of students 108,284 3 or higher / % 64,244 Mean grade 2.86 Standard deviation 1.34 25 59.2% 18 3.12 72% 1st AP Statistics test: 1997 ~ 7500 students 2009 AP Stat test: 116,876 students Exam grade 2009 Statistics Goins 2009 5 12.3% 2 4.3% 4 22.3% 6 12.8% 3 24.2% 17 36.2% 2 19.1% 12 25.5% 1 22.2% 10 21.3% Number of students 116,876 3 or higher / % 68,679 Mean grade 2.83 Standard deviation 1.33 47 58.8% 25 2.56 53.3% 1st AP Statistics test: 1997 ~ 7500 students 2010 AP Stat test: ~ 109,609 students Exam grade 2010 Statistics Goins 2010 5 12.8% 5 13.9% 4 22.4% 10 27.8% 3 23.5% 11 30.6% 2 18.2% 6 16.7% 1 23.1% 4 11.1% Number of students 129,899 36 3 or higher / % 58.7% 72.3% Mean grade 2.84 3.167 Standard deviation 1.35 1.2 1st AP Statistics test: 1997 ~ 7500 students 2011 AP Stat test: ~ 137,498 students Exam grade 2011 Statistics Goins 2011 5 12.1% 8 16.0% 4 21.3% 18 36.0% 3 25.0% 14 28.0% 2 17.8% 7 14.0% 1 23.9% 3 6.0% Number of students 142,910 50 3 or higher / % 58.8% 80.0% Mean grade 2.82 3.42 Standard deviation 1.34 1.1 1st AP Statistics test: 1997 ~ 7500 students 2012 AP Stat test: ~ 143,554 students Exam grade 2012 Statistics 5 12.5% 5 8.2% 4 21.1% 13 21.3% 3 25.6% 17 27.9% 2 18.0% 16 26.2% 1 22.8% 10 16.4% Number of students 153,859 61 59.2% 57.4% Mean grade 2.83 2.62 Standard deviation 1.33 3 or higher / % Goins 2012 The AP Statistics Exam covers material in these areas: I. Exploring data: describing patterns and departures from patterns (20-30%) Analyze data using graphical and numerical techniques Emphasis on interpreting info from graphical and numerical displays and summaries II. Sampling and experimentation: planning and conducting a study (10–15%) Collecting data with a well developed plan Clarifying the question and deciding on a method of data collection and analysis III. Anticipating patterns: Exploring random phenomena using probability and simulations (20-30%) Anticipating what the distribution of data should look like under a given model IV. Statistical inference: Estimating population parameters and testing hypotheses (30-40%) Selecting appropriate models for statistical inferences So. . . Let’s get started! What do you call data that has only ONE variable? UNIVARIATE DATA What are the two types of univariate data sets? Categorical: qualitative (brand) Type of computer you use Car you drive Area codes Numerical: quantitative (numerical in nature) height Price of textbook Amount of cola in can What are the two types of numerical data? Discrete: possible values are isolated points on a number line Number of AP classes Continuous: possible values form an interval (measurements are usually continuous) Distance lives from school What are appropriate graphical displays for categorical data? Bar Graphs • Bars do not touch • Categorical variable is typically on the horizontal axis • To describe – comment on which occurred the most often or least often • May make a double bar graph or segmented bar graph for bivariate categorical data sets Subject Preference 25 20 15 10 5 0 History Math Science English Business Foreign language Subject preference by gender 14 12 10 8 Male 6 Female 4 2 0 History Math Science English Business Foreign language What are appropriate graphical displays for categorical data? Pie Charts • To make: – Proportion X 360° – Using a protractor, mark off each part • To describe – comment on which occurred the most often or least often Subject Preference Foreign language 8% Business 2% History 6% English 13% Math 44% Science 27% What are appropriate graphical displays for numerical data? Dot Plot Stem (and leaf) Plot • Used with numerical data (either discrete or continuous) • Made by putting dots (or X’s) on a number line • Can make comparative dotplots by using the same axis for multiple groups • Used with univariate, numerical data • Must have key so that we know how to read numbers • Can split stems when you have long list of leaves • Can have a comparative stemplot with two groups (back to back) What are appropriate graphical displays for numerical data? Histograms • Used with numerical data • Bars touch on histograms • Two types – Discrete • Bars are centered over discrete values – Continuous • Bars cover a class (interval) of values • For comparative histograms – use two separate graphs with the same scale on the horizontal axis • Use no fewer than 5 classes (bars) • Check to see if scale is misleading • Look for symmetry and skewness What are appropriate graphical displays for numerical data? Cumulative Relative Frequency Plot (Ogive) • . . . is used to answer questions about percentiles. • Percentiles are the percent of individuals that are at or below a certain value. • Quartiles are located every 25% of the data. The first quartile (Q1) is the 25th percentile, while the third quartile (Q3) is the 75th percentile. What is the special name for Q2? • Interquartile Range (IQR) is the range of the middle half (50%) of the data. IQR = Q3 – Q1 What are appropriate graphical displays for numerical data? Boxplot (and whisker) • Used with numerical data (either discrete or continuous) • Modified shows outliers • Can make comparative by showing side-by-side on same scale • Good for comparing quartile, medians, and spread Why use boxplots? Why not use boxplots? • ease of construction • does not retain • convenient handling the individual of outliers observations • construction is not subjective (like • should not be histograms) used with small • Used with medium data sets (n < 10) or large size data sets (n > 10) How to construct • useful for • find five-number summary comparative Min Q1 Med Q3 Max displays • draw box from Q1 to Q3 • draw median as center line in the box • extend whiskers to min & max Modified boxplots • display outliers • fences mark off mild & ALWAYS use modified extreme outliers boxplots in this class!!! • whiskers extend to largest (smallest) data value inside the fence Inner fence Interquartile Range Q1 –– 1.5IQR Q3 + 1.5IQR (IQR) is the range (length) of theobservation box Any outside this Q3 -fence Q1 is an outlier! Put a dot for the outliers. Q1 Q3 Modified Boxplot . . . Draw the “whisker” from the quartiles to the observation that is within the fence! Q1 Q3 Outer fence Q1 – 3IQR Q3 + 3IQR observation between AnyAny observation outside this theisfences is considered fence an extreme outlier! a mild outlier. Q1 Q3 Symmetrical boxplots Approximately symmetrical boxplot Skewed boxplot Variable Type of variable the heights of male students in your school Continuous numerical the income of adults in your city Discrete numerical the color of M&M candies selected at random from a bag the number of TV’s in the homes of AP Stat students Categorical Graph Histogram Stem Plot Bar graph Discrete numerical Dot Plot the number of speeding tickets each student in AP Stat received Discrete numerical Dot Plot the birth weights of female babies born at a large hospital Continuous numerical Histogram the favorite movie type of AP Stat students by gender Categorical Bar graph – the area code of an individual Categorical Bar graph the Math SAT Score for students at your school Discrete numerical Histogram the average number of text sent per month Continuous numerical segmented or double Cumulative frequency plot (ogive) How do you describe univariate data? Just CUSS and BS! Center “the typical value” Mean Median Unusual Features Outliers Gaps Shape single vs. multiple modes (unimodal, bimodal) symmetry vs. skewness Illustrated Distribution Shapes Unimodal Skew negatively (left) Bimodal Symmetric Multimodal Skew positively (right) Spread “how tightly values cluster around the center” Standard deviation IQR Range 5-number summary And Be Specific! Measures of Central Tendency • Median - the middle of the data; 50th percentile –Observations must be in numerical order –Is the middle single value if n is odd –The average of the middle two values if n is even NOTE: n denotes the sample size Measures of Central Tendency parameter • Mean - the arithmetic average –Use m to represent a population statistic mean –Use x to represent a sample mean Formula: x x n S is the capital Greek letter sigma – it means to sum the values that follow Measures of Central Tendency • Mode – the observation that occurs the most often –Can be more than one mode –If all values occur only once – there is no mode –Not used as often as mean & median Suppose we are interested in the number of lollipops that are bought at a certain store. A sample of 5 customers buys the following number of lollipops. Find the median. The numbers are in order & n is odd – so find the middle observation. 2 The median is 4 lollipops! 3 4 8 12 Suppose we have sample of 6 customers that buy the following number of lollipops. The median is … The median is 5 The numbers are in order lollipops! & n is even – so find the middle two observations. Now, average these two values. 2 5 3 4 6 8 12 Suppose we have sample of 6 customers that buy the following number of lollipops. Find the mean. To find the mean number of lollipops add the observations and divide by n. x 5.833 2 3 4 6 8 12 6 2 3 4 6 8 12 What would happen to the median & mean if the 12 lollipops were 20? The median is . . . The mean is . . . 5 7.17 2 3 4 6 8 20 6 What happened? 2 3 4 6 8 20 What would happen to the median & mean if the 20 lollipops were 50? The median is . . . The mean is . . . 5 12.17 2 3 4 6 8 50 6 What happened? 2 3 4 6 8 50 Resistant • Statistics that are not affected by outliers • Is the median resistant? ►Is the mean resistant? YES NO Look at the following data set. Find the mean. 22 23 24 25 25 26 29 30 x 25.5 Now find how eachWill observation this sum always equal zero? deviates from the mean. YES What is the sum of the deviations from This is the deviation from the mean. the mean? x x 0 Look at the following data set. Find the mean & median. Mean = 27 Median = 27 21 27 Create a histogram with the data. x-scale of 2) Then Look(use at the placement of find mean median. thethe mean andand median in this symmetrical distribution. 23 23 24 25 25 27 27 28 30 30 26 26 26 27 30 31 32 32 Look at the following data set. Find the mean & median. Mean = 28.176 Median = 25 Create a histogram with the data. x-scale of 8) Then Look(use at the placement of find mean median. thethe mean andand median in this right skewed 22 29 distribution. 28 22 24 25 28 21 23 62 23 24 23 26 36 38 25 Look at the following data set. Find the mean & median. Mean = 54.588 Median = 58 Create a histogram with the data. Then findplacement the meanof and Look at the median. the mean and median in this skewed left distribution. 21 46 54 47 53 60 55 55 56 63 64 58 58 58 58 62 60 Recap: • In a symmetrical distribution, the mean and median are equal. • In a skewed distribution, the mean is pulled in the direction of the skewness. • In a symmetrical distribution, you should report the mean! • In a skewed distribution, the median should be reported as the measure of center! Trimmed mean: Purpose is to remove outliers from a data set To calculate a trimmed mean: • Multiply the % to trim by n • Truncate that many observations from BOTH ends of the distribution (when listed in order) • Calculate the mean with the shortened data set Find a 10% trimmed mean with the following data. 12 14 19 20 22 24 25 26 26 10%(10) = 1 So remove one observation from each side! 14 19 20 22 24 25 26 26 22 8 35 Why is the study of variability important? • Allows us to distinguish between usual & unusual values • In some situations, want more/less variability –scores on standardized tests –time bombs –medicine Range: • Single number – not an interval • Sensitive to outliers • Midrange – average of the max and min values - VERY sensitive to outliers Interquartile Range (IQR): . Quartiles: IQR Q3 Q1 The first quartile (Q1) is the value for which 25% of the observations are less than. It is the Median of the first half of the set of observations. (the 25th percentile) The third quartile (Q3) is the value for which 75% of the observations are less than. It is the Median of the second half of the set of observations. (the 75th percentile) IQR is insensitive to outliers. The average of the deviations squared is called the variance. Population parameter 2 Sample s 2 statistic A standard deviation is a measure of the average deviation from the mean. Population Sample s Suppose that we have this population: 24 16 34 28 26 21 Find the mean (m ) Find the deviations. 30 35 37 29 x m What is the sum of the deviations from the mean? 24 16 34 28 26 21 Square the deviations: 30 35 37 29 x m 2 Find the average of the squared deviations: 2 x m n 2 Calculation of variance of a sample xn x s n 1 2 2 df Degrees of Freedom (df) • n deviations contain (n - 1) independent pieces of information about variability Calculation of standard deviation of a sample xn x s 2 n 1 When to use what?????? Note: Variance and Standard Deviation are used to measure spread when the mean is used to describe center. Note: IQR is typically used to describe spread when Median is used to describe center. Note: When the distribution is approximately symmetric, the mean and standard deviation are generally used to summarize the distribution. If the distribution is skewed, a five number summary is generally use Which measure(s) of variability is/are resistant? Linear transformation rule • When adding a constant to a random variable, the mean changes but not the standard deviation. • When multiplying a constant to a random variable, the mean and the standard deviation changes. An appliance repair shop charges a $30 service call to go to a home for a repair. It also charges $25 per hour for labor. From past history, the average length of repairs is 1 hour 15 minutes (1.25 hours) with standard deviation of 20 minutes (1/3 hour). Including the charge for the service call, what is the mean and standard deviation for the charges for labor? m 30 25(1.25) $61.25 1 25 $8.33 3 Rules for Combining two variables • To find the mean for the sum (or difference), add (or subtract) the two means • To find the standard deviation of the sum (or differences), ALWAYS add the variances, then take the square root. • Formulas: m a b m a mb ma b ma mb 2 a a b If variables are independent 2 b Bicycles arrive at a bike shop in boxes. Before they can be sold, they must be unpacked, assembled, and tuned (lubricated, adjusted, etc.). Based on past experience, the times for each setup phase are independent with the following means & standard deviations (in minutes). What are the mean and standard deviation for the total bicycle setup times? Phase Mean SD Unpacking Assembly Tuning 3.5 21.8 12.3 0.7 2.4 2.7 mT 3.5 21.8 12.3 37.6 minutes T 0.7 2 2.42 2.7 2 3.680 minutes Normal Distributions • • • • • Symmetrical bell-shaped (unimodal) density curve How is this done Above the horizontal axis mathematically? N(m, ) The transition points occur at m + Probability is calculated by finding the area under the curve • As increases, the curve flattens & spreads out • As decreases, the curve gets taller and thinner Normal distributions occur frequently. • • • • • • • Length of newborn child Height Weight ACT or SAT scores Intelligence Number of typing errors Chemical processes A 6 B Do these two normal curves have the same mean? If so, what is it? YES Which normal curve has a standard deviation of 3? B Which normal curve has a standard deviation of 1? A Empirical Rule • Approximately 68% of the observations fall within of m • Approximately 95% of the observations fall within 2 of m • Approximately 99.7% of the observations fall within 3 of m Suppose that the height of male students at SHS is normally distributed with a mean of 71 inches and standard deviation of 2.5 inches. What is the probability that the height of a randomly selected male student is more than 73.5 inches? 1 - .68 = .32 P(X > 73.5) = 0.16 68% 71 Standard Normal Density Curves Always has m = 0 & = 1 To standardize: x m z Must have this memorized! Strategies for finding probabilities or proportions in normal distributions 1. State the probability statement 2. Draw a picture 3. Calculate the z-score 4. Look up the probability (proportion) in the table The lifetime of a certain type of battery is normally distributed with a mean of 200 hours and a standardDraw deviation of 15 & shade Write the hours. What proportion of these the curve probability batteries can be expected to last less statement than 220 hours? P(X < 220) = .9082 Look up z220 200 score in z 1.33 table 15 Calculate z-score The lifetime of a certain type of battery is normally distributed with a mean of 200 hours and a standard deviation of 15 hours. What proportion of these batteries can be expected to last more than 220 hours? P(X>220) = 1 - .9082 = .0918 220 200 z 1.33 15 The lifetime of a certain type of battery is normally distributed with a mean of 200 hours and a standard deviation of 15 Look up in table 0.95 hours. How long must a battery last to be in the top 5%? to find z- score P(X > ?) = .05 x 200 1.645 15 x 224.675 .95 .05 1.645 The heights of the female students at SHS are normally distributed with a What is the zmean of 65 inches. What is the for the standard deviation of this score distribution 63? if 18.5% of the female students are shorter than 63 inches? P(X < 63) = .185 63 65 .9 2 2.22 .9 -0.9 63 The heights of female teachers at SHS are normally distributed with mean of 65.5 inches and standard deviation of 2.25 inches. The heights of male teachers are normally distributed with mean of 70 inches and standard deviation of 2.5 inches. •Describe the distribution of differences of heights (male – female) teachers. Normal distribution with m = 4.5 & = 3.3634 • What is the probability that a randomly selected male teacher is shorter than a randomly selected female teacher? P(X<0) = .0901 0 4.5 z 1.34 3.3634 4.5 Will my calculator do any of this normal stuff? • Normalpdf – use for graphing ONLY • Normalcdf – will find probability of area from lower bound to upper bound • Invnorm (inverse normal) – will find zscore for probability Bivariate data • x – variable: is the independent or explanatory variable • y- variable: is the dependent or response variable • Use x to predict y yˆ a bx ŷ - (y-hat) means the predicted y b – is the slope – it is the approximate by which Be sureamount to put the hat y increases when x increases on the y by 1 unit a – is the y-intercept – it is the approximate height of the line when x = 0 – in some situations, the y-intercept has no meaning Least Squares Regression Line LSRL • The line that gives the best fit to the data set • The line that minimizes the sum of the squares of the deviations from the line Interpretations Slope: For each unit increase in x, there is an approximate increase/decrease of b in y. Correlation coefficient: There is a direction, strength, linear of association between x and y. Identify as having a positive association, a negative association, or no association. 1. Heights of mothers & heights of their + adult daughters 2. Age of a car in years and its current value 3. Weight of a person and calories consumed + 4. Height of a person and the person’s birth NO month 5. Number of hours spent in safety training and the number of accidents that occur Correlation Coefficient (r)• A quantitative assessment of the strength & direction of the linear relationship between bivariate, quantitative data • Pearson’s sample correlation is used most • parameter - r rho) • statistic - r xi x yi y 1 r n 1 s x s y Properties of r (correlation coefficient) • legitimate values of r is [-1,1] No Correlation Strong correlation Moderate Correlation Weak correlation -1 -.8 -.5 0 .5 .8 1 Properties of r (correlation coefficient) •value of r is non-resistant •value of r does not depend on which of the two variables is labeled x •value of r is not changed by any transformations •value of r is a measure of the extent to which x & y are linearly related The correlation coefficient and the LSRL are both non-resistant measures. Correlation does not imply causation Correlation does not imply causation Correlation does not imply causation Interpolation (good): • Using a regression line for estimating predicted values between known values. •Extrapolation (bad): It is unknown whether the pattern observed in the scatterplot continues outside this range. The LSRL should not be used to predict y for values of x outside the data set. Formulas – on chart yˆ b0 b1 x b1 x x y y x x i i 2 i b0 y b1 x b1 r sy sx The following statistics are found for the variables posted speed limit and the average number of accidents. x 40, s x 11 .6, y 18, s y 8.4, r .9981 Find the LSRL & predict the number of accidents for a posted speed limit of 50 mph. ˆ y .723 x 10 .92 ˆ y 25.23 accidents Residuals (error) • The vertical deviation between the observations & the LSRL • the sum of the residuals is always zero • error = observed - expected residual y yˆ Residual plot • A scatterplot of the (x, residual) pairs. • Residuals can be graphed against other statistics besides x • Purpose is to tell if a linear association exist between the x & y variables • If no pattern exists between the points in the residual plot, then the association is linear. Residuals Residuals x Linear x Not linear Coefficient of determination• r2 • gives the approximate proportion of variation in y that can be attributed to an linear relationship between x & y • remains the same no matter which variable is labeled x Interpretation of 2 r Approximately r2% of the variation in y can be explained by the LSRL of x & y. Outlier – • In a regression setting, an outlier is a data point with a large residual •Influential pointA point that influences where the LSRL is located If removed, it will significantly change the slope of the LSRL (189,30) could be influential. Remove & recalculate LSRL (189,30) was influential since it moved the LSRL Which of these measures are resistant? • LSRL • Correlation coefficient • Coefficient of determination NONE – all are affected by outliers What to do if the data is not linear… Calculate the LSRL Is the residual plot NO scattered? YES Appropriate model Transform data: x & log y log x & log y x & y x & 1y