Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STAT 250 Dr. Kari Lock Morgan Describing Data: One Quantitative Variable SECTIONS 2.2, 2.3 • One quantitative variable (2.2, 2.3) Statistics: Unlocking the Power of Data Lock5 Question of the Day How obese are Americans? www.outsmarthormones.com/2011/04/06/epidemic/ Statistics: Unlocking the Power of Data Lock5 Obesity Trends* Among U.S. Adults BRFSS, 1990, 2000, 2010 (*BMI 30, or about 30 lbs. overweight for 5’4” person) 2000 1990 2010 No Data <10% 10%–14% 15-19% Statistics: Unlocking the Power of Data 20-24% 25-29% ≥30% Lock5 https://www.cdc.gov/obesity/data/prevalence-maps.html Statistics: Unlocking the Power of Data Lock5 Obesity in America Obesity is a HUGE problem in America We’ll explore the topic of obesity in America question with two different types of data, both collected by the CDC: Proportion BMI of adults who are obese in each state for a random sample of Americans Statistics: Unlocking the Power of Data Lock5 Behavioral Risk Factor Surveillance System http://www.cdc.gov/obesity/data/table-adults.html Statistics: Unlocking the Power of Data Lock5 National Health and Nutrition Examination Survey Statistics: Unlocking the Power of Data Lock5 Behavioral Risk Factor Surveillance System http://www.cdc.gov/obesity/data/table-adults.html Statistics: Unlocking the Power of Data Lock5 Dotplot In a dotplot, each case is represented by a dot and dots are stacked. Easy way to see each case ? Statistics: Unlocking the Power of Data ? Lock5 Histogram The height of the each bar corresponds to the number of cases within that range of the variable 5 states with obesity rate between 32.25 and 33.75 32.25 to 33.75 Statistics: Unlocking the Power of Data Lock5 Histogram vs Bar Chart A bar chart is for categorical data, and the x-axis has no numeric scale A histogram is for quantitative data, and the xaxis is numeric For a categorical variable, the number of bars equals the number of categories, and the number in each category is fixed For a quantitative variable, the number of bars in a histogram is up to you (or your software), and the appearance can differ with different number of bars Statistics: Unlocking the Power of Data Lock5 Shape Long right tail Symmetric Right-Skewed Statistics: Unlocking the Power of Data Left-Skewed Lock5 National Health and Nutrition Examination Survey Statistics: Unlocking the Power of Data Lock5 BMI of Americans Statistics: Unlocking the Power of Data Lock5 BMI of Americans The distribution of BMI for American adults is a) Symmetric b) Left-skewed c) Right-skewed Statistics: Unlocking the Power of Data Lock5 Notation The sample size, the number of cases in the sample, is denoted by n We often let x or y stand for any variable, and x1 , x2 , …, xn represent the n values of the variable x x1 = 32.4, x2 = 28.4, x3 = 26.8, … Statistics: Unlocking the Power of Data Lock5 Mean The mean or average of the data values is 𝑚𝑒𝑎𝑛 = 𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠 𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛 𝑥 𝑚𝑒𝑎𝑛 = = 𝑛 𝑛 Sample mean: 𝑥 Population mean: (“mu”) Statistics: Unlocking the Power of Data Lock5 Mean The average obesity rate across the 50 states is µ = 28.606. The average BMI for Americans in this sample is 𝑥 = 24.887. Statistics: Unlocking the Power of Data Lock5 Median The median, m, is the middle value when the data are ordered. If there are an even number of values, the median is the average of the two middle values. The median splits the data in half. Statistics: Unlocking the Power of Data Lock5 Measures of Center For symmetric distributions, the mean and the median will be about the same For skewed distributions, the mean will be more pulled towards the direction of skewness Statistics: Unlocking the Power of Data Lock5 Measures of Center m = 24.163 Mean is “pulled” 𝑥 =24.887 in the direction of skewness Statistics: Unlocking the Power of Data Lock5 Measures of Center https://www.oecd.org/std/household-wealth-inequality-across-OECD-countries-OECDSB21.pdf Statistics: Unlocking the Power of Data Lock5 Skewness and Center A distribution is left-skewed. Which measure of center would you expect to be higher? a) Mean b) Median Statistics: Unlocking the Power of Data Lock5 Outlier An outlier is an observed value that is notably distinct from the other values in a dataset. Statistics: Unlocking the Power of Data Lock5 Outliers More info here Statistics: Unlocking the Power of Data Lock5 Resistance A statistic is resistant if it is relatively unaffected by extreme values. The median is resistant while the mean is not. With the Outlier Without the Outlier Statistics: Unlocking the Power of Data Mean 105.22 102.56 Median 101.0 100.5 Lock5 Outliers When using statistics that are not resistant to outliers, stop and think about whether the outlier is a mistake If not, you have to decide whether the outlier is part of your population of interest or not Usually, for outliers that are not a mistake, it’s best to run the analysis twice, once with the outlier(s) and once without, to see how much the outlier(s) are affecting the results Statistics: Unlocking the Power of Data Lock5 Standard Deviation The standard deviation for a quantitative variable measures the spread of the data 𝑠= 𝑥−𝑥 𝑛−1 2 Sample standard deviation: s Population standard deviation: (“sigma”) Statistics: Unlocking the Power of Data Lock5 Standard Deviation The larger the standard deviation, the more variability there is in the data and the more spread out the data are The standard deviation gives a rough estimate of the typical distance of a data values from the mean Statistics: Unlocking the Power of Data Lock5 150 s 1 0 50 Frequency Standard Deviation -5 0 5 150 -10 10 15 s4 0 50 Frequency -15 -15 -10 -5 0 5 10 15 Both of these distributions are bell-shaped Statistics: Unlocking the Power of Data Lock5 Two Ways of Measuring Obesity States as cases Individual people as cases Differences? Statistics: Unlocking the Power of Data Lock5 95% Rule If a distribution of data is approximately symmetric and bell-shaped, about 95% of the data should fall within two standard deviations of the mean. For a population, 95% of the data will be between µ – 2 and µ + 2 For a sample, 95% of the data will be between 𝑥 − 2𝑠 and 𝑥 − 2𝑠 Statistics: Unlocking the Power of Data Lock5 The 95% Rule Statistics: Unlocking the Power of Data Lock5 150 s 1 0 50 Frequency The 95% Rule -2 -1 0 1 2 3 150 s4 0 50 Frequency -3 -15 -10 -5 0 5 10 15 StatKey Statistics: Unlocking the Power of Data Lock5 95% Rule Give an interval that will likely contain 95% of obesity rates of states. Statistics: Unlocking the Power of Data Lock5 95% Rule Could we use the same method to get an interval that will contain 95% of BMIs of American adults? a) Yes b) No Statistics: Unlocking the Power of Data Lock5 The 95% Rule The standard deviation for hours of sleep per night is closest to a) b) c) d) e) Statistics: Unlocking the Power of Data ½ 1 2 4 I have no idea Lock5 z-score The z-score for a data value, x, is 𝑥−𝑥 𝑧= 𝑠 for sample data, and 𝑥−𝜇 𝑧= 𝜎 for population data. z-score measures the number of standard deviations away from the mean Statistics: Unlocking the Power of Data Lock5 z-score A z-score puts values on a common scale A z-score is the number of standard deviations a value falls from the mean Challenge: For symmetric, bell-shaped distributions, 95% of all z-scores fall between what two values? Statistics: Unlocking the Power of Data Lock5 The 95% Rule: z-scores Statistics: Unlocking the Power of Data Lock5 z-scores Statistics: Unlocking the Power of Data Lock5 z-score Which is better, an ACT score of 28 or a combined SAT score of 2100? ACT: = 21, = 5 SAT: = 1500, = 325 Assume ACT and SAT scores have approximately bell-shaped distributions a) b) c) ACT score of 28 SAT score of 2100 I don’t know Statistics: Unlocking the Power of Data Lock5 Other Measures of Location Maximum = largest data value Minimum = smallest data value Quartiles: Q1 = median of the values below m. Q3 = median of the values above m. Statistics: Unlocking the Power of Data Lock5 Five Number Summary Five Number Summary: Min Q1 25% 25% m Q3 25% Max 25% Minitab: Stat -> Basic Statistics -> Display Descriptive Statistics Statistics: Unlocking the Power of Data Lock5 Five Number Summary > summary(study_hours) Min. 1st Qu. 2.00 10.00 Median 15.00 3rd Qu. 20.00 Max. 69.00 The distribution of number of hours spent studying each week is a) b) c) d) Symmetric Right-skewed Left-skewed Impossible to tell Statistics: Unlocking the Power of Data Lock5 Percentile The Pth percentile is the value which is greater than P% of the data We already used z-scores to determine whether an SAT score of 2100 or an ACT score of 28 is better We could also have used percentiles: ACT score of 28: 91st percentile SAT score of 2100: 97th percentile Statistics: Unlocking the Power of Data Lock5 Five Number Summary Five Number Summary: Min Q1 25% 0th percentile m 25% 25th percentile 25% 50th percentile Statistics: Unlocking the Power of Data Q3 Max 25% 75th percentile 100th percentile Lock5 Measures of Spread Range = Max – Min Interquartile Range (IQR) = Q3 – Q1 Is the range resistant to outliers? a) Yes b) No Is the IQR resistant to outliers? a) Yes b) No Statistics: Unlocking the Power of Data Lock5 Comparing Statistics Measures of Center: Mean (not resistant) Median (resistant) Measures of Spread: Standard deviation (not resistant) IQR (resistant) Range (not resistant) Most often, we use the mean and the standard deviation, because they are calculated based on all the data values, so use all the available information Statistics: Unlocking the Power of Data Lock5 Boxplot Lines (“whiskers”) extend from each quartile to the most extreme value that is not an outlier Middle 50% of data Statistics: Unlocking the Power of Data Q3 Median Q1 Lock5 Boxplot Outlier *For boxplots, outliers are defined as any point more than 1.5 IQRs beyond the quartiles (although you don’t have to know that) Statistics: Unlocking the Power of Data Lock5 Boxplot This boxplot shows a distribution that is a) Symmetric b) Left-skewed c) Right-skewed Statistics: Unlocking the Power of Data Lock5 Summary: One Quantitative Variable Summary Statistics Visualization Center: mean, median Spread: standard deviation, range, IQR 5 number summary Percentiles Dotplot Histogram Boxplot Other concepts Shape: symmetric, skewed, bell-shaped Outliers, resistance z-scores Statistics: Unlocking the Power of Data Lock5 To Do HW 2.2, 2.3 (due Monday, 2/6) Statistics: Unlocking the Power of Data Lock5