Download 2.23 One Quantitative Variable

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
STAT 250
Dr. Kari Lock Morgan
Describing Data:
One Quantitative Variable
SECTIONS 2.2, 2.3
• One quantitative variable (2.2, 2.3)
Statistics: Unlocking the Power of Data
Lock5
Question of the Day
How obese are Americans?
www.outsmarthormones.com/2011/04/06/epidemic/
Statistics: Unlocking the Power of Data
Lock5
Obesity Trends* Among U.S. Adults
BRFSS, 1990, 2000, 2010
(*BMI 30, or about 30 lbs. overweight for 5’4” person)
2000
1990
2010
No Data
<10%
10%–14%
15-19%
Statistics: Unlocking the Power of Data
20-24%
25-29%
≥30%
Lock5
https://www.cdc.gov/obesity/data/prevalence-maps.html
Statistics: Unlocking the Power of Data
Lock5
Obesity in America
 Obesity is a HUGE problem in America
 We’ll explore the topic of obesity in America
question with two different types of data, both
collected by the CDC:
 Proportion
 BMI
of adults who are obese in each state
for a random sample of Americans
Statistics: Unlocking the Power of Data
Lock5
Behavioral Risk Factor Surveillance System
http://www.cdc.gov/obesity/data/table-adults.html
Statistics: Unlocking the Power of Data
Lock5
National Health and Nutrition
Examination Survey
Statistics: Unlocking the Power of Data
Lock5
Behavioral Risk Factor Surveillance System
http://www.cdc.gov/obesity/data/table-adults.html
Statistics: Unlocking the Power of Data
Lock5
Dotplot
 In a dotplot, each case is represented by
a dot and dots are stacked.
 Easy way to see each case
?
Statistics: Unlocking the Power of Data
?
Lock5
Histogram
 The height of the each bar corresponds to the
number of cases within that range of the variable
5 states with
obesity rate
between
32.25 and
33.75
32.25 to 33.75
Statistics: Unlocking the Power of Data
Lock5
Histogram vs Bar Chart
A bar chart is for categorical data, and the x-axis
has no numeric scale

A histogram is for quantitative data, and the xaxis is numeric

For a categorical variable, the number of bars
equals the number of categories, and the number in
each category is fixed

For a quantitative variable, the number of bars in
a histogram is up to you (or your software), and the
appearance can differ with different number of bars

Statistics: Unlocking the Power of Data
Lock5
Shape
Long
right
tail
Symmetric
Right-Skewed
Statistics: Unlocking the Power of Data
Left-Skewed
Lock5
National Health and Nutrition
Examination Survey
Statistics: Unlocking the Power of Data
Lock5
BMI of Americans
Statistics: Unlocking the Power of Data
Lock5
BMI of Americans
The distribution of BMI for American adults is
a) Symmetric
b) Left-skewed
c) Right-skewed
Statistics: Unlocking the Power of Data
Lock5
Notation
 The sample size, the number of cases in the
sample, is denoted by n
 We often let x or y stand for any variable, and x1
, x2 , …, xn represent the n values of the variable x
 x1 = 32.4, x2 = 28.4, x3 = 26.8, …
Statistics: Unlocking the Power of Data
Lock5
Mean
The mean or average of the data values is
𝑚𝑒𝑎𝑛 =
𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒𝑠
𝑥1 + 𝑥2 + ⋯ + 𝑥𝑛
𝑥
𝑚𝑒𝑎𝑛 =
=
𝑛
𝑛
 Sample mean: 𝑥
 Population mean:  (“mu”)
Statistics: Unlocking the Power of Data
Lock5
Mean
The average obesity rate across the 50 states is µ = 28.606.
The average BMI for Americans in this sample is 𝑥 =
24.887.
Statistics: Unlocking the Power of Data
Lock5
Median
The median, m, is the middle value when the
data are ordered.
If there are an even number of values, the
median is the average of the two middle values.
 The median splits the data in half.
Statistics: Unlocking the Power of Data
Lock5
Measures of Center
 For symmetric distributions, the mean and the
median will be about the same
 For skewed distributions, the mean will be
more pulled towards the direction of skewness
Statistics: Unlocking the Power of Data
Lock5
Measures of Center
m = 24.163
Mean is “pulled”
𝑥 =24.887 in the direction
of skewness
Statistics: Unlocking the Power of Data
Lock5
Measures of Center
https://www.oecd.org/std/household-wealth-inequality-across-OECD-countries-OECDSB21.pdf
Statistics: Unlocking the Power of Data
Lock5
Skewness and Center
A distribution is left-skewed. Which measure of
center would you expect to be higher?
a) Mean
b) Median
Statistics: Unlocking the Power of Data
Lock5
Outlier
An outlier is an observed value that
is notably distinct from the other
values in a dataset.
Statistics: Unlocking the Power of Data
Lock5
Outliers
More info here
Statistics: Unlocking the Power of Data
Lock5
Resistance
A statistic is resistant if it is
relatively unaffected by extreme
values.
 The median is resistant while the mean is not.
With the Outlier
Without the Outlier
Statistics: Unlocking the Power of Data
Mean
105.22
102.56
Median
101.0
100.5
Lock5
Outliers
 When using statistics that are not resistant to
outliers, stop and think about whether the
outlier is a mistake
 If not, you have to decide whether the outlier
is part of your population of interest or not
 Usually, for outliers that are not a mistake, it’s
best to run the analysis twice, once with the
outlier(s) and once without, to see how much
the outlier(s) are affecting the results
Statistics: Unlocking the Power of Data
Lock5
Standard Deviation
The standard deviation for a
quantitative variable measures the
spread of the data
𝑠=
𝑥−𝑥
𝑛−1
2
 Sample standard deviation: s
 Population standard deviation:  (“sigma”)
Statistics: Unlocking the Power of Data
Lock5
Standard Deviation
 The larger the standard deviation, the more
variability there is in the data and the more
spread out the data are
 The standard deviation gives a rough estimate
of the typical distance of a data values from
the mean
Statistics: Unlocking the Power of Data
Lock5
150
s 1
0 50
Frequency
Standard Deviation
-5
0
5
150
-10
10
15
s4
0 50
Frequency
-15
-15
-10
-5
0
5
10
15
Both of these distributions are bell-shaped
Statistics: Unlocking the Power of Data
Lock5
Two Ways of Measuring Obesity
States as cases
Individual people as cases
Differences?
Statistics: Unlocking the Power of Data
Lock5
95% Rule
If a distribution of data is approximately
symmetric and bell-shaped, about 95%
of the data should fall within two
standard deviations of the mean.
 For a population, 95% of the data will be between
µ – 2 and µ + 2
 For a sample, 95% of the data will be between
𝑥 − 2𝑠 and 𝑥 − 2𝑠
Statistics: Unlocking the Power of Data
Lock5
The 95% Rule
Statistics: Unlocking the Power of Data
Lock5
150
s 1
0 50
Frequency
The 95% Rule
-2
-1
0
1
2
3
150
s4
0 50
Frequency
-3
-15
-10
-5
0
5
10
15
 StatKey
Statistics: Unlocking the Power of Data
Lock5
95% Rule
Give an interval that will likely contain 95% of
obesity rates of states.
Statistics: Unlocking the Power of Data
Lock5
95% Rule
Could we use the same method to get an
interval that will contain 95% of BMIs of
American adults?
a) Yes
b) No
Statistics: Unlocking the Power of Data
Lock5
The 95% Rule
The standard
deviation for hours of
sleep per night is
closest to
a)
b)
c)
d)
e)
Statistics: Unlocking the Power of Data
½
1
2
4
I have no idea
Lock5
z-score
The z-score for a data value, x, is
𝑥−𝑥
𝑧=
𝑠
for sample data, and
𝑥−𝜇
𝑧=
𝜎
for population data.
 z-score measures the number of standard
deviations away from the mean
Statistics: Unlocking the Power of Data
Lock5
z-score
 A z-score puts values on a common scale
 A z-score is the number of standard deviations
a value falls from the mean
 Challenge: For symmetric, bell-shaped
distributions, 95% of all z-scores fall between
what two values?
Statistics: Unlocking the Power of Data
Lock5
The 95% Rule: z-scores
Statistics: Unlocking the Power of Data
Lock5
z-scores
Statistics: Unlocking the Power of Data
Lock5
z-score
Which is better, an ACT score of 28 or a
combined SAT score of 2100?
 ACT:  = 21,  = 5
 SAT:  = 1500,  = 325
 Assume ACT and SAT scores have
approximately bell-shaped distributions
a)
b)
c)
ACT score of 28
SAT score of 2100
I don’t know
Statistics: Unlocking the Power of Data
Lock5
Other Measures of Location
Maximum = largest data value
Minimum = smallest data value
Quartiles:
Q1 = median of the values below m.
Q3 = median of the values above m.
Statistics: Unlocking the Power of Data
Lock5
Five Number Summary
 Five Number Summary:
Min
Q1
25%
25%
m
Q3
25%
Max
25%
Minitab: Stat -> Basic Statistics -> Display Descriptive Statistics
Statistics: Unlocking the Power of Data
Lock5
Five Number Summary
> summary(study_hours)
Min.
1st Qu.
2.00
10.00
Median
15.00
3rd Qu.
20.00
Max.
69.00
The distribution of number of hours spent
studying each week is
a)
b)
c)
d)
Symmetric
Right-skewed
Left-skewed
Impossible to tell
Statistics: Unlocking the Power of Data
Lock5
Percentile
The Pth percentile is the value which
is greater than P% of the data
 We already used z-scores to determine whether an
SAT score of 2100 or an ACT score of 28 is better
 We could also have used percentiles:
 ACT
score of 28: 91st percentile
 SAT score of 2100: 97th percentile
Statistics: Unlocking the Power of Data
Lock5
Five Number Summary
 Five Number Summary:
Min
Q1
25%
0th
percentile
m
25%
25th
percentile
25%
50th
percentile
Statistics: Unlocking the Power of Data
Q3
Max
25%
75th
percentile
100th
percentile
Lock5
Measures of Spread
 Range = Max – Min
 Interquartile Range (IQR) = Q3 – Q1
 Is the range resistant to outliers?
a) Yes
b) No
 Is the IQR resistant to outliers?
a) Yes
b) No
Statistics: Unlocking the Power of Data
Lock5
Comparing Statistics
 Measures of Center:
 Mean (not resistant)
 Median (resistant)
 Measures of Spread:
 Standard deviation (not resistant)
 IQR (resistant)
 Range (not resistant)
 Most often, we use the mean and the standard
deviation, because they are calculated based on
all the data values, so use all the available
information
Statistics: Unlocking the Power of Data
Lock5
Boxplot
Lines (“whiskers”) extend from
each quartile to the most extreme
value that is not an outlier
Middle
50% of
data
Statistics: Unlocking the Power of Data
Q3
Median
Q1
Lock5
Boxplot
Outlier
*For boxplots, outliers are
defined as any point more
than 1.5 IQRs beyond the
quartiles (although you
don’t have to know that)
Statistics: Unlocking the Power of Data
Lock5
Boxplot
This boxplot shows a
distribution that is
a) Symmetric
b) Left-skewed
c) Right-skewed
Statistics: Unlocking the Power of Data
Lock5
Summary: One Quantitative Variable

Summary Statistics





Visualization




Center: mean, median
Spread: standard deviation, range, IQR
5 number summary
Percentiles
Dotplot
Histogram
Boxplot
Other concepts


Shape: symmetric, skewed, bell-shaped
Outliers, resistance
z-scores

Statistics: Unlocking the Power of Data
Lock5
To Do
 HW 2.2, 2.3 (due Monday, 2/6)
Statistics: Unlocking the Power of Data
Lock5