Download Chapter 1 Exploring Data

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Gibbs sampling wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Chapter 1
Exploring Data
1.0 Data Analysis
1.1 Analyzing Categorical Data
1.2 Displaying Quantitative Data with Graphs
1.3 Describing Quantitative Data with Numbers
1.0 Data Analysis
Objectives
SWBAT:
1) Identify the individuals and variables in a set of data.
2) Classify variables as categorical or quantitative.
What’s the difference between categorical and quantitative variables?
• A variable is any characteristic of an individual.
• A quantitative (numerical) variable takes numerical values for which it makes
sense to find an average.
• Examples would include height, weight, speed, age, number of oranges in a bowl,
number of stolen bases, etc...
• A categorical variable places an individual into one of several categories or
groups. (think qualitative)
• Examples would include gender, blood type, ethnicity, outcome of a plate appearance
in baseball, etc…
• Again, be cautious. Just because a variable is a number doesn’t mean it is
numerical. Take zip code for example. It describes a location in which you
live. This is really a categorical variable. It wouldn’t make sense to find the
average zip code.
Do we ever use numbers to describe the values of a categorical variable? Do
we ever divide the distribution of a quantitative variable into categories?
• A word of caution: not every variable that takes number values is
quantitative.
• Take zip code for example. It describes a location in which you live. This is really a
categorical variable. It wouldn’t make sense to find the average zip code.
• Another example would be social security number. You could just as easily use letters
instead of numbers to represent someone’s identity.
• Often, variables like age and weight are divided into categories and
treated as a categorical variable.
• An example would be age categories to classify people, such as 0-9, 10-19, etc…
What is a distribution?
• A variable generally takes on many different values. In data analysis, we are
interested in how often a variable takes on each value.
• A distribution of a variable tells us what values the variable takes and how
often it takes these values.
• Note: the values can be words or numbers.
Example
2009 Fuel Economy Guide
MODEL
2009 Fuel Economy Guide
2009 Fuel Economy Guide
MPG
MPG
MODEL
<new>MODEL
MPG
1
Acura RL
922 Dodge Avenger
1630 Mercedes-Benz E350
24
2
Audi A6 Quattro
1023 Hyundai Elantra
1733 Mercury Milan
29
3
Bentley Arnage
1114 Jaguar XF
1825 Mitsubishi Galant
27
4
BMW 5281
1228 Kia Optima
1932 Nissan Maxima
26
5
Buick Lacrosse
1328 Lexus GS 350
2026 Rolls Royce Phantom
18
6
Cadillac CTS
1425 Lincolon MKZ
2128 Saturn Aura
33
7
Chevrolet Malibu
1533 Mazda 6
2229 Toyota Camry
31
8
Chrysler Sebring
1630 Mercedes-Benz E350
2324 Volkswagen Passat
29
9
Dodge Avenger
1730 Mercury Milan
2429 Volvo S80
Variable of Interest:
MPG
25
<new>
Dotplot of MPG
Distribution
Example: US Census Data, 10 randomly selected US residents from 2000 census.
a) Who are the
individuals in this
data set?
10 randomly
selected US
residents who
participated in the
2000 US census
b) What variables are measured?
1) state; categorical 2) number of family members; quantitative; units-ppl
3) age; quantitative; units- years
4) gender; categorical
5) marital status; categorical
6) total income; quantitative; units: dollars
7) travel time to work; quantitative; units: minutes
c) Describe the individual in the first row.
The individual lives in Kentucky, has 2 members in her family, is 61 yeas old, is female, is
married, makes $21,000 a year, and travels 20 minutes to work.
1.1 Analyzing Categorical Data
Objectives
SWBAT:
1) Display categorical data with a bar graph. Decide if it would be
appropriate to make a pie chart.
2) Identify what makes some graphs of categorical data deceptive.
3) Calculate and display the marginal distribution of a categorical variable
from a two-way table.
4) Calculate and display the conditional distribution of a categorical variable
for a particular value of the other categorical variable in a two-way table.
5) Describe the association between two categorical variables by comparing
appropriate conditional distributions.
What is the difference between a frequency table and a relative frequency table? When is it
better to use relative frequency?
• A frequency table is a table that displays the count (frequency) of observations in each
category or class.
• A relative frequency table is a table that shows the percents (relative frequencies) of
observations in each category or class.
Frequency Table
Format
Variable
Values
Relative Frequency Table
Count of Stations
Format
Percent of Stations
Adult Contemporary
1556
Adult Contemporary
Adult Standards
1196
Adult Standards
8.6
Contemporary Hit
4.1
Contemporary Hit
569
11.2
Country
2066
Country
14.9
News/Talk
2179
News/Talk
15.7
Oldies
1060
Oldies
Religious
2014
Religious
Rock
869
Spanish Language
750
Other Formats
Total
1579
13838
7.7
14.6
Rock
6.3
Count
Spanish Language
5.4
Other Formats
11.4
Total
99.9
Percent
• When the number of observations is not the same (or close to the same)
between distributions, we should make a relative frequency histogram.
Example: Here are two frequency histograms comparing the number of points
scored for players on the LA Lakers and players not on the Lakers in the 20082009 regular season.
•
Because there are many more
players not on the Lakers, it is hard
to compare these distributions.
• The comparison is now much
easier to make.
What is the most important thing to remember
when making pie charts and bar graphs? Why
do statisticians prefer bar graphs?
• The most important thing to remember is to
make sure everything is properly labeled!
• Statisticians prefer bar graphs because 1)
they’re easier to make and read and 2) they
allow for a comparison of quantities that are
measured in the same units.
What are some common ways to make a misleading graph?
• When making any graph, avoid adding embellishments that are potentially
misleading.
• One way to make a graph misleading is to violate the area principle, meaning
that the area representing each category in a graph should be proportional to
the number of observations in that category (all bars should be equally wide).
• Another way is if you don’t start
the frequency axis at 0.
• This graph makes it look as if
LeBron missed almost all of his
shots.
• A third way to make graphs misleading is
by making them 3D.
The 3D design makes the slices closer
to the reader appear larger than
those in the back. The red and purple
slices are both 42%, but the purple
looks much larger.
What is wrong with the following graph?
First, the heights of the bars are not accurate. According to the
graph, the difference between 81 and 95 is much greater than the
difference between 56 and 81. Also, the extra width for the DIRECTV
bar is deceptive since our eyes respond to the area, not just the
height.
What is a two-way table? What is a marginal distribution?
• Two-way Table – describes two categorical variables, organizing counts
according to a row variable and a column variable.
Example, p. 12
Young adults by gender and chance of getting rich
Female
Male
Total
Almost no chance
96
98
194
Some chance, but probably not
426
286
712
A 50-50 chance
696
720
1416
A good chance
663
758
1421
Almost certain
486
597
1083
Total
2367
2459
4826
The variables
described by this table
are gender and
opinion about getting
rich.
• The Marginal Distribution of one of the categorical variables in a two-way
table of counts is the distribution of values of that variable among all
individuals described by the table.
• Note: Percents are often more informative than counts, especially when
comparing groups of different sizes.
• To examine a marginal distribution,
1)Use the data in the table to calculate the marginal distribution (in percents)
of the row or column totals.
2)Make a graph to display the marginal distribution.
Examine the marginal
distribution of chance
of getting rich.
Response
Percent
Almost no
chance
194/4826 = 4.0%
Some chance
712/4826 = 14.8%
A 50-50 chance
1416/4826 = 29.3%
A good chance
1421/4826 = 29.4%
Almost certain
1083/4826 = 22.4%
Young adults by gender and chance of getting rich
Female
Male
Total
Almost no chance
96
98
194
Some chance, but probably not
426
286
712
A 50-50 chance
696
720
1416
A good chance
663
758
1421
Almost certain
486
597
1083
Total
2367
2459
4826
What is a segmented bar graph? Why are
they good to use?
• A segmented bar graph displays the possible
outcomes of a categorical variable as slices
of a rectangle, with the area of each slice
proportional to how often each
corresponding outcome occurred (each bar
must total 100%).
• It is also known as a “stacked” bar chart.
• Segmented bar graphs are good to use
because they force us to use percents.
• Note that they aren’t the best for
comparison purposes. A better graph would
be a side-by-side bar graph like the one on
page 17.
What does it mean for two variables to have an association? How can you tell
by looking at a graph?
• Two variables have an association if knowing the value of one variable helps
predict the value of the other.
• For example, if knowing that a person is male makes one of the responses more likely,
there is an association between gender and response.
• In the graph to the right, there is an
association between gender and opinion.
Knowing that a young adult is male helps
us predict his opinion: he is more likely
than a female to say “good chance” or
“almost certain”.
Continuing with the same example, if
there was no association between
gender and opinion, then knowing a
young adult is male would NOT help us
predict his opinion. He would be no
more or less likely than a female to say
“good chance” or “almost certain” or
any other response. Males and females
would have the same opinions. In other
words, the bars would be almost equal
in height for the genders.
Example: The Pew Research Center asked a random
sample of 2024 adult cell phone owners from the US
which type of cell phone they own: iPhone, Android, or
other (including non-smart phones). Here are the
results, broken down by age category.
a) Explain what it would mean if there was no association between age and
cell phone type.
No association would mean that knowing someone’s age would not help us
predict what type of phone they would buy.
b) Based on this data, can we conclude there is an
association between age and cell phone type?
Justify.
It’s clear that there is an association between age
and cell phone type. We can predict that 18-34 year
olds would get an Android, 35-54 year olds would get
some other type of phone, and 55+ would get some
other phone.
1.2 Displaying Quantitative Data with Graphs
Objectives
SWBAT:
1) Make and interpret dotplots and stemplots of quantitative data.
2) Describe the overall pattern (shape, center, and spread) of a distribution
and identify any major departures from the pattern (outliers).
3) Identify the shape of a distribution from a graph as roughly symmetric or
skewed.
4) Make and interpret histograms of quantitative data.
5) Compare distributions of quantitative data using dotplots, stemplots,
and histograms.
The dotplots show the daily high temperatures for 7
cities in June, July and August.
1) What is the most important difference between
cities A, B, and C?
Their centers
2) What is the most important difference between
cities C and D?
Their spreads
3) What are two important differences between cities
D and E?
Their spreads (but not range) and unusual values
(outliers)
4) What is the most important difference between
cities C, F, and G?
Their shapes
When describing the distribution of a quantitative variable, what characteristics
should be addressed?
• You want to address patterns and departures from patterns. The acronym to
remember is SOCS: Shape, Outliers, Center, and Spread.
Shape
• When you describe a distribution’s shape, concentrate on the main features. Look
for rough symmetry or clear skewness.
Definitions:
A distribution is roughly symmetric if the right and left sides of the
graph are approximately mirror images of each other.
A distribution is skewed to the right (right-skewed) if the right side of
the graph (containing the half of the observations with larger values) is
much longer than the left side.
It is skewed to the left (left-skewed) if the left side of the graph is
much longer than the right side.
To help remember skewed right and skewed left, think about your feet.
A distribution is skewed to the right
when the right side of the graph is
more spread out than the left side.
Think about your right foot. The
toes are tall on the left side and get
progressively smaller as you move
to the right.
A distribution is skewed to the left
when the left side of the graph is
more spread out than the right side.
Think about your left foot. The toes
are tall on the right side and get
progressively smaller as you move
to the left.
Other terms to describe shape:
Unimodal
A distribution is unimodal
when it shows one distinct
peak.
Bimodal
A distribution is bimodal if
it has two distinct peaks.
Note: we don’t worry
about little bumps. They
have to be distinct.
Uniform
A distribution is uniform when
the heights of the bars are all
about the same.
How would you describe the shapes of these distributions?
Skewed right,
unimodal
Symmetric,
unimodal
• An outlier is an individual value that falls outside the
overall pattern of a distribution.
• For now we’ll use an eye test to determine outliers.
Looking at this distribution, there’s two unusually
high values that appear to be outliers, at
approximately 57 and 91.
• The center is the middle value in the distribution
(either the mean or median).
• The spread is the variability of a sampling distribution
(how spread out the data is).
• Common measures of spread are range and IQR.
• Here is an example of Tom Brady’s passer ratings in
the 2001 NFL season.
Describe the spread.
The range is 148.3-57.1=91.2
Frozen Pizza Example
Here are the number of calories per serving for 16 brands of
frozen cheese pizza, along with a dotplot of the data.
340 340 310 320 310 360 350 330
260 380 340 320 360 290 320 330
Shape: roughly symmetric and unimodal
Center: median at 330 calories
Spread: the values vary from 260 calories to 380 calories (a range of
120)
Outliers: there appears to be one unusually small value (260 calories)
What is the most important thing to remember when you are asked to
compare two distributions?
• You need to actually compare the distributions using explicit comparison
words!
•
•
•
•
Examples:
The center for distribution A is larger than the center for distribution B.
Carucci’s cat meows less than Mr. Fal’s cat.
Prestige Worldwide makes the same amount of money as the South Pole Elf
Corporation.
• Needless to say, this is only applicable to certain parts of SOCS (center and
spread). One shape cannot necessarily be better than another shape.
How do the annual energy costs (in dollars) compare for refrigerators with top freezers and
refrigerators with bottom freezers? The data below is from the May 2010 issue of Consumer
Reports.
• Shape: The distribution for bottom freezers looks skewed right and possibly bimodal (modes near $58
and $70 per year). The distribution for top freezers looks roughly symmetric, with its main peak
centered near $55.
• Outliers: There appear to be two bottom freezers with unusually high energy costs (over $140). There
are no outliers for the top freezers.
• Center: The typical energy cost for bottom freezers is greater than the typical cost for the top freezers
(midpoint of $69 vs midpoint of $56).
• Spread: There is much more variability in the energy costs for bottom freezers.
What is the most important
thing to remember when
making a stemplot?
• Stemplots (aka stem-andleaf plots) are simple
graphical displays for fairly
small data sets.
• Stemplots give us a quick
picture of the distribution
while including the actual
numerical values.
• Just like with all displays, it
is important to remember
the LABELS (and a key)!!!!
How to Make a Stemplot
1)Separate each observation into a stem (all but the final
digit) and a leaf (the final digit).
2)Write all possible stems from the smallest to the largest in a
vertical column and draw a vertical line to the right of the
column.
3)Write each leaf in the row to the right of its stem.
4)Arrange the leaves in increasing order out from the stem.
5)Provide a key that explains in context what the stems and
leaves represent.
• Stemplots (Stem-and-Leaf Plots)
• These data represent the responses of 20 female AP Statistics students to the
question, “How many pairs of shoes do you have?” Construct a stemplot.
50
26
26
31
57
19
24
22
23
38
13
50
13
34
23
30
49
13
15
51
1
1 93335
1 33359
2
2 664233
2 233466
3
3 1840
3 0148
4
4 9
4 9
5
5 0701
5 0017
Stems
Add leaves
Order leaves
Key: 4|9
represents a
female student
who reported
having 49
pairs of shoes.
Add a key
Sometimes it may be beneficial to split stems, which is a method for spreading out a stemplot
that has too few stems (the data tends to be bunched up). [every number 0-4 goes in the
first stem, 5-9 in the second]
Example: Which gender is taller, males or females? A sample of 14-year-olds from the UK was
randomly selected using the CensusAtSchool website. Here are the heights of the students (in
cm). Make a back-to-back stemplot and compare the distributions.
Male: 154, 157, 187, 163, 167, 159, 169, 162, 176, 177, 151, 175, 174, 165, 165, 183, 180
Female: 160, 169, 152, 167, 164, 163, 160, 163, 169, 157, 158, 153, 161, 165, 165, 159, 168,
153, 166, 158, 158, 166
If we opted to not split stems:
By
splitting
stems:
Male: 154, 157, 187, 163, 167, 159, 169, 162, 176, 177, 151, 175, 174, 165,
165, 183, 180
Female: 160, 169, 152, 167, 164, 163, 160, 163, 169, 157, 158, 153, 161, 165,
165, 159, 168, 153, 166, 158, 158, 166
Shape: The female distribution is skewed
left unimodal. The male distribution is
symmetric unimodal.
Outliers: Neither distribution appears to
contain outliers.
Centers: The males have a larger center
than the females (median of 167
centimeters vs median of 162 centimeters
[avg the middle two].
Spread: The male distribution has greater
variability than the female distribution.
• Histograms
• Quantitative variables often take many values. A graph of the distribution may be
clearer if nearby values are grouped together.
• The most common graph of the distribution of one quantitative variable is a
histogram.
How to Make a Histogram
1)Divide the range of data into classes of equal width.
2)Find the count (frequency) or percent (relative frequency) of
individuals in each class.
3)Label and scale your axes and draw the histogram. The
height of the bar equals its frequency. Adjacent bars should
touch, unless a class contains no individuals.
• The smallest observation
is 93.2 and the largest is
106.1 We could choose
classes of width 2
starting at 93.
Why would we prefer a relative frequency histogram
to a frequency histogram?
When comparing distributions of different sample
size! When the number of observations are not
equal, a fair comparison cannot be made using just
the frequency.
• Follow pages 36-37 to make a histogram on the TI-84!
• Note:
• To change the boundaries, press WINDOW.
• Xmin defines where the first class begins and Xscl defines the class width.
• Xmax, Ymin, and Ymax define how big the window will be.
1.3 Describing Quantitative Data with Numbers
Objectives
SWBAT:
1) Calculate measures of center (mean, median).
2) Calculate and interpret measures of spread (range, IQR, standard
deviation).
3) Choose the most appropriate measure of center and spread in a given
setting.
4) Identify outliers using the 1.5 X IQR rule.
5) Make and interpret boxplots of quantitative data.
6) Use appropriate graphs and numerical summaries to compare
distributions of quantitative variables.
What is a resistant measure? Is the mean a resistant measure of center? Is the
median a resistant measure of center?
• A resistant measure is a measure that can resist the influence of extreme
observations.
• Think about if we were going to calculate the mean salary for students in this
classroom.
• Let’s say Adam Sandler finds out he is one class short of graduating high
school, and that class happens to be AP Statistics. He moves to Lyndhurst and
transfers into this class. What effect would his salary have on the mean?
• What type of effect would it have on the median?
• Because the mean cannot resist the influence of extreme observations, we say
that it is not a resistant measure of center. However, median is a resistant
measure of center.
How does the shape of a distribution affect the relationship between mean and median?
There is a connection between the shape of a distribution and the relationship between the
mean and median of the distribution.
• When a distribution is symmetric, the mean and median will be approximately the same.
• When a distribution is skewed right, the mean will be greater than the median.
• When a distribution is skewed left, the mean will be smaller than the median.
• This distribution of stolen bases is
skewed right, with a median of 5, as
noted on the histogram.
• It does not seem plausible that the
balancing point (mean) is also 5.
Because the distribution is stretched
out to the right, the mean must be
greater than 5. Think of all the
extremely values that will pull the
mean up.
• Two common measures of spread (variability) are range and IQR.
What is range? Is it a resistant measure of spread?
• The range of a distribution is the distance between the minimum value and
the maximum value.
• Do you think it is a resistant measure of spread? Let’s go back to our Adam
Sandler example.
• Range can be a bit deceptive if there is an unusually high or unusually low
value in a distribution. It is not a resistant measure of spread.
What are quartiles? How do you find them?
• Quartiles are the values that divide a distribution into four groups of roughly
the same size.
• Find the quartiles:
4
6
8
12
20
22
27
How To Calculate The Quartiles And The IQR:
To calculate the quartiles:
1.Arrange the observations in increasing order and locate the median.
2.The first quartile Q1 is the median of the observations located to the
left of the median in the ordered list.
3.The third quartile Q3 is the median of the observations located to the
right of the median in the ordered list.
What is the interquartile range (IQR)? Is the IQR a resistant measure of spread?
• The interquartile range (IQR) is a single number that measures the range of the
middle half of the distribution, ignoring the values in the lowest quarter of the
distribution and the values in the highest quarter of the distribution
The interquartile range (IQR) is defined
as:
IQR = Q3 – Q1
• Since IQR essentially discards the lowest and highest 25% of the
distribution, any outliers would have a minimal affect on IQR. As a result,
we can state that IQR is a resistant measure of spread.
Here are data on the
amount of fat (in grams)
in 9 different McDonald’s
fish and chicken
sandwiches. Calculate
the median and the IQR.
Median=19
What is an outlier? How do you identify them? Are there any
outliers in the chicken/fish distribution?
In addition to serving as a measure of spread, the interquartile
range (IQR) is used as part of a rule of thumb for identifying
outliers.
The 1.5 x IQR Rule for Outliers
Call an observation an outlier if it falls more than 1.5 x IQR
above the third quartile or below the first quartile.
Since no values fall below the boundary of 0.75 or above the
boundary of 38.75, the distribution contains no outliers.
Are there any outliers in the
beef distribution?
No values fall below 9, so there are no small outliers.
43 falls above 41, so the Double Quarter Pounder with
Cheese is an outlier.
The Five-Number Summary
The minimum and maximum values alone tell us little about the
distribution as a whole. Likewise, the median and quartiles tell us little
about the tails of a distribution.
To get a quick summary of both center and spread, combine all five
numbers.
The five-number summary of a distribution consists of the
smallest observation, the first quartile, the median, the third
quartile, and the largest observation, written in order from
smallest to largest.
Minimum
Q1
Median
Q3
Maximum
Five-number summaries are displayed with box-and-whisker plots.
Boxplots (Box-and-Whisker Plots)
The five-number summary divides the distribution roughly into quarters.
This leads to a new way to display quantitative data, the boxplot.
How To Make A Boxplot:
• A central box is drawn from the first quartile (Q1) to the
third quartile (Q3).
• A line in the box marks the median.
• Lines (called whiskers) extend from the box out to the
smallest and largest observations that are not outliers.
• Outliers are marked with a special symbol such as an
asterisk (*).
Construct a Boxplot
Consider our New York travel time data:
10
30
5
25
40
20
10
15
30
20
15
20
85
15
65
15
60
60
40
45
5
10
10
15
15
15
15
20
20
20
25
30
30
40
40
45
60
60
65
85
Min=5
Q1 = 15
Median = 22.5
Q3= 42.5
Max=85
Recall, this is an
outlier by the
1.5 x IQR rule
• In the distribution above, how far are the values from the
mean, on average?
• The concept of mean absolute deviation is similar to
standard deviation.
What does the standard deviation measure?
• Standard deviation measures the average distance of observations from
their mean (typical deviation from the mean).
• Note: The formula for standard deviation and variance are on the formula
sheet. However, the calculations can be done right on the 84.
What are some similarities and differences between range, IQR, and standard
deviation?
• Similarity: All three measure variability.
• Differences:
• only IQR is resistant to outliers
• only standard deviation uses all the data
How is the standard deviation calculated? What is the variance?
• For now, just know that variance is standard deviation squared.
• To understand how to calculate standard deviation, we need to understand
what a deviation is first.
• Deviation from the mean: A deviation from the mean, 𝑥 − 𝑥, is the difference
between the value of x and the mean, 𝑥 (how far an observation is away from
the mean).
• For example, let’s say our data set consist of the values 4, 5, 8, and 11.
• The mean would be 28/4 = 7.
• Our deviations would be -3, -2, 1, and 4.
• A positive deviation indicates a value is larger than the mean.
• A negative deviation indicates a value is smaller than the mean.
• A deviation of 0 indicates the value is equal to the mean.
• The sum of the deviations, (𝑥 − 𝑥) is always zero because the deviations of x
values smaller than the mean (which are negative) cancel out those x values
larger than the mean (which are positive).
• We can remove this neutralizing effect if we do something to make all the
deviations positive. This can be accomplished by squaring each of the
deviations; squared deviations will all be nonnegative (positive or zero) values.
The squared deviations are used to find the variance.
• Variance is the mean of the squared deviations
• Variance of a population is denoted by 𝜎 2 .
2
(𝑥
−
𝜇)
𝑖
2
𝜎 =
𝑛
• Variance of a sample is denoted by 𝑠 2 .
2
𝑥
−
𝑥
𝑖
2
𝑠 =
𝑛−1
Note: variance is calculated slightly differently for populations vs samples.
Unless specified, we assume we are working with a sample.
Steps to find the variance.
1) Find the sum of your data set.
2) Find the mean.
3) Calculate the deviations.
4) Square the deviations.
5) Sum the squared deviations.
6) Divide the sum of the squared
deviations by n if working with the
population or n-1 if working with the
sample.
Variance:
𝑠2
=
𝑥𝑖 − 𝑥
𝑛−1
2
14
14
=
=
=7
3−1
2
Example: Find the variance for the
sample data: 25, 26, 30.
• The standard is the positive square root of the variance.
• Standard deviation measures a typical deviation from the mean.
• In the previous example, the variance was 7, so the standard deviation is:
s = 𝑠 2 = 7 = 2.65
• The standard deviation of a population is:
𝜎=
(𝑥𝑖 − 𝜇)2
𝑛
• The standard deviation of a sample is:
𝑠=
𝑥𝑖 − 𝑥
𝑛−1
2
What are some properties of standard deviation?
• SD measures the spread about the mean and should be used only when the
mean is chosen as the center (if median is chosen, use IQR).
• SD is always greater than or equal to 0. [A SD of 0 would mean all
observations are the same value.]
• SD has the same units of measurement as the original observations.
• SD is not resistant to outliers. A few outliers can make SD very large.
A random sample of 5 students was asked how many minutes they spent
doing HW the previous night. Here are their responses (in minutes): 0, 25,
30, 60, 90. Calculate and interpret the standard deviation.
The number of
minutes spent doing
HW typically varies by
about 34.71 minutes
from the mean of 41
minutes.
What factors should you consider when choosing summary statistics?
We now have a choice between two descriptions for center and spread
• Mean and Standard Deviation
• Median and Interquartile Range
Choosing Measures of Center and Spread
•The median and IQR are usually better than the mean and standard
deviation for describing a skewed distribution or a distribution with
outliers.
•Use mean and standard deviation only for reasonably symmetric
distributions that don’t have outliers.
•NOTE: Numerical summaries do not fully describe the shape of a
distribution. ALWAYS PLOT YOUR DATA!