Download ast3e_chapter02

Document related concepts
no text concepts found
Transcript
Chapter 2
Exploring Data with Graphs
and Numerical Summaries
Section 2.1
Different Types of Data
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Variable
A variable is any characteristic observed in a study.
Examples: Marital status, Height, Weight, IQ
A variable can be classified as either
 Categorical (in Categories), or
 Quantitative (Numerical)
3
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Categorical Variable
A variable can be classified as categorical if each
observation belongs to one of a set of categories:
Examples:
 Gender (Male or Female)
 Religious Affiliation (Catholic, Jewish, …)
 Type of Residence (Apartment, Condo, …)
 Belief in Life After Death (Yes or No)
4
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Quantitative Variable
A variable is called quantitative if observations on it take
numerical values that represent different magnitudes of
the variable.
Examples:
 Age
 Number of Siblings
 Annual Income
5
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Main Features of Quantitative and
Categorical Variables
For Quantitative variables: key features are the center and
spread (variability) of the data.
Example: What’s a typical annual amount of precipitation? Is
there much variation from year to year?
For Categorical variables: a key feature is the percentage of
observations in each of the categories.
Example: What percentage of students at a certain college
are Democrats?
6
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Discrete Quantitative Variable
A quantitative variable is discrete if its possible values
form a set of separate numbers, such as 0,1,2,3,….
Discrete variables have a finite number of possible values.
Examples:
 Number of pets in a household
 Number of children in a family
 Number of foreign languages spoken by an
individual
7
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Continuous Quantitative Variable
A quantitative variable is continuous if its possible values
form an interval.
Continuous variables have an infinite number of possible
values.
Examples:
 Height/Weight
 Age
 Blood pressure
8
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Class Problem #1
Identify the variable type as either categorical or
quantitative.




9
Number of siblings in a family
County of residence
Distance (in miles) of commute to school
Marital status
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Class Problem #2
Identify each of the following variables as continuous
or discrete.




10
Length of time to take a test
Number of people waiting in line
Number of speeding tickets received last year
Your dog’s weight
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Proportion & Percentage (Relative
Frequencies)
The proportion of the observations that fall in a certain
category is the frequency (count) of observations in that
category divided by the total number of observations.
Frequency of that class
Sum of all frequencies
The percentage is the proportion multiplied by 100.
Proportions and percentages are also called relative
frequencies.
11
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Frequency, Proportion, & Percentage
Example
There were 268 reported shark attacks in Florida between
2000 and 2010. Table 2.1 classifies 715 shark attacks
reported from 2000 through 2010, so, for Florida,



12
268 is the frequency.
0.375 =268/715 is the proportion and relative
frequency.
37.5 is the percentage 0.375 *100= 37.5%.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Frequency Table
A frequency table is a listing of possible values for a
variable, together with the number of observations and/or
relative frequencies for each value.
Table 2.1 Frequency of Shark Attacks in Various Regions for 2000–2010
13
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Class Problem #3
A stock broker has been following different stocks
over the last month and has recorded whether a stock is
up, the same, or down in value. The results were:
1.
Performance of stock Up Same Down
Count
21
7
12
1. What is the variable of interest?
2. What type of variable is it?
3. Add proportions to this frequency table.
14
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Chapter 2
Exploring Data with Graphs
and Numerical Summaries
Section 2.2
Graphical Summaries of Data
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Distribution
A graph or frequency table describes a distribution.
A distribution tells us the possible values a variable takes
as well as the occurrence of those values (frequency or
relative frequency).
16
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Graphs for Categorical Variables
The two primary graphical displays for summarizing a
categorical variable are the pie chart and the bar graph.
Pie Chart: A circle having a “slice of pie” for each
category.
Bar Graph: A graph that displays a vertical bar for each
category.
17
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Pie Charts
Pie Charts:
 Used for summarizing a categorical variable.
18

Drawn as a circle where each category is
represented as a “slice of the pie”.

The size of each pie slice is proportional to the
percentage of observations falling in that category.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Example: Renewable Electricity
Table 2.2 Sources of Electricity in the United States and Canada, 2009
19
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Example: Renewable Electricity
Figure 2.1 Pie Chart of Electricity Sources in the United States. The label for each slice
of the pie gives the category and the percentage of electricity generated from that source.
The slice that represents the percentage generated by coal is 45% of the total area of the
pie. Question: Why is it beneficial to label the pie wedges with the percent?
20
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Bar Graphs
Bar graphs are used for summarizing a categorical variable.
 Bar Graphs display a vertical bar for each category.
 The height of each bar represents either counts
(“frequencies”) or percentages (“relative frequencies”) for
that category.
 It is usually easier to compare categories with a bar
graph rather than with a pie chart.
 Bar Graphs are called Pareto Charts when the
categories are ordered by their frequency, from the
tallest bar to the shortest bar.
21
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Example: Renewable Electricity
Figure 2.2 Bar Graph of Electricity Sources in the United States. The bars are
ordered from largest to smallest based on the percentage use. (Pareto Chart).
22
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Graphs for Quantitative Variables
Dot Plot: shows a dot for each observation placed above
its value on a number line.
Stem-and-Leaf Plot: portrays the individual observations.
Histogram: uses bars to portray the data.
23
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Choosing a Graph Type
How do we decide which to use? Here are some guidelines:
Dot-plot and stem-and-leaf plot
 More useful for small data sets
 Data values are retained
Histogram
 More useful for large data sets
 Most compact display
 More flexibility in defining intervals
24
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Dot Plots
Dot Plots are used for summarizing a quantitative variable.
To construct a dot plot
1. Draw a horizontal line.
2. Label it with the name of the variable.
3. Mark regular values of the variable on it.
4. For each observation, place a dot above its value on
the number line.
25
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Steps for Constructing a Histogram
1. Divide the range of the data into intervals of equal
width.
2. Count the number of observations in each interval,
creating a frequency table.
3. On the horizontal axis, label the values or the
endpoints of the intervals.
4. Draw a bar over each value or interval with height
equal to its frequency (or percentage), values of which
are marked on the vertical axis.
5. Label and title appropriately.
26
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Example: Sodium in Cereals
Table 2.4 Frequency Table for Sodium in 20 Breakfast Cereals. The table
summarizes the sodium values using eight intervals and lists the number of
observations in each, as well as the proportions and percentages.
27
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Histogram for Sodium in Cereals
Figure 2.6 Histogram of Breakfast Cereal Sodium Values. The rectangular bar over
an interval has height equal to the number of observations in the interval.
28
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Interpreting Histograms
Overall pattern consists of center, spread, and shape.
29

Assess where a distribution is centered by finding
the median (50% of data below median 50% of data
above).

Assess the spread of a distribution.

Shape of a distribution: roughly symmetric, skewed
to the right, or skewed to the left.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Shape
Symmetric Distributions: if both left and
right sides of the histogram are mirror images of
each other.
A distribution is skewed
to the left if the left tail is
longer than the right tail.
30
A distribution is skewed to
the right if the right tail is
longer than the left tail.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Examples of Skewness
31
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Outlier
An outlier falls far from the rest of the data.
32
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Time Plots
Used for displaying a time series, a data set collected over
time.
 Plots each observation on the vertical scale against the
time it was measured on the horizontal scale. Points
are usually connected.
 Common patterns in the data over time, known as
trends, should be noted.
 To see a trend more clearly, it is beneficial to connect
the data points in their time sequence.
33
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Time Plots Example
The number of people in the United Kingdom between
2006 and 2010 who used the Internet (in millions).
Adults using the Internet everyday
34
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Chapter 2
Exploring Data with Graphs
and Numerical Summaries
Section 2.3
Measuring the Center of Quantitative Data
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Mean
The mean is the sum of the observations divided by the
number of observations.
x 
36

x
n
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Example: Center of the Cereal Sodium Data
We find the mean by adding all the observations and
then dividing this sum by the number of observations,
which is 20:
0, 340, 70, 140, 200, 180, 210, 150, 100, 130,
140, 180, 190, 160, 290, 50, 220, 180, 200, 210
Mean = (0 + 340 + 70 + . . . +210)/20 = 3340/20 =167
37
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Median
The median is the midpoint of the observations when they are
ordered from the smallest to the largest (or from the largest to
smallest).
How to Determine the Median:
Put the n observations in order of their size.
If the number of observations, n, is:
38

odd, then the median is the middle observation.

even, then the median is the average of the two middle
observations.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Example: CO2 Pollution
CO2 pollution levels in 8 largest nations measured in
metric tons per person:
China 4.9
India 1.4
United States 18.9
Indonesia 1.8
Brazil 1.9
Pakistan 0.9
Russia 10.8
Bangladesh 0.3
The CO2 values have n = 8 observations. The ordered
values are: 0.3, 0.9, 1.4, 1.8, 1.9, 4.9, 10.8, 18.9
39
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Example: CO2 Pollution
Since n is even, two observations are in the middle, the fourth and
fifth ones in the ordered sample. These are 1.8 and 1.9. The median is
their average, 1.85.
The relatively high value of 18.9 falls well above the rest of the
data. It is an outlier.
The size of the outlier affects the calculation of the mean but not
the median.
40
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Comparing the Mean and Median
The shape of a distribution influences whether the mean is
larger or smaller than the median.
 Perfectly symmetric, the mean equals the median.
 Skewed to the right, the mean is larger than the
median.
 Skewed to the left, the mean is smaller than the
median.
41
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Comparing the Mean and Median
In a skewed distribution, the mean is farther out in the long tail
than is the median.

For skewed distributions the median is preferred because
it is better representative of a typical observation.
Figure 2.9 Relationship Between the Mean and Median. Question: For skewed
distributions, what causes the mean and median to differ?
42
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Resistant Measures
A numerical summary measure is resistant if extreme
observations (outliers) have little, if any, influence on its
value.


43
The Median is resistant to outliers.
The Mean is not resistant to outliers.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
The Mode
Mode
44

Value that occurs most often.

Highest bar in the histogram.

The mode is most often used with categorical data.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Chapter 2
Exploring Data with Graphs
and Numerical Summaries
Section 2.4
Measuring the Variability of Quantitative Data
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Range
One way to measure the spread is to calculate the range.
The range is the difference between the largest and
smallest values in the data set:
Range = max  min
The range is simple to compute and easy to understand,
but it uses only the extreme values and ignores the other
values. Therefore, it’s affected severely by outliers.
46
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Standard Deviation
The deviation of an observation x from the mean x is
( x  x ), the difference between the observation and the
sample mean.
 Each data value has an associated deviation from the
mean.
 A deviation is positive if the value falls above the mean
and negative if the value falls below the mean.
 The sum of the deviations for all the values in a data
set is always zero.
47
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Standard Deviation
For the cereal sodium values, the mean is = 167. The observation
of 210 for Honeycomb has a deviation of 210 - 167 = 43. The
observation of 50 for Honey Smacks has a deviation of
50 - 167 = -117. Figure 2.11 shows these deviations
Figure 2.9 Dot Plot for Cereal Sodium Data, Showing Deviations for Two
Observations. Question: When is a deviation positive and when is it negative?
48
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
The Standard Deviation s of n
Observations
Gives a measure of variation by summarizing the deviations
of each observation from the mean and calculating an
adjusted average of these deviations.
( x  x )
s
n 1
49
2
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Standard Deviation
1.
2.
3.
4.
5.
Find the mean.
Find the deviation of each value from the mean.
Square the deviations.
Sum the squared deviations.
Divide the sum by n-1 and take the square root of that
value.
n
1
2
s
( xi  x )

( n  1) i 1
50
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Standard Deviation
Metabolic rates of 7 men (cal./24hr):
1792  1666  1362  1614  1460  1867  1439
1792  1666  1362  1614  1460  1867  1439
x
7
11, 200

7
 1600
51
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Standard Deviation
s  35 ,811 . 67  189 . 24 calories
214 ,870
s 
 35,811 .67
7 1
2
52
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Properties of the Standard Deviation
The most basic property of the standard deviation is this:
The larger the standard deviation
the data.

s

s  0 only when all observations have the same value, otherwise
s  0 . As the spread of the data increases, s gets larger.

s has the same units of measurement as the original
2
observations. The variance = s has units that are squared.
s is not resistant. Strong skewness or a few outliers can greatly
increase s .

53
s , the greater the variability of
measures the spread of the data.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Magnitude of s: The Empirical Rule
If a distribution of data is bell shaped, then approximately:
 68% of the observations fall within 1 standard deviation
of the mean, that is, between the values of x  s and
x  s (denoted x  s).
 95% of the observations fall within 2 standard deviations
of the mean ( x  2 s ).
 All or nearly all observations fall within 3 standard
deviations of the mean ( x  3s ).
54
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Magnitude of s: The Empirical Rule
Figure 2.12 The Empirical Rule. For bell-shaped distributions, this tells us approximately
how much of the data fall within 1, 2, and 3 standard deviations of the mean. Question:
About what percentage would fall more than 2 standard deviations from the mean?
55
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Chapter 2
Exploring Data with Graphs
and Numerical Summaries
Section 2.5
Using Measures of Position to Describe
Variability
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Percentile
The p th percentile is a value such that p percent of the
observations fall below or at that value.
57
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Quartiles
Figure 2.14 The Quartiles Split the Distribution Into Four Parts. 25% is below the first
quartile (Q1), 25% is between the first quartile and the second quartile (the median, Q2), 25%
is between the second quartile and the third quartile (Q3), and 25% is above the third quartile.
Question: Why is the second quartile also the median?
58
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
SUMMARY: Finding Quartiles
 Arrange the data in order.
 Consider the median. This is the second quartile, Q2.
 Consider the lower half of the observations (excluding
the median itself if n is odd). The median of these
observations is the first quartile, Q1.
 Consider the upper half of the observations (excluding
the median itself if n is odd).
 Their median is the third quartile, Q3.
59
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Example: Cereal Sodium Data
Consider the sodium values for the 20 breakfast cereals. What are the
quartiles for the 20 cereal sodium values? From Table 2.3, the sodium values, in
ascending order, are:
The median of the 20 values is the average of the 10th and 11th
observations, 180 and 180, which is Q2 = 180 mg.
The first quartile Q1 is the median of the 10 smallest observations (in the
top row), which is the average of 130 and 140, Q1 = 135mg.
The third quartile Q3 is the median of the 10 largest observations (in the
bottom row), which is the average of 200 and 210, Q3 = 205 mg.
60
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
The Interquartile Range (IQR)
The interquartile range is the distance between the third
quartile and first quartile:
IQR = Q3  Q1
IQR gives spread of middle 50% of the data
In Words: If the interquartile range of U.S. music teacher salaries equals $16,000, this
means that for the middle 50% of the distribution of salaries, $16,000 is the distance
between the largest and smallest salaries.
61
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Detecting Potential Outliers
Examining the data for unusual observations, such as
outliers, is important in any statistical analysis. Is there a
formula for flagging an observation as potentially being an
outlier?
The 1.5 x IQR Criterion for Identifying Potential Outliers
An observation is a potential outlier if it falls more than
1.5 x IQR below the first quartile or more than 1.5 x IQR
above the third quartile.
62
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
5-Number Summary of Positions
The 5-number summary is the basis of a graphical display called
the box plot, and consists of
63

Minimum value

First Quartile

Median

Third Quartile

Maximum value
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
SUMMARY: Constructing a Box Plot
 A box goes from Q1 to Q3.
 A line is drawn inside the box at the median.
 A line goes from the lower end of the box to the
smallest observation that is not a potential outlier and
from the upper end of the box to the largest
observation that is not a potential outlier.
 The potential outliers are shown separately.
64
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Example: Boxplot for Cereal Sodium Data
Figure 2.15 shows a box plot for the sodium values. Labels
are also given for the five-number summary of positions.
Figure 2.15 Box Plot and Five-Number Summary for 20 Breakfast Cereal Sodium Values.
The central box contains the middle 50% of the data. The line in the box marks the median.
Whiskers extend from the box to the smallest and largest observations, which are not identified
as potential outliers. Potential outliers are marked separately. Question: Why is the left whisker
drawn down only to 50 rather than to 0?
65
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Comparing Distributions
A box plot does not portray certain features of a distribution, such as
distinct mounds and possible gaps, as clearly as does a histogram.
Box plots are useful for identifying potential outliers.
Figure 2.16 Box Plots of Male and Female College Student Heights. The box plots use the
same scale for height. Question: What are approximate values of the quartiles for the two
groups?
66
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Z-Score
The z-score also identifies position and potential outliers.
The z-score for an observation is the number of standard
deviations that it falls from the mean. A positive z-score
indicates the observation is above the mean. A negative zscore indicates the observation is below the mean. For sample
data, the z -score is calculated as:
observation - mean
z
standarddeviation
An observation from a bell-shaped distribution is a potential
outlier if its z-score < -3 or > +3 (3 standard deviation
criterion) .
67
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Chapter 2
Exploring Data with Graphs
and Numerical Summaries
Section 2.6
Recognizing and Avoiding Misuses of
Graphical Summaries
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Summary: Guidelines for Constructing
Effective Graphs
 Label both axes and provide proper headings.
 To better compare relative size, the vertical axis should start
at 0.
 Be cautious in using anything other than bars, lines, or points.
 It can be difficult to portray more than one group on a single
graph when the variable values differ greatly. Consider
instead using separate graphs, or plotting relative sizes such
as ratios or percentages.
69
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Example: Heading in Different Directions
Figure 2.18 An Example of a Poor Graph. Question: What’s misleading about the
way the data are presented?
70
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.
Example: Heading in Different Directions
Figure 2.19 A Better Graph for the Data in Figure 2.18. Question: What trends do
you see in the enrollments from 1996 to 2004?
71
Copyright © 2013, 2009, and 2007, Pearson Education, Inc.