Download Chapter 5: Exploring Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 5: Exploring Data: Distributions
Lesson Plan
For All Practical
Purposes
 Exploring Data
 Displaying Distributions: Histograms
Mathematical Literacy in
Today’s World, 8th ed.
 Interpreting Histograms
 Displaying Distributions: Stemplots
 Describing Center: Mean and Median
 Describing the Spread: The Quartiles
 The Five-Number Summary and Boxplots
 Describing Spread: The Standard Deviation
 Normal Distributions
 The 68-95-99.7 Rule
© 2009, W.H. Freeman and Company
Chapter 5: Exploring Data: Distributions
Exploring Data
 Statistics is the science of collecting,
organizing, and interpreting data.
 Data
Individual – The objects
 Numerical facts that are essential for
described by a set of
making decisions in almost every area of
data. May be people
life and work.
or may also be
 Spreadsheet programs are used to
animals or things.
organize data by rows and columns.
 Exploratory data analysis
Variable – Any
1. Examine each variable by itself and then
characteristic of an
the relationship among them.
individual. A variable
can take different
2. Begin with a graph or graphs, then add
values for different
numerical summaries of specific aspects
individuals.
of the data.
Chapter 5: Exploring Data: Distributions
Displaying Distributions: Histograms
 Histogram
 The graph of the distribution of
outcomes (often divided into
classes) for a single variable.
 Steps in Making a Histogram
Distribution – The pattern of
outcomes of a variable; it
tells us what values the
variable takes and how
often it takes these values.
1. Choose the classes by dividing the
range of data into classes of equal
width (individuals fit into one class).
2. Count the individuals in each class
(this is the height of the bar).
3. Draw the histogram:


The horizontal axis is marked off
into equal class widths.
The vertical axis contains the scale
of counts (frequency of
Histogram of the percent of Hispanics
occurrences) for each class.
among the adult residents of the states
Example
Construct a histogram given the following data.
Value
12
14
16
18
20
Count
3
2
5
4
2
4






5
Example
Given the following 18 quiz scores (out of 30 points),
construct a histogram.
12 16 13 9 28 10 22 25 29
20 24 27 28 25 24 26 19 30
6
Chapter 5: Exploring Data: Distributions
Interpreting Histograms
 Examining a Distribution
 Overall Pattern What does the histogram graph look like?
 Shape –


Single peak (either symmetric or skewed distribution)

Symmetric – The right and left sides are mirror images.

Skewed to the right – The right side extends much farther out.

Skewed to the left – the left side extends much farther out.
Irregular distribution of data may appear clustered and may not
show a single peak (due to more than one individual being graphed).
 Center – Estimated center or midpoint of the data.
 Spread – The range of data outcomes (minimum to maximum).
 Deviation Are there any striking differences from the pattern?
 Outlier – An individual value that clearly falls outside the overall
pattern; possibly an error or some logical explanation.
Chapter 5: Exploring Data: Distributions
Interpreting Histograms
 Examples of Distribution Patterns and Deviations
 Regular Single-Peak Distributions
Histogram of
Iowa Test of
Basic Skills
vocabulary
scores for 947
seventh-grade
students
Single Peak
Symmetric 
Histogram of the
percent of
Hispanics among
the adult
residents of the
states
Single Peak
Skewed to Right
with Outlier 
 Irregular Clustered Distributions
Histogram of the tuition and fees charged by
four-year colleges in Massachusetts
Two separate distributions, graphing two
individuals (state and private schools)
Example
Given the following data regarding exam scores,
construct a histogram. Describe its overall shape
and identify any outliers.
Class
0–9
10 – 19
20 – 29
30 – 39
40 – 49
Count
1
0
2
3
4
Class
50 – 59
60 – 69
70 – 79
80 – 89
90 – 99
Count
6
7
7
2
1
9






Shape appears to be skewed to the left. The score in the
class 0-9 could be considered an outlier
10
Chapter 5: Exploring Data: Distributions
Displaying Distributions: Stemplots
 Stemplot
 A display of the distribution of a variable
that attaches the final digits of the
observation as leaves on stems made
up of all but the final digit, usually for
small sets of data only. Stemplots look
like histograms on the side.
 How to Make a Stemplot
1. Separate each observation into a stem (all
but the final rightmost digit) and a leaf (the
final rightmost digit).
2. Write the stems in a vertical column, smallest
at top, sequentially down to the largest value.
Draw a vertical line to the right of this column.
3. Write each leaf in the row to the right of its
stem, in increasing order out from the stem.
Stemplot of the percent of
Hispanics among the adult
residents of the states
Example
Recall the 18 quiz scores (out of 30 points) stated earlier.
Each score has been converted to a
percentage (rounded to the nearest tenth of a percent).
Construct a stemplot.
20.0% 53.3% 43.3% 30.0% 93.3% 33.3% 73.3% 83.3% 96.7%
66.7% 80.0% 90.0% 93.3% 83.3% 80.0% 86.7% 63.3% 100.0%
12
Solution
Given the format of the converted scores, we need to further round to the
nearest whole percent. The
stemplot would not be meaningful with the tenth of a percent being the
leaf.
In the stemplot, the ones digit will be the leaf.
2
3
4
5
6
7
8
9
10
0
03
3
3
37
3
0337
37
0
13
20% 53% 43% 30% 93% 33% 73% 83% 97%
67% 80% 90% 93% 83% 80% 87% 63% 100%
Chapter 5: Exploring Data: Distributions
Describing Center: Mean and Median
 Two Most Common Ways to Describe the Center: Mean
and Median
 Mean “average value”
 Ordinary arithmetic average of a set of observations, average
value.
 To find mean of a set of observations, add their values,
 x1, x2 , x3 , xn  and divide by the number of observations, n.
 x-bar, x = (x1 + x2 + … +xn)/n
 Median “middle value”
 The midpoint or center of an ordered list; middle value of a set
of observations; half fall below the median and half fall above.
 Arrange observations in increasing order (smallest to largest).
 If the number of observations is odd, the median M is the center
observation in the ordered list.
 If the number of observations is even, the median M is the
average of the two center observations in the ordered list.
Chapter 5: Exploring Data: Distributions
Describing Center: Mean and Median
 Finding the Mean and Median
 Mean average value, ¯x {x-bar}
Mean, ¯x = (x1 + x2 + … xn)/n
The mean city mileage for the 13 cars in Table 5.2:
16  15  22  21  24  19  20  20  21  27  18  21  48
13
292

 22.5 mpg
13
x
 Median middle value, M
Arrange observations in order, then choose the middle
value: 15 16 18 19 20 20 21 21 21 22 24 27 48
The median city mileage for the 13 cars in Table 5.2:
For 13 cars (odd): (n + 1)/2 = (13 + 1) /2 = 7
The 7th observation is 21 (in red above), the median.
Note: If the Toyota Prius is removed there are 12
observations (even): (n + 1)/2 = (12 + 1)/2 = 6.5
Median = Average of 6th and 7th value (20 + 21)/2 = 20.5
Example
Calculate the mean of each following data set.
a) 13, 6, 8, 12, 15, 14, 26, 12, 10, 11
b) 20, 61, 3, 2, 4, 5, 10, 7, 2
16
 A) The mean is =12.7
 B) The mean is ≈12.7
17
Example
Calculate the median of each data set.
a) 13, 6, 8, 12, 15, 14, 26, 12, 10, 11
b) 20, 61, 3, 2, 4, 5, 10, 7, 2
18




 Solution
 For each of the data sets, the first step is to place the data in order
from smallest to largest.
 a) 6, 8, 10, 11, 12, 12, 13, 14, 15, 26
 Since there are 10 pieces of data, the mean of the 10/2 = 5th and
6th pieces of data will be the median. Thus, the median is (12 +12) /
2 = 12.
 Notice, if you use the general formula (n+ 1)/ 2 , you
 would be looking for a value (10 + 1) / 2 = 11 / 2 = 5.5 “observations”
from the bottom. This would imply halfway between the actual 5th
observation and the 6th observation. Notice since the 5th
observation and the 6th observation were the same, we didn’t really
need to calculate the median.
 b) 2, 2, 3, 4, 5, 7, 10, 20, 61
 Since there are 9 pieces of data, the (9 + 1) / 2 = 10 / 2 = 5th piece
of data, namely 5, is the median.
19
 Example
 Given the following stemplot, determine the median.
11
12
13
14
15
16
17
18
029
3478
034679
012359
01359
09
1
0
20








 Solution
 Since there are 28 pieces of data, the mean of
the
28 / 2 =14th and 15th pieces of data will be the
median.
 Thus, the median is (140 + 141) / 2 = 281 / 2
=140.5. Notice, if you use the general formula
(n+ 1) / 2 , you would be looking for the value
(28 + 1) / 2 = 29 / 2 = 14.5 “observations” from
the bottom (or top).
21
Chapter 5: Exploring Data: Distributions
Describing Center: Mode
 Mode, most frequent value
 Since 21 appears 3 times and no other
mileage appears in the list of city mileages
more than twice, then the mode of the data
set would be 21.
 If there is a tie for the most occurrences in a
data set, then there may be multiple modes.
 Example: For highway mileage in the table,
27, 29, 30, and 33 all appear twice and no
mileage appears more times than twice.
Hence there are several modes.
Chapter 5: Exploring Data: Distributions
Describing Spread: The Quartiles
 Include Spread and Center to Better Describe a Distribution
 Range – Measures the spread of the set of observations.

Subtract the smallest observation from the largest observation
 Quartiles – The center and the middle of the top and bottom halves.
 Calculating the Quartiles
1. Arrange the observations in increasing order and locate the median M in
the ordered list of observations.


If n = even, split group in half and use all the numbers.
If n = odd, circle the median and do not use it in finding quartiles.
2. The first quartile, Q1 is the median of the observations whose position in
the ordered list is to the left of the overall median (midpoint of lower half).
3. The third quartile, Q3 is the median of the observations whose position in
the ordered list is to the right of the overall median (midpoint of upper half).



First quartile, Q1 is larger than 25% of the observation.
Third quartile, Q3 is larger than 75% of the observations.
Second quartile, Q2 is the median, and larger than 50% of observations.
 Example
 Find the range for following data set:
12, 14, 14, 14, 16, 16, 18.
Solution:
12-18 = 6. So there is a range of 6 from this data set
24
Chapter 5: Exploring Data: Distributions
The Five-Number Summary and Boxplots
 The Five-Number Summary
 A summary of a distribution that gives the median, the first and
third quartiles, and the largest and smallest observations.
 These five numbers offer a reasonably complete description of
center and spread.
 In symbols, the five-number summary is:
Minimum
Q1
M
Q3
Maximum
Examples
Five-number summary for the fuel economies in Table 5.2:
For the city mileage:
15
18.5
21
23
48
For the highway mileage:
23
27
29
32
45
Chapter 5: Exploring Data: Distributions
The Five-Number Summary and Boxplots
 Boxplots
 A boxplot is a graph of the five-number summary.
 Boxplots are often used for side-by-side comparison of one or more
distributions (they show less detail than histograms or stemplots).
 A box spans the quartiles, with an interior line marking the median.
 Lines extend out from this box to the extreme high and low observations
(maximum and minimum).
 A box plot may be drawn vertically or horizontally.
Boxplots of the
highway and city
gas mileages for
cars classified as
midsized by the
Environmental
Protection Agency.
 Example:
 Draw a boxplot for the following data set.
 21, 25, 10, 40, 43, 19, 12
27
 Solution
 The first step is to place the data in order from smallest
to largest:
10, 12, 19, 21, 25, 40, 43
 Since there are 7 pieces of data, the median isthe (7 +1)
/ 2 = 8 / 2 = 4th piece of data, namely 21.
 There are 3 pieces of data below the median, M. Thus,
the (3 +1) / 2 = 4 / 2 = 2nd piece of data is the first
quartile.
 Thus, 1 Q = 12. Now since there are 3 pieces of data
above M, 3 Q will be the 2nd piece of data to the right of
M. Thus, 3 Q = 40. The smallest piece of data is 10 and
the largest is 43. Thus, the five-number summary is
10, 12, 21, 40, 43.
28
So the box plot is:
Max = 43
3Q = 40
2Q (median)= 21
1Q = 12
Min = 10
29
Example
Draw a boxplot for the following data set:
31, 16, 11, 18, 10, 9, 12, 15, 15, 17, 20, 25
30
 Solution
 The first step is to place the data in order from smallest to largest.
9, 10, 12, 11, 15, 15, 16, 17, 18, 20, 25, 31
 Since there are 12 pieces of data, the median is between the 6th and 7th
pieces of data.
 9, 10, 11, 12, 15, 15, (median) 16, 17, 18, 20, 25, 31
 Thus the median, M, is (15 +16) / 2 = 31 / 2 = 15.5
 There are 6 pieces of data below M. Since (6 +1) / 2 = 7 / 2 = 3.5. Q1 will
be the mean of 3rd and 4th pieces of data, namely (11 +12) / 2 = 23 / 2 =
11.5.
 Now since there are 6 pieces of data above M, 3 Q will be the mean of the
3rd and 4th pieces of data to the right of M. Thus, (18 + 20) / 2 = 38 / 2 = 19.
 9, 10, 11, (1Q = 11.5) 12, 15, 15, (Median = 15.5), 16, 17, 18, (3Q = 19),
20, 25, 31
 The smallest piece of data is 9, and the largest is 31. Thus, the five-number
summary is:
9, 11.5, 15.5,19, 31.
31
The boxplot is as follows:
32
Chapter 5: Exploring Data: Distributions
Describing Spread: The Standard Deviation
 Standard Deviation s
 “Standard” or average amount that the observed data values
deviate from the mean
 Calculated by taking the square root of the mean of the squared
deviations except dividing by n-1 instead of n
 The standard deviation of n observations x1 , x2 , x3 , , xn is
 x1  x    x2  x    x3  x 
2
s
2
n 1
2

  xn  x 
2
Chapter 5: Exploring Data: Distributions
Describing Spread: The Standard Deviation
 Standard deviation example
 7 purchase prices for Radiohead “In Rainbows” download:
3 4 5 7 10 12 15 (in dollars) The mean is

3  4  5  7  10  12  15
x
 8 dollars
7
The standard deviation is
 3  8   4  8    5  8    7  8   10  8   12  8   15  8 
2
s
2
2
2
2
7 1
5    4    3   1   2    4    7 


6
2

2
2
2
2
2
2
25  16  9  1  4  16  49
120

 20  4.47 dollars
6
6
2
2
Chapter 5: Exploring Data: Distributions
Describing Spread: The Standard Deviation
 Properties of the standard deviation s:
 s measures spread about the mean
 s=0 only when there is no spread, otherwise s>0
 s has the same units of measurement as the original
observations
 s is sensitive to extreme observations or outliers
Choosing a Summary
The five-number summary is
usually better than the mean and
standard deviation for describing
a skewed distribution or a
distribution with outliers. Use the
mean and standard deviation only
for reasonably symmetric
distributions with no outliers.
Many calculators and
computer programs can
easily calculate the standard
deviation.
Chapter 5: Exploring Data: Distributions
Normal Distributions
 Normal Distributions
 When the overall pattern of a large number of observations is so
regular, we can describe it as a smooth curve.
 A family of distributions that describe how often a variable takes its
values by areas under a curve.
 Normal curves are
symmetric and bell-shaped,
smoothed-out histograms.
 The total area under the
Normal curve is exactly 1
(specific areas under the
curve actually are
proportions of the
observations).
Histogram of the vocabulary scores of all
seventh-grade students. The smooth curve
shows the overall shape of the distribution.
Chapter 5: Exploring Data: Distributions
Normal Distributions
 Standard Deviation of a Normal Curve
 The shape of a Normal distribution is completely described by two
numbers, the mean and its standard deviation.
 The mean is at the center of symmetry of the Normal curve.
 The standard deviation is the distance from the center to the
change-of-curvature points on either side.
 Calculating Quartiles
 The first quartile of any Normal
distribution is located 0.67
standard deviation below the
mean.
Q1 = Mean − (0.67)(Stand. dev.)
 The third quartile is 0.67
standard deviation above the
mean.
Q3 = Mean + (0.67)(Stand. dev.)
Example: Mean = 64.5, Stand. dev.= 2.5
Q3 = 64.5 + 0.67(2.5) = 64.5 + 1.7 = 66.2
 Example
 The scores on a marketing exam were normally
distributed with a mean of 68 and a standard
deviation of 4.5.
 a) Find the first and third quartile for the exam
scores.
 b) Find a range containing exactly 50% of the
students’ scores.
38




Solution
a) The quartiles are:
μ + 0.67σ = 68 + 0.67(4.5) = 68 + 3 = 71
μ - 0.67σ = 68 - 0.67(4.5) = 68 - 3 = 65
 1 Q = 65 and 3 Q = 71.
 b) Since 25% of the data lie below the first quartile and
25% of the data fall above the third quartile, 50% of the
data would fall between the first and third quartiles. We
would say an interval would be [65, 71]. So 50% of the
students got between a 65 and a 71 on the marketing
exam.
39
Chapter 5: Exploring Data: Distributions
The 68-95-99.7 Rule
 Normal Distributions 68-95-99.7 Rule
 68% of the observations fall within
1 standard deviation of the mean.
 95% of the observations fall within
2 standard deviations of the mean.
 99.7% of the observations fall
within 3 standard deviations of the
mean.
 Example
 SAT scores are close to a Normal
distribution, with a mean = 500 and a
standard deviation = 100.
 What percent of scores are above 700?
Answer: Score of 700 is +2 stand. dev.
Since 95% of data is between +2 and −2
stand. dev., then above 700 is in top 2.5%.
SAT scores have Normal distribution
 Example:
 The scores on a marketing exam were normally
distributed with a mean of 71.3 and a standard deviation
of 5.5.
 a) Almost all (99.7%) scores fall within what range?
 b) What percent of scores are more than 82?
 c) What percent of scores fall in the interval [66, 82]?
41
a) Since 99.7% of all scores fall within 3
standard deviations of the mean, we find the
following.
μ ± 3σ = 71.3± 3(5.5) = 71.3±16.5
71.3−16.5 = 54.8 and 71.3+16.5 = 87.8
Thus, the range of scores is 54.8 to 87.8.
If scores on the exam are understood
To be whole numbers, then the range of
Scores would be the interval [55, 87].
b) Scores above 82, such as 83 or more are two σ above μ; 95%
are within 2σ of μ. 5% lie farther than 2σ . Thus, half of these, or
2.5%, lie above 82.
c) 34% (half of 68%) of the scores would be between 66 and the
mean. 47.5% (half of 95%) of the scores would be between the
mean and 82. Thus, 34% + 47.5% = 81.5% of scores fall in the
interval [66, 82].
42