Download Document

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Ch2 Descriptive Analysis &
Presentation of SingleVariable Data
Black Bears
Mean: 60.07 inches
Median: 62.50 inches
Range: 42 inches
20
Variance: 117.681
Standard deviation: 10.85 inches
Minimum: 36 inches
Frequency
Maximum: 78 inches
10
First quartile: 51.63 inches
Third quartile: 67.38 inches
Count: 58 bears
Sum: 3438.1 inches
0
30
40
50
60
Length in Inches
70
80
Chapter Goals
• Learn how to present and describe sets of data
• Learn measures of central tendency, measures of
dispersion (spread), measures of position, and types of
distributions
• Learn how to interpret findings so that we know what
the data is telling us about the sampled population
2.1 ~ Graphic Presentation of Data
 Use initial exploratory data-analysis techniques to
produce a pictorial(繪畫的) representation of the data
• Resulting displays reveal(展露) patterns of behavior of
the variable being studied
• The method used is determined by the type of data and
the idea to be presented
• No single correct answer when constructing a graphic
display
Circle Graphs & Bar Graphs
Graphs that are used to summarize attribute data.
• Circle graphs (pie diagrams) show the amount of
data that belongs to each category as a proportional (
比例) part of a circle
• Bar graphs show the amount of data that belongs to
each category as proportionally sized rectangular
areas
Example
The table below lists the number of automobiles sold last
week by day for a local dealership. Describe the data
using a circle graph and a bar graph:
Day
Number Sold
Monday
15
Tuesday
23
Wednesday
35
Thursday
11
Friday
12
Saturday
42
Circle Graph Solution
Automobiles Sold Last Week
Day
Number Sold
Monday
15
Tuesday
23
Wednesday
35
Thursday
11
Friday
12
Saturday
42
Bar Graph Solution
Automobiles Sold Last Week
No. Sold
Mon.
15
Tue.
23
Wed.
35
Thu.
11
Fri.
12
Sat.
42
Frequency
Day
Pareto Diagram (柏拉圖)
• Pareto Diagram: A bar graph with the bars arranged
from the most numerous category to the least numerous
category. It includes a line graph displaying the
cumulative percentages and counts for the bars.
Notes:

The Pareto diagram is often used in quality control
applications

Used to identify the number and type of defects that
happen within a product or service
Example
The final daily inspection defect report for a cabinet
manufacturer is given in the table below:
Defect
Dent 凹痕
Stain 污跡
Blemish 瑕疵
Chip 屑片
Scratch 擦傷
Others 其他
Number
5
12
43
25
40
10
1) Construct a Pareto diagram for this defect report. Management
has given the cabinet production line the goal of reducing their
defects by 50%.
2) What two defects should they give special attention to in working
toward this goal?
Solutions
Daily Defect Inspection Report
1)
140
100
120
80
100
60
80
Count
Percent
60
40
40
20
20
0
Defect:
Count
Percent
Cum%
0
Blemish
Scratch
Chip
Stain
Others
Dent
43
31.9
31.9
40
29.6
61.5
25
18.5
80.0
12
8.9
88.9
10
7.4
96.3
5
3.7
100.0
2) The production line should try to eliminate blemishes and
scratches. This would cut defects by more than 50%.
Key Definitions
Quantitative Data: One reason for constructing a graph of
quantitative data is to examine the distribution - is the data
compact(緊密), spread out(散開), skewed(歪斜) , symmetric(對
稱), etc.
Distribution(分佈): The pattern of variability displayed by
the data of a variable. The distribution displays the frequency
of each value of the variable.
Dotplot Display: Displays the data of a sample by representing
each piece of data with a dot positioned along a scale. This
scale can be either horizontal or vertical. The frequency of the
values is represented along the other scale.
Example
A random sample of the lifetime (in years) of 50 home
washing machines is given below:
2.5
16.9
4.5
0.9
1.5
17.8
8.5
8.9
2.5
6.4
14.5
0.7
7.3
1.4
12.2
3.5
2.9
4.0
3.7
6.8
7.4
4.1
0.4
3.3
0.9
4.2
3.3
4.7
18.1
2.6
4.4
7.2
6.9
7.0
0.7
1.6
2.2
9.2
5.2
15.3
4.0
10.4
12.2
4.0
4.1
1.8
21.8
18.3
3.6
The figure below is a dotplot for the 50 lifetimes:
.
: . . .:.
..: :.::::::..
.
.::. ...
.
:
. .
.
:.
.
+---------+---------+---------+---------+---------+------0.0
4.0
8.0
12.0
16.0
20.0
Note: Notice how the data is “bunched” near the lower extreme and more
“spread out” near the higher extreme
Stem & Leaf Display
 Background:
– The stem(莖)-and-leaf(葉) display has become very
popular for summarizing numerical data
– It is a combination of graphing and sorting
– The actual data is part of the graph
– Well-suited for computers
Stem-and-Leaf Display: Pictures the data of a sample using the
actual digits that make up the data values. Each numerical data
is divided into two parts: The leading digit(s) becomes the stem,
and the trailing digit(s) becomes the leaf. The stems are located
along the main axis, and a leaf for each piece of data is located
so as to display the distribution of the data.
Example
A city police officer, using radar, checked the speed of cars as
they were traveling down the main street in town. Construct
a stem-and-leaf plot for this data:
41 31 33 35 36 37 39 49
33 19 26 27 24 32 40
39 16 55 38 36
Solution:
All the speeds are in the 10s, 20s, 30s, 40s, and 50s. Use the first
digit of each speed as the stem and the second digit as the leaf.
Draw a vertical line and list the stems, in order to the left of the line.
Place each leaf on its stem: place the trailing digit on the right side
of the vertical line opposite its corresponding leading digit.
Example
41 31 33 35 36 37 39 49 33 19 26 27 24 32 40 39 16 55 38 36
20 Speeds
20 Speeds
--------------------------------------- --------------------------------------1 |
1 |
2 |
2 |
3 |
3 |
4 |
4 |
5 |
5 |
---------------------------------------- ---------------------------------------• The speeds are centered around the 30s
Note: The display could be constructed so that only five
possible values (instead of ten) could fall in each stem. What
would the stems look like? Would there be a difference in
appearance?
Example
20 Speeds
--------------------------------------1 | 6 9
2 | 4 6 7
3 | 1 2 3 3 5 6 6 7 8 9 9
4 | 0 1 9
5 | 5
----------------------------------------
20 Speeds
--------------------------------------(10-14) 1 |
(15-19) 1 |
(20-24) 2 |
(25-29) 2 |
(30-34) 3 |
(35-39) 3 |
(40-44) 4 |
(45-49) 4 |
(50-54) 5 |
(55-59) 5 |
----------------------------------------
Remember!
1. It is fairly typical of many variables to display a distribution
that is concentrated (mounded) about a central value and
then in some manner be dispersed in both directions.
2. A display that indicates two “mounds” may really be two
overlapping distributions
3. A back-to-back stem-and-leaf display makes it possible to
compare two distributions graphically
4. A side-by-side dotplot is also useful for comparing two
distributions
Example
Weight of 50 College Students (lb)
--------------------------------------Female
Male
--------------------------------------8|9|
1 8 8 |10|
Back-to-Back
0 2 5 5 6 8 8 |11|
0 0 0 8 9 |12|
2 5 7 |13|
2 |14| 3 5 8
|15| 0 4 4 5 7 8
|16| 1 2 2 5 7 8
|17| 0 0 6 6 7
|18| 3 4 6 8
|19| 0 1 5 5
|20| 5
|21| 5
----------------------------------------
Weight of 50 College Students (lb)
--------------------------------------9| 8
10 | 1 8 8
11 | 0 2 5 5 6 8 8
12 | 0 0 0 8 9
13 | 2 5 7
14 | 2 3 5 8
15 | 0 4 4 5 7 8
16 | 1 2 2 5 7 8
17 | 0 0 6 6 7
18 | 3 4 6 8
19 | 0 1 5 5
20 | 5
21 | 5
----------------------------------------
. .
.. :..:::
Female
..... .
+---------+---------+---------+---------+---------+-------
.
…. ::.:.:: :. :..: :
Side-by-Side
Male
.
+---------+---------+---------+---------+---------+------100
125
150
175
200
225
weight
.
weight
2.2 ~ Frequency Distributions & Histograms
 Stem-and-leaf plots often present adequate
summaries, but they can get very big
• Need other techniques for summarizing data
• Frequency distributions and histograms are used to
summarize large data sets
Frequency Distributions
Frequency(頻率) Distribution: A listing, often expressed in chart
form, that pairs each value of a variable with its frequency
Ungrouped Frequency Distribution: Each value of x in the
distribution stands alone
Grouped Frequency Distribution: Group the values into a set of
classes
1. A table that summarizes data by classes, or class intervals
2. In a typical grouped frequency distribution, there are usually 5-12 classes
of equal width
3. The table may contain columns for class number, class interval, tally (if
constructing by hand), frequency, relative frequency, cumulative relative
frequency, and class midpoint
4. In an ungrouped frequency distribution each class consists of a single value
Frequency Distributions
Guidelines for constructing a frequency distribution:
1. All classes should be of the same width
2. Classes should be set up so that they do not overlap and
so that each piece of data belongs to exactly one class
3. For problems in the text, 5-12 classes are most desirable.
The square root of n is a reasonable guideline for the
number of classes if n is less than 150. (如:100分通常
分10組)
Frequency Distributions
Procedure for constructing a frequency distribution:
1. Identify the high (H) and low (L) scores. Find the range.
Range = H - L
2. Select a number of classes and a class width so that the
product is a bit larger than the range
3. Pick a starting point a little smaller than L. Count from L by
the width to obtain the class boundaries. Observations that
fall on class boundaries are placed into the class interval to
the right.
Example
The hemoglobin(血紅素) test, a blood test given to diabetics(糖尿
病患) during their periodic checkups, indicates the level of control
of blood sugar during the past two to three months. The data in the
table below was obtained for 40 different diabetics at a university
clinic that treats diabetic patients:
6.5
6.4
5.0
7.9
5.0
6.0
8.0
6.0
5.6
5.6
6.5
5.6
7.6
6.0
6.1
6.0
4.8
5.7
6.4
6.2
8.0
9.2
6.6
7.7
7.5
8.1
7.2
6.7
7.9
8.0
5.9
7.7
8.0
6.5
4.0
8.2
9.2
6.6
5.7
9.0
1) Construct a grouped frequency distribution using the classes
3.7 ~ <4.7, 4.7 ~ <5.7, 5.7 ~ <6.7, etc.
2) Which class has the highest frequency?
Solutions
1)
Class
Frequency
Relative
Cumulative
Class
Boundaries
f
Frequency Rel. Frequency Midpoint, x
--------------------------------------------------------------------------------------3.7 ~ <4.7
1
0.025
0.025
4.2
4.7 ~ <5.7
6
0.150
0.175
5.2
5.7 ~ <6.7
16
0.400
0.575
6.2
6.7 ~ <7.7
4
0.100
0.675
7.2
7.7 ~ <8.7
10
0.250
0.925
8.2
8.7 ~ <9.7
3
0.075
1.000
9.2
2) The class 5.7 - <6.7 has the highest frequency. The frequency
is 16 and the relative frequency is 0.40
Histogram(直方圖)
Histogram: A bar graph representing a frequency distribution of a
quantitative variable. A histogram is made up of the following
components:
1. A title, which identifies the population of interest
2. A vertical scale, which identifies the frequencies in the various
classes
3. A horizontal scale, which identifies the variable x. Values for the
class boundaries or class midpoints may be labeled along the xaxis. Use whichever method of labeling the axis best presents the
variable.
Notes:
 The relative frequency is sometimes used on the vertical scale
 It is possible to create a histogram based on class midpoints
Example
Construct a histogram for the blood test results given in
the previous example.
The Hemoglobin Test
Solution:
15
10
Frequency
5
0
4.2
5.2
6.2
7.2
Blood Test
8.2
9.2
Example
A recent survey of Roman Catholic(天主教徒) nuns(修女)
summarized their ages in the table below. Construct a histogram for
this age data:
Age
Frequency
Class Midpoint
-----------------------------------------------------------20 up to 30
34
25
30 up to 40
58
35
40 up to 50
76
45
50 up to 60
187
55
60 up to 70
254
65
70 up to 80
241
75
80 up to 90
147
85
Solution
Roman Catholic Nuns
200
Frequency
100
0
25
35
45
55
Age
65
75
85
Terms Used to Describe Histograms
Symmetrical(對稱): Both sides of the distribution are identical
mirror images. There is a line of symmetry.
Uniform (Rectangular)(一致): Every value appears with equal
frequency
Skewed(歪斜): One tail is stretched out longer than the other. The
direction of skewness is on the side of the longer tail. (Positively
skewed vs. negatively skewed)
J-Shaped: There is no tail on the side of the class with the highest
frequency
Bimodal: The two largest classes are separated by one or more
classes. Often implies two populations are sampled.
Normal: A symmetrical distribution is mounded about the mean and
becomes sparse at the extremes
Important Reminders

The mode is the value that occurs with greatest frequency
(discussed in Section 2.3)

The modal class is the class with the greatest frequency

A bimodal distribution has two high-frequency classes
separated by classes with lower frequencies

Graphical representations of data should include a
descriptive, meaningful title and proper identification of
the vertical and horizontal scales
Cumulative Frequency Distribution
Cumulative Frequency Distribution: A frequency
distribution that pairs cumulative frequencies with values
of the variable
• The cumulative frequency for any given class is the sum
of the frequency for that class and the frequencies of all
classes of smaller values
• The cumulative relative frequency for any given class is
the sum of the relative frequency for that class and the
relative frequencies of all classes of smaller values
Example
A computer science aptitude test was given to 50
students. The table below summarizes the data:
Class
Relative
Cumulative
Cumulative
Boundaries Frequency Frequency
Frequency
Rel. Frequency
------------------------------------------------------------------------------------0 up to 4
4
0.08
4
0.08
4 up to 8
8
0.16
12
0.24
8 up to 12
8
0.16
20
0.40
12 up to 16
20
0.40
40
0.80
16 up to 20
6
0.12
46
0.92
20 up to 24
3
0.06
49
0.98
24 up to 28
1
0.02
50
1.00
Ogive (頻度曲線)
Ogive: A line graph of a cumulative frequency or cumulative relative
frequency distribution. An ogive has the following components:
1. A title, which identifies the population or sample
2. A vertical scale, which identifies either the cumulative frequencies
or the cumulative relative frequencies
3. A horizontal scale, which identifies the upper class boundaries.
Until the upper boundary of a class has been reached, you cannot
be sure you have accumulated all the data in the class. Therefore,
the horizontal scale for an ogive is always based on the upper class
boundaries.
Note: Every ogive starts on the left with a relative frequency of zero at the lower
class boundary of the first class and ends on the right with a relative frequency
of 100% at the upper class boundary of the last class.
Example
The graph below is an ogive using cumulative relative frequencies
for the computer science aptitude data:
Computer Science Aptitude Test
1.0
0.9
0.8
0.7
Cumulative
Relative
Frequency
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0
4
8
12
Test Score
16
20
24
28
2.3 ~ Measures of Central Tendency
 Numerical values used to locate the middle of a
set of data, or where the data is clustered
• The term average is often associated with all
measures of central tendency
Mean
Mean: The type of average with which you are probably
most familiar. The mean is the sum of all the values divided
by the total number of values, n:
1
1
x =  xi = ( x1 + x2 + . . . + xn )
n
n
Notes:



The population mean, , (lowercase mu, Greek alphabet), is
the mean of all x values for the entire population
We usually cannot measure  but would like to estimate its value
A physical representation: the mean is the value that balances
the weights on the number line
Example
The following data represents the number of accidents in
each of the last 6 years at a dangerous intersection. Find the
mean number of accidents: 8, 9, 3, 5, 2, 6, 4, 5:
Solution:

In the data above, change 6 to 26:
Solution:
Note: The mean can be greatly influenced by outliers
平均薪資
Median
Median(中位數): The value of the data that occupies the
middle position when the data are ranked in order according
to size
Notes:
~
 Denoted by “x tilde”: x

The population median,  (uppercase mu, Greek alphabet), is
the data value in the middle position of the entire population
To find the median:
1. Rank the data
x ) = n +1
2. Determine the depth of the median: d ( ~
2
3. Determine the value of the median
Example
Find the median for the set of data:
{4, 8, 3, 8, 2, 9, 2, 11, 3}
Solution:
1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11
x ) = (9 +1)/ 2 = 5
2. Find the depth: d ( ~
3. The median is the fifth number from either end in the ranked
x =4
data: ~
Suppose the data set is {4, 8, 3, 8, 2, 9, 2, 11, 3, 15}:
1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11, 15
2. Find the depth: d ( ~x ) = (10 + 1) / 2 = 5.5
3. The median is halfway between the fifth and sixth
observations: ~
x = (4 +8)/ 2 = 6
Mode(眾數) & Midrange
Mode: The mode is the value of x that occurs most frequently
Note: If two or more values in a sample are tied for the
highest frequency (number of occurrences), there is
no mode
Midrange: The number exactly midway between a lowest value
data L and a highest value data H. It is found by averaging the
low and the high values:
midrange=
L+ H
2
Example
 Consider the data set {12.7, 27.1, 35.6, 44.2, 18.0}
Midrange
Notes:

When rounding off an answer, a common rule-of-thumb(經
驗法則) is to keep one more decimal place in the answer
than was present in the original data

To avoid round-off buildup, round off only the final answer,
not intermediate steps
2.4 ~ Measures of Dispersion
 Measures of central tendency alone cannot
completely characterize a set of data. Two very
different data sets may have similar measures of
central tendency.
• Measures of dispersion (分散) are used to describe
the spread, or variability, of a distribution
• Common measures of dispersion: range, variance,
and standard deviation
Range
Range(全距): The difference in value between the highestvalued (H) and the lowest-valued (L) pieces of data:
range= H  L
計算容易,但忽略資料分配情形:
• Other measures of dispersion are based on the following quantity
Deviation from the Mean: A deviation from the mean, x  x ,
is the difference between the value of x and the mean x
Example
 Consider the sample {12, 23, 17, 15, 18}.
Find 1) the range and 2) each deviation from the mean.
Solutions:
1) range= H  L = 2312 =11
2) x = 1(12 + 23+17 +15+18) =17
5
Data
Deviation from Mean
x x
x
_________________________
12
-5
23
6
17
0
15
-2
18
1
Mean Absolute Deviation
Note:
 (x
 x) = 0
(Always!)
Mean Absolute Deviation: The mean of the absolute
values of the deviations from the mean:
1
Mean absolute deviation = n  | x  x |
For the previous example:
1
1 + + + + = 14 =

=
| x x|
(5 6 0 2 1)
2.8

n
5
5
Sample Variance & Standard Deviation
Sample Variance: The sample variance, s2, is the mean of the
squared deviations, calculated using n  1 as the divisor:
1
s2 =
( x  x ) 2 where n is the sample size

n 1
Note: The numerator(分子) for the sample variance is called the sum
of squares for x, denoted SS(x):
s2 = SS( x)
SS ( x ) =  ( x  x ) 2 =  x 2 
where
n 1
1
n
( x )
2
2
n
n
 i=1x 
2
1


( x  x ) =  ( x  2 xx + x ) =  x  2 x  x + x 1 =  x  2 x (nx ) + nx =  x  nx =  x  n n =  x  n ( x) 2

 
i =1
i =1
i =1
i =1
i =1
i =1
i =1
i =1
i =1
i =1
 
n
n
2
n
2
2
n
2
n
2
n
n
2
2
n
n
2
2
2
Standard Deviation: The standard deviation of a sample, s, is the
positive square root of the variance:
s = s2
n v.s. n-1



樣本變異數的「任務」,就是為了估計母體變異數。
有時樣本變異數會比σ2大,這是高估,有時又會低估。
除非運氣特別好,才會碰巧估得毫釐不差。
估計都希望能「準」,這包含兩種意義。





1. 所有可能的樣本結果放在一起考慮的話,高估和低估的情
況彼此抵消掉,所以平均起來不高估也不低估,這又叫不偏
(unbiased)。
2.從不同樣本得到的結果(即樣本變異數),差別不要太大。
樣本變異數除以 n - 1,是因為用 n - 1 當作分母,
才能夠有「不偏」性質
例如:「全臺灣的高中生當中,體重最重和最輕者的差
距」抽樣計算全距的結果一定是「有偏」,樣本變異數
也有同樣的情況。
v.s. 2 = 1
2
1
2
2

σ
(
x
)
μ

s =
( x  x)

n
n 1
Example
 Find the 1) variance and 2) standard deviation for the
data {5, 7, 1, 3, 8}:
Solutions:
Notes

The shortcut formula for the sample variance:
(可以不用先將mean算出來)
( x )
x  n
2
2
s2 =

n 1
The unit of measure for the standard deviation is the
same as the unit of measure for the data
2.5 ~ Mean & Standard Deviation of
Frequency Distribution
 If the data is given in the form of a frequency
distribution, we need to make a few changes to the
formulas for the mean, variance, and standard
deviation
• Complete the extension table in order to find these
summary statistics
To Calculate
• In order to calculate the mean, variance, and standard
deviation for data:
1. In an ungrouped frequency distribution, use the
frequency of occurrence, f, of each observation
2. In a grouped frequency distribution, we use the
frequency of occurrence associated with each class
midpoint:
xf

x=
f
s =
2
x
2
xf )
(

f 
f
 f 1
2
Example
 A survey of students in the first grade at a local school
asked for the number of brothers and/or sisters for each
child. The results are summarized in the table below.
Find 1) the mean, 2) the variance, and 3) the standard
deviation:
2.6 ~ Measures of Position
 Measures of position are used to describe the relative
location of an observation
• Quartiles(四分位數) and percentiles(百分位數)
are two of the most popular measures of position
• An additional measure of central tendency, the
midquartile, is defined using quartiles
• Quartiles are part of the 5-number summary
Quartiles
Values of the variable that divide the ranked data into
quarters; each set of data has three quartiles
1. The first quartile, Q1, is a number such that at most 25% of
the data are smaller in value than Q1 and at most 75% are
larger
2. The second quartile, Q2, is the median
3. The third quartile, Q3, is a number such that at most 75% of
the data are smaller in value than Q3 and at most 25% are
larger
Ranked data, increasing order
25%
L
25%
Q1
25%
Q2
25%
Q3
H
Percentiles
Values of the variable that divide a set of ranked data into
100 equal subsets; each set of data has 99 percentiles. The
kth percentile, Pk, is a value such that at most k% of the data
is smaller in value than Pk and at most (100  k)% of the
data is larger.
at most k %
L
at most (100 - k )%
Pk
H
Notes:

The 1st quartile and the 25th percentile are the same: Q1 = P25

The median, the 2nd quartile, and the 50th percentile are
x = Q2 = P50
all the same: ~
Finding Pk (and Quartiles)
 Procedure for finding Pk (and quartiles):
1. Rank the n observations, lowest to highest
2. Compute A = (nk)/100 = n(k/100)
3. If A is an integer:
– d(Pk) = A.5 (depth)
– Pk is halfway between the value of the data in the Ath
position and the value of the next data
If A is a fraction:
– d(Pk) = B, the next larger integer
– Pk is the value of the data in the Bth position
Example
 The following data represents the pH levels of a random sample
of swimming pools in a California town. Find: 1) the first
quartile, 2) the third quartile, and 3) the 37th percentile:
5.6
6.0
6.7
7.0
5.6
6.1
6.8
7.3
5.8
6.2
6.8
7.4
5.9
6.3
6.8
7.4
6.0
6.4
6.9
7.5
Solutions:
1) k = 25: (20) (25/ 100) = 5,
depth = 5.5,
Q1 = 6
Midquartile
The numerical value midway between the first and third
quartile.
Q1 + Q3
midquartile= 2
 Example: Find the midquartile for the 20 pH values in
the previous example:
Q1 + Q3 6 + 6.95 12.95
=
=
= 6.475
midquartile =
2
2
2
Note: The mean, median, midrange, and midquartile are all measures
of central tendency. They are not necessarily equal. Can you
think of an example when they would be the same value?
5-Number Summary
The 5-number summary is composed of:
1. L, the smallest value in the data set
2 . Q1, the first quartile (also P25)
x, the median (also P50 and 2nd quartile)
3. ~
4. Q3, the third quartile (also P75)
5. H, the largest value in the data set
Notes:

The 5-number summary indicates how much the data is spread
out in each quarter

The interquartile range is the difference between the first and
third quartiles. It is the range of the middle 50% of the data
Box-and-Whisker Display
A graphic representation of the 5-number summary:
• The five numerical values (smallest, first quartile, median, third
quartile, and largest) are located on a scale, either vertical or
horizontal
• The box is used to depict the middle half of the data that lies
between the two quartiles
• The whiskers are line segments used to depict the other half of the
data
• One line segment represents the quarter of the data that is smaller
in value than the first quartile
• The second line segment represents the quarter of the data that is
larger in value that the third quartile
Example
 A random sample of students in a sixth grade class was
selected. Their weights are given in the table below.
Find the 5-number summary for this data and construct a
boxplot:
63
85
92
99
112
64
86
93
99
76
88
93
99
76
89
93
101
81
90
94
108
92
~
x
99
Q3
83
91
97
109
Solution:
63
L
85
Q1
112
H
Boxplot for Weight Data
Weights from Sixth Grade Class
60
70
80
90
100
110
Weight
L
Q1
~
x
Q3
H
z-Score
The position a particular value of x has relative to the
mean, measured in standard deviations. The z-score is
found by the formula:
z=
value  mean x  x
=
st.dev.
s
Notes:
 Typically, the calculated value of z is rounded to the nearest
hundredth
 The z-score measures the number of standard deviations
above/below, or away from, the mean
 z-scores typically range from -3.00 to +3.00
 z-scores may be used to make comparisons of raw scores
Example
 A certain data set has mean 35.6 and standard deviation
7.1. Find the z-scores for 46 and 33:
Solutions:
x  x 46  35.6
z=
=
= 176
.
s
7.1
46 is 1.46 standard deviations above the mean
2.7 ~ Interpreting & Understanding
Standard Deviation
 Standard deviation is a measure of variability, or
spread
• Two rules for describing data rely on the standard
deviation:
– Empirical rule(經驗法則 ): applies to a variable
that is normally distributed
– Chebyshev’s theorem: applies to any distribution
Empirical Rule
If a variable is normally distributed, then:
1. Approximately 68% of the observations lie within 1 standard
deviation of the mean
2. Approximately 95% of the observations lie within 2 standard
deviations of the mean
3. Approximately 99.7% of the observations lie within 3
standard deviations of the mean
Notes:

The empirical rule is more informative than Chebyshev’s theorem since
we know more about the distribution (normally distributed)

Also applies to populations

Can be used to determine if a distribution is normally distributed
Illustration of the Empirical Rule
99.7%
95%
68%
x 3s
x 2s
xs
x
x+s
x +2s
x +3s
Example
 A random sample of plum tomatoes was selected from a
local grocery store and their weights recorded. The mean
weight was 6.5 ounces with a standard deviation of 0.4
ounces. If the weights are normally distributed:
1) What percentage of weights fall between 5.7 and 7.3?
2) What percentage of weights fall above 7.7?
Solutions:
1) ( x  2s, x + 2s) = (6.5 2(0.4), 6.5+ 2(0.4)) = (5.7, 7.3)
Approximately 95% of the weights fall between 5.7 and 7.3
2) ( x  3s, x + 3s) = (6.5 3(0.4), 6.5+ 3(0.4)) = (5.3, 7.7)
Approximately 99.7% of the weights fall between 5.3 and 7.7
Approximately 0.3% of the weights fall outside (5.3, 7.7)
Approximately (0.3 / 2) =0.15% of t he weights fall above 7.7
A Note about the Empirical Rule
Note: The empirical rule may be used to determine whether
or not a set of data is approximately normally
distributed
1. Find the mean and standard deviation for the data
2. Compute the actual proportion of data within 1, 2, and
3 standard deviations from the mean
3. Compare these actual proportions with those given by
the empirical rule
4. If the proportions found are reasonably close to those of
the empirical rule, then the data is approximately
normally distributed
Chebyshev’s Theorem
The proportion of any distribution that lies within k standard
deviations of the mean is at least 1  (1/k2), where k is any
positive number larger than 1. This theorem applies to all
distributions of data.
Illustration:
at least
1 12
k
x  ks
x
x + ks
Empirical Rule v.s. Chebyshev’s Theorem
資料分佈範圍
Chebyshev' s
Empirical
[-σ, +σ]
至少0%
大約68%
[-2σ, +2σ]
至少75%
大約95%
[-3σ, +3σ]
至少88.89%
大約99%
[-1.5σ, +1.5σ]
至少55.56%
大約86.6%
Important Reminders!

Chebyshev’s theorem is very conservative and holds for any
distribution of data

Chebyshev’s theorem also applies to any population

The two most common values used to describe a distribution
of data are k = 2, 3

The table below lists some values for k and 1 - (1/k2):
k
1(1/ k 2)
1.7
0.65
2
0.75
2.5
0.84
3
0.89
Example
 At the close of trading, a random sample of 35 technology
stocks was selected. The mean selling price was 67.75 and
the standard deviation was 12.3. Use Chebyshev’s theorem
(with k = 2, 3) to describe the distribution.
Solutions:
Using k=2: At least 75% of the observations lie within 2 standard
deviations of the mean:
( x  2s, x + 2s) = (67.75  2(12.3), 67.75 + 2(12.3) = (43.15, 92.35)
2.8 ~ The Art of Statistical Deception
Good Arithmetic, Bad Statistics
Misleading Graphs
Insufficient Information
Good Arithmetic, Bad Statistics
 The mean can be greatly influenced by outliers(極值)

Example: The mean salary for all NBA players is $15.5 million
Misleading graphs:
1. The frequency scale should start at zero to present a
complete picture. Graphs that do not start at zero are used
to save space.
2. Graphs that start at zero emphasize the size of the
numbers involved
3. Graphs that are chopped off emphasize variation
Flight Cancellations
35
30
25
Number of
Cancellations
20
15
10
5
0
1996
1998
2000
Year
2002
Flight Cancellations
35
34
33
Number of
Cancellations
32
31
30
29
28
27
1996
1998
2000
Year
2002
Insufficient Information
 Example: An admissions officer from a state school
explains that the average tuition at a nearby private
university is $13,000 and only $4500 at his school. This
makes the state school look more attractive.
– If most students pay the full tuition, then the state
school appears to be a better choice
– However, if most students at the private university
receive substantial financial aid, then the actual
tuition cost could be much lower!
Superimpose Misrepresentation
Ithaca Times (Dec. 7, 2000)
Truncated Scale
http://cgi.usatoday.com/snapshot/money/2001-09-05-service-complaints.htm
Simple is Not Always Best
Excel軟體指令

插入函數
 Mean : AVERAGE(資料範圍),如:AVERAGE(F1:F10)
 Median:MEDIAN(資料範圍),如:MEDIAN(F1:F10)
 Mode:MODE(資料範圍),如:MODE(F1:F10)
 Midrange:(MAX(資料範圍)+MIN(資料範圍))/2
 全距:MAX(資料範圍)-MIN(資料範圍)
 樣本變異數:VAR(資料範圍),如:VAR(F1:F10)
 樣本標準差:STDEV(資料範圍),如:STDEV(F1:F10)
 母體變異數:VARP(資料範圍),如:MODE(F1:F10)
 母體標準差:STDEVP(資料範圍) ,如:STDEVP(F1:F10)
 四分位數:QUARTILE(A1:A200,Quart),Quart = 0~4表
第n個四分位數
 百分位數:PERCENTILE(A1:A200, k),0 ≦ k≦1表第
100*k 百分位數,例如:0.37為計算37百分位數