Download note 3

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Transcript
CHAPTER 3
Data Description
OUTLINE
•
•
•
•
•
3-1
3-2
3-3
3-4
3-5
Introduction
Measures of Central Tendency
Measures of Variation
Measures of Position
Exploratory Data Analysis
OBJECTIVES
• Summarize data using the measures of
central tendency, such as the mean,
median, mode, and midrange.
• Describe data using the measures of
variation, such as the range, variance, and
standard deviation.
OBJECTIVES
• Identify the position of a data value in a data
set using various measures of position, such
as percentiles, deciles and quartiles.
• Use the techniques of exploratory data
analysis, including stem and leaf plots, box
plots, and five-number summaries to discover
various aspects of data.
3-1 Introduction
• A statistic is a characteristic or measure
obtained by using the data values from a
sample.
• A parameter is a characteristic or
measure obtained by using the data
values from a specific population.
3-2 Measures of Central
Tendency
 Mean
 Median
 Mode
 Mid-range
3-2 The mean (arithmetric
average)
• The mean is defined to be the sum of the
data values divided by the total number of
values.
3-2 The Sample Mean
• The symbol X represents the sample mean.
X is read as “X - bar”. The Greek symbol Σ is
read as “sigma” and means “to sum”.
𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛
𝑋=
𝑛
=
𝑋
𝑛
The following data represent the annual chocolate
sales (in millions of RM) for a sample of seven
states in Malaysia. Find the mean.
RM2.0, 4.9, 6.5, 2.1, 5.1, 3.2, 16.6
X=
X
n
2.0 + 4.9 + 6.5 + 2.1 + 5.1 + 3.2 + 16.6
=
7
= RM 5.77 million
3-2 The Population Mean
• The Greek symbol µ represents the population
mean. The symbol µ is read as “mu”. N is the
size of the finite population.
𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛
µ=
𝑁
=
𝑋
𝑁
A small company consists of the owner, the
manager, the salesperson, and two technicians.
Their salaries are listed as RM 50,000, 20,000,
12,000, 9,000 and 9,000 respectively. Assume this
is the population, find the mean.
X

=
N
50,000 + 20,000 + 12,000 + 9,000 + 9,000
=
5
= RM 20,000
In a random sample of 7 ponds, the number of
fishes were recorded as the following, find the
mean.
23
37
56
45
36
28
33
3-2 The Sample Mean for an Ungrouped
Frequency Distribution
• The mean for an ungrouped frequency
distribution is given by
X=
 X)
n
 (f
f = frequency of the
corresponding value X
n = f
The scores of 25 students on a 4-point quiz are
given in the table. Find the mean score.
Score, X
Frequency, f
0
2
1
4
2
12
3
4
4
3
Score, X
Frequency, f
fX
0
2
0
1
4
4
2
12
24
3
4
12
4
3
12
 f  X 52
X =
=
 2.08
n
25
The number of balls in 17 bags were counted.
Find the mean
Number of balls
Frequency
5
5
6
4
7
2
8
6
3-2 The Sample Mean for a Grouped
Frequency Distribution
• The mean for grouped frequency distribution is
given by
X=
 ( f  Xm)
n
Xm = class midpoint
= (UCL + LCL) / 2
The lengths of 40 bean pods were showed in the
table. Find the mean.
Length (cm)
Frequency, f
4-8
2
9-13
4
14-18
7
19-23
14
24-28
8
29-33
5
Length
(cm)
Frequency, f
Midpoint, Xm
f Xm
4-8
2
6
12
9-13
4
11
44
14-18
7
16
112
19-23
14
21
294
24-28
8
26
208
29-33
5
31
155
 f X
m
 12  44  112  294  208  155
= 825
f X

X=
n
825
=
40
 20.6
m
Time (in minutes) that needed by a group of
students to complete a game are shown as
below. Find the mean.
Time (mins)
Frequency
1-3
2
4-6
4
7-9
8
10-12
5
13-15
1
Most common errors:
𝑓∙𝑋 ≠
𝑓 ∙
𝑋
𝑓 ∙ 𝑋𝑚 ≠
𝑓 ∙
𝑋𝑚
Example
Score, X
Frequency, f
f•X
0
1
0
1
3
3
2
5
10
3
2
6
4
1
4
X = 10
n = f = 12
(f•X) = 23
Mean = (f•X) / n
= 23 / 12
= 1.92
Mean = (f • X) / n
= (12 x 10) / 12
= 10
3-2 The Median
• When a data set is ordered, it is known as a
data array.
• The median is defined to be the midpoint of the
data array.
• The symbol used to denote the median is MD.
The ages of seven preschool children are
1, 3, 4, 2, 3, 5, and 1. Find the median.
1. Arrange the data in order.
2. Select the middle point.
Data array: 1, 1, 2, 3, 3, 4, 5
Median
The median (MD) age = 3 years.
• In the previous example, there was an odd
number of values in the data set.
• In this case it is easy to select the middle number
in the data array.
• When there is an even number of values in the
data set, the median is obtained by taking the
average of the two middle numbers.
Six customers purchased these numbers of
magazines: 1, 7, 3, 2, 3, 4. Find the median.
1. Arrange the data in order.
2. Select the middle point.
Data array: 1, 2, 3, 3, 4, 7
Median
The median (MD) = 3 + 3
2
=3
3-2 The Median - Ungrouped
Frequency Distribution
• For an ungrouped frequency distribution,
find the median by examining the cumulative
frequencies to locate the middle value.
• If n is the sample size, compute n/2. Locate the
data point where n/2 values fall below and n/2
values fall above.
LRJ Appliance recorded the number of VCRs sold
per week over a one-year period.
No. Sets Sold
Frequency
1
4
2
9
3
6
4
2
5
3
Total
24
• To locate the middle point, divide n by 2;
24/2 = 12.
• Locate the point where 12 values would
fall below and 12 values will fall above.
• Consider the cumulative distribution.
• The 12th and 13th values fall in class 2.
Hence MD = 2.
Median
No. Sets Sold
Frequency
Cumulative
Frequency
1
4
4
2
9
13
3
6
19
4
2
21
5
3
24
This class contains the 5th through the 13th values.
3-2 The Median - Grouped
Frequency Distribution
• For grouped frequency distribution, find the
median by using the formula as shown below:
n = sum of frequencies
Median, MD = Lm +
(W)
cf = cumulative frequency of the
class immediately preceding the
median class
f = frequency of the median class
w = class width of the median class
Lm = Lower class boundary of the
median class
Find the median by using the following data.
Class
Frequency, f
16 - 20
3
21 - 25
5
26 - 30
4
31 - 35
3
36 - 40
2
Class
Frequency, f
Cumulative
Frequency
16 - 20
3
3
21 - 25
5
8
26 - 30
4
12
31 - 35
3
15
36 - 40
2
17
• To locate the halfway point, divide n by 2;
17/2 = 8.5 ≈ 9.
• Find the class that contains the 9th value. This
will be the median class.
• Consider the cumulative distribution.
The median class will then be 26-30.
n = 17
cf = 8
f= 4
w = 30.5 –25.5 = 5
L m  25.5
(n 2) - cf
(17/2) – 8
MD 
( w)  L m =
(5)  25.5
.
f
4
= 26.125.
Find the median by using the following data.
Class
Frequency, f
5-14
5
15-24
7
25-34
19
35-44
17
45-54
7
3-2 The Mode
• The mode is defined to be the value that occurs
most often in a data set.
• A data set can have more than one mode.
• A data set is said to have no mode if all values
occur with equal frequency.
Find the mode for the number of children per
family for 10 selected families.
Data set: 2, 3, 5, 2, 2, 1, 6, 4, 7, 3.
Ordered set: 1, 2, 2, 2, 3, 3, 4, 5, 6, 7.
Mode: 2.
• Six strains of bacteria were tested to see how
long they could remain alive outside their normal
environment. The time, in minutes, is given
below. Find the mode.
• Data set: 2, 3, 5, 7, 8, 10.
• There is no mode since each data value occurs
equally with a frequency of one.
• Eleven different automobiles were tested at a
speed of 15 mph for stopping distances. The
distance, in feet, is given below. Find the mode.
• Data set: 15, 18, 18, 18, 20, 22, 24, 24, 24, 26,
26.
• There are two modes (bimodal). The values
are 18 and 24.
3-2 The Mode – Ungrouped
Frequency Distribution
Find the mode by using the following data.
Mode
Values
Frequency, f
15
3
20
5
25
8
30
3
35
2
Highest
frequency
• The mode for grouped data is the modal class.
• The modal class is the class with the largest
frequency.
3-2 The Mode – Grouped
Frequency Distribution
Find the mode by using the following data.
Modal
Class
Values
Frequency, f
15.5 - 20.5
3
20.5 - 25.5
5
25.5 - 30.5
7
30.5 - 35.5
3
35.5 - 40.5
2
Highest
frequency
3-2 The Midrange
 The midrange is found by adding the lowest
and highest values in the data set and dividing
by 2.
 The midrange is a rough estimate of the
middle value of the data.
 The symbol that is used to represent the
midrange is MR.
• Last winter, the city of New York, reported the
following number of water-line breaks per
month. The data is as follows: 2, 3, 6, 8, 4, 1.
Find the midrange.
MR = (1 + 8)/2 = 4.5.
• Note: Extreme values influence the midrange
and thus may not be a typical description of the
middle.
• The number of water-line breaks was as follows:
2, 3, 6, 16, 4, and 1. Find the midrange.
MR = (1 + 16)/2 = 8.5.
• 8.5 is not typical of the average monthly number
of breaks, since an excessive high number of
breaks, 16, occurred in one month.
3-2 Distribution Shapes
 Frequency distributions can assume many
shapes.
 Three are three most important shapes:
i) Positively skewed
ii) Symmetrical
iii) Negatively skewed
Positively Skewed
Y
Positively Skewed
Mode < Median < Mean
X
Symmetrical
Y
Symmetrical
X
Mean = Median = Mode
Negatively Skewed
Y
Negatively Skewed
Mean < Median < Mode
X
3-3 Measures of Variation
 Measure the variation and dispersion of
the data.
 Range
 Variance
 Standard Deviation
 Coefficient of Variation
3-3 Range
 The range is defined to be the highest value
minus the lowest value. The symbol R is used
for the range.
 R = highest value – lowest value.
 Extremely large or extremely small data
values can drastically affect the range.
3-3 Population Variance
General Formula
• The symbol for population
𝟐 =
(𝑿 − µ)𝟐
𝑵
variance is
.
• X = Individual value
• µ = Population mean
• N = Population size
3-3 Population Standard Deviation
General Formula
𝟐 =
𝝈𝟐 =
(𝑿 − µ)𝟐
𝑵
• The population standard
deviation is square root of
the population variance.
• Consider the following data to constitute
the population: 10, 60, 50, 30, 40, 20.
Find the mean, variance, and standard
deviation
Mean  = (10 + 60 + 50 + 30 + 40 + 20)/6
= 210/6
= 35.
X
X-µ
(X-µ)2
10
60
50
30
-25
+25
+15
-5
625
625
225
25
40
20
+5
-15
25
225
∑(X-µ)2: 1750
∑X: 210
The variance  2 = 1750/6 = 291.67.
The standard deviation  = 291.67 = 17.08
A class of 6 children sat a test; the resulting
marks scored out of 10, were as follows:
4
5
6
8
4
9
Calculate the mean, variance and standard
deviation of this population.
3-3 Sample Variance
General Formula
• The symbol for sample
𝟐
𝑺 =
(𝑿 − 𝑿)𝟐
𝒏−𝟏
variance is
•
.
= Sample mean
• n = Sample size
3-3 Sample Standard Deviation
General Formula
𝟐
𝑺 =
𝑺𝟐
=
𝑿) 𝟐
(𝑿 −
𝒏−𝟏
• The sample standard
deviation is square root of
the sample variance.
Mean, X = (35+45+30+35+40+25)/6
= 210/6
= 35
X
35
X-X
0
(X-X)2
0
45
30
35
40
+10
-5
0
+5
100
25
0
25
25
-10
100
∑(X-X)2 : 250
∑X: 210
The variance  2 = 250/(6-1) = 50
The standard deviation  = 50 = 7.07
How to find the population variance and standard
deviation for data with frequency provided?
 Ungrouped frequency distribution
𝟐 =
𝒇(𝑿 − µ)𝟐
𝑵
 Grouped frequency distribution
𝟐 =
𝒇(𝑿𝒎 − µ)𝟐
𝑵
• For ungrouped data, use the actual
observed X value in the different classes.
• For grouped data, use the same formula
with the X value replaced by class
midpoints, Xm.
How to find the sample variance and standard
deviation for data with frequency provided?
 Ungrouped frequency distribution
𝑺𝟐 =
𝒇(𝑿 − 𝑿)𝟐
𝒏−𝟏
 Grouped frequency distribution
𝟐
𝒇(𝑿
−
𝑿
)
𝒎
𝑺𝟐 =
𝒏−𝟏
Example
X
f
Xm
fXm
(Xm - µ) (Xm - µ)2
f(Xm - µ)2
1-10
4
5.5
22
-10.59 112.1481
448.5924
11-20
8
15.5
124
-0.59
0.3481
2.7848
21-30
5
25.5 127.5
9.41
88.5481
442.7405
N=
17
∑fXm =
273.5
∑f(Xm - µ)2
=
894.1177
Mean, µ = ∑fXm / N
= 273.5/17
= 16.09
Variance, 2 = 894.1177 / 17
= 52.60
Standard deviation,  = 2
= 52.60
= 7.25
Question
a) The scores in a statistics test for 60 candidates
are shown in the table. Find the mean, variance,
and standard deviation for this population.
Score
0-19
20-39
40-59
60-79
80-99
Frequency
8
13
24
11
4
b) The following table showed the number of cars
passed by UM hospital in a random sample of
days. Compute the sample mean, variance and
standard deviation.
Number of cars
Number of days
101-120
2
121-130
4
131-140
15
141-150
10
151-160
7
3-3 Coefficient of Variation
• The coefficient of variation is defined to be the
standard deviation divided by the mean. The
result is expressed as a percentage.
• It is used to compare the standard deviation of
different units.

s
CVar  100% or CVar = 100%.

X
i) Number of sales of cars over 3 months
Mean = 87; Standard deviation = 5
Cvar = 5/87 x 100% = 5.7% sales
ii) Sales commissions
Mean = $5225; Standard deviation = $773
Cvar = 773/5225 x 100% = 14.8% commissions
Commissions are more variable than the sales.
3-3 Chebyshev’s Theorem
• 75% of the values will lie within 2 standard
deviations of the mean.
• Approximately 89% will lie within 3 standard
deviations.
3-3 Empirical (Normal) Rule
• For any bell shaped distribution:
 Approximately 68% of the data values will fall
within one standard deviation of the mean.
 Approximately 95% will fall within two standard
deviations of the mean.
 Approximately 99.7% will fall within three
standard deviations of the mean.
-- -- 95%
--
 -  -  -      
Question
The scores on a national achievement exam have
a mean of 480 and standard deviation of 90. If
the scores are normally distributed, find the
scores for approximately 68% of the data
values, 95% of the data values, and 99.7% of
the data values.
3-4 Measures of Position
 Measure the position of particular data in
a data set.
 Z - score
 Percentile
 Decile
 Quartile
3-4 Z-Score
 The standard score or z score for a value is
obtained by subtracting the mean from the
value and dividing the result by the standard
deviation.
 The symbol z is used for the z score.
Z = (value - mean)/ standard deviation
• The z score represents the number of
standard deviations a data value falls above
or below the mean.
For samples :
X -X
.
s
For populations :
X -
z 
.
z 

• A student scored 65 on a statistics exam that had a
mean of 50 and a standard deviation of 10.
Compute the z-score.
• z = (65 – 50)/10 = 1.5.
• That is, the score of 65 is 1.5 standard deviations
above the mean.
• Above - since the z-score is positive.
• How about if the z-score shows a negative value?
3-4 Percentile
• Percentiles divide the distribution into 100
equal groups.
P1 P2 P3
P4 ……………… P98 P99 P100
How to find the Value
Corresponding to a Given Percentile?
• Step 1: Arrange the data in order.
• Step 2: Compute c = (np)/100.
c = position value of the required percentile
p = percentile
n = sample size
• Step 3: If c is not a whole number, round
up to the next whole number and find the
corresponding value. If c is a whole
number, use the value halfway between c
and c+1.
Find the value of the 25th percentile for
the following data set: 18, 12, 3, 5, 15, 8,
10, 2, 6, 20.
• Is the data arranged in order?
• Arrange data in order form: 2, 3, 5, 6, 8, 10,
12, 15, 18, 20.
• n = 10, p = 25, so c = (1025)/100 = 2.5.
Since 2.5 is not a whole number, round up to
c = 3.
• Thus, the value of the 25th percentile is the
value X = 5.
3-4 Decile
• Deciles divide the data set into 10 groups.
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
What is the relationship between
percentile and decile?
D1 corresponding to P10
D2 corresponding to P20
D9 corresponding to P90
D10 corresponding to P100
3-4 Quartile
• Quartiles divide the data set into 4
groups.
Q1
Q2
Q3
Q4
What is the relationship between
percentile and quartile?
Q1 corresponding to P25
Q2 corresponding to P50
Q3 corresponding to P75
Q4 corresponding to P100
Median
3-4 Outliers and the
Interquartile Range (IQR)
• An outlier is an extremely high or an
extremely low data value when
compared with the rest of the data
values.
• The Interquartile Range, IQR = Q3 –
Q1.
Procedures to identify outliers
• To determine whether a data value can be
considered as an outlier:
• Step 1: Arrange the data in order.
• Step 2: Compute Q1 and Q3.
• Step 3: Find the IQR = Q3 – Q1.
• Step 4: Compute (1.5)(IQR).
• Step 5: Compute Q1 – (1.5)(IQR) and
Q3 + (1.5)(IQR).
• Step 6: Compare the data value (say X) with
Q1 – (1.5)(IQR) and Q3 + (1.5)(IQR).
• If X < Q1 – (1.5)(IQR) or X > Q3 + (1.5)(IQR),
then X is considered an outlier.
• Given the data set 5, 6, 12, 13, 15, 18, 22,
50, find the outlier.
• Find Q1.
• Q1 is corresponding to P25
c = (n•p)/100
= (8•25)/100
= 2nd
• Since 2 is a whole number, take value
between 2nd and 3rd.
• Q1 = (6+12)/2 = 9
• Find Q3.
• Q3 is corresponding to P75
c = (n•p)/100
= (8•75)/100
= 6th
• Since 6 is a whole number, take value
between 6th and 7th.
• Q3 = (18+22)/2 = 20
•
•
•
•
•
Find IQR.
IQR = 20 - 9 = 11.
(1.5)(IQR) = (1.5)(11) = 16.5.
9 – 16.5 = – 7.5 and 20 + 16.5 = 36.5.
The value of 50 is outside the range – 7.5
to 36.5, hence 50 is an outlier.
3-5 Exploratory Data Analysis Stem and Leaf Plot
• A stem and leaf plot is a data plot that uses
part of a data value as the stem and part of the
data value as the leaf to form groups or
classes.
Leaf
Stem
Example of Stem and Leaf Plot
Stem
4
2 5 2
5
2 5 2
Leaves
Stem
Leading digit
Leaf
Trailing digit
At an outpatient testing center, a sample of 20
days showed the following number of
cardiograms done each day: 25, 31, 20, 32,
13, 14, 43, 02, 57, 23, 36, 32, 33, 32, 44, 32,
52, 44, 51, 45. Construct a stem and leaf plot
for the data.
Leading Digit (Stem)
0
1
2
3
4
5
Trailing Digit (Leaf)
2
34
035
1222236
3445
127
3-5 Exploratory Data Analysis Box Plot
• When the data set contains a small number of
values, a box plot is used to graphically represent
the data set.
• These plots involve five values:
1. Minimum value
2. Q1
3. Median
4. Q3
5. Maximum value
Example of Box Plot
Q1
Minimum value
Q2
Q3
Maximum value
Information Obtained from a
Box Plot
• If the median is near the center of the box, the
distribution is approximately symmetric.
• If the median falls to the left of the center of the
box, the distribution is positively skewed.
• If the median falls to the right of the center of the
box, the distribution is negatively skewed.
• If the lines are about the same length, the
distribution is approximately symmetric.
• If the right line is larger than the left line,
the distribution is positively skewed.
• If the left line is larger than the right line,
the distribution is negatively skewed.
21 girls estimated the length of a line, in mm. The
results were
51 45 31 43 97 16 18 23 34 35 35
85 62 20 22 51 57 49 22 18 27
Draw the box plot and use it to identify any outliers.
Comment on the shape of the distribution of length.