Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
STA 2023
Chapter 2 – Methods for Describing Sets of Data
Describing Qualitative Data (2.1)
o Class – into which qualitative data can be sorted
G, PG, PG-13, R and NC-17 are classes for film ratings
o Class Frequency – number of observations in a class
o Class Relative Frequency – class frequency divided by the total number of
observations
Table 2.1
Frequency
Relative Frequency
G
4
.08
PG
11
.22
PG-13
19
.38
R
15
.30
NC-17
1
.02
TOTAL
50
1.00
2002 Film Ratings
NC-17 G
2% 8%
R
30%
G
PG
22%
PG
PG-13
R
NC-17
PG-13
38%
Figure 2.1: A pie chart representing the data in Table 2.1. (Excel)
2002 Film Ratings
Number of Films
20
18
16
14
12
10
8
6
4
2
0
19
15
11
4
1
G
PG
PG-13
R
NC-17
Ratings
Figure 2.2: A frequency histogram representing the data in Table 2.1. (Excel)
1
STA 2023
Chapter 2 – Methods for Describing Sets of Data
2002 Film Ratings
0.38
Proportion of Films
0.4
0.35
0.3
0.3
0.22
0.25
0.2
0.15
0.08
0.1
0.02
0.05
0
G
PG
PG-13
R
NC-17
Ratings
Figure 2.3: A relative frequency histogram representing the data in Table 2.3. (Excel)
Note that the shape of the frequency histogram and relative frequency histogram are the
same. Frequency histograms and relative frequency histograms will always have the
same shape for the same set of data.
Describing Quantitative Data (2.2)
o Graphical Methods
Dot Plot
Histogram
Stem-and-Leaf Display
Table 2.2 - Number of kills per match for a volleyball team (sorted)
17
18
19
20
21
21
24
25
25
26
26
27
27
27
28
28
32
32
33
33
33
35
35
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Number of Kills
Figure 2.4: A dot plot representing the data in Table 2.2. (Excel)
2
STA 2023
Chapter 2 – Methods for Describing Sets of Data
Number of Matches
Number of Kills per Match
10
8
6
4
2
0
Less than 21
21-24
25-28
29-32
More than 32
Number of Kills
Figure 2.5: A frequency histogram representing the data in Table 2.2. (Excel)
Stem = Tens
1
2
3
789
0114556677788
2233355
Figure 2.6: A stem-and-leaf display representing the data in Table 2.2. (Word)
Summation Notation (2.3)
o Example – Exercise 2.34 using the dataset {3, 8, 4, 5, 3, 4, 6}
a. x = 3 + 8 + 4 + 5 + 3 + 4 + 6 = 33
b. x2 = 32 + 82 + 42 + 52 + 32 + 42 + 62 = 9 + 64 + 16 + 25 + 9 + 16 + 36 =
175
c. (x-5)2 = (-2)2 + (3)2 + (-1)2 + (0)2 + (-2)2 + (-1)2 + (1)2 = 4 + 9 + 1 + 0 +
4 + 1 + 1 = 20
d. (x-2)2 = (x2-4x+4) = x2 - 4x + 4 = x2 - 4x + 41 = 175 – 4(33) +
4(7) = 175 – 132 + 28 = 71
e. (x)2 = (33)2 = 1,089
Numerical Measures of Central Tendency (2.4)
o Mean – average value of a dataset (also called average or expected value)
x
Formula to calculate sample mean: x
n
x (pronounced “x bar”) represents the sample mean
(pronounced “myoo” – spelled mu) represents the population mean
o Sample Size – number of observations in the sample
3
STA 2023
Chapter 2 – Methods for Describing Sets of Data
n represents sample size
o Population Size – number of observations in the entire population
N represents population size
o Median – the value of a dataset which splits the data into halves (middle value)
Usually M represents the median of a sample or population
To find the median of a dataset, do the following:
1. Order all data in ascending order
2. Calculate the number of observations (assume n)
a. If n is odd, then M is the kth value of the dataset where
n 1
k
2
b. If n is even, then M is the average of the kth and (k+1)th
n
values of the dataset where k
2
o Mode – most frequently occurring value in the dataset
No symbol usually associated with mode
o Modal Class – class with the greatest frequency
No symbol usually associated with modal class
o Shape
When the mean and median of a distribution are equal, the distribution is
called symmetric
When the mean is less than the median, the distribution is skewed left
When the mean is greater than the median, the distribution is skewed
right
NOTE: You should be comfortable identifying the shape of a distribution
by comparing the mean and median and by visual identification
Symmetric
Skewed Left
Skewed Right
o Example – Exercise 2.42 calculating the mean, median, and mode
a. {7, -2, 3, 3, 0, 4}
Order the data: {-2, 0, 3, 3, 4, 7}
x = 2 0 3 3 4 7 = 15 = 2.5
Mean: x =
6
6
n
6
Median: n = 6 so n is even, then k =
= 3, and k+1 = 4, so we
2
should take the average of the 3rd and 4th values of the dataset
4
STA 2023
Chapter 2 – Methods for Describing Sets of Data
which are 3 and 3, respectively, and the average of those numbers
is (obviously) 3
Mode: the value in the dataset that occurs most frequently is 3
b. {2, 3, 5, 3, 2, 3, 4, 3, 5, 1, 2, 3, 4}
Order the data: {1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5}
x = 1 2 2 2 3 3 3 3 3 4 4 5 5 =
Mean: x =
12
n
40
= 3.08
13
14
Median: n = 13 so n is odd, then k =
= 7 so the median is the
2
7th value of the dataset which is 3
Mode: the value in the dataset that occurs most frequently is 3
c. {51, 50, 47, 50, 48, 41, 59, 68, 45, 37}
Order the data: {37, 41, 45, 47, 48, 50, 50, 51, 59, 68}
x = 37 41 45 47 48 50 50 51 59 68
Mean: x =
10
n
496
=
= 49.6
10
10
Median: n = 10 so n is even, then k =
= 5, and k+1 = 6, so we
2
should take the average of the 5th and 6th values of the dataset
which are 48 and 50, respectively, and the average of those
numbers is 49
Mode: the value in the dataset that occurs most frequently is 50
Numerical Measures of Variability (2.5)
o Range – maximum minus minimum
Usually R represents the range in a sample or population
o Sample Variance – a common measurement for the variability in a dataset
s2 represents the sample variance
2 (sigma-squared) represents the population variance
n
Formula to calculate sample variance: s 2
( x x)
2
i
i 1
x
n 1
2
x
2
n
n 1
o Sample Standard Deviation – spread of the data
Standard deviation is preferable to variance, since the units of standard
deviation are in the original units of variable (variance is not)
s represents the sample standard deviation
Alternate (easier) formula: s 2
5
STA 2023
Chapter 2 – Methods for Describing Sets of Data
(sigma) represents the population standard deviation
Formula to calculate sample standard deviation: s s 2
If an estimate for s cannot be obtained, dividing the range by 4 for small
samples and by 6 for large samples will provide a conservative estimate
o Example – Exercise 2.62 using Sample 1: {10, 0, 1, 9, 10, 0, 8, 1, 1, 9} and
Sample 2: {0, 5, 10, 5, 5, 5, 6, 5, 6, 5}
a. Examine both samples and identify the one that you believe has the greater
variability. Sample 1 appears to have the greater variability.
b. Calculate the range for each sample. Does the result agree with your
answer to part a.? Explain. Sample 1 Range = Maximum – Minimum =
10 – 0 = 10. Sample 2 Range = Maximum – Minimum = 10 – 0 = 10.
These results do not agree with our answer in part a.
c. Calculate the standard deviation for each sample. Does the result agree
with your answer to part a.? Explain. Sample 1 Variance =
x
x n
2
2
s2
10
n 1
=
0 8 1 1 9
10 0 1 9 10
10
2
2
0 2 12 9 2 10 2 0 2 8 2 12 12 9 2
9
49 2
429
10 = 20.99, so Sample 1 Standard Deviation = s s 2 =
=
9
x 2
2
x n
2
20.99 = 4.58. Sample 2 Variance = s
=
n 1
0 5 10 5 5 5 6 5 6 52
2
2
2
2
2
2
2
2
2
2
0
5 10 5 5 5 6 5 6 5
10
=
9
322
52 2
10 = 5.73, so Sample 2 Standard Deviation = s s 2 =
5.73 =
9
2.39. These results agree with our answer to part a.
d. Which of the two, the range or the standard deviation, provides a better
measure of variability? The standard deviation provides a better measure
of variability.
Interpreting the Standard Deviation (2.6)
o Chebyshev’s Rule – gives a bound for the proportion of data that falls within a
specified number of standard deviations from the mean and also gives a bound for
the proportion of data that falls outside of a specified number of standard
deviations from the mean
6
STA 2023
Chapter 2 – Methods for Describing Sets of Data
Proportion of data within k standard deviations of the mean: 1 –
Proportion of data outside k standard deviations of the mean: <
k
1
2
3
1
k2
1
k2
Chebyshev’s Rule applies to ANY distribution, regardless of its shape
Table 2.3 – Table of common values for Chebyshev’s Rule
Proportion within k
Proportion outside k
< 1 (useless)
0 (useless)
< .25
.75
< .111
.888
o Example – Assume we have a distribution with = 9 and = 2
a. What proportion of the data lies between 5 and 13? Here, k=2, so
referring to Table 2.3, we know that at least 75% of the distribution lies
between 5 and 13.
b. What proportion of the data lies outside of (2, 16)? Here, k=3.5, so we
1
use the formula 2 , with k=3.5, which tells us that less than .0816 or less
k
than 8.16% of the distribution lies outside of (2, 16)?
c. What proportion of the data lies below 4? Here, k=2.5, so we use the
1
formula 2 , with k=2.5, which tells us that less than .16 or less than 16%
k
of the data lies outside of (4, 14). However, at this point, it would be
incorrect to further split the .16 in half, because we are looking at one side
of the distribution; we know nothing about the distribution, hence, it
would be incorrect to assume that the distribution is symmetric.
o Empirical Rule – gives an approximation for the proportion of data that falls
within or outside of one, two, or three standard deviations
Approximately 68% of the distribution will fall within 1 standard
deviation of the mean
Approximately 95% of the distribution will fall within 2 standard
deviations of the mean
Approximately 99.7% of the distribution will fall within 3 standard
deviations of the mean
Empirical Rule applies only to distributions that are both SYMMETRIC
and MOUND-SHAPED
34%
13.5%
2.35%
0.15%
34%
13.5%
2.35%
0.15%
7
STA 2023
Chapter 2 – Methods for Describing Sets of Data
o Example – Assume we have a symmetric, mound-shaped distribution with = 9
and = 2
a. What proportion of the data lies within (7, 11)? 68%
b. What proportion of the data lies within (5, 15)? 97.35%
c. What proportion of the data lies outside of (5, 13)? 5%
d. What proportion of the data lies lie above 11? 16%
Numerical Measures of Relative Frequency (2.7)
o Percentile – an observation that is the pth percentile will have p% of the
distribution below it and (100-p)% of the distribution above it
o Z-score – measure of relative standing associated with the standard normal
distribution (Chapter 5)
xx
Formula for sample z-score: z
s
x
Formula for population z-score: z
o Example – Exercise 2.82, computing z-scores
x x 40 30
a. x = 40, s = 5, x = 30
=2
z
5
s
x 90 89
z
b. x = 90, = 89, = 2
= .5
2
x 50 50
z
c. = 50, = 5, x = 50
=0
5
x x 20 30
d. s = 4, x = 20, x = 30
= -2.5
z
4
s
e. a. sample b. population c. population d. sample
f. a. above the mean by 2 b. above the mean by .5
c. at the mean
d. below the mean by 2.5
Methods for Detecting Outliers (2.8) – SKIP
Graphing Bivariate Relationships (2.9) – SKIP
Distorting the Truth with Descriptive Techniques (2.10)
o Statistics can lie
8