Download Chapter 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
STA 2023
Chapter 2 – Methods for Describing Sets of Data
Describing Qualitative Data (2.1)
o Class – into which qualitative data can be sorted
 G, PG, PG-13, R and NC-17 are classes for film ratings
o Class Frequency – number of observations in a class
o Class Relative Frequency – class frequency divided by the total number of
observations
Table 2.1
Frequency
Relative Frequency
G
4
.08
PG
11
.22
PG-13
19
.38
R
15
.30
NC-17
1
.02
TOTAL
50
1.00
2002 Film Ratings
NC-17 G
2% 8%
R
30%
G
PG
22%
PG
PG-13
R
NC-17
PG-13
38%
Figure 2.1: A pie chart representing the data in Table 2.1. (Excel)
2002 Film Ratings
Number of Films

20
18
16
14
12
10
8
6
4
2
0
19
15
11
4
1
G
PG
PG-13
R
NC-17
Ratings
Figure 2.2: A frequency histogram representing the data in Table 2.1. (Excel)
1
STA 2023
Chapter 2 – Methods for Describing Sets of Data
2002 Film Ratings
0.38
Proportion of Films
0.4
0.35
0.3
0.3
0.22
0.25
0.2
0.15
0.08
0.1
0.02
0.05
0
G
PG
PG-13
R
NC-17
Ratings
Figure 2.3: A relative frequency histogram representing the data in Table 2.3. (Excel)
Note that the shape of the frequency histogram and relative frequency histogram are the
same. Frequency histograms and relative frequency histograms will always have the
same shape for the same set of data.

Describing Quantitative Data (2.2)
o Graphical Methods
 Dot Plot
 Histogram
 Stem-and-Leaf Display
Table 2.2 - Number of kills per match for a volleyball team (sorted)
17
18
19
20
21
21
24
25
25
26
26
27
27
27
28
28
32
32
33
33
33
35
35
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Number of Kills
Figure 2.4: A dot plot representing the data in Table 2.2. (Excel)
2
STA 2023
Chapter 2 – Methods for Describing Sets of Data
Number of Matches
Number of Kills per Match
10
8
6
4
2
0
Less than 21
21-24
25-28
29-32
More than 32
Number of Kills
Figure 2.5: A frequency histogram representing the data in Table 2.2. (Excel)
Stem = Tens
1
2
3
789
0114556677788
2233355
Figure 2.6: A stem-and-leaf display representing the data in Table 2.2. (Word)

Summation Notation (2.3)
o Example – Exercise 2.34 using the dataset {3, 8, 4, 5, 3, 4, 6}
a. x = 3 + 8 + 4 + 5 + 3 + 4 + 6 = 33
b. x2 = 32 + 82 + 42 + 52 + 32 + 42 + 62 = 9 + 64 + 16 + 25 + 9 + 16 + 36 =
175
c. (x-5)2 = (-2)2 + (3)2 + (-1)2 + (0)2 + (-2)2 + (-1)2 + (1)2 = 4 + 9 + 1 + 0 +
4 + 1 + 1 = 20
d. (x-2)2 = (x2-4x+4) = x2 - 4x + 4 = x2 - 4x + 41 = 175 – 4(33) +
4(7) = 175 – 132 + 28 = 71
e. (x)2 = (33)2 = 1,089

Numerical Measures of Central Tendency (2.4)
o Mean – average value of a dataset (also called average or expected value)
x
 Formula to calculate sample mean: x 
n
 x (pronounced “x bar”) represents the sample mean
  (pronounced “myoo” – spelled mu) represents the population mean
o Sample Size – number of observations in the sample
3
STA 2023
Chapter 2 – Methods for Describing Sets of Data
 n represents sample size
o Population Size – number of observations in the entire population
 N represents population size
o Median – the value of a dataset which splits the data into halves (middle value)
 Usually M represents the median of a sample or population
 To find the median of a dataset, do the following:
1. Order all data in ascending order
2. Calculate the number of observations (assume n)
a. If n is odd, then M is the kth value of the dataset where
n 1
k
2
b. If n is even, then M is the average of the kth and (k+1)th
n
values of the dataset where k 
2
o Mode – most frequently occurring value in the dataset
 No symbol usually associated with mode
o Modal Class – class with the greatest frequency
 No symbol usually associated with modal class
o Shape
 When the mean and median of a distribution are equal, the distribution is
called symmetric
 When the mean is less than the median, the distribution is skewed left
 When the mean is greater than the median, the distribution is skewed
right
 NOTE: You should be comfortable identifying the shape of a distribution
by comparing the mean and median and by visual identification
Symmetric
Skewed Left
Skewed Right
o Example – Exercise 2.42 calculating the mean, median, and mode
a. {7, -2, 3, 3, 0, 4}
 Order the data: {-2, 0, 3, 3, 4, 7}
 x =  2  0  3  3  4  7 = 15 = 2.5
 Mean: x =
6
6
n
6
 Median: n = 6 so n is even, then k =
= 3, and k+1 = 4, so we
2
should take the average of the 3rd and 4th values of the dataset
4
STA 2023
Chapter 2 – Methods for Describing Sets of Data
which are 3 and 3, respectively, and the average of those numbers
is (obviously) 3
 Mode: the value in the dataset that occurs most frequently is 3
b. {2, 3, 5, 3, 2, 3, 4, 3, 5, 1, 2, 3, 4}
 Order the data: {1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 5}
 x = 1 2  2  2  3  3  3  3  3  4  4  5  5 =
 Mean: x =
12
n
40
= 3.08
13
14
 Median: n = 13 so n is odd, then k =
= 7 so the median is the
2
7th value of the dataset which is 3
 Mode: the value in the dataset that occurs most frequently is 3
c. {51, 50, 47, 50, 48, 41, 59, 68, 45, 37}
 Order the data: {37, 41, 45, 47, 48, 50, 50, 51, 59, 68}
 x = 37  41  45  47  48  50  50  51  59  68
 Mean: x =
10
n
496
=
= 49.6
10
10
 Median: n = 10 so n is even, then k =
= 5, and k+1 = 6, so we
2
should take the average of the 5th and 6th values of the dataset
which are 48 and 50, respectively, and the average of those
numbers is 49
 Mode: the value in the dataset that occurs most frequently is 50

Numerical Measures of Variability (2.5)
o Range – maximum minus minimum
 Usually R represents the range in a sample or population
o Sample Variance – a common measurement for the variability in a dataset
 s2 represents the sample variance
 2 (sigma-squared) represents the population variance
n

Formula to calculate sample variance: s 2 
 ( x  x)
2
i
i 1
 x 
n 1
2
x
2

n
n 1
o Sample Standard Deviation – spread of the data
 Standard deviation is preferable to variance, since the units of standard
deviation are in the original units of variable (variance is not)
 s represents the sample standard deviation

Alternate (easier) formula: s 2 
5
STA 2023
Chapter 2 – Methods for Describing Sets of Data

 (sigma) represents the population standard deviation
Formula to calculate sample standard deviation: s  s 2
If an estimate for s cannot be obtained, dividing the range by 4 for small
samples and by 6 for large samples will provide a conservative estimate
o Example – Exercise 2.62 using Sample 1: {10, 0, 1, 9, 10, 0, 8, 1, 1, 9} and
Sample 2: {0, 5, 10, 5, 5, 5, 6, 5, 6, 5}
a. Examine both samples and identify the one that you believe has the greater
variability. Sample 1 appears to have the greater variability.
b. Calculate the range for each sample. Does the result agree with your
answer to part a.? Explain. Sample 1 Range = Maximum – Minimum =
10 – 0 = 10. Sample 2 Range = Maximum – Minimum = 10 – 0 = 10.
These results do not agree with our answer in part a.
c. Calculate the standard deviation for each sample. Does the result agree
with your answer to part a.? Explain. Sample 1 Variance =


 x 
x  n
2
2
s2 
 10
n 1
=
 0  8  1  1  9
  10  0  1  9  10
10
2
2
 0 2  12  9 2  10 2  0 2  8 2  12  12  9 2 
9
49 2
429 
10 = 20.99, so Sample 1 Standard Deviation = s  s 2 =
=
9
 x 2
2
x  n
2
20.99 = 4.58. Sample 2 Variance = s 
=
n 1
 0  5  10  5  5  5  6  5  6  52
2
2
2
2
2
2
2
2
2
2
 0

 5  10  5  5  5  6  5  6  5 
10
=
9
322 
52 2
10 = 5.73, so Sample 2 Standard Deviation = s  s 2 =
5.73 =
9
2.39. These results agree with our answer to part a.
d. Which of the two, the range or the standard deviation, provides a better
measure of variability? The standard deviation provides a better measure
of variability.

Interpreting the Standard Deviation (2.6)
o Chebyshev’s Rule – gives a bound for the proportion of data that falls within a
specified number of standard deviations from the mean and also gives a bound for
the proportion of data that falls outside of a specified number of standard
deviations from the mean
6
STA 2023
Chapter 2 – Methods for Describing Sets of Data

Proportion of data within k standard deviations of the mean:  1 –

Proportion of data outside k standard deviations of the mean: <

k
1
2
3
1
k2
1
k2
Chebyshev’s Rule applies to ANY distribution, regardless of its shape
Table 2.3 – Table of common values for Chebyshev’s Rule
Proportion within   k
Proportion outside   k
< 1 (useless)
 0 (useless)
< .25
 .75
< .111
 .888
o Example – Assume we have a distribution with  = 9 and  = 2
a. What proportion of the data lies between 5 and 13? Here, k=2, so
referring to Table 2.3, we know that at least 75% of the distribution lies
between 5 and 13.
b. What proportion of the data lies outside of (2, 16)? Here, k=3.5, so we
1
use the formula 2 , with k=3.5, which tells us that less than .0816 or less
k
than 8.16% of the distribution lies outside of (2, 16)?
c. What proportion of the data lies below 4? Here, k=2.5, so we use the
1
formula 2 , with k=2.5, which tells us that less than .16 or less than 16%
k
of the data lies outside of (4, 14). However, at this point, it would be
incorrect to further split the .16 in half, because we are looking at one side
of the distribution; we know nothing about the distribution, hence, it
would be incorrect to assume that the distribution is symmetric.
o Empirical Rule – gives an approximation for the proportion of data that falls
within or outside of one, two, or three standard deviations
 Approximately 68% of the distribution will fall within 1 standard
deviation of the mean
 Approximately 95% of the distribution will fall within 2 standard
deviations of the mean
 Approximately 99.7% of the distribution will fall within 3 standard
deviations of the mean
 Empirical Rule applies only to distributions that are both SYMMETRIC
and MOUND-SHAPED
34%
13.5%
2.35%
0.15%
34%
13.5%
2.35%
0.15%
7
STA 2023
Chapter 2 – Methods for Describing Sets of Data
o Example – Assume we have a symmetric, mound-shaped distribution with  = 9
and  = 2
a. What proportion of the data lies within (7, 11)? 68%
b. What proportion of the data lies within (5, 15)? 97.35%
c. What proportion of the data lies outside of (5, 13)? 5%
d. What proportion of the data lies lie above 11? 16%

Numerical Measures of Relative Frequency (2.7)
o Percentile – an observation that is the pth percentile will have p% of the
distribution below it and (100-p)% of the distribution above it
o Z-score – measure of relative standing associated with the standard normal
distribution (Chapter 5)
xx
 Formula for sample z-score: z 
s
x
 Formula for population z-score: z 

o Example – Exercise 2.82, computing z-scores
x  x 40  30

a. x = 40, s = 5, x = 30
=2
z
5
s
x   90  89
z

b. x = 90,  = 89,  = 2
= .5

2
x   50  50
z

c.  = 50,  = 5, x = 50
=0

5
x  x 20  30

d. s = 4, x = 20, x = 30
= -2.5
z
4
s
e. a. sample b. population c. population d. sample
f. a. above the mean by 2 b. above the mean by .5
c. at the mean
d. below the mean by 2.5

Methods for Detecting Outliers (2.8) – SKIP

Graphing Bivariate Relationships (2.9) – SKIP

Distorting the Truth with Descriptive Techniques (2.10)
o Statistics can lie
8
Related documents