Download PPT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 2 ~ Descriptive Analysis &
Presentation of Single-Variable Data
Black Bears
Mean: 60.07 inches
Median: 62.50 inches
Range: 42 inches
20
Variance: 117.681
Standard deviation: 10.85 inches
Minimum: 36 inches
Frequency
Maximum: 78 inches
10
First quartile: 51.63 inches
Third quartile: 67.38 inches
Count: 58 bears
Sum: 3438.1 inches
0
30
40
50
60
70
80
Length in Inches
1
Chapter Goals
• Learn how to present and describe sets of data
• Learn measures of central tendency, measures of
dispersion (spread), measures of position, and types of
distributions
• Learn how to interpret findings so that we know what
the data is telling us about the sampled population
2
Measures of Central Tendency
• Numerical values used to locate the middle of a
set of data, or where the data is clustered
• The term average is often associated with all
measures of central tendency
3
Mean
Mean: The type of average with which you are probably most familiar. The mean
is the sum of all the values divided by the total number of values, n:
1
1
x =  xi = ( x1 + x2 + . . . + xn )
n
n
Notes:



The population mean, , (lowercase mu, Greek alphabet), is the
mean of all x values for the entire population
We usually cannot measure  but would like to estimate its value
A physical representation: the mean is the value that balances the
weights on the number line
4
Example
 Example: The following data represents the number of accidents
in each of the last 6 years at a dangerous intersection.
Find the mean number of accidents: 8, 9, 3, 5, 2, 6, 4, 5:
Solution:

1
x = (8 + 9 + 3 + 5 + 2 + 6 + 4 + 5) = 5.25
8
In the data above, change 6 to 26:
Solution:
1
x = (8 + 9 + 3 + 5 + 2 + 26 + 4 + 5) = 7.75
8
Note: The mean can be greatly influenced by outliers
5
Median
Median: The value of the data that occupies the middle position when
the data are ranked in order according to size
Notes:
~
 Denoted by “x tilde”: x

The population median,  (uppercase mu, Greek alphabet), is
the data value in the middle position of the entire population
To find the median:
1. Rank the data
x ) = n +1
2. Determine the depth of the median: d ( ~
2
3. Determine the value of the median
6
Example
 Example: Find the median for the set of data:
{4, 8, 3, 8, 2, 9, 2, 11, 3}
Solution:
1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11
x ) = (9 +1)/ 2 = 5
2. Find the depth: d (~
3. The median is the fifth number from either end in the ranked
x =4
data: ~
Suppose the data set is {4, 8, 3, 8, 2, 9, 2, 11, 3, 15}:
1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11, 15
2. Find the depth: d ( ~x ) = (10 + 1) / 2 = 5.5
3. The median is halfway between the fifth and sixth
observations: ~
x = (4 +8)/ 2 = 6
7
Mode & Midrange
Mode: The mode is the value of x that occurs most frequently
Note: If two or more values in a sample are tied for the
highest frequency (number of occurrences), there
is no mode
Midrange: The number exactly midway between a lowest value
data L and a highest value data H. It is found by averaging the
low and the high values:
midrange=
L+ H
2
8
Example
 Example: Consider the data set {12.7, 27.1, 35.6, 44.2, 18.0}
+ H 12.7 + 44.2
L
=
= 2845
Midrange =
.
2
2
Notes:

When rounding off an answer, a common rule-of-thumb is
to keep one more decimal place in the answer than was
present in the original data

To avoid round-off buildup, round off only the final
answer, not intermediate steps
9
Measures of Dispersion
• Measures of central tendency alone cannot
completely characterize a set of data. Two very
different data sets may have similar measures of
central tendency.
• Measures of dispersion are used to describe the
spread, or variability, of a distribution
• Common measures of dispersion: range, variance,
and standard deviation
10
Range
Range: The difference in value between the highest-valued (H) and
the lowest-valued (L) pieces of data:
range= H  L
• Other measures of dispersion are based on the following quantity
Deviation from the Mean: A deviation from the mean, x  x ,
is the difference between the value of x and the mean x
11
Example
 Example: Consider the sample {12, 23, 17, 15, 18}.
Find 1) the range and 2) each deviation from the mean.
Solutions:
1) range= H  L = 2312 =11
2) x = 1(12 + 23+17 +15+18) =17
5
Data Deviation from Mean
x x
x
_________________________
12
-5
23
6
17
0
15
-2
18
1
12
Sample Variance & Standard Deviation
Sample Variance: The sample variance, s2, is the mean of the
squared deviations, calculated using n  1 as the divisor:
s2 =
1
( x  x) 2

n 1
where n is the sample size
Note: The numerator for the sample variance is called the sum of
squares for x, denoted SS(x):
s2 = SS( x)
n 1
where
SS( x ) =  ( x  x ) 2 =  x 2 
1
n
( x )
2
Standard Deviation: The standard deviation of a sample, s, is the
positive square root of the variance:
s = s2
13
Example
 Example: Find the 1) variance and 2) standard deviation for the
data {5, 7, 1, 3, 8}:
Solutions:
First: x = 1(5+ 7 +1+ 3+ 8) = 48
.
5
Sum:
1) s 2 =
x
x x
( x  x)2
5
7
1
3
8
24
0.2
2.2
-3.8
-1.8
3.2
0
0.04
4.84
14.44
3.24
10.24
32.08
1
( 32 . 8 ) = 8 . 2
4
2) s =
8 . 2 = 2 . 86
14
Notes

The shortcut formula for the sample variance:
( x )
x  n
2
2
s2 =

n 1
The unit of measure for the standard deviation is the
same as the unit of measure for the data
15
Measures of Position
• Measures of position are used to describe the
relative location of an observation
• Quartiles and percentiles are two of the most
popular measures of position
• An additional measure of central tendency, the
midquartile, is defined using quartiles
• Quartiles are part of the 5-number summary
16
Quartiles
Quartiles: Values of the variable that divide the ranked data into
quarters; each set of data has three quartiles
1. The first quartile, Q1, is a number such that at most 25% of
the data are smaller in value than Q1 and at most 75% are
larger
2. The second quartile, Q2, is the median
3. The third quartile, Q3, is a number such that at most 75%
of the data are smaller in value than Q3 and at most 25%
are larger
Ranked data, increasing order
25%
L
25%
Q1
25%
Q2
25%
Q3
H
17
Percentiles
Percentiles: Values of the variable that divide a set of ranked
data into 100 equal subsets; each set of data has 99 percentiles.
The kth percentile, Pk, is a value such that at most k% of the data
is smaller in value than Pk and at most (100  k)% of the data is
larger.
at most k %
L
at most (100 - k )%
Pk
H
Notes:

The 1st quartile and the 25th percentile are the same: Q1 = P25

The median, the 2nd quartile, and the 50th percentile are
x = Q2 = P50
all the same: ~
18
Finding Pk (and Quartiles)
• Procedure for finding Pk (and quartiles):
1. Rank the n observations, lowest to highest
2. Compute A = (nk)/100
3. If A is an integer:
– d(Pk) = A.5 (depth)
– Pk is halfway between the value of the data in the Ath
position and the value of the next data
If A is a fraction:
– d(Pk) = B, the next larger integer
– Pk is the value of the data in the Bth position
19
Example
 Example: The following data represents the pH levels of a
random sample of swimming pools in a California
town. Find: 1) the first quartile, 2) the third quartile,
and 3) the 37th percentile:
5.6
6.0
6.7
7.0
5.6
6.1
6.8
7.3
5.8
6.2
6.8
7.4
5.9
6.3
6.8
7.4
6.0
6.4
6.9
7.5
Solutions:
1) k = 25: (20) (25) / 100 = 5,
depth = 5.5,
Q1 = 6
2) k = 75: (20) (75) / 100 = 15, depth = 15.5, Q3 = 6.95
3) k = 37: (20) (37) / 100 = 7.4,
depth = 8,
P37 = 6.2
20
5-Number Summary
5-Number Summary: The 5-number summary is composed of:
1.
2.
3.
4.
5.
L, the smallest value in the data set
Q1, the first quartile (also P25)
~
x , the median (also P50 and 2nd quartile)
Q3, the third quartile (also P75)
H, the largest value in the data set
Notes:

The 5-number summary indicates how much the data is
spread out in each quarter

The interquartile range is the difference between the first
and third quartiles. It is the range of the middle 50% of the
data
21
Box-and-Whisker Display
Box-and-Whisker Display: A graphic representation of the
5-number summary:
• The five numerical values (smallest, first quartile, median, third
quartile, and largest) are located on a scale, either vertical or
horizontal
• The box is used to depict the middle half of the data that lies
between the two quartiles
• The whiskers are line segments used to depict the other half of the
data
• One line segment represents the quarter of the data that is smaller
in value than the first quartile
• The second line segment represents the quarter of the data that is
larger in value that the third quartile
22
Example
 Example: A random sample of students in a sixth grade class
was selected. Their weights are given in the table
below. Find the 5-number summary for this data and
construct a boxplot:
Solution:
63
85
92
99
112
63
L
64
86
93
99
85
Q1
76
88
93
99
92
~
x
76
89
93
101
99
Q3
81
90
94
108
83
91
97
109
112
H
23
Boxplot for Weight Data
Weights from Sixth Grade Class
60
70
80
90
100
110
Weight
L
Q1
~
x
Q3
H
24
z-Score
z-Score: The position a particular value of x has relative to the mean,
measured in standard deviations. The z-score is found by the
formula:
value  mean x  x
z=
=
st.dev.
s
Notes:
 Typically, the calculated value of z is rounded to the nearest
hundredth
 The z-score measures the number of standard deviations
above/below, or away from, the mean
 z-scores typically range from -3.00 to +3.00
 z-scores may be used to make comparisons of raw scores
25
Example
 Example: A certain data set has mean 35.6 and standard
deviation 7.1. Find the z-scores for 46 and 33:
Solutions:
x  x 46  35.6
z=
=
= 176
.
s
7.1
46 is 1.46 standard deviations above the mean
x  x = 33  35.6 = 
=
z
0.37
s
7.1
33 is -0.37 below standard deviations below the mean.
26
Interpreting & Understanding
Standard Deviation
• Standard deviation is a measure of variability, or
spread
• Two rules for describing data rely on the standard
deviation:
– Empirical rule: applies to a variable that is
normally distributed
– Chebyshev’s theorem: applies to any distribution
27
Empirical Rule
Empirical Rule: If a variable is normally distributed, then:
1. Approximately 68% of the observations lie within 1 standard
deviation of the mean
2. Approximately 95% of the observations lie within 2 standard
deviations of the mean
3. Approximately 99.7% of the observations lie within 3 standard
deviations of the mean
Notes:

The empirical rule is more informative than Chebyshev’s theorem since
we know more about the distribution (normally distributed)

Also applies to populations

Can be used to determine if a distribution is normally distributed
28
Illustration of the Empirical Rule
99.7%
95%
68%
x 3s
x 2s
xs
x
x+s
x +2s
x +3s
29
Example
 Example: A random sample of plum tomatoes was selected
from a local grocery store and their weights recorded.
The mean weight was 6.5 ounces with a standard
deviation of 0.4 ounces. If the weights are normally
distributed:
1) What percentage of weights fall between 5.7 and 7.3?
2) What percentage of weights fall above 7.7?
Solutions:
1) ( x  2s, x + 2s) = (65
.  2(0.4), 65
. + 2(0.4)) = (57
. , 7.3)
Approximately 95% of the weights fall between 5.7 and 7.3
2) ( x  3s, x + 3s) = (65
.  3(0.4), 65
. + 3(0.4)) = (53
. , 7.7)
Approximately 99.7% of the weights fall between 5.3 and 7.7
Approximately 0.3% of the weights fall outside (5.3, 7.7)
Approximately (0.3/2)=0.15% of the weights fall above 7.7
30
A Note about the Empirical Rule
Note: The empirical rule may be used to determine whether or
not a set of data is approximately normally distributed
1. Find the mean and standard deviation for the data
2. Compute the actual proportion of data within 1, 2, and 3
standard deviations from the mean
3. Compare these actual proportions with those given by the
empirical rule
4. If the proportions found are reasonably close to those of the
empirical rule, then the data is approximately normally
distributed
31
Chebyshev’s Theorem
Chebyshev’s Theorem: The proportion of any distribution that lies
within k standard deviations of the mean is at least 1  (1/k2), where
k is any positive number larger than 1. This theorem applies to all
distributions of data.
Illustration:
at least
1 12
k
x  ks
x
x + ks
32
Important Reminders!

Chebyshev’s theorem is very conservative and holds for
any distribution of data

Chebyshev’s theorem also applies to any population

The two most common values used to describe a
distribution of data are k = 2, 3

The table below lists some values for k and 1 - (1/k2):
k
1(1/ k 2)
1.7
0.65
2
0.75
2.5
0.84
3
0.89
33
Example
 Example: At the close of trading, a random sample of 35
technology stocks was selected. The mean selling
price was 67.75 and the standard deviation was 12.3.
Use Chebyshev’s theorem (with k = 2, 3) to describe
the distribution.
Solutions:
Using k=2: At least 75% of the observations lie within 2 standard
deviations of the mean:
( x  2s, x + 2s ) = (67.75  2(12.3), 67.75 + 2(12.3) = (43.15, 92.35)
Using k=3: At least 89% of the observations lie within 3
standard deviations of the mean:
( x  3s, x + 3s ) = (67.75  3(12.3), 67.75 + 3(12.3) = (30.85, 104.65)
34