Download 1342Lecture2.pdf

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Transcript
Lecture 2
9
Instruction: Measures of Central Tendency
This lecture discusses numerical descriptive measures, beginning with three statistics of
numerical data sets called measures of central tendency.
A measure of central tendency is a statistic that assigns a numerical
value as representative of an entire data set.
One measure of central tendency is the arithmetic mean. The symbol X-bar, X , denotes
the arithmetic mean of a sample set. The arithmetic mean, defined below, can be thought of as
the average.
For a given numerical set of data S = { X 1 , X 2 ,… , X n } with n elements, the
arithmetic mean of the set is given by the formula:
n
X=
∑ Xi
i =1
n
.
The arithmetic mean has three significant characteristics. First, changing the value of any score
or adding to the data set a new score not equal to the mean, will change the mean of the data set.
Second, if some constant value c is added to each value in the data set, the mean changes to
X + c . Third, if some constant value c is multiplied by each value in the data set, the mean
changes to c ⋅ X
A second measure of central tendency is the median. The median, defined below, is the
midpoint of the distribution of the data set.
For data set S arranged in ascending order, the median is the value that divides the
data set exactly in half, and exactly 50% of the data will be equal to or less than the
median. If ( n + 1) 2 is an integer, it equals the position of the median. If ( n + 1) 2
is a not an integer, the position of the median is the midpoint between the score in
the n 2 position and the score in the ( n + 2 ) 2 position.
If n ( S ) is odd for a sample S of non-rounded data arranged in ascending order, the median is the
middle number in S. If n ( S ) is even for a sample S of non-rounded data arranged in ascending
order, the median is the mean of the two middle numbers.
The third measure of central tendency is the mode. The mode, defined below, is the most
common number in a numerical data set.
For data set S with some frequency f k greater than any other frequency f j , the
mode is the value with the greatest frequency.
Lecture 2
10
According to the definition above, there is no mode in a numerical data set that contains data
values such that the frequencies of all the data values are equal. If, however, there exists any one
or more frequencies greater than one or more other frequencies, the data set has a mode, and the
mode equals the data value (or values) with the greatest frequency. Data sets with multiple
modes are said to be multimodal. Data sets with two modes are said to be bimodal.
Consider a sample V = {6, 5, 2, 12, 1, 3, 2, 4, 0, 4, 13, 6, 6, 7, 1, 6} . To find the three
measures of central tendency, we must find the arithmetic mean, the median, and the mode.
Arranging the data set in ascending order, will help identify frequencies and the median:
V = {0, 1, 1, 2, 2, 3, 4, 4, 5, 6, 6, 6, 6, 7, 12, 13} .
The data point 6 appears in the data set the most (has the greatest frequency), so the mode equals
6. The arithmetic mean equals the ratio of the sum of the data points to the number of data
points as computed below.
X=
0 + 1 + 1 + 2 + 2 + 3 + 4 + 4 + 5 + 6 + 6 + 6 + 6 + 7 + 12 + 13 78
=
= 4.875
16
16
Since n (V ) is even, the median equals the mean of the two middle numbers as computed below.
median =
4+5 9
= = 4.5
2
2
In summary, for the given data set V, we have the three measures of central tendency:
mean = 4.875, median = 4.5, & mode = 6.
Consider a larger set of data S displayed by the frequency distribution below.
X 22 23 24 25 26 27 28 29 30 31 32 33
f 5 3 7 1 1 2 4 10 4 1 1 1
Since the frequency distribution organizes the data set, finding the three measures of central
tendency for the data set is not much more difficult for S than it was for V; even though,
n ( S ) > n (V ) . Note that n ( S ) = ∑ f = 5 + 3 + 7 + 1 + 1 + 2 + 4 + 10 +4 + 1 + 1 + 1 = 40 . To find the
mode, select the data value with the greatest frequency, which is 29. To find the median, start by
calculating its position: ( 40 + 1) 2 = 20.5. Since position of the median is 20.5, the median
equals the average of the 20th and 21st values in the data set arranged in ascending order:
( 28 + 28) 2 = 28. To find the arithmetic mean, calculate the ratio of the sum of the data points
to the number of data points as below.
∑[ f ⋅ Xi ]
n
X=
i =1
∑f
=
5 ⋅ 22 + 3 ⋅ 23 + 7 ⋅ 24 + 25 + 26 + 2 ⋅ 27 + 4 ⋅ 28 + 10 ⋅ 29 + 4 ⋅ 30 + 31 + 32 + 33
= 26.75
40
Lecture 2
11
In summary, for the given data set S, we have the three measures of central tendency:
mean = 26.75, median = 28, & mode = 29.
Instruction: Measures of Dispersion
This portion of the lecture discusses three statistics of numerical data sets called
measures of dispersion. Consider the two samples below each with the same mean and median.
A = {47, 50, 53}
B = {0, 50, 100}
For both sets, X = 50. For sample A, the mean is a good estimate for any score found in the set,
but the mean is not a good estimate for any score found in sample B. The scores in sample B are
spread further apart than those in sample A. Sample B is said to have greater variability.
Statistics that measure the magnitude of variability are called measures of dispersion.
A measure of dispersion is a statistic that assigns a numerical value
to describe the variability of a data set. Variability refers to the
spread of a data set. A measure of dispersion measures how spread
out or how widely dispersed a set of data is.
One particular measure of dispersion is the range. The range, defined below, is the
distance between the largest and smallest values in a sample.
The range is the difference of the largest and smallest values in a sample.
The range of set A above equals six because 53 − 47 = 6. The range of set B above equals 100
because 100 − 0 = 100.
A second measure of dispersion is the sample variance. To discuss variance, we must
first discuss a deviation and the squares of deviations.
Deviation equals distance from the mean. A deviation score equals X i − X .
According to the definition above, the deviations of scores below the mean are negative, and the
deviations of scores above the mean are positive. The table below shows the deviations for set
A.
X Xi − X
47
–3
50
0
53
3
Lecture 2
12
Scores below the mean have negative deviations. Scores above the mean have positive
deviations. Scores equal to the mean have zero deviations. While deviations can be positive or
(
negative depending on the position of the respective score, the squares of deviations, X i − X
),
2
are always positive.
To calculate sample variance, we must calculate the deviation of each score in the sample
as above as well as the square of each deviation as below.
(X
Xi − X
X
47
50
53
–3
0
3
i
−X
)
2
9
0
9
The population variance equals the mean of the sum of the squares of the deviations.
The sample variance equals an estimate of the population variance given by the formula in the
box below.
The sample variance, denoted var or S 2 , equals the ratio:
n
var = S 2 =
(
∑ Xi − X
i =1
)
2
.
n −1
The sample variance for sample A = {47, 50, 53} is calculated below.
var = S 2 =
9 + 0 + 9 18
=
=9
3 −1
2
The third measure of dispersion is the standard deviation, which equals the square root of the
variance.
The sample standard deviation, denoted S, is a distance from the mean that
equals the square root of the variance:
S= S =
2
(
∑ x−x
)
2
.
n −1
The sample standard deviation measures the typical or standard distance of
scores in the sample from the mean.
According to the definition above, widely dispersed samples have large standard deviations.
Indeed, the larger the sample's standard deviation, the more widely dispersed are the elements in
the sample. The standard deviation of sample A = {47, 50, 53} is given here: S = 9 = 3 .
The standard deviation has two key characteristics. First, adding a constant to each score
in a sample will not change the standard deviation. Thus, if A* = {46, 49, 52} , then S = 3.
Lecture 2
13
Second, multiplying each score by a constant causes the standard deviation to be multiplied by
the same constant. Thus, if A* = {94, 100, 106} , then S = 6.
A fourth measure of dispersion is the coefficient of variation, a relative measure always
expressed as a percentage. The coefficient of variation measures the scatter in the data relative
to the mean.
The coefficient of variation, denoted CV, equals the quotient of the
standard deviation and the mean expressed as a percent:
⎛S⎞
CV = ⎜ ⎟ .
⎝X⎠
Reconsider our sample from above, A = {47, 50, 53} , for which S = 3 and X = 50 . The
coefficient of variation for sample A is 6% as calculated below.
⎛ 3 ⎞
CV = ⎜ ⎟ = 0.06 = 6%
⎝ 50 ⎠
Instruction: Distributions
This lecture discusses types of distributions plus an interesting use of the standard
deviation. Lecture 4.2 discussed frequency distribution graphs. This lecture discusses some
general types of shapes of frequency distributions.
One general shape of frequency distribution graphs includes symmetrical distributions.
With symmetrical distributions, a vertical line can be drawn through the middle in such a way
that one side of the distribution is an exact mirror image of the other as shown below in Figures
A and B. Figure B demonstrates a bimodal symmetrical distribution.
Figure A
f 25
20
15
10
5
0
symmetrical distribution
Lecture 2
14
Figure B
f 20
15
10
5
0
bimodal symmetrical distribution
Here the term bimodal means that the two data points (or classes) have the same frequency,
which is how we will use the term bimodal in this course. Bimodal can refer to non-symmetrical
distributions with two distinct peaks on either side of the center of the distribution.
Another general shape of frequency distribution graphs includes skewed distributions.
Skewed distributions tend to form graphs that rise up toward one end of the range of scores.
These distributions often taper off gradually at the opposite end. The tapering end is called the
tail. Figure C below demonstrates a positively skewed distribution. The modifier "positively"
derives from the fact that the tail points in the positive direction. Figure D below demonstrates a
negatively skewed distribution.
Figure D
Figure C
f
f 25
25
20
20
15
15
10
10
5
5
0
0
positively skewed distribution
negatively skewed distribution
Finally, a frequency distribution graph can be uniform (or rectangular). Uniform
distributions form a rectangle because all the objects (or classes) have equal frequencies. Figure
E below demonstrates a uniform distribution.
Lecture 2
15
Figure E
f
20
15
10
5
0
uniform distribution
It is easy to imagine how challenging it would be to construct a frequency distribution
graph for a population because populations tend to be large data sets and recording
measurements and frequencies for the entire group would be cumbersome. It is sometimes
easier, however, to construct relative frequency graphs for populations. Using statistical
procedures applied to samples, researchers can sometimes infer information about the relative
frequencies of populations. In such cases, the distributions are outlined with smooth curves.
Figure P below displays a symmetrical relative frequency distribution for a population.
Figure P
Instruction: Chebyshev's Theorem
The Russian statistician Pafnuti Chebyshev discovered a useful fact given in the box
below that applies to all distributions regardless of their shape.
Chebyshev's Theorem states that the fraction of any data set lying within k
standard deviations of the mean where k > 1 is at least:
k2 −1
.
k2
This theorem tells us that at least 75% of the scores in a data set lie within two standard
deviations of the mean as calculated below.
Instruction: Measures of Position
This lecture discusses measures of position. The median discussed in Lecture 4.3 is an
example of a measure of position. Since the median is the middle number (or average of the two
middle numbers), the median is a measure of position that reveals which score occupies the
center of the distribution.
Lecture 2
16
A measure of position is a statistic that reveals a score's position in the
distribution of a data set.
Percentiles are measures of position that reveal what percent of the scores equal or fall
below a given score. Percentiles divide the distribution of the data set into one hundred parts.
A percentile is a numerical value assigned to a given score that indicates what
percent of the scores in the data set equal or fall below the given score.
According to the definition above, a percentile is a position. The bottom percentile is zero. The
top percentile is 99. The score with zero percentile is the lowest score. The score with the 99th
percentile is the greatest score. To calculate the percentile of X i , divide the number of scores
less than X i by n where n is the total number of scores in the data set and convert to a percent.
To find the position of a score that occupies a given percentile, multiply the decimal equivalent
of the percentile by n and select the next largest integer.
Consider data set A = {22, 21, 14, 20, 19, 27, 17, 22, 26, 24}. Arranging the data set in
ascending order reveals that eight scores fall below 26:
A = {14, 17, 19, 20, 21, 22, 22, 24, | 26, 27}
Accordingly, a score of 26 corresponds to the 80th percentile as calculated below.
8
= 0.8 = 80%
10
To find the position of the score in the 40th percentile for set A, we multiply 40% by the cardinal
number of the set to get 4 then move to the next integer 5.
0.4 × 10 = 4 ⇒ 5
Hence, the fifth score in the ordered data set occupies the 40th percentile, so 21 corresponds to
the 40th percentile.
Another statistical measure of position is the quartile defined in the box below.
A quartile divides the distribution into quarters. The quartiles, denoted
Q1 , Q2 , and Q3 , are the three numbers that occupy the 25th, 50th, and 75th
percentiles respectively.
The second quartile, Q2 , equals the median. The first quartile, Q1 , equals the median of the
scores that fall below Q2 . The third quartile, Q3 , equals the median of the scores that fall above
Q2 .
For a given sample, the three quartiles together with the lowest value (zero percentile)
and the greatest value (99th percentile) act as a set of five numbers called the five number
Lecture 2
17
summary of a data set. These five numbers are used to create a box plot (or box-and-whisker
plot) defined below.
A box plot is a graphical display that uses a rectangle and two line
segments to summarize a data set. The entire display hovers over a
number line. The rectangle extends from the first quartile to the third
quartile and is divided into two parts by a vertical line segment drawn over
the median (second quartile). From the left and right sides of the rectangle,
two line segments called whiskers extend to the least and greatest scores
respectively.
Consider the data set B = {1, 2, 3, 6, 6, 7, 8, 8, 8, 9, 9, 11, 11, 12, 17}. The display below is a
box plot representing data set B.
Q1
Q2
Q3
The box plot conveys the central tendency, the location of the middle half of the data, the
dispersion, and the skew-ness. The location of the median shows the central tendency. The
rectangle reveals the middle half of the data. The reach of the whiskers exposes the range, and
the non-symmetry or symmetry of box and whiskers displays the skew-ness.
The most important measure of position is the Z-score defined below.
A Z-score is a numerical value assigned to a raw score that measures the
distance between the raw score and the mean in standard deviations. For a
given sample A with mean X and standard deviation S, the Z-score of
some raw score X i in A, is given by
Zi =
Xi − X
.
S
Since a Z-score is a ratio of a raw score's deviation from the mean to the standard deviation, Zscores assigned to raw scores below the mean will be negative while those assigned to raw scores
above the mean will be positive.
If we recall Chebyshev's Theorem, we see the significance of a Z-score. Chebyshev's
theorem stated that at least ( k 2 − 1) k 2 of the data of any distribution falls within k standard
deviations. Since a Z-score equals a number of standard deviations, a raw score's Z-score can be
substituted for k. Using Chebyshev's Theorem, we know that at least 93.75% of the data falls
between the data point with a Z-score of –4 and the data point with a Z-score of +4. As a
consequence, we note that any data point with a Z-score smaller than –4 or greater than +4 is
fairly atypical of the data set. Imagine a doctor examining a child of a certain age whose weight
Lecture 2
18
has a Z-score of –4.2. The doctor knows immediately that most children in a comparable
population or sample have a greater weight. Accordingly, the doctor has statistical evidence to
warrant expensive medical tests to see if there is some underlying medical cause for the child's
low weight. Chebyshev's Theorem applies to any distribution of data. A future lecture discusses
a particular type of distribution that imbues Z-scores with even more significance.
Instruction: Measuring the Strength of Linear Relationships
A scatter plot like the one shown below visually examines the relationship between two
numerical variables.
Blood Alcohol Concentration (mg%)
90
80
70
60
50
40
30
20
10
0
0
0.5
1
1.5
2
2.5
3
3.5
4
t-hours
In the scatter plot above, there appears to be a linear relationship between hours elapsed and the
alcohol concentration in a patient's bloodstream. The sample covariance measures the strength
of such linear relationships.
The sample covariance, denoted cov ( X , Y ) , measures the strength of the
linear relationship between variables X and Y and is given by
n
(
)(
)
∑ ⎡ X i − X Yi − Y ⎤⎦
i =1 ⎣
cov ( X , Y ) =
.
n −1
The covariance is limited in its usefulness because it does not provide a relative strength of the
relationship it purports to measure as does the coefficient of correlation.
The coefficient of correlation, denoted r, measures the relative strength of a linear
relationship between two numerical variables. The values of the coefficient of correlation range
from negative one to positive one. If r = −1 , the relationship is said to be a perfect negative
Lecture 2
19
correlation and the points in the scatter diagram will all fall in a straight line with a negative
slope. If r = 1 , the relationship is said to be a perfect positive correlation and the points in the
scatter diagram will all fall in a straight line with a positive slope. If r = 0 , the data points all
fall in a straight horizontal line. In general, the smaller the difference between one and r , the
stronger the linear relationship between the two variables.
The sample coefficient of correlation, denoted r, measures the relative
strength of the linear relationship between variables X and Y and is given
by
cov ( X , Y )
r=
S X SY
where
n
(
)(
)
∑ ⎡ X i − X Yi − Y ⎤⎦
i =1 ⎣
cov ( X , Y ) =
,
n −1
(
)
n
2
∑ ⎡⎢ X i − X ⎤⎥
⎦ , and S =
S X = i =1 ⎣
Y
n −1
(
)
n
2
∑ ⎡⎢ Yi − Y ⎤⎥
⎦.
i =1 ⎣
n −1
Assignment 2
20
Problems
#1
Find three measures of central tendency and three measures of dispersion for the data displayed
in the frequency distribution below. Assume the data belongs to a sample.
Value
2
4
6
8
#2
Frequency
5
1
8
4
Two friends, Frick and Frack, who take different history classes, took their midterm exams on
the same day. Frick’s score was 86 while Frack’s score was 78. Use a complete sentence to
identify which student did relatively better, given the class data shown below.
Class mean
Class standard deviation
Frick
73
8
Frack
69
5
#3
Consider a distribution where the mean is seventy and the standard deviation is eight. At least
what fraction of the values are between 54 and 86?
#4
Construct a box plot for the data set below.
46
69
61
56
#5
59
79
64
61
63
62
67
64
66
52
70
67
69
59
83
71
75
64
66
88
51
67
56
71
59
70
62
58
63
79
65
68
66
55
68
72
An efficiency expert has developed the JSI, a test measuring job satisfaction of civil service
clerks. The following information reflects data collected from a random sample of ten civil
service clerks. Use a complete sentence and a relevant statistic to discuss the effectiveness of
the JSI.
Job Satisfaction Index (JSI)
92 32 56 20 72 16 56 76 80 48
Absences for Year JSI was taken 8 14 10 14 6 17 8 12 7 15