Download Sociology 541 Measures of Central Tendency and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Regression toward the mean wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Sociology 541
Measures of Central Tendency and Dispersion
Describing the Center of a Data Set
Although a tabular or graphical summary of the data is useful, any further analysis of the data
requires the properties of the data to be summarized numerically. One feature of the distribution
of our data that we would like to describe is its center. We would like a single value to represent
this.
We will use the following data sets to illustrate measures of center.
DATA SET 1 The following are test scores from a class of 20 students:
96 95
93 89 83 83 81 77
77 77 71 71 70 68 68 65 57 55 48 42
DATA SET 2 The same 20 test scores are arranged in a grouped frequency distribution. To
illustrate how to obtain measures of central tendency for grouped data, we'll assume that all we
know about the data is what is presented in the table. We no longer know the original scores.
The frequency distribution for these test scores is:
Class
Mid-point
90-99
80-89
70-79
60-69
50-59
40-49
94.5
84.5
74.5
64.5
54.5
44.5
Frequency
3
4
6
3
2
2
Cumulative
frequency
20
17
13
7
4
2
Notation:
Σ
n
X
Xi
Summation sign
Total number of sample observations
A variable or a measured characteristic
Any one value for a variable
1
MODE
The mode of the sample is the value of the variable having the greatest frequency.
Example: Obtain the mode for Data Set 1
77
For a grouped frequency distribution, the modal class is the class having the greatest
frequency if the class intervals are equal and the mode is the midpoint of the modal class.
Example: Obtain the mode for Data Set 2
Modal interval is 70-79. For a more exact number, we say that the mode is the midpoint of the
interval or 74.5
Properties of Mode
1. Appropriate for all types of data.
2. Not necessarily unique.
3. Note that the mode doesn't have to be near the center of the distribution.
2
MEDIAN
The median of a set of observations is the value of the variable such that half the values are less
than the median and half are greater than the median. When observations are ordered from lowest
to highest, the median is the number that divides the sample so that an equal number of cases fall
above and below it.
For discrete observations, the median is found by first ordering the observations from
smallest to largest, and then if the number of observations, n, is odd, taking the middle
observation (n+1)/2 and if n is even, the median is the average of the observations at
position n/2 and (n/2)+1 in the ordered arrangement.
Example: Obtain the median for Data Set 1
10th and 11th scores are 77 and 71. Average of these two scores is 74.
For a grouped frequency distribution, the median lies with the class (the median class)
containing the value with a cumulative frequency of (n+1)/2. It can be determined from a
cumulative frequency polygon or numerically by interpolation within the median class.
Example: Obtain the median for Data Set 2
We're looking for the interval that contains the middle observation. Looking at cumulative
frequency numbers, that would be interval 70-79 (critical interval).
Whereabouts in the interval is the median or that score in the 50th percentile?
Md = li + i [((n+1)/2) – fm/fi)]
Md = median
li = exact lower limit of interval i
i = width of interval i
(n+1)/2 = identifies interval in which median is located
fm = number of observations in intervals below interval i
fi= number of observations in interval i
Median: 69.5 + [(10.5 – 7)/6]*10 = 75.33
Properties of Median
1. Appropriate for interval level data and ordinal data.
2. Median is not affected by extreme scores.
3. Median is insensitive to distances of measurements from the middle.
4. The median is problematic with binary data.
Quartiles and other Percentiles
Percentile: pth percentile is a number such that p% of the scores fall below it and (100-p)% fall
above it.
Lower quartile: 25th percentile - median for observations that fall below median
Upper quartile: 75th percentile - median for observations that fall above median
3
MEAN
The sample mean of a sample of n observations x1, x2, …, xn, denoted by x , is the sum of the n
observations divided by n if the data are not grouped.
x = sum of all observations in the sample
number of observations in the sample
= (x1 + x2 + … + xn)/ n = (Σ x) / n
where n is the number of observations, x1 is the first sample observation, x2 is the second sample
observation, …, xn is the n-th (last) sample observation.
The population mean, denoted by µ, is the average of all x values in the entire population.
Example: Obtain the sample mean for Data Set 1
Sum up all scores and divide by 20: 1466/20 = 73.3
Weighted Mean
x = (n1 x1 + n2 x2) / (n1 + n2)
Example:
Class 1: n1 =20; x1= 73.3
Class 2: n2 =15; x2= 81
[20(73.3) + 15(81)] / [20+15] = 76.6
Mean for Grouped Frequency Distribution
Example: Obtain the mean for Data Set 2
Weight midpoints of class intervals by the frequency of observations in each interval.
[3(94.5)+4(84.5)+6(74.5)+3(64.5)+2(54.5)+2(44.5)] / 20 = 73
Properties of Mean
1. Appropriate only for interval level data and above.
2. Mean is very sensitive to extreme scores, called outliers. An additional measurement at outer
points would pull up or down the mean. So mean may not be representative of the
measurements in the sample (particularly with small samples).
3. Mean is the center of gravity or point of balance for frequency distribution.
4. The sum of the deviations from the mean equal 0.
4
The deviation indicates the distance and direction of any raw score from the mean.
Deviation = X - x
Sum of the deviations = Σ (X - x ) = 0
Example
1, 3, 4 ,4 , 9, 15
x=6
Deviations: X - x
(1-6)=-5
(3-6)=-3
(4-6)=-2
(4-6)=-2
(9-6)=3
(15-6)=9
Σ (X - x ) = 0
5
Shape of Distribution
Mean, mode and median are identical for a unimodal, symmetric distribution, such as bell-shaped
distribution.
The mean and median are identical if the histogram is symmetric. (e.g. Bimodal distribution)
If the histogram is unimodal with a long right hand tail (positively skewed) the mean lies above
the median.
1, 2, 3, 4, 100
Mean = 22; Median = 3
If the histogram is unimodal with a long left hand tail (negatively skewed), then the mean is
smaller than the median.
1, 97, 98, 99, 100
Mean = 79 Median = 98
Mean is influenced by extreme scores so you should be cautious about using the mean with
highly skewed distribution. Median is not affected much, if at all, by changes in extreme scores.
With a bi-modal distribution, it's best to characterize the distribution by both modes. Using
median or mean would obscure important features of the distribution.
In summary, measures of center (also known as measures of location) describe the center of a
data set, or the location of a ‘typical value’, using a single value. There is no ‘best’ measure of
location, and the different measures of location do not measure the same thing. Each measure of
location summarizes only one aspect of a data set and they should not be looked at in isolation.
6
Describing Variability in a Data Set
Besides wanting to know the center of the data, we are interested in how far individual values in
the sample are from this center.
A measure of variability (or measure of dispersion) is used to describe how the data are spread
about its center. A small value indicates that the data are grouped closely together while a large
value indicates that the data are spread out. Measures of variability can be used to compare
several distributions.
To illustrate measures of dispersion, we’ll use the following two data sets. The information in
these data sets represents test scores for individuals in each of the two classes.
CLASS 1
10 8 7 6 6 5
Mode=6; Median=6.5; Mean=7
CLASS 2
20 18 14 13 9 4
Median=13.5; Mean=13
We'll consider three different measures of dispersion:
Range ( R)
Variance (Sample = s2 ; Population = σ2)
Standard Deviation (Sample = s; Population = σ)
7
Range (R)
Provides a quick but rough measure of variability.
Difference between the minimum (low L) score and maximum (high H) score.
R=H-L
Example:
Class 1: 10 – 5 = 5
Class 2: 20 – 4 = 16
Properties of range
1. Sensitive to sample size
2. Unstable
Interquartile Range
A measure of variation based on the quartiles of a distribution.
IQR = Q3 – Q1.
Q3 refers to the upper quartile or the score in the 75th percentile.
Q1 refers to the lower quartile or the score in the 25th percentile.
Semi-interquartile range = (Q3 – Q1) / 2 = average of the difference between the third and first
quartiles.
Properties of IQR
1. Not sensitive to extreme outlying observations, unlike the ordinary range.
2. IQR increases as variability increases.
3. IQR can mask real differences in a distribution.
Other measures of variation are based on the deviations of the data from a measure of central
tendency, usually their mean.
Review of summation notation
Suppose c is some constant
1.
2.
3.
4.
5.
Σ cxi = cx1 + cx2 + ... + cxn = c(x1 + x2 + ... + xn) = cΣ xi
Σ (xi + yi) = Σ xi + Σyi
Σ c = c + c + c + ... + c = nc
Σ (c + dxi) = Σ c + Σdxi = nc + dΣxi
Σ xi + c ≠ Σ (xi + c)
8
Recall that the mean is considered to be the center of gravity for a set of observations.
Σ (X - x ) = 0
Deviation of ith observation Xi from the sample mean x is (xi - x ), the difference between them.
Deviation + when an observation is greater than sample mean.
Deviation - when an observation is less then sample mean.
To obtain mean deviation, we would sum up all the individual deviations from the mean and
divide by n, the sample size.
1/n Σ (xi - x ) = 0
Example
Class 1
10-7 =
8-7
=
7-7
=
6-7
=
6-7
=
5-7
=
3
1
0
-1
-1
-2
Class 2
20-13 =
18-13 =
14-13 =
13-13 =
9-13 =
4-13 =
7
5
1
0
-4
-9
Because mean is center of gravity, the sum of the positive deviations equals the sum of the
negative deviations and sum of all deviations about mean is 0.
There are several ways we can avoid the sum of deviations equaling 0.
One is to take the absolute value of the deviations.
Mean Absolute Deviation (MAD)
1/n Σ xi - x 
Example:
Class 1
3+1+0+1+1+2=8
Class 2
7+5+1+0+4+9=26
9
Alternatively, we can take the square of the deviations from the mean to avoid the problem of the
sum of the deviations equaling zero.
Two measures of dispersion incorporate squares.
Variance
Σ (xi - x )2
To control for the number of scores involved, we can divide this sum by N, the number of
observations.
Population Variance: 1/n Σ (xi - x )2 = σ2
Sample Variance: 1/(n-1) Σ (xi - x )2 = s2
Note that we divide the sample variance by (n-1) rather than n. This has advantages when we go
from a sample to population. Samples are less likely to contain extreme values and therefore may
underestimate the amount of dispersion or variability in a set of measurements.
When have information on entire population, replace (n-1) with n, the actual population size.
Example
Class 1
s2 = (9+1+0+1+1+4)/(6-1) = 16/5 = 3.2
Class 2
Properties of Variance
1. Approximates the average of the squared distances from the mean.
2. Units of measurement for the variance are the squares of those for the original data.
10
Standard Deviation
The square root of the variance is called the standard deviation.
Population Standard Deviation: σ = square root [1/(n) Σ (xi - x )2]
Sample Standard Deviation: s = square root [1/(n-1) Σ (xi - x )2]
Example
Class 1:
3.2 = 1.789 = s
Class 2:
Interpretation:
As a measure of variability, standard deviation tells us how much each score, in a set of scores,
on the average, varies from the mean. The greater the variability around the mean of a
distribution, the larger the standard deviation.
Alternative way to calculate the variance and standard deviation using sum of squares
Variance = s
SD = s =
2
∑(X − X )
=
n −1
∑(X − X )
n −1
2
=
2
=
1 
1
2
2
 ∑ X − (∑ X ) 
n −1
n

1 
1
2
2
 ∑ X − (∑ X ) 
n −1
n

11
Example:
Class 1
Variance
Obs
1
2
3
4
5
6
X
10
8
7
6
6
5
42
X2
100
64
49
36
36
25
310
Variance = s2 = (1/5) * [310 - (42)2 / 6 ] = (1/5) * [310-294] = 3.2
Class 2
Properties of Standard Deviation
1. s>=0.
2. Greater the variation, larger the value of s.
3. s is less than the maximum value of xi - x
S < xi - x 
4. Just as with mean, s can be affected by outliers, particularly for small data sets.
12
Obtaining this information from SPSS:
Analyze – Descriptive Statistics – Frequencies – Select Variable(s) – Statistics
Statistics
N
Valid
Missing
Mean
Median
Mode
Std. Deviation
Variance
Range
Minimum
Maximum
Percentiles
25
50
75
CLASS1
6
0
7.0000
6.5000
6.00
1.7889
3.2000
5.00
5.00
10.00
5.7500
6.5000
8.5000
CLASS2
6
0
13.0000
13.5000
4.00a
5.8652
34.4000
16.00
4.00
20.00
7.7500
13.5000
18.5000
a. Multiple modes exist. The smallest value is shown
Common graphical displays of dispersion within a data set.
Box Plot
SPSS: Graphs – Boxplot – Select summaries of separate variables and simple boxplot – define –
select variable you want to plot – Hit okay.
Graphical summary of both central tendency and variation of a data set.
Box contains central 50% of distribution, from lower quartile to upper quartile.
Median marked by line within box.
Lines from box are called whiskers. Extend to maximum and minimum values, unless there are
outliers.
30
20
10
0
N=
6
CLASS1
6
CLASS2
13
Z SCORES
Sometimes we want to make comparisons between observations from different data sets. For
example, in our hypothetical classes, we may want to determine whether the student who received
a score of 6 in class 1 did better or worse than the student who received a score of 9 in class 2.
We are comparing two data sets with different means and standard deviations.
To do this, we convert the means and standard deviations into “standard scores” – also known as
z scores.
z = (X - x ) / s
where X = particular observation
x = mean of observations
s = standard deviation of observations
The z score describes how many standard deviations from the mean a score is located.
Example:
Class 1: X=6; x =7 ; s=1.79
Z=(6-7)/1.79 = -0.559
Class 2: X=9; x =13 ; s=5.87
Z=(9-13)/5.87 = -0.684
We conclude that the student in class 1 did slightly better than the student in class 2 based on
these standard scores.
You can transform all the scores in the classes into standard form or z scores.
CLASS 1 ( x =7 ; s=1.79)
CLASS 2 ( x =13 ; s=5.87)
Obs
1
2
3
4
5
6
Obs
1
2
3
4
5
6
X
10
8
7
6
6
5
z
3/1.79 = 1.68
1/1.79 = 0.56
0/1.79 = 0
-1/1.79 = -0.56
-1/1.79 = -0.56
-2/1.79 = -1.12
X
20
18
14
13
9
4
z
7/5.87 = 1.19
5/5.87 = 0.85
1/5.87 = 0.17
0/5.87 = 0
-4/5.87 = -0.68
-9/5.87 = -1.53
The scores for the two classes are now expressed in the same scale, and each class has a mean of
0 and a standard deviation of 1.
14