Download Central Tendency and Dispersion Univariate Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Regression toward the mean wikipedia , lookup

Transcript
Central Tendency and Dispersion
Univariate Data: Data involving one variable
Mean

Where x is the mean, Σx is the sum of all the scores
and n is the number of scores.
x
x=
n
The mean gives the average or ‘typical score’ for a set of values. Add up all the values (Σx) and divide
by the number of values that you have (n). If the values are written in a frequency table:
E.g.
Value Frequency
5
2
6
4
7
3
8
1
then this means the same as
5, 5, 6, 6, 6, 6, 7, 7, 7, 8
(i.e. 10 numbers)
x = (2×5) + (4×6) + (3×7) + (1×8) = 6.3
10
An OUTLIER can greatly affect your results. E.g. If you include a score of 20 to the above example
then the new mean becomes
(2×5) + (4×6) + (3×7) + (1×8) + (1×20) = 7.5 (1 d.p.)
11
This is a very different answer and is very misleading. Outliers are often, therefore, identified and
removed first.
Median
First place your scores from smallest to biggest. The median is the middle score. If two middle scores
exist (i.e. when n is an even number) then find the average of these two.
If n is a large number, the easiest way to find the median is to add 1 to the number of scores (n) and
then divide by 2. If you get a whole number then that is the score you use as the median. If you get a
decimal, e.g. 14.5, then the median is the average of the 14th and 15th score.
Mode
The most frequent value. In the example above the mode is 6. It is possible to have a bimodal
situation (i.e. two values both occurring the most often) but if there are more than two we usually
say that there is no distinct mode.
Measures of Central Tendency
The mean, median and mode are called measures of central tendency. They describe the average,
central or typical members of a group.
Stem and Leaf Diagrams
139, 168, 142, 133, 152, 155, 131, 140, 163, 142 can be rewritten as:
13
14
15
16
Stem
9
2
2
8
3 1
0 2
5
3
Leaves
You can rewrite them in order if you want to – it makes it easier to see the median:
13
1 3 9
14
0 2 2
15
2 5
16
3 8
Confirm these results:
Mean = 146.5
Mode = 142
Median = 142
Dot Frequency Graphs
Mean = (1×1)+(2×2)+(1×3)+(3×4)+(2×5)+(1×6)
10
= 3.6
y
3
2
Median = 4
1
Mode = 4
1
2
3
4
5
6
x
Also see Sadler 3A page 127 and Exercise 6A
Check that you know how to use your Classpad calculators here
Range =
Largest score - Smallest score
In the Dot Frequency graph above the range is 6 – 1 = 5
Quartiles
First put the scores into order, identifying the median score. Then, taking the score to the lower side
of the median, split them in half by finding their median. Do the same for the scores above the
original median. This will result in the scores being split into quarters.
E.g.
4
6
6
8
9
10
12
12
14
16
19
Median of bottom
half of scores
Median
Lower quartile
Q1
Interquartile Range =
Median of upper
half of scores
Upper quartile
Q2
Upper Quartile - Lower Quartile
In the example above IQR = 14 – 6 = 8
Q3
=
Q3 - Q1
Box and Whisker Plot
These show the quartiles and range clearly. A rectangular box with end points denoting the upper
and lower quartiles has a line through it at the median. Tails on either side show the range. For the
example in the quartiles section above, the box and whisker plot looks like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Also see Sadler 3A Exercise 6E Question 5
Mean Deviation
Mean brings an average to mind. Deviation refers to a difference between two values, more
specifically how far away from a set point the score is. Here, a mean deviation refers to how
concentrated around a mean some scores are.
E.g.
25
21
24
22
25
29
29
25
Mean = 25
Score
25
21
24
22
25
29
29
25
Deviation from the mean
25 – 25 = 0
25 – 21 = 4
25 – 24 = 1
25 – 22 = 3
25 – 25 = 0
29 – 25 = 4
29 – 25 = 4
25 – 25 = 0
Notice that it doesn’t matter which
side of the mean the score is. We
just look at how far away it is from
the mean (i.e. no negative values).
Mean deviation = 0 + 4 + 1 + 3 + 0 + 4 + 4 + 0
8
= 2
Variance
Square the deviations first and then find the average:
Variance = σ2 = 02 + 42 + 12 + 32 + 02 + 42 + 42 + 02
8
= 7.25
Note: Σ is capital sigma
σ is lower case sigma
Standard Deviation
Take the square root of the variance.
SD = σ = √7.25 = 2.7
Measures of Dispersion
The range, interquartile range and the standard deviation are all known as measures of dispersion.
The standard deviation is the more useful measure of dispersion in data analysis. This measure of
spread is significant because most (probably all) of the scores would lie within three standard
deviations either side of the mean.
i.e.
(x - 3σ) ≤ all scores ≤ (x + 3σ)
There are rules that you can follow if scores have been increased or decreased or multiplied by a
constant amount:
Original
mean
5
g
10
g
15
g
12
g
Original SD
Alteration to scores
New mean
New SD
1.5
h
2
h
3
h
2.5
h
Multiply all scores by 3
Multiply all scores by a
Add 8 to each score
Add b to each score
Double each and add 1 to each score
Multiply each score by a and then add b
Add 2 to each and then multiply each by 3
Add b to each and then multiply each by a
15
Ag
18
g+b
31
ag + b
14×3=42
a(g + b)
4.5
ǀaǀh
2
h
6
ǀaǀh
2.5×3=7.5
ǀaǀh
Also see Sadler 3A pages 131 – 138 Exercise 6B and “Statistics Worksheet”
Grouped Data
Sometimes data is grouped into sections, ‘class intervals’ or ‘bins’.
E.g.
Score
20-29
30-39
40-49
50-59
60-69
70-79
80-89
Frequency
3
11
18
28
24
10
6
This means, for example, that there are 10 scores between 70 and 79 but the problem is that we
don’t know the exact value of each of the ten scores. The advantage is that the overall distribution is
clear.
If asked to find the mean or standard deviation of grouped data we need to assume all of the scores
in an interval are all at the midpoint or the interval. E.g. The interval 70-79 has a midpoint of 74.5
Hence the mean of the above is
24.5×3 + 34.5×11 + 44.5×18 + 54.5×28 + 64.5×24 + 74.5×10 + 84.5×6
3 + 11 + 18 + 28 + 24 + 10 + 6
Mean = 55.8
Also, the SD = 14.3 (1 d.p.)
The mode for this type of example is referred to as the modal class. i.e. the class interval with the
most scores = 50-59.
The range can only be approximated. 89 – 20 = 69.
We could argue that it should be 89.5 – 19.5 = 70 but, as it is only an estimation then 69 or 70
would be reasonable.
See also Sadler 3A page 139
Weighted Means
Identify of each of the data points contributing equally to the final average some data points
contribute more than others.
E.g. Two school classes, one with 20 students and one with 30 students, both take a test.
Morning Class: 62, 67, 71, 74, 76, 77, 78, 79, 79, 80,
80, 81, 81, 82, 83, 84, 86, 89, 93, 98
Mean = 80
Afternoon Class: 81, 82, 83, 84, 85, 86, 87, 87, 88, 88,
89, 89, 89, 90, 90, 90, 90, 91, 91, 91
92, 92, 93, 93, 94, 95, 96, 97, 98, 99
Mean = 90
The straight average of 80 and 90 is (80 + 90)/2 = 85. However, this does not accurately account for
the difference in numbers of students in each class, and the value of 85 does not reflect the average
student grade (independent of class).
Weighted mean of the class is
20×80 + 30×90 = 86
20 + 30
Notice that the grouped mean of 86 is closer to the afternoon class mean of 90 than the morning
class mean of 80. The fact that there are more students in the afternoon class “weights” the group
mean towards the afternoon class more than the morning class.
See also Sadler 3A page 140
Cropped Data
Sometimes outliers affect the mean and make it look unrealistic so the data can be “cropped” to
eliminate them. E.g. In some sporting events the highest and lowest scores are ignored and the rest
averaged.
See also Sadler 3A page 141
Standard Scores
Standardised Score = Raw Score - Mean
Standard Deviation
Standardisation is a method to convert
various types or units or measurements into
a common scale in order to make
comparisons.
See also Sadler 3A page 142 and Exercise 6D