Download Describing Data - VCC Library

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Data mining wikipedia , lookup

Time series wikipedia , lookup

Transcript
HOSP 1207 (Business Stats)
Learning Centre
Describing Data
This worksheet focuses on describing data through measuring its central tendency and
variability. These measurements will give us an idea of what our data set looks like.
CENTRAL TENDENCY
There are three measurements of central tendency: mean, median, and mode.
Mean: Mean is another word for average. The mean represents the average or typical
value of a data set.
To find the mean, take the sum of all numbers in the data set and divide by how
many data points there are. The symbol for taking the sum of a set of numbers is the
capital Greek letter sigma, Σ, so “Σx” tells you to “take the sum of all values of x”. n is
the number of observations in the data set. You will see the notation for mean
represented two ways: x (pronounced “x bar”) is used for the mean of a sample of data,
and μ (pronounced “mew”, the Greek letter mu) represents the mean of a population of
data. The population is all the members of the group of interest (e.g. all ducks) while a
sample is a smaller group, or subset, of the population (e.g. 250 ducks at Trout Lake).
Example 1: Find the mean of the following sample: {3, 5, 4, 9, 8, 5, 7, 8, 9, 12}
Solution: We first take the sum of all numbers in the sample:
Σ xi = 3 + 5 + 4 + 9 + 8 + 5 + 7 + 8 + 9 + 12 = 70
and then divide by the number of values in the data set, n, which equals 10:
x
Σx
70
10
7
The mean of our data set is 7.
Median: The median is the middle value of an ordered data set. This is a more useful
measure of central tendency if the data is significantly skewed. “Skewness” means the
data favours high numbers over low numbers, or vice versa. In graph form, a skewed
curve appears asymmetrical, rather than as a symmetrical bell shape, with a longer tail
leading off to one side.
We find the position of the central observation using the formula:
position number =
Example 2: Find the median of the data set in Example 1.
Solution: The first step is to put the data set in order from smallest to largest:
{3, 4, 5, 5, 7, 8, 8, 9, 9, 12}
© 2013 Vancouver Community College Learning Centre.
Student review only. May not be reproduced for classes.
Authored by
by Emily
EmilySimpson
Simpson
Since we have an even number of observations (n = 10), the position of the median is
going to be average of two values. We use the formula for central tendency,
= #5.5,
th
th
which means the median is halfway between the 5 and 6 values of the ordered set.
We take the average of 7 and 8 and get 7.5, so the median of our data set is 7.5.
Mode: The mode is the data value that occurs most frequently. It is possible to have
more than one mode in a data set. In the data set {3, 4, 5, 5, 7}, the number 5 occurs
twice so it is the mode. In the data set {2, 4, 2, 6, 7, 7, 7, 8, 2}, both the numbers 2 and
7 occur three times each. This would be a bimodal data set.
Example 3: Identify the mode(s) in the data set from Exercise 1 if any exist.
Solution: There are three modes in this data set: 5, 8, and 9 (each value occurs twice).
This is called a multimodal data set.
VARIABILITY
Range: The range is the difference between the highest and lowest value in the data
set. It is not the most useful measure of variability of a data set.
Standard deviation: This is the most commonly used measure of variability. It reflects
the deviations (or differences) of all values in the data set from the mean. A larger
standard deviation indicates greater variability for a data set.
If you calculated the mean mark on a class midterm to be 65, that only tells you the
average mark. Did the marks in the class look like {66, 64, 67, 66, 62, 70…} or like {48,
97, 83, 57, 62, 81, …}? The first set of marks has low standard deviation - most of the
marks are quite close to the mean. The second set has a higher standard deviation as
there is a greater spread of values from the mean. The notation for standard deviation
of a population is σ (“sigma” - lower case Greek letter). The notation for standard
deviation of a sample is s.
To calculate standard deviation, use the following formulas:
Population Standard Deviation
Sample standard deviation
Σ
Σ
1
Σ
1
The rightmost formula for sample standard deviation is the easiest one to use for
calculating s by hand.
Variance is another related measure of variability that is simply the square of the
standard deviation (σ2 or s2). If the variance is calculated first (or given), take the square
root of the variance to get the standard deviation.
© 2013 Vancouver Community College Learning Centre.
Student review only. May not be reproduced for classes.
2
Example 4: Calculate the standard deviation of the data set from Example 1.
Solution: We know n = 10 from Example 1. We also know Σ = 70 from Example 1.
The only term left to figure out is Σx2. Σx2 is the sum of the square of all data values:
Σx2= 32 + 52 + 42 + 92 + 82 + 52 + 72 + 82 + 92 + 122 = 558
Now we plug into the formula:
558
10
70
10
1
√7.55556
2.749
Quartiles: One other way to measure variability is by using quartiles and the
interquartile range. This is a more accurate description of the data than using standard
deviation if a data set has strong outliers (values that lie FAR away from the rest of the
data) or is strongly skewed.
The first quartile (Q1) is the data point that lies above ¼ (25%) of all the points of the
data set and the third quartile (Q3) is the point that lies above ¾ (75%) of all the data
points. The second quartile lies above ½ (50%) of all data points (it’s the median).
The idea of quartiles, which cut a data set into quarters, can be extended to percentiles,
which cut a data set into hundredths. The pth percentile of a data set is the data point
above p% of all the data points in the set. For example, the 90th percentile is the value
above 90% of all the data points.
To calculate a percentile (or quartile):
(1) Find the position of the percentile.
Take the percentile number (e.g. for Q1, 25%) divided by 100 and multiply by the
number of observations (n) to get the position in the ordered set.
(a) If you get a whole number for the position, add 0.5
(b) If you get a decimal number for the position, round UP to the next whole
number
(2) Find the data point at that position in the ordered data set.
If the position is a whole number, use the value at that position in the data set as
the answer. If the position is a decimal value, use the average of the two values
spanning that position in the data set.
The interquartile range (IQR) is the difference between the 3rd quartile and 1st quartile:
Q3 – Q1. This range will include the middle 50% of the values of the data set.
© 2013 Vancouver Community College Learning Centre.
Student review only. May not be reproduced for classes.
3
Example 5: For the data set in Example 1, determine the 1st and 3rd quartile.
Solution:
1st quartile = 25th percentile = 25/100 * 10 = 2.5 (round up) = 3rd position
3rd quartile = 75th percentile = 75/100 * 10 = 7.5 (round up) = 8th position
Take the ordered data set and find the values in the 3rd and 8th position.
{3, 4, 5, 5, 7, 8, 8, 9, 9, 12}
Q1
Q3
Q1 = 5, Q3 = 9.
For the case above, IQR = Q3 – Q1 = 4.
EXERCISES
For the following sets of data, calculate (a) sample mean, (b) median, (c) mode, (d)
range, (e) variance, (f) standard deviation, (g) 1st quartile, (h) 3rd quartile, (i) interquartile
range, (j) 10th percentile, and (k) 90th percentile.
1. { 8, 24, 9, 6, 10, 18, 7, 14, 16, 21, 13, 24}
2. { 3, 6, 5, 4, 6, 5, 9, 10, 11, 7, 9}
3. { 41, 39, 38, 42, 43, 39, 40, 43, 26, 42, 42, 41, 41, 42, 27, 55, 60}
4. Name the data set (1, 2, or 3 – according to the numbered exercises above) with
the greatest variability based on (i) standard deviation (ii) range and (iii) IQR.
5. Explain why the answers to 4(i) and 4(ii) are different from 4(iii).
SOLUTIONS
1. (a) 14.17
(b) 13.5
(c) 24
(d) 18
(e) 41.7879
(f) 6.4644
(g) Q1 = 8.5 (position = 3.5, take the average of the 3rd and 4th values in the
ordered set)
(h) Q3 = 19.5 (position = 9.5, take the average of the 9th and 10th values in the
ordered set)
(i) IQR = 19.5 – 8.5 = 11
(j) 7 (2nd position)
(k) 24 (11th position)
© 2013 Vancouver Community College Learning Centre.
Student review only. May not be reproduced for classes.
4
2. (a) 6.82
(b) 6
(c) 5, 6, 9
(d) 8
(e) 6.7636
(f) 2.6007
(g) Q1 = 5 (position = 3, take the 3rd value in the ordered set)
(h) Q3 = 9 (position = 9, take the 9th value in the ordered set)
(i) IQR = 9 – 5 = 4
(j) 4 (2nd position)
(k) 10 (10th position)
3. (a) 41.24
(b) 41
(c) 42
(d) 34
(e) 62.9412
(f) 7.9335
(g) 39 (5th position)
(h) 42 (13th position)
(i) IQR = 42 – 39 = 3
(j) 27 (2nd position)
(k) 55 (16th position)
4. (i) Data set 3 has the greatest standard deviation
(ii) Data set 3 has the greatest range
(iii) Data set 1 has the largest IQR
5. Data set 3 has strong outliers above and below the central data points. Because
of this, Data set 3 has a high standard deviation and range. However, the IQR is
less sensitive to outliers and should be used as the measure of variability for data
sets with high skewness or strong outliers. For this reason, the IQR of Data set 3
is much lower than the IQR of Data set 1.
© 2013 Vancouver Community College Learning Centre.
Student review only. May not be reproduced for classes.
5