Download Statistics hand out 22.24KB 2017-03-29 12:41:19

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data mining wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
Measures of Central Tendency
Measures of central tendency, or averages, are values that describe a set of data by identifying the
central position within that set, e.g. the average life span of a human being. There are 3 different types:
MEAN


Use the mean to describe the middle of a set of data that does not have an outlier.
Calculated by adding all the values together and dividing the total by the number of values in the
data set, i.e.


E.g. A darts player throws the following scores during a match: 80, 100, 62, 180, 21, 55
Her average score per throw = (80 + 100 + 62 + 180 + 21 + 55) ÷ 6 = 83
Advantages
- Most popular measure in fields such as
business, engineering and computer science.
- It is unique - there is only one answer.
- Useful when comparing sets of data.
- Can be used with discrete or continuous data
- The calculation includes every value in the
data set.
Disadvantages
- Affected by extreme values (outliers)
e.g. GDP per capita is often not a good
measure of the average persons wealth
because can be distorted by extremely rich
peoples.
- Should not be used if the frequency
distribution of the data set is skewed
MEDIAN





Use the median to describe the middle of a set of data that does have an outlier.
The middle number when the values are listed in order of magnitude
E.g. The median of {1, 2, 4, 4, 5, 8, 9} is 4
If you have an even set of numbers you must take the mean of the middle two to get the median
E.g. The median of {1, 2, 4, 5, 7, 9, 12, 20} is (5+7) ÷ 2 = 6.
Advantages
- Extreme values (outliers) do not affect the
median as strongly as they do the mean.
- Can be used with skewed data
- Useful when comparing sets of data.
- It is unique - there is only one answer.
Disadvantages
- Need to put all values in order first, which can
be time consuming
- Uses only one value in the data set, so may
not be truly representative of the whole data set.
If this value is changed so it the median, even if
all other values stay the same.
MODE




Use the mode when the data is categorical or when asked to choose the most popular item.
The most commonly occurring value
On a histogram, the highest bar is the mode
E.g. The mode of {1, 2, 4, 4, 5, 8, 9} is 4 as it occurs the most.
Advantages
- Extreme values (outliers)
do not affect the mode.
- Can be used with skewed
data.
- More useful for some types
of data e.g. the average
number of children per
woman
Disadvantages
- Not necessarily unique - may be more than one answer
- When there is more than one mode, it is difficult to interpret and/or
compare.
- When no values repeat in the data set, the mode is every value and is
useless.
- May be unrepresentative of the data set e.g. 1, 1, 1, 6, 5, 7, 11, 7, 9,
10, 8 – A mode of 1 isn’t very useful!
- Only really useful for categorical data.
- Can be changed significantly if only one number in the data set is
changed.
Measures of Dispersion
Measures of dispersion, or spread, are used to describe the variability in a sample or population. For example: the
range of different heights of children in a class; or the differences in amount of rainfall throughout the year. These
measures can be used along with averages to help you analyse data in more depth and make more detailed
comparisons between data sets. There are 3 different types:
RANGE
 Use the mean to describe the spread of a dataset that does not have an outlier.
 This is the difference between the highest and the lowest figures in your data set
 E.g. A darts player throws the following scores during a match: 80, 100, 62, 180, 21, 55
 Her average score per throw = (180-21) = 159
Advantages
- Simple and quick to calculate
- Useful when comparing sets of data.
- Can be used with discrete or continuous data
Disadvantages
- Affected by extreme values (outliers)
- Only gives a very basic indication of spread
- Only uses two values in the data set
iNTERQUARTILE RANGE
 This measure of dispersion can be used with date that does have outliers
 Looks at the spread of the middle 50% of the data
 List numbers in order of magnitude, then calculate the Upper Quartile (n+1)/4 and the Lower Quartile
((n+1)/4) x 3. (n = number of figures in data set)
 IQR = UQ – LQ
 E.g. 1, 11, 15, 19, 20, 24, 28, 34, 37, 47, 50, 57, 68
 UQ = 47; LQ = 19; IQR = (47-19) = 28
Advantages
- Not affected by outliers
- Allows a fairer comparison of data sets than the
range.
- Good measure to use in combination with the
median
Disadvantages
- Need to put all values in order first, which can be
time consuming
- More difficult to calculate than the range
STANDARD DEVIATION
 A more advanced measure of dispersion
 Indicates the extent to which the data is clustered around the mean
 A smaller SD score suggests a more reliable mean
 First calculate the variance - this is calculated as the average squared deviation of each number from the
mean of a data set. E.g. for the numbers 1, 2, and 3 the mean is 2 and the variance is 0.667. [(1 - 2)2 + (2 2)2 + (3 - 2)2] ÷ 3 = 0.667
 Then square root the variance to get the standard deviation, E.G. SQRT 0.667 = 0.817
 The more widely spread the values are, the larger the standard deviation is
 SD tells you that 68% of the data lies within one standard deviation of the mean, 95% lie within 2 SD’s and
99% lie within 3SD’s. Anything that is 2 or more SDs away from the mean can be considered anomalous
Advantages
- Uses all values in a data set
- Provides a statistical measure with which to
identify outliers
- Useful for comparing two sets of data that have
similar means
Disadvantages
- Should only be used for data sets that have a normal
distribution
- More difficult and time consuming to calculate than the
range
- Can be distorted by outliers like the mean