Download Descriptive statistics

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Ana Jerončić
200 participants
[EUR]
about half (71+37=108)÷200 = 54%
of the bills are “small”,
i.e. less than 30 EUR
(18+28+14=60)÷200 = 30%
i.e. nearly a third of the phone bills
are greater than 75 EUR
There are only a few telephone
bills in the middle range.
Variable
Frequency
Frequency
Frequency
Symmetry
A histogram is said to be symmetric if,
when we draw a vertical line down the
center of the histogram, the two sides are
identical in shape and size:
Variable
Variable
2.3
A special type of symmetric unimodal histogram
is one that is bell shaped
Drawing the histogram helps
verify the shape of the
distribution in question.
Frequency
Many statistical techniques
require that the population be
bell shaped.
Variable
Bell Shaped
Frequency
Frequency
Skewness (asymmetry)
A skewed histogram is one with a long tail
extending to either the right or the left:
Variable
Positively Skewed
Variable
Negatively Skewed
2.5
(left)—Serum albumin values in 248 adults FIG 2 (right)—Normal distribution with the same
mean and standard deviation as the serum albumin values.
Altman D G , and Bland J M BMJ 1995;310:298
©1995 by British Medical Journal Publishing Group
•Center of distribution
•Variability
•Shape


Statistics that show
how different units
seem similar
Parameters of central
tendency



Mean
Median
Mode


Statistics that show
how different units
differ
Parameters of
statistical variability



Standard deviation
Range
Percentils


The average arithmetic value of set of numbers
Adding all data together and then dividing
them by the number of observations (sometimes
referred to as n or the sample size)
Observations: 3, 4, 5, 6, 7
Total sum: 3+4+5+6+7= 25
Number of observations = 5
Mean = 25/ 5 = 5

Calculate the mean of following data:

1, 2, 3, 3, 4, 5
=(1+2+3+3+4+5)/6
=3

1, 1, 1, 1, 2, 12
3


Mean is the most commonly used as the
measure of central tendency.
It is a central point around which the
standard deviation is calculated.
MeanA=3
MeanB=3
A
B
-Not a good descriptor of dataset B
-Large influence of outliers, especially in
small samples (ie. number 12)
Mean is not a good descriptor of data when
distribution is asymmetrical
Median
is in the
Middle
Median – the middle number
in a set of ordered numbers.
1, 3, 7, 10, 13
Median = 7
Step 1 – Arrange the numbers
in order from least to greatest.
21, 18, 24, 19, 27
18, 19, 21, 24, 27
Step 2 – Find the middle
number.
21, 18, 24, 19, 27
18, 19, 21, 24, 27
Number that separates the lowest value half and the
highest-value half of a sample
or a population


Centre of the distribution
Numbers simply need to be put in
order and the middle one is chosen
Advantage:
1.
More robust to outliers and a better
representative of a group in small
samples
2.
Used in a skewed distribution



The value that has the largest number of
observations. In a bell curve distribution,
the mode is at the peak.
Example: 2,2,2,4,5,6,7,7,7,7,8
7 is the most frequent observation (4 times)
 Mod is 7





It is not influenced by the sample size or
by intensities of observations
However, it may not represent values
close to the mean or median
Most useful in grouped or categorical data
RARELY USED



Standard deviation
Range
Quantiles

Smallest interval which contains all the
data values.
Calculated by substracting smallest
observation from the greatest
 Takes into account outliers (it depends on
only two observations) and represents
quantitave data well when the sample size is
large



The interquartile range (IQR) is the
range of the middle 50% of the data
in a distribution.
It is computed as follows:
IQR = 75th percentile - 25th
percentile
Data are put in numerical order and
then the lower and upper quarter
of the data are discarded
Advantage: eliminates the risk of misrepresenting data distribution due to outliers


The most commonly used measure of data
variability.
Measure of average distance of all data
values from the mean.

The standard deviation is especially useful
measure of data variability when the
distribution is normal or approximately
normal because the proportion of the
distribution within a given number of
standard deviations from the mean can
be calculated.
Mean
68% of data!!
1 standard deviation


68% of the distribution is within 1 standard
deviation of the mean
and approximately 95% of the distribution is
within 2 standard deviations of the mean.



Example
If you observe a normal distribution of your variable with a
mean of 50 and a standard deviation of 10, then 68% of the
distribution would be between
50 - 10 = 40 and 50 +10 =60.
Similarly, about 95% of the distribution would be between
50 - 2 x 10 = 30 and 50 + 2 x 10 = 70.




Both distributions have means of 50.
The blue distribution has a standard deviation of 5;
The red distribution has a standard deviation of 10.
For the blue distribution, 68% of the distribution is
between 45 and 55; for the red distribution, 68% is
between 40 and 60.
Figure shows two normal distributions.