Download Descriptive Statistics

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Descriptive Statistics
Prof. David Ramsey
e-mail: [email protected]
Home page: www.ioz.pwr.edu.pl/pracownicy/ramsey
February 27, 2017
1 / 69
Brief Course Outline
1. Data Collection and Descriptive Statistics.
2. Probability Theory
3. Statistical Methods of Estimation
2 / 69
1 Data Collection and Descriptive Statistics
Populations of objects and individuals show variation with respect
to various traits (e.g. height, political preferences, the working life
of a light bulb, investment risk).
It is impractical (or impossible) to observe all the members of the
population. In order to describe the distribution of a trait in the
population, we select a sample.
On the basis of the sample we gain information on the population
as a whole.
3 / 69
1.1 Types of Variables
Qualitive variables: These are normally categorical
(non-numerical) variables. We distinguish between two types of
qualitative variables:
a) nominal: these variables are not naturally ordered in any way
(e.g. i. department - mechanical engineering, mathematics,
economics ii. industrial sector).
b) ordinal: there is a natural order for such categorisations e.g.
with respect to smoking, people may be categorised as 1:
non-smokers, 2: light smokers and 3: heavy smokers.
It can be seen that the higher the category number, the more an
individual smokes.
Exam grades are ordinal variables.
4 / 69
Quantitative Variables
These are variables which naturally take numerical values (e.g.
age, height, number of children). Such variables can be measured
or counted.
As before, we distinguish between two types of quantitative
variables.
a) Discrete variables: these are variables that take values from a
set that can be listed (most commonly integer values, i.e. they are
variables that can be counted). For example, number of children,
the results of die rolls.
5 / 69
b) Continuous variables
These are variables that can take values in a given range to any
number of decimal places (such variables are measured, normally
according to some unit e.g. height, age, weight).
It should be noted that such variables are only measured to a given
accuracy (i.e. height is measured to the nearest centimetre, age is
normally given to the nearest year).
If a discrete random variable takes many values (i.e. the
population of a town), then for practical purposes it is treated as a
continuous variable.
6 / 69
1.2 Collection of data
Since it is impractical to survey all the individuals in a population,
we need to base our analysis on a sample.
A population is the entire collection of individuals or objects we
wish to describe.
A sample is the subset of the population chosen for data
collection.
A sampling frame is a list that is used to choose a sample (e.g.
electoral register, telephone book, list of addresses).
A unit is a member of the population.
A variable is any trait that varies from unit to unit.
The sample size is the number of individuals in a sample and is
denoted by n.
7 / 69
1.2.1 Parameters and Statistics
A parameter is an unkown number describing a population. For
example, it may be that 37% of the population of eligible voters
(the electorate) wish to vote for PiS (we do not, however, observe
this population proportion).
A statistic is a number describing a sample. For example, 35% of
a sample may wish to vote for PiS. This is the sample proportion.
Statistics may be used to describe a population, but they only
estimate the real parameters of the population.
8 / 69
Parameters and Statistics - Precision of Statistics
Naturally, the statistics from a sample will show some variation
around the appropriate parameters
e.g. 37% of the population wish to vote for PiS, but only 35% in
the sample.
The greater the sample size, the more precise the results (suppose
we take a large number of samples of size n, the larger n the less
variable the sample proportion from the various samples, i.e. the
more replicable the results).
9 / 69
Parameters and Statistics - Bias
However, there may be intrinsic bias from two possible sources:
a) Sampling bias - when a sample is chosen in a way
such that some members of the population are more
likely to be chosen than others. e.g. The support for
PO is stronger in urban centres and in the west of
Poland. Hence, if we carry out a survey in Wroclaw
and use this to estimate the support for PO in
Poland as a whole, we will tend to overestimate
support for PO.
10 / 69
Parameters and Statistics - Bias
b) Non-Sampling Bias This results from mistakes in
data entry and/or how interviewees react to being
questioned. For example, it has been noted that
support of the government is often underestimated
by opinion polls. This may well be due to the fact
that supporters of the government are more likely to
hide their preferences.
11 / 69
An Example of Sampling Bias - Estimation of the
Population Mean
The sampling bias can be eliminated by choosing a sample in a
more appropriate way, but not non-sampling bias.
e.g. Suppose the population of interest is the Polish population as
a whole and the variable of interest is height.
Suppose I base my estimate of the mean height of the population
on the mean height of a sample of students (i.e. the sampling
frame or means of selecting a sample is inappropriate). Since
students tend to be on average taller than the population as a
whole, I will systematically overestimate the mean height in the
population.
That is to say, if I consider many samples of students of say size
100, a large majority of such samples would give me an
overestimate of the mean height of the population as a whole.
12 / 69
Non-Sampling Bias
The following may be sources of non-sampling bias:
1. Lack of anonymity.
2. The wording of a question.
3. The desire to give an answer that would please the
interviewer. For example, surveys may systematically
overestimate the willingness of individuals to pay
extra for environmentally friendly goods, as stating
that you are prepared to pay more is seen to be the
”politically correct” answer.
13 / 69
Precision and Bias
It should be noted that bias is a characteristic of the way in which
data are collected, not a single sample.
Increasing the sample size will improve the precision of an
estimate, but will not affect the bias.
Returning to the example of height. As the sample size increases,
the sample mean becomes more replicable. However, if we are
estimating the mean height of the entire population based on
samples of students, there will always be a tendency to
overestimate the mean height of the population.
14 / 69
Sampling - Conclusion
In order for a survey to give accurate results, it is not sufficient
that the sample is large. We should ask the following questions:
1. Is the sample representative of the required
population? (sampling bias).
2. Is the proportion of non-respondents large? If so, are
non-respondents likely to differ from respondents and
how? (non-sampling bias).
3. Are the answers to questions reliable? (possibility of
”political correctness”, reaction to the questioner, i.e.
non-sampling bias).
4. Who is publishing the results? (they are likely to
stress what looks good to them).
15 / 69
1.3 Descriptive Statistics - 1.3.1 Qualitative (Categorical
Data)
We may describe qualitative data using
a) Frequency tables.
b) Bar charts.
c) Pie charts.
16 / 69
Frequency tables
Frequency tables display how many observations fall into each
category (the frequency column), as well as the relative frequency
of each category (the proportion of observations falling into each
category).
Let ni denote the number of observations in category i. The
relative frequency of category i in percentage terms is fi , where
fi =
100ni
n
If there are missing data, we may also give the relative frequencies
in terms of the actual number of observations, n0 i.e.
fi 0 =
100ni
n0
17 / 69
Frequency tables
For example 200 students were asked which of the following
musical acts they preferred: Arctic Monkeys, Alt-J or Adele. The
answers may be presented in the following frequency table:
Band
Arctic Monkeys
Alt-J
Adele
Frequency
62
66
72
Relative Frequency (% )
62 × 100/200 = 31
66 × 100/200 = 33
72 × 100/200 = 36
18 / 69
Bar chart
In a bar chart the height of a bar represents the relative frequency
of a given category (or the number of observations in that
category).
19 / 69
Pie chart
The size of a slice in a pie chart represents the relative frequency
of a category. 100% corresponds to 360 degrees. Hence, the angle
made by the slice representing category i is given (in degrees) by
αi , where
100ni
360ni
αi = 3.6fi = 3.6 ×
=
n
n
20 / 69
Pie chart
These graphs were obtained using the SPSS (PASW) package.
21 / 69
1.3.2 Graphical Presentation of Quantitative Data
Discrete data can be presented in the form of frequency tables
and/or bar charts (as above).
The distribution of continuous data can be presented using a
histogram.
The histogram estimates the probability density function of a
continuous random variable (see later).
22 / 69
Histograms for continuous variables
In order to draw a histogram for a continuous variable, we need to
categorise the data into intervals of equal length. The end points
of these intervals should be round numbers.
√
The number of categories used should be approximately n
(normally between 5 and 20 categories are used).
For
√ example, if we have 30 observations then we should use about
30 ≈ 5.5 categories. Hence, 5 and 6 are sensible choices for the
number of categories.
Let k be the number of categories.
23 / 69
Histograms
In order to choose the length of each interval, L, we use
xmax − xmin
,
k
where xmax is smallest ”round” number larger than all the
observations and xmin is the largest ”round” number smaller than
all the observations.
L≈
If necessary, we can ”fine tune” by rounding L upwards, so that the
intervals are of ”nice” length and the whole range of the data is
covered.
24 / 69
Histograms
The intervals used are
[xmin , xmin + L], (xmin + L, xmin + 2L], . . . , (xmax − L, xmax ].
In general the lower end-point of an interval is assumed not to
belong to that interval (to avoid a number belonging to two
classes).
25 / 69
Histograms
A histogram is very similar to a bar chart. The height of the block
corresponding to an interval is the relative frequency of
observations in that block.
Thus, the height of a block is the number of observations in that
interval divided by the total number of observations.
26 / 69
Example 1.1
We observe the height of 20 individuals (in cm). The data are
given below
172, 165, 188, 162, 178, 183, 171, 158, 174, 184,
167, 175, 192, 170, 179, 187, 163, 156, 178, 182.
Draw a histogram representing these data.
27 / 69
Example 1.1
First we choose the number of classes and the corresponding
intervals.
√
20 ≈ 4.5, thus we should choose 4 or 5 intervals.
28 / 69
Example 1.1
The tallest individual is 192cm tall and the shortest 156cm.
200cm is the smallest round number larger than all the
observations and 150cm is the largest round number smaller than
all the observations.
To calculate the length of the intervals
L=
200 − 150
.
k
Taking k to be 4, L = 12.5. Taking k = 5, L = 10 (a ”nicer”
length).
Hence, it seems reasonable to use 5 intervals of length 10, starting
at 150.
29 / 69
Example 1.1
If we assume that the upper endpoint of an interval belongs to
that interval, then we have the intervals [150,160], (160, 170],
(170,180], (180,190], (190,200].
Now we count how many observations fall into each interval and
hence the relative frequency of observations in each interval.
30 / 69
Example 1.1
Height (x)
150 ≤ x ≤ 160
160 < x ≤ 170
170 < x ≤ 180
180 < x ≤ 190
190 < x ≤ 200
No. of Observations
2
5
7
5
1
Rel. Frequency
100 × 2/20 = 10
100 × 5/20 = 25
100 × 7/20 = 35
100 × 5/20 = 25
100 × 1/20 = 5
31 / 69
Example 1.2
The histogram is given below:
32 / 69
Interpretation of the histogram of a continuous variable
A histogram is an estimator of the density function of a variable
(see the chapter on the distribution of random variables in Section
2).
The distribution of height seems to be reasonably symmetrical
around 175cm.
33 / 69
1.3.3 Symmetry and Skewness of Distributions
From a histogram we may infer whether the distribution of a
random variable is symmetric or not.
The histogram of height shows that the distribution is reasonably
symmetric (even if the distribution of height in the population were
symmetric, we would normally observe some small deviation from
symmetry in the histogram, as we observe only a sample).
34 / 69
Right-Skewed distributions
A distribution is said to be right-skewed if there are observations a
long way to the right of the ”centre” of the distribution, but not a
long way to the left.
The distribution of wages is right-skewed, since a small proportion
of individuals will earn several times more than the mean wage.
35 / 69
A right-skewed distribution
36 / 69
Left-skewed distributions
A distribution is said to be left-skewed if there are observations a
long way to the left of the ”centre” of the distribution, but not a
long way to the right.
For example, the distribution of the lifetime of individuals in
Western countries is left-skewed.
This is due to the fact that the majority of individuals will live
between 70-100 years.
No-one will live much longer, but a minority of individuals will die
at a young age.
37 / 69
A Left-skewed Distribution
38 / 69
1.4 Numerical Methods of Describing Quantitative Data
We consider two types of measure:
1. Measures of centrality - give information regarding
the location of the centre of the distribution (the
mean, median).
2. Measures of variability (dispersion) - give information
regarding the level of variation (the range, variance,
standard deviation, interquartile range).
39 / 69
1.4.1 Measures of centrality
1. The Sample Mean, x.
Suppose we have a sample of n observations, the mean is given by
the sum of the observations divided by the number of observations.
n
1X
x=
xi ,
n
i=1
where xi is the value of the i-th observation.
40 / 69
The Population Mean
µ denotes the population mean. If there are N units in the
population, then
PN
xi
µ = i=1 ,
N
where xi is the value of the trait for individual i in the population.
µ is normally unknown. The sample mean x (a statistic) is an
estimator of the population mean µ (a parameter).
41 / 69
2. The sample median Q2
In order to calculate the sample median, we first order the
observations from the smallest to the largest. The order statistic
x(i) is the i-th smallest observation in a sample (i.e. x(1) is the
smallest observation and x(n) is the largest observation).
The notation for the median comes from the fact that the median
is the second quartile (see quartiles in the section on measures of
dispersion).
42 / 69
The sample median Q2
If n is odd, then the median is the observation which appears in
the centre of the ordered list of observations. Hence,
Q2 = x(0.5[n+1]) .
If n is even, then the median is the average of the two observations
which appear in the centre of the ordered list of observations.
Hence,
Q2 = 0.5[x(0.5n) + x(0.5n+1) ]
One half of the observations are smaller than the median and one
half are greater.
43 / 69
The sample median
One advantage of the median as a measure of centrality is that it
is less sensitive to extreme observations (which may be errors) than
the mean. When the distribution is skewed, it is preferable to use
the median as a measure of centrality.
e.g. the median wage rather than the average wage should be used
as a measure of what the ”average man on the street earns”.
The distribution of wages is right-skewed and the small proportion
of people who earn very high wages will have a significant effect on
the mean. The mean is greater than the median.
For left-skewed distributions the mean is less than the median.
44 / 69
1.4.2 Measures of Dispersion - 1. The Range
The range is defined to be the largest observation minus the
smallest observation.
Since the range is only based on 2 observations it conveys little
information and is sensitive to extreme values (errors).
45 / 69
2
2. The sample variance sn−1
The sample variance is a measure of the average square distance
from the mean.
2
The formula for the sample variance sn−1
is given by
n
2
sn−1
1 X
=
(xi − x)2 .
n−1
i=1
2
2
sn−1
≥ 0 and sn−1
= 0 if and only if all the observations are equal
to each other.
46 / 69
3. The sample standard deviation s
The sample standard deviation is given by the square root of the
variance.
It (and hence the sample variance) can be calculated on a scientific
calculator by using the σn−1 or sn−1 function as appropriate.
In simple terms, the standard deviation is a measure of the average
distance of an observation from the mean. It cannot be greater
than the maximum deviation from the mean.
47 / 69
4. The interquartile range
The i-th quartile, Qi , is taken to be the value such that i quarters
of the observations are less than Qi . Thus, Q2 is the sample
median.
If
n+1
4
is an integer, then the lower quartile Q1 is given by
Q1 = x( n+1 )
4
Otherwise, if a is the integer part of n+1
4 [this is obtained by simply
removing everything after the decimal point], then
Q1 = 0.5[x(a) + x(a+1) ]
48 / 69
The interquartile range
If
3n+3
4
is an integer, then the upper quartile Q3 is given by
Q3 = x( 3n+3 )
4
Otherwise, if b is the integer part of
3n+3
4 ,
then
Q3 = 0.5[x(b) + x(b+1) ]
The interquartile range (IQR) is the difference between the upper
and lower quartiles
IQR = Q3 − Q1
49 / 69
Choice of the measure of dispersion
The units of all the measures used so far (except for the variance)
are the same units as those used for the measurement of
observations. The units of variance are the square of the units of
measurement.
For example, if we observe velocity in metres per second, the
variance is measured in metres squared per second squared. For
this reason the standard deviation is generally preferred to the
variance as a measure of dispersion.
If a distribution is skewed then the interquartile range is a more
reliable measure of the dispersion of a random variable than the
standard deviation.
50 / 69
Comparison of the dispersion of two variables
Sometimes we wish to compare the dispersion of two positive
variables.
In cases where different units are used to measure the two variables
or the means of two variables are very different, it may be useful to
use a measure of dispersion which does not depend on the units in
which it is measured.
The coefficient of variation C .V . does not depend on the units of
measurement. It is the standard deviation divided by the sample
mean
sn−1
.
C .V . =
x
51 / 69
Example 1.2 - The sample mean
Calculate the measures of centrality and dispersion defined above
for the following data.
6, 9, 12, 9, 8, 10
There are 6 items of data hence,
P6
x=
i=1 xi
6
=
6 + 9 + 12 + 9 + 8 + 10
=9
6
52 / 69
Example 1.2 - The sample median
In order to calculate the median, we first order the data. If an
observation occurs k times, then it must appear k times in the list
of ordered data.
The ordered list of data is 6, 8, 9, 9, 10, 12.
Since there is an even number of data (n = 6), the median is the
average of the two observations in the middle of this ordered list.
Hence,
Q2 = 0.5[x(n/2) + x(1+ n2 ) ] = 0.5[x(3) + x(4) ] =
9+9
2
53 / 69
Example 1.2 - The range
The range is the difference between the largest and the smallest
observations
Range = 12 − 6 = 6.
54 / 69
Example 1.2 - The variance and standard deviation
The variance is given by
n
2
sn−1
=
1 X
(xi − x)2
n−1
i=1
(6 − 9)2 + (9 − 9)2 + (12 − 9)2 + (9 − 9)2 + (8 − 9)2 + (10 − 9)2
=
5
=4
q
2
The standard deviation is given by sn−1 = sn−1
= 2.
55 / 69
Example 1.2 - The interquartile range
In order to calculate the interquartile range, we first calculate the
lower and upper quartiles. n = 6, hence n+1
4 = 1.75. The integer
part of this number is 1. Hence, the lower quartile is
Q1 = 0.5[x(1) + x(2) ] = 0.5(6 + 8) = 7
Similarly, 3n+3
4 = 5.25. The integer part of this number is 5.
Hence, the upper quartile is
Q3 = 0.5[x(5) + x(6) ] = 0.5(10 + 12) = 11.
Hence,
IQR = 11 − 7 = 4.
56 / 69
Example 1.2 - The coefficient of variation
C .V . =
2
sn−1
= .
x
9
A coefficient of variation above 1 is accepted to be very large (such
variation may occur in the case of wages when wage inequality is
high).
With regard to physical traits in a species, values for the coefficient
of variation of around 0.1 to 0.3 are common (in humans the
coefficient of variation of height is around 0.1, the coefficient of
variation for weight is somewhat bigger).
57 / 69
1.5 Measures of Location and Dispersion for Grouped Data
- a) Discrete Random Variables
A die was rolled 100 times and the following data were obtained
Result
1
2
3
4
5
6
No. of observations
15
18
20
14
15
18
58 / 69
Grouped discrete data
Suppose the possible results are {x1 , x2 , . . . , xk } and the result xi
occurs fi times.
The total number of observations is
n=
k
X
fi .
i=1
The sum of the observations is given by
k
X
xi fi .
i=1
59 / 69
Grouped discrete data
It follows that the sample mean is given by
Pk
x=
i=1 fi xi
n
The variance of the observations is given by
k
2
sn−1
1 X
=
fi (xi − x)2
n−1
i=1
60 / 69
Grouped discrete data
The following table is useful in calculating the sample mean
xi
1
2
3
4
5
6
P
fi
15
18
20
14
15
18
100
Hence, the sample mean is x =
350
100
fi x i
15
36
60
56
75
108
350
= 3.5.
61 / 69
Grouped discrete data
Once the mean has been calculated, we can add two columns for
(xi − x)2 and fi (xi − x)2 :
xi
1
2
3
4
5
6
P
x
fi
15
18
20
14
15
18
100
350
100 = 3.5
fi xi
15
36
60
56
75
108
350
(xi − x)2
2.52
1.52
0.52
0.52
1.52
2.52
2
sn−1
fi (xi − x)2
15 × 2.52 = 93.75
18 × 1.52 = 40.5
20 × 0.52 = 5
14 × 0.52 = 3.5
15 × 1.52 = 33.75
18 × 2.52 = 112.5
289
289
99 = 2.92
62 / 69
Grouped discrete data
The sample variance is given by
k
2
sn−1
1 X
289
=
= 2.92.
fi (xi − x)2 =
n−1
99
i=1
63 / 69
Calculation of the sample median for grouped discrete data
In this case we know the exact values of the observations and hence
we can order the data. In this way we can calculate the median.
Since there are 100 observations, the median is
Q2 = 0.5[x(50) + x(51) ]
64 / 69
Calculation of the sample median for grouped discrete data
The 15 smallest observations are equal to 1 i.e.
x(1) = x(2) = . . . = x(15) = 1.
The next 18 smallest observations are equal to 2 i.e.
x(16) = x(17) = . . . = x(33) = 2.
The next 20 smallest observations are all equal to 3 i.e.
x(34) = x(35) = . . . = x(53) = 3.
65 / 69
Calculation of the sample median for grouped discrete data
It follows that
x(50) = x(51) = 3.
Hence,
Q2 = 0.5[x(50) + x(51) ] = 3.
66 / 69
1.5 Measures of Location and Dispersion for Grouped Data
- b) Continuous Random Variables
In such cases we have data grouped into intervals. Let xi be the
centre of the i-th interval and fi the number of observations in the
i-th interval.
The approach to calculating the sample mean and variance is the
same as in the case of discrete data. In order to carry out the
calculations, we assume that each observation is in the middle of
the appropriate interval.
67 / 69
Example 1.4
Consider the grouped data from Example 1.2
Height (x)
150 ≤ x ≤ 160
160 < x ≤ 170
170 < x ≤ 180
180 < x ≤ 190
190 < x ≤ 200
P
Thus, the sample mean is x =
xi
155
165
175
185
195
-
3480
20
fi
2
5
7
5
1
20
fi x i
310
825
1225
925
195
3480
= 174.
68 / 69
Example 1.4
Now we can add the remaining 2 columns of the table.
xi
155
165
175
185
195
P
x
fi
2
5
7
5
1
20
3480
20 = 174
fi x i
310
825
1225
925
195
3480
(xi − x)2
192
92
1
112
212
2
sn−1
fi (xi − x)2
2×192 = 722
5×92 = 405
7
2
5 × 11 = 605
212 = 441
2180
2180
19 = 114.74
2
The variance is sn−1
= 2180
19 ≈ 114.74.
√
The standard deviation is sn−1 ≈ 114.74 ≈ 10.71.
69 / 69