Download Example

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia, lookup

Taylor's law wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Degrees of freedom (statistics) wikipedia, lookup

Chapter 7
What to do when you
have the data
We saw in the previous chapters
how to collect data. We will
spend the rest of this course
looking at how to analyse the
data that we have collected.
Stem and Leaf Diagrams
Stem and Leaf Diagrams are
graphical ways to display a group of
integers in a dataset.
Steps for Constructing a Stem and
Leaf Diagram
1. Select one or more of the leading
digits to be the Stem values, the
remaining digits become the Leaves.
2. List Possible Stem values in a
3. Record the Leaf for every
observation beside the corresponding
Stem value.
4. Indicate on the display what units
are used for the Stems and Leaves.
Example The following are a
selection of exam marks
71 52 52 75 64 60 48 56
67 29 11 53 25 46 58 46
49 62 66 40 19 54 57 54
60 19 59 43 51 40 21 45
46 62 73 59 36 45 55 46
45 32 55 46 51 46 65 49 61 40
A Stem And Leaf Diagram will look
like this:
 1 199
 2 159
 3 26
 4 0003555666666899
 5 11223445567899
 6 001224567
 7 135
Histogram for Discrete Numerical
1. Draw a horizontal X-axis and on it
mark the possible values taken by
the observations
2. Draw a vertical Y-axis marked
with either relative frequencies or
3. Above each possible value on the
X-axis draw a rectangle centred on
the value with width 1 and height
equal to the relative frequency or
frequency of that value.
The Shape of Histograms
The general shape of a histogram is
 The number of peaks in the
histogram determines whether a
distribution is classed as Unimodal,
Bimodal or Multimodal.
 In addition to this classification we
can further classify UniModal
distributions as to whether they are
symmetric or not.
 A unimodal distribution is defined to
be Symmetric if there is a vertical
line of symmetry through the middle
of the distribution such that the
distribution to the left of this line is
the mirror image of the distribution
to the right of this line.
The right part of a unimodal
distribution is called the Upper Tail
of the distribution while the left part
is called the Lower Tail:
A Unimodal distribution which is not
symmetric is called skewed, there
are two types of skewness.
Positive Skew: If the upper tail of
the distribution stretches out more
than the lower tail then the
distribution is said to be positively
Negative Skew: If the Lower tail of
the distribution stretches out more
than the upper tail then the
distribution is said to be negatively
Symmetric Distributions
 Mean: The Mean of a
quantitative dataset is the sum
of the observations in the
dataset divided by the number
of observations in the dataset.
 Median: The Median (m) of a
quantitative dataset is the
middle number when the
observations are arranged in
ascending order.
 Mode: The Mode of a datset is
the observation that occurs most
frequently in the dataset.
How to calculate these
Dataset: X1 X2 X3 X4 X5. . . . . Xn
 Mean = (X1+ X2 + X3+ . .+ Xn)/n
 Median: Arrange the n observations
in order from smallest to largest,
if n is odd, the median (m) is the
middle number,
if n is even, the median is the mean of
the middle two numbers
 Mode: If given a dataset, the mode
is easily chosen as the value which
appears most often.
Example A: Dataset: 5, 3, 8, 5, 6
Mean = 5.4
Mode = 5
Median: 3, 5, 5, 6, 8 so m = 5
Note: 5.4 is not one of the original
values in the dataset
Example B: 11, 140, 98, 23, 45, 14,
56, 78, 93, 200, 123, 165 n = 12
Mean = 1046/12 = 87.16666666
Median: 11, 14, 23, 45, 56, 78, 93,
98, 123, 140, 165, 200
m = (78 + 93)/2 = 85.5
Example C: generate a dataset
containing 9 numbers using the Day,
Month and Year of your birth and
that of the people sitting to your left
and right. ie: DD/MM/YY
Mean vs Median vs Mode which measures the centre best?
Choosing which of these three
measures to use in practice can
sometimes seem like a difficult task.
However if we understand a little
about the relative merits of each we
should at least be able to make an
informed decision.
If the distribution is symmetric then
Mean = Median
If the distribution is Positively
Skewed (to the right) then
Median < Mean
If the distribution is Negatively
Skewed (to the left) then
Mean < Median
So the difference between the mean
and median can be used to measure
the skewness of a dataset.
Note: The presence of outliers
affects the mean but not the median.
This can be seen from the diagrams
and from the following example
Example: Ten statistics graduates
who are now working as statisticians
are surveyed for their annual salary.
The survey produced the following
£60,000 £20,000 £19,000 £22,000
£21,500 £21,000 £18,000 £16,000
£17,500 £20,000
Mode = £20,000
Median = £20,000
Mean = £23,500
Notice that the distribution is
positively skewed, the presence of
the one high earner has affected the
Mean causing it to be £1,500 higher
than the highest of all the salaries
excluding £60,000. For this dataset
the Mean is therefore not a good
measure of the centre of the dataset.
Notice also that the median would be
unaffected if the £60,000 was
changed to a value like £23,000
which is more in line with the rest of
the data.
 Would you expect the datasets
described below to be symmetric,
skewed to the right or skewed to the
A. The salaries of people employed
by UCD
B. The grades on an easy exam
C. The grades on a diffucult exam
D. The amount of time spent by
students in a difficult 3 hour exam.
E. The amount of time students in
this class studied last week.
F. The age of cars on a used car lot
Example:The median age of the
population in Ireland is now 32 years
old. The median age of the Irish
population in 1986 was 27. Interpret
these values and explain the trend,
what implications does this data
have for Irish society. What are the
consequences for the entertainment
industry in Ireland?
Numerical Measures of Variability
When we want to describe a dataset
providing a measure of the centre of
that dataset is only part of the story.
Consider the following two
Both of these distributions are
symmetric and
meanA = meanB, modeA=modeB
and medianA=medianB. However
these two distributions are obviously
different, the data in A is quite
spread out compared to the data in
This spread is technically called
variability and we will now examine
how best to measure it.
Revision Tutorials
Range: The Range of a quantatitive
dataset is equal to the largest value
minus the smallest value.
Sample Variance: The Sample
Variance is equal to the sum of the
squared distances from the mean
divided by n-1.
Standard Deviation: The Sample
Standard Deviation, s, is defined as
the positive square root of the
Sample Variance, s2.
Sample Variance
s 
 (x
i 1
 x)
n 1
  xi 
 i 1 
xi 
i 1
s 
n 1
Which is best?
The meaning of the Range is easily
seen from its definition. It is a very
crude measure of the variability
contained in a dataset as it is only
interested in the largest and smallest
values and does not measure the
variability of the rest of the dataset.
Example: These two datasets have
the same range but do they have the
same variability?
Dataset1: 1, 5, 5, 5, 9
Dataset2: 1, 2, 5, 8, 9
NO, Dataset2 is obviously more
spread out than Dataset1 which has
three values clustered at 5.
Once upon a time there were two
lecturers A & B, each delivered the
same course to two different classes.
When exam time came both classes
had the same average marks of 70%.
The marks for Lecturer A’s class
however had a standard deviation of
25% whereas the Standard Deviation
for Lecturer B’s class was 5%.
Who’s class would you rather be in?
Chapter 8
Normal Curves and
Relative Standing
We have just seen how datasets can
be described by histograms.
For large datasets of continuous
variables the histograms have so
many possible values that it would
be impracticable to draw all of the
really narrow rectangles necessary.
Instead we represent these datasets
by curves (distributions). The curve
can be thought of as joining the
centre points of tops of all the
rectangles in the histogram.
These distributions which are like
generalised relative frequency
histograms can take many different
shapes, some symmetrical some
There is one shape however that
crops up all through the natural
world and that is …
 The Gaussian Distribution
 or The Bell Curve
The Normal Curve
The Normal Distribution is
There are many different Normal
curves, some are fat some are thin.
Some are centred at 0 some at 1
some at 5 etc.
Each normal curve can be uniquely
identified by two parameters.
The Mean and the Standard
Once you know the mean and the
S.Deviation for a Normal curve then
it is possible to draw the curve.
Normal curves are centred at the
Mean. And the Standard Deviation
describes how spread out they are.
The Normal Curve
The area under a Normal curve to
the left of the mean is .5. This
indicates that the probability that
something which is normally
distributed is less than its mean is .5.
The area under the curve to the left
of any point A on the X axis
represents the probability that a
Normal variable is less than A.
X ~ Normal
Probability( X<A) is the area
under the curve to the left of A
There are an infinite number of
different Normal curves, one for
each possible combination of values
of the mean and the standard
However there is a relationship
between all Normal curves.
All Normal variables X can be
transformed into a Standard Normal
Variable Z.
Z is Normal with Mean 0 and
Standard Deviation 1.
X 
We can use tables to look areas
under the Standard Normal Curve.
Example: Find the Probability that a
Normal variable with Mean 3 and
Standard Deviation 2 is less than 4.
Pr( X  4) 
Pr( X  3  4  3) 
 X  3 4  3
Pr 
 2
2 
Pr( Z  0.5) 
Section Interpreting the Standard
Deviation -the Empirical Rule
We have seen that the Variance and
hence the Standard Deviation of a
dataset provides us with a relative
measure of the variability contained
in a dataset. So that if we are given
two datasets the one with the larger
Standard Deviation will be the
dataset which exhibits the greater
Is it posssible for the Standard
Deviation to give more than a
relative measure of variability?
Can we actually say how spread out
the data is?
The answer is yes, we will see later
how to give detailed answers for
particular distributions. In the
meantime there are two rules which
will provide us with a good deal of
information about some general
The Empirical rule provides us with
some definite statements about the
proportion of observations in a
specified interval. It only works for
Symmetric Bell-Shaped (moundshaped) distributions. Also this rule
is an approximation and more or less
data than is indicated by the rule
may lie in each interval.
The Empirical Rule
For a Symmetric Bell-Shaped
distribution - Normal or close to
Approximately 68% of the
observations are within 1 Standard
Deviation of the Mean
Approximately 95% of the
observations are within 2 Standard
Deviation of the Mean
Approximately 99.7% of the
observations are within 3 Standard
Deviation of the Mean
In Tombstone, Arizona Territory
people used Colt .45 revolvers.
However people used different
Wyatt Earp knew that his brothers
and Doc Holliday were the only ones
in the territory who used Colt .45s
with Winchester ammunition.
The Earp brothers conducted tests on
many different combinations of
weapons and ammunition.They
found that dataset of observations
produced by the combination of Colt
.45 with Winchester shells showed a
Mean velocity of 936 feet/second
and a Standard Deviation of 10
The measurements were taken at a
distance of 15 feet from the gun.
When Wyatt examined the body of a
cowboy shot in the back in cold
blood he concluded that he was shot
at a distance of 15 feet and that the
velocity of the bullet at impact was
1,000 feet/second.
The dastardly Ike Clanton claimed
that this cowboy was shot by the
Earp brothers or Doc Holliday. Was
Wyatt able to clear his good name
using the Empirical Rule?
The distribution of this bullet
velocity data should be
approximately bell-shaped. This
implies that the empirical rule should
give a good estimation of the
percentages of the data within each
k# of
916, 956
906, 966
896, 976
886, 986
876, 996
866, 1006
This table quite clearly demonstrates
that since the bullet velocity in the
shooting was 1000 ft/sec and since
this lies more than 6 Standard
Deviations away from the mean the
probability is extremely high that the
Earps were not responsible for this
This is especially evident from
looking at the column showing
percentages from the empirical rule.
Practically 100% of bullet velocities
should be between 896 and 976
Numerical Measures of Relative
While it is useful to know how to
measure the centre of a dataset and
the variability of a dataset, many
times we want to be able to compare
one observation with the rest of the
observations in the dataset. Is one
observation larger than many others?
For Example suppose you get 35%
on the exam for this course you will
probably feel quite bad about your
performance but what if 90% of the
class actually did worse than you?
Then you might feel a bit better
about your 35%.
So in some cases knowing how one
observation compares with others
can be more useful than just
knowing the value of that
We will now look at some different
ways of measuring Relative
Percentile: For any dataset the pth
percentile is the observation which is
greater in value than P% of all the
numbers. Consequently this
observation will be smaller than
(100-P)% of the data.
Z-Score: The Z-Score of an
observation is the distance between
that observation and the mean
expressed in units of standard
deviations. So:
X 
The numerical value of the Z-score
reflects the relative standing of the
A large positive Z-score implies that
the observation is larger than most of
the other observations.
A large negative Z-score indicates
that the bservation is smaller than
almost all the other observations.
A Z score of zero or close to 0 means
that the observation is located close
to the mean of the dataset.
Example A sample of 120 statistics
students was chosen and their exam
results summarised, the mean and
standard deviation were shown to be:
 mean = 53%
 = 7%
Eric and Kenny are two students in
this class and Eric’s exam result was
47% what was his Z-score?
If Kenny’s Z-Score is 2, what was
his percentage on the exam?
What happens to Kenny then?