Download NUMERICAL DESCRIPTIVE METHODS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Regression toward the mean wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
AMS 5
NUMERICAL DESCRIPTIVE
METHODS
Introduction
A histogram provides a graphical
description of the distribution of a sample
of data. If we want to summarize the
properties of such a distribution we can
measure the center and the spread of the
histogram.
Introduction
These two histograms correspond to samples with the same center. The spread of
the sample on top is smaller than that of the sample in the bottom
The Average
57
83 92 237 51
82 127 65
87 66 76
85 70 198
152 95 110
117 83 52
69 134 74
70 53 165 116 78 161 129 70 74 156 93 53 66
58 77 88 65 49 86
64 100 95 80 70 156
81 65 81 70 94 65
80 222 105 59 83 86
49 80 168 78 161 63
62 92 69 72 102 92
68
67
72
53
57
55
143 75
110 64
56 53
50 57
79 89
48 107
100 77 103 63 145 82
105 85 65 79 59 76
57 47 118 91 167 94
154 56 72 69 55 106
53 145 123 83 65 119
Table: Maximum daily ozone values (in PSI in L.A. on 120 typical days
in 1989)
0
10
Frequency
20
30
40
The Average
50
100
150
Maximum Daily Ozone (PSI)
200
250
The Average
What was the “typical” ozone level in 1989?
How should we define what we mean by a
typical number? The most frequently used
definition, is the simple rule; add the numbers
up and divide them by how many there are.
The average (or arithmetic mean) of a list of
numbers equals their sum divided by how many
there are.
n
1
x = ∑ x i = 10, 729 /120 = 89.4 PSI
n i =1
30
40
The Median
We can easily observe that
only 42 out of the 120
observations are above the
average (i.e. only 35%).
This is because the histogram
is not symmetric.
Frequency
20
median
0
10
average
50
100
150
Maximum Daily Ozone (PSI)
200
250
A Histogram balances when
supported at the Average.
The median is the value with half the area of the histogram to the
left and half to the right.
The Median
A symmetric histogram will
look like this. In this case
50% of the data are above
the average, i.e. the median
and the average coincide.
Calculating the Median
Sort your data.
If you have an odd number of observations your median
is exactly the middle number of the ordered ones (i.e.
with 9 ordered numbers, the median is the 5th ordered
observation).
If you have an even number of observations, like in our
case with the ozone example where n=120, the there
are two numbers that could be the middle, the 60th and
the 61st ordered observations (with values 79 and 80 PSI
respectively). In order to resolve this conflict we can
choose as the median the average of this two numbers,
and therefore the median in our example is
(79+80)/2=79.5.
Average or Median
The Root-Mean-Square
Back in the ozone example, and we saw that the typical
ozone level is either around 80 or 90, depending on how
to define “typical”. But give or take how much? In other
words what is the typical amount by which the numbers
differ or deviate from the middle one? What we need is
a numerical measure of the spread of the data.
Before defining this measure of size lets, answer to an
even simpler question. How big are our numbers? To
make things easier lets suppose that we only had the
following 5 numbers:
0, 5, -8, 7, -3
The average is (0+5-8+7-3)/5 = 0.2 but this is a very
poor measure of size, since it allows the positives to be
cancelled with the negatives.
The Root-Mean-Square
The simpler way around this problem would be to
wipe out the signs by averaging the absolute
values. The average neglecting signs then is
(0+5+8+7+3)/5 = 4.6 and it gives us a measure
of size. Another measure is the root-mean-square
SQUARE all the entries
Take the MEAN of the squares
Take the square ROOT of the mean.
Therefore :
2
2
2
2
2
0 + 5 + (−8) + 7 + (−3)
r.m.s. size =
= 5.4
5
The Standard Deviation
Back to our initial question. What is the
typical amount by which the numbers
differ or deviate from the middle one? By
choosing as a middle number the average,
what we really want to calculate is a
measure of size of every entry from the
average. We will use the root mean
square as a measure of size and the final
measure of spread that we will obtain will
call it standard deviation (SD).
The Standard Deviation
Therefore
SD = r.m.s. deviation from the average
= r.m.s. (entry – average)
To calculate the standard deviation of a sample follow the steps:
Calculate the average
Calculate the list of deviations from the average by taking the
difference between each entry and the average.
Calculated the r.m.s. size of the resulting list.
The Standard Deviation
1.
2.
3.
Example: Lets go back to the simple example
with only the following 5 numbers:
0, 5, -8, 7, -3
The average is (0+5-8+7-3)/5 = 0.2
In order to find the deviations from the
average we just have to subtract the average
from each entry:
-0.2, 4.8, -8.2, 6.8, -3.2
Find the r.m.s. size of the deviations:
(−0.2) + 4.8 + (−8.2) + 6.8 + (−3.2)
SD =
= 5.42
5
2
2
2
2
2
The Standard Deviation Shortcut
Formula
1

x −  ∑ xi 
∑
n i =1 
i =1

SD =
n
n
n
2
2
i
For the ozone example:
Observation number (i)
1
2
.
.
120
xi
57
83
.
.
107
sum 10729
xi2
1121637 − (10729 /120 )
SD =
= 36.7
3249
120
6889
.
.
11449
1121637
2
The Standard Deviation
The SD comes out in the same units as the data.
Do not confuse the SD of a list with its r.m.s. size. The
SD is the r.m.s. not of the original numbers on the list,
but of their deviations from the average.
In many data sets (especially when the distribution is
symmetric) the following rule of thumb applies:
1. Roughly 68% of the observations are within one SD of the
average.
2. Roughly 95% of the observations are within two SDs of the
average.
3. Roughly 99% of the observations are within three SDs of the
average.
Using a Statistical Calculator