Download ch1p2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Measure of Location
The term "Average" is vague
Average could mean one of four things. The arithmetic mean, the median, midrange, or
mode. For this reason, it is better to specify which average you're talking about.
Mean
When people say “average”, they usually intend the mean.
Population Mean (),
N = number of observations in the population:
Sample Mean (x),
:
n = number of observation in the sample
For reporting x, it is recommended that we use decimal accuracy of one digit more than
the accuracy of the xi’s.
For instance, If x1 = 58 and x2 = 67, then x
62.5
Physical interpretation of x
Suppose there is horizontal measurement axis, each sample observation is represented by
a 1-kg weight placed at the corresponding point on the axis. The only point at which a
fulcrum can be placed to balance the system of weights is the point corresponding to the
value of x .
x = 21.18
10
20
30
40
1-11
Median
The word median is synonymous with “middle”.
The sample median is the middle value when the observations are ordered from smallest
to largest.
To find the median, we sort the data in ascending order, with any repeated values
included. Then
if n is odd,
~
x = the single middle value, OR
 n  1
~
x 
 ordered value
 2 
th
If n is even,
~
x = the average of the two middle values, OR
th
n
Average of   and
2
th
n 
  1 ordered values
2 
Ex Find a median for the following data set.
3, 5, 4, 2, 1, 6, 8, 11, 14, 13, 6, 9, 10, 7
Sol . 1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10, 11, 13, 14
Median = (6+7)/2 = 6.5
BOXPLOT
Ex We can make a box plot by using medians as follows
1. Split the numbers on left and right sides of the median:
1, 2, 3, 4, 5, 6, 6, │7, 8, 9, 10, 11, 13, 14
2. Find the median for each half:
1, 2, 3, 4, 5, 6, 6
│ 7, 8, 9, 10, 11, 13, 14
1-12
Left median (Lower Quartile ) = 4
Right median (Upper Quartile) = 10
3. Draw a number line from the smallest to the largest number without skipping any
numbers
1,
2,
3,
4,
5,
6,
6,
7,
8,
9,
10, 11, 13, 14
4. Put circles at the LOWER, UPPER Quartiles,and median (6.5)
5. Draw a box connecting the circles at the LOWER and UPPER Quartiles.
6. Draw a line connecting the median to the box.
7. Put circles at the high and low points.
1-13
8. Draw lines that connect the high and low points to the box.
1
2
3
4
5
6
6
7
9
8
10 11 13
Mean VS median

The population mean and median will not generally be equal to each other.

If the population distribution is either positively or negatively skewed, then   ~
 ~
(a) negative skew
~ 
 = ~
(b) symmetric
(c) positive skew
Mode (x*)
The mode is the most frequent data value. There may be no mode if no one value appears
more than any other. There may also be two modes (bimodal), three modes (trimodal), or
more than three modes (multi-modal).
Other measures of location: quartiles, percentiles, and
trimmed means
 The median divides the data set into two parts of equal size.
 Quartiles divide the data set into 4 equal parts.
 Percentiles divide the data set into 100 equal parts.
 10 % trimmed mean is computed by eliminating the smallest 10% and the
largest 10 % of the sample, and then averaging what is left over.
1-14
14
 15 % trimmed mean is computed by eliminating the smallest 15% and the
largest 15 % of the sample, and then averaging what is left over.
 The median is obtained by trimming the maximum possible amount from each end.
 The mean is obtained by trimming 0% from each end of the sample.
Example Consider the following 20 observations, ordered from smallest to largest,
each one representing the lifetime (in hours) of a certain type of incandescent lamp:
612
623
666
744
883
898
964
970
983
1003
1016
1022
1029
1058
1085
1088
1122
1135
1197
1201
The average of all 20 observations is
x  965.0 , and ~x  1009.5
The 10% trimmed mean is obtained by :
 deleting the smallest two observations (612 and 623) and the largest two (1197
and 1201)
 averaging the remaining 16, we obtain
xtr (10)  979.1
The 20% trimmed mean is obtained by averaging the middle 12 values: xtr ( 20)  999.9
Note: 20% trimmed mean is closer to the median than the 10% trimmed mean.
Summary
The Mean is used in computing other statistics (such as the variance). It is often not
appropriate for skewed distributions such as salary information.
The Median is the center number and is good for skewed distributions because it is
resistant to change.
The Mode is used to describe the most typical case. The mode can be used with nominal
data whereas the others can't. The mode may or may not exist and there may be more
than one value for the mode.
1-15
Property
Mean
Median
Mode
Uses all data values
Yes
No
No
Affected by extreme
values
Yes
No
No
__________________________________________________
Measures of Variation
The following figure shows dotplots of 3 samples with the same mean and median.
Questions -
It can be seen that the spread about the center is different for all 3 samples.
- Which sample has the largest amount of variability?
- Which sample has the smallest amount of variability?
1:
2:
3:
10
20
30
40
50
Range
The range is the simplest measure of variation. It is simply the highest value minus the
lowest value.
RANGE = MAXIMUM - MINIMUM
Since the range only uses the largest and smallest values, it is greatly affected by extreme
values, that is - it is not resistant to change.
Variance
"Average Deviation"
1-16
The range only involves the smallest and largest numbers, and it would be desirable to
have a statistic which involved all of the data values.
There are many ways to produce a statistics that involves all data values. An easy way is
to use the average deviation from the mean, which is defined as:
Average Deviation =
 (x
i
 )
N
The problem is that this summation is always zero. So, the average deviation will always
be zero. That is why the average deviation is never used.
Population Variance
So, to keep it from being zero, the deviation from the mean is squared and called the
"squared deviation from the mean". This "average squared deviation from the mean" is
called the variance.
Population Variance

2
 (x

i
 )2
N
Sample Variance
One would expect the sample variance to simply be the population variance with the
population mean replaced by the sample mean. However, this formula has the problem
that the estimated value isn't the same as the parameter. To counteract this, the sum
of the squares of the deviations is divided by one less than the sample size.
Sample variance
s2 
 (x
i
 x)2
n 1
Computational formula for S2 :
1-17
S2 
Proof
(x
i
Because
x
x
n
i
 xi2 
( xi ) 2
n 1
n
, therefore nx 
2
( xi ) 2
-------------(1)
n
 x ) 2   ( xi2  2 x.xi  x 2 )   xi2  2 x  xi   ( x ) 2
From (1)
=
x
 2 x.nx  n( x ) 2
=
x
 2n( x ) 2  n( x ) 2
=
x
 n(x ) 2
=
 xi2 
2
i
2
i
2
i
(  xi ) 2
n
Standard Deviation
There is a problem with variances. Recall that the deviations were squared. That means
that the units were also squared. To get the units back to the same as the original data
values, the square root must be taken.
Population Standard Deviation
Simple Standard Deviation
  
2
S S 
2
 (x
i
 )2
N
 (x  x)
2
n 1
Example (From Devore, 2000)
1-18
The amount of light reflectance by leaves has been used for various purposes,
including evaluation of turf color, estimation of nitrogen status, and measurement of
biomass. The following observations are obtained using spectrophotogrammetry, on leaf
reflectance under specified experimental conditions.
observation
xi
Observation
xi
15.2
x i2
231.04
9
12.7
x i2
161.29
1
2
16.8
282.24
10
15.8
249.64
3
12.6
158.76
11
19.2
368.64
4
13.2
174.24
12
12.7
161.29
5
12.8
163.84
13
15.6
243.36
6
13.8
190.44
14
13.5
182.25
7
16.3
265.59
15
12.9
166.4
8
13.0
169.00
x
i
 216.1
x
2
i
 3168.13
Find S2 and S
S2 
x
2
i

( xi ) 2
n 1
n
=
216.12
3168.13 
15
= 3.92
15  1
S = 1.98
Because the numerator of S2 is the sum of nonnegative quantities, S2 is guaranteed
to be nonnegative.
Properties of s2
The properties of s2 can sometimes be used to increase computational efficiency.
Let x1 , x2 , … xn be a sample and c be any nonzero constant.
1.
If y1 = x1 + c, y2 = x2 + c, . . . , yn = xn + c, then s y2  s x2 , and
2.
If
y1 = c x1 , . . . , yn = c xn, then s y2  c 2 s x2
and s y  c s x
1-19
where
s x2 = sample variance of the x’s
s y2 = sample variance of the y’s
In words, property 1 says that if a constant c is added to (or subtracted from) each data
value, the variance is unchanged.
Property 2 says that the multiplication of each xi by c results in s2 being multiplied by a
factor of c2
These properties can be proved by noting in Property 1 that y  x  c and in Property 2
that y  cx
1-20