Download GG 313 Lecture 5

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
GG 313 Lecture 5
Elementary Statistics
Class Web Site is UP
http://www.soest.hawaii.edu/~fred/GG313/
CRUISE:
Please email me your phone number if you think
you may come. I will need this to let you know of
last-minute changes.
Homework 2
1) What’s the contrast in density?
What’s the uncertainty?
2) Trick is to set it up correctly
All independent.
z  z r  value1*  (Value1)
2 mTm t
Value1 
( m   w ) 
2

2  2 T 2 Sqrt( 2   2 )  

m
w
 (Value1)  Sqrt m        m   
   (Value3)
m  w

 
  m      Tm  

 2 t 2 
Sqrt
    


0.5





t
t 
d(value3)


 (Value3) 
t  0.5 
 
dt
t
3rd question?
Statistics
We often want to quantify the characteristics of data sets
using a small number of parameters (statistics), often just a
representative value.
The POPULATION consists of all the possible observations of
a phenomenon. A population could be finite or infinite. Any
subset of a population is called a SAMPLE. For example 10
coin tosses is a subset of all coin tosses, which is infinite.
Often our aim is to discover characteristics of a population by
sampling the population. Political polls sample a small fraction
of the voter population to determine what the total population
thinks.
A geo-example:
You want to sample the population of dinosaurs living in
the late Mesozoic in Wyoming. You look at existing
fossils and collect fossils yourself to determine the
characteristics of the dinosaur population. Because
some dinosaurs may have lived in habitats where fossil
formation was not likely, the sample is probably not
representative.
We often want to know the central value of a population
and sample, but there are several different “centers”.
The three M’s - Mean, Median, and Mode
The best known estimate of central location is the arithmetic
MEAN. The mean of the sample is:
1 n
x   xi
n i1
The population mean (µ) is identical in form, but all
members
 of the population are included. More often than
not, we really don’t know µ, and are trying to determine it.
Characteristics of the mean:
• It always exists and can be calculated
• It is unique
• It is stable, not fluctuating much from sample-to-sample
• means from different samples can be combined to form
new statistics
• Every value is used to obtain the mean
Sample MEDIAN
In cases where there a few wild data points, the sample
mean may yield a poor estimate of the central value. A
statistic that tends to ignore wild data is the sample median:
x˜ 
{
x n / 21, n odd
1
(x n / 21  x n / 2 ), n even
2
The median is the middle value when the data are
arranged in ascending order. It always exists and it is
unique.
 Since the median does not depend on the value of
other points in the sample, it is not sensitive to wild values.
Consider the sample: 1,2,2,5,5,7,99. It has a mean of 17.3
and a median of 5. The great difference between the mean
and median in this case is because of the one wild value
(99).
Mode: The mode of a sample is the most frequent value. In
the sample above, the mode is non-unique - both 2 and 5
occur twice. It may not exist at all if no two values occur
more than once. We can usually make a sample with at
least one mode by grouping the samples into bins of
almost-identical values.
The mode is often denoted as

xˆ
In some cases, some elements of a sample are
considered more important than others. Some
measurements may be more precise, or better
documented than others. Large earthquakes may have
better statistics than small ones, for example. We may
want large earthquakes to count more than small ones
when determining the location of an earthquake swarm.
For situations such as this we can use a weighted mean.
n
x
w x
i
i1
i
n
w
i
i1
Where wi is the weight of the ith measurement.

VARIATIONS
While the central value of a sample is important, the
variations around that value are often equally
important.
When we talked about exploratory data analysis we
talked about the largest and smallest values and the
“hinges”, or values bounding the middle half of the
values.
We could look at the deviations of each point from the
mean:
xi  xi  x
And take the average of the deviations:
1 n
x   x i  x
n i1
But this value is always zero, since the mean is defined to
be the mid-point of all deviations.
You could take the absolute value of the deviations before
summing, but this isn’t often done.
The most common deviation of a population is defined
by:
2
n
1
   x i  
N i1
2
Where 2 is called the variance of the population, and  is
called the standard deviation of the population.

  
2
n
2
1
x i  

N i1
Usually, we’ll be working with samples of the population,
not the population itself, and the functions of variance and
standard deviation are almost the same: the sum is divided
by n-1, rather than N:
s  s2 
n
2
1
x i  x 

(n 1) i1
While this change isn’t very important in most cases, if n is
small, the difference between n and n-1 can be significant.

Why the change? In going from the population to the
sample, we’ve lost one degree of freedom. The number of
independent pieces of information that go into the estimate
of a parameter are called the degrees of freedom. The
number of values that can vary in the variance is n-1, since
the mean utilizes one of those values. BUT……..?????
Let’s look at the equation for variance of a sample closer.
What value of x minimizes the value of s2 ? We already
know the answer, but can we prove it? Let:
f (x )  s2 

n
2
1
x i  x 

(n 1) i1
Then we want to know the value of x that occurs when
d(f(x)/dx=0, insuring either a maximum or minimum in f(x). If

d2(f(x)/dx2>0, the value must be a minimum. Differentiating:
n
df

dx
n
 2x
i1
i
 x
n 1
2 n

x i  x   0

n 1 i1
1 n
 x i  x   0, or x = n  x i
i1
i1
The 2nd derivative of the function is
d 2f 1 n
2n
=
x

, which is > 0, so the result is a minimum

i
2
dx
n i1
n 1
So, the value of x that minimizes the variance is the
mean that was defined earlier. This value, the mean,
minimizes the least-squares mis-fit in the data samples,
and is often called the L2 estimate of the central value.
Similarly, we can show that the mean absolute
deviation is minimized by using the median central
estimate rather than the mean. The median is called
the L1 estimate of central location.
n
n


x i  x˜
d 1
1
0
  x i  x˜   
dx˜ n i1
 n i1 x i  x˜
Look at the term in the sum, it can only take on
values of ±1/1 or 0/0. For the sum to be zero, the
 numbers of values greater than x˜ must equal the
number of values less than x˜ , which defines the
median.


ROBUST Estimation
Estimation of central values and deviations don’t do
us much good if they are sensitive to which sample
we take. We are better off using estimators that are
robust, that is insensitive to variations in the sample.
We introduce the concept of “breakdown point” - the
smallest fraction of points in a sample that have to be
replaced by outliers to cause the estimator to lie
outside reasonable values.
The mean is sensitive to even one outlier, thus its
breakdown point is 1/n. The median, on the other
hand will not be thrown off until about half of the data
values have been replaced by outliers, so its
breakdown point is 50%.
Example of robustness of mean and median
original
Mean
Median
Std Dev
3 Std Devs
-0.001
0.044
0.031
0.047
-0.020
0.020
-0.034
-0.076
-0.002
0.011
0.000
-0.023
-0.0003
-0.0009
0.0350
0.1049
1 out
2 out
-0.001
0.044
0.031
0.047
-0.020
0.020
-0.034
-1.000
-0.002
0.011
0.000
-0.023
-0.077
-0.001
0.292
0.875
3 out
-0.001
0.044
1.000
0.047
-0.020
0.020
-0.034
-1.000
-0.002
0.011
0.000
-0.023
0.003
-0.001
0.427
1.281
4 out
-0.001
0.044
1.000
0.047
-0.020
0.020
-0.034
-1.000
-0.002
0.011
1.000
-0.023
0.087
0.005
0.515
1.545
5 out
-0.001
0.044
1.000
0.047
0.500
0.020
-0.034
-1.000
-0.002
0.011
1.000
-0.023
0.130
0.015
0.527
1.581
6 out
-0.001
0.044
1.000
0.047
0.500
-1.000
-0.034
-1.000
-0.002
0.011
1.000
-0.023
0.045
0.005
0.620
1.861
-1.000
0.044
1.000
0.047
0.500
-1.000
-0.034
-1.000
-0.002
0.011
1.000
-0.023
-0.038
0.005
0.690
2.070
The column on the left is the original series of n=12
random numbers. Each successive column to the right
has had one value replaced by an outlier. The mean,
median, and standard deviation are calculated at the
bottom.
0.8000
mean
0.7000
median
std dev
0.6000
0.5000
0.4000
0.3000
0.2000
0.1000
0.0000
0
1
2
3
4
5
6
7
-0.1000
-0.2000
This is a plot of the above results. Note that the standard
deviation increases rapidly with the number of outliers, while
the mean shows far more variation than the median. Thus,
the median is more robust than the mean.

From the above graph, we also see that the standard
deviation is not robust, growing steadily as the number
of outliers increases.
We don’t need to look far for a robust variation statistic;
x i  x˜ 
recall that the median of
isminimized
if is
thex˜ median of the sample.We can thus define the
median absolute deviation (MAD) as:
MAD  1.482 
 median x i  x˜
The factor 1.482 is a fudge factor that makes the MAD
values equivalent to the values of standard deviation.

0.8000
mean
0.7000
median
std dev
0.6000
MAD
0.5000
0.4000
0.3000
0.2000
0.1000
0.0000
-0.1000
0
1
2
3
4
5
6
7
-0.2000
The MAD value has been added to the previous graph.
Note that it is FAR more robust than the standard
deviation until the number of outliers reaches n/2.
How do we identify outliers? We don’t want to delete
real data! This is a real problem - as noted earlier.
Great care must be taken when deleting data from
consideration - and a search should be made for why
the data are bad. A data point being statistically “off”
is not sufficient reason to delete it.
We can normalize the MAD value by defining a new
variable:
x i  x˜
zi 
MAD
This new variable is unitless, and we can arbitrarily
cut off all points with values where |zi|>3. This

implies that data points to be deleted are more than 3
deviation units away from the median.
MAD:
zi:
original
0.03
1 out
0.03
2 out
0.03
3 out
0.05
4 out
0.05
5 out
0.06
6 out
0.40
0.02
1.43
1.01
1.51
0.61
0.66
1.06
2.36
0.02
0.37
0.02
0.69
0.02
1.43
1.01
1.51
0.61
0.66
1.06
31.63
0.02
0.37
0.02
0.69
0.02
1.43
31.69
1.51
0.61
0.66
1.06
31.63
0.02
0.37
0.02
0.69
0.12
0.81
20.27
0.86
0.50
0.31
0.79
20.46
0.13
0.12
20.27
0.56
0.33
0.56
19.10
0.61
9.40
0.09
0.96
19.69
0.33
0.09
19.10
0.74
0.10
0.65
16.42
0.70
8.17
16.57
0.64
16.57
0.10
0.10
16.42
0.45
2.52
0.10
2.50
0.11
1.24
2.52
0.10
2.52
0.02
0.02
2.50
0.07
These are the zi values for the previous example. Note that
each of the outliers shows up as having values far greater
than 3, indicating that they are somehow “bad” and can be
safely deleted.
What should you get from this? You now have a robust method for removal of
outliers from a data set. You must also realize that the mean and standard
deviation are NOT good statistics to use for removal of outliers.
Removal of outliers can be very important in many applications, and while leastsquares operations, such as mean and standard variation, are extremely
useful, they are best applied to data with NO outliers.
An excellent method for data analysis is thus provided:
1) Plot your raw data
2) Calculate the median, MAD, and zi values
3) Remove the outliers
4) Calculate the mean and standard deviation
These are sometimes called the least-trimmed squares estimates
(LTS)
Inferences about the mean
We are usually working with samples from some population.
How well do the statistics of our sample compare with the
statistics of the population?
In some cases, like the age of a rock, there is only one
correct answer, and no standard deviation. But how well
does our estimate of the age compare with the “true” age?
In other cases, we would like the distribution of our sample
to reflect the distribution of the population.
An important concept is presented by the Central Limit
Theorum. It states that: If n (the sample size) is large,
then the then the variation of sample means closely
approximates a normal distribution. The sample mean, x,
is an unbiased estimate of the population mean, µ.
It can also be shown that the standard deviation of the
means of many samples, sx , is related to the population
standard deviation, , by:
sx 
or:

sx 


n

n
Nn
N 1

Depending on whether N is finite or infinite. Note that sx
2
approaches zero as n gets large. The variance of s for
many samples has a mean value of 2. The variance of the
 of the variance is related to the population variance
estimate
by:
2
2
2
s  

n 1 
Covariance and correlation
Earlier we noted that the sample variance is defined
as:
n
n
2
x i  x  x i  x x i  x 
sx2  i1
 i1
n 1
n 1
We often deal with pairs of properties, such as
temperature vs depth, nitrogen vs oxygen, silica vs

potassium, etc. And we would like to know how these
parameters are related to each other. We can devine
a variance for each property, x and y. For y:
n
sy2 
y
i1
n
 y
2
i
n 1

y
i1
i
 y y i  y 
n 1
We define the covariance as:
n
sxy2 
x
i1
i
 x y i  y 
n 1
The covariance tells us how x and y vary together.
But what does it mean, particularly if the units of x

and y are different. Note that the covariance can
be negative!
We overcome this problem by normalizing and
2
getting rid of units:
s
r
xy
sx sy
This value, r, is the correlation coefficient.
2
xy
s
r
sx sy
If |r| is 1, x nd y are perfectly correlated. If r=1, x and y are
identical. If r=-1, they are identical but opposite in phase.
If r is close 
to zero, x and y are uncorrelated.
Consider the correlations on the next slide:
Note the circle in f). The correlation is zero despite the fact
that x and y are highly related to each other. The
correlation coefficient is looking for a linear correlation.
Moments
The rth moment of a sample is defined as:
n
1
m r   (x i  ) r
n i1
Thus,
L1 : µ: mean

L2 : 2 : variance
L3 : SK: skewness (symmetry)
L4 : K: kurtosis (sharpness)