Download lecture3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Foundations of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
EART20170 Computing,
Data Analysis &
Communication skills
Lecturer: Dr Paul Connolly (F18 – Sackville Building)
[email protected]
1. Data analysis (statistics)
3 lectures & practicals
statistics open-book test (2 hours)
2. Computing (Excel statistics/modelling)
2 lectures
assessed practical work
Course notes etc:
http://cloudbase.phy.umist.ac.uk/people/connolly
Recommended reading: Cheeney. (1983)
Statistical methods in Geology. George, Allen &
Unwin
Lecture 1
KNOWLEDGE
What do we know?
Lecture 2
Lecture 3
Recap – first lecture
 The four measurement scales:
nominal, ordinal, interval and
ratio.
 There are two types of errors:
random errors (precision) and
systematic errors (accuracy).
 Basic graphs: histograms,
frequency polygons, bar charts,
pie charts.
 Gaussian statistics describe
random errors.
 The central limit theorem
 Central values, dispersion,
symmetry
 Weighted mean.
back
Recap – last lecture
 Correlation between two
variables
 Classical linear regression
 Reduced major axis regression
 Propagation of errors
back
Lecture 3
 Distribution of a continuous
variable




Normal distribution
Standardised normal distribution
Confidence limits
Students t distribution
 Statistical inference for two
independent variables.


Students t test
Hypothesis testing
back
The Gaussian or Normal
distribution
 Infinite sample size and a continuous
curve that stretches from minus infinity
to plus infinity on the x-axis.
 Any Normal distribution can be
`transformed’ to the standardised
normal distribution.
2
0.8
=0.13889
 =0.20277
1.5
0.6
1
0.4
0.5
0.2
0
2.5
2
-2
0
2
0
0.8
=0.27219
 =0.19881
=0.19872
 =0.60379
-2
0
2
0
2
=0.015274
 =0.74679
0.6
1.5
0.4
1
0.2
0.5
0
-2
0
2
0
-2
Standardised normal
distribution
 - z -  2 
1

y  f ( z |  , ) 
exp 
2

 2
 2

Standardised normal distribution
0.4
0.35
0.3
y
0.25
0.2
=1.0 = 68.27%
0.15
=2.0 = 95.45%
0.1
=3.0 = 99.73%
0.05
0
-3





-2
-1
0
z
1
2
3
Properties:
Mean =  = 0
Standard deviation =  = 1
Total area under curve = 1
Total probability of obtaining a value from the
distribution = 1
 Probability of obtaining a value in the interval a and
b = area of curve between a and b.
 1 = 68.27%; 2  = 95.45%; 3  = 99.73%
Confidence intervals
Standardised normal distribution
0.4
0.35
0.3
Low confidence
narrow interval
y
0.25
0.2
high confidence
wide interval
=1.0 = 68.27%
0.15
=2.0 = 95.45%
0.1
=3.0 = 99.73%
0.05
0
-3
-2
-1
0
z
1
2
3
 A way of describing the spread of a distribution
– especially in the tails (descriptive statistics).
 There is a lot of choice about the confidence to
quote (1  68%, 2  95.4%, 3  97.7%….)
 Trade-off between a interval size (precision)
and degree of confidence (accuracy). In
practice 1 or 2  confidence limits are used in
science.
Last weeks practical
14
y = 0.010657*x - 0.73561
r2 =0.9906
Age (Million years)
12
10
8
6
4
2
data 1
linear
0
0
200
400
600
800
Distance (km)
1000
1200
1400
 How many points are on or touching
the line?
 Approximately 10 … so that’s 10/15 or
67% of the data
 In other words our best line is correct
67% of the time which is what we
expect considering 1 errors.
Confidence and the
standardised normal curve
 Confidence levels are used in
estimation
 When dealing with large sample
sizes we can use the Gaussian or
normal distribution.
 We can convert a value measured
on any physical scale (mass,
length, volts, temperature, etc.) to
a standardised normal distribution
in units of z (standard deviations)
as follows:
This turns out to be quite useful
Example
 Consider a sandstone body with average
() porosity of 20% distributed normally with
a standard deviaton () of 2%
 We may need to answer a question like,
what is the probability of a given specimen
of this sandstone having between 17.5 and
23.0% porosity?
17.5  20
z  17.5, gives z 
 1.25
2
z  23, gives z 
23  20
 1.5
2
 The area under the normal curve
corresponding to the interval +1.5 and -1.25
is found from tables to be:
 0.9332-0.1056 = 0.8276

So the probability of obtaining a sample of
sandstone with a porosity between 17.5 and
23.0% is 0.8276 (or 82.76%)
Student t distribution
 The t distributions were
discovered by William S. Gosset
in 1908. Gosset was a statistician
employed by the Guinness
brewing company which had
stipulated that he not publish
under his own name. He
therefore wrote under the pen
name ``Student.'' These
distributions arise in the following
situations.
Student’s t-distribution
 Use of the standard normal curve is
fine if you know the resolution of your
experiment (the value of the
population standard deviation, ).
 What if the spread in the data is
related to the basic data sample rather
than the measuring device, what do
you do then?
 You have to take several
measurements and look at the spread.
A few measurements enables the
sample standard deviation (s), s to be
estimated . Instead of z we have to
use the variable Student’s t defined as:
x 
t
s N
 t is not normally distributed with unit variance,
because of the additional uncertainty in s our
estimate of .
The number of measurements is called the
number of degrees of freedom.

Students t-distribution
 The t distribution is like a Gaussian,
but the tails are larger. At large sample
sizes (>30) t is almost identical to the
standard Gaussian.

Tables of t are more
complicated than z because of
the extra parameter of sample
size, N, known as the number
of degrees of freedom.
Normal
Student-t
0.4
0.4
 =1
0.4
 =2
0.4
 =3
 =4
0.3
0.3
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0
-2
0
2
0.4
0
-2
0
2
0.4
 =5
0
-2
0
2
0.4
 =6
0
 =7
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0
2
0.4
0
-2
0
2
0.4
 =9
0
-2
0
2
0.4
 =10
0
 =11
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0
2
0.4
0
-2
0
2
0.4
 =13
0
-2
0
2
0.4
 =14
0
 =15
0.3
0.3
0.2
0.2
0.2
0.2
0.1
0.1
0.1
0.1
0
2
0
-2
0
2
0
-2
0
2
0
2
 =16
0.3
-2
2
0.4
0.3
0
0
 =12
0.3
-2
-2
0.4
0.3
0
2
 =8
0.3
-2
0
0.4
0.3
0
-2
-2
0
2
0
-2
One
important use of Student’s t looks at how a
sample with mean and standard deviation s can be
used to estimate the true mean, 
s
  x  t
N
Example
 Using our sandstone example again
and assuming that 5 samples gave a
mean porosity of 20% porosity with s =
2%. What are the 95% confidence
intervals of the average porosity of the
whole sandstone body?
 The estimated error of the mean is
given as:
s
2

 0.89
N
5
 The critical t for 4 (N-1) degrees of freedom is
2.776. So the confidence levels are:
SE ( x ) 
  x  t
s
 20  2.776  0.89  20  2.5
N
NB. If this were Gaussian we would use 1.96 and
get limits of 18.2 to 21.7.

Hypothesis testing
 Taking decisions
 Sometimes the information obtained
from the data is not a number but a
yes or no answer to a factual question.
 “Is there an effect present”
“Do x and y have the same value”
 This type of problem is called
hypothesis testing:
(1) correct formulation of the
hypothesis;
(2) construction of a numerical test;
(3) applying it to the data, and then;
(4) rejecting or accepting the
hypothesis based on the result of the
test.
Accept or reject?
 Whether you accept or reject a
hypothesis is really under your
control by determining the
significance of the test. The test
involves dividing a probability
distribution into an acceptance
region (large) and a rejection region
(small).
Standardised normal distribution
0.4
0.35
Acceptance region
0.3
y
0.25
Rejection region
0.2
0.15
0.1
0.05
0
-3
-2
-1
0
z
1
2
3
Significance
 The significance of a test is called a and is
the rejection region of our probability
distribution. For a good test a should be
small – 1% or 5%.
 The hypothesis is usually the null effect (i.e.
no difference, no effect), but the alternate
hypothesis tends to be vague and
unquantifiable (e.g. an effect exists but
gives no clue as to why or how large).
 Final point is to remember that there is a
difference between one-tailed directional
and two-tailed non-directional tests.
“X is greater than Y” is a one-tailed test
“X is different from Y” is a two-tailed test
Additional
 Suppose we have a simple random sample of size n





drawn from a Normal population with mean and
standard deviation . Let denote the sample mean
and s, the sample standard deviation. Then the
quantity
t=x-mu/s/sqrt(n)
has a t distribution with n-1 degrees of freedom.
Note that there is a different t distribution for each
sample size, in other words, it is a class of
distributions. When we speak of a specific t
distribution, we have to specify the degrees of
freedom. The degrees of freedom for this t statistics
comes from the sample standard deviation s in the
denominator of equation 1.
The t density curves are symmetric and bell-shaped
like the normal distribution and have their peak at 0.
However, the spread is more than that of the
standard normal distribution. This is due to the fact
that in formula 1, the denominator is s rather than
sigma . Since s is a random quantity varying with
various samples, the variability in t is more, resulting
in a larger spread.
The larger the degrees of freedom, the closer the tdensity is to the normal density. This reflects the fact
that the standard deviation s approaches for large
sample
Hypothesis testing
 A hypothesis test is a procedure for determining if
an assertion about a characteristic of a population is
reasonable.
 For example, suppose that someone says that the
average price of a gallon of regular unleaded gas in
Massachusetts is $1.15. How would you decide
whether this statement is true?
 You could try to find out what every gas station in
the state was charging and how many gallons they
were selling at that price. That approach might be
definitive, but it could end up costing more than the
information is worth.
 A simpler approach is to find out the price of gas at
a small number of randomly chosen stations around
the state and compare the average price to $1.15.
 Of course, the average price you get will probably
not be exactly $1.15 due to variability in price from
one station to the next. Suppose your average price
was $1.18.
 Is this three cent difference a result of chance
variability, or is the original assertion incorrect? A
 hypothesis test can provide an answer.
Hypothesis test
terminology

The null hypothesis is the original assertion. In this case the null
hypothesis is that the average price of a gallon of gas is $1.15. The
notation is H0: µ = 1.15.

There are three possibilities for the alternative hypothesis. You
might only be interested in the result if gas prices were actually
higher. In this case, the alternative hypothesis is H1: µ > 1.15. The
other possibilities are H1: µ < 1.15 and H1: µ 1.15.

The significance level is related to the degree of certainty you
require in order to reject the null hypothesis in favor of the
alternative.

By taking a small sample you cannot be certain about your
conclusion. So you decide in advance to reject the null hypothesis if
the probability of observing your sampled result is less than the
significance level. For a typical significance level of 5%, the
notation is a = 0.05. For this significance level, the probability of
incorrectly rejecting the null hypothesis when it is actually true is
5%. If you need more protection from this error, then choose a
lower value of a .

The p-value is the probability of observing the given sample result
under the assumption that the null hypothesis is true. If the p-value
is less than a, then you reject the null hypothesis.

For example, if a = 0.05 and the p-value is 0.03, then you reject the
null hypothesis.

The converse is not true. If the p-value is greater than , you have
insufficient evidence to reject the null hypothesis.







The outputs for many hypothesis test functions also include confidence intervals.
Loosely speaking, a confidence interval is a range of values that have a chosen
probability of containing the true hypothesized quantity. Suppose, in our example,
1.15 is inside a 95% confidence interval for the mean, µ. That is equivalent to
being unable to reject the null hypothesis at a significance level of 0.05.
Conversely if the 100(1-) confidence interval does not contain 1.15, then you
reject the null hypothesis at the level of significance.