Download Lecture 10, 4 sides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Lecture 10 (MWF)
Review of previous lecture
• We calculated probabilities of a normal distribution by standardisation.
Data Analysis and Statistical Methods
Statistics 651
• Example Suppose X ∼ N (−3, 0.5), what is P (X ≤ −3.5)?
√
√
≤ −3.5+3
) = P (Z ≤ −0.707) ≈
Standardise: P (X ≤ −3.5) = P ( X+3
0.5
0.5
√
0.239 (by using the normal tables). We note that When we do −3.5+3
we
0.5
are going from a the nonstandard normal X ∼ N (−3, 0.5) to a standard
√
normal, hence Z = −3.5+3
, where Z ∼ N (0, 1).
0.5
http://www.stat.tamu.edu/~suhasini/teaching.html
Lecture 10 (MWF)
Suhasini Subba Rao
• We also did the reverse of this finding the values on the x-axis where
P (X ≤ x) = 0.8, when X ∼ N (6, 7) (for example).
√
√ ) = 0.8. Look up in
≤ x−9
In this case we had to standardise: P ( X−6
7
7
the tables the z-value that corresponds to 0.8. This is 0.85. Therefore
x−9
√ = 0.85 and solve for x.
7
1
Lecture 10 (MWF)
• Up until now we have assumed that the random variable X is normally
distributed.
• Often if we are not given any other information we may need to check
to see whether this assumption is realistic.
• We check this assumption based on the data we have. We will do this in
this lecture.
• Note: In the case of the sample mean this assumption is close to valid,
thanks to the CLT, which comes later in the course.
Lecture 10 (MWF)
Checking for Normality (a very rough check)
• Suppose x1, . . . , xn is a sample from a normal distribution with mean µ
and variance σ 2.
• First we order them from the smallest number to the largest number:
x(1), . . . , x(n).
• Estimate the mean and standard deviations from the data; x̄ and s.
• Plot all the observations on a number line. Locate the mean x̄ on this line
and also the intervals: [x̄ − s, x̄ + s], [x̄ − 2s, x̄ + 2s] and [x̄ − 3s, x̄ + 3s].
• If the observations came from a normal, then
– Roughly 68% of the observations should lie in the interval [x̄−s, x̄+s].
2
3
Lecture 10 (MWF)
Lecture 10 (MWF)
Motivating the QQplot
– 95% of the observations should lie in the interval [x̄ − 2s, x̄ + 2s].
– 99.7% of the observations should lie in the interval [x̄ − 3s, x̄ + 3s].
• Remember this means counting the number of points in each interval,
and dividing it by the total number of observations.
• We need to find a more accurate method (which is close in idea to the
counting in an interval).
• This motivates the idea of the QQplot.
• This is an extremely rough way to check for normality.
• There can exist weird non-normal distributions where the following:
– Roughly 68% of the observations should lie in the interval [x̄−s, x̄+s].
– 95% of the observations should lie in the interval [x̄ − 2s, x̄ + 2s].
– 99.7% of the observations should lie in the interval [x̄ − 3s, x̄ + 3s].
could be true!
• Roughly speaking the QQplots finds average percentage of observations
which should lie in the interval [x̄ − t × s, x̄ + t × s] if there were to
come from a normal distribution, for lots of different values of t (not just
t = 1, 2, 3 - as was done above).
This is compared with the actually percentage of the true observations
which lie in the interval [x̄ − t × s, x̄ + t × s], and these two number
are plotted against each other (sort of). The they are close then the
resulting plot should have a straightline.
4
5
Lecture 10 (MWF)
Lecture 10 (MWF)
Checking for normality: The QQ plot
• This is the QQplot.
• Its not exactly this, but similar in idea. The details are given at the end
of the lecture for those who are interested (but they are not necessary
for the course).
• This plots what has been described above.
• The QQplot consists of points and a straight 45 degree line.
x=y line
. .
X(5)
X (4)
X(3)
X (2)
X (1)
.
..
y y(2)
(1)
y(3)
y(4)
y(5)
• If the points tend to lie on the straightline, then this suggests the
observations come from a normal distribution.
6
7
Lecture 10 (MWF)
Example: Antarctic maximum temperature QQplot
Lecture 10 (MWF)
Example: Antarctic minimum temperature QQplot
Normal Q−Q Plot
−20
Sample Quantiles
−30
4
−2
−40
0
2
Sample Quantiles
6
8
−10
10
0
12
Normal Q−Q Plot
−3
−2
−1
0
1
2
3
−3
Theoretical Quantiles
−2
−1
0
1
2
3
Theoretical Quantiles
What do you think about the assumption of normality?
What do you think about the assumption of normality? If it does not seem to come
from a normal distribution what can we say about the distribution it comes from?
8
9
Lecture 10 (MWF)
Lecture 10 (MWF)
Interpretating a QQ-plot
• Some experienced statisticans have shaman like powers when it comes
to interpretating QQ-plots.
normal distribution most the observations 98% lie within the interval
[x̄ − 3s, x̄ + 3s]. For a heavy tail distribution a far smaller proportion
lie in this interval.
• You don’t need them, but it is good to have a feel of them.
• There are two main features you need to look for;
– Left Skew. This means the distribution is not symmetric. Find the
mode (the heightest point of the distribution). The right of the mode
should be shorter than the left of the mode.
– Right Skew. This means the distribution is not symmetric. Find the
mode (the heightest point of the distribution). The right of the mode
should be longer than the left of the mode.
– Heavy tails. This means that the probability of large numbers if
much more likely than a normal distribution. For example for a
10
11
Lecture 10 (MWF)
QQ-plot and skews
Lecture 10 (MWF)
A right skewed distribution
• The above is indicates a right skewed distribution.
• A right skewed distribution (red) has a long right tail (green is normal).
• The points are arched, going from the below the 45 degree line across it
and down again.
• For a left skewed distribution the QQ-plot is the mirror image along the
45 degree line (arch going upwards and towards the left).
12
13
Lecture 10 (MWF)
Lecture 10 (MWF)
QQ-plot and heavy tails
Heavy tail distribution
• The plot is like an ‘S ′. On the left of the plot it is left of the 45 degree
line and then towards the right it goes to being right of the 45 degree
line.
• Has much thicker tails than a normal distribution (the blue are the tails
of a normal and red are the tails of a thick tail).
14
15
Lecture 10 (MWF)
Lecture 10 (MWF)
Transforming a distribution
QQ plots and testing for normality
• If the data is far from normal we often do a tranformation of it to make
it ‘more normal’.
• There are ‘statistical tests’ (I have not defined this yet) for checking
normality. One of the most famous ones is called the KolmogorovSmirnov test.
• Standard transforms are;
– The log transform; Xi → log Xi = Yi.
This transformation should only be done on positive observations.
The variance of the transformed observation tends to be less than the
variance of the original observation (sometimes this transformation is
called ‘variance stablisation’). Often used when the sample mean and
sample variance of X are similar. √
– The square root transform; Xi → Xi = Yi.
This transformation tends to controls outliers. Very huge values are
pushed down (this is also true of the log transform)
– There are many other tranformations.
• QQ plots for other distributions It is possible by make a QQplot for other
distributions. That is to check whether the observations are drawn from
another distribution of interest. The QQplot must be modified to the
new distribution of interest. You could use the method I detailed below
to make this new plot or do it using a statistical package.
• Again the Kolmogorov-Smirnov test can be used to check whether the
observations come from the distribution of interest.
• The rest of these lecture notes are an optional aside, they explain how a
QQplot is made.
16
17
Lecture 10 (MWF)
Lecture 10 (MWF)
• Therefore roughly speaking P (X ≤ X(i)) ≈ i/n.
Aside: Plotting a probability (Q-Q) plots
• Suppose we observe X1, . . . , Xn.
• We order the observations X(1), . . . , X(n).
• If we draw the relative frequency histogram with the intervals only
containing each point;
• To prevent the end point X(n) correspond to 100 percent, that is
P (X ≤ X(n)) = 1, so preventing it from being the absolute maximum
(in another sample there could be larger), we remove a small amount
from the probability. That is
P (X ≤ X(i)) ≈
i
1
− .
n 2n
1
.
This means that P (X ≤ X(n)) = 1 − 2n
1/n
X1
X 2
X 3
X4 X 5
We want to check if the observations X1, . . . , Xn are from a normal
distribution with mean µ and variance σ 2.
X6
• The height of each point is 1/n.
18
19
Lecture 10 (MWF)
Lecture 10 (MWF)
This is an example of how to make a probability plot.
x=y line
• Let Z ∈ N (0, 1)
• Then using the normal tables we can evaluate all points where P (X ≤
yi) ≤ i/n. Basically, this is done by
– From the normal tables we can easily find all zi where P (Z ≤ zi) =
i/n − 1/(2n) (where Z ∼ N (0, 1)).
– Transform zi, such that yi = σzi + µ.
. .
X(5)
X (4)
X(3)
X (2)
X (1)
.
..
y y(2)
(1)
y(3)
y(4)
y(5)
• Plot (yi, Xi).
• The y-axis are the ‘true’ values associated with the normal distribution.
Eg P (X ≤ yi) = i/n.
• If the plot is close to linear, about the x = y line, then there is a large
amount of evidence that the observations X1, . . . , Xn are from a normal
distribution with mean µ and variance σ 2
• The x-axis is what we observe.
• We see how well they fit through the x = y line.
20
21
Lecture 10 (MWF)
Lecture 10 (MWF)
Aside: example of making a QQplot plot
75% of these observations are less than or equal to 117.
We observe 117, 132, 111, 107, 85, 89. We would like to know if it is a
sample from a normal distribution with mean 106 and variance 258 (this is
the sample mean and variance of the observations).
96% of these observations are less than or equal to 132.
The idea is:
• Order the observations 85, 89, 107, 111, 117, 132.
• We see that just by counting approximately;
8.3% of these observations are less than or equal to 85.
25% of these observations are less than or equal to 89.
41.6% of these observations are less than or equal to 107.
58.3% of these observations are less than or equal to 111.
22
23
Lecture 10 (MWF)
• What we want to do is see if for a normal distribution with mean 106
and variance 258 whether approximately:
8.3% of the curve is less than 85.
Lecture 10 (MWF)
41.6% of the curve is less than y3.
58.3% of the curve is less than y4.
75% of the curve is less than y5.
25% of the curve is less than 89.
96% of the curve is less than y6.
41.6% of the curve is less than 107.
• If 85 is close to y1, 89 is close to y2, 107 is close to y3, 111 is close y4,
107 is close to y5 and 132 is close to y6, then we can say the observations
may have come from a normal with mean 106 and variance 258.
58.3% of the curve is less than 111.
75% of the curve is less than 117.
96% of the curve is less than 132.
• We do this by plotting a graph and seeing whether they lie close to the
45 degree line.
• We do this by finding the points on the normal curve where
8.3% of the curve is less than y1.
• We now show how to evaluate yi.
25% of the curve is less than y2.
observations
order i
probability pi
pi=prob =
percent
100
zi value on a
standard normal curve
P (Z ≤ zi) = pi.
transformed
zi :
√
yi = 258 × zi + 106
24
25
Lecture 10 (MWF)
Lecture 10 (MWF)
85
1
1
1
−
6
12
89
2
2
1
−
6
12
107
3
3
1
−
6
12
111
4
4
1
−
6
12
117
5
5
1
−
6
12
132
6
6
1
−
6
12
0.083
0.25
0.416
0.583
0.75
0.96
-2.4
-0.605
-0.201
0.201
0.608
1.308
Basically if Z ∼ N (0, 1). Then the numbers above imply;
• P (Z ≤ −2.4) = 0.083,
• P (Z ≤ −0.605) = 0.25,
• P (Z ≤ −0.201) = 0.416,
• P (Z ≤ 0.201) = 0.583,
73
96
103
109
116
127
We obtain zi from the normal tables. But the opposite way to how we
find the probabilities. That is we look for at the probabilities, within the
table and locate the zi corresponding to i/n − 1/2n.
26
• P (Z ≤ 0.608) = 0.75 and
• P (Z ≤ 1.308) = 0.96.
27
Lecture 10 (MWF)
Plot yi against X(i).
73
85
96
89
103
107
109
111
116
117
127
132
70
80
90
100
c(70, 140)
110
120
130
140
yi
X(i)
70
80
90
100
110
120
130
140
c(70, 140)
28