Download normal distribution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Quantitative Methods Module I
Gwilym Pryce
[email protected]
Lecture 1
Density curves and the CLT
1
Notices:



Register
Feedback forms
Labs:
– Who wants to do the afternoon lab?
– Who wants to do the evening lab?


Class Reps and Staff Student committee.
Message:
– those in Business taking the Master
Class this week: come to seminar room
3 on the 3rd floor of the Business
school at 10.00 on Wednesday? Thanks,
Andy Furlong
2
Introduction:


In this lecture we introduce some
statistical theory
This theory sometimes seems abstract
for an applied quants course:
– tempting just to use SPSS without properly
learning statistical theory,
– which is a very powerful statistical package
– … but a little knowledge is a dangerous
thing...
3
4
5
 L1: Density Functions & CLT
 L2: Calculating z-scores
L3: Introduction to Confidence
Intervals
L4: Confidence Intervals for All
Occasions
Quants I
24/09/2005 - v23
L5: Introduction to Hypothesis
Tests
L6: Hypothesis Tests for All
Occasions
L7: Relationships between
Categorical Variables
 L8: Regression
6
 1. Review of Induction material
Lect 1
 27sep05
2. Density Functions
3. Normal Distribution
4. Central Limit Theorem
Example 1.3.3a How to Create a Histogram in
SPSS p.1-24
Example 1.5.6a Computing the Sample Mean in
SPSS p.1-34
Example 1.5.6b Computing the Sample
Standard Deviation in SPSS p.1-35
 L1: Density Functions & CLT
Lab 1
3oct05
Example 1.5.7 How to obtain a Summary of
FIle Info p. 1-37
Exercise 2.2 Understanding and Calculating
Areas under a Density Curve p.2-14
Example 2.6a Sampling Distribution of Means
2-17
Exercise 2.6b Impact on CLT of Reducing
Sample Size p.2-20
Exercise 2.8 Proportions and the CLT p.2-23
Pryce, Sections 1.3, 1.5, 2.4
Reading 1
Pryce I&S in SPSS
Pryce, Ch.1
Pryce, Section 2.6
Pryce, Section 2.5
7
Aims & Objectives

Aim
• the aim of this lecture is to introduce the
concepts that under gird statistical inference

Objectives
– by the end of this lecture students should
be able to:
• Understand what a density curve is
• understand the principles that allow us to make
inferences about the population from samples
8
Plan




1. Review of Induction material
2. Density curves & Symmetrical
Distributions
3. Normal Distribution
4. Central Limit Theorem
9
1. Review of Induction material


1.Measures of Central Tendency
x
2. Measures of Spread
– range, standard deviation
– percentiles & outliers
1
– Symmetric distributions s  n  1



x
i
n
 (x  x)
2
i
3. Density curves
4. Distribution of means from repeated
samples = central limit theorem.
5. Normal Distribution
10
2. Density curves:
idealised histograms
(rescaled so that area sums to one)
11
Properties of a density curve

Vertical axis indicates relative frequency
over values of the variable X
– Entire area under the curve is 1
– The density curve can be described by an
equation
– Density curves for theoretical probability
models have known properties
12
Area under density curves:

the area under a density curve that lies
between two numbers
= the proportion of the data that lies
between these two numbers:
• e.g. if area between two numbers x1 and x2 =
0.6, then this means 60% of xi lies between x1
and x2
– when the density curve is symmetrical, we
make use of the fact that areas under the
curve will also be symmetrical
13
Symmetrical Distributions
Mean = median
Areas of segments symmetrical
50% of sample < mean
50% of sample > mean
Mean = median
14
Symmetrical Distributions
•If 60% of sample falls between a and b,
what % greater than b?
•What’s the probability of randomly
choosing an observation greater than b?
60%
a
b
15
What’s the probability of being less than
6ft tall?
20%
6ft
height
16
3. Normal distribution:
68% and 95% rules

Slide 10 of 13 of Christian’s.
17
Normal Curves are all related

Infinite number of poss. normal
distributions
– but they vary only by mean and S.D.
• so they are all related -- just scaled versions of
each other

a baseline normal distribution has been
invented:
– called the standard normal distribution
– has zero mean and one standard deviation
18
50
14
40
16
10
30
12
6
20
8
2
10
4
80
6.
00
6.
20
5.
40
4.
60
3.
80
2.
00
2.
20
1.
0
.4
0
-.4
0
.2
-1
0
.0
-2
0
.8
-2
0
.6
-3
0
.4
-4
0
.2
-5
0
.0
-6
0
.8
-6
c
b
a
19
z
zb
za
zc
80
6.
00
6.
20
5.
40
4.
60
3.
80
2.
00
2.
20
1.
0
.4
0
-.4
0
.2
-1
0
.0
-2
0
.8
-2
0
.6
-3
0
.4
-4
0
.2
-5
0
.0
-6
0
.8
-6
0
0
NORM_2
NORM_2
Standardise
Standard Normal Curve

we can standardise any observation
from a normal distribution
– I.e. show where it fits on the standard
normal distribution by:
• subtracting the mean from each value and
dividing the result by the standard deviaiton.
• This is called the z-score = standardised value
of any normally distributed observation.
zi 
xi  

Where  = population mean
 = population S.D.
20
• Areas under the standard normal curve between different zscores are equal to areas between corresponding values on
any normal distribution
• Tables of areas have been calculated for each z-score,
– so if you standardise your observation, you can find out the
area above or below it.
– But we saw earlier that areas under density functions
correspond to probabilities:
• so if you standardise your observation, you can find out the
probability of other observations lying above or below it.
21
4. Distribution of means from
repeated samples


We have looked at how to calculate the
sample mean
What distribution of means do we get if
we take repeated samples?
22
E.g. Suppose the distribution of income in the
population looks like this:
23

Then suppose we ask a random sample of
people what their income is.
– This sample will probably have a similar
distribution of income as the population
• Positive skew: mean is “pulled-up” by the incomes of fatcat, bourgeois capitalists.
• Since the median is a “resistant measure”, the mean is
greater than the median

Then suppose we take a second sample, and
then a third; and then compute the mean
income of each sample:
– Sample 1: mean income = £20,500
– Sample 2: mean income = £18,006
– Sample 3: mean income = £21,230
24
As more samples are taken, normal
distribution of mean emerges
3
2.0
2
1.0
8
6
2
30
30
20
20
10
10
0
0
80
6.
00
6.
20
5.
40
4.
60
3.
80
2.
00
2.
20
1.
0
.4
0
-.4
0
.2
-1
0
.0
-2
0
.8
-2
0
.6
-3
0
.4
-4
0
.2
-5
0
.0
-6
0
.8
-6
80
6.
00
6.
20
5.
40
4.
60
3.
80
2.
00
2.
20
1.
0
.4
0
-.4
0
.2
-1
0
.0
-2
0
.8
-2
0
.6
-3
0
.4
-4
0
.2
-5
0
.0
-6
0
.8
-6
80
6.
00
6.
20
5.
40
4.
60
3.
80
2.
00
2.
20
1.
0
.4
0
-.4
0
.2
-1
0
.0
-2
0
.8
-2
0
.6
-3
0
.4
-4
0
.2
-5
0
.0
-6
0
.8
-6
NORM_2
NORM_2
NORM_2
40
0
40
4
80
6.40
6. 0
0
6.60
5.20
5. 0
8
4.40
4. 0
0
4.60
3.20
3. 0
8
2.40
2.00
2. 0
6
1.20
1.
0
.80
.40
.0 0
-.40
-..820
-1.60
-1.00
-2.40
-2.80
-2.20
-3.60
-3.00
-4.40
-4.80
-4.20
-5.60
-5.00
-6.40
-6.80
-6
12
50
10
8
6.
4
6.0
6.
6
5.2
5.
8
4.4
4.
0
4.6
3.
2
3.8
2.
4
2.0
2.
6
1.2
1.
.8
.4
.0
-.4
-.8.2
-1.6
-1.0
-2.4
-2.8
-2.2
-3.6
-3.0
-4.4
-4.8
-4.2
-5.6
-5.0
-6.4
-6.8
-6
80
6.40
6.00
6.60
5.20
5.80
4. 0
4
4.00
4.60
3.20
3.80
2.40
2. 0
0
2.60
1. 0
2
1.
0
.80
.40
.0 0
-.40
-..820
-1.60
-1.00
-2.40
-2.80
-2.20
-3.60
-3.00
-4.40
-4.80
-4.20
-5.60
-5.00
-6.40
-6.80
-6
14
50
16
NORM_2
NORM_2
NORM_2
0
0
0.0
8
5
3.5
3.0
2.5
4
6
1.5
2
4
.5
1
25
Why the normal distribution is
useful:

Even if a variable is not normally
distributed, its sampling distribution of
means will be normally distributed,
provided n is large (I.e. > 30)
– I.e. some samples will have a mean that is
way out of line from population mean, but
most will be reasonably close.
– “Central Limit Theorem”
26
– “The Central Limit Theorem is the fundamental
sampling theorem. It is because of this theorem
(and variations thereof), and not because of
nature’s questionable tendency to normalcy, that
the normal distribution plays such a key role in our
work”
(Bradley & South)

Why….?
27
The standard error of the mean...
– When we are looking at the distribution of
the sample mean, the standard deviation of
this distribution is called the standard error
of the mean
• I.e. SE = standard deviation of the sampling
distribution.
– but we don’t usually know this
• I.e. if we don’t know the population mean (I.e.
mean of all possible sample means), we are
unlikely to know the standard error of sample
means
– so what can we do?
28
CLT: What about Proportions?


What proportion of 10 catchers were
female?
What happens if I repeat the
experiment?
– What would the distribution of sample
proportions look like?
29
Editing syntax files:
1. Start with an asterix:
– Use *blah blah blah. to put headings in syntax
• anything after “ * ” is ignored by SPSS.
• Important way of keeping your syntax files in order
• e.g.
*Descriptive Statistics on Income.
*---------------------------------.
2. Forward slash and an asterix:
– Use /*blah blah blah */ to comment on lines
• Anything between /* and */ is ignored by SPSS.
• E.g.
COMPUTE z = x + y.
/*Compute total income*/
30