Download Modelling variations (D2)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Ms. Amal Jamil EL-Sayed
Modelling variations (D2)
This chapter is about modeling the variation observed in data; and it is
concerned primarily with one particular model for variation called the
normal distribution.
There are two main learning file themes for this chapter. The first involves
you in using the software package relevant to this block. This is the data
analysis software OUstats for MST 121, which from now on we shall call
OUstats.
Consider how we might model the variation in men’s heights. The first step
is to obtain some data. Some searching turned up an old data set which
contains the heights of 1000 Cambridge men in 1902. These heights are a
sample of the heights of all Cambridge men in 1902, and will be used to
illustrate how the variation observed in men’s heights may be modeled
(page 9).
In the article the Cambridge men were taken to be representative of the
general population.
The following figure shows a frequent diagram for the height of 1000
Cambridge men in 1902.
Figure 1.2 The heights of 1000 Cambridge men.
Activity 1.1 : Properties of the data.
Describe the shape of the frequency diagram in figure 1.2. Use the diagram
to estimate very roughly the mean height of the men. Over what range are
the heights of the men spread?
Answer:
1. The frequencies are low on the left side of the diagram, rise steadily to
reach a maximum for heights of 69 inches then decrease towards the
right.
2. The frequency diagram is roughly symmetrical about a single peak or
mode, it is unimodal.
3. The diagram could be described as approximately "bell-shaped". (So
the model has been represented by a curve).
4. Since the diagram is roughly symmetrical, the mean height is
approximately 69 inches, (the centra height).
5. The heights of the men range from approximately 61.5 inches to
approximately 77.5 inches.
Remark: Areas under the curve are used to represent proportions or
probabilities.
The normal distribution:
1) A bell-shaped curve would seem to be a good model for the variation in
the data. A model which has this shape is the normal distribution.
2) The equation of a typical normal curve is y = f(x),
where f(x) =
1 x 2
1
)  (-∞<x<∞)
exp   (
 2
 2 

3) The function f is known as the probability density function of the
distribution.
4) The normal distribution is a Continuous model.
5) μ gives the location of the centre of the curve.
σ governs the spread of the values that are most likely to occur (σ is the
standard deviation)
6) The normal distribution with mean equal to zero and standard deviation
equal to one is called standard normal distribution.
Some properties of the normal curve:
1. Normal Curves are bell-shaped and are symmetrical with respect to
a vertical line.
2. The mean is at the center.
3. Irrespective of the shape, the area enclosed by the Curve and the x- axis is
always equal to 1.
4. The probability that an outcome of a normally distributed experiment is be
between a and b equals the area under the associated normal Curve from
x = a to x = b.
5. The standard deviation of a normal distribution plays a major role in
describing the area under the normal Curve.
6. The Curve never meets the x-axis.
The total area between the normal Curve and the x-axis from x = -N to x = N
is given by
N
area =
 f ( x)dx
N
where f(x) =
1 x 2
1
) 
exp   (
 2
 2 

-∞<x<∞

area =


N
f ( x )dx = lim
N 
 f ( x)  1
N
Activity 1.4 Interpreting the model
Figure 1.13 contains three sketches of the particular normal curve used to
model the heights of all Cambridge men in 1902. For each sketch, describe
in words what the shaded area represents.
Figure 1.13 Three areas under a normal curve used to model the heights of
Cambridge men in 1902.
Solution:
a) The shaded area represents the proportion of all Cambridge men in
1902 who were between 69 and 71 inches tall. Alternatively it
represents the probability that a man selected at random from all
Cambridge men would have between 69 and 71 inches tall.
b) The shaded area represents the proportion of all Cambridge men in
1902 who were under 65 inches tall. Alternatively it represents the
probability that a man selected at random from all Cambridge men
would have been less than 65 inches tall.
c) The shaded area represents the proportion of all Cambridge men in
1902 who were over six feet tall. It represents the probability that
a man selected at random from all Cambridge men would have been
over six feet tall.
Section 2: Choosing a normal model.
1. Probability distributions are used to model the variation in population.
2. Population parameters are population mean and population standard
deviation are used to the normal model.
The sample mean.
Suppose that a sample of n observations x1,x2,……,xn is taken from a
population the sample mean
x=
1
1 n
( x1  x2  ......  xn )   xi .
n
n i 1
Example: (Find the mean of the heights of 1000 Cambridge men,
Heights of 1000 Cambridge men.
Height in inches
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
Frequency
3
20
24
40
85
122
139
179
139
107
55
47
22
12
5
1
1000
k
x = n1  x f
i1
i
i

1
xi f i  .......  xk f k 
n
1
(62
x = 1000

3  63 20  ..........  771)  68.872
So the sample mean is approximately 68.9 inches.
Activity 2.1 Irish dipper nestlings
Table 2.2 Weights of Irish dipper nestlings
Weight in grams
9-11
11-13
13-15
15-17
17-19
19-21
21-23
23-25
25-27
27-29
29-31
31-33
33-35
35-37
37-39
39-41
41-43
43-45
45-47
Frequency
1
8
3
6
5
12
13
18
27
18
22
20
3
17
8
7
4
4
2
198
What values would you use to calculate the mean weight of these nestlings?
Answer:
We use the midpoints of the intervals.
Weight in grams
Frequency
9-11
11-13
13-15
15-17
17-19
19-21
21-23
23-25
25-27
27-29
29-31
31-33
33-35
35-37
37-39
39-41
41-43
43-45
45-47
Midpoint
1
8
3
6
5
12
13
18
27
18
22
20
3
17
8
7
4
4
2
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
198
So the sample mean
1
(10 1  12 8  ..........  46
x = 198



2)  41.53
The population mean.
The formula for the mean of a continuous distribution which corresponds to

formula (2.1) (page 25). For the mean of a discrete distribution is
 xf ( x)dx

where f is the probability density function of the distribution.
The standard deviation.

Population variance =
 ( x  m)
2
f ( x )dx

The population standard deviation is defined to be the square root of the
population variance.
The sample standard deviation.
Consider the sample x1,x2,……,xn of size n.
δ2 = sample variance =
1 n
 xi  x 2
n  1 i 1
x=
δ = The standard deviation = var iance
1 n
xi  x 2

n  1 i 1
=
Sample static
population parameter
is used to estimate
μ
x
δ
is used to estimate
σ
Exercise: The number of items of mail delivered on a Monday morning to
each of 5 houses chosen at random from those in a large estate were as
follows: 2 7 3 1 2
a) Find the sample mean and the sample standard deviation.
b) If there are 3000 houses on the estate, which is your estimate of the
total number of items of mail delivered to the estate on that morning?
Answer:
a) The sample mean
x
=
2  7  3  1  2 15

3
5
5
x
xx
2
7
3
1
2
2 – 3 = -1
7–3=4
3–3=0
1 – 3 = -2
2 – 3 = -1
15
x  x 
2
1
16
0
4
1
22
n=5→n-1=4
The sample standard deviation = δ =
b) The sample mean
x
22
4
 2.3 (to 1 d.p)
may be used to estimate the mean number of
items of mail delivered to each house o the estate: 3. so an estimate
of the total number of items of mails delivered to the estate on that
morning is 3 x 3000 = 9000.
Example: The gross weekly earnings (in pounds) in 1995 of a sample of six
mechanical engineers were as follow:
310 635 464 520 381 732
Find the sample mean and the sample standard deviation.
Answer:
The sample mean
x
=
310  635  464  520  381  732 3042

 507
6
6
x
xx
x  x 
310
635
464
520
381
732
-197
128
-43
13
-126
225
38809
16384
1849
169
15876
50625
3042
The sample standard deviation = δ =
2
123712
123712
5
n=6→n-1=5
 157.30 (to 1 d.p).
So the sample mean of gross weekly earnings of 6 engineers was ₤ 507
and the sample St. deviation of gross weekly earnings was ≈ ₤ 157.30
Section 6
Suppose that a normal distribution is used to model the variation in a
population. Then according to the model:
o approximately 68.3% of the population are within 1 standard
deviation of the mean (that is, between μ – σ and μ + σ);
o approximately 95.4% of the population are within 2 standard
deviations of the mean (that is, between μ – 2σ and μ + 2σ);
o almost all the population – about 99.7% - are within 3 standard
deviations of the mean (that is, between μ – 3σ and μ + 3σ).
Note that these results hold true whatever the values of the mean μ and the
standard deviation σ of the distribution. The results are illustrated in
Figure 6.5.
Figure 6.5 Areas under normal curves
Example 6.1
The normal distribution used to model the heights of Cambridge men in
1902 has mean μ = 68.9 and σ = 2.57. According to this model within what
range were the heights of almost all Cambridge men-that is about 99.7% of
them? How does this compare with the sample of heights?
Solution:
Almost all Cambridge men (about 99.7%) were between μ -3σ and μ +3σ
that i between (68.9-3x2.57) and (68.9+3x2.57) inches tall that is between
61.2 and 67.6 inches. So the model predicts that about three in a thousand
men will be either shorter than 61.2 inches or taller than 76.6 inches. From
the table (page 23) only one man’s height was outside the range 61.2 inches
to 76.6 inches, no man was less than 61.5 inches tall.
So almost all the men in the sample were within 3 standard deviations of the
mean height. The normal model reflects quite well the proportion of men in
the sample who were unusually short or tall.
Do activity 6.1 page 37.
Suppose that a normal distribution is used to model the variation in a
population. Then according to the model:
o approximately 90% of the population are within 1.64 standard
deviations of the mean (that is, between μ – 1.64σ and μ + 1.64σ);
o approximately 95% of the population are within 1.96 standard
deviations of the mean (that is, between μ – 1.96σ and μ + 1.96σ);
o almost all the population – about 99% - are within 2.58 standard
deviations of the mean (that is, between μ – 2.58σ and μ + 2.58σ).
Note that these results hold true whatever the values of the mean μ and the
standard deviation σ. The results are illustrated in Figure 6.6.
Figure 6.6 Areas under normal curves
Example 6.2
According to the normal distribution used to model the variation in the
heights of Cambridge men in 1902, within what ranges were the heights of
approximately 90%, 95% and 99% of Cambridge men?
Solution: μ = 68.9
σ = 2.57
Approximately 90% of Cambridge men where between 68.9-(1.64)(2.57) ≈ 64.7
and 68.9+(1.64)(2.57) ≈ 73.1 inches tall.
Approximately 95% of Cambridge men where between 68.9-(1.96)(2.57) ≈ 63.9
and 68.9+(1.96)(2.57) ≈ 73.9 inches tall.
Approximately 99% of Cambridge men where between 68.9-(2.58)(2.57) ≈ 62.3
and 68.9+(2.58)(2.57) ≈ 75.5 inches tall.
Remark
If X is a continuous random variable, then a function y = f(x) is called a
(probability) density function for X if and only if it has the following
properties:
1. f(x) ≥ 0

 f ( x)dx  1.
2.

3. P(a ≤ X ≤ b) =
b
 f ( x)dx
a
More practice:
1) Show that the function f(x) = 1, if 0 ≤ x ≤ 1
= 0, otherwise
is a density function for X?
Solution:
We must verify that f(x) satisfies the three conditions in the definition of the
density function. First, f(x) is either 0 or 1, so f(x) ≥ 0.

Next,
1
 f ( x)dx   1dx  x

1
0
 1.
0
Finally, to verify that P(a ≤ X ≤ b) =
b
 f ( x)dx , we compute the area under
a
b
the graph between x = a and x = b. We have
 f ( x)dx   1dx  x
a
which, as stated in the above remark.
b
a
b
a
 b  a,
2) Consider the exponential density function f(x) = e-x for x ≥ 0, and
f(x) = 0 for x < 0.
a. Find P (2 < X < 3).
b. Find P (X > 4).
Solution:
3
a. P (2 < X < 3) =  e  x dx   e  x 2  – e-3 – (–e-2) = e-2 – e-3 ≈ 0.086.
3
2

r
b. P (X > 4) =  e dx  lim  e  x dx
x
r 
4
4
= lim  e  x 4  lim  e  r  e 4 
r 
r
r

= lim

r 
1

 e 4   0  e 4
r
 e

≈ 0.018.
Another solution for part b:
4
P (X > 4) = 1 – P(X ≤ 4) = 1 –  e  x dx ≈ 0.018.
0
3) Consider the function f(x) = 1 – x2 for x in [–1, 1].
a) Show that f(x) is non-negative in [–1, 1]. Find the area A under the
function f(x) in the interval [–1, 1].
3
b) Consider the function g(x) =   f(x) in [–1, 1].
4
Show that g(x) is a probability density function in [–1, 1].
c) Consider the continuous random variable x, with probability density
1
2
function g(x). Find P(x = 0), P(x = ), P(x in [0,
1
])and P(x in [0, 1]).
2
d) Show that the mean μ, of the continuous random variable x, with
probability density function g(x) is equal to zero.
Solution:
a) f(-1) = f(1) = 0. For x in (–1, 1), x2 < 1 and Hence x2 - 1 < 0 →
1 - x2 > 0 Therefore f is non-negative in [–1, 1].
1

x3 
4
The area under the curve =  1  x dx =  x    .
3  1 3

1
1

2

b) Since f is non-negative in [–1, 1] and 0 ≤ f(x) ≤ 1 then g(x) is non3
4
negative in [–1, 1] and 0 ≤ g(x) ≤ .
Furthermore the area under the function is:
1
1
3
3
3 4
Area =    f ( x).dx   f ( x ).dx      1.
4
4 1
 4  3 
1  
c) P(x = 0) = 0
1
2
P(x = ) = 0
1
2
1
2
P(x in [0, ]) = The area under g(x) in [0, ]
1
2
=
31
3
1
11
 4 f ( x).dx  4  2  24   32 .
0
3
3
P(x in [0, 1]) =  f ( x).dx   1  x 2 dx
4
4
0
0
1
1
1
3
x3 
3  1   3  2 
  x    1       
4
3  0 4  3   4  3 
1
1
2
1
3
3  x2 x4 
d)    x  g ( x )  dx   x 1  x 2 dx      0.
4 1
4 2
4  1
1
1