Download INTRODUCTION - Department of Mathematics and Statistics

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
INTRODUCTION
STATISTICS deals with the
-
collection
-
description
-
interpretation
of DATA.
Data may arise naturally as observations of “things as they are”, or as a result of a
controlled designed experiment.
The collection of data must be done carefully and according to certain rules. We will
explore one of these methods, called SIMPLE RANDOM SAMPLING in the next
few pages. Beyond the technical details of collecting data in a specific discipline, the
methods used will determine the possible uses to which the data can be meaningfully
applied. The important thing to keep in mind is that data collected incorrectly may
invalidate any conclusion we may draw from it.
Once collected, it is often useful to describe the data through displays or summaries
that bring attention to certain characteristics which may be of interest to us. The
techniques used for such descriptions will involve us in the area of DESCRIPTIVE
STATISTICS.
The description of data is not usually an end in itself. We may be interested in what
the data can tell us about the group from which it was drawn, but which remains
mostly unsampled. The methods employed in dealing with such generalizations are
known under the label of INFERENTIAL STATISTICS.
In dealing with the questions raised by Inferential Statistics we are led to define two
fundamental concepts, SAMPLE and POPULATION.
(1)
POPULATIONS AND SAMPLES
POPULATION: the set of all measurements or objects of interest in a particular
study.
QUESTION: If the population is the thing that interests an investigator, why
doesn’t she just simply examine each member of the population.
ANSWER:
Thus to obtain information about the population the investigator must use a
__________.
SAMPLE: a subset of the population.
In a study, an investigator should use a SIMPLE RANDOM SAMPLE (srs)
SIMPLE RANDOM SAMPLE (srs): a sample chosen in such a way that each
member of the population has an equal chance of being chosen.
REASON for using a srs:
We can now refine our definition of “inferential statistics” and say that
INFERENTIAL STATISTICS deals with ways of using the sample to draw
conclusions about the population (from which it was drawn).
(2)
EXAMPLE (POPULATION AND SAMPLE) : A manufacturer of light bulbs
wishes to estimate the mean (average) lifetime of 40 watt bulbs his company
produces. He obtains a srs of the lifetime of four such bulbs. 300(hrs), 450, 500,
400.
What is the population?
What is the sample?
Estimate the mean lifetime of the 40 watt bulbs produced by this company.
Would you expect your estimate to be accurate? Explain.
EXAMPLE (POPULATION AND SAMPLE): A candidate for political office
wishes to assess his chances in the next election by estimating the proportion of
voters in his district who will vote for him. He obtained a srs of 1000 voters and
found that 600 voters in the sample intended to vote for him and 400 against
him?
What is the population?
What is the sample?
Estimate the proportion of voters in the population that intend to vote for this
candidate. Would you expect your estimate to be accurate? Explain.
(3)
TYPES OF DATA
In statistics one works with several kinds of data.
NUMERICAL or QUANTITATIVE DATA is data measured on a numerical scale.
It is used to tell us how much or how many. There are two types of numerical data.
CONTINUOUS DATA is data such as length or weight, which in theory, can be
measured to any degree of accuracy.
DISCRETE DATA is data whose possible values are isolated points on the
number line. For example, data such as the number of children in families or
number of fish caught by fisherman is discrete data.
CATEGORICAL or QUALITATIVE DATA is data that records qualities of
persons or objects, such as names, gender, or student numbers.
Categorical data, which has a natural ordering is called ORDINAL DATA. For
example if each student in a statistics class is assigned one of the grades A, B, C, D
or F, the resulting data is ordinal data.
EXAMPLE: State what type of data would be recorded in each of the following
cases.
(a) In a sample of 1000 Canadian voters each is classified as being Liberal,
Reform, NDP, or other.
_______________________________and________________________
(b) A biologist measures the height of tomato plants.
__________________________and______________________________
(c) In the automobile industry, quality control classifies each paint job as
excellent, good, fair or poor.
_______________________________and____________________________
(4)
EXPLORATORY DATA ANALYSIS
Frequency and Relative Frequency Distributions
A set of raw data gives us very little information as to how the observations are
distributed. For example, consider the following sample of grades from a large
mathematics class (the grades have been ordered from lowest to highest for
convenience).
55
65
69
73
77
83
55
66
69
73
77
83
55
66
69
73
78
83
56 56 57 57 57
66 67 67 67 67
70 70 70 71 71
74 74 74 74 76
78 79 79 79 80
84 84 85 85 87
58
67
72
76
80
87
59
67
72
76
80
88
59
68
72
76
80
88
60
68
72
76
81
89
60
68
73
76
82
92
60
69
73
76
82
92
61
69
73
77
82
94
65
69
73
77
83
What is the distribution of these grades? Are there more B’s than C’s? Do the
grades clump around a certain letter grade?
One method of describing a sample is to divide it into classes or categories and to
find the number of sample values in each class (the CLASS FREQUENCY) and/or
the proportion of sample values in each class (the CLASS RELATIVE
FREQUENCY).
Divide the sample of grades into five classes, say , [50, 60), [60, 70), [70, 80),
[80, 90), and [90, 100), and find the frequency and relative frequency for each class.
Frequency and Relative Frequency Distribution
Class Interval
[50, 60)
Class Midpoint
(mi)
55
Class Frequency
(fi)
11
Class Relative
Frequency (fi/n)
11/95 = .1158
[60, 70)
[70, 80)
[80, 90)
[90, 100)
n=
95/95 = 1.000
Notes: 1. For a class [a, b). ‘a’ is called the LOWER CLASS BOUNDARY (LCB)
,‘b’ is called the UPPER CLASS BOUNDARY (UCB).
For example, for the second class [60, 70): LCB =________, UCB=____________
(5)
2. The CLASS MID-POINT is the average of the lower and upper class boundaries.
For example, for the second class [60, 70),
m2 = (LCB + UCB)/2 =
The mid – point represents a typical value in the class.
3. The CLASS Width is CW = UCB – LCB. So for the second class
CW = UCB – LCB =
Frequency and Relative Frequency Histograms
A FREQUENCY (RELATIVE FREQUENCY) HISTOGRAM of a sample is simply
a picture of the frequency ( relative frequency) distribution of that sample.
Example (FREQUENCY HISTOGRAM): To draw a frequency histogram of our
sample of 95 grades from a large mathematics class we proceed as follows:
(a) Form an xy coordinate system with the class boundaries on the horizontal
axis and the frequencies marked on the vertical axis.
(b) Above each class draw a rectangle, whose height is equal to the class
frequency.
(6)
To draw a relative frequency histogram, mark the relative frequencies on the
vertical axis and above each class draw a rectangle whose height is equal to the class
relative frequency.
Notes: 1. The class (relative) frequencies are proportional to the AREAS of the
rectangles above the class. For example the area of the rectangle above class 2 is
roughly twice the area of the rectangle above class 1. Thus the class (relative)
frequency of class 2 is roughly twice that of class 1.
2. The class width determines the number of classes. This choice will affect the
“look” of the histogram. For example, what would the frequency histogram of the
class grades look like if we used one class [50, 100).
(7)
The Histogram on Minitab
(a) MTB > hist c1 [Draws a histogram of the sample in c1; Minitab chooses the
classes]
(b) MTB > hist c1;
MTB > start m;
MTB > incr w.
[ Draws a histogram of the sample in c1 with starting mid-point ‘m’ and class width
(CW) ‘w’]
Note: The Minitab Output gives the class mid-points mi and the class frequencies
(counts) fi . By hand we obtain the class width.
CW= the distance between successive mid-points.
The lower and upper boundaries of the ith class are given by
[LCB, UCB) = [mi - .5CW, mi + .5 CW)
Example ( rem.mtw): A part of sleep called REM sleep is associated with dreaming.
The data below gives the fraction of time spent in REM sleep for a random sample
of 40 adults.
0.23 0.18 0.24 0.24 0.27 0.13 0.15 0.35 0.18 0.18
0.17 0.14 0.19 0.21 0.20 0.14 0.21 0.18 0.18 0.28
0.23 0.24 0.17 0.24 0.17 0.17 0.12 0.23 0.18 0.28
0.10 0.20 0.19 0.20 0.31 0.26 0.23 0.17 0.18 0.17
(8)
A. MTB > hist c1
Histogram of %REM N = 40
Midpoint
Count
0.10
1 *
0.12
1 *
0.14
3 ***
0.16
1 *
0.18
13 *************
0.20
5 *****
0.22
2 **
0.24
8 ********
0.26
1 *
0.28
3 ***
0.30
0
0.32
1 *
0.34
0
0.36
1 *
(9)
B. MTB > hist c1;
SUBC> start .08;
SUBC> incr .04.
Histogram
Histogram of %REM N = 40
Midpoint
0.0800
0.1200
0.1600
0.2000
0.2400
0.2800
0.3200
0.3600
Count
0
3
9
14
8
4
1
1
***
*********
**************
********
****
*
*
(10)
QUESTIONS:
1. With reference to histogram A answer the following:
(a) What is the class width?
(b) What are the boundaries of the class width?
(c) What percentage of the subjects in the sample spent 23% or more of their sleep
in REM sleep?
2. With reference to histogram B answer the following:
(a) Find the boundaries of the class containing the greatest number of sample
values?
(b) How many sample values fall in the last four classes?
(c) What is the shape of this histogram?
3. Suppose you wanted to use Minitab to form a histogram with the lower
boundary of the first class equal to 1 and a class width equal to 6. What Minitab
Commands would you use?
(11)
Ideally we want a histogram that gives us as much information as possible. If we
have too few classes we lose too much individual information; if we have too many
we lose overall information.
MTB > hist c1;
SUBC> start .1;
SUBC> incr .18.
Histogram of %REM
Midpoint
0.100
0.280
N = 40
Count
19
21
*******************
*********************
(12)
MTB > hist c1;
SUBC> start .1;
SUBC> incr .005.
Histogram of %REM
N = 40
Midpoint
Count
0.10000
1 *
0.10500
0
0.11000
0
0.11500
0
0.12000
1 *
0.12500
0
0.13000
1 *
0.13500
0
0.14000
2 **
0.14500
0
0.15000
1 *
0.15500
0
0.16000
0
0.16500
0
0.17000
6 ******
0.17500
0
0.18000
7 *******
0.18500
0
0.19000
2 **
0.19500
0
0.20000
3 ***
0.20500
0
0.21000
2 **
0.21500
0
0.22000
0
0.22500
0
0.23000
4 ****
0.23500
0
0.24000
4 ****
0.24500
0
0.25000
0
0.25500
0
0.26000
1 *
0.26500
0
0.27000
1 *
0.27500
0
0.28000
2 **
0.28500
0
0.29000
0
0.29500
0
0.30000
0
0.30500
0
0.31000
1 *
0.31500
0
0.32000
0
0.32500
0
0.33000
0
0.33500
0
0.34000
0
0.34500
0
0.35000
1 *
(13)
Histograms for Discrete and Categorical Data
For data which is discrete, individual data values may, in some instances be used as
classes.
Example: The table below gives the number of children in a random sample of 40
Canadian families.
Number of
Children
Number of
Families
0
1
2
3
4
8
12
15
4
1
(14)
For a categorical data, the categories themselves may be used as classes.
Exercise: The table below gives the final grades to a class of 100 students in an
elementary statistics course:
Grades
Number of
students
A
15
B
25
C
30
(a) What type of data is presented here?
(b) Draw a histogram of this data?
(15)
D
20
F
10
Some Typical Sample Shapes
(1) A LEFT SKEWED sample is one whose histogram ( or stem and leaf plot) has a
long left tail; the sample values tend to cluster at the right end of the scale and
taper off at the lower end.
(2) A RIGHT SKEWED sample is one whose histogram (or stem and leaf plot) has a
long right tail; the sample values tend to cluster at the left end of the scale and
taper off at the higher end.
(3) A SYMMETRIC sample is one whose histogram ( or stem and leaf plot) is
distributed approximately the same on each side of some central value.
(4) A particular type of symmetric sample is a BELL- SHAPED; all bell shaped
samples are symmetric, but not all symmetric samples are bell shaped.
(16)
The Stem and Leaf Plot
Although a histogram tells us how many observations fall in a particular class, we
lose information about the different values within a class. For example, in the
histogram B of the REM sleep data the class [.26, .30) includes exactly 4 sample
values; but unless we go back to the original data, we would not know that these
values are .26, .27, .28 and .28. Note that the mid-point for this class is .28 and yet all
four sample values in the class are at or below this mid-point.
A different method for displaying data which overcomes some of these difficulties,
and which is easy to do, even by hand, is the Stem and Leaf Plot.
Recall that the position of an individual digit in a number tells us the value the digit
represents. For example, consider the number 962.78. For this number 9 is 100’s
digit, 6 is 10’s digit, 2 is 1’s digit, 7 is .1’s digit and 8 is .01’s digit.
The stem and leaf plot uses selected digits ( called stems) to group the sample into
classes. The individual observations are then represented by the stem and the next
significant digit. This is called the “truncation” method.
EXAMPLE: Consider the following sample data of English Scholastic Aptitude Test
(SAT) scores.
638 574 627 621 705 690 522 612 594 581 640 653 638 760 491
The data here ranges from 491 to 760. Thus it is reasonable to group the sample
using the 100’s digit.
The number “638” is plotted as 63, (8 is truncated).
“6” is called a stem. The stem unit is SU=100 since the stem digit is the 100’s digit.
“3” is called a Leaf. The leaf unit is LU=10, since the leaf digit is 10’s digit.
“63” represents 63 leaf units; thus 63=63(LU)=63(10)=630. Thus although a value
read from the plot contains more information about the observation than is
available from a histogram, it may only be an approximation to the actual sample
value because of truncation.
[ 63 = 630 approximates the actual value of 638].
(17)
Notes: 1. Stem unit = 10 (Leaf unit) i.e. SU=10(LU).
2. The leaf always consists of a single digit [anything more is truncated].
To illustrate these ideas, we plot the sample as follows:
638 574 627 621 705 690 522 612 594 581 640 653 638 760 491
Initial Plot:
Stems
Leaves
Final Plot:
Stems
Leaves
LU= 10, SU = 100.
This plot is called one leaf category per stem plot (LCPS). The number of LCPS
merely gives the number of lines for which the stem value is the same.
Two leaf category per stem plot:
Stems
Leaves
LU= 10, SU=100.
(18)
Five Leaf Category per stem plot:
STEMS
LEAVES
LU=10, SU=100.
Note: The increment of a stem and leaf plot is the distance from one line to the next.
INCR= SU/#LCPS.
Notice that the stem together with the increment represent classes. In a one leaf
category per stem plot the increment is 100. Therefore, this plot represent the
following 4 classes.
[400,500), [500,600), [600,700), [700, 800)
While a histogram gives the number of sample values in each class, the stem and leaf
display plots the (truncated) sample values belonging to each class.
(19)
The Stem and Leaf Plot on Minitab
MTB> stem c1 [Draws a stem and leaf plot of the sample in c1]
Note: The Minitab output gives only the leaf unit(LU). The stem unit (SU) can be
obtained by the rule:
SU=10*LU
Example: Consider the sample data of English Scholastic Aptitude Test (SAT)
scores
638 574 627 621 705 690 522 612 594 581 640 653 638 760 491.
MTB> stem c1
Stem-and-leaf of C1
Leaf Unit = 10
1
2
5
(6)
4
2
1
4
5
5
6
6
7
7
N=15
9
2
789
122334
59
0
6
Leaf Unit: LU=10
Stem Unit: SU=10*LU=100
#LCPS=2
INCREMENT=SU/#LCPS = 100/2 = 50.
Values are easily read from the plot. For example the smallest sample value is
49 = 49(LU) = 49*10 = 490
Note: 1. Minitab also displays a depth column. The numbers in this column count
cumulatively the number of leaves from the bottom and top lines as long as the
count is  n/2. If there is a row remaining ( the middle row) the number of leaves in
this row is given in parentheses [ In the example above (6)].
2. The scale ( i.e. the stem unit and number of leaf categories per stem ) can be
controlled using the subcommand INCRement. This is illustrated below.
(20)
To obtain a stem and leaf plot with stem unit SU and 1,2 or 5 leaf categories per
stem use the STEM command with the subcommand INCRement where
INCR= SU/1 for a plot with stem unit SU and 1 LCPS
= SU/2 for a plot with stem unit SU and 2 LCPS
=SU/5 for a plot with stem unit SU and 5 LCPS
Example: Consider the English SAT scores of the previous example. Suppose we
want a stem and leaf plot with SU=100.
(a) MTB> stem c1;
SUBC>incr 100.
Stem-and-leaf of C1 N=15
Leaf Unit = 10
1
5
(8)
2
4 9
5 2789
6 12233459
7 06
(b) MTB> stem c1;
SUBC> incr 50.
Stem-and-leaf of C1 N=15
Leaf Unit =10
1
2
5
(6)
4
2
1
4 9
5 2
5 789
6 122334
6 59
7 0
7 6
(c) MTB> stem c1;
SUBC> incr 20.
Stem-and-leaf of C1 N=15
Leaf Unit= 10
1
4 9
1
5
2
5 2
2 5
3 5 7
5 89
5 6 1
(4) 6 2233
5 6 45
3
6
3 6 9
2
7 0
1
7
1
7
1 7 6
(21)
Example: weights ( in ounces) of 30 major league baseballs are
5.08 5.21 5.26 5.26 5.04 5.21 5.28 5.04 5.17 5.12
5.30 5.07 5.06 5.17 5.24 5.13 5.26 5.16 5.09 5.22
5.06 5.09 5.24 5.22 5.22 5.11 5.23 5.06 5.27 5.13
Construct a stem and leaf plot with stem unit .10 and with two leaf categories per
stem.
MTB> stem c1;
SUBC> incr .05.
Stem-and-leaf of C1 N=30
Leaf Unit =0.010
1
9
13
(3)
14
6
1
50
50
51
51
52
52
53
44
6667899
1233
677
11222344
66678
0
QUESTIONS:
1(a) What value does “52  3” on the plot above represent?
(b) Find the sample range R.
.
2. Suppose you wish to use Minitab to draw a stem and leaf plot. What increment
would you use to obtain a plot with the following specifications?
(a) Leaf Unit = 1 and one leaf category per stem.
SU=
INCR=
(b) Stem Unit = 10 and 5 leaf categories per stem.
INCR=
(22)
MEASURES OF THE LOCATION or CENTRE OF A SAMPLE
Sample Mean: Let a sample has n observations X1 , X2 , . . . , Xn .
The sample mean is denoted by the symbolX and is defined as follows:
 is the upper case Greek letter “sigma”. In statistics it means to sum. Thus Xi
means sum of the sample values Xi .
(23)
Example: Stop watches are tested for reliability by counting the number of cycles
(on-off-restart) till some part of the mechanism fails. The data below gives the
failure time ( in thousands of cycles) for a random sample of a certain type.
22,
2,
12,
18,
16
X = Xi / n = 70/5 = 14
__________________________________________________________
0
5
10
15
20
25
Sample Median : The value above and below which approximately 50% of the
sample falls is called the median. It is denoted by the symbol m.
Steps in the calculation of the sample median
(1) Order the sample values from smallest to largest.
(2) Calculate the approximate position of m: AP(m)=(50/100)n, where n is the
sample size.
(3) From (2) calculate the exact position of m as follows:
(i)
If AP(m) is not a whole number:
Pos(m) = AP(m) rounded up to the next whole number.
(ii)
If AP(m) is a whole number:
Pos(m)= AP(m) + .5
(4) RESULT: m is the value at Pos(m).
If Pos(m) is not a whole number, there is no observation in that position. In that
case m is the average of the values in the two positions on either side of Pos(m).
(24)
Example: Consider the stop watch data of the previous example.
For this sample,
(1) Ordered Sample: 2 12 16 18 22
(2) AP(m)= (.50)n = (.50)5 = 2.5
(3) Pos(m) = 3
(4) m=16
Interpretation: Approximately 50% of the failure times in the sample fall below
m=16.
Example: Incomes ( in $1000) of a sample of 8 residents of a hypothetical town
are:
29
16
17
19
964
26
10
29
= Xi / n = 1110/8 = 138.75
Now we calculate sample median for this sample:
(1) Ordered Sample: 10
16
17
19
26
29
29
964
(2) AP(m) = (.50)n =
(3) Pos(m) =
(4) m =
Interpretation: Approximately 50% incomes are below 22.5 thousands in the
sample.
Note: Data values like “964” in the above example are called OUTLIERS. These are
extreme data values which are either
(a) infrequently occurring members of the population, or
(b) sample values that do not belong to the population such as measurement errors,
recording errors, or measurements from the wrong population.
In case (a) outliers should not be removed from the sample. In case (b) they should
be removed.
(25)
Questions: 1. How sensitive to outliers is
(a) The sample meanX ?
(c) The sample median m?
2. In general which is a better measure of the centre ( or location) of a sample when
the sample is
(a) Skewed or has outliers ______________________
(b) symmetric __________________________
3. What does each of (a)-(c) below indicate about the shape of the sample? (left
skewed, right skewed or symmetric):
(a)
X is well above m? ________________
(b)
X is well below m?___________________
(c)
X is close to m ?___________________
(26)
The Sample Mean and the Sample Median on Minitab
Consider the data from our previous example: Incomes ( in $1000) of a sample of 8
residents of a hypothetical town.
C1: 10
16
17
19
26
29
29
964
MTB> mean C1
Mean = 138.75
MTB> medi C1
Median = 22.5
MTB> desc C1
C1
N MEAN MEDIAN
8
139
22
MIN
MAX
Q1
10
964
16
TRMEAN
139
Q3
29
STDEV
334
Note: To see how the outlier “964” affects the sample mean
sample without the outlier.
C2: 10
16
17
19
26
29
SEMEAN
118
, consider the
29
MTB> mean c2
Mean = 20.857
MTB>medi c2
Median = 19.00
Notice that the removal of the outlier has changed the sample mean from 138.75 to
20.857 while the sample median has changed little [from 22.5 to 19]
(27)
MEASURES OF VARIATION (SPREAD) OF A SAMPLE
VARIATION: a measure of how far the sample values are from their central value.
There are many such measures. One convenient way is as follows:
Total Variation:
SSTO=  (
)2.
The average value of the total variation for a sample is known as the
Sample Variance:
s2 =
SSTO
n 1
Note that in calculating the average here we divide by n-1 rather than n. It
turns out that a sample variance defined this way has better statistical properties.
Since the total variation and the sample variance are calculated as squares of the
observations, their unit of measurement is the square of the unit of measurement of
the data ( for example if the data is measured in centimeters (cm) then SSTO and s2
are measured in centimeters2 (cm2)).
To obtain a measure of variation which has the same units of measurement as the
data we take the square root of s2 to get the
Sample Standard Deviation: s =  s2
(28)
Example: The stopwatch data from a previous example is given below:
22
We have calculated
_______
SSTO =
sample mean.
2
12
18
16
X =14. We now calculate SSTO, s2 and s.
______________
_____________
, Total squared deviations of the sample values from the
s2 = SSTO/n-1 =
=
, roughly the average distance2 of
the sample values from the sample mean.
s=
=
Some sort of “average distance” of the sample values from
the sample mean.
(29)
Computational Formula for Total variation
The computation of total variation can actually be done without calculating
An alternative formula is
SSTO =  xi2 - (xi)2/n
For the stopwatch data
__________
_______________
SSTO =
s2= SSTO/n-1
=
s= 
Example: Below is a srs of 36 grades of students in a mathematics course
57 83 75 79 60 75 60 51
57 47 89 50 62 54 96 66
60 55 75 78 59 62 57 77
For this sample n=36, x=2352,
56 65 52 78
73 62 68 64
64 68 57 61
 x2 = 158,210
Sample mean =
=
SSTO =
=
=
s2= SSTO/n-1 =
;s=
(30)
SAMPLE PERCENTILES
The median splits the sample evenly into two halves. We can define measures that
divide the sample into parts of different size.
The rth Percentile of a sample is a value Pr such that (approximately) r% of the
sample falls below Pr.
Example: If a students’s score on a university entrance exam is the 84th percentile,
this means that approximately 84% of all the scores in the exam are below this
student’s score.
STEPS IN THE CALCULATION OF Pr FOR A SAMPLE OF SIZE n
(1) Order the sample values from smallest to largest.
(2) Calculate the approximate position of Pr: AP(Pr)=(r/100)n
(3) From (2) calculate the exact position of Pr as follows:
If AP(r) is not a whole number: Pos(Pr) = AP(Pr) rounded up to the next whole
number.
If AP(Pr) is a whole number: Pos(Pr) = AP(Pr)+.5
(4) RESULT: ‘Pr’ is the value at Pos(Pr).
If Pos(Pr) is not a whole number, there is no observation in that position. In that
case, Pr is the average of the values in the two positions on either side of Pos(Pr).
Example: Below are SAT (scholastic aptitude test) mathematics scores for a srs of
32 university applicants. [the sample has been ordered for convenience]
484 490 506 509 523 532 539 539 544 545 550
558 578 580 591 593 610 610 630 634 641 647
648 655 662 673 682 688 693 726 745 780
(i) 60th Percentile:
AP(P60) =
; Pos(P60 ) =
P60 =
Interpretation:
(ii) 25th Percentile:
AP(P25)=
; Pos(P25) =
P25 =
Interpretation:
(31)
SPECIAL PERCENTILES
First Quartile: Q1 = P25
Second Quartile: Q2 = P50 = m
Third Quartile: Q3 = P75
Example: Below are the scores of a profile test given by a psychologist to
random sample of 30 convicted felons. The scores have been ordered from smallest
to largest.
146 165 171 179 181 184 190 191 192 192
192 193 195 196 196 197 198 199 200 200
201 203 204 205 206 213 215 221 232 247
First Quartile:
Second Quartile:
Third Quartile:
(32)
The Boxplot
(33)
Example: Consider Psychological profile scores for 30 convicted felons. For this
sample we know that
m = 196.5, Q1 = 191 and Q3 = 204
Inner Fences: LIF =
HIF =
Outer Fences: LOF =
HOF =
Sketch :
(34)
The Boxplot on Minitab
MTB>gstd
MTB> boxp C1
[draws a boxplot for the sample in C1]
Note: 1. On the boxplot scale one space (sp) “-“ is given by
sp = ( the distance between successive tick marks [+] )/10
2. Values read on the boxplot scale are approximate.
Example: Consider the psychological profile score data
MTB > boxp C1
------------I +
I--------*
O
------+---------+---------+---------+---------+---------+------C1
140
160
180
200
220
240
O
*
*
Question: From the boxplot scale determine the ( approximate) values of the sample
median, first and third quartiles and possible and probable outliers.
sp=
Q1 =
; m=
; Q3 =
Possible Outliers:
Probable Outliers:
(35)
Identifying Outliers in a Stem-and–leaf Plot
Minitab gives us an option when we use the stem command to identify outliers as
calculated in the boxplot.
Using the STEM command with the subcommand TRIM one can form a stem and
leaf plot with the outliers trimmed away and shown on special lines labeled LO and
HI. In this way one can examine the shape of the sample with the outliers removed.
Example: Scientists use ph (power of Hydrogen) as a numerical measure of acidity
with a ph=7.00 being neutral. A scientist suspects a ph meter is defective. He takes a
random sample of 45 measurements with this meter on a neutral substance. The
data have been entered into C1.
8.32
6.19
4.61
5.49
6.76
6.13
10.28
6.86
6.09
8.35
5.89
7.05
9.05
5.82
7.22
6.45
6.56
5.88
5.38
8.85
7.14
5.90
5.79
6.60
5.95
9.00
5.99
6.58
6.49
7.24
7.21
6.99
3.46
6.62
3.26
6.02
6.98
6.21
7.22
6.72
5.96
8.26
6.07
3.76
11.17
MTB > stem C1;
SUBC> trim.
Stem-and-leaf of PHlevel
Leaf Unit = 0.10
LO
4
6
14
22
(9)
14
8
8
5
4
4
5
5
6
6
7
7
8
8
9
HI
N
= 45
32, 34, 37, Low Outliers(values below the LIF)
6
34
78889999
00011244
555677899
012222
233
8
00
102, 111,
High Outliers(values above the HIF)
(36)
The outliers are the values beyond the inner fences. We can show this using the
approximate sample values read from the plot. See the calculations below:
To show the “HI” values are outliers, we need calculate the high inner fence.
AP(Q1) =(25/100)45 = 11.25, Pos(Q1) = 12
Q1 =
AP(Q3) = (75/100)45 = 33.75; Pos(Q3) = 34
Q3 =
IQR =
HIF=
The values 10.2 and 11.1 are both bigger than 9.15 and are plotted as “HI” outliers
in the stem and leaf plot.
Similarly,
LIF =
The values 3.2, 3.4 and 3.7 are all smaller than LIF and are plotted as “LO” outliers
in the previous stem and leaf plot.
(37)
Bivariate Data
All the data sets we have encountered up to now deal with measurements of one
variable on each of several “individuals” (i.e. incomes of individuals, test scores
etc.). Such type of data is called univariate data.
In many situations it may be of interest to take measurements of two variables on
each of several individuals. For example we may be interested in both the income of
an individual and the number of hours of television watched per week. In such a
case each observation will consist of two values ( a pair of measurements)
(income, number of hours watched).
Such type of data is called bivariate data and is often denoted in the form; (x,y), the
“x-observation” and “y-observation”. A bivariate sample of ‘n’ observations would
then be written as
(x1,y2), (x2,y2),. . ., (xn,yn)
The reason we take observations on two variables is that we are interested in
exploring the relationship between the variables. For example “ Do poorer people
tend to watch more TV”.
The simplest possible relationship between two variables x and y is that of a straight
line.
(38)
Measuring Association Between Variables
Here are some typical examples of association:
Positive linear relationship between x and y:
Curvilinear relationship between x and y:
Negative linear relationship between x and y:
There are many measures of association in statistics and the term CORRELATION
may be applied to such measures. We will study a specific type of correlation called
PEARSON’S SAMPLE CORRELATION COEFFICIENT which measures the
strength of the linear relationship between the two numerical variables.
(39)
Pearson’s Sample Correlation Coefficient r
Consider the following bivariate data on two class quizzes:
x
y
1
5
1
4
2
4
4
2
5
1
5
2
2
2
4
4
sxy =
=
=
sxy is called the sample covariance and is a measure of the linear relationship
between x and y. To make this value independent of the units of measurement we
divide by sx = 1.690 and sy =1.414.
The scatter plot of this sample data is as follows:
(40)
Pearson’s sample correlation coefficient r is
r = _______________________
=_______________________ =
Note: r is always between –1 and +1. The nearer r is to –1, the closer the plotted
points are to a straight line with negative slope. The nearer r is to +1, the closer the
plotted points are to a straight line with positive slope. ( If r is equal to –1 or +1 then
all the points are on a straight line).
Note: Pearson’s sample correlation coefficient r is a measure of the linear
relationship between two variables X and Y. Other relationship may exist which are
not detected by r.
Example: Consider the sample of pairs (-1,1), (0,0), (1,1). We use Minitab to
calculate Pearson’s sample correlation coefficient r.
C1: -1
0
1
C2: 1
0
1
MTB> corr c1 c2
Correlations (Pearson)
Correlation of X and Y =0.000
The correlation r is 0. Thus there is no linear component to the relationship between
X and Y; note however that for each pair (x,y), y=x2, so there is another type of
relationship between X and Y ( a quadratic relationship).
(41)
Correlation and Causality
Just because the value of r is close to 1, does not of itself mean that x “causes” y.
A strong ( negative or positive) correlation does not necessarily imply a cause and
effect relationship between X and Y. Often there is a third hidden variable which
creates an apparent relationship between X and Y. Consider the example below.
Example: A sample of students from a grade school were given a vocabulary test. A
high positive correlation was found between
X= “student’s height” and Y= “student’s score on the test”
Should one infer that growing taller will increase one’s vocabulary? Explain. Also
indicate a third variable which could offer a plausible explanation for this apparent
relationship.
Note: A cause and effect relationship is best established by an experiment in which
other variables which influence X and Y are controlled.
Question: In the example above, how might a more controlled experiment be
performed?
(42)
Fitting a Straight Line: The Least Squares (Regression) Line
A set of points (bivariate observations) may or may not have a linear relationship. If
we want to draw a straight line which “fits” these point, which line fits best? Our
choice depends on what we use to define a “good” fit.
Given a sample of bivariate data (x1, y1), (x2,y2), . . .(xn,yn), the least squares line L is
the line fitted to the points in such a way that d12+d22+…+dn2 is as small as possible.
The equation of the least squares line L is given by
y =b0 + b1x, where b1 = (syr/sx) and b0 =
Note:
1. The sign of the slope (b1) is the same as the sign of the sample
correlation coefficient.
2. The least squares line always passes through the point (
(43)
).
Example: Consider the previous example of scores on two class quizzes. Sample
data is (1,5), (1,4), (2,4), (4,2), (5,1), (5,2), (2,2), (4,4). In our calculation of r=-.717,
we also found that
x =3,
y =3,
sx =1.690
and
sy = 1.414
Thus,
b1 = (syr/sx) =
b0 =
=
=
ŷ
=
= b0 + b1x
SKETCH:
Note: 1.The least squares line only fits the points used in its calculation. Do not be
tempted to extend it below the smallest x-value or above the largest x-value. New
observations with x-values outside this range may not fall anywhere close to the
fitted line. So if a situation requires you to extend the line, do so with caution.
2. In the language of regression r2 is known as the COEFFICIENT OF
DETERMINATION. It is used in regression to measure how well the least squares
line fits the points. r2 satisfies the following inequality
0 r2 1
(44)
Pearson’s Sample Correlation Coefficient and the Least Squares Line on Minitab
Example: The data below gives the cost estimates and actual costs (in millions) of a
random sample of 10 construction projects at a large industrial facility.
Estimate(X)
44.277
2.737
7.004
22.444
18.843
46.514
3.165
21.327
42.337
7.737
Actual(Y)
51.174
9.683
14.827
22.159
26.537
50.281
15.550
23.896
50.144
13.567
To find Pearson’s sample correlation coefficient ‘r’ for this data and the equation of
the least squares line using Minitab, proceed as follows.
Name C1 as EST(X) and C2 as ACT(Y), then enter the x-data into C1 and the
corresponding y-data into C2. Then
Click: Stat Regression  Fitted Line Plot
Type in C2 in the Response (Y) box and type in C1 in the Predictor(X) box. Now
click OK.
Regression Plot
ACT(Y) = 7.49228 + 0.937658 EST(X)
S = 3.49312
R-Sq = 96.0 %
R-Sq(adj) = 95.5 %
50
ACT(Y)
40
30
20
10
0
10
20
30
EST(X)
(45)
40
50
From the output we find that the coefficient of determination r2 = .960
and the equation of the least squares line is
ŷ =7.49228+.937658x
Questions: (1) What is the value of the sample correlation coefficient? Interpret it.
(2) Predict the actual cost of a project whose estimated cost is 15 million dollars.
(3) Is it safe to use this data to predict the actual cost of a project estimated to
cost 80 million dollars? Explain.
(4) Estimate the average change in the actual cost of a project when the
estimated cost increases by 10 million.
(46)
PROBABILTY
In life an action (or an “experiment”) will produce one of a number of possible
outcomes, but the exact nature of outcome may not be precisely predicted. Thus
when we toss a coin we know the outcome will be “heads” or “tails” but we will not
be able to predict on any given toss which of the two possibilities will actually occur.
In real life few things can be predicted exactly and there are situations in which
some outcomes may be more likely than others. For example you know that you are
more likely to do well in this course than to purchase a winning Loto 6-49 ticket!
We can formalize this feeling about the likelihood of an event by giving it a value
which expresses the level of uncertainty associated with the event.
A PROBABILITY is a number between 0 and 1, which measures how likely an
event is to occur.
The SAMPLE SPACE is the collection of all possible outcomes that can occur in an
experiment. The identification of all the possible outcomes in S is often the first step
in determining probabilities.
An EVENT is a subset of the sample space S. Events are usually denoted by upper
case letters (A,B,C,…) and the probability of an event E is denoted by P(E).
If all the outcomes in the experiment are “equally likely”, then the probability of an
event is given by
P(Event) = Number of outcomes in the event/Number of outcomes in S
= N(Event)/N(S)
In many experiments we can use a “tree diagram” to list the possible outcomes of
the experiment (i.e. the Sample Space S) and the probabilities of each outcome.
Example : A “ fair” coin is tossed. List the sample space S and the probabilities of
each outcome.
(47)
Example: A fair four sided die is rolled twice.
(a) Use a tree diagram to list the sample space S.
(b) List and find the probability of each of the following events.
A = “ a sum of 4 or less”
B = “ the second roll is a 2”
C= “ a sum of 7”
D = “ a sum greater than 8”
(48)
Operations on Events and Associated Rules of Probability
1. P(S) =1, P() =0 [Since, P(S) = N(S)/N(S) =1 and P() = 0/N(S) =0]
Here  is the symbol for the so-called “empty set”, the event that contains no
outcomes; so N() =0.
2. Complement:
The complement of A is the event that A does not occur.
It is made up of all outcomes which are not in A.
Complement Rule:
(49)
3. Intersection: A and B [AB]
The intersection of two events A and B is
the event that both A and B occur. It is
made up of all the outcomes which are common
to A and B.
(50)
4. Union: A or B [AUB]
The union of two events A and B is the event
that A or B or both occur. It is made up of all
the outcomes which are in A together with all
the outcomes which are in B.
Addition Rule :
P(A or B) = P(A) + P(B) – P(A and B)
(51)
5. Mutually Exclusive Events
Two events A and B are mutually exclusive
(or disjoint) if their intersection is empty, that
is they have no outcomes in common. Practically
speaking this means that A and B cannot both
occur in a given experiment.
This is written as A and B =.
Addition Rule for Mutually Exclusive Events:
P(A or B) = P(A) + P(B)
(52)
Although some what artificial, examples with balls of different colors and numbers
being drawn from boxes are useful in highlighting some of the concepts and rules we
have introduced. Here is a typical example.
Example: Box I contains two balls one red and one black, box II contains three balls
numbered 1,2 and 3. A ball is randomly selected from each box.
(a) List the sample space.
(b) List and find the probability of the following events:
(i)
A = “ an odd number is drawn”,
(ii)
B = “ a black ball is drawn”,
(iii)
A and B,
(iv)
A or B,
(v)
Complement of A.
(c) Are the events A and B mutually exclusive?
(53)
Previous Example continued:
(54)
Condition Probability:
The conditional probability of an event A given that event B has occurred is defined
by
P(AB) = P( A and B)/P( B)
Note: This definition implies that for any events A and B
P( A and B) = P(AB)P(B) = P(BA)P(A)
This statement is called the Multiplication Rule.
Example: Suppose two cards are selected at random from a deck one at a time and
without replacement. Find the probability that
(a) an ace is drawn first and a king is drawn second.
(b) two queens are drawn.
(55)
Example: In a large insurance agency,
60% of the customers have automobile insurance
40% of the customers have homeowners insurance
75% of the customers have one type or the other.
(a) Find the proportion of customers with both types of insurance.
(b) Find the probability that a customer has homeowner insurance given that he
has automobile insurance.
(56)
Independence
Sometimes additional information is not relevant and does not lead to an update of
the probability.
Example: A group of 200 voters in a certain electoral poll are classified according to
gender on one hand and candidate preference on the other.
Male
Female
Total
A
20
40
60
B
36
44
80
C
24
36
60
Total
80
120
200
Probability a voter favours candidate A given that the voter is a male is
P(A/M) = 20/80 =.25
Notice that the probability a randomly chosen voter favours A is
P(A)= 60/200 = .30
While 30% of this group of voters favour candidate A, a different percentage of the
male voters (namely 25%) favour candidate A. Thus knowledge of the gender of the
voter is useful and leads us to update our probability of candidate A preference.
The events A = “a voter favours candidate A” and M=”a voter is a male” are called
DEPENDENT EVENTS because P(AM)P(A).
Now look at the probability a voter favours candidate C given that the voter is a
male.
P(CM) = 24/80 = .30
Notice that the probability a voter favours candidate C is
P(C) = 60/200 = .30
Now in this case knowledge of the gender of the voter gives us no additional
information regarding preference for candidate C, and does not lead to an updating
of this probability ( while 30% of this group of voters favour candidate C, the same
percentage of the male voters also favour candidate C). The events C = “ a voter
favours candidate C” and M=”a voter is a male” are called INDEPENDENT
EVENTS because P(CM) = P(C).
(57)
There are a number of equivalent ways of defining independence.
Two events A and B are said to be INDEPENDENT if any of the following
statements hold:
(a) P(AB) = P(A)
(b) P(BA) = P(B)
(c) P(A and B) = P(A)P(B).
Statement (c) is the MULTIPLICATION RULE FOR INDEPENDENT EVENTS.
Example: Suppose A is the event an adult is a male and B is the event an adult
favours capital punishment. Further suppose that in a certain country, 50% of the
adults are male, 80% of the adults favour capital punishment, and 40% of the
adults are male and favour capital punishment. For this country, answer the
following:
(a) Are the events A and B independent? Explain.
(b) Are the events A and B mutually exclusive? Explain.
(58)
Example: For a certain population of individuals
52% are overweight,
30% have high blood pressure,
20% are both overweight and have high blood pressure.
Is the fact that a person is overweight independent of his/her blood pressure?
Example: An unfair coin has a probability of 0.7 of coming up “heads” when
tossed. Suppose it is tossed in such a way that the outcome of each toss is
independent of the outcome of any other toss.
(a) If the coin is tossed n=2 times, find the probability of obtaining two heads.
(b) If the coin is tossed n = 3 times, find the probability of obtaining two heads
followed by a tail.
(59)
POPULATIONS AND RANDOM VARIABLES
Often the things we sample from a population are actually measurements of some
characteristic. These measurements will naturally vary in some random way from
observation to observation.
This type of measurement, taken at random from a population, is called a
RANDOM VARIABLE, and we will use the symbol X to refer to it.
Example: Hospital records show that for patients undergoing a certain surgical
procedure, the length of stay in hospital was 2 days for 10% of the patients, 4 days
20% of the patients, 5 days 40% of the patients, and 6 days 30% of the patients.
Let X= the length of stay for a randomly selected patient. We can “model” the
population of these measurements by using the “PROBABILITY DISTRIBUTION”
of the random variable X. This is simply a table which gives the probability that the
value of X that we pick will be a certain number.
Probability Distribution of X
x
P(X=x) or p(x)
2
4
5
6
We obtain the following probabilities from this probability distribution.
1. P(X=4) =
2. P(X=2 or X=5)
3. P(X 4) =
4. P(X>3) =
5. P( 2 < X  6) =
6. P( X < 4 or X  5) =
(60)
Total
The Mean, Variance and Standard Deviation of a Random Variable X
MEAN: The population mean (or the mean of the distribution of X) is denoted by 
or x and is given by the formula
 =  xp(x)
Variance: The population variance ( or the variance of the distribution of X) is
denoted by 2 [ or x2 ] and is given by the formula:
2 =  (x - )2p(x) = x2p(x) - 2
The population standard deviation ( or the standard deviation of the distribution of
X) is denoted by  or x and is defined as
 = 2
Example: Consider the probability distribution of the random variable X in the
previous example.
x
p(x)
2
.1
4
.2
5
.4
6
.3
 = xp(x) =
=
[the mean (average) length of stay for patients undergoing
this surgical procedure]
2 =  (x -  )2p(x) =
=
=
(61)
A Counting Rule
Definition: n! ( read as “n factorial”) means the following:
n! = n(n-1)(n-2)...(3)(2)(1)
Note: By definition 0! = 1.
Example: (a) 3! =
(b) 5! =
An Important Formula: Let Cnx ( read “ n choose x”) denote the number of ways of
obtaining “ x heads in n tosses” of a coin. Then
Cnx =
n!
x!(n  x)!
The number Cnx is called a binomial coefficient.
Example: How many ways are there to obtain
(a) 2 heads in 3 tosses of a coin?
(b) 0 heads in 3 tosses of a coin?
(c) 12 heads in 15 tosses of a coin?
Note: The formula used above is merely a short hand way of calculating something
that would otherwise require us to draw a tree diagram.
For example, the answer to (a) could have been obtained as follows:
(62)
THE BINOMIAL EXPERIMENT
Conditions for a Binomial Experiment
1. n identical trials are performed,
where n is determined prior to
the experiment
A Coin Tossing Experiment
1. A fair coin is tossed 10 times
exactly the same way each time.
So 10 identical trials are
performed.
2. The trials are independent; so the
outcome of any particular toss
does not influence the outcome of
any other toss.
3. Each trial has two possible
outcomes, heads (success) and
tails (failure)
4. At each trial:
P(success) = p =.5
P(failure) = q = .5,
since the coin is fair.
The Binomial Random Variable is
X = the total number of heads in the
10 tosses.
2. The trials are independent; so the
outcome of any particular trial
does not influence the outcome of
any other trial.
3. Each trial has two possible
outcomes which we call “success”
and “failure”.
4. The probability of “success” is
same for each trial and is denoted
by “p”. Thus at each trial:
P(success) = p
P(failure) = 1 –p =q.
The Binomial Random Variable is
X= the total number of successes in
the n trials.
Example of Binomial Experiment: A manufacturer supplies a certain component for
automobile companies. 2% of the components produced are defective. A srs of 30
components is examined. Of interest is the number of defective components in the
sample.
1. n =
, identical trials are performed.
A trial consists of ____________________________________________________
2. The trails are independent.
3. Each trial has two possible outcomes,
Success: _______________________________________________
Failure:________________________________________________
4. At each trial : P(success) =p =
P(failure) = q =
The Binomial Random Variable is :
X=
(63)
Note: In general If X is a Binomial(n,p) ;then,
P(X=x) = p(x) = Cnxpxqn-x; x = 0,1,2,…,n; q = 1-p.
n!
where, Cnx =
x!(n  x)!
Example: A biased coin which has P(heads) = p =.7 and p(tails) = q = .3 is tossed 3
times. The coin is tossed in such a way that the outcomes on each toss are
independent. We obtain the probability of 0,1,2 and 3 heads using the above
formula:
(i)
P(X=0) =
(ii)
P(X=1) =
(iii)
P( X=2) =
(iv)
P( X=3) =
(64)
Probability Histogram for a Binomial Distribution:
Consider our binomial n=3, p=.7 random variable X above.
x
p(x)
0
.027
1
.189
2
.441
3
.343
Questions:
1. What is the shape of this histogram?______________________
2. What do you think the shape of this histogram would be if
(a) p < .50?__________________________________
(b) p =.50?____________________________________
(65)
Mean and Variance of a Binomial Random Variable
Again consider our binomial n=3, p=.7 random variable X above.
Using the usual formula for  and 2 , we obtain
 =  xp(x) =
2 =  (x -  )2 p(x)
=
=
Fact: It can be shown that for a Binomial random variable X, the mean , the
variance 2 and the standard deviation  can be calculated using the following
formulas:
 = np,
2 = npq, and  =  npq
Thus in the example above:
 = np = 3(.7) = 2.1
2 = npq =
=
(66)
Example: In a multiple choice test, there are 5 questions each with three possible
answers. For each question a student chooses an answer at random. Find the
probability that she gets (a) exactly 3 correct answers (b) three or fewer correct.
Let X = the number of correct guesses out of 5.
X is binomial , n = _____________, p = _______________
P(X=x) = p(x) =
(a) P(X=3) =
(b) P(X3) =
(67)
Example: A variety of seed has a 40% chance of germinating. Ten seeds are planted.
You are interested in X = the number of seeds out of ten that germinates. Then X is
_______________, n = ____________, p = ______________
(a) find the mean (expected) number of seeds that will germinate.
(b) the variance and standard deviation.
(c) The probability that exactly 6 seeds germinate.
(d) The probability that fewer than 6 seeds germinate.
(68)
(e) The probability at most six seeds germinate.
(f) The probability at least 6 seeds germinate.
(g) The probability that more than 6 seeds germinate.
(69)
CONTINUOUS RANDOM VARIABLES
The Binomial random variable is called a discrete random variable because it takes on
discrete values 0, 1, 2, 3, ..,n. Consider the probability histogram of X, Binomial n =3, p
=.7. The areas under the histogram are related to the probability distribution.
P(X=2) =
P(X 2) =
The total area under this probability histogram is
p(0) + p(1) + p(2) + p(3) = 1
A continuous random variable X is one which represents measurements that
(theoretically) can be made to any degree of accuracy. For example suppose X = the
weight (in kg) of a randomly chosen newborn baby. Depending on the accuracy of
our scale the weight X of a randomly selected baby could be recorded either as 3 or
3.3 or 3.26 or 3.258 etc.. The probability histogram for such a variable has to be
formed in a very different way than for a discrete random variable.
For example, in the case of “X=the birth weight of a newborn”, we could take a very
large sample from the population of newborns, measure the sample birth weights
very accurately (with many decimal of accuracy) and form a histogram of the birth
weights using classes of small width. If the vertical scale is adjusted so the total area
under the histogram is one then area under the histogram can be used to calculate
(approximately) the probabilities. The larger the sample and the more accurate our
measurements, the more accurate will be these probabilities.
In the illustration below a srs of 812 babies was used to form a probability
histogram.
(70)
Birth Weight (kg) of 812 Newborns
Frequency
30
20
10
0
1
2
3
4
5
WT(kg)
Let X be the birth weight of a randomly chosen newborn. P(X3) = Shaded Area.
The larger the sample, the smaller we can make these rectangles, and the
“smoother” will be the resulting histogram. Thus a “model” of the distribution
could be obtained by fitting a curve to such a histogram and using the area under
the curve to calculate the probabilities ( this curve is called a Probability Density
Function).
Thus P(X3) is the area under the curve to the left of 3. Thus total area under the
curve must be 1.
Note: As we know there are many different shapes among various populations (e.g.
left skewed, right skewed, symmetric etc.). In the class of bell-shaped curves there is
a specific one which is called normal curve ( there is a mathematical formula which
defined it exactly). If a population can be modeled by this certain bell –shaped curve
the population is said to have a Normal ( or Gaussian) Distribution.
(71)
The Normal Distribution
A Normal ( or Gaussian) Population is one that can be modeled by a certain bellshaped curve called the normal curve. The population of weights of newborn babies
described above is an example of such a population. In describing such a population,
we need to know two quantities,  and  .  stands for the population mean and 
stands for the population standard deviation. In our example  is the mean
(average) weight of all newborn babies in the population and  measures the spread
of the population values about the mean  .
A Normal (or Gaussian) random variable X represents a randomly chosen
measurement sampled from this population. Probabilities about X are found by
finding the appropriate areas under the curve i.e. P(Xx) = the area under the
normal curve to the left of x.
Note: (a) The total area under the curve is 1.
(b) For a normal random variable P(X=x) is always 0.
Practically speaking this means that if we can measure observations very accurately,
the chances of finding a newborn weighing exactly 3.0000000000kg, say, is very
small. Thus for all practical purposes (X=3) =0. One consequence of this is that
P(Xx) = P(X<x).
The population of Z-scores of a normal population is called the Standard Normal
Population. A randomly chosen measurement from the standard normal population
is denoted by Z. Z is simply the Z-score of a normal random variable X.
X 
Z=

For the standard normal random variable Z: Z = 0 and Z = 1.
We will show later that most of the time an observed value of Z will fall between –3
and +3.
(72)
Probabilities for the Standard Normal
Probabilities for the standard normal distribution can be obtained from the Table A
on pages T-2 and T-3.
Examples: (a) (i) P(Z1.5) =
(ii) P (Z>1.5) =
(b) (i) P(Z  -1.5) =
(ii) P(Z-1.5) =
(c) P(-1.5  Z  1.5) =
(d) P(-1.5 < Z < 2.21) =
(73)
Probabilities for Normal Random Variables in General
Example: The heart rate of patients suffering from heart disease is normally
distributed with a mean of 97 beats per minute and a standard deviation of 18 beats
per minute. For a randomly chosen patient, find the probability the heart rate is
(a) below 80 (b) more than 140, (c) between 55 and 90.
Let X = the heart rate of a randomly selected patient;  =97,  = 18.
(a) P(X< 80) =
(b) P(X>140) =
(c) P(55<X<90) =
(74)
Example: Let X be any normal random variable with mean  and standard
deviation . What is the probability that X is within two standard deviations of the
mean?
First note that the statement “ X is within two standard deviations of the mean”
means that X lies between  - 2 and  + 2.
Thus P( X is within two standard deviations of the mean)
= P ( - 2 < X <  + 2)
=
(75)
Note: Similarly,
(i)
P(X is within one standard deviations of the mean)
=
(ii)
P(X is within three standard deviations of the mean)
=
(76)
Percentiles of the Standard Normal Distribution
(Using the Normal Tables backwards)
(a) Find the 95th percentile of the standard
normal distribution i.e. find the value of z0
such that
P( Z  z0 ) = .95
( b) Find z0 such that
P ( Z  z0 ) = .41
(c)Find z0 such that
P( z0  Z  0 ) = .1
(d) Find z0 such that
P( -z0  Z  z0 ) = .95
(77)
Example: Let X be a normal random variable with mean  = 100 and standard
deviation =10. Find x0 such that
(a) P(X  x0 ) =.80 [i.e. x0 is the 80th percentile of X].
(b) P ( X  x0 ) = .025 .
(78)
Example : The scores on the Scholastic Aptitude Test ( SAT) for verbal ability of
high school seniors is normally distributed with mean 430 and standard deviation
100. What score must a student attain in order to be in the top 5% of all the
students who took the test?
Example: The time to first failure for a certain model of television set is normally
distributed with mean 5 years and standard deviation 1.56 years. If the
manufacturer wishes to repair only 10% of the sets sold, for how long should he
guarantee his product?
(79)
Normal Approximation to Binomial Probabilities
Let X be a binomial random variable with parameters n and p. Mean and standard
deviation of this distribution are given by

 = np and
 =  npq ,
q = 1-p.
Then if both np  10 and nq  10 the binomial probabilities may be closely
approximated by Normal probabilities in the following way. For a = 0,1,2,3…n
P(X  a)  P(Z 
a  .5  np
npq
)
Example: An automotive plant employs workers and suffers a daily absentee rate of
1%.
(a) What is the mean ( or expected value) of absentees on a given day?
(b) What is the variance and standard deviation of the absentees on a given day?
(c) Find the probability that on a given day
(i) at most 30 of the workers are absent,
(ii) at least 60 of the workers are absent,
(iii) between 30 and 60 (inclusive) of the workers are absent.
Solution: Let, X = the number of workers absent on a given day. Therefore, X is
binomial with n =5000 and p =.01.
a.  = np =
b. 2 = npq =
=
c. Since the values for n and p are not in Table C (Binomial Probabilities)
and since using the binomial formula would take a very long time ( not to
mention the difficulty of the calculation when n = 5000), we use the
normal approximation to the binomial.
Check: np =
; nq =
(80)
(i)
P(X 30) 
(ii)
P( X  60 ) =
(iii)
P ( 30  X  60 ) =
(81)
Example : A travel agency promotes vacation packages by telephoning households
at random in the evening hours. Historically, only 65% of heads of households are
at home when the agency calls. Suppose that 30 households are phoned in a given
evening. Find the probability that the agency will find
(a) between 15 and 25 households, inclusively, with the head of the household at
home.
(b) fewer than 23 households with the head of the household at home.
(c) P( 17 < X < 28)
(82)
The Central Limit Theorem
Consider the following probability distribution . The random variable X is the
length of stay in the hospital for a randomly selected patient undergoing a certain
medical procedure.
X:
2
4
5
6
p(x):
.1
.2
.4
.3
The mean and standard deviation of this distribution are  = 4.8 and  = 1.166.
Knowing the appropriate model for a population is like knowing the entire
population itself. In a real life situation we would not know which model would be
the appropriate one for the population of interest, and we certainly would not know
the actual value of . The problem then is to ESTIMATE  in some way.
One way to do this is to take a simple random sample from the population and then
use it to calculate a value of the sample meanX .
If we proceeded in this way, we would become aware of the fact that each time we
took a new sample from the population and calculated X we would get a different
value for our estimate of . Thus it is important to try to understand a bit more
about the behaviour ofX.
Mean, Variance and Standard Deviation of the Sample MeanX
Given: 1. A population with mean  and standard deviation  .
2. A random sample of size n from the population:
X1 , X2 , … Xn with sample mean X = (X1 +X2 +… Xn)/n
Then:  =  [ the mean of the distribution ofX is equal to the population mean]
X
 =  / n [ the sd of the distribution of X is equal to ( the population sd)/n]
X
2 =  2 /n [the variance of the distribution ofX is equal to
X
(the population variance)/n ]
(83)
The Central Limit Theorem (CLT) is stated as follows:
Given a large random sample of size n from a population with mean  and
standard deviation , then The sample mean X is approximately normally
distributed with mean and standard deviation given by
 = 
X
 =  / n
X
Note: 1. n >30 is usually large enough for the CLT to apply.
2. If the population from which we sample is normal thenX is exactly
normally distributed with mean and standard deviation as above for any
sample size.
Example: The population of white males aged 35-44 in Canada has a mean
systolic blood pressure of 130 with a standard deviation of 8. Find the
probability that a random sample of 50 white males has a mean systolic blood
pressure
(a) below 133, (b) above 128, (c) within 2 of the population mean.
Solution: Here we are asked to calculate probabilities about the sample
meanX. This suggests we use the CLT.
Population: Systolic blood pressures of white Canadian males aged 35-44.
Population Mean  = 130; Population Stdev  = 8.
Since n =50 > 30, the CLT tells us thatX is approximately normally
distributed with
 =  = 130
X
(a) P ( X < 133 ) =
(84)
and  =  / n
X
(b) P( X > 128) =
(c) Notice that “ X within 2 of the population mean “ means that X lies
between 128 and 132.
Thus, P ( 128 < X < 132 ) =
(85)
Example: The scores on a Standardized Reading Test are known to be normally
distributed with a mean of  = 10 and standard deviation of  = 1.5.
(a) Find the probability that a single individual selected at random scores
between 9.4 and 10.6 on this test.
(b) Find the probability that a random sample of 25 students has a mean score
between 9.4 and 10.6 on this test.
(c) Find the probability that a random sample of 64 students has a mean score
between 9.4 and 10.6 on this test.
Solution: Let X be the score of the randomly selected individual. Then X is normal
 = 10,  = 1.5.
(a) P( 9.4 < X < 10.6 ) =
(86)
(b) Since the population the sample came from is normal, then for any n ( hence for
n =25), X is exactly normal with
 = 10 and  = 1 .5/5 =.3
X
X
P( 9.4 < X < 10.6 ) =
(c) For the same reason as in (b) above, for n =64,X is exactly normal ,  =  = 10
X
and  =  / n = 1.5/8 = 0.1875.
X
Therefore, P( 9.4 <X < 10.6 ) =
(87)