Download Chapter 1 Data and Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia, lookup

Probability interpretations wikipedia, lookup

Probability wikipedia, lookup

Statistics wikipedia, lookup

Transcript
Chapter 1 Data and Statistics
Motivation: the following kinds of statements in newspaper and magazine appear very
frequently,
 Sales of new homes are accruing at a rate of 70300 homes per year.
 The unemployment rate has dropped to 4.0%.
 The Dow Jones Industrial Average closed at 10000
 Census
The above numerical descriptions are very familiar to most of us since we use it in
everyday life. As a matter of fact, these are part of statistics. Therefore, statistics is in
our everyday life. We now give a description about statistics.
Definition of statistics:
Statistics is the art and science of collecting, analyzing, presenting, interpreting
and predicting data.
Objective of this course: using statistics is to give the managers and decision makers a
better understanding of the business and economic environment and thus enable them
to make more informed and better decision.
1.1 Data
(I) Basis components of a data set:
Usually, a data set consists the following components:
Element: the entities on which data are collected.
Variable: a characteristic of interest for the element.
Observation: the set of measurements collected for a particular element.
Example 1:
We have a data set for the following 5 stocks:
Stock
Annual Sales (in
million)
Cache Inc.
Koss Corp
86.6
36.1
Earnings per share Exchange (where to
($)
trade)
0.25
0.89
1
OTC
OTC
Par Technology
81.2
0.32
NYSE
Scientific Tech.
Western Beef
17.3
273.7
0.46
0.78
OTC
OTC
Note: OTC stands for “over the counter” while NYSE stands for “New York Stock
Exchange”.
In the above data set,
Elements
Cache Inc., Koss Corp, Par Technology, Scientific Tech,
Western Beef
Variables
Annual Sales, Earnings per share, Exchange
Observations
(86.6,0.25,OTC),(36.1,0.89,OTC),(81.2,0.32,NYSE),
(17.3,0.46,OTC),(273.7,0.78,OTC)
(II) Qualitative and Quantitative Data:
Qualitative data: labels or names used to identify an attribute of each element.
Quantitative data: indicating how much or how many
Example 1 (continue):
Qualitative data: OTC, OTC, NYSE, OTC and OTC
Quantitative data: 86.6, 36.1, 81.2, 17.3, 273.7, 0.25, 0.89, 0.32, 0.46 and 0.78
The variable “Exchange” is referred to as a qualitative variable.
The variables “Annual Sales” and “Earnings per share” are referred to as quantitative
variables.
Note: quantitative data are always numeric, but qualitative data may be either
numeric or nonnumeric, for example, id numbers and automobile license plate
numbers are qualitative data.
Note: ordinary arithmetic operations are meaningful only with quantitative data
and are not meaningful with qualitative data.
(III) Cross-Sectional and Time Series Data:
Cross-sectional data: data collected at the same or approximately the same point in
time.
Time series data: data collected over several time periods.
2
Online Exercise:
Exercise 1.1.1
Exercise 1.1.2
1.2 Data Source:
There are two sources for data collection, one is existing sources and the other is
statistical studies.
(I) Existing Sources:
There are two existing sources:
Company: some of the data commonly available from the internal information
sources of most company including employee records, production records, inventory
records, sale records, credit records and customer profile, etc…
Government Agency: Department of Labor, Bureau of the Census, Federal Reserve
Board, Office of Management and Budget, Department of Commerce.
(II) Statistical Studies:
A statistical study can be conducted to obtain the data not readily available from
existing sources. Such statistical studies can be classified as either experimental or
observational.
Experimental Study: attempt is made to control or influence the variables of interest,
for example, drug test and industrial product test.
Observational Study: no attempt is made to control or influence the variable of
interest, for example, survey.
Online Exercise:
Exercise 1.2.1
1.3 Descriptive Statistics:
There are two classes of descriptive statistics, one class includes table and graph and
the other class includes numerical measures and index numbers.
(I) Tabular and Graphical Approaches:
3
Example 1(continue):
Tabular approach for stock data:
Exchange
Frequency
Percent
OTC
NYSE
4
1
80%
20%
5
100%
0
20
40
60
80
Graphical approach for stock data:
OT C
NYSE
(II) Numerical Measures and Index Numbers:
Some numerical quantities can be used to provide important information about the
data, for example, the average or mean. Index numbers are widely used in business,
for example, the Consumer Price Index (CPI) and te Dow Jones Industrial Average
(DJIA).
Online Exercise:
Exercise 1.3.1
Exercise 1.3.2
1.4 Statistical Inference:
Descriptive statistics introduced in section 1.4 can provide important and intuitive
information about the data of interest. However, these statistical measures are mainly
4
exploratory. For more detailed, rigorous and accurate results, the statistical inference
procedure is required. To conduct a statistical inference, data need to be drawn from a
set of elements of interest. We now introduce some basic components in the statistical
inference procedure. They are:
Population: the set of all elements of interest in a particular study.
Sample: a subset of the population.
Data from a sample can be used to make estimates and test hypotheses about the
characteristics of a population
Example 2:
Suppose there are 100000 bulbs produced in a bulbs factory.
Objective: want to know the average lifetime of the 100000 bulbs.
The 100000 bulbs are the population of interest. In practice, it is not possible (also not
realistic) to test 100000 bulbs for the lifetime. One workable way is to draw a sample,
say 100 bulbs, and then test for their lifetime. Suppose the average lifetime of the 100
bulbs is 750 hours. Then, the estimate (guess) of the average lifetime of the 100000
bulbs is 750 hours.
Note: the process of making estimates and testing hypotheses about the
characteristics of a population is referred to as statistical inference.
Online Exercise:
Exercise 1.4.1
Chapter 2 Descriptive Statistics: Table and Graph
The logical flow of this chapter:
 Summarizing qualitative data using tables and graphs (2.1)
 Summarizing quantitative data using tables and graphs (2.2)
5
 Exploratory
data analysis using simple arithmetic and
easy-to-draw graphs such that the data can be summarized quickly.
2.1 Summarizing Qualitative Data:
For qualitative data, we can use frequency distribution and relative frequency. We
now introduce frequency distribution, relative frequency and percent frequency.
Frequency distribution: tabular summary of data indicating the number of data
values in each of several nonoverlapping classes.
Relative frequency: (frequency of a class)/n, where n is the total number of the data.
Percent frequency: (relative frequency)× 100%.
Based on the frequency distribution, relative frequency, and percent frequency of the
data, we can use table and graphs to display these frequencies.
Example:
Forbes investigates the degrees of 25 best paid CEO (chief executive officer).
Tabular summary:
Degrees
Frequency
Relative Frequency
Percent Frequency
None
2
0.08
8
Bachelor
11
0.44
44
Master
7
0.28
28
Doctorate
5
0.20
20
Total
25
1.00
100
6
Graphical display:
0
2
4
6
8
10
Bar Graph:
None
Bac hel or
Mas ter
Doc torate
Pie Graph:
CEO Degrees
None
Bachelor
Master
Doctorate
Note: most statisticians recommend that from 5 to 20 classes be used in
a frequency distribution; classes with smaller frequencies should
normally be grouped!!
7
Online Exercise:
Exercise 2.1.1
Exercise 2.1.2
2.2 Summarizing Quantitative Data:
Determine the classes:
For quantitative data, we need to define the classes first. There are 3 steps to define
the classes for a frequency distribution:
Step 1: Determine the number of nonoverlapping classes, usually 5 to 20 classes.
Step 2: Determine the width of each class,
class width 
largest data value  smallest data value
number of classes
Note: the number of classes and the approximate class are determined by
trial and error!!
Step 3: Determine the class limits: the smallest possible data value should be larger
than the lower class limit while the largest possible data value should be smaller than
the upper class limit.
Example:
Suppose we have the following data (in days):
12
14
19
18
15
15
18
17
20
27
22
23
22
21
33
28
14
18
16
13
We applied the above procedure to this data.
Step 1:
We choose 5 to be the number of classes.
8
Step 2:
class width 
largest data value  smallest data value 33  12

 4.2 .
number of classes
5
Therefore, we use 5 as the class width.
Step 3:
The 5 classes we choose are
10-14
15-19
20-24
25-29
30-34
Note: the lower class limit in the first class (10) is smaller than the
smallest data value 12. Also, the upper class limit in the last class (34) is
smaller than the largest data value 33.
Summarizing quantitative data:
Tabular summary:
In addition to frequency, relative frequency and percent frequency, another tabular
summary of quantitative data is the cumulative frequency distribution.
Cumulative frequency distribution: the number of data items with values less than
or equal to the upper class limit of each class.
Graphical display:
In addition to histogram, another graphical display of quantitative data is ogive.
Ogive: the number of data items with values less than or equal to the upper class
limit of each class.
Example (continue):
Classes
Frequency
Relative Frequency
Percent Frequency
10-14
4
0.2
20
15-19
8
0.4
40
9
20-24
5
0.25
25
25-29
2
0.1
10
30-34
1
0.05
5
Total
20
Classes
Cumulative
Frequency
Cumulative Relative
Frequency
Cumulative Percent
Frequency
 14
4
0.2
20
 19
4+8=12
0.2+0.4=0.6
20+40=60
 24
4+8+5=17
0.2+0.4+0.25=0.85
20+40+25=85
 29
4+8+5+2=19
 34
4+8+5+2+1=20
1
100
0.2+0.4+0.25+0.1=0.95
20+40+25+10=95
0.2+0.4+0.25+0.1+0.05=1 20+40+25+10+5=100
0
2
4
6
8
The histogram is
10
15
20
25
data
The ogive plot is
10
30
35
Ogive plot
cumulative frequency
20
15
10
5
0
0
5
10
15
20
25
30
35
data
Online Exercise:
Exercise 2.2.1
Exercise 2.1.2
2.3 Exploratory Data Analysis:
Stem-and-leaf display is a useful exploratory data analysis tool which can provide an
idea of the shape of the distribution of a set of quantitative data.
Example:
Suppose the following data are the midterm scores of 10 students,
17, 22, 93, 82, 95, 87, 66, 68, 71, 52.
Then, the stem-and-leaf display is
17
2
3
4
5
6
7
8
9
2
2
68
1
27
35
11
Online Exercise:
Exercise 2.3.1
Chapter 3 Descriptive Statistics: Numerical
Methods
Suppose
y1 , y 2 ,, y N
are all the elements in the population and
are the sample drawn from
y1 , y 2 ,, y N
x1 , x2 ,, xn
, where N is referred to as the
population size and n is the sample size. In this chapter, we introduce several
numerical measures to obtain important information about the population. These
numerical measures computed from a sample are called sample statistics while
those numerical measures computed from a population are called population
parameters.
In practice, it is not realistic or not possible to obtain population parameter from a
population, for example, the average lifetime of 100000 bulbs. Therefore, the sample
statistic can be used to estimate the population parameter, for example, the average
lifetime of 100 bulbs can be used to estimate the average lifetime of 100000 bulbs..
3.1 Measure of Location:
Example:
Suppose the following data are the scores of 10 students in a quiz,
1, 3, 5, 7, 9, 2, 4, 6, 8, 10.
Some measures need to be used to provide information about the performance of the
10 students in this quiz.
(I) Mean:
12
n
Sample mean:
x
x
i 1
i
(sample statistic)
n
N
Population mean:

y
i 1
i
(population
N
parameter)
Basically, the mean can provide the information about the “center” of the data.
Intuitively, it can measure the rough “location” of the data.
Example (continue):
x
1  3    10
 5.5
10
(II) Median:
The data are arranged in ascending (or descending) order. Then,
1. As the sample size is odd, the median is the middle value.
2. As the sample size is even, the median is the mean of the middle two numbers.
Example (continue):
median 
56
 5.5
2
If the data are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. Then,
median  6
Note: the median is less sensitive to the data with extreme values than
the mean. For example, in the previous data, suppose the last data has
13
been wrongly typed, the data become 1, 3, 5, 7, 9, 2, 4, 6, 8, 100. Then
the median is still 5.5 while the mean becomes 14.5.
(III) Mode:
The data value occurs with greatest frequency (not necessarily to be numerical).
Note: if the data have exactly two modes, we say that the data are
bimodal. If the data have more than two modes, we say that the data are
multimodal.
(IV) Percentile:
The pth percentile is a value such as at least p percent of the data have this value or
less.
Note: 50th percentile = median!!
The procedure to calculate the pth percentile:
1. Arrange the data in ascending order.
 p 
2. Compute an index i, i  
n.
 100 
3. (a) If i is not an integer, round up, i.e., the next integer value greater than i denote
the position of the pth percentile.
(b) If i is an integer, the pth percentile is the average of the data values in positions
i and i+1.
Example (continue):
Please find 40th percentile and 26th percentile for the previous data.
[Solution]
Step 1: the data in ascending order are
1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
14
Step 2:
For 40 th percentile,
 40 
i
10  4 .
100


For 26 th percentile,
 26 
i
10  2.6
100


Step 3:
40th percentile 
45
 4. 5
2
and
26th percentile  3
(V) Quartiles:
When dividing data into 4 parts, the division points are referred to as the quartile!!
That is,
Q1  the first quartile or 25th percentile
Q2  the second quartile or 50th percentile
Q3  the third quartile or 75th percentile
Example (continuous):
Find the first quartile and the third quartile for the previous example.
Step 2:
For the first quartile,
 25 
i
10  2.5 .
100


For the third quartile,
 75 
i
10  7.5
100


Step 3:
Q1  3
15
and
Q3  8
Online Exercise:
Exercise 3.1.1
Exercise 3.1.2
3.2 Measure of Dispersion:
Example:
Suppose there are two factories producing the batteries. From each factory, 10
batteries are drawn to test for the lifetime (in hours). These lifetimes are:
Factory 1: 10.1, 9.9, 10.1, 9.9, 9.9, 10.1, 9.9, 10.1, 9.9, 10.1
Factory 2: 16, 5, 7, 14, 6, 15, 3, 13, 9, 12.
The mean lifetimes of the two factories are both 10. However, by looking at the data,
it is obvious that the batteries produced by factory 1 are much more reliable than the
ones by factory 2. This implies other measures for measuring the “dispersion” or
“variation” of the data are required.
(I) Range:
range=(largest value of the data)-(smallest value of the data).
Example (continue):
Range of lifetime data for factory 1=10.1-9.9=0.2
Range of lifetime data for factory 2=16-3=13
 The range of battery lifetimes for factory 1 is much smaller than the one for
factor 2.
Note: the range is seldom used as the only measure of dispersion. The
16
range is highly influenced by an extremely large or an extremely small
data value.
(II) Interquartile Range:
Interquartile is the difference between the third and the first quartiles. That is,
IQR  Q3  Q1 .
Example:
The first quartile and the third quartile for the data from factory 1 are 9.9 and 10.1,
respectively, and 6 and 14 for the data from factory 2. Therefore,
IQR (factory 1)=10.1-9.9=0.2
IQR (factory 2)=14-6=8.
 The interquartile of battery lifetimes for factory 1 is much smaller than the one
for factor 2.
(III) Variance and Standard Deviation:
population deviation about the mean:
sample deviation about the mean:
yi   , i  1,2,, N
xi  x , i  1,2, , n
Intuitively, the population deviation and the sample deviation can measure how far the
data is from the “center” of the data. Then, population variance and sample
variance are the sum of square of the population deviation and sample deviation,
N
2 
2


y


 i
i 1
N
and
n
s2 
 xi  x 
n
2
i 1
n 1
17

2
2
x

n
x
 i
i 1
n 1
,
respectively. The population standard deviation and sample standard deviation are the
square root of population variance and sample variance:
  2
and
s  s2
,
respectively.
Large sample variance or sample standard deviation implies the data are “dispersed”
or are highly varied.
n
n
Note:
n
n
 x  x    x  nx   x  n
i 1
i
i 1
i
i 1
i
x
i 1
n
i
n
n
i 1
i 1
  xi   xi  0
Example:
s
2
s
2
2
2
2

10.1  10  9.9  10    10.1  10
( factory.1) 
10  1
2
2
2

16  10  5  10    12  10
( factory.2) 
10  1
 0.0111
 21.1111
 The sample variance of battery lifetimes for factory 2 is 190 times larger than
the one for factor 1.
The sample standard deviation for the data from factories 1 and 2 are
0.01111  0.1054
and
21.1111  4.5946 ,
respectively.
(IV) Coefficient of Variation:
The coefficient of variation is another useful statistic for measuring the dispersion of
the data. The coefficient of variation is
18
C.V . 
s
100
x
The coefficient of variation is invariant with respect to the scale of the data. On the
other hand, the standard deviation is not scale-invariant. The following example
demonstrates the property.
Example:
In the battery data from factory 1, suppose the measurement is in minutes rather than
hours. Then, the data are 606, 594, 606, 594, 594, 606, 594, 606, 594, 606.
Thus, the standard deviation becomes 6.3245 which is 60 times larger than the one
0.1054 based on the original data measured in hours. However, no matter the data are
measured in hours and minutes, the coefficient of variation is
C.V . 
0.1054
6.3245
100 
100  1.054.
10
600
Note: since the coefficient of variation is scale-invariant, it is very useful
for comparing the dispersion of different data. For example, in the
previous battery data, if the lifetime of the batteries from factory 1 and
factory 2 are measured in minutes and hours, respectively, the standard
deviation for factory 1, 6.3245, would be larger than for factory 2, 4.5946.
However, the coefficient of variation for factory 1, 1.054 is still much
smaller than the one for factory 2, 45.946.
Online Exercise:
Exercise 3.2.1
Exercise 3.2.2
3.3 Exploratory Data Analysis:
(I) Five-Number Summary:
The five number summary can provide important information about both the location
19
and the dispersion of the data. They are





Smallest value
First quartile
Median
Third quartile
Largest value
Example (continue):
The original data (in hours) are:
Factory 1: 10.1, 9.9, 10.1, 9.9, 9.9, 10.1, 9.9, 10.1, 9.9, 10.1
Factory 2: 16, 5, 7, 14, 6, 15, 3, 13, 9, 12.
The five-number summary for the data from both factories is
Smallest
Q1
Median
Q3
Largest
9.9
3
9.9
6
10
10.5
10.1
14
10.1
16
Factory 1
Factory 2
(II) Box Plot:
The box-plot is commonly used graphical method to provide information about both
the location and dispersion of the data. Especially, as the interest is the comparison of
the data from different populations, the box-plot can provide insight. The box-plot is
1.5IQR
lower
limit
1.5IQR
IQR
Q3
Q1
Note: data outside upper limit and lower limit are called outliers.
Example (continue):
The box-plot for the data from the two factories is
20
upper
limit
16
14
12
10
8
6
4
factory1
factory2
Online Exercise:
Exercise 3.3.1
3.4 Measures of Relative Location:
z-score is the quantity which can be used to measure the relative location of the data.
Z-score, referred to as the standardized value for observation i, is defined as
zi 
Note:
xi  x
s .
z i is the number of standard deviation xi from the mean x .
Example (continue):
Factory 1:
xi
10.1
9.9
10.1
9.9
9.9
21
10.1
9.9
10.1
9.9
10.1
zi
0.948 -0.948 0.948 -0.948 -0.948 0.948 -0.948 0.948 -0.948 0.948
Factory 2:
xi
zi
16
5
7
14
6
15
3
13
9
12
1.305 -1.088 -0.652 0.870 -0.870 1.088 -1.523 0.652 -0.217 0.435
There are two results related to the location of the data. The first result is Chebyshev’s
theorem.
Chebyshev’s Theorem:
For any population, within k standard deviation of mean, there are at least
(1 
1
)  100%
k2
of the data, where k is any value greater than 1.
Based on Chebyshev’s theorem, for any data set, it could be roughly estimated that at
least (1 
1
)  100% of data within k sample standard deviation of mean.
k2
Example (continue):
As k=2, based on Chebyshev’s theorem, at least
(1 
1
)  100%  75%
22
of the data are estimated within 2 standard deviations of mean. For the data from
factory 1 and factory 2, all the data are within 2 sample deviations of mean, i.e., all
the data have z-score with absolute values smaller than 2.
The second result is based on the empirical rule. The rule is especially applicable as
the data have a bell-shaped distribution. The empirical rule is

Approximately 68% of the data will be within one standard deviation of the
mean (  1  z i  1 ).

Approximately 95% of the data will be within one standard deviation of the
22
mean (  2  z i  2 ).

Almost all of the data will be within one standard deviation of the mean
(  3  z i  3 ).
Example (continue):
For data from factory 1, all the data are within one standard deviation of the mean
while 60% of the data are within one standard deviation of the mean for the data from
the factory2. The result based on the empirical rule is not applicable to the two data
set since the two data sets are not bell-shaped. However, for the following data,
2.11
-0.83
-1.43
1.35
-0.42
-0.69
-0.65
-0.29
-0.54
1.92
0.53
-0.27
1.7
0.88
1.25
0.32
-2.18
0.68
0.85
0.34
0
1
2
3
4
The histogram of the above data given below indicates the data is roughly
bell-shaped.
-2
-1
0
1
2
rn1
Approximately 65% of the data are within one standard deviation of the mean, which
is similar to the result based on the empirical rule (68%).
Detecting Outliers:
To identify the outliers, we can use either the box-plot or the z-score. The outliers
identified by the box-plot are those data outside the upper limit or lower limit while
the outliers identified by z-score are those with z-score smaller than –3 or greater
than 3.
23
Note: the outliers identified by box-plot might be different from
those identified by using z-score .
Online Exercise:
Exercise 3.4.1
3.5 The Weighted Mean and Grouped Data:
Weighted Mean:
n
xw 
w x
i 1
n
i
i
w
i 1
.
i
Note: when data values vary in importance, the analyst must choose the
weight that best reflects the importance of each data value in the
determination of the mean.
Example 1:
The following are 5 purchases of a raw material over the past 3 months.
Purchase
Cost per Pound ($)
Number of Pounds
1
3.00
1200
2
3
4
5
3.40
2.80
2.90
3.25
500
2750
1000
800
Find the mean cost per pound.
[solutions:]
24
w1  1200, w2  500, w3  2750, w4  1000, w5  800.
and
x1  3.00, x2  3.40, x3  2.80, x4  2.90, x5  3.25.
Then,
5
xw 
w x
i 1
5
i
i
w
i
i 1
1200  3.00  500  3.40  2750  2.80  1000  2.90  800  3.25
1200  500  2750  1000  800
 2.96

Population Mean for Grouped Data:
m
g 
F M
k 1
m
k
m
k
 Fk

F M
k 1
k
k
N
k 1
where
Mk :
Fk :
the midpoint for class k,
the frequency for class k in the population,
m
N   Fk :
k 1
the population size.
Sample Mean for Grouped Data:
25
,
m
xg 
f
k 1
m
m
k

k 1
Mk

f
k 1
k
Mk
n
fk
,
where
fk :
the frequency for class k in the sample,
m
n   fk :
k 1
the sample size.
Population Variance for Grouped Data:
 Fk M k   g 
m
 g2 
2
k 1
N
Sample Variance for Grouped Data:
 f M
m
s g2 
k 1
k
 xg 
m
2
k
n 1

f
k 1
k
M k2  nxg2
n 1
Example 2:
The following are the frequency distribution of the time in days required to complete
year-end audits:
Audit Time (days)
Frequency
10-14
15-19
20-24
25-29
30-34
4
8
5
2
1
26
What is the mean and the variance of the audit time?
[solutions:]
f1  4, f 2  8, f3  5, f 4  2, f 5  1.
n  f1  f 2  f 3  f 4  f 5  4  8  5  2  1  20
and
M1  12, M 2  17, M 3  22, M 4  27, M 5  32.
Thus,
5
xg 
fM
i
i 1
i

5
f
i 1
4  12  8  17   5  22  2  27   1 32
 19
4  8  5  2 1
i
and
 f M
5
s g2 
i 1
i
 xg 
2
i
n 1
2
2
2
2
2
4  12  19  8  17  19  5  22  19  2  27  19  1 32  19

20  1
 30
Online Exercise:
Exercise 3.5.1
Chapter 4 Association Between Two Variables
In this chapter, we introduce several methods to measure the association.
They are:
 Crosstabulations and scatter diagrams
27
 Numerical measures of association
4.1 Crosstabulations and Scatter Diagrams:
The crosstabulation (table) and the scatter diagram (graph) can help us understand the
relationship between two variables.
1. Crosstabulations
Example:
Objective: explore the association of the quality and the price for the
restaurants in the Los Angeles area.
The following table is the crosstabulation of the quality rating (good, very good
and excellent) and the mean price ($10-19, $20-29, $30-39, and $40-49) data
collected for a sample 300 restaurants located in the Los Angeles area.
Meal Price
Quality
Rating
$10-19
$20-29
$30-39
$40-49
Total
Good
42
40
2
0
84
Very Good
34
64
46
6
150
Excellent
2
14
28
22
66
Total
78
118
76
28
300
The above crosstabulation provides insight abut the relationship between the variables,
quality rating and mean price. It seems higher meal prices appear to be associated
with the higher quality restaurants and the lower meal prices appear to be associated
with the lower quality restaurants. For example, for the most expensive restaurants
($40-49), none of these restaurants is rated the lowest quality but most of them are
rated highest quality. On the other hand, for the least expensive restaurants ($10-19),
only 2 of these restaurants are rated the highest quality ( 2
28
78
 2.56% ) but over half
of them are rated lowest quality.
2. Scatter Diagram
Suppose we have the following scatter diagrams for the weights and heights of the
students:
165
height
height
170
150
155
169
155
160
160
165
height
170
170
171
175
175
180
180
Sc atter Di agram of W ei ght v.s . Hei ght
50
55
60
65
70
75
80
50
w e ig h t
55
60
65
70
75
80
50
55
w e ig h t
60
65
70
75
80
w e ig h t
The left scatter diagram indicates the positive relationship between weight and
height while the right scatter diagram implies the negative relationship between
the two variables. The middle scatter diagram shows that there is no apparent
relationship between the weight and height.
Online Exercise:
Exercise 4.1.1
4.2 Numerical Measures of Association:
There are several numerical measures of association. We first introduce the covariance
of two variables.
(I)
Covariance:
Suppose we have two populations,
29
population 1:
y1 , y 2 ,, y N
and population 2:
w1 , w2 ,, wN .
Also, let
sample 1:
x1 , x2 ,, xn
z1 , z 2 ,, z n
and sample 2:
are drawn from population 1 and population 2, respectively.
Let u y and u w be the population means of populations 1 and 2, respectively.
Let
n
x
x
i 1
n
i
z
and
n
z
i 1
i
n
be the sample means of samples 1 and 2, respectively.
Then, the population covariance is
N
 yw 
(y
i 1
 y )( wi  w )
i
,
N
while the sample covariance
n
sxz 
n
 ( x  x )( z  z )  x z  nxz
i 1
i
i

n 1
i 1
i i
n 1
.
Intuitively, s xz would be very large (positive) as the observations in two
population are larger or smaller than the sample means simultaneously. That is, the
observations are positively correlated. On the other hand, s xz would be very
small (negative) as the observations in one population are larger than the sample
mean while the ones in the other population are smaller than the sample mean.
Therefore, the observations are negative correlated. Finally, s xz would be close
to 0 as the observations in one population being larger than the sample mean while
the ones in the other population are sometimes larger but sometimes smaller than the
30
sample mean, i.e., the observations in the two populations are not correlated.
Example: .
Let xi be the total money spent on advertisement for some product and z i be the
sales volume (1 unit  1000 packs).
xi
2
5
1
3
4
1
5
3
4
2
zi
50
57
41
54
54
38
63
48
59
46
( xi  x )( z i  z )
1
12
20
0
3
26
24
0
8
5
10

Note:
s xz 
 (x
i 1
i
 x )( z i  z )
10  1

99
 11 .
10  1
s xz is not scale invariant. For example, in the above example, if
the sales volume is 1 unit  1 pack. Then, z i would be 5000, 5700,
4100, 5400, 5400, 3800, 6300, 4800, 5900, 4600. Thus, s xz will be 1100,
which 1000 times larger than the original one. It is not plausible since the
correlation between the total money on advertisement and the sales
volume would change as the measurement unit changes. The quantity
introduced next is scale-invariant and can be used to measure the
correlation of two populations.
(II)
Let
Correlation Coefficient:
 y : population standard deviation for
y1 , y 2 ,, y N
 w : population standard deviation for
w1 , w2 ,, wN
s x : sample standard deviation for
x1 , x2 ,, xn
s z : sample standard deviation for
z1 , z 2 ,, z n .
31
Then, the population correlation coefficient is
 yw
 y w
 yw 
,
while the sample correlation coefficient is
ryw 
Note:  yw  1
s xz
sx sz
.
rxz  1
and
Example (continue):
10
(x  x)
s x2 
i
i 1
10  1
10
2
 1.4907
s z2 
and
(z
i 1
i
 z )2
 7.9303
10  1
Then,
10
s
rxz  xz 
sx sz
 ( x  x )( z
i 1
10
i
 z)
 0.9 3
10
(x  x)  (z
2
i 1
Note:
i
i
i 1
i
 z)
2
.
rxz is scale-invariant. For example, even the sales volume is
measured in 1 pack per unit, the value of rxz is still the same, 0.93.
Example:
Let z i  2 xi , i  1,2,3,4,5 .
xi
1
2
3
4
5
zi
2
4
6
8
10
32
Then,
5
x  3, z  6, s x 
 (x
i 1
i
 x)
5
2

5 1
5
s xz 
 (x
i 1
i
5
, sz 
2
 x )( z i  z )
5 1
 (z
i 1
i
 z)2
5 1
 10 ,
 5 .
Thus,
rxz 
s xz

sx sz
5
5
10
2
1
.
Note: when there is a perfect positive linear relationship between
variable x and z, then rxz  1 . rxz  1 might indicate a positive linear
relationship.
Online Exercise:
Exercise 4.2.1
Chapter 5 Introduction to Probability
5.1. Experiments, Counting Rules, and Probabilities
Experiment: any process that generates well-defined outcomes.
Example:
Experiment
Outcomes
Toss a coin
Roll a dice
Head, Tail
1, 2, 3, 4, 5, 6
33
Play a football game
Win, Lose, Tie
Rain tomorrow
Rain, No rain
Sample Space: the set of all experimental outcomes, denoted by S
Example:
Experiment
Sample Space
Toss a coin
S={Head, Tail}
Roll a dice
Play a football game
Rain tomorrow
S={1, 2, 3, 4, 5, 6}
S={Win, Lose, Tie}
S={Rain, No rain}
Counting Rules: the rules for counting the number of the
experimental outcomes.
We have the following counting rules:
 Multiple Step Experiment:
 Permutations
 Combinations
1. Multiple Step Experiment:
Example:
Step 1
(throw dice)
1
2
3
4
5
6
Step 2
(throw coin)
T
H
T
H
T
H
T
H
T
H
T
H
Experimental Outcomes
(1,T),(1,H)
(2,T),(2,H)
(3,T),(3,H)
(4,T),(4,H)
(5,T),(5,H)
(6,T),(6,H)
34
 S  {(1, T ), (1, H ), (2, T ), (2, H ), (3, T ), (3, H ), (4, T ), (4, H ), (5, T ), (5, H ), (6, T ), (6, H )}
 The total number of experimental outcomes= 12  6  2
Counting rule for multiple step experiments:
If there are k-steps in an experiment which there are n1 possible
outcomes on the first step, n 2 possible outcomes on the second step,
and so on, then the total number of experimental outcomes is given
by n1  n2  nk .
2. Permutations:
n objects are to be selected from a set of N objects, where the order is
important.
Example:
Suppose we take 3 balls from 5 balls, 1, 2, 3, 4 and 5. Then,
1
○
2
○
3
○
 two permutations (different orders)
2
○
1
○
3
○
35
Example:
n=3
□
□
□
3
○
4
○
5
○
2
○
3
○
4
○
5
○
1
○
1
○
2
○
2
○
3
○
3
○
N•(N-1)•……•(N-n+1)
4
○
4
○
n
5
○
5
○
N=5
5•4•3
↓
1
○
2
○
3
○
4
○
5
4
1
○
2
○
3
○
3
Example:
n
□
1
○
1
○
2
○
2
○
□ …………… □
2
○
N
○
1
○
N
○
N
○
N-1
N • (N-1) • (N-2) •……•[N-(n-1)]=
N!
( N  n)!
n
Counting rule for permutation:
As n objects are taken from N objects, then the total number of
permutations is given by
36
PnN 
N!
 ( N  n  1)( N  n  2)  N
( N  n)!
N! 1 2  3N and 0! 1.
where
3. Combinations:
n objects are to be selected from a set of N objects,
order is not important.
where the
Example:
□
1
○
□
2
○
□
3
○
1
○
2
○
2
○
3
○
3
○
3
○
1
○
3
○
1
○
2
○
2
○
3
○
1
○
2
○
1
○
1 combination, but 6 permutations.
Example:
□
1
○
2
○
3
○
5
○
4
○
3
○
4
○
2
○
5
○
1
○
5 ○
1
○
1 ○
5
○
□
1
○
2
○
3
○
4
○
1
○
2
○
4
○
5
○
2
○
3
○
4
○
5
○
1
○
2
○
3
○
5
○
 P2  5 •4= 20
5
1
○
3
○
4
○
5
○
C
5 ○
2
○
2 ○
5
○
5
2
2 ○
1
○
1 ○
2
○
P
5
2
2!
 10
 10 combinations
20 permutations
37
Example:
□
□
1
○
5
○
4
○
3
○
2
○
1
○
2
○
4
○
3
○
3
○
4
○
2
○
5
○
5 ○
4 ○
3
○
5 ○
3 ○
4
○
4 ○
5 ○
3
○
4 ○
3 ○
5
○
3 ○
4 ○
5
○
3 ○
5 ○
4
○
P
5
3
C
5
○
4
○
2
○
1
○
•
3!=6
4
5
3
= 5•4•3=60
P
5
3
3!

P35
 10
6
5
○
2
○
1
○
5
○
4
○
3
○
2
○
1
○
5
□
3
○
2
○
1
○
3
○
4
○
5
○
•

3
1 combinations, total 10 combinations.
Example:
38
n
□
N
○
□ …………… □
2
○
1
○

P
N
○
1
○
C
n-1
1
○
N
○
N
n
N
n
permutations
P
N
n
n!
N-1
n
□
1
○
□ …………… □
2
○
1
○
 Pn 
n
n
○
1
○
n
○
n!
n!
  n!
(n  n)! 0!
 1 combination
n
○
n-1
n
•
n-1 •
•1
Counting rule for combination:
As n objects are taken from N objects, then the total number of
combinations is given by
39
N
PnN
N!
C    

 n  n!( N  n)! n!
N
n
Online Exercise:
Exercise 5.1.1
Exercise 5.1.2
5.2. Events and Their Probability
Modern probability theory: a probability value that expresses our degree of belief that
the experimental outcome will occur is specified.
Basic requirement for assigning probabilities:
1. Let ei denote the i’th experimental outcome and P(ei ) be its
probability

0  P(ei )  1 .
2. If there are n experimental outcomes, e1 , e2 ,, en ,

P(e1 )  P(e2 )    P(en )  1
Example:
Roll a fair dice. Let e i be the outcome the point is i. Then,

0  P(e1 )  P(e2 )  P(e3 )  P(e4 )  P(e5 )  P(e6 ) 
1
1
6
Event: an event is a collection (set) of sample points
(experimental outcomes).
40
Example:
E1  the event that the points are even.
E2  the event that the points are odd.

E1  {2,4,6} and E2  {1,3,5}
Probability of an event: the probability of any event
is equal to the sum of the sample points in the event.
Example:
P( E1 )  P({2,4,6})  P(e2 )  P(e4 )  P(e6 ) 
1 1 1 1
   .
6 6 6 2
Note: P ( S )  1
Online Exercise:
Exercise 5.2.1
5.3. Some Basic Relationships of Probability
Ac : the complement of A, the event containing all sample points that are not in A.
A B :
the union of A and B, the event containing all sample points belonging to A
or B or Both.
A B :
the intersection of A and B, the event containing all sample points
belonging to both A and B.
Example:
E1  {2,4,6} , E2  {1,3,5} and E3  {1,2,3}  po int s  3 . Then,


E1c  E2 .
E1  E3  even  po i n ts  or  po i n ts  3  or  b o t 
h {1,2,3,4,6}
41

E1  E3  even  po i n ts  and  po i n ts  3  {2}
Note: two events having no sample points in common is called
mutually exclusive events. That is, if A and B are mutually
exclusive events, then A  B    empty  event
Example:
E1  {2,4,6} , E2  {1,3,5}

E1 and E2 are mutually exclusive events.
Results:
c
1. P( A )  1  P( A)
2. If A and B are mutually exclusive events, then
P( A  B)  0 and P( A  B)  P( A)  P( B) .
3. (addition law) For any two events A and B,
P( A  B)  P( A)  P( B)  P( A  B)
[Intuition of addition law]:
A
B
Ⅰ
Ⅱ Ⅲ
42
  A  B, A         

P( A  B )  P(I)  P(II )  P( I I )I
 P(I)  P( II)  P(II)  P( III )  P(II)
 P(I  II)  P(II  III )  P(A  B)
 P( A)  P( B)  P( A  B)
Example:
1. P( E 2 )  P({1,3,5})  P({2,4,6}c )  P( E1c )  1  P( E1 )  1 
2. P( E1  E 2 )  0, P( E1  E 2 )  P( E1 )  P( E 2 ) 
1 1

2 2
1 1
 1
2 2
5
3. P( E1  E3 )  P({1,2,3,4,6})  . We can also use the addition law, then
6
P( E1  E3 )  P( E1 )  P( E3 )  P( E1  E3 )  P({2,4,6})  P({1,2,3})  P({2})

1 1 1 5
  
2 2 6 6
Online Exercise:
Exercise 5.3.1
Exercise 5.3.2
5.4. Conditional Probability
A|B: event A given the condition that event B has
occurred.
Example:
{2}| E1 : point 2 occurs given that the point is known to be even.
P( A | B) : the conditional probability of A given B (as the event B has
43
occurred, the chance of the event A then occurs!!)
Formula of the conditional probability:
P( A | B) 
P( A  B)
P( B)
and
P( B | A) 
P( A  B)
P( A) .
Example:
1
P( E1  {2}) P({2}) 6 1
P({2} | E1 ) 

 
P( E1 )
P( E1 ) 1 3
2
Note: P( A | B)  P( A | B)  1
c
Note: P( A  B)  P( B) P( A | B)  P( A) P( B | A)
Independent Events:
P( A | B)  P( A)
 A and B are independent events.
or
P( B | A)  P( B) .
Dependent Events:
P( A | B)  P( A)
or

44
A and B are dependent events.
P( B | A)  P( B) .
Intuitively, if events A and B are independent, then the chance of event A occurring is
the same no matter whether event B has occurred. That is, event A occurring is
“independent” of event B occurring. On the other hand, if events A and B are
dependent, then the chance of event A occurring given that event B has occurred will
be different from the one with event B not occurring.
Example:
A: the event of a police officer getting promotion.
M: the event of a police officer being man.
W: the event of a police officer being woman.

P( A)  0.27, P( A | M )  0.3, P( A | W )  0.15
The above result implies the chance of a promotion knowing the candidate being male
is twice higher than the one knowing the one being female. In addition, the chance of
a promotion knowing the candidate being female (0.15) is much lower than the
overall promotion rate (0.27). That is, the promotion event A is “dependent” on the
gender event M or W.

A promotion is related to the gender.
Note: P( A  B)  P( A) P( B) as events A and B are independent.
Online Exercise:
Exercise 5.4.1
Exercise 5.4.2
45
5.5. Bayes’ Theorem
Example 1:
B: test (positive)
A: no AIDS
Ac : AIDS
B c : test (negative)
From past experience and records, we know
P( A)  0.99, P( B | A)  0.03, P( B | Ac )  0.98.
That is, we know the probability of a patient having no AIDS, the conditional
probability of test positive given having no AIDS (wrong diagnosis), and the
conditional probability of test positive given having AIDS (correct diagnosis).
Our object is to find P ( A | B ) , i.e., we want to know the probability of a patient
having not AIDS even known that this patient is test positive.
Example 2:
A1 :
the finance of the company being good.
A2 :
the finance of the company being O.K.
A3 :
the finance of the company being bad.
B1 :
good finance assessment for the company.
B2 :
O.K. finance assessment for the company.
B3 :
bad finance assessment for the company.
From the past records, we know
P( A1 )  0.5, P( A2 )  0.2, P( A3 )  0.3,
P( B1 | A1 )  0.9, P( B1 | A2 )  0.05, P( B1 | A3 )  0.05.
46
That is, we know the chances of the different finance situations of the company and
the conditional probabilities of the different assessments for the company given the
finance of the company known, for example, P( B1 | A1 )  0.9 indicates 90% chance
of good finance year of the company has been predicted correctly by the finance
assessment.
Our objective is to obtain the probability P( A1 | B1 ) , i.e., the conditional probability
that the finance of the company being good in the coming year given that good
finance assessment for the company in this year.
To find the required probability in the above two examples, the following Bayes’s
theorem can be used.
Bayes’s Theorem (two events):
P( A | B) 
P( A  B)
P( A) P( B | A)

P( B)
P( A) P( B | A)  P( A c ) P( B | A c )
[Derivation of Bayes’s theorem (two events)]:
A
Ac
B
B∩A B∩Ac
We want to know P( A | B) 
P( B  A)
. Since
P( B)
P( B  A)  P( A) P( B | A) ,
and
47
P( B)  P( B  A)  P( B  Ac )  P( A) P( B | A)  P( Ac ) P( B | Ac ) ,
thus,
P( A | B) 
P( B  A)
P( B  A)
P( A) P( B | A)


P( B)
P( B  A)  P( B  Ac ) P( A) P( B | A)  P( Ac ) P( B | Ac )
Example 1:
P( Ac )  1  P( A)  1  0.99  0.01. Then, by Bayes’s theorem,
P( A) P( B | A)
0.99 * 0.03
P( A | B) 

 0.7519
c
c
P( A) P( B | A)  P( A ) P( B | A ) 0.99 * 0.03  0.01 * 0.98
 A patient with test positive still has high probability (0.7519) of no AIDS.
Bayes’s Theorem (general):
Let A1 , A2 ,, An be mutually exclusive events and
A1  A2   An  S ,
then
P( Ai | B) 
P( Ai  B)
P( B)
.............. 
P( Ai ) P( B | Ai )
,
P( A1 ) P( B | A1 )  P( A2 ) P( B | A2 )    P( An ) P( B | An )
i  1,2,  , n .
[Derivation of Bayes’s theorem (general)]:
48
A1
A2
B∩A1
B∩A2
B∩A1
B∩A2
B∩A1
B∩A2
………………..
B
……………
An
B∩An
B∩An
B∩An
Since
P( B  Ai )  P( Ai ) P( B | Ai ) ,
and
P( B)  P( B  A1 )  P( B  A2 )   P( B  An )
.......  P( A1 ) P( B | A1 )  P( A2 ) P( B | A2 )    P( An ) P( B | An )
,
thus,
P( Ai | B) 
P( B  Ai )
P( Ai ) P( B | Ai )

P( B)
P( A1 ) P( B | A1 )  P( A2 ) P( B | A2 )    P( An ) P( B | An )
Example 2:
P( A1 | B1 ) 
P( A1 ) P( B1 | A1 )
P( A1 ) P( B1 | A1 )  P( A2 ) P( B1 | A2 )  P( A3 ) P( B1 | A3 )
................ 
0.5 * 0.9
 0.95
0.5 * 0.9  0.2 * 0.05  0.3 * 0.05
 A company with good finance assessment has very high probability (0.95) of
good finance situation in the coming year.
Online Exercise:
Exercise 5.5.1
49
Chapter 6 Probability Distribution
6.1. Random Variable
Example:
Suppose we gamble in a casino and the possible result is as follows.
Outcome
Token (X)
Win
3
Lose
-4
Tie
Money (Y)
30
-40
0
0
In this example, the sample space is S  {Win , Lose, Tie} , containing 3 outcomes. X is
the quantity representing the token obtained or lose under different result while Y is
the one representing the money obtained or lost.
In the above example, X and Y can provide a numerical summary corresponding to
the experimental outcome. A formal definition for these numerical quantities is in the
following.
Definition (random variable): A random variable is a numerical
description of the outcome of an experiment.
50
Example:
In the previous example,
X: the random variable representing the token obtained or lose corresponding to
different outcomes.
Y: the random variable representing the money obtained or lose corresponding to
different outcomes.
X has 3 possible values corresponding to 3 outcomes

X ({Win})  3, X ({Lose})  4, X ({Tie})  0
Y has 3 possible values corresponding to 3 outcomes

Y ({Win})  30, Y ({Lose})  40, Y ({Tie})  0 .
Note that
Y  10 X
,
since
Y {{Win})  30  10 X ({Win}), Y ({Lose})  40  10 X ({Lose}),
Y ({Tie})  0  10 X ({Tie})
That is, Y is 10 times of X under all possible experimental outcomes.
There are two types of random variables. They are:
Discrete random variable: a quantity assumes either a finite
number of values or an infinite sequence of values, such as 0, 1, 2,

Continuous random variable: a quantity assumes any numerical
value in an interval or collection of intervals, such as time, weight,
distance, and temperature.
51
Example:
Let the sample space
S  {z | z is the delay time for a flight, 0  z  1} .
Let Z be the random variable representing the delay flight time, defined as
Z ({z  t})  Z ({the flight tim e is t})  t ,0  t  1.
For example, Z  0.5 corresponds to the outcome that the flight time is 0.5 hour (30
minutes) late.
Online Exercise:
Exercise 6.1.1
6.2. Probability Distribution
Definition (probability distribution): a function describes how
probabilities are distributed over the values of the random
variable.
(I): Discrete Random Variable:
Example:
Suppose the probability for the outcomes in the gamble example is
1
 P( X  3)  P(Y  30)
6
2
P({Lose})   P( X  4)  P(Y  40)
3
1
P({Tie})   P( X  0)  P(Y  0)
6
P({Win}) 
Let f x (x) be some function corresponding to the probability of the gambling
outcomes for random variable X, defined as
52
f x (3)  P ( X  3) 
1
6
f x ( 4)  P ( X  4) 
f x (0)  P ( X  0) 
2
3
1
6
f x (x) is referred as the probability distribution of random variable X.
Similarly, the probability distribution f y (x) of random variable Y is
f y (30)  P (Y  30) 
1
6
f y ( 40)  P (Y  40) 
f y (0)  P (Y  0) 
2
3
1
6
◆
Required conditions for a discrete probability distribution:
Let a1 , a2 ,, an , be all the possible values of the discrete random
variable X. Then, the required conditions for f x (x) to be the discrete
probability distribution for X are
(a)
f x (ai )  0, for every i.
(b)
 f (a )  f (a )  f (a )    f (a )    1
i
1
2
n
i
Example:
In the gambling example, f x (x) is a discrete probability distribution for the random
variable X since
(a)
f x (3)  0, f x (4)  0, and f x (0)  0 .
(b)
f x (3)  f x (4)  f x (0)  1 .
53
Similarly, f y (x) is also a discrete probability distribution for the random variable Y.
Note: the discrete probability distribution describes the probability
of a discrete random variable at different values.
(II): Continuous Random Variable:
For a continuous random variable, it is impossible to assign a probability to every
numerical value since there are uncountable number of values in an interval. Instead,
the probability can be assigned to a small interval. The probability density function
can describe how the probability distributes in the small interval.
Example:
In the delay flight time example, suppose the probability of being late within 0.5
hours is two times of the one of being late more than 0.5 hour, i.e.,
P(0  Z  0.5) 
2
1
and P(0.5  Z  1)  .
3
3
Then, the probability density function f1 ( x) for the random variable Z is
0.0
0.5
f1(x)
1.0
1.5
4
2
f1 ( x)  , 0  x  0.5; f1 ( x)  , 0.5  x  1.
3
3
0.0
0.2
0.4
0.6
x
54
0.8
1.0
The area corresponding to the interval is the probability of the random variable Z
taking values in this interval. For example, the probability of the flight time being late
within 0.5 hour (the random variable Z taking value in the interval [0,0.5]). is
0.5
P(The flight tim e being late within 0.5 hour)  P(0  Z  0.5) 
4
2
 f ( x)dx  3 * 0.5  3 .
1
0
Similarly, the probability of the flight time being late more than 0.5 hour (the random
variable Z taking value in the interval (0.5,1]). is
1
P(The flight tim e being late more than 0.5 hour)  P(0.5  Z  1) 
2
1
 f ( x)dx  3 * 0.5  3 .
1
0.5
On the other hand, If the probability of being late within 0.5 hours is the same as the
one of being late more than 0.5 hour, i.e.,
P(0  Z  0.5)  P(0.5  Z  1) 
1
,
2
then, the probability density function f 2 ( x) for the random variable Z is
f 2 ( x)  1, 0  x  1.
Note that the probability density function corresponds to the probability of the random
variable taking values in some interval. However, the probability density function
evaluated at some value, not like the probability distribution, can not be used to
describe the probability of the random variable Z taking this value.
Required conditions for a continuous probability density:
Let the continuous random variable Z taking values in [a,b]. Then,
the required conditions for f (x) to be the continuous probability
distribution for Z are
(a)
f ( x)  0, a  x  b.
b
(b)
 f ( x)dx  1
a
55
d
Note: P(c  Z  d ) 
 f ( x)dx, a  c  d  b . That
is, the
c
area under the graph of f (x) corresponding to a given interval is the
probability of the random variable Z taking value in this interval.
Example:
In the flight time example, f1 ( x) is a discrete probability distribution for the random
variable Z since
(a)
f1 ( x)  0, 0  x  1 .
0.5
(b)
1
4
2
dx

0 3
0.5 3 dx  1 .
Similarly, f 2 ( x) is also a discrete probability distribution for the random variable Z.
Online Exercise:
Exercise 6.2.1
6.3. Expected Value and Variance:
(I): Discrete Random Variable:
(a) Expected Value:
Example:
X: the random variable representing the point of throwing a fair dice. Then,
1
P( X  i)  f x (i )  , i  1, 2, 3, 4, 5, 6.
6
56
Intuitively, the average point of throwing a fair dice is
1 2  3  4  5  6
 3.5 .
6
The expected value of the random variable X is just the average,
6
1
1
1
1
1
1
E ( X )   if x (i )  1   2   3   4   5   6   3.5  average point .
6
6
6
6
6
6
i 1
Formula for the expected value of a discrete random
variable:
Let a1 , a2 ,, an , be all the possible values of the discrete random
f x (x)
variable X and
is the probability distribution. Then, the
expected value of the discrete random variable X is
E ( X )     if x (i)  a1 f x (a1 )  a2 f x (a2 )    an f x (an )  
i
Example:
In the gambling example, the expected value of the random variable X is
1
2
1  13
E ( X )  3  f x (3)  (4)  f x (4)  0  f x (0)  3   (4)   0  
.
6
3
6
6
Therefore, on the average, the gambler will lose
 13
for every bet.
6
Similarly, the expected value of the random variable Y is
1
2
1  130
E (Y )  30  f y (30)  (40)  f y (40)  0  f y (0)  30   (40)   0  
.
6
3
6
6
(b) Variance:
Example:
Suppose we want to measure the variation of the random variable X in the dice
example. Then, the square distance between the values of X and its mean E(X)=3.5
57
can be used, i.e., (1  3.5) 2 , (2  3.5) 2 , (3  3.5) 2 , (4  3.5) 2 , (5  3.5) 2 , (6 - 3.5) 2 can
be used. The average square distance is
(1  3.5) 2  (2  3.5) 2  (3  3.5) 2  (4  3.5) 2  (5  3.5) 2  (6  3.5) 2 8.75

.
6
3
Intuitively, large average square distance implies the values of X scatter widely.
The variance of the random variable X is just the average square distance (the
expected value of the square distance). The variance for the dice example is
6
Var ( X )  EX  E ( X )  E ( X  3.5) 2   (i  3.5) 2 f (i )
2
i 1
1
1
1
1
1
1
 (1  3.5) 2   (2  3.5) 2   (3  3.5) 2   (4  3.5) 2   (5  3.5)   (6  3.5) 2 
6
6
6
6
6
6
8.75

 the average square distance
3
Formula for the variance of a discrete random variable:
Let a1 , a2 ,, an , be all the possible values of the discrete random
variable X and
f x (x)
is the probability distribution. Let   E (X )
be the expected value of X. Then, the variance of the discrete random
variable X is
Var ( X )   2  EX  E ( X )   (ai   ) 2 f x (ai )
2
i
 (a1   ) 2 f x (a1 )  (a2   ) 2 f x (a2 )    (an   ) 2 f x (an )  
Example:
In the gambling example, the variance of the random variable X is
2
2
2
   13 

   13 
  13 
Var ( X )  3  
  f x (3)   4  
  f x (4)  0  
  f x (0)
 6 
  6 

  6 
 31  1   11  2  13  1
   
       7.472
6 6  6  3 6 6
2
2
2
Similarly, the variance of the random variable Y is
58
.
2
2
2


   130 
  130 
  130 
Var (Y )  30  
  f y (30)   40  
  f y (40)  0  
  f y (0)
 6 
 6 


  6 
 310  1   110  2  130  1

  
  
   747.2
 6  6  6  3  6  6
2
2
2
(II): Continuous Random Variable:
(a) Expected Value:
Example:
Z: the random variable representing the delay flight time taking values in [0,1].
1
P(0  Z  0.5)  P(0.5  Z  1)  .
2
Then, the probability density function for Z is
f 2 ( x)  1, 0  x  1.
Intuitively, since there is equal chance for any delay time in [0,1], 0.5 hour seems to
be a sensible estimate of the average delay time.
The expected value of the random variable Z is just the average delay time.
1
1
x2
E (Z )   xf2 ( x)dx   xdx 
2
0
0
1
0
 0.5  average delay time .
Formula for the expected value of a continuous random
variable:
Let the continuous random variable X taking values in [a,b] and
f (x)
is the probability density function. Then, the expected value of
59
the continuous random variable X is
b
E ( X )     xf ( x)dx .
a
Example:
In the flight time example, suppose the probability density function for Z is
4
2
f1 ( x)  , 0  x  0.5; f1 ( x)  , 0.5  x  1.
3
3
Then, the expected value of the random variable Z is
1
E ( Z )   xf1 ( x)dx 
0
0.5
1
4
2
x2 4
x

dx

x

dx


0 3 0.5 3
2 3
0.5
0

x2 2

2 3
 0.5 2 4 0 2 4   12 2 0.5 2 2  5
 
 
     
 
2 3  12
 2 3 2 3  2 3
Therefore, on the average, the flight time is
1
0.5
.
5
hour.
12
(b) Variance:
Example:
Suppose we want to measure the variation of the random variable Z in the flight time
example. Suppose f 2 ( x) is the probability density function for Z. Then, the square
distance between the values of Z and its mean E ( Z ) 
1
can be used, i.e.,
2
2
1

 x   , 0  x  1 can be used. The average square distance is
2

1
 x3 x2 x  1
 1
 1
E Z      x   f 2 ( x)dx      0
 2 0  2
 3 2 4
2
2
1 1 1 0 0 0 1
         .
 3 2 4   3 2 4  12
The variance of the random variable Z is just the average square distance (the
expected value of the square distance). The variance for the flight time example is
60
2
1
1

2
Var ( Z )  E Z  E ( Z )  E  Z   
 the average square distance .
2
12

Formula for the variance of a continuous random variable:
Let the continuous random variable X taking values in [a,b] and
f (x)
is the probability distribution. Let   E (X ) be the expected
value of X. Then, the variance of the continuous random variable X is
b
Var ( X )    EX  E ( X )   ( x  u) 2 f ( x)dx
2
2
a
Example:
In the flight time example, suppose f1 ( x) is the probability density function for Z.
Then, the variance of the random variable Z is
2
2
2
5
5 4
5 2



2
Var ( Z )  E Z  E ( Z )    x   f1 ( x)dx    x    dx    x    dx
12 
12  3
12  3
0
0
0.5
1
 x 3 5 x 2 25 x  4

  

12 144  3
 3
0 .5
0
0.5
 x 3 5 x 2 25 x  2

  

12 144  3
 3
1
0.5
1

11
144
Online Exercise:
Exercise 6.3.1
Exercise 6.3.2
Chapter 7 Discrete Probability Distribution
7.1. The Binomial Probability Distribution
61
Example:
X2 :
representing the number of heads as flipping a fair coin twice.
H : head
X2  0
X2 1
X2  2
□
□
T
T
□
□
H
T
T
H
□
□
H
H
T : tail .
 P( X 2  0) 
1 1  2 1 1
   
2 2  0  2 2
(1 combination)
1 1  2 1 1
(2 combinations)
 P( X 2  1)  2      
2 2  1 2 2
 P( X 2  2) 
1 1  2 1 1
   
2 2  2  2 2
(1 combination)
 2  1 
 P( X 2  i)  f 2 (i)    
 i  2 
 (number of combinatio ns)  ( the probabilit y of every combinatio n)
2
,
i  0, 1, 2.
X3 :
representing the number of heads as flipping a fair coin 3 times.
□
X3  0
T
□
□
T
1 1 1  3 1
T  P( X 3  0)       
2 2 2  0  2 
62
3
(1 combination)
X3 1
X3  2
X3  3
□
□
□
H
T
T
T
H
T  P( X 3  1)  3        (3 combinations)
2 2 2  1  2 
T
T
H
□
□
□
H
H
T
H
T
H  P( X 3  2)  3        (3 combinations)
2 2 2  2  2 
T
H
H
□
□
□
H
1 1 1  3 1
H  P( X 3  3)       
2 2 2  3 2 
H
1 1 1
1 1 1
 3 1
3
 3 1
3
3
(1 combination)
 3 1 
 P( X 3  i)  f 3 (i)    
 i  2 
 (number of combinatio ns)  ( the probabilit y of every combinatio n)
3
,
i  0, 1, 2, 3.
Xn :
representing the number of heads as flipping a fair coin n times.
Then,
n
□ □ …………… □
 n 1
1
T T…………….T  P( X n  0)       
 2
 0  2 
n
Xn  0
(1 combination)
63
n
n
□ □ …………… □
H T……..……..T
 n 1
1
T  P( X n  1)  n       
 2
 1  2 
n
n
Xn 1
T H………




n
(n combinations)
T T …………..H


 n  1 
P( X n  i)  f n (i)    
 i  2 
 (number of combinatio ns)  (the probabilit y of every combinatio n)
n
Note: the number of combinations is equivalent to the number of ways as
drawing i balls (heads) from n balls (n flips).
Example:
Z3 :
representing the number of successes over 3 trials.
S : Success
Suppose the probability of the success is
F : Failure
1
2
while the probability of failure is
.
3
3
64
Then,
□
□
□
F
 3 1
1
2
2
F  P(Z 3  0)           
 3   3   0  3   3 
0
Z3  0
F
3
0
3
(1 combination)
□
□
□
S
F
F
F
S
 3 1 2
1 2
F  P(Z 3  1)  3         
3  3
 1  3  3 
2
Z3  1
2
(3 combinations)
F
F
S
□
□
□
S
S
F
F
1
2  3 1
2
S  P(Z 3  2)  3          
 3  3  2  3   3 
2
Z3  2
S
2
(3 combinations)
F
S
S
□
□
□
S
 3 1
1
2
2
S  P(Z 3  3)           
 3  3 
 3 3   3 
3
Z3  3
S
0
3
0
(1 combination)
3i
 3 1   2 
 P( Z 3  i)  f 3 (i)      
 i  3   3 
 (number of combinatio ns)  ( the probabilit y of every combinatio n)
i
,
i  0, 1, 2, 3.
65
Zn :
representing the number of successes over n trials.
Then,
n
□ □ …………… □
 n 1
2
F F…………….F  P(Z n  0)       
 3
 0  3 
n
Zn  0
0
 2
 
 3
n
(1 combination)
n
□ □ ……… □
S F…….. .F
Zn  1
n
F S……… F




 1  2 
 P( Z n  1)  n    
 3  3 
n 1
 n  1  2 
    
 1  3  3 
n 1
(n combinations)
F F … .S


n i
 n  1   2 
P( Z n  i)  f n (i)      
 i  3   3 
 (number of combinatio ns)  (the probabilit y of every combinatio n)
i
From the above example, we readily describe the binomial experiment.
66
Properties of Binomial Experiment
 X: representing the number of successes over n independent
identical trials.
 The probability of a success in a trial is p while the probability of
a failure is (1-p).
Binomail Probability Distribution:
Let X be the random variable representing the number of successes
of a Binomial experiment. Then, the probability distribution function
for X is
 n
n! i
n i
n i
P( X  i)  f x (i)    p i 1  p  
p 1  p  , i  0,1, 2,, n .
i!n  i !
i
Properties of Binomial Probability Distribution:
A random variable X has the binomial probability distribution f (x)
with parameter p , then
E ( X )  np
and
Var ( X )  np(1  p) .
[Derivation:]
67
n
 n i
n!
n i
n i
E ( X )   i  f (i )   i    p 1  p    i 
p i 1  p 
i!n  i !
i 0
i 0
i 0
i
n
n
n!
n!
n i
n i
i
 i 
p 1  p   
p i 1  p 
i!n  i !
i 1
i 1 i  1! n  i !
n
n
n  1!
( n 1) ( i 1)
 p i 1 1  p 
i  1!n  1  i  1!
i 1
n 1
n  1! p j 1  p n1 j ( j  i  1)
 np 
j 0 j!n  1  j !
n  1! p j 1  p n1 j is the probabilit y
 np (since
j!n  1  j !
n
  np  
distributi on of a binomial random variable over n  1
trials)
The derivation of
Var ( X )  np(1  p)
is left as exercise.
How to obtain the binomail probability distribution:
(a) Using table of Binomail distribution.
(b) Using computer
 by some software, for example, Excel or Minitab.
 by some computing resource in the internet, for example,
http://home.kimo.com.tw/g894730/stat/ca1/index.html
or http://140.128.104.155/wenwei/stat-life/stat/ca1/index.html
Online Exercise:
Exercise 7.1.1
Exercise 7.1.2
7.2. The Poisson Probability Distribution:
Properties of Poisson Experiment:
68
 X : representing the number of occurrences in a continuous
interval.
 : expected value of occurrences in this interval.
 The probability of an occurrence is the same for any two
intervals of equal length!! The expected value of occurrences in
an interval is proportional to the length of this interval.
 The occurrence or nonoccurrence in any interval is independent
of the occurrence or nonoccurrence in any other interval.
 The probability of two or more occurrences in a very small
interval is close to 0
Poisson Probability Distribution:
Let X be the random variable representing the number of
occurrences of a Poisson experiment in some interval. Then, the
probability distribution function for X is
e  i
P( X  i)  f x (i) 
, i  0,1, 2, ,
i!
where
e  2.71828
and
 is some parameter.
Properties of Poisson Probability Distribution:
A random variable X has the Poisson probability distribution f (x)
with parameter
 , then
E ( X )    the expected number of occurrence s
and
Var ( X )  
69
.
The derivations of the above properties are similar to the ones for the binomial
random variable and are left as exercises.
Example:
Suppose the average number of car accidents on the highway in one day is 4. What is
the probability of no car accident in one day? What is the probability of 1 car
accidence in two days?
[solution:]
It is sensible to use Poisson random variable representing the number of car accidents
on the high way. Let X representing the number of car accidents on the high way in
one day. Then,
e 4 4i
P( X  i)  f x (i) 
, i  0,1, 2, 
i!
and
E( X )  4 .
Then,
e 4 40
P( No car accident in one day)  P( X  0)  f x (0) 
 e 4  0.0183
0!
Since the average number of car accidents in one day is 4, thus the average number of
car accidents in two days should be 8. Let Y represent the number of car accidents in
two days. Then,
e 8 8i
P(Y  i)  f y (i) 
, i  0,1, 2,
i!
and
E (Y )  8 .
Then,
e 8 81
P(1 car accidents in two days)  P(Y  1)  f y (1) 
 8e 8  0.002
1!
70
Example:
Suppose the average number of calls by 104 in one minute is 2. What is the
probability of 10 calls in 5 minutes?
[solution]:
Since the average number of calls by 104 in one minute is 2, thus the average number
of calls in 5 minutes is 10. Let X represent the number of calls in 5 minutes. Then,
e 1010i
P( X  i)  f x (i) 
, i  0,1, 2,
i!
and
E ( X )  10 .
Then,
e 101010
P(10 calls in 5 minutes)  P( X  10)  f x (10) 
 0.1251 .
10!
How to obtain the Poisson probability distribution:
(c) Using table of Poisson distribution.
(d) Using computer
 by some software, for example, Excel or Minitab.
by some computing resource in the internet, for example,
http://home.kimo.com.tw/g894730/stat/ca1/index.html
or http://140.128.104.155/wenwei/stat-life/stat/ca1/index.html
Online Exercise:
Exercise 7.2.1
Exercise 7.2.2
71
7.3. The Hypergeometric Probability Distribution:
Example:
Suppose there are 50 officers, 10 female officers and 40 male officers. Suppose 20 of
them will be promoted. Let X represent the number of female promotions. Then,
10  40 
  
 0  20 
P( X  0) 
 50 
 
 20 
# of combinatio ns for 0 female

10  40 
  
 1  19 
P( X  1) 
 50 
 
 20 

promotion (# of combinatio ns for 20 male promotions )
(# of combinatio ns for 20 promotions )
# of combinatio ns for 1 female
promotion (# of combinatio ns for 19 male promotions )
(# of combinatio ns for 20 promotions )

10  40 
 

 i  20  i 
P( X  i) 
 50 
 
 20 

# of combinatio ns for i female
promotion (# of combinatio ns for 20-i male promotions )
(# of combinatio ns for 20 promotions )

10  40 
  
10  10 
P( X  10) 
 50 
 
 20 

# of combinatio ns for 10 female
promotion (# of combinatio ns for 10 male promotions )
(# of combinatio ns for 20 promotions )
Therefore, the probability distribution function for X is
72
10  40 
 

i
20

i
 

P( X  i ) 
, i  0,1,  ,10.
 50 
 
 20 
Hypergeometric Probability Distribution:
There are N elements in the population, r elements in group 1 and the
other N-r elements in group 2. Suppose we select n elements from the
two groups and the random variable X represent the number of
elements selected from group 1. Then, the probability distribution
function for X is
 r  N  r 
 

i ni 
P( X  i )  f x (i )   
, 0  i  r.
N
 
 
n
r
  is the number of combinations as selecting i elements
i
N  r
 is the number of combinations as
from group 1 while 
 ni 
N
selecting n-i elements from group2.   is the total number of
n
Note:
combinations as selecting n elements from the two groups while
 r  N  r 
 
 is the total number of combinations as selecting i and n-i
 i  n  i 
elements from groups 1 and 2, respectively.
How to obtain the hypergeometric probability distribution:
(e) Using table of Poisson distribution.
(f) Using computer
 by some software, for example, Excel or Minitab.
 by some computing resource in the internet, for example,
73
http://home.kimo.com.tw/g894730/stat/ca1/index.html
or http://140.128.104.155/wenwei/stat-life/stat/ca1/index.html
Online Exercise:
Exercise 7.3.1
Chapter 8 Continuous Probability Density
8.1. The Uniform Probability Density:
Example:
X: the random variable representing the flight time from Taipei to Kaohsiung.
Suppose the flight time can be any value in the interval from 30 to 50 minutes. That is,
30  X  50. .
Question: if the probability of a flight time within any time interval
is the same as the one within the other time interval with the same
length. Then, what density f (x) is sensible for describing the
probability?
Recall that the area under the graph of f (x) corresponding to any interval is the
probability of the random variable X taking values in this interval. Since the
probabilities of X taking values in any equal length interval are the same, then the the
areas under the graph of f (x) corresponding to any equal length interval are the
same. Thus, f (x) will take the same value over any equal length area. For example,
within one minute interval, then
31
32
50
30
31
49
P(30  X  31)   f ( x)dx  P(31  X  32)   f ( x)dx    P(49  X  50)   f ( x)dx
Therefore, we have
74
f ( x) 
1
, 30  x  50; f ( x)  0, otherwise.
20
Note: since we know
f ( x)  c  some constant , then by the property
that
50
50
30
30
 f ( x)dx   cdx  1  20c  1  c 
1
20 .
In the above example, the probability density has the same value in the interval the
random variable taking value. This probability density is referred as the uniform
probability density function.
Uniform Probability Density Function:
A random variable X taking values in [a,b] has the uniform
probability density function f (x) if
f ( x) 
1
, a  x  b; f ( x)  0, otherwise.
ba
f(x)
The graph of f (x) is
1/(b-a)
a
b
x
75
.
Properties of Uniform Probability Density Function:
A random variable X taking values in [a,b] has the uniform
probability density function f (x) , then
ba
E( X ) 
,
2
2

b  a
Var ( X ) 
12
[Derivation]:
1
1 x2 b
1  b2 a2 
  
E ( X )   xf ( x)dx   x 
dx 
 |a 
b

a
b

a
2
b

a
2 2
a
a
1 b  a b  a  b  a



ba
2
2
b
The derivation of
b
2

b  a
Var ( X ) 
12
is left as an exercise.
Example:
In the flight time example, b  50, a  30, then
50  30  33.33
50  30
E( X ) 
 40, Var ( X ) 
2
12
2
Online Exercise:
Exercise 8.1.1
Exercise 8.1.2
8.2. The Normal Probability Density
The normal probability density, also called the Gaussian density, might be the most
76
commonly used probability density function in statistics.
Normal Probability Density Function:
A random variable X taking values in [, ] has the normal
probability density function f (x) if
f ( x) 
1
e
2 
  x 2
2 2
, -  x   ,
where
  E ( X ),  2  Var( X ),   3.14159
f(x)
The graph of f (x) is
u
x
Properties of Normal Density Function:
77
(a)


  E ( X )   xf ( x)dx   x 


1
e
2 
  x   2
2 2
dx
and
 2  Var ( X ) 

2
 x   


1
2
f ( x)dx   x    
e
2



 x   2
2 2
dx
(b)
  the mean of the normal random variable X
 the median of the normal random variable X ( P( X   )  P( X  u ))
 the mode of the normal probabilit y density ( f (u )  f ( x), x   )
(c) X is a random variable with the normal density function. X is
denoted by
X ~ N ( , 2 )
f(x)
(d) The standard deviation determine the width of the curve. The
normal density with larger standard deviation would be more
dispersed than the one with smaller standard deviation. In the
following graph, two normal density functions have the same
means but different standard deviations, one is 1 (the solid line)
and the other is 2 (the dotted line):
u
x
78
(e) The normal density is symmetric with respect to mean. That is,
f (u  c)  f (u  c), where c is any number
(f) The probability of a normal random variable follows the
empirical rule introduce previously. That is,
P(u    X     )  0.6826  68%
P(   2  X    2 )  0.9544  95%
P(   3  X    3 )  0.9973  100%
i.e., the probability of X taking values within one standard
deviation is about 0.68, within two standard deviations about
0.95, and within three standard deviation about 1.
Standard Normal Probability Density Function:
A random variable Z, taking values in [, ] has the standard
normal probability density function f (x) if
f ( x) 
1
e
2
x2
2
, -  x  ,
where
  E (Z )  0,  2  Var(Z )  1.
Note: we denote Z as Z ~ N (0,1)
The probability of Z taking values in some interval can be found by the normal table.
The probability of Z taking values in [0,z], z  0, can be obtained by the normal
table. That is,
P(0  Z  z )  the area of the region between tw o vertical lines
z

0
1
2
e
 x2
2
dx
79
The graph is given below:
-z
0
z
Example:
P(0  Z  1.0)  0.3413
P(0  Z  1.03)  0.3485
P(1.0  Z  1.0)  P(1  Z  0)  P(0  Z  1)
 2 P(0  Z  1)  0.6826 (symmetry of Z )
P( Z  1.5)  1  P( Z  1.5)  1  P( Z  0)  P(0  Z  1.5)
1
 1  (  0.4332)  0.0668
2
P( Z  1.5)  P(1.5  Z  0)  P( Z  0)
1
(symmetry of Z )
2
 0.4332  0.5  0.9332
 P(0  Z  1.5) 
80
P(1  Z  1.5)  P(0  Z  1.5)  P(0  Z  1)
 0.4332  0.3413  0.0919
Example:
P( Z  x)  0.0099 . What is x?
[solutions:]
P(Z  x)  1  P(Z  x)  1  0.0099  0.9901
1
 x  0 (if x  0, then P( Z  x)  )
2
 P( Z  x)  P( Z  0)  P(0  Z  x) 
 P(0  Z  x)  0.9901 
1
 P(0  Z  x)  0.9901
2
1
 0.4901  x  2.33
2
Computing Probabilities for any Normal Random Variable:
Once the probability of the standard normal random variable can be
obtained, the probability of any normal random variable (not
standard) can be found via the following important rescaling:
X ~ N ( , 2 ) 
X 

 Z ~ N (0,1)
Example:
Let
X ~ N (1,4) . Please find P(1  X  3) .
[solutions:]
  1 and   2. Then,
81
0 X 1 2
P(1  X  3)  P(1  1  X  1  3  1)  P( 
 )  P(0  Z  1)  0.3413
2
2
2
Online Exercise:
Exercise 8.2.1
Exercise 8.2.2
8.3 The Exponential Density:
The exponential random variable can be used to describe the life time of a machine,
industrial product and Human being. Also, it can be used to describe the waiting time
of a customer for some service.
Exponetial Probability Density Function:
A random variable X taking values in [0,  ] has the exponential
probability density function f (x) if
f ( x) 
where
1

x
e  , 0 x  ,
  0. .
f(x)
The graph of f (x) is
0
x
82
Properties of Exponential Density Function:
Let X be the random variable with the exponential density function
f (x ) and the parameter  . Then
1.
P( X  x0 )  1  e
 x0

,
x0  0 .
for any
2.

E( X )   x 
0
x
1
 e  dx  

and

Var ( X )   x    
2
0
1

x
 e  dx  2 .
[derivation:]
x
x0
x0
x0

x
1
x
P( X  x0 )    e  dx   e  d   e  y dy

 0
0
0
 e
x0
y 
0
 e
 x0

 
  e  1 e
0
x
(y  )

 x0

The derivation of 2 is left as exercise.
Note: S ( x0 )  P( X  x0 )  1  P( X  x0 )  e
survival function.
83
 x0

is called the
Example:
Let X represent the life time of a washing machine. Suppose the average lifetime for
this type of washing machine is 15 years. What is the probability that this washing
machine can be used for less than 6 years? Also, what is the probability that this
washing machine can be used for more than 18 years?
[solution:]
X has the exponential density function with   15 (years). Then,
P( X  6)  1  e
6
15
 0.3297 and P( X  18)  e
18
15
 0.3012
Thus, for this washing machine, it is about 30% chance that it can be used for quite a
long time or a short time.
Relationship Between Poisson and Exponential Random
Variable:
Let Y be a Poisson random variable representing the number of
occurrences in an time interval of length t with the probability
distribution
where

e u  i
P(Y  i) 
,
i!
is the mean number of occurrences in this time interval.
Then, if X represent the time of one occurrence, X has the exponential
density function with mean E ( X ) 

1
 (t).
The intuition of the above result is as follows. Suppose the time interval is [0,1] (in
hour) and   4 . Then, on the average, there are 4 occurrences during 1 hour period.
Thus, the mean time for one occurrence is  
84
1


1
(hour). The number of
4
occurrences can be described by a Poisson random variable (discrete) with mean 4
while the time of one occurrence can be described by an exponential random variable
(continuous) with mean
1
.
4
Example:
Suppose the average number of car accidents on the highway in two days is 8. What is
the probability of no accident for more than 3 days?
[solutions:]
The average number of car accidents on the highway in one day is
mean time of one occurrence is
8
 4 . Thus, the
2
1
(day) .
4
Let Y be the Poisson random variable with mean 4 representing the number of car
accidents in one day while X be the exponential random variable with mean
1
(day)
4
representing the time of one accident occurrence. Thus,
P(No accident for more than 3 days)  P( the time of one occurrence larger tha n 3)
 P( X  3)  e
3
1
4
 e 12  0
Online Exercise:
Exercise 8.3.1
Appendix: Poisson and Normal Approximations
(I)
Poisson Approximation:
Example:
85
Let X be the binomial random variable over 250 trials with p  0.01 . Then, it might
not be easy to obtain
P( X  3) 
250!
0.013 0.99247
3!247!
directly. However, if we only want to obtain an approximation, Poisson
approximation is a good choice.
Poisson approximation:
Let X be a binomial random variable over n trials and let
p  0.05, n  20.
Let Y be a Poisson random variable with mean np.
Then, the probability of X taking value i can be approximated by the
probability of Y taking value i. That is,
i
 n i
e  np np 
n i
  p 1  p  
, i  1, 2, , n.
i!
i
Example:
In the above example, the Poisson random variable with mean np  250  0.01  2.5
can be used for approximation. Thus,
250!
e 2.5 2.5
3
247
0.01 0.99 
P( X  3) 
 0.2138
3!247!
3!
3
Note that the exact probability is
 250 
0.013 0.99247  0.2149 .
P( X  3)  
 3 
Therefore, the normal approximation is reasonably accurate.
(II)
Normal Approximation:
Normal approximation:
86
Let X be a binomial random variable over n trials and the probability
of success be p. Let Y be the normal random variable with mean np
and variance np(1-p). Then, the probability of X taking value i can be
1
 1
i

,
i

approximated by the probability of Y taking values in 
.
2 
 2
That is,
 n i
1
1
  p 1  p ni  P(i   Y  i  )
2
2
i
1
i
2
1
i
2

Note: the probability
P (i 
1
e
2 np 1  p 
1
1
Y i )
2
2
 x np 2
2 np (1 p )
dx .
can be obtained by
transforming Y to the standard normal random variable Z.
Example:
Let X be the binomial random variable over 100 trials and let the probability of a
success be 0.1. What is the probability of 12 successes by normal approximation?
[solutions:]
The normal random variable with mean np  100  0.1  10 and variance
np(1  p)  100  0.1  (1  0.1)  9 can be used for approximation. Thus,
1
1
11.5  10 Y  10 12.5  10
P( X  12)  P(12   Y  12  )  P(


)
2
2
3
3
3
 P(0.5  Z  0.83)  P(0  Z  0.83)  P(0  Z  0.5)
 0.2967 - 0.1915  0.1052
Note that the exact probability is
87
100 
0.112 0.988  0.0988 .
P( X  12)  
 12 
Therefore, the normal approximation is reasonably accurate.
Note:
Let X be a binomial random variable over n trials and the probability
of success be p and let Y be the normal random variable with mean
np and variance np(1-p). Then,
1
Y  np
P( X  i)  P(Y  i  )  P(

2
np(1  p)
1
1
 np
i   np
2
2
)  P( Z 
),
np(1  p)
np(1  p)
i
where Z is the standard normal random variable. Similarly,
1
P( X  k )  P(Y  k  )  P( Z 
2
1
 np
2
)
np(1  p)
k
Example:
In the previous example, what is the probability of at most 13 successes by normal
approximation?
[solution:]
1
13   10
2
P( X  13)  P( Z 
)  P( Z  1.17)  P( Z  0)  P(0  Z  1.17)
3
 0.5  0.3790  0.8790
Note that the exact probability is
100 
0.1i 0.9100i  0.8761 .
P( X  13)   
i 0  i 
13
Therefore, the normal approximation is reasonably accurate.
88
Review 1
Chapter 1:
1. Elements, Variable, and Observations:
Example:
Table 1.1 (p. 5) in the textbook!!
25 elements (25 companies): Advanced Comm. Systems, Ag-Chem
Equipment Co.,…,Webco Industries
Inc..
5 variables : Exchange, Ticker Symbol, Annual Sales, Share Price,
Price/Earnings Ratio.
25 observations: (OTC, ACSC, 75.10, 0.32, 39.10), (OTC, AGCH,
321.10, 0.48, 23.40),…, (AMEX, WEB, 153.50, 0.88,
7.50).
2. Type of Data: Qualitative Data and Quantitative Data
(a) Qualitative data may be nonnumeric or numeric.
(b) Quantitative data are always numeric.
(c) Arithmetic operations are only meaningful with quantitative data.
Example (continue):
Qualitative variables: Exchange, Ticker Symbol.
Quantitative variables: Annual Sales, Share Price, Price/Earnings Ratio.
Chapter 2: Figure 2.22, p. 66.
1. Summarizing qualitative data:
 Frequency distribution, relative frequency distribution, and
percent frequency distribution.
 Bar plot and Pie plot.
89
Example:
Below you are given the grades of 20 students.
D
C
E
B
B
B
A
D
B
C
B
E
C
B
C
B
B
D
B
C
Then,
Grades
Frequency
Relative
Frequency
Percent
Frequency
E
2
2/20=0.1
10
D
3
3/20=0.15
15
C
5
5/20=0.25
25
B
9
9/20=0.45
45
A
1
1/20=0.05
5
Total
20
1
100
2. Summarizing qualitative data:
 Frequency distribution, relative frequency distribution, percent
frequency distribution, cumulative frequency distribution,
cumulative relative frequency distribution, cumulative percent
frequency distribution
 Histogram, Ogive, and stem-and leaf display.
Example:
Suppose we have the following data:
30
79
59
65
40
64
52
53
57
39
61
47
50
60
48
50
58
67
Suppose the number of nonoverlapping classes is determined to be 5.
Please construct the frequency distribution table (including frequency,
percent frequency, cumulative frequency, and cumulative percent
frequency) for the data.
[solution:]
90
Approximat e class width 
79  30
 9.8
5

The class width is 10.
Thus,
Class
Frequency
2
3
7
5
1
30-39
40-49
50-59
60-69
70-79
Percent
Frequency
Cumulative
Frequency
(2/18)100=11
(3/18)100=17
(7/18)100=39
(5/18)100=28
(1/18)100=5
2
5
12
17
18
Cumulative
Percent
Frequency
11
28
67
95
100
Chapter 3: Key Formulas, pp. 128-129.
Example:
Suppose we have the following data:
Rent
420-439
440-459
460-479
480-499
500-519
Frequency
8
17
12
8
7
Rent
520-539
540-559
560-579
580-599
600-619
Frequency
4
2
4
2
6
What are the mean rent and the sample variance for the rent?
[solution:]
10
xg 
fM
i 1
i
70
i
, where f i is the frequency of class i M i is the midpoint of
class i and n is the sample size. Then,
Rent
420-439
440-459
fi
8
17
Mi
429.5
449.5
Rent
520-539
540-559
fi
4
2
Mi
529.5
549.5
91
460-479
12
469.5
560-579
4
569.5
480-499
8
489.5
580-599
2
589.5
500-519
7
509.5
600-619
6
609.5
Thus,
10
fM
i
i 1
i
 34525 and x g  34525  493.21 .
70
For the sample variance,
 f M
10
s g2 
i 1
i
 xg 
2
i
70  1

208234.287
 3017.89
69
Chapter 4:
 Tabular and Graphical Methods: Crosstabulation (qualitative
and quantitative data) and Scatter Diagram (only quantitative
data).
 Numerical Method: Covariance and Correlation Coefficient.
Chapter 5:
1. Multiple Step Experiments, Permutations, and Combinations:
Example:
How many committees consisting of 3 female and 5 male students can be
selected from a group of 5 female and 8 male students?
[solution:]
 5  8
5!
8!
     

 560
3
5
    3!2! 5!3!
92
2. Event, Addition Law, Mutually Exclusive Events and Independent
Event:
Example:
Assume you are taking two courses this semester (S and C). The
probability that you will pass course S is 0.835, the probability that you
will pass both courses is 0.276. The probability that you will pass at least
one of the courses is 0.981.
(a) What is the probability that you will pass course C?
(b) Is the passing of the two courses independent event?
(c) Are the events of passing the courses mutually exclusive? Explain.
[solution:]
(a)
Let A be the event of passing course S and B be the event of passing
course C. Thus,
P( A)  0.835, P( A  B)  0.276, P( A  B)  0.981 .
 P( Ac  B)  P( A  B)  P( A)  0.981  0.835  0.146
 P( B)  P( A  B)  P( Ac  B)  0.276  0.146  0.422
.
(b)
P( A | B) 
P( A  B) 0.276

 0.654  P( A)  0.835
P( B)
0.422
Thus, events A and B are not independent. That is, passing of two courses
are not independent events.
(c)
Since P( A  B)  0.276  0 , events A and B are not mutually exclusive.
93
Review 2
Chapter 4
Bayes’ Theorem:
Example:
In a random sample of Tung Hai University students 50% indicated they
are business majors, 40% engineering majors, and 10% other majors. Of
the business majors, 60% were female; whereas, 30% of engineering
majors were females. Finally, 20% of the other majors were female.
Given that a person is female, what is the probability that she is an
engineering major?
[solution:]
Let
A1: the students are engineering majors
A2: the students are business majors
 A1  A2  A3  
A3: the students are other majors.
B: the students are female.
Originally, we know
P( A1)  0.4, P( A2)  0.5, P( A3)  0.1, P( B | A1)  0.3, P( B | A2)  0.6, P( B | A3)  0.2 .
Then, by Bayes’ theorem,
P( A1) P( B | A1)
P( A1) P( B | A1)  P( A2) P( B | A2)  P( A3) P( B | A3)
0.4  0.3

 0.2727.
0.4  0.3  0.5  0.6  0.1  0.2
P( A1 | B) 
94
Chapter 5
1. Random Variables, Discrete Probability Function, Expected Value
and Variance
Example:
The probability distribution function for a discrete random variable X is
f ( x )  2k , x  1
3k , x  3
4k , x  5
0, otherwise
where k is some constant. Please find
(a) k. (b) P( X  2) (c) E ( X ) and Var( X )
[solution:]
(a)
 f ( x)  f (1)  f (3)  f (5)  2k  3k  4k  9k  1
x
 k 
(b)
1
.
9
P( X  2)  P( X  3 or X  5)  P( X  3)  P( X  5)
1 7
 f (3)  f (5)  3k  4k  7k  7  
9 9
(c)
u  E ( X )   xf ( x)  1 f (1)  3  f (3)  5  f (5)
x
2
3
4 31
 1  3   5  
9
9
9 9
95
.
and
Var ( X )   x  u  f ( x)
2
x
2
2
2
 31 
 31 
 31 
 1    f (1)   3    f (3)   5    f (5)
9
9
 9


2

 22 2 16 3 14 2 4 200

   
 
81
9 81 9
81 9
81
2. Continuous Probability Density Function, Expected Value and
Variance.
Example:
The probability density function for a continuous random variable X is
f ( x )  a  bx 2 , 0  x  1
0, otherwise.
where a, b are some constants. Please find
(a) a, b if E ( X ) 
3
(b) Var( X ) .
5
[solution:]
(a)
1
 f ( x)dx  1 
0
 a  bx dx  1  ax  b3 x
1
2
| 1
3 1
0
0
b
 a  1
3
and
1
1


a
b
a b 3
E ( X )   xf ( x)dx   x a  bx 2 dx  x 2  x 4 |10   
2
4
2 4 5
0
0
96
Solve for the two equations, we have
a
3
6
, b .
5
5
(b)
f ( x) 
3
6 2

x , 0  x 1
5
5
0, otherwise.
Thus,
 3
Var ( X )  EX  E ( X )  E ( X 2 )  E ( X )  E ( X 2 )   
 5
2
2
2
9
9
3 6 
  x f ( x)dx    x 2   x 2 dx 
25 0  5 5 
25
0
1
1
2
1
6
9 1 6 9
2
 x 3  x 5 |10     
5
25
25 5 25 25 25
Chapter 6
1. Uniform Probability Density Function, Normal Probability
Density Function, and Exponential Probability Density Function:
Example:
The number of customers arriving at Taiwan Bank is Poisson distributed
with a mean, 4 customers/per minute.
(a) Within 2 minutes, what is the probability that there are 3 customers?
(b) What is the probability density function for the time between the
arrival of the next customer?
[solution:]
(a)
Let
97
Y: the number of customers arriving within 2 minutes.
Then,
E (Y )  u  2  4  8 (customers /two minutes)
and
euu i e8 8i
P(Y  i) 

, i  0,1,2, .
i!
i!
Thus,
e 8 83
P(Y  3) 
 0.0286 .
3!
(b)
X: the time between the arrival of the next customer
The average time between arrival of the next customer is

t
2
1

 .
u
8
4
X has the exponential density function with mean
f ( x) 
1

e

x

x
1 1 / 4

e
 4e  4 x .
1/ 4
98
1
,
4
Review 2
Chapters 3, 4
Measures of Location, Dispersion, Exploratory Data Analysis,
Measure of Relative Location, Weighted and Grouped Mean and
Variance, Association between Two Variables
Example:
The flashlight batteries produced by one of the manufacturers are known
to have an average life of 60 hours with a standard deviation of 4 hours.
(a) At least what percentage of batteries will have a life of 54 to 66 hours?
(b) At least what percentage of the batteries will have a life of 52 to 68
hours?
(c) Determine an interval for the batteries’ lives that will be true for at
least 80% of the batteries.
[solution:]
Denote
x  60, s  4
(a)
[54,66]  60  6  x  1.5s
Thus, by Chebyshev’s theorem, within 1.5 standard deviation, there is at
least
1 

 100%  55.55%
1 
2 
1
.
5


of batteries.
(b)
[52,68]  60  8  x  2s
Thus, by Chebyshev’s theorem, within 1.5 standard deviation, there is at
least
99
1

1  2  100%  75%
 2 
of batteries.
(c)
1 
1

1  2   100%  80%  1  2  0.8  k  5
k
 k 
Thus, within
Therefore,
5 standard deviation, there is at least 80% of batteries.
x  5s  60  5  4  60  8.94  51.06,68.94 .
Chapter 5
Basic Relationships of Probability, Conditional Probability and
Bayes’ Theorem
Example:
The following are the data on the gender and marital status of 200
customers of a company.
Single
Married
Male
20
100
Female
30
50
(a) What is the probability of finding a single female customer?
(b) What is the probability of finding a married male customer?
(c) If a customer is female, what is the probability that she is single?
(d) What percentage of customers is male?
(e) If a customer is male, what is the probability that he is married?
(f) Are gender and martial status mutually exclusive? Explain.
(g) Is martial status independent of gender? Explain.
[solution:]
A1: the customers are single
A2: the customers are married
100
B1: the customers are male.
B2: the customers are female.
(a)
P A1  B2  
30
 0.15
200
P A2  B1  
100
 0.5
200
(b)
(c)
P A1 | B2  
P A1  B2 
PB2  .
Since
PB2   P A1  B2   P A2  B2  
30
50
80


200 200 200 ,
30
P A1  B2  200 30
P A1 | B2  


 0.375
.
80
PB2 
80
200
(d)
PB1   P A1  B1   P A2  B1  
20 100 120


 0.6
200 200 200
(e)
100
P A2  B1  200 100 5
P A2 | B1  



120 120 6 .
PB1 
200
(f)
Gender and martial status are not mutually exclusive since
P A1  B1   0
101
(f)
Gender and martial status are not independent since
P A1 | B2  
30
50

 P A1  .
80 200
Example:
In a recent survey in a Statistics class, it was determined that only 60% of
the students attend class on Thursday. From past data it was noted that
98% of those who went to class on Thursday pass the course, while only
20% of those who did not go to class on Thursday passed the course.
(a) What percentage of students is expected to pass the course?
(b) Given that a student passes the course, what is the probability that
he/she attended classes on Thursday.
[solution:]
A1: the students attend class on Thursday
A2: the students do not attend class on Thursday
 A1  A2  
B1: the students pass the course
B2: the students do not pass the course
P A1   0.6, P A2   1  P A1   0.4, PB1 | A1   0.98, PB1 | A2   0.2
(a)
PB1   PB1  A1   PB1  A2 
 P A1 PB1 | A1   P A2 PB1 | A2 
 0.6  0.98  0.4  0.2
 0.668
(b)
By Bayes’ theorem,
102
P( A1 | B1 ) 
P A1  B1 
P( A1) P( B1 | A1)

PB1 
P( A1) P( B1 | A1)  P( A2) P( B1 | A2)
0.6  0.98
0.6  0.98  0.4  0.2
 0.854

Chapter 6
3. Random Variables, Discrete Probability Function and Continuous
Probability Density
Example:
The probability distribution function for a discrete random variable X is
f ( x )  2k , x  1
3k , x  3
4k , x  5
0, otherwise
where k is some constant. Please find
(a) k. (b) P( X  2)
[solution:]
(a)
 f ( x)  f (1)  f (3)  f (5)  2k  3k  4k  9k  1
x
 k 
(b)
1
.
9
P( X  2)  P( X  3 or X  5)  P( X  3)  P( X  5)
1 7
 f (3)  f (5)  3k  4k  7k  7  
9 9
103
.