Download Document

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Law of large numbers wikipedia , lookup

Statistical inference wikipedia , lookup

Gibbs sampling wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
醫學統計方法
Medical Statistics
[email protected]
陳 宏
課程訊息
參考書籍:
*Rosner, B. (2000). Fundamentals of Biostatistics. Fifth edition,
Duxbury Press.
Bowerman, B. L. and O’Connell, R. T. (1990). Linear Statistical
Models: An Applied Approach, 2nd edition, Duxbury
Press.
Pagano, M. and Gauvreau, K. (2000). Principles of Biostatistics.
2nd edition, Duxbury Press.
Hamilton, LC. (1992). Regression with Graphics, Duxbury Press.
時間:星期五下午3:30至5:20
地點:基礎醫學大樓一樓 101 講堂
日期
教師
內容
93/9/17
李文宗
簡介
9/24
程毅豪
敘述性統計、母體、樣本
10/1
陳 宏
機率與不確定性、
統計分配實例
10/8
陳 宏
中央極限定理、估計方法、
信賴區間
10/15
戴 政
實驗設計
10/22
助教
第一次考試
Exploratory Data Analysis
and Statistical Inference
Ch 2:2、3、4、8、9;Ch3: 6
Ch4: 8; Ch 5:3、4、5
Ch 6:2、3、5、7、10
10/1/2004
主題:
複習母體、樣本
探討如何取好的樣本
量化未知量(參數)與已知量(估計或統計量)
之差距
中央極限定理及常態分配
信賴區間
二項分配
Hospital-stay data
• The data in Table 2.11 are a sample from a larger data set
collected on persons discharged from a selected Pennsylvania
hospital as part of a retrospective chart review of antibiotic usage
in hospitals [7].
– The data are also given in Data Set HOSPITAL.DAT with documentation
in HOSPITAL.DOC on the data disk.
• Compute the mean and median for the duration of hospitalization
for the 25 patients.
– How? Use hand or computer software (R, S-plus; SPSS, SAS)
– Why? Data summary: central tendency versus spread
• Compute the standard deviation and range for the duration of
hospitalization for the 25 patients.
• It is of clinical interest to know if the duration of hospitalization
is affected by whether or not a patient has received antibiotics.
–Answer this question descriptively using either numeric or
graphic methods.
–Will you feel confident to report your finding?
First temp.
following
admission
First WBC(x 103)
following
admission
Received
antibiotic
1 = yes 2= no
Received bacterial culture
1 = yes
2= no
Service
1 =med.
2 = surg
2
99.0
8
2
2
1
73
2
98.0
5
2
1
1
6
40
2
99.0
12
2
2
2
4
11
47
2
98.2
4
2
2
2
5
5
25
2
98.5
11
2
2
2
6
14
82
1
96.8
6
1
2
2
7
30
60
1
99.5
8
1
1
1
8
11
56
2
98.6
7
2
2
1
9
17
43
2
98.0
7
2
2
1
10
3
50
1
8.0
12
2
1
2
11
9
59
2
97.6
7
2
1
1
12
3
4
1
97.8
3
2
2
2
13
8
22
2
99.5
11
1
2
2
14
8
33
2
98.4
14
1
1
2
15
5
20
2
98.4
11
2
1
2
16
5
32
1
99.0
9
2
2
2
17
7
36
1
99.2
6
1
2
2
18
4
69
1
98.0
6
2
2
2
19
3
47
1
97.0
5
1
2
1
20
7
22
1
98.2
6
2
2
2
21
9
11
1
98.2
10
2
2
2
22
11
19
1
98.6
14
1
2
2
23
11
67
2
97.6
4
2
2
1
24
9
43
2
98.6
5
2
2
2
25
4
41
2
98.0
5
2
2
1
Duration
of Hospital
stay
Age
1
5
30
2
10
3
ID
no.
Sex
1=M
2=F
Compute the mean and median by R
• Importing and exporting data
– Most programs (e.g. Excel), as well as humans, know how to deal with
rectangular tables in the form of tab-delimited text files.
– Type conversions: Understand the conventions your input files use and set
the quote options accordingly. (the delimiter character (space, comma,
tabulator) and the end-of-line character )
• R (Splus)
duration<- c(5,10, 6,11, 5, 14, 30, 11, 17, 3, 9, 3, 8, 8, 5, 5, 7, 4, 3, 7, 9, 11, 11,9,4)
mean(duration); median(duration)
[1] 8.6 [1] 8
var(duration); sd(duration)
[1] 32.66667 [1] 5.715476
summary(duration)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.0
5.0
8.0
8.6 11.0
30.0
range(duration)
[1] 3 30
hist(duration, freq=FALSE, ylim=c(0,0.09)); points(density(duration))
•
9
0.04
0.02
0.00
Density
0.06
0.08
Histogram of duration
0
5
10
15
duration
20
25
30
Is the duration of hospitalization is affected by
receiving antibiotics?
antibiotics<- c(2, 2 , 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1, 2, 2, 2)
• Data editing
– noduration<- duration [antibiotics > 1.5]
– yesduration<- duration [antibiotics < 1.5]
– boxplot(duration~antibiotics)
– plot(duration~antibiotics)
• Question: Is there a causation between taking antibiotics and
duration?
– In this data set, it shows an association.
– How do we measure association? Refer to correlation.
– Can we use correlation to handle this data set?
• How do we study the relationship between population and
sample.
• Statistics is a collection of procedures and principles for gaining
and processing information in order to make decisions when
faced with uncertainty.
30
What is your conclusion?
5
10
15
20
25
Read
Ch2.9
(Case
Study 1)
1
Receiving antiobiotics
2
Population and Sample
• Example. Suppose we wish to estimate the proportion p of students
in NTU who have not showered or bathed for over a day. This poses
a number of questions.
– Who do we mean by students?
– Suppose time is limited and we can only interview 20 students in the campus.
Is it important that our survey leads to a good representation of all students?
How can we ensure this?
– Will students we question be embarrassed to admit if they have not bathed?
– Even if we can get truthful answers, will we be happy with our estimate if that
chosen sample turns out to include no women, or if it includes only computer
scientists?
• Example 6.8 Suppose we wish to characterize the distribution of
birthweights of all liveborn infants who were born in the United
States in 1998.
– Assume that the underlying distribution of birthweight has an expected value
(or mean) m and variance s2.
– Ideally, we wish to estimate m and s2 exactly, based on the entire population
of U.S. liveborn infants in 1998. But this task is difficult with such a large
group.
– Instead, we decide to select a collection of n infants who are representative of
this large group and use the birthweights xl, . . . , xn from this sample to help us
estimate m and s2.
如何選取一組代表性的樣本:機率抽樣
(1) 簡單隨機抽樣(Simple Random Sampling)
–N個人中選n個,共有C(N,n)種選法,或每n個人被選中的機
會為1/C(N,n) 。
–利用亂數表或計算機亂數產生程式
(2) 分層隨機抽樣(Stratified Random Sampling)
–將母體分成好幾層,在每層中做簡單隨機抽樣
–若抽樣成本相同且層內變異相同,則層內樣本數及人口數
比例相同(比例抽樣)
–一般而言較簡單隨機抽樣準確(若層與層差異大)
The reference, target, or study population is the group we wish to
study.
–The random sample is selected from the study population.
A random sample is a selection of some members of the population
such that each member is independently chosen and has a known
nonzero probability of being selected.
A simple random sample is a random sample in which each group
member has the same probability of being selected.
分層隨機抽樣
Suppose that we want to find out the average of height of adult in
Taiwan.
考慮使用以下的分層隨機抽樣
–將母體分成男、女兩層,在每層中做簡單隨機抽樣
–計算出男、女兩層之平均值
–如果男、女人口之比例為49%及51%
–What is the average of height?
–數學符號 E(X) = E[E(X|Y)]
–What is Y in the above example?
–We can calculate the mean height by
•Compute mean height of men T(1) = E(X|Y=1)
•Compute mean height of women T(0) = E(X|Y=0)
•What is E(T)? (T takes on two values T(0) and T(1). What is the
probability of getting T(0)?)
–Var(X) = Var[E(X|Y)] + E[Var(X|Y)]
•If there is no difference between mean height of men and women, how do
you compute Var(X)?
叢聚抽樣 (Cluster Sampling)
將母體分成好幾個類似的叢聚,然後自這些叢聚中隨機抽一些,對這些抽到
的叢聚作普查
– 不需要所有人的名單
– 所有的抽樣單位都在近距離內,節省因距離增加而增加的抽樣成本
• Example 6.9 The Minnesota Heart Study seeks to
– accurately assess the prevalence and incidence of different types of
cardiovascular morbidity (such as heart attack and stroke) in the state of
Minnesota
– trends in these rates over time.
– It is impossible to survey every individual in the state and impractical to survey,
in person, a random sample of individuals in the state. The latter requires a
large number of interviewers to be dispersed throughout the state.
– Sampling scheme: Divide the state of Minnesota into geographically compact
regions or clusters. A random sample of clusters is then chosen for study, and
several interviewers are sent to each cluster selected.
– Enumerate all households in a cluster, and then to survey all members in these
households. If some cardiovascular morbidity is identified by interviewers,
then the relevant individuals are invited to be examined in more detail at a
centrally located health site within the cluster.
– The total sample of all interviewed subjects over the entire state is referred to
as a cluster sample.
如何選取一組代表性的樣本:機率抽樣
Table 6.2 gives the birthweight from 1000 consecutive deliveries at
Boston City Hospital (serving a low-income population).
•Description of this data set:
– birthweight<- scan("D:/teaching/statistics/elementary/birthweight.txt")
– summary(birthweight); var(birthweight); sd(birthweight)
Min. 1st Qu. Median Mean 3rd Qu. Max. Var
SD
17 100
113
112
126
198 424.06 20.59
– par(mfrow=c(2, 2))
– hist(birthweight, freq=FALSE, ylim=c(0,0.025), main="nclass=20");
points(density(birthweight))
– hist(birthweight, freq=FALSE, nclass= 80, main="nclass=80", ylim=c(0,0.025));
points(density(birthweight))
– hist(birthweight, freq=FALSE, nclass= 160, main="nclass=160", ylim=c(0,0.025));
points(density(birthweight))
– hist(birthweight, freq=FALSE, nclass= 240, main="nclass=240", ylim=c(0,0.025));
points(density(birthweight))
0.000
100
150
200
50
100
birthweight
birthweight
nclass=160
nclass=240
150
200
150
200
0.010
0.000
0.000
0.010
Density
0.020
50
0.020
0
Density
0.010
Density
0.010
0.000
Density
0.020
nclass=80
0.020
nclass=20
50
100
birthweight
150
200
50
100
birthweight
如何選取一組代表性的樣本:簡單隨機抽樣
• How can I pick up a simple random sample of size 50 from those
1000 birthweights?
– s1<- sample(birthweight,50, replace = FALSE); mean(s1); median(s1); sd(s1)
– 112.62, 114.5, 20.61; 108.18, 112.5, 22.69
– population: 112, 113, 20.59
– random sample: fluctuation
– How much information on those 1000 birthweights can be revealed through
a srs of size 50?
– How do we quantify the fluctuation of 112.62-112, 108.18 – 112, ……..
• Repeat this sampling schemes N times.
N<- 1000000
a1sample<- matrix(rep(0,N*3),ncol=3)
for (i in 1:N) {s1<- sample(birthweight,50, replace = FALSE); a1sample[i,]<c(mean(s1),median(s1),sd(s1))
}
apply(a1sample, 2, summary)
mean: (97.7,110.1,112.0,112.0,113.9,125.9) = (min,25%,50%,mean,75%,max)
sd: (10.64, 18.57, 20.27, 20.43, 22.15,33.60) = (min,25%,50%,mean,75%,max)
median: (98.0,110.5,113.5,113.1,115.5,127.5)
What does the above study try to convey?
• Suppose I ask 10 of you to take a random sample of size 50 and
work out its mean.
– Results: 111.44, 113.30, 113.02, 113.66, 112.64, 111.04, 110.58, 117.16,
111.26, 116.68
– You may be the one who gets 117.16.
– Why do you think that you will be the one who gets 111.44?
• How do we settle this difficulty?
• Law of Large Numbers:
– The average of a sequence of random variables with a common distribution
converges (in the senses given below) to their common expectation, in the
limit as the size of the sequence goes to infinity.
– Set
. Then
– In the above study n = 50. What is N?
– What is the probability?
Classical: equally likely outcome
If you can do counting, you know how to calculate the probability.
– Challenge: What is the success rate of an operation?
Relative frequency = limn f/n
An illustration of LLN
testmean<- a1sample[,1]
s1<- rep(0,1000)
for (i in 1:1000) s1[i]<- mean(testmean[1:(100*i)])
plot(ts(s1), xlab=“count by thousand”, ylab=“average”, main=“n=50”);
abline(mean(birthweight),0)
111.9
111.8
111.7
average
112.0
n=50
0
200
400
600
count by thousand
800
1000
Sampling Distributions
•
If you pick up a simple random sample of size 1 from those 1000
birthweights, how would you describe it?
• table(birthweight)
•
•
•
•
Numerical descriptive measures calculated from the sample are
called statistics.
•
•
•
•
It gives 17(1), 22(1), 32(2),…, 115(24), 116(19), 120(25) , 121(26),…,
198(1).
Let X denote the sample that you are going to take.
X is a discrete random variable taking 17 with probability 0.001, etc.
Consider the average of a random sample of size 50.
Statistics vary from sample to sample and hence are random variables.
The probability distributions for statistics are called sampling
distributions.
In repeated sampling, they tell us what values of the statistics can
occur and how often each value occurs.
Sampling distributions of a simple random sample
of size 50 from those 1000 birthweights
testmean<- a1sample[,1]
testmeannorm<- sqrt(50)*(testmeanmean(birthweight))/sd(birthweight)
hist(testmeannorm, freq=FALSE, ylim=c(0,0.4), main="R:
N(0,1)"); points(density(testmeannorm))
x <- seq(-3, 3, len = 101)
y <- (1/sqrt(2*pi))*exp(-x^2/2)
points(x, y, type = "l", xaxt = "n",col = "red")
How do I quantify the difference between 117.6 and the average
weight of 1000 birthweights?
How do I quantify the difference between 111.44 and the average
weight of 1000 birthweights?
Where do you think 117.6 locate on the x-axis in next slide?
0.2
0.1
0.0
Density
0.3
0.4
R: N(0,1)
-4
-2
0
testmeannorm
2
4
Types of Inference
• Estimation:
–Estimating or predicting the value of the parameter
– “What is (are) the most likely values of m or p?”
• Hypothesis Testing:
–Deciding about the value of a parameter based on some
preconceived idea.
–“Did the sample come from a population with m = 5 or p
= .2?”
Types of Inference
• Examples:
–A consumer wants to estimate the average price of similar
homes in her city before putting her home on the market.
Estimation: Estimate m, the average home price.
–A manufacturer wants to know if a new type of steel is
more resistant to high temperatures than an old type was.
Hypothesis test: Is the new average resistance, mN
equal to the old average resistance, mO?
Types of Inference
• Whether you are estimating parameters or testing hypotheses,
statistical methods are important because they provide:
–Methods for making the inference (Next lecture)
–A numerical measure of the goodness or reliability of the
inference (confidence interval)
• An estimator is a rule, usually a formula, that tells you how to
calculate the estimate based on the sample.
–Point estimation: A single number is calculated to estimate
the parameter.
–Interval estimation: Two numbers are calculated to create
an interval within which the parameter is expected to lie.
Properties of Point Estimators
• Since an estimator is calculated from sample values, it varies
from sample to sample according to its sampling distribution.
• An estimator is unbiased if the mean of its sampling
distribution equals the parameter of interest.
–It does not systematically overestimate or underestimate the
target parameter.
Properties of
Point Estimators
• Of all the unbiased estimators, we prefer
the estimator whose sampling distribution
has the smallest spread or variability.
Measuring the Goodness
of an Estimator
• The distance between an estimate and the
true value of the parameter is the error of
The distance between the bullet and
estimation.
the bull’s-eye.
• In this chapter, the sample sizes are large,
so that our unbiased estimators will have
normal distributions. Because of the Central
Limit Theorem.
The Margin of Error
• For unbiased estimators with normal
sampling distributions, 95% of all point
estimates will lie within 1.96 standard
deviations of the parameter of interest.
•Margin of error: The maximum error of
estimation, calculated as
1.96  std error of the estimator
Estimating Means
and Proportions
•For a quantitative population,
Point estimatorof populationmean μ : x
M argin of error (n  30) :  1.96
s
n
•For a binomial population,
Point estimatorof populationproportionp : pˆ = x/n
pˆ qˆ
Margin of error (n  30) :  1.96
n
Example
• A homeowner randomly samples 64 homes similar to her own
and finds that the average selling price is $252,000 with a
standard deviation of $15,000. Estimate the average selling price
for all similar homes in the city.
Point estimatorof μ : x = 250,000
s
15,000
Margin of error :  1.96
= 1.96
= 3675
n
64
信賴區間之模擬
• 網址為www.stat.berkeley.edu/~stark/Java/Ci.htm
信賴區間模擬之說明
• 模擬一成功機會為0.5(=p)的事件,抽取20個樣本,反覆做了100次p的95%
信賴區間。
– 如果您想抽取250個樣本,反覆作1000次p=0.4的99.7%信賴區間,此時
只需將圖下方第一個欄位(sample size)由20改為250(上限為250),
第二個欄位(samples to take)由100改為1000(上限為1000),第三個
欄位(#SE)改為3〈在此畫面上是2,因為是求取95%的信賴區間〉。
– 比較複雜的是如何由p=0.5更動為0.4,在畫面的右方長欄會看到0、1這
兩個數字,這代表有一個箱子,內放0及1兩個數字,當您隨機取出一
個數字時,它的機率是1/2。所以要模擬p=0.4時,可以設計一個箱子,
內有0、0、0、1、1五個數字(2/5=0.4),此時在最右方長欄改為0,0,
0,1,1即可。如果想模擬0.55(=11/20),您就得輸入11個1及9個0。
• 在畫面下方的最右方有一數字0.92,這是代表在模擬出的100個信賴區間中,
有92個〈92/100〉包含0.5〈不包含0.4的信賴區間標示為紅色〉。在學理上
我們期望有95個信賴區間包含0.4。但因隨機的關係,它不會正好等於95,
就像您丟公平銅板100次,剛好出現50次正面的機率約為0.07958924(但出
現46次到55次正面的機率約為 0.6802727)道理是一樣的。
• 當您點選上述網址時,會看到上述畫面,但不含這些綠、紅區間,最右方
長欄中有0、1、2、3、4等五個數字。當你點選圖上方第一個欄位Take
Sample,將會出現一藍線其橫座標為2(這五個數字的平均)。
Sampling Distributions
Definition: The sampling distribution of a statistic is the
probability distribution for the possible values of the statistic
that results when random samples of size n are repeatedly
drawn from the population.
Population: 3, 5, 2, 1
Draw samples of size n
= 3 without
replacement
Possible samples
3, 5, 2
3, 5, 1
3, 2, 1
5, 2, 1
p(x)
1/4
x
2
3
x
10 / 3 = 3.33
9/3 = 3
6/3 = 2
8 / 3 = 2.67
Each value
of x-bar is
equally
likely, with
probability
1/4
Sampling Distributions (without replacement)
a<- c(3,5,2,1); mean(a); sd(a)
2.75; 1.707825
N<- 1000000
a1sample<- rep(0,N)
for (i in 1:N) {s1<- sample(a,3, replace = FALSE);
a1sample[i]<- mean(s1)}
testmean<- a1sample
testmeannorm<- sqrt(3)*(testmean-mean(a))/sd(a)
hist(testmeannorm, freq=FALSE, main="R: N(0,1)")
x <- seq(-3, 3, len = 101)
y <- (1/sqrt(2*pi))*exp(-x^2/2)
points(x, y, type = "l", xaxt = "n",col = "red")
Why?
p(x)
0.4
R: N(0,1)
1/4
0.0
0.1
0.2
3
Density
2
0.3
x
-0.8
table(testmean)
2
2.67
3
3.33
0.251 0.2497 0.2497 0.2497
-0.6
-0.4
-0.2
0.0
testmeannorm
0.2
0.4
0.6
Sampling Distributions (with replacement)
a<- c(3,5,2,1); mean(a); sd(a)
2.75; 1.707825
N<- 1000000
a1sample<- rep(0,N); n<- 3
for (i in 1:N) {s1<- sample(a,n, replace = TRUE);
a1sample[i]<- mean(s1)}
testmean<- a1sample
testmeannorm<- sqrt(3)*(testmean-mean(a))/sd(a)
hist(testmeannorm, freq=FALSE, main="R: N(0,1) ")
points(density(testmeannorm))
x <- seq(-3, 3, len = 101)
y <- (1/sqrt(2*pi))*exp(-x^2/2)
points(x, y, type = "l", xaxt = "n",col = "red")
0.8
R: N(0,1)
0.4
0.2
0.0
Density
0.6
Why? n=3 only
-1
0
testmeannorm
1
2
Sampling Distributions (with replacement)
a<- c(3,5,2,1); mean(a); sd(a)
2.75; 1.707825
N<- 1000000
a1sample<- rep(0,N); n<- 20
for (i in 1:N) {s1<- sample(a,n, replace = TRUE);
a1sample[i]<- mean(s1)}
testmean<- a1sample
testmeannorm<- sqrt(n)*(testmean-mean(a))/sd(a)
hist(testmeannorm, freq=FALSE, main="R: N(0,1); n=20")
points(density(testmeannorm))
x <- seq(-3, 3, len = 101)
y <- (1/sqrt(2*pi))*exp(-x^2/2)
points(x, y, type = "l", xaxt = "n",col = "red")
0.2
0.1
0.0
Density
0.3
0.4
R: N(0,1); n=20
-4
-2
0
testmeannorm
2
4
Sampling Distributions
Sampling distributions for statistics can be
Approximated with simulation techniques
Derived using mathematical theorems
The Central Limit Theorem is one such theorem.
Central Limit Theorem: If random samples of n
observations are drawn from a nonnormal population with
finite m and standard deviation s , then, when n is large, the
sampling distribution of the sample mean x is approximately
normally distributed, with mean m and standard deviation
s / n. The approximation becomes more accurate as n
becomes large.
Example
Toss a fair coin n = 1 time. The distribution of x the number
on the upper face is flat or uniform.
m =  xp( x)
1
1
1
= 1( )  2( )  ...  6( ) = 3.5
6
6
6
s = ( x  m ) 2 p( x) = 1.71
Denote the outcome of a simple random sample of size 1 from
those 1000 birthweights by X, how would you describe its mean
and variance?
Example
Toss a fair coin n = 2 time. The distribution of x the average
number on the two upper faces is mound-shaped.
Mean : m = 3.5
Std Dev :
s/ 2 = 1.71 / 2 = 1.21
Example
Toss a fair coin n = 3 time. The distribution of x the average
number on the two upper faces is approximately normal.
Mean : m = 3.5
Std Dev :
s/ 3 = 1.71 / 3 = .987
Why is this Important?
 The
Central Limit Theorem also implies that the sum of n
measurements is approximately normal with mean nm and
standard deviation ns2.
 Many
statistics that are used for statistical inference are sums
or averages of sample measurements.
 When
n is large, these statistics will have approximately
normal distributions.
 This
will allow us to describe their behavior and evaluate the
reliability of our inferences.
How Large is Large?
If the sample is normal, then the sampling distribution of
will also be normal, no matter what the sample size.
x
When the sample population is approximately symmetric,
the distribution becomes approximately normal for relatively
small values of n.
When the sample population is skewed, the sample size
must be at least 30 before the sampling distribution of x
becomes approximately normal.
The Sampling Distribution of the Sample
Mean
A random sample of size n is selected from a
population with mean m and standard deviation s.
The sampling distribution of the sample mean
have mean m and standard deviation s / n .
x
will
If the original population is normal, the sampling
distribution will be normal for any sample size.
If the original population is nonnormal, the sampling
distribution will be normal when n is large.
The standard deviation of x-bar is sometimes called the
STANDARD ERROR (SE).
Finding Probabilities for
the Sample Mean
If the sampling distribution of x is normal or
approximately normal, standardize or rescale the
interval of interest in terms of
xm
z=
s/ n
Find the appropriate area using Table 3.
Example: A random
sample of size n = 16
from a normal
distribution with m = 10
and s = 8.
12  10
P( x  12) = P( z 
)
8 / 16
= P( z  1) = 1  .8413 = .1587
Example
A soda filling machine is supposed to fill cans of
soda with 12 fluid ounces. Suppose that the fills are actually
normally distributed with a mean of 12.1 oz and a standard
deviation of .2 oz. What is the probability that the average fill for
a 6-pack of soda is less than 12 oz?
P (x  12) =
x  m 12  12.1
P(

)=
s / n .2 / 6
P( z  1.22) = .1112
How do we check association?
• 兩事件A、B互相獨立定義成 P (A∩B) = P (A) P (B)
• 條件機率:某種情況 (A) 已知,求另一事件 (B) 發生的機率
• P (B|A) = P (A∩B) /P (B)
• 若A、B互相獨立,則 P (B|A) = P (B)
• 在安全帽與頭部受傷的研究中得到下列數據:
安全帽
有戴
沒戴
列加總
頭部受傷
17
218
235
頭部沒傷 130
428
558
行加總
147
646
793
戴安全帽但頭部受傷的比例 P(頭部受傷 | 有戴安全帽)的估計為11.6%
沒戴安全帽但其頭部受傷的比例 P(頭部受傷 | 沒戴安全帽)的估計為33.3%
Read the definitions of risk ratio and odds ratio in Chapter 13.
Example 1
• Toss a fair coin twice. Define
– A: head on second toss
– B: head on first toss
P(A|B) = ½
HH
HT
TH
TT
1/4
P(A|not B) = ½
1/4
1/4
1/4
P(A) does not
change, whether B
happens or not…
A and B are
independent!
Example 2
• A bowl contains five M&Ms®, two red and three blue.
Randomly select two candies, and define
– A: second candy is red.
– B: first candy is blue.
P(A|B) =P(2nd red|1st blue)= 2/4 = 1/2
P(A|not B) = P(2nd red|1st red) = 1/4
m
m
m
m
m
P(A) does change,
depending on whether
B happens or not…
A and B are
dependent!
Recall the question on whether the duration of hospitalization
is affected by receiving antibiotics.
Defining Independence
• We can redefine independence in terms of conditional
probabilities:
Two events A and B are independent if and only if
P(A|B) = P(A)
or
P(B|A) = P(B)
Otherwise, they are dependent.
• Once you’ve decided whether or not two events are independent,
you can use the following rule to calculate their intersection.
The Multiplicative Rule for Intersections
• For any two events, A and B, the probability that both A and B
occur is
P(A  B) = P(A) P(B given that A occurred) = P(A)P(B|A)
• If the events A and B are independent, then the probability that
both A and B occur is
P(A  B) = P(A) P(B)
Example 1
In a certain population, 10% of the people can be
classified as being high risk for a heart attack. Three people are
randomly selected from this population. What is the probability
that exactly one of the three are high risk?
Define H: high risk
N: not high risk
P(exactly one high risk) = P(HNN) + P(NHN) + P(NNH)
= P(H)P(N)P(N) + P(N)P(H)P(N) + P(N)P(N)P(H)
= (.1)(.9)(.9) + (.9)(.1)(.9) + (.9)(.9)(.1)= 3(.1)(.9)2 = .243
Connection with coin tossing
Flip a coin (with probability of getting head 0.1) three times.
Example 2
Suppose we have additional information in the
previous example. We know that only 49% of the population are
female. Also, of the female patients, 8% are high risk. A single
person is selected at random. What is the probability that it is a
high risk female?
Define H: high risk
F: female
From the example, P(F) = .49 and P(H|F) = .08. Use the
Multiplicative Rule:
P(high risk female) = P(HF)
= P(F)P(H|F) =.49(.08) = .0392
The Law of Total Probability
•
Let S1 , S2 , S3 ,..., Sk be mutually exclusive and exhaustive events
(that is, one and only one must happen). Then the probability of
another event A can be written as
P(A) = P(A  S1) + P(A  S2) + … + P(A  Sk)
= P(S1)P(A|S1) + P(S2)P(A|S2) + … + P(Sk)P(A|Sk)
The Law of Total Probability
S1
A
A  S1
S2….
A Sk
Sk
P(A) = P(A  S1) + P(A  S2) + … + P(A  Sk)
= P(S1)P(A|S1) + P(S2)P(A|S2) + … + P(Sk)P(A|Sk)
Bayes’ Rule
• Let S1 , S2 , S3 ,..., Sk be mutually exclusive and exhaustive
events with prior probabilities P(S1), P(S2),…,P(Sk). If an event
A occurs, the posterior probability of Si, given that A occurred
is
P( Si ) P( A | Si )
P( Si | A) =
for i = 1, 2,...k
 P( Si ) P( A | Si )
Risk Factor
Example
From a previous example, we know that 49% of the population are
female. Of the female patients, 8% are high risk for heart attack,
while 12% of the male patients are high risk. A single person is
selected at random and found to be high risk. What is the
probability that it is a male?
Define H: high risk
We know:
P(F) =
P(M) =
P(H|F) =
P(H|M) =
.49
.51
.08
.12
F: female
M: male
P( M ) P ( H | M )
P( M | H ) =
P( M ) P( H | M )  P( F ) P ( H | F )
.51 (.12)
=
= .61
.51 (.12)  .49 (.08)
Random Variables
• A quantitative variable x is a random variable if the value that it
assumes, corresponding to the outcome of an experiment is a
chance or random event.
• Random variables can be discrete or continuous.
• Examples:
 x = SAT score for a randomly selected student
 x = number of people in a room at a randomly selected
time of day
 x = number on the upper face of a randomly tossed die
Probability Distributions for Discrete Random
Variables
• The probability distribution for a discrete random variable x
resembles the relative frequency distributions we constructed
in Chapter 2. It is a graph, table or formula that gives the
possible values of x and the probability p(x) associated with
each value.
We must have
0  p( x)  1 and  p ( x) = 1
Example
• Toss a fair coin three times and
define x = number of heads.
x
HHH
1/8
3
1/8
2
1/8
2
1/8
2
1/8
1
THT
1/8
1
TTH
1/8
1
TTT
1/8
0
HHT
HTH
THH
HTT
P(x = 0) =
P(x = 1) =
P(x = 2) =
P(x = 3) =
1/8
3/8
3/8
1/8
x
0
1
2
3
p(x)
1/8
3/8
3/8
1/8
Probability
Histogram for x
Probability Distributions
• Probability distributions can be used to describe the
population, just as we described samples in Chapter 2.
– Shape: Symmetric, skewed, mound-shaped…
– Outliers: unusual or unlikely measurements
– Center and spread: mean and standard deviation. A
population mean is called m and a population standard
deviation is called s.
The Mean
and Standard Deviation
• Let x be a discrete random variable with probability
distribution p(x). Then the mean, variance and standard
deviation of x are given as
Mean : m =  xp( x)
Variance : s = ( x  m ) p( x)
2
2
Standard deviation : s = s
2
Example
• Toss a fair coin 3 times and
record x the number of heads.
x
0
1
p(x)
1/8
3/8
xp(x)
0
3/8
(x-m)2p(x)
(-1.5)2(1/8)
(-0.5)2(3/8)
12
m =  xp( x) = = 1.5
8
2
3
3/8
1/8
6/8
3/8
(0.5)2(3/8)
(1.5)2(1/8)
s = ( x  m ) p( x)
2
2
s 2 = .28125  .09375  .09375  .28125 = .75
s = .75 = .688
Example
• The probability distribution for x the
number of heads in tossing 3 fair
coins.
•
•
•
•
m
Shape?
Outliers?
Center?
Spread?
Symmetric;
mound-shaped
None
m = 1.5
s = .688
Key Concepts
• Population — The entire collection of entities about which one
wishes to make an inference or draw a conclusion about (also called
aggregate or universe).
• Sample — A subset of a population. Used because we usually
cannot measure all individuals in a population.
– It is the sample we observe, but the population we wish to know.
• Simple Random Sample — A sample of size n from a larger
population selected in such a way that every sample of size n has
the same chance of being selected.
• Parameter — The true value of some population attribute, which is
almost always unknown; or an unknown constant that describes a
key feature in a model for answering a question of interest.
– Parameters are often represented by Greek letters, such as m for the
population mean, and s for the population standard deviation.
• Statistic-Any quantity that is computed from sample observations.
• Probability — Set of mathematical tools to quantify concepts we
understand intuitively, such as “likelihood and “certainty.” We use
probability to gauge the amount of confidence to place on sample
estimates.
Key Concepts
• Model — Some approximation of reality.
• Statistical model — A mathematical expression that help us
predict a response variable as a function of one or more
explanatory variables, based on a set of assumptions. These
assumptions allow the model not to fit exactly, and are made
about random terms in the model called error (e).
• Types of Variables:
– Refer to Ch9.1 of Rosner.
– Quantitative (continuous or cardinal) Data versus Qualitative (discrete or
categorical) Data
– Nominal—When data values for a variable are labels identifying a
category and their order is not meaningful. E.g., attributes such as sex,
race, and cause of death are nominal (meaning named) because the
categories do not represent some underlying, quantitative scale.
– Ordinal—Data values for a variable are labels identifying a category
and their order is meaningful. E.g., a person’s highest educational level
might be recorded ordinal, where the categories of interest might be
grade school, high school, college, and graduate school. Additional
examples are stage of cancer, severity, and preference.
Key Concepts
• Continuous—Data values for a variable are measured on a
continuous scale. E.g., body mass is often measured and recorded
on a continuous scale, where values such as 40 g or 3,154.2 g are
acceptable.
– It can be useful to distinguish between continuous and discrete data.
– Continuous data can be represented with any and all conceivable values
within a particular range, such as the height of a plant being 36.354 cm;
discrete data can be represented by only certain values within a particular
range, such as number of leaves on a plant, where 22, 185, or 45 are
possible, but 22.8 is not.
– Ratio— Examples are age, body weight, height, and blood pressure.
– Interval— Examples are Celsius and Fahrenheit temperatures.
Some Discrete Distributions
The Binomial Distribution
• The most commonly used discrete probability distribution is the
binomial distribution.
• An experiment which follows a binomial distribution will satisfy
the following requirements (think of repeatedly flipping a coin as
you read these):
–The experiment consists of n identical trials, where n is fixed
in advance.
–Each trial has two possible outcomes, S or F, which we denote
``success'' and ``failure'' and code as 1 and 0, respectively.
–The trials are independent, so the outcome of one trial has no
effect on the outcome of another.
–The probability of success, p=P(S), is constant from one trial
to another.
The Binomial Distribution
• The random variable X of a binomial distribution counts the
number of successes in n trials.
• Sampling distributions for counts and proportions
• The probability that X is a certain value x is given by the formula
P(X=x) = C(n,x)px(1-p)n-x where 0≦p ≦1, x = 0, 1,…, n.
– E(X) = np and Var(X) = np (1-p).
– A particularly important example of the use of the binomial distribution is
when sampling with replacement (this implies that p is constant).
– EXAMPLE: Suppose we have 10 balls in a bowl, 3 of the balls are red and 7
of them are blue. Define success S as drawing a red ball. If we sample with
replacement, P(S)=0.3 for every trial. Let's say n=20, then P(X=5)=0.1789.
• 例:核能廢料運送100次輻射外洩次數;測試某零件200個其
中不良品之個數;城市感染某病之人數;問卷中同意一特定
主題之人數比例。
Hardy-Weinberg 平衡定律(1908年)顯性基因的人
愈來愈多嗎? (How do we reasoning?)
假設人類中第一代具AA基因者有20%、Aa基因者有30%、
aa基因者有50%,考慮第二代基因分布情形如下:
•A 代表顯性, a代表隱性
第一代
機會
AA/AA
第二代分布比例
AA
Aa
aa
20%×20%=4%
1
0
0
AA/Aa
2×20%×30%=12%
1/2
1/2
0
AA/aa
2×20%×50%=20%
0
1
0
Aa/Aa
30%×30%=9%
1/4
1/2
1/4
Aa/aa
2×30%×50%=30%
0
1/2
1/2
aa/aa
50%×50%=25%
0
0
1
Hardy-Weinberg 平衡定律
第二代中,AA、Aa、aa所佔比例為
• AA的比例=1×4%+1/2×12%+1/4×9%=12.25%
• Aa 的比例=1/2×12%+1×20%+1/2×9%+1/2×30%=45.5%
• aa 的比例=1/4×9%+1/2×30%+1×25%=42.25%
第二代
第三代分布比例
機會
AA
Aa
aa
AA/AA
12.25%×12.25%=1.500625%
1
0
0
AA/Aa
2×12.25%×45.5%=11.1475%
1/2
1/2
0
AA/aa
2×12.25%×42.25%=10.35125
%
0
1
0
Aa/Aa
45.5%×45.5%=20.7025%
1/4
1/2
1/4
Aa/aa
2×45.5%×42.25%=38.4475%
0
1/2
1/2
aa/aa
42.25%×42.25%=17.850625%
0
0
1
Hardy-Weinberg 平衡定律
依此作法,第n (n > 3) 代中,AA、Aa、aa比例為多少?
• AA的比例1×1.500625%+1/2×11.1475%+1/4×20.7025%=12.25%
• Aa 的比例
1/2×11.1475%+1×10.35125%+1/2×20.7025%
+1/2×38.4475%=45.5%
• aa 的比 1/4×20.7025%+1/2×38.4475%+1×17.850625%=42.25%
• 依此作法,第n (n>3) 代中,AA、Aa、aa比例與第二、三代
均為如此,就稱為Hardy-Weinberg平衡定律。
• 必須滿足的條件:
(1)基因的突變不會發生
(2)自然天擇不會發生
(3)族群人數足夠多
(4)所有的人都結婚
(5)結婚的對象為隨機〈隨機交配〉
(6)每一個人產生大約相同數目的子孫
(7) 族群沒有移進或移出
敘述統計量
• 最小值:觀察值中最小的; 最大值:觀察值中最大的
• 中間值:所有觀察值的中位數
• 眾數:觀察值中出現最多次的
n
 Xi
平均數: i =1
標準誤:算法為
n
n
樣本標準差 S =
 ( X i  X )2
i =1
n 1
n
 ( X i  X )2
樣本變異數 S 2 = i =1
n 1
S
n
敘述統計量
n
 ( X i  X )2
樣本變異數 S 2 = i =1
n 1
n
 ( X i  X )4
n(n  1)
i =1
峰度:
(n  1)(n  2)(n  3)
n
偏態:算法為 (n  1)(nn  2)
(X
i =1
i
S4
3(n  1) 2

(n  2)(n  3)
 X )3
S3
S:樣本標準差