* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Foundations of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
Sampling (statistics) wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Law of large numbers wikipedia , lookup
Statistical inference wikipedia , lookup
Gibbs sampling wikipedia , lookup
醫學統計方法 Medical Statistics [email protected] 陳 宏 課程訊息 參考書籍: *Rosner, B. (2000). Fundamentals of Biostatistics. Fifth edition, Duxbury Press. Bowerman, B. L. and O’Connell, R. T. (1990). Linear Statistical Models: An Applied Approach, 2nd edition, Duxbury Press. Pagano, M. and Gauvreau, K. (2000). Principles of Biostatistics. 2nd edition, Duxbury Press. Hamilton, LC. (1992). Regression with Graphics, Duxbury Press. 時間:星期五下午3:30至5:20 地點:基礎醫學大樓一樓 101 講堂 日期 教師 內容 93/9/17 李文宗 簡介 9/24 程毅豪 敘述性統計、母體、樣本 10/1 陳 宏 機率與不確定性、 統計分配實例 10/8 陳 宏 中央極限定理、估計方法、 信賴區間 10/15 戴 政 實驗設計 10/22 助教 第一次考試 Exploratory Data Analysis and Statistical Inference Ch 2:2、3、4、8、9;Ch3: 6 Ch4: 8; Ch 5:3、4、5 Ch 6:2、3、5、7、10 10/1/2004 主題: 複習母體、樣本 探討如何取好的樣本 量化未知量(參數)與已知量(估計或統計量) 之差距 中央極限定理及常態分配 信賴區間 二項分配 Hospital-stay data • The data in Table 2.11 are a sample from a larger data set collected on persons discharged from a selected Pennsylvania hospital as part of a retrospective chart review of antibiotic usage in hospitals [7]. – The data are also given in Data Set HOSPITAL.DAT with documentation in HOSPITAL.DOC on the data disk. • Compute the mean and median for the duration of hospitalization for the 25 patients. – How? Use hand or computer software (R, S-plus; SPSS, SAS) – Why? Data summary: central tendency versus spread • Compute the standard deviation and range for the duration of hospitalization for the 25 patients. • It is of clinical interest to know if the duration of hospitalization is affected by whether or not a patient has received antibiotics. –Answer this question descriptively using either numeric or graphic methods. –Will you feel confident to report your finding? First temp. following admission First WBC(x 103) following admission Received antibiotic 1 = yes 2= no Received bacterial culture 1 = yes 2= no Service 1 =med. 2 = surg 2 99.0 8 2 2 1 73 2 98.0 5 2 1 1 6 40 2 99.0 12 2 2 2 4 11 47 2 98.2 4 2 2 2 5 5 25 2 98.5 11 2 2 2 6 14 82 1 96.8 6 1 2 2 7 30 60 1 99.5 8 1 1 1 8 11 56 2 98.6 7 2 2 1 9 17 43 2 98.0 7 2 2 1 10 3 50 1 8.0 12 2 1 2 11 9 59 2 97.6 7 2 1 1 12 3 4 1 97.8 3 2 2 2 13 8 22 2 99.5 11 1 2 2 14 8 33 2 98.4 14 1 1 2 15 5 20 2 98.4 11 2 1 2 16 5 32 1 99.0 9 2 2 2 17 7 36 1 99.2 6 1 2 2 18 4 69 1 98.0 6 2 2 2 19 3 47 1 97.0 5 1 2 1 20 7 22 1 98.2 6 2 2 2 21 9 11 1 98.2 10 2 2 2 22 11 19 1 98.6 14 1 2 2 23 11 67 2 97.6 4 2 2 1 24 9 43 2 98.6 5 2 2 2 25 4 41 2 98.0 5 2 2 1 Duration of Hospital stay Age 1 5 30 2 10 3 ID no. Sex 1=M 2=F Compute the mean and median by R • Importing and exporting data – Most programs (e.g. Excel), as well as humans, know how to deal with rectangular tables in the form of tab-delimited text files. – Type conversions: Understand the conventions your input files use and set the quote options accordingly. (the delimiter character (space, comma, tabulator) and the end-of-line character ) • R (Splus) duration<- c(5,10, 6,11, 5, 14, 30, 11, 17, 3, 9, 3, 8, 8, 5, 5, 7, 4, 3, 7, 9, 11, 11,9,4) mean(duration); median(duration) [1] 8.6 [1] 8 var(duration); sd(duration) [1] 32.66667 [1] 5.715476 summary(duration) Min. 1st Qu. Median Mean 3rd Qu. Max. 3.0 5.0 8.0 8.6 11.0 30.0 range(duration) [1] 3 30 hist(duration, freq=FALSE, ylim=c(0,0.09)); points(density(duration)) • 9 0.04 0.02 0.00 Density 0.06 0.08 Histogram of duration 0 5 10 15 duration 20 25 30 Is the duration of hospitalization is affected by receiving antibiotics? antibiotics<- c(2, 2 , 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 1, 2, 1, 2, 2, 1, 2, 2, 2) • Data editing – noduration<- duration [antibiotics > 1.5] – yesduration<- duration [antibiotics < 1.5] – boxplot(duration~antibiotics) – plot(duration~antibiotics) • Question: Is there a causation between taking antibiotics and duration? – In this data set, it shows an association. – How do we measure association? Refer to correlation. – Can we use correlation to handle this data set? • How do we study the relationship between population and sample. • Statistics is a collection of procedures and principles for gaining and processing information in order to make decisions when faced with uncertainty. 30 What is your conclusion? 5 10 15 20 25 Read Ch2.9 (Case Study 1) 1 Receiving antiobiotics 2 Population and Sample • Example. Suppose we wish to estimate the proportion p of students in NTU who have not showered or bathed for over a day. This poses a number of questions. – Who do we mean by students? – Suppose time is limited and we can only interview 20 students in the campus. Is it important that our survey leads to a good representation of all students? How can we ensure this? – Will students we question be embarrassed to admit if they have not bathed? – Even if we can get truthful answers, will we be happy with our estimate if that chosen sample turns out to include no women, or if it includes only computer scientists? • Example 6.8 Suppose we wish to characterize the distribution of birthweights of all liveborn infants who were born in the United States in 1998. – Assume that the underlying distribution of birthweight has an expected value (or mean) m and variance s2. – Ideally, we wish to estimate m and s2 exactly, based on the entire population of U.S. liveborn infants in 1998. But this task is difficult with such a large group. – Instead, we decide to select a collection of n infants who are representative of this large group and use the birthweights xl, . . . , xn from this sample to help us estimate m and s2. 如何選取一組代表性的樣本:機率抽樣 (1) 簡單隨機抽樣(Simple Random Sampling) –N個人中選n個,共有C(N,n)種選法,或每n個人被選中的機 會為1/C(N,n) 。 –利用亂數表或計算機亂數產生程式 (2) 分層隨機抽樣(Stratified Random Sampling) –將母體分成好幾層,在每層中做簡單隨機抽樣 –若抽樣成本相同且層內變異相同,則層內樣本數及人口數 比例相同(比例抽樣) –一般而言較簡單隨機抽樣準確(若層與層差異大) The reference, target, or study population is the group we wish to study. –The random sample is selected from the study population. A random sample is a selection of some members of the population such that each member is independently chosen and has a known nonzero probability of being selected. A simple random sample is a random sample in which each group member has the same probability of being selected. 分層隨機抽樣 Suppose that we want to find out the average of height of adult in Taiwan. 考慮使用以下的分層隨機抽樣 –將母體分成男、女兩層,在每層中做簡單隨機抽樣 –計算出男、女兩層之平均值 –如果男、女人口之比例為49%及51% –What is the average of height? –數學符號 E(X) = E[E(X|Y)] –What is Y in the above example? –We can calculate the mean height by •Compute mean height of men T(1) = E(X|Y=1) •Compute mean height of women T(0) = E(X|Y=0) •What is E(T)? (T takes on two values T(0) and T(1). What is the probability of getting T(0)?) –Var(X) = Var[E(X|Y)] + E[Var(X|Y)] •If there is no difference between mean height of men and women, how do you compute Var(X)? 叢聚抽樣 (Cluster Sampling) 將母體分成好幾個類似的叢聚,然後自這些叢聚中隨機抽一些,對這些抽到 的叢聚作普查 – 不需要所有人的名單 – 所有的抽樣單位都在近距離內,節省因距離增加而增加的抽樣成本 • Example 6.9 The Minnesota Heart Study seeks to – accurately assess the prevalence and incidence of different types of cardiovascular morbidity (such as heart attack and stroke) in the state of Minnesota – trends in these rates over time. – It is impossible to survey every individual in the state and impractical to survey, in person, a random sample of individuals in the state. The latter requires a large number of interviewers to be dispersed throughout the state. – Sampling scheme: Divide the state of Minnesota into geographically compact regions or clusters. A random sample of clusters is then chosen for study, and several interviewers are sent to each cluster selected. – Enumerate all households in a cluster, and then to survey all members in these households. If some cardiovascular morbidity is identified by interviewers, then the relevant individuals are invited to be examined in more detail at a centrally located health site within the cluster. – The total sample of all interviewed subjects over the entire state is referred to as a cluster sample. 如何選取一組代表性的樣本:機率抽樣 Table 6.2 gives the birthweight from 1000 consecutive deliveries at Boston City Hospital (serving a low-income population). •Description of this data set: – birthweight<- scan("D:/teaching/statistics/elementary/birthweight.txt") – summary(birthweight); var(birthweight); sd(birthweight) Min. 1st Qu. Median Mean 3rd Qu. Max. Var SD 17 100 113 112 126 198 424.06 20.59 – par(mfrow=c(2, 2)) – hist(birthweight, freq=FALSE, ylim=c(0,0.025), main="nclass=20"); points(density(birthweight)) – hist(birthweight, freq=FALSE, nclass= 80, main="nclass=80", ylim=c(0,0.025)); points(density(birthweight)) – hist(birthweight, freq=FALSE, nclass= 160, main="nclass=160", ylim=c(0,0.025)); points(density(birthweight)) – hist(birthweight, freq=FALSE, nclass= 240, main="nclass=240", ylim=c(0,0.025)); points(density(birthweight)) 0.000 100 150 200 50 100 birthweight birthweight nclass=160 nclass=240 150 200 150 200 0.010 0.000 0.000 0.010 Density 0.020 50 0.020 0 Density 0.010 Density 0.010 0.000 Density 0.020 nclass=80 0.020 nclass=20 50 100 birthweight 150 200 50 100 birthweight 如何選取一組代表性的樣本:簡單隨機抽樣 • How can I pick up a simple random sample of size 50 from those 1000 birthweights? – s1<- sample(birthweight,50, replace = FALSE); mean(s1); median(s1); sd(s1) – 112.62, 114.5, 20.61; 108.18, 112.5, 22.69 – population: 112, 113, 20.59 – random sample: fluctuation – How much information on those 1000 birthweights can be revealed through a srs of size 50? – How do we quantify the fluctuation of 112.62-112, 108.18 – 112, …….. • Repeat this sampling schemes N times. N<- 1000000 a1sample<- matrix(rep(0,N*3),ncol=3) for (i in 1:N) {s1<- sample(birthweight,50, replace = FALSE); a1sample[i,]<c(mean(s1),median(s1),sd(s1)) } apply(a1sample, 2, summary) mean: (97.7,110.1,112.0,112.0,113.9,125.9) = (min,25%,50%,mean,75%,max) sd: (10.64, 18.57, 20.27, 20.43, 22.15,33.60) = (min,25%,50%,mean,75%,max) median: (98.0,110.5,113.5,113.1,115.5,127.5) What does the above study try to convey? • Suppose I ask 10 of you to take a random sample of size 50 and work out its mean. – Results: 111.44, 113.30, 113.02, 113.66, 112.64, 111.04, 110.58, 117.16, 111.26, 116.68 – You may be the one who gets 117.16. – Why do you think that you will be the one who gets 111.44? • How do we settle this difficulty? • Law of Large Numbers: – The average of a sequence of random variables with a common distribution converges (in the senses given below) to their common expectation, in the limit as the size of the sequence goes to infinity. – Set . Then – In the above study n = 50. What is N? – What is the probability? Classical: equally likely outcome If you can do counting, you know how to calculate the probability. – Challenge: What is the success rate of an operation? Relative frequency = limn f/n An illustration of LLN testmean<- a1sample[,1] s1<- rep(0,1000) for (i in 1:1000) s1[i]<- mean(testmean[1:(100*i)]) plot(ts(s1), xlab=“count by thousand”, ylab=“average”, main=“n=50”); abline(mean(birthweight),0) 111.9 111.8 111.7 average 112.0 n=50 0 200 400 600 count by thousand 800 1000 Sampling Distributions • If you pick up a simple random sample of size 1 from those 1000 birthweights, how would you describe it? • table(birthweight) • • • • Numerical descriptive measures calculated from the sample are called statistics. • • • • It gives 17(1), 22(1), 32(2),…, 115(24), 116(19), 120(25) , 121(26),…, 198(1). Let X denote the sample that you are going to take. X is a discrete random variable taking 17 with probability 0.001, etc. Consider the average of a random sample of size 50. Statistics vary from sample to sample and hence are random variables. The probability distributions for statistics are called sampling distributions. In repeated sampling, they tell us what values of the statistics can occur and how often each value occurs. Sampling distributions of a simple random sample of size 50 from those 1000 birthweights testmean<- a1sample[,1] testmeannorm<- sqrt(50)*(testmeanmean(birthweight))/sd(birthweight) hist(testmeannorm, freq=FALSE, ylim=c(0,0.4), main="R: N(0,1)"); points(density(testmeannorm)) x <- seq(-3, 3, len = 101) y <- (1/sqrt(2*pi))*exp(-x^2/2) points(x, y, type = "l", xaxt = "n",col = "red") How do I quantify the difference between 117.6 and the average weight of 1000 birthweights? How do I quantify the difference between 111.44 and the average weight of 1000 birthweights? Where do you think 117.6 locate on the x-axis in next slide? 0.2 0.1 0.0 Density 0.3 0.4 R: N(0,1) -4 -2 0 testmeannorm 2 4 Types of Inference • Estimation: –Estimating or predicting the value of the parameter – “What is (are) the most likely values of m or p?” • Hypothesis Testing: –Deciding about the value of a parameter based on some preconceived idea. –“Did the sample come from a population with m = 5 or p = .2?” Types of Inference • Examples: –A consumer wants to estimate the average price of similar homes in her city before putting her home on the market. Estimation: Estimate m, the average home price. –A manufacturer wants to know if a new type of steel is more resistant to high temperatures than an old type was. Hypothesis test: Is the new average resistance, mN equal to the old average resistance, mO? Types of Inference • Whether you are estimating parameters or testing hypotheses, statistical methods are important because they provide: –Methods for making the inference (Next lecture) –A numerical measure of the goodness or reliability of the inference (confidence interval) • An estimator is a rule, usually a formula, that tells you how to calculate the estimate based on the sample. –Point estimation: A single number is calculated to estimate the parameter. –Interval estimation: Two numbers are calculated to create an interval within which the parameter is expected to lie. Properties of Point Estimators • Since an estimator is calculated from sample values, it varies from sample to sample according to its sampling distribution. • An estimator is unbiased if the mean of its sampling distribution equals the parameter of interest. –It does not systematically overestimate or underestimate the target parameter. Properties of Point Estimators • Of all the unbiased estimators, we prefer the estimator whose sampling distribution has the smallest spread or variability. Measuring the Goodness of an Estimator • The distance between an estimate and the true value of the parameter is the error of The distance between the bullet and estimation. the bull’s-eye. • In this chapter, the sample sizes are large, so that our unbiased estimators will have normal distributions. Because of the Central Limit Theorem. The Margin of Error • For unbiased estimators with normal sampling distributions, 95% of all point estimates will lie within 1.96 standard deviations of the parameter of interest. •Margin of error: The maximum error of estimation, calculated as 1.96 std error of the estimator Estimating Means and Proportions •For a quantitative population, Point estimatorof populationmean μ : x M argin of error (n 30) : 1.96 s n •For a binomial population, Point estimatorof populationproportionp : pˆ = x/n pˆ qˆ Margin of error (n 30) : 1.96 n Example • A homeowner randomly samples 64 homes similar to her own and finds that the average selling price is $252,000 with a standard deviation of $15,000. Estimate the average selling price for all similar homes in the city. Point estimatorof μ : x = 250,000 s 15,000 Margin of error : 1.96 = 1.96 = 3675 n 64 信賴區間之模擬 • 網址為www.stat.berkeley.edu/~stark/Java/Ci.htm 信賴區間模擬之說明 • 模擬一成功機會為0.5(=p)的事件,抽取20個樣本,反覆做了100次p的95% 信賴區間。 – 如果您想抽取250個樣本,反覆作1000次p=0.4的99.7%信賴區間,此時 只需將圖下方第一個欄位(sample size)由20改為250(上限為250), 第二個欄位(samples to take)由100改為1000(上限為1000),第三個 欄位(#SE)改為3〈在此畫面上是2,因為是求取95%的信賴區間〉。 – 比較複雜的是如何由p=0.5更動為0.4,在畫面的右方長欄會看到0、1這 兩個數字,這代表有一個箱子,內放0及1兩個數字,當您隨機取出一 個數字時,它的機率是1/2。所以要模擬p=0.4時,可以設計一個箱子, 內有0、0、0、1、1五個數字(2/5=0.4),此時在最右方長欄改為0,0, 0,1,1即可。如果想模擬0.55(=11/20),您就得輸入11個1及9個0。 • 在畫面下方的最右方有一數字0.92,這是代表在模擬出的100個信賴區間中, 有92個〈92/100〉包含0.5〈不包含0.4的信賴區間標示為紅色〉。在學理上 我們期望有95個信賴區間包含0.4。但因隨機的關係,它不會正好等於95, 就像您丟公平銅板100次,剛好出現50次正面的機率約為0.07958924(但出 現46次到55次正面的機率約為 0.6802727)道理是一樣的。 • 當您點選上述網址時,會看到上述畫面,但不含這些綠、紅區間,最右方 長欄中有0、1、2、3、4等五個數字。當你點選圖上方第一個欄位Take Sample,將會出現一藍線其橫座標為2(這五個數字的平均)。 Sampling Distributions Definition: The sampling distribution of a statistic is the probability distribution for the possible values of the statistic that results when random samples of size n are repeatedly drawn from the population. Population: 3, 5, 2, 1 Draw samples of size n = 3 without replacement Possible samples 3, 5, 2 3, 5, 1 3, 2, 1 5, 2, 1 p(x) 1/4 x 2 3 x 10 / 3 = 3.33 9/3 = 3 6/3 = 2 8 / 3 = 2.67 Each value of x-bar is equally likely, with probability 1/4 Sampling Distributions (without replacement) a<- c(3,5,2,1); mean(a); sd(a) 2.75; 1.707825 N<- 1000000 a1sample<- rep(0,N) for (i in 1:N) {s1<- sample(a,3, replace = FALSE); a1sample[i]<- mean(s1)} testmean<- a1sample testmeannorm<- sqrt(3)*(testmean-mean(a))/sd(a) hist(testmeannorm, freq=FALSE, main="R: N(0,1)") x <- seq(-3, 3, len = 101) y <- (1/sqrt(2*pi))*exp(-x^2/2) points(x, y, type = "l", xaxt = "n",col = "red") Why? p(x) 0.4 R: N(0,1) 1/4 0.0 0.1 0.2 3 Density 2 0.3 x -0.8 table(testmean) 2 2.67 3 3.33 0.251 0.2497 0.2497 0.2497 -0.6 -0.4 -0.2 0.0 testmeannorm 0.2 0.4 0.6 Sampling Distributions (with replacement) a<- c(3,5,2,1); mean(a); sd(a) 2.75; 1.707825 N<- 1000000 a1sample<- rep(0,N); n<- 3 for (i in 1:N) {s1<- sample(a,n, replace = TRUE); a1sample[i]<- mean(s1)} testmean<- a1sample testmeannorm<- sqrt(3)*(testmean-mean(a))/sd(a) hist(testmeannorm, freq=FALSE, main="R: N(0,1) ") points(density(testmeannorm)) x <- seq(-3, 3, len = 101) y <- (1/sqrt(2*pi))*exp(-x^2/2) points(x, y, type = "l", xaxt = "n",col = "red") 0.8 R: N(0,1) 0.4 0.2 0.0 Density 0.6 Why? n=3 only -1 0 testmeannorm 1 2 Sampling Distributions (with replacement) a<- c(3,5,2,1); mean(a); sd(a) 2.75; 1.707825 N<- 1000000 a1sample<- rep(0,N); n<- 20 for (i in 1:N) {s1<- sample(a,n, replace = TRUE); a1sample[i]<- mean(s1)} testmean<- a1sample testmeannorm<- sqrt(n)*(testmean-mean(a))/sd(a) hist(testmeannorm, freq=FALSE, main="R: N(0,1); n=20") points(density(testmeannorm)) x <- seq(-3, 3, len = 101) y <- (1/sqrt(2*pi))*exp(-x^2/2) points(x, y, type = "l", xaxt = "n",col = "red") 0.2 0.1 0.0 Density 0.3 0.4 R: N(0,1); n=20 -4 -2 0 testmeannorm 2 4 Sampling Distributions Sampling distributions for statistics can be Approximated with simulation techniques Derived using mathematical theorems The Central Limit Theorem is one such theorem. Central Limit Theorem: If random samples of n observations are drawn from a nonnormal population with finite m and standard deviation s , then, when n is large, the sampling distribution of the sample mean x is approximately normally distributed, with mean m and standard deviation s / n. The approximation becomes more accurate as n becomes large. Example Toss a fair coin n = 1 time. The distribution of x the number on the upper face is flat or uniform. m = xp( x) 1 1 1 = 1( ) 2( ) ... 6( ) = 3.5 6 6 6 s = ( x m ) 2 p( x) = 1.71 Denote the outcome of a simple random sample of size 1 from those 1000 birthweights by X, how would you describe its mean and variance? Example Toss a fair coin n = 2 time. The distribution of x the average number on the two upper faces is mound-shaped. Mean : m = 3.5 Std Dev : s/ 2 = 1.71 / 2 = 1.21 Example Toss a fair coin n = 3 time. The distribution of x the average number on the two upper faces is approximately normal. Mean : m = 3.5 Std Dev : s/ 3 = 1.71 / 3 = .987 Why is this Important? The Central Limit Theorem also implies that the sum of n measurements is approximately normal with mean nm and standard deviation ns2. Many statistics that are used for statistical inference are sums or averages of sample measurements. When n is large, these statistics will have approximately normal distributions. This will allow us to describe their behavior and evaluate the reliability of our inferences. How Large is Large? If the sample is normal, then the sampling distribution of will also be normal, no matter what the sample size. x When the sample population is approximately symmetric, the distribution becomes approximately normal for relatively small values of n. When the sample population is skewed, the sample size must be at least 30 before the sampling distribution of x becomes approximately normal. The Sampling Distribution of the Sample Mean A random sample of size n is selected from a population with mean m and standard deviation s. The sampling distribution of the sample mean have mean m and standard deviation s / n . x will If the original population is normal, the sampling distribution will be normal for any sample size. If the original population is nonnormal, the sampling distribution will be normal when n is large. The standard deviation of x-bar is sometimes called the STANDARD ERROR (SE). Finding Probabilities for the Sample Mean If the sampling distribution of x is normal or approximately normal, standardize or rescale the interval of interest in terms of xm z= s/ n Find the appropriate area using Table 3. Example: A random sample of size n = 16 from a normal distribution with m = 10 and s = 8. 12 10 P( x 12) = P( z ) 8 / 16 = P( z 1) = 1 .8413 = .1587 Example A soda filling machine is supposed to fill cans of soda with 12 fluid ounces. Suppose that the fills are actually normally distributed with a mean of 12.1 oz and a standard deviation of .2 oz. What is the probability that the average fill for a 6-pack of soda is less than 12 oz? P (x 12) = x m 12 12.1 P( )= s / n .2 / 6 P( z 1.22) = .1112 How do we check association? • 兩事件A、B互相獨立定義成 P (A∩B) = P (A) P (B) • 條件機率:某種情況 (A) 已知,求另一事件 (B) 發生的機率 • P (B|A) = P (A∩B) /P (B) • 若A、B互相獨立,則 P (B|A) = P (B) • 在安全帽與頭部受傷的研究中得到下列數據: 安全帽 有戴 沒戴 列加總 頭部受傷 17 218 235 頭部沒傷 130 428 558 行加總 147 646 793 戴安全帽但頭部受傷的比例 P(頭部受傷 | 有戴安全帽)的估計為11.6% 沒戴安全帽但其頭部受傷的比例 P(頭部受傷 | 沒戴安全帽)的估計為33.3% Read the definitions of risk ratio and odds ratio in Chapter 13. Example 1 • Toss a fair coin twice. Define – A: head on second toss – B: head on first toss P(A|B) = ½ HH HT TH TT 1/4 P(A|not B) = ½ 1/4 1/4 1/4 P(A) does not change, whether B happens or not… A and B are independent! Example 2 • A bowl contains five M&Ms®, two red and three blue. Randomly select two candies, and define – A: second candy is red. – B: first candy is blue. P(A|B) =P(2nd red|1st blue)= 2/4 = 1/2 P(A|not B) = P(2nd red|1st red) = 1/4 m m m m m P(A) does change, depending on whether B happens or not… A and B are dependent! Recall the question on whether the duration of hospitalization is affected by receiving antibiotics. Defining Independence • We can redefine independence in terms of conditional probabilities: Two events A and B are independent if and only if P(A|B) = P(A) or P(B|A) = P(B) Otherwise, they are dependent. • Once you’ve decided whether or not two events are independent, you can use the following rule to calculate their intersection. The Multiplicative Rule for Intersections • For any two events, A and B, the probability that both A and B occur is P(A B) = P(A) P(B given that A occurred) = P(A)P(B|A) • If the events A and B are independent, then the probability that both A and B occur is P(A B) = P(A) P(B) Example 1 In a certain population, 10% of the people can be classified as being high risk for a heart attack. Three people are randomly selected from this population. What is the probability that exactly one of the three are high risk? Define H: high risk N: not high risk P(exactly one high risk) = P(HNN) + P(NHN) + P(NNH) = P(H)P(N)P(N) + P(N)P(H)P(N) + P(N)P(N)P(H) = (.1)(.9)(.9) + (.9)(.1)(.9) + (.9)(.9)(.1)= 3(.1)(.9)2 = .243 Connection with coin tossing Flip a coin (with probability of getting head 0.1) three times. Example 2 Suppose we have additional information in the previous example. We know that only 49% of the population are female. Also, of the female patients, 8% are high risk. A single person is selected at random. What is the probability that it is a high risk female? Define H: high risk F: female From the example, P(F) = .49 and P(H|F) = .08. Use the Multiplicative Rule: P(high risk female) = P(HF) = P(F)P(H|F) =.49(.08) = .0392 The Law of Total Probability • Let S1 , S2 , S3 ,..., Sk be mutually exclusive and exhaustive events (that is, one and only one must happen). Then the probability of another event A can be written as P(A) = P(A S1) + P(A S2) + … + P(A Sk) = P(S1)P(A|S1) + P(S2)P(A|S2) + … + P(Sk)P(A|Sk) The Law of Total Probability S1 A A S1 S2…. A Sk Sk P(A) = P(A S1) + P(A S2) + … + P(A Sk) = P(S1)P(A|S1) + P(S2)P(A|S2) + … + P(Sk)P(A|Sk) Bayes’ Rule • Let S1 , S2 , S3 ,..., Sk be mutually exclusive and exhaustive events with prior probabilities P(S1), P(S2),…,P(Sk). If an event A occurs, the posterior probability of Si, given that A occurred is P( Si ) P( A | Si ) P( Si | A) = for i = 1, 2,...k P( Si ) P( A | Si ) Risk Factor Example From a previous example, we know that 49% of the population are female. Of the female patients, 8% are high risk for heart attack, while 12% of the male patients are high risk. A single person is selected at random and found to be high risk. What is the probability that it is a male? Define H: high risk We know: P(F) = P(M) = P(H|F) = P(H|M) = .49 .51 .08 .12 F: female M: male P( M ) P ( H | M ) P( M | H ) = P( M ) P( H | M ) P( F ) P ( H | F ) .51 (.12) = = .61 .51 (.12) .49 (.08) Random Variables • A quantitative variable x is a random variable if the value that it assumes, corresponding to the outcome of an experiment is a chance or random event. • Random variables can be discrete or continuous. • Examples: x = SAT score for a randomly selected student x = number of people in a room at a randomly selected time of day x = number on the upper face of a randomly tossed die Probability Distributions for Discrete Random Variables • The probability distribution for a discrete random variable x resembles the relative frequency distributions we constructed in Chapter 2. It is a graph, table or formula that gives the possible values of x and the probability p(x) associated with each value. We must have 0 p( x) 1 and p ( x) = 1 Example • Toss a fair coin three times and define x = number of heads. x HHH 1/8 3 1/8 2 1/8 2 1/8 2 1/8 1 THT 1/8 1 TTH 1/8 1 TTT 1/8 0 HHT HTH THH HTT P(x = 0) = P(x = 1) = P(x = 2) = P(x = 3) = 1/8 3/8 3/8 1/8 x 0 1 2 3 p(x) 1/8 3/8 3/8 1/8 Probability Histogram for x Probability Distributions • Probability distributions can be used to describe the population, just as we described samples in Chapter 2. – Shape: Symmetric, skewed, mound-shaped… – Outliers: unusual or unlikely measurements – Center and spread: mean and standard deviation. A population mean is called m and a population standard deviation is called s. The Mean and Standard Deviation • Let x be a discrete random variable with probability distribution p(x). Then the mean, variance and standard deviation of x are given as Mean : m = xp( x) Variance : s = ( x m ) p( x) 2 2 Standard deviation : s = s 2 Example • Toss a fair coin 3 times and record x the number of heads. x 0 1 p(x) 1/8 3/8 xp(x) 0 3/8 (x-m)2p(x) (-1.5)2(1/8) (-0.5)2(3/8) 12 m = xp( x) = = 1.5 8 2 3 3/8 1/8 6/8 3/8 (0.5)2(3/8) (1.5)2(1/8) s = ( x m ) p( x) 2 2 s 2 = .28125 .09375 .09375 .28125 = .75 s = .75 = .688 Example • The probability distribution for x the number of heads in tossing 3 fair coins. • • • • m Shape? Outliers? Center? Spread? Symmetric; mound-shaped None m = 1.5 s = .688 Key Concepts • Population — The entire collection of entities about which one wishes to make an inference or draw a conclusion about (also called aggregate or universe). • Sample — A subset of a population. Used because we usually cannot measure all individuals in a population. – It is the sample we observe, but the population we wish to know. • Simple Random Sample — A sample of size n from a larger population selected in such a way that every sample of size n has the same chance of being selected. • Parameter — The true value of some population attribute, which is almost always unknown; or an unknown constant that describes a key feature in a model for answering a question of interest. – Parameters are often represented by Greek letters, such as m for the population mean, and s for the population standard deviation. • Statistic-Any quantity that is computed from sample observations. • Probability — Set of mathematical tools to quantify concepts we understand intuitively, such as “likelihood and “certainty.” We use probability to gauge the amount of confidence to place on sample estimates. Key Concepts • Model — Some approximation of reality. • Statistical model — A mathematical expression that help us predict a response variable as a function of one or more explanatory variables, based on a set of assumptions. These assumptions allow the model not to fit exactly, and are made about random terms in the model called error (e). • Types of Variables: – Refer to Ch9.1 of Rosner. – Quantitative (continuous or cardinal) Data versus Qualitative (discrete or categorical) Data – Nominal—When data values for a variable are labels identifying a category and their order is not meaningful. E.g., attributes such as sex, race, and cause of death are nominal (meaning named) because the categories do not represent some underlying, quantitative scale. – Ordinal—Data values for a variable are labels identifying a category and their order is meaningful. E.g., a person’s highest educational level might be recorded ordinal, where the categories of interest might be grade school, high school, college, and graduate school. Additional examples are stage of cancer, severity, and preference. Key Concepts • Continuous—Data values for a variable are measured on a continuous scale. E.g., body mass is often measured and recorded on a continuous scale, where values such as 40 g or 3,154.2 g are acceptable. – It can be useful to distinguish between continuous and discrete data. – Continuous data can be represented with any and all conceivable values within a particular range, such as the height of a plant being 36.354 cm; discrete data can be represented by only certain values within a particular range, such as number of leaves on a plant, where 22, 185, or 45 are possible, but 22.8 is not. – Ratio— Examples are age, body weight, height, and blood pressure. – Interval— Examples are Celsius and Fahrenheit temperatures. Some Discrete Distributions The Binomial Distribution • The most commonly used discrete probability distribution is the binomial distribution. • An experiment which follows a binomial distribution will satisfy the following requirements (think of repeatedly flipping a coin as you read these): –The experiment consists of n identical trials, where n is fixed in advance. –Each trial has two possible outcomes, S or F, which we denote ``success'' and ``failure'' and code as 1 and 0, respectively. –The trials are independent, so the outcome of one trial has no effect on the outcome of another. –The probability of success, p=P(S), is constant from one trial to another. The Binomial Distribution • The random variable X of a binomial distribution counts the number of successes in n trials. • Sampling distributions for counts and proportions • The probability that X is a certain value x is given by the formula P(X=x) = C(n,x)px(1-p)n-x where 0≦p ≦1, x = 0, 1,…, n. – E(X) = np and Var(X) = np (1-p). – A particularly important example of the use of the binomial distribution is when sampling with replacement (this implies that p is constant). – EXAMPLE: Suppose we have 10 balls in a bowl, 3 of the balls are red and 7 of them are blue. Define success S as drawing a red ball. If we sample with replacement, P(S)=0.3 for every trial. Let's say n=20, then P(X=5)=0.1789. • 例:核能廢料運送100次輻射外洩次數;測試某零件200個其 中不良品之個數;城市感染某病之人數;問卷中同意一特定 主題之人數比例。 Hardy-Weinberg 平衡定律(1908年)顯性基因的人 愈來愈多嗎? (How do we reasoning?) 假設人類中第一代具AA基因者有20%、Aa基因者有30%、 aa基因者有50%,考慮第二代基因分布情形如下: •A 代表顯性, a代表隱性 第一代 機會 AA/AA 第二代分布比例 AA Aa aa 20%×20%=4% 1 0 0 AA/Aa 2×20%×30%=12% 1/2 1/2 0 AA/aa 2×20%×50%=20% 0 1 0 Aa/Aa 30%×30%=9% 1/4 1/2 1/4 Aa/aa 2×30%×50%=30% 0 1/2 1/2 aa/aa 50%×50%=25% 0 0 1 Hardy-Weinberg 平衡定律 第二代中,AA、Aa、aa所佔比例為 • AA的比例=1×4%+1/2×12%+1/4×9%=12.25% • Aa 的比例=1/2×12%+1×20%+1/2×9%+1/2×30%=45.5% • aa 的比例=1/4×9%+1/2×30%+1×25%=42.25% 第二代 第三代分布比例 機會 AA Aa aa AA/AA 12.25%×12.25%=1.500625% 1 0 0 AA/Aa 2×12.25%×45.5%=11.1475% 1/2 1/2 0 AA/aa 2×12.25%×42.25%=10.35125 % 0 1 0 Aa/Aa 45.5%×45.5%=20.7025% 1/4 1/2 1/4 Aa/aa 2×45.5%×42.25%=38.4475% 0 1/2 1/2 aa/aa 42.25%×42.25%=17.850625% 0 0 1 Hardy-Weinberg 平衡定律 依此作法,第n (n > 3) 代中,AA、Aa、aa比例為多少? • AA的比例1×1.500625%+1/2×11.1475%+1/4×20.7025%=12.25% • Aa 的比例 1/2×11.1475%+1×10.35125%+1/2×20.7025% +1/2×38.4475%=45.5% • aa 的比 1/4×20.7025%+1/2×38.4475%+1×17.850625%=42.25% • 依此作法,第n (n>3) 代中,AA、Aa、aa比例與第二、三代 均為如此,就稱為Hardy-Weinberg平衡定律。 • 必須滿足的條件: (1)基因的突變不會發生 (2)自然天擇不會發生 (3)族群人數足夠多 (4)所有的人都結婚 (5)結婚的對象為隨機〈隨機交配〉 (6)每一個人產生大約相同數目的子孫 (7) 族群沒有移進或移出 敘述統計量 • 最小值:觀察值中最小的; 最大值:觀察值中最大的 • 中間值:所有觀察值的中位數 • 眾數:觀察值中出現最多次的 n Xi 平均數: i =1 標準誤:算法為 n n 樣本標準差 S = ( X i X )2 i =1 n 1 n ( X i X )2 樣本變異數 S 2 = i =1 n 1 S n 敘述統計量 n ( X i X )2 樣本變異數 S 2 = i =1 n 1 n ( X i X )4 n(n 1) i =1 峰度: (n 1)(n 2)(n 3) n 偏態:算法為 (n 1)(nn 2) (X i =1 i S4 3(n 1) 2 (n 2)(n 3) X )3 S3 S:樣本標準差