Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia, lookup

Student's t-test wikipedia, lookup

Taylor's law wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Resampling (statistics) wikipedia, lookup

Misuse of statistics wikipedia, lookup

Degrees of freedom (statistics) wikipedia, lookup

Foundations of statistics wikipedia, lookup

Categorical variable wikipedia, lookup

Transcript

多變量分析 陳 宏 台灣大學數學系 週四9:10至12:00 A211室 [email protected] 課程內容 •基礎機率，統計語言及其工具（授課時數約2週） – 重要的機率分配 – 模擬隨機變數 – 點估計、信賴區間、假設檢定 •線性模型（授課時數約7週） – 線性迴歸、羅吉斯迴歸 – 變異數分析 – 列聯表分析 •多變量分析（授課時數約8週） – 主成分分析（Principal Component Analysis） – 因素分析（Factor Analysis） – 判別分析法（Discriminant Analysis） – 集群分析法（Cluster Analysis） – 典型相關分析（Canonical Correlation Analysis） • 參考書： – 待定 • 程式語言： – R（可由網路取得） – R has a home page at http://www.r-project.org/ – Download •成績評量方式： – 期中考（30%）、projects（70%） 講 綱 •概論 – Exploratory Data Analysis: Decision Making – Data Mining – Data Collection: 抽樣與問卷 •統計軟體 – R Software •基礎機率，統計語言及其工具 – Probability and Random Variables – Variance •線性模型 – Association – IntroRegression – MultipleRegression – DAonREgression 講 綱 •多變量分析 – 主成分分析（Principal Component Analysis） – 因素分析（Factor Analysis） – 判別分析法（Discriminant Analysis） – 集群分析法（Cluster Analysis） – 典型相關分析（Canonical Correlation Analysis） Statistics for Decision Making •Describing Sets of Data – Objective: Introduce numerical methods and graphical displays to summarize data sets. – Graphical and numerical tools • for examining the distribution of a single variable, • for comparing several distributions, and • for investigating changes over time. •Sampling and Statistical Inference – Objective: Provide methods to infer about a population based on a sample of observations drawn from that population •Forecasting with Distinguishable Data – Objective: Introduce the basic concepts of forecasting to motivate a regression model. – Method for studying relationships among several variables. •Regression Coefficients and Forecasts – Objective: Understand regression coefficients and how to use them for forecasting Statistics for Decision Making •Measures of Goodness of Fit and Residual Analysis – Objective: Introduce a few statistics that measure how well a regression model fits the data and show how to use residual analysis to detect inadequacies of a regression model •Developing a Regression Model – Objective: Demonstrate how to develop a useful regression model through •Selection of the Dependent Variable •Selection of the Independent Variables •Determining the Nature of Relationships Sampling and Statistical Inference •Objective: Provide methods to infer about a population based on a sample of observations drawn from that population. •Inference from a Sample •Statistical Estimation •From Margin of Error to Confidence Interval •Test of Significance Inference from a Sample •The sample provides useful information, but the information is imperfect. – Samples are taken when it is impossible, impractical or too expensive to obtain complete data on relevant population. •EX. Suppose you are asked 100 potential customers how much they will spend on a proposed new product next year? – From the 100 responses you obtained a sample average of $250. You could make the following inference: • My best estimate of average sales per potential customer is $250. • Average sales per potential customer will be between $210 and $290 with 95% confidence. • Average sales per potential customer will be greater than the break-even amount of $210 at a 2.5% level of significance. •Law of Large Numbers: – Independent observations at random from any population with finite mean – As the number of observations drawn increases, the mean of the observed values eventually approaches the mean of the population as closely as you specified and then stays that close. Sampling variability •Parameter: p=the proportion of the adult population in the US (~190 million) that find clothes shopping frustrating. •Statistic: 66% or 1650 out of 2500 adults. •Sampling variability: The value of a statistic varies in repeated random sampling. •Answer to “What would happen if we took many samples?” – Take a large number of samples from the same population. – Calculate the sample proportion p^ for each sample. – Make a histogram of the values of p^. – Examine the distribution displayed in the histogram. •We can imitate chance behavior of many samples by using random digits or computer (simulation). Sampling variability •The sampling distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population. •Can be either – approximated by simulation or – obtained exactly by probability theory in statistics. 1000 SRSs of size 100 when p=0.6. 1000 SRSs of size 100 and 2500 when p=0.6 Bias and variance •A statistic is unbiased in the mean of its sampling distribution is equal to the true value of the parameter being estimated. no favoritism. •The variability of a statistic is described by the spread of its sampling distribution. – 95% of the sample proportions will like in the range 0.6±0.1 (n=100) or 0.6 ± 0.02 (n=2500) – Larger samples have smaller spreads. •As long as the population is much larger than the sample, the spread of the sampling distribution for a sample of fixed size n is approximately the same for any population size. – An SRS of size 2500 from 270 million US residents gives results as precise as an SRS of size 2500 from 740,000 inhabitants of SFO! Why randomize? • The act of randomizing guarantees that the results of analyzing our data are subject to the laws of probability. – Randomization removes bias. – Replication (bigger sample) reduces variance. – Better answer “What would happen if the sample or the experiment were repeated many times?” •Caution: the sampling distribution does not reflect bias due to under-coverage, non-response, lack of realism, etc. Presidential Election and Poll 背景：1936年美國總統選舉 •法蘭克羅斯福總統爭取連任、肯薩斯州州長蘭登為共和黨總統 候選人 •美國經濟正由大蕭條中逐漸恢復 –九百萬人失業，於1929年至1933年間實際所得降低三分之一。 – 蘭登州長選戰主軸為「小政府」。口號為The spender must go。 – 羅斯福總統選戰主軸為「擴大內需」 (deficit financing)。口號為Balance the budget of the American people first。 •宣稱一：大部分的觀察家認為羅斯福總統將大勝 •宣稱二：Literary Digest雜誌認為蘭登將以57%對43%贏此選戰。 – 此數字乃根據於二百四十萬人之民意調查結果。 – 該機構至1916年起，皆能依照其預測辦法作正確的預測。 •選舉結果：羅斯福以62%對38%贏此選戰。為什麼？ •新興競爭者－蓋洛普－的工作： – 依據Literary Digest雜誌所取的二百四十萬人樣本中，蓋洛普抽樣三千人， 而預測蘭登將以56%對44%贏此選戰。 –依據自己所取的五萬人樣本中，蓋洛普預測羅斯福將以56%對44%贏此選 戰。 Digest雜誌錯在那裡？ 取樣辦法：郵寄一千萬份的問卷，回收二百四十萬份，但 問卷對象係從電話簿及俱樂部會員中選取。 –在當時僅有一千一百萬具住宅用電話，但九百萬人失業。 可能問題的所在： •取樣偏差：Digest雜誌的取樣中包含過多的富人，而該年 貧富間選舉傾向相距極大。 •拒回答偏差：低回收率。 –以芝加哥一地為例，問卷寄給三分之一的登記選民，回 收約20%的問卷，其中超過一半宣稱將選蘭登，但選舉 結果卻是羅斯福拿到三分之二的選票。 為何簡單隨機抽樣是個合理的抽樣方法？ •試想抽取16所醫院來預測393所醫院的平均出院病人數的例子， – 共有約1033種的不同樣本。 – 依據中央極限定理，所得到的平均出院病人數分佈像個鐘形曲線，其 中心位於所有醫院的平均出院病人數，且大多數的16所醫院平均出院 病人數都離中心(大數法則)不遠。 較有保障的抽樣辦法，被選取的樣本應使用隨機的原理取 得。 Digest雜誌錯在那裡？ 取樣辦法：郵寄一千萬份的問卷，回收二百四十萬份，但問 卷對象係從電話簿及俱樂部會員中選取。 •（在當時僅有一千一百萬具住宅用電話及九百萬人失業）。 •可能問題的所在： •取樣偏差：Digest雜誌的取樣中包含過多的富人，而該年貧富 間選舉傾向相距極大。 •拒回答偏差：低回收率。 •以芝加哥一地為例，問卷寄給三分之一的登記選民，回收約 20%的問卷，其中超過一半宣稱將選蘭登，但選舉結果卻是羅 斯福拿到三分之二的選票。 Statistical Estimation •A parameter is a number that described the population. – Its value is fixed but unknown. •A statistic is a number that describes a sample. –Its value is known for a sample, but it can change from sample to sample. –We use a statistic to estimate an unknown parameter. •Error of estimation is the difference between an estimate and the estimated parameter. –In case of estimating the population mean using the sample mean, Error of Estimation = sample mean – population mean •The distribution of Error of Estimation: Central Limit Theorem –If the sample size is large, the error of estimation is approximately normally distributed with mean zero and a standard deviation which can be estimated by Standard Error = sample standard deviation/(sample size)1/2 •The Normal Distribution –If X has N(,2) distribution, then Z=(X- )/ has N(0,1) distribution. The normal density • The height of the normal density curve for the normal distribution with mean and SD is given by: 1 ( x, , ) e 2 1 x 2 2 •Why is the normal distributions important? • Good description for some distributions of real data. (e.g. test scores, repeated measurements, characteristics of biological populations, etc.) • Good approximations to the results of many kinds of chance outcomes. (e.g. coin tossing). • Many statistical inference procedures based on normal distributions work well for other roughly symmetric distributions. From Margin of Error to Confidence Interval •What is the probability that the error of estimation exceeds two standard errors? – If we add two standard errors to our estimate as the margin of error, what can we say about the resulting interval estimate? •Confidence and Probability – When reporting that a confidence interval for a population mean extends from $210 to $290, it is tempting to slip into the language of probability, and say there is only 5% chance that the true mean of the population is outside this interval. – Such probabilistic interpretation is much more natural and appealing than the rather convoluted interpretation above. But is it legitimate? – Example: • Suppose from a sample of 100 potential customers one market researcher obtained a 95% confidence interval of ($190,$210) for the average amount a potential customer will spend on a product next year. • Another market researcher from a different sample of size 400 obtained a 95% confidence interval of ($215,$225). • How do you reconcile these two results? Test of Significance •Example 1: A market researcher asked a sample of 100 potential customers how much they plan to spend on a product next year. – The mean of the sample turned out to be $25 and the standard deviation is $200. – Is it likely that average sales per capita exceeds a break-even level of $208? • Example 2: Suppose a manager is trying to decide which of the two new products, A or B, to introduce. Break-even sales per capita are $208 for both A and B. – Sample results are given in the following. – Product A: sample size = 10,000, sample mean=211, sample SD= 100 – Product B: sample size = 100, sample mean=250, sample SD= 300 • Example 3: In a Business Week/Harris executive poll, senior executives were asked: “Compared with the last 12 months, do you think the rate of growth of the gross domestic product will go up, go down, or stay the same for the next 12 months?” Test for Independence •Application on Business outlook •Results of this poll are summarized below (Business Week, 1/09/95). Date of Survey 12/94 6/94 12/93 Total Go Up 152 177 101 430 Go Down 104 72 36 212 Outlook Stay the Same 144 152 261 557 Not Sure 0 0 4 4 Total 400 401 402 1203 •Have the executives changed their outlook over time? Relations in categorical data •Relationship between two or more categorical variables. •Use counts (frequencies) or percent (relative frequencies) of individuals that fall into various categories. Two-way table •A two-way table describes two categorical variables. •Each horizontal row in the table describes individuals with one level of the row variable. •Each vertical column describes individuals with one level of the column variable. •EX: Years of school completed, by age (thousands of persons) Education did not complete high school completed high school college 1 to 3 years college, 4 or more years Total 25 to 34 5,325 14,061 11,659 10,342 41,387 Age Group 35 to 54 55 and over Total 9,152 16,035 30,512 24,070 18,320 56,451 19,926 9,662 41,247 19,878 8,005 38,225 73,026 52,022 166,435 Marginal distributions •Look at the distribution of each variable separately. •“Total” columns list the totals for each of the rows or row totals. Similarly for column totals. •Row and column totals specify the marginal distributions of each of the two categorical variables. The distribution of years of schooling completed among people age 25 years and over Describing relationships •What percent of people aged 25 to 34 have completed 4 years of college? •What percent of people aged 35 to 54 have completed 4 years of college? •What percent of people aged 55 and over have completed 4 years of college? •Conclusion? Conditional distribution of age group on the education level Three way table • The table of outcome by hospital by patient condition is a three-way table that reports the frequencies of each combination of levels of three categorical variables. • We can aggregate a three-way table into a two-way table. • A variable being aggregated can become a lurking variable. NSF study on the salary of new women engineer • The median salary of newly graduated female engineers and scientists was 73% of that for males. • Field is a lurking variable. (life and social sciences against physical and engineering) Establishing causation • The best (and only?) method of establishing causation is to conduct a carefully designed experiment in which the effects of possible lurking variables are controlled. • What other criteria when we can’t do an experiment? “Smoking causes lung cancer” • The association is strong. • The association is consistent. • Higher doses are associated with stronger responses. • The alleged cause precedes the effect in time. • The alleged cause is plausible. Forecasting with Distinguishable Data • Objective: Introduce the basic concepts of forecasting to motivate a regression model. • Forecasting with Indistinguishable Data: – If the future value of the variable you would like to forecast is indistinguishable from the sample values you collected, then you forecast with indistinguishable data. – Example 1: To help forecasting the selling price of your house, you obtained a sample ($109,360, $137,980, $131,230, $130,230, $125,410, $124,370, $139,030, $140,160, $144,220, $154,190. • Forecasting when the Data are Distinguishable: – When your sample contains additional information so that the sample values are no longer indistinguishable from the future value you would like to forecast, you forecast with distinguishable data. – Example 2: Our sample also contain the information on the square footage of the ten houses. ($109,360,1404), ($137,980,1477), ($131,230,1503)$, ($130,230,1552), ($125,410,1608), ($124,370,1633), ($139,030,1717), ($140,160,1775), ($144,220,1838), ($154,190,1934). Forecasting with Distinguishable Data • Assume that your house has 1682 square feet of living area. – Analysis 1: sample average of all ten houses = $133,618 (SD = $12,406) • Analysis 2: Stratify the sample according to lot size. Size Range Sample Average SD Number of Observations 1400-1599 $127,200 $12,381 4 1600-1799 $132,243 $8,513 4 1800-1999 $149,205 $7,050 2 Then use $132,243 (instead of $133,618) to forecast the selling value. – Does the cell standard deviation properly measure the forecast uncertainty? – Is it possible to have a measure of overall efficacy of our partitioning the sample into cells? • Use the data more efficiently: The stratification method that we used is unsatisfactory for two reasons. First, we have ignored data on house that are “less like,” but not “most like” yours. Secondly, we have stratified the data somewhat arbitrarily. The question of causation •Mother’s adult height vs daughter’s adult height. •Amount of saccharin in a rat’s diet vs count of tumors in the rat’s bladder. •A student’s SAT score and the student’s first year GPA. •Monthly flow of money into stock mutual funds vs monthly rate of return for the stock market. •The anesthetic used in surgery vs whether the patient survives the surgery. •The number of years of education a worker has vs the worker’s income. Explaining association •Causation. •Common response. (a lurking variable). •Confounding: two variables are confounded when their effects on a response variable are mixed together. Data on the survival of patients after surgery in hospital A and B Died Survived Total Hospital A Hospital B 63 16 2037 784 2100 800 •Hospital A loses 3% of patients while Hospital B loses 2%. Lurking variable... Died Survived Total Good condition Hospital A Hospital B 6 8 594 592 600 600 Died Survived Total Bad condition Hospital A Hospital B 57 8 1443 192 1500 200 • 1% vs 1.3% for patients with good condition • 3.8% vs 4% for patients with bad condition Simpson’s paradox • How can A do better in each group, yet do worse overall?? • An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. Regression Model •Try to create a model that specifies the relationship between selling price (dependent variable) and other variables (independent or explanatory variable) that help you forecast the selling price. –It is reasonable to assume that as size go up, selling price will go up on average. Regression Coefficients and Forecasts • Objective: Understand regression coefficients and how to use them for forecasting. Measures of Goodness of Fit and Residual Analysis • Objective: Introduce a few statistics that measure how well a regression model fits the data and show how to use residual analysis to detect inadequacies of a regression model Developing a Regression Model •Objective: Demonstrate how to develop a useful regression model through – Selection of the Dependent Variable – Selection of the Independent Variables – Determining the Nature of Relationships