Download 統計預測方法 - 國立臺灣大學 數學系

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia, lookup

Student's t-test wikipedia, lookup

Taylor's law wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Time series wikipedia, lookup

Resampling (statistics) wikipedia, lookup

Misuse of statistics wikipedia, lookup

Degrees of freedom (statistics) wikipedia, lookup

Foundations of statistics wikipedia, lookup

Categorical variable wikipedia, lookup

Gibbs sampling wikipedia, lookup

Statistical inference wikipedia, lookup

陳 宏
週四9:10至12:00 A211室
[email protected]
– 重要的機率分配
– 模擬隨機變數
– 點估計、信賴區間、假設檢定
– 線性迴歸、羅吉斯迴歸
– 變異數分析
– 列聯表分析
– 主成分分析(Principal Component Analysis)
– 因素分析(Factor Analysis)
– 判別分析法(Discriminant Analysis)
– 集群分析法(Cluster Analysis)
– 典型相關分析(Canonical Correlation Analysis)
• 參考書:
– 待定
• 程式語言:
– R(可由網路取得)
– R has a home page at
– Download
– 期中考(30%)、projects(70%)
– Exploratory Data Analysis: Decision Making
– Data Mining
– Data Collection: 抽樣與問卷
– R Software
– Probability and Random Variables
– Variance
– Association
– IntroRegression
– MultipleRegression
– DAonREgression
– 主成分分析(Principal Component Analysis)
– 因素分析(Factor Analysis)
– 判別分析法(Discriminant Analysis)
– 集群分析法(Cluster Analysis)
– 典型相關分析(Canonical Correlation Analysis)
Statistics for Decision Making
•Describing Sets of Data
– Objective: Introduce numerical methods and graphical displays to
summarize data sets.
– Graphical and numerical tools
• for examining the distribution of a single variable,
• for comparing several distributions, and
• for investigating changes over time.
•Sampling and Statistical Inference
– Objective: Provide methods to infer about a population based on a
sample of observations drawn from that population
•Forecasting with Distinguishable Data
– Objective: Introduce the basic concepts of forecasting to motivate a
regression model.
– Method for studying relationships among several variables.
•Regression Coefficients and Forecasts
– Objective: Understand regression coefficients and how to use them for
Statistics for Decision Making
•Measures of Goodness of Fit and Residual Analysis
– Objective: Introduce a few statistics that measure how well a
regression model fits the data and show how to use residual analysis
to detect inadequacies of a regression model
•Developing a Regression Model
– Objective: Demonstrate how to develop a useful regression model
•Selection of the Dependent Variable
•Selection of the Independent Variables
•Determining the Nature of Relationships
Sampling and Statistical Inference
•Objective: Provide methods to infer about a population based
on a sample of observations drawn from that population.
•Inference from a Sample
•Statistical Estimation
•From Margin of Error to Confidence Interval
•Test of Significance
Inference from a Sample
•The sample provides useful information, but the information
is imperfect.
– Samples are taken when it is impossible, impractical or too expensive
to obtain complete data on relevant population.
•EX. Suppose you are asked 100 potential customers how
much they will spend on a proposed new product next year?
– From the 100 responses you obtained a sample average of $250. You
could make the following inference:
• My best estimate of average sales per potential customer is $250.
• Average sales per potential customer will be between $210 and $290 with
95% confidence.
• Average sales per potential customer will be greater than the break-even
amount of $210 at a 2.5% level of significance.
•Law of Large Numbers:
– Independent observations at random from any population with finite
mean 
– As the number of observations drawn increases, the mean of the
observed values eventually approaches the mean  of the population
as closely as you specified and then stays that close.
Sampling variability
•Parameter: p=the proportion of the adult population in the
US (~190 million) that find clothes shopping frustrating.
•Statistic: 66% or 1650 out of 2500 adults.
•Sampling variability: The value of a statistic varies in repeated
random sampling.
•Answer to “What would happen if we took many samples?”
– Take a large number of samples from the same
– Calculate the sample proportion p^ for each sample.
– Make a histogram of the values of p^.
– Examine the distribution displayed in the histogram.
•We can imitate chance behavior of many samples by using
random digits or computer (simulation).
Sampling variability
•The sampling distribution of a statistic is the distribution of
values taken by the statistic in all possible samples of the
same size from the same population.
•Can be either
– approximated by simulation or
– obtained exactly by probability theory in statistics.
1000 SRSs of size 100 when p=0.6.
1000 SRSs of size 100 and 2500 when p=0.6
Bias and variance
•A statistic is unbiased in the mean of its sampling distribution
is equal to the true value of the parameter being estimated. no favoritism.
•The variability of a statistic is described by the spread of its
sampling distribution.
– 95% of the sample proportions will like in the range 0.6±0.1 (n=100) or
0.6 ± 0.02 (n=2500)
– Larger samples have smaller spreads.
•As long as the population is much larger than the sample, the
spread of the sampling distribution for a sample of fixed size
n is approximately the same for any population size.
– An SRS of size 2500 from 270 million US residents gives
results as precise as an SRS of size 2500 from 740,000
inhabitants of SFO!
Why randomize?
• The act of randomizing guarantees that the results of analyzing our
data are subject to the laws of probability.
– Randomization removes bias.
– Replication (bigger sample) reduces variance.
– Better answer “What would happen if the sample or the experiment
were repeated many times?”
•Caution: the sampling distribution does not reflect bias due
to under-coverage, non-response, lack of realism, etc.
Presidential Election and Poll
– 蘭登州長選戰主軸為「小政府」。口號為The spender must go。
– 羅斯福總統選戰主軸為「擴大內需」 (deficit financing)。口號為Balance
the budget of the American people first。
•宣稱二:Literary Digest雜誌認為蘭登將以57%對43%贏此選戰。
– 此數字乃根據於二百四十萬人之民意調查結果。
– 該機構至1916年起,皆能依照其預測辦法作正確的預測。
– 依據Literary Digest雜誌所取的二百四十萬人樣本中,蓋洛普抽樣三千人,
– 共有約1033種的不同樣本。
– 依據中央極限定理,所得到的平均出院病人數分佈像個鐘形曲線,其
Statistical Estimation
•A parameter is a number that described the population.
– Its value is fixed but unknown.
•A statistic is a number that describes a sample.
–Its value is known for a sample, but it can change from sample to sample.
–We use a statistic to estimate an unknown parameter.
•Error of estimation is the difference between an estimate and the
estimated parameter.
–In case of estimating the population mean using the sample mean,
Error of Estimation = sample mean – population mean
•The distribution of Error of Estimation: Central Limit Theorem
–If the sample size is large, the error of estimation is approximately
normally distributed with mean zero and a standard deviation which can
be estimated by
Standard Error = sample standard deviation/(sample size)1/2
•The Normal Distribution
–If X has N(,2) distribution, then Z=(X- )/ has N(0,1) distribution.
The normal density
• The height of the normal density curve for the normal distribution
with mean  and SD  is given by:
 ( x,  ,  ) 
 2
1  x 
 
2  
•Why is the normal distributions important?
• Good description for some distributions of real data. (e.g. test scores,
repeated measurements, characteristics of biological populations, etc.)
• Good approximations to the results of many kinds of chance outcomes.
(e.g. coin tossing).
• Many statistical inference procedures based on normal distributions
work well for other roughly symmetric distributions.
From Margin of Error to Confidence Interval
•What is the probability that the error of estimation exceeds
two standard errors?
– If we add two standard errors to our estimate as the margin of error,
what can we say about the resulting interval estimate?
•Confidence and Probability
– When reporting that a confidence interval for a population mean
extends from $210 to $290, it is tempting to slip into the language of
probability, and say there is only 5% chance that the true mean of the
population is outside this interval.
– Such probabilistic interpretation is much more natural and appealing
than the rather convoluted interpretation above. But is it legitimate?
– Example:
• Suppose from a sample of 100 potential customers one market researcher
obtained a 95% confidence interval of ($190,$210) for the average amount
a potential customer will spend on a product next year.
• Another market researcher from a different sample of size 400 obtained a
95% confidence interval of ($215,$225).
• How do you reconcile these two results?
Test of Significance
•Example 1: A market researcher asked a sample of 100
potential customers how much they plan to spend on a
product next year.
– The mean of the sample turned out to be $25 and the standard deviation is
– Is it likely that average sales per capita exceeds a break-even level of $208?
• Example 2: Suppose a manager is trying to decide which of the two
new products, A or B, to introduce. Break-even sales per capita are
$208 for both A and B.
– Sample results are given in the following.
– Product A: sample size = 10,000, sample mean=211, sample SD= 100
– Product B: sample size = 100, sample mean=250, sample SD= 300
• Example 3: In a Business Week/Harris executive poll, senior
executives were asked: “Compared with the last 12 months, do you
think the rate of growth of the gross domestic product will go up,
go down, or stay the same for the next 12 months?”
Test for Independence
•Application on Business outlook
•Results of this poll are summarized below (Business Week,
Date of Survey
12/94 6/94 12/93 Total
Go Up
152 177
Go Down
Outlook Stay the Same 144 152
Not Sure
400 401
•Have the executives changed their outlook over time?
Relations in categorical data
•Relationship between two or more categorical variables.
•Use counts (frequencies) or percent (relative frequencies) of
individuals that fall into various categories.
Two-way table
•A two-way table describes two categorical variables.
•Each horizontal row in the table describes individuals with one
level of the row variable.
•Each vertical column describes individuals with one level of
the column variable.
•EX: Years of school completed, by age (thousands of persons)
did not complete high school
completed high school
college 1 to 3 years
college, 4 or more years
25 to 34
Age Group
35 to 54 55 and over
52,022 166,435
Marginal distributions
•Look at the distribution of each variable separately.
•“Total” columns list the totals for each of the rows or row
totals. Similarly for column totals.
•Row and column totals specify the marginal distributions of
each of the two categorical variables.
The distribution of years of schooling completed among people age 25
years and over
Describing relationships
•What percent of people aged 25 to 34 have completed 4 years
of college?
•What percent of people aged 35 to 54 have completed 4 years
of college?
•What percent of people aged 55 and over have completed 4
years of college?
Conditional distribution of age group on
the education level
Three way table
• The table of outcome by hospital by patient
condition is a three-way table that reports
the frequencies of each combination of
levels of three categorical variables.
• We can aggregate a three-way table into a
two-way table.
• A variable being aggregated can become a
lurking variable.
NSF study on the salary of new
women engineer
• The median salary of newly graduated
female engineers and scientists was 73% of
that for males.
• Field is a lurking variable. (life and social
sciences against physical and engineering)
Establishing causation
• The best (and only?) method of establishing
causation is to conduct a carefully designed
experiment in which the effects of possible
lurking variables are controlled.
• What other criteria when we can’t do an
“Smoking causes lung cancer”
• The association is strong.
• The association is consistent.
• Higher doses are associated with stronger
• The alleged cause precedes the effect in
• The alleged cause is plausible.
Forecasting with Distinguishable Data
• Objective: Introduce the basic concepts of forecasting to motivate a
regression model.
• Forecasting with Indistinguishable Data:
– If the future value of the variable you would like to forecast is
indistinguishable from the sample values you collected, then you forecast
with indistinguishable data.
– Example 1: To help forecasting the selling price of your house, you obtained
a sample ($109,360, $137,980, $131,230, $130,230, $125,410, $124,370,
$139,030, $140,160, $144,220, $154,190.
• Forecasting when the Data are Distinguishable:
– When your sample contains additional information so that the sample values
are no longer indistinguishable from the future value you would like to
forecast, you forecast with distinguishable data.
– Example 2: Our sample also contain the information on the square footage of
the ten houses. ($109,360,1404), ($137,980,1477), ($131,230,1503)$,
($130,230,1552), ($125,410,1608), ($124,370,1633), ($139,030,1717),
($140,160,1775), ($144,220,1838), ($154,190,1934).
Forecasting with Distinguishable Data
• Assume that your house has 1682 square feet of living area.
– Analysis 1: sample average of all ten houses = $133,618 (SD = $12,406)
• Analysis 2: Stratify the sample according to lot size.
Size Range
Sample Average
Number of Observations
Then use $132,243 (instead of $133,618) to forecast the selling value.
– Does the cell standard deviation properly measure the forecast uncertainty?
– Is it possible to have a measure of overall efficacy of our partitioning the
sample into cells?
• Use the data more efficiently: The stratification method that we
used is unsatisfactory for two reasons. First, we have ignored data
on house that are “less like,” but not “most like” yours. Secondly,
we have stratified the data somewhat arbitrarily.
The question of causation
•Mother’s adult height vs daughter’s adult height.
•Amount of saccharin in a rat’s diet vs count of tumors in the
rat’s bladder.
•A student’s SAT score and the student’s first year GPA.
•Monthly flow of money into stock mutual funds vs monthly
rate of return for the stock market.
•The anesthetic used in surgery vs whether the patient
survives the surgery.
•The number of years of education a worker has vs the
worker’s income.
Explaining association
•Common response. (a lurking variable).
•Confounding: two variables are confounded when their
effects on a response variable are mixed together.
Data on the survival of patients after
surgery in hospital A and B
Hospital A Hospital B
•Hospital A loses 3% of patients while Hospital B
loses 2%.
Lurking variable...
Good condition
Hospital A Hospital B
Bad condition
Hospital A Hospital B
• 1% vs 1.3%
for patients
with good
• 3.8% vs 4%
for patients
with bad
Simpson’s paradox
• How can A do better in each group, yet do
worse overall??
• An association or comparison that holds for
all of several groups can reverse direction
when the data are combined to form a single
Regression Model
•Try to create a model that specifies the relationship between
selling price (dependent variable) and other variables
(independent or explanatory variable) that help you forecast
the selling price.
–It is reasonable to assume that as size go up, selling price will go up on
Regression Coefficients and Forecasts
• Objective: Understand regression coefficients and how to use
them for forecasting.
Measures of Goodness of Fit and Residual
• Objective: Introduce a few statistics that measure how well a
regression model fits the data and show how to use residual
analysis to detect inadequacies of a regression model
Developing a Regression Model
•Objective: Demonstrate how to develop a useful regression
model through
– Selection of the Dependent Variable
– Selection of the Independent Variables
– Determining the Nature of Relationships