Download Inference - 國立臺灣大學 數學系

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Inference
Mean, Proportion, CLT
Bootstrap
From Probability to Statistics
• In all our probability calculations, we have assumed that we know
all quantities needed to solve the problem:
– Portfolio problems: To find the expected return and standard deviation of a
portfolio, we assumed we knew the mean and standard deviation of the
returns of the underlying stocks.
– Potato chip example: To find the proportion of bags below the 8-ounce
minimum, we assumed we knew the mean and standard deviation of the
weight of chips in each bags.
– In practice, these types of parameters are not given to us; we must estimate
them from data.
• Statistical analysis usually proceeds along the following lines:
– Postulate a probability model (usually including unknown parameters) for a
situation involving uncertainty; e.g., assume that a certain quantity follows a
normal distribution.
– Use data to estimate the unknown parameters in the model.
– Plug the estimated parameters into the model in order to do make predictions
from the model.
How do we start with?
•The first step, picking a model, must be based on an understanding
of the situation to be modeled.
– Which assumptions are plausible?
– Which are not?
– These questions are answered by judgment, not by precise statistical
techniques.
•Examples:
– Assume that daily changes in a stock price follow a normal distribution.
•Use historical data to estimate the mean and standard deviation.
•Once we have estimates, we might use the model to predict future price ranges or to
value an option on the stock.
– Assume that demand for a fashion item is normally distributed.
•Use historical data to estimate the mean and standard deviation.
•Once we have estimates, we might use the model to set production levels.
How do we get data and make inference?
•The first step in understanding the process of estimation is
understanding basic properties of sampled data and sample statistics,
since these are the basis of estimation.
•When we talk about sampling it is always in the context of a fixed
underlying population:
– If we look at 50 daily changes in IBM stock, we are looking at a sample of
size 50 from the population of all daily changes in IBM stock.
– If we ask 150 shoppers whether or not they buy corn flakes, we have a sample
of size 150 from all possible shoppers.
•If the population is very large (as in these examples), we generally
treat it as though it were infinite; this simplifies matters. Thus, we
are primarily concerned with finite samples from infinite populations.
•A single sample from a population is a random variable. Its distribution is the population distribution; e.g.,
– The distribution of a randomly selected daily change in IBM stock is the
distribution over all daily changes;
– The probability that a randomly selected shopper buys corn flakes is the
proportion of the entire population that buys corn flakes.
Random Sample
• A random sample from a population is a set of randomly selected
observations from that population. If X1,…, Xn are a random sample,
then
– they are independent;
– they are identically distributed, all with the distribution of the underlying
population.
• A sample statistic is any quantity calculated from a random sample.
The most familiar example of a sample statistic is the sample mean
, given by
= (X1 + X2 + … + Xn)/n
• The sample mean gives an estimate of the the population mean m =
E[Xi].
Distribution of the Sample Mean
• Every sample statistic is a random variable.
– Randomness is introduced through the sampling mechanism.
• As noted above, the sample mean of a random sample X1,…, Xn is
an estimate of the population mean m = E[Xi].
– How good an estimate is it?
– How can we assess the uncertainty in the estimate?
– To answer these questions, we need to examine the sampling distribution of
the sample mean; that is, the distribution of the random variable .
• Assume that the underlying population is normal with mean m and
variance s2.
– This means that Xi ~ N(m,s2) for all i.
– The Xi's are independent, since we assume we have a random sample.
• The sum of independent normal random variables is normally
distributed. The usual rules for means and variances apply:
– The expected value of the sum is the sum of the expected values.
– The variance of the sum is the sum of the variances (by independence).
• Any linear transformation of a normal random variable is normal;
in particular, multiplication by a constant preserves normality.
Distribution of the Sample Mean
•Using these two facts, we find that if Xi ~ N(m,s2) for all i, then
– X1 + X2 + … + Xn ~ N(nm,ns2);
•The sample mean from a normal population has a normal
distribution.
•First consequence:
– The expected value of the sample mean is the population mean; “on average"
the sample mean correctly estimates the underlying mean.
– The standard deviation of a sample statistic is called its standard error. Thus,
we have shown that the standard error of the sample mean is s/√n, where s is
the underlying standard deviation and n is the sample size.
•Second consequence:
– Because the standard error of sample mean is s/√n, the uncertainty in this
estimate decreases as the sample size n increases. (That's good.)
– The uncertainty (as measured by the standard deviation) decreases rather
slowly: to cut the standard deviation in half, we need to collect four times as
much data, because of the square root. (That's not so good, but that's life.)
Example:
•Suppose the number of miles driven each week by US car owners is
normally distributed with a standard deviation of s = 75 miles.
– Suppose we plan to estimate the population mean number of miles driven per
week by US car owners using a random sample of size n = 100.
– What is the probability that our estimate will differ from the true value by
more than 10 miles?
•Denote the population mean by m and the sample mean by
.
– We need to find P X  m  10 .
– By symmetry of the normal distribution, it is
2 P  X  m  10


 2 P sX/ mn  75 /10100  2 P Z  1.33  0.1836.
Thus, the probability that our estimate will be o by more than 10 miles is 18.36%.
• If the underlying population is not normal, what can be done?
Central Limit Theorem
• By the central limit theorem, regardless of the underlying
population, the distribution of sample mean tends towards
N(m,s2/n) as n becomes large.
– If we accept the use of this approximation, we don't need to assume
that the number of miles driven per week in the example is normally
distributed (as long as our sample size n is large).
– repeatedly to assess the error in X as an estimate of .
• How large should n be for the normal approximation to be
accurate?
– There is no simple answer (it depends on the underlying
distribution), but n≧ 30 is a reasonable rule of thumb.
• If the underlying population is finite of size N, and if the
sample size n is not a small proportion of N, we use the
following small sample correction to the standard error:
Std Error X  
s
n
N n .
N 1
Sampling Distribution of the Sample Proportion
•Consider estimating any of the following quantities:
– Proportion of voters who will vote for a third-party candidate in the
next election.
– Proportion of visits to a web site that result in a sale.
– Proportion of shoppers who prefer crunchy over creamy.
•In each of these examples, we are trying to estimate a
population proportion. Denote a generic population
proportion by the symbol p.
•Estimate a population proportion using a sample proportion.
– For example, if a poll surveys 1000 voters and finds that 85 of those
surveyed plan to vote for a third-party candidate, then the sample
proportion is 8.5%.
– The population proportion is what the poll would find if it could ask
every voter in the population.
– Denote the sample proportion by the symbol p̂
– Once we have collected a random sample, the sample proportion p̂ is
known. We use it to estimate the true, unknown population
proportion p.
Estimating a proportion can be formulated as a special
case of estimating a population mean.
• Consider again the example of a poll of 1000 voters.
– Imagine encoding responses to a question about third-party candidates as
follows: for the ith person polled,
Xi = 1; if ith person plans to vote for third-party candidate;
= 0; otherwise.
– Our random sample consists of X1,…, X1000. If 85 respondents indicated that
they would vote for a third-party candidate, then X1+…+ X1000 = 85; because
85 of the Xi 's are equal to 1 and all the rest are equal to 0.
– The sample proportion is just a special case of the sample mean.
• How good an estimate of the population proportion p is the sample
proportion? How effective are polls and surveys?
– By how much is the sample proportion likely to deviate from the true
population proportion p?
– This is measured by the standard deviation of sample proportion (standard
error).
– [p(1-p)/n]1/2, It is greatest when p = 0.5.
EXAMPLE
• Suppose that the true, unknown proportion p of voters who will
vote for a third-party candidate in the next election is 9%.
– What is the probability that a poll of 1000 voters will find a sample
proportion that differs from the true proportion by more than 2%?
• We need to find


P pˆ  p  0.02   2 P Z  0.090(1.02
 0.027.
0.09 )
1000
• We conclude that the probability that the poll will be off by more
than two percentage points is .027.
Confidence Intervals
•For the mean m of a population a 100(1-a)% CI is:
–When the population is normal and SD s is known -
x  za / 2s / n , where za/2 comes from the normal table.
–Reason:
– When the population is normal, σ is not known, but n is large (maybe >50).
Use the same formula with s in place of σ.
– When the population is not necessarily normal, but n is large (maybe > 50 to
100)
(depending on how close to normal the population is ?or seems to be)
Use the same formula with σ, if known, or with s if σ is not known.
•Summary: These intervals have probability approximately 1 α of
containing the true value of µ.
Demonstration with R
•Take 1000 samples of size 200 from a Normal(µ=0,σ2=1)
population.
– Calculate a 95% CI for each sample.
– Check to see how many of these contain the true µ. Answer = ___.
Check to see the percentage is approximately 1-a.
x<- rnorm(200)
#generate 200 standard normal rv
mu<- mean(x); sd<- sqrt(var(x)) #calculate sample mean and sd
q95<- qnorm(c(0.025,0.975))
#find quantiles of normal distribution
q95<- qt(0.975,199)
#find quantiles of normal distribution
lower<- mu-q95*sd/sqrt(199); upper<- mu+q95 *sd/sqrt(199) #CI
if(lower*upper > 0) contain<- 0 else contain<- 1
Demonstration with R
•Write a function to find whether the confidence interval
contains mean.
demons<- function(nsize,conf){
x<- rnorm(nsize)
#generate 200 standard normal rv
mu<- mean(x); sd<- sqrt(var(x)) #calculate sample mean and sd
q95<- qt((1+conf)/2,nsize-1)
#find quantiles of normal distribution
lower<- mu-q95*sd/sqrt(nsize-1); upper<- mu+q95 *sd/sqrt(nsize-1)
if(lower*upper > 0) contain<- 0 else contain<- 1
contain
}
•Conduct a simulation study to check the validity of
confidence interval based on t-statistic.
nsimu<- 1000
contain<- 1:nsimu
for (i in 1:nsimu) contain[i]<- demons(200,0.95)
Higher confidence(Good!) = Wider interval(Bad!) !
• The only way to control both confidence and interval size is to
choose sufficiently large n.
• For confidence 100(1-a)% and width w we need w = 2za/2s/√n.
– If s or s is not known, use your best guess (or preliminary data).
• Example. Fisher’s Iris data had n=50, s=3.5, and a 95% CI of
5.0±1.96× (3.5/√50) = 5.0 ± 0.97 = (4.03, 5.97). *
– This CI has width w=2×0.97=1.94.
– *The sample size 50, here, is on the borderline of what could be acceptable
for the use of this procedure. It would be (slightly) better to use the tprocedure discussed below.
• Suppose we want a CI of total width w = 0.5 (ignoring the data we
have already gathered). How large a sample size should we use?
– Our best guess for s is 3.5. (We don’t have any other information to give us a
better idea.)
– We should choose n≒ (2×1.96×3.5/0.5)2 =753.**
– **This value of n is large. If the answer to a question like this works out to be
a small n (suggesting use of a t-test) then it’s not really a valid answer – or, at
best, it should be thought of as only a very rough estimate.
t-Interval
•When the population is normal but 2 is not known and n
is not large!
(p.s.: How do we tell whether the population is normal?)
•What we’ve done so far doesn’t work.
•Demonstration:
•Repeat the previous demonstration, but with 50,000 samples of
size 4 from an exponential distribution.
Bootstrap
• As a general term, bootstrapping describes any operation which
allows a system to generate itself from its own small well-defined
subsets (e.g. compilers, software to read tapes written in computerindependent form).
• The word is borrowed from the saying pull yourself up by your own
by your own bootstraps.
• In statistics, the bootstrap is a method allowing one to judge the
uncertainty of estimators obtained from small samples, without prior
assumptions about the underlying probability distributions.
–The method consists of forming many new samples of the same size as the
observed sample, by drawing a random selection of the original observations, i.e.
usually introducing some of the observations several times.
–The estimator under study (e.g. a mean, a correlation coefficient) is then formed
for every one of the samples thus generated, and will show a probability
distribution of its own.
–From this distribution, confidence limits can be given.
–For details, see B. Efron (Computers and the Theory of Statistics, SIAM Rev. 21
(1979) 460.) or Efron (The Jackknife, the Bootstrap and Other Resampling Plans,
SIAM, Bristol, 1982. )
Random Numbers
• Random numbers are particular occurrences of random variables.
They are used in Monte Carlo calculations, where three different
types may be distinguished according to the method used to
generate them:
– Truly random numbers are unpredictable in advance and can only be
generated by a physical process such as radioactive decay: in the presence of
radiation, a Geiger counter will record particles at time intervals that follow a
truly random (exponential) distribution.
– Pseudo random numbers are those most often used in Monte Carlo
calculations. They are generated by a numerical algorithm, and are therefore
predictable in principle, but appear to be truly random to someone who does
not know the algorithm.
– Quasi random numbers are also generated by a numerical algorithm, but are
not intended to appear to have the properties of a truly random sequence,
rather they are optimized to give the fastest convergence of the Monte Carlo
calculation.
Pseudo Random Numbers
• Generated in a digital computer by a numerical algorithm,
pseudorandom numbers are not random, but should appear to be
random when used in Monte Carlo calculations.
–The most widely used and best understood pseudorandom generator is the
Lehmer multiplicative congruential generator, in which each number r is
calculated as a function of the preceding number in the sequence:
ri ≡ ari-1 (mod m) or ri ≡ ari-1 + c (mod m)
where a and c are carefully chosen constants, and m is usually a power of two,
2k.
–All quantities appearing in the formula (except m) are integers of k bits.
–The expression in brackets is an integer of length 2k bits, and the effect of the
modulo m is to mask off the most significant part of the result of the
multiplication.
–r0 is the seed of a generation sequence; many generators allow one to start
with a different seed for each run of a program, to avoid re-generating the
same sequence, or to preserve the seed at the end of one run for the beginning
of a subsequent one.
–Before being used in calculations, the ri are usually transformed to floating
point numbers normalized into the range [0,1].
• Generators of this type can be found which attain the
maximum possible period of 2k-2, and whose sequences
pass all reasonable tests of ``randomness'', provided one
does not exhaust more than a few percent of the full period.
– D.E. Knuth, The Art of Computer Programming, Addison-Wesley,
1981.
– A detailed discussion can be found in G. Marsaglia, A Current
View of Random Number Generators in Computer Science and
Statistics, Elsevier, Amsterdam, 1985.
Jackknife
• The jackknife is a method in statistics allowing one to judge the
uncertainties of estimators derived from small samples, without
assumptions about the underlying probability distributions.
• The method consists of forming new samples by
–omitting, in turn, one of the observations of the original sample.
–For each of the samples thus generated, the estimator under study can be
calculated, and the probability distribution thus obtained will allow one to draw
conclusions about the estimator's sensitivity to individual observations.