Download Statistics: interpretation of data, especially the analysis of population

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
The Basic Idea
Statistics: interpretation of data, especially the analysis of population characteristics by
inference from sampling.
Population:
the set of
objects of
interest in the
study
random
Sample: a
subset of
the
population
statistic: summarizes the
information in a collection
of data, i.e., the data sample
parameter: a numerical
summary of the population
Inferential statistics: predict population characteristics (parameters) based on sample
characteristics (statistics).
Sampling error: the error that occurs when a statistics based on a sample predicts the
value of a population parameter.
Distributions
Sample: frequency distribution
Population: probability distribution
Relative frequency for an interval is the
portion of sample observations that fall in
that interval
Probability of an interval is the portion of
times that an observation would fall in that
interval in a long run of repeated
observations.
The probability distribution for a variable
lists the possible values of the variable
together with their probabilities.
Probabilities distributions are characterized
by parameters such as the mean and
standard deviation. Letting P(yi) denote the
probability of outcome yi for a variable,
then the mean of the distribution is Sumi (yi
P(yi)).
The normal distribution: a symmetric bell-shaped curve completely specified by the
parameters, mean and std. deviation.
Mean: the central tendency of a distribution, the average
Variance & standard deviation: measures deviation from the central point-deviation from
the mean
The z-score of an observation measures the number of std. deviations that it falls from the
mean of the distribution. The z-score is positive when it falls above the mean and
negative when it falls below the mean.
Sampling Distribution
Up to this point we have used a probability distribution to summarize the possible
outcomes of a variable. In practice, the population distributions of most variables are
unknown. Instead we use sample data to estimate characteristics of these distributions
such as their parameters.
When a sample statistic estimates a population parameter, the accuracy of that estimate
depends on sampling variation. A probability distribution that describes the variation
that occurs from repeatedly selecting samples of a certain size to form a particular
statistic is called a sampling distribution. In other words, a sampling distribution is a
probability distribution of the possible values of a sample statistic.
The Central Limit Theorem states that for large random samples, the sampling
distribution of the sample mean is approximately normal. This is true even if the
population distribution is far from normal. The sample mean is centered around the true
mean with a spread described by the sample std. deviation divided by the square root of
the sample size.
Estimation and Confidence Intervals
A point estimate of a population parameter is the value of the sample statistic that
predicts the value of that parameter. E.g. The sample mean is a point estimate of the
population mean.
We will use point estimates that are the maximum likelihood estimates of the population
parameters. A maximum likelihood estimate is the value of the parameter that is most
consistent with the observed data in the sense that if the parameter equaled that number,
the observed data would have the greatest chance of occurring.
A confidence interval describes how close the point estimate is likely to be to the
population parameter. More precisely, the confidence interval for a parameter is the
range of numbers within which the parameter is believed to fall. The probability that the
confidence interval contains the parameter is called the confidence coefficient.
Example: large-sample confidence interval for the mean
90% confidence interval:
All we need to know is how many standard deviations about the mean will include 90%
of the sample means. The following picture of the standard normal curve shows the zvalue we want so that a total area of 0.90 is included between z = -1.645 and z = 1.645:
We call this value of z "z.05 since the area of the tail to its right is .05 units:
90% confidence interval = [ - 1.645 / n,
+ 1.645 / n]
99% confidence interval:
Similarly, for the 99% confidence interval, we can consult the following picture
and obtain:
99% confidence interval = [ - 2.576 / n,
+ 2.576 / n]
Significance Tests
Five elements of a significance test:
1. Assumptions:
a. type of data, form of population, method of sampling, sample size.
2. Hypotheses:
a. Null hypothesis, H0 (the parameter valued being tested---the “no effect”
value)
b. Alternative hypothesis, Ha (alternative parameter values)
3. Test statistic
a. Compares point estimate to null hypotheses parameter value
4. P-value
a. Weight of evidence about H0; smaller P is more contrary to H0
5. Report P-value
P-value is the probability, when H0 is true, of a test statistic value at least as contrary to
H0 as the value actually observed. The smaller the P-value, the more strongly the data
contradicts H0.
Example: Significance test for a mean
1. Assumption: we have a random sample of size 30 or greater.
2. Hypotheses: H0: population mean equal y, where y is some number.
Ha: population mean not equal y
3. Test statistic: the sample mean, , estimates the population mean. When the
sample size is 30 or greater, the distribution of the sample mean is approx. normal
about the population mean with standard error = sample std. deviation divided by
the square root of the sample size.
The test statistic is the z-score: - y / / n
4. P-value:
Distribution of
Means when no
effect
means
Sampling distribution of z = - y / / n when H0 is true
(standard normal distribution)
P = sum of the tail probabilities
5. Round the P value to 1 to 3 significant digits before reporting