Download 01/22/2008

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Normal Distribution
 The Normal Distribution is a density curve based on
the following formula. It’s completely defined by two
parameters: mean; and standard deviation.
1
f ( x) 
e
 2

1
2
2
( x )2
,  x 
 A density function describes the overall pattern of a
distribution. The total area under the curve is always
1.0.
 The normal distribution is symmetrical.
 The mean, median, and mode are all the same.
Normal Distribution
Total area under the curve is 1
0.1
0.2
f(x)
0.3
0.4
0.5
A Normal Density Curve
0.0
σ
2σ
-10
-5
0
µ
x
5
10
15
If we know µ and σ,
we know every thing
about the normal
distribution.
Normal Distribution
The 68-95-99.7 Rule
In the normal distribution with mean µ and standard deviation σ:
68% of the observations fall within σ of the mean µ.
95% of the observations fall within 2σ of the mean µ.
99.7% of the observations fall within 3σ of the mean µ.
0.04
0.02
3σ 2σ σ
0.00
f(x)
0.06
0.08
Normal Density Plot
-20
-10
σ
0
2σ3σ
10
x
20
Normal Density Plot
0.4
Normal density function
0.2
68%
0.1
f(x)
0.3
A sample of 100 observations
from a normal distribution
with mean 0 and standard
deviation 1.
95%
-2
-1
0
1
x
2
Normal Distribution
 Standardizing and z-Scores
If x is an observation from a distribution that has mean µ
and standard deviation σ, the standardized value of x is,
z
x

.
A standardized value is often called a z-score. If x is a
normal variable with mean µ and standard deviation σ,
then z is a standard normal variable with mean 0 and
standard deviation 1.
The density function of a standard normal variable z,
f ( z) 
1
e
2

z2
2
,   z 
Normal Distribution

Let x1, x2, …., xn be n random variables each with mean µ and
standard deviation σ, then sum of them ∑xi be also a normal
with mean nµ and standard deviation σ√n. The distribution of
mean
is also a normal with mean µ and standard deviation
σ/√n.
The standardized score of the mean
is,
x

x
z
/ n
x
The mean of this standardized random variable is 0 and
standard deviation is 1.
Assessing the normality of data
•
Most statistical methods assume that data are from a
population. So it’s important to test the normality of the data.
normal
•
Normal quantile plots
If the points on a normal quantile plot lie close to diagonal line, the plot
indicates that the data are normal. Otherwise, it indicates departure from
normality. Points far away from the overall pattern indicates outliers. Minor
wiggles can be overlooked. We will see normal quantile plots in next two
slides.
•
Shapiro-Wilk W statistics, Kolmogorov-Smirnov (K-S) tests etc are
being used for testing normality of the data.
•
To perform a K-S Test for Normality in SPSS, Analyze> Nonparametric
Tests > 1 Sample K-S. Choose OK after selecting variable (s).
•
To perform Shapiro-Wilk test of normality in R use the following:
>x<- c(12, 14, 13, 11, 16, 18) for inputting data
>shapiro.test(x)
Normal quantile plot
Normal q-q plot
0
-1
-2
Sample Quantiles
1
2
q-q plot of 100 sample
observations from a normal
distribution with mean 0 and
standard deviation 1
-2
-1
0
Theoretical Quantiles
1
2
Normal quantile plot
Normal q-q plot of Height of our Sample Data
50
45
40
Sample Quantiles
55
This plot shows a minor deviation
from the linear diagonal line. We
can ignore this minor wiggles and
still can assume that the data are
from a normal distribution.
-2
-1
0
Theoretical Quantiles
1
2
Population and Sample





Population:
The
entire
collection
of
individuals
or
measurements about which information is desired e.g. If want to
know the average height of 5-year old children in USA, then all
5-year old children in USA is our population.
Sample: A subset of the population selected for study.
Methods of sampling: Random sampling, stratified sampling,
systematic sampling, cluster sampling, multistage sampling,
area sampling, qoata sampling etc. We will discuss only random
sampling.
Random Sample: A simple random sample of size n from a
population is a subset of n elements from that population where
the subset is chosen in such a way that every possible unit of
population has the same chance of being selected.
Example: Consider a population of 5 numbers (1, 2, 3, 4, 5).
How many random samples (without replacement) of size 2 can
we draw from this population ?
(1,2), (1,3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3,4), (3,5),
(4,5)
Population and Sample
 Why do we need randomness in sampling?
It reduces the possibility of subjective and other
biases.
Mean and variance of a random sample is an
unbiased estimate of the population mean and
variance respectively.
 Population mean of the five numbers in
previous slide is 3. Averages of 10 samples of
sizes 2 are 1.5, 2, 2.5, 3, 2.5, 3, 3.5, 3.5, 4,
4.5. Mean of this 10 averages (1.5 +2 + 2.5 +
3 + 2.5 + 3+ 3.5+ 3.5+ 4+ 4.5)/10 =3 which
is the same as the population mean.
Parameter and Statistic
 Parameter: Any statistical characteristic of a
population. Population mean, population
median, population standard deviation are
examples of parameters.
 Statistic: Any statistical characteristic of a
sample used as an estimate of population
parameter such as sample mean, sample
median, sample standard deviation etc.
 Statistical Issue: Describing population
through census or making inference from
sample by estimating the value of the
parameter using statistic.
Census and Inference
 Census: Complete enumeration of population units.
 Statistical Inference: We sample the population (in a
manner to ensure that the sample correctly represents the
population) and then take measurements on our sample
and infer (or generalize) back to the population.
Example: We may want to know the average height of all
adults (over 18 years old) in the U.S. Our population is
then all adults over 18 years of age. If we were to census,
we would measure every adult and then compute the
average. By using statistics, we can take a random sample
of adults over 18 years of age, measure their average
height, and then infer that the average height of the total
population is ``close to'' the average height of our
sample.
Univariate, Bivariate, and
Multivariate Data
 Depending on how many variables we are
measuring on the individuals or objects in
our sample, we will have one of the three
following types of data sets
 Univariate: Measurements made on only one
variable per observation.
 Bivariate: Measurements made on two variables
per observation.
 Multivariate: Measurements made on more than
two variables per observation.
Proportion
 Proportion: In many cases, it is appropriate to summarize a
group of independent observations by the number of
observations in the group that represent one of two outcomes.
 Consider a variable X with two outcomes 1 and 0 for
happening and not happening of some events correspondingly.
Let p be the probability that the event happens then
p=Prob(X=1).
 Suppose, we want to estimate of the proportion of the Patients
coming to duPont having some particular disease. To estimate
this proportion (population), we need to take a sample of size
n and examine if the patient is bearing that particular disease.
Then the estimated proportion is,
Number of Patients with that Particular disease X
pˆ 

Sample size
n
Proportion
 For large n, the sampling distribution of p̂ is
approximately normal with mean P (Population Proportion)
and the standard deviation p(1  p)
n
.
 If probability of happening one event is p, then probability
of not happening of the same event is 1-p and total
probability is 1.
 What is the difference between proportion and a sample
mean?
If X takes two values 0 or 1 and p is the proportion of
happening an event (X=1) then the sample proportion is
the same as the sample mean. For example, the mean of
the following data 1,0,1,1,0 is (1+0+1+1+0)/5 = 3/5 and
the proportion is also the 3/5.
Binomial Distribution
 Let us consider an experiment with two outcomes
success (S) and failure (F) for each subject and the
experiment be done for n subjects. One of the possible
sequences of S and F can be arranged as followsSSFSFFFSSFS……F
where there are x success out of n trials. Then the
probability distribution of x can written as
n x
f ( x)    p (1  p) n  x , x  0,1,....n
 x
where, p  prob(S) and 1  p  prob(F)
Binomial Distribution
 The mean and variance of x are np and
np(1-p).
 If p=1/2, then Binomial distribution is
symmetric.
Test of hypothesis




A statistical hypothesis is some statement or assertion about
the population distribution or population parameter which
we want to verify on the basis of information available from a
sample.
Test of hypothesis is a two-action decision problem after the
sample values have been obtained, the two actions being the
acceptance or rejection of the hypothesis under consideration.
Null Hypothesis: The statement is being tested in a test of
significance. Usually the null hypothesis is a statement of “no
effect” or “no difference”. That is, a statement without any kind
of motivation due to any reason. H0 is the abbreviated form of
null hypothesis.
Alternative Hypothesis: Complement of null hypothesis. Ha is
the abbreviated form of alternative hypothesis.
Test of hypothesis


Test statistic: The statistic is used to test the significance of a
hypothesis.
Two types of errors: Type I error and Type II error.


Type I error (α): Probability of rejecting H0 when it is true (Ha is
false).
Type II error (β): Probability of rejecting Ha when it is true.
Truth about the population
H0 true
Ha true
Decision
Reject H0
Based on
Accept H0
sample
Type I error
Correct Decision
Correct Decision
Type II error
Test of hypothesis




Our aim is to make inference controlling both type I and type II errors.
But the reduction in one results in an increase in the other and the
consequence of type I error seems to be more severe than that of type
II error. That’s why, we choose a test that minimizes type II error
(maximize the power of the test) keeping type I error at a fixed low
level.
Level of Significance: The probability of type I error is known as the
level of significance.
Power (1-β): The probability of rejecting H0 when alternative
hypothesis is true at a fixed level of significance (α ). Power of a test is
a function of sample size and the parameter of interest. We calculate
power for a particular value of the parameter in alternative hypothesis.
Increasing sample size increases power of a test.
Increasing Power: If power is too small than we can do the followings
to increase power:



Increase the sample size
Increase the significance level (α )
Reduce the variability
Test of hypothesis
 P-value: The probability, assuming H0 is true,
that the test statistic would take a value as
extreme or more extreme than that actually
observed is called the P-value of the test. The
smaller the P-value, the stronger the evidence
against H0 provided by the data.
 Statistical Significance: If the P-value is
small or smaller than α, we say that the test is
statistically significant at level α. A result is
statistically significant if it is unlikely to happen
by chance.
Point estimation and Confidence
Interval




Two types of parameter estimation:
Point Estimation: A single value calculated from sample data, that
estimates the parameter. Unbiasedness, consistency, efficiency,
and sufficiency are some of the criteria that should be satisfied by a
good estimator. The sample mean is a point estimate of the population
mean.
Interval Estimation: A range calculated from sample data along with
a confidence level that a population parameter lie between the range is
called interval estimate of the parameter.
Confidence Interval: An interval computed from sample data
containing the true value of the parameter with a certain level of
confidence. With a 95% confidence interval of a mean we mean, 95% of
all samples of the same size will contain the true population mean. A
confidence interval for mean can be written as follows
Confidence Interval
 Confidence Interval = Point Estimate ±
Margin of error
 Margin of error: The amount of allowed
sampling error.
 95% Confidence Interval for mean=
Sample mean ± ((z97.5 σ/ √n ) or ( t97.5,
(n-1)s/ √n))
where s and σ are the sample and
population standard deviations
correspondingly.
Sampling Distribution
 Standard Error: Standard deviation of a
statistic. Let x1, x2, …xn be a sample of size n.
Then the standard deviation of the sample
mean is a standard error. It is written as σ/√n
and s/√n termed as estimated standard error of
sample mean.
 Sampling distribution: The distribution of a
statistic.
 Commonly used statistic: Z-statistic, t-statistic,
chi-square statistic, F-statistic are some of the
commonly used statistics.
Z-Score
x
 Z-score: the statistic z 
is
/ n
distributed as standard normal with mean 0 and
standard deviation 1.
 Assumptions: Sample is large and the
population standard deviation is known.
 Applications:
 1. Test of hypothesis of a single sample mean
 2. Comparing means of two groups or populations.
Power Calculation and Sample Size
Detection
 Sample size calculation is an important part of a research
study. If the sample size is too small, even a well
designed study may fail to answer research questions
such as important effects or associations. If the sample
size is too large, it will be expensive and difficult to
handle. Power analysis can provide us an optimum
sample size that we need to answer the particular
research answer with certain level of accuracy.
 Calculation of sample size is a difficult and cumbersome
task as it involves rigorous mathematical derivation.
There are many softwares for sample size calculation.
Some of them are PASS, Power and Precision, PS.
Softwares are not very costly. Some of them are free.
Example: Power calculation/
sample size detection


A research team is planning a study to examine if a 6-month
exercise program increase the total body bone mineral content
(TBBMC) of young women. Based on the results of a previous
study, they are willing to assume that σ = 2 for the percent
change in TBBMC over the 6-month period. A change in TBBMC
of 1% would be considered important and the researchers would
like to have a reasonable chance of detecting a change this large
or larger. Are 25 subjects large enough for this project?
To answer the above question we need to calculate power of the
test which involved three following steps


State H0 and Ha, and the significance level α.
Find the value of sample mean (xbar) that will lead to reject H0.
Calculate the probability of observing these values of xbar when the
alternative is true.
Example: Power calculation/
sample size detection
 Step 1:
H0: Mean percent change, µ = 0
Ha: µ > 0
A 5% level of significance (α) will be used.
 Step 2:
The z-tests rejects H0 at α=0.05 when z>= 1.645 i. e.
(xbar-0)/(2/√25) >= 1.645
reject H0 when xbar >= 0.658
 Step 3:
Prob( xbar >=0.658 when µ=1) =
 xbar   0.658  1 
Prob

  Prob( z  0.855)  0.80
2 / 25 
 / n
Useful Website(s)
 For the basic idea and the statistical
definitions the website may be useful.
http://www.cas.lancs.ac.uk/glossary_
v1.1/main.html