Download Introduction - The Department of Mathematics & Statistics

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Stats 845
Applied Statistics
This Course will cover:
1. Regression
–
–
Non Linear Regression
Multiple Regression
2. Analysis of Variance and Experimental
Design
The Emphasis will be on:
1. Learning Techniques through example:
2. Use of common statistical packages.
•
•
•
•
SPSS
Minitab
SAS
SPlus
What is Statistics?
It is the major mathematical tool of
scientific inference - the art of drawing
conclusion from data. Data that is to some
extent corrupted by some component of
random variation (random noise)
An analogy can be drawn to
data that is affected by
random components of
variation to signals that are
corrupted by noise.
Quite often sounds that are
heard or received by some
radio receiver can be thought
of as signals with
superimposed noise.
The objective in signal theory
is to extract the signal from
the received sound (i.e.
remove the noise to the
greatest extent possible). The
same is true in data analysis.
Example A:
Suppose we are comparing
the effect of three different
diets on weight loss.
An observation on weight loss
can be thought of as being
made up of two components:
1. A component due to the effect
of the diet being applied to the
subject (the signal)
2. A random component due to
other factors affecting weight
loss not considered (initial
weight of the subject, sex of the
subject, metabolic makeup of
the subject.) random noise.
Note:
that random assignment of
subjects to diets will ensure
that this component will be a
random effect.
Example B
In this example we again are
comparing the effect of three diets on weight
gain. Subjects are randomly divided into three
groups. Diets are randomly distributed
amongst the groups. Measurements on weight
gain are taken at the following times - one month
- two months
- 6 months and
- 1 year
after commencement of the diet.
In addition to both the factors Time and Diet
effecting weight gain there are two random
sources of variation (noise)
- between subject variation and
- within subject variation
This can be illustrated in a schematic
fashion as follows:
Deterministic factors
Diet
Time
Random Noise
within subject
between subject
Response
weight gain
The circle of Research
Questions arise about
a phenomenon
A decision is made to
collect data
Conclusion are drawn
from the analysis
Statistics
Statistics
A decision is made as
how to collect the
data
The data is
summarized and
analyzed
The data is collected
Notice the two points on the
circle where statistics plays
an important role:
1.The analysis of the collected data.
2.The design of a data collection
procedure
The analysis of the collected
data.
• This of course is the traditional use of statistics.
• Note that if the data collection procedure is well
thought out and well designed, the analysis step of
the research project will be straightforward.
• Usually experimental designs are chosen with the
statistical analysis already in mind.
• Thus the strategy for the analysis is usually
decided upon when any study is designed.
• It is a dangerous practice to select the form
of analysis after the data has been collected
( the choice may to favour certain predetermined conclusions and therefore in a
considerable loss in objectivity )
• Sometimes however a decision to use a
specific type of analysis has to be made
after the data has been collected (It was
overlooked at the design stage)
The design of a data collection
procedure
• the importance of statistics is quite
often ignored at this stage.
• It is important that the data collection
procedure will eventually result in
answers to the research questions.
• And will result in the most
accurate answers for the resources
available to research team.
• Note the success of a research
project should not depend on the
answers that it comes up with but
the accuracy of the answers.
• This fact is usually an indicator of
a valuable research project..
Some definitions
important to Statistics
A population:
this is the complete collection of subjects
(objects) that are of interest in the study.
There may be (and frequently are) more
than one in which case a major objective
is that of comparison.
A case (elementary sampling
unit):
This is an individual unit (subject) of the
population.
A variable:
a measurement or type of measurement
that is made on each individual case in the
population.
Types of variables
Some variables may be measured on a
numerical scale while others are
measured on a categorical scale.
The nature of the variables has a great
influence on which analysis will be used. .
For Variables measured on a numerical scale
the measurements will be numbers.
Ex: Age, Weight, Systolic Blood Pressure
For Variables measured on a categorical scale
the measurements will be categories.
Ex: Sex, Religion, Heart Disease
Types of variables
In addition some variables are labeled as
dependent variables and some variables
are labeled as independent variables.
This usually depends on the objectives of
the analysis.
Dependent variables are output or
response variables while the
independent variables are the input
variables or factors.
Usually one is interested in determining
equations that describe how the dependent
variables are affected by the independent
variables
A sample:
Is a subset of the population
Types of Samples
different types of samples are determined
by how the sample is selected.
Convenience Samples
In a convenience sample the subjects that
are most convenient to the researcher are
selected as objects in the sample.
This is not a very good procedure for
inferential Statistical Analysis but is
useful for exploratory preliminary work.
Quota samples
In quota samples subjects are chosen
conveniently until quotas are met for
different subgroups of the population.
This also is useful for exploratory
preliminary work.
Random Samples
Random samples of a given size are
selected in such that all possible samples
of that size have the same probability of
being selected.
Convenience Samples and Quota samples
are useful for preliminary studies. It is
however difficult to assess the accuracy
of estimates based on this type of
sampling scheme.
Sometimes however one has to be
satisfied with a convenience sample and
assume that it is equivalent to a random
sampling procedure
A population statistic
(parameter):
Any quantity computed from the values
of variables for the entire population.
A sample statistic:
Any quantity computed from the values
of variables for the cases in the sample.
Statistical Decision Making
• Almost all problems in statistics
can be formulated as a problem of
making a decision .
• That is given some data observed
from some phenomena, a decision
will have to be made about the
phenomena
Decisions are generally broken
into two types:
• Estimation decisions
and
• Hypothesis Testing decisions.
Probability Theory plays a very
important role in these decisions
and the assessment of error made
by these decisions
Definition:
A random variable X is a
numerical quantity that is
determined by the outcome of a
random experiment
Example :
An individual is selected at
random from a population
and
X = the weight of the individual
The probability distribution of a
random variable (continuous) is
describe by:
its probability density curve f(x).
i.e. a curve which has the
following properties :
• 1. f(x) is always positive.
• 2. The total are under the curve f(x) is
one.
• 3. The area under the curve f(x) between
a and b is the probability that X lies
between the two values.
0.025
0.02
0.015
f(x)
0.01
0.005
0
0
20
40
60
80
100
120
Examples of some important
Univariate distributions
1.The Normal distribution
A common probability density curve is the “Normal”
density curve - symmetric and bell shaped
Comment: If m = 0 and s = 1 the distribution is
called the standard normal distribution
0.03
Normal distribution
with m = 50 and s =15
0.025
0.02
Normal distribution with
m = 70 and s =20
0.015
0.01
0.005
0
0
20
40
60
80
100
120
xm 
2
f(x) 

1
e
2s
2s
2
2.The Chi-squared distribution
with n degrees of freedom
1
(n  2 ) / 2  x / 2
f ( x)  n n / 2 x
e if x  0
 2 2
0.5
0.4
0.3
0.2
0.1
2
4
6
8
10
12
14
Comment: If z1, z2, ..., zn are
independent random variables each
having a standard normal distribution
then
2
2
2
U = z1  z2    zn
has a chi-squared distribution with n
degrees of freedom.
3. The F distribution with
n1 degrees of freedom in the
numerator and n2 degrees of
freedom in the denominator
 n 1  n 2  / 2
 n1 
if x  0
1  x
f(x)  K x
 n 2 
n1 / 2
n1 
n1  n 2  


 
 n2 
2
where K =
n1  n2 


 2   2 
(n1  2)2
0.8
0.7
0.6
F dist
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
6
Comment: If U1 and U2 are independent
random variables each having Chi-squared
distribution with n1 and n2 degrees of
freedom respectively then
U1 n1
F=
U 2 n2
has a F distribution with n1 degrees of
freedom in the numerator and n2 degrees of
freedom in the denominator
4.The t distribution with n
degrees of freedom
 n1  / 2

x 

f(x)  K 1
 n 
2
n  1

 2 
where K =
n 
   n
2
0.4
0.3
0.2
0.1
-4
-2
2
4
Comment: If z and U are independent
random variables, and z has a standard
Normal distribution while U has a Chisquared distribution with n degrees of
freedom then
t=
z
U n
has a t distribution with n degrees of
freedom.
•
1.
2.
3.
4.
5.
An Applet showing critical values and tail
probabilities for various distributions
Standard Normal
T distribution
Chi-square distribution
Gamma distribution
F distribution
The Sampling distribution
of a statistic
A random sample from a probability
distribution, with density function
f(x) is a collection of n independent
random variables, x1, x2, ...,xn with a
probability distribution described by
f(x).
If for example we collect a random
sample of individuals from a population
and
– measure some variable X for each of
those individuals,
– the n measurements x1, x2, ...,xn will
form a set of n independent random
variables with a probability distribution
equivalent to the distribution of X across
the population.
A statistic T is any quantity
computed from the random
observations x1, x2, ...,xn.
• Any statistic will necessarily be
also a random variable and
therefore will have a probability
distribution described by some
probability density function fT(t).
• This distribution is called the
sampling distribution of the
statistic T.
• This distribution is very important if one is
using this statistic in a statistical analysis.
• It is used to assess the accuracy of a
statistic if it is used as an estimator.
• It is used to determine thresholds for
acceptance and rejection if it is used for
Hypothesis testing.
Some examples of Sampling
distributions of statistics
Distribution of the sample mean for a
sample from a Normal popululation
Let x1, x2, ...,xn is a sample from a normal
population with mean m and standard
deviation s
Let
x
x
i
i
n
Than
x
x
i
i
n
has a normal sampling distribution with mean
mx  m
and standard deviation
sx  s
n
0
20
40
60
80
100
Distribution of the z statistic
Let x1, x2, ...,xn is a sample from a normal
population with mean m and standard deviation s
Let
z
xm
s
n
Then z has a standard normal distibution
Comment:
Many statistics T have a normal distribution
with mean mT and standard deviation sT.
Then
T  mT
z
sT
will have a standard normal distribution.
Distribution of the c2 statistic for
sample variance
Let x1, x2, ...,xn is a sample from a normal
population with mean m and standard deviation s
Let
2


x

x
 i
s2 
and
= sample variance
i
n 1
 xi  x 
2
s
i
n 1
= sample standard deviation
Let
c 
2
 x
i
 x
2
i
s2
(n  1)s

2
s
2
Then c2 has chi-squared distribution with n
= n-1 degrees of freedom.
The chi-squared
distribution
0 .5
0
0
4
8
12
16
20
24
Distribution of the t statistic
Let x1, x2, ...,xn is a sample from a normal
population with mean m and standard deviation s
Let
xm
t s
n
then t has student’s t distribution with n = n-1
degrees of freedom
Comment:
If an estimator T has a normal distribution with
mean mT and standard deviation sT.
If sT is an estimatior of sT based on n degrees of
freedom
Then
T mT
t
sT
will have student’s t distribution with n degrees of
freedom. .
t distribution
standard normal distribution
Point estimation
• A statistic T is called an estimator of the
parameter q if its value is used as an
estimate of the parameter q.
• The performance of an estimator T will be
determined by how “close” the sampling
distribution of T is to the parameter, q,
being estimated.
• An estimator T is called an unbiased
estimator of q if mT, the mean of the
sampling distribution of T satisfies mT = q.
• This implies that in the long run the average
value of T is q.
• An estimator T is called the Minimum
Variance Unbiased estimator of q if T is an
unbiased estimator and it has the smallest
standard error sT amongst all unbiased
estimators of q.
• If the sampling distribution of T is normal,
the standard error of T is extremely
important. It completely describes the
variability of the estimator T.
Interval Estimation
(confidence intervals)
• Point estimators give only single values as
an estimate. There is no indication of the
accuracy of the estimate.
• The accuracy can sometimes be measured
and shown by displaying the standard error
of the estimate.
• There is however a better way.
• Using the idea of confidence interval
estimates
• The unknown parameter is estimated with a
range of values that have a given probability
of capturing the parameter being estimated.
Confidence Intervals
• The interval TL to TU is called a (1 - a) 
100 % confidence interval for the parameter
q, if the probability that q lies in the range
TL to TU is equal to 1 - a.
• Here , TL to TU , are
– statistics
– random numerical quantities calculated from
the data.
Examples
Confidence interval for the mean of a Normal population
(based on the z statistic).
TL  x  z a / 2
s
s
to TU  x  z a / 2
n
n
is a (1 - a)  100 % confidence interval for m, the mean of a
normal population.
Here za/2 is the upper a/2  100 % percentage point of the
standard normal distribution.
More generally if T is an unbiased estimator of the parameter
q and has a normal sampling distribution with known
standard error sT then
TL  T  z a / 2 s T to TU  T  z a / 2s T
is a (1 - a)  100 % confidence interval for q.
Confidence interval for the mean of a Normal
population
(based on the t statistic).
TL  x  t a / 2
s
s
to TU  x  t a / 2
n
n
is a (1 - a)  100 % confidence interval for m, the
mean of a normal population.
Here ta/2 is the upper a/2  100 % percentage point
of the Student’s t distribution with n = n-1 degrees of
freedom.
More generally if T is an unbiased estimator of the parameter
q and has a normal sampling distribution with estmated
standard error sT, based on n degrees of freedom, then
TL  T  t a / 2s T to TU  T  t a / 2s T
is a (1 - a)  100 % confidence interval for q.
Common Confidence intervals
Situation
Sample form the Normal distribution with unknown
mean and known variance
(Estimating m) (n large)
Sample form the Normal distribution with unknown
mean and unknown variance (Estimating m)(n small)
Confidence interval
x  za / 2
x  ta / 2
Estimation of a binomial probability p
pˆ  za / 2
Two independent samples from the Normal
distribution with unknown means and known
variances
(Estimating m1 - m2) (n,m large)
Two independent samples from the Normal
distribution with unknown means and unknown but
equal variances. (Estimating m1 - m2) ) (n,m small)
Estimation of a the difference between two binomial
probabilities, p1-p2
s0
n
s
n
pˆ (1  pˆ )
n
x  y  za / 2
2
s x2 s y

n m
x  y  ta / 2 s Pooled
pˆ 1  pˆ 2  za / 2
1 1

n m
pˆ 1 (1  pˆ 1 ) pˆ 2 (1  pˆ 2 )

n1
n2
Multiple Confidence intervals
In many situations one is interested in estimating not
only a single parameter, q, but a collection of
parameters, q1, q2, q3, ... .
A collection of intervals, TL1 to TU1, TL2 to TU2, TL3
to TU3, ... are called a set of (1 - a)  100 % multiple
confidence intervals if the probability that all the
intervals capture their respective parameters is 1 - a
Hypothesis Testing
• Another important area of statistical
inference is that of Hypothesis Testing.
• In this situation one has a statement
(Hypothesis) about the parameter(s) of the
distributions being sampled and one is
interested in deciding whether the statement
is true or false.
• In fact there are two hypotheses
– The Null Hypothesis (H0) and
– the Alternative Hypothesis (HA).
• A decision will be made either to
– Accept H0 (Reject HA) or to
– Reject H0 (Accept HA). The following table
gives the different possibilities for the decision
and the different possibilities for the correctness
of the decision
• The following table gives the different
possibilities for the decision and the
different possibilities for the correctness of
the decision
H0
is true
H0
is false
Accept H0
Reject H0
Correct
Decision
Type II
error
Type I
error
Correct
Decision
• Type I error - The Null Hypothesis H0 is
rejected when it is true.
• The probability that a decision procedure
makes a type I error is denoted by a, and is
sometimes called the significance level of
the test.
• Common significance levels that are used
are a = .05 and a = .01
• Type II error - The Null Hypothesis H0 is
accepted when it is false.
• The probability that a decision procedure
makes a type II error is denoted by b.
• The probability 1 - b is called the Power of
the test and is the probability that the
decision procedure correctly rejects a false
Null Hypothesis.
A statistical test is defined by
• 1. Choosing a statistic for making the
decision to Accept or Reject H0. This
statisitic is called the test statistic.
• 2. Dividing the set of possible values of
the test statistic into two regions - an
Acceptance and Critical Region.
• If upon collection of the data and evaluation
of the test statistic, its value lies in the
Acceptance Region, a decision is made to
accept the Null Hypothesis H0.
• If upon collection of the data and evaluation
of the test statistic, its value lies in the
Critical Region, a decision is made to reject
the Null Hypothesis H0.
• The probability of a type I error, a, is
usually set at a predefined level by choosing
the critical thresholds (boundaries between
the Acceptance and Critical Regions)
appropriately.
• The probability of a type II error, b, is
decreased (and the power of the test, 1 - b,
is increased) by
1. Choosing the “best” test statistic.
2. Selecting the most efficient experimental
design.
3. Increasing the amount of information
(usually by increasing the sample sizes
involved) that the decision is based.
Some common Tests
Situation
Test Statistic
Sample form the Normal
distribution with unknown
mean and known variance
(Testing m) (n large)
z
n x  m0 
s
Sample form the Normal
distribution with unknown
mean and unknown variance
(Testing m) (n small)
t
n x  m 0 
s
Testing of a binomial
probability p
Two independent samples
from the Normal distribution
with unknown means and
known variances
(Testing m1 - m2)
(n, m largel)
z
z
pˆ  p0
p0 (1  p0 )
n
x  y 
H0
m  m
m  m
p  p
HA
m  m
m  m
m  m
m  m
m  m
m  m
p  p
p  p
p  p
m1  m 2 m1  m 2
2
s x2 s y

n m
Critical Region
z < -za/2 or z > za/2
z > za
z <-za
t < -ta/2 or t > ta/2
t > ta
t < -ta
z < -za/2 or z > za/2
z > za
z < -za
z < -za/2 or z > za/2
m1  m 2 z > za
m1  m 2 z < -za
Two independent samples
from the Normal distribution
with unknown means and
unknown but equal
variances. (Testing m1 - m2)
t
x  y 
s Pooled
m1  m 2 m1  m 2 t < -ta/2 or t > ta/2
1 1

n m
m1  m 2 t > ta
m1  m 2 t < -ta
Estimation of a the
difference between two
binomial probabilities, p1-p2
z
pˆ 1  pˆ 2
1
1 
pˆ (1  pˆ )  
n
n
2 
 1
p1  p 2
p1  p2 z < -za/2 or z > za/2
p1  p 2 z > za
p1  p2
z < -za
The p-value approach to
Hypothesis Testing
In hypothesis testing we need
1. A test statistic
2. A Critical and Acceptance region
for the test statistic
The Critical Region is set up under the
sampling distribution of the test statistic.
Area = a (0.05 or 0.01) above the critical
region. The critical region may be one tailed or
two tailed
The Critical region:
a/2
a/2
Reject H0
 za / 2
0
za / 2
Accept H0
z
Reject H0
PAccept H 0 when true   P za / 2  z  za / 2   1  a
PReject H 0 when true   Pz   za / 2 or z  za / 2   a
In test is carried out by
1. Computing the value of the test
statistic
2. Making the decision
a. Reject if the value is in the Critical
region and
b. Accept if the value is in the
Acceptance region.
The value of the test statistic may be in the
Acceptance region but close to being in the
Critical region, or
The it may be in the Critical region but close to
being in the Acceptance region.
To measure this we compute the p-value.
Definition – Once the test statistic has been
computed form the data the p-value is defined
to be:
p-value = P[the test statistic is as or more
extreme than the observed value of
the test statistic]
more extreme means giving stronger evidence to
rejecting H0
Example – Suppose we are using the z –test for the
mean m of a normal population and a = 0.05.
Z0.025 = 1.960
Thus the critical region is to reject H0 if
Z < -1.960 or Z > 1.960 .
Suppose the z = 2.3, then we reject H0
p-value = P[the test statistic is as or more extreme than
the observed value of the test statistic]
= P [ z > 2.3] + P[z < -2.3]
= 0.0107 + 0.0107 = 0.0214
Graph
p - value
-2.3
2.3
If the value of z = 1.2, then we accept H0
p-value = P[the test statistic is as or more extreme than
the observed value of the test statistic]
= P [ z > 1.2] + P[z < -1.2]
= 0.1151 + 0.1151 = 0.2302
23.02% chance that the test statistic is as or more
extreme than 1.2. Fairly high, hence 1.2 is not very
extreme
Graph
p - value
-1.2
1.2
Properties of the p -value
1. If the p-value is small (<0.05 or 0.01) H0 should be
rejected.
2. The p-value measures the plausibility of H0.
3. If the test is two tailed the p-value should be two
tailed.
4. If the test is one tailed the p-value should be one
tailed.
5. It is customary to report p-values when reporting
the results. This gives the reader some idea of the
strength of the evidence for rejecting H0
Multiple testing
Quite often one is interested in performing
collection (family) of tests of hypotheses.
1. H0,1 versus HA,1.
2. H0,2 versus HA,2.
3. H0,3 versus HA,3.
etc.
• Let a* denote the probability that at least one type
I error is made in the collection of tests that are
performed.
• The value of a*, the family type I error rate, can
be considerably larger than a, the type I error rate
of each individual test.
• The value of the family error rate, a*, can be
controlled by altering the thresholds of each
individual test appropriately.
• A testing procedure of this nature is called a
Multiple testing procedure.
A chart illustrating Statistical Procedures
Independent variables
Dependent
Variables
Categorical
Continuous
Categorical
Multiway frequency Analysis
(Log Linear Model)
Discriminant Analysis
Continuous
Continuous &
Categorical
ANOVA (single dep var)
MANOVA (Mult dep var)
??
MULTIPLE
REGRESSION
(single dep variable)
MULTIVARIATE
MULTIPLE
REGRESSION
(multiple dependent
variable)
??
Continuous &
Categorical
Discriminant Analysis
ANACOVA
(single dep var)
MANACOVA
(Mult dep var)
??
Next topic: Fitting equations to
data
Link