Download Chapter 7 Review, Part 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Correlation 2
Computations, and the best fitting
line.
Computing r from a more realistic
set of data
• A study was performed to investigate
whether the quality of an image affects
reading time.
• The experimental hypothesis was that
reduced quality would slow down reading
time.
• Quality was measured on a scale of 1 to 10.
Reading time was in seconds.
Quality vs Reading Time data:
Compute the correlation
Quality Reading time
(scale 1-10) (seconds)
4.30
8.1
Is there a relationship?
4.55
8.5
Check for linearity.
5.55
7.8
Compute r.
5.65
7.3
6.30
7.5
6.45
7.3
6.45
6.0
Calculate t scores for X
X
4.30
4.55
5.55
5.65
6.30
6.45
6.45
X=39.25
n= 7
X=5.61
X-X
-1.31
-1.06
-0.06
0.04
0.69
0.84
0.84
(X - X)2
1.71
1.12
0.00
0.00
0.48
0.71
0.71
SSW = 4.73
MSW = 4.73/(7-1) = 0.79
s = 0.89
tX =
(X - X) / sX
-1.48
-1.19
-0.07
0.05
0.78
0.95
0.95
Calculate t scores for Y
Y
8.1
8.5
7.8
7.3
7.5
7.3
6.0
Y=52.5
n= 7
Y=7.50
Y-Y
0.60
1.00
0.30
-0.20
0.00
-0.20
-1.50
(Y - Y)2
0.36
1.00
0.09
0.04
0.00
0.04
2.25
SSW = 3.78
MSW = 3.78/(7-1) = 0.63
sY = 0.79
tY =
(Y - Y) / sY
0.76
1.26
0.38
-025
0.00
-0.25
-1.89
Plot t scores
tX
tY
-1.48
-1.19
-0.07
0.05
0.78
0.95
0.95
0.76
1.28
0.39
-0.25
0.00
-0.25
-1.89
t score plot with best fitting line:
linear? YES!
Reading Time (t score)
2.00
1.00
-2.00
0.00
-1.00
0.00
1.00
-1.00
-2.00
Image quality (t score)
2.00
Calculate r
tX
tY
-1.48
-1.19
-0.07
0.05
0.78
0.95
0.95
0.76
1.28
0.39
-0.25
0.00
-0.25
-1.88
tY -tX
(tY -tX)2
-2.24
5.02
-2.47
6.10
-0.46
0.21
0.30
0.09
0.78
0.61
1.20
1.44
2.83
8.01
 (tX - tY)2 = 21.48
 (tX - tY)2 / (nP - 1) = 3.580
r = 1 - (1/2 * 3.580) = 1 - 1.79 = -0.790
Best fitting line
The definition of the best fitting line
plotted on t axes
• A “best fitting line” minimizes the average
squared vertical distance of Y scores in the sample
(expressed as tY scores) from the line.
• The best fitting line is a least squares, unbiased
estimate of values of Y in the sample.
• The generic formula for a line is Y=mx+b where
m is the slope and b is the Y intercept.
• Thus, any specific line, such as the best fitting
line, can be defined by its slope and its intercept.
The intercept of the best fitting line
plotted on t axes
The origin is the point where both tX and
tY=0.000
• So the origin represents the mean of both
the X and Y variable
• When plotted on t axes all best fitting lines
go through the origin.
• Thus, the tY intercept of the best fitting line
= 0.000
The slope of and formula for the
best fitting line
• When plotted on t axes the slope of the
best fitting line = r, the correlation
coefficient.
• To define a line we need its slope and Y
intercept
• r = the slope and tY intercept=0.00
• The formula for the best fitting line is
therefore tY=rtX + 0.00 or tY= rtX
Here’s how a visual representation of the best fitting line
(slope = r, Y intercept = 0.000) and the dots representing
tX and tY scores might be described. (Whether the
correlation is positive of negative doesn’t matter.)
• Perfect - scores fall exactly on a straight line.
• Strong - most scores fall near the line.
• Moderate - some are near the line, some not.
• Weak - the scores are only mildly linear.
• Independent - the scores are not linear at all.
Strength of a relationship
1.5
Perfect
1.0
0.5
0
-1.5
-1.0
-0.5
0
0.5
-0.5
-1.0
-1.5
1.0
1.5
Strength of a relationship
1.5
1.0
Strong
r about .800
0.5
0
-1.5
-1.0
-0.5
0
0.5
-0.5
-1.0
-1.5
1.0
1.5
Strength of a relationship
1.5
1.0
0.5
0
-1.5
-1.0
-0.5
0
0.5
-0.5
-1.0
-1.5
1.0
1.5
Moderate
r about .500
Strength of a relationship
r about 0.000
1.5
1.0
0.5
0
-1.5
-1.0
-0.5
0
0.5
1.0
1.5
-0.5
-1.0
Independent
-1.5
r=.800, the formula for the best
fitting line = ???
1.5
1.0
0.5
0
-1.5
-1.0
-0.5
0
0.5
-0.5
-1.0
-1.5
1.0
1.5
r=-.800, the formula for the best
fitting line = ???
1.5
1.0
0.5
0
-1.5
-1.0
-0.5
0
0.5
-0.5
-1.0
-1.5
1.0
1.5
r=0.000, the formula for the best
fitting line is:
1.5
1.0
0.5
0
-1.5
-1.0
-0.5
0
0.5
-0.5
-1.0
-1.5
1.0
1.5
Notice what that formula for
independent variables says
• tY = rtX = 0.000 (tX) = 0.000
• When tY = 0.000, you are at the mean of Y
• So, when variables are independent, the best
fitting line says that the best estimate of Y scores
in the sample is back to the mean of Y regardless
of your score on X
• Thus, when variables are independent we go back
to saying everyone will score right at the mean
A note of caution: Watch out for the plot
for which the best fitting line is a curve.
1.5
1.0
0.5
0
-1.5
-1.0
-0.5
0
0.5
-0.5
-1.0
-1.5
1.0
1.5
Confidence intervals around rhoT
– relation to Chapter 6
• In Chapter 6 we learned to create confidence intervals
around muT that allowed us to test a theory.
• To test our theory about mu we took a random sample,
computed the sample mean and standard deviation, and
determined whether the sample mean fell into that interval.
• If it did not, we had shown the theory that led us to predict
muT was false.
• We then discarded the theory and muT and used the sample
mean as our best estimate of the true population mean.
If we discard muT, what do we use as
our best estimate of mu?
• Generally, our best estimate of a population
parameter is the sample statistic that estimates it.
• Our best estimate of mu has been and is the
sample mean, X-bar.
• Since we have discarded our theory, we went back
to using X-bar as our best (least squares, unbiased,
consistent estimate) of mu.
More generally, we can test a theory
(hypothesis) about any population parameter
using a similar confidence interval.
• We theorize about what the value of the
population parameter is.
• We get an estimate of the variability of the
parameter
• We construct a confidence interval (usually a 95%
confidence interval) in which our hypothesis says
that the sample statistic should fall.
• We obtain a random sample and determine
whether the sample statistic falls inside or outside
our confidence interval
The sample statistic will fall inside
or outside of the CI.95
• If the sample statistic falls inside the confidence
interval, our theory has received some support and
we hold on to it.
• But the more interesting case is when the sample
statistic falls outside the confidence interval.
• Then we must discard the theory and the theory
based estimate of the population parameter.
• In that case, our best estimate of the population
parameter is the sample statistic
• Remember, the sample statistic is a least squares,
unbiased, consistent estimate of its population
parameter.
We are going to do the same
thing with a theory about rho
• rho is the correlation coefficient for the population.
• If we have a theory about rho, we can create a 95%
confidence interval into which we expect r will fall.
• An r computed from a random sample will then fall inside
or outside the confidence interval.
When r falls inside or outside of the
CI.95 around rhoT
• If r falls inside the confidence interval, our theory
about rho has received some support and we hold
on to it.
• But the more interesting case is when r falls
outside the confidence interval.
• Then we must discard the theory and the theory
based estimate of the population parameter.
• In that case, our best estimate of rho is the r we
found in our random sample
• Thus, when r falls outside the CI.95 we can go back
to using it as a least squares unbiased estimate of
rho.
Chapter 7 slides end here
Rest of slides are for other chapters
and should not be reviewed here.
RK – 10/24
Why is it so important to
determine whether r fits a theory
• In Chapter 8 we go on to predict values of Y from
values of X and r.
• The formula we use is called the regression
equation, it is very much like the formula for the
best fitting line.
• The only difference is that the best fitting line
describes the relationship among the Y scores in
the sample.
• But in Chapter 8 we move to predicting scores for
people who are in the population from which the
sample was drawn, but not in the sample.
That’s dangerous.
Let me give you an example.
Assume, you are the personnel
officer for a mid size company.
•
•
•
•
You need to hire a typist.
There are 2 applicants for the job.
You give the applicants a typing test.
Which would you hire: someone who types
6 words a minute with 12 mistakes or
someone who types 100 words a minute
with 1 mistake.
Who would you hire?
• Of course, you would predict that the second
person will be a better typist and hire that person.
• Notice that we never gave the person with 6
words/minute a chance to be a typist in our firm.
• We prejudged her on the basis of the typing test.
• That is probably valid in this case – a typing test
probably predicts fairly well how good a typist
someone will be.
But say the situation is a little more complicated!
• You have several applicants for a leadership
position in your firm.
• But it is not 2002, it is 1957, when we knew that
only white males were capable of leadership in
corporate America.
• That is, we all “know” that leadership ability is
correlated with both gender and skin color, white
and male are associated with high leadership
ability and darker skin color and female gender
with lower leadership ability.
• We now know this is absurd, but lots of people
were never
Confidence intervals around muT
Confidence intervals and hypothetical means
• We frequently have a theory about what the
mean of a distribution should be.
• To be scientific, that theory about mu must
be able to be proved wrong (falsified).
• One way to test a theory about a mean is to
state a range where sample means should
fall if the theory is correct.
• We usually state that range as a 95%
confidence interval.
• To test our theory, we take a random sample from
the appropriate population and see if the sample
mean falls where the theory says it should, inside
the confidence interval.
• If the sample mean falls outside the 95%
confidence interval established by the theory, the
evidence suggests that our theoretical population
mean and the theory that led to its prediction is
wrong.
• When that happens our theory has been falsified.
We must discard it and look for an alternative
explanation of our data.
For example:
• For example, let’s say that we had a new
antidepressant drug we wanted to peddle.
Before we can do that we must show that
the drug is safe.
• Drugs like ours can cause problems with
body temperature. People can get chills or
fever.
• We want to show that body temperature is
not effected by our new drug.
Testing a theory
• “Everyone” knows that normal body temperature
for healthy adults is 98.6oF.
• Therefore, it would be nice if we could show that
after taking our drug, healthy adults still had an
average body temperature of 98.6oF.
• So we might test a sample of 16 healthy adults,
first giving them a standard dose of our drug and,
when enough time had passed, taking their
temperature to see whether it was 98.6oF on the
average.
Testing a theory - 2
• Of course, even if we are right and our drug has no
effect on body temperature, we wouldn’t expect a
sample mean to be precisely 98.600000…
• We would expect some sampling fluctuation
around a population mean of 98.6oF.
• So, if our drug does not cause change in body
temperature, the sample mean should be close to
98.6. It should, in fact, be within the 95%
confidence interval around muT, 98.6.
• SO WE MUST CONSTRUCT A 95%
CONFIDENCE INTERVAL AROUND 98.6o
AND SEE WHETHER OUR SAMPLE MEAN
FALLS INSIDE OR OUTSIDE THE CI.
To create a confidence interval around muT,
we must estimate sigma from a sample.
• We randomly select a group of 16 healthy
individuals from the population.
• We administer a standard clinical dose of our new
drug for 3 days.
• We carefully measure body temperature.
• RESULTS: We find that the average body
temperature in our sample is 99.5oF with an
estimated standard deviation of 1.40o (s=1.40).
• IS 99.5oF. IN THE 95% CI AROUND MUT???
Knowing s and n we can easily compute
the estimated standard error of the mean.
• Let’s say that s=1.40o and n = 16:
• sX  s / n
= 1.40/4.00 = 0.35
• Using this estimated standard error we can
construct a 95% confidence interval for the
body temperature of a sample of 16 healthy
adults.
We learned how to create confidence intervals with
the Z distribution in Chapter 4.
95% of sample means will fall in a symmetrical
interval around mu that goes from 1.960 standard
errors below mu to 1.960 standard errors above mu
• A way to write that fact in statistical language is:
CI.95: mu + ZCRIT* sigmaX-bar
or
CI.95: mu - ZCRIT* sigmaX-bar < X-bar < mu + ZCRIT* sigmaX-bar
For a 95% CI, ZCRIT = 1.960
But when we must estimate sigma with s, we must
use the t distribution to define critical intervals
around mu or muT.
Here is how we would write the formulae
substituting t for Z and s for sigma
CI95: muT + tCRIT* sX-bar
or
CI.95: muT - tCRIT* sX-bar < X-bar < muT + tCRIT* sX-bar
Notice that the critical value of t that includes 95%
of the sample means changes with the number of
degrees of freedom for s, our estimate of sigma,
and must be taken from the t table.
If n= 16 in a single sample, dfW=n-k=15.
df
.05
.01
1
12.706
63.657
2
4.303
9.925
3
3.182
5.841
4
2.776
4.604
5
2.571
4.032
6
2.447
3.707
7
2.365
3.499
8
2.306
3.355
df
.05
.01
9
2.262
3.250
10
2.228
3.169
11
2.201
3.106
12
2.179
3.055
13
2.160
3.012
14
2.145
2.997
15
df
.05
.01
17
2.110
2.898
18
2.101
2.878
19
2.093
2.861
20
2.086
2.845
21
2.080
2.831
22
2.074
2.819
23
2.069
2.807
24
2.064
2.797
df
.05
.01
25
2.060
2.787
26
2.056
2.779
27
2.052
2.771
28
2.048
2.763
29
2.045
2.756
30
2.042
2.750
40
2.021
2.704
60
2.000
2.660
df
.05
.01
100
1.984
2.626
200
1.972
2.601
500
1.965
2.586
1000
1.962
2.581
2000
1.961
2.578
10000
1.960
2.576
16
2.131 2.120
2.947 2.921
So, muT=98.6, tCRIT=2.131, s=1.40, n=16
Here is the confidence interval
CI.95: muT + tCRIT* sX-bar =
= 98.6 + (2.131)*(1.40/ 61 ) =
= 98.6 + (2.131)*(1.40/4)
= 98.6 + (2.131)(0.35) = 98.60+ 0.75
CI.95: 97.85 < X-bar < 99.35
Our sample mean fell outside the CI.95 and falsifies
the theory that our drug has no effect on body
temperature. Our drug may cause a slight fever.
Related documents