Download STP 420

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
STP 420 SUMMER 2002
STP 420
INTRODUCTION TO APPLIED STATISTICS
NOTES
PART 1 - DATA
CHAPTER 2
LOOKING AT DATA - RELATIONSHIPS
Introduction
Association between variables
Two variables measured on the same individuals are associated if some values of one
variable tend to occur more often with some values of the second variable than with other
values of that variable.
Eg. height and weight: it seems that the overall trend of height increasing shows that
weight also increases
Or smoking and life expectancy: the overall trend of smokers is short life expectancy
(inverse relationship but still associated)
response variable – measures an outcome of a study (dependent variable)
explanatory variable – explains or causes changes in the response variables
(independent variable)
1
STP 420 SUMMER 2002
2.1
Scatterplots
A scatterplot shows the relationship between two quantitative variables measured on the
same individuals. The values of one variable appear on the horizontal axis (explanatory
variable x) and the other on the vertical axis (response variable y). Each individual
appears as a point.
Examining a scatterplot
In any graph of data, look for overall pattern and for striking deviations/outliers.
Describe the overall pattern of the scatterplot by the form, direction (+ve or –ve), and
strength (how close the points are to a straight line)of the relationship.
Outlier (important kind of deviation) – falls outside the overall pattern of the relationship
Positive association – points in scatterplot seem to increase from left to right
Negative association – points in scatterplot seem to decrease from left to right
Linear relationship – points follow a straight line approximately
Categorical variables – use different color or symbol for each category
Categorical explanatory variables with a quantitative response variable
Make a graph that compares the distributions of the response for each category
of the explanatory variable.
2
STP 420 SUMMER 2002
2.2
Correlation – r
Correlation - measures the direction and strength of the linear relationships between two
quantitative variables and ranges in numeric value from –1 to 1.
r = -1 implies a perfect negative relation, all points follow a negative straight line
r = 0 implies no relationship
r = 1 implies a perfect positive relation, all points follow a positive straight line
r
 xi  x  y i  y 
1




n  1  s x  s y 
where
n - # of individuals
xi – observations for variable X
x - mean of variable X
sx – standard deviation for variable X
yi – observations for variable Y
y - mean of variable Y
sy – standard deviation for variable Y
Properties of correlation
1.
Makes no use in distinction between explanatory and response variables. Makes
no difference which variable is x or y.
2.
The two variables must be quantitative. Not appropriate on categorical variables.
3.
r computed using standardized values and not affected if units of measurements
for x, y, or both are changed.
4.
Positive r implies positive association between variables and negative r implies
negative association
5.
-1  r  1, r close to 0 implies a weak relationship
6.
correlation measures strength of linear relationships (not for curves)
7.
like s, r is not resistant and affected by possible outliers (be careful)
3
STP 420 SUMMER 2002
Correlation is not a complete description of two-variable data, should also use means and
standard deviations.
2.3
Least-squares regression
Example.
Age (yr): x
Price ($100): y
5
85
4
103
6
70
5
82
5
89
5
98
6
66
6
95
2
169
7
70
7
48
Plot x against x, if the points seem to follow a straight line, then a straight line can be
used to approximate the relationship between x and y.
A regression line is a straight line that describes how a response variable y changes as
an explanatory variable x changes. Can be used to predict y given x. Must know which
is the explanatory and response variables.
y = a + bx
where
b is the slope and tells how much y changes as x changes one unit
a is the intercept, the value of y when x = 0
Least-squares regression line of y on x is the line that makes the sum of the squares of
the vertical distances of the data points from the line as small as possible.
Extrapolation – use of a regression line to predict far outside the range of values of the
explanatory variable x. may be inaccurate.
Equation of the least-squares regression line
yˆ  a  bx
4
STP 420 SUMMER 2002
br
with slope
x
y
r
x
y
sx
sy
sy
sx
and intercept
a  y  bx
explanatory variable
response variable
correlation between x and y
sample mean of x
sample mean of y
sample standard deviation of x
sample deviation of y
Example continue.
Regression equation – The equation of the regression line.
yˆ  195.47  20.26 x
Computational formulas in regression (for by hand computations)
Definition
S xx   ( x  x)
Computational
2
( x)
2


S xx  x
n
( x)(  y )
S xy   xy 
n
2
( y )
2


y
S yy 
n
2
S xy   ( x  x)( y  y)
S yy   ( y  y )
b 
S xy
S xx
2
and
a
1
( y  b1  x)  y  b x
n
5
STP 420 SUMMER 2002
Coefficient of determination ,r2 is the square of the correlation r – is the fraction of
the variation in the observed values of y that is explained by the least-squares regression
of y on x.
S xy2
S xy
2
Computational Formulas : r 
r 
S xx S yy
S xx S yy
0  r2  1
ie. r2 varies from 0 to 1
r2 close to 0 implies the least-squares regression explains very little of the variation in y
r2 close to 1 implies the least-squares regression explains most of the variation in y
r 2   ( yˆ  y ) 2 /  ( y  y ) 2

x
y
ŷ
y y
yˆ  y
y  yˆ
5
4
6
5
5
5
6
6
2
7
7
85
103
70
82
89
98
66
95
169
70
48
94.16
114.42
73.90
94.16
94.16
94.16
73.90
73.90
154.95
53.64
53.64
-3.64
14.36
-18.64
-6.64
0.36
9.36
-22.64
6.36
80.36
-18.64
-40.64
5.53
25.79
-14.74
5.53
5.53
5.53
-14.74
-14.74
66.31
-35.00
-35.00
-9.16
-11.42
-3.90
-12.16
-5.16
3.84
-7.90
21.10
14.05
16.36
-5.64
( yˆ  y ) 2 = 8285.0 ,

( y  y ) 2 = 9708.5
r2 = 8285.0/9708.5 = 0.853 (85.3%)
6
STP 420 SUMMER 2002
2.4
Cautions about regression and correlation
Both correlation and regression along with the scatterplot allows us to study the
relationship among variables considered in pairs.
Residual – the difference between an observed value of the response variable and the
value predicted by the regression line
Residual = observed y – predicted y
= y  yˆ
Residual plot – a scatterplot of the regression residuals against the explanatory variable.
- It helps us to assess the fit of the regression line.
- if plot unstructured and centered about 0, no major problem
- if plot has a curve then a straight line is not the best fit of the data
- if the residuals get bigger as you go from left to right, predictions are more
precise on the left than on the right.
Lurking variable – variable that has an important effect on the relationship among
variables in a study but is not included among the variables studied.
Outlier – observation that lies outside the overall pattern of the other observations. Points
that are outliers in the y direction of a scatterplot have large regression residuals, other
outliers need not have large residuals.
Influential observation – if removed there would be a change in the result of some
statistical calculation. Points that are outliers in the x direction of a scatterplot are often
called influential points for the least-squares regression line.
Difference between fitted values (DFFITS) - Find the predicted response ( ŷi ) for the
ith individual with this individual in the data and out of the data, find the difference and
standardize it (minus the mean and divide by the sd). Do this for all individuals to give
the DFFITS.
7
STP 420 SUMMER 2002
Studentized residuals – standardizing the residuals using the standard deviation of the
data with the individual omitted from the data (helps to avoid having too big a standard
deviation)
Beware of lurking variables
Correlation measures only linear association.
Extrapolation can be inaccurate.
Correlation and least-squares regression are not resistant measures.
Lurking variables can make correlation or regression misleading.
Association does not imply causation
An association between an explanatory variable x and a response variable y, even if it is
very strong, is not by itself good evidence that changes in x actually causes changes in y.
A correlation based on averages over many individuals is usually higher than the
correlation between the same variables based on the data for individuals.
Prediction does not requires a cause-and-effect relationship. (eg. height & weight)
2.6
Relations in categorical data (case of response variable being quantitative)
Relationships described using of counts (frequencies) or percents (relative
frequencies) of each category
8
STP 420 SUMMER 2002
Two way table – presents data for two variables
Row variable - education
Column variable - age
Education
< HS
= HS
College 1 – 3
College >= 4
Total
25 - 34
5,325
14,061
11,659
10,342
41,388
Age group
35 – 54
9,152
24,070
19,926
19,878
73,028
>= 55
16,035
18,320
9,662
8,005
52,022
Total
30,152
56,451
41,247
38,225
166,438
Roundoff error – values rounded to nearest thousand.
Education alone and age alone are marginal distributions
Eg
Education
< HS
= HS
College 1 – 3
College >= 4
Total
Total
30,152
56,451
41,247
38,225
166,438
Ages
25 - 34
41,388
35 – 54
73,028
>= 55
52,022
Total
166,438
9
STP 420 SUMMER 2002
Conditional Distribution of Education given an age (25 – 34)
Education
< HS
= HS
College 1 – 3
College >= 4
Total
25 - 34
5,325
14,061
11,659
10,342
41,388
Simpson’s paradox – an association or comparison that holds for all of several groups
can reverse direction when the data are combined to form a single group.
- reversal of direction by aggregation of data
Example of three-way table – presenting information on three variables, one two-way
table for each level (value) of the third variable.
Died
Survived
Total
Good Condition
Hosp. A
Hosp. B
6
8
594
592
600
600
Died
Survived
Total
Poor Condition
Hosp. A
Hosp. B
57
8
1443
192
1500
200
Condition variable – good and poor
Hospital variable – A and B
Survival variable – Died and survived
Aggregation of data – adding up across one variable (elimination of one variable)
Eg. eliminating condition (ignoring condition)
Died
Survived
Total
Hosp. A
63
2037
2100
Hosp. B
16
784
800
10
STP 420 SUMMER 2002
2.7
The question of causation
Two variables are often associated or strongly associated but this does not assume that
any one causes the other (ie. - the explanatory variable causes the response variable).
Explaining association - causation
One variable causes the other
x
y
x
y
x
y
?
z
z
Causation
common response
confounding
x, y – observed variables
z – lurking variable
arrows shows cause-effect relationship
Explaining association – common response
Observed association between x and y is explained by lurking variable z. Both x and y
changes when z is changed.
Explaining association - confounding
Effects on a response variable is mixed about more than one variable (x and z are
either explanatory or lurking variables). Cannot distinguish the influence of x on y from
the influence of z on y.
11