Download chapter2A

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
STP 420 SUMMER 2005
STP 420
INTRODUCTION TO APPLIED STATISTICS
NOTES
PART 1 - DATA
CHAPTER 2
LOOKING AT DATA - RELATIONSHIPS
Introduction
Association between variables
Two variables measured on the same individuals are associated if some values of one
variable tend to occur more often with some values of the second variable than with other
values of that variable.
Eg. height and weight: it seems that the overall trend of height increasing shows that
weight also increases
Or smoking and life expectancy: the overall trend of smokers is short life expectancy
(inverse relationship but still associated)
response variable – measures an outcome of a study (dependent variable)
explanatory variable – explains or causes changes in the response variables
(independent variable)
1
STP 420 SUMMER 2005
2.1
Scatterplots
A scatterplot shows the relationship between two quantitative variables measured on the
same individuals. The values of one variable appear on the horizontal axis (explanatory
variable x) and the other on the vertical axis (response variable y). Each individual
appears as a point.
Examining a scatterplot
In any graph of data, look for overall pattern and for striking deviations/outliers.
Describe the overall pattern of the scatterplot by the form, direction (+ve or –ve), and
strength (how close the points are to a straight line)of the relationship.
Outlier (important kind of deviation) – falls outside the overall pattern of the relationship
Positive association – points in scatterplot seem to increase from left to right
Negative association – points in scatterplot seem to decrease from left to right
Linear relationship – points follow a straight line approximately
Categorical variables – use different color or symbol for each category
Categorical explanatory variables with a quantitative response variable
Make a graph that compares the distributions of the response for each category
of the explanatory variable.
2
STP 420 SUMMER 2005
2.2
Correlation – r
Correlation - measures the direction and strength of the linear relationships between two
quantitative variables and ranges in numeric value from –1 to 1.
r = -1 implies a perfect negative relation, all points follow a negative straight line
r = 0 implies no relationship
r = 1 implies a perfect positive relation, all points follow a positive straight line
r
1
n

1
xi
x
sx
yi
y
sy
where
n - # of individuals
xi – observations for variable X
x - mean of variable X
sx – standard deviation for variable X
yi – observations for variable Y
y - mean of variable Y
sy – standard deviation for variable Y
Properties of correlation
1.
Makes no use in distinction between explanatory and response variables. Makes
no difference which variable is x or y.
2.
The two variables must be quantitative. Not appropriate on categorical variables.
3.
r computed using standardized values and not affected if units of measurements
for x, y, or both are changed.
4.
Positive r implies positive association between variables and negative r implies
negative association
5.
-1  r  1, r close to 0 implies a weak relationship
6.
correlation measures strength of linear relationships (not for curves)
7.
like s, r is not resistant and affected by possible outliers (be careful)
Correlation is not a complete description of two-variable data, should also use means and
standard deviations.
3
STP 420 SUMMER 2005
2.3
Least-squares regression
Example.
Age (yr): x
Price ($100): y
5
85
4
103
6
70
5
82
5
89
5
98
6
66
6
95
2
169
7
70
7
48
Plot x against x, if the points seem to follow a straight line, then a straight line can be
used to approximate the relationship between x and y.
A regression line is a straight line that describes how a response variable y changes as
an explanatory variable x changes. Can be used to predict y given x. Must know which
is the explanatory and response variables.
y = a + bx
where
b is the slope and tells how much y changes as x changes one unit
a is the intercept, the value of y when x = 0
Least-squares regression line of y on x is the line that makes the sum of the squares of
the vertical distances of the data points from the line as small as possible.
Extrapolation – use of a regression line to predict far outside the range of values of the
explanatory variable x. may be inaccurate.
Equation of the least-squares regression line
ŷ a bx
with slope
b r
sy
sx
and intercept
a y bx
4
STP 420 SUMMER 2005
x
y
r
x
y
sx
sy
explanatory variable
response variable
correlation between x and y
sample mean of x
sample mean of y
sample standard deviation of x
sample deviation of y
Example continue.
Regression equation – The equation of the regression line.
yˆ195.47206x
Computational formulas in regression (for by hand computations)
Definition
Computational
2
( x)
n
( x )( y )
xy
n
( y )2
2
y
n
2
x2
Sx(x) S xx
Sxy(x)y S xy
2
Sy(y) S yy
b
S xy
S xx
and
a
1
( y b1
n
x)
y
b x
Coefficient of determination ,r2 is the square of the correlation r – is the fraction of
the variation in the observed values of y that is explained by the least-squares regression
of y on x.
5
STP 420 SUMMER 2005
Computational Formulas : r
S xy
S xx S yy
2
ie. r varies from 0 to 1
0  r2  1
r
S xy2
2
S xx S yy
r2 close to 0 implies the least-squares regression explains very little of the variation in y
r2 close to 1 implies the least-squares regression explains most of the variation in y
r2(yˆ/) (y y)2
x
y
ŷ
5
4
6
5
5
5
6
6
2
7
7
85
103
70
82
89
98
66
95
169
70
48
94.16
114.42
73.90
94.16
94.16
94.16
73.90
73.90
154.95
53.64
53.64
(yˆ
y)2 = 8285.0 ,
(y
y
y
-3.64
14.36
-18.64
-6.64
0.36
9.36
-22.64
6.36
80.36
-18.64
-40.64
ŷ
y
5.53
25.79
-14.74
5.53
5.53
5.53
-14.74
-14.74
66.31
-35.00
-35.00
y
yˆ
-9.16
-11.42
-3.90
-12.16
-5.16
3.84
-7.90
21.10
14.05
16.36
-5.64
y)2= 9708.5
r2 = 8285.0/9708.5 = 0.853 (85.3%)
6
STP 420 SUMMER 2005
2.4
Cautions about regression and correlation
Both correlation and regression along with the scatterplot allows us to study the
relationship among variables considered in pairs.
Residual – the difference between an observed value of the response variable and the
value predicted by the regression line
Residual = observed y – predicted y
= y yˆ
Residual plot – a scatterplot of the regression residuals against the explanatory variable.
- It helps us to assess the fit of the regression line.
- if plot unstructured and centered about 0, no major problem
- if plot has a curve then a straight line is not the best fit of the data
- if the residuals get bigger as you go from left to right, predictions are more
precise on the left than on the right.
Lurking variable – variable that has an important effect on the relationship among
variables in a study but is not included among the variables studied.
Outlier – observation that lies outside the overall pattern of the other observations. Points
that are outliers in the y direction of a scatterplot have large regression residuals, other
outliers need not have large residuals.
Influential observation – if removed there would be a change in the result of some
statistical calculation. Points that are outliers in the x direction of a scatterplot are often
called influential points for the least-squares regression line.
Difference between fitted values (DFFITS) - Find the predicted response ( ŷ i ) for the
ith individual with this individual in the data and out of the data, find the difference and
standardize it (minus the mean and divide by the sd). Do this for all individuals to give
the DFFITS.
7
STP 420 SUMMER 2005
Studentized residuals – standardizing the residuals using the standard deviation of the
data with the individual omitted from the data (helps to avoid having too big a standard
deviation)
Beware of lurking variables
Correlation measures only linear association.
Extrapolation can be inaccurate.
Correlation and least-squares regression are not resistant measures.
Lurking variables can make correlation or regression misleading.
Association does not imply causation
An association between an explanatory variable x and a response variable y, even if it is
very strong, is not by itself good evidence that changes in x actually causes changes in y.
A correlation based on averages over many individuals is usually higher than the
correlation between the same variables based on the data for individuals.
Prediction does not requires a cause-and-effect relationship. (eg. height & weight)
8