Download Lecture 8: Linear Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Economic model wikipedia , lookup

Choice modelling wikipedia , lookup

Transcript
Goals

Develop basic concepts of linear regression from a probabilistic framework

Estimating parameters and hypothesis testing with linear models

Linear regression in R
Lecture 8: Linear Regression
May 24, 2012
GENOME 560, Spring 2012
Su‐In Lee, CSE & GS
[email protected]
1
2
Regression

Why Linear Regression?
Technique used for the modeling and analysis of numerical data

Suppose we want to model the outcome variable Y in terms of three predictors, X1, X2, X3
Y = f (X1, X2, X3)


Exploits the relationship between two or more variables so that we can gain information about one of them through knowing values of the other
Regression can be used for prediction, estimation, hypothesis testing, and modeling causal relationships

Typically will not have enough data to try and directly estimate f

Therefore, we usually have to assume that it has some restricted form, such as linear
Y = X1 + X2 + X3
3
4
1
Regression Terminology
Linear Regression is a Probabilistic Model

Y = X1 + X2 + X3
Much of mathematics is devoted to studying variables that are deterministically related to one another
Y = β0 + β1 X
Dependent Variable
Outcome Variable
Response Variable
“Lung cancer
risk”
Independent Variable
Predictor Variable
Explanatory Variable
Lung cancer risk
Genetic factor, smoking,
diet, etc
Expression level
of gene X
Expression levels of X’s TFs
A, B and C
“smoking”
Intercept term

Slope
But we’re interested in understanding the relationship between variables related in a nondeterministic fashion
5
A Linear Probabilistic Model

6
Implications
Definition: There exists parameters β0, β1 and σ2, such that for any fixed value of the predictor variable X, the outcome variable Y is related to X through the model equation
Y = β0 + β1 X + ε ,

The expected value of Y is a linear function of X, but for fixed value x, the variable Y differs from its expected value by a random amount
Variables & Symbols: How is x different from X ?
where ε is a RV assumed to be N(0, σ2)
Capital letter X: a random variable
Lower case letter x: corresponding values
(i.e. the real numbers the RV X map into)
Lung
cancer risk
For example,
X: Genotype of a certain locus
x: 0, 1 or 2 (meaning AA, AG and GG, respectively)
smoking
7
8
2
Implications
Graphical Interpretation

The expected value of Y is a linear function of X, but for fixed value x, the variable Y differs from its expected value by a random amount

Formally, let x* denote a particular value of the predictor variable X, then our linear probabilistic model says:
E (Y | x*)  Y | x*  mean value of Y when X is x *
E (Y | x*)  Y | x*  mean value of Y when X is x *
V (Y | x* )   Y2| x*  variance of Y when X is x*
V (Y | x* )   Y2| x*  variance of Y when X is x*
9
Graphical Interpretation
10
One More Example
Weight

Suppose the relationship between the predictor variable height (X) and outcome variable weight (Y) is described by a simple linear regression model with true regression line
Y = 7.5 + 0.5 X, ε~ N(0, σ2) and σ = 3

The expected change in weight (Y ) associated with a 1‐unit increase in height (X )
Height

Say that X = height and Y = weight

Then μY|x=60 is the average weight for all individuals 60 inches tall in the population
Q1: What is the interpretation of β1 = 0.5?

Q2: If x = 20, what is the expected value of Y?
μY|x=20 = 7.5 + 0.5 (20) = 17.5
11
12
3
One More Example

Estimating Model Parameters
Q3: If x = 20, what is P(Y>22) ?
Y = 22
μY|x=20 = 17.5

Where are the parameters β0 and β1 from?

Predicted, or fitted, values are values of y
predicted by plugging x1, x2, … , xn into the estimated regression line: y = β0 + β1x
yˆ1   0  1 x1
yˆ 2   0  1 x2
Residuals are the deviations of observed (red dots) and predicted values (red line)
e1  y1  yˆ1
e2  y2  yˆ 2
Y ~ N(μ =17.5, σ = 3)

x=20
e3  y3  yˆ 3
σ=3
y = β0 + β1x
y3

Given Y ~ N(μ =17.5, σ = 3),
Y >22
22  17.5
)  1   (1.5)  0.067
3
where ϕ means the CDF of Normal dist. N(0,1)
y2
P(Y  22 | x  20)  1   (
x1
μ =17.5
x3
14
The error sum of squares (SSE) can tell us how well the line fits to the data
n
n
SSE   (ei )   ( yi  yˆ i )
2
i 1

x2
Least Squares
Residuals Are Useful!

yˆ 3   0  1 x3
i 1
yˆ1   0  1 x1
yˆ 2   0  1 x2
x1
yˆ 3   0  1 x3
Least squares

y = β0 + β1 x
2
x2
x3
Find β0 and β1 that minimizes SSE
n

f (  0 , 1 )    yi  (  0  1 xi )
2

i 1

Least squares
Find β0 and β1 that minimizes SSE
n
Denote the solutions by and ̂ 0
̂1
15
f (  0 , 1 )    yi  (  0  1 xi )
i 1
2
16
4
Least Squares

Least Squares
Least squares


Find β0 and β1 that minimizes SSE
n
Least squares

f (  0 , 1 )    yi  (  0  1 xi )
Find β0 and β1 that minimizes SSE
n
f (  0 , 1 )    yi  (  0  1 xi )
2
i 1
Least Squares
2
i 1
17
Coefficient of Determination

Important statistic referred to as the coefficient of determination (R2):
R2  1
n
SSE
SST
n
SSE   (ei ) 2   ( yi  yˆ i ) 2
i 1
i 1
n
SST   ( yi  y ) 2
i 1

Least squares

18
Error Sum Squares
Error Sum Squares, when β0= avg(y) and β1=0
y = β0 + β1 x
Find β0 and β1 that minimizes SSE
n
f (  0 , 1 )    yi  (  0  1 xi )
i 1
y = average y
2
19
20
5
Multiple Linear Regression

Categorical Independent Variables
Extension of the simple linear regression model to two or more independent variables

Qualitative variables are easily incorporated in regression framework through dummy variables

Simple example: sex can be coded as 0/1

What if my categorical variable contains three levels:
y = β 0 + β 1 x1 + β 2 x2 + … + β n xn + ε
Expression = Baseline + Age + Tissue + Sex + Error

Partial Regression Coefficients: βi ≡ effect on the outcome variable when increasing the ith
predictor variable by 1 unit, holding all other predictors constant
0 if AA

X i  1 if AG
2 if GG


NO!
21
Categorical Independent Variables

Previous coding would result in collinearity

Solution is to set up a series of dummy variable. In general for k levels you need (k‐1) dummy variables
X
Hypothesis Testing: Model Utility Test

AA
AG
X1
1
0
X2
0
1
GG
0
0
The first thing we want to know after fitting a model is whether any of the predictor variables (X’s) are significantly related to the outcome variable (Y):
H 0 : β1  β2    βk  0
1 if AG
X2  
0 otherwise
1 if AA
X1  
0 otherwise
0 if AA

X i  1 if AG
2 if GG

22
H A : At least one βi  0

Let’s frame this in our ANOVE framework

In ANOVA, we partitioned total variance (SST) into two components:


23
SSE (unexplained variation)
SSR (variation explained by linear model)
6
Model Utility Test

Partition total variance (SST) into two components:



ANOVA Formulation of Model Utility Test

SSE (unexplained variation)
SSR (variation explained by linear model)
Partition total variance (SST) into two components:


SSE (unexplained variation)
SSR (variation explained by linear model)
Let’s consider n (=3) data points and k (=1) predictor model
n
SST   ( yi  y ) 2
i 1
y = β0 + β1 x
n
n
i 1
i 1
n‐(k+1)
SSE   (ei ) 2   ( yi  yˆ i ) 2
y = average y
n‐(k+1)
n
SSR   ( yˆ i  y ) 2
# data points – (# parameters in the model) Rejection
i 1
25
ANOVA Formulation of Model Utility Test

F‐test statistic
F
R
n  (k  1)
MS R
SSR / k



k
MS E SSE /n  (k  1) 1  R 2

A powerful tool in multiple regression analysis is the ability to compare two models

For instance say we want to compare
Rejection Region : F ,k ,n ( k 1)

Full Model : y  β0  β1x1  β2 x2  β3 x3  β4 x4  
Pick the distribution function, based on k and n-(k+1).
Choose the critical value based on α (Fα,k,n-(k+1))


26
F Test For Subsets of Independent Variables
2

Region : F ,k ,n ( k 1)
Reduced Model : y  β0  β1x1  β2 x2  

Again, another example of ANOVA
SSE R  error sum of squares for
reduced model with l predictors
Say that α =0.05
Prob(F>Fα,k,n-(k+1)) = 0.05
SSE F  error sum of squares for
F
(SSE R  SSE F ) /(k  l )
SSE F /[n  (k  1)]
full model with k predictors
27
28
7
Example of Model Comparison

How To Do In R
We have a quantitative trait and want to test the effects at two markers, M1 and M2.

You can fit a least‐squares regression using the function

Full Model : Trait  Mean  M1  M2  (M1 X2)  
Reduced Model : Trait  Mean  M1  M2  

The coefficients of the fit are then given by

F

(SSE R  SSE F ) /(k  l ) (SSE R  SSE F ) /(3  2)

SSE F /[n  (k  1)]
SSE F /[100  (3  1)]

(SSE R  SSE F )
SSE F / 96
mm$residuals
And to print out the tests for zero slope just do

Rejection Region : F ,1,96
mm$coef
The residuals are


mm <‐ lsfit(x,y)
ls.print (mm)
29
Input Data

http://www.cs.washington.edu/homes/suinlee/geno
me560/data/cats.txt

Data on fluctuating proportions of marked cells in marrow from heterozygous Safari cats

Input Data


Proportions of cells of one cell type in samples from cats (taken in our department many years ago). Column 1 is the ID number of the particular cat. You will want to plot the data from one cat. 
30

For example cat 40004 is rows 1:17, 40005a is 18:31, 40005b is 32:47, 40006 is 48:65, 40665 is 66:83 and so on.
2nd column: Time, in weeks from the start of monitoring, that the measurement from marrow is recorded. 3rd column: Percent of domestic‐type progenitor cells
observed in a sample of cells at that time. 4th column: Sample size at that time, i.e. the number of progenitor cells analyzed.
Cat #1
Cat #2
31
8