Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Goals Develop basic concepts of linear regression from a probabilistic framework Estimating parameters and hypothesis testing with linear models Linear regression in R Lecture 8: Linear Regression May 24, 2012 GENOME 560, Spring 2012 Su‐In Lee, CSE & GS [email protected] 1 2 Regression Why Linear Regression? Technique used for the modeling and analysis of numerical data Suppose we want to model the outcome variable Y in terms of three predictors, X1, X2, X3 Y = f (X1, X2, X3) Exploits the relationship between two or more variables so that we can gain information about one of them through knowing values of the other Regression can be used for prediction, estimation, hypothesis testing, and modeling causal relationships Typically will not have enough data to try and directly estimate f Therefore, we usually have to assume that it has some restricted form, such as linear Y = X1 + X2 + X3 3 4 1 Regression Terminology Linear Regression is a Probabilistic Model Y = X1 + X2 + X3 Much of mathematics is devoted to studying variables that are deterministically related to one another Y = β0 + β1 X Dependent Variable Outcome Variable Response Variable “Lung cancer risk” Independent Variable Predictor Variable Explanatory Variable Lung cancer risk Genetic factor, smoking, diet, etc Expression level of gene X Expression levels of X’s TFs A, B and C “smoking” Intercept term Slope But we’re interested in understanding the relationship between variables related in a nondeterministic fashion 5 A Linear Probabilistic Model 6 Implications Definition: There exists parameters β0, β1 and σ2, such that for any fixed value of the predictor variable X, the outcome variable Y is related to X through the model equation Y = β0 + β1 X + ε , The expected value of Y is a linear function of X, but for fixed value x, the variable Y differs from its expected value by a random amount Variables & Symbols: How is x different from X ? where ε is a RV assumed to be N(0, σ2) Capital letter X: a random variable Lower case letter x: corresponding values (i.e. the real numbers the RV X map into) Lung cancer risk For example, X: Genotype of a certain locus x: 0, 1 or 2 (meaning AA, AG and GG, respectively) smoking 7 8 2 Implications Graphical Interpretation The expected value of Y is a linear function of X, but for fixed value x, the variable Y differs from its expected value by a random amount Formally, let x* denote a particular value of the predictor variable X, then our linear probabilistic model says: E (Y | x*) Y | x* mean value of Y when X is x * E (Y | x*) Y | x* mean value of Y when X is x * V (Y | x* ) Y2| x* variance of Y when X is x* V (Y | x* ) Y2| x* variance of Y when X is x* 9 Graphical Interpretation 10 One More Example Weight Suppose the relationship between the predictor variable height (X) and outcome variable weight (Y) is described by a simple linear regression model with true regression line Y = 7.5 + 0.5 X, ε~ N(0, σ2) and σ = 3 The expected change in weight (Y ) associated with a 1‐unit increase in height (X ) Height Say that X = height and Y = weight Then μY|x=60 is the average weight for all individuals 60 inches tall in the population Q1: What is the interpretation of β1 = 0.5? Q2: If x = 20, what is the expected value of Y? μY|x=20 = 7.5 + 0.5 (20) = 17.5 11 12 3 One More Example Estimating Model Parameters Q3: If x = 20, what is P(Y>22) ? Y = 22 μY|x=20 = 17.5 Where are the parameters β0 and β1 from? Predicted, or fitted, values are values of y predicted by plugging x1, x2, … , xn into the estimated regression line: y = β0 + β1x yˆ1 0 1 x1 yˆ 2 0 1 x2 Residuals are the deviations of observed (red dots) and predicted values (red line) e1 y1 yˆ1 e2 y2 yˆ 2 Y ~ N(μ =17.5, σ = 3) x=20 e3 y3 yˆ 3 σ=3 y = β0 + β1x y3 Given Y ~ N(μ =17.5, σ = 3), Y >22 22 17.5 ) 1 (1.5) 0.067 3 where ϕ means the CDF of Normal dist. N(0,1) y2 P(Y 22 | x 20) 1 ( x1 μ =17.5 x3 14 The error sum of squares (SSE) can tell us how well the line fits to the data n n SSE (ei ) ( yi yˆ i ) 2 i 1 x2 Least Squares Residuals Are Useful! yˆ 3 0 1 x3 i 1 yˆ1 0 1 x1 yˆ 2 0 1 x2 x1 yˆ 3 0 1 x3 Least squares y = β0 + β1 x 2 x2 x3 Find β0 and β1 that minimizes SSE n f ( 0 , 1 ) yi ( 0 1 xi ) 2 i 1 Least squares Find β0 and β1 that minimizes SSE n Denote the solutions by and ̂ 0 ̂1 15 f ( 0 , 1 ) yi ( 0 1 xi ) i 1 2 16 4 Least Squares Least Squares Least squares Find β0 and β1 that minimizes SSE n Least squares f ( 0 , 1 ) yi ( 0 1 xi ) Find β0 and β1 that minimizes SSE n f ( 0 , 1 ) yi ( 0 1 xi ) 2 i 1 Least Squares 2 i 1 17 Coefficient of Determination Important statistic referred to as the coefficient of determination (R2): R2 1 n SSE SST n SSE (ei ) 2 ( yi yˆ i ) 2 i 1 i 1 n SST ( yi y ) 2 i 1 Least squares 18 Error Sum Squares Error Sum Squares, when β0= avg(y) and β1=0 y = β0 + β1 x Find β0 and β1 that minimizes SSE n f ( 0 , 1 ) yi ( 0 1 xi ) i 1 y = average y 2 19 20 5 Multiple Linear Regression Categorical Independent Variables Extension of the simple linear regression model to two or more independent variables Qualitative variables are easily incorporated in regression framework through dummy variables Simple example: sex can be coded as 0/1 What if my categorical variable contains three levels: y = β 0 + β 1 x1 + β 2 x2 + … + β n xn + ε Expression = Baseline + Age + Tissue + Sex + Error Partial Regression Coefficients: βi ≡ effect on the outcome variable when increasing the ith predictor variable by 1 unit, holding all other predictors constant 0 if AA X i 1 if AG 2 if GG NO! 21 Categorical Independent Variables Previous coding would result in collinearity Solution is to set up a series of dummy variable. In general for k levels you need (k‐1) dummy variables X Hypothesis Testing: Model Utility Test AA AG X1 1 0 X2 0 1 GG 0 0 The first thing we want to know after fitting a model is whether any of the predictor variables (X’s) are significantly related to the outcome variable (Y): H 0 : β1 β2 βk 0 1 if AG X2 0 otherwise 1 if AA X1 0 otherwise 0 if AA X i 1 if AG 2 if GG 22 H A : At least one βi 0 Let’s frame this in our ANOVE framework In ANOVA, we partitioned total variance (SST) into two components: 23 SSE (unexplained variation) SSR (variation explained by linear model) 6 Model Utility Test Partition total variance (SST) into two components: ANOVA Formulation of Model Utility Test SSE (unexplained variation) SSR (variation explained by linear model) Partition total variance (SST) into two components: SSE (unexplained variation) SSR (variation explained by linear model) Let’s consider n (=3) data points and k (=1) predictor model n SST ( yi y ) 2 i 1 y = β0 + β1 x n n i 1 i 1 n‐(k+1) SSE (ei ) 2 ( yi yˆ i ) 2 y = average y n‐(k+1) n SSR ( yˆ i y ) 2 # data points – (# parameters in the model) Rejection i 1 25 ANOVA Formulation of Model Utility Test F‐test statistic F R n (k 1) MS R SSR / k k MS E SSE /n (k 1) 1 R 2 A powerful tool in multiple regression analysis is the ability to compare two models For instance say we want to compare Rejection Region : F ,k ,n ( k 1) Full Model : y β0 β1x1 β2 x2 β3 x3 β4 x4 Pick the distribution function, based on k and n-(k+1). Choose the critical value based on α (Fα,k,n-(k+1)) 26 F Test For Subsets of Independent Variables 2 Region : F ,k ,n ( k 1) Reduced Model : y β0 β1x1 β2 x2 Again, another example of ANOVA SSE R error sum of squares for reduced model with l predictors Say that α =0.05 Prob(F>Fα,k,n-(k+1)) = 0.05 SSE F error sum of squares for F (SSE R SSE F ) /(k l ) SSE F /[n (k 1)] full model with k predictors 27 28 7 Example of Model Comparison How To Do In R We have a quantitative trait and want to test the effects at two markers, M1 and M2. You can fit a least‐squares regression using the function Full Model : Trait Mean M1 M2 (M1 X2) Reduced Model : Trait Mean M1 M2 The coefficients of the fit are then given by F (SSE R SSE F ) /(k l ) (SSE R SSE F ) /(3 2) SSE F /[n (k 1)] SSE F /[100 (3 1)] (SSE R SSE F ) SSE F / 96 mm$residuals And to print out the tests for zero slope just do Rejection Region : F ,1,96 mm$coef The residuals are mm <‐ lsfit(x,y) ls.print (mm) 29 Input Data http://www.cs.washington.edu/homes/suinlee/geno me560/data/cats.txt Data on fluctuating proportions of marked cells in marrow from heterozygous Safari cats Input Data Proportions of cells of one cell type in samples from cats (taken in our department many years ago). Column 1 is the ID number of the particular cat. You will want to plot the data from one cat. 30 For example cat 40004 is rows 1:17, 40005a is 18:31, 40005b is 32:47, 40006 is 48:65, 40665 is 66:83 and so on. 2nd column: Time, in weeks from the start of monitoring, that the measurement from marrow is recorded. 3rd column: Percent of domestic‐type progenitor cells observed in a sample of cells at that time. 4th column: Sample size at that time, i.e. the number of progenitor cells analyzed. Cat #1 Cat #2 31 8