Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Step 1: Collect and clean data (spreadsheet from heaven) Step 2: Calculate descriptive statistics Step 3: Explore graphics Step 4: Choose outcome(s) and potential predictive variables (covariates) Step 5: Pick an appropriate statistical procedure & execute Step 6: Evaluate fitted model, make adjustments as needed Four Considerations 1) Purpose of the investigation Descriptive orientation 2) The mathematical characteristics of the variables Level of measurement (nominal, ordinal, continuous) and Distribution 3) The statistical assumptions made about these variables Distribution, Independence, etc. 4) How the data are collected Random sample, cohort, case control, etc. Purpose of analysis: To relate two variables, where we designate one as the outcome of interest (Dependent Variable or DV) and one more as the predictor variables (Independent Variables or IVs) In general, we will consider k to represent the number of IVs and here k=1. Given a sample of n individuals, we observe pairs of values for 2 variables (Xi,Yi) for each individual i. Type of variables: Continuous (interval or ratio) Characterize relationship by determining extent, direction, and strength of association between IVs and DV. Predict DV as a function of IVs Describe relationship between IVs and DV controlling for other variables (confounders) Determine which IVs are important for predicting a DV and which ones are not. Determine the best mathematical model for describing the relationship between IVs and a DV Assess the interactive effects (effect modification) of 2 or more IVs with regard to a DV Obtain a valid and precise estimate of 1 or more regression coefficients from a larger set of regression coefficients in a given model. NOTE: When we find statistically significant associations between IVs and a DV this does not imply that the particular IVs caused the DV to occur. Strength of association - does the association appear strong for a number of different studies? Dose-response effect - The DV changes in a meaningful manner with changes in the IV Lack of temporal ambiguity - The cause precedes the effect Consistency of findings - Most studies show similar results Biological and theoretical plausibility - The causal relationship is consistent with current biological and theoretical knowledge Coherence of evidence - The findings do not seriously conflict with accepted facts about the DV being studied. Specificity of association - The study factor is associated with only one effect Simple Linear Regression Model Y x where: i 0 1 i i Yi is the value of the response(outcome, dependent) variable for the ith unit (e.g., SBP) 0 and 1 are parameters which represent the intercept and slope, respectively Xi is the value of the predictor (independent) variable (e.g., age) for the ith unit. X is considered fixed - not random. i is a random error term that has mean 0 and variance 2, i and j are uncorrelated for all i,j ij, i=1,...,n Simple Linear Regression Model Yi 0 1 xi i Model is "simple" because there is only one independent variable. Model is "linear in the parameters" because the parameters β0 and β1 do not appear as an exponent and they are not multiplied or divided by another parameter. Model is also "linear in the independent variable" because this variable (Xi) appears only in the first power. The observed value of Y for the ith unit is the sum of 2 components (1) the constant term β0 + β1Xi and (2) the random error term i. Hence, Yi is a random variable. Since i has mean 0, Y must have mean β0 + β1Xi: E(Yi|Xi) = E(β0 + β1Xi + i) = β0 + β1Xi + E(i) = β 0 + β 1 Xi where E = "Expected value”=mean Y E (Y ) ˆo ˆ1 X X The fitted (or estimated) regression line is the expected value of Y at the given value of X, i.e. E(Y|X) Residuals Y ε ε X Define the residuals i (Yi Yˆi ) Interpreting the Coefficients Y 1 1.0 o X Expected value of Y when X=0 Expected change in Y per unit change in X Linear relationship between Y and X (i.e., only allow linear β’s) Independent observations Normally distributed residuals, in particular εi~N(0, σ2) Equal variances across values of X (homogeneity of variance) Normality Assumption Y i.i.d. Yi ~ N ( 0 1 xi , ) 0 + 1x1 2 E yi β0 β1 xi • X1 = 10 X Homoscedasticity - The variance of Y is the same for any X 45 • 40 • 35 • • 30 25 20 5 10 15 20 X 25 30 35 Departures from Normality Assumption If the normality assumption is not “badly” violated, the model is generally robust to violations from normality If normality assumption is badly violated, try a transformation of Y (e.g., the natural log) If you transform the data, you must consider if Y is normally distributed as well as whether the variance homogeneity assumption holds – often go together The “correct” model is fitted All IVs included are truly related to the DV No (conceivable) IVs related to the DV have been left out Violation of either of these assumptions can lead to “model misspecification bias” Null Hypothesis: The simple linear regression model does not fit the data better than the baseline model. 1 = 0 Alternative Hypothesis: The simple linear regression model fits the data better than the baseline model. 1 0 Fitting data to a linear model Yi o 1 X i i Linear Regression – determine the values of β0 and β1 that minimize: 2 2 ˆ i (Yi Yi ) i i The LEAST-SQUARES Solution For each pair of observations (Xi,Yi), the method of least squares considers the deviation of Yi from its expected value: n i=1 Yi -Yˆi 2 n = Yi ˆ0 ˆ1 X i i 1 2 the least-squares method will find ̂ 0 and ̂ 1 that minimize the sum of squares above. The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line the smallest. The Least-Squares Method n n X i X Yi Y X iY i - X i Y i i=1 i=1 i=1 i=1 ˆ = = 1 n 2 2 n n Xi X 2 X i - X i n i=1 n n i=1 ˆ ˆ = Y 0 1X i=1 n The method of least squares is purely mathematical However, statistically the least squares estimators are very appealing because the are the Best Linear Unbiased Estimators (BLUE) This means that among all of the equations we could have picked to estimate 0 and 1,the least squares equations will give us estimates: 1. That have expectation 0 and 1 (unbiased) 2. Have minimum variance among all of the possible linear estimators for 0 and 1 (most efficient) n SSE Yi Yˆi i 1 2 SSE is the sum of squares due to error (i.e., sum of the squared residuals), the quantity we wish to “minimize”. Yˆi ˆ0 ˆ1 X i Response Yi Y (Y) Total Variability _ Unexplained Variability Yi Yˆi Explained Variability Yˆi Y Y Predictor (X) If SSE=0, then model is perfect fit SSE is affected by Large 2 (a lot of variability) 2. Nonlinearity Need to look at both (1) and (2). For now assume linearity, and estimate σ 2 as: 1. n 1 2 ˆ Y Y S Y|X = i i n - 2 i 1 2 1 = SSE n-2 We use n – 2 because we estimate 2 parameters, 0 and 1 SSE/(n-2) is also known as “mean squared error” or MSE Simple Linear Regression in a How do I build my model? Using the tools of statistics… 1. First I use estimation in particular, least squares to estimate: 2 ˆ ˆ 0 , 1 , ˆ , Yˆ 2. Then I use my distributional assumptions to make Inference about the estimates 3. Hypothesis testing, e.g., is the slope 0? Interpretation – interpret in light of assumptions Hypothesis Testing for Regression Parameters Hypothesis testing: To test the hypothesis H0: β1=β1(0), where β1(0) is some hypothesized value for β1, the test statistic used is (0) ˆ ˆ 1- 1 T= S ˆ 1 where S ˆ 1 ˆ S Y|X Sx n-1 Sx n-1 This test statistic has a t distribution with n - 2 degrees of freedom The CI is given by ˆ1 tn2,1 / 2 S ˆ 1 Timeout: The T-distribution The t distribution (or Student’s t distribution) arises when we use an estimated variance to construct the test statistic: Y T n S where S 2 ( Y Y ) i i n 1 is the sample standard deviation As n→∞, T→Z~N(0,1) Have to pay a penalty for estimating σ2 Can think of the t distribution as a thick-tailed normal Inference concerning the Intercept To test the hypothesis H0: β0=β0(0) we use the following statistic (0) ˆ ˆ 0-0 T= S ˆ S 0 2 Y|X 1 X + n n - 1 S X2 which also has the t distribution with n-2 degrees of freedom when Ho:β0= β0(0) The CI is given by ˆ0 tn2,1 / 2 S ˆ 0 Null Hypothesis: The simple linear regression model does not fit the data better than the baseline model. 1 = 0 Alternative Hypothesis: The simple linear regression model does fit the data better than the baseline model. 1 0 Interpretations of Tests for Slope Failure to reject H0:β1=0 could mean: Y is essentially as good as Y ˆ1 X - X for predicting Y y A • • • • • • •• •• ••• • Y x Interpretations of Tests for Slope Failure to reject H0:β1=0 could mean: The true relationship between Y and X is not linear (i.e. could be quadratic or some other higher power) y • •• • •• • •• • ••• •• • • • •• • • •• ••• x Dude, that’s why you always plot Y vs. X! Interpretations of Tests for Slope Failure to reject H0:β1=0 could mean: We do not have enough power to detect a significant slope Not rejecting H0:β1=0 implies that a straight line model in X is not the best model to use, and does not provide much help for predicting X (ignoring power) The Intercept We often leave the intercept, β0, in the model regardless of whether the hypothesis, H0:β0=0, is rejected or not. This is because if we say the intercept is zero then we must force the regression line through the origin (0,0) and rarely is this true. Regression of SBP on age: Analysis of Variance Source DF Sum of Squares Model Error Corrected Total 1 28 29 4008.12372 2319.37628 6327.50000 Root MSE 9.10137 Mean Square 4008.12372 82.83487 SSE ˆ = SY |X 2 Parameter Estimates Variable DF Intercept age 1 1 ˆ0 Parameter Estimate Standard Error t Value Pr > t| 54.21462 1.70995 13.08530 0.24582 4.14 6.96 .0003 .0001 ˆ1 SE(1 ) H 0 : 1 0 Response Variable: Y Explanatory Variables: X1,..., Xk Model (Extension of Simple Regression): E(Y) = 0 + 1X1 + + kXk V(Y) = 2 Partial Regression Coefficients (i): Effect of increasing Xi by 1 unit, holding all other predictors constant Computer packages fit models, hand calculations very tedious Model Parameters: 0, 1,…, k, Estimators: 0 , 1,..., k , ˆ Least squares prediction equation: Yˆ ˆ ˆ1 X1 ˆk X k Residuals: i (Yi Yi ) 2 2 ˆ SSE ( Y Y ) Error Sum of Squares: i i i Estimated conditional standard deviation: SSE ̂ n k 1 When there are 2 independent variables (X1 and X2) we can view the regression as fitting the best plane to the 3 dimensional set of points (as compared to the best line in simple linear regression) When there are more than 2 IVs plotting becomes much more difficult Analysis of Variance: 2 Regression sum of Squares: SSR (Yˆ Y ) df R k 2 Error Sum of Squares: SSE (Y Yˆ ) df E n k 1 2 dfT n 1 Total Sum of Squares:TSS (Y Y ) Coefficient of (Multiple) Determination: R2=SSR/TSS (the % of variation explained by the model) Least Squares Estimates Regression Coefficients Estimated Standard Errors t-statistics P-values (Significance levels for 2-sided tests) Max Diameter, Time to Max Dilation Diameter Phase (mm) (sec) Pre-cuff Baseline (mm) Post-cuff Baseline (mm) Participant ID Gender Reader Name 3000028 M Crotts 6.835 84 6.559 6.573 84.2 3000052 F Manli 2.905 89 2.809 2.829 75.3 3000079 M Manli 3.677 52 3.583 3.576 80.1 3000087 M Manli 4.974 57 4.957 4.909 78.3 3000257 F Crotts 4.748 62 4.492 4.291 78 3000346 M Drum 5.973 114 5.929 5.917 78.5 3000419 F Drum 3.429 94 3.288 3.312 76.6 3000524 M Drum 4.971 34 4.897 4.887 75.4 3000559 F Crotts 4.162 46 3.825 3.751 76.5 3000591 M Crotts 4.677 115 4.477 4.493 80.7 N 706 Age (yrs) Max Diameter Dilation Phase (mm) Histogram 7.75+* .* .*** .******* .***************** .************************ .************************************* .**************************************** .***************************** .***************** .**** 2.25+* ----+----+----+----+----+----+----+----+ # 1 4 12 27 65 96 146 157 114 65 14 1 Boxplot 0 0 | | | +-----+ *--+--* | | +-----+ | | | Normal Probability Plot 7.75+ * | * | *****+ | ******+ | **** **+ | ****** | ****** | ******* | ******* | ********* |******++ 2.25+*+ +----+----+----+----+----+----+----+----+----+----+ Pre-cuff, Baseline (mm) Histogram 7.75+* .* .** .******* .************** .*********************** .********************************* .*************************************** .********************************* .******************** .******* 2.25+* ----+----+----+----+----+----+----+---* may represent up to 4 counts # 1 3 6 25 54 91 131 156 132 80 25 2 Boxplot 0 0 0 | | +-----+ | | *--+--* +-----+ | | | Normal Probability Plot 7.75+ * | * | **** | ******+ | ******+ | *****+ | +****** | ****** | ******* | ********* |********+ 2.25+*+++ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 Time to Max Diameter(sec) Stem Leaf 11 555555555555555999 11 0000011112222333344444444444444 10 555556666666677778889999 10 000001111122223333334444 9 5555555555555667777777777888889999 9 0000000000111122222333344444444 8 555555556667778888999999999 8 000000001112222223333334444 7 555555666677778888889999 7 00000000001112222222333333344444444 6 555555556666666666777777778888999999999 6 000000001111111222222333333344444 5 5555555556666666667777777777788888888999999999 5 0000000000000011111122222222223333334444444444 4 55555556666666677777777778888888899999 4 0000000111112222222233333444444444 3 5555555555666666677777778888888999999999 3 0000000000111111122222222222333333334444444444 2 55555566666666667777788888889999999 2 0002222222223333334444 1 555566677888899999999 1 00111122222233333444 0 7779 0 0444 ----+----+----+----+----+----+----+----+----+- # 18 31 24 24 34 31 27 27 24 35 39 33 46 46 38 34 40 46 35 22 21 20 4 4 Boxplot | | | | | | +-----+ | | | | | | | | | + | *-----* | | | | | | +-----+ | | | | | | | Normal Probability Plot 117.5+ ++****** | ***** | ***+ | ***+ | ***+ | ***++ | **++ | **+ | **+ | ** | *** | ** | +** | +** | *** | +** | +*** | *** | **** | *** | *** | *****++ |*** ++ 2.5+* + +----+----+----+----+----+----+----+----+----+----+ Age (years) Histogram # Boxplot 93+* .*** .**** 87+******** .*********** .******************* 81+**************** .************************* .***************************************** 75+************************************* .*********** .**** 69+* . . 63+* ----+----+----+----+----+----+----+----+- 2 12 14 29 43 73 63 100 161 148 43 16 1 1 0 0 | | | | +-----+ | + | *-----* +-----+ | | | 0 Normal Probability Plot 93+ * | ***** | ***++++ 87+ ****+++ | ****++ | ***** 81+ +**** | ++**** | +******* 75+ ********** | ******+++ |******++++ 69+*+++++ |+ | 63+* +----+----+----+----+----+----+----+----+----+----+ Max. Diameter, Dilation Phase (mm) 8 7 6 5 4 3 2 2 3 4 5 6 Pre-cuff Baseline (mm) 7 8 Regression of Max Diameter (mm) on Pre-cuff Baseline (mm) Source Model Error Corrected Total Analysis of Variance Sum of Mean DF Squares Square 1 549.03550 549.03550 700 5.46021 0.00780 701 554.49571 F Value Pr > F 70386.5 <.0001 R2 Root MSE Dependent Mean Coeff Var Variable Intercept PREBL 0.08832 R-Square 4.56780 Adj R-Sq 1.93352 0.9902 0.9901 Parameter Estimates Parameter Standard Label DF Estimate Error t Value Pr > |t| Intercept 1 0.17015 0.01691 10.06 <.0001 Pre-cuff Baseline (mm) 1 0.99302 0.00374 265.30 <.0001 Max. Diameter, Dilation Phase (mm) 8 7 6 Regression Line with 95% CI 5 4 3 2 2 3 4 5 6 Pre-cuff Baseline (mm) 7 8 Max. Diameter, Dilation Phase (mm) 8 7 6 5 4 3 2 60 70 80 Age (yrs) 90 100 Regression of Max Diameter (mm) on Age (mm) Analysis of Variance Source Model DF 1 Sum of Squares 0.04862 Error 700 554.44709 Corrected Total 701 554.49571 Mean Square 0.04862 0.79207 F Value 0.06 Pr > F 0.8044 Root MSE 0.88998 R-Square 0.0001 Dependent Mean 4.56780 Adj R-Sq -0.0013 Coeff Var 19.48382 Parameter Estimates Variable Intercept age Label Intercept DF 1 Parameter Estimate 4.42124 Standard Error 0.59247 t Value 7.46 Pr > |t| <.0001 1 0.00186 0.00751 0.25 0.8044 Diameter Dilation (mm) vs. Age (yrs) Max. Diameter, Dilation Phase (mm) 8 7 6 5 4 3 2 60 70 80 Age (yrs) 90 100 Regression of Max Diameter (mm) on Pre-cuff Baseline (mm) and Age (yrs) Analysis of Variance Source Model DF 2 Sum of Squares 549.09606 Error 699 5.39965 Corrected Total 701 554.49571 Variable Intercept PREBL age Mean Square 274.54803 0.00772 F Value 35541.0 Root MSE 0.08789 R-Square 0.9903 Dependent Mean 4.56780 Adj R-Sq 0.9902 Coeff Var 1.92414 Parameter Estimates Parameter Standard Label DF Estimate Error Intercept 1 0.33282 0.06049 Pre-cuff Baseline (mm) 1 0.99323 0.00373 Age (yrs) 1 -0.00208 0.00074187 t Value 5.50 266.60 -2.80 Pr > F <.0001 Pr > |t| <.0001 <.0001 0.0053 Multicollinearity Many research studies have large numbers of predictor variables Problems arise when the various predictors are highly related among themselves (collinear) Estimated regression coefficients can change dramatically, depending on whether or not other predictor(s) are included in model. Standard errors of regression coefficients can increase, causing non-significant t-tests and wide confidence intervals Variables are explaining the same variation in Y Multicollinearity - example Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations MAXD PREBL POSTBL T2MAXD age MAXD 1.00000 0.99506 0.99475 0.02827 0.00936 Max. Diameter, Dilation Phase <.0001 <.0001 0.4546 0.8044 (mm) 702 702 702 702 702 PREBL 0.99506 1.00000 0.99716 0.02597 0.02194 Pre-cuff Baseline (mm) <.0001 <.0001 0.4918 0.5605 702 706 703 703 706 POSTBL 0.99475 0.99716 1.00000 0.01667 0.02075 Post-cuff Baseline (mm) <.0001 <.0001 0.6590 0.5828 702 703 703 703 703 T2MAXD 0.02827 0.02597 0.01667 1.00000 -0.04169 Time to Max. Diameter (sec) 0.4546 0.4918 0.6590 0.2697 702 703 703 703 703 age 0.00936 0.02194 0.02075 -0.04169 1.00000 Age (yrs) 0.8044 0.5605 0.5828 0.2697 702 706 703 703 706 Multicollinearity - example Parameter Estimates DF 1 Parameter Estimate 0.33282 Standard Error 0.06049 t Value 5.50 Pr > |t| <.0001 Variable Intercept Label Intercept PREBL Pre-cuff Baseline (mm) 1 0.99323 0.00373 266.60 <.0001 age Age (yrs) 1 -0.00208 0.00074187 -2.80 0.0053 DF 1 Parameter Estimate 0.32369 Standard Error 0.05707 t Value 5.67 Pr > |t| <.0001 Parameter Estimates Variable Intercept Label Intercept PREBL Pre-cuff Baseline (mm) 1 0.55326 0.04716 11.73 <.0001 POSTBL Post-cuff Baseline (mm) 1 0.44290 0.04735 9.35 <.0001 age Age (yrs) 1 -0.00213 0.00069985 -3.04 0.0025 We assume that the outcome (Y’s) are normally distributed. What assumptions have we made about the distribution of the IVs (the X’s)? None, except that they are RVs with some underlying distribution Recall, model assumption centers around the conditional distribution of Y’s (conditional on values of X’s) ANOVA is simply linear regression with series of dichotomous indicators for the “levels” of X Are there any differences among the population means? Response Predictor One-Way ANOVA Categorical Continuous H1: At least one mean different H0: All means are equal 45 45 40 40 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 Comparing Populations 3 4 2 1 A B C D independent observations normally distributed data for each group, or the pooled error terms are normally distributed equal variances for each group Variability between Groups Total Variability i groups Variability within Groups j individuals Within Sum of Squares (SSW) k ni 2 ( Y Y ) ij i i 1 j 1 Between Sum of Squares (SSB) k ni 2 ( Y Y ) i i 1 j 1 Total Sum of Squares (SST) 1 Yi ni ni 1 k Yij and Y niYi n i 1 j 1 k ni (Y i 1 j 1 ij Y ) 2 SST = SSB + SSW ANOVA Example – Max Diameter by Reader SSB Source Model DF 2 Sum of Squares 3.6421805 Error 699 550.8535268 Corrected Total 701 554.4957073 SSW R-Square 0.006568 Mean Square 1.8210902 0.7880594 F Value 2.31 SST Coeff Var 19.43447 Root MSE 0.887727 MAXD Mean 4.567798 READER Crotts MAXD Standard Pr > |t LSMEAN Error | 4.64841096 0.05998704 <.0001 LSMEAN Number 1 Drum 4.48160364 0.05353196 <.0001 2 Manli 4.59687981 0.06155280 <.0001 3 Pr > F 0.0999 One of the things that makes the General Linear Model (or GLM) so flexible ANCOVA analyses should always assess possible interactions between continuous IVs and categorical IVs If interactions are present, model must be interpreted carefully ANCOVA Example – Max Diameter vs. Reader Adjusting for Pre-cuff Source Model Sum of DF Squares Mean Square F Value Pr > F 5 549.2229489 109.8445898 14499.4 <.0001 Error 696 Corrected Total 701 554.4957073 Continuous covariate: Pre-cuff diameter 5.2727583 0.0075758 R-Square Coeff Var Root MSE MAXD Mean 0.990491 1.905493 0.087039 4.567798 Source READER PREBL PREBL*READER Interaction between Precuff Diameter and Reader DF 2 Type III SS 0.0596745 1 541.7764744 2 0.0508130 Mean Square F Value Pr > F 0.0298373 3.94 0.0199 541.7764744 0.0254065 71514.1 <.0001 3.35 0.0355 Yikes, it’s significant! ANCOVA Example – Stratified Analysis Reader Effects in 1st Quartile of Pre-cuff Diameter Reader Effects in 2nd Quartile of Pre-cuff Diameter MAXD Standard READER LSMEAN Error Crotts 3.50587755 0.04807411 Drum 3.51649412 0.03650059 Manli 3.50760000 0.04759094 MAXD Standard READER LSMEAN Error Crotts 4.24069492 0.02395722 Drum 4.25861538 0.02282474 Manli 4.28982456 0.02437390 Reader Effects in 3rd Quartile of Pre-cuff Diameter Reader Effects in 4th Quartile of Pre-cuff Diameter MAXD Standard READER LSMEAN Error Crotts 4.84836735 0.02640555 Drum 4.84447541 0.02366619 Manli 4.77966667 0.02667919 MAXD Standard READER LSMEAN Error Crotts 5.78133871 0.06433425 Drum 5.64400000 0.06332105 Manli 5.78918868 0.06958252 Interaction is driven by “swapping” of effects for Reader at various levels of pre-cuff diameter.