Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Discrete choice wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Data assimilation wikipedia , lookup
Regression toward the mean wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Time series wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Choice modelling wikipedia , lookup
Regression analysis wikipedia , lookup
Statistics PROC LOGISTIC: A FORM OF REGRESSION ANALYSIS Edith Flaster Winthrop-University Hospital 1. When do we use Proc Logistic? Proc Logistic is used when your response variable is binary; that is, yes or no, present or absent, high or low. 2. SAS's tricky definition of Yes and No. SAS has succumbed to the epidemiologists on this proc and used their notation for the response variable as opposed to the standard statistical notation. SAS keeps yes as 1, but no must be 2, not 0, the way statisticians do it. If you use 0 for no, all your signs come out in the opposite direction, and you will drive yourself nuts trying to straighten it out after the fact. Now, the 1 and 2 apply only to the response variable. . All your independent variables should be coded the way we usually do: 1 for high, 0 for low, 1 for present, 0 for absent. 3. Comparison of logistic and ordinary regression a. Response variables and assumptions In logistic regression, the response variable is binary, as stated above, while in ordinary least squares regression the response variable is continuous and assumed to have a normal distribution. b. Independent or explanatory variables The independent variables can be either continuous or binary for both methods. If you have a categorical variable with more than two levels, you work with a series of dummy variables in both methods. reported that he usually gets similar solutions with his very large data sets.) d. Meaning of coefficients (slopes vs relative risk) For least squares regressions the coefficients are interpreted as slopes or change in response per unit change in the explanatory variable; that is, if your equation is: Weight(\bs) = 306 + 3(Heigbt,in) then weight should increase 3 Ibs for each inch in height. In this equation, 306 is the intercept, and 3 is the slope or regression coefficient. For logistic regressions, the coefficients are interpreted as the natural logarithm (In) of the odds ratio when the independent variable has only two levels. When the independent variable is continuous, the coefficient is the In of the odds ratio per unit change in the indepeodenI variable. In order to get the odds ratio, you must exponentiate the coefficient. For example, with an equation of: Disease =.5 + 1.1(decade of age), the odds ratio is exp(I.I) or 3.0, meaning that you are 3 times more likely to get the disease for each 10 years thai you age. 4. Logistic regression a. Calculation of outcome probabilities c. Methods of calculation Ordinary regression is known as least squares regression, because the equations are solved by the method of least squares. Logistic regression is solved by maximum likelihood methods. The two methods do not necessarily give the same solution. (I tried it with a small data set and got different solutions, but a NYASUG friend 708 Using logistic regression, we can calculate for eacIi person or row in the table the probability of a positiv~ outcome. The response variable that is fit in logisti~ regression is called the logit, which is expressed In(p/l-p), where p is the fractional probability of positive response. Using some arithmetic, which I'll into at the session if someone wants to know, we Statistics solve for p. If we let Z stand for the right band side of our equation: In (p/(l-p» = bO + blXl + b2X2.•• , then p = 1/(1 + exp(-Z». b. Using Proc Logistic Pmc Logistic is similar in form to both Pmc Reg and Proc GLM. That is; you use a model statement, and can use by, output, and weight statements. There are many other options and various diagnostic procedures, which you can examine in the manual. Several are shown in the 5 examples given. c. Example from Applied Logistic Regression by Hosmer and Lemeshow. (1989, Wiley) We will go over a simple example from an excellent text on applied logistic regression. The first example in the text is a table showing age, and the presence or absence of coronary beart disease(CHD). Using the simplest SAS code, we first input the data: Data RaW; Input Age CHD; If CHD = 0 then CHD = 2; Cards; 200 230 and minimum and maximum. It's always a good idea to look at these, to protect yourself against outliers, whether real or errors. The criteria for assessing model fit follow next. The most commonly used are the -2 LOG L, where L is the likelihood, and the Score statistic. Both of these are distributed as chi-square, with the degrees of freedom corresponding to the number of explanatory variables in the model, and the usual p-value is printed. The AlC stands for the Akaike Information Criterion and SC stands for the Schwartz Criterion. These are primarily used for comparing different models for the same data. Lower values are better. We fmallyarrive at the results, listed under the analysis of maximum likelihood estimates. It is a very similar table to those that appear in Pmc Reg, or Pmc GLM. The first column lists either the intercept or the coefficient of the listed variable. The column labelled parameter estimate gives the corresponding numerical values. The coefficient for age is 0.111, with a standard error of 0.024. This value is statistically significant as shown in the column which is second from the right. As we said before, the odds ratio is exp(0.111) or 1.117/1. This may be interpreted as the chance that a person will have coronary heart disease increases about 12 % for every year of age. If you want to calculate the probability of having heart disease at 30, for instance, you substitute in the equation we talked about before: p = 1I(I+exp(-(-S.31+.1l1*Age») You can see that the coding was changed so that presence of CHD remains at I, but absence is coded as 2. We then request the model: Pmc Logistic; Model CHD = Age; Run; The output is shown on the following page. You will be happy to know that SAS gets the same results as the text. The response profile near the top of the page serves as a check that your response variable has been coded as 1 and 2, and shows bow many of each code there were. The simple statistics show the usual mean, std deviatioD, If we do the calculations, we find that the chance of having CHD at 30 is 0.12 or 12%. At 90, it is 0.99 or 99%. At 100, it is 0.997 or 99.7%. With a logistic model. the probability will never exceed 100%. These chances seem very high. According to the brief descriptioD in the book, these were people selected to participate in a study. They may be a sample of people who came to the hospital because of chest pain. I doubt that they're a sample of a normal population. The last table in the output gives four rank correlation indices and the number of pairs with different responses. These statistics assess the predictive ability of the model. Proc Logistic is a very rich procedure. with regression diagnostics, many other options, and the ability to tackle a wide variety of problems. I hope that this talk has enabled you to get a feel for the basics of the procedure. 709 Statistics SAS The LOGISTIC Procedure Data Set: WORK.RAW Response Variable: CHD Response Levels: 2 Number of Observations: 100 Link FUDCtion: Logit Response Profile Ordered Value CHD 1 2 Count 1 2 43 57 Simple Statistics for Explanatory Variables Variable AGE Standard Deviation Mean 44.380000 Minimum 11.721327 Maximum 20.0000 69.0000 Criteria for Assessing Model Fit Intercept Criterion Only AlC Intercept and .Chi-Square for Covariates Covariates 138.663 141.268 111.353 SC -2 LOG L 136.663 107.353 116.563 Score 29.310 with 1 DF (p=O.OOOI) 26.399 with 1 DF (P=O'OOOI) Analysis of Maximum Likelihood Estimates Variable Parameter Standard Estimate Error INTERCPT -5.3095 AGE 0.1109 Wald Chi-Square 1.1337 0.0241 21.9350 21.2541 Pr > Standardized Chi-Square Estimate 0.0001 0.0001 0.716806 Association of Predicted Probabilities and Observed Responses Concordant = 79.0% Discordant = 19.0% Tied (2451 pairs) 710 = 2.0% Somers' D = 0.600 Gamma 0.612 Tau-a 0.297 c = = = 0.800