Download Using Propensity Scores To Adjust For Treatment Selection Bias

Using Propensity Scores To Adjust For Treatment Selection Bias R. Scott Leslie, MedImpact Healthcare Systems, Inc., San Diego, CA ABSTRACT A key strength of observational studies is the ability to estimate treatment effect on health outcomes in “real world” conditions. However, studies that lack randomization of subjects into treatment groups must address selection bias to properly estimate the effect of treatment as non randomized groups usually differ on observed and unobserved characteristics. That is, the observed treatment effect may be due to the treatment itself or due to the differential selection into treatment groups from non-randomization. Conventional regression adjustment, matching, and stratification using propensity scores are widely used techniques to adjust for selection bias. Included in this tutorial is a description of these propensity score methods, an explanation of the advantages and disadvantages of each method, and an application of the methods using various examples. This tutorial is intended for intermediate level statisticians, SAS® programmers, and/or data analysts. INTRODUCTION AND WHY USE PROPENSITY SCORES Randomized control trials (RCTs) measure drug efficacy in controlled environments: however, can often be restricted to subpopulations that limit generalizability of results. Observational studies, on the other hand, can evaluate treatment effectiveness in routine care settings or everyday use patterns. However, a limitation of observational studies is the lack of treatment assignment. Non randomized groups usually differ in observed and unobserved characteristics causing selection bias when evaluating the effect of treatment. For example, the below figure shows how 2 patient groups may differ in age, gender, health status, and previous medication use. Patients eligible for Drug A or Drug B Drug A Patient Characteristics Drug B Patient Characteristics Younger, More females, Currently taking less medications, Poorer medication use behavior Older, More male, Currently taking more medications, Better medication use behavior Statistical techniques such as matching, stratification, and regression adjustment are commonly used to account for differences in treatment groups but may be limited if using too few covariates in the adjustment process. The use of propensity score techniques avoids this limitation because it can summarize more or all the covariate information into a single score. So what is a propensity score? The propensity score is the conditional probability of being treated based on individual covariates. Rosenbaum and Rubin demonstrated that propensity scores can account for imbalances in treatment groups and reduce bias by resembling randomization of subjects into treatment groups. By using propensity scores to balance groups, traditional adjustment methods can better estimate treatment effect on outcomes while adjusting for covariates. One method professed by Ralph B. D’Agostino, Jr. to adjust for the non randomized treatment selection is to use a propensity score method in conjunction with traditional regression techniques. This process is performed using two steps, the first of which calculates propensity scores as the probability of patients being included in each treatment group based on pre-treatment observables. The aim of this step is to create balance treatment groups and simulate random treatment allocation. The second step utilizes the created propensity scores with ANCOVA to more accurately estimate outcomes and study the possible covariate predictors. HOW TO CREATE PROPENSITY SCORES The logistic model describes the relationship of several independent variables to a dichotomous dependent variable. Furthermore, logistic regression is used to predict the probability of event occurring as a function of independent variables (continuous and/or dichotomous) The logistic model: P( X ) = 1 1 + e −(α + ΣβiXi1) Propensity scores are easily created using PROC LOGISTIC. In the cases described in this paper, the dependent variable is treatment group (treatment for the patient) and the independent variables are patient and baseline characteristics. In other cases the dependent variable may be any dichotomous outcome (treated or untreated, disenrolled or not, visited specialist or not). The GENMOD procedure for generalized linear models may also create propensity scores by using the OUTPUT statement and keyword PREDICTED. Example of Creating Propensity Scores Using PROC LOGISTIC The following example illustrates the use of PROC LOGISTIC to create propensity scores. A large pharmacy claims database was used to identify 19,433 patients using oral antidiabetic therapy. Patients were categorized into two drug treatment groups, A and B, with the main objective of comparing compliance and adherence rates. Compliance was measured as the proportion of days a medication was supplied over a 180 day period. Adherent patients were identified as those reaching a threshold of 80% compliance. Selection bias is believed to be a factor as the two drug treatment groups differ in patient tolerance, adverse events, and side effects which possibly influence drug choice and compliance to each drug. Other variables controlled for in the analyses include demographic variables (age, gender) and previous medication use/patterns measured in a 6 month baseline period prior to treatment. Previous medication use was recorded by the use of specific cardiovascular, asthma, and antidepressant medications and previous medication pattern use was measured by refill patterns of maintenance type medications. PROC LOGISTIC calculates propensity scores as the conditional probability of each patient receiving a particular treatment based on pre-treatment variables and can output the propensity score to a data set. In this example, propensity scores were calculated based on the covariates listed in Table 1 below. The objective was to balance the treatment groups so to reduce bias of treatment selection and obtain better idea of treatment effect on the outcome of compliance. The logit function is specified in the LINK option to fit the binary logit model and the RSQUARE option assesses the amount of variation explained by the independent variables. The propensity score is output to data set named “psdataset”. The predicted probabilities are output to a variable named “ps”. proc logistic data = wuss; class naive0; model tx (event=’Drug A’) = age female b_hmo pre_drug_cnt_subset naive0 pre_refill_pct copay_idxdrug pre_sulf pre_htn pre_asthma pre_pain pre_lipo pre_depress /link=logit rsquare; output out=psdataset pred=ps; run; Where, tx = treatment selection indicator, age = age of patient at date of treatment initiation (index date), female = indicator variable, b_hmo= HMO insurance, copay_idxdrug = patient co-payment for initial prescription All variables with the prefix “pre_” describe utilization in the baseline period. pre_drug_cnt = number of medications (generic) utilized excluding medications for the categories below pre_sulf = sulfonylurea use pre_htn = hypertension use pre_asthma = asthma use pre_pain = pain medication use pre_lipo = lipotropic use pre_depress = antidepressant use pre_refill_pct = refill percentage for maintenance medications 2 After creating the propensity scores, an evaluation of the distributions can check comparability of the treatment groups. Sizeable overlaps among the groups illustrate satisfactory overlap in covariate distributions and indicate that the groups are comparable. 14 12 P D e r r u c g e n A t 10 8 6 4 2 0 14 12 P D e r r u c g e n B t 10 8 6 4 2 0 0. 33 0. 35 0. 37 0. 39 0. 41 0. 43 0. 45 0. 47 0. 49 0. 51 0. 53 0. 55 0. 57 0. 59 0. 61 0. 63 Est i m at ed Pr obabi l i t y Tables 1 and 2 demonstrate how propensity scoring can balance the groups. Table 1 contains the unadjusted values (before propensity scores) for the two treatment groups. Descriptors include demographic variables (age, gender), type of health plan insurance, and medication use in a 6 month period prior to treatment (baseline period). Age was the age of the patient at date of treatment initiation (index date). Type of insurance was categorized by health maintenance organization (HMO), Medicare, Medicaid, or self-insured. Prior medication use was measured for sulfonylureas, antihypertensives, lipid lowering agents as well as asthma and antidepressant medications. Refill patterns of medications for chronic diseases, or maintenance medications, were used to estimate patient’s behavior to following prescribed dosing. The TABULATE procedure was used to create the table below and PROC TTEST and PROC FREQ was used to test for differences between groups. Those observables that differed significantly in the two groups of patients include type of insurance (proportion of patients enrolled in an HMO), oral antidiabetic, sulfonylurea, and pain medication use. Table 1. Unadjusted Demographic and Baseline Measures UNADJUSTED VALUES Drug A Member Count 9,129 Age Mean 57.7 SD 10.5 Female % 43.1% HMO * % 73.3% # of Drugs Utilized Mean 4.9 3 Drug B 10,304 57.5 10.7 43.4% 75.3% 5.0 p-value 0.3786 0.7115 0.0018 0.0841 UNADJUSTED VALUES Maintenance Medication Refill * Prior Oral Antidiabetic Use * Sulfonylurea * Hypertension Lipid Irregularity Pain Management * Antidepressant Asthma * p < .05 SD % % % % % % % % Drug A 3.0 59.6% 77.5 55.3% 79.1% 68.3% 23.6% 19.5% 9.3% Drug B 3.1 60.5% 80.4 51.3% 78.4% 68.5% 24.9% 18.4% 9.9% p-value 0.0492 <.0001 <.0001 0.2381 0.7509 0.0357 0.0575 0.1967 Table 2 contains the adjusted values (after propensity scores) for the two treatment groups. PROC GLM was used to compare groups while adjusting for the propensity score. Differences between groups were minimized when using the propensity score method. Table 2. Propensity Score Adjusted Demographic and Baseline Measures WITH PROPENSTIY SCORING Drug A Drug B Member Count 9,129 10,304 Age Mean 57.4 57.4 SD 10.5 10.7 Female % 42.9% 42.8% HMO % 75.2% 75.1% # of Drugs Utilized Mean 4.9 4.9 SD 3.0 3.0 Maintenance Medication Refill % 60.3 60.3 Prior Oral Antidiabetic Use % 80.2% 80.7% Sulfonylurea % 53.2% 53.6% Hypertension % 78.6% 78.7% Lipid Irregularity % 68.7% 68.8% Pain Management % 24.0% 23.9% Antidepressant % 18.8% 18.8% Asthma % 9.5% 9.5% p-value 0.9329 0.9062 0.8922 0.9390 0.0687 0.3088 0.5204 0.8674 0.9303 0.8380 0.9112 0.9407 USE OF PROPENSITY SCORES Once the propensity score is calculated what to do you with them? As explained above, the 3 methods commonly used are matching on propensity score, stratification, and regression adjustment. Regression Adjustment Continuing with the study example described above, the created propensity scores were used in regression adjustment where a propensity score weight, also referred to as the inverse probability of treatment weight (IPTW), was calculated as the inverse of the propensity score (Hogan and Lancaster). The treatment selection model above modeled the propensity to receive drug a. For those patients receiving drug b, the propensity score would be 1- ps and the propensity score weight would be the inverse of 1-ps. data psdataset; set psdataset; if druga=1 then ps_weight=1/ps;else ps_weight=1/(1-ps); run; Next, a propensity score-weighted linear regression model, using the GLM procedure, was fitted to compare drug treatment on the outcome of compliance while controlling for other covariates. The LSMEANS statement computes the least-squares means for the treatment variable allowing for multiple comparisons. The ADJUST=TUKEY option uses the Tukey-Kramer method to adjust the least-squares means and the PDIFF and CL options give the p values and corresponding confidence intervals for the differences in the least-squares means. 4 proc glm data=psdataset; class tx naive0 female; model p_dayscovered = tx age female b_hmo pre_drug_cnt_subset naive0 pre_refill_pct copay_idxdrug pre_sulf pre_htn pre_asthma pre_pain pre_lipo pre_depress /solution; lsmeans tx/OM ADJUST=TUKEY PDIFF CL; weight ps_weight; quit; Results of the model above showed no difference between treatment groups (p = 0.2066). Comparing this result to unadjusted compliance means (a model with drug treatment as the only independent variable), where there is a significant difference (p=0.0071), shows the effect of controlling for selection bias and confounding. The increase in compliance from the unadjusted means (0.6673 and 0.6793) to the adjusted means (0.7032 and 0.7082) was probably due to the large number of patients with prior use of the drug. These patients had higher compliance values than those new to therapy and therefore when this variable is placed in the model as a covariate the mean compliance values rise. Table 3. Mean Compliance by Treatment: Unadjusted vs. PS Adjusted Models Model Treatment Compliance Outcome 95% Confidence Limits Unadjusted Model P = 0.0071 Drug A 0.6673 0.6601 0.6736 Drug B 0.6793 0.6733 0.6852 Propensity Score Adjusted Model P = 0.2066 Drug A 0.7032 0.6975 0.7089 Drug B 0.7082 0.7029 0.7135 Stratification Stratification, subclassification or binning using propensity scores involves grouping subjects into classes or strata based on the subject’s observed characteristics. Once the propensity scores are calculated, subjects are placed into strata (Cochran states that 5 strata can remove 90% of the bias) with the idea that subjects in the same stratum are similar in the characteristics used in the propensity score development process. The tutorial by D’Agostino details how to perform this technique. Briefly, quintiles are used to group subjects into five strata after making sure that there is adequate propensity score overlap between treatment groups. To prove that the propensity scores removed any bias due to differences in covariates between treatment groups, t-tests or chi-square tests are conducted before and after propensity score creation. Finally, outcomes and treatment effects can be assessed using models while adjusting for the propensity scores. Continuing with the example and code above, subjects are divided into 5 classes based on the common propensity score overlap using the RANK procedure. Checking for difference between treatment group before and after stratifying subjects by propensity scores can be done using PROC FREQ, PROC TTEST and PROC GLM. proc rank data= psdataset groups=5 out = r; ranks rnks + 1; var ps; run; data quintile; set r; quintile = rnks + 1; run; /*check for differences in groups before propensity score*/ proc freq data=quintile; tables tx*(female b_hmo naive0 pre_sulf pre_htn pre_asthma pre_pain pre_lipo pre_depress)/chisq;run; 5 proc ttest data=quintile; var age pre_drug_cnt_subset maintrefillratio; class tx;run; /*check for differences in groups while adjusting for propensity scores*/ proc glm data=quintile; class tx quintile; model age female b_hmo pre_drug_cnt_subset maintrefillratio naive0 pre_sulf pre_htn pre_asthma pre_pain pre_lipo pre_depress = tx quintile; lsmeans tx; quit; The result is similar to Tables 1 and 2 above showing minimal differences between groups when subclassifying subjects. Outcomes can be compared within the 5 subclasses or averaged to report for the overall treatment groups. Matching Matching groups by propensity scores is a common method to balance groups on covariates. Once the propensity score is calculated, subjects are matched by this single score as opposed to traditional direct matching by one or more covariates. A disadvantage of matching methods includes incomplete matching and inexact matching. That is, subjects may be excluded because of difficulty finding a match. Reducing this bias is well explained in Lori Parsons’ papers (see reference section). Her papers offer code for performing case-control match using a greedy matching algorithm. As she explains in the paper, cases were matched to controls based on propensity scores. Tables 1 and 2 below show how propensity scores were used to balance a treated (“Early Intervention” group) and untreated group of subjects (“Conservative” group). Table 1 is the original population and includes results of rank-sum tests and chisquare tests showing differences between groups in many characteristics (only a few a shown in the tables). Table 2 shows how differences are eliminated after matching. This matched subset of patients can now be used to model outcomes and assess effect of treatment. Table 1: Original Population Total Patients Age (Mean±sd) Male Gender White Race Hx Angina Hx MI Early Intervention N (%) 2,402 61.3 ±12.2 1,744 (72.6) 2,079 (91.8) 444 (18.5) 574 (23.9) Conservative N (%) 17,735 68.2±13.0 10,914 (61.5) 15,002 (88.4) 4,441 (25.0) 5,382 (30.3) p-value Conservative N (%) 2,036 61.7±13.3 1,445 (71.0) 1,858 (91.3) 381 (18.7) 491 (24.1) p-value <0.0001 <0.0001 <0.0001 <0.0001 <0.0001 Table 2: Greedy 5 to 1 Digit Matched Population Total Patients Age (Mean±sd) Male Gender White Race Hx Angina Hx MI Early Intervention N (%) 2,036 61.9 ±12.0 1,452 (71.3) 1,865 (91.6) 390 (19.2) 488 (24.0) 0.5405 0.8087 0.6952 0.7189 0.9124 SUMMARY OF AND STRENGTHS AND LIMITATIONS OF PROPENSITY SCORING Calculating the propensity score as the conditional probability of treatment summarizes observed values into a single score. The scores can then be used to control for selection bias by matching subjects, stratifying subjects, and/or as a regressor. All techniques have the purpose of balancing groups to remove bias when assessing treatment effect on outcomes. Traditional techniques to control for bias may be limited if accounting for only a few covariates. Compared to multiple regression, the propensity score methodology summarizes many observables and is less sensitive to model misspecification (Perkins et al). Propensity scores can also diagnose if groups are comparable before moving onto the modeling stage of analysis. If distributions of the propensity scores fail to show much overlap in covariate values, the comparison groups are too different making it difficult to balance groups. When creating propensity scores all covariates that affect both treatment and outcome must be included in the model and it is assumed that all patients have a non zero probability of receiving each treatment. The technique only looks at observed characteristics of a patient thus does not account for unobserved factors, such as patient attitudes, 6 socioeconomic status, and education level. This limitation is modified if unobserved covariates are correlated to observed factors. Analyses also need large samples sizes in order to establish adequate variance in covariate distributions. CONCLUSION This tutorial introduces potential selection bias in observational studies and describes how propensity score methodology can control for overt bias when estimating treatment effectiveness. Propensity scoring methodology attempts to balance groups before comparing outcomes between treatment groups. Commonly used techniques via propensity scores include matching, stratification, and regression adjustment using the inverse of the propensity score. Each method can be used in conjunction with traditional risk adjustment techniques to reduce bias and better describe the effect of treatment on outcomes. REFERENCES D’Agostino R.B. Sr, Kwan H. 1995. “Measuring Effectiveness: What to Expect Without a Randomized Control Group”. Medical Care. 195:33 (4 suppl): AS95-AS105. D’Agostino R.B., Jr, D’Agostino R.B., Sr. 2007. “Estimating Treatment Effects Using Observational Data”. JAMA. 297 (3). 314-316 Rosenbaum P.R. and Rubin D.B. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects”, Biometrika, 70, 41-55. D’Agostino, R.B. 1998. “Tutorial on Biostatistics: Propensity Score Methods for Bias Reduction in the comparison of a treatment to a non-randomized control group”. Statistics in Medicine 17, 2265-2281. Hogan, J.W., Lancaster, T. 2004. ”Instrumental variable and propensity weighting for causal inference from longitudinal observational studies”. Statistical Methods in Medical Research 13: 17-48. Obenchain, R.L., Melfi, C.A., “Propensity Score and Heckman Adjustments for Treatment Selection Bias in Database Studies”. Pasta, David J. 2000. “Using Propensity Scores to Adjust for Group Differences: Examples Comparing Alternative Surgical Methods”. Proceedings of the Twenty-Fifth Annual SAS Users Group International Conference, Indianapolis, IN, 261-25. Parsons, Lori. 2000. “Using SAS® Software to Perform a Case Control Match on Propensity Score in an Observational Study”. Proceedings of the Twenty-Fifth Annual SAS Users Group International Conference, Indianapolis, IN, 214-26. SAS Institute Inc. 2004. “SAS Procedures: The LOGISTIC Procedure”. SAS OnlineDoc® 9.1.3. Cary, NC: SAS Institute Inc. http://support.sas.com/documentation/onlinedoc/91pdf/sasdoc_91/stat_ug_7313.pdf CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: R. Scott Leslie MedImpact Healthcare Systems, Inc. 10680 Treena Street San Diego, CA 92131 Work Phone: 858-790-6685 Fax: 858-689-1799 Email: [email protected] SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 7

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Using Propensity Scores To Adjust For Treatment Selection Bias