Download Documentation for R code

Documentation for R code: Rihc Method: This method provides power estimates, as well as estimated hazard rates and classification error rates, for an immunohistochemistry study following up on a gene expression study. Required are data from the gene expression study, including a training set and test set, and two groups (e.g., two diagnoses). Also required are survival data from each of the subjects. The methodology is described in: Betensky, RA, Nutt, CL, Batchelor, TT, Louis, DN, (2004). Statistical Considerations for Immunohistochemistry Panel Development Following Gene Expression Profiling of Human Cancers. Software: This code is for the freely available R programming language (http://www.r-project.org/), which can be run on a PC or on a UNIX platform. Data: The gene expression data should be placed in a CSV file, called “genes.csv,” with one row for each subject, the first column for group (e.g., in our example, “GBM” or “Oligo”), and the remaining columns for genes. The file should contain the gene expression values for the 2X most differentially expressed genes between the two groups. The first X columns should contain the genes that are most highly differentially expressed in group 1 and the second X columns should contain the genes that are most highly differentially expressed in group 2. Within each set of X columns, the genes should be ordered by degree of differential expression. The first set of subjects should be the training set (i.e., in our example, the “classic” brain tumors), and among them, the first set should be from group 1 (i.e., in our example, the GBM’s) and the second set from group 2 (i.e., oligo’s). The second set of subjects should be the test set (i.e., in our example, the “non-classic” brain tumors). The survival data should be placed in a second CSV file, called “ptinfo.csv,” with one row for each subject (ordered as in genes.csv), one column containing time to death or last follow-up with column header=”surv”, one column containing a death indicator (i.e., coded “1” if subject died and “0” if subject alive) with column header=”cens”, and one column containing a group indicator (i.e., in our example, “GBM” or “Oligo”), with column header=”Path”. Inputs: The following are required inputs to the program, “simrun”: assaynum=number of assays initially considered for development n=number of simulations to be run to compute estimated power and other measures pdiff=probability that a gene that displayed differential gene expression between the two groups also displays differential protein expression pos,pos1,pos2,pos3,pm,pm1,pm2,pm3=probabilities for model linking gene expression data to IHC data: The premise of this simulation is that the user has gene expression data available, and needs to simulate IHC data in order to estimate the power available as a function of number of IHC assays developed. The model is a probability model that is based on the observed gene expression value and the median expression value for that gene. Probability(IHC=4+ given expression>1.25median)=pos Probability(IHC=3+ given expression>1.25median)=pos1-pos Probability(IHC=2+ given expression>1.25median)=pos2-pos1 Probability(IHC=1+ given expression>1.25median)=pos3-pos2 Probability(IHC=0 given expression>1.25median)=1-pos3 Probability(IHC=0 given expression<0.75median)=pos Probability(IHC=1+ given expression<0.75median)=pos1-pos Probability(IHC=2+ given expression<0.75median)=pos2-pos1 Probability(IHC=3+ given expression<0.75median)=pos3-pos2 Probability(IHC=4+ given expression<0.75median)=1-pos3 Probability(IHC=2+ given 0.75median<=expression<=1.25median)=pm Probability(IHC=1+ given 0.75median<=expression<=1.25median)=pm1-pm Probability(IHC=3+ given 0.75median<=expression<=1.25median)=pm2-pm1 Probability(IHC=4+ given 0.75median<=expression<=1.25median)=pm3-pm2 Probability(IHC=0 given 0.75median<=expression<=1.25median)=1-pm3 For example, if pos=0.75, pos1=0.90, pos2=0.95, pos3=1.0, and if a particular subject’s gene expression value is greater than 1.25 times the median expression value for that gene, then that individual will be assigned to be a 4+ with probability 0.75, a 3+ with probability 0.15, a 2+ with probability 0.05, a 1+ with probability 0.05, and a 0 with probability 0.0. Alternatively, if that subject’s gene expression value is less than 0.75 times the median expression value for that gene, then that individual will be assigned to be a 0 with probability 0.75, a 1+ with probability 0.15, a 2+ with probability 0.05, a 3+ with probability 0.025, and a 4+ with probability 0.0. Lastly, if pm=0.5, pm1=0.75, pm2=1.0, pm3=1.0, and that subject’s gene expression is greater than 0.75 times the median expression for that gene and less than 1.25 times the median expression for that gene, then that individual will be assigned to be a 2+ with probability 0.50, a 1+ with probability 0.25, a 3+ with probability 0.25, a 4+ with probability 0.0 and a 0 with probability 0.0. Users should select values for these probabilities to best reflect their understanding of the relationship between the gene expression data and the IHC data. opt= estimated success rate optimizing commercially available antibodies for immunohistochemical assays on formalin-fixed paraffin-embedded tissues ntrain=total number of tumors in original training set (e.g., “classic” brain tumors in Nutt et. al. data in our example=21) ntrain1=total number of tumors of tumor type 1 in training set (e.g., “classic” GBM’s in Nutt et al. in our example=14) ntrain2=total number of tumors of tumor type 2 in training set (e.g., “classic” oligo’s in Nutt et al. in our example=7) ntot=total number of tumors used in original analysis (i.e., total number in original training set plus total number in test set in Nutt et al. original analysis=50) newtrain1=total number of tumors of tumor type 1 expected for training set in new data (i.e., number in MGH data set in our example=135) newtrain2=total number of tumors of tumor type 2 expected for training set in new data (i.e., number in MGH data set in our example=23) Running the program: To run the program from within R, source the code by typing: source(“Rihc”) then select values for the input parameters and type: simrun(assaynum,n,pdiff,pos,pos1,pos2,pos3,pm,pm1,pm2,pm3,opt,ntrain,ntrain1,ntot,newtrain1,newtrain2) Run the programs several times under different values for “assaynum” to evaluate the power for various numbers of assays to consider for development and to evaluate the effect of changing the probability model (i.e., values of pos, pos1, pos2, pos3), as well as estimates of pdiff (i.e., relationship between gene expression and IHC outcome) and opt (i.e., estimate of success rate of antibody optimization). Output: actualassaynumber=actual number of assays to be developed after paring down to those genes for which there is differential protein expression and for which an antibody is available errororig=error rate in classification based on simulated IHC outcomes when applied to original test set errornew=estimated error rate in classification based on simulated IHC outcomes when applied to new test set pownew=estimated power for detecting a non-unity hazard ratio with respect to IHCbased classification, after adjusting for group (e.g., pathologic diagnosis) haznew=estimated hazard ratio with respect to IHC-based classification (after adjusting for group (e.g., pathologic diagnosis)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Documentation for R code