Download Documentation for R code

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene desert wikipedia , lookup

Public health genomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of depression wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genome evolution wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

NEDD9 wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Microevolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Designer baby wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene expression programming wikipedia , lookup

Transcript
Documentation for R code: Rihc
Method: This method provides power estimates, as well as estimated hazard rates and
classification error rates, for an immunohistochemistry study following up on a gene
expression study. Required are data from the gene expression study, including a training
set and test set, and two groups (e.g., two diagnoses). Also required are survival data
from each of the subjects. The methodology is described in:
Betensky, RA, Nutt, CL, Batchelor, TT, Louis, DN, (2004). Statistical Considerations for
Immunohistochemistry Panel Development Following Gene Expression Profiling of
Human Cancers.
Software: This code is for the freely available R programming language
(http://www.r-project.org/), which can be run on a PC or on a UNIX platform.
Data: The gene expression data should be placed in a CSV file, called “genes.csv,”
with one row for each subject, the first column for group (e.g., in our example, “GBM” or
“Oligo”), and the remaining columns for genes. The file should contain the gene
expression values for the 2X most differentially expressed genes between the two groups.
The first X columns should contain the genes that are most highly differentially
expressed in group 1 and the second X columns should contain the genes that are most
highly differentially expressed in group 2. Within each set of X columns, the genes
should be ordered by degree of differential expression. The first set of subjects should be
the training set (i.e., in our example, the “classic” brain tumors), and among them, the
first set should be from group 1 (i.e., in our example, the GBM’s) and the second set from
group 2 (i.e., oligo’s). The second set of subjects should be the test set (i.e., in our
example, the “non-classic” brain tumors).
The survival data should be placed in a second CSV file, called “ptinfo.csv,” with
one row for each subject (ordered as in genes.csv), one column containing time to death
or last follow-up with column header=”surv”, one column containing a death indicator
(i.e., coded “1” if subject died and “0” if subject alive) with column header=”cens”, and
one column containing a group indicator (i.e., in our example, “GBM” or “Oligo”), with
column header=”Path”.
Inputs: The following are required inputs to the program, “simrun”:
assaynum=number of assays initially considered for development
n=number of simulations to be run to compute estimated power and other measures
pdiff=probability that a gene that displayed differential gene expression between the two
groups also displays differential protein expression
pos,pos1,pos2,pos3,pm,pm1,pm2,pm3=probabilities for model linking gene expression
data to IHC data: The premise of this simulation is that the user has gene expression data
available, and needs to simulate IHC data in order to estimate the power available as a
function of number of IHC assays developed. The model is a probability model that is
based on the observed gene expression value and the median expression value for that
gene.
Probability(IHC=4+ given expression>1.25median)=pos
Probability(IHC=3+ given expression>1.25median)=pos1-pos
Probability(IHC=2+ given expression>1.25median)=pos2-pos1
Probability(IHC=1+ given expression>1.25median)=pos3-pos2
Probability(IHC=0 given expression>1.25median)=1-pos3
Probability(IHC=0 given expression<0.75median)=pos
Probability(IHC=1+ given expression<0.75median)=pos1-pos
Probability(IHC=2+ given expression<0.75median)=pos2-pos1
Probability(IHC=3+ given expression<0.75median)=pos3-pos2
Probability(IHC=4+ given expression<0.75median)=1-pos3
Probability(IHC=2+ given 0.75median<=expression<=1.25median)=pm
Probability(IHC=1+ given 0.75median<=expression<=1.25median)=pm1-pm
Probability(IHC=3+ given 0.75median<=expression<=1.25median)=pm2-pm1
Probability(IHC=4+ given 0.75median<=expression<=1.25median)=pm3-pm2
Probability(IHC=0 given 0.75median<=expression<=1.25median)=1-pm3
For example, if pos=0.75, pos1=0.90, pos2=0.95, pos3=1.0, and if a particular subject’s
gene expression value is greater than 1.25 times the median expression value for that
gene, then that individual will be assigned to be a 4+ with probability 0.75, a 3+ with
probability 0.15, a 2+ with probability 0.05, a 1+ with probability 0.05, and a 0 with
probability 0.0.
Alternatively, if that subject’s gene expression value is less than 0.75 times the median
expression value for that gene, then that individual will be assigned to be a 0 with
probability 0.75, a 1+ with probability 0.15, a 2+ with probability 0.05, a 3+ with
probability 0.025, and a 4+ with probability 0.0.
Lastly, if pm=0.5, pm1=0.75, pm2=1.0, pm3=1.0, and that subject’s gene expression is
greater than 0.75 times the median expression for that gene and less than 1.25 times the
median expression for that gene, then that individual will be assigned to be a 2+ with
probability 0.50, a 1+ with probability 0.25, a 3+ with probability 0.25, a 4+ with
probability 0.0 and a 0 with probability 0.0.
Users should select values for these probabilities to best reflect their understanding of the
relationship between the gene expression data and the IHC data.
opt= estimated success rate optimizing commercially available antibodies for
immunohistochemical assays on formalin-fixed paraffin-embedded tissues
ntrain=total number of tumors in original training set (e.g., “classic” brain tumors in
Nutt et. al. data in our example=21)
ntrain1=total number of tumors of tumor type 1 in training set (e.g., “classic” GBM’s in
Nutt et al. in our example=14)
ntrain2=total number of tumors of tumor type 2 in training set (e.g., “classic” oligo’s in
Nutt et al. in our example=7)
ntot=total number of tumors used in original analysis (i.e., total number in original
training set plus total number in test set in Nutt et al. original analysis=50)
newtrain1=total number of tumors of tumor type 1 expected for training set in new data
(i.e., number in MGH data set in our example=135)
newtrain2=total number of tumors of tumor type 2 expected for training set in new data
(i.e., number in MGH data set in our example=23)
Running the program: To run the program from within R, source the code by
typing:
source(“Rihc”)
then select values for the input parameters and type:
simrun(assaynum,n,pdiff,pos,pos1,pos2,pos3,pm,pm1,pm2,pm3,opt,ntrain,ntrain1,ntot,newtrain1,newtrain2)
Run the programs several times under different values for “assaynum” to evaluate the
power for various numbers of assays to consider for development and to evaluate the
effect of changing the probability model (i.e., values of pos, pos1, pos2, pos3), as well as
estimates of pdiff (i.e., relationship between gene expression and IHC outcome) and opt
(i.e., estimate of success rate of antibody optimization).
Output:
actualassaynumber=actual number of assays to be developed after paring down to those
genes for which there is differential protein expression and for which an antibody is
available
errororig=error rate in classification based on simulated IHC outcomes when applied to
original test set
errornew=estimated error rate in classification based on simulated IHC outcomes when
applied to new test set
pownew=estimated power for detecting a non-unity hazard ratio with respect to IHCbased classification, after adjusting for group (e.g., pathologic diagnosis)
haznew=estimated hazard ratio with respect to IHC-based classification (after adjusting
for group (e.g., pathologic diagnosis)