* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Translation of Drug Metabolic Enzyme and Transporter (DMET) Genetic Variants into Star Allele Notation using SAS.
Epigenetics of human development wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Gene nomenclature wikipedia , lookup
Skewed X-inactivation wikipedia , lookup
X-inactivation wikipedia , lookup
Medical genetics wikipedia , lookup
Gene expression programming wikipedia , lookup
Genomic imprinting wikipedia , lookup
Human genetic variation wikipedia , lookup
Public health genomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome (book) wikipedia , lookup
Human leukocyte antigen wikipedia , lookup
Designer baby wikipedia , lookup
Genome-wide association study wikipedia , lookup
Population genetics wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Genetic drift wikipedia , lookup
Microevolution wikipedia , lookup
Paper PR03-2009 Translation of Drug Metabolic Enzyme and Transporter (DMET) Genetic Variants into Star Allele Notation using SAS Mark Farmen, Eli Lilly and Company, Indianapolis, IN William Koh, Lilly Singapore CDD, Republic of Singapore Sandra L Close, Eli Lilly and Company, Indianapolis, IN ABSTRACT Conversion of genotyping results from metabolic enzyme and transporter (DMET) genetic assays into the consensus star allele nomenclature is necessary for clinical utilization of DMET genetics. The problem involves translation of variant genotypes, such as single nucleotide polymorphisms (SNPs) or small insertion-deletion events, into gene level star allele nomenclature. Genetic variants in the DMET genes can occur in combinations. In some instances a single variant or often these combinations define a star allele. When and how these variants combine is often poorly understood. However, given adequate definitions, a simple algorithm based on vector addition and comparison can be easily implemented in SAS proc IML to perform the conversion. The simplicity and transparency of this algorithm can help de-mystify the translation process and allow statisticians and data management experts to handle this new type of patient information. INTRODUCTION The cytochrome P450 superfamily of enzymes (CYPs), together with other enzyme classes and transport proteins have important roles in the uptake, distribution, metabolism and excretion of a host of therapeutic drugs and other xenobiotic molecules (Lewis (2005) and Cascorbi (2006)). Extensive literature evidence exists that significant variability in individual drug disposition and response is commonplace. Much of this observed heterogeneity is believed to be due to the underlying genetic variation found in metabolism and transport genes. The functional consequences or predicted phenotypes resulting from genetic variation, such as single nucleotide polymorphisms, deletions, insertions, and gene duplications, in these enzymes and transporters can markedly influence drug pharmacokinetics or alter efficacy and/or toxicity profiles. An early example includes the metabolism of the hypertension drug, debrisoquine, which was shown to be influenced by common genetic variants in the CYP2D6 gene (Gonzalez et.al. (1988). More recently, Mega et. al. (2009) have demonstrated genetic variation in CYP2C19 affects clopidogrel metabolism with subsequent effects on pharmacodynamic response and cardiovascular event rates. DNA can be thought of as a sequence of nucleotides denoted by A, C, G and T. Locations along the DNA sequence that differ from person to person in the nucleotide(s) present at that location are called genetic variants. Currently, relatively inexpensive genotyping platforms can readily determine point variations, by determining which nucleotide (A, C, G or T) pair exists at a distinct location along each of the chromosomes for a patient (see Figure 1). This common type of genetic variant is called a single nucleotide polymorphism (SNP). The variant is defined by its location on the chromosome and the composition of nucleotides seen at this locus. Since any given gene is represented by two copies (one on the paternal chromosome and one on the maternal chromosome), data collected from these platforms is a genotype consisting of a pair of nucleotides (e.g. CC, TC or TT - for a C/T variant) at the locus. However, in many cases information about which of the two chromosomes contributed which alleles is not known due to limitations of the genotyping technology. In addition to SNPs, alternate, but equally important types of biallelic genetic variation in metabolism and transport genes are known as an insertion/deletion (IN/DEL). At a locus within a gene, nucleotides are either present (insertion) or missing (deletion) (see Figure 2). 1 CYP4B1 gene from one of two chromosomes illustrating variant 517C>T Major allele (wild type) Minor allele (mutant) …AGTTATCCAG … AGAGAAAGCT[C (>T)]GGGAGGGTAA … Variant Locus (bp) 517 DNA location axis Figure 1. Example genetic variant 517C>T from http://www.cypalleles.ki.se/cyp4b1.htm. The units along the axis are base pairs, labeled with a nucleotide position number. At the relative location, 517, the usual C nucleotide (major allele) can be a variant T nucleotide (minor allele). The presence of T changes how the resulting liver enzyme functions. CYP4B1 gene from one of two chromosomes illustrating variant 881_882delAT; …AGTTATCCAG … CCCTAACCCAGG [AT>(-)] GAAGATGACATC… Variant Locus (bp) DNA location axis 881 - Indicates that AT are missing from location (a Del) Figure 2. Example genetic variant 881_882delAT from http://www.cypalleles.ki.se/cyp4b1.htm. At the relative location 881-882, the major AT nucleotides (INS major allele) can be a “variant” deletion in which these two nucleotides are missing from the sequence at this position (DEL minor allele indicates the missing “AT” at 881). Again, the absence of the AT changes how a protein is made from the genetic template, resulting in liver enzyme dysfunction for this allele. The resulting data for the SNP at position 517 and the IN/DEL at 881 in allele pairs for four hypothetical patients is shown in Table 1. For each of the loci, major (represented in greater than 50% of the population) and minor (represented in less than 50% of population) alleles are customarily defined. Worth noting is patient 2 with one minor T allele and one minor DEL allele. A limitation of almost all current genotyping platforms, is the inability to define if the 517C>T “T” is on the same chromosome as the 881_882delAT “DEL.” This lack of chromosomal information can be problematic, as most genes often have multiple loci with defined variation on a particular chromosome (e.g. maternal). Although much of this variation is rare, and observing them in combination is infrequent, some of the known functional variation is quite common. The inability to determine if the person has one copy of the gene with two minor variants or two copies of the gene with one minor variant represented in each copy may lead to difficulty in interpreting the functional consequences. For some of the metabolism and transporter genes, there has been extensive research on how these variants are inherited, which generally aids in inferring which chromosomes have which of the observed variants (minor alleles). patient 517C>T 1 2 3 4 CC CT TT CC 881_882delA T INS/INS INS/DEL INS/DEL DEL/DEL Table 1. Example genetic data on two variants for 4 patients. For many genes in drug metabolism and transport, a “common consensus nomenclature” has been developed, based upon the presence or absence of defined genetic variants. Translation of multiple locus level genetic variants within a gene into this standardized nomenclature known as star alleles has become common place. Nebert and IngelmanSundberg have maintained star allele nomenclature and annotations for many of the Cytochrome P450 genes 2 (http://www.cypalleles.ki.se/ and Sim and Ingelman-Sundberg (2006)). For some non-cyp drug metabolism enzymes or transporters, there are similar sites that maintain star allele nomenclature [UDP-glucuronosyltransferase (UGT) (http://galien.pha.ulaval.ca/alleles/alleles.html) and N-acetyltransferase (NAT) (http://louisville.edu/medschool/pharmacology/NAT.html) as pointed out in Robarge et al. (2007). For clinical utility, a transparent and flexible means of translating locus level variants to star allele genotypes is needed. Star alleles are typically a multi-locus variant known as a haplotype (haploid meaning from one chromosome vs. diploid from a pair of chromosomes). There are very powerful methods for determining haplotype pairs that are consistent with a given set of locus level genotypes (e.g. Schaid et al. (2002)). These methods assume that the haplotypes are unknown and do not make use of information outside the sample genotypes being analyzed. For DMET genes, there is considerable literature on known gene haplotypes/star alleles. Given a set of variants at multiple loci, publicly available information can be used by a clinical expert to create star allele definitions. These will be referred to as translation tables. The translation involves finding star allele pairs from the translation table that are consistent with a set of locus level genotypes. In addition to the need to translate genetic information appropriately, when genotyping is performed on a single or few variants rather than comprehensively for the gene, missing information regarding these multilocus variants may lead to misinterpretation of the patient’s function. In the quest to utilize pharmacogenetic information for tailoring drug prescribing, a multiplex platform has been developed by Affymetrix® to genotype variants at roughly 1000 loci in Drug Metabolic Enzyme and Transporter (DMETTM) genes (Dumaul et. al. (2007) and Daly et. al. (2007)). The data discussed here was generated by an early version of the Affymetrix DMET™ Plus Premier Pack http://www.affymetrix.com/products_services/arrays/specific/dmet.affx. This report defines efforts to create an automated translating algorithm which transforms individual genetic variant data for 22 genes and 165 variants into the common consensus * allele nomenclature. TRANSLATION TABLE To demonstrate the method, the cytochrome P450 enzyme CYP4B1 was selected. The Karolinska website lists variants at 7 loci that are used to define the star alleles (see Table 2). The “nucleotide changes” field lists the locus level variants within the gene that must be present on a chromosome for that chromosome to have the indicated star allele. For example, 517C>T indicates that the major allele, C, is replaced by variant or minor allele, T, at the 517 nucleotide position in the CYP4B1 gene. If this variant occurs on a chromosome without any other listed variants then the chromosome has the CYP4B1*3 allele. The star allele provides a description of one copy of the patient’s gene. It is useful to think of the star alleles as vectors and the variants as vector components (i.e. dimensions or fields). Table 3 lists variants as fields and star alleles are defined by an “x” indicating that this variant must be present to define the allele. No “x” means the defining allele is not present. Table 3 is a translation table. Comparing Tables 2 and Tables 3, the CYP4B1*6 allele is missing and the 1033G>A variant is missing. In a multiplex assay, missing one of several variants utilized to define the allele occurs on occasion. When markers are missing, decisions have to be made about how to modify the translation table in terms of its star allele definitions. This requires intimate knowledge of the gene and its variants. Missing locus level variants (i.e. not in the multiplex assay) cause some star alleles to be indistinguishable from one another. Manipulations of Table 2 as a SAS dataset will be shown that identify these “aliased” star alleles in the last section. It is utilizing logic very similar to translation, so many of the details behind the code will be left to the reader. For CYP4B1, only the 1033G>A variant is missing and CYP4B1*6 can not be distinguished form CYP4B1*3. For demonstration purposes, the variant pattern with an x for 517C>T and thus a genotype indistinguishable between CYP4B1*6 and CYP4B1*3, will be assigned the star allele CYP4B1*3. The translation table, shown in Table 3, will be used to translate locus level variant data into gene level star allele pairs (i.e. star allele genotypes). As mentioned above, the “x” indicates the variant(s) that defines the star allele. For a copy of the gene (1 chromosome), the minor allele must be present for the indicated variants and the major allele for the other variants in order for the gene copy to be the indicated star allele. Using 0 to indicate the major allele and 1 to indicate the minor allele, the star allele, CYP4B1*2, is present on a specific chromosome if the pattern of locus level variants is 0-1-0-1-1-1. 3 Allele Protein Nucleotide changes, cDNA Effect Enzyme activity CYP4B1*1 CYP4B1.1 None None CYP4B1*2 881_882delAT; 993G>A; 1018C>T; 1123C>T 294Frameshift (premature stop); M331I; R340C; R375C LoGuidice et al, 2002 CYP4B1*3 CYP4B1.3 517C>T R173W LoGuidice et al, 2002 CYP4B1*4 CYP4B1.4 964A>G S322G LoGuidice et al, 2002 CYP4B1*5 CYP4B1.5 993G>A M331I LoGuidice et al, 2002 CYP4B1*6 CYP4B1.6 517C>T; 1033G>A R173W; V345I Hiratsuka et al., 2004 CYP4B1*7 CYP4B1.7 881_882delAT; 993G>A; 1018C>T 294Frameshift (premature stop); M331I; R340C Hiratsuka et al., 2004 In_viv o In_vitr o Reference s Normal Normal Additional SNPs, where the haplotype has not yet been determined 1061T>G F354C NCBI dbSNP Table 2. Allele nomenclature table taken from http://www.cypalleles.ki.se/cyp4b1.htm on 02/25/2009. The Nucleotide Changes field lists the variants that define each star allele. A copy of the gene must have the listed variants without any of the other variants in order for it to be the indicated star allele. However, the locus level data consists of genotypes from two chromosomes, not just one, necessitating that the algorithm handle genetic data from the chromosome pairs simultaneously. For the following example, reference Table 3 and the minor alleles. If the locus level variants, ordered as in Table 3, have patient genotypes CC, DelDel, AA, AA, TT, TT, then the translation would be star allele genotype CYP4B1*2/CYP4B1*2 or *2/*2 (a *2 homozygote for CYP4B1). Two of the *2 haplotypes, C-DEL-A-A-T-T, are the only star allele pair consistent with this set of variant level genotypes. At first glance, the translation appears complex. However, the simple vector algorithm described below makes finding consistent star allele pairs relatively simple. TRANSLATION ALGORITHM Given a set of biallelic variants, the star alleles or gene haplotypes can be converted to a vector of zeros and ones. The variants can be order by the relative location on the gene or by other consistent ordering. The value of 1 is assigned if the minor allele of the variant is in the star allele and zero otherwise. For the star alleles in Table 3 ordered by the locus of the individual variants, *1=(0,0,0,0,0,0), *2=(0,1,0,1,1,1), *3=(1,0,0,0,0,0), *4=(0,0,1,0,0,0), *5=(0,0,0,1,0,0), and *7=(0,1,0,1,1,0). 4 CYP4B1 refSNP ID rs4646487 NA rs45467195 rs2297810 rs4646491 rs2297809 Genomic variant Amino acid changes 517C>T 881_882delAT 964A>G 993G>A 1018C>T 1123C>T R173W 294Frameshift (premature stop) S322G M331I R340C R375C Affy DMET chip (v1) CYP4B1star3( RS4646487) CYP4B1star2_ 881delAT CYP4B1star4 CYP4B1star 5(RS2297810) CYP4B1star 2_R340 CYP4B1star 2_R375 Validated N N N N N N Haplotype Y Y Y Y Y Y Major allele C INS A G C C Minor allele T DEL G A T T x x x *1 *2 *3 x x *4 x *5 *7 x x x x Table 3. Translation table for the CYP4B1 star alleles. The >”nucleotide” or “del/ins” indicates the variant or minor allele. The major and minor alleles are listed again in the table for clarity. The variant name used by Affymetrix® is listed to identify the data (variant order must be controlled). The genetic variants are ordered from a starting location in the gene. For this particular gene, the star alleles are defined not by a single variant but by a combination, or haplotype. The variants that define the star allele for the gene on a specific chromosome are indicated with an “x” for each of the star alleles *1-*5 and *7. *1 has no variants and it is C-INS-A-G-C-C or 0-0-0-0-0-0 in numbers of variants at each loci. The *7 allele has 3 variants and it is C-DEL-A-A-T-C or 0-1-0-1-1-0. The patient diploid genotypes (from the chromosome pair) can also be converted to a numeric vector. The elements of the vector are the number of minor alleles at each locus (e.g. SNP). The values at any given locus are 0, 1 or 2: 0=the variant (minor allele) is not present on either chromosome at the locus, 1=the variant (minor allele) is present on one chromosome at the locus but not the other (heterozygous), 2=the variant (minor allele) is present on both chromosomes at the locus. Converting the genotypes into numbers is actually a better representation of the data. A heterozygote such as CT is now 1 with no possible implication of which nucleotide is on which chromosome. The algorithm simply loops through all possible star allele pair vectors and compares their vector sum to the numeric patient vector that is defined by the genotypes. This is a double loop with the first index going from first to last star allele and the second index going from the star allele of the first index to the last star allele. A star allele pair that has a vector sum equal to the patient numeric genotypes vector is a star allele genotype that is consistent with the patients’ variant level genotypes. For the six star alleles of our CYP4B1 translation (Table 3), the sum of all possible pairs of star alleles can be listed (as the loop would run). Figure 3 shows the vector sum of each CYP4B1 star allele pair as a column vector. Comparing the patient genotype vector to the vector sums is the basis of the algorithm used to find the star allele genotype(s) that is consistent with the patient’s variant level data. 5 *1/*1 *1/*2 *1/*3 ⎡1 ⎤ ⎡0 ⎤ ⎡0 ⎤ ⎢0 ⎥ ⎢1⎥ ⎢0 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢1⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢⎣0⎥⎦ ⎣⎢0⎦⎥ ⎣⎢1⎦⎥ *1/*4 *1/*5 *1/*7 ⎡0 ⎤ ⎡0 ⎤ ⎡0 ⎤ ⎢1 ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢1 ⎥ ⎢0 ⎥ ⎢1 ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢⎣0⎥⎦ ⎣⎢0⎦⎥ ⎣⎢0⎦⎥ *4/*4 *4/*5 *4/*7 ⎡0 ⎤ ⎡0 ⎤ ⎡0 ⎤ ⎢1 ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢1 ⎥ ⎢ 2⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢1 ⎥ ⎢0 ⎥ ⎢1 ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢⎣0⎥⎦ ⎣⎢0⎦⎥ ⎣⎢0⎦⎥ *2/*2 *2/*3 ⎡1⎤ ⎡0 ⎤ ⎢1⎥ ⎢ 2⎥ ⎢ ⎥ ⎢ ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢1⎥ ⎢ 2⎥ ⎢1⎥ ⎢ 2⎥ ⎢ ⎥ ⎢ ⎥ ⎢⎣1⎥⎦ ⎣⎢2⎦⎥ *2/*4 *2/*5 *2/*7 ⎡0 ⎤ ⎡0 ⎤ ⎡0 ⎤ ⎢ 2⎥ ⎢1 ⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 2⎥ ⎢ 2⎥ ⎢1⎥ ⎢ 2⎥ ⎢1 ⎥ ⎢1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢⎣1 ⎥⎦ ⎣⎢1 ⎦⎥ ⎣⎢1⎦⎥ *5/*5 *5/*7 ⎡0⎤ ⎡0 ⎤ ⎢1 ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢0⎥ ⎢0 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 2⎥ ⎢ 2⎥ ⎢1 ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢⎣0⎥⎦ ⎣⎢0⎦⎥ *3/*3 ⎡ 2⎤ ⎢0 ⎥ ⎢ ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢⎣0⎥⎦ *3/*4 *3/*5 *3/*7 ⎡1 ⎤ ⎡1 ⎤ ⎡1 ⎤ ⎢1 ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢1 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢1 ⎥ ⎢1 ⎥ ⎢0 ⎥ ⎢1 ⎥ ⎢0 ⎥ ⎢0 ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢⎣0⎥⎦ ⎢⎣0⎥⎦ ⎢⎣0⎥⎦ *7/*7 ⎡0⎤ ⎢ 2⎥ ⎢ ⎥ ⎢0⎥ ⎢ ⎥ ⎢ 2⎥ ⎢ 2⎥ ⎢ ⎥ ⎢⎣0⎥⎦ Figure 3. CYP4B1 vector sums for each pair of star alleles that could be a possible star allele genotype. The algorithm is comparing the numeric patient data vectors to these vector sums to find consistent star allele genotypes. For missing patient locus variants, a weight matrix must be added to the comparison of star allele haploid pairs with the patient’s numeric genotype vector. By making the variant weight 0, for a missing patient genotype, this variant can be ignored in comparison of star allele vector sums at this variant. The result is generally more pairs of star alleles that are consistent with the patient’s variant level genotype data. However, there are many cases, where the missing variant does not affect which star allele pairs are consistent with the non-missing patient genotypes. SAS Proc IML (Interactive Matrix Language) is ideal for implementing the algorithm and performing the vector comparisons. It is also easy to track and store the text star allele genotypes, which have vector sums that match the patient vector. The algorithm is not efficient but it is transparent. Since clinical translations are not always well studied, transparency is crucial for working with a wide range of subject matter experts. Sometimes multiple versions of the translation tables must be tested to find definitions that are consistent with observed genotypes. Literature on some rare star alleles is quite sparse. The translation algorithm has the following three limitations. First, the translation table must include all the important haplotypes. Due to the random nature of cross-over events during meiosis, it is possible that some important haplotypes have not been observed. Second, the variants are assumed to be of a biallelic type (normal vs. mutant two possible versions at each locus). In the case of a variant that is not biallelic, the second limitation can often be overcome, since one locus with several (say n) variants can be represented by at most an n(n-1)/2 set of vector components. Third, copy number variants, or cases where patients do not have two gene copies, can not easily be included in the translation. Inclusion of copy number variants is of high importance due to their functional role in some genes such as CYP2D6. ALGORITHM IN SAS First, input SAS datasets are needed. The imported translation table dataset is shown in Table 4. Patient variant level data is shown in Table 5. It is not uncommon for an assay measuring a locus variant to fail. A variant assay failure is noted numerically with a -1. The early Affymetrix® platform also reported possible rare alleles. In some 6 cases, “possible rare allele” implied that at least one chromosome carried the minor allele. If it is know that the minor allele is present on 1 or more of the chromosomes, then a -2 is entered instead of -1. CYP4B1 F2 F3 F4 F5 F6 F7 Triallelic refSNP ID rs4646487 NA rs45467195 rs2297810 rs4646491 rs2297809 Genetic nucleotide position 517C>T 881_882 delAT 964A>G 993G>A 1018C>T 1123C>T Amino acid changes R173W 294Frame shift (premature stop) S322G M331I R340C R375C Affy DMET chip (v1) CYP4B1 star3 CYP4B1 star4 CYP4B1 star5 CYP4B1 star2_R340 CYP4B1 star2_R375 (RS4646487) CYP4B1 star2_881 delAT N Y C T N Y INS DEL N Y A G N Y G A N Y C T N Y C T x x x x x x Validated Haplotype Major allele Minor allele *1 *2 *3 *4 *5 *7 (RS2297810) x x x x Table 4. SAS dataset pl.transtable_cyp4b1 derived from a standard import of the excel translation table shown in Table 2. The structure of the table (row/column position of information) is assumed by the macro. Study_Pr otocal SUGI SUGI SUGI SUGI SUGI SUGI SUGI SUGI SUGI SUGI SUGI Patient_I dentifier _02 _03 _04 _05 _06 _07 eg05 0 -1 0 1 1 1 eg06 1 -1 0 1 1 1 eg01 0 0 0 1 0 0 eg07 1 -1 0 1 1 0 eg00 0 2 0 2 2 2 eg08 1 1 0 1 1 -1 eg03 2 0 0 0 0 0 eg02 0 1 0 2 1 1 eg04 -1 0 0 0 0 0 eg10 0 -1 0 2 0 0 eg09 1 0 1 0 0 0 Table 5. SAS table: num_dat.transpose_CYP4B1 of patient locus level genotype data. Variables _02-_07 are the variants listed in the order of the translation table. The genotypes have been converted to number of minor alleles. The macro assumes numeric data is the locus level genotypes ordered as they are listed in the translation table from left to right. The -1 is used to indicate missing data for a variant. Though the patient identifiers are artificial, the variant level genotypes were observed on patient samples. The algorithm is easiest to follow in the Proc IML code. SAS datasets are essential in manipulating the input data into a vector/matrix form that can be fed into the algorithm. There is complexity in the macro because the code is 7 intended to work on an arbitrary gene’s translation table. The patient locus level gentoypes must have been converted to numeric form for the given gene. *** start of translation code for gene defined by macro variable current gene ***; libname pl ‘A directory'; libname num_dat ‘A directory with patient data’; %let ncol=1; %let current_gene=CYP4B1; /*** Convert translation table from excel to 0/1 vector that defines star allele by the presence vs absence of the variant in the star allele definition vectors. ***/ %macro get_gene(gene); data tr_&gene.; set pl.transtable_&gene.; array allvar{*} _character_; length name $15. vlist $300.; if (_N_=7) then do i=1 to dim(allvar); if allvar(i)=" " then do; call vname(allvar{i}, name); vlist=catx(' ',vlist,trim(name)); end; end; if (_N_=7) then call symput('to_drop', left(trim(vlist))); drop name vlist i; RUN; %if not("&to_drop." = " ") %then %do; data tr_&gene.; set cyp.transtable_&gene.; drop &to_drop.; run; %end; data tr_&gene.; set tr_&gene.; array allvar{*} _character_; call symput('ncol', left(trim(put(dim(allvar)-1,best.)))); run; data tr_&gene.; set tr_&gene.; array allvar{*} _character_; array variants{&ncol.} v1-v&ncol.; length star_al $5.; if _N_>=12 then do; star_al=allvar{1}; do i=1 to &ncol.; if allvar{i+1}=" " then variants{i}=0; else variants{i}=1; end; if not(star_al=" ") then output; end; keep star_al v1-v&ncol.; run; %mend; %get_gene(¤t_gene.); 8 /********************************************************************************** v1 v2 0 0 1 0 0 0 v3 0 1 0 0 0 1 v4 0 0 0 1 0 0 v5 0 1 0 0 1 1 v6 0 1 0 0 0 1 0 1 0 0 0 0 star_al *1 *2 *3 *4 *5 *7 tr_&gene as row vectors defining the star alleles (star_al) as shown above for CYP4B1 ***********************************************************************************/ /*** macro performs the translation of the locus level genotypes that have been converted from character genotypes to numeric number of minor alleles in the genotype (0, 1, 2). The code -1 is used for missing data and -2 is used for the cases of “either 1 or 2” i.e. at least one minor allele is present in the genotype. ****/ %macro pros_gene(gene); proc iml; use tr_&gene.; read all var {star_al}; read all var _num_ into vv; use num_dat.transpose_&gene.; read all var {Study_Protocol Patient_Identifier}; read all var _num_ into ww; wwt=t(ww); vvt=t(vv); nprobe = nrow(vvt); nallele = ncol(vvt); npat=nrow(ww); do i=1 to npat; wpat=wwt[,i]; icall=loc(wpat>-1); i_nc=loc(wpat=-1); i_pra=loc(wpat=-2); if nrow(i_nc)>0 then wpat[i_nc]=0; if nrow(i_pra)>0 then wpat[i_pra]=1; pgt={" "}; sgt={" "}; do j=1 to nallele; /*** begin process patient i variant level data ***/ al1=vvt[,j]; /*** get star allele vector j ***/ do k=j to nallele; al2=vvt[,k]; ]; /*** get star allele vector k>= j ***/ cgt=al1+al2; /*** compute the sum of the j,k star allele pair ***/ diffp=wpat-cgt; /*** compare with patient i numeric locus genotype vector ***/ pmatch=max(abs(diffp)); if pmatch=0 then do; pgt=concat(pgt,star_al[j],{"/"},star_al[k],{","}); /**when match is found then store the star allele genotype**/ end; /**** Code from this point on deals with missing data and accumulates the patient star allele genotype calls. ****/ else do; if nrow(i_pra)>0 then smat1=min(cgt[i_pra]); else smat1=1; diffs=diffp; if nrow(i_nc)>0 then diffs[i_nc]=0; if nrow(i_pra)>0 then diffs[i_pra]=0; 9 smatch=max(abs(diffs)); if smat1>0 & smatch=0 then do; sgt=concat(sgt,star_al[j],{"/"},star_al[k],{","}); end; end; end; end; if i=1 then do; pat_gt=Study_Protocol[1] || Patient_Identifier[1] || pgt || sgt; end; else do; pat_gt=pat_gt // (Study_Protocol[i] || Patient_Identifier[i] || pgt || sgt); end; end; print pat_gt; create pl.&gene.lcalls from pat_gt[colname={'Study_Protocol' 'Patient_Identifier' 'pgt' 'sgt'}]; append from pat_gt; quit; %mend; /**** Run macro to translate the data. Merge translation with raw data for review. ****/ %pros_gene(¤t_gene.); proc sort data=pl.¤t_gene.lcalls; by Study_Protocol Patient_Identifier; run; proc sort data=num_dat.transpose_¤t_gene.; by Study_Protocol Patient_Identifier; run; data pl.¤t_gene.lcalls; merge num_dat.transpose_¤t_gene. pl.¤t_gene.lcalls; by Study_Protocol Patient_Identifier; length gene_name $12.; pgt=compress(pgt); sgt=compress(sgt); gene_name=upcase("¤t_gene."); run; The translated patient data is stored in the table cyp4b1lcalls with the transposed locus level genotypes coded numerically. Table 6 shows the translated results. The variable PGT is the primary genotype. The primary genotype(s) is based on complete locus level data or in the case of missing variant data, the variant is assumed to be 0 (no minor alleles). The secondary genotypes (SGT) are possible star allele genotypes, which result from “non-wild type” possible variants due to missing data. Table 5 can be compared to Figure 3 to see directly how the star allele genotypes are matched to the patient locus level variant genotypes in vector representation. 10 Study_Pr otocal SUGI SUGI SUGI SUGI SUGI SUGI SUGI SUGI SUGI SUGI SUGI Patient_I dentifier _02 _03 _04 _05 _06 _07 PGT eg00 0 2 0 2 2 2 *2/*2, eg01 0 0 0 1 0 0 *1/*5, eg02 0 1 0 2 1 1 *2/*5, eg03 2 0 0 0 0 0 *3/*3, eg04 -1 0 0 0 0 0 *1/*1, eg05 0 -1 0 1 1 1 eg06 1 -1 0 1 1 1 eg07 1 -1 0 1 1 0 eg08 1 1 0 1 1 -1 *3/*7, eg09 1 0 1 0 0 0 *3/*4, eg10 0 -1 0 2 0 0 *5/*5, SGT *1/*3,*3/*3, *1/*2, *2/*3, *3/*7, *2/*3, Table 6. SAS dataset pl.cyp4b1lcalls. The star allele genotype(s) are in variables PGT (Primary Genotype) and SGT (Secondary Genotype). COLAPSING HAPLOTYPES IN CONSTRUCTION OF TRANSLATION TABLES Typically a genotyping platform will not have all the variants needed to identify all known star alleles. This results in components of the star allele defining vectors being removed and previously distinct star allele patterns becoming identical to other star alleles. Given a current list of star alleles, which has all necessary variants, one must determine the star alleles that are aliased with one another, when some necessary variants are removed. For CYP4B1, one can visually compare CYP4B1*3 and CYP4B1*6 with the other star alleles. It is not hard to see that the loss of the 1033G>A variant will only make CYP4B1*3 indistinguishable from CYP4B1*6. No other star alleles are effected by the loss of 1033G>A. However, there are much more complex CYP genes (see CYP2D6 in Appendix 1). Fortunately, it is possible to use the vector idea for defining star alleles to determine, which star alleles are aliased with one another using SAS code. Table 7 displays a SAS table containing the first few columns of the Karolinska CYP4B1 star allele nomenclature, http://www.cypalleles.ki.se/cyp4b1.htm. The information is identical to Table 2 but care has been taken to put the variant list (Nucleotide Changes) into one variable record with semicolons delimiting the variants. The SAS code in Appendix 1 is used to systematically uncover, which star alleles can not be distinguished from one another. The SAS code results, based on Table 7 as the cyp_raw input, are shown in Table 8. Note, the ordering of the variants is changed due to the SAS transpose statement. However, the loss of the 1033G>A variant shows that *3 and *6 are not distinguishable with the remaining locus variants because they now have the same pattern. Allele Protein CYP4B1*1 CYP4B1*2 CYP4B1*3 CYP4B1*4 CYP4B1*5 CYP4B1*6 CYP4B1*7 CYP4B1.1 CYP4B1.2 CYP4B1.3 CYP4B1.4 CYP4B1.5 CYP4B1.6 CYP4B1.7 Variants cDNA None 881_882delAT; 993G>A; 1018C>T; 1123C>T 517C>T 964A>G 993G>A 517C>T; 1033G>A 881_882delAT; 993G>A; 1018C>T Table 7. CYP4B1 information from http://www.cypalleles.ki.se/cyp4b1.htm imported into a SAS table, cyp_raw, for processing. This is the input information for the SAS dataset code in Appendix 1. The code works on general nomenclature tables of the form listed on http://www.cypalleles.ki.se/ with data imported to cyp_raw in the format of Table 7. The code is based on simple vector ideas presented for translation. First, one constructs a matrix table with rows that are the vectors defining all star alleles. This is the current complete translation table based on a reference (in this case Karolinska). Again, the fields are the locus variants that define the star alleles. In order to construct this complete translation table, the information from the Karolinska website must be imported into a SAS table and then transposed into the matrix of star allele row vectors. 11 The next step is to keep only the variants in a particular assay panel. This is done with a keep statement. The star allele row vectors are now converted to (single field) character variant patterns. With variants removed, these star allele vectors and equivalent patterns are no longer unique. Star alleles that share a pattern can no longer be distinguished from one another. These groups are listed in the variable allele_grp in the final table. _1123C_T _881_882delAT _993G_A _1018C_T _517C_T _964A_G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 pattern '000000 '000001 '000010 0 0 1 0 1 1 1 1 1 0 1 1 0 0 0 0 0 0 '001000 '011100 '111100 allele_grp CYP4B1*1 CYP4B1*4 CYP4B1*3, CYP4B1*6 CYP4B1*5 CYP4B1*7 CYP4B1*2 Table 8. Using our appendix SAS code on CYP4B1, the output table, cyp_condensed, has the star alleles that are not distinguishable from one another (see variable allele_grp). The keep statement in creation of the SAS table cyptt2, has all the variants except 1033G>A. The result is that the variants patterns are the same for both CYP4B1*3 and CYP4B1*6. They cannot be distinguished with the variants that we had in our early DMET platform. The methods described here represent a simple, systematic and transparent process for translating genetic variants into star alleles (e.g. Table 2). However, the “effect” information in column 4 of the Karolinska tables is of crucial importance, where contextually important hypertext is listed as effects and annotated. Currently this information must be processed manually. This information is crucial in dealing with groups of star alleles that are aliased with one another (see CYP2D6 in Appendix 1) . CONCLUSION The amount of genetic information with clinical application is increasing (www.fda.gov/Cder/guidance/6400fnl.pdf page 4 for example). Improvements in data processing and analyses are necessary in order to maximize the usefulness of this information for clinicians. Efficient algorithms and implementation can lead to huge efficiency gains in both developing and using genetic assays. This manuscript describes how SAS may be utilized to simplify the translation of genetic variation into the common drug metabolizing enzyme and transporters (DMET) nomenclature. Algorithms translating the often unfamiliar or un-interpretable genetic variants into the DMET nomenclature recognized by clinicians, the star allele nomenclature, will increase the interpretability and thus the utility in a clinical setting. In a SAS environment, the translations can be made with current information from the literature and websites such as Karolinska. However, an intimate knowledge of both the biological relevance and the mathematical structure are required for any one individual to recognize this. Close partnership and collaboration between statisticians, programmers, data management experts and the clinician or biologist, often requiring persistence, is essential to bring the potential to fruition. REFERENCES Cascorbi I, (2006). Genetic basis of toxic reactions to drugs and chemicals. toxicol. Lett. 162(1), 16-28. Daly TM, Dumaual CM, Miao X. (2007), Multiplex assay for comprehensive genotyping of genes involved in drug metabolism, excretion, and transport. Clin Chem. 53(7):1222-1230. Dumaual C, Miao X, Daly TM, et al. (2007), Comprehensive assessment of metabolic enzyme and transporter genes using the Affymetrix Targeted Genotyping System. Pharmacogenomics. 8(3):293-305. Gonzalez, F.J., Skoda, R.C., Kimura, S., Umeno, M., Zanger, U.M., Nebert, D.W. (1988), Characterization of the common genetic defect in humans deficient in debrisoquine metabolism. Nature 331: 442–446. Lewis DF, (2005). Human P450s in the metabolism of drugs: molecular modeling of enzyme-substrate interactions. Expert Opin. Drug Metab. Toxicol, 1(1), 5-8. Mega JL, Close S, Wiviott SD, Shen, L, Hockett RD, Brandt JT, Walker JR., Antman EM., Macias W, Braunwald E, and Sabatine MS (2009), Cytochrome P-450 Polymorphisms and Response to Clopidogrel. NEJM 360:354-362 12 Robarge, J. D., L. Li, Z. Desta, A. Nguyen, and D. A. Flockhart. (2007), The star-allele nomenclature: retooling for translational genomics. Clin Pharmacol Ther 82:244-248. Schaid, D.J., Rowland, C.M., Tines, D.E., Jacobson, R.M., and Poland, G.A. (2002), "Score Tests for Association between Traits and Haplotypes when Linkage Phase is Ambiguous," American Journal of Human Genetics, 70:425 434. Sim, S.C. & Ingelman-Sundberg, M. (2006), The human cytochrome P450 Allele Nomenclature Committee Web site: submission criteria, procedures, and objectives. Methods Mol. Biol. 320:183–191. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the author at: Mark Farmen Lilly Research Laboratories Lilly Corporate Center Indianapolis, IN 46285 Work Phone: (317) 433-4262 E-mail: [email protected] SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. APPENDIX 1 SAS CODE FOR COLAPSING HAPLOTYPES AND BUILDING TRANSLATION TABLES data cypv1; set cyp_raw; retain star_allele; length gv $70.; if upcase(variants)="NONE" then variants=" "; if length(Allele)>5 then do; if upcase(substr(compress(Allele),1,3))="CYP" then star_allele=Allele; end; do ii=1 to 1000; gv = compress(trim(scan(variants, ii, ";"))); * if no more gv AND not first entry then exit loop; if gv='' & ii>1 then leave; output; end; run; data cypt; set cypv1; ind=1; if (gv=" ") then do; if (star_allele^=" ") then do; gv="NONE"; ind=0; end; else delete; end; if (star_allele=" ") then delete; keep star_allele gv ind; run; 13 proc sort data=cypt; by star_allele gv; run; proc transpose data=cypt out=cyptt; by star_allele; id gv; var ind; run; data cyptt; set cyptt; array allnum{*} _numeric_; do i=1 to dim(allnum); if allnum{i}=. then allnum{i}=0; end; drop i NONE _NAME_; run; /**** Save cyptt in order to select the variants that are genotyped by the platform. These go in the keep statement below. ****/ data cyptt2; set cyptt; keep star_allele _1123C_T _881_882delAT _993G_A _1018C_T _517C_T _964A_G; run; data cyptt2; set cyptt2; length pattern $25. bit $1.; array allnum{*} _numeric_; do i=1 to dim(allnum); bit=put(allnum{i},1.); pattern=compress(pattern)||bit; end; drop bit i; run; proc sort data=cyptt2; by pattern star_allele; run; data cyp_condensed; set cyptt2; by pattern star_allele; retain allele_grp; length allele_grp $350.; if first.pattern then do; allele_grp=star_allele; end; if not first.pattern then do; allele_grp=trim(allele_grp)||", "||compress(star_allele); end; 14 if last.pattern then output; run; The above code can be applied to much more complex genes. The CYP2D6 star alleles were imported into SAS from http://www.cypalleles.ki.se/cyp2d6.htm in the format shown in Table 7. The first 3 columns and a small selection of star alleles are shown in Table A1. Though importing the webpage table into SAS is difficult, the SAS code can be run essentially unchanged. Allele CYP2D6*1A CYP2D6*2A CYP2D6*3A CYP2D6*3B CYP2D6*4A CYP2D6*4B CYP2D6*4C CYP2D6*4D CYP2D6*4E CYP2D6*4F CYP2D6*4G CYP2D6*4H CYP2D6*4J CYP2D6*4K CYP2D6*4L CYP2D6*4M Protein CYP2D6.1 CYP2D6.2 variants -1584C>G; -1235A>G; -740C>T;-678G>A; CYP2D7 gene conversion in intron 1; 1661G>C; 2850C>T; 4180G>C 2549delA 1749A>G; 2549delA 100C>T; 974C>A; 984A>G; 997C>G; 1661G>C; 1846G>A; 4180G>C 100C>T; 974C>A; 984A>G; 997C>G; 1846G>A; 4180G>C 100C>T; 1661G>C; 1846G>A; 3887T>C; 4180G>C 100C>T; 1039C>T; 1661G>C; 1846G>A; 4180G>C 100C>T; 1661G>C; 1846G>A; 4180G>C 100C>T; 974C>A; 984A>G; 997C>G; 1661G>C; 1846G>A; 1858C>T; 4180G>C 100C>T; 974C>A; 984A>G; 997C>G; 1661G>C; 1846G>A; 2938C>T; 4180G>C 100C>T; 974C>A; 984A>G; 997C>G; 1661G>C; 1846G>A; 3877G>C; 4180G>C 100C>T; 974C>A; 984A>G; 997C>G; 1661G>C; 1846G>A 100C>T; 1661G>C; 1846G>A; 2850C>T; 4180G>C 100C>T; 997C>G; 1661G>C; 1846G>A; 4180G>C -1235A>G; 746C>G; 843T>G 974C>A; 984A>G; 997C>G; 1661G>C; 1846G>A; 2097A>G; 3384A>C; 3582A>G; 4401C>T Table A1. Vital variables from the SAS table cyp_raw containing CYP2D6 star allele definitions based on the Karolinska website 12/12/2008. Due to size issues, only some of the rows are listed. Refer to http://www.cypalleles.ki.se/cyp2d6.htm for the complete table. The validated CYP2D6 variants on the early DMET assays were 100C>T; 4180G>C; 2850C>T; 883G>C; 124G>A; 1758G>A; 1758G>T; 137_138insT; 1023C>T; 2539_2542delAACT; 2573_2574insC; 2988G>A; 2587_2590delGACT; 2549delA; 2950G>C; 1846G>A; CYP2D6deleted; 1707delT; 2935A>C; 2615_2617delAAG. Using the CYP2D6 data in cyp_raw (see Table A1) as the input table and the following data steps (see keep statement) to reduce the alleles in table cyptt2, the star alleles that can not be distinguished (because the variants are not measured) are obtained from cyp_condensed. The variant patterns and allele_grp are shown in Table A2. data cyptt2; set cyptt; 15 keep star_allele _100C_T _4180G_C _2850C_T _883G_C _124G_A _1758G_A _1758G_T _137_138insT _1023C_T _2539_2542delAACT _2573_2574insC _2988G_A _2587_2590delGACT _2549delA _2950G_C _1846G_A CYP2D6deleted _1707delT _2935A_C _2615_2617delAAG; run; Pattern allele_grp '00000000000000000000 CYP2D6*13, CYP2D6*16, CYP2D6*17XN, CYP2D6*18, CYP2D6*1A, CYP2D6*1B, CYP2D6*1C, CYP2D6*1D, CYP2D6*1E, CYP2D6*1XN, CYP2D6*22, CYP2D6*23, CYP2D6*24, CYP2D6*25, CYP2D6*26, CYP2D6*27, CYP2D6*33, CYP2D6*43, CYP2D6*48, CYP2D6*50, CYP2D6*53, CYP2D6*60, CYP2D6*61, CYP2D6*62, CYP2D6*66, CYP2D6*67, CYP2D6*68, CYP2D6*71 '00000000000000000001 '00000000000000000100 '00000000000000001000 '00000000000000010000 '00000000000000100000 '00000000000001000000 '00000000000010000000 '00000000000100000000 '00000010000000000000 '00100000000000000000 '01000000000000000000 '01000000000000001000 '01100000000000000000 CYP2D6*9 CYP2D6*7 CYP2D6*6A, CYP2D6*6B, CYP2D6*6D CYP2D6*5 CYP2D6*4M CYP2D6*44 CYP2D6*3A, CYP2D6*3B CYP2D6*38 CYP2D6*15 CYP2D6*34, CYP2D6*63 CYP2D6*39, CYP2D6*70 CYP2D6*6C CYP2D6*20, CYP2D6*28, CYP2D6*29, CYP2D6*2A, CYP2D6*2B, CYP2D6*2C, CYP2D6*2D, CYP2D6*2E, CYP2D6*2F, CYP2D6*2G, CYP2D6*2H, CYP2D6*2J, CYP2D6*2K, CYP2D6*2L , CYP2D6*30, CYP2D6*31, CYP2D6*32, CYP2D6*35, CYP2D6*35X2, CYP2D6*42, CYP2D6*45A, CYP2D6*45B, CYP2D6*46_1, CYP2D6*46_2, CYP2D6*51, CYP2D6*55, CYP2D6*56A, CYP2D6*59 '01100000000000000010 01100000001000000000 '01100000010000000000 '01100000100000000000 '01100001000000000000 '01100100000000000000 '01101000000000000000 '01110000000000000000 '10000000000000100000 '11000000000000000000 CYP2D6*8 CYP2D6*2M, CYP2D6*41 CYP2D6*21A, CYP2D6*21B CYP2D6*19 CYP2D6*17, CYP2D6*40, CYP2D6*58 CYP2D6*14B CYP2D6*12 CYP2D6*11 CYP2D6*4G, CYP2D6*4J CYP2D6*10A, CYP2D6*10B, CYP2D6*10D, CYP2D6*36Dupl., CYP2D6*36single, CYP2D6*37, CYP2D6*47, CYP2D6*49, CYP2D6*52, CYP2D6*54, CYP2D6*56B, CYP2D6*57 CYP2D6*4A, CYP2D6*4B, CYP2D6*4C, CYP2D6*4D, CYP2D6*4E, CYP2D6*4F, CYP2D6*4H, CYP2D6*4L, CYP2D6*4N '11000000000000100000 '11000001000000000000 '11100000000000000000 '11100000000000100000 Star Allele choice CYP2D6*64 CYP2D6*65 CYP2D6*4K 16 '11100000001000000000 '11100100000000000000 CYP2D6*69 CYP2D6*14A Table A2. Output SAS table cyp_condensed for CYP2D6, which shows the star alleles that can not be distinguished based on the variants that can be genotyped. The star allele choice must be made based on knowledge of the key variants and star allele frequency in order to create the translation table. The star alleles that cannot be determined because the variants are not measured with the DMET platform are classified as wild type alleles (*1 in most cases). When no data suggests that subtypes of a star allele behave differently in vivo, these subtypes are grouped together (e.g. *6A, *6B, and *6D into *6). The *10 allele poses some challenges. The allele classifying variants are the 100C>T and the 4180G>C. These two variants are also found in several other star alleles (for example *4, *14, *36, & *37). When both are present and there is no 1846G>A (calls a *4) or 1758G>T (calls a *14) then the allele is classified as *10. No attempt is made to distinguish *10 from *36 or *37 as there is no data to suggest the additional variants present in the *36 or *37 haplotypes cause additional enzyme activity changes compared to the *10 allele. 17