Download PDF

PERSPECTIVE ARTICLE published: 21 March 2013 doi: 10.3389/fgene.2013.00038 Where in the genome are we? A cautionary tale of database use in genomics research Laura K. Vaughan* and Vinodh Srinivasasainagendra Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA Edited by: Danielle Posthuma, VU University Amsterdam, Netherlands Reviewed by: Dirk-Jan De Koning, Swedish University of Agricultural Sciences, Sweden Yulia A. Medvedeva, King Abdullah University of Science and Technology, Saudi Arabia *Correspondence: Laura K. Vaughan, Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, 1665 University Blvd., RPHB 317D, Birmingham, AL 35294, USA. e-mail: [email protected] With the advent of high throughput data genomic technologies the volume of available data is now staggering. In addition databases that provide resources to annotate, translate, and connect biological data have grown exponentially in content and use. The availability of such data emphasizes the importance of bioinformatics and computational biology in genomics research and has led to the development of thousands of tools to integrate and utilize these resources. When utilizing such resources, the principles of reproducible research are often overlooked. In this manuscript we provide selected case studies illustrating issues that may arise while working with genes and genetic polymorphisms. These case studies illustrate potential sources of error which can be introduced if the practices of reproducible research are not employed and non-concurrent databases are used. We also show examples of a lack of transparency when these databases are concerned when using popular bioinformatics tools. These examples highlight that resources are constantly evolving, and in order to provide reproducible results, research should be aware of and connected to the correct release of the data, particularly when implementing computational tools. Keywords: reproducible research, database, bioinformatics, gene mapping INTRODUCTION When conducting genetics research, whether from the perspective of a candidate gene study or genome wide association study (GWAS), researchers must be able to accurately identify and translate where molecular markers are located on the genome in reference to the coordinates of known genes. While this may seem straightforward, it can be quite complicated and is often overlooked (Hong et al., 2009; Wang et al., 2010). The mapping of markers to genes and subsequent data mining of information about these genes is further complicated by the ever increasing amounts of data and resulting evolution in databases, which in turn can lead to changes in genomic coordinates, annotations, and other information. Additionally, few studies (and methodologies) report the version of databases that are used in the bioinformatic workflow process. For example, the popular bioinformatics resource DAVID lists the download date of the various databases used for the Knowledgebase (the most recent of which is in 2009), but does not provide the version of those databases (Huang et al., 2009a,b). This lack of reporting can make subsequent analysis and reproduction of others research difficult, if not impossible. In this manuscript we describe the key steps involved in the use of database resources for the mapping of markers to genes (and vice versa) in a typical candidate gene based study and highlight several ambiguities that can have potentially serious consequences in subsequent research. WORKFLOW CASE STUDIES The steps in identifying SNPs from a list of candidate genes can be described as (1) determining the candidate gene pool, www.frontiersin.org (2) annotating, or retrieving information about those genes, (3) determining the location (and boundaries) of those genes, and (4) identifying molecular markers (e.g., single nucleotide polymorphisms, SNPs) within those boundaries. Similar steps are involved in identifying genes that are related to SNPs found to be significant from a GWAS. In a collaborative research setting investigators will often provide a list of gene names that they are interested in to their bioinformatic collaborators who then retrieve information related to these genes for further analysis. The naming of these genes is the first point of ambiguity. Often, these names are common names or synonyms instead of the official names or gene symbols [see HUGO Gene Names Committee HGNC (Seal et al., 2011)]. Due to the structure of most bioinformatic data sources, it can be difficult to correctly identify the gene that an investigator is interested in when the official name is not provided. An example of this is described in Table 1. In this example the original list of TOSO, PIGR, FCAMR, ADRA1A, ADRA1B, and ADRA1D was provided by a collaborator. When searching the UCSC Genome Browser or Entrez Gene databases (accessed July 2011), we see that TOSO is not an HGNC official gene symbol, but is instead a synonym for the gene Fas apoptotic inhibitory molecule 3 (official symbol FAIM3, geneID 9214), and searching earlier versions of UCSC Genome Browser or into the Gene Accession Conversion tool in DAVID, TOSO would not produce any results. The second source of ambiguity, gene annotation, is also illustrated in Table 1. In general, capturing gene level annotations (HGNC id, geneID, synonyms, chromosome, description, etc.), not only provides more information, but also allows March 2013 | Volume 4 | Article 38 | 1 Vaughan and Srinivasasainagendra Database use in genomics research Table 1 | Inconsistency in gene “names” and locations. GeneID Symbol Synonyms Chr Description Genome build/base pair location Hg17 Hg18 Hg19 9214 FAIM3 TOSO 1 Fas apoptotic inhibitory molecule 3 203466126203483738 205144354205161966 207076633207095378 5284 PIGR FLJ22667|MGC125 361|MGC125362 1 Polymeric immunoglobulin receptor 203490267203508202 205168495205186430 207101869207119811 83953 FCAMR FCA/MR|FKSG87 1 Fc receptor, IgA, IgM, high affinity Information Not available 205198027205210593 207131404207143970 148 ADRA1A ADRA1C|ADRA1L 1|ALPHA1AAR 8 Adrenergic, alpha1A-, receptor 2668313926778839 2668313926778839 2662722226722922 147 ADRA1B ADRA1|ALPHA1B AR 5 Adrenergic, alpha1B-, receptor 159276318159332129 159276318159332595 159343740159400017 146 ADRA1D ADRA1|ADRA1A| ADRA1R|ALPHA1| DAR|dJ779E11.2 20 Adrenergic, alpha1D-, receptor 41498164177659 41492784177659 42012784229659 Bold genes indicate terms from original list from collaborator. Annotation information retrieved for the candidate gene list through Entrez gene’s GeneInfo. GeneInfo data can be downloaded from Entrez gene’s FTP location on September 10th 2010 (ftp://ftp.ncbi.nih.gov/gene/DATA/). investigators to perform quality checks. In the example discussed here, ADRA1A is listed as both an HGNC official gene symbol (adrenergic alpha-1B-receptor, gene ID 147) and as a synonym for ADRA1D (adrenergic alpha-1D-receptor, gene ID 146). Without scrutiny, it is difficult to tell which gene(s) the investigator is indeed interested. A survey of the most recent build of Entrez Gene (Hg 19) reveals that there are 43,037 unique gene symbols, 53,215 unique gene synonyms and 1122 instances where a term is both an official gene symbol and a synonym for at least one other gene, and 2632 terms that occur as synonyms for multiple genes, Although it may seem trivial in this example where there are only a few genes, in situations where there are dozens to hundreds of genes this manual verification of genes represents a significant investment of time and potential sources of error. It is also important to note that inconsistencies between databases can also introduce significant errors when translating gene IDs from one source to another. Even when using one of the several ID converters available (e.g., DAVID ID Converter or GeneCruiser), errors can be introduced when synonyms, HGNC symbols and other identifiers are inconsistently mapped or when the timelines for the database releases are not correctly matched or are out of date (discussed further below). In the candidate gene framework, the gene coordinates can be identified from databases such as Entrez Gene or UCSC Genome Browser relatively easily. However the third ambiguity, determining the gene location, is illustrated in part by inconstancies in the use of gene symbol vs. gene synonym (Table 1). For example, the genes ADRA1A and ADRA1D (discussed above) are located on separate chromosomes. Choosing the wrong gene will result in choosing a completely inappropriate location which Frontiers in Genetics | Statistical Genetics and Methodology will have obviously significant potential implications on downstream analysis. This ambiguity is perhaps more of a concern when taking the approach common for GWAS of identifying genes related to interesting SNPs. For a GWAS, usually both the SNP coordinates and genes that contain those SNPs are provided by the manufacturer of the genotyping platform. However, how these coordinates and genes are identified is often unclear, and these annotation files themselves are often additional source of errors. This is strikingly illustrated in Table 2. In this instance a SNP (rs2844871) was identified as interesting based on an association study genotyped on the Affymetrix Genome Wide Human SNP Array 6.0. When following the bioinformatic workflow to identify the gene of interest, it was discovered that the SNP is mapped to different genes based on not only different databases, but also on different versions of those databases. A query of build 135 of dbSNP identified 1,226,430 SNPs that have multiple coordinates, 805,555 of which have more than one distinct chromosome assigned to the same rs ID (or 1.5% of the 54,212,080 SNPs). Additionally, 1,164,480 single base pair coordinates were found to be associated with multiple rsID’s (with the maximum of 97 rsID’s associated with the coordinates for one single nucleotide polymorphism). Searching for annotation information for rs2844871 in the UCSC Genome Browser, dbSNP, HapMap and Affymetrix databases not only provided different genomic locations based on which build that was accessed, but different (and multiple) chromosomes. Although in this case the multiple locations are likely due to a duplication event [a BLAST (http://blast.ncbi.nlm.nih.gov/) search of the 100 base pair sequence surrounding the SNP shows that regions of >90% identity occur on chromosomes 22, 14, 2, 4, and 21], it serves to dramatically illustrate errors that can be introduced March 2013 | Volume 4 | Article 38 | 2 Vaughan and Srinivasasainagendra Database use in genomics research Table 2 | Discrepancies in the location of SNP rs2844871. Database Database version Human genome build Human reference chromosome/ NCBI build dbSNP build Location (bp) NetAffx 30 Hg18 NCBI36 Not specified chr22:14459243 NetAffx 31 Hg19 GRCH 37 131 SNP is listed, but no position information given NetAffx 32 Hg19 GRCH 37 132 SNP is listed, but no position information given July 2011 Hg18 NCBI36 128 chr22: 14459242 Hg18 NCBI36 130 chr14: 19763716 chr22: 16079242 Hg 19 GRCH 37 132 chr14:19763467 chr22:16078993 chrUn_gl000244:34403 Hg 19 GRCH 37 135 chr14:19763467 chr22:16078993 chrUn_gl000244:34403 July 2011 Hg 19 NCBI 37.1 132 chr2: 125655701 October 2012 Hg19 NCBI 37.3 137 chr14: 19763717 chr22: 16079243 NA chr2:124523528 HapMap Release 27 Hg18 Not specified Not specified chr22: 14459243* HapMap Release 28 Hg18 NCBI36 126 chr22: 14459243 Affymetrix 6.0 UCSC genome browser dbSNP ∗ Genotyped on Affy 6.0 for Phase II samples, no dates or other information was given in HapMart. Search results for SNP rs284487 from various sources. Search conducted on Jul 11th, 2011 and updated October 29th, 2012. with the use of different databases and lack of stringent quality controls. Additionally, when mapping a marker to a gene, investigators are often not just interested in a SNP that lies directly with the gene boundary, but also genes that lie within a certain distance or are in linkage disequilibrium with a SNP of interest. The accurate identification of SNPs and related genes is dependent on both an accurate identification of gene boundaries and the synchronization of multiple databases, which often leads to the final source of ambiguity and is discussed further below. The final source of ambiguity, variation between databases and across time, is intrinsic to every step of the workflow outlined above. Bioinformatic analysis is dependent on key database resources such as dbSNP, Entrez Gene, UCSC Genome Browser and Ensembl (Sherry, 2001; Fujita et al., 2011; Maglott et al., 2011; Flicek et al., 2012). These databases are in a state of dynamic flux, and are constantly being updated, sometimes resulting in significant changes (Data Changes that Occur Between Builds, 2005; Fujita et al., 2011). More often than not, investigators fail to provide the date and database version of each of the data sources that www.frontiersin.org was used in the process of their analysis. Comparing the number of official gene symbols and synonyms for Hg18 and Hg19 highlights the differences in database builds. As discussed above, there are 43,037 and 53,215 unique gene symbols and synonyms respectively in Hg19, compared to 38,586 and 53,475 in Hg18, with 23,325 gene symbols overlapping between the two versions of the human genome builds. Tables 1 and 2 illustrate how the version of the database used can have an effect on the data that is retrieved. For example, (1) FACMR (gene ID 83953) was not included in the human genome build 17 (Hg17) and has gene boundary location that is shifted by almost two million bases from Hg18 to Hg19 and (2) when the coordinates from Hg17 are used to search UCSC Genome Browser using Hg19 the gene OPTC (gene ID 26254) is retrieved instead of FAIM3. Although these shifts in boundaries are a result of updates to the genome builds, one can see how using gene boundary coordinates from Hg19 for data that was originally built on Hg17, without first correcting for the change, can introduce errors. Tools such as the UCSC Genome Browser LiftOver Utility are available to convert genome coordinates between assemblies; however to correctly apply the tool, researchers must first be aware of the issue. March 2013 | Volume 4 | Article 38 | 3 Vaughan and Srinivasasainagendra Furthermore, the timeline of these changes is not coordinated across databases. A search conducted in January 2010 in dbSNP would have been based on Hg18 instead of Hg19 which the NCBI released in early 2009 (build #37) and the August 2010 HapMap data release (release #28) for both Phase II and III data is based on NCBI build 36 and dbSNP release 126 (from 2006). As discussed above, the location of a gene may change with different builds of the human genome, sometimes significantly, and investigators should take the necessary steps to ensure that they are using coordinated builds of the different resources. When conducting research that is based on a genotyping chip, investigators should also carefully consider the version of databases used for bioinformatic analysis. If the corresponding changes in the coordinates of the markers on the genotyping chip are not also accounted for, SNPs could be mapped to incorrect genes, which can result in very costly mistakes (Karow, 2010). DISCUSSION In recent years there has been a paradigm shift in the field of genetics. In the not too distant past, researchers were limited by their ability to acquire data. Now, with the availability of genome scale DNA and RNA platforms and recent introduction of affordable whole genome sequencing technologies, scientists are limited by their ability to effectively organize and analyze vast amounts of data. Part of this process is the accurate and consistent annotation of genomic information as part of the bioinformatics workflow. As describe above, changes in database versions and genome builds throughout the life of a study can have potentially significant impact. In the case study discussed here we highlight several ambiguities that can be introduced in a candidate gene or SNP based study. When going across database versions using gene name, coordinates or rsID’s up to the individual researcher. In the candidate gene based approach where the aim is to identify variants within a gene, one will typically use the coordinates of both genes and SNPs to identify SNPs for further study. As described above, one first verify they have the correct gene, and then must either stay within the same human genome version for each database used, or must correctly convert coordinates in order to avoid introducing errors. For the complementary approach based on identifying genes related to interesting SNPs, often the only data provided is the rsID for that SNP and no coordinates or genome build information is provided. Without this extra information errors can again be introduced when, as shown above, multiple positions, and therefore multiple genes are associated with a variant. One way to prevent these errors is for investigators to involve bioinformaticians in all stages of a study, and for everyone involved to follow the principles of reproducible research. Reproducibility in research has been defined by the uniform Guidelines of the International Committee of Medical Journal Editors as the responsibility of authors to “identify the methods, apparatus and procedures in sufficient detail to allow other workers to reproduce the results.” Young scientists are taught to include in the methods and materials section of manuscripts the details Frontiers in Genetics | Statistical Genetics and Methodology Database use in genomics research which would be needed for successful repetition and extension of their work (Hothorn and Leisch, 2011). Unfortunately, the same attention that is given to laboratory based experimental details and protocols have not been applied to the bioinformatics or computational components of many large genetic studies. This is beginning to change, especially in the domains of bioinformatics and biocomputing, where there has been growing interest in following the philosophy and best principles of reproducibility and repeatability in scientific research (Hothorn et al., 2009; Mesirov, 2010). As we move toward fully embracing the concepts of reproducible research, there is an increasing need for reproducible research modules in many of the software and tools where underlying computer code and data tend to change over time. The continued growth in data volume has introduced a new set of issues that must be considered and addressed in genomics studies. The examples discussed above illustrate the importance of involving bioinformaticians in the entire process of a study. Researchers can avoid these pitfalls by implementing procedures that follow the principles of reproducible research. Similar to the use of a notebook in a wet lab, a wiki based notebook (our own group uses a Confluence powered wiki), employing a Reproducible Research Systems (RSS) approach or using tools such as myExperiment, GenePattern GRRD, Galaxy or Sweave, can be used to detail the workflow involved in the computational analysis of complex genomic data (Friedrich Leisch, 2002; Reich et al., 2006; Goble et al., 2010; Goecks et al., 2010; Hothorn and Leisch, 2011). Accurate depiction of the research process will become even more important as journals follow the trend set by Biometrical Journal, Journal of Epidemiology, and Biostatistics which now suggest that authors go beyond the common practice of making data freely accessible, but also meet some standard of reproducibility (Peng et al., 2006; Peng, 2009; Mesirov, 2010). CONCLUSION The importance of following the principles of reproducible research has been recently highlighted with several high profile examples (Hothorn et al., 2009; Baggerly and Coombes, 2011). Seemingly small mistakes can have significant downstream consequences in any data analysis that utilizes large amounts of data and multiple steps of analysis. As exemplified here, the simple mistake of not reporting, or using an incorrect version of a database can affect the interpretability and reproducibility of a study. To prevent these issues from having a greater impact, it is important for the research community as a whole to embrace the concepts of reproducible research and make a conscious effort toward moving toward that goal. ACKNOWLEDGMENTS The authors would like to thank the members of the Section on Statistical Genetics Programming Team for their helpful comments and suggestions. This work was supported by National Institutes of Health K01DK080188 and R03DK096071. March 2013 | Volume 4 | Article 38 | 4 Vaughan and Srinivasasainagendra REFERENCES Baggerly, K., and Coombes, K. (2011). What information should be required to support clinical “Omics” publications? Clin. Chem. 57, 688–690. Data Changes that Occur Between Builds. (2005). Bookshelf ID NBK44467. Flicek, P., Ridwan Amode, M., Barrell, D., Beal, K., Brent, S., CarvalhoSilva, D., et al. (2012). Ensembl 2012. Nucleic Acids Res. 40, D84–D90. Friedrich Leisch, J. (2002). “Sweave: dynamic generation of statistical reports using literate data analysis,” in Compstat 2002 - Proceedings in Computational Statistics, eds H. Wolfgang and R. Bernd (Heidelberg: Physica Verlag), 575–580. Fujita, P. A., Rhead, B., Zweig, A. S., Hinrichs, A. S., Karolchik, D., Cline, M. S., et al. (2011). The UCSC Genome Browser database: update 2011. Nucleic Acids Res. 39, D876–D882. Goble, C. A., Bhagat, J., Aleksejevs, S., Cruickshank, D., Michaelides, D., Newman, D., et al. (2010). myExperiment: a repository and social network for the sharing of bioinformatics workflows. Nucleic Acids Res. 38, W677–W682. www.frontiersin.org Database use in genomics research Goecks, J., Nekrutenko, A., and Taylor, J. (2010). Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11:R86. doi: 10.1186/gb-2010-11-8-r86 Hong, M.-G., Pawitan, Y., Magnusson, P. K. E., and Prince, J. (2009). Strategies and issues in the detection of pathway enrichment in genome-wide association studies. Hum. Genet. 126, 289–301. Hothorn, T., Held, L., and Friede, T. (2009). Biometrical journal and reproducible research. Biom. J. 51, 553–555. Hothorn, T., and Leisch, F. (2011). Case studies in reproducibility. Brief. Bioinformatics 12, 288–300. Huang, D. W., Sherman, B. T., and Lempicki, R. A. (2009a). Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 37, 1–13. Huang, D. W., Sherman, B. T., and Lempicki, R. A. (2009b). Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57. Karow, J. (2010). NimbleGen Array Capture Outperforms Two TargetEnrichment Methods in ABRF Research Group Comparison In Seq- uence Sequencing GenomeWeb. GenomeWeb In Sequence. Available online at: http://www.genomeweb. com/sequencing/nimblegen-arraycapture-outperforms-two-target-en richment-methods-abrf-researchMaglott, D., Ostell, J., Pruitt, K. D., and Tatusova, T. (2011). Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 39, D52–D57. Mesirov, J. P. (2010). Accessible reproducible research. Science 327, 415–416. Peng, R. D. (2009). Reproducible research and Biostatistics. Biostatistics 10, 405–408. Peng, R. D., Dominici, F., and Zeger, S. L. (2006). Reproducible epidemiologic research. Am. J. Epidemiol. 163, 783–789. Reich, M., Liefeld, T., Gould, J., Lerner, J., Tamayo, P., and Mesirov, J. P. (2006). GenePattern 2.0. Nat. Genet. 38, 500–501. Seal, R. L., Gordon, S. M., Lush, M. J., Wright, M. W., and Bruford, E. A. (2011). Genenames.org: the HGNC resources in 2011. Nucleic Acids Res. 39, D514–D519. Sherry, S. T. (2001). dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311. Wang, K., Li, M., and Hakonarson, H. (2010). Analysing biological pathways in genome-wide association studies. Nat. Rev. Genet. 11, 843–854. Conflict of Interest Statement: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Received: 14 August 2012; accepted: 04 March 2013; published online: 21 March 2013. Citation: Vaughan LK and Srinivasasainagendra V (2013) Where in the genome are we? A cautionary tale of database use in genomics research. Front. Genet. 4:38. doi: 10.3389/fgene. 2013.00038 This article was submitted to Frontiers in Statistical Genetics and Methodology, a specialty of Frontiers in Genetics. Copyright © 2013 Vaughan and Srinivasasainagendra. This is an openaccess article distributed under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in other forums, provided the original authors and source are credited and subject to any copyright notices concerning any third-party graphics etc. March 2013 | Volume 4 | Article 38 | 5 Vaughan and Srinivasasainagendra Database use in genomics research URLS Resource URL Entrez Gene www.ncbi.nlm.nih.gov/gene UCSC Genome Browser http://genome.ucsc.edu Affymetrix NetAffx Analysis Center www.affymetrix.com/analysis/index.affx dbSNP www.ncbi.nlm.nih.gov/projects/SNP/ UCSC LiftOver http://genome.ucsc.edu/cgi-bin/hgLiftOver My Experiment www.myexperiment.org/ GenePattern GRRD www.broadinstitute.org/cancer/software/genepattern/grrd/ Sweave www.stat.uni-muenchen.de/∼leisch/Sweave/ HUGO www.genenames.org/aboutHGNC.html DAVID http://david.abcc.ncifcrf.gov/ GeneCruiser http://genecruiser.broadinstitute.org/genecruiser3/ Frontiers in Genetics | Statistical Genetics and Methodology March 2013 | Volume 4 | Article 38 | 6

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PDF