Download OSIRIS: a tool for retrieving literature about sequence variants

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic imprinting wikipedia , lookup

Transposable element wikipedia , lookup

Protein moonlighting wikipedia , lookup

Epistasis wikipedia , lookup

Tag SNP wikipedia , lookup

Metagenomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Copy-number variation wikipedia , lookup

NEDD9 wikipedia , lookup

Human genetic variation wikipedia , lookup

Genetic engineering wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Genome evolution wikipedia , lookup

Point mutation wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene wikipedia , lookup

Genome (book) wikipedia , lookup

The Selfish Gene wikipedia , lookup

Public health genomics wikipedia , lookup

Genome editing wikipedia , lookup

Gene therapy wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Gene expression profiling wikipedia , lookup

Gene desert wikipedia , lookup

Helitron (biology) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene nomenclature wikipedia , lookup

Designer baby wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
BIOINFORMATICS APPLICATIONS NOTE
Vol. 22 no. 20 2006, pages 2567–2569
doi:10.1093/bioinformatics/btl421
Data and text mining
OSIRIS: a tool for retrieving literature about sequence variants
Julio Bonis, Laura Inés Furlong and Ferran Sanz
Research Unit on Biomedical Informatics (GRIB), IMIM, Universitat Pompeu Fabra, carrer Dr Aiguader 88,
E-08003 Barcelona, Spain
Received on April 19, 2006; revised on July 13, 2006; accepted on July 28, 2006
Advance Access publication July 31, 2006
Associate Editor: Alvis Brazma
ABSTRACT
Summary: Sequence variants, in particular single nucleotide
polymorphisms (SNPs), are key elements for the identification of
genes associated with complex diseases and with particular drug
responses. The search for literature about sequence variation is
hampered by the large number of allelic variants reported for many
genes and by the variability in both gene and sequence variants
nomenclatures. We describe OSIRIS, a search tool that integrates
different sources of information with the aim to retrieve literature
about sequence variation of a gene. In addition, it provides a method
to link a dbSNP entry with the articles referring to it.
Availability: OSIRIS is available for public use at http://ibi.imim.es/
Contact: [email protected]
Supplementary information: Supplementary data are available at
Bioinformatics online.
1
INTRODUCTION
Sequence variants, in particular single nucleotide polymorphisms
(SNPs), are key elements for the identification of genes associated
with complex diseases and with particular drug responses
(McCarthy 2002). A common task in the disease–gene association
studies and pharmacogenomics research involves the search for
articles that refer to all the SNPs identified for a gene or genes
of interest. This task involves several steps such as retrieving all
the SNPs reported for the gene by searching on databases, for
instance dbSNP (Sherry et al., 1999), translating each SNP into
character strings suitable for queries in PubMed, and finally performing the corresponding PubMed searches. A particular difficulty
arises from the variability in gene and genetic variants nomenclature. One gene can be referred to by different names in the literature,
e.g. the gene encoding the 2A serotonin receptor can be found as
HTR2A, 5-HT2A or HTR2, among other terms. A similar situation
arises with the sequence variation nomenclature; for instance, a SNP
can be referred to as a change in nucleotide (e.g. C102T, T102C,
102 C>T) or in the protein sequence (e.g. V34I, I34V, Ile34Val,
Isoleucine 34 Valine). It has to be pointed out that this terminology
is ambiguous as, for instance, C102T represents a cytosine to
thymine change, but also a cysteine to threonine one, requiring
contextual information for disambiguation. Moreover, different
numbering conventions are frequently used. The complexity of
the nomenclature hampers the exhaustive retrieval of literature
about the sequence variants of a gene in a systematic and
reproducible way.
To whom correspondence should be addressed.
In the previous years many computational methods have been
proposed for the automatic retrieval of biomedical literature (Jensen
et al., 2006). Several initiatives have been directed to the task of
gene and protein name recognition (Chang et al., 2004; Fundel
et al., 2005; Hanisch et al., 2005) and some works have been
published on the problem of retrieval of sequence variants from
literature (Horn et al., 2004; Rebholz-Schuhmann et al., 2004).
These approaches are important for the development of repositories
such as the collection of mutational data for a set of genes extracted
from literature (Rebholz-Schuhmann et al., 2004) or the database on
mutations of nuclear receptors and G-proteins coupled to them
(Horn et al., 2004). However, a general method for the automatic
search of literature dealing with sequence variants of a gene is
still lacking.
In this communication we describe OSIRIS, a tool that facilitates
the PubMed retrieval of biomedical literature that refers to the
sequence variants located in a gene and its vicinity. OSIRIS integrates different sources of information (NCBI Gene, dbSNP and
Medline databases) and provides a list of the articles that refer to
the sequence variants present in the studied gene.
2
METHODS
The only input required by OSIRIS is the gene name, and the output are
XML and HTML files containing information regarding the gene, its
sequence variants and the list of articles found. See Figure 1 in supplementary information for a schematic view of OSIRIS data flow. OSIRIS is
implemented in Python programming language (version 2.3.3), using the
Biopython library (http://www.python.org/). OSIRIS uses the NCBI Eutils
facilities (http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html)
in order to access to the information available at the NCBI Gene (http://www.
ncbi.nlm.nih.gov/entrez/query.fcgi?db¼gene), dbSNP (http://www.ncbi.
nlm.nih.gov/projects/SNP/) and Medline (http://www.ncbi.nlm.nih.gov/
entrez/query.fcgi?db¼PubMed) databases.
2.1
Retrieving information about the gene and
sequence variants
Given the gene name of interest the application performs a search for the
corresponding human entries in the Gene database of NCBI using Eutils
facilities. For this purpose, the filter ‘gene name[GENE] AND homo sapiens[ORGN]’ is used. If a database entry is not found for the gene name
introduced by the user, a second search without the ‘[GENE]’ tag is carried
out and the user is asked to choose between the Gene entries before resuming
the OSIRIS search. Then, gene name synonyms, location in the chromosome, summary and the list of gene products are extracted from the Gene
XML records. The list of sequence variants located in the gene and its
vicinity are retrieved using Eutils-Elink. The sequence variations considered
The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
2567
J.Bonis et al.
are SNPs, insertion/deletion polymorphisms (indel), microsatellite and
named variations (e.g. Alu sequences). The type of variation, nucleotidic
and amino acidic alleles, location in the sequence and validation status are
obtained from dbSNP. For all the variants, OSIRIS reports their position
taking as reference point the beginning of the gene sequence. In the case of
coding variants, the positions relative to the mRNA and protein sequences
are also provided.
2.2
Alternative sequence variants nomenclature
algorithm
For each variant, the algorithm generates a set of alternative nomenclatures.
For instance, given the SNP position ‘POS’, the reference allele ‘A’
and alternative allele ‘B’, some of the nomenclature variants are coded
as follows:
‘A’ þ ‘POS’ þ ‘B’
‘B’ þ ‘POS‘ þ ‘A’
‘A’ þ ‘non-alph’ þ ‘POS’ þ ‘non-alph’ þ ‘B’
‘B’ þ ‘non-alph’ þ ‘POS‘ þ ‘non-alph’ þ ‘A’
‘POS’ þ ‘non-alph’ þ ‘A’ þ ‘non-alph’ þ ‘B’
‘POS’ þ ‘non-alph’ þ ‘B’ þ ‘non-alph’ þ ‘A’‚
where ‘non-alph’ represents any, or none, non-alphanumeric strings. This
process is repeated for each nucleotidic notation and, in the case of coding
variants, amino acidic notation. In the latter case, the notations generated
include the entire standard amino acid notation (one letter, three letters and
full name).
2.3
Generation of Medline query strings and
literature retrieval
Once the different names of the gene have been retrieved from Gene database, and the alternative nomenclatures have been obtained for each variant
in the gene, a collection of PubMed queries is generated. The queries are
built by Boolean combination of each of the gene name synonyms with each
of the alternative nomenclatures for all the SNPs. Each query is submitted to
PubMed through Eutils-Efetch, and the list of articles retrieved is classified
under headings specifying the corresponding dbSNP entries. In addition, a
query containing the gene name synonyms combined with the filter ‘AND
(SNP[TW] OR polymorphism[TW] OR ‘Polymorphism, Single Nucleotide’[MeSH])’ is performed to retrieve the articles related with genetic
variants not associated to a particular dbSNP entry (‘generic’ search).
2.4
Final report
Data generated by an OSIRIS search are stored as an XML file, which is
automatically converted into HTML through XSLT. A link to the HTML
output file is sent by email to the user by the online facility. The XML files
are available only upon request.
3
RESULTS
Examples of OSIRIS output files are provided in the Supplementary
information section. They refer to the genes of the serotonin receptor subtype 2A (HTR2A), transthyretin (TTR), dopamine receptor
D3 (DRD3), lysil oxidase (LOX), cyclooxigenase 2 (PTGS2) and
superoxide dismutase 1 (SOD1) (date of search April 4, 2006).
The first section of the output files is the General Information
that includes the list of the synonyms used to refer to the gene,
the Gene database accession number, the chromosome location
and a brief summary describing the gene function. Information
about sequence variants (non-coding and coding variants) is provided in the Sequence Variation section. For each variant, the
2568
dbSNP accession number (rs code), the type of variation (snp,
in-del, etc.), its position in the sequence and the alternative alleles
are given. Finally, the list of articles found for each variant, and a list
of articles that result from a ‘generic’ search are provided (see the
Methods section for details). All data are provided with their corresponding links to the NCBI databases. For instance, the OSIRIS
search for the HTR2A gene found 255 non-coding and 7 coding
variants, and assigned 160 articles to the 7 coding variants. Some
articles were annotated to more than one variant, resulting in 130
unique articles. The checking of these articles showed that all of
them were correctly annotated to a specific dbSNP entry (Supplementary Table 1). In other OSIRIS searches (TTR, DRD3, LOX), all
the automatically found relationships between articles and dbSNP
entries were also correct. In contrast, the results for the PTGS2
gene showed one wrong annotation (PMID: 11738487) out of
the 12 articles retrieved. This annotation was owing to an incorrect
matching of the ‘3A-G’ SNP (rs:20426) with the abbreviation of
‘3-(arylmethylidene)aminoxy-(3a-g, 4a-g)’ that appears in the
abstract. In the case of the SOD1 gene, the only two articles
retrieved were false positives. One of the articles (PMID:
16290266) was correctly matched to the gene name and a particular
SNP (rs:7277748 and 1788098, G39A); however, its was a wrong
annotation since there was a typo in the abstract (‘G93A’ had
been erroneously typed as ‘G39A’). The other incorrectly assigned
article in the SOD1 search resulted from the coincidence of the term
ALS, which is a synonym of the SOD1 gene, and the acronym of
Amyotropic Lateral Sclerosis, being the article referred to this
disease. Surprisingly, the abstract included an SNP of the PERSYN
gene, which has the same location and alleles as the SOD1 one
(rs: 17886025).
4
DISCUSSION AND CONCLUSION
OSIRIS is a search tool that integrates different sources of information with the aim to retrieve the literature about the genetic variation
of a gene. In contrast with related approaches (Horn et al., 2004;
Rebholz-Schuhmann et al., 2004), OSIRIS links the articles to the
corresponding entries in dbSNP.
For most of the tests carried out, OSIRIS showed a high precision
in the task of assigning articles to dbSNP entries. The few incorrect
annotations were owing to the ambiguity existing in the terms used
to designate genes and variants. We anticipate that an improvement
in the performance of the method could be attained by adding a
validation step to eliminate false positive results. On the other hand,
the use of more comprehensive lists of synonyms of the studied
genes would improve the recall of the method. We are currently
working on this improvement.
ACKNOWLEDGEMENTS
The authors acknowledge the financial support provided by the
networks INBIOMED and INFOBIOMED.
REFERENCES
Chang,J.T. et al. (2004) GAPSCORE: finding gene and protein names one word at a
time. Bioinformatics, 20, 216–225.
Fundel,K. et al. (2005) A simple approach for protein name identification: prospects
and limits. BMC Bioinformatics, 6 (Suppl. 1), S15.
OSIRIS
Hanisch,D. et al. (2005) ProMiner: rule-based protein and gene entity recognition.
BMC Bioinformatics, 6 (Suppl. 1), S14.
Horn,F. et al. (2004) Automated extraction of mutation data from the literature:
application of MuteXt to G protein-coupled receptors and nuclear hormone receptors. Bioinformatics, 20, 557–568.
Jensen,L.J. et al. (2006) Literature mining for the biologist: from information retrieval
to biological discovery. Nat. Rev. Genet., 7, 119–129.
McCarthy,J. (2002) Turning SNPs into Useful Markers of Drug Response. In Licinio,J.
and Wong,M.L. (eds), Pharmacogenomics: The Search for Individualized Therapies. Wiley-VCH, Weinheim, pp. 35–55.
Rebholz-Schuhmann,D. et al. (2004) Automatic extraction of mutations from Medline
and cross-validation with OMIM. Nucleic Acids Res., 32, 135–142.
Sherry,S.T. et al. (1999) dbSNP-database for single nucleotide polymorphisms and
other classes of minor genetic variation. Genome Res., 9, 677–679.
2569