Download Literature retrieval

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Oncogenomics wikipedia , lookup

Transposable element wikipedia , lookup

Protein moonlighting wikipedia , lookup

X-inactivation wikipedia , lookup

Ridge (biology) wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Pathogenomics wikipedia , lookup

Epistasis wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Genomic imprinting wikipedia , lookup

Public health genomics wikipedia , lookup

Point mutation wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Copy-number variation wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genetic engineering wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

NEDD9 wikipedia , lookup

Genome evolution wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

RNA-Seq wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Genome (book) wikipedia , lookup

Helitron (biology) wikipedia , lookup

The Selfish Gene wikipedia , lookup

Gene therapy wikipedia , lookup

Gene desert wikipedia , lookup

Gene expression programming wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression profiling wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene nomenclature wikipedia , lookup

Designer baby wikipedia , lookup

Transcript
Human gene thesaurus
To solve the ambiguity of gene names [1, 2], including synonyms (different names for the same
gene) and homonyms (different genes or unrelated concepts with the same name), GenCLiP uses a
human gene thesaurus that collected all of aliases for each gene and limited the specificity of each
gene with several methods for the description [3-6]. The human gene thesaurus was compiled
from the HUGO Nomenclature Committee database [7] and the Entrez Gene database [8]. Human
gene symbols (both official and alias), gene names, and product names were included. Gene
names were processed as follows. (i) Contents in parentheses were deleted. (ii) Different variant
forms of the gene symbols were added by adding/removing a space between the end non-digit and
digit character, such as ‘Bcl 2’ versus ‘Bcl2’. (iii) Symbols shorter than three characters (such as
‘CT’, ‘A1’, etc.) were removed. (iv) English word symbols (such as ‘FAT’, etc.) were removed
using an English dictionary [4, 5]. (v) Common word gene names (such as ‘protein’, ‘tissue’, etc.)
were removed using a baseline occurrence list (provided by D. Chaussabel, personal
communication), which has been proven unbiased [6]. If the baseline occurrence of a gene name
was greater than 1%, then it was considered common. We chose this somewhat high cutoff
percentage because some of the most investigated genes (p53 for example) have a baseline
occurrence of 1%. (vi) Common phrase gene names (i.e., those for which each term in the name is
a common word, such as ‘novel protein’) with an exceptionally high number of hits (more than
100) were manually curated. (vii) If a gene name was shorter than five characters [3], the same as
a cell line name [9], or composed of common words, an assistant (one of the uncommon words or
phrases derived from the list of full gene names) was required. It should be noted that some of the
above processing steps, such as removal of English words and use of assistant search terms, will
reduce sensitivity to some degree. These parameters, however, can be manually corrected in the
literature retrieval window of GenCLiP.
Evaluation of performance
We have used the above gene thesaurus construction strategy to solve the synonym and homonym
problems. To test our gene thesaurus construction strategy, we conducted a PubMed search for
4,999 random human genes using three search strategies for the description [5]: (i) the official
symbol for each gene (Symbol), (ii) the official symbol with all its aliases and gene product names
(Expanded), and (iii) informative terms only (Filtered). The Expanded search allowed
identification of literature information for ~700 additional genes over the number obtained when
only the official gene symbols were queried (Table 1). Using the Filtered search terms allowed this
addition without adding significantly to the number of queries that returned unreasonable results.
In addition to expanding the number of genes that were found in the literature, the Filtered search
terms also increased the number of articles found per gene (from an average of 165 articles per
gene found by searching with the symbol alone to an average of 363 articles per gene when
searching with the filtered terms). These results indicate that our gene thesaurus construction
strategy achieved a higher percentage of relevant literature search results for each gene while
limiting the addition of irrelevant information.
Table 1. Summary of PubMed hit counts for 4,999 random human genes using different search
strategies.
Type of primary terma
Positive resultsb
Unreasonable resultsc
Articles per gened
Symbol
2,738
2
165
Expanded
3,433
42
1,139
Filtered
3,353
3
363
aThe
PubMed search was conducted using three search strategies: (i) ‘Symbol’ refers to a search in
which each gene was represented by its official symbol; (ii) ‘Expanded’ refers to searches in
which each gene was represented by the gene symbol, all its synonyms, and the official gene
product name; (iii) ‘Filtered’ refers to searches in which uninformative names were filtered out of
the expanded list.
bNumber
cNumber
of queries that returned at least one result.
of queries that returned more than 44,000 results. We used the number 44,000 as a rough
estimate of unreasonable results based on the fact that some of the most investigated genes, like
p53, appear in less than 44,000 abstracts.
dThe
average number of abstracts per gene—counting only genes that appeared at least once and
did not appear in more than 44,000 abstracts.
References
1.
Fundel K, Zimmer R: Gene and protein nomenclature in public databases. BMC
Bioinformatics 2006, 7:372.
2.
Tsai RT, Wu SH, Chou WC, Lin YC, He D, Hsiang J, Sung TY, Hsu WL: Various criteria in
the evaluation of biomedical named entity recognition. BMC Bioinformatics 2006, 7:92.
3.
Jenssen TK, Laegreid A, Komorowski J, Hovig E: A literature network of human genes for
high-throughput analysis of gene expression. Nat Genet 2001, 28(1):21-28.
4.
Alako BT, Veldhoven A, van Baal S, Jelier R, Verhoeven S, Rullmann T, Polman J, Jenster G:
CoPub Mapper: mining MEDLINE based on search term co-publication. BMC
Bioinformatics 2005, 6(1):51.
5.
Rubinstein R, Simon I: MILANO--custom annotation of microarray results using
automatic literature searches. BMC Bioinformatics 2005, 6(1):12.
6.
Chaussabel D, Sher A: Mining microarray expression data by literature profiling. Genome
Biol 2002, 3(10):RESEARCH0055.
7.
HUGO Nomenclature Committee [ http://www.gene.ucl.ac.uk/nomenclature/]
8.
Entrez Gene [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?DB=gene]
9.
Human and Animal Cell Line Names [http://www.biotech.ist.unige.it/cldb/cname-1c.html]