Download Playing Biology’s Name Game: Identifying Protein Names In

Playing Biology’s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp Biocomput. 2003;:403-14. Abstract     Construction of a comprehensive general purpose name dictionary An accompanying automatic curation procedure based on a simple token model of protein names An efficient search algorithm to analyze all abstracts in MEDLINE Parameters are optimized using machine learning techniques Model for protein and gene names    Protein names are often composed of more than one word (token) The “order” of these words is not very important – permutation of tokens may occur General-purpose dictionaries of protein names must be automatically composed Token classes (1/3) Token classes (2/3)    Extract all words from the dictionary with frequency of occurrence > 100 Non-descriptive tokens: words occurring in databases but rarely used in free text or have no influence on the significance of match Modifier tokens: words crucial for correct recognition Token classes (3/3)     Specifier tokens: Arabic and Roman numbers and Greek letters Delimiter tokens: used to gain specificity in the matching procedure – help identify name boundaries Common words: obtained by comparison to a standard English dictionary Standard tokens: gene identifiers as they cannot be easily assigned to a separate calss Automatic generation of the dictionary    Extract gene symbols, alias names, and full names for all human genes from the HUGO Nomenclature database Create an entry for each official gene symbol and add the corresponding names in the OMIM database Extract all synonyms in SWISSPROT and TREMBL database and match these to HUGO entries Curation of the dictionary (1/3)    To resolve ambiguities and to remove nosensical names from the dictionary A curation procedure consists of two phases – expansion and pruning Expansion: Curation of the dictionary (2/3)     Pruning: remove redundancies, ambiguities, and irrelevant synonyms First: synonyme  a sequence of token class identifiers Use regular expression to search unspecific synonyms (e.g. only non-descriptive tokens, only specifier tokens, etc.) Finally, a list of ambiguous names is stored separately with reference to their original records Curation of the dictionary (3/3)  The ambiguity list can be used to identify such entries and move them to the manual curation list based on their frequency of occurrence. Efficient detection of names (1/3)    MEDLINE contains about 11 million abstracts Linear time in the number of tokens of the parsed text To sweep over the abstract, processing one token at a time and keep a set of candidate solutions and two associated scoring measures, boundary score s and acceptance score s, for the present position Efficient detection of names (2/3)   boundary score s: controls the end of the extension of a candidate match and is increased on a token mismatch. The candidate is pruned if s >boundary threshold acceptance score s: determine whether the candidate is reported as a match. s is a linear combination of token-class-specific match and mismatch terms. In other words, the significance of token classes vary. Efficient detection of names (3/3)    Example: Only the non-descriptive token “precursor” is unmatched in the candidate  a nearly maximal match score would be computed (if non-descriptive tokens receive a small weight) However, the semantically significant modifier token “receptor” leads to a substantial mismatch term (if weights are set appropriately) Parameter optimization     Robust linear programming (RPL) was used to compute a set of sensible weights This supervised machine learning techniques uses a set of positive samples, i.e. correctly identified protein names, and a set of negative ones. The match and mismatch weighting parameters for delimiter, specifier, modifier, and standard tokens were tuned. The optimized weightings penalize mismatch of modifier and number tokens and reward matching of other token classes to various extend Evaluation     The test dataset is based on the TRANSPATH database on regulatory interactions. Extracted all human proteins with SWISSPROT annotations Discarded abstracts if no text was available or if a protein was described for the first time Resulting benchmark set consists of 611 associations (141 objects in 470 abstracts) Results – 5-fold c.v.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Playing Biology’s Name Game: Identifying Protein Names In