Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Michael Miller, Haissam Abdelhamid, Eugene Gross XLS-CS-553-WS 2014S Assignment 8 Case Study: tmVar: A Text Mining Approach For Extracting Sequence Variants in Biomedical Literature The Problem Bioinformatics is the science of organizing, storing, retrieving, and analyzing complex biological data, and it’s currently at the forefront of understanding diseases and developing novel medicine. One of the most important areas of bioinformatics is analyzing the gene to protein and disease relationship. Our genes are our own unique biological stores of genetic code for synthesizing the proteins that make up organelles, cells, tissues, and ultimately organs. Genes are the blueprint of who we are, and genetic sequence variation plays a major role between genes and disease. Therefore, with the multitude of structured and free text experimental data existing today, text-mining can be a very powerful tool in optimizing the capabilities of biological data. As a result, it can greatly enhance bioinformatics science and ultimately the medical world. This case study focuses specifically on using text-mining to most efficiently finding, interpreting, and analyzing biochemical sequence variations, also referred to as mutations, in complex diseases. This can also lead to the creation of more useful disease related databases. Thus far, most existing biological data extraction approaches involve rule-based systems with very limited sequence variations, and expanding the extraction capabilities involve extensive manual efforts in examining new variations and then deducing their corresponding rules. Therefore, the efficiency and timing of extracting various biological mutations had been greatly limited, and an automatic approach was greatly needed to significantly speed up the research process and unlock bioinformatics disease research potential. To the best of the researchers’ knowledge, this work was the first attempt to identify various mutation types according to a standard nomenclature by the Human Genomic Variation Society (HGVS), an Affiliate of the International Federation of Human Genetics Societies (IFHGS) and also the Human Genome Organization (HUGO). Solution Overview As previously mentioned, already existing methods were very limited, such as MutationFinder, in that they either aimed at protein point mutations or limited to only a few mutation types. With solving the mutation identification problem being considered mainly a sequence labeling task, in this new approach, termed tmVar, it is based on conditional random field (CRF) to extract a very wide range of sequence variants at protein, DNA, and RNA. CRFs are a statistical modeling method applied in pattern recognition and used for structured prediction. In this study, several important mutation types were covered which were not previously considered in past studies. The solution overview can be seen in Figure 1, which includes three major components: 1. Pre-processing is the first step, where a tokenizer divides the input text into a sequence of tokens. Despite tokenizers traditionally separating input text by space or punctuation, this method uses a much more advanced approach of separating tokens based on special characters, numbers, and upper or lower case. 2. Next, mutation identification is performed using a probability-based sequence detection CRF model. 3. Lastly, post-processing is done for the few mentions missed by the CRF model to minimize false negatives. Here, the mentions that are extracted by the CRF module are translated into regular expression rules and patterns, then used to find additional mentions of the similar kind in the same article. 4. Output of the results as normalized mutations. Fig. 1. The system overview consisting of three major components Methods and Procedures PUBMED was used to obtain the corpus of MEDLINE abstracts which contain a very high number of mutation mentions, the submitted PUBMED query obtained 5116 abstracts. Next, the first step with the input text was tokenization, where a finer level of separation was implemented. Special characters such as ‘-‘, ‘*’, and ‘+’, for example, were used. Also, numbers and cases were also used to divide separate tokens. As an example, the mention in Figure 1 ‘c.2708_2711delTTAG’ was not regarded as one token, instead it was divided up into seven pieces as seen with another example in Table 1. As seen in Table 1, every mutation mention becomes a series of labels, for then the probabilitybased sequence detection CRF model was used, which defines the conditional probability distribution P(Y|X) of label sequence Y given observation sequence X. 𝑃(𝒀|𝑿) = exp(𝐹(𝑿, 𝒀)) ∑ 𝑦′ exp(𝐹(𝑿, 𝒀′ )) where y1 to yn is a label sequence from Y and x1 to xn is a token sequence from X. 𝐹(𝑋, 𝑌) = ∑ 𝑛∑ 𝑗=1 𝑤 ∗ 𝜔𝑖 𝑓𝑖 (𝑦𝑗 , 𝑦𝑗−1 , 𝑋) 𝑖=1 is a global feature vector for label sequence Y and observation sequence X, and ω1 to ωw is a feature weight vector.1 The y1 to yn indicate the label for the corresponding token of interest, with the 10 different labels describing mutation elements listed in Table 2. Six different types of features were engineered for CRF: 1. Dictionary features, where HGVS mutation nomenclature was followed. 2. General linguistic features, to capture mutation mentions with brief natural language such as ‘G→A, for example. 3. Character features, to capture mutation features with numbers and special features which are often used in amino acid substitution. 4. Semantic features, where a binary system was used for describing mutation-specific characteristics. 5. Case pattern features, where a pattern is constructed to capture case shifting in a token. 6. Contextual features, where dictionary and linguistic features of three neighboring tokens from each side were included to utilize contextual information. The third major component, post-processing, entailed translating extracted mutation mentions into regular expression patterns for finding additional mentions that were missed by the CRF model. Two rules were applied to do this, first, all numerical digits became ‘[0-9]+’; second, all lowercase and uppercase letters became ‘[a-z]’ and ‘[A-Z], respectively. With the only exception of three special tokens, IVS, EX, and RS. In addition, several regular expression patterns based from MutationFinder were also added. System Deployment and Results tmVar was tested and later compared to earlier existing methods, such as MutationFinder. Precision, recall, and F-measure were measured on all mentions to evaluate the tmVar system’s ability to extract different mutations. F-measure, or F1, is a weighted average of precision and recall, and is a measure of a test’s accuracy. It is defined as: 𝐹1 = 2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 where precision is a fraction of the retrieved instances that are relevant and recall is the fraction of relevant instances that are retrieved. Tables 5 and 6 show that the tmVar method achieved higher Fmeasures than MutationFinder with the study corpus and with MutationFinder’s own corpus as well. Only one exception to note of a small drop in precision with the Normalized mutations of MutationFinders corpus, probably due that it incorrectly identified DNA substitutions. In conclusion, a CRF-based machine-learning method was developed to extract mutaions from text with high efficiency and performance. The tmVar method compliments and extends the capabilities of already existing extraction methods from biomedical literature, however, there are future directions that should be studied with tmVar as well. How well it can extract from text from other genres, and its performance when integrated into other fields of research, are the next steps that need to be explored.