Download XLS-CS-553

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein structure prediction wikipedia , lookup

Homology modeling wikipedia , lookup

Transcript
Michael Miller, Haissam Abdelhamid, Eugene Gross
XLS-CS-553-WS 2014S
Assignment 8
Case Study:
tmVar: A Text Mining Approach For Extracting Sequence Variants in Biomedical Literature
The Problem
Bioinformatics is the science of organizing, storing, retrieving, and analyzing complex biological
data, and it’s currently at the forefront of understanding diseases and developing novel medicine. One
of the most important areas of bioinformatics is analyzing the gene to protein and disease relationship.
Our genes are our own unique biological stores of genetic code for synthesizing the proteins that make
up organelles, cells, tissues, and ultimately organs. Genes are the blueprint of who we are, and genetic
sequence variation plays a major role between genes and disease. Therefore, with the multitude of
structured and free text experimental data existing today, text-mining can be a very powerful tool in
optimizing the capabilities of biological data. As a result, it can greatly enhance bioinformatics science
and ultimately the medical world. This case study focuses specifically on using text-mining to most
efficiently finding, interpreting, and analyzing biochemical sequence variations, also referred to as
mutations, in complex diseases. This can also lead to the creation of more useful disease related
databases. Thus far, most existing biological data extraction approaches involve rule-based systems with
very limited sequence variations, and expanding the extraction capabilities involve extensive manual
efforts in examining new variations and then deducing their corresponding rules. Therefore, the
efficiency and timing of extracting various biological mutations had been greatly limited, and an
automatic approach was greatly needed to significantly speed up the research process and unlock
bioinformatics disease research potential. To the best of the researchers’ knowledge, this work was the
first attempt to identify various mutation types according to a standard nomenclature by the Human
Genomic Variation Society (HGVS), an Affiliate of the International Federation of Human Genetics
Societies (IFHGS) and also the Human Genome Organization (HUGO).
Solution Overview
As previously mentioned, already existing methods were very limited, such as MutationFinder,
in that they either aimed at protein point mutations or limited to only a few mutation types. With
solving the mutation identification problem being considered mainly a sequence labeling task, in this
new approach, termed tmVar, it is based on conditional random field (CRF) to extract a very wide range
of sequence variants at protein, DNA, and RNA. CRFs are a statistical modeling method applied in
pattern recognition and used for structured prediction. In this study, several important mutation types
were covered which were not previously considered in past studies. The solution overview can be seen
in Figure 1, which includes three major components:
1. Pre-processing is the first step, where a tokenizer divides the input text into a sequence of
tokens. Despite tokenizers traditionally separating input text by space or punctuation, this
method uses a much more advanced approach of separating tokens based on special
characters, numbers, and upper or lower case.
2. Next, mutation identification is performed using a probability-based sequence detection CRF
model.
3. Lastly, post-processing is done for the few mentions missed by the CRF model to minimize
false negatives. Here, the mentions that are extracted by the CRF module are translated into
regular expression rules and patterns, then used to find additional mentions of the similar
kind in the same article.
4. Output of the results as normalized mutations.
Fig. 1. The system overview consisting of three major components
Methods and Procedures
PUBMED was used to obtain the corpus of MEDLINE abstracts which contain a very high number
of mutation mentions, the submitted PUBMED query obtained 5116 abstracts. Next, the first step with
the input text was tokenization, where a finer level of separation was implemented. Special characters
such as ‘-‘, ‘*’, and ‘+’, for example, were used. Also, numbers and cases were also used to divide
separate tokens. As an example, the mention in Figure 1 ‘c.2708_2711delTTAG’ was not regarded as one
token, instead it was divided up into seven pieces as seen with another example in Table 1.
As seen in Table 1, every mutation mention becomes a series of labels, for then the probabilitybased sequence detection CRF model was used, which defines the conditional probability distribution
P(Y|X) of label sequence Y given observation sequence X.
𝑃(𝒀|𝑿) =
exp(𝐹(𝑿, 𝒀))
∑ 𝑦′ exp(𝐹(𝑿, 𝒀′ ))
where y1 to yn is a label sequence from Y and x1 to xn is a token sequence from X.
𝐹(𝑋, 𝑌) = ∑
𝑛∑
𝑗=1
𝑤 ∗ 𝜔𝑖 𝑓𝑖 (𝑦𝑗 , 𝑦𝑗−1 , 𝑋)
𝑖=1
is a global feature vector for label sequence Y and observation sequence X, and ω1 to ωw is a feature
weight vector.1
The y1 to yn indicate the label for the corresponding token of interest, with the 10 different
labels describing mutation elements listed in Table 2.
Six different types of features were engineered for CRF:
1. Dictionary features, where HGVS mutation nomenclature was followed.
2. General linguistic features, to capture mutation mentions with brief natural language such
as ‘G→A, for example.
3. Character features, to capture mutation features with numbers and special features which
are often used in amino acid substitution.
4. Semantic features, where a binary system was used for describing mutation-specific
characteristics.
5. Case pattern features, where a pattern is constructed to capture case shifting in a token.
6. Contextual features, where dictionary and linguistic features of three neighboring tokens
from each side were included to utilize contextual information.
The third major component, post-processing, entailed translating extracted mutation mentions into
regular expression patterns for finding additional mentions that were missed by the CRF model. Two
rules were applied to do this, first, all numerical digits became ‘[0-9]+’; second, all lowercase and
uppercase letters became ‘[a-z]’ and ‘[A-Z], respectively. With the only exception of three special
tokens, IVS, EX, and RS. In addition, several regular expression patterns based from MutationFinder
were also added.
System Deployment and Results
tmVar was tested and later compared to earlier existing methods, such as MutationFinder.
Precision, recall, and F-measure were measured on all mentions to evaluate the tmVar system’s
ability to extract different mutations. F-measure, or F1, is a weighted average of precision and recall,
and is a measure of a test’s accuracy. It is defined as:
𝐹1 = 2 ∗
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
where precision is a fraction of the retrieved instances that are relevant and recall is the fraction of
relevant instances that are retrieved. Tables 5 and 6 show that the tmVar method achieved higher Fmeasures than MutationFinder with the study corpus and with MutationFinder’s own corpus as
well. Only one exception to note of a small drop in precision with the Normalized mutations of
MutationFinders corpus, probably due that it incorrectly identified DNA substitutions.
In conclusion, a CRF-based machine-learning method was developed to extract mutaions from
text with high efficiency and performance. The tmVar method compliments and extends the
capabilities of already existing extraction methods from biomedical literature, however, there are
future directions that should be studied with tmVar as well. How well it can extract from text from
other genres, and its performance when integrated into other fields of research, are the next steps
that need to be explored.