Download Annotation guidelines - Systems Biology and Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expression vector wikipedia , lookup

Western blot wikipedia , lookup

Oxidative phosphorylation wikipedia , lookup

Gene nomenclature wikipedia , lookup

Metalloprotein wikipedia , lookup

Proteolysis wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Enzyme inhibitor wikipedia , lookup

Isotopic labeling wikipedia , lookup

Evolution of metal ions in biological systems wikipedia , lookup

Biosynthesis wikipedia , lookup

Gene regulatory network wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Enzyme wikipedia , lookup

Metabolism wikipedia , lookup

Basal metabolic rate wikipedia , lookup

Metabolomics wikipedia , lookup

Pharmacometabolomics wikipedia , lookup

Metabolic network modelling wikipedia , lookup

Transcript
Metabolic entities corpus
Annotation guideline
Version 1.01 (2016-JAN-15)
Patumcharoenpol et al.
[email protected]
This work is licensed under a Creative Commons
Attribution-ShareAlike 4.0 International License.
1
Content
Introduction
Convention
3
3
Entities and Events
Entities
Event
Metabolic event
4
5
5
6
The task
7
General guidelines
8
Type-specific guidelines
Gene and Protein (GP) entities specific guidelines
Metabolite entities specific guidelines
Events specific guidelines
Appendix
9
9
111
122
15
2
Introduction
We are currently under the process of building a corpus, which is used in assisting further for
developing integrated text mining framework for metabolic interaction network reconstruction.
This document provides a practical guide on an annotation task to help in creating a
consistence and well-form corpus.
The annotation guideline initially starts with a convention that will be used throughout this
document, followed by the annotation step-by-step with how-to. Lastly, we provide a set of
general and type-specific rules for identifying of what to include in the corpus.
Convention
The text throughout this document is formatted in a straightforward way. Bold font is used to
show types of entities and events, like Gene and Protein (GP) as entities as well as Metabolic
reaction, Metabolic consumption, Metabolic production, and Positive regulation as
events. Notably, all examples in this document in following details are shown as either text
format or picture format.
Text format
Description: An excerpt from an abstract in italic font with underlined annotation. This case
shows annotation of GP entities for example, “E.C. 1.1.1.262 is annotated as GP entities”.
The fourth step is catalyzed by 4-hydroxythreonine-4-phosphate dehydrogenase (PdxA, E.C.
1.1.1.262), which converts 4-hydroxy-l-threonine phosphate (HTP) to 3-amino-2-oxopropyl
phosphate.
Picture format
Description: An excerpt from an abstract shows annotation of two metabolic events with
highlight in orange and two metabolites annotation with highlight in blue as shown in Figure 1.
Figure 1 An example of picture format.
3
Entities and events
Considering on entities and events in this guideline, Gene and Protein, as well as Metabolite
are annotated as entities. Events are processes or actions that involve with entities. For entities,
they mostly represent by noun or noun-phrase while events mostly represent by verb or
nominalized verb. In addition, entities and events can be classified into different sub-types. As
shown in Figure 2, for example, Phosphoglucosamine mutase, GlmM, glucosamine-1phosphate, and glucosamine-6-phosphate are identified as entities. In this case, first two are
Gene and Protein entities and latter two are Metabolite entities. Next, catalyzes and formation
are identified as events of sub-types Positive regulation and Metabolic reaction.
Figure 2 An example of annotation for entities and events.
We build a model of the type(s) of these entities and events in order to help in assist with the
later automatic extraction of metabolic context from text. Therefore, these types are modeled
around the synthesis and usage of metabolite in text. This defined type is shown in hierarchical
relationships between annotation types (Figure 3). It is noted that those nodes with shown in all
capitals are for organizational purposes only without assignable type.
Figure 3 Hierarchical relationships between annotation types.
4
Entities
In order to meet our current implementation of integrated text mining framework for metabolic
interaction network reconstruction, we divide the type of entities into two categories, Gene and
Protein (GP) and Metabolite as showed in Table 1.
GP means gene and protein that reside within organism. This also includes:
Gene: genetic sequences reside on DNA that code for mRNA or protein.
Protein: A long chain amino acid.
Enzyme: Subset of protein that has catalytic function.
mRNA: Polypeptide of ribonucleotides. We only restrict to find mRNA.
Note: The others type of gene’s products, that are not mentioned here, which are not annotated,
e.g., sRNA, snRNA.
Metabolite means specified chemical substance, which is an intermediate or product of
metabolism. Metabolite is usually a small molecule but it could be amino acids, lipids,
carbohydrates, and nucleotides. In this work, only chemical that is resided in the cell (in vivo) is
considered as metabolite.
Table 1 Description of entitiy types
Entity types
Reference
Ontology Id
Gene or Protein (GP)
Ecocyc
SBO:0000246
Metabolite
ChEBI
SBO:0000247
Events
Event is an occurrence of a process or action. Event is normally represented by verb and
nominalized verb and accompany by entity as its argument.
Figure 4 Metabolic reaction event with two arguments
As shown in an example of Figure 4, this sentence shows event that is represented by the word
transformation. This event associates with two arguments, acetolactate as type of Theme and
dihydroxyisovalerate as type of Product.
For more formal sense, event composes of two components, trigger word and argument.
1. Trigger word: A sequence of word, which represents the event.
2. Argument: Entities that associate to event through trigger word. It can further be classified
into two categories.
- Theme, the entity that instigates an event.
- Other argument, an optional argument that adds more biological descriptive to the event.
5
Metabolic events
We are focusing on specific type of event, a mention of mechanical description of the metabolic
interaction, which we call a metabolic event in this guideline. They should explicitly describe the
change of one chemical into another. In this work, we classify metabolic events into 4 categories
as listed in Table 2.
1. Metabolic production: Metabolic event that corresponds to the formation of metabolite.
2. Metabolic consumption: Metabolic event that corresponds to the consumption of
metabolite.
3. Metabolic reaction: Metabolic event that corresponds to the conversion of metabolite.
4. Positive regulation: Enzyme relation with metabolic event.
Table 2 Event types and their arguments for this annotation task. The type of each argument is shown in parenthesis.
Event type
Arguments
Description
Ontology ID
Theme: Metabolite,
Metabolic event that results in formation
SBO:0000176
Cause: Enzyme
of metabolite.
Metabolic
Theme: Metabolite,
Metabolic event that results in
consumption
Cause: Enzyme
consumption of metabolite.
Metabolic reaction
Theme: Metabolite,
Metabolic event that results in conversion
Cause: Enzyme
of metabolite.
Theme: Event,
Enzyme relation with metabolic event
GO:0048518,
Cause: Enzyme
(Metabolic production, Metabolic
GO:0044093
Additional Arguments
Metabolic production
Positive regulation
SBO:0000176
SBO:0000176
consumption and metabolic reaction)
6
The task
In this work, we would like to discover all events that represent a metabolic process. In order to
do this, we need to discover as in the following.
1. All genes and proteins as well as metabolites that are mentioned in text.
2. All metabolic reactions that can be assigned as events.
3. Roles of proteins and metabolites in context of metabolic events.
This above information can be used to infer linguistic pattern underlying and further be used for
further text mining task.
7
General guidelines
An instruction in this section is applied to all kinds of entities and event annotations.
1. All annotations must be a continuous stretch of word.
(a) Last, the SerC (PdxF) enzyme uses 4PHT as a substrate in the reverse
transamination reaction (13).
As shown in (a), SerC (PdxF) enzyme is considered to be a single GP entity.
2. Preposition and determiner are excluded from annotation.
(b) Substitution of pyridoxal 5'-phosphate in D-serine dehydratase from Escherichia coli
by cofactor analogues provides information on cofactor binding and catalysis.
As shown above in (b), at first we consider D-serine dehydratase from Escherichia coli as a
potential GP entity. As stated above, however we do not want to include any prepositions, so
we can shorten the candidate down to D-serine dehydratase.
3. Pronoun must be excluded. A pronoun likes “which”, “it”, or “they”, even though it is possible
to be resolved into proper noun.
(c) Pyridoxine 5'-phosphate (PNP) synthase is the key enzyme in the pdx group. It
catalyses a multistep ring closure reaction yielding PNP and inorganic phosphate (Pi).
In example (c), it refers itself to Pyridoxine 5'-phosphate (PNP) synthase.
4. Special character (e.g., quote, dash, or parenthesis) should not be at the beginning or
ending of annotation.
(d) L-Glutamine:D-fructose-6-phosphate amidotransferase ('glucosamine synthase', EC
5.3.1.19) from Escherichia coli MRE 600 was purified at least 75-fold.
As shown in (d), parenthesis and single quote are excluded from glucosamine synthase.
5. People’s name should be excluded from annotation.
(e) This enzyme is referred to as the Wood and Gunsalus L-threonine deaminase.
6. Always apply the most specific type from the hierarchy that is applicable.
8
Type-specific guidelines
Next section describes specific annotation of individual type, namely entities guidelines and
events guidelines.
We must emphasize here that it is crucial to annotate every entities (GP and Metabolite)
whether it has corresponded to event or not.
In case where two or more entity names sharing a head of a phrase annotate them as one
entity.
(a) In the present work, we provide in vivo evidence that gadC is co-transcribed with
gadB and that the functional glutamic acid-dependent system requires the activities of
both GadA/B and GadC.
(b) The ratio of the valine- and isoleucine-alpha-ketoglutarate activities did not change
significantly during purification.
In example (a), GadA/B has A and B sharing GAD, which can be expanded into GadA and
GadB. Similarly, in the second example (b), valine- and isoleucine-alpha-ketoglutarate can be
expanded into valine-alpha-ketoglutarate and isoleucine-alpha-ketoglutarate.
Gene and protein (GP) entities specific guidelines
1. ID from reference database (e.g., NCBI and EBI).
2. Common name, gene symbol, and EC number.
(a) The fourth step is catalyzed by 4-hydroxythreonine-4-phosphate dehydrogenase
(PdxA, E.C. 1.1.1.262), which converts 4-hydroxy-l-threonine phosphate (HTP) to 3amino-2-oxopropyl phosphate.
Let’s consider (a); 4-hydroxythreonine-4-phosphate dehydrogenase, PdxA, and E.C. 1.1.1.262
are considered for the GP since they are common name, gene symbol, and EC number.
9
3. Prefix or suffix that adds biological meaning to GP is considered as a part of GP.
(b) The accumulation of 2-ketoisovalerate in ilvE leu double mutants was shown to
interfere with 2-KIC amination by the tyrB-encoded transaminase and also by the
aspC- and avtA-encoded transaminases.
(c) The FolB protein shows 30% identity to the paralogous dihydroneopterin-triphosphate
epimerase, which is specified by the folX gene located at 2427 kilobases on the E.
coli chromosome.
(d) The role of intersubunit side chain-side chain interactions in the stability of the
Escherichia coli aspartate aminotransferase (eAATase) homodimer was investigated
by directed mutagenesis at 10 different interface contacts.
(e) Evidence was obtained for two monocistronic gltA transcripts extending anticlockwise, to a common terminus, from independent promoters with start points 196
bp (major) and 299 bp (minor) upstream of the gltA coding region.
(f) In all conditions tested, this regulation required a functional narL gene product.
(g) Each subunit (361 residues) of the PSAT homodimer is composed of a large
pyridoxal-5'-phosphate binding domain (residues 16-268).
As shown in examples (b)-(g) above, enzyme, protein, gene, clusters, family, homodimer,
coding region, transcripts and gene product. Mentions of species and compartments are also
included (e.g., E.coli and cytoplasmic).
4. The GP has to be able to resolve itself without any additional or external information.
(h) A three-dimensional structural comparison to four other vitamin B6-dependent
enzymes reveals that three alpha-helices of the large domain, as well as an Nterminal domain (subgroup II) or subdomain (subgroup I) are absent in PSAT.
(i) Two polytopic membrane proteins, NarK and NarU, are assumed to transport nitrite
out of the Escherichia coli cytoplasm.
Examples of (h) and (i) contain words that can potentially be GP, vitamin B6-enzyme in (h) and
Two polytopic membrane proteins in (i). In this particular case, both are not annotated as GP
since it is impossible to identify an exact enzyme without using additional context around them.
5. Word(s) that do not have any significant biological meaning should not be included.
(j) Purified transaminase B catalyzed transamination.
In example of (j), Purified did not add anything biologically significant to transaminase B, so it is
omitted.
10
6. The amino acid residue or functional group is not counted.
(k) The cofactor is bound through an aldimine linkage to Lys198 in the active site.
In example above, Lys198 is Lysine residue, so it is omitted.
Metabolite entities specific guidelines
1. Oxygen, CO2, NADH and its variants, ATP and its variants are considered as Metabolite
entity.
2. Same as GP entities, the prefixes and the suffixes which add more meaningful description
are included as a part of Metabolite entity.
(a) The enzyme activity is dependent on the presence of a divalent magnesium ion…
(b) A Zn2+ ion is bound within each active site,…
(c) It is suggested that anticapsin behaves as a glutamine analogue…
As shown in (a)-(c), ion and analogue are counted as a part of Metabolite entity.
3. Co-factor is annotated as in the Metabolite entity.
(d) Either NADP+ or NAD+ function as cofactors, whereas the free alcohol 4-hydroxy-Lthreonine is not a substrate for the reaction.
In example of (d) NADPH is annotated as Metabolite in this example.
4. Amino-acid that does not have catalytic activity is considered as Metabolite.
(e) It is suggested that anticapsin behaves as a glutamine analogue and that a reaction
of its epoxide group with a thiol group of glucosamine synthase results in its linkage to
the enzyme by a covalent bond.
In example of (e), the glutamine analogue is annotated as Metabolite.
5. Mentions that are too ambiguous are not annotated and included.
(f) The reductoisomerase is able to catalyze the reduction of ketopantoate to produce
pantoate (the intermediate in coenzyme A biosynthesis) which again requires that the
reduction half-reaction produce a 2-(R)-hydroxy acid (Primerano & Burns, 1983).
In example of (f), 2-(R)-hydroxy acid can be referred to many types of metabolites. Such this
case, 2-(R)-hydroxy acid is not annotated as Metabolite.
11
Events specific guidelines
In this section, we explain the guideline on event annotation.
1. Event’s trigger word should be presented with one word long.
(a) … the amination of 2-ketoisocaproate (2-KIC) to form leucine…
(b) … divalent metal ion-dependent oxidative decarboxylation of a b-hydroxy acid
substrate…
From examples of (a) and (b), amination, and decarboxylation are metabolic event of type of
Metabolic consumption.
2. Try to annotate the event using the most specific type first. For examples, Metabolic
consumption, and Metabolic production which are preferred to Metabolic reaction.
However, in some cases, we have preposition instead (e.g., from, in), which cannot be used
as an event (technical limitation). In this case, we use metabolic reaction as seen in Figure
5.
Figure 5 Annotation example of metabolic reaction for specific case.
Metabolic event can acts on multiple entities. There are two cases as a separate event and a
combined event. In case of separate event shows in Figure 6, there are two events occurring
here, a biosynthesis of isoleucine and biosynthesis of valine which we can assign Metabolic
production event to them accordingly.
Figure 6 Assigning two Metabolic production event to biosynthesis with isoleucine and valine as its theme, respectively
However, there is a case where text explicitly states that both events are occurred in the same
reaction at the same time. In that case, the event is combined as illustrated in Figure 7 as
example. In this particular case, deoxyxylulose 5-phosphate and 4-phosphohydroxy-L-threonine
are used together in same reaction.
Figure 7 Metabolic consumption event with two Themes, deoxxylulose 5-phosphate and 4-phosphohydroxy-L-threonine
12
3. Only annotate event that leads to an occurrence.
(c) This strain did not form aminoacetone from threonine, but it slowly degraded
threonine.
An example of (c) shows that form and degraded are not assigned as an event since this
sentence is explained that this occurrence are rarely occurred by stating this reaction did not
form and slowly.
Metabolic production
Typically, we want to annotate all text that states the synthesis of metabolite. Most of the time,
Synthesis and Biosynthesis are counted as a Metabolic production event. In Figure 8, it states
the event of pyridoxine production.
Figure 8 Example of Metabolic production event
Metabolic consumption
1. The verbs or nominalized verbs that indicate the usage of metabolite are considered to be a
Metabolic consumption event.
(a) for a third enzyme which can utilize only L-serine
(b) confined to the B6 vitamer salvage pathway
In example of (a) utilize is annotated as Metabolic consumption with L-serine as Theme
argument. Catabolism, metabolism, utilize, savage are counted as Metabolic consumption
event.
2. Metabolite being bound to enzyme, does not count as Metabolic consumption event since it
did not specific if the such occurrence has any catalytic activity involved.
(c) The x-ray structure of PdxA bound to Zn2+ as well as the HTP presented here…
As shown in (c), PdxA being bound to Zn2+ does not annotated as Metabolic consumption.
Metabolic reaction
In particular case where the usage/synthesis is obscured (for example, transfer of functional
group), we recommend to use Metabolic reaction as a general assignment. As seen in Figure
9, Transamination is assigned as Metabolic reaction.
Figure 9 Example of specific case of metabolic reaction event
Positive regulation
Positive regulation is a metabolic event of catalyst relation between enzyme and metabolic
event.
13
1. If possible, use Positive regulation in place of Cause argument. As in this Figure 10,
catalyzes as Positive Regulation. However, if there is no verb or nominalized verb for
annotate Positive regulation, using “Cause” argument instead as seen in Figure 11.
Figure 10 Example of Positive regulation event
Figure 11 Using Cause argument instead of using Positive regulation event.
2. One positive regulation responsible for one enzyme or one event. Considering separate
them according to the number of events and enzymes as shown in Figure 12 and 13,
respectively.
Figure 12 Isocitrate dehydrogenase catalyzing two separate events.
Figure 13 Amination reactions catalyzing by two separate enzymes.
14
Appendix
In this annotating process, we are looking for metabolic events in text with associate genes,
proteins or metabolites.
In order to make the annotation process as consistence as possible, we recommend following
this procedure.
1. Pre-process text with BANNER to get a candidate list of GP and Metabolite entities.
2. For each sentence in text, read carefully and make a correction on GP and Metabolite
entities from BANNER.
3. Locate verbs/nominalized verbs and mark them with appropriate events.
4. Determine the entities that associate with events within the same sentence.
We recommend using BRAT (https://github.com/nlplab/brat) for annotation process. A
configuration
file
for
BRAT
could
be
downloaded
from
http://www.sbi.kmutt.ac.th/~preecha/metrecon/.
15
Version history
1.01 – Add license.
1.0 – First draft.
16