Download Automatic functional annotation of predicted active sites - EMBL-EBI

Document related concepts
no text concepts found
Transcript
Automatic functional annotation
of predicted active sites:
combining PDB and literature mining
Kevin Nagel
Wolfson College
A dissertation submitted to the University of Cambridge
for the degree of Doctor of Philosophy
European Molecular Biology Laboratory,
European Bioinformatics Institute,
Wellcome Trust Genome Campus, Hinxton,
Cambridge CB10 1SD, United Kingdom.
Email: [email protected]
January 2009
Declaration
This dissertation is the result of my own work, and includes nothing which is the outcome
of work done in collaboration, except where specifically indicated in the text. The dissertation does not exceed the specified length limit of 300 pages as defined by the Biology
Degree Committee. This thesis has been typeset in 12pt font using LATEX 2ε according
to the specifications defined by the Board of Graduate Studies and the Biology Degree
Committee.
1
Summary
Kevin Nagel
European Bioinformatics Institute
University of Cambridge
Dissertation title: Automatic functional annotation of predicted active sites:
combining PDB and literature mining.
Proteins are essential to cell functions, which is mainly identified in biological experiments.
The structural models for proteins help to explain their function, but are not direct
evidence for their function. Nonetheless, we can mine structural databases, such as Protein
Data Bank (PDB), to filter out shared structural components that are meaningful with
regards to the protein function.
This thesis applied mining techniques to PDB to identify evolutionary conserved structural patterns, e.g. active sites. This analysis retrieved 3- and 4-bodies with assumed twoand three-way residue interaction that have been selected from a distribution analysis of
residue triplets. A subset of the mined patterns is assumed to represent an active site,
which should be confirmed by annotations gathered by automatic literature analysis.
Literature analysis for the functional annotation of proteins relies on the extraction
of GO terms from the context of a protein mention. The annotation of protein residues
2
requires the identification of chemical functions, which could be found in the context
of residue mentions. MEDLINE abstracts have been processed to identify protein mentions in combination with species and residues (F1-measure 0.52; the F1-measure is a
statistical measure of a test’s accuracy based on the precision and recall of a test). The
identified protein-species-residue triplets have been validated and benchmarked against
reference data resources. Then, contextual features were extracted through shallow and
deep parsing and the features have been classified into predefined categories (F1-measure
ranges from 0.15 to 0.67). Furthermore, the feature sets have been aligned with annotation types in UniProtKB to assess the relevance of the annotations for ongoing curation
projects. Altogether, the annotations have been assessed automatically and manually
against reference data resources.
All MEDLINE has been processed to filter out annotations for residues. A subset of
identified catalytic sites could be cross-validated against the Catalytic Site Atlas (CSA;
44 out of 221). 429 out of 512 protein residues from MSDsite was then annotated with
contextual data. Altogether, MEDLINE does not provide sufficient data to fully annotate
the content from PDB. Conversely, residue annotation is achieved with a different feature
set than provided from GO, and incomplete annotations in the reference datasets can be
filled from public literature.
3
Acknowledgements
This thesis would not have been possible without the support, direction, and love of a multitude of people. First, I would like to thank my supervisor Dietrich Rebholz-Schuhmann
for his trust, encouragements, and for all his unconditional support and guidance. Dietrich has throughout given me opportunity and a sound research methodology. Working
with him I have learned the value of vision, and persistence in achieving it.
I am blessed to have had Tom Oldfield for my second supervisor. Ever since I was
interviewed by Tom, he has been inspiring, helpful and most of all patient. I will look back
fondly on our discussions, the ”insights” in protein science he gave me, and the cheerful
and motivational chats. I am deeply indebted for his belief in me.
I would like to thank my thesis committee members for their valuable and constructive
comments and valuable criticism; Michael Ashburner, Kim Henrick, and Rob Russell.
They all seemed to find time for me despite their busy schedules.
A special thank you must go to Kim Henrick; had he not encouraged me to pursue a
research position I would not be a scientist now.
I would also like to acknowledge Antonio Jimeno for his time, patience, and suggestions
and especially for reminding me to keep my focus always. But most of all I will remember
the great times we had cycling to and from work.
I would like to thank the past and present members of the Rebholz Group (Text
Mining). During my years of research, the group has expanded and I have had the chance
to learn from them as well as to have fun with them within the group.
4
I am also thankful to the European Molecular Biology Laboratoy EMBL for the scholarship and the organised EMBL International PhD programme, throughout which I have
had the chance to meet many talented and cheerful PhD students from the EMBL/EBI
Hinxton.
A special thank you to Christina Granroth and Dagmar Harzheim, who have done the
proofreading of this thesis. Thank you Dagmar for becoming clearer what I want to say.
Finally, I would like to acknowledge my wife Almut Nagel and my daughter Juli Nagel.
Without Almut I would have become a working maniac with no joy in life; she helped me
to maintain balance during my PhD research and also for the future. My special thanks
and love will go to Juli, aged one, from whom I have learned so much.
5
Contents
1 Introduction
15
1.1
Proteins and functional sites . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3
Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4
Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.5
Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.6
Guide to remaining chapters . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2 Background
2.1
2.2
2.3
26
Protein related data resources . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.1.1
Protein Data Bank . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.1.2
Universal Protein Knowledge base . . . . . . . . . . . . . . . . . . . 31
2.1.3
Gene Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.4
Biomedical literature . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Protein structure data mining . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.1
Hypothesis-driven data analysis . . . . . . . . . . . . . . . . . . . . 36
2.2.2
Discovery-driven data mining . . . . . . . . . . . . . . . . . . . . . 37
Biomedical literature mining . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.1
Biological entity recognition . . . . . . . . . . . . . . . . . . . . . . 38
2.3.2
Biological relation extraction . . . . . . . . . . . . . . . . . . . . . . 39
6
2.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 Mining residue interactions as triads from PDB
3.1
42
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.1
Structural feature extraction . . . . . . . . . . . . . . . . . . . . . . 44
3.1.2
Detection of significant configurations as interactions . . . . . . . . 47
3.1.3
Grouping and selecting frequent configurations . . . . . . . . . . . . 52
3.2
Analysing available non-redundant protein structure sets . . . . . . . . . . 53
3.3
Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.1
Identification of residue interactions is dependent on data selection
3.4.2
The interaction distance correlates with the distribution of residue
55
triads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.3
Interaction classification is sensitive to the size of cross-validation . 59
3.5
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 Prediction of functions for mined residue triads
63
4.1
Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.1
Identification of homologous metal binding sites . . . . . . . . . . . 66
4.2.2
Validation of convergent metal binding sites . . . . . . . . . . . . . 67
4.2.3
Recovering active sites and catalytic triads from the dataset . . . . 73
4.2.4
Discovering the conserved serine residue in the catalytic triad (quartet) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
7
5 Identification of protein residues in MEDLINE
5.1
79
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.1
Protein and organism entity recognition . . . . . . . . . . . . . . . 81
5.1.2
Entity recognition of protein residue . . . . . . . . . . . . . . . . . 82
5.1.3
Association identification of the entity triplet organism, protein,
and residue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2
The construction of evaluation test corpora . . . . . . . . . . . . . . . . . . 86
5.3
Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.1
Evaluation of organism, protein, and residue entity recognition . . . 90
5.4.2
Performance study on the entity triplet association . . . . . . . . . 92
5.4.3
Cross-validation of identified residues with UniProtKB . . . . . . . 93
5.4.4
Identified residues in MEDLINE for Uniprot/PDB proteins . . . . . 94
5.5
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.6
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6 Information extraction from the context of a residue in text
6.1
101
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.1.1
Extraction of contextual features . . . . . . . . . . . . . . . . . . . 103
6.1.2
Categorisation of contextual features . . . . . . . . . . . . . . . . . 110
6.2
Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3.1
Contextual feature extraction evaluated . . . . . . . . . . . . . . . . 117
6.3.2
Performance analysis of the classifiers . . . . . . . . . . . . . . . . . 118
6.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8
7 Extraction of functional annotation for protein residues from MEDLINE
124
7.1
Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.2
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2.1
Evaluation of the developed functional annotation extraction system 126
7.2.2
Studying mined functional annotations for the proteins p53 and Jak2129
7.2.3
Cross-validation of mined catalytic residues with CSA . . . . . . . . 132
7.2.4
Annotation of protein residues in MSDsite . . . . . . . . . . . . . . 134
7.3
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8 Combining active site prediction with mined functional annotations
8.1
137
Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
8.1.1
Combining protein structure data with literature data . . . . . . . . 138
8.2
Evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.3
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.3.1
Protein residue mapping between three data resources . . . . . . . . 140
8.3.2
Rediscovery of active sites and catalytic residues . . . . . . . . . . . 142
8.3.3
Search for novel catalytic residues . . . . . . . . . . . . . . . . . . . 145
8.3.4
General correlation found between predicted functional sites and
extract functional annotations.
. . . . . . . . . . . . . . . . . . . . 146
8.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.5
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
9 Conclusions and future work
150
9.1
Summary of main contributions . . . . . . . . . . . . . . . . . . . . . . . . 150
9.2
Limitations and future works
. . . . . . . . . . . . . . . . . . . . . . . . . 152
A Examples of errors in relation extraction.
9
171
B Examples of extracted functional annotations compared with UniProtKB173
C Examples of extracted functional annotations for the protein p53
177
D Examples of extracted functional annotations for the protein Jak2
183
E Examples of extracted functional annotations of the category binding
event
186
F Examples of extracted functional annotations of active site residues
189
G Glossary
192
10
List of Figures
1.1
The standard amino acids . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2
Examples of functional sites in proteins . . . . . . . . . . . . . . . . . . . . 18
1.3
The protein universe and its knowledge representation . . . . . . . . . . . . 20
2.1
Data banks in the protein universe . . . . . . . . . . . . . . . . . . . . . . 28
2.2
Three hyperlinked protein data banks . . . . . . . . . . . . . . . . . . . . . 29
2.3
Categories for protein sequence annotation UniProtKB . . . . . . . . . . . 32
2.4
GO terms are not suitable for protein residue annotation . . . . . . . . . . 34
3.1
Overview of processes and evaluation methods of the developed 3D pattern
identification system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2
Four classes of interactions within a 3-body . . . . . . . . . . . . . . . . . . 49
3.3
Non-redundant structure set for 3D pattern mining . . . . . . . . . . . . . 53
3.4
Distribution analysis of extracted residue triplets . . . . . . . . . . . . . . 57
3.5
Comparison of extracted residue triplets based on their interaction type . . 58
3.6
The effect of varying the cross-validation sample size on significance testing
of residue interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.1
A metal binding site with the 3Cys pattern in OLDFIELD . . . . . . . . . 68
4.2
A metal binding site with the Cys-2His pattern in OLDFIELD . . . . . . . 69
4.3
A metal binding site with the 3Cys pattern in SCOP40 . . . . . . . . . . . 70
4.4
A metal binding site with the Cys-2His pattern in SCOP40 . . . . . . . . . 71
11
4.5
Re-discovery of the catalytic triad as Asp-His-Ser pattern in OLDFIELD . 75
5.1
Overview of processes and evaluation methods for the developed protein
residue identification system . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2
Test corpora for information extraction evaluation . . . . . . . . . . . . . . 87
5.3
Identified protein residues in MEDLINE . . . . . . . . . . . . . . . . . . . 95
5.4
Cross-validation of citations from identified protein residues with UniProtKB/PDB 97
6.1
Overview of processes and evaluation methods of the developed contextual
feature extraction system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.1
Performance evaluation of the functional annotation extraction system . . 127
7.2
Cross-validation of text mined catalytic residues with CSA . . . . . . . . . 133
7.3
Cross-validation of text mined binding residues with MSDsite . . . . . . . 134
8.1
Overview of processes and evaluation methods of combining the protein
structure dataset and literature dataset . . . . . . . . . . . . . . . . . . . . 138
8.2
Lookup table for PDB/UniProtKB mapping . . . . . . . . . . . . . . . . . 140
8.3
Overview of the combined datasets from protein structure data and biomedical literature data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
12
List of Tables
3.1
Study on the effect of varying the interaction distance threshold in structure
triangulation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.1
Summary of extracted data at each protein structure data mining step . . 65
4.2
Identification of metal binding sites in OLDFIELD . . . . . . . . . . . . . 66
4.3
Convergent metal binding sites identified in SCOP40 . . . . . . . . . . . . 72
4.4
List of cross-validated active site residues . . . . . . . . . . . . . . . . . . . 74
4.5
Extending the catalytic triad into 4-bodies . . . . . . . . . . . . . . . . . . 76
5.1
Regular expression patterns for the detection of residue mentions in text . 84
5.2
Performance evaluation of residue entity recognition . . . . . . . . . . . . . 90
5.3
Performance evaluation of protein entity recognition . . . . . . . . . . . . . 91
5.4
Performance evaluation of organism entity recognition . . . . . . . . . . . . 91
5.5
Performance evaluation of residue-protein-organism entity association detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.6
Performance evaluation of protein-organism and protein-residue entity association detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.7
A specialised performance evaluation between GC and XC2. . . . . . . . . 94
6.1
Biological categories for the classification of protein residue related information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2
Category distribution in the text feature reference set . . . . . . . . . . . . 115
13
6.3
Evaluation of syntactical language parser performance . . . . . . . . . . . . 117
6.4
Performance analysis of the classifiers (confusion matrix) . . . . . . . . . . 119
6.5
Performance evaluation of the classifiers (precision, recall, F1 measure) . . 120
8.1
Extracted MEDLINE information on the catalytic residues in bovine chymotrypsinogen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.2
Identified catalytic residues from MEDLINE extraction . . . . . . . . . . . 144
8.3
Catalytic triad residues available from the mined functional annotations . . 145
8.4
Functional annotations of protein residues in predicted functional sites. . . 147
8.5
Homology-based transfer of extracted functional annotations for protein
residues in the mined pattern data. . . . . . . . . . . . . . . . . . . . . . . 148
A.1 Examples of errors in the relation extraction for the detection of contextual
features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
B.1 Comparison of extracted functional annotations from GC with UniProtKB. 174
C.1 Examples of literature mined annotations of protein residues in p53. . . . . 178
D.1 Examples of literature mined annotations of protein residues in Jak2. . . . 184
E.1 Mined functional annotations of protein residues with information on binding events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
F.1 Identified catalytic triad residues from MEDLINE exraction. . . . . . . . . 190
14
Chapter 1
Introduction
1.1
Proteins and functional sites
The genomic information encodes the blueprint to build an organism. The decoding and
implementation of genetic information depends on the functions of the proteins. Each protein is the result of transcribing a gene into mRNA, which is translated into a polypeptide.
Hence, a protein is a gene product. The elementary units of a protein are the 20 natural standard amino acids, each with four invariant parts: a central chiral alpha carbon
(Cα), an amine group (NH2), a carboxylic acid group (COOH), hydrogen (H), and a
characteristic side chain (R). Apart from the invariant amine and carboxylic acid group,
which gives every amino acid the property of a zwitterion, distinctive physicochemical
properties are defined by the side chain group. These can be polar, acidic/basic, aromatic, bulky, conformational flexible, contain cross-linking ability, show hydrogen-bond
capability, or chemical reactivity. Figure 1.1 lists all the standard amino acids and their
common classification on the basis of the nature of the side chain group.
During biosynthesis, ribosomes catalyses the polymerisation of amino acids through
condensation and form peptide bonds between the NH2 and COOH groups of two consecutive amino acids. The backbone (main chain) of the resulting polypeptide is the repeating
sequence of NH2-C-CO-[NH-C-CO]n -NH-C-CO. This is the primary structure of a protein
15
Amino Acid
3-Letter
1-Letter
Side-chain polarity
Alanine
Arginine
Asparagine
Aspartic acid
Cysteine
Glutamic acid
Glutamine
Glycine
Histidine
Isoleucine
Leucine
Lysine
Methionine
Phenylalanine
Proline
Serine
Threonine
Tryptophan
Tyrosine
Valine
Ala
Arg
Asn
Asp
Cys
Glu
Gln
Gly
His
Ile
Leu
Lys
Met
Phe
Pro
Ser
Thr
Trp
Tyr
Val
A
R
N
D
C
E
Q
G
H
I
L
K
M
F
P
S
T
W
Y
V
nonpolar
polar
polar
polar
nonpolar
polar
polar
nonpolar
polar
nonpolar
nonpolar
polar
nonpolar
nonpolar
nonpolar
polar
polar
nonpolar
polar
nonpolar
Figure 1.1: The standard amino acids. The trivial names, 3-letter and 1-letter abbreviations are listed
along with the physicochemical properties of their side chains.
and it will fold spontaneously due to different interactions of its amino acid composition
with environmental factors, e.g. solvent, salt, chaperones. The most prominent formation
during the folding process is the hydrophobic core, which stabilises the protein structure.
Amino acids, such as alanine, valine, leucine, isoleucine, phenylalanine, and methionine,
are clustered in the interior of a protein, while charged or polar side chains are turned to
the solvent-exposed surface and interact with surrounding water molecules. Minimising
the exposition of hydrophobic side chains to water is the principal driving force of folding.
The process of protein folding involves the formation of regular secondary structure
elements (SSE), such as alpha helix and beta strand, which are stabilised by intramolecular
hydrogen bonds and contacts between side chain atoms (van der Waals interaction). By
following a helical path, the carboxyl group of residue i and the amino group of residue i+4
of the main chain are arranged in alignment and stabilise the local structure by hydrogenbond formation. The side chains protrude out from the helically coiled backbone and
define the surface of the helix. In contrast, beta strands are formed by hydrogen bonds
between distant regions on the peptide. Depending on the direction of the peptide region,
16
two adjacent strands can be characterised as parallel or antiparallel. Because the backbone
adopts almost a fully extended conformation, every side chain of i + 2 residue is facing
the same direction. A set of interacting strands is called a sheet. Within the process of
intramolecular stabilisation of the main chain, the regions between secondary structure
elements adopt a loosely defined conformation such as turns and random coils or loops.
The attractive and repulsive forces (e.g. ionic or van der-Waals interaction between
residues) among the SSEs balance each other during the folding process and lead to a
relatively stable and complex three-dimensional structure. Stabilisation of the conformation may involve covalent bonding, e.g. disulphide bridges between two cysteine residues
or the formation of metal binding-motifs. The spatial arrangement of sequentially proximate or distant residues allows the generation of biochemical functional sites. To identify
those and other novel biologically functional regions in the protein is one of the greatest
research interests in the protein bioinformatics community, because they explain phenological data, e.g. cellular processes. Figure 1.2 lists some of the well known functional
sites in various proteins classified according to my own designed categorisation scheme.
Finally, the formation of quaternary structure is the assembly of tertiary structures
within a multi-chain protein. In this respect, each polypeptide chain is regarded as an
individual functional unit (subunit or domain). Within the interfaces of the subunits,
a multi-domain based functional site can be formed, which is not present or functional
in the individual domains. For example, the proteins cAMP-dependent protein kinase
(PDBID:1rdq), hexokinase (PDBID:1bdq), or maltodextrin phosphorylase (PDBID:1l5w)
contain ligand binding sites consisting of more than one protein structure domain (A.
Kahraman, pers. comm.). The identification of these multi-domain functional sites is
another great challenge in protein bioinformatics. First, the prediction system has to
find the correct assembly of tertiary structures (a crystal structure of a protein does
not necessarily reflect the biological state of assembly). Second, the structure models
have to be adjusted (proteins are not rigid molecules and have flexible parts), and finally
17
site
1. evolutionary site
1.1. conserved site
2. functional site
2.1. interaction site
2.1.1. active site
2.1.1.1. catalytic site / reactive site
2.1.1.1.1. catalytic residue
2.1.1.1.2. donor site
2.1.1.1.3. acceptor site
2.1.1.2. binding site / contact site / substrate binding site / ligand binding site /
binding site / recognition site
2.1.1.2.1. specificity residue / specific site
2.1.1.2.1.1. high affinity binding site
2.1.1.2.1.2. low affinity binding site
2.1.1.2.2. peptide binding site
2.1.1.2.3. protein binding / receptor site
2.1.1.2.3.1. nf kappab site
2.1.1.2.3.2. antibody binding site
2.1.1.2.3.3. antigen binding site
2.1.1.2.3.4. actin binding site
2.1.1.2.4. sugar binding
2.1.1.2.5. lipid binding
2.1.1.2.6. nucleic acid binding
2.1.1.2.6.1. atp binding site
2.1.1.2.7. metal binding site
2.1.1.2.7.1. calcium binding site / ca(2+) binding site
2.1.1.2.7.2. copper site
2.1.2. passive site / target site
2.1.2.1. cleavage site / lesion site / processing site / proteolytic cleavage site
2.1.2.2. PTM site
2.1.2.2.1. phosphorylation site
2.1.2.2.1.1. tyrosine phosphorylation site
2.1.2.2.2. glycosylation site
2.1.2.2.3. regulatory site
2.1.2.2.4. inhibitory site
2.1.2.2.5. activation site
2.2. structural site
2.2.1 hydrophobic site
2.2.1.1 hydrophobic core
2.2.1.2. hydrophobic patch
2.2.2. n terminal site
2.2.3. c terminal site
2.2.4. transmembrane site
2.2.5. intracellular site / cellular site
2.2.6. extracellular site
2.2.7. anionic site
2.2.8. cationic site
2.2.9. nucleation site
Figure 1.2: Examples of functional sites in proteins. A proposition of a classification scheme (excerpt)
is represented based on my own perspective of biomolecular function of specific residue configurations in
protein structures.
18
co-factors, e.g. metal ions, have to be considered.
1.2
Motivation
The understanding of the biological function of proteins remains a central challenge in
biology. Our knowledge of the protein universe can be partitioned into at least three
knowledge spaces (cf. figure 1.3): protein sequence space, protein structure space, and
protein function space. Each space represents a specific view of proteins. For example, the
protein structure space contains information about the number of biological conformations
of protein structures (cf. figure 1.3, top panel). Whereas, the function space describes the
spectrum of protein function. Although information from each space partially overlaps,
only little data are available to explain their relationship. For example, site-directed
mutational analysis is often reported in context of gain or loss of a protein function,
while the biological correlation between sequence and function is not understood. This is
because the mechanism of protein function is not explained by information within sequence
space. In contrast, structural data are more expressive than sequence data, because a
protein structure provides spatial context of residues. Proteins are physical entities and
as such, they perform interactions with other proteins or ligands. The shape of a protein,
or more precisely, the spatial configuration of a set of residues in a functional site, is
one explanation for protein function. While protein structure data mining is concerned
with the prediction of novel functional sites in proteins, a mined structural pattern has
no evidences of biological function. In contrast, biomedical literature reports a range
of biological function of protein residues without a structural context and explanation of
molecular mechanism (cf. figure 1.3, middle panel). The combination of information from
protein structure space and protein function space seems to be an obvious approach in
order to gain new knowledge on protein function.
19
Figure 1.3: The protein universe and its knowledge representation. Information on a protein can be collected from at least three different knowledge domains: crystallography provides the spatial coordinate of
a protein, protein sequencing determines the linear composition of amino acids in a protein, and biochemical experiments characterises the biological function (top panel). In principle protein function prediction
can be done based on information from each domain knowledge spaces, however the combination of them
can overcome some domain specific limitations (middle panel).
20
1.3
Objective
This thesis aims to discover hypothetical functional sites from Protein Data Bank (PDB)
and annotate them with functional information from biomedical literature. The main
idea is to combine the information from currently two detached data resources, protein
structure information from PDB, and functional annotations of residues from MEDLINE
(cf. figure 1.3, lower panel). More specifically, this research focuses on the prediction of
active sites by data mining recurrent spatial residue configurations (3D pattern) in proteins. Contextual features of residues are extracted from biomedical literature to provide
functional annotations. The results from both datasets are then combined to verify predicted functional sites by evidences of biological function. While existing approaches in
protein structure data mining and biomedical literature mining has been used to generate
data for each research domain, the combination of the datasets is a novel approach in
protein bioinformatics research.
1.4
Related works
To verify a predicted protein function with functional annotations extracted from biomedical literature, two different levels have to be considered: the protein level, and the residue
level (i.e. groups of residues forming a functional site).
The recent publication of [JGLRS08] is one example for case (1): The prediction of
protein function is based on the search for a conserved and connected subgraph (CCS) in
protein-protein interaction graphs, generated from several biological databases. Within
the set of CCS, all available functional annotations of a protein in a database are transferred to homologous proteins. The annotations consist of Gene Ontology (GO) terminologies and the transfer is the prediction of protein function. The verification of a predicted
function was done by identifying GO terms in abstract texts of the corresponding protein.
The approach of this thesis has some similarities to this report [JGLRS08], e.g. in
21
both approaches, results from data mining were verified by information extracted from
biomedical literature. However, there are crucial differences between the two that need
to be considered when assessing the result of this thesis. First, in contrast to the CCS
identification, the data mining part in this work does not aim to identify known patterns,
but wants to discover new structural features that may represent a novel functional site.
Secondly, in [JGLRS08] the prediction of protein function utilises terminologies of a welldeveloped public resource, the Gene Ontology, while the same resource is not suitable
for annotation of protein residues. This is because GO is designed to describe function
of genes and gene products. From a conceptual point of view, terminologies in GO describe a high level of biological function, while the description of residue function are of a
lower level. For example, description of protein-protein interaction is found in context of
metabolomics, signal-transduction or other cellular processes. In contrast, the function of
a protein residue can be explained in light of molecular interactions or chemical reaction
mechanisms. Finally, the distribution of information on biological function is expected to
be different in biomedical publications. Because protein function is conceptually a high
level of biological function, it is likely that abstract texts of biomedical articles contain
information on this level. Conversely, the interaction of protein residues is a detailed description of protein function, and key information are expected to be mentioned in results
or discussion sections of full-text articles. To my knowledge, the most related relevant
work in terms of functional annotation of protein residues (case (2)) is the system called
Mutation extraction and STRucture Annotation Pipeline (mSTRAP) [KCRB07]. The key
feature of mSTRAP is the visualisation of mutation annotations, which is projected onto
a structure of a protein of interest. The advantage of mSTRAP is to interpret impacts of
mutation in context of the protein structure. However, the prediction of functional sites
is done by visual analysis of the protein structure. The provided annotations are sets
of complete sentences extracted from MEDLINE, which means that the interpretation of
the information requires expert knowledge.
22
The developed system in this work differs from mSTRAP, in that the extracted information is not exclusively used to annotate point mutations, but rather other functional
descriptions of wild-type residues are also collected. Another distinction to mSTRAP is,
the mined information is represented in a so called predicate-argument structure (PAS)
format; only relevant text segments from sentences are extracted that describe a biological
function or a biological context of a mentioned residue. The structured format allows to
some extent queries for specific information in the extracted annotation dataset.
In conclusion, only few related works have been reported that describe an automated
system to verify a predicted protein function by using functional annotations extracted
from the literature. This work retains its originality, because it aims to find novel functional sites in proteins by mining the PDB, and by extracting functional annotations from
a wide range of biomedical literature data.
1.5
Challenges
Is it possible to identify a functional site, e.g. an active site, on the basis of mining
PDB and the literature, and then combine the information of both? We can expect
that a significant population of similarly arranged residues in a protein can be identified
from a non-redundant protein set, if this evolutionary conserved interaction provides a
functional or structural advantage. We can also expect that residues are mentioned in
conjunction with their corresponding protein, and that the biological role of a protein
residue is reported in context of gain or loss of function of the overall protein in biomedical
literature.
One task presented in this thesis is the identification of textual features as functional
annotation. The problem differs from other information extraction tasks, e.g. the annotation of proteins, because the target is to provide knowledge on the biological role of a
residue. For example, to extract protein-protein interactions from text, a list of protein
names is used, and the task is reduced to finding only associations between listed pro23
teins. In contrast, to extract a protein residue and its corresponding biological function
is difficult, because an adequate dictionary of terms is not available.
1.6
Guide to remaining chapters
Chapter 2 presents background knowledge that are important for this work. Four different data resources are reviewed and their limitations discussed in context of this
thesis. Then follows an explanation of methods in the field of protein structure data
mining and biomedical literature mining. Some of the introduced methodologies are
reused in this work, while ideas and approaches of others were adopted to develop
task specific extraction systems.
Chapter 3 describes the developed protein structure data mining system for the identification of 3D patterns in PDB. Algorithms for the identification of conserved
spatial residue configurations are explained and the effects of algorithm-related and
data-related parameters are discussed.
Chapter 4 demonstrates the biological implication of the mined 3D patterns from chapter 3. Two examples of rediscovered functional sites in proteins are shown to justify
the presented data mining approach. The first biological validation is the identification of metal binding sites, while the second validation is the rediscovery of catalytic
triad from the mined data.
Chapter 5 is the first of three text mining chapters in this thesis. It explains the developed protein residue identification system, which consists of two main modules:
biological entity recognition of residue, protein, and organism, and association detection of the entity triplet.
Chapter 6 describes the approach to detect contextual features of a mentioned residue in
text. An automatic method is introduced to assign semantic labels to the extracted
24
textual features.
Chapter 7 presents the third part of the three text mining chapters. Both text mining
modules from the previous chapters (protein residue identification, and contextual
feature extraction) are combined to form the functional annotation extraction system. The overall performance of this information extraction system is studied. The
validity of the extracted information as functional annotation is demonstrated by
manual analysis on two example proteins (p53 and Jak2), and by cross-validation
of identified catalytic or binding residues with two reference databases: CSA and
MSDsite.
Chapter 8 presents results on combining protein structure data with literature data.
The validity is studied by examining the correlation of predicted active site residues
with enzyme-related functional annotations.
Chapter 9 summarises the thesis and presents limitations and open questions for follow
up research.
25
Chapter 2
Background
In the previous chapter, I have presented the motivation and objective of this thesis. The
purpose of this chapter is to familiarise the reader with relevant concepts in protein science,
data mining, and literature mining. The limitations of each reviewed data resource or
methodology are discussed in context of this research work.
2.1
Protein related data resources
Proteins are both building blocks of cellular structures and the major machinery in cells.
In order to perform their functions, proteins need to fold into their three-dimensional
structures and thereby form functional sites. The prediction of a structural pattern associated with a biological function is an important aspect in protein bioinformatics. To
interpret the multiple functions of proteins, annotations are linked with results from
bioinformatics analysis tools. In addition, data are extracted from generic and specific
databases, biological knowledge accumulated in literature, and data from genome-wide
experiments, such as transcriptomics and proteomics, are collected. One major goal is to
describe protein function within biological context by using a standardised hierarchical
classification scheme and controlled vocabulary.
The biological community has developed databases and functional annotation schemes
26
that are not only used to archive protein data, but also to describe protein function on
a molecular, cellular and phenotypical level. Figure 2.1 shows some of the most popular
and relevant databases in the field of protein bioinformatics. These protein-related data
resources are hyperlinked in order to foster bioinformatical research works. A statistic of
three example databanks and their hyperlinked references is given in figure 2.2.
2.1.1
Protein Data Bank
The Protein Data Bank (PDB) is an archive of 3D structures of large biological molecules,
such as proteins and nucleic acids. Currently, PDB lists 43,099 proteins determined by
crystallography (version November 2008). Despite the large amount of structure data
available for a range of proteins, the information in the PDB has three significant limitations. First of all, the structure data have a low correlation with sequence data. In
comparison to the sequence data in UniProtKB (cf. section 2.1.2), the coverage of the sequence space is much larger than the structure space. Therefore, the derived information
from PDB is only applicable to a limited set of proteins.
The second limitation is the coverage of annotation available for proteins. In the
PDB, there are some facilities to annotate proteins, for example the SITE record is used
to annotate protein residues that are part of active sites. However, annotations are not
mandatory and many other sites are not updated, although new evidences of biological
functionality of these residues were found. An automatically derived database called PDBSITE [IPGK05] stores the SITE record information and makes the search for these data
accessible. Another, rather predictive, database of functional sites in protein structures is
the MSDmotif [GH08], which provides information about ligands, sequence and structure
motifs, their relative position, and their neighbour environment. Another database of predicted functional sites is MSDtemplate [Old02], which contains small fragments generated
by data mining on a structurally unique protein set from PDB. Examples of biologically
relevant fragments were identified in this data collection, such as the catalytic triad and
27
Figure 2.1: Data banks in the protein universe. This figure shows my interpretation of how our knowledge about proteins can be categorised. A selection of the most relevant data resources and web services
are reproduced in this figure. UniProtKB = Universal Protein Knowledge base [WAB+ 06]; PIR = Protein
Information Resource [BGH+ 00]; PDB SELECT = representative list of PDB chain identifiers [HSSS92];
PISCES = Protein Sequence Culling Server [WD03]; UniqueProt = web-service to create representative
protein sequence sets [MR03]; MEROPS = the Peptidase Database [RMK+ 07]; CAZy = CarbohydrateActive enZYmes [CCR+ 08]; TC-DB = Membrane Transport Protein Classification Database [STB06];
PMD = Protein Mutant Database [KON99]; Phospho.ELM = a database of S/T/Y phosphorylation sites
[DCG+ 04]; PROSITE = Database of protein domains, families and functional sites [HBB+ 08]; PRINTS
= Protein Motif Fingerprint Database [Att02]; BMC = Biomedical Center [BMC08]; PMC = PubMed
Central [PMC08]; PDB = Protein Data Bank [BWF+ 00]; SCOP = Structural Classification of Proteins
[HMBC97]; CATH = Class, Architecture, Topology, Homologous superfamily - Protein structure classification [OMJ+ 97]; Relibase = database of protein-ligand complexes [HBGK03]; CSA = Catalytic Site
Atlas [PBT04]; MSDmotif = an integrated resource of protein structure motifs.
28
Figure 2.2: Three hyperlinked protein data banks. Illustrated is the size of three databanks, PDB,
UniProtKB, and MEDLINE, along with their cross-references. For example, the PDB contains in total
42,943 PDB identifiers (version November 2008) with cross-references to 42,085 out of 333,445 Uniprot
identifiers, which in return points to 10,466 biomedical journal articles (PMIDs). Notice that PDB
also holds for each record a small number of primary citations, however, these are mainly pointers to
crystallographic publications and provide little hints of biological function of the protein or annotation
of functional sites.
29
various metal binding sites. The Catalytic Site Atlas (CSA) [PBT04] is another database
documenting active sites in enzymes of 3D structures. The data are either manually
curated or predicted, based on searches for homologous proteins.
Another serious limitation of PDB is its use for statistical analysis of structure data.
The PDB represents a redundant and biased snapshot of the protein universe. Redundancy is due to the fact that many highly similar structures or identical folds are deposited
in the database leading to an over-representation of some proteins. In the past, structure determination has been guided by hypothesis-driven experiments, short-listed target
proteins in the medical or commercial field, and by the methodologically tractable small
proteins for crystallisation. Consequently, the fold-space has not been fully explored
yet. Although techniques in protein crystallography are improving, there are still other
underrepresented proteins, e.g. membrane proteins or large proteins, which define the
boundaries of representativeness of the structure data.
While there is little we can do about exploring the complete ensemble of folds from
a bioinformatics point of view, the over-representation can be filtered. For example,
protein sequence based clustering [AGM+ 90] [AMS+ 97] is the principle method to produce
the following datasets: PDB SELECT [HSSS92], PISCES [WD03], UniqueProt [MR03].
However, this approach is limited by the assertion of sequence-structure relation in the
so called twilight zone, i.e. below 30 per cent sequence identity proteins may or may not
have similar folds [Ros99]. Another critical issue with sequence based clustering is the
comparison of protein chain sequences rather than the alignment of segments defined by
protein domain boundaries.
Structure based approaches cluster the data on the basis of domain structures. Several
databases of domain based structure clustering were created with the most prominent
ranging from entirely manual work (SCOP [HMBC97]), semi-automatic approach (CATH
[OMJ+ 97]), to entirely non-supervised methods (FSSP-Dali, [HS94]). Differences in these
classification were studied by [HJ99] and [DBAD03].
30
2.1.2
Universal Protein Knowledge base
The major repository of protein sequence data is the Universal Protein Knowledge base
(UniProtKB). Along with the collection of sequence data is the listing of protein names
and synonyms, taxonomic data, citation references, and other manually curated information from literature survey. One important aspect of UniProtKB when evaluating
structure-function relationships is the annotation of protein residues. In the feature table
the biological function of a residue site is described along with several other key categories
(cf. figure 2.3). Currently, UniProtKB lists 333,445 entries with 2,088,573 site-specific
annotations (version from January 2008).
Despite the high quality data contained in UniProtKB, the process of extracting functional annotations from literature remains a laborious human expert curation work. The
curator surveys the biomedical literature, represents the experimentally determined functional information, and formulates the precise functional role by utilising standardised
semantic resources (cf. section 2.1.3). Despite the highly reliable quality of manual curation, this approach is evidently inefficient considering the amount of full-text publications
curators have to distil. According to Frishman, if we assume
”[...] that one needs on average roughly 30 min to assess published fact
and bioinformatics evidence for one protein, one thousand annotators would
have to work 1 year long, 8 h a day, to annotate all 5 million sequences that
are currently known. However, since the size of the protein database has been
consistently doubling every 18 months, the moving target of annotating all
proteins will never be achieved.” [Fri07]
Considering that the estimated total number of proteins is in excess of 1010 [CK06],
an automatic or semi-automatic solution is needed to facilitate the laborious human expert work. Currently, methods for the automatic expansion of citation set [YLPV07]
[HLC04] [LHC07] and the automatic annotation of protein function with GO terminologies [CSL+ 06] [GJYLRS08] [RSKA+ 07] are being developed in the field of text mining.
31
Key
INIT MET
SIGNAL
PROPEP
TRANSIT
CHAIN
PEPTIDE
TOPO DOM
TRANSMEM
DOMAIN
REPEAT
CA BIND
ZN FING
DNA BIND
NP BIND
REGION
COILED
MOTIF
COMPBIAS
ACT SITE
METAL
BINDING
SITE
NON STD
MOD RES
LIPID
CARBOHYD
DISULFID
CROSSLNK
VAR SEQ
VARIANT
MUTAGEN
CONFLICT
Description
Initiator methionine.
Extent of a signal sequence (prepeptide).
Extent of a propeptide.
Extent of a transit peptide (mitochondrion, chloroplast, thylakoid, cyanelle, peroxisome etc.).
Extent of a polypeptide chain in the mature protein.
Extent of a released active peptide.
Topological domain.
Extent of a transmembrane region.
Extent of a domain, which is defined as a specific combination of secondary structures organised into a characteristic three-dimensional structure of fold.
Extent of an internal sequence repetition.
Extent of a calcium-binding region.
Extent of a zinc finger region.
Extent of a DNA-binding region.
Extent of a nucleotide phosphate-binding region.
Extent of a region of interest in the sequence.
Extent of a coiled-coil region.
Short (up to 20 amino acids) sequence motif of biological interest.
Extent of a compositionally biased region.
Amino acid(s) involved in the activity of an enzyme.
Binding site for a metal ion.
Binding site for any chemical group (co-enzyme, prosthetic group, etc.).
Any interesting single amino-acid site on the sequence, that is not defined by another feature
key. It can also apply to an amino acid bond which is represented by the positions of the
two flanking amino acids.
Non-standard amino acid.
Posttranslational modification of a residue.
Covalent binding of a lipid moiety.
Glycosylation site.
Disulfide bond.
Posttranslationally formed amino acid bonds.
Description of sequence variants produced by alternative splicing, alternative promoter usage,
alternative initiation and ribosomal frameshifting.
Authors report that sequence variants exist.
Site which has been experimentally altered by mutagenesis.
Different sources report differing sequences.
Figure 2.3: Categories for protein sequence annotation in UniProtKB. Key categories used to describe
regions or sites of interest in a protein sequence are listed. The key and the corresponding information
(value) are stored in the feature table (FT line) in UniProtKB. Along with the listed categories are their
definitions presented in this figure.
32
Clearly, the annotation for a whole protein cannot be transferred to residue site annotation, because different groups of residues in the protein structure have different function.
In this respect, the biological community is missing an information extraction system for
the annotation of proteins at residue level.
2.1.3
Gene Ontology
The Gene Ontology (GO) [AL02] [GOC06] is one of the most widely used functional
classification scheme including all of the most important criteria for annotations of biological data [PKS06]. Currently, the ontology lists a total of 26,302 terms with 15,643
biological process terms, 2,233 cellular component terms, and 8,426 molecular function
terms (version November 2008). The UniProtKB/InterPro group at the European Bioinformatics Institute (EBI) belongs to the Gene Ontology Consortium, and use its standard
vocabulary to the annotation of protein function. The vocabulary is meant to describe
biological phenomenology of genes and gene products (proteins). This is the reason why
terminologies in GO are not suitable to describe the function and property of a protein
residue. Figure 2.4 lists some examples where the identification of GO terms [GJYLRS08]
did not find the more relevant keywords for the annotation of residues. At the moment,
an ontology dedicated solely for the functional annotation of protein residues has not been
developed. However, terminologies can be in general collected from other considerable resources, such as the Open Biomedical Ontologies [SAR+ 07] which contains, for example,
REX (an ontology of physico-chemical processes), and PSI-MOD (an ontology describing
protein chemical modifications).
2.1.4
Biomedical literature
Biomedical research tackles biological questions from a number of perspectives and the
published experimental data are always heterogeneous. The sum of description of biological phenomenon enables scientists to understand mechanisms in biology within various
33
Annotation
Sentence
Manual
GO
”The catalytic mechanism of the
non-phosphorylating glyceraldehyde3-phosphate dehydrogenase and the
other aldehyde dehydrogenases resembles a thioester mechanism involving
the universally conserved cysteine 298
(pea GAPN).” (PMID:9461340)
thioester mechanism,
served cysteine
con-
glyceraldehyde-3-phosphate
dehydrogenase
(NADP+)
(phosphorylating
activity),
glyceraldehyde-3-phosphate
biosynthesis, glyceraldehyde3-phosphate catabolism, phosphoglycerate
dehydrogenase
activity
Annotation
Sentence
Manual
GO
”However, mutations of a key residue,
His48, show significant deviation from
the relationship, implying a role
for the side chain in protection of
the complex from hydroxide attack.”
(PMID:2690955)
protection of the complex from
hydroxide attack
AT DNA binding, tRNA, tyrosine tRNA ligase activity
Annotation
Sentence
Manual
GO
”Second,
this reactive cysteinyl
residue, which is required for Lcysteine desulfurization activity, was
identified as Cys325 by the specific
alkylation of that residue and by sitedirected mutagenesis experiments.”
(PMID:81615929)
L-cysteine desulfurization activity
pyridoxal biosynthesis, phosphate binding, mutagenesis,
nitrogenase activity, L-alanine
biosynthesis, pyridoxal phosphate binding
Figure 2.4: GO terms are not suitable for protein residue annotation. The presented examples demonstrate that predicted GO terms are not always suitable for protein residue annotation. The prediction of
GO terms was done with an information theory based parser [GJYLRS08].
34
contexts. This summary of text has also been compared with an ”unstructured knowledge
database”, where information is present, but difficult to retrieve due to the complexity of
natural language. According to Sidhu,
”[...] it is generally acknowledged that only 20 per cent of biological knowledge and data is available in a structured format or a database. The remaining
80 per cent of biological information is hidden in the unstructured, free text
of scientific publications.” [SDC06]
In context of information extraction, the data to be extracted from an article are
words (keywords) regarding biological concepts that could summarise the key message
of the article. At first glance, abstract texts have a high density of keywords but a
low coverage of information, while full-texts cover a larger but disperse quantity of data
[FKY+ 01] [YHF+ 02] [SPIBA03] [SWS+ 04] [NBD+ 06].
Another key distinction between abstract texts and full-texts is the availability of
data resources. Biomedical abstract texts can be publicly downloaded from MEDLINE
without restriction, while full-texts from various journals are only available for subscribed
customers. Although some full-text articles are accessible through various initiatives
[BMC08] [Plo08] [PMC08], the extraction of information from a whole document is expected to be much more complex than from an abstract text. For example, a biological
feature of a residue may be expressed over several sentences, requiring a co-reference
resolution of the residue and the feature.
2.2
Protein structure data mining
Data mining is an analytic method to identify valid, and novel patterns in data. A general
data mining solution does not exist. Instead human data mining expertise and human
domain expertise are required to solve each specific data mining problem. A data mining
35
process consists of the following main processes: data selection, feature extraction, and
correlation analysis.
In respect of protein structure data mining, data selection means the identification
of a non-redundant set of protein structures from PDB (cf. section 2.1.1). Although a
protein structure contains only geometrical information, it is important to distinguish
the types of structural features to be analysed. Following are the options of structural
feature as target: the configuration of amino acids as Cα, the configuration of backbone
atoms, the spatial arrangement of chemical groups [JIDG03] [YEC+ 07] [Rus98] [SSR03]
[Old02], and the physicochemical environments [OCR01] [YEC+ 07]. In order to discover
new information from the data, a developed data mining algorithm must not contain any
biochemical knowledge. The target should be a mathematical model and not a biological
template.
2.2.1
Hypothesis-driven data analysis
”Within the field of bioinformatics research, the term data mining is used very loosely to
describe any type of data analysis. (T. Oldfield, pers. comm.).” Hypothesis-driven data
analysis consists of defining a biological target (hypothesis), and searching for the target.
Consequently, the result of a hypothesis-driven data analysis is not the discovery of new
information.
A number of methods were published that predicts a known protein function on the
basis of protein structure information. Initially, the research work focused on global fold
recognition [HS96] [WR97] [MB99] [KH04] [HPS+ 03] [AZP+ 05] to identify evolutionary
distant, but structurally conserved homologues. Once a match is found functional annotations are transferred from the target to the query. Another more specific approach
focuses on the search for matching local substructures in the proteins. The rational is,
that a biological function can be mapped to a particular residue configuration in the
protein, which is independent in function from the global fold of the structure. One obvi-
36
ous approach was to design structure templates, which contains all the essential residues
for a biological function. Several specific types of sites or motifs have been studied in
detail to capture metal binding sites [Glu91], the catalytic triad of the serine proteases
[FWLN94] [WBT97], and binding sites for anions such as sulphate and phosphate [Cha93]
[CB94]. Computer assisted methods were developed in the following to help experts to
design templates by analysing motifs over large sets of proteins corresponding to active
sites [APG+ 94] [Rus98] [SSR03] [Kle99] [FS98] [FGS98] [WBT97] [BT03] [PB06], surface
patches or clefts [Las95] [KJ94] [LEW98] [SPNW04] [BFL04] or structural binding site
locations [GPP+ 03] [KN03].
2.2.2
Discovery-driven data mining
The key feature in a discovery-driven data mining is the search for common characteristics
(pattern) in the data, without providing any domain knowledge. More specifically, the
target is mathematically defined and the system aims to identify over-representations,
data variations, or singularities in the dataset. Hence discovery-driven data mining can
deliver novel information, while the biological significance of the result is not trivial.
One important aspect in identifying residue interactions in protein structures is the
consideration of contextual information, such as interaction distance, chemical environment, and evolutionary conservation, in the data mining algorithm. The systems called
ET/MA [CFK+ 05] and ConSurf uses evolutionary information in combination with structural and chemical data, in order to highlight region of local structures with functional
importance. In contrast, the systems PINTS [Rus98] [SSR03] and SIDEMINE [Old02] find
patterns within the distribution of non-redundant structure set, by using solely mathematical model of interactions. One critical issue in the development of these data mining
methods was the improvement of the signal/noise ratio. In order to boost the signal frequency, two structural features are merged if one is biologically equivalent to the other.
While the analysis showed that the mined output contained biological valid data, the
37
result actually incurs some bias, because biological knowledge was introduced.
2.3
Biomedical literature mining
Biomedical text mining extracts information from text for the integration into biological
databases. Due to the complexity of natural language, text processing involves structuring the text input by means of parsing and the annotation of some linguistic features,
e.g. part-of-speech tags. The majority of biological text analysis is concerned about the
extraction of explicitly stated facts from text; a task referred as biological information extraction [Hob02]. Biomedical text mining processes typically consist of two main analysis
steps: biological entity recognition, and biological relation extraction.
The vast amount of published biomedical articles contains phenomenological data on
proteins, such as their molecular function. The information is encoded in unstructured
text and requires different level of complexity to mine the data. There are several levels
of text mining challenges to extract functional annotation: the identification of mutations
[LHC07] [WK07] [BW05] [RSMA+ 04] [HLC04] or genetic sequences [MG03], identification
of gene or protein names [RSAG+ 08] [PJYLRS08] [TMA08] [Fuk98] and chemical entities
[CMR06], the extraction of annotation of molecular function [GJYLRS08] [RSKA+ 07]
[DS05] [KNT05] [GDAW03] [HNR+ 05], and the identification of semantic relations between the biological entities [BLK+ 08] [LCM03] [SB06].
2.3.1
Biological entity recognition
The process of entity recognition (ER) can be split into three parts: location of the mentioned entity in text, classification of the entity into a predefined category, and normalising
the entity by referencing to an entry in a database.
Biological entities are often ambiguous in terms of their boundaries and categories.
Probably the most challenging task is the correct identification of protein or gene names.
38
For example, ”hunchback” is a protein in Drosophila, while it is also a general English
term. Furthermore, protein names consist mostly of multiple words, e.g. ”Rho-like protein” or ”HIV-1 envelope glycoprotein gp120”. An ER system needs to identify all the
constituents of a protein name in order to relate the detected entity to its reference entry
in a database. The BioCreAtIvE challenge addressed this problem with the 1B subtask;
the target is the identification of protein/gene names in text, and the annotation of their
correct gene identifier. Various solutions were published ranging from rule-based methods [HFM+ 05] [TW02] [Fuk98] to machine learning approaches [CMP05]. The developed
methods are, in general, reusable for any other biological entity recognition or terminology
identification problem.
Works have also been published that focused on the extraction of protein point mutations [RSMA+ 04] [HLC04] [BW05] [LHC07] [YLPV07], which is one category of protein
residue terminology. Other categories are residue sequence or residue interaction pair.
The most widely adopted method to identify these terminologies is the design of regular
expression patterns.
2.3.2
Biological relation extraction
Relation extraction (RD) aims to find associations between entities, or between an entity and a terminology within a text phrase. One objective in biomedical information
extraction is the mining of biological facts from text. An example of biological fact is
the semantic relation between two biological entities, such as protein-protein interaction
[TOT04].
Until now, three strategies have been investigated for biological relation extraction: the
co-occurrence based analysis [LC05] [SB05], pattern-based approach [HZH+ 04] [LCM03],
and machine learning based methods [BM05] [BM06]. The common limitation of all of
these extraction systems is, that only the relation targets, e.g. proteins within a proteinprotein interaction, are extracted. By no means are contextual information considered in
39
the extraction that would describe or explain the association of the entities. Within the
information extraction community, a consensus has been reached, that deeper analysis of
sentence structures is required in order to adequately acquire biomolecular relations from
text [WSC04].
In respect of biological relation extraction, two classes of syntactical parsers were studied. The first is the shallow parsing technique, which aims in detecting main constituents
of a sentence, without determining the complete syntactical structure. Results were published, where protein-protein interactions [KNT05] and general biological entity relations
[LCM03] were extracted based on shallow parsing. The second class of syntactical parser
is the full parser, which attempts a deep analysis of the syntactical structure of a sentence. Several systems have been reported [NED03] [FKY+ 01] that utilises full parsing
for relation extraction from biomedical literature. One interesting full parser is ENJU
[YMTT05] [MT05], a so called head-driven phrase structure grammar (HPSG) parser,
which identifies predicate-argument structure (PAS) from a text sentence.
The use of PAS, as template for biomolecular relation extraction, was firstly reported
in [TOT04] [YMTT05]. Recently, two proposition bank were reported, that are designed
to capture relations in molecular biology: PASBio [WSC04] and BioProp [TCS+ 07].
Within this work, there are two types of semantic relations to be extracted. The
first is the residue-protein association. The system called MEMA [RSMA+ 04] uses a
word distance metric to associate a list of residue-protein pairs with the smallest word
distance. Another approach is to look up valid associations between a residue and a
protein in context of a predetermined association of a protein and an organism. Three
systems have been reported, that adopt this approach: MuteXt [HLC04], MutationMiner
[BW05], and MutationGraB [LHC07].
The other semantic relation to be extracted in this work is the association between
a residue entity and its description of function. The systems MuteXt [HLC04], MEMA
[RSMA+ 04], MutationMiner [BW05], and MutationGraB [LHC07] are all dedicated to
40
the extraction of point mutations, but provide no extraction of functional annotation. In
a recent publication [WK07], an ontological model was proposed that should hold information extracted from MutationMiner as well as point mutation annotations. However,
the author did not provide any results of feature extraction nor was a strategy proposed.
2.4
Conclusion
In this chapter, I have reviewed some of the most relevant data resources and research
works in the field of protein structure data mining and text mining. Some of the data
resources are used in this thesis. In the following, I will present the extraction systems I
have developed during my PhD.
41
Chapter 3
Mining residue interactions as triads
from PDB
In this chapter, I present a novel approach in mining 3D patterns from protein structures.
More specifically, a pattern is defined as the irreducible interaction of a chemical and
spatial configuration of residues. The goal is to identify new information from a nonredundant dataset on the basis of using solely mathematical targets. The mined 3D
patterns represent prediction of functional sites in proteins.
3.1
Algorithms
The novelty of this presented 3D pattern mining approach is based on the classification of
residue triplets into one of four interaction classes. The idea of analysing side chain interactions within a residue triplet is based on the work of [Old02], while the classification of
residue interaction relies on the methodology developed by [JB04]. The developed data
mining method consists of three processing steps: structural feature extraction, detection
of significant configurations as interactions, and grouping and selection of frequent configurations. Figure 3.1 illustrates the procedures of the entire protein structure data mining
system developed in this thesis.
42
Figure 3.1: Overview of processes and evaluation methods of the developed 3D pattern identification
system.
43
3.1.1
Structural feature extraction
Theory
Residue triplet as spatial pattern unit. The presented protein structure data mining algorithm aims to identify significant interaction of residues within a triplet configuration. The rational of analysing residue triplets is described in the following. In order to
form a functional site in a protein structure, residues need to be physically in closed contact. In other words there exists a mutual dependency or interaction among the residues.
The interaction can be studied on a two-residue basis (doublet 3D pattern). However,
regarding the size of structure data the probability of any two-residue configurations is
too high to be detected as specific. Hence, the signal/noise ratio issue is the reason why
a two-residue 3D pattern is not the target of protein structure data mining [Old02].
A two residue contact is completely defined by a scalar property, while a three residue
contact is defined by vectors. Consequently, a three residue constellation encodes much
more information. This makes information theory based methods tractable to find conserved residue interactions as signals.
In reality, functional sites can be composed of more than three residues, e.g. various
metal binding sites used four coordinative cysteine residues. However, data sparseness
and the mathematical complexity [CL64] [Sin04] in modelling four or larger residue interactions makes it infeasible. In principle, the more variables are introduced in modelling
residue interactions, the more specific the data mining. It should be noted, that the identification of N-body interactions of residues can be solved from a combinatorial approach.
Two triplets are combined, if there is equality in two out of three residues from each
triplet [Old02]. This approach was adopted in this study to demonstrate that larger interaction configurations are extractable. However, this investigation concentrates mainly
on the identification of three residue interactions. The assumption is, that if the output
of a data mining provides valid result, the approach is justified and more complex residue
configurations may inherit this property.
44
Side chain interaction model. The determination of residue interactions requires a
transformation of a full atom model into a simpler representation. This is because the
mathematical model, that needs to describe all combinations of atom interactions of two
residues, would be too complex. The solution is to replace the all-atom structure model
with a coarse grained model, by reducing each residue to a single point. In principle,
a residue point can be calculated either by the centre of mass, or the geometric centre
(centroid). Each representation can be calculated from main chain atoms, main and side
chain atoms, or side chain atoms only.
The focus in this study is the side chain interactions within residue triplet configuration. For this reason, a protein structure is represented as a point spread of side chain
centroids.
Protein structure triangulation. The extraction of residue triplets from a protein is
based on triangulation of structures. Here structures are triangulated on the basis of three
criteria. The first is the compositional constraint. Each residue in a triplet must be an
element of the 20 natural amino acids, while hetero atoms are excluded. One prominent
reason is that there are not many examples of residue-hetero atom interactions in the
dataset that would support a statistical analysis.
The second condition of triplet extraction requires that none of the residues are direct
neighbours in the protein sequence. The assumption made here is, that any covalently
bonded residues have a higher likelihood than any other two residues being next to each
other in space that are not bonded. Similarly, the probability of finding three residues in
space that are connected, is higher than finding unconnected triplets of residues. Consequently, the distribution of interacting residues in space would be over-represented. The
definition of residue neighbourhood affects the data mining result, e.g. by requiring a
pair interaction in the triplet to have a distance of more than one residue, patches of
residues at one side of a beta-sheet may not be discovered. While tuning this parameter
can modify the result of the data mining, the objective here is to discover new knowledge
45
from the input data set by providing as little as possible of biological information.
The last criterion in triplet extraction is concerned with the geometrical property of a
triplet. The Euclidean distances between the residues must fulfil the triangular inequality,
while only two interaction distances of less than 6Åwere allowed. Although the interaction
distance threshold is based on an empirical study of a number of protein structures, this
value may not be adequate, because it would prefer close contacts of large side chains
of residue pairs. For example the pair interaction of two tryptophans may have a near
maximal allowed interaction distance of the centroids, while the distance of the contacting
atoms are actually very close. The alternative is to set up a threshold system for residue
pairs or triplets, which depends on the types of residues. Although this approach was not
studied in this thesis, future work could improve the developed algorithm. Yet another
approach in selecting residue interactions from a protein structure is based on the analysis
of surface contacts of the side chain groups. While not all functional sites require their
constituents to be in physicochemical contact (e.g. a metal binding site consists of metal
ion coordinating residues without physical contacts), a protein binding site is an example
where residues of two different proteins are in non-covalent interaction. However, the
presented data mining approach aims in the unbiased search for residue interactions from
a dataset of monomeric protein structure domains, and therefore a surface-based selection
criterion will biased the analysis.
Implementation
A coarse grained representation is used in this protein structure analysis. From a full atom
model of a protein structure, centroid positions of each protein residue were calculated
on the basis of their side chain atoms. The resulting simplified structure model is then
triangulated based on three criteria: (1) each residue in a triplet must be an element of
the 20 natural amino acids; (2) pairs of residues in the triplet must not have a sequential
relation in respect of their protein sequence position; and (3) only two pairs of residues
46
can have a maximal interaction distance of 6Å, and only one pair with an interaction
distance of less than 12Å.
For the interaction analysis it is necessary to define a hash table, based on integer
values of centroid distances, and the name of residue. The integer value of a distance is
calculated by dividing the measured distance by a precision value (hash precision), which
was set at +/- 0.5Å. Given a 3-body with
trip = (A, B, C),
(3.1)
a three-dimensional hash table is defined as
HT (A, B, C) = 3D hash bin[i][j][k],
(3.2)
where i, j, and k are the integer values of measured distances between two spatial coordinate of residues. The integer values are given by the equation
i = IN T (dist(A, B)/hash precision)
j = IN T (dist(B, C)/hash precision)
(3.3)
k = IN T (dist(A, C)/hash precision).
For a detailed definition of the implemented hashtable cf. [Old01].
3.1.2
Detection of significant configurations as interactions
Theory
The method for residue interaction detection relies on the comparison of two probabilistic
models: the reductionistic part-to-whole approximation model, and the holistic reference
model. Part-to-whole approximation is modelled with a collection of marginal distributions defined by subsets of the variables. Formally, a 3-body consists of three variables
(cf. equation 3.1). To verify whether the probability of a triplet, P (A, B, C), can be
47
factorised, we attempt to approximate it by using all attainable marginals
M = {P (A, B), P (A, C), P (B, C), P (A), P (B), P (C)}.
(3.4)
If the approximation fits the data, i.e. the probability of finding a particular triplet
is explained by the approximation model, then there is no evidence for an interaction.
In other words, a significant interaction is given when the two models are significantly
different. The difference between two joint probability density functions O and M is
measured by the Kullback-Leibler divergence
D(O||M ) =
P
i
O(i)
O(i)log( M
).
(i)
(3.5)
In this context O usually refers to the observed probability or the reference model,
while M is the approximation model. The null hypothesis in testing the interaction model
is that the part-to-whole approximation matches the observed data. The alternative one
is that the approximation does not fit and that there is an interaction. Three cases can
be listed:
D(O||M ) > 0 : there is a pattern among k attributes
D(O||M ) = 0 : there is no pattern of order k
(3.6)
D(O||M ) < 0 : there is redundancy among the parts.
Within a 3-body system, four different configurations of interactions can be defined
(cf. figure 3.2): no-interaction, one-pair interaction, two-pair interactions, and three-pair
interactions. For each of these configurations it is possible to formulate a part-to-whole
approximation model, i.e. the interaction can be factorised. In the case of no-interaction,
the probability of the observable is expected to be estimated by its singlet probabilities
k=0:
P̂0 (A, B, C) = P (A)P (B)P (C) ,
48
(3.7)
Figure 3.2: Four classes of interactions within a 3-body. A circle represents a protein residue, and an
intersection resembles an interaction between two residues. k=0: no-interaction; k=1: one-way or one
pair interaction; k=2: two-way or two-pair interactions; k=3: three-way or three-pair interactions.
49
whereas in a system with one-pair interaction, two variables are dependent on each other.
Consequently, within a 3-body state there are three isoforms of one-pair interactions:



P̂1,1 (A, B, C) =



k=1:
P̂1,2 (A, B, C) =




 P̂1,3 (A, B, C) =
P (A,B)P (C)
P (A)P (B)
P (A,C)P (B)
P (A)P (C)
.
(3.8)
P (B,C)P (A)
P (B)P (C)
There are two forms of three variable interactions, but with different dependencies:
two-pair interaction (k=2) and three-pair interaction (k=3). These interactions represent
the target of 3D pattern mining. In a two-pair interaction, two pairs of variables are dependent on each other, while sharing a common attribute. For example, given A interacts
with B, and B interacts with C, there is no clear observation that A also interacts with
C. Three isoforms are formulated for this interaction:



P̂2,1 (A, B, C) =



k=2:
P̂2,2 (A, B, C) =




 P̂2,3 (A, B, C) =
P (A,B)P (B,C)
P (B)
P (A,C)P (A,B)
P (A)
.
(3.9)
P (B,C)P (A,C)
P (C)
In case of a three-pair interaction, all three variables are dependent on each other, and
the approximation model is defined as
k=3:
P̂3,1 (A, B, C) =
P (A,B)P (B,C)P (A,C)
P (A)P (B)P (C)
.
(3.10)
If the state is disturbed, e.g. by exchanging one variable, a partial interaction will
not be observed. In respect of protein biology, this could mean that a residue mutation
abolishes an intramolecular stabilising network. However, as this does not provide an
evolutionary advantage the conservation of this residue is likely to be promoted and can
be detected as a recurrent structural feature.
The determined sets of two-way (k=2) and three-way (k=3) interactions are the targets
in this data mining.
50
Implementation
Triplets of residues are classified into one of the four defined interaction configurations.
The classification is based on a non-parametric cross-validation sampling method described by [JB04]. A significant interaction is given when the two models O and M are
significantly different. Because the data can be regarded as a sample of a multinomial distribution, the representativeness of the approximation model can be tested by the self-loss
function D(P 0 ||P ). Here, P 0 and P are the probability distributions from two equal sample sizes. The weight of evidence of accepting the null hypothesis, i.e. the approximation
model, can be estimated by pcv -values from a 2-fold cross-validation. For each random
sampling the dataset is partitioned into two equally sized subsets: one training set and
one test set. From these subsets two joint probability distribution functions, P 0 and P
are determined from the training and test set, respectively. The marginal distributions,
singlets and doublets, are determined from P 0 to construct the part-to-whole approximation P̂ 0 . The pcv -value is defined as the probability where the self-loss is greater or equal
to the approximation loss
pcv {D(P ||P 0 ) ≥ D(P ||P̂ 0 )}.
(3.11)
On the basis of pcv -values, an interaction is discovered if pcv ≤ α, and an interaction
is rejected when pcv > α. High threshold values of α, e.g. 0.95, will bias towards an
interaction and risk overfitting, while lower values, e.g. 0.05, moves the bias towards nointeraction model and risk underfitting. In this study, a reductionistic bias approach was
chosen, to prefer a simpler no-interaction model, by selecting α = 0.05. The used value
of α is based on the research work of [JB04].
51
3.1.3
Grouping and selecting frequent configurations
Theory
The result of data mining protein structures can be a large set of 3D pattern. The
data needs to be clustered in order to select the most frequent pattern. The assumption
behind data clustering is, that residue configurations in protein structures are unlikely
to be absolute and static. By grouping spatially similar configurations, the geometrical
variation of patterns can be compensated and their frequencies improved.
Implementation
The objective in this section is to identify frequent groups of geometrically similar triplets
with identical chemical configurations. Data clustering was done in two steps. For each
residue triplet combinations, the initial step is to group geometrically similar patterns,
and then count the combined frequencies
G(HT (i, j, k)) =
j+1 k+1
i+1 X
X
X
HT (i, j, k),
(3.12)
i−1 j−1 k−1
where HT is a hash table of the residue triplets (cf. equation 3.2). Then local geometrical
peaks were searched by comparing the frequencies of the grouped triplets
arg max G(HT (a, b, c)) < G(HT (i, j, k)),
(3.13)
where HT (a, b, c) 6= HT (i, j, k) with a = {i − 1, i, i + 1}, b = {j − 1, j, j + 1} and
c = {k − 1, k, k + 1}.
The second step in data clustering finds subgroups of triplets from a local peak, based
on an all atom structure alignment. The determined clusters are ranked by their proba-
52
Dataset
PDBIDs
Domains
Domain definition
Data selection
Properties
OLDFIELD
1,442
2,320
mathematical
Sequence alignment
Homologous structural
features of divergent
proteins.
SCOP40
3,449
4,734
human expert
Sructure comparison
Convergent structural
features of divergent
proteins.
Figure 3.3: Non-redundant structure set for 3D pattern mining. The dataset OLDFIELD is based on
the publication of [Old02], and SCOP40 was obtained from ASTRAL Compendium [BKL00]. The size
of the datasets, the method for data selection, and key properties are summarised.
bility scores, which is defined as:
P (cluster) =
#cluster member
.
#peak member
(3.14)
On the basis of P (cluster) a cluster of residue interaction is selected if P (cluster) ≥
τ . In this study, the threshold tau for selecting a cluster was set to 0.66.
3.2
Analysing available non-redundant protein structure sets
The significance of this data mining result is greatly dependent on the representativeness
of the data. For the frequencies of structural features to be true, they would have to be
taken from protein structures of all of the naturally occurring protein folds. However,
such a data resource is not available at present (cf. section 2.1.1). This effectively means
that protein structure data mining is bound by the availability of fold examples. While
from a bioinformatical point of view, little can be done to improve the coverage of the
fold space, a number of efforts have been dedicated to the compilation of non-redundant
datasets from PDB.
The results in this thesis are based on the study of two non-redundant protein structure sets: OLDFIELD [Old02] and SCOP40 [HMBC97] [BKL00]. Table 3.3 summarises
53
key features of each dataset. The major distinction between both datasets lies in the
definition of a non-redundant dataset. The purpose in compiling OLDFIELD is to create a dataset that allows the detection of interesting structural equivalence from the
non-specific structural features. The primary data selection is in sequence space. The
resulting dataset contains only sequentially dissimilar protein fragments, while common
fold motifs are preserved. This allows the detection of homologous structural components
of divergent proteins. In contrast, SCOP represents a biased view of protein data by defining classes in structure space. The assignment to a class, of a novel protein, is based on
structure and sequence comparisons. SCOP40 is the data subset of SCOP, where sequentially divergent proteins with convergent structural features are retained. Because the
classification contains structurally divergent proteins, any identified recurrent structural
feature in SCOP40 is an indication of convergent evolution.
Another distinction between OLDFIELD and SCOP40 is the method of identifying
domain structures. In OLDFIELD, protein fragmentation was done mathematically by
analysis Cα distances [Old01], while in SCOP40 human experts were recruited to process
a batch of protein structures. Both approaches have their advantages and caveats. On one
hand, an automatic structure domain identification system can deliver reproducible data,
while the results may not be justified in some cases. On the other hand, expert curated
data represent a single precision view, but the information is difficult to be reproduced
as new data become available.
The difference in automatic and manual data selection is also reflected in the size of
the datasets. In 2002, the compiled non-degenerated domain structure set from OLDFIELD listed 2,320 domain structures, corresponding to 1,442 PDB identifiers. In contrast, SCOP40 contained 4,734 domain structures determined from 3,449 PDB identifiers
in the same year.
54
3.3
Evaluation methods
The presented 3D pattern identification system is a discovery-driven data mining solution.
The assessment of performance is done on two levels: the study of parameter dependency
(presented in this chapter), and the validation of biological significance of the data (cf.
chapter 4).
The effect of data-related parameters was studied by comparing the mined results from
OLDFIELD and SCOP. In the first part of the analysis, the distributions of extracted
residue triplets were compared. Then the determined sets of k=2 and k=3 interactions
were studied.
The developed data mining method is a three step process, and the study of algorithmrelated parameter effects was studied on two levels. Although, the developed data mining
method is controlled by many different parameters, the following key parameters were
studied: residue interaction distance, and size of cross-validation to compute p-values.
The effect of the interaction distance parameter was studied by varying the maximal
distance between the centroids of residues. Three different distance settings were tested:
4Å, 6Å, and 8Å.
Repeated cross-validation sampling was used to determine confidence values for residue
triplet classification. Various iterations were tested (from 100 to 1,500 in steps of 100) to
study the effect on the size of interaction datasets.
3.4
3.4.1
Results
Identification of residue interactions is dependent on data
selection
The result of a data mining analysis is greatly dependent on the input dataset. The
objective in this section is to study the effect of data-related parameters by comparing
55
results from data mining on OLDFIELD and SCOP40.
With 590,255 unique triplet configurations in SCOP40 and 429,471 in OLDFIELD,
the common set of triangulated triplets is 381,578 (cf. figure 3.4). Due to the difference
in the probability distributions of both datasets, the classification of residue interactions
resulted in different sizes of interaction classes. A set analysis on the classification data
shows, that the classes have different sizes of overlaps (cf. figure 3.5). For example,
OLDFIELD/k=3 and SCOP40/k=3 have a large common set of residue configurations of
around 89 per cent for OLDFIELD and 44 per cent for SCOP. In contrast, the common
set of k=2 interaction is much lower, i.e. 21 per cent for OLDFIELD and 13 per cent for
SCOP40. The analysis also found two proportions of non-agreed classifications (k2/k3
between OLDFIELD/SCOP40).
These results highlight the effect of data selection on the data mining result. A different
probability distribution of residue triplets, singlets and doublets is the reason, why certain
residue configurations were classified as k=2 in one dataset, and k=3 in another dataset.
3.4.2
The interaction distance correlates with the distribution
of residue triads
The extraction of residue configurations is controlled by the data representation, feature
extraction, and by the feature selection method. Structural features were extracted by
triangulation of a protein structure, which was modelled by a point spread of side chain
centroids. The goal in this section is to study the effect of varying the interaction distance
parameter. For this analysis the dataset OLDFIELD was used.
Table 3.1 summarises the determined set of residue triplets by using three different
maximal interaction distances. With the change of the distance threshold, the amount
of extracted triplets, and the probability distributions of the singlets and doublets are
changed (data not shown). Consequently, the testing of significance of residue interactions
returns different results. It must be noted, that a complete analysis with 8 Åinteraction
56
Figure 3.4: Distribution analysis of extracted residue triplets. The determined residue triplet distribution from OLDFIELD is compared with SCOP40. The upper panel shows a set analysis of the extracted
residue triplets (numbers are the unique counts of the residue configuration). The middle panel illustrates the frequency of each triplet (t) (represented as information, I(t)) from the set of triplets (T). For
a better visualisation the difference of the distributions is measured by the Kullback-Leibler divergence
(lower panel).
57
Figure 3.5: Comparison of extracted residue triplets based on their interaction type. The determined
k=2 and k=3 classification sets from OLDFIELD and SCOP40 are compared by a set analysis. Due to
the interaction classification (k=2, and k=3) there is no intersection of all four datasets.
Triplets
Distance
4
6
8
Total
Unique
k=2
k=3
2,938
1,379,545
7,128,886
1,799
429,471
2,016,306
16
9,681
N/A
165
134,465
N/A
Table 3.1: Study on the effect of varying the interaction distance threshold in structure triangulation.
The different determined sets of residue triplet configurations in OLDFIELD were achieved by using the
interaction distance thresholds: 4Å, 6Å, and 8Å.
58
distance was not done in this study.
In conclusion, the effect of varying the interaction distance on the triangulation output is in agreement with the expected result. While the frequencies of ”small” triplet
configurations are the same for incrementing interaction distance threshold, the calculated probabilities are different, because of the different distributions. This also affects
the result of interaction classification.
3.4.3
Interaction classification is sensitive to the size of crossvalidation
Significance testing of residue interactions is a method for assigning confidence values to
the classification of residue triplets. The p-values were calculated from a two-fold crossvalidation with n-iterations of random data sampling. Here, the effect of varying the size
of iterations is studied. OLDFIELD is used as dataset for this analysis.
Figure 3.6 shows the logarithmic dependency between iteration size and determined
classification sets. Regression analysis indicates, that the finite classification set was
not found after 1,500-iterations. The study of classified residue interactions from each
iteration revealed, that the set from iteration i is always a subset from the iteration j
with i < j.
In conclusion, the result of varying the iteration sizes indicates, that the classification
sets are stable and reproducible. With the increase of iteration size, the determined sets do
not altered, meaning classification result is reliable but additional elements are identified.
3.5
Discussion
3D pattern identification is the result of a data mining method that finds recurrent structural features within a protein dataset. The developed analysis method consists of three
major modules: triangulation of a protein structure, significance testing of residue inter-
59
Figure 3.6: The effect of varying the cross-validation sample size on significance testing of residue
interaction. The diagram shows the increasing but converging number of determined residue triplet
configurations with one-way, two-way, and three-way interactions at various iteration steps (from 100 to
1,500 in steps of 100) of a non-parametric cross-validation sampling.
60
action, and data clustering of the determined residue interactions.
Protein structure triangulation is the basis of collecting spatial configurations of residues.
The definition of residue interaction is a complex task, because an amino acid consists
of many atoms. Many of them are candidates of interaction partners. A coarse grained
model was used to overcome this problem, however, with the cost of redefining the interaction distance. Instead of measuring interaction distances between atoms of two different
amino acids, the distance between the side chain centroids is used. The theoretical physicochemical interaction distance between two atoms cannot be transferred to measure the
centroid based side chain interactions. The upper bound of interaction distance of 6Åwas
determined from several visual inspections and measurements of residue configurations.
The analysis shows that with d = 6Å, various side chain rotamer configurations are captured, which may represent a physicochemical interaction. By reducing the interaction
distance threshold, a bias towards tightly inert residue configurations is observed. Conversely, the increase in d results in a huge set of triplet combinations. Some of the larger
triplets do not capture a 3-body interaction, but may be part of a four-body interaction,
where the fourth residue is situated between all three residues. Although larger interaction states may reflect a complete picture of a structural unit, the primary aim here is to
find local and adjacent interactions of residues.
The performance of correlation analysis based on hash tables is sensitive to positional
errors, which is typically translated into the computation of ”wrong” hash bin indices.
Consider the sample values a = 3.99, b = 4.01, and c = 4.99, where a is assigned to hash
bin index i(a) = 1, while b and c are assigned to i(b) = i(c) = 2. The difference between a
and b is actually less than b and c. The correlation analysis with these hashed data seems
to be inadequate, although the ”correct” hash bin is in the neighbourhood. A solution
to this problem is to consider adjacent hash bins, i.e. rectangular region, of the table
[LW91].
The identification of an interaction class, e.g. a two-way interaction, is based on a
61
probabilistic classification approach. Confidence values were assigned to the classification
result, by calculating p-values from non-parametric cross-validation sampling. Theoretically, the more sampling iterations are used the more stable become the calculated pvalues. At a certain point, the size of the determined interacting residues should converge
to some value. The implication of determining a stable p-value is the identification of a
finite set of residue interactions. Within this study, the final set was not determined and
for practical reasons, a set after 100 iterations was used.
The output of extracted patterns depends on the distribution of structural features
in the input dataset. The introduced algorithm is based on the assumption that there
are significant trends of residue configurations in proteins, if these interactions provide
a significant functional or structural advantage. Obviously, we cannot expect that data
mining on two differently defined data selection would deliver the same mining output.
From a mathematical point of view, the results are still correct, because the algorithm is
detecting recurrent residue configurations in the data.
3.6
Conclusion
In this chapter, I have presented a novel data mining approach for the discovery of 3D
patterns in protein structures. A pattern is a residue triplet with two- or three-way
interaction of residues. The extraction of 3D patterns is not only dependent on algorithmrelated parameters, but also on the data selection. The validity of the data mining
approach is justified on the basis of knowing the limits and effects of data and parameters.
In the following chapter, I will present the biological significance of the mined result.
62
Chapter 4
Prediction of functions for mined
residue triads
In the previous chapter, a data mining approach was introduced, that identifies recurrent
interacting residues as triplets in protein structures. Assuming, that a certain residue
configuration is conserved in evolution, if it provides a structural or functional advantage,
then the mined 3D pattern may represent a functional site in the protein. The objective in
this chapter, is to demonstrate the biological validity of the data mined results, by crossvalidation with a reference database. I present two example cases of validated residue
interactions. The first example represents the validation of a metal binding site, where
the mined patterns represent either a homologous or a convergent structural feature.
The second validation identifies the catalytic triad from the mined data. The analysis
includes the search for a 4-body configuration of the catalytic triad (quartet), in order to
find a previously reported conserved serine residue. The result presented in this chapter
demonstrates the biological significance of the mined data, and justify the data mining
approach.
63
4.1
Evaluation methods
The biological significance of the mined 3D patterns is demonstrated by the rediscovery of
known residue interactions. A systematic performance analysis, in terms of coverage and
accuracy is not possible, because a test set with complete functional annotations of local
residue interactions with biological function is not available. Therefore, various protein
databases were used as references for cross-validations.
The automatic cross-validation of metal binding sites is based on the comparison of
the mined 3D patterns with a metal binding site database. Two reference databases were
used and the results compared with each other: MSDsite [GDO+ 05] and MDB [CHR+ 02].
The identification of available metal binding sites in the input dataset considered only
configurations with more than 2 residues. A hit was found, if all residues of a metal binding
site were present in a protein structure. Likewise, a mined 3D pattern was identified as
a metal binding site, if all residues of the pattern resemble a subset of a metal binding
site. However, because a metal binding site can contain more than three residues, and
the mined patterns can have two overlapping triplets, only identified metal binding sites
were counted and not every matched pattern. The coverage is computed as:
ccoverage =
#unique sites matched by all residues in a 3D pattern
.
#available sites in protein structure set
(4.1)
The result of metal binding site cross-validation is compared with the performance of
SIDEMINE [Old02] extraction. Because a similar experiment was not performed before,
it was repeated here. The cross-validation of a metal binding site is analogous to the
identification of active sites in the dataset (cf. above).
The identification of a convergent metal binding site was done by a manual search in
the mined output from SCOP40. The protein structures of a found metal binding site
pattern were analysed in respect of their SCOP classification identifiers.
64
OLDFIELD
#Triangulated
triplet
Interaction
type
#Classified
interactions
#Clustered
patterns
#Pattern
frequencies
429,471
k=2
k=3
9,681
134,465
925
1,007
5,697
11,957
SCOP40
#Triangulated
triplet
Interaction
type
#Classified
interactions
#Clustered
patterns
#Pattern
frequencies
590,255
k=2
k=3
15,455
269,683
765
2,019
927
2,361
Table 4.1: Summary of extracted data at each protein structure data mining step. The data mining
was performed on OLDFIELD and SCOP40. The number of identfied residue triplet interactions is
given in ”#Classified interactions”, while the column ”#Clustered patterns” indicates the size of unique
residue interaction configurations after data clustering, and ”#Pattern frequencies” is the total amount
of examples of the found residue interactions in the dataset.
The automatic cross-validation of catalytic residues was done by comparing residues
from active site templates in CSA [PBT04]. The validation of a catalytic active site for
all example protein structures was based on manual analysis.
To test whether the mined result contains a second conserved serine residue in the
catalytic triad (quartet) (Asp-His-Ser/Ser), larger residue configurations were constructed.
The method for finding N-bodies is based on the algorithm of [Old02]: two 3D patterns
(triplets) from the same protein structure were combined, if they share two common
residues. The analysis considered only the search for 4-, 5-, and 6-bodies.
4.2
Results
In the following sections, the biological significance of the mined 3D patterns is evaluated.
Data mining was performed on the datasets OLDFIELD and SCOP40 with the following
parameters: interaction distance d = 6Å, cross-validation iteration = 100, and selection
of cluster based on τ = 0.66 (cf. section 3.4). Table 4.1 summarises the extracted data
at each processing step.
65
MSDsite
Reference
SIDEMINE
Dataset
Determined
Validated
Coverage
OLDFIELD
567
85
0.15
OLDFIELD
567
60
0.11
Determined
Validated
Coverage
OLDFIELD
302
36
0.12
OLDFIELD
302
18
0.06
MDB
Reference
SIDEMINE
Dataset
Table 4.2: Identification of metal binding sites in OLDFIELD. The available metal binding sites in the
protein domain structures in OLDFIELD (input dataset) were determined by two reference databases
(MSDsite and MDB). The figures were compared with the cross-validated metal binding sites in the
mined 3D pattern dataset. A hit was found in the pattern data, if all three residues of a pattern is a
subset of residues of a metal binding site. The performance was measured in terms of coverage.
4.2.1
Identification of homologous metal binding sites
Metal binding proteins play a vital role in a wide range of biological processes, such as
structural stability and complex formation. The identification of metal binding proteins
is therefore crucial. The objective in this section is to identify metal binding sites within
the mined 3D patterns from OLDFIELD by cross-validation with the reference databases
MSDsite [GDO+ 05] and MDB [CHR+ 02].
Table 4.2 lists the number of determined metal binding sites in the input dataset and
the validated 3D patterns. The analysis shows that the determined coverage for both
references is quite similar providing some confidence in the determined value. While
the mined result covers only a small fraction of the available metal binding sites, the
performance is comparable with SIDEMINE.
A manual analysis shows, that some of the annotated metal binding sites can be partially recovered by merging two 3-bodies into a single 4-body. For example, the MSDsite
lists the iron binding site, Asp-3His, for the PDB entry 1ar5 with the residues ASP161,
HIS27, HIS75, and HIS165. The mined result from OLDFIELD contains the patterns
66
2His-Trp and Asp-His-Trp, with the residues HIS27, HIS75, TRP126, and ASP161, HIS75,
TRP126, respectively. Both triplets can be merged into the 4-body Asp-2His-Trp.
A systematic analysis of false negatives is beyond the scope of this work. However,
preliminary studies indicate, that the selection of interaction distance, plays an important
role in discovering 3D patterns. For example, by setting the interaction distance d to 8Å,
various triplet configurations can be extracted that contain the missing histidine, HIS165,
from the example above.
The validity of a mined 3D pattern as a metal binding site is demonstrated by manual
analysis of several example structures. The examples shows that the residues of a metal
binding site have a strong conservation of the side chain groups, indicating a high energy
bond in the formation of a coordinative tetrahedral site. Figure 4.1 illustrates an example
configuration with three cysteines from six structure examples. The listed proteins are
heterogeneous in nature but are common in the 3Cys mediated ion binding site. Except
for one entry all structures coordinate a zinc ion in a tetrahedral configuration.
Another metal binding site with the configuration Cys-2His is shown in figure 4.2.
The cluster lists 11 proteins with the majority being electron transfer proteins.
In conclusion, the mined 3D pattern data contain validated metal coordinating residue
configurations. The result indicates, that the presented data mining system is able to
identify homologous structural features, which are recurrent in the dataset.
4.2.2
Validation of convergent metal binding sites
Proteins with different folds can share a common structural feature. For example, various
metal binding sites share a common residue arrangement, while the global fold of the
metal binding proteins is quite different. In this case, the common pattern represents
a convergent structural feature. The objective in this section is to test whether the
developed data mining algorithm is able to find patterns of convergent structural features.
For this analysis, the data mining was performed on SCOP40.
67
PDBID
Description
Bound metal
1h2r
1lat
2nll
1ptq
2ohx
4mt2
periplasmic hydrogenase
glucocorticoid receptor
retinoic acid receptor
protein kinase c
alcohol dehydrogenase
metallothionein isoform II
nickel-iron
zinc
zinc
zinc
zinc
zinc
Figure 4.1: A metal binding site with the 3Cys pattern. Cross-validation of metal binding sites with 3D
pattern from OLDFIELD identified the 3Cys configuration (top panel). List of protein structures with
the common 3Cys residue configuration (bottom panel).
68
PDBID
Description
Bound metal
1kdi
1aoz
6paz
1jer
2azu
1bqk
1aac
1byo
1as7
1nic
1rcy
plastocyanin
ascorbate oxidate
pseudoazurin
stellacyanin
azurin
pseudoazurin
amicyanin
plastocyanin
nitrite reductase
nitrite reductase
rusticyanin
cu
cu
cu
cu
cu
cu
cu
cu
cu
cu
cu
Figure 4.2: A metal binding site with the Cys-2His pattern. Cross-validation of metal binding sites with
3D pattern from OLDFIELD identified the Cys-2His configuration (top panel). List of protein structures
with the common Cys-2His residue configuration (bottom panel).
69
PDBID
Description
Bound metal
1iml
1zin
1kk1
1ibi
1dgs
1hc7
1gax
1dsv
1i50
1ptq
1zbd
1kb4
1dcq
1jj2
1vfy
1ffy
1dcq
1dsz
1d66
2alc
1tfi
4mt2
1jr3
1a5t
1jjd
1bor
1zbd
1g25
1pyi
1hwt
1het
metal-binding protein
phosphotransferase
translation
metal-binding protein
ligase
aminoacyl-trna synthetase
ligase/rna
virus/virus protein
transcription
phosphotransferase
complex (gtp-binding/effector)
transcription/dna
metal binding protein
ribosome
transport protein
ligase/rna
metal binding protein
transcription/dna
transcription regulation
dna binding protein
transcription regulation
metallothionein
transferase
zinc finger
metal binding protein
transcription regulation
complex (gtp-binding/effector)
metal binding protein
complex (dna-binding protein/dna)
complex (activator/dna)
oxidoreductase)
zn
zn
zn
zn
zn
zn
zn
zn
zn
zn
zn
zn
zn
cd
zn
zn
zn
zn
cd
zn
zn
zn
zn
zn
zn
zn
zn
zn
zn
zn
zn
Figure 4.3: A metal binding site with the 3Cys pattern. Cross-validation of metal binding sites with
3D pattern from SCOP40 identified the 3Cys configuration (top panel). List of protein structures with
the common 3Cys residue configuration (bottom panel).
70
PDBID
Description
Bound metal
1ncs
1rmd
2drp
1yuj
1a1i
1ubd
5znf
2gli
1tf3
1bhi
1e53
1g2a
1jym
transcription regulation
dna-binding protein
complex (transcription regulation/dna)
complex (dna-binding protein/dna)
complex (zinc finger/dna)
complex (transcription regulation/dna)
zinc finger dna binding domain
complex (dna-binding protein/dna
complex (transcription regulation/dna)
dna-binding regulatory protein
transcription
hydrolase
hydrolyse
zn
zn
zn
zn
zn
zn
zn
co
zn
n/a
zn
ni
co
Figure 4.4: A metal binding site with the Cys-2His pattern. Cross-validation of metal binding sites with
3D pattern from SCOP40 identified the Cys-2His configuration (top panel). List of protein structures
with the common Cys-2His residue configuration (bottom panel).
71
3Cys
SCOP classification
a.4.11.1
a.27.1.1
a.60.2.2
b.35.1.2
c.26.1.1
c.37.1.8
c.37.1.13
g.38.1.1
g.39.1.2
g.39.1.3
g.39.1.6
g.40.1.1
g.41.2.1
g.41.3.1
g.44.1.1
g.45.1.1
g.46.1.1
g.49.1.1
g.50.1.1
g.56.1.1
SCOP domain identifiers
1i50j
1ffya1
1dgsa1
1heta1
1gaxa3
1kk1a3
1jr3a2, 1a5t 2
1d66a1, 2alca , 1pyia1, 1hwtc1
1kb4b , 1dsza
1iml 2, 1ibia1, 1ibia2
1jj2t
1dsva
1zin 2
1tfi
1bor , 1g25a
1dcqa2
4mt2 , 1jjda
1ptq
1vfya , 1zbdb
1hc7a3
Cys2His
SCOP classification
g.37.1.1
g.49.1.2
d.167.1.1
SCOP domain identifiers
11ncs , d1rmd 1, d2drpa1, d2drpa2, d1yuja , d1a1ia1, d1ubdc1, d5znf ,
d1ubdc2, d2glia4, d2glia2, d2glia3, d1tf3a1, d1bhi
d1e53a
d1g2aa , d1jyma
Table 4.3: Convergent metal binding sites identified in SCOP40. The determined metal binding sites
from the 3D patterns in SCOP40 belong to different fold classes of unrelated proteins (convergent structural feature).
Two patterns were identified in this study that represent metal binding sites. The
3Cys configuration is the first example with 31 structure examples (cf. figure 4.3). The
second metal binding configuration is the Cys-2His pattern with 17 structure examples
(cf. figure 4.4). A visual analysis determined that the identified metal binding sites from
SCOP40 are similar to the mined result from OLDFIELD (cf. previous section). According to the SCOP classification scheme, groups of protein structures can be determined,
that have different domain structures, but share the same metal binding site (cf. table 4.3). This indicates that the pattern was found as a recurrent structural feature in
evolutionary distant proteins.
72
The result of this analysis suggests that the developed data mining algorithm is able
to find recurrent and convergent structural features in a non-redundant structure set.
4.2.3
Recovering active sites and catalytic triads from the dataset
The catalytic triad is one of the most characterised non-metal active sites of serine proteases. The enzymatic reaction is based on the conserved residues serine, aspartate, and
histidine that work together in a specific spatial arrangement. Previously, the identification of the catalytic triad has been described as the key evaluation analysis in protein
structure data mining, because the occurrence of this pattern is just above the noise level
in a dataset of analogous proteins [Old02]. The objective in this section is the search
for active sites, and the catalytic triad in particular, by cross-validation with CSA. The
mined result from OLDFIELD was analysed in this study.
Within OLDFIELD, 235 active sites were determined, while the number of crossvalidated active sites from the mined output was 27. Table 4.4 lists the validated protein
residues. The majority of these residues are found in the Asp-His-Ser pattern, which
was validated as the catalytic triad by manual analysis. The identified catalytic triad
configuration lists 22 structure examples, with the majority belonging to the enzyme class
hydrolase, and only a few belongs to the class oxidoreductase. In comparison, [Old02]
identified 9 proteins, where 7 out of 9 were rediscovered in this analysis. The remaining
15 out of 22 are additional and approved solutions. Figure 4.5 shows the superimposed
structures for the Asp-His-Ser configuration.
This study shows that the presented data mining system is able to find the catalytic
triad in OLDFIELD. The mined result contains 15 additional valid solutions that were
not discovered in [Old02].
73
3D pattern (k=2)
Cross-validated
Pattern
PDBID
RID
CSA
Ala-Arg-Asn
1qgj
7atj
A ALA 71, A ARG 38, A ASN67
A ALA 74, A ARG 38, A ASN 70
His-2Ser
1elt
1ppf
1bma
1avw
1hyl
1bit
1jrt
1try
1au8
1ct0
A
E
A
A
A
A
A
A
A
E
1a8q
1a7u
1a88
1a8s
1tib
3tgl
1bs9
1avw
1acb
1taw
1au8
1elt
3tgi
1agj
1auo
1arb
1jrt
1try
2tec
1ppf
1jfr
1ct0
A ASP 223, A HIS 252, A SER 94
A ASP 228, A HIS 257, A SER 98
A ASP 226, A HIS 255, A SER 96
A ASP 224, A HIS 253, A SER 94
A ASP 201, A HIS 258, A SER 146
A ASP 203, A HIS 257, A SER 144
A ASP 175, A HIS 187, A SER 90
A ASP 102, A HIS 57, A SER 195
E ASP 102, E HIS 57, E SER 195
A ASP 102, A HIS 57, A SER 195
A ASP 102, A HIS 57, A SER 195
A ASP 102, A HIS 57, A SER 195
E ASP 102, E HIS 57, E SER 195
A ASP 120, A HIS 72, A SER 195
A ASP 168, A HIS 199, A SER 114
A ASP 113, A HIS 57, A SER 194
A ASP 102, A HIS 57, A SER 195
A ASP 102, A HIS 57, A SER 195
E ASP 38, E HIS 71, E SER 225
E ASP 102, E HIS 57, E SER 195
A ASP 177, A HIS 209, A SER 131
E ASP 102, E HIS 57, E SER 195
Asp-His-Ser
HIS
HIS
HIS
HIS
HIS
HIS
HIS
HIS
HIS
HIS
57,
57,
60,
57,
57,
57,
57,
57,
57,
57,
A
E
A
A
A
A
A
A
A
E
SER
SER
SER
SER
SER
SER
SER
SER
SER
SER
195,
195,
203,
195,
195,
195,
195,
195,
195,
195,
A SER 214
E SER 214
A SER 222
A SER 214
A SER 214
A SER 214
A SER 214
A SER 214
A SER 214
E SER 214
SIDEMINE
EC
UID
+
1.11.1.7
1.11.1.7
PER59 ARATH
PER1A ARMRU
+
3.4.21.36
3.4.21.37
3.4.21.36
3.4.21.4
3.4.21.3.4.21.4
3.4.21.4
3.4.21.4
3.4.21.20
N/A
ELA1 SALSA
ELNE HUMAN
ELA1 PIG
N/A
COGS HYPLI
TRY1 SALSA
TRY1 BOVIN
TRYP FUSOX
CATG HUMAN
N/A
1.11.1.10
1.11.1.10
1.11.1.10
1.11.1.10
3.1.1.3
3.1.1.3
3.1.1.6
3.4.21.4
3.4.21.4
3.4.21.4
3.4.21.20
3.4.21.36
3.4.21.4
3.4.21.3.4.22.38
3.4.21.50
3.4.21.4
3.4.21.4
3.4.21.66
3.4.21.37
N/A
N/A
BPA1 STRAU
PRXC STRAU
PRXC STRLI
PRXC PSEFL
LIP THELA
LIP RHIMI
AXE2 PENPU
N/A
CTRA BOVIN
N/A
CATH HUMAN
ELA1 SALSA
TRY2 RAT
ETA STAAU
CATK HUMAN
API ACHLY
TRY1 BOVIN
TRYP FUSOX
THET THEVU
ELNE HUMAN
P83850 STREX
N/A
EC
UID
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
3D pattern (k=3)
Cross-validated
Pattern
PDBID
RID
CSA
SIDEMINE
Ala-Asp-Ser
1brt
1onr
A ALA 123, A ASP 228, A SER 98
A ALA 225, A ASP 17, A SER 176
+
1.11.1.10
2.2.1.2
BPOA2 STRAU
TALB ECOLI
Asp-Cys-Lys
1nba
A ASP 51, A CYS 177, A LYS 144
+
3.5.1.59
CSH ARTSP
Table 4.4: List of cross-validated active site residues. The catalytic residues in the mined k=2 or k=3
residue triplets were compared against active site templates in CSA. RID = a Residue identifier consisting
of a chain identifier + a residue name + a residue sequence position.
74
Figure 4.5: Re-discovery of the catalytic triad in OLDFIELD. Examples of protein structures with the
Asp-His-Ser pattern were cross-validated by CSA.
4.2.4
Discovering the conserved serine residue in the catalytic
triad (quartet)
The catalytic triad template (Asp-His-Ser) has been reported as a four residue configuration (Asp-His-Ser/Ser) [WBT97] [BFW+ 94]. Based on the identified catalytic triad
pattern in OLDFIELD (cf. previous section), the objective in this section is to test
whether a 4-body or even larger residue configurations can be generated, based on the
mined 3D patterns. In addition, the analysis searches the conserved serine residue in these
extended configurations.
The result of extending the catalytic triad is summarised in table 4.5. With 10 out of
22 structure examples having a single residue extension, only 7 out of the 10 determined
4-bodies contain the conserved serine residue (Asp-His-2Ser).
Other 4-bodies were also found with an additional alanine or cysteine residue. Preliminary studies indicate that even larger configurations can be obtained, by combining the
determined 4-bodies into a 5- or 6-body. However, the biological validity of the additional
75
PDBID
Asp-His-Ser
His-2Ser
Ala-His-Ser
1jrt
1au8
1ppf
1avw
1ct0
1elt
1try
3tgi
1acb
1arb
2tec
1agj
1taw
1a8s
1jfr
1a7u
1auo
1a88
1a8q
1tib
3tgl
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Cys-His-Ser
+
+
+
+
+
+
+
+
Ala-Asp-His
+
+
+
Table 4.5: Extending the catalytic triad into 4-bodies. Two pairs of residue triplets from the same
protein structure are merged together if two of the residues are identical. The first column indicates the
catalytic triad configuration, while the second column represents an extension with a previously reported
conserved serine residue. The remaining columns shows other solutions of 3-body extensions with the
catalytic triad.
alanine or cysteine in a 4-body, or even other amino acids in larger residue configurations,
needs to be determined.
In conclusion, the presented algorithm is able to find the catalytic triad (quartet),
i.e. the second conserved serine residue was rediscovered from data mining. While other
residue configurations of 4-bodies were also found, the biological role of these residues is
being investigated further.
4.3
Discussion
The biological cross-validation of the mined 3D patterns requires an adequate knowledge
base as reference. A precision score cannot be estimated from cross-validation studies,
because the result is the solution of discovery-driven data mining, and current knowledge
bases have an incomplete coverage of functional sites. In this respect, the mined 3D
patterns may contain known biological motifs, which are the detectable true positives,
76
or unknown functional sites, which cannot be confirmed yet. In addition, the result may
contain noise, which is impossible to detect as false positives. The biological significance
of the presented data mining was evaluated by examples of known biological functional
sites: the metal binding site, and the catalytic triad. In particular, only known functional
sites for proteins in the input structure set were used as benchmark. An alternative to this
stringent evaluation is to transfer functional sites from homologous proteins, e.g. based
on the Homology-dervied Secondary Structure of proteins (HSSP) database [SS96], and
consider these information as true positive reference.
About one third of the data in the PDB are protein structures co-crystallised with
metal ions, which allows the study of metal binding sites [BW03]. Within the analysis,
only a small fraction of proteins with metal binding sites were rediscovered. A systematic
optimisation of the developed data mining algorithm was not pursued, e.g. by modification
of feature selection criteria, because this would have exceeded the limit of this thesis.
Preliminary studies on the source of false negative rate indicates, that the interaction
distance threshold is the first parameter to be optimised. However with the change of this
parameter the probability distribution of triangulated structural features is also modified
and the effect cannot be estimated easily.
The datasets OLDFIELD and SCOP40 are quite different (cf. section 3.2). OLDFIELD consists of sequentially dissimilar protein structures, while the proteins may still
share structure similarity. This property allows the mining of homologous structural
features of divergent proteins, such as metal binding sites or the catalytic triad. The developed data mining method was also tested, whether it can extract convergent structural
features, by analysing SCOP40. This dataset consists only of divergent proteins with no
global structural similarities. As a consequence, structural components are mainly represented by convergent features, and the detection of these residue configurations might be
below detection level. That is, the occurrences of convergent structural features are similar to background level. However, metal binding sites are examples of convergent patterns
77
that were found in this study. The coordination of metal ions is greatly dependent on the
distances and orientations of the conjugating residues. For that reason, data mining can
detect these convergent structural features in structurally unrelated proteins.
The presented data mining system identifies local three residue interactions with respect of their spatial and chemical configuration. In addition, examples of 4- and 5-body
interactions were shown as a solution in extending the catalytic triad pattern. The analysis shows, that larger residue configurations can be found with the presented combinatorial
approach. However, the search for larger structural patterns might deliver only protein
stabilising features or other biological units in protein structures that are difficult to
interpret.
4.4
Conclusion
The solution of this developed data mining algorithm is justified by the cross-validation
of biologically relevant structure motifs provided in this study. The mining system is
able to detect recurrent homologous or convergent structural features in the dataset.
More importantly, two biological motifs, the metal binding site, and the catalytic triad,
were rediscovered indicating, that the mined output contains biologically valid solutions.
While the prediction of functional sites is an important task in structural biology, the
biological interpretation of a 3D pattern requires evidences of biological significance. The
combination with published biochemical and experimental data can provide evidences and
a biological context for data interpretation. In the next chapter, I will present a biomedical
literature mining system, for the extraction of functional annotation of protein residues.
78
Chapter 5
Identification of protein residues in
MEDLINE
In this chapter, I present a text mining method to identify protein residues in biomedical
texts. In the first step, the algorithm identifies the biological entities of residue, protein,
and organism, and then determines the association of entity triplets. As a result a residue
is linked to its source protein, and the protein is mapped to its hosting organism. Because
the developed text mining solution relies on information from UniProtKB, an identified
protein residue is directly linked to a unique Uniprot entry. One application of this
method is the search for abstract texts in MEDLINE with protein residues, and then use
the result for the update of citations in UniProtKB. The identification of protein residues
in biomedical texts is a prerequisite for the extraction of functional annotation of residues.
5.1
Algorithms
The developed protein residue identification system is based on the algorithm of [HLC04].
Basically, the developed method is a four step procedure: biological entity recognition
of organism, protein, and residue, and the association of the entity triplet. Figure 5.1
illustrates the procedures of this text mining system.
79
Figure 5.1: Overview of processes and evaluation methods for the developed protein residue identification system.
80
5.1.1
Protein and organism entity recognition
Theory
The recognition of protein and organism entities in text is based on a dictionary lookup
approach. Basically, names of proteins, their synonyms, and their gene names are collected from UniProtKB to populate a protein terminology dictionary. The lookup of the
protein dictionary considers the matching of morphological variants. The dictionary is
not expanded by syntactical variants of terminological entries, like structural or formal
variants, and addition of modifier or head word, because the lookup approach with the
vast number of permutations requires much more computational memory resources. The
alternative is to use a probabilistic approach.
A similar method is also used to populate the organism terminology dictionary with
names and synonyms from the NCBI Taxonomy database [WBB+ 06]. The lookup of
terminologies also considers the matching of morphological variants.
Implementation
The recognition of protein entities was based on an approach that combined dictionary
lookup with basic disambiguation [RSKA+ 07]. All protein names and synonyms were
collected from UniProtKB.
Names of species were extracted from the NCBI Taxonomy references in UniProtKB,
and their scientific and common names collected. The dictionary was complemented with
terminologies describing only the referenced genus. Full organism names were augmented
with abbreviated genus forms, i.e. first letter abbreviation of genus + specie.
The fast and efficient method for annotating texts with protein and organism names
was based on the publicly available web service called Whatizit [RSAG+ 08]. The result is
an annotation of protein and organism names in text with references to UniProtKB and
NCBI Taxonomy.
81
5.1.2
Entity recognition of protein residue
Theory
The identification of residue entities is based on the re-implementation of previously published regular expression patterns for point mutations [HLC04] [RSMA+ 04]. Here, the
patterns are extended to capture in total three types of residues: wild-type, point mutation, and range of residues or pair of residues. Although amino acid sequences can
be considered in the residue entity identification, the lack of information about sequence
position prevents the precise association detection with proteins.
The first basic type of residue mention is the single protein residue sequence reference,
which consists of the name of an amino acid, followed by the sequence position number,
e.g. ”Gly-12”, ”arginine 4”, ”Tyr74”, ”Arg(53)”. A point mutation is the second type of
residue mention, where the description details the exchange of an amino acid at a given
position. The common notation is the name of the amino acid, its sequence position
number, followed by the exchange. The following are examples of point mutations found
in text: ”W77R”, ”Cys560Arg”, ”ser-52->ala”, ”ala2-methionine”. Finally, the third
type of residue mention describes either a range of residues or an interaction pair, e.g.
”Tyr 85 to Ser 85”, ”Trp27–Cys29”. The correct identification of this type of residue
mention requires the consideration of contextual information, which is not handled in
this version. The common notation is the string sequence: amino acid name, sequence
position, a connection symbol or connection word, amino acid name, and then sequence
position.
In addition to the abbreviated notation, protein residues can be expressed in syntactical form, e.g. ”isoleucine at position 3”, ”substitution of Ala at position 4 to Gly”,
”Ser472 to glutamic acid”. Additional patterns were developed to accommodate these
and other less precise defined residue mentions in syntactical form, e.g. ”residue at position 22, 34, and 40”. Although the entity triplet association algorithm does not utilise
the latter identified residue mentions, annotation can generally be extracted for these
82
underspecified residues to increase the recall in information extraction.
Implementation
The extraction of residue mentions reuses the idea of designing regular expressions to find
residue entities in text [RSMA+ 04] [HLC04]. Some of the previously published regular
expression patterns were adopted, while other patterns were created to cover other types
of residue mentions, such as basic abbreviational point mutation patterns. In this thesis,
sets of regular expressions were developed and implemented as finite state transducer to
identify three types of residue entities (cf. table 5.1): wild-type, point mutation, and
range or pair of residues. The result is an annotation of residue mention in text with
normalised expressions.
5.1.3
Association identification of the entity triplet organism,
protein, and residue
Theory
The association of the entities organism, protein, and residue is a difficult text mining
task. Unlike the association of two proteins, e.g. the physical interactions of two proteins
(protein-protein interaction), the binary semantic relationships of organism-protein and
protein-residue are not necessarily explicitly stated in biomedical texts. For example, a
protein may be mentioned at the beginning of a paragraph, while a site-directed mutation
on the same protein is described in later sections. This is one reason why approaches
relying only on language patterns or word distance metrics are not feasible to find proteinresidue associations. The association task becomes more complex, when multiple proteins
are mentioned in the text. Usually a residue has a one-to-one relationship with a protein,
however two proteins can have the same residue at the same sequence position. While
this ambiguity cannot be solved without deeper natural language processing techniques,
the problem can be tackled with a knowledge based approach.
83
RANGE-TO
CONVERT-TO
XAA
POS
RESN1
RESN3
RESNF
SITE
SITES
RANGE/PAIR
MUTATION
=
=
=
=
=
=
("-"+ ("to" "-+")? | "to");
("to" | "-"+ ">"?);
( "X" | "XAA" | "xaa" );
(1-9)(0-9)*;
[ARNDCQEGHILKMFPSTWYVOUBZX];
( [aA]la|ALA | [aA]rg|ARG | [aA]sn|ASN | [aA]sp|ASP | [cC]ys|CYS
| [gG]ln|GLN | [gG]lu|GLU | [gG]ly|GLY | [hH]is|HIS | [iI]le|ILE
| [lL]eu|LEU | [lL]ys|LYS | [mM]et|MET | [pP]he|PHE | [pP]ro|PRO
| [sS]er|SER | [tT]hr|THR | [tT]rp|TRP | [tT]yr|TYR | [vV]al|VAL
| [pP]yl|PYL | [sS]ec|SEC | [aA]sx|ASX | [gG]lx|GLX | [xX]aa|XAA);
= ( [aA]lanine | [aA]rginine | [aA]sparagine | [aA]spart(ate|ic acid) |
[cC]ysteine
| [gG]lutamine | [gG]lutam(ate|ic acid) | [gG]lycine | [hH]istidine |
[iI]soleucine
| [lL]eucine | [lL]ysine | [mM]ethionine | [pP]henylalanine | [pP]roline
| [sS]erine | [tT]hreonine | [tT]ryptophan | [tT]yrosine | [vV]aline
| [pP]yrrolysine | [sS]elenocysteine | [aA]spartic acid or [aA]sparagine
| [gG]lutamic acid or[gG]lutamine);
= ( (RESN3 | RESNF) POS "residue"?
| (RESN3 | RESNF) "-"+ POS "residue"?
| (RESN3 | RESNF) "residue"? "at position"? POS "residue"?
| (RESN3 | RESNF) "(" POS ")" "residue"?
| "amino acid"? "residue" "at position"? POS
| "amino acid" "residue"? "at position"? POS
| RESNF "residue" POS);
= ( RESNF"s" (("," | "and" | "or") RESNF"s")*
| RESNF"s"? ("at position""s"?)? ("," | "and" | "or") (("at position""s"?)?
("," | "and" | "or") POS)+
| RESNF "residue""s"?
| RESN3 "residue""s"? ("at position""s"?)? POS (("at position""s"?)? ("," |
"and" | "or") POS)+
| RESN3 "residue""s"?
| "residue""s"? ("at position""s"?)? POS ("," | "and" | "or") POS)+
| (RESN3 | RESNF) "for" (RESN3 | RESNF) "at position" POS ("," | "and" | "or")
POS)+
| RESNF ("," | "and" | "or") POS)* "residue""s"?);
= ( "residue""s"? ("," | "and" | "or") RANGE-TO POS)+
| "amino acid" "residue"? "s"? ("," | "and" | "or") RANGE-TO POS)+
| ("resiude""s"?)? "at position""s"? ("," | "and" | "or") RANGE-TO POS)+
| RESI RANGE-TO RESI);
= ( RESN1 POS RESN1
| RESN1 "-" POS "-" RESN1
| RESN1 "(" POS ")" RESN1
| RESI CONVERT-TO (RESN3 | RESNF)
| RESI RESN3
| "from" (RESNF | RESN3) CONVERT-TO (RESNF | RESN3) "at position" POS
| (RESN3 | RESNF) "for" (RESN3 | RESNF) "at position" POS
| RESI ("-"+ | CONVERT-TO) RESI "substitution");
Table 5.1: Regular expression patterns for the detection of residue mentions in text. The patterns
recognise single (SITE) or multiple wild-type residue sites (SITES), a sequence range or residue pair
(RANGE/PAIR), and point mutation (MUTATION). The set covers abbreviated notations of residues
as well as grammatical expressions found in text.
84
The developed method in this work is based on the algorithm of [HLC04]. Basically,
the identification of a protein residue can only be validated, if it is part of the protein
sequence, as it is denoted in a reference database, e.g. UniProtKB. This requires that the
protein mentioned in the text is further supported by evidence for the organisms under
scrutiny to select the appropriate protein sequence from the bioinformatics database; that
excludes the risk of using orthologous protein sequences.
Implementation
In this study, the developed system to identify the entity triplet association of organism,
protein, and residue, was based on the algorithm described by [HLC04] with some modifications. In the first step proteins were associated with their hosting organisms. Given a
protein, all pairs of protein-organism (specie) were determined from text and ranked according to a word distance measure. The word distance between two entities was defined
by the smallest number of words between them. The identification of protein-organism
began with the pair with the smallest word distance measure. A valid association was
found, if a semantic relation was specified in UniProtKB. If an association was validated
then the search was terminated, and the protein was annotated with the corresponding
Uniprot identifier, otherwise the next entity pair from the list was tested. If no match
between protein and organism (specie) was found, then the search was relaxed to genus
matching. This relaxed matching is the expansion to the [HLC04] algorithm. Because
entries in UniProtKB are species specific, the protein-organism (genus) association will
result in a list of Uniprot identifiers as annotation of the protein.
The second step of this algorithm was the association of residues with their source
proteins. The procedure of selecting and ranking the residue-protein pairs was similar
to the protein-organism association identification. For each pair that was to be tested
the annotated Uniprot identifier of the protein was used to retrieve the protein sequence
from the database. Three cases of results can be distinguished: (1) the residue correctly
85
matches the protein sequence; (2) several alternative sequences are matching from a list
of proteins; and (3) no match can be found for the residue with the available protein
sequences. If a match was found, then the residue was annotated with references to the
protein, otherwise the search continued with the next pair from the ranked list.
5.2
The construction of evaluation test corpora
UniProtKB is one of the most comprehensive protein knowledge bases (cf. section 2.1.2).
It contains manually curated functional annotations on three levels: protein, protein sequence, and protein residue. Information is derived from surveys of biomedical articles,
and entries are annotated with citation references (PMIDs; PubMed identifiers). However, the precise association of a citation and a protein residue in context of functional
annotation is generally not available.
The test dataset for the developed functional annotation extraction is based on the
citation references from UniProtKB. A Uniprot corpus was generated by retrieving abstract texts from MEDLINE that are indexed by the knowledge base. From the 136,566
citations listed in UniProtKB, a virtually complete set of 136,559 abstract texts was retrieved from MEDLINE. Although not all information presented in the UniProtKB are
necessarily available in the Uniprot corpus, the Uniprot corpus is a starting point for the
evaluation of the developed text mining modules. In particular three derived test corpora
were generated from the Uniprot corpus: the gold standard corpus with manual annotation (GC), and the two cross-validation corpora with annotated information derived from
UniProtKB (XC1, and XC2). Figure 5.2 summarises key features in both test corpora.
For the automatic evaluation of extracted data, a cross-validation corpus (XC) was
derived from Uniprot corpus. This test set was used to analyse the performance of proteinorganism (XC1) and residue-protein (XC2) associations. The test set was annotated
automatically, i.e. the biological entities were detected with the same ER systems. The
documents in the Uniprot corpus were scanned for tri-occurrences of organism, protein,
86
Dataset
Gold standard corpus (GC)
Cross-validation
corpus (XC1)
Cross-validation
corpus (XC2)
Abstracts count
Method of annotation
total/unique residues
100
manual
362/262
(with
262/191
having
residue name +
residue
sequence
position)
990/511
323/123
240/172
residueprotein-organism
associations
Test the the type,
amount and reliability
of
the
extracted information (reproduction
of manually annotated information).
55,998
automatic
N/A
4,503
automatic
N/A
N/A
N/A
NA/70,401
protein-organism
as UTP
Test set is assumed
to contain the same
type of information
as GC, but certainty is not clear.
Study the reproduction of information contained in
the database.
N/A
N/A
NA/10,152
protein-residue
as URP
Test set is assumed
to contain the same
type of information
as GC, but certainty is not clear.
Study the reproduction of information contained in
the database.
total/unique proteins
total/unique organisms
total/unique associations
Application
Figure 5.2: Test corpora for information extraction evaluation. Based on the citation references from
UniProtKB a base corpus was generated by retrieving abstract texts from MEDLINE. Two test corpora
were derived from this corpus: (1) the gold standard corpus (GC), which resembles a manually annotated test set; and (2) the cross-validation corpora (XC1, XC2), which contains automatically assigned
annotations based on information from UniProtKB.
and residue in text and a subset was retained if the combinations of the identifier triplet
(UID+TID+PMID) for each document can be found in the database. UID is the Uniprot
ID, TID is the NCBI Taxonomy ID, and PMID is the PubMed identifier. If at least a single
match was found, then a document was selected. For the non-matching combinations the
corresponding annotations were removed from text. This results in the test set XC1 with
the associated set of the triple identifier combinations UTP = (UID+TID+PMID). XC2
is a subselection from XC1 by filtering for documents where the identifier combination
URP=(UID+RID+PMID) were validated by entries in UniProtKB. RID is a residue
identifier which consists of a residue name + sequence position. 70,401 UTPs from 55,998
abstract texts were determined for XC1, and correspondingly 10,152 URPs were derived
from 4,503 MEDLINE articles in XC2.
The gold standard corpus (GC) was created through manual curation, since no suitable
annotated corpora are available for this study. A random sample of 100 MEDLINE
87
abstract texts was drawn from the Uniprot corpus, where every abstract text must contain
the tri-occurrences of organism, protein and residue. Notice that the detection of the
entities was based on the entity recognition (ER) systems described in the previous section.
It is not expected that the ER systems are performing at top level, and therefore a certain
proportion of the filtered abstract texts contains false positives of identified entities.
From this set of 100 abstract texts, manual analysis provided four types of annotations.
The first type is the annotation of the biological entities of organism, protein, and residue,
while the second is the annotation of entity triplet associations, i.e. organism-proteinresidue. Notice that this process did not include the grounding of protein or organism
entities to entries in the specialised databases, i.e. UniProtKB and NCBI Taxonomy. In
addition, text segments of sentences with a residue entity were annotated, if they represent
keywords for functional annotation. Finally, the association of a keyword and a residue
was also annotated in GC.
Notice, that the set of documents in GC is partially contained in XC2; only 26 abstracts
are shared among both datasets. From manual annotation 38 entity triplet associations
were determined, while the corresponding number from XC2 was 58. The total number
of manually annotated triplet associations in GC is 172 (cf. figure 5.2).
The major difference between both evaluation corpora is, that GC contains manually
confirmed biological entities and their associations. In contrast, the same annotations
in XC1 and XC2 were done with UniProtKB, based on the assumption that the same
database information is present in abstract texts. The interpretation of performance
analysis has to consider the properties of these evaluation test corpora.
5.3
Evaluation methods
The performance of each process of the developed protein residue identification system
was scored against a manually annotated gold standard corpus. Proteins, where the
protein entity recognition system and manual curation assigned the same entity (full
88
term matching) were considered as true positives (TP). The same rule also applied for
counting TP for the detection of residue and organism entities.
The evaluation of the entity triplet association detections considered only associations
as TP, if both pair relations organism-protein and protein-residue were determined correctly. If one of the relations was incorrect, a found association was counted as false
positive (FP).
In contrast, the automatic evaluation of the entity recognition and entity association
detection systems were performed on XC. A true positive of an annotated entity within
an abstract text was identified, if UniProtKB lists the same entity in context of the
given PMID. For example, if organism X in text Y is also indexed in UniProtKB as a
combination of TID+PMID, then a TP was counted.
A correct protein-organism association was detected, if the determined identifier combination UTP was found in XC. Similarly, a correct residue-protein association was found,
if the derived identifier combination URP was found in the test corpus.
The effectiveness of the ER and the association detection systems was measured in
terms of precision, recall and the balanced F-measure (F1):
precision =
recall =
#true positive
,
#true positive + #f alse positive
#true positive
,
#true positive + #f alse positive
F1 =
5.4
2 ∗ precision ∗ recall
.
precision + recall
(5.1)
(5.2)
(5.3)
Results
The developed protein residue identification system in this study consists of four modules.
The following sections assess first performances of biological entity recognition, and then
89
Unique residue entities
Reference
MutationGraB
MutationMiner
MEMA
Dataset
Available
Extracted
Common
Precision
Recall
F1
Gold standard corpus
191
203
187
0.92
0.98
0.95
GPCR corpus
Xylanase corpus
Mutation corpus
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
0.98
1.00
0.98
0.77
0.85
0.75
0.86
0.92
0.85
Table 5.2: Performance evaluation of residue entity recognition. The performance is compared with other
published residue entity recognition systems: MutationGraB (GPCR corpus) [LHC07]; MutationMiner
(Xylanase corpus) [BW05]; and MEMA (Mutation corpus) [RSMA+ 04]. Performance was measured in
terms of precision, recall, and F1 measure.
the association of the entity triplet organism, protein, and residue. The final section
presents an application of the presented text mining solution that can be used to update
the citation set of UniProtKB or any other derived databases.
5.4.1
Evaluation of organism, protein, and residue entity recognition
The goal of biological entity recognition, in this study, is to detect the mentions of residue,
protein, and organism in biomedical abstract texts. In order to evaluate the performance
of the developed ER systems, the detections were compared against the results from
manual curated test set, the gold standard corpus (GC).
The evaluation shows that the developed regular expression patterns are highly usable
for the detection of residue mentions in biomedical texts. ER for residue mention yields
in a precision of 0.92 and a recall of 0.98. With an F1 measure of 0.95 the performance
of this ER system is within range of previous reports on point mutation identification
[LHC07] [BW05] [RSMA+ 04] (cf. table 5.2).
The performance for protein mention identification is evaluated with 65% precision and
60% recall (62% F1 measure). The result is difficult to compare to previously reported
systems, e.g. ProMiner and MutationMiner (cf. table 5.3), due to the different experimental setup. ProMiner was evaluated on the BioCreAtIvE corpus (80% F1 measure)
90
Unique protein entities
Reference
ProMiner
MutationMiner
Dataset
Available
Extracted
Common
Precision
Recall
F1
Gold standard corpus
511
471
305
0.65
0.60
0.62
BioCreAtIvE corpus
Xylanase corpus
N/A
N/A
N/A
N/A
N/A
N/A
0.8
0.88
0.8
0.71
0.8
0.79
Table 5.3: Performance evaluation of protein entity recognition. The performance is compared with the
other published protein entity recognition systems: ProMiner (BioCreAtIvE corpus, Task 1B, protein
and gene name identification) [HFM+ 05]; and MutationMiner (Xylanase corpus) [BW05]. Performance
was measured in terms of precision, recall, and F1 measure.
Unique organism entities
Reference
MutationMiner
Dataset
Available
Extracted
Common
Precision
Recall
F1
Gold standard corpus
123
109
88
0.81
0.72
0.76
Xylanase corpus
N/A
N/A
N/A
0.88
0.71
0.79
Table 5.4: Performance evaluation of organism entity recognition. The performance is compared with
the NER system of MutationMiner (Xylanase corpus) [BW05]. Performance was measured in terms of
precision, recall, and F1 measure.
which links the contained protein mentions to only a small set of organisms. However,
we have repeated the experiment on the BioCreAtIvE dataset and the result suggests
that our method yields a comparable performance (76% F1 measure). Conversely, the
evaluation of MutationMiner not only considers abstract texts but also the content of the
full-text articles which should improve the results (79% F1 measure).
Although the developed organism entity recognition system relies on a similar dictionary lookup approach as protein entity recognition, the performance is higher (precision
of 0.81 and recall of 0.72; cf. table 5.4). This indicates that the list of terminologies are
precise and covers a wide range of expressions.
In conclusion, with F1 measures of 0.95, 0.62, and 0.76 for the entity recognition of
residue, protein, and organism, the developed text mining system is able to detect these
three biological entities in biomedical abstract texts.
91
Unique resi.-prot.-org.-associations
Reference
MutationGraB
MEMA
MuteXt
Dataset
Available
Extracted
Common
Precision
Recall
F1
Gold standard corpus
172
79
65
0.82
0.38
0.52
Mutation corpus
Mutation corpus
tinyGRAP
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
N/A
0.85
0.93
0.88
0.69
0.35
0.83
0.76
0.51
0.85
Table 5.5: Performance evaluation of residue-protein-organism entity association detection. The performance is compared with the other published point mutation detection systems: MutationGraB (Mutation
corpus1) [LHC07]; and MEMA (Mutation corpus2) [RSMA+ 04]. Notice that MEMA identified only associations but without grounding. Performance was measured in terms of precision, recall, and F1 measure.
5.4.2
Performance study on the entity triplet association
The objective of the developed association detection system is to identify the entity triplet
of organism, protein, and residue. In this section, the performance of this detection
system is studied by comparing the predicted association with the manually annotated
associations in the gold standard corpus (GC).
With a precision of 0.82 and a recall of 0.38 the developed detection system is a reliable
method for association detection, and the precision is comparable to other related reports
(cf. table 5.5). In comparison to the systems, MutationGraB and MuteXt, the low recall
can be explained by the differences in the test corpora; both systems were evaluated on
protein family specific full-text articles. The evaluated precision of MEMA is different
from this study, because MEMA identifies only associations without grounding to Uniprot
entries.
Manual analysis isolated two main reasons for the low recall. First, the association of
all the three entities failed in several cases, because the system did not find an association
between protein and organism. Other cases were also encountered, where a proteinorganism association was correctly identified, but a protein-residue association could not
be found. A detailed explanation is given in the discussion section.
Despite the low recall of this text mining module, the evaluation indicates that the
developed method is able to detect associations of residue, protein, and organism. More
92
UTP
Dataset
XC1
Available
Extracted
Common
Precision
Recall
F1
70,401
77,407
62,068
0.82
0.88
0.85
URP
Dataset
XC2
Available
Extracted
Common
Precision
Recall
F1
10,152
10,876
9,325
0.86
0.92
0.89
Table 5.6: Performance evaluation of protein-organism and protein-residue entity association detection. A cross-validation corpus (XC) from UniProtKB was obtained from MEDLINE, by first retrieving
abstract texts from MEDLINE, searching for tri-occurrences of the named entities residue, protein, organism, and then retaining only those entries for which the identifier combination of UTP (Uniprot identifier
+ NCBI Taxonomy identifier + PubMed identifier) was found in UniProtKB. The result is the test set
XC1 for protein-organism association study. XC2 is a subset of XC1 by scaning for documents where
the identifier combination URP identifier combination (Uniprot identifier + Residue identifier + PubMed
identifier) was validated by UniProtKB. Performance was measured in terms of precision, recall, and F1
measure.
importantly, the detected associations are in accordance with manually identified semantic
relations between the three biological entities. With a precision of 0.82 the developed
method is able to identify precisely protein residues in biomedical texts.
5.4.3
Cross-validation of identified residues with UniProtKB
In the previous section the system for the association of the entity triplet organism,
protein, and residue, was evaluated manually on the gold standard corpus. The objective
in this section is to perform an analysis on a larger test set by cross-validation with
UniProtKB. For this task, the cross-validation corpora XC1 and XC2 were used. The
analysis consists of a two-step association study, i.e. the association of protein-organism
and residue-protein were evaluated individually. Table 5.6 summarises the results.
With a precision of 0.82 and a recall of 0.88, the result for organism-protein association
indicates that the system is able to extract correct semantic relations from XC1. The second step of the evaluation determines the performance of the residue-protein association
detection. A similar precision score of 0.86 was determined, while the recall (0.92) was
93
triplet association/UTRP
Resource
Available
Extracted
Common
Precision
Recall
F1
38
58
61
61
29
52
0.48
0.84
0.76
0.90
0.59
0.87
GC
XC2
Table 5.7: A specialised performance evaluation between GC and XC2. The test set consists of the 26
common documents between GC and XC2. A comparison of the annotated entity triplet associations
from both resources shows that the list of targets are different.
almost twice as high as the triple entity association determined with GC (cf. table 5.5).
This can be explained by the differences of the used annotation methods for both test
corpora. The entities and their associations in GC were determined manually and did not
considered a grounding step.
To better compare the performance between the GC and XC2 data the common set of
26 abstract texts from both corpora were studied (cf. section 5.2). By reusing the URP
information from the cross-validation corpus the determined performance is similar to the
one evaluated on the whole XC2 dataset (compare table 5.7 with table 5.6). However,
the XC2-based evaluation is different form the manual-based annotation study.
However, this result is different from the evaluation based on manual annotation. A
detailed analysis shows that manual annotation determined 38 entity triplets, whereas
XC2 lists 58 associations and only 25 of these are common among both data sets (data
not shown). This indicates that the annotated targets in GC and XC2 are different and
cannot be compared directly.
The results indicate that the developed method is able to detect correct associations
of residue, protein, and organism.
5.4.4
Identified residues in MEDLINE for Uniprot/PDB proteins
The developed text mining system annotates an identified protein residue in a text passage
with references to its source protein and its hosting organism. Therefore, each MEDLINE
94
Figure 5.3: Identified protein residues in MEDLINE. From a MEDLINE extraction, a subset of 2,884
Uniprot proteins were identified, with cross-references to 14,007 PDB entries, and a corresponding set of
18,427 MEDLINE records. In comparison, the citation set of the corresponding entries in UniProtKB
has only 4,652 PMIDs. Only 657 out of 18,427 PMIDs are cross-validated by UniProtKB data. Dashed
line = MEDLINE based extraction; solid line = database values.
record with an identified protein residue can be used to update the citation set of a
correspondent protein entry in UniProtKB, or any other hyperlinked database, e.g. PDB
(UniProtKB/PDB). In this study, the whole MEDLINE was scanned with the developed
protein residue identification method, and the determined set of PMIDs compared with the
citation sets in UniProtKB/PDB (cf. figure 5.3; for an overview of databanks hyperlinks
and citation references cf. section 2.1).
The protein residue identification system found a total of 40,750 MEDLINE records
where residues were associated with co-mentioned proteins. The unique count of Uniprot
proteins within the entity triplet associations is 9,354, where 2,884 out of 9,364 proteins
have hyperlinks to 14,007 PDB entries. Corresponding to these 2,884 Uniprot proteins
95
is the set of 18,427 out of 40,750 PMIDs. In comparison, UniProtKB indexes for these
2,884 Uniprot entries a set of 4,652 PMIDs. A set analysis determined that both datasets
are common in 657 PMIDs. This means that only 3.6 per cent of the identified PMIDs
can be cross-validated with UniProtKB (cf. figure 5.4).
The low number of rediscovery can be explained, in that most of the annotations
in UniProtKB are done from sections only available in full-text articles. Although the
analysis was based on MEDLINE, the extraction was already able to find a large number
of relevant abstract texts for citation expansion. With a precision of 0.82 (determined
by gold standard evaluation), the estimated number of true positives in the PMID set is
15,110. In context of the 4,652 citations from the database for the 2,884 Uniprot proteins,
and the consideration of the 657 re-discovered abstract texts, the result of MEDLINE
analysis expands the citation set by 3 fold.
In conclusion, the presented text mining system can be used to determine relevant
literature data for the update of the citation sets in UniProtKB/PDB.
The extracted abstract texts for those proteins provide the basis for functional annotation extraction.
5.5
Discussion
The presented text mining method identifies protein residues in biomedical texts. The
first step is the recognition of the entities residue, protein, and organism in texts. The
language expressions of all three biological entities are quite different. A residue entity,
for example, is generally mentioned in the text by its three-letter abbreviation form +
protein sequence position. The regular expression patterns were designed specifically for
these and other derived expressions, which explains the high precision and recall of the
residue entity recognition system. However, a residue can also be expressed by its oneletter abbreviation or syntactical form. While the latter expression is considered and
implemented in this thesis, it was suggested that these expressions resemble only a small
96
Figure 5.4: Cross-validation of citations from identified protein residues with UniProtKB/PDB. For a
subset of UniProtKB/PDB proteins (i.e. proteins with UID and PDBID) the determined PMIDs can be
cross-validated with the relevant citation set from UniProtKB. Dashed line = the number of common
PMIDs; uni = UniProtKB/PDB based citations; med = protein residue identification based citations;
comm = common set of citations between uni and med.
97
fraction [LHC07] in biomedical texts. The implementation of one-letter abbreviation
would increase the recall, but the method would become less precise. For example the
matched string ”C4” could be a nucleotide, a gene, an atom in a chemical compound, or
any other acronym.
The identification of protein terminologies in text is a great challenge in the biomedical
text mining community. This is based on the fact that protein names are not standardised,
and the usage of many alternative names are common, e.g. abbreviations, pet names,
or synonymous names. In addition, there is no guideline in the construction of names,
therefore a name can be short or long in respect of word counts, e.g. ”MAP kinase kinase”
and ”MAP kinase kinase kinase”. The developed protein entity recognition system is
based on a lookup of names and synonyms in a dictionary. Because the entries are finite,
syntactical variants of protein names cannot be detected, if they are not covered by the
dictionary. This explains the low recall of this ER system. In contrast, sub-matching of
a whole protein name or the tagging of ambiguous protein names reduces the precision
of the method. For example, ”SNF” could be a protein in yeast or the funding agency
”Swiss National Science Foundation.
The principle method for organism entity recognition is the same as protein name
identification in this investigation. A list of terms from NCBI taxonomy was utilised to
generate an organism name dictionary. Although the developed method is the same as
protein entity recognition, the system yielded in a higher performance. One explanation is,
that the dictionary contains predominantly unambiguous terminologies. However, some
ambiguous terms can also be found, e.g. ”RAT” could be a protein, an organism, or a
method. To my knowledge, a dedicated research in organism entity recognition has not
been published nor is a gold standard for performance evaluation available.
Based on the finding of residue, protein, organism entities in a text, the developed system identifies semantic relations between these biological entities. The approach is based
on the idea of reusing explicitly stated relations contained in UniProtKB. The correct
98
association between protein and residue relies on several factors: the ER performance,
the correct protein sequence retrieval, which is dependent on the correct organism-protein
association, and the correct alignment of a residue with a protein sequence at the specified
position. On one hand, a low recall in residue-protein association can be explained by
a missing protein sequence variant in the repository. On the other hand, an incorrect
protein-organism association leads to the retrieval of a wrong protein sequence. Another
consideration is, that the protein sequence in the database could deviate from the author’s data, because either side may have used different indexing rules. Conversely, the
true positive rate can also be blurred by the same reason that a non corresponding residue
sequence index results in a by chance matching with a protein sequence. One solution to
this specific problem is to consider all residues of the same protein in the sequence alignment. However, this method may only be applicable for full-text analysis, as abstract
texts rarely mention multiple residues of the same protein.
The evaluation of the entity recognition and the association detection systems was
done by a manual analysis on the gold standard corpus, and by an automatic crossvalidation study. This has the following reasons. Protein annotations in UniProtKB are
primarily derived from manual information extraction from full-text articles. Although a
considerable amount of these information may not be present in MEDLINE, the combination of X+PMID, where X is either UID or TID, can be used to estimate the information
extraction performance. However, the false positive rate in this cross-validation study
cannot be determined, because the knowledge base is incomplete with information, and
even for the indexed citations. Therefore, manual evaluation on a gold standard test set
has the advantage to study the false positive and false negative rate.
An identified protein residue is annotated with references to its source protein (Uniprot
identifier) and the hosting organism (NCBI Taxonomy identifier). Based on these annotations a link can be made between MEDLINE and biological knowledge bases. One
immediate application is to scan MEDLINE for protein residues and use the Uniprot
99
identifier annotations in combination with the MEDLINE identifier (or PubMed identifier; PMID) to update the citation sets of corresponding Uniprot entries. The significance
of this approach was studied by automatic cross-validation analysis. Although, the results
indicate that only a small proportion of Uniprot proteins can be found and associated with
residues from MEDLINE analysis, the identified set of PMIDs has only a small overlap
with the corresponding citation sets. One explanation is, that annotations were extracted
from full-text articles, where the same information is not present in the abstract texts;
they represent the true negative fraction in sense that the information cannot be identified
from abstract sections. Another explanation is based on the fact that curators provide
only a list of relevant citations from a batch of processed biomedical articles. In other
words, the information of irrelevant citations (false positives) or the complete list of true
positives of citations, from the sample of reviewed biomedical articles, is not available in
UniProtKB which would have allowed a more precise evaluation.
5.6
Conclusion
The developed text mining solution identifies protein residues in text and annotates them
with references to UniProtKB and NCBI Taxonomy. Based on these references, a link
between MEDLINE and UniProtKB is created. Although the identification of protein
residues in MEDLINE does not necessarily mean that functional annotations are present
in abstract texts, the analysis is a prerequisite for the mining of functional annotation.
The extraction of contextual feature as annotations of a protein residue is the topic of the
following chapter.
100
Chapter 6
Information extraction from the
context of a residue in text
In the previous chapter, I have introduced a method for the identification of protein
residues in biomedical texts. The objective, in this chapter, is to extract textual features
from the context of protein residues that can be used as functional annotation. Because a
terminological resource is not utilised, the developed method can discover new information
from text. The extracted contextual features are then enriched with semantic labels
according to a categorisation scheme. The design of this scheme was data-driven, and
contains concepts of biological interests. The overall result of this text mining solution
is the annotation of protein residues with text segments that are classified by a set of
biological categories.
6.1
Algorithms
The developed information extraction system can be divided into two parts: extraction
of contextual features associated with protein residues, and classification of the extracted
textual features. Figure 6.1 illustrates the procedures involved in the developed information extraction system.
101
Figure 6.1: Overview of processes and evaluation methods of the developed contextual feature extraction
system.
102
6.1.1
Extraction of contextual features
Theory
Finding functional annotations of protein residues in biomedical text. In this
study, several assumptions have been made for the extraction of functional annotations
from biomedical texts, which are explained in the following. The first assumption is, that
noun phrases in a text are semantically rich in sense, that they are able to represent
a subject content (keyword) [JK95]. Consequently, they are good candidates of textual
features for the functional annotation of protein residues.
The second assumption is, that a biological function of a protein residue, can be found
as verbal or nominal expression in natural language. In other words, a syntactical relation
between a residue and a term can capture their semantic relation. Therefore, a syntactical
analysis of a sentence enables the identification of an explicitly stated biological function.
For example, from the phrase
”A inhibits B by phosphorylation of C”,
the relations
A—inhibits—by-phosphorylation-of-C
A—inhibits—B-by-phosphorylation
A—inhibits—B
UNK—phosphorylate—C,
can be identified. Although the identification of a residue-keyword association can be
attempted with co-occurrence analysis, the target is to extract reliable associations with
contextual information on their association. In other words the type of association expressed by a verb or by a preposition, and the context expressed by a prepositional phrase,
are important bits of information that represent a justifiable functional annotation. A
103
discussion on semantic relation and syntactical relation extraction can be found in section 2.3.2.
Generally, to identify description of biological function in text, the terminologies from
GO can be reused. However, this ontology is actually not specialised on protein residues,
for example the term ”active site” does not even appear as a stand-alone term in the
repository. Generally, description of protein function refers to higher level of biological
function, e.g. metabolomics or cell signalling. In contrast, the annotation of protein
residues requires a different set of terminologies that describe molecular interactions or
chemical reactions.
Because a suitable terminological resource is not available, the extraction of syntactical
relation focuses on semantic relations with the elements: residue entity and contextual
feature (keyword). The following is a demonstration of how a description of function can
be identified from a parsed sentence. Given the example sentence from MEDLINE
”Parathyroid hormone inhibits renal phosphate transport by phosphorylation of serine 77 of sodium-hydrogen exchanger regulatory factor-1.”
(PMID:17975671),
a syntactical analysis produces the following phrase structure representation
104
[Parathyroid hormone]/NP
[inhibits]/V
[renal phosphate transport]/NP
[by]/P
[phosphorylation]/NP
[of]/P
[serine 77]/NP
[of]/P
[sodium-hydrogen exchanger regulatory factor-1]/NP,
where NP is a noun phrase, P a preposition, and V a verb. From this parsed sentence,
the following semantic relations can be determined:
Parathyroid hormone—inhibits—renal phosphate transport-byphosphorylation-of-serine 77
Parathyroid hormone—inhibits—renal phosphate transport-byphosphorylation
Parathyroid hormone—inhibits—renal phosphate transport
UNK—phosphorylate—serine 77.
In the next section, a template for storing the extracted relation information is discussed.
Semantic representation of extracted relations. The objective of syntactical relation extraction is to identify biological relations in a sentence, i.e. a semantic relation
between a residue entity and a terminology. While the result is a set of syntactical relations with different contextual specification (cf. example in previous section), a suitable
105
data collation method is necessary to avoid data redundancy. That is, the set of determined relations, within a given syntactic frame contains a relation, which is a specification
of another one. For example, the relation
A—inhibits—B-by-phosphorylation,
is a specification of the relation
A—inhibits—B.
Here, the predicate-argument structure (PAS) is proposed as a semantic representation of extracted syntactical relations. A PAS is a template for information extraction,
where the predicate and the arguments represent the slots to be filled. In this study, the
predicate (pred) of a PAS is defined as the verb, while the arguments of the verb are
the numerically labelled arguments arg1 and arg2, or even higher numerically labelled
arguments. The arg1 label is assigned to arguments, which are understood as agents,
causers, or experiencers, i.e. the semantic subject. Conversely, the arg2 label is usually
assigned to the patient argument, i.e. the argument which undergoes the change of state
or is being affected by the action.
The transformation of the extracted relations into PAS data, does not consider the
analysis of the semantic role of the verb arguments, i.e. argument modifiers, such as
location, time, cause, etc. Noun phrases of the extracted relations can have prepositional
attachments, and the preposition are often indicators of thematic roles of the verb arguments. Therefore, prepositional phrases are listed as modifiers of arguments with the
following label notations: main argument label + preposition, e.g. arg1-of, and arg2by. The following illustrates the transformation of relations into a PAS for the previous
example:
106
pred = inhibit
arg1 = Parathyroid hormone
arg2 = renal phosphate transport
arg2-by = phosphorylation
arg2-of = serine 77,
which corresponds to the following verb frame set:
inhibit sub-arg1 obj-arg2 P by-arg2 P of-arg2.
Notice, that the defined PAS does not accord to PAS schemes of some propositional
banks, e.g. PropBank or PASBio. For example, for the verb ”inhibit” PropBank lists the
following frame set:
inhibit sub-ARG0 obj-ARG1
inhibit sub-ARG0 S-ARG1,
while additional arguments are not defined (notice, that the definition of ARG0 in PropBank is equivalent to arg1 in this definition, and ARG1 corresponds to arg2). Although
verb frame sets from publicly available propositional banks can be considered in this study,
the set of listed verbs have a low coverage with the set of verbs co-occurring with residue
mentions in MEDLINE. The low coverage and the non-domain specific verb frame sets
are the main reasons why these resources were not reused.
Implementation
The extraction of contextual features is based on a syntactical analysis of natural language
sentences. Two approaches were developed in this work and compared in the performance
107
evaluation study: shallow parser based relation extraction, and full parser based relation
extraction.
Shallow parser based relation extraction. The first approach was to develop a
shallow parser, which aims to find the boundaries of major constituents in a sentence,
such as noun phrases. The design is based on heuristics and the idea of finding general
relations between closed-class English words [LCM03]. The reported parser finds verbal
relations between noun phrases, and prepositional relations of a set of the most frequent
prepositions, i.e. ”of”, ”in”, and ”by”. Here, the parser is implemented as a general
relation extraction method, where the list of prepositions are not limited to the three
mentioned ones. The purpose is to find more contextual features, and thereby discover
more information.
Initially, an abstract text was split into sentences, and then annotated with partof-speech (POS) tags using the CISTAGGER. The tagger was trained in the CISLEX
lexical resource that contains a rich terminological set of the biomedical domain [Gue96].
Based on a rule set and the POS information the developed shallow parser identified noun
phrases, verb groups, verb phrases, and prepositional phrases for analysed sentences:
NP = Det? (Adj|Adv|N)* N
PP = P NP
VG = (Adv|Aux|V|InfTo)* V
VP = VG NP PP*.
N is a noun, Det a determiner, Adj an adjective, Adv an adverb, P a preposition, PP a
prepositional phrase, VP a verb phrase, and VG a verb group. Notice, that the grammar
does not consider coordinating conjunctions, e.g. with ”and”, ”or” and ”,”. The grammar
can be easily extended to capture conjunctions by
108
NPx = NP (CC NP)*,
where
CC = (”and” | ”or” | ”,”){1,2}.
However, the pattern would then also find false positives as illustrated in the following
example. The sentence
”Highly conserved phosphopantothenate binding residues include Asn59,
Ala179, Ala180, and Asp183 from one monomer and Arg55’ from the
adjacent monomer.” (PMID:12906824),
contains the noun phrases
NP1 = ”Asn59, Ala179, Ala180, and Asp183 from one monomer”
NP2 = ”Arg55’ from the adjacent monomer”.
The extended patterns would have extracted a single noun phrase, from which the identification of the correct post-nominal prepositional phrase attachment cannot be done
easily:
NPx =
”Asn59, Ala179, Ala180, and Asp183 from one monomer and
Arg55’ from the adjacent monomer”.
Based on the determined phrase structure, the parser then extracts verbal relations of
noun phrases or prepositional phrases. A condition of the extraction is, that at least one
relation element must contain one or more residue mentions:
109
REL = NP PP* VP.
The extracted relation is then transformed to fill the slots of the predefined PAS template.
Full parser based relation extraction. The second approach in contextual feature
extraction utilises the full parser ENJU [MT05] (version 2.3), which generates a so called
head-driven parse tree from a sentence. The advantage of this parser is, that a parsing
model adapted to biomedical text is utilised. This parser generates predicate-argument
relations between words. Because the generated output contains a lot of information,
different interpretations are possible. In this study, a wrapper was developed that converts
the parser’s output into the presented PAS data format. The assumption is, that by
following the direct links of a verb to its arguments in the tree, and then collecting all the
sub-branches of each argument, the phrase structure of a verb argument can be found.
The identified NP PP* VP structures are then decomposed to fill the PAS template.
6.1.2
Categorisation of contextual features
Theory
A PAS captures a verb frame within a text sentence, where the arguments may represent a
subject content. In order to evaluate the relevance of these arguments a semantic interpretation is needed. Here, a classification method was developed, that assigns automatically
semantic labels to the arguments of a PAS. For this task, the categories have to be defined
as suitable labels for information interpretation. Although an ontological model of protein
residue function is not available, there are two approaches to this problem. The first is
to adopt annotation schemes from various protein databases, e.g. the UniProtKB. This
represents a top-down approach. One motivation for reusing the categorisation scheme of
UniProtKB is, that classified information with this scheme can be directly used to update
110
the relevant fields in the database.
Alternatively, a bottom-up approach can propose new categories. In this study, suitable text segments from MEDLINE were analysed, if they represent suitable functional
annotations for residues. The result, is an overview of information distribution in MEDLINE, which has led to the proposition of a categorisation scheme. The defined categories
of both schemes are compared in table 6.1. Both categorisation schemes reflect concepts
of biological interest. However the bottom-up approach has the advantage that proposed
categories are data-driven, while in a top-down approach examples of listed categories may
not be present in natural language text, or other categories are missing in the scheme.
The assignment of categories to contextual features is based on the endogenous classification approach [Cer00]. In contrast, the exogenous, i.e. corpus-based, approach requires
large amounts of contextual cues, which are difficult to obtain. According to the author,
the endogenous approach is more reliable to produce results even under conditions of
sparse data.
From a reference set of terms with manually assigned labels according to a categorisation scheme, the algorithm computes the mutual information of the lexical constituents of
terms and their assigned categories. These scores are then used to calculate and select the
highest scoring association of a term and a category. The algorithm was re-implemented
and used in this study.
Implementation
The semantic interpretation of contextual features, which are the arguments of the extracted PAS, relies on the endogenous classification approach described by [Cer00]. The
method was re-implemented in this study. The algorithm relies only on the mutual information of the lexical constituents of terms and their assigned categories.
During the training phase, lexical constituents of multi-word terms were extracted
from a labelled reference set. They represent the features of the predefined categories.
111
112
MAN
Binding type. Class denoting different
physico-chemical forces leading to a bond formation
between a protein structure component and a
chemical entity.
Enzymatic activity. Types of enzymatic reactions as
a subpart to protein functions.
Cellular phenotype. Class denoting different cellular
phenotypes that can be affected by structural or compositional changes of a protein.
BINDING
ENZ ACT
CELL
Description of sequence variants produced by alternative splicing, alternative
promoter usage, alternative initiation and ribosomal frameshifting.
VAR SEQ
Extent of a DNA-binding region.
Extent of a nucleotide phosphate-binding region.
Extent of a zinc finger region.
Extent of a calcium-binding region.
DNA BIND
NP BIND
ZN FING
CA BIND
N/A
Amino acid(s) involved in the activity of an enzyme.
Posttranslationally formed amino acid bonds.
CROSSLNK
ACT SITE
Disulfide bond.
Binding site for a metal ion.
METAL
DISULFID
Binding site for any chemical group (co-enzyme, prosthetic group, etc.).
Any interesting single amino-acid site on the sequence, that is not defined by
another feature key.
Extent of a region of interest in the sequence.
Glycosylation site.
BINDING
SITE
REGION
CARBOHYD
Covalent binding of a lipid moiety.
Extent of a released active peptide.
PEPTIDE
LIPID
Posttranslational modification of a residue.
Authors report that sequence variants exist.
Extent of a coiled-coil region.
Extent of a transmembrane region.
Extent of a polypeptide chain in the mature protein.
Topological domain.
MOD RES
VARIANT
COILED
TRANSMEM
CHAIN
TOPO DOM
Short (up to 20 amino acids) sequence motif of biological interest.
MOTIF
FEAT
Extent of a domain, which is defined as a specific combination of secondary
structures organised into a characteristic three-dimensional structure of fold.
Defintion
DOMAIN
Category
Table 6.1: Biological categories for the classification of protein residue related information. Two sets
of schemes were used: a text data motivated definition of categories (MAN) determined from manual
analysis of sentences with annotations for protein residues from MEDLINE, and key categories from the
feature table of UniProtKB (FEAT).
Structural modification. Class denoting the changes
to the protein structure without changes to the
chemical composition.
Chemical modification. Class denoting changes to
the protein sequence and the chemical composition.
Structure component. Class denoting concepts that
represent pieces and parts of the protein structure.
Defintion
STR MOD
CHEM MOD
STR COMP
Category
The association between both, a feature (w) and a category (c), was estimated based on
their mutual information score
(w,c)
I(w, c) = log2 PP(w)P
.
(c)
(6.1)
The association between the multi-word term T = {wi }ni=1 and a category c was
computed by the sum of the associations of its words
A(T, c) = P ∗ (c)
Pn
i=1
I(wi , c),
(6.2)
where P ∗ (c) is the probability of a category associated with a term. The categorization
of a multi-word term into one of the categories, amounts to the identification of the best
fitting category C ∗ for a term, based on the words in a term
c∗ = arg maxc A(T, c).
(6.3)
The reference set was generated, by using maximal length noun phrase (MLNP) analysis. The assumption of this approach is that textual features co-occurring with a residue
within a noun phrase (NPr ) are good candidates of terms for functional annotation. In
order to identify the boundaries of these candidate terms, the MLNP algorithm relies on
the lookup of a determined set of noun phrases without nested residue entities (NP¬r ). In
other words, the algorithm assumes that nested terms in NPr are also expressed as standalone noun phrases, which can be identified by a broad syntactical analysis on MEDLINE.
The following is an example for illustration. Consider the term
”complex formation”,
which is identified as a stand-alone noun phrase NP¬r in the sentence
113
”The GlyNH2 was removed and the reactive-site peptide bond X18Glu19 was synthesized by complex formation with proteinase K.”
(PMID:9047374).
The same term co-occurs with a residue entity within another noun phrase (NP(r))
”Rb-E2F-DNA complex formation”
in the sentence
”MDM2 also interacts with Rb through its central acidic domain and inhibits Rb function in part by blocking Rb-E2F-DNA complex formation.”
(PMID:16337594).
The determined MLNP in this example is ”complex formation”.
Once the set of MLNPs were extracted, each item (NP) was manually labelled, based
on a categorisation scheme. Within this study, two categorisation schemes (cf. table 6.1)
were used independently and studied: the categories defined by manual analysis on MEDLINE sentences (bottom-up approach), and the categories defined as keys in the feature
table from UniProtKB (top-down approach). The sets of categories from the bottom-up
approach and from the top-down approach are referred as MAN and FEAT in this study.
Table 6.2 compares the distribution of labels within the reference set.
An illustration, where a determined MLNP can be used to find relevant information
from contextual features of a protein residue, is the following example. From the sentence
114
MAN
Category
FEAT
Frequency
Category
Frequency
STR COMP
433
DOMAIN
MOTIF
TOPO DOM
CHAIN
TRANSMEM
COIL
28
8
4
2
2
1
CHEM MOD
361
VARIANT
MOD RES
PEPTIDE
VAR SEQ
LIPID
CARBOHYD
275
59
13
6
3
1
STR MOD
25
REGION
SITE
100
246
BINDING
195
BINDING
METAL
DISULFID
CROSSLNK
DNA BIND
NP BIND
ZN FING
CA BIND
139
25
11
10
6
5
2
1
ENZ ACT
90
ACT SITE
110
CELL
161
N/A
GEN BIOL
GEN ENG
2,172
643
GEN BIOL
GEN ENG
2,372
651
Table 6.2: Category distribution in the text feature reference set. The text feature reference set was
compiled from maximal length noun phrase analysis (MLNP) from two sets of noun phrases: one without
residue mentions and the other with identified protein residue entities. The features in the reference set
were manually assigned with labels of the categorisation scheme MAN and FEAT. GEN BIOL = general
biological terminologies; GEN ENG = general English words.
115
”Mutation K241Q completely abolishes DNA glycosylase activity and
covalent complex formation in the presence of NaBH4.” (PMID:9241232),
the following relation can be identified
mutation K241Q—abolish—covalent complex formation.
A semantic label can be assigned to the relation argument ”covalent complex formation”
because the term ”complex formation” is labelled in the reference set.
6.2
Evaluation methods
The extraction of contextual features of residues results in a set of syntactical relations,
which are represented as PAS. The performance of this extraction module was evaluated
by comparing the returned PAS data with manual annotations in the gold standard test
corpus (cf. section 5.2). A true positive was counted, if the syntactical relations in a PAS
were correct, and if the arguments in the PAS contained the annotated residue entity and
the marked keyword(s) in the test corpus. If any of these conditions were not met, then a
false positive was registered. The performance was measured in terms of precision, recall
and F1-measure, as described earlier in section 5.3.
The performance of the developed classification method was evaluated by a 100 times
5-fold cross-validation. For each iteration, terms in the reference set were shuffled, and
partitioned into a test set (1/5 of the data) and a training set (4/5 of the data). The
average precision, recall and F1-measure (cf. section 5.3) were calculated for each classifier
from the determined confusion matrix.
116
PAS
Method
Shallow parsing
Full parsing
Available
Extracted
Common
Precision
Recall
F1
117
117
82
86
56
32
0.68
0.37
0.48
0.27
0.56
0.31
Table 6.3: Evaluation of syntactical language parser performance. The performance of the two language
parsers (shallow and full parsing) were evaluated on the basis of precision, recall and F1 measures by
comparing the annotated PAS data in the test set with the returned PAS output from the parsers.
6.3
Results
In this section, the performances of contextual feature extraction and categorisation are
studied. The test dataset is the gold standard corpus.
6.3.1
Contextual feature extraction evaluated
The objective in contextual feature extraction is to find textual features that are suitable
as functional annotations for protein residues.
In this section, the performance of this extraction system is studied by comparing
the results produced with two different language parsers: the shallow parser, and the full
parser. Sentences from the gold standard corpus (GC) were used as test dataset for this
analysis.
Within this study, the analysis determined that the developed shallow parser has a
better performance than the full parser ENJU. The shallow parser yielded in a F1 measure
of 0.56 (precision of 0.68 and recall of 0.48), while the full parser ENJU has a F1 measure
of 0.31 (precision of 0.37 and recall of 0.27) (cf. table 6.3).
The results suggest that contextual information of a residue entity can be extracted
from a syntactical analysis with a F1 measure of 0.56 and 0.31 for shallow parsing and
full parsing, respectively.
117
6.3.2
Performance analysis of the classifiers
One problem in functional annotation extraction is the semantic interpretation of the
extracted text data. The solution proposed in this work, is based on a classification
approach. Two different categorisation schemes were tested in this study: MAN and
FEAT. The performance of the developed classification method was evaluated by repeated
cross-validation studies. Table 6.5 summarises the results from the determined confusion
matrix (cf. table 6.4).
For MAN, the top three performing classifiers with F1 measures of 0.62, 0.57, and 0.57
are STR COMP (precision of 0.56, recall of 0.69), CHEM MOD (precision of 0.54, recall
of 0.59) and BINDING (precision of 0.63, recall of 0.52). The average performance of the
whole classification system for this categorisation scheme yielded in an average precision
of 0.48 and an average recall of 0.42. In comparison the classification based on FEAT has
a much lower average performance: average precision of 0.24, average recall of 0.18. The
weak performances of the FEAT classifiers is explained by the distribution of examples
in the categories; for some categories the number of corresponding features or examples
is low (cf. table 6.2). A discussion is presented in section 6.4
Examining the false positive rate in the confusion matrix of MAN reveals that the classifiers are confused with the category GEN BIOL (general biological terms) or GEN ENG
(general English terms). This is not surprising considering that English terms are ambiguous. In addition, some categories show confusions with others, e.g. STR COMP with
CHEM MOD, and ENZ ACT with STR COMP. One explanation is that some terms
can be assigned to more than one category. For example, ”mutant structure” refers to
an altered protein structure state, which is based on a chemical change in the protein
sequence.
Despite the average performances of some classifiers, the presented method can be
used to assign categories to textual features. However, significant improvements on the
performances of some classifiers are necessary before the system can be used automatically.
118
119
160
1
CHEM MOD
GEN ENG
ENZ ACT
STR COMP
STR MOD
t|
u|
a|
l|
91
783
338
2,556
1,103
1,167
15,815
762
GEN BIOL
1
64
80
126
12
836
525
28
CELL
129
551
201
510
3,742
150
1,496
93
CHEM MOD
125
592
226
1,820
761
325
4,514
165
GEN ENG
0
35
324
46
79
91
159
26
ENZ ACT
21
4,914
457
480
546
67
1,714
546
STR COMP
43
11
0
35
25
0
65
0
STR MOD
Table 6.4: Performance analysis of the classifiers (confusion matrix). Classification with categories
from MAN were analysed by cross-validation studies with 100-iterations. The result is represented as a
confusion matrix.
33
144
38
96
CELL
c|
560
1,772
GEN BIOL
A|
BINDING
BINDING
Prediction
MAN
Category
FEAT
Precision
Recall
F1
STR COMP
0.56
0.69
0.62
CHEM MOD
0.54
0.59
STR MOD
0.24
BINDING
Category
Precision
Recall
F1
DOMAIN
MOTIF
TOPO DOM
CHAIN
TRANSMEM
COIL
0.50
0.98
0
0
0
0
0.24
0.36
0
0
0
0
0.32
0.53
0
0
0
0
0.57
VARIANT
MOD RES
PEPTIDE
VAR SEQ
LIPID
CARBOHYD
0.50
0.40
0.05
0
1
0
0.69
0.23
0.06
0
0.32
0
0.58
0.29
0.05
0
0.48
0
0.10
0.15
REGION
SITE
0.44
0.40
0.44
0.55
0.44
0.46
0.63
0.52
0.57
BINDING
METAL
DISULFID
CROSSLNK
DNA BIND
NP BIND
ZN FING
CA BIND
0.41
0.05
0.53
0
0
0
0
0
0.45
0.02
0.15
0
0
0.06
0
0
0.43
0.03
0.23
0
0
0
0
0
ENZ ACT
0.43
0.20
0.27
ACT SITE
0.45
0.31
0.36
CELL
0.50
0.31
0.38
N/A
GEN BIOL
GEN ENG
0.70
0.21
0.64
0.32
0.67
0.26
GEN BIOL
GEN ENG
0.76
0.23
0.65
0.32
0.70
0.27
0.48
0.42
0.43
0.25
0.18
0.19
Average
Average
Table 6.5: Performance evaluation of the classifiers (precision, recall, F1 measure).Evaluation of classification of textual features (noun phrases). Classification with categories from MAN and FEAT were
analysed by cross-validation studies with 100-iterations. The performance was measured in terms of
precision, recall, and F1 measure.
120
One option is to increase the number of training data, or the size of features for each
classifier. Another alternative is to modify the definition of classes. The results suggest
that the algorithm is, in generally, suitable for classification.
6.4
Discussion
The presented text mining solution extracts textual features from the context of residue
entities. The identification of the contextual features, and the association with the residue
entity, is based on the syntactical analysis of the sentence. More specifically, only a subset
of semantic relations that are found in verbal and prepositional relations are extracted
from text. The advantage of this approach is, that not only the semantic relation partners
and the semantic relation type are found, but also contextual information is extracted.
Within this study two approaches in syntactical analysis were compared, i.e. shallow
parsing and full parsing, while the result indicates that the ENJU parser had a weaker
performance than the developed shallow parser. Manual analysis on the false positive rate
indicates that the source of incorrectly determined syntactical structure originates from
false part-of-speech tagging. For example, in the sentence
”Conversely, K382Q displays a highly altered responsiveness to the activator, suggesting that Lys(382) is involved in both activator binding and
allosteric transition mechanism.” (PMID:10751408),
both parsers identified ”altered” as a verb in past tense, although the correct POS is a
noun modifier. The performance of the POS tagger is critical for the detection of phrase
boundaries. However, both parsers rely on two different methods for POS tagging and the
performance of the POS tagger has to be considered as well when comparing the shallow
and full parser. Table A.1 lists some examples, where a parser failed in extracting the
annotated PAS data from GC.
121
The extracted information is difficult to normalise, because there is no gold standard
of how to represent the association, and how to qualify the contextual information. In
this work, the predicate-argument structure is used as a template for the extracted information. Although verb frame sets from PropBank or PASBio can be used to normalise
the extracted data, they are not designed to capture description of protein residue function. On the other hand, this gives the extraction method the advantage to discover new
knowledge. Because the extracted information is not normalised, the performance can
only be measured in terms of sensitivity.
The evaluation of the classification method indicates, that the presented approach can
provide an automatic solution for text interpretation. However, some of the categories
have only few examples, which is reflected in weak performances of the classifiers. One
solution to this problem is to balance the example sets of each category, for example,
by collecting more terminologies from MEDLINE. Alternatively, other categories may
be defined to balance the ratio between a category and the associated set of examples.
Yet another approach is not to classify arguments of a PAS, but cluster them based on
their, for example, contextual usage. The advantage here is to find more information
similarities among the PAS data by overcoming the information representativeness of a
training (reference) set.
Despite the fact, that semantic labels can be assigned to the arguments in a PAS,
the developed method is not able to interpret the meaning of the whole extracted text
segment. For example, in the sentence
”Specific binding of the WT and mutant receptors Cys14Ala and
Cys199Ala was inhibited in the presence of the disulfide bond reducing agent, DTT, implying that disulfide bonds are formed and can be
reduced in these mutant receptors.” (PMID:9202220).
The following information was extracted and semantic categories were assigned to the
122
arguments of the PAS
pred = inhibited
arg1 = Specific binding
arg1-of = [the WT and mutant receptors CYS14 ALA and
CYS199 ALA]/CHEM MOD
arg2-in = the presence
arg2-of = the disulfide bond reducing agent.
Although one part of the information in the example has been correctly assigned with the
label CHEM MOD, the entire text phrase should be labelled with BINDING. A solution
to this problem is not trivial and requires several levels of linguistic analysis.
6.5
Conclusion
In this chapter, I have presented the developed contextual feature extraction system for
the annotation of residue entities. Because a suitable terminological resource is not available, the identification of functional annotation is based on the extraction of syntactical
relations between a residue entity and a noun phrase. The developed method allows the
discovery of novel information that can provide key information for functional annotation. In the next chapter, I will demonstrate the validity of the extracted information as
functional annotation of protein residues.
123
Chapter 7
Extraction of functional annotation
for protein residues from MEDLINE
In the previous two chapters, two fundamental text mining components for the functional
annotation extraction were presented. In this chapter, I provide results of the combined
extraction result, and assesses the performance of the combined system. The objective in
this study is to determine the qualitative and quantitative distribution of information in
MEDLINE. Because the information is derived solely from biomedical abstract texts, it
is necessary to examine the data in terms of validity, novelty, and biological significance.
In the first part of the evaluation, the performance of the functional annotation extraction is studied on the gold standard corpus. Then the biological significance of the
extracted data from MEDLINE is studied on two example proteins, the suppressor protein
p53, and the Janus kinase 2 protein. Finally, the distribution of information is examined
by two specific analysis: the cross-validation of identified active site residues with CSA,
and the cross-validation of binding residues with MSDsite.
124
7.1
Evaluation methods
The evaluation of the functional annotation extraction system was based on the performance analysis of its extraction components: protein residue identification, and contextual
feature extraction (cf. section 5.3 and section 6.2).
The analysis on the biological validity of the mined functional annotations was done by
manual analysis. For each protein residue, the set of extracted annotations was reviewed
and grouped by similar topics. Because a set of annotations for each associated protein
residue can be very large, random samples were drawn from a list of annotations sorted
by residue name and position. The result is a set of sample annotations for each extracted
residue of a protein. The information was compared with the corresponding annotations
in UniProtKB.
The validation of catalytic residues was done by cross-validation with CSA [PBT04].
The analysis was performed on three levels, i.e. the comparison of identified protein
residues from MEDLINE with CSA, comparison of residues with extracted functional annotations, and comparison of residues with extracted annotations classified as ENZ ACT
(cf. section 6.1.2). The residues were compared by using the combination of the identifiers
RID+UID (cf. section 5.3).
The validation of binding residues from MEDLINE extraction was done accordingly.
The third level of validation compared residues with extracted annotations classified as
BINDING.
125
7.2
7.2.1
Results
Evaluation of the developed functional annotation extraction system
The presented functional annotation extraction system consists of two basic modules:
identification of protein residues, and contextual feature extraction. The following describes an analysis of the overall performance of the combined text mining system. The
test set is the gold standard corpus (GC; cf. section 5.2). The evaluation was done
in two respects: manual validation of extracted information, and cross-validation with
UniProtKB annotations.
Manual validation of extracted information. The gold standard corpus consists
of 100 abstract texts with tri-occurrences of the triplet protein, residue and organism.
However, manual analysis identified only 51 abstract texts with residue entities that can
be associated with their proteins and hosting organisms. The number of associations
(OPR) is 172. This represents the target for protein residue identification.
Corresponding to these OPRs is the set of functional annotations (PAS data). For 109
out of 172 OPRs, keywords were co-mentioned in verbal relations. The number of PAS
associated with the 109 OPRs is 117. This represents the target of functional annotation
extraction.
Figure 7.1 summarises the performance of the functional annotation extraction. With
a previously determined precision of 0.82 and a recall of 0.38, the protein residue identification module detects 79 OPRs with 65 out of 79 being the correct ones. Contextual
feature extraction for these 65 protein residues resulted in 35 PAS data. In comparison
with the 117 annotated PAS of the 109 OPRs, only 16 out of 35 extracted PAS are true
positives. However, the total number of extracted PAS is 46, which results in a precision
of 0.35 and a recall of 0.13. A systematic analysis revealed, that the rate of false positives
126
PAS data
Dataset
GC
Available
Extracted
Common
Precision
Recall
F1
117
46
16
0.35
0.13
0.25
Figure 7.1: Performance evaluation of the functional annotation extraction system. The performance
is dependent on the two combined text mining modules: protein residue identification; and contextual
feature extraction. The performance was measured in terms of precision, recall, and F1 measure
127
has the following sources: a false positive of OPR with extracted PAS, a true positive
OPR with no annotated PAS, and a true positive of OPR with false positive of PAS.
In comparison, if the system would have identified all protein residues correctly, the
performance of the whole extraction would have yielded in a precision of 0.68 and a
recall of 0.48 (cf. section 6.3). Considering, the presented text mining solution is a pilot
approach to extract functional annotations for the validation of predicted functional sites,
the result is good for this area and comparable to first studies in BioCreAtIvE or Critical
Assessment of Techniques for Protein Structure Prediction (CASP). The recall can be
explained by the performance of the contextual feature extraction module.
The result indicates, that the extracted functional annotations have a reasonable precision in this first attempt of functional annotation extraction, but is low in coverage.
This can be explained by the sum of the performances of each text mining module. On
one hand, an incorrectly determined protein residue leads to a false positive of PAS. On
the other hand, a failed entity recognition contributes to the false negative rate. In addition, language complexity, and incorrectly parsed sentences are the other reasons for the
false positive and false negative rate of functional annotation extraction.
In conclusion, the presented functional annotation extraction system delivers precise
information, but has a low coverage of extraction. However, in context of the bioinformatics work of this thesis, a precision-driven extraction system is prefered over a recall
oriented text mining solution.
Cross-validation with UniProtKB functional annotations. Despite the low coverage of the functional annotation extraction system, the extracted information is correct
and reusable for the annotation of protein residues. Table B.1 lists the 16 verified PAS
data, corresponding to 17 verified protein residues. A comparison with UniProtKB shows,
that 5 out of 16 are rediscovered knowledge. The remaining 11 out of 16 contain novel
information that can be used to update the protein knowledge base.
The extraction of functional annotations is a multi-step system. Although the per128
formances of each module may not be at optimal level, the results demonstrate that
functional annotations are available and extractable from MEDLINE.
7.2.2
Studying mined functional annotations for the proteins
p53 and Jak2
UniProtKB curates functional annotations for proteins on three levels: protein level,
protein domain level, and protein residue level. The objective in this section is to study the
validity and novelty of mined functional annotations from whole MEDLINE extraction.
The result provides an indication of the biological significance for automatic extraction
from MEDLINE. The annotations of two example proteins, p53 and Jak2, are analysed
and compared with relevant information from UniProtKB.
Tumour suppressor protein p53. p53 plays a critical role in preventing human cancer formation. In the native state, the protein assembles to a tetrameric phosphoprotein.
It consists of four functional domains: (1) the proline-rich, acidic, N-terminus, which is
involved in transcriptional activation, e.g. Mdm2 binding; (2) the central core, which
binds DNA; (3) the oligomerisation domain with nuclear localisation signals, which allows the transfer into the nucleus; and (4) the C-terminus, which regulates DNA-binding
[SYH+ 03].
The extraction of functional annotations from MEDLINE for the human tumor protein
p53 resulted in 1,665 PAS data. A manual analysis on samples of mined functional
annotations indicates, that there are two main topics: the regulatory post-translational
modification, and the binding activity of residues, where in some cases the interaction
partner is also stated. Table C.1 lists example annotations grouped by similar topics. For 5
out of 6 of the identified residues with post-translational modification, i.e. THR18, SER46,
SER15, THR55, and SER315, the extracted information is similar to the annotations in
the UniProtKB entry. The remaining residue, SER6, has no annotation in the UniProtKB.
129
The knowledge base does not provide further information on the biological implication
of these residues, while the extracted data contain more contextual information. For
example:
”[...]ATM-mediated phosphorylation of the ser15 site of p53[...]”
(PMID:14757188),
”[...]Ser46 phosphorylation activates p53-dependent apoptosis[...]”
(PMID:17172844).
The analysis also found annotations for some critical residues that are not recorded in
UniProtKB. For example:
”[...]the amino acid change C135R generates the loss of TP53 DNAbinding activity[...]” (PMID:17914575),
”[...]R248W abolish the association with p63[...]” (PMID:11172034).
The activity of p53 is thought to be regulated through a number of post-translational
modifications at the N- and C-terminal regions. Review articles report that seven serines
(SER6, SER9, SER15, SER20, SER33, SER37, and SER46) and two threonines (THR18,
and THR81) in the N-terminal domain are modified by kinases upon exposure of cells to
ionising radiation or UV light. The analysis shows that MEDLINE extraction can recover
this information for the residues SER6, SER15, SER46, and THR18.
Janus Kinase 2 (Jak2). Jak2 plays a crucial part in various growth factors and cytokine signalling pathways. Similar to other protein tyrosine kinases of the Janus kinase
family, Jak2 consists of a tyrosine kinase domain and a tyrosine kinase-like domain. It is
thought that the kinase-like domain can negatively regulate the kinase domain.
130
The set of extracted functional annotations for Jak2 has the size of 624 PAS data, and
contains only information on seven residues: L539 (1 annotation), W515 (1 annotation),
K607 (2 annotations), V617 (630 annotations), F617 (5 annotations; a reported variant
associated with Budd-Chiari syndrome), V678 (3 annotations), and D816 (1 annotation).
A comparison with UniProtKB data shows, that the extracted information for F617, K607,
and L539 are similar to the annotations in the database. These and other annotations for
D816, V678, and W515 describe mutation events (data not shown).
In order to assess the extracted information on V617, random samples were selected
and studied manually. The result of the analysis indicates, that the set of annotations
contains a lot of redundant information. The data can be grouped into two main topics: disease, and genetical origin. Table D.1 lists some examples of extracted functional
annotations.
The effect of mutating residue 617 on cellular function, and its association with particular diseases has already been reported, but none of the extracted annotations provide any
molecular explanation. A survey of research publications on Jak2 revealed, that myeloid
and lymphoid malignancies are associated with Jak2 V617F. It is proposed, that the
residue 617 destabilises the kinase and kinase-like domain interactions, and thereby promotes activation of kinase activity [POHS05]. These results suggest that the extracted
information reflects pieces of evidences, however, their biological relations may not be
available in the mined output or even in MEDLINE.
In summary, the study of the mined functional annotations of residues for the two proteins presented here indicates, that MEDLINE contains information, which are recurrent
in a number of abstract texts. Despite the data redundancy, some functional annotations are not contained in UniProtKB, indicating that MEDLINE extraction retains its
originality.
131
7.2.3
Cross-validation of mined catalytic residues with CSA
In the previous section, functional annotations were extracted from MEDLINE, and for a
range of annotations, the contained information was analysed on its biological validity and
novelty. This section focuses on enzyme-related information in the extracted annotations.
The objective is to study how reliable the extracted information is for the validation of
catalytic residues. The identified residues with these associated annotations are compared
with CSA. Figure 7.2 summarises the result of this analysis.
The CSA lists 12,971 protein residues (RID+UID), of which 799 were identified in
MEDLINE. The missing 12,172 protein residues in CSA can be explained by the performance of the identification system (cf. section 5.4). Another explanation is, that CSA
is curated from full-text publication extraction, and the same information may not be
available in MEDLINE.
By selecting residues with extracted functional annotations from MEDLINE, 691 out
of 799 protein residues were retained. This result indicates that a lot of functional descriptions are available as contextual features of the identified protein residues. The result
is consistent with previous performance evaluation studies (cf. section 6.4). With a precision of 0.43 and recall of 0.20, the classifier for the category ENZ ACT (cf. section 6.3)
identified enzyme-related functional annotations for 77 out of 691 protein residues. Manual analysis shows, that this reduction can be explained by the classifier’s performance.
Another explanation is the absence of relevant contextual cues in the extracted text.
A search for the term ”catalytic triad” in the sentences of the identified protein residues
yielded in a sub-selection of 221 out of 46,750 residues. A comparison with CSA shows,
that 44 out of 221 are re-discoveries of active site residues. The annotations for the
remaining 177 may contain supporting evidences to identify the residues as catalytic. A
systematic analysis of these predicted catalytic residues should start with the 27 out of
177 residues, which have annotations classified as ENZ ACT.
In conclusion, the developed text mining system rediscovers active site residues, by
132
Figure 7.2: Cross-validation of text mined catalytic residues with CSA. The analysis was done based
on the comparison of the determined RID+UID pairs. The numbers reflect the determined RID+UID
pairs. RID = Residue identifier; UID = Uniprot identifier.
133
Figure 7.3: Cross-validaiton of text mined binding residues with MSDsite. Annotation was studied
on the level of using solely the mentioned protein residue, the residue with PAS data, and residue with
information on binding. The number indicates the counted RID+UID pairs in the data. RID = Residue
identifier; UID = Uniprot identifier.
solely mining abstract text from MEDLINE. While the rate of false positive is not known,
the extraction identified 1,391 protein residues with enzyme-related functional annotations. The significance of these potentially new CSA residues are further studied in
ongoing work.
7.2.4
Annotation of protein residues in MSDsite
The MSDsite [GDO+ 05] holds a number of predicted ligand binding sites, by automatically
analysing ligand contacting residues in the PDB. The objective in this section is to analyse
how many of these binding residues can be annotated from mining MEDLINE.
134
The analysis shows that 512 out of the 46,750 identified protein residues in MEDLINE
are also contained in MSDsite (cf. figure 7.3). A large proportion of these residues are
associated with PAS data (429 out of 512), while only a smaller subset of 12 have information classified as BINDING. Manual analysis shows, that all of these 12 annotations are
correct. They can be used to validate the predicted ligand binding residues in MSDsite
(table E.1).
For the remaining 417 out of 512 residues, the associated PAS data may still contain
valid information for the annotation. However, a systematic analysis was not performed
at this stage of study.
In summary, a relatively small set of protein residues recovered from MEDLINE extraction can be used for the annotation of MSDsite entries.
7.3
Discussion
The extraction of functional annotation is a multi-step process, and the quality of the
result has to be interpreted in context of each subprocess’ performance. Although the
performances of each extraction module may not be at optimal level, the evaluation results
indicate that the mined output contains biologically meaningful data. Considering the
validation of a predicted function requires any evidences of biological function, the developed text mining system can become a valuable tool, for example for the protein function
prediction assessement in the Critical Assessment of Techniques for Protein Structure Prediction (CASP) [LRTV07]. With the improvement of the information extraction modules,
the quality of mined functional annotations is expected to become more reliable.
The biological relevance of the extracted functional annotation was demonstrated on
two different proteins, p53 and Jak2. The results show, that not only information in
UniProtKB can be rediscovered from MEDLINE, but also novel information can be extracted as well. These functional annotations can be considered to complement existing
annotations in UniProtKB. However, manual analysis on subsets of the extracted annota135
tions indicates, that the information is represented redundantly in MEDLINE. One major
reason is, that biological facts are expressed repeatedly within the biological community.
The study of identifying catalytic residues and binding residues from the mined functional annotations, and the cross-validation with CSA and MSDsite shows, that the developed text mining solution is able to find relevant data from MEDLINE. Although the
developed classifiers have a weak performance, it is not clear whether this explains completely the cross-validation results. It is possible, that key information is not mentioned
in abstract texts that would identify the biological role of the protein residues. Another
explanation is based on the protein residue identification performance, which had been
evaluated with a low recall score.
Although abstract texts cover only a subset of information from full-text articles, and
information is represented repeatedly in MEDLINE, this study shows that the text mined
information is biologically valid and contains snippets of additional information that are
relevant for UniProtKB. For example, the extracted annotations complement existing
information in UniProtKB and provide first data of yet not curated functional sites in
proteins.
7.4
Conclusion
In this chapter, two text mining components were combined to form the functional annotation extraction system. Performance analysis shows, that the system is precise, but
has a low coverage. However, the low recall is compensated by the fact, that information
is distributed redundantly. The extracted information is biologically valid, and contains
some novel data, which can be used to update UniProtKB. So far, functional annotations
of residues have been evaluated in isolation, i.e. independent from structural context in
proteins. In the following chapter a biological context is created, by combining functional
annotations with protein structure data (cf. chapter 3 and chapter 4).
136
Chapter 8
Combining active site prediction
with mined functional annotations
The goal in this thesis is to combine information from two disjoint information resources.
In this course various methodologies were developed for the prediction of functional sites
in proteins, and the extraction of relevant information for the functional annotation of
protein residues from scientific articles. More specifically, a predicted functional site
can be validated by a set of functional annotations of protein residues. Conversely, a
set of functional annotations requires a structural context to understand the molecular
mechanism of a protein function.
In the previous chapters, I have presented the results on 3D pattern mining from PDB
(cf. chapter 3) and functional annotation extraction from MEDLINE (cf. chapters 5, 6,
and 7). Here, the produced datasets are combined and analysed. The objective in this
chapter is to validate predicted active sites that the data mining output may contain,
by combining specific functional annotations extracted from MEDLINE. The result is
compared with data from CSA.
137
Figure 8.1: Overview of processes and evaluation methods of combining the protein structure dataset
and literature dataset.
8.1
8.1.1
Algorithms
Combining protein structure data with literature data
Theory
The method to combine PDB with MEDLINE data, i.e. the functional annotation of a
residue from a protein structure, is based on the combination of two identifiers: RID+UID
(cf. section 5.3). There are two major subtasks to combine the datasets (cf. figure 8.1):
linking PDB entries to a Uniprot entry, and associating a residue with its co-mentioned
protein in text.
Mapping residues in PDB to UniProtKB. The mapping between PDB and UniProtKB,
and the inherited mapping of a protein residue from a PDB entry to its UniProtKB sequence index, is a non-trivial task. One problem is that the author of a determined protein
structure used an arbitrary residue index system that is not in accordance with the wild138
type protein sequence. Furthermore, residues in a protein deletion mutant may have
been numbered sequentially, irrespectively of sequence gaps. Another example is, that
UniProtKB does not have the corresponding protein sequence for a crystallised protein,
which may be, for example, a novel splice variant.
In some cases, cross-links from PDB to UniProtKB, or UniProtKB to PDB are available. However, over time the links may have become outdated. In order to find the correct
mapping between the protein residue indices in both databases, an exhaustive sequence
alignment is required. Various solutions and services have been provided for the periodic
update of UniProtKB-PDB mappings [VMMR+ 05] [Mar05] [VZHC05] [MSD08].
Here, I reuse a previously published lookup table file [Mar05] for the mapping of
protein residues in PDB to UniProtKB. Notice, that the lookup table is based on the
alignment analysis work of the Macromolecular Structure Database (MSD) group at the
European Bioinformatics Institute [MSD08].
Mapping protein residue in text to UniProtKB. The mapping of a residue entity in text to its co-mentioned protein, and ultimately the mapping to UniProtKB, is
explained in section 5.1.
Implementation
The correct sequence index mapping of a PDB entry to its corresponding Uniprot entry
was based on the lookup table produced by [Mar05] (version October 2008). An example
of the lookup table data is shown in figure 8.2. The combination of the following keys were
used to unambiguously map a residue from PDB to its Uniprot native sequence position:
PDBID + chainID + RID.
139
PDB
UniProtKB
PDBID
chainID
serial
resName
resSeq
11gs
11gs
11gs
11gs
11gs
11gs
11gs
11gs
11gs
11gs
B
B
B
B
B
B
B
B
B
B
1
2
3
4
5
6
7
8
9
10
PRO
TYR
THR
VAL
VAL
TYR
PHE
PRO
VAL
ARG
2
3
4
5
6
7
8
9
10
11
UID
GSTP1
GSTP1
GSTP1
GSTP1
GSTP1
GSTP1
GSTP1
GSTP1
GSTP1
GSTP1
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
HUMAN
resName
seqIndex
P
Y
T
V
V
Y
F
P
V
R
3
4
5
6
7
8
9
10
11
12
Figure 8.2: Lookup table for PDB/UniProtKB mapping. Excerpt of the lookup table to map protein
residues from a PDB entry to the corresponding UniProtKB entry.
8.2
Evaluation methods
The validation of identified catalytic residues was done by manual examination of the
functional descriptions of annotated protein residues. Within this analysis 6 datasets
were used (cf. section 7.2): CSA is the set of active site residues from the Catalytic Site
Atlas [PBT04]; OLDFIELD is the set of residues in the non-redundant structure set from
[Old02]; PATTERN is the set of residues from the data mined 3D patterns; OPR is the
set of protein residues identified from MEDLINE extraction; FA is the subset of OPR,
which have functional annotations extracted from MEDLINE; and ENZ is the subset of
FA, where the contained information are classified as ENZ ACT, i.e. the information are
enzyme-related.
8.3
8.3.1
Results
Protein residue mapping between three data resources
This section gives an overview of the analysed datasets. Figure 8.3 summarises the data.
OLDFIELD contains in total 341,365 protein residues, counted as RID+PDBID.
328,796 out of 341,365 residues are found in the lookup table, which corresponds to
280,521 RID+UID. Parallely, the residues from the mined 3D pattern set (PATTERN) was
140
Figure 8.3: Overview of the combined datasets from protein structure data and biomedical literature
data. The combined dataset is analysed to identify active site residues. CSA = active site database; OPR
= identified protein residues; PAS = contextual feature assigned to a protein residue; ENZ = contextual
feature with enzyme-related information; OLDFIELD = protein structure subset from PDB; PATTERN
= data mined structural features from OLDFIELD.
141
mapped to 24,500 RID+UID. The identification of protein residues in MEDLINE found
a total of 132,476 RID+UID with a unique count of 46,750 RID+UID. This dataset is
referred as OPR. 36,569 out of 46,750 protein residues have functional annotations (FA),
while another subset of 1,467 out of 36,569 have annotations classified as ENZ ACT
(ENZ). A set analysis between OLDFIELD and OPR determined 2,402 common protein
residues, 197 out of 2,402 also listed in CSA.
In summary, for a large fraction of protein residues in OLDFIELD, mapping to
UniProtKB sequence indices is available. However, only 2,402 are recovered from MEDLINE extraction, which can be used for validation.
8.3.2
Rediscovery of active sites and catalytic residues
The identification of catalytic residues from protein structure data mining, and from
biomedical literature mining was studied previously (cf. sections 4.2 and 7.2). Each
result was evaluated by cross-validation with CSA. This section studies the validation of
predicted active sites from the combined datasets.
Previously, three structural patterns were identified as active sites, by cross-validation
with CSA (cf. chapter 4). One of the pattern represents the well known catalytic triad.
This pattern was found in 19 proteins within the dataset (cf. section 4.2). Associated
with these 19 proteins is the set of 57 protein residues. The analysis shows that only 3 out
of 57 residues were identified in MEDLINE, The 3 identified residues in text correspond
to the same protein, bovine chymotrypsinogen (cf. table 8.1). The associated functional
annotations for the residues ASP102, and HIS57, were not classified as ENZ ACT. The
contained information in these annotations only indirectly indicate the catalytic property
of these residues; the annotations do not mention them as part of the catalytic triad. In
conclusion, a structure-based prediction of an active site was not validated by literature
data.
The intersection of PATTERN, OPR, and CSA results in a set of 15 protein residues.
142
RID+UID
Sentence
PAS
RID+UID
Sentence
PAS
RID+UID
Sentence
PAS
RID+UID
Sentence
PAS
RID+UID
Sentence
PAS
S195 CTRA BOVIN; D102 CTRA BOVIN; H57 CTRA BOVIN
”These include the NH2-terminal four residues, the sequences near histidine-57 (chymotrypsinogen A numbering system), aspartic acid-102, aspartic acid-189, and serine-195,
the regions of the three disulfide bridges, and the COOH-terminal end (residues 225229) of the proteins. When aligned to maximize homology the identity of residues is
34%.”(PMID:804314)
N/A
D102 CTRA BOVIN; H57 CTRA BOVIN
”In bovine chymotrypsinogen A in 2H2O at 31 degrees C, histidine-57 has a pK’ of 7.3 and
aspartate-102 a pK’ of 1.4, and the histidine-40-aspartate-194 system exhibits inflections at
pH 4.6 and 2.3.” (PMID:31898)
pred = has
arg1 = HIS57
arg2 = a pK
arg2-of = 7.3 and ASP102 a pK
arg2-of = 1.4
D102 CTRA BOVIN
”In bovine chymotrypsin Aalpha under the same conditions, the histidine-57-aspartate-102
system has pK’ values of 6.1 and 2.8, and histidine-40 has a pK’ of 7.2.” (PMID:31898)
pred = have
arg1 = the HIS57 ASP102 system
arg2 = pK values
arg2-of = 6.1 and 2.8
D102 CTRA BOVIN; H57 CTRA BOVIN
”The results suggest that the pK’ of histidine-57 is higher than the pK’ of aspartate-102 in
both zymogen and enzyme.” (PMID:31898)
pred = is
arg1 = that the pK
arg1-of = HIS57
arg2 = higher than the pK
arg2-of = ASP102
arg2-in = both zymogen and enzyme
H57 CTRA BOVIN
”The 1H NMR chemical shift of the Cepsilon1 H of histidine-57 in the chymotrypsin Aalphapancreatic trypsin inhibitor (Kunitz) complex is constant between pH 3 and 9 at a value
similar to that of histidine-57 in the porcine trypsin-pancreatic trypsin inhibitor complex
[Markley, J.L., and Porubcan, M. A. (1976), J. Mol. Biol. 102, 487–509], suggesting that the
mechanisms of interaction are similar in the two complexes.” (PMID:31898)
pred = is
arg1 = complex
arg2 = constant
arg2-between = pH 3 and 9
arg2-at = a value similar
arg2-to = that
arg2-of = HIS57
arg2-in = the porcine trypsin-pancreatic trypsin inhibitor complex
Table 8.1: Extracted MEDLINE information on the catalytic residues in bovine chymotrypsinogen.
Based on the performance of the functional annotation extraction system and the availability of information in MEDLINE, only few information was extracted. The mined information on the active site
residues mention only indirectly their catalytic properties.
143
RID+UID
Sentence
PAS
RID+UID
Sentence
PAS
C32 THIO HUMAN; C35 THIO HUMAN
”A hydrogen bond between the sulfhydryls of Cys32 and Cys35 may reduce the pKa of Cys32
and this pKa depression probably results in increased nucleophilicity of the Cys32 thiolate
group.” (PMID:8805557)
pred = reduce
arg1 = A hydrogen bond
arg1-between = the sulfhydryls
arg1-of = CYS32 and CYS35
arg2 = the pKa
arg2-of = [CYS32 and this pKa depression]/ENZ ACT
C215 PTN1 HUMAN
”The structure of the catalytically inactive mutant (C215S) of the human proteintyrosine phosphatase 1B (PTP1B) has been solved to high resolution in two complexes.”
(PMID:9391040)
pred = solved
arg1 = [inactive mutant (C215S)]/ENZ ACT
arg1-of = the human protein-tyrosine phosphatase 1B (PTP1B)
arg2 = unk
arg2-to = to high resolution
arg2-in = in two complexes
Table 8.2: Identified catalytic residues from MEDLINE extraction. The mined functional annotation
were classified as enzyme-related, suggesting the correspondent protein residue has some catalytic properties. The identified residues were also cross-validated by CSA, however the mined 3D pattern with
these residues were not validated as active site residues by the database.
The analysis shows that only 3 out of 15 protein residues have enzyme-related annotations.
2 out of 3 residues correspond to the protein human thioredoxin (cf. table 8.2). However,
none of the mined 3D patterns can provide a structure context to the identified catalytic
residues. A manual analysis on the 12 out of 15 residues shows, that some of the associated
annotations were not correctly classified as enzyme-related, which can be explained by
the performance of the classifier (cf. section 6.3).
For 16 out of 197 protein residues, i.e. the intersection between OLDFIELD, OPR,
and CSA, the term ”catalytic triad” is found as co-mention within sentences. While none
of the 16 residues are associated with a mined 3D pattern, 6 out of 16 residues have
enzyme-related functional annotations (cf. table 8.3).
In conclusion, the results in this study indicate, that the coverage of relevant information to validate predicted active sites is too low. However, some of the enzyme-related
annotations are biological valid, but have no correlation with a 3D pattern.
144
RID+UID
Sentence
PAS
RID+UID
Sentence
PAS
S80 HNL HEVBR; D207 HNL HEVBR; H235 HNL HEVBR
”Our results yielded further support for an enzymatic mechanism involving the catalytic
triad Ser80, His235, and Asp207 as a general acid/base.” (PMID:11354003)
pred = involving
arg1 = furhter support
arg1-for = for an enzymatic mechanism
arg2 = [the catalytic triad SER80, HIS235, and ASP207]/ENZ ACT
E132 LINB PSEPA; D108 LINB PSEPA; H272 LINB PSEPA
”The enzyme belongs to the alpha/beta hydrolase family and contains a catalytic triad
(Asp108, His272, and Glu132) in the lipase-like topological arrangement previously proposed
from mutagenesis experiments.” (PMID:11087355)
pred = contains
arg1 = unk
arg1-to = the alpha/beta hydrolase family and
arg2 = [a catalytic triad (ASP108, HIS272, and GLU132)]/ENZ ACT
Table 8.3: Catalytic triad residues available from the mined functional annotations. The active site
residues were identified by a search for the term ”catalytic triad” in the mined functional annotation
data. The validity was also confirmed by comparison with CSA.
8.3.3
Search for novel catalytic residues
In the previous section, the combined dataset was evaluated by cross-validation with CSA.
Thus the identified catalytic residues represent only re-discoveries of known data. The
goal in this section is to search for novel catalytic residues by combining enzyme-related
annotations with mined 3D pattern.
A set analysis between CSA, OLDFIELD, and OPR revealed, that 2,205 residues
are included in OLDFIELD and OPR, but not in CSA (cf. figure 8.3). A search for
the term ”catalytic triad” in sentences of these 2,205 identified residues resulted in a
subselection of 24 residues. The analysis shows that none of the 24 residues were found in
the mined 3D pattern. However, 15 out of 24 residues have enzyme-related annotations
(cf. table F.1), suggesting they are catalytic residues. A manual analysis determined,
that the annotations contain valid evidences to identify the residues as catalytic.
The result in this study indicates, that MEDLINE extraction can find some additional
catalytic residues that are not represented in CSA. However, a correlation with the mined
3D patterns was not found, and functional annotations were not interpreted in a structural
context.
145
8.3.4
General correlation found between predicted functional
sites and extract functional annotations.
Previously, the validation of predicted active sites was studied by cross-validation of known
catalytic residues. In this section a more general correlation analysis between structure
and function data is studied. Because the coverage of extracted functional annotations
of protein residues is too low to be useful to annotate the residues of the prediction,
we cannot expect that all residues in one prediction are annotated with description of
biological function. However, if a predicted functional site has some feature which point
to a common concept of function, then this can be used to prioritise the prediction.
Table 8.4 (left panel) shows the top 25 mined structural patterns which were ranked
by the number of distinct residues with PAS data. In total 168 patterns have annotations
ranging from one residue to a maximal of nine distinct residues with annotations. Another
view is to take into consideration the number of annotated residues in context of the total
number of residues in a prediction (cf. table 8.4, right panel). This gives an indication of
how frequent a pattern is and how much do we know on each residue from the text mined
data.
The extraction of biological features from text for protein residues matches to a number of various proteins, including homologues proteins. So far the annotation of residues
in a predicted functional site considered only first level information (annotations for exact
protein), however, the correlation analysis can also exploit information from homologous
proteins (second level information). Based on the information from the Homology-derived
Secondary Structure of proteins (HSSP) database [SS96], the annotation of the prediction
was expanded by extracted information from homologues. The result of this study shows,
that the number of residue annotation is increased by 10% (cf. table 8.5). A control analysis of how many residues in the non-redundant protein dataset OLDFIELD are identified
in MEDLINE and how many of these have an association with PAS data indicates that
the low recall of the developed text mining system is the reason for the weak annotation
146
147
9 10 16 CYS CYS PHE-1
10 15 11 ASP HIS TRP-2
10 11 20 HIS MET PHE-1
9 18 11 GLY MET TYR-1
9 11 17 ALA LEU VAL-1
8 9 10 CYS CYS HIS-1
11 8 18 HIS HIS SER-1
11 18 9 CYS ILE PHE-1
11 11 12 HIS HIS MET-1
9 15 11 GLN LEU TRP-2
10 15 11 ASP HIS TRP-1
10 11 11 ALA HIS HIS-1
20 9 11 ASP GLY MET-1
18 10 10 ASP CYS PHE-1
19 11 10 ASP CYS ILE-1
11 14 7 ASP MET SER-1
9 17 10 ALA ILE PHE-1
9 10 8 CYS HIS MET-1
10 13 10 CYS PHE TYR-1
21 11 10 CYS GLY VAL-1
11 9 9 ASP MET SER-1
17 11 9 ALA LEU VAL-1
10 10 19 ALA HIS MET-1
8 8 15 ASP HIS SER-1
10 9 11 CYS VAL VAL-1
6
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
3
3
3
3
3
3
3
3
12
18
12
12
30
12
12
12
21
6
15
6
12
12
12
18
18
9
6
21
15
102
18
33
33
#residues in
pattern (B)
0.5
0.2222
0.3333
0.3333
0.1333
0.3333
0.3333
0.3333
0.1905
0.6667
0.2667
0.6667
0.3333
0.3333
0.3333
0.2222
0.2222
0.3333
0.5
0.1429
0.2
0.0294
0.1667
0.0909
0.09099
A/B
4
4
6
3
4
4
4
4
4
4
4
4
3
2
2
2
2
2
2
1
1
1
1
4
4
#residues with
PAS (A)
10 11 11 ALA HIS HIS-1
9 15 11 GLN LEU TRP-2
9 10 16 CYS CYS PHE-1
10 13 10 CYS PHE TYR-1
10 11 20 HIS MET PHE-1
11 18 9 CYS ILE PHE-1
11 8 18 HIS HIS SER-1
18 10 10 ASP CYS PHE-1
19 11 10 ASP CYS ILE-1
20 9 11 ASP GLY MET-1
8 9 10 CYS CYS HIS-1
9 18 11 GLY MET TYR-1
9 10 8 CYS HIS MET-1
11 13 9 ASN LYS SER-1
11 14 8 ALA ARG ASN-2
11 17 10 CYS PHE PRO-1
18 10 11 ARG GLU PRO-1
19 9 11 ALA PRO TYR-1
9 11 9 ASP CYS LYS-1
10 10 20 HIS PRO TYR-1
10 12 11 ILE LEU PHE-1
14 8 7 ASP HIS SER-1
8 11 17 GLU THR THR-1
10 15 11 ASP HIS TRP-1
10 15 11 ASP HIS TRP-2
Pattern
6
6
12
6
12
12
12
12
12
12
12
12
9
6
6
6
6
6
6
3
3
3
3
15
18
#residues in
pattern (B)
Table 8.4: Functional annotations of protein residues in predicted functional sites. A functional site is
predicted as a structure pattern that is recurrent among a non-redundant set of proteins. The table on
the left panel lists the top 25 patterns ranked by the total number of annotated protein residues for each
pattern, while the table on the right panel ranks the pattern by the total number of annotated protein
residues in context of total number of residues found in all structure examples.
Pattern
#residues with
PAS (A)
0.6667
0.6667
0.5
0.5
0.3333
0.3333
0.3333
0.3333
0.3333
0.3333
0.3333
0.3333
0.3333
0.3333
0.3333
0.3333
0.3333
0.3333
0.3333
0.3333
0.3333
0.3333
0.3333
0.2667
0.2222
A/B
Residue Annotations
-HSSP
OLDFIELD
PATTERN
+HSSP
OPR
FA
OPR
FA
2,402
168
1,963
132
243
16
192
19
Table 8.5: Homology-based transfer of extracted functional annotations for protein residues in the
mined pattern data. Based on the HSSP information the identified protein residues and their associated
functional annotations were transferred from homologous proteins to the target proteins and residues in
the mined structure pattern data.
expansion.
In conclusion, a general correlation between protein structure and function data is
found in this study. The set of available annotations for protein residues is an indication
of biological function for a predicted functional site. The biological significance of this
result is being investigated further.
8.4
Discussion
The distribution of information in the combined data was studied by a search for active
site residues. Another approach in sampling the dataset is the identification of ligand
binding residues. A search can be done from the protein structure data, by selecting only
residues of an identified metal binding site, and then consulting the literature for relevant
annotations.
The validation of a predicted active site in this study demonstrates, that the amount
of extracted functional annotations was not sufficient for this task. Considering, that
the catalytic triad is a well characterised structural feature, the information should be
available in MEDLINE. In fact, by searching for the term ”catalytic triad” in the text
mined data, several associations between the term and residues can be found. A close
examination reveals that some are annotations for homologous proteins with the AspHis-Ser catalytic triad motif (data not shown). However, the results of the presented
studies indicate that the recall of the text mining system is to low to capture sufficiently
148
annotations for protein homologues.
Despite the identification of some catalytic residues in this analysis, it must be noted
that literature-based verification of predicted active sites cannot rule out the detection of
false positives. The absence of a biological evidence in the literature does not mean, that
the prediction is wrong, but that simply no knowledge is currently available. Biological
research is hypothesis-driven, and therefore not all of the predicted active site residues
are expected to be reported in the literature, if they have not been a biological research
target.
8.5
Conclusion
In this chapter I performed a correlation analysis between the dataset from protein structure data mining and literature mining. The result in this study suggests, that the
combined data have little correlations. For example, a structure-based prediction of an
active site had no functional annotations with biological evidences, while the result was
cross-validated with CSA. Conversely, literature-based identification of catalytic residues
could not be interpreted in an evolutionary conserved structure context, because data
mining did not find a suitable recurrent structure pattern.
149
Chapter 9
Conclusions and future work
9.1
Summary of main contributions
The goal of this thesis was to identify functional sites in proteins. For this purpose a
novel approach that combines protein structure data mining and literature mining was
used. Below is a summary of contributions.
Significance testing of residue interaction is a novel approach to identify statistically significant spatial and chemical configurations of residues. The developed
method relies solely on mathematical models, and the analysis shows, that recurrent
homologous or convergent structural features can be extracted. More importantly,
the mined result contains biologically valid data. For example, 22 proteins with the
catalytic triad were identified from cross-validation studies. Altogether, the developed data mining method can be used to discover novel information; the result is a
prediction of functional sites.
Identification of protein residues is an important text mining component developed
in this study for the extraction of functional annotations. The implemented solution
utilises regular expression patterns, and lists of terminologies from UniProtKB and
NCBI Taxonomy, in order to find and associate biological entities. Ultimately, an
150
identified protein residue is mapped to a Uniprot protein, which means other extracted information can be integrated into UniProtKB. With a precision of 0.82 and
a recall of 0.38, residues can be identified and associated precisely with their Uniprot
proteins. From a whole MEDLINE analysis, 15,110 abstract texts were found, that
can be used for information extraction of 2,884 UniProtKB/PDB proteins.
Contextual feature extraction is a discovery-driven information extraction approach,
to find description of function associated with a residue entity in the text. The developed method extracts from a parsed sentence verbal and prepositional relations
of a residue and its contextual features. The Gene Ontology was not used, because
it does not contain suitable terminologies for the identification of functional descriptions of residues. With a precision of 0.68 and a recall of 0.48, the language parser
found 46,750 annotations for the identified protein residues from MEDLINE. Manual analysis indicates that some of the extracted annotations are valid, and contain
novel information that can be used to update the feature table in UniProtKB.
Annotation of protein structures is the main objective in this thesis. The goal is to
create a synthesis between protein structure data and protein function data. The
hypothesis is, that the intersection of information from both datasets can lead to
the discovery of new biological information. For example, a predicted active site can
be validated with evidences from the set of functional annotations. Although crossvalidations demonstrates, that mined information from PDB and literature contain
correct results, no correlation was found between both datasets. Nevertheless, the
text mined information are valid, and 1,391 catalytic residues were found, that can
be used to update CSA.
151
9.2
Limitations and future works
During the work of this thesis, various research techniques, and three major analysis
components have been developed. Their algorithms, and implementations were explained,
their performances analysed, and suggestions for improvement have been made. In the
following is a discussion on the improvements for the combined dataset analysis.
To biologically validate a predicted functional site with published experimental data
results it has to be assumed that the extracted functional annotations from the literature
provide sufficient supporting evidence for a biological function. This has been shown to
be partly correct for some examples. However, it will probably not work in all cases. My
results suggest that other factors have to be considered in order to achieve one of the
followings: (1) standardised description of function of protein residues; (2) identification
of a representative functional concept of a structural feature; and (3) verification of the
validity of the pattern as a consensus functional site, where annotations of other protein
examples share the same annotations. Although the verification approach uses the vast
and broad covering information from MEDLINE, the analysis indicates that this might
not be sufficient for this task.
Another serious limitation in the literature-based verification of functional sites is to
take into account that our knowledge of the protein function space could be incomplete or
even incorrect. Protein structure data mining aims to deliver biologically unbiased results,
since 3D pattern mining relies on mathematical models and no biological knowledge is
used. The result is a prediction of functional sites. However, the input is biologically
biased. Currently, we do not have the complete knowledge of the fold space, which
means the actual distribution of structural features may be skewed. As a consequence,
the prediction may contain a large fraction of false positives. In the long run, various
structural genomics initiatives may expand our knowledge of the fold space.
In the meantime, the literature is the main resource of biological evidences to validate
predictions. Yet, our knowledge of protein residue function, and even the spectrum of
152
biological function has still to be determined. This can lead to four scenarios: (1) a
true functional site is fully supported by evidences (true positive); (2) a true functional
site is partly supported by evidences (incomplete knowledge); (3) a falsely predicted
functional site is partly supported by evidences (incomplete knowledge); and (4) a falsely
predicted functional site is fully supported by contradictory evidences (false positive).
While, from a bioinformatical point of view, there is little we can do about this problem,
the identification of case (2), (3), and case (4) can propose further biological experiments
to find the missing data.
153
Bibliography
[AGM+ 90]
SF Altschul, W Gish, W Miller, EW Myers, and DJ Lipman. Basic local
alignment search tool. Journal of Molecular Biololgy, 215(3):403–10, 1990.
[AL02]
M Ashburner and SE Lewis. On ontologies for biologists: the gene ontology
- uncoupling the web. Novartis Foundation Symposium, 2002.
[AMS+ 97]
SF Altschul, TL Madden, AA Schaffer, J Zhang, Z Zhang, W Miller, and
DJ Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Research, 25(17):3389–402, 1997.
[APG+ 94]
PJ Artymiuk, AR Poirrette, HM Grindley, DW Rice, and P Willett. A
graph-theoretic approach to the identification of three-dimensional patterns
of amino acid side-chains in protein structures. Journal of Molecular Biololgy, 243(2):327–44, 1994.
[Att02]
TK Attwood. The PRINTS database: a resource for identification of protein
families. Brief Bioinform, 3(3):252–63, 2002.
[AZP+ 05]
G Ausiello, A Zanzoni, D Peluso, A Via, and M Helmer-Citterich. pdbFun:
mass selection and fast comparison of annotated PDB residues. Nucleic
Acids Research, 33:W133–137, Jul 2005.
154
[BFL04]
T Binkowski, P Freeman, and J Liang. pvSOAR: detecting similar surface
patterns of pocket and void surfaces of amino acid residues on proteins.
Nucleic Acids Research, 32:555–558, 2004.
[BFW+ 94]
A Barth, K Frost, M Wahab, W Brandt, HD Schadler, and R Franke. Classification of serine proteases derived from steric comparisons of their active
sites, part ii: ”ser, his, asp arrangements in proteolytic and nonproteolytic
proteins”. Drug Design Discovery, 2:89–111, November 1994.
[BGH+ 00]
WC Barker, JS Garavelli, H Huang, PB Mcgarvey, BC Orcutt, GY Srinivasarao, C Xiao, LL Yeh, RS Ledley, JF Janda, F Pfeiffer, HW Mewes,
A Tsugita, and C Wu. The protein information resource (pir). Nucleic
Acids Research, 28(1):41–44, January 2000.
[BKL00]
SE Brenner, P Koehl, and M Levitt. The astral compendium for protein
structure and sequence analysis. Nucleic Acids Research, 28(1):254–256,
January 2000.
[BLK+ 08]
E Beisswanger, V Lee, JJ Kim, D Rebholz-Schuhmann, A Splendiani,
O Dameron, S Schulz, and U Hahn. Gene regulation ontology (gro): design principles and use cases. Studies in health technology and informatics,
136:9–14, 2008.
[BM05]
R Bunescu and RJ Mooney. A shortest path dependency kernel for relation extraction. In Proceedings of the Joint Conference on Human Language Technology / Empirical Methods in Natural Language Processing
(HLT/EMNLP’05), 2005.
[BM06]
R Bunescu and RJ Mooney. Subsequence kernels for relation extraction. In
Y. Weiss, B. Schölkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 171–178. MIT Press, 2006.
155
[BMC08]
BMC. Biomed central. http://www.biomedcentral.com/, November 2008.
[BT03]
JA Barker and JM Thornton. An algorithm for constraint-based structural
template matching: application to 3D templates with statistical analysis.
Bioinformatics, 19(13):1644–1649, September 2003.
[BW03]
PE Bourne and H Weissig. Structural Bioinformatics (Methods of Biochemical Analysis, V. 44). Wiley-Liss, 1 edition, February 2003.
[BW05]
CJO Baker and R Witte. Mutation miner - textual annotation of protein
structures. CERMM Symposium, 2005.
[BWF+ 00]
HM Berman, J Westbrook, Z Feng, G Gilliland, TN Bhat, H Weissig,
IN Shindyalov, and PE Bourne. The protein data bank. Nucleic Acids
Research, 28(1):235–242, January 2000.
[CB94]
RR Copley and GJ Barton. A structural analysis of phosphate and sulphate
binding sites in proteins. Estimation of propensities for binding and conservation of phosphate binding sites. Journal of Molecular Biology, 242:321–
329, Sep 1994.
[CCR+ 08]
BL Cantarel, PM Coutinho, C Rancurel, T Bernard, V Lombard, and
B Henrissat. The Carbohydrate-Active EnZymes database (CAZy): an
expert resource for Glycogenomics. Nucleic Acids Research, Oct 2008.
[Cer00]
F Cerbah. Exogenous and endogenous approaches to semantic categorization of unknown technical terms. In in In Proceedings of the 18th International Conference on Computational Linguistics (COLING, pages 145–151,
2000.
[CFK+ 05]
BY Chen, VY Fofanov, DM Kristensen, M Kimmel, O Lichtarge, and
LE Kavraki. Algorithms for structural comparison and statistical analysis
156
of 3D protein motifs. Pacific Symposium on Biocomputing, pages 334–345,
2005.
[Cha93]
P Chakrabarti. Anion binding sites in protein structures. Journal of Molecular Biololgy, 234:463–482, Nov 1993.
[CHR+ 02]
JM Castagnetto, SW Hennessy, VA Roberts, ED Getzoff, JA Tainer, and
ME Pique. Mdb: the metalloprotein database and browser at the scripps
research institute. Nucleic Acids Research, 30(1):379–382, January 2002.
[CK06]
IG Choi and SH Kim. Evolution of protein structural classes and protein sequence families. Proceedings of the National Academy of Sciences,
September 2006.
[CL64]
RV Cochran and LH Lund. On the kirkwood superposition approximation.
Journal of Physical Chemistry, 1964.
[CMP05]
J Crim, R McDonald, and F Pereira. Automatically annotating documents
with normalized gene lists. BMC Bioinformatics, 6 Suppl 1, 2005.
[CMR06]
P Corbett and P Murray-Rust. High-throughput identification of chemistry
in life science texts. In Computational Life Sciences II, pages 107–118.
Springer, 2006.
[CSL+ 06]
FM Couto, MJ Silva, V Lee, E Dimmer, E Camon, R Apweiler, H Kirsch,
and D Rebholz-Schuhmann. Goannotator: linking protein go annotations
to evidence text. Journal of Biomedical Discovery and Collaboration, 1:19+,
December 2006.
[DBAD03]
R Day, DA Beck, RS Armen, and V Daggett. A consensus view of fold
space: combining SCOP, CATH, and the Dali Domain Dictionary. Protein
Science, 12:2150–2160, Oct 2003.
157
[DCG+ 04]
F Diella, S Cameron, C Gemuend, R Linding, A Via, B Kuster, ST Ponten,
N Blom, and TJ Gibson. Phospho.elm: a database of experimentally verified
phosphorylation sites in eukaryotic proteins. BMC Bioinformatics, 5, June
2004.
[DS05]
A Doms and M Schroeder. Gopubmed: exploring pubmed with the gene
ontology. Nucleic Acids Research, 33(Web Server issue), July 2005.
[FGS98]
JS Fetrow, A Godzik, and J Skolnick. Functional analysis of the escherichia
coli genome using the sequence-to-structure-to-function paradigm: identification of proteins exhibiting the glutaredoxin/thioredoxin disulfide oxidoreductase activity. Journal of Molecular Biololgy, 282(4):703–711, October
1998.
[FKY+ 01]
C Friedman, P Kra, H Yu, M Krauthammer, and A Rzhetsky. Genies: a
natural-language processing system for the extraction of molecular pathways
from journal articles. Bioinformatics, 17 Suppl 1, 2001.
[Fri07]
D Frishman. Protein annotation at genomic scale: the current status. Chem
Rev, 107(8):3448–3466, August 2007.
[FS98]
JS Fetrow and J Skolnick. Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T1 ribonucleases. Journal of Molecular Biololgy, 281(5), September 1998.
[Fuk98]
K Fukuda. Toward information extraction: identifying protein names from
biological papers, 1998.
[FWLN94]
D Fischer, H Wolfson, SL Lin, and R Nussinov. Three-dimensional, sequence order-independent structural comparison of a serine protease against
158
the crystallographic database reveals active site similarities: potential implications to evolution and to protein folding. Protein Science, 3(5):769–778,
May 1994.
[GDAW03]
R Gaizauskas, G Demetriou, PJ Artymiuk, and P Willett. Protein structures and information extraction from biological texts: the pasta system.
Bioinformatics, 19(1):135–143, January 2003.
[GDO+ 05]
A Golovin, D Dimitropoulos, TJ Oldfield, A Rachedi, and K Henrick.
Msdsite: A database search and retrieval system for the analysis and viewing of bound ligands and active sites. Proteins: Structure, Function, and
Bioinformatics, 58(1):190–199, 2005.
[GH08]
A Golovin and K Henrick. Msdmotif: exploring protein sites and motifs.
BMC Bioinformatics, 9(1), 2008.
[GJYLRS08] S Gaudan, A Jimeno Yepes, V Lee, and D Rebholz-Schuhmann. Combining
evidence, specificity, and proximity towards the normalization of gene ontology terms in text. EURASIP journal on bioinformatics & systems biology,
2008.
[Glu91]
JP Glusker. Structural aspects of metal liganding to functional groups in
proteins. Advances in Protein Chemistry, 42:1–76, 1991.
[GOC06]
GOConsortium. The gene ontology (go) project in 2006. Nucleic Acids
Research, 34(Database issue), January 2006.
[GPP+ 03]
F Glaser, T Pupko, I Paz, RE Bell, D Bechor-Shental, E Martz, and N BenTal. ConSurf: identification of functional regions in proteins by surfacemapping of phylogenetic information. Bioinformatics, 19(1):163–164, January 2003.
159
[Gue96]
F Guenthner. Electronic lexica and corpora research at cis. CIS Bericht96-100, 1996.
[HBB+ 08]
N Hulo, A Bairoch, V Bulliard, L Cerutti, BA Cuche, E de Castro,
C Lachaize, PS Langendijk-Genevaux, and CJ Sigrist. The 20 years of
PROSITE. Nucleic Acids Research, 36:D245–249, Jan 2008.
[HBGK03]
M Hendlich, A Bergner, J Günther, and G Klebe. Relibase: design and
development of a database for comprehensive analysis of protein-ligand interactions. Journal of Molecular Biololgy, 326(2):607–620, February 2003.
[HFM+ 05]
D Hanisch, K Fundel, HT Mevissen, R Zimmer, and J Fluck. Prominer:
rule-based protein and gene entity recognition. BMC Bioinformatics, 6
Suppl 1, 2005.
[HJ99]
C Hadley and DT Jones. A systematic comparison of protein structure
classifications: SCOP, CATH and FSSP. Structure, 7:1099–1112, Sep 1999.
[HLC04]
F Horn, AL Lau, and FE Cohen. Automated extraction of mutation data
from the literature: application of mutext to g protein-coupled receptors and
nuclear hormone receptors. Bioinformatics, 20(4):557–568, March 2004.
[HMBC97]
TJ Hubbard, AG Murzin, SE Brenner, and C Chothia. SCOP: a structural
classification of proteins database. Nucleic Acids Research, 25:236–239, Jan
1997.
[HNR+ 05]
ZZ Hu, M Narayanaswamy, KE Ravikumar, K Vijay-Shanker, and CH Wu.
Literature mining and database annotation of protein phosphorylation using
a rule-based system. Bioinformatics, 21(11):2759–2765, June 2005.
[Hob02]
JR Hobbs. Information extraction from biomedical text. Journal of Biomedical Informatics, 35(4):260–264, August 2002.
160
[HPS+ 03]
A Harrison, F Pearl, I Sillitoe, T Slidel, R Mott, JM Thornton, and
CA Orengo. Recognizing the fold of a protein structure. Bioinformatics,
19(14):1748–1759, September 2003.
[HS94]
L Holm and C Sander. The fssp database of structurally aligned protein
fold families. Nucleic Acids Research, 22(17):3600–3609, September 1994.
[HS96]
L Holm and C Sander.
Mapping the protein universe.
Science,
273(5275):595–603, August 1996.
[HSSS92]
U Hobohm, M Scharf, R Schneider, and C Sander. Selection of representative protein data sets. Protein Science, 1(3):409–417, March 1992.
[HZH+ 04]
M Huang, X Zhu, Y Hao, DG Payan, K Qu, and M Li. Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics,
20(18):3604–3612, December 2004.
[IPGK05]
VA Ivanisenko, SS Pintus, DA Grigorovich, and NA Kolchanov. PDBSite:
a database of the 3D structure of protein functional sites. Nucleic Acids
Research, 33:D183–187, Jan 2005.
[JB04]
A Jakulin and I Bratko. Testing the significance of attribute interactions.
In In ICML, pages 409–416. ACM Press, 2004.
[JGLRS08]
S Jaeger, S Gaudan, U Leser, and D Rebholz-Schuhmann. Integrating
protein-protein interactions and text mining for protein function prediction.
BMC Bioinformatics, 9(Suppl 8), 2008.
[JIDG03]
c
M Jambon, A Imberty, G DelÃage,
and C Geourjon. A new bioinformatic approach to detect common 3d sites in protein structures. Proteins:
Structure, Function, and Genetics, 52:137–145, 2003.
161
[JK95]
J Justeson and S Katz. Technical terminology: some linguistic properties
and an algorithm for identification in text. Natural Language Engineering,
pages 9–27, 1995.
[KCRB07]
R Kanagasabai, KH Choo, S Ranganathan, and CJ Baker. A workflow for
mutation extraction and structure annotation. Journal of Bioinformatics
and Computational Biology, 5(6):1319–1337, December 2007.
[KH04]
E Krissinel and K Henrick. Secondary-structure matching (ssm), a new tool
for fast protein structure alignment in three dimensions. Acta Crystallographica Section D: Biological Crystallography, 60(1):2256–2268, December
2004.
[KJ94]
GJ Kleywegt and TA Jones. Detection, delineation, measurement and display of cavities in macromolecular structures. Acta Crystallographica Section
D: Biological Crystallography, 50(Pt 2):178–185, March 1994.
[Kle99]
GJ Kleywegt. Recognition of spatial motifs in protein structures. Journal
of Molecular Biololgy, 285(4):1887–1897, January 1999.
[KN03]
K Kinoshita and H Nakamura. Identification of protein biochemical functions by similarity search using the molecular surface database ef-site. Protein Science, 12(8):1589–1595, August 2003.
[KNT05]
A Koike, Y Niwa, and T Takagi. Automatic extraction of gene/protein
biological functions from biomedical text. Bioinformatics, 21(7):1227–1236,
April 2005.
[KON99]
T Kawabata, M Ota, and K Nishikawa. The protein mutant database.
Nucleic Acids Research, 27(1):355–357, January 1999.
162
[Las95]
RA Laskowski. Surfnet: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. Journal of Molecular Biololgy, 13(5),
October 1995.
[LC05]
G Leroy and H Chen. Genescene: An ontology-enhanced integration of
linguistic and co-occurrence based relations in biomedical texts: Research
articles. Journal of the American Society for Information Science and Technology, 56(5):457–468, March 2005.
[LCM03]
G Leroy, H Chen, and JD Martinez. A shallow parser based on closedclass words to capture relations in biomedical text. Journal of Biomedical
Informatics, pages 145–158, June 2003.
[LEW98]
J Liang, H Edelsbrunner, and C Woodward. Anatomy of protein pockets
and cavities: measurement of binding site geometry and implications for
ligand design. Protein Science, 7(9):1884–1897, September 1998.
[LHC07]
LC Lee, F Horn, and FE Cohen. Automatic extraction of protein point
mutations using a graph bigram association. PLoS Computational Biology,
3(2):e16+, February 2007.
[LRTV07]
Gonzalo Lopez, Ana Rojas, Michael Tress, and Alfonso Valencia. Assessment of predictions submitted for the CASP7 function prediction category.
Proteins, 69 Suppl 8:165–74, 2007.
[LW91]
Y Lamdan and HJ Wolfson. Protein structures and information extraction from biological texts: the pasta system. Computer Vision and Pattern
Recognition, 1991. Proceedings CVPR ’91., IEEE Computer Society Conference on, pages 22–27, June 1991.
[Mar05]
AC Martin. Mapping pdb chains to uniprotkb entries. Bioinformatics,
21(23):4297–4301, December 2005.
163
[MB99]
Y Matsuo and SH Bryant. Identification of homologous core structures.
Proteins, 35:70–79, Apr 1999.
[MG03]
J McCallum and S Ganesh.
Text mining of DNA sequence homology
searches. Applied Bioinformatics, 2:59–63, 2003.
[MR03]
S Mika and B Rost. UniqueProt: Creating representative protein sequence
sets. Nucleic Acids Research, 31:3789–3791, Jul 2003.
[MSD08]
MSDmapping.
Msdmapping.
http://www.ebi.ac.uk/msd-as/
MSDMapping/, November 2008.
[MT05]
Y Miyao and J Tsujii.
Probabilistic disambiguation models for wide-
coverage hpsg parsing. In ACL ’05: Proceedings of the 43rd Annual Meeting
on Association for Computational Linguistics, pages 83–90. Association for
Computational Linguistics, 2005.
[NBD+ 06]
J Natarajan, D Berrar, W Dubitzky, C Hack, Y Zhang, C Desesa,
JR Van Brocklyn, and EG Bremer. Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between
sphingosine-1-phosphate and invasiveness of a glioblastoma cell line. BMC
Bioinformatics, 7:373+, August 2006.
[NED03]
S Novichkova, S Egorov, and N Daraselia. Medscan, a natural language
processing engine for medline abstracts. Bioinformatics, 19(13):1699–1706,
September 2003.
[OCR01]
MJ Ondrechen, JG Clifton, and D Ringe. Thematics: A simple computational predictor of enzyme function from structure. Proceedings of the
National Academy of Sciences, 98(22):12473–12478, October 2001.
164
[Old01]
TJ Oldfield. Creating structure features by data mining the PDB to use as
molecular-replacement models. Acta Crystallographica Section D: Biological
Crystallography, 57:1421–1427, Oct 2001.
[Old02]
TJ Oldfield. Data mining the protein data bank: residue interactions. Proteins, 49(4):510–528, December 2002.
[OMJ+ 97]
CA Orengo, AD Michie, S Jones, DT Jones, MB Swindells, and JM Thornton. CATH-a hierarchic classification of protein domain structures. Structure, 5:1093–1108, Aug 1997.
[PB06]
BJ Polacco and PC Babbitt. Automated discovery of 3d motifs for protein
function annotation. Bioinformatics, 22(6):723–730, March 2006.
[PBT04]
CT Porter, GJ Bartlett, and JM Thornton. The Catalytic Site Atlas: a
resource of catalytic sites and residues identified in enzymes using structural
data. Nucleic Acids Research, 32(Database issue), January 2004.
[PJYLRS08] P Pezik, A Jimeno Yepes, V Lee, and D Rebholz-Schuhmann. Static dictionary features for term polysemy identification. Building and evaluating
resources for biomedical text mining, LREC Workshop, 2008.
[PKS06]
G Pandey, V Kumar, and M Steinbach. Computational approaches for
protein function prediction: A survey. Technical Report 06-028, Department
of Computer Science and Engineering, University of Minnesota, Twin Cities,
2006.
[Plo08]
PloS. Public library of science. http://www.plos.org/, November 2008.
[PMC08]
PMC. Pubmed central. http://www.pubmedcentral.nih.gov/, November
2008.
165
[POHS05]
M Pesu, J O’Shea, L Hennighausen, and O Silvennoinen. Identification of an
acquired mutation in Jak2 provides molecular insights into the pathogenesis
of myeloproliferative disorders. Molecular Interventions, 5:211–215, Aug
2005.
[RMK+ 07]
ND Rawlings, FR Morton, CY Kok, J Kong, and AJ Barrett. Merops: the
peptidase database. Nucleic Acids Research, pages gkm954+, November
2007.
[Ros99]
B Rost. Twilight zone of protein sequence alignments. Protein Engineering
Design and Selection, 12(2):85–94, February 1999.
[RSAG+ 08] D Rebholz-Schuhmann, M Arregui, S Gaudan, H Kirsch, and A Jimeno Yepes. Text processing through web services: Calling whatizit. Bioinformatics, 2008.
[RSKA+ 07] D Rebholz-Schuhmann, H Kirsch, M Arregui, S Gaudan, M Riethoven, and
P Stoehr. Ebimed-text crunching to gather facts for proteins from medline.
Bioinformatics, 23(2), January 2007.
[RSMA+ 04] D Rebholz-Schuhmann, S Marcel, S Albert, R Tolle, G Casari, and H Kirsch.
Automatic extraction of mutations from medline and cross-validation with
omim. Nucleic Acids Research, 2004.
[Rus98]
RB Russell. Detection of protein three-dimensional side-chain patterns:
new examples of convergent evolution.
Journal of Molecular Biology,
279(5):1211–1227, June 1998.
[SAR+ 07]
B Smith, M Ashburner, C Rosse, K Bard, W Bug, W Ceusters, LJ Goldberg,
K Eilbeck, A Ireland, CJ Mungall, N Leontis, P Rocca-Serra, A Ruttenberg,
SA Sansone, RH Scheuermann, N Shah, PL Whetzel, and S Lewis. The
166
OBO Foundry: coordinated evolution of ontologies to support biomedical
data integration. Nature Biotechnology, 25(11):1251–5, 2007.
[SB05]
A Schutz and P Buitelaar. Relext: A tool for relation extraction from text
in ontology extension. The Semantic Web - ISWC 2005, pages 593–606,
2005.
[SB06]
J Schuman and S Bergler. Postnominal prepositional phrase attachment
in proteomics. In Proceedings of the HLT-NAACL BioNLP Workshop on
Linking Natural Language and Biology. Association for Computational Linguistics, 2006.
[SDC06]
A Sidhu, T Dillon, and E Chang. Unification of protein data and knowledge
sources. Knowledge-Based Intelligent Information and Engineering Systems,
pages 728–737, 2006.
[Sin04]
A Singer. Maximum entropy formulation of the Kirkwood superposition
approximation. Journal of Chemical Physics, 121:3657–3666, Aug 2004.
[SPIBA03]
PK Shah, C Perez-Iratxeta, P Bork, and MA Andrade. Information extraction from full text scientific articles: where are the keywords? BMC
Bioinformatics, 4(1), May 2003.
[SPNW04]
A Shulman-Peleg, R Nussinov, and HJ Wolfson. Recognition of functional
sites in protein structures. Journal of Molecular Biololgy, 339(3):607–633,
June 2004.
[SS96]
R Schneider and C Sander.
The HSSP database of protein structure-
sequence alignments. Nucleic Acids Research, 24(1):201–5, 1996.
[SSR03]
A Stark, S Sunyaev, and RB Russell. A model for statistical significance of
local similarities in structure. Journal of Molecular Biology, 326(5):1307–
1316, March 2003.
167
[STB06]
MH Saier, CV Tran, and RD Barabote. Tcdb: the transporter classification
database for membrane transport protein analyses and information. Nucleic
Acids Research, 34(Database issue), January 2006.
[SWS+ 04]
MJ Schuemie, M Weeber, BJ Schijvenaars, EM van Mulligen, CC van der
Eijk, R Jelier, B Mons, and JA Kors. Distribution of information in biomedical abstracts and full-text publications. Bioinformatics, 20(16):2597–2604,
November 2004.
[SYH+ 03]
S Saito, H Yamaguchi, Y Higashimoto, C Chao, Y Xu, AJ Fornace, E Appella, and CW Anderson. Phosphorylation site interdependence of human
p53 post-translational modifications in response to stress. Journal of Biological Chemistry, 278:37536–37544, Sep 2003.
[TCS+ 07]
RT Tsai, WC Chou, YS Su, YC Lin, CL Sung, HJ Dai, IT Yeh, W Ku,
TY Sung, and WL Hsu. Biosmile: A semantic role labeling system for
biomedical verbs using a maximum-entropy model with automatically generated template features. BMC Bioinformatics, 8:325+, September 2007.
[TMA08]
Y Tsuruoka, J Mcnaught, and S Ananiadou. Normalizing biomedical terms
by minimizing ambiguity and variability. BMC Bioinformatics, 9(Suppl 3),
2008.
[TOT04]
Y Tateisi, T Ohta, and J Tsujii. Annotation of predicate-argument structure
on molecular biology text. In First International Joint Conference on Natural Language Processing In the IJCNLP-04 workshop on Beyond Shallow
Analyses, March 2004.
[TW02]
L Tanabe and WJ Wilbur. Tagging gene and protein names in biomedical
text. Bioinformatics, 18(8):1124–1132, August 2002.
168
[VMMR+ 05] S Velankar, P McNeil, V Mittard-Runte, A Suarez, D Barrell, R Apweiler,
and K Henrick. E-msd: an integrated data resource for bioinformatics.
Nucleic Acids Research, 33(Database issue), January 2005.
[VZHC05]
A Via, A Zanzoni, and M Helmer-Citterich. Seq2Struct: a resource for
establishing sequence-structure links. Bioinformatics, 21(4):551–3, 2005.
[WAB+ 06]
CH Wu, R Apweiler, A Bairoch, DA Natale, WC Barker, B Boeckmann, S Ferro, E Gasteiger, H Huang, R Lopez, M Magrane, MJ Martin, R Mazumder, C O’Donovan, N Redaschi, and B Suzek. The universal
protein resource (uniprot): an expanding universe of protein information.
Nucleic Acids Research, 34(Database issue), January 2006.
[WBB+ 06]
DL Wheeler, T Barrett, DA Benson, SH Bryant, K Canese, V Chetvernin,
DM Church, M Dicuccio, R Edgar, S Federhen, LY Geer, W Helmberg,
Y Kapustin, DL Kenton, O Khovayko, DJ Lipman, TL Madden, DR Maglott, J Ostell, KD Pruitt, GD Schuler, LM Schriml, E Sequeira, ST Sherry,
K Sirotkin, A Souvorov, G Starchenko, TO Suzek, R Tatusov, TA Tatusova,
L Wagner, and E Yaschenko. Database resources of the national center
for biotechnology information. Nucleic Acids Research, 34(Database issue),
January 2006.
[WBT97]
AC Wallace, N Borkakoti, and JM Thornton. Tess: a geometric hashing algorithm for deriving 3d coordinate templates for searching structural
databases. application to enzyme active sites. Protein Science, 6(11):2308–
2323, November 1997.
[WD03]
G Wang and RL Dunbrack. Pisces: a protein sequence culling server. Bioinformatics, 19(12):1589–1591, August 2003.
169
[WK07]
R Witte and T Kappler. Enhanced semantic access to the protein engineering literature using ontologies populated by text mining. International
Journal of Bioinformatics Research and Applications, 2007.
[WR97]
HJ Wolfson and I Rigoutsos. Geometric hashing: an overview. Computational Science and Engineering, IEEE [see also Computing in Science &
Engineering], 4(4):10–21, 1997.
[WSC04]
T Wattarujeekrit, PK Shah, and N Collier. Pasbio: predicate-argument
structures for event extraction in molecular biology. BMC Bioinformatics,
5, October 2004.
[YEC+ 07]
S Yoon, JC Ebert, EY Chung, G De Micheli, and RB Altman. Clustering
protein environments for function prediction: finding prosite motifs in 3d.
BMC Bioinformatics, 8 Suppl 4, 2007.
[YHF+ 02]
H Yu, V Hatzivassiloglou, C Friedman, A Rzhetsky, and WJ Wilbur. Automatic extraction of gene and protein synonyms from medline and journal
articles. Proceedings of the AMIA Symposium, pages 919–923, 2002.
[YLPV07]
YL Yip, N Lachenal, V Pillet, and AL Veuthey. Retrieving mutationspecific information for human proteins in UniProt/Swiss-Prot Knowledgebase. Journal of Bioinformatics and Computational Biology, 5:1215–1231,
Dec 2007.
[YMTT05]
A Yakushiji, Y Miyao, Y Tateisi, and J Tsujii. Biomedical information
extraction with predicate-argument structure patterns. In SMBM, 2005.
170
Appendix A
Examples of errors in relation
extraction.
171
Table A.1: Examples of errors in the relation extraction for the detection of
contextual features.
.
Sentence
Annotated residue
Annotated keywords
Annotated PAS
TP shallow parsing
FP full parsing
Sentence
Annotated residue
Annotated keywords
Annotated PAS
FP shallow parsing
TP full parsing
Sentence
Annotated residue
Annotated keywords
Annotated PAS
FP shallow parsing
FP full parsing
”This observation provides a rationale for the reduced electron-transfer efficiency displayed
by the E92K mutant. ” (PMID:10089511)
GLU92
reduced electron-transfer efficiency
pred = diplayed
arg1 = the reduced electron-transfer efficiency
arg2-by = the E92K mutant
pred = displayed
arg1 = a rationale
arg1-for = the reduced electron-transfer efficiency
arg2-by = the GLU92 LYS mutant
pred = displayed
arg1-by = the GLU92 LYS mutant
”An apparent ’acceptor consensus overlap’ at Ser474 suggests that the mechanism behind
the glycosaminoglycan split of TM may involve a competition for substrate between xylosyltransferase and N-acetylgalactosaminyltransferase.” (PMID:8216207)
SER474
acceptor consensus overlap
pred = suggests
arg1 = An apparent ’acceptor consensus overlap’
arg1-at = SER474
arg2 = the mechanism behind the glycosaminoglycan split
arg2-of = TM
pred = suggests
arg1-at = SER474
arg2 = that the mechanism
arg2-behind = the glycosaminoglycan split
arg2-of =
pred = suggests</pred>
arg1 = An apparent ’acceptor consensus overlap’
arg1-at = SER474
arg2 = that the mechanism
arg2-behind = the glycosaminoglycan split
arg2-of = TM
”Using this approach, coupled with Edman degradation of the 32PO4-labeled tryptic
peptides, and comparison with tryptic peptides analyzed after labeling normal human
colonic tissues, we identified ser-52 as the major K18 physiologic phosphorylation site.”
(PMID:7523419)
SER52
physiologic phosphorylation site
pred = identified
arg1 = unk
arg2 = SER52
arg2-as = the major K18 phosphorylation phosphorylation site
pred = identified
arg2 = SER52
arg2-as = the major
pred = identified
arg1 = we
arg2 = SER52
172
Appendix B
Examples of extracted functional
annotations compared with
UniProtKB
173
Table B.1: Comparison of extracted protein residue annotations from GC with
UniProtKB. Mined functional annotations are listed as PAS, while relevant
information from UniProtKB are reproduced from the feature table (FT) entry
line.
.
RID+UID
Sentence
UniProtKB/FT
PAS
RID+UID
Sentence
UniProtKB/FT
PAS
PAS
RID+UID
Sentence
UniProtKB/FT
PAS
RID+UID
Sentence
UniProtKB/FT
PAS
SER15 P53 HUMAN
”Previous studies have demonstrated that phosphorylation of <o>human</o>
<p>p53</p> on <r>serine 15</r> contributes to <a>protein stabilization</a> after DNA damage and that this is mediated by the <p>ATM family of kinases</p>.”
(PMID:11865061)
SER15 MOD RES: Phosphoserine; by PRPK
SER15 VARIANT: S->R in a sporadic cancer; somatic mutation.
pred = contributes
Arg1 =
arg1-on = SER15
arg2 =
arg2-to = protein stabilization
arg2-after = DNA damage and that
GLU189 CP27B HUMAN, LEU343 CP27B HUMAN
”The <r>R389G</r> mutant was totally inactive,but mutant <r>L343F</r> retained
2.3% of <a>wild-type activity</a>,and mutant <r>E189G</r> retained 22% of <a>wildtype activity</a>.” (PMID:12050193)
GLU189 VARIANT: E-K in VDDR I; 11% of wild-type activity.
LEU343 VARIANT: L->F in VDDR I; 2.3% of wild-type activity.
pred = retained
arg1 = but mutant LEU343 PHE
arg2 = 2.3 %
arg2-of = wild-type activity
pred = retained
arg1 = and mutant GLU189 GLY
arg2 = 22 %
arg2-of = wild-type activity
CYS260 TGA1 ARATH, CYS266 TGA1 ARATH
”Furthermore,site-directed mutagenesis of <p>TGA1</p> <r>Cys-260</r> and <r>Cys266</r> enables <a>the interaction</a> with <p>NPR1</p> in <o>yeast</o> and
<o>Arabidopsis</o>.” (PMID:12953119)
C260/C266 DISULFID: (potential).
C260 MUTAGEN: C->N; Gain of interaction with NPR1; when associated with S-266.
C266 MUTAGEN: C->S: Gain of interaction with NPR1; when associated with S-260.
pred = enables
arg1 = site-directed mutagenesis
arg1-of = TGA1 CYS260 and CYS266
arg2 = the interaction
arg2-with = NPR1
arg2-in = yeast and Arabidopsis
THR13 RUM1 SCHPO, SER19 RUM1 SCHPO
”Direct in vitro kinase assay using GST-fusion proteins of wild-type as well as various mutants of <p>p25</p>(<p>rum1</p>) demonstrated that <p>MAPK</p> phosphorylates the N-terminal portion of <p>p25</p>(<p>rum1</p>) and residues <r>Thr13</r>
and <r>Ser19</r> are <a>major phosphorylation sites</a> for <p>MAPK</p>.”
(PMID:12135491)
THR13 MOD RES: Phosphothreonine; by MAPK
SER19 MOD RES: Phosphoserine; by MAPK
SER19 MUTAGEN: S->E:reduces activity as a cdc2 inhibitor; when associated with E-13
pred = are
arg1 = the N-terminal portion
arg1-of = p25(rum1) and residues THR13 and SER19
174
. . . continuation
of table B.1
arg2 = major phosphorylation sites
arg2-for = MAPK
RID+UID
Sentence
UniProtKB/FT
THR13 RUM1 SCHPO, SER19 RUM1 SCHPO
”Together with the fact that replacement of both <r>Thr13</r> and <r>Ser19</r> with
Glu,which mimics the phosphorylated state of these residues,also significantly reduces the activity of <p>p25</p>(<p>rum1</p>) as a <p>Cdc2 inhibitor</p>,it was suggested that
<a>the phosphorylation</a> of <r>Thr13</r> and <r>Ser19</r> negatively regulates
<a>the function</a> of <p>p25</p>(<p>rum1</p>).” (PMID:12135491)
THR13 N/A
SER19 N/A
PAS
pred = suggested
arg2 = that the phosphorylation
arg2-of = THR13 and SER19
PAS
pred = regulates
arg1 = that the phosphorylation
arg1-of = THR13 and SER19
arg2 = the function
arg2-of = p25(rum1)
RID+UID
Sentence
UniProtKB/FT
PAS
RID+UID
Sentence
UniProtKB/FT
THR13 RUM1 SCHPO, SER19 RUM1 SCHPO
”Further evidence indicates that <a>phosphorylation</a> of <r>Thr13</r>
and Ser19</r> may retain <a>a negative effect</a> on the function of
<p>p25</p>(<p>rum1</p>) even in vivo.” (PMID:12135491)
THR13 N/A
SER19 N/A
pred = retain
arg1 = that
arg1-of = THR13 and SER19
arg2 = a negative effect
arg2-on = the function
arg2-of = p25(rum1)
GLU55 DHMA MYCAV, ASP123 DHMA MYCAV, TRP124 DHMA MYCAV
”Many
residues
essential
for
the
dehalogenation
reaction
are
conserved
in
<p>DhmA</p>;the
<a>putative
catalytic
triad</a>
consists
of
<r>Asp123</r>,<r>His279</r>,and <r>Asp250</r>,and the <a>putative oxyanion hole</a> consists of <r>Glu55</r> and <r>Trp124</r>.” (PMID:12147465)
GLU55 N/A
ASP123 ACT SITE: Nucleophile (by similarity).
TRP124 N/A
PAS
pred = consists
arg1 = the putative catalytic triad
arg2 =
arg2-of = ASP123
PAS
pred = consists
arg1 = and the putative oxyanion hole
arg2 =
arg2-of = GLU55 and TRP124
RID+UID
Sentence
UniProtKB/FT
CYS48 THIO RAT, CYS152 THIO RAT, CYS73 THIO RAT
”Thus,<p>PrxV</p> mutants lacking <r>Cys(48)</r> or <r>Cys(152)</r> showed
no detectable <a>thioredoxin-dependent peroxidase activity</a>,whereas mutation of
<r>Cys(73)</r> had no <a>effect on activity</a>.” (PMID:10751410)
N/A
175
. . . continuation
of table B.1
PAS
pred = showed
arg1 = CYS48 or CYS152
arg2 = no detectable thioredoxin-dependent peroxidase activity
PAS
pred = had
arg1 = whereas mutation
arg1-of = CYS73
arg2 = no effect on activity
RID+UID
Sentence
UniProtKB/FT
PAS
RID+UID
Sentence
UniProtKB/FT
PAS
RID+UID
Sentence
UniProtKB/FT
PAS
RID+UID
Sentence
UniProtKB/FT
PAS
GLY43 PPCS HUMAN
”Highly
<a>conserved
ATP
binding
residues</a>
include
<r>Gly43</r>,<r>Ser61</r>,<r>Gly63</r>,<r>Gly66</r>,<r>Phe230</r>,and
<r>Asn258</r>.” (PMID:12906824)
N/A
pred = include
arg1 = conserved ATP binding residues
arg2 = GLY43
ASN59 PPCS HUMAN
”Highly
<a>conserved
phosphopantothenate
binding
residues</a>
include
<r>Asn59</r>,<r>Ala179</r>,<r>Ala180</r>,and
<r>Asp183</r>
from
one
monomer and <r>Arg55’</r> from the adjacent monomer.” (PMID:12906824)
N/A
pred = include
arg1 = conserved phosphopantothenate binding residues
arg2 = ASN59
GLU50 SHD HUMAN, GLU51 SHD HUMAN
”<p>Rab3A</p>
<a>binding-defective
mutants</a>
of
<p>rabphilin</p>
(<r>E50A</r>) and <p>Noc2</p>( <r>E51A</r>) were still localized in the distal
portion of the neurites (where dense-core vesicles had accumulated) in nerve growth factordifferentiated PC12 cells,the same as the wild-type proteins,whereas <p>Rab27A</p>
binding-defective mutants of <p>rabphilin</p> ( <r>E50A</r>/<r>I54A</r>) and
<p>Noc2</p>( <r>E51A</r>/<r>I55A</r>) were present throughout the cytosol.”
(PMID:14722103)
N/A
pred = localized
arg1 = Rab3A binding-defective mutants
arg1-of = rabphilin ( GLU50 ALA ) and Noc2 ( GLU51 ALA )
arg2 =
arg2-in = the distal portion
arg2-of = the neurites ( where dense-core vesicles
TRP124 DHMA MYCAV
”<r>Trp124</r> should be involved in <a>substrate binding</a> and <a>product
(halide) stabilization</a>,while the second halide-stabilizing residue cannot be identified from a comparison of the <p>DhmA</p> sequence with the sequences of three
<p>dehalogenases</p> with known tertiary structures.” (PMID:12147465)
N/A
pred = involved
arg1 = TRP124
arg2 =
arg2-in = substrate binding and product (halide) stabilization
176
Appendix C
Examples of extracted functional
annotations for the protein p53
177
Table C.1: Examples of literature mined annotations of protein residues in
p53. The listed data are grouped by topics.
.
regulatory PTM
RID+UID
PMID
PAS
SER6 P53 HUMAN
10930428
pred = creased
arg1 = a background
arg1-of = constitutive phosphorylation
arg1-at = SER6 that
arg2 = 10-fold
arg2-upon = upon exposure
arg2-to = either ionizing radiation or UV light
pred = exhibited
arg1 = Untreated A549 cells
arg2 = a background
arg2-of = constitutive phosphorylation
arg2-at = SER6 that
pred = is
arg1 = The relative phosphorylation
arg1-of = THR18
arg1-by = VRK2B
arg2 = similar
arg2-in = magnitude
arg2-to = that induced
arg2-by = taxol
RID+UID
PMID
PAS
RID+UID
PMID
PAS
RID+UID
PMID
PAS
RID+UID
PMID
PAS
THR18 P53 HUMAN
12487430
pred = compared
arg1 = that phosphorylation
arg1-at = THR18 decreased binding
arg1-to = recombinant Mdm2 protein
arg2 =
arg2-with = the unphosphorylated and the two other single phosphorylated analogues
SER46 P53 HUMAN
11030628
pred = regulates
arg1 = and phosphorylation
arg1-of = SER46
arg2 = the transcriptional activation
arg2-of = this apoptosis-inducing gene
SER46 P53 HUMAN
11875057
pred = hibited
arg1 = IR-induced phosphorylation
arg1-at = SER46
arg2 =
arg2-by = wortmannin
SER15 P53 HUMAN
14757188
pred = duce
arg1 =
arg1-in = synergy
arg2 = ATM-mediated phosphorylation
arg2-of = the SER15 site
178
. . . continuation
of table C.1
arg2-of =
RID+UID
PMID
PAS
RID+UID
PMID
PAS
RID+UID
PMID
PAS
RID+UID
PMID
PAS
SER15 P53 HUMAN
17292432
pred = suppressed
arg2 = both NaVO(3)-induced SER15 phosphorylation and accumulation
arg2-of =
SER15 P53 HUMAN
11850826
pred = observed
arg1 = Increased phosphorylation
arg1-of = SER15
arg2 =
arg2-in = heat shocked GM638
THR55 P53 HUMAN
10933801
pred = define
arg1 = These data
arg2 = THR55
arg2-as = a novel phosphorylation site and
arg2-for = the first time show threonine phosphorylation
arg2-of = human
THR55 P53 HUMAN
15116093
pred = clarify
arg1 = This study
arg2 = the biological significance
arg2-of = doxorubicin-induced THR55 phosphorylation
pred = reduced
arg1 = phosphorylation
arg1-at = SER15
arg2 = and phosphorylation
arg2-at = SER392
RID+UID
PMID
PAS
SER315 P53 HUMAN
9246643
pred = reversed
arg1 = but SER315
arg2 = the effect
arg2-of = phosphorylation
arg2-at = SER392
binding activity
RID+UID
PMID
PAS
RID+UID
PHE19 P53 HUMAN
7926727
pred = are
arg1 = PHE19
arg2 = crucial
arg2-for = the interactions
arg2-between =
SER20 P53 HUMAN
179
. . . continuation
PMID
PAS
RID+UID
PMID
PAS
RID+UID
PMID
PAS
of table C.1
11323395
pred = play
arg1 =
arg1-of = SER20
arg2 = a key role
arg2-in = the dissociation
arg2-of = mdm2
arg2-in = response
arg2-to = Cr(VI)
CYS135 P53 HUMAN
17914575
pred = generates
arg1 = that the amino acid change CYS135˜ARG
arg1-in = the human TP53
arg2 = the loss
arg2-of = TP53 DNA-binding activity
SER315 P53 HUMAN
16784539
pred = dephosphorylates
arg1 = both
arg1-in = vitro and
arg1-in = vivo and
arg2 = the SER315 site
arg2-of =
protein-protein-interaction
RID+UID
PMID
PAS
RID+UID
PMID
PAS
RID+UID
PMID
PAS
RID+UID
PMID
PAS
SER20 P53 HUMAN
10432310
pred = containing
arg2 = phosphate
arg2-at = SER20 inhibited DO-1 binding
SER166 P53 HUMAN
11960368
pred = mutated
arg1 = analysis
arg1-of = HDM2 proteins
arg2 =
arg2-at = the consensus Akt recognition sites
arg2-at = SER166
ARG175 P53 HUMAN
11172034
pred = abolish
arg1 = mutations ARG175˜HIS or ARG248˜TRP
arg2 = the association
arg2-of =
SER315 P53 HUMAN
7624134
pred = abolished
arg1 =
arg1-to = alanine ( p53- SER315˜ALA )
180
. . . continuation
of table C.1
arg2 = phosphorylation
arg2-by = cdk2 kinase
biological activity
RID+UID
PMID
PAS
RID+UID
PMID
PAS
RID+UID
PMID
PAS
SER315 P53 HUMAN
7624134
pred = required
arg1 = SER315
arg1-of = wtp53
arg2 =
arg2-for = transcriptional activity
arg2-in = vivo
CYS238 P53 HUMAN
16818505
pred = retains
arg1 = ( CYS238˜TYR ) mutant
arg2 = functional wild-type
ARG175 P53 HUMAN
16707427
pred = displayed
arg1 = the ARG175˜LEU mutant
arg2 = an attenuated tumor suppressor activity
arg2-in = the regulation
arg2-of = transcription
disease
RID+UID
PMID
PAS
RID+UID
PMID
PAS
ARG72 P53 HUMAN
10616523
pred = suggests
arg1 = The acquisition
arg1-of = both mutations ( GLY245˜VAL and ARG72˜PRO )
arg1-in = the transformation
arg1-from = transient leukemia
arg1-to = overt acute megakaryoblastic leukemia
arg2 = a functional role
arg2-of = mutant
ARG72 P53 HUMAN
18181044
pred = sociated
arg1 = the development
arg1-of = lung carcinoma and that ARG72˜PRO genotype
arg2 =
arg2-with = a poorer prognosis
arg2-of = lung cancer
181
. . . continuation
of table C.1
molecular stability
RID+UID
PMID
PAS
RID+UID
PMID
PAS
VAL138 P53 HUMAN
7761089
pred = showed
arg1 = The human VAL138 mutant
arg2 = temperature-sensitive transformation
arg2-of = rat embryo fibroblasts ( REFs )
arg2-in = collaboration assay
arg2-with = activated
ARG249 P53 HUMAN
15703170
pred = duce
arg1 = oncogenic mutations HIS168˜ARG and z:resi ty
ARG249˜SER
arg2 = substantial structural perturbation
arg2-around = the mutation site
arg2-in = the L2 and L3 loops
182
Appendix D
Examples of extracted functional
annotations for the protein Jak2
183
Table D.1: Examples of literature mined annotations of protein residues in
Jak2. The listed data are grouped by topics.
.
disease
PMID
RID+UID
16896569
VAL617 JAK2 HUMAN
pred = improved
arg1 = The improved knowledge
arg1-of = the molecular basis
arg1-of = the disease because
arg1-of = the discovery
arg1-of = the VAL617˜PHE mutation
arg1-in = the JAK2 gene
arg2 = the molecular diagnosis and
PMID
RID+UID
PAS
PMID
RID+UID
PAS
PMID
RID+UID
PAS
16503548
VAL617 JAK2 HUMAN
pred = is
arg1 = that the JAK2 VAL617˜PHE mutation
arg2 = rare
arg2-in = patients
arg2-with = idiopathic erythrocytosis
16247455
VAL617 JAK2 HUMAN
pred = reported
arg1 = A missense somatic mutation
arg1-in = JAK2 gene ( JAK2 VAL617˜PHE )
arg2 =
arg2-in = chronic myeloproliferative disorders
18024388
VAL617 JAK2 HUMAN
pred = is
arg1 = The JAK2 VAL617˜PHE point mutation
arg2 = rare
arg2-in = hypereosinophilic syndrome and/or chronic eosinophilic leukemia
genetic
PMID
RID+UID
PAS
15858187
VAL617 JAK2 HUMAN
pred = had
arg1 = All 51 patients
arg1-with = 9pLOH
arg2 = the VAL617˜PHE mutation
pred = is
arg1 = VAL617˜PHE
arg2 = a somatic mutation present
arg2-in = hematopoietic cells
molecular function
184
. . . continuation
PMID
RID+UID
PAS
PMID
RID+UID
PAS
PMID
RID+UID
PAS
PMID
RID+UID
PAS
PMID
RID+UID
PAS
PMID
RID+UID
PAS
PMID
RID+UID
PAS
of table D.1
15970705
VAL617 JAK2 HUMAN
pred = sociated
arg1 = JAK2 ( VAL617˜PHE )
arg2 =
arg2-with = constitutive phosphorylation
arg2-of = JAK2 and its downstream effectors
arg2-as =
16239216
VAL617 JAK2 HUMAN
pred = duces
arg1 = that the homologous VAL617˜PHE mutation
arg2 = activation
arg2-of = JAK1 and Tyk2
16384930
VAL617 JAK2 HUMAN
pred = link
arg1 = the presence
arg1-in = PV erythroblasts
arg1-of = proliferative and antiapoptotic signals that
arg2 = the JAK2 VAL617˜PHE mutation
arg2-with = the inhibition
arg2-of = death receptor signaling
16442619
VAL617 JAK2 HUMAN
pred = does
arg1 = crease
arg1-of = expression and kinase activity
arg1-of = JAK2
arg1-in = CML cells
arg2 = result
arg2-from = the JAK2 VAL617˜PHE activation mutation and that transformation
arg2-into = to blast crisis
16461300
VAL617 JAK2 HUMAN
pred = sociated
arg1 = the presence
arg1-of = the JAK2 VAL617˜PHE mutation
arg2 =
arg2-with = higher platelet activation
16904848
VAL617 JAK2 HUMAN
pred = transmit
arg1 = that JAK2 VAL617˜PHE
arg2 = signals
arg2-from = ligand-activated TpoR or EpoR
15863514
VAL617 JAK2 HUMAN
pred = changes
arg2 = conserved VAL617˜PHE
arg2-in = the pseudokinase domain
arg2-of = JAK2 that
185
Appendix E
Examples of extracted functional
annotations of the category binding
event
186
Table E.1: [Mined functional annotations of protein residues with information
on binding events. The mined information correspond to 17 protein residues
listed in MSDsite. The extracted information can be used for functional annotation and validation of predicted binding site in the database.
.
RID+UID
Sentence
PAS
RID+UID
Sentence
T199 CAH2 HUMAN
”The three-dimensional structures of azide-bound and sulfate-bound T199V CAIIs were determined by x-ray crystallographic methods at 2.25 and 2.4 A, respectively (final crystallographic R factors are 0.173 and 0.174, respectively).” (PMID:8262987)
pred = determined
arg1 = The three-dimensional structures
arg1-of = [azide-bound and sulfate-bound THR199 VAL CAIIs]/BINDING
arg2 =
arg2-by = x-prot:ray crystallographic methods
arg2-at = at 2.25 and 2.4 A ,respectively ( final crystallographic
R55 PPIA HUMAN
”On the basis of the structure, it is proposed that Arg55 hydrogen-bonds to the nitrogen
to deconjugate the resonance of the prolyl amide bond and thus facilitates the cis-trans
rotation.” (PMID:8652511)
PAS
pred = proposed
arg2 = [that ARG55 hydrogen-bonds]/BINDING
arg2-to = the nitrogen
PAS
pred = deconjugate
arg1 = [that ARG55 hydrogen-bonds]/BINDING
arg1-to = the nitrogen
arg2 = the resonance
arg2-of = the prolyl amide bond and
RID+UID
Sentence
PAS
RID+UID
Sentence
PAS
RID+UID
Sentence
PAS
RID+UID
L255 PH4H HUMAN
”Only for the R252Q and L255V mutants were catalytically active tetramer and dimer recovered and for R252G some dimer, i.e. 20% (R252Q, tetramer), 44% (L255V, tetramer)
and 4.4% (R252G, dimer) of the activity for the respective wild-type (wt) forms.”
(PMID:9799096)
pred = recovered
arg1 = active tetramer and dimer
arg2 = and
arg2-for = [ARG252 GLY some dimer]/BINDING
Y156 HGXR TRIFO
”But the forces involved in recognizing the exocyclic C2-substituents of the purine ring, which
involve the Tyr156 hydroxyl, Ile157 backbone carbonyl, and Asp163 side-chain carboxyl, may
be weakened by the shifted conformation of the peptide backbone resulted from loss of the
Glu11-Arg155 salt bridge.” (PMID:9843428)
pred = resulted
arg1 =
arg1-by = the shifted conformation
arg1-of = the peptide backbone
arg2 =
arg2-from = loss
arg2-of = [the GLU11 ARG155 salt bridge]/BINDING
K79 HGXR TOXGO
”The Leu78-Lys79 peptide bond in the active site adopts the cis configuration, which it must
to bind PRPP or pyrophosphate.” (PMID:10545171)
pred = adopts
arg1 = [The LEU78 LYS79 peptide bond]/BINDING
arg1-in = the active site
arg2 = the
G57 FLAV CLOBE
187
. . . continuation
Sentence
PAS
of table E.1
”In the Clostridium beijerinckii flavodoxin, the reduction of the flavin mononucleotide (FMN)
cofactor is accompanied by a local conformation change in which the Gly57-Asp58 peptide
bond ”flips” from primarily the unusual cis O-down conformation in the oxidized state to
the trans O-up conformation such that a new hydrogen bond can be formed between the
carbonyl group of Gly57 and the proton on N(5) of the neutral FMN semiquinone radical
[Ludwig, M. L., Pattridge, K. A., Metzger, A. L., Dixon, M. M., Eren, M., Feng, Y., and
Swenson, R. P. (1997) Biochemistry 36, 1259-1280].” (PMID:10353827)
pred = accompanied
arg1 = ) cofactor
arg2 =
arg2-by = a local conformation change
arg2-in = [which the GLY57 ASP58 peptide bond]/BINDING
RID+UID
D160 APX STRGR; M161 APX STRGR; G201 APX STRGR; R202 APX STRGR; F219
APX STRGR
Sentence
”These studies allowed the tracing of the previously disordered region of the enzyme (Glu196Arg202) and the identification of some of the functional groups of the enzyme that are
involved in enzyme-substrate interactions (Asp160, Met161, Gly201, Arg202 and Phe219).”
(PMID:10771423)
PAS
RID+UID
Sentence
PAS
pred = involved
arg1 = disordered region
arg1-of = the enzyme ( GLU196 ARG202 ) and the identification
arg1-of = some
arg1-of = the functional groups
arg1-of = the enzyme that
arg2 =
arg2-in = [enzyme-substrate interactions ( ASP160, MET161,
PHE219)]/BINDING
GLY201,
ARG202,
I209 FIXL RHIME
”Interaction between the iron-bound O(2) and Ile209 was also observed in the resonance
Raman spectra of RmFixLH as evidenced by the fact that the Fe-O(2) and Fe-CN stretching
frequencies were shifted from 575 to 570 cm(-1) (Fe-O(2)), and 504 to 499 cm(-1), respectively,
as the result of the replacement of Ile209 with an Ala residue.” (PMID:10926518)
pred = observed
arg1 = Interaction
arg1-between = [the iron-bound O(2) and ILE209]/BINDING
arg2 =
arg2-in = the resonance Raman spectra
arg2-of = RmFixLH as
188
Appendix F
Examples of extracted functional
annotations of active site residues
189
Table F.1: Identified catalytic triad residues from MEDLINE exraction. The
listed sentences describe the mentioned protein residues as catalytic (comention with the term ”catalytic triad”), however, none of them are recorded
in CSA, thus the identified information are novel data.
.
RID+UID
Sentence
PAS
RID+UID
Sentence
PAS
RID+UID
Sentence
PAS
RID+UID
Sentence
PAS
RID+UID
Sentence
PAS
RID+UID
Sentence
PAS
D44 TPP2 HUMAN, H264 TPP2 HUMAN, S449 EPHA3 HUMAN
”The amino acids forming the putative catalytic triad (Asp-44, His-264, Ser-449) as well as
the conserved Asn-362, potentially stabilizing the transition state, were replaced by alanine
and the mutated cDNAs were transfected into human embryonic kidney (HEK) 293 cells.”
(PMID:12445476)
pred = forming
arg1 = The amino acids
arg2 = [the putative catalytic triad ( ASP44, HIS264, SER449)]/ENZ ACT
C25 CYSP1 CARCN, H159 CYSP1 CARCN, D175 CYSP1 CARCN
”The seven cysteine residues are aligned with those of papain and the catalytic triad
(Cys25, His159, Asn175) of all cysteine peptidases of the papain family is conserved.”
(PMID:10355634)
pred = aligned
arg1 = The seven CYS+
arg2 =
arg2-with = with those
arg2-of = of papain and the catalytic triad ( CYS25
C176 NADE MYCTU, E52 NADE MYCTU, K121 NADE MYCTU
”The residues forming the putative catalytic triad (Cys176, Glu52 and Lys121) were replaced
by alanine; the mutated enzymes were expressed in the Escherichia coli Origami (DE3) strain
and purified.” (PMID:15748981)
pred = forming
arg1 = The residues
arg2 = [the putative catalytic triad ( CYS176, GLU52, and LYS121)]/ENZ ACT
S1752 POLG BVDVS
”Our study provides experimental evidence that histidine at position 1658 and aspartic acid
at position 1686 constitute together with the previously identified serine at position 1752
(S1752) the catalytic triad of the pestiviral NS3 serine protease.” (PMID:10915606)
pred = identified
arg1 =
arg1-with = the
arg2 = [SER1752 ( S1752 ) the catalytic triad]/ENZ ACT
arg2-of = the pestiviral NS3 serine protease.
D167 POLS SFV, H145 POLS SFV, S219 POLS SFV
”After this autoproteolytic cleavage, the free carboxylic group of Trp267 interacts with the
catalytic triad (His145, Asp167 and Ser219) and inactivates the enzyme.” (PMID:18177892)
pred = interacts
arg1 = the free carboxylic group
arg1-of = TRP267
arg2 =
arg2-with = [the catalytic triad ( HIS145, ASP167, and SER219)]/ENZ ACT
D122 ARY2 RAT
”Substitution of the catalytic triad Asp-122 with either alanine or asparagine resulted in the
complete loss of protein structural integrity and catalytic activity.” (PMID:15209520)
pred = resulted
arg1 = Substitution
arg1-of = the catalytic triad ASP122
arg1-with = either alanine or asparagine
arg2 =
arg2-in = the complete loss
arg2-of = [protein structural integrity and catalytic activity]/ENZ ACT
190
. . . continuation
RID+UID
Sentence
PAS
of table F.1
D156 LYPA1 HUMAN
”To investigate whether this bridging function occurs in vivo, two transgenic mouse lines
were established expressing a muscle creatine kinase promoter-driven human LPL (hLPL)
minigene mutated in the catalytic triad (Asp156 to Asn).” (PMID:9811888)
pred = mutated
arg1 = ( hLPL ) minigene
arg2 =
arg2-in = [the catalytic triad (ASP156 ASN)]/ENZ ACT
191
Appendix G
Glossary
3D pattern – a recurrent residue triplet configuration (with k=2 or k=3 interaction of residues) within a dataset of protein
structures.
arg – the argument of a PAS
BIND – the set of binding-related functional annotations of extracted protein residues, i.e. annotations are labelled as
BINDING.
BINDING – a category in MAN, describing binding events of a protein residue.
CSA – a database of manually curated active sites with structure templates derived from PDB.
Contextual feature .
EC – Enzyme classification identifier.
ER – entity recognition.
ENZ – the set of enzyme-related functional annotations of extracted protein residues, i.e. annotations are labelled as
ENZ ACT.
ENZ ACT – a category in MAN, describing enzyme-related information.
FA – a functional annotation; or the set of extracted protein residues with functional annotations.
FEAT – a categorisation scheme based on UniProtKB.
FN – a false negative.
FP – a false positive.
FT – a record in Uniprot data file with functional annotation.
Functional annotation – Information on biological function assigned to a protein residue.
GC – a manually annotated test set with abstract texts drawn from a random selection of UniProtKB citations.
GO – Gene Ontology.
MAN – a categorisation scheme based on manual analysis on MEDLINE.
MEDLINE – a database of citations and abstract texts from biomedical publications.
NP – a noun phrase is defined as a nominal sequence.
OLDFIELD – a non-redundant structure dataset of protein domains selected from PDB by sequence alignments.
OPR – a semantic relation between a residue, its source protein, and hosting organism; or the set of mined protein residues.
192
PAS – a data structure to accommodate the semantic relation between a predicate its arguments.
PDBID – PDB identifier.
PDB – the primary database of protein structure with spatial coordinates.
PMID – a PubMed identifier.
POS – a class of words, e.g. noun, verb, adjective, used for linguistic analysis.
PP – a prepositional phrase is defined as preposition + noun phrase.
pred – the predicate of a PAS.
Protein residue – a residue with known association to its source protein within a hosting organism (OPR).
RE – Relation extraction.
RID – a Residue identifier: residue name + residue protein sequence.
SCOP40 – a non-redundant protein structure dataset derived from SCOP.
SCOP – a derived protein structure database with manual classification of proteins based on structure similarities.
SITE – a record in the PDB data file denoting residues of a functional site.
Structure pattern – cf. 3D pattern.
TN – a true negative.
TP – a true positive.
TID – a Taxonomy identifier based on the NCBI Taxonomy guideline.
UID – a Protein identifier based on the UniProtKB guideline.
UniProtKB – a protein sequence database with manual annotations on protein residues.
VG – a verb group is sequence of verbs, auxiliaries, or verb modifiers.
VP – a verb phrase, consisting of a verb group + noun phrase.
XC – a cross-validation corpus based on references from UniProtKB.
chainID – a protein chain identifier in a PDB entry.
k=2, k=3 – a residue triplet configuration with two-way or three-way interaction.
resName – a residue name.
resSeq – a protein residue sequence identifier from a PDB entry.
seqIndex – a protein residue sequence identifier from a UniProtKB entry.
193