Download Rewiring of the plant interactome in response to environmental stress

Document related concepts
no text concepts found
Transcript
 Rewiring of the plant interactome in response to environmental stress Jasmien VERCRUYSSE Master’s dissertation submitted to obtain the degree of Master of Science in Biochemistry and Biotechnology Major Bioinformatics Academic year 2013-­‐2014 Promoter(s): Prof. Dr. Yves Van de Peer and Dr. Sofie Van Landeghem Scientific supervisor: Dr. Sofie Van Landeghem UGent -­‐ Department Plant Biotechnology and Bioinformatics VIB -­‐ Department Plant Systems Biology Research Unit Bioinformatics & Evolutionary Genomics i Acknowledgements Eerst en vooral wil ik graag Sofie bedanken. Zonder haar was deze thesis nooit tot stand gekomen. Sofie, je liet me vrij in de wondere wereld van de bioinformatica, maar je hield toch een oogje in het zeil om bij te sturen waar nodig. Ik heb mogen smullen van jouw kennis en heb enorm veel bijgeleerd: van het wetenschappelijk verantwoord schrijven in het Engels en Nederlands tot het uitvoeren van de meest ingewikkelde algoritmen. En altijd was je beschikbaar, hetzij via mail, hetzij in levende lijven, om een antwoord te bieden op mijn vragen. Ook buiten de werkuren was je enthousiast en introduceerde je me in de groep. Jij bevestigde dat de bioinformatica absoluut geen saai vakgebied is. Er was altijd plaats voor een grapje of vrolijke noot. Nogmaals bedankt, Sofie, dit jaar heb ik van elk aspect met volle teugen genoten. Ook mijn vrienden mogen zeker niet ontbreken in dit dankwoord. Onder het motto de boog moet niet altijd gespannen staan, brachten we avonden op elkaars kot door genietend van een koud biertje of maakten we de Overpoort onveilig. Ook in het VIB was er altijd plaats voor een pauzeke rond 15:00. Bedankt Marco, Dario, Jeroen, Lukas, Nick, Michiel, Karen, Kaatje en Hanneloor. Deze fantastische tijden zetten we voort. Vooral Dario en Marco, jullie zorgden voor onvergetelijke momenten en de nodige stress om deze thesis te vervolledigen wanneer ik mezelf een ochtend vol slaap toe-­‐eigende omwille van de vorige nacht… Graag wil ik ook Thomas, Michiel, Nicolas, Lieven, Jonas en Bo van het labo bedanken om de middagen en pauzes in de lounge te verrijken met hun aanwezigheid. Jullie grappen, opmerkingen, meningen en verhalen werden graag gehoord. Als laatste bedank ik ook graag mijn ouders, mams en paps, om me al die jaren te steunen. De weg die ik tot nu aflegde, heb ik grotendeels aan jullie te danken. Ik weet dat jullie met veel interesse dit manuscript zullen doorspitten. Jullie hebben er dit jaar vaak genoeg naar gevraagd. Dankjewel om er te zijn voor me. Jasmien, 10 juni 2014 ii iii Table of Contents Acknowledgements ............................................................................................................ ii Table of Contents .............................................................................................................. iv List of Figures .................................................................................................................... vi List of Tables .................................................................................................................... viii List of Abbreviations .......................................................................................................... x English Summary ............................................................................................................... xii Dutch Summary ............................................................................................................... xiv PART 1: Introduction .......................................................................................................... 1 1.1 Abiotic stress ..................................................................................................................... 1 1.2 Ontology-­‐based data integration ....................................................................................... 1 1.2.1 What is an ontology? ......................................................................................................... 2 1.2.2 The importance of an ontology ......................................................................................... 7 1.3 Description of the data sources .......................................................................................... 7 1.3.1 Text-­‐mining data: EVEX ...................................................................................................... 7 1.3.2 Experimental data: CORNET ............................................................................................ 10 1.4 Differential networks ....................................................................................................... 11 1.4.1 Construction possibilities ................................................................................................. 11 1.4.2 The Diffany tool to construct differential networks ........................................................ 12 2 PART 2: Aim of Research Project .................................................................................... 17 3 PART 3: Results .............................................................................................................. 19 3.1 Creation of the abiotic stress ontology ............................................................................. 19 3.1.1 The Gene Ontology .......................................................................................................... 19 3.1.2 The Environmental Ontology ........................................................................................... 22 3.1.3 The Plant Environmental Ontology .................................................................................. 23 3.1.4 The Plant Ontology .......................................................................................................... 23 3.1.5 The Plant Enviromental Stress Ontology ......................................................................... 23 3.2 Text-­‐mining ..................................................................................................................... 27 3.2.1 The stress trigger algorithm ............................................................................................. 27 3.2.2 Evaluation of the stress trigger algorithm ....................................................................... 29 3.2.3 The impact of wrongly predicted events by EVEX ........................................................... 31 3.2.4 Visualizing TM data .......................................................................................................... 31 3.3 CORNET ........................................................................................................................... 32 3.4 Differential networks ....................................................................................................... 33 3.4.1 One-­‐against-­‐one versus one-­‐against-­‐all comparisons ..................................................... 34 3.4.2 The contribution of TM data to differential networks ..................................................... 42 iv 4 PART 4: Discussion ......................................................................................................... 51 4.1 The PESO ......................................................................................................................... 51 4.2 Text-­‐mining data .............................................................................................................. 51 4.2.1 Evaluation of the stress trigger algorithm ....................................................................... 51 4.2.2 Future goals ..................................................................................................................... 52 4.3 Co-­‐expression data .......................................................................................................... 52 4.4 Network representation .................................................................................................. 52 4.4.1 One-­‐against-­‐one differential networks ............................................................................ 53 4.4.2 One-­‐against-­‐all differential networks .............................................................................. 53 4.4.3 The contribution of TM data to differential networks ..................................................... 53 4.5 Conclusion ....................................................................................................................... 53 5 PART 5 Materials and Methods ...................................................................................... 55 5.1 Construction of the PES ontology ..................................................................................... 55 5.2 Data processing ............................................................................................................... 55 5.2.1 Processing of TM data ...................................................................................................... 55 5.2.2 Evaluation of the stress trigger algorithm ....................................................................... 55 5.2.3 MySQL scripts for the retrieval of EVEX data ................................................................... 55 5.2.4 Processing of co-­‐expression data ..................................................................................... 56 5.3 Network representation .................................................................................................. 56 5.3.1 Generation of a reference network ................................................................................. 56 5.3.2 Condition-­‐dependent network ........................................................................................ 57 5.3.3 Differential network ......................................................................................................... 57 References ....................................................................................................................... 59 6 Addendum ..................................................................................................................... 65 6.1 Perl Code ......................................................................................................................... 65 6.2 Lists of establishment and specificity of stress. ................................................................ 65 6.2.1 List including words establishing the occurrence of stress in sentences (not complete) 65 6.2.2 List including word chains establishing the occurrence of stress in sentences (not complete) ..................................................................................................................................... 65 6.2.3 List including words identifying the type of abiotic stress in sentences (not complete) . 65 6.2.4 List including chains of words identifying the type of abiotic stress in sentences (not complete) ..................................................................................................................................... 65 6.3 Absolute values to evaluate the stress trigger algorithm .................................................. 66 6.3.1 Test run 1 ......................................................................................................................... 66 6.3.2 Test run 2 ......................................................................................................................... 67 v List of Figures Figure 1: Overall model of an abiotic stress response. ............................................................. 2 Figure 2: Representation of a hierarchical tree-­‐like structure formed by is a relations. .......... 4 Figure 3. Term neighbourhood for intracellular organelle part. ............................................... 5 Figure 4: The evolution of adding the term duration of stress to an existing ontology which describes a plant’s response to abiotic stress. .................................................................. 6 Figure 5: Representation of event extraction in EVEX.. ............................................................ 8 Figure 6. Overview of the TM pipeline in EVEX. EVEX includes five tools. ................................ 9 Figure 7: Representation of the Co-­‐regulation probability radar-­‐chart in DINA for a set of genes involved in the pancreatic secretion pathway. ..................................................... 12 Figure 8: Diffany employs an ontology to classify different types of interactions. ................ 13 Figure 9: Scheme of one reference network and three condition-­‐dependent networks. ...... 14 Figure 10: Representation of four differential networks constructed in Diffany based on the reference and condition-­‐dependent networks depicted in Figure 9. ............................. 15 Figure 11: Overview of thesis. ................................................................................................ 18 Figure 12. Representation of the branch Environmental system in EnvO. ............................. 22 Figure 13. Term neighborhood for Cellular response to cold to illustrate the complex relations between terms. ................................................................................................ 24 Figure 14. Overview of the Plant Stress Ontology. ................................................................. 25 Figure 15. Overview of the Plant Environment Stress Ontology. ........................................... 26 Figure 16: EVEX generates two sorts of output data: a set of GGPs used in the article and a list of events connecting the GGPs by interactions. ....................................................... 27 Figure 17: The stress trigger algorithm processes word by word except when multiple words combined express the establishment of stress or a kind of stress. ................................ 28 Figure 18. Visualization of events which do not have a cause. ............................................... 32 Figure 19: Since calculations were done based on a subset of the data, the ABA reference network consists out of different clusters. ..................................................................... 34 Figure 20: Representation of the drought differential and overlap network with the one-­‐
against-­‐one method. ....................................................................................................... 35 Figure 21: Example of the construction of the drought differential network. ....................... 36 Figure 22: All interacting partners of HPR in the drought differential network. .................... 37 Figure 23: The construction of a differential network from three input networks. ............... 39 Figure 24: Differential network constructed by the one-­‐against-­‐all method. ........................ 40 Figure 25: Example of a differential network in Diffany, when there is no overlapping network. .......................................................................................................................... 41 vi Figure 26: Depiction of the rewiring of HPR and its relations when ABA, drought, and salt stress are present. .......................................................................................................... 42 Figure 27. When variants of the same gene can be classified under a common denominator, then the visual representation of their interactions is more clarifying. ......................... 43 Figure 28: The ABA stress dependent network with the all TM data. .................................... 44 Figure 29: Representation of the ABA condition-­‐dependent network with 50,000 co-­‐
expression interactions and 119 TM relations. ............................................................... 44 Figure 30: Example of a differential network with multiple types of interactions. ................ 46 Figure 31: Relations with different weights are present in the differential network. ............ 47 Figure 32 One cluster in the ABA stress dependent network had both a TM and multiple co-­‐
expression interactions. .................................................................................................. 48 Figure 33: Cluster of co-­‐expression relations in which the overlapping network was found. 49 vii List of Tables Table 1: There are different agreements regarding the nomenclature of mutant variants, wild-­‐type variants, phenotypes, proteins, different alleles and different genes with the same symbol.. ................................................................................................................. 10 Table 2: GO differentiates between four kinds of synonyms for its terms. ............................ 20 Table 3: All PESO terms with their ID and references to GO, Envo and EO. ........................... 21 Table 4. To avoid complexity, independent concepts such as duration, localization and concentration of stress are modeled in separate ontologies. ........................................ 25 Table 5. General overview of the four kinds of results. .......................................................... 29 Table 6. Overview of the precision (P), recall (R), and F-­‐score (F) for three papers to test the stress trigger algorithm its power. .................................................................................. 30 Table 7. Overview of the collected co-­‐expression data from CORNET. .................................. 33 Table 8: List of HPR interacting genes in the drought differential network. .......................... 38 Table 9: Degree of four nodes with both TM and co-­‐expression data. .................................. 45 Table 10: Numbers of true and false positives, and true and false negatives used to calculate the F-­‐score in test run 1. ................................................................................................. 66 Table 11: Numbers of true and false positives, and true and false negatives used to calculate the F-­‐score in test run 2. ................................................................................................. 67 viii ix List of Abbreviations ABA AOA BAR COSINE DAG DINA EnvO EO EVEX FN FP GA GEO GGP GO IPCC JA MO OBO PESO PM PMC PO PPI PRO RMA RO ROS SA TAIR TF TM TN TP Abscisic acid Aminooxyacetic acid The BioAnalytic Resource Condition-­‐-­‐Specific sub-­‐network algorithm Directed acyclic graph Differential Network Analysis Algorithm Enviornmental Ontology Plant environmental ontology Event extraction False negatives False positives Gibberellic acid Gene Expression Omnibus Gene or gene product Gene Ontology Panel of Climate Change Jasmonate MGED Ontology Open Biological and Biomedical Ontologies Plant Environment Stress Ontology PubMed PubMed Central Plant Ontology Protein-­‐protein interaction Protein Ontology Robust Multi-­‐array Average Relation Ontology Reactive oxygen species Salicylic acid The Arabidopsis Information Resource Transcription factor Text-­‐mining True negatives True positives x xi English Summary Plants are continually challenged to respond to detrimental changes in their environment in order to avoid destructive effects on growth and development. Understanding the mechanisms that plants use to cope with these kinds of abiotic stresses is of great importance for creating durable crops which can withstand abiotic stress. Unfortunately, current knowledge is spread across various databases and available in different formats. To overcom this difficulty, data integration is applied, merging different data ypes togheter to provide more information than conclusions based on just one data type. Therefore, we modeled abiotic stress responses in a structural framework, creating an ontology specifically designed to describe types of abiotic stress responses. In this research, text-­‐mining (TM) data from the EVEX database and experimental data from the CORNET database were employed to extract knowledge about abiotic stress. Text-­‐mining includes the automated extraction of information from published articles to create events, which represent the relation between genes and gene products. These relations are described in texts as molecular processes such as binding, phosphorylation, and more indirect regulatory associations. The experimental data included co-­‐expression interactions between two genes. We developed a stress trigger algorithm to further process the TM relations. In that way, events ocurring during various types of abiotic stress were grouped accordingly in the ontology and could be used as input for condition-­‐specific networks. Co-­‐expression data was also categorized in the ontology. In Cytoscape, a network visualization tool, we used the Diffany plugin to construct differential networks from both datasources. Differential networks focus on the rewiring of interactions when abiotic stress occurs. Those networks are very useful to discover interactions that changed during the presence of one or more types of abiotic stress. We compared the one-­‐against-­‐one and the one-­‐against-­‐all approach, resulting in the comparison of respectively one and multiple abiotic stress respone networks to a reference network, which included interactions under normal conditions. Based on a case study, we concluded that both methods give a clear view how the wiring changes between genes and gene products. We also evaluated the contribution of TM to the construction of differential networks. The potential of TM is of great value to the scientific world, discovering hidden links in literature. However, TM relations and co-­‐expression interactions were rarely combined into one input network. As a result, differential networks based on both data sources fell apart in clusters either containing TM relations or co-­‐expresison interactions. Further research is thus recommended in order to combine both TM and co-­‐expression relations into one network. xii xiii Dutch Summary Planten worden voortdurend belaagd door verschillende soorten omgevingsstress. Om zich te beschermen tegen deze verschillende bedreigingen, beschikt de plant over een gamma aan abiotische stress responses. Het doorgronden van deze mechanismen is belangrijk daar de kennis bijdraagt tot het ontwerpen van duurzame gewassen die in staat zijn omgevingsstress te weerstaan. De huidige kennis is jammer genoeg verspreid over verschillende databanken en is beschikbaar in verschillende formaten. Data-­‐integratie wordt toegepast om deze informatie samen te brengen. Dit houdt in dat de waarde van de data stijgt omdat meer dan één data platform wordt geraadpleegd. Om dit te verwezelijken, creëerden we een ontologie die verschillende types abiotische stress beschreef. In dit onderzoek zullen text-­‐mining (TM) data uit de EVEX databank en experimentele data uit the CORNET databank aangewend worden om kennis over de abiotische stress responses in planten te bundelen. TM omvat de automatische extractie van informatie uit teksten. Deze informatie wordt voorgesteld in events die de relatie representeren tussen genen en andere macromoleculen. Relaties stellen onder meer binding, fosforylatie en indirecte regulatorische associaties voor. Experimentele data bevatten co-­‐expressie als interactie tussen twee genen. Verder ontwikkelden we een stress trigger algoritme dat in staat is om de text-­‐mining data verder te bewerken. Op die manier werden interacties gegroepeerd volgens hun voorkomen tijdens een specifieke omgevingsstress. Co-­‐expressie en text-­‐mining interacties werden aldus aan de ontologie gelinkt. Cytoscape en Diffany, een plugin, werden aangewend om differentiële netwerken te vormen. Differentiële netwerken focussen op de veranderingen in interacties tussen genen en gen producten die optreden wanneer er een verandering van omgeving optreedt. We vergeleken de one-­‐against-­‐one—één stress afhankelijk netwerk wordt vergeleken met één netwerk onder normale omstandigheden—en de one-­‐against-­‐all methode—meerdere stress afhankelijke netwerken worden vergeleken met één netwerk onder normale omstandigheden—om differentiële netwerken te creeëren. We baseerden ons op een case study om deze methoden te beoordelen en besloten dat deze methoden de veranderingen tussen genen en gen producten goed weergeven wanneer stress optreedt. Verder evalueerden we de bijdrage van TM wanneer differentiële netwerken werden geconstrueerd. TM bezit een grote kracht in het identificeren van verborgen verbanden in de wetenschappelijke literatuur. Echter, het samen voorkomen van TM interacties en co-­‐
expressie relaties in een netwerk bleek beperkt. Bijgevolg bestonden differentiële voornamelijk netwerken uit clusters enkel gevormd door TM of co-­‐expressie data. Verder onderzoek is dus nodig naar het combineren van TM en co-­‐expressie interacties in één netwerk. xiv xv PART 1: Introduction PART 1: Introduction 1.1 Abiotic stress Plants are sessile organisms. This means that there is no running away when danger arises. Plants thus have to be able to cope with a number of different violent environmental conditions such as drought, U.V. radiation, freezing and salinity. These extreme conditions are all examples of abiotic stresses (Barkla et al, 2013). According to the Intergovernmental Panel of Climate Change (IPCC) (http://www.ipcc.ch), plants will have to endure even more severe environmental conditions due to global warming while there are already enormous losses of yield because of abiotic stress (Skirycz & Inzé, 2010). An increasing human population, which demands higher crop yields should also not be underestimated regarding the high requirements of plants. Therefore, understanding abiotic stress responses in plants is now thought to be one of the most important topics in plant science for future breeding applications (Debnath et al, 2011). Plants have developed two mechanisms to deal with environmental stresses: stress avoidance and stress tolerance. Stress avoidance mechanisms in the plant try to counteract the negative effects of the stress, resulting in a heritable adaptation to the stress. Stress tolerance processes allow the plant to adjust to a stress, which can be revered if the stress does not persist (Töpfer et al, 2013). Major leaps in this field have come from the application of molecular biology and whole genome sequencing (Hirayama & Shinozaki, 2010). Molecular biology allows abiotic stress responsive genes to be characterized in transgenic plants while whole genome sequencing followed by microarray hybridization gives genome-­‐wide expression profiles in response to abiotic stress. These two techniques both provide a detailed and broad view of abiotic stress responses and tolerance in plants. Moreover, experiments involving overexpression or knock-­‐out mutants involved in abiotic stress responses have also contributed immensely in the understanding of abiotic stress responses (Bartels & Sunkar, 2005). Abiotic stress responses in a plant are very complex because they appear in different phases (Ma & Bohnert, 2007). Furthermore, plants are exposed to multiple stresses simultaneously(Rasmussen et al, 2013). Therefore the induction of stress responses occurs in different locations at different moments. Each induction phase may be controlled by a different pathway or different set of genes. Figure 1 shows the current understanding of a plant’s overall abiotic stress responses. 1.2 Ontology-­‐based data integration Much research about a plant’s response to abiotic stress has already been done and the findings are stored in different kinds of formats among which papers, expression profiles, lists with differentially expressed genes and heat maps (Kilian et al, 2007). Furthermore, microarray data and proteomics data—such as protein-­‐protein interactions (PPIs)—are often stored in an unstructured manner, which complicates data retrieval and interpretation (De Bodt et al, 2010). In order to get an overview of the current knowledge, it is recommended that different data sources are consulted to retrieve information about abiotic stress responses. In that way, the current available information is bundled. This approach is based on the idea of the wisdom of crowds, which refers to the phenomenon in which the collective knowledge of community is greater than the knowledge of one individual (Marbach et al, 2012). Indeed, 1 PART 1: Introduction the value of data significantly increases when it can be integrated with other data (Smith et al, 2007). In this research, we used an ontology or structured vocabulary specifically designed to describe and categorize genes and gene products involved in abiotic stress responses. An ontology provides the textual framework needed for integration of both textual as experimental data. The data types are further explained below in 1.3. Figure 1: Overall model of an abiotic stress response. Changes in environment that cause abiotic stress get noticed by a range of receptors and sensors in and near the cell wall of a plant cell. Those sensors pass the abiotic stress alert on to various signaling cascades such as the reactive oxygen species signaling pathway (ROS). These signal pathways end in transcription factors (TFs), which help transcribe a range of abiotic stress response genes into mRNA. The mRNAs can undergo posttranscriptional modifications and are then translated. The gene products can be either functional proteins (molecular chaperons, metabolic enzymes), which will lead to stress avoidance, or regulatory proteins which will lead to stress tolerance. This last group gives a positive or negative feedback to the signal cascades. Figure adapted from Hirayama & Shinozaki, 2010. 1.2.1 What is an ontology? The representation of knowledge is based on conceptualization, or in other words, (Gruber, 1995)a conceptualization is a simplified view of an object or a biological process of interest (Gruber, 1995). For instance, a plant can be represented as a combination of three concepts: roots, stem and leafs. Of course, a plant is much more complex than this representation, but these terms help describe the plant and help distinguish it from other organisms. An ontology is an explicit specification of conceptualization. It is a classification methodology for formalizing a subject's properties in a structured way (Plant Ontology Consortium, 2002). An ontology thus comprises a structured vocabulary of well-­‐defined terms, descriptions and relations between those terms which describe an object or a process of interest such as a plant's response to abiotic stress. 2 PART 1: Introduction Functional annotations are defined as descriptions of a gene's or gene product's biological identity (Berardini et al, 2004). Molecular functions, biological roles, and subcellular locations can all be described in these annotations. However, in literature annotations of genes and gene products are described in scientific natural language or plain text. This form makes it human readable and understandable, but it is not easy to interpret computationally due to ambiguity, synonyms and lexical variations (Lord et al, 2003). By using a structured vocabulary to describe genes and gene products involved in biological processes, computational interpretation is possible and automated reasoning can thus be performed (NOY, 2001). Calculations and adaptations can be made with the genes and gene products which represent this object or process. After all, using a computational approach to analyze an object or a biological process provides more accurate information than when it is examined manually. For the purpose of this research, ontologies are further explained in function of the plant's responses to abiotic stress. An ontology needs to be flexible in its terms and descriptions (Ashburner et al, 2000). Terms and descriptions need to be well defined and unambiguous, but synonyms of terms have to be taken into account. Suppose a plant's responses to abiotic stress-­‐ontology comprises the term cold to classify genes and gene products involved in responses to cold stress. Novel research could for instance identify a gene that is involved in the response to cold stress, and functionally annotate it with activation at low temperature. If the ontology does not contain the synonym low temperature, this gene will not be categorized correctly. Another reason an ontology needs to be flexible is the constant gathering of knowledge about a plant's abiotic stress responses. Since this domain knowledge evolves rapidly—there are always new inventions and techniques to be made—biological knowledge will be described at different stages of completeness in an ontology (Ashburner et al, 2000). An ontology must thus leave room for adjustments and refinements in its terms and descriptions. New terms can be added or existing terms can be deleted. Important to note is that the flexibility of an ontology may not allow an overall change in structure. It is a challenge to develop an ontology that is generic enough to describe the majority of genes and gene products involved in a plant's response to abiotic stress, while at the same time leaves room for small modifications. An ontology is thus a structured, well-­‐defined, flexible vocabulary for describing the roles of genes and gene products in a biological process of interest. Terms and descriptions in an ontology are all tied together by their relations (Cooper et al, 2013). Relations specify how terms are related to other terms and form the ontology structure which reflects the current representation of biological knowledge (Plant Ontology Consortium, 2002). For instance, in Gene Ontology (GO), response to cold is a response to stress and the term oxidoreductase complex is a part of the term cell. Just like terms and descriptions in an ontology, relations also need to be structured and unambiguous. As such, Relation Ontology (RO) was created to prevent the inconsistent use of relations in ontologies. It comprises a set of relations used in the Open Biological and Biomedical Ontologies (OBO) (Smith et al, 2007). The OBO foundry is a collaborative experiment for creating a set of principles that can be consulted when creating an ontology. As such, different ontologies become interoperable. There are already a number of ontologies that have followed these principles, such as GO (Ashburner et al, 2000), Plant 3 PART 1: Introduction Ontology (PO) (Cooper et al, 2013), Protein Ontology (PRO) (Natale et al, 2014), and The Environmental Ontology (EnvO) (Buttigieg et al, 2013). The most important relation is the subsumption, the hyponym-­‐hypernym relation or the is a relation. This relationship creates a hierarchical tree-­‐like structure or taxonomy (Figure 2). Figure 2: Representation of a hierarchical tree-­‐like structure formed by is a relations. For instance, response to water flooding is a response to water stress. For instance, in an ontology that categorizes genes and gene products involved in a plant's abiotic stress responses, response to water deprivation is a response to water stress and, in turn, response to water stress is a response to environmental factors. Because of this hierarchical structure, parent-­‐child relations are established (Lord et al, 2003). Each term is a child of a parent term one level higher, with the exception of the upper level or root terms (response to abiotic stress in Figure 2). Another important relation is the mereology relation or part of relation, that represents how terms combine together to form composite terms. These kind of relations no longer uphold the hierarchical tree-­‐structure, but create a directed acyclic graph (DAG) instead (Figure 3) (Plant Ontology Consortium, 2002). In a DAG, terms are connected by directed relations in such a way that there are no loops in the graph. Similar to the is a relation, the part of relation can create a structure in which child terms have multiple parents (Figure 3). 4 PART 1: Introduction Figure 3. Term neighbourhood for intracellular organelle part. Dotted arrows represent the PART OF relation and plain arrows represent the IS A relation. Together, these relations form a directed acyclic graph (DAG). Because of the PART OF relation between cell part and cell, child term cell part now has two parent terms: cell and cellular component. Intracellular part has three parent terms because of the IS A relation between them: cell part, intracellular organelle and intracellular organelle part. Terms and relations are taken from the Gene Ontology. During the construction of an ontology, it is important to note that child terms have to depend on their parent terms. If a child term can be seen as an independent concept, it is not a properly classified object or term. We could potentially hypothesize that duration of stress is a child term of response to temperature stress to model the duration of stress Figure 4A and Figure 4B). However, duration of stress is also relevant to response to heat, response to cold, response to wounding and response to water stress (Figure 4C). As seen in Figure 4C, the term duration of stress would repeatedly present in the ontology. This recurrent presence complicates the structure of the ontology more than needed. Since duration of stress is applicable to all terms in the ontology, it can be seen as a more general independent concept and should thus be modeled as such (Figure 4D and Figure 4E). The other children of response to temperature—response to heat and response to cold—depend only on their parent, are not applicable to other terms and thus are properly classified. 5 PART 1: Introduction Figure 4: The evolution of adding the term duration of stress to an existing ontology which describes a plant’s response to abiotic stress. Figure A depicts the ontology before duration of stress was added. In figure B, duration of stress is added as a child term of response to temperature stress. However, figure C shows how duration of stress is also relevant to all other terms in the ontology. Duration of stress is thus a more general independent concept and should be modeled as seen in figure D. Duration of stress is now depicted to be applicable to every term. An ontology can be appended with other relations which further refine the semantics they model. Some relations are domain-­‐specific to categorize specific kinds of terms. Synapsed by/via type III bouton is such a specific relation in RO. It describes how the genes and gene products active in the neural circuit of Drosophila sp. are related to each other. 6 PART 1: Introduction 1.2.2 The importance of an ontology As stated before, a structured vocabulary allows a complex object or biological process to be interpreted by a computer or a machine. Therefore, this vocabulary makes the integration of data more efficient. After all, the value of any kind of data is greatly enhanced when it is integrated with other data (Smith et al, 2007). Integration connects common data elements and thus allows the combined analysis and correlation of different data sets. Integration of data through the use of an ontology allows cross-­‐species comparisons of genes and gene products (Berardini et al, 2004). Knowledge about one organism can then be transferred to another one. The Arabidopsis species is a vastly studied and annotated organism. Because of this knowledge transferability, economically more interesting plants—
such as zea mayze (maize), Oryza saliva (rice) and Glycine max (soybean)—can be annotated as well. These plants are a big part of the food and fuel industry and are thus of utter importance to human kind. For instance, the information—involvement in osmotic stress acclimatization—about the A. thaliana gene rof1 (AT3G25230) can be transferred to the similar gene ZM02G19570 in Z. mays (Karali et al, 2012; Van Bel et al, 2012). Now it is known that both genes are involved in response to osmotic stress in their respective species. However, it is important to critically review these knowledge transfers. Manual curation is necessary afterwards. The transfer of knowledge is possible because there is a pool of genes and gene products which are shared among different closely-­‐related species (Ashburner et al, 2000). This fact thus encourages the use of structured vocabularies. 1.3 Description of the data sources In this research we developed an ontology to describe a specific concept: a plant’s response to abiotic stress. Unlike the broad GO which has terms describing different species and processes, we created specific terms applicable to abiotic stress responses of plants. Since the value of data improves significantly with integration of different data sources, the ontology will be used to integrate two different sources of data: text-­‐mining data from the EVEX database and experimental knowledge from the CORNET resource. 1.3.1 Text-­‐mining data: EVEX Text-­‐mining is a powerful tool to extract and process knowledge out of text-­‐based sources (Van Landeghem, 2012). There are a number of web-­‐based text-­‐mining tools available such as Bio-­‐Context (Gerner et al, 2012), iHOP (Hoffmann & Valencia, 2004), Predictive Networks (Haibe-­‐Kains et al, 2012) and EVEX (Van Landeghem et al, 2013). In this research EVEX (short for EVent EXtraction) will be used to generate text-­‐mining data. This program was co-­‐
developed by the University of Turku (Finland), Ghent University and the Plant Systems Biology Department at the VIB. The EVEX extracts events out of full texts (Figure 5). Its algorithm is based on machine learning, which means that the algorithm adapts itself based on a training set. This set contains manually annotated sentences, which the algorithm uses to make predictions for other sentences. 7 PART 1: Introduction Figure 5: Representation of event extraction in EVEX. The identified GGPs are colored red and the extracted interaction between them is colored blue. In this sentence the word activated is interpreted as an event of the type positive regulation, Ubiquitin is the regulatory target or theme and E1 Ub-­‐activating enzyme is the source or cause of the activation. EVEX consists out of different tools, which carry out different steps in the information extraction process. Figure 6 gives an overview how the EVEX extracts information out of plain text. First, all abstracts from PubMed and all full articles from PMC are bundled. Then, each paper is broken down into sentences by the GENIA sentence splitter tool (Kim et al, 2011a). In those sentences, the BANNER tool recognizes names of genes and gene products (Leaman & Gonzalez, 2008). Event extraction is the next step wherein the relation between the gene or gene products (GGPs) is revealed and is done by the TEES tool (Van Landeghem et al, 2013). Furthermore, TEES examines the reliability of the extracted event and assigns confidence scores based on that reliability (Van Landeghem et al, 2013). During the next step, unique species IDs are assigned to each GGP by the GenNorm tool due to gene name ambiguity (Wei et al, 2009; Chen et al, 2005). GenNorm and BANNER are stand-­‐alone tools, meaning that GenNorm does not check whether the selected GGPS by BANNER are correct. After assigning a species ID, GenNorm tries to assign unique gene IDs found in Entrez gene in a step called gene normalization. When this is not possible due to for instance a falsely identified GGP or a GGP of which there is no knowledge yet, EVEX provides a fallback mechanism wherein it searches for a gene family in Homologene and/or Ensemble families to assign to the GGP. It is possible that there is no family ID, due to a wrong NCBI taxon assignment or an incorrect identification of a GGP by BANNER. Gene normalization also allows the text mining results to be linked with other databases such as NCBI Entrez (Sayers et al, 2010), KEGG (Kanehisa & Goto, 2000) and PDB (Rose et al, 2011). The extracted event is now completed with extra information such as PubMed ID (PMID), authors of the article, title and year of publication and is stored in the EVEX database. 8 PART 1: Introduction Figure 6. Overview of the TM pipeline in EVEX. EVEX includes five tools. GENIA splits full texts from PubMed (full articles, titles and abstracts) into sentences, which get further analyzed. BANNER identifies GGPs and GenNorm assigns a unique species ID to each GGP. Based on that species ID, EVEX tries to assign gene IDs from Entrez gene, and tries to place the GGPs in a gene family based on data from Homologene and Ensemble families. It is possible that no family ID or gene ID is found due to incorrect identification of GGPs or identification of newly discovered GGPs of which there is no knowledge yet. TEES extracts the event corresponding to the identified GGPs. In this example, PKS3 is identified as a GGP and binds according to EVEX with ABA insensitive 2. Furthermore, GenNorm found PKS3 to be present in the G. max organism and assigned the MELK gene family to the gene. No gene ID was discovered. Mind that SOS2 and ABI2 were also identified by BANNER as GGPs, but were not part of the example. When all the information found by BANNER, TEES, GenNorm, and EVEX is added to the sentence, it is stored in the EVEX database. All sentences are formally represented by the Stav visualizer (Kim et al, 2011b; Stenetorp et al, 2011) as seen at the bottom of the figure. Black arrows represent the progress through the EVEX pipeline. Red arrows represent the different ways in which a GGP can be referenced. Figure adapted from (Van Landeghem et al, 2013). Teaching a program, like EVEX, to extract information out of sentences is challenging. It is hard for a program to interpret the context of sentences. Sentences can indicate speculation or negation. For instance, the sentences Whether or not gene A binds with gene B is unknown and gene A does not bind with gene B mean very different things, but are often wrongly interpreted by the text-­‐mining program. There are different agreements in the Arabidopsis species regartding the notation of gene names, allels, protein names, and variants (Table 1). Because of those different consensuses, it would be hard to differentiate 9 PART 1: Introduction between genes, proteins, alleles, or variants. Indeed, BANNER—and by extension EVEX—
does not differ between those concepts and labels them as GGP: gene or gene product. Table 1: There are different agreements regarding the nomenclature of mutant variants, wild-­‐type variants, phenotypes, proteins, different alleles and different genes with the same symbol. In this example the writing consensus is depicted for the fictive variant, gene, protein, phenotype, and allele: abc. In this research we will employ the raw data of EVEX to extract events that are present during abiotic stress. In that way it is possible to list or visualize interactions between genes and gene products that occur during a certain stress condition such as drought or freezing temperatures. 1.3.2 Experimental data: CORNET CORNET (short for CORrelation NETworks) is a database developed by the Department of Plant Systems Biology at the VIB UGent. The database holds information of experiments about known and predicted interactions between plant genes (De Bodt et al, 2012). The meta-­‐data of those experiments are annotated using ontology terms from the PO, the EO (http://www.gramene.org) and the MGED Ontology (MO) (Whetzel et al, 2006). For instance, one can browse experiments based on liquid growth media (GRAMENE), Protocol Package (MO) or flowering (PO) terms. CORNET uses different sources of data and conducts already data-­‐integration. To access the information, three different tools can be consulted: the co-­‐expression tool, the PPI tool and the TF tool. The co-­‐expression tool calculates the correlation between gene expression profiles. Genes corresponding to the expression profiles are divided among thirteen partly overlapping subsets or compendia: ten specific compendia containing genes involved in abiotic stress, biotic stress, development, flower, genetic modification, hormone, leaf, root, seed and abiotic plus biotic stress and three global compendia holding genes involved in the whole plant. The PPI tool lists all available protein-­‐ protein interactions from other databases – such as IntAct (Aranda et al, 2010), BAR (Toufighi et al, 2005) and TAIR (Lamesch et al, 2012)–both experimentally verified and computationally predicted interactions. The TF tool retrieves regulatory interactions from AGRIS (Yilmaz et al, 2011) and inferred from CORNET microarray data. In this research we will use co-­‐expression data out of the abiotic stress compendia to integrate with our text-­‐mining results using a plant environmental stress ontology. After data integration, an overview or visualization of those interactions can reveal interesting 10 PART 1: Introduction links. Furthermore, visualizing interactions specific for a type of stress or term from the ontology, can uncover the rewiring of interactions between genes and gene products. 1.4 Differential networks Abiotic stress responses of plants can be represented as a web of relations between genes, gene products, and other macromolecules. In order to get a clear view of this web, abiotic stress responses of plants can be represented in a network, which consists out of genes and gene products (i.e. nodes) and their interactions (i.e. edges, directed or symmetrical relations). Such a network is very informative since it gives a bird-­‐eye view over the whole range of abiotic stress responses and provides the cellular context of all interacting genes and gene products (de la Fuente, 2010). However, this approach gives a static view of the interactions between genes and gene products (Ideker & Krogan, 2012). To obtain interactions that are characteristic for a plant’s abiotic stress response, differential networking is applied. Differential networks thus focus on detecting changes in interactions between genes and/or gene products over different conditions. These conditions can be different subject types (e.g. mutant versus control), tissue types (leaf versus root), time points in a time series, or different climate types (e.g. normal versus drought stress) (Gill et al, 2010). In that way the rewiring of physical interactions in a biological process can be visualized, leading to meaningful insights in pathway alterations across different conditions. 1.4.1 Construction possibilities The most common way to construct a differential network includes interactions between differentially expressed genes (Gambardella et al, 2013; Ideker et al, 2002). The differential network algorithm requires a single network of a biological process to identify a set of genes—the differential network—whose expression changes across two conditions (e.g. salt stress and normal conditions). However, this method does not incorporate the size of change. Changes in expression can be very small or even absent, but are visualized in the same way that strong changes in expression are depicted. To overcome the problem of visualizing the measure of change, differential co-­‐regulation was used as interaction in differential networks. This meant that the differential network included genes and gene products, which were co-­‐regulated in a certain condition, but not in other conditions (de la Fuente, 2010; Ideker & Krogan, 2012). More advanced methods do not only take differential expression into account, but also try to identify the differential network in which the most changed interactions are present. That takes a lot of computing time, since all possible differential networks have to be observed in order to construct the network with interactions that changed the most. Therefore, those methods are limited in the number of conditions that can be compared (Kostka & Spang, 2004; Choi & Kendziorski, 2009; Langfelder et al, 2011). In general, the algorithm calculating differential networks employs two key components. Both a scoring function, which calculates the alteration of a differential network across different conditions, and a search algorithm extracting the highest scoring differential network, are used. Finding the highest score is a complex process that requires heuristic—or approximate—methods, which are in need of greedy search algorithms. This process also costs a lot of computing time (Gill et al, 2010). The Differential Network Analysis Algorithm (DINA) (Gambardella et al, 2013), the Condition-­‐
Specific sub-­‐network algorithm (COSINE) (Ma et al, 2011) and Diffany (Van Landeghem et al) 11 PART 1: Introduction are three examples of programs that calculate differential networks. DINA requires as input a set of genes involved in a known pathway of interest. The probability of co-­‐regulation of these genes is calculated across different condition specific networks, i.e. tissue types in Mammalia. After that, a co-­‐regulation probability radar-­‐chart is depicted (Figure 7). Figure 7: Representation of the Co-­‐regulation probability radar-­‐chart in DINA for a set of genes involved in the pancreatic secretion pathway. In this example, a set of genes involved in the pancreas was taken from kegg (hsa04972) to test the co-­‐regulation probability of DINA. As seen on the radar-­‐chart there is strong evidence that the genes (depicted on the left) are co-­‐regulated in the pancreas and are present in Mammalia. Image taken from dina.tigem.it. COSINE considers both the differential expression of individual genes and the differential correlation of gene pairs. It applies a scoring function measuring the condition-­‐specific changes of two genes (nodes) and their co-­‐expression (edge) to form a differential network. Both alteration in the expression level of individual genes and the variation of gene correlation in different conditions, are thus analyzed. Since COSINE only handles the co-­‐
expression between genes and the differential expression of a gene as resources, there cannot be formed differential networks based on other data. Diffany is explained in detail in section 1.4.2. 1.4.2 The Diffany tool to construct differential networks Diffany’s input consists out of one reference network and one or more condition-­‐dependent networks. The edges between nodes can represent different interaction types such as binding, activation, co-­‐expression and phosphorylation. Those types of interactions are modeled in an ontology (Figure 8). When a network is analyzed, Diffany uses the ontology to 12 PART 1: Introduction categorize the relations present in the network. Based on the terms in the framework, Diffany compares each interaction to form a differential network. The algorithm calculates the overlap between the reference network and a condition-­‐
dependent network. Based on this overlap, the differences between the two input networks are calculated and a differential network is constructed. In this research, the network visualization program, Cytoscape will be used (Shannon et al, 2003). The program is capable of visualizing elaborate biological networks and can execute network calculations. To calculate and construct differential networks, the Diffany plugin will be used since it allows multiple data sources as input and creates differential networks based on two methods. In contrast to other differential networking tools, Diffany can compute both one-­‐against-­‐one and one-­‐against-­‐all comparisons of reference and condition-­‐dependent networks (Figure 9 and Figure 10). In the one-­‐against-­‐one approach the reference network is compared against a network comprising interactions between genes and gene products under one type of abiotic stress, e.g. drought stress. When the one-­‐against-­‐all method is applied, the reference network is e.g. compared against three networks (drought stress, cold stress and wounding stress dependent networks). This method of comparison involves a challenge. It is possible that genes are differentially expressed in the cold stress and the drought stress dependent network, but not in the wounding stress dependent network. Therefore, Diffany only constructs a differential network including interactions that equally changed in all of the condition-­‐dependent/reference comparisons (Figure 10A). Figure 8: Diffany employs an ontology to classify different types of interactions. In that way relations can be compared. 13 PART 1: Introduction Figure 9: Scheme of one reference network and three condition-­‐dependent networks. Image A shows the reference network under non-­‐stress conditions. Images B, C, and D show condition-­‐dependent networks of three different stress conditions. The interactions between hypothetical genes A, B, D, and E are present under all conditions (red rectangle). These could represent household genes and would be removed when differential networks are constructed (Figure 10 B, C, and D) Interactions between hypothetical genes A and G are only present under stress conditions (green rectangle). This interaction would be present if the three stress condition networks would be merged into one differential network (Figure 10A). Blue lines represent interactions present in the reference network but not present in all stress differential networks. Note that in a one-­‐against-­‐all approach, it would be challenging to represent these kinds of interactions in one differential network of all stress conditions since they do not occur in every stress condition. A one-­‐to-­‐
one approach would solve this. This means that each individual stress network would be compared to the reference network. In that way, there would be three differential networks (Figure 10 B, C, and D). Purple lines in the three stress dependent networks represent interactions unique for that stress condition. Again, these interactions would be challenging to represent in one differential network of all stress conditions. 14 PART 1: Introduction Figure 10: Representation of four differential networks constructed in Diffany based on the reference and condition-­‐dependent networks depicted in Figure 9. 15 PART 2: Aim of Research Project 16 PART 2: Aim of Research Project 2 PART 2: Aim of Research Project In this research we will investigate whether differential networks can discover changed interactions between genes and gene products under specific types of abiotic stress, using Diffany, a Cytoscape plugin tool. Moreover, we will examine the contribution of TM data to these differential networks. With differential networks, the rewiring of biological processes is visualized. As such, differential networks can offer insights in underlying biological processes that correspond to this rewiring. To do this, we first will model different abiotic stress conditions creating an ontology: the Plant Environmental Stress Ontology (PESO). It is not necessary to invent an entirely new ontology since there are already a number of ontologies available, which will be used to construct the basis of the PESO. As such, GO, EO, Envo, and PO will all be considered for the creation of the PESO. Since the value of knowledge is significantly increased when data integration is applied, we will use the PES ontology to integrate two different data sources: TM data from the EVEX database and co-­‐expression data from the CORNET database. Both sources will be checked for genes and gene products interacting with each other when abiotic stress occurs. TM data will be created using a stress trigger algorithm, which searches in sentences for interactions between genes and gene products during the occurrence of abiotic stress. The EVEX database already contains parsed articles, which will now be searched for the presence of descriptions of abiotic stress. To retrieve co-­‐expression data from CORNET, we will create user-­‐defined compendia for different types of abiotic stress. Those compendia will contain all differentially expressed genes and gene products during various types of abiotic stress, and genes and gene products with similar expression patterns to those differentially expressed genes, according to the CORNET database. After integration we will have for each term in the ontology a list of genes and gene products interacting with each other, either predicted by the EVEX TM algorithm and the stress trigger algorithm or experimentally validated by experiments stored in the CORNET database. To allow biological interpretation, those lists will be visualized in a network. In general, we will create one big reference network, comprising interactions between genes and gene products under normal conditions. For each term in the PESO we will create a condition specific network, including interactions between genes and gene products occurring under the respective abiotic stress condition. Using Diffany, the different condition-­‐dependent networks will be compared against the reference network creating a differential network to reveal the rewiring of the plant interactome when abiotic stress is present. We will both explore differential networks as a result of one-­‐against-­‐one comparisons—one condition-­‐dependent network is compared to one reference network—and differential networks as a result of one-­‐against-­‐all comparisons—multiple condition-­‐dependent networks are compared to each other and one reference network. In that way we can conclude which approach can show the most biological information. With these techniques, we want to discover (i) whether the overall structure of condition-­‐dependent and reference networks are different, (ii) whether the connectivity of a set of genes (i.e. using a known pathway occurring during stress as a case study) has changed between condition-­‐dependent and reference networks, and (iii) whether TM can contribute as a data source to the construction of differential networks. 17 PART 2: Aim of Research Project After observing several differential networks, we will conclude whether this approach allowed us to reach our goal, i.e. using two data sources, EVEX and CORNET, and the Diffany tool, to create biological meaningful differential networks. Figure 11 gives an overview of this thesis. Figure 11: Overview of thesis. In this thesis, we will examine whether differential networks can contribute to the knowledge about a plant’s response to abiotic stress and if text-­‐mining can serve as a reliable data type to help constructing those differential networks. To do this, text-­‐mining data from the EVEX database and experimental data from the CORNET database will be integrated by using a plant environment stress ontology. After integration, two kinds of networks will be generated: a reference network, holding all interactions between genes and/or gene products which occur in normal conditions (i.e. no stress); and a series of condition-­‐dependent networks, including all interactions between genes and/or gene products which occur in stress conditions such as drought stress and cold stress. Using the Cytoscape plugin tool Diffany, we will construct differential networks, which will only show interactions between genes and/or gene products that differ in the reference network and the condition-­‐dependent network. In other words, we will model the dynamic nature of molecular network by showing and analyzing specifically those interactions that changed under abiotic stress. 18 PART 3: Results 3 PART 3: Results We examined the ability of differential networks to reveal abiotic stress specific interactions between A. thaliana genes and gene products using Diffany, a Cytopscape plugin tool. Furthermore, we explored the contribution of TM data to these differential networks. Therefore, we integrated two data sources employing an ontology specifically designed to describe GGPs involved in responses to abiotic stress. This ontology or structured vocabulary was based on a number of existing ontologies such as GO and EO. The two data sources included text-­‐mining data from the EVEX database and co-­‐expression data from the CORNET database. After integration, a reference network and various condition-­‐dependent networks were constructed using the Cytoscape network visualization tool. The reference network included all interactions between GGPs under normal conditions while condition-­‐dependent networks included all interactions between GGPs under different stress conditions such as osmotic stress or ABA stress. These condition-­‐dependent networks were then compared against the reference network creating a differential network, which only included changed interactions between GGPs. 3.1 Creation of the abiotic stress ontology In this thesis we developed an abiotic stress ontology or structured vocabulary for data integration of two data sources. Furthermore, this ontology allowed automated reasoning since it was a formal framework wherein GGPs were classified. The construction of this specific ontology was based on a number of existing ontologies among which the well-­‐
known GO, the EO, the EnvO, and the PO. Therefore, we examined those four ontologies by comparing their weak and strong characteristics. Each of them is described more into detail below together with the properties we will use in our newly developed ontology. 3.1.1 The Gene Ontology GO’s structure resembles a tree with three roots or domains: cellular component, biological process, and molecular function which are not related to each other, meaning that they do not have a common parent. All terms and descriptions in GO are categorized under one of these three root terms. General terms and descriptions are positioned close to the root terms while specific and more detailed terms are placed further away from the root terms. Each term in GO’s structure has a couple of essential elements, including a unique term name and identifier. For instance, response to stress is a unique term with unique identifier GO:0006950. Each term also has a namespace which denotes the root terms it belongs to. In this example, response to stress categorizes under biological process (GO:0008150). A Definition is given to each term, which is a textual description of what this term represents. Links between terms represent how terms are related to each other. Alternative words or sentences closely related in meaning to a term are incorporated as synonyms. GO differentiates between four kinds of synonyms (Table 2). 19 PART 3: Results Table 2: GO differentiates between four kinds of synonyms for its terms. When the GO project was initiated in 1998, it had three major goals. First, the GO project strove to develop a set of controlled, structured vocabularies to describe key domains of molecular biology, including gene product attributes and biological sequences. Secondly, the GO project tried to incorporate its terms and descriptions in the annotations of sequences, genes or gene products in biological databases. Lastly, GO attempted to provide a centralized public resource allowing universal access to the ontologies, annotation data sets and software tools developed for the use with GO data (Consortium, 2004). GO is an open project, which means that researchers from every field can propose changes in the structure of the ontology. This fact partly reverses the generic character of GO. The GO ontology categorizes genes and gene products from multiple species among which Drosophila, Mus musculus (mouse), Saccharomyces cerevisiae (baker’s yeast), Danio rerio (zebrafish) and A. thaliana. The incorporation of so many species presents a challenge. The many different species require for species-­‐specific terms to describe functions, processes, and components in individual groups of organisms (Hill et al, 2002). It thus becomes difficult to maintain both the global interspecies relationships for which GO strives as well as the precise terms required for intra-­‐species gene annotation. We borrowed a number of terms from GO to construct our own PES ontology. For instance, response to abiotic stress, resoponse to nitrosative stress, and response to cold stress were all found in the GO and transferred to the PESO (Table 3). Note that some PESO terms have multiple GO IDs. This is because GO differentiates between a response to stress and a cellular response to stress. We did not make a difference regarding the location of the stress response. As such, GO’s response to salt stress (GO:0009651) and cellular response to salt stress (GO:0071472) are merged in to one PESO term, response to salt stress (PESO:0610). Just like the GO, we incorporated synonyms for almost each term and assigned a unique identifer to each PESO term. 20 PART 3: Results Table 3: All PESO terms with their ID and references to GO, Envo and EO. 21 PART 3: Results 3.1.2 The Environmental Ontology The EnvO provides a controlled, structured vocabulary that is designed to support the annotation of any organism or biological sample with environment descriptors. EnvO provides for each term a definition and a unique identifier. For instance, the term acid habitat has ID ENVO:00002021 and definition a habitat in which the pH is lower than seven. EnvO’s most developed root terms—which are of primary interst to annotators—are environmental systems, environmental features, and environmental materials. An environmental system is defined as a system to which its communities have evolved. Examples of such community or biome terms are: boreal moist forest biome, tropical rain forest biome, and oceanic pelagic zone biome. Environmental features include environmental systems that make reference to some central, supporting feature. Examples of environmental feature terms are: mountain, pond, whale fall, and karst. Environments containing a mountain, a pond, a whale fall or a karst would not have the same properties if those features were not present. The class environmental material describes terms that cannot be defined by one type of material or medium. The term soil, for instance, can comprise fine rock particles, sand grain, various microorganisms and other organic materials. Other less developed upper-­‐level terms include ecozone, ecoregion, chemical entity, quality and environmental condition. With these upper level terms and their child terms, a standardized, comprehensive description of environments is given. The terms categorized under environmental system are found to be highly relevant for the creation of the ontology, which describes a plant’s response to abiotic stress, particularly the terms under extreme habitat (Figure 12 and Table 3). The environmental material, condition and feature roots were less relevant to the creation of the new ontology. Figure 12. Representation of the branch Environmental system in EnvO. The terms classified under Extreme habitat can be useful to describe abiotic stress environments for plants. Black arrows represent the is a relation between the terms. Note that only a couple of child terms of Extreme habitat are represented here. 22 PART 3: Results The EnvO was created because there was a need for consistent descriptions of environmental origins of tissue, pathogen, and metagenomics samples. It was built on the OBO Foundry principles, which makes the ontology interoperable with other OBO-­‐ontologies such as GO and PO. Just as GO, EnvO is an open project, which allows its users to propose adjustments in the ontology. 3.1.3 The Plant Environmental Ontology The EO includes a set of standardized controlled vocabularies describing various types of treatments given to an individual plant, a population of plants or a certain cell type to evaluate the response on its exposure. The ontology has four root terms: abiotic environment, biotic environment, study type, and unknown environment. The abiotic and biotic environments describe treatments such as temperature environment (EO:0007175) and virus (EO:0007106) respectively. The study type term includes terms such as green house and field study, which identify certain growth study facilities. Unknown environment holds terms, which do not fit any of the other root terms, and this category currently holds no child terms in this category. This ontology – and in particular the abiotic environment term and its child terms – provide a good base to start modeling our plant environment stress ontology. 3.1.4 The Plant Ontology The core of the PO includes two root terms: plant structure development stage based on published growth stage descriptors (e.g. fruit development PO:0001002) and plant anatomical entity categorizing cell and tissue types in the plant body (e.g. trichome PO:0000282) (Plant Ontology Consortium, 2002). This means that there are no terms, which denote responses to abiotic stress of a plant. The PO resembles the GO because they both follow the ontology principles set by the OBO Foundry initiative (Smith et al, 2007). Just like the GO, the PO has four types of synonyms for its terms as seen in Table 2 and unique identifiers for each term (e.g. PO:0025195 for Pollen tube cell) (Cooper et al, 2013). The PO was created because of the inconsistent description of gene structures, gene products, gene functions, phenotypes, traits, developmental stages and anatomical parts in scientific literature. 3.1.5 The Plant Enviromental Stress Ontology It is not necessary to create a whole new ontology to model abiotic stress responses of plants since a number of existing ontologies already provide terms or structure elements for the development of this ontology. Still, we designed a ‘new’ ontology mainly because previously described ontologies were not elaborate enough regarding abiotic stress responses of plants. Furthermore, relations between terms describing abiotic stress responses in plants were often too complex and terms themselves were sometimes scattered over the consulted ontologies (Figure 13). For instance, the GO term Response to cold (GO:0009409) has two parent terms, Response to stress and Response to temperature stimulus, making the ontology complex. Moreover, Response to temperature stimulus is not directly linked to the Response to stress term, which seems unusual. This structure was thus revised for the Plant Environmental Stress Ontology (PESO). 23 PART 3: Results Figure 13. Term neighborhood for Cellular response to cold to illustrate the complex relations between terms. Black arrows represent the is a relation between terms. Response to temperature stimulus is only indirectly linked to Response to stress which makes the ontology complex. Figure adapted from the Gene Ontology database (Ashburner et al, 2000). The general structure of the PESO was based on a Plant Stress Ontology used in a case study demonstrating the benefits from combining data integration with text-­‐mining (Hassani-­‐Pak et al, 2010). The Abiotic stress term had two child terms: Chemical abiotic stress and environmental abiotic stress. For the PESO we employed the same terms serving as root terms (Figure 15). Under the Chemical abiotic stress term, oxidative, nitrosative and hormone stress were classified. The Environmental abiotic stress term included water, osmotic, wounding, pH, temperature and light stress. Note that Response to drought and Response to flooding, two antonyms, have different parents: Response to osmotic stress and Response to water stress respectively. This was done after reading several scientific articles describing osmotic stress experiments. It seemed that drought stress was almost always accompanied by osmotic stress. Therefore, the term Response to drought stress was modeled as a child term of Response to osmotic stress. Response to flooding meaning the response of a plant to an excess of water was categorized under the Response to water stress term since flooding is not a form of osmotic stress. 24 PART 3: Results Figure 14. Overview of the Plant Stress Ontology. The red square contains the overall structure of the ASS ontology we developed and used in this research. Figure adapted from (Hassani-­‐Pak et al, 2010). To obtain a straightforward structure, we decided to only include is a relations in the ontology resulting in a hierarchical framework. No part of relations were taken into account and thus no DAG was created meaning each term – except for the upper level term Response to abiotic stress – had exactly one parent term (Figure 15). Independent concepts such as duration, location and concentration of stress were not modeled into the ontology due to time limits. As seen in Section 1.2.1, it is recommended to model such independent terms in a different ontology to avoid complexity in the main PESO (Figure 4). As a result the genes would be categorized in four different ontologies (Table 4). Development of such ontologies is a future goal and is not to be underestimated. Table 4. To avoid complexity, independent concepts such as duration, localization and concentration of stress are modeled in separate ontologies. For this example the The GsAPK gene is differentially expressed when 100µL ABA is applied to the roots of one-­‐month-­‐old seedlings was used. Data adapted from (Yang et al, 2012). 25 PART 3: Results Figure 15. Overview of the Plant Environment Stress Ontology. 26 PART 3: Results 3.2 Text-­‐mining TM is the automated extraction of knowledge from plain text. For this research, we build upon the EVEX resource to generate stress-­‐specific TM data. In its current form, EVEX identifies genes or gene products (GGPs) and their relations, which are called events. It is possible that multiple events can occur in one sentence (Figure 16). Raw data for roughly 4.5 million (4,532,311) articles are stored in the EVEX database. Of those 4.5 million articles, 29,223 involve the A. thaliana model organism: 8,002 PMC full articles and 21,221 PubMed articles including only title and abstract. We created a stress trigger algorithm to parse the 8,002 full articles and abstracts for abiotic stress terms from the PESO. If one or more terms of stress were described in a sentence, the stress trigger algorithm assumed that the extracted events by EVEX occurred during that type of stress. A total of 1,441 full articles were found by the stress trigger algorithm, in which abiotic stress responses in the A. thaliana organism were described. Figure 16: EVEX generates two sorts of output data: a set of GGPs used in the article and a list of events connecting the GGPs by interactions. The sentence is depicted using the Stav Visualiser in the EVEX database. Information is taken form article PMCID 3306294. Important to note is that mentioning A. thaliana in an article does not necessarily mean that A. thaliana is involved in the experiments described in that article. A. thaliana is a model organism and could be used as an example in the text or as an attribute to an experiment. For instance, the article Modulation of intracellular calcium and proliferative activity of invertebrate and vertebrate cells by ethylene (PMCID 32299) mentions the A. thaliana organism only once to state a general well-­‐known fact. Still, different responses to ethylene are picked up by the algorithm, but those are regarded to another organism: the marine sponge Suberites domuncula. That is why we added an extra filter checking whether the described responses to abiotic stress occurred in the A. thaliana organism. As explained in 1.3.1, each GGP is assigned a NCBI taxon reference indicating an organism. We thus matched the NCBI taxon reference of each GGP involved in an abiotic stress response to the A. thaliana NCBI taxon reference (3702). 3.2.1 The stress trigger algorithm We built a stress trigger algorithm as an extra platform on top of the EVEX data to extract events occurring during abiotic stress. For each article stored in EVEX, two files are generated. One file is a flat text file including sentences wherein EVEX and its tools found 27 PART 3: Results GGPs and/or events. The other file contains a list of the identified events and GGPs with their references to species, gene and family IDs and confidence values. Both files were used as input for the stress trigger algorithm, which searched for the presence of abiotic stress descriptions in texts. The algorithm’s objective was to identify events in EVEX that occur during various types of abiotic stress. To accomplish that, the algorithm analyzed word-­‐by-­‐word and matched them against manually assembled lists of words involving different descriptions of abiotic stress (Section 6.2). In order to mark an event occurring during abiotic stress, the algorithm had to both mark the establishment of stress and the kind of stress that was described. Therefore, four kinds of lists were made. The first list contained single words, which described the presence of stress with words such as stress, treatment, responses, stimulation and exposure. The second list included single words describing the kind of stress. Basically, this list was an elaborate version of the PES ontology including words such as abiotic, environmental, drought, NaCl and ethylene. The third and fourth list contained chains of words. They included parts of sentences or expressions such as water deficit, water deprivation, temperature rise or low temperature. Since the algorithm tried to match word by word to the regular expressions, including these lists prevented the algorithm from missing occurrences of the right kind of stress. Indeed, there is a big difference between picking up water stress and water deficit stress. When such a chain of words was identified in a sentence, the algorithm skipped the words, which were part of that chain (Figure 17). Figure 17: The stress trigger algorithm processes word by word except when multiple words combined express the establishment of stress or a kind of stress. Each word is checked whether it describes the establishment of stress or the type of stress. In this example the algorithm skips deficit, since it is part of the water deficit word chain or suffix tree. In order to pick up as many as stress related events as possible, all lists included synonyms and variants of each word. As such synonyms for wounding—injure, harming and stabbing—
were recognized, but also variants of salt—salty, saline, salinity and NaCl. The Perl language was used to implement the highlighting of both establishment and type of stress since that language is exceptional strong in pattern matching. The algorithm could pickup Inducing, induced, Inducable, induction and induce using one regular expression: [Ii]nduc.+. 28 PART 3: Results When both the establishment of stress and a specific type of abiotic stress were described in one sentence, all events in that sentence were stored in folders respective to the type of abiotic stress. As such, 33 lists with events were made, one for each term in the PES ontology. Important to note is that interactions stored by the stress trigger algorithm, will only serve the condition-­‐dependent networks since sentences with GGPs were only parsed for abiotic stress trigger words. Sentences such as gene A regulates gene B under drought stress means that genes A and B interact during drought stress. But gene A does not regulate gene B under drought stress does not necessarily mean that that there is an interaction between A and B under normal condition. There could be no interaction at all between genes A and B or there could be an interaction between those two genes under other stress conditions. That is why we omitted sentences with negation and only used TM data for condition-­‐dependent networks. The control network will be constructed with data from CORNET. 3.2.2 Evaluation of the stress trigger algorithm To evaluate the algorithm, we determined the F-­‐score, which is a measure of a test’s accuracy. The F-­‐measure considers both precision and recall—or sensitivity—of the algorithm. Precision is the number of correct results divided by the number of all results: 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 . Recall represents the fraction of relevant results that are retrieved: 𝑟𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 (Table 5). During the creation of this algorithm, we wanted to retrieve as many ocurring interactions—occurring during the presence of abiotic stresses—as possible, without including too many false positive interactions. The F-­‐score provides a weighted average between both recall and precision, meaning a harmonic mean between positive predictive value and the true positive: 2 × 𝑅 ×𝑃 𝑅 + 𝑃 . We randomly selected three papers to manually parse for events occurring during stress. In that way we could check if the algorithm picked up the same responses than we did manually. Based on this evaluation, the F-­‐score could be determined. Table 5. General overview of the four kinds of results. The test algorithm determines whether an event in a sentence occurs during stress (POSITIVE) or not (NEGATIVE). Based on manual curation, one can determine which positive results are true positive and false positive. The same is done for negative test results. In that way both accuracy and specificity are determined of the stress trigger algorithm. For each event identified by EVEX, we manually checked whether it was a biological meaningful event. Falsely predicted events by EVEX were not taken into account to calculate the F-­‐score since it is not the scope of this research to evaluate the accuracy of EVEX. Note that the algorithm was evaluated on two levels. If TEES extracted a correct event, the algorithm was tested whether or not it could highlight the occurrence of stress and if so, if it could identify the specific type of stress. For instance, in the sentence Gene A activates gene 29 PART 3: Results B under drought stress, the event gene A positively regulates gene B is correctly extracted by EVEX and the algorithm should determine that this event occurs during abiotic stress, and more specifically during drought stress. As such, we calculated two F-­‐scores, one less stringent for the presence of abiotic stress, and one more stringent for the type of abiotic stress. The algorithm was constructed based on the paper GsAPK, an ABA-­‐activated and Calcium-­‐
Independent SnRK2-­‐Type Kinase from G. soja, Mediates the Regulation of Plant Tolerance to Salinity and ABA stress (PMID 3306294). Note that this paper did not involve the A. thaliana organism. The results were thus not included in further steps. This paper was randomly selected from PubMed to base the first attempts of the algorithm on. The A. thaliana filter—
described at the beginning of this section—was added later. After conducting the first test-­‐run with that paper, we obtained an F-­‐score of 0.93 for recognizing the occurrence of stress and for recognizing the specific type of stress (Table 6). An F-­‐score close to 1.00 meant that the algorithm performed very well. Because testing was done with the same text on which the algorithm was based, it was not surprising that the algorithm scored high on both precision and recall. When we tested the same algorithm with another paper (PMID 2996028), we obtained lower scores for precision and thus lower F-­‐
scores for both noticing the occurrence of stress and deducting the right type of stress, 0.78 and 0.62 respectively (Table 6). This meant that the algorithm relied too much on the initial text and was not generic enough to parse other texts. After some mild adjustments, we tested the same paper again and obtained and F-­‐score for both highlighting stress and stress specificity of 0.91. A third paper was manually parsed during the second test run to confirm that the algorithm did not built to much upon the first two papers. This was not the case since we obtaine F-­‐scores for recognizing stress and the specificity of stress of 0.83 and 0.79 respectively (Table 6). For the absolute values of the counted events see Section 6.3. Table 6. Overview of the precision (P), recall (R), and F-­‐score (F) for three papers to test the stress trigger algorithm its power. Two test runs were conducted each involving manual curation of two papers. For each correctly determined event by EVEX, the possible presence under stress conditions was evaluated. Precision, recall and F-­‐score (𝟐 × 𝑹 × 𝑷 𝑹 + 𝑷 ) were determined for the presence of abiotic stress and the presence of the right type of abiotic stress. The first modification included that the events categorized under response to basic stress (pH > 7.4) were omitted from the final TM output. The algorithm generated many false positive results recognizing parts of sentences such as basic leucine zipper, basically and the basic level. These expressions were wrongly interpreted by the algorithm since they do not 30 PART 3: Results describe responses to basic stress. The second alteration excluded presence of [type of chemical compound] as a trigger for the occurrence of chemical stress. After reviewing some false positive results, it became clear that mentioning the presence of [chemical compound] implied the existence of a normal amount of that compound and no occurrence of stress. The third adjustment did not allow effectively as a trigger for the presence of stress. This word was originally included by the algorithm because it matched words similar to effect proving the occurrence of stress in a sentence. The fourth change also involved a change in trigger stress words. Sentences with expressions similar to the exogenous administration of [chemical compound] were considered as descriptions of occurrence of stress. The fifth and final modification omitted tolerance to [type of abiotic stress] as a trigger for the algorithm since using this expression mostly meant that a mutant plant was more or less tolerant to a type of abiotic stress. In that way, events in the sentence […] Arabidopsis plants engineered to be drought-­‐tolerant through overexpression of ABF3 […] will not be picked up by the stress trigger algorithm. Note that it is easy to over-­‐fit the rules and filters of the algorithm to one particularly paper since the authors of that article have obtained a certain writing style. When similar versions of the same sentence describing abiotic stress responses, are used over and over again, the F-­‐score changes drastically. The sentences are either all wrongly interpreted by the algorithm, or they are correctly analyzed since they are very much alike. 3.2.3 The impact of wrongly predicted events by EVEX As previously mentioned, evaluating or improving EVEX is beyond the scope of this research. However, it is important that common mistakes of EVEX are taken into account when EVEX’s raw data is processed further. The BANNER tool which identifies genes and or gene products in a sentence sometimes overlooks GGPs or identifies the wrong GGPs. For instance Fig, Qin, and Zea were identified as GGPs in the texts used for constructing and testing the stress trigger algorithm. Fig (fos intronic gene) and Qin (CG43726-PB) are the names of two proteins in the Drosophila melanogaster organism, but are often wrongly identified because fig is also an abbreviation for figure and Qin is the name of an author. Zea is part of the taxonomic name for Zea mays and is often wrongly identified as a GGP. The TEES tool does not always extract the right event. In one of the test papers (PMCID 2996028), variants of AtAIRp1-­‐overexpressing plants were subjected to [type of abiotic stress] are mentioned very often. Almost every occurrence of this variant is wrongfully extracted as an event: positive regulation of AtAIRp1. As such the output of the stress trigger algorithm will also be wrong: AtAIRp1 is positively regulated under abiotic stress. These kinds of mistakes are thus not to be underestimated when TM data is processed further. 3.2.4 Visualizing TM data The TM data as generated with the stress trigger algorithm can contribute as input to condition-­‐dependent networks. However, the modeling of reference networks from text is more challenging. One opportunity involves adding an extra platform in the algorithm that searches for negation in events. For instance, in the sentence Gene A does not bind with gene B under drought stress, the stress trigger algorithm could identify the relation between A and B to not occur under drought stress. However, this sentence does not prove that gene A and gene B bind under normal conditions. It could very well be that genes A and B never bind under any circumstance. Another possibility to include TM events in reference networks, involves all events mentioned in sentences for which the algorithm was not 31 PART 3: Results triggered by descriptions of abiotic stress. However, when describing events occurring during abiotic stress, articles tend to mention the abiotic stress first and then define—in the next sentences—all events present during abiotic stress. We cannot assume for an event that when no type of abiotic stress is described, this event occurs during normal conditions. Many sentences that are picked up by the stress trigger algorithm have a similar structure, e.g. GsAPK is up-­‐regulated when salt stress occurs, meaning that the entity salt stress activates GsAPK. Since BANNER only identifies GGPs though, the output will be that an unknown component up-­‐regulates GsAPK. To visualize this knowledge in a network, genes or nodes were assigned colors when their regulation was known but not their interaction partner (Figure 18). Figure 18. Visualization of events which do not have a cause. Sentences are searched for events by EVEX (ImageA). If those events occur during a type of abiotic stress, the stress trigger algorithm stores these events as source-­‐gene – type of interaction – target-­‐gene according to their stress condition (Image B). When there is no source-­‐gene present, visualization in a network is still possible by coloring the nodes (Image C). In that way no information is lost when generating a network from the algorithm output. In this example, nodes or genes without interaction partners are colored green when they are up-­‐regulated under their respective stress conditions. Green arrows depict a directed positive regulation between two genes, red lines represent a directed negative regulation between interacting genes and blue lines indicate a symmetrical relation: binding. 3.3 CORNET Experimental co-­‐expression data was generated using the CORNET web-­‐based database. We created 27 user-­‐defined groups of micro array experiments with differentially expressed genes, involving 14 types of abiotic stress (Table 7). The expression profiles of the genes listed in those user-­‐defined groups were then compared to expression profiles of the whole A. thaliana genome. In that way a set of co-­‐expressing genes was generated for each type of abiotic stress based on the Pearson’s correlation coefficient. Since both control experiments and experimental set-­‐ups served the co-­‐expression data—meaning all data was experimentally verified and not computationally predicted—, each set could be divided into 32 PART 3: Results a set of genes with similar expression profiles under stress conditions and a set of co-­‐
expressing genes under normal control conditions. Due to technical limitations of the CORNET database, only the first 50 000 genes with the highest Person’s correlation coefficients could be retrieved for each type of abiotic stress—control experiments and experimental set-­‐ups. A general threshold for each user-­‐defined group could thus not be implemented. To overcome this problem, a threshold was set when creating networks in a later step. Important to note is that expression profiles of genes with a Pearson’s correlation coefficient of 1.0 were omitted from the list of co-­‐expressing genes. We assumed that a Pearson’s correlation coefficient of 1.0 meant that those expression profiles were identical and thus involved the same genes. Table 7. Overview of the collected co-­‐expression data from CORNET. A total of 27 user-­‐defined groups of micro array experiments with differentially expressed genes were created for 14 types of abiotic stress corresponding with the ontology terms from the PESO. Each user-­‐defined group included a list of differentially expressed genes under a certain abiotic stress condition. The expression profiles of those genes were compared to expression profiles of the whole A. thaliana genome according to the Pearson’s correlation coefficient. For each user-­‐defined set, we listed the ontology IDs and the parent terms of the PESO. Both experimental set-­‐ups and control experiments were considered to collect co-­‐expression data and for each of the set-­‐ups the number of experiments and the lowest Pearson’s correlation coefficient are depicted. Note that there could not be generated any co-­‐expression data from experiments involving ethylene stress since the CORNET database gave an error each time this was tried. 3.4 Differential networks Differential networks depict the rewiring of interactions across different conditions. In order to create a differential network with the Diffany tool, at least two input networks are required. A reference network, i.e. interactions between genes and gene products under normal conditions, and one or more condition-­‐dependent networks, i.e. interactions occurring during different types of stress, are analyzed to form one or more differential networks depending on the chosen comparison method (Section 3.4.1). Since it is our goal to observe the behavior of differential networks—with different input networks and settings—
in general, we chose two types of stress to compare as a case study: drought stress and ABA stress. 33 PART 3: Results Note that the reference and condition-­‐dependent networks did not show one cluster of interactions. In stead, there was a main network holding the majority of the relations, and many little networks with four nodes or less. This was because we only included a small portion of the co-­‐expression data due to the computational limits of Cytoscape and Diffany. When all expression data was visualized together with the TM data, we could observe one big hairball and a couple of very small networks (Figure 19). Figure 19: Since calculations were done based on a subset of the data, the ABA reference network consists out of different clusters. This is also the case for the ABA stress-­‐dependent network and the drought input networks (not shown). 3.4.1 One-­‐against-­‐one versus one-­‐against-­‐all comparisons In this section we will present the results of both one-­‐against-­‐one and one-­‐against-­‐all comparisons in differential networking. As stated in Section 1.4.2, Diffany can compute both types of comparisons. We only used co-­‐expression data from the CORNET database to review the differential networks. In Section 3.4.2, we will look further into the integration of both co-­‐expression and TM data. The one-­‐against-­‐one comparison For this method, Diffany requires one reference network and one condition-­‐dependent network for comparison. The reference network included all interactions—co-­‐expression—
between genes under normal conditions. The condition-­‐dependent network included co-­‐
expressing genes under drought stress. Note that we retrieved 50,000 interactions from the CORNET database for each of the types of stress in the PESO. However, Diffany had some computational limits, so we had to limit our reference and drought stress-­‐dependent networks. The drought stress-­‐dependent network included 336 genes and 966 interactions between them. The reference network comprised 248 genes and 981 interactions. The output of Diffany includes a differential network, showing the rewired interactions, and an overlap network, which comprises all interactions between genes that are present in both the reference network and the drought stress-­‐dependent network (Figure 20). In the one-­‐
34 PART 3: Results against-­‐one comparison, Diffany looks at each interacting pair of genes in the reference network and determines whether this pair is also present in the drought stress-­‐dependent network. If this is the case, the interacting gene pair becomes part of the overlapping network. When a gene pair is found to be different in the drought stress-­‐dependent network, it becomes part of the differential network. Diffany labels interactions either a decreased or an increased relation in the differential network. A decrease in relation occurs when the interaction is present in the reference network and less or not present in the condition-­‐dependent network. The opposite is true for an increase in relation, where the interaction is present in the condition-­‐dependent network and less or not in the reference network (Figure 21). Figure 20: Representation of the drought differential and overlap network with the one-­‐against-­‐one method. 35 PART 3: Results Figure 21: Example of the construction of the drought differential network. In this example the relation between HK5 and CYP705A25 disappears when drought stress is present. In the differential network there is thus a decrease in co-­‐expression between those two genes. The opposite is true for the interaction between CYP705A4 and AT5G35940, the interaction appears when droughts stress occurs and is thus visible as an increase in co-­‐expression in the drought differential network. Since we only included one type of interaction in the input networks, co-­‐expression, the differential network only calculates an increase or decrease in co-­‐expression. Because we included only one type of interaction—co-­‐expression—in the input networks, there are only two possible changes that can be noticed in the differential network. Either the co-­‐expression relation increases or the interaction decreases (Figure 21). In section 3.4.2, we will show the results when multiple interactions types are used in the input networks. In these situations, there are more possibilities regarding the decrease or increase of the interaction types. We looked further into the drought differential network and could identify the HPR gene (AT1G68010) which is active during drought stress according to GO. We identified all interactions partners of HPR in the differential network, and searched for gene descriptions (Figure 22 and Table 8). None of the genes connected with HPR is known for involvement in responses to drought stress. However, SHM1 is involved in responses to salt and high light stress. All genes of which the co-­‐expression increased during drought stress, are involved in the energy pathway. A possible explanation could be that the cell requires more energy for defense and stress tolerance mechanisms to deal with the threat of drought stress. However, there is a decrease in co-­‐expression with SBPASE and SHM1, genes active in the Calvin cycle—which converts the sun’s energy into a storable form, i.e. glucose—and in the photo respiratory pathway respectively. When the plant is in need of more energy to cope with drought stress, those genes should also increase in co-­‐expression with HPR. The other two genes of which co-­‐expression decreases during drought stress, CRB and FBA2, are responsible for proper function of the chloroplast and the response to ABA respectively. Overall, the drought stress dependent network (336 nodes and 966 edges), the reference network (248 nodes and 981 edges), and the differential network (400 nodes and 1629 edges) all showed a dense network with highly interacting nodes. The overlapping network 36 PART 3: Results was smaller and more sparsed. There were also separate clusters of which the biggest cluster—39 nodes and 86 edges—included interactions relevant to the HPR environment. Figure 22: All interacting partners of HPR in the drought differential network. 37 PART 3: Results Table 8: List of HPR interacting genes in the drought differential network. Note that function explanation followed by ‘*’ means that the descriptions are found on GO, other function descriptions are taken from NCBI. The one-­‐against-­‐all comparison In this approach, more than one condition-­‐dependent network is compared to the reference network. Diffany presents two ways to do this, either by the pairwise comparison or by the one-­‐to-­‐all method. When the pairwise method is applied, Diffany compares each condition-­‐
dependent network on its own to the reference network as described previously. When three condition-­‐dependent networks are compared to one reference network with this method, the output will include three differential networks and three overlap networks, one for each comparison. When the one-­‐to-­‐all approach is used, one differential network will be generated, comprising changed interactions in all three condition-­‐dependent and reference network comparisons. This method encounters an extra challenge when certain interactions are found in only one of the condition-­‐dependent networks. To cope with this challenge, Diffany includes those kind of interactions in the differential network only when they change in the same way in all condition-­‐dependent networks as opposed to the reference network (Figure 9). Note that it is possible for an interaction to be both part of the overlapping network and the differential network. This occurs when there are weights assigned to the edges in the input networks or when different types of interaction types are used (Section 3.4.2. and Figure 23). 38 PART 3: Results Figure 23: The construction of a differential network from three input networks. The A-­‐B relation represents co-­‐expression between nodes A and B. In all three input networks the A-­‐B interaction is present, and it is thus also part of the overlapping network. This relation is also weighted and is less present in both condition-­‐dependent networks than in the reference network. Therefore, the interaction changes negatively, and is also part of the differential network. Note that the A-­‐C interaction is also present in all three networks, but changes differently in both condition-­‐dependent networks. Therefore, the interaction is only present in the overlapping network. Figure taken from Van Landeghem et al. To illustrate the one-­‐against-­‐all comparison, we compared condition-­‐dependent networks including interactions occurring under ABA stress, drought stress, and salt stress in a case study. Note that the ABA, drought, and salt stress condition-­‐dependent networks based on CORNET data have their own reference network. This is because of the data retrieval step from the CORNET database (Section 3.3). In order to create a general reference network for all three condition-­‐dependent networks, Diffany calculated the overlap between the three reference networks. This general reference network was then used to compare to the ABA, drought, and salt stress-­‐dependent networks, in order to create one differential network (Figure 24). However, there was no reference overlap when three conditions—drought, salt, and ABA stress—were compared. In the end, all interactions in the condition-­‐dependent networks were increased in co-­‐expression since they did not appear in the overlap reference network.This is not a problem, because Diffany can handle reference and condition-­‐
dependent networks which do not overlap (Figure 25). 39 PART 3: Results Figure 24: Differential network constructed by the one-­‐against-­‐all method. Since there is no overlapping network, all interactions increased in co-­‐expression. 40 PART 3: Results Figure 25: Example of a differential network in Diffany, when there is no overlapping network. Image A depicts the input networks: three condition-­‐dependent networks and one reference network, i.e. normal conditions. Image B shows how Diffany calculates the overlapping interactions in all input network and tries to form an overlap network. Here only three nodes overlap in some networks, but there are no overlapping relations. This means that all interactions will be regarded as differential and will be represented in the differential network (Image C). Decreased interactions are shown in red and increased interactions are shown in green. Note that, interactions from the reference network disappeared in the condition-­‐dependent networks, and are thus shown as decreased relations. Interactions from the three condition-­‐dependent networks are all present in the differential network as increased relations since they were not present in the reference network. We focused again on HPR and its environment in order to assess the differential network. Now, we could see the rewiring of the HPR-­‐cluster when three stresses are present (Figure 26). For instance, the HPR GABP interaction is present in the reference network, but not in the three condition-­‐dependent networks. Therefore, the interaction is depicted as a decrease in co-­‐expression in the differential network, which means that its connection is broken after induction of stress. The connection between the other HPR interacting node, GAPA, in the reference network is only present in the drought depending stress network, suggesting that this interaction might be not abiotic stress specific, but drought stress specific. However, as stated in Section 1.3.1 and Table 1, GAPA-­‐2 is an allele of the GAPA gene, which could mean that the GAPA genes are drought and salt stress specific. Overall, the HPR-­‐cluster is present—with at least three interaction partners—in all three condition-­‐dependent networks suggesting that genes involved in this cluster are important 41 PART 3: Results in responses to stress. Moreover, most of the HPR-­‐interaction partners are involved in the electron transport system and are responsible for the energy levels in the plant cell. Figure 26: Depiction of the rewiring of HPR and its relations when ABA, drought, and salt stress are present. All relations in the differential network represent a decrease in co-­‐expression (red). Note that the reference network is larger than depicted; only relevant interactions to HPR are shown here. Co-­‐expression relations are shown in purple. 3.4.2 The contribution of TM data to differential networks We created a stress trigger algorithm that listed events in EVEX under specific terms of abiotic stress. As such, TM data could only be used in the condition-­‐dependent networks to form differential networks. In order to visualize the TM interactions with the co-­‐expression data, we had to map official gene names on to the TAIRIDs used in the CORNET database. This involved an extra challenge. BANNER, the GGP identifying tool in EVEX, differenitates between variants of the same GGP. For instance, GsAPK genes, GsAPK gene transcripts, and the GsAPK gene are all concidered to be different GGPs. This complicates representation in a network (Figure 27A). When those variants can be categorized under one common denominator such as a gene family, the network would be more clarifying (Figure 27B). Furthermore, integration with other data types would be more accurate. However, EVEX recognizes those variants and assigns them to an official gene or gene product symbol. Lists with the official symbol and its variants were thus employed to resolve this ambiuity and construct more dens and meaningful condition-­‐dependent networks (Section 5.2.3). 42 PART 3: Results Figure 27. When variants of the same gene can be classified under a common denominator, then the visual representation of their interactions is more clarifying. Figure A depicts a network without common denominator for various descriptions of gene A. Figure B shows a network in which all variants of gene A are assembled, resulting in an uncluttered network. General facts regarding the input networks In order to asses the TM data in differential networks, we conducted a case-­‐study based on responses to ABA stress. The input of TM data consisted out of two parts; on the one hand, full events including two or more interacting genes and their relations, and on the other hand single genes without interaction partners of which the up or down regulation is known (Figure 28 and Figure 18). These single genes help completing the condition-­‐dependent network, but cannot contribute to the construction of differential networks since these networks are based on interacting genes and gene products. First we looked at the merging and the distribution of TM data with co-­‐expression data in the ABA condition-­‐dependent network (Figure 29). Overall, there are very few TM interactions as opposed to co-­‐expression relations. This is because there were many TM events identified consisting out of one gene or gene product and its regulation, but without mentioning its specific activator or repressor. (Figure 28 and Figure 18). An example of four nodes invluding both TM and co-­‐expression realtions is shown in Table 9. When looking further into the relation represented by TM, we could conclude that there are very few relations confirmed by both data resources. However, in the ABA stress dependent network, TM interactions help linking seperate clusters of co-­‐expression data which proves the fact that knowledge is enriched when different resources are integrated (Figure 29). 43 PART 3: Results Figure 28: The ABA stress dependent network with the all TM data. Genes with unknown activators or repressors are shown as independent nodes in green and red respectively. Figure 29: Representation of the ABA condition-­‐dependent network with 50,000 co-­‐expression interactions and 119 TM relations. Co-­‐expression relations are green and TM interactions are purple. Here can be seen that TM relations, while being unrepresented, still help linking two separate clusters of genes, demonstrating the potential of data-­‐integration with TM data as a resource. 44 PART 3: Results Table 9: Degree of four nodes with both TM and co-­‐expression data. Differential networks with multiple types of interactions As briefly mentioned in section 1.4.2, Diffany employs an ontology to model different interaction types (Figure 8). When interactions in the reference and condition-­‐dependent network have the same parent term, they can be compared to each other. Moreover, Diffany can ‘substract’ relations in the reference network from interactions in the condition-­‐
dependent network when those interactions have the same parent term in the ontology. For instance, relations categorized under positive and negative regulation can be compared to each other, but PPI labeled interactions cannot be compared to positive or negative regulated interactions (Figure 30). The differential network can have relations with various weights. There is a smaller change in an interaction when a positive regulation appears in the condition-­‐dependent network than when a negative regulation in the reference network rewires to a positive regulation in the condition-­‐dependent network (Figure 31). Since TM interactions were not present in te reference network, no subtraction of negative or positive regulation could be performed. As such, the ABA differential network did not have a difference in interactions. All relations were assumed to change with the same size. One possibility to include weights in the differential network, is the assignmetn of weights to co-­‐expression interactions based on the Pearson’s correlation coefficients. 45 PART 3: Results Figure 30: Example of a differential network with multiple types of interactions. Interactions of type binding and regulation are classified under different terms in the Diffany ontology. Therefore, they cannot be subtracted from each other as can be seen in the relations between hypothetical genes A and C, and B and C. The interaction between C and B—binding—in the reference network disappears when the condition occurs, and a positive regulation appears when looking at the condition-­‐dependent network. This means that the differential network will have two types of interactions between nodes C and B: on the one hand a decrease in binding (decrease_ppi, red) and on the other hand an increase in regulation (increases_regulation, green). Note that symmetrical interactions such as binding, have no direction in the differential network. Directed interactions such as positive and negative regulation, are depicted with an arrow to show the direction. 46 PART 3: Results Figure 31: Relations with different weights are present in the differential network. When the reference network is subtracted from the condition-­‐dependent network, a bigger change occurs between nodes A and B, than between nodes A and C. The A-­‐B relation rewires from a negative regulation to a positive regulation while the A-­‐C relation rewires from no interaction to a positive regulation. Both relations involve a positive rewiring, but the A-­‐B transition is bigger. We performed a case study to construct a differential network showing the rewiring of the ABA stress responses of plants. The condition-­‐dependent network included all available TM interactions listed by the stress trigger algorithm from 8,002 full articles. A total of 200 co-­‐
expression relations with the highest Pearson’s correlation coefficients were also added to the ABA stress dependent network. The reference network included only the co-­‐expression relations. We only found one cluster in the condition-­‐dependent input network that had both TM and co-­‐expression relations proving that there is very little overlap of both data sources (Figure 32). Other clusters included either TM or co-­‐expression data. Furthermore, Diffany found a very small overlapping network—three nodes and two interactions—which suggest a nearly complete rewiring of the relations involved in the response to ABA stress (Figure 32). We could hypothesize that when ABA stress occurs, different signalling pathways are switched off or switched on. However, we worked with a small sample of the available co-­‐expression data due to the computational limits of Diffany. Important interactions connecting separate clusters, and linking TM data with co-­‐expression data could still exist. 47 PART 3: Results Figure 32 One cluster in the ABA stress dependent network had both a TM and multiple co-­‐expression interactions. The relation between GA and LEA was predicted by TM and the stress trigger algorithm, and represents a gene expression of LEA induced by GA (depicted as a green line in the ABA stress dependent network). Other relations represent co-­‐expression and are purple. Note that there were no overlapping interactions for this cluster. In Figure 33 we present the relevant cluster of interactions around the small overlapping network. All three nodes, AT5G09480, AT2G33850, and AT3G25930 are present in the differential network, but with a different wiring than in the input networks. Furthermore, the interactions between those three nodes represent all a decrease in co-­‐expression, suggesting that this network falls to pieces. Apart from that, there is an increase in co-­‐
expression regarding the tightly interacting genes AT5G36900, AT2G14430, AT3G62040, AT5G35260, MLP3228, and AIR1, which could suggest the activation of a response pathway to deal with the presence of drought stress. 48 PART 3: Results Figure 33: Cluster of co-­‐expression relations in which the overlapping network was found. 49 PART4: Discussion 50 PART4: Discussion 4 PART 4: Discussion In this research, we looked into the formation of differential networks uncovering abiotic stress specific interactions between A. thaliana genes and gene products. To accomplish this, a newly developed ontology describing different abiotic responses was created, which allowed us to integrate two data sources. The terms and structure of this ontology were based on various existing ontologies, among which GO and EO. The two data sources included text-­‐mining data from the EVEX database and co-­‐expression data from the CORNET database. To construct the differential networks, we compared a reference network to three different condition-­‐dependent networks. The reference network included all interactions between genes and gene products under normal conditions and was solely composed of co-­‐
expression data. The condition-­‐dependent networks were formed from both co-­‐expression data and TM data, and included all interactions between genes and gene products under different stress conditions such as drought stress or wounding stress. 4.1 The PESO The newly created PES ontology shows a good overview of the different plant stress responses, both environmental and chemical (Figure 15). Moreover, it allows us to categorize genes, gene products, and their interactions of different data sources. However, we only created an ontology to model different types abiotic stress responses. Other concepts such as location, duration, and concentration were not taken into account. Without the inclusion of those concepts, a lot of information is lost when data is categorized in this ontology. Now, all classified genes, gene products, and their relations are considered to occur at the same place and time. For future work it is recommended that different ontologies are constructed to model independent concepts such as location, duration, and concentration of abiotic stress. Those four ontologies could then be connected and include all possible information about an interaction between genes and gene products (Table 4). Most TM data—published articles—contain a Materials & Methods section describing in detail how experiments were conducted. Many papers researching the abiotic stress topic, describe in detail how these abiotic stresses were administered. Using three extra ontologies—location, duration, and concentration—could help revealing additional information about the type of abiotic stress that was applied. The meta-­‐data of the experiments listed in the CORNET database were partly annotated using the EO. Terms of this ontology were also employed in the PESO. This fact made it convenient to categorize CORNET’s co-­‐expression data in the PESO. 4.2 Text-­‐mining data 4.2.1 Evaluation of the stress trigger algorithm We developed a stress trigger algorithm which analyzes sentences word-­‐by-­‐word to reveal events occurring during various types of abiotic stress. The algorithm was evaluated on two levels, using the F-­‐score (Table 6). The algorithm was assessed whether or not it could pickup the occurrence of stress on the first level, and if so, whether or not it could notice the specific type of stress on the second level. The precision, recall, and F-­‐score do not differ much between the two levels, meaning that when the presence of stress is noticed in a sentence, the algorithm usually identifies the right type of abiotic stress. In general, the algorithm has a very high recall, which means that mentioning abiotic stress in a sentence is almost always picked up by the algorithm. However, the precision of the algorithm is lower, meaning that there are some false positive results—events that do not occur during abiotic 51 PART4: Discussion stress or events that are categorized under the wrong type of abiotic stress. It is thus recommended that the results are manually curated afterwards. We want to stress that sentences can be wrongly interpreted. When false events are picked up by EVEX, the stress trigger algorithm is not capable of correcting these events. It is thus important that TM results are reviewed critically. 4.2.2 Future goals In this research, we ordered the stress trigger algorithm to analyze word-­‐by-­‐word and process one sentence at a time. As many papers first explain the type of stress and describe in following sentences a couple of interactions occurring during that type of stress without mentioning it again, processing paragraph by paragraph could yield more results. In that way, many more interactions would be picked up that now seem irrelevant according to the algorithm. This would improve the recall of the algorithm. 4.3 Co-­‐expression data We collected co-­‐expression data from CORNET by creating lists including experimental set-­‐
ups and control experiments for 14 types of abiotic stress, which were mapped to terms in the ASS ontology. CORNET assumed that expression profiles similar to differentially expressed genes under stress conditions were also involved in the respective abiotic stress response. There were no expression profiles available for 12,210 genes in the A. thaliana genome—
based on the TAIR database—and were thus excluded from the research. However, some of these genes could be involved in the abiotic stress response. To counteract this problem, other databases or data resources could be considered for data integration with co-­‐
expression data and TM data. For instance, the ATTED-­‐II database (Obayashi et al, 2014) and the CressExpress (Srinivasasainagendra et al, 2008) database could be employed for additional co-­‐expression data. As previously mentioned (see Section 3.3), we could not create lists including all available co-­‐
expression data due to computation limits of CORNET. The program was only able to retrieve the first 50 000 genes with the highest Pearson’s correlation coefficients. As such, the differential networks based on co-­‐expression data were not depicted to their full potential. For future work, the processing of raw data could hel avoiding limitations of the web-­‐based application. 4.4 Network representation Using Diffany, a novel Cytoscape plugin tool, we created differential networks to observe the rewiring of interactions when a type of abiotic stress occurred. Diffany requires two input networks: a reference network—including all interactions present under normal conditions—and a condition-­‐dependent network—comprising all interactions occurring when a type of abiotic stress arose. Based on these two input networks, Diffany calculated a differential network. We focused on two methods Diffany offers to construct differential networks: the one-­‐against-­‐one and the one-­‐against-­‐all approach (3.4.1), using three types of abiotic stress: drought, ABA, and salt stress. Then, we evaluated whether TM could contribute to the formation of differential networks. Overall, Diffany had some computational limits during the calculation of network of +50,000 genes. Therefore, we were forced to construct differential networks with a small subset of 52 PART4: Discussion the co-­‐expression interactions. We could still work with all available TM data. In order to lower the computing time, small clusters can be omitted from the input networks allowing Diffany to focus on the big clusters with more information. 4.4.1 One-­‐against-­‐one differential networks The one-­‐against-­‐one method compared one type of abiotic stress to a control situation. Particularly the drought differential network was promising. Both input networks consisted of dense networks, which resulted in a compact differential network and a separate cluster. Using the GO category cellular response to drought deprivation, we identified the HPR gene (AT1G68010) present in the differential network. Looking at the known interaction partners of HPR, we found 11 genes of which there is no previous knowledge that they are involved in the response to drought stress (Table 8). Moreover, differential networks show how the relations between HPR and its interactions partners are changed when drought stress occurs. In this case-­‐study, all relations were formed by co-­‐expression. These kinds of networks can guide and challenge scientists in setting up new experiments involving genes identified in differential networks. 4.4.2 One-­‐against-­‐all differential networks We applied the one-­‐against-­‐all method to construct a differential network including both changed interactions when drought, ABA, and salt stress presented themselves. This network included one hairball of interactions suggesting a tight connection between responses to ABA, salt, and drought stress (Figure 24). Once again, we zoomed in on the HPR gene and its one interaction partner, GAPB. With the three condition-­‐dependent networks and the one reference network as input, we observed the rewiring of the interactions around HPR (Figure 26). Overall we can conclude that the one-­‐against-­‐all approach contributes in revealing concealed interactions that are shared among different types of abiotic stress. 4.4.3 The contribution of TM data to differential networks We looked at the construction of differential networks including different types of data sources—TM and co-­‐expression data—and different types of relations, e.g. positive and negative interactions, gene expression and co-­‐expression. Overall we can conclude that the integration of TM data with co-­‐expression data is not elaborate. Only a few nodes in the condition-­‐dependent network included both co-­‐expression and TM relations. Often, we could observe separate clusters of interactions including only TM data or only co-­‐expression data. However, we worked with small samples of data due to the computational limits of Diffany. This means that the contribution of TM data to condition-­‐dependent and differential networks could be more elaborate. Indeed, when analyzing all 50,000 co-­‐
expression interactions and all 119 TM relations in the ABA stress dependent network, it is clear that TM relations can help linking the co-­‐expression data (Figure 29). For future work, it would be promising to calculate differential networks based on all co-­‐expression and TM data. To include more TM relations, the stress trigger algorithm should be adapted in order to get higher recall values generating more accurate TM interactions. 4.5 Conclusion We wanted to model abiotic stress conditions using a structural framework partly based on existing ontologies. This ontology allowed categorization of genes and gene products under various types of abiotic stress. Furthermore, it formed a basis to integrate two different data sources: TM and co-­‐expression data. However, the ontology failed to model other 53 PART4: Discussion independent concepts such as duration of stress, concentration of stress administration, and location of stress. For future work, different ontologies should be created each modeling an independent concept of abiotic stress. Combined, these ontologies could store all relevant data of the respective abiotic stress. Co-­‐expression data was retrieved from the CORNET database and ready for employment in network visualization. To integrate the TM data, we constructed a stress trigger algorithm, identifying all interactions between genes and gene products involved in different types of stress. This algorithm has a high recall, meaning that few interactions occurring during stress are missed. However, the precision was lower which urges to manually review the results. In general, TM data is computational predicted and should be handled accordingly. In order to visualize the rewiring of interactions when abiotic stress appears, TM interactions and co-­‐expression relations were used to construct differential networks with the Diffany tool. Diffany uses both a reference network and a stress-­‐dependent network to construct differential networks. Diffany allows both comparisons of one type of abiotic stress—or in general one type of condition—to a reference network and multiple types of abiotic stress to a reference network. We evaluated both methods and concluded that they can reveal concealed interactions presenting themselves during stress. The one-­‐against-­‐one method shows the condition-­‐specific rewiring (stress versus normal), while the one-­‐against-­‐all method identifies common regulation pathways over multiple conditions or stresses. When we evaluated differential networks enriched with TM data, we noticed a rather limited merging of both co-­‐expression and TM data. Clusters in the differential network consisted either complete out of TM interactions or co-­‐expression relations. For a better merging of the two data sources, larger datasets will have to be used. Generally speaking, the use of differential networks in revealing the rewiring of abiotic stress response pathways shows promising results and is to be further explored. 54 PART 5: Materials and Methods 5 PART 5 Materials and Methods 5.1 Construction of the PES ontology The PESO was based on four existing ontologies: GO, EO, EnvO and PO. The general structure was borrowed from a Plant Stress Ontology created for a case study showing the benefits from combining data integration with text-­‐mining (Hassani-­‐Pak et al, 2010). The ontology was manually constructed by thoroughly reviewing each ontology. The final ontology contains only hyponym-­‐hypernym relations and no mereology relations resulting in a strict hierarchical structure. This PES ontology was used to integrate two different data sources: TM data from the EVEX database and co-­‐expression data from the CORNET database. 5.2 Data processing In this research, two data sources were employed to provide data for the differential networks. We consulted the EVEX database for TM data. All articles, which mentioned the A. thaliana organism, were parsed with a stress trigger algorithm that is further explained below. Since EVEX’s last update was in 2012, articles published after this year were not included in this research. Data from CORNET was retrieved by constructing user-­‐defined sets, which held genes differentially expressed under different abiotic stress conditions. Those lists served both the reference network and condition-­‐dependent networks in order to construct a differential network. 5.2.1 Processing of TM data To spare time, we used a supercomputer to conduct parallel computing. In that way the computational power increased. Roughly 460 full articles were parsed per minute of which on average 18 contained the A. thaliana keyword and were run through the stress trigger algorithm. A total of 1,441 full articles was found to involve around abiotic stress. 5.2.2 Evaluation of the stress trigger algorithm We evaluated the stress trigger algorithm using the F-­‐score. First we calculated both precision and recall on two levels. The fist level observed the ability of the algorithm to pickup events related to stress conditions. The second level examined the capablity of the algorithm to store the stress related events under the right term in the PESO. The F-­‐score was calculated using the formula 𝐹! = 2 × 𝑅 × 𝑃 𝑅 + 𝑃 with 𝑃 = 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 and 𝑅 = 𝑟𝑒𝑐𝑎𝑙𝑙. 5.2.3 MySQL scripts for the retrieval of EVEX data -­‐ Retrieval of all possible TM relations select distinct(type) from occurrence_event;
-­‐
Counting of all articles stored in EVEX select count(distinct(bibliome_id)) from normalization_occurrence_tax
noct join (occurrence_tax ot) on (ot.id = noct.occurrence_tax_id);
-­‐
Counting of all articles mentioning A. thaliana select count(distinct(bibliome_id)) from normalization_occurrence_tax
noct
join
(occurrence_tax
ot)
on
(ot.id
= noct.occurrence_tax_id)
where ncbitax_id = 3702;
-­‐
Example of the canonical mapping used to merge variants of the same gene description (Section 3.4.2) select
join
distinct(string)
from
occurrence_ggp
og
(canonical_ggp_occmap map) on (map.occurrence_ggp_id = og.id)
55 PART 5: Materials and Methods where bibliome_id = 1002996028 and canonical_ggp_id = 3174138 limit
20;
5.2.4 Processing of co-­‐expression data We consulted the CORNET database to create lists of differentially expressed genes for terms in the ASS ontology. Meta-­‐data of experiments stored in the CORNET database, is annotated with a standardized ontology. Terms and descriptions of this ontology were borrowed from PO, EO and the MGED Ontology. This ontology was used to compile user-­‐defined sets of experiments (Table 7). Those datasets needed to be large enough for reliable calculation of the correlation coefficients (Usadel et al, 2009). A total of 27 user-­‐defined compiled datasets were made in the CORNET database. Each group was either a group of control experiments or a group of experimental set-­‐ups searching for differential expressed genes. After creating various sets with experiments identifying differentially expressed genes under different abiotic stress conditions, the expression profiles of those differentially expressed genes were matched against all other available expression profiles. If an expression profile of a differentially expressed gene matched another gene’s profile, those were considered to co-­‐express. Co-­‐
expression was measured by the Pearson’s correlation coefficient. A coefficient close to 1.0 meant that those genes had very similar expression profiles and thought to be co-­‐expressed. Correlation coefficients close to 0.0 had no co-­‐expression and coefficients close to -­‐1.0 were anti-­‐correlated. CORNET retrieved microarray data from the Affymetrix Gene Chip from the Gene Expression Omnibus (GEO). The microarray data were preprocessed with the Robust Multi-­‐array Average (RMA). RMA is an algorithm used to create an expression matrix from Affymetrix data. The raw intensity values are background corrected, log2 transformed and then quantile normalized. Next a linear model is fit to the normalized data to obtain an expression measure for each probe set on each array. Due to technical limitation, CORNET was only able to retrieve the first 50 000 genes with the highest Pearson’s correlation coefficient. Because of this limitation, an overall threshold was set when networks were formed in a later step of this research. 5.3 Network representation To visualize the processed TM and co-­‐expression data, Cytoscape was used (version 3.1.1, http://www.cytoscape.org/download.html). Cytoscape is a software platform, publically available and was created to visualize biomolecules—represented by nodes—and their interactions—represented by edges (Shannon et al, 2003; Cline et al, 2007). The program is capable of performing heavy network calculations such as clustering, and allows the user to adjust the settings, e.g. the style of edges and nodes. Attributes to nodes or edges can be loaded to name nodes differently or scale the edge width. Furthermore, many plugins can be loaded such as Diffany (beta version 0.2), which will be used in this research. 5.3.1 Generation of a reference network A reference network includes all interactions between genes and gene products under normal conditions, i.e. no occurrence of abiotic stress. In this research, the reference network was created from co-­‐expression data only. With CORNET, we generated 13 lists with co-­‐expressing genes under control conditions. The overlapping interactions were 56 PART 5: Materials and Methods calculated using Diffany to form one generic reference network that could be used to compare one or multiple condition-­‐dependent networks. 5.3.2 Condition-­‐dependent network Both TM data and co-­‐expression data contributed to the construction of a condition-­‐
dependent network. We created 13 condition depenent networks, corresponding to 13 abiotic stress responses from the PESO: ABA, auxin, cold, drought, gibberellin, heat, high light, jasmonate, low light, osmotic, salicylic acid, and salt. 5.3.3 Differential network We used the Cytoscape plugin tool Diffany to create differential networks. Diffany is based on an ontology providing a framework to categorize interactions between gene and gene products. Diffany takes directed and undirected egdes into account as well as edge weights, negation and semantic interpretations of different interaction types. Due to computational limits, we were not able to generate differential networks based on all data we compiled. With a maximum of 1,000 edges in the reference and condition-­‐
dependent networks, we created differential networks based on the one-­‐against-­‐one and one-­‐against-­‐all approach. 57 References 58 References References Aranda B, Achuthan P, Alam-­‐Faruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian AT, Kerrien S, Khadake J, Kerssemakers J, Leroy C, Menden M, Michaut M, Montecchi-­‐Palazzi L, Neuhauser SN, Orchard S, Perreau V, Roechert B, Eijk K van, et al (2010) The IntAct molecular interaction database in 2010. Nucleic Acids Res. 38: D525–D531 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-­‐Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM & Sherlock G (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25–29 Barkla BJ, Vera-­‐Estrella R & Pantoja O (2013) Progress and challenges for abiotic stress proteomics of crop plants. Proteomics 13: 1801–1815 Bartels D & Sunkar R (2005) Drought and Salt Tolerance in Plants. Crit. Rev. Plant Sci. 24: 23–
58 Van Bel M, Proost S, Wischnitzki E, Movahedi S, Scheerlinck C, Van de Peer Y & Vandepoele K (2012) Dissecting plant genomes with the PLAZA comparative genomics platform. Plant Physiol. 158: 590–600 Berardini TZ, Mundodi S, Reiser L, Huala E, Garcia-­‐Hernandez M, Zhang P, Mueller LA, Yoon J, Doyle A, Lander G, Moseyko N, Yoo D, Xu I, Zoeckler B, Montoya M, Miller N, Weems D & Rhee SY (2004) Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol. 135: 745–755 De Bodt S, Carvajal D, Hollunder J, Van den Cruyce J, Movahedi S & Inzé D (2010) CORNET: a user-­‐friendly tool for data mining and integration. Plant Physiol. 152: 1167–1179 De Bodt S, Hollunder J, Nelissen H, Meulemeester N & Inzé D (2012) CORNET 2.0: integrating plant coexpression, protein–protein interactions, regulatory interactions, gene associations and functional annotations. New Phytol. 195: 707–720 Buttigieg PL, Morrison N, Smith B, Mungall CJ, Lewis SE & $author.lastName $author firstName (2013) The environment ontology: contextualising biological and biomedical entities. J. Biomed. Semant. 4: 43 Chen L, Liu H & Friedman C (2005) Gene name ambiguity of eukaryotic nomenclatures. Bioinforma. Oxf. Engl. 21: 248–256 Choi Y & Kendziorski C (2009) Statistical methods for gene set co-­‐expression analysis. Bioinforma. Oxf. Engl. 25: 2780–2786 Cline MS, Smoot M, Cerami E, Kuchinsky A, Landys N, Workman C, Christmas R, Avila-­‐
Campilo I, Creech M, Gross B, Hanspers K, Isserlin R, Kelley R, Killcoyne S, Lotia S, Maere S, Morris J, Ono K, Pavlovic V, Pico AR, et al (2007) Integration of biological networks and gene expression data using Cytoscape. Nat. Protoc. 2: 2366–2382 Consortium GO (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32: D258–D261 Cooper L, Walls RL, Elser J, Gandolfo MA, Stevenson DW, Smith B, Preece J, Athreya B, Mungall CJ, Rensing S, Hiss M, Lang D, Reski R, Berardini TZ, Li D, Huala E, Schaeffer M, Menda N, Arnaud E, Shrestha R, et al (2013) The plant ontology as a tool for comparative 59 References plant anatomy and genomic analyses. Plant Cell Physiol. 54: e1 Debnath M, Pandey M & Bisen PS (2011) An omics approach to understand the plant abiotic stress. Omics J. Integr. Biol. 15: 739–762 De la Fuente A (2010) From ‘differential expression’ to ‘differential networking’ -­‐ identification of dysfunctional regulatory networks in diseases. Trends Genet. TIG 26: 326–
333 Gambardella G, Moretti MN, de Cegli R, Cardone L, Peron A & di Bernardo D (2013) Differential network analysis for the identification of condition-­‐specific pathway activity and regulation. Bioinforma. Oxf. Engl. 29: 1776–1785 Gerner M, Sarafraz F, Bergman CM & Nenadic G (2012) BioContext: an integrated text mining system for large-­‐scale extraction and contextualization of biomolecular events. Bioinformatics 28: 2154–2161 Gill R, Datta S & Datta S (2010) A statistical framework for differential network analysis from microarray data. BMC Bioinformatics 11: 95 Gruber TR (1995) Toward principles for the design of ontologies used for knowledge sharing? Int. J. Hum.-­‐Comput. Stud. 43: 907–928 Haibe-­‐Kains B, Olsen C, Djebbari A, Bontempi G, Correll M, Bouton C & Quackenbush J (2012) Predictive networks: a flexible, open source, web application for integration and analysis of human gene networks. Nucleic Acids Res. 40: D866–875 Hassani-­‐Pak K, Legaie R, Canevet C, van den Berg HA, Moore JD & Rawlings CJ (2010) Enhancing data integration with text analysis to find proteins implicated in plant stress response. J. Integr. Bioinforma. 7: Hill DP, Blake JA, Richardson JE & Ringwald M (2002) Extension and Integration of the Gene Ontology (GO): Combining GO Vocabularies With External Vocabularies. Genome Res. 12: 1982–1991 Hirayama T & Shinozaki K (2010) Research on plant abiotic stress responses in the post-­‐
genome era: past, present and future. Plant J. Cell Mol. Biol. 61: 1041–1052 Hoffmann R & Valencia A (2004) A gene network for navigating the literature. Nat. Genet. 36: 664 Ideker T & Krogan NJ (2012) Differential network biology. Mol. Syst. Biol. 8: n/a–n/a Ideker T, Ozier O, Schwikowski B & Siegel AF (2002) Discovering regulatory and signalling circuits in molecular interaction networks. Bioinforma. Oxf. Engl. 18 Suppl 1: S233–240 Kanehisa M & Goto S (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28: 27–30 Karali D, Oxley D, Runions J, Ktistakis N & Farmaki T (2012) The Arabidopsis thaliana immunophilin ROF1 directly interacts with PI(3)P and PI(3,5)P2 and affects germination under osmotic stress. PloS One 7: e48241 Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, Batistic O, D’Angelo C, Bornberg-­‐Bauer E, Kudla J & Harter K (2007) The AtGenExpress global stress expression data set: protocols, evaluation and model data analysis of UV-­‐B light, drought and cold stress responses. Plant J. 60 References Cell Mol. Biol. 50: 347–363 Kim J-­‐D, Ohta T, Pyysalo S, Kano Y & Tsujii J (2011a) Extracting Bio-­‐Molecular Events from Literature—the Bionlp’09 Shared Task. Comput. Intell. 27: 513–540 Kim J-­‐D, Pyysalo S, Ohta T, Bossy R, Nguyen N & Tsujii J (2011b) Overview of BioNLP Shared Task 2011. In Proceedings of the BioNLP Shared Task 2011 Workshop pp 1–6. Stroudsburg, PA, USA: Association for Computational Linguistics Available at: http://dl.acm.org/citation.cfm?id=2107691.2107692 [Accessed May 28, 2014] Kostka D & Spang R (2004) Finding disease specific alterations in the co-­‐expression of genes. Bioinforma. Oxf. Engl. 20 Suppl 1: i194–199 Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, Alexander DL, Garcia-­‐Hernandez M, Karthikeyan AS, Lee CH, Nelson WD, Ploetz L, Singh S, Wensel A & Huala E (2012) The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 40: D1202–D1210 Van Landeghem S (2012) Playing hide and seek on the genomic playground: unveiling biological function from literature. PhD thesis, Ghent University Van Landeghem S, Björne J, Wei C-­‐H, Hakala K, Pyysalo S, Ananiadou S, Kao H-­‐Y, Lu Z, Salakoski T, Van de Peer Y & Ginter F (2013) Large-­‐Scale Event Extraction from Literature with Multi-­‐Level Gene Normalization. PLoS ONE 8: e55814 Van Landeghem S, Van Parys T & Van de Peer Y Diffany: an ontology-­‐driven framework to infer, visualise and analyse differential molecular networks. Manuscript under Review. 2014: Langfelder P, Luo R, Oldham MC & Horvath S (2011) Is My Network Module Preserved and Reproducible? PLoS Comput. Biol. 7: Available at: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3024255/ [Accessed June 5, 2014] Leaman R & Gonzalez G (2008) BANNER: an executable survey of advances in biomedical named entity recognition. Pac. Symp. Biocomput. Pac. Symp. Biocomput.: 652–663 Lord PW, Stevens RD, Brass A & Goble CA (2003) Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinforma. Oxf. Engl. 19: 1275–1283 Ma H, Schadt EE, Kaplan LM & Zhao H (2011) COSINE: COndition-­‐SpecIfic sub-­‐NEtwork identification using a global optimization method. Bioinforma. Oxf. Engl. 27: 1290–1298 Ma S & Bohnert HJ (2007) Integration of Arabidopsis thaliana stress-­‐related transcript profiles, promoter structures, and cell-­‐specific expression. Genome Biol. 8: R49 Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, Allison KR, DREAM5 Consortium, Kellis M, Collins JJ & Stolovitzky G (2012) Wisdom of crowds for robust gene network inference. Nat. Methods 9: 796–804 Natale DA, Arighi CN, Blake JA, Bult CJ, Christie KR, Cowart J, D’Eustachio P, Diehl AD, Drabkin HJ, Helfer O, Huang H, Masci AM, Ren J, Roberts NV, Ross K, Ruttenberg A, Shamovsky V, Smith B, Yerramalla MS, Zhang J, et al (2014) Protein Ontology: a controlled structured network of protein entities. Nucleic Acids Res. 42: D415–421 NOY NF (2001) Ontology Development 101 : A Guide to Creating Your First Ontology : 61 References Knowldege Systems Laboratory, Stanford University. Stanf. Knowl. Syst. Lab. Tech. Rep. KSL-­‐
01-­‐05 Stanf. Med. Inform. Tech. Rep. SMI-­‐2001-­‐0880 Available at: http://ci.nii.ac.jp/naid/10018137174/ [Accessed May 28, 2014] Obayashi T, Okamura Y, Ito S, Tadaka S, Aoki Y, Shirota M & Kinoshita K (2014) ATTED-­‐II in 2014: evaluation of gene coexpression in agriculturally important plants. Plant Cell Physiol. 55: e6 Plant Ontology Consortium (2002) The Plant Ontology Consortium and plant ontologies. Comp. Funct. Genomics 3: 137–142 Rasmussen S, Barah P, Suarez-­‐Rodriguez MC, Bressendorff S, Friis P, Costantino P, Bones AM, Nielsen HB & Mundy J (2013) Transcriptome responses to combinations of stresses in Arabidopsis. Plant Physiol. 161: 1783–1794 Rose PW, Beran B, Bi C, Bluhm WF, Dimitropoulos D, Goodsell DS, Prlić A, Quesada M, Quinn GB, Westbrook JD, Young J, Yukich B, Zardecki C, Berman HM & Bourne PE (2011) The RCSB Protein Data Bank: redesigned web site and web services. Nucleic Acids Res. 39: D392–D401 Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Lu Z, Madden TL, Madej T, Maglott DR, et al (2010) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 38: D5–D16 Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B & Ideker T (2003) Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Res. 13: 2498–2504 Skirycz A & Inzé D (2010) More from less: plant growth under limited water. Curr. Opin. Biotechnol. 21: 197–203 Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg LJ, Eilbeck K, Ireland A, Mungall CJ, OBI Consortium, Leontis N, Rocca-­‐Serra P, Ruttenberg A, Sansone S-­‐A, Scheuermann RH, Shah N, Whetzel PL & Lewis S (2007) The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat. Biotechnol. 25: 1251–
1255 Srinivasasainagendra V, Page GP, Mehta T, Coulibaly I & Loraine AE (2008) CressExpress: a tool for large-­‐scale mining of expression data from Arabidopsis. Plant Physiol. 147: 1004–
1016 Stenetorp P, Topić G, Pyysalo S, Ohta T, Kim J-­‐D & Tsujii J (2011) BioNLP Shared Task 2011: Supporting Resources. In Proceedings of the BioNLP Shared Task 2011 Workshop pp 112–
120. Stroudsburg, PA, USA: Association for Computational Linguistics Available at: http://dl.acm.org/citation.cfm?id=2107691.2107707 [Accessed May 28, 2014] Töpfer N, Caldana C, Grimbs S, Willmitzer L, Fernie AR & Nikoloski Z (2013) Integration of genome-­‐scale modeling and transcript profiling reveals metabolic pathways underlying light and temperature acclimation in Arabidopsis. Plant Cell 25: 1197–1211 Toufighi K, Brady SM, Austin R, Ly E & Provart NJ (2005) The Botany Array Resource: e-­‐
Northerns, Expression Angling, and promoter analyses. Plant J. 43: 153–163 Usadel B, Obayashi T, Mutwil M, Giorgi FM, Bassel GW, Tanimoto M, Chow A, Steinhauser D, 62 References Persson S & Provart NJ (2009) Co-­‐expression tools for plant biology: opportunities for hypothesis generation and caveats. Plant Cell Environ. 32: 1633–1651 Wei C-­‐H, Huang I-­‐C, Hsu Y-­‐Y & Kao H-­‐Y (2009) Normalizing Biomedical Name Entities by Similarity-­‐Based Inference Network and De-­‐ambiguity Mining. In Ninth IEEE International Conference on Bioinformatics and BioEngineering, 2009. BIBE ’09 pp 461–466. Whetzel PL, Parkinson H, Causton HC, Fan L, Fostel J, Fragoso G, Game L, Heiskanen M, Morrison N, Rocca-­‐Serra P, Sansone S-­‐A, Taylor C, White J & Stoeckert CJ (2006) The MGED Ontology: a resource for semantics-­‐based description of microarray experiments. Bioinformatics 22: 866–873 Yang L, Ji W, Gao P, Li Y, Cai H, Bai X, Chen Q & Zhu Y (2012) GsAPK, an ABA-­‐activated and calcium-­‐independent SnRK2-­‐type kinase from G. soja, mediates the regulation of plant tolerance to salinity and ABA stress. PloS One 7: e33838 Yilmaz A, Mejia-­‐Guerra MK, Kurz K, Liang X, Welch L & Grotewold E (2011) AGRIS: the Arabidopsis Gene Regulatory Information Server, an update. Nucleic Acids Res. 39: D1118–
D1122 63 Addendum 64 Addendum 6 Addendum 6.1 Perl Code Scripts with the implementation of all used algorithms are in a seperate folder in the Minerva dropbox. 6.2 Lists of establishment and specificity of stress. 6.2.1 List including words establishing the occurrence of stress in sentences (not complete) [sS]tress.*
[tT]reat.*
[rR]esponse.*
[iI]nduc.+
[eE]xpos.+
[sS]timula.+
[aA]ffect.*
[eE]ffect.*
6.2.2 List including word chains establishing the occurrence of stress in sentences (not complete) [uU]p [rR]egulat.+
[sS]timulat.+
[dD]own
[rR]egulat.+
[sS]timulat.+
[rR]egulat.+
[uU]p.*
[dD]own.*
[Ss]timulat.+
[uU]p.*
[dD]own.*
6.2.3 List including words identifying the type of abiotic stress in sentences (not complete) [aA]biot.*
[eE]nvironment.*
[cC]hemi.+
[wW]ater
[hH]2[0o]
[Ii]rrigat.*
[wW]et.*
[fF]ill.*
[wW]ound.* [iI]njur.* [hH]arm.*
[sS]tab.+
pH
[tT]emperature
[lL]ight
[Rr]ay.*
[Bb]eam.*
[rR]adiat.+ [Ss]hine
[gG]low
[gG]leam
[bB]right.*
[oO]smo.*
[nN]itro.+
[hH]ormon.+
[aA][bB][aA]
[aA]bsci.+
[dD]rought
[sS][aA]
[Ss]alt.*
[Ss]ali.+
NaCl
PEG
[Pp]olyethyleneglycol
[fF]lood.*
[cC]old.*
[hH]yperosmo.+
[mM]annitol
[sS]ucro.+
[aA]cidic
[bB]asic
[hH]eat
[wW]arm.*
[Oo]xidat.+
[Hh]yperoxia.*
[hH]ypoxia.*
[Ee]thylene [Ee]thene
C2H4
[Gg]ibberellin.* GA.*
[sS]had.+
[iI][aA][aA]
[Aa]xin.*
[Aa][Oo][Aa]{1,2}
6.2.4 List including chains of words identifying the type of abiotic stress in sentences (not complete) [wW]ater
[Dd]eficit.*
[Dd]eprivat.+
65 Addendum [Dd]eficit.*
[oO]f
[Dd]eprivat.+
[oO]f
[hH]2[0o]
[Dd]eficit.*
[Pp]olyethylene
[gG]lycol
[lL]ow
[tT]emperature
[hH]igh
[tT]emperature
[Jj]asmo.+ [aA]cid
[Aa]bsci.+ [aA]cid
[Aa]minooxy.+
[acid]
[Ii]ndole
[aA]cetic
[aA]cetic
[Aa]cid
[Ss]alicyl [aA]cid
[Dd]eprivat.+
[lL]ight
[lL]ight
[oO]xy.+
[oO]xy.+
[nN]itro.*
[nN]itro.*
pH
pH
6.3 Absolute values to evaluate the stress trigger algorithm 6.3.1 Test run 1 Table 10: Numbers of true and false positives, and true and false negatives used to calculate the F-­‐score in test run 1. 66 Addendum 6.3.2 Test run 2 Table 11: Numbers of true and false positives, and true and false negatives used to calculate the F-­‐score in test run 2. 67