Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Downloaded from http://rspb.royalsocietypublishing.org/ on June 18, 2017 Using phylogenetic information and chemical properties to predict species tolerances to pesticides rspb.royalsocietypublishing.org Research Cite this article: Guénard G, Carsten von der Ohe P, Carlisle Walker S, Lek S, Legendre P. 2014 Using phylogenetic information and chemical properties to predict species tolerances to pesticides. Proc. R. Soc. B 281: 20133239. http://dx.doi.org/10.1098/rspb.2013.3239 Received: 9 January 2014 Accepted: 11 June 2014 Subject Areas: ecology, evolution, computational biology Keywords: phylogenetic modelling, bilinear model, tolerance, pesticide, effect prediction model Author for correspondence: Guillaume Guénard e-mail: [email protected] Electronic supplementary material is available at http://dx.doi.org/10.1098/rspb.2013.3239 or via http://rspb.royalsocietypublishing.org. Guillaume Guénard1, Peter Carsten von der Ohe2,3, Steven Carlisle Walker4, Sovan Lek5 and Pierre Legendre1 1 Département de Sciences Biologiques, Université de Montréal, CP 6128, Succursale Centre-Ville, Montréal, Québec, Canada H3C 3J7 2 Department of Ecological Chemistry, Helmholtz Centre for Environmental Research –UFZ, Permoserstrasse 15, Leipzig 04318, Germany 3 Amalex Environmental Solutions, Kreuzstrasse 1, Leipzig 04103, Germany 4 Department of Mathematics and Statistics, McMaster University, 1280 Main Street West, Hamilton, Ontario, Canada L8S 4K1 5 Laboratoire Évolution et Diversité Biologique (EDB) UMR 5174, CNRS/Université Paul-Sabatier, 118 Rte de Narbonne 4R3-b1, Toulouse Cedex 9 31062, France SCW, 0000-0002-4394-9078; PL, 0000-0002-3838-3305 Direct estimation of species’ tolerance to pesticides and other toxic organic substances is a combinatorial problem, because of the large number of species–substance pairs. We propose a statistical modelling approach to predict tolerances associated with untested species–substance pairs, by using models fitted to tested pairs. This approach is based on the phylogeny of species and physico-chemical descriptors of pesticides, with both kinds of information combined in a bilinear model. This bilinear modelling approach predicts tolerance in untested species–compound pairs based on the facts that closely related species often respond similarly to toxic compounds and that chemically similar compounds often have similar toxic effects. The three tolerance models (median lethal concentration after 96 h) used up to 25 aquatic animal species and up to nine pesticides (organochlorines, organophosphates and carbamates). Phylogeny was estimated using DNA sequences, while the pesticides were described by their mode of toxic action and their octanol–water partition coefficients. The models explained 77–84% of the among-species variation in tolerance (log10 LC50). In cross-validation, 84–87% of the predicted tolerances for individual species were within a factor of 10 of the observed values. The approach can also be used to model other species response to multivariate stress factors. 1. Introduction Organisms are adapted to exploit the resources of their habitats and withstand a range of environmental conditions found therein. Stress or disturbance occurs when conditions depart from that range [1]. Toxic stress occurs when exposure to chemical substances is sufficient to disrupt important physiological processes, leading to depression or complete failure of functions that are necessary for growth, reproduction or survival. Such disruption depends on a broad array of interactions, at the molecular level, between physico-chemical properties of the compound and biochemical traits (e.g. receptors and enzymes) of the organism. Hence, tolerance to toxic stress can be regarded as a complex species trait integrating an array of subordinate biochemical and physiological features. On the other hand, chemical compound toxicity results from the physico-chemical properties of the compounds; these properties can be regarded as features of the compounds. Toxic stress therefore arises from complex interactions between the characteristics of the organisms and those of the chemical compounds. Understanding the effects of toxicity on living organisms is mandatory to assess the impact of chemical pollution on the environment [2]. An appreciable proportion of the numerous chemical compounds that are developed, produced & 2014 The Author(s) Published by the Royal Society. All rights reserved. Downloaded from http://rspb.royalsocietypublishing.org/ on June 18, 2017 (a) Data sources We obtained tolerance data from the US Environmental Protection Agency ECOTOX database [13]. We selected data of median lethal (b) Phylogenetic analysis We estimated the phylogenies using DNA sequences obtained online from the US National Center for Biotechnology Information’s GenBank website (http://www.ncbi.nlm.nih.gov/genbank/; database: Nucleotide). We collected the available DNA sequences for the whole mitochondrial genomes as well as for the nuclear 28S, 18S and 5.8S rRNA transcripts and their internal transcribed spacers (ITS 1 and ITS 2). We used sequences for the complete genes, whenever possible, and took the longest sequences available when only partial sequences could be found. When a sequence for a widespread gene was not available for a given species and no other species in the same genus or family was in the dataset, we used that of a related species instead. Then, we performed multiple DNA sequences alignment separately for each individual gene using the program MUSCLE v. 3.8.31 ([15]; http://www.drive5.com/muscle). After alignment, we concatenated the sequences for all genes into a single super-alignment and estimated the phylogeny using the maximum-likelihood method [16,17], which was computed by program fdnaml within software EMBOSS v. 6.3.1 [18]. (c) Modelling approach Phylogenetic eigenvector regression [19,20] is an approach whereby eigenvectors are used as descriptors in a regression model. For this study, we used a particular type of such eigenvectors called phylogenetic eigenvector maps (PEM) [12]. PEM allows one to calculate phylogenetic eigenvectors that are finetuned to the processes through which traits have evolved (neutral evolution or natural selection). In its simplest form, PEM calculation involves the estimation of a selection strength parameter a that take values between 0 ( purely neutral evolution) and 1 (strong natural selection). Calculation of PEMs and of the species scores used to make predictions is described in [12]; here, we will only summarize the method. These calculations involve the singular value decomposition of a weighted influence matrix (i.e. a mathematical representation of the edges linking the nodes and leaves of the tree) whose columns are centred on 0. These eigenvectors allow one to decompose species trait variation into components associated with different phylogenetic resolution levels; they are used as descriptors in the bilinear models for predicting tolerance to toxic substances. One can then obtain scores for new species for which predictions are to be made (we will hereafter refer to them as the ‘target’ species) using calculations similar to those that are made when a new point is added to an ordination diagram [21,22]. These scores can be used in models to predict tolerance (figure 1). Note that these are true out-of-sample predictions because the target species are not used to estimate the parameters of the bilinear model. In this study, we used phylogenetic eigenvectors to explain the among-species variation in LC50; let U ¼ [ui,k] be a design 2 Proc. R. Soc. B 281: 20133239 2. Material and methods concentrations (LC50) after 96 h of exposure excluding entries with inequality indications (i.e. greater than or smaller than). The mortality was selected as the endpoint because it is the one for which data are the most abundant. We performed a quality check when multiple LC50 values were found for one substance-species combination (see section ‘Data quality check’ in the electronic supplementary material for details). From the resulting dataset, we selected the species and pesticides that allowed us to build the species-by-pesticides table having the largest number of elements, without missing information. The data for any given model were assembled in a multivariate response matrix LC50 ¼ [LC50i,j ] whose rows i represent individual species and whose columns j represent a set of pesticides, and LC50i,j is the log10 of the concentration of pesticide j causing 50% death for species i after 96 h. We obtained the octanol–water partition coefficients (kow) for each pesticide from the KOWWIN software [14]. rspb.royalsocietypublishing.org and used by humans eventually reach ecosystems. These ecosystems are themselves populated by many different species, each of them having its particular susceptibilities towards each of those compounds. The exhaustive testing of every possible species-substance combination is impossible in practice because ecosystems are generally inhabited by hundreds to thousands of species and affected by environmental mixtures containing hundreds, sometimes thousands, of different compounds [3]. To our knowledge, most approaches proposed so far have addressed this problem as two separate issues: compound multiplicity and species diversity. Quantitative structure– activity relationships (QSAR) [4–10] seek to address compound multiplicity by predicting physiological effects of chemical compounds on single biological endpoints from their molecular structure. QSAR parameters are estimated using observed effects of compounds (e.g. their level of toxicity from a given mode of action) on target organisms and descriptors of its chemical structure and properties. Phylogenetic modelling [3], on the other hand, seeks to address species diversity by estimating species’ unknown tolerances to toxic compounds or other stressors from that of other related species with known values. This approach relies on the fact that traits which influence tolerance are subject to evolutionary change, and therefore that the effects of these traits can be modelled as phylogenetic signals. These signals can be regarded as a combination of large-scale phylogenetic structures and small-scale structures. A large-scale structure arises from an evolutionary change that occurred early in the phylogeny, with observable effects on lineages shared by a large number of species. A small-scale structure, on the other hand, arises from changes that occurred recently in the phylogeny, thereby only affecting a few, closely related, species. Here, we demonstrate that chemical compound multiplicity and species diversity can be addressed as a single problem. Bilinear modelling is an extension of multivariate linear regression where a table of response variables is fitted using two different tables of descriptors [11]. The first table contains descriptors of the variation among the rows of the multivariate response data (i.e. as in multiple regression), whereas the second table contains descriptors of the variation among the columns of the multivariate response. A bilinear model can therefore be used to relate a multivariate response, such as the tolerance of many species to multiple compounds, to descriptors of the species’ phylogenetic signal component [3,12] and descriptors of the molecular structure and/or physico-chemical properties [10], as those used for QSAR. The objective of this study is to demonstrate how bilinear modelling can be used to develop models addressing the issue of predicting the tolerance of multiple species to multiple compounds. To exemplify the modelling approach, we applied it to the modelling of multiple species tolerances to acute toxicity by multiple pesticides, using phylogenetic eigenvectors and physico-chemical properties as predictors. The ability of the models to make external predictions of tolerance to pesticides was also assessed. Downloaded from http://rspb.royalsocietypublishing.org/ on June 18, 2017 u1 (a) E9 U : row-wise explanatory variable (e.g. eigenvectors) Z : column-wise explanatory variable (e.g. properties) C : matrix of billinear regression coefficients Utarget : predictors (e.g. phylogenetic eigenfunctions) D E8 Y ·...Ò : concatenation of the columns of a matrix learning knowledge (C) application E E5 NZ NY E1 N2 E4 F N4 E3 G Z interpreting the coefficients UtargetCZT =Ŷpred 2 cz = –0.26 u2 2.04 0.02 u3 0.00 0.57 c0 u... cz u... u4 –0.12 –0.11 c0 u u u5 –0.12 0.04 u6 –0.15 0.02 2 l 2 l y1 y2 y3 y4 y5 u 0.03 –0.41 6 cz l 6 response y4 y5 –3.0 –1.5 0 1.5 3.0 c0u l 1 –0.51 y3 –3.0 –1.5 0 1.5 3.0 cz u 1 1 u1 y2 –3.0 –1.5 0 1.5 3.0 c0u l predictions C= 3 cz 0 y1 A zl 1 c00 l k –3.0 –1.5 0 1.5 3.0 z 2 1 0 –1 –2 7 (b) 4 examples property of each of the responses ·ŶfitÒ = (Z ƒ U) ·CÒ –3.0 –1.5 0 1.5 3.0 (c) B X C D Y E F 6 G Z 1 Figure 1. Example flowchart of bilinear model calculation. (a) Y is a multivariate response matrix whose values for y1, y2, y4, and y5 are known for seven model species A 2 G (black markers), whereas the values that are being predicted (grey markers) are those of three target species X, Y and Z for all the responses ( yX,. . ., yY,. . . and yZ,. . ., respectively) as well as those of all species for response y3. Two kinds of information are available in the form of row-wise and column-wise descriptors and predictors. (b) The row-wise design matrix U contains the species loadings of the model species on the phylogenetic eigenvectors uk (black markers), whereas the matrix of row-wise predictors U target contains the target species scores on the phylogenetic eigenfunctions (grey markers). (c) The column-wise design matrix Z contains a descriptor z l (e.g. a chemical property; there could have been many column-wise descriptors); here, it is also used as a predictor as it is known for all response variables. Model building proceeds as follows: using the observed response variables (1) together with the species loadings (2) and the column-wise descriptor(s) (3), we calculate C, the matrix of bilinear regression coefficients. This matrix contains the intercept, the main effects of the individual phylogenetic eigenvectors uk, the main effects of the individual properties zl and the interactions among the row-wise and column-wise descriptors. The model interrogation proceeds as follows: we pre-multiply C, which was estimated from data (4), with the design matrix containing U target (5) and post-multiply it with the transpose of Z (6), yielding predicted values (7). matrix of phylogenetic eigenvectors whose elements ui,k are the loadings of species i (rows) on eigenvectors k (columns). As mentioned in the Introduction, a bilinear model requires a second set of descriptors. In our study, these descriptors are properties of the pesticides that we used to explain variation in the columns of matrix LC50. Let Z ¼ [zj,l ] be a design matrix of descriptors of the chemical properties l of pesticides j. LC50 can be represented as the matrix product LC50 ¼ UCZT þ E, (2:1) where C ¼ [cj,l ] is a matrix of bilinear regression coefficients, and E ¼ [1i,j ] is a matrix of error terms (see section ‘Estimating the matrix of bilinear regression coefficients’ in the electronic supplementary material for details). The design matrix describing the among-row variation (U) had a centring variable (i.e. a column of ones to estimate the intercept) in addition to the phylogenetic eigenvectors. The design matrix describing the among-column variation (Z) included a centring variable, contrasts variables depicting the different modes of action of the pesticides and the log10 of the octanol–water partition coefficient (i.e. the proportion of the compound dissolving in octanol solvent compared with that dissolving in water solvent) as a descriptor of pesticides in LC50, and p q columns, where p and q are the numbers of columns in U and Z, respectively. We regularized the bilinear models in order to maximize the ability of the models to predict LC50. Regularization was performed by the forward selection of a subset of the modelling terms (elements cj,l ) on the basis of the Akaike information criterion with correction for small sample sizes (AICc [23]). We chose the subset that minimized AICc, thereby obtaining models minimizing information loss. Elements of C that were found to be irrelevant for prediction were given a value of 0, resulting in a sparse bilinear model. We used cross-validation to evaluate the ability of models to make dependable and accurate predictions. For each model, cross-validation was performed by first selecting a given species i, then performing the aforementioned model construction procedure using the n 2 1 remaining species, predicting the LC50 for species i, and finally repeating the procedure for each and every species in the dataset. In addition to cross-validation of the species, we obtained predictions for tolerance to other pesticides not used in model estimation and compared them to their observed values. Following these predictive validations, we assessed models performance at the global and observation scales using the prediction coefficient P 2 and the deviation factor di,j, respectively (see section ‘Metrics of model performance’ in the electronic supplementary material for calculation details). Model calculation and cross-validation were performed using the R language and environment v. 3.0.1 [24] and R package ape v. 3.0– 8 [25]. (d) Candidate models To explore the performance of our modelling approach under different circumstances of data availability, we defined three bilinear LC50 models from the data described in table 1. All models span three Proc. R. Soc. B 281: 20133239 c00 : intercept c0u : main effect of phylogenetic eigenfunction uk k cz 0 : main effect of compound property zl l cz u : interactions between uk and zl –1.0 –0.5 0 0.5 1.0 –1.0 –0.5 0 0.5 1.0 –1.0 –0.5 0 0.5 1.0 –1.0 –0.5 0 0.5 1.0 –1.0 –0.5 0 0.5 1.0 –1.0 –0.5 0 0.5 1.0 E2 Ŷfit,Ŷpred : fitted and predicted response, respectively 5 Utarget NX fy (0.1) B X C N5 E7 3 Y : multivariate response matrix (e.g. LC50) u6 U fx (0.25) 0.1 u2 rspb.royalsocietypublishing.org E11 N3 fz (0.2) u2 A E12 N1 u2 N6 E10 E6 u2 Downloaded from http://rspb.royalsocietypublishing.org/ on June 18, 2017 name log10 kow mode of action used 50293 DDT 6.91 NaM 123 55389 56382 fenthion parathion 4.09 3.83 AChE AChE 2, 3 123 58899 lindane 3.72 GABA 123 60571 63252 dieldrin carbaryl 5.40 2.36 GABA AChE 123 2, 3 115297 121755 endosulfan malathion 3.83 2.75 GABA AChE 123 122145 fenitrothion 3.30 AChE 123 333415 2921882 diazinon chlorpyrifos 3.81 4.96 AChE AChE 2, 3 123 1, 2 8001352 toxaphene 3.30 GABA 3 1, 2 modes of action: acetylcholinesterase inhibition (AChE), g-amino butyric acid-triggered chloride channel antagonism (GABA) and sodium channel modulation (NaM). The first model (M1) incorporates five pesticides: malathion and parathion (AChE), lindane and dieldrin (GABA) and DDT (NaM), as well as 25 species. The 25 species included two amphibians, 13 fish species, six crustaceans and four insects; total number of observations: 125. The two other models included more pesticides; the cost was that they had smaller numbers of species, given the available data. We obtained a second model (M2) by adding the pesticides carbaryl and fenthion (AChE) and dropping six species, which left 19 species for modelling (13 fish species, five crustaceans and one insect, total number of observations: 126). In a third model (M3), we added chlorpyrifos (AChE) and toxaphene (GABA), which required dropping four more species, leaving 15 species for modelling (10 fish species, four crustaceans and one insect, total number of observations: 117). We obtained predictions for each of these three models by using data for the pesticides that were not used to estimate model parameters. These other pesticides span three additional compounds: endosulfan (GABA), fenitrothion and diazinon (both AChE). To represent the difference in the effects of the three modes of actions in the column-wise descriptor matrices (Z), we defined two orthogonal contrast variables called MOA1 and MOA2 (see section ‘Orthogonal contrast variables’ in the electronic supplementary material for more details). We found DNA sequences for most species, with the exception of the Isopode (Crustacea) Asellus brevicaudus and the Hemiptera (Insecta) Notonecta undulata, for which we substituted the congeneric species Asellus aquaticus in the first case, and for which information was only available at the genus level in the second case. The most widespread DNA sequences were the mitochondrial large ribosomal subunit (16S; 25 species, of which 12 were complete and 13 were partial), the cytochrome oxidase subunit 1 (COX1; 24 species: 13 complete, 11 partial), the mitochondrial small ribosomal subunit (12S; 23 species: 13 complete, 10 partial), the nuclear small ribosomal subunit (18S; 21 species: height complete, 13 partial) and cytochrome b (CYTB; 21 species: 13 complete, height partial). Most of the remaining 39 sequences that were available for fewer than 19 species were complete mitochondrial DNA sequences for species whose entire mitochondria can be found on GenBank. predictions 1 1 123 3. Results With a few exceptions, the phylogeny we estimated from DNA data placed most species within their recognized taxonomic group (see section ‘Estimated phylogeny’ in the electronic supplementary material for more details on the phylogeny and common species names). The PEM used to calculate models M1 and M2 had small selection strength parameters (a, table 2) indicating nearly neutral and neutral trait evolution, respectively, whereas M3 had much higher value of a, indicating a stronger intervention of natural selection, on average, when pesticides chlorpyrifos and toxaphene are used in the model. The number of bilinear regression terms involved in the models ranged from 24 (M1, five main effect and 19 interaction terms; table 2; see ‘tables of bilinear model coefficients’ in the electronic supplementary material for more details on bilinear model parameters) to 11 (M3, five main effect and six interaction terms; M2 had 19 terms, with nine main effect and 10 interaction terms). The models had adjusted R 2 (R2adj: ) ranging from 0.77 to 0.84 (from 0.16 to 0.30 when using only chemical descriptors). The prediction coefficients (P 2) obtained by comparing observed LC50 values with those predicted following leave-one-out jack-knives ranged from 0.49 to 0.73 (table 2 and figure 2). The percentages of these individual jack-knifed predictions having an absolute deviation factor (jdj) lower than 10 (i.e. a variation often observed between studies for the same species –compound combination) ranged from 84 to 87%, whereas from 0 to 2.3% of the predictions deviated from observations by a factor of 100 or more. The most extreme deviations by model were observed for species A. brevicaudus and pesticide malathion in models M1 (d ¼ 2246) and M2 (d ¼ 2313), as well as for Cyprinus carpio and dieldrin in model M3 (d ¼ 297). The numbers of predictions made for pesticides not used in the models were 118 for M1, 68 for M2 and 31 for M3. They had P 2s ranging from 0.48 to 0.74 (table 2 and figure 3). From 61 to 87% of the predictions were within an absolute deviation factor of 10 of the observed LC50 values, whereas from 0 to 3% of the predictions deviated by a factor 100 or Proc. R. Soc. B 281: 20133239 CAS no. 4 rspb.royalsocietypublishing.org Table 1. The pesticides, their properties used for phylogenetic bilinear modelling (mode of action, NaM: Naþ channel modulator, AChE: Acetylcholinesterase inhibitor and GABA: GABA-gated Cl2 channel antagonist; and log10 octanol– water partition coefficient), the models in which they are used and the model(s) for which predictions were calculated. Downloaded from http://rspb.royalsocietypublishing.org/ on June 18, 2017 parameter sample size adjusted R 2 P 2 deviation factor M2 M3 species 25 18 13 compounds total 5 125 7 126 9 117 chemistry 0.06 1 0.00 3 0.57 2 phylogeny interaction 4 19 6 10 3 6 chemistry alone 0.16 0.30 0.19 all terms jack-knife 0.80 0.49 0.84 0.64 0.77 0.73 jdj , 10 0.54 84% 0.48 87% 0.74 86% jdj , 100 98% 98% 100% jdj , 10 jdj , 100 61% 97% 65% 97% 87% 100% external jack-knife external greater. Most of the extreme predictions by model were overestimates of Ictalurus punctatus tolerance to endosulfan (M1: d ¼ 152) and diazinon (M2: d ¼ 127, M3: d ¼ 58). A small prediction bias was evidenced for model M2 (the 95% confidence limits of the regression line did not include the 1 : 1 line in the whole range of observed values), the latter having a tendency to predict LC50s were smaller than the observed values. For models M1 and M3, the 1 : 1 line fit within the 95% confidence limits of the regression line and no prediction bias was evident. 4. Discussion In this paper, we developed an approach to predict tolerance using species phylogenetic patterns of pesticide tolerance coupled with properties used as proxies of the activity of the compounds. The models presented here as examples used relatively simple column-wise descriptor matrices encompassing only the mode of toxic action (in the form of orthogonal contrast variables) and log10 kow. This modelling approach (i.e. using phylogenetic eigenvectors within a bilinear model) has, however, a much wider application scope; it can be applied whenever one wants to model multiple species attributes towards sets of environmental conditions that species are exposed to. For example, one may want to predict multiple species reactions (e.g. physiological, behavioural and immunological endpoints) to a mixture of toxic stresses observed in situ using descriptors for this mixture in addition to phylogenetic eigenvectors. Still, despite the fact that we used relatively simple bilinear models for expository purposes, we were able to predict from half to two-thirds of the out-of-sample variation in LC50 values (0.50–0.66) for jack-knifes and from half to three-quarters (0.48–0.74) for a set of external compounds. We stress that these were true out-of-sample predictions, and therefore are honest assessments of the information content in phylogenetic trees about the relationships between pesticide properties and species tolerances. (a) Candidate models The challenge in selecting tolerance data was to obtain a broad phylogeny that encompassed species from as many higher taxonomic groups as possible with, ideally, several representative species for each of these groups, while having complete sets of values for as many pesticides as possible. The three models covered the following cases: (M1) many species, but few substances, (M3) many substances, but few species, and (M2) an intermediate model between these two extremes. As a consequence of data availability and the requirement that the response matrix (Y) be complete (no missing value for any element yi,j ), all three models were implemented using modest numbers of species, compared with those typically inhabiting ecosystems, and very few compounds, compared with the many chemicals that are found in ecosystems. Although these models may be useful for several applications, we hope that enhanced models, with many more species and compounds, will be developed. We propose the following four steps to maximize the development of phylogenetic bilinear models in the near future. First, it is necessary to pinpoint, within the pool of currently unavailable data, those data whose estimation using bioassays would maximize the size of the response matrix Y in the training data. To do so, an online database can be queried to highlight unavailable species – pesticide pairs associated with either many species or many pesticides. For example, if a particular species is lacking tolerance data about only one pesticide, then a bioassay on that species –pesticide pair will permit the addition of all measured tolerances for that particular species to the response matrix. Second, the suites of bioassays to estimate LC50 should be performed for the Proc. R. Soc. B 281: 20133239 selection strength (a) number of terms M1 5 rspb.royalsocietypublishing.org Table 2. Comparison of three phylogenetic bilinear model: sample size, selection strength (a), number of bilinear regression terms, adjusted coefficient of determination (R2adj: ), prediction coefficient (P 2) and the percentages of deviation factor values (d ) that are below given values for jack-knife (performed for one species at a time using a model developed using the remaining species) and external (using all the species of a given model and predicting for pesticides that were not in the models) predictions. Downloaded from http://rspb.royalsocietypublishing.org/ on June 18, 2017 (a) 6 lindane malathion DDT parathion dieldrin 10–3 1 103 10–3 1 103 10–3 1 103 10–3 1 103 10–3 1 103 lindane malathion DDT parathion dieldrin carbaryl fenthion 10–3 1 103 10–3 1 103 10–3 1 103 10–3 1 103 10–3 1 103 10–3 1 103 10–3 1 103 lindane malathion DDT parathion dieldrin carbaryl fenthion chlorpyrifos toxaphene 10–3 1 103 10–3 1 103 10–3 1 103 10–3 1 103 10–3 1 103 10–3 1 103 10–3 1 103 10–3 1 103 10–3 1 103 Proc. R. Soc. B 281: 20133239 (b) rspb.royalsocietypublishing.org O. mykiss O. mossambicus O. latipes G. affinis P. reticulata M. salmoides L. macrochirus L. cyanellus M. saxatilis C. carpio C. auratus P. promelas I. punctatus B. woodhousei P. triseriata C. dipterum Peltodytes ssp. P. californica N. undulata D. pulex D. magna M. monoceros G. fasciatus G. lacustris A. brevicaudus O. mykiss O. mossambicus O. latipes G. affinis P. reticulata M. salmoides L. macrochirus L. cyanellus M. saxatilis C. carpio C. auratus P. promelas I. punctatus P. californica D. pulex D. magna G. lacustris A. brevicaudus (c) O. mykiss G. affinis P. reticulata L. macrochirus L. cyanellus C. carpio C. auratus P. promelas I. punctatus P. californica D. pulex D. magna G. lacustris LC50 (µmol l−1) Figure 2. The observed LC50 (filled squares) and values predicted using phylogeny (unfilled circles) following leave-one-out jack-knife runs on the datasets used to build model M1 (a; 25 species and five pesticides), M2 (b; 19 species and seven pesticides) and M3 (c; 15 species and nine pesticides). still missing species –pesticide combinations with the most useful predictive information. Third, the quality of the explanatory variables could be improved. For example, the phylogeny could be re-estimated using DNA sequence for the poorly resolved species. Furthermore, more descriptors of chemical activity (e.g. new metrics of molecular structure, enzymatic assays) should also be considered in order to improve models. It is noteworthy that using the toxic mode of action limits the scope of application of models to compounds with modes of action that are known and represented in the models. Future efforts to seek better descriptors of chemical activity for use in phylogenetic bilinear tolerance models should therefore focus on those that are easy to estimate (e.g. using descriptors of molecular structure and activity, in vitro tests) and applicable to many (ideally all) compounds. Fourth, the phylogenetic bilinear model should be re-estimated to account for any new information provided by the previous three steps. We view these four steps as an Downloaded from http://rspb.royalsocietypublishing.org/ on June 18, 2017 model 1 2.0 (b) 2.0 P2 = 0.54 1.0 1.0 –0.5 –0.5 k –2.0 j vk −3.5 g P2 = 0.74 1.0 –0.5 –2.0 j vk k −3.5 −3.5 –2.0 –0.5 1.0 2.0 log10[observed LC50 (µmol l−1)] m −3.5 –2.0 –0.5 1.0 2.0 log10[observed LC50 (µmol l−1)] a A. brevicaudus b B. woodhousei c C. auratus d C. dipterum e C. carpio f D. magna g D. pulex h G. affinis i G. fasciatus j G. lacustris k I. punctatus l L. cyanellus m L. macrochirus endosulfan fenitrothion fenthion toxaphene n M. monoceros o M. salmoides p M. saxatilis q N. undulata r O. mykiss s O. mossambicus t O. latipes u Peltodytes ssp. v P. promelas w P. reticulata x P. triseriata y P. californica Figure 3. (a – c) Observed LC50 values as a function of those predicted using three phylogenetic bilinear models, for pesticides other than that used to build the models, with regression lines (solid; dotted: 95% confidence limits; dashed: 1 : 1 relationship), and coefficients of prediction (the letters used as markers represent the different pesticides). Species labels are only shown when predictions deviated from observations by a factor of more than 50. iterative procedure of model building. The tools for such a procedure could conceivably be implemented within a web-based interface, with data provided from collaborators worldwide. Another relevant matter to increase the scope of phylogenetic bilinear LC50 models beyond pesticides is to focus the efforts on a standard set of species and compounds. Standard species could be chosen, for instance, on the basis of their representativeness of the taxonomic group to which they belong and to maximize the number of high-order taxonomic groups that the models represent. Similarly, standard chemical compounds can be chosen on the basis of their representativeness of the chemical group they belong to (e.g. alkanes, ketones, alcohols, etc.). By increasing sample size, orienting sampling effort to add more representative species and compounds, and developing better descriptors of chemical activity, it would be possible to extend the scope of applications of phylogenetic bilinear tolerance models. A potential future application could be to accurately estimate the toxicity of compounds that have not yet been synthesized, including designer drugs. We anticipate that such models would be very useful to the chemical industry and government regulation authorities by giving scientists an early warning to avoid or ban harmful compounds and make the industry focus on more benign alternatives. (b) Model performance The developed models explained between 77 and 84% of the variability in the acute compound toxicity (table 2) and were therefore in the range of previous models (61–85%) that used existing empirical data to make predictions [3]. Regarding the ability of the three models to make predictions, it became apparent that the local representation of phylogeny was very important for the final performance. Most outliers in M1 were related to single branches in the phylogenetic tree, e.g. Morone saxatilis (DDT), A. brevicaudus and Metapenaeus monoceros (for both malathion and parathion). In M2, the latter species was removed, while for A. brevicaudus, another outlier for fenthion was included. For M3 finally, both M. saxatilis and A. brevicaudus were removed, with the best prediction results, respectively. Hence, to increase the prediction power for species not included in the model, it is suggested to have at least one related species in the phylogenetic tree. Generally, a number of other ecotoxicological models for single biological endpoints (e.g. acute toxicity for Daphnia magna or Pimephales promelas) are available, applying physicochemical properties or read-across methodologies to make quantitative predictions [6,9,10]. Thereby, models predicting baseline toxicity from kow usually explain more variability than those for diverse chemical classes using read-across. However, these models are restricted to single species, while the proportions of variability they explain (e.g. 73–87% for fishes and 85% for Daphnids) are in the same range as that of the models presented here (table 2). Hence, owing to the wide taxonomic range of species covered, the models developed in this paper are particularly useful to identify the taxonomic groups at highest risk for a certain compound. Given the widespread distribution of pollution [26], these models might also help explain why certain species have been lost from certain areas Proc. R. Soc. B 281: 20133239 2.0 k –2.0 carbaryl chlorpyrifos diazinon model 3 (c) 7 P2 = 0.48 −3.5 −3.5 –2.0 –0.5 1.0 2.0 log10[observed LC50 (µmol l−1)] log10[predicted LC50 (µmol l−1)] model 2 rspb.royalsocietypublishing.org log10[predicted LC50 (µmol l−1)] (a) Downloaded from http://rspb.royalsocietypublishing.org/ on June 18, 2017 (c) Assumptions (d) Directions We hope that this study will inspire scientists to pursue the implementation of phylogenetic models in various fields, such as ecotoxicology, environmental sciences, pharmacology and beyond, to other disciplines where it is relevant. We also hope that it will inspire ecotoxicologists to develop more powerful tolerance models and make them available to the people who need them [31]. Acknowledgements. We would like to thank our fellow scientists at Pierre Legendre’s laboratory and NSERC Hydronet for their useful advice and fruitful discussion during the elaboration of the manuscript. Funding statement. G.G. was funded by NSERC grant no. 7738–07 to P.L., and S.C.W. was funded by an NSERC PDF and NSERC grant no. 7738– 07 to P.L. The work of P.v.d.O. was supported by the European Commission through the Project HEROIC (FP7–282896; Coordination Action funded under the 7th Framework Program). References 1. 2. Townsend CR, Hildrew AG. 1994 Species traits in relation to a habitat templet for river systems. Freshw. Biol. 31, 265–275. (doi:10.1111/j.13652427.1994.tb01740.x) von der Ohe PC et al. 2011 A new risk assessment approach for the prioritization of 500 classical and emerging organic microcontaminants as potential river basin specific pollutants under the European Water Framework Directive. Sci. Total Environ. 409, 2064–2077. (doi:10.1016/j.scitotenv.2011.01.054) 3. 4. Guénard G, von der Ohe PC, de Zwart D, Legendre P, Lek S. 2011 Using phylogenetic information to predict species tolerances to toxic chemicals. Ecol. Appl. 21, 3178– 3190. (doi:10.1890/102242.1) Russom CL, Bradbury SP, Broderius SJ, Hammermeister DE, Drummond RA. 1997 Predicting modes of toxic action from chemical structure: acute toxicity in the fathead minnow (Pimephales promelas). Environ. Toxicol. Chem. 16, 948–967. 5. 6. (doi:10.1897/1551-5028(1997)016,0948: PMOTAF.2.3.CO;2) Schultz TW, Cronin MTD, Netzeva TI. 2003 The present status of QSAR in toxicology. J. Mol. Struct. Theochem. 622, 23 –38. (doi:10.1016/S0166-1280 (02)00615-2) von der Ohe PC, Kühne R, Ebert RU, Altenberger R, Liess M, Schüürmann G. 2005 Structural alerts: a new classification model to discriminate excess toxicity from narcotic effect levels of organic 8 Proc. R. Soc. B 281: 20133239 Bilinear modelling is an adaptation of multivariate regression analysis with the addition of a second table of descriptors. Therefore, it has the same assumptions as multivariate regression. An assumption that is of concern in this study is that the response variable should have independent and identically distributed residuals. Hence, as LC50 are taken from multiple species having different degrees of common ancestry, their values are indeed expected to lack statistical independence, because of phylogenetic autocorrelation, which is induced by phylogenetic processes. The latter processes, however, are themselves described by the phylogenetic eigenvectors acting as a fixed effect in the bilinear models, thereby directly controlling for the lack of independence originating from phylogenetic correlation. When one uses ordinary least squares to fit models—as we did in this study—residuals must be normally distributed for parametric tests of significance to be correct. Linear models with distribution other than normal may also be used; it is straightforward to fit generalized linear models [11]. Perhaps the most relevant assumption of any modelling method, including bilinear regression, is the adequacy of descriptors. In this study, row-wise descriptors were phylogenetic eigenvectors, whereas column-wise descriptors were more standard variables (i.e. orthogonal contrasts and log10 kow, a quantitative variable). We can enumerate three main assumptions underlying the adequacy of phylogenetic eigenvectors as descriptors. The first assumption is that the phylogenetic tree is estimated without systematic and significant error. This assumption is similar for any model whose descriptors need to be estimated by some other method prior to modelling. In cases where the reliability of phylogeny is questionable, Monte Carlo simulations can be used to assess the impact of phylogenetic estimation errors on the correctness and accuracy of tolerance estimates. It is also noteworthy that while this study did no DNA sequencing, future modelling efforts would ideally be granted with that possibility. Nowadays, DNA sequencing is cheaper than performing a suite of bioassays. We therefore regard the bioassays as the most demanding constraint to model development. The second assumption is that traits evolve in a steady fashion along the branches of the phylogenetic tree. In this paper, we used PEM, a method enabling one to fine-tune eigenvectors to represent the evolutionary dynamics of traits (i.e. neutral evolution versus evolution by stabilizing or directional selection) and specify different evolutionary rates adapted to particular parts of a tree. However, in this expository treatment we did not use that latter functionality. The third assumption is that the test species are sufficient and adequate to dependably represent the patterns of trait variation. Hence, when a group shows extreme trait variability, many representative species are necessary to correctly describe the patterns of trait variation within it. As a word of caution, it is noteworthy that, in practice, the possibility of large trait variation over small phylogenetic distances can never be entirely overruled. In this study, for instance, species A. aquaticus was substantially more tolerant to malathion and parathion than the other Malacostraca species present in the data. When the latter species was withdrawn during crossvalidation, the resulting phylogenetic model naively inferred relative sensitivity of that group to these pesticides leading to the highest deviation values we observed. In general, a species showing small inter-population variability in tolerance may include outlying populations that are highly tolerant (or sensitive; [29,30]). In such a situation, it may not be practical to estimate tolerance from a phylogeny as one would need intra-specific phylogeny and tolerance data focused on the evolutionary event that led to the emergence of these outstandingly tolerant populations [3]. Although not explored in this paper, trait data may be used either in place of or in combination with phylogenetic data, in order to directly measure the species characteristics that lead to variation in tolerance [12]. rspb.royalsocietypublishing.org while others remain [27]. Moreover, the predictions might support mixture toxicity models that usually suffer from a low number of test data for each compound [28]. Downloaded from http://rspb.royalsocietypublishing.org/ on June 18, 2017 8. 10. 11. 12. 13. 14. 24. The R Development Core Team 2012 R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. See http://www.R-project.org. 25. Paradis E, Claude J, Strimmer K. 2004 APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289– 290. (doi:10.1093/ bioinformatics/btg412) 26. Malaj E, von der Ohe PC, Grote M, Brack W, Schäfer RB. In press. Organic chemicals jeopardise freshwater ecosystem integrity on the continental scale. Proc. Natl Acad. Sci. USA (doi:10.1073/pnas. 1321082111) 27. de Deckere E, De Cooman W, Leloup V, Meire P, Schmitt C, von der Ohe PC. 2011 Development of sediment quality guidelines for freshwater ecosystems. J. Soil Sediments. 11, 504–517. (doi:10.1007/s11368-010-0328-x) 28. Schäfer RB, Gerner N, Kefford BJ, Rasmussen JJ, Beketov MA, de Zwart D, Liess M, von der Ohe PC. 2013 How to characterize chemical exposure to predict ecologic effects on aquatic communities? Environ. Sci. Tech. 47, 7996–8004. (doi:10.1021/es4014954) 29. Nacci DE, Champlin D, Jayaraman S. 2010 Adaptation of the estuarine fish Fundulus heteroclitus (Atlantic killifish) to polychlorinated biphenyls (PCBs). Estuaries Coasts 33, 853–864. (doi:10.1007/s12237-009-9257-6) 30. Naudula VK. 2010 Glyphosate resistance in crops and weeds: history, development, and management. Hoboken, NJ, USA: Wiley Publishing. 31. Schäfer RB, Bundschuh M, Focks A, von der Ohe PC. 2013 To the editor. Environ. Toxicol. Chem. 32, 734–735. (doi:10.1002/etc.2140) 9 Proc. R. Soc. B 281: 20133239 9. 15. Edgar RC. 2004 MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acid Res. 32, 1792–1797. (doi:10.1093/nar/gkh340) 16. Felsenstein J. 1981 Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Ecol. 17, 368 –376. (doi:10.1007/ BF01734359) 17. Felsenstein J, Churchill GA. 1996 A hidden Markov Model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13, 93 –104. (doi:10. 1093/oxfordjournals.molbev.a025575) 18. Rice P, Longden I, Bleasby A. 2000 EMBOSS: the European molecular biology open software suite. Trends Genet. 16, 276 –277. (doi:10.1016/S01689525(00)02024-2) 19. Desdevise Y, Legendre L, Azouzi L, Morand S. 2003 Quantifying phylogenetically structured environmental variation. Evolution 57, 2647–2652. (doi:10.1111/j.0014-3820.2003.tb01508.x) 20. Diniz-Filho JAF, de Sant’Ana CER, Bini LM. 1998 An eigenvector method for estimating phylogenetic inertia. Evolution 52, 1247 –1262. (doi:10.2307/ 2411294) 21. Gower JC. 1966 Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325–338. (doi:10.1093/ biomet/53.3-4.325) 22. Gower JC. 1969 Adding a point to vector diagrams in multivariate analysis. Biometrika 55, 582–585. (doi:10.1093/biomet/55.3.582) 23. Hurvich CM, Tsai CL. 1993 A corrected Akaike information criterion for vector autoregressive model selection. J. Time Ser. Anal. 14, 271–279. (doi:10. 1111/j.1467-9892.1993.tb00144.x) rspb.royalsocietypublishing.org 7. compounds in the acute daphnid assay. Chem. Res. Toxicol. 18, 536– 555. (doi:10.1021/tx0497954) de Roode D, Hoekzema C, de Vries-Buitenweg S, van de Waart B, van der Hoeven J. 2006 QSARs in ecotoxicological risk assessment. Regul. Toxicol. Pharm. 45, 24 –35. (doi:10.1016/j.yrtph.2006. 01.012) Ajmani S, Jadhav K, Kulkarni SA. 2009 Group-based QSAR (G-QSAR): mitigating interpretation challenges in QSAR. QSAR Comb. Sci. 28, 36 –51. (doi:10.1002/ qsar.200810063) Schüürmann G, Ebert RU, Kühne R. 2011 Quantitative read-across for predicting the acute fish toxicity of organic compounds. Environ. Sci. Technol. 45, 4616 –4622. (doi:10.1021/es200361r) Kühne R, Ebert RU, von der Ohe PC, Ulrich N, Brack W, Shüürmann G. 2013 Read-across prediction of the acute toxicity of organic compounds toward the water flea Daphnia magna. Mol. Inform. 32, 108–120. (doi:10.1002/minf.201200085) Gabriel KR. 1998 Generalized bilinear regression. Biometrika 85, 689–700. (doi:10.1093/biomet/85. 3.689) Guénard G, Legendre P, Peres-Neto P. 2013 Phylogenetic eigenvector maps (PEM): a framework to model and predict species traits. Methods Ecol. Evol. 4, 1120–1131. (doi:10.1111/2041210X.12111) US Environmental Protection Agency 2011 ECOTOX user guide: ECOTOXicology database system. See www.epa.gov/ecotox/ (accessed 1 July 2011). US Environmental Protection Agency 2012 Epi suite for Windows v4.11. Duluth, MN. See http://www. epa.gov/oppt/exposure/pubs/episuitedl.htm.