Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Michel Veuille Ecole pratique des Hautes Etudes Director of the Systematics and Evolution dept Muséum National d’Histoire Naturelle Paris Scientific Advisory Board of the CBOL Data Analysis Working Group What is the molecular signature of speciation events? There is no molecular signature of speciation events What are the other signatures of speciation events? There is no universal signature of speciation events But there are local signatures of speciation events, and one kind of signature (e.g. morphological) can be present when the other (e.g. genetical) is absent Two examples : 1st / 2 A case of two mtDNA species with no morphological difference In 1998, the common European earwig was shown to consist of two sympatric and reproductively isolated species differing only in the number of annual broods (one or two broods per year). The two species differ strikingly in COII sequence This is because the GC% of these species evolves at a very high rate But since they present no apparent morphological difference, the two species remain unnamed GC% at COII in hexapoda European earwig Forficula auricularia Other hexapoda earwigs Wirth, Le Guellec, Vancassel, & Veuille. 1998. Evolution 52: 260-265 Wirth, Le Guellec, & M. Veuille. 1999 MBE, 16: 1645-1653. Two examples : 2nd / 2 A case of two morphological species with no mtDNA difference São Tome Drosophila santomea Drosophila yakuba Drosophila santomea lives in the highlands of São Tome above 1100 m Drosophila yakuba lives in the lowlands, below 1100 m. They hybridize at 1100 m, and nevertheless remain genetically distinct They share the same mitochondria, but can be easily identified through the colour pattern of the abdomen After Lachaise et al. Proc. Roy Soc. London, 2000 They belong to the Drosophila melanogaster ("black abdomen") subgroup D. orena D. erecta 1978 1974 Cameroon Tropical Africa D. teissieri 1971 Tropical Africa D. yakuba 1954 Tropical Africa D. santomea D. mauritiana 2000 1830 1919 1974 São Tome island Tropical Africa + worldwide Tropical Africa + worldwide Mauritius island D. sechellia 1981 Sechelles islands D. melanogaster D. simulans Share the same mitochondrion through common descent D. santomea D. yakuba The condition of the barcoder is challenging The species concept is hotly debated There are many definitions of species « Species » make sense to everybody. For example, 12% of the nouns in the French vocabulary* correspond to taxa that make sense to a taxonomist (species, families, varieties) A solution is to let people use whatever species concept they prefer and limit the barcoder’s activity to the domain where he/she can be helpful * : From the Robert a classic French dictionary What data analysis is about (barcoder) ?0,000,000 species Data & tools (taxonomist) Black box Data analysis consists in providing data to taxonomists, in order to make decisions about the status of specimens and taxa. Barcoding and taxonomic decisions are logically distinct, even though they can be performed by the same person. « This is species A or B » « This is a new species » What data analysis is about (contd) Tree of life Tree of life closest COI validated node closest validated node Closest validated node using additional information sister group Query sequence Local barcode Local barcode If we want to be 100% sure of the assignment of a taxon, then we must look at the nodes below the closest node excluding a sister group with probability p < 0.01. Below this point, a series of statistical and classificatory approaches allow us to estimate the probability that the query sequence belongs or not to an already described species, based on the available information. Alternatively, additional information using other genes, or an enlarged dataset can increase our understanding of the taxonomic status of the query. The population genetics background behind data analysis Principle two sequences from the same population find their last common ancestor with some constant probabiilty p = 1/N It is a « death process » Very different from a normal distribution Past (generations) The most probable coalescence time: t=1 the expectation: t=N P = 0.05 for: t = 3N MRCA 1 p 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 n 0 2 9 19 39 Sample n1 Probability p that the MRCA of a sample of size n is also the MRCA of the species assuming a standard Wright-Fisher model. In a very large population p = (n-1)/(n+1) p increases very rapidly. The probability is p = 0.6667 for n = 5, and p = 0.8 for p = 9 Increasing the sample size beyond this is useless MRCA N generations 2N (1-1/n) generations Typically, under a standard equilibrium Wright-Fisher model(*) , the expected time to the last common ancestor of the tree (MRCA) is only twice the time to the common ancestor of two randomly sampled sequences (*) assuming : - neutrality - constant population size - no structuring - mutation drift-equilibrium - N = effective number of genes Sample n1 Using a larger dataset does not increase the information very much at this level MRCA MRCA N generations 2N (1-1/n) generations Sample n1 Sample n2 > n1 « The older nodes of a genealogy tend to be revealed in a small sample, whereas more recent portions are, on average, only revealed as the sample size per locus grows large. » Kliman et al. 2000. polymorphisms can go very far, back in the past of the species, and enter the ancestral population with a sister species After AG Clark 1997 A long time after they have split, two species still share some neutral polymorphisms. Exploring shallow nodes 1. Nielsen and Matzen’s MCMC method Derived from Nielsen and Hey’s (2001) IM method, based on MCMC (Monte Carlo Markov chains). This method estimated 5 Parameters, thus involving very long computation time 1. Matz and Nielsen’s MCMC method Derived from Nielsen and Hey’s (2001) IM method, based on MCMC (Monte Carlo Markov chains). This method estimated 5 Parameters, thus involving very long computation time Matz and Nielsen (2005) reduce it to two parameters: - the population size - time to speciation. They estimate the probability that the query sequence belongs or not to the same species as the reference sample 2. Evaluating classification and phylogenetic methods : Austerlitz et al. They compare two classification methods CART random forest And two phylogenetic methods Neighbour-joining phy-ML The classification methods partition the dataset using a few characters The distance methods work well with a small dataset, provided there are enough mutations They simulate n +1 individuals in each species. n individuals are a reference sample the last individual is the query. Repeated simulations, allow them to record the rate of correct assigment of the query to its species Comparison of the methods for a low q (2 populations, reference sample size = 10, q = 3) 100% success rate 90% 80% ml cart RF 70% 60% 50% 100 1000 10000 Separation time Classification methods perform better for a low variation Comparison of the methods for a high q (2 populations, Reference sample size = 10, θ = 30) 100% success rate 90% 80% ml CART RF 70% 60% 50% 100 1000 10000 Separation time Phylogenetic methods perform better for a highly variable population Conclusion : the appropriate method varies with the properties of the dataset Comparing methods using realistic datasets 100.00% 1. Litoria nannotis 4 species Average sample size: 43.7 average q = 1.54 success rate 95.00% ML CART Random Forest 90.00% 85.00% 80.00% 0 5 10 15 20 25 30 sample size 2. Astraptes fulgeraptor 100% Good assignment rate 99% 98% 97% 96% phylo 95% CART 94% 93% 92% 91% 90% 12 species Average sample size: 38.8 average q = 23.5 3 4 5 6 7 8 Reference Sample size 9 10 100.00% 3. Cowries good assignment rate 95.00% ML CART Random Forest 90.00% 85.00% 80.00% 0 5 10 15 sample size 20 25 30 Other solutions: Can we replace CO1 ? Can we complement it with other genes Properties of bilaterian mtDNA Other systems Large number of copies per cell rDNA has a high copy number High mutation rate Microsatellites also Low variation / divergence ratio Centromeres, telomeres (documented in Drosophila) No recombination Centromeres, telomeres (documented in Drosophila) Haploid X-chromosome, Y chromosome Maternally inherited asexual The Y is asexual The other chromosomes recombine Variation in mtDNA is lowered due to selective sweeps according to Bazin et al (2006) Variation is also lowered in some nuclear regions due to background selection The main disadvantage of maternal inheritance is that mitochondria can be transferred horizontally along with Wolbachia endosymbiotic bacteria. Examples: Protocalliphora and Drosophila The main disadvantage of asexuality is that mitochondria do not follow the 2nd law of Mendel : mtDNA carries no information on genetic barriers.. Maternally transmitted endosymbiotic bacteria : hitchhiking by Wolbachia Phylogeny of the fly Protocalliphora based on AFLP (nuclear markers),according to Whitworth et al (2007). Symbols represent different Wolbachia strains nuclear mtDNA Phylogeny of Protocalliphora based on COI+COII. The authors claim that the assignment of unknown individuals to species is impossible in 60% of the species After Whitworth et al. Proc Roy. Soc. B, in press MRCA Phylogenetic tree of mtDNA Phylogram of nuclear DNA A phyletic tree in mtDNA represents true phyletic relationships. Mutations are in linkage disequilibrium because they do not recombine. Having two divergent clades is trivial under a FW standard model Whereas the phylogram of a recombining gene represents distances between haplotypes, where mutations can seem to « appear » repeatedly on several terminal branches. They thus inform us on the existence of barrier to gene flow Conclusions 1. There is no mitochondrial signature of speciation. There is no room for a barcode species concept, and anything like a « barcodon ». 2. Even a moderate sample can provide a wealth of information on the history of a species. 3. Additional information can be obtained in difficult cases, either by increasing the population sample, or by using additional markers. The END