* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download here - CMBI
Neuronal ceroid lipofuscinosis wikipedia , lookup
Genomic library wikipedia , lookup
Copy-number variation wikipedia , lookup
Genetic engineering wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Transposable element wikipedia , lookup
Oncogenomics wikipedia , lookup
Ridge (biology) wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene therapy wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Human genome wikipedia , lookup
Public health genomics wikipedia , lookup
Genomic imprinting wikipedia , lookup
Koinophilia wikipedia , lookup
Epigenetics of human development wikipedia , lookup
History of genetic engineering wikipedia , lookup
Point mutation wikipedia , lookup
Maximum parsimony (phylogenetics) wikipedia , lookup
Gene nomenclature wikipedia , lookup
Metagenomics wikipedia , lookup
Gene desert wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genome (book) wikipedia , lookup
Pathogenomics wikipedia , lookup
Minimal genome wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genome editing wikipedia , lookup
Designer baby wikipedia , lookup
Gene expression programming wikipedia , lookup
Helitron (biology) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Phylogenomics Using complete genomes to determine the phylogeny of species Bas E. Dutilh Tree of life • Bacteria • Archaea • Eukaryota Evolution • What we can see are the present-day species • Offspring looks like its parents • Mutations – Phenotype – Genotype • Nature selects: survival of the fittest Phenotype • Which properties to compare? • Watanabe's Ugly Duckling Theorem: “All things have an infinite number of features. So any two things share an infinite number of features. Therefore two things cannot be of the same kind because they share more features than they do with things of a different kind.” Evolution Genotype • Genome sequence is finite and you do not have to choose • Genetic properties – Word frequency – Sequence (nt/aa) – Gene content – Gene order Why sequence similarity works • Every residue (nt/aa) is a separate dimension – Human: 3 billion nucleotides • Most mutations are … Sequences never converge Evolution: mutation and selection • Mutation is responsible for changes • Selection is responsible for continuity • The more differences, the more distantly related two sequences are • Contrary to structure or phenotype, sequences do not converge Phylogenetics • Inferring the evolution of a gene • Distance matrix • Hierarchical clustering • Evaluate likelihood of all possible trees P P P • Maximum likelihood Substitution matrix • Describes the rate at which one character in a sequence changes to other character states • BLOck SUbstitution Matrix (BLOSUM) is based on observed substitutions between proteins with e.g. >62% sequence identity Neighbour joining Maximum likelihood P P P • Make all possible trees • Calculate likelihood that the alignment evolved in this tree Maximum likelihood tree • Very computer intensive • PhyML searches “around” starting tree (e.g. NJ) Maximum parsimony • Parsimony is a special case of likelihood • The tree with the smallest number of mutations is the maximum parsimony tree SSU rRNA • Present in all species • Constant function • Slowly evolving Fox et al, Science 1980 SSU rRNA • Phylogeny of SSU rRNA discovered the three domains • Representative for the evolutionary history of species • Bacteria • Archaea • Eukaryota Olsen et al, J Bacteriol 1994 Different genes tell different stories • Conflict between trees based on single genes • Unrecognized paralogy - Orthologs - Paralogs spec A ancestor spec B spec C • Horizontal gene transfer • Mutation saturation, biases, divergent rates Is a tree the right representation? • Genomes are chimeras with genes from different origins – Endosymbiosis (mitochondrion, chloroplast) – Horizontal gene transfer (many examples, often adaptations to environment) More data = more consistent trees • Combine information from more genes to average out these anomalies • Complete genomes contain the maximum phylogenetic information Fungi • Yeasts, filamentous and dimorphic fungi • Fungi are the eukaryotic clade with largest number of completely sequenced genomes • S. cerevisiae is a well studied model organism • Much consensus about phylogeny Consensus phylogeny (literature) 19 target nodes Orthology • Which genes to compare between species ancestor – Homologs (originated “de novo”) – Orthologs (originated at speciation) • Orthology has higher resolution – Pairwise orthology – Cluster orthology – Tree-based orthology spec A spec B spec C Pairwise orthology (Inparanoid) • Compare all proteins in species A to all proteins in species B to find homologs • Find bi-directional best hit • All proteins closer than bi-directional best hit are (in-) paralogs Cluster orthology (COG) • First group in-paralogs in every species • Find bi-directional best hits between inparalogous groups • Join in-paralogs to orthologous groups – Link all pairs of in-paralogous groups – Only if link is confirmed by third species (triangle) Tree based orthology • Phylogenetic tree of homologs • Find gene duplication nodes • Two homologous genes are orthologs if last common ancestor is not a duplication node but a speciation node Gene content methods • Presence/absence matrix (0/1) OG1 OG2 OG3 OG4 … sp1 1 1 0 1 … sp2 0 1 0 0 … sp3 0 0 1 1 … … … … … … • Similarity: number of shared orthologous groups – Genomes that share few OGs are distantly related – Genomes that share many OGs are closely related but… Genome size correction – Average genome size – Smallest of two genomes – Weighted average genome size 3000 P. chrysosporium # shared genes • Large genomes have more genes, so they also share more genes • Divide number of shared genes by 2800 2600 2400 2200 2000 1800 1600 1400 1200 Korbel et al, Trends Genet 2002 1000 2000 3000 4000 5000 6000 size 7000 genome 8000 Gene content methods • Similarity: corrected number of shared genes dist (spA, spB) = 1 – ( # shared OGs (spA, spB) weighted average size (spA, spB) • Distance: (1 – similarity) d\s sp1 sp2 sp3 sp4 … sp1 0\1 0.8 0.6 0.8 … sp2 0.2 0\1 0.1 0.9 … sp3 0.4 0.9 0\1 0.7 … • Neighbour joining sp4 0.2 0.1 0.3 0\1 … … … … … … ) Gene content methods • Dollo parsimony – Gaining a complex character (gene) is rare and happens once – Losing it is relatively easy – Minimize the number of gene losses for maximum parsimony Superalignment methods • Multiple alignment • Concatenate alignments (1:1:1) • A missing gene in a certain species (row) can be seen as a gap in the alignment Superdistance methods • Combine distance matrices from separate gene families, e.g. average Supertree methods • Make phylogenetic trees for all gene families separately • Matrix Representation using Parsimony (MRP) 13 trees 14 trees 15 trees 12 trees Gene content vs. sequence based Gene content supertrees are different than sequence based supertrees Consensus phylogeny (literature) 19 target nodes Gene content 10.38 • Low-dimensional compared to genotype • Intermediate between genotype and phenotype – Main dichotomy between yeasts and filamentous Fungi, not Ascomycota and Basidiomycota – Dimorphic Basidiomycota exclude filamentous P. chrysosporium Superalignment 18.21 Supertree 17.50 • Sequence based trees agree better with literature • Literature is dominated by sequence based trees Hyperthermophiles Nanoarchaeum Nanoarchaeota Crenarchaeota Euryarchaeota Waters et al. PNAS 2003; Di Giulio, J Theor Biol 2006 • Gene content tree Eury Cren Ciccarelli et al. Science 2006 Brochier et al. Genome Biol 2005 Assignment www.cmbi.ru.nl/edu/seminars • Make a gene content tree • Compare with other phylogenetic trees • Describe the differences – Can you find literature that specifically studies these species? – What do you think is going on? Why are the trees different? • Write a paper about some of your most interesting findings, include references