* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Comparative genomics
Transcriptional regulation wikipedia , lookup
Exome sequencing wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Gene regulatory network wikipedia , lookup
Gene desert wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Ridge (biology) wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Community fingerprinting wikipedia , lookup
Non-coding DNA wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genomic imprinting wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genomic library wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
BIOINFORMATICS TO ANALYZE AND COMPARE GENOMES We sequenced and assembled a genome, but this is only a long stretch of ATCG What should we do now? 1. find genes Gene calling The simpliest thing is to look for ORFs= open reading frames An ORF is a stretch of DNA that starts with a start codon and ends with a stop codon Our goal is to call (which means to find) the ORFs in a genome sequence ORF calling So we need a software that will recognize start and stop These usually are ATG = methionine TGA TAA = stop TAG ORF calling So we need a software that will recognize start and stop → in all six possible frames Gene calling Sounds pretty easy... … however there are some issues Gene calling issues 1. The genetic code is NOT really universal So we need to known which variation of the code our organism follow 2. Eukaryotes have introns Rules for intron/exon boundaries vary among species, so we will need a software that is suited for our organism Gene calling Easy and straightforward Fundamental to use the right software Which is in general a good rule for bioinformatics Gene Annotation The process to assign a name and a function to each of our genes This is done by comparing each gene in our genome to a database, to detect a gene that is similar enough for us to say that our novel sequence has the same function Gene Annotation ... comparing each gene in our genome... When I want to compare two sequences, or two set of sequences, I use the NCBI BLAST algorithm Gene Annotation - BLAST BLAST means Basic Local Alignment Search Tool It can be used online or offline Offline is better for entire genomes It is fast and accurate It is highly customizable It outputs hits with a score, indicating the strength of the similarity Gene Annotation - BLAST It is highly customizable Four main algorithm, with varying inputs Combination of nucleotides and proteins the input and in the database sequences Gene Annotation - BLAST It is highly customizable Gene Annotation - BLAST Even more customizable offline Gene Annotation - BLAST Even more customizable offline We can set a number of parameters such as: Cost of a gap: how much negative score does a gap in the alignment cause % identity between the query and database Output format: for example a table The most important parameter is possibly the E value Gene Annotation - BLAST E value The Expect value (E) is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise. So very low e-values indicate a very low possibility that the hit has been found for a random similarity between the two sequences High e-values indicate a high possibility of a random hit Gene Annotation - BLAST E value So an e-value of 1 is VERY BAD! It is strictly correlated with the database size A bigger database contains more sequences, and thus more sequences that will be randomly similar to the input 10-5 is widely considered a stringent e-value HOWEVER the parameter must be set based on the task Gene Annotation Databases ...comparing each gene in our genome to a database... Which database? There are multiple possibilities some are very general, some are species-specific Gene Annotation Databases NR = non redundant NCBI database (proteins) NT = non redundant NCBI database (nucleotides) UCSC Genome Browser → for human genes COG = cluster of orthologous genes Flybase for Drosophila RAST for bacteria Gene Annotation Databases Multiple possibilities The choice should be careful, based on the organism, and comparison of multiple databases should be done when possible Specific database can be generated for the task For example based on ncbi searches Gene Annotation ...comparing each gene in our genome to a database, to detect a gene that is similar enough for us to say that our novel sequence has the same function How much is enough? The case of the creeping Fox terrier clone Stephen Jay Gould Essay contained in the book 'Bully for Brontosaurus' (1992) The case of the creeping Fox terrier Clone We may imagine the earliest herds of horses in the lower Eocene as resembling a lot of Fox-Terriers in size... HF Osborn was the first to use this comparison in 1905, and since then most of the books started using it Do authors really know the size of a Fox-terrier? Or are they just copying the old comparison? The case of the creeping Fox terrier clone When we use comparisons to annotate our genes, we need to be careful How many times has this comparison been used starting from a gene with a function that was experimentally determined? Avoid missannotation when possible Use multiple databases Use stringent BLAST parameters Double-check the important genes (cannot do them all, we are working highthroughput) Genes can be useful for many tasks A couple of examples Evaluating the metabolic potential of the newly sequenced genome Determining the phylogenetic position of the organism From genes to metabolisms The presence of a genes often indicates an active enzime If all enzymes of a pathway are present, our organism can very probably complete the pathway Specific softwares can reconstruct metabolic pathways, or cellular structures From genes to metabolisms KEGG – Kyoto encyclopedia of genes and genomes Phylogenetics Phylogenetics is the study of the evolutionary history of organisms Phylogeny All organisms have a common ancestor Similarly to genealogy, phylogeny aims at reconstructing a 'tree' Reconstruction of EVOLUTION, using differences and common traits Phylogeny Phylogeny Reconstruction of EVOLUTION, using differences and common traits Originally it was based on morphology Phylogenetics To study phylogenies we need hereditary characters that group and separate the units present in our dataset We can use morphology, but... Phylogenetics We can use morphology, but Interpretation and Analogy Can provide false evidence or make or results noisy Advanced technics of Geometric morphometrics can work Phylogenetics It is important to use traits that are Homologous: that derive from a common ancestory And not Analogous: that evolved independentely, in a process of convergent evolution Fins in this dataset are an analogous character (fish and whales) Endothermy is an homologus character (mammals) Phylogeny ATCTTCTG ATCTTCATG ACCTTCATG ATCTGCATG ATCAGCATG ATCTGCATG ATCCGCATG Reconstruction of EVOLUTION, using DNA sequences DNA is perfect, because it is a digital character that is not influenced by interpretation Phylogeny We do not have information on the ancestors!! ATCTTCTG ACCTTCATG ATCAGCATG ATCCGCATG So we need to infer the evolution of the character Phylogenetics DNA is perfect, because it is a digital character that is not influenced by interpretation With single genes → phylogenetic analyses With entire genomes → phylogenomic analyses More characters = more power Phylogenetics More characters = more power I can discriminate ancient events where the noise is very strong Phylogenetics More characters = more power I can discriminate extremely recent events, where the variation between the different taxa is extremely low Phylogenetics 1. sequence and assemble genome 2. extract genes 3. obtain genes of other organisms from a database 4. align them 5. run a phylogenetic software Phylogenomics – obtain genes Finding homologous genes is not enough: The genes that we want are called orthologous genes ortologous: genes that derive from a speciation event paralogous: genes that derive from a duplication event Phylogenomics – obtain genes The software OrthoMCL is an example of a tool to obtain orthologous genes This software 1. Compares all the genes of all the organisms in a dataset (bidirectional Blast hit) 2. uses a Markov Cluster algorithm to create networks to determine orthologous genes Phylogenomics – obtain genes to obtain orthologous genes → bidirectional Blast hit The software only accepts the gene pairs for which each of the two genes is the best hit in the Blast search of the other genome Align the genes Orthologous genes are generally very similar, so they can be aligned by softwares such as Muscle Phylogeny Starting from an alignment we can use specific algorithms with the goal of understanding the evolutionary relations between our organisms A number of such phylogenetic algorithms exist, Maximum Likelihood methods: try to find the evolutionary tree that is more likely to explain the variation present in the dataset Maximum parsimony methods: try to find the tree that explain the variation with the lowest amount of evolutionary changes Phylogenetics 1. sequence and assemble genome 2. extract genes 3. obtain genes of other organisms from a database 4. align them 5. run a phylogenetic software When we work with very similar genomes orthologous genes can contain too little information, as they are too similar → we need higher resolution Phylogenetics 1. sequence and assemble genome 2. extract genes 3. obtain genes of other organisms from a database 4. align them 5. run a phylogenetic software SNPs analysis When we work with very similar genomes orthologous genes can contain too little information, as they are too similar → we need higher resolution We need to work at a 'lower' level, not on genes, but on single positions Single Nucleotide Polymophisms This approach allows to detect single variations - in highly variable genes → excluded from the orthology analysis - in intergenic regions SNPs ANALYSIS – how to We sequence and assemble our genomes (contigs) We align them to a REFERENCE GENOME REFERENCE GENOME: a closely related genome that we can use as blueprint, as reference, to compare our novel genomes to Variations between the reference and our novel genomes can be recorded and used for comparison purposes, such as SNPs based phylogeny SNPs ANALYSIS – how to Alignment of the genomes to a REFERENCE GENOME This can be done using specific softwares MAUVE does the alignement and gives us the SNPs SNPs ANALYSIS If the phylogenomic approach can exclude important information ... ... The SNPs approach may include areas with questionable alignment To avoid this problem not all variations are used, but just the CORE SNPs: SNPs that are flanked on both sides by identical nucleotides in all the genomes of our alignment This will allow to obtain a dataset that is precise and informative SNPs ANALYSIS SNPs analysis allows detect minimum differences between very similar genomes It is the analysis of choice when working in human genomics, and in general in the genomics of model systems With the increase of available genomes, it has also become the method of choice for bacterial genomics of single species, a field that is called genomic epidemiology Phylogenetics with SNPs 1. sequence and assemble genome 2. align the contigs to a reference genome 3. extract core SNPs 4. run a phylogenetic software However there is a FASTER alternative Phylogenetics with SNPs 1. sequence and assemble genome 2. align the contigs to a reference genome 3. extract core SNPs 4. run a phylogenetic software However there is a FASTER alternative: mapping the reads directly to the REFERENCE GENOME MAPPING OF THE READS The assembly of the reads into a genome is not the only way Assembly from reads = DENOVO As in 'try to generate a novel genome DENOVO, without previous information' An alternative that is very useful in specific situations is mapping the reads to a genome we already know, indicated, again, as REFERENCE GENOME MAPPING OF THE READS Mapping means using a bioinformatic algorithm to determine in what position of a previously sequenced REFERENCE GENOME we can locate our reads, without assemblying them MAPPING OF THE READS Mapping means using a bioinformatic algorithm to determine in what position of a previously sequenced REFERENCE GENOME we can locate our reads, without assemblying them Phylogenetics with MAPPING 1. sequence genome 2. map the reads to the reference 3. extract core SNPs 4. run a phylogenetic software Perfect for big genomes (less computational power needed) and also useful for finding variations for genomics of alleles... … and for transcriptomics Genomic epidemiology Genomic Epidemiology Tracing the origin epidemic outbreaks: whole-genome sequencing and the microevolution of pathogenic agents Genomic epidemiology Molecular typing of pathogens Important in microbiology to classify bacteria at the subspecies level: find virulent clones... Analysis of a single gene (e.g. 16 rDNA) ~ 1000 bp, ~ 50 Euro MLST ~ 4000 bp, ~ 300 Euro Whole genome sequencing 1-5 millions bp, 100-300 Euro (Plasmids included) 1995-2000 2000-2012 2012- NOW Genomic epidemiology WHOLE GENOME typing of pathogens Approaches and advantages Thousands of characters to discriminate between different strains Comparative genomics can be used to study the origin of phenotypic traits and host/environment adaptaptation mechanisms Not only classification/clustering of microorganisms but also reconstruction of their evolutive history thanks to phylogeny Genomic epidemiology WHOLE GENOME typing of pathogens Approaches and advantages Small genomic changes can be used to track the spread of a pathogen in different time and space scales This makes WGS the perfect tool for investigation Genomic epidemiology Example in medical epidemiology Klebsiella pneumoniae The model: Klebsiella pneumoniae is a nosocomial pathogen, known for its multiple resistances to antibiotics, usually carried by PLASMIDS The plasmid gene KPC gives resistance to carbapenemic antibiotics The problem: resistance to carbapenemic antibiotics and has rapidly spread in Italy in the last few years. How has this happened? Genomic epidemiology THE APPROACH 1. 89 K. pneumoniae isolates of various antibiotic resistance profiles were collected from 5 Italian hospitals 2. Whole genome sequencing using the MiSEQ machine from Illumina 3. Genome assembly using the software MIRA 4. Comparative genomics and phylogeny Genomic epidemiology project GLOBAL phylogeny All available K.pneumoniae genomes from all over the world (n=230) were added to the database, for a total of 319 genomes Multiple genomic alignment, based on several pairwise alignments (using Mauve) Extraction of Single nucleotide polymorphisms (SNPs) with an in-house suite of scripts (Python, Perl, R, shell) Genomic epidemiology project GLOBAL phylogeny 94,812 core SNPS detected Core SNPS are one-base mutations in genomic regions present in all genomes of the alignment SNP phylogeny Maximum Likelihood, 100 bootstrap replicates (RaxML software) Genomic epidemiology project GLOBAL phylogeny Branch length in phylogenies Phylogenetic trees contain the information of the phylogenetic relationship between the analyzed organisms However they can also contain the information of how 'distant' the different organisms are This information can be shown as branch length Genomic epidemiology project GLOBAL phylogeny Genomic epidemiology project GLOBAL phylogeny 203 genomes cluster here! THEY FORM THE CLONAL COMPLEX 258 (CC258) (i.e. all of them have Multilocus Sequence Type 258 or single mutations of it) 97% of them have gene blaKPC Only 4% of the other have the gene Genomic epidemiology project GLOBAL phylogeny Why almost all blaKPC positive genomes are in CC258? Maybe, the plasmid cannot be transferred... NO! Plenty of evidence in literature of plasmid transfer Maybe there is a genomic reason A) a genomic element of CC258 acts as a “plasmid magnet” B) genomic traits make these strains highly virulent and/or highly fit (so that they are massively isolated worldwide) Genomic epidemiology project recombination events Two genomic areas with high SNP density were detected, are they recombinations? PHYLOGENY OF PUTATIVE RECOMBINATIONS Yes, they are! Genomic epidemiology project recombination events ~5.6 Mb ~1.3 Mb ~1.1 Mb ~300 Kb Genomic epidemiology project the Klebsiella hopeful monster Recombinations as evolutionary leaps, CC258 derived from giant genomic 'fusions', with a high fitness, as indicated by its global spread in all hospitals around the world, in less then 30 years ...sounds like punctuated equilibrium! Commentary in Mbio - 2014 Genomic epidemiology project GLOBAL phylogeny Four Italian clades in the CC258 Four different diffusion events in Italy Genomic epidemiology project molecular clock Date the nodes, date the 4 events of entrance in Italy Method used: bayesian inference (Beast) Genomic epidemiology project molecular clock Recombination events were also dated Outbreak reconstruction almost forensic genomics Outbreak of CC258 K. pneumoniae in an hospital in northern Italy 7 genomes (that fit in one of the four Italian clades) Using DATES of isolation and SNPs it is possible to reconstruct the spreading route of the pathogen Outbreak reconstruction almost forensic genomics Whose fault is it? Star-like diffusion The diffusion does not correlate with the bed disposition The pathogen is likely to be carried around by the hospital staff: a better safety protocol is needed In addition, comparative genomics shows that the isolates from the seven patients do not present any specific virulence or resistance factor that make them different from other strains from the same hospital.