Download Comparative genomics

Document related concepts

Transcriptional regulation wikipedia , lookup

Exome sequencing wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Gene regulatory network wikipedia , lookup

Gene desert wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Ridge (biology) wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Community fingerprinting wikipedia , lookup

Non-coding DNA wikipedia , lookup

Gene wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genomic imprinting wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genomic library wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Molecular evolution wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
BIOINFORMATICS TO
ANALYZE AND
COMPARE GENOMES
We sequenced and assembled a
genome, but this is only a long
stretch of ATCG
What should we do now?
1. find genes
Gene calling
The simpliest thing is to look for ORFs= open
reading frames
An ORF is a stretch of DNA that starts with a start
codon and ends with a stop codon
Our goal is to call (which means to find) the ORFs
in a genome sequence
ORF calling
So we need a software that will recognize start
and stop
These usually are
ATG = methionine
TGA
TAA = stop
TAG
ORF calling
So we need a software that will recognize start and
stop → in all six possible frames
Gene calling
Sounds pretty easy...
… however there are some issues
Gene calling issues
1. The genetic code is NOT really universal
So we need to known which variation of the code our
organism follow
2. Eukaryotes have introns
Rules for intron/exon boundaries vary among species, so we
will need a software that is suited for our organism
Gene calling
Easy and straightforward
Fundamental to use the right software
Which is in general a good rule for bioinformatics
Gene Annotation
The process to assign a name and a function to each
of our genes
This is done by comparing each gene in our genome
to a database, to detect a gene that is similar
enough for us to say that our novel sequence has
the same function
Gene Annotation
... comparing each gene in our genome...
When I want to compare two sequences, or two set
of sequences, I use the NCBI BLAST algorithm
Gene Annotation - BLAST
BLAST means Basic Local Alignment Search Tool
It can be used online or offline
Offline is better for entire genomes
It is fast and accurate
It is highly customizable
It outputs hits with a score, indicating the strength of the
similarity
Gene Annotation - BLAST
It is highly customizable
Four main algorithm, with varying inputs
Combination of nucleotides and proteins the
input and in the database sequences
Gene Annotation - BLAST
It is highly customizable
Gene Annotation - BLAST
Even more customizable offline
Gene Annotation - BLAST
Even more customizable offline
We can set a number of parameters such as:
Cost of a gap: how much negative score does a
gap in the alignment cause
% identity between the query and database
Output format: for example a table
The most important parameter is possibly the
E value
Gene Annotation - BLAST
E value
The Expect value (E) is a parameter that describes the
number of hits one can "expect" to see by chance when
searching a database of a particular size. It decreases
exponentially as the Score (S) of the match increases.
Essentially, the E value describes the random background
noise.
So very low e-values indicate a very low possibility that the
hit has been found for a random similarity between the two
sequences
High e-values indicate a high possibility of a random hit
Gene Annotation - BLAST
E value
So an e-value of 1 is VERY BAD!
It is strictly correlated with the database size
A bigger database contains more sequences, and thus more
sequences that will be randomly similar to the input
10-5 is widely considered a stringent e-value
HOWEVER the parameter must be set based on the task
Gene Annotation Databases
...comparing each gene in our genome to a
database...
Which database?
There are multiple possibilities
some are very general, some are species-specific
Gene Annotation Databases
NR = non redundant NCBI database (proteins)
NT = non redundant NCBI database (nucleotides)
UCSC Genome Browser → for human genes
COG = cluster of orthologous genes
Flybase for Drosophila
RAST for bacteria
Gene Annotation Databases
Multiple possibilities
The choice should be careful, based on the organism,
and comparison of multiple databases should be
done when possible
Specific database can be generated for the task
For example based on ncbi searches
Gene Annotation
...comparing each gene in our genome to a
database, to detect a gene that is similar enough
for us to say that our novel sequence has the same
function
How much is enough?
The case of the creeping Fox terrier
clone
Stephen Jay Gould
Essay contained in the book 'Bully for Brontosaurus' (1992)
The case of the creeping Fox terrier
Clone
We may imagine the earliest herds of horses in the lower Eocene as
resembling a lot of Fox-Terriers in size...
HF Osborn was the first to use this comparison in 1905, and since then most
of the books started using it
Do authors really know the size of a Fox-terrier? Or are they just copying the
old comparison?
The case of the creeping Fox terrier
clone
When we use comparisons to annotate our genes, we need to be
careful
How many times has this comparison been used starting from a gene
with a function that was experimentally determined?
Avoid missannotation when possible
Use multiple databases
Use stringent BLAST parameters
Double-check the important genes (cannot
do them all, we are working highthroughput)
Genes can be useful for many tasks
A couple of examples
Evaluating the metabolic potential of the newly
sequenced genome
Determining the phylogenetic position of the
organism
From genes to metabolisms
The presence of a genes often indicates an
active enzime
If all enzymes of a pathway are present, our
organism can very probably complete the
pathway
Specific softwares can reconstruct metabolic
pathways, or cellular structures
From genes to metabolisms
KEGG – Kyoto encyclopedia of genes and genomes
Phylogenetics
Phylogenetics is the study of the evolutionary
history of organisms
Phylogeny
All organisms have a
common ancestor
Similarly to genealogy,
phylogeny aims at
reconstructing a 'tree'
Reconstruction of
EVOLUTION, using
differences and
common traits
Phylogeny
Phylogeny
Reconstruction of
EVOLUTION, using
differences and
common traits
Originally it was based
on morphology
Phylogenetics
To study phylogenies we need hereditary characters
that group and separate the units present in our
dataset
We can use
morphology,
but...
Phylogenetics
We can use morphology, but
Interpretation
and
Analogy
Can provide false evidence or make or results noisy
Advanced technics of Geometric morphometrics can work
Phylogenetics
It is important to use traits that are
Homologous: that derive from a common ancestory
And not
Analogous: that evolved independentely, in a
process of convergent evolution
Fins in this dataset are an analogous character (fish
and whales)
Endothermy is an homologus character (mammals)
Phylogeny
ATCTTC­TG
ATCTTCATG
ACCTTCATG
ATCTGCATG
ATCAGCATG
ATCTGCATG
ATCCGCATG
Reconstruction of EVOLUTION, using DNA sequences
DNA is perfect, because it is a digital character that is not
influenced by interpretation
Phylogeny
We do not have information on the ancestors!!
ATCTTC­TG
ACCTTCATG
ATCAGCATG
ATCCGCATG
So we need to infer the evolution of the character
Phylogenetics
DNA is perfect, because it is a digital character that
is not influenced by interpretation
With single genes → phylogenetic analyses
With entire genomes → phylogenomic analyses
More characters = more power
Phylogenetics
More characters = more power
I can discriminate ancient events where the noise is
very strong
Phylogenetics
More characters = more power
I can discriminate
extremely recent
events, where the
variation between the
different taxa is
extremely low
Phylogenetics
1. sequence and assemble genome
2. extract genes
3. obtain genes of other organisms from a database
4. align them
5. run a phylogenetic software
Phylogenomics – obtain genes
Finding homologous genes is not
enough:
The genes that we want are called
orthologous genes
ortologous: genes that derive from a
speciation event
paralogous: genes that derive from a
duplication event
Phylogenomics – obtain genes
The software OrthoMCL is an example of a tool to obtain orthologous
genes
This software
1. Compares all the genes of all
the organisms in a dataset
(bidirectional Blast hit)
2. uses a Markov Cluster
algorithm to create networks
to determine orthologous
genes
Phylogenomics – obtain genes
to obtain orthologous genes → bidirectional Blast hit
The software only accepts the gene pairs for which each of the two
genes is the best hit in the Blast search of the other genome
Align the genes
Orthologous genes are generally very similar, so they can be aligned by
softwares such as Muscle
Phylogeny
Starting from an alignment we can use specific algorithms with the
goal of understanding the evolutionary relations between our
organisms
A number of such phylogenetic algorithms exist,
Maximum Likelihood methods: try to find the evolutionary tree that is
more likely to explain the variation present in the dataset
Maximum parsimony methods: try to find the tree that explain the
variation with the lowest amount of evolutionary changes
Phylogenetics
1. sequence and assemble genome
2. extract genes
3. obtain genes of other organisms from a database
4. align them
5. run a phylogenetic software
When we work with very similar genomes orthologous genes can
contain too little information, as they are too similar → we need
higher resolution
Phylogenetics
1. sequence and assemble genome
2. extract genes
3. obtain genes of other organisms from a database
4. align them
5. run a phylogenetic software
SNPs analysis
When we work with very similar genomes orthologous genes can
contain too little information, as they are too similar → we need
higher resolution
We need to work at a 'lower' level, not on genes, but on single
positions
Single Nucleotide Polymophisms
This approach allows to detect single variations
- in highly variable genes → excluded from the orthology analysis
- in intergenic regions
SNPs ANALYSIS – how to
We sequence and assemble our genomes (contigs)
We align them to a REFERENCE GENOME
REFERENCE GENOME: a closely related genome that we can use
as blueprint, as reference, to compare our novel genomes to
Variations between the reference and our novel genomes can be
recorded and used for comparison purposes, such as SNPs based
phylogeny
SNPs ANALYSIS – how to
Alignment of the genomes to a REFERENCE GENOME
This can be done using specific softwares
MAUVE does the alignement and gives us the SNPs
SNPs ANALYSIS
If the phylogenomic approach can exclude important
information ...
... The SNPs approach may include areas with questionable
alignment
To avoid this problem not all variations are used, but just the
CORE SNPs: SNPs that are flanked on both sides by identical
nucleotides in all the genomes of our alignment
This will allow to obtain a dataset that is precise and informative
SNPs ANALYSIS
SNPs analysis allows detect minimum differences between very
similar genomes
It is the analysis of choice when working in human genomics, and
in general in the genomics of model systems
With the increase of available genomes, it has also become the
method of choice for bacterial genomics of single species, a
field that is called genomic epidemiology
Phylogenetics with SNPs
1. sequence and assemble genome
2. align the contigs to a reference genome
3. extract core SNPs
4. run a phylogenetic software
However there is a FASTER alternative
Phylogenetics with SNPs
1. sequence and assemble genome
2. align the contigs to a reference genome
3. extract core SNPs
4. run a phylogenetic software
However there is a FASTER alternative: mapping the reads directly to
the REFERENCE GENOME
MAPPING OF THE READS
The assembly of the reads into a genome is not the only way
Assembly from reads = DENOVO
As in 'try to generate a novel genome DENOVO, without previous
information'
An alternative that is very useful in specific situations is mapping
the reads to a genome we already know, indicated, again, as
REFERENCE GENOME
MAPPING OF THE READS
Mapping means using a bioinformatic algorithm to determine in
what position of a previously sequenced REFERENCE GENOME
we can locate our reads, without assemblying them
MAPPING OF THE READS
Mapping means using a bioinformatic algorithm to determine in
what position of a previously sequenced REFERENCE GENOME
we can locate our reads, without assemblying them
Phylogenetics with MAPPING
1. sequence genome
2. map the reads to the reference
3. extract core SNPs
4. run a phylogenetic software
Perfect for big genomes (less computational power needed) and also
useful for finding variations for genomics of alleles...
… and for transcriptomics
Genomic epidemiology
Genomic Epidemiology
Tracing the origin epidemic outbreaks:
whole-genome sequencing and the microevolution of pathogenic agents
Genomic epidemiology
Molecular typing of pathogens
Important in microbiology to classify bacteria at the subspecies level: find virulent clones...
Analysis of a single gene (e.g. 16 rDNA)
~ 1000 bp, ~ 50 Euro
MLST
~ 4000 bp, ~ 300 Euro
Whole genome sequencing
1-5 millions bp, 100-300 Euro
(Plasmids included)
1995-2000
2000-2012
2012- NOW
Genomic epidemiology
WHOLE GENOME typing of pathogens
Approaches and advantages
Thousands of characters to discriminate
between different strains
Comparative genomics can be used to study
the origin of phenotypic traits and
host/environment adaptaptation
mechanisms
Not only classification/clustering of
microorganisms but also reconstruction of
their evolutive history thanks to phylogeny
Genomic epidemiology
WHOLE GENOME typing of pathogens
Approaches and advantages
Small genomic changes can be used to
track the spread of a pathogen in
different time and space scales
This makes WGS the perfect tool for
investigation
Genomic epidemiology
Example in medical epidemiology
Klebsiella pneumoniae
The
model:
Klebsiella
pneumoniae
is
a
nosocomial pathogen, known for its multiple
resistances to antibiotics, usually carried by
PLASMIDS
The plasmid gene KPC gives resistance to
carbapenemic antibiotics
The
problem:
resistance
to
carbapenemic
antibiotics and has rapidly spread in Italy in the
last few years. How has this happened?
Genomic epidemiology
THE APPROACH
1. 89 K. pneumoniae isolates of various antibiotic
resistance profiles were collected from 5 Italian
hospitals
2. Whole genome sequencing using the MiSEQ
machine from Illumina
3. Genome assembly using the software MIRA
4. Comparative genomics and phylogeny
Genomic epidemiology project
GLOBAL phylogeny



All available K.pneumoniae genomes from all over the world (n=230)
were added to the database, for a total of 319 genomes
Multiple genomic alignment,
based on several pairwise
alignments (using Mauve)
Extraction of Single nucleotide
polymorphisms (SNPs) with an
in-house suite of scripts
(Python, Perl, R, shell)
Genomic epidemiology project
GLOBAL phylogeny
94,812 core SNPS detected
Core SNPS are one-base mutations
in genomic regions present in all
genomes of the alignment
SNP phylogeny
Maximum Likelihood, 100 bootstrap replicates (RaxML software)
Genomic epidemiology project
GLOBAL phylogeny
Branch length in phylogenies
Phylogenetic trees contain the information of the phylogenetic relationship
between the analyzed organisms
However they can also contain the information of how 'distant' the different
organisms are
This information can be shown as
branch length
Genomic epidemiology project
GLOBAL phylogeny
Genomic epidemiology project
GLOBAL phylogeny
203 genomes cluster here!
THEY FORM THE CLONAL COMPLEX
258 (CC258)
(i.e. all of them have Multilocus Sequence Type
258 or single mutations of it)
97% of them have gene blaKPC
Only 4% of the other have the gene
Genomic epidemiology project
GLOBAL phylogeny
Why almost all blaKPC positive genomes are in CC258?
Maybe, the plasmid cannot be transferred...
NO! Plenty of evidence in literature of plasmid transfer
Maybe there is a genomic reason
A) a genomic element of CC258 acts as a “plasmid magnet”
B) genomic traits make these strains highly virulent and/or highly fit
(so that they are massively isolated worldwide)
Genomic epidemiology project
recombination events
Two genomic areas with high
SNP density were detected,
are they recombinations?
PHYLOGENY OF PUTATIVE
RECOMBINATIONS
Yes, they are!
Genomic epidemiology project
recombination events
~5.6 Mb
~1.3 Mb
~1.1 Mb
~300 Kb
Genomic epidemiology project
the Klebsiella hopeful monster
Recombinations as evolutionary leaps, CC258 derived from giant genomic 'fusions', with a
high fitness, as indicated by its global spread in all hospitals around the world, in less
then 30 years
...sounds like punctuated equilibrium!
Commentary in Mbio - 2014
Genomic epidemiology project
GLOBAL phylogeny
Four Italian clades in the
CC258
Four different diffusion
events in Italy
Genomic epidemiology project
molecular clock
Date the nodes, date the 4 events of entrance in Italy
Method used: bayesian inference (Beast)
Genomic epidemiology project
molecular clock
Recombination events were also dated
Outbreak reconstruction
almost forensic genomics
Outbreak of CC258 K. pneumoniae
in an hospital in northern Italy
7 genomes (that fit in one of the
four Italian clades)
Using DATES of isolation and SNPs
it is possible to reconstruct the
spreading route of the pathogen
Outbreak reconstruction
almost forensic genomics
Whose fault is it?
Star-like diffusion
The diffusion does not correlate with
the bed disposition
The pathogen is likely to be carried
around by the hospital staff: a better
safety protocol is needed
In addition, comparative genomics shows that the
isolates from the seven patients do not present any
specific virulence or resistance factor that make
them different from other strains from the same
hospital.