Download Genomics - WHAT IF server

Document related concepts

Epigenetics of diabetes Type 2 wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Non-coding RNA wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

NUMT wikipedia , lookup

Point mutation wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Essential gene wikipedia , lookup

Genetic engineering wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Copy-number variation wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Oncogenomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Primary transcript wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene desert wikipedia , lookup

Metagenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Transposable element wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Genomic library wikipedia , lookup

Human Genome Project wikipedia , lookup

History of genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome (book) wikipedia , lookup

Genomics wikipedia , lookup

Human genome wikipedia , lookup

RNA-Seq wikipedia , lookup

Microevolution wikipedia , lookup

Non-coding DNA wikipedia , lookup

Designer baby wikipedia , lookup

Gene wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genome editing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Minimal genome wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Genomics
Irena Artamonova
Second European School of Bioinformatics
Nijmegen, January 22, 2005
Complete genomes
90
84
80
70
60
55
50
40
30
30
10
19
18
20
14
9
2
0
1995
4
1
1996
2
1
1997
3
2
1998
4
2
10
7
4
1999
2000
15
8
2001
2002
Brief calculation
Approximately 233 complete genomes with
about 3000 genes in each on average.
Almost all genes are new and unstudied
In a lab: investigation of function of one gene
requires one postdoc-year at least.
Hurrah!: we have work for all molecular
biologists for thousands of years right now!
We have a new “complete genome”.
What can we do with it now (in silico)?
(outline of the lecture)
• Gene recognition
• Prediction of regulation of gene expression
• Functional annotation of proteins
• Metabolic reconstruction
• Study of genome evolution
Main differences:
Prokaryotes and Eukaryotes
Gene recognition I. Prokaryotes
Size of a prokaryotic genome:
Pathogenesis bacteria - from < 1 Mb and 600 genes
Free living bacteria – up to 6-9 Mb, 9000 genes
E.g., Escherichia coli: 4.6 Mb - 4400 генов
•
Projection of known genes
•
Genome comparisons
•
Finding long ORFs
•
Using DNA statistics
•
Identification of gene starts
Mapping “known” genes
BLASTx: //www.ncbi.nlm.nih.gov/BLAST/
A lot of information when a close genome
is well-studied. But it happens rarely.
Problems: choice of thresholds, fine
mapping of start positions in other cases.
No perfect solutions.
Using long ORFs
–What minimal length is functional?
–Which Met is the start?
ORFs in a fragment of the K. pneumoniae genome
Use of DNA statistics in gene
recognition
Frequencies of codons differ from
frequencies of non-coding triplets:
•
frequencies of amino acids (and their)
codons;
•
frequencies of dipeptides;
•
frequencies of synonymous codons
(genome-specific, correlate with tRNA
concentration).
Coding potential
A function measuring whether the genomic fragment is
coding or non-coding based on its DNA statistics.
We can calculate coding potential for ORFs or for sliding
window
“Sliding window” technique:
•Scan the DNA sequence with sliding window of fixed
size
•Calculate coding potential for each window position and
plot it above the sequence (horizontal axis)
• Choosing of a window size so as to minimize random
noise
Selection of window size
for sliding window
E. coli: 96nt window
48nt window
Exact mapping of gene
start positions
• Prokaryotes: starting methionine is
preceded by a ribosome-binding site
(so-called Shine-Dalgarno box, any part
of GGAGGA)
• Extension of the nucleotide alignment
with orthologous region from a related
genome: mutation patterns in the coding
region differ from the those in the
intergenic region
rbsD in enterobacteria
Sty
Sen
Stm
Eco
Ype
AGGGTTACACTGCGGC-CAGCGAAACGTTTCGCTAGTGGAGCAGAAAAATGAAGAAAGGC
AGGGTTACACTGCGGC-CAGCGAAACGTTTCGCTAGTGGAGCAGAAAAATGAAGAAAGGC
GGGGTTACACTGCGGC-CAGCGAAACGTTTCGCTAGTGGAGCAGAAAAATGAAGAAAGGC
AGGATTAAACTGTGGGTCAGCGAAACGTTTCGCTGATGGAGAA-AAAAATGAAAAAAGGC
TTTTCTAAACTCCTTGTTAGCGAAACGTTTCGCTCTTGGAGTA-GATCATGAAAAAAGGT
** ***
**************** ***** * * ***** *****
Sty
Sen
Stm
Eco
Ype
ACCGTACTCAACTCTGAAATCTCGTCGGTCATTTCCCGTCTGGGGCATACTGATACTCTG
ACCGTACTCAACTCTGAAATCTCGTCGGTCATTTCCCGTCTGGGGCATACTGATACTCTG
ACCGTACTCAACTCTGAAATCTCGTCGGTCATTTCCCGTCTGGGGCATACTGATACTCTG
ACCGTTCTTAATTCTGATATTTCATCGGTGATCTCCCGTCTGGGACATACCGATACGCTG
GTATTACTGAACGCTGATATTTCCGCGGTTATCTCCCGTCTGGGCCATACCGATCAGATT
* ** ** **** ** ** **** ** *********** ***** ***
*
Pattern of nucleotide changes in
protein-coding regions
Sty
Stm
Sen
Eco
Kpn
Ype
pdxB in
enterobacteria
TCGCTCG--CAGCGGAAAGAGGATTACGCCCTTCGCCTGGAGGCTGTGCAGGGGC---GCCGGAGATGGGATGCATAATT
TCGCTCG--CAGCGGAAAGAGGATTACGCCCTTCGCCTGGAGGCTGTGCAGGGGC---GCCGGAGATGGGATGCATAATT
TCGCTCG--CAGCGGAAAGAGGATTACGCCCTTCGCCTGGAGGCTGTGCAGGGGC---GCCGGAGATGGGATGCATAATT
TTGCCCG--TGCCAGACGGCAGATTATCTCCCTGACCTGGTGGTTGCCCAGGAGGAGGGCCGGAAATAGGTTGTATCATT
----CGG--TGGCGCAGTGCCTGATGGG-CCTCGCCCTGGAGGACGGTCTGGCAT---ATCAGCAAGGGGGTGCGTCATG
TTGTTAGAACAGGGGAAAACGGTAAACAGTGTGGCATTAGATGTCGGTTATAGCT-----CCGCCTCTGCTTTTATCGCC
*
*
* * * *
* *
* *
*
Sty
Stm
Sen
Eco
Kpn
Ype
AATTATCCTTTAAC----------CATAAATCTGAGCAATA-TATGCTTGGCGGCCAGATTATGGC--ACACTTGTCCGG
AATTATCCTTTAAC----------CATAAATCTGAGCAATA-TATGCCTGGCGGCCAGATTATGGC--ACACTTGTCCGG
AATTATCCTTTAAC----------CATAAATCTGAGCAATA-TATGCCTGGCGGCCAGATTATGGC--ACACTTGTCCGG
ACGTATCCTTATAC----------CTGAAATCTTCGCAAG--TATGCCTGGCCGCGAGATTATGGC--ACACTTGTCCGG
ATTCATCCTTTCGATATCGCGGTGCTGGAACCAGGTGATGAGTATGCCTGGCGGCCAGATTATGGC--ACACTTCCCCAG
ATGTTTCAGCAAATAT--------CGGGTACCA-CGCCTGAGCGTTTCCGGCGGGGCAATAGTGGCTTATACTAAGCCCC
*
**
*
* *
*
*** *
** **** * ***
**
Sty
Stm
Sen
Eco
Kpn
Ype
TTAACTCTCGTT-CTCAAACAG------GTACGACAGTC--GTGAAAATTCTCGTTGATGAAAATATGCCTTACGCCCGC
TTAACTCTCGTT-CTCAAACAG------GTACGACAGTC--GTGAAAATTCTCGTTGATGAAAATATGCCTTACGCCCGC
TTAACTCTCGTT-CTCAAACAG------GTACGACAGTC--GTGAAAATTCTCGTTGATGAAAATATGCCTTACGCCCGC
TTAACTCTCGT--CTCATACAG------GTAACACAAAC--GTGAAAATCCTTGTTGATGAAAATATGCCTTATGCCCGC
TTAACTCTCGTT-CTCAGACAG------GTACTGAACT---GTGAAAATCCTCGTTGATGAAAATATGCCCTATGCCCGT
CTGTTTTTCATCTGTATGGCAGTTCGCTGTCGGAGAGTAAAGTGAAAATTCTGGTTGATGAAAATATGCCGTACGCTGAG
*
* ** *
*
***
**
*
******** ** ***************** ** **
123123123123123123123123123123123123123
Operons
Majority of genes in prokaryotes are transcribed in
operons.
Some examples of operons in eukaryotes: C.elegans
Ideas for de novo prediction of operon structure are
trivial:
• Small distance between adjacent genes
• Co-orientation (lie on the same strand)
• More reliability when these features are conserved in
different species
Additional arguments:
• Similar functional annotations of adjacent genes
• Observed co-expression
• Known average operon length
Training for a completely new genome
For all already discussed methods we need some
initial knowledge about genes in the genome
(DNA statistics, minimal ORFs length etc.) –
from known genes or their very close orthologs
When we have no information at all, we use an
iterative process with initial parameters from
very long ORFs (and/or distant orthologs with
reconstructed structure) as genes, and regions
with no ORFs as intergenic regions
Gene recognition II. Eukaryotes
Specifics:
• Exon-intron structure
• 9-10 coding exons per gene on average (human),
~5 exons (insects)
• Average length of internal exons is 120-130
nucleotides
• Very long introns (>10Kb) are frequent, may be as
long as > 1 Mb
• There are no Shine-Dalgarno sequences (the Kozak
rule can be used instead, but it is much weaker)
=> ORFs and “sliding window” techniques are
inapplicable!
Inapplicability of “sliding window”
technique for eukaryotic genomes
The gene of rat chemotripsin
Nothing (intergenic region)
Search for “known” genes
BlastX is reliable only for large exons (short
introns are treated as long deletions)
What can we use instead? Splicing
signals!
“Spliced alignment” is an alignment of DNA
fragment with a sequence coding for a
homologous protein. Unlike standard
alignments, it is allowed to contain nonpenalized long “deletions” flanked with
splicing signals (that is, introns). BLAT,
ProFrame, TWINSCAN
Spliced alignments of genomic sequences
VISTA (www-gsd.lbl.gov/vista/): human-dog-mouse
HMM (Hidden Markov Model)
Definition: An HMM is a 5-tuple (Q, V, p, A, E), where:
 Q is a finite set of states, |Q|=N
 V is a finite set of observation symbols per state, |V|=M
 p is the initial state probabilities.
 A is the state transition probabilities, denoted by ast for each s, t ∈ Q.
 For each s, t ∈ Q the transition probability is: ast
≡ P(xi = t|xi-1 =
s)
 E is a probability emission matrix, esk ≡ P (vk at time t | qt = s)
Output: Only emitted symbols are observable by the system but not the
underlying random walk between states -> “hidden”
Property: Emissions and transition are dependent on the current state
only and not on the past.
HMM-based Gene Finding
•
•
•
•
•
•
GENSCAN (Burge 1997)
FGENESH (Solovyev 1997)
HMMgene (Krogh 1997)
GENIE (Kulp 1996)
GENMARK (Borodovsky & McIninch 1993)
VEIL (Henderson, Salzberg, & Fasman
1997)
GenScan Overview
• Developed by Chris Burge (Burge 1997), in the
research group of Samuel Karlin, Dept of
Mathematics, Stanford Univ.
• Characteristics:
– Designed to predict complete gene structures
• Introns and exons, Promoter sites, Polyadenylation signals
– Incorporates:
• Descriptions of transcriptional, translational and splicing signal
• Length distributions (Explicit State Duration HMMs)
• Compositional features of exons, introns, intergenic, C+G regions
– Larger predictive scope
• Deal with partial and complete genes
• Multiple genes separated by intergenic DNA in a sequence
• Consistent sets of genes on either/both DNA strands
• Based on a general probabilistic model of genomic sequences
composition and gene structure
GenScan Architecture
• It is based on Generalized
HMM (GHMM)
• Model both strands at once
– Other models: Predict on one
strand first, then on the other
strand
– Avoids prediction of overlapping
genes on the two strands (rare)
• Each state may output a string
of symbols (according to
some probability distribution).
• Explicit intron/exon length
modeling
• Special sensors for Cap-site
and TATA-box
• Advanced splice site sensors
Regulation
Less than 5% of the sequence of human genome
are protein-coding sequences. What is the role
of the remaining DNA?
It has been suggested, that a much larger part of
human genome codes the regulatory
machinery
Processes whose regulation we try to predict:
• Transcription (DNA  RNA)
• Splicing (pre-mRNA  mRNA)
• Translation (mRNA  protein)
Two types of analysis of regulation
Prediction of regulatory signal
Identification
of the signal
Finding
new sites
Signal is an ideal “site” or
a set of ALL observed
sites
Site is a representative
of the signal in the genome
Deriving of the signal ab initio I.
Ubiquitous (necessary) signals
• Examples: promoters of transcription,
ribosome-binding signal, acceptor and donor
splicing sites, stop-codon, signal of
polyadenilation
• We know many examples and some
biological characteristics (and landmarks)
• Often short (4-6 nucleotides)
Re-alignment approaches
• Initial alignment by a biological landmark
– start of transcription for promoters
– start codon for ribosome binding sites
– exon-intron boundary for splicing sites
• Fix the width of the sliding window and the expected
signal size
• Derive the signal (the most frequent word) within a
sliding window
• Repeat for other parameters, select the best set
• Re-align anchoring on the signal
• Identify the signal positions (with non-uniform
nucleotide frequencies)
Gene starts of Bacillus subtilis
dnaN
gyrA
serS
bofA
csfB
xpaC
metS
gcaD
spoVC
ftsH
pabB
rplJ
tufA
rpsJ
rpoA
rplM
ACATTATCCGTTAGGAGGATAAAAATG
GTGATACTTCAGGGAGGTTTTTTAATG
TCAATAAAAAAAGGAGTGTTTCGCATG
CAAGCGAAGGAGATGAGAAGATTCATG
GCTAACTGTACGGAGGTGGAGAAGATG
ATAGACACAGGAGTCGATTATCTCATG
ACATTCTGATTAGGAGGTTTCAAGATG
AAAAGGGATATTGGAGGCCAATAAATG
TATGTGACTAAGGGAGGATTCGCCATG
GCTTACTGTGGGAGGAGGTAAGGAATG
AAAGAAAATAGAGGAATGATACAAATG
CAAGAATCTACAGGAGGTGTAACCATG
AAAGCTCTTAAGGAGGATTTTAGAATG
TGTAGGCGAAAAGGAGGGAAAATAATG
CGTTTTGAAGGAGGGTTTTAAGTAATG
AGATCATTTAGGAGGGGAAATTCAATG
dnaN
gyrA
serS
bofA
csfB
xpaC
metS
gcaD
spoVC
ftsH
pabB
rplJ
tufA
rpsJ
rpoA
rplM
cons.
num.
ACATTATCCGTTAGGAGGATAAAAATG
GTGATACTTCAGGGAGGTTTTTTAATG
TCAATAAAAAAAGGAGTGTTTCGCATG
CAAGCGAAGGAGATGAGAAGATTCATG
GCTAACTGTACGGAGGTGGAGAAGATG
ATAGACACAGGAGTCGATTATCTCATG
ACATTCTGATTAGGAGGTTTCAAGATG
AAAAGGGATATTGGAGGCCAATAAATG
TATGTGACTAAGGGAGGATTCGCCATG
GCTTACTGTGGGAGGAGGTAAGGAATG
AAAGAAAATAGAGGAATGATACAAATG
CAAGAATCTACAGGAGGTGTAACCATG
AAAGCTCTTAAGGAGGATTTTAGAATG
TGTAGGCGAAAAGGAGGGAAAATAATG
CGTTTTGAAGGAGGGTTTTAAGTAATG
AGATCATTTAGGAGGGGAAATTCAATG
aaagtatataagggagggttaataATG
001000000000110110000000111
760666658967228106888659666
dnaN
gyrA
serS
bofA
csfB
xpaC
metS
gcaD
spoVC
ftsH
pabB
rplJ
tufA
rpsJ
rpoA
rplM
cons.
num.
ACATTATCCGTTAGGAGGATAAAAATG
GTGATACTTCAGGGAGGTTTTTTAATG
TCAATAAAAAAAGGAGTGTTTCGCATG
CAAGCGAAGGAGATGAGAAGATTCATG
GCTAACTGTACGGAGGTGGAGAAGATG
ATAGACACAGGAGTCGATTATCTCATG
ACATTCTGATTAGGAGGTTTCAAGATG
AAAAGGGATATTGGAGGCCAATAAATG
TATGTGACTAAGGGAGGATTCGCCATG
GCTTACTGTGGGAGGAGGTAAGGAATG
AAAGAAAATAGAGGAATGATACAAATG
CAAGAATCTACAGGAGGTGTAACCATG
AAAGCTCTTAAGGAGGATTTTAGAATG
TGTAGGCGAAAAGGAGGGAAAATAATG
CGTTTTGAAGGAGGGTTTTAAGTAATG
AGATCATTTAGGAGGGGAAATTCAATG
tacataaaggaggtttaaaaat
0000000111111000000001
5755779156663678679890
Positional information content
before and after re-alignment
Deriving of the signal II.
Transcription regulation
• Transcription factors binding sites
• Usually longer (10-20 nts or more)
• Relatively small sample: only several sites
in a genome at all, very few examples are
known
• Often have some symmetry
• Conserved among species
• Experimental studies are not sufficient: they
define only the regulatory region
Why TFBS are palindromes?
Examples
Prokaryotes
Eukaryotes
Use of symmetry
• DNA-binding factors and their signals
 Co-operative homogeneous
 Palindromes
 Repeats
 Co-operative non-homogeneous
 Cassetes
 Others
 RNA signals: special conservative secondary
structure
Regulation of transcription
in eukaryotes
Signal, consensus
codB
purE
pyrD
purT
cvpA
purC
purM
purH
purL
consensus
CCCACGAAAACGATTGCTTTTT
GCCACGCAACCGTTTTCCTTGC
GTTCGGAAAACGTTTGCGTTTT
CACACGCAAACGTTTTCGTTTA
CCTACGCAAACGTTTTCTTTTT
GATACGCAAACGTGTGCGTCTG
GTCTCGCAAACGTTTGCTTTCC
GTTGCGCAAACGTTTTCGTTAC
TCTACGCAAACGGTTTCGTCGG
ACGCAAACGTTTTCGT
Pattern
codB
purE
pyrD
purT
cvpA
purC
purM
purH
purL
consensus
pattern
CCCACGAAAACGATTGCTTTTT
GCCACGCAACCGTTTTCCTTGC
GTTCGGAAAACGTTTGCGTTTT
CACACGCAAACGTTTTCGTTTA
CCTACGCAAACGTTTTCTTTTT
GATACGCAAACGTGTGCGTCTG
GTCTCGCAAACGTTTGCTTTCC
GTTGCGCAAACGTTTTCGTTAC
TCTACGCAAACGGTTTCGTCGG
ACGCAAACGTTTTCGT
aCGmAAACGtTTkCkT
Frequency matrix
j a C G m A A A C G t
T T k C k T
A 6
0
0
2
9
9
8
0
0
1
0
0
0
0
0 0
C 1
8
0
7
0
0
1
9
0
0
0
0
0
9
1 0
G 1
1
9
0
0
0
0
0
9
1
1
0
5
0
5 0
T
0
0
0
0
0
0
0
0
7
8
9
4
0
3 9
1
Information content
I = j b f(b,j)[log f(b,j) / p(b)]
W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5)
Positional weight matrix (PWM)
j a C G m A A A C G
t
T T k C k T
A 6
0
0
2
9
9
8
0
0
1
0
0
0
0
0 0
C
1
8
0
7
0
0
1
9
0
0
0
0
0
9
1 0
G 1
1
9
0
0
0
0
0
9
1
1
0
5
0
5 0
T
1
0
0
0
0
0
0
0
0
7
8
9
4
0
3 9
A
1.1 –1.0 –0.7 0.5
2.2
2.2
1.9 –0.7 –0.7 –0.1 –1.0 –0.7 –1.1 –0.7 –1.4 –0.7
C
–0.4
1.9 –0.7
1.6 –0.7 –0.7
G
–0.4
0.1
T
–0.4 –1.0 –0.7 –1.1 –0.7 –0.7 –1.0 –0.7 –0.7
0.1
2.2 –0.7 –1.2 –1.0 –0.7 –1.1
2.2 –1.1 –0.7 –0.7 –1.0 –0.7
2.2 –0.1 –0.1 –0.7
1.5
1.9
2.2
2.2 –0.3 –0.7
1.2 –0.7
1.0 –0.7
1.0 –0.7
0.6
2.2
Sequence logo
Greedy algorithms (MEME)
Find a signal among all k-words (assuming that we
know the length signal).
For all k-words it’s too time-consuming (k~16). So
initially we consider only k-words that were
present in the fragments.
For each k-word construct a matrix of “sites”:
alignment of best “copies” of the k-word from
every sequence fragment.
Select the best k-word. What is the measure for
comparison of matrices? Information content!
Greedy algorithms. Cont’d
• Select the k-word with maximal
information content
Problem. We considered only k-words from
our sequences => may select not the
signal (the consensus word), but only its
best representative in our sample
Solution. For each k-word from the sample
construct PWM and reconstruct the
frequency matrix based on it. Repeat until
stabilization of the matrix. Use the
consensus of this matrix.
Limitation of greedy algorithms
• Started from k-words in our sequences
and increase the information content at
each step => find a local (not global)
maximum of the functional.
• We need an alternative algorithm that
will not be “greedy”!
Gibbs sampler
Let’s A be a signal (set of sites), and I(A)
be its information content.
At each step a new site is selected in one
sequence with probability
P ~ exp [(I(Anew)]
For each candidate site the total time of
occupation is computed.
(Note that the signal changes all the time)
Recognition of signals I.
Ubiquitous signals
• Consensus
• Pattern (consensus with degenerate positions)
• Positional weight matrix (PWM, or profile)
Weight of the site:
W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5)
• Logical rules
• Neural networks
Neural networks: architecture
• 4k input neurons (sensors), each
responsible for observing a particular
nucleotide at particular position
OR 2k neurons (one discriminates
between purines and pyrimidines, the
other, between A/T and G/C)
• One or more layers of hidden neurons
• One output neuron
Neural networks: architecture. II
• Each neuron is connected to all neurons of the
next layer
• Each connection is ascribed a numerical weight
A neuron
• Sums the inputs at incoming connections
• Compares the total with the threshold (or
transforms it according to a fixed function)
• If the threshold is passed, excites the
outcoming connections (resp. sends the
modified value)
Training of the neural network
• Sites and non-sites from the training sample
are presented one by one.
• The output neuron produces the prediction.
• The connection weights increase if the
prediction is correct and decrease if it’s
incorrect.
Networks differ by architecture, particulars of
the signal processing, the training schedule
Recognition of signals II.
Regulation of transcription
• Neutral networks don’t work: need
training, too few examples
• PWM – ok, but too many false positive
predictions => we need rules to select the
true sites among predicted.
• Many genomes are available =>
comparative approach:
– Consistency filtering
– Phylogenetic footprinting
– Phylogenetic shadowing
Definition of orthologs
• Orthologous genes:
Duplication
– the result of speciation
– the “same” role in the cell
Speciation
• Paralogous genes :
– the result of duplication
– keep common biochemical
function
A1 B1
Genome 1
A2 B2
Genome 2
Example:
gluconate and
idonate kinases
Consistency filtering
Basic assumption. Regulons (sets of co-regulated
genes) are conserved =>
• True sites occur upstream of orthologous genes
• False sites are scattered at random
We need to check that transcription factors are true
orthologs by themselves (BBH, COGs are not
sufficient; conservation of the DNA-binding
domain, conservation of the core pathway), have
exactly the same specificity (similar binding sites)
and then compare genes (and whole operons) after
the predicted sites
The basic procedure
Set of known sites
Genome 1
Profile
Genome 2
Genome N
Accounting for the operon structure
«Old» genome
«New» genome
A
A
BC
BC
D
XD
EF
E
F
X
X
X
X
Tryptophan operons
Closely related genomes:
Phylogenetic footprinting
Regulatory sites are more
than non-coding regions
and are often seen as
islands in alignments
upstream regions.
conserved
in general
conserved
of gene
Low conservation
yjcD
ST AAA-GCATAAAAAGCGGCAAAGTTCAGTTGAAAAAGCGTTGATGATCGCTGGATAATCGTTTGCTTTTTTTTG---CCAC
EC AAA-GAGAAAAAAGCAGCAAACTTCGGTTGAAAAAGCCGCTATGATCGCCGGATAATCGTTTGCTTTTTTTA----CCAC
YP AAATGTATTAAATGTCGCATTCGGGTGTTGATTAGTCACCACTGATGGCTAGATAATCGTTTGCCTTAAATGACATCTGC
*** *
*** * ***
***** * *
**** ** ************* **
*
* *
ST CC--------GTTTTGT--------ATACGTG----GAGCTAAACGTTTGCTTTTTTGCGGCGCCCCG-G-TTGTCGTAA
EC CC--------GTTTTGT--------ATGCGCG----GAGCTAAACGTTTGCTTTTTTGCGACGCAGCA-AATTGTCGCAA
YP CCTAAACTTCGATTTTTTTTCAGTCATGCGTTCTCCCAGCTAATCGTTTGCTATTTTTCCCCGCTCTATGAGTCAGGGAG
**
* *** *
** **
****** ******** **** * ***
*
* *
ST ATGTAGC----------ACAAGGA-GATAACGTTGCGCTGTTAGTGGATTACCTCCCACGTATACCGACGAATAATAAAT
EC ACCTGGA----------GCAGGAA-GATAACGTTTCGCTGGCAGGGGATTGTCCGCCACGCATCTTGACGAAAATTAAAC
YP AGTTAGTGAGTTCATCGACAGGAACGGAAACGATTACGTAGAGAAGGGCGCTTGGCTTGGCATGCTATTTTAAAATGA-C
* * *
** * * * **** *
*
**
*
* **
* * * *
ST TCTCAGGGGATGTTTTCT-ATGTCT------ACGCCTTCAGCGCGTACCGGCGGTTCACTCGACGCCTGGTTTAAAATTT
EC TCTCAGGGGATGTTTTCTTATGTCT------ACGCCATCAGCGCGTACCGGCGGTTCACTCGACGCCTGGTTTAAAATTT
YP ACACAGGGGACATCACC--ATGTCTAGCAGCAACCCTCAAGCACAGCCAAAGGGCACGCTTGATGCATTCTTTAAGCTTA
* ******* *
* ******
* **
*** *
*
** * ** ** ** * ***** **
High conservation
purL
ST AGCGGCATTTTGCGTAACAATGCGCCAGTTGGCAACTT-ATT-CGCAACGATAGCCGCACC--GTATGACAAGAAAAAGC
EC AGCGGCATTTTGCGTAAACCTGCGCCAGATGGCAACTT-ATT-ACAGCCATTGGCGGCACG--CGTTGCTAATTCACGAT
YP AGTGGCATTTTGCGCAACAAAACGCCAGTGTGCAACTTTATTGCGAGCTATTTGCTGAGTCTGCGTTACACACACATAGC
** *********** **
******
******* ***
* ** *
*
*
*
ST GG-TGATT---------TTATTTCT-------ACGCAAACGGTTTCGTCGGCGCGTCAGATTCTTTATAATGACGGCCGT
EC GG-TGATT---------TTATTTCC-------ACGCAAACGGTTTCGTCAGCGCATCAGATTCTTTATAATGACGCCCGT
YP GGCTGTTTCTGACTGAATTATTAATAATAGATACGCAAACGGTTTCGTCGGCGGCTCAGATTCACTATAATGGCGCGCGT
** ** **
*****
***************** *** ******** ******* ** ***
ST TTCCCCCC-------------------TTGCGCACACCAAA--------------GCTTAGAAGACGAGAGA--CTTA-EC TTCCCCCCC------------------TTGGGTACACCGAAA-------------GCTTAGAAGACGAGAGA--CTTA-YP TTTGCCCTGTTGTTGCGCCAATGAATGTTGCGCCCAATGAAGTGCTGTTCCAGCCGCTTCGAAGACGAGAGAAACTTAGA
** ***
*** * **
**
**** ************ ****
ST TGATGGAAATTCTGCGTGGTTCGCCTGCACTGTCTGCATTCCGTATCAATAAACTGCTGGCGCGCTTTCAGGCTGCCAAC
EC TGATGGAAATTCTGCGTGGTTCGCCTGCACTGTCGGCATTCCGAATCAACAAACTGCTGGCACGTTTTCAGGCTGCCAGG
YP TTATGGAAATACTGCGTGGTTCACCCGCTTTGTCGGCTTTTCGTATCACCAAACTGTTGTCCCGTTGCCAGGATGCTCAC
* ******** *********** ** ** **** ** ** ** **** ****** ** * ** * **** ***
Another variation.
Phylogenetic shadowing
Idea. Instead of distant orthologs use very
close orthologs, but from multiple (very
close) species. True sites would look like
islands of strongly conserved columns on
multiple alignment.
Need to sequence orthologous upstream
regions from a series of close genomes
(e.g., from many different primates) and
analyze their multiple alignment
RNA regulation. Riboswitches
mRNA has two alternative conformations
of its leader region: one of them blocks
the expression.
Two main cases (prokaryotes): a
terminator interrupts transcription or a
special structure blocks the ribosomebinding site.
Eukaryotes: block of a splicing site
Riboswitches are RNA signals stabilized
by a small molecule
Example of the secondary
structure of riboswitch
Capitals: invariant (absolutely conserved) positions.
Lower case letters: strongly conserved positions.
Obligatory base pairs are set in bold.
Degenerate positions: R = A or G; Y = C or U; K = G or
U; B= not A; V = not U. N: any nucleotide. X: any
nucleotide
Importance of prediction of RNA
regulation as bioinformatics problem
• Phenomenon was discovered by
means of bioinformatics
• RNA signal is strongly conserved (on
the sequence level, not only as the
secondary structure) => well-predictable
(no “false positive” predictions)
• A portion of the regulation of this type is
valuable (~ 5% of all genes for some
species)
Assignment of function based
on homology
We want to characterize a new gene. What is
the function of the product?
The first step: BlastP.
The best case: we obtain a hit with known
function
Have we got a functional information on our
gene?
Similarity ≠ homology: e-val is a measure of
statistical significance (non-randomness) of
similarity.
Definition of orthologs
• Orthologous genes:
Duplication
– the result of speciation
– the “same” role in the cell
Speciation
• Paralogous genes :
– the result of duplication
– keep common biochemical
function
A1 B1
Genome 1
A2 B2
Genome 2
Example:
gluconate and
idonate kinases
Orthologs or paralogs?
The best proof is a phylogenetic tree, but it’s
too time-expensive.
We use BBH - Bidirectional Best Hit.
COGs – Clusters of orthologous genes
(//www.ncbi.nlm.nih.gov/COGs/new)
(prokaryotes) or KOGs (eukaryotes)
Search for orthologs (fast and dirty)
Genome 1
Genome 2
A
A'
B
B'
B"
symmetrical best hit
Assignment of a new gene to
specific functional system. I
• Positional clustering
Operon: co-transcription of several genes
(usually for prokaryotes, rarely for
eukaryotes - Caenorhabditis elegans).
Genes are transcribed together and so,
exactly under the same conditions => they
are dependent functionally
Assignment of a new gene to
specific functional system. II
• Genes are not in the same operon, but in
the same locus: horizontal transfer
• Divergon: a regulatory signal influents the
direct and the complementary chains
(usually with opposite effects)
regulatory site(s)
gene (operon) on (+) strand
gene (operon) on (-) strand
Measure of positional
closeness
Let’s use a measure of positional
neighborhood: a ration of divergent
genomes in which our genes are closely
located
Servers that predict functional dependence:
ERGO (//www.cordis.lu/ergo/ ),
STRING (//string.embl.de/, may be
described at the proteomics day):
implementation and visualization of ALL
the techniques related to this area
Eukaryotic case: domain shuffling
Compression of biochemical functions into single molecules
Prokaryotes:
all enzymatic activities carried out by separate proteins
Fungi:
FAS1 gene encodes activities 3 and 4
FAS2 gene encodes activities 1,2 and 5-7
Animals:
All activities encoded by fatty-acid synthase
Genomic structure of fatty-acid
synthase from rat
Protein domains
InterPro:
www.ebi.ac.uk/interpro/
Pfam:
http://www.sanger.ac.uk/Soft
ware/Pfam/
Co-regulation
Genes that are distant in the genome, but
are regulated similarly.
Very similar to the case of operons
But it’s hard to work with computationally.
A lot of manual analysis is necessary.
Co-expression
• If the expression of two genes changes
consistently in response to changing
conditions or in time => they are
functionally related
Microarray data analysis: a special area
of bioinformatics (Transcriptomics
session)
Protein-protein interactions
• Evidence of physical interaction is a
direct proof of the functionality in one
cellular system (together)
Will be discussed in detail at the
Proteomics session
Phylogenetic profiling
Usually functional system is present or absent in a
genome as a whole (or it’s true for a separate
subsystem) =>
If we have many distant complete genomes, we can
compare patterns of occurrence (phylogenetic
profile) for individual genes.
This is rather weak evidence, but useful in
combination with other techniques.
The converse situation also is interesting: genes
with complementary phylogenetic profiles may
have identical function (non-orthologous
displacement: paralogs, specificity changes or
really different structure).
Combining of methods
Each individual type of evidence is rather weak
=> we need to combine methods in every
case.
BlastP => general biochemical function
Positional clustering and/or domain shuffling
and/or phylogenetic profiling => assignment to
functional system
Metabolic reconstruction => gaps in this system
Try place the product of our gene to each gap =>
(if we are lucky) exact biochemical function
and exact position in the metabolic pathway
Archaeal shikimate-kinase
Chorismate biosynthesis pathway (E. coli)
Pectin utilization
E. chrysanthemi
… and transport of oligogalacturonates
E. chrysanthemi
Y. pestis
K. pneumoniae
YpaA: riboflavine transport
• 5 predicted TM segments => potential
transporter
• Regulatory RFN-element => coregulation with genes from riboflavine
metabolism => transport of metabolism
or one of it’s predecessor
• S. pyogenes, E. faecalis, Listeria: have
ypaA, no genes of riboflavin
biosynthesis => transport of riboflavin
So, prediction:
YpaA is a riboflavin transporter (Gelfand
et al., 1999)
Verification:
• YpaA imports riboflavin (genetic
analysis, Kreneva et al., 2000)
• YpaA is regulated with riboflavin
(microarray expression analysis, Lee et
al., 2001; direct verification, Winkler et al.,
2002).
Genome evolution. Repeats
• More than 45% of human genome is repetitive
DNA
• A.Smith: ”The best algorithm of gene prediction
is to mask the repeats, and the rest will be
genes!”
• Genome-specific classes of repeats are unique
markers of genome post-speciation evolution
(did humans appear due to special repeats?!)
• Too many repeats=> this task is computational
• Influence on gene recognition, similarity search
and other genomic analyses. Mask repeats
before!
RepeatMasker
www.repeatmasker.org/
Duplications in genomes. Example of
a locus with internal duplications
MAGE-A locus, X human chromosome
MAGEA9a
LW-1a
FAM11a
LW-1b
repeat I
…
GABRE
MAGEA9b
MAGE8
… 2 Mb …
repeat I
MAGEA5
MAGEA10
GABRA3
GABRQ
MAGEA6
TRAG3a
repeat II
…
MAGEA12
MAGEA4
CSAGE
MAGEA2b
TRAG3b
repeat II
MAGEA3
… 6 genes …
MAGEA1
MAGEA2a
Duplications
• The main problem of duplications:
assembly of newly sequenced genomes
• No universal solution: every group uses
its own algorithm and software
Human genome: the number of
duplications changes from one release
to another. Two initial versions (Int.
consortium, Celera) were significantly
different at the point of duplications
Synteny groups
• Human chromosomes cut into > 100 pieces and
reassembled become a reasonable facsimile of the
mouse chromosome
Rearrangements as a unit of
genome evolution
rearrangement
Rearrangements of alfafa
and garden pea
Transforming alfaalfa
into pea
Whole genome duplication in yeast
Kellis M, Birren BW, Lander ES. (2004) Proof and evolutionary analysis of ancient genome
duplication in the yeast Saccharomyces cerevisiae. Nature. 428:617-24
Thank you!
The BioSapiens project is funded by the European
Commission within its FP6 Programme, under the
thematic area "Life sciences, genomics and
biotechnology for health,"contract number LHSG-CT2003-503265.