* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Genomics - WHAT IF server
Epigenetics of diabetes Type 2 wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Non-coding RNA wikipedia , lookup
Mitochondrial DNA wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Point mutation wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Essential gene wikipedia , lookup
Genetic engineering wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
Copy-number variation wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Oncogenomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Primary transcript wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene desert wikipedia , lookup
Metagenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Transposable element wikipedia , lookup
Genomic imprinting wikipedia , lookup
Ridge (biology) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genomic library wikipedia , lookup
Human Genome Project wikipedia , lookup
History of genetic engineering wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genome (book) wikipedia , lookup
Human genome wikipedia , lookup
Microevolution wikipedia , lookup
Non-coding DNA wikipedia , lookup
Designer baby wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Genome editing wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genomics Irena Artamonova Second European School of Bioinformatics Nijmegen, January 22, 2005 Complete genomes 90 84 80 70 60 55 50 40 30 30 10 19 18 20 14 9 2 0 1995 4 1 1996 2 1 1997 3 2 1998 4 2 10 7 4 1999 2000 15 8 2001 2002 Brief calculation Approximately 233 complete genomes with about 3000 genes in each on average. Almost all genes are new and unstudied In a lab: investigation of function of one gene requires one postdoc-year at least. Hurrah!: we have work for all molecular biologists for thousands of years right now! We have a new “complete genome”. What can we do with it now (in silico)? (outline of the lecture) • Gene recognition • Prediction of regulation of gene expression • Functional annotation of proteins • Metabolic reconstruction • Study of genome evolution Main differences: Prokaryotes and Eukaryotes Gene recognition I. Prokaryotes Size of a prokaryotic genome: Pathogenesis bacteria - from < 1 Mb and 600 genes Free living bacteria – up to 6-9 Mb, 9000 genes E.g., Escherichia coli: 4.6 Mb - 4400 генов • Projection of known genes • Genome comparisons • Finding long ORFs • Using DNA statistics • Identification of gene starts Mapping “known” genes BLASTx: //www.ncbi.nlm.nih.gov/BLAST/ A lot of information when a close genome is well-studied. But it happens rarely. Problems: choice of thresholds, fine mapping of start positions in other cases. No perfect solutions. Using long ORFs –What minimal length is functional? –Which Met is the start? ORFs in a fragment of the K. pneumoniae genome Use of DNA statistics in gene recognition Frequencies of codons differ from frequencies of non-coding triplets: • frequencies of amino acids (and their) codons; • frequencies of dipeptides; • frequencies of synonymous codons (genome-specific, correlate with tRNA concentration). Coding potential A function measuring whether the genomic fragment is coding or non-coding based on its DNA statistics. We can calculate coding potential for ORFs or for sliding window “Sliding window” technique: •Scan the DNA sequence with sliding window of fixed size •Calculate coding potential for each window position and plot it above the sequence (horizontal axis) • Choosing of a window size so as to minimize random noise Selection of window size for sliding window E. coli: 96nt window 48nt window Exact mapping of gene start positions • Prokaryotes: starting methionine is preceded by a ribosome-binding site (so-called Shine-Dalgarno box, any part of GGAGGA) • Extension of the nucleotide alignment with orthologous region from a related genome: mutation patterns in the coding region differ from the those in the intergenic region rbsD in enterobacteria Sty Sen Stm Eco Ype AGGGTTACACTGCGGC-CAGCGAAACGTTTCGCTAGTGGAGCAGAAAAATGAAGAAAGGC AGGGTTACACTGCGGC-CAGCGAAACGTTTCGCTAGTGGAGCAGAAAAATGAAGAAAGGC GGGGTTACACTGCGGC-CAGCGAAACGTTTCGCTAGTGGAGCAGAAAAATGAAGAAAGGC AGGATTAAACTGTGGGTCAGCGAAACGTTTCGCTGATGGAGAA-AAAAATGAAAAAAGGC TTTTCTAAACTCCTTGTTAGCGAAACGTTTCGCTCTTGGAGTA-GATCATGAAAAAAGGT ** *** **************** ***** * * ***** ***** Sty Sen Stm Eco Ype ACCGTACTCAACTCTGAAATCTCGTCGGTCATTTCCCGTCTGGGGCATACTGATACTCTG ACCGTACTCAACTCTGAAATCTCGTCGGTCATTTCCCGTCTGGGGCATACTGATACTCTG ACCGTACTCAACTCTGAAATCTCGTCGGTCATTTCCCGTCTGGGGCATACTGATACTCTG ACCGTTCTTAATTCTGATATTTCATCGGTGATCTCCCGTCTGGGACATACCGATACGCTG GTATTACTGAACGCTGATATTTCCGCGGTTATCTCCCGTCTGGGCCATACCGATCAGATT * ** ** **** ** ** **** ** *********** ***** *** * Pattern of nucleotide changes in protein-coding regions Sty Stm Sen Eco Kpn Ype pdxB in enterobacteria TCGCTCG--CAGCGGAAAGAGGATTACGCCCTTCGCCTGGAGGCTGTGCAGGGGC---GCCGGAGATGGGATGCATAATT TCGCTCG--CAGCGGAAAGAGGATTACGCCCTTCGCCTGGAGGCTGTGCAGGGGC---GCCGGAGATGGGATGCATAATT TCGCTCG--CAGCGGAAAGAGGATTACGCCCTTCGCCTGGAGGCTGTGCAGGGGC---GCCGGAGATGGGATGCATAATT TTGCCCG--TGCCAGACGGCAGATTATCTCCCTGACCTGGTGGTTGCCCAGGAGGAGGGCCGGAAATAGGTTGTATCATT ----CGG--TGGCGCAGTGCCTGATGGG-CCTCGCCCTGGAGGACGGTCTGGCAT---ATCAGCAAGGGGGTGCGTCATG TTGTTAGAACAGGGGAAAACGGTAAACAGTGTGGCATTAGATGTCGGTTATAGCT-----CCGCCTCTGCTTTTATCGCC * * * * * * * * * * * Sty Stm Sen Eco Kpn Ype AATTATCCTTTAAC----------CATAAATCTGAGCAATA-TATGCTTGGCGGCCAGATTATGGC--ACACTTGTCCGG AATTATCCTTTAAC----------CATAAATCTGAGCAATA-TATGCCTGGCGGCCAGATTATGGC--ACACTTGTCCGG AATTATCCTTTAAC----------CATAAATCTGAGCAATA-TATGCCTGGCGGCCAGATTATGGC--ACACTTGTCCGG ACGTATCCTTATAC----------CTGAAATCTTCGCAAG--TATGCCTGGCCGCGAGATTATGGC--ACACTTGTCCGG ATTCATCCTTTCGATATCGCGGTGCTGGAACCAGGTGATGAGTATGCCTGGCGGCCAGATTATGGC--ACACTTCCCCAG ATGTTTCAGCAAATAT--------CGGGTACCA-CGCCTGAGCGTTTCCGGCGGGGCAATAGTGGCTTATACTAAGCCCC * ** * * * * *** * ** **** * *** ** Sty Stm Sen Eco Kpn Ype TTAACTCTCGTT-CTCAAACAG------GTACGACAGTC--GTGAAAATTCTCGTTGATGAAAATATGCCTTACGCCCGC TTAACTCTCGTT-CTCAAACAG------GTACGACAGTC--GTGAAAATTCTCGTTGATGAAAATATGCCTTACGCCCGC TTAACTCTCGTT-CTCAAACAG------GTACGACAGTC--GTGAAAATTCTCGTTGATGAAAATATGCCTTACGCCCGC TTAACTCTCGT--CTCATACAG------GTAACACAAAC--GTGAAAATCCTTGTTGATGAAAATATGCCTTATGCCCGC TTAACTCTCGTT-CTCAGACAG------GTACTGAACT---GTGAAAATCCTCGTTGATGAAAATATGCCCTATGCCCGT CTGTTTTTCATCTGTATGGCAGTTCGCTGTCGGAGAGTAAAGTGAAAATTCTGGTTGATGAAAATATGCCGTACGCTGAG * * ** * * *** ** * ******** ** ***************** ** ** 123123123123123123123123123123123123123 Operons Majority of genes in prokaryotes are transcribed in operons. Some examples of operons in eukaryotes: C.elegans Ideas for de novo prediction of operon structure are trivial: • Small distance between adjacent genes • Co-orientation (lie on the same strand) • More reliability when these features are conserved in different species Additional arguments: • Similar functional annotations of adjacent genes • Observed co-expression • Known average operon length Training for a completely new genome For all already discussed methods we need some initial knowledge about genes in the genome (DNA statistics, minimal ORFs length etc.) – from known genes or their very close orthologs When we have no information at all, we use an iterative process with initial parameters from very long ORFs (and/or distant orthologs with reconstructed structure) as genes, and regions with no ORFs as intergenic regions Gene recognition II. Eukaryotes Specifics: • Exon-intron structure • 9-10 coding exons per gene on average (human), ~5 exons (insects) • Average length of internal exons is 120-130 nucleotides • Very long introns (>10Kb) are frequent, may be as long as > 1 Mb • There are no Shine-Dalgarno sequences (the Kozak rule can be used instead, but it is much weaker) => ORFs and “sliding window” techniques are inapplicable! Inapplicability of “sliding window” technique for eukaryotic genomes The gene of rat chemotripsin Nothing (intergenic region) Search for “known” genes BlastX is reliable only for large exons (short introns are treated as long deletions) What can we use instead? Splicing signals! “Spliced alignment” is an alignment of DNA fragment with a sequence coding for a homologous protein. Unlike standard alignments, it is allowed to contain nonpenalized long “deletions” flanked with splicing signals (that is, introns). BLAT, ProFrame, TWINSCAN Spliced alignments of genomic sequences VISTA (www-gsd.lbl.gov/vista/): human-dog-mouse HMM (Hidden Markov Model) Definition: An HMM is a 5-tuple (Q, V, p, A, E), where: Q is a finite set of states, |Q|=N V is a finite set of observation symbols per state, |V|=M p is the initial state probabilities. A is the state transition probabilities, denoted by ast for each s, t ∈ Q. For each s, t ∈ Q the transition probability is: ast ≡ P(xi = t|xi-1 = s) E is a probability emission matrix, esk ≡ P (vk at time t | qt = s) Output: Only emitted symbols are observable by the system but not the underlying random walk between states -> “hidden” Property: Emissions and transition are dependent on the current state only and not on the past. HMM-based Gene Finding • • • • • • GENSCAN (Burge 1997) FGENESH (Solovyev 1997) HMMgene (Krogh 1997) GENIE (Kulp 1996) GENMARK (Borodovsky & McIninch 1993) VEIL (Henderson, Salzberg, & Fasman 1997) GenScan Overview • Developed by Chris Burge (Burge 1997), in the research group of Samuel Karlin, Dept of Mathematics, Stanford Univ. • Characteristics: – Designed to predict complete gene structures • Introns and exons, Promoter sites, Polyadenylation signals – Incorporates: • Descriptions of transcriptional, translational and splicing signal • Length distributions (Explicit State Duration HMMs) • Compositional features of exons, introns, intergenic, C+G regions – Larger predictive scope • Deal with partial and complete genes • Multiple genes separated by intergenic DNA in a sequence • Consistent sets of genes on either/both DNA strands • Based on a general probabilistic model of genomic sequences composition and gene structure GenScan Architecture • It is based on Generalized HMM (GHMM) • Model both strands at once – Other models: Predict on one strand first, then on the other strand – Avoids prediction of overlapping genes on the two strands (rare) • Each state may output a string of symbols (according to some probability distribution). • Explicit intron/exon length modeling • Special sensors for Cap-site and TATA-box • Advanced splice site sensors Regulation Less than 5% of the sequence of human genome are protein-coding sequences. What is the role of the remaining DNA? It has been suggested, that a much larger part of human genome codes the regulatory machinery Processes whose regulation we try to predict: • Transcription (DNA RNA) • Splicing (pre-mRNA mRNA) • Translation (mRNA protein) Two types of analysis of regulation Prediction of regulatory signal Identification of the signal Finding new sites Signal is an ideal “site” or a set of ALL observed sites Site is a representative of the signal in the genome Deriving of the signal ab initio I. Ubiquitous (necessary) signals • Examples: promoters of transcription, ribosome-binding signal, acceptor and donor splicing sites, stop-codon, signal of polyadenilation • We know many examples and some biological characteristics (and landmarks) • Often short (4-6 nucleotides) Re-alignment approaches • Initial alignment by a biological landmark – start of transcription for promoters – start codon for ribosome binding sites – exon-intron boundary for splicing sites • Fix the width of the sliding window and the expected signal size • Derive the signal (the most frequent word) within a sliding window • Repeat for other parameters, select the best set • Re-align anchoring on the signal • Identify the signal positions (with non-uniform nucleotide frequencies) Gene starts of Bacillus subtilis dnaN gyrA serS bofA csfB xpaC metS gcaD spoVC ftsH pabB rplJ tufA rpsJ rpoA rplM ACATTATCCGTTAGGAGGATAAAAATG GTGATACTTCAGGGAGGTTTTTTAATG TCAATAAAAAAAGGAGTGTTTCGCATG CAAGCGAAGGAGATGAGAAGATTCATG GCTAACTGTACGGAGGTGGAGAAGATG ATAGACACAGGAGTCGATTATCTCATG ACATTCTGATTAGGAGGTTTCAAGATG AAAAGGGATATTGGAGGCCAATAAATG TATGTGACTAAGGGAGGATTCGCCATG GCTTACTGTGGGAGGAGGTAAGGAATG AAAGAAAATAGAGGAATGATACAAATG CAAGAATCTACAGGAGGTGTAACCATG AAAGCTCTTAAGGAGGATTTTAGAATG TGTAGGCGAAAAGGAGGGAAAATAATG CGTTTTGAAGGAGGGTTTTAAGTAATG AGATCATTTAGGAGGGGAAATTCAATG dnaN gyrA serS bofA csfB xpaC metS gcaD spoVC ftsH pabB rplJ tufA rpsJ rpoA rplM cons. num. ACATTATCCGTTAGGAGGATAAAAATG GTGATACTTCAGGGAGGTTTTTTAATG TCAATAAAAAAAGGAGTGTTTCGCATG CAAGCGAAGGAGATGAGAAGATTCATG GCTAACTGTACGGAGGTGGAGAAGATG ATAGACACAGGAGTCGATTATCTCATG ACATTCTGATTAGGAGGTTTCAAGATG AAAAGGGATATTGGAGGCCAATAAATG TATGTGACTAAGGGAGGATTCGCCATG GCTTACTGTGGGAGGAGGTAAGGAATG AAAGAAAATAGAGGAATGATACAAATG CAAGAATCTACAGGAGGTGTAACCATG AAAGCTCTTAAGGAGGATTTTAGAATG TGTAGGCGAAAAGGAGGGAAAATAATG CGTTTTGAAGGAGGGTTTTAAGTAATG AGATCATTTAGGAGGGGAAATTCAATG aaagtatataagggagggttaataATG 001000000000110110000000111 760666658967228106888659666 dnaN gyrA serS bofA csfB xpaC metS gcaD spoVC ftsH pabB rplJ tufA rpsJ rpoA rplM cons. num. ACATTATCCGTTAGGAGGATAAAAATG GTGATACTTCAGGGAGGTTTTTTAATG TCAATAAAAAAAGGAGTGTTTCGCATG CAAGCGAAGGAGATGAGAAGATTCATG GCTAACTGTACGGAGGTGGAGAAGATG ATAGACACAGGAGTCGATTATCTCATG ACATTCTGATTAGGAGGTTTCAAGATG AAAAGGGATATTGGAGGCCAATAAATG TATGTGACTAAGGGAGGATTCGCCATG GCTTACTGTGGGAGGAGGTAAGGAATG AAAGAAAATAGAGGAATGATACAAATG CAAGAATCTACAGGAGGTGTAACCATG AAAGCTCTTAAGGAGGATTTTAGAATG TGTAGGCGAAAAGGAGGGAAAATAATG CGTTTTGAAGGAGGGTTTTAAGTAATG AGATCATTTAGGAGGGGAAATTCAATG tacataaaggaggtttaaaaat 0000000111111000000001 5755779156663678679890 Positional information content before and after re-alignment Deriving of the signal II. Transcription regulation • Transcription factors binding sites • Usually longer (10-20 nts or more) • Relatively small sample: only several sites in a genome at all, very few examples are known • Often have some symmetry • Conserved among species • Experimental studies are not sufficient: they define only the regulatory region Why TFBS are palindromes? Examples Prokaryotes Eukaryotes Use of symmetry • DNA-binding factors and their signals Co-operative homogeneous Palindromes Repeats Co-operative non-homogeneous Cassetes Others RNA signals: special conservative secondary structure Regulation of transcription in eukaryotes Signal, consensus codB purE pyrD purT cvpA purC purM purH purL consensus CCCACGAAAACGATTGCTTTTT GCCACGCAACCGTTTTCCTTGC GTTCGGAAAACGTTTGCGTTTT CACACGCAAACGTTTTCGTTTA CCTACGCAAACGTTTTCTTTTT GATACGCAAACGTGTGCGTCTG GTCTCGCAAACGTTTGCTTTCC GTTGCGCAAACGTTTTCGTTAC TCTACGCAAACGGTTTCGTCGG ACGCAAACGTTTTCGT Pattern codB purE pyrD purT cvpA purC purM purH purL consensus pattern CCCACGAAAACGATTGCTTTTT GCCACGCAACCGTTTTCCTTGC GTTCGGAAAACGTTTGCGTTTT CACACGCAAACGTTTTCGTTTA CCTACGCAAACGTTTTCTTTTT GATACGCAAACGTGTGCGTCTG GTCTCGCAAACGTTTGCTTTCC GTTGCGCAAACGTTTTCGTTAC TCTACGCAAACGGTTTCGTCGG ACGCAAACGTTTTCGT aCGmAAACGtTTkCkT Frequency matrix j a C G m A A A C G t T T k C k T A 6 0 0 2 9 9 8 0 0 1 0 0 0 0 0 0 C 1 8 0 7 0 0 1 9 0 0 0 0 0 9 1 0 G 1 1 9 0 0 0 0 0 9 1 1 0 5 0 5 0 T 0 0 0 0 0 0 0 0 7 8 9 4 0 3 9 1 Information content I = j b f(b,j)[log f(b,j) / p(b)] W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5) Positional weight matrix (PWM) j a C G m A A A C G t T T k C k T A 6 0 0 2 9 9 8 0 0 1 0 0 0 0 0 0 C 1 8 0 7 0 0 1 9 0 0 0 0 0 9 1 0 G 1 1 9 0 0 0 0 0 9 1 1 0 5 0 5 0 T 1 0 0 0 0 0 0 0 0 7 8 9 4 0 3 9 A 1.1 –1.0 –0.7 0.5 2.2 2.2 1.9 –0.7 –0.7 –0.1 –1.0 –0.7 –1.1 –0.7 –1.4 –0.7 C –0.4 1.9 –0.7 1.6 –0.7 –0.7 G –0.4 0.1 T –0.4 –1.0 –0.7 –1.1 –0.7 –0.7 –1.0 –0.7 –0.7 0.1 2.2 –0.7 –1.2 –1.0 –0.7 –1.1 2.2 –1.1 –0.7 –0.7 –1.0 –0.7 2.2 –0.1 –0.1 –0.7 1.5 1.9 2.2 2.2 –0.3 –0.7 1.2 –0.7 1.0 –0.7 1.0 –0.7 0.6 2.2 Sequence logo Greedy algorithms (MEME) Find a signal among all k-words (assuming that we know the length signal). For all k-words it’s too time-consuming (k~16). So initially we consider only k-words that were present in the fragments. For each k-word construct a matrix of “sites”: alignment of best “copies” of the k-word from every sequence fragment. Select the best k-word. What is the measure for comparison of matrices? Information content! Greedy algorithms. Cont’d • Select the k-word with maximal information content Problem. We considered only k-words from our sequences => may select not the signal (the consensus word), but only its best representative in our sample Solution. For each k-word from the sample construct PWM and reconstruct the frequency matrix based on it. Repeat until stabilization of the matrix. Use the consensus of this matrix. Limitation of greedy algorithms • Started from k-words in our sequences and increase the information content at each step => find a local (not global) maximum of the functional. • We need an alternative algorithm that will not be “greedy”! Gibbs sampler Let’s A be a signal (set of sites), and I(A) be its information content. At each step a new site is selected in one sequence with probability P ~ exp [(I(Anew)] For each candidate site the total time of occupation is computed. (Note that the signal changes all the time) Recognition of signals I. Ubiquitous signals • Consensus • Pattern (consensus with degenerate positions) • Positional weight matrix (PWM, or profile) Weight of the site: W(b,j)=ln(N(b,j)+0.5) – 0.25iln(N(i,j)+0.5) • Logical rules • Neural networks Neural networks: architecture • 4k input neurons (sensors), each responsible for observing a particular nucleotide at particular position OR 2k neurons (one discriminates between purines and pyrimidines, the other, between A/T and G/C) • One or more layers of hidden neurons • One output neuron Neural networks: architecture. II • Each neuron is connected to all neurons of the next layer • Each connection is ascribed a numerical weight A neuron • Sums the inputs at incoming connections • Compares the total with the threshold (or transforms it according to a fixed function) • If the threshold is passed, excites the outcoming connections (resp. sends the modified value) Training of the neural network • Sites and non-sites from the training sample are presented one by one. • The output neuron produces the prediction. • The connection weights increase if the prediction is correct and decrease if it’s incorrect. Networks differ by architecture, particulars of the signal processing, the training schedule Recognition of signals II. Regulation of transcription • Neutral networks don’t work: need training, too few examples • PWM – ok, but too many false positive predictions => we need rules to select the true sites among predicted. • Many genomes are available => comparative approach: – Consistency filtering – Phylogenetic footprinting – Phylogenetic shadowing Definition of orthologs • Orthologous genes: Duplication – the result of speciation – the “same” role in the cell Speciation • Paralogous genes : – the result of duplication – keep common biochemical function A1 B1 Genome 1 A2 B2 Genome 2 Example: gluconate and idonate kinases Consistency filtering Basic assumption. Regulons (sets of co-regulated genes) are conserved => • True sites occur upstream of orthologous genes • False sites are scattered at random We need to check that transcription factors are true orthologs by themselves (BBH, COGs are not sufficient; conservation of the DNA-binding domain, conservation of the core pathway), have exactly the same specificity (similar binding sites) and then compare genes (and whole operons) after the predicted sites The basic procedure Set of known sites Genome 1 Profile Genome 2 Genome N Accounting for the operon structure «Old» genome «New» genome A A BC BC D XD EF E F X X X X Tryptophan operons Closely related genomes: Phylogenetic footprinting Regulatory sites are more than non-coding regions and are often seen as islands in alignments upstream regions. conserved in general conserved of gene Low conservation yjcD ST AAA-GCATAAAAAGCGGCAAAGTTCAGTTGAAAAAGCGTTGATGATCGCTGGATAATCGTTTGCTTTTTTTTG---CCAC EC AAA-GAGAAAAAAGCAGCAAACTTCGGTTGAAAAAGCCGCTATGATCGCCGGATAATCGTTTGCTTTTTTTA----CCAC YP AAATGTATTAAATGTCGCATTCGGGTGTTGATTAGTCACCACTGATGGCTAGATAATCGTTTGCCTTAAATGACATCTGC *** * *** * *** ***** * * **** ** ************* ** * * * ST CC--------GTTTTGT--------ATACGTG----GAGCTAAACGTTTGCTTTTTTGCGGCGCCCCG-G-TTGTCGTAA EC CC--------GTTTTGT--------ATGCGCG----GAGCTAAACGTTTGCTTTTTTGCGACGCAGCA-AATTGTCGCAA YP CCTAAACTTCGATTTTTTTTCAGTCATGCGTTCTCCCAGCTAATCGTTTGCTATTTTTCCCCGCTCTATGAGTCAGGGAG ** * *** * ** ** ****** ******** **** * *** * * * ST ATGTAGC----------ACAAGGA-GATAACGTTGCGCTGTTAGTGGATTACCTCCCACGTATACCGACGAATAATAAAT EC ACCTGGA----------GCAGGAA-GATAACGTTTCGCTGGCAGGGGATTGTCCGCCACGCATCTTGACGAAAATTAAAC YP AGTTAGTGAGTTCATCGACAGGAACGGAAACGATTACGTAGAGAAGGGCGCTTGGCTTGGCATGCTATTTTAAAATGA-C * * * ** * * * **** * * ** * * ** * * * * ST TCTCAGGGGATGTTTTCT-ATGTCT------ACGCCTTCAGCGCGTACCGGCGGTTCACTCGACGCCTGGTTTAAAATTT EC TCTCAGGGGATGTTTTCTTATGTCT------ACGCCATCAGCGCGTACCGGCGGTTCACTCGACGCCTGGTTTAAAATTT YP ACACAGGGGACATCACC--ATGTCTAGCAGCAACCCTCAAGCACAGCCAAAGGGCACGCTTGATGCATTCTTTAAGCTTA * ******* * * ****** * ** *** * * ** * ** ** ** * ***** ** High conservation purL ST AGCGGCATTTTGCGTAACAATGCGCCAGTTGGCAACTT-ATT-CGCAACGATAGCCGCACC--GTATGACAAGAAAAAGC EC AGCGGCATTTTGCGTAAACCTGCGCCAGATGGCAACTT-ATT-ACAGCCATTGGCGGCACG--CGTTGCTAATTCACGAT YP AGTGGCATTTTGCGCAACAAAACGCCAGTGTGCAACTTTATTGCGAGCTATTTGCTGAGTCTGCGTTACACACACATAGC ** *********** ** ****** ******* *** * ** * * * * ST GG-TGATT---------TTATTTCT-------ACGCAAACGGTTTCGTCGGCGCGTCAGATTCTTTATAATGACGGCCGT EC GG-TGATT---------TTATTTCC-------ACGCAAACGGTTTCGTCAGCGCATCAGATTCTTTATAATGACGCCCGT YP GGCTGTTTCTGACTGAATTATTAATAATAGATACGCAAACGGTTTCGTCGGCGGCTCAGATTCACTATAATGGCGCGCGT ** ** ** ***** ***************** *** ******** ******* ** *** ST TTCCCCCC-------------------TTGCGCACACCAAA--------------GCTTAGAAGACGAGAGA--CTTA-EC TTCCCCCCC------------------TTGGGTACACCGAAA-------------GCTTAGAAGACGAGAGA--CTTA-YP TTTGCCCTGTTGTTGCGCCAATGAATGTTGCGCCCAATGAAGTGCTGTTCCAGCCGCTTCGAAGACGAGAGAAACTTAGA ** *** *** * ** ** **** ************ **** ST TGATGGAAATTCTGCGTGGTTCGCCTGCACTGTCTGCATTCCGTATCAATAAACTGCTGGCGCGCTTTCAGGCTGCCAAC EC TGATGGAAATTCTGCGTGGTTCGCCTGCACTGTCGGCATTCCGAATCAACAAACTGCTGGCACGTTTTCAGGCTGCCAGG YP TTATGGAAATACTGCGTGGTTCACCCGCTTTGTCGGCTTTTCGTATCACCAAACTGTTGTCCCGTTGCCAGGATGCTCAC * ******** *********** ** ** **** ** ** ** **** ****** ** * ** * **** *** Another variation. Phylogenetic shadowing Idea. Instead of distant orthologs use very close orthologs, but from multiple (very close) species. True sites would look like islands of strongly conserved columns on multiple alignment. Need to sequence orthologous upstream regions from a series of close genomes (e.g., from many different primates) and analyze their multiple alignment RNA regulation. Riboswitches mRNA has two alternative conformations of its leader region: one of them blocks the expression. Two main cases (prokaryotes): a terminator interrupts transcription or a special structure blocks the ribosomebinding site. Eukaryotes: block of a splicing site Riboswitches are RNA signals stabilized by a small molecule Example of the secondary structure of riboswitch Capitals: invariant (absolutely conserved) positions. Lower case letters: strongly conserved positions. Obligatory base pairs are set in bold. Degenerate positions: R = A or G; Y = C or U; K = G or U; B= not A; V = not U. N: any nucleotide. X: any nucleotide Importance of prediction of RNA regulation as bioinformatics problem • Phenomenon was discovered by means of bioinformatics • RNA signal is strongly conserved (on the sequence level, not only as the secondary structure) => well-predictable (no “false positive” predictions) • A portion of the regulation of this type is valuable (~ 5% of all genes for some species) Assignment of function based on homology We want to characterize a new gene. What is the function of the product? The first step: BlastP. The best case: we obtain a hit with known function Have we got a functional information on our gene? Similarity ≠ homology: e-val is a measure of statistical significance (non-randomness) of similarity. Definition of orthologs • Orthologous genes: Duplication – the result of speciation – the “same” role in the cell Speciation • Paralogous genes : – the result of duplication – keep common biochemical function A1 B1 Genome 1 A2 B2 Genome 2 Example: gluconate and idonate kinases Orthologs or paralogs? The best proof is a phylogenetic tree, but it’s too time-expensive. We use BBH - Bidirectional Best Hit. COGs – Clusters of orthologous genes (//www.ncbi.nlm.nih.gov/COGs/new) (prokaryotes) or KOGs (eukaryotes) Search for orthologs (fast and dirty) Genome 1 Genome 2 A A' B B' B" symmetrical best hit Assignment of a new gene to specific functional system. I • Positional clustering Operon: co-transcription of several genes (usually for prokaryotes, rarely for eukaryotes - Caenorhabditis elegans). Genes are transcribed together and so, exactly under the same conditions => they are dependent functionally Assignment of a new gene to specific functional system. II • Genes are not in the same operon, but in the same locus: horizontal transfer • Divergon: a regulatory signal influents the direct and the complementary chains (usually with opposite effects) regulatory site(s) gene (operon) on (+) strand gene (operon) on (-) strand Measure of positional closeness Let’s use a measure of positional neighborhood: a ration of divergent genomes in which our genes are closely located Servers that predict functional dependence: ERGO (//www.cordis.lu/ergo/ ), STRING (//string.embl.de/, may be described at the proteomics day): implementation and visualization of ALL the techniques related to this area Eukaryotic case: domain shuffling Compression of biochemical functions into single molecules Prokaryotes: all enzymatic activities carried out by separate proteins Fungi: FAS1 gene encodes activities 3 and 4 FAS2 gene encodes activities 1,2 and 5-7 Animals: All activities encoded by fatty-acid synthase Genomic structure of fatty-acid synthase from rat Protein domains InterPro: www.ebi.ac.uk/interpro/ Pfam: http://www.sanger.ac.uk/Soft ware/Pfam/ Co-regulation Genes that are distant in the genome, but are regulated similarly. Very similar to the case of operons But it’s hard to work with computationally. A lot of manual analysis is necessary. Co-expression • If the expression of two genes changes consistently in response to changing conditions or in time => they are functionally related Microarray data analysis: a special area of bioinformatics (Transcriptomics session) Protein-protein interactions • Evidence of physical interaction is a direct proof of the functionality in one cellular system (together) Will be discussed in detail at the Proteomics session Phylogenetic profiling Usually functional system is present or absent in a genome as a whole (or it’s true for a separate subsystem) => If we have many distant complete genomes, we can compare patterns of occurrence (phylogenetic profile) for individual genes. This is rather weak evidence, but useful in combination with other techniques. The converse situation also is interesting: genes with complementary phylogenetic profiles may have identical function (non-orthologous displacement: paralogs, specificity changes or really different structure). Combining of methods Each individual type of evidence is rather weak => we need to combine methods in every case. BlastP => general biochemical function Positional clustering and/or domain shuffling and/or phylogenetic profiling => assignment to functional system Metabolic reconstruction => gaps in this system Try place the product of our gene to each gap => (if we are lucky) exact biochemical function and exact position in the metabolic pathway Archaeal shikimate-kinase Chorismate biosynthesis pathway (E. coli) Pectin utilization E. chrysanthemi … and transport of oligogalacturonates E. chrysanthemi Y. pestis K. pneumoniae YpaA: riboflavine transport • 5 predicted TM segments => potential transporter • Regulatory RFN-element => coregulation with genes from riboflavine metabolism => transport of metabolism or one of it’s predecessor • S. pyogenes, E. faecalis, Listeria: have ypaA, no genes of riboflavin biosynthesis => transport of riboflavin So, prediction: YpaA is a riboflavin transporter (Gelfand et al., 1999) Verification: • YpaA imports riboflavin (genetic analysis, Kreneva et al., 2000) • YpaA is regulated with riboflavin (microarray expression analysis, Lee et al., 2001; direct verification, Winkler et al., 2002). Genome evolution. Repeats • More than 45% of human genome is repetitive DNA • A.Smith: ”The best algorithm of gene prediction is to mask the repeats, and the rest will be genes!” • Genome-specific classes of repeats are unique markers of genome post-speciation evolution (did humans appear due to special repeats?!) • Too many repeats=> this task is computational • Influence on gene recognition, similarity search and other genomic analyses. Mask repeats before! RepeatMasker www.repeatmasker.org/ Duplications in genomes. Example of a locus with internal duplications MAGE-A locus, X human chromosome MAGEA9a LW-1a FAM11a LW-1b repeat I … GABRE MAGEA9b MAGE8 … 2 Mb … repeat I MAGEA5 MAGEA10 GABRA3 GABRQ MAGEA6 TRAG3a repeat II … MAGEA12 MAGEA4 CSAGE MAGEA2b TRAG3b repeat II MAGEA3 … 6 genes … MAGEA1 MAGEA2a Duplications • The main problem of duplications: assembly of newly sequenced genomes • No universal solution: every group uses its own algorithm and software Human genome: the number of duplications changes from one release to another. Two initial versions (Int. consortium, Celera) were significantly different at the point of duplications Synteny groups • Human chromosomes cut into > 100 pieces and reassembled become a reasonable facsimile of the mouse chromosome Rearrangements as a unit of genome evolution rearrangement Rearrangements of alfafa and garden pea Transforming alfaalfa into pea Whole genome duplication in yeast Kellis M, Birren BW, Lander ES. (2004) Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 428:617-24 Thank you! The BioSapiens project is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health,"contract number LHSG-CT2003-503265.