* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download alignable - gobics.de: Department of Bioinformatics
Survey
Document related concepts
Non-coding DNA wikipedia , lookup
Gene desert wikipedia , lookup
Point mutation wikipedia , lookup
Human genome wikipedia , lookup
Microevolution wikipedia , lookup
Genomic library wikipedia , lookup
Designer baby wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Pathogenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Metagenomics wikipedia , lookup
Genome editing wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Helitron (biology) wikipedia , lookup
Sequence alignment wikipedia , lookup
Transcript
Alignment of large genomic sequences Fragment-based alignment approach (DIALIGN) useful for alignment of genomic sequences. Possible applications: Detection of regulatory elements Identification of pathogenic microorganisms Gene prediction The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa The DIALIGN approach atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacccctgaattgaataa The DIALIGN approach atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacc----------cctgaattgaataa The DIALIGN approach atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac----------gg-ttcaatcgcg caaa--gagtatcacc----------cctgaattgaataa The DIALIGN approach Consistency! atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac----------gg-ttcaatcgcg caaa--gagtatcacc----------cctgaattgaataa The DIALIGN approach atc------TAATAGTTAaactccccCGTGC-TTag cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg caaa--GAGTATCAcc----------CCTGaaTTGAATaa First step in sequence comparison: alignment S1 S2 S3 First step in sequence comparison: alignment S1 S1’ S3 S2 S2’ For genomic sequences: Neither local nor global methods appropriate S3’ First step in sequence comparison: alignment S1 S1’ S3 S2 S2’ S3’ Local method finds single best local similarity First step in sequence comparison: alignment S1 S1’ S3 S2 S2’ S3’ Multiple application of local methods possible First step in sequence comparison: alignment S1 S1’ S3 S2 S2’ S3’ Multiple application of local methods possible First step in sequence comparison: alignment S1 S1’ S3 S2 S2’ S3’ Multiple application of local methods possible First step in sequence comparison: alignment S1 S1’ S3 S2 S2’ S3’ Multiple application of local methods possible First step in sequence comparison: alignment S1 S1’ S3 S2 S2’ Threshold has to be applied to filter alignments: reduced sensitivity! S3’ First step in sequence comparison: alignment Alternative approach: During evolution few large-scale rearrangements -> relative order homologies conserved Search for chain of local homologies First step in sequence comparison: alignment S1 S1’ S3 S2 S2’ Genomic alignment: chain of homologies S3’ First step in sequence comparison: alignment S1 S1’ S3 S2 S2’ Genomic alignment: chain of homologies S3’ First step in sequence comparison: alignment S1 S1’ S3 S2 S2’ Genomic alignment: chain of homologies S3’ First step in sequence comparison: alignment S1 S1’ S3 S2 S2’ Genomic alignment: chain of homologies S3’ First step in sequence comparison: alignment Novel approaches for genomic alignment: WABA PipMaker MGA TBA Lagan Avid DIALIGN Alignment of large genomic sequences Gene-regulatory sites identified by mulitple sequence alignment (phylogenetic footprinting) Alignment of large genomic sequences Objective function for DIALIGN: Weight score for every possible fragment f based on P-value: P(f) = probability of finding a fragment “like f” by chance in random sequences with same length as input sequences w(f) = -log P(f) (“weight score” of f) ”like f” means: at least same # matches (DNA, RNA) or sum of similarity values (proteins) Objective function for DIALIGN: Score of alignment: sum of weight scores of fragments – no gap penalty! Optimization problem for DIALIGN: Find consistent collection of fragments with maximum total weight score! Alternative fragment weight scores for genomic sequences: Calculate fragment scores at nucleotide level and at peptide level. catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg Standard score: Consider length, # matches, compute probability of random occurrence Translation option: catcatatcttatcttacgttaactcccccgt cagtgcgtgatagcccatatccgg Translation option: L S Y V catcatatc tta tct tac gtt aactcccccgt cagtgcgtg ata gcc cat atc cgg I A H I DNA segments translated to peptide segments; fragment score based on peptide similarity: Calculate probability of finding a fragment of the same length with (at least) the same sum of BLOSUM values P-fragment (in both orientations) L S Y V catcatatc tta tct tac gtt aactcccccgt cagtgcgtg ata gcc cat atc cgg I A H I N-fragment catcatatc ttatcttacgtt aactcccccgtgct || | | | cagtgcgtg atagcccatatc cg For each fragment f three probability values calculated; Score of f based on smallest P value. Alternative fragment weight scores for genomic sequences: Calculate fragment scores at nucleotide level and at peptide level. DIALIGN alignment of human and murine genomic sequences DIALIGN alignment of tomato and Thaliana genomic sequences Alignment of large genomic sequences Evaluation of signal detection methods: Apply method to data with known signals (correct answer is known!). E.g. experimentally verified genes for gene finding TP = true positves = # signals correctly predicted (i.e. signal present) FP = false positives = # signals predicted but wrong (i.e no signal present) TN = true negative = # no signal predicted, no signal present FN = false negative = # no signal predicted, signal present! Alignment of large genomic sequences Sn = Sensitivity = correctly predicted signals / present signals = TP / (TP + FN) Sp = Specificity = correctly predicted signals / predicted signals = TP / (TP + FP) Alignment of large genomic sequences Comprehensive evaluation of signal prediction method: Method assigns score to predictions Apply threshold parameter High threshold -> high specificity (Sp), low sensitivity (Sn) Low threshold -> high sensitivity , low specificity ROC curve („receiver-operator curve“) Vary threshold parameter, plot Sn against Sp Performance of long-range alignment programs for exon discovery (human - mouse comparison) DIALIGN alignment of tomato and Thaliana genomic sequences AGenDA: Alignment-based Gene Detection Algorithm Bridge small gaps between DIALIGN fragments -> cluster of fragments AGenDA: Alignment-based Gene Detection Algorithm Bridge small gaps between DIALIGN fragments -> cluster of fragments Search conserved splice sites and start/stop codons at cluster boundaries to Identify candidate exons AGenDA: Alignment-based Gene Detection Algorithm Bridge small gaps between DIALIGN fragments -> cluster of fragments Search conserved splice sites and start/stop codons at cluster boundaries to Identify candidate exons Recursive algorithm finds biologically consistent chain of potential exons Identification of candidate exons Fragments in DIALIGN alignment Identification of candidate exons Build cluster of fragments Identification of candidate exons Identify conserved splice sites Identification of candidate exons Candidate exons bounded by conserved splice sites Construct gene models using candidate exons Score of candidate exon (E) based on DIALIGN scores for fragments, score of splice junctions and penalty for shortening / extending len (C ) dis (C , E ) sc ( E ) w( fi ) sc ( SP) len (C ) i Find biologically consistent chain of candidate exons (starting with start codon, ending with stop codon, no internal stop codons …) with maximal total score Find optimal consistent chain of candidate exons Find optimal consistent chain of candidate exons Find optimal consistent chain of candidate exons Find optimal consistent chain of candidate exons Find optimal consistent chain of candidate exons atg gt ag gt ag tga atg tga Find optimal consistent chain of candidate exons atg gt ag G1 gt ag tga atg tga G2 Find optimal consistent chain of candidate exons Recursive algorithm calculates optimal chain of candidate exons in O(N log N) time Find optimal consistent chain of candidate exons atg gt ag G1 gt ag tga atg tga G2 Find optimal consistent chain of candidate exons Recursive algorithm calculates optimal chain of candidate exons in O(N log N) time DIALIGN fragments Candidate exons Gene model Result: 105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000) Result: 105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000) 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% AGenDA GenScan sensitivity specificity Result: 105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000) AGenDA GenScan 64 % 12 % 17 % Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster: Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster: DIALIGN able to identify small regulatory elements, but Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster: DIALIGN able to identify small regulatory elements, but Entire genes totally mis-aligned Alignment of large genomic sequences DIALIGN used by tracker for phylogenetic footprinting (Prohaska et al., 2004) Alignment of Hox gene cluster: DIALIGN able to identify small regulatory elements, but Entire genes totally mis-aligned Reason for mis-alignment: duplications ! Alignment of large genomic sequences The Hox gene cluster: 4 Hox gene clusters in pufferfish. 14 genes, different genes in different clusters! Alignment of large genomic sequences The Hox gene cluster: Complete mis-alignment of entire genes! Alignment of sequence duplications S1 S2 Alignment of sequence duplications S1 S2 Conserved motivs; no similarity outside motifs Alignment of sequence duplications S1 S2 Duplication in two sequences Alignment of sequence duplications S1 S2 Duplication in two sequences Alignment of sequence duplications S1 S2 Duplication in two sequences Alignment of sequence duplications S1 S2 Mis-alignment would have lower score! Alignment of sequence duplications S1 S2 Duplication in one sequence Alignment of sequence duplications S1 S2 Duplication in one sequence Alignment of sequence duplications S1 S2 Duplication in one sequence Possible mis-alignment Alignment of sequence duplications S1 S2 S3 Duplication in one sequence Alignment of sequence duplications S1 S2 S3 Duplication in one sequence Alignment of sequence duplications S1 S2 S3 Duplication in one sequence Alignment of sequence duplications S1 S2 S3 Duplication in one sequence Alignment of sequence duplications S1 S2 S3 Consistency problem Alignment of sequence duplications S1 S2 S3 More plausible alignment – and higher score: Alignment of sequence duplications S1 S2 S3 Consistency problem Alignment of sequence duplications S1 S2 S3 Alternative alignment; probably biologically wrong; lower numerical score! Anchored sequence alignment Biologically meaningful alignment often not possible by automated approaches. Anchored sequence alignment Biologically meaningful alignment not possible by automated approaches. Idea: use expert knowledge to guide alignment procedure Anchored sequence alignment Biologically meaningful alignment not possible by automated approaches. Idea: use expert knowledge to guide alignment procedure User defines a set anchor points that are to be „respected“ by the alignment procedure Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Use known homology as anchor point Anchored sequence alignment NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Use known homology as anchor point Anchored sequence alignment NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Use known homology as anchor point Anchor point = anchored fragment (gap-free pair of segments) Anchored sequence alignment NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT Use known homology as anchor point Anchor point = anchored fragment (gap-free pair of segments) Remainder of sequences aligned automatically Anchored sequence alignment -------NLF VALYDFVASG DNTLSITKGE klrvlgynhn iihredkGVI YALWDYEPQN DDELPMKEGD cmt------- Anchored alignment Anchored sequence alignment NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS Anchor points in multiple alignment Anchored sequence alignment NLFV ALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQND DELPMKEGDCMT GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS Anchor points in multiple alignment Anchored sequence alignment -------NLF V-ALYDFVAS GD-------- NTLSITKGEk lrvLGYNhn iihredkGVI Y-ALWDYEPQ ND-------- DELPMKEGDC MT-------------GYQ YrALYDYKKE REedidlhlg DILTVNKGSL VA-LGFS-- Anchored multiple alignment Algorithmic questions Goal: Find optimal alignment (=consistent set of fragments) under costraints given by userspecified anchor points! Algorithmic questions Additional input file with anchor points: 1 2 1 3 3 4 215 34 317 231 78 402 5 23 8 4.5 1.23 8.5 Algorithmic questions NLFVALYDFVASGDNTLSITKGEKLRVLGYNHN IIHREDKGVIYALWDYEPQNDDELPMKEGDCMT GYQYRALYDYKKEREEDIDLHLGDILTVNKGSLVALGFS Algorithmic questions Additional input file with anchor points: 1 2 1 3 3 4 215 34 317 231 78 402 5 23 8 4.5 1.23 8.5 Algorithmic questions Additional input file with anchor points: 1 2 1 3 3 4 Sequences 215 34 317 231 78 402 5 23 8 4.5 1.23 8.5 Algorithmic questions Additional input file with anchor points: 1 2 1 3 3 4 Sequences 215 34 317 231 78 402 start positions 5 23 8 4.5 1.23 8.5 Algorithmic questions Additional input file with anchor points: 1 2 1 3 3 4 Sequences 215 34 317 231 78 402 start positions 5 23 8 length 4.5 1.23 8.5 Algorithmic questions Additional input file with anchor points: 1 2 1 3 3 4 Sequences 215 34 317 231 78 402 start positions 5 23 8 4.5 1.23 8.5 length score Algorithmic questions Requirements: Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Inconsistent anchor points! Algorithmic questions atctaat---agttaaactcccccgtgcttag Cagtgcgtgtattac-taacggttcaatcgcg caaagagtatcacccctgaattgaataa Inconsistent anchor points! Algorithmic questions Requirements: Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points Algorithmic questions Requirements: Anchor points need to be consistent! – if necessary: select consistent subset from user-specified anchor points Find alignment under constraints given by anchor points! Algorithmic questions Use data structures from multiple alignment Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Greedy procedure for multiple alignment Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Greedy procedure for multiple alignment Algorithmic questions atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa Question: which positions are still alignable ? Algorithmic questions atctaatagttaaactcccccgtgcttag Si cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x For each position x and each sequence Si exist an upper bound ub(x,i) and a lower bound lb(x,i) for residues y in Si that are alignable with x Algorithmic questions atctaatagttaaactcccccgtgcttag Si cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x For each position x and each sequence Si exist an upper bound ub(x,i) and a lower bound lb(x,i) for residues y in Si that are alignable with x Algorithmic questions atctaatagttaaactcccccgtgcttag Si cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x ub(x,i) and lb(x,i) updated during greedy procedure Algorithmic questions atctaatagttaaactcccccgtgcttag Si cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x Initial values of lb(x,i), ub(x,i) Algorithmic questions atctaatagttaaactcccccgtgcttag Si cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x ub(x,i) and lb(x,i) updated during greedy procedure Algorithmic questions atctaatagttaaactcccccgtgcttag Si cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x ub(x,i) and lb(x,i) updated during greedy procedure Algorithmic questions Anchor points treated like fragments in greedy algorithm: Algorithmic questions Anchor points treated like fragments in greedy algorithm: Sorted according to user-defined scores Algorithmic questions Anchor points treated like fragments in greedy algorithm: Sorted according to user-defined scores Accepted if consistent with previously accepted anchors Algorithmic questions Anchor points treated like fragments in greedy algorithm: Sorted according to user-defined scores Accepted if consistent with previously accepted anchors ub(x,i) and lb(x,i) updated during greedy procedure Algorithmic questions Anchor points treated like fragments in greedy algorithm: Sorted according to user-defined scores Accepted if consistent with previously accepted anchors ub(x,i) and lb(x,i) updated during greedy procedure Resulting values of ub(x,i) and lb(x,i) used as initial values for alignment procedure Algorithmic questions atctaatagttaaactcccccgtgcttag Si cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x Initial values of lb(x,i), ub(x,i) Algorithmic questions atctaatagttaaactcccccgtgcttag Si cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa x Initial values of lb(x,i), ub(x,i) calculated using anchor points Algorithmic questions Ranking of anchor points to prioritize anchor points, e.g. anchor points from verified homologies -higher priority automatically created anchor points (using CHAOS, BLAST, … ) -- lower priority Application: Hox gene cluster Application: Hox gene cluster Use gene boundaries as anchor points Application: Hox gene cluster Use gene boundaries as anchor points + CHAOS / BLAST hits Application: Hox gene cluster no anchoring anchoring Ali. Columns 2 seq 3 seq 4 seq 2958 668 244 3674 1091 195 Score 1166 1007 CPU time 4:22 0:19 Application: Hox gene cluster Example: Teleost Hox gene cluster: Application: Hox gene cluster Example: Teleost Hox gene cluster: Score of anchored alignment 15 % higher than score of non-anchored alignment ! Application: Hox gene cluster Example: Teleost Hox gene cluster: Score of anchored alignment 15 % higher than score of non-anchored alignment ! Conclusion: Greedy optimization algorithm does a bad job! Application: Improvement of Alignment programs Two possible reasons for mis-alignments: Application: Improvement of Alignment programs Two possible reasons for mis-alignments: Wrong objective function: Biologically correct alignment gets bad numerical score Application: Improvement of Alignment programs Two possible reasons for mis-alignments: Wrong objective function: Biologically correct alignment gets bad numerical score Bad optimization algorithms: Biologically correct alignment gets best numerical score, but algorithm fails to find this alignment Application: Improvement of Alignment programs Two possible reasons for mis-alignments: Anchored alignments can help to decide Application: RNA alignment Application: RNA alignment aa----CCCC AGC---GUAa gucgcuaucc a cacucuCCCA AGC---GGAG Aac------- ccg----CCA AaagauGGCG Acuuga---- non-anchored alignment Application: RNA alignment aa----CCCC AGC---GUAa gucgcuaucc a cacucuCCCA AGC---GGAG Aac------- ccg----CCA AaagauGGCG Acuuga---- structural motif mis-aligned Application: RNA alignment aaCCCCAGCG UAAGUCGCUA UCca---CACUCUCC CAAGCGGAGA AC-------CCGCCA AAAGAUGGCG ACuuga 3 conserved nucleotides as anchor points WWW interface at GOBICS (Göttingen Bioinformatics Compute Server) WWW interface at GOBICS (Göttingen Bioinformatics Compute Server)