Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
What is the problem? • Very large databases • Unrefined datasets – Whole genomes in draft form • Pairwise searching – Alignment – O(n2) for each sequence in the database – BLAST: Tool that searches with “hashes” to speed up. • Basic idea is that if you have a sequence from a “related” gene, then you can find new genes: – Copies of genes in same species – Same gene in different species • The problem is that single instances may not represent the diversity that can be biologically interesting. 1 Central dogma of Genome Function 2 Hidden Markov Models (Basic Concepts) • Goal: Construct a model which can be built from a multiple sequence alignment (i.e., a training dataset) that will score future sequences with their degree of similarity to the set of training sequences. • Note: Fundamentally different from BLAST, with it’s universal substitution matrices. 3 PAM-250 Matrix 4 Hidden Markov Models (Basic Concepts) • Uses notion of a prior probability (Bayesian Statistics) to reverse roles of observation and expectation • E.g., in randon sequence, P(A) = P(C) = P(G) = P(T) = 0.25. These are prior probabilities. • Now, assume that in a training data set, that 30% of the time, a ‘G’ was seen to follow an ‘AT’. We would say that P(G|AT) = 0.3, yet P(G) is still 0.25 overall. 5 HMMs: Start Codon Recognition A: 0.91 C: 0.03 G: 0.03 T: 0.03 A: 0.03 C: 0.03 G: 0.03 T: 0.91 A: 0.03 C: 0.03 G: 0.91 T: 0.03 A T G • Above: A “state machine/model” for outputting sequences. It would output various sequences with varying probabilities ATG .91 x .91 x .91 = .7536 ATT .91 x .91 x .03 = .0248 TAG .03 x .03 x .91 = .000819 • What are these? P(ATT|M) -- M is the model • But, what we want is P(M|ATT) – I.e., Probability that we are looking at a real start codon, given that we have seen ‘ATT’. 6 • Subtle, but very important difference. HMMs: Bayes Rule and Key Derivation • Bayes Rule: P(A|B) x P(B) = P(B|A) x P(A) • Rearranged: P(A|B) = (P(B|A) x P(A)) / P(B) • Let A be M, and B be the observed sequence, e.g. ATT from our codon example P(M|ATT) = (.0248 x P(M)) / (0.25 x 0.25 x 0.25) = 1.587 x P(M) ; note: P(M) is a constant, so falls out of all comparisons between scores of sequences 7 Profile HMMs • Example alignment: ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATG • [AT][CG][AC][ACGT]*A[TG][GC] • This regular expression (RE) captures many sequences, including the ones above. However, it sees no preference of TGCTAGG over ACACATC.8 HMMs: Building a Model • Rules: – – – – One state for each “clear” position, or for each term in the RE. Insert states for Kleene closure terms in the RE. State probabilities computed from state “populations”. Transition probabilities must sum to 1.0. • Starting out… – The [AT] term in the previous example has 80% As, and 20% Ts. – Transition to the next “state” is unconditional. A: 0.8 C: 0.0 G: 0.0 T: 0.2 [AT] state 1.0 A: 0.0 C: 0.8 G: 0.2 T: 0.0 [CG] state 1.0 A: 0.8 C: 0.2 G: 0.0 T: 0.0 [AC] state ... 9 HMMs: Building a Model • Continuing . . . – If states must split, transition probabilities must reflect the probabilities of going to the insert state, versus bypassing the insert state ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATG 3 sequences lead to insert state 3/5 = 0.6 A: -C: -G: -T: -2 sequences bypass the insertion state 2/5 = 0.4 0.6 A: 0.8 C: 0.0 G: 0.0 T: 0.2 [AT] state 1.0 A: 0.0 C: 0.8 G: 0.2 T: 0.0 [CG] state 1.0 A: 0.8 C: 0.2 G: 0.0 T: 0.0 [AC] state 10 0.4 HMMs: Building a Model • Probabilities of symbols on insert state ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATG 0.4 A: 0.2 C: 0.4 G: 0.2 T: 0.2 [ACGT*] state 2 C’s (0.4) 1 G (0.2) 1 A (0.2) 1 T (0.2) Total of 5 symbols 0.6 0.6 • Probabilities of transitions leaving insert state – After arriving in insert state, 2 insertions remain 2/5 = 0.4 – Otherwise, we leave this state. 1 - 2/5 = 0.6 A: 0.8 C: 0.2 G: 0.0 T: 0.0 [AC] state 0.4 A: 1.0 C: 0.0 G: 0.0 T: 0.0 [A] state 11 Example HMM Derivation 12 HMMs: Example Sequence Scoring • P(ACACATC|M) = 0.8x1.0x0.8x1.0x0.8x1.0x0.6x0.4x0.6x1.0x1.0x0.8x1.0x0.8 State probabilities Transition Probabilities = 0.04718 = 4.7 x 10-2 • P(M|ACACATC) = ((4.7 x 10-2)/(0.25)7) x P(M) = (7.7 x 102) x P(M) Log-odds = ln(7.7 x 102) = 6.65 log2(7.7 x 102) = 9.6 This number is a “score” of the likelihood that seeing this sequence implies that the model applies. 13 HMM Scoring of Sequences 14 Log-odds HMM Model Example 15 HMM Profile Model Structure 16 Example Alignment (SH3 domain) 17 Example HMM Profile Model (No synthetic pseudo-counts) 18 HMM Model Example with Pseudo-count of 1 19 Gene Prediction with HMMs • HMMs can be used for “predicting”, in genomic sequence, where genes are encoded. • Models can be built from sets of known genes – – – – – Promoters Start of coding (start codon) Intron/exon splice sites Stop Codon Polyadenylation site 20 Genome Architecture Primer Start c odon Codons Donor site GCCGCCGCCATGCCCTTCTCCAACAGGTGAGTGAG Transc ription Start 5’ UTR Promoter Exon CTCCCAGCCCTGCC Acc eptor site Stop Codon Intron Poly-A site ATCCCCATGCCTGAGGGCCCCT GCAGAAACAATAAAACCA 3’ UTR 21 Comprehensive HMM Model for Unspliced Genes 22 Coding Region Model 23 Intron Modeling 24 Gene Prediction Approaches • Ab initio methods: – Profile Hidden Markov Models (GENSCAN, HMMgene) – Neural Networks (GRAIL, Genie) – Decision Trees (MORGAN) • Issues: – Seeding from training sets – Fully general approaches? • Interesting question: – Can gene finding be done species-independent? 25 Gene Prediction: Recognizing Initiation of Coding 5’ UTR 1st Exon ATG Kozac Consensus Stops in all 3 frames No in-fram e stops GT Exon AG Intron Exon 26 Classifier Outline ConsensusKozak 0 errors 1 error ATG/UTR Heuristic 2 errors ATG L M CDS R stop ratio; frame shift check Stops upstream UTR ~E(stop) Check ORF for frame shifts 27 Classifier Heuristic Components 226 Classes • Kozak Existence and Fidelity • ATG Heuristic: Template (sIFl, sl, sFl) 5len : ATG : 3len (sIFr, sr, sFr) Ideal ( 1, 3, 3) 125 : ATG : 300 ( 0, 6, 2) • # Stops left of candidate ATG • CDS: # Stops in minimum frame • UTR Heuristic • In frame stops to All stops Ratio • # Frame shifts needed for perfect ORF • Not Used: • Codon or Hexamer Frequencies. • Known protein starting motifs. 28 Verification and Testing • Generation of sets of known CDS “reads” (12,826) known ATG “reads” (13,672) known UTR “reads” (1,035) Run Classifier against all three sets: • Identify classes with highest CDS to ATG differential & UTR vs. CDS/ATG • Grade A: K0E.ATG.L.pSL.ORFr0F or 1FS K0E.ATG.L.npSL.ORFr0FS or 1FS K1E.ATG.L.pSL.ORF0FS or 1FS K1E.ATG.L.npSL.ORF0FS or 1FS KG1E.ATG.L.pSL.ORF0FS or 1FS KG1E.ATG.L.npSL.ORF0FS or 1FS • Grade B: Same as A, but with ATG in Middle 1/2 • Grade C: zSL for K0E only and ATG in L, M, or R • UTR Class 29 Accuracy and Yield of Classes •ATG True Positive (of 13,672): •Grade A: 867 - 6.3% •Grade B: 3,742 - 27.3% •UTR: 82 - 0.6% Total: 34.3% (4,691) •CDS False Positive (of 12,826): •Grade A: 3 - 0.02% •Grade B: 753 - 5.5% •UTR: 1725 - 13.5% Total: 19.3% (2481) •UTR True Positive (of 1,035): •691 - 66.8% Yield 34% 67% Confidence 95% 87% • Notes: •the yield estimate is conservative due to variable fidelity of mRNA source. 30 Consensus Gene Prediction: Stops in No in-frame Finding all 3 frames Intron Boundaries stops GT Exon AG Intron Exon 31 Simple Dicty Gene Finder (Intuition and an Example) • Basic Idea (G. Klein) based on GC/AT content of Intron vs. Exons • Idealized Example: Count G/Cs and A/Ts in a window size of 10 bases. 6 10 10 10 AT content <EXON> <EXON> …….CGCGGGCGCCGTATTTATATATTATA…..AATATTTTATATAGCCCGGCGCGGCCG…... <INTRON> GC content 10 Donor Site 10 6 2 Acceptor Site Point where GC.left and AT.right are both maximized 32 Dicty Gene Finding Tool Model • Model Parameters: – W -- Window Size – low -- threshold below which GC or AT content does not match hypothesis – high -- threshold above which GC or AT content matches hypothesis – m -- number of consecutive windows that will be examined – n -- number of windows out of m that that must exceed to qualify for an intron/exon or exon/intron transition – tol -- maximum distance from the GC/AT content transition at which the GT or AG motif must be found 33 Dicty Gene Finding Tool Model W = 8, m = 4, high = 7, low = 6 1 2 3 4 5 6 G/C=7 . . .GCGGGCGCTGGGGCCGCGTATTATAGTATTTAT. . . n=3 n=4 34 Dicty Intron/Gene Prediction Algorithm 1. Calculate AT (GC) content in size W windows right and left of each base position. 2. Calculate n AT count high, AT count low for each window of m bases to the left and right of each base position. 3. For each position: If ……... ATlefthigh n && ATrightlow n potential acceptor site ATleftlow n && ATrighthigh n potential donor site 35 Dicty Intron/Gene Prediction Algorithm (continued) 4. For each potential donor site: If GT (donor) or AG (acceptor) motif is found within Tol bases distance, note this as an intron boundary. 5. Sort boundaries into candidate introns. 36 Test Data >IIADP1D6358 Antiparallèle 811 bases AAAAACCTGCTTAGGATTAATTATGAGCGAATTTTTTTTCTTTAAAACTT CCAAAAATATTTTTTTTTTTTTTTTTTTTTAATAATTTCGGTTTGCTCAT AGATTTTTTATTTATTTAATTAATATTTTTAATTTTTTTTTTTTTAATCC TAAAAATAGATTTTATTTATTTTATTTAATTTTTAATTATTAAAAGATAT GAGATTTTTAAAGTTCGGGTTAGAAATTAATTTGGGTAAAGGAACTCTTA TTGAATTTGATGAACAgtgtacttaaatatttaattaatttttttttttt atttgttttaagaagaagaaaaagaaaaaatatagaaatagTAAAAAACT ATTTCCATATATTTGTTATACTCTTACACACAAGGTTATAAATTTAAAGT gttataaataatttaaaaattttattctgtaagaaaatttgttttgaaat tatttgattaaaaatagaaggtttttttttttattttttttttttatttt tatttttttttattttttataatttccgcgtttgaatttgttgtgtaaat taattttaattttttttttttttttttttttttttttttttttttttttt ttcatttttaacatcatttgattcattaatttattttttttttcaacatc cccaacccaaaaaaaaaaaataaaaaaaaatgataagAAATTTAACAAAA TTAACAAAATTTACAATTGAAAATAGATTTTACCAATCCTCATCAAAAGG AAGATTCAGTGGTAAAAATGGAAACAATGCATTCAGGGGATCTCTAGAGT CGACCGAAGGC •Probable Correct Introns: +267 -341 +401 -687 37 Parameter Space to Search • Ranges – – – – – – W -- 3 10 (8 values) high -- .7xW W (4 values) low -- .5xW .9xW (4 values) m -- 3 11 (9 values) n -- m/2 m (4 values) tol -- 3-7 (5 values) • 3584 x 5 18,000 sets of parameters • Search for sets that find all expected sites with a minimum of false positives. 38 Test Data idt idt idt idt idt idt idt idt idt idt idt idt idt idt idt idt idt idt idt idt idt idt idt idt idt idt . . t1.fasta 3 1 3 1 t1.fasta 3 1 3 2 t1.fasta 3 1 3 1 t1.fasta 3 1 3 2 t1.fasta 3 2 3 1 t1.fasta 3 2 3 2 t1.fasta 3 2 3 1 t1.fasta 3 2 3 2 t1.fasta 3 3 3 1 t1.fasta 3 3 3 2 t1.fasta 3 3 3 1 t1.fasta 3 3 3 2 t1.fasta 3 2 4 1 t1.fasta 3 2 4 2 t1.fasta 3 2 4 1 t1.fasta 3 2 4 2 t1.fasta 3 3 4 1 t1.fasta 3 3 4 2 t1.fasta 3 3 4 1 t1.fasta 3 3 4 2 t1.fasta 3 4 4 1 t1.fasta 3 4 4 2 t1.fasta 3 4 4 1 t1.fasta 3 4 4 2 t1.fasta 3 2 5 1 t1.fasta 3 2 5 2 . . . About 18,000 2 4 2 2 269 401 2 4 2 2 269 401 3 4 2 2 269 401 3 4 2 2 269 401 2 4 2 2 269 401 2 4 2 2 269 401 3 4 2 2 269 401 3 4 2 2 269 401 2 4 2 2 269 401 2 4 2 2 269 401 3 4 2 2 269 401 3 4 2 2 269 401 2 4 2 2 269 401 2 4 2 2 269 401 3 4 2 2 269 401 3 4 2 2 269 401 2 4 2 2 269 401 2 4 2 2 269 401 3 4 2 2 269 401 3 4 2 2 269 401 2 4 2 2 269 401 2 4 2 2 269 401 3 4 2 2 269 401 3 4 2 2 269 401 2 4 2 2 269 401 2 4 2 2 269 401 more lines like 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 341 687 this . . . 39 Test Data Raw Results len=811 W=3 n=1 m=3 thrL=1 thrH=2, Tol=4, Sites Found=18 Intron: Intron: Intron: Intron: Intron: Intron: 1 2 3 4 5 6 + + + + + + 91 236 267 385 471 799 + + + - 213 - 213 241 267 399 - 399 - 467 759 - 759 - 797 799 len=811 W=3 n=1 m=3 thrL=2 thrH=2, Tol=4, Sites Found=29 Intron: 1 + 91 Intron: 2 + 219 Intron: 3 + 236 Intron: 4 + 267 Intron: 5 + 305 Intron: 6 + 341 Intron: 7 + 385 Intron: 8 + 429 Intron: 9 + 441 Intron: 10 + 471 Intron: 11 + 759 Intron: 12 + 799 + + - 213 223 241 267 312 341 399 433 467 753 759 799 - 213 - 335 - 399 - 797 len=811 W=3 n=1 m=3 thrL=1 thrH=3, Tol=4, Sites Found=13 Intron: Intron: Intron: 1 + 91 + 213 - 213 - 241 2 + 267 + 399 - 399 - 467 3 + 471 + 759 - 759 - 786 . . . About 18,000 sets of results like this. . . 40 Test Data Filtered Results len=811 W=6 n=5 m=9 thrL=5 thrH=6, Tol=3, Sites Found=11 ALL KNOWN SITES FOUND len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=3, Sites Found=11 ALL KNOWN SITES FOUND len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=4, Sites Found=11 ALL KNOWN SITES FOUND len=811 W=6 n=5 m=5 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND len=811 W=6 n=5 m=6 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND len=811 W=6 n=5 m=7 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND len=811 W=6 n=5 m=8 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND len=811 W=6 n=5 m=5 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND len=811 W=6 n=5 m=6 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND len=811 W=6 n=5 m=7 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND This provides an initial set of likely to be optimal parameters 41 Analysis of a Known Gene len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=4, Sites Found=11 1: AAAAACCTGC TTAGGATTAA TTATGAGCGA ATTTTTTTTC TTTAAAACTT 51: CCAAAAATAT TTTTTTTTTT TTTTTTTTTT AATAATTTCG GTTTGCTCAT 101: AGATTTTTTA TTTATTTAAT TAATATTTTT AATTTTTTTT TTTTTAATCC 151: TAAAAATAGA TTTTATTTAT TTTATTTAAT TTTTAATTAT TAAAAGATAT 201: GAGATTTTTA AAgttcgggt tagaaattaa tttgggtaaa gGAACTCTTA 251: TTGAATTTGA TGAACAgtgt acttaaatat ttaattaatt tttttttttt 301: atttgtttta agaagaagaa aaagaaaaaa tatagaaata gTAAAAAACT 351: ATTTCCATAT ATTTGTTATA CTCTTACACA CAAGgttata aatttaaagt 401: gttataaata atttaaaaat tttattctgt aagAAAATTT GTTTTGAAAT 451: TATTTGATTA AAAATAGAAG gttttttttt ttattttttt tttttatttt 501: tatttttttt tattttttat aatttccgcg tttgaatttg ttgtgtaaat 551: taattttaat tttttttttt tttttttttt tttttttttt tttttttttt 601: ttcattttta acatcatttg attcattaat ttattttttt tttcaacatc 651: cccaacccaa aaaaaaaaaa taaaaaaaaa tgataagAAA TTTAACAAAA 701: TTAACAAAAT TTACAATTGA AAATAGATTT TACCAATCCT CATCAAAAGG 751: AAGATTCAGT GGTAAAAATG GAAACAATGC ATTCAGGGGA TCTCTAGAGT 801: CGACCGAAGG C 90% correct prediction Intron 1: + 213 - 241 overpredicted (45 bases) Intron 2: + 267 - 341 UNDERPREDICTED (37 BASES) Intron 3: + 385 + 399 - 433 correct + (325 bases) 42 Intron 4: + 471 - 687 CORRECT - (404 BASES) Analysis of Unknown Gene • Started with 21 reads from •Used phred to assemble them •4 contigs found •4th contig was longest (1759 bases) •Used parameters from previous analysis •Results for contig4 compared . . . . . . . 43 Contig4 Sampled Results (a closer look) W=6 n=5 Intron Intron Intron Intron Intron m=5 thrL=5 thrH=6, Tol=6, Sites Found=14 1: + 54 - 401 2: + 579 - 612 3: + 711 + 782 -1113 4: +1185 +1350 -1350 -1504 5: +1628 -1709 W=6 n=5 Intron Intron Intron Intron Intron m=5 thrL=5 thrH=6, Tol=7, SitesFound=14 1: + 54 - 401 2: + 579 - 612 3: + 711 + 782 -1113 4: +1185 +1350 -1350 -1504 5: +1628 -1709 len=1759 W=6 n=5 Found=14 Intron 1: + 54 Intron 2: + 579 Intron 3: + 711 Intron 4: +1174 Intron 5: +1628 m=6 thrL=5 thrH=6, Tol=6, Sites len=1759 W=6 n=5 Found=14 Intron 1: + 54 Intron 2: + 579 Intron 3: + 711 Intron 4: +1174 Intron 5: +1628 m=6 thrL=5 thrH=6, Tol=7, Sites - 401 - 612 + 782 -1164 +1350 -1350 -1504 -1709 - 401 - 612 + 782 -1164 +1350 -1350 -1504 -1709 len=1759 W=6 n=5 Intron 1: + 54 Intron 2: + 579 Intron 3: + 650 Intron 4: +1087 Intron 5: +1174 Intron 6: +1628 Intron 7: +1735 m=7 thrL=5 thrH=6, Tol=6, Sites Found=17 - 401 - 612 + 782 -1043 -1164 +1350 -1350 -1504 -1709 len=1759 W=6 n=5 Intron 1: + 54 Intron 2: + 579 Intron 3: + 650 Intron 4: + 711 Intron 5: +1087 Intron 6: +1350 Intron 7: +1628 Intron 8: +1735 m=7 thrL=5 thrH=6, Tol=7, Sites Found=19 - 401 - 612 - 683 + 782 -1043 +1164 -1164 -1350 -1504 -1709 len=1759 W=6 n=5 Intron 1: + 54 Intron 2: + 579 Intron 3: + 650 Intron 4: + 711 Intron 5: +1087 Intron 6: +1350 Intron 7: +1628 Intron 8: +1735 m=8 thrL=5 thrH=6, Tol=6, Sites Found=19 - 401 - 612 - 683 + 782 -1043 +1164 -1164 -1350 -1504 -1709 44 Contig4 Results len=1759 W=6 n=5 1: TGATAATAAC 51: ATTgtaataa 101: gataatgata 151: tataaataat 201: tatcaccaat 251: aataattcaa 301: ttgtagaaat 351: ttcctataac 401: gACCAATTTA 451: AAACCCAATA 501: TACAATCACA 551: TCAAAATCAT 601: taaaaaatca 651: taattaatca 701: AAAATAACCA 751: aaatatatga 801: tatatatgat 851: attatagaac 901: agtttgattg 951: tatagatttt . . . m=10 thrL=5 thrH=6, Tol=4, Sites AATAATAACA ATAATAATAA TAATAATAAT taataatatt aataatgata ataataataa ataataatat taatactgtt gataatcatg ancaataatt ttaataaaaa tgaatatcca atctccaaaa tcttcaatat caagttttcc taaataatac aggttcaatg gtttcagatt tcgatttcct ctagttcaat tgattcaagt aatacaatca atagattttg aagataagaa AAATAATATC AAAATCAAAT ATAGAAAATA CCTCCATTCA ATCAAACCAA TAACCAATGT TTCTTTACCA ACAATTTTAA AACAACCACA TTTCTAGTAG TATCAATAgt aatagtaaaa agATCATTTG AAATTGAATC AAAAATTAAT tatatattta aacctttcaa aagTTGGTAG gtatgtatta aattaacaaa tgattaatat aactaattta atattttaaa ggtgttttta aagggtttta tttcaagaga tgatttaaaa taaacaaaat gggttaaaat ttcaagactt atcacatttt tcaacaattt gataaaaata gaagaaattt aaagtgaatt aacaattaac Found=19 AATATTAATA taataataat atgatgatat tcaagtaata aacaaattta ctttaagttc gttgcttcaa tattaaatca CAATTGAAAC GAAGTTCAGT TATTTATAAA ttaaaaaaat TTATTTGATg TGAAGAACAA attgttgtaa aattatatga gaagtattaa tacaatggaa tggatggata actggaaatt . . 1001: 1051: 1101: 1151: 1201: 1251: 1301: 1351: 1401: 1451: 1501: 1551: 1601: 1651: 1701: 1751: Intron Intron Intron Intron Intron Intron . aagttaaaga TATTGGAATA ttaaaaattg aaattcaatt AAAGAGCAAT GCTCAATTAA ACAATTATTT TTGATAAATA GCTTCATTTT CGGTAAACCT TTAGGCCACC GGTTTCACTT AAAATTATTN aaaaaaaaaa ttattatagC atgggacaa 1: 2: 3: 4: 5: 6: + 54 + 579 + 650 + 711 +1087 +1628 aaaaggaaga TATACCGGAA aaggatcaaa ttagTAATAA AGAATTATTT TTGAATTCAA ACAATGATTA TATGACATTT TACATACTAT GATAATATTT AGTTTGGGAA ACCAAAATCA GGAAATCTAA aaaaattaat CATCATTTAT aaatccaaat AAAGAAAGTT attatttttt CAAGTTTTTT GGCCCGGGTG TGCAGCAATA GAAATACCAA CATAAATTAA TGGTTGGATT TTTATGATTG ATGATTTTTA TTTTTAATAA TTTNGNAgta tattttttat TTATNGGATT tatattttta TTCATAgttt atatctttat AAATGTTCAT TATATATAAC ATTTTAATGA ATTTAAATTT TTGGTTATAC GTTGGTATGG TTTAGCACCT ACCGTTTACC TTATGGCAAT agtttttttt tatataattt TTATgtttna aagaagAAAA aaaaagatat tttttattat GCAAATAATA AAGAATTGCA CAATGTGTAA TTATTTCCAG ATTAATCATT CAGTTGCNCC CATTTTAAAT AGGTGTAACA TTTATCTTTA tttaaaaaaa tatagttatt ttaattttac - 401 - 612 - 683 + 782 -1046 -1164 -1709 (poor quality) 45 Further Intron Finding Options • • • • • • • Exhaustive parsing of sequence 400 base sequence 50 acceptor/donors 20 donor/acceptors 5 minutes on P750/.5GB 24 donor/acceptors 1 day 30 donor/acceptors ~year Hybrid solution: rank top 20 d/a sites and parse Use protein/predicted gene homology to edit results 46 Domain Finding with HMMs • Basic Elements of Method • Example from Defensin Genes 47 Antimicrobial Proteins and Peptides Lysozyme, lactoferrin, SLPI, PLA2, SP-A, SP-D, LL37, BPI, a- and ß-defensins, inorganics, immunoglobulins Macrophages ? Defensins CCL20 48 T cells DCs Functions of defensins Comprise an ever-ready shield at mucosal surfaces Antimicrobial effects: disrupt bacterial cell walls, sequester nutrients, act as decoys for microbial attachment, enhance phagocytosis Prevent attachment, colonization or infection Constitutive and/or inducible expression Cross-talk to adaptive immune system Synergy or additivity among factors Alterations in these properties may contribute to disease 49 Genomics Approach to Defensin Gene Discovery - Rationale Defensin gene discovery in humans has generally proceeded from identification of the protein All known defensin genes in humans cluster to a <1 Mb region on 8p22-p23 It is likely that not all defensin genes are known Hypothesis: Novel defensins in the gene cluster can be found using a computational genomics-based strategy 50 Structure of mature b-defensin peptides C1-C5 C2-C4 GAL3 DEFB3 DEFB1 BNBD12 DEFB2 EP2E TQCRIRGGFC YYCRVRGGRC YNCVSSGGQC LSCGRNGGVC VTCLKSGAIC TICRMQQGIC Consensus hSC+xxxGhC hhhxCPxxx+ QIGTCxxxxh +CC+ T1 b1 RVGSCRFPHI AVLSCLPKEE LYSACPIFTK IPIRCPVPMR HPVFCPRRYK RLFFCHSGEK C3-C6 T2 T3 AIGKCATFIS QIGKCSTRGR IQGTCYRGKA QIGTCFGRPV QIGTCGLPGT KRDICSDPWN b2 b-loop b-bulge -CCGRAYEV(+20) KCCRRKK KCCK KCCRSW KCCKKP RCCVSNTDE(+14) b3 51 Structure of leader sequence of b-defensin proteins EP2C EP2E TAP Defb4 GAL1 DEFB2 DEFB1 MRQRLLPSVTSLLLVALLFPGSS MKVFFLFAVLFCLVQTNSGDVPP MRLHHLLLALLFLVLSAWSGFTQ MRIHYLLFTFLLVLLSPLAAFTQ MRIVYLLLPFILLLAQGAAGSSQ MRVLYLLFSFLFIFLMPLPGVFG MRTSYLLLFTLCLLLSEMASGGN Consensus MRhxxLLhhhhhhhhhxxxxxxx 52 Genome approach for discovering b-defensin genes Known genes HUMAN DEFB1 DEFB2 MOUSE Defb1 Defb2 Defb3 Defb4 Defb5 BLAST HTGS DEFB1 DEFB2 DEFB3 EP2D DEFB4 DEFB5 DEFB6 DEFB7 DEFB8 DEFB9 EP2C EP2D Defbp1 DEFB10 DEFB11 DEFB12 DEFB13 DEFB14 DEFB15 DEFB16 DEFB17 DEFB18 DEFB19 DEFB20 DEFB21 DEFB22 DEFB23 DEFB24 DEFB25 DEFB26 DEFB27 DEFB28 DEFB29 Markov Celera Defb1 Defb2 Defb3 Defb4 Defb5 Defb6 Defb7 Defb8 Defb9 Defb10 Defb11 Defb12 Defb13 Defb14 Defb15 Defb16 Defb17 Defb18 Defb19 Defb20 Defb21 Defb22 Defb28 Defb31 Defb32 Defbp1 BACs DEFB1 DEFB2 DEFB3 EP2D DEFB4 DEFB5 DEFB6 DEFB7 DEFB8 DEFB9 EP2C EP2D Defbp1 DEFB10 DEFB11 DEFB12 DEFB13 DEFB14 DEFB15 DEFB16 DEFB17 DEFB18 DEFB19 DEFB20 DEFB21 DEFB22 DEFB23 DEFB24 DEFB25 DEFB26 DEFB27 DEFB28 DEFB29 GA-contigs Defb1 Defb2 Defb3 Defb4 Defb5 Defb6 Defb7 Defb8 Defb9 Defb10 Defb11 Defb12 Defb13 Defb14 Defb15 Defb16 Defb17 Defb18 Defb19 Defb20 Defb21 Defb22 Defb28 Defb31 Defb32 Defbp1 Defb23 Defb24 Defb25 Defb26 Defb27 Defb29 Defb30 Defb33 Defbp2 Defbp3 36 33 53 Chromosomal localization of b-defensin genes 6p11-p21 Mouse 1 8p23 Mouse 8 20q11 Mouse 2 54 TEL 10.5 EP2C HE2b1/EP2D EP2E b 16 16 16 44 D8S542 34 115c21 179c23 16g12 44n19 2541m15 397k22 372k15 24f4 207i3 2629i16 633e22** 540n10** 561b17** 332a23** 877e9 415d8 b 15 15 GCT10E01 D8S1469 D8S503 10 4 D8S1825 33 8 3 D8S351 31 2 D8S277 D8S1511 D8S561 DEFB1 D8S1819/D8S439 DEFA6 DEFA4 DEFA1/3 DEFA7 DEFA5 D8S1706 HE2/EP2 DEFB3 DEFB2 8 A004x20 cR 7.6 D8S1099 D8S1742 cM 1 WI-4625 Mb 211c9 458d3 3023L14 398f12 398f10 399g23 556o5 540e4 776f21 351i21 177k12* 18L2 295j18* 62h7 449o20* 429b7 8o7 10a14 497j4* 115j16 324n11* 375n15 10 kb b b DEFB3 DEFB2 55 CEN Synteny between human 8p and mouse 8 Chromosome 8p22-p23 (human) BAC 295j18 BAC 324n11 Chromosome 8 (mouse) GA_x5J8B7W6WMR 56 Synteny between human 6p21 and mouse 1 Chromosome 6p21 (human) BAC RP11-397g17 Chromosome 1 (mouse) GA_x5J8B7W3NRM 57 Synteny between human 20q11 and mouse 2 Chromosome 20q11.1 (human) BAC RP5-854e16 BAC RP5-1018d12 BAC RP5-1093g12 Chromosome 2 (mouse) GA_x5J8B7W3FJ8 58 Human and Mouse b-defensin alignment – all 69 genes EP2d _c 8 EP2e _c 8 EPm2d _c 8 DEFB5 _c 8 Defbm12_c 8 Defbm13_c 8 DEFB11 _c 6 Defbm17_c 1 DEFB12 _c 6 DEFB14 _c 6 EP2c _c 8 EPm2c _c 8 DEFB10 _c 6 Defbm16_c 1 DEFB13 _c 6 Defbm18_c 1 Defbm28 DEFB9 _c 8 DEFB27 _c20 DEFB17 _c20 Defbm19_c 2 DEFB18 _c20 Defbm21_c 2 DEFB20 _c20 DEFB4 _c 8 DEFB1 _c 8 Defbm1 _c 8 Defbm7 _c 8 Defbm8 _c 8 Defbm2 _c 8 Defbm31 Defbm9 _c 8 Defbm10_c 8 Defbm15_c 8 Defbm3 _c 8 Defbm4 _c 8 Defbm6 _c 8 DEFB2 _c 8 Defbm5 _c 8 DEFB3 _c 8 Defbm14_c 8 DEFB16 _c20 Defbm29 DEFB8 _c 8 DEFB29 _c20 Defbm23_c 2 DEFB28 _c20 Defbm20_c 2 DEFB15 _c20 Defbm32 DEFB25 _c20 Defbm26_c 2 DEFB24 _c20 Defbm25_c 2 DEFB6 _c 8 Defbm11_c 8 Defbm30 DEFB21 _c20 DEFB19 _c20 Defbm24_c 2 DEFB22 _c20 Defbm27_c 2 DEFB23 _c20 TI CRMQ--Q GICRLF-FCHSGEKKRDICSDPWNR CCVSNT TI CRMQ--Q GICRLF-FCHSGEKKRDICSDPWNR CCVSNT TV CLMQ--Q GHCRLF-MCRSGERKGDICSDPWNR CCVPYS ES CKLG--R GKCRK--ECLENEKPDGNCRL-NFL CCRQRI ET CRLG--R GKCRR--TCIESEKIAGWCKL-NFF CCRERI FL CKKM--N GQCEA--ECFTFEQKIGTCQA-NFL CCRKRRE CRIG--N GQCKN--QCHENEIRIAYCIRPGTH CCLQQKE CKMR--R GHCKL--QCSEKELRISFCIRPGTH CC---KS CTAI--G GRCKN--QCDDSEFRISYCARPTTH CCV--DR CTKR--Y GRCKR--DCLESEKQIDICSLPRKI CC---VD CRRS--E GFCQE--YCNYMETQVGYCSKKKDA CCLH-VN CKKS--E GQCQE--YCNFMETQVGYCSKKKEP CCLH--- CEKV--R GICKT--FCDDVEYDYGYCIKWRSQ CCV--ER CEKV--R GMCKT--VCDIDEYDYGYCIRWRNQ CCI--RE CQLV--R GACKP--ECNSWEYVYYYCN--VNP CC---HK CSLV--R GTCKS--ECNSWEYKYNYCH--TEP CCVVRE RT CFYG--L GKCRR--ICRANEKKKERCGE-RTF CCLRET GH CLNL--S GVCRRD-VCKVVEDQIGACRR-RMK CCRAWW KK CWNNYVQ GHCRK--ICRVNEVPEALCEN-GRY CCLNIK KS CWII--K GHCRK--NCKPGEQVKKPCKN-GDY CCIPSN KACWVL--R GHCRK--HCRSGERVRKPCSN-GDY CC---KK CWNR--S GHCRK--QCKDGEAVKDTCKN-LRA CCIPSN KR CLKI--L GHCRR--HCKDGEMDHGSCKY-YRV CCVPDL VE CW-M--D GHCRL--LCKDGEDSIIRCRN-RKR CCVPSR RI CGYG--TARCRK--KCRSQEYRIGRCPN-TYA CCLRKYN CVSS--G GQCLYS-ACPIFTKIQGTCYRGKAK CCK--YK CLQH--G GFCLRS-SCPSNTKLQGTCKPDKPN CCKS-TR CYKF--G GFCHYN-ICPGNSRFMSNCHPENLR CCKNIK AR CYKF--G GFCYNS-MCPPHTKF IGNCHPDHLH CCINMK DH CHTN--G GYCVRA-ICPPSARRPGSCFPEKNP CCKYMK --CRSW-- GTCSIAAICFDSLSRRGQCGPVKDP CCPL-ER CHKK--G GYCYF--YCFSSHKK IGSCFPEWPR CCKNIK VS CIRN--G GICQ-Y-RCIGLRHK IGTCGSP-FK CCK--RA CYRE--G GEC--L-RCIGLFHK IGTCNFR-FK CCKFQVS CLRK--G GRCWN--RCIGNTRQ IGSCGVPFLK CCKRKIT CMTN--GAICWG--PCPTAFRQ IGNCGHFKVR CCKIRVT CMSY--G GSCQR--SCNGSFRLGGHCGHPKIR CCRRKVT CLKS--GAICHPV-FCPRRYKQ IGTCGLPGTK CCKKPVS CCMI--GGICRY--LCKGNILQNGNCGVTSLN CCKRKYY CRVR--G GRCAVL-SCLPKEEQIGKCSTRGRK CCRRKK FF CRIR--G GRCAVL-NCLGKEEQIGRCSNSGRK CCRKKK NP CELY--Q GMCRN--ACREYEIQYLTCPN-DQK CCLKLS IA CELY--Q GLCRN--ACQKYEIQYLSCPK-TRK CCLKYEI CERP--N GSCRD--FCLETEIHVGRCLN-SRP CCLPLG RR CLMG--L GRCRD--HCNVDEKEIQKCKM-KK- CCVGPK KR CLVG--F GKCKD--SCLADETQMQHCKA-KK- CCIGPK KK CFNK-VT GYCRK--KCKVGERYEIGCLS-GKL CCANDE -R CFSN-VE GYCRK--KCRLVEISEMGCLH-GKY CC---RR CYYG--T GRCRK--SCKEIERKKEKCGE-KHI CCVPKE KL CLDQ--KDTCPDSRTCLEGTQ---PCHPHHPN CCESSQK CWKN-NV GHCRR--RCLDTERYILLCRN-KLS CCISII -K CWKN-SL GYCRV--RCQEEERYIYLCKN-KVS CCIHRT KR CWKG--Q GACQT--YCTRQETYMHLCPD-ASL CCLSYA KR CWNG--Q GACRT--FCTRQETFMHLCPD-ASL CCLSYS EK CNKL--K GTCKN--NCGKNEELIALCQK-SLK CCRTIQ EK CSRV--N GRCTA--SCLKNEELVALCQK-NLK CCVTVQ DT CWKL--K GICRN--TCQKEEIYHIFCG-IQSL CCLEKK MK CWGK--S GRCRT--TCKESEVYYILCKT-EAK CCVDPK LR CMGN--S GICRA--SCKKNEQPYLYCRN-CQSCCLQSY LQ CMGN--R GFCRS--SCKKSEQAYFYCRT-FQM CCLQSY ET CWNF--R GSCRD--ECLKNERVYVFCVS-GKL CCLKPK ER CWKS--F GVCRE--ECAKKESFYIFCWN-GKL CCVKPK QR CWNL--Y GKCRY--RCSKKERVYVYCIN-NKM CCVKPK 59 ESTs provide sequence for exon 1 Chromosome 8 cluster Gene Name Exon 1 aa sequence (exon 2) C pattern EST Exprssion DEFB1 MRTSYLLLFTLCLLLSEMASGGNxxxxxxxxxFLTGLGHRSDHYNCVSSGGQCLYSACPIFTKIQGTCYRGKAKCCK 6 4 9 6 ai688359,ai688522, epithelia,ai733355,ai3 kidney DEFB7 MKIFFFILAALILLAQIFQG xxLKTNCFLYLARTAIHRALISKRMEGHCEAE-CLTFEVKIGGCRAELAPFCCKNRKKH 21 3 9 7 no ESTs TFPGKLPQQLFLGTGEFAVCESCKLGRGKCRKE-CLENEKPDGNCRLNFL-CCRQRI DEFB5 6 3 9 5 no ESTs xxxxxxxxxxxxxAKNAFFDEKCNKLKGTCKNN-CGKNEELIALCQKSLK-CCRTIQPCGSIID DEFB6 MRTFLFLFAVLFFLTP 6 3 9 5 aw103145, lung/testis ai910580 EST DEFB4 (Forss man) MQRLVLLLAISLLLYQDLPG xxxxxxxxxxYLVRSEFELDRICGYGTARCRKK-CRSQEYRIGRCPNTYA-CCLRK 6 3 9 5 no ESTs EP2c MRQRLLPSVTSLLLVALLFPG xxxxxxxxxxxxEPASDLKVVDCRRSEGFCQEY-CNYMETQVGYCSKKKDACCLH 6 3 9 6 1st exon aa778602, epithelia aa400545 EP2d/HE2b1 MRQRLLPSVTSLLLVALLFPGSSxxxxxxxxxxxxxxxxxxxxTICRMQQGICRLFFCHSGEKKRDICSDPWNRCCVSNTDE 6 4 9 6 aa778602 testis EP2e MKVFFLFAVLFCLVQTNSGDVPPxxxxxxxxxxxxxxxxxxxxTICRMQQGICRLFFCHSGEKKRDICSDPWNRCCVSNTDE 6 4 9 6 2nd exon aa176631, testis be044355, ai DEFB3 MRIHYLLFALLFLFLVPVPG xxxxxxxxxGHGGIINTLQKYYCRVRGGRCAVLSCLPKEEQIGKCSTRGRKCCRRKK 6 4 9 6 epithelia DEFB2 MRVLYLLFSFLFIFLMPLPG xxxxxxxxxxxGVFGGIGDPVTCLKSGAICHPVFCPRRYKQIGTCGLPGTKCCKKP 6 4 9 6 bf08889, bf088086, epithelia,be714509 head/neck xxxxxxxxxxxxxxxxxxxxxxxxx*RCVCVLNVCSTSLKQIGTYGHDRIKCCKK DEFBp1 YLLFSFRFVFLMPLP pseudogene xxxxxxxxxxxLHVAKGKFKEICERPNGSCRDF-CLETEIHVGRCLNSRP-CCLPLGHQPRIESTTPKKD DEFB8 6 3 9 5 aa406058 possible EST testis xxxxxxxxxxxxxGGLGPAEGHCLNLSGVCRRDVCKVVEDQIGACRRRMK-CCRAWWILMSIPTPLIMSDYQEPLKPKLK DEFB9 6 4 9 5 aw383156 head_neck Chromosome 6 cluster Gene Name Exon 1 DEFB10 DEFB11 DEFB12 DEFB13 DEFB14 Chromosome 20 Cluster Gene Name Exon 1 DEFB15 DEFB16 DEFB17 DEFB18 DEFB19 DEFB20 MKLLLLALPMLVSYPKZSQ MKLLYLFLAILLAIEEPVIS May share with 20-5 DEFB21 DEFB22 DEFB23 maybe 3 exons MKLLLLTLTVLLLLSQLTP DEFB24 DEFB25 DEFB26 DEFB27 MKSLLFTLAVFMLLAQLVS MGLFMIIAILLFQKPT DEFB28 DEFB29 MKLLFPIFASLMLQYQVNT aa sequence (exon 2) xxxxxxxxxxxxxxxxxxxFERCEKVRGICKTF-CDDVEYDYGYCIKWRSQCCV xxxxxxxxxxxxxxxxxDLRRECRIGNGQCKNQ-CHENEIRIAYCIRPGTHCCLQQ xxxxxxxxxxxxxxxxxxxxWKSCTAIGGRCKNQ-CDDSEFRISYCARPTTHCCVTECDP xxxxxxxxxxxxxxxxxxxKRECQLVRGACKPE-CNSWEYVYYYCNVNP--CCAVWE xxxxxxxxxxxxxTCTLVNADRCTKRYGRCKRD-CLESEKQIDICSLPRKICCTEKL C 6 6 6 6 6 pattern 3 9 6 no 3 9 6 no 3 9 6 no 3 9 4 no 3 9 6 no EST ESTs ESTs ESTs ESTs ESTs Exon 2 xxxxxxxxxxxGWIRRCYYGTGRCRKSCKEIERKKEKCGEKHICCVPKEKDKLSHIHDQKETSELYI 6 3 9 5 no ESTs GLFRSHNGKSREPWNPCELYQGMCRNACREYEIQYLTCPNDQKCCLKLSVKITSSKNVKEDYDSNSNLSVTNSSSYSHI 6 3 9 5 no ESTs xxxxxxxxxxxxSQKSCWIIKGHCRKNCKPGEQVKKPCKNGDYCCIPSNTDS 6 3 9 5 no ESTs xxxxxxxxxxxxGEKKCWNRSGHCRKQCKDGEAVKDTCKNLRACCIPSNEDHRRVPATSPTPLSDSTPGIIDDILTVRFTTDYFEVSSKKDMVEESEAGRGT 6 3 9 5 AA335178, Epididymis, AI220434 Pooled NFL xxxxxxxxxxKRHILRCMGNSGICRASCKKNEQPYLYCRNCQSCCLQSYMRISISGKEENTDWSYEKQWPRLP 6 3 9 5 AA939044, Pooled AW193716, NFL,AIPooled 807541, Ger xxxxxxxxxxxxVKSVECWMDGHCRLLCKDGEDSIIRCRNRKRCCVPSRYLTIQPVTIHGILGWTTPQMSTTAPKMKTNITNR 5 3 9 5 AW070283, Pooled AA834919, NFL,H92063 Testis, Re xxxxxxxxxxxxxxMKCWGKSGRCRTTCKESEVYYILCKTEAKCCVDPKYVPVKPKLTDTNTSLESTSAV 6 3 9 5 AI476463 Pooled NFL xxxxxxxxxxxxRIETCWNFRGSCRDECLKNERVYVFCVSGKLCCLKPKDQPHLPQHIKN 6 3 9 5 AI989655, Pooled AW236570 Germ Cell Tumor xxxxxxxxxxxxGTQRCWNLYGKCRYRCSKKERVYVYCINNKMCCVKPKYQPKERWWPF 6 3 9 5 AA933749, Pooled AA970840, GermBF08527, Cell Tumor AA xxxxxxxxxxxxEFKRCWKGQGACQTYCTRQETYMHLCPDASLCCLSYALKPPPVPKHEYE6 3 9 5 no ESTs xxxxxxxxxxFEPQKCWKNNVGHCRRRCLDTERYILLCRNKLSCCISIISHEYTRRPAFPVIHLEDITLDYSDVDSFTGSPVSMLNDLITFDTTKFGETMTP 7 3 9 5 AA935636(not Pooled cysteine NFL domain) xxxxxxxxxxNWYVKKCLNDVGICKKKCKPEEMHVKNGWAMCGKQRDCCVPADRRANYPVFCVQTKTTRISTVTATTATTTLMMTTASMSSMAPTPVSPTG 6 3 13 5 AA994981, Testis, AA846419, Pooled AA453384, NFL A xxxxxxxxTEQLKKCWNNYVQGHCRKICRVNEVPEALCENGRYCCLNIKELEACKKITKPPRPKPATLALTLQDYVTIIENFPSLKTQST 8 3 9 5 AI694319, Testis, AA812652, Pooled AA454191, NFL A 60 xxxxxxxxxxxxLKKCFNKVTGYCRKKCKVGERYEIGCLSGKLCCANDEEEKKHVSFKKPHQHSGEKLSVLQDYIILPTITIFTV 7 3 9 5 no ESTs xxxxxxxxxxxxxxRRCLMGLGRCRDHCNVDEKEIQKC-KMKKCCVGPKVVKLIKNYLQYGTPNVLNEDVQEMLKPAKNSSAVIQRKHILSVLPQIKSTSFF 6 3 9 4 AA401404, Testis, AA446332, Pooled AA399988, NFL A – Summary – Gene Discovery with HMMs • Increased number of defensin genes in mouse and human from 7 to 69 • Genomic searches based solely on BLAST may miss genes related by tertiary structure • Hidden Markov Tool is a more reliable approach for identifying gene families related by tertiary structure 61 “Curing” Disease and Finding New Treatments I. “Curing” disease – know the disease-causing gene(s) – diagnose with genetic test (before onset) – preempt entire disease with intervention (therapy or lifestyle advice) II. Finding new treatments of disease – know the gene(s) – understand the biological pathway like never before a. identify existing drug candidates that interact b. precisely design a new drug from a molecular basis 62 “Curing” Disease and Finding New Treatments • After all the analysis and data visualization…. – Make some decisions: • 1. Is this a (strongly) genetic phenomenon? • 2. Is/are there regulating “known” gene(s)? • 3. Can they be prioritized for further study? • Can the pathway be deduced or refined? • Are there existing related products/drugs? • BUT, where do we obtain candidate “targets”?…. 63