Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Neural Network Applications • • • • Problems Input transformation Network Architectures Assessing Performance Lecture 10, CS567 1 Problems • Deducing the genetic code • Predicting genes • Predicting signal peptide cleavage sites Lecture 10, CS567 2 Deducing the genetic code • Problem: Given a codon, predict corresponding amino acid • Of didactic value – Trivial mapping table, after-the-fact • Perfect classification problem, rather than prediction – With minimal network • Learning issues – ‘Similar’ codons code for ‘similar’ amino acids – Abundance of amino acids proportional to code redundancy (this and previous point undermine effect of mutations) – Third base ‘wobble’ – N:1 mapping between codon and amino acid Lecture 10, CS567 3 The genetic code T C A G TTT Phe (F) TTC " T TTA Leu (L) TTG " TCT Ser (S) TCC " TCA " TCG " TAT Tyr (Y) TAC TAA Ter TAG Ter TGT Cys (C) TGC TGA Ter TGG Trp (W) CTT Leu (L) CTC " C CTA " CTG " CCT Pro (P) CCC " CCA " CCG " CAT His (H) CAC " CAA Gln (Q) CAG " CGT Arg (R) CGC " CGA " CGG " ATT Ile (I) ATC " A ATA " ATG Met (M) ACT Thr (T) ACC " ACA " ACG " AAT Asn (N) AAC " AAA Lys (K) AAG " AGT Ser (S) AGC " AGA Arg (R) AGG " GTT Val (V) GTC " G GTA " GTG " GCT Ala (A) GCC " GCA " GCG " GAT Asp (D) GAC " GAA Glu (E) GAG " GGT Gly (G) GGC " GGA " GGG " http://molbio.info.nih.gov/molbio/gcode.html Lecture 10, CS567 4 Network Architecture • Orthogonal coding (4X3) • 2 hidden neurons (Is this a linear or non-linear problem?) • 20 output neurons – Winner takes all • Total of 86 parameters (How?) • FFBP Lecture 10, CS567 5 Deducing the genetic code (Fig 6.7) Lecture 10, CS567 6 Deducing the genetic code (Fig 6.8) Lecture 10, CS567 7 Improving classification error • Training rate high for misclassified codons, low otherwise (in addition to iteration dependence) • Balanced cycles (Balanced in terms of amino acids, not codons) • Adaptive training – Present mis-classified examples more often Lecture 10, CS567 8 Is it a gene or not a gene? • Approaches depend on – Bias at junctions of coding and non-coding regions • Donor (5’ end of intron) and acceptor sites (3’ end of intron) have biases in composition (GT [junk]+ C/U+ AG) – Bias in composition of coding regions (but not of noncoding regions, eg, introns) • Exons are “regular guys”, introns are “freshman dorm rooms” • Seen as GC bias, codon usage frequency and codon bias – Inverse relationship between the two (splice site strength and regularity within exons) • “Food exit sign on highway doesn’t need prominent restaurant signs” • “Stretch of prominent restaurant signs doesn’t need a sign indicating food” Lecture 10, CS567 9 Regularity within coding regions (Fig 6.11) Bacteria C. elegans Lecture 10, CS567 Mammals A. thaliana 10 Predicting Exons: The holy GRAIL • Neural networks for gene prediction – Input representation/transformation key – NN per se trivial: MLP with single hidden layer and single output neuron – Input = Coding region candidate, transformed to • • • • • • • 6mer (di-codon) score of candidate region 6mer (di-codon) score of flanking regions GC composition of candidate region GC composition of flanking region Markov model score Length of candidate Splice site score Lecture 10, CS567 11 Signal peptide (SignalP) prediction • Signal peptides are N-terminal subsequences in proteins that are “export tags” including a “dotted line” (cleavage site) indicating point of detachment – Coding is species specific • Problem analogous to exon/intron delineation – Distinguish between signalP and rest of protein – Find junction between signalP and rest of protein Lecture 10, CS567 12 Signal peptide (SignalP) prediction • Two kinds of network that output, for each position, – S-score: Probability of classification as signal peptide – C-score: Probability of being the junction • Key is post-processing – using S and C scores to come up with final prediction • C-score prediction: Based on Asymmetric windows (why?) • S-score prediction: Based on Symmetric windows (why?) • Y-score = (CidSi)1/2 where dSi = Average difference in Si in windows of size d flanking position i Lecture 10, CS567 13 Signal peptide (SignalP) prediction (Fig 6.5) S S Lecture 10, CS567 14