Download PowerPoint file

Gene Predictor Date:20/11/2003 Implemented By: Zohar Idelson Supervisor: Dr. Yizhar Lavner Winter - Summer 2003 1 Genomic Signal Processing • Genomic Signal Processing is a relatively new • field in Bioinformatics, in which signal processing algorithms and methods are used to study functional structures in the DNA. An appropriate mapping of the DNA sequence into one or more numerical sequences, enables the use of many digital signal processing tools. DNA Segment atgcggatttgccgtcgatgtc… Gene Predictor DNA Segment Gene Gene 2 DNA Basics • DNA in Eukaryotes is organized in chromosomes. • The DNA in each chromosome can be read as a discrete signal to • • • • • {a,t,c,g}. (For example: atgatcccaaatggaca…). In exons (protein-coding region), during the biological amino acids building, those letters are read as triplets (codons). Every codon signals which amino acid to build (there 20 aa). There are 6 ways of translating DNA signal to codons signal, called the reading frames (3 * 2 directions). Every gene start with a start-codon and ends with a stop-codon. An exon cannot consists of more than one stop-codon. Non coding areas (majority usually) has a lot more random behavior than genes. Most of the DNA is non coding. Genes can be detected by some statistics regularities, like codon usage, nucleotide usage, periodicity and data base comparison. 3 Organisms • Classified into two types: • Eukaryotes: contain a membrane-bound nucleus and organelles (plants, animals, fungi,…) • Prokaryotes: lack a true membrane-bound nucleus and organelles (single-celled, includes bacteria) • Not all single celled organisms are prokaryotes! 4 Cells • Complex system enclosed in a membrane • Organisms are unicellular (bacteria, baker’s yeast) or multicellular • Humans: – 60 trillion cells – 320 cell types Example Animal Cell www.ebi.ac.uk/microarray/ biology_intro.htm 5 DNA Basics – cont. • DNA in Eukaryotes is organized in chromosomes. 6 Chromosomes • In eukaryotes, nucleus contains one or several double stranded DNA molecules orgainized as chromosomes • Humans: – 22 Pairs of autosomes – 1 pair sex chromosomes Human Karyotype http://avery.rutgers.edu/WSSP/StudentScholars/ Session8/Session8.html 7 www.biotec.or.th/Genome/whatGenome.html 8 What is DNA? • DNA: Deoxyribonucleic Acid • Single stranded molecule (oligomer, polynucleotide) chain of nucleotides • 4 different nucleotides: – – – – Adenosine (A) Cytosine (C) Guanine (G) Thymine (T) 9 Nucleotide Bases • Purines (A and G) • Pyrimidines (C and T) • Difference is in base structure Image Source: www.ebi.ac.uk/microarray/ biology_intro.htm 10 DNA 11 12 Genome • chromosomal DNA of an organism • number of chromosomes and genome size varies quite significantly from one organism to another • Genome size and number of genes does not necessarily determine organism complexity 13 Genome Comparison ORGANISM CHROMOSOMES GENOME SIZE GENES Homo sapiens (Humans) 23 3,200,000,000 ~ 30,000 Mus musculus (Mouse) 20 , 2600,000,000 ~30,000 Drosophila melanogaster (Fruit Fly) 4 180,000,000 ~18,000 Saccharomyces cerevisiae (Yeast) 16 14,000,000 ~6,000 Zea mays (Corn) 10 2,400,000,000 ??? 14 15 DNA Basics – cont. • The DNA in each chromosome can be read as a discrete signal to {a,t,c,g}. (For example: atgatcccaaatggaca…) 16 DNA Basics – cont. • In genes (protein-coding region), during the construction of proteins by amino acids, these nucleotides (letters) are read as triplets (codons). Every codon signals one amino acid for the protein synthesis (there are 20 aa). 17 DNA Basics – cont. • There are 6 ways of translating DNA signal to codons signal, called the reading frames (3 * 2 directions). …CATTGCCAGT… 18 DNA Basics – Cont. …CATTGCCAGT… Start: ATG Stop: TAA, TGA, TAG gene Exon Intron Exon Intron Exon Exon 19 The Problem • Given unannotated DNA, find the genes. • In practice, find the exons and their RF. • Smaller scale problem: given some annotated DNA of a creature, find the exons of unannotated DNA of the same creature. atgcggatttgccgtcgatgtc… Gene Predictor Exon Exon 20 Solution Scheme • Solution scheme: – Work in windows analysis. – Find parameters that gives a good prediction in annotated DNA (of the same organism). Learn how to distinguish exons regions from non-exons regions. – Extract those parameters from the unannotated DNA, and use the discrimination rule in order to predict. • Almost all methods shown here fit to this scheme. 21 Creatures in the Project C. elegans S. cerevisiae (yeast) 22 Existing Methods • Many methods relies on the pseudo periodicity of 3 in genes. For that we define: – Ub is the binary indicator series for base B. – UB is the STFT of ub. • N, the window size, is in the hundreds. Exons size is in order of 101…103 (in S. Cerevisiae). • Overlapping windows. – There exists a connection between the DFT in k = N/3 frequency and nucleotides usage. 23 Calculating the DFT of a DNA sequence* N 1 n 0 X (k )  DFT {x(n)}   x ( n )e N 1 i N U b     ub  n  e  3  n 0 S(n) N 1 i 2 nk N 0  k  N 1 n 0 2 n 3 b  A, T , C , G ATCGTACAGCTGCAAAGCATAGATTCGGTCACAGTTG… uA(n) 1000010100000111001010100000001010000 uT(n) 01001000001… uC(n) 0010001001… uG(n) 000100001001… 1 N DFT   N 3 1 N UA   N  3  1 N UT   1 N N  3  UC   N  3 1 U N G   N *Silverman and Linsker 1986; Voss 1992  3  24 Spectrogram Position(nucleotides) A way for showing the amplitude of UA, UC, UG and UT together. Linear Transform to RGB. Magnitude is represented by N/3 brightness Finding exons visually: bright horizontal lines, usually in k = N/3 25 Spectrogram – cont. DNA of C. Elegans chr. III versus totally random DNA 26 Power Spectrum 2 2 2 S= A +C +G + T 2 1 1 A = Ua(k),C = Uc(k) ... N N k {0, 1, ... N/2} Difference between gene to non-gene areas is in 1 order of magnitude Used for k = N/3 27 IIR Anti Notch Filtering R 2  2 R cos z 1  z 2 A( z )  1 2 2 1  2 R cos z  R z all-pass 1  A( z ) H ( z)  Anti-notch 2 • IIR anti notch filter aimed to find “peaks” of a chosen frequency 28 Optimized Spectral Content Measure (OSCM) 2 E{ aA  tT  gG }  E{ aAr  tTr  gGr } argmax std ( aA  tT  gG )  std ( aAr  tTr  gGr ) a,t,g W = aA + cC + gG + tT 2 Find good coefficients (a,g,t) for high differentiation between exons and introns. Ignoring C since of the linear dependency in the rest. Ar, Tr, Gr are generated from random DNA sequence, or Introns. Performance: 29 OSCM Example Direction mistake Good forward detection Good reverse detection 30 OSCM Justification • In genes, the 4 complex • • variables A,T,C,G are not all-random and tend to be near a specific angle (phase). In introns, the values of phase seems to be pure random. Those unique angles enable us to detect reading frame as well. 31 Distribution of the phase of the DFT at the freq of 1/3 in the genes of S. Cerevisiae: angular mean = -1.3734 angular deviation = 0.7903 Distribution of arg(A) angular mean = -2.6862 angular deviation = 0.8416 Distribution of arg(C) angular mean = 2.7962 angular deviation = 0.5723 Distribution of arg(T) angular mean = 0.3556 angular deviation = 0.4016 Distribution of arg(G) Argument distributions for all experimental genes in all chromosomes in S. Cerevisiae 32 Distribution of the phase of the DFT at the freq of 1/3 in the introns of S. Cerevisiae: Distribution of arg(A) Distribution of arg(T) Distribution of arg(C) Distribution of arg(G) Argument distribution for non-coding regions in all chromosomes in S. Cerevisiae 33 Fourier Spectra and Position Asymmetry i N U b    f  b,1 1  f  b, 2   e 3 2 3  f  b, 3   e i 2 3 f(b,i) is the frequency of the base b in the codon position i, i=1,2,3. 34 Genes versus Introns Distribution of the DFT of T at 1/3 frequency Distribution of the DFT of G at 1/3 frequency (Data taken from S.Cerevisiae, chr. IV) Introns and intergenic spacers Coding regions genes and exons Magnitude small LARGE Phase Randomly distributed Narrow distribution 35 Finding Reading Frame (OSCM Phase)  = arg(aA + cC + gG + tT) Reading Frame Color 1 Red 2 Green 3 Blue •  Is concentrated around 1, 2 and 3 corresponding to each reading frame. • Lowering the variance of  with the optimization: aA  gG  tT argmax E{ } aA  gG  tT a,t,g • Transforming  to color. • Deriving reading frame by a simple look. 36 New Methods in This Project • Linear prediction • Classification by clustering (CC) • Classification by compression ratios 37 Linear Prediction n x[n]   au A [k ]  cuC [k ]  guG [k ]  tuT [k ] k 1 • Create a walk from the indicator sequences • For each window, find LP coefficients. Look for differences in correlation by: – Poles map – Frequency response – Prediction error • No new findings in this method. 38 Classification by Clustering • Recall: DFT in k=N/3 frequency has a strong correlation with genes locations and reading frames (as shown in part A) • Here we’ll attempt to use it in order to discriminate exons from the rest, in a 6D space • Learning phase: clustering • Classification phase: fuzzy KNN 39 Classification by Clustering Clustering Stage: Example From left to right: C, G and T. S. Cerevisiae 5th chromosome. 40 Classification by Clustering RF =1 (T,C,G) new sample RF =? 1 T Exon? RF =1 C G RF =? 3 Max -120° RF =1 Reading frame (if it’s an exon) RF =? 2 +120° DFT(k=N/3) DFT(k=N/3) DFT(k=N/3) uT uC uG Indicator Indicator Indicator DNA = … atcgtgactagc … Start here ‫סף‬ 41 Classification Rule • Fuzzy KNN: create a fuzzy • Two methods for membership function and choose the one with the highest score. Add fuzzy clustering iteration to the LBG algorithm. classifying gene/nongene: – Add genes and non-genes scores, and max sum wins. – Max centroid score wins. • 2nd method used (better performance). Scores sums are used for reading frame: max r.f. wins. 42 Results f_p • Creature: S. Cerevisiae. • Learning was done on the 5th chromosome. • Parameters: – K=7 and m=2 of fuzzy KNN. – True exon  50% exon. – Thresh = 1. • Total: only 4.6% of true exons weren’t detected at all. f_n rf_true f_n_exons # exons # missed 1 0.1037 0.4524 0.9574 0.0882 102 9 2 0.0821 0.4735 0.9685 0.0472 381 18 3 0.0917 0.4618 0.9551 0.071 155 11 4 0.0821 0.4615 0.9654 0.029 725 21 6 0.1102 0.4247 0.9762 0.05 120 6 7 0.0821 0.4749 0.9647 0.0258 504 13 8 0.103 0.4716 0.9671 0.0456 263 12 9 0.1091 0.452 0.9476 0.04 200 8 10 0.1005 0.4719 0.9723 0.0293 341 10 11 0.0822 0.4816 0.9641 0.0703 327 23 12 0.0973 0.4759 0.9722 0.0514 486 25 13 0.0885 0.4682 0.9607 0.0365 438 16 14 0.1041 0.4597 0.9616 0.0397 378 15 15 0.0904 0.4644 0.9665 0.0311 514 16 16 0.0824 0.4744 0.9662 0.0452 442 20 Total 5376 223 43 CC - Example 44 CC - Improving • Instead of deciding for each reading frame separately and then decide which r.F. “Won”, we can replicate the centroids for the other reading frames and the classification rule will determine [exon / non-exon] + [reading frame], at the same time. This suppose to cause a more fair competition between the reading frames. 45 Classification by Compression Rates A T C G A T C G T A C G C A T G C A T G C A T G C A T G A A A A 29 18 1 60… Nucleotides (‘A’,’C’,’T’,’G’) Codons (0..63) • In forward coding, creating 3 different codon sequences. • In classification of reverse coding, first complementing all the DNA, then treating it like forward (and results will also be reversed) • In the end of this stage, we have 6 codon seriates. 46 The Idea • If we have a dictionary with the popular words ( = codon sequences) in exons which aren’t popular in non-exons then: – Good compression will be achieved in exons – Good compression will not be achieved in introns • So we need a good dictionary and a good compressing algorithm 47 Building the Dictionary • Aim: the output • Add restriction on • • • • dictionary is expected to hold short popular words in exons. Using LZW algorithm. Input: all exons of learnt chromosome. Initial dictionary: all codons. • length of words to be entered to the dictionary. Output I: dictionary with words that appeared in exons. Output II: the code of the exons by the dictionary. 48 LZW: Encoding 1) Accum  first input letter 2) If dict.Find(accum) == false 1) 2) 3) 4) Dict.Add(accum) Code.Add(index) Accum  accum(end) Return to (2) 3) Else: 1) Index = dict.Findwhere(accum) 2) Accum.Add(next letter from input) 3) Return to (2) 49 Dictionary Pruning • Output LZW dictionary is a tree (TRIE). • Aim: keep the most popular words, but don’t • allow undesired redundancy. Method: – Go on every level of the tree (starting in max length words) and take predefined number of popular words. – Pass number of appearances (from output code) to parents: pass the sum of all, OR pass the sum of untaken. More variations: multiply by the entropy. 50 Using Entropy for Better Pruning 24*log(4) = 48 40*log(1) = 0 [31 45 1] 6 [31 45 1 60] [31 45 1] 6 [31 45 1 31] 6 [31 45 1 30] 6 40 [31 45 1 13] [31 45 1 30] 20*(-1)*[5/6*log(5/6) + 2*1/24*log(1/24) + 1/16*log(1/16)] = 20*0.8513 = 17.0255 [31 45 1] 1 [31 45 1 60] 20 [31 45 1 31] 1 [31 45 1 30] 2 [31 45 1 13] 51 Compression Rates Classification 2. 6 codons sequences for the 6 different reading frames 3. Compressing with genes based dictionary 1. Input: DNA of a chromosome and gene based dictionary 8. 6 binary vectors – the final classification 7. Post Processing 4. 6 compress rates vectors 5. Rf_wins = Argmax{compress_rate(rf),thresh) Lowerthresh = Argmax{compress_rate(rf),lower-thresh) Too_much_stops = 1 if window has more than 1 stop codon 6. 6 binary vectors + post processing data 52 Post Processing • Lower threshold technique: tag as true every window that is between close already-tagged windows, if value larger than the lower threshold. • Stop codons quantity in the window: more than one => not an exon-window (which is larger than analysis window size). 53 Compression Rates: Example 54 Stop Codons Usage • 100,000b of 2nd • chromosome 1 where there is one stop codon in the window, at most 55 Post Processing: Stop-codon Usage • Stop codon usage cleans up many potential false positives, without damaging any success measure • Hence, a lower principal threshold can be determined and we’ll get better performance Without stop codon usage 56 Compression Rates: Results • Learnt chromosome = 1st , window size = 100c, dictionary size = 1381 (32 codons, branching = 3) • After choosing best configuration, going over all the chromosomes: # f_p f_n rf_true f_n_exons # exons # miss THRESH 2 0.10442 0.13809 0.93866 0.046875 384 18 0.457 3 0.10015 0.16098 0.92234 0.03871 155 6 0.457 4 0.084271 0.14014 0.93809 0.036986 730 27 0.457 5 0.090556 0.13763 0.92723 0.034483 261 9 0.457 6 0.13909 0.14274 0.92495 0.041667 120 5 0.457 7 0.12053 0.14733 0.93927 0.027723 505 14 0.457 8 0.15057 0.14538 0.93362 0.059925 267 16 0.457 9 0.13161 0.13816 0.92458 0.045 200 9 0.457 10 0.12222 0.12411 0.93447 0.03207 343 11 0.457 11 0.07833 0.14575 0.93712 0.069069 333 23 0.457 12 0.14106 0.13654 0.9405 0.064777 494 32 0.457 13 0.11051 0.14338 0.92814 0.040816 441 18 0.457 14 0.15044 0.15475 0.93434 0.026525 377 10 0.457 15 0.089995 0.14578 0.9357 0.044231 520 23 0.457 16 0.12039 0.13794 0.93657 0.033784 444 15 0.457 total 0.0423394 5574 236 57 Compression Rates: Improving • Use non-exon dictionary, or prune exon- dictionary considering non-exon common words. • Adaptive dictionary: when detecting an exon, use its common words to update the current dictionary. 58

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PowerPoint file