Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Simple sequence repeat ● Also known as a low complexity sequence ● Repetition of a nucleotide in genomic sequences ● ● ● Can also be repetition of a pair of nucleotides, ie. ACACACACACACACACACAC ... a microsatellite used for forensic identification. Can also be a triplet or more complex repeat. Filtered from BLAST searches. Two common programs that filter sequences for BLAST searches are seg (amino acids – off by default) and dust (nucleotides – on by default). Simple sequence repeat ● ● ● You can also find low complexity regions in protein coding sequences Repetition can be in the DNA or in the amino acids in the proteins Found in almost all eukaryotic organisms to greater or lesser degrees Simple sequence repeat protein DDB G0268506 from Dictyostelium discoideum MSKDHHHQQHQYQQLHPPIPSQHHHHHHHQSQNSDSELNHDNHKKFGHDRIVSNSFSPPPLHQFNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNYNNENNGNNVSFNPHQSNNNNNNPPMSSNEQYKPIYGQPSS LSHLWFENKSNRASNNNNNNNHNNNNNNHNNNNNNSNNDNNNVSLTESYGPQAHDHPHHHHHPNHHSNNQ NLFNQFSLQNSTPCNLSNNADMSNSNQHHHSNNSEIVRDRNIDNNNNINNNNNNTTTTNNNNSGNRDRYK DSVDVLEKSTEKSKITTLGKHNTNINNNNSNKYKQLLPPLPIPNEQYNGIGIDNGLSHSSSNGSLGSADS LDSPHTPMSSPSLSSLSLSQNLHINNNSNNYNNNNNGNNNNFNNNNYNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNYNNNNNYNSSSNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GNNNNNNNINYNQSHNYNNYTPPISPLSTPPLTPTGSSISGPIGFLSGSPQNSPRNPNSPRLDPLVVQNN QMVIYKRQFDQMLTKSMGDVWLKINKDVEEGSPTLPSATSTLPLARIKKIMKSDPGVKMISWEAPILFAK ACEFFILELAARSWIHTDLSKRRTLQRSDIIHAVARVETFDFLIDVLPRDEIKPKKVDDIKPSYINSPEG FPISLEPIPINNSGRLNSNNNNNNSNNRALTLTNPSPLNSNLTTQLPNIPTPQHQNQNQNQNQNQNQNQN QHQHQNQNQNQNQNQNQNQNQNQYQHQHQHQHQHQHQHQHHQHHQHQHHQHHHHQHQHQNQNQNQNQHQQ HQHQIYQPNQQQIHHINHQLGMHHHNPHQNQNQHPMYSHQFQNYSQVAFNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNSNNSNNSNNSSNNNNNNNNNNSNNNNNNNNNNNNNNNNNNNSNNNNNSNNSNNNN NNNYNNYNGNNNNYNNYNSSSNNNSNNNNNNNNNNNNNNNNNNNNNNNNNNSNNNNNGNNNFENINPFQP HNHMQSQYYYNQSINQYQNQNHNNNNNNSNNNNSNNQNSNNIYTRQYENEEDDENEEDQKSSTSGSESES Simple sequence repeat ● Saccharomyces cerevisae SRP40 >YKR092C MASKKIKVDEVPKLSVKEKEIEEKSSSSSSSSSSSSSSSSSSSSSSSSSG ESSSSSSSSSSSSSSDSSDSSDSESSSSSSSSSSSSSSSSDSESSSESDS SSSGSSSSSSSSSDESSSESESEDETKKRARESDNEDAKETKKAKTEPES SSSSESSSSGSSSSSESESGSESDSDSSSSSSSSSDSESDSESDSQSSSS SSSSDSSSDSDSSSSDSSSDSDSSSSSSSSSSDSDSDSDSSSDSDSSGSS DSSSSSDSSSDESTSSDSSDSDSDSDSGSSSELETKEATADESKAEETPA SSNESTPSASSSSSANKLNIPAGTDEIKEGQRKHFSRVDRSKINFEAWEL TDNTYKGAAGTWGEKANEKLGRVRGKDFTKNKNKMKRGSYRGGSITLESG SYKFQD* Fis binding sites Sequence complexity ● Information theory Claude Elwood Shannon 1916 - 2001 ● ● ● ● Information theory (Shannon entropy) Proved that boolean algebra and binary arithmetic can be used for electromechanical relays (Master thesis) PhD “An algebra for theoretical genetics” 1948: “A mathematical theory of communication” From Wikipedia http://en.wikipedia.org/wiki/Claude_Shannon Sequence complexity ● Information theory Information source Encoder Negative / positive noise Decoder Destination source From Shannon 1948 Sequence complexity ● Information theory Parent DNA Natural, sexual selection Mutation DNA Offspring Sequence complexity ● Information theory – In Biology: ● DNA ● RNA ● Proteins DNA RNA mRNA Proteins Transcription Splicing Translation Sequence complexity ● Information theory – Noise sources: ● heterologous sequences ● rearranged and deleted sequences ● repetitive elements ● sequencing error ● natural polymorphism ● frameshift ● codon usage ● selection Sequence complexity ● Information theory – Developed by Shannon and Weaver to describe the transmission of electronic signals – Used to look for pattern and complexity in DNA and protein sequences Shannon's Entropy H = - L ∑ pi log2 (pi) L: number of elements pi: probability of occurrence H: units in “bits” Entropy ● ● Entropy – A measure of the disorder or randomness in a closed system – A measure of the loss of information in a transmitted message Given a random variable X with probabilities P(xi) for a discrete set of events x1, x2, ..., xn the Shannon Entropy is just the negative expected value of log(X) – H(X) = - E (log(X)) Entropy – From basic probability: ● E(X) = ∑ xi p(xi) ● E(h(X)) = ∑ h(xi) p(xi) ● H(X) = - ∑ log p(xi) p(xi) Entropy ● ● The entropy measures the prior uncertainty in the outcome of a random experiment described by P, or the information gained when the outcome is observed Uses the logarithm base 2, which makes the unit of entropy bits Entropy ● Properties – H (X) ≥ 0 – If we are certain of the outcome of a sample from the distribution (P(xk) = 1, all other P (xi) = 0), then the entropy is 0 – Entropy is maximized when all n of the P(xi) are equal to (1/n) – the maximum is then log (n) Sequence complexity - For a DNA sequence, sequence L, 4 nucleotides with pi = 0.25, Hmax = L×log2 (4) = 2L bits - Representation of each nucleotide as a 2 bit number (11, 10, 01, 00) - In case of departure from equal probability: H < Hmax - If H/L = 0 sequence of minimal complexity (same nucleotide or amino acid) - if H/L = 2 maximal complexity all nucleotides are equally represented ATGTTCTATGGGCCACAAGTCACGAGCT A: T: G: C: 7 7 7 7 A: T: G: C: 0.25 0.25 0.25 0.25 H = - 28 x ((0.25 x log2 (0.25)) + (0.25 x log2 (0.25)) + (0.25 x log2 (0.25)) + (0.25 x log2 (0.25))) H = 56 bits H/L = 2 ACTTATATATACCGGAGACTATATGAGA A: T: G: C: 11 8 5 4 A: T: G: C: 0.39 0.29 0.18 0.14 H = - 28 x ((0.39 x log2 (0.39)) + (0.29 x log2 (0.29)) + (0.18 x log2 (0.18)) + (0.14 x log2 (0.14))) H = 52.92 bits H/L = 1.89 Sequence complexity ● Uncertainty in the information – Because of selective pressures acting on sequences (DNA or protein) some departure from expectation can be observed – I (X) = Hexpected – Hobserved – The more conserved the sequence, the higher the information content Sequence complexity ● Example – if pi = 0.25 – Hexpected = 2 bits – At a particular position in a number of related sequences we observe only A or G with – pA = 0.7 and pG = 0.3 – Hobserved = (-0.7 log2 (0.7)) – (0.3 log2 (0.3)) – 0 - 0 – Hobserved = 0.88 – I (X) = 2 – 0.88 = 1.12 bits Sequence complexity ● Example GTGTACTCTC CATTTGCGAT GTGTATTCTC CATTTGCGTT GTGTTCTCCC CAATTGCTCT GTGTTTTCTC CATTTGCGGT Assuming all four nucleotides are equally possible: Hexpected = 2 Hobserved = 0 I = 2 bits Sequence complexity ● Example GTGTACTCTC CATTTGCGAT GTGTATTCTC CATTTGCGTT GTGTTCTCCC CAATTGCTCT GTGTTTTCTC CATTTGCGGT Assuming all four nucleotides are equally possible: Hexpected = 2 Hobserved = (-0.5 x log2 (0.5)) – (0.5 x log2 (0.5)) – 0 – 0 =1 I = 2 – 1 = 1 bit Sequence complexity ● Example GTGTACTCTC CATTTGCGAT GTGTATTCTC CATTTGCGTT GTGTTCTCCC CAATTGCTCT GTGTTTTCTC CATTTGCGGT Assuming all four nucleotides are equally possible: Hexpected = 2 Hobserved = (-0.25 x log2 (0.25)) – (0.25 x log2 (0.25)) (0.25 x log2 (0.25)) – (0.25 x log2 (0.25)) =2 I = 2 – 2 = 0 bit Shannon-Weaver H content Amino acid complexity in Saccharomyces cerevisae (Nsr1p). Windows size: 10 aa Sequence window complexity ● ● ● ● Sequence complexity can be investigated using a sliding window analysis – Shannon – Weaver Index (H/L) – GC content Maximum complexity is expected to be found in the exons High GC content is often associated with protein-coding sequences High AT content with non-coding DNA such as introns Bit scores “Raw scores have little meaning without detailed knowledge of the scoring system used, or more simply its statistical parameters K and lambda. Unless the scoring system is understood, citing a raw score alone is like citing a distance without specifying feet, meters, or light years. By normalizing a raw score using the formula S' = (λ S - ln(K) ) / ln(2) one attains a "bit score" S', which has a standard set of units. The E-value corresponding to a given bit score is simply E = m n 2(-S') Bit scores subsume the statistical essence of the scoring system employed, so that to calculate significance one needs to know in addition only the size of the search space.” From http://www.ncbi.nlm.nih.gov/BLAST/tutorial/#head3