Download Zipf*s monkeys

Observations from real and random genomes Environmental genomics  When an organism dies, it decomposes and the DNA in its cells degenerates into smaller and smaller fragments  Given a collection of DNA fragments (i.e. reads), figure out which organisms they came from The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCT… The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATG… The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAG… The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAGATCATCATCGCGCATCAATCAGTG… The data ___________________________________________________________________________________________________________________________________________________________ The data ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ The data ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ The data ___________ _________________ ___________ ______________________ _______________ __________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ The data ___________ _________________ ___________ ______________________ _______________ __________________ ______________ ______________________________ __________________________ ______________________________________ ____ ________ _____________ _______________________________________ _____________________________ ________________________ _____________________ _________________________ _______ _________________ _______________________ ___________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ The data ___________ _________________ ___________ ______________________ _______________ __________________ ______________ ______________________________ __________________________ ______________________________________ ____ ________ _____________ _______________________________________ _____________________________ ________________________ _____________________ _________________________ _______ _________________ _______________________ ___________ _______ _____________ ____ ______________ ___________________________ __________ ________________ _____ ____________________________ ______________________ ________________________________________________________ ________ _______ __ _______ ______________ ________________ _______________________________________ ______________ ___________________________ ______ _______________________ ____________________ ______________ _______________________________ _________________ __ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ The data ___________ _________________ ___________ ______________________ _______________ __________________ ______________ ______________________________ __________________________ ______________________________________ ____ ________ _____________ _______________________________________ _____________________________ ________________________ _____________________ _________________________ _______ _________________ _______________________ ___________ _______ _____________ ____ ______________ ___________________________ __________ ________________ _____ ____________________________ ______________________ ________________________________________________________ ________ _______ __ _______ ______________ ________________ _______________________________________ ______________ ___________________________ ______ _______________________ ____________________ ______________ _______________________________ _________________ __ ________________________ __________________ ________________ ________________________________ ___________________ __________ _______ ___________________ ____________ _____ _______ ________________ _________________ _______________ ______________ ___________ _______________ ___________ _____ _______ ___________ _________ ______________________ ___ __ _____________ ___________________________________ ____________________ _______________________ __________ How can we reconstruct the original genomes? Approaches  Jigsaw puzzle  Find common subsequences  Align overlapping regions  Statistics  Compute histograms of oligonucleotides (n-grams)  Match to distributions for known organisms  Use rare polymers to select anchor points (BLAST-like) Compression distance  Conjecture: a lossless, dictionary-based sequence compressor built for a genome compresses one of its own subsequences better than would the compressor built for another genome  (normalized) universal compression distance max[ C(xy) – C(x), C(yx) – C(y) ] UCD(x,y) = --------------------------------------------max[ C(x), C(y)] CM clustering  Compression Maximization  Adopt compression into a kind of EM clustering  Partition reads randomly into [say] two groups  For each read, compute compression distance to each group (à la leave-one-out)  Reassign read to closest group  Iterate until some stopping criterion  Apply recursively to each group Experiment groupA DG2 NM1 MR2 DE3 AD5 AF1 DE1 AF3 DG4 AF5 CA1 MR4 CA3 DE4 CA5 NM4 CS2 AD2 CS4 MR3 groupB AF2 DE2 AD4 CA4 DE5 DG1 AD1 NM3 AF4 DG5 MR1 AD3 CS5 CA2 MR5 CS3 NM2 DG3 CS1 NM5 Experiment: result groupA AD1 AD2 AD3 AD4 AD5 AF1 AF2 AF3 AF4 AF5 CA1 CA2 NM1 NM2 NM3 CS1 CS2 CS3 CS4 CS4 groupB DE1 DE2 DE3 DE4 DE5 DG1 DG2 DG3 DG4 DG5 MR1 MR2 MR3 MR4 MR5 CA3 CA4 CA5 NM4 NM5 stop when µCD > 70 Reassembly  Can the LZ trie be used to reassemble reads into     genomes? The LZ trie is a regular grammar of the set of reads A long phrase is an extension of a shorter phrase The start of one read is the end of another The part of a long phrase that is the suffix after a shorter phrase (i.e. the difference between the short phrase and the long one) is the prefix of another phrase Along the way ….  While setting up the initial experiments, we started to ponder things that might go wrong  Different genomes might have a lot of common subsequences that will conflate the clustering result  SNPs and missing fragments might thwart compression  Compression model might take too long to converge on a useful model (paucity of data)  What is the underlying principle being leveraged? Information theory  A linear sequence of symbols intended for communication exhibits a balance between randomness and regularity  If a sequence is entirely random, it is noise  If a sequence is entirely predictable, it is redundant  Patterns provide means for recognition (interpretation) and irregularities provide for novelty (information)  Compression attempts to minimize redundancy Information theory  Human languages exhibit non-uniform distributions over letters, phonemes, words, etc Brown Corpus word frequencies DNA primary sequences  Four nucleotide symbols: A, C, G, T  Much of a genome codes nothing, and the rest is genes  A gene is copied (transcription) off the genome, and the copy is used to build a protein (translation)  Three consecutive nucleotides form a codon, which codes for a specific amino acid  A sequence of amino acids (residues) constitutes a protein  Proteins are where structure definitely exists DNA primary sequences  43 = 64 possible codons  20 possible amino acids  Many amino acids have more than one codon Genomic regularities  Most genes start with ATG and end with a stop codon (TAG, TAA, and TGA most frequent)  TATA-box in regulatory region (for binding)  GC rich regions (for stability) But  Frequency of individual nucleotides or residues is not-so interesting (no syntax)  Tertiary structure of proteins is The Thing: the interactions of amino residues are paramount Genomic regularities  Do genomes have sequential syntactic structures? Codon frequencies in real DNA 4-gram frequencies in real DNA 5-gram frequencies in real DNA 6-gram frequencies in real DNA 6-gram probabilities in real DNA Problems from paucity of data  Takes time for an LZ compression trie to become saturated with characteristic phrases  Experimental data somewhat small, thus interesting sequences may not manifest quickly enough  Prime the trie by prepending some random DNA to the data prior to computing CD  How much? How about a million? bigram frequency in random DNA codon frequency in random DNA 10-gram frequency in random DNA 4-gram frequency in random DNA 5-gram frequency in random DNA 5-gram frequency in random DNA 7-gram frequency in random DNA 8-gram frequency in random DNA 9-gram frequency in random DNA Miller’s monkey  19th century – Wilfried Pareto showed that power-law     distributions abound in social, scientific, economic and geophysical data 1949 – G.K. Zipf argued that power-law distributions are an interesting linguistic phenomenon 1957 – G.A. Miller argued that the effect related to random placement of spaces, and that a monkey at a typewriter would produce ‘language’ with Zipfian distribution 1968 – David Howes argued that Miller’s proof is flawed 2004 – Michael Mitzenmacher demonstrated the connection between power-law distributions and lognormal distributions conclusion  Probably nothing!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Zipf*s monkeys