* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Zipf*s monkeys
DNA sequencing wikipedia , lookup
Expanded genetic code wikipedia , lookup
Agarose gel electrophoresis wikipedia , lookup
DNA barcoding wikipedia , lookup
Comparative genomic hybridization wikipedia , lookup
Maurice Wilkins wikipedia , lookup
Genome evolution wikipedia , lookup
Genetic code wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
DNA vaccination wikipedia , lookup
Point mutation wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Molecular cloning wikipedia , lookup
Biosynthesis wikipedia , lookup
Genomic library wikipedia , lookup
DNA supercoil wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Community fingerprinting wikipedia , lookup
Non-coding DNA wikipedia , lookup
Molecular evolution wikipedia , lookup
Observations from real and random genomes Environmental genomics When an organism dies, it decomposes and the DNA in its cells degenerates into smaller and smaller fragments Given a collection of DNA fragments (i.e. reads), figure out which organisms they came from The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCT… The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATG… The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAG… The data AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAGATCATCATCGCGCATCAATCAGTG… The data ___________________________________________________________________________________________________________________________________________________________ The data ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ The data ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ The data ___________ _________________ ___________ ______________________ _______________ __________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ The data ___________ _________________ ___________ ______________________ _______________ __________________ ______________ ______________________________ __________________________ ______________________________________ ____ ________ _____________ _______________________________________ _____________________________ ________________________ _____________________ _________________________ _______ _________________ _______________________ ___________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ The data ___________ _________________ ___________ ______________________ _______________ __________________ ______________ ______________________________ __________________________ ______________________________________ ____ ________ _____________ _______________________________________ _____________________________ ________________________ _____________________ _________________________ _______ _________________ _______________________ ___________ _______ _____________ ____ ______________ ___________________________ __________ ________________ _____ ____________________________ ______________________ ________________________________________________________ ________ _______ __ _______ ______________ ________________ _______________________________________ ______________ ___________________________ ______ _______________________ ____________________ ______________ _______________________________ _________________ __ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ ___________________________________________________________________________________________________________________________________________________________ The data ___________ _________________ ___________ ______________________ _______________ __________________ ______________ ______________________________ __________________________ ______________________________________ ____ ________ _____________ _______________________________________ _____________________________ ________________________ _____________________ _________________________ _______ _________________ _______________________ ___________ _______ _____________ ____ ______________ ___________________________ __________ ________________ _____ ____________________________ ______________________ ________________________________________________________ ________ _______ __ _______ ______________ ________________ _______________________________________ ______________ ___________________________ ______ _______________________ ____________________ ______________ _______________________________ _________________ __ ________________________ __________________ ________________ ________________________________ ___________________ __________ _______ ___________________ ____________ _____ _______ ________________ _________________ _______________ ______________ ___________ _______________ ___________ _____ _______ ___________ _________ ______________________ ___ __ _____________ ___________________________________ ____________________ _______________________ __________ How can we reconstruct the original genomes? Approaches Jigsaw puzzle Find common subsequences Align overlapping regions Statistics Compute histograms of oligonucleotides (n-grams) Match to distributions for known organisms Use rare polymers to select anchor points (BLAST-like) Compression distance Conjecture: a lossless, dictionary-based sequence compressor built for a genome compresses one of its own subsequences better than would the compressor built for another genome (normalized) universal compression distance max[ C(xy) – C(x), C(yx) – C(y) ] UCD(x,y) = --------------------------------------------max[ C(x), C(y)] CM clustering Compression Maximization Adopt compression into a kind of EM clustering Partition reads randomly into [say] two groups For each read, compute compression distance to each group (à la leave-one-out) Reassign read to closest group Iterate until some stopping criterion Apply recursively to each group Experiment groupA DG2 NM1 MR2 DE3 AD5 AF1 DE1 AF3 DG4 AF5 CA1 MR4 CA3 DE4 CA5 NM4 CS2 AD2 CS4 MR3 groupB AF2 DE2 AD4 CA4 DE5 DG1 AD1 NM3 AF4 DG5 MR1 AD3 CS5 CA2 MR5 CS3 NM2 DG3 CS1 NM5 Experiment: result groupA AD1 AD2 AD3 AD4 AD5 AF1 AF2 AF3 AF4 AF5 CA1 CA2 NM1 NM2 NM3 CS1 CS2 CS3 CS4 CS4 groupB DE1 DE2 DE3 DE4 DE5 DG1 DG2 DG3 DG4 DG5 MR1 MR2 MR3 MR4 MR5 CA3 CA4 CA5 NM4 NM5 stop when µCD > 70 Reassembly Can the LZ trie be used to reassemble reads into genomes? The LZ trie is a regular grammar of the set of reads A long phrase is an extension of a shorter phrase The start of one read is the end of another The part of a long phrase that is the suffix after a shorter phrase (i.e. the difference between the short phrase and the long one) is the prefix of another phrase Along the way …. While setting up the initial experiments, we started to ponder things that might go wrong Different genomes might have a lot of common subsequences that will conflate the clustering result SNPs and missing fragments might thwart compression Compression model might take too long to converge on a useful model (paucity of data) What is the underlying principle being leveraged? Information theory A linear sequence of symbols intended for communication exhibits a balance between randomness and regularity If a sequence is entirely random, it is noise If a sequence is entirely predictable, it is redundant Patterns provide means for recognition (interpretation) and irregularities provide for novelty (information) Compression attempts to minimize redundancy Information theory Human languages exhibit non-uniform distributions over letters, phonemes, words, etc Brown Corpus word frequencies DNA primary sequences Four nucleotide symbols: A, C, G, T Much of a genome codes nothing, and the rest is genes A gene is copied (transcription) off the genome, and the copy is used to build a protein (translation) Three consecutive nucleotides form a codon, which codes for a specific amino acid A sequence of amino acids (residues) constitutes a protein Proteins are where structure definitely exists DNA primary sequences 43 = 64 possible codons 20 possible amino acids Many amino acids have more than one codon Genomic regularities Most genes start with ATG and end with a stop codon (TAG, TAA, and TGA most frequent) TATA-box in regulatory region (for binding) GC rich regions (for stability) But Frequency of individual nucleotides or residues is not-so interesting (no syntax) Tertiary structure of proteins is The Thing: the interactions of amino residues are paramount Genomic regularities Do genomes have sequential syntactic structures? Codon frequencies in real DNA 4-gram frequencies in real DNA 5-gram frequencies in real DNA 6-gram frequencies in real DNA 6-gram probabilities in real DNA Problems from paucity of data Takes time for an LZ compression trie to become saturated with characteristic phrases Experimental data somewhat small, thus interesting sequences may not manifest quickly enough Prime the trie by prepending some random DNA to the data prior to computing CD How much? How about a million? bigram frequency in random DNA codon frequency in random DNA 10-gram frequency in random DNA 4-gram frequency in random DNA 5-gram frequency in random DNA 5-gram frequency in random DNA 7-gram frequency in random DNA 8-gram frequency in random DNA 9-gram frequency in random DNA Miller’s monkey 19th century – Wilfried Pareto showed that power-law distributions abound in social, scientific, economic and geophysical data 1949 – G.K. Zipf argued that power-law distributions are an interesting linguistic phenomenon 1957 – G.A. Miller argued that the effect related to random placement of spaces, and that a monkey at a typewriter would produce ‘language’ with Zipfian distribution 1968 – David Howes argued that Miller’s proof is flawed 2004 – Michael Mitzenmacher demonstrated the connection between power-law distributions and lognormal distributions conclusion Probably nothing!