Download Zipf*s monkeys

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DNA sequencing wikipedia , lookup

Expanded genetic code wikipedia , lookup

Agarose gel electrophoresis wikipedia , lookup

DNA barcoding wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Maurice Wilkins wikipedia , lookup

Replisome wikipedia , lookup

Genome evolution wikipedia , lookup

Genetic code wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

DNA vaccination wikipedia , lookup

Point mutation wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Molecular cloning wikipedia , lookup

Biosynthesis wikipedia , lookup

Genomic library wikipedia , lookup

DNA supercoil wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Community fingerprinting wikipedia , lookup

Non-coding DNA wikipedia , lookup

Molecular evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Transcript
Observations from real and random genomes
Environmental genomics
 When an organism dies, it decomposes and the DNA
in its cells degenerates into smaller and smaller
fragments
 Given a collection of DNA fragments (i.e. reads), figure
out which organisms they came from
The data
AGTCGATGCAGTCAGCATACGATCAGACTGCAGCT…
The data
AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATG…
The data
AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAG…
The data
AGTCGATGCAGTCAGCATACGATCAGACTGCAGCTTATATCGCATCGCGCATGATTACTACTGCGCGATCAGCATCATATACGACTACGGCAGATCATCATCGCGCATCAATCAGTG…
The data
___________________________________________________________________________________________________________________________________________________________
The data
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
The data
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
The data
___________
_________________
___________
______________________
_______________
__________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
The data
___________
_________________
___________
______________________
_______________
__________________
______________
______________________________
__________________________
______________________________________
____
________
_____________
_______________________________________
_____________________________
________________________
_____________________
_________________________
_______
_________________
_______________________
___________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
The data
___________
_________________
___________
______________________
_______________
__________________
______________
______________________________
__________________________
______________________________________
____
________
_____________
_______________________________________
_____________________________
________________________
_____________________
_________________________
_______
_________________
_______________________
___________
_______
_____________
____ ______________
___________________________
__________
________________
_____
____________________________ ______________________
________________________________________________________
________
_______
__ _______
______________
________________
_______________________________________
______________
___________________________
______
_______________________
____________________
______________
_______________________________
_________________ __
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
___________________________________________________________________________________________________________________________________________________________
The data
___________
_________________
___________
______________________
_______________
__________________
______________
______________________________
__________________________
______________________________________
____
________
_____________
_______________________________________
_____________________________
________________________
_____________________
_________________________
_______
_________________
_______________________
___________
_______
_____________
____ ______________
___________________________
__________
________________
_____
____________________________ ______________________
________________________________________________________
________
_______
__ _______
______________
________________
_______________________________________
______________
___________________________
______
_______________________
____________________
______________
_______________________________
_________________ __
________________________
__________________
________________
________________________________
___________________
__________
_______
___________________
____________
_____ _______
________________
_________________
_______________
______________
___________
_______________
___________
_____ _______ ___________ _________
______________________
___
__
_____________
___________________________________
____________________
_______________________
__________
How can we reconstruct the original genomes?
Approaches
 Jigsaw puzzle
 Find common subsequences
 Align overlapping regions
 Statistics
 Compute histograms of oligonucleotides (n-grams)
 Match to distributions for known organisms
 Use rare polymers to select anchor points (BLAST-like)
Compression distance
 Conjecture: a lossless, dictionary-based sequence
compressor built for a genome compresses one of its
own subsequences better than would the compressor
built for another genome
 (normalized) universal compression distance
max[ C(xy) – C(x), C(yx) – C(y) ]
UCD(x,y) = --------------------------------------------max[ C(x), C(y)]
CM clustering
 Compression Maximization
 Adopt compression into a kind of EM clustering
 Partition reads randomly into [say] two groups
 For each read, compute compression distance to each
group (à la leave-one-out)
 Reassign read to closest group
 Iterate until some stopping criterion
 Apply recursively to each group
Experiment
groupA
DG2
NM1
MR2
DE3
AD5
AF1
DE1
AF3
DG4
AF5
CA1
MR4
CA3
DE4
CA5
NM4
CS2
AD2
CS4
MR3
groupB
AF2
DE2
AD4
CA4
DE5
DG1
AD1
NM3
AF4
DG5
MR1
AD3
CS5
CA2
MR5
CS3
NM2
DG3
CS1
NM5
Experiment: result
groupA
AD1
AD2
AD3
AD4
AD5
AF1
AF2
AF3
AF4
AF5
CA1
CA2
NM1
NM2
NM3
CS1
CS2
CS3
CS4
CS4
groupB
DE1
DE2
DE3
DE4
DE5
DG1
DG2
DG3
DG4
DG5
MR1
MR2
MR3
MR4
MR5
CA3
CA4
CA5
NM4
NM5
stop when µCD > 70
Reassembly
 Can the LZ trie be used to reassemble reads into




genomes?
The LZ trie is a regular grammar of the set of reads
A long phrase is an extension of a shorter phrase
The start of one read is the end of another
The part of a long phrase that is the suffix after a
shorter phrase (i.e. the difference between the short
phrase and the long one) is the prefix of another
phrase
Along the way ….
 While setting up the initial experiments, we started to
ponder things that might go wrong
 Different genomes might have a lot of common
subsequences that will conflate the clustering result
 SNPs and missing fragments might thwart compression
 Compression model might take too long to converge on
a useful model (paucity of data)
 What is the underlying principle being leveraged?
Information theory
 A linear sequence of symbols intended for
communication exhibits a balance between
randomness and regularity
 If a sequence is entirely random, it is noise
 If a sequence is entirely predictable, it is redundant
 Patterns provide means for recognition
(interpretation) and irregularities provide for novelty
(information)
 Compression attempts to minimize redundancy
Information theory
 Human languages exhibit non-uniform distributions
over letters, phonemes, words, etc
Brown Corpus word frequencies
DNA primary sequences
 Four nucleotide symbols: A, C, G, T
 Much of a genome codes nothing, and the rest is genes
 A gene is copied (transcription) off the genome, and
the copy is used to build a protein (translation)
 Three consecutive nucleotides form a codon, which
codes for a specific amino acid
 A sequence of amino acids (residues) constitutes a
protein
 Proteins are where structure definitely exists
DNA primary sequences
 43 = 64 possible codons
 20 possible amino acids
 Many amino acids have more than one codon
Genomic regularities
 Most genes start with ATG and end with a stop codon
(TAG, TAA, and TGA most frequent)
 TATA-box in regulatory region (for binding)
 GC rich regions (for stability)
But
 Frequency of individual nucleotides or residues is not-so
interesting (no syntax)
 Tertiary structure of proteins is The Thing: the interactions
of amino residues are paramount
Genomic regularities
 Do genomes have sequential syntactic structures?
Codon frequencies in real DNA
4-gram frequencies in real DNA
5-gram frequencies in real DNA
6-gram frequencies in real DNA
6-gram probabilities in real DNA
Problems from paucity of data
 Takes time for an LZ compression trie to become
saturated with characteristic phrases
 Experimental data somewhat small, thus interesting
sequences may not manifest quickly enough
 Prime the trie by prepending some random DNA to
the data prior to computing CD
 How much? How about a million?
bigram frequency in random DNA
codon frequency in random DNA
10-gram frequency in random DNA
4-gram frequency in random DNA
5-gram frequency in random DNA
5-gram frequency in random DNA
7-gram frequency in random DNA
8-gram frequency in random DNA
9-gram frequency in random DNA
Miller’s monkey
 19th century – Wilfried Pareto showed that power-law




distributions abound in social, scientific, economic and
geophysical data
1949 – G.K. Zipf argued that power-law distributions are an
interesting linguistic phenomenon
1957 – G.A. Miller argued that the effect related to random
placement of spaces, and that a monkey at a typewriter
would produce ‘language’ with Zipfian distribution
1968 – David Howes argued that Miller’s proof is flawed
2004 – Michael Mitzenmacher demonstrated the
connection between power-law distributions and lognormal distributions
conclusion
 Probably nothing!