Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Open Database Connectivity wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Microsoft SQL Server wikipedia , lookup
Concurrency control wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Relational model wikipedia , lookup
Database model wikipedia , lookup
A hands on Introduction to Computational Genomics Amos Tanay Eran Segal Weizmann Institute of Science Topics • • • • • • • • • • Modern technologies, data centric, practical tips Analysis of high-throughput sequencing data Genome structure and organization Normalization and analysis of tiling microarray data Peak finding Probabilistic models, Hidden Markov Models Clustering, bi-clustering Model based analysis of transcription factor binding Model based analysis of nucleosome binding Master basic genomic techniques – clustering, functional enrichments, DNA/RNA motif finding • • • • Construction of regulatory networks Classification Understand and use evolutionary conservation Testing of complex biological hypotheses Administrative Issues • Course: Ziskind, Room 1, Sundays 9-11 – Next week: Wed. 9-11, Ziskind, Room 1 • Contacts – Amos Tanay ([email protected]) – Eran Segal ([email protected]) – Questions must be communicated through Wiki! • www.wisdom.weizmann.ac.il/~atanay/compgenomics • Homework – – – – – A lot... Practical programming assignment given almost every week Assignments are due two weeks from the day they are given Work can be done in pairs (up to 3 works with same partner) Analyze unpublished data Administrative Issues • Grade: – 60% for all homework tasks – 40% for a project • Each pair selects one homework assignment to extend • Prepare a more comprehensive report • Submit documented code • No final exam! The Genome A T G intergenic exon C intron exon intron exon intron exon Triplet Code intergenic •Genomes are varying in size within 6 orders of magnitude •But DNA is the universal information coding molecule in the biosphere From: Lynch 2007 Sequencing and computational biology evolved together • • • • 70s: sequences were few and sparse, but evolutionary biologists where starting to apply algorithms to analyze collection of proteins sequences 80’s: people like Eric Lander and Michael Waterman introduced ideas from computer science and statistics that later served as the foundation for the human genome project 90’s: computational genomics became a field with the first genomes becoming available. Computational scientists solved critical problems and formed the first genomics pipelines 00’s: Sequenced genomes have became abundant, technology is breaking into new grounds – computation is integrated into the biological research and is not just solving ad-hoc problems. Haussler Felsenstein Waterman Lander Sanger sequencing Dideoxynucleotide chain termination Original DNA Fragmented DNA Cloned DNA Starting from restriction site and growing until termination with (ddATP, ddCTP, ddGTP, ddTTP) Each termination nucleotide have its own color (dye) Running on 4 gel lanes indicate all termination points Vectors - Plasmid (virus) – 2000-10,000 bp Cosmid – 40kb BAC (bacterial chrom) – 70kb-300kb Gel electrophoresis Restriction enzymes sites Base calling PHRED: processing the gel chromatographs (Phill Green) Generate sequence and quality estimation PHRED quality = -10log10Prob(Error) ABI 3700 (look for one of these in ebay – but it served genomics for a long time) 1998-2001 – The genome sequencing pipeline – celera’s version November 2008 – Resequencing in 8 weeks! Nature 456, 53-59 (2008) doi:10.1038/nature07484 Sequence by synthesis Sequence by synthesis Sequence by synthesis 7-8 lanes - ~5,000,000 reads of ~36bp on each 1000-3000$ per lane 48 images per second 25-90 million base/hour Claiming 109 reads per experiment – not quite there Real Single molecule! Sequence by ligation What make it so important? • Resequencing: population genetics and association studies • Resequencing: cancer genome project • Resequencing: Bacterial resistance • Sequencing: Bacetria, new species • RNA-seq: unbiased RNA discovery • ChIP-seq: mapping transcription factors • ChIP-seq: mapping histone modifications • MNase-seq: mapping nucleosomes • BIS-seq: mapping DNA methylation • 3C-seq: mapping chromosomal interactions Dealing with large sequence datasets • The matching/local alignment problem: – – Query (small) Genebank Or Genome Database – • Given a database (long string d) and query short string q) Find all appearances of the query q in the database Allow mismatches, insertions and deletions As an optimization problem: – – – – – Define similarity s(q, s) s(q,s) =Sd(c1,c2) (- gg) (g-number of gaps in the affine gap cost model) g- gap open cost d-nucleotide similarity matrix Goal: find best s in d Or: Find all s where d(q,s)<x The alignment dynamic programming graph a.k.a: Smith-Waterman, Needleman-Wunsch Database Query gap 0 A T 1 C 2 T 3 G 4 A 5 T 6 C 7 i 0 Query T1 8 DB gap j G2 Match/Mismatch Initialize 0,0 to C3 Global Alignment A4 si,j = si-1,j-1 + δ (vi, wj) s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) max T5 Local Alignment A6 0 si,j = max C7 How can we align all Query to part of the database? si-1,j-1 + δ (vi, wj) s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) BLAST (Basic Local Alignment Search Tool) For us CS-guys, DP and alignment is boring: Time complexity: O(nm) Space complexity: O(max(n,m)) However in the 90’s n started to got out of hand (1010) – and the query could get quiet long as well Input –A query sequence –A database of biological sequences There is also the issue of statistical significant of a hit when the database is so huge (we will not go into the details here) Output –A ranked list of “hits”, database sequences that are locally similar to the query sequence (from which the unknown function of the query sequence might be inferred) –The statistical significance of each hit SF Altschul et al. (1991) J Mol Biol 215:403 BLAST Algorithm Mask repetitive sequences MNPQQQQQQRST = MNPXXXXXXRST X will not match anything in the database. It does preserve position, however. query BLAST Algorithm database Prot = 3 database database database Word Seeds & Neighboring Nucl = 11 ACDEF = ACD + CDE + DEF Threshold T = 11 ACD = {ACA,ACE,ACG…} Threshold T = 13 ACD = {ACE,ACG…} S Muthukrishnan & H Ramesh (1995) Information & Computation 122:140 query BLAST Algorithm database database database database Ungapped (Diagonal) Extension Probe along a diagonal, if it contains •Two T = 11 hits within 40 positions of each other •A single T = 13 hit query BLAST Algorithm database database database database Gapped (Off-Diagonal) Extension gapxdrop Subroutine If a diagonal score equals Sg = 22 bits, try to extend across gaps, until the score drops to Xg = 40 bits. Hash Table (Present Method) Database Sequences Sequence Sequence Sequence … Sequence 1 2 3 … N Query Sequence Construct A Table of All 203 Words of Length 3 AAA AAC AAD YYY Finite-state Automaton Database Sequences Sequence Sequence Sequence … Sequence 1 2 3 … N Query Sequence Construct an Automaton That Finds Query Neighbor Words of Length 3 AAA AAC AAD YYY Blat Blast was designed in the 90s when RAM was limited When searching many times the same DB, it does not make sense not to preprocess the DB (e.g. – online complexity is what matters) BLAT (Blast like alignment tool) from UCSC’s Jim Kent is using database indexing to save time in Genome Browser queries Non overlapping K-mer hashing Filter over-represented k-mers Search query k-mers (overlapping) Find pairs of nearby matches on the same diagonal Query 1 Query 2 Genome Database Query 3 Or find one longer near-match (one mismatch) Simple hits statistics • • • • K : K-mer size M : The match probability when it is a true positive Q : Query sequence size, G: Genome size H: expected hit length Perfect match Sensitivity: One mismatch: p1 = M k 1 (1 M ) M k p1 = M k P = 1 (1 p1 ) H / K k P = 1 (1 p1 ) H / K k Number of chance hits (assuming uniform nuc distribution): F = (Q K 1) (G / K ) (1 / A) K F = (Q K 1) (G / K ) ( K (1 / A) K 1 (1 (1 / A)) (1 / A) K ) Mapping Solexa reads Mapping Solexa reads to a genome have unique characteristics Query consists of a very large number of short reads Similarity to reference genome is expected to be very high You can index the query k-mers (using which k?) and traverse the database to search for hits Or you can index the database and map queries one by one You can expect low level of errors: 1 or 2 per read You can assume that no more than one gap occurred (even this is a lot) The algorithm must pay particular attention to ambiguous hits (that are mapped to more than one position) The meta-algorithm: Solexa Query Build index for exact k-mers (db/query?) Genome Database Find k-mer hits Extend k-mer hits to matches (filter double matches upon detection) Sequence Quality • Same as for Sanger sequecning the NG sequencers generate quality scores • (for Solexa these are not -10log10(p) but some conversion!) • One would like to consider a mismatch with low quality appropriately Uniqueness in genome • • • • For a genome of size G, what is the expected number of k-mer hits as a function of K? If nucleotides have variable G+C content? If we map all C’s to T’s? In fact, the genome k-mer spectrum is strongly affected by repetitive elements and microsattelites – more on this next time Hashing • DNA K-mers of length 11 is real easy (222) • Longer K-mers (for searching mismatches) storage is bounded by genome length! • How to access the hash efficiently? • Best: random access using integer encoding – A DNA word need 2bits for character, you can hash 12-mers in a vector with 16 million entries • Possible: binary search tree (e.g., STL map<string>, Perl associative containters) Suffix Trees • • Suffix trees efficient string encodings Geared toward O(d) lookup of substrings • • The tree contains all suffixes of a string as pathes from the root Each node have no more than A out edges (A=4 for DNA) • • • Naïve construction: in O(N2) O(N) construction (!) O(N) memory (Prove!) d a b d a c a b c d a c 3 b c c 4 a c 2 6 Suffix tree for “dabdac” d 5 1 Sampling short reads • How many reads we expect to detect on a certain genomic location? • We sample N times (e.g., 10,000,000) from a large population – the number of hits for a single locus is expected to be binomial B(p,n) where p is the fraction of fragments in the pool • • If Np is large (>10) we can assume a normal distribution If Np is small the distribution should be geometric • p’s (the fraction of fragments that cover a locus) will vary among loci: – In ChIP-seq – loci that are occupied by that targeted factor will be covered – In MNase-seq – loci that are adjacent to a cutting site • As is often the case, the theoretical assumptions need not hold –test the distribution of values and see for yourself • When pooling samples together, you observe average or median statistics. From mapped reads to coverage statistics • Divide the genome to fixed bins • Compute how many reads cover each bin • A better strategy will depend on the application: Add ~500 for fragmented ChIP product Add 147 (or -10 )for nucs (or linkers) Add fragment length for RNA Pair ended-reads Statistics on spatial bins Assignment 1: • Read a MNase-seq Solexa reads file in FASTA format >name ACGTACGTACGT… >name2 ACGTAAAGAC… • Read a genome reference in FASTA format (a set of chromosomes) • Write a mapping program (choose reasonable parameters!) and find the genomic coordinate of each mappable read. You can ignore insertions/deletions (if it helps) • Use a given genomic Transcription Start Site table to compute the average coverage statistics in fixed size bins (50bp) around the feature • Determine which of the bins around the TSS are significantly different than the TSS bin. • Submit: – – – – 1 page description of the algorithm and the parameters you used Mapping statistics (how many read mapped successfully, how many were non unique, running time) Graph showing the average around the TSS with a paragraph discussing its statistics Your code Implementation considerations • C/C++: get used the STL – Vectors – Maps – Integer encodings • Java • Perl: be aware to your memory model – Associative arrays are expensive – Lists when you can – vec($myvar, $id, bits) • (can use BioPerl) • Python • R • Matlab (don’t)