Download Genome Database - Weizmann Institute of Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Open Database Connectivity wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Database wikipedia , lookup

Concurrency control wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Clusterpoint wikipedia , lookup

ContactPoint wikipedia , lookup

Transcript
A hands on Introduction to
Computational Genomics
Amos Tanay
Eran Segal
Weizmann Institute of
Science
Topics
•
•
•
•
•
•
•
•
•
•
Modern technologies, data centric, practical tips
Analysis of high-throughput sequencing data
Genome structure and organization
Normalization and analysis of tiling microarray data
Peak finding
Probabilistic models, Hidden Markov Models
Clustering, bi-clustering
Model based analysis of transcription factor binding
Model based analysis of nucleosome binding
Master basic genomic techniques
– clustering, functional enrichments, DNA/RNA motif finding
•
•
•
•
Construction of regulatory networks
Classification
Understand and use evolutionary conservation
Testing of complex biological hypotheses
Administrative Issues
• Course: Ziskind, Room 1, Sundays 9-11
– Next week: Wed. 9-11, Ziskind, Room 1
• Contacts
– Amos Tanay ([email protected])
– Eran Segal ([email protected])
– Questions must be communicated through Wiki!
• www.wisdom.weizmann.ac.il/~atanay/compgenomics
• Homework
–
–
–
–
–
A lot...
Practical programming assignment given almost every week
Assignments are due two weeks from the day they are given
Work can be done in pairs (up to 3 works with same partner)
Analyze unpublished data
Administrative Issues
• Grade:
– 60% for all homework tasks
– 40% for a project
• Each pair selects one homework assignment to
extend
• Prepare a more comprehensive report
• Submit documented code
• No final exam!
The Genome
A
T
G
intergenic
exon
C
intron exon intron exon
intron
exon
Triplet
Code
intergenic
•Genomes are varying in size within 6 orders of magnitude
•But DNA is the universal information coding molecule in
the biosphere
From: Lynch 2007
Sequencing and computational
biology evolved together
•
•
•
•
70s: sequences were few and sparse, but evolutionary biologists where
starting to apply algorithms to analyze collection of proteins sequences
80’s: people like Eric Lander and Michael Waterman introduced ideas from
computer science and statistics that later served as the foundation for the
human genome project
90’s: computational genomics became a field with the first genomes
becoming available. Computational scientists solved critical problems and
formed the first genomics pipelines
00’s: Sequenced genomes have became abundant, technology is breaking
into new grounds – computation is integrated into the biological research
and is not just solving ad-hoc problems.
Haussler Felsenstein
Waterman
Lander
Sanger sequencing
Dideoxynucleotide chain termination
Original DNA
Fragmented DNA
Cloned DNA
Starting from restriction site and growing until
termination with (ddATP, ddCTP, ddGTP, ddTTP)
Each termination nucleotide have its own color (dye)
Running on 4 gel lanes indicate all termination
points
Vectors -
Plasmid (virus) – 2000-10,000 bp
Cosmid – 40kb
BAC (bacterial chrom) – 70kb-300kb
Gel
electrophoresis
Restriction
enzymes
sites
Base calling
PHRED: processing the gel
chromatographs (Phill Green)
Generate sequence and quality estimation
PHRED quality = -10log10Prob(Error)
ABI 3700
(look for one of these
in ebay – but it served
genomics for a long
time)
1998-2001 – The genome sequencing pipeline – celera’s
version
November 2008 – Resequencing in 8 weeks!
Nature 456, 53-59 (2008) doi:10.1038/nature07484
Sequence by synthesis
Sequence by synthesis
Sequence by synthesis
7-8 lanes - ~5,000,000 reads of ~36bp on each
1000-3000$ per lane
48 images per second
25-90 million base/hour
Claiming 109 reads per
experiment – not quite
there
Real Single molecule!
Sequence by ligation
What make it so important?
• Resequencing: population genetics and association
studies
• Resequencing: cancer genome project
• Resequencing: Bacterial resistance
• Sequencing: Bacetria, new species
• RNA-seq: unbiased RNA discovery
• ChIP-seq: mapping transcription factors
• ChIP-seq: mapping histone modifications
• MNase-seq: mapping nucleosomes
• BIS-seq: mapping DNA methylation
• 3C-seq: mapping chromosomal interactions
Dealing with large sequence datasets
•
The matching/local alignment
problem:
–
–
Query (small)
Genebank
Or
Genome
Database
–
•
Given a database (long string d) and
query short string q)
Find all appearances of the query q in
the database
Allow mismatches, insertions and
deletions
As an optimization problem:
–
–
–


–
–
Define similarity s(q, s)
s(q,s) =Sd(c1,c2) (- gg)
(g-number of gaps in the affine gap
cost model)
g- gap open cost
d-nucleotide similarity matrix
Goal: find best s in d
Or: Find all s where d(q,s)<x
The alignment dynamic programming graph
a.k.a: Smith-Waterman, Needleman-Wunsch
Database
Query gap
0
A
T
1
C
2
T
3
G
4
A
5
T
6
C
7
i 0
Query
T1
8
DB gap
j
G2
Match/Mismatch
Initialize 0,0 to
C3
Global Alignment
A4
si,j =
si-1,j-1 + δ (vi, wj)
s i-1,j + δ (vi, -)
s i,j-1 + δ (-, wj)
max
T5
Local Alignment
A6
0
si,j = max
C7
How can we align all Query to part of the database?
si-1,j-1 + δ (vi, wj)
s i-1,j + δ (vi, -)
s i,j-1 + δ (-, wj)
BLAST (Basic Local Alignment Search Tool)
For us CS-guys, DP and alignment
is boring:
Time complexity: O(nm)
Space complexity: O(max(n,m))
However in the 90’s n started to got
out of hand (1010) – and the query
could get quiet long as well
Input
–A query sequence
–A database of biological sequences
There is also the issue of statistical
significant of a hit when the database
is so huge (we will not go into the
details here)
Output
–A ranked list of “hits”, database sequences that
are locally similar to the query sequence (from
which the unknown function of the query
sequence might be inferred)
–The statistical significance of each hit
SF Altschul et al. (1991) J Mol Biol 215:403
BLAST Algorithm
Mask repetitive sequences
MNPQQQQQQRST = MNPXXXXXXRST
X will not match anything in the database.
It does preserve position, however.
query
BLAST Algorithm
database
Prot = 3
database
database
database
Word Seeds & Neighboring
Nucl = 11
ACDEF = ACD + CDE + DEF
Threshold T = 11 ACD = {ACA,ACE,ACG…}
Threshold T = 13 ACD = {ACE,ACG…}
S Muthukrishnan & H Ramesh (1995) Information & Computation 122:140
query
BLAST Algorithm
database
database
database
database
Ungapped (Diagonal) Extension
Probe along a diagonal, if it contains
•Two T = 11 hits within 40 positions of each other
•A single T = 13 hit
query
BLAST Algorithm
database
database
database
database
Gapped (Off-Diagonal) Extension
gapxdrop Subroutine
If a diagonal score equals Sg = 22 bits,
try to extend across gaps, until the score drops to
Xg = 40 bits.
Hash Table (Present Method)
Database Sequences
Sequence
Sequence
Sequence
…
Sequence
1
2
3
…
N
Query Sequence
Construct A Table of All 203 Words of Length 3
AAA
AAC
AAD
YYY
Finite-state Automaton
Database Sequences
Sequence
Sequence
Sequence
…
Sequence
1
2
3
…
N
Query Sequence
Construct an Automaton That Finds Query Neighbor
Words of Length 3
AAA
AAC
AAD
YYY
Blat
Blast was designed in the 90s
when RAM was limited
When searching many times the
same DB, it does not make
sense not to preprocess the DB
(e.g. – online complexity is what
matters)
BLAT (Blast like alignment tool)
from UCSC’s Jim Kent is using
database indexing to save time
in Genome Browser queries
Non overlapping K-mer hashing
Filter over-represented k-mers
Search query k-mers (overlapping)
Find pairs of nearby matches on the same diagonal
Query 1
Query 2
Genome
Database
Query 3
Or find one longer near-match (one mismatch)
Simple hits statistics
•
•
•
•
K : K-mer size
M : The match probability when it is a true positive
Q : Query sequence size, G: Genome size
H: expected hit length
Perfect match
Sensitivity:
One mismatch:
p1 = M k 1 (1  M )  M k
p1 = M k
P = 1  (1  p1 ) H / K
k
P = 1  (1  p1 ) H / K
k
Number of chance hits (assuming uniform nuc distribution):
F = (Q  K  1)  (G / K )  (1 / A) K
F = (Q  K  1)  (G / K )  ( K  (1 / A) K 1
 (1  (1 / A))  (1 / A) K )
Mapping Solexa reads
Mapping Solexa reads to a
genome have unique
characteristics
Query consists of a very large
number of short reads
Similarity to reference genome
is expected to be very high
You can index the query k-mers (using which k?)
and traverse the database to search for hits
Or you can index the database and map queries
one by one
You can expect low level of errors: 1 or 2 per read
You can assume that no more than one gap
occurred (even this is a lot)
The algorithm must pay particular attention to
ambiguous hits (that are mapped to more than one
position)
The meta-algorithm:
Solexa
Query
Build index for exact k-mers (db/query?)
Genome
Database
Find k-mer hits
Extend k-mer hits to matches
(filter double matches upon detection)
Sequence Quality
•
Same as for Sanger sequecning the NG sequencers generate quality
scores
•
(for Solexa these are not -10log10(p) but some conversion!)
•
One would like to consider a mismatch with low quality appropriately
Uniqueness in genome
•
•
•
•
For a genome of size G, what is the expected number of k-mer hits as a
function of K?
If nucleotides have variable G+C content?
If we map all C’s to T’s?
In fact, the genome k-mer spectrum is strongly affected by repetitive
elements and microsattelites – more on this next time
Hashing
• DNA K-mers of length 11 is real easy (222)
• Longer K-mers (for searching mismatches) storage is
bounded by genome length!
• How to access the hash efficiently?
• Best: random access using integer encoding
– A DNA word need 2bits for character, you can hash 12-mers in a
vector with 16 million entries
• Possible: binary search tree (e.g., STL map<string>, Perl
associative containters)
Suffix Trees
•
•
Suffix trees efficient string encodings
Geared toward O(d) lookup of substrings
•
•
The tree contains all suffixes of a string as pathes from the root
Each node have no more than A out edges (A=4 for DNA)
•
•
•
Naïve construction: in O(N2)
O(N) construction (!)
O(N) memory (Prove!)
d a
b d a c
a
b
c
d
a
c
3
b
c
c
4
a
c
2
6
Suffix tree for “dabdac”
d
5
1
Sampling short reads
•
How many reads we expect to detect on a certain genomic location?
•
We sample N times (e.g., 10,000,000) from a large population – the number
of hits for a single locus is expected to be binomial B(p,n) where p is the
fraction of fragments in the pool
•
•
If Np is large (>10) we can assume a normal distribution
If Np is small the distribution should be geometric
•
p’s (the fraction of fragments that cover a locus) will vary among loci:
– In ChIP-seq – loci that are occupied by that targeted factor will be covered
– In MNase-seq – loci that are adjacent to a cutting site
•
As is often the case, the theoretical assumptions need not hold –test the
distribution of values and see for yourself
•
When pooling samples together, you observe average or median statistics.
From mapped reads to coverage
statistics
• Divide the genome to fixed bins
• Compute how many reads cover each bin
• A better strategy will depend on the application:
Add ~500 for fragmented ChIP product
Add 147 (or -10 )for nucs (or linkers)
Add fragment length for RNA
Pair ended-reads
Statistics on spatial bins
Assignment 1:
•
Read a MNase-seq Solexa reads file in FASTA format
>name
ACGTACGTACGT…
>name2
ACGTAAAGAC…
•
Read a genome reference in FASTA format (a set of chromosomes)
•
Write a mapping program (choose reasonable parameters!) and find the genomic coordinate of
each mappable read. You can ignore insertions/deletions (if it helps)
•
Use a given genomic Transcription Start Site table to compute the average coverage statistics in
fixed size bins (50bp) around the feature
•
Determine which of the bins around the TSS are significantly different than the TSS bin.
•
Submit:
–
–
–
–
1 page description of the algorithm and the parameters you used
Mapping statistics (how many read mapped successfully, how many were non unique, running time)
Graph showing the average around the TSS with a paragraph discussing its statistics
Your code
Implementation considerations
• C/C++: get used the STL
– Vectors
– Maps
– Integer encodings
• Java
• Perl: be aware to your memory model
– Associative arrays are expensive
– Lists when you can
– vec($myvar, $id, bits)
• (can use BioPerl)
• Python
• R
• Matlab (don’t)