Download blast

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CSC:n käyttäjätunnukset
- myös opiskelijoille
http://www.csc.fi/asiakkaaksi/korkeakoulut/kayttol
upahakemukset/index_html
• Ohjaajan nimeksi Petri Törönen
• Perusteluiksi ”opiskelu” ja luentokurssin nimi
(Geneettisen Bioinformatiikan luennot)
• Pyytäkää käyttöoikeus myös Chipsteriin
(rastittava) jos olette tulossa Bioinfon työt kurssille!
Sekvenssihaut ja –vertailut:
BLAST
2014
Sequence databases
• Many different types of databases:
– Nucleotide data banks EMBL, GenBank, DDBJ
– EST databases (dbEST:
http://www.ncbi.nlm.nih.gov/dbEST/ )
– genome builds with browsers
• ENSEMBL (www.ensembl.org)
• UCSC (http://www.genome.ucsc.edu/cgibin/hgGateway)
– Protein databases (Uniprot, …)
– etc
Searching databases
• Different types of queries:
– Find DNA sequences that are homologous to
your sequence (the same evolutionary origin)
– Find gene families across the species
– Align an mRNA sequence to a genome
assembly (what gene is my mRNA coding?)
– Does my RNA sequence remind any protein?
– Design primers for PCR
– Search a database for a specific motif
Database search
• The most common application in bioinformatics
• Used for searching a sequence database for a
match for the query sequence.
• Proteins are frequently composed of functional
domains repeated in many different proteins –
These parts are most likely to be conserved ->
look for shared patterns
• Two computer program families
– FastA (older, used seldom anymore, BUT may offer
advantages which do not exist with BLAST)
– BLAST
Database search tools
• BLAST vs. other tools
• SPEED versus SENSITIVITY
FAST TOOLS
• BLAT
• MEGABLAST
…
Standard
BLAST
SENSITIVE TOOLS
• SSEARCH
• PSI-BLAST
…
What is BLAST?
• BLAST
– Basic Local Alignment Search Tool
– finds regions of local similarity between sequences. The
program compares nucleotide or protein sequences to
sequence databases and calculates the statistical
significance of matches.
– A set of several computer programs (blastn, blastx…)
– Optimized for finding local alignments between two
sequences
– Requires user to choose some parameters
http://blast.ncbi.nlm.nih.gov/Blast.cgi
BLAST
• Since DNA databases can be very large,
searching for the optimal alignment to all
sequences would take too long. Thus,
BLAST is a heuristic algorithm.
• Heuristic algorithms find a match reasonably
close to the optimal one in a much shorter time
than the full dynamic programming.
• The alignment found can then be separately
verified / refined using slower but more accurate
dynamic programming
• Understanding BLAST function is relevant for
understanding when it fails to find hits!
• http://en.wikipedia.org/wiki/BLAST
How BLAST works?
• All BLAST programs work more or less
similarly, and computationally a BLAST
search consists of three phases:
– Seeding
– Extension
– Evaluation
How BLAST works
• The query sequence is divided into
subsequences of a given length.
– word size 3 for proteins, 11 for nucleotides.
• These are used to look for exact or nearly
exact matches in the sequence database.
– Fast to do = computationally inexpensive.
• When a match is found, it is extended
further.
Seeding
Word size (W=3)
• KRISTIAN
• KRISTIAN
• KRISTIAN
• KRISTIAN
• KRISTIAN
• KRISTIAN
Remember dot plot!
Search space
Word hits
Q
u
e
r
y
Database
Alignment
Gapped alignment
Threshold in seeding
• Word hit
– Hit is two matching, identical words, one in database,
another in the query sequence (used in blastn)
– Hit is a neighborhood (used in protein-related searches)
• The neighborhood of a word contains the word itself and all
other words whose score is at least as big as T (threshold) when
compared via the scoring matrix.
• For example, if T=13, word=PQG, matrix=Blosum62, only words
getting a score over 13 will be scored as hits: PQG-PEG (15) is
accepted, but PQG-PQA (12) is not.
• Setting T higher will remove more word hits, making BLAST
run faster, but increases the chance of missing an
interesting alignment.
• Setting W (wordsize) higher will decrease sensitivity (chance of
finding the alignment), but increase speed of the search.
Extension
• Word hits found during seeding are
extented from their ends.
• Extension is stopped when the
alignment score drops, or in newer
implementations, when the alignment
score has dropped enough (drop-off
score) compared to its previous
maximum.
Alignment
Word hit
Extension
Extension, example
drop off score
KRISTIAN
gap=0, X=2
-RISTISANA
BLOSUM62
0544541200 <- BLOSUM62 values
05913182223212121
<- Score
00000002
<- Drop off score
• Extension terminates when drop off score
falls below X.
Evaluation
• When the extension stage has produced the
alignments, they will be evaluated to
determine whether they are statistically
significant.
• Statistical significance is determined using
Karlin-Altschul statistics (the E-score)
– Some simplifying assumptions are made (such
as sequences inifinitely long, no gaps), but in
practice, K-A statistics is nicely generalizable.
E-score
• The lower the E-score, the more
significant the alignment
• The E-score is dependent on both the
database size and the scoring system
(substitution matrix, gap penalties).
– If these are changed, the E-score for a
specific alignment will also change.
Karlin-Altschul statistics
• E value. E = Kmne-λS
– E = number of alignments reaching score S just by
chance
– K = minor constant
– m = the length of query sequence
– n = the size of the database (DB)
– e (neperin luku) ≈ 2,71
– λS = normalized alignment score (S is the score,
lambda is a normalization factor)
– E-value estimates number of equally good (or better)
hits from DB by random
Karlin-Altschul, example
• What is the chance that when two equally
long (250) amino acid sequences are
aligned using PAM250 matrix, the
alignment score is 75?
• E = Kmne-λS = 0,1*250*250*2,71-(0,229*75)
= 0,000217
http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
Filtering out repeats
• The human genome (like most others) contains
large amounts of repetitive DNA (LINE, SINE,
Alu, etc.)
• If the query sequence contains repeats, many of
the homologies identified will be to other
sequences containing the same repeats.
• Repeats should in most instances be masked
out
• Usually represented as AATAGNNNNCGC
• Same represented for aminoacids with X
Disadvantages of BLAST
• When expected sequence similarity drops
below 80%, nucleotide-nucleotide blast no
longer performs that well.
• Many significant homologies are missed
due to the initial word size requirement.
• If initial words are allowed to be
discontinuous, matching is improved.
Discontinuous initial words
• For instance, require 11 positions out of 21
consecutive nucleotides to be homologous.
Description of BLAST Services
http://www.ncbi.nlm.nih.gov/blast/html/BLASThomehelp.html
Different varieties of BLAST
• BLASTN: DNA query against a database
of DNA sequences (blastn).
• BLASTP: Protein query against protein
sequences (blastp).
• BLASTX: DNA query translated in six
reading frames against a protein database
(blastx).
• TBLASTX: Search DNA query against the
via the translation to proteins
Blastn and Megablast
• Typically used for ”identifying” your
sequence.
• Megablast is a fast alternative for finding
nearly exact matches.
• Blastn is better at finding somewhat
diverged sequences (e.g. from a related
species).
• Blastn is more sensitive but slower than
megablast
Blastx and tblastx
• Blastx translates the query sequence in all
reading frames and compares it to a protein
database.
• Aggregate statistics are provided for all reading
frames.
• Tblastx queries a translated DNA sequence
against a database of translated DNA
sequences.
• Also produces aggregate statistics for all reading
frames.
BLAST programs
Query
Database
Program
Typical uses
DNA
DNA
blastn
protein
protein
blastp
translated DNA
protein
blastx
protein
translated DNA
tblastn
translated DNA
translated DNA
tblastx
Annotation, mapping oligonucleotides to genome
Identifying common regions
in proteins
Finding protein-coding genes
in genomic DNA
Identifying transcripts, possibly
from multiple organisms
Cross-species gene prediction,
searching for genes not yet in
protein databases
Large and closely
related sequences
megablast
Extensions to BLAST
• Make specific primers with Primer-BLAST (Finding
primers specific to your PCR template
http://www.ncbi.nlm.nih.gov/tools/primerblast/index.cgi?LINK_LOC=BlastHome)
• PSI-BLAST: Protein sequence search method where:
–
–
–
–
Best matches are aligned with query sequence
Alignment creates a profile that emphasises conserved regions
Search is repeated with created profile
Whole thing is repeated
• CS-BLAST: BLAST version that uses information about
the neighboring aminoacids to estimate substitutions
BLAST from command line
Why command line usage?
•Runs with 100 – 10 000 query sequences
•Runs against specific database (say all sequences from
human, chimp and gorilla)
Applications:
•BLAST all human genes vs. all mouse genes
•Running BLAST between all sequences in an analyzed set
….
...miten valita omaan tarkoitukseen
sopivin blast-versio?!
Apua ohjelman valintaan:
http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#pstab
Päätäntäkaavio homologisten sekvenssien löytämiseksi
Ei
Koodaako sekvenssi proteiinia tai
voidaanko se kääntää proteiiniksi?
Tee haku DNAsekvenssitietokannoista
Kyllä
Tee haku prot.tietokannoista tai
käännetystä DNA-sekvenssitietokannasta
Säädä ensin
BLAST-parametreja
Ei
Onko saatu hakutulos mielekäs
ja tilastollisesti merkitsevä?
Etsi seuraavaksi
motiiveja ja
blokkeja
Kyllä
Toista tietokantahaku käyttäen
löydettyjä sekvenssejä
hakusekvensseinä
Kaavio J.Tuimalan ”Bioinformatiikan perusteet” mukaan
Tee saaduille sekvensseille
usean sekvenssin rinnastus
ja muodosta mahdollinen
fylogeneettinen puu
Kokeillaan
tttcggctca
catgcctaat
ctaacagggt
catacattat
caataactgc
tatatcccat
atgtcaacta
attcgacaac
cggagcatct
ctactaggag
tacccagatc
tatttctagc
acacctgaca
cattttcatc
atctgccgag
cggttgaatt
tacactcaaa
attttcttcc
Käytännön vinkkejä hakuihin
• Tee samalla huolellisuudella kuin laboratoriokokeetkin!
• Aloita aina hakuparametrien (aukkosakot, pisteytysmatriisit)
oletusasetuksilla, ja jos tulokset eivät tyydytä, muuta
sopivampaan suuntaan sanakokoa, pisteytysmatriisia ja/tai
mahd. E-arvorajaa
• ”Yleiskäyttöisiä” pisteytysmatriiseja aminohapoille:
BLOSUM62 (aukkosakoilla -8 ja -2) ja BLOSUM50 (-12/-2 tai
-14/-2)
• Rajoita haku vain kiinnostavaan tietokantaan (ja/tai sen
osastoon), tämä voi nopeuttaa hakuasi oleellisesti!
Esim, jos et halua monia kertoja saman sekvenssin eri muotoja
vastauksina, ja tiedät minkä organismin
sekvenssivastaavuuksista on kyse, tee hakusi Genomic
BLASTilla! (Suoraan geenipankista=nukleotiditietokannasta
hakeva BLAST antaa vastaukseksi KAIKKI vastaavat
sekvenssit, vaikka olisivat vain saman genomisen sekvenssin
eri versioita)
Käytännön vinkkejä:
• Hakukoneet ovat eniten kuormitettuja keskellä työpäivää, paikallista
aikaa klo 10-16
• Mikäli sekvenssisi on proteiinia koodaava, käytä ah-sekvenssiä, ei
DNAta vertailuihin. Eliöiden välillä on eroa mm. kodonien käytössä,
mikä voi aiheuttaa ongelmia tietokantahauissa!
• Poista low-complexity (yksinkertaiset ja toistojakso-) alueet
suodattamalla (”filtering”, löytyy optiona BLASTissa) -> vähentää
biologisesti ei-relevanttien samankaltaisuuksien löytymistä.
• Hyvin lyhyet sekvenssit, noin 20 bp: perus-BLASTin
hakuparametrien oletusarvot eivät toimi näille hyvin! Siispä:
– Pienennä sanakokoa, kasvata E-arvoa
– PCR-alukkeiden genomispesifisyyttä tutkittaessa käytä uutta PrimerBLASTia!
• Lyhyet ah-sekvenssit: pisteytysmatriisiksi lähisukuisille sekvensseille
sopivat, esim. PAM30, BLOSUM80, BLOSUM90
• Kaukaisille sukulaisille: PAM250, BLOSUM62
Tulosten tulkintaan
• Osumat ESTeihin ja ”hypoteettisiin”
proteiineihin (varsinkin hyvin lyhyisiin) –
suhtaudu näihin varauksella!
• Huonot osumat on helppo tunnistaa
linjauksessa olevan suuren aukkomäärän
perusteella (tällöin nosta
aukkosakkoparametriesi arvoja!)
Esimerkkisekvenssi 2: Blastaa!
gactgtgagc
cattttcaaa
tattgcttat
aaacatggca
gccatcagaa
cattgaccca
aaatattaca
atgggctaag
aaagctttag
tcaagtggac
gagttccggg
ggcaaaagta
cctatgttgg
cctggaaaac
agacacacag
cctgaatata
taccaggcaa
ttacagatgg
tgattgcaga
agccaagcaa
ctctggatcc
cagtacctct
taacacttaa
ctgggggctt