Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSC:n käyttäjätunnukset - myös opiskelijoille http://www.csc.fi/asiakkaaksi/korkeakoulut/kayttol upahakemukset/index_html • Ohjaajan nimeksi Petri Törönen • Perusteluiksi ”opiskelu” ja luentokurssin nimi (Geneettisen Bioinformatiikan luennot) • Pyytäkää käyttöoikeus myös Chipsteriin (rastittava) jos olette tulossa Bioinfon työt kurssille! Sekvenssihaut ja –vertailut: BLAST 2014 Sequence databases • Many different types of databases: – Nucleotide data banks EMBL, GenBank, DDBJ – EST databases (dbEST: http://www.ncbi.nlm.nih.gov/dbEST/ ) – genome builds with browsers • ENSEMBL (www.ensembl.org) • UCSC (http://www.genome.ucsc.edu/cgibin/hgGateway) – Protein databases (Uniprot, …) – etc Searching databases • Different types of queries: – Find DNA sequences that are homologous to your sequence (the same evolutionary origin) – Find gene families across the species – Align an mRNA sequence to a genome assembly (what gene is my mRNA coding?) – Does my RNA sequence remind any protein? – Design primers for PCR – Search a database for a specific motif Database search • The most common application in bioinformatics • Used for searching a sequence database for a match for the query sequence. • Proteins are frequently composed of functional domains repeated in many different proteins – These parts are most likely to be conserved -> look for shared patterns • Two computer program families – FastA (older, used seldom anymore, BUT may offer advantages which do not exist with BLAST) – BLAST Database search tools • BLAST vs. other tools • SPEED versus SENSITIVITY FAST TOOLS • BLAT • MEGABLAST … Standard BLAST SENSITIVE TOOLS • SSEARCH • PSI-BLAST … What is BLAST? • BLAST – Basic Local Alignment Search Tool – finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. – A set of several computer programs (blastn, blastx…) – Optimized for finding local alignments between two sequences – Requires user to choose some parameters http://blast.ncbi.nlm.nih.gov/Blast.cgi BLAST • Since DNA databases can be very large, searching for the optimal alignment to all sequences would take too long. Thus, BLAST is a heuristic algorithm. • Heuristic algorithms find a match reasonably close to the optimal one in a much shorter time than the full dynamic programming. • The alignment found can then be separately verified / refined using slower but more accurate dynamic programming • Understanding BLAST function is relevant for understanding when it fails to find hits! • http://en.wikipedia.org/wiki/BLAST How BLAST works? • All BLAST programs work more or less similarly, and computationally a BLAST search consists of three phases: – Seeding – Extension – Evaluation How BLAST works • The query sequence is divided into subsequences of a given length. – word size 3 for proteins, 11 for nucleotides. • These are used to look for exact or nearly exact matches in the sequence database. – Fast to do = computationally inexpensive. • When a match is found, it is extended further. Seeding Word size (W=3) • KRISTIAN • KRISTIAN • KRISTIAN • KRISTIAN • KRISTIAN • KRISTIAN Remember dot plot! Search space Word hits Q u e r y Database Alignment Gapped alignment Threshold in seeding • Word hit – Hit is two matching, identical words, one in database, another in the query sequence (used in blastn) – Hit is a neighborhood (used in protein-related searches) • The neighborhood of a word contains the word itself and all other words whose score is at least as big as T (threshold) when compared via the scoring matrix. • For example, if T=13, word=PQG, matrix=Blosum62, only words getting a score over 13 will be scored as hits: PQG-PEG (15) is accepted, but PQG-PQA (12) is not. • Setting T higher will remove more word hits, making BLAST run faster, but increases the chance of missing an interesting alignment. • Setting W (wordsize) higher will decrease sensitivity (chance of finding the alignment), but increase speed of the search. Extension • Word hits found during seeding are extented from their ends. • Extension is stopped when the alignment score drops, or in newer implementations, when the alignment score has dropped enough (drop-off score) compared to its previous maximum. Alignment Word hit Extension Extension, example drop off score KRISTIAN gap=0, X=2 -RISTISANA BLOSUM62 0544541200 <- BLOSUM62 values 05913182223212121 <- Score 00000002 <- Drop off score • Extension terminates when drop off score falls below X. Evaluation • When the extension stage has produced the alignments, they will be evaluated to determine whether they are statistically significant. • Statistical significance is determined using Karlin-Altschul statistics (the E-score) – Some simplifying assumptions are made (such as sequences inifinitely long, no gaps), but in practice, K-A statistics is nicely generalizable. E-score • The lower the E-score, the more significant the alignment • The E-score is dependent on both the database size and the scoring system (substitution matrix, gap penalties). – If these are changed, the E-score for a specific alignment will also change. Karlin-Altschul statistics • E value. E = Kmne-λS – E = number of alignments reaching score S just by chance – K = minor constant – m = the length of query sequence – n = the size of the database (DB) – e (neperin luku) ≈ 2,71 – λS = normalized alignment score (S is the score, lambda is a normalization factor) – E-value estimates number of equally good (or better) hits from DB by random Karlin-Altschul, example • What is the chance that when two equally long (250) amino acid sequences are aligned using PAM250 matrix, the alignment score is 75? • E = Kmne-λS = 0,1*250*250*2,71-(0,229*75) = 0,000217 http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html Filtering out repeats • The human genome (like most others) contains large amounts of repetitive DNA (LINE, SINE, Alu, etc.) • If the query sequence contains repeats, many of the homologies identified will be to other sequences containing the same repeats. • Repeats should in most instances be masked out • Usually represented as AATAGNNNNCGC • Same represented for aminoacids with X Disadvantages of BLAST • When expected sequence similarity drops below 80%, nucleotide-nucleotide blast no longer performs that well. • Many significant homologies are missed due to the initial word size requirement. • If initial words are allowed to be discontinuous, matching is improved. Discontinuous initial words • For instance, require 11 positions out of 21 consecutive nucleotides to be homologous. Description of BLAST Services http://www.ncbi.nlm.nih.gov/blast/html/BLASThomehelp.html Different varieties of BLAST • BLASTN: DNA query against a database of DNA sequences (blastn). • BLASTP: Protein query against protein sequences (blastp). • BLASTX: DNA query translated in six reading frames against a protein database (blastx). • TBLASTX: Search DNA query against the via the translation to proteins Blastn and Megablast • Typically used for ”identifying” your sequence. • Megablast is a fast alternative for finding nearly exact matches. • Blastn is better at finding somewhat diverged sequences (e.g. from a related species). • Blastn is more sensitive but slower than megablast Blastx and tblastx • Blastx translates the query sequence in all reading frames and compares it to a protein database. • Aggregate statistics are provided for all reading frames. • Tblastx queries a translated DNA sequence against a database of translated DNA sequences. • Also produces aggregate statistics for all reading frames. BLAST programs Query Database Program Typical uses DNA DNA blastn protein protein blastp translated DNA protein blastx protein translated DNA tblastn translated DNA translated DNA tblastx Annotation, mapping oligonucleotides to genome Identifying common regions in proteins Finding protein-coding genes in genomic DNA Identifying transcripts, possibly from multiple organisms Cross-species gene prediction, searching for genes not yet in protein databases Large and closely related sequences megablast Extensions to BLAST • Make specific primers with Primer-BLAST (Finding primers specific to your PCR template http://www.ncbi.nlm.nih.gov/tools/primerblast/index.cgi?LINK_LOC=BlastHome) • PSI-BLAST: Protein sequence search method where: – – – – Best matches are aligned with query sequence Alignment creates a profile that emphasises conserved regions Search is repeated with created profile Whole thing is repeated • CS-BLAST: BLAST version that uses information about the neighboring aminoacids to estimate substitutions BLAST from command line Why command line usage? •Runs with 100 – 10 000 query sequences •Runs against specific database (say all sequences from human, chimp and gorilla) Applications: •BLAST all human genes vs. all mouse genes •Running BLAST between all sequences in an analyzed set …. ...miten valita omaan tarkoitukseen sopivin blast-versio?! Apua ohjelman valintaan: http://www.ncbi.nlm.nih.gov/blast/producttable.shtml#pstab Päätäntäkaavio homologisten sekvenssien löytämiseksi Ei Koodaako sekvenssi proteiinia tai voidaanko se kääntää proteiiniksi? Tee haku DNAsekvenssitietokannoista Kyllä Tee haku prot.tietokannoista tai käännetystä DNA-sekvenssitietokannasta Säädä ensin BLAST-parametreja Ei Onko saatu hakutulos mielekäs ja tilastollisesti merkitsevä? Etsi seuraavaksi motiiveja ja blokkeja Kyllä Toista tietokantahaku käyttäen löydettyjä sekvenssejä hakusekvensseinä Kaavio J.Tuimalan ”Bioinformatiikan perusteet” mukaan Tee saaduille sekvensseille usean sekvenssin rinnastus ja muodosta mahdollinen fylogeneettinen puu Kokeillaan tttcggctca catgcctaat ctaacagggt catacattat caataactgc tatatcccat atgtcaacta attcgacaac cggagcatct ctactaggag tacccagatc tatttctagc acacctgaca cattttcatc atctgccgag cggttgaatt tacactcaaa attttcttcc Käytännön vinkkejä hakuihin • Tee samalla huolellisuudella kuin laboratoriokokeetkin! • Aloita aina hakuparametrien (aukkosakot, pisteytysmatriisit) oletusasetuksilla, ja jos tulokset eivät tyydytä, muuta sopivampaan suuntaan sanakokoa, pisteytysmatriisia ja/tai mahd. E-arvorajaa • ”Yleiskäyttöisiä” pisteytysmatriiseja aminohapoille: BLOSUM62 (aukkosakoilla -8 ja -2) ja BLOSUM50 (-12/-2 tai -14/-2) • Rajoita haku vain kiinnostavaan tietokantaan (ja/tai sen osastoon), tämä voi nopeuttaa hakuasi oleellisesti! Esim, jos et halua monia kertoja saman sekvenssin eri muotoja vastauksina, ja tiedät minkä organismin sekvenssivastaavuuksista on kyse, tee hakusi Genomic BLASTilla! (Suoraan geenipankista=nukleotiditietokannasta hakeva BLAST antaa vastaukseksi KAIKKI vastaavat sekvenssit, vaikka olisivat vain saman genomisen sekvenssin eri versioita) Käytännön vinkkejä: • Hakukoneet ovat eniten kuormitettuja keskellä työpäivää, paikallista aikaa klo 10-16 • Mikäli sekvenssisi on proteiinia koodaava, käytä ah-sekvenssiä, ei DNAta vertailuihin. Eliöiden välillä on eroa mm. kodonien käytössä, mikä voi aiheuttaa ongelmia tietokantahauissa! • Poista low-complexity (yksinkertaiset ja toistojakso-) alueet suodattamalla (”filtering”, löytyy optiona BLASTissa) -> vähentää biologisesti ei-relevanttien samankaltaisuuksien löytymistä. • Hyvin lyhyet sekvenssit, noin 20 bp: perus-BLASTin hakuparametrien oletusarvot eivät toimi näille hyvin! Siispä: – Pienennä sanakokoa, kasvata E-arvoa – PCR-alukkeiden genomispesifisyyttä tutkittaessa käytä uutta PrimerBLASTia! • Lyhyet ah-sekvenssit: pisteytysmatriisiksi lähisukuisille sekvensseille sopivat, esim. PAM30, BLOSUM80, BLOSUM90 • Kaukaisille sukulaisille: PAM250, BLOSUM62 Tulosten tulkintaan • Osumat ESTeihin ja ”hypoteettisiin” proteiineihin (varsinkin hyvin lyhyisiin) – suhtaudu näihin varauksella! • Huonot osumat on helppo tunnistaa linjauksessa olevan suuren aukkomäärän perusteella (tällöin nosta aukkosakkoparametriesi arvoja!) Esimerkkisekvenssi 2: Blastaa! gactgtgagc cattttcaaa tattgcttat aaacatggca gccatcagaa cattgaccca aaatattaca atgggctaag aaagctttag tcaagtggac gagttccggg ggcaaaagta cctatgttgg cctggaaaac agacacacag cctgaatata taccaggcaa ttacagatgg tgattgcaga agccaagcaa ctctggatcc cagtacctct taacacttaa ctgggggctt