* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Sequence Search
Promoter (genetics) wikipedia , lookup
Genomic library wikipedia , lookup
Molecular ecology wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Genetic code wikipedia , lookup
Non-coding DNA wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Multilocus sequence typing wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University 2015-12-10 1 Sequence search • Sequences – Nucleotide and amino acid sequences – Known sequences are stored in different databases • NCBI, ensembl, and others – Number of organisms being sequenced is increasing • Ensembl, Ensembl plant, Ensembl fungi, Ensembl bacteria, etc • Genome 10K (genomic zoo) project – Use of sequences is expanding rapidly in biomedical research – 1000 Genomes Project, 100K Genomes Project • Sequence search – Search for an appropriate sequence – Search for similar sequences in a database 2015-12-10 2 Sequence search • • • Identify a new sequence Functional and structural annotation of sequences Finding homolog sequences for – • Genomic, phylogenetic, structural studies, etc Haemophilus influezae – – – The genome of Haemophilus influezae was reported in 1995 (Fleischmann et al.) 1,743 assumed coding regions were translated into amino acid sequences, and searched for similarity in the Swiss-Prot database 1,007 of them matched • • the biochemical function could be deduced for each of them Multiple Sclerosis (source: Martin Tompa) – – – Autoimmune disease in which the T-cells recognise nerves’ myelin sheaths as foreign body and attack them Hypothesis: Nerves’ myelin sheath proteins were similar to bacterial sheath proteins from earlier infection Methodology • • • 2015-12-10 Myelin sheath proteins were sequenced Search in a database for similar bacteria and virus sequences Lab tests to check if the T-cells attacked the identified bacterial and viral proteins 3 Similarity vs homology • Similarity – Similarity is the degree of likeness of two sequences – It is a quantitative measure • Homology – Homology is an evolutionary relationship between two sequences – It can not be measured. – Distant and close homology refers to the distance between the sequences and their common ancestors • Two sequences are 80% similar. • Two sequences are 80% homologous. 2015-12-10 4 Orthologs and paralogs 2015-12-10 Source: http://www.ensembl.org/info/genome/compara/tree_example1.png 5 Sequence search: Problem • Given – Query sequence – Database Query Database Result • Goal 1 2 3 4 5 6 7 Search 1 2 4 7 – To find statistically significant similarity that can be used to infer homology Sensitivity: Are all related sequences identified? Specificity: Are all unrelated sequences rejected? 2015-12-10 TP 1, 2, 4 FP 7 FN 3, 5 TN 6 6 Heuristic database searching • Sequence search: problem – Exact similarity computation between a query sequence and a database using dynamic programming is computationally intense – With available technology, aligning a query sequence against an entire database is not feasible • Solution – Heuristic methods: Fast scanning of similar sequences – Sequences similar to a query sequence are searched from the database using heuristic methods before computing exact alignment scores – Tools • BLAST • FASTA 2015-12-10 7 BLAST • • • • BASIC Local Alignment Search Tool Developed by Altschul et al., 1990 Determines the local alignment between a query and a database BLAST consist of two steps: – Searching matches – Computing statistical significance of the matches 2015-12-10 8 BLAST Query sequence • Given a query sequence, split the query sequence into words with k residues – k = 3, for amino acid sequence – K = 11-12, for nucleotide sequence • Generate all other combination of words with k residues • Score each of the words using substitution matrix MDLSALTRQ k-mers MDL DLS MDV DLR MDM QLS MRL DVS MQL DKS ----- LSA ----------- SAL ALT ----- ----- ----- ----- ----- --- – Words with scores higher than threshold are considered in the next step 2015-12-10 9 BLAST High scoring words • Match each of the high scoring words in the database sequences • The matches are extended on both directions to form ungapped local alighment to find high scoring pair (HSP) • The HSP with a cutoff score greater than the threshold are kept • Significance of the ungapped HSP is calculated 2015-12-10 Database Ungapped extension HSP 10 Gapped BLAST • • Altschul et al., 1997 Extension of matches requires two non-overlapping matches in the same diagonal within a distance ”A” • Less number of extensions makes the search faster • Perform gapped alignments around the hits that have higher scores than a pre-defined score 2015-12-10 11 FASTA • • • • FAST All (extension of FASTP and FASTN) Developed by Lipman and Pearson, 1985 FASTA also builds a local alignment between query and database FASTA has four steps: – – – – 2015-12-10 Hashing 1𝑠𝑡 scoring 2𝑛𝑑 scoring Alignment 12 FASTA • Hashing – Query sequence is split into words of size k – Exact word matches are identified in the database – Regions populated with matches are identified and 10 best regions are selected • 1𝑠𝑡 scoring – Within the selected regions, optimal local alignment is computed using substitution matrix • 2𝑛𝑑 scoring – Alignments are combined to obtain a single larger alignment – Gaps are allowed in the alignment • Alignment – Alignment is iptimized using Smith-Waterman dynamic programming • Statistical significance for each alignment is computed 2015-12-10 13 Variants of BLAST and FASTA Query Database Program Protein Protein blastp fasta Nucleotide Nucleotide blastn fasta Nucleotide Protein blastx fastx, fasty Translate query to a protein Protein Nucleotide tblastn tfastx, tfasty Translate database Nucleotide Nucleotide tblastx Translate both query and database 2015-12-10 Comment 14 Using BLAST and FASTA • Web application – BLAST • http://blast.ncbi.nlm.nih.gov/Blast.cgi – FASTA • http://www.ebi.ac.uk/Tools/sss/fasta/ • Standalone – Local installation – Database should also be downloaded 2015-12-10 15 BLASTP 2015-12-10 16 FASTA 2015-12-10 17 Input formats • FASTA format files – Widely used in bioinformatics • Other file formats – GCG, EMBL, GenBank, PIR, UniProtKB/Swiss-Prot, PHYLIP • Identifiers – Supported in BLAST – Accession – Gene identifier 2015-12-10 18 Database • Generic databases – UniProt or RefSeq databases – UniRef and Non-redundant database: Database of unique sequence entries – Genome, Chromosome • Structure databases – Database of sequences for which 3D structures are available in PDB – Used specially for finding template sequence for homology modelling • Specialized database – Local database can be created including the sequences that are relevant for your purpose 2015-12-10 19 Other parameters • Expect – Statistical significance parameter – Default = 10, i.e. 10 matches are expected by chance • Filter – Mask regions of low-complexity and short repeats • Alignment options – Substitution table and gap function 2015-12-10 20 Output • There are three major sections in BLAST output – Header • Information about the query sequence and the database searched • Graphical overview of matches (only in web version) – Description • • • • Description of the sequences (hits) Scores: Generated from alignment, Higher is better E-value: Number of hits expected by chance, Lower is better Sequence identifier in NCBI databases – Alignment • Pairwise alignment • Details of the alignment (Score, E-value, similarity, etc.) 2015-12-10 21 PSI BLAST • Position Specific Iterated BLAST • More sensitive to distantly related sequences • Algorithm – In the first iteration, standard BLAST is run • A PSSM (position specific scoring matrix) is generated based on the significant alignments – In the next iteration, the new PSSM is used to score the alignments • A new PSSM is generated based on the significant alignments – The above step is repeated until a stop criterion is met. Stop criteria may be: • No new sequences are identified in two consecutive iterations • Number of desired iteration reached 2015-12-10 22 Sequence search: Challenges • Self hits are uninteresting • Size of target database – Use no big database than required • Paralogs have similar sequences but often have different function • Low-complexity regions reduce the quality of alignments • Short repeats give false hits • Results for very short queries may be less reliable – Matches that are 50% identical with length 20-40 amino acids occur frequently by chance • Distant homologs may have very low similarity 2015-12-10 23 Sequence search • Exercise – BLAST – FASTA 2015-12-10 24