* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Bioinformatics Unit 1: Data Bases and Alignments
Deoxyribozyme wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Frameshift mutation wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Protein moonlighting wikipedia , lookup
DNA barcoding wikipedia , lookup
Non-coding DNA wikipedia , lookup
Human genome wikipedia , lookup
Microsatellite wikipedia , lookup
Genetic code wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genome editing wikipedia , lookup
Metagenomics wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Helitron (biology) wikipedia , lookup
Point mutation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments Overview of Lecture • Introduction: When, what and how to search for “homologous” seqeunces • Terminology • Nucleotide database searches – BLAST programs – FASTA programs – Others • Protein database searches Introduction • • • • • When do I search? What do I search for (which database)? How do I search (which program)? What do the search results mean? Answer: Database searches (hopefully) identify biologically relevant sequence alignments Sequence Alignments • Sequence alignments allow comparison of new sequences to either one, a group of, or all known sequences • A well-designed alignment can allow one to infer: – gene or protein function – evolutionary relationships among genes, proteins or species – structure of proteins of nucleic acids • Process is highly dependent on choice of query and parameters of alignment Terminology Associated with Searches and Alignments • Query: The input sequence (or other type of search term) with which all of the entries in a database are to be compared. – Examples: Your unknown DNA sequence, a word, an accession number, etc. • Algorithm: A fixed procedure embodied in a computer program – Examples: Alignment programs like BLAST,FASTA, BLITZ, etc. Terminology Associated with Searches and Alignments (cont.) • Homology: Similarity attributed to descent from a common ancestor (often misused). • Identity: The extent to which two (nucleotide or amino acid) sequences are invariant. Often expressed as a percentage. • Similarity: The extent to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity (nucleotides) and/or conservation (proteins i.e., a lysine substituted for an arginine). Terminology Associated with Searches and Alignments (cont.) • Gap: A space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. – Example: Aligning a cDNA sequence with a gene requires gaps at the position of introns • Substitution (Scoring) matrices: Speed vs. sensitivity – allow a query sequence to be aligned with sequences in the database very rapidly. The most significant matches (successful alignments) are reported. Less complex, faster matrices sacrifice a certain degree of match significance (i.e. you need a better match for it to be recognized than if you use a slower, more complex matrix). The matrix, together with the choice of program essentially determine the search sensitivity and Terminology Associated with Searches and Alignments (cont.) • Filters: usually part of an alignment algorithm and are turned on by default. – The filter masks (hides) regions of the query sequence (your sequence) that have low compositional complexity (like poly A tails). Masking is achieved by replacing the sequence with a string of N's (NNNNNN), the code for any DNA base. – Poly-A tails, for example, can give rise to artificially high scores and therefore misleading results. This is due to the large numbers of such sequences distributed throughout the genome, and therefore throughout the database. – Similarly, new programs exist to filter out vector sequences. Nucleotide Database Searching • Commonly used search algorithms: – – – – – BLAST (at NCBI) FASTA (in France) BLITZ (at EPI in EMBL) SSEARCH (in France) PSI-BLAST (at NCBI) Basic Local Alignment Search Tool (BLAST) • A set of similarity search tools • Fast and sensitive • “real” matches fairly easily distinguished from random matches by scoring • Seeks local rather than global alignment • Can detect relationships between sequences that share only regions of similarity – GREAT as proteins are “modular” Algorithms Within BLAST • Blastn • Blastp • Blastx compares nucleotide query sequence against nucleotide sequence database compares amino acid query sequence against protein sequence database compares nucleotide query translated in all reading frames against a protein sequence database Algorithms Within BLAST (cont.) • Tblastn compares amino acid query sequence against nucleotide sequence database dynamically translated in all reading frames • Tblastx compares the six-frame translation of a nucleotide query sequence against a nucleotide sequence database dynamically translated in all reading frames COMPUTATIONALLY INTENSE!! • Choose the correct algorithm!!! A Sample BLAST Search • AAAAGAAAAGGTTAGAAAGATGAGAGATGATAAAGGGTCCATTTGAGGTTAGGTAA TATGGTTTGGTATCCCTGTAGTTAAAAGTTTTTGTCTTATTTTAGAATACTGTGAT CTATTTCTTTAGTATTAATTTTTCCTTCTGTTTTCCTCATCTAGGGAACCCCAAGA GCATCCAATAGAAGCTGTGCAATTATGTAAAATTTTCAACTGTCTTCCTCAAAATA AAGAAGTATGGTAATCTTTACCTGTATACAGTGCAGAGCCTTCTCAGAAGCACAGA ATATTTTTATATTTCCTTTATGTGAATTTTTAAGCTGCAAATCTGATGGCCTTAAT TTCCTTTTTGACACTGAAAGTTTTGTAAAAGAAATCATGTCCATACACTTTGTTGC AAGATGTGAATTATTGACACTGAACTTAATAACTGTGTACTGTTCGGAAGGGGTTC CTCAAATTTTTTGACTTTTTTTGTATGTGTGTTTTTTCTTTTTTTTTAAGTTCTTA TGAGGAGGGGAGGGTAAATAAACCACTGTGCGTCTTGGTGTAATTTGAAGATTGCC CCATCTAGACTAGCAATCTCTTCATTATTCTCTGCTATATATAAAACGGTGCTGTG AGGGAGGGGAAAAGCATTTTTCAATATATTGAACTTTTGTACTGAATTTTTTTGTA ATAAGCAATCAAGGTTATAATTTTTTTTAAAATAGAAATTTTGTAAGAAGGCAATA TTAACCTAATCACCATGTAAGCACTCTGGATGATGGATTCCACAAAACTTGGTTTT ATGGTTACTTCTTCTCTTAGATTCTTAATTCATGAGGAGGGTGGGGGAGGGAGGTG GAGGGAGGGAAGGGTTTCTCTATTAAAATGCATTCGTTGTGTTTTTTAAGATAGTG TAACTTGCTTAAATTTCTTATGTGACATTAACAAATAAAAAAGCTCTTTTAATATTAGATAA Top red line represents query sequence Each line below indicates matching sequences sorted by score (in color) and position of match Below is a list of high scoring matches followed by actual alignments The “Expectation” Value (E Value) • Expectation value. The number of different alignments with scores equivalent to or better than S (threshold score) that are expected to occur in a database search by chance. The lower the E value, the more significant the score. • Given in scientific notation. • For example, an E value of e-167 indicates that there is a 1/10167 chance that the match is random • The smaller the E value, the more significant the match • Varies due to number of bp of sequence in the database and the length of the query sequence How Does the BLAST Algorithm Work? An Overview • A two step process • Initial scanning identifies high scoring matches to “words” in the query sequence – Positive scores for exact matching bases or amino acids – Negative scores for mismatches – Default word size is 11 bases • Sequences with high scores are extended in both directions in the second step until the best score is achieved • Scoring matrices are used in each step Options • Word length – Set at 11 bases for blastn. – Requires a perfect 11 bp match to go to the second step – Chances of a random 11 bp exact match are 1/411 (= 1/4,194,304) – Shortening the word length may make the search more sensitive, but it may increase the number of non-biologically significant hits Options (cont.) • Filters – Can mask regions of low complexity • Poly A tails • Proline rich regions – Can now mask human repetitive sequences – Low complexity filter is on by default. Others must be activated Options (cont.) • The Expect threshold – The statistical significance threshold for reporting matches against database sequences – The default value is 10, meaning that 10 matches are expected to be found merely by chance – If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. – Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. – Increasing the threshold shows less stringent matches. Fractional values are acceptable. Protein Database Searching • 2-5 times more sensitive than a DNA database search! – DNA alphabet is smaller than the protein alphabet (4 v. 20 letters) – The genetic code is redundant (6 serine codons) – There is a selection for function, thus protein sequence is more highly conserved through time • Groups of genes or proteins from different organisms that have the same function are called “orthologs”