Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file. Learn how to search Genbank for information. Understand difference between header, features and sequence. Learn the difference between a primary database and secondary database. Principle of similarity searches using the BLAST program What is GenBank? Gene sequence database Annotated records that represent single contiguous stretches of DNA or RNA-may have more than one coding region (limit 350 kb) Generated from direct submissions to the DNA sequence databases from the authors. Part of the International Nucleotide Sequence Database Collaboration. Exchange of information on a daily basis GenBank (NCBI) EMBL (EBI) United Kingdom International Nucleotide Sequence Database Collaboration DDBJ Japan History of GenBank Began with Atlas of Protein Sequences and Structures (Dayhoff et al., 1965) In 1986 it collaborated with EMBL and in 1987 it collaborated with DDBJ. It is a primary database-(i.e., experimental data is placed into it) Examples of secondary databases derived from GenBank/EMBL/DDBJ: Swiss-Prot, PRI. GenBank Flat File is a human readable form of the records. General Comments on GBFF Three sections: 1) Header-information about the whole record 2) Features-description of annotations-each represented by a key. 3) Nucleotide sequence-each ends with // on last line of record. DNA-centered Translated sequence is only a feature Feature Keys Purpose: 1) Indicates biological nature of sequence 2) Supplies information about changes to sequences Feature Key Description conflict Separate deter’s of the same seq. differ rep_origin protein_bind CDS Origin of replication Protein binding site on DNA Protein coding sequence Feature Keys-Terminology Feature Key CDS Location/Qualifiers 23..400 /product=“alcohol dehydro.” /gene=“adhI” Interpretation-The feature CDS is a coding sequence beginning at base 23 and ending at base 400, has a product called “alcohol dehydrogenase” and corresponds to the gene called “adhI”. Feature Keys-Terminology (Cont.) Feat. Key Location/Qualifiers CDS join (544..589,688..1032) /product=“T-cell recep. B-ch.” /partial Interpretation-The feature CDS is a partial coding sequence formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain. Record from GenBank GenBank division (plant, fungal and algal) Modification date LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 Unique identifier (never changes) VERSION U49845.1 KEYWORDS . Coding region GI:1293613 GeneInfo identifier (changes whenever there is a change) Nucleotide sequence identifier (changes when there is a change in sequence (accession.version)) Word or phrase describing the sequence (not based on controlled vocabulary). Not used in newer records. SOURCE ORGANISM baker's yeast. Common name for organism Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. Formal scientific name for the source organism and its lineage based on NCBI Taxonomy Database Record from GenBank (cont.1) REFERENCE AUTHORS TITLE JOURNAL MEDLINE REFERENCE AUTHORS TITLE JOURNAL MEDLINE REFERENCE 1 (bases 1 to 5028) Oldest reference first Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae Yeast 10 (11), 1503-1509 (1994) 95176709 Medline UID 2 (bases 1 to 5028) Roemer,T., Madden,K., Chang,J. and Snyder,M. Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein Genes Dev. 10 (7), 777-793 (1996) 96194260 3 (bases 1 to 5028) AUTHORS Roemer,T. Submitter of sequence (always the last reference) TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USA Record from GenBank (cont.2) There are three parts to the feature key: a keyword (indicates functional group), a location (instruction for finding the feature), and a qualifier (auxiliary information about a feature) FEATURES source Keys CDS Database cross-refs Location/Qualifiers 1..5028 Location /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" Qualifiers /chromosome="IX" /map="9" <1..206 Partial sequence on the 5’ end. The 3’ end is complete. /codon_start=3 Start of open reading frame /product="TCP1-beta" Descriptive free text must be quotations /protein_id="AAA98665.1" Protein sequence ID # /db_xref="GI:1293614" Values /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM" Note: only a partial sequence Record from GenBank (cont.3) gene 687..3158 New location /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN. . . “ Cutoff gene complement(3300..4037) New location /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616" /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ . . . “ Cutoff Record from GenBank (cont.4) BASE COUNT 1510 a 1074 c 835 g 1609 t ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct . . .// Primary databases contain experimental biological information GenBank/EMBL/DDBJ Alu-alu repeats in human DNA dbEST-expressed sequence tags-single pass cDNA sequences (high error freq.) It is non-redundant HTGS-high-throughput genomic sequence database (errors!) PDB-Three-dimensional structure coordinates of biological molecules PROSITE-database of protein domain/function relationships. Types of secondary databases that contain biological information dbSTS-Non-redundant db of sequence-tagged sites (useful for physical mapping) Genome databases-(there are over 20 genome databases that can be searched EPD:eukaryotic promoter database NR-non-redundant GenBank+EMBL+DDBJ+PDB. Entries with 100% sequence identity are merged as one. Vector: A subset of GenBank containing vector DNA ProDom PRINTS BLOCKS Workshop 2 A-Look up a Genbank record. Use the annotations to determine the the first open reading frame. Dot Plots A A T G C C T A G T G C C * T * * A G * * * * * * * * * * * * * Window = 1 Note that 25% of the table will be filled due to random chance. 1 in 4 chance at each position Dot Plots with window = 2 A A { T { G { C { C { T {A {G T G C C T A * * * G Window = 2 The larger the window the more noise can be filtered What is the percent chance that you will receive a match randomly? 1/16 * 100 = 6.25% * * * * Identity Matrix A C I L 1 0 0 0 A 1 0 1 0 0 C I 1 L Simplest type of scoring matrix Similarity Searching It is easy to score if an amino acid is identical to another (the score is 1 if identical and 0 if not). However, it is not easy to give a score for amino acids that are somewhat similar. O O Leucine H2N CH C CH2 CH CH3 CH3 OH H2N CH C CH CH3 OH Isoleucine CH2 CH3 Should they get a 0 (non-identical) or a 1 (identical) or something in between? Purpose of finding differences and similarities of amino acids. Infer structural information Infer functional information Infer evolutionary relationships Evolutionary Basis of Sequence Alignment 1. Similarity: Quantity that relates to how alike two sequences are. 2. Identity: Quantity that describes how alike two sequences are in the strictest terms. 3. Homology: a conclusion drawn from data suggesting that two genes share a common evolutionary history. Evolutionary Basis of Sequence Alignment (Cont. 1) 1. Example: Shown on the next page is a pairwise alignment of two proteins. One is mouse trypsin and the other is crayfish trypsin. They are homologous proteins. The sequences share 41% identity. 2. Underlined residues are identical. Asterisks and diamond represent those residues that participate in catalysis. Five gaps are placed to optimize the alignment. Evolutionary Basis of Sequence Alignment (Cont. 2) Why are there regions of identity? 1) Conserved function-residues participate in reaction. 2) Structural-residues participate in maintaining structure of protein. (For example, conserved cysteine residues that form a disulfide linkage) 3) Historical-Residues that are conserved solely due to a common ancestor gene. Modular nature of proteins The previous alignment was global. However, many proteins do not display global patterns of similarity. Instead, they possess local regions of similarity. Proteins can be thought of as assemblies of modular domains. Think Mr. Potatohead Scoring Matrices Scoring matrices tell how similar amino acids are. There are two main sets of scoring matrices: PAM and BLOSUM. PAM is based on evolutionary distances BLOSUM is based on structure/function similarities The bottom line on PAM Frequencies of alignment Frequencies of occurrence The probability that two amino acids, i and j are aligned by evolutionary descent divided by the probability that they are aligned by chance BLOSUM Matrices BLOSUM is built from distantly related sequences whereas PAM is built from closely related sequences BLOSUM is built from conserved blocks of aligned protein segment found in the BLOCKS database (remember the BLOCKS database is a secondary database that depends on the PROSITE Family) Global Alignment vs. Local Alignment Global alignment is used when the overall gene sequence is similar to another sequence-often used in multiple sequence alignment. Clustal W algorithm Local alignment is used when only a small portion of one gene is similar to a small portion of another gene. BLAST FASTA Smith-Waterman algorithm Two proteins that are similar in certain regions Tissue plasminogen activator (PLAT) Coagulation factor 12 (F12). BLAST Basic Local Alignment Search Tool Speed is achieved by: Pre-indexing the database before the search Parallel processing Uses a hash table that contains neighborhood words rather than just identical words. Neighborhood words The program declares a hit if the word taken from the query sequence has a score >= T when a substitution matrix is used. This allows the word size (W (this is similar to ktup value)) to be kept high (for speed) without sacrificing sensitivity. If T is increased by the user the number of background hits is reduced and the program will run faster The expectation (E) value The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Similarity Score (S) that is assigned to a match between two sequences. The higher the score, the lower the E value. Essentially, the E value describes the random background noise that exists for matches between sequences. The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported. An E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size you might expect to see 1 match with a similar score simply by chance. What influences the E Value? Length of sequence The longer the query the lower the probability that it will find a sequence in the database by chance. E value Size of database The larger the database the higher the probability that the query will find a match by chance. E value Increase the word size (W) The larger the word size the lower the probability that the query will find a sequence in the database by chance. The scoring matrix The less stringent the scoring matrix the higher the probability that the query will find a sequence in the database by chance. E value E value Workshop for module 2: Perform a BLAST search of different databases using a peptide sequence.