Download Workshop#7

Information organization Oct 2, 2012 Learning objectives-Demonstrate Dotter Program. Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how to search GenBank for information. Understand difference between header, features and sequence. Distinguish between a primary database and secondary database. Homework #2 due today. Homework #3 due Tues. Oct. 9 What is GenBank? Gene sequence database Annotated records that represent single contiguous stretches of DNA or RNA-may have more than one coding region. Generated from direct submissions to the DNA sequence databases from the authors. Part of the International Nucleotide Sequence Database Collaboration. http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html History of GenBank Began with Atlas of Protein Sequences and Structures (Dayhoff et al., 1965) In 1986 it shared data with EMBL and in 1987 it shared data with DDBJ. Primary database Examples of secondary databases derived from GenBank: UniProt, EST database. GenBank Flat File is a human readable form of a GenBank record. Downstream (relative to CDS) Upstream (relative to CDS) Start of gene Transcription Coding strand Transcription End of gene initiation site termination site 5’ Promoter Protein Coding Sequence (CDS) 3’ 5’ untranslated region (5’UTR) Template strand 3’ DNA 5’ 3’ untranslated region (3’UTR) Transcription 3’ 5’ RNA Translation Protein Protein folding Folded protein Transcript splicing 3 2 1 Intron 2 Intron 1 4 DNA Intron 3 Transcription 1 2 3 4 Primary transcript Splicing mRNA Translation protein Alternative splicing 1 2 3 4 Primary transcript General Comments on GBFF Three sections: 1) Header-information about the whole record  2) Features-description of annotations-each represented by a key.  3) Nucleotide sequence-each ends with // on last line of record.  DNA-centered Translated sequence is a feature Feature Keys Purpose:  1) Indicates biological nature of sequence  2) Supplies information about changes to sequences Feature Key conflict rep_origin protein_bind CDS Description Separate determinations of the same seq. differ Origin of replication Protein binding site on DNA (Protein) coding sequence Feature Keys-Terminology Feature Key CDS Location/Qualifiers 23..400 /product=“alcohol dehydro.” /gene=“adhI” The feature CDS is a coding sequence beginning at base 23 and ending at base 400 that has a product called “alcohol dehydrogenase” and corresponds to the gene called “adhI”. Feature Keys-Terminology (Cont.) Feat. Key Location/Qualifiers CDS join (544..589,688..1032) /product=“T-cell recep. B-ch.” /partial The feature CDS is a partial coding sequence formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain. Record from GenBank GenBank division (plant, fungal and algal) Modification date Locus name LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 Accession number (never changes) VERSION U49845.1 KEYWORDS . Coding sequence GI:1293613 GeneInfo identifier (changes whenever there is a change) Nucleotide sequence identifier (changes when there is a change in sequence (accession.version)) Word or phrase describing the sequence (not based on controlled vocabulary). Not used in newer records. SOURCE ORGANISM baker's yeast. Common name for organism Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. Formal scientific name for the source organism and its lineage based on NCBI Taxonomy Database Record from GenBank (cont.1) REFERENCE AUTHORS TITLE JOURNAL MEDLINE REFERENCE AUTHORS TITLE JOURNAL MEDLINE REFERENCE 1 (bases 1 to 5028) Oldest reference first Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae Yeast 10 (11), 1503-1509 (1994) 95176709 Medline UID 2 (bases 1 to 5028) Roemer,T., Madden,K., Chang,J. and Snyder,M. Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein Genes Dev. 10 (7), 777-793 (1996) 96194260 3 (bases 1 to 5028) AUTHORS Roemer,T. TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, Submitter of sequence (always the last reference) New Haven, CT, USA Record from GenBank (cont.2) There are three parts to the feature key: a keyword (indicates functional group), a location (instruction for finding the feature), and a qualifier (auxiliary information about a feature) FEATURES source Keys CDS Database cross-refs Location/Qualifiers 1..5028 Location /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" Qualifiers /chromosome="IX" /map="9" <1..206 Partial sequence on the 5’ end. The 3’ end is complete. /codon_start=3 Start of open reading frame /product="TCP1-beta" Descriptive free text must be in quotations /protein_id="AAA98665.1" Protein sequence ID # /db_xref="GI:1293614" Values /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM" Note: only a partial sequence Record from GenBank (cont.3) gene 687..3158 Another location /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN. . . “ Cutoff gene complement(3300..4037) Another location /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616" /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ . . . “ Cutoff Coding strand is complementary strand Record from GenBank (cont.4) BASE COUNT 1510 a 1074 c 835 g 1609 t ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct . . .// DNA RNA protein cDNA DNA databases derived from GenBank containing data for a single gene •Non-redundant (nr) RNA (cDNA) databases derived •dbGSS from GenBank •dbSTS containing data for a single gene •dbEST •UniGene •RefSeq Protein databases derived from GenBank containing data for a single gene •Non-redundant (nr) •UniProtKB Types of primary databases carrying biological infomation GenBank/EMBL/DDBJ dbEST-expressed sequence tags-single pass cDNA sequences (high error freq.) It is non-redundant PDB-Three-dimensional structure coordinates of biological molecules PROSITE-database of protein domain/function relationships. Summary GenBank-longest running molecular biology database. Three sections in every GenBank record Primary databases and secondary databases. RefSeq-contains unique record for each RNA variant. UniProtKB-protein centered Workshop Do problem 1 in Chapter 2. Homework Do problems 2 and 3 in Chapter 2.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Workshop#7