* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download - Cal State LA - Instructional Web Server
Epigenomics wikipedia , lookup
RNA silencing wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Gene expression profiling wikipedia , lookup
DNA vaccination wikipedia , lookup
Frameshift mutation wikipedia , lookup
Protein moonlighting wikipedia , lookup
DNA barcoding wikipedia , lookup
Genomic library wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Gene nomenclature wikipedia , lookup
Genome evolution wikipedia , lookup
History of genetic engineering wikipedia , lookup
Genetic code wikipedia , lookup
History of RNA biology wikipedia , lookup
Human genome wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Non-coding DNA wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Non-coding RNA wikipedia , lookup
Designer baby wikipedia , lookup
Epitranscriptome wikipedia , lookup
Microevolution wikipedia , lookup
Primary transcript wikipedia , lookup
Metagenomics wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Point mutation wikipedia , lookup
Genome editing wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Sequence Databases – 21 June 2007 Learning objectives Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be able to search GenBank for information. Be able to explain the content difference between a header, features and sequence. Be able to say what distinguishes between a primary database and a secondary database. Be able to access and navigate the ENTREZ platform for biological data analysis. BIOSEQs – entry common to all sequence databases BIOSEQ = Biological sequence Central element in the NCBI database model. Found in both the nucleotide and protein databases Comprises the sequence of a single continuous molecule of nucleic acid or protein. Entry must have At least one sequence identifier (Seq-id) Information on the physical type of molecule (DNA, RNA, or protein) Descriptors, which describe the entire Bioseq Annotations, which provide information regarding specific locations within the Bioseq What is GenBank? The NIH genetic sequence database, an annotated collection of all publicly available DNA sequences Each record represents a single contiguous stretch of DNA or RNA DNA stretches may have more than one coding region (gene). RNA sequences are presented with T, not U Records are generated from direct submissions to the DNA sequence databases from the investigators (authors). GenBank is part of the International Nucleotide Sequence Database Collaboration. General Comments on GBFF Three sections: 1) Header-information about the whole record 2) Features-description of annotations-each represented by a key. 3) Nucleotide sequence-each ends with // on last line of record. Nucleic acid (DNA or RNA (cDNA)) sequence translated to amino acid sequence is a “feature” Genbank Flat File (MyoD1 as an example) Feature Keys Purpose: 1) Indicates biological nature of sequence 2) Supplies information about changes to sequences Feature Key conflict rep_origin protein_bind CDS Description Separate determinations of the same seq. differ Origin of replication Protein binding site on DNA Protein coding sequence Feature Keys-Terminology Feature Key CDS Location/Qualifiers 23..400 /product=“alcohol dehydro.” /gene=“adhI” The feature CDS is a coding sequence beginning at base 23 and ending at base 400, has a product called “alcohol dehydrogenase” and corresponds to the gene called “adhI”. Feature Keys-Terminology (Cont.) Feat. Key CDS Location/Qualifiers join (544..589,688..1032) /product=“T-cell recep. B-ch.” /partial The feature CDS is a partial coding sequence formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain. (For MyoD1 – Accession number X61655) Record from GenBank GenBank division (plant, fungal and algal) Modification date Locus name LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 Unique identifier (never changes) VERSION U49845.1 KEYWORDS . Coding region GI:1293613 GeneInfo identifier (changes whenever there is a change) Nucleotide sequence identifier (changes when there is a change in sequence (accession.version)) Word or phrase describing the sequence (not based on controlled vocabulary). Not used in newer records. SOURCE ORGANISM baker's yeast. Common name for organism Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. Formal scientific name for the source organism and its lineage based on NCBI Taxonomy Database Record from GenBank (cont.1) REFERENCE AUTHORS TITLE JOURNAL MEDLINE REFERENCE AUTHORS TITLE JOURNAL MEDLINE REFERENCE 1 (bases 1 to 5028) Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae Yeast 10 (11), 1503-1509 (1994) 95176709 Medline UID 2 (bases 1 to 5028) Roemer,T., Madden,K., Chang,J. and Snyder,M. Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein Genes Dev. 10 (7), 777-793 (1996) 96194260 3 (bases 1 to 5028) AUTHORS Roemer,T. Submitter of sequence (always the last reference) TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USA Record from GenBank (cont.2) There are three parts to the feature key: a keyword (indicates functional group), a location (instruction for finding the feature), and a qualifier (auxiliary information about a feature) FEATURES source Keys CDS Database cross-refs Location/Qualifiers 1..5028 Location /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" Qualifiers /chromosome="IX" /map="9" <1..206 The 5’ end of the coding sequence begins upstream of the first nucleotide of the sequence. end is complete. /codon_start=3 Start of open reading frame /product="TCP1-beta" Descriptive free text must be in quotations /protein_id="AAA98665.1" Protein sequence ID # /db_xref="GI:1293614" Values /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM" Note: only a partial sequence The 3’ Record from GenBank (cont.3) gene 687..3158 Another location /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN. . . “ Cutoff gene complement(3300..4037) Another location /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616" /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ . . . “ Cutoff Record from GenBank (cont.4) BASE COUNT 1510 a 1074 c 835 g 1609 t ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct . . .// Primary databases vs. Secondary databases Primary database comprises information submitted directly by the experimenter. is called an archival database. Secondary database comprises information derived from primary database. is a curated database. Types of primary databases carrying biological infomation GenBank/EMBL/DDBJ PDB-Three-dimensional structure coordinates of biological molecules PROSITE-database of protein domain/function relationships. http://www.expasy.org/prosite/ Types of secondary databases carrying biological infomation dbSTS-Non-redundant db of sequence-tagged sites (useful for physical mapping) Genome databases-(there are over 20 genome databases that can be searched EPD:eukaryotic promoter database http://www.epd.isb-sib.ch/ NR-non-redundant GenBank+EMBL+DDBJ+PDB. Entries with 100% sequence identity are merged as one. ProDom http://protein.toulouse.inra.fr/prodom/current/html/home.php PRINTS http://bioinf.man.ac.uk/dbbrowser/PRINTS/ BLOCKS http://bioinformatics.weizmann.ac.il/blocks/ Secondary Databases DNA RNA protein cDNA DNA databases derived from GenBank containing data for a single gene •Non-redundant (nr) •dbGSS (genome survey sequences) •dbHTGS (high throughput) •dbSTS (sequence tagged site) •LocusLink Protein databases derived from GenBank containing data for a single gene RNA (cDNA) databases derived •Non-redundant (nr) from GenBank containing data for a single gene •Swissprot •PIR (Int’l. protein sequence) •dbEST (expressed sequence tag) •LocusLink •UniGene •LocusLink References for understanding the NCBI sequence database model Here is the website for NCBI developer tools. http://www.ncbi.nlm.nih.gov/IEB/ToolBox/SD KDOCS/INDEX.HTML DNA RNA PROTEIN RNA processing RNA, but NOT mRNA RNA, but NOT mRNA Mature mRNA