Download Nucleotide Sequence Databases

Nucleotide Sequence Databases Your guide to genes & genomes Nucleotide Sequence Databases • First generation – GenBank is a representative example – started as sort of a museum to preserve knowledge of a sequence from first discovery – great repositories, particularly for long-term study of bioinformatic data – flat files; not built for (and not great at) querying Nucleotide Sequence Databases • Second generation: – Entrez gene is an example – information is gene-centric (not just sequencecentric) – all sequence information for a given gene can be found in one place Nucleotide Sequence Databases • Third generation: – Ensembl is a good example – Information is organized around whole genomes; not only a specific gene’s structure, but its context: • position of this gene relative to others • strand orientation • how gene relates to presence or absence of biochemical functions in organism Prokaryotes (& Archaea) • microscopic organisms • single cell • no nucleus • simple genome: – single, circular DNA molecule – 600,000 – 8 million base pairs • 70% of genome codes for proteins Prokaryotes (& Archaea) • genes don’t overlap • no introns; mRNA is collinear with gene sequence • protein sequences derived by translating longest ORF (ATG to STOP) spanning genetranscript sequence source: http://www.cod.edu/people/faculty/fancher/CellStructure.htm Thought for today … source: http://www.scicomics.com/uploads/prokaryote.jpg Eukaryotes • way more complicated – genes found in cell nucleus – genome size: 10 million – 670 million base pairs • much lower gene density than prokaryotes: in human chromosomes, about one gene for every 100,000 base pairs source: http://www.cod.edu/people/faculty/fancher/CellStructure.htm Eukaryotes • much less efficient than prokaryotes; less than 5% of human genome codes for protein • genes transcribed after a promoter region; but process may be strongly influenced by sequence elements relatively far away source: http://www.cit.gu.edu.au/~anthony/dungeon/balcony/ Eukaryotes • Gene sequences and mRNA/protein sequences not collinear; only exons are retained in mature mRNA that encodes protein • A single gene may (and often does) exhibit more than one mRNA and protein form GenBank • First example: prokaryotic gene – point your browser to: http://www.ncbi.nlm.nih.gov/entrez – choose Nucleotide from the Search pull-down menu – in For box, type X01714 and click Go – Click the link labeled X01714 – Can “Send To Text” if you want to save the file GenBank fields • LOCUS – size of sequence (in base pairs) – nature of molecule (e.g. DNA or RNA) – topology (linear or circular) • DEFINITION: brief description of gene • ACCESSION: unique identifier for this (and some other) databases • VERSION: lists synonymous or past ID numbers GenBank fields • KEYWORDS: list of terms related to entry; can be used for keyword searching for related data • SOURCE: common name of relevant organism • ORGANISM: complete id, with taxonomic classification – note that ORGANISM is indented under SOURCE; this indicates that ORGANISM is a subordinate term, or subsection, of SOURCE GenBank fields • REFERENCE: credits author(s) who initially determined the sequence; includes subsections: – AUTHOR – TITLE – JOURNAL – PUBMED • COMMENT: free-formatted text that doesn’t fit in another category GenBank fields • FEATURES: table describing gene regions and associated biological properties – source: origin of specific regions of sequence; useful for distinguishing cloning vectors from host sequences – promoter: precise coordinates of promoter element in the sequence; may be more than one of these – misc feature: in this example, indicates (putative) location of transcription start (mRNA synthesis) – RBS (ribosome binding site): location of last upstream element – CDS (CoDing Segment): describes the ORF GenBank fields: FEATURES: CDS • gives coordinates from initial nucleotide (ATG) to last nucleotide of stop codon (TAA) • several lines follow, listing protein products, reading frame to use, genetic code to apply and several IDs for the protein sequence • /translation section gives computer translation of sequence into amino acid sequence Last Section: sequence itself • This is the most important section in terms of analysis using other tools • Can isolate just this section and save the file, as follows: – Choose FASTA from the Display pull-down menu (top of page) – Choose Text in the Send To pull-down menu – Use File/Save As to save the file • use “Text” as file type • give the file a name that you’ll know to associate with this particular sequence Example 2: eukaryotic mRNA • Can obtain this example by searching Nucleotide database for U90223 • Similar to prokaryote example, because we’re looking at a direct coding sequence for a protein – not DNA, in other words • Notes on example: – KEYWORD field is empty: this is an example of an incomplete annotation – remember, you’re looking at a primary database! – FEATURES field contains some new terms: • sig_peptide: location of mitochondrial targeting sequence • mat_peptide: exact boundaries of mature peptide Example 3: Eukaryotic gene • Can obtain this record by searching Nucleotide for AF018430 • General information: – LOCUS: same info as previous examples – note the locus name is different from the accession number this time – DEFINITION: specifies exon; remember, protein-coding regions in eukaryotes are not contiguous as in prokaryotes – SEGMENT: indicates this is the second of 4; you’d need all 4 to reconstruct the mRNA that codes for the protein Eukaryotic gene: FEATURES section • source subsection includes a /map section: – indicates chromosome (15) – arm (q means long arm) – cytogenic band (q21.1) Eukaryotic gene: FEATURES section • gene subsection: describes how to reconstruct the mRNAs found in this and separate entries: – the strings that begin “AF” refer to the GenBank entries (remember, this one was AF018430), and the numbers represent the nucleotide positions from the entries – if a set of numbers (example: 1..1177) is NOT preceded by an entry indicator, it’s from the current entry – The < and > signs indicate that the start and stop points are only approximate Eukaryotic gene: FEATURES section • mRNA section: can be read in a similar manner to the gene section • note that there are two mRNA sections (each followed by a CDS section) – first section describes mitochondrial RNA – second section describes nuclear RNA • exon section: indicates position of exon(s) in sequence Retrieving GenBank entries without accession numers • Search Nucleotide for specific product you’re interested in; for example: human[organism] AND dUTPase[Protein name] – this search yields several entries; can click the Links link to the right of one of these (AF018432) and choose Related Sequences from the pull-down that appears – retrieves several more entries, some DNA and some mRNA – terms used in the titles of these entries can give us additional search criteria: human[organism] AND “dUTP pyrophosphatase”[Title] – yields somewhat different set of entries

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Nucleotide Sequence Databases