Download SEQUENCE

‫بسم هللا الرحمن الرحیم‬ Using NCBI Resources for Gene Discovery Lecturer: Dr. Farkhondeh Poursina, PhD [email protected] 1392 National Center for Biotechnology Information (NCBI) National Library of Medicine National Institutes of Health http://www.ncbi.nlm.nih.gov/ PRIMARY BIOLOGICAL DATABASES  Nucleic acid & Protein EMBL(European Molecular Biology Laboratory) DDBJ (DNA Data Bank of Japan) GenBank (NCBI, The National Center for Biotechnology Information) EMBL/GENBANK/DDJB These 3 db contain mainly the same information (few differences in the format)  Serve as archives containing all sequences (single genes, ESTs, complete genomes, etc.)   derived from:   Genome projects and sequencing centers Individual scientists Non-confidential data are exchanged daily  Currently: 2.5 x107 sequences, over 3.2 x1010 bp;  Sequences from > 50,000 different species;  THE ‘PERFECT’ DATABASE  Comprehensive, but easy to search.  Annotated, but not “too annotated”.  A simple, easy to understand structure.  Cross-referenced.  Minimum redundancy.  Easy retrieval of data. THE NATIONAL CENTER FOR BIOTECHNOLOGY INFORMATION Bethesda,MD Created in 1988 as a part of the National Library of Medicine at NIH(National Institutes of Health) – – – – Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information WEB ACCESS: WWW.NCBI.NLM.NIH.GOV New pages! New Homepage Common footer TYPES OF MOLECULAR DATABASES (SEQUENCE) AT NCBI  Primary Databases Original submissions by experimentalists  Content controlled by the submitter    Examples: GenBank, Trace, SRA, SNP, GEO Derivative Databases  Derived from primary data  Curated/expert review(Content controlled by third party (NCBI)  compilation and correction of data  Examples: NCBI Protein, Refseq, RefSNP, UniGene, Homologene, Structure, Conserved Domain PRIMARY VS. DERIVATIVE SEQUENCE DATABASES RefSeq Labs Sequencing Centers TATAGCCG AGCTCCGATA CCGATGACAA Curators TATAGCCG TATAGCCG TATAGCCG TATAGCCG Genome Assembly Updated continually by NCBI GenBank UniGene Updated ONLY by submitters Algorithms THE PROBLEM Rapidly growing databases with complex and changing relationships  Rapidly changing interfaces to match the above  Result  Many people don’t know: Where to begin  Where to click on a Web page  Why it might be useful to click there  DERIVATIVE SEQUENCE DATABASES ENTREZ FINDING RELEVANT INFORMATION IN NCBI DATABASES YOU CAN SEARCH DNA SEQUENCE DATABASE Retrieve known sequences by  ENTREZ  http://www.ncbi.nlm.nih.gov/Entrez/  Click – Nucleotide  OR Accession number  Keyword search  Entrez is Internally Cross-linked  DNA and protein sequences are linked to other similar sequences  Medline citations are linked to other citations that contain similar keywords 3-D structures are linked to similar structures  DATABASES CONTAIN MORE THAN JUST DNA & PROTEIN SEQUENCES  Retrieve all sequences for an organism or taxon Starting with an organism or taxon name...  How to: Download the complete genome for an organism  Starting at the Genomes  How to: Find transcript sequences for a gene  Starting with ...  A GENE NAME, PRODUCT NAME, OR SYMBOL  How to: Obtain genomic sequence for/near a gene, marker, transcript or protein  Starting with...    A GENE NAME OR SYMBOL ENTREZ TIP: START SEARCHES IN GENE Entrez Protein Other Entrez DBs BLink Gene Homologene: Gene Neighbors HomoloGene UniGene How to: Display genomic annotation graphically  Starting with...  A NUCLEOTIDE RECORD (e.g. NC_000001)  BY APPLYING LIMITS, THERE ARE NOW JUST TWO ENTRIES Precise Results A TRADITIONAL GENBANK RECORD Locus Field ACCESSION NO ACCESSION VERSSION Molecular weight Definition Line GI (GenInfo) Taxonomy Submission Field Molecule Type Modification Date Genbank Division TRADITIONAL GENBANK RECORD ACCESSION Accession •Stable •Reportable •Universal U07418 Coding sequence VERSION U07418.1 Version Tracks changes in sequence the sequence is the data GI:466461 GI number NCBI internal use What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775 NT_030059 Rs7079946 GenBank genomic DNA sequence Genomic contig dbSNP (single nucleotide polymorphism) DNA N91759.1 NM_006744 An expressed sequence tag (1 of 170) RefSeq DNA sequence (from a transcript) NP_007635 AAC02945 Q28369 1KT7 RefSeq protein GenBank protein SwissProt protein Protein Data Bank structure record RNA protein Page 27 Feature Table GenPept Record Genomic DNA Sequence GENPEPT: GENBANK CDS TRANSLATIONS FEATURES source gene CDS Location/Qualifiers 1..2484 /organism="Homo sapiens" /mol_type="mRNA" /db_xref="taxon:9606" /chromosome="3" /map="3p22-p23" 1..2484 >gi|463989|gb|AAC50285.1| DNA mismatch repair prote... /gene="MLH1" 22..2292 MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKSTSIQVIV... EDLDIVCERFTTSKLQSFEDLASISTYGFRGEALASISHVAHVTITTKTAD... /gene="MLH1" /note="homolog of S. cerevisiae PMS1 (Swiss-Prot Accession Number P14242), S. cerevisiae MLH1 (GenBank Accession Number U07187), E. coli MUTL (Swiss-Prot Accession Number P23367), Salmonella typhimurium MUTL (Swiss-Prot Accession Number P14161) and Streptococcus pneumoniae (Swiss-Prot Accession Number P14160)" /codon_start=1 /product="DNA mismatch repair protein homolog" /protein_id="AAC50285.1" /db_xref="GI:463989" /translation="MSFVAGVIRRLDETVVNRIAAGEVIQRPANAIKEMIENCLDAKS TSIQVIVKEGGLKLIQIQDNGTGIRKEDLDIVCERFTTSKLQSFEDLASISTYGFRGE ALASISHVAHVTITTKTADGKCAYRASYSDGKLKAPPKPCAGNQGTQITVEDLFYNIA TRRKALKNPSEEYGKILEVVGRYSVHNAGISFSVKKQGETVADVRTLPNASTVDNIRS REFSEQ • Reference Sequences − Nucleotide sequences and protein translation − Curated by NCBI or NCBI-approved programs. • Difference between GenBank and RefSeq − GenBank has raw data and duplicated records − Metadata in GenBank can be incomplete − RefSeq annotated, curated and non-redundant. − NCBI takes best sequences from GenBank and curates for RefSeq records SELECTED REFSEQ ACCESSION NUMBERS mRNAs and Proteins NM_123456 NP_123456 NR_123456 XM_123456 XP_123456 XR_123456 Gene Records NG_123456 Chromosome NC_123455 AC_123455 Assemblies NT_123456 NW_123456 Curated mRNA Curated Protein Curated non-coding RNA Predicted mRNA Predicted Protein Predicted non-coding RNA Reference Genomic Sequence Microbial replicons, organelle Alternate assemblies Contig WGS Supercontig over 100,000 nucleotide entries for HIV-1 only 1 RefSeq HOW TO SAVE?        Choose FASTA from the Display drop-down menu Transform the content of this window into plain text by choosing Text from the drop-down menu located on the far right of the menu bar. Save the FASTA sequence by using the following protocol: a. In the Edit menu of your Web browser, click Select All and then click Copy. b. Open a default Word document and, in the Edit menu of Word, click Paste. c. Finally, save your document as dUTPaseDNA.txt by choosing the Save as type option text only (*.txt). FASTA FORMAT DESCRIPTION • • • FASTA is a DNA and protein sequence alignment software package first described (as FASTP) by David J. Lipman and William R. Pearson in 1985 Popular Format and commonly used A sequence in FASTA format begins with a singleline description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. ‫‪53‬‬ ‫شکوه‬ ‫ریاضی‬ ‫‪،‬فران‬ ‫ک‬ ‫کاظمی‬

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download SEQUENCE