* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download The Johns Hopkins University - American University of Beirut
Protein moonlighting wikipedia , lookup
Messenger RNA wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Minimal genome wikipedia , lookup
Non-coding RNA wikipedia , lookup
Public health genomics wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Epitranscriptome wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Primary transcript wikipedia , lookup
Sequence alignment wikipedia , lookup
Point mutation wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Human Genome Project wikipedia , lookup
Non-coding DNA wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Human genome wikipedia , lookup
Genome evolution wikipedia , lookup
Metagenomics wikipedia , lookup
Pathogenomics wikipedia , lookup
Genome editing wikipedia , lookup
Helitron (biology) wikipedia , lookup
Retrieving Information: Using Entrez Lecture 2.2 1 Retrieving information: how it works: • Servers have the records you want • You need to understand the data they have, and how it is organized • There are often many ways to get to an answer. • Route to get there is not always obvious, but you need to think of alternatives and traps. • Use some query language – each system has its own. • Retrieve data in a specified format. • Save it in a way that will be useful to you. Lecture 2.2 2 What you may be looking for: • Did a BLAST search – and you need more info about some of the proteins they found similarities to. • Heard on about a disease gene that was recently discovered, and you want to know more about it. • Want to build a dataset for local blast searches. • A colleague wants you to do an alignment of all sequences from a given protein family. Lecture 2.2 3 What you are looking for: • PubMed paper from author X • Sequence from gene X in organism Y • All information about organelle W in model organism Y • All information about disease X in human • Orthologs of that disease genes in other model organisms Lecture 2.2 4 Central Dogma: NCBI version DNA RNA Write a paper about it protein Lecture 2.2 5 Entrez: Pathway to Discovery Term frequency statistics 1993 Literature citations in sequence databases MEDLINE abstracts Nucleotide sequences Nucleotide sequence similarity Lecture 2.2 Literature citations in sequence databases Protein sequences Coding region features Amino acid sequence similarity 6 Type in your last name and find a paper form one of your teammates Related Articles Lecture 2.2 7 Hard link DNA to protein L12345 Lecture 2.2 8 From Fig 1 of Entrez search and retrieval system Jim Ostell Chapter 14, the NCBI Handbook. 2003 Lecture 2.2 9 Lecture 2.2 10 Lecture 2.2 11 Lecture 2.2 12 Ctrl-F Lecture 2.2 13 Lecture 2.2 14 Getting started in Entrez Lecture 2.2 15 “ouellette bf” [au] AND yeast Lecture 2.2 16 Lecture 2.2 17 Lecture 2.2 18 Lecture 2.2 19 MeSH: Medical Subject Heading Lecture 2.2 20 A query • Word <free text> : too many hits – More words (the Boolean ‘AND’ is the default) – Limit query to specified field – Limit query in time – Do Boolean on queries • #1 AND #2 • #3 NOT #5 • #7 OR #8 Lecture 2.2 21 hieter p [au] Lecture 2.2 22 Limit in Time: 1993-01-01 1993-12-31 Lecture 2.2 23 Lecture 2.2 24 No abstract With abstract Full Text on-line Full Text in PubMed Central Lecture 2.2 25 boguski m [au] 99 boguski ms [au] 80 Lecture 2.2 26 #24 NOT #23 Lecture 2.2 19 27 Lecture 2.2 28 Other types of links in Entrez • Next slides to explore other kind of things linked into Entrez records. Lecture 2.2 29 “hieter p” [au] cdc16p Lecture 2.2 30 Lecture 2.2 31 Lecture 2.2 32 Lecture 2.2 33 Lecture 2.2 34 Lecture 2.2 35 Lecture 2.2 36 Lecture 2.2 37 Lecture 2.2 38 “Books” Lecture 2.2 39 (2) Lecture 2.2 40 Lecture 2.2 41 Lecture 2.2 42 Lecture 2.2 43 Lecture 2.2 44 Lecture 2.2 45 Link to Genome View of Chromosome I Lecture 2.2 46 Lecture 2.2 47 Lecture 2.2 48 RefSeq • RefSeq represents the NCBI curated “reference sequences” for all ‘worked’ genome. • Historically, these used to be referred to as “GenBank-Gold”. • RefSeq are either genomic, mRNA or protein sequences. • Not all sequences are in RefSeq • All RefSeq sequences are assembled/taken from things in GenBank. Lecture 2.2 49 Some of the features of the RefSeq: • non-redundancy • explicitly linked nucleotide and protein sequences • updates to reflect current knowledge of sequence data and biology • data validation and format consistency • distinct accession series • ongoing curation by NCBI staff and collaborators, with review status indicated on each record Lecture 2.2 50 Accession number space • GenBank: – 1+5 (L12345, U00001) – 2+6 (AF000001, AC000003) – 4+2+6 (WGS) • All have accession.version • Protein: – 1+5 (SwissProt/UniProt) – 3+5 (GenPept) • All have accession.version • RefSeq: – N*_12345 Lecture 2.2 51 RefSeq Accession Number Space NC_123456 Genomic Complete genomic molecules including genomes, chromosomes, organelles, plasmids. NG_123456 Genomic Incomplete genomic region; supplied to support the NCBI Genome Annotation pipeline. NM_123456 mRNA NR_123456 RNA NP_123456 Protein NP_12345678 Protein Lecture 2.2 Non-coding transcripts including structural RNAs, transcribed pseudogenes, and others Planned expansion of accession series 52 Automated Assemblies NT_123456 Genomic Intermediate genomic assemblies of BAC sequence data NW_123456 Genomic Intermediate genomic assemblies of Whole Genome Shotgun sequence data Lecture 2.2 53 Model RefSeq records XM_123456 mRNA model mRNA provided by the Genome Annotation process; sequence corresponds to the genomic contig. XR_123456 RNA model non-coding transcripts provided by the Genome Annotation process; sequence corresponds to the genomic contig. XP_123456 Protein model proteins provided by the Genome Annotation process; sequence corresponds to the genomic contig. Lecture 2.2 54 WGS special case NZ_ABCD123 45678 Genomic A collection of whole genome shotgun sequence data for a project. Accessions are not tracked between releases. The first four characters following the underscore (e.g. 'ABCD') identifies a genome project. ZP_12345678 Protein Proteins annotated on NZ_ accessions (often via computational methods). Lecture 2.2 55 Download all the data Entrez and RefSeq Lecture 2.2 56 Lecture 2.2 57 Lecture 2.2 58 Lecture 2.2 59 Locus Link Lecture 2.2 60 Things to watch out for: Lecture 2.2 61 Lecture 2.2 62