Download Tri-I Bioinformatics Workshop: Public data and tool

Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center Workshop sections 1. Retrieving data from public resources • • • 2. public databases at NCBI, EBI, Ensembl locate and utilize some of the myriad of publicly available bioinformatics tools common data formats Genome Browsers • • • genome build process, ongoing and complete genome projects genome browsers of Ensembl, UCSC and NCBI Mapviewer broad survey of analysis tools and tutorials available on the Web for use directly and after download Public data and tool repositories Section 1 Retrieving data from public resources Goals A. Understand the scope and organization of the major public databases: NCBI, EBI/ Ensembl. B. Understand the importance of a unique identifiers, database fields, logical operators and wildcards. C. Be able to query, retrieve and display publications and sequences. D. Be able to visualize/analyze protein structure Amyloid Precursor Protein (APP) G-protein coupled receptor that binds heparin and laminin ß-secretase Controls nerve cell growth amyloid fibril amyloid plaque -secretase interacts with protein-synthesis machinery NCBI Strengths are data storage, annotation and BLAST: 1. 2. 3. 4. 5. PubMed: Biomedical publications Heritable diseases and syndromes GenBank: Nucleotide and protein sequences BLAST: Pairwise sequence comparison Curated gene-centric data, including reference sequences 6. Genome builds 7. Nucleotide sequence traces Ex: Finding Entrez Gene record for APP Indexing and logical operators Query: app[Gene Name] AND homo sapiens[Organism] 1 0 aardvark … … 0 app … … homo sapiens 1 … … mus musculus 0 0 0 0 0 0 1 0 0… 1 0 0 0 1 1 0 0… AND 2 3 4 5 6 7 8… 1 1 0 0 0 0 0… 1 0 0 0 1 0 0… 0 0 0 1 1 0 0… 1 0 0 0 0 1 0… 0 0 0 0 0 1 0 0… An Entrez Query 1. 2. 3. Query parsed: terms, fields and operators organized in a tree (if syntax incorrect generate error or warning) Unfielded terms matched to synonyms, and extra terms, fields and operators added as needed For each database: a) According to order of operations: i. ii. iii. b) 4. 5. 6. 7. Term found in appropriate index (if term not found, then generate warning) Bit map pulled and uncompressed Pairwise operations performed with previous result (if zero result, then stop) Number of results generated If Global Query, display results summary and stop List of UIDs generated from final result UIDs sorted by user preference Records pulled and displayed by user preference Gene-centric questions 1. 2. 3. 4. 5. Where is a gene located? What’s its genomic sequence? What variations are associated with it? What’s its exon-intron structure? What are the mRNA sequences of its alternate transcripts? 6. What are the protein sequences of its isoforms? 7. What post-translational modification is possible? 8. What regulates its transcription? 9. What are its co-regulated partners? 10. What’s its normal function? 11. What’s its function in disease? 12. How does it fit into the larger cellular context? May depend upon cellular “state” Ex: Looking over the Entrez Gene record for APP Common id and record formats 1. Ids a) 2. GenBank accession i. ii. iii. BI559391,Y00264 AAB23646 RefSeq Ensembl UniGene d) Hs.651215 PDB Structures i. e) 1iyt HUGO Gene Names i. APP Flat i. ii. iii. iv. v. vi. Protein i. i. a) Nucleotide i. b) c) Formats b) GenBank and GenPept FASTA Multiple FASTA Alignment Multiple alignment Tab-delimited Hierarchical i. ii. iii. ASN.1 XML HTML NCBI’s RefSeq project 1. 2. 3. 4. Is a project to create curated sequence records for the biopolymers of the Central Dogma: DNA, mRNA and protein First release 2003 4,079 organisms, 3,234,358 proteins Goals 1. 2. 3. 4. 5. 5. non-redundancy explicitly linked nucleotide and protein sequences updates to reflect current knowledge of sequence data and biology data validation and format consistency distinct accession series ongoing curation by NCBI staff and collaborators, with reviewed records indicated What’s its relationship to BLAST database called “nr”? UniGene versus Entrez Gene 1. UniGene 1. 2. 3. 4. 5. 2. Entrez Gene 1. 2. 3. 4. 3. Automated process that compares and clusters transcript-source sequences (no assembly) Gene discovery tool: predates Entrez Gene, genome assemblies Based primarily on EST sequences ID turn-over and retirement is common Currently 76 taxa and 1,299,304 clusters Curated clearinghouse of gene-centric information Grew out of LocusLink (eukaryote model organisms) and Entrez Genome (bacteria, viruses, organelles) ID turn-over and retirement happens, but is less common since it is based primarily on sequenced genomes Currently 3882 taxa and 2,479,759 genes Hs: 85,793 UniGene clusters compared to 38,604 Entrez Gene records EBI/Ensembl Strengths are data storage and analysis software: 1. 2. 3. 4. 5. 6. 7. 8. Biomedical publications Nucleotide and protein sequences Protein domains/signatures Sequence comparison Sequence analysis Structure analysis Protein function analysis Ensembl genome browser Ex: Looking at the APP gene in the EBI/Ensembl resources Ensembl ids 1. Human 1. 2. 3. 4. ENSG: gene ENST: transcript ENSE: exon ENSP: protein 2. Other organisms 1. ENS{species 3-letter code}{G|T|P}{11 digits} 2. RNO=rat 3. MUS=mouse Amyloid Precursor Protein (APP) G-protein coupled receptor that binds heparin and laminin ß-secretase -secretase amyloid fibril amyloid plaque DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA Ex: Viewing the structure of an amyloid fibril Other structure tools 1. Structure visualization. Free applications: a) RasMol b) Cn3D c) VMD 2. Structure prediction servers/applications a) CASP: Critical Assessment of Techniques for Protein Structure Prediction b) General method: i. ii. Sequence similarity search to identify closest homolog with known structure Fit to homolog’s known structure, minimizing some constraint Problems 1. Query Entrez Gene with the following two queries separately and then explain the differences between the two results using a logical NOT operation: a) tyrosine kinase[Gene Ontology] AND human[Organism] b) cd00192[Domain] AND human[Organism] 2. Retrieve the APP gene record from NCBI and use the Display dropdown menu to display Conserved Domain Links. Use the ids of the listed domains to query Entrez Gene for records with the same domains. 3. Use the SNP Geneview link at NCBI to identify coding SNPs in the APP gene. Which SNP is missing from this display which was present in the Ensembl APP protein record? 4. Use the Homologene link at NCBI to identify possible functional orthologs for human APP. How does this list compare to the Ensembl list of orthologs that we reviewed previously?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Tri-I Bioinformatics Workshop: Public data and tool