Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Protein sequence retrieval AND other database information databases • Protein sequence(primary) – SWISS-PROT – PIR-International • Protein sequence (composite) – OWL – NRDB Protein sequence (secondary) – PROSITE – PRINTS – Pfam Macromolecular structures – Protein Data Bank (PDB) – Nucleic Acids Database (NDB) – HIV Protease Database – ReLiBase – PDBsum – CATH – SCOP – FSSP • Nucleotide sequences – GenBank – EMBL – DDBJ • Genome sequences – Entrez genomes – GeneCensus – COGs • Integrated databases – InterPro – Sequence retrieval system (SRS) – Entrez Protein Sequence Alignment and Database Searching •Alignment of Two Sequences (Pair-wise Alignment) – The Scoring Schemes or Weight Matrices – Techniques of Alignments – DOTPLOT •Multiple Sequence Alignment (Alignment of > 2 Sequences) –Extending Dynamic Programming to more sequences –Progressive Alignment (Tree or Hierarchical Methods) –Iterative Techniques • Stochastic Algorithms (SA, GA, HMM) • Non Stochastic Algorithms •Database Scanning – FASTA, BLAST, PSIBLAST, ISS • Alignment of Whole Genomes – MUMmer (Maximal Unique Match) Input Query Amino Acid Sequence DNA Sequence Blastp tblastn blastn blastx tblastx Compares Against Protein Sequence Database Compares Against translated Nucleotide Sequence Database Compares Against Nucleotide Sequence Database Compares Against Protein Sequence Database Compares Against translated nucleotide Sequence Database An Overview of BLAST Comparison of Whole Genomes • MUMmer (Salzberg group, 1999, 2002) – – – – – • Pair-wise sequence alignment of genomes Assume that sequences are closely related Allow to detect repeats, inverse repeats, SNP Domain inserted/deleted Identify the exact matches How it works – – – – – – Identify the maximal unique match (MUM) in two genomes As two genome are similar so larger MUM will be there Sort the matches found in MUM and extract longest set of possible matches that occurs in same order (Ordered MUM) Suffix tree was used to identify MUM Close the gaps by SNPs, large inserts Align region between MUMs by SmithWaterman 10 11 Secondary protein database • SWISS-PROT (1986) – Best annotated, least redundant • PIR (Protein Information Resource) – More automated annotation – Collaborations with MIPS and JIPID 12 Secondary protein databases • SWISS-PROT (1986) – Best annotated, least redundant • PIR (Protein Information Resource) – More automated annotation – Collaborations with MIPS and JIPID • Uniprot (2003) – UniProt (Universal Protein Resource) is a central repository of protein sequence and function created by joining the information contained in Swiss-Prot, TrEMBL, and PIR. 13 Databases • Secondary (curated) • Primary (archival) – – – – – – – – – – GenBank/EMBL/DDBJ UniProt PDB Medline (PubMed) BIND 14 RefSeq Taxon UniProt OMIM SGD Organismal Divisions Used in which database? BCT FUN HUM INV MAM ORG PHG PLN PRI PRO ROD SYN VRL VRT Bacterial Fungal Homo sapiens Invertebrate Other mammalian Organelle Phage Plant Primate (also see HUM) Prokaryotic Rodent Synthetic and chimeric Viral Other vertebrate 15 DDBJ - GenBank EMBL DDBJ - EMBL all all EMBL all all all (not same data in all) EMBL all all all all Functional Divisions PAT EST STS GSS HTG HTC CON Patent Expressed Sequence Tags Sequence Tagged Site Genome Survey Sequence High Throughput Genome (unfinished) High throughput cDNA (unfinished) Contig assembly instructions Organismal divisions: BCT PRI FUN ROD INV SYN MAM VRL PHG VRT 16 PLN EST: Expressed Sequence Tag Expressed Sequence Tags are short (300-500 bp) single reads from mRNA (cDNA) which are produced in large numbers. They represent a snapshot of what is expressed in a given tissue, and developmental stage. Also see: http://www.ncbi.nlm.nih.gov/dbEST/ http://www.ncbi.nlm.nih.gov/UniGene/ 17 STS Sequenced Tagged Sites, are operationally unique sequence that identifies the combination of primer pairs used in a PCR assay that generate a mapping reagent which maps to a single position within the genome. Also see: http://www.ncbi.nlm.nih.gov/dbSTS/ http://www.ncbi.nlm.nih.gov/genemap/ 18 GSS: Genome Survey Sequences Genome Survey Sequences are similar in nature to the ESTs, except that its sequences are genomic in origin, rather than cDNA (mRNA). The GSS division contains: • random "single pass read" genome survey sequences. • single pass reads from cosmid/BAC/YAC ends (these could be chromosome specific, but need not be) • exon trapped genomic sequences • Alu PCR sequences Also see: http://www.ncbi.nlm.nih.gov/dbGSS/ 19 HTG: High Throughput Genome High Throughput Genome Sequences are unfinished genome sequencing efforts records. Unfinished records have gaps in the nucleotides sequence, low accuracy, and no annotations on the records. Also see: http://www.ncbi.nlm.nih.gov/HTGS/ Ouellette and Boguski (1997) Genome Res. 7:952-955 20 Which tool? mRNA EST Genomic Other dbEST Simple E-mail or FTP WWW BankIt Other •Better control of annotations •pop/phylo •segmented sets Sequin or tbl2asn E-mail 21 STS/ GSS HTGS Simple dbSTS dbGSS Customized software or tbl2asn WWW BankIt E-mail or FTP E-mail or FTP