Download A 2 - Computer Science

Introduction to Bioinformatics Junhui Wang May 2004 outline • What’s bioinformatics? • introduction to biological database • Sequence Alignment Why use bioinformatics ? • An explosive growth in the amount of biological information necessitates the use of computers for cataloguing and retrieval. • Impossible to analyze data by manual inspection • Data mining –functional/structural information is important for studying the molecular basis of diseases(and evolutionary patterns) What is bioinformatics ? • A mixture of computer science, mathematics and biology. • Development of new algorithms and statistics to assess relationships among members of large data sets. • Analysis and interpretation of various types of data. • Development and implementation of tools to efficiently access and manage different types of information. Database for bioinformatics ? • Nucleotide Database & Protein database • Primary database & Secondary database DNA/RNA database Primary database Secondary database GenBank/EMBL/DDBJ Protein database PDB SWISS-PROT /PIR DNA RNA protein DNA genomic DNA databases RNA cDNA ESTs protein protein sequence databases There are three major public DNA databases EMBL Housed at EBI European Bioinformatics Institute GenBank DDBJ Housed at NCBI National Center for Biotechnology Information Housed in Japan www.ncbi.nlm.nih.gov PubMed is… • National Library of Medicine's search service • 11 million citations in MEDLINE • links to participating online journals • PubMed tutorial (via “Education” on side bar) Entrez integrates… • a search and retrieval system that integrates NCBI databases • the scientific literature; • DNA and protein sequence databases; • 3D protein structure data; Entrez BLAST is… • Basic Local Alignment Search Tool • NCBI's sequence similarity search tool • supports analysis of DNA and protein databases • 80,000 searches per day OMIM is… •Online Mendelian Inheritance in Man •catalog of human genes and genetic disorders •edited by Dr. Victor McKusick Books is… • searchable resource of on-line books TaxBrowser is… • browser for the major divisions of living organisms ( bacteria, viruses) • taxonomy information such as genetic codes • molecular data on extinct organisms Structure site includes… • Molecular Modeling Database (MMDB) • biopolymer structures obtained from the Protein Data Bank (PDB) • a 3D-structure viewer Four questions we can answer at NCBI (and elsewhere): [1] How can I do a literature search using PubMed? [2] How can WelchWeb help? [3] How can I use Entrez to find information about a particular gene or protein? [4] How can I find information about a particular disease? Question #1: How can I use PubMed at NCBI to find literature information? PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,000 journals published in the United States and in 70 foreign countries. It has 12 million records dating back to 1966. MeSH is the acronym for "Medical Subject Headings." MeSH is the list of the vocabulary terms used for subject analysis of biomedical literature at NLM. MeSH vocabulary is used for indexing journal articles for MEDLINE. The MeSH controlled vocabulary imposes uniformity and consistency to the indexing of biomedical literature. PubMed search strategies Try the tutorial (“education” on the left sidebar) Use boolean queries AND ,OR, NOT Try using “limits” Try “LinkOut” to find external resources Obtain articles on-line via Welch Medical Library (and download pdf files): http://www.welch.jhu.edu/ Question #2: How can I use WelchWeb (from the Welch Medical Library) to do literature searches? WelchWeb is available at http://www.welch.jhu.edu WelchWeb is available at http://www.welch.jhu.edu E-mail gateway PubMed gateway Library catalog Remote access to Welch services Request literature Browse journals Browse databases Question #3: How can I use NCBI (or other sites) to find information about a protein or gene? Four ways to access protein and DNA sequences [1] LocusLink with RefSeq [2] Entrez [3] UniGene [4] ExPASy Sequence Retrieval System (this is separate from NCBI) 4 ways to access protein and DNA sequences [1] LocusLink with RefSeq LocusLink is a great starting point: it collects key information on each gene/protein from major databases. It now covers 8 organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) [2] Entrez [3] UniGene [4] ExPASy SRS 4 ways to access protein and DNA sequences [1] LocusLink with RefSeq [2] Entrez Entrez is divided into sites for nucleotide, protein, structure, genomes, OMIM, and more. You can use limits (such as RefSeq) to focus your Entrez search. [3] UniGene [4] ExPASy SRS The Genebank flatfile: • the elementary unit of information • one of the most commonly used format • LOCUS: locus name/the length of the sequence/the molecule type/ GenBank division code/the date • DEFINITION:summarize the biology of the record genus species/product name/…. ACCESSION:An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. VERSION:accession version GID: the gi(geninfo identifier) The Genebank flatfile (cont): • KEYWORDS:identify the particular entry,not very useful • SOURCE:either have the common name for the organism or its scientific name • REFERENCE: at least one reference or citation,can be published or unpublished,MEDLINE and PUBMED identifier provide a link to the MEDLINE and PUBMED database. • COMMENT: refer to the whole record. Graphics format 4 ways to access protein and DNA sequences [1] LocusLink with RefSeq [2] Entrez [3] UniGene UniGene collects expressed sequence tags (ESTs) into clusters, in an attempt to form one gene per cluster. Use UniGene to study where your gene is expressed in the body, when it is expressed, and see its abundance. [4] ExPASy SRS 4 ways to access protein and DNA sequences [1] LocusLink with RefSeq [2] Entrez [3] UniGene [4] ExPASy SRS There are many bioinformatics servers outside NCBI. Try ExPASy’s sequence retrieval system at http://www.expasy.ch/ (ExPASy = Expert Protein Analysis System) Question #4: How can I find information about a particular disease? Answer: Try OMIM Two main disease databases: general and locus-specific General OMIM GeneCards (Weizmann) http://bioinformatics.weizmann.ac.il/cards/ Genes & Disease (at NCBI) http://www.ncbi.nlm.nih.gov/disease/ Locus-specific Human Gene Mutation Database (HGMD) http://archive.uwcm.ac.uk/uwcm/mg/docs/oth_mut.html Comparative method--Sequence alignment •A long tradition in biology of comparative analysis leading to discovery • compare the similarities and differences of the biological data to infer structural,functional and evolutionary relationship. • the most common comparative method—alignment • my concern—sequence alignment Sequence alignment • Definition--Provides an explicit mapping between the residues of two or more sequences. • Large enough similarity typically infer homology • Homology—similarity due to decent from a common ancestor. homology information is in genes. • 2 type according to the number of sequences Pairwise alignment—two sequences are compared Multiple alignment--- more than two sequences involved Pairwise alignment –simple example • before insert gap query sequence AG G V LAQ V G object sequence AG GV LQVG 5 identical residues after insert gap query sequence AG G V LAQ V G object sequence A G G V L -- Q V G 8 identical residues Gap—insert or delete Pairwise sequence alignment • Each pair in the alignment receives a value depend on its content. • The total score is the sum of the values. • There may be many alignments with maximal score. • Example for score identical characters (match):+1 different characters (mismatch): -1 gap: -1 Pairwise alignment –simple example before insert gap T CAT G CATT G after insert gap score =4-2=2 TCAT- G TCA–TG - CATT G -CA TTG Similarity search –dot plot M T F R D L L S V S F E G P R P D S S A G G S S AG G M . T F R D . . . . . . . L . . L . . S . V S . . . . . . . . . . . . Pairwise alignment –dot plot same sequences high similar sequences some similar seqences Similarity search and alignment • Exhaustive method: Needleman-Wunch—global alignment Smith-Waterman--- local alignment • Common program: FastA--1985 comparing a query sequence against a database of sequences Blast (Basic Local Alignment Search technique) --1990 improvement on FastA Alignment models: global &local Alignment models: global &local • Global alignment--take all of one sequence and align it with all of a second sequence •Disadvantage:short and highly similar subsequences may be missed in the alignment • input: two sequences of similar lengths (if sequence differ in length,space may be introduced) output: the best similarity score between the sequences example: ACCTGC -ACC-TGC -- TACGTG TAC-GTG- Needleman-Wunsch Algorithm (1970) • The first step is to place the two sequences along the margins of a matrix, simply place a 1 anywhere the two sequences match and a 0 elsewhere. • For each element in the matrix you perform the following operation. M i,j = M i,j +Max(M k, j +1, M i+1, n) where k is any integer larger than i and n is any integer larger than j. •In words, alter the matrix by adding to each element the largest element from the row just below and to the right of that element and from the column just to the right and below the element of interest. • The number contained in each cell of the matrix, after this operation is completed, is the largest number of identical pairs that can be found if that element is the origin for a pathway which proceeds to the upper left. Example A D L G A V F A L C D R Y F Q A 1 0 0 0 1 0 0 1 0 4 3 2 1 1 0 D 0 1 0 0 0 0 0 0 0 4 4 2 1 1 0 L 0 0 1 0 0 0 0 0 1 4 3 2 1 1 0 G 0 0 0 1 0 0 0 0 5 4 3 2 1 1 0 R 0 0 0 0 0 0 0 0 5 4 3 3 1 1 0 T 0 0 0 0 0 0 0 0 5 4 3 2 1 1 0 Q 0 0 0 0 0 0 0 0 5 4 3 2 1 1 1 N 0 0 0 0 0 0 0 0 5 4 3 2 1 1 0 C 0 0 0 0 0 0 0 0 4 5 3 2 1 1 0 D 0 1 0 0 0 0 0 0 3 3 4 2 1 1 0 R 0 0 0 0 0 0 0 0 2 2 2 3 1 1 0 Y 0 0 0 0 0 0 0 0 2 2 2 2 2 1 0 Y 0 0 0 0 0 0 0 0 1 1 1 1 2 1 0 Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Example A D L G A V F A L C D R Y F Q A 9 7 6 6 7 6 6 7 5 4 3 2 1 1 0 D 7 8 6 6 6 6 6 6 5 4 4 2 1 1 0 L 6 6 7 5 5 5 5 5 6 4 3 2 1 1 0 G 5 5 5 6 5 5 5 5 5 4 3 2 1 1 0 R 5 5 5 5 5 5 5 5 5 4 3 3 1 1 0 T 5 5 5 5 5 5 5 5 5 4 3 2 1 1 0 Q 5 5 5 5 5 5 5 5 5 4 3 2 1 1 1 N 5 5 5 5 5 5 5 5 5 4 3 2 1 1 0 C 4 4 4 4 4 4 4 4 4 5 3 2 1 1 0 D 3 4 3 3 3 3 3 3 3 3 4 2 1 1 0 R 2 2 2 2 2 2 2 2 2 2 2 3 1 1 0 Y 2 2 2 2 2 2 2 2 2 2 2 2 2 1 0 Y 1 1 1 1 1 1 1 1 1 1 1 1 2 1 0 Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Alignment models: global &local • Local alignment--not all of the sequences might be aligned together. • input: two sequences S and T output: the maximum similarity between a sub-sequence of S and a subsequence of T example: ACCTGC ACCTGC -- TACGTA TACGTA Smith-waterman Algorithm(1981) • Similar to the global alignment • Difference- the best local alignment score is the greatest value in the table. • A negative score/weight must be given to mismatches. • Zero must be the minimum score recorded in the matrix. • M i,j = M i,j +Max(M k, j -1, M i-1, n) where k is any integer smaller than i and n is any integer smaller than j go left to right, top to bottom in the matrix . Example here a penalty of –0.5 for each mismatch Common program--FastA and Blast FastA--1985 comparing a query sequence against a database of sequences Blast (Basic Local Alignment Search technique) --1990 improvement on FastA , a set of similarity search programs for proteins or DNA sequences (BLASTN, BLASTP,..), developed at NCBI. •By seeking local alignment is able to detect relationships among sequences that share only isolated regions of similarity. •Structural similarity (usually functional similarity) FASTA format Multiple Alignment -- • Definition--A multiple alignment of sequences S1, S2,..,Sn is a series of sequences S1’,S2’,..,Sn’ such that• all Si’ sequences are of equal length. • Si’ is an extension of Si obtained by inserting gaps. Motivation— Find diagnostic patterns to characterize protein families. Detect/demonstrate homology between a new sequence and existing families of proteins. Help predict the secondary and tertiary structures of new sequences. Multiple Alignment Method--ClustalW Algorithm •Based on the idea:similar sequence always have relationship of evolution. •During the alignment,all pairs of sequences are aligned separately in order to calculate the similarity score. • Make some group them according to the similarity score . • Do alignment between groups,get the similarity score. • Make some group again…get the final result. • The sequences with high similarity do alignment first, then follow by low similarity sequences. •A guide tree is calculated, similar sequences are neighbors in the tree, distant sequences are distant from each other in the tree. • The sequences are progressively aligned according to the Reference 1.http://www.ncbi.nlm.nih.gov/ 2. Elementary Sequence Analysis, McMaster University, http://colorbasepair.com/bioinformatics_courses_tutorials.html 3. Introduction to bioinformatics ,T K Attwood and D J Parry-Smith 4. Bioinformatics-A pratical guide to the analysis of genes and proteins, Andreas D. Baxevanis and B. F. Francis ouellette 5. http://omega.cbmi.upmc.edu/~vanathi/syllabus.html

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A 2 - Computer Science