* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Bioinformatics
Expression vector wikipedia , lookup
Genetic code wikipedia , lookup
Biochemistry wikipedia , lookup
Metalloprotein wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Whole genome sequencing wikipedia , lookup
Gene expression wikipedia , lookup
Community fingerprinting wikipedia , lookup
Interactome wikipedia , lookup
Western blot wikipedia , lookup
Non-coding DNA wikipedia , lookup
Genomic library wikipedia , lookup
Point mutation wikipedia , lookup
Proteolysis wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Bioinformatics CSC 391/691; PHY 392; BICM 715 Bioinformatics Course, Spring 2004 Importance of bioinformatics • A more global perspective in experimental design • The ability to capitalize on the emerging technology of database-mining--the process by which testable hypotheses are generated regarding the function or structure of a gene or protein of interest by identifying similar sequences in better characterized organisms. Bioinformatics Course, Spring 2004 Amino acids: chemical composition or digital symbols for proteins http://wbiomed.curtin.edu.au/teach/biochem/tutorials/AAs/AA.html Link found on the Research Collaboratory for Structural Biology web site: www.rcsb.org/pdb/education.html See also Table 2.2 (Mount) Bioinformatics Course, Spring 2004 Nucleotides: chemical composition or digital symbols for nucleic acids http://ndbserver.rutgers.edu/NDB/archives/NAintro/ http://www.web-books.com/MoBio/Free/Ch3A.htm Link found on the Research Collaboratory for Structural Biology web site: www.rcsb.org/pdb/education.html See also Table 2.1 (Mount) Bioinformatics Course, Spring 2004 The Genetic Code: how DNA nucleotides encode protein amino acids http://www.accessexcellence.org/AB/GG/genetic.html Bioinformatics Course, Spring 2004 Biologists think it’s a lot of data, but maybe its really not He made fun of biologists for complaining that the human genome, which takes up about 3 gigabytes, is "a lot of data". He offered the comparison of the DVD movie "Evita", which is about 12 gigabytes, with the genome of Madonna. (3 gigabytes). "The movie contains four times more information than Madonna's genome. And Madonna shares 99% of her DNA with a chimp...And 90% with Craig Venter’s dog.” More proof that the genome is not a lot of data: About 90-something percent of genetic information is common to all humans. "The unique part of you will fit on a floppy disk." Nathan Myhrvold, former Chief Technology Officer for Microsoft Keynote Speech at NIH Digital Biology Meeting 2003 Bioinformatics Course, Spring 2004 Review of Lab 1 • What did you learn about the sites you visited: SGD, SwissProt, EntrezRefSeq, EntrezNeighbor, EntrezProtein, PIR-US • Can you define the term protein function? • Does the term gene function have any meaning? • Questions? Bioinformatics Course, Spring 2004 Biologists think it’s a lot of data, but maybe its really not He made fun of biologists for complaining that the human genome, which takes up about 3 gigabytes, is "a lot of data". He offered the comparison of the DVD movie "Evita", which is about 12 gigabytes, with the genome of Madonna. (3 gigabytes). "The movie contains four times more information than Madonna's genome. And Madonna shares 99% of her DNA with a chimp...And 90% with Craig Venter’s dog.” More proof that the genome is not a lot of data: About 90-something percent of genetic information is common to all humans. "The unique part of you will fit on a floppy disk." Nathan Myhrvold, former Chief Technology Officer for Microsoft Keynote Speech at NIH Digital Biology Meeting 2003 Bioinformatics Course, Spring 2004 Biologists think it’s a lot of data, and maybe it really is • The genome is not a static, one-time picture • Genome changes over time—mutations and other changes • Genes expressed to make proteins – Set of genes that are expressed changes with cell type – Set of genes that are expressed changes over time and state Bioinformatics Course, Spring 2004 Definition of a Biological Database A biological database is a large, organized body of persistent data, usually associated with computerized software designed to update, query, and retrieve components of the data stored within the system. Bioinformatics Course, Spring 2004 Sources of sequence data 1. GenBank at the National Center of Biotechnology Information, National Library of Medicine, Washington, DC (nucleotides and proteins) http://www.ncbi.nlm.nih.gov/Entrez 2. European Molecular Biology Laboratory (EMBL) Outstation at Hixton, England http://www.ebi.ac.uk/embl/index.html 3. DNA DataBank of Japan (DDBJ) at Mishima, Japan http://www.ddbj.nig.ac.jp/ 4. Protein International Resource (PIR) database at the National Biomedical Research Foundation in Washington, DC (see Barker et al. 1998) http://wwwnbrf.georgetown.edu/pirwww/ 5. The SwissProt protein sequence database at ISREC, Swiss Institute for Experimental Cancer Research in Epalinges/Lausanne http://www.expasy.ch/cgibin/sprot-search-de 6. The Sequence Retrieval System (SRS) at the European Bioinformatics Institute allows both simple and complex concurrent searches of one or more sequence databases. The SRS system may also be used on a local machine to assist in the preparation of local sequence databases. http://srs6.ebi.ac.uk Table 2.5. Mount Bioinformatics Course, Spring 2004 Sources of protein structure data • RCSB Protein Data Bank (PDB): www.rcsb.org • BioMagResBank: http://www.bmrb.wisc.edu/ • MMDB: http://www.ncbi.nlm.nih.gov/Structure/MM DB/mmdb.shtml Bioinformatics Course, Spring 2004 Review of Lab 2 • What did you learn about the RCSB web page? • What are your thoughts about the PDB file format? • Was RasMol easy or hard to use? Is there anything you tried to do, but couldn’t figure out how? • What is the difference between the two glutaredoxin structures (1aaz and 1die)? • MMDB: database of protein structures, ASN.1 format (http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.sht ml) • Other questions? Bioinformatics Course, Spring 2004 Levels of protein structure • • • • • Primary structure Secondary structure (Super secondary structure) Tertiary structure Quaternary structure Bioinformatics Course, Spring 2004 Databases of protein structure classification • SCOP: Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995). J. Mol. Biol. 247, 536-540. [email protected] • CATH: Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. (1997) Vol 5. No 8. p.1093-1108. http://www.biochem.ucl.ac.uk/bsm/cath/ • Dali: L. Holm and C. Sander (1996) Science 273:595602. http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.ht ml • VAST: S. H. Bryant and C. Hogue. http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml Bioinformatics Course, Spring 2004 RNA Structure • Primary structure: sequence of GACU nucleotides • Secondary structure: stemloop structures • Tertiary structure • http://www.rnabase .org/ Bioinformatics Course, Spring 2004 DNA structure • Primary structure: sequence of GACT nucleotides • Secondary structure: double helix • Higher levels of structure… nucleosome… chromatin… chromosome Bioinformatics Course, Spring 2004 An example of pairwise alignment (A) ./wwwtmp/lalign/.17728.1.seq Glutaredoxin, T4, 1AAZ.pdb - 87 aa (B) ./wwwtmp/lalign/.17728.2.seq Unknown protein - 93 aa using matrix file: BL50, gap penalties: -14/-4 27.0% identity in 89 aa overlap; score: 101 E(10,000): 0.0014 10 20 30 40 50 Glutar KVYGYDSNIHKCVYCDNAKRLLTVKKQPFEFINIMPEKGV---FDD—EKIAELLTKLGR ..:: .. :: : .: :: : .:.: .. . . :: ::. : .. . Unknow EIYGIPEDVAKCSGCISAIRLCFEKGYDYEIIPVLKKANNQLGFDYILEKFDECKARANM 10 20 30 40 50 60 60 70 80 Glutar DTQIGLTMPQVFAPDGSHIGGFDQLREYF .:. ..:..:. ::..::.. :... . Unknow QTR-PTSFPRIFV-DGQYIGSLKQFKDLY 70 80 90 Bioinformatics Course, Spring 2004 Pairwise Sequence Alignment • The alignment of two sequences (either protein or nucleic acid) based on some algorithm • What is the “right answer”? – Align (pairwise) the following words: instruction, insurrection, incision • There is NO unique, precise, and universally applicable method of pairwise alignment Bioinformatics Course, Spring 2004 An example of pairwise alignment (A) ./wwwtmp/lalign/.17728.1.seq Glutaredoxin, T4, 1AAZ.pdb - 87 aa (B) ./wwwtmp/lalign/.17728.2.seq Unknown protein - 93 aa using matrix file: BL50, gap penalties: -14/-4 27.0% identity in 89 aa overlap; score: 101 E(10,000): 0.0014 10 20 30 40 50 Glutar KVYGYDSNIHKCVYCDNAKRLLTVKKQPFEFINIMPEKGV---FDD—EKIAELLTKLGR ..:: .. :: : .: :: : .:.: .. . . :: ::. : .. . Unknow EIYGIPEDVAKCSGCISAIRLCFEKGYDYEIIPVLKKANNQLGFDYILEKFDECKARANM 10 20 30 40 50 60 60 70 80 Glutar DTQIGLTMPQVFAPDGSHIGGFDQLREYF .:. ..:..:. ::..::.. :... . Unknow QTR-PTSFPRIFV-DGQYIGSLKQFKDLY 70 80 90 Bioinformatics Course, Spring 2004 Global vs Local Alignment Figure 3.1, Mount Bioinformatics Course, Spring 2004 Pairwise Sequence Alignment Websites Bayes block aligner http://www.wadsworth.org/res&res/b ioinfo BCM Search Launcher: Pairwise sequence alignment http://searchlauncher.bcm.tmc.edu/se q-search/alignment.html SIM—Local similarity program for finding alternative alignments http://www.expasy.ch/tools/sim.html Huang et al. (1990); Huang and Miller (1991); Pearson and Miller (1992) Global alignment programs (GAP, NAP) http://genome.cs.mtu.edu/align/align. html Huang (1994) FASTA program suite http://fasta.bioch.virginia.edu/fasta/f asta_list.html Pearson and Miller (1992); Pearson (1996) BLAST 2 sequence alignment (BLASTN, BLASTP) http://www.ncbi.nlm.nih.gov/gorf/bl 2.html Altschul et al. (1990) LALIGN http://www.ch.embnet.org/software/ LALIGN_form.html Huang and Miller, published in Adv. Appl. Math. (1991) 12:337-357 Likelihood-weighted sequence alignment (lwa) http://stateslab.bioinformatics.med.u mich.edu/service/lwa.html Table 3.1, Mount Zhu et al. (1998) Bioinformatics Course, Spring 2004 What is multiple sequence alignment? • Multiple sequence alignment is the alignment of more than two nucleotide or protein sequences • Compare pairwise sequence alignment multiple sequence alignment Bioinformatics Course, Spring 2004 Issues with multiple sequence alignment • Try creating a multiple sequence alignment of the three words: – Insurrection – Incision – Instruction Bioinformatics Course, Spring 2004 Issues with multiple sequence alignment • What’s the right answer? – in cision insurrection instr uction in cision insurrection ins truction inci sion insurrection ins truction in cision insurrec tion instr uc tion • Computational complexity • What is reasonable method for obtaining cumulative score? • Placement and scoring of gaps Bioinformatics Course, Spring 2004 Pairwise sequence alignment: LALIGN of OVCA2 and DYR_SCHPO (global) ./wwwt MAAQRPLRVLCLAGFRQSERGFREKTGALRKALRGRAELVCLSGPHPVPDPPGPEGARSD :. .::.:::: :. :: : .: :...: : ::: .:: . . :. . dihydr MS—KPLKVLCLHGWIQSGPVFSKKMGSVQKYLSKYAELHFPTGPVVADEEADPNDEEEK 10 20 30 40 50 70 80 90 100 110 120 ./wwwt FGSCPPEEQPRGWWFSEQEADVFSALEEPAVCRGLEESLGMVAQALNRLGPFDGLLGFSQ . : :. :.. :. . . .::: . : ... ::::::.:::: dihydr KRLAALGGEQNGGKFGWFEVEDFKN-----TYGSWDESLECINQYMQEKGPFDGLIGFSQ 60 70 80 90 100 110 130 140 150 160 170 ./wwwt GAALAALVCALGQAGDPRFPL---P—RFILLVSSFCPRGIGFKESILQRPLSLPSLHVF ::...:.. . : :.: : : .:...:..: . : . . . :. ::::. dihydr GAGIGAMLAQMLQPGQPPNPYVQHPPFKFVVFVGGFRAEKPEF-DHFYNPKLTTPSLHIA 120 130 140 150 160 170 180 190 200 210 ./wwwt GDTDKVIPSQESVQLASQFPGAITLTHSGGHFIPA-------------AAP--------: .: ..: .: ::. . .: .: : : :..: .:: dihydr GTSDTLVPLARSKQLVERCENAHVLLHPGQHIVPQQAVYKTGIRDFMFSAPTKEPTKHPR 19.2% sequence identity; score -413 Bioinformatics Course, Spring 2004 Multiple sequence alignment Bioinformatics Course, Spring 2004 What is multiple sequence alignment used for? • Consensus sequences: which residues can be used to identify other members of the family? • Gene and protein families: which residues are functionally important; functional families • Relationships and phylogenies: contains evolutionary “history” of sequences • Data underlying some protein structure prediction algorithms • Genome sequencing: sequence random, overlapping fragments; automation of assembly (in this case, there is a RIGHT answer) Bioinformatics Course, Spring 2004 Consensus sequences and important functional residues Baxter, et al, Mol Cell Prot 2003 Bioinformatics Course, Spring 2004 Relationships and phylogenies • Serine-threonine protein phosphatases • Same biochemical function • Clustering clearly shows PP1, PP2a and PP2B families • What is different about these families? Fetrow, Siew, Skolnick, FASEB J, 1999 Bioinformatics Course, Spring 2004 Possible redox site in PP1 family Only a clustering, not a true phylogenetic tree Bioinformatics Course, Spring 2004 Methods to solve computational complexity • Progressive global alignment • Iterative methods • Alignments based on locally conserved patterns • Statistical methods and probabilistic models Bioinformatics Course, Spring 2004 Multiple Sequence Alignment: Global CLUSTALW or CLUSTALX (latter has graphical interface) FTP to ftp.ebi.ac.uk/pub/soft warea,d Thompson et al. (1994a, 1997); Higgins et al. (1996) MSA http://www.psc.edu/b Lipman et al. http://www.ibc.wustl.e (1989);Gupta et al. du/ibc/msa.htmlc (1995) FTP to fastlink.nih.gov/pub/m sa PRALINE http://mathbio.nimr.mrc.a Heringa (1999) c.uk/~jhering/praline/ Table 4.1, Mount Bioinformatics Course, Spring 2004 Multiple Sequence Alignment: Interative DIALIGN segment alignment http://www.gsf.de/biodv/dialign. html Morgenstern et al. (1996) MultAlin http://protein.toulouse.inra.fr/mu ltalin.html Corpet (1988) Parallel PRRN progressive global alignment http://prrn.ims.u-tokyo.ac.jp/ Gotoh (1996) SAGA genetic algorithm http://igs-server.cnrsmrs.fr/~cnotred/ Projects_home_page/saga_ home_page.html Notredame and Higgins (1996) Table 4.1, Mount Bioinformatics Course, Spring 2004 Multiple Sequence Alignment: Local Aligned Segment Statistical Evaluation Tool (Asset) FTP to ncbi.nlm.nih.gov/pub/neuwa ld/asset Neuwald and Green (1994) BLOCKS Web site http://blocks.fhcrc.org/blocks/ Henikoff and Henikoff (1991, 1992) eMOTIF Web server http://dna.Stanford.EDU/emotif/ Nevill-Manning et al. (1998) GIBBS, the Gibbs sampler statistical method FTP to ncbi.nlm.nih.gov/pub/neuwa ld/gibbs9_95/ Lawrence et al. (1993); Liu et al. (1995); Neuwald et al. (1995) HMMER hidden Markov model software http://hmmer.wustl.edu/ Eddy (1998) MACAW, a workbench for multiple alignment construction and analysis FTP to ncbi.nlm.nih.gov/pub/maca w/ Schuler et al. (1991) MEME Web site, expectation maximization method http://meme.sdsc.edu/meme/we bsite/ Bailey and Elkan (1995); Grundy et al. (1996, 1997); Bailey and Gribskov (1998) Profile analysis at UCSDa,e http://www.sdsc.edu/projects/pro file/ Gribskov and Veretnik (1996) SAM hidden Markov model Web site http://www.cse.ucsc.edu/researc h/compbio/sam.html Krogh et al. (1994); Hughey and Krogh (1996) Table 4.1, Mount Bioinformatics Course, Spring 2004 Methods to solve computational complexity • Progressive global alignment – Start with most related sequences – Problem is that these errors in initial alignments are propagated • Iterative methods – Iterative alignment of subgroup of sequences to find “best”; then align subgroups • Alignments based on locally conserved patterns – Block analysis • Statistical methods and probabilistic models – Expectation maximum; Gibbs sampler; Hidden Markov Models; Bioinformatics Course, Spring 2004 Profile Methods • Perform a global multiple sequence alignment on a group of sequences • Extract more highly conserved regions • Profile = scoring matrix for these highly conserved regions • Used to search unknown sequences for membership in the family Figures 4.11 (p. 162) and 4.12 (p. 166-167) Bioinformatics Course, Spring 2004 Limitations of such profiles • Limited by sequences in original msa: – Sequence bias (too many of one type of sequence) – Sequences in msa not representative of entire family Bioinformatics Course, Spring 2004 Blocks • Blocks are conserved regions of msa (like profiles) but no gaps allowed • Servers for producing Blocks: – Blocks server – eMotif server • Block libraries for database searching – Blocks (Henikoff and Henikoff) – Prosite (Bairoch) – Prints (Attwood) Bioinformatics Course, Spring 2004 Blocks that might be extracted from an msa Baxter, et al, Mol Cell Prot 2003 Bioinformatics Course, Spring 2004 Blocks that might be extracted from an msa Baxter, et al, Mol Cell Prot 2003 Bioinformatics Course, Spring 2004 Database searching • Identify a new sequence by experimental methods: what is it? • Search databases to find similar sequences • If “enough similarity”, can say that function of new sequence is same as known sequence: function annotation transfer • What is “enough similarity”? • What is “function”? Chapter 7, Mount Bioinformatics Course, Spring 2004 Relationships between family members • Sequence relationships between family members • Not all members of family have significant sequence similarity to all others • Can be represented by nodes and edges of a graph Z F E A D C B Bioinformatics Course, Spring 2004 Beware of issues with function annotation transfer • Multiple domains • High sequence identity, but functional residues not conserved • Sequence repeats (low complexity regions) New Function B S S Function A Function A H D Known serine hydrolase New sequence D L Bioinformatics Course, Spring 2004 Methods for database searching • Sequence similarity with query sequence: FASTA, BLAST (Fig 7.5, p. 305) • Profile search: ProfileSearch • Position-specific scoring matrix: MAST • Iterative alignment (combination of sequence searching and profile search): PSI-BLAST • Patterns: Prosite, Blocks, Prints, CDD/Impala Table 7.1, Mount Bioinformatics Course, Spring 2004 The problem with speed • Dynamic programming – Guaranteed to find optimal answer – Too slow (number of searches performed and number of sequences in databases that are searched): Smith-Waterman dynamic programming algorithm 50X slower than BLAST or FASTA – faster hardware has made this problem feasible • Heuristic methods – FASTA: short, common patterns in query and database searches – BLAST: similar, but searched for more rare and significant patterns Bioinformatics Course, Spring 2004 Searches on DNA vs Protein Sequences • 20-letter alphabet vs 4-letter alphabet • Fivefold larger variety of sequence characters in proteins: easier to detect patterns • Searches with DNA sequences produce fewer significant matches • What if you don’t know reading frame? • Sometimes must do nucleic acid searches (searching for similarities in non-coding regions) Bioinformatics Course, Spring 2004 Sensitivity vs selectivity • Sensitivity: method’s ability to find most members of the protein family • Selectivity: method’s ability to distinguish true members from non-members • Want a method to have high sensitivity (get all true positives) and high selectivity (not get false positives) • Can be a difficult test with biological data sets: not all true positives are known Bioinformatics Course, Spring 2004 Scoring matrices commonly used • PAM250: point accepted mutation; Dayhoff, M., Schwartz, R. M., and Orcutt, B. C., Atlas of Protein Sequence and Structure (1978) 5(3):345 • BLOSUM62: blocks amino acid substitution matrices; Henikoff and Henikoff, Amino acid substitution matrices from protein blocks. (1992) Proc. Natl. Acad. Sci. USA 89:10915-10919. Bioinformatics Course, Spring 2004 PAM250 – Calculated for families of related proteins (>85% identity) – 1 PAM is the amount of evolutionary change that yields, on average, one substitution in 100 amino acid residues – A positive score signifies a common replacement whereas a negative score signifies an unlikely replacement – PAM250 matrix assumes/is optimized for sequences separated by 250 PAM, i.e. 250 substitutions in 100 amino acids (longer evolutionary time) Bioinformatics Course, Spring 2004 BLOSUM62 • BLOSUM matrices are based on local alignments (“blocks” or conserved amino acid patterns) • BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence • All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins • BLOSUM 62 is the default matrix in BLAST 2.0 Bioinformatics Course, Spring 2004 Comparison of PAM250 and BLOSUM62 BLOSUM80 BLOSUM62 BLOSUM45 PAM1 PAM120 PAM250 Less divergent More divergent The relationship between BLOSUM and PAM substitution matrices. BLOSUM matrices with higher numbers and PAM matrices with low numbers are both designed for comparisons of closely related sequences. BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins. If distant relatives of the query sequence are specifically being sought, the matrix can be tailored to that type of search. Bioinformatics Course, Spring 2004 Scoring matrices commonly used • PAM250 – Represents a period of time during which only about 20% of amino acids will remain unchanged – Shown to be appropriate for searching for sequences of 17-27% identity • BLOSUM62 – Matrix calculated from comparisons of sequences with no less than 62% divergence – Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships • BLOSUM50 – Shown to be better for FASTA searches Bioinformatics Course, Spring 2004 Methods for database sequence searching • Sequence similarity with query sequence: FASTA, BLAST • Profile search: ProfileSearch • Position-specific scoring matrix: MAST • Iterative alignment (combination of sequence searching and profile search): PSI-BLAST • Patterns: Prosite, PFAM, CDD/Impala Bioinformatics Course, Spring 2004 Review of protein structure • Primary structure: sequence of amino acids • Secondary structure: local segments of protein structure • Tertiary structure: three-dimensional structure of a single protein chain • Quaternary structure: packing of 2 or more protein chains Bioinformatics Course, Spring 2004 Classification of protein tertiary structure • • • • • All alpha proteins All beta proteins Alpha+beta proteins Alpha/beta proteins Irregular proteins Classify these proteins: T-cell protein CD8 (1cd8), myoglobin, triose phosphate isomerase, G-specific endonuclease (1rnb) Bioinformatics Course, Spring 2004 Representations of protein structures • • • • All atom CPK models Cartoons (ribbons, etc) Topology diagrams Bioinformatics Course, Spring 2004 Protein structure databases • RCSB (PDB): http://www.rcsb.org/pdb – General repository for all protein coordinate files • MMDB: http://www.ncbi.nlm.nih.gov/Structure – NCBI structure database; structures from pdb – Links to sequence and genome databases • BioMagResBank: http://www.bmrb.wisc.edu/ – General repository for NMR structure data Bioinformatics Course, Spring 2004 Alignment of protein structure • Superposition of protein 3D structures • Used in searching for structural similarity and grouping proteins into “fold families” • Structural similarity is common and does not necessarily indicate an evolutionary relationship (different from sequence similarity) Bioinformatics Course, Spring 2004 Structure Alignment: A difficult problem • Alignment in atom positions in 3D space • Pieces of proteins may align Easy example (Eidhammer and Jonassen): – What is significant and what is not? (Is alignment of two helices significant?) • Alignment of topology or secondary structure packing give different answers More difficult examples: http://www.sbg.bio.ic.ac.uk/people/rob/sf/sf.html Bioinformatics Course, Spring 2004 Structure alignment used to classify (group) protein structures • • • • • SCOP (Structural Classification Of Proteins; http://scop.mrc-lmb.cam.ac.uk/scop/) – Class (all alpha, all beta, alpha+beta, alpha/beta), family, superfamily, fold – Reflects structural and evolutionary relationships – Mostly done by “hand” (expert analysis) CATH (classification by class, architecture, topology and homology; http://www.biochem.ucl.ac.uk/bsm/cath) – Class (all alpha, all beta, alpha+/beta), architecture, fold, superfamily, family – Uses SSAP structure alignment program FSSP (fold classification based on structure-structure protein alignment; http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html.) – Based on pairwise alignment of all non-redundant proteins in PDB – Divides proteins into structures and domains: represents unique configuration of secondary structure elements – Uses Dali structure alignment program MMDB (molecular modeling database; http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure) – Proteins classified into structurally related groups by VAST, based on arrangements of secondary structures – Groupings of all PDB structures SARF (spatial arrangement of backbone fragments; http://123d.ncifcrf.gov/) Bioinformatics Course, Spring 2004 Web sites for structure alignment • VAST: http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml – NCBI structure comparison – Comparison of orientations of secondary structures (vector representation of secondary structures) – Approach from graph theory • Dali: http://www.ebi.ac.uk/dali/ – FSSP structure comparison – Protein represented as distance matrix between alpha carbons – Monte Carlo simulation to do random search for sub-distance-matrices • SSAP: http://www.biochem.ucl.ac.uk/cgi-bin/cath/GetSsapRasmol.pl – CATH structure comparison – Set structure environment for each residue, then align residue by residue using double dynamic programming – Structure environment can use beta carbon vectors or phi/psi backbone dihedral angles • Others: Lots, such as Structal (Gerstein and Levitt); Minarea (Falicov and Cohen); Lock (Singh and Brutlag) Bioinformatics Course, Spring 2004 Protein Structure Prediction • Goal is to understand the relationship between the primary amino acid sequence and the structure of the protein • Relationship between sequence and structure is not simple and is not understood • “Protein folding problem” remains unsolved Bioinformatics Course, Spring 2004 Protein Structure Prediction • Secondary structure prediction: unsolved? • Tertiary structure prediction: unsolved problem (CASP competition) • Quaternary structure prediction: unsolved problem – “Docking” of two subunits Bioinformatics Course, Spring 2004 Secondary structure prediction • Prediction of three classes of secondary structure: helix, strand, “coil” – Solved problem? 70-80% “correct predictions” – Methods (web sites) can give very different answers • Prediction of non-regular secondary structure (loops and turns) not as successful Bioinformatics Course, Spring 2004 Secondary structure prediction • Method development – Frequencies on types of residues found in each secondary structures – Frequencies calculated from database of known structures (training set) • Method evaluation – Test method on proteins whose structures are known (testing set) • Training and testing sets must not be the same Bioinformatics Course, Spring 2004 Secondary structure prediction methods and references Single residue statistics Explicit rules Nearest neighbors Neural networks Hidden markov models 1st generation Chou/Fasman (’74) GOR I Lim (’74) 2nd generation GOR III (’87) Predator (’96) Levin (’86) Nishikawa and Ooi (‘86) Yi and Lander (’93) Qian and Sejnowski (‘88) Holley and Karplus (’89) Yi and Lander (’93) Asai/Handa (’93) 3rd generation GOR IV DSC (Prof) (’96) NNssp (’95) NNssp (’95) PHD (’93) Jnet (’99) PsiPred (’99) PASSML (’98) See Table 9.7, Mount, for list of servers Bioinformatics Course, Spring 2004 GOR IV secondary structure prediction • Three state prediction: helix, strand, loop • Statistics of pair frequencies observed within a window of 17 amino acid residues • Based on information theory—sound statistical basis and no ad hoc rules • Mean accuracy of 64.4% for a three state prediction (Q3) Garnier, Gibrat, Robson; http://abs.cit.nih.gov/index.html Bioinformatics Course, Spring 2004 PHD secondary structure prediction • Three state prediction: helix, strand, loop • Predicts secondary structure from multiple sequence alignments • Three consecutive neural networks (feed forward) – Raw 3-state prediction for each position, based on alignment composition in 13 residue window – Filter 3-state probabilities based on probabilities of flanking positions in 17-residue window – Jury network using several raw/filter combinations trained separately • Expected average accuracy > 72% for three state prediction (Q3) Rost and Sander; http://www.predictprotein.org Bioinformatics Course, Spring 2004 Method evaluation: how good is “good”? • Testing of prediction methods involves – Applying the method to a set of proteins whose secondary structures are known experimentally and comparing prediction results to known results – Calculating measures of how good the performance is • Q1 (h, s, or c) – (number of residues correctly predicted in one state/number of residues in that state) * 100 • Q3 (h, s, and c): – (number of residues correctly predicted in each of 3 states/number of all residues) * 100 • Matthews correlation coefficient (Cs) – (TpTn - FpFn) / sqrt[(Tp+Fp)(Tn+Fn)(Tp+Fn)(Tn+Fp)] Num: Res: Actu: Pred: Pred: ....,....1....,....2....,....3....,....4....,....5....,....6 MSTKQHSAKDELLYLNKAVVFGSGAFGTALAMVLSKKCREVCVWHMNEEEVRLVNEKREN| HHHHHHHHHHHH EE HHHHHHHHHHHHHHH EE HHHHHHHHHHHHHH| HHHHHH EEEEE HHHHHHHHHHHH EEEEEE HHHHHHHH | HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH| Bioinformatics Course, Spring 2004 Method evaluation: how good is “good”? • Matthews correlation coefficient (Cs) – (TpTn - FpFn) / sqrt[(Tp+Fp)(Tn+Fn)(Tp+Fn)(Tn+Fp)] – Where Tp, true positive predictions (method predicts helix, and residue is in a helix); Tn, true negative prediction (method predicts “not helix”, and residue is not in a helix); Fp, false positive prediction (method predicts helix, but residue is not in a helix); Fn, false negative prediction (method predicts “not helix”, but residue is in a helix) Num: ....,....1....,....2....,....3....,....4....,....5....,....6 Res: MSTKQHSAKDELLYLNKAVVFGSGAFGTALAMVLSKKCREVCVWHMNEEEVRLVNEKREN| Actu: HHHHHHHHHHHH EE HHHHHHHHHHHHHHH EE HHHHHHHHHHHHHH| Pred: HHHHHH EEEEE HHHHHHHHHHHH EEEEEE HHHHHHHH | Q1 (helix)=(4+12+8)/(12+15+14)*100=58% Q3=(4+12+8+2+2)/60*100=47% Tp=4+12+8=24; Tn=9+8=17 Fp=2; Fn=8+1+2+5+1=17 Ch=[(24*17)-(2*17)]/sqrt[(24+2)(17+17)(24+17)(17+2)] Bioinformatics Course, Spring 2004 Tertiary Structure Prediction • Homology modeling: identifiable sequence similarity • Fold recognition (“threading;” Table 9.8 for server list) • “Ab initio” methods Bioinformatics Course, Spring 2004 Homology modeling • • • • • • Sequence alignment Side chain modeling Modeling insertions and deletions Optimizing the model Model evaluation Repeat? Bioinformatics Course, Spring 2004 Fold Recognition (“threading”) • Template identification/sequence alignment/alignment optimization • Side chain modeling • Modeling insertions and deletions • Optimizing the model • Model evaluation • Repeat? Bioinformatics Course, Spring 2004 Ab initio methods: folding “from scratch” • • • • Start with unfolded protein or random conformation Use atomic-level forces, solve energetic equations Identify most stable conformation (lowest free energy) Computational demands high: for protein of 100 amino acids – – – – Assume constant bond lengths and angles Allow 2/3 backbone torsion angles per amino acid to rotate Do not allow side chain torsion angles to move Assuming 10 allowed conformations per residue, must explore 10100 conformations – Calculation of 10100 energies (one for each conformation) is not possible Bioinformatics Course, Spring 2004 Ab initio methods: simplifications • Lattice models to simplify the conformational search space • Monte Carlo statistical sampling of conformational space • Stepwise processes: – Predict regular secondary structures – Pack secondary structures to form tertiary structures • Others… Bioinformatics Course, Spring 2004 Review of Definitions • Cell: fundamental working unit of biology • DNA: encodes all information to create cells and allow them to function – Linear arrangement of bases (AGTC) • Genome: organism’s complete set of DNA • Chromosome: physically distinct molecules of DNA – Genomes can be composed of 1, 2 or more chromosomes • Gene: basic physical and functional unit of heredity – Linear arrangement of bases along the chromosome – Contain instructions for encoding protein – (Remember genetic code?) Bioinformatics Course, Spring 2004 Genomes and proteomes • Genome: Sum of all genes and intergenic DNA sequences in a cell – the smallest known genome for a free-living organism (a bacterium) contains about 600,000 DNA base pairs – human and mouse genomes have about 3 billion – “relatively” unchanging from cell to cell • Proteome: The entire set of proteins encoded in the genome of an organism and produced by that organism – Constellation of proteins in cells is highly dynamic Bioinformatics Course, Spring 2004 The Human Genome • 24 chromosomes • Chromosomes range is size from 50 million to 250 million base pairs • Total size of the human genome is over 3 billion base pairs (3.1647 billion) – 99.9% of all bases are the same in all people • Genes comprise only 2% of the total genome – – – – Human genome is estimated to contain 30,000 to 40,000 genes Average gene size is about 3000 bases Largest identified so far is 2.4 million bases (dystrophin) Functions for less than 50% of genes and gene products are known • Remainder of genome is non-coding regions – – – – Chromosomal structural integrity Repetitive sequences Regulation of protein production Other functions that we don’t know about Bioinformatics Course, Spring 2004 Human Genome Sequencing Project Goals • Determine the sequences of the 3 billion chemical base pairs that make up human DNA • Identify all the approximately 30,000 genes in human DNA • Store this information in databases • Improve tools for data analysis • Transfer related technologies to the private sector • Address the ethical, legal, and social issues that may arise from the project Human Genome Project (DOE): http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml NIH: http://www.ncbi.nlm.nih.gov/genome/guide/human/ Bioinformatics Course, Spring 2004 Other sequencing projects • Over 200 genomes sequenced • Range of archeae, bacteria, eukaryotic genomes – Organisms that have been well-studied in the laboratory – Organisms that are pathogenic to humans – Organisms of special scientific or technical interest NCBI list of sequenced genomes (NIH): http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome Bioinformatics Course, Spring 2004 Prokaryotes and eukaryotes • Prokaryotes (bacteria and archaea) – No true nucleus – DNA generally circular (one chromosome) • Eukaryotes – True nucleus contains (most) DNA – DNA linear and arranged in chromosomes Phylogenetic analysis of small subunit ribosomal RNAs, C. Woese, 1987 Bioinformatics Course, Spring 2004 Anatomy of a prokaryotic genome • DNA compact and circular • ORFs (open reading frames) with start and stop codons • No introns Bioinformatics Course, Spring 2004 Anatomy of a eukaryotic genome • Linear DNA; chromosomes • Centromeres • Telomeres • Tandem repeats • Transposable elements • Introns • Pseudogenes Example of chromosome maps: http://www.ncbi.nlm.nih.gov/genome/guide/human/ Bioinformatics Course, Spring 2004 DNA sequencing AGCT • Separate strands of DNA • Anneal primer to one strand • Replicate using fluorescently labeled ddNTPs (as opposed to normal dNTPs) • Separate fragments by size • Image gel for fluorescent labels See also, electropherogram, Fig2.2, Mount Bioinformatics Course, Spring 2004 Methods of genome sequencing • Mapping method – Fragment chromosome – Identify markers and order them – Arrange fragments, then sequence • Shotgun method – Fragment chromosome – Sequence fragments, then arrange • cDNA sequencing (ESTs) – Isolate mRNA (expressed in cell) – Reverse transcribe mRNA to create cDNA – Sequence cDNA Bioinformatics Course, Spring 2004 Maps • • • • Gene map Chromosome map Sequence map Maps important for obtaining sequence information (mapping method) – Restriction map – Contig (contiguous clone) map NCBI map viewer: http://www.ncbi.nlm.nih.gov/mapview/ Bioinformatics Course, Spring 2004 Prediction of genes • Method • Difference between prokaryotes and eukaryotes • Tests for validation of predictions Bioinformatics Course, Spring 2004 Genome Analysis • General approach (p. 492) • Comparative genomics – Self-comparison reveals gene families and duplication – Between-genome-comparison reveals orthologs, gene families and domains – Gene ordering on chromosomes • Phylogenetic analysis • Genetic diversity Bioinformatics Course, Spring 2004