Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatics Introduction Acknowledgments • These slides and exercises were prepared as Bioinformatics Teaching Modules developed by Elizabeth Murray, Ph.D. and Andrew Rieser, Integrated Science and Technology, Marshall University. • Development of the slide shows and exercises was funded in part by the National Center for Research Resources (NCRR) of the NIH Grant #P20 RR16477. Acknowledgments • These slides have been inspired by many sources available on the Internet. We have tried to acknowledge the contributions of others and the source of the images in the notes. • If we have overlooked an acknowledgment, please let us know and we will correct this. • We are basing some of the examples on the excellent new text “Bioinformatics and Functional Genomics” by Jonathan Pevsner Wiley Liss 2004. Computational Biology vs. Bioinformatics What is Computational Biology? The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems. What is Bioinformatics? Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. -NIH Biomedical Information and Science Technology Initiative Consortium What is Informatics? • The term informatics is widely used in both health care and computer science. • Computer specialists use the term informatics for computer hardware, software, and information theory • Medical informatics includes all data management in a hospital from patient records, billing, images, to medical literature etc. A Good Definition • Bioinformatics is the use of computers for the acquisition, management, and analysis of biological information. • It incorporates elements of molecular biology, computational biology, database computing, and the Internet • The key element of the definition is information management What Kinds of Information? • Bioinformatics deals with any type of data that is of interest to biologists – DNA and protein sequences – Gene expression (microarray) – Articles from the literature and databases of citations – Images of microarrays or 2-D protein gels – Raw data collected from any type of field or laboratory experiment – Software • The analysis of DNA sequence data dominates the field of bioinformatics, but the term can be used to describe any type of biological data that can be recorded as numbers or images and handled by computers. Who Works in Bioinformatics? • Bioinformatics is clearly a multidisciplinary field including: – computer systems management – networking, database design – computer programming – molecular biology How to Get a Job in Bioinformatics • Few scientists describe themselves as specialists in bioinformatics. • It is difficult to train people to specialize in this field since different skills are required to use computer tools to analyze data vs. the design of those tools. • Other specialists create the mathematical algorithms used to build the tools. • Strong knowledge of molecular biology is also needed to frame meaningful questions and problems for software development and analysis. Every Molecular Biologist must be “Bioinformatics Literate” • Most biologists are “users” not “developers” of software and algorithms. • This series of presentations and exercises are intended to help you be a knowledgeable user of software packages and be able to frame interesting questions and interpret your results. A Good Day Using Bioinformatics • A scientist studying a model organism, Arabidopsis, finds a TDNA insertional mutation in a gene they are studying. • They use the TDNA DNA sequence as a probe to hybridize to a genomic DNA library and identify a clone of the genomic DNA. Now they can determine the sequence of the gene they mutated with TDNA. A Good Day Using Bioinformatics • The scientist enters the sequence into a search tool (BLAST, FASTA) and compares their DNA sequence with all the DNA sequences in all the databases. • The scientist finds a group of related sequences to the gene with the tDNA. • BUT the scientist doesn’t know anything about those related genes. A Good Day Using Bioinformatics • The scientist can: – Search publications on the related gene to determine the gene function in other organisms (or even their own organism). – Look at the structure of the domains in related genes to analyze function. – Compare sequences with those of other organisms to develop “trees” of sequence relationships. – Analyze the promoter sequence of the gene to see what transcription factor binding sites are there. – Analyze expression data if the gene was included on microarrays. • All without lifting a pipette or thawing a tube! A Good Day Using Bioinformatics • Now the scientist can use bioinformatics to guide their next experiments in the lab. – Design PCR primers to amplify the DNA – Search the sequence for restriction enzyme cut sites for cloning the DNA for additional experiments. – Test hypotheses about structure and function of the protein suggested by the sequence similarities. If the protein looks like it has a kinase domain, clone, express and purify it to see if it actually is a kinase! Introduction to Molecular Genetics • Using these slides requires some familiarity with the principals of molecular biology and genetics. • If you are from a mathematics or computer science background, the information in these slides may be too jargon-filled and detailed for you. • There are many excellent resources on the Internet to help you learn some of the basic terminology of molecular biology. Excellent Introductory Resources • The US Department of Energy has created a useful Primer on Molecular Genetics. – http://www.ornl.gov/sci/techresources/Human_Ge nome/publicat/primer/toc.html • On Line Biology Textbook – http://www.emc.maricopa.edu/faculty/farabee/BI OBK/BioBookTOC.html • NCBI’s Science Primer – http://www.ncbi.nlm.nih.gov/About/primer/index.h tml The Challenges of Molecular Biology Computing • • • • • The big dataset problem DNA sequencing Pairwise and Multiple Alignments Similarity searching the databanks Structure-function relationships; Can sequence patterns predict function? • Phylogenetic analysis: Sequence conservation across evolution • Genomics The Big Dataset Problem • Biologists have been very successful in finding the sequences of DNA and protein molecules – Automated DNA sequencers – The Human Genome Project – High throughput sequencing of cDNAs (ESTs) • Information scientists have to develop tools to keep up with the data The Big Dataset Problem • Information is being collected, organized, and made available in databases: – GenBank is the central sequence information database in the United States – Data is shared between GenBank and European Molecular Biology Laboratory (EMBL) and the DNA Database of Japan (DDBJ) – All sequence data submitted to any of these databases is automatically integrated into the others. – Sequence data is also incorporated from the Genome Sequence Data Base (GSDB) and from patent applications. The Big Dataset Problem • These presentations will familiarize students with these databases and their organization. • Students will learn to enter data into the databases and search for and download data from the databases. • Students will learn to use some of the additional bioinformatics tools used to organize the databases (LocusLink, COGs, OMIM, SNP, UniGene and others). DNA Sequencing • One technician with an automated DNA sequencer can produce over 20 KB of raw sequence data per day. • The real challenge of DNA sequencing is in the analysis of the data • DNA sequences reads of ~500 base pairs must be assembled into complete genes and chromosomes • These 500 bp reads have errors of both incorrect bases and insertion/deletions. DNA Sequencing • These presentations allow students to become familiar with different strategies for genome sequencing projects. • Students will learn to analyze electronic DNA sequence files and to use the Accelrys Wisconsin Package Software to assemble DNA sequences from such projects. Pairwise and Multiple Alignments • Pairwise alignment is the basis of similarity searching • Pairwise alignment has been "solved" as a computational problem through dynamic programming • However, the "optimal" alignment calculated by the computer may not be the best representation of the true biological alignment. Pairwise and Multiple Alignments • Multiple Alignment is the basis for the analysis of protein families and functional domains. • When pairwise alignment is expanded to compare multiple sequences, it becomes a computationally huge problem. • To reduce the nearly infinite permutations, a simplified heuristic (approximate) algorithm is used known as progressive pairwise alignment • Since this problem is so complex, it is not possible to mathematically define a truly optimal alignment of multiple sequences. Pairwise and Multiple Alignments • These presentations will explain the dynamic programming algorithm and its application. • These presentations will allow the student to distinguish between global and local alignment algorithms and apply them appropriately. • Students will learn the significance of the Needleman/Wunsch and Smith/ Waterman algorithms and their application. • These slides will allow the student to understand the role of scoring matrices (PAM and BLOSUM) and gaps in sequence alignment. Pairwise and Multiple Alignments • These exercises will allow the student to use, display and interpret data generated by the Pairwise and Multiple alignment programs included in the GCG Wisconsin Package. These programs include: • Pairwise Comparison – Gap; FrameAlign; Compare; DotPlot; GapShow; ProfileGap • Multiple Comparison – PileUp; HmmerAlign; SeqLab®; PlotSimilarity; Pretty; PrettyBox, MEME; HmmerCalibrate; ProfileMake; ProfileGap; Overlap; NoOverlap; OldDistances. Similarity Searching the Databanks • "Are there any sequences in the databanks similar to my sequence?" • Directly searching the databanks by comparing sequences is too computer timeconsuming. • The scientist uses timesaving heuristic tools: FASTA and BLAST • Meaningful interpretation relies on the informed judgment of the Biologist and interpretation of the statistics. Similarity Searching the Databanks • Students will master the popular search tools BLAST and FASTA (in their many versions) used to search the databanks and learn to interpret the significance of the statistics and output from these programs. • Students will learn additional sequence searching and retrieval programs within the GCG Wisconsin Package (FrameSearch; MotifSearch; ProfileMake; ProfileSegments; FindPatterns; Motifs; WordSearch; Segments; Fetch and NetFetch). Structure-function relationships: • Sequence patterns that predict function. – The prediction of the function of protein molecules from their sequence is one of the most challenging areas of computational molecular biology. • Sequence determines 3-D structure, structure determines function – Currently, we can’t predict a 3-D protein structure from amino acid sequence alone. The best current approach is based on comparing sequence similarity to proteins of known structure = "threading" Structure-function relationships: • Can predict some aspects of 3-D structure from sequence: – – – – A-helix vs. B-sheet membrane spanning region helix-turn-helix signal peptide • Identifying conserved regions (domains or motifs). • Functions of these conserved domains are defined by laboratory research. • Domain databases can be used to scan any unknown protein sequence for the presence of over a thousand known domains. Structure-function relationships: • Databases of important conserved elements within DNA sequences have been developed: – transcription factor binding sites – restriction enzyme recognition sites • Some 3D RNA structures can also be predicted based strictly on sequence – by sequence comparison with other known sequences (such as tRNA) – by simple detection of stem-loop structures as inverse repeats Structure-function relationships: • Students will learn to use PubMed and other literature databases to obtain on-line journal articles, abstracts and texts. • Students will analyze proteins using software which identifies sequence motifs, predicts peptide properties, looks at secondary structure, hydrophobicity, and antigenicity, and identifies repeats and regions of low complexity. Structure-function relationships: • Students will analyze sequences to predict RNA or DNA structure. GCG Wisconsin Package programs include MFold, PlotFold; StemLoop. • Students will use Gene Prediction software packages available on the internet, including Genefinder, Genscan and GrailII. Structure-function relationships: • Students will learn to view 3-D protein structures using Chime, Cn3d, Mage, Rasmol and Swiss 3D viewer, Spdbv. • Students will learn to view 3-D protein structures using Chime, Cn3d, Mage, Rasmol and Swiss 3D viewer, Spdbv. • Students will design primer pairs using Oligo3, Prime, PrimePair and TempMelt Phylogenetic Analysis • There are evolutionary assumptions underlying the science of molecular sequence analysis. – evolution = mutation of DNA sequences – two species that have genes that are similar in sequence are more closely related than are two species that have less sequence similarity. • It is possible to collect sequence data from several different organisms, add up the differences, and estimate their relationships. A Phylogenetic Tree Phylogenetic Analysis • There are a many controversies and objections to such simplistic analyses. – Not all DNA sequences mutate at the same rate: protein coding regions mutate more slowly than non-coding regions. – Some positions in protein coding DNA sequences are more free to mutate than others – Parsimony vs. maximum likelihood methods of measuring distance. Phylogenetic Analysis • Students will investigate the relationships within an aligned set of sequences through computation of the pairwise distance between sequences, construction of phylogenetic trees, or calculation the degree of divergence of two protein coding regions. • The student will be able to collect a set of related DNA sequences and calculate phylogenetic distances and create a tree using software programs in GCG Wisconsin package (PAUPSearch; PAUPDisplay; GrowTree; Diverge ). Genomics • What is genomics? – An operational definition: The application of high throughput automated technologies to molecular biology. • A philosophical definition: – A holistic or systems approach to the study of information flow within a cell Genomics • Genomics Technologies include: – Automated DNA sequencing and annotation of sequences – Gene Finding and Pattern Recognition – DNA microarrays – gene expression (measuring RNA levels) – single nucleotide polymorphisms (SNPs) – Protein chips – Protein-protein interactions Genomics • The student will learn to use microarray software available from NCBI and Marshall University’s microarray facility to analyze gene expression data. • The student will learn to use GCG Wisconsin package programs designed for genome analysis, including TestCode, Codon Preferences, Frame, Repeat, FindPatterns, Composition, CodonFrequency,Window, StatPlot, Consensus, FitConsensus, Xnu and Seg.