* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Introductory Biological Sequence Analysis Through Spreadsheets
Genomic library wikipedia , lookup
United Kingdom National DNA Database wikipedia , lookup
DNA barcoding wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Frameshift mutation wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
DNA vaccination wikipedia , lookup
History of genetic engineering wikipedia , lookup
Molecular cloning wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Cell-free fetal DNA wikipedia , lookup
Epigenomics wikipedia , lookup
DNA supercoil wikipedia , lookup
Nucleic acid double helix wikipedia , lookup
History of RNA biology wikipedia , lookup
Extrachromosomal DNA wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Human genome wikipedia , lookup
Primary transcript wikipedia , lookup
Genetic code wikipedia , lookup
Microevolution wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Genome editing wikipedia , lookup
Multiple sequence alignment wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Sequence alignment wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Microsatellite wikipedia , lookup
Non-coding DNA wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Metagenomics wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Introductory Biological Sequence Analysis Through Spreadsheets Stephen J. Merrill Sandra E. Merrill Marquette University Milwaukee, WI November 18, 2000 ICTCM 2000 Teaching Mathematics to Students of Biology  Need to make the math in the courses correlate with math that needed in that discipline  The most important “math” needed is statistics  The molecular biology revolution in biology presents data in a form in which calculus has little impact (sequences of letters) November 18, 2000 ICTCM 2000 The Nature of Biological Sequence Data  Primary structure of DNA, RNA, and proteins are sequences of letters -- 4 letters in the case of DNA (ATGC) and RNA (AUGC) and 20 letters representing the sequence of amino acids which makes up a protein  Secondary and Tertiary structures (bending, folding and twisting) of structures determines function -- hints seen through primary structure November 18, 2000 ICTCM 2000 Use of Spreadsheets in this setting  Commonly found and used in biological labs for data acquisition, storage and organization, and data analysis  Commonly present on student computers and computer labs  Unlike calculators -- able to handle data sets typical of “real world” applications  R.F. Murphy at CMU has developed a set of worksheets for sequence analysis November 18, 2000 ICTCM 2000 Meaningful Questions & Problems 1. Measuring the similarity between two strings -- “alignment” or “homology” 2. Finding instances of a pattern in a string 3. Describing the composition and properties of a string 4. Graphing the evolutionary process and construction of phylogenetic trees November 18, 2000 ICTCM 2000 Measuring the Similarity between Strings  Given a gene -- suggest the function of the protein coded for by finding a similar sequence (possibly in another species)  Simple homology involves assigning a “1” for agreement and “0” for nonagreement at each site. Then sum over all sites  Homology is the fraction of the highest possible score, in % November 18, 2000 ICTCM 2000 Spreadsheet #1 Simple Homology Part of 2 70 base sequences of yeast DNA C T C A C C 0 1 0 2 A C 1 3 C C 0 4 A G 1 5 C G 0 6 A C 0 7 C T 0 8 A T 0 9 C T C C 0 10 0 11 A T 1 12 1 .2 1 0 .8 0 .6 S e rie s 1 0 .4 0 .2 0 0 November 18, 2000 20 40 ICTCM 2000 60 80 0 13 Spreadsheet #1 (cont.) comparing random sequences November 18, 2000 ICTCM 2000 0. 5 0. 4 0. 3 0. 2 0. 1 0 Frequency Recording the results of many trials Simresult Trial # alignment 0.271429 this is updated each time any cell is entered 1 0.314286 2 0.171429 3 0.271429 4 0.285714 Bin Frequency Histogram 5 0.228571 0 0 5 0.185714 0.05 0 5 6 0.242857 0.1 0 4 7 0.185714 0.15 0 3 8 0.271429 0.2 4 Frequency 2 9 0.357143 0.25 3 1 10 0.242857 0.3 3 11 0.2 0.35 1 0 0.4 1 0.45 0 0.5 0 Bin More 0 Finding Instances of a Particular Pattern in a String  The process of locating genes involves locating regions of the DNA sequences that contain patterns which resemble those of known genes  Identifying sites on DNA where one of the restriction enzymes can cleave DNA -- Also of interest is size of the fragments that result  Identify regions of RNA which correspond to particular features (e.g. loops) which may be splice sites November 18, 2000 ICTCM 2000 Describing the Composition and Properties of a String  Counts of frequencies of particular letters due to their properties (e.g. regions rich in G&C or A&T in DNA)  Properties of proteins (e.g. charge or hydrophobicity) which depend on the nature and frequencies of the particular amino acids November 18, 2000 ICTCM 2000 Spreadsheet #2 Hydropathy Plot Human IL-10 having 148 amino acids Hydrophobic regions are yellow Hydrophilic regions in blue November 18, 2000 ICTCM 2000 Spreadsheet #2 (Cont.) November 18, 2000 ICTCM 2000 Kyte-Doolittle Chart 144 133 122 100 111 amino acid sequence number 89 78 67 56 45 5 4 3 2 1 0 -1 -2 -3 -4 -5 1 12 23 34 Hydrophobicity Plot S1 A C D E F G H I K L M N P Q R S T V W Y 1.8 2.5 -3.5 -3.5 2.8 -0.4 -3.2 4.5 -3.9 3.8 1.9 -3.5 -1.6 -3.5 -4.5 -0.8 -0.7 4.2 -0.9 -1.3 Graphing Evolution and Phylogenetic Trees  Evolutionary distance between two DNA sequences used to determine the process of the changes in the sequences over time (e.g. the evolution of HIV or the flu viruses)  Trees constructed to express the relationship between related sequences -distance in the tree a monotone function of homology November 18, 2000 ICTCM 2000 Spreadsheet #3 Mutation & Evolution 30 25 20 Series1 15 10 5 number of mutations November 18, 2000 ICTCM 2000 31 28 25 22 19 16 13 10 7 4 0 1 total number of different letters Total Differences from original sequence Spreadsheet #3 (cont.) To study the evolution of a sequence, we randomly pick a site for mutation, then change its letter Site # 9 6 40 70 33 25 28 52 67 8 52 29 3 13 Letter T A G T A T C C T G G A G T November 18, 2000 letter in the different distance away original sequence 1 for yes from original orig.seq T 0 0C A 0 0T T 1 1T C 1 2C C 1 3T T 0 3A A 1 4C A 1 5A T 0 5T A 1 6A A 1 7G T 1 8C T 1 9C C 1 10 C ICTCM 2000 postion # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Conclusion  Use of a spreadsheet makes possible an experimental approach to introducing the mathematics of sequence analysis  The use of spreadsheets makes possible the use of real-world data and presents the computational tool in a meaningful context  The importance of the topics to all educated individuals suggests that the topics be included in many liberal arts math courses November 18, 2000 ICTCM 2000
 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
									 
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                             
                                            