Download Introductory Biological Sequence Analysis Through Spreadsheets

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genomic library wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

DNA barcoding wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Nucleosome wikipedia , lookup

Frameshift mutation wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

DNA vaccination wikipedia , lookup

Mutation wikipedia , lookup

History of genetic engineering wikipedia , lookup

Molecular cloning wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Epigenomics wikipedia , lookup

DNA supercoil wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

History of RNA biology wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Gene wikipedia , lookup

Human genome wikipedia , lookup

Primary transcript wikipedia , lookup

Genetic code wikipedia , lookup

RNA-Seq wikipedia , lookup

Microevolution wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Genome editing wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Sequence alignment wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Microsatellite wikipedia , lookup

Genomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Metagenomics wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Helitron (biology) wikipedia , lookup

Point mutation wikipedia , lookup

Transcript
Introductory Biological Sequence
Analysis Through Spreadsheets
Stephen J. Merrill
Sandra E. Merrill
Marquette University
Milwaukee, WI
November 18, 2000
ICTCM 2000
Teaching Mathematics
to Students of Biology
 Need
to make the math in the courses correlate
with math that needed in that discipline
 The most important “math” needed is statistics
 The molecular biology revolution in biology
presents data in a form in which calculus has
little impact (sequences of letters)
November 18, 2000
ICTCM 2000
The Nature of Biological Sequence Data
 Primary
structure of DNA, RNA, and proteins
are sequences of letters -- 4 letters in the case of
DNA (ATGC) and RNA (AUGC) and 20 letters
representing the sequence of amino acids which
makes up a protein
 Secondary and Tertiary structures (bending,
folding and twisting) of structures determines
function -- hints seen through primary structure
November 18, 2000
ICTCM 2000
Use of Spreadsheets in this setting
 Commonly
found and used in biological labs
for data acquisition, storage and organization,
and data analysis
 Commonly present on student computers and
computer labs
 Unlike calculators -- able to handle data sets
typical of “real world” applications
 R.F. Murphy at CMU has developed a set of
worksheets for sequence analysis
November 18, 2000
ICTCM 2000
Meaningful Questions & Problems
1. Measuring the similarity between two
strings -- “alignment” or “homology”
2. Finding instances of a pattern in a string
3. Describing the composition and properties
of a string
4. Graphing the evolutionary process and
construction of phylogenetic trees
November 18, 2000
ICTCM 2000
Measuring the Similarity between Strings
 Given
a gene -- suggest the function of the
protein coded for by finding a similar
sequence (possibly in another species)
 Simple homology involves assigning a “1”
for agreement and “0” for nonagreement at
each site. Then sum over all sites
 Homology is the fraction of the highest
possible score, in %
November 18, 2000
ICTCM 2000
Spreadsheet #1 Simple Homology
Part of 2 70 base sequences of yeast DNA
C
T
C
A
C
C
0
1
0
2
A
C
1
3
C
C
0
4
A
G
1
5
C
G
0
6
A
C
0
7
C
T
0
8
A
T
0
9
C
T
C
C
0
10
0
11
A
T
1
12
1 .2
1
0 .8
0 .6
S e rie s 1
0 .4
0 .2
0
0
November 18, 2000
20
40
ICTCM 2000
60
80
0
13
Spreadsheet #1 (cont.)
comparing random sequences
November 18, 2000
ICTCM 2000
0.
5
0.
4
0.
3
0.
2
0.
1
0
Frequency
Recording the results of many trials
Simresult
Trial #
alignment
0.271429 this is updated each time any cell is entered
1 0.314286
2 0.171429
3 0.271429
4 0.285714
Bin
Frequency
Histogram
5 0.228571
0
0
5 0.185714
0.05
0
5
6 0.242857
0.1
0
4
7 0.185714
0.15
0
3
8 0.271429
0.2
4
Frequency
2
9 0.357143
0.25
3
1
10 0.242857
0.3
3
11
0.2
0.35
1
0
0.4
1
0.45
0
0.5
0
Bin
More
0
Finding Instances of a Particular
Pattern in a String
 The
process of locating genes involves locating
regions of the DNA sequences that contain
patterns which resemble those of known genes
 Identifying sites on DNA where one of the
restriction enzymes can cleave DNA -- Also of
interest is size of the fragments that result
 Identify regions of RNA which correspond to
particular features (e.g. loops) which may be
splice sites
November 18, 2000
ICTCM 2000
Describing the Composition
and Properties of a String
 Counts
of frequencies of particular letters
due to their properties (e.g. regions rich in
G&C or A&T in DNA)
 Properties of proteins (e.g. charge or
hydrophobicity) which depend on the
nature and frequencies of the particular
amino acids
November 18, 2000
ICTCM 2000
Spreadsheet #2 Hydropathy Plot
Human IL-10 having 148 amino acids
Hydrophobic regions are yellow
Hydrophilic regions in blue
November 18, 2000
ICTCM 2000
Spreadsheet #2 (Cont.)
November 18, 2000
ICTCM 2000
Kyte-Doolittle Chart
144
133
122
100
111
amino acid sequence number
89
78
67
56
45
5
4
3
2
1
0
-1
-2
-3
-4
-5
1
12
23
34
Hydrophobicity Plot
S1
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
1.8
2.5
-3.5
-3.5
2.8
-0.4
-3.2
4.5
-3.9
3.8
1.9
-3.5
-1.6
-3.5
-4.5
-0.8
-0.7
4.2
-0.9
-1.3
Graphing Evolution and
Phylogenetic Trees
 Evolutionary
distance between two DNA
sequences used to determine the process of
the changes in the sequences over time (e.g.
the evolution of HIV or the flu viruses)
 Trees constructed to express the
relationship between related sequences -distance in the tree a monotone function of
homology
November 18, 2000
ICTCM 2000
Spreadsheet #3 Mutation & Evolution
30
25
20
Series1
15
10
5
number of mutations
November 18, 2000
ICTCM 2000
31
28
25
22
19
16
13
10
7
4
0
1
total number of different
letters
Total Differences from original sequence
Spreadsheet #3 (cont.)
To study the evolution of a sequence,
we randomly pick a site for mutation, then change its letter
Site #
9
6
40
70
33
25
28
52
67
8
52
29
3
13
Letter
T
A
G
T
A
T
C
C
T
G
G
A
G
T
November 18, 2000
letter in the
different distance away
original sequence 1 for yes from original
orig.seq
T
0
0C
A
0
0T
T
1
1T
C
1
2C
C
1
3T
T
0
3A
A
1
4C
A
1
5A
T
0
5T
A
1
6A
A
1
7G
T
1
8C
T
1
9C
C
1
10 C
ICTCM 2000
postion #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Conclusion
 Use
of a spreadsheet makes possible an
experimental approach to introducing the
mathematics of sequence analysis
 The use of spreadsheets makes possible the
use of real-world data and presents the
computational tool in a meaningful context
 The importance of the topics to all educated
individuals suggests that the topics be
included in many liberal arts math courses
November 18, 2000
ICTCM 2000