Download Bioinformatics - Sequences and Computers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mutation wikipedia , lookup

DNA barcoding wikipedia , lookup

Public health genomics wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

DNA vaccination wikipedia , lookup

Genetic engineering wikipedia , lookup

Epigenomics wikipedia , lookup

Molecular cloning wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Frameshift mutation wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Minimal genome wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Gene desert wikipedia , lookup

Genome (book) wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Genetic code wikipedia , lookup

Genomic library wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Transposable element wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Primary transcript wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Gene expression profiling wikipedia , lookup

Pathogenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genome evolution wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

History of genetic engineering wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Human genome wikipedia , lookup

Microsatellite wikipedia , lookup

Non-coding DNA wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Microevolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Designer baby wikipedia , lookup

Gene wikipedia , lookup

Point mutation wikipedia , lookup

Sequence alignment wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome editing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Hort 503: Bioinformatics for Research Exercise I
The information for the set-up of living organisms is stored in the sequences of
nucleotides in DNA. DNA serves two purposes: to provide the information during the life cycle of
a cell and to pass it on to offspring. The discovery of genes and the genetic code triggered the
hope to be able to read the information stored in our genes, and today we are able to do so:
massive progress in sequencing technology has delivered entire genomes to the tips of our
fingers. The era of genomics and proteomics has opened up the opportunity to go beyond the
analysis of single genes and proteins, towards understanding the interactions between all
components of genomes and proteomes. From trying to comprehend life by cutting it into smaller
and smaller pieces we are beginning to unveil in the same way it has been functioning since its
beginning: as a whole.
Computer scientists are important allies for biologists in the struggle to understand the
information in DNAs. On one hand the massive amount of sequencing data requires new tools computers and programs- to generate, proof, store, and access these data. On the other hand,
the deciphering of genomes necessitates the development of new hardware and software which
allow us to detect genes, determine relationships between them and study their expression to
help us understand the basis of development and disease.
Bioinformatics provides the tools to understand the information in biological molecules - DNA,
RNA, and proteins. The two major work routines of bioinformaticists are:
(1) Comparing sequences in order to identify similarities, and
(2) Analyzing sequence composition in order to identify patterns.
The first routine is applied to identify relationships between genes, proteins, or organisms and
regulatory regions in genomes. The second routine is applied to the prediction of genes and
regulatory regions in genomes. In protein analysis it is used to determine the structure and
function of proteins, e.g. in motif identification
Having read the above, go to the website listed below and click on the bioinformatics link
http://www.dnalc.org/bioinformatics/2003/. Read and complete the exercises using the web links
provided.
1
Sequence Analysis: Introduction
Learn how information-bearing sequences are different from random sequences and become
familiar with bioinformatics tools for the analysis of sequences.
Language and DNA use sequences to communicate information. The sequence elements
in language are letters and punctuation, in DNA they are the nucleotides. As the letters in books
contain information that is realized by readers, the sequence of nucleotides in DNA contains
information that is realized by the gene expression machinery of cells. Just as documents may
provide information that stimulates readers to act on something (like adopting a new life style),
genes contain the information for the building of proteins that perform specific actions within
cells.
Living organisms are not the only source for DNA sequences. DNA sequencing and
computer technologies have made it possible to "isolate" DNA from databases. In addition, many
experimental routines such as restriction digests and hybridizations can now be performed
electronically or "in silico". As restriction enzymes recognize specific target sequences so do
search algorithms. Algorithms then determine the lengths of the DNA stretches between
restriction sites and display the result of the "digest" in a graphical form that resembles the
banding pattern of a DNA gel. The electronic equivalent of Southern blotting and hybridization
are search algorithms that search databases for sequences that are similar to the sequence of a
DNA "probe". Then, matches and probe "hybridize" by way of alignments. However, in silico
molecular biology provides the advantage of data-linking. With the click of a mouse button, the
detection of a hybridizing fragment can be extended into identifying:
(1) Where exactly in a genome is a match is located.
(2) Whether there are similar sequences elsewhere in the genome.
(3) Whether homologs exist in other organisms.
(4) Whether it has been found to be linked to disease.
(5) Whether it contains a gene, transposon, or other important features.
Whether a DNA sequence exists as a chemical or in a database does not affect its
nucleotide order. Thus, the information stored in this sequence remains the same, whether it
occurs in form of a molecule or as a stretch of letters. It is the realization of this information that
requires DNA in its molecular form, e.g. to synthesize proteins which then perform certain
actions.
In the following segments you will learn how DNA stores information and how the function of
DNA sequences can be determined through electronic ("in silico") sequence analysis.
Sequence Analysis: Outline
The composition of sequences yields clues about their genesis as well as their potential
function. This is true for languages, for secret codes, and for DNA sequences. This chapter will
follow the outline provided on the new website of the Dolan DNA Learning Center, DNA
interactive or DNAi, and use Gene Boy, an intuitive bioinformatics tool that was developed by
non-bioinformaticists for biologists.
1. Encoding Meaning;
2. Random vs. Genic vs. Intergenic DNA;
3. Patterns and Consensus;
2
Bioinformatics - Sequence Analysis: Encoding Meaning
Language and computer codes provide great examples to examine the encoding of
information using sequences - of 26 elements (letters) in language and of 2 elements (0,1) in
computer language. Work through the segment 'meaning' in the Genome Mining' section on
DNAi, to develop an understanding how language encodes meaning and how the analysis of
sequence composition can give valuable clues about the information content of a text or a piece
of computer code.
1. Go to the DNAi website. (http://www.dnai.org/index.html)
2. Open 'Genome'
3. Select 'Genome mining'
4. Read through the text giving you an overview.
5. Select 'meaning'
6. Work through the consecutive pages, utilizing the forward icon where ever necessary.
Respond to the questions below
1. Click forward. Frame #2: Does the text convey meaning?
2. Click forward. Frame #3: Was there anything that may have kept you initially
from answering this question correctly?
3
3. Stay on frame. Frame #3: Do you think that text contains the letters of the
English alphabet in equal proportions?
4. Click forward. Frame #4: Click on each of the four sample letters to find the
actual counts:
E=
J=
R=
Y=
5. Stay on frame. Frame #4: Given that the total number of letters in this text is
1,064 (see frame #2), express your results in percent:
E=
J=
R=
Y=
6. Click forward. Frame #5: Do your results match the results on display?
7. Stay on frame. Frame #5: In what percentages would you assume the text
contains the other 22 letters of the English alphabet?
8. Click forward. Frame #6: What can you say about the distribution of the ratios
for each letter in this meaningful text? Would you call them rather equal or
rather dispersed?
9. Click forward. Frame #7: How often would you think letters occur in random
text? A sequence of letters that was composed under the rule that none of the
letters should be treated preferentially?
10. Click forward. Frame #8: What can you say about the distribution of the ratios
for each letter in this random text? Would you call them rather equal or rather
"all over the place"?
11. Stay on frame. Compare the result's for the frequency analysis of the two types
of text. Can you think of a method to use frequency analysis to determine
whether a code may contain meaning without trying to read it?
12. Click forward. Frame #9: Does this result confirm your suggestion? Or is it
different? Discuss your answer with the rest of the class.
Sequence analysis provides indeed a useful tool to determine the potential of sequences to
contain information. However, just analyzing the occurrences of single letters can easily lead to
erroneous conclusions. Work through the remaining frames of 'meaning' using the example of
binary code to understand how the analysis of single elements can be misleading and how to fix
this problem.
4
1. Click forward. Frame #11: Can you see any pattern in this sequence? List them:
2. Stay on frame. Does the sequence follow any rules? List them:
3. Stay on frame. Can you make any predictions about the sequence? Which element would
you think comes next?
4. Click forward. Frame #11: Check your answers against the answers provided at the
bottom of the left frame.
5. Click forward. Frame #12: Count the numbers of 0's and 1's in this sequence. In what
percentages do they occur?
6. Stay on frame. Working through the text examples earlier, what have you learned does a
distribution of elements like this indicate? Randomness? Or meaning? So, do you think
there is a problem here?
7. Click forward. Frame #14: Read the text.
8. Click forward. Frame #15: How many different pairs of elements can you form of 0's and
1's? In a perfectly random sequence, how often should each pair occur?
9. Click forward. Frame #16: Count the occurrences of these pairs in the actual sequence at
hand. After you determined each pair, move only one letter forward. This way each pair
overlaps with the next and you should be able to find a total of 19 pairs in this sequence
of 20 elements.
10. Stay on frame: Count the occurrences of each of the four different pairs in this sequence
and calculate their percentages.
00
01
10
11
11. Stay on frame: Do these data show that the percentages for each pair are very similar?
Or are they quite dispersed?
12. Stay on frame: Does this analysis confirm that the sequence is random? Or that it is
meaningful and predictable?
13. Click forward. Frame #17: Read the text.
14. Click forward. Frame #18: Read the text.
15. Click forward. Frame #19: In order for frequency analysis to be accurate it needs to
include analyzing the occurrences of element combinations.
5
Bioinformatics - Sequence Analysis: Random vs. Genic vs. Intergenic DNA
DNA, just like languages or binary code, is a code. It consists of four different elements
and the fact that it is a molecule does not inflict on the capacity of nucleotide sequences to
encode information. Given the sheer number of nucleotides in a DNA string one could perceive
DNA as a random sequence of nucleotides. Determine now this possibility applying nucleotide
frequency analysis to random, genic, and intergenic DNA stretches. Use our new bioinformatics
tool Gene Boy to calculate the percentages for the four different nucleotides in these three types
of sequences. Then, apply your conclusions about meaningful text and binary code to interpret
your data about DNA.
1. Go to the DNAi website.
2. Open 'Genome'
3. Select 'Genome mining'
4. Read through the introductory and overview page.
5. Select 'DNA analysis'
6. Following the instructions in DNAi, analyze the nucleotide composition of a random
sequence and respond to the questions below. Use the excel spreadsheet provided to
record your data
o
What did you find out about the distribution of A's, C's, G's, and T's in random
sequences?
o
Based on these results and what you learned during text analysis, formulate a
hypothesis about the distribution of nucleotides in the random DNA sequence.
o
Analyze the second random sequence on Gene Boy.
o
Does the result confirm or contradict your hypothesis? Make sure you check singles,
pairs, and triplets
7. Now analyze the sequence composition of a genic sequence.
o
What did you find out about the distribution of A's, C's, G's, and T's in the genic
sequence?
o
How does it differ from the result for the random sequence?
o
Form a hypothesis about the difference between genic and random sequences.
o
Analyze the second genic sequence on Gene Boy.
o
Does the result confirm or contradict your hypothesis?
6
8. Why would you think genic DNA differs from random DNA?
9. Discuss your answer with the rest of the class.
Genomic DNA which does not contain genes is called intergenic DNA. Which sequence
characteristics would you expect for intergenic DNA? It's non-coding so would it look like random
DNA? Like genic DNA? Or different from that? The answer to this question will provide you with
first clues about how one can detect genes in DNA sequences.
1. Compare a random with an intergenic sequence. What differences can you observe?
2. Compare the second intergenic sequence with the second random sequence.
3. Are intergenic sequences random? Explain.
4. Are there any differences between genic and intergenic DNA?
5. Analyze the composition of genic sequence #1 and intergenic sequence #1. Can you
identify any differences?
6. Determine the nucleotide distributions for the second set of genic and intergenic
sequences. Can you identify any differences?
7. Repeat the previous analysis for genic and intergenic DNA examining the ratios for
nucleotide pairs. Can you identify any differences between the two sequence types?
8. Examine the occurrences of CG di-nucleotides in the two sequence types, where do you
find high frequencies of CG-pairs?
9. What are "CG-islands" associated with? They are often called "CpG-islands" to denote the
fact that they are connected through a phosphate-bridge. In other words, CpG refers to
the fact that C and G are neighbors on the same strand and not base pairs which sit
opposite each other.
10. From your previous answers can you name anything in the nucleotide and di-nucleotide
composition of genic and intergenic DNA that could be used to find genes in genomic
DNA sequences? How would you go about doing that?
11. Make a comparative statement about the hemoglobin alpha gene cluster on chromosome
16 and a high CG content.
7
Results of Composition
Analysis
Composition Analysis Range (%)
Type
Random 1
Mononucleotides
24.6 - 25.3
Random 2
Genic 1
Genic 2
Intergenic 1
Intergenic 2
Genic 1
Genic 2
Intergenic 1
Intergenic 2
Dinucleotides
Trinucleotides
Mononucleotide Composition Analysis (%)
Mononucleotides
A
Random 1
Random
2
25.2
C
G
T
8
Bioinformatics - Sequence Analysis: Patterns and Consensus
DNA sequences which serve the same function frequently share characteristics. Genes
share many structural features with each other, such as promoters, start and stop codons, and
coding sequences. Sometimes they are identical in different genes, such as start and stop
codons. Sometimes, they share an overall consensus but the sequences for this feature might
deviate somewhat from the consensus in different genes. Familiarize yourself with the most
important of these features by working through the 'gene feature' segment in 'Genome Mining',
answering the questions below:
1. Frame #1: What is the central dogma?
2. Frame #3: Which triplet is the start codon? On mRNA ______ On DNA ______
3. Frame #3: Which amino acid does the start codon encode?
4. Frame #5: What are the three stop codons? On mRNA ______, ______, ______
On DNA ______, ______, ______
5. Frame #5: Which amino acids do the stop codons encode?
6. Frame #7 and #8: What is meant by a reading frame RF?
7. Frame #9: What is meant by an open reading frame ORF?
8. Frame #10: What is a UTR?
9. Frame #10: What is the actual beginning and end of transcription?
10. Frame #10: What is the actual beginning and end of translation?
11. Frame #11: View the 3D animation "Promoting transcription".
12. Frame #11: What is a promoter?
13. Frame #11: What is the actual purpose of a promoter in gene expression?
14. Frame #13: A consensus sequence in the promoter of the six different genes is
highlighted in yellow - what is it called? What is the actual consensus sequence?
15. Frames #14 and #15: Explain the process by which a computer program can be used to
identify TATA boxes.
16. Frame #15: Do the "TATA"-boxes of all genes contain exactly the same sequence?
17. Frame #16: What is meant by "splicing"?
9
18. Frame #16: Does splicing occur on the DNA or the RNA level?
19. Frame #16: What are "spliced genes"?
20. Frame #16: What are introns?
21. Frame #16: What are exons?
22. Frame #16 and #17: What are splice sites?
23. Frame #17: Introns begin with the sequence ______ and end with the sequence ______.
24. Frame #18: What is a polyA-tail?
25. Frame #18: At which end of the RNA is the polyA-tail added?
26. Frame #18: What is a polyA-signal?
27. Frame #19 and #20: What is the consensus sequence for a polyA-signal?
28. Bonus question: How can these gene features be used in gene identification?
10
Hort 503: Bioinformatics for Research Exercise II
Go to http://www.dnalc.org/bioinformatics/2003/
Click on Similarity, read and complete the exercises:
Bioinformatics - Sequence Similarity & Alignments
According to the concept of evolution (and our own experience) living organisms can be
traced back to forebears. Going back further and further in time allows to connect currently
distinct species through common ancestors until, ultimately, all life might spring from a few or
even one ancestral "cell". Thus, all current life forms may actually be related to each other.
Support for this hypothesis can be found in our
genes and proteins. Could there be any organisms
more different than E. coli, lettuce, yeast, worms,
flies, and humans? Yet, humans share genes and
proteins with similar functions and sequences with
other primates, rodents, worms, and even
prokaryotes. Since life in its diversity might relate
back to one ancestral life form, all our genomes
might relate back to the genome of this very same
organism. All differences that exist today among
genomes were then introduced during the ensuing
billions of years of genomic changes and evolution.
These changes then lead to the appearance of new organisms. This evolutionary development
of life is likened to a tree with emerging new species and kingdoms representing the branching
points of this tree of life.
The amount of similarity between two sequences
is a measure for their relatedness. The relationship
among nucleotide sequences can differ from the
relationship among amino acid sequences which, in
turn, may differ from the relationship in structure and
function. Closely related sequences are more similar
than more distantly related sequences. Thus, sequence similarity serves to estimate evolutionary
distance following the assumption that sequence similarity that goes beyond the similarity which
can be expected just by chance, indicates relatedness. The determination of sequence similarity
is not trivial, though. It requires sophisticated computer algorithms which attempt to align
sequences with each other in order to determine and score identities and differences between
them. Aligning sequences is the basis for many research objectives such as finding genes,
determining relationships, and finding sequences in databases. This module will guide you
through examples to understand how alignments work and lead to new findings.
Bioinformatics - Sequence Similarity: Outline
Identifying similarities among sequences is an important technique for many biological
questions such as: are two organisms/gene/proteins related to each other? Or: can a drug that
works against one type of cancer be applied to other cancers, too? Learn what sequence
similarity is, how it is being determined, and how it can be used to answer biological questions.
11
1.
2.
3.
4.
5.
Similarity, identity, and homology;
Sequence comparison through alignments;
Local and global alignments;
Pairwise sequence alignments and sequence searches;
Pairwise sequence alignments and homology;
Bioinformatics - Similarity, identity, and homology
Things that are related are often similar and things that are similar are often related.
Relationships are defined by common ancestry: siblings are related to and through their
parents, nieces and nephews through their grandparents. On the same token different globin
molecules are related: they all relate back to an ancestral heme-incorporating globin-like
protein. Relationships among family members are determined by similarities: biologically
through physical similarities, socially/legally by identical last names. Relationships of proteins
and genes are determined by the degree of similarity among their sequences.
Everybody has their own, unique DNA sequence, just as everybody has their own, unique
set of fingerprints. Any organism's DNA sequence has been provided by its forebears; parents
for humans, and progenitor cells for bacteria and other single-cell organisms. Changes in DNA
are introduced by two mechanisms: spontaneous mutations (infrequent) and recombination of
paternal and maternal DNA during meiosis (frequent). Thus, a person's genome does usually not
resemble either her father's or her mother's genome, it is a combined of parts of both of these.
Homology is a term used for genes or proteins which are derived from the same ancestor.
Homology cannot be expressed as fraction, either two sequences are homologous or not.
Scientists infer homology from the similarity among sequences.
Similarity is a measure for relatedness. 100% similarity would be identity. In order to find out
whether things are similar, we usually compare them side-by-side. Such as the two images
below. Or we find some way to describe them. Such as the formula that is being used to
describe the loops and whorls in fingerprints. Then, searching for matches does not require to
compare images but formulas can be used to query databases.
Below are the images of regular hemoglobin and an artistically "mutated" hemoglobin molecule.
Compare the two images and try to find the three defects in the right one, that would render a
person's blood unable to transport oxygen if this molecule really existed.


Click left image.
To preserve the
large molecule in
a new browser
window, hold
down 'ctrl'-key
and press 'n'.



Then click the
right image.
Re-size screens
that they fit next
to each other.
Compare.
12
Bioinformatics - Sequence comparison by alignment
As you place images and many other things next to each
other to compare them, so are sequences compared by
aligning them with each other. Alignments between
nucleotide or amino acid sequences represent the evolutionary history of these sequences - their
relatedness. Many recent advances in understanding the information in genomes and proteomes
were derived through sequence comparisons, and alignment and search methods have become
indispensable routines in bioinformatics and biological research.
An alignment between two sequences is simply a pairwise match between the characters of
each sequence. While sequences are either homologs or not, similarity, the measure which is
used to infer homology, can be expressed as fraction or percentage. As an example determine
the similarity between the two sequences:
AATCTATA
AAGATA
Align the two sequences in various ways with paper and pencil. Or click anywhere on the box
and do it online. Then score their similarity by rewarding matches with +1 and scoring
mismatches with 0. In how many different ways can you align the two sequences? Which are
the highest and lowest scores you can find? Do the alignments allow you to determine a
relationship among the two sequences? Results here. However, none of these three alignments
is actually very satisfying. There seems to be more similarity at the beginning and the end of the
sequence than in the middle part - if only the second sequence wouldn't be so short!
Three kinds of changes can occur at any given position within a sequence: a change from
one nucleotide to another, an insertion of one or more nucleotides, or a deletion of one or more
nucleotides. This fact has prompted scientists to allow the insertions of gaps into sequences that
they try to align. A gap would represent a deletion in one of the sequences. Or an insertion in
the other one. However, insertions and deletions (so-called 'indels') have been found to occur
much less frequent than changes of one nucleotide into another, substitutions. In order to
account for this fact gaps are penalized with a negative score or a gap penalty. Align the
sequences above again. But this time you are permitted to insert gaps in form of dashes into
either sequence. (Don't delete or replace any nucleotides, though. Upon inserting a gap the
nucleotides which follow a gap would be pushed forward.) Realign the two sequences, inserting
gaps whereever it seems necessary. Then score the alignments, a match with '1', a mismatch
with '0', and a gap with '-1'. Find the highest and the lowest scores possible. (Use paper and
pencil or click here to do it online.) Do the alignments allow you to determine a relationship
among the two sequences? Click here and compare your alignments.
Are two holes more than one? The third gapped alignment from the last activity contains
two consecutive gaps. Should they be scored the same way as two isolated gaps or differently?
Maybe consecutive gaps should be viewed as just one gap? Regardless of the numbers of
dashes following each other? In fact, insertions or deletions that sit next to each other are more
likely to have occurred at once than due to distinct events. Several gaps in a row are therefore
charged differently than several isolated gaps. This happens by breaking up the gap penalty into
a gap-opening penalty and a gap-extension penalty. The gap-opening penalty is being charged
for each isolated gap as well as for the first gap in a longer stretch of consecutive gaps. Each
consecutive gap would then be charged with the gap-extension penalty which is lesser than the
gap-opening penalty. Thus, alignment 3 would not score 3 but somewhat higher, depending on
the gap-extension penalty. Re-score your alignments by setting the gap-extension penalty to 0.5.
13
Bioinformatics - Global and local alignments
So far we compared sequences looking at their entire length. Find out how this technique can
lead to misinterpretations and what other technique you need to know in order to understand
how to correctly use sequence alignments for the analysis of nucleotide and amino acid
sequences.
Determine the relationship between these two sequences aligning them with paper and pencil or
electronically:
ACGT
AACACGTGTCT
Global alignments try to match sequences by comparing them at each given position. Charging
gap penalties does not take into account whether gaps are inserted within a sequence or at the
beginning or the end of one or both sequences. However, in aligning the two sequences above,
you may have found that the optimal alignment is to not split up ACGT but to leave the
sequences in one piece. Here, several alignments are possible, yet, only one that contains the
short sequence entirely in the long one. Gaps at the beginning and end of sequences are usually
the result of incomplete data and are not based on biological circumstances. They should
therefore be treated differently than internal gaps. This kind of alignment is called semi-global
alignment. Repeat the previous alignment charging gaps at the beginning and at the end of the
sequence with '0'. Find the alignment yielding the highest score. (For the electronic version click
here.)
Local Alignments combine global and semi-global alignments in that they attempt to determine
the best matching subsequences within two sequences. Align these two sequences with paper
and pencil or electronically:
AACCTATAGCT
CCGATATA
Again, you could align these sequences in a variety of ways, yet, alignments which leave larger
chunks of sequence intact are favored over those that chop the entire sequence into its
nucleotides and insert a whole lot of individual gaps into both sequences. Using a local
alignment algorithm, scores of 1 for matches and 0 for mismatches, and a gap penalty of -1
would yield this alignment. The gap in the upper sequence could have been placed at a few
other positions but the point is that this alignment provides the least number of gaps while
maintaining the integrity of as many subsequences as possible.
Find out more about scoring alignments in this animation. It's somewhat complex at times. But
it provides a good overview over how optimal sequence alignments are determined using
alignment matrices.
14
Bioinformatics - Pairwise alignments in searches
Searching databases for sequences is an extremely important bioinformatics routine. Learn how
alignment programs are used to query databases for sequences.
Searching the Internet for information often necessitates the use of so-called search
engines. Bioinformatics has its own search engines which are specialized tools to search
databases containing nucleotide and amino acid sequence data. These databases are located in
a variety of different places around the world, most notably in the US at the National Center for
Biotechnology Information (NCBI), in England at the European Molecular Biology Laboratories
(EMBL), in Switzerland (SwissProt), and the DNA Database in Japan (DDBJ). The sheer number
of sequences in these databases prevents direct sequence-to-sequence alignments and search
tools have to be quite sophisticated in order to complete searches in a reasonable amount of
time.
BLAST is one of the more known sequence search engines. BLAST stands for Basic Local
Alignment Search Tool. BLAST finds sequences in databases that are similar and related to
subsequences in a query sequence. It returns a brief title line describing the nature of the
search hit, links to the database entry for the hit, shows the actual sequence alignment between
query and hit sequence, and validates each search hit a ' score' and a so-called 'E-value'.
The two scoring parameters 'score' and 'E-value' refer to the quality of the search result. Scores
are determined based on the number of matches, mismatches and gaps in the sequence
alignment. However, they can be somewhat misleading in that scores strongly depend on the
length of the query sequence.
Perform a BLAST search with this sequence:
TTAACTCCACCATTAGCACC
1. Highlight and copy the sequence
2. Open Gene Boy
3. Click 'Your Sequence', paste the sequence into the central window, change Your
Sequence on top into a name of your choosing, select 'Save'
4. Open 'WWW Tools', select 'Sequence Search'
5. In the next window select 'Format', wait for the results to come up
6. Examine the different parts of the results page. Try to resolve questions first by
consulting with the help provided under 'BLAST FAQs'.
7. Clicking on a score moves further down the page to a view of the alignment between
the match and the query.
8. Clicking on a 'gi|.....' number will link you to the database entry for the respective
match. Use the browser's 'Back'-button to move back to the result page.
9. Which organisms are the matching DNA sequences from?
10. What scores and E-values can you find for different matches? Read the definition for
15
score and for E-value under 'BLAST FAQs'.
11. Check Genetic Origins - mtDNA - Recipes. What sequence was the primer derived from?
The two scoring parameters 'score' and 'E-value' are provided to judge the quality of the
search result. The score is determined based on the number of matches, mismatches and gaps
in the sequence alignment. However, score can be somewhat misleading in that it strongly
depends on the length of the query sequence.
The E-value on the other hand is seen as a more informative parameter to judge the validity of
a search result. It provides an estimate for the possibility that a match is similar to the query
sequence just by chance. The higher the E-value the more likely it is that the result matches the
query just by chance. The lower the E-value the more significant the search result. (As if this
wouldn't be confusing enough, the E-value is often expressed as e to the power of a negative
number. While this negative number can be quite big, e.g. 56, e-56 is a rather small number
and an E-value that you would want to see for meaningful search results.)
Bioinformatics - Pairwise alignments to determine homology
Sequence homology can be determined through aligning the sequences and scoring them. If the
score differs significantly from the degree of similarity one could expect if the sequences were
just random, homology can be inferred.
Homologous sequences share a common ancestor. Since they diverged from this ancestor,
both sequences have undergone changes. The number of these changes and, therefore, their
degree of similarity are correlated with the number of generations that have passed since the
two sequences diverged. If sequences are very close they are likely to be very similar. If
sequences are very similar they might be very closely related. If sequences have diverged very
far in the past, they might be quite different. In other words, sequences that are highly different
might not be homologous at all. Or they might be homologous, except one might not be able to
determine that by examining sequence similarity. In the following example determine the
similarity between genic sequences for proteins that have the same function but are derived from
different organisms. The sequences are human alpha globin 1, mouse alpha globin 1, and lupine
leghemoglobin, a plant derived oxygen-binding protein.
1. Determine the degree of identity among these two sequences by calculating the
percentage of identical nucleotides:
Hs hba
Mm hba
>gcctggggtaaggtcggcgcgcacgctggcgagtatggtgcggaggccctggagagg<
|||||||| ||| | || | || | || || ||||| || || |||||||| |||
>gcctgggggaagattggtggccatggtgctgaatatggagctgaagccctggaaagg<
2. Do you think these two genes share a common ancestor?
3. Now calculate the identity among these two genic sequences:
Hs hba >gcgagtatggtgcggaggccctggagaggtgaggctccctcccctgctcc<
|| | | ||| |
|
| |
|
|| | ||
L lhb >aagaatttaatgcaaatattcctaaaaacacccaccgtttcttcaccttg<
16
4. Do you think these two genes share a common ancestor?
5.
How does it change your thinking looking at this alignment of the two proteins for the
two genes?
6. The occurrence of hemoglobin is not limited to red blood cells. Legume plants such as
clover, pea, beans, and many others are able to synthesize a form of hemoglobin when
they undergo a symbiosis with nitrogen-fixing bacteria, Rhizobia. Rhizobia are able to
capture atmospheric nitrogen, N2, by reducing it to ammonia, NH3. During the
establishment of the symbiosis, the plant host develops new organs, nodule-like
structures on its roots, within which it isolates and houses the bacteria. The plant
receives nitrogenous compounds from the bacterial partner, and can grow independently
from fertilizer. The bacterial partner receives carbon compounds from the plant host. The
bacterial enzyme that reduces atmospheric nitrogen is called nitrogenase; it is extremely
oxygen-sensitive. On the other hand, nitrogen-fixation requires a large amount of
energy, ATP, which depends on the availability of oxygen. In order to accommodate this
paradox, legumes synthesize in their nodules a form of hemoglobin leading to a pinkish
to dark red hue within the nodules. This form of hemoglobin is called leghemoglobin
since it is synthesized in legumes. Leghemoglobin binds the free oxygen around the
bacteria in the nodules and protects their nitrogenase from being destroyed. On the
other hand, it presents the oxygen to the bacterial symbionts, which use it to satisfy their
ATP needs.
7. Comparing nucleotide sequences does not always give you a good idea about the
relatedness of two different functional structures such as hemoglobin. Comparing the
protein structures gave a much more accurate clue about the similarity between the
proteins. Therefore, in order to understand the relatedness of proteins you have to not
only look at the genes but also at the amino acid sequences to determine their similarity.
What is the percentage of identical amino acids in this alignment between mouse and
human hemoglobin?
Human alpha globin 1
Mouse hemoglobin
>GKVGAHAGEYGAEALER<
|| | | |||||||||
>GKIGGHGAEYGAEALER<
17
8. Amino acids are different from nucleotides in that similarity and identity are differentiated
due to the fact that amino acids can be grouped according to their physicochemical
properties such as size, charge, hydrophobicity etc. (see image, web page). By just
looking at the image below, it is obvious that leucine and valine are more similar than
histidine.
Thus, amino acid sequence alignments are analyzed by a) determining the percentage of
identical amino acids as % identity. Then b) by determining how many amino acids are
identical plus how many represent substitutions against similar ones and expressing the
result as % similarity. Groups of similar amino acids are as follows (as provided by
ClustalW site at European Bioinformatics Institute EBI):
Small + hydrophobic + aromatic: A,V,F,P,M,I,L,W;
Acidic: D,E;
Basic: R,H,K;
Hydroxyl + Amine + Basic: S,T,Y,H,C,N,G,Q.
9. How similar are the two sequences if similar amino acids are labeled with a '+'?
Human alpha globin 1
Mouse hemoglobin
>GKVGAHAGEYGAEALER<
||+| | |||||||||
>GKIGGHGAEYGAEALER<
10. Now determine the degree of identity among the human and the legume sequence:
Hs Hba
L Lghb
>VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPH<
|
|| |
|
| | |
>ALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSF<
11. How similar are the two sequences if similar amino acids are labeled with a '+'?
Hs Hba
L Lghb
>VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP<
+|+
|| |
+|+
+
+ ++ + | | |
>ALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSF<
18
12. Since human proteins are composed of 20 different amino acids one would expect a
random similarity of about 5% between two entirely different amino acid sequences. The
amino acid sequences of human and legume hemoglobin’s are significantly more similar
than 5% and it can be safely assumed that the two sequences are derived from a
common ancestor; these two sequences are homolog. The amount of similarity between
two sequences can be used to estimate the point in time when they split from each
other: the more different two related sequences are the longer ago they split.
19
Hort 503: Bioinformatics for Research Exercise III
Bioinformatics - Finding Genes, Intro
Genes are DNA sequence stretches that are transcribed into RNA. Some RNAs, messenger
RNAs or mRNA serve as templates that are translated into amino acid sequences. Other types
of RNA perform tasks as RNA molecules, such as transfer or tRNA, ribosomal or rRNA, small
nuclear RNA or snRNA, interfering RNA or RNAi, small nucleolar RNA or snoRNA. For the sake
of this course we will be looking at genes whose transcripts are translated into proteins.
DNA makes RNA makes protein, this is the central dogma of molecular biology. This is
generally true, including the action of retroviruses. Retroviruses consist of RNA which is reverse
transcribed into DNA by means of an enzyme called reverse transcriptase. However, this is not
a contradiction of the Central Dogma - it is just a necessary first step in the process to express
the viral genes. Because only after they have been transcribed into DNA can these genes be
used to synthesize proteins - by way of the Central Dogma. The transcription of RNA follows
this schema:
DNA Watson-strand 5'-nnnnnCATGCTGACGCAGTCGCTAGTCTGAAnnnnn-3' (equivalent to RNA)
DNA Crick strand 3'-nnnnnGTACGACTGCGTCAGCGATCAGACTTnnnnn-5' (template for RNA)
RNA
5'-nnnnnCAUGCUGACGCAGUCGCUAGUCUGAAnnnnn-3' (transcript of DNA)
Bioinformatics provides two principally different methods to find genes. The first one
uses sequence alignments to search databases for the presence of a highly similar sequence
which has already been annotated and shows a potential gene. This could result in the finding
that the gene has already been identified. Or it could yield a similar gene, the finding of which
may significantly speed up research. Finally, it may end up yielding no significant match, in
which case one would have to follow the second approach to finding genes in silico: finding
genes from scratch. Finding genes from scratch is called ab initio or de novo gene prediction. It
is based on the observations and experiences derived from comparing many known genes and
identifying common sequence features. Once some common sequence features, motifs, or
consensus sequences, have been identified, these are being looked for to identify new genes in
yet un-annotated DNA. Sequence features associated with functional aspects of genes are open
reading frames (ORFs), promoters, splice sites, polyA-signals, start- and stop codons. Another
feature, this one related to the structural aspect of genes, is the occurrence of specific
nucleotides and nucleotide combinations in genic vs. non-genic regions.
Bioinformatics - ORF Finding: Finding a human gene
Purpose: to identify a gene in human DNA by finding ORFs, and to determine the nature of the
potential gene.
ORF finding
1. Open Gene Boy at http://www.dnai.org/geneboy/index.html
2. Select 'Clear'
3. On the left, under 'Sequences', select 'Genic 1'
20
4. How long is the sequence?
o
On the right, under 'Operations', select 'Find Genes', then 'ORF'
o
Below the window select 'Reverse'
o
Explain what you see on the screen answering these questions:
o
What do the three horizontal bars represent?
o
What do the yellow and green boxes represent?
o
What constitutes an ORF?
o
Why are you searching for ORFs?
o
What does 'Reverse' do?
o
Why is this function needed?
5. Transfer the data from both windows into the table below:
Frame
From
To
Length
Organism
Gene
6.
Gene verification and identification
1. Go to the DNAi website at http://www.dnai.org/index.htm
2. Select 'Genome', then 'Genome Mining', then 'Gene Finding'
3. Click the forward icon until you arrive at frame #9
4. Compare your ORFs with the map for the gene. Did you find the ORF shown in the
map?
5. What are UTRs?
6. Why did the ORF finder not find the UTRs?
7.
How could you find out what function the different genes have?
Gene function: In order to find out what the gene is by using Genic 1 for a BLAST search.
21
1. On Gene Boy select 'Clear', then select 'Genic 1'
2. Select 'WWW Tools', then 'Sequence Search'
3. Select 'Format' and wait for your results to come up
4. On the results page determine which matches are the best, the one's with low E-values
or high E-values? (Use the BLAST FAQs link to answer this question)
5. Scroll down to results #7, 8, or 9 - what gene is mentioned there?
6. Clicking on the gi- number on the left side of a hit links you to the database ( GenBank)
entry for this hit. Determine the function of the gene.
Bioinformatics - Finding Spliced Genes
The majority of genes in eukaryotes are spliced genes. This means that the coding
sequence is interrupted by non-coding sequences. These interspersed non-coding
sequences are called introns, the expressed sequences they separate are called exons. In
order to restore an open reading frame that can be translated into a protein, the
preliminary mRNA (pre-mRNA), consisting of the entire sequence between transcription
start site and transcription stop site undergoes a process called splicing. During splicing socalled splicesosomes attach to the pre-mRNA molecule, often while it still is being
generated by transcription. Then, the introns are cut removed, and the exons are
connected (spliced together). This process generates a molecule that may contain some
untranslated regions (UTR's) at the beginning and the end, but internally it would contain a
continuous ORF - ready to be translated into an amino acid sequence. Thus, even spliced
genes have ORF's - on the level of the mRNA molecule.
ORF finders are not suitable to identify spliced genes. In spliced genes the true start
and stop codons are separated by introns, and thus, do not generally lend themselves to be
part of ORF structures. The first coding exon does contain a start codon, but no stop codon
that would follow the start codon in frame. Otherwise, translation would stop right there
and the gene would not be a spliced gene. Internal exons may contain methionine codons,
they may also contain stop codons. But the methionine would not code for a start but for a
methione which is internal to the final protein product. And, while they may contain stop
codons in two reading frames, the third one would have to remain free of stop codons.
Finally, terminal exons may or may not contain ATG's, but these wouldn't be start codons.
So, applying ORF-finding to the identification of spliced genes is almost hopeless. Except ...
When ORF-finders are applied to mRNA or cDNA they should be able to pick up the
coding sequence for a particular splice form of the gene. Once the ORF has been identified,
the cDNA sequence can be aligned to the genomic DNA, revealing the position of introns
(not-aligning) and exons (aligning).
A variety of tools is available to identify spliced genes. Gene prediction is geared to
assign to a raw DNA sequence like this:
22
a structure like this:
Upstream
Promoter First Exon
Intergenic e.g.
Region
TATA
Intron(s) Exon(s)
Intron(s) Last Exon
Downstream
CDS/ORF &
Transcriptional
Enhancer
Start, 5'-UTR,
Sites,
CDS/ORF
Translational
Frequent
Frequent Translational
&
Intergenic
Start,
Stop
Stop
Stop, 3'-UTR,
Enhancer
Region
CDS/ORF &
Codons
Codons PolyASites
Enhancer
insertion Site,
Sites
Transcriptional
Stop
De novo gene prediction programs probe sequence features such as the distribution of
nucleotide pairs (CpG), triplets, and heptamers to determine regions which are likely to
contain genes. Other programs predict individual sequence features such as splice sites,
promoters, polyA-signals, etc., providing data that can be used to puzzle the gene together.
Alignment-based gene prediction methods search sequence data bases for sequences which
are similar to the sequence awaiting annotation.
Bioinformatics - Finding Spliced Genes: ORFs
Purpose: to identify a gene in human DNA by finding ORFs.
ORF Finding
1. Open Gene Boy at http://www.dnai.org/geneboy/index.html
2. Select 'Genic 2'
3. How long is this sequence?
4. Select 'Find Genes' and 'ORF'
5. Below the window select 'Reverse'
6. Explain what you see on the screen
7. Transfer the data from the screens into the table below:
Frame
From
To
Length
Organism
Gene
8.
9. Did you detect any gene in the sequence?
Gene verification and identification Find out whether the identified ORF is in fact the gene
23
in Genic 2 by tranlsating the sequence into its aminoacid equivalent and comparing it to the
aminoacid sequence of the gene.
1. Open Gene Boy, press 'Clear'
2. Select 'Genic 2'
3. Copy and paste the 'Genic 2' sequence - numbers, empty spaces, nucleotides
4. Come back to the course website and open 'Translator' (in the bar to the right or here)
5. Paste the sequence into the window of the translation tool
6. Check 'Show Aminoacids'
7. Select 'go'
8. Print out the DNA/Amino acid sequence, reconstruct the gene/protein by cutting out
the paper stretches and adding them head-to-tail using scissors and glue.
9. In order to include the complementary strand in this examination use the 'Transform
Sequence' funtion in GeneBoy to generate the reverse complement of the sequence.
Then, translate the reverse complement using the 'Translator' link on the right.
10. Identify the coding portion of the gene by highlighting in the forward strand and/or the
complementary strand of the reconstructed gene the nucleotide stretches that code for
the amino acid sequences in this view of the protein encoded by Genic 2
11. What strikes you about the structure of the gene?
12. Compare this map for the gene in Genic 2 with the result of 'Find ORFs' (repeat if
necessary). Which portion of the gene did ORF-finder detect?
13. Why would ORF-finders not be able to detect the entire gene?
14. Given the way cells process the information in genes, at what point could you have
successfully applied ORF-finding to identify the coding protion of the gene?
15. Click the forward icon until you arrive at frame #12
Gene features Try to predict the gene in Genic 2 by identifying individual gene features.
1. Get the sequence: Open Gene Boy, press 'Clear', select 'Genic 2', copy and copy the
'Genic 2' sequence - numbers, empty spaces, nucleotides
2. Determine the start and the end of the gene by using bioinformatics tools that
determine transcriptional start sites (beginning) and polyA-tail signals in DNA
sequences.
o
Here is the output of a program which predicts transcriptional start sites.
Viewing the output determines were DNA transcription may start in your
sequence, and thus the 5'-region of the gene. Determine the location for the
transcriptional start sites which are predicted by both programs and which
have the highest scores. by comparing the nucleotide positions and the scores
provided by the different programs. (If you would like to run the programs for
24
yourself, highlight and copy the DNA sequence. Then, use the CSHL Human
Core Promoter Predictor and the UC Berkeley Promoter Predictor to predict
transcriptional start sites by pasting the sequence into the input window for
each program and running the program. In order to examine the second
strand for the sequence generate its reverse complement through the
'Transform Sequence' function in Gene Boy.)
o
Here is the output of a program which predicts polyA-tail insertion signals.
Viewing the output determine the region where the polyA-tail may be inserted,
and thus the 3'-region of the gene. (If you would like to run the programs for
yourself, highlight and copy the DNA sequence. Then, use the CSHL PolyASignal Predictor to predict polyA-insertion signals by pasting the sequence into
the input window of the program and running the program.)
o
Mark in your worksheet the promoter region, and the prospective translational
start and polyA-signal sites. Which direction is the gene most likely transcribed
in?
3. Discern exons and introns by identifying splice sites (i.e. the borders
between exons and introns):
o
Here is the output of a program which predicts splice sites. Viewing the output
determine the two models that would incorporate the findings above about
promoters and polyA-signal regions of the gene. (If you would like to run the
programs for yourself, highlight and copy the DNA sequence. Then, use the
Splice site prediction program at UC Berkeley. Paste the sequence into the
input window of the program and change the donor score cutoff to 0.88 and
acceptor score cutoff to 0.94 (donor site at beginning of intron, acceptor site at
end of intron). Then run the program.)
o
What does the output page tell you about the prediction mechanism?
o
Use the output to determine the predicted splice site positions and record the
results in the worksheet.
o
From the total of six splice site predictions build a couple of alternative maps
that show how exons and introns could follow each other in this gene.
4. Determine how the predicted splice sites align with open reading frames:
Relate the predicted splice sites to ORFs. Try Gene Boy or NCBI's ORF Finder here to
complete this task.
5. Determine which of the predicted splice sites border exons and introns: For
each splice site examine where exactly it is preceded (donor sites) or followed
(acceptor sites) by a stop codon utilizing the Translation Tool in Sequence Utilities,
checking 'Show only start and stop codons'. Make sure you examine all three reading
frames +1, +2, and +3, for each splice site. (Hint: start out with identifying the open
read-through/coding sequence - CDS - in the internal exon. Then determine the CDS in
the last exon and the translational stop site. Then figure out the first exon and the
translational start site ATG.)
6. Characterize the gene by determining its length, exons, introns, splice sites,
25
promoter, etc.
Bioinformatics - Finding Spliced Genes: Gene Prediction
Purpose: to identify a gene in human DNA through gene predictions.
Gene prediction
1. On Gene Boy select 'Clear', then select 'Genic 2'
2. Highlight and copy the sequence, incl. digits and spaces
3. Select WWW Tools, then Gene Prediction
4. Select FgenesH or Gegroup B select GenScan
5. On the website of the prediction tool find the input window for the sequence
6. Paste the sequence into the window (Ctrl-v) and submit for analysis by clicking the left
button below the input window
7. Compare the result with the map in frame #12 of Genome Mining in DNAi. How
accurate are the gene predictions?
Bioinformatics - Finding Spliced Genes: Function
Purpose: to identify the nature of a gene in human DNA by searching for similar sequences in
sequence databases.
Gene function: In order to find out what the gene is submit 'Genic 2' to a BLAST search.
1. On Gene Boy select 'Clear', then select 'Genic 2'
2. Select 'WWW Tools', then 'Sequence Search'
3. Select 'Format' and wait for your results to come up
4. On the results page determine which matches are the best, the one's with low E-values
or high E-values? (Use the BLAST FAQs link to answer this question)
5. Scroll down to the match listing - what gene is mentioned there?
6. How could you find out which of these genes is the one in 'Genic 2'?
7. Clicking on the gi- number on the left side of a hit links you to the database ( GenBank)
entry for this hit. Determine the function of the gene.
26
Bioinformatics - Finding Spliced Genes: Location
Purpose: to learn about a human gene by studying its location in the human genome.
Gene location: Find 'Genic 2' in the human genome.
1. On Gene Boy select 'Clear', then select 'Genic 2'
2. Select 'WWW Tools', then 'Genome View'
3. Select 'Format' and wait for your results to come up
4. Select 'Genome View'
5. What chromosome has 'Genic 2' been located on?
6. For a closer view select the number underneath the respective chromosome.
7. Select 'Maps & Options', 'Remove' everything with the exception of 'Gene' from 'Maps
displayed'
8. Select 'Gene', click 'Toggle Ruler'
9. Click 'Apply', close this window, and begin zooming into the chromosome.
10. What genes do you find surrounding the 'Genic 2' locus? What gene cluster can you
find in there?
27