* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download cis667-1 - Electrical Engineering and Computer Science
Amino acid synthesis wikipedia , lookup
Gel electrophoresis wikipedia , lookup
Eukaryotic transcription wikipedia , lookup
Restriction enzyme wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Epitranscriptome wikipedia , lookup
Agarose gel electrophoresis wikipedia , lookup
Proteolysis wikipedia , lookup
Genomic library wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Transformation (genetics) wikipedia , lookup
SNP genotyping wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Real-time polymerase chain reaction wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Transcriptional regulation wikipedia , lookup
DNA supercoil wikipedia , lookup
Gel electrophoresis of nucleic acids wikipedia , lookup
Molecular cloning wikipedia , lookup
Non-coding DNA wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene expression wikipedia , lookup
Community fingerprinting wikipedia , lookup
Point mutation wikipedia , lookup
Genetic code wikipedia , lookup
Biochemistry wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Biosynthesis wikipedia , lookup
CIS 667
Bioinformatics
Cleveland State University
Department of Computer and
Information Science
Fall 2003
What is Bioinformatics?
• Field of science in which biology, computer
science, information technology merge to
form a single discipline
Historically, creation/maintenance of biological
sequence databases important
• Biology is being transformed from a purely
lab-based science to an information
science as well
What is Bioinformatics?
• Three important sub-disciplines
Development of new algorithms and statistical
methods to analyze relationships among
members of large data sets
Analysis and interpretation of various types of
data (nucleotide and amino acid sequences,
protein structures, etc.)
Development/implementation of tools for
efficient access/mgmt. of various types data
Why now?
• Recent advances in molecular biology and
genomic technologies lead to an explosive
growth in the amount of biological
information generated
• Requires computerized databases to
store/organize/index data and specialized
tools to view and analyze data
What skills should a
Bioinformatician have?
• Deep background in some area of molecular
biology
• Understand the central dogma of molecular
biology
• Substantial experience with at least one or two
major packages
• Experience working in a command-line
computing environment
• Experience with both high-level and scripting
languages
Others…
•
•
•
•
•
•
Molecular Evolution
Physical chemistry
Statistics and probability
Database design
Algorithm development
Molecular biology lab methods
What will we learn?
• Central dogma of molecular biology +
other necessary biology background
• Working in a Unix command-line
environment
• Programming in Perl
• Algorithms for molecular biology
• Hands-on experience with bioinformatics
tools
Molecular Biology
• Primarily concerned with two basic
molecules of all living things:
Proteins
Structural proteins are tissue building blocks while
enzymes catalyze chemical reactions
Proteins are chains of amino acids
Example Amino Acid
Side Chain
H2N
Amino Group
CH3
C
H
Alpha Carbon
COOH
Carboxy Group
Amino Acids
• There are 20 naturally occurring amino
acids
Amino acids can be identified by a 3-letter
code (and sometimes by 1-letter code)
In a protein, amino acids are joined by peptide
bonds (C from carboxy group binds to N from
amino group)
A water molecule is liberated so we speak of
residues in the chain
Amino Acids
Name
One-letter code
Three-letter code
Alanine
A
Ala
Cysteine
C
Cys
Aspartic Acid
D
Asp
Glutamic Acid
E
Glu
Phenylalanine
F
Phe
Glycine
G
Gly
Histidine
H
His
Isoleucine
I
Ile
Lysine
K
Lys
Leucine
L
Leu
Methionine
M
Met
Asparagine
N
Asn
Proline
P
Pro
Glutamine
Q
Gln
Arginine
R
Arg
Serine
S
Ser
Threonine
T
Thr
Valine
V
Val
Tryptophan
W
Trp
Tyrosine
Y
Tyr
Proteins
• Typical protein contains about 300
residues
• Chain have an amino group at one end
and a carboxy group at the other giving
the chain an orientation (start - end)
• The sequence of residues in the chain is
called the protein’s primary structure
Proteins
• Proteins fold in three dimensions resulting
in secondary, tertiary, quaternary
structures
• The two most common secondary
structures are the -helix and the -sheet
Secondary Structure
• Only a small number of patterns are common
• Patterns formed by regular intramolecular
hydrogen bonding patterns
Proteins
• The specific shape that a protein folds into
determines its unique function
Different shapes mean the protein can bind to
different molecules
• Proteins are produced in a cell structure
called a ribosome
Amino acids are added one after the other in
the sequence coded by a messenger
ribonucleic acid (mRNA) molecule
Ribosomes
Large subunit
Small subunit
Nucleic Acid
• Two types of nucleic acids
Ribonucleic acid (RNA)
Deoxyribonucleic acid (DNA)
• DNA, like protein, is a chain of simpler
molecules, but double stranded
Each strand consists of a chain of nucleotides
Nucleic Acids
• Each nucleotide consists of
A sugar molecule
A phosphate residue
A base
• The sugar molecule has five carbon atoms
labeled 1’ - 5’
The 3’ carbon of one nucleotide is bound to the 5’
carbon of the next nucleotide in the chain giving an
orientation to the chain
5’ is the start and 3’ is the end
Nucleic Acids
Nucleic Acids
• The chain of sugar/phosphate groups
forms the backbone of a strand of DNA
• Attached to each 1’ carbon in the
backbone is a molecule called a base
There are four different bases
Adenine (A)
Guanine (G)
Cytosine (C)
Thymine (T)
DNA
• DNA molecules are double strands
The strands form a double helix
The strands are held in the helix form by
bonds between complementary bases in the
two strands
A and T are complements
G and C are complements
We refer to the paired bases as base pairs
(bp) and use base pairs as the unit of length
of DNA molecules
DNA Double Helix
DNA
• DNA can be considered as a string of
letters from the set {A, T, C, G}
5’ … TACTGAA … 3’
• This other strand connected to this one is
antiparallel and complentary
3’ … ATGACTT … 5’
• Note that the orientations of the two
strands are opposite
DNA
• Given one of the strands, we can infer the
other strands
One of the strands can act as a template for
the construction of the other
This property allows for cell division and
replication with each new cell containing a
copy of the DNA from the original cell
• Complementary base pairs are held
together by hydrogen bonds
DNA
• In higher organisms, DNA is found inside the cell
nucleus
Also in cell organelles called mitochondria (plants and
animals) and chloroplasts (plants only)
• The DNA is found in a few very long DNA
molecules called chromosomes
RNA
• RNA molecules are similar to DNA, but
Have a different sugar
Have the base uracil (U) instead of thymine (T)
U binds with A, as does T
RNA does not form a double helix
Hybrid DNA-RNA helices may form
Parts of an RNA molecule may bind to other parts of the
same molecule by complementarity
Three-dimensional structure is variable (compare
Protein)
Central Dogma of Molecular
Biology
• Information stored in DNA is used to make
a transient RNA
Process is called transcription accomplished
through use of enzyme RNA polymerase
• The RNA is used to make proteins
Process is called translation and is performed
by ribosomes
RNA Transcription
RNA Transcription
QuickTime™ and a
Graphics decompressor
are needed to see this picture.
Genes and the Genetic Code
• All of the proteins in an organism are
specified by a contiguous stretch of DNA
called a gene
Remember that the DNA is contained in a
small number of molecules called
chromosomes
Not all of the DNA specifies some protein
Some genes code for RNA products
Gene Expression
• Gene expression is the process of using
the information stored in DNA to make an
RNA molecule and then a protein
RNA polymerases must
determine the start of genes
determine whether the protein coded by a gene is
needed at the present moment
Start of gene marked by 13 nucleotides (why
13, not, e.g. 1) promoter sequence
Gene Expression
Gene Expression
• How does the RNA polymerase then tell if
a protein should now be produced?
Specific regulatory genes produce proteins
capable of binding to a cell’s DNA near the
promoter sequence of a gene they control in
some circumstances
Positive regulation when binding makes RNA
polymerase initiation of transcription easier,
negative regulation when harder
Genetic Code
• A gene codes the sequence of amino
acids needed to form a protein
• 20 aa > 4 bases need more than one
base to specify an aa
43 > 20, so 3 bases suffice
Each sequence of 3 bases (a codon) codes
for an amino acid (with 3 exceptions)
Three codons cause translation to end and
are called stop codons
Genetic Code
• Since 64 > 20, more than one codon must code
for some amino acid(s)
• In fact, 18 of the 20 amino acids are coded for
by more than one codon
• The genetic code is therefore a degenerate code
Errors in transcription may not cause the wrong aa to
be produced (especially if the error is in the 3rd
nucleotide)
Even if the wrong aa is produced due to a single
error, a similar aa is likely to be produced
Open Reading Frames
• One special start codon (AUG) marks the
spot where translation begins
• A sequence of codons is called a reading
frame
A sequence of codons which begins with a
start codon and has no stop codons is called
an open reading frame (orf)
Prokaryotes and Eukaryotes
• Living organisms may be classified as
either prokaryote (bacteria) or eukaryote
(higer organisms like yeast, plants,
people)
The cells of eukaryotes have a nucleus while
prokaryotes don’t
DNA is linear in eukaryotes and circular in
prokaryotes
Prokaryotes and Eukaryotes
Introns and Exons
• In prokaryotes, the mRNA copies of the
genes corresponds directly to the DNA
sequence in the genome (with U
substituted for T)
• In eukaryotes, the mRNA is carried outside
the nucleus before translation
The mRNA is modified by splicing out
sequences of introns and rejoining the exons
that flank them
Introns and Exons
• Splicing is controlled by enzyme
complexes called spliceosomes
Incorrect splicing leads to frame shifts or
premature stop codons which make the
resulting protein useless
The position of introns is signalled by several
specific sequences of nucleotides
Since there is more than one sequence we can
have alternative splicing resulting in different
proteins being produced in different circumstances.
Molecular Biology Tools
• A small set of laboratory techniques are
used by molecular biologists to identify the
information content of organisms so that it
can be processed using bioinformatics
methods
Restriction Enzyme Digests
• Restriction enzymes can be used to cut
DNA molecules wherever a particular
sequence occurs
Digesting a DNA molecule and observing how
many fragments occur gives some insight into
the organization and sequence of that DNA
This is called restriction mapping
Allows isolation and experimentation of
individual genes for the first time
Restriction Enzyme Digests
Gel Electrophoresis
• We can separate the fragments of DNA
obtained by restriction enzymes with gel
electrophoresis
DNA fragments are pulled through a gel
towards an electrical charge
Larger fragments do not move as quickly, so
this provides a way to separate the fragments
by size
Gel Electrophoresis
Blotting and Hybridization
• To study a single fragment, DNA is
transferred from the gel to a piece of paper
or cloth (blotting)
The DNA fragments are then permanently
attached to the membrane using (e.g.) UV
light
A specially prepared labeled fragment of DNA
(a probe) is allowed to base pair with the
fragments to try to find a specific fragment
Blotting and Hybridization
• The probe is tagged using (e.g.) a fluorescent
dye (hybridization)
Then determine where on the membrane base pairing
has occurred
• DNA chip or microarray techniques are similar
Thousands of nucleotide sequences are affixed to
portions of a small silica chip
A large number of probes are washed over the chip
and a laser is used to find which probes bind to which
sequences
DNA Chip
Cloning
• Large amounts of DNA material is typically
required for analysis
In cloning, specific DNA fragments are
inserted into chromosome-like carriers called
vectors in living cells
The identical copies of the fragments are
called molecular clones and can be stored in
libraries for later study
Vectors are derived from bacteria and yeast
chromosomes
Cloning
Polymerase Chain Reaction
• Polymerase Chain Reaction (PCR) is an
alternative to cloning for amplifying DNA
fragments
DNA fragments are heated to break them into
single strands
Probes are added to bind to the portion of
DNA to be amplified
DNA polymerase grows the strands from the
probes
The process repeats
Polymerase Chain Reaction