Download cis667-1 - Electrical Engineering and Computer Science

Document related concepts

Amino acid synthesis wikipedia , lookup

Gel electrophoresis wikipedia , lookup

Nucleosome wikipedia , lookup

Eukaryotic transcription wikipedia , lookup

Restriction enzyme wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Epitranscriptome wikipedia , lookup

Agarose gel electrophoresis wikipedia , lookup

Proteolysis wikipedia , lookup

Genomic library wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Transformation (genetics) wikipedia , lookup

Metabolism wikipedia , lookup

SNP genotyping wikipedia , lookup

RNA-Seq wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Real-time polymerase chain reaction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Gene wikipedia , lookup

DNA supercoil wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Molecular cloning wikipedia , lookup

Non-coding DNA wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene expression wikipedia , lookup

Community fingerprinting wikipedia , lookup

Point mutation wikipedia , lookup

Genetic code wikipedia , lookup

Biochemistry wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Biosynthesis wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Transcript
CIS 667
Bioinformatics
Cleveland State University
Department of Computer and
Information Science
Fall 2003
What is Bioinformatics?
• Field of science in which biology, computer
science, information technology merge to
form a single discipline
 Historically, creation/maintenance of biological
sequence databases important
• Biology is being transformed from a purely
lab-based science to an information
science as well
What is Bioinformatics?
• Three important sub-disciplines
 Development of new algorithms and statistical
methods to analyze relationships among
members of large data sets
 Analysis and interpretation of various types of
data (nucleotide and amino acid sequences,
protein structures, etc.)
 Development/implementation of tools for
efficient access/mgmt. of various types data
Why now?
• Recent advances in molecular biology and
genomic technologies lead to an explosive
growth in the amount of biological
information generated
• Requires computerized databases to
store/organize/index data and specialized
tools to view and analyze data
What skills should a
Bioinformatician have?
• Deep background in some area of molecular
biology
• Understand the central dogma of molecular
biology
• Substantial experience with at least one or two
major packages
• Experience working in a command-line
computing environment
• Experience with both high-level and scripting
languages
Others…
•
•
•
•
•
•
Molecular Evolution
Physical chemistry
Statistics and probability
Database design
Algorithm development
Molecular biology lab methods
What will we learn?
• Central dogma of molecular biology +
other necessary biology background
• Working in a Unix command-line
environment
• Programming in Perl
• Algorithms for molecular biology
• Hands-on experience with bioinformatics
tools
Molecular Biology
• Primarily concerned with two basic
molecules of all living things:
 Proteins
 Structural proteins are tissue building blocks while
enzymes catalyze chemical reactions
 Proteins are chains of amino acids
Example Amino Acid
Side Chain
H2N
Amino Group
CH3
C
H
Alpha Carbon
COOH
Carboxy Group
Amino Acids
• There are 20 naturally occurring amino
acids
 Amino acids can be identified by a 3-letter
code (and sometimes by 1-letter code)
 In a protein, amino acids are joined by peptide
bonds (C from carboxy group binds to N from
amino group)
 A water molecule is liberated so we speak of
residues in the chain
Amino Acids
Name
One-letter code
Three-letter code
Alanine
A
Ala
Cysteine
C
Cys
Aspartic Acid
D
Asp
Glutamic Acid
E
Glu
Phenylalanine
F
Phe
Glycine
G
Gly
Histidine
H
His
Isoleucine
I
Ile
Lysine
K
Lys
Leucine
L
Leu
Methionine
M
Met
Asparagine
N
Asn
Proline
P
Pro
Glutamine
Q
Gln
Arginine
R
Arg
Serine
S
Ser
Threonine
T
Thr
Valine
V
Val
Tryptophan
W
Trp
Tyrosine
Y
Tyr
Proteins
• Typical protein contains about 300
residues
• Chain have an amino group at one end
and a carboxy group at the other giving
the chain an orientation (start - end)
• The sequence of residues in the chain is
called the protein’s primary structure
Proteins
• Proteins fold in three dimensions resulting
in secondary, tertiary, quaternary
structures
• The two most common secondary
structures are the -helix and the -sheet
Secondary Structure
• Only a small number of patterns are common
• Patterns formed by regular intramolecular
hydrogen bonding patterns
Proteins
• The specific shape that a protein folds into
determines its unique function
 Different shapes mean the protein can bind to
different molecules
• Proteins are produced in a cell structure
called a ribosome
 Amino acids are added one after the other in
the sequence coded by a messenger
ribonucleic acid (mRNA) molecule
Ribosomes
Large subunit
Small subunit
Nucleic Acid
• Two types of nucleic acids
 Ribonucleic acid (RNA)
 Deoxyribonucleic acid (DNA)
• DNA, like protein, is a chain of simpler
molecules, but double stranded
 Each strand consists of a chain of nucleotides
Nucleic Acids
• Each nucleotide consists of
 A sugar molecule
 A phosphate residue
 A base
• The sugar molecule has five carbon atoms
labeled 1’ - 5’
 The 3’ carbon of one nucleotide is bound to the 5’
carbon of the next nucleotide in the chain giving an
orientation to the chain
 5’ is the start and 3’ is the end
Nucleic Acids
Nucleic Acids
• The chain of sugar/phosphate groups
forms the backbone of a strand of DNA
• Attached to each 1’ carbon in the
backbone is a molecule called a base
 There are four different bases
 Adenine (A)
 Guanine (G)
 Cytosine (C)
 Thymine (T)
DNA
• DNA molecules are double strands
 The strands form a double helix
 The strands are held in the helix form by
bonds between complementary bases in the
two strands
 A and T are complements
 G and C are complements
 We refer to the paired bases as base pairs
(bp) and use base pairs as the unit of length
of DNA molecules
DNA Double Helix
DNA
• DNA can be considered as a string of
letters from the set {A, T, C, G}
 5’ … TACTGAA … 3’
• This other strand connected to this one is
antiparallel and complentary
 3’ … ATGACTT … 5’
• Note that the orientations of the two
strands are opposite
DNA
• Given one of the strands, we can infer the
other strands
 One of the strands can act as a template for
the construction of the other
 This property allows for cell division and
replication with each new cell containing a
copy of the DNA from the original cell
• Complementary base pairs are held
together by hydrogen bonds
DNA
• In higher organisms, DNA is found inside the cell
nucleus
 Also in cell organelles called mitochondria (plants and
animals) and chloroplasts (plants only)
• The DNA is found in a few very long DNA
molecules called chromosomes
RNA
• RNA molecules are similar to DNA, but
 Have a different sugar
 Have the base uracil (U) instead of thymine (T)
 U binds with A, as does T
 RNA does not form a double helix
 Hybrid DNA-RNA helices may form
 Parts of an RNA molecule may bind to other parts of the
same molecule by complementarity
 Three-dimensional structure is variable (compare
Protein)
Central Dogma of Molecular
Biology
• Information stored in DNA is used to make
a transient RNA
 Process is called transcription accomplished
through use of enzyme RNA polymerase
• The RNA is used to make proteins
 Process is called translation and is performed
by ribosomes
RNA Transcription
RNA Transcription
QuickTime™ and a
Graphics decompressor
are needed to see this picture.
Genes and the Genetic Code
• All of the proteins in an organism are
specified by a contiguous stretch of DNA
called a gene
 Remember that the DNA is contained in a
small number of molecules called
chromosomes
 Not all of the DNA specifies some protein
 Some genes code for RNA products
Gene Expression
• Gene expression is the process of using
the information stored in DNA to make an
RNA molecule and then a protein
 RNA polymerases must
 determine the start of genes
 determine whether the protein coded by a gene is
needed at the present moment
 Start of gene marked by 13 nucleotides (why
13, not, e.g. 1) promoter sequence
Gene Expression
Gene Expression
• How does the RNA polymerase then tell if
a protein should now be produced?
 Specific regulatory genes produce proteins
capable of binding to a cell’s DNA near the
promoter sequence of a gene they control in
some circumstances
 Positive regulation when binding makes RNA
polymerase initiation of transcription easier,
negative regulation when harder
Genetic Code
• A gene codes the sequence of amino
acids needed to form a protein
• 20 aa > 4 bases  need more than one
base to specify an aa
 43 > 20, so 3 bases suffice
 Each sequence of 3 bases (a codon) codes
for an amino acid (with 3 exceptions)
 Three codons cause translation to end and
are called stop codons
Genetic Code
• Since 64 > 20, more than one codon must code
for some amino acid(s)
• In fact, 18 of the 20 amino acids are coded for
by more than one codon
• The genetic code is therefore a degenerate code
 Errors in transcription may not cause the wrong aa to
be produced (especially if the error is in the 3rd
nucleotide)
 Even if the wrong aa is produced due to a single
error, a similar aa is likely to be produced
Open Reading Frames
• One special start codon (AUG) marks the
spot where translation begins
• A sequence of codons is called a reading
frame
 A sequence of codons which begins with a
start codon and has no stop codons is called
an open reading frame (orf)
Prokaryotes and Eukaryotes
• Living organisms may be classified as
either prokaryote (bacteria) or eukaryote
(higer organisms like yeast, plants,
people)
 The cells of eukaryotes have a nucleus while
prokaryotes don’t
 DNA is linear in eukaryotes and circular in
prokaryotes
Prokaryotes and Eukaryotes
Introns and Exons
• In prokaryotes, the mRNA copies of the
genes corresponds directly to the DNA
sequence in the genome (with U
substituted for T)
• In eukaryotes, the mRNA is carried outside
the nucleus before translation
 The mRNA is modified by splicing out
sequences of introns and rejoining the exons
that flank them
Introns and Exons
• Splicing is controlled by enzyme
complexes called spliceosomes
 Incorrect splicing leads to frame shifts or
premature stop codons which make the
resulting protein useless
 The position of introns is signalled by several
specific sequences of nucleotides
 Since there is more than one sequence we can
have alternative splicing resulting in different
proteins being produced in different circumstances.
Molecular Biology Tools
• A small set of laboratory techniques are
used by molecular biologists to identify the
information content of organisms so that it
can be processed using bioinformatics
methods
Restriction Enzyme Digests
• Restriction enzymes can be used to cut
DNA molecules wherever a particular
sequence occurs
 Digesting a DNA molecule and observing how
many fragments occur gives some insight into
the organization and sequence of that DNA
 This is called restriction mapping
 Allows isolation and experimentation of
individual genes for the first time
Restriction Enzyme Digests
Gel Electrophoresis
• We can separate the fragments of DNA
obtained by restriction enzymes with gel
electrophoresis
 DNA fragments are pulled through a gel
towards an electrical charge
 Larger fragments do not move as quickly, so
this provides a way to separate the fragments
by size
Gel Electrophoresis
Blotting and Hybridization
• To study a single fragment, DNA is
transferred from the gel to a piece of paper
or cloth (blotting)
 The DNA fragments are then permanently
attached to the membrane using (e.g.) UV
light
 A specially prepared labeled fragment of DNA
(a probe) is allowed to base pair with the
fragments to try to find a specific fragment
Blotting and Hybridization
• The probe is tagged using (e.g.) a fluorescent
dye (hybridization)
 Then determine where on the membrane base pairing
has occurred
• DNA chip or microarray techniques are similar
 Thousands of nucleotide sequences are affixed to
portions of a small silica chip
 A large number of probes are washed over the chip
and a laser is used to find which probes bind to which
sequences
DNA Chip
Cloning
• Large amounts of DNA material is typically
required for analysis
 In cloning, specific DNA fragments are
inserted into chromosome-like carriers called
vectors in living cells
 The identical copies of the fragments are
called molecular clones and can be stored in
libraries for later study
 Vectors are derived from bacteria and yeast
chromosomes
Cloning
Polymerase Chain Reaction
• Polymerase Chain Reaction (PCR) is an
alternative to cloning for amplifying DNA
fragments
 DNA fragments are heated to break them into
single strands
 Probes are added to bind to the portion of
DNA to be amplified
 DNA polymerase grows the strands from the
probes
 The process repeats
Polymerase Chain Reaction