Download Powerpoint slides - SEAS - The George Washington University

Document related concepts

Helicase wikipedia , lookup

Replisome wikipedia , lookup

DNA nanotechnology wikipedia , lookup

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
CS 177 Introduction to Bioinformatics
Fall 2004
• Instructor: Anna Panchenko
([email protected])
• Instructor: Tom Madej
([email protected])
• Co-Instructor: Rahul Simha
([email protected])
Lecture 1: Introduction
• Instructors
• Course goals
• Grading policy
• Motivating problem
• Course overview
• Molecular basis of cellular processes
• Historical timeline
Course Goals
• The student will be introduced to the fundamental
problems and methods of bioinformatics.
• The student will become thoroughly familiar with on-line
public bioinformatics databases and their available
software tools.
• The student will acquire a background knowledge of
biological systems so as to be able to interpret the
results of database searches, etc.
• The student will also acquire a general understanding of
how important bioinformatics algorithms/software tools
work, and how the databases are organized.
Grading Policy
• Homework: 50%, weekly assignments
• Final exam: 50%
“All examinations, papers, and other graded
work products and assignments are to be
completed in conformance with: The George
Washington University Code of Academic
Integrity”.
Optional Texts
P.E. Bourne and H. Weissig
(2003), Structural Bioinformatics,
Wiley & Sons.
What is Bioinformatics?
• A merger of biology, computer science, and information
technology.
• Enables the discovery of new biological insights and
unifying principles.
• Born from necessity, because of the massive amount of
information required to describe biological organisms
and processes.
Severe Acute Respiratory Syndrome
(SARS)
• SARS is a respiratory illness caused by a previously
unrecognized coronavirus; first appeared in Southern
China in Nov. 2002.
• Between Nov. 2002 and July 2003, there were 8,098
cases worldwide and 774 fatalities (WHO).
• The global outbreak was over by late July 2003. A few
new cases have arisen sporadically since then in China.
• There is currently no vaccine or cure available.
Fig. 2 from Rota et al.
Phylogenetic analysis of coronavirus proteins
Fig. 2 from Rota et al.
Conserved motifs in coronavirus S proteins.
Fig. 2 from Rota et al.
Exercise!
Look up the SARS genome on the NCBI
website: www.ncbi.nlm.nih.gov
The (ever expanding) Entrez System
OMIM
PubMed
PubMed Central
3D Domains
Journals
Structure
Books
CDD/CDART
Entrez
Protein
Taxonomy
Genome
GEO/GDS
UniSTS
UniGene
Nucleotide
SNP
PopSet
Course Overview
Lecture 1: Introduction
• Instructors
• Grading policy
• Motivating problem
• Course overview
• Molecular basis of cellular processes
• Historical timeline
Lecture 2: General principles of DNA/RNA
structure and stability
• Physico-chemical properties of nucleic acids
• RNA folding and structure prediction
• Gene identification
• Genome analysis
Lecture 3: General principles of protein structure
and stability
• Physico-chemical properties of proteins
• Prediction of protein secondary structure
• Protein domains and prediction of domain boundaries
• Protein structure-function relationships
Lecture 4: Sequence alignment algorithms
• The alignment problem
• Pairwise sequence alignment algorithms
• Multiple sequence alignment algorithms
• Sequence profiles and profile alignment methods
• Alignment statistics
Lecture 5: Computational aspects of protein structure,
part I
• Protein folding problem
• Problem of protein structure prediction
• Homology modeling
• Protein design
• Prediction of functionally important sites
Lecture 6: Computational aspects of protein structure,
part II
• Structure-structure alignment algorithms
• Significance of structure-structure similarity
• Protein structure classification
Lecture 7: Bioinformatics databases
• Sequence and sequence alignment formats, data exchange
• Public sequence databases
• Sequence retrieval and examples
• Public protein structure databases
• Lab exercises
Lecture 8: Bioinformatics database search tools
• Sequence database search tools
• Structure database search tools
• Assessment of results, ROC analysis
• Lab exercises
Lecture 9: Phylogenetic analysis, part I
•
•
•
•
Molecular basis of evolution
Taxonomy and phylogenetics
Phylogenetic trees and phylogenetic inference
Software tools for phylogenetic analysis
Lecture 10: Phylogenetic analysis, part II
• Accuracies and statistical tests of phylogenetic trees
• Genome comparisons
• Protein structure evolution
Lecture 11: Experimental techniques for
macromolecular analysis
• Sequencing, PCR
• Protein crystallography
• Mass spectroscopy
• Microarrays
• RNA interference
Lecture 12: Systems biology
• Genomic circuits
• Modeling complex integrated circuits
• Protein-protein interaction
• Metabolic networks
Lectures 13, 14: To be decided…
Molecular Biology Background
• Cells – general structure/organization
• Molecules – that make up cells
• Cellular processes – what makes the cell alive
Two Cell Organizations
• Prokaryotes – lack nucleus, simpler internal structure,
generally quite smaller
• Eukaryotes – with nucleus (containing DNA) and various
organelles
Selected organelles…
• Nucleus – contains chromosomes/DNA
• Mitochondria – generate energy for the cell, contains
mitochrondrial DNA
• Ribosomes – where translation from mRNA to proteins
take place (protein synthesis machinery)
• Lysosomes – where protein degradation takes place
Cells can become specialized…
Three domains of life
• Prokarya
Bacteria
Archaea
• Eukarya
Eukaryotes
Universal phylogenetic tree.
Fig. 1 from:
N.R. Pace, Science 276
(1997) 734-740.
Molecules in the cell
• Proteins – catalyze reactions, form structures, control
membrane permeability, cell signaling, recognize/bind
other molecules, control gene function
• Nucleic acids – DNA and RNA; encode information about
proteins
• Lipids – make up biomembranes
• Carbohydrates – energy sources, energy storage,
constituents of nucleic acids and surface membranes
• Other small molecules – e.g. ATP, water, ions, etc.
The Central Dogma of Molecular Biology
Exercise!
Retrieve a protein structure from the SARS
coronavirus from the NCBI website; you can use:
www.ncbi.nlm.nih.gov/Structure/
Look at the structure for the SARS protease using
Cn3D.
Timeline
1859 Darwin publishes On the Origin of Species…
1865 Mendel’s experiments with peas show that hereditary
traits are passed on to offspring in discrete units.
1869 Meischer isolates DNA.
1895 Rőntgen discovers X-rays.
1902 Sutton proposes the chromosome theory of heredity.
Timeline (cont.)
1911 Morgan and co-workers establish the chromosome
theory of heredity, working with fruit flies.
1943 Astbury observes the first X-ray pattern of DNA.
1944 Avery, MacLeod, and McCarty show that DNA
transmits heritable traits (not proteins!).
1951 Pauling and Corey predict the structure of the alphahelix and beta-sheet.
Timeline (cont.)
1953 Watson and Crick propose the double helix model for
DNA based on X-ray data from Franklin and Wilkins.
1955 Sanger announces the sequence of the first protein to
be analyzed, bovine insulin.
1955 Kornberg and co-workers isolate the enzyme DNA
polymerase (used for copying DNA, e.g. in PCR).
1958 The first integrated circuit is constructed by Kilby at
Texas Instruments.
Timeline (cont.)
1960 Perutz and Kendrew obtain the first X-ray structures
of proteins (hemoglobin and myoglobin).
1961 Brenner, Jacob, and Meselson discover that mRNA
transmits the information from the DNA in the nucleus to
the cytoplasm.
1965 Dayhoff starts the Atlas of Protein Sequence and
Structure.
1966 Nirenberg, Khorana, Ochoa and colleagues crack the
genetic code!
1970 The Needleman-Wunsch algorithm for sequence
comparison is published.
Timeline (cont.)
1972 Dayhoff develops the Protein Sequence Database
(PSD).
1972 Berg and colleagues create the first recombinant DNA
molecule.
1973 Cohen invents DNA cloning.
1975 Sanger and others (Maxam, Gilbert) invent rapid DNA
sequencing methods.
Timeline (cont.)
1980 The first complete gene sequence for an organism
(Bacteriophage FX174) is published. The genome
consists of 5,386 bases coding 9 proteins.
1981 The Smith-Waterman algorithm for sequence
alignment is published.
1981 IBM introduces its Personal Computer to the market.
1982 The GenBank sequence database is created at Los
Alamos National Laboratory.
Timeline (cont.)
1983 Mullis and co-workers describe the PCR reaction.
1985 The FASTP algorithm is published by Lipman and
Pearson.
1986 The SWISS-PROT database is created.
1986 The Human Genome Initiative is announced by DOE.
1988 The National Center for Biotechnology Information
(NCBI) is established at the National Library of Medicine
in Bethesda.
Timeline (cont.)
1992 Human Genome Systems, in Gaithersburg, MD, is
founded by Haseltine.
1992 The Institute for Genomic Research (TIGR) is
established by Venter in Rockville, MD.
1995 The Haemophilus influenzea genome is sequenced
(1.8 Mb).
1996 Affymetrix produces the first commercial DNA chips.
Timeline (cont.)
1988 The FASTA algorithm for sequence comparison is
published by Pearson and Lipman.
1990 Official launch of the Human Genome Project.
1990 The BLAST program by Altschul et al., is published.
1991 The CERN research institute in Geneva announces
the creation of the protocols which make up the World
Wide Web.
Timeline (cont.)
1996 The yeast genome is sequenced; the first complete
eukaryotic genome.
1996 Human DNA sequencing begins.
1997 The E. coli genome is sequenced (4.6 Mb, approx. 4k
genes).
1998 The C. elegans genome is sequenced (97 Mb,
approx. 20k genes); the first genome of a multicellular
organism.
Timeline (cont.)
1998 Venter founds Celera in Rockville, MD.
1998 The Swiss Institute of Bioinformatics is established in
Geneva.
1999 The HGP completes the first human chromosome
(no. 22).
2000 The Drosophila genome is completed.
Timeline (cont.)
2000 Human chromosome no. 21 is completed.
2001 A draft of the entire human genome (3,000 Mb) is
published.
2003 The Human Genome is “completed”! Approx. 30,000
genes (estimated).
DNA, RNA, protein overview
Questions about the genome in an organism:
How much DNA, how many nucleotides?
How many genes are there?
What types of proteins appear to be coded by these genes?
Questions about the proteome:
What proteins are present?
DNA
RNA
Mutations
Amino acids,
protein structure
Where are they?
When are they present - under what
conditions?
DNA overview
DNA
deoxyribonucleic acid
4 bases
Pyrimidine (C4N2H4)
A = Adenine
Purine (C5N4H4)
T = Thymine
C = Cytosine
G = Guanine
Nucleoside
Nucleotide
base + sugar (deoxyribose)
base + sugar
O
--
- PO
O
P4 O
OH
DNA
RNA
Mutations
Amino acids,
protein structure
5’ CH2
O
O
4’
1’
H
H
3’
OH
H
H
2’
H
Numbering of carbons?
sugar
+ phosphate
Linking nucleotides
3’
5’
3’
Hydrogen bonds
Linking nucleotides:
3’
N-H------N
3’
3’
N-H------O
The 3’-OH of one nucleotide is
linked to the 5’-phosphate of
the next
nucleotide
What
next?
3’
Thymine
3’
2nm
Adenine
3’
3’
Cytosine
DNA
RNA
3’
Mutations
5’
Amino acids,
protein structure
3’
Guanine
Base pairing
3’
5’
A
T
3’
Base pairing (Watson-Crick):
C
3’
A/T (2 hydrogen bonds)
G
G/C (3 hydrogen bonds)
3’
Always pairing a purine and a
pyrimidine yields a constant width
A
3’
T
3’
DNA base composition:
A + G = T + C (Chargaff’s rule)
T
3’
A
3’
DNA
RNA
C
3’
Mutations
G
5’
Amino acids,
protein structure
3’
DNA conventions
1. DNA is a right-handed helix
DNA
RNA
Mutations
Amino acids,
protein structure
DNA conventions
1. DNA is a right-handed helix
2. The 5’ end is to the left by convention
5’ -ATCGCAATCAGCTAGGTT- 3’
sense (forward)
3’ -TAGCGTTAGTCGATCCAA- 5’
antisense (reverse)
3’ -TAGCGTTAGTCGATCCAA- 5’
5’ -ATCGCAATCAGCTAGGTT- 3’
DNA
Amino acids,
protein structure
3
’
T
A
G
C
G
T
T
A
G
T
C
G
A
T
C
C
A
A
5
’
Mutations
5
’
A
T
C
G
C
A
A
T
C
A
G
C
T
A
G
G
T
T
3
’
RNA
DNA structure
Some more facts:
1. Forces stabilizing DNA structure: Watson-Crick-H-bonding and base stacking
(planar aromatic bases overlap geometrically and electronically  energy gain)
2. Genomic DNAs are large molecules:
Eschericia coli: 4.7 x 106 bp; ~ 1 mm contour length
Human: 3.2 x 109 bp; ~ 1 m contour length
3. Some DNA molecules (plasmids) are circular and have no free ends:
mtDNA
bacterial DNA (only one circular chromosome)
4. Average gene of 1000 bp can code for average protein of about 330 amino acids
5. Percentage of non-coding DNA varies greatly among organisms
Organism
DNA
RNA
Mutations
Amino acids,
protein structure
small virus
‘typical’ virus
bacterium
yeast
human
amphibians
plants
# Base pairs
# Genes
4 x 103
3
3x
5 x 106
1 x 107
3.2 x 109
< 80 x 109
< 900 x 109
Non-coding DNA
105
3000
6000
30,000?
very little
200
10 - 20%
> 50%
99%
?
?
23,000 - >50,000
> 99%
very little
RNA structure
RNA
3 major types of RNA
messenger RNA (mRNA); template for protein synthesis
transfer RNA (tRNA); adaptor molecules that decode the genetic code
ribosomal RNA (rRNA); catalyzing the synthesis of proteins
ribonucleic acid
4 bases
Pyrimidine (C4N2H4)
A = Adenine
Purine (C5N4H4)
U = Uracil
C = Cytosine
G = Guanine
Thymine (DNA)
Nucleoside
base
Uracil (RNA)
Nucleotide
+ sugar (ribose)
base + sugar
O
--
DNA
OH
RNA
5’ CH2
Mutations
Amino acids,
protein structure
- PO
O
P4 O
O
O
4’
1’
H
H
3’
OH
H
H
2’
OH
sugar
+ phosphate
Base interactions in RNA
Base pairing:
U/A/(T) (2 hydrogen bonds)
G/C
(3 hydrogen bonds)
RNA base composition:
A+G/
=U+C
Chargaff’s rule does not apply (RNA usually prevails as single strand)
RNA structure:
- usually single stranded
DNA
RNA
Mutations
Amino acids,
protein structure
- many self-complementary regions  RNA commonly exhibits an intricate secondary structure
(relatively short, double helical segments alternated with single stranded regions)
- complex tertiary interactions fold the RNA in its final three dimensional form
- the folded RNA molecule is stabilized by interactions (e.g. hydrogen bonds and base stacking)
RNA structure
Primary structure
A) single stranded regions
formed by unpaired nucleotides
Secondary structure
B) duplex
double helical RNA (A-form with 11 bp per turn)
C
C) hairpin
duplex bridged by a loop of unpaired nucleotides
D) internal loop
D
nucleotides not forming Watson-Crick base pairs
E
DNA
RNA
Mutations
Amino acids,
protein structure
F
G
E) bulge loop
unpaired nucleotides in one strand,
other strand has contiguous base pairing
F) junction
B
A
three or more duplexes separated by single
stranded regions
G) pseudoknot
tertiary interaction between bases of hairpin loop
and outside bases
RNA structure
Primary structure
Secondary structure
Tertiary structure
C
D
E
DNA
RNA
Mutations
Amino acids,
protein structure
F
B
A
G
RNA structure
How to predict RNA secondary/tertiary structure?
Probing RNA structure experimentally:
- physical methods (single crystal X-ray diffraction, electron microscopy)
- chemical and enzymatic methods
- mutational analysis (introduction of specific mutations to test change in some
function or protein-RNA interaction)
Thermodynamic prediction of RNA structure:
- RNA molecules comply to the laws of thermodynamics, therefore it should be
possible to deduce RNA structure from its sequence by finding the conformation
with the lowest free energy
- Pros: only one sequence required; no difficult experiments; does not rely on
alignments
- Cons: thermodynamic data experimentally determined, but not always accurate;
possible interactions of RNA with solvent, ions, and proteins
Comparative determination of RNA structure:
DNA
RNA
Mutations
Amino acids,
protein structure
- basic assumption: secondary structure of a functional RNA will be conserved in the
evolution of the molecule (at least more conserved than the primary structure);
when a set of homologous sequences has a certain structure in common, this structure can
be deduced by comparing the structures possible from their sequences
- Pros: very powerful in finding secondary structure, relatively easy to use, only sequences
required, not affected by interactions of the RNA and other molecules
- Cons: large number of sequences to study preferred, structure constrains in fully conserved
regions cannot be inferred, extremely variable regions cause problems with alignment
Amino acids/proteins
The “central dogma” of modern biology: DNA  RNA  protein
Getting from DNA to protein:
Two parts: 1. Transcription in which a short portion of chromosomal DNA is used to
make a RNA molecule small enough to leave the nucleus.
2. Translation in which the RNA code is used to assemble the protein at the
ribosome
The genetic code
- The code problem: 4 nucleotides in RNA, but 20 amino acids in proteins
- Bases are read in groups of 3 (= a codon)
- The code consists of 64 codons (43 = 64)
- All codons are used in protein synthesis:
- 20 amino acids
- 3 stop codons
- AUG (methionine) is the start codon (also used internally)
DNA
RNA
Mutations
Amino acids,
protein structure
- The code is non-overlapping and punctuation-free
- The code is degenerate (but NOT ambiguous): each amino acid is specified by at
least one codon
- The code is universal (virtually all organisms use the same code)
The genetic code
Base 2
T
C
Phenylalanine
F
T
Leucine
L
A
Leucine
L
Isoleucine
I
Mutations
Amino acids,
protein structure
Valine
V
Cysteine
C
STOP
Proline
P
Threonine
T
DNA
G
Tyrosine
Y
Serine
S
Methionine M
RNA
G
Alanine
A
Histidine
H
Glutamine
Q
STOP
T
C
A
Tryptophan
W
G
Arginine
R
Asparagine
N
Serine
S
Lysine
K
Arginine
R
Aspartate
B
Glutamate
Z
Glycine
G
T
C
A
G
T
C
A
G
T
C
A
G
In-class exercise
1. Which amino acids are
specified by single codons?
methionine and tryptophan
Base 3
Base 1
C
A
2. How many amino acids
are specified by the first
two nucleotides only?
five: proline, threonine,
valine, alanine, glycine
3. What is the RNA code for
the start codon?
AUG