Download Molecular Biology Primer

Document related concepts

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
COMP.3500/5800 Topics in Bioinformatics


What is bioinformatics ?
 Study of DNA sequences, genomes, protein
 Modeling/Inference
What are involved in bioinformatics ?
 Biology


Statistics


http://sphweb.bumc.bu.edu/otlt/MPHModules/PH/PH709_DNA-Genetics/
http://pages.uoregon.edu/aarong/teaching/G4074_Outline/no
de1.html
Computer Science
 Algorithms
 Machine learning
COMP.3500/5800 Topics in Bioinformatics

Why study bioinformatics ?
 What makes us different ? 99.9% genomes are
identical
 How different cells are developed from the same
genome ?
 Study mutation in genome -> drug development
Topics today

Textbook, pgs. 16-19

Genes to Proteins
transcription
 Translation



DNA & RNA
Genome
Genome
•
•
•
one or more chromosomes that
contain the code (gene) that
directs the synthesis of proteins
that are essential for its structure
and function
Human: 22 pairs of homologous
chromosomes & XY
http://www.ncbi.nlm.nih.gov/geno
me/?term=txid9606[orgn]
Genes
•
Allele
• Alternative forms of the same
gene
•
Dominant, recessive
Central Dogma in Molecular Biology

Info flows in one direction
 DNA/genome
 A template or a roadmap

RNA
 Copies of genes to be expressed (activated)

Protein
 Biochemical molecules performing biological
functions
Gene to Protein:
Transcription & Translation
Gene to Protein:
Transcription & Translation
Gene to Protein:
Transcription & Translation
Transcription
sense
anti-sense
Transcription
Gene to Protein
Protein Coding Region
5’UTR
Non-Protein
Coding
Protein 2
Region
exon
intron
Protein 1
intergenic
UTR
3’UTR
Non-Protein
Coding
Region
Alternative Splicing
Translation
• Genetic Code
• A triplet (called codon)
• Ribosome moves along mRNA 3 bases at a time
• Degenerate coding
• 4x4x4=64 possible triplets into 20 Amino Acids
• 8 AA have 3rd base irrelevant – immune to
mutation
• Anti-codon – reverse complement of a codon
Genetic Code
Genetic Code
Translation
Genetic Code
Amino Acids
•
•
General structure of amino
acids
•
an amino group
•
a carboxyl group
•
α-carbon bonded to a
hydrogen and a side-chain
group, R
R determines the identity of
particular amino acid
•
•
•
•
•
R: large white and gray
C: black
Nitrogen: blue
Oxygen: red
Hydrogen: white
DNA & RNA
DNA and RNA
•
•
DNA (deoxyribonucleic acid) and RNA (ribonucleic acid) are composed
of linear chains of monomeric units of nucleotides
A nucleotide has three parts: a sugar, a phophate and a base
• Four bases
Base Types
•
Nucleic acid bases are of two types
•
•
Pyrimidine [pairímədì:n]– C, T, U (two nitrogens in 6-member ring at
positions 1 and 3)
Purine – A, G (pyrimidine ring fused to an imidazole ring (C3H4N2))
R
Y
A
G
T
C
W
M
A
G
T
C
K
A
G
T
C
B
A
G
T
C
s
V
A
G
T
C
D
A
G
T
C
H
N
A
G
T
C
A
G
T
C
Primary Structure of DNA and RNA
•
•
Nucleotides are joined by phosphodiester bonds and form sugarphosphate backbone
•
Sugar is deoxyribose in DNA (left)and ribose in RNA (right)
Nitrogen-containing nucleobases are bonded to sugar
Online course on Biology
•
Educational Portal
•
DNA chemical structure
• http://education-portal.com/academy/lesson/dna-and-thechemical-structure-of-nucleic-acids.html
Secondary Structure
•
Double helix – 1953 Watson and Crick
using X-ray diffraction
•
•
•
Sugar-phosphate backbone is the
outer part of the helix
Two strands run in antiparallel
directions
Dimensions
•
•
•
•
Inside diameter of backbone: 11 A
(1.1 nm)
Outer diameter: 20 A (1A=10-10 m =0.1
nm)
Length of one complete turn: 34 A, 10
base-pairs
Major and minor grooves – drugs or
polypeptides bind to DNA
Secondary Structure of DNA
•
Two strands are complementary
•
Base pairing: A-T; G-C
•
Pyrimidine and Purine form complementary H bonding
Monomer counts in DNA
•
In double strands
•
# of A = # of T; # of G = # of C
•
Erwin Chargaff’s 1st Parity Rule, 1951
•
In a single strand ?
•
# of A = # of T; # of G = # of C
•
Erwin Chargaff’s 2nd Parity Rule
Importance of Hydrogen Bonding
•
•
•
Many consider hydrogen bond essential to the evolution of life
Individual hydrogen bond is weak, many H bonds collectively
exert very strong force
Orderly repetitive arrangement of H bonds in polymers
determines their shape
Online course on Biology
•
Educational Portal
•
Four bases
• http://education-portal.com/academy/lesson/dna-adenineguanine-cytosine-thymine-complementary-base-pairing.html
Chromosome Length
•
•
•
3.4A per base
3 Billion bases
•
1.8 meters of DNA
•
0.09 nm of chromatin after being
wound on histones
Five families of histones
•
H1/H5, H2A, H2B, H3, and H4
RNA
•
•
•
•
•
Sugar in RNA nucleotide is ribose rather than 2’deoxyribose
Thymine is replaced by uracil (U)
RNA polymers are usually a few thousand nucleotides or
shorter
RNA in cells is usually single-stranded
RNA is considered to be the original gene coding material,
and it still code genes in a few viruses
RNA Types
•
Four RNA’s are involved in protein synthesis
RNA Type
Size
Function
Transfer RNA
Small
Transports AA to protein synthesis sites
Ribosomal RNA
Variable combines with proteins to form
ribosome, where protein polypeptide
chain grows
Messenger RNA
Variable Transcribes AA sequence from genes
Small nuclear RNA
Processing of initial mRNA to its mature
form in eukaryotes
Online course on Biology
•
Educational Portal
•
RNA
• http://education-portal.com/academy/lesson/differences-betweenrna-and-dna-types-of-rna-mrna-trna-rrna.html
Genome
Genome
• Genome
– The entire DNAs of a cell is the genome
– Individual units for coding proteins or RNA are genes
– A gene starts with ATG, ends with one or two stop codons
– Called ORF (Open Reading Frame)
– Biological Info
– Contained in genome
– Encoded in nucleotide sequences of DNA or RNA
– Partitioned into discrete units, genes
Cell
– Different levels of cells
– Prokaryote (karyan, “kernel” in Greek)(/proekaeriəts) (pro for “before”)
– Eukaryote (“true”)
– Main difference is the presence of organelle, especially the nucleus, in
eukaryotes
Organelle
Prokaryotes
Eukaryotes
Nucleus
No definite nucleus
Present
Cell membrane
Present
Present
Mitochondria
None.
Present
Endoplasmic reticulum
None
Present
Ribosomes
Present
Present
Chloroplasts
None
Present in green plants
animal cell
plant cell
Prokaryotic cell
Three Domain
• Classification purely based on biochemistry (RNA)
– C. Woese, 1981
• Eubacteria (true bacteria)
• Archaea (archaebacteria, early bacteria)
• Eukarya (eukaryotes)
Genome Sequencing Projects

Major genome sequencing centers







U.S. Dept. of Energy Joint Genome Institute (435 projects)
J. Craig Venter Institue (302)
The Institute for Genomic Research (TIGR) (206)
Washington Univ. (184)
Institut Pasteur, Univ. of Tokyo
www.ncbi.nlm.nih.gov/genomes/static/lcenters.html
national center for biotechnology information
Completely sequenced genomes include




Several hundred bacteria, over 20 archea, and over 30 eukarya
Human (homo sapies), chimpanzee (Pan troglodytes), mouse (Mus musculus),
brown rat (Rattus norvegicus), dog (Canis familiaris), Thale cress (Arabidopsis
thaliana), rice (Oryza sativa), Fruit fly (Drosophila melanogaster), yeast
(Saccharomyces cerevisiae)
http://www.ebi.ac.uk/2can/genomes/genomes.html has descriptions of
species and their clinical and scientific significances
http://www.genomesonline.org has current status of genome projects
Genome Databases


Completed genomes
 ftp site -- ftp://ftp.ncbi.nlm.nih.gov/genomes/
 http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/allorg.html
 http://www.ebi.ac.uk/genomes/mot/index.html
 http:/pir.goergetown.edu/pirwww/search/genome.html
Organism-specific databases
 http://www.unledu/stc-95/ResTools/biotools/biotools10.html
 http://www.fp.mcs.anl.gov/~gaasterland/genomes.html
 http://www.hgmp.mrc.ac.uk/GenomeWeb/genome-db.html
 http://www.bioinformatik.de/cgibin/browse/Catalog/Databases/Genome_Proejcts
Genomes of Prokaryotes

Circular double-stranded DNA


Protein-coding regions do not contain introns
Protein-coding regions are partially organized into operons – tandom
genes transcribed into a single mRNA molecule
trpE
trpD
The trp operon in E.Coli begins with control
region, followed by genes performing
successive steps in systhesis of tryptophan AA

The density of coding region is high
 ~89% in E.Coli
Genome of E.Coli


Many E.Coli proteins were known before the sequencing (1853
proteins)
Genome of Escherichia coli, strain MG1655 published in 1997


By F. Blattner at Univ. Wisconsin
4.64 Mbp






4284 protein-coding genes, 122 structural RNA genes, Non-coding repeat
sequences, Regulatory elements, etc.
Average size of ORF is 317 AA
Average inter-genic gap is 118 bp
¾ transcribe single genes, and the rest are operons (gene clusters)
60% protein functions are known
http://wishart.biology.ualberta.ca/BacMap/index.html contains an atlas
of bacterial genome diagram (2005)
Genome of Archea

Microorganism Methanococcus jannaschii




thrives in hydrothermal vents at temp from 48 to 94 CB genes from 45
strains
Capable of self-reproduction from inorganic components
Metabolism is to synthesize methane from H2 and CO2
Sequenced in 1996 by TIGR




1.665 Mbp in chromosome containing a circular DNA modecule, two extrachromosomal elements
1,784 protein-coding regions
Proteins in archea for transcription and translation are closer to those in
eukaryote
Proteins involved in metabolism are closer to those of bacteria
Genomes of Eukarya

Majority of DNA is in the nucleus


Smaller amount of DNA in organelles such as mitochondria
and chloroplasts




Organelles originated as intra-cellular parasites
Organelle genomes usually have circular forms, but sometimes in
linear or multi-circular shape
Genetic code is different that the one for nuclear genes
Diverse among species



Organized into chromosomes containing single-DNA molecule each
Humans have 23 chromosomes, chimpanzees have 24
Human chromosome #2 is equivalent to a fusion of chimpanzee
chromosomes 12 and 13
List of genome sequences

http://en.wikipedia.org/wiki/List_of_sequenced_eukaryotic_genomes
Genome of


Saccharomyces cerevisiae (Yeast)
Simplest eukaryotic organism
Sequencing from 100 labs completed in 1992



12.06 Mbp
16 chromosomes
6,172 protein-coding genes
 Dense: only 231 genes contain introns
Genome of Caenorhabditis elegans (C.
elegans)

Completed in 1998



First full DNA sequence of a
multi-cellular organism
97 Mbp
Paired chromosomes




XX for a self-fertilizing
hermaphrodite (simultaneously
male and female)
XO for male
Avg. 5 introns per gene
Proteins



42% have homologues to other
species
34% specific to nematodes
(round worms)
24% no known homologues
Chrom
osome
Size
(Mbp)
Protein genes
Kbp/ge
ne
I
7.9
2803
5.06
II
8.5
3559
3.05
III
7.6
2508
5.40
IV
9.2
3094
5.17
V
9.8
4082
4.15
X
10.1
2631
6.54
Genome of Drosophila melanogaster (Fruit fly)

Completed in 1999 by Celera Genomics and Berkeley




180 Mbp
Five chromosomes: 3 large autosomes, Y, and tiny fifth
13,601 genes, 1 gene/8Kbp
Has 289 homologues to human genes
 Such as cancer, cardiovascular, neurological, etc.
 There is a fly model for Parkinson and malaria
Genome of Arabidopsis thaliana


Relatively small genome, 146 Mbp, completed in 2000
Five chromosomes


25,498 predicted genes; 1 gene/4.6 kbp
Proteins


Most A. thaliana proteins have homologues in animals
 60% of genes have human homologues, e.g., BRCA2
Gene distribution
 Nucleus: genome size (125 Mbp), genes (25,500)
 Chloroplast: genome (154 Kbp), genes (79)
 Mitochondrion: genome (367 Kbp), genes (58)

20 of 54 genes in a 340-Kbp stretch of rice genome (top) are
conserved and retain the same order in five A. thalia strands
Human Genome
•
Human Genome Project
–
•
Conceived in 1984, begun in 1990, completed in 2001 ahead of 2003
schedule
What did the sequence reveal ?
3 Bbp (base pair)
– 24 chromosomes,
– 22 autosomes plus two sex chromasomes (X,Y)
– Longest 250 Mbp, shorted 55 Mbp
– Mitochondrial genome
– Circular DNA molecule of 16.569 Mbp
– ~10**(13) cells
–
–
How many is 3 Bbp ?
–
–
Typical 11-pt font can print 60 nucleotide is 3 in (~10 cm).
In this format, 3 Bbp writes out in 5,000 mi
Genome of Homo sapiens


22 chromosomes plus X (163 Mbp) and Y (51 Mbp)
Web resources






Interactive access to DNA and protein sequences
 http://www.ensembl.org
Images of chromosomes, maps, loci
 http://www.ncbi.nlm.nih.gov/projects/genome/guide/
Gene map 99
 http://www.ncbi.nlm.nih.gov/genemap99
overview of human genome structure
 http://www.ims.u-tokyo.ac.jp/imsut/en
SNP (Single nucleotide polymorphisms)
 http://snp.cshl.org
Human genetic diseases
 http://www.ncbi.nlm.nih.gov/Omim (Online Mendelian Inheritance in
Man )
 http://www.geneclinics.org/profiles/all-html
Human Genome Insights
(ENCODE)







Majority of genome is transcribed
~50% transposons
~25% protein coding genes/1.3% exons
~23,700 protein coding genes
~160,000 transcripts
Average Gene ~ 36,000 bp
7 exons @ ~ 300 bp
6 introns @ ~5,700 bp
7 alternatively spliced products (95% of genes)
RefSeq: ~34,600 “reference sequence” genes
(includes pseudogenes, known RNA genes)
Genome of Homo sapiens (cont’d)

Repeat sequences >50 % of the genome


Short interspersed nuclear elements (SINEs): 13 %, LINEs: 21 %
Simple stutters (repeats of short oligomers including mini- and microsatellites)


Triplet repeats such as CAG are implicated in numerous diseases (e.g.,
glutamine repeats in glutamine protein)
SNP (pronounced snip)





A->T mutation in beta-globin changes Glu -> Val, creating a sticky surface on
haemoglobin molecules => sicklecell anaemia
Progeria
Avg 1 SNP/Kbp (100 SNPs per 100 Kbp)
Many 100-Kbp regions tend to remain intact, with fewer than five SNPs
  discrete combinations of SNPs define individual’s haplotype (haploid
genotype)
 Individual genomes are characterized by a distribtuion of genetic makers
including SNPs
Int’l HapMap Consortium
Genome of Homo sapiens (cont’d)

SNP consortium


Collects human SNPs, nearly 5 million SNPs
Show





Most of variations appear in all populations
However, a few SNPs are unique to particular populations
Genomes of individuals from Japan and China are very similar
Chromosome X varies more than other chromosomes (X is more subject to
selective pressure)
Mitochondrial DNA




Double-stranded closed circular molecule of 16,569 bp
Inherited almost exclusively through maternal lines
Not subject to recombination, and changes only by mutation
About 1 mutation every 25,000 years
mtDNA and Y

mtDNA Inherited through maternal lines


Both sons and daughters get it from their mother
All existing sequence variants are traced back to a single woman
(Mitochondrial Eve) in Africa roughly 200,000 years ago




Supports “from Africa” hypothesis
Avg difference in mtDNA between pairs of individuals is 61.1, between
Africans is 76.7, between non-Africans is 38.5
More divergent populations in Africa for much longer than in the rest of the
world
Y chromosome


Most recent common male ancestor (Y-chromosome Adam) is around 59,000
years ago
Most divergent sequences are found from Africans
Other Species
Organism
Genome size
Epstein – Barr virus
0.17 Mbp
80
4.6 Mbp
4,406
12.5 Mbp
6,172
Nematode worm (C.elegans)
100.3 Mbp
19,099
Thale cress (A. thaliana)
115.4 Mbp
25,498
Fruit fly (D. melanogaster)
128.3 Mbp
13,601
3223.0 Mbp
20,500
390.0 Mbp
30,000
16000.0 Mbp
30,000
E.Coli
Yeast (S. cerevisiae)
Human (H. sapiens)
Fugu (Takifugu rubripes)
Wheat
# of genes