Download Genome Sizes (haploid) Base pairs Genes Notes Phi

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Phi-X 174
Human mitochondrion
Epstein-Barr virus (EBV)
Nanoarchaeum equitans
Mycoplasma genitalium
Rickettsia prowazekii
Mimivirus
Borrelia burgdorferi
Haemophilus influenzae
Neisseria meningitidis
Propionibacterium acnes
Listeria monocytogenes
E. coli
Saccharomyces cerevisiae
Arabidopsis thaliana
Drosophila melanogaster
Humans
Rice
Amphibians
Genome Sizes (haploid)
Base pairs Genes
Notes
5,386
10
virus of E. coli
16,569
37
172,282
80
causes mononucleosis
This parasitic archaean has the smallest
490,885
552
genome of a true organism yet found.
580,073
483
One of the smallest true organisms
1,111,523 834
bacterium that causes epidemic typhus
A virus (of an amoeba) with a genome
1,181,404 1,262
larger than many cellular organisms
6
1.44 x 10
1,738 bacterium that causes Lyme disease
1,830,138 1,738 bacterium that causes middle ear infections
Group B; the most frequent cause of
2,272,351 2,221
meningitis in the U.S.
2,560,265 2,333 causes acne
2,853 of these encode proteins; the rest
2,944,528 2,926
RNAs
4,290 of these genes encode proteins; the
4,639,221 4,377
rest RNAs
12,495,682 5,770 Budding yeast. A eukaryote.
115,409,949 25,498 a flowering plant
122,653,977 13,379 the "fruit fly"
20,000–
3.3 x 109
25,000
4.3 x 108
~60,000
109 - 1011
?
Shotgun sequencing
• Experimental methods for DNA sequencing work reliably for segments ~500 bases
long.
• Genome must be randomly divided into fragments of 500 bp which are then sequenced
and reassembled
• Continuous segment of overlapping segments is called a contig.
• Let a = NL/G be the coverage of genome of size G by N segments of length L.
contigs
Problems:
1)
Find the mean proportion of the genome covered by the contigs
Ans: < Gc >= 1 − e − a
2)
Determine the average number of contigs for given G and L as a function of a.
Ans: < n >= Ne − a
3)
Determine the mean contig size.
Ans: < l >=
e −a − 1
L
a
Genetic code
genome
RNAP
gene
A, C, U, G
regulatory
proteins
ribosome
mRNA
nuclease
AA
protein
protease
DNA: sequence of 4 nucleotides: A adenine, G guanine, C cytosine, T thymine
Base pairing of DNA:
A–T C–G
mRNA: sequence of 4 nucleotides:
Base pairing of DNA:
A adenine, G guanine, C cytosine, U uracil
A–UC–G G.U
protein: 20 amino acids: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y
A Alanine
Ala
I
Isoleucine
Ile
R
Arginine
Arg
C Cysteine
Cys
K
Lysine
Lys
S
Serine
Ser
D Aspartic Acid Asp
L
Leucine
Leu
T
Threonine
Thr
E Glutamic Acid Glu
M Methionine Met
V
Valine
Val
F Phenylalanine Phe
N
Asparagine Asn
W Tryptophan Trp
G Glycine
Gly
P
Proline
Y
H Histidine
His
Q
Glutamine Gln
Mapping DNA –> mRNA:
A –> A
Pro
G –> G
C –> C
Tyrosine
T –> U
Tyr
Mapping mRNA –> protein: uses a triplet code
UUU Phe UCU Ser UAU Tyr UGU Cys
UUC
UCC
UAC
UGC
UUA Leu UCA
UAA STOP UGA STOP
UUG
UCG
UAG STOP UGG Trp
CUU Leu CCU Pro CAU
CUC
CCC
CAC
CUA
CCA
CAA
CUG
CCG
CAG
AUU Ile
AUC
AUA
AUG Met
His
Gln
ACU Thr AAU
ACC
AAC
ACA
AAA
ACG
AAG
Asn
GUU Val GCU Ala GAU
GUC
GCC
GAC
GUA
GCA
GAA
GUG
GCG
GAG
Asp
Lys
Glu
CGU
CGC
CGA
CGG
Arg
AGU
AGC
AGA
AGG
Ser
GGU
GGC
GGA
GGG
Arg
Gly
• Met in the beginning of a gene signals the START of transcription
• redundant, comma-less (requires specification of the frame), redundant, almost
universal
ORF: open reading frame – region devoid of stop codons
Sequence motifs and patterns
• Translation start and stop codons
• Restriction enzyme cutting sites
• Promoters
• Regulatory or structural protein binding sites
• Exons & introns
Restriction Enzymes
• Bind to short (4 to 6 bases) sequences of DNA
• Cut the DNA at the binding site
• Function during recombination
• Help to protect against viruses
EcoRI
5’-GAATTC-3’
3’-CTTAAG-5’
Examples of Restriction Enzymes
Enzyme Organism from which derived
Target sequence
(cut at *)
5' -->3'
Bam HI Bacillus amyloliquefaciens
G* G A T C C
Bgl II
Bacillus globigii
A* G A T C T
Eco RI
Escherichia coli RY 13
G* A A T T C
Hae III Haemophilus aegyptius
GG*CC
Hha I
GCG*C
Haemophilus haemolyticus
Hind III Haemophilus inflenzae Rd
A* A G C T T
Hpa I
Haemophilus parainflenzae
GTT*AAC
Kpn I
Klebsiella pneumoniae
GGTAC*C
Mbo I
Moraxella bovis
*G A T C
Mbo I
Moraxella bovis
*G A T C
Pst I
Providencia stuartii
CTGCA*G
Sma I
Serratia marcescens
CCC*GGG
SstI
Streptomyces stanford
GAGCT*C
Sal I
Streptomyces albus G
G*TCGAC
Taq I
Thermophilus aquaticus
T*CGA
Xma I
Xanthamonas malvacearum
C*CCGGG
Analysis of a single sequence
Statistical models for a single sequence
Independent distribution: P(A) = pA , P(G) = pG , P(C) = pC , P(T) = pT
iid DNA: pA = pG = pC = pT = 0.25
Weight matrix: Probability of finding the base b at a location i:
pbi
Markov type distribution (homogeneous):
Transition matrix: Probability that the base i + 1 is b given that the base i is d: pbd
Limiting probability distribution: π b :
∑ pbd π d = π b
b = A, G , C ,T
Problems:
1)
Find the mean and variance of the number of occurrences of an M base long word in an
N base long iid DNA
M −1
⎛
⎞
<Y >
Var (Y )
−M
−2M ⎜ M
≅4
,
4 − ( 2 M + 1) + 2
ws 4 s ⎟ for large N
≅4
Ans:
⎜
⎟
N
N
s =1
⎝
⎠
∑
2)
Find the mean distance between consecutive occurrences of the word
Ans: < l >= 4 M − 1
Location of protein binding sites
• DNA binding protein can attach to variety of sites
• The binding site can be described by the consensus sequence or weight matrix
Problem:
Find most likely binding sites for a protein in a given DNA sequence.
Ans: Maximize the score S = ∑ s xi ,i
i
Ex: narL binding site in E.coli
Data:
TACCTCAATAGCGGTA
TGCTCCTTTATAGGTA
TAACTCTTTCCGGGTA
TACATCGGTAAGGGTA
TTACTCACTATGGGTA
GTATTCCCCATGGGTA
TTATCCTAAAGGGGTA
TCACTCGAAAGTAGTA
TACCCCGATCGGGGTA
TACTCCTTAATGGGTA
TAACTCTAAAGTGGTA
Consensus:
TAACTCTATAGGGGTA
Weight matrix (frequencies) pbi
1
A
G
0
3
0
4
5
0.55 0.55 0.09
0.09 0.09
C
T
2
0
0
0
A −3.9
3
0.8
T
1.3
10
11
12
13
14 15 16
0 0.18 0.46 0.36 0.82 0.09 0.09 0.09
0
0
1
0
0 0.27 0.09
0.46 0.64 0.91
1
0
0
0
0.09 0.09
0
0
0
0
0.36 0.64 0 0.45 0.27 0.55 0.18 0.36 0.18
0
0
1
0
4
0
0
0.18 0.09
⎛p
⎞
sbi = ln ⎜⎜ bi + f ⎟⎟
⎝ pb
⎠
5
6
7
8
0.8 −1.0 −3.9 −3.9 −0.3
G −1.0 −1.0 −3.9 −3.9 −3.9 −3.9
C −3.9 −1.0
9
0
Weight matrix (scores)
2
8
0
0.09 0.45 0.55 0.36 1
0.91 0.27
1
6 7
0.6
0.8
0.4
0.1 −3.9
0.4
0.9 −3.9
9
0.6
10
0.4
11
12
13
14
15
1.2 −1.0 −1.0 −1.0 −3.9 −3.9
0.1 −1.0 −3.9 −3.9
0.6
0.9
1.3
16
1.4
1.4 −3.9 −3.9
1.4 −3.9 −0.3 −1.0 −3.9 −1.0 −1.0 −3.9 −3.9 −3.9 −3.9
0.6
0.1
0.8 −0.3
0.4 −0.3 −3.9 −3.9
1.4 −3.9
Sequence logo (http://weblogo.berkeley.edu/)
⎞
⎛
Height h ia = p ia ⎜⎜ 2 − ∑ p ia log 2 p ia ⎟⎟
⎝
a
⎠
Available software on the web
Regulatory sequence analysis tool (RSA) (http://rsat.ulb.ac.be/rsat/)
• Allows search for patterns in single or multiple sequences
• Construction of consensus sequences
• Search for weight matrix motifs
Consensus server (Stormo Laboratory) (http://adric.wustl.edu/oldconsensus/)
• Construction of weight matrices by searching multiple sequences.
Related documents