Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Welcome to
Introduction to Bioinformatics
Wednesday, 1 December
• Intro to Scenario 8
Identification of genes of foreign origin
E. coli K12
E. coli O157:H7
Scenario 1
Comparison of genomes of
pathogenic and nonpathogenic E. coli
E. coli: What makes it kill?
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
Gene finder
Blast
+ parser
Gene finder
Pathogenosis
specific
Pathogen
specific
(~1000!)
?
E. coli: What makes it kill?
Pathogenecity Islands (PAI)
Virus-related
genes
DNA
Pathogen
Nonpathogen
Disease-related
genes
E. coli: What makes it kill?
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
Pathogen
specific
Gene finder
Blast
+ parser
Gene finder
(~1000!)
Pathogenosis
specific
?
foreign
genes
How to find foreign genes?
Current methods
• % GC
• Dinucleotide frequencies
• Codon bias
• Amino acid bias
How to find foreign genes?
Dinucleotide frequencies
Gene from Prochlorococcus marinus MED4
ATGGAGATTGTTTGTAATCAAAATGAATTTAATTATGCTATTCAATTA
GTTAGTAAAGCAGTTGCTTCAAGACCTACGCATCCTATCCTTGCAAAT
TTACTTCTAACAGCTGATCAAGGTACTAATAAAATTAGTTTAACTGGA
TTTGATTTGAATCTAGGAATACAAACTTCATTTGATGCAACTGTAAAC
AAAAGTGGAGCAATTACAATTCCATCTAAACTTTTATCTGAAATAGTT
AATAAACTACCAAGCGAAACTCCTGTCTCTCTTGATGTTGATGAGAGT
TCTGACAATATTTTAATTAAAAGTGATAGGGGTTCTTTTAATATTAAA
GGTATTCCATCAGACGATTACCCAAGCTTACCGTTTGTAGAAAGTGGT
ACATCTTTGAATATTGATCCAAGTTCTTTTTTAAAAGCTTTAAAATTA
ACTATATTCGCTAGTAGTAGTGATGATTCAAAGCAATTACTCACAGGA
GTAAATTTTACATTTAATTTAAAATATTTGGAGTCAGCTGCAACAGAT
GGGCATAGATTGGCTGTTGTTTTGGTTGATAACAAAGAAAATTTTGAT
GAAAAAGAAGATTTTGCTTCAAATGAAGAAAACTTATCAGTTACTATA
CCAACAAGATCTTTAAGAGAAATTGAAAAGCTTGTTAGCCTTAGAAGT
TCTGAAAATTCAATTAAACTTTTCTATGACAAAGGTCAAGTAGTATTT
ATTTCCTCTAATCAAATAATTACTACTAGAACCCTTGAAGGTTCTTAT
CCAAATTATTCTCAATTAATACCTGATAATTTTACTAAAATTTTTACA
TTTAATACAAAAAAAATAATCGAATCACTTGAAAGAATAGCAGTTTTA
GCAGACCAACAAAGTAGTGTCGTTAAGATTAAACTTAATGAAAAGGAT
TTAGCATTAGTCAGTGCTGATGCTCAAGACATAGGGAATGCCAGCGAA
TTAGTTCCTGTATCTTTTTATTTTGATCAATTTGATATAGCTTTTAAT
GTAAGGTATTTATTAGAAGGTTTAAAAGTTATATCAAGTGAAAATGTA
ATTTTTAAATGTAATCTTCCAACTACTCCAGCTGTTTTAGTTCCAGAA
A
C
G
T
A 167 47 80 116
C
61 26 10
65
G
65 30 23
59
T 117 59 64 168
ρ*XY = f*XY / f*X f*Y
How to find foreign genes?
Dinucleotide frequencies
Calculate ρ*AA
Study Question 3:
in the following 50-nt sequence:
TGATGACAGTCGATTTTTCGGTAGG
ATAACTGCCATGCCTCTCAAAGTAC
ρ*XY = f*XY / f*X f*Y
How to find foreign genes?
Dinucleotide frequencies
δ* (f, g) =
(1/16) Σ | ρ*XY(f) - ρ*XY(g) |
How to find foreign genes?
Dinucleotide frequencies
δ* (f, g) = (1/16) Σ | ρ*XY(f) - ρ*XY(g) |
Calculate δ*(human,mouse)?
How to find foreign genes?
Current methods
• % GC
• Dinucleotide frequencies
• Codon bias
• Amino acid bias
How to find foreign genes?
Cholera toxin locus
100 kbases
How to find foreign genes?
Current methods
• % GC
• Dinucleotide frequencies
• Codon bias
• Amino acid bias
How to find foreign genes?
Current methods
• % GC
0.29274613
0.24678363
0.2797619
0.32916668
0.29405162
0.36720142
0.34567901
0.28865057
0.29333332
0.35947713
0.39583334
0.3116883
0.6219512
0.26923078
0.28615862
0.28374657
0.2905983
0.30982906
0.29574862
0.25382262
...
0.27824858
0.3089947
0.3152174
0.37955555
0.26300985
0.3510848
0.34285715
0.33333334
0.3200222
0.32882884
0.29283488
0.2754821
0.37037036
0.3197176
0.30438185
0.3477868
0.3215859
0.34751773
0.30287206
0.34519956
%GC of genes of Prochlorococcus
0.32350427
0.32791328
0.46096095
0.26495728
0.3646139
0.31604227
0.38206628
0.2717087
0.3412162
0.34351662
0.47674417
0.30488145
0.29738563
0.27430555
0.272578
0.31501833
0.3195021
0.35598704
0.3068182
0.3283208
0.321013
0.2984127
0.31343284
0.28431374
0.29989868
0.2984234
0.33838382
0.33004925
0.32882375
0.26504064
0.296
0.36578172
0.42553192
0.32748538
0.32188296
0.31860465
0.29113925
0.30678466
0.3878205
0.26345214
0.28214577
0.32879817
0.34343433
0.32193732
0.33516484
0.23024055
0.32380953
0.31546623
0.29943502
0.32506204
0.30749354
0.28992248
0.36120042
0.32178217
0.35273737
0.2952816
0.3432836
0.32900432
0.3469388
0.2958245
How to find foreign genes?
Current methods
• % GC
0.29274613
0.24678363
0.2797619
0.32916668
0.29405162
0.36720142
0.34567901
0.28865057
0.29333332
0.35947713
0.39583334
0.3116883
0.6219512
0.26923078
0.28615862
0.28374657
0.2905983
0.30982906
0.29574862
0.25382262
...
0.27824858
0.3089947
0.3152174
0.37955555
0.26300985
0.3510848
0.34285715
0.33333334
0.3200222
0.32882884
0.29283488
0.2754821
0.37037036
0.3197176
0.30438185
0.3477868
0.3215859
0.34751773
0.30287206
0.34519956
%GC of genes of Prochlorococcus
0.32350427
0.32791328
0.46096095
0.26495728
0.3646139
0.31604227
0.38206628
0.2717087
0.3412162
0.34351662
0.47674417
0.30488145
0.29738563
0.27430555
0.272578
0.31501833
0.3195021
0.35598704
0.3068182
0.3283208
0.321013
0.2984127
0.31343284
0.28431374
0.29989868
0.2984234
0.33838382
0.33004925
0.32882375
0.26504064
0.296
0.36578172
0.42553192
0.32748538
0.32188296
0.31860465
0.29113925
0.30678466
0.3878205
0.26345214
0.28214577
0.32879817
0.34343433
0.32193732
0.33516484
0.23024055
0.32380953
0.31546623
0.29943502
0.32506204
0.30749354
0.28992248
0.36120042
0.32178217
0.35273737
0.2952816
0.3432836
0.32900432
0.3469388
0.2958245
How to find foreign genes?
Current methods
• % GC
0.29274613
0.24678363
0.2797619
0.32916668
0.29405162
0.36720142
0.34567901
0.28865057
0.29333332
0.35947713
0.39583334
0.3116883
0.6219512
0.26923078
0.28615862
0.28374657
0.2905983
0.30982906
0.29574862
0.25382262
...
0.27824858
0.3089947
0.3152174
0.37955555
0.26300985
0.3510848
0.34285715
0.33333334
0.3200222
0.32882884
0.29283488
0.2754821
0.37037036
0.3197176
0.30438185
0.3477868
0.3215859
0.34751773
0.30287206
0.34519956
%GC of genes of Prochlorococcus
0.32350427
0.32791328
0.46096095
0.26495728
0.3646139
0.31604227
0.38206628
0.2717087
0.3412162
0.34351662
0.47674417
0.30488145
0.29738563
0.27430555
0.272578
0.31501833
0.3195021
0.35598704
0.3068182
0.3283208
0.321013
0.2984127
0.31343284
0.28431374
0.29989868
0.2984234
0.33838382
0.33004925
0.32882375
0.26504064
0.296
0.36578172
0.42553192
0.32748538
0.32188296
0.31860465
0.29113925
0.30678466
0.3878205
0.26345214
0.28214577
0.32879817
0.34343433
0.32193732
0.33516484
0.23024055
0.32380953
0.31546623
0.29943502
0.32506204
0.30749354
0.28992248
0.36120042
0.32178217
0.35273737
0.2952816
0.3432836
0.32900432
0.3469388
0.2958245
How to find foreign genes?
Current methods
• % GC
• Dinucleotide frequencies
• Codon bias
• Amino acid bias
Good for large blocks of nucleotides
Not as good for individual genes
Need an indicator more information-dense
How to find foreign genes?
Markov Models
Building the model
AAAA: 10%
Training
Set
AAA
AAC
AAG
AAT
ACA
...
TTG
TTT
AAAC: 15%
AAAG: 40%
AAAT: 35%
How to find foreign genes?
Markov Models
Building the model
Training
Set
AAA
AAC
AAG
AAT
ACA
...
TTG
TTT
AACA: 25%
AACC: 45%
AACG: 25%
AACT: 5%
How to find foreign genes?
Markov Models
Using the model
3rd order Markov model
Candidate
gene
AAAACAA…
A
0.10
0.25
0.25
0.25
0.15
C
0.15
0.45
0.20
0.20
0.20
G
0.40
0.25
0.30
0.30
0.25
T
0.35
0.05
0.25
0.25
0.40
AAA
AAC
AAG
AAT
ACA
...
TTG 0.20 0.50 0.05 0.25
TTT 0.10 0.55 0.25 0.10
0.10
How to find foreign genes?
Markov Models
Analyze sequence
Compare test
sequence to model
model
Produce new
sequence per model
How to find foreign genes?
Markov Models
Analyze sequence
Take a test run through
Hamlet.pl
model
Produce new
sequence per model
Scenario 7
Gene Identification
How do you tell if an orf is real?
Genetic Code
UUU
UUC
UUA
UUG
CUU
CUC
CUA
CUG
AUU
AUC
AUA
AUG
GUU
GUC
GUA
GUG
Phe
Phe
Leu
Leu
Leu
Leu
Leu
Leu
Ile
Ile
Ile
Met
Val
Val
Val
Val
UCU
UCC
UCA
UCG
CCU
CCC
CCA
CCG
ACU
ACC
ACA
ACG
GCU
GCC
GCA
GCG
Ser
Ser
Ser
Ser
Pro
Pro
Pro
Pro
Thr
Thr
Thr
Thr
Ala
Ala
Ala
Ala
UAU
UAC
UAA
UAG
CAU
CAC
CAA
CAG
AAU
AAC
AAA
AAG
GAU
GAC
GAA
GAG
Tyr
Tyr
ochre
amber
His
His
Gln
Gln
Asn
Asn
Lys
Lys
Asp
Asp
Glu
Glu
UGU
UGC
UGA
UGG
CGU
CGC
CGA
CGG
AGU
AGC
AGA
AGG
GGU
GGC
GGA
GGG
The code is degenerate
Cys
Cys
opal
Trp
Arg
Arg
Arg
Arg
Ser
Ser
Arg
Arg
Gly
Gly
Gly
Gly
Are codons
equally used?
Scenario 7
Gene Identification
How do you tell if an orf is real?
Most frequently used codons
Genetic Code (human)
UUU
UUC
UUA
UUG
CUU
CUC
CUA
CUG
AUU
AUC
AUA
AUG
GUU
GUC
GUA
GUG
Phe
Phe
Leu
Leu
Leu
Leu
Leu
Leu
Ile
Ile
Ile
Met
Val
Val
Val
Val
UCU
UCC
UCA
UCG
CCU
CCC
CCA
CCG
ACU
ACC
ACA
ACG
GCU
GCC
GCA
GCG
Ser
Ser
Ser
Ser
Pro
Pro
Pro
Pro
Thr
Thr
Thr
Thr
Ala
Ala
Ala
Ala
UAU
UAC
UAA
UAG
CAU
CAC
CAA
CAG
AAU
AAC
AAA
AAG
GAU
GAC
GAA
GAG
Tyr
Tyr
ochre
amber
His
His
Gln
Gln
Asn
Asn
Lys
Lys
Asp
Asp
Glu
Glu
UGU
UGC
UGA
UGG
CGU
CGC
CGA
CGG
AGU
AGC
AGA
AGG
GGU
GGC
GGA
GGG
Cys
Cys
opal
Trp
Arg
Arg
Arg
Arg
Ser
Ser
Arg
Arg
Gly
Gly
Gly
Gly
The third position is biased
Scenario 7
Gene Identification
How do you tell if an orf is real?
If AT
If CG
G
G
C
A
T
ATG
CGG
TGG
GCC
CAA
CCA
CAT
CGT
GGG
CAG
TCC
CTT
Most frequently used codons
Genetic Code (human)
UUU
UUC
UUA
UUG
CUU
CUC
CUA
CUG
AUU
AUC
AUA
AUG
GUU
GUC
GUA
GUG
Phe
Phe
Leu
Leu
Leu
Leu
Leu
Leu
Ile
Ile
Ile
Met
Val
Val
Val
Val
UCU
UCC
UCA
UCG
CCU
CCC
CCA
CCG
ACU
ACC
ACA
ACG
GCU
GCC
GCA
GCG
Ser
Ser
Ser
Ser
Pro
Pro
Pro
Pro
Thr
Thr
Thr
Thr
Ala
Ala
Ala
Ala
UAU
UAC
UAA
UAG
CAU
CAC
CAA
CAG
AAU
AAC
AAA
AAG
GAU
GAC
GAA
GAG
PSSM
for Markov
third position?
Third
order
Chain
Tyr
Tyr
ochre
amber
His
His
Gln
Gln
Asn
Asn
Lys
Lys
Asp
Asp
Glu
Glu
UGU
UGC
UGA
UGG
CGU
CGC
CGA
CGG
AGU
AGC
AGA
AGG
GGU
GGC
GGA
GGG
Cys
Cys
opal
Trp
Arg
Arg
Arg
Arg
Ser
Ser
Arg
Arg
Gly
Gly
Gly
Gly
Related documents