Download Surrogate Filters

Document related concepts
no text concepts found
Transcript
Frog’s eye view of the jungle
(time frozen)
Push to
restart time
Frog’s eye view of the jungle
(timemoving)
frozen)
(time
Frog’s eye view of the jungle
(through movement filter)
Push to
restart time
Frog’s eye view of the jungle
(through movement filter)
Filters: Information reducers
Movement filter
Filters: Information reducers
Sequence filter
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TATGAGGCAA
CTCGGGAGCG
CCTTTAGATG
AGGCCGGAGG
CCCCGGCCTA
TTCCCTGGGC
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
TCACAGCATC
CACGGCTCTA
CAAGAAGGAG
GTCAAGAACT
AGGCTGCCTG
TCGGCGGGAC
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
AGGTGACCTT
AAGAGGCCCA
GAAACAGCTC
CTCCACCGGC
TGCTATAAAT
AGATAACATG
CTAGTTCTTG
TTATCTGTTT
CACTAGTTTC
TTAGATAAAC
CTCCACGCCC
ATATTAAAAA
AATTAGCAAA
CATTCTAGGG
AAACAAGCTA
ATTTCCTGGG
AGCCAAGGAC
TGACAGACAG
ATTGAACCCT
AGTGCAGACA
AGAAATGAGA
AGTATCTATT
TATCCAGGCA
GAAATCCCTG
GGCAGCGGCC
ACGCGGCCCA
AATGTGCCCT
How organism is made
CTCCGTAAAC CTCTAAC...
How organism works
From Sequence to Organism
How does Nature do it?
Active site
ATGACTTATGATCAACGCACAGGGCTA
Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
Rules of folding
From Sequence to Organism
How does Nature do it?
Active site
ATGACTTATGATCAACGCACAGGGCTA
Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
Rules of folding
Metabolism,
Architecture
Cell interaction
From Sequence to Organism
How does Nature do it?
Active site
ATGACTTATGATCAACGCACAGGGCTA
Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
Rules of folding
• Custom antibiotics
Gives us:
From Sequence to Organism
How does Nature do it?
Active site
ATGACTTATGATCAACGCACAGGGCTA
Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
Gives us:
Rules of folding
• Custom antibiotics
• Custom antibodies
• Custom enzymes
• New materials
From Sequence to Organism
How does Nature do it?
?
ATGACTTATGATCAACGCACAGGGCTA
Met-Thr-Tyr-Asp-Gln-Arg-Thr-Gly-Leu...
Genetic code
TCTACTTATA
AAGAGTCTGT
TTCTGTCTGC
TGGATTTCGG
GAACCTTAGT
CTCCGTAAAC
TGAATAAACT
AAGAGTTTAA
AAACCTGTAT
TTATATATTT
CCCCAGCTGT
GACAGCACTG
GCTGAAATTC
CCCTGCACCA
ATGAATGACT
TTCAATCCAC
TGAATGAACA
TCTGACCTCT
AACTCTAGCC
GACTTCTGCT
CTCTAACATG
TTGTTAAAGG
AGTTAAAAAC
GGTTACATGA
TAAGAAATTA
CATTAAAAAG
ACCCTCAAGA
CGCTGAGAGC
GGTCTTTCCT
GAACGAACGA
AGGGCTACAC
CATACATGGT
GGCAGCTTTC
TGCCCCACTC
ATACCAAAGT
ATGTCAGCAA
TACAAATGAA
GAATTGCAGT
ACTGCCTAAA
ATTGCAATTA
AGGCAAATAC
AGGCACCGGC
AGAGTGGTAC
GTGGGCACTG
TTGAATGAAA
CTAGTTCTTG
TTATCTGTTT
CACTAGTTTC
TTAGATAAAC
CTCCACGCCC
ATATTAAAAA
AATTAGCAAA
CATTCTAGGG
AAACAAGCTA
ATTTCCTGGG
AGCCAAGGAC
TGACAGACAG
ATTGAACCCT
AGTGCAGACA
AGAAATGAGA
3%
97%
TCTACTTATATTCAATCCACAGGGCTA
CACCTAGTTCTTGAAGAGTCTGTTGAA
TGAACACATACATGGTTTATCTGTTTT
TCTGTCTGCTCTGACCTCTGGCAGCTT
ATGACTTATGATCAACGCACAGGGCTA
TAGCCTGCCCCACTCTTAGATAAACGA
ACCTTAGTGACTTCTGCTATACCAAAG
TCTCCACGCCCCTCCGTAAACCTCTAA
CATGATGTCAGCAAATATTAAAAATGA
Rules of transcriptional and
post-transcriptional control
• Transcr’l initiation
• Transcr’l termination/
polyA tailing
• Splicing
• Transl’l initiation
From Sequence to Organism
How does Nature do it?
Natural filters/transformations
• Selective transcription
DNA
• Selective processing
• Translation
• Folding
Functional
protein
From Sequence to Organism
How does Nature do it?
Natural filters/transformations
DNA
Functional
protein
From Sequence to Organism
How can WE do it?
Simulation of Nature
“Whether ‘tis nobler in the mind
to suffer the slings and arrows
of outrageous fortune...”
Utterence of
Wm Shakespeare
Utterence of
George W Bush
“We must give our military
every tool and weapon
it needs to prevail...”
???
From Sequence to Organism
How can WE do it?
Surrogate Processes
“Whether ‘tis nobler in the mind
to suffer the slings and arrows
of outrageous fortune...”
“We must give our military
every tool and weapon
it needs to prevail...”
Utterence of
Wm Shakespeare
Utterence of
George W Bush
Words/sentence; Choice of words; Sentence structure; …
From Sequence to Organism
How can WE do it?
Natural filters/transformations
Surrogate filters
• Selective transcription
• Gene finders
• Selective processing
• Translation
• Folding
Predicted coding regions
My sequence
Characteristics of
coding sequences/introns
From Sequence to Organism
How can WE do it?
Natural filters/transformations
Surrogate filters
• Selective transcription
• Gene finders
• Selective processing
• Similarity finders
• Translation
• Folding
Sequence/motif
Databases
My sequence
From Sequence to Organism
How can WE do it?
Natural filters/transformations
Surrogate filters
• Selective transcription
• Gene finders
• Selective processing
• Similarity finders
• Translation
• Feature finders
• Folding
Predicted features
My sequence
Characteristics
of features
From Sequence to Organism
How can WE do it?
Natural filters/transformations
Surrogate filters
• Selective transcription
• Gene finders
• Selective processing
• Similarity finders
• Translation
• Feature finders
• Folding
• Pattern finders
My sequences
Statistical engine
Surrogate Filters
How do they work?
Case studies
• Gene finders
• Real problems
• Similarity finders
• Mixed strategies
• Feature finders
• Pattern finders
You do it
Surrogate Filters
Gene finders
Class 1: Start/Stop codon search (Map, Frames, OrfFinder)
Look for start codons (ATG) (GTG,TTG)
Look for stop codons (TAA,TAG,TGA)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
CTC CAC GCC CCT CCG TAC ACC TCT AAC ATG ATC TCA GCA AAT ATT AAA AAT GAA TAA ACT TTG TGA CAT GTA CAA ATG GAA ATA TGC AA
C TCC ACG CCC CTC CGT ACA CCT CTA ACA TGA TCT CAG CAA ATA TTA AAA ATG AAT AAA CTT TGT GAC ATG TAC AAA TGG AAA TAT GCA A
CT CCA CGC CCC TCC GTA CAC CTC TAA CAT GAT CTC AGC AAA TAT TAA AAA TGA ATA AAC TTT GTG ACA TGT ACA AAT GGA AAT ATG CAA
Surrogate Filters
Gene finders
Class 1: Start/Stop codon search (Map, Frames, OrfFinder)
Look for start codons (ATG) (GTG,TTG)
Look for stop codons (TAA,TAG,TGA)
CTCCACGCCCCTCCGTACACCTCTAACATGATGTCAGCAAATATTAAAAATGAATAAACTTTGTGACATGTACAAATGGAAATATGCAA
TTGCATATTTCCATTTGTACATGTCACAAAGTTTATTCATTTTTAATATTTGCTGAGATCATGTTAGAGGTGTACGGAGGGGCGTGGAG
Surrogate Filters
Gene finders
Class 1: Start/Stop codon search (Map, Frames, OrfFinder)
Pro: Quick, simple
Con: Useless for eukaryotic genomic sequences (introns)
Inaccurate (start codon problem)
Inaccurate (doubtful short open reading frames)
Surrogate Filters
Gene finders
Class 2: Codon bias recognition (TestCode)
Genetic Code
UUU
UUC
UUA
UUG
CUU
CUC
CUA
CUG
AUU
AUC
AUA
AUG
GUU
GUC
GUA
GUG
Phe
Phe
Leu
Leu
Leu
Leu
Leu
Leu
Ile
Ile
Ile
Met
Val
Val
Val
Val
UCU
UCC
UCA
UCG
CCU
CCC
CCA
CCG
ACU
ACC
ACA
ACG
GCU
GCC
GCA
GCG
Ser
Ser
Ser
Ser
Pro
Pro
Pro
Pro
Thr
Thr
Thr
Thr
Ala
Ala
Ala
Ala
UAU
UAC
UAA
UAG
CAU
CAC
CAA
CAG
AAU
AAC
AAA
AAG
GAU
GAC
GAA
GAG
Tyr
Tyr
ochre
amber
His
His
Gln
Gln
Asn
Asn
Lys
Lys
Asp
Asp
Glu
Glu
UGU
UGC
UGA
UGG
CGU
CGC
CGA
CGG
AGU
AGC
AGA
AGG
GGU
GGC
GGA
GGG
The code is degenerate
Cys
Cys
opal
Trp
Arg
Arg
Arg
Arg
Ser
Ser
Arg
Arg
Gly
Gly
Gly
Gly
Are codons
equally used?
Surrogate Filters
Gene finders
Class 2: Codon bias recognition (TestCode)
Most frequently used codons
Genetic Code (human)
UUU
UUC
UUA
UUG
CUU
CUC
CUA
CUG
AUU
AUC
AUA
AUG
GUU
GUC
GUA
GUG
Phe
Phe
Leu
Leu
Leu
Leu
Leu
Leu
Ile
Ile
Ile
Met
Val
Val
Val
Val
UCU
UCC
UCA
UCG
CCU
CCC
CCA
CCG
ACU
ACC
ACA
ACG
GCU
GCC
GCA
GCG
Ser
Ser
Ser
Ser
Pro
Pro
Pro
Pro
Thr
Thr
Thr
Thr
Ala
Ala
Ala
Ala
UAU
UAC
UAA
UAG
CAU
CAC
CAA
CAG
AAU
AAC
AAA
AAG
GAU
GAC
GAA
GAG
Tyr
Tyr
ochre
amber
His
His
Gln
Gln
Asn
Asn
Lys
Lys
Asp
Asp
Glu
Glu
UGU
UGC
UGA
UGG
CGU
CGC
CGA
CGG
AGU
AGC
AGA
AGG
GGU
GGC
GGA
GGG
Codon usage is biased
Cys
Cys
opal
Trp
Arg
Arg
Arg
Arg
Ser
Ser
Arg
Arg
Gly
Gly
Gly
Gly
Codon bias
universal?
Surrogate Filters
Gene finders
Class 2: Codon bias recognition (TestCode)
Pro: Quick, simple, available through GCG
Better than Class 1 in excluding false open reading frames
Con: Useless for eukaryotic genomic sequences (introns)
Gives only general areas of open reading frames
Surrogate Filters
Gene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Principle
Step 1: Create model through extensive training set
* Training set = proven or suspected genes
* Organism-specific
Step 2: Assess candidate genes through filter of model
Surrogate Filters
Gene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Step 1: Create model through extensive training set
Training
Set
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC
AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA
CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT
GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC
TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT
ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG
TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA
TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA
TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA
TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC
AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT
GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA
CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT
GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG
TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT
AAA
AAC
AAG
AAT
ACA
...
TTG
TTT
Surrogate Filters
Gene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Step 1: Create model through extensive training set
AAAA: 33%
Training
Set
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC
AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA
CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT
GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC
TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT
ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG
TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA
TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA
TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA
TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC
AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT
GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA
CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT
GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG
TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT
AAA
AAC
AAG
AAT
ACA
...
TTG
TTT
AAAC: 25%
AAAG: 12%
AAAT: 30%
Surrogate Filters
Gene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Step 1: Create model through extensive training set
Training
Set
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATAATC
AATGACTATCAGACAGAGAATCATCGTGCTGTCAGTAAAA
CCTCTGATTTCGATCTTTACCATAATTGTTATGTTGTAAT
GACTAACCAGACTATCTTTTACAGAGCTTCTGGTTAACAC
TTGTCTAATTAGACATTGATAATGTTTGTGGGGGTTGGTC
ATCAGGAATGGTAAATAGCAATTACCCTTCAGACTTTCCT
ATGAGACGCTCCGCCAACGAGCAGTGTCTCTTAAAGAACG
TTATGAGCGCTCAGTTAACTTCAGAAATTCACGGCGGAAA
TCCATAGTTATTATTACTTATGACTAAAACAAAATTACTA
TGGCGGCTTGTTTAATATAGATTCTGTGTTCTGAGAAATG
ACTTTTAAAGTCCCACTAACTTTTTTCTCATCTATTGCTA
TATTTCGACTTTAAAACTTATAGTAGATGGCTTAATTCTC
AAATAACAAACTCATTTTTAGTAGATATTTCATGCAAACT
GAGGTTTTTAGTGATATTTTCCCCTTATTGAGTACAGCCA
CTCCACAAACCTTAGAATGGCTACTCAATATTGCAATTGA
TCATGAATATCCCACTGGTAGAGCAGTTTTAATGGAAGAT
GCCTGGGGTAATGCAGTTTATTTCGTTGTATCTGGATGGG
TAAAAGTTCGGCGCACCTGTGGAGATGATTCGGTAGCTTT
AAA
AAC
AAG
AAT
ACA
...
TTG
TTT
AACA: 30%
AACC: 20%
AACG: 15%
AACT: 35%
Surrogate Filters
Gene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Step 2: Assess candidate genes
3rd order Markov model
Candidate
gene
AAAGCAA…
A
0.33
0.30
0.35
0.30
0.25
C
0.25
0.20
0.15
0.15
0.20
G
0.12
0.15
0.20
0.20
0.15
T
0.30
0.35
0.30
0.25
0.35
AAA
AAC
AAG
AAT
ACA
...
TTG 0.25 0.30 0.15 0.30
TTT 0.30 0.25 0.10 0.35
0.12
Surrogate Filters
Gene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Step 2: Assess candidate genes
3rd order Markov model
Candidate
gene
AAAGCAA…
A
0.33
0.30
0.35
0.30
0.25
C
0.25
0.20
0.15
0.15
0.20
G
0.12
0.15
0.20
0.20
0.15
T
0.30
0.35
0.30
0.25
0.35
AAA
AAC
AAG
AAT
ACA
...
TTG 0.25 0.30 0.15 0.30
TTT 0.30 0.25 0.10 0.35
0.12 x 0.15
Surrogate Filters
Gene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Step 2: Assess candidate genes
3rd order Markov model
Candidate
gene
AAAGCTA…
A
0.33
0.30
0.35
0.30
0.25
C
0.25
0.20
0.15
0.15
0.20
G
0.12
0.15
0.20
0.20
0.15
T
0.30
0.35
0.30
0.25
0.35
AAA
AAC
AAG
AAT
ACA
...
TTG 0.25 0.30 0.15 0.30
TTT 0.30 0.25 0.10 0.35
So far, not a good candidate!
0.12 x 0.15 . . .
Surrogate Filters
Gene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Pro: Almost most accurate method known
Con: Needs big training set
May miss genes of foreign origin
Will miss very small genes
Surrogate Filters
Gene finders
Class 3: Hidden Markov Model (HMM)-based recognition
Pro: Almost most accurate method known
Con: Needs big training set
May miss genes of foreign origin
Will miss very small genes
Surrogate Filters
Scenario I – Case of the Hidden Heterocyst
Case of the Hidden Heterocyst
NH3
heterocysts
N2
O2
NH3
Matveyev and Elhai (unpublished)
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Nostoc genome
Transposon
1. Use transposon mutagenesis
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Nostoc genome
Transposon
1. Use transposon mutagenesis
to find a mutant defective in
heterocyst differentiation
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Nostoc genome
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATA
ATCAATGACTATCAGACAGAGAATCATCGTGCTGTCA
GTAAAACCTCTGATTTCGATCTTTACCATAATTGTTA
TGTTGTAATGACTAACCAGACTATCTTTTACAGAGCT
TCTGGTTAACACTTGTCTAATTAGACATTGATAATGT
TTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAAT
TACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACG
AGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTT
AACTTCAGAAATTCACGGCGGAAATCCATAGTTATTA
TTACTTATGACTAAAACAAAATTACTATGGCGGCTTG
TTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTA
AAGTCCCACTAACTTTTTTCTCATCTATTGCTATATT
TCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCA
AATAACAAACTCATTTTTAGTAGATATTTCATGCAAA
CTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTAC
AGCCACTCCACAAACCTTAGAATGGCTACTCAATATT
GCAATTGATCATGAATATCCCACTGGTAGAGCAGTTT
TAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGT
TGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGA
1. Use transposon mutagenesis
to find a mutant defective in
heterocyst differentiation
2. Sequence out from transposon
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Nostoc genome
AAGCTTGACCAAAAAGTTAAAACACTGACGGCAAATA
ATCAATGACTATCAGACAGAGAATCATCGTGCTGTCA
GTAAAACCTCTGATTTCGATCTTTACCATAATTGTTA
TGTTGTAATGACTAACCAGACTATCTTTTACAGAGCT
TCTGGTTAACACTTGTCTAATTAGACATTGATAATGT
TTGTGGGGGTTGGTCATCAGGAATGGTAAATAGCAAT
TACCCTTCAGACTTTCCTATGAGACGCTCCGCCAACG
AGCAGTGTCTCTTAAAGAACGTTATGAGCGCTCAGTT
AACTTCAGAAATTCACGGCGGAAATCCATAGTTATTA
TTACTTATGACTAAAACAAAATTACTATGGCGGCTTG
TTTAATATAGATTCTGTGTTCTGAGAAATGACTTTTA
AAGTCCCACTAACTTTTTTCTCATCTATTGCTATATT
TCGACTTTAAAACTTATAGTAGATGGCTTAATTCTCA
AATAACAAACTCATTTTTAGTAGATATTTCATGCAAA
CTGAGGTTTTTAGTGATATTTTCCCCTTATTGAGTAC
AGCCACTCCACAAACCTTAGAATGGCTACTCAATATT
GCAATTGATCATGAATATCCCACTGGTAGAGCAGTTT
TAATGGAAGATGCCTGGGGTAATGCAGTTTATTTCGT
TGTATCTGGATGGGTAAAAGTTCGGCGCACCTGTGGA
Do it
1. Use transposon mutagenesis
to find a mutant defective in
heterocyst differentiation
2. Sequence out from transposon
3. Find gene boundaries
4. Identify gene
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
1. Go to http://www.vcu.edu/~elhaij/BioInf
2. Open second browser (Ctrl-N in Netscape)
Go to same site (copy and paste URL)
3. In 1st browser, go to Program List
Click on Gene Finders
Open GeneMark
4. In 2nd browser, open Nostoc sequence
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Mission successful:
>Translation: 397..639 (direct), 81 amino acids
VLGSKIEEGPKHIILDLSQIDFIDSSGLGALVQLAKQAQTAEGTLQIVTNAR
VTQTVKLVRLEKFLSLQKSVEEALENVK*
… or was it?
Check predicted protein against databases
Surrogate Filters
Similarity finders
Blast
• BlastP: Protein sequence to search protein database
• BlastN: Nucleotide sequence to search nucleotide database
• BlastX: Nucleotide sequence (translated) to search protein database
• TBlastN: Protein sequence to search (translated) nucleotide database
• Blast2Seq: Compare two sequences you specify
FastA
• (Various flavors)
Do it
Pfam (Protein motif families)
Finds conserved motifs similar to protein sequence
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Mission successful:
>Translation: 397..639 (direct), 81 amino acids
VLGSKIEEGPKHIILDLSQIDFIDSSGLGALVQLAKQAQTAEGTLQIVTNAR
VTQTVKLVRLEKFLSLQKSVEEALENVK*
VLGSK
Why?
• GeneMark correct: Conservation of noncoding regions
• GeneMark wrong: Fooled by weird aa sequence or start codon
Case of the Hidden Heterocyst
Strategy to find heterocyst differentiation genes
Moral
Automated gene finders are wonderful,
but common sense is better
Don’t trust automated annotation
Surrogate Filters
Feature finders
Hidden Markov model-based methods
• Good for contiguous features (e.g. signal sequences)
• Not good with features with gaps (e.g. promoters)
Ad hoc methods
• Feature-specific rules (e.g. tandem repeats, terminators)
Position-dependent frequency tables
= Position-specific scoring matrix (PSSM)
= Weight table
Surrogate Filters
Feature finders
Position-dependent frequency tables
Some of 106 aligned
human promoter
sequences (near -26)
Consensus
CCCTATATAAGGC...
CGCTATAAAAACT...
GGGTATATAAGCG...
GGCTATATAAAAC...
TTCTATAAAGCGG...
CCCTATAAAACCC...
GAGTATAAAGCAC...
GGTTATAAAAACA...
CAGTATAAAAGGG...
CCGTATAAATAGG...
TCCCATATAAGCC...
TATAAA
histone H1t
HMG-17
b'-tubulin b'2
a'-actin skel-m.
a'-cardiac actin
b'-actin
keratin I 50K
vimentin
a'1(I) collagen
a'2(I) collagen
fibronectin
Surrogate Filters
Feature finders
Position-dependent frequency tables
CCCTATATAAGGC...
CGCTATAAAAACT...
GGGTATATAAGCG...
GGCTATATAAAAC...
TTCTATAAAGCGG...
CCCTATAAAACCC...
GAGTATAAAGCAC...
GGTTATAAAAACA...
CAGTATAAAAGGG...
CCGTATAAATAGG...
TCCCATATAAGCC...
Some of 106 aligned
human promoter
sequences (near -26)
A
T
C
G
21
16
28
35
29
22
24
25
-------------
0
87
13
0
100
0
0
0
0
100
0
0
100
0
0
0
81
19
0
0
91
0
0
9
histone H1t
HMG-17
b'-tubulin b'2
a'-actin skel-m.
a'-cardiac actin
b'-actin
keratin I 50K
vimentin
a'1(I) collagen
a'2(I) collagen
fibronectin
57
21
0
22
32
6
15
47
15
10
33
42
26
11
28
34
Surrogate Filters
Feature finders
Position-Specific Scoring Matrix in action
aceB
atpI
bioB
glnA
glnH
lacZ
rpsJ
serC
sucA
trpE
ACTATGGAGCATCTGCACATGAAAACC
ACCTCGAAGGGAGCAGGAGTGAAAAAC
ACGTTTTGGAGAAGCCCCATGGCTCAC
ATCCAGGAGAGTTAAAGTATGTCCGCT
TAGAAAAAAGGAAATGCTATGAAGTCT
TTCACACAGGAAACAGCTATGACCATG
AATTGGAGCTCTGGTCTCATGCAGAAC
GCAACGTGGTGAGGGGAAATGGCTCAA
GATGCTTAAGGGATCACGATGCAGAAC
CAAAATTAGAGAATAACAATGCAAACA
unknown
Experimentally
proven
start sites
Surrogate Filters
Feature finders
Position-Specific Scoring Matrix in action
aceB
atpI
bioB
glnA
glnH
lacZ
rpsJ
serC
sucA
trpE
ACTATGGAGCATCTGCACATGAAAACC
ACCTCGAAGGGAGCAGGAGTGAAAAAC
ACGTTTTGGAGAAGCCCCATGGCTCAC
ATCCAGGAGAGTTAAAGTATGTCCGCT
TAGAAAAAAGGAAATGCTATGAAGTCT
TTCACACAGGAAACAGCTATGACCATG
AATTGGAGCTCTGGTCTCATGCAGAAC
GCAACGTGGTGAGGGGAAATGGCTCAA
GATGCTTAAGGGATCACGATGCAGAAC
CAAAATTAGAGAATAACAATGCAAACA
unknown
Experimentally
proven
start sites
Surrogate Filters
Feature finders
Position-Specific Scoring Matrix in action
aceB
atpI
bioB
glnA
glnH
lacZ
rpsJ
serC
sucA
trpE
ACCACATAACTATGGAGCATCTGCACATGAAAACC
ACCTCGAAGGGAGCAG.....GAGTGAAAAAC
ACGTTTTGGAGAAGC...CCCATGGCTCAC
ATCCAGGAGAGTTA.AAGTATGTCCGCT
TAGAAAAAAGGAAATG.....CTATGAAGTCT
TTCACACAGGAAACAG....CTATGACCATG
AATTGGAGCTCTGGTCTCATGCAGAAC
GCAACGTGGTGAGGG...GAAATGGCTCAA
GATGCTTAAGGGATCA....CGATGCAGAAC
CAAAATTAGAGAATA...ACAATGCAAACA
A
C
G
T
Surrogate Filters
Feature finders
Position-Specific Scoring Matrix in action
aceB ACCACATAACTATGGAGCATCT.GCACATGAAAACC
atpI
ACCTCGAAGGGAGCAG.....GAGTGAAAAAC
bioB
ACGTTTTGGAGAAGC...CCCATGGCTCAC
glnA
ATCCAGGAGAGTTA.AAGTATGTCCGCT
glnH
TAGAAAAAAGGAAATG.....CTATGAAGTCT
lacZ
TTCACACAGGAAACAG....CTATGACCATG
rpsJ
AATTGGAGCTCTGGTCTCATGCAGAAC
serC
GCAACGTGGTGAGGG...GAAATGGCTCAA
sucA
GATGCTTAAGGGATCA....CGATGCAGAAC
trpE
CAAAATTAGAGAATA...ACAATGCAAACA
A
C
G
T
Surrogate Filters
Pattern finders
Specified patterns (FindPatterns, PatScan)
e.g. Find instances of restriction sites
New pattern discovery (Meme, Gibbs sampler)
Human sequences 5’ to transcriptional start
snRNA U1 (pU1-6)
histone H1t
HMG-14
TP1
protamine P1
nucleolin
snRNP E
rp S14
rp S17
ribosomal p. S19
a'-tubulin ba'1
b'-tubulin b'2
a'-actin skel-m.
a'-cardiac actin
b'-actin
AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC
GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT
CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG
GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT
CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT
GCAGGCTCAGTCTTTCGCCTCAGTCTCGAGCTCTCGCTGG
TGCCGCCGCGTGACCTTCACACTTCCGCTTCCGGTTCTTT
GACACGGAAGTGACCCCCGTCGCTCCGCCCTCTCCCACTC
TGGCCTAAGCTTTAACAGGCTTCGCCTGTGCTTCCTGTTT
ACCCTACGCCCGACTTGTGCGCCCGGGAAACCCCGTCGTT
GGTCTGGGCGTCCCGGCTGGGCCCCGTGTCTGTGCGCACG
GGGAGGGTATATAAGCGTTGGCGGACGGTCGGTTGTAGCA
CCGCGGGCTATATAAAACCTGAGCAGAGGGACAAGCGGCC
TCAGCGTTCTATAAAGCGGCCCTCCTGGAGCCAGCCACCC
CGCGGCGGCGCCCTATAAAACCCAGCGGCGCGACGCGCCA
Surrogate Filters
Pattern finders
How do pattern finders work?
snRNA U1 (pU1-6)
histone H1t
HMG-14
TP1
protamine P1
AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC
GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT
CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG
GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT
CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT
Step 1. Arbitrarily choose candidate pattern from a sequence
Step 2. Find best matches to pattern in all sequences
Step 3. Construct position-dependent frequency table based on matches
Step 4. Calculate relative probability of matches from frequency table
GACAGGGCAGAA
GCCCGGGTGTTT
GCCGGGGACGCG
GCCCCCGGGCCT
GCCGCAGAGCTG
A
T
C
G
0.208
0.160
0.283
0.349
0.292
0.217
0.236
0.255
0.000
0.867
0.132
0.000
0.999
0.000
0.000
0.000
0.000
0.999
0.000
0.000
0.999
0.000
0.000
0.000
0.811
0.189
0.000
0.000
0.905
0.000
0.000
0.95
0.575
0.208
0.000
0.217
0.321
0.057
0.151
0.472
0.151
0.104
0.330
0.415
0.264
0.113
0.283
0.340
Surrogate Filters
Pattern finders
How do pattern finders work?
snRNA U1 (pU1-6)
histone H1t
HMG-14
TP1
protamine P1
AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC
GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT
CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG
GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT
CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT
Step 1. Arbitrarily choose candidate pattern from a sequence
Step 2. Find best matches to pattern in all sequences
Step 3. Construct position-dependent frequency table based on matches
Step 4. Calculate relative probability of matches from frequency table
Step 5. If probability score high, remember pattern and score
Surrogate Filters
Pattern finders
How do pattern finders work?
snRNA U1 (pU1-6)
histone H1t
HMG-14
TP1
protamine P1
AGGTATATGGAGCTGTGACAGGGCAGAAGTGTGTGAAGTC
GCCCTACCCTATATAAGGCCCCGAGGCCGCCCGGGTGTTT
CGGCCGGCGGGGAGGGGGAGCCCGCGGCCGGGGACGCGGG
GCCAAGGCCTTAAATACCCAGACTCCTGCCCCCGGGCCTT
CCCTGGCATCTATAACAGGCCGCAGAGCTGGCCCCTGACT
Step 1. Arbitrarily choose candidate pattern from a sequence
Step 2. Find best matches to pattern in all sequences
Step 3. Construct position-dependent frequency table based on matches
Step 4. Calculate relative probability of matches from frequency table
Step 5. If probability score high, remember pattern and score
Step 6. Repeat Steps 1 - 5
Surrogate Filters
Scenario II – Case of the Masked Motif
• You’ve found a gene related to Purple Tongue Syndrome
• BlastP: Encoded protein related to cAMP-binding proteins
• Are the similarities trivial? Related to cAMP binding?
• Does your protein contain cAMP-binding site?
• What IS a cAMP-binding site?
Task
1. Determine what is a cAMP-binding site
2. Determine if your protein has one
Surrogate Filters
Scenario II – Case of the Masked Motif
Strategy
1. Collect sequences of known cAMP-binding proteins
2. Run Meme, a pattern-finding program
Ask it to find any significant motifs
Do it
3. Rerun Meme. Demand that every protein has identified
motifs
4. Run Pfam over known sequence to check
Surrogate Filters
Scenario III – Case of the Mortal Mitochondrion
Progressive External Ophthalmoplegia (PEO)
• Slow paralysis of voluntary eye muscles
• Many other symptoms (e.g., frequent deafness)
• Loss of mitochondrial DNA
Surrogate Filters
Scenario III – Case of the Mortal Mitochondrion
Progressive External Ophthalmoplegia (PEO)
• Slow paralysis of voluntary eye muscles
• Many other symptoms (e.g., frequent deafness)
• Loss of mitochondrial DNA
Inheritance
• Mendelian
• Autosomal dominant
• Linked to chromosome 4q34
Surrogate Filters
Scenario III – Case of the Mortal Mitochondrion
Progressive External Ophthalmoplegia (PEO)
• Slow paralysis of voluntary eye muscles
• Many other symptoms (e.g., frequent deafness)
• Loss of mitochondrial DNA
Inheritance
• Mendelian
• Autosomal dominant
• Linked to chromosome 4q34
Your task
• Examine sequence of 4q34 region
• Assess likelihood that a gene in the area could cause disease symptoms
Surrogate Filters
Scenario III – Case of the Mortal Mitochondrion
Examining Sequence of 4q34 Region
tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaatgaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgct
ctgacctctggcagctttccactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatgccctctgtggccctggaaccttagtgacttctgctat
accaaagtctccacgcccagggtgacacgcagctgcagctccgtaaacctctaacatgatgtcagcaaatattaaaaaaaaaaagtttataaaaacaatgaataaactttgttaaaggtacaaatgaaaat
tagcaaacatgggaagataattgagtaaagagtttaaagttaaaaacgaattgcagtcattctaggggaaggaacagttgtatttgaaaacctgtatggttacatgaactgcctaaaaaacaagctaagga
aaattaaagctcagatttatatattttaagaaattaattgcaattaatttcctgggattaaatagcatttcctcaaccccagctgtcattaaaaagaggcaaatacagccaaggactggatcttctccgga
aggctgacagcactgaccctcaagaaggcaccggctgacagacagaacattctgccctaatatgtgctgaaattccgctgagagcagagtggtacattgaaccctttaggggcttacaaaagaagtgtcct
gtgttttagagtcacagagttttgcagaaacaagtatgaattcacctagtggccccctgcaccaggtctttcctgtgggcactgagtgcagacacatcaatatgtaatagcagaatgaatgactgaacgaa
cgattgaatgaaaagaaatgagaggcagcaggttgtcagattctatgaggcaatcacagcatcaggtgaccttagtatctatttgagaggactgccatttattctcgggagcgcacggctctaaagaggcc
catatccaggcagtgagctctggtggggggcgcctttagatgcaagaaggaggaaacagctcgaaatccctgggcctgagcgcggcccgtgcaggccggagggtcaagaactctccaccggcggcagcggc
ccggtgtctgccccggcttcgccccggcctaaggctgcctgtgctataaatacgcggcccacatgccgcggtgacacggtgttccctgggctcggcgggacagataacatgaatgtgccctttaaacgtcc
caagttgcagggacagcccccggcccagcctcgctcccggaagcgccttcgcccccgatgccctctgcagctgggaggagggggcgccccgcacctgcccagccaatgcgcggcgcgagcgccggccgcga
cccgcctcctctcgcgagagcccggcggggatataagggggagctgcgggccaggcggcggccccctagcgtcgcgcagggtcggggactgcgcgcggtgccaggccgggcgtgggcgagagcacgaacgg
gctgcctgcgggctgagagcgtcgagctgtcaccatgggtgatcacgcttggagcttcctaaaggacttcctggccgggggcgtcgccgctgccgtctccaagaccgcggtcgcccccatcgagagggtca
aactgctgctgcaggtgaggaccgcgcggtgcaagaggcgggcgcgggcgcggcgggccgggcggggcgcgcgatgcggcgcgagctgcagggcgcggggcgccgcggaaaatctgcgccaggccacaggc
ccgggcgcccgcccgcccgcgggggaagaaggtgccctctgcgtagagacaggtccagcgtcagtcgcagattcctggtgtcgggtggcgcccggcgttcgggtgtctatatatggaaacccacccggagc
cggtttacgtgtgccagatcctgcgcccgtgacagcacgggcgtgcactcaggcccggaggcacctagtgattgccagtatttttggcaccgtcttatgcgcacgcacctttacaataaaaacatcaaaat
aatcatcacccaagaattcccttatcgtatctcatgcacaatgctgtatgtaggctgacgccttcatctttatgtaacctctgtgagagagttattcttctccattttacagatgaagctgaggttttgaa
atattaagaaacaattttcggaataaactcagatcatcctgtctccaaatcttttcctcccctacctggtcgctgaatggtttatcatcctctcgtgttttcctccacctgcccaaaaggtcagggcccct
caatgaggaagagcccaatttgggagtcagaattactaacaacaaaacccccacaaattgctcacaacggcagcaaacccttaataattgattacttggattatctgcttgaaaactttggaggcctaatg
tttagtggatttattctccttcctctattagagcatctagtagagatcctcatctccagggtgatcagagtgacactgagaaattgtcattttttggccatcatgtctattaaatccaaagccctttgaag
cagggagtgttactcatttctgtcccccagtaagcccctcatacagttctcaaacctagggaaagtgaaataaataaatggctatagctttatataattcaatcaccttttcagtttatttggggcaatac
ctttccctcaaataccctaataattgaagcaacattggattattttggcttgttatccagtaactaacatggataacagtatccatttacacgtcctcgtatccatttgatttcctcatcctttttttctt
caaaaaaaaaatctaggaagtgcaaaccttttttttttctcctgtcctcttcccttctctctaccctgcctgtcctctgtcacccaccctcccctccaccaggtccagcatgccagcaaacagatcagtgc
tgagaagcagtacaaagggatcattgattgtgtggtgagaatccctaaggagcagggcttcctctccttctggaggggtaacctggccaacgtgatccgttacttccccacccaagctctcaacttcgcct
tcaaggacaagtacaagcagctcttcttagggggtgtggatcggcataagcagttctggcgctactttgctggtaacctggcgtccggtggggccgctggggccacctccctttgctttgtctacccgctg
gactttgctaggaccaggttggctgctgatgtgggcaagggcgccgcccagcgtgagttccatggtctgggcgactgtatcatcaagatcttcaagtctgatggcctgagggggctctaccagggtttcaa
cgtctctgtccaaggcatcattatctatagagctgcctacttcggagtctatgatactgccaagggtgagagaggggcatcggggagaaggagggtggtgtggaaagaggatcctatgggatctataactc
acaaaggacctgatatatattgatcttgttttttctagtctctgggataattgaggcttctgaatgaggaggtgatgtgcataagttaatagctgaagcgttccttgtgtcctctactgaaataaactctg
gcctttagttattcagagaggaggaggggggagcctgtctccctctagacacagccatagcagttactgagtttaacttgaagccacttccaatgccctgtatacaagctgagcactgcccctccggggtc
cggagagggcagcagccacctttgctgtctgcctggtcatatgtgaagcacctgcacaggggcaggttccccgcaaggtcagagcatggagctggaggtgcagtggcctctctccctccacctgctttctg
ctgagaacaggcacttcatagccgttcggcttctgggctctgtccacagggatgctgcctgaccccaagaacgtgcacatttttgtgagctggatgattgcccagagtgtgacggcagtcgcagggctggt
gtcctacccctttgacactgttcgtcgtagaatgatgatgcagtccggccggaaagggggtaagcttgtgctctactcatctaaacttgtttggttttgcccgaggagaacattttacagggctcctttca
gtcttccttactggaaattaattttcaaaattatttgataaggacttagggaagaaagatggtattaattccccctaacgttctcaactatcctattagggaaaagtattttccattttattagagatgat
aagaacatgaatagtaagacatttagatgtgaatttaactaggtatccagcattatagagaccctaggccctcttcccttagagcctgggtgcaaaagctagggaaaagaagtagttagctacttcttaca
aagaactcttgcttccctcctagttacaggtgttagtgggatggggtgtttagctgggtagagatggcctgaagcaatctgttgtgccagagaaagttttggcttctataggttgaaccatatgaaattgc
cactttaaaagtcaaaaacagtccaatgttagcagtttcgtatgtttcaacgaatagttacagccttttatttagactgcataacctcgtgcaggatcatctgaggctcagcctcagttcggtcctccata
aaaaaaggtaaccgcgtagcataatactcctgctccactgcgcccttcttgtttcgcagttgggcagtccatgaattacttggttaattgccccagttcttcactgaccttgaactaatggagtaggaatg
acaggagacccagcctgccagtgaagcaaggaaggagatgtccagtgggatgttgcatggagctgggactccatgcccagatgaccctgattttataaaactggtaacagtgtgtacagatatgtttcagg
ggaaaagtctctttcctccagcgttacggagccctcaccagcatttgtttccacagccgatattatgtacacggggacagttgactgctggaggaagattgcaaaagacgaaggagccaaggccttcttca
aaggtgcctggtccaatgtgctgagaggcatgggcggtgcttttgtattggtgttgtatgatgagatcaaaaaatatgtctaatgtaattaaaacacaagttcacagatttacatgaacttgatctacaag
ttcacagatccattgtgtggtttaatagactattcctaggggaagtaaaaagatctgggataaaaccagactgaaggaatacctcagaagagatgcttcattgagtgttcattaaaccacacatgtatttt
Surrogate Filters
Scenario III – Case of the Mortal Mitochondrion
Strategy
• Assume that encoded protein is in mitochondria
• Protein has function associated with mitochondrial location?
– Use Gene finder to identify protein sequence(s)
– Use Similarity finder to identify possible function
• Protein has structure associated with mitochondrial location?
– Use Feature finders to identify pertinent regions
– (What ARE pertinent regions?)
Surrogate Filters
Scenario III – Case of the Mortal Mitochondrion
Run 4q34 region through FGene
Name: PEO-related_gene?
First three lines of sequence:
tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaat
gaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttc
cactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg
fgene Wed Feb 27 16:55:29 GMT 2002
>PEO-related_gene?
length of sequence 5768
number of predicted exons - 5
positions of predicted exons:
1607 1717 w= 17.84 ORF:
1607 1717
2985 3231 w=
9.13 ORF:
2985 3230
3421 3471 w=
6.08 ORF:
3423 3470
3980 4120 w= 12.62 ORF:
3982 4119
5035 5192 w=
1.93 ORF:
5037 5192
Length of Coding region708bp
Amino acid sequence MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVR
IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASG
IIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMMQSG
RKGADIMYTGTVDCWRKIAKDEGAKAFFKGAWSNVLRGMGGAFVLVLYDEIKKYV*
235aa
Surrogate Filters
Scenario III – Case of the Mortal Mitochondrion
Run 4q34 region through FGeneSH
Name: PEO-related_gene?
First three lines of sequence:
tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaat
gaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttc
cactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg
Fgenesh Wed Feb 27 16:59:14 GMT 2002
FGENESH 1.0 Prediction of potential genes in Human
Time:
Wed Feb 27 16:59:14 2002
Seq name: PEO-related_gene?
Length of sequence: 5768 GC content: 48 Zone: 2
Positions of predicted genes and exons:
G Str Feature
Start
End
Score
ORF
1
1
1
1
1
1
+
+
+
+
+
+
1
2
3
4
TSS
CDSf
CDSi
CDSi
CDSl
PolA
1216
1607
2985
3980
5035
5471
-
1717
3471
4120
5192
-2.70
18.01
52.41
20.99
2.32
0.92
1607
1607
2985
2985
3421
3982
3980
5037
5035
genomic DNA
Len
-----
FGENE output
1717 w= 111
17.84
1717
3231 w= 4869.13
3470
3471 w= 1386.08
4119
4120 w= 156
12.62
5192
5192 w=
1.93
Predicted protein(s):
>FGENESH
1
4 exon (s)
1607 5192
298 aa, chain +
MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVR
IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASG
GAAGATSLCFVYPLDFARTRLAADVGKGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVS
How to decide where exons are?
P
Exon
Intron
Exon
Intron
Exon
DNA
mRNA
hnRNA
AAAAAAAA
Strategy
• Compare sequence of 4q34 region to sequence of mRNA
• Sequence of mRNA may be in cDNA library
• Expressed Sequence Tag (EST) library
Problems
• Library may not exist
• Expression of gene may be low
Surrogate Filters
Scenario III – Case of the Mortal Mitochondrion
Run 4q34 region through BlastN (x human est’s)
Final Score Card for Gene Finders
Feature
FGene
(splice site
recognition)
Transcription
Start Site
…1607-1717
Exon 1
Exon 2
Exon X
Exon 3
PolyA site
2985-3231
3421-3471
3980-4120
5035-5192…
FGeneSH
(FGene +
HMM model)
BlastN of
EST library
(compare
with known)
1216
1501
…1607-1717
…1607-1717
2985-3471
2985-3471
3980-4120
5035-5192…
???
3980-4120
5035-5192…
???
MORAL: Trust, but verify.
Surrogate Filters
Scenario III – Case of the Mortal Mitochondrion
Strategy
• Assume that encoded protein is in mitochondria
• Protein has function associated with mitochondrial location?
– Use Gene finder to identify protein sequence(s)
– Use Similarity finder to identify possible function
• Protein has structure associated with mitochondrial location?
– Use Feature finders to identify pertinent structures
– (What ARE pertinent structures?)
Surrogate Filters
Scenario III – Case of the Mortal Mitochondrion
Run 4q34 region through BlastP
Name: PEO-related_gene?
First three lines of sequence:
tctacttatattcaatccacagggctacacctagttcttggtacacagtacatgctcagcaagagtctgttgaat
gaacacatacatggtttatctgtttgtctcttccgagttcttgacttctgtctgctctgacctctggcagctttc
cactagtttctagctttcattctgcttacctggatttcggaactctagcctgccccactcttagataaacgcatg
Fgenesh Wed Feb 27 16:59:14 GMT 2002
FGENESH 1.0 Prediction of potential genes in Human
Time:
Wed Feb 27 16:59:14 2002
Seq name: PEO-related_gene?
Length of sequence: 5768 GC content: 48 Zone: 2
Positions of predicted genes and exons:
G Str Feature
Start
End
Score
ORF
1
1
1
1
1
1
+
+
+
+
+
+
1
2
3
4
TSS
CDSf
CDSi
CDSi
CDSl
PolA
1216
1607
2985
3980
5035
5471
-
1717
3471
4120
5192
-2.70
18.01
52.41
20.99
2.32
0.92
1607
2985
3982
5037
genomic DNA
Len
-
1717
3470
4119
5192
Predicted protein(s):
>FGENESH
1
4 exon (s)
1607 5192
298 aa, chain +
MGDHAWSFLKDFLAGGVAAAVSKTAVAPIERVKLLLQVQHASKQISAEKQYKGIIDCVVR
IPKEQGFLSFWRGNLANVIRYFPTQALNFAFKDKYKQLFLGGVDRHKQFWRYFAGNLASG
GAAGATSLCFVYPLDFARTRLAADVGKGAAQREFHGLGDCIIKIFKSDGLRGLYQGFNVS
VQGIIIYRAAYFGVYDTAKGMLPDPKNVHIFVSWMIAQSVTAVAGLVSYPFDTVRRRMMM
111
486
138
156
Surrogate Filters
Scenario III – Case of the Mortal Mitochondrion
Run 4q34 region through BlastP
Summary
• One protein in region
• Contains mitochondrial carrier motifs
• Similar to ATP/ADP transporter
• Mitochondrial signal sequence?
Reasonable candidate for PEO-related protein
Complex gene discovery
Your turn: Repeat and extend characterization of PEO-related gene
1. Take same sequence (FastA format) e-mailed to you
2. Get better estimate of promoter and polyA site
(e.g. by TSSW and PolyASH)
(Is there a TATA box upstream from the predicted promoter?)
3. Find encoded protein sequence by suitable method
(e.g. FGeneSH(GC) or comparison with cDNA)
4. Continue characterization of protein
* Contains signal sequence?
* Contains transmembrane domains?
Filter limitation
Inevitable…
but whose filter?
Filters controlled by
outside programmers
Filters controlled by you
Related documents