Download Coordinated Laboratory for Computational Genomics

Document related concepts
no text concepts found
Transcript
What is the problem?
• Very large databases
• Unrefined datasets
– Whole genomes in draft form
• Pairwise searching
– Alignment – O(n2) for each sequence in the database
– BLAST: Tool that searches with “hashes” to speed up.
• Basic idea is that if you have a sequence from a
“related” gene, then you can find new genes:
– Copies of genes in same species
– Same gene in different species
• The problem is that single instances may not
represent the diversity that can be biologically
interesting.
1
Central dogma of Genome Function
2
Hidden Markov Models
(Basic Concepts)
• Goal: Construct a model which can be built
from a multiple sequence alignment (i.e., a
training dataset) that will score future
sequences with their degree of similarity to
the set of training sequences.
• Note: Fundamentally different from
BLAST, with it’s universal substitution
matrices.
3
PAM-250 Matrix
4
Hidden Markov Models
(Basic Concepts)
• Uses notion of a prior probability (Bayesian
Statistics) to reverse roles of observation and
expectation
• E.g., in randon sequence, P(A) = P(C) = P(G) =
P(T) = 0.25. These are prior probabilities.
• Now, assume that in a training data set, that 30%
of the time, a ‘G’ was seen to follow an ‘AT’. We
would say that P(G|AT) = 0.3, yet P(G) is still 0.25
overall.
5
HMMs: Start Codon Recognition
A: 0.91
C: 0.03
G: 0.03
T: 0.03
A: 0.03
C: 0.03
G: 0.03
T: 0.91
A: 0.03
C: 0.03
G: 0.91
T: 0.03
A
T
G
• Above: A “state machine/model” for outputting sequences. It would output
various sequences with varying probabilities
ATG  .91 x .91 x .91 = .7536
ATT  .91 x .91 x .03 = .0248
TAG  .03 x .03 x .91 = .000819
• What are these? P(ATT|M) -- M is the model
• But, what we want is P(M|ATT) – I.e., Probability that we are looking at a real
start codon, given that we have seen ‘ATT’.
6
• Subtle, but very important difference.
HMMs: Bayes Rule and
Key Derivation
• Bayes Rule:
P(A|B) x P(B) = P(B|A) x P(A)
• Rearranged:
P(A|B) = (P(B|A) x P(A)) / P(B)
• Let A be M, and B be the observed sequence,
e.g. ATT from our codon example
P(M|ATT) = (.0248 x P(M)) / (0.25 x 0.25 x 0.25)
= 1.587 x P(M) ;
note: P(M) is a constant, so falls out of all comparisons between scores
of sequences
7
Profile HMMs
• Example alignment:
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATG
• [AT][CG][AC][ACGT]*A[TG][GC]
• This regular expression (RE) captures many
sequences, including the ones above. However, it
sees no preference of TGCTAGG over ACACATC.8
HMMs: Building a Model
• Rules:
–
–
–
–
One state for each “clear” position, or for each term in the RE.
Insert states for Kleene closure terms in the RE.
State probabilities computed from state “populations”.
Transition probabilities must sum to 1.0.
• Starting out…
– The [AT] term in the previous example has 80% As, and 20% Ts.
– Transition to the next “state” is unconditional.
A: 0.8
C: 0.0
G: 0.0
T: 0.2
[AT] state
1.0
A: 0.0
C: 0.8
G: 0.2
T: 0.0
[CG] state
1.0
A: 0.8
C: 0.2
G: 0.0
T: 0.0
[AC] state
...
9
HMMs: Building a Model
• Continuing . . .
– If states must split, transition probabilities must
reflect the probabilities of going to the insert
state, versus bypassing the insert state
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATG
3 sequences lead
to insert state
 3/5 = 0.6
A: -C: -G: -T: -2 sequences bypass
the insertion state
 2/5 = 0.4
0.6
A: 0.8
C: 0.0
G: 0.0
T: 0.2
[AT] state
1.0
A: 0.0
C: 0.8
G: 0.2
T: 0.0
[CG] state
1.0
A: 0.8
C: 0.2
G: 0.0
T: 0.0
[AC] state
10
0.4
HMMs: Building a Model
• Probabilities of symbols on insert state
ACA---ATG
TCAACTATC
ACAC--AGC
AGA---ATC
ACCG--ATG

0.4
A: 0.2
C: 0.4
G: 0.2
T: 0.2
[ACGT*] state
2 C’s (0.4)
1 G (0.2)
1 A (0.2)
1 T (0.2)
Total of 5 symbols
0.6
0.6
• Probabilities of transitions
leaving insert state
– After arriving in insert state, 2 insertions remain
2/5 = 0.4
– Otherwise, we leave this state.
1 - 2/5 = 0.6
A: 0.8
C: 0.2
G: 0.0
T: 0.0
[AC] state
0.4
A: 1.0
C: 0.0
G: 0.0
T: 0.0
[A] state
11
Example HMM Derivation
12
HMMs: Example Sequence Scoring
• P(ACACATC|M) =
0.8x1.0x0.8x1.0x0.8x1.0x0.6x0.4x0.6x1.0x1.0x0.8x1.0x0.8
State probabilities
Transition Probabilities
= 0.04718 = 4.7 x 10-2
• P(M|ACACATC) =
((4.7 x 10-2)/(0.25)7) x P(M) = (7.7 x 102) x P(M)
Log-odds = ln(7.7 x 102) = 6.65
log2(7.7 x 102) = 9.6
This number is a “score” of the likelihood that seeing this sequence
implies that the model applies.
13
HMM Scoring of Sequences
14
Log-odds HMM Model Example
15
HMM Profile Model Structure
16
Example Alignment (SH3 domain)
17
Example HMM Profile Model
(No synthetic pseudo-counts)
18
HMM Model Example with
Pseudo-count of 1
19
Gene Prediction with HMMs
• HMMs can be used for “predicting”, in
genomic sequence, where genes are encoded.
• Models can be built from sets of known genes
–
–
–
–
–
Promoters
Start of coding (start codon)
Intron/exon splice sites
Stop Codon
Polyadenylation site
20
Genome Architecture Primer
Start c odon
Codons
Donor site
GCCGCCGCCATGCCCTTCTCCAACAGGTGAGTGAG
Transc ription
Start
5’ UTR
Promoter
Exon
CTCCCAGCCCTGCC
Acc eptor site
Stop Codon
Intron
Poly-A site
ATCCCCATGCCTGAGGGCCCCT
GCAGAAACAATAAAACCA
3’ UTR
21
Comprehensive HMM Model for
Unspliced Genes
22
Coding Region Model
23
Intron Modeling
24
Gene Prediction Approaches
• Ab initio methods:
– Profile Hidden Markov Models (GENSCAN, HMMgene)
– Neural Networks (GRAIL, Genie)
– Decision Trees (MORGAN)
• Issues:
– Seeding from training sets
– Fully general approaches?
• Interesting question:
– Can gene finding be done species-independent?
25
Gene Prediction:
Recognizing Initiation of Coding
5’ UTR
1st Exon
ATG
Kozac
Consensus
Stops in
all 3 frames
No in-fram e
stops
GT
Exon
AG
Intron
Exon
26
Classifier Outline
ConsensusKozak
0 errors
1 error
ATG/UTR
Heuristic
 2 errors
ATG
L
M
CDS
R
stop ratio;
frame shift check
Stops upstream


UTR
~E(stop)
Check ORF for frame shifts
27
Classifier Heuristic Components
226 Classes
• Kozak Existence and Fidelity
• ATG Heuristic:
Template (sIFl, sl, sFl) 5len : ATG : 3len (sIFr, sr, sFr)
Ideal
( 1, 3, 3) 125 : ATG : 300 (
0, 6, 2)
• # Stops left of candidate ATG
• CDS: # Stops in minimum frame
• UTR Heuristic
• In frame stops to All stops Ratio
• # Frame shifts needed for perfect ORF
• Not Used:
• Codon or Hexamer Frequencies.
• Known protein starting motifs.
28
Verification and Testing
• Generation of sets of known CDS “reads” (12,826)
known ATG “reads” (13,672)
known UTR “reads” (1,035)
Run Classifier against all three sets:
• Identify classes with highest CDS to ATG differential & UTR vs. CDS/ATG
• Grade A:
K0E.ATG.L.pSL.ORFr0F or 1FS
K0E.ATG.L.npSL.ORFr0FS or 1FS
K1E.ATG.L.pSL.ORF0FS or 1FS
K1E.ATG.L.npSL.ORF0FS or 1FS
KG1E.ATG.L.pSL.ORF0FS or 1FS
KG1E.ATG.L.npSL.ORF0FS or 1FS
• Grade B: Same as A, but with ATG in Middle 1/2
• Grade C: zSL for K0E only and ATG in L, M, or R
• UTR Class
29
Accuracy and Yield of Classes
•ATG True Positive (of 13,672):
•Grade A: 867 - 6.3%
•Grade B: 3,742 - 27.3%
•UTR: 82 - 0.6%
Total: 34.3% (4,691)
•CDS False Positive (of 12,826):
•Grade A: 3 - 0.02%
•Grade B: 753 - 5.5%
•UTR: 1725 - 13.5%
Total: 19.3% (2481)
•UTR True Positive (of 1,035):
•691 - 66.8%
Yield 34%  67%
Confidence 95%  87%
• Notes:
•the yield estimate is conservative due to variable fidelity of mRNA source.
30
Consensus
Gene Prediction:
Stops in
No in-frame
Finding
all 3 frames Intron Boundaries
stops
GT
Exon
AG
Intron
Exon
31
Simple Dicty Gene Finder
(Intuition and an Example)
• Basic Idea (G. Klein) based on GC/AT content of
Intron vs. Exons
• Idealized Example: Count G/Cs and A/Ts in a
window size of 10 bases.
6
10
10
10 AT content
<EXON>
<EXON>
…….CGCGGGCGCCGTATTTATATATTATA…..AATATTTTATATAGCCCGGCGCGGCCG…...
<INTRON>
GC content 10
Donor Site
10
6
2
Acceptor Site
Point where GC.left and AT.right are both maximized
32
Dicty Gene Finding Tool Model
• Model Parameters:
– W -- Window Size
– low -- threshold below which GC or AT
content does not match hypothesis
– high -- threshold above which GC or AT
content matches hypothesis
– m -- number of consecutive windows
that will be examined
– n -- number of windows out of m that
that must exceed  to qualify for an
intron/exon or exon/intron transition
– tol -- maximum distance from the GC/AT
content transition at which the GT or
AG motif must be found
33
Dicty Gene Finding Tool Model
W = 8, m = 4, high = 7, low = 6
1
2
3
4
5
6
G/C=7
. . .GCGGGCGCTGGGGCCGCGTATTATAGTATTTAT. . .
n=3
n=4
34
Dicty Intron/Gene
Prediction Algorithm
1. Calculate AT (GC) content in size W
windows right and left of each base position.
2. Calculate n
AT count  high, AT count   low
for each window of m bases to the left and
right of each base position.
3. For each position: If ……...
ATlefthigh  n && ATrightlow  n
 potential acceptor site
ATleftlow  n && ATrighthigh  n
 potential donor site
35
Dicty Intron/Gene
Prediction Algorithm
(continued)
4. For each potential donor site:
If GT (donor) or AG (acceptor) motif is
found within Tol bases distance, note this as
an intron boundary.
5. Sort boundaries into candidate introns.
36
Test Data
>IIADP1D6358 Antiparallèle 811 bases
AAAAACCTGCTTAGGATTAATTATGAGCGAATTTTTTTTCTTTAAAACTT
CCAAAAATATTTTTTTTTTTTTTTTTTTTTAATAATTTCGGTTTGCTCAT
AGATTTTTTATTTATTTAATTAATATTTTTAATTTTTTTTTTTTTAATCC
TAAAAATAGATTTTATTTATTTTATTTAATTTTTAATTATTAAAAGATAT
GAGATTTTTAAAGTTCGGGTTAGAAATTAATTTGGGTAAAGGAACTCTTA
TTGAATTTGATGAACAgtgtacttaaatatttaattaatttttttttttt
atttgttttaagaagaagaaaaagaaaaaatatagaaatagTAAAAAACT
ATTTCCATATATTTGTTATACTCTTACACACAAGGTTATAAATTTAAAGT
gttataaataatttaaaaattttattctgtaagaaaatttgttttgaaat
tatttgattaaaaatagaaggtttttttttttattttttttttttatttt
tatttttttttattttttataatttccgcgtttgaatttgttgtgtaaat
taattttaattttttttttttttttttttttttttttttttttttttttt
ttcatttttaacatcatttgattcattaatttattttttttttcaacatc
cccaacccaaaaaaaaaaaataaaaaaaaatgataagAAATTTAACAAAA
TTAACAAAATTTACAATTGAAAATAGATTTTACCAATCCTCATCAAAAGG
AAGATTCAGTGGTAAAAATGGAAACAATGCATTCAGGGGATCTCTAGAGT
CGACCGAAGGC
•Probable Correct Introns:
+267 -341
+401 -687
37
Parameter Space to Search
• Ranges
–
–
–
–
–
–
W -- 3  10 (8 values)
high -- .7xW  W (4 values)
low -- .5xW  .9xW (4 values)
m -- 3  11 (9 values)
n -- m/2  m (4 values)
tol -- 3-7 (5 values)
• 3584 x 5  18,000 sets of parameters
• Search for sets that find all expected sites
with a minimum of false positives.
38
Test Data
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
idt
. .
t1.fasta 3 1 3 1
t1.fasta 3 1 3 2
t1.fasta 3 1 3 1
t1.fasta 3 1 3 2
t1.fasta 3 2 3 1
t1.fasta 3 2 3 2
t1.fasta 3 2 3 1
t1.fasta 3 2 3 2
t1.fasta 3 3 3 1
t1.fasta 3 3 3 2
t1.fasta 3 3 3 1
t1.fasta 3 3 3 2
t1.fasta 3 2 4 1
t1.fasta 3 2 4 2
t1.fasta 3 2 4 1
t1.fasta 3 2 4 2
t1.fasta 3 3 4 1
t1.fasta 3 3 4 2
t1.fasta 3 3 4 1
t1.fasta 3 3 4 2
t1.fasta 3 4 4 1
t1.fasta 3 4 4 2
t1.fasta 3 4 4 1
t1.fasta 3 4 4 2
t1.fasta 3 2 5 1
t1.fasta 3 2 5 2
. . . About 18,000
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
3 4 2 2 269 401
3 4 2 2 269 401
2 4 2 2 269 401
2 4 2 2 269 401
more lines like
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
341 687
this . . .
39
Test Data Raw Results
len=811 W=3 n=1 m=3 thrL=1 thrH=2, Tol=4, Sites Found=18
Intron:
Intron:
Intron:
Intron:
Intron:
Intron:
1
2
3
4
5
6
+
+
+
+
+
+
91
236
267
385
471
799
+
+
+
-
213 - 213
241
267
399 - 399 - 467
759 - 759 - 797
799
len=811 W=3 n=1 m=3 thrL=2 thrH=2, Tol=4, Sites Found=29
Intron: 1 + 91
Intron: 2 + 219
Intron: 3 + 236
Intron: 4 + 267
Intron: 5 + 305
Intron: 6 + 341
Intron: 7 + 385
Intron: 8 + 429
Intron: 9 + 441
Intron: 10 + 471
Intron: 11 + 759
Intron: 12 + 799
+
+
-
213
223
241
267
312
341
399
433
467
753
759
799
- 213
- 335
- 399
- 797
len=811 W=3 n=1 m=3 thrL=1 thrH=3, Tol=4, Sites Found=13
Intron:
Intron:
Intron:
1 + 91 + 213 - 213 - 241
2 + 267 + 399 - 399 - 467
3 + 471 + 759 - 759 - 786
. . . About 18,000 sets of results like this. . .
40
Test Data Filtered Results
len=811 W=6 n=5 m=9 thrL=5 thrH=6, Tol=3, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=3, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=4, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=5 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=6 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=7 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=8 thrL=5 thrH=6, Tol=6, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=5 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=6 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND
len=811 W=6 n=5 m=7 thrL=5 thrH=6, Tol=7, Sites Found=11 ALL KNOWN SITES FOUND
This provides an initial set of likely to be optimal parameters
41
Analysis of a Known Gene
len=811 W=6 n=5 m=10 thrL=5 thrH=6, Tol=4, Sites Found=11
1: AAAAACCTGC TTAGGATTAA TTATGAGCGA ATTTTTTTTC TTTAAAACTT
51: CCAAAAATAT TTTTTTTTTT TTTTTTTTTT AATAATTTCG GTTTGCTCAT
101: AGATTTTTTA TTTATTTAAT TAATATTTTT AATTTTTTTT TTTTTAATCC
151: TAAAAATAGA TTTTATTTAT TTTATTTAAT TTTTAATTAT TAAAAGATAT
201: GAGATTTTTA AAgttcgggt tagaaattaa tttgggtaaa gGAACTCTTA
251: TTGAATTTGA TGAACAgtgt acttaaatat ttaattaatt tttttttttt
301: atttgtttta agaagaagaa aaagaaaaaa tatagaaata gTAAAAAACT
351: ATTTCCATAT ATTTGTTATA CTCTTACACA CAAGgttata aatttaaagt
401: gttataaata atttaaaaat tttattctgt aagAAAATTT GTTTTGAAAT
451: TATTTGATTA AAAATAGAAG gttttttttt ttattttttt tttttatttt
501: tatttttttt tattttttat aatttccgcg tttgaatttg ttgtgtaaat
551: taattttaat tttttttttt tttttttttt tttttttttt tttttttttt
601: ttcattttta acatcatttg attcattaat ttattttttt tttcaacatc
651: cccaacccaa aaaaaaaaaa taaaaaaaaa tgataagAAA TTTAACAAAA
701: TTAACAAAAT TTACAATTGA AAATAGATTT TACCAATCCT CATCAAAAGG
751: AAGATTCAGT GGTAAAAATG GAAACAATGC ATTCAGGGGA TCTCTAGAGT
801: CGACCGAAGG C
90% correct prediction
Intron 1: + 213 - 241
overpredicted (45 bases)
Intron 2: + 267 - 341
UNDERPREDICTED (37 BASES)
Intron 3: + 385 + 399 - 433
correct + (325 bases)
42
Intron 4: + 471 - 687
CORRECT - (404 BASES)
Analysis of Unknown Gene
• Started with 21 reads from
•Used phred to assemble them
•4 contigs found
•4th contig was longest (1759 bases)
•Used parameters from previous analysis
•Results for contig4 compared . . . . . . .
43
Contig4 Sampled Results
(a closer look)
W=6 n=5
Intron
Intron
Intron
Intron
Intron
m=5 thrL=5 thrH=6, Tol=6, Sites Found=14
1: + 54 - 401
2: + 579 - 612
3: + 711 + 782 -1113
4: +1185 +1350 -1350 -1504
5: +1628 -1709
W=6 n=5
Intron
Intron
Intron
Intron
Intron
m=5 thrL=5 thrH=6, Tol=7, SitesFound=14
1: + 54 - 401
2: + 579 - 612
3: + 711 + 782 -1113
4: +1185 +1350 -1350 -1504
5: +1628 -1709
len=1759 W=6 n=5
Found=14
Intron 1: + 54
Intron 2: + 579
Intron 3: + 711
Intron 4: +1174
Intron 5: +1628
m=6 thrL=5 thrH=6, Tol=6, Sites
len=1759 W=6 n=5
Found=14
Intron 1: + 54
Intron 2: + 579
Intron 3: + 711
Intron 4: +1174
Intron 5: +1628
m=6 thrL=5 thrH=6, Tol=7, Sites
- 401
- 612
+ 782 -1164
+1350 -1350 -1504
-1709
- 401
- 612
+ 782 -1164
+1350 -1350 -1504
-1709
len=1759 W=6 n=5
Intron 1: + 54
Intron 2: + 579
Intron 3: + 650
Intron 4: +1087
Intron 5: +1174
Intron 6: +1628
Intron 7: +1735
m=7 thrL=5 thrH=6, Tol=6, Sites Found=17
- 401
- 612
+ 782 -1043
-1164
+1350 -1350 -1504
-1709
len=1759 W=6 n=5
Intron 1: + 54
Intron 2: + 579
Intron 3: + 650
Intron 4: + 711
Intron 5: +1087
Intron 6: +1350
Intron 7: +1628
Intron 8: +1735
m=7 thrL=5 thrH=6, Tol=7, Sites Found=19
- 401
- 612
- 683
+ 782 -1043
+1164 -1164
-1350 -1504
-1709
len=1759 W=6 n=5
Intron 1: + 54
Intron 2: + 579
Intron 3: + 650
Intron 4: + 711
Intron 5: +1087
Intron 6: +1350
Intron 7: +1628
Intron 8: +1735
m=8 thrL=5 thrH=6, Tol=6, Sites Found=19
- 401
- 612
- 683
+ 782 -1043
+1164 -1164
-1350 -1504
-1709
44
Contig4 Results
len=1759 W=6 n=5
1: TGATAATAAC
51: ATTgtaataa
101: gataatgata
151: tataaataat
201: tatcaccaat
251: aataattcaa
301: ttgtagaaat
351: ttcctataac
401: gACCAATTTA
451: AAACCCAATA
501: TACAATCACA
551: TCAAAATCAT
601: taaaaaatca
651: taattaatca
701: AAAATAACCA
751: aaatatatga
801: tatatatgat
851: attatagaac
901: agtttgattg
951: tatagatttt
.
.
.
m=10 thrL=5 thrH=6, Tol=4, Sites
AATAATAACA ATAATAATAA TAATAATAAT
taataatatt aataatgata ataataataa
ataataatat taatactgtt gataatcatg
ancaataatt ttaataaaaa tgaatatcca
atctccaaaa tcttcaatat caagttttcc
taaataatac aggttcaatg gtttcagatt
tcgatttcct ctagttcaat tgattcaagt
aatacaatca atagattttg aagataagaa
AAATAATATC AAAATCAAAT ATAGAAAATA
CCTCCATTCA ATCAAACCAA TAACCAATGT
TTCTTTACCA ACAATTTTAA AACAACCACA
TTTCTAGTAG TATCAATAgt aatagtaaaa
agATCATTTG AAATTGAATC AAAAATTAAT
tatatattta aacctttcaa aagTTGGTAG
gtatgtatta aattaacaaa tgattaatat
aactaattta atattttaaa ggtgttttta
aagggtttta tttcaagaga tgatttaaaa
taaacaaaat gggttaaaat ttcaagactt
atcacatttt tcaacaattt gataaaaata
gaagaaattt aaagtgaatt aacaattaac
Found=19
AATATTAATA
taataataat
atgatgatat
tcaagtaata
aacaaattta
ctttaagttc
gttgcttcaa
tattaaatca
CAATTGAAAC
GAAGTTCAGT
TATTTATAAA
ttaaaaaaat
TTATTTGATg
TGAAGAACAA
attgttgtaa
aattatatga
gaagtattaa
tacaatggaa
tggatggata
actggaaatt
.
.
1001:
1051:
1101:
1151:
1201:
1251:
1301:
1351:
1401:
1451:
1501:
1551:
1601:
1651:
1701:
1751:
Intron
Intron
Intron
Intron
Intron
Intron
.
aagttaaaga
TATTGGAATA
ttaaaaattg
aaattcaatt
AAAGAGCAAT
GCTCAATTAA
ACAATTATTT
TTGATAAATA
GCTTCATTTT
CGGTAAACCT
TTAGGCCACC
GGTTTCACTT
AAAATTATTN
aaaaaaaaaa
ttattatagC
atgggacaa
1:
2:
3:
4:
5:
6:
+ 54
+ 579
+ 650
+ 711
+1087
+1628
aaaaggaaga
TATACCGGAA
aaggatcaaa
ttagTAATAA
AGAATTATTT
TTGAATTCAA
ACAATGATTA
TATGACATTT
TACATACTAT
GATAATATTT
AGTTTGGGAA
ACCAAAATCA
GGAAATCTAA
aaaaattaat
CATCATTTAT
aaatccaaat
AAAGAAAGTT
attatttttt
CAAGTTTTTT
GGCCCGGGTG
TGCAGCAATA
GAAATACCAA
CATAAATTAA
TGGTTGGATT
TTTATGATTG
ATGATTTTTA
TTTTTAATAA
TTTNGNAgta
tattttttat
TTATNGGATT
tatattttta
TTCATAgttt
atatctttat
AAATGTTCAT
TATATATAAC
ATTTTAATGA
ATTTAAATTT
TTGGTTATAC
GTTGGTATGG
TTTAGCACCT
ACCGTTTACC
TTATGGCAAT
agtttttttt
tatataattt
TTATgtttna
aagaagAAAA
aaaaagatat
tttttattat
GCAAATAATA
AAGAATTGCA
CAATGTGTAA
TTATTTCCAG
ATTAATCATT
CAGTTGCNCC
CATTTTAAAT
AGGTGTAACA
TTTATCTTTA
tttaaaaaaa
tatagttatt
ttaattttac
- 401
- 612
- 683
+ 782 -1046
-1164
-1709 (poor quality)
45
Further Intron Finding Options
•
•
•
•
•
•
•
Exhaustive parsing of sequence
400 base sequence  50 acceptor/donors
20 donor/acceptors  5 minutes on P750/.5GB
24 donor/acceptors  1 day
30 donor/acceptors  ~year
Hybrid solution: rank top 20 d/a sites and parse
Use protein/predicted gene homology to edit results
46
Domain Finding with HMMs
• Basic Elements of Method
• Example from Defensin Genes
47
Antimicrobial Proteins and Peptides
Lysozyme, lactoferrin, SLPI, PLA2, SP-A, SP-D, LL37,
BPI, a- and ß-defensins, inorganics, immunoglobulins
Macrophages
? Defensins
CCL20
48
T cells
DCs
Functions of defensins
 Comprise an ever-ready shield at mucosal surfaces
 Antimicrobial effects: disrupt bacterial cell walls, sequester
nutrients, act as decoys for microbial attachment, enhance
phagocytosis
 Prevent attachment, colonization or infection
 Constitutive and/or inducible expression
 Cross-talk to adaptive immune system
 Synergy or additivity among factors
 Alterations in these properties may contribute to disease
49
Genomics Approach to Defensin Gene
Discovery - Rationale
 Defensin gene discovery in humans has generally
proceeded from identification of the protein
 All known defensin genes in humans cluster to a <1
Mb region on 8p22-p23
 It is likely that not all defensin genes are known
 Hypothesis: Novel defensins in the gene cluster can be
found using a computational genomics-based strategy
50
Structure of mature b-defensin peptides
C1-C5
C2-C4
GAL3
DEFB3
DEFB1
BNBD12
DEFB2
EP2E
TQCRIRGGFC
YYCRVRGGRC
YNCVSSGGQC
LSCGRNGGVC
VTCLKSGAIC
TICRMQQGIC
Consensus
hSC+xxxGhC hhhxCPxxx+ QIGTCxxxxh +CC+
T1
b1
RVGSCRFPHI
AVLSCLPKEE
LYSACPIFTK
IPIRCPVPMR
HPVFCPRRYK
RLFFCHSGEK
C3-C6
T2
T3
AIGKCATFIS
QIGKCSTRGR
IQGTCYRGKA
QIGTCFGRPV
QIGTCGLPGT
KRDICSDPWN
b2
b-loop
b-bulge
-CCGRAYEV(+20)
KCCRRKK
KCCK
KCCRSW
KCCKKP
RCCVSNTDE(+14)
b3
51
Structure of leader sequence
of b-defensin proteins
EP2C
EP2E
TAP
Defb4
GAL1
DEFB2
DEFB1
MRQRLLPSVTSLLLVALLFPGSS
MKVFFLFAVLFCLVQTNSGDVPP
MRLHHLLLALLFLVLSAWSGFTQ
MRIHYLLFTFLLVLLSPLAAFTQ
MRIVYLLLPFILLLAQGAAGSSQ
MRVLYLLFSFLFIFLMPLPGVFG
MRTSYLLLFTLCLLLSEMASGGN
Consensus MRhxxLLhhhhhhhhhxxxxxxx
52
Genome approach for discovering
b-defensin genes
Known genes
HUMAN
DEFB1
DEFB2
MOUSE
Defb1
Defb2
Defb3
Defb4
Defb5
BLAST
HTGS
DEFB1
DEFB2
DEFB3
EP2D
DEFB4
DEFB5
DEFB6
DEFB7
DEFB8
DEFB9
EP2C
EP2D
Defbp1
DEFB10
DEFB11
DEFB12
DEFB13
DEFB14
DEFB15
DEFB16
DEFB17
DEFB18
DEFB19
DEFB20
DEFB21
DEFB22
DEFB23
DEFB24
DEFB25
DEFB26
DEFB27
DEFB28
DEFB29
Markov
Celera
Defb1
Defb2
Defb3
Defb4
Defb5
Defb6
Defb7
Defb8
Defb9
Defb10
Defb11
Defb12
Defb13
Defb14
Defb15
Defb16
Defb17
Defb18
Defb19
Defb20
Defb21
Defb22
Defb28
Defb31
Defb32
Defbp1
BACs
DEFB1
DEFB2
DEFB3
EP2D
DEFB4
DEFB5
DEFB6
DEFB7
DEFB8
DEFB9
EP2C
EP2D
Defbp1
DEFB10
DEFB11
DEFB12
DEFB13
DEFB14
DEFB15
DEFB16
DEFB17
DEFB18
DEFB19
DEFB20
DEFB21
DEFB22
DEFB23
DEFB24
DEFB25
DEFB26
DEFB27
DEFB28
DEFB29
GA-contigs
Defb1
Defb2
Defb3
Defb4
Defb5
Defb6
Defb7
Defb8
Defb9
Defb10
Defb11
Defb12
Defb13
Defb14
Defb15
Defb16
Defb17
Defb18
Defb19
Defb20
Defb21
Defb22
Defb28
Defb31
Defb32
Defbp1
Defb23
Defb24
Defb25
Defb26
Defb27
Defb29
Defb30
Defb33
Defbp2
Defbp3
36
33
53
Chromosomal localization of b-defensin genes
6p11-p21
Mouse 1
8p23
Mouse 8
20q11
Mouse 2
54
TEL
10.5
EP2C
HE2b1/EP2D
EP2E
b
16 16 16
44
D8S542
34
115c21
179c23
16g12
44n19
2541m15
397k22
372k15
24f4
207i3
2629i16
633e22**
540n10**
561b17**
332a23**
877e9
415d8
b
15 15
GCT10E01
D8S1469
D8S503
10
4
D8S1825
33
8
3
D8S351
31
2
D8S277
D8S1511
D8S561
DEFB1
D8S1819/D8S439
DEFA6
DEFA4
DEFA1/3
DEFA7
DEFA5
D8S1706
HE2/EP2
DEFB3
DEFB2
8
A004x20
cR
7.6
D8S1099
D8S1742
cM
1
WI-4625
Mb
211c9
458d3
3023L14
398f12
398f10
399g23
556o5
540e4
776f21
351i21
177k12*
18L2
295j18*
62h7
449o20*
429b7
8o7
10a14
497j4*
115j16
324n11*
375n15
10 kb
b
b
DEFB3
DEFB2
55
CEN
Synteny between human 8p and mouse 8
Chromosome 8p22-p23 (human)
BAC 295j18
BAC 324n11
Chromosome 8 (mouse)
GA_x5J8B7W6WMR
56
Synteny between human 6p21 and mouse 1
Chromosome 6p21 (human)
BAC RP11-397g17
Chromosome 1 (mouse)
GA_x5J8B7W3NRM
57
Synteny between human 20q11 and mouse 2
Chromosome 20q11.1 (human)
BAC RP5-854e16
BAC RP5-1018d12
BAC RP5-1093g12
Chromosome 2 (mouse)
GA_x5J8B7W3FJ8
58
Human and
Mouse
b-defensin
alignment
– all 69 genes
EP2d
_c 8
EP2e
_c 8
EPm2d _c 8
DEFB5 _c 8
Defbm12_c 8
Defbm13_c 8
DEFB11 _c 6
Defbm17_c 1
DEFB12 _c 6
DEFB14 _c 6
EP2c
_c 8
EPm2c _c 8
DEFB10 _c 6
Defbm16_c 1
DEFB13 _c 6
Defbm18_c 1
Defbm28
DEFB9 _c 8
DEFB27 _c20
DEFB17 _c20
Defbm19_c 2
DEFB18 _c20
Defbm21_c 2
DEFB20 _c20
DEFB4 _c 8
DEFB1 _c 8
Defbm1 _c 8
Defbm7 _c 8
Defbm8 _c 8
Defbm2 _c 8
Defbm31
Defbm9 _c 8
Defbm10_c 8
Defbm15_c 8
Defbm3 _c 8
Defbm4 _c 8
Defbm6 _c 8
DEFB2 _c 8
Defbm5 _c 8
DEFB3 _c 8
Defbm14_c 8
DEFB16 _c20
Defbm29
DEFB8 _c 8
DEFB29 _c20
Defbm23_c 2
DEFB28 _c20
Defbm20_c 2
DEFB15 _c20
Defbm32
DEFB25 _c20
Defbm26_c 2
DEFB24 _c20
Defbm25_c 2
DEFB6 _c 8
Defbm11_c 8
Defbm30
DEFB21 _c20
DEFB19 _c20
Defbm24_c 2
DEFB22 _c20
Defbm27_c 2
DEFB23 _c20
TI CRMQ--Q GICRLF-FCHSGEKKRDICSDPWNR CCVSNT
TI CRMQ--Q GICRLF-FCHSGEKKRDICSDPWNR CCVSNT
TV CLMQ--Q GHCRLF-MCRSGERKGDICSDPWNR CCVPYS
ES CKLG--R GKCRK--ECLENEKPDGNCRL-NFL CCRQRI
ET CRLG--R GKCRR--TCIESEKIAGWCKL-NFF CCRERI
FL CKKM--N GQCEA--ECFTFEQKIGTCQA-NFL CCRKRRE CRIG--N GQCKN--QCHENEIRIAYCIRPGTH CCLQQKE CKMR--R GHCKL--QCSEKELRISFCIRPGTH CC---KS CTAI--G GRCKN--QCDDSEFRISYCARPTTH CCV--DR CTKR--Y GRCKR--DCLESEKQIDICSLPRKI CC---VD CRRS--E GFCQE--YCNYMETQVGYCSKKKDA CCLH-VN CKKS--E GQCQE--YCNFMETQVGYCSKKKEP CCLH--- CEKV--R GICKT--FCDDVEYDYGYCIKWRSQ CCV--ER CEKV--R GMCKT--VCDIDEYDYGYCIRWRNQ CCI--RE CQLV--R GACKP--ECNSWEYVYYYCN--VNP CC---HK CSLV--R GTCKS--ECNSWEYKYNYCH--TEP CCVVRE
RT CFYG--L GKCRR--ICRANEKKKERCGE-RTF CCLRET
GH CLNL--S GVCRRD-VCKVVEDQIGACRR-RMK CCRAWW
KK CWNNYVQ GHCRK--ICRVNEVPEALCEN-GRY CCLNIK
KS CWII--K GHCRK--NCKPGEQVKKPCKN-GDY CCIPSN
KACWVL--R GHCRK--HCRSGERVRKPCSN-GDY CC---KK CWNR--S GHCRK--QCKDGEAVKDTCKN-LRA CCIPSN
KR CLKI--L GHCRR--HCKDGEMDHGSCKY-YRV CCVPDL
VE CW-M--D GHCRL--LCKDGEDSIIRCRN-RKR CCVPSR
RI CGYG--TARCRK--KCRSQEYRIGRCPN-TYA CCLRKYN CVSS--G GQCLYS-ACPIFTKIQGTCYRGKAK CCK--YK CLQH--G GFCLRS-SCPSNTKLQGTCKPDKPN CCKS-TR CYKF--G GFCHYN-ICPGNSRFMSNCHPENLR CCKNIK
AR CYKF--G GFCYNS-MCPPHTKF IGNCHPDHLH CCINMK
DH CHTN--G GYCVRA-ICPPSARRPGSCFPEKNP CCKYMK
--CRSW-- GTCSIAAICFDSLSRRGQCGPVKDP CCPL-ER CHKK--G GYCYF--YCFSSHKK IGSCFPEWPR CCKNIK
VS CIRN--G GICQ-Y-RCIGLRHK IGTCGSP-FK CCK--RA CYRE--G GEC--L-RCIGLFHK IGTCNFR-FK CCKFQVS CLRK--G GRCWN--RCIGNTRQ IGSCGVPFLK CCKRKIT CMTN--GAICWG--PCPTAFRQ IGNCGHFKVR CCKIRVT CMSY--G GSCQR--SCNGSFRLGGHCGHPKIR CCRRKVT CLKS--GAICHPV-FCPRRYKQ IGTCGLPGTK CCKKPVS CCMI--GGICRY--LCKGNILQNGNCGVTSLN CCKRKYY CRVR--G GRCAVL-SCLPKEEQIGKCSTRGRK CCRRKK
FF CRIR--G GRCAVL-NCLGKEEQIGRCSNSGRK CCRKKK
NP CELY--Q GMCRN--ACREYEIQYLTCPN-DQK CCLKLS
IA CELY--Q GLCRN--ACQKYEIQYLSCPK-TRK CCLKYEI CERP--N GSCRD--FCLETEIHVGRCLN-SRP CCLPLG
RR CLMG--L GRCRD--HCNVDEKEIQKCKM-KK- CCVGPK
KR CLVG--F GKCKD--SCLADETQMQHCKA-KK- CCIGPK
KK CFNK-VT GYCRK--KCKVGERYEIGCLS-GKL CCANDE
-R CFSN-VE GYCRK--KCRLVEISEMGCLH-GKY CC---RR CYYG--T GRCRK--SCKEIERKKEKCGE-KHI CCVPKE
KL CLDQ--KDTCPDSRTCLEGTQ---PCHPHHPN CCESSQK CWKN-NV GHCRR--RCLDTERYILLCRN-KLS CCISII
-K CWKN-SL GYCRV--RCQEEERYIYLCKN-KVS CCIHRT
KR CWKG--Q GACQT--YCTRQETYMHLCPD-ASL CCLSYA
KR CWNG--Q GACRT--FCTRQETFMHLCPD-ASL CCLSYS
EK CNKL--K GTCKN--NCGKNEELIALCQK-SLK CCRTIQ
EK CSRV--N GRCTA--SCLKNEELVALCQK-NLK CCVTVQ
DT CWKL--K GICRN--TCQKEEIYHIFCG-IQSL CCLEKK
MK CWGK--S GRCRT--TCKESEVYYILCKT-EAK CCVDPK
LR CMGN--S GICRA--SCKKNEQPYLYCRN-CQSCCLQSY
LQ CMGN--R GFCRS--SCKKSEQAYFYCRT-FQM CCLQSY
ET CWNF--R GSCRD--ECLKNERVYVFCVS-GKL CCLKPK
ER CWKS--F GVCRE--ECAKKESFYIFCWN-GKL CCVKPK
QR CWNL--Y GKCRY--RCSKKERVYVYCIN-NKM CCVKPK
59
ESTs provide sequence for exon 1
Chromosome 8 cluster
Gene Name
Exon 1
aa sequence (exon 2)
C pattern EST
Exprssion
DEFB1
MRTSYLLLFTLCLLLSEMASGGNxxxxxxxxxFLTGLGHRSDHYNCVSSGGQCLYSACPIFTKIQGTCYRGKAKCCK
6 4 9 6 ai688359,ai688522,
epithelia,ai733355,ai3
kidney
DEFB7
MKIFFFILAALILLAQIFQG xxLKTNCFLYLARTAIHRALISKRMEGHCEAE-CLTFEVKIGGCRAELAPFCCKNRKKH 21 3 9 7
no ESTs
TFPGKLPQQLFLGTGEFAVCESCKLGRGKCRKE-CLENEKPDGNCRLNFL-CCRQRI
DEFB5
6 3 9 5
no ESTs
xxxxxxxxxxxxxAKNAFFDEKCNKLKGTCKNN-CGKNEELIALCQKSLK-CCRTIQPCGSIID
DEFB6
MRTFLFLFAVLFFLTP
6 3 9 5 aw103145, lung/testis
ai910580
EST
DEFB4 (Forss man)
MQRLVLLLAISLLLYQDLPG xxxxxxxxxxYLVRSEFELDRICGYGTARCRKK-CRSQEYRIGRCPNTYA-CCLRK
6 3 9 5
no ESTs
EP2c
MRQRLLPSVTSLLLVALLFPG xxxxxxxxxxxxEPASDLKVVDCRRSEGFCQEY-CNYMETQVGYCSKKKDACCLH
6 3 9 6 1st exon aa778602,
epithelia aa400545
EP2d/HE2b1
MRQRLLPSVTSLLLVALLFPGSSxxxxxxxxxxxxxxxxxxxxTICRMQQGICRLFFCHSGEKKRDICSDPWNRCCVSNTDE 6 4 9 6 aa778602 testis
EP2e
MKVFFLFAVLFCLVQTNSGDVPPxxxxxxxxxxxxxxxxxxxxTICRMQQGICRLFFCHSGEKKRDICSDPWNRCCVSNTDE 6 4 9 6 2nd exon aa176631,
testis
be044355, ai
DEFB3
MRIHYLLFALLFLFLVPVPG xxxxxxxxxGHGGIINTLQKYYCRVRGGRCAVLSCLPKEEQIGKCSTRGRKCCRRKK
6 4 9 6
epithelia
DEFB2
MRVLYLLFSFLFIFLMPLPG xxxxxxxxxxxGVFGGIGDPVTCLKSGAICHPVFCPRRYKQIGTCGLPGTKCCKKP
6 4 9 6 bf08889, bf088086,
epithelia,be714509
head/neck
xxxxxxxxxxxxxxxxxxxxxxxxx*RCVCVLNVCSTSLKQIGTYGHDRIKCCKK
DEFBp1
YLLFSFRFVFLMPLP
pseudogene
xxxxxxxxxxxLHVAKGKFKEICERPNGSCRDF-CLETEIHVGRCLNSRP-CCLPLGHQPRIESTTPKKD
DEFB8
6 3 9 5 aa406058 possible EST testis
xxxxxxxxxxxxxGGLGPAEGHCLNLSGVCRRDVCKVVEDQIGACRRRMK-CCRAWWILMSIPTPLIMSDYQEPLKPKLK
DEFB9
6 4 9 5 aw383156 head_neck
Chromosome 6 cluster
Gene Name
Exon 1
DEFB10
DEFB11
DEFB12
DEFB13
DEFB14
Chromosome 20 Cluster
Gene Name
Exon 1
DEFB15
DEFB16
DEFB17
DEFB18
DEFB19
DEFB20
MKLLLLALPMLVSYPKZSQ
MKLLYLFLAILLAIEEPVIS
May share with 20-5
DEFB21
DEFB22
DEFB23
maybe 3 exons
MKLLLLTLTVLLLLSQLTP
DEFB24
DEFB25
DEFB26
DEFB27
MKSLLFTLAVFMLLAQLVS
MGLFMIIAILLFQKPT
DEFB28
DEFB29
MKLLFPIFASLMLQYQVNT
aa sequence (exon 2)
xxxxxxxxxxxxxxxxxxxFERCEKVRGICKTF-CDDVEYDYGYCIKWRSQCCV
xxxxxxxxxxxxxxxxxDLRRECRIGNGQCKNQ-CHENEIRIAYCIRPGTHCCLQQ
xxxxxxxxxxxxxxxxxxxxWKSCTAIGGRCKNQ-CDDSEFRISYCARPTTHCCVTECDP
xxxxxxxxxxxxxxxxxxxKRECQLVRGACKPE-CNSWEYVYYYCNVNP--CCAVWE
xxxxxxxxxxxxxTCTLVNADRCTKRYGRCKRD-CLESEKQIDICSLPRKICCTEKL
C
6
6
6
6
6
pattern
3 9 6 no
3 9 6 no
3 9 6 no
3 9 4 no
3 9 6 no
EST
ESTs
ESTs
ESTs
ESTs
ESTs
Exon 2
xxxxxxxxxxxGWIRRCYYGTGRCRKSCKEIERKKEKCGEKHICCVPKEKDKLSHIHDQKETSELYI
6 3 9 5 no ESTs
GLFRSHNGKSREPWNPCELYQGMCRNACREYEIQYLTCPNDQKCCLKLSVKITSSKNVKEDYDSNSNLSVTNSSSYSHI
6 3 9 5 no ESTs
xxxxxxxxxxxxSQKSCWIIKGHCRKNCKPGEQVKKPCKNGDYCCIPSNTDS
6 3 9 5 no ESTs
xxxxxxxxxxxxGEKKCWNRSGHCRKQCKDGEAVKDTCKNLRACCIPSNEDHRRVPATSPTPLSDSTPGIIDDILTVRFTTDYFEVSSKKDMVEESEAGRGT
6 3 9 5 AA335178, Epididymis,
AI220434
Pooled NFL
xxxxxxxxxxKRHILRCMGNSGICRASCKKNEQPYLYCRNCQSCCLQSYMRISISGKEENTDWSYEKQWPRLP
6 3 9 5 AA939044, Pooled
AW193716,
NFL,AIPooled
807541,
Ger
xxxxxxxxxxxxVKSVECWMDGHCRLLCKDGEDSIIRCRNRKRCCVPSRYLTIQPVTIHGILGWTTPQMSTTAPKMKTNITNR
5 3 9 5 AW070283, Pooled
AA834919,
NFL,H92063
Testis, Re
xxxxxxxxxxxxxxMKCWGKSGRCRTTCKESEVYYILCKTEAKCCVDPKYVPVKPKLTDTNTSLESTSAV
6 3 9 5 AI476463 Pooled NFL
xxxxxxxxxxxxRIETCWNFRGSCRDECLKNERVYVFCVSGKLCCLKPKDQPHLPQHIKN 6 3 9 5 AI989655, Pooled
AW236570
Germ Cell Tumor
xxxxxxxxxxxxGTQRCWNLYGKCRYRCSKKERVYVYCINNKMCCVKPKYQPKERWWPF 6 3 9 5 AA933749, Pooled
AA970840,
GermBF08527,
Cell Tumor
AA
xxxxxxxxxxxxEFKRCWKGQGACQTYCTRQETYMHLCPDASLCCLSYALKPPPVPKHEYE6 3 9 5 no ESTs
xxxxxxxxxxFEPQKCWKNNVGHCRRRCLDTERYILLCRNKLSCCISIISHEYTRRPAFPVIHLEDITLDYSDVDSFTGSPVSMLNDLITFDTTKFGETMTP
7 3 9 5 AA935636(not
Pooled
cysteine
NFL domain)
xxxxxxxxxxNWYVKKCLNDVGICKKKCKPEEMHVKNGWAMCGKQRDCCVPADRRANYPVFCVQTKTTRISTVTATTATTTLMMTTASMSSMAPTPVSPTG
6 3 13 5 AA994981, Testis,
AA846419,
Pooled
AA453384,
NFL A
xxxxxxxxTEQLKKCWNNYVQGHCRKICRVNEVPEALCENGRYCCLNIKELEACKKITKPPRPKPATLALTLQDYVTIIENFPSLKTQST
8 3 9 5 AI694319, Testis,
AA812652,
Pooled
AA454191,
NFL A
60
xxxxxxxxxxxxLKKCFNKVTGYCRKKCKVGERYEIGCLSGKLCCANDEEEKKHVSFKKPHQHSGEKLSVLQDYIILPTITIFTV
7 3 9 5 no ESTs
xxxxxxxxxxxxxxRRCLMGLGRCRDHCNVDEKEIQKC-KMKKCCVGPKVVKLIKNYLQYGTPNVLNEDVQEMLKPAKNSSAVIQRKHILSVLPQIKSTSFF
6 3 9 4 AA401404, Testis,
AA446332,
Pooled
AA399988,
NFL A
– Summary –
Gene Discovery with HMMs
• Increased number of defensin genes in mouse and
human from 7 to 69
• Genomic searches based solely on BLAST may miss
genes related by tertiary structure
• Hidden Markov Tool is a more reliable approach for
identifying gene families related by tertiary structure
61
“Curing” Disease and
Finding New Treatments
I. “Curing” disease
– know the disease-causing gene(s)
– diagnose with genetic test (before onset)
– preempt entire disease with intervention (therapy or
lifestyle advice)
II. Finding new treatments of disease
– know the gene(s)
– understand the biological pathway like never before
a. identify existing drug candidates that interact
b. precisely design a new drug from a molecular basis
62
“Curing” Disease and
Finding New Treatments
• After all the analysis and data visualization….
– Make some decisions:
• 1. Is this a (strongly) genetic phenomenon?
• 2. Is/are there regulating “known” gene(s)?
• 3. Can they be prioritized for further study?
• Can the pathway be deduced or refined?
• Are there existing related products/drugs?
• BUT, where do we obtain candidate “targets”?….
63
Related documents