Download Why HMMer? - jjoseph.org

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
RBP1 Splicing Regulation
in
Drosophila Melanogaster
03-711 - Fall 2005
Jacob Joseph, Ahmet Bakan,
Amina Abdulla
This presentation available at http://www.jjoseph.org/biology/
Alternative Splicing in Dros.
RBP1 Regulation


Involved in dsx splicing and Rbp1 auto-regulation
Suspected in many other related pathways
Genome Data


Sequence of all introns of known splice variants
Two annotated genomes available
D. Melanogaster
 D. Pseudoobscura


As the gene names for D. Mel. and D. Pseu.
differ, a list of gene orthologs was also obtained
Computational Approach







Create profile HMM for each motif (B-B, B-A)
Select the end of every intron (~50 bases)
Perform an HMM search for each intron
segment, in both D. Mel. and D. Pseu.
Keep matches found in both species
Keep matches at the end of introns (~15 bases)
Return alignment of both species
Examine biological similarity of matches
Data Summary
Hidden Markov Profile (HMM) and
HMMer


We needed an HMM profiler and search program.
Revised version of what Krogh/Haussler model called
Plan 7

Not only global alignment
HMMer Advantages

Possible Alignments





Classic global alignment
Classic local alignment
Global Profile, Local Sequence alignment
Fully local “multihit” alignment. Ex:
Scoring


Raw alignment score
E-value, showing the significance of the alignment
HMMer



Create HMM for multiple alignment of each B-B and
B-A motif
Genome is scanned for high scoring matches
Only hits within a distance of 15 base pairs of the 3’
splice site are considered
Results: B-A Motif
CG30271-RC-in_5 (27 - 39), GA15740-in_5 (27 - 39)
ctgttgaatcacttggaaagcaatcaGTCGACAATTGTTtacttttacag
| |||||||||| |||||||||||||||||||||||||||||||||||
cctttgaatcactcggaaagcaatcaGTCGACAATTGTTtacttttacag
score: -6
CG30020-RA-in_3 (25 - 37), GA15581-in_9 (24 - 36)
ccgtcccagtgacttacaatacgaTTCTACTATTTTTtgtacgcttacag
| |
|
| | ||||| |||| |
|
taaggctcttcatactttatcaaATCTACAATTTCTcaatgtaattgcag
score: -8
Klp3A-RA-in_3 (31 - 43), GA21186-in_3 (26 - 38)
score: -9
ttgaagttcgaaaactcctgaaactaattgTTCCACAATTTTTttttatt
| || ||
||
||| || ||||| | |
tgttcaattcttaaataaaaccaatTTCGACTCTTTTTctcttctttcag
na-RB-in_0 (33 - 45), GA13546-in_2 (25 - 37)
score: -9
tctggtgcactgagagaaatgccatctacttcATCGATACTCTTTtgcag
| |
|| | | || || |
tgtaaacactcgttgcaaacacaaATTTACAATCAATttccatgttttat
CG30428-RA-in_2 (33 - 45), GA15840-in_1 (25 - 37)
score: -9
ggtaaggaagcgtaaaaataaattctttttttATCACCAATATTTttcag
|
|| ||
|||||
|||| |||||
aaaatatcaagccgaaacaaatttATGTACAATTTTTtttttatggaaag
CG2199-RB-in_0 (36 - 48), GA15296-in_0 (33 - 45)
score: -10
ttgctactgccattataggtagtttaaaaactgttTTCTACACTCTTTct
| | | |
|
||
||||| | |
aacaaaaacaaaaatatggccctctgataattGGGGACACTTTATttcag
Results: B-B Motif
ps-RD-in_4 (31 - 42), GA20847-in_4 (31 - 42)
score: -11
catttaatatcttgaaaatatttaacataaATCTGATGCAAAtattccag
|
||
| || ||||||||||||||||||||||||||||||||
attactattcttaaaatatatttaacataaATCTGATGCAAAtattccag
fru-RE-in_6 (26 - 37), GA12896-in_5 (24 - 35)
score: -13
cccacccccacagtgatgacgcctaATATGAACCAAGcaaatgtttgcag
|
|
|
| | | ||| | || | |
| |
tgctaaataaaccaaattccaaaCTCTGATCAAAAaataccgataaaaag
Ptp52F-RA-in_0 (38 - 49), GA14851-in_14 (34 - 45)
score: -13
tactctttgaaaaataagcatatggatgtcactgataATATGATATTAAt
| |
|
| ||
|
|||
||
||
tctaaatcgtattcaaatcgaattgaaacataaATCGAATCCAAAaacag
CG9455-RA-in_0 (32 - 43), GA21800-in_0 (27 - 38)
score: -13
aatagtggctttgttttaataacaatgtaatATCTGATATTTAttctcag
|
|
|
|
| ||||| |
| |
cagagcgtgccccgtctgatgatccgAACTGATCTGATgtttttcggtag
CG8709-RA-in_2 (34 - 45), GA21271-in_9 (34 - 45)
acaaatcttaggaaataccaaagttgttctacgATCTTATCTATGgagtc
|
| |
| |
| || || | ||||||
gccccatcagtgtcagtggcagctgaccccaccATTTGATCTATTtgcag
score: -13
CG7966-RA-in_0 (37 - 48), GA20727-in_4 (26 - 37)
score: -13
tatatgtacacattgtactgcaaacacatgccctgaATCTTTGATAAAga
| |
||| |
| |||||| | ||||
gtgttgaatgaaagaatacacttgaATCGGTTCTAAAttgcatcgcacag
Biomolecular Activity: B-A
Biomolecular Activity: B-B
Biomolecular activity analysis





fru gene, regulated by the tra and tra2 genes is
expressed at the same time as dsx gene helps
validate our results.
Expected presence of sxl and tra genes.
Functional Similarity:
B-A motif: SNF4Agamma, rdgc, qtc.
B-B motif: ps, ptp, CG9455.
Difficulties & Future Directions




Support Vector Machines were applied
Lack of significant training data.
Lack of direct experimental data for crossvalidation.
Since the current D. Pse. genome has far fewer
intron sequences, reliance upon orthologs
introduces many false negatives.
Alternate Approach:
Support Vector Machines (SVM)




Used for data classification
Creates hyperplanes that separate
data into two classes with
maximum-margin
Appropriate for multidimensional
classification problems
Examples



Article classification
Protein classification
Critical points


Feature selection
Training
HMM and SVM




HMMer is used to generate features
All genome searched for A and B consensus sequences
Search results for each intron combined to create
features
Features






Scores of two motifs in the upstream (2)
Distance of the motifs to the splice site (1)
Length of consensus sequence overlap (1)
Length of motif (1)
Does consensus sequence B precedes A (1)
Number of features = 6
Summary




Profile HMM used for modeling
Comparative analysis with the D.Pseu genome
High scoring alignments for both motifs further
analyzed for biomolecular activity
The existence of the fru and other close matches
help to validate our results
Related documents