Download Soft Computing Tools for Gene Matching in Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Set Similarity Measures for Gene Matching
Mihail Popescu#, James Keller+, Joyce Mitchell#
# Department
of Health Management and Informatics;+Department of Electrical and Computer Engineering;
University of Missouri-Columbia, Columbia, MO 65211
Example of Similarity Calculation for the
Gene Ontology (GO) Dimension
Why Similarity Measures?
• For a unified clustering approach in a 4D gene space
• Gene space dimensions (4D): sequence, microarray expression,
literature abstracts (articles), gene ontology (GO)
• Two dimensions are numeric (sequence, expression) and two
symbolic
• s(ATM, STK11)=? (GO dimension)
• Algorithm:
•1. Retrieve LocusLink GO annotations:
• Dice, Jaccard: do not consider the weight of the elements
• Maximum and average usually overestimates the or
underestimates the similarity, respectively
• Example: ATM (human ataxia telangiectasia mutated) and
STK11 (serine/threonine kinase 11.) The geneticist assessed
these two genes as quasi-similar (similarity ~0.5) because:
•they both have protein serine/threonine kinase enzyme
activity (they share a kinase domain)
S
e
q
u
e
n
c
e
STK11
ATM
Abstracts
GO annotations
sio
res
p
Ex
- sequence: ACAC...
- expression: 195
- abstracts: abstract11
...
abstract1n
- GO annotations: term11
...
term1m
Expert
assessed
similarity
~0.5
- sequence: CCAT...
- expression: 300
- abstracts: abstract21
...
abstract2n
- GO annotations: term21
...
term2m
4674
6468
16740
4674
1.12(0.67)[0.1]
4.93(1)[0.44]
0
3.69(0.975)[0.33]
3677
2.21(0.89)[0.2]
1.12(0.67)[0.1]
0
1.12(0.67)[0.1]
4428
1.12(0.67)[0.1]
4.3(0.986)[0.38]
0
3.69(0.975)[0.33]
7131
0
0
2.12(0.88)[0.19]
0
6281
0
0
2.12(0.88)[0.19]
0
7165
0
0
0.86(0.58)[0.08]
0
5634
0
0
0
0
0
3.69(1)[0.33]
1.33(0.74)[0.12]
0
1.12(0.67)[0.1] 3.69(0.975)[0.33]
45786
0
0
n
Possible similarity measures
• Expression dimension: real measures (Euclidian measure, etc…).
• Sequence dimension: sequence similarity measure (Smith-Waterman,
Needleman-Wunsch, etc…)
Average
Maximum
OWA
(normalized) (normalized) (normalized)
0.18
0.09
0.44
0.37
Example of Similarity Calculation for the
Retrieved Abstracts Dimension
• GO and Abstract dimension: set similarity.
Gene 2
Gene 1
Set similarity measures
• Set similarity: given two gene products, G1 and G2, we can
consider them as being represented by collections of terms:
G1  {T11 ,..., T1i ,..., T1n }
G2  {T21 ,..., T2 j ,..., T2 m }
Based on the two sets, the goal is to define a natural similarity
between G1 and G2 and , denoted as : s(G , G )
• Two types of set similarity:
1
2
• element based (Dice, Jaccard, Cosine, fuzzy measure)
• pair of elements based (Maximum, Average, OWA, Choquet)
Abstract 1, g(A11)
Abstract 2, g(A12)
s(A11,A21)
{T11i}
{T12i}
c(A11,A21)
12183403
–
Cancer Res (8.30)
12234250
–
Biochem J (4.326)
12805220 - EMBO
J. (12.459)
11853558Biochem J (4.326)
{T21i}
{T22i}
0.10
0.10
0.09
0.35
0.29
0.29
0.27
1.00
0.10
0.10
0.09
0.35
0.44 0.0 0.00
0.07 0.29 0.1

0.00 0.13 0.26
0.00 0.20 0.16
0.00
0.11
0.32
0.24
• Similarity calculation:
•Using weighted average: s(ATM, STK11)=0.37
•Using Choquet integral: s(ATM, STK11)=0.53
• For the GO dimension, the best method of assigning densities
was normalizing the information content [4] by the maximum value
Jaccard Dice FMS
FMS
(normalized) (depth)
0.36
STK11
Conclusions
•3. Compute the similarity:
0.31 0.64
14499692-Science
(23.329)
s( Ak ) FMS
5524
16740
14500819-Nucleic
Acids Res. (6.373)
• The pair-wise similarity values calculated using FMS are:
•2. Compute GO term densities using the Resnik formula [4], the
normalized version [.] or the depth in the hierarchy (.)
4D Gene space
12970738Oncogene (6.737)
0.19
0.19
i
g   0.18
0.67
16740: ” transferase activity”}
•They both cause cancers when mutated, including breast
cancer.
12917635Oncogene (6.737)
• Calculate the confidence of the pair g(A1, A2) =g(A1)*g(A2) and
normalize using maximum value:
•ATM={4674: “ protein serine/threonine kinase activity”,
3677: ” DNA binding”,
4428 ” inositol/phosphatidylinositol kinase activity”,
7131 : ” meiotic recombination”,
6281 : ” DNA repair”,
7165: ” signal transduction”,
5634: ” nucleus”,
16740: ” transferase activity”,
45786: ” negative regulation of cell cycle”}
•STK11={5524: “ ATP binding”,
4674: ” protein serine/threonine kinase activity”,
6468: ” protein amino acid phosphorylation”,
• The existent symbolic measures are not adequate:
ATM
Abstract 1,g(A21)
Abstract 2, g(A22)
• The proposed fuzzy similarity measure (FMS) agrees better with
our intuition of similarity: if the common elements have a high
confidence, then the similarity is stronger. In addition, the non
common terms have also a contribution to the similarity since the
measure is computed apriori for each term set.
•The Choquet similarity measure is much more general, depending
only on the fuzzy measure. In addition the optimal fuzzy measure
can be learned from examples.
Acknowledgements
This research was supported by National Library of Medicine Biomedical and Health Informatics
Research Training grant 2-T15-LM07089-11.
References
[1] C.D. Manning, H. Schutze, Foundations of Statistical Natural Language Processing, MIT Press, 2001.
[2] R. Yager, “Criteria Aggregation Functions Using Fuzzy Measures and the Choquet Integral”, Int. Jour. of Fuzzy Systems, Vol.1, No. 2, December 1999.
[3] J.J. Jiang, D.W. Conrath, “Semantic Similarity Based on Corpus Statistics and Lexical Ontology”, Proc. of Int. Conf. Research on Comp. Linguistics X,
1997, Taiwan.
• s(ATM, STK11)=? (Abstract dimension)
• Algorithm:
• Retrieve PubMed abstracts for ATM, STK11
• Calculate all the pair-wise distances based on the MeSH indexing
• Keep the 4 best-matching pairs
• Find the impact factor for each journal: g(Ai), i=1…8
[4] P.W. Lord, R.D. Stevens, A. Brass, C.A. Goble, “Semantic similarity measure as a tool for exploring the gene ontology”, In Pacific Symposium on
Biocomputing, pages 601-612, 2003.
[5] M. Sugeno, Fuzzy measures and fuzzy integrals: a survey, (M.M. Gupta, G. N. Saridis, and B.R. Gaines, editors) Fuzzy Automata and Decision
Processes, pp. 89-102, North-Holland, New York, 1977.
[6] S. Raychaduri, R.B. Altman, “A literature-based method for assessing the functional coherence of a gene group”, Bioinformatics, 19(3), pp. 396:401, Feb.
2003.
[7]. M. Grabisch, T. Murofushi, and M. Sugeno (eds.), Fuzzy Measures and Integrals: Theory and Applications, Springer-Verlag, 2000.
[8]. Hvidsten TR, Komorowski J, Sandvik AK, Laegreid A. Predicting gene function from gene expressions and ontologies. Pac Symp Biocomput. 2001;:299310.
[9]. Trupti Joshi. Cellular function prediction for hypothetical proteins using high-throughput data. MS thesis, University of Tennessee, Knoxville, 2003.
[10]. Keller J, Popescu M, Mitchell J. Soft Computing Tools for Gene Similarity Measures in Bioinformatics, FLINT-CIBI 2003, Berkeley, Dec 15-18, 2003.
Related documents