Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Text Mining and Its Application in Bioinformatics Xiaohua Tony Hu College of Information Science & Technology Drexel University, USA Agenda • Introduction • Problems of Biomedical Literature Mining Approaches • Related Works • Our System: Bio-Set-DM • Sub Network Modeling, Simulation and Evaluation • Conclusion and Future Studies 2 Biomedical Literature Mining • Many biomedical and bioinformatics knowledge and experimental results only published in text documents and these documents are collected in online digital libraries/databases (Medline, PubMedCentral, BioMedCentral). • How big is Medline? – Abstracts from more than 4800 journals, with over 16 million abstracts – Over 10,000 papers per week are added 3 0 Year 1950 1952 1954 1956 1958 1960 1962 1964 1966 1968 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 (Apr.) MEDLINE Size (# of articles) Introduction 16,000,000 14,000,000 12,000,000 10,000,000 8,000,000 6,000,000 4,000,000 2,000,000 The Exploding number of PubMed articles over the years 4 Introduction • How to solve the information overloading of biomedical Literature? – developing scalable searching & mining methods – integrating information extraction and data mining methods to automatically o search & retrieve biomedical literature efficiently and effectively o extract the results into a structured format o mine important biological relationships 5 Major Issues in Biomedical Literature Mining • Huge numbers of documents • Lack of structures • Many subdomains • Many aliases and typographical variants for most biomedical objects • Abbreviations, synonyms, polysemy, etc 6 The General Text Mining View 1. Selects what they will read (Information Retrieval), 2. Identifies important entities and relations between those entities (Information Extraction), 3. Combines this new information with other documents and other knowledge into a database 4. Mine the extracted results (Data Mining) 7 Issues in Current Information Retrieval (IR)? • Key-word based: get a lot of irrelevant and miss a lot of relevant documents • Query Expansion • Probability Language Modeling Ex: mouse, bank, chip, apple etc 8 Issues in Current Information Extraction (IE)? • Examining every document – Doing so against Medline is extremely timeconsuming • Using filters to select promising abstracts for extraction – Requiring human involvement to maintain and to adapt to new topics or sub disciplines. 9 Our Approaches: Bio-SET-DM • Information Retrieval: semantic-query expansion • • • (Xiaohua Zhou’s Ph.D. Thesis) Information Extraction Methods: mutual reinforcement learning for automatic pattern learning and tuple extraction (Illhoi Yoo’s Ph.D. thesis) Text Mining: graphical-based representation text clustering and summarization (Xiaodan Zhang’s Ph.D. thesis) Bio-SET-DM (Biomedical Literature Searching, Extracting and Text Data Mining) • Biomedical Ontologies: UMLS and Go are the glues 10 NSF Career: A Unified Architecture for Data Mining Biomedical Literature Databases (415K US$, March 2005-Feb 2010) 11 Problem Descriptions of IR • Descriptions – Many biomedical literature searches are about relationships between biological entities. – The co-occurrence of two keywords often does mean these two keywords are really related. obesity [TIAB] AND hypertension [TIAB] AND hasabstract [text] AND ("1900"[PDAT] : "2005/03/08"[PDAT]) The query used to retrieve documents addressing the interaction of obesity and hypertension from PubMed. A ranked hit list of 6687 documents is returned. We then took the top 100 abstracts for human relevance judgment. Unfortunately, as expected, only 33 of them were relevant. – Explicitly index and search documents with relationships 12 Statistical Language Model • Statistical language model – It is a probabilistic mechanism for generating text. • Text generation – Suppose word is the unit of a text (e.g. document). The text generation process looks like as follows: • Choose a language model in each step. • Generate a word according to the chosen model. 13 Language Modeling and IR • Example: – Document 1={(A,3), (B, 5), (C,2)} – Document 2={(A,4), (B, 1), (C,5)} – Query={A, B} – Which document is more relevant to the query? Doc 1: 0.3*0.5=0.15 Doc 2: 0.4*0.1=0.04 Doc 1 is more relevant to the query than Doc 2 14 Why Smoothing? • Avoid Zero Probability – Document 1={(A,3), (B, 5), (C,2)} – Document 2={(A,4), (B, 1), (C,5)} – Query={A, D} – Which document is more relevant to the query? Doc 1: 0.3*0=0 Doc 2: 0.4*0=0 Obviously, this result is not reasonable. 15 Why Smoothing? • Discount High-frequency Terms: Stop words (e.g. the, a, an, you…) frequently occur in documents. According to Maximum Likelihood Estimate (MLE), their generative probability will be very high. However, stop words are obviously trivial to those documents. • Assign reasonable probability to unseen word (Data Sparsity) – Testing words do not appear in training corpus. – Need effective smoothing method, especially incorporating the semantic relationship between the testing words and training words into the model. – Examples: a document containing “auto” for query “car” in text retrieval task. • If using Laplacian smoothing or background smoothing, the document will not return for the query. • If using semantic smoothing, the document will return for the query. 16 LM and IR • Steps: – Estimate the word distribution for each document, i.e., p(w|di), which is also referred to as document language model or document model. – Computing the probability of generating the query according to each document model. – Rank the query-generating probabilities of all documents in the collection. 17 Language Modeling IR Formalism • LM: view IR as a process of word sampling from the document. The higher probability to generate the query, the more relevant the document is to the query (Ponte and Croft 1998) log p( r Q, D ) p(Q | D, r ) p( rD ) log log p( r Q, D ) p(Q | D, r ) p( r D ) rank p( rD ) log p(Q | D, r ) log p( r D ) rank log p(Q | D, r ) The formula is from (Lafferty and Zhai 2002) 18 Context-Sensitive Semantic Smoothing (Our Approach) • Definition – Like the statistical translation model, term semantic relationships are used for model smoothing. – Unlike the statistical translation model, contextual and sense information is considered • Method – Decompose a document into a set of contextsensitive topic signatures and then statistically translate topic signatures into individual words. 19 Topic Signatures • Concept Pairs – A pair of two concepts which are semantically and syntactically related to each other – Example: computer and mouse, hypertension and obesity – Extraction: Ontology-based approach (Zhou et al. 2006, SIGIR) • Multiword Phrases – Example: Space Program, Star War, White House – Extraction: Xtract (Smadja 1993) 20 Translation Probability Estimate • Method – Use cooccurrence counts (topic signature and individual words) – Use a mixture model to remove noise from topicfree general words Vt Vd Vw t1 D1 w1 D2 w2 D3 w3 D4 w4 t2 t3 t4 t5 Figure 2. Illustration of document indexing. Vt, Vd and Vw are topic signature set, document set and word set, respectively. p( w | Dk ) (1 ) p( w | tk ) p( w | C ) Denotes Dk the set of documents containing the topic signature tk. The parameter α is the coefficient controlling the influence of the corpus model in the mixture model. 21 Translation Probability Estimate • Log likelihood of generating Dk log p( Dk | tk , C ) c( w, Dk ) log p( w | Dk ) w c( w, Dk )((1 ) p( w | tk ) p( w | C )) • EM for estimation w pˆ ( n ) ( w) p ( n 1) (1 ) p ( n ) ( w | tk ) (1 ) p ( n ) ( w | tk ) p( w | C ) c( w, Dk ) pˆ ( n ) ( w) ( w | tk ) (n) ˆ c ( w , D ) p i k ( wi ) i Where is the document frequency of term w in Dk, i.e., the cooccurrence count of w and tk in the whole collection. 22 Contrasting Translation Example Space: space 0.245; shuttle 0.057; launch 0.053; flight 0.042; air 0.035; program 0.031; center 0.030; administration 0.026; develop 0.025; like 0.023; look 0.022; world 0.020; director 0.020; plan 0.018; release 0.017; problem 0.017; work 0.016; place 0.016; mile 0.015; base 0.014; Program: program 0.193; washington 0.026; congress 0.026; administration 0.024; need 0.024; billion 0.023; develop 0.023; bush 0.020; plan 0.020;money 0.020; problem 0.020; provide 0.020; writer 0.018; d 0.018; help 0.018; work 0.017; president 0.017; house .017; million 0.016; increase 0.016; Space Program: space 0.101; program 0.071; NASA 0.048; shuttle 0.043; astronaut 0.041; launch 0.040; mission 0.038; flight 0.037; earth 0.037; moon 0.035; orbit 0.032; satellite 0.031; Mar 0.030; explorer 0.028; station 0.028; rocket 0.027; technology 0.026; project 0.025; science 0.023; budget 0.023; 23 Topic Signature LM • Basic Idea – Linearly interpolate the topic signature based translation model with a simple language model. – The document expansions based on context-sensitive semantic smoothing will be very specific. – The simple language model can capture the points the topic signatures miss. pbt ( w | d ) (1 ) pb ( w | d ) pt ( w | d ) Where the translation coefficient (λ) controls the influence of the translation component in the mixture model. 24 Topic Signature LM • The Simple Language Model pb ( w | d ) (1 ) pml ( w | d ) p( w | C ) • The Topic Signature Translation Model pt ( w | d ) p( w | tk ) pml (tk | d ) k c ( tk , d ) pml (tk | d ) c(ti , d ) i c(ti, d) is the frequency of topic signature ti in document d. 25 Text Retrieval Experiments • Collections – TREC Genomics Track 2004 and 2005 – Use sub-collection – 2004: 48,753 documents – 2005: 41,018 documents • Measures: – Mean Average Precision (AP), Recall • Settings – – – – Simple language model as the baseline Use concept pairs as topic signatures Background coefficient: 0.05 Pseudo-relevance feedback: top 50 documents, expand10 terms 26 Experiments • Collections – TREC Genomics Track 2004 and 2005 – Use sub-collection – 2004: 48,753 documents – 2005: 41,018 documents • Measures: – Mean Average Precision (AP), Recall • Settings – Background coefficient: 0.05 – Pseudo-relevance feedback: top 50 documents, expand10 terms 27 Baseline Models Table 1. Comparison of the baseline language model to the Okapi model. The Okapi formula is the same as the one in [10]. The number of relevant documents for TREC04 and TREC05 are 8266 and 4585, respectively. The asterisk indicates the initial query is weighted. Recall MAP Collection SLM Okapi Change SLM Okapi Change TREC04 6411 6662 +3.9% 0.345 0.363 +5.2% TREC04* 6527 6704 +2.7% 0.364 0.364 +0.0% TREC05 4084 4124 +1.0% 0.255 0.250 -2.0% TREC05* 4135 4134 -0.0% 0.260 0.254 -2.3% 28 Experiment Results Table 2. The comparison of the baseline language model (DM0) to document smoothing model (DM2) and query smoothing model (FM1). Collection TREC04 TREC05 TREC05* DM2 Change γ =0.6 FM1 Change 0.345 0.395 +14.5% 0.451 6411 6749 +5.3% 0.364 0.414 +13.7% 0.460 Recall 6527 6905 +5.8% 7039 +7.8% MAP 0.255 0.277 +8.6% 0.279 +9.4% Recall 4084 4167 +2.0% 4227 +3.5% 0.260 0.288 +10.8% 0.287 4135 4214 +1.9% MAP Recall TREC04* DM0 λ=0.3 MAP MAP Recall 6929 4235 +30.9% +8.0% +26.9% +10.4% +2.4% 29 Context-sensitive vs. Contextinsensitive • The context-sensitive semantic smoothing approach performs significantly better than context-insensitive semantic smoothing approaches. Table 3. Comparison of the context-sensitive semantic smoothing (DM2) to the context-insensitive semantic smoothing (DM2’) on MAP. The rightmost column is the change of DM2 over DM2’. DM0 Collection DM2’ Chang e DM2 Map Change Change MAP MAP TREC04 0.346 0.36 7 +6.1% 0.395 +14.5% +7.6% TREC04* 0.364 0.38 4 +5.5% 0.414 +13.7% +7.8% TREC05 0.255 0.26 0 +2.0% 0.277 +8.6% TREC05* 0.260 0.26 +3.5% 0.288 +10.8% +7.1% +6.5% 30 Relevant Publications 1. Hu X., Xu., X., Mining Novel Connections from Online Biomedical Databases Using Semantic Query Expansion and Semantic-Relationship Pruning, International Journal of 3. Web and Grid Service, 1(2), 2005, pp 222-239 Zhou X., Hu X., Zhang X., Topic Signature Language Models for Ad-hoc Retrieval, in the IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), September, 2007 Song M., Song I-Y, Hu X., Allen B., Integration of Association 4. in the Journal of Data & Knowledge Engineering Zhou X., Hu X., Zhang X., Lin X., Song I-Y., Context-Sensitive 2. 5. 6. Rules and Ontology for Semantic-based Query Expansion Semantic Smoothing for the Language Modeling Approach to Genomic IR, in the Proc. Of the 29th Annual International ACM SIGIR Conference on Research & Development on Information Retrieval (SIGIR 2006), Zhou X., Zhang X., Hu X., Semantic Smoothing of Document Models for Agglomerative Clustering, accepted in the Twentieth International Joint Conference on Artificial Intelligence(IJCAI 07), Hyderabad, India, Jan 6-12, 2007 Zhang X., Hu X., Zhou X., A Comparative Evaluation of Different Link Types on Enhancing Document Clustering, accepted in 31th Annual International ACM SIGIR Conference on Research & 31 Development on Information Retrieval (SIGIR 2008) SPIE: Scalable and Portable Information Extraction • Scalable and portable information extraction system (SPIE) is influenced by the idea of DIPRE introduced by Brin [Brin, 1998]. • The goal is to develop efficient and portable information extraction system to automatically extract various biological relationships from online biomedical literature with no or little human intervention. 32 SPIE: Scalable and Portable Information Extraction • The main ideas of SPIE: – Automatic query generation and query expansion for effective search and retrieval from text databases – Dual reinforcement information extraction for pattern generation and tuple extraction – Scalable well in huge collections of text files because it does not need to scan every text file 33 SPIE: Scalable and Portable Information Extraction Initial seed tuples Queries Automatic Query Generation & Document Categorization Search Engine Biomedical Literature DB Query List Set of Documents Data Mining to generate rules from categorized documents Automatic Categorization of Documents Extract text segment of interest Find occurrence of seed tuples Mutual Reinforcement of Pattern Generation - Instance Extraction Generate extraction pattern and store it in pattern base Instance Relation Initial seed tuples tuples generated from IE Pattern Base New instance extraction based on pattern matching 34 SPIE (Scalable & Portable IE) SPIE takes the following steps: 1.Starting with a set of user-provided seed tuples, SPIE retrieves a sample of documents from the biomedical literature library. – the seed tuples can be quite small, normally 5 to 10 is enough – constructing some simple queries by using the attribute values of the initial seed tuples to extract the document samples of a pre-defined size using from the search engine 35 SPIE (Scalable & Portable IE) 2. The tuple set induces a binary partition (a split) on the documents: – those that contain tuples or those that do not contain any tuple from the relation – The documents are thus labeled automatically as either positive or negative examples, respectively. – The positive examples represent the documents that contain at least one tuple. – The negative examples represent documents that contain no tuples. 36 Query Generation/Expansion for Document Retrieval STEP 3 consists of two stages – converting the positive and negative examples into an appropriate representation for training – running the data mining algorithms on the training examples to generate a set of rules and then convert the rules into an ordered list of queries expected to retrieve new useful documents 37 Query Generation/Expansion for Document Retrieval 38 Query Generation/Expansion for Document Retrieval • In STEP 3 three data mining algorithms are used • • for rule generation; Ripple, CBA & DB-Deci Those rules are ranked based on Laplace measures Top 10% of rules are converted into a query list Positive IF WORDS ~ protein AND binding Positive IF WORDS ~ cell and function Query 1: protein AND binding Query 2: cell AND function 39 Pattern Generation • A pattern is – a 5–tuples <prefix, entity_tag1, infix, entity_tag2, suffix> – prefix, infix, and suffix are vectors associating weights with terms. – prefix is the part of sentence before entity1, – infix is the part of sentence between entity1 and entity2 – suffix is the part of sentence after entity2. “HP1 interacts with HDAC4 in the two–hybrid system…” { “”, <Protein>, “interacts with”, <Protein>, “”}. 40 Pattern Matching 41 Experiment • Keyword base vs. SPIE • Keyword base experiment – Input: o around 7000 protein names (expanded from 1600 protein o o names using protein synonyms) 23 keywords 1.5 million abstracts (obtained using those keyword searching in PubMed) • SPIE experiment – Input: o Only 10 pairs of protein-protein interaction (PPI) pairs – Maximum number of documents used in each iteration is 10k – Starting with 50k documents and stopping at 500k documents 42 Experiment Keyword based SPIE 43 Experiment Experiment Abstracts used # of distinct PPI Keyword base 1,444,002 9,980 SPIE 500k 9,483 • It is very obvious that SPIE has a significant performance advantage over key-word based approach. 44 Chromatin Protein Network 45 Biomolecular Network Analysis • Biomolecular networks dynamically respond to stimuli and implement cellular function • Understanding these dynamic changes is the key challenge for cell biologists • Biomolecular networks grow in size and complexity, and thus the computer simulation is an essential tool to understand biomolecular network models • A sub-network executes a specific cellular function and deserve to be studied 46 Biomolecular Network Analysis • Our method consists of two steps. • First, a novel scale-free network clustering approach is applied to the biomolecular network to obtain various sub-networks. • Second, computational models are generated for the sub-network and simulated to predict their behavior in the cellular context. • We discuss and evaluate three advanced computational models: state-space model, probabilistic Boolean network model, and fuzzy logic model. 47 Mining the large-scale biomolecular network (1) Main Algorithm SNBuilder (G, s, f, d) 1: G(V, E) is the input graph with vertex set V and edge set E. 2: s is the seed vertex; f is the affinity threshold; d is the distance threshold. 3: N ← {Adjacency list of s } U{s} 4: C ← FindCore(N) 5: C’ ← ExpandCore(C, f, d) 6: return C’ 48 Mining a large-scale biomolecular network (2) Sub-Algorithm FindCore (N) 8: for each v N 9: calculate kvin(N) 10:end for 11: Kmin ← min { kvin (N), v N} 12: Kmax ← max { kvin(N), v N} 13: if Kmin = Kmax or (kiin (N) = kjin (N), (i, j N , i, j s, i j ) then return N 14:else return FindCore(N – {v}, kvin(N) = Kmin) 49 Mining a large-scale biomolecular network (3) Sub-Algorithm ExpandCore(C, f, d) D ← ( v,w)E,vC ,wC {v, w} 17: C’ ← C 18: for eacht D, t C, and distance(t, s) <= d 19: calculate ktin (D) 20: calculate ktout (D) 21: if ktin (D) > ktout (D) or ktin (D)/|D| > f then C’ ← C’ U {t} 22: end for 23: if C’ = C then return C 50 16: Experiment Results Promising Protein-Protein Interaction clusters 51 Experiment Results Fig 1 A sub-network obtained using the algorithm 52 State-space model for simulation (1) A gene regulatory network x1 x2 Z1 External inputs Dynamic equations … Zp Observation equations . . . xn 53 State-space model for simulation (2) z (t 1) Az(t ) Bu(t ) n1 (t ) x(t ) Cz(t ) n2 (t ) • • • • • • x: gene expression data z: internal variables (promoters) A: state transition matrix B: control (input) matrix C is transformation matrix n1(t) and n2(t) stand for noises 54 State-space model for simulation (3) • Applying the state-space modeling method to gene expression data of 16 genes in Figure 1, we obtained an inferred gene regulatory network with nine internal variables • The analysis shows that the inferred network is stable, robust, and periodic • Use the constructed model from the training dataset Thy-Thy 3 to predict the expression profiles in the testing dataset Thy-Noc, the result is shown in Figure 2 55 State-space model for simulation (4) 1 Expression level Expression level 1 0.5 0 0.5 0 B A -0.5 0 5 10 Time 15 -0.5 20 5 10 Time 15 0.5 0 0.5 0 D C -0.5 20 1 Expression level Expression level 1 0 0 5 10 Time 15 20 -0.5 0 5 10 Time 15 20 Fig 2 Comparison of experimental (solid lines) and predicted (dotted lines) gene expression profiles for DMFT(A), F2 (B), RRM2 (C) and TYR (D)56 Fuzzy logic model for simulation (1) • The fuzzy biomolecular network model is a set • • of rule sets for each node (in this case gene) in the network governing the response to each fuzzy state of the input genes to that node (the output gene). Fuzzy rule sets are generated for genes in the sub-network in Figure 1. Use the constructed model from the training dataset Thy-Thy 3 to predict the expression profiles in the testing dataset Thy-Noc, the results shown in Figure 3 57 Fuzzy logic model for simulation A 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 0 10 20 30 40 Log(Expression Ratio) Log(Expression Ratio) (2) B 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 0 10 0.6 C 0.4 0.2 0 -0.2 -0.4 -0.6 0 10 20 Time 30 40 30 40 Time Log(Expression Ratio) Log(Expression Ratio) Time 20 30 40 D 1 0.5 0 -0.5 -1 -1.5 0 10 20 Time Fig 3 Best fit rule on training set “Thy-Thy 3” predicting gene expression on the test data set (solid line) compared to actual data from the test set “Thy-Noc” (dashed line) for CDK2 (A), BRCA1 (B), EP300 (C), and CDK4 58 (D) Probabilistic Boolean Networks for simulation (1) • A probabilistic Boolean network (PBN) is a • • Markov chain capturing transition probabilities among different genes expression states. We construct PBNs for the given microarray data set "Thy-Thy 3" and use the data set "Thy-Noc" to test the constructed PBNs The results are shown in Tables 1 through 3 59 Probabilistic Boolean Networks for simulation (2) Gene DMTF BRCA1 HIFX HE PPP2R4 MYC NR4A2 F2 2 states 66.67 55.56 77.78 22.22 55.56 44.44 72.22 61.11 Gene PTEN RRM2 PLAT TYR CAD CDK2 CDK4 EP300 2 states 72.22 77.78 50.00 55.56 66.67 50.00 66.67 72.22 Table 1: Prediction accuracy based on the given genetic network using 2 states microarray data. Gene DMTF BRCA1 HIFX HE PPP2R4 MYC NR4A2 F2 3 states 50.00 55.56 66.67 16.67 61.11 55.56 55.56 61.11 Gene PTEN RRM2 PLAT TYR CAD CDK2 CDK4 EP300 3 states 44.44 72.22 66.67 50.00 61.11 38.89 66.67 61.11 60 Table 2: Prediction accuracy based on the given genetic network using 3 states microarray data. Probabilistic Boolean Networks for simulation (3) • To improve the prediction accuracy of the He MYC and CDK2, we use the developed multivariate Markov chain to model the mircoarray data set. The results are shown in Table 3 Gene HE MYC CDK2 2 states 55.56 (22.22) 61.11 (44.44) 66.67 (50.00) 3 states 27.78 (16.67) 55.56 (55.56) 38.89 (38.89) Table 3: Prediction accuracy based on the input genes estimated from the multivariate Markov chain model. 61 Conclusions • We present a new method for mining and • • • dynamic simulation of sub-networks from large biomolecular network. The presented method applies a scale–free network clustering approach to the biomelcular network to obtain biologically functional subnetwork. Three computational models: state-space model, probabilistic Boolean Network, and fuzzy logical model are employed to simulate the sub-network, using time-series gene expression data of the human cell cycle. The results indicate our presented method is promising for mining and simulation of sub62 networks. Relevant Publications 1. Hu X., Wu D., Data Mining and Predictive Modeling of 2. and Bioinformatics, (April-June 2007), p251-263 Hu X.,, Sokhansanj, Wu, D., Tang Y., A Novel Approach for 3. 4. 5. Biomolecular Network from Biomedical Literature Databases, in IEEE/ACM Transactions on Computational Biology Mining and Dynamic Fuzzy Simulation of Biomolecular Network, in IEEE Transactions on Fuzzy Systems Hu X., Wu F.X. Ng M., Sokhansanj B., Mining and Dynamic Simulation of Sub-Networks from Large Biomolecular Networks, in 2007 International Conference on Artificial Intelligence, June 25-28, Las Vegas, USA (Best Paper Award, out of 500 submissions) Hu X., Yoo I., Song I-Y., Song M., Han J., Lechner M., Extracting and Mining Protein-Protein Interaction Network from Biomedical Literature, in the Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (IEEE CIBCB 2004), Oct. 7-8, 2004, San Diego, USA, (Best Paper Award), pp 244-251 Tang Y.C., Zhang Y-Q, Huang Z., Hu X.,, and Zhao Y. Recursive Fuzzy Granulation for Gene Subsets Extraction and Cancer Classification accepted to be published in the IEEE Transactions on Information Technology in Biomedicine 63 Dragon Toolkit • Software package Designed for Language • • Modeling, Information Retrieval and Text Mining Free download http://www.ischool.drexel.edu/dmbio/dragontool /default.asp 500 Java Libaries in NLP, Search Engine, Entity Extraction, One of the most popular software packages for Information Retrieval, NLP etc. More than 1500 research groups in the world have downloaded it since Jul 2007 64 Call for Paper International Journal of Data Mining and Bioinformatics (IJDMB) Editor-in-Chief: Xiaohua Hu Inaugural Issue: July 2006 SCI Indexed: Oct, 2007 65 Call For Participation 2008 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 08) Philadelphia, USA, Nov 3-5, 2008 IEEE BIBM Steering Committee Chair: Xiaohua Hu 66 My Ph.D. Students and Joint Ph.D. Students with Chinese University 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Illhoi Yoo (graduated in 2006, tenure-track assistant professor in Univ. of Missouri-Columbia) Xiaodan Zhang (4th year Ph.D. student, Text and Web Data Mining, Digital Library, Bioterrorism) Daniel Wu (5th year Ph.D. student, Data Mining and Biomolecular Network Analysis) Xuheng Xu (4th year Ph.D. student, Semantic-based Query Optimization and Intelligent Searching) Davis Zhou (5th year Ph.D. student, Semantic-based Information Extraction and Retrieval) Palakorn Achananuparp (4rth year Ph.D. student, Text Mining) Deima Elnatour (4th year Ph.D. student, Semantic-based Text Mining) Guisu Li (2nd year Ph.D. student, Healthcare Informatics) Zhong Huang (2nd year Ph.D. student, Bioinformatics, Computational Biology) Xin Chen (fresh Ph.D. student, USTC) Xiaoshi Yin (joint Ph.D. student with Prof. Zhoujun Li from BAUU) Min Xu (joint Ph.D. student with Prof. Shuigeng Zhou from Fudan University) Yaoyu Zuo (joint Ph.D. student with Prof. Ying Tong from Zhongshan University 67 Acknowledgements • • • • • • • PI: NSF CAREER: A Unified Architecture for Data Mining Large Biomedical Literature Databases (NSF CAREER IIS 0448023, $415K, 03/15/2005-02/28/2010) PI: High Performance Rough Sets Data Analysis in Data Mining (NSF CCF 0514679, $102K, 08/01/2005-07/31/2008) Co-PI: The Drexel University GAANN Fellowship Program: Educating Renaissance Engineers (US Dept. of Education, 9/1/2006 to 8/31/2009, around $700K) Co-PI: Penn State Cancer Education Network Evaluation (PA Dept. of Health, 04/25/2006-07/31/2010, $1.2M) Co-PI: Center for Public Health Readiness and Communication (PA Dept. of Health, 08/01/2004-08/31/2007, $1.5M) Co-PI: Origin and Evolution of Genomic Instability in Breast Cancer (PA Dept. of Health, $100K, 05/01/2004-04/30/2005) Co-PI: Systems Biology Approach to Understanding ProteinProtein Interactions (PA Dept. of Health, $100K, 05/01/200404/30/2005 68 Thanks for your attention Any comments or questions ? 69