* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PPT
Transposable element wikipedia , lookup
X-inactivation wikipedia , lookup
Oncogenomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Copy-number variation wikipedia , lookup
Pathogenomics wikipedia , lookup
Point mutation wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Minimal genome wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Public health genomics wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Genomic imprinting wikipedia , lookup
History of genetic engineering wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Ridge (biology) wikipedia , lookup
Gene therapy wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Genome evolution wikipedia , lookup
The Selfish Gene wikipedia , lookup
Gene desert wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Helitron (biology) wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene expression programming wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Microevolution wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Functional Gene Clustering via Gene Annotation Sentences, MeSH and GO Keywords from Biomedical Literature Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology Madurai Kamaraj University Madurai – 625021, INDIA Purpose & Goals Extracting gene specific functional ‘keywords’ from biological literature Augment extracted keywords with MeSH and GO keywords related to gene Compare the accuracy of results with a test data set in various keyword extraction methods 2 From full-abstracts Gene specific sentences Full-abstracts Gene specific sentences Gene specific sentences + MeSH keywords Gene specific sentences+ MeSH and GO keywords Use the keyword extraction method to cluster the differentially expressed gene clusters in a microarray experiments Outline Two Parts: I, and II Part I: Text mining and keyword extraction from literature Our text mining methodology Part II: Applications to microarrays 3 ? Functional keyword clustering of microarray data Part I: Text Mining Text Mining: Introduction and overview 5 Text mining aims to identify non-trivial, implicit, previously unknown, and potentially useful patterns in text (e.g. classification system, association rules, hyphothesis etc.) includes more established research areas such as information retrieval (IR), natural language processing (NLP), information extraction (IE), and traditional data mining (DM) relevant to bioinformatics because of explosive growth of biomedical literature (e.g. MEDLINE – 15 million records) availability of some information in textual form only, e.g. clinical records Text Mining: System Architecture MeSH / GeneOntology Microarray Experiment MedLine Abstracts Filtering MeSH/GO Your stuff here. Keyword Extraction Gene List Gene/Protein Dictionary Set of Abstract Sentence Your stuff here. Exctraction Annotation Keyword Your stuff here. Extraction Patterns Visualization Feature Vector Your stuff here. Generation Clustering Experimental design of gene clustering with sentences-level, MeSH and GO keywords 6 Text Mining: Keyword Extraction from Biomedical Literature Steps to extract sentence-level keywords Gene - Synonym dictionary – A special gene name synonym name dictionary was created for human genes using Entrez-Gene Gene-name normalization - This process replaces all the gene names in the abstract with its unique canonical identifier (Entrez gene ID) using the genesynonym dictionary specially constructed for this study. Sentence filtering – using corpus specific the regular expression as the following example ($gene @{0,6} $action (of|with) @{0,2} $gene) 7 extracts sentences that match the structure shown below the expression. The notational construct ‘ A B ...’ is interpreted as ‘A followed by B followed by ...’. gene name 0-6 words action verb ‘of’ or ‘with’ 0-2 words gene name Keyword extraction. – Next slide Text Mining: Keyword Extraction from biomedical literature Table 1. An example set of regular expressions as nouns describing agents and agents, and passive and active verbs 8 Name of Expression Expression Pattern Sentence Output Nouns describing agents ($gene (is)? (the|an|a) @{0,2}$action of @{0,2} $gene) IL6, a known mediator of STAT3 response Nouns describing actions ($gene @{0,6} $action (of|with) @{0,1} $gene) Passive verbs ($gene @{0.6} (is|was|be|are|were) @{0,1} $action $(by|via|through) @{0,3} $gene) abi5 domains required for interaction with abi3 Protein kinase c (PKC) has been shown to be activated by parathyroid hormone Active verbs ($gene $sub-action @{0,1} $action @{0,2} $gene) Insulin mediated inhibition of hormone sensitivity lipase activity Text Mining: Keyword Extraction from Biomedical Literature Keyword extraction Example Sentence: Brill-POS-tagged sentence: associates, stimulates, transcription activity Sentence keywords after manual curation: 9 BRCA1/NNP physically/RB associates/VBZ with/IN p53/NN and/CC stimulates/VBZ its/PRP$ transcriptional/JJ activity/NN ./. Sentence keywords: BRCA1 physically associates with p53 and stimulates its transcriptional activity. transcription activity Text Mining: MeSH Keyword Extraction MeSH keywords MeSH keyword extraction Extracted directly from gene specific abstracts via Perl scripts MeSH keyword curation MeSH keywords are subject index terms assigned to each scientific literature by the Natural Library of Medicine (NLM) for purpose of subject indexing and searching the journal articles via PubMed. Using a MeSH keywords stop words dictionary (e.g., human, DNA, animal, Support U.S Govt etc.). For example the MeSH keywords associated with a gene ‘FOS’ in our gene list are ‘oncogene, felypressin, transcription-factor, thermo- receptors, DNA-binding, antibiosis, inflammatory-response, zinc-fingers, gene-regulation, and neuronal-plasticity’. 10 Text Mining: GO Keyword Extraction GO keywords GO keyword extraction Gene Ontology (GO) is a hierarchical organization of gene and gene product terms from various databases in which concepts at higher levels in the hierarchy are more general than those further down Out of the three GO annotation categories we included only molecular function and biological process and left out cellular component as it is less important for characterizing genes functions Further, due to hierarchical nature of GO and multiple inheritance in the GO structure, we consider with every ancestor up to the level 2 in the GO tree For example the GO keywords associated with the gene ‘FOS’ in our gene list are ‘protein-dimerization, DNA binding, RNA polymerase, transcription factor, DNA methylation, and inflammatory-response’. 11 Text Mining: Keyword Representation and Calculation of Numeric Vectors This process is concerned with computing the numeric weight, wij, for each gene-keyword pair (gi, tj) (i = 1, 2, … n and j = 1, 2, … k) to represent the gene’s characteristics in terms of the associated keywords. Common techniques for such numeric encoding include Binary. The presence or absence of a keyword relative to a gene. Term frequency. The frequency of occurrence of a keyword with a gene. Term frequency / inverse document frequency (TF*IDF). The relative frequency of occurrence of a keyword with a gene compared to other genes 12 Text Mining: TF*IDF Weighting Most weighting scheme in information retrieval and text classification method is the TFIDF (term frequency / inverse document frequency) weighting scheme. TF(w,d) (Term Frequency) is the number of times word w occurs in a document d. DF(w) (Document Frequency) is the number of documents in which the word w occurs at least once. The inverse document frequency is calculated as IDF (w) log( 13 | D| DF ( w) ) Where | D | is total number of documents in the corpus Text Mining: Keyword Representation and Calculation of Numeric vectors In our study, as the keywords are extracted from gene specific sentences but not from full abstracts, the number of keywords associated with each gene is small. Further, the frequency of occurance of most keywords tended be one. Therefore, the binary encoding scheme was adopted as illustrated in Table 2 . Table 2. Binary representation of gene * keywords Genes / Terms g1 g2 ... gn 14 t1 w11 = 0 w12 = 1 t2 w21 = 1 w22 = 1 ... ... w1n = 0 w2n = 0 ... ... ... ... ... tk wk1 = 1 wk2 = 0 ... wkn = 1 Text Mining: Gene Clustering After, our binary coding scheme adopted in this study consists of numeric row vectors representing genes (via the associated biological functional keywords), and numeric column vectors representing annotation terms (via the associated genes) Clustering can produce useful and specific information about the biological characteristics of sets of genes Clustering: Partition unlabeled examples into disjoint subsets of clusters, such that: 15 Examples within a cluster are very similar Examples in different clusters are very different Discover new categories in an unsupervised manner. Text Mining: Test Set and Evaluation The test set contains 20 genes and 10 abstracts for each gene, resulting in a total of 200 abstracts in two cancer categories (Table 3) was used evaluate usefulness of our keyword extraction method Table 3. Test set of 20 human genes manually grouped in to two cancer categories 16 Genes Category ADAM23, DKK1, IGF2, LRRC4, L3MBTL, MMP9, MSH2, PTPNS1, SFMBT1, ZIC1 Brain Tumor AMPH, ATM, BRCA1, BRCA2, CHEK2, CDH1, PHB, TFF1, TSG101, XRCC3 Breast Cancer Text Mining: Evaluation 17 Full abstract keywords (baseline). Extracts gene annotation terms based on term frequencies * inverse document frequencies (TF*IDF) within the entire abstract without regard to sentence structure. Sentence keywords. Extracts gene specific keywords based sentence-level processing. Sentence + MeSH keywords. As in (2) above plus MeSH terms (see Section MeSH keywords extraction). Sentence + MeSH + GO keywords. As in (2) above plus MeSH terms (see Section MeSH keywords extraction) and GO terms (see Section GO keyword extraction Text Mining: Evaluation Results of various keyword extraction methods Keywords Extraction Method 18 Precisi on Recall F-measure (%) Abstract keywords (baseline) 0.31 0.24 27.05 Sentence keywords only 0.57 0.38 45.60 Sentence + MeSH keywords 0.64 0.47 54.19 Sentence + MeSH + GO keywords 0.78 0.72 74.88 Part II: Applications to Microarrays Functional keyword Clustering of genes resulting from microarray experiment Applications to Microarrays Data and Analysis 20 As an illustrative example, our keyword extraction methods was applied to functional interpretation of cluster of genes that were found differentially expressed in a microarray experiment investigating the impact of two mitogenic protein Epidermal growth factor (EGF) and Sphingosine 1-phosphate (S1P) on glioblastoma cell lines when compared to the resting state, 19 genes were significantly differentially expressed as a response to EGF, 35 genes as a response to S1P and 30 genes as a response to COM, i.e., combined stimuli of S1P and EGF. The three gene lists are referred to as G(EGF), G(S1P) and G(COM), respectively (Table 4). Applications to Microarrays Data and Analysis Table 4. List of Differentially Expressed Genes 21 Gene List Name of Genes G(EGF) (19 genes) HRY, KLF2, ID1, JUN, DUSP6, IMPDH2, GP1BB, PNUTL1, CGI-96, CALD1, TRIM15, FOS, SPRY4, CLU, SLC5A3, MRPS6, ABCA1, OLFM1, PHLDA1 G(S1P) (35 genes) F3, NR4A1, KLF5, GADD45B, IL8, CITED2, CALD1, IL6, BCL6, LBH, HRB2, KIAA0992, NFKBIA, TNFAIP3, CCL2, DSCR1, TXNIP, NAB1, EHD1, GBP1, GLIPR1, MAP2K3, FZD7, RGS3, SOCS5, FOSL2, JAG1, DOC1, NRG1, BTG1, PDE4C, KIAA1718, KIAA0346, SFRS3, PLAU G(COM) (30 genes) MAFF, DUSP5, EGR3, SERPINE1, ZFP36, DUSP1, LIF, DTR, MYC, GADD45B, RTP801, ATF3, JUNB, SNARK, WEE1, EGR2, TIEG, SPRY2, CEBPD, SGK, GEM, NEDD9, LDLR, EGR1, C8FW, UGCG, MCL1, ZYX, FOSL1, DIPA Applications to Microarrays Data and Analysis 22 Using these the three gene lists obtained from the microarray experiment (Table 6) as query in MEDLINE returned the three corresponding sets of abstracts A(EGF), A(S1P) and A(COM), respectively (Table 5). The abstracts were processed with the keyword extraction method involving sentence-level augmented with MeSH and GO keywords The resulting keywords were encoded in binary weighting scheme The resulting representations were clustered using average linkage hierarchical clustering algorithm. Applications to Microarrays Data and Analysis Table 5. Three sets of abstracts, A(EGF), A(S1P), and A(COM), retrieved via MEDLINE for this study 23 Gene List # of Genes in List Retrieved Abstract Set # of Abstracts in Set G(EGF) 19 A(EGF) 28 913 G(S1P) 35 A(S1P) 19 705 G(COM) 30 A(COM) 39 890 Applications to Microarrays Average Linkage Hierarchical Clustering Algorithm Use average similarity across all pairs within the merged cluster to measure the similarity of two clusters. sim (ci , c j ) 24 1 ci c j ( ci c j 1) sim ( x , y) x( ci c j ) y( ci c j ): y x Compromise between single and complete link. Averaged across all ordered pairs in the merged cluster instead of unordered pairs between the two clusters. TRIM15 GP1BB SPRY4 MRPS6 IMPDH2 PNUTL1 OLFM1 PHLDA1 DUSP6 ID1 KLF2 CALD1 ABCA1 CLU FOS JUN SLC5A3 HRY 25 antibiosis osteoblasts v-fos fusion sensation immuno-reactivity recombination thermo-receptors intracellular atherogeneses glutamine-transport DNA-methylation felypressin transition clustering relaxation tumorigenesis desaturases shape-regulation assemble secretion biosynthesis regulation glycoprotein androgens odontogenesis calmodulin-binding trans-activators zinc fingers mitogenesis inhibition embryonic development neural tube defects transcription factor cell death embryogenesis ion binding angiogenesis Applications to Microarrays Results Summary of analysis of EGF cluster atherogenesis mitogenesis assemble inflammation angiogenesis endocytosis lymphocytes pathogenesis immune-response DNA-dependent focal-contact DNA-damage splicing G1 phase extracellular motility protein-binding cos-cells myosin RNA localization dose-response anticodon cytotoxicity parasitophorous G protein demyelination cytolysis Ca release locomotion homeostasis circulation phosphorylation synthesis repair protein kinase endothelialization organogenesis cell-adhesion mutagenesis Applications to Microarrays Results Summary of analysis of S1P cluster TNFAIP3 KLF5 BCL6 NAB1 BTG1 NFKBIA NR4A1 SOCS5 CITED2 NRG1 JAG1 PLAU CCL2 IL8 IL6 GLIPR1 F3 MAP2K3 EHD1 GBP1 DSCR1 HRB2 GADD45B FOSL2 PDE4C RGS3 FZD7 SFRS3 TXNIP DOC1 CALD1 26 27 LDLR SPRY2 GEM ZYX NEDD9 MYC LIF SERPINE1 DTR MCL1 C8FW MAFF ATF3 RTP801 EGR1 JUNB FOSL1 CEBPD TIEG EGR2 EGR3 ZFP36 WEE1 SNARK SGK GADD45B DUSP1 DUSP5 UGCG DIPA DNA modification DNA methylation jun genes G2-m transition mRNA splicing immortality DNA recombination microtubule gene silencing helix-loop-helix motifs transcription factor seizures genome instability DNA-binding zinc fingers repressor proteins DNA-dependent nucleus transactivation leucine zippers transcription gene expression regulation oxidative stress proto-oncogene cell survival signal transduction maturation endocytosis differentiation mitogenesis mitosis G2 phase chemosensitivity mutagenesis lymphangiogenesis ion binding RNA processing Applications to Microarrays Results Summary of analysis of COM cluster Conclusions 28 An important topic in microarray data mining is to bind transcriptionally modulated genes to functional pathways or how transcriptional modulation can be associated with specific biological events such as genetic disease phenotype, cell differentiation etc. However, the amount of functional annotation available with each transcriptionaly modulated genes is still a limiting factor because not all genes are well annotated Further, Jenssen et al. (2001) earlier compiled a network of human gene relationships from MEDLINE abstracts. These compiled relationships were then compared to the gene expression cluster results. This approach gives a very interesting result: functionally related genes can show totally different patterns, and hence belong to different clusters (Jenssen, et al.: A literature network of human genes for highthroughput analysis of gene expression, Nat.Genet., 28, 21-28, 2001) Conclusions 29 Our gene functional keyword clustering/ grouping will enable to select functionally informative genes from differentially expressed genes for further investigations. Our evaluation suggests that this approach will provide more specific and useful information than typical approaches using abstract-level information. This is particularly the case when the sentence-level terms are augmented by MeSH and GO keywords As the current text mining scenario is on full-text mining As full-text contains large number of irreverent sentences compare to abstracts this approach is more appropriate for full-text study as it filters irrelevant sentences before clustering. Acknowledgments 30 Eric G. Bremer, Brain Tumor Research Program, Children’s Memorial Research Center, Chicago, IL, USA, and James R. van Brocklyn, Division of Neuropathology, Department of Pathology, The Ohio State University, Columbus, Ohio, USA for the microarray data set Dr. Daniel Berrar, Bioinformatics Research Group, University of Ulster, UK Members of Bioinformatics Centre, Madurai Kamaraj University, India Dept of Biotechnology, Govt. of India for Bioinformatics facilities THANK YOU 31