Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Gene therapy of the human retina wikipedia , lookup
Gene nomenclature wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Gene expression profiling wikipedia , lookup
Designer baby wikipedia , lookup
Retrospective study of a gene by mining texts : The Hepcidin use-case Fouzia Moussouni-Marzolf Introduction Life Science is becoming the most VOLUMINOUS science. 3 major reasons : Modern digital revolution : INTERNET Increasing incitment to publish : • The competition pressure • Evaluation concerns at several levels Sharing of knowledge at a global scale Introduction Rapid Expansion of the biomedical literature available papers exploding The comprehension of iron regulation system is still difficult BOOM of publications since 2000 MLTrends Hepcidin Comprehension of associated Since dec 2000 diseases by medical experts Increased demand for effective text mining tools to find quickly relevant information. Introduction These tools extract a deluge of information Very dense data Hepcidin : January 2011 Hepcidin : Febrary 2011 Text Mining with Ali-baba and a global Query « Hepcidin » [1] Many common events few news non expert Information dense and unreadable The pertinent information is hidden biologists are rapidly discouraged from using these tools. For an expert A considerable amount of well known data (background). [1] Plake, C., Schiemann, T., Pankalla, M., Hakenberg, J. & Leser, U. AliBaba: PubMed as a graph. Bioinformatics. 22, 2444-2445 (2006). Introduction Which solutions for managing this increasing flood of information extracted ? Unfolding time during the process of text mining time Reduce the density of information at each period of time Perception of a certain chronology in the sequence of events linked to a gene: enhance comprehension Ability to locate trivial information repeatedly published and extracted [2] Select the most relevant events over time = Reduced density of information [1] Jensen, L.J., Saric, J. & Bork, P., Literature mining for the biologist: from information retrieval to biological discovery. Nat Rev Genet. 7, 119-129 (2006). Methods Focus on 2 frames of study 1. Exploit Text Mining Engine Ali-Baba (HU-Berlin) Information Extraction Tool from Medline abstracts resulting from a PubMed Query Hepcidin 2005 [dp] Ali-baba is not a simple pattern matching tool for counting keyword occurrences. It recognizes effective biological entities localized in the abstracts using dictionnaries. proteins Disease Different sorts of bio-entities extracted Cell Type tissue Drug Specie Methods Ali-Baba extracts relationships between recognized bio-entities, namely bioevents. …. STAT3 inhibitors, including curcumin, AG490 and a peptide (PpYLKTK), reduced hepcidin1”, …. Curcumin AG490 Peptide reduce reduce hepcidin1 reduce hepcidin1 (PpYLKTK) Source Entity hepcidin1 Relationship Biological Events Target Entity Methods Abstracts of « Hepcidin 2005 [dp] » Graph of events Extraction of Bio-events Natural Language processing (NLP) Co-occurrence Methods 2. Focus on Hepcidin gene Corpus of linked biological events published since gene discovery until today Retrospective study of Hepcidin over time June 2012 dec2000 time period = 1 month Filter trivial bio-events Select relevant bio-entities Methods What is a time relevant biological entity ? Definition A biological entity e recognized by an IE based text mining system is time relevant for period t if it achieves at time t a maximum of relationships with other biological entities recognized by the same IE based system. Graph G(Nodes,Edges) of extracted bio-events, e t-relevant biological entity e e Highly Targeted by other bio-entities at time t Methods T-Relevance can be computed for different sorts of biological entities Source Entity Relationships Target Entity Protein Protein Disease Disease Cell Type Cell Type Tissue tTssue Drug Drug Specie Specie Different valuable information for each kind of relevance Methods What is a trivial biological event at time t ? A trivial event Te = event already published before t G0 = Graph of events at time t0 G1 = Graph of events at time t1 = t0+p G2 = Graph of events at time t2= t0+2p ... t0+2p t0+p Te Є G1 and Te Є G0 t0+3p Te Є G2 and (Te Є G1 or Te Є G0) Methods Data Processing Pipeline For each period t in [t0,tn] : Query(t) = « Gene t [dp]" Ali-baba web-service for Query(t) graphML export events extracted and drawn for period t insert GraphML database final retrospective data analysis Data transformation Data stamping Clearing of trivial data Selection of t-relevant bio-entities integrated time-based events of the decade Data integration Results Hepcidin Gene Use case - from t0 = 12/2000 to tn = 12/2011 - Database of more than 50,000 published biological events. Considerable amount of trivial events Background ? Cumulative Quantification of trivial events over time 52% of published events on the whole Hepcidin decade are trivial Results Hepcidin Gene Use case Relevant bio-entities over time Relevant Proteins over time Before clearing trivials Permanent visibility of Hepcidin as relevant After Clearing New information emerge as highly targeted : several proteins regulate Hepcidin Transcription Results Hepcidin Gene Use case Relevant bio-entities over time Relevant diseases over time Before clearing trivials Permanent visibility of hemochromatosis and iron overload After Clearing New diseases linked to Hepcidin and iron, emerge as highly targeted, like the neurological diseases Results More annotations of the “relevant entities” Conclusion A new straightforward approach for retrospective studies of genes has been proposed. Time has been coupled to the process of information extraction to improve comprehension of the considerable amount of biological events linked to a Hepcidin gene since its discovery in dec 2000. This work is still ongoing. Current developments … Toward a generalization to queries of any biological entities Exclude review papers, sections “background” and “methods” from mining to minimize trivial events and entities Threshold of relevance, threshold of triviality Acknowledgments Contributors • Bertrand De-Cadeville Master2 MSB • Olivier Loréal, resp. Iron Ieam INSERM UMR 991 • Ulf Leser, resp. Bioinformatics Team HU-Berlin •Astrid Rheinlander Ali-baba Team at Berlin