Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Topic Scoring Engine Developed in Partnership with the European Molecular Biology Laboratory (EMBL) Ulrich Reincke SAS Germany Copyright © 2003, SAS Institute Inc. All rights reserved. Agenda Topic Scoring Engine ! ! ! ! ! ! ! EMBL- at a Glance The Biological Research Workflow The Business Problem The Text Mining Approach The Text Mining/Portal Solution Benefits Questions Copyright © 2003, SAS Institute Inc. All rights reserved. The EMBL at a Glance ! established in 1974 ! supported by sixteen countries including nearly all of Western Europe and Israel. ! five facilities: • the main Laboratory in Heidelberg (Germany), • Outstations − Hamburg (Germany), − Grenoble (France) − Hinxton (U. K.), • external Research Programme in Monterotondo (Italy). Copyright © 2003, SAS Institute Inc. All rights reserved. EMBL was founded with a four-fold mission: ! to conduct basic research in molecular biology, ! to provide essential services to scientists in its Member States, ! to provide high-level training to its staff, students, and visitors, ! to develop new instrumentation for biological research. Copyright © 2003, SAS Institute Inc. All rights reserved. EMBL Features: ! Provides access to instruments for the study of protein structures ! Hosts some of the world's oldest and biggest databases of DNA and protein sequences ! Provides services operated by highly-trained biologists who are simultaneously involved in their own research ! The scientific networks created by EMBL alumni have contributed to the development of a truly international scientific community throughout Europe. Copyright © 2003, SAS Institute Inc. All rights reserved. EMBL in Numbers ! current annual budget: €95.6M, • €59.1M subscriptions of the Member States, • €14.5M from personnel contributions • €22M from external sources. ! ! ! ! 810 scientific and technical staff 100 research faculty 140 postdoctoral fellows 170 PhD students Copyright © 2003, SAS Institute Inc. All rights reserved. Development Partnership between SAS and the Structural and Computational Biology Programme Copyright © 2003, SAS Institute Inc. All rights reserved. Biological Research Workflow A Biologist Somewhere in the World: Thinks, Conjects Doubts Copyright © 2003, SAS Institute Inc. All rights reserved. Experiment to Proof Ideas Protocoll and Publish Results Biological Research Workflow A Biologist Somewhere in the World: Thinks, Conjects Doubts Copyright © 2003, SAS Institute Inc. All rights reserved. Experiment to Proof Ideas Protocol and Publish Results Paper Copyright © 2003, SAS Institute Inc. All rights reserved. Biological Research Workflow A Biologist Somewhere in the World: Thinks, Conjects Doubts Copyright © 2003, SAS Institute Inc. All rights reserved. Experiment to Proof Ideas Protocol and Publish Results Paper Enter Citation in Pubmed PubMed / Medline, the premier bibliographic database covering the fields of biology, medicine, nursing, dentistry, veterinary medicine, the health care system & preclinical science ! contains bibliographic citations and author abstracts ! ! ! ! More than 12 mill. citations since 1966 4,600 biomedical journals Journals published in 70 countries On average 2000 new contributions per day Copyright © 2003, SAS Institute Inc. All rights reserved. Biological Research Workflow A Biologist somewhere in the World: Thinks, Conjects Doubts Copyright © 2003, SAS Institute Inc. All rights reserved. Experiment to Proof Ideas Protocol and Publish Results Paper Enter Citation in Pubmed Biological Research Workflow "Select Papers "Read Papers "Understand Paper "Structure Data of Papers " Enter Data in a Databank "Relate Data to other Data Query: Sequences Data Cleansing Download Citations Provide Predicted Functions Query: Keywords In Pubmed A Biologist Somewhere in the World: Thinks, Conjects Doubts Copyright © 2003, SAS Institute Inc. All rights reserved. Experiment to Proof Ideas Protocol and Publish Results Paper Enter Citation in Pubmed Biological Research Workflow "Select Paper "Read Paper "Understand Paper "Structure Data of Paper " Enter Data in a Databank "Relate Data to other Data Query: Sequences Data Cleansing Download Citations Provide Predicted Functions Query: Keywords In Pubmed A Biologist Somewhere in the World: Thinks, Conjects Doubts Copyright © 2003, SAS Institute Inc. All rights reserved. Experiment to Proof Ideas Protocol and Publish Results Paper Enter Citation in Pubmed Copyright © 2003, SAS Institute Inc. All rights reserved. Copyright © 2003, SAS Institute Inc. All rights reserved. Copyright © 2003, SAS Institute Inc. All rights reserved. Location Copyright © 2003, SAS Institute Inc. All rights reserved. Functional Information Biological Research Workflow "Select Paper "Read Paper "Understand Paper "Structure Data of Paper " Enter Data in a Databank "Relate Data to other Data Query: Sequences Data Cleansing Download Citations Provide Predicted Functions Copyright © 2003, SAS Institute Inc. All rights reserved. Experiment to Proof Ideas Protocol and Publish Results Keywords In Pubmed Business Problem A Biologist Somewhere in the World: Thinks, Conjects Doubts Query: Paper Enter Citation in Pubmed Business Problem: Document Retrieval ! Need to overcome the „keyword barrier“ ! With increasing number of objects (topics) the the task of updating will take more and more resources. ! Need to filter 2000 new abstracts per day and retain only what is valuable to the database ! Need to learn from previous query experiences ! Need to move from a single user expertise to a Process expertise where query quality can be monitored and improved ! At stake is the quality of the query service Copyright © 2003, SAS Institute Inc. All rights reserved. The Text Mining Approach Copyright © 2003, SAS Institute Inc. All rights reserved. 1. Relate Paper PMID (Document No.) topic Topic 2. Generate HTML Link for each document to Pubmed 3. Download Text from Pubmed 4. Read Text into SAS Datasets 5. Programs allow modifications of Topic definitions via E-Mail: -create, delete Topics -split Topics -redefine Topics (include delete papers) => User friendliness flexibility 1. Text Mining Pre-processing in Batch (V8) with different parameter settings (Parallelisation with MP Connect) 2. Generating 200-3000 Concepts („COLs“ or SV) for each document in the training data 3. Variable selection for each topic (Proc Corr, Proc Varclus, Proc Reg) 4. Estimating the final model for the profile of each topic (Proc Logistic) => Best pre-processing, most important variables & best model will be provided in batch in order to optimise recall & precision for each topic Read new publications in XML format Input by ELM Analysis, Distribution Scoring Evaluationand Feedbackloop by ELM consortium Input by ELM consortium Input by ELM Analysis, Personalisation in Portal Distribution Scoring Evaluationand Feedbackloop by ELM consortium Input by ELM consortium The Text Mining / Portal Solution Copyright © 2003, SAS Institute Inc. All rights reserved. Demo Report ! Topic Scoring Engine Copyright © 2003, SAS Institute Inc. All rights reserved. Technical Overview on the Solution WWW, WWW, Intranet, Intranet, Extranet Extranet Text Retrieval File system, Document Management Systems, RDBMS Copyright © 2003, SAS Institute Inc. All rights reserved. Datasets Training& Scoring Data Mart with Scores and links to the scored Documents Personalized Text Retrieval Information Delivery Copyright © 2003, SAS Institute Inc. All rights reserved. Benefits Topic Scoring Engine: ! TSE frees resources from querying, speed up literature queries ! TSE provides automatic information on rapid literature developments in any topic profile ! EMBL’s databases will better reflect the actual available literature in Pubmed/Medline. ! the topic scoring engine goes far beyond key words: even relevant abstracts of papers that have been assigned by accident to the wrong or a misleading key word will be prompted. Copyright © 2003, SAS Institute Inc. All rights reserved. Summary: Topic Scoring Engine operates like an automatic gravel mine Copyright © 2003, SAS Institute Inc. All rights reserved. Questions? Copyright © 2003, SAS Institute Inc. All rights reserved.