Download The Topic Scoring Engine

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
The Topic Scoring Engine
Developed in Partnership with the
European Molecular Biology Laboratory (EMBL)
Ulrich Reincke
SAS Germany
Copyright © 2003, SAS Institute Inc. All rights reserved.
Agenda Topic Scoring Engine
!
!
!
!
!
!
!
EMBL- at a Glance
The Biological Research Workflow
The Business Problem
The Text Mining Approach
The Text Mining/Portal Solution
Benefits
Questions
Copyright © 2003, SAS Institute Inc. All rights reserved.
The EMBL at a Glance
! established in 1974
! supported by sixteen countries including nearly
all of Western Europe and Israel.
! five facilities:
• the main Laboratory in Heidelberg (Germany),
• Outstations
− Hamburg (Germany),
− Grenoble (France)
− Hinxton (U. K.),
• external Research Programme in Monterotondo (Italy).
Copyright © 2003, SAS Institute Inc. All rights reserved.
EMBL was founded with a four-fold mission:
! to conduct basic research in molecular
biology,
! to provide essential services to scientists
in its Member States,
! to provide high-level training to its staff,
students, and visitors,
! to develop new instrumentation for
biological research.
Copyright © 2003, SAS Institute Inc. All rights reserved.
EMBL Features:
! Provides access to instruments for the study of
protein structures
! Hosts some of the world's oldest and biggest
databases of DNA and protein sequences
! Provides services operated by highly-trained
biologists who are simultaneously involved in
their own research
! The scientific networks created by EMBL alumni
have contributed to the development of a truly
international scientific community throughout
Europe.
Copyright © 2003, SAS Institute Inc. All rights reserved.
EMBL in Numbers
! current annual budget: €95.6M,
• €59.1M subscriptions of the Member States,
• €14.5M from personnel contributions
• €22M from external sources.
!
!
!
!
810 scientific and technical staff
100 research faculty
140 postdoctoral fellows
170 PhD students
Copyright © 2003, SAS Institute Inc. All rights reserved.
Development Partnership between SAS and the
Structural and Computational Biology Programme
Copyright © 2003, SAS Institute Inc. All rights reserved.
Biological
Research
Workflow
A Biologist Somewhere in the World:
Thinks, Conjects
Doubts
Copyright © 2003, SAS Institute Inc. All rights reserved.
Experiment
to Proof Ideas
Protocoll and Publish
Results
Biological
Research
Workflow
A Biologist Somewhere in the World:
Thinks, Conjects
Doubts
Copyright © 2003, SAS Institute Inc. All rights reserved.
Experiment
to Proof Ideas
Protocol and Publish
Results
Paper
Copyright © 2003, SAS Institute Inc. All rights reserved.
Biological
Research
Workflow
A Biologist Somewhere in the World:
Thinks, Conjects
Doubts
Copyright © 2003, SAS Institute Inc. All rights reserved.
Experiment
to Proof Ideas
Protocol and Publish
Results
Paper
Enter Citation in Pubmed
PubMed / Medline, the premier bibliographic database
covering the fields of biology, medicine, nursing, dentistry, veterinary medicine, the health care system & preclinical science
! contains bibliographic citations and author
abstracts
!
!
!
!
More than 12 mill. citations since 1966
4,600 biomedical journals
Journals published in 70 countries
On average 2000 new contributions per day
Copyright © 2003, SAS Institute Inc. All rights reserved.
Biological
Research
Workflow
A Biologist somewhere in the World:
Thinks, Conjects
Doubts
Copyright © 2003, SAS Institute Inc. All rights reserved.
Experiment
to Proof Ideas
Protocol and Publish
Results
Paper
Enter Citation in Pubmed
Biological
Research
Workflow
"Select Papers
"Read Papers
"Understand Paper
"Structure Data of Papers
" Enter Data in a Databank
"Relate Data to other Data
Query:
Sequences
Data Cleansing
Download
Citations
Provide Predicted
Functions
Query:
Keywords
In Pubmed
A Biologist Somewhere in the World:
Thinks, Conjects
Doubts
Copyright © 2003, SAS Institute Inc. All rights reserved.
Experiment
to Proof Ideas
Protocol and Publish
Results
Paper
Enter Citation in Pubmed
Biological
Research
Workflow
"Select Paper
"Read Paper
"Understand Paper
"Structure Data of Paper
" Enter Data in a Databank
"Relate Data to other Data
Query:
Sequences
Data Cleansing
Download
Citations
Provide Predicted
Functions
Query:
Keywords
In Pubmed
A Biologist Somewhere in the World:
Thinks, Conjects
Doubts
Copyright © 2003, SAS Institute Inc. All rights reserved.
Experiment
to Proof Ideas
Protocol and Publish
Results
Paper
Enter Citation in Pubmed
Copyright © 2003, SAS Institute Inc. All rights reserved.
Copyright © 2003, SAS Institute Inc. All rights reserved.
Copyright © 2003, SAS Institute Inc. All rights reserved.
Location
Copyright © 2003, SAS Institute Inc. All rights reserved.
Functional Information
Biological
Research
Workflow
"Select Paper
"Read Paper
"Understand Paper
"Structure Data of Paper
" Enter Data in a Databank
"Relate Data to other Data
Query:
Sequences
Data Cleansing
Download
Citations
Provide Predicted
Functions
Copyright © 2003, SAS Institute Inc. All rights reserved.
Experiment
to Proof Ideas
Protocol and Publish
Results
Keywords
In Pubmed
Business Problem
A Biologist Somewhere in the World:
Thinks, Conjects
Doubts
Query:
Paper
Enter Citation in Pubmed
Business Problem: Document Retrieval
! Need to overcome the „keyword barrier“
! With increasing number of objects (topics) the
the task of updating will take more and more
resources.
! Need to filter 2000 new abstracts per day and
retain only what is valuable to the database
! Need to learn from previous query experiences
! Need to move from a single user expertise to a
Process expertise where query quality can be
monitored and improved
! At stake is the quality of the query service
Copyright © 2003, SAS Institute Inc. All rights reserved.
The Text Mining Approach
Copyright © 2003, SAS Institute Inc. All rights reserved.
1.
Relate Paper PMID (Document No.) topic Topic
2.
Generate HTML Link for each document to Pubmed
3.
Download Text from Pubmed
4.
Read Text into SAS Datasets
5.
Programs allow modifications of Topic definitions via E-Mail:
-create, delete Topics
-split Topics
-redefine Topics (include delete papers)
=> User friendliness flexibility
1.
Text Mining Pre-processing in Batch (V8) with different parameter settings (Parallelisation with MP
Connect)
2.
Generating 200-3000 Concepts („COLs“ or SV) for each document in the training data
3.
Variable selection for each topic (Proc Corr, Proc Varclus, Proc Reg)
4.
Estimating the final model for the profile of each topic (Proc Logistic)
=> Best pre-processing, most important variables & best model will be provided in batch in order to optimise
recall & precision for each topic
Read new publications in XML format
Input by ELM
Analysis,
Distribution
Scoring
Evaluationand Feedbackloop by ELM consortium
Input by ELM consortium
Input by ELM
Analysis,
Personalisation in Portal
Distribution
Scoring
Evaluationand Feedbackloop by ELM consortium
Input by ELM consortium
The Text Mining / Portal Solution
Copyright © 2003, SAS Institute Inc. All rights reserved.
Demo Report
! Topic Scoring Engine
Copyright © 2003, SAS Institute Inc. All rights reserved.
Technical Overview on the Solution
WWW,
WWW,
Intranet,
Intranet,
Extranet
Extranet
Text
Retrieval
File system,
Document
Management
Systems,
RDBMS
Copyright © 2003, SAS Institute Inc. All rights reserved.
Datasets
Training&
Scoring
Data Mart with
Scores and links
to the scored
Documents
Personalized Text Retrieval
Information
Delivery
Copyright © 2003, SAS Institute Inc. All rights reserved.
Benefits Topic Scoring Engine:
! TSE frees resources from querying, speed up
literature queries
! TSE provides automatic information on rapid
literature developments in any topic profile
! EMBL’s databases will better reflect the actual
available literature in Pubmed/Medline.
! the topic scoring engine goes far beyond key
words: even relevant abstracts of papers that have
been assigned by accident to the wrong or a
misleading key word will be prompted.
Copyright © 2003, SAS Institute Inc. All rights reserved.
Summary: Topic Scoring Engine
operates like an automatic gravel mine
Copyright © 2003, SAS Institute Inc. All rights reserved.
Questions?
Copyright © 2003, SAS Institute Inc. All rights reserved.