Download Visualization of Relational Text Information

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Visualization of Relational Text
Information
for Biomedical Knowledge
Discovery
James W. Cooper
IBM T J Watson Research Center
Hawthorne, NY
Overview
Prior work
 Java based text mining
 Computation of unnamed relations
 Graphical display of relations
Text

Text
Text
Text
Text
Tex
t
Text
Text
Tex
t
Relations between terms


Noun phrase co-occurrence statistics [Roark,
Charniak]
Choose seed words and look for terms near them.
[Brin] [Gravano, Agichtein]
– Repeat

Biomedical domain
– Blaschke used dictionary of common verbs
– Pustejovsky found inhibit relations

Stevens, Palakal, Mostafa
– Detected abstract-wide co-occurrence using dictionary
of genes and useful verbs.
Graphical Displays
Biolayout – protein similarity
 ProtInAct – interactive system using yFiles
 Zhang – interactive 3D system
 Jenssen – gene network
 Leroy – GeneScene

BioLayout –Enright and Ouzounis
Five related protein families and their
corresponding relationships.
Spheres represent proteins and lines
represent protein similarities.
ProInAct- Spencer and Bennett
Proteins clustered by functional interaction
Zhang-Protein interaction mapping
Jenssen – A literature network
Lines connect genes that have co-occurred in 1 or more papers.
Leroy –GeneScene
What would we like to do?

Find scientifically meaningful connections
between important terms.
– Such as Swanson’s Reynaud’s disease – fish
oil connection.
Allow exploration of relations by user.
 Filter the relations by ontology or term
types
 Perform path analysis
 Let the user vary the graphical display.

Data we analyzed

Two sets of patent data
– 584 patents on Viagra and phosphodiesterase
inhibitors.
– 1514 patents on quinolones (like Cipro)
Recognized major technical terms in each
patent.
 Filtered organic chemical nomenclature.

The Talent text mining system

Text Analysis and Language Engineering
Tools
– Finds multiword noun phrases
– Does shallow parse
– Can extract NPs and VGs

As well as all other sentence parts
The JTalent Library

Java class library with JNI interface
– To Talent DLL

Creates database load files of terms
–
–
–
–
Paragraph
Sentence
Offset
Term type (NP, VG)
TalentShow Demo
The KSS Library

Java class library of functions for
– Accessing a database (DB2, Access)
– Manipulating a search engine
– Manipulating tables of information created by
JTalent.
Database Tables

Documents
– Title, author, URL, ID

TermDocs
–
–
–
–
–

Term
Paragraph
Sentence
Offset
Type
Dictionary of terms, types and IDs
– Such as MeSH
Computing term information
Compute unique terms from Termdocs
 Compute frequency
 Compute salience

– Based on frequency
– Number of docs they appear in more than
once
Compute term relations
Named relations based on abbreviation
expansions.
 Unnamed relations based on proximity,
with weight based on how frequently they
occur near each other.
 Mutual information weight:

 totalterms  paircount 

m  log 
freq1  freq2


Tuning Computed relations
Select only terms above a salience
threshold.
 Only relations in which one or both are
members of an ontology.
 Store relations in a database table for rapid
access:
 Term | weight | term

Original System
Visual client
 SOAP server

– Queries database to get relations
– Round trip for each new query

Instead, we export the data for the user to
visualize as they wish.
Exporting relations


Save relations and ontology information in xml file.
<relation>
– <term>



<iq>78</iq>
<source>MeSH</source>
<relationDocuments>
– <doc> 34</doc
– </term>
– <term> </term>


</relation>
This XML file is a portable version of the computed
relations that we can then use with any number of
viewers.
A Graphical Relations Viewer
Creates a Java Relations object for each
relation it reads from the XML file.
 Inserts them into a Trie structure based on
lower cased first term.

– If there is already a Relation at that point, it
adds them to a Vector for that term.

Creates an alphabetical list of all terms in a
2nd Trie.
Using the Viewer


When you enter part of a
term, it shows all terms
starting with that fragment in
the left list box.
When you click on a term, it
shows all its relations in the
right list box.
Lexical Navigation

Displays relations
between terms
graphically and allows
you to explore them
without formulating a
specific query.
Possible enhancements
Show only terms belonging to an ontology.
 Show only higher IQ terms
 Show the documents the relations occur in.
 Show the ontology reference.
 Show computed paths
 Show more kinds of named relations.

– Inhibits, expresses
Evaluations of Information
Visualization



Few, if any, graphical displays have been
evaluated thus far for effectiveness.
Usability studies are hard to construct and carry
out.
Intuition seems to show
– that exploration may result in discoveries.
– Relations more than one step apart seem best
displayed graphically.

Remains to be shown that such visualizations are
actually useful.
Differences in Intent

Displays may represent information your
system has discovered.
– Gene – protein relations

Or they may represent data from which the
user may discover new information.
– New 2nd or 3rd order relationships

These are rather different applications of
visualization technology
Summary
Java-based text mining system
 Database of terms and positions
 Computation of relations
 Export as XML
 Graphical relations viewer
 The value of such visual interfaces has not
yet been established.

Acknowledgements
Bhavani Iyer – XML export
 Eric Brown – DictMatcher hash code
 Daniel Tunkelang – graphical layout
 Bob Mack – paper suggestions
