Download Machine Learning of Text Analysis Rules for Clinical Records

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Yiddish grammar wikipedia , lookup

American Sign Language grammar wikipedia , lookup

Stemming wikipedia , lookup

Junction Grammar wikipedia , lookup

Esperanto grammar wikipedia , lookup

Polish grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Semantic memory wikipedia , lookup

Untranslatability wikipedia , lookup

Cognitive semantics wikipedia , lookup

Spanish grammar wikipedia , lookup

Chinese grammar wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Determiner phrase wikipedia , lookup

Malay grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Pleonasm wikipedia , lookup

Preposition and postposition wikipedia , lookup

Lexical semantics wikipedia , lookup

English grammar wikipedia , lookup

Pipil grammar wikipedia , lookup

Transcript
Machine Learning of Text Analysis Rules for Clinical Records
Stephen Soderland, David Aronow*, David Fisher, Jonathan Aseltine, Wendy Lehnert
University of Massachusetts, Amherst MA
*Harvard Community Health Plan, Brookline, MA
{soderlan aronow dfisher aseltine lehnert}@cs.umass.edu
Automatically extracting information in clinical
free text can make available an information
resource that is largely untapped. This paper
describes the BADGER text analysis system,
which identifies concepts contained in a text
based on linguistic context. A key component of
BADGER is the CRYSTAL dictionary induction
system that automatically learns text analysis
rules from a set of training documents. Each of
these rules is generalized as far as possible
without producing errors, so that a minimum
number of dictionary entries cover the positive
training instances.
INTRODUCTION
Much of the vital information in clinical records
is in narrative form. Although this information
is accessible by searching the codes with which
the text is associated, the meaning of the text is
relatively inaccessible in automated systems.
The growing interest in automated and integrated
medical records has spurred intense research into
machine learning for indexing, abstracting and
understanding clinical text. While most of this
work has focused on areas such as radiology and
pathology reports,1 some researchers are making
progress in the automated classification of
clinical free text to code.2,3
The National Center for Intelligent Information
Retrieval (CIIR) at the University of
Massachusetts in Amherst is working on a
strategic research and development project to
automate, to the extent possible, the assignment
of ICD-9-CM codes to hospital discharge
summaries. The current task is to automate
sufficient understanding of the content of the
summaries to label phrases in them that contain
information concerning (1) diagnoses and (2)
signs or symptoms of disease. The categories
are divided into subcategories:
Diagnosis:
Sign or Symptom:
confirmed
present
ruled out
absent
suspected
presumed
pre-existing
unknown
past
history
Looking up isolated word meanings will not
suffice to make these distinctions. Consider the
word ÒfeverÓ in sentences 1 and 2 in Figure 1. It
should be classified as sign or symptom with
subtype ÒpresentÓ in sentence 1, and with
subtype ÒabsentÓ in sentence 2. An automated
system will need rules about linguistic contexts
that distinguish absent from present symptoms,
rules that are specific to the writing style of
hospital discharge summaries.
The problem of classifying phrases becomes even
more difficult when no single word in isolation
conveys the target concept. Examples of this are
the phrase Òmenstrual irregularitiesÓ in sentence 3
and Òcigarette smokingÓ in sentence 4, both of
which represent the concept absent sign or
symptom. Sometimes the classification of a
phrase is contrary to the usual meaning of a word
in isolation. In sentence 5, the word ÒpainÓ,
which in isolation is a sign or symptom, is part
of a phrase referring to a pre-existing diagnosis.
Figure 1. Sample discharge summary sentences
1) He continues to have fever and chills.
2) The patient denies fever, chills or dysuria.
3) The patient denies any menstrual
irregularities.
4) The patient denies cigarette smoking.
5) PAST HISTORY: Chronic lower back
pain.
Two systems developed by the Natural Language
Processing Group at CIIR are being applied to
this task: BADGER a sentence analyzer and
CRYSTAL a dictionary induction tool.
BADGER is a general-purpose tool that can be
used to analyze newswire stories as easily as
medical records. It relies on a semantic lexicon
and a dictionary of text analysis rules which are
specific to discharge summaries. CRYSTAL
uses a machine learning approach to derive the
rules from a training set of discharge summaries,
finding features of the local linguistic context
that reliably identify the target concepts. This
paper introduces BADGER and CRYSTAL and
presents the results of our work to date.
BADGER SENTENCE ANALYZER
The conceptual content of each relevant phrase in
the text is represented by BADGER as a case
frame called a concept node (CN). As BADGER
analyzes each sentence in the text, it uses a set of
rules called CN definitions to determine whether
to create a concept node for each segment of text.
In these experiments, concept nodes are only
created for phrases referring to diagnoses or to
signs or symptoms of disease.
Each CN definition specifies a set of syntactic
and semantic constraints that must be satisfied
for the definition to be applied. The syntactic
constraints operate on the subject, verb phrase,
direct or indirect objects, and prepositional
phrases. Any of these constituents may be tested
for a sequence of specific words, for specific
semantic classes in the head noun of a phrase, or
for specific semantic classes in the modifiers of a
phrase. The verb can be further constrained with
respect to active or passive voice.
The example shown in Figure 2 is a CN
definition that identifies references to absent sign
or symptom in the direct object. It has a set of
syntactic and semantic constraints that must be
met for the CN definition to apply:
a) the subject must include the word ÒpatientÓ
(of semantic class <Patient or Disabled
Group>)
b) the verb must include the word ÒdeniesÓ in
the active voice
c) the direct object must have the semantic
class <Sign or Symptom>.
Figure 2. CN Definition for Absent Sign or
Symptom in the Direct Object
CN-type: Sign or Symptom
Subtype: Absent
Extract from Direct Object
Active voice verb
Subject constraints:
words include ÒPATIENTÓ
head class <Patient or Disabled
Group>
Verb constraints:
words include ÒDENIESÓ
Direct Object constraints:
head class <Sign or Symptom>
This CN definition would extract Òany episodes
of nauseaÓ from the sentence ÒThe patient denies
any episodes of nauseaÓ. It would fail to apply
to the sentence ÒPatient denies a history of
asthmaÓ, since asthma is of semantic class
<Disease or Syndrome>, which is not a subclass
of <Sign or Symptom>. Other CN definitions
would be needed to handle ÒShe denies any
episode of nauseaÓ or ÒThe patient has not had
any episodes of nauseaÓ. Several hundred CN
definitions may be required to cover most
references to absent signs or symptoms.
We are using a semantic lexicon and semantic
hierarchy derived from the Unified Medical
Language Systems (UMLS) MetaThesaurus and
Semantic Network. This gives us a semantic
hierarchy with 133 semantic classes. We derived
a lexicon of 95,000 terms and phrases from the
MetaThesaurus that map into classes in the
semantic hierarchy. A statistically-based
semantic disambiguation component is added
since many of these terms and phrases have
multiple semantic classes.
Figure 3 shows another example of a CN
definition, which identifies pre-existing
diagnoses with a set of constraints that could be
summarized as Ò... was diagnosed with
recurrence of <body_part> <disease>Ò:
a) the verb phrase must include the word
ÒDIAGNOSEDÓ in the passive voice
b) there must be a prepositional phrase with
preposition ÒWITHÓ and including the
sequence of words Òrecurrence ofÓ as well as
a head noun whose semantic class is a
<disease_or_syndrome> and a modifying
term whose class is a <body_part>
Figure 3. CN Definition for Disease Recurrence
CN-type: Diagnosis
Subtype: Pre-existing
Extract from Prep. Phrase ÒWITHÓ
Passive voice verb
Verb constraints:
words include ÒDIAGNOSEDÓ
Prep. Phrase constraints:
preposition = ÒWITHÓ
words include ÒRECURRENCE OFÓ
modifier class <Body Part or Organ>
head class <Disease or Syndrome>
This CN definition applies to a sentence such as
ÒThe patient was diagnosed with a recurrence of
laryngeal cancerÓ. Since there are no constraints
on the subject, the text segment is free to have
any subject, including a relative pronoun or an
omitted subject.
Constructing a dictionary of several hundred
such CN definitions would be a laborious task,
requiring someone who combines clinical
knowledge with a deep understanding of the
BADGER sentence analyzer. The following
section shows how CRYSTAL eliminates this
manual knowledge engineering by automatically
inducing CN definitions.
CRYSTAL DICTIONARY INDUCTION
TOOL
CRYSTAL derives a dictionary of CN
definitions from a set of training documents.
Each clause that contains a phrase with relevant
information serves as a motivating example for
an initial CN definition that requires the exact
sequence of words and the exact set of semantic
classes as in the motivating example.
CRYSTAL then relaxes these constraints to
merge similar CN definitions. The CN
definitions in CRYSTALÕs final dictionary are
generalized as much as possible without
producing extraction errors on the training
corpus.
The first step in dictionary creation is annotation
of training texts by a clinician. Each phrase that
contains information to be extracted is labeled
with an appropriate CN type and subtype. The
annotated texts are then segmented by BADGER
to create a set of training instances. Each
instance is a text segment, generally a simple
clause, some of whose syntactic constituents
may be tagged as positive instances of a
particular CN type and subtype.
Figure 4 shows the initial CN definition derived
from the sentence fragment ÒUnremarkable with
the exception of mild shortness of breath and
chronically swollen ankles.Ó The domain expert
has marked Òshortness of breathÓ and Òswollen
anklesÓ with CN type sign or symptom and
subtype present. BADGER's syntactic analysis
has the word ÒunremarkableÓ as the subject and
the complex noun phrase Òthe exception of mild
shortness of breath and chronically swollen
anklesÓ as the object of a single prepositional
phrase.
Because the word ÒunremarkableÓ was not found
in the semantic lexicon, it gets no semantic
constraint. When a syntactic constituent is made
up of multiple simple noun phrases (Òthe
exceptionÓ, Òmild shortness_of_breathÓ,
Òchronically swollen anklesÓ) the semantic
constraints become a list of classes, all of which
must be satisfied.
Figure 4. An Initial CN Definition
CN-type: Sign or Symptom
Subtype: Present
Extract from Prep.Phrase ÒWITHÓ
Verb = <NULL>
Subject constraints:
words include ÒUNREMARKABLEÓ
Prep.Phrase constraints:
preposition = ÒWITHÓ
words include ÒTHE EXCEPTION OF
MILD SHORTNESS_OF_BREATH AND
CHRONICALLY SWOLLEN
ANKLESÓ
modifier class <Sign or Symptom>
head class
<Sign or Symptom>, <Body
Location or Region>
The initial CN definitions created from a training
example are designed to operate properly on the
motivating example, but are too tightly
constrained to be useful on new sentences.
CRYSTAL must generalize each initial CN
definition to increase its coverage by relaxing
some of its constraints. Figure 5 summarizes
the CRYSTAL algorithm.
CRYSTAL decides which constraints are
essential by finding a similar CN definition,
retaining the constraints the two have in
common, and dropping all other constraints. If
word constraints from the two definitions have
an intersecting string of words, the unified word
constraint is that intersecting string. Otherwise
the word constraint is dropped. Unifying two
class constraints may involve moving up the
semantic hierarchy to find a common ancestor of
classes in the two constraints
Figure 5. The CRYSTAL Algorithm
Initialize Dictionary and Training Instances
Database
Do for each initial CN definition in Dictionary
D = an initial CN definition
Loop
D' = the most similar CN definition to D
U = the unification of D and D'
Test the coverage of U in Training
Instances
If the error rate of U > Tolerance
exit loop
Set D = U
Add D to the Dictionary
Return the Dictionary
The resulting generalization is tested for accuracy
on the set of training instances. If the
generalized CN definition is valid, the process is
repeated, until further relaxation would lead to a
bad CN definition. A user-defined error
tolerance parameter is used to decide whether a
proposed CN definition has an acceptable error
rate.
This discussion has omitted some issues of how
to find the most similar CN definitions or test
the coverage and error rate of a proposed CN
definition against the training instances
efficiently. Please refer to Soderland et al. for a
more detail.4
Figure 5. Effect of Error Tolerance Setting
on Performance
EXPERIMENTAL RESULTS
BADGER has been tested on a corpus of 385
hospital discharge summaries, averaging just
under one thousand words each, which produced
14,719 training instances with 2,122 positive
instances of ÒdiagnosisÓ and 6,047 positive
instances of Òsign or symptomÓ. CRYSTAL
induced a dictionary of all CN types and
subtypes from this training set in 10 minutes of
clock time on a DEC ALPHA AXP 3000 using
45 MB of memory.
The experimental methodology was to repeatedly
partition the annotated texts into a training set
and a blind test set. Dictionaries for each CN
type and subtype were induced from the training
set and then evaluated on the test set.
Performance is measured here in terms of recall
and precision, where recall is the percentage of
possible phrases that the dictionary extracts and
precision is the percentage correct of the extracted
phrases. For example if there are 5,000 phrases
that could possibly be extracted from the test set,
but the dictionary extracts only 3,000, recall is
60%. If the dictionary extracts 4,000 phrases, of
which only 3,000 are correct, precision is 75%.
The dictionary for all CN types and subtypes,
induced at error tolerance 0.2 from all 14,719
training instances, had 194 CN definitions that
covered 10 or more training instances, 527 that
covered from 3 to 9, and 793 with coverage of 2.
Figure 6 displays recall against the number of
positive training instances for the most frequent
CN type and subtypes in the corpus. Recall is
over 60 and still increasing at this level of
training for ÒdiagnosisÓ and Òsign or symptomÓ
of any subtype. With error tolerance set at .20
and minimum coverage at 2, precision remains
fairly constant regardless of training size, at
about 70 for most CN types and subtypes.
Figure 6. CRYSTAL Learning Curve:
Increase of Coverage with Training Size
The choice of error tolerance parameter has a
significant impact on performance and can be
used to manipulate a tradeoff between recall and
precision. Figure 5 shows performance of a
dictionary of CN definitions for Òsign or
symptomÓ of any subtype, where the error
tolerance is varied from 0.0 to 0.4. The results
shown are the averages of 50 random partitions
of the corpus into 90% training and 10% test
documents for each error tolerance.
The results in Figure 5 are for generalized CN
definitions, those with coverage >= 2. Precision
can be boosted by raising the minimum coverage
threshold. At error tolerance of 0.0 the
dictionary for Ósign or symptomÓ had precision
79 and recall 44 at minimum coverage of 2;
precision 87 and recall 35 at minimum coverage
of 5; and precision 91 and recall 24 at minimum
coverage of 10.
For this graph we have set the training partitions
at 10%, 30%, 50%, 70%, and 90% of the 385
annotated documents. This was done 50 times
for each training size and results averaged. The
number of positive training instances in a
partition depends on what CN type and subtype
is being learned.
The difference in coverage for Òsymptom, absentÓ
and Òsymptom, presentÓ is due to a limitation in
negation handling by the current version of
CRYSTAL. CRYSTAL can require the word
ÒnoÓ or the word ÒwithoutÓ in its constraints,
which allows CN definitions that identify absent
symptoms. There is no mechanism to require
the absence of ÒnoÓ to ensure an affirmative sense
for present symptoms. The next version of
CRYSTAL will explicitly handle negation and
allow a constraint that a phrase be in affirmative
or negative mode.
Figure 7 shows representative CN definitions to
give a sense of what CRYSTAL is learning.
Figure 7. Representative CRYSTAL CN
Definitions
1) Extract sign or symptom present from direct
object
Verb: ÒREVEALEDÓ
Direct object: ÒAÓ <Finding>
2) Extract sign or symptom absent from direct
object
Verb: ÒREVEALEDÓ
Direct object: ÒNOÓ
3) Extract sign or symptom absent from prep.
phrase
Preposition = ÒWITHOUTÓ
CN Definition 1 in Figure 7 will apply to cases
where a test revealed Òa massÓ, Òa pleural
effusionÓ, or Òa murmurÓ and correctly identify
the direct object as a present sign or symptom.
The requirement of the exact word ÒaÓ eliminates
cases where the test revealed Òno murmursÓ,
Ònormal bowel soundsÓ, and other phrases that
should be labeled as absent, rather than present,
sign or symptom.
CN Definition 2 includes an exact word
constraint ÒnoÓ to identify absent sign or
symptom. This has over 90% accuracy with no
semantic constraints. CN Definition 3 also has
no semantic constraints and finds absent sign or
symptom in a prepositional phrase with the
preposition ÒWITHOUTÓ. This CN definition
covers 229 training instances with 13% error
rate. If a constraint is added to require the class
<Finding> in the prepositional phrase, coverage
drops to 75 with 5% errors.
The lack of semantic constraints on most of the
high coverage CN definitions may be attributed
to gaps in the semantic lexicon. We have
concentrated on developing CRYSTAL's
mechanism for learning generalized CN
definitions and have not yet addressed the issue
of adapting and augmenting for our particular
application the UMLS lexicon, which lacks
words such as ÒlesionsÓ, ÒrateÓ, ÒrhythmÓ,
ÒtendernessÓ, and ÒdistentionÓ.
CONCLUSIONS
The BADGER text analysis system is able to
use extraction rules automatically derived by the
CRYSTAL dictionary induction tool. When
CRYSTAL is presented with a training set of
hand-annotated hospital discharge summaries, it
uses machine learning techniques to derive a set
of extraction rules. The goal of CRYSTAL is to
find the minimum set of generalized rules that
cover all of the positive training instances and to
test each proposed rule against the training
corpus to ensure that the error rate is within a
predefined tolerance.
The requirements of CRYSTAL are a sentence
analyzer, a semantic lexicon that maps individual
words into classes in a semantic hierarchy, and a
set of annotated training texts. CRYSTAL uses
the training text to build a fully functional
conceptual dictionary that requires no further
knowledge engineering.
Acknowledgments
This research is supported by NSF Grant No.
EEC-9209623, State/Industry/University
Cooperative Research on Intelligent Information
Retrieval.
References
1. Aronow DB, Coltin KL. Information
technology applications in quality assurance and
quality improvement, Part II. Joint Commission
Journal on Quality Improvement. 10:465-478,
1993.
2. Satomura Y, do Amaral MB. Automated
diagnostic indexing by natural language
processing. Medical Informatics. 3:149-163,
1992.
3. Sager N, Lyman M, Nhan NT, Tick LJ.
Automatic encoding into SNOMED III: A
preliminary investigation. Journal of the
American Medical Informatics Association.
Supp:230-234, 1994.
4. Soderland S, Fisher D, Aseltine J, Lehnert W.
CRYSTAL: Inducing a conceptual dictionary.
In Proceedings of the Fourteenth International
Joint Conference on Artificial Intelligence.
1995.