Knowledge Representation and Inference Models
Textual Entailment
Dan Roth
University of Illinois
Rodrigo Braz, Roxana Girju, Vasin Punyakanok, Mark Sammons
Fundamental Task
By “textually entailed” we mean: most
people would agree that one sentence
implies the other.
(more later)
WalMart defended itself in Entails
Subsumed by
court today against claims
WalMart was sued for
that its female employees
sexual discrimination
were kept out of jobs in
management because they
are women
Why Textual Entailment?
A fundamental task that can be used as a building block in
multiple NLP and information extraction applications
There is always a risk in solving a separate ’fundamental’ task rather
than the task one really wants to solve…
Some of the examples here are very direct, though.
Has multiple direct applications
Question Answering
Q: Who acquired Overture?
Determine: (and distinguish from other candidates)
A: Eyeing the huge market potential, currently
led by Google, Yahoo took over search company
Overture Services Inc last year.
Eyeing the huge market
potential, currently led by
Google, Yahoo took over
search company Overture
Services Inc last year
Subsumed by
Yahoo acquired Overture
Story Comprehension
A process that maintains and updates a collection of propositions about
the state of affairs.
Viewed this way, a fundamental task to consider is that of textual
entailment: Given a snippet of text S, does it entail a proposition T?
(ENGLAND, June, 1989) - Christopher Robin is alive and well. He lives in England. He is
the same person that you read about in the book Winnie the Pooh. As a boy, Chris
lived in a pretty home called Cotchfield Farm. When Chris was three years old, his father
wrote a poem about him. The poem was printed in a magazine for others to read. Mr.
Robin then wrote a book. He made up a fairy tale land where Chris lived. His friends
were animals. There was a bear called Winnie the Pooh. There was also an owl and a
young pig, called a piglet. All the animals were stuffed toys that Chris owned. Mr.
Robin made them come to life with his words. The places in the story were all near
Cotchfield Farm. Winnie the Pooh was written in 1925. Children still love to read about
Christopher Robin and his animal friends. Most people don't know he is a real person.
He has written books of his own that tell what it is like to be famous.
1. Christopher Robin was born in England.
2. Winnie the Pooh is a title of a book.
3. Christopher Robin’s dad was a magician. 4. Christopher Robin must be at least 65 now.
More Examples
You may disagree with the truth of this
statement; and you may infer also that: the
presidential candidate’s wife was born in N.C.
A key problem in natural language understanding is to abstract over the
inherent syntactic and semantic variability in natural language.
Multiple tasks attempt to do just that.
Relation Extraction:
Dole’s wife, Elizabeth, is a native of Salisbury, N.C. 
Elizabeth Dole was born in Salisbury, N.C
Information Integration (Data Bases)
Different database schemas represent the same information under
different titles.
Information retrieval:
Multiple issues, from variability in the query and target text, to
Multiple techniques can be applied; all are entailment problems.
Direct Application: Semantic Verification
A long contract that you need to ACCEPT
Determine: (and distinguish from other candidates)
Does it satisfy the 3 conditions that you really
care about?
Why Study Textual Entailment?
A fundamental task for language comprehension.
Builds on a lot of research (and tools) done in the last few
years in Learning and Inference in Natural Language.
Opens up a large collection of questions both from the
natural language perspective and from the machine learning,
knowledge representation and inference perspectives.
This Talk
A brief perspective & technical motivation
An Approach to Textual Entailment
Some examples
The CCG Inference model for textual entailment
Inference as optimization
Knowledge modules
Two Extremes in Representation and Inference
Statistics: Using relatively simple statistical techniques for
BOW and/or paraphrases
Multiple problems that may not be addressed just from the data:
Entailment vs. Correlation
[Geffet & Dagan’s 04,05]
An important component, but:
 How to put together/chain/weigh paraphrases? Inference model.
Inference in NL requires mapping sentences to logical forms
and using general purpose theorem proving.
Extensions include various relaxations in the way the representation
is generated and in the type of information incorporated in a KB, to
support the theorem prover; non-logical, probabilistic paradigms.
Key problems include the realization that underspecificty of the
language is a feature, rather than a bug.
 representation, but not a canonical representation
New (Better?) View on Problems
Access to information requires tolerating “loose speak”
[Porter et. al, ‘04]
Refers to the imprecise way queries/questions are formed –
with respect to the representation of the information source.
Metonymy: referring to the an entity or event by one of its
Causal factor: referring to a result by one of its causes
Aggregate: referring to an aggregate by one of its members
Generic: referring to a specific concept by the generic class to which it
[The potato was cultivated first in SA]
Noun compounds: referring to a relation between nouns by using
just the noun phrase consisting of the two nouns. [wooden table]
Many other kinds of ambiguities – some language related
and some knowledge related.
Example: New (Better?) View on Problems
Collin Powel addressed the general assembly yesterday 
Collin Powel gave a speech at the UN
The secretary of state gave a speech at the UN
Resolving the sense ambiguity in “addressed” ?
Or a weaker, “existential”, Yes/No with respect to “gave a speech” is
[Ido Dagan; Seneval’04]
How about Collin Powel?
In many disambiguation problems, the view taken when
studying entailment is that keeping the underspecificity of
language is possible, and perhaps the right thing to do.
Task-based Refinement
Learning infrom
Reason [’94-’97]
An unified framework to study Learning, Knowledge Representation and
A series of theoretical results on the advantages of a unified framework for
L, KR & R, in a situations where:
The goal is to Reason - deduction; abduction (best explanation)
Starting point for Reasoning is not a static Knowledge Base but rather A
representation of knowledge learned via interaction with the world.
Quality of the learned representation is determined by the reasoning stage.
Intermediate Representation is important – but only to the extent that it is
learnable, and it facilitates reasoning.
There may not be a need (or even a possibility) to learn an exact
intermediate representation, but only to the extent that is supports
[Khardon & Roth JACM97, AAAI94; Roth95, Roth96, Khardon&Roth99
Learning to Plan: Khardon’99]
This Talk
A brief perspective & technical motivation
An Approach to Textual Entailment
Some examples
The CCG Inference model for textual entailment
Inference as optimization
Knowledge modules
Defining Textual Entailment
Mapping text to a canonical representation is often not the
right approach (or: not possible)
Not a computational issue
Rather, the representation might depend on the task, in our case, on
the hypothesis sentence.
Suggests a definition for textual entailment:
Let s, t, be text snippets with representations r s, rt 2 R.
We say that s textually entails t if
there is
r 2 R of s, for which we can prove that r µ rt
Page 16
a representation
Defining Semantic Entailment
R - a knowledge representation language, with a well defined
syntax and semantics or a domain D.
For text snippets s, t:
rs, rt - their representations in R.
M(rs), M(rt) their model theoretic representations
There is a well defined notion of subsumption in R, defined
model theoretically
u, v 2 R:
u is subsumed by v when M(u) µ M(v)
Not an algorithm; need a proof theory.
Defining Semantic Entailment (2)
The proof theory is weak; will show rs µ rt only when they are
relatively “similar”.
r 2 R is faithful to s if M(rs) = M(r)
Definition: Let s, t, be text snippets with representations r s, rt 2 R.
We say that s textually entails t if there is a representation r 2
R that is faithful to s, for which we can prove that r µ rt
Given rs one needs to generate many equivalent
representations r’s and test r’s µ rt
Cannot be done exhaustively
How to generate alternative representations?
Page 18
The Role of Knowledge: Refining Representations
A rewrite rule (l,r) is a pair of expressions in R such that l µ r
Given a representation rs of s and a rule (r,l) for which rs µ l
the augmentation of rs via (l,r) is r’s = rs Æ r.
l µ r, rs µ l
r’s = rs Æ r
Claim: r’s is faithful to s.
Proof: In general, since r’s = rs Æ r then M(r’s)= M(rs) Å M(r)
However, since rs µ l µ r then M(rs) µ M(r).
Consequently: M(r’s)= M(rs)
And the augmented representation is faithful to s.
The claim suggests an algorithm for generating alternative (equivalent)
representations, and for textual entailment.
The resulting algorithm is sound, but is not complete.
Completeness depends on the quality of the KB of rules.
The power of this re-representation algorithm is in the rules KB and in an
inference procedure that incorporates them.
Choosing appropriate refinements
Depends on the target sentence
Is an optimization procedure
General Strategy
Given a sentence S (answer)
Induce an abstract representation
of S (a concept graph)
Re-represent S
Given a sentence T (question)
Induce an abstract representation
of S (a concept graph)
Given a KB of semantic;
structural and pragmatic
transformations (rules).
Re-represent S
Find the optimal set of
transformations that maps
one sentence to the target
The One Slide Approach Summary
Inducing an Abstract Representation of Text
Refining the representation using an existing KB
Multiple learning Steps; centered around a semantic parse (predicate-argument
representation) of a sentence augmented by additional information.
Final representation is a hierarchical concept graph (DL inspired)
Rewrite rules at multiple levels; application depends on target; [Features]
Modeling Entailment as Constrained Optimization
Entailment is a mapping between sentence representation
Find an optimal mapping [minimal cost proof; abduction] that respects
The hierarchy
Transformations (rules) applied to nodes/edges/sub-graphs
The confidence in the induced information
All modeled as (soft) constraints
Provides robustness against inherent variability in natural language,
inevitable noise in learning processes and missing information.
Learning, Representing and Reasoning take part at several levels in the
A unified knowledge representation of the text, that
provides an hierarchical encoding of the structural, relational and
semantic properties of the given text
is integrated with learning mechanisms that can be used to induce
such information from newly observed raw text, and
that is equipped with an inferential mechanism that can be used to
support inferences with respect to such representations.
An Inference Model for Semantic Entailment
Experiments with a Semantic Entailment System
An Example
s: Lung cancer put an end to the life of Jazz singer Marion
Montgomery on Monday.
t: Singer dies of carcenoma.
s is re-represented in several ways; one of these is shown to be
subsumed by t
s’1: Lung cancer killed Jazz singer Marion Montgomery on
s’2: Jazz singer Marion Montgomery died of lung cancer on
Page 24
Hierarchical; Multiple types of information;
All hanging on the sentence itself.
Formally, represented using Description Logic
Expressions; Rewrite rules have the same
Representation (2)
Representation is formal – not to be confused with a logical/canonical
Attempt is made to represent the text, and augment/refine the
representation as part of the inference process.
The skeleton of the representation is a predicate-argument representation
learned based on PropBank (the semantic role labelling task).
Resources used to augment the
Segmentation; tokenization;
Lemmatizer;POS tagger
Shallow Parser
Syntactic parser (Collins;Charniak)
Named entity tagger
Entity identification. (co-Reference)
Resources used to Rewrite/Refine
and for Subsumption
Dirt paraphrase rules (Lin)
Word clusters (Lin)
Ad hoc modules (later)
In house machine learning based tools [
Predicate-Argument Representation
For each predicate in a sentence [currently – verbs]
Represent all constituents that fill a semantic role
Core Arguments, e.g., Agent, Patient or Instrument
Their adjuncts, e.g., Locative, Temporal or Manner
: benefactor
: utterance
A0 : leaver
I pearls
in my
I said,
I left
to daughter-in-law
my to
my daughter-in-law.
A1 : thing
: left
A2 : benefactor
Semantic Role Labelling
Screen shot from a CCG demo
Page 28
This problem itself is modelled as
a constrained optimization
problem over the output of a
large number of classifiers, and
multiple constraints.
Solution: formulating it as a
linear program and solving
integer linear programs.
Top system in CoNLL shared
Task; presentation later today
Rewrite Rules (KB)
Goal: Acquire transformations that preserve meaning
Basic linguistics processing levels:
Keyword matching;
(Discourse, Pragmatic, …)
The mechanism supports chaining. Rules may contain variables;
the augmentation mechanism supports inheritance.
Some examples later
Rules are used also to avoid semantic parsing problems.
managed to enter  entered; failed to enterenternot
The Inference Problem
1. Optimizing over the transformations applied to the initial
2. Optimizing over the transformations applied to determine
final subsumption
Even after the refinement of the representation, requiring exact
subsumption (embedding of the target graph in the source graph) is
Words can be replaced by synonyms; modifiers can be dropped, etc.
We develop a notion of functional subsumption: say “yes” when
node & edges unify modulo some allowed transformations.
[Why do we separate to two stages?]
Modeling Inference as Optimization
Incrementally augment the original representation and generate faithful
re-representations of it.
2. Compute whether the target representation subsumes the augmented
concept graph via an extended subsumption algorithm.
Uncertainty is encoded by optimizing a linear cost function. Cost can
be learned in a straight forward way via and EM-like algorithm.
The inference model seeks the optimal re-representation S'i such that:
S'i = argmin{S‘ | C(S,S'i) + D(S'i,T) }
Over the space of all possible re-representations of S given KB (subject
to multiple constraints – order, structure)
C returns the cost of augmenting S to S'i and
D returns the costs of performing extended subsumption from S'i to T.
Page 31
Inference: Key Points
Hierarchical Subsumption
Decision List: if succeeds at a level, go on to the next; otherwise, fail
Match both attributes and edges (relational information)
Match may not be perfect
Inference (unification) as Optimization
At the Predicate-Argument level
At the phrase level
At the word level
The optimal unification U’ is the one minimizing:
Hi {(X,Y) U| X  Hi} iG (X,Y)
(X,Y, resp. substructures on S, T)
where i is a fixed constant that ensures the hierarchical behavior is as a
decision list.
(i makes sure that changes in H0 dominate changes in H1)
Integer Linear Programming formulation for Unification
[Acquisition & Inference]
A knowledge base consisting of syntactic and semantic rewrite rules,
written at several levels of abstractions
A description logic inspired hierarchical KR into which we re-represent
the surface level text augmented with multiple abstractions.
[Learning & Inference]
[modeled as optimization: flexibility & error tolerance]
An extended subsumption algorithm which determines subsumption
between representations.
An Inference Model for Semantic Entailment
Experiments with a Semantic Entailment System [IJCAI’05-WS]
Evaluation: SRL (CoNLL Shared Task) ; Pascal
Ablation study on the PARC collection
This Talk
A brief perspective & technical motivation
An Approach to Textual Entailment
Some examples
The CCG Inference model for textual entailment
Inference as optimization
Knowledge modules
Ablation study on the PARC Data
76 Pairs of Q-A sentences
Designed to test linguistic (lexical and constructional) entailment
Out of 76 pairs:
questions converted manually
treat label “unknown” as “false”
64 pairs – got perfect SRL labelling
System versions: Vary Two Dimensions
Structure: add more parsing capabilities
Semantic: add more semantic resources (some use parse structure)
System Versions
Suite of tests, incrementally adding system components
System versions:
LLM: Uses BOW++ to match entire sentences
SRL + LLM: Uses SRL tagging (filter) and BOW on verb arguments
SRL + Deep Structure: System parses arguments of Verbs
Uses full parse, shallow parse tagging to identify argument structure
Knowledge Base (of rewrite rules) active or inactive
Testing the Entailment System
Entailment (Knowledge Base) Modules (can only be
activated when appropriate parse structure is present)
Verb Phrase Compression
Discourse Analysis
Detect embedded predicates
Annotate effect of embedding predicate on embedded predicate
Qualifier Reasoning
Rewrite verb constructions – modal, VERB to VERB, tense
Detect qualifiers and scope – some, no, all, any, etc.
Determine entailment of qualified arguments
Not shown: Functional Subsumption – rules (e.g., synonyms) used
to allow other rules to fire.
Results for Different Entailment Systems
Perfect Corpus with applicable entailment modules,
with Knowledge Base
Active Components
Base + VP
Base + VP +
Base + VP +
DA + Qual
SRL + Deep
Results for Different Entailment Systems
Full Corpus with applicable entailment modules,
with Knowledge Base
Active Components
Base + VP
Base + VP +
Base + VP +
DA + Qual
SRL + Deep
Baseline Entailment System (1)
Baseline system is Lexical Level Matching (LLM)
Ignores many “stopwords”, including “be” verbs, prepositions, determiners
Lemmatizes words before matching
Requiring structure may hurt: LLM allows entailment when SRLbased subsumption requires a rewrite rule:
S: [The diplomat]/ARG1 visited [Iraq]/ARG1 [in September]/AM_TMP
T: [The diplomat]/ARG1 was in [Iraq]/ARG2
For LLM, the only words of T that register are ”diplomat” and
As these are present in S, LLM will return “true”
Baseline System (1.1)
But, LLM is insensitive to small changes in wording
S: [Legally]/AM_ADV, [John]/ARG0 [could]/AM_MOD drive.
T: [John]/ARG0 drove.
LLM ignores modal “could”, so returns incorrect answer
SRL + LLM (2.)
SRL + LLM system uses Semantic Role Labeler tagging
First, tries to match verb and argument types in the two sentences
If successful, system uses LLM to determine entailment of arguments
Advantage over LLM when argument or modifier attached to different
verb in T than in S:
S: [The president]/ARG0 said [[the diplomat]/ARG0 left
T: [The diplomat]/ARG0 said [[the president]/ARG0 left
Words are identical, so LLM incorrectly labels example “true”
SRL+LLM returns “false” because arguments of “said”, “visit” don’t match.
SRL + LLM (2.1)
Disadvantage of using SRL+LLM compared to LLM:
SRL generates predicate frames verbs ignored as stopwords by LLM
Example: “went” in following sentence pair:
S: [The president]/ARG0 visited [Iraq]/ARG1 [in September]/AM_TMP
T: [The president]/ARG0 went to [Iraq]/ARG1.
LLM ignores “went”, returns correct label “true”
SRL generates a verb frame for “went”
Subsumption fails as no match for this verb in S
In this data set, more instances like the second case than like the first
the result is a drop in performance
However, SRL forms crucial backbone for other functionality
SRL+LLM with Verb Processing (3.0)
The Verb Processing (VP) module rewrites certain verb
phrases as a single verb with additional attributes
Uses word order and Part of Speech information to identify candidate
Presently recognizes modal and tense constructions, and simple verb
compounds of the form ”VERB to VERB” (such as “manage to
Verb phrase replaced by single predicate (verb) node with additional
Modality (“CONFIDENCE”)
Requires POS and word order information
SRL+LLM with Verb Processing (3.1)
Example where Verb Processing (VP) module helps:
S: [Legally]/AM_ADV, [John]/ARG0 [could]/AM_MOD drive.
T: [John]/ARG0 drove.
Subsumption in LLM and SRL+LLM system succeeds, as
argument and verb lemma in T match those in S
VP module rewrites “could drive” as “drive”, adds attribute
“CONFIDENCE: POTENTIAL” to “drive” predicate node
In SRL+LLM+VP, subsumption fails at verb level, as
CONFIDENCE attributes don’t match
Page 45
VP module rewrites auxiliary construction in T as a
single verb with tense and modality attributes attached
S: Bush said that Khan sold centrifuges to North Korea.
T: Centrifuges sold to North Korea.
Now, SRL generates only a single predicate frame for “sold”
This matches its counterpart in S, and subsumption succeeds,
qualifying effect of the verb ``said'' in S cannot be recognized
without the deeper parse structure and the Discourse Analysis
SRL + Deep Structure (4.0)
SRL + Deep Structure entailment system identifies
substructure in SRL predicate arguments
uses full- and shallow parse, Named Entity and Part of Speech
identifies the key entity in each argument
Identifies modifiers of key entity such as adjectives, titles, and
Enables further semantic modules, such as Qualifier module
for reasoning about entailment of qualified arguments
SRL + Deep Structure (4.0)
S: No US congressman visited Iraq until the war.
T: Some US congressmen visited Iraq before the war.
“Some” and “no” are stopwords (i.e., ignored by LLM), so
LLM and SRL+LLM incorrectly label this example “true”
SRL + Deep Structure gives correct label, “false”, because “no”
and “some” are identified as key entity modifiers for
matching argument, and they don’t match
SRL + Deep Structure (4.2)
Handling modifiers:
S: The room was full of women.
T: The room was full of intelligent women.
No rules for modifiers: The LLM and SRL+LLM systems find
no match for “intelligent” in S, and so return the correct
answer, “false”
SRL + Deep Structure system allows unbalanced T adjective
modifiers (assumption: S must be more general than T) and
returns “true”.
Context sensitive handling of modifiers?
SRL + Deep Structure + Discourse Analysis (5.0) *
Detecting the effects of an embedding predicate on the embedded
Presently, supports distinction between “FACTUAL” (default
assumption) and a set of values that distinguish various types of
uncertainty, such as “REPORTED”
S: The New York Times reported that Hanssen sold FBI secrets to the
Russians and could face the death penalty.
T: Hanssen sold FBI secrets to the Russians.
All systems lacking Discourse Analysis (DA) module label this sentence
pair “true”, because T is a literal fragment of S
Actual truth value depends on interpretation of “reported”
Other embedding constructions DA can handle:
Adjectival: “It is unlikely that Hanssen sold secrets…”
Nominal: “There was a suspicion that Hanssen sold secrets…”
SRL + Deep Structure + DA + Qualifier (6.0)
The Qualifier module allows comparison of qualifiers such as
all, some, many, no, etc.
In the following example it is used to identify that “all
soldiers” entails “many soldiers”
S: All soldiers were killed in the ambush.
T: Many soldiers were killed in the ambush.
Results for Different Entailment Systems
Perfect Corpus with applicable entailment modules,
with Knowledge Base
Active Components
Base + VP
Base + VP +
Base + VP +
DA + Qual
SRL + Deep
Results for Different Entailment Systems
Full Corpus with applicable entailment modules,
with Knowledge Base
Active Components
Base + VP
Base + VP +
Base + VP +
DA + Qual
SRL + Deep
Experiment: Conclusions
Monotonic improvement as additional analysis resources are
Best performance for system with most structural
information (which supports the most semantic analysis
Non-monotonic improvement, relative to LLM, because:
LLM robust to certain errors due to stopwords
SRL matching stricter: fewer false positives, more false negatives
Corpus distribution favors LLM
Consistent behavior for “imperfect” corpus (includes SRL
Hierarchical representational approach shows strong promise
Progress in Natural Language Understanding requires the ability to learn,
represent and reason with respect to structured and relational data.
The task of Textual Entailment provides a general setting within which to
study and develop these theories. At the same time, it supports some
immediate applications.
We argued for an approach that
Attempts to refine a learned representation using a collection of knowledge
modules, thus maintaining some of the under specificity in language as far as
Models inference as an optimization problem that attempts to find the
minimal cost solution.
No surprise, the key issues in this approach are in knowledge acquisition.
Page 55