Download VACNET: Extracting and analyzing non-trivial linguistic structures at

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
VACNET: Extracting and analyzing
non-trivial linguistic structures at
scale
Matthew Brook O’Donnell,
Nick C. Ellis, Ute Römer & Gin Corden
English Language Institute
[email protected]
The 2nd University of Michigan Workshop on
Data, Text, Web, and Social Network Mining
April 22, 2011
Challenge of natural
language for data mining
• Much work in NLP, IR and text classification
relies upon frequency analysis of
• single words
• n-grams (contiguous word sequences of various lengths)
• Units are computationally trivial to retrieve
• Map-Reduce ‘Hello World’!
• Techniques tend to use a ‘bag of words’
approach, disregarding structure
• Frequency and statistical measures highlight
distinctive items and document ‘aboutness’
• But this is a weak proxy for meaning, which
remains somewhat elusive!
Can linguistic theory help?... NLP tools:
Typical NLP Pipeline
text text
text text
text text
Sentence
splitting
Word
Tokenization
POS
tagging
Chunking/Pars
ing
Named-entity
recognition
meaning???
Challenge of natural
language for data mining
Analyzing natural language data is, in
my opinion, the problem of the next 23 decades. It's an incredibly difficult
issue […] It's imperative to have a
sufficiently sophisticated and rigorous
enough approach that relevant
context can be taken into account.
Matthew Russell, Author
Can linguistic theory help?... What is relevant context?
Learning meaning in language
How are we able to learn what novel words mean?
V about n
① She moogels about her book
 each word contributes individual
meaning
 verb meaning central; yet verbs
are highly polysemous
 larger configuration of words
carries meaning
 these we call CONSTRUCTIONS
‘recurrent patterns of linguistic elements
that serve some well-defined linguistic
function’ (Ellis 2003)
 moogle inherits its
interpretation from the echoes
of the verbs that occupy the
V about n Verb Argument
Construction (VAC), words like:
 talk, think, know, write, hear,
speak, worry … fuss, shout,
mutter, gossip
VACNET
Collaborative project to build an inventory of a large number of
English verb argument constructions (VACs) using:
• the COBUILD Verb Grammar Patterns descriptions
• tools from computational and corpus linguistics
• techniques from data mining, machine learning and network
analysis
The project has two components:
(1) a computational corpus analysis of corpora to retrieve instances
and verb distributions for the full range of VACs
(2) psycholinguistic experiments to measure speaker knowledge of
these VACs through the verbs selected.
V about n – some examples
• He grumbled incessantly about the ‘disgusting’ provincial life we
had to lead on the island
• You should try to think ahead about your financial situation
• He worried persistently about the poverty of his social life
• She would keep banging on about her son
• He wondered briefly about the effects of prolonged exposure to
solar radiation
• The housekeeper left the room, muttering about ingratitude
• I do not want to carp about the work of the Committee
• ‘Any views expressed about Master Matthew?’
• There are several other valid justifications for teaching explicitly
about language
• Those who gossip about him tend to meet with nasty accidents.
VACNET: Language engineering challenge
• TASK
– retrieval of 700+ verb argument constructions from
a 100 million corpus with minimal intervention but
requirement for high precision and high recall
• Multidisciplinary TEAM
– linguists, psychologists, information scientists
– undergraduate/graduate student RAs, faculty
• TOOLS
– dependency parsed corpus in GraphML format
– web-based precision analysis tool
– processing pipeline
Architecture: Large scale extraction of constructions
CORPUS
BNC 100
mill. words
POS tagging
&
Dependency
Parsing
Word Sense
Disambiguation
WordNet
CouchDB document database
COBUILD
Verb Patterns
Construction
Descriptions
Web application
DISCO
Statistical
analysis of
distributions
Network
Analysis &
Visualization
8
Method: Collaborative semi-automatic extraction
Method: Collaborative semi-automatic extraction
1.DEFINE search graph
2.ENCODE in XML
3.CONVERT to Python code
4.SEARCH corpus and RECORD matches
5.ERROR CODE
Precision analysis interface
Recall analysis
Results: V about n
VAC freq
talk
2232
think
1810
know
879
hear
349
worry
347
forget
322
write
299
ask
298
say
281
care
250
go
203
complain
192
speak
181
find
148
learn
143
be
124
feel
118
look
115
wonder
102
read
101
 Types (list of
different verbs
occurring in VAC)
 Frequency
(Zipfian?)
 Contingency
(attraction of
verb
construction)
 Semantics
prototypicality
of meaning &
radial
structure
(Zipfian?)
Results: V about n
VAC freq
VAC freq
Corpus freq
Faithfulness
12
98
0.1224
5
51
0.098
2232
24566
0.0909
talk
2232
reminisce
think
1810
moon
know
879
talk
hear
349
brag
5
69
0.0725
worry
347
carp
5
72
0.0694
forget
322
worry
347
5027
0.069
write
299
generalize
15
244
0.0615
ask
298
generalise
10
176
0.0568
say
281
enthuse
13
236
0.0551
care
250
complain
192
3947
0.0486
go
203
grumble
18
407
0.0442
complain
192
rave
9
205
0.0439
speak
181
fret
10
265
0.0377
find
148
fuss
9
246
0.0366
learn
143
care
250
7064
0.0354
be
124
speculate
26
771
0.0337
feel
118
gossip
9
270
0.0333
look
115
forget
322
10240
0.0314
wonder
102
enquire
38
1341
0.0283
read
101
prowl
5
179
0.0279
VAC
V about n
V across n
V after n
V among pl-n
V around n
V as adj
V as n
V at n
V between pl-n
V for n
V in n
V into n
V like n
Vnn
V of n
V over n
V through n
V to n
V towards n
V under n
V way prep
V with n
Types Tokens
365
3519
799
4889
1168
7528
417
1228
761
3801
235
1012
1702 34383
1302
9700
669
3572
2779 79894
2671 37766
1873 46488
548
1972
663
9183
1222 25155
1312
9269
842
4936
707
7823
190
732
1243
8514
365
2896
1942 24932
TTR
10.37
16.34
15.52
33.96
20.02
23.22
4.95
13.42
18.73
3.48
7.07
4.03
27.79
7.22
4.86
14.15
17.06
9.04
25.96
14.6
12.6
7.79
Lead verb
talk
come
look
find
look
know
know
look
distinguish
look
find
go
look
give
think
go
go
go
move
come
make
deal
Token*Faith
talk
spread
look
divide
revolve
regard
act
look
distinguish
wait
result
divide
look
give
consist
preside
riffle
listen
bias
come
wend
deal
MIcw
brag
scud
lust
nestle
traipse
class
masquerade
officiate
sandwich
vie
couch
delve
glitter
rename
partake
pore
riffle
randomize
gravitate
wilt
wend
15
pepper
Initial Findings
 The frequency distributions for the types
occupying each VAC are Zipfian

The most frequent verb for each VAC is much
more frequent than the other members,
taking the lion’s share of the distribution
 The most frequent verb in each VAC is
prototypical of that construction’s
functional interpretation

generic in its action semantics
 VACs are selective in their verb form
family occupancy:


Individual verbs select particular
constructions
Particular constructions select particular
verbs
What do speakers know about verbs in VACS?
Two Experiments
276 Native & 276 L1 German speakers of English
Asked to fill the gap with the first word
that comes to mind given the prompt
s/he/it _____ about the …
But what about meaning?...
•
We want to quantify the semantic coherence or
‘clumpiness’ of the verbs extracted in the previous steps
–
•
Construction patterns are productive units in language
and subject to polysemy just like words. Can we separate
meaning groups within verb distributions?
–
–
–
•
{think, know, hear, worry, care,…} ABOUT
COMMUNICATION: {talk, write, ask, say, argue,…} ABOUT
COGNITION: {think, know, hear, worry, care,…} ABOUT
MOTION: {move, walk, run, fall, wander,…} ABOUT
The semantic sources must not be based on localized
distributional language analysis
–
Use WordNet and Roget’s
• Pedersen et al. (2004) WordNet similarity measures
• Kennedy, A. (2009). The Open Roget's Project: Electronic lexical knowledge base
Building a semantic network
• Use semantic similarity scores for pairs of verbs (from
WordNet, Roget, DISCO, etc.) to create network
• nodes = lemma forms from VAC/CEC distribution
• edges = link between nodes for top n similarity scores for a pair of verbs
COGNITION
COMMUNICATION
Community
detection
top 100 verbs
in VAC
V about n
Semantic Networks
• Exploring community detection algorithms
•
•
•
•
•
•
•
Edge Betweenness (Girvan and Newman, 2002)
Fast Greedy (Clauset, Newman and Moore, 2004)
Label Propagation (Raghavan, Albert and Kumara, 2007)
Leading Eigenvector (Newman 2006)
Spinglass (Reichardt and Bornholdt, 2006)
Walktrap (Pons and Latapy, 2005)
Louvain (Blondel, Guillaume, Lambiotte and Lefebvre, 2008)
VACNET Summary
 Challenge of natural language for data mining
 Project investigates usage of VACs at scale
 constructions = meaning through patterns
 IR challenge: retrieving non-trivial structures at scale
 Corpus analysis examines the distributions of verbs in VACs
 frequency distribution
 contingency
 semantics
 Psycholinguistic experiments explore the psychological reality of
VACs
 VACNET structured inventory
 verb to construction and construction to verb
 valuable for NLP and DM tasks
 Future explorations:
 Train classifiers on our datasets
 Tackle ‘big data’ sets
Thank you!
[email protected]