Download VACNET: Extracting and analyzing non-trivial linguistic structures at

VACNET: Extracting and analyzing non-trivial linguistic structures at scale Matthew Brook O’Donnell, Nick C. Ellis, Ute Römer & Gin Corden English Language Institute [email protected] The 2nd University of Michigan Workshop on Data, Text, Web, and Social Network Mining April 22, 2011 Challenge of natural language for data mining • Much work in NLP, IR and text classification relies upon frequency analysis of • single words • n-grams (contiguous word sequences of various lengths) • Units are computationally trivial to retrieve • Map-Reduce ‘Hello World’! • Techniques tend to use a ‘bag of words’ approach, disregarding structure • Frequency and statistical measures highlight distinctive items and document ‘aboutness’ • But this is a weak proxy for meaning, which remains somewhat elusive! Can linguistic theory help?... NLP tools: Typical NLP Pipeline text text text text text text Sentence splitting Word Tokenization POS tagging Chunking/Pars ing Named-entity recognition meaning??? Challenge of natural language for data mining Analyzing natural language data is, in my opinion, the problem of the next 23 decades. It's an incredibly difficult issue […] It's imperative to have a sufficiently sophisticated and rigorous enough approach that relevant context can be taken into account. Matthew Russell, Author Can linguistic theory help?... What is relevant context? Learning meaning in language How are we able to learn what novel words mean? V about n ① She moogels about her book  each word contributes individual meaning  verb meaning central; yet verbs are highly polysemous  larger configuration of words carries meaning  these we call CONSTRUCTIONS ‘recurrent patterns of linguistic elements that serve some well-defined linguistic function’ (Ellis 2003)  moogle inherits its interpretation from the echoes of the verbs that occupy the V about n Verb Argument Construction (VAC), words like:  talk, think, know, write, hear, speak, worry … fuss, shout, mutter, gossip VACNET Collaborative project to build an inventory of a large number of English verb argument constructions (VACs) using: • the COBUILD Verb Grammar Patterns descriptions • tools from computational and corpus linguistics • techniques from data mining, machine learning and network analysis The project has two components: (1) a computational corpus analysis of corpora to retrieve instances and verb distributions for the full range of VACs (2) psycholinguistic experiments to measure speaker knowledge of these VACs through the verbs selected. V about n – some examples • He grumbled incessantly about the ‘disgusting’ provincial life we had to lead on the island • You should try to think ahead about your financial situation • He worried persistently about the poverty of his social life • She would keep banging on about her son • He wondered briefly about the effects of prolonged exposure to solar radiation • The housekeeper left the room, muttering about ingratitude • I do not want to carp about the work of the Committee • ‘Any views expressed about Master Matthew?’ • There are several other valid justifications for teaching explicitly about language • Those who gossip about him tend to meet with nasty accidents. VACNET: Language engineering challenge • TASK – retrieval of 700+ verb argument constructions from a 100 million corpus with minimal intervention but requirement for high precision and high recall • Multidisciplinary TEAM – linguists, psychologists, information scientists – undergraduate/graduate student RAs, faculty • TOOLS – dependency parsed corpus in GraphML format – web-based precision analysis tool – processing pipeline Architecture: Large scale extraction of constructions CORPUS BNC 100 mill. words POS tagging & Dependency Parsing Word Sense Disambiguation WordNet CouchDB document database COBUILD Verb Patterns Construction Descriptions Web application DISCO Statistical analysis of distributions Network Analysis & Visualization 8 Method: Collaborative semi-automatic extraction Method: Collaborative semi-automatic extraction 1.DEFINE search graph 2.ENCODE in XML 3.CONVERT to Python code 4.SEARCH corpus and RECORD matches 5.ERROR CODE Precision analysis interface Recall analysis Results: V about n VAC freq talk 2232 think 1810 know 879 hear 349 worry 347 forget 322 write 299 ask 298 say 281 care 250 go 203 complain 192 speak 181 find 148 learn 143 be 124 feel 118 look 115 wonder 102 read 101  Types (list of different verbs occurring in VAC)  Frequency (Zipfian?)  Contingency (attraction of verb construction)  Semantics prototypicality of meaning & radial structure (Zipfian?) Results: V about n VAC freq VAC freq Corpus freq Faithfulness 12 98 0.1224 5 51 0.098 2232 24566 0.0909 talk 2232 reminisce think 1810 moon know 879 talk hear 349 brag 5 69 0.0725 worry 347 carp 5 72 0.0694 forget 322 worry 347 5027 0.069 write 299 generalize 15 244 0.0615 ask 298 generalise 10 176 0.0568 say 281 enthuse 13 236 0.0551 care 250 complain 192 3947 0.0486 go 203 grumble 18 407 0.0442 complain 192 rave 9 205 0.0439 speak 181 fret 10 265 0.0377 find 148 fuss 9 246 0.0366 learn 143 care 250 7064 0.0354 be 124 speculate 26 771 0.0337 feel 118 gossip 9 270 0.0333 look 115 forget 322 10240 0.0314 wonder 102 enquire 38 1341 0.0283 read 101 prowl 5 179 0.0279 VAC V about n V across n V after n V among pl-n V around n V as adj V as n V at n V between pl-n V for n V in n V into n V like n Vnn V of n V over n V through n V to n V towards n V under n V way prep V with n Types Tokens 365 3519 799 4889 1168 7528 417 1228 761 3801 235 1012 1702 34383 1302 9700 669 3572 2779 79894 2671 37766 1873 46488 548 1972 663 9183 1222 25155 1312 9269 842 4936 707 7823 190 732 1243 8514 365 2896 1942 24932 TTR 10.37 16.34 15.52 33.96 20.02 23.22 4.95 13.42 18.73 3.48 7.07 4.03 27.79 7.22 4.86 14.15 17.06 9.04 25.96 14.6 12.6 7.79 Lead verb talk come look find look know know look distinguish look find go look give think go go go move come make deal Token*Faith talk spread look divide revolve regard act look distinguish wait result divide look give consist preside riffle listen bias come wend deal MIcw brag scud lust nestle traipse class masquerade officiate sandwich vie couch delve glitter rename partake pore riffle randomize gravitate wilt wend 15 pepper Initial Findings  The frequency distributions for the types occupying each VAC are Zipfian  The most frequent verb for each VAC is much more frequent than the other members, taking the lion’s share of the distribution  The most frequent verb in each VAC is prototypical of that construction’s functional interpretation  generic in its action semantics  VACs are selective in their verb form family occupancy:   Individual verbs select particular constructions Particular constructions select particular verbs What do speakers know about verbs in VACS? Two Experiments 276 Native & 276 L1 German speakers of English Asked to fill the gap with the first word that comes to mind given the prompt s/he/it _____ about the … But what about meaning?... • We want to quantify the semantic coherence or ‘clumpiness’ of the verbs extracted in the previous steps – • Construction patterns are productive units in language and subject to polysemy just like words. Can we separate meaning groups within verb distributions? – – – • {think, know, hear, worry, care,…} ABOUT COMMUNICATION: {talk, write, ask, say, argue,…} ABOUT COGNITION: {think, know, hear, worry, care,…} ABOUT MOTION: {move, walk, run, fall, wander,…} ABOUT The semantic sources must not be based on localized distributional language analysis – Use WordNet and Roget’s • Pedersen et al. (2004) WordNet similarity measures • Kennedy, A. (2009). The Open Roget's Project: Electronic lexical knowledge base Building a semantic network • Use semantic similarity scores for pairs of verbs (from WordNet, Roget, DISCO, etc.) to create network • nodes = lemma forms from VAC/CEC distribution • edges = link between nodes for top n similarity scores for a pair of verbs COGNITION COMMUNICATION Community detection top 100 verbs in VAC V about n Semantic Networks • Exploring community detection algorithms • • • • • • • Edge Betweenness (Girvan and Newman, 2002) Fast Greedy (Clauset, Newman and Moore, 2004) Label Propagation (Raghavan, Albert and Kumara, 2007) Leading Eigenvector (Newman 2006) Spinglass (Reichardt and Bornholdt, 2006) Walktrap (Pons and Latapy, 2005) Louvain (Blondel, Guillaume, Lambiotte and Lefebvre, 2008) VACNET Summary  Challenge of natural language for data mining  Project investigates usage of VACs at scale  constructions = meaning through patterns  IR challenge: retrieving non-trivial structures at scale  Corpus analysis examines the distributions of verbs in VACs  frequency distribution  contingency  semantics  Psycholinguistic experiments explore the psychological reality of VACs  VACNET structured inventory  verb to construction and construction to verb  valuable for NLP and DM tasks  Future explorations:  Train classifiers on our datasets  Tackle ‘big data’ sets Thank you! [email protected]

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download VACNET: Extracting and analyzing non-trivial linguistic structures at