Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds 1 Web as language resource Replaceable or replacable? check 2 Very very large Most languages Most language types Up-to-date Free Instant access 3 How to use the web? Google or other commercial search engines (CSEs) not 4 Using CSEs No setup costs Start querying today Methods Hit counts ‘snippets’ Metasearch engines, WebCorp Find pages and download 5 Googleology CSE hit counts for language modelling 36 queries to estimate freq(fulfil, obligation) to each of Google and Altavista (Keller & Lapata 2003) finding noun-noun relations “we issue exact phrase Google queries of type noun2 THAT * noun1” Nakov and Hearst 2006 Small community of researchers Corpora mailing list Very interesting work Intense interest in query syntax Creativity and person-years 6 The Trouble with Google not enough instances max 1000 not enough queries max 1000 per day with API not enough context 10-word snippet around search term ridiculous sort order search term in titles and headings untrustworthy hit counts limited search syntax No regular expressions linguistically dumb lemmatised aime/aimer/aimes/aimons/aimez/aiment … not POS-tagged not parsed not 7 Appeal Zero-cost entry, just start googling Reality High-quality work: high-cost methodology 8 Also: No replicability Methods, stats not published At mercy of commercial corporation 9 Also: No replicability Methods, stats not published At mercy of commercial corporation Bad science 10 The 5-grams A present from Google All 1-, 2-, 3-, 4-, 5-grams with fr>=40 in a terabyte of English A large dataset 11 Prognosis Next 3 years Exciting new ideas Dazzlingly clever uses Drives progress in NLP 12 Prognosis Next 3 years Exciting new ideas Dazzlingly clever uses After 5+ years A chain round our necks Cf Penn Treebank (others? Brickbats?) Resource-led vs. ideas-led research 13 How to use the web? Google or other commercial search engines (CSEs) not 14 Language and the web Web is mostly linguistic Text on web << whole web (in GB) Not many TB of text Special hardware not needed We are the experts 15 Community-building ACL SIGWAC WAC Kool Ynitiative (WaCKY) Mailing list Open source WAC workshops WAC1, Birmingham 2005 WAC2, Trento (EACL), April 2006 WAC3, Louvain, Sept 15-16 2007 16 Proof of concept: DeWaC, ItWaC 1.5 B words each, German and Italian Marco Baroni, Bologna (+ AK) 17 What is out there? What text types? some are new: chatroom proportions is it overwhelmed by porn? How much? Hard question 18 What is out there The web a social, cultural, political phenomenon new, little understood a legitimate object of science mostly language we are well placed a lot of people will be interested Let’s study the web source of language data apply our tools for web use (dictionaries, MT) use the web as infrastructure 19 How to do it: Components 1. web crawler 2. filters and classifiers de-duplication 3. linguistic processing • Lemmatise, pos-tag, parse 4. Database • • Indexing user interface 20 1. Crawling How big is your hard disk? When will your sysadmin ban you? DeWaC/ItWaC Open source crawler: heritrix 21 1.1 Seeding the crawl Mid-frequency words Spread of text types Formal and informal, not just newspaper DeWaC Words from newspaper corpus Words from list with “kitchen” vocab Use Google to get seeds for crawls 22 2. Filtering non ‘running-text’ stripping Function word filtering Porn filtering De-duplication 23 2.1 Filtering: Sentences What is the text that we want? Lists? Links? Catalogues? … For linguistics, NLP in sentences Use function words 24 2.2 Filtering: CLEANEVAL “Text cleaning” Lots to be done, not glamorous Many kinds of dirt needing many kinds of filter Open Competition/shared task Who can produce the cleanest text?! Input: arbitrary web pages “gold standard” paragraph-marked plain text Prepared by people Workshop Sept 2007. do join us! http://cleaneval.sigwac.org.uk 25 3. Linguistic processing Lemmatise, POS-tag, parse Find leading NLP group for each language Be nice to them Use their tools 26 Database, interface Solved problem (at least for 1.5 BW) Sketch Engine 27 “Despite all the disadvantages, it’s still so much bigger” 28 How much bigger? Method Sample words 30 Mid-to-high freq Not common words in other major lgs Min 5 chars Compare freqs, Google vs ItWaC/DeWaC 29 Google results (Italian) Arbitrariness Repeat identical searches 9/30: > 10% difference 6/30: > 100% difference API: typically 1/18th ‘manual’ figure Language filter mista bomba clima mostly non-Italian pages use MAX and MIN of 6 lg-filtered results 30 Clima= Computational logic in multi-agent systems Centre for Legumes in Mediterranean Agriculture (5-char limit too short) 31 Ratios, Google:DeWaC WORD MAX MIN RAW CLEAN -------------------------------------------------------------besuchte 10.5 3.8 81840 18228 stirn 3.38 0.62 32320 11137 gerufen 7.14 3.72 66720 27187 verringert 6.86 3.46 52160 15987 bislang 24.4 11.6 239000 90098 brach 4.36 2.26 44520 19824 -------------------------------------------------------------MAX/MIN: max/min of 6 Google values (millions) RAW: DeWaC document frequency before filters, dedupe CLEAN: DeWaC document frequency after filters, dedupe 32 ItWaC:Google ratio, best estimate For each of 30 words Calculate ratio, max:raw Calculate ratio, min:raw Take mid-point and average: 1:33 or 3% Calculate raw:vert Average = 4.4 half (for conservativeness/uncertainty) = 2.2 3% x 2.2 = 6.6% ItWaC:Google = 6.6% 33 Italian web size ItWaC = 1.67b words Google indexes 1.67/.066 = 25 bn words sentential non-dupe Italian 34 German web size Analysis as for Italian DeWaC: 3% Google DeWaC = 1.41b words Google indexes 1.41/.03 = 44 bn words sentential non-dupe German 35 Effort ItWac, DeWac Less than 6 person months Developing the method (EnWaC: in progress) 36 Plan ACL adopts it (like ACL Anthology) (LDC?) Say: 3 core staff, 3 years Goals could be: English: 2% G-scale (still biggest part) 6 other major languages: 30% G-scale 30 other languages: 10% G-scale Online for Searching as in SkE Specifying, downloading subcorpora for intensive NLP “corpora on demand” Don’t quote me 37 Logjams Cleaning See CLEANEVAL Text type “what kind of page is it?” Critical but under-researched WebDoc proposal (with Serge Sharoff, Tony Hartley) (a different talk) 38 Moral Google, CSEs are wonderful Start today but bad science Not Good science, reliable counts We (the NLP community) have the skills With collective effort, mid-sized project Google-scale is achievable 39 Thank you http://www.sketchengine.co.uk 40 Scale and speed, LSE Commercial search engines banks of computers highly optimised code but this is for performance no downtime instant responses to millions of queries This proposal crawling: once a year downtime: acceptable not so many users 41 …but it’s not representative The web is not representative but nor is anything else Text type variation under-researched, lacking in theory Atkins Clear Ostler 1993 on design brief for BNC; Biber 1988, Baayen 2001, Kilgarriff 2001 Text type is an issue across NLP Web: issue is acute because, as against BNC or WSJ, we simply don’t know what is there 42 Oxford English Corpus Method as above Whole domains chosen and harvested control over text type 1 billion words Public launch April 2006 Loaded into Sketch Engine 43 Oxford English Corpus 44 Oxford English Corpus 45 Examples DeWaC, ItWaC Baroni and Kilgarriff, EACL 2006 Serge Sharoff, Leeds Univ UK English Chinese Russian English French Spanish, all searchable online Oxford English corpus 46 Options for academics Give up Niche markets, obscure languages Leave the mainstream to the big guys Work out how to work on that scale Web is free, data availability not a problem 47