Download Slide 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Googleology is bad science
Adam Kilgarriff
Lexical Computing Ltd
Universities of Sussex, Leeds
1
Web as language resource
 Replaceable or replacable?
 check
2






Very very large
Most languages
Most language types
Up-to-date
Free
Instant access
3
How to use the web?
 Google
or other commercial search engines (CSEs)
 not
4
Using CSEs
No setup costs
Start querying today
Methods
 Hit counts
 ‘snippets’
 Metasearch engines, WebCorp
 Find pages and download
5
Googleology
 CSE hit counts for language modelling
 36 queries to estimate freq(fulfil, obligation) to each
of Google and Altavista (Keller & Lapata 2003)
 finding noun-noun relations
“we issue exact phrase Google queries of type
noun2 THAT * noun1”
Nakov and Hearst 2006
 Small community of researchers
 Corpora mailing list
 Very interesting work
 Intense interest in query syntax
 Creativity and person-years
6
The Trouble with Google
 not enough instances
 max 1000
 not enough queries
 max 1000 per day with API
 not enough context
 10-word snippet around search term
 ridiculous sort order
 search term in titles and headings
 untrustworthy hit counts
 limited search syntax
 No regular expressions
 linguistically dumb
 lemmatised
 aime/aimer/aimes/aimons/aimez/aiment …
 not POS-tagged
 not parsed not
7
 Appeal
 Zero-cost entry, just start googling
 Reality
 High-quality work: high-cost methodology
8
Also:
 No replicability
 Methods, stats not published
 At mercy of commercial corporation
9
Also:




No replicability
Methods, stats not published
At mercy of commercial corporation
Bad science
10
The 5-grams
 A present from Google
 All
 1-, 2-, 3-, 4-, 5-grams
 with fr>=40
 in a terabyte of English
 A large dataset
11
Prognosis
 Next 3 years
 Exciting new ideas
 Dazzlingly clever uses
 Drives progress in NLP
12
Prognosis
 Next 3 years
 Exciting new ideas
 Dazzlingly clever uses
 After 5+ years
 A chain round our necks
 Cf Penn Treebank (others? Brickbats?)
 Resource-led vs. ideas-led research
13
How to use the web?
 Google
or other commercial search engines (CSEs)
 not
14
Language and the web
 Web is mostly linguistic
 Text on web << whole web (in GB)
 Not many TB of text
 Special hardware not needed
 We are the experts
15
Community-building
 ACL SIGWAC
 WAC Kool Ynitiative (WaCKY)
 Mailing list
 Open source
 WAC workshops
 WAC1, Birmingham 2005
 WAC2, Trento (EACL), April 2006
 WAC3, Louvain, Sept 15-16 2007
16
Proof of concept: DeWaC, ItWaC
 1.5 B words each, German and Italian
 Marco Baroni, Bologna (+ AK)
17
What is out there?
 What text types?
 some are new: chatroom
 proportions
is it overwhelmed by porn? How much?
Hard question
18
What is out there
 The web
a social, cultural, political phenomenon
new, little understood
a legitimate object of science
mostly language
we are well placed
 a lot of people will be interested




 Let’s




study the web
source of language data
apply our tools for web use (dictionaries, MT)
use the web as infrastructure
19
How to do it:
Components
1. web crawler
2. filters and classifiers
 de-duplication
3. linguistic processing
•
Lemmatise, pos-tag, parse
4. Database
•
•
Indexing
user interface
20
1. Crawling
 How big is your hard disk?
 When will your sysadmin ban you?
DeWaC/ItWaC
 Open source crawler: heritrix
21
1.1 Seeding the crawl
 Mid-frequency words
 Spread of text types
 Formal and informal, not just newspaper
 DeWaC
 Words from newspaper corpus
 Words from list with “kitchen” vocab
 Use Google to get seeds for crawls
22
2. Filtering




non ‘running-text’ stripping
Function word filtering
Porn filtering
De-duplication
23
2.1 Filtering: Sentences
 What is the text that we want?




Lists?
Links?
Catalogues?
…
 For linguistics, NLP
 in sentences
 Use function words
24
2.2 Filtering: CLEANEVAL
 “Text cleaning”
 Lots to be done, not glamorous
 Many kinds of dirt needing many kinds of filter
 Open Competition/shared task
 Who can produce the cleanest text?!
 Input: arbitrary web pages
 “gold standard”
 paragraph-marked plain text
 Prepared by people
 Workshop Sept 2007. do join us!
 http://cleaneval.sigwac.org.uk
25
3.
Linguistic processing
 Lemmatise, POS-tag, parse
 Find leading NLP group for each
language
 Be nice to them
 Use their tools
26
Database, interface
 Solved problem (at least for 1.5 BW)
 Sketch Engine
27
“Despite all the disadvantages, it’s
still so much bigger”
28
How much bigger?
 Method
 Sample words




30
Mid-to-high freq
Not common words in other major lgs
Min 5 chars
 Compare freqs, Google vs ItWaC/DeWaC
29
Google results (Italian)
 Arbitrariness
 Repeat identical searches
 9/30: > 10% difference
 6/30: > 100% difference
 API: typically 1/18th ‘manual’ figure
 Language filter
 mista bomba clima
 mostly non-Italian pages
 use MAX and MIN of 6 lg-filtered results
30
 Clima=
 Computational logic in multi-agent systems
 Centre for Legumes in Mediterranean
Agriculture
 (5-char limit too short)
31
Ratios, Google:DeWaC
WORD
MAX
MIN
RAW
CLEAN
-------------------------------------------------------------besuchte
10.5
3.8
81840
18228
stirn
3.38
0.62
32320
11137
gerufen
7.14
3.72
66720
27187
verringert
6.86
3.46
52160
15987
bislang
24.4
11.6 239000
90098
brach
4.36
2.26
44520
19824
-------------------------------------------------------------MAX/MIN: max/min of 6 Google values (millions)
RAW:
DeWaC document frequency before filters, dedupe
CLEAN:
DeWaC document frequency after filters, dedupe
32
ItWaC:Google ratio, best estimate
 For each of 30 words
 Calculate ratio, max:raw
 Calculate ratio, min:raw
 Take mid-point and average: 1:33 or 3%
 Calculate raw:vert
 Average = 4.4
 half (for conservativeness/uncertainty) = 2.2
 3% x 2.2 = 6.6%
 ItWaC:Google = 6.6%
33
Italian web size
 ItWaC = 1.67b words
 Google indexes 1.67/.066 =
25 bn words
sentential non-dupe Italian
34
German web size




Analysis as for Italian
DeWaC: 3% Google
DeWaC = 1.41b words
Google indexes 1.41/.03 =
44 bn words
sentential non-dupe German
35
Effort
 ItWac, DeWac
 Less than 6 person months
 Developing the method
 (EnWaC: in progress)
36
Plan
 ACL adopts it (like ACL Anthology) (LDC?)
 Say: 3 core staff, 3 years
 Goals could be:
 English: 2% G-scale (still biggest part)
 6 other major languages: 30% G-scale
 30 other languages: 10% G-scale
 Online for
 Searching as in SkE
 Specifying, downloading subcorpora for
intensive NLP
 “corpora on demand”
 Don’t quote me 
37
Logjams
 Cleaning
 See CLEANEVAL
 Text type
 “what kind of page is it?”
 Critical but under-researched
 WebDoc proposal
 (with Serge Sharoff, Tony Hartley)
 (a different talk)
38
Moral
 Google, CSEs are wonderful
 Start today but
bad science
 Not
 Good science, reliable counts
 We (the NLP community) have the skills
 With collective effort, mid-sized project
Google-scale is achievable
39
Thank you
 http://www.sketchengine.co.uk
40
Scale and speed, LSE
 Commercial search engines
 banks of computers
 highly optimised code
but this is for performance
 no downtime
 instant responses to millions of queries
 This proposal
 crawling: once a year
 downtime: acceptable
 not so many users
41
…but it’s not representative
 The web is not representative
 but nor is anything else
 Text type variation
 under-researched, lacking in theory
Atkins Clear Ostler 1993 on design brief for BNC;
Biber 1988, Baayen 2001, Kilgarriff 2001
 Text type is an issue across NLP
 Web: issue is acute because, as against BNC or
WSJ, we simply don’t know what is there
42
Oxford English Corpus
 Method as above
 Whole domains chosen and
harvested
 control over text type
 1 billion words
 Public launch April 2006
 Loaded into Sketch Engine
43
Oxford English Corpus
44
Oxford English Corpus
45
Examples
 DeWaC, ItWaC
 Baroni and Kilgarriff, EACL 2006
 Serge Sharoff, Leeds Univ UK
 English Chinese Russian English French
Spanish, all searchable online
 Oxford English corpus
46
Options for academics
 Give up
 Niche markets, obscure languages
 Leave the mainstream to the big guys
 Work out how to work on that scale
 Web is free, data availability not a
problem
47