Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
School of Computing something FACULTY OF ENGINEERING OTHER Which English dominates the World Wide Web, British or American? (Combining research and teaching in corpus linguistics) by Eric Atwell, Junaid Arshad, Chien-Ming Lai, Lan Nim, Noushin Rezapour Asheghi, Josiah Wang, and Justin Washtell School of Computing, Leeds University Outline Introduction • This paper reports the results of an experiment to combine research and teaching in Corpus Linguistics, using an AI-inspired intelligent agent architecture, but casting students as the intelligent agents. Methods • Detailed coursework specifications: Appendix A, B Results • Draft journal papers by Junaid Arshad, Chien-Ming Lai, Lan Nim, Noushin Rezapour Asheghi, Josiah Wang, and Justin Washtell Conclusions • ? … also, I need research questions for next year’s classes! Introduction 93 Computing students studying Computational Modelling and Technologies for Knowledge Management were given the data-mining coursework task of harvesting and analysing a Data Warehouse from WWW, using WWW-BootCat web-ascorpus technology (Baroni et al 2006). Each student/agent collected English-language web-pages from a specific national top-level domain. The analysis task involved comparing each national sample web-as-corpus with given “gold standard” samples from UK and US domains, to assess whether national WWW English terminology / ontology was closer to UK or US English. Methods CRISP-DM WWW-BootCat and Google Compare to .UK and .US Follow-up: regional overviews CRISP-DM The task was cast as an exercise in applying the CRISP-DM methodology for computational modelling: the Cross-Industry Standard Process for Data Mining projects. CRISP-DM specifies a series of phases or sub-tasks in a data-mining project; it is a “recipe” to follow, allowing novices and nonexperts to carry out data mining experiments: • Business Understanding • Data Understanding • Data Preparation • Modelling • Evaluation • Deployment WWW-BootCat and Google WWW-Bootcat: easy-to-use web front-end to BootCat. User supplies “seed terms”, typical English words (Sharoff). Constrain search to Domain (eg .fr), Language (eg English). WWW-BootCat uses Google to find and download web-pages … hey presto: 200,000-word national English corpus! Problems: • Technical, eg user licences/keys required; server downtime, … • Small “national domains” eg South Georgia Island • Legal restrictions, eg Algerian law promotes Arabic over French (et al) Compare to .UK and .US Next, each agent/student had to decide if their national sample was closer to British or American English Computing students/agents could not use Linguistic expertise Instead, compute similarity to .UK and .US “gold standards” (also collected via WWW-BootCat and Google) Word-frequency Log-Likelihood profiles and averages; Occurrences of selected words (color/colour, tap/fawcet); Lexical analysis only – not syntax or pronunciation Follow-up: regional overviews This yielded 93 reports on national web-as-corpus analyses… … but still difficult to collate results, see patterns. Follow-up coursework for MSc students: collate and compare results across a group of countries in a single geographical or political region, to produce overviews of English in the region. Students could base their regional overview on the results gathered in the first exercise, though some chose to collate and analyse their own web-as-corpus data afresh. Each regional report was to be written as a research journal paper, targeted at a journal specific to the region. Results Draft journal papers (accepted for CL2007, BUT they can’t afford time or fees ) Junaid Arshad, Chien-Ming Lai, Lan Nim, Noushin Rezapour Asheghi, Josiah Wang, Justin Washtell More draft journal papers by Precious CHIVESE, Binita DUTTA, Dureid EL-MOGHRABY, Sanaz GHODOUSI, Olatomiwale MALOMO, Anh NGUYEN Junaid Arshad Analysis of English used in a web corpus from the Middle East “… Jordan and Egypt English corpora were closer to UK than US English; English websites in Saudi Arabia, Lebanon, Israel, Kuwait, and Bahrain were more similar to US English than UK English; and UEA and Iran English websites contained a mix of UK and US English, with neither dominant…” Chien-Ming Lai Studying Influences of British English and American English on World Wide Web in Southeast Asia by Applying Web as Corpus “… The countries studied were Indonesia, Malaysia, Philippines, Singapore, Thailand and Vietnam. Among these countries, only Philippines and Singapore recognize English as official language, but English is widely used in the other countries … the English texts used in most of the chosen countries in the Southeast Asia are closer to the American English…” Lan Nim The Dominant English Type within the World Wide Web Domains of France and its Former Colonies “… This paper investigates the English used in the WWW domains of France (.fr) and its former colonies of Vietnam (.vn), Laos (.ln), Mauritius (.mu) and Senegal (.sn) … British English is more dominant overall in Francophone domains compared to American English. However, some local variation was observed: American English is more widespread in Vietnam, probably due to American political influence after the end of French colonization; and, more surprisingly, American English seems more prevalent than British English in the .FR domain of France.” Noushin Rezapour Asheghi Which English dominates the World Wide Web in countries where English is a native language: British or American? “… The results from Log-Likelihood technique in modelling phase indicate that English used in Australian, South African and Irish web sites is closer to British English and text in New Zealand, Jamaican and Canadian web sites are more similar to American English. However, there is not a great difference between the results of comparing these corpora with British and American English… and British spelling is used predominantly in the New Zealand domain…” Josiah Wang Dominance of British and American English on the World Wide Web in Malaysia, Singapore and Brunei “… Malaysia, Singapore and Brunei have a history as British post-colonial countries ... As a comparison, we have also included three neighbouring countries … Former British colonies like Malaysia, Singapore and Brunei still favour British English on the World Wide Web. In addition, Indonesia and Papua New Guinea which are indirectly influenced by British English (i.e. through the Netherlands and Australia) also tend to lean towards British English. The Philippines on the other hand still continue to exhibit America’s influence with their preference for American English on the Internet.” Justin Washtell The Polynesian influence on English in the World Wide Web of Pacific island nations “… This study analyses the effect of indigenous Polynesian languages upon the balance of a core of function (non-lexical) words in sample English web corpora taken from Polynesian island nation domains: from a selection of New Zealand, Cook Islands and French Polynesian websites. These corpora are compared to those recovered from .uk and .us domains and significant grammatical differences are sought. Noted differences are compared with those found between a French corpus from France and one captured from French Polynesian websites using an identical technique…” Conclusions We expected US English to dominate the WWW: • Computing generally has been American-led • US-owned companies might base national websites on US originals Result: British English is holding its own; no clear winner? It is hard to find major differences; International English? Main differences are in pronunciation, not lexis? And finally… I want to run a similar exercise next year: casting students as intelligent agents to combine teaching and research… I need other web-as-corpus research questions to answer, … to be divided into 50+ subtasks, one for each student … with computable metrics, for Computing students SUGGESTIONS WELCOME!