Download The New General Service List: A Core Vocabulary for EFL Students

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Phonological history of English high front vowels wikipedia , lookup

American and British English spelling differences wikipedia , lookup

Yes and no wikipedia , lookup

Philippine English wikipedia , lookup

Ugandan English wikipedia , lookup

Middle English wikipedia , lookup

English orthography wikipedia , lookup

British National Corpus wikipedia , lookup

American English wikipedia , lookup

History of English wikipedia , lookup

Phonological history of English consonant clusters wikipedia , lookup

Classical compound wikipedia , lookup

(NOTE: This article is the pre-release draft of an article to appear in the summer
issue of Cambridge University Press publication, Cambridge Connections. Final
version of the article to be posted later)
The New General Service List: A Core Vocabulary for EFL Students &
Dr. Charles Browne, Meiji Gakuin University
Dr. Brent Culligan, Aoyama Gakuin Women’s Junior College
Joseph Phillips, Aoyama Gakuin Women’s Junior College
The English Language has a surprisingly large number of words. Even if we count
words like ACCEPT, ACCEPTS, ACCEPTING and ACCEPTABLE as part of the same
“word family”, there are still more that 500 million words in English! Fortunately
for teachers and students, language has built in redundancy, with certain words
occurring much more frequently than others (the word THE, for example, makes
up 6-7% of all the words in any book, magazine or newspaper). Because of this,
the average native speaker of English knows only a small percentage of these
half million words (about 22,000 words for a recent college graduate).
Although 22,000 words may sound like a daunting number there is more good
news. Corpus linguistics, the science of analyzing large collections of texts, has
shown that knowledge of just a few thousand of the most important words can
give an astonishing degree of coverage of English used in daily life. In 1953,
Michael West published a list of about 2000 important vocabulary words known
as the General Service List (GSL). Based on more than two decades of precomputer corpus research and a corpus size of 2.5 million to 5 million words, the
GSL gives about 84% coverage of general English. However, as useful and helpful
as this list has been to us over the decades, it has been criticized for (1) being
based on a corpus that is both dated and small by modern standards and (2) for
not clearly defining what constitutes a “word”.
On the 60th anniversary of West’s publication of the GSL, we would like to
announce the creation of a New General Service List (NGSL) that is based on a
carefully selected 273 million-word subsection of the 1.6-billion-word
Cambridge English Corpus (CEC). Following many of the same steps of West and
his colleagues (as well as the suggestions of Professor Paul Nation, project
advisor and a leading figure in modern second language vocabulary acquisition),
we have tried to combine the strong objective scientific principles of corpus and
vocabulary list creation with useful pedagogic insights to create a list of
approximately 2800 high frequency words which meet the following goals:
1. to update and expand the size of the corpus used (273 million words)
compared to the limited corpus behind the original GSL (about 5 million
words), with the hope of increasing the generalizability and validity of the
2. to create a NGSL of the most important high-frequency words for second
language learners of English which gives the highest possible coverage of
English texts with the fewest words.
3. to make a NGSL that is based on a clearer definition of what constitutes a
4. to be a starting point for discussion among interested scholars and
teachers around the world, with the goal of updating and revising the list
based on this input (in much the same way that West did with the original
Interim version of the GSL)
The NGSL: A word list based on a large, modern corpus
Utilizing a range of computer-based corpus tools, we began developing the NGSL
with an analysis of the Cambridge English Corpus (formerly known as the
Cambridge International Corpus). The CEC is a 1.6 billion-word corpus of the
English language that contains both written and spoken data of British and
American English. The initial corpus was created using a subset of the 1.6 billionword CEC that was queried and analyzed using the SketchEngine (2006)
( The size of each sub-corpus that was initially
included is outlined in Table 1:
Table 1. CEC corpora used for preliminary analysis of NGSL
Running Words
Upon revision, the Newspaper and Academic corpora were removed from the
compilation. The Newspaper corpus was removed because it’s enormous size
(748,391,436 running words) dominated the total frequencies and it also
showed a marked bias towards financial terms. The academic sub-corpus
(260,904,352 words) was removed because it was a specific genre not directly
related to general English. The final 273-million-word corpus is far more
balanced as a result.
The resulting word lists were then cleaned up by removing proper nouns,
abbreviations, slang and other noise, and excluding certain word sets such as
days of the week, months of the year and numbers. Then we used a series of
computations to combine the frequencies from the various sub-corpora while
adjusting for differences in their relative sizes. Based on a series of meetings and
discussions with Paul Nation about how to improve the list, the combined list
was then compared to other important lists such as the original GSL, the BNC and
COCA to make sure important words were included/excluded as necessary.
The NGSL: More coverage for your money!
One of the important goals of this project was to develop a NGSL that would be
more efficient and useful to language learners and teachers by providing more
coverage with fewer words than the original GSL. For a meaningful comparison
between the GSL and NGSL to be done, the words on each list need to be counted
in the same way. A comparison of the number of “word families” in the GSL and
NGSL reveals that there are 1964 word families in the former and 2368 in the
latter (using level 6 of Bauer and Nation’s 1993 word family taxonomy).
Coverage within the 273 million word CEC is summarized in Chart 1, showing
that the 2368 word families in the NGSL provide 90.34% coverage while the
1964 word families in the original GSL provide only 84.24%. That the NGSL with
approximately 400 more word families provides more coverage than the original
GSL may not seem a surprising result, but when these lists are lemmatized, the
usefulness of the NGSL becomes more apparent as the more than 800 fewer
lemmas in the NGSL provide 6.1% more coverage than is provided by West’s
original GSL.
Vocabulary List
Number of “Word
Number of
Coverage in CEC
Where to find the NGSL:
The list of 2818 words is now available for download, comments and debate
from a new website we’ve dedicated to the development of this list:
It is our hope that this list will be of use to you and your students. Please join the
discussion on the NGSL as we begin to present on it at academic conferences
throughout the year such as KOTESOL and the World Congress on Extensive
Reading in Korea, JALT-CALL, and JALT National in Japan, the Vocab@Voc
Conference in New Zealand, and the AILA Conference in Australia in mid 2014.
Later this year you will also be able to find the NGSL taught in a new course from
Cambridge University Press, In Focus.
West, M. (1953). A General Service List of English Words. London: Longman,
Green & Co.
Bauer, L., & Nation, I. S. P. (1993). Word Families. International Journal of
Lexicography, 6(4), 253–279.
(this paper is a modified version of the article titled, “The New General Service
List: Celebrating 60 years of Vocabulary Learning” published by Browne, C. in the
July 2013 issue of JALT’s The Language Teacher)