Download doc - Montclair State University

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Old Irish grammar wikipedia , lookup

Old English grammar wikipedia , lookup

Lexical semantics wikipedia , lookup

Chinese grammar wikipedia , lookup

Japanese grammar wikipedia , lookup

Kannada grammar wikipedia , lookup

Swedish grammar wikipedia , lookup

Comparison (grammar) wikipedia , lookup

Portuguese grammar wikipedia , lookup

Junction Grammar wikipedia , lookup

Inflection wikipedia , lookup

French grammar wikipedia , lookup

Compound (linguistics) wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Macedonian grammar wikipedia , lookup

Stemming wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Agglutination wikipedia , lookup

Honorific speech in Japanese wikipedia , lookup

Spanish grammar wikipedia , lookup

Yiddish grammar wikipedia , lookup

Esperanto grammar wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Russian grammar wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Latin syntax wikipedia , lookup

Untranslatability wikipedia , lookup

Polish grammar wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Pleonasm wikipedia , lookup

Contraction (grammar) wikipedia , lookup

Pipil grammar wikipedia , lookup

English grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Introduction to Corpus Linguistics
for Advanced Structure of American English
Introduction ..................................................................................................................................... 1
Shifts in Word Meaning .................................................................................................................. 1
Collocations .................................................................................................................................... 2
Powerful tea and cooked coffee? ................................................................................................ 2
Stand on line or in line? .............................................................................................................. 2
Part of Speech Identification ........................................................................................................... 2
Testing textbook claims about POS ............................................................................................ 3
Exemplifying Standard and non-standard forms ............................................................................ 3
Adverbs or Adjectives in intransitive sentences? ....................................................................... 3
Syntactic Constructions .................................................................................................................. 4
The passive with by .................................................................................................................... 4
Subject and object selection ........................................................................................................ 4
Verb complementation ................................................................................................................ 5
Textbooks on English grammar attempt to describe the language as a system of rules that might
explain how the child comes to know the language so quickly. This knowledge of language is
referred to as grammatical competence. However, the use of language in everyday situations,
known as grammatical performance, often affects competence since it provides the data that the
child hears. Corpus linguistics aims to look at the actual use of language, written and spoken.
The tasks you will do below are designed to make you familiar with this approach and to
appreciate some of its possibilities for your own research and teaching.
Shifts in Word Meaning
Most aspects of language remain fairly constant over time, but words can shift meaning, or
develop new meanings, rather rapidly. The usage of a word is also restricted by the domain in
which it occurs. The word hot is a good example of this.
Using the Virtual Language Centre (VLC) Web Concordancer, compare the usage of the word
hot in the Brown Corpus (data from a wide variety of English texts) with its usage in computer
ASAE. Corpus Tutorial
texts from the 1990’s. To do so, click on "Simple search" under "English", In the VLC Web
Concordancer, (English), type the word "hot" in the second box after "Search string:"; then go to
"Select corpus:" and, on the pull-down menu, select "Brown Corpus".
What word does "hot" most frequently modify among these 136 entries?
What is the most common meaning of "hot" in the Brown Corpus?
There are only 7 entries for "hot" in "Business and Economy, but what is the most common
meaning here?
Powerful tea and cooked coffee?
Certain words commonly occur together: coffee is brewed, not cooked, lights are turned off, not
closed. These collocational patterns are language-specific and, as such, are often mind-boggling
to the language learner. They are learned by constant exposure. A corpus can provide this
exposure quickly.
The following sentences are ESL student productions in which the underlined word is not a
standard collocate of the following word(s). Using the VLC concordancer, find a better word for
each of the underlined words below.
A powerful dollar overseas hurts European markets. (Search Business and Economy and Sort left.)
This was his single chance for success. (Search Brown and Sort left.)
I like being with people, knowing new people, etc. (Search Times for 'new people', and Sort left.)
Stand on line or in line?
Should ESL students be taught to 'stand on line' or 'stand in line', or does it matter? You can
check the use of the prepositions at the Collins CoBuild website. In the "Type in your query"
box, type stand +1line. The +1 will ignore a single word between stand and line. How many
times does stand on line occur? _________ stand in line? __________ Do Americans use on or
in more often? _________
Part of Speech Identification
Many grammar texts give form criteria by which to judge the part of speech of a word. For
example, a word is classified as a noun if it can occur with a plural or possessive ending, or if it
has a noun-making morpheme like –ness or –tion, while it is classified as a verb if it can occur
with the tense and participial endings, or if it has a verb-making morpheme like –ize, or -ate.
However, many words in English have the same form as nouns and verbs, e.g., house, button,
garden, progress, permit, record. How is part of speech determined in such cases?
To answer this question, look at the usage of the word permit as it occurs in the Brown corpus.
(Sort left.) The word permit can be a noun or a verb in English and, in its base form, there is no
formal, morphological way of telling whether it is a noun or a verb.
ASAE. Corpus Tutorial
Find the first three occurrences of permit as a verb. How do you know permit is being used as a
verb here?
If you're not sure about assigning parts of speech, you might want to check out the online Part of
Speech Tagger at the University of Colorado.
Testing textbook claims about POS
Klammer, Schulz and Della Volpe's Analyzing English Grammar says that completely,
absolutely, totally, extremely, and excessively are adverbs even though they fit the qualifier frame
The handsome man seems ______ handsome.
These words do fit the adverb form test (they end in -ly) but they fail all the function tests (they
can't modify verbs and they can't move within the sentence).. What’s going on? Looking at the
usage of one of the -ly degree words in MICASE, the Michigan Corpus of Academic Spoken
English, will clarify what's going on with these words.
If we classify adverbs as words that (1) modify verbs and (2) can be moved within a sentence,
find an example in the data above of totally used as an adverb.
If we classify qualifiers as words that (1) modify adjectives or adverbs and (2) can fit the slot in
the frame sentence The handsome man is ________________ handsome, find an example in the
data above of totally used as a qualifier.
This exercise shows that the part of speech class of a word depends in part on
a. the morphological form of the word
b. the context in which the word occurs
c. the grammatical function of the word
Exemplifying Standard and non-standard forms
Adverbs or Adjectives in intransitive sentences?
Many grammar texts claim that there is a usage issue relating to the use of adverbs vs. adjectives
following intransitive verbs, as in doing well vs. doing good, with the latter considered informal.
Find all instances in MICASE of doing good and doing well. Which form is more
MICASE classifies its data by speech event. (The speech events are listed to the left of
the data.) Is there any correlation between the informal nature of the speech event and
ASAE. Corpus Tutorial
doing good, as opposed to the formal nature of the speech event and doing well?
Speech Event: ADV – advising session; COL – colloquia; DIS – discussion section; LAB – lab
sections LEL – large lecture; LES – small lecture; MTG – meetings; OFC – office hours; SEM –
seminars; SGR – study groups; TUT – tutorials;
How about the age and/or status of the speakers? (Speaker characteristics are listed to the
right of the data.)
Syntactic Constructions
The passive with by
The often politically motivated claim "Mistakes were made" has recently been dubbed the 'past
exonerative'. How often does the passive occur without by? To see how often mistakes occurs
with made, with and without the by, in Collins' 56 million word database, type the following in
the Collins Cobuild site's query box -mistakes+2made
-- and click "Show Concs"
The +2 allows for a maximum of two words to intervene between mistakes and made. (If you
want results for mistake and mistakes, type mistake*.)
Searching for passives is tricky because the past participle used in the passive voice often has the
same form as the past tense. For example: I made a mistake (past tense); A mistake was made
(past participle).
The Collins Cobuild site distinguishes past tense forms, which it labels VBD, from past
participle forms, which it labels VBN and verbs can be searched for as, for example, made/VBN.
(Be aware that the VBN tag will give you all past participles, i.e., both has/have made and
am,is,are,was,were,be made.) So you can try the search again as follows -mistake*+2made/VBN
Of the 40 samples that you see, how many have a by phrase? _________
You can see how often the passive form of a specific verb occurs with a by phrase in the entire
56 million word database by asking for the T-score under "Collocation Sampler". What is the
joint frequency of the following items?
mistake*+2made and by? __________
(Type the first term; the by will appear in the table.)
mistake*/NOUN and made? __________
What percentage of the time does by appear with mistakes were made? _____________
Subject and object selection
According to the Collins Cobuild data, what subjects and objects can the verb prove take?
Animiate, inanimate, abstract, concrete, mass, count?
ASAE. Corpus Tutorial
Verb complementation
Can you pretend something, i.e. a noun or a noun phrase? Type
into the Collins Cobuild box to find out. (Note the occasional errors in the POS tagging of
pretend.) What somethings can one pretend? _________________________________________
Can pretend be followed by "-ing" forms, infinitives or anything else? Type pretend* in the box
to find out.
Know Your Tools
A concordance, in its simplest form, is an alphabetical listing of the words in a text, given
together with the contexts in which they appear. The most common form of concordance today is
the Keyword-in-Context (KWIC) index, in which each word is centered in a fixed-length field
(e.g., 80 characters). The example given below was produced by Conc 1.70 (Macintosh), from a
plain ASCII text version of the first book of Dickens' A Tale of Two Cities. Note that the line
numbers are as calculated by Conc.
Figure 3.1.1: Concordance of poor in Tale of Two Cities, Book 1
taste it is that such
of sparing the
small property of my
desolate, while your
Miss, if the
the love of my
stockings, and all his
faded away into a
on your way to the
detachment from the
cattle always have in their mouths
child the inheritance of any part of
father, whom I never saw--so long
heart pined away, weep for it
lady had suffered so intensely
mother hid his torture from me
tatters of clothes, had, in a long
weak stain. So sunken and
wronged gentleman, and, with a
young lady, by laying a brawny hand
A concordancer is a software tool that produces such a list.
A collocate list is a list of words that occur in the neighborhood of the keyword. For example, a
search for the keyword so in the Hong Kong Web Concordancer with a request for words that
occur at a distance of two words from the so, returns the following words as the top collocates of
so that occur to its right:
Right collocates for 'so'
The 132
that 127
as 110
ASAE. Corpus Tutorial
to 77
in 57
a 49
and 47
it 46
He 40
of 35
A part-of-speech tagger automatically tags each word in a text with its part of speech. Current
taggers are about 97% accurate (as are human experts). The Collins CoBuild Concordancer
allows you to search for part of speech strings rather than strings of words.
Searching, in the context of corpus work, means looking in the online text for a specific
keyword, phrase, part-of-speech tag, etc.
Browsing means reading through the documents in the corpus. This is a useful activity only if
the documents have been classified. For example, the MICASE corpus is categorized by speech
event (lab, lecture, office hour, etc.) and by speaker (professor, undergraduate, grad student,
native speaker, non-native speaker, male, female, etc.). This classification allows you to get a
sense of the differences between one speech event or speaker type and another.
Sorting means listing words in alphabetical order. The Hong Kong Web Concordancer allows
you to sort the collocates immediately to the right or to the left of the keyword.
Websites to get you started.
The Internet Grammar of English is an online course in English grammar written primarily for university
undergraduates. IGE does not assume any prior knowledge of grammar. It includes interactive exercises.
English Grammar on the Web is a resource designed to support ESL/EFL teachers, but it has valuable lists of links
to other web resources on English grammar. Particularly helpful is its Lists of Grammar Lists.
The Hong Kong Web Concordancer is a concordance program that allows you to search several million words of
English sampled from many sources.
Alternate URL for The Hong Kong Web Concordancer
The Collins CoBuild Concordancer allows you to search 56 million words of contemporary written and spoken
documents. It also allows you to tag these documents for part of speech.
A Ten-step Introduction to Concordancing through the Collins Cobuild Corpus Concordance Sampler.
Instructions for using Collins Cobuild effectively.
MICASE, the Michigan Corpus of Academic Spoken English allows you to search 1,848,364 words of English
transcribed from lectures, conversations, service encounters, etc. recorded at the University of Michigan.
POS Tagger, A statistically-based Part-of-Speech Tagger from the University of Colorado; it returns your sentence
with Penn Treebank part-of-speech tags assigned.
Bookmarks for Corpus Linguists is a list of web resources on using corpora and links to software and corpus
The Web Concordances and Workbooks from the University of Dundee English Department. This site is devoted to
the study of literature using literary computer concordancing, a form of analysing text. This document will attempt
to help students understand what is meant by literary concordancing.
ASAE. Corpus Tutorial
WordNet a lexical database for the English language.
Concordances and Corpora. A tutorial by Catherine Ball on the design of corpora, the use of concordances, and
available concordancing software.
The Montclair Electronic Language Database (MELD). A collection of ESL student essays and background
information on L1, native country, age, gender, and other languages spoken.
ASAE. Corpus Tutorial