Download homenaje a mervyn smale - Universidad de Granada

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

English verbs wikipedia , lookup

English clause syntax wikipedia , lookup

English grammar wikipedia , lookup

British National Corpus wikipedia , lookup

Transcript
(DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS:
LOOKING FOR INDIRECT OBJECTS IN THE ICE-GB
CARMEN AGUILERA CARNERERO
Universidad de Granada
During the last decades, the world of Corpus Linguistics has witnessed the
proliferation of many different and varied corpora. Among them, there are
some in which syntactic information is added to the plain text: a grammatical
category (i.e. tagged corpora) and a syntactic function (i.e. parsed corpora);
the latter being quite useful for saving effort and time when retrieving
syntactic structures to carry out syntactic analyses. This paper deals with
some of the problems that arise when a parsed corpus – the British
component of the International Corpus of English – is used to analyse
indirect objects in English. During our quest, mainly two sorts of problems
were detected: the disagreement with the parsers of the corpus in relation to
some syntactic categories we find doubtful – ‘dimonotransitivity’, ‘parataxis’
and ‘transitive complementation’ – as well as several inconsistencies shown
in the labelling of the constituents in the corpus. The particular problems
faced in the accomplishment of our particular task illustrate some of the
difficulties found working with a parsed corpus and make us question the real
utility of parsed corpora.
INTRODUCTION
Within the great development that Corpus Linguistics has undergone
recently, two main different approaches have been distinguished: on the one
hand, the corpus-based approach and, on the other hand, the corpus-driven
perspective. The two positions tackle the analysis of language from different
angles – especially concerning methodological issues – with radical deep
implications in their engaging of corpus analysis. Remarkably, they differ
considerably in their attitude towards annotated corpora, showing a positive
390
CARMEN AGUILERA CARNERERO
posture to it in the case of the corpus-based approach and a disbelief in the case
of the corpus-driven linguists.
As we have said above, this paper deals with the obstacles found working
with a parsed corpus – the British component of the International Corpus of
English (ICE) – when a lexico-grammatical analysis of indirect objects in
English was carried out. This study not only helped us in getting deeper into the
syntactic-semantic nature of indirect objects in English, but also entailed an
intense reflection of the use of tagged and, above all, parsed corpora.
In what follows, we will make a brief survey of the differences between the
corpus-based approach and the corpus-driven approach concerning the use of
tagged and parsed corpora to introduce, in the following section, the main
features of the International Corpus of English (ICE). We will then deal with
some of the difficulties faced when working with the ICE-GB and we will finish
with the exposition of the conclusions reached in the light of our findings.
THE ROLE OF ANNOTATION IN CORPORA
As we have stated in the previous section, the study of language phenomena
within Corpus Linguistics gave rise to two different approaches to the topic, that
is, the corpus-based approach on the one hand, and the corpus-driven approach
on the other. The positions vary in a considerable number of aspects that have
been clearly summarized by Tognini-Bonelli (1991:65ff):
One could argue that the two positions we are addressing with respect to corpus
work, the corpus-based and the corpus-driven, reflect two opposed stances
concerning this issue and while the corpus-based linguist attempts to insulate it,
standardise it and reduce it, the corpus-driven linguists build it into the theoretical
categories (s)he derives from the data. (Tognini-Bonelli 1991:67)
Aarts (2002:3) considers that the linguist’s preference for one of the
methodologies rather than the other depends on his/her answers to questions such
as:
1. the type of evidence he/she thinks the corpus data provide
2. if he/she considers there is room for different data from those included in
the corpus
3. the role played by non-corpus data in previous linguistic research.
One of the main differences between these two perspectives is the way they
dissent about the implications of assigning certain labels to particular corpus
items. Whereas corpus-based linguists take annotation as a useful tool to help
them in their analysis, corpus-driven linguists think this will predetermine the
researcher’s conceptions of their results, a fact which, in their opinion, should
emanate from the corpus itself not being contaminated, in this way, by previous
research.This idea is overtly expressed by John Sinclair in the following passage:
(DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS
391
[M]y reservations about annotation are quite specific, and concern only their
inclusion in the resources around generic corpora. Because they impose one
particular model of language on the corpus, they restrict the kind of research that
can be done; because the practice of annotation normally requires human
intervention, it is not a replicable process and therefore fails the first test of
scientific method. Because the models imposed by current conventions of
annotation are unlikely to be informed by corpus evidence, I believe researchers
who use them are likely to make unnecessary problems for themselves. ( Sinclair
2004:54)
A strong objection to this aseptic approach to language is raised by
Mukherjee (2005:72):
[T]his dogmatic stance [corpus-driven approach] with its fixation about corpus
data seems both unrealistic and implausible to me. It is unrealistic because any
linguistic research activity stems from some sort of initial intuitions about
language. [...] So, there always is some sort of theoretical preconception involved,
and, what is more, even the avoidance of a priori theory is a theoretical
preconception. The distrust of intuition of CDL-methodology is also implausible
since any corpus is compiled on the grounds of linguists’ informed intuitions
about language in the first place.
We agree entirely with Mukherjee on his rejection of corpus-driven
methodology. It seems to be difficult to imagine the linguist’s brain as a tabula
rasa, not being biased at all by any linguistic preconception. What’s more, the
fact of working with a corpus implies support of a whole conception of language
based on the observation of real language, as opposed to the Chomskyan
tradition. What we think is not completely correct in Sinclair’s quotation is the
corpus-driven assumption that all corpus-based linguists accept all the theoretical
ideas expressed in the corpus by the taggers and parsers as a dogma by the mere
fact of using it.
We cannot deny the fact that working with a tagged and parsed corpus
implies that the researcher has to deal with the material someone else
manipulated before, which may be a disadvantage if he or she does not agree
with the previous decisions made by the tagger and/or parser.
Nevertheless, we think this does not have to be so. Using a tagged and parsed
corpus for your analysis does not necessarily mean that you have to share all the
linguistic preconceptions that led the tagger/parser to choose one decision or
another. Once more, it is not a black-or-white issue. One could agree on, let’s
say, the majority of decisions in the corpus, but not with every single choice that
taggers and parsers make. In fact, agreement with the ideas existing in the corpus
confirms the linguist’s conceptions on particular linguistic problems, whereas
disagreement with the specific issues motivates debate and critical reflection in
the researcher. As Mukherjee (2005: 79-80) explains:
There is a danger, therefore, that already available corpora with their syntactic
annotation predetermine the linguistic theory of and research into syntax. As a
392
CARMEN AGUILERA CARNERERO
matter of fact, the reverse order should be aimed at; not that corpus annotation
should influence linguistic research, but linguistic research questions should be
the guideline for the corpus annotation.
Paradoxically, annotated corpora have also been accused of containing less
information than non-annotated ones. According to Aarts (2002:9), it is a
problem of considering the type of information added by annotation different
from the information contained in the corpus, as it is the result of a “descriptive
framework that generated the tags”.
Within the category of annotated corpora, two main different kinds can also
be distinguished: tagged corpora and parsed corpora. Tagged corpora assign each
lexical item in the corpus a grammatical category; parsed corpora add to them a
syntactic function in the form of a tree.
The advantages and disadvantages of these two types of corpora have been
pointed out by Guilquin (2002:192). The structural information contained in the
parsed ones is less easily available and unreliable (in the majority of cases). If
parsed corpora are analysed in detail (as ICE-GB), then, the main disadvantage is
their small size, and therefore, their inadequacy for studying infrequent structures
in the language.
Leaving aside the controversy over the linguist’s shared principles with the
grammatical and syntactic categories contained in the corpus, a quite
considerable quality a tagged and parsed corpus has is the simplification of the
whole process of analysis, which is otherwise quite a time-consuming task.
However, this advantage offered by an annotated corpus is negatively
counterbalanced by, in our view, one of its great dangers: the inaccuracy of the
tags concerning the grammatical and syntactic categories and, consequently, the
unreliability of the results, as we will discuss below.
THE INTERNATIONAL CORPUS OF ENGLISH: ICE
The corpus chosen for this study is the British component of the International
Corpus of English (ICE). The International Corpus of English was a project
launched at London University in 1988 under the auspices of Professor Sidney
Greenbaum, who felt the need to compare the linguistic varieties of England and
the United States, both written and spoken, an aim he could not carry out
satisfactorily using the corpora available at that time. 1
The ICE was compiled by the Survey of English usage, at the University of
London, and the software programmes needed to carry out the grammatical and
syntactic analysis were developed by the TOSCA group at Nijmegen University.
In particular, the software for using the corpus (ICECUP) was created by Aidan
1. As Greenbaum (1991:83) explains, at that moment, there were some corpora devoted to the study
of written English such as the Brown Corpus and the LOB corpus, as well as the London-Lund
corpus specializing in spoken English, but there was no corpus that contained both modes so that
written and spoken English could be compared.
(DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS
393
Quinn and Nick Porter at the Survey of English Usage. The compilation of the
British component was only the beginning, and later on, other national teams
joined the project 2 in every country in which English was the first language or in
those countries in which, although not official, English was the language of
administration, education or the law courts.
The ICE-GB corpus is not very large (1 million words) and it comprises 500
texts (200 written and 300 spoken, belonging to different genres, each one of
approximately 2000 words). All the material was collected between 1990-1993
and was produced by adults (18+ years) who were educated in the English
language at least until they finished secondary school. One of the features that
makes the ICE-GB very appealing for researchers interested in syntax and
grammar is the fact of being not only a tagged corpus but also a parsed one,
which means that all clause constituents are associated with a grammatical
category and also with a syntactic function in a tree. Besides, ICECUP allows the
linguist to retrieve grammatical and syntactic information, thanks to the tool
known as ‘fuzzy tree fragment’.
The compilation and annotation of the ICE-GB was a long and complicated
process which covered the following stages (Nelson, Wallis and Aarts
2002:10ff):
1) Structural markup: Structural markup encodes features of the original
texts that are lost when it is converted into a plain text file on a computer.
In written texts, markup symbols are used to encode typographic features,
such as boldface, italics and underlining, as well as structural features
such as sentence boundaries, paragraph boundaries, and headings. In
spoken texts, markup encodes sentence boundaries, speaker turns,
overlapping strings, and pauses.
2) Part-of-speech tagging: During this stage, each lexical item was assigned
a part-of-speech label or tag, suchs as ‘N’ for noun, or ‘V’ for verb. In
addition to the main label, most tags carry additional information, which
appears in brackets. With some modifications, the tagset is based on the
classifications given in Quirk, Greenbaum, Leech, and Svartvik 1985. The
tagger assigned one or more tags to each lexical item, and the output was
manually checked at the Survey of English Usage. The checking stage
involved choosing the correct tag for each item and removing the
incorrect tags.
3) Parsing: This is the most important stage for a parsed corpus, since a
syntactic function is assigned to every element in the clause. The
syntactic parsing was carried out automatically using the software created
by the TOSCA group of the University of Nigmejen, but previously, there
was a phase of pre-edition in which high frequency constructions were
2. For the time being, there are fifteen teams compiling their own national or regional components of
the ICE in such different countries as Malaysia, Sri Lanka, Ghana or Ireland, to mention just a few.
The components of Britain, New Zealand, India, Hong Kong, East Africa, Singapore and Philippines
have already been finished and are available.
394
CARMEN AGUILERA CARNERERO
marked manually “in order to reduce the ambiguity of the input, and
thereby reduce the number of decisions that the automatic parser would
have to make” (Nelson et al. 2002:14). The TOSCA parser analysed 70%
of the parsing units of the corpus, then the Survey Parser analysed the
rest.
In summary, the annotation of the ICE corpus was partly automatic, partly
manual, this latter phase being thought to solve the possible mistakes derived
from the automatic tagging.
PROBLEMS DERIVED FROM WORKING WITH A PARSED CORPUS
As we have said in the previous section, the most outstanding characteristic
of the ICE-GB is the fact that it is parsed, which – a priori – simplifies the work
of linguists enormously, especially since querying the corpus takes just a few
seconds, dodging the manual work the linguist has to do. According to this, any
parsed corpus potentially seems to be the perfect solution to ease and quicken the
tedious – but unavoidable – part of any syntactic study: the compilation of data.
However, what may be at first a great advantage for the researcher could turn out
to be a disadvantage, mainly due to problems of a twofold nature:
a) Unidirectional problems of the corpus on account of the inconsistencies
of the corpus itself. These are, from our point of view, the most serious
difficulties, since they do not allow the linguist to rely on the results
obtained from the search queries, demanding a subsequent analysis to
check if the previous results are right or not, thus requiring the researcher
to spend a lot of time on this manual work to find the mismatches
obtained. As a consequence, the process previously shortened by the
automatic searches gets much slower.
b) Bidirectional problems, that is, problems the researcher may face as a
result of his/her disagreement with the language categories existing in the
corpus.These sorts of problems are of a theoretical nature.
In the case of the ICE-GB, one has to be particularly careful; indeed, the
authors issue the following warning in the manual:
[W]e could not guarantee that all similar constructions would always be analysed
in the same way throughout the corpus. In other words, while we could achieve
accuracy in individual cases, we could not guarantee consistency across the whole
corpus. (Nelson et al. 2002:17, emphasis added)
After thinking about this statement, our question is: how can the linguist
work with a corpus that warns about the incongruence of the data found? What
then are the advantages (if any) tagged and parsed corpora have to offer the
linguist?
In the following section, we will concentrate on the aspects we have already
brought up: the presumed easiness of analysis of search queries in terms of
(DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS
395
investment of time and simplification of work, the inaccuracy of the data, and the
possibility of not sharing linguistic principles with the parsers of the corpus.
These principles are discussed in relation to the problems we found when we
undertook the study of the indirect objects complementing the six more frequent
ditransitive verbs in the ICE-GB: give, tell, show, send, ask and offer (Mukherjee
2005). The distribution of the examples analysed is shown in the following table:
Table 1. Distribution of the examples studied in the corpus
Ditransitive verb Number of occurrences
in the corpus
give
tell
show
send
ask
offer
1159
794
639
518
346
198
Total
3654
Some of the results found in a previous sample make us aware beforehand of
certain irregularities in the labelling of the grammatical and functional
categories; therefore, we started a lexical search of these verb patterns of
complementation.
In
particular,
the
grammatical
categories
of
dimonotransitivity, parataxis and transitive complements are open to question.
Dimonotransitivity
As Mukherjee (2005:78) rightly points out in his study of ditransitive verbs in
English, the concept of transitivity in the ICE-GB is not a stable property of the
verb and it is purely syntactic. This means that the number of elements present in
the sentence dictates the character of the verb and, consequently, the kind of
sentence: if there is no object, the verb (and sentence) will be intransitive, if there
is one object there are two possibilities: either the verb may be monotransitive
(complemented just by a direct object) or dimonotransitive (only complemented
by an indirect object), and finally, if the verb is complemented by a direct object
and an indirect object, the sentence will be ditransitive. It is as simple (or as
difficult) as that. In this line, the new category they coin ‘dimonotransitive’
makes sense. They need a new label to designate the possibility of having
sentences with just one object: an indirect object. Only a reduced group of verbs
admit this pattern (Nelson et al. 2002:49): show, ask, assure, grant, inform,
promise, reassure and tell. For example:
396
CARMEN AGUILERA CARNERERO
(1) When I asked her, she burst into tears. <ICE-GB: S1A-094#110>
(2) I’ll tell you tomorrow. <ICE-GB: S1A:099#396>
(3) Show me. <ICE-GB: S1A:042#119>
As our concept of (di) transitivity is semantic and not syntactic, we cannot
accept the label ‘dimonotransitive’ since we do not consider the possibility of
finding the indirect object as the sole required semantic complement of the verb
in the same way monotransitive verbs behave. What we do not admit is the fact
that there may be verbs which cognitively evoke the presence of just an indirect
object: wherever there is an indirect object in a sentence, we have a ditransitive
verb. This is not incompatible with admitting – as we do – the possibility of
omission of the direct object, leaving only the indirect object explicitely exposed
in the sentence.
A lexical search of the complementation of the most frequent ditransitive
verbs in the corpus verbs showed us the following examples:
(4) I’m asking you <ICE-GB: S1A-070#182>
(5) By asking people <ICE-GB: W2A-016#014>
(6) if he wanted anything he asked the nearest girl and firmly called his daughter
Pamela, never Pig <ICE-GB: W2F-017#025>
(7) I’m not sure I ever got round to asking her <ICE-GB: S1A-023#182>
(8) Well, I’ll ask one of the stallholders down Chapel Street <ICE-GB: S1A010#025>
(9) You’re meant to ask me <ICE-GB: S1A-017#146>
(10) Did you ask anybody there <ICE-GB: S1A-024#012>
(11) Oh, I suppose it’s a question a lot of people ask each other <ICE-GB:
050#028>
(12) I mean ask Nigel you know <ICE-GB: S1A-090#208>
(13) people have to be asked <ICE-GB: 078#207>
(14) Some candidates may also be asked to attend for interview or to take an
entrance examination <ICE-GB: W2D-007#049>
(15) We hardly told anybody <ICE-GB: S2A-027#127>
(16) If he does then I hope that he will approach the Health and Safety executive
and talk to them about why these numbers have changed because he would
be told <ICE-GB: S1B-057#092>
(17) He’s called Basil in the stables and I’m told likes a pint of MacEwan with his
feed <ICE-GB: S2A-011#064>
(18) He was not told <ICE-GB: S2B-046#038>
(19) And this way I can usually discover proposed future programmes all long
before I’d officially be told <ICE-GB: S1A-082#036>
(20) By the age of forty, he had risen to the position of managing director - a sign,
as I supposed, that he possessed all those qualities of drive, initiative and
enterprise which I am told are required for success in the world of commerce
and industry <ICE-GB: W2F-011#043>
In all the examples above, the only object present in the sentence is
considered to be a direct object and the verb phrase is considered monotransitive.
After a brief comparison between these sentences and the ones used to illustrate
(DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS
397
the concept of dimonotransitivity, a question inmediately arises: which are the
differences between the underlined monotransitive constituents in examples
(16)-(18) and the dimonotransitives in (1), (2) and (3)?: all these examples have
similar subjects (personal pronouns) and the same verb (tell). In other words,
why have some of them been parsed as direct objects and others as indirect
objects?
If, according to the ICE-GB, dimonotransitive structures are combined with
only an indirect object, we could logically think this is going to be the case
through the whole corpus. The main problem with this label is to find out which
are the lexico-grammatical criteria used by the ICE to label some elements as
indirect objects and which ones are chosen to qualify other constituents as direct
objects.
In all the cases, the only participant in the predicate is a noun phrase, has a [+
animate, + human] referent, occupies an immediate post-verbal position or, in
some cases of passive sentences (examples (13), (14), (16), (17), and (18)), it is
the subject of a passive clause. Semantically, in all the examples the only
animate constituent of the predicate has the semantic role of recipient, that is, the
participant who receives the entity transferred by the agent. So, the results
provided by the corpus are not reliable at all.
To make things worse, in the following sentence, the highlighted constituent
is considered an adverbial instead of a direct object:
(21) the contribution of modern genetics has shown however that the genetic code
is really a fundamental organising principle <ICE-GB: S1B-060#068>
In the next example, there is a confusion between the direct object (you,
according to the ICE) and the indirect object (what, according to the ICE):
(22) And what Mr Lampitt told you was that he was interested in acquiring a
business whereby he could bring that business into the centre of London in
effect <ICE-GB: S1B-064#017>
The ICE-GB user’s manual is not very explicit when explaining the reasons
which led the parsers to make one decision instead of another. They just mention
the formal categories both direct and indirect object are related to: NP in the case
of indirect object and NP, CL, AJP, REACT, INTERJEC and DISP for direct
objects, as well as the sorts of verbs they go with: ditransitive and
dimonotransitive with both direct and indirect objects and monotransitive, and
complex transitive in relation to direct objects. However, the fact that the ICE
grammar is based on Quirk et. al’s grammar is still more surprising in relation to
the labelling of these categories. Quirk et. al (1985:759) overtly mention the
different lexical nature (usually animate in the case of indirect objects and
prototypically inanimate in the case of direct objects), as one of the
distinguishing features between direct and indirect object, a characteristic which
has not obviously been taken into account for the ICE parsers to differentiate
between these two types of constituents.
398
CARMEN AGUILERA CARNERERO
Parataxis
The label PARATAXIS is used with direct speech or reported speech and
thoughts. It is assumed to have the clause level ‘main’ and is associated with the
categories CL, DISP and NONCL, for example (Nelson et al. 2002:51,67):
(23) And he said oh yes I agree with you <ICE-GB: S1A: 005#025>
(24) So I said yes here <ICE-GB: 008#274>
The leading problem we find in relation to this category is the non-inclusion
of the paratactic element within the main clause. Giving this sort of constituent
the functional category ‘parataxis’, the parsers are automatically considering it
an element apart, not integrated in the clause structure, that is, without any
function in the clause. Furthermore, the non-integration of paratactic elements
within the clause makes them part of a higher unit, that is, a superordinate
element. In this sense, it would be in the line of other types of constituents, such
as disjuncts or conjuncts, elements which are clearly out of the scope of the
clause.
We contend that the element labelled in the corpus as parataxis is part of the
clause structure, usually having the function of direct object. This has two main
implications:
1) The (di)monotransitive nature of the verb complemented by a paratactic
constituent in the line of cohesion with the principle of syntactic
transitivity (i.e. one object  (di)monotransitivity).
2) The controversy about the functional nature of paratactic constituents:
what sort of elements are they? Are they disjuncts, conjuncts or perhaps
something different? Do they (disjuncts or adjuncts) share with paratactic
elements any lexico-grammatical feature? An analysis of the next
examples makes us think about the function of the paratactic element is
the direct object of the clause:
(25) What I am sensing is my own dread, she told herself <ICE-GB: W2F020#043>
(26) And lots of people ask me well why do you go on <ICE-GB: S1B-026#217>
(27) Rajiv perhaps best captured the imagination of the members of Congress
when he told them: “India is an old country but a young nation: and like
the young everywhere we are impatient” <ICE-GB: W2B-011#051>
Semantically, the highlighted constituents refer to the content of the verb tell
and ask respectively, and syntactically it could be transformed into indirect
speech and be replaced by a that/wh- clause: She told herself what she was
sensing was her own dread, And lots of people ask me why I go on, and Rajiv
told them India is an old country but a young nation. However, we found that
the following sentence has been considered ditransitive, with the direct speech
constituent acting as direct object:
(DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS
399
(28) He had refused to increase child benefits or pay large old age pensions and
told unemployed “if it isn’t hurting it’s not working” <ICE-GB: W2C018#064> [bold face present in the original example]
The fact that the realization of any element in the clause is in direct speech is
not, in our view, a reason clear enough to think of it as a different sort of
constituent, even less, not being part of the predicator. To entangle the whole
taxonomy even more, the ICE does not admit the possibility of formal realization
of the category PARATACTIC by a noun phrase, which prevents the next
examples from being considered as such:
(29) Nonsense, she told herself <ICE-GB: W2F-020#024>
(30) “An act of bravado”, she’d told Mr Rainbow <ICE-GB: W2F-020#088>
The syntactic units in direct speech are labelled as ELE (element), a category
which is defined by the taggers of the ICE as “an isolated element. All phrases
occurring within a NONCL have the function ELE.” The constituents labelled
ELE do not have a role in the clause, it being an element outside the clause
structure too.
Transitive complement
In relation to the category ‘transitive complements’, Nelson et al. (2002:40ff)
explain:
The transitivity of a verb is unclear in many instances where the main verb is
transitive and is followed by a noun phrase that may be the subject of the
nonfinite clause or the object of the host clause. In all such cases, we avoid
deciding the type of transitivity by tagging the main verb “V (trans,...)” [...] The
problem seems to be therefore, with nonfinite clauses with intervening nominal
since they were considered transitive. Transitive complements occur with trans
(transitive verbs) and are associated with the categories CL and DISP. However
the label trans is not applied
(a) if the verb is be: ‘one of my aims is to finish my PhD’
(b) if the nonfinite clause does not have an overt Subject: ‘I enjoy doing it’
Concerning this and taking into account the double nature of some of the
constituents in these constructions (having a function within the subordinate
clause and a different one with respect to the main clause), the most outstanding
difficulty seems to be to decide if what we have is just one constituent (i.e. the
verb is monotransitive, as in the case of want) or two (i.e. it is ditransitive, as in
the case of tell). Greenbaum clearly justifies the lack of a steady decision
adopted by the parsers of the ICE-GB:
We do not want to pre-empt the findings of those investigating this conspicuous
example of syntactic gradience. We therefore avoid deciding the type of transiti-
400
CARMEN AGUILERA CARNERERO
vity by tagging the verb simply as transitive. We leave it to researchers to weigh
criteria and decide what distinctions to make. (Greenbaum 1993:15)
Greenbaum’s quotation is rather striking since one could interpret that the
parsers do pre-empt the rest of the categories of the ICE, some of them still at the
centre of hot linguistic debates, such as the possible formal realization of the
indirect object by a prepositional phrase, for instance. From our point of view,
this drawback could have been easily solved by following Quirk et al.’s tests
(1985:16.25-67) to determine the degree of mono, di- or complex transitivity:
possibility of replacement by a pronoun, answer to a wh- question, focus of a
pseudo-cleft sentence, passivization of the subordinate clause, and retaining or
dropping of the preposition to, instead of proposing a new and confusing
category: the transitive complementation which is opposed (by system) to the
already existing classes of (di)monotransitivity and ditransitivity. The application
of these tests to a verb such as tell in an example labelled as ‘transitive
complement’ confirms its ditransitive nature:
(31) She told her Ministers at a Downing Street reception last night to work harder
and argued that the most important thing for the Conservatives was to get the
economy right <ICE-GB: 006#037>
a) Replacement by a pronoun: She told her ministers at a Downing Street reception
last night something.
b) Answer to a wh- question: What did she tell her Ministers?
c) Focus of a pseudo-cleft sentence: What she told them was to work harder.
d) Passivization of the subordinate clause: They were told to work harder.
e) Remaining or dropping of the preposition to: She told them.
Other examples of transitive complements are:
(32) It’s told its eighteen thousand employees not to report for work <ICE-GB:
S2B-015#074>
(33) He was told to use the normal exit and that caused “resentment and
friction”<ICE-GB: 011#016>
(34) Her husband told her not to attend as a result the trial was impeded <ICEGB: W2B-020#069>
Following what the ICE says: “In passive constructions, the tagging of the
main verb is the same as it would be if the verb were active” (Nelson et al.
2002:40), so this means that the examples below would have to be considered
monotransitive also in the active voice, as they have been parsed monotransitive
(neither transitive nor ditransitive):
(35) We’ve all been told to do it <ICE-GB:S1A-093#032>
(36) Fielding shows us giving the poor man a: “severe rebuke” concluding that
“Every parish ought to keep their own poor” <ICE-GB: W1A-010#065>
(DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS
401
Nevertheless, the following couple of sentences have been parsed as
ditransitive in spite of having a similar structure to the previous pair of
sentences, which are parsed as transitive:
(37) The Interior Ministry has told people to carry on with their work and that
attempts to destabilise the country will be severely punished <ICE-GB: S2B008#096>
(38) Probably there was a bullet in him somewhere , but when she tried to protest
he told her brusquely to pad and cover where the blood was seeping
through and leave everything else alone <ICE-GB: W2F-015#074>
CONCLUSIONS
In the light of all that we have said, some questions inevitably stand out: Do
linguists need a tagged/parsed corpus to carry out syntactic analysis? What are
the advantages of working with a parsed corpus? Do they really make the
researcher’s work easier?
In our study of indirect objects in English, we have not found many
advantages working with a parsed corpus such as the ICE-GB, due mainly to
some functional categories we find doubtful, as well as the inconsistency
revealed in the treatment of similar examples. Most of the problems that can be
faced in the analysis of the corpus are deeply rooted in the mistaken labelling of
constituents, therefore not allowing the linguist to retrieve the right information.
In this sense, Guilquin (2002:207) says:
[T]he fully automatic retrieval of syntactic structures with no manual intervention
is still something of an impossible dream for lack of suitable and/or reliable tools
and corpora [...] the lack of the ideal parsed corpus (i.e. accurate, detailed and big
enough) forces one to turn to a tagged corpus and use a method requiring more
manual post-editing and yielding slightly less satisfactory results.
It is worthy of note that the three categories we have called into question,
namely dimonotransitivity, parataxis and transitive complementation, are not
included either in Quirk et al.’s A comprehensive grammar of the English
language (1985) or in A student’s grammar of the English language, 3 in spite of
the fact that Greenbaum (1993:13) recognises that these two reference grammars
provided the basis for the categories in the ICE-GB.
However, does our experience imply that parsed corpora are not useful for
this sort of studies? Not necessarily. It just means the ICE-GB is not appropriate
for our needs, mainly on account of the inaccuracy of its parsing. All in all, the
choice of using a parsed corpus such as ICE-GB or the alternative of using a
‘only-tagged’ corpus would have been the same. Furthermore, in terms of time
3. Not even in Greenbaum´s Oxford grammar of the English language (1996).
402
CARMEN AGUILERA CARNERERO
consumption and facility, working with the ICE-GB represented a much greater
effort than working with non-parsed corpus.
At least, we hope to have proved that Sinclair was totally wrong when he
stated:
Each tagger will put into practice a policy for these categories that is more likely
to be the result of expediency than the elaboration of a theory, and these decisions
will affect a decade or more of research, without the users even being aware of
them. Most researchers are content that someone has tagged the corpus, and they
are not inquisitive as to how this was done, or what the shortcomings are.
(Sinclair 2002:53, emphasis added)
REFERENCES
Aarts, J. 2002, “Does corpus linguistics exist? Some old and new issues” in Leiv
Ágil Breivik and A. Hasselgren (eds.). From the COLT’s mouth...and others’.
Language Corpora Studies. In honour of Anna-Brita Stenström. Ámsterdam
and New Cork: Rodopi.
Francis, G. 1993, “A corpus-driven approach to grammar. Principles, methods,
and examples”, in G. Sampson and D. MacCarthy (eds.). Corpus linguistics.
Readings in a widening discipline. London and New York: Continuum.
Greenbaum, S. 1991, “ICE: The International Corpus of English”. English
Today, 28 (7): 3-7.
Greenbaum, S. 1993, “The tagset for the International Corpus of English”, in C.
Souter and E. Atwell. Corpus-based computational linguistics. Amsterdam:
Rodopi, 11-24.
Greenbaum, S. 1996, The Oxford grammar of the English language. Oxford:
Oxford University Press.
Greenbaum, S. and R. Quirk 1990, A student’s grammar of the English
language. London: Longman.
Guilquin, G. 2002, “Automatic retrieval of syntactic structures. The quest for the
Holy Grail”. International Journal of Corpus Linguistics, 7 (2): 183-214.
Mukherjee, J. 2005, English ditransitive verbs. Aspects of theory, description
and a usage-based model. Ámsterdam: Rodopi.
Nelson, G., S. Wallis and B. Aarts 2002, Exploring natural language. Working
with the British component of the International Corpus of English.
Amsterdam/Philadelphia: John Benjamins Publishing Company.
Quinn, A. and N. Porter 1994, “Investigating English usage with ICECUP”.
English Today, 10 (3): 21-24.
Quirk, R., S. Greenbaum, G. Leech and I. Svartvik 1985, A comprehensive
grammar of the English language. London: Longman.
Sinclair, J. 2004, “Intuition and annotation: the discussion continues”, in K.
Aijmer and B. Altenberg (eds.). Advances in corpus linguistics. Papers from
the 23rd International Conference on English Language Research on
Computerized Corpora (ICAME 23). Amsterdam/ New York: Rodopi, 39-59.
(DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS
403
Tognini-Bonelli, E. 1991, Corpus linguistics at work. Amsterdam/Philadelphia:
John Benjamins Publishing Company.