Download domain and genre in sublanguage text: definitional

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mojibake wikipedia , lookup

Text linguistics wikipedia , lookup

Transcript
DOMAIN AND GENRE IN
SUBLANGUAGE TEXT:
DEFINITIONAL MICROTEXTS IN
THREE CORPORA
Marie-Paule PERY-WOODLEY
Josette REBEYROLLE
Equipe de Recherche en Syntaxe et Sémantique
Université Toulouse-le Mirail
5 allées Antonio Machado
31058 Toulouse, France
tel. 33 (0)5 61 50 36 09
fax 33 (0)5 61 50 46 77
email: {pery,rebeyrol}@cict.fr
Abstract
In this paper we outline a shallow grammar of definitions as they
occur in technical and scientific texts. We show how visual
clues, typography and layout, interact with lexical and syntactic
clues to signal a definitional text. Whilst starting from the
hypothesis that domain and genre have an impact on the
grammar of definitions, we also expect to find stable features,
which would make the retrieval of definitions possible in new
domains. We illustrate this approach with a comparative study
on three texts which differ in terms of domain and genre and we
show how it is possible to identify some constraints on
variations in the formulation of definitional texts.
Keywords: corpus linguistics, sublanguages, textual genres,
definitions, knowledge extraction from texts
identification and mark-up of functional objects for more
intelligent information retrieval.
We start by outlining the model of text architecture, which
forms the theoretical foundation for our approach to
signalling in text (section 2). Section 3 then presents the
results of the study carried out on three corpora.
2. TEXT ARCHITECTURE AND THE
SIGNALLING OF TEXT OBJECTS
Our analysis is based on a model of text structure which
stresses the fact that written texts are visual objects, and
that their visual properties are directly involved – and
exploited by readers – in the construction of meaning. In
the model of the representation of text architecture, Virbel
and his group (Virbel, 1985; Virbel, 1989; Pascual, 1991)
propose an extended notion of text formatting, which
brings out the relationship between text organisation
signalled via visual means (typography and layout) and via
discursive means (lexical and syntactic markers). This
relationship is best presented through an example:
Definitions
A: __________________
A is __________________
B: __________________
B can be defined as ______
C: __________________
______________________
We call C ______________
1. INTRODUCTION
Our objective is twofold: first, it is to formulate a shallow
grammar of a functional text object – definitions – in
technical and scientific texts, as a first step towards
automatic identification in textual databases; a secondary
objective is the investigation of the impact of genre and
domain on the signalling of such structures.
We draw on previous studies devoted to the identification
of markers of semantic relations and other “knowledge
probes” in discourse (Hearst, 1992; Borillo, 1997; Ahmad,
1993; Kavanagh, 1996; Desclès & Jouis, 1993). Our
approach differs, however, in three ways: its focus is a text
object rather than a relation, and the markers we are
seeking to identify mostly signal the initial boundary of
this object; secondly, we propose a broader conception of
marker, in terms of configurations of lexical, syntactic,
typographical and layout features; finally, rather than start
from the principle that markers belong to “general
language”, we hypothesize variation linked to genre and
domain, and we aim to throw light on areas of stability and
areas of variability in the signalling of definitions in
different corpora.
This surface modelling of definitions and of their
integration within text has relevance for knowledge
extraction, in particular in the context of the constitution
of terminological knowledge bases, and for text
generation. It is also a test-bed for the automatic
Figure 1: Formatting-based vs. discursive formulation
The same effect can be said to be created in the text image
on the left and on the right: three definitions are
formulated, and meant to be recognised as definitions, in
both cases. The formulations on the left are based mostly
on layout, typography and enumerations, while those on
the right, though not devoid of visual formatting, rely
more on discursive means. The example is designed to
show extreme cases, but in-between formulations are
obviously possible. The resources available for signalling
written text organisation thus appear as a continuum from
wholly discursive to wholly visual. There seems to be no
hard and fast conventions for layout and typographical
enhancement, but rather a general principle of contrast.
It is on the basis of these initial observations that the
model of text architecture was developed. Its main tenets
can be summarised as follows:
– these formulations are perceived as equivalent because
they are interpreted by the reader as performing the same
“text act”, here defining. Success of such text acts is that
they be recognised, and that the text segments concerned
(the arguments of the performative) be understood as
definitions. These are metalinguistic performatives whose
performativity is directed at the text itself.
– the textual metalanguage, exemplified by the fully
discursive formulations, is part of the language, and
therefore open to description in terms of operatorargument relations (after Harris, 1968, 1982). The
operators are verbs such as organise, entitle, illustrate,
conclude, define…; their arguments are text segments
called text objects. A text object is therefore a segment
corresponding to a specific metalinguistic formulation and
signalled by formatting. The notion of formatting1 covers
lexico-syntactic, typographical, layout and punctuation
markers.
– the formulation of constraints governing the combination
of text objects amounts to a theory of text organisation.
Our study is grounded in this model in several ways:
– it focusses on a functional text object, definitions, and
hypothesises that this text object is marked by identifiable
formatting features;
– it is also concerned with the interaction of text objects,
in particular the role of presentational objects such as lists
within definitions;
– finally, the notion of formatting features encompasses
visual aspects generally overlooked by research on
markers.
3. THE STUDY
3.1 METHOD
The initial identification of definitions is carried out on the
basis of our competence as readers. In the first instance,
we simply respond to the formatting features of texts in an
intuitive manner, our aim being then to identify the
features which underly this competence. An iterative
process therefore starts whereby successive formulations
of the configurations of markers which signal definitions
are tested on the data (via filters created using textanalysis software) and gradually refined. This process
continues until we obtain a stable configuration of lexical,
syntactic, typographical and layout features. The
comparative analysis presented in this paper was
performed on data tagged for grammatical categories.
Visual aspects, however, were not tagged, and had to be
taken into account “manually”. We believe the method
could be adapted for raw data with visual formatting
indications, such as SGML- or HTML-formatted text.
Our method also aims to identify variations in the patterns
of markers so as to see whether they can be correlated
with domain and/or genre, and whether a common pattern,
applicable to new domains, may emerge.
comparative study of a diverse corpus. In this initial stage
of our study, our corpus is made up of three texts:
1) a subpart (32400 occurrences) of a textbook in
geomorphology2 (T1)
2) a manual for a text analysis software programme3
(66300 occurrences) (T2)
3) a handbook for a software engineering project (54400
occurrences) (T3).
In terms of the hypothesis we are testing here, the size of
the corpus is of minor importance, its composition
however is crucial. As well as in domain, these texts differ
in terms of discourse function and intended audience: the
first one is an expository text destined to non-expert
specialists; the other two are instructional texts, but
addressing a general readership in the case of the software
manual, domain experts in the case of the software
engineering handbook.
Clearly, the stable aspects of the “grammar of definitions”
identified in this study will in the future need to be
validated on larger corpora.
3.3 RESULTS
In general terms, a definition first states the genus of the
term to be defined and then moves on to the expression of
the differentiae. This general formulation of the textual
organisation of definitions highlights the link between this
study and the well-researched question of the expression
of hyponymy, since the hyponymy relation provides the
primary means of categorisation of a word, whether in
dictionary definitions or in discourse. Out of the diversity
of expressions encountered in the corpus a general pattern
clearly emerges: the term being defined is associated to
one or two hypernyms (genus), the differentiae is realised
by a modifier which can be a relative clause, a present
participle, an adjectival or prepositional phrase. A first
approximation of this pattern is as follows:
Nc1
Nn
Vi
Nc2
Mod
where Nc is a classifier noun phrase, Nn the domain noun
phrase being defined, Vi the copula or a verb belonging to
a restricted class, and Mod the modifier. This pattern,
called Full Pattern (FP), subsumes the different structures
occurring in the corpus. These different realisations are
presented in Table 1, together with their distribution in the
three texts making up the corpus. Table 1 distinguishes a
Basic Pattern (BP), so-called because it is complete in
terms of the “genus-differentiae” structure, and present in
3.2 CORPUS
Our objective of identifying areas of variation and areas of
constancy in the formulation of definitions implies a
1
The original term is "mise en forme matérielle".
2 Derruau, M. (1988) Précis de géomorphologie, Masson, 7ème
Edition.
3 Daoust, F. (1996). SATO (Système d'Analyse de Textes par
Ordinateur) version 4.0, Manuel de référence. Centre ATO
Université du Québec à Montréal.
all three sub-corpora, from Reduced Patterns (RP), where
one of the basic elements is not expressed.
Patterns
Corpus
Nc1
Nn
Vi Nc2 Mod
FP
+
+
+
+
BP
–
+
+
+
RP1
–
+
–
+
RP2
–
+
+
–
T1
*
T2
T3
*
*
*
*
*
*
Table 2 illustrates these patterns with examples drawn
from the three texts of our corpus:
T2 §
BP
T3
RP2 T1
Vi Nc2
Distance
est
un
analyseur
lexicostatistique.
Table 2: Examples of composition of definition pattern
Besides its illustrative function, Table 2 adds to the
patterns elements which are stable throughout the variants,
and contribute significantly to the identification of
definitions in discourse. These lexico-syntactic,
typographical and layout features are integral parts of the
configurations of markers that make up our patterns:
– these definitional patterns in our corpus always occur at
the beginning of a paragraph (noted §);
– Nn, the term to be defined, is almost always
typographically marked, whether by capitals, inverted
commas, bold or italics;
– Vi belongs to a class which can be defined in extension
(in our corpus: {être, désigner}), and is always in the
present tense;
4
3.3.1 The composition of definition patterns
The examination of the genus has thrown light on a
number of stable elements in the pattern, but also on
several variations, which we now analyse further.
a) the Full Pattern accounts for the possibility of a twofold
expression of the genus: Nn can be preceded by a
classifier (Nc1). Thus in:
“Distance” is classified by Nc1 as a type of command, and
by Nc2 as a type of analyser. It is however the Basic
Pattern – the single-hypernym definition – which
constitutes the most common form, the Full Pattern being
the most complete, but not the most representative.
Mod
Elle permet de
comparer statisLa
tiquement
les
commande
lexiques de deux
sous-textes quelconques
d’un
corpus
§ Les dif- sont des or- chargées de la
fuseurs
ganisations
diffusion com–
partenaires
merciale du Produit Logiciel.
§
permet
de
CARACpréciser le fonc–
–
TERISER
tionnement
du
journal.
§
est donc une
Le palse
forme de sé–
–
grégation.
FP
RP1 T2
Nn
– Nc2, the hypernym, is always an indefinite noun phrase.
(1) La commande Distance est un analyseur lexicostatistique. Elle permet de comparer….
*
Table 1: Full pattern variations4
Nc1
– Nn is always a definite noun phrase;
Table 1 represents patterns actually occurring in our corpus. If
the objective was to generate all possible variations, a pattern
"Nc1 Nn Vi Nc2" should be included.
b) Reduced Pattern 1 (RP1) represents definitions which
display all the formatting features described above
(determiners, verb form and tense, typographical marking
and layout) but where the genus is not expressed. There
are numerous occurrences of this seemingly paradoxical
pattern in two of our texts. Closer examination shows
however that they are subject to strict formatting
constraints which in fact eliminate the paradox: they only
occur in list structures, which in themselves can express a
hyponymic relation, as can be seen in this contextualised
version of the RP1 example from T2 given in Table 2:
(2) Cinq actions s’appliquent à cet objet : AFFICHER,
CARACTERISER (…).
– AFFICHER permet de ….
– CARACTERISER
permet de préciser le
fonctionnement du journal.
–…
The genus is indeed present (“action”), and the hyponymy
relation is expressed by the list structure.
c) Finally, just as RP1 is genus-less, our second reduced
pattern, RP2, is differentiae-less. The term “palse” is
defined solely by a hyponymic relation, with no
explicitation of the differences that distinguish it from its
hypernym “ségrégation”. We must however emphasize
that this is the only occurrence of a RP2, and that the
presence of “donc” should be investigated further as it
may in fact invalidate the definition status of this example.
We have analysed in this section the variations in the
structures in terms of presence or absence of certain
elements. We now turn to the variations in the formulation
of the differentiae.
3.3.2 Variations in the formulation of the differentiae
We have seen that the modifier expressing the differentiae
could be realised by a relative clause, a present participle,
an adjectival or prepositional phrase. It is indeed possible
to treat these diverse forms as transformations of relative
clauses (reductions and permutations in Harrissian terms,
Cf. Pascual & Péry-Woodley, 1997b). Here, however, our
purpose is to examine the surface variations in order to
identify any correlation with genre and domain. Table 3
summarises through examples the main types of modifiers
found in the three texts of the corpus.
Nc1
Nn
T1
–
§ Les fjords
T1
–
§ Les griffures
–
T2
§ La
commande
T3
–
sont
auges
sont
sillons
est
patron
fouille
Starting from a definition organised around genus and
differentiae, we have described a basic structure which
reflects its syntactic realisation, and which leads, through
the detail of the configuration of formatting markers, to an
approximation of filters for the identification of definitions
in texts. Table 4 summarises these stages:
Mod
des creusées par un glacier
de vallée.
des étroits.
un qui permet de définir les
de entrées du dictionnaire
que l’on veut sauvegarder.
est
un Elle permet de comparer
Distance
analyseur statistiquement les lexilexicoques de deux sous-textes
statistique quelconques d’un corpus
§ Le guide est
un pour la production des
d’élaboration
guide
documents de spécificade la docu- méthodolo tion
des
logiciels
mentation de gique
scientifiques.
spécification
§ Le filtre
T2
Vi Nc2
3.4 SYNTHESIS
genus
differentiae
(Nc1) Nn
Vi Nc2
Mod
§ def_det “NP”
ind_det {être,
désigner} NP
{adjective,
past participle,
relative clause,
“pour” NP,
“pour” clause}
Table 4: From structure to a configuration of markers
Key:
def_det: definite determiner
ind_det: indefinite determiner
Table 3: Examples of variations of differentiae
“NP”: the inverted commas indicate typographical
enhancement
As Table 3 suggests, T1, the geomorphology textbook,
favours adjectives and past participles as modifiers for the
expression of the differentiae. In the other two texts (T2,
T3), both concerned with software, modifiers are
essentially relative clauses, sometimes independent
clauses linked via pronominalization, or prepositional
phrases with “pour”.
Without claiming to have formulated an exhaustive
grammar of definitions, we have identified some
constraints which can play an essential role in their
automatic identification:
The explanation for these specific distributions requires
that one considers the semantic information linked to the
domain to which these texts belong. On the one hand,
descriptive definitions in the domain of geomorphology,
with the primary aim of specifying the properties of
objects (adjectives), their location, how they were formed
(past participles); on the other hand, functional definitions
in the software domain, quite regularly constructed with a
relative clause with verbs such as “permettre”, “servir à”,
or with prepositional phrases introduced by “pour”.
Besides domain, genre has a clear role. Texts T2 and T3
are both instructional texts, in which definitions are the
preferred way to formulate expressions: “command X is
used to do Y” also says “in order to do Y, use command
X”. It is therefore no surprise that definitions are never of
the RP2 form, when the differentiae, the functional
aspects, are so obviously central in this genre. Similarly,
the dominance of the Basic Pattern in the geomorphology
textbook is explained by the pedagogical nature of this
text, which has as one of its primary aims to familiarise the
reader with the terms of the domain.
4. DISCUSSION
a) The expression of the genus is crucial for there to be a
definition. We have shown however that it can take
diverse forms. We have mentioned lists, emphasising the
fact that the genus is present in those structures in the
introductory sentence. Lists can work as two-level
definitions: each item of the list is a definition (as in
example 2), with a common genus expressed in the
introductory sentence; but the whole list may in itself be a
definition, with the genus in the introduction and the
differentiae in the list items. We intend to take further the
investigation of the link between presentational text
objects such as lists, enumerations, parentheses, footnotes
and functional text objects such as definitions. As there is
no suggestion that any of the visual formatting features
involved are dedicated to definitions, it is necessary to
identify other markers or constraints which would produce
definition-specific configurations. Along these lines,
Borillo (1997) studies the conditions under which
parentheses may signal a hyponymy relation: “un
parenthésage peut jouer assez souvent le rôle d’une
apposition si des conditions assez strictes sont données
concernant la juxtaposition des syntagmes nominaux
concernés et la non-détermination du syntagme à
l’intérieur de la parenthèse” (1997:118).
b) Our data gathering was approached on the basis of an
intuitive recognition of definitions. Our analysis took as its
starting point the genus-differentiae pair. For a while,
there seemed to be a mismatch between this “ideal”
definition and the actual data, which led us to describe two
reduced patterns, genus-less and differentiae-less
respectively. Closer examination however shows that these
variations are only apparent: the genus can be external to
the definition in list structures; as for differentiae-less
definitions, there is only one somewhat dubious case. On
the basis of this study therefore, we can assert that
variations affect the form but not the fundamental
structure of definitions.
c) Typographical enhancement of the term to be defined,
though not absolutely systematic, could give added weight
to a configuration. The possibility of giving weightings to
the different markers, and defining minimal configurations
for definition-hood, needs further investigation.
– the lexico-syntactic configuration of the expression of
the genus (including noun phrase determiners, verb tense
and class).
We therefore go some way in this study towards
distinguishing which clues can be expected to belong to
general language and which vary according to genre and
domain.
Taking into account layout and typographical features as
an integral part of the signalling of definitions is a first
stage in the study of the interaction of visual and
discursive formatting as suggested by the model of
representation of text architecture. We show that this
theoretically-motivated broadening of the notion of marker
leads to the formulation of finer models for specific text
objects.
6. REFERENCES
d) The constraint concerning the position at the start of a
paragraph appears to be essential if the expression is to
convey a generic relation, as a definition must. The
patterns described must be seen as the initial boundary of
a definitional text object. The integration of such text
objects in the overall structure of texts, and the
distribution of the different patterns in relation to the
organisational hierarchy of texts have been the focus of
previous studies (Pascual & Péry-Woodley, 1997a).
5. CONCLUSION
The comparative study of structures signalling definitions
in three sublanguage texts differing in domain and genre
enables us to determine what varies and what remains
constant in these expressions.
Domain influences the wording of the differentiae in
relation with the semantic nature of the information
(description of the physical aspect of objects in
geomorphology vs. their intended use in the software
domain).
Genre appears to have an impact on the structural
properties of definitions: the hypernym is systematically
expressed within the definition in the geomorphology
textbook, consistant with the importance in this genre of
linking a new term to a class. The differentiae is
sometimes minimally expressed in this text, whereas it
tends to constitute the most developed part of definitions
in the software manual and the handbook, being the
expression of the function, and of direct relevance to the
actions towards which these instructional texts are geared.
These variations do not preclude the possibility of
elaborating filters for the identification of definitions in
texts. Those should be based on the stable markers which
characterise the initial boundary of the definition:
– layout and typographical markers
Ahmad, K. (1993). Terminology and knowledge
acquisition: a text-based approach. In Proceedings of
TKE’93: Terminology an Knowledge Engineering (pp.
56--70). INDEKS-VERLAG.
Borillo, A. (1997). Exploration automatisée de textes de
spécialité : repérage et identification automatique de la
relation lexicale d’hyperonymie. LINX, 34-35, 113--121.
Desclès, J.-P., Jouis, C. (1993). L’exploration
contextuelle: une méthode linguistique et informatique
pour l’analyse automatique de textes. In Actes du colloque
Informatique et Langue Naturelle (pp 339--351). Nantes.
Harris, Z.S. (1968). Mathematical Structures of
Language. New York: Wiley & Sons.
Harris, Z.S. (1982). A grammar of English on
Mathematical Principles. New York: Wiley-Interscience.
Hearst, M.A. (1992). Automatic acquisition of
hyponyms from large text corpora. In Proceedings of
COLING’92 (pp. 539--545). Nantes.
Kavanagh, J. (1996). The Text Analyzer: A tool for
extracting knowledge from text. Master’s thesis.
University of Ottawa.
Pascual, E. (1991). Représentation de l’architecture
textuelle et génération de texte. Thèse de Doctorat en
Informatique. Université Paul Sabatier, Toulouse, France.
Pascual, E., Péry-Woodley, M.-P. (1997a). Modèles de
texte pour la définition. In Actes des Ières Journées
Scientifiques et Techniques du Réseau Francophone de
L’Ingénierie de la Langue de l'AUPELF-UREF (pp. 137-145). Paris: AUPELF-UREF.
Pascual,
E.,
Péry-Woodley,
M.-P.
(1997b).
Modélisation des définitions dans les textes à consignes.
In J. Virbel, J.M. Cellier & J.L. Nespoulous (Eds.),
Cognition, Discours procédural, Action (pp. 37--53).
Toulouse: PRESCOT.
Virbel, J. (1985). Langage et méta-langage dans le texte
du point de vue de l’édition en informatique textuelle.
Cahiers de Grammaire, 10, 1--72.
Virbel, J. (1989). The contribution of linguistic
knowledge to the interpretation of text structures. In J.
André, V. Quint & R.K. Furuta (Eds), Structured
Documents (pp. 161--181). Cambridge: CUP.