Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DOMAIN AND GENRE IN SUBLANGUAGE TEXT: DEFINITIONAL MICROTEXTS IN THREE CORPORA Marie-Paule PERY-WOODLEY Josette REBEYROLLE Equipe de Recherche en Syntaxe et Sémantique Université Toulouse-le Mirail 5 allées Antonio Machado 31058 Toulouse, France tel. 33 (0)5 61 50 36 09 fax 33 (0)5 61 50 46 77 email: {pery,rebeyrol}@cict.fr Abstract In this paper we outline a shallow grammar of definitions as they occur in technical and scientific texts. We show how visual clues, typography and layout, interact with lexical and syntactic clues to signal a definitional text. Whilst starting from the hypothesis that domain and genre have an impact on the grammar of definitions, we also expect to find stable features, which would make the retrieval of definitions possible in new domains. We illustrate this approach with a comparative study on three texts which differ in terms of domain and genre and we show how it is possible to identify some constraints on variations in the formulation of definitional texts. Keywords: corpus linguistics, sublanguages, textual genres, definitions, knowledge extraction from texts identification and mark-up of functional objects for more intelligent information retrieval. We start by outlining the model of text architecture, which forms the theoretical foundation for our approach to signalling in text (section 2). Section 3 then presents the results of the study carried out on three corpora. 2. TEXT ARCHITECTURE AND THE SIGNALLING OF TEXT OBJECTS Our analysis is based on a model of text structure which stresses the fact that written texts are visual objects, and that their visual properties are directly involved – and exploited by readers – in the construction of meaning. In the model of the representation of text architecture, Virbel and his group (Virbel, 1985; Virbel, 1989; Pascual, 1991) propose an extended notion of text formatting, which brings out the relationship between text organisation signalled via visual means (typography and layout) and via discursive means (lexical and syntactic markers). This relationship is best presented through an example: Definitions A: __________________ A is __________________ B: __________________ B can be defined as ______ C: __________________ ______________________ We call C ______________ 1. INTRODUCTION Our objective is twofold: first, it is to formulate a shallow grammar of a functional text object – definitions – in technical and scientific texts, as a first step towards automatic identification in textual databases; a secondary objective is the investigation of the impact of genre and domain on the signalling of such structures. We draw on previous studies devoted to the identification of markers of semantic relations and other “knowledge probes” in discourse (Hearst, 1992; Borillo, 1997; Ahmad, 1993; Kavanagh, 1996; Desclès & Jouis, 1993). Our approach differs, however, in three ways: its focus is a text object rather than a relation, and the markers we are seeking to identify mostly signal the initial boundary of this object; secondly, we propose a broader conception of marker, in terms of configurations of lexical, syntactic, typographical and layout features; finally, rather than start from the principle that markers belong to “general language”, we hypothesize variation linked to genre and domain, and we aim to throw light on areas of stability and areas of variability in the signalling of definitions in different corpora. This surface modelling of definitions and of their integration within text has relevance for knowledge extraction, in particular in the context of the constitution of terminological knowledge bases, and for text generation. It is also a test-bed for the automatic Figure 1: Formatting-based vs. discursive formulation The same effect can be said to be created in the text image on the left and on the right: three definitions are formulated, and meant to be recognised as definitions, in both cases. The formulations on the left are based mostly on layout, typography and enumerations, while those on the right, though not devoid of visual formatting, rely more on discursive means. The example is designed to show extreme cases, but in-between formulations are obviously possible. The resources available for signalling written text organisation thus appear as a continuum from wholly discursive to wholly visual. There seems to be no hard and fast conventions for layout and typographical enhancement, but rather a general principle of contrast. It is on the basis of these initial observations that the model of text architecture was developed. Its main tenets can be summarised as follows: – these formulations are perceived as equivalent because they are interpreted by the reader as performing the same “text act”, here defining. Success of such text acts is that they be recognised, and that the text segments concerned (the arguments of the performative) be understood as definitions. These are metalinguistic performatives whose performativity is directed at the text itself. – the textual metalanguage, exemplified by the fully discursive formulations, is part of the language, and therefore open to description in terms of operatorargument relations (after Harris, 1968, 1982). The operators are verbs such as organise, entitle, illustrate, conclude, define…; their arguments are text segments called text objects. A text object is therefore a segment corresponding to a specific metalinguistic formulation and signalled by formatting. The notion of formatting1 covers lexico-syntactic, typographical, layout and punctuation markers. – the formulation of constraints governing the combination of text objects amounts to a theory of text organisation. Our study is grounded in this model in several ways: – it focusses on a functional text object, definitions, and hypothesises that this text object is marked by identifiable formatting features; – it is also concerned with the interaction of text objects, in particular the role of presentational objects such as lists within definitions; – finally, the notion of formatting features encompasses visual aspects generally overlooked by research on markers. 3. THE STUDY 3.1 METHOD The initial identification of definitions is carried out on the basis of our competence as readers. In the first instance, we simply respond to the formatting features of texts in an intuitive manner, our aim being then to identify the features which underly this competence. An iterative process therefore starts whereby successive formulations of the configurations of markers which signal definitions are tested on the data (via filters created using textanalysis software) and gradually refined. This process continues until we obtain a stable configuration of lexical, syntactic, typographical and layout features. The comparative analysis presented in this paper was performed on data tagged for grammatical categories. Visual aspects, however, were not tagged, and had to be taken into account “manually”. We believe the method could be adapted for raw data with visual formatting indications, such as SGML- or HTML-formatted text. Our method also aims to identify variations in the patterns of markers so as to see whether they can be correlated with domain and/or genre, and whether a common pattern, applicable to new domains, may emerge. comparative study of a diverse corpus. In this initial stage of our study, our corpus is made up of three texts: 1) a subpart (32400 occurrences) of a textbook in geomorphology2 (T1) 2) a manual for a text analysis software programme3 (66300 occurrences) (T2) 3) a handbook for a software engineering project (54400 occurrences) (T3). In terms of the hypothesis we are testing here, the size of the corpus is of minor importance, its composition however is crucial. As well as in domain, these texts differ in terms of discourse function and intended audience: the first one is an expository text destined to non-expert specialists; the other two are instructional texts, but addressing a general readership in the case of the software manual, domain experts in the case of the software engineering handbook. Clearly, the stable aspects of the “grammar of definitions” identified in this study will in the future need to be validated on larger corpora. 3.3 RESULTS In general terms, a definition first states the genus of the term to be defined and then moves on to the expression of the differentiae. This general formulation of the textual organisation of definitions highlights the link between this study and the well-researched question of the expression of hyponymy, since the hyponymy relation provides the primary means of categorisation of a word, whether in dictionary definitions or in discourse. Out of the diversity of expressions encountered in the corpus a general pattern clearly emerges: the term being defined is associated to one or two hypernyms (genus), the differentiae is realised by a modifier which can be a relative clause, a present participle, an adjectival or prepositional phrase. A first approximation of this pattern is as follows: Nc1 Nn Vi Nc2 Mod where Nc is a classifier noun phrase, Nn the domain noun phrase being defined, Vi the copula or a verb belonging to a restricted class, and Mod the modifier. This pattern, called Full Pattern (FP), subsumes the different structures occurring in the corpus. These different realisations are presented in Table 1, together with their distribution in the three texts making up the corpus. Table 1 distinguishes a Basic Pattern (BP), so-called because it is complete in terms of the “genus-differentiae” structure, and present in 3.2 CORPUS Our objective of identifying areas of variation and areas of constancy in the formulation of definitions implies a 1 The original term is "mise en forme matérielle". 2 Derruau, M. (1988) Précis de géomorphologie, Masson, 7ème Edition. 3 Daoust, F. (1996). SATO (Système d'Analyse de Textes par Ordinateur) version 4.0, Manuel de référence. Centre ATO Université du Québec à Montréal. all three sub-corpora, from Reduced Patterns (RP), where one of the basic elements is not expressed. Patterns Corpus Nc1 Nn Vi Nc2 Mod FP + + + + BP – + + + RP1 – + – + RP2 – + + – T1 * T2 T3 * * * * * * Table 2 illustrates these patterns with examples drawn from the three texts of our corpus: T2 § BP T3 RP2 T1 Vi Nc2 Distance est un analyseur lexicostatistique. Table 2: Examples of composition of definition pattern Besides its illustrative function, Table 2 adds to the patterns elements which are stable throughout the variants, and contribute significantly to the identification of definitions in discourse. These lexico-syntactic, typographical and layout features are integral parts of the configurations of markers that make up our patterns: – these definitional patterns in our corpus always occur at the beginning of a paragraph (noted §); – Nn, the term to be defined, is almost always typographically marked, whether by capitals, inverted commas, bold or italics; – Vi belongs to a class which can be defined in extension (in our corpus: {être, désigner}), and is always in the present tense; 4 3.3.1 The composition of definition patterns The examination of the genus has thrown light on a number of stable elements in the pattern, but also on several variations, which we now analyse further. a) the Full Pattern accounts for the possibility of a twofold expression of the genus: Nn can be preceded by a classifier (Nc1). Thus in: “Distance” is classified by Nc1 as a type of command, and by Nc2 as a type of analyser. It is however the Basic Pattern – the single-hypernym definition – which constitutes the most common form, the Full Pattern being the most complete, but not the most representative. Mod Elle permet de comparer statisLa tiquement les commande lexiques de deux sous-textes quelconques d’un corpus § Les dif- sont des or- chargées de la fuseurs ganisations diffusion com– partenaires merciale du Produit Logiciel. § permet de CARACpréciser le fonc– – TERISER tionnement du journal. § est donc une Le palse forme de sé– – grégation. FP RP1 T2 Nn – Nc2, the hypernym, is always an indefinite noun phrase. (1) La commande Distance est un analyseur lexicostatistique. Elle permet de comparer…. * Table 1: Full pattern variations4 Nc1 – Nn is always a definite noun phrase; Table 1 represents patterns actually occurring in our corpus. If the objective was to generate all possible variations, a pattern "Nc1 Nn Vi Nc2" should be included. b) Reduced Pattern 1 (RP1) represents definitions which display all the formatting features described above (determiners, verb form and tense, typographical marking and layout) but where the genus is not expressed. There are numerous occurrences of this seemingly paradoxical pattern in two of our texts. Closer examination shows however that they are subject to strict formatting constraints which in fact eliminate the paradox: they only occur in list structures, which in themselves can express a hyponymic relation, as can be seen in this contextualised version of the RP1 example from T2 given in Table 2: (2) Cinq actions s’appliquent à cet objet : AFFICHER, CARACTERISER (…). – AFFICHER permet de …. – CARACTERISER permet de préciser le fonctionnement du journal. –… The genus is indeed present (“action”), and the hyponymy relation is expressed by the list structure. c) Finally, just as RP1 is genus-less, our second reduced pattern, RP2, is differentiae-less. The term “palse” is defined solely by a hyponymic relation, with no explicitation of the differences that distinguish it from its hypernym “ségrégation”. We must however emphasize that this is the only occurrence of a RP2, and that the presence of “donc” should be investigated further as it may in fact invalidate the definition status of this example. We have analysed in this section the variations in the structures in terms of presence or absence of certain elements. We now turn to the variations in the formulation of the differentiae. 3.3.2 Variations in the formulation of the differentiae We have seen that the modifier expressing the differentiae could be realised by a relative clause, a present participle, an adjectival or prepositional phrase. It is indeed possible to treat these diverse forms as transformations of relative clauses (reductions and permutations in Harrissian terms, Cf. Pascual & Péry-Woodley, 1997b). Here, however, our purpose is to examine the surface variations in order to identify any correlation with genre and domain. Table 3 summarises through examples the main types of modifiers found in the three texts of the corpus. Nc1 Nn T1 – § Les fjords T1 – § Les griffures – T2 § La commande T3 – sont auges sont sillons est patron fouille Starting from a definition organised around genus and differentiae, we have described a basic structure which reflects its syntactic realisation, and which leads, through the detail of the configuration of formatting markers, to an approximation of filters for the identification of definitions in texts. Table 4 summarises these stages: Mod des creusées par un glacier de vallée. des étroits. un qui permet de définir les de entrées du dictionnaire que l’on veut sauvegarder. est un Elle permet de comparer Distance analyseur statistiquement les lexilexicoques de deux sous-textes statistique quelconques d’un corpus § Le guide est un pour la production des d’élaboration guide documents de spécificade la docu- méthodolo tion des logiciels mentation de gique scientifiques. spécification § Le filtre T2 Vi Nc2 3.4 SYNTHESIS genus differentiae (Nc1) Nn Vi Nc2 Mod § def_det “NP” ind_det {être, désigner} NP {adjective, past participle, relative clause, “pour” NP, “pour” clause} Table 4: From structure to a configuration of markers Key: def_det: definite determiner ind_det: indefinite determiner Table 3: Examples of variations of differentiae “NP”: the inverted commas indicate typographical enhancement As Table 3 suggests, T1, the geomorphology textbook, favours adjectives and past participles as modifiers for the expression of the differentiae. In the other two texts (T2, T3), both concerned with software, modifiers are essentially relative clauses, sometimes independent clauses linked via pronominalization, or prepositional phrases with “pour”. Without claiming to have formulated an exhaustive grammar of definitions, we have identified some constraints which can play an essential role in their automatic identification: The explanation for these specific distributions requires that one considers the semantic information linked to the domain to which these texts belong. On the one hand, descriptive definitions in the domain of geomorphology, with the primary aim of specifying the properties of objects (adjectives), their location, how they were formed (past participles); on the other hand, functional definitions in the software domain, quite regularly constructed with a relative clause with verbs such as “permettre”, “servir à”, or with prepositional phrases introduced by “pour”. Besides domain, genre has a clear role. Texts T2 and T3 are both instructional texts, in which definitions are the preferred way to formulate expressions: “command X is used to do Y” also says “in order to do Y, use command X”. It is therefore no surprise that definitions are never of the RP2 form, when the differentiae, the functional aspects, are so obviously central in this genre. Similarly, the dominance of the Basic Pattern in the geomorphology textbook is explained by the pedagogical nature of this text, which has as one of its primary aims to familiarise the reader with the terms of the domain. 4. DISCUSSION a) The expression of the genus is crucial for there to be a definition. We have shown however that it can take diverse forms. We have mentioned lists, emphasising the fact that the genus is present in those structures in the introductory sentence. Lists can work as two-level definitions: each item of the list is a definition (as in example 2), with a common genus expressed in the introductory sentence; but the whole list may in itself be a definition, with the genus in the introduction and the differentiae in the list items. We intend to take further the investigation of the link between presentational text objects such as lists, enumerations, parentheses, footnotes and functional text objects such as definitions. As there is no suggestion that any of the visual formatting features involved are dedicated to definitions, it is necessary to identify other markers or constraints which would produce definition-specific configurations. Along these lines, Borillo (1997) studies the conditions under which parentheses may signal a hyponymy relation: “un parenthésage peut jouer assez souvent le rôle d’une apposition si des conditions assez strictes sont données concernant la juxtaposition des syntagmes nominaux concernés et la non-détermination du syntagme à l’intérieur de la parenthèse” (1997:118). b) Our data gathering was approached on the basis of an intuitive recognition of definitions. Our analysis took as its starting point the genus-differentiae pair. For a while, there seemed to be a mismatch between this “ideal” definition and the actual data, which led us to describe two reduced patterns, genus-less and differentiae-less respectively. Closer examination however shows that these variations are only apparent: the genus can be external to the definition in list structures; as for differentiae-less definitions, there is only one somewhat dubious case. On the basis of this study therefore, we can assert that variations affect the form but not the fundamental structure of definitions. c) Typographical enhancement of the term to be defined, though not absolutely systematic, could give added weight to a configuration. The possibility of giving weightings to the different markers, and defining minimal configurations for definition-hood, needs further investigation. – the lexico-syntactic configuration of the expression of the genus (including noun phrase determiners, verb tense and class). We therefore go some way in this study towards distinguishing which clues can be expected to belong to general language and which vary according to genre and domain. Taking into account layout and typographical features as an integral part of the signalling of definitions is a first stage in the study of the interaction of visual and discursive formatting as suggested by the model of representation of text architecture. We show that this theoretically-motivated broadening of the notion of marker leads to the formulation of finer models for specific text objects. 6. REFERENCES d) The constraint concerning the position at the start of a paragraph appears to be essential if the expression is to convey a generic relation, as a definition must. The patterns described must be seen as the initial boundary of a definitional text object. The integration of such text objects in the overall structure of texts, and the distribution of the different patterns in relation to the organisational hierarchy of texts have been the focus of previous studies (Pascual & Péry-Woodley, 1997a). 5. CONCLUSION The comparative study of structures signalling definitions in three sublanguage texts differing in domain and genre enables us to determine what varies and what remains constant in these expressions. Domain influences the wording of the differentiae in relation with the semantic nature of the information (description of the physical aspect of objects in geomorphology vs. their intended use in the software domain). Genre appears to have an impact on the structural properties of definitions: the hypernym is systematically expressed within the definition in the geomorphology textbook, consistant with the importance in this genre of linking a new term to a class. The differentiae is sometimes minimally expressed in this text, whereas it tends to constitute the most developed part of definitions in the software manual and the handbook, being the expression of the function, and of direct relevance to the actions towards which these instructional texts are geared. These variations do not preclude the possibility of elaborating filters for the identification of definitions in texts. Those should be based on the stable markers which characterise the initial boundary of the definition: – layout and typographical markers Ahmad, K. (1993). Terminology and knowledge acquisition: a text-based approach. In Proceedings of TKE’93: Terminology an Knowledge Engineering (pp. 56--70). INDEKS-VERLAG. Borillo, A. (1997). Exploration automatisée de textes de spécialité : repérage et identification automatique de la relation lexicale d’hyperonymie. LINX, 34-35, 113--121. Desclès, J.-P., Jouis, C. (1993). L’exploration contextuelle: une méthode linguistique et informatique pour l’analyse automatique de textes. In Actes du colloque Informatique et Langue Naturelle (pp 339--351). Nantes. Harris, Z.S. (1968). Mathematical Structures of Language. New York: Wiley & Sons. Harris, Z.S. (1982). A grammar of English on Mathematical Principles. New York: Wiley-Interscience. Hearst, M.A. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings of COLING’92 (pp. 539--545). Nantes. Kavanagh, J. (1996). The Text Analyzer: A tool for extracting knowledge from text. Master’s thesis. University of Ottawa. Pascual, E. (1991). Représentation de l’architecture textuelle et génération de texte. Thèse de Doctorat en Informatique. Université Paul Sabatier, Toulouse, France. Pascual, E., Péry-Woodley, M.-P. (1997a). Modèles de texte pour la définition. In Actes des Ières Journées Scientifiques et Techniques du Réseau Francophone de L’Ingénierie de la Langue de l'AUPELF-UREF (pp. 137-145). Paris: AUPELF-UREF. Pascual, E., Péry-Woodley, M.-P. (1997b). Modélisation des définitions dans les textes à consignes. In J. Virbel, J.M. Cellier & J.L. Nespoulous (Eds.), Cognition, Discours procédural, Action (pp. 37--53). Toulouse: PRESCOT. Virbel, J. (1985). Langage et méta-langage dans le texte du point de vue de l’édition en informatique textuelle. Cahiers de Grammaire, 10, 1--72. Virbel, J. (1989). The contribution of linguistic knowledge to the interpretation of text structures. In J. André, V. Quint & R.K. Furuta (Eds), Structured Documents (pp. 161--181). Cambridge: CUP.