Download Slicing and Dicing a Linguistic Data Cube

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Clusterpoint wikipedia , lookup

Data center wikipedia , lookup

Data model wikipedia , lookup

Forecasting wikipedia , lookup

Data analysis wikipedia , lookup

Information privacy law wikipedia , lookup

3D optical data storage wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Database model wikipedia , lookup

Transcript
288
Chapter XVII
Slicing and Dicing a Linguistic
Data Cube
Jan H. Kroeze
University of Pretoria, South Africa
Theo J. D. Bothma
University of Pretoria, South Africa
Machdel C. Matthee
University of Pretoria, South Africa
Abstract
This chapter discusses the application of some data warehousing techniques on a data cube of linguistic
data. The results of various modules of clausal analysis can be stored in a three-dimensional data cube
in order to facilitate on-line analytical processing of data by means of three-dimensional arrays. Slicing
is such an analytical technique, which reveals various dimensions of data and their relationships to other
dimensions. By using this data warehousing facility the clause cube can be viewed or manipulated to
reveal, for example, phrases and clauses, syntactic structures, semantic role frames, or a two-dimensional
representation of a particular clause’s multi-dimensional analysis in table format. These functionalities
are illustrated by means of the Hebrew text of Genesis 1:1-2:3. The authors trust that this chapter will
contribute towards efficient storage and advanced processing of linguistic data.
INTRODUCTION
This chapter suggests a way in which data warehousing concepts may be used and adapted to store and
view complex sets of linguistic data. After explaining and illustrating the concept of a three-dimen-
Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Slicing and Dicing a Linguistic Data Cube
sional data cube as a suitable data structure to capture multi-dimensional linguistic data, slicing and
dicing are discussed as a way in which various perspectives of this data can be revealed. Although these
functionalities are illustrated by means of the Hebrew text of Genesis 1:1-2:3, linguistic data of text in
any language could be explored and manipulated in a similar way.
BACKGROUND: USING A DATA CUBE TO INTEGRATE COMPLEX SETS OF
LINGUISTIC DATA
The clauses constituting a text can be analysed linguistically in various ways depending on the chosen
perspective of a specific researcher. These different analytical perspectives regarding a collection of
clauses can be integrated into a paper-based medium as a series of two-dimensional tables, where each
table represents one clause and its multi-dimensional analysis.
This concept can be explained with a simplified grammatical paradigm and a very small micro-text
consisting of only three sentences (e.g. Gen. 1:1a, 4c and 5a)1:
•
•
•
1).
Bre$it bara elohim et ha$amayim ve’et ha’arets (in the beginning God created the heaven and the
earth)
Vayavdel elohim ben ha’or uven haxo$ex (and God separated the light and the darkness)
Vayiqra elohim la’or yom (and God called the light day)2
An interlinear multi-dimensional analysis of this text can be done as a series of tables (see Table
The linguistic modules3 that are represented here were chosen only to illustrate the concept of an
integrated structure of linguistic data, as well as the manipulation thereof, and should not be regarded
as comprehensive. In analyses that are more detailed additional layers of analyses, such as morphology,
transliteration4 and pragmatics could be added.
Although such series of tables can be regarded as a databank, if it is electronically available, these
tables are not combined into a single coherent data structure and they do not allow for flexible analytical
operations. Knowing the advanced ad hoc query possibilities that are facilitated by database management
systems on highly structured data, the ability to perform similar operations on implicitly structured
linguistic data becomes attractive. Such queries would be facilitated if all the separate tables could be
combined into one complex data structure. This is an example of document processing that “needs
database processing for storing and manipulating data” (Kroenke, 2004, p. 464).
The obvious suggestion for solving this problem would be to use a relational database to capture
linguistic data, but there are some prohibiting factors. There are many differences among the structures
of clauses and the result will be a very sparse database (containing many empty fields) if one were to
create attributes for all possible syntactic and semantic fields. Even in the event that this could work, an
extra field will be needed to capture the word-order position for every phrase. Furthermore, relational
database management systems are restricted to two dimensions: “The table in an RDBMS can only ever
represent multi-dimensional data in two dimensions” (Connolly & Begg, 2005, p. 1209).
Closer inspection of the above-mentioned two-dimensional clause tables reveals that they actually
represent multi-dimensional data. The various rows of each table do not represent separate records (as
is typical of a two-dimensional relational database), but deeper modules of analysis, which are related
289
11 more pages are available in the full version of this document, which may
be purchased using the "Add to Cart" button on the publisher's webpage:
www.igi-global.com/chapter/slicing-dicing-linguistic-data-cube/21730
Related Content
Implications for Nursing Research and Generation of Evidence
Suzanne Bakken, Robert Lucero, Sunmoo Yoon and Nicholas Hardiker (2013). Data Mining: Concepts,
Methodologies, Tools, and Applications (pp. 1082-1096).
www.irma-international.org/chapter/implications-nursing-research-generation-evidence/73485/
Efficient Identification of Similar XML Fragments Based on Tree Edit Distance
Hongzhi Wang, Jianzhong Li and Fei Li (2012). XML Data Mining: Models, Methods, and Applications (pp.
78-97).
www.irma-international.org/chapter/efficient-identification-similar-xml-fragments/60905/
Review of Data Mining Techniques and Parameters for Recommendation of Effective Adaptive E
-Learning System
Renuka Mahajan (2017). Collaborative Filtering Using Data Mining and Analysis (pp. 1-23).
www.irma-international.org/chapter/review-of-data-mining-techniques-and-parameters-forrecommendation-of-effective-adaptive-e-learning-system/159492/
Automated Integration of Heterogeneous Data Warehouse Schemas
Marko Banek, Boris Vrdoljak, A Min Tjoa and Zoran Skocir (2008). International Journal of Data
Warehousing and Mining (pp. 1-21).
www.irma-international.org/article/automated-integration-heterogeneous-data-warehouse/1815/
Using TreeNet to Cross-sell Home Loans to Credit Card Holders
Dan Steinberg, Nicholas C. Cardell, John Ries and Mykhaylyo Golovnya (2008). International Journal of
Data Warehousing and Mining (pp. 32-45).
www.irma-international.org/article/using-treenet-cross-sell-home/1805/