Download Data Warehousing: the New Knowledge Management Architecture

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Big data wikipedia , lookup

Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Data Warehousing: the New Knowledge
Management Architecture for Humanities
Research?
Dr Janet Delve,
Department of Information Systems and Computing Applications
University of Portsmouth
1-8 Burnaby Terrace
PO1 3AE
[email protected]
Telephone 02392 846669
Fax 02392 846402
Keywords
Data warehousing, humanities research, knowledge management
Data Warehousing: the New Knowledge Management Architecture for Humanities Research?
Data Warehousing: the New Knowledge
Management Architecture for Humanities
Research?
1. Introduction
Data Warehousing is now a well-established part of the business world. Companies like
Amazon, Wal*Mart and Opodo use data warehouses to house terabytes of data about
their products. These data warehouses can be subjected to complex, large-scale queries
that produce detailed results. The customer occasionally catches a glimpse of what is
going on behind the scenes with companies such as Opodo when searching for a plane
ticket - an online message displays the large number of databases the warehouse is
interrogating to locate the best ticket prices, and all carried out in a matter of seconds. It
was this desire for greater analysis of business and scientific data that brought about the
introduction of data warehouses - huge data repositories housing multifarious data
extracted from databases and additionally from other relevant sources. As a bi-product of
this phenomenon, many improvements also occurred in data quality and integrity. The
outcome of all these developments is that there is now much greater flexibility when
modelling data for a data warehouse (DW) via dimensional modelling, a real contrast to
the strict rules imposed on relational database modelling via normalization. However, up
until very recently, DWs were restricted to modelling essentially numerical data –
examples being sales figures in the business arena and astronomical data in scientific
research.
Although some humanities research may be essentially numerical, much is not, for
example memoirs and trade directories. Recent innovations have opened up new
possibilities for largely textual DWs, making them truly accessible to a wide variety of
humanities research for the first time. Humanities research data is often difficult to
model due to the irregular and complex nature of the data involved, and it can be very
awkward to manipulate time shifts in a relational database. Fitting such data into a
normalized data model has proved exceedingly unnatural and cumbersome for many
researchers. This paper considers history and linguistics as two exemplars of humanities
research and investigates current difficulties using relational databases. The paper sets out
DWs in a business context, examines advances in DW modelling, and puts forward ways
these can be applied to historical and linguistic research. The concluding suggestion is
that data warehousing might provide a more suitable knowledge architecture for these
two fields in particular and humanities in general than the relational databases they are
currently employing.
2. Computing in Historical Research
2
Data Warehousing: the New Knowledge Management Architecture for Humanities Research?
Overview
Historians have been harnessing computing power in their research for at least the last
two decades. The Association for History and Computing1 has held national and / or
international conferences since the mid eighties and has produced / inspired many
publications (for example Denley et al., 1989; Mawdsley et al., 1990) and journal
articles2. These and other scholarly works (for example Mawdsley and Munck, 1993)
display the wide range of ways in which historians have used computing, encompassing
database management systems (DBMSs), artificial intelligence and expert systems,
quantitative analysis, XML and SGML. In their dedicated book on historical databases,
Harvey and Press (1996, xi ) have observed that historical database creation has been
widespread in recent years and that the popularity of such database-centred research
would appear to derive from its applicability to a broad spectrum of historical study.
Hudson (2001, 240) also reported that 'the biggest recent growth areas of computer use in
new sorts of historical research have involved relational database software...'. Historical
databases have been created using different types of data from diverse countries over a
range of time periods. Various historical communities have used databases, from
medievalists to economic historians. Some databases have been modest and stand alone3,
others part of a larger conglomerate4. Databases involving international collaboration on
this scale require vast resources (Harvey and Press 1996, xi). One issue that is essential
to good database creation is that of data modelling; which has been the subject of
contentious debate in historical circles over recent years. Before embarking on this, a
brief summary of the relational model is necessary as there are inherent perceptions
regarding relational modelling for a business community that contrast with those in
academia.
The Relational Model
The relational model is currently the business standard (Begg and Connolly 1999, vii)
and is based on relations (tables). Each relation represents a single 'entity' and contains
attributes defining only that entity and there is a formal system of relating these relations
via relationships. The relational model is based upon mathematical relations (Date 2003;
Elmasri and Navathe 2003) and is strictly controlled by normalization (Codd 1972 and
1974), which ensures that each relation contains a unique key that determines all the
other attributes in that relation. Unfortunately this is problematic, even in the business
world as '(t)he process of normalization generally leads to the creation of relations that do
not correspond to entities in the 'real world'. The fragmentation of a 'real world' entity
into many relations, with a physical representation that reflects this structure, is
inefficient, leading to many joins during query processing' (Begg and Connolly 1999,
732). These joins are very expensive in terms of computer processing. In addition to
processing issues, there are other matters to take into account.
1
http://grid.let.rug.nl/ahc/
The associated journal of the AHC is History and Computing, Edinburgh University Press
3
For example, the Register of Music in London Newspapers, 1660-1800, as outlined in Harvey and Press,
40-47
2
4
The current North Atlantic Population Project (NAPP) project entails integrating international census data
3
Data Warehousing: the New Knowledge Management Architecture for Humanities Research?
In practice, integrity and enterprise constraints are not implemented as rigorously as
relational theory demands. Also, the relational model is 'semantically overloaded' (Begg
and Connolly 1999, 733) as it is unable to build semantics into the simple relation /
relationship model. New object-oriented data models are part of third generation DBMSs
and they do address the problem of semantics and 'real world' modelling, but they are not
clearly defined (Begg and Connolly 1999, 752) and have not, as yet, replaced the
relational model. As noted above by Hudson and Harvey and Press, the relational model
is the database model of choice for increasing numbers of historians.
Data Modelling in Historical Databases
When reviewing relational modelling as used for historical research, (Bradley, 1994)
contrasted seemingly straightforward business data with often incomplete, irregular,
complex- or semi-structured data. He noted that the relational model worked well for
simply-structured business data, but that it could be tortuous to use for historical data.
(Breure, 1995) pointed out the advantages of inputting data into a model that matches it
closely, something that is very hard to achieve, as noted above by Begg and Connolly.
(Burt (Delve) and James, 1996) considered the relative freedom when using sourceoriented data modelling as compared to relational modelling with its restrictions due to
normalization, and drew attention to the forthcoming possibilities of data warehouses
(DWs).
Source-Oriented Data Modelling
Dr Manfred Thaller5 decided that historical data modelling was so important that he
developed a dedicated DBMS called ‘a semantic network tempered by
hierarchical considerations’ (Thaller 1991, 155). His principle aim was to carry out
source-oriented data modelling, which entailed encapsulating an entire data source in a
model, without having to split data up into separate tables as occurs in
normalization. He felt that it was imperative to keep each entire source for posterity as it
had been all too common for historians to create datasets by picking out just the parts of a
source they were interested in at a particular time for a particular project. This could
prove very wasteful should other historians at a later date wish to analyse other parts of
that source that had not been included in the database. In terms of modelling, 
flexible nature gives a ‘rubber band data structures’ facility (Denley 1994, 37), thus
overcoming the problems of semantic modelling. The fluid nature of creating a database
with marks it out as an ‘organic’ DBMS. is a sophisticated DBMS that
handles image processing as well as geographical coordinates. There was much debate in
historical circles about which data modelling was superior,  or relational, (Denley
1994, 33-44). However, despite its versatility and power,  never replaced relational
databases as the main DBMS for historians, who continue using relational modelling and
normalization to prepare data for RDBMSs, despite the inherent difficulties of so doing.
5
Max Planck Institute of Historical Studies, Göttingen
4
Data Warehousing: the New Knowledge Management Architecture for Humanities Research?
Examples of Difficulties Modelling Historical Data
Burt and James (1996, 160-168) discussed four historical projects that found the
relational model very constraining, and outlined how using  as the DBMS had
greatly improved their data modelling. The Educational Times project entailed the
creation and querying of a database of mathematical problems found in the mathematical
department of this 19th century journal. The format of the data was irregular, and the
appearance of some sections within it was erratic. The data appeared monthly and using
a relation model it was impossible to preserve the hierarchical nature of the data as the
process of normalization caused it to be split by subject and placed in 28 tables. A
similar problem was encountered during the modelling stage of the Winchester census
Database where the removal of uninhabited properties into a separate table following
normalization would make it impossible to retain the hierarchical nature of the data, thus
greatly adding to the complexity of the data model. The Britmath database contained
data pertaining to British mathematicians and institutions acquired from a wide range of
disparate archives. The data was not uniform and it was not possible to predict in
advance a model that could accurately represent all the data for the database. For the
Winchester Probate Database, the problem is more profound. Previous researchers have
looked for evidence of personal status by searching through wills and inventories for
objects conferring such status. Crucially, these objects have been decided in advance,
and the data has been searched for a given list of objects. The desire when creating the
Winchester Probate Database was to encapsulate several entire 17th century wills and
inventories, and then search them for a variety of status indicators. In this way it was
possible to redefine queries as the project developed, instead of having a predetermined
set of queries before the project even began or the data was thoroughly known. All of
these projects found that it would be a problem to model extraneous detail in a
normalized data model. Another difficulty with historical material arises from the wealth
of different dating systems in use in a variety of countries over time.
Historical dating systems encompass a number of different systems, including the
western, Islamic, Revolutionary and Byzantine calendars. Historical data may refer to
‘the first Sunday after Michaelmas’, which would need calculations to be undertaken
before a date can be entered into a database. Even then, not all databases and
spreadsheets have been able to cope with dates other than those of the 20th century.
Similarly, for researchers in historical geography, it might be necessary to calculate dates
based on the local introduction of the Gregorian calendar. These difficulties can be timeconsuming and arduous for researchers. They were addressed by the  DBMS, but
as this is not a widespread system, researchers still have to contend with these issues in
the relational DBMS. To summarise, awkward and irregular data with abstruse dating
systems do not fit easily into a relational model that does not lend itself to hierarchical
data. Many of these concerns also occur in linguistics computing.
3. Linguistic Databases
Lawler and Aristar Dry (1998, 1) claimed that 'in the last decade computers have
dramatically altered the professional life of the ordinary working linguist, altering the
things we can do, the ways we can do them, and even the ways we can think about them'.
5
Data Warehousing: the New Knowledge Management Architecture for Humanities Research?
and then went so far as to say that the technology 'is shaping the way we conceptualise
both linguistics and language'. Linguistic databases have a well-established place in
language research, with at least two dedicated conferences in the last decade; 1995 in
Groningen, with proceedings edited by Nerbonne, (1998) and the IRCS Workshop on
Linguistic Databases at Philadelphia 2001 and their accompanying web-based
proceedings (http://www.ldc.upenn.edu/annotation/database/proceedings.html).
Nerbonne (1998) highlighted in his introduction that 'linguistics is a data-rich study' and
then proceeded to delineate the multi-faceted nature of this discipline. He outlined the
abundance of languages with their myriad words having many possible word forms, and
then alluded to the plethora of coding rules for phrases, words and sounds. In addition,
many factors are studied for their effect on linguistic variation for example sex,
geography, social and educational status.
The sheer volume of data in linguistics has encouraged researchers to look to databases as
natural tools for storage and analysis of linguistic data and metadata. Nerbonne stated
'Databases have long been standard repositories in phonetics... and psycholinguistics ...
research, but they are finding increasing use not only in phonology, morphology, syntax,
historical linguistics and dialectology but also in areas of applied linguistics such as
lexicography and computer-assisted learning.' He proceeded to discuss the fact that data
integrity and consistency is of particular importance, and that deductive databases
(providing logical power) and object-oriented databases (providing flexibility) both have
their place in the field. He concluded his preliminary remarks by claiming that '(t)he
most important issues are overwhelmingly practical' rather than conceptual. However,
many conceptual issues appear in the IRSC 2001 proceedings, linguistic data modelling
playing a central role.
Linguistic Data Modelling
The benefits of using the relational model together with the process of normalization is
put forward by Hayashi and Hatton (2001 IRCS conference proceedings) who stated that
'(b)ecause the repository of classes is normalized, the user can use the same class in a
number of visual models. This is very helpful when modelling integrated linguistic data
structures.' Brugman and Wittenburg (2001 IRCS conference proceedings)
acknowledged the central role played by data models and object oriented models right
from the start of their work on linguistic corpora, annotation formats and software tools.
They traced the evolution of various data models to date and concluded by saying:
The choice of what models to create has a profound influence on how a problem is attacked and how a
solution is shaped. There is a close relation between a model on one side and a set of problems and their
solution in the form of a software system on the other side. A different problem requires a different model.
Over the years, user requirements changed, partly because new technology made new things possible, and
partly because of the user's experience with our own tools. The models changed with the user requirements
and will continue to do so.
Bliss and Ritter (2001 IRCS conference proceedings) discussed the constraints imposed
on them when using 'the rigid coding structure of the database' developed to house
pronoun systems from 109 languages. They observed that coding introduced
interpretation of data and concluded that designing 'a typological database is not unlike
trying to fit a square object into a round hole. Linguistic data is highly variable, database
structures are highly rigid, and the two do not always "fit".' Brown (2001 IRCS
6
Data Warehousing: the New Knowledge Management Architecture for Humanities Research?
conference proceedings) outlined the fact that different database structures may reflect a
particular linguistic theory, and also mentioned the trade-off between quality and quantity
in terms of coverage.
Summary of data modelling problems in historical and linguistic research
The choice of data model has a profound effect on the problems that can be tackled and
the data that can be interrogated. For both historical and linguistic research, relational
data modelling and normalization often appear to impose data structures which do not fit
naturally with the data and which constrain subsequent analysis. Coping with
complicated dating systems can also be very problematic. Surprisingly, similar
difficulties have already arisen in the business community, and have been addressed by
data warehousing.
4. Data Warehousing in the business context
DWs came into being as a response to the problems caused by large, centralized
databases (Inmon 2002, 6-14). Users in each department would resent the time taken to
query the slow, unwieldy database and would write queries to siphon off a copy of the
portion of the central database pertinent to their needs. They would store this compact
'extract' database locally and benefit from the quicker response rate when running their
queries against it. The advantages to the department were obvious as they now had
control of their own database that they could customize, but the problems to the company
were legion when managers tried to get an overall picture of the state of the business. A
call would go out for reports from each department, which would respond by querying
their own extract database. However, since each of these had been created without
reference to any standard, an eclectic set of often-contradictory reports from the
departments would be received by head office. (Also there is a problem of copies being
altered and then no longer reflecting the original they came from) Inmon described this
as the cabinet effect (1991) and the 'spider-web' problem (2002). The need was thus
recognized for a single, integrated source of clean data to serve the analytical needs of a
company.
DW Queries
Behind the business analyst's desire to search the company data is the theory that there is
a single question (or a small suite of questions) that, if answered correctly, will provide
the company with a competitive edge over their rivals. More fundamentally, a DW can
provide answers to a completely different range of queries than those aimed at a
traditional database. Using an estate agency as a typical business, the type of question
their local databases should be able to answer might be 'How many three-bedroomed
properties are there in the Botley area up to the value of £100,000? The type of overarching question a business analyst (and company managers) would be interested in
might be of the general form 'Which type of property sells for prices above the average
selling price for properties in the main cities of Great Britain and how does this correlate
to demographic data?’ (Begg and Connolly, 1999, 917). To trawl through each local
estate agency database and corresponding local county council database, then correlate
7
Data Warehousing: the New Knowledge Management Architecture for Humanities Research?
the results into a report would take a long time and a lot of resources. The DW was
created to answer this need, set in the wider context of data mining.
Data Mining
Somewhat confusingly, data mining is the name given to the whole field comprising data
cleansing, data warehousing and data mining (Byte, 1995). Data cleansing is the process
of painstakingly checking the vast quantities (terabytes) of data for consistency and
coherence. Data warehousing is the actual storing of the data and data mining (DM) is
the procedure carried out to search for meaningful patterns. DM involves neural nets,
machine learning, cluster analysis and decision trees which are just some of the AI
methods used to try to model and classify the vast quantities of data typically stored in a
DW. Traditional database querying tools like SQL are also utilised. Having established
the place of DW within the field of DM, it is still not obvious how to precisely define
data warehousing.
The Philosophy of Data Warehousing
‘Data warehousing is an architecture, not a technology. There is the architecture, and
there is the underlying technology, and they are two very different things.
Unquestionably there is a relationship between data warehousing and database
technology, but they are most certainly not the same. Data warehousing requires the
support of many different kinds of technology (Inmon 2002, xv)’. A detailed survey of
the hardware needed to run a DW is beyond the scope of this paper, but a separate DW
server is advantageous to avoid draining the main database server. In addition, the
processing power of parallel-distributed data processors is needed to mine (query) the
data. There is a plethora of supporting technology from e.g. Oracle (Oracle Warehouse
Builder, OWB, outlined in Begg and Connolly) Prism Solutions, IBM, SAP and others.
Building a DW can be a complex task because it is difficult to find a vendor that provides
an ‘end-to-end’ set of tools. Thus a data warehouse is often built using multiple products
from different suppliers. Ensuring the coordination and integration of these products can
be a major challenge (Begg and Connolly 1999, 927). Data warehousing thus fits into the
wider field of data mining.
Basic Components of a DW
Inmon (2002, 31), the father of data warehousing (Krivda, 1995), defined a DW as being
subject-oriented, integrated, non-volatile and time-variant. Emphasis is placed on
choosing the right subjects to model as opposed to being constrained to model around
applications. As an example, an insurance company would have as subjects - customer,
policy, premium and claim whereas previously data would have been modelled around
the applications - car, health, life and accident.
Here it is necessary to clarify the fundamental relationship between databases and DWs.
DWs do not replace databases as such - they co-exist alongside them in a symbiotic
fashion. Databases are needed both to serve the clerical community who answer day-today queries such as 'what is A.R. Smith's current overdraft?' and also to 'feed' a DW. To
do this, snapshots of data are extracted from a database on a regular basis (daily, hourly
and in the case of some mobile phone companies almost real-time). The data is then
8
Data Warehousing: the New Knowledge Management Architecture for Humanities Research?
transformed (cleansed to ensure consistency) and loaded into a DW. In addition, a DW
can cope with diverse data sources, including external data in a variety of formats and
summarized data from a database. The myriad types of data of different provenance
create an exceedingly rich and varied integrated data source opening up possibilities not
available in databases. Thus all the data in a DW is integrated. Crucially, data in a
warehouse is not updated - it is only added to, thus making it non-volatile, which has a
profound effect on data modelling as the main function of normalization is to obviate
update anomalies. Finally, a DW has a time horizon (that is contains data over a period)
of five to ten years, whereas a database typically holds data that is current for two to three
months. Data in general has thus taken on a new significance in the whole DM process,
and meta data is of vital importance in the whole process.
Meta Data
Meta data is extremely important in a DW (Inmon 2002, 25,26; 113, 171-2). It is used to
create a log of the extraction and loading of data into the warehouse, record how the data
has been mapped when carrying out data cleansing and transformations, locate the most
appropriate data source as part of query management and also to help end users to build
queries, and lastly manage all the data in the DW. With the sheer quantity of data and the
large number of indexes to ensure smooth querying in the DW, it is a matter of the
highest priority that meta data is created and stored efficiently. The fact that data is held
over such a long period of time adds to the urgency, especially given the fact that the data
stewards responsible for putting the data into the DW may have long since moved on. If
an accurate record of the data that captures all aspects of the meaning of the data does not
exist, then the consequences for those using the DW are potentially dire. The DW
contains a meta data manager to control this aspect of the DW. As with data cleansing,
the tools now on the market specifically geared towards meta data management may be
of interest to those in the linguistic and historical communities with their vast quantities
of data and meta data. Data modelling in a DW is of greater relevance, however.
4. Data Modelling in a Data Warehouse
There is a fundamental split in the DW community as to whether to construct a DW from
scratch, or to build them via data marts.
Data Marts
A data mart is essentially a cut-down DW that is restricted to one department or one
business process. Whilst acknowledging the pivotal role they play, the industry is
divided about data marts. Inmon (2002, 142) recommended building the DW first, then
extracting the data from it to fill up several data marts. The DW modelling expert
Kimball (2002) advised the incremental building of several data marts that are then
carefully integrated into a DW. Whichever way is chosen, the data must first be
modelled via dimensional modelling.
Dimensional Modelling
Dimensional models need to be linked to the company's corporate ERD (Entity
Relationship Diagram) as the data is actually taken from this (and other) source(s).
9
Data Warehousing: the New Knowledge Management Architecture for Humanities Research?
Dimensional models are somewhat different from ERDs, the typical star model having a
central fact table surrounded by dimension tables.
Fact Tables
Kimball (2002, 16-18) defined a fact table as 'the primary table in a dimensional model
where the numerical performance measurements of the business are stored…Since
measurement data is overwhelmingly the largest part of any data mart, we avoid
duplicating it in multiple places around the enterprise.' Thus the fact table contains
dynamic numerical data such as sales quantities and sales and profit figures. It also
contains key data in order to link to the dimension tables.
Dimension Tables
Dimension tables contain the textual descriptors of the business process being modelled
and their depth and breadth define the analytical usefulness of the DW. As they contain
descriptive data, it is assumed they will not change at the same rapid rate as the numerical
data in the fact table that will certainly change every time the DW is refreshed.
Dimension tables can have 50-100 attributes and these are not usually normalized. The
data is often hierarchical in the tables and can be a accurate reflection of how data
actually appears in its raw state (Kimball 2002, 19-21). There is not the need to
normalize as data is not updated in the DW, although there are variations on the star
model such as the snowflake and starflake models which allow varying degrees of
normalization in some or all of their dimension tables. Coding is disparaged due to the
long-term view that definitions may be lost and that the dimension tables should contain
the fullest, most comprehensible descriptions possible (Kimball 2002, 21). The
restriction of data in the fact table to numerical data has been a hindrance to academic
computing. However, Kimball has recently developed 'factless' fact tables (Kimball
2002, 49) that can accommodate textual data in the central fact table, thus opening the
door to a much broader spectrum of possible DWs. The freedom to model data without
heed to the strictures of normalization is a very attractive feature of DWs and one which
may appeal to those trying to construct linguistic or literary databases (or indeed any
humanities database) with abstruse or awkward data.
5. Applying the Data Warehouse Architecture to Historical and
Linguistic Research
One of the major advantages of data warehousing is the enormous flexibility in modelling
data. Normalization is no longer an automatic straightjacket and hierarchies can be
represented in dimension tables. The expansive time dimension (Kimball 2002, 39) is a
welcome by-product of this modelling freedom, allowing country-specific calendars,
synchronization across multiple time zones and the inclusion of multifarious time
periods. It is possible to add external data from diverse sources and summarised data
from the source database(s). The DW is built for analysis that immediately makes it
attractive to humanities researchers. It is designed to continuously receive huge volumes
(terabytes) of data, but is sensitive enough to cope with the idiosyncrasies of geographic
location dimensions within GISs (Kimball, 2002, 227). Additionally a DW has advanced
indexing facilities that make it desirable for those controlling vast quantities of data.
10
Data Warehousing: the New Knowledge Management Architecture for Humanities Research?
With a DW it is theoretically possible to publish the ‘right data’ that has been collected
from a variety of sources and edited for quality and consistency. In a DW all data is
collated so a variety of different subsets can be analysed whenever required. It is
comparatively easy to extend a DW and add material from a new source. The data
cleansing techniques developed for data warehousing are of interest to researchers, as is
the tracking facility afforded by the meta data manager (Begg and Connolly 1999, 931933).
In terms of using DWs 'off the shelf', some humanities research might fit into the
‘numerical fact’ topology, but some doesn’t. The 'factless fact table' has been used in
several American universities, but expertise in this area would not be a widespread as that
with numerical fact tables. The whole area of data cleansing may perhaps be daunting for
humanities researchers (as it is to those in industry). Ensuring vast quantities of data is
clean and consistent may be an unattainable goal for humanities researchers without
recourse to expensive data cleansing software. The DW technology is far from easy and
is based on having existing databases to extract from, hence double the work. It is
unlikely that researchers would be taking regular snapshots of their data, as occurs in
industry, but they could equate to data sets taken at different periods of time to DW
snapshots (e.g. 1841 census, 1861 census). Whilst many DWs use familiar WYSIWYGs
and can be queried with SQL-type commands , there is undeniably a huge amount to
learn in data warehousing. Nevertheless, there are many areas in linguistics where DWs
may prove attractive.
DWs and Linguistics Research
Brugman and Wittenberg (2001 IRCS conference proceedings) highlighted the
importance of metadata for language resources since the launch of the EAGLES/ISLE
metadata initiative. The problems related by Bliss and Ritter (2001 IRCS conference
proceedings) concerning rigid relational data structures and pre-coding problems would
be alleviated by data warehousing. Brown (2001 IRCS conference proceedings) outlined
the dilemma arising from the alignment of database structures with particular linguistic
theories, and also the conflict of quality and quantity of data. With a DW there is room
for both vast quantities of data and a plethora of detail. No structure is forced onto the
data so several theoretical approaches can be investigated using the same DW. Dalli
(2001 IRCS conference proceedings) observed that many linguistic databases are
standalone with no hope of interoperability. His proffered solution to create an
interoperable database of Maltese linguistic data involved an RDBMS and XML. Using
DWs to store linguistic data should ensure interoperability. There is growing interest in
corpora databases, with the recent dedicated conference at Portsmouth, November 2003.
Teich, Hansen and Fankhauser drew attention to the multi-layered nature of corpora and
speculated as to how 'multi-layer corpora can be maintained, queried and analyzed in an
integrated fashion.' A DW would be able to cope with this complexity.
Nerbonne (1998) alluded to the 'importance of coordinating the overwhelming amount of
work being done and yet to be done.' Kretzschmar (2001 IRCS conference proceedings)
delineated 'the challenge of preservation and display for massive amounts of survey data.'
There appears to be many linguistics databases containing data from a range of locations /
11
Data Warehousing: the New Knowledge Management Architecture for Humanities Research?
countries. For example, ALAP, the American Linguistic Atlas Project; ANAE, the Atlas
of North American English (part of the TELSUR Project); TDS, the Typological
Database System containing European data; AMPER, the Multimedia Atlas of the
Romance Languages. Possible research ideas for the future may include a broadening of
horizons - instead of the ‘me and my database’ mentality with the emphasis on individual
projects, there may develop an ‘Our integrated warehouse’ approach with the emphasis
on even larger scale, collaborative projects. These could compare different languages or
contain many different types of linguistic data for a particular language, allowing for new
orders of magnitude analysis.
DWs and Historical Research
There are inklings of historical research involving data warehousing in Britain and
Canada. A DW of current census data is underway at the University of Guelph, Canada
and the Canadian Century Research Infrastructure aims to house census data from the last
100 years in data marts constructed using IBM software at several sites across the
country. At the University of Portsmouth, UK, a historical DW of American mining data
is under construction using Oracle Warehouse Builder. These projects give some idea of
the scale of project a DW can cope with, that is, really large country / state -wide
problems. Following these examples, it would be possible to create a DW to analyse all
British censuses from 1841to1901 (approximately 108 bytes of data). Data from a variety
of sources over time such as hearth tax, poor rates, trade directories, census, street
directories, wills and inventories, GIS maps for a city e.g. Winchester could go into a city
DW. Such a project is under active consideration for Oslo, Norway. Similarly, a Voting
DW could contain voting data – poll book data and rate book data up to 1870 for the
whole country (but it must be noted that some data is missing). A Port DW could contain
all data from portbooks for all British ports together with yearly trade figures. Similarly a
Street directories DW would contain data from this rich source for whole country for the
last 100 years. Lastly, a Taxation DW could afford an overview of taxation of different
types, areas or periods. 19th century British census data doesn’t fit into the typical DW
model as it doesn’t have the numerical facts to go into a fact table, but with the advent of
factless fact tables a DW could now be made to house this data. The fact that some
institutions have Oracle site licenses opens to way for humanities researchers with Oracle
databases to use Oracle Warehouse Builder as part of the suite of programs available to
them. These are practical project suggestions which would be impossible to construct
using relational databases, but which, if achieved, could grant new insights into our
history. Comparisons could be made between counties and cities and much broader
analysis would be possible than has previously been possible.
6. Conclusions
The advances made in business data warehousing are directly applicable to many areas of
historical and linguistics research. DW dimensional modelling would allow historians
and linguists to model vast amounts of data on a countrywide basis (or larger),
incorporating data from existing databases and other external sources. Summary data
could also be included, and this would all lead to a DW containing more data than is
currently possible, plus the fact that the data would be richer than in current databases
12
Data Warehousing: the New Knowledge Management Architecture for Humanities Research?
due to the fact that normalization is no longer obligatory. Whole data sources could be
captured, and more post-hoc analysis would result. Dimension tables particularly lend
themselves to hierarchical modelling, so data would not need splitting into many tables
thus forcing joins while querying. The time dimension particularly lends itself to
historical research where significant difficulties have been encountered in the past. These
suggestions for historical and linguistics research will undoubtedly resonate in other areas
of humanities research. As feedback becomes available from the humanities DW
projects outlined above, other researchers will be able to ascertain whether data
warehousing is the new knowledge management architecture for humanities research.
7. References
Begg, C. and Connolly, T. (1999) Database Systems. Addison-Wesley, Harlow.
Bliss and Ritter, IRCS (Institute for Research into Cognitive Science) Conference
Proceedings (2001) http://www.ldc.upenn.edu/annotation/database/proceedings.html
Bradley, J. (1994) Relational Database Design, History and Computing, 6.2: pp. 71-84.
Breure, L. 1995 ‘Interactive data Entry’, History and Computing, 7.1, pp. 30-49
Brown, IRCS (Institute for Research into Cognitive Science) Conference Proceedings
(2001) http://www.ldc.upenn.edu/annotation/database/proceedings.html
Burt, J. and James, T. B. (1996) Source-Oriented Data Processing, The triumph of the
micro over the macro? History and Computing, 8.3: pp. 160-169.
Brugman and Wittenburg, , IRCS (Institute for Research into Cognitive Science)
Conference Proceedings (2001)
http://www.ldc.upenn.edu/annotation/database/proceedings.html
Codd, E. F. (1972) Further normalization of the data base relational model. In Rustin,
R. (ed) Data Base Systems, Englewood Cliffs, NJ.
Codd, E. F. (1974) Recent Investigations in relational data base systems. Proceedings
IFIP Congress.
Dalli, IRCS (Institute for Research into Cognitive Science) Conference Proceedings
(2001) http://www.ldc.upenn.edu/annotation/database/proceedings.html
Date, C. (2003) An Introduction to Database Systems. Addison-Wesley, Reading MA.
Elmasri, R. and Navathe, S. (2003) Fundamentals of Database Systems. AddisonWesley, Reading MA.
Harvey, C. and Press, J. 1996 Databases in Historical Research, Basingstoke
Hayashi and Hatton, IRCS (Institute for Research into Cognitive Science) Conference
Proceedings (2001) http://www.ldc.upenn.edu/annotation/database/proceedings.html
Hudson, P. 2000 History by numbers. An introduction to quantitative approaches,
London
Inmon, W.H. (1991) The Cabinet Effect, Database Programming and Design, pp. 7071.
Inmon, W. H. (2002) Building the Data Warehouse. Wiley, New York.
IRCS (Institute for Research into Cognitive Science) Conference Proceedings (2001)
http://www.ldc.upenn.edu/annotation/database/proceedings.html
Kimball, R. and Ross, M. (2002) The Data Warehouse Toolkit. Wiley, New York.
Krivda, C.D. (1995) Data-Mining Dynamite, Byte, pp. 97-100.
Nerbonne, J. (ed) (1998) Linguistic Databases. CSLI Publications, Stanford, California.
13
Data Warehousing: the New Knowledge Management Architecture for Humanities Research?
Stackpole, B. (2001) Wash Me, Data Management, CIO Magazine, Feb. 15th.
Teich, Hansen and Fankhauser, IRCS (Institute for Research into Cognitive Science)
Conference Proceedings (2001)
http://www.ldc.upenn.edu/annotation/database/proceedings.html
Thaller, M. (1991) 'The Historical Workstation Project', Computers and the Humanities,
25, pp. 149-162
14