Download Databases and Tools for Structured Data - WebLearn

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data model wikipedia , lookup

Data analysis wikipedia , lookup

Resource Description Framework wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Database wikipedia , lookup

Open data in the United Kingdom wikipedia , lookup

Semantic Web wikipedia , lookup

Business intelligence wikipedia , lookup

Data vault modeling wikipedia , lookup

Information privacy law wikipedia , lookup

Versant Object Database wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Database model wikipedia , lookup

Transcript
Resear ch
Databases and Tools for
Structured Data
An increasing number of research projects are now making use of some form of
structured data – that is, data which consists of sets of comparable information,
in which multiple items or objects share certain common features. Even if you
have no intention of producing a website or presenting your data publicly, there
are many situations in which databases or database-like tools may provide the
best way of organising this sort of information. This article provides some tips
on how to select the tools that are most appropriate for a given research project.
Contents
When a word processor is not enough ................................................................... 1
Spreadsheets ........................................................................................................... 1
Relational databases .............................................................................................. 2
XML databases ...................................................................................................... 3
TEI XML ............................................................................................................. 4
RDF data ................................................................................................................ 5
Where to go for more information ........................................................................ 6
When a word processor is not enough
Some researchers use a word processor for a considerable portion of their work,
particularly when writing books, papers, theses, and notes. Because of this
familiarity, it can be tempting to use a word processor for everything.
While this approach can save time which would otherwise need to be invested in
learning new software and methods, it can lead to people generating extremely
long documents, from which it is a struggle to retrieve useful information in any
meaningful manner. For structured data, one of the approaches described below
may be a better bet.
Spreadsheets
If the information you are working with consists of a number of discrete objects,
each of which shares essentially the same limited set of characteristics, a
spreadsheet may provide the ideal means for structuring your data. Historical
surveys, such as censuses, and bureaucratic records are often most clearly set
out in the form of a spreadsheet; a list of the bibliographic details of a set of
books might be another example, or a list of financial returns from a publishing
house.
Spreadsheet software is well adapted to sort and re-sort records alphabetically
or numerically, and is ideal if you wish to conduct numerical analysis – to
establish means and medians in a particular dataset, for instance, or visualise
information in the form of charts and graphs.
Resear ch
Figure 1: Example of information in a spreadsheet
Useful for:
Ordering simple records
Numerical analysis
Generating charts and graphs
Disadvantages:
Not as good at handling complex relationships as relational databases
Popular software packages:
Microsoft Excel
OpenOffice.org Calc
Relational databases
Spreadsheets are fine for a lot of basic data organisation and analysis, but in
some cases a relational database offers significant advantages. If you’re working
with information or sources that have relationships with other objects, which in
turn have interesting properties or relationships, using a relational database is
probably a good idea.
When you construct a relational database, you can create separate tables (which
individually tend to look much like spreadsheets) and link fields within each
table to fields in other tables. So, for instance, you could create a table of
bibliographic details about books, including the names of the authors, and link
this to a separate table of authors, containing information about when they were
born and died, where they were educated, and so forth. If you wished, you could
link the information about where they were educated to another table, providing
information about the size and location of the school or university they
attended. Relational databases cater for one-to-many relationships, or even
many-to-many.
Relational databases can be designed to enable quite complex cross-searching,
for instance, listing all the books published by authors who attended a particular
university during a given period. Searches of databases are called queries, and
are written in a query-language such as SQL (Structured Query Language) –
though many database software packages include a query-building function
which will automatically convert your instructions to SQL for you. Learning to
construct queries is not difficult, but there are various clever tricks that one can
pick up to perform complex but efficient searches.
Resear ch
Figure 2: Example of a relational database structure
Useful for:
Situations where you are not sure in advance how you (or others) will want to
query your data, and wish to keep your options open and flexible
Spotting unexpected relationships between things
Hosting information on the Web and allowing others to search it
Disadvantages:
Efficiently structuring large databases can be a challenge
Popular software packages:
Microsoft Access
FileMaker Pro
MySQL (particularly for databases hosted on the Web)
PostgreSQL (particularly for databases hosted on the Web)
XML databases
Spreadsheets and relational databases can be very useful if you are working with
essentially consistent data – where there are a limited number of shared
characteristics common to each record in a given table. If the information you
wish to analyse is difficult to characterise in such a way, however, you may wish
to take a different approach.
XML (eXtensible Markup Language) is a standard for tagging information in
order to render it machine-readable. It is primarily used to assist textual
analysis, as it can be used to indicate particular characteristics that apply to
particular sections in a text.
For instance, you may have a number of texts which cover very different
subjects, but you want to find all instances where a particular individual or
event is mentioned. You could surround each personal name with tags
indicating that the part of the text is question is a personal name –
Resear ch
<name>Christopher Columbus</name>, for example. You could then search
your texts for all the people named in them, or index each occurrence of a
specific name. You could also use XML to create a standardised version of a
name that occurs in many different variants, in order to render it searchable but
without having to alter the original spellings in the document itself.
Other tags can be used to indicate how a text should be displayed. Enclosing a
piece of text between two emphasis tags, for instance, will indicate to a Web
browser or some other XML reader that it should be displayed as bold, or italic.
The precise interpretation of how an XML file should be displayed can be
customised – the important thing is that XML separates content from its
representation, ensuring that the document does not become unreadable just
because the technology used to display it has changed.
As is the case when working with relational databases, it is possible to create
quite complicated queries when working with XML-tagged documents. XQuery
is one popular language for searching XML databases. As with SQL and its
equivalents, it is fairly straightforward to learn how to return results from
simple searches, but complex queries can also be constructed with a little more
knowledge and experience.
XML is not only used to indicate textual content, but is also widely used in
linguistics to indicate parts of speech or features of spoken language. It is also
popular amongst those working with manuscripts or multiple editions, to
indicate variations, alternative translations, and so forth.
TEI XML
Text Encoding Initiative (TEI) XML is a schema established to aid consistency
and interoperability between digital humanities projects. Essentially, the TEI
has defined a number of labels (about 500) for use when tagging texts, so that
people do not end up having to create their own definitions every time they want
to make a text machine-readable. The TEI guidelines are available from
http://www.tei-c.org/Guidelines/.
The University of Oxford is a centre of expertise in TEI XML. OUCS runs an
annual summer school, and members of the University can email
[email protected] at any time for free advice.
Figure 3 (below) shows part of the play The Raigne of King Edvvard the Third
marked up into XML. Some of the tags instruct the Web browser how to display
the text, whereas others are ‘invisible’, but can assist searching and analysis of
the text. For instance, homographs are indicated in the XML, but are not flagged
up in the text displayed in the browser. One can see here that the browser has
not been instructed to recognise all of the rendering information in the XML
original, as it is not displaying the names of the speakers or the stage directions
in italic.
Resear ch
Figure 3: Example of a text with TEI XML mark-up, rendered into simple HTML
Useful for:
Working with texts
Providing access to textual databases via the Web
Textual and linguistic analysis
Disadvantages:
Tagging documents is time-consuming
You need to ensure that you tag elements consistently
Popular software packages:
Oxygen XML Editor is useful for editing and tagging XML documents and
checking that they meet TEI standards
eXist is a free, open source, native XML database management system
RDF data
Although not yet as widespread as other means of structuring data, the use of
RDF (Resource Description Framework) metadata (data that describes other
data) is gaining prevalence as a means of linking together data from disparate
sources.
RDF represents relationships between things in the form of subject-predicateobject expressions. Any given subject (a particular book, for instance) may have
Resear ch
a particular relationship (such as being published by) with a particular object (a
given publisher). The book will have other relationships and properties as well,
such as being published in (a relationship/property) a particular year (object);
or being published as (relationship/property) a paperback (object). In RDF
terms, such subject-predicate-object expressions are called ‘triples’, and a
database containing them is called a ‘triplestore’.
RDF data is used especially to describe the relationships between resources on
the Web in a machine-readable manner, and as such is a key component in what
is known as the ‘Semantic Web’. The idea behind the Semantic Web is
essentially to evolve the Web from a linked document store to a database of
interlinked information. This may not be the easiest concept to envisage, but it
basically means enriching data by enabling data from different sources to be
searched together.
RDF data is usually written using XML tags to describe the relationship being
expressed. Some sort of standard ontology will need to be chosen to ensure a
degree of consistency between descriptions.
The predominant query language for RDF data is SPARQL, which as its name
suggests has certain similarities to the SQL-type languages used to query
relational databases.
Useful for:
Integrating existing data from disparate sources
Network analysis
Disadvantages:
Can be tricky to conceptualise at first
Coding RDF relationships by hand would be time-consuming. It is often
therefore generated automatically from SQL or various XML formats.
Most triplestore software is at present aimed more at developers than
‘ordinary’ users – you will almost certainly need technical help
Popular software packages:
Jena
Sesame
Where to go for more information
If you wish to produce anything more than a very simple relational database, it
would be wise to learn first a little about the principles of structuring data and
the capabilities of the software you are considering using. Most good bookshops
will have a selection of introductory guides to database design and XML. The IT
Learning Programme at OUCS offers a number of courses that may be of
interest: see http://www.oucs.ox.ac.uk/itlp/courses/ for more information.
Alternatively, if you have an idea for a research database and want to talk it
through with a technical expert, speak to a member of the Infodev team at
OUCS: see http://www.oucs.ox.ac.uk/infodev/ for more information. Infodev
can also help with other aspects of research support, including data
manipulation and website building.