Download Combining Data Integration and Information Extraction Techniques

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Entity–attribute–value model wikipedia , lookup

Clusterpoint wikipedia , lookup

Relational model wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
Combining Data Integration and Information
Extraction Techniques
Dean Williams
[email protected]
School of Computer Science and Information Systems, Birkbeck College,
University of London
Abstract
We describe a class of applications which are built using databases comprising some
structured data and some free text. Conventional database management systems have
proved ineffective for these applications and they are rarely suitable for current text and
data mining techniques. We argue that combining Information Extraction and Data
Integration techniques is a promising direction for research and we outline how our
ESTEST system demonstrates this approach.
1. Introduction
A class of applications exist which can be characterised by the way in which they combine both data conforming to a schema and some related free text. We describe this
application class in Section 2. Our approach is to combine Data Integration (DI) and
Information Extraction (IE) techniques to better exploit the text data and, in Section 3,
there is a summary of related areas of research and we show how our method relates to
these. Section 4 details why we belive text is used in these applications and as a result,
why we belive combining DI and IE techniques will be beneficial to these applications.
Details of our system Experimental Software to Extract Structure from Text (ESTEST) are given in Section 5 which shows how we plan to realise our goals. Finally
we give our conclusions and plans for future work in Section 6.
2. Partially Structured Data
In [1] King and Poulovassilis define a distinct category of data - partially structured
data (PSD). Many database applications rely on storing significant amounts of data in
the form of free text. Recent developments in database technology have improved the
facilities available for storing large amounts of text. However the provision for making
use of this text data largely relies on searching the text for keywords.
A class of applications exist where the information to be stored consists partly of
some structured data conforming to a schema with the remainder left as free text. We
consider this data to be partially structured. This idea of PSD is distinct from semistructured data, which is generally taken to mean data that is self-describing. In semistructured data there may not be a schema defined but the data itself contains some
structural information e.g. XML tags.
An example of an application based on the use of PSD is operational intelligence
gathering, which is used in serious crime investigations. The data collected in this application area takes the form of a report that contains some structured data such as the
name of the Police Officer making the report, the time and location of the incident, as
well as details of subjects and locations contained in the report. This is combined with
the actual report of the sighting or information received which is captured as text. A
number of other text based applications exist in crime e.g. for witness statements and
scene of crime reports.
Other application domains we are familiar with which have partially structured data
include Road Traffic Accident reports where the standard format statistics are combined
with free text accounts in a formalised subset of English. In Bioinfomatics, structured
databases such as the SWISS-PROT [2] database includes comment fields that contain
related unstructured information.
A common theme of many of these applications, including crime and SWISS-PROT,
is a requirement for expert users to annotate the text, trying to use standard terms to
assist with queries, reduce duplication and highlight important facts. This is often a
time consuming, demanding task with results less effective than would be desired and
applications to assist with this work are being developed both as academic research
projects e.g. [3] and commercial software e.g. [4].
3. Related Areas
A number of active areas of research deal with text in databases and we use the following definitions to establish how our approach relates to these.
Data Integration Providing a single schema over a collection of data sources that facilitates queries across the sources [5]
Information Extraction Finding pre-defined entities from text and using the extracted
data to fill slots in a template using shallow NLP techniques [6].
Data Mining / Knowledge Discovery in Databases Finding patterns in structured data,
discovering new deep knowledge embedded in data.
Text Mining Application of data mining to text (often some NLP process creates a
structured dataset from the text and then this is used for data mining [7]).
Graph Based Data Models Current industry standard databases are essentially record
based (e.g. the relational model or some form of object data model) where the
schema must be determined in advance of populating the database. Graph-based
data models offer finer semantic granularity and greater flexibility [8].
We are not proposing use of a text mining technique which finds patterns in very
large collections of text e.g. Nahm and Mooney [9] who combine IE with Text Mining.
For many of the PSD applications we have described this is unlikely to be effective as
there are not very large static datasets to be mined (although there are some exceptions
e.g. SWISS-PROT), rather over time new query requirements arise and extensions to
the schema are required.
We propose an evolutionary system where the user iterates through the steps as
new information sources and new query requirements arise. Firstly an initial integrated
2
schema is built from a variety of sources including structured data schema, domain
ontologies and natural language ontologies. Then information extraction rules are semiautomatically generated from this schema to be used as input for the IE processor. The
data extracted from the text is added to the integrated schema and is available to answer
queries. The schema may then be extended by new data-sources being added or new
schema elements identified and the process repeats. Figure 1 shows how the user will
use the ESTEST system in this evolutionary manner.
Because of the evolutionary approach we suggest a graphical workbench will be
required for end user use of ESTEST and we intend to consider the requirements of
such a workbench.
Create Data to
assist the IE
process
Integrate
Datasources
IE
Direction
Data
Enhance Schema
Information
Information
Extraction (IE)
Extracted
Data
Global
Schema
Integrate Results of
IE
Query Global
Schema
Control Flow
Data Flow
Fig. 1. Evolutionary Use of the ESTEST System
4. Combining Data Integration and Information Extraction
We belive that the data collected in the form of free-text is important to PSD applications
and is not stored as text due to its secondary value, and that there are two main reasons
for storing data as text in PSD applications:
– It is not possible in advance to know all of the queries that will be required in the
future. The text captured represents an intuitive attempt by the user to provide all
information that could possibly be relevant. The Road Traffic Accident reports are
3
a good example of this. The schema of the structured part of the data covers all
currently known requirements in a format known as STATS20 [10] and the text
part is used when new reporting requirements arise.
– Data is captured as text due to the limitation of dynamically building a schema
in conventional DBMS where simply adding a column to an existing table can
be a major task in production systems. For example in systems storing witness
statements in crime reports as entities and relationships are mentioned for the first
time it is not possible to dynamically expand the underlying data schema and so the
new information is only stored in its text form.
Furthermore, the real world entities and relationships described in the text are related to the entities in the structured part of the data. An application combining IE and
Data Integration will provide advantages in these applications for a number of reasons.
Information Extraction is based on the idea of filling pre-defined templates and Data Integration can provide a global schema to be used as a template. Combining the schema
of the structured data together with ontologies and other metadata sources can create the
global schema / template. Metadata from the data sources can be used to assist the IE
process by semi-automatically creating the required input to the IE modules. Data Integration systems which use a low level graph based common data model (e.g. AutoMed
[11]) are able to extend schema as new entities become known without the overhead associated with conventional DBMS as they are not based on record based structures such
as tables in relational databases. The templates filled by the IE process will provide a
new data source to be added to the global schema supporting new queries which could
not previously be answered.
5. The ESTEST System
Our ESTEST system makes use of the AutoMed heterogeneous data integration system
being developed at Birkbeck and Imperial Colleges [12]. In data integration systems,
several data sources, each with an associated local schema, are integrated to form a
single virtual database with an associated global schema. If the data sources conform to
different data models, then these need to be transformed into a common data model as
part of the integration process. The AutoMed system uses a low-level graph-based data
model, the HDM, as its common data model - this is suitable for incremental increases
in a global schema as new requirements arise. We have developed an AutoMed HDM
data store [13] to store instance data and intermediate results for ESTEST. AutoMed
implements bi-directional schema transformation pathways to transform and integrate
heterogeneous schemas [11] which is a flexible approach amenable to including new
domain knowledge dynamically.
In summary the ESTEST system works as follows. The data sources are first identified and integrated into a single global schema. In AutoMed each data model which
can be integrated is defined in terms of the HDM. Each construct in the external data
model has an associated set of HDM nodes and edges. In the ESTEST system some features of data models are required to be preserved across all the integrated data sources.
These features include an IS-A concept hierarchy; allowing for attributes; identifying
text data to be mined and the ability to attach word forms to concepts. To facilitate the
4
automatic creation of the global schema all the data sources used by ESTEST will be
transformed to an ESTEST data model. Each construct in the external model also has a
set of transformations to map onto the ESTEST data model. Once all the data sources
have been transformed to this standard representation and mappings between schema
elements obtained - it will be possible to integrate the schemas.
ESTEST then takes the metadata in the global schema and uses this to suggest input
into the IE process. The user confirms, corrects and appends to this configuration data
and the IE process is run. We make use of the GATE [14] IE architecture to build
the ESTEST IE processor. As well as reusing standard IE components such as Named
Entity gazetteers, sentence splitters, pattern matching grammars (with configured inputs
semi-automatically created by ESTEST), a number of new IE components are being
developed:
TemplateFromSchema Takes an ESTEST global schema and creates templates to be
filled by the IE engine and creates input to the standard IE components.
NE-DB Named Entity recognition in IE is typically driven by flat file lists, the NE-DB
component will associate a query on the global schema with a annotation type. A
list of word forms will be materialised in the HDM store for use when the IE process is running (GATE NE gazetteers generate Finite State Machines for possible
transitions of tokens).
WordForm For a given concept will get relevant word forms from the WordNet natural language ontology. It will be possible to generate more words by increasing
the number of traversals allowed through the WordNet hierarchy ordered by an
approximation of semantic distance.
The templates filled by the IE process will then be used to add to the extent of
the concept in the global schema. Extracted annotations which match objects in the
global schema will be extracted and put in the HDM store.The global query facilities of
AutoMed are now available to the user who can query the global schema using the IQL
query language [15, 16].
For more detailed information on the design of the ESTEST system we refer the
reader to [17] and for an example of its operation in the Road Traffic Accident domain
to [18].
Recent work within the Tristarp group [19], has resulted in advanced visualisation
tools for graph-based databases becoming available [20] that may be of assistance in
the proposed user workbench. This research interest is also reflected in recent products
developed in industry, the Sentences [21] DBMS from Lazysoft is based on a quadruple
store and sets out to challenge the dominance of the relational model.
6. Conclusions and Future Work
We have discussed how a class of applications based on partially structured data are not
adequately supported by current database and data mining techniques. We have stated
why we belive combining Information Extraction and Data Integration techniques is a
promising direction for research.
We are currently completing an initial implementation of the ESTEST system which
we will test in the Road Traffic Accident reporting and Crime Investigation domains.
5
ESTEST extends the facilities offered by data integration systems by moving towards
handling text and extends IE systems by attempting to use schema information to semiautomatically configure the IE process.
References
1. P.J.H.King and A Poulovassilis. Enhancing database technology to better manage and exploit
partially structured data. Technical report, Birkbeck College, University of London, 2000.
2. Bairoch A., Boeckmann B., Ferro S., and Gasteiger E. Swiss-Prot: Juggling between evolution and stability. Brief. Bioinform., 5:39–55, 2000.
3. SOCIS Scene of Crime Information System. http://www.computing.surrey.ac.uk/ai/socis/.
4. QUENZA. http://www.xanalys.com/quenza.html.
5. Alon Y. Halevy. Data integration: A status report. In Gerhard Weikum, Harald Schöning,
and Erhard Rahm, editors, BTW, volume 26 of LNI, pages 24–29. GI, 2003.
6. D. Appelt. An introduction to information extraction. Artificial Intelligence Communications, 1999.
7. A.H.Tan. Text mining: The state of the art and the challanges. Proceedings of the PAKDD
1999 Workshop on Knowledge Discovery from Advanced Databases, pages 65–70, 1999.
8. William Kent. Limitations of record-based information models. ACM Transactions on
Database Systems, 4(1):107–131, March 1979.
9. R. Mooney U.Y. Nahm. Using information extraction to aid the discovery of prediction rules
from text. Proceedings of the KDD-2000 Workshop on text Mining, pages 51–58, 2000.
10. UK
Government
Department
for
Transport.
Instructions
for
the
completion
of
road
accident
report
form
stats19.
http://www.dft.gov.uk/stellent/groups/dft transstats/documents/page/dft transstats 505596.pdf.
11. P.J. McBrien and A. Poulovassilis. Data integration by bi-directional schema transformation
rules. In Proc. ICDE’03, 2003.
12. AutoMed Project. http://www.doc.ic.ac.uk/automed/.
13. D. Williams. The AutoMed HDM data store. Technical report, Automed Project, 2003.
14. H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. Experience with a language engineering architecture: Three years of GATE. Proceedings of the 40th Anniversary Meeting
of the Association for Computational Linguistics (ACL’02), 2002.
15. A. Poulovassilis. The AutoMed Intermediate Query Language. Technical report, AutoMed
Project, 2001.
16. E. Jasper. Global query processing in the AutoMed heterogeneous database environment. In
Proc. BNCOD02, LNCS 2405, pages 46–49, 2002.
17. D. Williams and A.Poulovassilis. Combining data integration with natural language technology for the semantic web. In Proc. Workshop on Human Language Technology for the
Semantic Web and Web Services, at ISWC’03, page TBC, 2003.
18. Dean Williams and Alexandra Poulovassilis. An example of the estest approach to combining
unstructured text and structured data. In DEXA Workshops, pages 191–195. IEEE Computer
Society, 2004.
19. Tristarp Project. http://www.dcs.bbk.ac.uk/TriStarp.
20. M.N. Smith and P.J.H. King. Database support for exploring criminal networks. Intelligence
and Security Informatics: First NSF/NIJ Symposium, 2003.
21. Lazysoft (maker of Sentences). http://www.lazysoft.com/index.html.
6