Download Linked - PlanetData

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Knowledge Extraction and Integration using Automatic
and Visual Methods
Vedran Sabol, Roman Kern, Barbara Kump, Viktoria Pammer, Michael Granitzer
vsabol|rkern|bkump|vpammer|[email protected]
Know-Center, Inffeldgasse 21a, 8010 Graz, Austria
Introduction
Techniques for efficiently accessing, analyzing and presenting large amounts of
dynamic, heterogeneous data, with the goal of acquiring new knowledge and facts, play
an increasingly important role in various application domains, such as media, patent
databases, scientific publication repositories, medical databases etc. We identify two
areas of research which, we believe, can provide significant improvements and benefits
for fact discovery and acquisition of knowledge from large amounts of data, particularly
when combined and applied together:
• There is a large potential in combining the knowledge present in structured,
semantic data, such as open linked data (DBPedia, GeoData, Friend of a Friend,
etc.), with unstructured information, such as human readable content. Integration
of structured and unstructured data will facilitate the development of advanced
data mining and knowledge extraction techniques.
• Visual interfaces are a powerful means for exploring and analyzing large amounts
of data and providing access to knowledge and facts. Additionally, visual
interfaces can be used as a common discourse platform supporting collaboration
and networking, and empowering users to share their insights with co-workers
and contribute knowledge into the Linked Data Cloud.
Vision
We envision an approach combining automatic methods and visual techniques for
discovery of facts and deriving new knowledge from massive data. Besides data and
content, which contain implicit knowledge, two explicit knowledge sources shall be
utilized in the process: semantic databases and human expertise. Patterns and facts,
which are either discovered by algorithms or unveiled through visual analysis, shall be
validated by humans and integrated into Linked Data repositories through dedicated
visual interfaces (Figure 1). The advantage of integrating the newly acquired facts and
knowledge is twofold:
• it can be used by extraction and mining algorithms to improve their performance.
• it is made available to other people - either direct collaborators or people
independently working on related topics.
It should be noted that the proposed scenario borrows from the ideas of the social
Web, whereby instead of publishing new user-generated content on various Web
platforms, the focus here is on deriving and sharing of new facts and knowledge through
integration with existing knowledge bases (i.e. Linked Data).
LOD Cloud
Data/Content
Repositories
Data
Content
Validated
Facts and
Knowledge
Knowledge
Extraction of Sematics
Data Mining
Visual
Interfaces
Knowledge,
Extracted Facts,
Patterns
Pattern recognition
Hypothesis generation & validation
Deriving new knowledge
Figure 1: Extraction of new knowledge and facts and their integration into the Linked
Data Cloud using a combination of automatic and visual methods.
Semantic Enrichment and Data Mining
Extraction of semantics from unstructured data such as text is a well established
area of research, however extraction and disambiguation of facts still poses significant
challenges. Current information extraction systems [Etzioni et. al. 2010] are capable of
extracting simple, frequent factual relationships from open data sets like the Web, but
extraction and disambiguation of non-frequent entities for specific domains, and
integration of the extracted semantics into Linked Data repositories remains a major
challenge. Integration of different, ontologically encoded knowledge can be achieved
through ontology alignment or ontology merging algorithms [Euzenat & Shvaiko 2007],
however approaches involving human intervention through the use of visual interfaces
may help alleviate algorithm limitations [Granitzer et al. 2010]. Use of semantic
information for advancing data mining methods appears as a promising research topics:
it has been shown by several authors that semantic information can be successfully used
in mining tasks, such as for example in text clustering [Hotho et al. 2002], [Szczuka et al.
2011].
Our goal is to develop methods for extraction of semantic information and facts, and
enrichment of human readable content, such as governmental information, scientific
publications, patent databases, media articles, user generated content etc. Newly
extracted knowledge and facts shall be integrated with the Linked Open Data Cloud. The
resulting methods will be based on natural language processing (NLP) and machine
learning techniques, such as scalable text clustering [Muhr et al. 2010a] and cluster
labelling [Muhr et al. 2010b]. It should be noted that in the above described setup,
several challenges deserve particular attention: entity disambiguation techniques [Kern
et al. 2010], [Kern et al. 2011b], information diffusion and reuse [Kern et al. 2011a], and
information quality and trust [Lex et al. 2010].
Visual Access and Analysis
Visual Analytics [Thomas & Cook 2011, Keim et al. 2010] is a research field focusing
at supporting humans in analytical reasoning over massive data sets using visual
interfaces. It strives to effectively integrate human knowledge and experience into
complex analytical processes by suitably combining machine processing with visual
analysis methods.
Our goal is to develop visual analysis techniques for large unstructured repositories
and for Linked Data repositories. Visual interfaces built atop automatic techniques, shall
provide an intuitive access to data and knowledge, pattern recognition possibilities and
integration of the discovered, validated facts back into the Linked Data Cloud.
Ontological information can be used to improve the presentation of visual interfaces in a
variety of ways [Paulheim & Probst 2010]. Ontologically described user interface
components will be easily adaptable to various data types and sources, allowing users to
explore, analyse and manipulate information delivered by semantic enrichment, mining
and integration methods. Visual components for assisting and simplifying data binding
from Linked Open Data repositories to user interface elements shall also be considered.
A visualization system providing methods for exploring and analysing massive
repositories, and for integrating the discovered knowledge into semantic knowledge
repositories, will be composed of a variety of visualization components which can be
grouped into two categories [Granitzer et al. 2011]: discovery components, where the
information flow is from the repository to the users, and description components
suitable for expressing knowledge and integrating it with the Linked Data.
Discovery Components serve analytical purposes such as explorative navigation
and discovery of new insights. Examples are charting components (for example bar, pie,
line and spider charts) and advanced analytical components targeting abstract
information such as graphs and networks, topical relatedness, change and temporal
information etc. [Sabol et al. 2009]. Selection of suitable visual metaphors for a
particular task and data shall be performed (semi-)automatically based on available best
practices [Lengler & Eppler 2007].
Description components are customizable knowledge visualization [Eppler &
Burkhard 2004], [Bertschi et al. 2011] metaphors empowering users to intuitively
express and communicate facts and knowledge. A new visualization shall initially be
created as an empty "skeleton" of a chosen metaphor, where the user applies a "Builder
Tool" to construct a visual representation expressing (newly acquired) knowledge. In
doing so the expressed facts shall be integrated into the Linked Data Cloud. Also, if made
publicly available, such a visual representation would provide a platform for
transferring and communicating knowledge to a broader audience which can extend it
by integrating additional knowledge.
Challenges
We compile an (incomplete) list of challenges, grouped into five categories, which
need to be addressed for realization of the above described vision:
1. Data and information:
a) Information quality: Through the advent of the Social Web and user
generated content information quality and trustworthiness gain a prominent
role. The same is true for user generated knowledge.
b) Information reuse and diffusion: In large scale collaboration scenarios
tracking diffusion of information and knowledge, and identification of their
reuse becomes important.
c) Security and ownership: Access control and security mechanisms are hard
to address in large, distributed data and knowledge bases .
d) Change and evolution: Identification of trends and temporal patterns, and
handling of high rates of change in data, content and knowledge are
necessary when dealing with dynamically changing repositories.
e) Data integration: Binding to different data and knowledge repositories, and
transformation of data into the required form becomes indispensable when
dealing with heterogeneous infromation.
2. Storage infrastructures: efficiently applying distributed Big Data Storage
infrastructures, such as Hadoop/HDFS, and integrating them with traditional
infrastructure (relational databases) and with arising models, such as cloud
computing.
3. Algorithms: To cope with huge data amounts scalable algorithms need to be
developed. Two approaches appear promising in this context:
a) Distributed Algorithms. for example based on Hadoop (MapReduce + Google
File System), process data on a large number of computing nodes which may
be placed in geographically separate locations.
b) Streaming Algorithms process a huge stream of information in a few (one)
passes and with limited resources (memory), by building an approximate
summary/aggregation of the data.
4. User interfaces and visualization:
a) Visual analytics: Advanced, scalable visual interfaces are necessary for
analysis and exploration of large data and knowledge databases. Usage of
semantic descriptors holds the promise for automatically binding to
semantically described data.
b) Mobility: The mobile boom redefines the requirements on GUI design and
interactivity.
c) Context: Besides considering the classical user context (e.g. task, preferences),
user context can be extended with sensory data, such as those provided by
mobile devices.
d) Collaboration: New collaboration possibilities arise through increased
mobility and permanent broadband network access.
5. Use und Commercialization: For various types of content (such as videos,
music, news etc.) established commercial and non-commercial utilization models
exist. For linked data this is not the case (yet). Sustainable eco-systems for LOD
utilization need to be built and established (possibly learning from the lessons
delivered by the social Web and user generated content).
Each of the challenges presents a research topic on its own. Our interests have been
focusing on topics such as information reuse, quality and evolution, consideration of
user context, and visual analytical interfaces (including mobile applications). To bring
forward the presented vision and address the challenges in a satisfactory manner,
bundling of resources and competencies appears as a natural way to go.
References
[Bertschi et al. 2011] Bertschi, S., Bresciani, S, Crawford, T., Goebel, R., Kienreich, W.,
Lindner, M., Sabol, V., Vande Moere, A., (2011): What is Knowledge Visualization?
Opinions on Current and Future State, in Proceedings of the 15th International
Conference Information Visualisation (IV'11)
[Eppler & Burkhard 2004] Eppler, M.J., Burkhard, R.A., Knowledge Visualization –
Towards a New Discipline and its Fields of Application, ICA Working Paper
#2/2004, University of Lugano, Switzerland, 2004.
[Etzioni et al. 2010] Etzioni, O., Banko, M., Soderland, S., & Weld, D. S. (2008). Open
information extraction from the web. Communications of the ACM, 51(12), 68.
doi:10.1145/1409360.1409378
[Euzenat & Shvaiko 2007] Euzenat, J., Shvaiko, P., (2007): Ontology matching, SpringerVerlag.
[Granitzer et al. 2010] Granitzer, M., Sabol, V., Onn, K.W., Lukose, Dickson., Tochtermann,
K. (2010): Ontology Alignment – A Survey with Focus on Visually Supported SemiAutomatic Techniques, Future Internet, Volume 2, Issue 3, 238-258, MDPI AG
[Granitzer et al. 2011] Granitzer, M., Sabol, V., Kienreich, W., Lukose, D., Onn, K.W.
(2011): Visual Analyses on Linked Data – An Opportunity for both Fields The 2011
STI Semantic Summit, Riga, Latvia
[Hotho et al. 2002] Hotho, A., Maedche, A. and Staab, S., (2002): “Ontology-based text
document clustering”, Kunstliche Intelligenz, 16 (4), pp 48-54.
[Keim et al. 2010] Keim, D., Mansmann, F., & Thomas, J. (2010). Visual analytics: how
much visualization and how much analytics? ACM SIGKDD Explorations Newsletter,
11(2), 5–8. ACM.
[Kern et al. 2010] Kern, R., Muhr, M., Granitzer, M. (2010): KCDC: Word Sense Induction
by Using Grammatical Dependencies and Sentence Phrase Structure, in Proceedings
of SemEval-2, pages 351-354.
[Kern et al. 2011a] Kern, R., Seifert, C., Zechner, m., Granitzer, M., (2011): Vote/Veto
Meta-Classifier for Authorship Identification, 3rd International Competition on
Plagiarism Detection
[Kern et al. 2011b] Kern, R., Zechner, M., Granitzer, M., (2011): Model Selection
Strategies for Author Disambiguation, IEEE Computer Society: 8th International
Workshop on Text-based Information Retrieval in Procceedings of 22th
International Conference on Database and Expert Systems Applications (DEXA 11),
pages 155-160.
[Lengler & Eppler 2007] Lengler R., Eppler M. (2007): Towards A Periodic Table of
Visualization Methods for Management. IASTED Proceedings of the Conference on
Graphics and Visualization in Engineering (GVE 2007), Clearwater, Florida, USA.
[Lex et al. 2010] Lex, E., Khan, I., Bischof, H., Granitzer, M., (2010): Assessing the Quality
of Web Content, Proceedings of the ECML/PKDD Discovery Challenge.
[Muhr et al. 2010a] Muhr, M., Sabol, V., Granitzer, M. (2010): Scalable Recursive TopDown Hierarchical Clustering Approach with implicit Model Selection for Textual
Data Sets, 7th International Workshop on Text-based Information Retrieval, in
Proceedings of 21th International Conference on Database and Expert Systems
Applications (DEXA 10), IEEE.
[Muhr et al. 2010b] Muhr, M., Roman Kern R., Granitzer, M., (2010): Analysis of
Structural Relationships for Hierarchical Cluster Labeling, in Proceeding of the 33rd
international ACM SIGIR Conference on Research and Development in information
Retrieval, pages 175-185, ACM
[Paulheim & Probst 2010] Paulheim, H., Probst, F., (2010): Ontology-Enhanced User
Interfaces: A Survey, International Journal on Semantic Web and Information
Systems (IJSWIS), 6(2).
[Sabol et al. 2009] Sabol, V., Kienreich, W., Muhr, M, Klieber, W., Granitzer, M., (2009):
Visual Knowledge Discovery in Dynamic Enterprise Text Repositories, Proceedings
of the 13th International Conference on Information Visualisation (IV09), IEEE.
[Szczuka et al. 2011] Szczuka, M., Janusz, A., Herba, K., (2011): Clustering of rough set
related documents with use of knowledge from DBpedia, Proceedings of the 6th
international conference on Rough sets and knowledge technology RSKT'11, pages
394-403.
[Thomas & Cook 2005] Thomas, J. J., Cook, K. A. (2005). Illuminating the Path: The
Research and Development Agenda for Visual Analytics (p. 186). IEEE Computer
Society. Retrieved from http://nvac.pnl.gov/agenda.stm.