Download Big Data Analysis and Its Applications for Knowledge

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Knowledge representation and reasoning wikipedia , lookup

Personal knowledge base wikipedia , lookup

Pattern recognition wikipedia , lookup

The Measure of a Man (Star Trek: The Next Generation) wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Time series wikipedia , lookup

Transcript
www.iaard.net
eISSN: 2455-3204
International Association of Advances in Research and Development
International Journal of Computer
Science and Information Technology
International Journal of Computer Science and Information Technology, 2015, 1(1),1-5
Research Article
Big Data Analysis and Its Applications for KnowledgeManagement
L.Velmurugan*1, P.Sasikumar2, Alema Gebre3, Tilahun.A4
1
Professor, Computer Science, Institute of Technology, AMBO University, Ethiopia
2
Assistant Professor, Information Technology, Institute of Technology, AMBO University, Ethiopia
3
Head, Computer Science, Institute of Technology, AMBO University, Ethiopia
4
Head, Information Technology, Institute of Technology, AMBO University, Ethiopia
……………………………………………………………………………………………….
Abstract: Big Data analysis is one of the most important steps now days for the knowledge discovery in
databases process and is considered as significant subfield in knowledge management. Research in big
data analysis continues growing in business and in learning organization over coming decades. This
review paper gives the applications of big data techniques which have been developed to support
knowledge management process. The journal articles indexed in Science Direct Database from 2007 to
2012 are analyzed and classified. The discussion on the findings is divided into 4 topics: (i) knowledge
resource; (ii) knowledge types and/or knowledge datasets and big data analysis techniques and
applications used in knowledge management. The article momentarily designates the definition of big
data and its functionality. Then the knowledge management rationale and major knowledge management
tool s integrated in knowledge management cycle are described. Finally, the applications of big data
techniques in the process of knowledge management are summarized and discussed.
………………………………………………………………………………………………………………
among data miners and (ii) to use data mining as a
1. Introduction
Modern data-mining applications, often tool to extend human knowledge.
called “big-data” analysis, require us to manage 2. Challenges to the analysis of massive data
A number of challenges exist in both data
immense amounts of data quickly. In many of
these applications, the data is extremely regular, management and data analysis that require new
and there is ample opportunity to exploit approaches to support the “big data” era. These
parallelism. Experiments, observations, and challenges span generation of the data, preparation
numerical simulations in many areas of science for analysis, and policy-related challenges in its
nowadays generate terabytes of data and, in some sharing and use. Initiatives in research and
cases, are on the verge of generating many development that are leading to improved
petabytes. This rapid growth heralds an era of capabilities include the following:
“data-centric science,” which requires new • Dealing with highly distributed data sources,
paradigms addressing how data are captured, • Tracking data provenance, from data generation
processed, discovered, exchanged, distributed, and through data preparation
analyzed. Many business organizations have • Validating data
gathered and stored immense amount of data. • Coping with sampling biases and heterogeneity
However, they are unable to discover valuable • Working with different data formats and
information hidden in the data by transforming structures
these data into valuable and useful knowledge [1]. • Developing algorithms that exploit parallel and
Managing knowledge resources can be a challenge. distributed architectures
Many organizations are employing information • Ensuring data integrity
technology in knowledge management to aid • Ensuring data security
creation, sharing,integration, and distribution of • Enabling data discovery and integration,
• Enabling data sharing
knowledge.
Knowledge management is a process of • Developing methods for visualizing massive data,
data usage [2]. The basis of data mining is a process and
ofusing tools to extract useful knowledge from • Developing scalable and incremental algorithms.
large datasets; data mining is an essential part As data volumes increase, the ability to perform
ofknowledge management [2].Wang & Wang analysis on the data is constrained by the
(2008) point that data mining can be useful for KM increasingly distributed nature of modern data sets.
intwo main manners: (i) to share common Highly distributed data sources present challenges
knowledge of business intelligence (BI) context due to diverse natures of the technical
L.Velmurugan et al.,
Page No.1
International Journal of Computer Science and Information Technology, 2015, 1(1),1-5
infrastructures, creating challenges in data access,
integration, and sharing. The distributed nature
also creates additional challenges due to the
limitations in moving massive data through
channels with limited bandwidth. In addition, data
produced by different sources are often defined
using different representation methods and
structural specifications. Bringing such data
together becomes a challenge because the data are
not properly prepared for data integration and
fusion, and the technical infrastructures lack the
appropriate information infrastructure services to
support analysis of the data if it remains
distributed. Statistical inference procedures often
require some form of aggregation that can be
expensive in distributed architectures, and a major
challenge involves finding cheaper approximations
for such procedures. Finally, security and policy
issues also limit the ability to share data. Yet, the
ever-increasing generation of data from medicine,
physical science, defense, and other industries
require that analysis be performed on data that are
captured and managed across distributed databases.
3. Knowledge management
There are various concepts of knowledge
management. In this paper we use the definition of
knowledge
management
by
McInerney
(2002):“Knowledge management (KM) is an effort
to increase useful knowledge within the
organization. Ways to do this include encouraging
communication, offering opportunities to learn,
and promoting the sharing of appropriate
knowledge artifacts”. This definition emphasizes
the interaction aspect of knowledge management
and organization all earning. Knowledge
management process focuses on knowledge flows
and the process of creation, sharing, and
distributing knowledge (Figure 1) [3]. Each of
knowledge units of capture and creation, sharing
and dissemination, and acquisition and application
can be facilitated by information technology.
Figure 1 KM Technologies Integrated KM Cycle
(Source from Dalkir, K.,2005).
Liao (2003) classifies KM technologies using
seven categories:
1. KM Framework
2. Knowledge-Based Systems (KBS)
3. Data Mining
4. Information and Communication Technology
5. Artificial Intelligence (AI)/Expert Systems (ES)
6. Database Technology (DT)
7. Modeling
As technologies play an important role in KM,
technologies stand to be a necessary tool for
KMusage [4]. Thus, KM requires technologies to
facilitate communication, collaboration, and
content for better knowledge capture, sharing,
dissemination, and application [5].
3.1 Knowledge Management: Capture and
Creation Tools
This section provides an overview of a
classification of KM technologies as tools and
focuses ontools for capture and creation
knowledge.Liao (2003) classifies KM technologies
using seven categories:
1. KM Framework
2. Knowledge-Based Systems (KBS)
3. Data Mining
4. Information and Communication Technology
5. Artificial Intelligence (AI)/Expert Systems (ES)
6. Database Technology (DT)
7. Modeling
Ruggles et.al. (1997) classify KM technologies as
tools that generate knowledge (e.g. datamining),
code knowledge, and transfer knowledge. Dalkir
(2005) classifies KM tools according tothe phase
of the KM cycle.
3.2 Knowledge Types
This section described knowledge types in
8 organization domains for data mining
collaboration process in the knowledge creation.
• Health-care System domain, the dataset
composed of three databases: the health-care
providers’ database; the out-patient health-care
statistics database; and the medical status database
[5]
.Another data source was from hospital inpatient
medical records [6].
• Construction Industry domain, a sample data set
was in the form of Post Project
Reviews (PPRs) as defining good or bad
information. Multiple Key Term Phrasal
Knowledge sequences (MKTPKS) formation was
generated through applications of text mining and
was used an essential part of the text analysis in
the text documents Classification.
• Retailing domain: customer data and the products
purchased have been collected and stored in
databases to mine whether the customers’ purchase
L.Velmurugan et al.,
Page No.2
International Journal of Computer Science and Information Technology, 2015, 1(1),1-5
habits and behaviour affect the product line and
brand extensions or not [7].
Financial domain: There were two datasets posed
in financial domain: (i) to identify
bond ratings, knowledge sets contained strings of
data, models, parameters and reports for each
analytical study; and (ii) to predict rating changes
of bonds, cluster data of bond features as well as
the model parameters were stored, classified, and
applied to rating predictions [8].
• Small and Middle Businesses (SMBs) domain:
Knowledge types in small and middle businesses
in case of Food Company were related to the
corporate conditions or goals of the problem
among all departments to develop a decision
system platform and then formed the knowledge
tree to find relations by human-computer
interaction method and optimize the process of
decision making [9].To solve food supply chain
networks problems, Li et al. (2010) developed
EW&PC prototype which composed of major
components of: (i) knowledge base, (ii) task
classifier and template approaches, (iii)
DMmethods library with expert system for method
selection, (iv) explorer and predictor, and(v) user
interface [10]. This system built decision support
models and helped manager stoac complish
decision-making.
• Research Assets domain: In Cantu &Cellbos
(2010) focused on managing knowledge eassets by
applied aknowledge and information network
(KIN) approach. This platform contained three
components types of research products, human
resources or intellectual capital, and research
programs. The various types of research assets
were handled ondomain ontologies and databases
[11]
.
• Business domain: there were two types of
knowledge attributes conducted: condition
attributes and decision attribute. Condition
attributes included four independent attributes of
the KM purpose, the explicit-oriented degree, the
tacit-oriented degree, andthe success factor.
Decision attribute included one dependent attribute
of the KM performance.
• Collaboration and Teamwork domain: a dataset
used from a research laboratory in a research
institute. It contained 14 knowledge workers, 424
research documents, and aworkers’ log as that
recorded the time of document accessed and the
documents of workers’ needed [12]. For the
workers’ log, it was generated to 2 levels of
codified-level knowledge flow and topic-level
knowledge flow [12]. The two types of knowledge
flow were determined to describe a worker’s
needs. To collect the knowledge flow, documents
in the dataset were categorized into eight clusters
by data mining clustering approach [12].
4. Big data techniques / applications used in
knowledge management
Experiments, observations, and numerical
simulations in many areas of science and business
are currently generating terabytes of data, and in
some cases are on the verge of generating
petabytes and beyond. Analyses of the information
contained in these data sets have already led to
major breakthroughs in fields ranging from
genomics to astronomy and high-energy physics
and to the development of new information-based
industries. Traditional methods of analysis have
been based largely on the assumption that analysts
can work with data within the confines of their
own computing environment, but the growth of
“big data” is changing that paradigm, especially in
cases in which massive amounts of data are
distributed across locations.
While the scientific community and the
defence enterprise have long been leaders in
generating and using large data sets, the emergence
of e-commerce and massive search engines has led
other sectors to confront the challenges of massive
data. For example, Google, Yahoo!, Microsoft, and
other Internet-based companies have data that is
measured in exabytes (1018 bytes). Social media
(e.g., Facebook, YouTube, Twitter) have exploded
beyond anyone’s wildest imagination, and today
some of these companies have hundreds of
millions of users. Data mining of these massive
data sets is transforming the way we think about
crisis
response,
marketing,
entertainment,
cybersecurity and national intelligence. It is also
transforming how we think about information
storage and retrieval. Collections of documents,
images, videos, and networks are being thought of
not merely as bit strings to be stored, indexed, and
retrieved, but as potential sources of discovery and
knowledge, requiring sophisticated analysis
techniques that go far beyond classical indexing
and keyword counting, aiming to find relational
and semantic interpretations of the phenomena
underlying the data.
A health-care system in which increasingly
detailed data are maintained for each individual—
including genomic, cellular, and environmental
data—and in which such data can be combined
with data from other individuals and with results
from fundamental biological and medical research
so that optimized treatments can be designed for
each individual. One can also envision numerous
business opportunities that combine knowledge of
L.Velmurugan et al.,
Page No.3
International Journal of Computer Science and Information Technology, 2015, 1(1),1-5
preferences and needs at the level of single
individuals with fine-grained descriptions of
goods, skills, and services to create new markets.
It is natural to be optimistic about the
prospects. Several decades of research and
development in databases and search engines has
yielded a wealth of relevant experience in the
design of scalable data-centric technology. In
particular, these fields have fuelled the advent of
cloud computing and other parallel and distributed
platforms that seem well suited to massive data
analysis. Moreover, innovations in the fields of
machine learning, data mining, statistics, and the
theory of algorithms have yielded data-analysis
methods that can be applied to ever-larger data
sets. However, such optimism must be tempered
by an understanding of the major difficulties that
arise in attempting to achieve the envisioned goals.
In part, these difficulties are those familiar from
implementations of large-scale databases—finding
and mitigating bottlenecks, achieving simplicity
and generality of the programming interface,
propagating metadata, designing a system that is
robust to hardware failure, and exploiting parallel
and distributed hardware—all at an unprecedented
scale. But the challenges for massive data go
beyond the storage, indexing, and querying that
have been the province of classical database
systems (and classical search engines) and, instead,
hinge on the ambitious goal of inference. Inference
is the problem of turning data into knowledge,
where knowledge often is expressed in terms of
entities that are not present in the data per se but
are present in models that one uses to interpret the
data. Statistical rigor is necessary to justify the
inferential leap from data to knowledge, and many
difficulties arise in attempting to bring statistical
principles to bear on massive data. Overlooking
this foundation may yield results that are, at best,
not useful, or harmful at worst. In any discussion
of massive data and inference, it is essential to be
aware that it is quite possible to turn data into
something resembling knowledge when actually it
is not. Moreover, it can be quite difficult to know
that this has happened.
Indeed, many issues impinge on the quality
of inference. A major one is that of “sampling
bias.” Data may have been collected according to a
certain criterion (for example, in a way that favors
“larger” items over “smaller” items), but the
inferences and decisions made may refer to a
different sampling criterion. This issue seems
likely to be particularly severe in many massive
data sets, which often consist of many sub
collections of data, each collected according to a
particular choice of sampling criterion and with
little control over the overall composition. Another
major issue is “provenance.” Many systems
involve layers of inference, where “data” are not
the original observations but are the products of an
inferential procedure of some kind. This often
occurs, for example, when there are missing entries
in the original data. In a large system involving
interconnected inferences, it can be difficult to
avoid circularity, which can introduce additional
biases and can amplify noise. Finally, there is the
major issue of controlling error rates when many
hypotheses are being considered. Indeed, massive
data sets generally involve growth not merely in
the number of individuals represented (the “rows”
of the database) but also in the number of
descriptors of those individuals (the “columns” of
the database). Moreover, we are often interested in
the predictive ability associated with combinations
of the descriptors; this can lead to exponential
growth in the number of hypotheses considered,
with severe consequences for error rates. That is, a
naive appeal to a “law of large numbers” for
massive data is unlikely to be justified; if anything,
the perils associated with statistical fluctuations
may actually increase as data sets grow in size.
While the field of statistics has developed tools
that can address such issues in principle, in the
context of massive data care must be taken with all
such tools for two main reasons: (1) all statistical
tools are based on assumptions about
characteristics of the data set and the way it was
sampled, and those assumptions may be violated in
the process of assembling massive data sets; and
(2) tools for assessing errors of procedures, and for
diagnostics, are themselves computational
procedures that may be computationally infeasible
as data sets move into the massive scale.
In spite of the cautions raised above, the
Committee on the Analysis of Massive Data
believes that many of the challenges involved in
performing inference on massive data can be
confronted usefully.
While large businesses in the past have
used relational databases, these do not scale well to
such extreme sizes. Industries dealing with big
data are reacting to data that is more distributed,
heterogeneous, and generated from a variety of
sources. This is leading to new approaches for data
analysis and the demand for new computing
approaches. Various innovative data management
solutions have emerged, many of which are
discussed in Chapter 3. These models work well in
the commercial setting, where enormous resources
are spent on harvesting and collecting the data
L.Velmurugan et al.,
Page No.4
International Journal of Computer Science and Information Technology, 2015, 1(1),1-5
through actions such as Internet crawling, aerial
photos for geospatial information systems, or
collecting user data in search engines. Some of the
technical trends that have been occurring to
address the data challenges include the following:
• Distributed systems (access, federation, linking,
etc.),
• Technologies (MapReduce algorithms, cloud
computing, Workflow, etc.),
• Scalable infrastructures for data- and computeintensive applications,
• Service-oriented architectures,
•
Ontologies,
models
for
information
representation,
• Scalable database systems with different
underlying models (relation to triple stores),
• Federated data security mechanisms, and
• Technologies for moving large data sets.
Many of these technologies are being used to drive
toward more systematic approaches. Rather than
constructing one large database, the general
concept is to enable analysis by bringing together a
variety of tools that allow for capture, preparation,
management, access, and distribution of data. This
collection of tools is configured as a series of steps
that constitute a complex workflow for generating
and distributing data sets.
4.1 The new KM model
For the past decade or so, businesses have
often categorised data according to a traditional
knowledge management (KM) model known as the
DIKW hierarchy (data, information, knowledge,
wisdom). In this model, each level is built from
elements contained in the previous level. But in the
context of Big Data, this needs to be extended to
more accurately reflect organisations’ need to gain
business value from their (and others’) data. A
better model might be:
1. Integrated data — data that is connected and
reconnected to make it more valuable
2. Actionable information — information put into
the hands of those that can use it
3. Insightful knowledge — knowledge that
provides real insight (i.e. not just a stored
document)
4. Real-time wisdom — getting the answer now,
not next week.
Of course, some organisations have put
significant investment into traditional knowledge
management systems and processes. So in regard
to KM and its relationship with Big Data, it is
worth noting the following:
1. KM is an enabler for Big Data, but not the
goal.
2. KM activities achieve better outcomes for
structured data than for unstructured or semistructured data.
3. The principles of KM are still important but
they need to be interpreted in new ways for
the new types of data being processed.
4. KM focuses much effort on storing all data,
but that is not always the focus with Big Data,
particularly when analysing ‘in-flight’
(transient) data.
In that sense Big Data has a librarian’s
focus. The archivist wants to store data but is less
interested in making it accessible. The librarian is
less interested in storing data as long as he or she
has access to it and can provide the information
that their clients need.
5. References
1. Berson, A., Smith, S.J. &Thearling, K. (1999).
Building Data Mining Applications for CRM.
NewYork: McGraw-Hill.
2. Dawei, J. IEEE Computer Society, 58, 79.2011.
3. Dalkir, K. (2005). Knowledge Management in
Theory and Practice. Boston: ButterworthHeinemann.
4. Ang, X. & Wang, W. A literature review., 138141. 2010
5. Lavrac, N., Bohanec, M., Pur, A., Cestnik, B.,
Debeljak, M. &Kobler, A. Journal of
Biomedical Informatics, 40, 438-447, 2007
6. Hwang, H.G., Chang, I.C., Chen, F.J. & Wu,
S.Y. Expert Systems with Applications,
34(1),725-733.2008.
7. Liao, S.H., Chen, C.M., Wu, C.H. Expert
Systems with Applications, 34(3), 17631776.2008
8. Cheng, H., Lu, Y. &Sheu, C. Expert Systems
with Applications, 36, 3614–3622. 2009.
9. Li, X., Zhu, Z. & Pan, X. Procedia Computer
Science, 1(1), 2479-2488.2010
10. Li, Y., Kramer, M.R., Beulens, A.J.M., Van
Der Vorst, J.G.A.J. Computers in Industry,
61,852–862. 2010
11. Cantú, F.J. & Ceballos, H.G. A Expert Systems
with Applications, 37(7), 5272-5284. 2
12. Tipawan Silwattananusarn and Kulthida
Tuamsuk. International Journal of Data
Mining & Knowledge Management Process
(IJDKP) Vol.2, No.5, September 2012.
L.Velmurugan et al.,
Page No.5