* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Big Data Analysis and Its Applications for Knowledge
Survey
Document related concepts
Transcript
www.iaard.net eISSN: 2455-3204 International Association of Advances in Research and Development International Journal of Computer Science and Information Technology International Journal of Computer Science and Information Technology, 2015, 1(1),1-5 Research Article Big Data Analysis and Its Applications for KnowledgeManagement L.Velmurugan*1, P.Sasikumar2, Alema Gebre3, Tilahun.A4 1 Professor, Computer Science, Institute of Technology, AMBO University, Ethiopia 2 Assistant Professor, Information Technology, Institute of Technology, AMBO University, Ethiopia 3 Head, Computer Science, Institute of Technology, AMBO University, Ethiopia 4 Head, Information Technology, Institute of Technology, AMBO University, Ethiopia ………………………………………………………………………………………………. Abstract: Big Data analysis is one of the most important steps now days for the knowledge discovery in databases process and is considered as significant subfield in knowledge management. Research in big data analysis continues growing in business and in learning organization over coming decades. This review paper gives the applications of big data techniques which have been developed to support knowledge management process. The journal articles indexed in Science Direct Database from 2007 to 2012 are analyzed and classified. The discussion on the findings is divided into 4 topics: (i) knowledge resource; (ii) knowledge types and/or knowledge datasets and big data analysis techniques and applications used in knowledge management. The article momentarily designates the definition of big data and its functionality. Then the knowledge management rationale and major knowledge management tool s integrated in knowledge management cycle are described. Finally, the applications of big data techniques in the process of knowledge management are summarized and discussed. ……………………………………………………………………………………………………………… among data miners and (ii) to use data mining as a 1. Introduction Modern data-mining applications, often tool to extend human knowledge. called “big-data” analysis, require us to manage 2. Challenges to the analysis of massive data A number of challenges exist in both data immense amounts of data quickly. In many of these applications, the data is extremely regular, management and data analysis that require new and there is ample opportunity to exploit approaches to support the “big data” era. These parallelism. Experiments, observations, and challenges span generation of the data, preparation numerical simulations in many areas of science for analysis, and policy-related challenges in its nowadays generate terabytes of data and, in some sharing and use. Initiatives in research and cases, are on the verge of generating many development that are leading to improved petabytes. This rapid growth heralds an era of capabilities include the following: “data-centric science,” which requires new • Dealing with highly distributed data sources, paradigms addressing how data are captured, • Tracking data provenance, from data generation processed, discovered, exchanged, distributed, and through data preparation analyzed. Many business organizations have • Validating data gathered and stored immense amount of data. • Coping with sampling biases and heterogeneity However, they are unable to discover valuable • Working with different data formats and information hidden in the data by transforming structures these data into valuable and useful knowledge [1]. • Developing algorithms that exploit parallel and Managing knowledge resources can be a challenge. distributed architectures Many organizations are employing information • Ensuring data integrity technology in knowledge management to aid • Ensuring data security creation, sharing,integration, and distribution of • Enabling data discovery and integration, • Enabling data sharing knowledge. Knowledge management is a process of • Developing methods for visualizing massive data, data usage [2]. The basis of data mining is a process and ofusing tools to extract useful knowledge from • Developing scalable and incremental algorithms. large datasets; data mining is an essential part As data volumes increase, the ability to perform ofknowledge management [2].Wang & Wang analysis on the data is constrained by the (2008) point that data mining can be useful for KM increasingly distributed nature of modern data sets. intwo main manners: (i) to share common Highly distributed data sources present challenges knowledge of business intelligence (BI) context due to diverse natures of the technical L.Velmurugan et al., Page No.1 International Journal of Computer Science and Information Technology, 2015, 1(1),1-5 infrastructures, creating challenges in data access, integration, and sharing. The distributed nature also creates additional challenges due to the limitations in moving massive data through channels with limited bandwidth. In addition, data produced by different sources are often defined using different representation methods and structural specifications. Bringing such data together becomes a challenge because the data are not properly prepared for data integration and fusion, and the technical infrastructures lack the appropriate information infrastructure services to support analysis of the data if it remains distributed. Statistical inference procedures often require some form of aggregation that can be expensive in distributed architectures, and a major challenge involves finding cheaper approximations for such procedures. Finally, security and policy issues also limit the ability to share data. Yet, the ever-increasing generation of data from medicine, physical science, defense, and other industries require that analysis be performed on data that are captured and managed across distributed databases. 3. Knowledge management There are various concepts of knowledge management. In this paper we use the definition of knowledge management by McInerney (2002):“Knowledge management (KM) is an effort to increase useful knowledge within the organization. Ways to do this include encouraging communication, offering opportunities to learn, and promoting the sharing of appropriate knowledge artifacts”. This definition emphasizes the interaction aspect of knowledge management and organization all earning. Knowledge management process focuses on knowledge flows and the process of creation, sharing, and distributing knowledge (Figure 1) [3]. Each of knowledge units of capture and creation, sharing and dissemination, and acquisition and application can be facilitated by information technology. Figure 1 KM Technologies Integrated KM Cycle (Source from Dalkir, K.,2005). Liao (2003) classifies KM technologies using seven categories: 1. KM Framework 2. Knowledge-Based Systems (KBS) 3. Data Mining 4. Information and Communication Technology 5. Artificial Intelligence (AI)/Expert Systems (ES) 6. Database Technology (DT) 7. Modeling As technologies play an important role in KM, technologies stand to be a necessary tool for KMusage [4]. Thus, KM requires technologies to facilitate communication, collaboration, and content for better knowledge capture, sharing, dissemination, and application [5]. 3.1 Knowledge Management: Capture and Creation Tools This section provides an overview of a classification of KM technologies as tools and focuses ontools for capture and creation knowledge.Liao (2003) classifies KM technologies using seven categories: 1. KM Framework 2. Knowledge-Based Systems (KBS) 3. Data Mining 4. Information and Communication Technology 5. Artificial Intelligence (AI)/Expert Systems (ES) 6. Database Technology (DT) 7. Modeling Ruggles et.al. (1997) classify KM technologies as tools that generate knowledge (e.g. datamining), code knowledge, and transfer knowledge. Dalkir (2005) classifies KM tools according tothe phase of the KM cycle. 3.2 Knowledge Types This section described knowledge types in 8 organization domains for data mining collaboration process in the knowledge creation. • Health-care System domain, the dataset composed of three databases: the health-care providers’ database; the out-patient health-care statistics database; and the medical status database [5] .Another data source was from hospital inpatient medical records [6]. • Construction Industry domain, a sample data set was in the form of Post Project Reviews (PPRs) as defining good or bad information. Multiple Key Term Phrasal Knowledge sequences (MKTPKS) formation was generated through applications of text mining and was used an essential part of the text analysis in the text documents Classification. • Retailing domain: customer data and the products purchased have been collected and stored in databases to mine whether the customers’ purchase L.Velmurugan et al., Page No.2 International Journal of Computer Science and Information Technology, 2015, 1(1),1-5 habits and behaviour affect the product line and brand extensions or not [7]. Financial domain: There were two datasets posed in financial domain: (i) to identify bond ratings, knowledge sets contained strings of data, models, parameters and reports for each analytical study; and (ii) to predict rating changes of bonds, cluster data of bond features as well as the model parameters were stored, classified, and applied to rating predictions [8]. • Small and Middle Businesses (SMBs) domain: Knowledge types in small and middle businesses in case of Food Company were related to the corporate conditions or goals of the problem among all departments to develop a decision system platform and then formed the knowledge tree to find relations by human-computer interaction method and optimize the process of decision making [9].To solve food supply chain networks problems, Li et al. (2010) developed EW&PC prototype which composed of major components of: (i) knowledge base, (ii) task classifier and template approaches, (iii) DMmethods library with expert system for method selection, (iv) explorer and predictor, and(v) user interface [10]. This system built decision support models and helped manager stoac complish decision-making. • Research Assets domain: In Cantu &Cellbos (2010) focused on managing knowledge eassets by applied aknowledge and information network (KIN) approach. This platform contained three components types of research products, human resources or intellectual capital, and research programs. The various types of research assets were handled ondomain ontologies and databases [11] . • Business domain: there were two types of knowledge attributes conducted: condition attributes and decision attribute. Condition attributes included four independent attributes of the KM purpose, the explicit-oriented degree, the tacit-oriented degree, andthe success factor. Decision attribute included one dependent attribute of the KM performance. • Collaboration and Teamwork domain: a dataset used from a research laboratory in a research institute. It contained 14 knowledge workers, 424 research documents, and aworkers’ log as that recorded the time of document accessed and the documents of workers’ needed [12]. For the workers’ log, it was generated to 2 levels of codified-level knowledge flow and topic-level knowledge flow [12]. The two types of knowledge flow were determined to describe a worker’s needs. To collect the knowledge flow, documents in the dataset were categorized into eight clusters by data mining clustering approach [12]. 4. Big data techniques / applications used in knowledge management Experiments, observations, and numerical simulations in many areas of science and business are currently generating terabytes of data, and in some cases are on the verge of generating petabytes and beyond. Analyses of the information contained in these data sets have already led to major breakthroughs in fields ranging from genomics to astronomy and high-energy physics and to the development of new information-based industries. Traditional methods of analysis have been based largely on the assumption that analysts can work with data within the confines of their own computing environment, but the growth of “big data” is changing that paradigm, especially in cases in which massive amounts of data are distributed across locations. While the scientific community and the defence enterprise have long been leaders in generating and using large data sets, the emergence of e-commerce and massive search engines has led other sectors to confront the challenges of massive data. For example, Google, Yahoo!, Microsoft, and other Internet-based companies have data that is measured in exabytes (1018 bytes). Social media (e.g., Facebook, YouTube, Twitter) have exploded beyond anyone’s wildest imagination, and today some of these companies have hundreds of millions of users. Data mining of these massive data sets is transforming the way we think about crisis response, marketing, entertainment, cybersecurity and national intelligence. It is also transforming how we think about information storage and retrieval. Collections of documents, images, videos, and networks are being thought of not merely as bit strings to be stored, indexed, and retrieved, but as potential sources of discovery and knowledge, requiring sophisticated analysis techniques that go far beyond classical indexing and keyword counting, aiming to find relational and semantic interpretations of the phenomena underlying the data. A health-care system in which increasingly detailed data are maintained for each individual— including genomic, cellular, and environmental data—and in which such data can be combined with data from other individuals and with results from fundamental biological and medical research so that optimized treatments can be designed for each individual. One can also envision numerous business opportunities that combine knowledge of L.Velmurugan et al., Page No.3 International Journal of Computer Science and Information Technology, 2015, 1(1),1-5 preferences and needs at the level of single individuals with fine-grained descriptions of goods, skills, and services to create new markets. It is natural to be optimistic about the prospects. Several decades of research and development in databases and search engines has yielded a wealth of relevant experience in the design of scalable data-centric technology. In particular, these fields have fuelled the advent of cloud computing and other parallel and distributed platforms that seem well suited to massive data analysis. Moreover, innovations in the fields of machine learning, data mining, statistics, and the theory of algorithms have yielded data-analysis methods that can be applied to ever-larger data sets. However, such optimism must be tempered by an understanding of the major difficulties that arise in attempting to achieve the envisioned goals. In part, these difficulties are those familiar from implementations of large-scale databases—finding and mitigating bottlenecks, achieving simplicity and generality of the programming interface, propagating metadata, designing a system that is robust to hardware failure, and exploiting parallel and distributed hardware—all at an unprecedented scale. But the challenges for massive data go beyond the storage, indexing, and querying that have been the province of classical database systems (and classical search engines) and, instead, hinge on the ambitious goal of inference. Inference is the problem of turning data into knowledge, where knowledge often is expressed in terms of entities that are not present in the data per se but are present in models that one uses to interpret the data. Statistical rigor is necessary to justify the inferential leap from data to knowledge, and many difficulties arise in attempting to bring statistical principles to bear on massive data. Overlooking this foundation may yield results that are, at best, not useful, or harmful at worst. In any discussion of massive data and inference, it is essential to be aware that it is quite possible to turn data into something resembling knowledge when actually it is not. Moreover, it can be quite difficult to know that this has happened. Indeed, many issues impinge on the quality of inference. A major one is that of “sampling bias.” Data may have been collected according to a certain criterion (for example, in a way that favors “larger” items over “smaller” items), but the inferences and decisions made may refer to a different sampling criterion. This issue seems likely to be particularly severe in many massive data sets, which often consist of many sub collections of data, each collected according to a particular choice of sampling criterion and with little control over the overall composition. Another major issue is “provenance.” Many systems involve layers of inference, where “data” are not the original observations but are the products of an inferential procedure of some kind. This often occurs, for example, when there are missing entries in the original data. In a large system involving interconnected inferences, it can be difficult to avoid circularity, which can introduce additional biases and can amplify noise. Finally, there is the major issue of controlling error rates when many hypotheses are being considered. Indeed, massive data sets generally involve growth not merely in the number of individuals represented (the “rows” of the database) but also in the number of descriptors of those individuals (the “columns” of the database). Moreover, we are often interested in the predictive ability associated with combinations of the descriptors; this can lead to exponential growth in the number of hypotheses considered, with severe consequences for error rates. That is, a naive appeal to a “law of large numbers” for massive data is unlikely to be justified; if anything, the perils associated with statistical fluctuations may actually increase as data sets grow in size. While the field of statistics has developed tools that can address such issues in principle, in the context of massive data care must be taken with all such tools for two main reasons: (1) all statistical tools are based on assumptions about characteristics of the data set and the way it was sampled, and those assumptions may be violated in the process of assembling massive data sets; and (2) tools for assessing errors of procedures, and for diagnostics, are themselves computational procedures that may be computationally infeasible as data sets move into the massive scale. In spite of the cautions raised above, the Committee on the Analysis of Massive Data believes that many of the challenges involved in performing inference on massive data can be confronted usefully. While large businesses in the past have used relational databases, these do not scale well to such extreme sizes. Industries dealing with big data are reacting to data that is more distributed, heterogeneous, and generated from a variety of sources. This is leading to new approaches for data analysis and the demand for new computing approaches. Various innovative data management solutions have emerged, many of which are discussed in Chapter 3. These models work well in the commercial setting, where enormous resources are spent on harvesting and collecting the data L.Velmurugan et al., Page No.4 International Journal of Computer Science and Information Technology, 2015, 1(1),1-5 through actions such as Internet crawling, aerial photos for geospatial information systems, or collecting user data in search engines. Some of the technical trends that have been occurring to address the data challenges include the following: • Distributed systems (access, federation, linking, etc.), • Technologies (MapReduce algorithms, cloud computing, Workflow, etc.), • Scalable infrastructures for data- and computeintensive applications, • Service-oriented architectures, • Ontologies, models for information representation, • Scalable database systems with different underlying models (relation to triple stores), • Federated data security mechanisms, and • Technologies for moving large data sets. Many of these technologies are being used to drive toward more systematic approaches. Rather than constructing one large database, the general concept is to enable analysis by bringing together a variety of tools that allow for capture, preparation, management, access, and distribution of data. This collection of tools is configured as a series of steps that constitute a complex workflow for generating and distributing data sets. 4.1 The new KM model For the past decade or so, businesses have often categorised data according to a traditional knowledge management (KM) model known as the DIKW hierarchy (data, information, knowledge, wisdom). In this model, each level is built from elements contained in the previous level. But in the context of Big Data, this needs to be extended to more accurately reflect organisations’ need to gain business value from their (and others’) data. A better model might be: 1. Integrated data — data that is connected and reconnected to make it more valuable 2. Actionable information — information put into the hands of those that can use it 3. Insightful knowledge — knowledge that provides real insight (i.e. not just a stored document) 4. Real-time wisdom — getting the answer now, not next week. Of course, some organisations have put significant investment into traditional knowledge management systems and processes. So in regard to KM and its relationship with Big Data, it is worth noting the following: 1. KM is an enabler for Big Data, but not the goal. 2. KM activities achieve better outcomes for structured data than for unstructured or semistructured data. 3. The principles of KM are still important but they need to be interpreted in new ways for the new types of data being processed. 4. KM focuses much effort on storing all data, but that is not always the focus with Big Data, particularly when analysing ‘in-flight’ (transient) data. In that sense Big Data has a librarian’s focus. The archivist wants to store data but is less interested in making it accessible. The librarian is less interested in storing data as long as he or she has access to it and can provide the information that their clients need. 5. References 1. Berson, A., Smith, S.J. &Thearling, K. (1999). Building Data Mining Applications for CRM. NewYork: McGraw-Hill. 2. Dawei, J. IEEE Computer Society, 58, 79.2011. 3. Dalkir, K. (2005). Knowledge Management in Theory and Practice. Boston: ButterworthHeinemann. 4. Ang, X. & Wang, W. A literature review., 138141. 2010 5. Lavrac, N., Bohanec, M., Pur, A., Cestnik, B., Debeljak, M. &Kobler, A. Journal of Biomedical Informatics, 40, 438-447, 2007 6. Hwang, H.G., Chang, I.C., Chen, F.J. & Wu, S.Y. Expert Systems with Applications, 34(1),725-733.2008. 7. Liao, S.H., Chen, C.M., Wu, C.H. Expert Systems with Applications, 34(3), 17631776.2008 8. Cheng, H., Lu, Y. &Sheu, C. Expert Systems with Applications, 36, 3614–3622. 2009. 9. Li, X., Zhu, Z. & Pan, X. Procedia Computer Science, 1(1), 2479-2488.2010 10. Li, Y., Kramer, M.R., Beulens, A.J.M., Van Der Vorst, J.G.A.J. Computers in Industry, 61,852–862. 2010 11. Cantú, F.J. & Ceballos, H.G. A Expert Systems with Applications, 37(7), 5272-5284. 2 12. Tipawan Silwattananusarn and Kulthida Tuamsuk. International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.2, No.5, September 2012. L.Velmurugan et al., Page No.5