Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal Of Scientific Research And Education ||Volume||3||Issue||4||Pages-3323-3330||April-2015|| ISSN (e): 2321-7545 Website: http://ijsae.in Upshot of Knowledge Discovery from Big Data Authors Shweta M Nirmanik1, Preethi S2 Department of Computer Science and Engineering, CiTech, Bangalore-36 Email- [email protected] ABSTRACT Knowledge Discovery is a process of extracting new knowledge from large complex database. Knowledge Discovery is non- trivial process which identifies valid, useful and understandable pattern from variety of data. Big data is collection and processing of large data set. In today’s world, it is difficult to understand and manage the rapidly growing of Data. Hence there is need to extract useful information from growing Data. In this paper we present three principles used in knowledge discovery of database, issues of database that has to be solved, knowledge discovery process. Lastly efficient Association Rule Mining algorithm and Machine learning tools handled by Knowledge Discovery. Keywords- Bid Data, Knowledge Discovery, Machine learning, Mining algorithm, Principles of Knowledge Discovery. 1. INTRODUCTION Knowledge discovery from data (KDD) is the nontrivial extraction and useful information from data. The growth in the size of data exceeds human abilities to analyze such data. Hence there is need for knowledge discovery. Big Data is collection and processing of large data set. This paper focuses on overview of three principles used in knowledge discovery. These principles can guide to develop useful and flexible data. Next, the issues of databases such as noisy, dynamic data from large volumes of data that has to be solved. Next, The process used for knowledge discovery in order to produce effective data has to be solved. Lastly association rule mining algorithm to deal with time changing data and machine learning tools used for knowledge discovery to handle large volume of data. Shweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015 Page 3323 2. THREE PRINCIPLES OF KNOWLEDGE DISCOVERY Three principles that have been presented for knowledge discovery from big data by ORNL (Oak Ridge National University), USA. ORNL cooperates with several state and federal agencies on big data projects. Three principles are as follows A. Support Variety of Analysis Methods Knowledge Discovery uses distributed programming, data mining, machine learning, visualization and human computer interaction. Supports different tools rather than using limited set of tools to user. Statistical analysis deals with summarizing large data set (ex. Average, minimization, maximization etc.) and models for prediction must be defined. Data mining is automatic discovery of useful models and patterns in large data set. Machine learning uses both data mining and statistical analysis by allowing machines to understand set of data. Visualization and visual analysis allows user to understand and discover interesting relationships. B. One Size Does Not Fit All KDD has to provide storing and processing of data at all stages of pipeline. Single storage mechanism may be efficient for small data volumes, this may be problematic to large data analysis. First stage in pipeline is Data preparation and batch analysis, at this stage data may contain errors, not useable form. Hadoop is best tool that can be used at this stage. Hadoop is a collection of open source software based on Google’s Big Table and Google File System. Hadoop uses map reduce component and scalable storage component. Hadoop has Hive and Hbase. Hadoop can process structured data using Hive, semi-structured data like hierarchical documents, graphs and geospatial data using HBase and Cassandra. C. Make Data Accessible Above two principles explain data analysis and organization, this explains end product. Should present results to user in well supported framework. To accomplish this Use open, popular standard protocols, Use light weight architectures, create rich application on demand, expose results using API so that users can have flexible methods to interact with data system. 3. ISSUES OF DATABASE THAT HAS TO BE SOLVED This section gives brief description of several issues in data that has to be solved. Dynamic data, contents of data are ever changing with time. Irrelevant fields that is whether data is relevant to current discovery. Missing fields, presence or absence of missing values in relevant data used for discovery. Noisy, errors in data caused due to data type assigned to data values. Discovered knowledge must be in form such as interfiled pattern relate values of field are in same record, interrecord patterns relate values aggregated over groups of records called clusters, quantitative discovery relates to numeric field values, representation must be in appropriate format understandable to end user. Shweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015 Page 3324 4. KNOWLEDGE DISCOVERY PROCESS (KDP) Process is iterative and interactive. There might be necessary of moving back to previous steps. KDP can be based on application domain and particular business objectives. Here we describe three KDP models such as Fayyad et al. KDP model, Industrial model and Hybrid model which involves several steps. A. Fayyad et al. KDP Model This model consists of nine steps in process. They are described as follows: Fig. 1 Knowledge discovery process 1. Developing an understanding of the application domain: In this step people in charged for KDD must understand and define the goals of the end user and the environment at which knowledge discovery takes place. 2. Creating target data set: Data that will be used for knowledge discovery must be determined. This is important stage because data mining uses this available data for processing. At this stage integrate the knowledge discovery data into one data set. If any attributes are missing it may lead to failure. 3. Data cleaning and Preprocessing: It includes handling missing values, and removal of noisy or outliers. Keeping account of changes in data. It may involve data mining algorithm, statistical methods. Shweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015 Page 3325 4. Data reduction and projection: It includes dimension reduction and attributes transformation to find useful attributes. 5. Choosing data mining task: Having completed above tasks, next steps are related to data mining part. Data miner must decide type of data mining to be used based on goals defined in step1. Few Data mining types are regression, classification and clustering. 6. Choosing data mining algorithm: Data miner must select method to be used for searching pattern. This approach gives understanding of data mining algorithm is appropriate to particular problem or not appropriate. 7. Data mining: Finally, implementation of data mining algorithm. Generates patterns in representational form, such as Classification rules, decision trees, regression models etc. Algorithm is carried out until successful result is obtained. 8. Interpreting the mined patterns: Analysts evaluate and interpret whether the mined patterns satisfy rules and reliability factors with respect to the goals defined in the first step. Here, analyst visualizes the extracted patterns and models. 9. Consolidating discovered knowledge: Last step is usage and overall feedback on discovered results. This includes incorporating discovered knowledge into performance system and documenting and reporting to specific parties. Also includes checking and resolving conflicts. B. Industrial Model The CRISP-DM KDP model consists of six steps described as follows: 1. Business understanding: This is first step in KDP. It involves understanding of objective and requirements of business perspective. 2. Data understanding: After objectives are known next step is to collect initial data. This step involves description, exploration, verification of data. 3. Data preparation: At this stage selection and cleansing of data, constructing, integration and formatting of data is done. 4. Modeling: This step involves selection of modeling technique, generation of test design, and creation of model and assessment of generated models. Shweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015 Page 3326 5. Evaluation: Analysts involve in evaluation of results, process review, and determination of next step. 6. Deployment: At this final step the Discovered knowledge must be presented in way customer can use. C. Hybrid Mode This model is combination of both Fayyad et al. KDP model, Industrial model. It has six steps as follows: 1. Understanding of problem domain: This is first step that works in defining problem domain and determine goals of project. 2. Understanding the data: This step includes collecting of sample data and verifies usefulness of data based on goals defined in the first step. 3. Preparation of data: This step decides which data to be used as input. It involves removing of noisy data, data cleaning. The cleaned data is processed by extraction algorithm, summarizing data. Final result must meet requirements specified by first step. 4. Data mining: Data miner decides data mining methods for processing cleaned data. Evaluation of discovered knowledge: Analysts checks whether discovered knowledge is appropriate. At this stage entire process is revisited to check steps to be taken to improve the results. 6. Use of discovered knowledge: Final step includes deciding where and how the discovered knowledge can be used. Implementation and documentation of full project is done. 5. ASSOCIATION RULE MINING ALGORITHM Incremental model that reflects changing data and user beliefs, is useful to make knowledge discovery in big data more effective and efficient. In Association Rule Mining Algorithm, it does not consider time in which data arrives, in this condition combination of old and new data is used to build a new model from scratch. Here, previously discovered knowledge (PDK) becomes invalid so incremental models have been designed. A. Apriori algorithm It uses association rules. Mining association rule has two sub problems. First, generating all large item set in database and second, generating association rules according to large item sets generated in first step. Apriori algorithm is as followsShweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015 Page 3327 Fig. 2 Apriori algorithm B. Incremental AlgorithmIt is similar to apriori algorithm except that after each frequent item sets generation, shocking interestingness measure is computed with respect to existing model Ti and prune uninteresting items that are not significant in current training set. Fig. 3 Incremental association rule mining algorithm Shweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015 Page 3328 6. MACHINE LEARNING TOOLS Knowledge Discovery is characterized into two parts descriptive knowledge discovery and predictive knowledge discovery. Probabilistic Approach uses graphical representations models. Visualization tool is best example for graphical representation. Statistical approach uses rule discovery and is based on data relationships. Online analytical processing (OLAP) is best example for statistical approach. Deviations and Trend Analysis is to filter important patterns from temporary data. Classification approaches uses decision trees, pattern discovery and data cleaning models. 7. CONCLUSIONS In this paper we have proposed upshot of Knowledge discovery from Big Data. Big data is creating hype in IT industry. Knowledge discovery from big data can allow organizations to have deeper insights, look at the bigger picture and projects in return. The three principles can give useful and flexible data analysis pipeline to organizations. KDP aims at understanding the project domain and data, through data preparation, analysis, and evaluation. KDP has several loops back to previous steps. The Incremental Association Rule Mining Algorithm deals with time changing data and user beliefs. It is used where volume of data keeps growing and changing with time. Knowledge discovery uses machine learning tools, statistical techniques. . However timeliness and security still pose great challenges in the knowledge discovery process. We can carry out work on how to use cloud computing to reduce cost and maximize performance? How to handle security and analysis as data flows in pipeline? What storage and analysis system needed? Whether hadoop can be used for graphs? Use of incremental algorithm, integration algorithm, interactive systems to deal with data issues. REFERENCES 1. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in knowledge discovery and data mining, 1996. 2. T. Kalil, “Fact sheet: Big data across the federal government,” 3. Office of Science and Technology Policy, Executive Office of the President, March 2012. 4. Bauer, E., Kohavi, R., An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants. Machine Learning, 36:105-139, 1999. 5. Bayardo, J., Efficiently mining long patterns from databases. In In A. T. Laura M. Haas, editors, Proceedings of ACM SIGMOD’98, pages 85–93, Seattle, WA, USA, 1998. 6. Usama Fayyad et.al. From Data Mining to Knowledge Discovery in Databases. American Association for Artificial Intelligence.URL:<http://www.kdnuggets.com/gpspubs/aimagkdd- overview-1996-Fayyad.pdf> Shweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015 Page 3329 7. AZMY Thinkware Inc. SuperQuery 1.50. http://www.azmy.com. Fort Lee, NJ, 1997. 8. Bissantz Küppers & Company GmbH. Delta Miner 3.5. http://www.bissantz.de, Erlangen, Germany, 1998. 9. Bjorvand, A.T. Rough Enough -- Software Demonstration. 15th IMACS World Congress on Scientific Computation, Modelling and Applied Mathematics. Berlin, Germany, August 24-29, 1997. 10. “Greenplum database community edition.” [Online]. Available: http://www.greenplum.com/products/community- edition 11. A. Sorokine, J. Daniel, and C. Liu, “Parallel visualization for GIS applications,” in Proceedings GeoComputation, 2005. Shweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015 Page 3330