Download Print this article - International Journal Of Scientific Research And

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal Of Scientific Research And Education
||Volume||3||Issue||4||Pages-3323-3330||April-2015|| ISSN (e): 2321-7545
Website: http://ijsae.in
Upshot of Knowledge Discovery from Big Data
Authors
Shweta M Nirmanik1, Preethi S2
Department of Computer Science and Engineering, CiTech, Bangalore-36
Email- [email protected]
ABSTRACT
Knowledge Discovery is a process of extracting new knowledge from large complex database. Knowledge
Discovery is non- trivial process which identifies valid, useful and understandable pattern from variety of
data. Big data is collection and processing of large data set. In today’s world, it is difficult to understand
and manage the rapidly growing of Data. Hence there is need to extract useful information from growing
Data. In this paper we present three principles used in knowledge discovery of database, issues of database
that has to be solved, knowledge discovery process. Lastly efficient Association Rule Mining algorithm and
Machine learning tools handled by Knowledge Discovery.
Keywords- Bid Data, Knowledge Discovery, Machine learning, Mining algorithm, Principles of Knowledge
Discovery.
1. INTRODUCTION
Knowledge discovery from data (KDD) is the nontrivial extraction and useful information from data. The
growth in the size of data exceeds human abilities to analyze such data. Hence there is need for knowledge
discovery. Big Data is collection and processing of large data set. This paper focuses on overview of three
principles used in knowledge discovery. These principles can guide to develop useful and flexible data.
Next, the issues of databases such as noisy, dynamic data from large volumes of data that has to be solved.
Next, The process used for knowledge discovery in order to produce effective data has to be solved. Lastly
association rule mining algorithm to deal with time changing data and machine learning tools used for
knowledge discovery to handle large volume of data.
Shweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015
Page 3323
2. THREE PRINCIPLES OF KNOWLEDGE DISCOVERY
Three principles that have been presented for knowledge discovery from big data by ORNL (Oak Ridge
National University), USA. ORNL cooperates with several state and federal agencies on big data projects.
Three principles are as follows
A. Support Variety of Analysis Methods
Knowledge Discovery uses distributed programming, data mining, machine learning, visualization and
human computer interaction. Supports different tools rather than using limited set of tools to user. Statistical
analysis deals with summarizing large data set (ex. Average, minimization, maximization etc.) and models
for prediction must be defined.
Data mining is automatic discovery of useful models and patterns in large data set. Machine learning uses
both data mining and statistical analysis by allowing machines to understand set of data. Visualization and
visual analysis allows user to understand and discover interesting relationships.
B. One Size Does Not Fit All
KDD has to provide storing and processing of data at all stages of pipeline. Single storage mechanism may
be efficient for small data volumes, this may be problematic to large data analysis. First stage in pipeline is
Data preparation and batch analysis, at this stage data may contain errors, not useable form. Hadoop is best
tool that can be used at this stage. Hadoop is a collection of open source software based on Google’s Big
Table and Google File System. Hadoop uses map reduce component and scalable storage component.
Hadoop has Hive and Hbase. Hadoop can process structured data using Hive, semi-structured data like
hierarchical documents, graphs and geospatial data using HBase and Cassandra.
C. Make Data Accessible
Above two principles explain data analysis and organization, this explains end product. Should present
results to user in well supported framework. To accomplish this Use open, popular standard protocols, Use
light weight architectures, create rich application on demand, expose results using API so that users can
have flexible methods to interact with data system.
3. ISSUES OF DATABASE THAT HAS TO BE SOLVED
This section gives brief description of several issues in data that has to be solved. Dynamic data, contents of
data are ever changing with time. Irrelevant fields that is whether data is relevant to current discovery.
Missing fields, presence or absence of missing values in relevant data used for discovery. Noisy, errors in
data caused due to data type assigned to data values. Discovered knowledge must be in form such as
interfiled pattern relate values of field are in same record, interrecord patterns relate values aggregated over
groups of records called clusters, quantitative discovery relates to numeric field values, representation must
be in appropriate format understandable to end user.
Shweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015
Page 3324
4. KNOWLEDGE DISCOVERY PROCESS (KDP)
Process is iterative and interactive. There might be necessary of moving back to previous steps. KDP can be
based on application domain and particular business objectives. Here we describe three KDP models such as
Fayyad et al. KDP model, Industrial model and Hybrid model which involves several steps.
A. Fayyad et al. KDP Model
This model consists of nine steps in process. They are described as follows:
Fig. 1 Knowledge discovery process
1. Developing an understanding of the application domain:
In this step people in charged for KDD must understand and define the goals of the end user and the
environment at which knowledge discovery takes place.
2. Creating target data set:
Data that will be used for knowledge discovery must be determined. This is important stage because data
mining uses this available data for processing. At this stage integrate the knowledge discovery data into one
data set. If any attributes are missing it may lead to failure.
3. Data cleaning and Preprocessing:
It includes handling missing values, and removal of noisy or outliers. Keeping account of changes in data. It
may involve data mining algorithm, statistical methods.
Shweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015
Page 3325
4. Data reduction and projection:
It includes dimension reduction and attributes transformation to find useful attributes.
5. Choosing data mining task:
Having completed above tasks, next steps are related to data mining part. Data miner must decide type of
data mining to be used based on goals defined in step1. Few Data mining types are regression, classification
and clustering.
6. Choosing data mining algorithm:
Data miner must select method to be used for searching pattern. This approach gives understanding of data
mining algorithm is appropriate to particular problem or not appropriate.
7. Data mining:
Finally, implementation of data mining algorithm. Generates patterns in representational form, such as
Classification rules, decision trees, regression models etc. Algorithm is carried out until successful result is
obtained.
8. Interpreting the mined patterns:
Analysts evaluate and interpret whether the mined patterns satisfy rules and reliability factors with respect to
the goals defined in the first step. Here, analyst visualizes the extracted patterns and models.
9. Consolidating discovered knowledge:
Last step is usage and overall feedback on discovered results. This includes incorporating discovered
knowledge into performance system and documenting and reporting to specific parties. Also includes
checking and resolving conflicts.
B. Industrial Model
The CRISP-DM KDP model consists of six steps described as follows:
1. Business understanding:
This is first step in KDP. It
involves understanding of objective and requirements of business
perspective.
2. Data understanding:
After objectives are known next step is to collect initial data. This step involves description, exploration,
verification of data.
3. Data preparation:
At this stage selection and cleansing of data, constructing, integration and formatting of data is done.
4. Modeling:
This step involves selection of modeling technique, generation of test design, and creation of model and
assessment of generated models.
Shweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015
Page 3326
5. Evaluation:
Analysts involve in evaluation of results, process review, and determination of next step.
6. Deployment:
At this final step the Discovered knowledge must be presented in way customer can use.
C. Hybrid Mode
This model is combination of both Fayyad et al. KDP model, Industrial model. It has six steps as follows:
1. Understanding of problem domain:
This is first step that works in defining problem domain and determine goals of project.
2. Understanding the data:
This step includes collecting of sample data and verifies usefulness of data based on goals defined in the
first step.
3. Preparation of data:
This step decides which data to be used as input. It involves removing of noisy data, data cleaning. The
cleaned data is processed by extraction algorithm, summarizing data. Final result must meet requirements
specified by first step.
4. Data mining:
Data miner decides data mining methods for processing cleaned data.
Evaluation of discovered knowledge:
Analysts checks whether discovered knowledge is appropriate. At this stage entire process is revisited to
check steps to be taken to improve the results.
6. Use of discovered knowledge:
Final step includes deciding where and how the discovered knowledge can be used. Implementation and
documentation of full project is done.
5. ASSOCIATION RULE MINING ALGORITHM
Incremental model that reflects changing data and user beliefs, is useful to make knowledge discovery in big
data more effective and efficient. In Association Rule Mining Algorithm, it does not consider time in which
data arrives, in this condition combination of old and new data is used to build a new model from scratch.
Here, previously discovered knowledge (PDK) becomes invalid so incremental models have been designed.
A. Apriori algorithm
It uses association rules. Mining association rule has two sub problems. First, generating all large item set in
database and second, generating association rules according to large item sets generated in first step.
Apriori algorithm is as followsShweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015
Page 3327
Fig. 2 Apriori algorithm
B. Incremental AlgorithmIt is similar to apriori algorithm except that after each frequent item sets generation, shocking interestingness
measure is computed with respect to existing model Ti and prune uninteresting items that are not significant
in current training set.
Fig. 3 Incremental association rule mining algorithm
Shweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015
Page 3328
6. MACHINE LEARNING TOOLS
Knowledge Discovery is characterized into two parts descriptive knowledge discovery and predictive
knowledge discovery. Probabilistic Approach uses graphical representations models. Visualization tool is
best example for graphical representation. Statistical approach uses rule discovery and is based on data
relationships. Online analytical processing (OLAP) is best example for statistical approach. Deviations and
Trend Analysis is to filter important patterns from temporary data. Classification approaches uses decision
trees, pattern discovery and data cleaning models.
7. CONCLUSIONS
In this paper we have proposed upshot of Knowledge discovery from Big Data. Big data is creating hype in
IT industry. Knowledge discovery from big data can allow organizations to have deeper insights, look at the
bigger picture and projects in return. The three principles can give useful and flexible data analysis pipeline
to organizations. KDP aims at understanding the project domain and data, through data preparation,
analysis, and evaluation. KDP has several loops back to previous steps. The Incremental Association Rule
Mining Algorithm deals with time changing data and user beliefs. It is used where volume of data keeps
growing and changing with time. Knowledge discovery uses machine learning tools, statistical techniques. .
However timeliness and security still pose great challenges in the knowledge discovery process. We can
carry out work on how to use cloud computing to reduce cost and maximize performance? How to handle
security and analysis as data flows in pipeline? What storage and analysis system needed? Whether hadoop
can be used for graphs? Use of incremental algorithm, integration algorithm, interactive systems to deal with
data issues.
REFERENCES
1. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Advances in knowledge discovery
and data mining, 1996.
2. T. Kalil, “Fact sheet: Big data across the federal government,”
3. Office of Science and Technology Policy, Executive Office of the President, March 2012.
4. Bauer, E., Kohavi, R., An Empirical Comparison of Voting Classification Algorithms: Bagging,
Boosting, and Variants. Machine Learning, 36:105-139, 1999.
5. Bayardo, J., Efficiently mining long patterns from databases. In In A. T. Laura M. Haas, editors,
Proceedings of ACM SIGMOD’98, pages 85–93, Seattle, WA, USA, 1998.
6. Usama Fayyad et.al. From Data Mining to Knowledge Discovery in Databases. American
Association
for
Artificial
Intelligence.URL:<http://www.kdnuggets.com/gpspubs/aimagkdd-
overview-1996-Fayyad.pdf>
Shweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015
Page 3329
7. AZMY Thinkware Inc. SuperQuery 1.50. http://www.azmy.com. Fort Lee, NJ, 1997.
8. Bissantz Küppers & Company GmbH. Delta Miner 3.5. http://www.bissantz.de, Erlangen, Germany,
1998.
9. Bjorvand, A.T. Rough Enough -- Software Demonstration. 15th IMACS World Congress on
Scientific Computation, Modelling and Applied Mathematics. Berlin, Germany, August 24-29, 1997.
10. “Greenplum database community edition.” [Online]. Available:
http://www.greenplum.com/products/community- edition
11. A. Sorokine, J. Daniel, and C. Liu, “Parallel visualization for GIS applications,” in Proceedings
GeoComputation, 2005.
Shweta M Nirmanik , Preethi S IJSRE Volume 3 Issue 4 April 2015
Page 3330