Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTER -1 INTRODUCTION 1.1 DATA MINING Data mining refers loosely to finding relevant information, or discovering knowledge, from a large volume of data. Data mining attempts to discover statistical rules and patterns automatically from stored data. However, it differs from machine learning and deal with very large volumes of data, stored on disk [1]. Discovering knowledge, from a large volume of data is not a simple process, but it is an iterative and interactive process. Data mining should be “the nontrivial process of identifying valid, novel, potentially useful, and ultimately comprehensible knowledge” from databases such knowledge can useful in making crucial decisions [41]. Nontrivial - Means that rather than simple computations, complex processing is required to uncover the patterns that are buried in the data. Valid - The discovered patterns should be hold by proper for all data including new data. Novel - The discovered patterns should be innovative. Useful - The organization should be able to act upon these patterns to become more profitable / efficient. Comprehensible - The new patterns should be understandable to the users and add to the knowledge. 1 Steps in knowledge discovery in database process Data Cleaning: It is the process of removing noise and inconsistent data. Data Integrating: It is the process of combining data from multiple sources. Data Selection: It is the process of retrieving relevant data from the database. Data Transformation: In this process, data are transformed into forms for mining by performing summary or aggregation operations. Data mining: It is an important process where intelligent methods are useful to extract data patterns. Pattern Evaluation: The pattern evaluation obtained in the data mining stage is converted into knowledge based on some interestingness measures. Knowledge Presentation: Visualization and knowledge representation methods are used to provide knowledge to the user. 2 1.2 ARCHITECTURE OF DATA MINING Data mining is the process of discovering knowledge from large amount of data stored either in very large database or other information repositories. The major components of data mining’s are Database, Data Warehouse or other information Repository: This means a single or a collection of multiple databases, data warehouses, Flat files, spreadsheets or other kinds of information repositories. Data cleaning and integration methods may be performed on the data. Database or Data Warehouse Server: This server fetches the relevant data, based on the data mining request. Knowledge Base: It is a domain knowledge which guides the search, or assesses the interestingness of resultant patterns. Data Mining Engine: This is necessary to the data mining system and perfectly consists of a set of functional methods for task such as characterization, association, classification, cluster analysis, evolution and outlier analysis. 3 GRAPHICAL USER INTERFACE PATTERN EVALUATION KNOWLED GE BASE DATA MINING ENGINE DATABASE OR DATA WAREHOUSE SERVER WORLD WIDE DATA BASE DATA WAREHO OTHER DATA Figure 1.1 Architecture of a typical data mining system 4 Pattern Evaluation Module: It actions and interacts with the data mining modules to focus the search towards increasing patterns. It may use thresholds to filter out discovered patterns. Alternately, the pattern evaluation module may also be integrated with mining module. Graphical User Interface: This module communicate between users and the data mining system, and allow the user to communicate with the system by specifying a task or data mining query for performing exploratory data mining based on intermediate data mining results. It also allows the user to browse data warehouse schemes or data structures, evaluate mined patterns and visualize the pattern in different forms such as maps, charts etc. 1.3 DATA MINING TECHNIQUES Several techniques are used in data mining to describe the type of mining and data recovery operations. These techniques are analyzes the data in different ways. The most commonly used techniques are 1. Neural Networks It has the ability to derive meaning from complicated data and it can be used to extract patterns and discover methods that are very complex. A well trained neural network can be considered as an “expert” of information and it has been to examine then this expert can be used to supply projections for new situations of interest and answer “what if” questions. Neural networks use a set of processing elements analogous to neurons in the brain. 5 These processing elements are interconnected in a network that can then identify patterns in data once it is exposed to the data, i.e., the network learns from experience just as people do. 2. Decision Tree Decision Trees are simple knowledge representation and they classify examples to a finite number of classes, the nodes are labeled with attribute names, the edges are labeled with possible values for this attribute and the leaves labeled with different classes. Tree shaped structures represents set of decisions which generate rules for the categorization of dataset. Decision trees produce rules that are mutually exclusive and collectively exhaustive with respect to the training database. Particular decision tree methods consist of Classification and Regression Tree (CART) and Chi Square Automatic Interaction Detection (CHAID). 3. Nearest Neighbor Method A method that classifies each record in a dataset based on a mixture of the classes of the k record(s) is called the k-nearest neighbor techniques. Nearest neighbor is a prediction techniques that is quite similar to clustering- its essence is that in order to predict what a prediction value is in one record look for records with similar predictor values in the historical database and use the prediction value from the record that it nearest to the unclassified record. 6 4. Cluster Analysis Cluster analysis is a kind of mathematical based tool for exploring the clear structure of data. Clustering a method of grouping the objects into clusters such that the objects from the same cluster are similar and objects from different clusters are dissimilar. Objects can be described in terms of dimensions or by relationships with other objects. Clustering is sometimes used to mean segmentation. Clustering and segmentation basically divide the database so that each group is similar according to some rules or metric. Many data mining application make use of clustering according to similarity for example to segment to client/customer base. 5. Association Association is the most popular data mining techniques. It makes a simple correlation between two or more items, often of the same type to identify patterns. For example, when tracking people’s buying habits, it might identify that always a customer buys butter when they buy bread. 6. Rule Induction Rule induction is the process of extracting valuable if-then condition rules from data based on statistical and mathematical consequence. Rule induction on a database can be massive undertaking in which all possible patterns are systematically pulled out of the data and then accuracy and significance calculated, telling users how strong the pattern is and how likely it is to occur again. 7 7. Generic Algorithms Genetic Algorithm refers to the algorithms that dictate how populations of organisms should be formed, evaluated and modified. Genetic algorithm is an optimization technique that use methods such as genetic combination, mutation and natural selection in a devise based on the concepts of evaluation. The use of genetic algorithm is made on top of an existing data mining techniques such as neural networks or decision tree. 8. Data Visualization Data visualization makes it possible for the analyst to gain a deeper, more intuitive understanding of the data and such can work well along side data mining. Data mining allows the analyst to focus on certain patterns and trends explore in-depth using visualization. On its own data visualization can be overwhelmed by the volume of data in a database but in conjunction with data mining can help with exploration. 1.4 APPLICATIONS OF DATA MINING Nowadays, many industries are being used the electronic data repositories for storing the huge size of their data. Extract the knowledge from the huge size of these data sources is non-viable to the analyst for better decision making process. The traditional techniques are insufficient to analyze these kinds of data. Today’s world data are collected and stored at enormous speeds. So it is essential to the industries to find a special tool for storing and accessing these databases. The data mining tools are such type of tools. These 8 tools are applied to both commercial and scientific data. The commercial data are mined to provide better service to customers, customize and pro-activate their services. The tools help to extracted, understand complex relationships, predict future status and study interesting similarities in data. Data mining is implemented in the following areas i.e. The Retail Market, Banking [127], Fraud Detection [169], Insurance, Transport Engineering [154], Telecommunication, Stock Market [133] [161], Crime and Terrorism [37] [68], Aircraft Maintenance, Geographical and Spatial Data, Software Engineering [87], Healthcare Industry [89], Education [132], Social Network Analysis [66] and Sport Databases. Amusingly, Data mining techniques are extended into medical industries [89]. Medical databases store large amounts of information about patients, results of laboratory tests, diagnosis status and their treatments. Data mining techniques applied on these databases discover relationships and patterns which are helpful in studying the progression and management of diseases. 1.5 ASSOCIATION RULES Association analysis is used to extract interesting correlations, frequent patterns, associations or casual structures among set of items or objects in transaction databases, relational databases or other data repositories In 1993, Agrawal, Imielinski and Swami (AIS) developed an algorithm to discover the relationships or correlation between items based on measures [1]. Since then, Association analysis has become one of the most highly used and studied techniques in data mining. The main aim of this technique is the discovery of 9 relationships and co-occurrence between items in the dataset. The results of association analysis represented in the form of rules. Association rules are IF/THEN statements that are used to find the relationships between unrelated data from voluminous databases. An association rule has two parts, an antecedent (if) and a consequent (then). An antecedent is an item found in the data. A consequent is an item that is found in combination with the antecedent [2]. A common example of this method is market basket analysis [1]. For instance: “customers who buy product A often also buy product B”. A decision maker such as a shopper or a marketer can access a large volume of historical data from which such rules have been extracted, to be more confidently draw conclusions and make decisions that are well-supported by data. In general, association rule mining can be viewed as a two-step process: 1. Find all frequent itemsets: By definition, each of these, itemsets will occur at least as frequently as a predetermined minimum support count, min_sup. Support: Percentage of transaction in D that contain AUB. 2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum 10 confidence. Confidence: Percentage of transaction in D containing A that also contain B. Support (A=>B) = P(AUB) Confidence (A=>B) = P(B/A). Rules that satisfy both a minimum support threshold (min_sup) and a minimum confidence threshold (min_conf) are called strong. 1.6 OBJECTIVE OF THE RESEARCH Searching is a process of finding a particular item in a collection of items, or in a Very Large Databases. Searches use indexes to reduce the timing in scanning the entire data and use the key for storing the values. During retrieval time the key is referred and decoded for rapid scanning. Due to some special reasons the algorithms are seldom applied on relational databases which store and manage very large volumes of data, though relational databases have the power and speed to traverse large volumes of data within a short span of time. And a divide and conquer algorithm also used to recursively breaking down a problem into two or more sub problems, until these become simply enough to be searched/solved directly. 11 The attempt of the Refined Search Divide and Conquer (Hybrid) Algorithm (RSDCA) is to apply frequent item set mining on Very Large Databases. The thesis proposes RSDCA exhibit’s reduction on the time factor by using the indexes of the database with the help of divide and conquers methods. The RSDCA also explores to reduce the memory spaces in a generic way during the frequent itemsets searching in databases. 1.7 SCOPE OF THE RESEARCH Though earlier researchers have developed a lot of algorithms and methodologies, it is imperative to find new algorithms. It is possibly unsuitable to apply the difficulty of mining in all frequent item sets on very large datasets. All algorithms use their own memory data structures for retrieving datasets and executing the result. These structures have a limitation on the size of data that can be processed. But RDBMS’s provides the benefits of using their buffer management systems. So that the user/applications can be free from size considerations of the data. RDBMS also gives the advantages of mining over very large datasets. The proposed algorithm can be applied to any database; it further explores new possibilities on the generic searches executed on Very Large Data Bases (VLDB). The algorithm shows the need for proper indexing and more particularly generic indexing to improve the speed of queries. The approach of this thesis enables construction of new refined divide and conquer algorithm with completely different perception. In this approach Divide and 12 Conquer method is applied to extract the desired frequent dataset from the Very Large Data Base to improve the speed of the queries. Also the user/applications own memory data structure is enough for retrieving the dataset and executing the queries. 1.8 CONTRIBUTION OF THE RESEARCH The proposed Refined Search Divide and Conquer (Hybrid) Algorithm (RSDCA) is used to access the itemsets very quickly in the large databases. The search algorithm takes problem as input and process the problem within its search space and returns a solution to the problem. The proposed RSDCA algorithm is used a pattern growth method for mining frequent pattern. The RSDCA algorithm rapidly reads transactions and updates support counts at the same time. The proposed RSDCA based on closed frequent itemsets mining algorithm specifies the order of set enumeration by using constraints. The algorithm generates the bases from frequent closed itemsets and discovers frequent itemsets and defines the exact association rules with maximum confidence. RSDCA stresses on the importance of the data dictionary in relational databases and exhibits the power of the RDBMS in identifying frequent itemsets. The RSDCA minimizes on search time since it refers and compiles the items or their occurrences at a top level. 13 1.9 ORGANIZATION OF THE THESIS The first chapter overviews data mining techniques and the role of association rules are discussed. The second chapter discusses the literature survey with existing methods of association rules in detail. The third chapter presents closed frequent itemsets mining algorithm. This algorithm uses rapid search technique to mine closed frequent itemsets. The fourth chapter deals with the rapid search itemset mining algorithm. This algorithm is based on pattern growth method for frequent pattern mining. The proposed Refined Search Divide and Conquer (Hybrid) Algorithm (RSDCA) for mining association rules are described in chapter five. The application related to rapid search technique is discussed elaborately The chapter six summarizes the results of the above proposed algorithms with synthetic datasets. The chapter seven concludes the findings and discusses about enhancements. 14