Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BTP REPORT EFFICIENT MINING OF EMERGING PATTERNS K G Pavan Kumar (200701036) G Vinod Kumar Naidu(200701025) Knowledge Discovery in Databases The existence of vast repositories of data made to believe that technology can process the information in a manner that will be useful to humans. The term established to categorise this type of process is Knowledge Discovery in Databases. It has been defined as consisting of five stages: 1. Selection: The process of collating data from a combination of relevant sources. 2. Pre-processing: Should data obtained during the selection stage be from a variety of disparate sources, contradictions, omissions and errors need to be reconciled. 3. Transformation: A single uniform format must be applied to data selected from different areas. Other activities such as discretisation of continuously valued data or normalisation of values may also be necessary. 4. Data Mining: Given some specific objective, appropriate algorithms are used to process the data available following transformation 5. Interpretation/Evaluation: The results of the previous stages need to be accessible to end users. This step defines a means of presenting the analysis to people. Methods include sophisticated visuals and simple, compact implication oriented rules. The fourth step data mining is what the concern. The types of algorithms available to actually carry out this task depend on the precise objectives of the KDD process that is initiated. In all cases, however, the fundamental aim of these algorithms is to extract or identify meaningful, useful or interesting patterns from data. They achieve this by constructing some model that describes or is representative of the data given as input. The model itself, its definition or purpose, will mean that some particular model criteria will need to be identified. Predictive Model: This describes two types of data mining models. A predictive model is one that is used to make predictions about data using information obtained and analysed earlier. Descriptive Model: A descriptive model is usually constructed to provide a way to ascertain relationships that generalise the data or can be used as a way of succinctly summarising the most significant features. Emerging Patterns: In our work we mainly dealt with emerging patterns. Their definition was motivated in the context of providing a tool by which distinct sets of data could be contrasted. Consequently they have been widely applied in the area of classification. Put simply, EPs are sets that occur in one set of data with a frequency that significantly exceeds their frequency in any other dataset. They are patterns that can be used to contrast o differentiate. If we are presented with disjoint classes of data, where each data point is associated with a particular “class label”, EPs can provide knowledge as to what subsets of each class are unique. Thus, they are useful in providing discriminant features of one class of data against any other. Current Approaches: Apriori: The basic concept is to generate itemsets in a bottom-up, small cardinality upwards manner, such that each satisfies the support constraint (frequent itemset). In each iteration, only itemsets that were found to be frequent in the previous iteration need be considered as bases from which new candidates can be enumerated and tested. This strategy is based on the apriori property, which states that all subsets of a frequent itemset are also frequent itemsets. FP-tree: Frequent Pattern Tree (FP-Tree) is structure defined for storing database transactions. In order to mine frequent patterns from this structure, the FP-Growth algorithm was devised. The structure is made up of the tree itself and a header table. The header table is a device indexing all nodes in a tree that represent the same item provides a definition for these structures: 1. contains a root node, labelled “null” 2. Each node is characterised by three attributes: item-name, count and node link. The first indicates which item the node represents. The count attribute refers to the number of transactions whose insertion meant passing through the node. The node-link attribute points to the next node in the tree sharing the node’s item-name, it points to nothing should no such node exist. 3. Every header table entry is made up of two attributes: item-name and head of node-link, which points to the first node in the tree that is labelled with itemname. Max Miner: This algorithm finds maximal frequent itemsets by performing candidate generation as an ordered enumeration of sets over a finite item domain. The basis of the algorithm is to construct nodes in a tree, such that each is labelled with a series of items called the head, and a sequence of items known as the tail. A head and tail form what is known as a candidate group. For some ordering of items, tail items are simply those that succeed the lowest ordered item appearing in the head. EMERGING PATTERNS: Association rules are useful for finding co-occurrence relationships in a dataset. This information is not specifically designed for finding contrasts between two or more datasets though. Finding such contrasts is a very intuitive way of thinking about solving the classification problem. Should one be able to contrast concepts or classes effectively, by identifying the features (or combinations of features) that are unique to each, searching for these features in the objects one is classifying can help point out the best matching class. Dong & Li Border Differential Algorithm: • finding the item-sets that satisfy some user-defined minimum support constraint in the positive dataset • finding the item-sets that satisfy some user-defined maximum support constraint in the negative dataset • performing a “difference” on these two sets Any user mining EPs in this framework stipulates the kind of patterns they desire. Essentially, this permits one to seek itemsets having at least _ support (where support is either relative or absolute) in the positive dataset and at most _ support in the negative dataset. So, this allows contrasts between the positive dataset and the negative one to be identified. For a = 10% and b = 0.5%, the user is requesting all patterns that occur at least 10% of the time in the positive dataset but no more than 0.5% in the negative dataset. The first two stages of the mining procedure just stated, imply the finding of borders. This equates to finding maximal frequent itemsets (MFI). For the case of the positive border, we mine MFI with regard a. The negative border is computed with reference b. The third step of the entire process was implemented in by iterating through each positive instance and mining with reference the negative dataset. Performing this operation is the function of Border-Diff. By aggregating the output from each call and minimising the accumulated results we obtain the complete set of EPs for the positive dataset. The Border-Diff operator has two input parameters: • A reference set of item-sets, or negative instances, N. • A single positive instance, p. The objective is to find all subsets of p that are not contained in any element of N. We initially find all subsets of p that are not contained in at least one element of N. This is simply the set of differences Diffs = {p−ni, ∀ ni ∈ N}. Taking these differences, the generation is based on partial expansions of the cross-product. Each di ∈ Diffs is the EP that exists in p with reference ni. This can be generalised to find EPs with reference any two ni, nj ∈ N. By taking any two differences, di, dj ∈ Diffs, if the cross product is computed and minimised, subsets of p are generated such that they: • do not exist in ni and nj • are minimal Computing the cross-product of three differences di, dj , dk and minimising the result therefore provides those minimal subsets of p not present in the negative instances ni, nj , nk. By extending this process to all ni ∈ Diffs, the complete set of EPs with respect p and N are found. Approaches Followed: Tree Based Emerging Pattern Mining EP Mining Issues and Objectives: Our work is focused on providing an improved method of completing all of the three stages listed above. We show that we can significantly improve completion of these tasks by once again looking towards a data partitioning heuristic. We construct partitions with two primary aims: 1. To minimise the number of item-sets present in each partition 2. To minimise the cardinality of item-sets contained in large partitions