Download BTP REPORT EFFICIENT MINING OF EMERGING PATTERNS K G

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
BTP REPORT
EFFICIENT MINING OF EMERGING PATTERNS
K G Pavan Kumar (200701036)
G Vinod Kumar Naidu(200701025)
Knowledge Discovery in Databases
The existence of vast repositories of data made to believe that technology
can process the information in a manner that will be useful to humans. The term
established to categorise this type of process is Knowledge Discovery in Databases.
It has been defined as consisting of five stages:
1. Selection: The process of collating data from a combination of relevant
sources.
2. Pre-processing: Should data obtained during the selection stage be from a
variety of disparate sources, contradictions, omissions and errors need to be
reconciled.
3. Transformation: A single uniform format must be applied to data selected
from different areas. Other activities such as discretisation of continuously
valued data or normalisation of values may also be necessary.
4. Data Mining: Given some specific objective, appropriate algorithms are used
to process the data available following transformation
5. Interpretation/Evaluation: The results of the previous stages need to be
accessible to end users. This step defines a means of presenting the analysis
to people. Methods include sophisticated visuals and simple, compact
implication oriented rules.
The fourth step data mining is what the concern. The types of algorithms available to
actually carry out this task depend on the precise objectives of the KDD process that
is initiated. In all cases, however, the fundamental aim of these algorithms is to
extract or identify meaningful, useful or interesting patterns from data. They achieve
this by constructing some model that describes or is representative of the data given
as input. The model itself, its definition or purpose, will mean that some particular
model criteria will need to be identified.
Predictive Model: This describes two types of data mining models. A predictive
model is one that is used to make predictions about data using information obtained
and analysed earlier.
Descriptive Model: A descriptive model is usually constructed to provide a way to
ascertain relationships that generalise the data or can be used as a way of succinctly
summarising the most significant features.
Emerging Patterns:
In our work we mainly dealt with emerging patterns. Their definition was motivated in
the context of providing a tool by which distinct sets of data could be contrasted.
Consequently they have been widely applied in the area of classification.
Put simply, EPs are sets that occur in one set of data with a frequency that
significantly exceeds their frequency in any other dataset. They are patterns that can
be used to contrast o differentiate. If we are presented with disjoint classes of data,
where each data point is associated with a particular “class label”, EPs can provide
knowledge as to what subsets of each class are unique. Thus, they are useful in
providing discriminant features of one class of data against any other.
Current Approaches:
Apriori:
The basic concept is to generate itemsets in a bottom-up, small cardinality upwards
manner, such that each satisfies the support constraint (frequent itemset). In each
iteration, only itemsets that were found to be frequent in the previous iteration need
be considered as bases from which new candidates can be enumerated and tested.
This strategy is based on the apriori property, which states that all subsets of a
frequent itemset are also frequent itemsets.
FP-tree:
Frequent Pattern Tree (FP-Tree) is structure defined for storing database
transactions. In order to mine frequent patterns from this structure, the FP-Growth
algorithm was devised. The structure is made up of the tree itself and a header table.
The header table is a device indexing all nodes in a tree that represent the same
item provides a definition for these structures:
1. contains a root node, labelled “null”
2. Each node is characterised by three attributes: item-name, count and node
link. The first indicates which item the node represents. The count attribute
refers to the number of transactions whose insertion meant passing through
the node. The node-link attribute points to the next node in the tree sharing
the node’s item-name, it points to nothing should no such node exist.
3. Every header table entry is made up of two attributes: item-name and head of
node-link, which points to the first node in the tree that is labelled with itemname.
Max Miner:
This algorithm finds maximal frequent itemsets by performing candidate
generation as an ordered enumeration of sets over a finite item domain. The basis of
the algorithm is to construct nodes in a tree, such that each is labelled with a series
of items called the head, and a sequence of items known as the tail. A head and tail
form what is known as a candidate group. For some ordering of items, tail items are
simply those that succeed the lowest ordered item appearing in the head.
EMERGING PATTERNS:
Association rules are useful for finding co-occurrence relationships in a dataset.
This information is not specifically designed for finding contrasts between two or
more datasets though. Finding such contrasts is a very intuitive way of thinking about
solving the classification problem. Should one be able to contrast concepts or
classes effectively, by identifying the features (or combinations of features) that are
unique to each, searching for these features in the objects one is classifying can help
point out the best matching class.
Dong & Li Border Differential Algorithm:
• finding the item-sets that satisfy some user-defined minimum support
constraint in the positive dataset
• finding the item-sets that satisfy some user-defined maximum support constraint in
the negative dataset
• performing a “difference” on these two sets
Any user mining EPs in this framework stipulates the kind of patterns they
desire. Essentially, this permits one to seek itemsets having at least _ support
(where support is either relative or absolute) in the positive dataset and at most _
support in the negative dataset. So, this allows contrasts between the positive
dataset and the negative one to be identified. For a = 10% and b = 0.5%, the user is
requesting all patterns that occur at least 10% of the time in the positive dataset but
no more than 0.5% in the negative dataset.
The first two stages of the mining procedure just stated, imply the finding of
borders. This equates to finding maximal frequent itemsets (MFI). For the case of the
positive border, we mine MFI with regard a. The negative border is computed with
reference b.
The third step of the entire process was implemented in by iterating through
each positive instance and mining with reference the negative dataset. Performing
this operation is the function of Border-Diff. By aggregating the output from each call
and minimising the accumulated results we obtain the complete set of EPs for the
positive dataset.
The Border-Diff operator has two input parameters:
• A reference set of item-sets, or negative instances, N.
• A single positive instance, p.
The objective is to find all subsets of p that are not contained in any element of N.
We initially find all subsets of p that are not contained in at least one element of N.
This is simply the set of differences Diffs = {p−ni, ∀ ni ∈ N}. Taking these differences,
the generation is based on partial expansions of the cross-product. Each di ∈ Diffs is
the EP that exists in p with reference ni. This can be generalised to find EPs with
reference any two ni, nj ∈ N. By taking any two differences, di, dj ∈ Diffs, if the cross
product is computed and minimised, subsets of p are generated such that they:
• do not exist in ni and nj
• are minimal
Computing the cross-product of three differences di, dj , dk and minimising the
result therefore provides those minimal subsets of p not present in the negative
instances ni, nj , nk. By extending this process to all ni ∈ Diffs, the complete set of EPs
with respect p and N are found.
Approaches Followed:
Tree Based Emerging Pattern Mining
EP Mining Issues and Objectives:
Our work is focused on providing an improved method of completing all of the three
stages listed above. We show that we can significantly improve completion of these
tasks by once again looking towards a data partitioning heuristic.
We construct partitions with two primary aims:
1. To minimise the number of item-sets present in each partition
2. To minimise the cardinality of item-sets contained in large partitions