Download Benchmark Algorithms and Models of Frequent Itemset Mining over

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Global Journal of Computer Science and Technology
Software & Data Engineering
Volume 13 Issue 5 Version 1.0 Year 2013
Type: Double Blind Peer Reviewed International Research Journal
Publisher: Global Journals Inc. (USA)
Online ISSN: 0975-4172 & Print ISSN: 0975-4350
Benchmark Algorithms and Models of Frequent Itemset Mining
over Data Streams: Contemporary Affirmation of State of Art
By V. Sidda Reddy, Dr. T.V. Rao & Dr. A. Govardhan
K.L. University, India
Abstract - Data mining and knowledge discovery is an active research work and getting popular by
the day because it can be applied in different type of data like web click streams, sensor networks,
stock exchange data and time-series data and so on. Data streams are not devoid of research
problems. This is attributed to non-stop data arrival in numerous, swift, varying with time, erratic and
unrestricted data field. It is highly important to find the regular prototype in single pass data stream or
minor number of passes when making use of limited space of memory. In this survey the review on
the final progress in the study of regular model mining in data streams. Mining algorithms are talked
about at length and further research directions have been suggested.
Keywords : data stream mining; sliding window; training model; linear reparability; data mining,
frequent pattern; combinatorial approximation; single-pass algorithms; bit-sequence representation.
GJCST-C Classification : H.2.8
Benchmark Algorithms and Models of Frequent Itemset Mining over Data Streams Contemporary Affirmation of State of Art
Strictly as per the compliance and regulations of:
© 2013. V. Sidda Reddy, Dr. T.V. Rao & Dr. A. Govardhan. This is a research/review paper, distributed under the terms of the Creative
Commons Attribution-Noncommercial 3.0 Unported License http://creativecommons.org/licenses/by-nc/3.0/), permitting all noncommercial use, distribution, and reproduction inany medium, provided the original work is properly cited.
Benchmark Algorithms and Models of Frequent
Itemset Mining over Data Streams:
Contemporary Affirmation of State of Art
Keywords : data stream mining; sliding window; training
model; linear reparability; data mining, frequent pattern;
combinatorial approximation; single-pass algorithms;
bit-sequence representation.
F
I.
Introduction
requent item set mining [48] is popularly known to
be very essential in several crucial data mining
activities. These activities include clusters [20],
classifiers [46], sequences [62], correlations [14] and
associations [35]. Several studies that point to mining
frequent item sets on statistic databases and several
proficient algorithms are suggested.
In recent years, data streams become an active
research work in computer applications such as
database systems, distributed databases and data
mining. Data streams also importance to study online
mining of frequent item sets, which is much needed for
These applications include sensor networks, ebusiness and stock market analysis, trend analysis,
fraud detection in telecommunications, network traffic
analysis, and web log and click-stream mining. It is ever
more tedious to perform data mining and advanced
analysis on huge and rapidly arriving data streams to
capture remarkable development,
model and
exceptions. This is recognized to the fast appearance of
these new application domains.
Author α : Research scholar, JNTU, Hyderabad, AP, India.
E-mail : [email protected]
Author σ : Professor, School of Computing, K.L. University, AP, India.
E-mail : [email protected]
Author ρ : Professor and DE, JNTU, Hyderabad, AP, India.
E-mail : [email protected]
Data streams are continuous and high speed
flow of data items that come in an appropriate order.
These are quite different when compared to the data in
traditional static databases.
Apart from having data distributions that modify
with time, the data streams are unbounded, usually
come in high speed and are continuous [23].
Data stream is further divided into offline
streams and online streams. Normal bulk arrivals are
attributed to offline streams [13]. Making reports on web
log streams is regarded as mining offline data streams
this is due to the reports that are created on the log data
for a specific period of time. Backup devices or queries
on updates to warehouse form other examples for offline
streams. Queries on these streams can be permitted to
treated offline.
Online
streams
are
distinguished
by
instantaneously restructured data that appear once at a
time. Calculating the regularity estimation of the internet
packet streams is an application of mining online data
streams, since internet packet streams is an instant one
at a time packet process. Sensor data, network
measurements and stock tickers are the other online
data streams. Apart from keeping pace with the high
speed of online queries, these must also be developed
online. Immediately on their arrival, they need to be
processed. It is not possible to process bulk data for
mining data streams comes with a new set of questions.
Primarily, to keep the whole stream in the main memory
or in a secondary storage area it will be impractical. This
is because data is streamed non-stop and the quantity
of data is limitless. Secondarily it is not feasible to use
traditional mining techniques with multiple scans like
stored datasets. In data streams, data will be streamed
and flow with high speed only once. To keep pace with
high data arrival rate the mining streams will need quick,
real-time processing. The results will likely to be
available in a short time span. Also when the item sets
are combine it can intensify mining regular item sets
over streams in regards to both processing frequency
and memory consumption.
Due to these limitations, research work
performed on approximating mining results, with some
sensible assurance on the value of approximation.
© 2013 Global Journals Inc. (US)
1
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
Abstract - Data mining and knowledge discovery is an active
research work and getting popular by the day because it can
be applied in different type of data like web click streams,
sensor networks, stock exchange data and time-series data
and so on. Data streams are not devoid of research problems.
This is attributed to non-stop data arrival in numerous, swift,
varying with time, erratic and unrestricted data field. It is highly
important to find the regular prototype in single pass data
stream or minor number of passes when making use of limited
space of memory. In this survey the review on the final
progress in the study of regular model mining in data streams.
Mining algorithms are talked about at length and further
research directions have been suggested.
Year 2 013
V. Sidda Reddy α, Dr. T.V. Rao σ & Dr. A. Govardhan ρ
Benchmark Algorithms and Models of Frequent Itemset Mining over Data Streams: Contemporary
Affirmation of State of Art
Year 2 013
a) Distinguish between data streams and static data
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
2
Static data contains persistent data relations,
approach of querying is one-time, data set is passive
and randomly accessed, no role of response time on
result accuracy, update speed is minimal and
responses in passive mode.
Data streams contains transient data relations,
the querying process is continuous, data set with
continuous updates and sequentially accessed,
response time influences the performance and results
accuracy, update speed is maximal since response type
is active
II. Taxonomy of Issues in Mining Data
Streams
Regular item set issue is a daunting problem.
For instance, there can be a massive number of regular
item sets because of combinational explosion. The test
here lies in how effectively can we detail, find and stock
up these regular item sets. Due to their exclusive
features, these data streams have added many more
new challenges in regular item set mining.
Lately, in a couple of data sources, the data
generation has become rapid than before. This quick
production of non-stop data streams has questioned
storage, communication and computation capacity in
the computing systems. It is quite demanding to store,
mine and query these data sets. Pulling out knowledge
structures present in models and patterns in the
uncontrollable streams of information is what mining
data streams are all about.
The demands and the research problems faced
in the data stream mining is motivational. Several
demands concerned with this arena are explained here.
a) Handling the continuous flow of data
This problem is concerning the data
management. These high data rates cannot be handled
by the traditional database management systems. As a
result, it is unreasonable to keep all the information in
relentless media. In addition, it is very high-priced to
aimlessly check the data several times. The demanding
task here is to access the data item once and get
frequent item sets. To deal with these fluctuations it is
important to see to new indexing, storage and querying
techniques.
b)
Data reduction and synopsis construction
To comply with the earlier mentioned problems
the data stream analysis, classification, querying and
clustering applications will need some type of
specification techniques. These techniques will be used
to give out fairly accurate answers from huge data sets
generally by means of synopsis construction and data
reduction. This can be done by choosing a subset of
incoming data or by making use of sketching,
aggregation techniques, load shedding.
© 2013 Global Journals Inc. (US)
c) Minimizing energy consumption
In the resource-constrained environment a
massive quantity of data streams are produced. For
example: Sensor network. Devices here have short
battery life. Energy efficient designs are very important
because sending every generated stream to a central
site is energy inefficient in addition to its lack of
scalability problem.
d) Unbounded memory requirements
Data streams have unbound data but the
storage that can be used to find or preserve frequent
item sets is inadequate. Machine learning techniques
show the core basis of data mining algorithms. When
the analysis algorithm is implemented, it is important
that the machine learning methods have the information
in the memory. Because of this large number of
produced streams, it is extremely essential to design
space effective techniques that can have one look or
less over the incoming stream. The frequency of an item
set depends on time; this is another result of
unbounded data. It is very demanding to use
inadequate storage and find dynamic frequent item sets
from unbounded data.
e) Transferring data mining results
Representation of knowledge is a new important
research essential. The structures must be transferred to
the user after getting it from models and patterns from
data stream generators. Kargupta et al [6] have dealt
with this issue. He has used Fourier transformations to
capably transfer the mining results in a limited
bandwidth links.
f)
Modeling changes of result over time
In a couple of cases, the user is disinterested in
the mining data stream results. But how do these results
change over time. For instance, clusters formed by
changes in data stream, may symbolize the changes in
the dynamics of the coming stream. This change would
help many temporal-based analysis applications like
disaster recovery, emergency and video-based
surveillance.
g) Real-time response
There is a demand on response time as the
data stream applications are very time specific.
Algorithms that come slower than the data arriving rate
in constrained situations are of no use.
h) Visualization of data mining results
Research is still on in revelation of traditional
data mining results on a desktop. It is a real challenge to
see visualization in the small screens of the PDA. If a
businessman is seeing the results of data being
streamed and analyzed on his PDA, the results should
be so effective that it should facilitate him to take quick
decisions.
Benchmark Algorithms and Models of Frequent Itemset Mining over Data Streams: Contemporary
Affirmation of State of Art
a) Sampling
Data size can be lessened by using the random
sampling technique. This can be done while capturing
its important attributes. By using this form of
summarization in a data stream and other summation
can be built by using this example itself [2]. The data
size is mandatory to get unbiased sample of data. The
sampling approach is the simplest approach that has
periodic time intervals which gives nice way to dwindle
the data stream. This approach can have huge
information loss in the stream as the data rates change.
b) Sketching
Sketching is a process of constructing a
statistical synopsis of a data stream that makes use a
little amount of memory and has frequency moments.
c) Histograms
Histograms are synopsis structures that can
total the distribution value of the dataset. Tasks such as
data mining, approximate query answering and query
size estimation uses histograms.
d) Wavelets
These are methods for estimating data with a
known likelihood. Wavelets co-efficient are predictions of
a known signal (set of data values) into an orthogonal
set of source vectors.
e) Concept Drifts
This is a crucial field in data mining. We need to
study the judgment of swift and precise concept drift,
the successful use of concept drift acquisition, how to
save and serious use of concept and inclination of
concept drift.
f)
Sliding Window Models
The recent most data streams are examined in
the sliding window models. The recent data items and
summarized versions are examined in detail. Many
techniques have accepted by many techniques. A
preset newly generated data items, intended for data
mining are taken up and knowledge discovery is
performed on this. To make this synopsis structure
several other synopsis structures are conversed to the
entire data stream. It is assumed that the sliding window
model is inspired to use the most recent data in the data
stream [9]. Hence only a set history of the data stream is
taken into consideration for the analysis and processing.
g) Landmark-window models
Regular sets are worked out from adjacent
transactions in a stream which is known between a
particular transaction in the past called a landmark and
the existing transaction. While one steadily increases
with the arrival of new transactions, the other endpoint
(landmark) remains fixed. Transactions generally come
h) Damped window model
Compared to the preceding windows the most
new one have crucial importance in the damped
windows. Older transactions do not add much towards
the item set frequencies. For instance, the sliding
window model will compute the average. The
importance of data go slow exponentially into the past in
the damped window model.
Mining Frequent Item Set Over
Data Streams: a Contemporary AffirMation of the State of Art
IV.
A regular item set mining algorithm over the
data stream has been developed by Giannella et al [56].
Tilted windows have been used to estimate the regular
patterns for the newest transactions built on the fact that
the interest of the users lie on the newest transactions.
Frequent item sets are symbolized by a tree data
structure called as FP-stream, an increasing algorithm is
used to maintain this. The earlier historical data is made
use of to estimate the regular patterns increment by the
implemented algorithm. A statistical technique was
proposed by Laur et al [61] that is an unfair estimation
of the support of sequential patterns or recall as
selected by the user and restricts the degradation of the
different principle. Using statistical support the
conventional minimal support condition for checking
regular sequential patterns was replaced. The
theoretical foundations of the data stream analysis were
checked by Gaber M M [49]. He also checked the
mining data stream systems and techniques. The issues
in the streaming was outlined and contemplated upon.
Observation: Algorithms [56] [13] [6] [49] that
deal with mining frequent or sequential patterns on the
data streams spotlight on how to technically deal with
large historical data. There is no assurance that the
algorithms can be run in limited memory capacity, when
the data becomes huge and does not follow the running
time.
MOMENT, an algorithm was suggested by Chi
et al [63]. For continuous streaming data this showed
sliding window to mine closed regular itemset. CET an
internal data structure is used to check and store the
closed item sets and boundary nodes. Teng [4]
suggested a FTP-DS model that has sliding windows to
ease the heavy information volume brought in by the
continuous data streams. Data streams from the item
sets give in temporal patterns to the FTP-DS mines. The
sliding window data model is used by the FTP-DS an
approximate mining algorithm. Li et al [11] created a
DSM-FI technique to mine frequent item sets over the
whole historical stream information. A distinctive topdown frequent item set discovery scheme and compact
© 2013 Global Journals Inc. (US)
Year 2 013
Mining Data Streams
in batches for this model that is generally appropriate for
data streams.
3
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
III. Taxonomy of Data Preprocessing in
Year 2 013
Benchmark Algorithms and Models of Frequent Itemset Mining over Data Streams: Contemporary
Affirmation of State of Art
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
24
pattern representation is accepted by the DSM-FI.
Another algorithm on lossy counting was shown my
Manku [57] and Motwani [13]. For the very first time
mining, frequent item sets over the whole streaming
data was used.
Observation: Historical Meta patterns got
through data scanning are stored in internal data
structures. This is a familiar feature that is explained in
[4][11][63][13]. Additional procedures are created to remine the internal data structures when the user
instigates a request for mining results.
Using a data structure called FP-Stream to keep
historical knowledge such as item set frequencies
Giannella et al [53] proved an approximation algorithm
in random time intervals for mining frequent item sets.
Linear Potential Function (LPF) algorithm to estimate
propagation in Bayesian networks was shown by Zhang
and Zhang [32]. An algorithm for keeping up frequent
item sets in stream data in the supposition that every
transaction holds a weight that is related to its age,
permitting transactions on various ages to be dealt with
different approaches was developed by Chang and Lee
[38]. Stream an online algorithm was given by Asai et al
[65] targeted mining patterns from semi-structured
stream data like the XML data.
Observation: [53] [32] [38] [65] are the
methodologies that permit handling of huge or infinite
data streams in an estimated manner.
Cheung et al [42] [2] showed algorithms FUP
and FUP2 for augmenting the frequent item sets. Similar
algorithm was proposed by Thomas et al [12]. The
mentioned models benefit from the tie up between the
original database (DB) and incrementally changed
transactions (db) and they also assume batch updates.
Also known as Apriori algorithm FUP is a multiple-step
algorithm. ZIGZAG was an algorithm proposed by
Veloso et al [52] for mining regular item sets in
developing databases. ZIGZAG was later extended by
Otey et al [67] as parallel and distributed algorithms.
[47] [2] and [12] have a striking similarity with Zigzag
when both accelerate by using DB and db.
Observation: The main thing noted in the FUP is
that some regular item sets will remain regular and
some earlier irregular item sets will become regular
when db is added to DB. Such item sets are called as
winners. In addition some earlier regular item sets will
become irregular. Such item sets are called as losers.
The main method of FUP is to use the data in db to
separate some winners and losers and thereby
decrease the size of the candidate in Apriori algorithm.
The working of the Apriori algorithm heavily relies on the
size of the candidate set. Apriori’s working is improved
to a large extent by the FUP. Previous transactions from
the database can be removed by extending FUP to
FUP2. Hence FUP2 is used in the sliding-window data
model and FUP will deal only with accumulative data
model. FUP2 and the algorithm told in [12] are alike
© 2013 Global Journals Inc. (US)
except that supplementing the regular item sets a
negative border is retained. The regular item sets of db
are mined first in the algorithm. Also the regular item
sets in DB are updated in parallel. The regular item sets
in the restructured database are calculated with a
possible scan of the restructured database which is
based on the change of the regular item sets in DB, the
negative border in DB and the regular item sets in db.
The algorithm [12] works well as the changed database
is scanned at most once. Both accumulative and sliding
window can make use of algorithm [12]. The [47], [2]
and [12] come under the exact mining algorithm.
ZIGZAG comes with many dissimilar features. Zigzag
increases the speed of the support counting of common
item sets in the reconstructed database and it does not
find the regular item set in db itself. Hence with the
lowest support ZIGZAG can take care of the batch
update with random block size. It also takes in the
techniques given by the GENMAX algorithm [31] and in
every update it sustains maximum regular item sets.
Sometimes this is not enough, at that point of time a
second step is used in ZIGZAG. Here the restructured
database is checked to find all regular item set and their
supports.
An algorithm was created by Chi et al [16]
where closed regular item sets are mined over data
stream sliding windows. A condensed data structure
called a closed enumeration tree (CET) was brought in
to keep a vigorous selected set of item sets over the
sliding window. There is a boundary in the closet regular
item sets and the other item sets. The boundary
movements in the CET will show the concept drifts in
data streams. Sliding – window data model are the
exact mining algorithms used by both the ZIGZAG and
Moment.
1-pass algorithm, Count Sketch, was shown by
Charikar et al [37] that resend the most regular items
whose frequencies to convince an entry with high
probabilities. A randomized algorithm was developed
Manku et al [13] for upholding regular items over a data
stream for a time t, the regular items defined over the
whole data stream up to t.
Observation: No false negativity is promised by
these algorithms and restriction the error of the
calculated frequency. The Lossy Counting Algorithm is
further improved to handle regular item sets. A process
called as trie takes care of all regular item sets, this trie
is restructured by batches of communications in the
data stream. The algorithm Manku et al is attempt for a
tunable negotiation between error bounds and memory
usage. Accumulative data model is used by the Lossy
Counting, Sticky Sampling and Count Sketch.
An algorithm presented by Chang et al [38]
estDec, here frequency is defined by the age of a
function. During a random time intervals, data streams
mined by regular item sets, Giannella et al [56]
proposed this algorithm. FP stream is helpful in storing
Liu et al [64] recommended a methodology to
systematize and sum up the found association rules.
This methodology simplifies the rules and keeps stays
on exceptions to the generalization.
Observation: [66] focuses on holding the level
of items and the categorization of rules making use of a
hierarchical association rule collecting the groups the
created rules from the item space to the hierarchical
space. To find comprehensive association, simplified
methods [29] can be used over the history of
organization. Based on appealing measures [21] [71]
many research projects in this arena have looked upon
the rule ordering. For example, Li et al [21] suggested
rule position depending on assurance, sustenance and
the number of items on the left hand side to trim the
found rules.
In the past 10 years, a good number of
algorithms have been suggested on the mining regular
patterns in data streams. These can be divided into
three special categories, landmark based [4, 13, 1, 9]
damped or time decay based [39, 33] and a sliding
window based [26-34] algorithms. DSM-FI [11] is an
innovative algorithm which converts all operations into
minor transactions and is put in the synopsis data
structure known as the item-suffix regular item set forest
that is based on the prefix-tree. In order to get fairly
accurate results of regular patterns over the landmark
window the authors made use of Chernoff Bound [1].
Lattice structure was made use by Zhi-Jun et al this is
also known as the regular enumerate tree that is
separated into many equal classes of the accumulated
patterns with the similar transaction ids in a single class.
estDec was recommended by Change and Lee
reproduced the time decay model where every
transaction has an influence that lessens with age [39].
Algorithm [33] which is alike estDec, it is said that
mining increasing number of regular item set
instantaneously over data stream based on a damped
model.
Observation: The data that enters from a
particular time called landmark till the existing time is
taken into consideration in the landmark model. This
time can be the starting or the restarting time. Earlier
and existing transactions of the input stream are
considered similar. Regular models are further
separated into equivalent classes and only these regular
patterns symbolize the two borders of every class are
preserved, other regular models are trimmed.
Significance is given to data elements based on
their arrival order in the time decal model. This is done in
order to highlight the recently arrived data. A decay rate
is shown to lessen the effect of the previous transaction
in the set of regular pattern.
Numerous sliding window based algorithms
recommended for regular item set mining over data
streams. Raw transactions of sliding windows can be
stored using DS Tree [26] and CPS-Tree [74]. Fixed tree
© 2013 Global Journals Inc. (US)
5
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
and updating historic data for the regular item sets and
their occurrence with time and a period function is made
use of to update the entries so that the newest entries
are prejudiced more.
Observation: estDec and Giannella’s algorithms
are very prejudiced and are basically approximate
mining algorithms. An error level in Giannella’s algorithm
is wee bit different as it gives error levels for information
at different levels.
The numbers of patterns available have been
vast. Showing the pattern in a condensed form is being
worked upon. For example, Pei et al [19] and Zaki et al
[18] show well-organized Closet and Charm. An item set
is known as closed if all its supersets do not get the
support that it gets [30]. Fp-tree was another efficient
information structure that was coined by Han et al [57]
for efficiently saving transactions for a specified low
support threshold. A perfect algorithm known as FPgrowth for mining Fp-tree was suggested.
Observation: Mining data streams does not get
much support from the Fp-growth algorithm. In this case
it goes through each window and is prohibitively
expensive for big windows.
Windows
methodology
has
gained
a
considerable amount of interest from regular item sets in
data streams [43] [15] [7-9] [17] [59]. Therefore false
negative based approach for mining of regular item sets
over data streams has been proposed by Yu et al [1].
Lee at al [73] suggested the generation of k candidate
sets from (k-1) candidate sets without checking their
frequency. Jiang et al [17] recommended an algorithm
for increasing holding the closed regular item sets over
data streams. Extra passes over the data can be
avoided this way and also this will not conclude in too
many added candidate sets. Chi et al [63] advised the
Moment algorithm for having a closed regular item sets
over the sliding windows. When it comes to bigger slide
sizes, Moment does not really help. Increasing mining
on regular item sets is supported by Cats Tree [43] and
Can Tree [15]. A good amount of work has been done
by counting the candidate item sets more capably.
Observation: Park et al [8] suggested the hashbased counting method which was made use by several
before mentioned regular item sets algorithms [7-2, 18,
8]. On the contrary Brin et al [27] suggested an active
algorithm known as DIC for competently counting item
sets frequencies. Fp-tree data structures are made use
of by the fast verifiers in this model. The thought of
putting conditions is to get must faster delta
maintenance and counting patterns.
An
association
rule
generation
and
summarization is another important area of work. Won et
al [66] suggested a system for a hierarchical rule
generation driven by ontology. In addition, Kumar et al
[29] also suggested a single pass algorithm centered on
the hierarchy-aware counting and transaction pruning
for mining association rules, after the structure is given.
Year 2 013
Benchmark Algorithms and Models of Frequent Itemset Mining over Data Streams: Contemporary
Affirmation of State of Art
Year 2 013
Benchmark Algorithms and Models of Frequent Itemset Mining over Data Streams: Contemporary
Affirmation of State of Art
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
26
structure is used by DS Tree in canonical order of the
branches and CPS-Tree is recreated to keep a check on
the memory usage. Both these use the FP-Growth [418]
for mining as suggested for static databases. MFI-Trans
based on priori algorithm was recommended in [54].
Every regular item sets is mined over the newest window
of transactions. Bit string is made use of for all items to
keep its happening data in the window. A sliding based
algorithm has been recommended in [28]. This has a
window content that is kept vigorously using a set of
easy lists.
Lin et al [51] suggested a fresh technique for
mining regular patterns on time receptive sliding
window. Here window is divided into batches and
mining for this item set is done discretely. A limit in
frequent closed item set and other item set is kept; this
helps the Moment algorithm to locate the closed regular
item sets.
The regular item set in one pane of the window
are taken into consideration for further check to look out
for regular item sets in the entire window in SWIM [77] a
pane based algorithm. The union of these regular
patterns of all panes is maintained. It also increasing
updates support and trims irregular ones. Transactions
are stored as a prefix tree for each plane. The authors
devised algorithm for mining [4-10] constantly to
maintain the non-derivable regular item sets of the
sliding windows. Closed regular item sets and nonderivable can be viewed as a synopsis of all regular
item sets.
estWin algorithm was recommended by Chang
and Lee it locates the newest regular patterns adaptively
on transactional data streams making use of the sliding
window model. Monitoring lattice is made use of as a
prefix tree to check the set of regular item sets over a
data stream. Every node of the lattice will symbolize an
item set that can be created by using items kept in the
nodes in the way from the root to the node. All the data
is kept in the final node of the item set. A Less amount
and small support called as minimum important is used
by the algorithm to know the new regular item set to
approximate their support. The existing set of regular
item set is updates going to the related item set in the
monitoring lattice, when a novel transaction come from
the input data stream. With this noteworthy item sets are
known by second traversal of the related ways of the
monitoring lattice making use of a transaction. In order
to remove their effect when the algorithm expires keeps
the set of transaction of the window in the memory.
Connected ways of the monitoring lattice must be gone
through to delete the previous transaction from the
window. The algorithm trims irrelevant item sets from the
tree after reaching a stable number of transactions
making use of a method called as force pruning. When
all the subsets become irrelevant and are kept in the
tree an item set is put to the prefix tree. The support of
the novel important item set in the old transaction of the
© 2013 Global Journals Inc. (US)
window is estimated by the algorithm. The item set of
length k (k-item set) calculated support is alike to the
lowest support in all its subsets with length k-1.
Information like possible count 9pcnt), error (err), actual
count (acnt) and first transaction (mid) are available in
each node of the prefix tree. Pcnt is known as the
calculated frequency of the item set in the older
transaction of the window before the item set was put to
the prefix tree. Acnt is the checked frequency of the item
set post its insertion in the tree. Err is termed as the error
of the calculated support of the item set estimated using
the subsets. Id of transaction that results the item set to
be put in the tree is known as Mtid. The total of Acnt and
Pcnt gives support of an item set in the prefix tree. To
know more about data computing the probable count
and the error in support of important item sets and other
features of the estWin algorithm we can refer to [34]. On
putting the item set, fresh transactions come in,
previous transactions are removed and the error is
minimized. Due to the removal of previous transactions
the error will lessen.
New-Moment was used by H F Li [45] to
preserve a set of regular closed item sets in data
streams with transaction sensitive sliding window. Chi
recommended MOMENT [63] which was a usual
algorithm that can reduce the size of the data structure.
A new way was suggested by N Jiang [70] for mining
regular closed item sets over data streams. Compact
data structure was brought in by Y Chi [16], this is a
closed enumeration tree that keep a set of item sets that
are dynamically selected in a sliding window. There is a
border between the item sets and regularly closed item
sets. FPCFI-DS was an algorithm suggested by F J Ao
[10] for mining closed regular item sets in data streams.
This makes use of single-pass lexicon graphical order
FP-Tree-based algorithm that is merged with the
ordering policy to mine the closed regular item sets in
the primary window and the tree is updated for every
sliding window.
Another mining task for recommended by J Y
wang [T8-9] for mining top-k regular closed item sets
whose length was greater than min_l.
Observation: compared to the sliding window
based regular item set the data streams can be
categorized as two groups. [26] [74] [54] and [28] is the
first group. The window is kept above the newest
transactions and frequent patterns in the current window
are extracted by the mining process. This happens
when the user submits a request. Hence the intention
here is to keep the transactions making use of the low
memory and do the mining process competently. [51]
[9-8] [77] [41] and [34] is the second group where the
mining results are constantly updated by making use of
the fresh transaction to the window and previous
transactions are removed from the window. Hence quick
updation of the mining result and less memory usage
gives the desired result. As you can see the mining
V.
Conclusion
The deliberations above we observe that the
present mining strategies have an increasing and one
pass mining algorithm that are appropriate to mine data
streams. Very few of them tackle the concept drifting
problem. Fairly accurate results are obtained from these
algorithms as a result of large data streams and less
memory. Unlike traditional databases, here we cannot
keep the frequency counts of all item sets in the entire
data streams. Precise mining results are obtained by
some suggested algorithms by keeping up to a minute
subset of regular item sets from data streams, retaining
their precise frequency counts. Sliding window data
processing model helps in preserves only some section
of the regular item sets in the sliding windows. Different
other ways to maintain is maximal frequent item sets,
closed frequent item sets and short frequent item sets.
The existing stream data mining techniques
need users to describe one or more parameters before
its implementation. But many of these streams do not
tell how these parameters can be adjusted while they
run. Waiting for the mining algorithm to cease to reset
the parameters is not a good idea as it will take a long
time for the execution as the data is massive.
Some recommended techniques permit the
user to regulate some parameters online. There is no
assurance that these parameters are of key importance
and they might not be helpful in the mining scenario.
Auto adjustment of mining algorithms and users
adjusting online was also considered. The research in
this field is in gestation stage. To tackle all those
discussed issues in this paper would propel the process
of developing association rule mining applications in
data stream systems. Previous researches focused on
creating competent mining techniques. A new aspect of
data mining streams is necessary to find scalable
regular item set mining models with the following
factors.
1. Utility Mining
2. Weighted Supports
3. Multiple Supports
4. Transitional States
5. Temporal Validity
If many of these issues are yet too solved and a
perfect and efficient system has to be developed, it is
certain that in coming future data stream item set will be
very important in the business world. Focusing on this
we guesstimate more research in this direction.
REFERENCES RÉFÉRENCES REFERENCIAS
1. Information Sciences, 176(14), 2006, pp. 1986–
2015.
2. A General Incremental unquie for Maintaining
Discovered Association Rules. Proceedings of the
Fifth International Conference on Database Systems
for Advanced Ap- plications (DASFAA), Melbourne,
Australia, April 1, 4, 1997.
3. A Generic Local Algorithm for Mining Data Streams
in Large Distributed Systems
4. A regression-based temporal pattern mining
scheme for data streams. In: Proceedings of the
29th VLDB Conference (Berlin, Germany, 2003) 93–
104.
5. A Sketch-Based Architecture for Mining Frequent
Items and Item sets from Distributed Data Streams.
In: Proc. of the 11th IEEE/ACM International
Symposium on Cluster, Cloud and Grid Computing
(2011), pp. 245-253.
© 2013 Global Journals Inc. (US)
7
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
result with immediate effect the algorithms of the second
group are proven to be more useful. To give out quickly
the estimated algorithm that can find set of regular item
sets in the sliding window with elevated quality of result
is adequate. The first group [26-6] fails to adaptively
maintain and update the mining result. The result
becomes invalid when fresh transactions arrive from the
stream. This results in the re-execution of the mining
task. Conversely, relating a mining algorithm to the
entire window will require significant processing and
memory requirements specifically in big window size in
respect to algorithms of second group [28] [51] [9-8]
[77] and [410] where mining results are updated
adaptively. Compared to the high arrival rated of the
input streams the functioning of these algorithms are
poor.
In [69], [75] and [44] sequential pattern-mining
algorithms in data streams are presented. The issues
relating to the mining distributed data stream was dealt
in many areas of data mining such as mining regular
item sets [55], [5], clustering [24],[3] and association
rules [7], [50]. In contrast, sequential patterns of mining
in several data streams are dealt in [72] and [42].
Collective Data Mining (CDM) has been suggested
in [22].
Observation: The methods [69], [75] and [44]
cannot deal with the distributed streams, as
accumulating data streams in one location before giving
them out come with a high-priced cost and can invade
owner’s privacy. Old mining results will not be preserved
in MILE algorithm, which happens to be one-time
fashioned algorithm. Therefore, this might take
excessive amount of time in re-mining. In IAs pam
algorithm this issue has been done away with. It was
shown in [42] that increasingly mine across-streams
sequential type to hold the fresh mining results.
Nevertheless centralized setting strategy is made use
here and it depends on a sliding window at the same
working node to read the sample data streams. CDM
has a purpose to guarantee that incomplete models
made from local data at various sites are right. A global
model can be formed from these models.
Year 2 013
Benchmark Algorithms and Models of Frequent Itemset Mining over Data Streams: Contemporary
Affirmation of State of Art
Year 2 013
Benchmark Algorithms and Models of Frequent Itemset Mining over Data Streams: Contemporary
Affirmation of State of Art
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
28
Measures
for
Mining
6. Alternative
Interest
Associations. In IEEE TKDE, 15:57-69, 2003.
7. An Approach for Distributed Streams Mining Using
Combination of Naïve Bayes and Decision Trees
8. An effective hash-based algorithm for mining
association rules. In SIGMOD, pages 175-186,
1995.
9. An efficient algorithm for frequent itemset mining on
data streams, Proc. ICDM, 2006, pp. 474–491.
10. itemsets in data Streams. Proceedings of the IEEE
8th International Conference on Computer and
Information Technology, pp.37-42, 2008.
11. An efficient algorithm for mining frequent itemsets
over the entire history of data streams. In: 1st
International Workshop on Knowledge Discovery in
Data Streams (Pisa, Italy, 2004) 20–24.
12. An Efficient Algorithm for the Incremental Updation
of Association Rules in Large Databases.
Proceedings of the Third International Conference
on Knowledge Discovery and Data Mining (KDD97), Newport Beach, California, USA, August 14, 17,
1997.
13. Approximate frequency counts over data streams.
In: Proceedings of the 28th VLDB Conference (Hong
Kong, China, 2002).
14. Beyond Market Basket: Generalizing Association
Rules to Correlations. In Proc. of SIGMOD, 1997.
15. Cantree: A tree structure for efficient incremental
mining of frequent patterns. In ICDM, pages 274281, 2005.
16. Catch the Moment: Maintaining Closed Frequent
Itemsets in a Data Stream Sliding Win- dow.
Knowledge and Information Systems, 10(3):
265-294.
17. Cfi-stream: mining closed frequent itemsets in data
streams. In SIGKDD, pages 592-597, 2006.
18. CHARM: An efficient algorithm for closed itemset
mining. In SIAM, 2002.
19. CLOSET: An efficient algorithm for mining frequent
closed itemsets. In SIGMOD, pages 21-30, 2000.
20. Clustering by Pattern Similarity in Large Datasets. In
Proc. of SIGMOD, 2002.
21. CMAR: Accurate and efficient classification based
on multiple class-association rules. In ICDM, pages
369-376, 2001.
22. Collective Data Mining: A New Perspective towards
Distributed Data Mining. In: Advances in Distributed
and Parallel Knowledge Discovery, editied by H.
Kargupta and P. Chan, chapter, 5, AAAI Press / The
MIT Press (2000).
23. Data Streams and Histograms; ACM Symposium on
Theory of Computing; 2001.
24. Discovering Trend-Based Clusters in Spatially
Distributed Data Streams. International Workshop of
Mining Ubiquitous and Social Environments (2010),
pp 107-122.
© 2013 Global Journals Inc. (US)
25. Discovery of Frequent Episodes in Event
Sequences. In DMKD, 1:259-289, 1997.
26. DSTree: a tree structure for the mining of frequent
sets from data streams”, in Proceedings of IEEE
International Conference on Data Mining, 2006, pp.
928–932.
27. Dynamic itemset counting and implication rules for
market basket data. In SIGMOD, pages 255-264,
1997.
28. EclatDS: An Efficient Sliding Window Based
Frequent Pattern Mining Method for Data Streams”,
Intelligent Data Analysis, Vol. 15(4), 2011, pp.
571-587.
29. Efficient algorithm for hierarchical online mining of
association rules
30. Efficient mining of association rules using closed
itemset lattices. Information Systems, pages 25-46,
1999.
31. Efficiently Mining Maximal Frequent Itemsets.
Proceedings of the 2001 IEEE International
Conference on Data Mining, San Jose, California,
USA, November 29 , December 2, 2001.
32. Encoding probability propagation in belief networks
IEEE Transactions on Systems, Man and
Cybernetics (Part A) 32(4) (2002) 526–31.
33. est Max: Tracing Maximal Frequent Itemsets
Instantly over Online Transactional Data Streams”,
IEEE Transactions on Knowledge and Data
Engineering, 21 (10), 2009, pp. 1418-1431.
34. estWin: Online data stream mining of recent
frequent itemsets by sliding window method”,
Journal of Information Science, Vol. 31, 2005, pp.
76–90.
35. Fast Algorithms for Mining Association Rules.
36. Fast algorithms for mining association rules” in
Proceedings of International Conference on Very
Large Databases, 1994, pp. 487–499.
37. Finding Frequent Items in Data Streams.
Proceedings of the 2002 International Collo- quium
on Automata, Languages and Programming,
Malaga, Spain, July 8, 13, 2002.
38. Finding recent frequent itemsets adaptively over
online data streams. In: Proceedings of the 9th ACM
SIGKDD International Conference on Knowledge
Discovery
and
Data
Mining
(KDD2003)
(Washington, DC, 2003) 226–35.
39. Finding recently frequent itemsets adaptively over
online transactional data streams Frequent Itemset
Mining Implementations. In Proc. of the ICDM 2003
Workshop
on
Frequent
Itemset
Mining
Implementations, 2003.
40. Frequent pattern mining: current status and future
directions”, Data Mining and Knowledge Discovery,
Vol. 15, 2007, pp. 55–86.
41. Incremental Mining of Across-streams Sequential
Patterns in Multiple Data Streams. Journal of
Computers Vol. 6 (2011), pp.449-457.
59. Mining Sequential Patterns on Data Streams: a
Near-Optimal Statistical Approach.
60. Mining Sequential Patterns. In Proc. of IDCE, 1995.
61. Moment: maintaining closed frequent itemsets over
a stream sliding window. In: Rajeev Rastogi et al.
(eds) Proceedings of 4th IEEE International
Conference on Data Mining (Brighton, UK, 2004)
59–66.
62. Multi-level organization and summarization of the
discovered rules. In Knowledge Discovery and Data
Mining, pages 208-217, 2000.
63. Online algorithms for mining semi structured data
streams.
64. Ontology-driven
rule
generalization
and
categorization for market data. Data Engineering
Workshop, 2007 IEEE 23rd International Conference
on, pages 917-923, 2007.
65. Parallel and Distributed Methods for Incremental
Frequent Itemset Mining. IEEE Transactions on
Systems, Man, and Cybernetics, Part B, 34(6):
2439, 2450.
66. Querying and Mining Data Streams: You Only Get
One Look. In Tutorial of SIGMOD, 2002.
67. Random Sampling Over Data Streams for
Sequential Pattern Mining. In: Proc. of the 1st
European Workshop on Data Streams (2007), pp.
61-66.
68. Research issues in data stream association rule
mining. SIGMOD Record 35 (1), pp.14-19, 2006.
69. Selecting the right interestingness measure for
association patterns. In Knowledge Discovery and
Data Mining, pages 32-41, 2002.
70. Sequential Pattern Mining in Multiple Streams. In:
Proc. of the 5th IEEE International Conference on
Data Mining (2005).
71. Sliding window filtering: an efficient method for
incremental mining on a time-variant database.
Information Systems, pages 227-244, 2005.
72. Sliding window-based frequent pattern mining over
data streams”, Information Sciences, Vol. 179, 2009,
pp. 3843-3865.
73. Stream Sequential Pattern Mining with Precise Error
Bounds. In: Proc. of the IEEE International
Conference on Data Mining (2008), pp. 941-946.
74. The space complexity of approximating the
frequency moments. Compute. Syst. Sci.,
58(1):137–147, 1999.
75. Verifying and mining frequent patterns from large
windows over data streams”, in Proceedings of
International Conference on Data Engineering,
2008, pp. 179–188.
© 2013 Global Journals Inc. (US)
9
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
42. Incremental mining of frequent patterns without
candidate generation or support. In 7th Database
Engineering
and
Applications
Symposium
International Proceedings, pages 111-116, 2003.
43. Incremental Mining of Sequential Patterns over a
Stream Sliding Window. In: Proc. of the 6th IEEE
International Conference on Data Mining (2006),
pp.677-681.
44. Incremental updates of closed frequent itemsets
over continuous data streams. Expert Systems with
Applications, Vol.36, pp.2451-2458, 2009.
45. Integrating Classification and Association Rule
Mining. In Proc. of KDD, 1998.
46. Maintenance of Discovered Association Rules in
Large Databases: An Incremental Up - dating
Technique. Proceedings of the Twelfth International
Conference on Data Engineering, New Orleans,
Louisiana, USA, February 26 , March 1, 1996.
47. Mining Association Rules between Sets of Items in
Large Databases. In Proc. of SIGMOD, 1993.
48. Mining Data Streams: A Review, Mining Distributed
Evolving Data Streams using Fractal GP Ensembles.
In: Proc. of the 10th European Conference on
Genetic Programming (2007), pp. 160-169.
49. Mining frequent itemsets from data streams with a
time-sensitive sliding window”, in Proceedings of
SDM International Conference on Data Mining,
2005.
50. Mining Frequent Itemsets in Evolving Databases.
Proceedings of the Second SIAM International
Conference on Data Mining, Arlington, VA, USA,
April 11, 13, 2002.
51. Mining frequent itemsets over arbitrary time intervals
in data streams. In: Technical Report TR587
(Indiana University, IN 2003).
52. Mining frequent itemsets over data streams using
efficient window sliding techniques”, Expert
Systems with Applications, Vol. 36, 2009, pp.
1466–1477.
53. Mining Frequent Itemsets over Distributed Data
Streams by Continuously Maintaining a Global
Synopsis. Data Mining and Knowledge Discovery
Vol. 23 (2011), pp. 252-299.
54. Mining Frequent Patterns in Data Streams at
Multiple Time Granularities,
55. Mining frequent patterns without candidate
generation. In: SIGMOD’ 00 (ACM Press, Dallas, TX,
2000) 1–12.
56. Mining Frequent Patterns without Candidate
Generation: A Frequent-Pattern Tree Approach”,
Data Mining and Knowledge Discovery, Vol. 8,
2004, pp. 53-87.
57. Mining maximal frequent itemsets from data
streams. Information Science, pages 251-262, 2007
58. Mining non-derivable frequent itemsets over data
stream”, Data & Knowledge Engineering, Vol. 68,
2009, pp. 481-498.
Year 2 013
Benchmark Algorithms and Models of Frequent Itemset Mining over Data Streams: Contemporary
Affirmation of State of Art
Year 2 013
Benchmark Algorithms and Models of Frequent Itemset Mining over Data Streams: Contemporary
Affirmation of State of Art
Global Journal of Computer Science and Technology ( C
D ) Volume XIII Issue V Version I
10
2
This page is intentionally left blank
© 2013 Global Journals Inc. (US)