Download Performance Analysis of Data Mining Algorithms to Generate

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Inverse problem wikipedia , lookup

Geographic information system wikipedia , lookup

Neuroinformatics wikipedia , lookup

Theoretical computer science wikipedia , lookup

Data analysis wikipedia , lookup

Corecursion wikipedia , lookup

Pattern recognition wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
ORIENTAL JOURNAL OF
COMPUTER SCIENCE & TECHNOLOGY
An International Open Free Access, Peer Reviewed Research Journal
Published By: Oriental Scientific Publishing Co., India.
ISSN: 0974-6471
December 2012,
Vol. 5, No. (2):
Pgs. 277-281
www.computerscijournal.org
Performance Analysis of Data Mining Algorithms to Generate
Frequent itemset at Single and Multiple Levels
Md. IQBAL
Department of Computer Science & Engineering, Institute of Technology, Meerut (U.P.)
(Received: September 10, 2012; Accepted: September 18, 2012)
ABSTRACT
Knowledge Discovery and Data Mining are rapidly evolving areas of research that are at
the intersection of several disciplines including statistics, databases, artificial intelligence,
visualization and high performance and parallel computing . Data Mining is core part of Knowledge
Discovery process (KDD). The KDD process consist of data selection, data cleaning, data
transformation, pattern searching ( data mining ) and finding pattern evaluation. Focusing specially,
on the definition of data mining, it has been described as “ the task of discovering interesting
patterns from large amount of data where the data can be stored in databases, data warehouses
or other information repositories”. Thus data mining is extraction of implicit, previously unknown;
potentially use for information from the vast amount of data available in the data sets (databases,
data warehouses or other information repositories). People in various organizations such as
business, science, medicine, academia and government collect such data. The problem is that not
enough human analysts are available who are skilled at translating all of the data into knowledge.
The development of next generation databases and Management Information System (MIS) has
been empowered by data mining, which helps in extraction of hidden useful information and aimed
at formulation of knowledge for taking decision by the organization. Thus goal of data mining and
knowledge discovery is to turn “data into knowledge”. Data Mining is becoming more widespread
every day, because it empowers organizations to uncover profitable patterns and trends from
their existing databases. Most of organizations spent millions of dollars to collect megabytes and
terabytes of data but are not taking advantage of valuable information stored in it. The tools use
different data mining technique and algorithm. The tasks of data mining are distinct because many
patterns exist in the large database. All the techniques can be integrated or combined to deal with
a complicated problem resides in these large databases. Most of data mining tools employ multiple
methods to deal with different kind of data in different application areas. Based on the pattern one
is looking for the data-mining task, which can be classified into summarization, classification,
clustering, association.
Key words: Data Mining, Management Information System, Artificial Intelligence, Performance
Analysis of Data Mining Algorithms to generate frequent itemset at Single and Multiple Levels.
INTRODUCTION
Data mining, also known as Knowledge
Discovery in Database (KDD), has been well studied
for several decades. It has been described as “the
nontrivial extraction of implicit, previously unknown,
and potentially useful information from data and the
science of extracting useful information from large
data sets or databases.” In general, data mining is
the process of analyzing data from different
perspectives and summarizing it into useful
information. And technically, it is the process of
278
IQBAL, Orient. J. Comp. Sci. & Technol., Vol. 5(2), 277-281 (2012)
finding correlations or patterns among dozens of
fields in large relational databases.
Scopes And Relevance Of Study
Data mining refers to extracting or mining
knowledge from large amounts of data. The term is
actually a misnomer. Remember that the mining of
gold from rocks or sand is referred to as gold mining
rather than rock or sand mining. Thus, data mining
should have been more appropriately named
“knowledge mining from data,” which is sometime
known as knowledge mining. Many other term carry
a similar or slightly different meaning to data mining,
such as knowledge mining from data, knowledge
extraction, data/pattern analysis, data archaeology
and knowledge discovery from data or KDD.
Knowledge Discovery in Databases (KDD)
is defined as the nontrivial process of identifying
valid, novel, potentially useful, and ultimately
understandable patterns in data. But data mining is
Fig. 1: Knowledge Discovery in Database
(KDD) Process (Fayyad et al. 1996)
the central part of the KDD process. Knowledge
discovery as a process consists of sequence of
following steps as shown in fig 2.1:
´
Data integration (where multiple data sources
may be combined)
´
Data selection (where data relevant to the
analysis task are retrieved from the
Database)
´
Data Preprocessing (to remove noise and
inconsistent data)
´
Data transformation (where data are
transformed into forms appropriate for mining
by performing summary or aggregation
operation)
´
Data mining (an essential process where
intelligent methods are applied in order to
extract data pattern)
´
Pattern evaluation (to identify the truly
interesting pattern representing knowledge
based on some measure)
´
Knowledge presentation (where visualization
and knowledge representation techniques
are used to present the mined knowledge to
the user)
Data mining is the process of discovering
interesting knowledge from large amounts of data
stored in the database, data warehouse, or other
information repositories. Data mining is only one step
of the process, involving the application of discovery
tools to find interesting patterns from targeted data.
Flow of the data mining process can be shown by
Fig. 1.2
A data mining session is usually an
interactive process of data mining query submission,
task analysis, and data collection from the database,
interesting pattern search, and findings presentation.
Fig. 2: Flow of the data mining process
Process For Mining The Data
An important concept is that building a
mining model is part of a larger process that includes
everything from defining the basic problem that the
model will solve, to deploying the model into a
working environment. This process can be defined
by using the following six basic steps.
Defining the problem
´
Preparing data
´
Exploring data
´
Building models
IQBAL, Orient. J. Comp. Sci. & Technol., Vol. 5(2), 277-281 (2012)
´
´
Validating models
Updating models
Objectives
Even though the data mining has made a
significant progress during the past decade but most
of the research is devoted to developing effective
and efficient algorithm. These algorithms are used
to extract knowledge from data. It is difficult for
students, researchers and business users to get a
holistic view of this field. They are perceived by the
collection of algorithm and tools available. The
objective of this endeavor is to study the efficiency
and effectiveness of the multiple-level algorithms,
which helps in specific information extraction. In this
work a new method for multiple-level association
rules is introduced.
1.
Minimum support
2.
Delta (factor for reducing support at lower
levels)
3.
Concept hierarchy
Process For Mining The Data
An important concept is that building a
mining model is part of a larger process that includes
everything from defining the basic problem that the
model will solve, to deploying the model into a
working environment. This process can be defined
by using the following six basic steps.
Defining the problem
Preparing data
Exploring data
Building models
Validating models
Updating models
Need Of Data Mining
The way in which companies interact with
their customers has changed dramatically over the
past few years. A customer’s continuing business is no
longer guaranteed. As a result, companies have found
that they need to understand their customers better, and
to quickly respond to their wants and needs. In addition,
the time frame in which these responses need to be made
has been shrinking. It is no longer possible to wait until
the sings of customer dissatisfaction are obvious before
action must be taken. To succeed, companies must
be proactive and anticipate what a customer desires.
This is possible just knowing what data mining is?
´
The right offer
´
´
´
279
To the right person
At the right time
Through the right channel
The right offer means managing multiple
interactions with your customers, prioritizing what
the offer will be while making sure that irrelevant
offers are minimized. The right person means that
not all customers are cut from same cloth. Your
interactions with them need to move towards highly
segmented marketing campaigns that target
individual wants and needs. The right time is result
of the fact that interactions with customers now
happen on continues basis. This is significantly
different from the past, when quarterly mailings were
cutting edge marketing. Finally, the right channel
means that you can interact with your customers in
variety of ways (direct mail, email, telemarketing,
etc). You need to make sure that you’re choosing
the most effective medium for a particular interaction.
But for this there are some problems with the
database repositories like that data volume are too
large for classical analysis approaches:
Large number of records (108-1012 bytes).
High dimensional data (102-104 attributes).
How do you explore millions of records, tens or
hundreds of fields, and patterns?
Only a small portion(typically 5%-10% of collected
data is every analyzed
Data that may never be analyzed continues
to be collected, at a great expense, out of year that
something which may prove important in the future
is missing.
Fig. 3: Example of concept Hierarchy
A growth rate of data precludes traditional “manually
intensive” approach.
What can data mining do for us?
Identify our best prospects and then retain
280
IQBAL, Orient. J. Comp. Sci. & Technol., Vol. 5(2), 277-281 (2012)
them as customers- by concentrating marketing
efforts only on the best prospects we will save time
and money , thus increasing effectiveness of the
marketing operation.
Predict cross sell opportunities and make
recommendations- whether we have a traditional or
web- based operation, we can help the customers
quickly locate products of interest to them and
simultaneously increase the value of each
communication with the customer.
Learn parameters influencing trends in
sales and margins- one may think this can be done
with OLAP (Online Analytical Processing) tools. True,
OLAP can help prove a hypothesis- but only if we
know what questions to ask in the first place. In the
majority of cases we may have no clue on what
combination of parameters influences our operation.
In these situations data mining is only real option
Segment markets and personalize
communications- there might be distinct groups of
customers, patients, or natural phenomena that
require different approaches in their handling. If we
have a broad customer range, we would need to
address teenagers in California and married
homeowners in Minnesota with different products
and messages in order to optimize a marketing
campaign.
Multiple-level association rules
Mining association rules at single level, in
many cases, loose detailed information. Besides it
can show only general rules without ability of getting
inside the rule. Data mining should also be available
for mining association rules at the multiple levels of
abstraction. In association rules every transaction
can be encoded based on dimension and levels.
In multiple-level association rule mining,
the items in an item set are characterized by using
a concept hierarchy. Mining occurs at multiple levels
in the hierarchy. At lowest levels, it might be that no
rules may match the constraints. At highest levels,
rules can be extremely general. Generally, a topdown approach is used where the support threshold
may be same or varies from level to level (support
is reduced going from higher to lower levels).
Conclusion & future work
Summarization is the abstraction or
generalization of data. This results in a smaller set,
which gives a general overview of data, usually with
aggregated information. The summarization can go
to different abstraction levels and can be viewed from
different angles. Classification derives a function or
model, which determines the class or model which
determines the class of an object based on its
attributes. A classification function or model is
constructed by analyzing the relationship between
the attributes and the classes of the objects in the training
set. This f-ies the classes also called clusters or groups
for the set of objects whose classes are unknown. The
objects are so clustered that the interclass similarities
are maximized and the interclass similarities are
minimized. This is done based on the criteria defined on
the attributes of the objects. Association is the degree of
relationship or involvement or the connection of objects.
Such connection is termed as association rule. An
association rules revels the associative relationship
among objects at multiple levels. In this dissertation
iterative or noniterative database scanning is used for
finding frequent itemsets. The association rules are
derived from these frequent itemsets. There are
different algorithms, which are used for finding the
frequent itemsets. In this dissertation the emphasis
is given on generation of multiple level association
rules.
REFERENCES
1.
2.
Frawley, W., Piatetsky-Shapiro and Matheus,
C., ‘Knowledge Discovery in Databases: An
Overview’, AI Magazine, pp. 213-228 (1992).
R. Agrawal, T. Imielinski, and A. Swami,
“Mining association rules between sets of
items in large databases”. In Proceedings of
3.
the 1993 ACM SIGMOD International
Conference on Management of Data, pages
207-216, Washington, DC, 26-28 (1993).
Kantardzic, M., Data Mining: Concepts,
Models, Methods, and Algorithms, WileyInterscience, Hoboken, NJ (2003).
IQBAL, Orient. J. Comp. Sci. & Technol., Vol. 5(2), 277-281 (2012)
4.
5.
6.
7.
8.
M.H.Margahny and A.A.Mitwaly, “ Fast
Algorithm for Mining Association Rules”.
Proceedings of AIML 05 Conference, CICC,
Cairo, Egypt, 19-21 (2005).
Kishore B. Kumar and Naresh Jotwani,
“Efficient Algorithm for Hierarchical Online
Mining of Association Rules,” in Proc. 13th
International Conference on Management of
Data COMAD, (2006).
R. S. Thakur, R. C. Jain and K. R. Pardasani,
“ Fast Algorithm for mining multi-level
association rules in large databases”. Asian
Journal of International Management 1(1):1926 (2007).
Qi Luo “Knowledge Discovery and Data
Mining”, Work shop on Knowledge Discovery
9.
10.
11.
12.
281
and Data Mining, Adelaide, SA,3-5 (2008).
Hahsler, M., Buchta, C., and Hornik, K.
Selective Association Rule Generation ,
Comutational Statistics, 23(2) (2008).
Zheng, Z., Kohavi, R., and Mason, L. “Real
world performance of association rule
algorithms”, In Proceedings of the 7th KDD
Conference, ACM Press, 401-406 (2001).
R. Srikant, R. Agrawal, “Mining generalized
association rules”, Future Generation
Computer Systems 13(2–3): 161-180 (1997).
Jiawei Han and Yongjian Fu., “Discovery of
Multiple-Level Association Rules from Large
Databases”. Proceeding in IEEE Trans. on
Knowledge and Data Eng. 11(5): 798-804
(1999).