Download 1 CHAPTER -1 INTRODUCTION 1.1 DATA MINING Data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
CHAPTER -1
INTRODUCTION
1.1 DATA MINING
Data mining refers loosely to finding relevant information, or discovering
knowledge, from a large volume of data. Data mining attempts to discover
statistical rules and patterns automatically from stored data.
However, it
differs from machine learning and deal with very large volumes of data, stored
on disk [1]. Discovering knowledge, from a large volume of data is not a
simple process, but it is an iterative and interactive process.
Data mining should be “the nontrivial process of identifying valid,
novel, potentially useful, and ultimately comprehensible knowledge” from
databases such knowledge can useful in making crucial decisions [41].
Nontrivial - Means that rather than simple computations, complex processing
is required to uncover the patterns that are buried in the data.
Valid - The discovered patterns should be hold by proper for all data including
new data.
Novel - The discovered patterns should be innovative.
Useful - The organization should be able to act upon these patterns to become
more profitable / efficient.
Comprehensible - The new patterns should be understandable to the users and
add to the knowledge.
1
Steps in knowledge discovery in database process
Data Cleaning: It is the process of removing noise and inconsistent data.
Data Integrating: It is the process of combining data from multiple sources.
Data Selection: It is the process of retrieving relevant data from the database.
Data Transformation: In this process, data are transformed into forms for
mining by performing summary or aggregation operations.
Data mining: It is an important process where intelligent methods are useful
to extract data patterns.
Pattern Evaluation: The pattern evaluation obtained in the data mining stage
is converted into knowledge based on some interestingness measures.
Knowledge Presentation: Visualization and knowledge representation
methods are used to provide knowledge to the user.
2
1.2
ARCHITECTURE OF DATA MINING
Data mining is the process of discovering knowledge from large amount
of data stored either in very large database or other information repositories.
The major components of data mining’s are
Database, Data Warehouse or other information Repository: This means a
single or a collection of multiple databases, data warehouses, Flat files,
spreadsheets or other kinds of information repositories. Data cleaning and
integration methods may be performed on the data.
Database or Data Warehouse Server: This server fetches the relevant data,
based on the data mining request.
Knowledge Base: It is a domain knowledge which guides the search, or
assesses the interestingness of resultant patterns.
Data Mining Engine: This is necessary to the data mining system and
perfectly consists of a set of functional methods for task such as
characterization, association, classification, cluster analysis, evolution and
outlier analysis.
3
GRAPHICAL USER INTERFACE
PATTERN EVALUATION
KNOWLED
GE BASE
DATA MINING ENGINE
DATABASE OR DATA WAREHOUSE SERVER
WORLD
WIDE
DATA
BASE
DATA
WAREHO
OTHER
DATA
Figure 1.1 Architecture of a typical data mining system
4
Pattern Evaluation Module: It actions and interacts with the data mining
modules to focus the search towards increasing patterns.
It may use
thresholds to filter out discovered patterns. Alternately, the pattern evaluation
module may also be integrated with mining module.
Graphical User Interface: This module communicate between users and the
data mining system, and allow the user to communicate with the system by
specifying a task or data mining query for performing exploratory data mining
based on intermediate data mining results. It also allows the user to browse
data warehouse schemes or data structures, evaluate mined patterns and
visualize the pattern in different forms such as maps, charts etc.
1.3
DATA MINING TECHNIQUES
Several techniques are used in data mining to describe the type of
mining and data recovery operations. These techniques are analyzes the data
in different ways. The most commonly used techniques are
1. Neural Networks
It has the ability to derive meaning from complicated data and it can be
used to extract patterns and discover methods that are very complex. A well
trained neural network can be considered as an “expert” of information and it
has been to examine then this expert can be used to supply projections for new
situations of interest and answer “what if” questions. Neural networks use a
set of processing elements analogous to neurons in the brain.
5
These
processing elements are interconnected in a network that can then identify
patterns in data once it is exposed to the data, i.e., the network learns from
experience just as people do.
2. Decision Tree
Decision Trees are simple knowledge representation and they classify
examples to a finite number of classes, the nodes are labeled with attribute
names, the edges are labeled with possible values for this attribute and the
leaves labeled with different classes. Tree shaped structures represents set of
decisions which generate rules for the categorization of dataset. Decision trees
produce rules that are mutually exclusive and collectively exhaustive with
respect to the training database. Particular decision tree methods consist of
Classification and Regression Tree (CART) and Chi Square Automatic
Interaction Detection (CHAID).
3. Nearest Neighbor Method
A method that classifies each record in a dataset based on a mixture of the
classes of the k record(s) is called the k-nearest neighbor techniques. Nearest
neighbor is a prediction techniques that is quite similar to clustering- its
essence is that in order to predict what a prediction value is in one record look
for records with similar predictor values in the historical database and use the
prediction value from the record that it nearest to the unclassified record.
6
4. Cluster Analysis
Cluster analysis is a kind of mathematical based tool for exploring the
clear structure of data. Clustering a method of grouping the objects into
clusters such that the objects from the same cluster are similar and objects
from different clusters are dissimilar. Objects can be described in terms of
dimensions or by relationships with other objects. Clustering is sometimes
used to mean segmentation. Clustering and segmentation basically divide the
database so that each group is similar according to some rules or metric. Many
data mining application make use of clustering according to similarity for
example to segment to client/customer base.
5. Association
Association is the most popular data mining techniques. It makes a simple
correlation between two or more items, often of the same type to identify
patterns. For example, when tracking people’s buying habits, it might identify
that always a customer buys butter when they buy bread.
6. Rule Induction
Rule induction is the process of extracting valuable if-then condition rules
from data based on statistical and mathematical consequence. Rule induction
on a database can be massive undertaking in which all possible patterns are
systematically pulled out of the data and then accuracy and significance
calculated, telling users how strong the pattern is and how likely it is to occur
again.
7
7. Generic Algorithms
Genetic Algorithm refers to the algorithms that dictate how populations of
organisms should be formed, evaluated and modified. Genetic algorithm is an
optimization technique that use methods such as genetic combination,
mutation and natural selection in a devise based on the concepts of evaluation.
The use of genetic algorithm is made on top of an existing data mining
techniques such as neural networks or decision tree.
8. Data Visualization
Data visualization makes it possible for the analyst to gain a deeper, more
intuitive understanding of the data and such can work well along side data
mining. Data mining allows the analyst to focus on certain patterns and trends
explore in-depth using visualization. On its own data visualization can be
overwhelmed by the volume of data in a database but in conjunction with data
mining can help with exploration.
1.4 APPLICATIONS OF DATA MINING
Nowadays, many industries are being used the electronic data
repositories for storing the huge size of their data. Extract the knowledge from
the huge size of these data sources is non-viable to the analyst for better
decision making process. The traditional techniques are insufficient to analyze
these kinds of data. Today’s world data are collected and stored at enormous
speeds. So it is essential to the industries to find a special tool for storing and
accessing these databases. The data mining tools are such type of tools. These
8
tools are applied to both commercial and scientific data. The commercial data
are mined to provide better service to customers, customize and pro-activate
their services. The tools help to extracted, understand complex relationships,
predict future status and study interesting similarities in data.
Data mining is implemented in the following areas i.e. The Retail
Market, Banking [127], Fraud Detection [169], Insurance, Transport
Engineering [154], Telecommunication, Stock Market [133] [161], Crime and
Terrorism [37] [68], Aircraft Maintenance, Geographical and Spatial Data,
Software Engineering [87], Healthcare Industry [89], Education [132], Social
Network Analysis [66]
and Sport Databases. Amusingly, Data mining
techniques are extended into medical industries [89]. Medical databases store
large amounts of information about patients, results of laboratory tests,
diagnosis status and their treatments. Data mining techniques applied on these
databases discover relationships and patterns which are helpful in studying the
progression and management of diseases.
1.5 ASSOCIATION RULES
Association analysis is used to extract interesting correlations, frequent
patterns, associations or casual structures among set of items or objects in
transaction databases, relational databases or other data repositories In 1993,
Agrawal, Imielinski and Swami (AIS) developed an algorithm to discover the
relationships or correlation between items based on measures [1]. Since then,
Association analysis has become one of the most highly used and studied
techniques in data mining. The main aim of this technique is the discovery of
9
relationships and co-occurrence between items in the dataset. The results of
association analysis represented in the form of rules. Association rules are
IF/THEN statements that are used to find the relationships between unrelated
data from voluminous databases. An association rule has two parts, an
antecedent (if) and a consequent (then). An antecedent is an item found in the
data. A consequent is an item that is found in combination with the antecedent
[2].
A common example of this method is market basket analysis [1].
For instance: “customers who buy product A often also buy product B”. A
decision maker such as a shopper or a marketer can access a large volume of
historical data from which such rules have been extracted, to be more
confidently draw conclusions and make decisions that are well-supported by
data.
In general, association rule mining can be viewed as a two-step process:
1. Find all frequent itemsets: By definition, each of these, itemsets will
occur at least as frequently as a predetermined minimum support
count, min_sup. Support: Percentage of transaction in D that contain
AUB.
2. Generate strong association rules from the frequent itemsets:
By
definition, these rules must satisfy minimum support and minimum
10
confidence. Confidence: Percentage of transaction in D containing A
that also contain B.
Support (A=>B)
=
P(AUB)
Confidence (A=>B) =
P(B/A).
Rules that satisfy both a minimum support threshold (min_sup)
and a minimum confidence threshold (min_conf) are called strong.
1.6 OBJECTIVE OF THE RESEARCH
Searching is a process of finding a particular item in a collection of
items, or in a Very Large Databases. Searches use indexes to reduce the
timing in scanning the entire data and use the key for storing the values.
During retrieval time the key is referred and decoded for rapid scanning. Due
to some special reasons the algorithms are seldom applied on relational
databases which store and manage very large volumes of data, though
relational databases have the power and speed to traverse large volumes of
data within a short span of time. And a divide and conquer algorithm also
used to recursively breaking down a problem into two or more sub problems,
until these become simply enough to be searched/solved directly.
11
The attempt of the Refined Search Divide and Conquer (Hybrid)
Algorithm (RSDCA) is to apply frequent item set mining on Very Large
Databases. The thesis proposes RSDCA exhibit’s reduction on the time factor
by using the indexes of the database with the help of divide and conquers
methods. The RSDCA also explores to reduce the memory spaces in a generic
way during the frequent itemsets searching in databases.
1.7 SCOPE OF THE RESEARCH
Though earlier researchers have developed a lot of algorithms and
methodologies, it is imperative to find new algorithms. It is possibly
unsuitable to apply the difficulty of mining in all frequent item sets on very
large datasets. All algorithms use their own memory data structures for
retrieving datasets and executing the result. These structures have a limitation
on the size of data that can be processed. But RDBMS’s provides the benefits
of using their buffer management systems. So that the user/applications can
be free from size considerations of the data. RDBMS also gives the
advantages of mining over very large datasets.
The proposed algorithm can be applied to any database; it further
explores new possibilities on the generic searches executed on Very Large
Data Bases (VLDB). The algorithm shows the need for proper indexing and
more particularly generic indexing to improve the speed of queries. The
approach of this thesis enables construction of new refined divide and conquer
algorithm with completely different perception. In this approach Divide and
12
Conquer method is applied to extract the desired frequent dataset from the
Very Large Data Base to improve the speed of the queries.
Also the
user/applications own memory data structure is enough for retrieving the
dataset and executing the queries.
1.8 CONTRIBUTION OF THE RESEARCH
The proposed Refined Search Divide and Conquer (Hybrid) Algorithm
(RSDCA) is used to access the itemsets very quickly in the large databases.
The search algorithm takes problem as input and process the problem within
its search space and returns a solution to the problem. The proposed RSDCA
algorithm is used a pattern growth method for mining frequent pattern. The
RSDCA algorithm rapidly reads transactions and updates support counts at the
same time. The proposed RSDCA based on closed frequent itemsets mining
algorithm specifies the order of set enumeration by using constraints. The
algorithm generates the bases from frequent closed itemsets and discovers
frequent itemsets and defines the exact association rules with maximum
confidence. RSDCA stresses on the importance of the data dictionary in
relational databases and exhibits the power of the RDBMS in identifying
frequent itemsets. The RSDCA minimizes on search time since it refers and
compiles the items or their occurrences at a top level.
13
1.9 ORGANIZATION OF THE THESIS
The first chapter overviews data mining techniques and the role of
association rules are discussed.
The second chapter discusses the literature survey with existing
methods of association rules in detail.
The third chapter presents closed frequent itemsets mining algorithm.
This algorithm uses rapid search technique to mine closed frequent itemsets.
The fourth chapter deals with the rapid search itemset mining
algorithm. This algorithm is based on pattern growth method for frequent
pattern mining.
The proposed Refined Search Divide and Conquer (Hybrid) Algorithm
(RSDCA) for mining association rules are described in chapter five. The
application related to rapid search technique is discussed elaborately
The chapter six summarizes the results of the above proposed
algorithms with synthetic datasets.
The chapter seven concludes the findings and discusses about
enhancements.
14