Download data mining concepts and methods implemented for knowledge

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
INTERNAIONAL PAPER CONFERENCE
DATA MINING CONCEPTS AND METHODS
IMPLEMENTED FOR KNOWLEDGE DISCOVERY IN
DATABASES
BY
NAGARATNA P. HEGDE, PROFESSOR
VASAVI ENGINEERING COLLEGE,
HYDERABAD
&
B.VARIJA, ASSOCIATE PROFESSOR
RESEARCH SCHOLAR
NISHITHA DEGREE AND P.G COLLEGE
NIZAMABAB
DATA MINING CONCEPTS AND METHODS
IMPLEMENTED FOR KNOWLEDGE DISCOVERY IN
DATABASES
NAGARATNA P. HEGDE
B.VARIJA (Resea rch Scholar)
Professor
Vasavi Engineering College, hyd
Associate Professor
Nishitha degree & P.G College, nzb
ABSTRACT
Data mining turns a large
collection of data into knowledge. The Automatic
Discovery of useful information from large data
repositories. Mining can be applied to any kind of
data as long as the data are meaningful for a target
application. Many methods are used for mining
like mining frequent patterns, associations,
classification, cluster analysis, outlier detection. In
particular, data mining draws upon ideas such as
sampling, estimation and hypothesis. In this paper
basic concepts and methods are discussed which
are used for data mining the data in systematic
manner.
I. INTROCUTION
Data mining has made significant
progress and covered a broad spectrum of
application since 1980. Data mining can be
categorized as predictive task and descriptive
task. In data mining association analysis is used
to discover pattern that is used to describe
strongly associated features in the data. Where
cluster analysis is to find groups of closely
related observations clustering. The outlier
analysis is a data object that deviates
significantly from the rest of the objects.
Analyzing an essential process where intelligent
methods are applied to extract data patterns.
Many people treat data mining as a synonym for
another popularly used term knowledge
discovery from data or KDD
II.KNOWLEGE DISCOVERY DATA
The Knowledge discovery from data as
merely essential steps in the process of
knowledge discovery. The knowledge discovery
process is an iterative sequence of the following
steps:
1. Data Cleaning: In order to remove noise and
inconsistent data
2. Data Integration: the multiple data sources
can be obtained
3. Data Selection: the relevant data is used for
analyzing the task
4. Data Transformation: the data is
transformed into appropriate for mining by
performing summary
5. Data Mining is an essential process where
intelligent methods are applied to extract data
patterns
6. Pattern evaluation: to identify the truly
interesting patterns representing knowledge
7. Knowledge Presentation: the visualization
and knowledge representation techniques are
used to present mined knowledge to users.
II. CONCEPTS OF DATA MINING
Frequent Mining:
Frequent pattern mining for the
discovery of interesting association and corelation between items sets in transaction and
relational databases. The frequent item set
mining is market basket analysis. The process
analyzes customer buying habits by finding
association between the different items that
customers place in their shopping baskets
Association Rules:
Let I = {I1, I2, I3 …IN} item set
Where T € I (T transaction)
Association with an identifier called TID
An Association rules is an implication of the
A => B where A€ I, B € I
A ≠Ǿ, B ≠Ǿ and A^B = Ǿ
Support (A=>B) = P (Aǘ B)
Classification:
The data classification mining can be
consider as two step process they are learning
step and classification step. This learning step or
training phase where a classification algorithm
builds the classifier by analyzing or learning
from a training set made up of database tuples
and their associated class labels. Learning data
are analyzed by a classification algorithm and the
classification test data are used to estimate the
accuracy of the classification rules. If the
accuracy is considered acceptable, the rules can
be applied to the classification of new data
tuples.
Cluster analysis
The class labeled data sets is used in
classification unlike in cluster analysis data
objects are used without consulting class labels.
The objects are clustered or grouped based on
the principle of maximizing the interclass
similarity and minimizing the interclass
similarity.
III. DATA MINING APPLICATION
The various applications of data mining
are given below:
Data Mining for Financial Data Analysis:
Financial data collected in the banking
and financial industry are often relatively
complete, reliable and of high quality.
Data
Mining
for
Retail
and
Telecommunication
Industries Retail data
mining can help identify customer buying
behavior, discover customer shopping patterns
and trend to improve the quality of customer
service
Data Mining in Science and Engineering:
vast amount of data from scientific
domain using sophisticated telescopes
Data Mining for Intrusion Detection and
Prevention: the majority of intrusion detection
and prevention system is either signature based
detection or anomaly based detection
Data Mining and Recommender Systems:
Outlier Analysis:
The outlier analysis is different from
noisy data. In general outliers can be classified
into three categories namely global outlier,
contextual outlier, collective outliers.
Recommender systems help consumers
by making product recommendations that are
likely to be of interest to the users such as books,
CDs, movie, restaurant, online new articles and
other services.
III.DATA MINING METHODS
Association Rules
The confidence is the conditional probability
that, given X present in a transition, Y will also
be present.
Support
Every association rule has a support and a
confidence.
“The support is the percentage of transactions
that demonstrate the rule.”
Confidence measure, by definition:
Confidence(X=>Y) equals support(X, Y) /
support(X)
Confidence
Example:
Database with
(customer_#: item_a1, item_a2,)
1:
2:
3:
4:
transactions
We should only consider rules derived from item
sets with high support, and that also have high
confidence.
1, 3, 5.
1, 8, 14, 17, 12.
4, 6, 8, 12, 9, 104.
2, 1, 8.
“A rule with low confidence is not
meaningful.”
Support {8, 12} = 2 (, or 50% ~ 2 of 4
customers)
Support {1, 5} = 1 (, or 25% ~ 1 of 4 customers)
Support {1} = 3 (, or 75% ~ 3 of 4 customers)
Example:
Database with
(customer_#: item_a1, item_a2, .)
Support
An item set is called frequent if its support is
equal or greater than an agreed upon minimal
value – the support threshold
add to previous example:
If threshold 50%
Then item sets {8,12} and {1} called frequent
Confidence
Every association rule has a support and a
confidence.
An association rule is of the form:
Rules don’t explain anything; they just
point out hard facts in data volumes.
X => Y
X => Y: if someone buys X, he also buys Y
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
transactions
3, 5, 8.
2, 6, 8.
1, 4, 7, 10.
3, 8, 10.
2, 5, 8.
1, 5, 6.
4, 5, 6, 8.
2, 3, 4.
1, 5, 7, 8.
3, 8, 9, 10.
Conf ({5} => {8})?
Supp ({5}) = 5
, supp ({8}) = 7,
supp ({5, 8}) = 4,
Then conf ({5} => {8}) = 4/5 = 0.8 or
80%
.
Here are the typical requirements of
clustering in data mining:
Partitioning Method
Scalability clustering
databases.
We need highly scalable
algorithms to deal with large
Ability to deal with different kind of
attributes - Algorithms should be capable to be
applied on any kind of data such as interval
based (numerical) data, categorical, binary data.
Discovery of clusters with attribute
shape - The clustering algorithm should be
Suppose we are given a database of n
objects, the partitioning method construct k
partition of data. Each partition will represent a
cluster and k≤n. It means that it will classify the
data into k groups, which satisfy the following
requirements:
Each group contains at least one object.
Each object must belong to exactly one
group.
capable of detect cluster of arbitrary shape. The
should not
be bounded to only distance measures that tend
to find spherical cluster of small size.
For a given number of partitions (say k), the
partitioning method will create an initial
partitioning.
High
Then it uses the iterative relocation technique to
improve the partitioning by moving objects from
one group to other.
dimensionality -
The clustering
algorithm should not only be able to handle lowdimensional data but also the high dimensional
space.
Ability to deal with noisy data - Databases
contain noisy, missing or erroneous data. Some
algorithms are sensitive to such data and may
lead to poor quality clusters.
Interpretability - The clustering results should
be interpretable, comprehensible and usable.
Clustering Methods
Hierarchical Methods
This method creates the hierarchical
decomposition of the given set of data objects.
We can classify Hierarchical method on basis of
how the hierarchical decomposition is formed as
follows:
Agglomerative Approach
Divisive Approach
Agglomerative Approach
The clustering methods can be classified into
following categories:
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
This approach is also known as bottom- up
approach. In this we start with each object
forming a separate group. It keeps on merging
the objects or groups that are close to one
another. It keep on doing so until all of the
groups are merged into one or until the
termination condition holds.
Divisive Approach
Advantage
This approach is also known as topdown approach. In this we start with all of the
objects in the same cluster. In the continuous
iteration, a cluster is split up into smaller
clusters. It is down until each object in one
cluster or the termination condition holds.
Model Based Method
Disadvantage
This method is rigid i.e. once merge or
split is done, It can never be undone
.Approaches to improve
hierarchical clustering:
quality
of
Here is the two approaches that are
used to improve quality of hierarchical
clustering: Perform careful analysis of object
linkages at each hierarchical partitioning.
Integrate hierarchical agglomeration by first
using a hierarchical agglomerative algorithm to
group objects into micro clusters, and then
performing macro clustering on the micro
clusters
Density based Method
This method is based on the notion of
density. The basic idea is to continue growing
the given cluster as long as the density in the
neighborhood exceeds some threshold i.e. for
each data point within a given cluster; the radius
of a given cluster has to contain at least a
minimum number of points.
Grid Based Method
In this the objects together from a grid.
The object space is quantized into finite number
of cells that form a grid structure.
this method is fast processing time.
It is dependent only on the number of cells in
each dimension in the quantized space
In this method a model is hypothesize
for each cluster and find the best fit of data to the
given model. This method locates the clusters by
clustering the density function. This reflects
spatial distribution of the data points.
This method also serve a way of
automatically determining number of clusters
based on standard statistics, taking outlier or
noise into account. It therefore yield robust
clustering methods.
Constraint Based Method
In this method the clustering is
performed by incorporation of user or
application oriented constraints. The constraint
refers to the user expectation or the properties of
desired clustering results. The constraint give us
the interactive way of communication with the
clustering process. The constraint can be
specified by the user or the application
requirement
The Apriori algorithm
Together with the introduction of the
frequent set mining problem, also the first
algorithm to solve it was proposed, later denoted
as AIS. Shortly after that the algorithm was
improved by R. Agrawal and R. Srikant and
called Apriori. It is a seminal algorithm, which
uses an iterative approach known as a level-wise
search, where k-item sets are used to explore
(k+1)-item sets.
The Apriori algorithm:
Input:
§ D, database of transactions;
§ min_sup, the minimum support count threshold
Output: L, frequent itemsets in D
Procedure has_infrequent_subset(c: candidate kitemset; Lk-1: frequent (k-1)-itemsets);
//use priori knowledge
(1) for each (k-1)-subset s of c
(2) if s !Є Lk-1 then
(3) Return TRUE;
(4) Return FALSE; 8
Generating association rules from frequent.
Method:
Conclusion: In this paper we have discussed
(1) L1=find_frequent_1-itemsets (D);
(2) for (k=2; Lk-1!=null;k++){
(3) Ck=apriori_gen (Lk-1);
(4) for each transaction t Є D{ // scan D for
counts
(5) Ct = subset (Ck, t); // get the subsets of t that
are candidates
(6) for each candidate c Є Ct
(7) c.count++;
(8) }
(9) Lk={c Є Ck | c.count≥min_sup}
(10) }
(11) Return L=UkLk
Procedure apriori_gen (Lk-1: frequent (k-1)itemsets)
(1) For each itemset l1 Є Lk-1
(2) For each itemset l2 Є Lk-1
(3)
if(l1[1]=l2[1])^(l1[2]=l2[2])^…^(l1[k2]=l2[k-2])^(l1[k-1]<l2[k-1]) then{
(4) C=l1xl2; //join step: generate candidates
(5) if has_infrequent_subset(c,Lk-1) then
(6) Delete c; //prune step: remove unfruitful
candidate
(7) Else add c to Ck;
(8)}
(9) Return Ck;
about the data mining concepts and methods
which are used to perform mining in any of the
organization or industries because now a days
mining the data as become compulsory for
maximum organizations.
Reference:
1. Jiawei Han, Micheline Kamber. Data Mining
Concepts and Techniques. Morgan.
Kaufmann, 2 edition, 2006.
2. Agrawal R, Imielinski T and Swami A. Mining
association rules between sets of
Items in large databases. In Buneman P. and
Jajodia S., editors, Processing of the 1993
3. ACM SIGMOD International Conference on
Management of Data
4. Adrian Kügel and Enno Ohlebusch. A space
efficient solution to the frequent string mining
problem for many databases. Data mining
knowledge discovery, 2008.