Download desciption about predictive and descriptive data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Sonali Guglani et al. / IJAIR
ISSN: 2278-7844
DESCIPTION ABOUT PREDICTIVE AND
DESCRIPTIVE DATA MINING
Sonali Guglani1, Sunaina Bagga2, Ankit Goyal3
1,2
ASSISTANT PROFESSOR, RIMT – MAEC, Mandi Gobindgarh
Abstract
Data warehouse is a repository of information
collected from multiple sources, stored under a
unified schema, and which usually resides at a single
site. Data warehouses are constructed via a process of
data cleansing, data transformation, data integration,
data loading. Data mining refers to extracting or
mining knowledge from large amounts of data. Data
mining tasks can be classified into two categories:
descriptive and predictive. Descriptive mining tasks
characterize the general properties of the data in the
database. Predictive mining tasks perform inference
on the current data in order to make predictions. In
this paper, we will study about the comparison
between algos present in predictive and descriptive
data mining.
Keywords: Data mining, Descriptive algorithm,
Predictive algorithm.
1. Introduction:
Data mining is the process of discovering useful
information (i.e. patterns) underlying the data.
Powerful techniques are needed to extract patterns
from large data because traditional statistical tools are
not efficient enough anymore [1]
The architecture of a typical data mining system may
have the following major components:
can include concept hierarchies, used to organize
attributes or attribute values into different levels of
abstraction.
4. Data mining engine. This is essential to the data
mining system and ideally consists of a set of
functional modules for tasks such as characterization,
association analysis, classification, evolution and
deviation analysis.
5. Pattern evaluation module. This component
typically employs interestingness measures and
interacts with the data mining modules so as to focus
the search towards interesting patterns. It may access
interestingness thresholds stored in the knowledge
base.
6. Graphical user interface. This module
communicates between users and the data mining
system, allowing the user to interact with the system
by specifying a data mining query or task, providing
information to help focus the search, and performing
exploratory data mining based on the intermediate
data mining results. In addition, this component
allows the user to browse database and data
warehouse schemas or data structures, evaluate
mined patterns, and visualize the patterns in different
forms.
1. Database, data warehouse, or other information
repository. This is one or a set of databases, data
warehouses, spread sheets, or other kinds of
information repositories. Data cleaning and data
integration techniques may be performed on the data.
2. Database or data warehouse server. The database
or data warehouse server is responsible for fetching
the relevant data, based on the user's data mining
request.
3. Knowledge base. This is the domain knowledge
that is used to guide the search, or evaluate the
interestingness of resulting patterns. Such knowledge
© 2012 IJAIR. ALL RIGHTS RESERVED
303
Sonali Guglani et al. / IJAIR
ISSN: 2278-7844
for <target class> where <target condition>
{versus <contrast class _i> where <contrast condition
_i>} analyze hmeasure(s)> This specifies that
discriminant descriptions are to be mined. These
descriptions compare a given target class of objects
with one or more other contrasting classes. Hence,
this kind of knowledge is referred to as a comparison.
As for characterization, the analyze clause specifies
aggregate measures, such as count, sum, or count%,
to be computed and displayed for each description.
2. Different Types of Data Mining:
There are two types of data mining:
Descriptive and Predictive.
3. Syntax for specifying the kind of
knowledge to be mined:
The Mine Knowledge Specification statement is used
to specify the kind of knowledge to be mined. In
other words, it indicates the data mining functionality
to be performed. Its syntax is defined below for
characterization,
discrimination,
association,
classification, and prediction.
1.Characterization.
(Mine
Knowledge
Specificationi)::=mine characteristics [as <pattern
name>]analyze <measure(s)>
This specifies that characteristic descriptions are to
be mined. The analyze clause, when used for
characterization,Specifies aggregate measures, such
as count, sum, or count% (percentage count, i.e., the
percentage of tuples in the relevant data set with the
specified characteristics). These measures are to be
computed for each data characteristic found.
2. Discrimination.
<Mine Knowledge Specificationi>::=
mine comparison [as <pattern name>]
© 2012 IJAIR. ALL RIGHTS RESERVED
3.Association.
<Mine Knowledge Specificationi ::=
mine associations [as <pattern name>]
[matching <metapattern>]
This specifies the mining of patterns of association.
When specifying association mining, the user has the
option of providing templates (also known as
metapatterns or metarules) with the matching clause.
The metapatterns can be used to focus the discovery
towards the patterns that match the given
metapatterns, thereby enforcing additional syntactic
constraints for the mining task. In addition to
providing syntactic constraints, the metapatterns
represent data hunches or hypotheses that the user
finds interesting for investigation. Mining with the
use of metapatterns, or metarule-guided mining,
allows additional exibility for ad-hoc rule mining.
While metapatterns may be used in the mining of
other forms of knowledge, they are most useful for
association mining due to the vast number of
potentially generated associations.
4.Classification.
<Mine Knowledge Specificationi>::=
mine classification [as <pattern name>]
analyze {classifying attribute or dimension}
This specifies that patterns for data classification are
to be mined. The analyze clause specifies that the
classification is performed according to the values of
<classifying attribute or dimension>. For categorical
attributes or dimensions, typically each value
represents a class (such as \Vancouver", \New York",
\Chicago", and so on for the dimension location). For
numeric attributes or dimensions, each class may be
defined by a range of values (such as \20-39", \4059", \60-89" for age). Classification provides a
concise framework which best describes the objects
in each class and distinguishes them from other
classes.
5. Prediction.
<Mine Knowledge Specification> ::=
mine prediction [as <pattern name>]
analyze {prediction attribute or dimension}
304
Sonali Guglani et al. / IJAIR
{set {<attribute or dimension _i>= <value _i>}}
This DMQL syntax is for prediction. It specifies the
mining of missing or unknown continuous data
values, or of the data distribution, for the attribute or
dimension specified in the analyze clause. A
predictive model is constructed based on the analysis
of the values of the other attributes or dimensions
describing the data objects (tuples). The set clause
can be used to fix the values of these other attributes.
4. CLUSTERING TECHNIQUES
Different approaches to clustering data can be disc ribbed with the help of the hierarchy [4] (other
axonometric
representations
of
clustering
Methodology is possible); ours is based on the
discussion in Jain and Daubes. At the top level, there
is a distinction between hierarchical and partition
approaches (hierarchical methods produce a nested
series of partitions, while partition methods produce
only one) must be supplemented by a discussion of
cross-cutting issues that may (in principle) affect all
of the different approaches regardless of their
placement in the taxonomy.
—Agglomerative vs. divisive [3]: This aspect relates
to algorithmic structure and operation. An
agglomerative approach begins with each pattern in a
Distinct (singleton) cluster, and successively merges
clusters together until a stopping criterion is satisfied.
A divisive method begins with all patterns in a single
cluster and performs splitting until a stopping
criterion is met [4].
—Monothetic vs. polythetic: This aspect relates to
the sequential or simultaneous use of features in the
clustering process. Most algorithms are polythetic;
That is, all features enter into the computation of
distances between patterns, and decisions are based
on those distances. A simple monothetic algorithm
reported in Ander berg considers features
sequentially to divide the given collection of patterns.
This is illustrated in Figure 8.Here, the collection is
divided into two groups using feature x1; the vertical
broken line V is the separating line. Each of these
clusters is further divided independently using feature
x2, as depicted by the broken lines H1 and H2. The
major problem with this algorithm is that it generates
2d clusters where d is the dimensionality of the
patterns. For large values of d (d. 100 is typical in
information retrieval applications [Salton 1991]), the
number of clusters generated by this algorithm is so
large that the data set is divided into uninterestingly
small and fragmented clusters.
© 2012 IJAIR. ALL RIGHTS RESERVED
ISSN: 2278-7844
—Hard vs. fuzzy [4]: A hard clustering algorithm
allocates each pattern to a single cluster during its
operation and in its output. A fuzzy clustering
method assigns degrees of membership in several
clusters to each input pattern. A fuzzy clustering can
be converted to a hard clustering by assigning each
pattern to the cluster with the largest measure of
membership.
—Deterministic vs. stochastic [4]: This issue is most
relevant to partitioned approaches designed to
optimize a squared error function. This optimization
Can be accomplished using traditional techniques or
through a random search of the state space consisting
of all possible labeling.
—Incremental vs. non-incremental [4]:
This issue arises when the pattern set to be clustered
is large, and constraints on execution time or memory
space affect the architecture of the algorithm. The
early history of clustering methodology does not
contain many examples of clustering algorithms
designed to work with large data sets, but the advent
of data mining has fostered the development of
clustering algorithms that minimize the number of
scans through the pattern set, reduce the number of
patterns examined during execution, or reduce the
size of data structures used in the algorithm’s
operations [4].
5. Conclusion
In this paper, we have presented predictive and
descriptive types of data mining [10]. We have seen
that there are various types of algorithm present
under these categories and define them in detail with
the help of an example.
6. References
[1] Data Mining: Concepts and Techniques,
2nd Edition, Jiawei Han and Micheline
Kamber, Morgan Kauffman.
[2] Data Mining Clustering and Classification:
Spring 2007, SJSU, Benjamin Lam.
[3] Data Mining at UVA May 21-24, 2007
Kathy Gerber, ITC Research Computing
[4] AL-SULTAN, K. S. A tabu search approach
to clustering problems. Pattern Recogn. 28.
[5] P.S. Bradley, U. Fayyad, C. Reina, “Scaling
clustering algorithms to large databases”, 4th Int
Conf. on Knowledge Discovery and Data Mining
(KDD-98). AAAI Press, Aug. 1998
305
Sonali Guglani et al. / IJAIR
ISSN: 2278-7844
[6] A. K. Jain, R. C. Dubes, Algorithms for
Clustering Data, Prentice Hall.
[7] Data Mining Tool in capital market May 2124, 2007, International Journal "Information Theories
& Applications" Vol.15 .
[8] R. Grossman, C. Kamath, V. Kumar Data
Mining for Scientific and Engineering Applications
June, 2008.
© 2012 IJAIR. ALL RIGHTS RESERVED
306