Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sonali Guglani et al. / IJAIR ISSN: 2278-7844 DESCIPTION ABOUT PREDICTIVE AND DESCRIPTIVE DATA MINING Sonali Guglani1, Sunaina Bagga2, Ankit Goyal3 1,2 ASSISTANT PROFESSOR, RIMT – MAEC, Mandi Gobindgarh Abstract Data warehouse is a repository of information collected from multiple sources, stored under a unified schema, and which usually resides at a single site. Data warehouses are constructed via a process of data cleansing, data transformation, data integration, data loading. Data mining refers to extracting or mining knowledge from large amounts of data. Data mining tasks can be classified into two categories: descriptive and predictive. Descriptive mining tasks characterize the general properties of the data in the database. Predictive mining tasks perform inference on the current data in order to make predictions. In this paper, we will study about the comparison between algos present in predictive and descriptive data mining. Keywords: Data mining, Descriptive algorithm, Predictive algorithm. 1. Introduction: Data mining is the process of discovering useful information (i.e. patterns) underlying the data. Powerful techniques are needed to extract patterns from large data because traditional statistical tools are not efficient enough anymore [1] The architecture of a typical data mining system may have the following major components: can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. 4. Data mining engine. This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association analysis, classification, evolution and deviation analysis. 5. Pattern evaluation module. This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search towards interesting patterns. It may access interestingness thresholds stored in the knowledge base. 6. Graphical user interface. This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms. 1. Database, data warehouse, or other information repository. This is one or a set of databases, data warehouses, spread sheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. 2. Database or data warehouse server. The database or data warehouse server is responsible for fetching the relevant data, based on the user's data mining request. 3. Knowledge base. This is the domain knowledge that is used to guide the search, or evaluate the interestingness of resulting patterns. Such knowledge © 2012 IJAIR. ALL RIGHTS RESERVED 303 Sonali Guglani et al. / IJAIR ISSN: 2278-7844 for <target class> where <target condition> {versus <contrast class _i> where <contrast condition _i>} analyze hmeasure(s)> This specifies that discriminant descriptions are to be mined. These descriptions compare a given target class of objects with one or more other contrasting classes. Hence, this kind of knowledge is referred to as a comparison. As for characterization, the analyze clause specifies aggregate measures, such as count, sum, or count%, to be computed and displayed for each description. 2. Different Types of Data Mining: There are two types of data mining: Descriptive and Predictive. 3. Syntax for specifying the kind of knowledge to be mined: The Mine Knowledge Specification statement is used to specify the kind of knowledge to be mined. In other words, it indicates the data mining functionality to be performed. Its syntax is defined below for characterization, discrimination, association, classification, and prediction. 1.Characterization. (Mine Knowledge Specificationi)::=mine characteristics [as <pattern name>]analyze <measure(s)> This specifies that characteristic descriptions are to be mined. The analyze clause, when used for characterization,Specifies aggregate measures, such as count, sum, or count% (percentage count, i.e., the percentage of tuples in the relevant data set with the specified characteristics). These measures are to be computed for each data characteristic found. 2. Discrimination. <Mine Knowledge Specificationi>::= mine comparison [as <pattern name>] © 2012 IJAIR. ALL RIGHTS RESERVED 3.Association. <Mine Knowledge Specificationi ::= mine associations [as <pattern name>] [matching <metapattern>] This specifies the mining of patterns of association. When specifying association mining, the user has the option of providing templates (also known as metapatterns or metarules) with the matching clause. The metapatterns can be used to focus the discovery towards the patterns that match the given metapatterns, thereby enforcing additional syntactic constraints for the mining task. In addition to providing syntactic constraints, the metapatterns represent data hunches or hypotheses that the user finds interesting for investigation. Mining with the use of metapatterns, or metarule-guided mining, allows additional exibility for ad-hoc rule mining. While metapatterns may be used in the mining of other forms of knowledge, they are most useful for association mining due to the vast number of potentially generated associations. 4.Classification. <Mine Knowledge Specificationi>::= mine classification [as <pattern name>] analyze {classifying attribute or dimension} This specifies that patterns for data classification are to be mined. The analyze clause specifies that the classification is performed according to the values of <classifying attribute or dimension>. For categorical attributes or dimensions, typically each value represents a class (such as \Vancouver", \New York", \Chicago", and so on for the dimension location). For numeric attributes or dimensions, each class may be defined by a range of values (such as \20-39", \4059", \60-89" for age). Classification provides a concise framework which best describes the objects in each class and distinguishes them from other classes. 5. Prediction. <Mine Knowledge Specification> ::= mine prediction [as <pattern name>] analyze {prediction attribute or dimension} 304 Sonali Guglani et al. / IJAIR {set {<attribute or dimension _i>= <value _i>}} This DMQL syntax is for prediction. It specifies the mining of missing or unknown continuous data values, or of the data distribution, for the attribute or dimension specified in the analyze clause. A predictive model is constructed based on the analysis of the values of the other attributes or dimensions describing the data objects (tuples). The set clause can be used to fix the values of these other attributes. 4. CLUSTERING TECHNIQUES Different approaches to clustering data can be disc ribbed with the help of the hierarchy [4] (other axonometric representations of clustering Methodology is possible); ours is based on the discussion in Jain and Daubes. At the top level, there is a distinction between hierarchical and partition approaches (hierarchical methods produce a nested series of partitions, while partition methods produce only one) must be supplemented by a discussion of cross-cutting issues that may (in principle) affect all of the different approaches regardless of their placement in the taxonomy. —Agglomerative vs. divisive [3]: This aspect relates to algorithmic structure and operation. An agglomerative approach begins with each pattern in a Distinct (singleton) cluster, and successively merges clusters together until a stopping criterion is satisfied. A divisive method begins with all patterns in a single cluster and performs splitting until a stopping criterion is met [4]. —Monothetic vs. polythetic: This aspect relates to the sequential or simultaneous use of features in the clustering process. Most algorithms are polythetic; That is, all features enter into the computation of distances between patterns, and decisions are based on those distances. A simple monothetic algorithm reported in Ander berg considers features sequentially to divide the given collection of patterns. This is illustrated in Figure 8.Here, the collection is divided into two groups using feature x1; the vertical broken line V is the separating line. Each of these clusters is further divided independently using feature x2, as depicted by the broken lines H1 and H2. The major problem with this algorithm is that it generates 2d clusters where d is the dimensionality of the patterns. For large values of d (d. 100 is typical in information retrieval applications [Salton 1991]), the number of clusters generated by this algorithm is so large that the data set is divided into uninterestingly small and fragmented clusters. © 2012 IJAIR. ALL RIGHTS RESERVED ISSN: 2278-7844 —Hard vs. fuzzy [4]: A hard clustering algorithm allocates each pattern to a single cluster during its operation and in its output. A fuzzy clustering method assigns degrees of membership in several clusters to each input pattern. A fuzzy clustering can be converted to a hard clustering by assigning each pattern to the cluster with the largest measure of membership. —Deterministic vs. stochastic [4]: This issue is most relevant to partitioned approaches designed to optimize a squared error function. This optimization Can be accomplished using traditional techniques or through a random search of the state space consisting of all possible labeling. —Incremental vs. non-incremental [4]: This issue arises when the pattern set to be clustered is large, and constraints on execution time or memory space affect the architecture of the algorithm. The early history of clustering methodology does not contain many examples of clustering algorithms designed to work with large data sets, but the advent of data mining has fostered the development of clustering algorithms that minimize the number of scans through the pattern set, reduce the number of patterns examined during execution, or reduce the size of data structures used in the algorithm’s operations [4]. 5. Conclusion In this paper, we have presented predictive and descriptive types of data mining [10]. We have seen that there are various types of algorithm present under these categories and define them in detail with the help of an example. 6. References [1] Data Mining: Concepts and Techniques, 2nd Edition, Jiawei Han and Micheline Kamber, Morgan Kauffman. [2] Data Mining Clustering and Classification: Spring 2007, SJSU, Benjamin Lam. [3] Data Mining at UVA May 21-24, 2007 Kathy Gerber, ITC Research Computing [4] AL-SULTAN, K. S. A tabu search approach to clustering problems. Pattern Recogn. 28. [5] P.S. Bradley, U. Fayyad, C. Reina, “Scaling clustering algorithms to large databases”, 4th Int Conf. on Knowledge Discovery and Data Mining (KDD-98). AAAI Press, Aug. 1998 305 Sonali Guglani et al. / IJAIR ISSN: 2278-7844 [6] A. K. Jain, R. C. Dubes, Algorithms for Clustering Data, Prentice Hall. [7] Data Mining Tool in capital market May 2124, 2007, International Journal "Information Theories & Applications" Vol.15 . [8] R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications June, 2008. © 2012 IJAIR. ALL RIGHTS RESERVED 306