Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
INTERNAIONAL PAPER CONFERENCE DATA MINING CONCEPTS AND METHODS IMPLEMENTED FOR KNOWLEDGE DISCOVERY IN DATABASES BY NAGARATNA P. HEGDE, PROFESSOR VASAVI ENGINEERING COLLEGE, HYDERABAD & B.VARIJA, ASSOCIATE PROFESSOR RESEARCH SCHOLAR NISHITHA DEGREE AND P.G COLLEGE NIZAMABAB DATA MINING CONCEPTS AND METHODS IMPLEMENTED FOR KNOWLEDGE DISCOVERY IN DATABASES NAGARATNA P. HEGDE B.VARIJA (Resea rch Scholar) Professor Vasavi Engineering College, hyd Associate Professor Nishitha degree & P.G College, nzb ABSTRACT Data mining turns a large collection of data into knowledge. The Automatic Discovery of useful information from large data repositories. Mining can be applied to any kind of data as long as the data are meaningful for a target application. Many methods are used for mining like mining frequent patterns, associations, classification, cluster analysis, outlier detection. In particular, data mining draws upon ideas such as sampling, estimation and hypothesis. In this paper basic concepts and methods are discussed which are used for data mining the data in systematic manner. I. INTROCUTION Data mining has made significant progress and covered a broad spectrum of application since 1980. Data mining can be categorized as predictive task and descriptive task. In data mining association analysis is used to discover pattern that is used to describe strongly associated features in the data. Where cluster analysis is to find groups of closely related observations clustering. The outlier analysis is a data object that deviates significantly from the rest of the objects. Analyzing an essential process where intelligent methods are applied to extract data patterns. Many people treat data mining as a synonym for another popularly used term knowledge discovery from data or KDD II.KNOWLEGE DISCOVERY DATA The Knowledge discovery from data as merely essential steps in the process of knowledge discovery. The knowledge discovery process is an iterative sequence of the following steps: 1. Data Cleaning: In order to remove noise and inconsistent data 2. Data Integration: the multiple data sources can be obtained 3. Data Selection: the relevant data is used for analyzing the task 4. Data Transformation: the data is transformed into appropriate for mining by performing summary 5. Data Mining is an essential process where intelligent methods are applied to extract data patterns 6. Pattern evaluation: to identify the truly interesting patterns representing knowledge 7. Knowledge Presentation: the visualization and knowledge representation techniques are used to present mined knowledge to users. II. CONCEPTS OF DATA MINING Frequent Mining: Frequent pattern mining for the discovery of interesting association and corelation between items sets in transaction and relational databases. The frequent item set mining is market basket analysis. The process analyzes customer buying habits by finding association between the different items that customers place in their shopping baskets Association Rules: Let I = {I1, I2, I3 …IN} item set Where T € I (T transaction) Association with an identifier called TID An Association rules is an implication of the A => B where A€ I, B € I A ≠Ǿ, B ≠Ǿ and A^B = Ǿ Support (A=>B) = P (Aǘ B) Classification: The data classification mining can be consider as two step process they are learning step and classification step. This learning step or training phase where a classification algorithm builds the classifier by analyzing or learning from a training set made up of database tuples and their associated class labels. Learning data are analyzed by a classification algorithm and the classification test data are used to estimate the accuracy of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the classification of new data tuples. Cluster analysis The class labeled data sets is used in classification unlike in cluster analysis data objects are used without consulting class labels. The objects are clustered or grouped based on the principle of maximizing the interclass similarity and minimizing the interclass similarity. III. DATA MINING APPLICATION The various applications of data mining are given below: Data Mining for Financial Data Analysis: Financial data collected in the banking and financial industry are often relatively complete, reliable and of high quality. Data Mining for Retail and Telecommunication Industries Retail data mining can help identify customer buying behavior, discover customer shopping patterns and trend to improve the quality of customer service Data Mining in Science and Engineering: vast amount of data from scientific domain using sophisticated telescopes Data Mining for Intrusion Detection and Prevention: the majority of intrusion detection and prevention system is either signature based detection or anomaly based detection Data Mining and Recommender Systems: Outlier Analysis: The outlier analysis is different from noisy data. In general outliers can be classified into three categories namely global outlier, contextual outlier, collective outliers. Recommender systems help consumers by making product recommendations that are likely to be of interest to the users such as books, CDs, movie, restaurant, online new articles and other services. III.DATA MINING METHODS Association Rules The confidence is the conditional probability that, given X present in a transition, Y will also be present. Support Every association rule has a support and a confidence. “The support is the percentage of transactions that demonstrate the rule.” Confidence measure, by definition: Confidence(X=>Y) equals support(X, Y) / support(X) Confidence Example: Database with (customer_#: item_a1, item_a2,) 1: 2: 3: 4: transactions We should only consider rules derived from item sets with high support, and that also have high confidence. 1, 3, 5. 1, 8, 14, 17, 12. 4, 6, 8, 12, 9, 104. 2, 1, 8. “A rule with low confidence is not meaningful.” Support {8, 12} = 2 (, or 50% ~ 2 of 4 customers) Support {1, 5} = 1 (, or 25% ~ 1 of 4 customers) Support {1} = 3 (, or 75% ~ 3 of 4 customers) Example: Database with (customer_#: item_a1, item_a2, .) Support An item set is called frequent if its support is equal or greater than an agreed upon minimal value – the support threshold add to previous example: If threshold 50% Then item sets {8,12} and {1} called frequent Confidence Every association rule has a support and a confidence. An association rule is of the form: Rules don’t explain anything; they just point out hard facts in data volumes. X => Y X => Y: if someone buys X, he also buys Y 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: transactions 3, 5, 8. 2, 6, 8. 1, 4, 7, 10. 3, 8, 10. 2, 5, 8. 1, 5, 6. 4, 5, 6, 8. 2, 3, 4. 1, 5, 7, 8. 3, 8, 9, 10. Conf ({5} => {8})? Supp ({5}) = 5 , supp ({8}) = 7, supp ({5, 8}) = 4, Then conf ({5} => {8}) = 4/5 = 0.8 or 80% . Here are the typical requirements of clustering in data mining: Partitioning Method Scalability clustering databases. We need highly scalable algorithms to deal with large Ability to deal with different kind of attributes - Algorithms should be capable to be applied on any kind of data such as interval based (numerical) data, categorical, binary data. Discovery of clusters with attribute shape - The clustering algorithm should be Suppose we are given a database of n objects, the partitioning method construct k partition of data. Each partition will represent a cluster and k≤n. It means that it will classify the data into k groups, which satisfy the following requirements: Each group contains at least one object. Each object must belong to exactly one group. capable of detect cluster of arbitrary shape. The should not be bounded to only distance measures that tend to find spherical cluster of small size. For a given number of partitions (say k), the partitioning method will create an initial partitioning. High Then it uses the iterative relocation technique to improve the partitioning by moving objects from one group to other. dimensionality - The clustering algorithm should not only be able to handle lowdimensional data but also the high dimensional space. Ability to deal with noisy data - Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters. Interpretability - The clustering results should be interpretable, comprehensible and usable. Clustering Methods Hierarchical Methods This method creates the hierarchical decomposition of the given set of data objects. We can classify Hierarchical method on basis of how the hierarchical decomposition is formed as follows: Agglomerative Approach Divisive Approach Agglomerative Approach The clustering methods can be classified into following categories: Partitioning Method Hierarchical Method Density-based Method Grid-Based Method Model-Based Method Constraint-based Method This approach is also known as bottom- up approach. In this we start with each object forming a separate group. It keeps on merging the objects or groups that are close to one another. It keep on doing so until all of the groups are merged into one or until the termination condition holds. Divisive Approach Advantage This approach is also known as topdown approach. In this we start with all of the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the termination condition holds. Model Based Method Disadvantage This method is rigid i.e. once merge or split is done, It can never be undone .Approaches to improve hierarchical clustering: quality of Here is the two approaches that are used to improve quality of hierarchical clustering: Perform careful analysis of object linkages at each hierarchical partitioning. Integrate hierarchical agglomeration by first using a hierarchical agglomerative algorithm to group objects into micro clusters, and then performing macro clustering on the micro clusters Density based Method This method is based on the notion of density. The basic idea is to continue growing the given cluster as long as the density in the neighborhood exceeds some threshold i.e. for each data point within a given cluster; the radius of a given cluster has to contain at least a minimum number of points. Grid Based Method In this the objects together from a grid. The object space is quantized into finite number of cells that form a grid structure. this method is fast processing time. It is dependent only on the number of cells in each dimension in the quantized space In this method a model is hypothesize for each cluster and find the best fit of data to the given model. This method locates the clusters by clustering the density function. This reflects spatial distribution of the data points. This method also serve a way of automatically determining number of clusters based on standard statistics, taking outlier or noise into account. It therefore yield robust clustering methods. Constraint Based Method In this method the clustering is performed by incorporation of user or application oriented constraints. The constraint refers to the user expectation or the properties of desired clustering results. The constraint give us the interactive way of communication with the clustering process. The constraint can be specified by the user or the application requirement The Apriori algorithm Together with the introduction of the frequent set mining problem, also the first algorithm to solve it was proposed, later denoted as AIS. Shortly after that the algorithm was improved by R. Agrawal and R. Srikant and called Apriori. It is a seminal algorithm, which uses an iterative approach known as a level-wise search, where k-item sets are used to explore (k+1)-item sets. The Apriori algorithm: Input: § D, database of transactions; § min_sup, the minimum support count threshold Output: L, frequent itemsets in D Procedure has_infrequent_subset(c: candidate kitemset; Lk-1: frequent (k-1)-itemsets); //use priori knowledge (1) for each (k-1)-subset s of c (2) if s !Є Lk-1 then (3) Return TRUE; (4) Return FALSE; 8 Generating association rules from frequent. Method: Conclusion: In this paper we have discussed (1) L1=find_frequent_1-itemsets (D); (2) for (k=2; Lk-1!=null;k++){ (3) Ck=apriori_gen (Lk-1); (4) for each transaction t Є D{ // scan D for counts (5) Ct = subset (Ck, t); // get the subsets of t that are candidates (6) for each candidate c Є Ct (7) c.count++; (8) } (9) Lk={c Є Ck | c.count≥min_sup} (10) } (11) Return L=UkLk Procedure apriori_gen (Lk-1: frequent (k-1)itemsets) (1) For each itemset l1 Є Lk-1 (2) For each itemset l2 Є Lk-1 (3) if(l1[1]=l2[1])^(l1[2]=l2[2])^…^(l1[k2]=l2[k-2])^(l1[k-1]<l2[k-1]) then{ (4) C=l1xl2; //join step: generate candidates (5) if has_infrequent_subset(c,Lk-1) then (6) Delete c; //prune step: remove unfruitful candidate (7) Else add c to Ck; (8)} (9) Return Ck; about the data mining concepts and methods which are used to perform mining in any of the organization or industries because now a days mining the data as become compulsory for maximum organizations. Reference: 1. Jiawei Han, Micheline Kamber. Data Mining Concepts and Techniques. Morgan. Kaufmann, 2 edition, 2006. 2. Agrawal R, Imielinski T and Swami A. Mining association rules between sets of Items in large databases. In Buneman P. and Jajodia S., editors, Processing of the 1993 3. ACM SIGMOD International Conference on Management of Data 4. Adrian Kügel and Enno Ohlebusch. A space efficient solution to the frequent string mining problem for many databases. Data mining knowledge discovery, 2008.