Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ORIENTAL JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY An International Open Free Access, Peer Reviewed Research Journal Published By: Oriental Scientific Publishing Co., India. ISSN: 0974-6471 December 2012, Vol. 5, No. (2): Pgs. 277-281 www.computerscijournal.org Performance Analysis of Data Mining Algorithms to Generate Frequent itemset at Single and Multiple Levels Md. IQBAL Department of Computer Science & Engineering, Institute of Technology, Meerut (U.P.) (Received: September 10, 2012; Accepted: September 18, 2012) ABSTRACT Knowledge Discovery and Data Mining are rapidly evolving areas of research that are at the intersection of several disciplines including statistics, databases, artificial intelligence, visualization and high performance and parallel computing . Data Mining is core part of Knowledge Discovery process (KDD). The KDD process consist of data selection, data cleaning, data transformation, pattern searching ( data mining ) and finding pattern evaluation. Focusing specially, on the definition of data mining, it has been described as “ the task of discovering interesting patterns from large amount of data where the data can be stored in databases, data warehouses or other information repositories”. Thus data mining is extraction of implicit, previously unknown; potentially use for information from the vast amount of data available in the data sets (databases, data warehouses or other information repositories). People in various organizations such as business, science, medicine, academia and government collect such data. The problem is that not enough human analysts are available who are skilled at translating all of the data into knowledge. The development of next generation databases and Management Information System (MIS) has been empowered by data mining, which helps in extraction of hidden useful information and aimed at formulation of knowledge for taking decision by the organization. Thus goal of data mining and knowledge discovery is to turn “data into knowledge”. Data Mining is becoming more widespread every day, because it empowers organizations to uncover profitable patterns and trends from their existing databases. Most of organizations spent millions of dollars to collect megabytes and terabytes of data but are not taking advantage of valuable information stored in it. The tools use different data mining technique and algorithm. The tasks of data mining are distinct because many patterns exist in the large database. All the techniques can be integrated or combined to deal with a complicated problem resides in these large databases. Most of data mining tools employ multiple methods to deal with different kind of data in different application areas. Based on the pattern one is looking for the data-mining task, which can be classified into summarization, classification, clustering, association. Key words: Data Mining, Management Information System, Artificial Intelligence, Performance Analysis of Data Mining Algorithms to generate frequent itemset at Single and Multiple Levels. INTRODUCTION Data mining, also known as Knowledge Discovery in Database (KDD), has been well studied for several decades. It has been described as “the nontrivial extraction of implicit, previously unknown, and potentially useful information from data and the science of extracting useful information from large data sets or databases.” In general, data mining is the process of analyzing data from different perspectives and summarizing it into useful information. And technically, it is the process of 278 IQBAL, Orient. J. Comp. Sci. & Technol., Vol. 5(2), 277-281 (2012) finding correlations or patterns among dozens of fields in large relational databases. Scopes And Relevance Of Study Data mining refers to extracting or mining knowledge from large amounts of data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, data mining should have been more appropriately named “knowledge mining from data,” which is sometime known as knowledge mining. Many other term carry a similar or slightly different meaning to data mining, such as knowledge mining from data, knowledge extraction, data/pattern analysis, data archaeology and knowledge discovery from data or KDD. Knowledge Discovery in Databases (KDD) is defined as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. But data mining is Fig. 1: Knowledge Discovery in Database (KDD) Process (Fayyad et al. 1996) the central part of the KDD process. Knowledge discovery as a process consists of sequence of following steps as shown in fig 2.1: ´ Data integration (where multiple data sources may be combined) ´ Data selection (where data relevant to the analysis task are retrieved from the Database) ´ Data Preprocessing (to remove noise and inconsistent data) ´ Data transformation (where data are transformed into forms appropriate for mining by performing summary or aggregation operation) ´ Data mining (an essential process where intelligent methods are applied in order to extract data pattern) ´ Pattern evaluation (to identify the truly interesting pattern representing knowledge based on some measure) ´ Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user) Data mining is the process of discovering interesting knowledge from large amounts of data stored in the database, data warehouse, or other information repositories. Data mining is only one step of the process, involving the application of discovery tools to find interesting patterns from targeted data. Flow of the data mining process can be shown by Fig. 1.2 A data mining session is usually an interactive process of data mining query submission, task analysis, and data collection from the database, interesting pattern search, and findings presentation. Fig. 2: Flow of the data mining process Process For Mining The Data An important concept is that building a mining model is part of a larger process that includes everything from defining the basic problem that the model will solve, to deploying the model into a working environment. This process can be defined by using the following six basic steps. Defining the problem ´ Preparing data ´ Exploring data ´ Building models IQBAL, Orient. J. Comp. Sci. & Technol., Vol. 5(2), 277-281 (2012) ´ ´ Validating models Updating models Objectives Even though the data mining has made a significant progress during the past decade but most of the research is devoted to developing effective and efficient algorithm. These algorithms are used to extract knowledge from data. It is difficult for students, researchers and business users to get a holistic view of this field. They are perceived by the collection of algorithm and tools available. The objective of this endeavor is to study the efficiency and effectiveness of the multiple-level algorithms, which helps in specific information extraction. In this work a new method for multiple-level association rules is introduced. 1. Minimum support 2. Delta (factor for reducing support at lower levels) 3. Concept hierarchy Process For Mining The Data An important concept is that building a mining model is part of a larger process that includes everything from defining the basic problem that the model will solve, to deploying the model into a working environment. This process can be defined by using the following six basic steps. Defining the problem Preparing data Exploring data Building models Validating models Updating models Need Of Data Mining The way in which companies interact with their customers has changed dramatically over the past few years. A customer’s continuing business is no longer guaranteed. As a result, companies have found that they need to understand their customers better, and to quickly respond to their wants and needs. In addition, the time frame in which these responses need to be made has been shrinking. It is no longer possible to wait until the sings of customer dissatisfaction are obvious before action must be taken. To succeed, companies must be proactive and anticipate what a customer desires. This is possible just knowing what data mining is? ´ The right offer ´ ´ ´ 279 To the right person At the right time Through the right channel The right offer means managing multiple interactions with your customers, prioritizing what the offer will be while making sure that irrelevant offers are minimized. The right person means that not all customers are cut from same cloth. Your interactions with them need to move towards highly segmented marketing campaigns that target individual wants and needs. The right time is result of the fact that interactions with customers now happen on continues basis. This is significantly different from the past, when quarterly mailings were cutting edge marketing. Finally, the right channel means that you can interact with your customers in variety of ways (direct mail, email, telemarketing, etc). You need to make sure that you’re choosing the most effective medium for a particular interaction. But for this there are some problems with the database repositories like that data volume are too large for classical analysis approaches: Large number of records (108-1012 bytes). High dimensional data (102-104 attributes). How do you explore millions of records, tens or hundreds of fields, and patterns? Only a small portion(typically 5%-10% of collected data is every analyzed Data that may never be analyzed continues to be collected, at a great expense, out of year that something which may prove important in the future is missing. Fig. 3: Example of concept Hierarchy A growth rate of data precludes traditional “manually intensive” approach. What can data mining do for us? Identify our best prospects and then retain 280 IQBAL, Orient. J. Comp. Sci. & Technol., Vol. 5(2), 277-281 (2012) them as customers- by concentrating marketing efforts only on the best prospects we will save time and money , thus increasing effectiveness of the marketing operation. Predict cross sell opportunities and make recommendations- whether we have a traditional or web- based operation, we can help the customers quickly locate products of interest to them and simultaneously increase the value of each communication with the customer. Learn parameters influencing trends in sales and margins- one may think this can be done with OLAP (Online Analytical Processing) tools. True, OLAP can help prove a hypothesis- but only if we know what questions to ask in the first place. In the majority of cases we may have no clue on what combination of parameters influences our operation. In these situations data mining is only real option Segment markets and personalize communications- there might be distinct groups of customers, patients, or natural phenomena that require different approaches in their handling. If we have a broad customer range, we would need to address teenagers in California and married homeowners in Minnesota with different products and messages in order to optimize a marketing campaign. Multiple-level association rules Mining association rules at single level, in many cases, loose detailed information. Besides it can show only general rules without ability of getting inside the rule. Data mining should also be available for mining association rules at the multiple levels of abstraction. In association rules every transaction can be encoded based on dimension and levels. In multiple-level association rule mining, the items in an item set are characterized by using a concept hierarchy. Mining occurs at multiple levels in the hierarchy. At lowest levels, it might be that no rules may match the constraints. At highest levels, rules can be extremely general. Generally, a topdown approach is used where the support threshold may be same or varies from level to level (support is reduced going from higher to lower levels). Conclusion & future work Summarization is the abstraction or generalization of data. This results in a smaller set, which gives a general overview of data, usually with aggregated information. The summarization can go to different abstraction levels and can be viewed from different angles. Classification derives a function or model, which determines the class or model which determines the class of an object based on its attributes. A classification function or model is constructed by analyzing the relationship between the attributes and the classes of the objects in the training set. This f-ies the classes also called clusters or groups for the set of objects whose classes are unknown. The objects are so clustered that the interclass similarities are maximized and the interclass similarities are minimized. This is done based on the criteria defined on the attributes of the objects. Association is the degree of relationship or involvement or the connection of objects. Such connection is termed as association rule. An association rules revels the associative relationship among objects at multiple levels. In this dissertation iterative or noniterative database scanning is used for finding frequent itemsets. The association rules are derived from these frequent itemsets. There are different algorithms, which are used for finding the frequent itemsets. In this dissertation the emphasis is given on generation of multiple level association rules. REFERENCES 1. 2. Frawley, W., Piatetsky-Shapiro and Matheus, C., ‘Knowledge Discovery in Databases: An Overview’, AI Magazine, pp. 213-228 (1992). R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases”. In Proceedings of 3. the 1993 ACM SIGMOD International Conference on Management of Data, pages 207-216, Washington, DC, 26-28 (1993). Kantardzic, M., Data Mining: Concepts, Models, Methods, and Algorithms, WileyInterscience, Hoboken, NJ (2003). IQBAL, Orient. J. Comp. Sci. & Technol., Vol. 5(2), 277-281 (2012) 4. 5. 6. 7. 8. M.H.Margahny and A.A.Mitwaly, “ Fast Algorithm for Mining Association Rules”. Proceedings of AIML 05 Conference, CICC, Cairo, Egypt, 19-21 (2005). Kishore B. Kumar and Naresh Jotwani, “Efficient Algorithm for Hierarchical Online Mining of Association Rules,” in Proc. 13th International Conference on Management of Data COMAD, (2006). R. S. Thakur, R. C. Jain and K. R. Pardasani, “ Fast Algorithm for mining multi-level association rules in large databases”. Asian Journal of International Management 1(1):1926 (2007). Qi Luo “Knowledge Discovery and Data Mining”, Work shop on Knowledge Discovery 9. 10. 11. 12. 281 and Data Mining, Adelaide, SA,3-5 (2008). Hahsler, M., Buchta, C., and Hornik, K. Selective Association Rule Generation , Comutational Statistics, 23(2) (2008). Zheng, Z., Kohavi, R., and Mason, L. “Real world performance of association rule algorithms”, In Proceedings of the 7th KDD Conference, ACM Press, 401-406 (2001). R. Srikant, R. Agrawal, “Mining generalized association rules”, Future Generation Computer Systems 13(2–3): 161-180 (1997). Jiawei Han and Yongjian Fu., “Discovery of Multiple-Level Association Rules from Large Databases”. Proceeding in IEEE Trans. on Knowledge and Data Eng. 11(5): 798-804 (1999).