Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Computer Trends and Technology (IJCTT) - volume4 Issue5–May 2013 A Survey Report on RFM Pattern Matching Using Efficient Multi-Set HPID3 Algorithm Priyanka Rani1, Nitin Mishra2 ,Samidha Diwedi Sharma3 Abstract— with the increasing contest in the retail industry, the main focus of superstore is to classify valuable customers accurately and quickly among the large volume of data. In this paper a survey report on proposed HPID3 classification algorithm is presented. It creates an innovative decision tree structure in order to predict the future outcomes. This technique proofs better over existing efficient techniques of classification, which had been proposed in recent times. In this proposed technique, concept of Recency, Frequency and Monetary is introduced, which is usually used by marketing investigators to develop some marketing rules and strategies, to find important patterns. Conventional ID3 algorithm is modified by firstly horizontally splitting the sample of RFM dataset and then some mathematical calculations are applied in order to construct decision tree and then to predict future customer behaviors by matching pattern. This proposed algorithm significantly reduces the processing time mean absolute error, relative absolute error and memory space required for processing the dataset and for constructing decision tree. The technique works optimally for small and large size database. The performance of the proposed algorithm is analyzed by the standard RFM dataset and compared with the conventional ID3 algorithm using weka data mining tool where the results are better for the proposed technique. Keywords— Data mining, ID3, HPID3, RFM, customer classification, Decision tree I. INTRODUCTION Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Exploiting large volumes of data for superior decision making by looking for interesting patterns in the data has become a main task in today’s business environment. Data classification is one of the most widely used technologies in data mining. Its main purpose is to build a classification model, which can be mapped to a particular subclass through the data list in the database. Classification is very essential to organize data, retrieve information correctly and rapidly. At present, the decision tree has become an important data mining method. It is a more general data classification function approximation algorithm based on machine learning. There exist many methods to do decision analysis. Each method has its own advantages and disadvantages. In machine learning, decision tree learning is one of the most popular techniques for making classifications decisions in pattern recognition. The basic learning approach of decision tree is greedy algorithm, which ISSN: 2231-2803 use the recursive top-down approach of decision tree structure. The decision tree as a classification method has the advantage of processing large amount of data quickly and accurately. ID3 algorithm is one of the most widely used mathematical algorithms for building the decision tree. It was invented by J. Ross Quinlan and uses Information Theory invented by Shannon. It builds the tree from the top down, with no backtracking. ID3 algorithm is an example of Symbolic Learning and Rule Induction. It is also a supervised learner which means it looks at examples like a training data set to make its decisions. The key feature of ID3 is choosing information gain as the standard for testing attributes for classification and then expanding the branches of decision tree recursively until the entire tree has been built completely. The main objective of this paper is to predict future customer behaviours from customer purchasing R-F-M dataset using HPID3 algorithm. Firstly the historical data is collected from which we can predict customer behaviours or customer purchasing patterns and then the entire dataset is partitioned horizontally. Using R–F–M attributes of customer transaction dataset classification rules are discovered by HPID3 algorithm in which divide-and-conquer strategy is used and the decision tree is pruned to avoid over fitting problem. The mathematical calculation for entropy and information gain is performed on each attribute in multiple partitions. At the end, the information gains of the same attribute in multiple partitions are added together to find out the total information gain of a particular attribute in the given training dataset. The attribute with the largest information gain is selected as the root of the decision tree and their values as its branches. After that information gain of the remaining attributes is again calculated with respect to selected attribute. This process continues recursively until it comes to a conclusion at the decision tree's leaf node. In this way the proposed system will minimize information content needed and the average depth of the generated decision tree. II. RELATED WORK In 2011 Wei Jianping did research on VIP Customer Classification Rule Based on RFM Model. The main objective of this paper is to expose the most important VIP customers from all customers, using data mining based on rough set technology. An RFM model is used in classification of VIP customers, which achieves the simplification and extraction of rules. Finally he concluded that the proposed work provides one new method for the fast determination of customer rank, http://www.ijcttjournal.org Page 987 International Journal of Computer Trends and Technology (IJCTT) - volume4 Issue5–May 2013 which is helpful for enterprises to develop more feasible marketing programs for different customers [1]. In 2010 HuiChu Chang proposed an EL-RFM model (for E-Learning RFM) to measure learner's learning motive force. The model is an approximate RFM model & its aim is to measure students' learning motive forces, and then the instructor can realize and analysis students' behaviors. He concluded that the instructor can use ELRFM model to quantify learners' learning behaviors even though the learning environment is Internet [2]. In 2010 Ya-Han Hu, Fan Wu, Tzu-Wei Yeh the author proposed a tree structure, named RFMP-tree, to compress and store entire transactional database and a new RFMP-growth algorithm to discover all RFM-patterns from RFMP-tree. The experimental results showed that the proposed method can not only significantly reduce the size of outputted patterns, but also retain more meaningful results to users [3]. In 2011 Xiaojing Zhou1, Zhuo Zhang1, Yin Lu2 proposed five kinds of CS methods in traditional market segmentation and makes some analysis and research about the efficiency and applicability. Customer Segmentation (CS) based on vital statistics, Customer segmentation based on lifestyle, Customer segmentation based on the way of act, The CS based on value of corporation profit, CS method based on profit [4]. In 2011 Chaohua Liu investigated the scope of customer value and brings forward special definition of customer value. Also, he analyzes customer value according to current value, potential value and customer loyalty. And he proposes a model to classify customers into sixteen categories based on Self- Organization Map [5]. In 2010 Mei-Ping Xie, Wei-Ya Zhao proposed a decision tree model to analyze the main factors that affect Customers’ satisfaction degree such as the factors of price, quality, marketing and promotion. Finally, the author concluded that if one company wants to improve its Customers’ Satisfaction degree, managers should try their best to improve the level of technology and the employees’ professional skills, innovate their products continuously to satisfy more customer’s needs [6]. In 2010 Mehdi Bizhani, Mohammad Jafar Tarokh a new RFM approach is defined based on the banking industry to analyze transactional behavior. It is used for both preprocess and segmentation phases. Segmentation based on the RFM helps us to achieve a better understanding of groups of customers. For segmentation, ULVQ (unsupervised learning vector quantization) algorithm along with thirteen clustering validation indices is applied to find more exact and understandable results. The result of this analysis helps bank for better planning its 85 marketing and CRM issues. So a framework is proposed to find an exact and valuable. After calculating RFM scores for every POS, the result is analyzed to find the interesting segment of shops. Also the result shows us an alert of a considerable group of inactive POSs [7]. In 2010 Yuxun, Xie Niuniu a new decision tree algorithm based on attribute is proposed. The improved algorithm uses attribute-importance to increase information gain of attribution which has fewer attributions and compares ID3 with improved ID3 by an example [8]. In 2010 liu jiale, duhuiying proposed a framework to analyze and calculate customer value defined by RFM model. By evaluating ISSN: 2231-2803 customer value, the author concluded that identifying highvalue customers for the airlines is extremely important for airlines to make proper marketing strategies and realize precise marketing. The analysis of the experimental data show that improved ID3 algorithm compared with ID3 algorithm has better classification accuracy and can get more reasonable ,more effective and more objectively actual classification rules [9]. In 2010 the author used ID3 algorithm in the application of intrusion detection system, it would created fault alarm and omission alarm. To this fault, an improved decision tree algorithm was proposed. The decision tree was created after the data collected classified correctly. The tree would be not high and has a few of branches. Experimental results showed the effectiveness of the algorithm, false alarm rate and omission rate decreased, increasing the detection rate and reducing the space consumption [10]. In 2011 Imas Sukaesih Sitanggang, Razali Yaakob, Norwati Mustapha, Ahmad Ainuddin B Nuruddin proposed a new spatial decision tree algorithm based on the ID3 algorithm for discrete features represented in points, lines and polygons. The new formula for spatial information gain is proposed using spatial measures for point, line and polygon features. Empirical result demonstrated that the proposed algorithm can be used to join two spatial objects in constructing spatial decision trees on small spatial dataset. The proposed algorithm has been applied to the real spatial dataset consisting of point and polygon features. The result is a spatial decision tree with 138 leaves and the accuracy is 74.72% [11]. In 2011 Jun-Hui LIU, Na LI proposed an Optimized ID3 algorithm based on attribute importance and convex function. Compared with the traditional ID3 algorithm, the optimized one possesses higher average category accuracy. Meanwhile, it has less decision leaves, thus reduces its complexity. Especially, when the data set size is bigger, the efficiency and performance of the optimized ID3 algorithm is better, and possesses more obvious superiority [12]. In 2011 Xingwen Liu, Dianhong Wang, Liangxiao Jiang, Fenxiong Chen and Shengfeng Gan , used the improved information gain based on dependency degree of condition attributes on decision attribute as a heuristic for selecting the optimal splitting attribute in order to overcome above-stated shortcoming of the traditional ID3 algorithm. Experiments proved that the tree size and classification accuracy of the decision trees generated by the improved algorithm is superior to the ID3 algorithm [13]. In 2011 Liu zhongtao, Wang hong reduced the drawbacks of decision tree's attribute dependency towards more value by decision tree optimization through twice information gain and optimized calculation aimed at the backwards of the information gain in ID3, through the improvement of information gain on interests of the users, and based on the calculating specialties of information gain in ID3. The experiment proved that: compared with the traditional method, the optimized ID3 is provided with high accuracy and counting speed. In addition the structure decision tree possesses the advantage of lower average of leaf tree [14]. In 2010 I-Cheng, King-Jang Yang, Tao-Ming Ting used Bernoulli sequence in probability theory, to derive out the http://www.ijcttjournal.org Page 988 International Journal of Computer Trends and Technology (IJCTT) - volume4 Issue5–May 2013 formula that can estimate the probability that one customer will buy at the next time, and the expected value of the total number of times that the customer will buy in the future. It introduced a comprehensive methodology to discover the knowledge for selecting targets for direct marketing from a database. This methodology leads to more efficient and accurate selection procedures than the existing ones. In the empirical part he examined a case study, blood transfusion service, to show that the methodology has greater predictive accuracy than traditional RFM approaches [15]. In the RFM model, recency (R) is, in general, defined as the interval from the time when the latest consumption happens to the present, frequency (F) is the number of consumption within a certain period, and monetary (M) is the amount of money spent within a certain period. An earlier study showed that customers with bigger R, F values are to make a new trade with enterprise. Moreover, the bigger M is, the more likely the corresponding customers are to respond to enterprises' products and service again. [19]. While the three parameters are considered equally important in [17], they are unequally weighted due to the characteristics of industry in [18]. In [17], each of the R, F, M dimensions is divided into five equal parts and customers are clustered into 125 groups according to their R, F, M values. Consequently, the high potential groups (or customers) can be easily identified. In [18], the RFM model is utilized in profitability evaluation and a weighted-based evaluation function was proposed. In 2010 I-Cheng, KingJang Yang, Tao-Ming Ting used Bernoulli sequence in probability theory, to derive out the formula that can estimate the probability that one customer will buy at the next time, and the expected value of the total number of times that the customer will buy in the future. It introduced a comprehensive methodology to discover the knowledge for selecting targets for direct marketing from a database. This methodology leads to more efficient and accurate selection procedures than the existing ones. In the empirical part he examined a case study, blood transfusion service, to show that the methodology has greater predictive accuracy than traditional RFM approaches. The algorithm considers not only attributes of the object to be objects. The algorithm does not make distinction between thematic layers and it takes into account only one spatial relationship [21]. The decision tree from spatial data was also proposed as in [22]. The approach for spatial classification used in [22] is based on both (1) non-spatial properties of the classified objects and (2) attributes, predicates and functions describing spatial relation between classified objects and other features located in the spatial proximity of the classified objects. Reference [23] discusses another spatial decision tree algorithm namely SCART (Spatial Classification and Regression Trees) as an extension of the CART method. The CART (Classification and Regression Trees) is one of most commonly used systems for induction of decision trees for classification proposed by Brieman et. al. in 1984. The SCART considers the geographical data organized in thematic layers, and their spatial relationships. ISSN: 2231-2803 III. EXISTING ID3 ALGORITHM ID3 algorithm is one of the most widely used mathematical algorithms for building the decision tree. It was invented by J. Ross Quinlan and uses Information Theory invented by Shannon. It builds the tree from the top down, with no backtracking. The key feature of ID3 is choosing information gain as the standard for testing attributes for classification and then expanding the branches of decision tree recursively until the entire tree has been built completely. Input: R, a set of attributes. C, the class attributes. S, data set of tuples. Output: The Decision tree Procedure: 1. If R is empty then 2. Return the leaf having the most frequent value in data set S. 3. Else if all tuples in S have the same class value then 4. Return a leaf with that specific class value. 5. Else 6. Determine attribute A with the highest information gain in S. 7. Partition S in m parts S (a1), ..., S (am) such that a1, ..., am are the different values of A. 8. Return a tree with root A and m branches labeled a1...am, such that branch i contains ID3(R − {A}, C, S (ai)). 9. End if Basic ConceptDefinition 1 Information: Assume there are two classes, P and N. Let dataset S contain p elements of class P and n elements of class N. The amount of information, needed to decide if an arbitrary instance in S belongs to P or N is defined as I ( p, n) p log p n 2 p n log p n p n 2 n p n Definition 2 Entropy: Suppose A is an attribute having n different values {a1,a2,…………an} . Using this property, S can be divided into n number of subsets {s1,s2………sv}, If Si contains pi examples of P and ni examples of N, the entropy, or the expected information needed to classify objects in all sub trees Si is http://www.ijcttjournal.org E (A) i1 pi ni I ( pi , ni ) p n Page 989 International Journal of Computer Trends and Technology (IJCTT) - volume4 Issue5–May 2013 Definition 3 Gain: The encoding information that would be gained by branching on attribute A GAIN(A) = I (p, n) – E (A) IV. RECENCY FREQUENCY MONETRY The RFM is the most frequently adopted segmentation technique that contains three measures (recency, frequency and monetary) & used to analyse the behaviour of a customer and then make predictions based on the behaviour in the database. Segmentation divides markets into groups with similar customer needs and characteristics that are likely to exhibit similar purchasing behaviours. Due to the effectiveness of RFM in marketing, some data mining techniques have been carried out with RFM. The most common techniques are: (1) clustering and (2) classification. Clustering based on RFM attributes provides information of customers’ actual marketing levels. Classification based on RFM attributes provides valuable information’s for managers to predict future customer behaviour. Definition: Recency : Recency is commonly defined by the time period from the last activity to now. A customer having a high score of recency implies that he or she is more likely to make a repeated action. Monetary: Monetary is commonly defined by the total amount of money spent during a specified period of time. Higher monetary score indicates that customer is more important. Frequency: Frequency is commonly defined by the number of activities made in a certain time period. Higher frequency score indicates greater customer loyalty he or she has great demand for the product and is more likely to purchase the products repeatedly. ADVANTAGES OF RFM METHODOLOGY 1. It not only determines the existence of a pattern but also checks whether it conforms to the recency & monetary constraints. 2. It measures when people buy, how often they buy and how much they buy. 3. It allows marketers to test marketing actions to smaller segments of customers, and direct larger actions only towards those customer segments that are predicted to respond profitably. 4. Company can easily classify their customers who are important & valuable. 5. It can provide company’s profit in a short time. 6. It acts as a common tool to develop marketing strategies. ISSN: 2231-2803 7. It facilitates to choose which customers to target with an offer. 8. It helps in identifying significant and valuable customers. 9. It is used to identify customers and analyse customer profitability. 10. Firms can get much benefit from the adoption of RFM, encompassing increased response rates, lowered order cost and greater profit. V. PROPOSED SYSTEM Firstly the RFM dataset is collected from which we can predict customer behaviours or customer purchasing patterns and then the entire dataset is partitioned horizontally. Using R–F–M attributes of customer transaction dataset classification rules are discovered by HPID3 algorithm in which divide-and-conquer strategy is used and the decision tree is pruned to avoid over fitting problem. The mathematical calculation for entropy and information gain is performed on each attribute in multiple partitions. At the end, the information gains of the same attribute in multiple partitions are added together to find out the total information gain of a particular attribute in the given training dataset. The attribute with the largest information gain is selected as the root of the decision tree and their values as its branches. After that information gain of the remaining attributes is again calculated with respect to selected attribute. This process continues recursively until it comes to a conclusion at the decision tree's leaf node. In this way the proposed system will minimize information content needed and the average depth of the generated decision tree. Step 1: Data Preparation for RFM Modeling The first step is to collect RFM dataset from which we can predict customer behaviours, or customer purchasing patterns. Using RFM model we can select customers who are more likely to buy. In this way, we can improve marketing efficiency and maximize profit. Step 2: Horizontally partitioning the dataset After pre-processing, the dataset is partitioned horizontally into multiple partitions. Step 3: Develop Decision Tree using HPID3 algorithm Using R–F–M attributes of dataset classification rules are discovered by HPID3 algorithm in which divide-and-conquer strategy is used and the decision tree is pruned to avoid over fitting problem. It calculates overall entropy and information gains of all attributes. The attribute with the highest information gain is chosen to make the decision. So, at each node of tree, horizontal partition decision tree chooses one attribute that most effectively splits the training data into subsets with the best cut point, according to the entropy and information gain. http://www.ijcttjournal.org Page 990 International Journal of Computer Trends and Technology (IJCTT) - volume4 Issue5–May 2013 Step 4: Targeting Segment Selection The next step is to perform profit analysis by selecting the segment with higher profit or higher response rate. Targeting segments are selected so as to maximize profit. Step 5: Fetching RFM values Using RFM model RFM values are fetched & we can select customers who are more likely to buy. Step 6: Pattern matching After fetching RFM values we get frequent pattern by sequentially matching it in the dataset. Collect historical data Data Preprocessing Horizontally partitioning the dataset frequent pattern Applying pattern matching . . Pn (V1), Pn (V2), …., Pn (Vm) 8. Return the Tree whose Root is labelled ABestAttribute and has m edges labelled V1, V2, …., Vm. Such that for every i the edge Vi goes to the Tree 9. End. VI CONCLUSIONS In this paper the concept of Recency, Frequency and Monetary (RFM) is introduced, which is usually used by marketing investigators to develop marketing strategies, to find important patterns. Conventional ID3 algorithm is modified by horizontally splitting the sample of customer purchasing RFM dataset and then some classification rules are discovered to predict future customer behaviors by matching pattern. Both the algorithms have been tested on the blood transfusion service centre RFM dataset. The experimental result shows that the proposed HPID3 algorithm is more effective than conventional ID3 in terms of accuracy and processing speed. REFERENCES Fetch RFM values [1] [2] Develop Decision Tree using HPID3 algorithm Targeting Segment Selection Figure 1: (Phases of proposed system) The Proposed Algorithm: 1. Horizontally partition the dataset into multiple partitions P1, P2, …., Pn Partition. 2. Each Partition contains R set of attributes A1, A2, …., AR. 3. C the class attributes contains c class values C1, C2, …., Cc. 4. For partition Pi where i = 1 to n do If R is Empty then Return a leaf node with class value Else if all transaction in T (Pi) has the same class then Return a leaf node with the class value Else Calculate Expected Information to classify the given sample for each partition Pi individually. Calculate Entropy for each attribute (A1, A2, …., AR) of each partition Pi. Calculate Information Gain for each attribute (A1, A2,…., AR) of each partition Pi End If End For 5. Calculate Total Information Gain for each attribute of all partitions (TotalInformationGain ( )). 6. ABestAttribute MaxInformationGain ( ) 7.Let V1, V2, …., Vm be the value of attributes. ABestAttribute partitioned P1, P2,…., Pn partitions into m partitions P1 (V1), P1 (V2), …., P1 (Vm) P2 (V1), P2 (V2), …., P2 (Vm) . . ISSN: 2231-2803 [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] http://www.ijcttjournal.org Wei Jianping “Research on Customer Classification Rule Extraction base on RFM Model and Rough Set”,2010 IEEE. Mei-Ping Xie, Wei-Ya Zhao,”The Analysis of Customers’ Satisfaction Degree Based On Decision Tree Model”, 2010 IEEE. Xiaojing Zhou1, Zhuo Zhang1, Yin Lu2,” Review of Customer Segmentation method in CRM”, 2011 IEEE. “Liu Yuxun, Xie Niuniu”,Improved ID3 Algorithm”, 2010 IEEE. Wei Jianping”Research on VIP Customer Classification Rule Base on RFM Model”, 2011 IEEE. Xingwen Liu*, Dianhong Wang, Liangxiao Jiang, Fenxiong Chen and Shengfeng Gan,” A Novel Method for Inducing ID3 Decision Trees Based on Variable Precision Rough Set”, 2011 IEEE. Imas Sukaesih Sitanggang#†1, Razali Yaakob#2, Norwati Mustapha#3, Ahmad Ainuddin B Nuruddin*4,” An Extended ID3 Decision Tree Algorithm for Spatial Data”, 2011 IEEE. Hnin Wint Khaing,” Data Mining based Fragmentation and Prediction of Medical Data”, 2011 IEEE Derya Birant,” Data Mining Using RFM Analysis”, Dokuz Eylul University Turkey Hui-Chu Chang,” Developing EL-RFM Model for Quantification Learner's Learning Behavior in Distance Learning”, Department of Electrical Engineering National Taiwan University, 2010 IEEE. Ya-Han Hu, Fan Wu, Tzu-Wei Yeh,” Considering RFM-Values of Frequent Patterns in Transactional Databases”, 2010 IEEE Chaohua Liu,” Customer Segmentation and Evaluation Based On RFM, Cross-selling and Customer Loyalty ”, 2011 IEEE liu jiale, duhuiying,” Study on Airline Customer Value Evaluation Based on RFM Model”, 2010 International Conference On Computer Design And Appliations (ICCDA 2010) Zahra Tabaei, Mohammad Fathian,” Developing W-RFM Model for Customer Value: An Electronic Retailing Case Study”, Department of Industrial Engineering, Iran University of Science and Technology Chen Jin, Luo De-lin, Mu Fen-xiang, “An Improved ID3 Decision Tree Algorithmdy”, Proceedings of 2009 4th International Conference on Computer Science & Education “Research and Improvement on ID3 Algorithm in Intrusion Detection System”, 2010 Sixth International Conference on Natural Computation (ICNC 2010), 2010 IEEE J. Miglautsch, "Thoughts on RFM scoring", Journal of Database Marketing, Vol. 6, 2000. C.-Y. Tsai and C.-C. Chiu, "A purchase-based market segmentation methodology", Expert Systems with Applications, Vol. 27, 2004. Page 991 International Journal of Computer Trends and Technology (IJCTT) - volume4 Issue5–May 2013 [19] [20] [21] [22] [23] [24] J. Wu and Z. Lin, "Research on Customer Segmentation Model by Clustering", Proceedings of the 7th ACM ICEC international conference on Electronic commerce, 2005. C.-H. Cheng, and Y.-S. Chen, "Classifying the segmentation of customer value via RFM model and RS theory", Expert Systems with Applications, Vol. 36, Issue 3, Part 1, April 2009. K. Zeitouni and C. Nadjim, “Spatial Decision Tree – Application to Traffic Risk Analysis,” in ACS/IEEE International Conference, IEEE, 2001. K. Koperski, J. Han and N. Stefanovic, “An efficient two-step method for classification of spatial data,” In Symposium on Spatial Data Handling, 1998. N. Chelghoum, Z. Karine, and B. Azedine, “A Decision Tree for MultiLayered Spatial Data,” in Symposium on Geospatial Theory, Processing and Applications, Ottawa, 2002. J. Han and M. Kamber, Data Mining Concepts and Techniques, 2nd , San Diego, USA: Morgan-Kaufmann, 2006. ISSN: 2231-2803 http://www.ijcttjournal.org Page 992