Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Computer Application (2250-1797) Volume 5– No. 5, August 2015 Particle swarm Optimization Based Association Rule Mining Swati Shrivastava#1,Vikram Rajput#2 #1 M.Tech (CSE) Scholor, Deptt. Of CSE, LNCT Bhopal(M.P.) , swati1989shrivastava @gmail.com \ #2 Assistant Professor ,Deptt. Of CSE , LNCT Bhopal(M.P.) , rajputvikram022@ gmail.com \ \ ABSTRACT: Association rule mining is one of the widely using and simple concepts to find the frequent item sets from large number of datasets. While generating frequent item sets from a large dataset using association rule mining is not so efficient. This can be improved by using particle swarm optimization algorithm (PSO). PSO algorithm is population based evolutionary heuristic search methods used for solving different combinatorial problems. In general the rule generated by association rule mining technique do not consider the negative occurrences of attributes in them, but by using PSO algorithm over these rules the system can predict the rules which contains negative attributes. Keywords::Evolutionary algorithm, PSO GSA, Association rule mining, Data Mining. Corresponding Author:Swati Shrivastava INTRODUCTION: Data Mining is the process of processing large volumes of data (usually stored in a database), searching for patterns and relationships within that data. Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identifiedHowever, continuous innovations in computer processing power, disk storage, and statistical software are dramatically increasing the accuracy of analysis while driving down the cost. Data mining is sorting through data to identify patterns and establish relationships [1] and [3-6]. Data: Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating vast and growing amounts of data in different formats and different databases. Information: The patterns, associations, or relationships among all this data can provide information. For example, analysis of retail point of sale transaction data can yield information on which products are selling. Knowledge: Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior. 105 International Journal of Computer Application (2250-1797) Volume 5– No. 5, August 2015 In general, association rule mining can be viewed as a two-step process: Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min sup. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence[1].Most currently existing association rule mining algorithms were designed with market basket analysis in mind. Also, it is necessary to specify the minimum support of the mined rules in advance, often leading to either too many or too few rules; this negatively impacts the performance of the overall system[3]. II. What is Association Rule? Association rule mining (a-priori): The Association Rule rules designed is as follows: IF NFC and RIA THEN BI IF QC and VLSI THEN IPR Where NFC, RIA, QC, VLSI, refer to eight subjects out of which student has to choose four. The rules above shows that if a student takes NFC and RIA then the probability is high that he will choose BI too; similarly if he chooses QC and VLSI then the probability is high that he will take IPR. There is no limit on the number of antecedents in the rules, but there is a constraint on the number of consequents and i.e.: Number of consequents = 1 The limit doesn’t make any harm, because if in case the user wants to see the confidence value of a rule that contains the more than one consequents [6] can do the same by taking two rules from our system which allows its intersection. Apriori algorithm: General process: Association rule generation is usually split up into two separate steps: • First, minimum support is applied to find all frequent itemsets in a database • Second, these frequent itemsets and the minimum confidence constraint are used to form rules While the second step is straight forward, the first step needs more attention. Finding all frequent itemsets in a database is difficult since it involves searching all possible itemsets (item combinations). The set of possible itemsets is the power set over I and has size 2n-1 (excluding the empty set which is not a valid itemset). Exploiting this property, efficient algorithmscan find all frequent itemsets. Apriori Algorithm Pseudocode PseudocodeApriori (T, minsupport) {//T is the database and L1 = {frequent items}; For (k = 2; 𝐿𝑘−1 ! = ∅; k++) { 𝐶𝑘 = candidates generated from 𝐿𝑘−1 // that is cartesian product 𝐿𝑘−1×𝐿𝑘−1 and eliminating any k-1 size itemset that is not frequent for each transaction t in database do { 106 International Journal of Computer Application (2250-1797) Volume 5– No. 5, August 2015 # increament the count of all candidates in 𝐶𝐶𝑘𝑘 that are contained in t 𝐿𝑘 = candidates in 𝐶𝑘 with minsupport }//end for each }//end for Return 𝑈𝑘𝐿𝑘; } As is common in association rule mining, given a set of itemsets (for instance, sets of retail transactions, each listing individual items purchased), the algorithm attempts to find subsets. III Problem Statement Frequent itemset mining plays an important role in several data mining fields as association rules warehousing, correlations, clustering of high-dimensional biological data, and classification. Given a data set d that contains k items, the number of itemsets that could be generated is 2k - 1, excluding the empty set. In order to searching the frequent itemsets, the support of each itemset must be computed by scanning each transaction in the dataset. A brute force approach for doing this will be computationally expensive due to the exponential number of itemsets whose support counts must be determined. Let I be a set of items. A set X = {i1, . . . ,ik} ⊆ I is called an itemset, or a k-itemset if it contains k items. A transaction over I is a couple T = (tid , I) where tid is the transaction identifier and I is an itemset. A transaction T = (tid , I) is said to support an itemset X ⊆I, if X ⊆I. A transaction database D over I is a set of transactions over I. It omit I whenever it is clear from the context. The cover of an itemset X in D consists of the set of transaction identifiers of transactions in D that support X: cover(X, D) := {tid | (tid , I) ∈ D,X ⊆ I}. The support of an itemset X in D is the number of transactions in the cover of X in D: support(X,D) := |cover(X,D)|. The frequency of an itemset X in D is the probability of X occurring in a transaction T ∈ D: frequency(X,D) := P(X)= {support (X,D)}/|D| It note that |D| = support ({},D). It omit D whenever it is clear from the context. An itemset is called frequent if its support is no less than a given absolute minimal support threshold σabs , with 0 <= σabs<= |D|. When working with frequencies of itemsets instead of their supports, it use a relative minimal frequency threshold σrel, with 0 <=σrel<=1. Obviously, σabs = г{σrel · |D|}ד. IV Particle Swarm Optimization In PSO, a population of particles starts to move in search space by following the current optimum particles and changing the positions in order to find out the optima. When a particle takes part of the population as its topological neighbours, the best value is a local best. The particles tend to move to good areas in the search space by the information spreading to the swarm. The particle is moved to a new position calculated by the velocity updated at each time 107 International Journal of Computer Application (2250-1797) Volume 5– No. 5, August 2015 step t. determine the significance of 𝑝(𝑡)and 𝑔(𝑡), respectively. Furthermore, 𝑣𝑖 at any time step of the algorithm is constrained by the parameter 𝑣𝑚𝑎𝑥 . The swarm in PSO is initialized by assigning each particle to a uniformly and randomly chosen position in the search space. Velocities are initialized randomly in the range 𝑣𝑚𝑖𝑛 , 𝑣𝑚𝑎𝑥 . V. Proposed Methodology Swarm behavior can be seen in bird flocks, fish schools, as well as in insects like mosquitoes and midges. Many animal groups such as fish schools and bird flocks clearly display structural order, with the behavior of the organisms so integrated that even though they may change shape and direction, they appear to move as a single coherent entity [11]. The main principles of the collective behavior are: • Homogeneity: every bird in flock has the same behavior model. The flock moves without a leader, even though temporary leaders seem to appear. • Locality: the motion of each bird is only influenced by its nearest flock mates. Vision is considered to be the most important senses for flock organization. • Collision Avoidance: avoid with nearby flock mates. • Velocity Matching : attempt to match velocity with nearby flock mates. • Flock Centering: attempt to stay close to nearby flock mates According to the definition of association rule mining, the intersection of the association rule of item set X to item set Y (X→Y) must be empty. Items which appear in the item set X do not appear on item set Y, and vice versa. Hence, both the front and back partition points must be given for the purpose of chromosome encoding. The item set before the front partition point is called item set X, while that between the front partition and back partition points is called ―item set Y. The chromosome encoding approach in this study is string encoding. Each value represents a different item name, which means that item 1 is encoded as ―1‖ and item 2 is encoded as 2. The value of each item is encoded into a string type chromosome by the corresponding order. 1 Fitness Value Calculation The fitness value is utilized to evaluate the importance of each particle. The fitness value of each particle comes from the fitness function. Fitness(k) = confidence(k) *log(support(k)* length(k) +1) 2. Population Generation 3. Search the Best Particle, The particle is moved to a new position calculated by the velocity updated at each time step t. This new position is calculated as the sum of the previous position and the new velocity by (4): 𝑥(𝑡 + 1) = 𝑥(𝑡 + 1) + 𝑣(𝑡 + 1) The velocity update is performed as indicated in Equation: 𝑣 𝑡 + 1 = 𝜔𝑣 𝑡 + 1 + 𝜑1 𝑟𝑎𝑛𝑑 0,1 𝑝 𝑡 − 𝑥 𝑡 + 𝜑2 𝑟𝑎𝑛𝑑 0,1 (𝑔(𝑡) − 𝑥(𝑡)) The parameter 𝜔 is called the inertia weight and controls the magnitude of the old velocity 𝑣 (𝑡) in the calculation of the new velocity, whereas 𝜑1 and𝜑2 determine the significance of Personal best 𝑝(𝑡)and group best 𝑔(𝑡), respectively. Furthermore, 𝑣𝑖 at any time step of the algorithm is constrained by the parameter𝑣𝑚𝑎𝑥 . The swarm in PSO is initialized by assigning each particle to a uniformly and randomly chosen position in the search space. Velocities are initialized 108 International Journal of Computer Application (2250-1797) Volume 5– No. 5, August 2015 randomly in the range 𝑣𝑚𝑖𝑛 , 𝑣𝑚𝑎𝑥 . 𝑣𝑚𝑖𝑛 , 𝑣𝑚𝑎𝑥 . Velocities are initialized randomly in the range Main steps of the procedure are: 1: Initialize Population 2: repeat 3: Calculate fitness values of particles 4: Modify the best particles in the swarm 5: Choose the best particle 6: Calculate the velocities of particles 7: Update the particle positions 8: until requirements are met Particles’ velocities on each dimension are clamped to a maximum velocity tmax. If the sum of accelerations would cause the velocity on that dimension to exceed tmax, which is a parameter specified by the user, then the velocity on that dimension is limited to tmax. Figure 1: Flow Diagram of Proposed methodology VI. Result Analysis The standard Dataset of Online American lottery system which has transaction ID as index number and the lottery numbers are the items ID. The “lottery.txt” this file has lower number of data entry around 55, the left hand side there is transaction IDs and at right hand side the values varies from four to six, and the “Testdata.txt” has 200 transaction IDs at Left hand Side and 109 International Journal of Computer Application (2250-1797) Volume 5– No. 5, August 2015 entries vary from five to nine in Right Hand side the other data set is “karate” Data set which has thirty four transaction IDs at left hand side and values varies from one to thirty four at the right hand side. The mining criteria parameters are Minimum Support, Minimum confidence and the size of file to be mined. The mining results comparison parameters are timing, time per rules and number rules generated.The support and Confidence are measured and calculated by the following equations 1 and 2: support(X,D) := |cover(X,D)|. (1) (6.1) The frequency of an itemset X in D is the probability of X occurring in a transaction T ∈ D. Confidence (X) Y, D):= P(Y |X) = { support(X ∪ Y,D) }/ support(X,D) Table 1 Results with Confidence 70 percent for Test Data Set S.No. Support Mining Time Time per entry No. Of rules 1 20 1.1679 0.005753 5 2 30 0.5979 0.002945 8 3 40 0.585168 0.002882 8 (2) Table 2 Results with Confidence 40 percent for Lottery dataset. S.No. Support Mining Time Time per entry 1 10 2.166 0.040331 2 15 2.08166 0.040331 3 20 2.0879 0.04991 No. Of rules 4 No rules generated No rules generated Table 3 Results with Confidence 70 percent for student Dataset. S.No. Support Mining Time Time per entry 1 10 2.08166 NA 2 20 2.79 NA 3 40 2.89 NA No. Of rules 5 No rules generated No rules generated Proposed methodology based result Analysis Karate Dataset with 70 percent confidence which is imitated by personal best of element in Swarm with the different fitness values Table 4 Results with Confidence 70 percent for student Dataset S.No. Support Mining Time Time per entry No. Of rules 1 10 0.02366 1.040331 3 2 15 0.041775 1.089331 5 3 20 0.014307 0.04991 4 Refer to serial No. 1 with the individual support and confidence of the transaction itemset. items33 ->items34 (29.4118%, 83.3333%) items34 ->items33 (29.4118%, 58.8235%) items2 ->items1 (20.5882%, 77.7778%) 110 International Journal of Computer Application (2250-1797) Volume 5– No. 5, August 2015 Refer to serial No. 2 with the individual support and confidence of the transaction itemset items33 ->items34 (29.4118%, 83.3333%) items34 ->items33 (29.4118%, 58.8235%) items1 ->items2 (20.5882%, 43.75%) items2 ->items1 (20.5882%, 77.7778%) items3 ->items34 (17.6471%, 60%) Refer to serial No3 with the individual support and confidence of the transaction itemset Rule (Support, Confidence) items33 ->items34 (29.4118%, 83.3333%) items34 ->items33 (29.4118%, 58.8235%) items1 ->items2 (20.5882%, 43.75%) items2 ->items1 (20.5882%, 77.7778%) Comparison of Existing and Proposed methodology Here comparing the existing method and the proposed method on the parameters number of rules generation for associativity of items.The below table shows that the rules are produced by the existing method are more as compare to the proposed methodology.so, the mining time is lesser also for the proposed methodology As shown in Table 6. Table 6.5 :Comparison of Number of Rules generation in mining S.no. Support Existing Methods PSO(proposed) method 1 Sup= 10 5 3 2 Sup= 15 8 5 3 Sup= 20 8 4 Table 6 :Comparison of Mining time S.no. Support Existing Methods 1 Sup= 10 1.1679 2 Sup= 15 0.5979 3 Sup= 20 0.585168 PSO(proposed) method 0.02366 0.041775 0.014307 111 International Journal of Computer Application (2250-1797) Volume 5– No. 5, August 2015 9 8 7 6 5 4 3 2 1 0 Existing Methods PSO(proposed) method Sup= 10 Sup= 15 Sup= 20 Figure 2: The Graph Between the support vs Number of rules generated during Mining 1.4 1.2 1 0.8 Existing Methods 0.6 PSO(proposed) method 0.4 0.2 0 Sup= 10 Sup= 15 Sup= 20 Figure 3: The Graph between Support and Mining time for the rule generations. The figure 2 and 3 shows the graphical comparison of two methods. In first graph shows the number of generated rules.The number of rules are less in proposed method because of the optimization property of PSO so ,the processing time is less during the rule mining from the given dataset. VII. Conclusion If application supports multiple minimum supports then the above work is very easy and comparative study can be done.Data mining, which is the exploration of knowledge from the large set of data, generated as a result of the various data processing activities. Frequent Pattern Mining is a very important task in data mining. The proposed method work well with static dataset by using support count particle swarm optimization concept with as well as for mining streams requires fast, real-time processing in order to keep up with the high data arrival rate and mining results are expected to be available within short response time. 112 International Journal of Computer Application (2250-1797) Volume 5– No. 5, August 2015 VIII.References [1] Nicholson, S. The Bibliomining Process: Data Warehousing and Data Mining for Library Decision-Making. Information Technology and Libraries. 2003, 22(4):146-151. [2] JING YANG , HA0 WANG SUE-GANG HU, ZHONG-HUI HU, “A NEW CLASSIFICATION ALGORITHM BASED ON ROUGH SET AND ENTROPY”, IEEE 2003 Proceedings of the Second International Conference on Machine Learning and Cybernetics. [3] DE-SHENG YIN, GUO-YIN WANG, W WU, “A SELF-LEARNING ALGORITHM FOR DECISION TREE PRE-PRUNING”, IEEE 2004 Proceedings of the Third Intemationd Conference on Machine Leaming and Cybemetics. [4]Yang B R. Knowledge discovery based on inner mechanism: construction, realization and application [M]. USA: Elliott & Fitzpatrick Inc. 2004. [5] C. Giannella, J. Han, J. Pei, X. Yan, and P.S. Yu, “ Mining frequent patterns in data streams at multiple time granularities”, In Data Mining: Next Generation Challenges and Future Directions. AAAI/MIT Press, 2004. [6] M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy, “Mining data streams: A review,” ACM SIGMOD Record, vol. Vol. 34,no. 1, 2005. [7] Jiann-CherngShieh, Yung-Shun Lin. Bibliomining User Behaviors in the Library. Journal of Educational Media & Library Sciences.2006, 44(1):36-60. [8] N. Jiang and L. Gruenwald, “Research issues in data stream association rule mining”, ACM SIGMOD Record, 35(1):14–19, 2006. [9] Gregory Buehrer, SrinivasanParthasarathy, and AmolGhoting. Out-ofcore frequent pattern mining on a commodity pc. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, 2006, pp 86–95. [10] DPVG06] NeleDexters, Paul W. Purdom, and Dirk Van Gucht. A probability analysis for candidate-based frequent itemset algorithms. In SAC ’06: Proceedings of the 2006 ACM symposium on Applied computing, New York, NY, USA, 2006. ACM, pp541–545. [11] C.C. Aggarwal, “ On density based transforms for uncertain data mining”, In Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE), pages 866–875, Istanbul, Turkey, 2007. [12] C. C. Aggarwal, Data Streams: models and algorithms. Springer, 2007 [13] C. K. Chui, B. Kao, and E. Hung. Mining frequent itemsets from uncertain data. In 11th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2007, Nanjing, China, pages 47–58, 2007. [14] N. Dalvi and D. Suciu. "Efficient query evaluation on probabilistic databases", The VLDB Journal, 16(4):523–544, 2007. [15] EhsanValian, ShahramMohanna And SaeedTavakoli , “Improved Cuckoo Search Algorithm For Feedforward Neural Network Training”, International Journal Of Artificial Intelligence & Applications (Ijaia), Vol.2(3), pp. 36-43, July 2011. 113 International Journal of Computer Application (2250-1797) Volume 5– No. 5, August 2015 [16] S. Walton; O. Hassan, K. Morgan and M.R. Brown, "Modified Cuckoo Search : A new gradient free optimisation algorithm". Chaos, Solitons& Fractals. doi:10.1016/j.chaos.2011.06.004, 30 June 2011. [17] D.Rodrigues, L.A.M. Pereira, A. N. Souza, C. C.Ramos, Xin-She Yan “Binary Cuckoo Search : A Binary Cuckoo Search Algorithm For Feature Selection”, IEEE International Symposium on Circuits and Systems (ISCAS), pp. 465 - 468 ,2013. [18] C. K.-S. Leung, C. L. Carmichael, and B. Hao,” Efficient mining of frequent patterns from uncertain data”, In ICDMW ’07: Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, pages 489–494, 2007. [19] Weiming Hu; Ou Wu; Zhouyao Chen; Zhouyu Fu; Maybank, S, “Recognition of Pornographic Web Pages by Classifying Texts and Images ”Pattern Analysis and Machine Intelligence, IEEE Transactions 2007. [20] Jun Fang; Lei Guo; XiaoDong Wang; Ning Yang, “Ontology-Based Automatic Classification and Ranking for Web Documents”, Fuzzy Systems and Knowledge Discovery, 2007. FSKD 2007; 114