Survey

Survey

Document related concepts

Transcript

DATE OF ACCEPTANCE: JUNE 21, 2014 DATE OF PUBLICATION: JUNE 26, 2014 ISSN: 2348-4098 VOLUME 02 ISSUE 05 JUNE- 2014 IMPROVINGTHEEFFICIENCYOFAPRIORIALGORITHMIN DATAMINING GURNEETKAUR Scholar,DepartmentofComputerScienceandApplications KurukshetraUniversity,Kurukshetra Email:[email protected] ABSTRACT Apriori algorithm is the classic algorithm of association rules, which finds the entire frequent item, sets. When this algorithm is applied to dense data the algorithm performancedeclinesduetothelargenumberoflongpatternsemerge.Inordertofind more valuable rules, this paper proposes an improved algorithm of association rules, the classical Apriori algorithm. Finally, the improved algorithm is verified, the results showthattheimprovedalgorithmisreasonableandeffective,canextractmorevaluable information. KEYWORDS‐Associationrulemining,Apriori,Candidateitemsets,DataMining 1. INTRODUCTION relational database, data warehouses Dataminingisoneofthemostdynamic emergingresearchesintoday’sdatabase technology.Toextractthevaluabledata from large databases, it is necessary to explore the databases efficiently. It is theanalysisstepoftheKDD(Knowledge Discovery and Data Mining) process. It is defined as the process of extracting interesting previously (non‐trivial, unknown implicit, and useful) information or patterns from large information repositories such as: etc.Thegoalofthedataminingprocess istoextractinformationfromadataset andtransformitintoanunderstandable structure for further use. The problem of mining association rules from transactional database was introduced in[9].Theconceptaimstofindfrequent patterns, interesting correlations, and associations among sets of items in the transaction databases or other data repositories.Associationrulesarebeing used widely in various areas such as telecommunication INTERNATIONAL JOURNAL OF SCIENCE, ENGINEERING AND TECHNOLOGY- www.ijset.in networks, drug 315 DATE OF ACCEPTANCE: JUNE 21, 2014 DATE OF PUBLICATION: JUNE 26, 2014 ISSN: 2348-4098 VOLUME 02 ISSUE 05 JUNE- 2014 analysis, risk and market management, Support(XY)/Total inventorycontroletc.[2] transactions. Associationrulearethestatementsthat confidence parts “Antecedent” and “Consequent”. Lift: is the ratio of the probability that X and Y occur together to the multiple of the two individual which is found in the database, and probabilities for X and Y. It is given consequent is the item that is found in antecedent[12]. = Support(XUY)/Support(X). consequent. Antecedent is that item combination with the first i.e. the Confidence:isdefinedaspercentage X that also contain Y. It is given as: any database. Association rule has two breadistheantecedentandbutteristhe of oftransactionindatabasecontaining find the relationship between data in For example {bread} => {butter}. Here number as:lift=Pr(X,Y)/Pr(X).Pr(Y). Conviction: Unlike lift, it measures the effect of the right‐hand‐side not Formal definition [3]: Let I = {i1,i2, …, beingtrue.Itinvertstheratioandis in} be a set of items. Let D be a set of given as : conviction = Pr(X).Pr(not task relevant data transactions where Y)/Pr(X,Y).[11] eachtransactionTisasetofitemssuch that T ⊆ I. A unique TID is associated 2. APRIORIALGORITHM with each transaction. Let A be a set of items.AtransactionTissaidtocontain Aprioriisthemostclassicalandfamous A if and only if A ⊆ T. An association algorithm for mining frequent patterns. rule is implication of the form A ⇒ B, Apriorialgorithmwasintroducedby[9]. whereA⊂I,B⊂I,andA∩B=null. The algorithm works on categorical attributes and employs bottom up The ARM technique is based on several threshold values which are explained below: Support: strategy. It is based on the Apriori property which is useful for trimming irrelevantdata.Itstatesthatanysubset is defined as the offrequentitem‐setsmustbefrequent. percentage/fraction of records that 2.1 DESCRIPTION OF THE TYPICAL contain XUY to the total number of APRIORIALGORITHM records in the database. It is given as: Support(XUY)= INTERNATIONAL JOURNAL OF SCIENCE, ENGINEERING AND TECHNOLOGY- www.ijset.in 316 DATE OF ACCEPTANCE: JUNE 21, 2014 DATE OF PUBLICATION: JUNE 26, 2014 ISSN: 2348-4098 VOLUME 02 ISSUE 05 JUNE- 2014 Apriori employs an iterative approach, sets are members of Ck. A scan of the where k‐item sets are used to explore database is required to determine the (k+1)‐itemsets.First,thesetoffrequent countofeachcandidateinCk.Thescan 1‐itemsets is found by scanning the of the database result in the formation database to count the occurrences for of Lk i.e. it contains all candidates each item, and then collecting those having a count no less than the items that satisfy minimum support minimumsupportcountarefrequentby defined. The result set obtained is definition, and therefore belong to Lk. denoted as L1. Now, L1 is used to find The Apriori property is used to reduce thesetoffrequent2‐itemsets,L2andso thesizeofCk,asfollows.Any(K‐1)‐item on, until no more frequent k‐item sets set that is not frequent cannot be a can be found. Thus the finding of each subsetofafrequentk‐itemset.Hence,if frequent k‐item set requires one full any(K‐1)‐subsetofthecandidatek‐item scan of the database. To improve the set is not in Lk‐1, then the candidate efficiency of frequent item sets level‐ item cannot be frequent either and so wise generation, an important property canbedeletedfromCk[5]. called the Apriori property, is used. It reduces the search space. Apriori property states that “All nonempty 3. RELATEDWORK subsetsofafrequentitemsetmustalso Rehab H. Alwa and Anasuya V. Patil be frequent”. Thus it is divided into a (2013) [1] described a novel approach two‐step process which is used to find to improve the Apriori algorithm the frequent item sets: join and prune throughthecreationofMatrix‐File.For actions. thispurposeMATLABisused,wherethe A)THEJOINSTEP database transactions are saved. Thus repeated scanning is avoided and A set of candidate k‐item set is particularrows&columnsareextracted generated by joining Lk‐1 with itself. and perform a function on that rather ThissetofcandidatesisdenotedasCk. than scanning entire database. The B)THEPRUNESTEP results are visualize using graphical form display and then interpreted. The The members of Ck may or may not be novel approach showed a very good frequent, but all of the frequent k‐item result in comparison to the traditional INTERNATIONAL JOURNAL OF SCIENCE, ENGINEERING AND TECHNOLOGY- www.ijset.in 317 DATE OF ACCEPTANCE: JUNE 21, 2014 DATE OF PUBLICATION: JUNE 26, 2014 ISSN: 2348-4098 VOLUME 02 ISSUE 05 JUNE- 2014 Apriori algorithm because there is a that is not frequent. The improved pruning process to those columns algorithm not only optimizes the whoseitemcountislessthanthevalue algorithm by reducing the size of the of minimum support. Hence the size of candidate set of k‐item‐sets, Ck, but it the Matrix reduces which saves a lot of also reduces the amount of I / O time, and an improvement in the speed spending by cutting down transaction is noticed by reducing the redundant records scanningofthedatabase. performance of Apriori algorithm is Jaio Yabing (2013) [4] proposed an improvedalgorithmofassociationrules, the classical Apriori algorithm. The in the database. The optimized so that we can mine association information from massive datafasterandbetter. improvedalgorithmisimplementedand Jose L. Balcazar (2013) [6] proposed a the results show that the improved measure to the confidence boost of a algorithm is reasonable and effective, rule. Acting as a balance to confidence can extract more value information. andsupport,ithelpstoobtainsmalland Although this process can decline the crispsetsofminedassociationrulesand numberofcandidateitemsetsinCkand solves the well‐known problem such as reduce time cost of data mining, the rules of negative correlation may pass price of it is pruning frequent item set, the confidence bound. He analyzed the whichcouldcostcertaintime.Fordense propertiesoftwoversionsofthenotion databases, this ofconfidenceboost,onebeinganatural algorithm is higher than Apriori generalization of the other. They algorithm. developed algorithms to filter rules the efficiency of JaishreeSinghetal(2013)[5]proposed an Improved Apriori algorithm which reduces the scanning time by cutting down not required transaction records as well as reduce the redundant generation of sub‐items during pruning thecandidateitem‐sets,whichcanform directly the set of frequent item‐sets andeliminatecandidatehavingasubset accordingtotheirconfidenceboost,and compared the given concept to some similar notions in the literature, and described the results of experimentation employing the new notions on standard benchmark datasets. He also described an open source association mining tool that embodies one of our variants of INTERNATIONAL JOURNAL OF SCIENCE, ENGINEERING AND TECHNOLOGY- www.ijset.in 318 DATE OF ACCEPTANCE: JUNE 21, 2014 DATE OF PUBLICATION: JUNE 26, 2014 ISSN: 2348-4098 VOLUME 02 ISSUE 05 JUNE- 2014 confidenceboostinsuchawaythatthe mostly improved Apriori algorithms data mining process does not require aimstogeneratelesscandidatesetsand the user to select any value for any yet get all frequent items. They parameter. concludedthatmanyimprovementsare Mohammed Al‐ Maolegi and Bassam Arkok (2013) [8] indicated the needed basically on pruning in Apriori toimproveefficiencyofalgorithm. limitation of the original Apriori Apriori Algorithm is the classical algorithm of wasting time for scanning algorithm for association rule mining. the whole database searching on the From the above reviews it has been frequent item‐sets, and presented an found that Apriori Algorithm is simple improvement on Apriori by reducing and easy to implement but still it has thatwastedtimedependingonscanning several drawbacks such as it requires onlysometransactions.Theyperformed multiplescanofthedatabase.Moreover experiments with several groups of for candidate generation process it transactions, and with severalvalues of takesmorememoryspaceandtime.The minimum support applied on the rules generated consist of items which original Apriori and the improved areirrelevant.Incaseoflargedatabases Apriori,andtheresultsshowedthatthe redundant improved Apriori reduces the time Moreover the Apriori Algorithm works consumed by 67.38% in comparison onlyonasinglevalueofconfidenceand withtheoriginalApriori,andmakesthe supportthroughoutthealgorithm.Thus improved Apriori algorithm more it is not suitable for dense databases. efficientandlesstimeconsuming. [13] RinaRavaletal(2013)[10]performeda survey on few good improved approachesofApriorialgorithmsuchas rules are generated. 4. PROPOSEDALGORITHM A. THEORYOFALGORITHM RecordFilterandIntersectionapproach, Classical Apriori algorithm generates Improvement size large number of candidate sets if frequency and trade list, Improvement database is large. And due to large by reducing candidate set and memory numberofrecordsindatabaseresultsin utilization, Improvement based on much more I/O cost. In this project, we frequency of items etc. They found that proposed an optimized method for based on set INTERNATIONAL JOURNAL OF SCIENCE, ENGINEERING AND TECHNOLOGY- www.ijset.in 319 DATE OF ACCEPTANCE: JUNE 21, 2014 DATE OF PUBLICATION: JUNE 26, 2014 ISSN: 2348-4098 VOLUME 02 ISSUE 05 JUNE- 2014 Apriori algorithm which reduces the sizeofdatabasealongwithreducingthe number of candidate item sets generated.Inourproposedmethod,we introduced an attribute named SizeOfTransaction (SOT), containing number of items in individual transaction in database. The deletion process of transaction in database will made according to the value of K. Depending on the value of K, algorithm 1)L1=find_frequent_1‐itemsets(D); 2)For(k=2;Lk‐1≠∅;k++){ 3)Prune1(Lk‐1) 4)Ck=apriori_gen(Lk‐1,min_sup); 5)foreachtransactiont∈D{ 6)Ct=subset(Ck,t); 7)foreachcandidatec∈Ct 8)c.count++; searches the same value for SOT in database. If value of SOT matches with 9)} value of K then those transactions are 10)Lk={c∈Ck|c.count≥min_sup}; deletedfromthedatabase.Forreducing the size of the candidate item sets 11)if(k>=2){ formed before the candidate item sets 12)delete_datavalue(D,Lk,Lk‐1); Ck are generated, further prune Lk‐1, count the timesofallitemsoccurredin Lk‐1, delete item sets with this number less than k‐1 in Lk‐1. In this way, the number of connecting items sets and the size of the database to be scanned willbedecreased,sothatthenumberof candidateitemswilldecline. B. PSEUDOCODEOFTHE ALGORITHM Input:D:Databaseoftransactions; min_sup:minimumsupportthreshold Output:L:frequentitemsetsinD 13)delete_datarow(D,Lk);} 14)} 15)returnL=UkLk; Procedureapriori_gen(Lk‐ 1:frequent(k‐1)‐itemsets) 1)foreachitemsetl1∈Lk‐1{ 2)foreachitemsetl2∈Lk‐1{ 3)if(l1[1]=l2[1])∧(l1[2]=l2[2]) ∧…∧(l1[k‐2]=l2[k‐2])∧(l1[k‐1]<l2 [k‐1])then{ Method: INTERNATIONAL JOURNAL OF SCIENCE, ENGINEERING AND TECHNOLOGY- www.ijset.in 320 DATE OF ACCEPTANCE: JUNE 21, 2014 DATE OF PUBLICATION: JUNE 26, 2014 ISSN: 2348-4098 VOLUME 02 ISSUE 05 JUNE- 2014 4)c=l1∞l2; 6)deletedatarow;} 5)foreachitemsetl1∈Lk‐1{ 7)} 6)foreachcandidatec∈Ck{ ProcedurePrune1(Lk‐1) 7)ifl1isthesubsetofcthen 1)forallitemsetsL1∈Lk‐1 8)c.num++;}}}}} 2)ifcount(L1)≤k‐1 9)C'k={c∈Ck|c.num=k}; 3)thendeleteallLjfromLk‐1 10)returnC'k; 4)returnL'k‐1//gobackanddelete Proceduredelete_datavalue itemsetswiththis (D:Database;Lk:frequent(k)‐itemsets; numberlessthank‐1. Lk‐1:frequent(k‐1)‐itemsets) 1)foreachitemseti∈Lk‐1andi∉Lk{ 2)foreachtransactiont∈D{ C. EXAMPLEOFTHEALGORITHM Let us consider a transaction database asshownintable1. TABLEI:Transactiondatabase 3)foreachdatavalue∈t{ 4)if(datavalue=i) TID ITEMS SOT 5)updatedatavalue=null; T1 A,B,D 3 6)}} T2 A,B,C,D 4 Proceduredelete_datarow(D: T3 A,B,E 3 DatabaseLk:frequent(k)‐itemsets) T4 B,E,F 3 1)foreachtransactiont∈D{ T5 A,B,D,F 4 2)foreachdatavalue∈t{ T6 A,E 2 3)if(datavalue!=nullanddatavalue!=0 T7 C 1 T8 E,F 2 ){ 4)datarow.count++;}} 5)if(datarow.count<k){ INTERNATIONAL JOURNAL OF SCIENCE, ENGINEERING AND TECHNOLOGY- www.ijset.in 321 DATE OF ACCEPTANCE: JUNE 21, 2014 DATE OF PUBLICATION: JUNE 26, 2014 ISSN: 2348-4098 VOLUME 02 ISSUE 05 JUNE- 2014 Suppose the minimum support count Step1:Firstly,theSOTcolumnisadded min_sup=2. The algorithm proceeds as tothedatabase. follows(fig1): Step2:Inthefirstiteration,eachitemis a member of candidate 1‐itemsets,C1. The algorithm simply scans the database to count the occurrences of eachitem. Step 3: This algorithm will then generate number of items in each transaction. We called this Size_Of _Transaction(SOT). Step4:Becauseofmin_sup=2,thesetof frequent 1‐itemset, L1 can be determined. It consistsofthecandidate 1‐itemset, satisfying the minimum supportthreshold. Step 5: As there are no items with support less than 2 nothing is deleted. Inaddition,whenL1isgenerated,now, thevalueofkis2,deletethoserecords of transaction having SOT=1 in D. And there won’t exist any elements of C2 in the records we find there is only one data in the T7. Thus after deleting the transaction we obtain transaction databaseD1. Step 6: To discover the set of frequent Figure1:Exampleofproposedalgorithm 2‐itemsets, L2, the algorithm uses the joinL1∞L1togenerateacandidateset of2‐itemsets,C2. INTERNATIONAL JOURNAL OF SCIENCE, ENGINEERING AND TECHNOLOGY- www.ijset.in 322 DATE OF ACCEPTANCE: JUNE 21, 2014 DATE OF PUBLICATION: JUNE 26, 2014 ISSN: 2348-4098 VOLUME 02 ISSUE 05 JUNE- 2014 Step 7: The transactions in D1 are Step 12: The transactions in D2 are scannedandthesupportcountandSOT scanned and the support count of each of each candidate item‐set in C2 is candidateitemsetinC3isaccumulated. accumulated. UseC3togenerateL3. Step 8: L2, the set of frequent 2‐ Step13:L3hasonlyone3‐itemsetsso itemsets is now then determined, that C4 =null. The algorithm will stop consistingofthosecandidate2‐itemsets andgiveoutallthefrequentitemsets. in C2 having minimum support greater orequaltothespecifiedvalue. Step 14: Algorithm will be generated forCkuntilCk+1becomesempty. Step 9: Now L2 is pruned by deleting the 5. ANALYTICAL ANALYSIS OF item sets with number of occurrences in Lk‐1 less than k‐1 to before candidate item sets occur i.e. itemsets {B,F}, {A,E}, {B,E}, {E,F} are deleted. THEPROPOSEDALGORITHM The basic thought of this proposed algorithm is similar with the existing apirori algorithm, that is it gets the Step10:AfterL2ispruned,wefindthat frequentitemsetL1whichhassupport the transactions T6 and T8 consists of level larger than or equal to the given onlytwoitems.Now,thevalueofkis2, level of the users support via scan the delete the transactions having SOT=2. databaseD.Theniteratetheprocessand And there won’t exist any elements of getL2，L3……Lk..Butstilltheproposed C3 in the records. Therefore, these algorithm differs from the existing records can be deleted and we obtain algorithm. transactiondatabaseD2. The size of the database is reduced by Step11:Todiscoverthesetoffrequent cutting the database using 3‐itemsets, L3, the algorithm uses the SizeofTransaction joinL2’∞L2’togenerateacandidateset optimizedalgorithmprunesLk‐1before of3‐itemsetsC3.Thereareanumberof Ck is generated. In other words, elements in C3. According to the frequentitemsetinLk‐1whichisgoing propertyofApriorialgorithm,C3needs toconnectarecounted.Accordingtothe ispruned. result obtained item sets with number attribute. the The less than k‐1 in Lk‐1 are deleted to INTERNATIONAL JOURNAL OF SCIENCE, ENGINEERING AND TECHNOLOGY- www.ijset.in 323 DATE OF ACCEPTANCE: JUNE 21, 2014 DATE OF PUBLICATION: JUNE 26, 2014 ISSN: 2348-4098 VOLUME 02 ISSUE 05 JUNE- 2014 decrease the number of the connecting frequent item‐set. The typical Apriori itemsetandremovalofsomeelements algorithm has performance bottleneck that does not satisfy the conditions. in the large data processing so that we Next the transactions whose value of need to optimize the existing algorithm SOT is less than k‐1 are removed from with variety of methods. Thus the the database. This decreases the optimizedalgorithmprunesLk‐1before possibility thus we generate Ck. Thus decreasing the declining the number of candidate item number of connecting item sets and sets in Ck, and reduces the number of removing the item‐sets which does not times to repeat the process and the satisfies the conditions. The improved number of transactions that need to be algorithm proposed optimizes the scanned from iterations. For large algorithm by reducing the size of the database, this algorithm can thus save candidate set Ck and also reducing the time and increase the efficiency of data I/O spend by cutting down transaction mining. This is what Apriori algorithm records in the database. The proposed doesnothave. algorithm also reduces the time to of combination, Although this process can decline the numberofcandidateitemsetsinCkand reduce time cost of data mining, the price of it is pruning frequent item set, whichcouldcostcertaintime.Fordense database such as, telecom, population, multinationalcompanies,census,etc.as repeat the process. For large database, this algorithm can save time cost and increase the efficiency of data mining. Thus the performance of Apriori algorithm is optimized so that we can mine association information from massivedatafasterandbetter. they consist of large amounts of long Although this improved algorithm has forms,theefficiencyofthisalgorithmis optimized the existing algorithm and is higherthanApriori. more efficient but it has overhead to 6. CONCLUSION manage the new database after every generationofLk.Tus,thereshouldexist some approach which requires less In this paper, Apriori algorithm is number of scans of database. Another improved based on the properties of solution might be division of large cutting database and pruning of the databaseamongprocessors. INTERNATIONAL JOURNAL OF SCIENCE, ENGINEERING AND TECHNOLOGY- www.ijset.in 324 DATE OF ACCEPTANCE: JUNE 21, 2014 DATE OF PUBLICATION: JUNE 26, 2014 ISSN: 2348-4098 VOLUME 02 ISSUE 05 JUNE- 2014 REFERENCES [1]. Communication Engineering, Vol. 2,No. A.RehabH.AlwaandB.Anasuya V Patil, “New Matrix Approach to Improve Apriori Algorithm” In International Journal of Computer Science and Network Solutions, Vol. 1 http://en.wikipedia.org/wiki/Da Ila Chandrakar and A. Mari Kirthima,“ASurveyOnAssociationRule Mining Algorithms”, In International Journal Of Mathematics and Computer Research,ISSN:2320‐7167,Vol1,Issue 10,PageNo.270‐272,November2013. [4]. JaishreeSingh,HariRamandDr. J.S. Sodhi, “Improving Efficiency of Apriori Algorithm Using Transaction Reduction” In International Journal of Scientific and Research Publications, Jiao Yabing, “Research of an Improved Apriori Algorithm in Data Mining Association Properties of the ConfidenceBoostofAssociationRules“ In ACM Transactions on Kmowledge November2013. Rules”, In Internatinal Journal of Computer and K.Vanitha and R. Santhi, “Using HashBasedAprioriAlgorithmtoReduce the Candidate 2‐ Itemsets for Mining AssociationRules“,InJournalofGlobal ResearchforComputerScience,Volume 2,No.5,April2011. [8]. Mohammed Al‐ Maolegi and Bassam Arkok, “ An Improved Apriori Algorithm for Association Rules” In International Journal on Natural Language Computing Vol. 3, No. 1, February2014. [9]. Volume3,Issue1,January2013. [5]. Jose L. Balcazar, “ Formal and Computational [7]. ta_mining [3]. [6]. Discovery from Data, Vol 7, No. 4, No4December2013. [2]. 1,January2013. RakeshAgrawal,R.,T.Imielinski, and A.Swami, “Mining association rules between sets of items in large databases” In Proceedings of the 1993 ACMSIGMODInternationalConference INTERNATIONAL JOURNAL OF SCIENCE, ENGINEERING AND TECHNOLOGY- www.ijset.in 325 DATE OF ACCEPTANCE: JUNE 21, 2014 DATE OF PUBLICATION: JUNE 26, 2014 ISSN: 2348-4098 VOLUME 02 ISSUE 05 JUNE- 2014 on Management of Data‐ SIGMOD ’93, Nanyang pp.207‐216,1993. Singapore,No.2003116,2003. RinaRawal,ProfIndrjeetRajput, Prof. Vinit kumar Gupta “Survey on [10]. Technological University, severalimprovedapriorialgorithms”In IOSR Journal of Computer Engineering, Volume9,April2013. [11]. Shweta and Kanwal Garg, “Mining Efficient Association Rules Through Apriori Algorithm Using AttributesAndComparativeAnalysisOf VariousAssociationRuleAlgorithms”In International Journal Of Advanced Research in Computer Science and SoftwareEngineeringVolume3,Issue6, June2013. [12]. Sotiris Kanellopoulas, Kotsiantis, Dimitris “Association Rules Mining: A Recent Overview”, GESTS InternationalTransactionsonComputer Science and Engineering, Vol. 329(1), pp.71‐82,2006. [13]. Qiankun Zhao and Sourav S. Bhowmick, “Association Rule Mining: A Survey”, Technical Report, CAIS, INTERNATIONAL JOURNAL OF SCIENCE, ENGINEERING AND TECHNOLOGY- www.ijset.in 326