Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Computer Trends and Technology (IJCTT) – volume 10 number 4 – Apr 2014 Minimizing Spurious Patterns Using Association Rule Mining Ruchi Goel Dr. Parul Agarwal M.Tech(CSE) Assistant Professor Jamia Hamdard University, New Delhi,India (Department of Computer Science) Jamia Hamdard University, New Delhi,India ABSTRACT Most of the clustering algorithms extract patterns which are of least interest. Such pattern consists of data items which usually belong to widely different support levels. Such data items belonging to different support level have weak association between them, thus producing least interested patterns which are of least interest. The reason behind this problem is that such existing algorithms do not have the basic knowledge regarding the co-occurrence relationship between data items. Such algorithm cannot even consider the knowledge regarding the co-occurrence relationship among the data items in them as if it consider such knowledge, the goal of the algorithm will conflict with this knowledge. I am going to propose a solution to this problem by extracting highly correlated and interested patterns known as maximized patterns. Confidence measure will be used to extract maximized patterns. In this framework, the data mining operation is performed not directly on the data set but the data mining is performed on the highly correlated intensive patterns. Using this strategy the effect of cross support pattern is also minimized. A minimum threshold value is also being used to regulate the intensive patterns. Keywords: Asymmetric data set, Cooccurrence relation, Intensive patterns, Minimum threshold, Spurious patterns. I. INTRODUCTION Normally data sets consists of asymmetric data items. For example, any departmental store having a wide range of commodities of same price but their significance vary from one commodity to another. Some belong to same support level while some belongs to different support level. Normally clustering algorithms are ineffective on such a asymmetric datasets to perform effective clustering. The conventional algorithms on the lower threshold value give large spurious patterns that are weakly correlated data items. Such problem require to design a measure that can perform even on low support value and remove spurious patterns to get intensive patterns. For example, in any shopping mall there could be a large range of commodities having some price which may vary significantly from one commodity to the other while there ISSN: 2231-2803 could be some commodities belonging to the same price level. Thus, it could be said that in a shopping mall there is a wide range of commodities belonging to different support levels but few of them may belong to the same support level. In such data sets if we use conventional clustering algorithms for mining associated patterns then they will not be effective. Most of the clustering algorithms defined so far rely purely on support-based pruning strategy and this strategy when used on highly asymmetric data sets proves to be ineffective because of the following two reasons: 1. If the value of minimum threshold is very low, then the number of spurious patterns in the overall extracted patterns may also increase. Such spurious patterns contain data items belonging to different support level. These spurious patterns are called cross-support patterns and the data items which they contain are weakly correlated with most of the data items belonging to the pattern. For example, {chips, shampoo} is a cross-support pattern as chips is a data item having high support while shampoo on the other hand has quite low support as compared to chips. Such data items are weakly correlated and thus the patterns which contain them are considered to be spurious. Besides this, using a lower value for minimum threshold also increases the computational and memory requirements substantially. 2. On the other hand if the value of minimum threshold is very high, then there are chances that many interested patterns having support less than the threshold value may be missed. For example, {chips, cold drinks}. II. OBJECTIVE I am going to reduce spurious patterns from the asymmetric dataset using association rules. Such spurious pattern should be minimized and we get the optimized pattern , on which we can do decision-making. So far, the clustering algorithms can’t discover cooccurrence relationship among activities performed by specific group or object or between the data items. They simply use their notions and reduce the size of data items by removing items that provides less power to classify instances. http://www.ijcttjournal.org Page192 International Journal of Computer Trends and Technology (IJCTT) – volume 10 number 4 – Apr 2014 So it becomes tedious to do decision making on such clusters containing asymmetric data items. My solution will be to extract useful patterns at low support value and in turn remove spurious patterns during mining. This will generate the patterns having string co-occurrence relationship between data items. III. APPROACH FOR SPURIOUS PATTERNS MINIMIZING As we know that, clustering is a process of assigning various objects to various clusters, keeping in mind that objects or data items belonging to a certain cluster has maximum similarity with each other. Thus, provide patterns for the purpose of discovering knowledge that can use for further decision making. If some of the data items belonging to a particular cluster moves from that particular cluster to some other cluster because of the clustering algorithm being used then it would become very tedious to gather knowledge from such clusters. Even more it is much easy to gather knowledge from the well understood patterns rather than interpreting the data items directly. Thus to solve this problem intensive-clustering is an approach. In this approach, patterns are preserved such that the data items belonging to a particular pattern always belongs to a particular cluster. Intensive-patterns[3] are patterns which contains data items which have high similarity or cooccurrence relation with each other. By high co-occurrence relation in intensive patterns means that the presence of each and every data item in that intensive-pattern highly implies the presence of each and every other data item belonging to that same intensive-pattern. The intensive-confidence or (i-confidence) of an itemset D={d1,d2,d3,….dm}, is denoted as I-conf(D), is a measure that reflects the overall co-occurrence relation among items within the itemset. This measure is defined as min{conf{d1->d2,d3,…,dm}, conf{d2->d1,d3,….dm}, ………., conf{dmd1,d2,……dm-1}}, where conf is the conventional definition of association rule confidence. The scope of intensive-confidence could be understood properly with the help of following example. Consider an itemset D= {desktop, printer, antivirus}. Assume that supp({desktop})=0.1, supp({printer})=0.1, supp({antivirus})=0.06, and supp({desktop, printer ,antivirus })=0.06, where supp is the support of an itemset. Then conf{desktopprinter, antivirus}=supp({desktop,printer,antivirus})/supp({desktop}) =0.6 ISSN: 2231-2803 conf{printerdesktop, antivirus}=supp({desktop,printer,antivirus})/supp({printer})= 0.6 conf{antivirusdesktop, printer}=supp({desktop,printer,antivirus})/supp({antivirus})= 1 Hence-conf(D)=min{conf{desktopprinter, antivirus}, conf{printerdesktop ,antivirus}, conf{antivirusdesktop, printer}}=0.6. The collection of candidate patterns from the itemset(D) is an intensive-pattern if and only if, the value of i-conf(D)>= Tc, where Tc is the minimum threshold confidence which is provided by the user. Further if for any intensive-pattern there exist some subset of this intensive-pattern, then these subset patterns should be removed from the set of all intensivepattern. The reason for this is due to the property of allconfidence [2]. Properties of I-confidence Measure The I-confidence measure has four important properties, namely the anti-monotone property, the cross-support property, the strong co-occurrence relation property and the all-confidence property. 1. Anti-Monotone The I-confidence measure posses anti-monotone property. This property states that if for all the data items belonging to P, the value of I-confidence is greater than the threshold value Tc, then for all the subsets of P, the value of I-confidence will remain greater than the threshold value Tc. How I-confidence measure uses this property of antimonotone? This could be easily explained with the help of the following example: Suppose the supp({desktop}) = 0.2, supp({antivirus}) = 0.6 and the supp({desktop, printer})= 0.3 and the value of minimum i-confidence threshold is 0.6, then the i-confidence of the candidate pattern {desktop, printer} is given by supp({desktop, printer})/ max{supp({desktop}),supp({printer})}= 0.3/.6 = 0.5 which is less than the minimum i-confidence of 0.6. Thus the candidate pattern {desktop, printer} is not a intensive pattern. Moreover, all the candidate patterns having {desktop, printer} as their subset are pruned, like {desktop, printer, TV} is not a intensive pattern. One thing should be noted down here, the pruning here is done on the basis of I-confidence threshold. If the value of I-confidence threshold is reduced to .45, then {desktop, printer} will be an intensive pattern. http://www.ijcttjournal.org Page193 International Journal of Computer Trends and Technology (IJCTT) – volume 10 number 4 – Apr 2014 2. Strong Co-occurrence Relation IV. The I-confidence measure also posses the property of Strong Co-occurrence relation among datasets. The I-confidence measure take cares that all the data items contained in a data set have strong co-occurrence relation among each other which means to say strong association between each other. This could be easily understood with the help of following consideration. Suppose the value of I-confidence is 90% for any itemset (D). Then if any of the data item belonging to the itemset (D) occurs in any transaction, then there are 90% chances that the remaining data items belonging to the same itemset (D) will also occur in the same transaction. ALGORITHM FOR SPURIOUS PATTERNS MINIMIZING Input I: Item Set stored in database containing list of transactions with their items and corresponding support Min_threshold: Minimum Threshold value of i-confidence * the value of Min_threshold will be provided by the user. Variable 3. Cross-support patterns Intensive : Intensive Pattern Set The I-confidence helps in minimizing the cross-support patterns which are actually the spurious patterns. It is always very difficult to choose the right threshold value for the purpose of mining the large collection of data. If we set a very high value of threshold then there are chances that we may miss many interesting patterns. Conversely, if we set a very low value for the threshold then also it may not be easy to find the interested associated patterns because of the following two reasons. The first reason is that the computational and memory requirements of existing analysis algorithm increases considerably and secondly, the number of extracted patterns also increases substantially. The I-confidence helps us in eliminating patterns which consists of data items which are not of interest. Also, I-confidence does not involve extra computational cost as it simply depends on the support values of the individual data items or their various combinations. This could be easily understood with the help of following consideration. Suppose Tc is the given value of threshold and P is a pattern such that P= {p1, p2,….,,pn}. We could say P as a cross-support pattern with respect to Tc,if for any two data items suppose p1 and p2 belonging to P, the value of supp({p1})/supp({p2}) < Tc, where 0<Tc<1. Max_Intensive: Maximal Intensive Pattern Set Intensive_Pattern_Evaluation() : Function for evaluating Intensive Pattern Set Max_Intensive_Pattern_Evaluation() evaluating Max_Intensive Pattern Set ISSN: 2231-2803 for I: Extracting Maximal Intensified Pattern Intensive=Intensive_Pattern_Evaluation(I, Min_threshold) { 1. Access the support value for each element in (I) . Create candidate patterns with items belonging to different level of support Prune candidate patterns on basis of Antimonotone property Prune candidate patterns on basis of crosssupport property Intensive patterns (i.e. Intensive ) with Iconfidence > Min_threshold 2. 3. 4. Omiecinski proposed the concept of all confidence [1] as an alternative to the support. All confidence represents the minimum confidence of all the association rules extracted from the itemset. Omiecinski’s all-confidence posses the desirable property of anti-monotone. Supp({p1,p2,……,pm})/max 1<=k<=m{supp(pk})} Function Method 4. All Confidence The all-confidence [2] measure for an itemset P = {p1, p2,…….., pm} is given by min({conf(AB | for all A, B is subset of P, AUB = P, A∩B = Ø }) and is equal to :- : 5. } Max_Intensive= Max_Intensive_Pattern_Evaluation(Intensive) { 1. http://www.ijcttjournal.org Find an Intensive patterns (X’) such that X’ is a subset of Y and both X’,Y Є Intensive. Page194 International Journal of Computer Trends and Technology (IJCTT) – volume 10 number 4 – Apr 2014 2. 3. 4. Intensive = Intensive – X’ Reapeat until steps 1 -2 until there exist no X’, X’ subset of Y and both belonging to the Intensive Patterns. Set Max_Intensive = Intensive 10 Ketchup, Juice, Coffee, Egg 11 Bread, Juice, Pickle } V. RESULT AND DISCUSSION Performance comparison of maximal intensive -pattern and hyperclique-pattern 12 Milk 13 Milk, Coffee, Sugar While comparing the performance of maximal intensive patterns and hyperclique patterns, we take a dataset of transactions. Both the algorithms are implemented on the same set of data of transactions. Minimum threshold value taken in both the cases is same. On execution of algorithm the result that comes out is that the number of interested patterns generated by maximal intensive patterns are less than that are generated by hyperclique pattern algorithm.[4].Reason for this is maximal intensive pattern used anti- monotone and all confidence property with strong co-occurrence and cross support property. 14 Cookie, Chocolate 15 Chocolate, Milk 16 Biscuit, Milk 17 Bread, Biscuit, Milk 18 Let us consider the Table 1 , having a list of various Milk, Coffee, Sugar transactions with the items involved in each transaction. Transaction_Id Items 1 Bread, Butter 2 Butter 3 Coffee, Butter, 19 Cookie, Chocolate 20 Milk Table 1 list of various transactions with the items involved in each transaction. Bread The Table 2 shows a list of interested patterns generated as 4 Coffee, Milk 5 Bread, Butter, value of minimum threshold confidence taken for each of Milk these measures is 0.02. The number of interested patterns Hyperclique-patterns and Maximal Intensive-patterns. The generated as Hyperclique patterns are 16 and the number of 6 Coffee 7 Bread, Cookie interested patterns generated as Maximal Intensive Patterns are 11. Look at the interested patterns number 6 & 13 generated as Hyperclique patterns. It is found that both the 8 Coffee, Pickle 9 Bread, Sugar values are same. The reason behind this is that while generating hyperclique patterns, it is only that the value of h- ISSN: 2231-2803 conf for that pattern should be greater than the minimum http://www.ijcttjournal.org Page195 International Journal of Computer Trends and Technology (IJCTT) – volume 10 number 4 – Apr 2014 threshold confidence. Whatsoever, if there are same multiple candidate patterns. On the other hand, look at the interested patterns generated as Maximal Intensive Pattern, none of the interested patterns has any subset present in the interested pattern list. This is due to the all-confidence property. Interested Hyperclique Maximal Patterns Patterns Patterns 1 {ketchup,juice,coffee, {ketchup,juice,coffee,egg egg} } {bread,juice,pickle} {bread,juice,pickle} 2 Optimized VI. CONCLUSION I conclude that this algorithm is able to reduce the spurious patterns and generate the maximal intensive patterns having high co-occurrence relation among patterns with the threshold value given by user on the basis of various properties that are prior mentioned. On this intensive patterns, clustering process become highly efficient than the already existing mining algorithm. REFERENCES. [1] Fisher, R.A.: The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics, vol. 7, 1936, pp. 179188. 3 {chocolate, milk} {chocolate, milk} 4 {bread,biscuit,milk} {bread,biscuit,milk} 5 {milk,coffee,sugar} {milk,coffee,sugar} 6 {cookie, chocolate} {cookie, chocolate} 7 {coffee,butter,bread} {coffee,butter,bread} 8 {bread,butter,milk} {bread,butter,milk} 9 {bread, cookie} {bread, cookie} 10 {coffee, pickle} {coffee, pickle} 11 {milk,bread,sugar} {milk,bread,sugar} 12 {bread, butter} 13 {cookie, chocolate} 14 {biscuit, milk} 15 {coffee, milk} 16 {milk,coffee,sugar} [2] Omiecinski, Edward R.; Alternative interest measures for mining associations in databases, IEEE Transactions on Knowledge and Data Engineering, 15(1):57-69, Jan/Feb 2003. [3] Syed Zubair Ahmad Shah, Preceding Clustering by Pattern Preservation. In VSRD-IJCSIT,Vol. 2 (8), 2012. [4] H. Xiong, P. Tan, and V. Kumar. Mining hyperclique patterns with confidence pruning. In Technical Report 03-006 Computer Science, Univ. of Minnesota., Jan 2003. Table 2 List of interested patterns generated as Hypercliquepatterns and Maximal Intensive –patterns. ISSN: 2231-2803 http://www.ijcttjournal.org Page196