Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
UP-Miner (Utility Pattern Miner) is a novel open-source and cross-platform toolbox, which incorporates implementations of 13 state-of-the-art algorithms for 6 high utility pattern (abbreviated as HUP) mining technologies, including high utility itemset (abbreviated as HUI) mining, concise high utility itemset mining (abbreviated as concise HUI), top-k high utility itemset (abbreviated as top-k HUI)mining, quantitative high utility itemset (abbreviated as quantitative HUI) mining, high utility episode (abbreviated as quantitative HUE) mining and high utility sequential pattern (abbreviated as HUSP) mining. It is distributed under the GPL v3 license. Such a toolbox is very desirable for both academic and industrial purposes. For academics, this work provides a rich library of implementations and documents, like benchmark datasets, data generators and user’s manual, such that other researchers can easily compare their advanced works with these implementations. Besides, a unified platform also allows experiments to be reproducible by other researchers. For industrial practitioners, they can use high-performance algorithms incorporated in UP-Miner to efficiently discover different types of HUPs in real datasets for practical applications. Chapter 1. Main Features of UP-Miner First, UP-Miner takes three kinds of utility-based databases (e.g., transactional databases, complex event sequences, and sequence databases) into account and comprehensively offers implementations of thirteen state-of-the-art algorithms for efficiently mining different types of HUP patterns, including HUIs[3, 6, 10, 11, 16], concise HUIs [17], top-k HUIs [22], quantitative HUIs [12, 19], HUEs [21] and HUSPs[4]. Second, algorithms incorporated in UP-Miner are very efficient and representative in the field of HUP mining. Most implementations of the algorithms are provided by the original authors to ensure the code quality. Third, UP-Miner offers a user-friendly graphical interface, a utility-based data processing module and a visualization module to users. Thus, users can use the system to easily access, process, analyze and visualize utility-based data without complicated manipulations. Fourth, we integrate the above implementations and modules into a unified system architecture. All the input and output files follow a uniform format and the source code of UP-Miner is implemented in Java. Therefore, it is a cross-platform system that can be easily extended and reused in other Java programs. Chapter 2. HUP Mining Technologies Integrated in UP-Miner Table 1 shows all the HUP mining technologies integrated in UP-Miner and their main purposes. Table 1. HUP mining technologies provided by UP-Miner and their purposes HUP Mining Technology HUI Mining Concise HUI Mining Top-k HUI Mining Quantitative HUI Mining HUE Mining HUSP Mining Purpose To efficiently mine HUIs in transactional databases. To efficiently mine a lossless and concise representation of HUIs in high dimensional datasets, such as dense microarray datasets, and gene expression data. To efficiently mine the k itemsets having the highest utilities in transactional databases, where k is the number of desired itemsets specified by the users. To efficiently mine HUIs carrying information about quantities in transactional databases. To efficiently mine ordered sets of events carrying high utility in complex event sequences, which can be applied to user’s behavior analysis, stock prediction, etc. To efficiently mine ordered sets of items carrying high utility in sequence databases, which can be further extended to mine high utility mobile sequential patterns [2] and high utility web traversal patterns [15]. Chapter 3. Efficient Algorithms Integrated in UP-Miner Table 2 shows all the implementations integrated in UP-Miner and their main characteristics. Table 2. The algorithms incorporated in UP-Miner and their characteristics Algorithm Two-Phase [10] IHUP [3] UP-Growth [16] HUI-Miner [11] FHM [6] FHN [5] CHUD [17] DAHU [17] Characteristics The first two-phase HUI mining algorithm using a candidate generation-and-test methodology. The first tree-based HUI mining algorithm using the pattern-growth methodology. A state-of-the-art HUI mining algorithm that is commonly used for performance comparison. The first one-phase algorithm for mining HUIs in vertical databases. A state-of-the-art one-phase algorithm for mining HUIs in vertical databases. The first algorithm for mining HUIs with negative item values in vertical databases. The first algorithm for mining concise HUIs from dense datasets. The first algorithm for deriving all the HUIs from concise HUIs. TKU [22] HUQA [19] VHUQI [12] UP-Span [21] UtilitySpan [4] The first algorithm for mining top-k HUIs without the need of setting the minimum utility thresholds. The first algorithm for mining quantitative HUIs in horizontal databases. A state-of-the-art algorithm for mining quantitative HUIs in vertical databases. The first algorithm for mining HUEs in complex event sequences. The first algorithm for mining HUSPs in sequence databases using a pattern-growth approach. Utility-based Databases Transactional Database Complex Event Sequences User Interface Module Utility-based Data Processing Module Sequence Databases Visualization Module User Interface Calculation of Statistical Information Item Sorting Database Transformation Database Integration Data Visualization HUP Mining Algorithm Library Classical HUI Mining Concise HUI Mining Top-k HUI Mining Quantitative Mining HUE Mining HUSP Mining Item Visualization Discovered High Utility Patterns All HUIs Concise HUIs Top-k HUIs Quantitative HUIs All HUEs All HUSPs Itemset Visualization Fig. 1. System architecture of UP-Miner Chapter 4. System Architecture of UP-Miner The system architecture of UP-Miner consists of four major modules: (1) user interface module, (2) utility-based data processing module, (3) HUP mining algorithm library, and (4) visualization module, which are described below. 4.1 User Interface Module This module provides an easy-to-use graphical interface to users. Through the interface module, users can easily import three types of utility-based databases (i.e., transactional databases, complex event sequences and sequence databases), set parameters for corresponding algorithms and review/save mining results. 4.2 Utility-based Data Processing Module This module provides four functions to users: calculation of statistical information (CS), item sorting (IS), data transformation (DT) and data integration (DI) for processing utility-based data. CS is used for calculating statistical information about the imported database, including the total utility of the database [16], number of distinct items, average length of transactions and maximum length of transactions. IS is used for sorting items and their utilities in transactions. DT is used for transforming an horizontal database into a vertical database and vice versa. DI is used for integrating database records with items’ internal and external utilities for further mining. 4.3 HUP Mining Algorithm Library This library offers implementations of thirteen state-of-the-art HUP algorithms covering six important technologies in HUP mining to users. Table 1 shows the variety of HUP mining technologies incorporated in UP-Miner and their main purposes. Table 2 shows the characteristics of the algorithms integrated in UP-Miner, where the implementations of UP-Growth, FHM, FHN, CHUD, DAHU, TKU, VHUQI and UP-Span are provided by the original authors. As shown in Table 2, the incorporated algorithms are very innovative and representative in the field of HUP mining. 4.4 Visualization Module This module offers three functions for visualizations to users, namely data visualization (DV), item visualization (IV) and itemset visualization (SV). DV and IV are used to visualize the distribution of transaction lengths and the distribution of item utility values, respectively. SV offers visualizations of the distribution of itemset lengths and itemset utilities. For example, Fig. 5 shows a snapshot of the visualization of item utility values distribution. Chapter 5. Utility-based Data Format This chapter describes the data format of the input data files. UP-Miner offers three kinds of utility-based databases, namely transactional databases, complex event sequences, and sequence databases. 5.1 Utility-based Transactional Database The HUP mining technologies [Mining High Utility Patterns], [Mining Top-k High Utility Patterns] and [Mining Concise High Utility Patterns] take as input a utility-based transactional database. Each line of the database consists of:(1) a set of items (the first column of the table), (2) the sum of the utilities (e.g., profit) of these items in this transaction (the second column of the table), the utility of each item in this transaction (e.g., profit of each item in this transaction)(the third column of the table). Note that the value in the second column for each line is the sum of the values in the third column. There is an example database named “HUI Mining - Example Database.txt” in the folder [Example Database]. 5.2 Utility-based Complex Event Sequence The input files of the HUE mining technology [Mining High Utility Episodes] includes a utility-based complex event sequence. The corresponding example databases named “HUE Mining - Example Database.txt” is in the folder [Example Database]. Each line of the database represents the information of a transaction at a time stamp. For example, the first and the second lines represent the information of transactions at time stamp 1 and time stamp 2, respectively. Each line of the database consists of (1) a set of items, (2) the sum of the utilities (e.g., the profit) of these items in this transaction, and (3) the utility of each item in this transaction (e.g., profit of each item in this transaction) . 5.3 Utility-based Sequence Database The input files of the HUSP mining technology [Mining High Utility Sequential Patterns] includes a utility-based sequence database and a profit table. The corresponding example databases named “HUSP Mining - Example Database.txt” and “HUSP Mining - Example Profit Table.txt” are in the folder [Example Database]. Each line in “HUSP Mining - Example Database.txt” represents a sequence of itemsets, where each itemset is separated by brackets. Each item in an itemset is followed by its quantity, where a colon is used to split them. Each line in “HUSP Mining - Example Profit Table.txt” represents profit of an item. For example, the first value represents the profit of item 1, the second value represents the profit of item 2, and so on. The number of values in “HUSP Mining - Example Profit Table.txt” is equal to the number of distinct items. Chapter 6. How to Use UP-Miner To use the UP-Miner, the users need to do the following actions: (1) choose the algorithm, (2) select the input file, (3) set the user-specified parameters, (4) set the output file name and (5) click the button “Run Algorithm”. Then, the mining results will be shown on the text area of UP-Miner. Chapter 7. Applications With the rapid advancement of research on HUP mining, numerous applications in different domains have been proposed in recent years. In the following, we describe a few important applications. 7.1 Mobile Commerce With the development of IoT (Internet of Things) technologies and sensor-enabled devices, such as smartphone, wireless network and GPS devices, information about users’ locations and payment records can be acquired and integrated. In such scenario, HUP mining technology can be used to discover valuable user behaviors in mobile environments. For example, Shie et al. [14] have proposed a new framework named high utility mobile sequential pattern mining for discovering associations between customers’ purchase behaviors and location trajectories in mobile environments. Discovered patterns can be utilized for location-based advertisements, navigational services, location-based recommendation systems, and many other applications essential to mobile commerce. 7.2 Web Mining In web mining, users’ browsing and purchasing behaviors are recorded in web transactional logs. In such data, a user’s browsing time on a webpage can be expressed as the internal utility of the web page and each web page may have a different importance depending on users’ preference (i.e., external utility). Web site managers can use HUP mining technology to discover utility-based patterns, such as high utility access patterns [2] and high utility traversal patterns [15] in web transactions. The mined results can be used for electronic commerce to improve website services, providing efficient access to related web pages, navigation suggestions for traversing web pages, and improve the design of web pages, etc. 7.3 Biomedicine An important application of HUP mining in biomedicine is gene expression data analysis. In gene expression data, each row represents a set of genes and their expression levels (i.e., internal utility) under an experimental condition. Furthermore, each gene has a degree of importance for biological processes (i.e., external utility). Mining HUPs in such data can discover interesting relationships between genes. For instance, Liu et al. [8] has applied HUP mining technology to gene expression analysis and successfully found several novel gene regulations from time course comparative gene expression data. The mined results can help medical researchers to develop new drugs for the treatment of diseases. Reference [1] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. of Int’l Conf. on Very Large Data Bases, pp. 487-499, 1994. [2] C. F. Ahmed, S. K. Tanbeer and B. Jeong, “Mining High Utility Web Access Sequences in Dynamic Web Log Data,”Proc. of Int’l Conf. on Software Engineering Artificial Intelligence Networking and Parallel/Distributed Computing, pp. 76-81, 2010. [3] C. F. Ahmed, S. K. Tanbeer, B. Jeong and Y. Lee, “Efficient Tree Structures for High Utility Pattern Mining in Incremental Databases,” IEEE Transactions on Knowledge and Data Engineering, Vol. 21, Issue 12, pp. 1708-1721, 2009. [4] C. F. Ahmed, S. K. Tanbeer and B. Jeong, “A Novel Approach for Mining High-Utility Sequential Patterns in Sequence Databases”, ETRI Journal, Vol. 32, No.5, pp.676-686, 2010. [5] P. Fournier-Viger, “FHN: Efficient Mining of High-Utility Itemsets with Negative Unit Profits,” Proc. of Int’l Conf. on Advanced Data Mining and Applications, pp. 16-29, 2014. [6] P. Fournier-Viger, C. Wu, S. Zida and V. S. Tseng, “FHM: Faster High-Utility Itemset Mining Using Estimated Utility Co-occurrence Pruning,” Proc. of Int’l Symposium on Methodologies for Intelligent Systems, pp. 83-92, 2014. [7] J. Han, J. Pei and Y. Yin,“Mining Frequent Patterns without Candidate Generation,” Proc. of the ACMSIGMOD Int’l Conf. on Management of Data, pp. 1-12, 2000. [8] Y. Liu, C. Cheng and V. S. Tseng, “Mining Differential Top-k Co-expression Patterns from Time Course Comparative Gene Expression Datasets,” BMC Bioinformatics, 14:230, 2013. [9] C. Lin, W. Gan, T. Hong and J. Pan, “Efficient Mining High-Utility Itemsets with Transaction Insertion,” Proc. of Int’l Conf. on Advanced Data Mining and Applications, pp. 44-56, 2014 [10] Y. Liu, W. Liao and A. Choudhary, “A Fast High Utility Itemsets Mining Algorithm,” Proc. of the Utility-Based Data Mining Workshop, pp. 90-99, 2005. [11] M. Liu and J. Qu, “Mining High Utility Itemsets without Candidate Generation,” Proc. of ACM Int’l Conf. on Information and Knowledge Management, pp. 55-64, 2012. [12] C. H. Li, C. Wu and V. S. Tseng, “Efficient Vertical Mining of High Utility Quantitative Itemsets,”Proc. of Int’l Conf. on Granular Computing,pp. 155-160, 2014. [13] Y. C. Lin, C. Wu and V. S. Tseng,“Mining High Utility Itemsets in Big Data,”Proc. of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp.649-661, 2015. [14] B. Shie, H. Hsiao and V. S. Tseng, “Efficient Algorithms for Discovering High Utility User Behavior Patterns in Mobile Commerce Environments,” Knowledge and Information System, Vol. 37, Issue 2,pp. 363-387, 2013. [15] M. Thilagu and R. Nadarajan,“Efficiently Mining of Effective Web Traversal PatternsWith Average Utility,”Proc. of Int’l Conf. on Communication, Computing, and Security, pp. 444-451, 2012. [16] V. S. Tseng, B. Shie, C. Wu and P. S. Yu, “Efficient Algorithms for Mining High Utility Itemsets from Transactional Databases,”IEEE Transactions on Knowledge and Data Engineering,Vol. 25, Issue 8,pp. 1772-1786, 2013. [17] V. S. Tseng, C. Wu, P. Fournier-Viger and P. S. Yu,“Efficient Algorithms for Mining the Concise and Lossless Representation of High Utility Itemsets,”IEEE Transactions on Knowledge and Data Engineering,Vol. 27, Issue 3,pp. 726-739, 2015. [18] S. Yen, J. Gu and Y. Lee, “Mining Sequential Purchasing Behaviors from Customer Transaction Databases,”Proc. of Int’l Conf. on Systems, Man, and Cybernetics, pp. 2933-2938, 2013. [19] S.Yen and Y. Lee, “Mining High Utility Quantitative Association Rules,” Proc. of Int’l Conf. on Data Warehousing and Knowledge Discovery, pp. 283-292, 2007. [20] I. H. Witten and E. Frank, “Data Mining: Practical Machine Learning Tools and Techniques,”Morgan Kaufman, 2005. [21] C.Wu, P. Fournier-Viger, P. S. Yu and V. S. Tseng, “Mining HighUtility Episodes in Complex Event Sequences,” Proc. of ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pp. 536-544, 2013. [22] C. Wu, B. Shie, V. S. Tseng and P. S. Yu, “Mining Top-kHigh Utility Itemsets,”Proc. of ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pp. 78-86, 2012. [23] FIMI Repository. Available at:http://fimi.cs.helsinki.fi/ [24] Intelligent DataBase System Laboratory. Available at: http://idb.csie.ncku.edu.tw/English/newwebpa/index2.php [25] Illimine. Software available at: http://illimine.cs.uiuc.edu [26] Knime. Software available at: http://www.knime.org/ [27] Mahout. Software available at: http://mahout.apache.org/ Biographical Sketch of the Authors Vincent S. Tseng is currently a Professor at Department of Computer Science in National Chiao Tung University. Currently he also serves as the chair for IEEE Computational Intelligence Society Tainan Chapter. He served as the president of Taiwanese Association for Artificial Intelligence during 2011-2012 and acted as the director for Institute of Medical Informatics of National Cheng Kung University (NCKU) during August 2008 and July 2011. During 2004 and 2007, he also served as the director for Informatics Center in NCKU Hospital. Dr. Tseng has a wide variety of research interests covering data mining, big data, biomedical informatics, multimedia databases, mobile and Web technologies. He has published more than 300 research papers in referred journals and international conferences as well as 15 patents held. He has been on the editorial board of a number of journals including IEEE Transactions on Knowledge and Data Engineering, IEEE Journal on Biomedical and Health Informatics, ACM Transactions on Knowledge Discovery from Data, etc. He has also served as chairs/program committee members for a number of premier international conferences related to data engineering artificial computational intelligence including KDD, ICDM, SDM, PAKDD, ICDE, CIKM, IJCAI, etc. He is also the recipient of 2014 K. T. Li Breakthrough Award. Cheng Wei Wu received the Ph.D. degree in Department of Computer Science and Information Engineering from National Cheng Kung University, Taiwan, in 2015. Currently, he is hired as a post-doctoral researcher in College of Computer Science, National Chiao Tung University, Taiwan. His research interests include data mining, utility pattern mining, pattern discovery, machine learning and big data analytical. Jun Han Lin currently is pursuing Master’s degree at Department of Computer Science and Information Engineering in National Cheng Kung University. His research interests include data mining, high utility pattern mining, Hadoop, Spark, and big data analytical. Philippe Fournier-Viger is an assistant-professor at University of Moncton, Canada. He received the Ph.D. degree from Cognitive Computer Science at the University of Quebec in Montreal in 2010. His research interests include data mining, e-learning, intelligent tutoring systems, knowledge representation and cognitive modeling. He is the author of the popular SPMF data mining software.