Download User`s guide (How to use the UP

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
UP-Miner (Utility Pattern Miner) is a novel open-source and cross-platform
toolbox, which incorporates implementations of 13 state-of-the-art algorithms for 6
high utility pattern (abbreviated as HUP) mining technologies, including high utility
itemset (abbreviated as HUI) mining, concise high utility itemset mining (abbreviated
as concise HUI), top-k high utility itemset (abbreviated as top-k HUI)mining,
quantitative high utility itemset (abbreviated as quantitative HUI) mining, high
utility episode (abbreviated as quantitative HUE) mining and high utility sequential
pattern (abbreviated as HUSP) mining. It is distributed under the GPL v3 license.
Such a toolbox is very desirable for both academic and industrial purposes. For
academics, this work provides a rich library of implementations and documents, like
benchmark datasets, data generators and user’s manual, such that other researchers
can easily compare their advanced works with these implementations. Besides, a
unified platform also allows experiments to be reproducible by other researchers. For
industrial practitioners, they can use high-performance algorithms incorporated in
UP-Miner to efficiently discover different types of HUPs in real datasets for practical
applications.
Chapter 1. Main Features of UP-Miner
 First, UP-Miner takes three kinds of utility-based databases (e.g., transactional
databases, complex event sequences, and sequence databases) into account and
comprehensively offers implementations of thirteen state-of-the-art algorithms for
efficiently mining different types of HUP patterns, including HUIs[3, 6, 10, 11,
16], concise HUIs [17], top-k HUIs [22], quantitative HUIs [12, 19], HUEs [21]
and HUSPs[4].
 Second, algorithms incorporated in UP-Miner are very efficient and representative
in the field of HUP mining. Most implementations of the algorithms are provided
by the original authors to ensure the code quality.
 Third, UP-Miner offers a user-friendly graphical interface, a utility-based data
processing module and a visualization module to users. Thus, users can use the
system to easily access, process, analyze and visualize utility-based data without
complicated manipulations.
 Fourth, we integrate the above implementations and modules into a unified system
architecture. All the input and output files follow a uniform format and the source
code of UP-Miner is implemented in Java. Therefore, it is a cross-platform system
that can be easily extended and reused in other Java programs.
Chapter 2. HUP Mining Technologies Integrated in UP-Miner
Table 1 shows all the HUP mining technologies integrated in UP-Miner and their
main purposes.
Table 1. HUP mining technologies provided by UP-Miner and their purposes
HUP Mining
Technology
HUI Mining
Concise
HUI Mining
Top-k
HUI Mining
Quantitative
HUI Mining
HUE
Mining
HUSP
Mining
Purpose
To efficiently mine HUIs in transactional databases.
To efficiently mine a lossless and concise representation of
HUIs in high dimensional datasets, such as dense microarray
datasets, and gene expression data.
To efficiently mine the k itemsets having the highest utilities
in transactional databases, where k is the number of desired
itemsets specified by the users.
To efficiently mine HUIs carrying information about
quantities in transactional databases.
To efficiently mine ordered sets of events carrying high
utility in complex event sequences, which can be applied to
user’s behavior analysis, stock prediction, etc.
To efficiently mine ordered sets of items carrying high utility
in sequence databases, which can be further extended to mine
high utility mobile sequential patterns [2] and high utility
web traversal patterns [15].
Chapter 3. Efficient Algorithms Integrated in UP-Miner
Table 2 shows all the implementations integrated in UP-Miner and their main
characteristics.
Table 2. The algorithms incorporated in UP-Miner and their characteristics
Algorithm
Two-Phase [10]
IHUP [3]
UP-Growth [16]
HUI-Miner [11]
FHM [6]
FHN [5]
CHUD [17]
DAHU [17]
Characteristics
The first two-phase HUI mining algorithm using a candidate
generation-and-test methodology.
The first tree-based HUI mining algorithm using the
pattern-growth methodology.
A state-of-the-art HUI mining algorithm that is commonly
used for performance comparison.
The first one-phase algorithm for mining HUIs in vertical
databases.
A state-of-the-art one-phase algorithm for mining HUIs in
vertical databases.
The first algorithm for mining HUIs with negative item values
in vertical databases.
The first algorithm for mining concise HUIs from dense
datasets.
The first algorithm for deriving all the HUIs from concise
HUIs.
TKU [22]
HUQA [19]
VHUQI [12]
UP-Span [21]
UtilitySpan [4]
The first algorithm for mining top-k HUIs without the need of
setting the minimum utility thresholds.
The first algorithm for mining quantitative HUIs in horizontal
databases.
A state-of-the-art algorithm for mining quantitative HUIs in
vertical databases.
The first algorithm for mining HUEs in complex event
sequences.
The first algorithm for mining HUSPs in sequence databases
using a pattern-growth approach.
Utility-based
Databases
Transactional
Database
Complex Event
Sequences
User Interface
Module
Utility-based Data
Processing Module
Sequence
Databases
Visualization
Module
User Interface
Calculation of
Statistical Information
Item
Sorting
Database
Transformation
Database
Integration
Data Visualization
HUP Mining
Algorithm Library
Classical
HUI Mining
Concise HUI
Mining
Top-k HUI
Mining
Quantitative
Mining
HUE
Mining
HUSP
Mining
Item Visualization
Discovered High
Utility Patterns
All
HUIs
Concise
HUIs
Top-k
HUIs
Quantitative
HUIs
All
HUEs
All
HUSPs
Itemset Visualization
Fig. 1. System architecture of UP-Miner
Chapter 4. System Architecture of UP-Miner
The system architecture of UP-Miner consists of four major modules: (1) user
interface module, (2) utility-based data processing module, (3) HUP mining
algorithm library, and (4) visualization module, which are described below.
4.1 User Interface Module
This module provides an easy-to-use graphical interface to users. Through the
interface module, users can easily import three types of utility-based databases (i.e.,
transactional databases, complex event sequences and sequence databases), set
parameters for corresponding algorithms and review/save mining results.
4.2 Utility-based Data Processing Module
This module provides four functions to users: calculation of statistical
information (CS), item sorting (IS), data transformation (DT) and data integration
(DI) for processing utility-based data. CS is used for calculating statistical information
about the imported database, including the total utility of the database [16], number of
distinct items, average length of transactions and maximum length of transactions. IS
is used for sorting items and their utilities in transactions. DT is used for transforming
an horizontal database into a vertical database and vice versa. DI is used for
integrating database records with items’ internal and external utilities for further
mining.
4.3 HUP Mining Algorithm Library
This library offers implementations of thirteen state-of-the-art HUP algorithms
covering six important technologies in HUP mining to users. Table 1 shows the
variety of HUP mining technologies incorporated in UP-Miner and their main
purposes. Table 2 shows the characteristics of the algorithms integrated in UP-Miner,
where the implementations of UP-Growth, FHM, FHN, CHUD, DAHU, TKU,
VHUQI and UP-Span are provided by the original authors. As shown in Table 2, the
incorporated algorithms are very innovative and representative in the field of HUP
mining.
4.4 Visualization Module
This module offers three functions for visualizations to users, namely data
visualization (DV), item visualization (IV) and itemset visualization (SV). DV and IV
are used to visualize the distribution of transaction lengths and the distribution of item
utility values, respectively. SV offers visualizations of the distribution of itemset
lengths and itemset utilities. For example, Fig. 5 shows a snapshot of the visualization
of item utility values distribution.
Chapter 5. Utility-based Data Format
This chapter describes the data format of the input data files. UP-Miner offers
three kinds of utility-based databases, namely transactional databases, complex event
sequences, and sequence databases.
5.1 Utility-based Transactional Database
The HUP mining technologies [Mining High Utility Patterns], [Mining Top-k
High Utility Patterns] and [Mining Concise High Utility Patterns] take as input a
utility-based transactional database. Each line of the database consists of:(1) a set of
items (the first column of the table), (2) the sum of the utilities (e.g., profit) of these
items in this transaction (the second column of the table), the utility of each item in
this transaction (e.g., profit of each item in this transaction)(the third column of the
table). Note that the value in the second column for each line is the sum of the values
in the third column. There is an example database named “HUI Mining - Example
Database.txt” in the folder [Example Database].
5.2 Utility-based Complex Event Sequence
The input files of the HUE mining technology [Mining High Utility Episodes]
includes a utility-based complex event sequence. The corresponding example
databases named “HUE Mining - Example Database.txt” is in the folder [Example
Database]. Each line of the database represents the information of a transaction at a
time stamp. For example, the first and the second lines represent the information of
transactions at time stamp 1 and time stamp 2, respectively. Each line of the database
consists of (1) a set of items, (2) the sum of the utilities (e.g., the profit) of these items
in this transaction, and (3) the utility of each item in this transaction (e.g., profit of
each item in this transaction) .
5.3 Utility-based Sequence Database
The input files of the HUSP mining technology [Mining High Utility Sequential
Patterns] includes a utility-based sequence database and a profit table. The
corresponding example databases named “HUSP Mining - Example Database.txt” and
“HUSP Mining - Example Profit Table.txt” are in the folder [Example Database].
Each line in “HUSP Mining - Example Database.txt” represents a sequence of
itemsets, where each itemset is separated by brackets. Each item in an itemset is
followed by its quantity, where a colon is used to split them. Each line in “HUSP
Mining - Example Profit Table.txt” represents profit of an item. For example, the first
value represents the profit of item 1, the second value represents the profit of item 2,
and so on. The number of values in “HUSP Mining - Example Profit Table.txt” is
equal to the number of distinct items.
Chapter 6. How to Use UP-Miner
To use the UP-Miner, the users need to do the following actions: (1) choose the
algorithm, (2) select the input file, (3) set the user-specified parameters, (4) set the
output file name and (5) click the button “Run Algorithm”. Then, the mining results
will be shown on the text area of UP-Miner.
Chapter 7. Applications
With the rapid advancement of research on HUP mining, numerous applications
in different domains have been proposed in recent years. In the following, we describe
a few important applications.
7.1 Mobile Commerce
With the development of IoT (Internet of Things) technologies and
sensor-enabled devices, such as smartphone, wireless network and GPS devices,
information about users’ locations and payment records can be acquired and
integrated. In such scenario, HUP mining technology can be used to discover valuable
user behaviors in mobile environments. For example, Shie et al. [14] have proposed a
new framework named high utility mobile sequential pattern mining for discovering
associations between customers’ purchase behaviors and location trajectories in
mobile environments. Discovered patterns can be utilized for location-based
advertisements, navigational services, location-based recommendation systems, and
many other applications essential to mobile commerce.
7.2 Web Mining
In web mining, users’ browsing and purchasing behaviors are recorded in web
transactional logs. In such data, a user’s browsing time on a webpage can be
expressed as the internal utility of the web page and each web page may have a
different importance depending on users’ preference (i.e., external utility). Web site
managers can use HUP mining technology to discover utility-based patterns, such as
high utility access patterns [2] and high utility traversal patterns [15] in web
transactions. The mined results can be used for electronic commerce to improve
website services, providing efficient access to related web pages, navigation
suggestions for traversing web pages, and improve the design of web pages, etc.
7.3 Biomedicine
An important application of HUP mining in biomedicine is gene expression data
analysis. In gene expression data, each row represents a set of genes and their
expression levels (i.e., internal utility) under an experimental condition. Furthermore,
each gene has a degree of importance for biological processes (i.e., external utility).
Mining HUPs in such data can discover interesting relationships between genes. For
instance, Liu et al. [8] has applied HUP mining technology to gene expression
analysis and successfully found several novel gene regulations from time course
comparative gene expression data. The mined results can help medical researchers to
develop new drugs for the treatment of diseases.
Reference
[1] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc.
of Int’l Conf. on Very Large Data Bases, pp. 487-499, 1994.
[2] C. F. Ahmed, S. K. Tanbeer and B. Jeong, “Mining High Utility Web Access
Sequences in Dynamic Web Log Data,”Proc. of Int’l Conf. on Software
Engineering Artificial Intelligence Networking and Parallel/Distributed
Computing, pp. 76-81, 2010.
[3] C. F. Ahmed, S. K. Tanbeer, B. Jeong and Y. Lee, “Efficient Tree Structures for
High Utility Pattern Mining in Incremental Databases,” IEEE Transactions on
Knowledge and Data Engineering, Vol. 21, Issue 12, pp. 1708-1721, 2009.
[4] C. F. Ahmed, S. K. Tanbeer and B. Jeong, “A Novel Approach for Mining
High-Utility Sequential Patterns in Sequence Databases”, ETRI Journal, Vol. 32,
No.5, pp.676-686, 2010.
[5] P. Fournier-Viger, “FHN: Efficient Mining of High-Utility Itemsets with Negative
Unit Profits,” Proc. of Int’l Conf. on Advanced Data Mining and Applications,
pp. 16-29, 2014.
[6] P. Fournier-Viger, C. Wu, S. Zida and V. S. Tseng, “FHM: Faster High-Utility
Itemset Mining Using Estimated Utility Co-occurrence Pruning,” Proc. of Int’l
Symposium on Methodologies for Intelligent Systems, pp. 83-92, 2014.
[7] J. Han, J. Pei and Y. Yin,“Mining Frequent Patterns without Candidate
Generation,” Proc. of the ACMSIGMOD Int’l Conf. on Management of Data, pp.
1-12, 2000.
[8] Y. Liu, C. Cheng and V. S. Tseng, “Mining Differential Top-k Co-expression
Patterns from Time Course Comparative Gene Expression Datasets,” BMC
Bioinformatics, 14:230, 2013.
[9] C. Lin, W. Gan, T. Hong and J. Pan, “Efficient Mining High-Utility Itemsets with
Transaction Insertion,” Proc. of Int’l Conf. on Advanced Data Mining and
Applications, pp. 44-56, 2014
[10] Y. Liu, W. Liao and A. Choudhary, “A Fast High Utility Itemsets Mining
Algorithm,” Proc. of the Utility-Based Data Mining Workshop, pp. 90-99, 2005.
[11] M. Liu and J. Qu, “Mining High Utility Itemsets without Candidate Generation,”
Proc. of ACM Int’l Conf. on Information and Knowledge Management, pp.
55-64, 2012.
[12] C. H. Li, C. Wu and V. S. Tseng, “Efficient Vertical Mining of High Utility
Quantitative Itemsets,”Proc. of Int’l Conf. on Granular Computing,pp. 155-160,
2014.
[13] Y. C. Lin, C. Wu and V. S. Tseng,“Mining High Utility Itemsets in Big
Data,”Proc. of the Pacific-Asia Conference on Knowledge Discovery and Data
Mining, pp.649-661, 2015.
[14] B. Shie, H. Hsiao and V. S. Tseng, “Efficient Algorithms for Discovering High
Utility User Behavior Patterns in Mobile Commerce Environments,” Knowledge
and Information System, Vol. 37, Issue 2,pp. 363-387, 2013.
[15] M. Thilagu and R. Nadarajan,“Efficiently Mining of Effective Web Traversal
PatternsWith Average Utility,”Proc. of Int’l Conf. on Communication,
Computing, and Security, pp. 444-451, 2012.
[16] V. S. Tseng, B. Shie, C. Wu and P. S. Yu, “Efficient Algorithms for Mining
High Utility Itemsets from Transactional Databases,”IEEE Transactions on
Knowledge and Data Engineering,Vol. 25, Issue 8,pp. 1772-1786, 2013.
[17] V. S. Tseng, C. Wu, P. Fournier-Viger and P. S. Yu,“Efficient Algorithms for
Mining the Concise and Lossless Representation of High Utility Itemsets,”IEEE
Transactions on Knowledge and Data Engineering,Vol. 27, Issue 3,pp. 726-739,
2015.
[18] S. Yen, J. Gu and Y. Lee, “Mining Sequential Purchasing Behaviors from
Customer Transaction Databases,”Proc. of Int’l Conf. on Systems, Man, and
Cybernetics, pp. 2933-2938, 2013.
[19] S.Yen and Y. Lee, “Mining High Utility Quantitative Association Rules,” Proc.
of Int’l Conf. on Data Warehousing and Knowledge Discovery, pp. 283-292,
2007.
[20] I. H. Witten and E. Frank, “Data Mining: Practical Machine Learning Tools and
Techniques,”Morgan Kaufman, 2005.
[21] C.Wu, P. Fournier-Viger, P. S. Yu and V. S. Tseng, “Mining HighUtility
Episodes in Complex Event Sequences,” Proc. of ACM SIGKDD Int’l Conf. on
Knowledge Discovery and Data Mining, pp. 536-544, 2013.
[22] C. Wu, B. Shie, V. S. Tseng and P. S. Yu, “Mining Top-kHigh Utility
Itemsets,”Proc. of ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data
Mining, pp. 78-86, 2012.
[23] FIMI Repository. Available at:http://fimi.cs.helsinki.fi/
[24] Intelligent DataBase System Laboratory.
Available at: http://idb.csie.ncku.edu.tw/English/newwebpa/index2.php
[25] Illimine. Software available at: http://illimine.cs.uiuc.edu
[26] Knime. Software available at: http://www.knime.org/
[27] Mahout. Software available at: http://mahout.apache.org/
Biographical Sketch of the Authors
Vincent S. Tseng is currently a Professor at Department of Computer
Science in National Chiao Tung University. Currently he also serves as the
chair for IEEE Computational Intelligence Society Tainan Chapter. He
served as the president of Taiwanese Association for Artificial Intelligence
during 2011-2012 and acted as the director for Institute of Medical Informatics of
National Cheng Kung University (NCKU) during August 2008 and July 2011. During
2004 and 2007, he also served as the director for Informatics Center in NCKU
Hospital. Dr. Tseng has a wide variety of research interests covering data mining, big
data, biomedical informatics, multimedia databases, mobile and Web technologies. He
has published more than 300 research papers in referred journals and international
conferences as well as 15 patents held. He has been on the editorial board of a number
of journals including IEEE Transactions on Knowledge and Data Engineering, IEEE
Journal on Biomedical and Health Informatics, ACM Transactions on Knowledge
Discovery from Data, etc. He has also served as chairs/program
committee members for a number of premier international conferences related to data
engineering artificial computational intelligence including KDD, ICDM, SDM,
PAKDD, ICDE, CIKM, IJCAI, etc. He is also the recipient of 2014 K. T. Li
Breakthrough Award.
Cheng Wei Wu received the Ph.D. degree in Department of Computer
Science and Information Engineering from National Cheng Kung
University, Taiwan, in 2015. Currently, he is hired as a post-doctoral
researcher in College of Computer Science, National Chiao Tung
University, Taiwan. His research interests include data mining, utility pattern mining,
pattern discovery, machine learning and big data analytical.
Jun Han Lin currently is pursuing Master’s degree at Department of
Computer Science and Information Engineering in National Cheng Kung
University. His research interests include data mining, high utility pattern
mining, Hadoop, Spark, and big data analytical.
Philippe Fournier-Viger is an assistant-professor at University of
Moncton, Canada. He received the Ph.D. degree from Cognitive Computer
Science at the University of Quebec in Montreal in 2010. His research
interests include data mining, e-learning, intelligent tutoring systems,
knowledge representation and cognitive modeling. He is the author of the popular
SPMF data mining software.