Download Data Mining - Current students

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Final Year Project Report
School of Computer Science
University of Manchester
Data Mining
Mining association rules
Zhanat Zhanibekov
Bsc Computing for Business Application
Supervisor: Ilias Petrounias
5th May 2010
Abstract
Nowadays, massive amount of data has been generated from various sources, including
industry, science and internet. As the amount of information grows exponentially, there is
need to efficiently process and extract valuable information using data mining technologies.
The aim of Data Mining is to extract hidden, predictive and potentially useful patterns from
large databases.9 This subject is becoming a very active research area and many different
methodologies have been produced to solve industrial and scientific problem.
The objective of this project is to research Association Rules Discovery field and describe the
process of developing software application, which extracts “useful patterns” from large
datasets using Apriori Algorithm. This paper discusses the software development process as
well as theoretical aspects of the project.
2
Acknowledgements
I would like to take this opportunity to thank my supervisor Ilias Petrounias for his assistance
and motivation throughout the project.
Also, I would like to thank my family and friends for their constant support.
3
Table of Content
Abstract .....................................................................................................................................2
Acknowledgements .................................................................................................................... 3
Tables of Figures ....................................................................................................................... 6
Chapter 1: Introduction ............................................................................................................7
1.1
1.2
1.3
1.4
1.5
Overview ..................................................................................................................... 7
Outline of the Problem ................................................................................................. 7
Project aims and objectives .......................................................................................... 7
Existing Data Mining system ....................................................................................... 8
Report Structure ........................................................................................................... 9
Chapter 2: Background .......................................................................................................... 10
2.1
2.2
2.3
2.4
2.5
2.6
2.7
Overview ....................................................................................................................10
Data Mining Motivation ..............................................................................................10
Data Mining Definition ...............................................................................................10
Knowledge Discovery in Databases ............................................................................11
Data Mining Methods .................................................................................................13
Data mining Challenges ..............................................................................................14
Summary ....................................................................................................................14
Chapter 3: Research................................................................................................................ 15
3.1 Overview ....................................................................................................................15
3.2 Association Rules Discovery .......................................................................................15
3.3 Problem Definition ......................................................................................................15
3.4 Association Rule Algorithm ........................................................................................16
3.5 Apriori Algorithm .......................................................................................................17
3.6 Rule Generation ..........................................................................................................18
3.7 Apriori algorithm improvements .................................................................................19
3.7.1 Hash-based techniques ......................................................................................... 19
3.7.2 Transaction reduction ........................................................................................... 19
3.7.3 Sampling.............................................................................................................. 19
3.7.4 Partitioning .......................................................................................................... 19
3.7.5 Dynamic Itemset Counting ................................................................................... 20
3.8 Advanced association rules techniques ........................................................................20
3.8.1 Generalized association rule ................................................................................. 20
3.8.2 Multiple-Level Association Rules ........................................................................ 21
3.8.3 Temporal Association Rule .................................................................................. 22
3.8.4 Quantitative Association Rules ............................................................................ 23
3.9 Summary ....................................................................................................................23
4
Chapter 4: Requirements and Design .................................................................................... 24
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
Overview ....................................................................................................................24
Software development methodology ...........................................................................24
Requirements definition ..............................................................................................25
Use Cases ...................................................................................................................27
System Overview Diagram..........................................................................................28
Activity Diagram ........................................................................................................30
Graphical User Interface Design..................................................................................30
Database Design .........................................................................................................32
Summary ....................................................................................................................33
Chapter 5: Implementation .................................................................................................... 34
5.1 Overview ....................................................................................................................34
5.2 Implementation tools ...................................................................................................34
5.2.1 Programming language ........................................................................................ 34
5.2.2 Database language ............................................................................................... 34
5.2.3 DBMS ................................................................................................................. 34
5.2.4 Development environment tools ........................................................................... 34
5.3 Data structure..............................................................................................................35
5.4 Database Loader implementation ................................................................................36
5.5 Algorithm implementation ..........................................................................................38
5.6 Association mining rule example ................................................................................41
5.7 I/O association rules operation ....................................................................................44
5.8 Analyzer tool ..............................................................................................................44
5.9 Summary ....................................................................................................................45
Chapter 6: Testing and Evaluation ........................................................................................ 46
7.1 Overview ....................................................................................................................46
7.2 Testing Methods .........................................................................................................46
7.3 Unit and Functional Testing ........................................................................................46
7.4 Performance Testing ...................................................................................................47
7.5 Integration and System Testing ...................................................................................50
7.6 Evaluation ...................................................................................................................51
7.6.1 Development evaluation ....................................................................................... 51
7.6.2 System evaluation ................................................................................................ 52
7.6.3 Performance evaluation ........................................................................................ 52
7.7 Summary ....................................................................................................................52
Chapter 7: Conclusion ........................................................................................................... 53
7.1
7.2
7.3
7.4
Overview ....................................................................................................................53
Personal Experience ....................................................................................................53
Challenges ..................................................................................................................53
Further Improvements .................................................................................................54
References................................................................................................................................ 55
5
Tables of Figures
Figure 1: WEKA data mining software.................................................................................. 9
Figure 2: Data mining findings ............................................................................................ 10
Figure 3: KDD process ........................................................................................................ 11
Figure 4: CRISP-DM process .............................................................................................. 13
Figure 5. Market Basket transaction table ............................................................................ 15
Figure 6: Frequent itemset generation (Apriori algorithm) ................................................... 17
Figure 7: Rule generation in Apriori Algorithm29 ................................................................ 18
Figure 8: Procedure ap-gerules(Fk, H1).29 ............................................................................ 19
Figure 9: Concept hierarchy ................................................................................................ 20
Figure 10: Rational Unified Process .................................................................................... 25
Figure 11: Use case diagram for Data mining application .................................................... 28
Figure 12: System structure ................................................................................................. 29
Figure 13: Activity Diagram................................................................................................ 30
Figure 14: High fidelity prototype for GUI .......................................................................... 31
Figure 15: Transaction data in hash map.............................................................................. 35
Figure 16: Database connection ........................................................................................... 36
Figure 17: Database retrieves the tables ............................................................................... 37
Figure 18: Candidate selection process ................................................................................ 38
Figure 19: Algorithm processing ......................................................................................... 40
Figure 20: Generated rules displayed in the table ................................................................. 43
Figure 21: Saving results process ........................................................................................ 44
Figure 22: Comparing tool. ................................................................................................. 45
Figure 23: Relation of min support to the processing time ................................................... 49
Figure 24: Graphical representation of performance (items - candidate) .............................. 50
6
Chapter 1: Introduction
1.1
Overview
This chapter gives an introduction of the project, describing the main goals and objective to
be achieved. Moreover, it shows the outline of the project, briefly describing each part.
1.2
Outline of the Problem
For the last 20 years the amount of digital information has been greatly increased and
1
continues to grow exponentially. It is very hard to calculate the exact amount of digital
information in the world. The data is generated from sensors, internet, phones, cameras,
research laboratories so it requires more storage space. There are some examples where
processing of huge amount of information is involved.
In Switzerland, the experiment at the Large Hardon Collider at CERN’s particle-physics
laboratory produces 40 terabytes per second and the amount is much more that can be stored
in their data storages, so the scientist try to analyze it on the fly and the rest of the data is
removed. 2
In astronomy, in 2000 the telescope in Sloan Digital Sky Survey in New Mexico has been
acquired more data in the first few weeks than in whole history of astronomy. So for 10
years, the telescope produced approximately 140 terabytes of data. Moreover, in 2016 Large
Synoptic Survey Telescope in Chile will collect the same amount of data every 5 days. 3
Regarding the business area, Wall-Mart is the biggest retailer in the USA and it manages
more than 1 million transactions every hour. So it more than 2.5 petabytes (PB)i of data has
been stored in Wal-Mart databases. 3
The study by International Data Corporation (IDC) shows that about 1200 exabytes (EB)ii of
digital information will be produced in 2010. 4 The world has unimaginably huge amount of
data which offers new challenges and opportunities to people. The data analysis may identify
business trends, help to diagnose diseases, solve scientific problems and many more. On the
other hand, privacy and security protection will be harder to manage and extra storage and
processing technologies will be required.
The main problem is how to make the sense of the large amount of data. We produce too
much data, but have really small knowledge about it. The solution is to use special
technologies and methods to generate knowledge from data. This technology called “Data
Mining” and it was introduced in early 1990s.1
1.3
Project aims and objectives
The main purpose of the project is to design and implement software application, which will
use association rule algorithm to mine data from databsae. Generally, the system should
i
1 PB =250 bytes
ii
1 EB = 260 bytes
7
perform following operation: scanning database for transactional data, applying association
rule algorithm on the “raw data” to extract “potentially useful” rules, which can be used by
business analytics.
There is the list of the objectives, which I need to achieve in order to develop the system.
1
The system should connect to any type of database, which will contain transactional
table, which has particular structure.
The Graphical User Interface should be intuitively simple to navigate and provide help
for the user in case he needs some.
Ensure that the system can handle large amount of data and process it for relatively short
period of time.
Provide the user an option to choose desired items for data mining process
Provide the feature to save the results of the mining process and open the files to view
generated association rules.
Develop the function to analyze and compare the results produced by the system.
2
3
4
5
6
1.4
Existing Data Mining system
As far as the existing Data Mining products have been concerned, data mining marketplace
has been significantly grown over last decade. Rexeter analytics has been surveyed that, the
most popular area of data mining are CRM/Marketing, Academic, Finance and IT/Telecom.
Additionally, they found that the most widely used data mining techniques are regression,
decision trees and cluster analysis. 5
There is the list of top 10 the most popular data mining products according to KKD nuggets
survey6:
1. SPSS/ SPSS Clementine
2. Salford Systems CART/MARS/TreeNet/RF
3. Rapid Miner
4. SAS
5. Angoss Knowledge Studio / Knowledge Seeker
6. KXEN
7. Weka
8. R
9. MS SQL
10. MATLAB
In order to start designing the application, author researched several existing data mining
system. Some of them had a great influence on the implementation decision taken during
development. For instance, WEKA project was comprehensively studied, because it has
similar set of features and was implemented using the same tools.
Regarding WEKA (Waikato Environment for Knowledge Analysis), it is open source
machine learning software which was written in Java. It can perform several data mining
techniques such as data pre-processing, clustering, association, classification, visualisation
etc. Figure 1 illustrates the classification analysis task and visualise the result using decision
tree.
8
Figure 1: WEKA data mining software7
The main advantages of the WEKA application are




1.5
It is open source software under GNU General Public Licences.
It has a complete library of data pre-processing and mining techniques.
It is portable and can run on any platform.
It has user-friendly graphical interface, which is easy to use. 8
Report Structure
Chapter 1: Introduction: This chapter gives an overview of the project, describing
objectives and purpose of the project.
Chapter 2: Background: This chapter explains the concept of Knowledge Discovery for
Databases. Moreover, it discusses different data mining techniques and various issues in
data mining process.
Chapter 3: Research: The chapter discusses association rule mining concepts, including
Apriori algorithm for generation large itemsets.
Chapter 4: Requirement and Design: Defines the main requirements of the system and
describe the high-level design aspects of the project. It provides system diagrams and
models to provide better overview of the system.
Chapter 5: Implementation: Shows interesting aspect of the implementation process
including technologies that has been used and discusses the most challenging parts of the
system and their solution.
Chapter 6: Testing and Evaluation: Discusses the testing process and provide
evaluation of the system, its performance and overall development process.
Chapter 7: Conclusion: Shows what has been achieved and learnt during the project.
9
Chapter 2: Background
2.1
Overview
This chapter describes the role of data mining in business and outlines KDD processes and
different data mining techniques.
2.2
Data Mining Motivation
Data Mining is the great tool when it comes to large amount of data. There are many reasons
to apply data mining such as it can reduce costs, increase revenue, improving customer and
user experience etc.
Nowadays, the business sector is very competitive and companies has to use the analytical
and data mining technologies to take leading position in their area. Moreover, the customer
has an access to greater amount of information about products in the internet and will go for
the better product or service. The top retailer will able to provide better customer service and
get more profit, as it has information about what customer more likely to buy as shown in
Figure 2.
Figure 2: Data mining findings
Currently, without Data Mining technologies companies would not get desired profit as they
may provide irrelevant offers and promote unwanted products or service to thir customer. As
the result, it may bring down customers satisfaction and cause reduction of the revenue.
2.3
Data Mining Definition
As far as the “data mining” term has been concerned, there are various definition has been
made. However, some common characteristic can be identified.
Generally, data mining is the extraction/discovery of the hidden, potentially useful,
previously unknown patterns/relationship from the large data set.9
10
In analogy, we can make comparison with the gold mining, which is the process for the
sifting through large amount of ore to find valuable nuggets.
2.4
Knowledge Discovery in Databases
Data Mining is the part of bigger process called knowledge discovery in databases. KDD has
been defined as “the non-trivial process of identifying novel, potentially useful and ultimately
understandable patterns in data.”10
As you can see from Figure 3, KDD process is iterative and complex, so it involves various
subprocesses and decisions which are made by developers. This process involves several
main steps, which are data selection, data preprocessing, data transformation, data mining and
interpretation/evaluation of the data.
Figure 3: KDD process11
From the Figure 3, it can be seen that KDD process involves the following steps:12
1. Setting the application domain. It includes finding relevant information and
setting the purpose of the application.
2. Choosing the data set. At this stage, selecting only relevant data for the specified
task. Later, discovery will be performed on the selected data set.
3. Data cleaning and preprocessing. All the noisy and inconsistent data will be
removed from data set at this stage. Also, it includes collecting information for
building the model for noise and selecting the approach for dealing with missing
data field and solving DBMS problems, which includes data types, schema issues
and mapping missing/unknown data
4. Data reduction and projection. At this point, data will be formatted and
transformed into proper representation. Data transformation includes smoothing,
aggregation, generalisation, normalisation attribute/feature construction.
11
5. Selecting data mining functionality. According to the goal of the application
domain, the function of data mining will be chosen. The typical examples are
classification, association, clustering and so on.
6. Selecting Data Mining Algorithm. Due to the fact that user may pursue specific
aim from the data mining process; the particular algorithm for extracting pattern
should be selected. For instance, models for categorical data are different from the
numeric data
7. Apply Data Mining. Performing searching for interesting patterns and retrieving
the potentially useful results.
8. Interpretation and Evaluation. At this stage, retrieved information will be
represented on the human readable format and visualised. Evaluation includes
statistical validation and testing on importance, test on quality by the expert, pilot
survey for checking the accuracy of the model.
9. Using and utilizing knowledge. Finally the discovered knowledge will be used
for resolving business or scientific issues. Also the useful knowledge will be
documented and compared with other results.
There is another model for KDD process which has been developed by DaimlerChrysler,
SPSS and NCR in 1996. It is called “Cross Industry Standard Process for Data Mining” or
CRISP-DM and created for data mining processes in industry sector.13 The main purpose of
CRSIP-DM to make data mining project much faster and cheaper, as typical data mining
projects exceed the budget and do not meet deadlines. Moreover, the availability and quality
of the data has a direct affect on the data mining process performance. Therefore, we should
focus on the data analysis requirements and software design to minimize data mining effort.
Figure 4 shows the illustration of the CRISP-DM process, which is consisting of six main
phases. The arrows show the data flow between the phases. CRISP-DM process is highly
iterative model and phases may not correspond with the task, since it focuses on project
objectives and user requirements. It is quite common during the process the movement
between the phases as it helps to refine and improve the existing decision.
The first phase is Business Understanding, where we define business objectives, project
planning and identify data mining task. This is the most important and challenging part of the
project because the clear of understanding of the problem will produce better result.
Next stage is the Data Understanding stage. This stage includes analyzing the data and
applying advanced statistical method to it. For instance, if data will be retrieved from
different sources, we will need to integrate them dealing with data inconsistencies, missing
values and outliers. Once the data is understood, we start the data preparation phase, where
we transform raw data into readable format. Furthermore, Modeling phase starts where user
can choose the functionality of the data mining and specify mining algorithm.
After building and testing the model, we move to the evaluation step. During this process,
we decide whether or not to proceed with deployment of the model in the business
application. In other words, we check how well it satisfies the business objectives. Finally, in
deployment phase we integrate and document all the results of the data mining project. 14
12
Figure 4: CRISP-DM process15
2.5
Data Mining Methods
There are large amount of methods in data mining, we will discuss the most popular DM
technique such as classification, regression, clustering, sequence analysis and dependency
modeling.
Classification involves two stages. The first stage is supervising learning of a training set of
data to build the model. The second stage is the classification of data according to the model.
Some common examples are decision trees, neural network, Bayesian classification and knearest neighbour classifier. Decision trees are top-down way of classification, when all data
is categories into leaf and node categories. Neural network is predictive approach, which is
based on the learning from prepared data set and use the learnt knowledge to the bigger data
set. Nearest neighbour learn the training set by identifying similarities of a group and use the
resultant data to process the test data.16
Regression applies formulas on the existing data and makes prediction based on results. For
instance, linear regression is the simplest form of regression and uses the straight line formula
(y = k*x + b) to finds the suitable values for k to predict the value of y, which is dependent
on x.17
Clustering groups a data into one or several category classes, which are not predefined and
must be created from the data. The building of category classes based on the similarity
metrics or probability density models. There are several types of clustering algorithms such
as hierarchical, partitional, density-based algorithm and so on. The common characteristics of
13
clustering algorithm that they require the number of clusters to produce in the input data set
before start of the algorithm. 18
Sequence analysis produce sequential patterns. The main objective of this method is to find
frequent sequences from data.19
Dependency modelling (or Association rules) describes the association between variables in
large data set. Market basket analysis is the most popular examples, where the technique can
be applied to discover the business trends and promote the products. Moreover, it is widely
used in web usage mining and bioinformatics. 20
Summarization (characterization or generalization) group data into the subsets and provide
compact description for every set. Some advance techniques includes summary rules,
multivariate visualization techniques and functional relationship between variables. For
instance, this technique can be applied to automate reporting. 21
2.6
Data mining Challenges
1. Mining tasks and user interaction issues: These include issues related to the
knowledge mining at different granularities, knowledge representation and domain
knowledge appliance. 22
a. Incorporation of background knowledge
b. Mining various types of knowledge
c. Interactive mining
d. Removing noisy data and incomplete data
e. Visualisation of the mining result
f. Evaluation of the process and interestingness of the problem
2. Performance issues: This refers scalability, processing speed and parallelization of
data mining techniques. The algorithms should perform large amount of data for short
period of time. Due to large database size, computational complexity of data mining
algorithms and the broad distribution of sources, distrusted and parallel algorithm has
been developed to produce the greater performance result.22
3. Database compatibility issues: These include all issues with mining different types
of data from heterogeneous and global database systems. Data mining system should
able to handle relational and complex (hypertext, spatial, temporal or multimedia)
types of data. It is impossible to have one software application to effectively process
all kinds of data. Therefore, data mining tools should be specialised for mining
particular data types. Moreover, DM application must support the discovery from
different sources, integrating structured, semi-/un- structured information as
distributed and heterogeneous databases become widely used. For instance, Web
mining becomes a very challenging but can be very profitable area in data mining.22
2.7
Summary
Due to the exponential growth of data collected from different sources (business, science,
industry etc), there is demand for effective data mining and analysis application. There are
numerous challenges concerning the effectiveness of data mining.
Data mining is extracting knowledge from large amount of data and it is part of bigger
process called knowledge discovery process. Finally, there is wide range of data mining
techniques, which were design for different purposes.
14
Chapter 3: Research
3.1
Overview
In this chapter, we present methodology called association analysis and discuss its main
properties. Furthermore, Association rules algorithm implementation will be demonstrated
and corresponding examples will be illustrated. Finally, different types of rules will be
discussed.
3.2
Association Rules Discovery
As far as Association Rules has been concerned, it is technique for discovering interesting
relationship from the large data set. Association rules or sets of frequent item sets can
represent relationship between items. For instance, the association between two items can be
shown as the rule {Bread}  {Butter}. It shows that there are strong dependencies between
Bread and Butter as the consumers more likely to buy these two items together.
Nowadays, this methodology has been applied in different areas, such as medical diagnosis,
web mining and telecommunication. In fact, one of the widely researched areas was “basket
market analysis”. 23
3.3
Problem Definition
Regarding the “Market Basket Analysis”, it is the modeling technique which is use
association rules for prediction customer purchase behavior. Customer put some set of items
on their basket during their shopping process and software can register what kinds of
products are bought together. Finally, Marketers can apply this information to manage
inventory (selective inventory control), to promote their products (positioning items on the
particular place) and to conduct marketing campaign (targeting specific customer categories
and increasing customer satisfaction).24
Figure 5. Market Basket transaction table24
Figure 5 shows typical market basket transaction data, which is collected from the stores cash
machine computers. As you can see from the figure above, the table consists from the 2
columns, which are TID and ITEMS. The former contain unique identifier of transactions;
the latter has the set of products bought together.
Lets I = {i1, i2, …, In} be the set of n distinct items (literals) in the basket data. Lets D be a
database, which has unique number for every record T = {t1, t2, …, tn }. Also, each record has
a set of items (literals), where T⊆I. Association rule is an implication of the form XY,
where are X and Y sets of item set, where X ∩ Y= ∅.25 Here, X called antecedent and Y is
named as consequent.
15
With respect to the measures of association rules, there are two essential measures called
support (S) and confidence (C).
Support (S) is the occurring frequency of the rule to the given data set, how often X and Y
appear together out of total number (N) of transactions. 25
Support, S (XY) = Q(X ∩ Y)/N;
Example: Support of the relationship BreadMilk will be 60%, which is the number of how
often they are bought together (3 times) divided by total number of purchases (5
transactions).
Confidence (C) is the strength of the association, how often item in Y appear in transaction
that contain X. 25
Confidence, C (XY) = Q(X ∩ Y)/X;
Example: Let’s consider the rule Bread, Milk  Diapers, the confidence of the rule will be
66%, as the combination of Bread, Milk and Diapers appears 2 times together and support of
Bread and Milk is equal to the 3. Therefore, the confidence is 2/3 = 0,6(6). 25
Lift (L) is ration between the confidence of the rules and support of the item set in the
consequent of the rule. The motivation that the high-confidence rule may be ambiguous since
it ignores the support of the item set in the rule consequent.
Lift = C (AB)/S (B);
Example: The Confidence of BreadMilk is 75%, and support of milk is 80%, then the lift
= 75/80 = 0.9375 (negative correlation). If Lift <1 then it is negative lift, otherwise it is
positive lift.
The lift has the same value as “interest factor” (I) for binary variables. It is the ratio of the
observed support to that expected by chance. 26
I (A, B) = S (A, B)/S (A)*S (B);
3.4
Association Rule Algorithm
As for Association rule discovery, for each set of transaction T, discover all the rules where
support(S) is greater than minimum support threshold and confidence(C) is greater than
minimum confidence threshold.
The association rule algorithm consists of two steps:27
1. Frequent Itemset Generation: Extract all the itemset which occur with greater
frequency than the minimum support threshold. All this items will be called frequent
items.
2. Rule Generation: Generate all the high-confidence rules from the frequent itemset
generated on the first step. These rules will be called “strong rules”.
16
3.5
Apriori Algorithm
With regard to frequent itemset generation implementation, Apriori is an influential algorithm
for learning association rules. It was introduced by Agrawal in 1994 and it pioneered the use
of support-based pruning to deal with exponential expansion of candidate item set. 28
Pseudo code below demonstrates the process of frequent itemset generation of the Apriori
algorithm. Lets Ck is the set of candidate k-itemset. Lets Fk is set of frequent itemset.
Step 1-2: At the beginning, the algorithm runs through all data set and counts each item’s
support. Then it produces the F1 – the set of all frequent 1-itemset.
Step 3-5: Next, the algorithm iteratively produces new candidate k-item set, which is based
on previous iteration’s k-1 item set value. It uses Apriori-gen(Fk-1) functions to generate
candidate.
Step 6-11: After that, support of the candidate is calculated by passing over data set.
Additionally, all the candidate itemsets in C K in each transaction t is discovered by subset
function (subset (Ck, t)).
Step 12: Furthermore, all candidates which are not satisfy minimum support threshold
(minsup) will be removed and only frequent itemset will be left.
Step 13-14: If there is no new frequent itemset produced, then end algorithm (FK = ∅).Then
all frequent itemset will be joined for the rule generation process.
1: k = 1
2:
Fk = {i | i ∈ I ∧ q({i}) ≥ N × minsup} //identify all large 1-itemset
3:
repeat
4:
k = k+1
5:
Ck = apriori-gen(Fk-1). // produce candidate itemset
6:
for each transaction t ∈ T do
7:
Ct = subset (Ck, t). // find all candidates which is subset of t
8:
for each candidate itemset c ∈Ct do
9:
q(c) = q(c) + 1.
10:
end for
11:
end for
12:
Fk = {c | c ∈ Ck ∧ q(c) ≥ N x minsup} // generate the large k-itemsets
13:
until Fk = ∅
14:
Result = ∪Fk.
Figure 6: Frequent itemset generation (Apriori algorithm)29
Apriori-gen(Fk-1) is used to generate candidates item set. This function consists of two steps:
1. Candidate generation – Ck is generated by joining itself
2. Candidate Pruning – any (k-1) – itemset that is infrequent will be eliminated, as it
cannot be subset of a frequent k-itemset. It can minimize the number of candidate
itemset when support counted is performing.
17
Subset (Ck, t) – support counting function counts the number of occurrence of every
candidate in the database. It performed on all candidates that passed through Apriori-gen
(Fk-1). The effective way to implement this function is to count itemsets in each transaction
and refresh the value of support for corresponding candidate itemset.
3.6
Rule Generation
With respect to rule generation, another task is to effectively extract association rule from the
frequent itemset. The algorithm applies level-wise techniqie for generating assoctiation rules.
In this appriach, each rule is positioned in particular level which has the same index as an
number of items in the rule consequent.
Figure 7 illustrates the pseudocde for rule generation. Initially, it process all the 1-item
consequence and store them in H1. Then it calls ap-gerules(Fk, H1), which is shown in figure
8. Finally, it terminates after all rules generated.
1: For each frequent k-itemset fk, k ≥ 2 do
2:
H1 = {i | i ∈ fk}
3:
call ap-genrules(fk, H1)
4: end for
Figure 7: Rule generation in Apriori Algorithm29
Step 1-2: count the size of frequent itemset k and size of rule consequent in m
Step 3: Condition is the size of frequent itemset is greater than the size of rule consequent. If
it satisfies, then continue to generate the rule. Otherwise, end this algorithm.
Step 4: Call ap-gerules(H1) to generate the new candidates for the association rule. The
method ap-gerules(H1) is similiar to that in frequent itemset generation.
Step 5- 12: For every rule, calcualte the confidence (conf) by dividing the support value
which were counted in frequent itemset generation.
Step 13: Method ap-gerules(Hm+1) is called to generate rules for the m+1 size of rule
consequent.
1: k = |fk| // frequent itemset size
2: m = |Hm| // rule consequent size
3: If k > m+1 then
4: Hm+1 = apriori-gen(Hm)
5: for each hm+1 ∈ Hm+1 do
6:
conf = q(fk) / q(fk – hm+1)
7:
if conf ≥ minconf then
8:
output the rule (fk+1 - hm+1)  hm+1
9: else
18
10:
11:
12:
13:
14 :
delete hm+1 from Hm+1
else if
end for
call ap-genrule(fk,Hm+1)
end if
Figure 8: Procedure ap-gerules(Fk, H1).29
3.7
Apriori algorithm improvements
3.7.1 Hash-based techniques
This method can greatly reduce the size of the candidate k-itemsets examined. During the
scanning database process for generation frequent candidates from the 1-itemset, the 2itemsets will be generated. Next, 2-itemsets candidates will be stored (or hashed) into the
different buckets of a hash table. Then, related bucket counts will be added. If the 2-itemset
bucket count does not satisfy the minimum support, it will be eliminated from the candidate
set.30
3.7.2 Transaction reduction
This approach is based on decreasing the number of transaction processed in future iterations.
If transaction set does not have any large k-itemsets, then it also will not have any large (k+1)
– itemset. From this rule, corresponding transaction sets can be taken away from the future
iteration.30
3.7.3 Sampling
Sampling technique is used when the efficiency of the algorithm is prioritized. For instance, it
could be important for application running huge datasets regularly. Initially, from the
provided data D, sample data set S will be selected. Next, the large itemset from the S will be
generated, not from D.
The results of algorithm can be less accurate, because only S set is being searched for
frequent itemset and some global frequent itemset can be missed. The solution is to use the
lower minimum support value than minimum confidence to find local frequent itemset to S
(LS). Furthermore, the frequencies each itemset of in LS will be computed in the rest of the
database. This is used to identify if LS contain all global frequent itemset. Finally, if some
candidates are missed, the second scan will be performed in order to find all frequent
candidates. In the best case, only one scan is required in case all frequent candidates are
found.30
3.7.4 Partitioning
A partitioning technique requires only just 2 database scans in order to generate frequent
itemset. The process consists of two stages:
1. Database transaction (D) is split into n-nonoverlapping subset. The minimum support
for particular partitions will be result of multiplication of minimum support of D and
the number of transaction in that partition. In each partition the frequent itemset (local
19
frequent itemset) will be calculated. Furthermore, the local frequent itemset will be
stored in the special data structure, where for each itemset, corresponding TID’s of
transaction is stored. As the result, the database can be scanned only once.
2. In this stage, to identify global itemset, the actual support threshold for each itemset is
checked. The database is scanned only once, each partition can be stored in the main
memory.30
3.7.5 Dynamic Itemset Counting
The idea of this technique is that during the database scanning process, the candidate itemsets
will be added at any point. Initially, database is divided into the blocks, which is marked by
starting point. The support value is counted dynamically from the itemsets that has been
processed so far. If all subset of the itemset are frequents, the new candidates will be added
during the process. 30
3.8
Advanced association rules techniques
This section discusses several techniques of association rule generation which involves more
complicated concepts those basic rules.
3.8.1 Generalized association rule
This type of association rules uses a concept of hierarchy that shows the set relationship
between various elements. This technique allows generating rules at different levels. The
definition of generalized association rules is almost similar to the regular association rules
XY, whereas it put constrains such as no item in Y may be above any item in X. 31
As an example, Figure 9 shows a partial concept hierarchy for clothes. It can be seen that
white boots is the subtype of the boots and the boots is the subcategory of shoes. The
association rule Boots Shoes Cream has a lower support and confidence threshold that one
from the shoes, because the amount of transaction containing shoes is larger than number of
transactions, which contain Boots. Therefore, Black Boots  Shoe Cream has a lower
support and confidence values than Shoes  Shoe Cream.
Clothes
Slipper
Shoes
Jeans
Derbys
Boots
White
Brown
Jackets
Shorts
T-Shirts
Crew
Neck
Black
Figure 9: Concept hierarchy
20
Raglan
There are several algorithms implemented to generate generalized rules. For instance,
transactions can be expanded by adding all items above it in any hierarchy.
3.8.2 Multiple-Level Association Rules
This type of association rules are subtype of Generalized Association Rules. It is association
rule, where each item has the set of relevant attributes. The set of multiple-level concepts are
represented by these attributes. 30
Table 1 illustrates the transaction table used in Multi-Level Association Rule. Here the
computer peripherals items can be described by “Category”, “Content” and “Brand”
attributes, which represents first-, second- and third-level concept respectively. Therefore,
each item in the transactional database, there are a set of domain values. Item in the database
can be described as “HP Laser printer”, if the “category”, “content” and “brand” columns
contain Printer, Laser and HP domain correspondingly.
Table 1: Transaction table
Category
Content
Brand
Printer
Laser
HP
Mouse
Wireless
Apple
…
…
…
Notebook
17 inch
Sony Vaio
The main difference that itemset may be taken from any concept level in the hierarchy.
Therefore, Multiple-Level Association rules allow discovering more specific and concrete
knowledge from the data.
The concept of hierarchy can be traversed using top-down approach and frequent itemset can
be generated using variation of Apriori algorithm. After generating at level k, frequent
itemset can be produced in the next (k+1) level. Furthermore, frequent n-item set generated at
the first level in the concept hierarchy will be used as candidates to produce large n-itemset
for children on the further levels.
From the table above, it can be described as “Printer” is at the first concept level, “Laser
Printer” is at the second level and “HP Laser Printer” is at third level. Also, there are
minimum confidences and a support threshold specified for each level. 30
21
3.8.3 Temporal Association Rule
Temporal association rule is type of algorithm which also involves the discovery useful timerelated rules. It has a form of <AR, TF>, where AR is an association rule implication AB
and TF is the temporal feature which is contained in AR. 32 Temporal feature TF state that
during each interval TP in f(TF), the existence of X in database transaction implies the
existence of Y in the same transaction.
1. AR has confidence C% in particular time period TP, TP∈F(x). For the confidence
C% of transaction in D (TP) that stores X also stores Y.
2. AR has support S% in particular time period TP, TP∈F(x). If for support S%, both X
and Y is stored in D (TP).
3. AR has temporal feature TF with rate R% in the database transaction D, if during at
least f% of the period of F (TP), it satisfy minimum confidence min_C% and
minimum min_S support 33
The examples of temporal feature can be specific period of time or some calendar time
expressions. For instance, “year*month (3-5)” can describe any spring period. The main
challenge of association rule implementation is that it is very costly to generate all the useful
rules due to two-dimensional solution space.
The example of the data mining transaction table is shown in the Figure below. It consists
from the 3 columns: transaction id, item name and time when transaction happened.
Table 2: Temporal transactional table
TID
ITEM
Date
10001
a, c, e,
<*,01,09>
10002
a, d, e, f
<*,02,09>
10003
d, e, f
<*,02,09>
10004
a, d, e
<*,05,09>
10005
a, b, c, d, f
<*,05,09>
...
...
...
The main advantage of temporal association rules in the business is that many supermarkets
now have aisles dedicated to the sale of seasonal product.
22
3.8.4 Quantitative Association Rules
Regarding Quantitative Association Rules, this type combines both categorical and
quantitative data. This kind of rules contain continues attributes, which may reveal potentially
useful information in the business market. 31
The main advantage of this rule is that they provide more detailed results, as it extracts the
rules using multiples solution space. 34 According to the internet survey, it is revealed that
“users whose annual salary is greater $120K belong to the 45-60 age groups”. This collected
data by internet survey is illustrated in the table below.35
Table 3:Quantitative transaction table35
Gender
Age
Annual
Income
Hours/week
using internet
Email account
quantity
Privacy
concerned
F
26
90K
20
4
+
M
51
135K
10
2
-
M
29
80K
10
3
+
F
45
120K
15
3
+
F
31
95K
20
5
+
M
25
55K
25
5
-
..
..
..
..
..
..
3.9
Summary
In summary, basic concepts of association analysis has been discussed. Moreover, Apriori
algorithm has been reviewed in details and the related examples has been provided.
23
Chapter 4: Requirements and Design
4.1
Overview
In this section, the specification and design of the application for mining association rules
will be described. Also the process of capturing and defining requirements will be explored.
Then different design solution and system structure will be illustrated to provide simpler
transition to implementation phase.
4.2
Software development methodology
As far as the software development process has been concerned, it describes the way for
designing, building and deploying software system. 36 Therefore, it is important to select the
right development process before the start of the project. There were many different types of
models to choose from. However, the main criteria are to have flexible and open
methodology.
The Unified Process (UP) is well-known iterative methodology for building object oriented
software. Rational Unified Process is an example of the refinement of the Unified Process
and it is currently widely used in industry. Moreover, RUP process provides the best
practices into organized and well-structured process description.
Since RUP technique is an iterative process, development is structured into sequence of
short and time-boxed mini-projects called iteration. Iterations last about three weeks and it
includes its own requirement, analysis, implementation and evaluation procedures.
The main advantages of UP process include
1. Flexible, better productivity and early visible progress.
2. Research can be done within iteration so development process can be improved.
3. Early prevention of high risks, which includes technical, usability, design and other
issues, less project failure and lower defects probability.
4. Earlier feedbacks and user commitment results on closer meeting of requirement with
stakeholders.
24
Figure 10: Rational Unified Process37
As you can see from the Figure 8, Rational Unified Process has a cyclic form and consists
from several stages.
1.
2.
3.
4.
5.
Requirement stage – capturing the system requirements
Design stage – planning and design the software structure
Implementation – coding and developing the system
Testing – examination the system and system evaluation.
Deployment – deployment and production
Each of these steps will be repeated iteratively until the final product release. Feedbacks are
provided and all material from the last workshop will be review and refined during each
iteration.
4.3
Requirements definition
Regarding requirements analysis phase, it is vital to the success of the project. Requirement
analysis is the process of identifying and documenting user expectations for the new software
product.38 It discovers functionality, performance, usability and other characteristics of the
system.
The successful software is not “a program that works” but it is the program that meets the
client needs.39 Even if the program has the greatest features and does everything correct, but
does not meet the client expectation, it can be classified as failure. Hence, the right
identification of the requirements will greatly reduce the amount of work in further phases.
There are two types of requirements: functional and non-functional.
Non-functional requirements (NFR) - defines the how system must behave, the qualities of
the functionality of the system. (Examples: performance, availability, security, reliability,
usability etc.) It is essential for the system to be usable and accessible by the business
professionals, including those with visual impairments, such as color blindness. Furthermore,
it is vital that program will be reliable and store results in different format.
25
From the Table below, it can be seen the list of discovered non-functional requirements of the
system. Each requirement has been ranked from 1 to 3, with 1 being highest and 3 being
lowest priority.
No
Requirement
Priority
NFR1
The system must be usable by business analysis (e.g. understandable to
the business professionals)
1
NFR2
The system must have good accessibility (e.g. Large fonts, visible
colors)
1
NFR3
The system must be reliable and not constantly crash
1
NFR4
The system must run on different platforms
2
NFR5
The system should be extensible for the future updates
3
NFR6
Data will be stored in MYSQL database
1
NFR7
The system will be written in a collection of JAVA and SQL
1
NFR8
The system should process large amount of data for the accepted and
reasonable period of time
1
NFR9
The system should allow to store data on various formats
1
NFR10 The system should be design in such user friendly and intuitive way so
novice user will spend less than 10 minutes to understand how system
operates
3
NFR11 The system should have complete set of instruction of system use
3
NFR12 The result of the mining process should be presented in the clear and
understandable way(e.g. table, graphics)
2
NFR13 The design of the system should be consistent, user-friendly,
informative for the novice user
1
NFR14 The result of mining process should be accurate and reasonable
1
NFR15 The proposed system should be delivered within a 5 month period.
1
Functional requirements (FR) – defines the function of the software or it component.40
(Examples: business rules, authentication, historical data etc.).
Compare to NFR, functional requirements are top priority. They are supported by nonfunctional requirement which impose constrains on the design and implementation.
26
Functional requirements specify concrete result of a system. System’s functionality such as
mining transactional dataset and producing frequent item set functions are all examples of
functional requirements. Below there is the list of all FR and their priorities (1- high, 2medium, 3- low).
N
Requirement
Priority
FR1
The system must connect to various database system by providing its
details (username, password, URL)
1
FR2
The system must retrieve the database and table names from the server
to provide link for the “raw” data to perform algorithm
1
FR3
The system must retrieve the transactional data from the database and
store it in the program data structure
1
FR4
The system should perform data mining Association rule algorithm on
the transactional data and extract the useful pattern from it
1
FR5
The system must store the results of the system on the file
1
FR6
The system must display the results of the mining algorithm in
appropriate format
1
FR7
The system should provide feedback on the data mining process
2
FR8
The System should allow algorithm mining process to run concurrently 2
FR9
The system should allow to filter and arrange results of data mining
process
3
FR10
The system should allow comparing and analyzing the results of data
mining process.
3
The primary functionality as outlined in the functional requirement table is the connection to
the database, performing the association rule algorithm and storing the results. Other
functions are not core but desirable. Capturing the functional requirements requires some
techniques as it crucial part of the software development process.
4.4
Use Cases
With respect to Use Case, it is widespread practice for identifying functional requirements.
Use cases define a set of interaction between one or more external actors and the system.
41
Moreover, they illustrate the system from user prospective. Actor is the participant which
exists outside the system and involves in the series of actions with the system to achieve
particular goal.
27
Use case diagrams shows a graphical representation of the functionality of the system.
Furthermore, system context of proposed system can be illustrated by use case diagram.
Figure 9 illustrates high level use case diagram based on the general functionality of mining
application. Blue box represents the system context and stick figures represent actors. Also,
horizontal ellipse indicates use cases and solid lines show association between actors and use
cases.
Data mining application
Connect to Database
«uses»
*
*
*
**
*
Perform mining
process
*
Save results
*
Business Analysist
*
Open results
«uses»
*
Compare result
*
Market researcher
*
Figure 11: Use case diagram for Data mining application
4.5
System Overview Diagram
During the design stage, decisions have to be taken on the system structure and its behavior.
Various design solutions were reviewed for every feature proposed for the system. However,
only the most interesting has been presented here.
As for system overview diagram, it is high level representation of the application. It provides
simpler view of the system’s structure and shows interaction between components of the
systems. In figure 10, there is an illustration of the system context and its subsystems. This
diagram helps to make decisions on the early stages of software development. The decision
can include functional, organization and technological aspect of design.
Initially, “Database connector” provides an access to the database and “Database loader”
retrieves the “raw data” from the database. After that, data mining algorithm start to process
the “raw data”. It has two components: “Frequent Itemset Generator” and “Rule Generator”.
The former produce large itemset, the latter creates the rules from the generated itemsets.
Next, the “Display result” show “useful” patterns in various formats and send them to the
28
“File buffer”, which operates with text file performing read/write functionality. Finally,
“Compare tool” can be used for analyzing and comparing data mining results.
System
Database
Loader
Database
Connector
Database
Frequent
Itemset
Generator
Display result
Rule
Generator
File buffer
FILE
Compare tool
Figure 12: System structure
29
4.6
Activity Diagram
Activity diagram are graphical representation of workflow of activities with support and
choice, iteration and concurrency. It is used to describe business process and operational
workflows of component in the system.
User Activities
System activities
Fill database
details and
press ok
Connect to
the database
Select the
database
Show the list
of database
Select table
Retrieve
table’s name
Choose the
candidates to
process
Retrieve
candidate
from table
Select
minimum
support
Retrieve
selected
candidates
Select
minimum
confidence
Extract all
frequent
itemset
Click “Start
algorithm”
Generate the
rules
Display the
result
View result in
the table
Display table
Save result
Write result
on external
file
Figure 13: Activity Diagram
4.7
Graphical User Interface Design
Regarding Graphical user interface (GUI), it is the part of the system with which user will
directly interact. The main goal is to make effective interaction between a human and
software by providing operative control of the application. In order to design usable and
accessible interface, GUI best practices had been researched in this area, before GUI was
designed.
30
As the result, “Ten usability heuristics” by Jacob Nielsen was taken as basis for designing the
user interface. There are ten user interface design principles which was applied for my
application42
1. Visibility of system status – keep user informed about system processes.
2. Match between system and real world – words, phrases and labels must be familiar
to user.
3. User control and freedom – provide easer navigation for user.
4. Consistency and standard – provide intuitive design.
5. Error prevention – simple handling of error.
6. Recognition rather than recall – provide instructions in simple way.
7. Flexible and effective use – run several functions at one time.
8. Aesthetic and minimalist design – information should be provided where it required.
9. Help for user for error prevention – error message should be provided in plain
format.
10. Help and documentation – provide help about the system.
The picture bellow demonstrates the high fidelity prototype of mining application’s user
interface. To come up to this interface, various low fidelity interface prototypes were
sketched out by comparing at different existing software interfaces. Finally, the different
graphical features were analyzed and the best design solutions were deployed.
Figure 14: High fidelity prototype for GUI
31
From the figure below it can be seen that the system interface consists of four functional
areas. The main emphasis of GUI design was to make data mining process effective and
straightforward for business professionals. At the top of window will be menu bar (indicated
as 1), which contain file management and help information functions. On the left side, there
are control panel (indexed as 4), which provide tools for user to manipulate data mining
process. Bottom area contains panel which designed to provide feedback of data mining and
inform user about errors. Finally, on the center (indexed as 2) there is display area to show
the results of the mining process.
4.8
Database Design
As database technology developing, modern databases are capable to store huge amount of
data. Sometimes they can reach Tera- or Petabytes and have a tendency to handle even more
data in the future.43 Therefore, data mining technologies must be able to deal with that
amount of time for reasonable time. For this purpose, the “raw data” should be preprocessed
and transformed into required format.
For association rules analysis, data should be modified into particular format as the data
structure also has an influence to the speed of the data mining process. For instance,
transactional data can be represented in a binary format, as illustrated in table 1. The leftmost
column shows the transactional number, which identify the purchase of the particular
customer. Other columns store purchased items, which is treated as a binary variable.
The presence of the items in the transaction is marked as 1, whereas the absence marked as
zero. However, it is simplistic view of the market transactional data and can be applied only
for small amount of items. Additionally, this view is not capable of storing supplementary
data about items, such as the amount of sold items and their cost.
Table 4: Normalized view of "Market basket data"
TID
Laptop HP550
Windows
7
antivirus
Mouse
Laptop case
1001
1
1
0
1
0
1002
1
1
0
0
1
1003
1
1
0
0
1
1004
1
0
1
0
0
1005
0
0
1
0
1
As shown in table 2, there is another representation of transactional data which is
currently used by modern data mining products in business. This data structure makes
possible for association rule mining tools to process the large amount of products which is
stored in few columns. Moreover, it can be applied to perform more advanced analysis,
32
such as temporal and quantitative association rule mining because it can store the
quantity, category, cost of products, time of purchase and so on. In the table below, it can
be seen that TID and ITEM column forms unique set. That is same transaction number
(TID) corresponds to different product names (ITEM).
Table 5: Typical view of market basket data
TID
ITEM
Cost($)
Amount
1001
Laptop HP550
500
1
1001
Windows 7
350
1
1001
Mouse
10
1
1002
Laptop HP550
500
2
1002
Windows 7
350
2
1002
Laptop case
25
2
1003
Laptop HP550
500
1
The proposed system was required to run large dataset in order to be applied on real business
data. Test data used for application is expected to contain hundred thousand of records and
deal with tens of item types. Also, software can be expanded to handle temporal and
quantitative association rule during development phase. Hence, the second model of
transactional data was selected for implementation.
4.9
Summary
In this chapter, the requirements analysis and design of proposed data mining system has
been described. That helps to smooth the transition to the implementation phase as the
critical requirements were identified and a number of design decisions were made.
33
Chapter 5: Implementation
5.1
Overview
This chapter highlights the important aspects of system implementation, including the
technology choice, algorithm implementation and other interesting implementation solutions.
The main objective of this stage is to transform the design solutions into working model.
5.2
5.2.1
Implementation tools
Programming language
Regarding the programming languages, there is variety of different languages for
implementation has been considered. That is one of the key decisions in developing process
because using improper language can be de-motivating for writing better software. At times,
wrong choice can ruin the entire effort of software development.
Therefore, only few languages have been considered. These are Java, C/C++, Visual Basic
and Python. There are several factors have been taken into account while selecting the right
language such as the level of expertise, reference documentation and development platform.
Most of them offering similar features and some of them represent the leading edge
technology.
For following reasons Java was selected as the development language. It is mature in terms of
implementation as well as API. It is object-oriented and supports class loading, multithreading, garbage collection and database handling. Furthermore, the author had several
years of experiences in using Java.
5.2.2 Database language
For the database transaction, SQL (Structure Query Language) was selected. SQL is database
computer language used for organizing, managing and retrieving data from relational
database. The main advantages are reliability, performance, scalability and standardized. 44
5.2.3 DBMS
Regarding database management system, MYSQL has been the primary choice because it has
consistent fast performance, high reliability and simple user interface.
5.2.4 Development environment tools
The main candidates for development environment choice were Netbeans and Eclipse.
However, NetBeans was preferred over the Eclipse for the some reason. Netbeans has more
intuitive and easy-to-use interface, sophisticated GUI builder editor, automatic integration of
framework. Moreover, Netbeans 6.8 version has improved performance compare to earlier
versions.
34
5.3
Data structure
It is essential to use right data structure so it can be processed efficiently while performing
system operations. Initially, transaction data is stored in database as in Table 2. Since
application was design to perform basic Association rule, in this case, only the left two
column (TID and ITEM) need to be retrieved from database to the relevant data structures.
From the number of existing data structures, hash map was the best option to store
transactional data. Therefore itemset will be stored in the hash map as shown in Figure 15.
TID (key)
ITEM (Array List)
1001
Laptop
Windows 7
Mouse
1002
Laptop
Windows 7
Case
1003
Laptop
Windows 7
Case
1004
Laptop
Antivirus
1005
Antivirus
Case
Figure 15: Transaction data in hash map.
The transaction id will be stored as key and itemset will be stored in corresponding array list.
It is very much like a hash table, except that Hash Map stucture is faster and thread-safe.
Also, Array List has been used while performing Apriori algorithm to store frequent item set.
As the result, the processing speed has been considerably increased.
35
5.4
Database Loader implementation
The process of data retrieval from database can be time-consuming when dealing with huge
amount of data. Transactional table’s values added to the HashMap <K, V> from the
database.
Transaction number and item name is added to the key (K) and value (V) parameters
correspondingly.
Figure 16: Database connection
As you can see from the figure 16 above, user needs to provide particular username,
password and URL of the database in order to connect to particular database system. If the
application is not connected to database, it will show the error message in the connection
status area.
36
Otherwise, the database connection window will be hidden and the application will retrieve
database and table name as shown in Figure 17.
Figure 17: Database retrieves the tables
When user selects particular database, the application will automatically retrieve
corresponding list of the table names. After selecting the table, user need to specify the
candidates to be processed for association rule algorithm.
37
Figure 18: Candidate selection process
From the picture above it can be seen, that user can define own set of candidates to be
processed by moving the items from “All candidates” to the “Final candidates” field using the
buttons placed between the fields. In contrast, if user has not selected any candidates, the
system will automatically process all items.
5.5
Algorithm implementation
The pseudo code below shows the algorithm for generating frequent item sets using Apriori
algorithm. 45 The details have been discussed in the Chapter 3.
Input values: Database: D, minimum support: min_sup.
Output values: Frequent itemset: F.
Table 6: Pseudocode for Frequent candidate generation
L1 = FIND ALL FREQUENT 1-ITEMSET(D);
For(n=2;Ln-1<candidate size; n++) {
C = APRIORI_GENERATION(Ln-1, min_sup)
For each transaction T∈D
Ct = subset(CK, t);
For each candidate c(Ct
c.count ++;
}
Ln = { c∈Ck | c.count>= min_sup}
Return L=UkLk;
38
The following code illustrates the rules generation algorithm
Table 7: Pseudocode for Rule Generation
APRIORI_GENERATION(Ln-1; min_sup)
For each itemset ll∈Lk-1
For each itemset ll∈Lk-1
If(l1[1]=l2[1])/\(l1[2]=l2[2])/\.../\(l1[k-1]=l2[k-1])
then c=l1Ul2
If HAS_INFREQUENT_VALUES (c,Lk-1) then
Delete c;
Else add c to Ck;
}
RETURN Ck;
HAS_INFREQUENT_VALUES (c, Lk-1)
For each (k-1)-subset s of c
If s ∉ Lk-1 then
Return TRUE;
Return FALSE;
In order to start the algorithm, user has to specify the support and confidence thresholds.
Therefore, in the example below user selected the min_sup = 22% and min_conf = 80% and
started algorithm. The application displays the large items and generated rules in the middle
area.
39
Figure 19: Algorithm processing
Additionally, mining large datasets take some time, so user may perform another mining
operation simultaneously. From the image below it can be seen that user run several
algorithm at the same time. For instance, test, tesco_database and z10_10000 transactional
datasets were processed concurrently.
40
5.6
Association mining rule example
This example illustrates how the algorithm is actually processes. Let’s consider database
consisting of 9 transactions. Let’s minimum confidence required is 80 % and minimum
support is 2(22%). Initially, Apriori algorithm will be applied to find frequent itemset.
Afterwards, Association rules will be generated support and confidence threshold.
Step 1: Generating frequent 1-itemsets.
In the beginning, database is scanned for the each item. Next, all unique candidates are
computed and their frequency of occurrence is calculated (support count). Furthermore,
candidates support count is compare with minimum support threshold.
C1
D
TID
ITEMS
1001
1, 2, 5
1002
2,4
1003
2,3
1004
1,2,4
1005
1,3
1006
2,3
1007
1,3
1008
1,2,3,5
1009
1,2,3
L1
Itemset Support
count
Count the
frequency
of each
candidate
by
scanning
D
Itemset
Support
count
1
6
1
6
2
7
2
7
3
6
3
6
4
2
4
2
5
2
5
2
Compare
support
count to
minimum
support
Step 2: Generate frequent 2-itemsets.
This step starts from generating 2-itemset candidate by joining the previous frequent 1-item
candidates. Then, 2-itemset candidate support count is calculated and compared to the
minimum support. If the support count does not satisfy the minimum support, the candidate
will be removed and will not be processed in the further steps. Therefore, only frequent set of
2-itemset will be processed
41
C2
C2
L2
Itemset
Itemset Support
count
1, 2
Produce
candidate
itemset
from L1
Calculate
each
candidate
support
count
1, 3
1, 4
1, 5
2, 3
2, 4
2, 5
3, 4
3, 5
4, 5
1, 2
4
1, 3
4
1, 4
1
1, 5
2
2, 3
4
2, 4
2
2, 5
2
3, 4
0
3, 5
1
4, 5
0
Eliminate
all
candidates
which does
not satisfy
minimum
support
Itemset Support
count
1, 2
4
1, 3
4
1, 5
2
2, 3
4
2, 4
2
2, 5
2
Step 3: Generate frequent 3-itemsets.
This phase starts by joining the frequent set of 2-itemset into 3-itemset candidates. This
process involves the Apriori Property. The main idea of Apriori Property is if the itemset is
frequent, then all of its subset must also be frequent.
C3
Generate C3
candidate itemset
from L2
itemset
C3
Count each
candidate
itemset
1 ,2, 3
1 ,2, 5
L3
Support
count
itemset
Support
count
Compare
min_sup with
support count
1, 2, 3
2
1, 2, 3
2
1, 2, 5
2
1, 2, 5
2
Step 4: Generate frequent 3-itemsets.
Similarly to the third step, algorithm generates a candidate set of 4-itemset. According to
Apriori property, the result of join {1, 2, 3, 5} is pruned, because its subset {2, 3, 5} is not
frequent. Finally, there are no frequent item sets generated as the result Apriori algorithm
terminates.
42
Step 5: Generate Association rules from frequent item sets.
At this stage rule generation process starts. It will generate high confidence rules from the
frequent itemset produced in the previous step.
The frequent itemset are {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3},
{I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
Let’s extract high confidence rules: for instance, the item I = {1, 2, 5} and it contain non
empty subsets which are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
Results
1.
2.
3.
4.
5.
6.
7.
R1: I1 ^ I2 i5 with confidence 50%
R2: I1 ^ I5I2 with confidence 100% (strong)
R3: I2 ^ I5I1 with confidence 100%(strong)
R4: I1 I2 ^ I5 with confidence 33%
R5: I2I1 ^ I5 with confidence 29%
R6: I5I1 ^ I2 with confidence 100%(strong)
...
As you can see above, there are three string rules accumulated with confidence 100%.
The similar set of items has been processed in the application, which produced the identical
results, as shown in the image below. The generated rules can be viewed in table and strong
rules with high confidence are highlighted. Table consists of 5 columns: number of the rule,
item1 (antecedent), item2 (consequent), support, confidence and lift. It proves that algorithm
works correctly on the software application, as it show the same result as the example above
Figure 20: Generated rules displayed in the table
43
5.7
I/O association rules operation
As far as the Input and Output operation has been concerned, the component provides the
functionality to read and write the generated rules from/to an external file. So that it can be
used outside the data mining software to perform further analysis and reporting.
For the I/O operation, the basic Buffered Input/ Output streams which is implemented by
Java API. Buffered streams read data from a memory area (buffer). Likewise, data is written
to the buffer by buffered output streams. 46
The result saved in the text file, using the java dialogs window, where user can specify
directory of the file to be saved and the name of the file.
Figure 21: Saving results process
Next, the file will be saved into the format as shown below. This format allows the file to be
opened and processed by the implemented application and any spreadsheet software.
bread, beer, 50.0, 40.0, 0.834
beer, bread, 66.67, 40.0, 0.834
diapers, beer, 75.0, 60.0, 1.25
beer, diapers, 100.0, 60.0, 1.25
milk, beer, 50.0, 40.0, 0.834
beer, milk, 66.67, 40.0, 0.834
5.8
Analyzer tool
Regarding the comparator tool, it has been implemented to compare and analyse the
association rules generated earlier. The main purpose of this function is to explore and
identify potentially useful trends in the data mining results. Using this functionality,
generated results can be opened and displayed side by side in the tables. “Strong rules” will
44
be highlighted to simplify the process analysis. (Fig 22) Additionally, these results can be
sorted and filtered using by different parameters, minimum support/confidence/lift and item
names. Finally, after performing this operation the results can be merged and saved in the
external text file.
Figure 22: Comparing tool.
5.9
Summary
This chapter presented software development implementation stage describing the languages
used, techniques applied and decision made. At the end, algorithm mining process on small
amount of data has been exemplified.
45
Chapter 6: Testing and Evaluation
7.1 Overview
The testing stage is important process and it should be performed during and after
implementation phase. Because Unified Process was taken as software development method,
the testing will be performed after series of time boxed iterations. This section describe about
various testing methods to examine the data mining software applications.
7.2 Testing Methods
Regarding testing methods, there are different type of tests should be performed to examine
all components of the program. As the main concern of data mining application is accuracy of
the generated results and performance of the mining process, the related set of tests should be
performed on the application. Therefore, the following testing methods have been carried out
such as Unit Testing, System Testing, Function Testing, and Performance Testing.
7.3 Unit and Functional Testing
Functional testing (also known as Acceptance testing) involves the user in the testing process
to find out if the application meets the user requirements and have all essential features
functioning correctly. After each iteration of unified process, software application was tested
and reviewed by users. Then they gave a feedback to the system developer, so some
potentially serious system bags were eliminated at the early stages.
The number of functional and non-functional requirements was provided by users. Then users
checked that the system meets their expectation by filling the form as it shown in Table 6. It
is Black box type of testing because user has no idea of internal implementation of software.
Table 8: Function testing form
Test
ID
Test Type
Description
Desirable outcome
Actual outcome
T1001
Functional
Testing whether the
system provides the
feedback and error
messages to the
user.
If
the
user
performed wrong
operation(e.g. left
empty
field),
system will throw
the message
System in every case
notifies users if he
done
inappropriate
action.
T1002
Functional
Examine if the
system can run
several algorithm
simultaneously
without any effect
on the performance
User run algorithm
several
times
without cancelling
the previous one.
All
results
generated without
any propagation
System runs multiple +
algorithms using java
multithread
technology.
The
performance
of
algorithm
is
not
affected.
46
Suc
cess
+
Unit testing is carried out to check is particular module or unite of code is working correct
before integrating them into modules. The main advantage of unit test is prevention of system
defects in the early stages. Despite of Functional testing, the Unit test is the “White Box” as
developer has an access to the code. For particular components the test code has been written.
The examples of tested components are connecting to database, retrieving the candidates,
checking the filters etc. The database connection was validated in unit testing as shown on
table7.
Table 9: Functional testing
Test
ID
Test Type
Description
Desirable
outcome
Actual outcome
T1011
Unit
Testing whether the
system can connect
to any type of
database. The
various database
details are
hardcoded.
Database
successfully
connects to any
database.
System connected to
the database. The
tested databases are
MYSQL, Heidi SQL
and Oracle database
Suc
cess
+
7.4 Performance Testing
Speed and efficiency of data mining process is key characteristics of any mining software
application. The amount of time, which algorithm spends on the data mining algorithm
should be reasonable. Therefore, the efficiency of the system can be evaluated by setting
different values of minimum support and measuring the amount of time required to generate
the association rules.
In order to evaluate the system performance, the synthetic datasets were created. It has been
decided to generate several sets of data with different number of records. Then values of
minimum support will be incremented up to specific point to analyze the effect of changing
support value to the system performance. The test data will have the same structure as the
dataset used in implementation phase. Similarly, the table will contain the two fields:
transaction id (TID) and item name (ITEM).
Since the main concern is to get comprehensive view of system mining performance, the
synthetic datasets were generated. Therefore test dataset D10000 of 10000 records and its
subset D5000 of 5000 records has been generated by Spawner software 47 and then
preprocessed manually into required format.
It has been decided to test data mining application using 2 different approaches. The first way
is to process each test dataset separately to analyze the maximum performance. The second
47
approach is to run all test dataset simultaneously to check the how it multithread function
affects on performance. So the dataset for simultaneous testing has been named as MD10000
for 10000 records and MD5000 for 5000 records.
Table 8 show the relation between the minimum support value (%) and amount of time
(seconds) spent to process algorithm. Initially, minimum support specified at 10% then it was
increased by 5 until 30%. In contrast, the confidence was set permanently on 50%. The
runtime is between D10000 and D5000 is dissimilar in way it takes more time to process
10000 records than 5000 records for more than 50% time for smaller support. For higher
support threshold, the processing time between D10000 and D5000 is getting closer.
This result can be explained by the fact that on higher support many candidates were pruned
at earlier stage in Apriori algorithm, so the whole mining process runs much faster. On the
other hand, it can be observed that simultaneous mining requires approximately twice as
much as for single tasking for 10000 records. For 5000 records dataset the difference is
insignificant.
Table 10: Calculating the system performance
Minimum
Support (%)
Time (Seconds)
Time (Seconds)
Time (Seconds)
Time (Seconds)
D10000
D5000
MD10000
MD5000
10
27
11
44
12
15
18
6
38
11
20
8
4
27
10
25
5
3
25
7
30
4
2
18
5
Figure 23 illustrates the results from the table above using the graphical representation. It can
be observed, that on large datasets (MD10000) the processing time are decreasing
significantly from 44 seconds (minsup = 10%) to 18 seconds (minsup = 30%). However on
smaller dataset (MD5000), there are very slight decline from 12 seconds to only 5 seconds.
48
50
45
40
Time(Seconds)
35
30
D10000
25
D5000
20
MD10000
15
MD5000
10
5
0
10
15
20
25
30
Minimum Support(%)
Figure 23: Relation of min support to the processing time
As for candidate-related performance experiment, the processing time for mining various
numbers of items has been calculated. For this purpose, the large transactional datasets has
been generated of size 141272 (LD1) and 54000 records (LD2), so the application was tested
on real like data.
Table 11: Relation between number of candidates and processing time
Number of candidates
Time(seconds) on LD1
Time(seconds) LD2
20
46
38
16
44
37
12
41
35
8
38
33
4
26
23
The table above provides the experimental results obtained by processing various numbers of
unique items from LD1 and LD2 datasets. It consists of 3 columns: number of items involved
in processing, time spent on processing 141272 records and time taken for 54000 records.
The support and confidence thresholds were set on 30% and 50% respectively. The number
of candidates was increased from 4 to 20 items by 4 candidates at each test. As the table 9
shows, there is minor difference in time for processing LD1 and LD2. A possible explanation
49
for these might be that at early stages algorithm runs through all records to process the
selected candidates.
The results of processing time can be compared in Figure 24. It can be seen that both (LD1
and LD2) processing times are gradually increasing as the number of candidates rise.
50
45
Time (seconds)
40
35
30
25
20
LD1
15
LD2
10
5
0
4
8
12
16
20
Number of Candidates
Figure 24: Graphical representation of performance (items - candidate)
7.5 Integration and System Testing
After the unit testing, system should be examined by integration testing, which works to
expose the system bags in the interfaces or interactions between integrated modules. It
delivers as the result the integrated system for the system testing.
The main objective is to verify functional, performance and reliability requirements defined
on the main system components. While implementing software, few components were
integrated and tested. To exemplify this, database connector component, user interface
component and Association rules algorithm were integrated into single system as it shown in
Table8.
The System testing is designed to evaluate system compliance with predefined requirements.
It explores how well system executes its functions. The system tester examines the whole
software in the context of Functional Requirement Specification 48. System testing also
expected to test beyond the bounds defined in requirement specification. The system testing
generally includes usability testing, compatibility testing, reliability testing, regression testing
and others.
50
Table 12: Integration and system testing form
Test
ID
Test Type
Description
Desirable
outcome
Actual
outcome
T1051
Integration
Test whether the
system retrieves
items from database
correctly and display
on the menu.
The system
retrieves all
selected items.
The transaction
table must have
two columns,
named TID and
ITEM, then it
retrieve all the
elements from
the dataset.
System
successfully load
transaction data
from database,
perform
algorithm and
produce the rule
System was
tested on the
small amount (6
items) of data
and output
accurate results.
Classes participated:
Apriori.java
DBConnection.java
T1071
System
Test whether the
result of data mining
algorithm correctly
displayed on the
table.
Success
+
+
7.6 Evaluation
As for evaluation, it is important to assess development process as the whole and its product
(the system). Evaluation phase were separated into three distinct categories: Development
process evaluation, system evaluation and performance evaluation.
7.6.1 Development process evaluation
As far as software development process has been concerned, Rational Unified Process
methodology was applied to develop software application. Prior to this, substantial
investigation on data mining field has been conducted. Initially, each process of KDD has
been studied. Furthermore, different data mining techniques were described and some
implementation issues were considered.
Next, in order to implement mining algorithm, association rule technique has been researched
in detail by looking at different types of the rules and theoretical part of the problem. From
various association rules algorithm, Apriori has been chosen for generating frequent itemset.
This choice can be explained by the fact that Apriori is the fundamental association rule
algorithm which was the first algorithm to manage exponential growth of generated items
using support-based pruning. Then, after implementing Apriori algorithm, I will able to
improve it or develop more advance algorithm.
After association mining research has been made, I started to development of data mining
application for retrieving association rules. It was previously suggested that Rational Unified
51
Process has been chosen, due to its flexibility and iterative approach. At early stages,
requirements were captured and system structure was outlines.
Next, system’s structure and behavior diagram were drawn and system prototype was
developed. After defining the system context, implementation technology was chosen and
coding process started. Finally, the system components were tested and evaluated. During
each iteration, system requirements, design and implementation decision were reviewed and
refined.
7.6.2 System evaluation
Regarding the system evaluation, software can be defined as successful if it satisfies all user
requirements. Data mining application has all essential features to process association rules
analysis. The main functionality, such as loading dataset from database, Association rule
algorithm mining, displaying and writing results have been developed. Furthermore,
additional features such as comparing and filtering results have been implemented.
For software implementation, Java was used, because it is object-oriented, platform
independent and simple language to use. For DBMS, MYSQL were used, because it is fast,
robust and with good feature set. For transaction operation, SQL was used due to its
performance and reliability. Furthermore, the combination of these tools is ideal for
development, because I have great experience on each of them. Finally, the implemented
system meets all critical requirements.
7.6.3 Performance evaluation
With respect to the performance of the application, in testing stage the performance has been
calculated. Two testing approach has been applied: support-related test and candidate-related
test. The former were calculating using different min support values on the same number of
candidate. The latter were processed different number of candidates but on the same support
threshold.
From the table 8 and 9, the findings of the experiments suggest that data mining application
works very quick on small and medium databases. In contrast, the speed of data mining
application on large dataset is noticeably slower. The main reason is that large dataset has
more items, which in their turn make bigger transaction width. Therefore, the algorithm will
spend more time scanning candidates, because more items were located in the same
transaction id. Overall, the performance of the application is good.
7.7 Summary
In this section various testing techniques have been discussed. They help to identify the
correctness, completeness, quality and efficiency of developed data mining software. Finally
the software has been evaluated and feedbacks were provided.
52
Chapter 7: Conclusion
7.1
Overview
This chapter describes what challenges were encountered and the set of desired features and
improvements could be implemented. Finally, personal opinion and recommendations of the
author are provided.
7.2
Personal Experience
This paper describes the process of development data mining software application and
research made in corresponding .The subject of association rules discovery has been selected
due to the fact that the data mining software products are applied in different areas, including
business, science and medicine. The application can uncover previously unknown, hidden
and potentially useful information from the data. Personally, it was very interesting to
develop such software and see what results it can produce from the raw data. Especially in
our time, these kinds of applications are in demand as large amount of data collected from
various source, such as industry and internet.
Although the software development was not easy, it was very fascinating process for me. I
enjoyed by overcoming complex issues and taking decisions throughout the project. The
project helped me to gain academic knowledge as well as practical skills. I have learnt
fundamentals of association rule discovery and become specialist in data mining field. This
knowledge will certainly be useful in my future studies in Information Systems. Also, my
coding skills in Java have been improved as I have learnt new techniques and algorithms.
Personally, I have achieved great results in personal development, including timemanagement, decision making and learning skills.
The main lesson learnt from the development process that in reality it is very challenging to
deliver a complete product in limited time. Even the most specifications were satisfied there
are always some things could be done to improve the system. Overall, the project was
successful as all compulsory requirements were met. Finally, the third year project was
unique and valuable experience for my future career.
7.3
Challenges
Although the essential goal has been fulfilled, there were number of challenges to deal with.
The purpose of the project was to create software product which deals with huge datasets.
The main challenge of this data mining field is to efficiently process the large quantities of
data. Inefficient implementation of application may result on exponential computational
complexity of algorithm, which will require an exponentially increasing amount of resources,
such as processing time, computer memory. Therefore, I have chosen Apriori algorithm,
which deals with exponential growth of generated items. I repeatedly tested the application
using sample dataset, to ensure the correctness and effectiveness of my program.
Another issue was choosing right data structure for storing and processing frequent itemsets
from databases. At the beginning, I planned to create my own data structure. However, after
53
studying existed implementation, I decided to use Hash Map and ArrayList from the Java
API, because of their high-quality implementation and high-performance.
Next problem was running multiple algorithms simultaneously, without effect on
performance. It was partially resolved using concurrency support provided by Java platform.
7.4
Further Improvements
As for desired improvements, there are number features which could be implemented if more
time was available. There were many data mining techniques to choose from, whereas I
selected Association rule Discovery, due to my interest in this area. Perhaps, I could research
other mining methodologies (e.g., clustering, classification and regression), if more time
would be available. However I did briefly studied and described them in the second chapter,
so this research helped to make my choice in favor of association rules.
In chapter 3, I have described about other types of Association Algorithm: Generalized,
Quantitative, Categorical and Temporal Association rule. Therefore, the software could
support more sophisticated properties of data such as time, quantity, category and other
attributes of data. If I had more time, I would explore more advanced implementation of
association rules looking at various methods for generating frequent item set. The possible
options could include FP-growth algorithm, ECLAT and Apriori-TID. As the result, the
mining processing could be more efficient and would require less processing time.
Furthermore, I would also like to look at more complex algorithm feature, multiple support
measure and negative support value etc. Thus, mining results would offer more precise and
“more interesting” rules to data analytics, by eliminating trivial, irrelevant and misleading
results. Then, interpretation and evaluation stage would be simplified and quickly conducted.
As for extra functionality of the developed system, there several features, which I would like
to improve or add to the system. First of all, the system would process all stages of
Knowledge Discovery Process, from data selection/preprocessing to data evaluation phase.
The data analytics would process all cycles of KDD using only this software. Secondly, the
system also would be developed to read “raw data” from different data sources (CSV, Ms
Excel, XML etc.) of and export results into various data formats (databases, Ms Excel or
Web). The system would be more flexible and may ease some pre/post processing work.
Additionally, as the part of KDD process, association rules could be visualized as two or
three dimensional graph to make easier for analytics to investigate the results. As an
alternative, the table with the results has been implemented by highlighting some interesting
rules. Next, I have considered, that it is better to create my own data structure for storing item
sets, if I could develop my project from scratch. It will bring more flexibility and greater
control over the data. Also, I would choose the C++, because it has higher performance.
Finally, it would be better if system could operate on the web, so there will be two versions:
online and desktop application.
54
References
1
Dr. Osmar R. Zaiane, (1999) “Introduction to data mining” [online] Available
from:<http://webdocs.cs.ualberta.ca/~zaiane/courses/cmput690/slides/Chapter1/sld011.htm> [cited 25/02/10]
2
Kenneth Cukier, (2010) “Data, data everywhere”, The Economist, pp 1-3, a special report on managing
information, Feb 27.
3
Kenneth Cukier, (2010) “Data, data everywhere”, The Economist, pp 3-4, A special report on managing
information, Feb 27.
4
Kenneth Cukier, (2010) “All too much”, The Economist, pp 3, A special report on managing information,
Feb 27.
5
Karl Rexeter (2009) “2009 Data Miner Survey” [online] Available from: <
http://www.rexeranalytics.com/Data-Miner-Survey-Results-2009.html> [cited 28/04/10]
6
Karl Rexeter (2007) “2007 KDD Nuggets Survey” [online] Available from: <http://www.the-datamine.com/bin/view/Software/MostPopularDataMiningSoftware> [cited 28/04/10]
7
“Project details for WEKA” (2010) [online] Available from:
<http://mloss.org/media/screenshot_archive/weka_explorer_screenshot.png> [cited 28/04/10]
8
Wikipedia (2010) "WEKA (machine learning)” [online] Available from:
<http://en.wikipedia.org/wiki/Weka_(machine_learning)> [cited 28/04/10]
9
Wikipedia (2010), “Data mining” [online] Available from: <http://en.wikipedia.org/wiki/Data_mining> [cited
28/04/10]
10
U. Fayad, G. Piatetsky-Shpiro, P. Smyth (1996) From data mining to knowledge discovery: An overview. In
Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, Cambridge.
11
Rithm (2010), “Knowledge discovery in database” [online] Available from: <
http://www.rithme.eu/?m=home&p=kdprocess&lang=en> [cited 28/04/10]
12
U. Fayad, G. Piatetsky-Shpiro, P. Smyth (1996) “The KDD Process for Extracting Useful Knowledge from
Volumes of Data”, Communication of the ACM; Vol. 39, No. 11
13
Cross Industry Standard Process for Data Mining (2010) “About CRISP-DM” [online] Available from: <
http://www.crisp-dm.org/> [cited 28/04/10]
14
M.F. Hornick, E. Marcade, S. Venkayala, (2007) “Java Data Mining: Strategy, Standard, and Practice”;
Elsevier Inc. pp 52-59
15
Aerlingus.com (2010) “AerLingus systems” [online] Available from: <
http://student.dcu.ie/~czakanm2/ca596/asgn2datamining.html>
16
Dr. Osmar R. Zaiane, (1999) “Principles of Knowledge Discovery in Databases” [online] Available from:
<http://webdocs.cs.ualberta.ca/~zaiane/courses/cmput690/ > [cited 28/04/10]
17
Mike Chapple, About.com (2010), “Regression”, [online] Available from: <
http://databases.about.com/od/datamining/g/regression.htm>
18
Wikipedia (2010), "Cluster Analysis” [online] Available from: <
http://en.wikipedia.org/wiki/Cluster_analysis> [cited 28/04/10]
19
MSDN, (2010) “Data Mining Algorithm” [online] Available from: < http://msdn.microsoft.com/enus/library/ms175595.aspx> [cited 28/04/10]
55
20
Wikipedia (2010), "Association rule learning” [online] Available from: <
http://en.wikipedia.org/wiki/Association_rule_learning [cited 28/04/10]
21
M.H. Dunham (2003) “Data Mining: Introduction and advance topics” Pearson Education, Inc. pp 8.
22
M.H. Dunham (2003) “Data Mining: Introduction and advance topics” Pearson Education, Inc. pp 14-15.
23
Wikipedia (2010), "Association rule learning” [online] Available from: <
http://en.wikipedia.org/wiki/Association_rule_learning> [cited 28/04/10]
24
P. Tan, M. Steinbach, V, Kumar (2006) “Introduction to data mining” Pearson Education, Inc. pp 327.
25
P. Tan, M. Steinbach, V, Kumar (2006) “Introduction to data mining” Pearson Education, Inc. pp 328-330
26
P. Tan, M. Steinbach, V, Kumar (2006) “Introduction to data mining” Pearson Education, Inc. pp 374
27
J, Han, M. Kamber (2001) “Data Mining: Concept and Technique” Academic Press, San Diego pp 228
28
P. Tan, M. Steinbach, V, Kumar (2006) “Introduction to data mining” Pearson Education, Inc. pp 328
29
P. Tan, M. Steinbach, V, Kumar (2006) “Introduction to data mining” Pearson Education, Inc. pp 337-352
30
M.H. Dunham (2003) “Data Mining: Introduction and advance topics” Pearson Education, Inc. pp 236-238
31
M.H. Dunham (2003) “Data Mining: Introduction and advance topics” Pearson Education, Inc. pp 184-186
32
Ilias Patrounias, Xiaodong Chen, Discovering Temporal Association Rules: Algorithm, Language and System
(2000) [online] Available from: < http://www.computer.org/portal/web/csdl/doi/10.1109/ICDE.2000.839423>
[cited 28/04/10]
33
Y. Kambayashi, Mukesh Mohania, A Min Tjoa (2001) Second International Conference on Data
Warehousing and Knowledge Discovery, pp 329, Springer(London)
34
Y. Kambayashi, Mukesh Mohania, A Min Tjoa (2001) Second International Conference on Data
Warehousing and Knowledge Discovery, pp 330, Springer(London)
35
P. Tan, M. Steinbach, V, Kumar (2006) “Introduction to data mining” Pearson Education, Inc. pp 419
36
G. Larman (2005) “Appling UML and Patterns” Pearson Education, Inc pp 18.
37
Openia, (2010) Methodology, [online] Available from: < http://www2.openia.com/about/methodology>
[cited 28/04/10]
38
SearchSoftwareQuality (2007) “requirements analysis”, [online] Available from: <
http://searchsoftwarequality.techtarget.com/sDefinition/0,,sid92_gci1248686,00.html> [cited 28/04/10]
39
Dr Siobhan Devlin (2010) ”requirement analysis and definition”, [online] Available from: <
http://osiris.sunderland.ac.uk/~cs0sdv/CSE100/> [cited 28/04/10]
40
Lesson from History (2009) “Functional versus Non-Functional Requirements and Testing” [online]
Available from: < http://www.lessons-from-history.com/node/83>[cited 28/04/10]
41
Ruth Malan and Dana Bredemeyer (2010) “Functional Requirements and Use Cases” [online] Available
from: < https://docs.google.com/viewer?url=http://www.bredemeyer.com/pdf_files/functreq.pdf> [cited
28/04/10]
56
42
Jakob Nielsen (2010) “Ten Usability Heuristics” [online] Available from: <
Eric Lain (2008) “ComputerWorld” [online] Available from:
<http://www.computerworld.com/s/article/9087918/Size_matters_Yahoo_claims_2_petabyte_database_is_worl
d_s_biggest_busiest>[cited 28/04/10]
43
44
YookStore (2010) “Why SQL?” [online] Available from: <http://www.yook.com/sql/>[cited 28/04/10]
45
J, Han, M. Kamber (2001) “Data Mining: Concept and Technique” Academic Press, San Diego pp 235
46
The Java Tutorials (2010) “Buffered streams” [online] Available from:
<http://java.sun.com/docs/books/tutorial/essential/io/buffers.html>[cited 28/04/10]
47
MySQL Forge (2010) “Spawner Data Generator” [online] Available from: <
http://forge.mysql.com/projects/project.php?id=214>[cited 28/04/10]
48
Wikipedia (2010), "System Testing” [online] Available from: < http://en.wikipedia.org/wiki/System_testing>
[cited 28/04/10]
57