Download evaluation of data mining

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Arisekola Akanbi
EVALUATION OF DATA MINING
Thesis
CENTRAL OSTROBOTHNIA UNIVERSITY OF APPLIED SCIENCES
Degree Programme in Information Technology
May 2011
Thesis Abstract
Department
Date
Author
Technology and Business
May 2011
Arisekola Akanbi
Degree Programme
Degree Programme in Information Technology
Thesis Topic
Evaluation of data mining
Instructor
Kauko Kolehmainen
Supervisor
Pages
Kauko Kolehmainen
63 + APPENDIX
The development in networking, processor and storage technologies have led
to the increase in the amount of data flowing into organizations, the creation of
mega databases and data warehouses to handle the bulk of transactional data
in digital form. This has led to the emphatic need to develop processes and
tools to explicitly analyze such data so as to extract valuable trends and
correlations generating interesting information that will yield knowledge from
the data.
Data mining is the technology that meets up to the challenge of solving our
quest for knowledge from these vast data burdens. It provides us with a user
oriented approach to novel hidden patterns in data. Important disciplines
ranging from machine learning, information retrieval, statistics and artificial
intelligence have had impacts on the development of data mining. Based on
the geometric increase in data flow, we envisage more advanced and
sophisticated information to be hidden in datasets.
The goal of the thesis was to evaluate data mining in theory and in practice. An
overview of database systems, data warehousing, data mining goals,
applications and algorithms was carried out. It also involved reviewing data
mining tools. Microsoft SQL server 2008, in conjunction with Excel 2007 data
mining add-ins were used in demonstrating data mining task in practice, using
data samples from Microsoft AdventureWorks database and Weka repository.
In conclusion, the results of the tasks using the Microsoft Excel data mining
add-ins, revealed how reliable, easy and efficient data mining could be.
Key words
Data, data mining, information and knowledge
TABLE OF CONTENTS
1 INTRODUCTION……………………………………………………………………....1
2 DATABASE SYSTEMS…....................................................................................3
2.1 Databases……………………………………………………………………….....4
2.2 Relationship between data mining and data warehousing……………..........6
2.3 Data warehousing.....…………………………………………………………….6
3 DATA MINING…………………………………………………………………………9
3.1 Brief history and evolution………………………………………………………..9
3.2 Knowledge discovery in databases…………………………………………….11
3.3 Knowledge discovery process models…………………………………………12
3.4 The need for data mining……………………………………………….............14
3.5 Data Mining Goals……………………………………………………………….15
3.6 Applications of data mining………………………………………………..........16
3.6.1 Marketing…………………………………………………………………......16
3.6.2 Supply chain visibility………………………………………………….........16
3.6.3 Geospatial decision making……………………………………………......17
3.6.4 Biomedicine and science application……………………………………..17
3.6.5 Manufacturing……………………………………………………………….17
3.6.6 Telecommunications and control…………………………………….........18
4 DISCOVERED KNOWLEDGE…………………………………………………......19
4.1 Association rules…………………………………………………………………19
4.1.1 Association rules on transactional data…………………………………....21
4.1.2 Multilevel association rules…………………………………………………22
4.2 Classification…………………………………………………………………….22
4.2.1 Decision tree…………………………………………………………….......23
4.3 Clustering…………………………………………………………………….......25
4.4 Data mining algorithms………………………………………………………....26
4.4.1 Naïve Bayes algorithm……………………………………………………...26
4.4.2 Apriori algorithm……………………………………………………………..27
4.4.3 Sampling algorithm…………………………………………………………28
4.4.4 Frequent-pattern tree algorithm…………………………………………...28
4.4.5 Partition algorithm…………………………………………………………..29
4.4.6 Regression………………………………………………………………….29
4.4.7 Neural networks…………………………………………………………….30
4.4.8 Genetic algorithm……………………………………………………..........30
5 APPLIED DATA MINING……………………………………………………………32
5.1 Data mining environment…………………………………………………….....32
5.2 Installing the SQL server………………………………………………………...34
5.3 Data mining add-ins for Microsoft Office 2007………………………………..34
5.4 Installing the add-ins for Excel 2007…………………………………………...35
5.5 Connecting to the analysis service…………………………………………….35
5.6 Effect of the add-ins……………………………………………………………..36
5.6.1 Analyze key Influencers…………………………………………………....38
5.6.2 Detect categories…………………………………………………………....39
5.6.3 Fill from example tool……………………………………………………….41
5.6.4 Forecast tool…………………………………………………………………43
5.6.5 Highlight exceptions tool…………………………………………………...44
5.6.6 Scenario analysis tool………………………………………………………45
5.6.7 Prediction calculator………………………………………………………...47
5.6.8 Shopping basket analysis………………………………………………….49
6 ANALYSIS SCENARIO AND RESULT…………………………………………..52
7 CONCLUSION………………………………………………………………………59
REFERENCES
APPENDIX
1
1 INTRODUCTION
It is an established fact that we are in an information technology driven society,
where knowledge is an invaluable asset to any individual, organization or
government. Companies are supplied with huge amount of data in daily basis, and
there is the need for them to focus on refining these data so as to get the most
important and useful information in their data warehouses. The need for a
technology to help solve this quest for information has been on the research and
development front for several years now.
Data mining is a new technology which could be used in extracting valuable
information from data warehouses and databases of companies and governments.
It involves the extraction of hidden information from some huge dataset. It helps in
detecting anomalies in data and predicting future patterns and attitude in a highly
efficient way.
Data mining is implemented using tools, and the automated analysis provided by
this tools go beyond evaluation of dataset to providing tangible clues that human
experts would not have been able to detect due to the fact that they have never
experienced or expected such. Applying data mining makes it easier for
companies and government, during quality decisions from available data, which
would have taken longer time, based on human expertise.
Data mining techniques could be applied in a wide range of organizations, so long
as they deal with collecting data, and there are several data mining software been
made available to the market today, to help companies tackle decision making
problems and invariably overcome competition from other companies in the same
business.
The goal of this thesis work is for the evaluation of data mining in theory and in
practice, as this thesis could also be used for academic purpose. It is to have an
overview of database systems, data warehousing, and insight on data mining as a
field and try hands on some data mining tools used in accomplishing the process.
Achieving such objective involves reviewing the main algorithms been employed
2
during data mining by most data mining tools, carrying out some scenario analysis
to demonstrate the process using one or more data mining tools.
The tool used in this work is the Microsoft SQL server 2008, in conjunction with
Excel 2007 data mining add-ins, however, this tool only uses data that has already
been collected and prepared, because it basically models an analyzes a ready
data. The other data mining tools tried were Ibm intelligent miner, Tanagra data
miner, and Weka.
The contents of this work start with an overview of database systems and
databases been the root technology that lead to data mining in form of evolution,
then there is a brief literature on data warehousing and its relation to data mining,
since all useful data collected by organizations are kept there, before they could
be subjected to any further mining or analysis prior to decision making. There is an
overview of data mining as a field, its evolution what motivated its coming into
existence, data mining objective and the process of knowledge discovery in
databases.
The knowledge discovered is then placed into classes based on the method used
and the outcome. The main algorithms employed during data mining process are
also analyzed, some having example citations and graphs to emphasize the
algorithm. Some of the numerous possible applications of data mining are also
discussed.
Applied data mining using Microsoft Office Excel 2007 data mining add-ins are
also discussed, ranging from the SQL server, the add-ins to the effect of the addins in form of tools and algorithms employed during analysis of ready data.
In essence of emphasis, the objective of this work is to evaluate data mining in
theory and practice.
3
2 DATABASE SYSTEMS
In this chapter, an overview of database systems, its evolution, databases, data
warehousing, and the relationship between data warehousing and data mining will
be made.
Database understanding would be incomplete without some knowledge of the
major aspects which constitute the building and framework of database systems,
and these fields include structured query language (SQL), extended markup
language (XML), relational databases concepts, object-oriented concepts, client
and servers, security, unified modeling language (UML), data warehousing, data
mining and emerging applications. Relational database idea was put forward to
differentiate storage of data from its conceptual depiction, and hence provide a
logical foundation for content storage. The birth of object oriented programming
languages brought about the idea of object oriented databases, even though they
are not as popular as relational databases today. (Elmasri & Navathe 2007, 23.)
Adding and retrieving information from databases is fundamentally achieved by
the use of SQL, while interchanging data on the web was also possible and
enhanced by publishing language like hyper text mark Up language (HTML) and
XML. The data are kept in web servers, so other users could have access to them,
and subsequent development and research into database management led to the
realization of data warehousing and data mining even though they had been
applied before the name was set aside. (Elmasri & Navathe 2007, 23.)
XML is a standard proposed by the World Wide Web consortium. It helps users to
describe and store structured or semi structured data and to exchange data in a
platform and tool independent way. It helps to implement and standardize
communication between knowledge discovery and database systems. Predictive
Model Markup Language (PMML) a standard based on XML, have also been
identified to enhance interoperability among different mining tools and achieve
4
integration with other applications ranging from database systems to decision
support systems. (Cios, Pedrycz, Swiniarski & Kurgan 2007, 20.)
The data mining group (DMG) developed the PMML to represent analytic models.
It is supported by leading business intelligence and analytics vendors like IBM,
SAS, micro-strategy, Oracle and SAP. (Webopedia 2011.)
Database systems could be classified as OLTP (on-line transaction process
systems, and decision support systems, like warehouses, on-line analytical
processing (OLAP) and mining. Archive of data from OLTP form decision support
systems which have the aim of learning from past instances. It involves many
short, update-intensive commands and it is the main function of relational
database management systems. (Mitra & Acharya 2006, 24.)
2.1 Databases
A database is a well structured aggregation of data that are associated in a
meaningful way, which could be accessed in various logical ways by several
users. Database systems are systems in which the translation and storage are of
paramount value. The requirement for data after several years to be used by many
users optimally depicts database systems. (Sumathi & Esakkirajan 2007, 2.)
It is sometimes abbreviated as db. It is a collection of organized data put in a way
that a computer program could quickly and easily select required parts of the data.
It can be presumed as an electronic filing system. (Webopedia 2011.)
A traditional database is organized into fields, records and fields, where field
implies single piece of information, record is a complete set of fields, and file a
collection of records. A database management system is needed to be able to
access data or information from a database. (Webopedia 2011.) The graph 1
below is a simple representation of how information technology and database has
evolved over time to data mining.
5
Data collection and database creation (1960s
and earlier)
–Primitive file processing
Database Management Systems (1970s to early 1980s)
Hierarchical and network database systems
-Relational database systems
Data modeling tools: entity-relational models, etc
Indexing and accessing methods: B-trees, hashing, etc
-Query languages: SQL, etc.
-User interfaces, forms and reports
Query processing and query optimization
Transactions, concurrent control and recovery
-On-line transaction processing (OLTP)
Advances Database Systems(mid
80s to present
-Advanced
data
models:
extended relational, objectrelational, etc
-Advanced applications: spatial,
temporal, multimedia, active,
stream and sensor, scientific and
engineering, knowledge based
Web-based
databases(1960s –present)
-XML-based
systems
database
-Integration
information retrieval
-Data and
integration
with
information
Advanced Data Analysis: Data Warehousing and Data Mining (late 1980s – present)
-Data warehousing and knowledge discovery: generalization, classification, association,
clustering, frequent pattern and structured pattern analysis, outliers analysis, trend and deviation
analysis, etc.
-Advanced data mining applications: stream data mining, bio-data mining, time series analysis,
text mining, web mining, intrusion detection, etc.
-Data mining and society: privacy-preserving data mining
New generation of integrated data and information system (present to future
GRAPH 1. Evolution of Database Technology (adapted from Han, Kamber & Pei
2005, 2)
6
2.2 Relationship between data mining and data warehousing
There has been an explosive growth in database technology and the amount of
data collected. Developments in data collection methods, bar code usage and the
computerization of transactions have provided us with enormous data. The huge
size of data and the great computation involved in knowledge discovery hampers
the ability to analyze the data readily available in order to extract more intelligent
and useful information, while data mining is all about to enhance decision making
and predictions or the process of data-driven extraction of not so obvious but
useful information from large databases.( Sumathi & Sivanandam 2006, 5.)
Today‟s competitive marketplace challenges even the most successful companies
to protect and retain the customer base, manage supplier partnerships and control
cost while at the same time increase their revenue. This and other information
management processes are only achievable if the information managers have
accurate and reliable access to the data, hence prompts the need for creation of
data warehouses. (Sumathi & Sivanandam, 2006, 6.)
Interestingly data warehousing provides online analytical processing tools for
interactive analysis of multidimensional data of varied granularities which enhance
data mining and mining functions such as prediction, classification and association
could be integrated with OLAP operations thus enhancing mining of knowledge,
hence buttressing the fact that data warehouses have become an increasingly
important platform for data analysis and data mining. (Sumathi & Sivanandam,
2006, 6.)
2.3 Data warehousing
A data warehouse is an enabled relational database system designed to support
very large databases at a significantly higher level of performance and
manageability. It is an environment and not a product. (Han et al. 2001, 39.)
7
It is an architectural construct of information that is difficult to reach or present in
traditional operational data stores. A data warehouse is also referred to as a
subject-oriented, integrated, time variant and non-volatile collection of data which
supports management decision making process. Subject-oriented depicts that all
tangible relevant data pertaining to a subject are collected and stored as a single
set in a useful format. (Han et al. 2001, 40.)
Integrated relates to the fact that data is being stored in a globally accepted style
with consistent naming trends, measurement, encoding structure and physical
features even when the underlying operational systems store the data differently.
Non-volatile simply implies that data in a data warehouse is in a read-only state,
hence can be found and accessed in the warehouse. Time-variant denotes the
period the data has been available, because such data are usually of long term
states. (Han et al. 2001, 41.)
The process of constructing and using data warehouses is called data
warehousing. Data warehouses comprise of consolidated data from several
sources, augmented with summary information and covering a long time period.
They are much larger than other kinds of databases, having sizes ranging from
several gigabytes to terabytes. Typical workloads involving ad hoc, fairly complex
queries and fast response times are important. (Ramakrishnan & Grehrke 2003,
679.)
OLAP however is a basic function of a data warehouse system. It focuses on data
analysis and decision making, based on the content of the data warehouse and it
is subject oriented thus implying it is organized around a certain main subject. it is
built by integrating multiple, heterogeneous data sources like flat files, on-line
transaction records and relational databases in a given format. (Mitra & Acharya
2006, 24.)
Data cleaning, integration and consolidation techniques are often employed to
ensure consistency in nomenclature and could be viewed as an important
preprocessing step for data mining, encoding structures, attribute measure and
8
lots more among different data sources. Data warehouses primarily provide
information from a historical perspective. (Mitra & Acharya 2006, 26.)
Since every key structure in the data warehouse contains some atom of time,
explicitly or implicitly even though the key of operational data may not contain the
time atom. Data warehouses neither need to be updated operationally nor require
transaction processing, recovery and concurrency control systems, all it needs are
the initial addition of the data and its access. (Mitra & Acharya 2006, 25.)
9
3 DATA MINING
The term data mining could be perceived to have derived its name from the
similarities between searching for variable information in a large database and
mining a mountain for a vein of valuable one, both processes require either sifting
through an immense amount of material or intelligently probing it to find where the
value resides. (Lew & Mauch 2006,16.)
Data mining also termed as knowledge discovery is the computer-assisted
process of digging through and analyzing enormous sets of data and then
extracting the meaning of the data. Data mining tools predict behaviors and future
trends, allowing proactive knowledge driven decisions to be made. It scours the
database to find hidden patterns, predictive information that experts may miss
because it lies outside their expectation. (Lew & Mauch 2006, 16-17.)
The Gartner Group refers to data mining as “ the process of discovering
meaningful new correlation, patterns and trends by sifting through large amount of
data stored in repositories, using pattern recognition technologies as well as
statistical and mathematical techniques (Larose 2005, 2.)
3.1 Brief history and evolution
Management Information Systems (MIS) in the 1960s and Decision Support
Systems (DSS) during the 1970s did a lot by providing large amount of data for
processing and execution, but unfortunately it was more of mere data that was
obtained from the systems and that was not enough in enhancing business
activities and decisions ( Mueller & Lemke 2003, 5.)
A summary of the trend of data mining evolution from data collection to data
mining stage could be seen in the table below.
10
TABLE 1. Steps in the evolution of data mining (adapted from Thearling 2010)
Evolutionary
Business
Enabling
Characteristics
step
Question
Technologies
Data
Collection
(1960s)
"What was my
total revenue
in the last five
years?"
Computers,
tapes, disks
Retrospective,
static data
delivery
Data Access
(1980s)
"What were
unit sales in
New England
last March?"
Relational
databases
(RDBMS),
Structured
Query
Language
(SQL), ODBC
Retrospective,
dynamic data
delivery at
record level
Data
Warehousing
& Decision
Support
(1990s)
"What were
unit sales in
New England
last March?
Drill down to
Boston
On-line analytic
processing
(OLAP),
multidimension
al databases,
data
warehouses
Retrospective,
dynamic data
delivery at
multiple levels
Data Mining
(Emerging
Today)
"What‟s likely
to happen to
Boston unit
sales next
month?
Advanced
algorithms,
multiprocessor
computers,
massive
databases
Prospective,
proactive
information
delivery
Data mining could be viewed to have evolved from a field of age long history,
though the term was introduced in the 1990s. It could be traced back to some
major family of origin which includes classical statistics, artificial intelligence,
database systems, pattern recognition and machine learning. Basically without
11
statistics there would be no data mining because it is the bedrock upon which
most techniques of data mining is based. (Data mining software 2011).
Classical statistics concept studies data and data relationship, hence it plays a
significant role within the heart of today‟s data mining tools. Artificial intelligence on
the other hand was built upon heuristics as opposed to classical statistics; it
applies human thought-like processing in statistical problem domains. Machine
learning could be perceived as a merger of artificial intelligence and advanced
statistics analysis, since it lets computer programs to learn about the data they
process. (Data mining software 2011.)
It prompts them to make decisions based on the data studied using statistics and
advances artificial intelligence heuristics and algorithms. Traditional query and
report tools have been used to describe and extract what is in a database. The
user forms a hypothesis about a relationship and verifies it or discounts it with a
series of queries against the data. (Data mining software 2011.)
3.2 Knowledge discovery in databases
Data would not fulfill its potential if it is not been processed into information after
which knowledge could be gained from it due to further processing. Knowledge
discovery involves a process that yields new knowledge, it gives in details
sequence of steps (data mining inclusive) that ought to be adhered to in order to
discover knowledge in data, however each step is achieved using some software
tools. It is a nontrivial process of identifying valid, novel, potentially useful and
ultimately understandable patterns in data in databases. It involves several steps,
and each attempts to discover some knowledge using some knowledge discovery
method. (Han et al. 2005, 10.)
It entrenches the whole knowledge exploration process, ranging from data storage
and access, analysis of large data using efficient and scalable algorithms, results
interpretation and visualization, to human-machine interaction modeling and
12
support. (Han et al. 2005, 11). A unique feature of the model is the definition of
input output states of data, because the output of a step is used as the input of a
subsequent step in the process, and at the end output is the discovered
knowledge portrayed in terms of patterns, rules, classifications, association, trends
and statistical analysis. (Han et al. 2005, 11.)
3.3 Knowledge discovery process models
The knowledge discovery process has been placed into two main models called
the Fayyad et al (academic) model and the Anand and Buchner (industrial) model.
The Fayyad model is represented below:
Interpretation/
Evaluation
Knowledge
Data
mining
100
Transfor
mation
90
80
70
Patterns
Preproc
essing
Transformed Data
Selection
Preprocessed data
Target Data
Data
GRAPH 2. Knowledge discovery process (adapted from Fayyad, Piatetsky &
Smyth 1996)
Developing and understanding the application domain entails, learning the useful
afore-hand knowledge and aim of the end user for the discovered knowledge. The
next phase is creating a target data set, which involves querying the existing data
13
to select the desired subset by selecting subsets of attributes and data points to be
used for task. (Han et al. 2005, 14.)
Data
cleaning
and
processing
entails
eradicating
outliers,
handling
noise and missing values in data, and accounting for time sequence information
and known changes. It leads to the data rejection and Projection. It consists of
finding valuable attributes by utilizing dimension reduction and transformation
methods, and discovering invariant representation of the data. (Han et al.
2005,15.)
Choosing the Data mining task and this involves matching the relevant prior
knowledge and objective of a user with a specific data mining method. Choosing
the data mining algorithm basically involves selecting methods to search for
patterns in the data and conclude on which models and yardsticks of the method is
perfect. (Han et.al. 2005, 15-16.)
Data mining being the next phase involves pattern generation, in a particular
representation form, such as classification, decision tree, etc. Consolidation of
discovered knowledge involves incorporating discovered knowledge into the
performance system, documenting and reporting. (Han et al. 2005, 15-16.) The
Industrial model also tagged CRISP-DM knowledge discovery process is
summarized in the graphs below.
14
GRAPH 3 (cont.). Phases of the CRISP-DM process model (adapted from CRISP
2011)
1.Business understanding
· Determination of business
objective
· Assessment of the situation
· Determination of data mining goal
· Generation of project plan
2. Data Understanding
· Collection of initial data
· Description of data
· Exploration of data
· Verification of data quality
3. Data preparation
· Data selection
· Data cleansing
· Construction of data
· Integration of data
· Data substeps formatting
4. Modeling
· Selection of modeling technique
· Generation of test design
· Models creation
· Generated models assessment
5. Evaluation
· Evaluation of results
· Process review
· Determination of next step
6. Deployment
· Plan deployment
· Plan monitoring and maintenance
· Generation of final report
· Review of the process substeps
GRAPH 4. Details of CRISP process model (adapted from CRISP 2011)
3.4 The need for data mining
The achievement of digital revolution and the escalation of the internet have
brought about a great amount of multi-dimensional data in virtually all human
endeavor, and the data type ranges from text, image, audio, speech, hypertext,
15
graphics and video thus providing organizations with too many data, but the whole
data might not be useful if it does not provide a tangible unique information that
could be utilized in solving a problem. The quest to generate information from
existing data prompted the need for data mining. (Mitra & Acharya 2003, 2.)
The impressive development in the data mining could be affiliated to the
conglomeration of several factors such as the explosive growth in data collection,
the storing of data in warehouses, thus enhancing accessibility and reliability of
database, the availability of increased access to data from internet and intranets,
the high quest to raise market share globally, the evolving of mining software, and
the awesome development in computing power and storage ability. (Larose 2005,
4.)
3.5 Data mining goals
Data mining is basically done with the aim of achieving certain objectives and it
ranges from classification, prediction, identification to optimization.
·
Classification: this involves allocating data into classes or categories as a
result of combining yardsticks.
·
Prediction: Mining in this instance helps to single out features in a data, and
their tendencies in the event of time.
·
Identification: Trends or patterns in certain data could enhance in identifying
the existence of items, events or action in a given scenario or case.
·
Optimization: Mining also facilitates the optimization of the use of scarce
resources in turn maximizing output variables within constraint conditions
(Elmasri & Navathe 2007, 947.)
16
3.6 Applications of data mining
The traditional approach to data analysis for decision making used to involve
furnishing domain experts with statistical modeling techniques so as to develop
hand-crafted solutions for specific problems, but the influx of mega data having
millions of rows and columns and the spontaneous constructions and deployment
of data driven analytics coupled with demand by users for results easily readable
and understandable has prompted the inevitable need for data mining. (Sumathi &
Sivanandam 2006, 166.) Data mining technologies are deployed in several
decision-making scenarios in organizations. Its importance cannot be over
emphasized, as it is applicable in several fields some of which as discussed
below.
3.6.1 Marketing
This involves analysis of customer behavior in purchasing patterns, market
strategies determination varying from advertising to location, targeted mailing,
segmentation of customers, products, stores, catalogs design and advertisement
strategy. (Elmasri & Navathe 2007, 970.)
3.6.2 Supply chain visibility
Companies have automated portions of their supply chain enabling collection of
significant data about inventory, supply performance and logistic of materials, and
finished goods, material expenditures, accuracy of plans for order delivery. Data
mining application also spans though price optimization and work force analysis in
organizations. (Sumathi & Sivanandam 2006, 169.)
17
3.6.3 Geospatial decision making
In climate data and earth ecosystem scenario, automatic extraction and analysis of
interesting patterns involving modeling ecological data and designing efficient
algorithm for finding spatiotemporal patterns in form of tele-connection patterns or
recurring and persistent climate patterns. This operation is usually carried out
using the clustering technique, which divides the data into meaningful groups,
helping to automate the discovery of tele-connections. (Sumathi & Sivanandam
2006, 174.)
3.6.4 Biomedicine and science application
Biology used to be a field dominated by an attitude of formulate hypothesis,
conduct experiment, evaluate results, but now upon the impact of data mining it
has evolved into a field of big science attitude involving collecting and storing data,
mine for new hypothesis, then confirm with data or supplemental experiment.
(Sumathi & Sivanandam 2006, 170.)
It also includes discovery of patterns in radiological images, analysis of microarray
(gene-chip) experimental data to cluster genes and to relate to symptoms or
disease, analysis of side effects of drugs and effectiveness of certain drugs.
(Sumathi & Suvanandam 2006, 171.)
3.6.5 Manufacturing
The application in this aspect relates to optimizing the resources used in optimal
design of manufacturing processes, and product design based on customers‟
feedback. (Elmasri & Navathe 2007, 970.)
18
3.6.6 Telecommunications and control
It is applied to the vastly available high volume of data consisting of call records
and other telecommunication related data, which in turn is applied in toll-fraud
detection, consumer marketing and improving services. (Sumathi & Suvanandam
2006, 178.) Data mining is also applied in security operations and services,
information analysis and delivery, text and Web mining instances, banking and
commercial applications as well as insurance. (Han et al. 2005, 456.)
19
4 DISCOVERED KNOWLEDGE
Data could only be useful when it is converted into information and it becomes
paramount when some knowledge is gained from the generated information, as
such is the most vital phase of data handling in any setup that deals with decision
making, and this knowledge obtained could be inductive or deductive, where
deductive knowledge deduces new information from applying pre specified logical
rules on some data. The inductive knowledge is the form of knowledge referred to
when data mining is concerned, as it discovers new rules and patterns from some
given data. (Elmasri & Navathe 2007, 948). The knowledge acquired from data
mining is classified in the forms below, though knowledge could be as a result of a
combination of any of them:
·
Association rules: simply involves correlating the presence of a set of items
with another range of values for another set of variable.
·
Classification hierarchies: this aims at progressing from an existing set if
transactions or actions to generate a hierarchy of classes.
·
Sequential patterns: basically seeks some form of sequence from some
events or activities.
·
Patterns within time series: this involves detecting similarities within
positions of time series of some data, implying sets of data obtained at
regular intervals.
·
Clustering: this relates to segmentation of some given collection of items or
actions into sets of similar elements (Elmasri & Navathe 2007, 949.)
4.1 Association rules
This deals with unsupervised data, as it finds interesting associations,
dependencies and relationships in vast data item sets. These items are kept as
transactions that could be created by an external process or fetched from data
20
warehouses or relational databases. However, due to the expansible feature of the
association rules algorithms and the ever/increasing size of cumulating data, use
of association rules for knowledge extraction is somewhat inevitable, as
discovering interesting associations gives a sources of information that is used for
making decisions.(Han at al. 2005, 256.)
Association rules are applied in areas such as market/basket data analysis,
cross/marketing, catalog design, loss/leader analysis, clustering, data processing,
genomics etc. The Market/basket analysis is the most intuitive application of
association rule, as it strives to analyze customer tendencies by finding
associations existing between items purchased by customers (Han et al. 2005,
264-270.)
An example of the application of association rule is as in the graph below, where a
sales transaction table is made to identify items which are often bought together,
so as to be able to make some cross-selling decisions during discount sales
period. (Maclennan, Tang, & Crivat 2009, 7). It simply shows that customers who
buy milk also buy cheese, and customers who buy cheese could also buy wine,
likewise customers buy either Coke or Pepsi and juice, the same applies to
customers buying beer or wine, cake or donut.
MILK
CAKE
BEER
CHEESE
WINE
COKE
PEPSI
JUICE
DONUT
BEEF
GRAPH 4. Items association (adapted from MacLennan et al. 2009, 7)
21
4.1.1 Association rules on transactional data
Items are denoted as Boolean variables while the collection is denoted as Boolean
vector, as the vector is analyzed to determine which variables are frequently taken
together by different users or in other words associated with each other. These cooccurrences
are
represented
in
association
rules
written
as:
LHS => RHS [Support, Confidence]
The left-hand side (LHS) implies the right-hand side (RHS) with a given value of
support and confidence.
Support and confidence are used to determine the quality of a given rule in terms
of its usefulness (Strength) and certainty, while support denotes how many
instances (transactions) from a data set that was used to generate the rule
including items from left hand side and right hand side, confidence on the other
hand expresses how many instances that include items from left hand side also
include items from right hand side, where measured values appear in
percentages. (Han et al. 2005, 290-291.)
Association rule is interesting if it satisfies minimum values of confidence and
support which are stipulated by the user. Association rules are derived when data
describe events that occur at the same time or in some close proximity. The main
association rules are: single-dimensional and multidimensional, where both rules
could be placed into groups as either Boolean or quantitative. (Han et al. 2005,
292.)
The Boolean case relates to the presence or absence of an event or item, while
the quantitative case considers values which are partitioned into item intervals.
Basically the data to be used in any mining applying association rule ought to be
given in a transactional form, hence should have a transaction identification (ID),
and information about all items consistently. (Han et al. 2005, 292.)
22
4.1.2 Multilevel association rules
Finding association rules at low levels in cases where items form a hierarchy could
be difficult. Such association rules could be found at higher levels existing as
established knowledge. Multilevel association rules are created by performing a
top-down, iterative deepening search, in essence by first finding strong rules at the
high levels in the hierarchy, before searching for lower-level weaker rules. (Han et
al. 2005, 244.)
The main methods for multilevel association rule are in classes subsequently
placed as; Uniform support based method which involves the same minimum
support threshold been applied to create rules at all levels of abstraction. Reduced
support based method which primarily fixes the short comings of uniform support
method, since every level of abstraction is supplied with its own minimum support
threshold and the lower the level, the smaller the threshold. (Han et al. 2005, 244245.)
Level by level independent method involving a breadth-first check been carried
out, hence every node in the hierarchy is checked, regardless of its parent node
frequency. Level cross-filtering by single item method simply involving checking
items at certain levels are checked only if their parent at the previous level are
frequent. Level cross filtering by k-itemset method is a method in which a k-itemset
at some given level is checked only if its parent k-itemset at the previous level is
frequent. (Han et al. 2005, 246.)
4.2 Classification
Classification is the process of learning a model which describes different classes
of data, and the classes are usually predetermined. It is also known as supervised
learning, as it involves building a model which could be used to classify new data.
The process begins by using an already classified data usually called training set
of data, and each training data consist of an attribute called class label. (Elmasri &
23
Navathe 2007, 961). It essentially is the act of splitting up objects so that each
one is assigned to a number of mutually exhaustive and exclusive categories
known as class. (Bramer 2007, 23.)
Classification basically involves the process of finding set of models or functions
which describe and distinguish data classes or concepts for purpose of being able
to use the model to predict the class of objects whose label is unknown. The
classification models could occur in models operating on rules involving decision
tree, or neural networks or mathematical rules or formulae or probability. (Han et
al. 2005, 280.)
A decision tree is a flow-chart like tree structure, where each node denotes some
test on an attribute value, each branch stands for an outcome of the test, while the
tree leaves represent classes or class distributions. (Han et al. 2005, 282.) The
approach of classification that uses probability theory to find the most likely of
classifications is known as the Naïve Bayes. (Bramer 2007, 24.)
4.2.1 Decision tree
A decision tree is a flow chart like tree figure used to represent information in a
given data, where each internal node represents a test on an attribute, every
branch denotes the outcome of a test, while the leaf nodes represent classes. The
uppermost node in a tree is the root node. (Han et al. 2005, 284-286.)
A typical decision tree is shown below, where the tree represents some magazine
subscription by people, with information of their age, car type, number of children if
any, and their subscription. Internal nodes are denoted by rectangles, leaf nodes
by ovals. The table below is a sample data for the subscription.
24
TABLE 2. Sample training data for magazine subscription (adapted from Ye 2003,
5)
ID
Age
Car
Children
Subscription
1
23
Sedan
0
Yes
2
31
Sports
1
No
3
36
Sedan
1
No
4
25
Truck
2
No
5
30
Sports
0
No
6
36
Sedan
0
No
7
25
Sedan
0
Yes
8
36
Truck
1
No
9
30
Sedan
2
Yes
10
31
Sedan
1
Yes
11
45
Sedan
1
Yes
<=30
Sedan
Yes
Car
Type
Age
>30
Sports, Truck
0
No
Sedan
No
Number of
Children
Car
Type Sports, Truck
Yes
>0
Sedan
Yes
Car
Type
Sports, Truck
No
GRAPH 5. Decision tree for magazine subscription (adapted from Ye. 2003, 6)
25
4.3 Clustering
This basically aims to place objects into groups, such that records in a group are
similar to each other and totally dissimilar to records in other groups, and this
groups are said to be disjoint. Clustering analysis is also called segmentation or
taxonomy analysis, as it tries to identify homogenous subclasses of cases in a
given population. (Elmasri & Navathe 2007, 964). Some of the approaches of
Cluster analysis are discussed below.
Hierarchical clustering which permits users to choose a definition of distance, then
select a linking method for forming clusters, after which the number of clusters that
best suit the data are estimated. This approach of clustering creates a
representation of clusters in icicle plots and dendograms. A dendogram is defined
as a binary tree with distinguished roots that has all the data items at its leaves.
(Cios et al. 2007, 260.)
The k-means clustering simply requires the user to indicate the number of clusters
in advance, after which the algorithm estimates how to designate cases to the k
clusters. k-means clustering isles computer-intensive, hence it is often preferred
when data sets are much say over a thousand, and it creates a table showing the
mean-square error called Anova. (Garson 2010.)
The two step clustering which generates pre-clusters after which it clusters the
pre-clusters using hierarchical methods. This approach handles very high volume
of datasets, and has the largest array of output options, including variable plots.
(Garson 2010.)
The graph 6 below is a diagrammatic representation of clustering, where people
are placed into clusters based on their income levels and age.
26
Income
Cluster 1
Cluster 2
Cluster 3
Age
GRAPH 6. Output of clustering people on income basis (adapted from MacLennan
et al. 2009, 7)
4.4 Data mining algorithms
Data mining algorithms are the mechanisms which create the data mining model,
which is the main phase of the data mining process. In the subsequent sub
headings the algorithms will be discussed.
4.4.1 Naïve Bayes algorithm
Collecting frequent item sets involves consideration of all possible item sets,
computing their support, checking if they are of higher value than the minimum
support threshold. Naïve algorithm involves searching high number of item sets,
while scanning lots of transactions each time, and as such making the amount of
test that need to be conducted to be exponentially high, thus causing problem and
excessive time consumption, and due to this short coming of the algorithm, there
was need for the birth of another more efficient algorithm. (Han et al 2005, 296.)
27
An example for demonstrating the algorithm involves having several tiny particles
of two colors red and green as shown in graph below. The particles are classified
as either red or green. In an effort to classify new cases and tell which class they
belong to, could be easily done based on the graph.
GRAPH 7. Naive Bayes classifier (adapted from Statsoft 2011)
It is obvious that the green particles are twice as much as the red particles, hence
on handling a new case which has not be handled before, it is twice like that a
particles will belong to the green group rather than the red group.(Statsoft 2011.)
4.4.2 Apriori algorithm
This algorithm applies a prior knowledge of an important attribute of frequent itemsets. The Apriori property of any item-set declares that all non empty subsets of a
frequent item-set has to be frequent, hence where a given item-set is not frequent
(if it does not meet up to the minimum support threshold), then all superset of this
item-set will also not be frequent, since it cannot occur more frequently than the
original item-set. (Cios et al. 2007, 295.)
4.4.3 Sampling algorithm
This algorithm is basically about taking a small sample of the main database of
transactions, then establishing the frequent item sets from the sample. where such
28
frequent item-sets form a superset of the frequent item-sets
of the whole
database, then one could affirm the real frequent item sets by scrutinizing the
remainder if the database in order to determine the exact support values of the
superset item-set. This is basically some form of Apriori algorithm though with a
lowered minimum support. (Elmasri & Navathe 2007, 952.)
Second scans of the databases are usually required because of cases of missed
item sets, and determining if there were any missed item-sets gave room for the
idea of Negative border, which in relation to a frequent item set say S, and some
set of item say I, is the minimal item-sets contained in power set(I) and not in S, in
a nut shell, the negative border of some set of frequent item sets consist of the
closest item sets possibly frequent. (Elmasri & Navathe 2007, 952.)
Consider an example having a set of items I= {A, B, C, D, E} and let the combined
frequent item-sets of size 1 to 3 be S= {{A}, {B}, {C}, {D}, {AB}, {AC}, {BC}, {AD},
{CD}, {ABC}}. Here the negative border is {{E}, {BD}, {ACD}}. The set {E} is the
only 1 item set not contained in S, {BD} is the only 2-item-set not in S, but whose 1
item-set subset are, and {ACD} is the only 3 item set whose 2 item set subsets are
all in S. The negative border is important since it is necessary to determine the
support for those item-sets in the negative border to ensure that no large item-sets
are missed. (Elmasri & Navathe 2007, 953.)
4.4.4 Frequent-pattern tree algorithm
This is also an algorithm which came into been due to the fact that Apriori
algorithm involves creating and testing huge amount of item-sets. However, this
algorithm eliminates the creation of such large candidate item-sets. A compressed
sample of the database is first created, based on the frequent pattern tree; this
tree keeps useful item-set information and gives an avenue for the efficient finding
of frequent item-sets. The main mining process is divided into smaller task and
each functions on a conditional frequent pattern tree, which is a branch of the main
tree. (Elmasri & Navathe 2007, 955.)
29
4.4.5 Partition algorithm
Partitioning algorithm operates by splitting the database into non-overlapping
subsets, which are taken for separate databases and all bulk item-sets for that
partition are called Local Frequent item-sets, and they are created in one pass,
after which the Apriori algorithm is then efficiently applied on each partition if it fits
into the primary memory. Partitions are taken such that each every partition could
be accommodated in the main memory, hence been checked only once. (Elmasri
& Navathe 2007, 957.)
The main short coming of this algorithm is that the minimum support for each
partition is dependent on the size of the partition rather than the size of the main
database for large item-sets. After the first scan, a union of the frequent item-sets
from every partition is taken, forming the Global candidate frequent item-sets for
the whole database. The global candidate large item-set found in the first scan are
confirmed in the second scan, when their support is measured for the whole
database, invariably, this algorithm is implemented in parallel or distributed
manner for enhanced performance. (Elmasri & Navathe 2007, 957.)
4.4.6 Regression
Regression is an exclusive application of the classification rule. If a classification
rule is regarded as a function over the variables that map these variables into
target class variable, the rule is called a regression rule. A common application of
regression occurs when in place of mapping a tuple of data from some relation to
a specific class, the value of variable is predicted based on the tuple itself.
(Elmasri &Navathe, 2007, 967.)
Regression involves smoothing data by fitting the data to a function. It could be
linear or multiple, the linear involves finding the best line to fit two variables so that
one could be used to predict the other, while the multiple one has to do with more
30
than two variables. (Han et al. 2005, 321.) For example, where there is a single
categorical predictor such as female or male, a legitimate regression analysis has
been undertaken if one compares two income histograms, one for men and one
for women.(Berk 2003.)
4.4.7 Neural networks
This is a technique derived from artificial intelligence, using general regression and
provides an iterative method to implement it. It operates using a curve fitting
approach to infer a function from a given sample. It is a learning approach which
uses a test sample for initial learning and inference. Neural networks are placed
into two classes namely supervised and unsupervised networks. Adaptive
methods that try to reduce the output error are supervised learning methods, while
unsupervised learning methods involve those that develop internal representations
without sample outputs. (Elmasri & Navathe 2007, 968-969.)
It can be used where some information is known, and one would like to infer some
unknown information. Example is in the Stock market prediction, where last week
and today‟s stock prices are known, but one wants to know tomorrows stock
prices. (Statsoft 2011.)
4.4.8 Genetic algorithm
Genetic algorithms are a class of randomized search procedures capable of
adaptive and large search over a large range of search space topologies. It was
developed by John Holland in the 1960s. It applies the techniques of evolution,
dependent on optimization of functions in artificial intelligence to generate some
solution, by simply developing a sample of possible solutions to some problem
domain, then taking out solutions that are better and gathering them together to
create a new domain of solutions, and lastly using the new solutions to replace the
poorer of the original, after which the whole cycle is done again (Hill 2011.)
31
The solutions generated by genetic algorithms are differentiated from that of other
techniques because genetic algorithms use a set of solution during each
generation instead of a single solution. The memory of the search done is
represented by the set of solutions at disposal for a generation. It finds near
optimal balance between knowledge gain and exploitation by manipulating
encoded solutions. Genetic algorithm is a randomized algorithm unlike other
algorithms, and its ability to solve problems in parallel makes it powerful in data
mining (Elmasri &Navathe 2007, 969.)
32
5 APPLIED DATA MINING
The process of evaluating data mining could only be complete after a practical
demonstration is done. In the process of trying to carry out a practical mining task,
several other mining devices were used in the course of this project, and they
range from the Ibm intelligent miner, Estard miner, SQL server warehouse, and
SQL server using Microsoft Excel 2007, but due to logistic, and administrative
limitations in using most of the mining devices, I chose to use the SQL server cum
Microsoft Excel 2007 for the mining task, since it provides a trial version which
grants much access with lesser administrative requirements.
Data Acquisition
Application
Data
Preparation
Validation
Modeling
GRAPH 8. Data mining process (adapted from MacLennan et al. 2009, 188)
The graph above simply shows the steps involved in a simple data mining
process. However for the purpose of this thesis work, the data acquisition and
preparation phases were skipped since a ready data was gotten from a repository.
5.1 Data mining environment
In an attempt to demonstrate the data mining process, the mining software that
was chosen is the Microsoft excel 2007 which is been used in conjunction with
33
Microsoft SQL server. The exact SQL server that was used is the Microsoft SQL
server 2008, though other earlier versions exist too. The server is Microsoft‟s
enterprise-class database solution. It consists of four components namely the:
database engine, analysis services, integration services and the reporting
services, and these four components work together to create business intelligence
(Ganas 2009, 2.)
The database engine basically facilitates the storage of data in tables and allows
users to analyze the given data using commands in SQL language. Considering
the database engine from the business intelligence perspective, its primary
function is the storage of collected data, and it has the capacity to store hundreds
of gigabytes of data which are also be termed as “data warehouse” or “data mart”.
(Ganas 2009, 2.)
The Analysis Services part of the SQL server is responsible for the analysis of
data using Online Analytical Processing (OLAP) cubes and the data mining
algorithms. The cube is fundamentally a pre-meditated pivot table. It is located in
the server and it stores the raw data, along with pre-calculated data, in a multidimensional format. The data in an OLAP cube could be accessed using Excel
pivot table. OLAP cubes are valuable since the make it easy and convenient for
users to handle and analyze extremely large amount of data. (Ganas 2009, 2.)
The other component of the SQL server is the Integration service, which basically
extracts, transforms and load data. Its primary purpose is to transfer data between
different storage formats, just as in an instance it pulls data out of excel file and
uploads it into an SQL server table. It is also a data cleaning tool which makes it
really relevant, since dirty data could make it difficult to develop valid statistical
models. One of the cleaning techniques inscribed in the integration service is the
fuzzy logic, as it is applied to clean data by isolating and removing questionable
values. (Ganas 2009, 2-3.)
The Reporting Services is the fourth component of the SQL server, it is a webbased reporting tool. Basically it creates a web page where users could see
reports that were generated using the data in the SQL server tables. It also consist
34
of a web application known as Report Builder, which allows users to create ad hoc
reports without knowing the SQL language. This facility enhances the easy
transmission of business intelligence to several users. (Ganas 2009, 3.)
5.2 Installing the SQL server
Upon adhering to the installation requirements before the installation proper, the
installation wizard was used as it allows the user to specify which features are to
be installed, the installation location, and the administrator privileges. The wizard
is also used to grant users access to the components of the server. (MacLennan
et al. 2009,16.)
5.3 Data mining add-ins for Microsoft Office 2007
The Data mining add-ins for Microsoft office 2007 enhances the exploitation of the
full potentials in the SQL Server, hence an instance of the add-ins have to be
installed on a machine which already has an SQL server and all its components
installed. This add-ins comprise of three parts namely the Table analysis tools, the
Data mining client, and the Data mining template for Visio. It also consist of the
Server configuration utility which handles the details of configuring the add-ins and
the connection of the Analysis Services (MacLennan et al. 2009, 16.)
The Data mining client for Excel 2007 allows users to build complex models in
Excel, while processing those models on a high performance server running
Analysis Services, and this reduces the time and effort required to extract and
analyze information from ordinary raw data using the most powerful algorithms
available. (Ganas 2009, 3.)
35
5.4 Installing the add-ins for Excel 2007
The main component of the add-ins which is of utmost importance in this case of
data mining is the Data mining client (which is basically installed on a computer),
as it acts as a link between the user and SQL server running analysis services.
The client architecture allows multiple users, with every one working on his own
computer to benefit of the power of a single analysis services server.
Installing the add-ins on a computer and specifying the name of the server running
analysis services is relatively straightforward; a wizard guides users through the
installation process, however taking note that the user has administrative rights to
all aspects of the server and the add-ins. (MacLennan et al. 2009, 20.)
5.5 Connecting to the analysis service
On clicking the analyze menu, there is also the option of connecting to an analysis
service which must be done to facilitate a successful evaluation of the given data.
The connection button was labeled <No connection>. On clicking it, the analysis
services connection dialog box appeared, requesting server name and instance
name, then the given server name was entered and an instance name was
entered too. The figure below shows an instance of the connection phase.
By default, it is recommended that the windows authentication button is used in
connecting to the analysis services, because it supports only windows
authenticated clients. (MacLennan et al. 2009, 21.)
36
GRAPH 9. Connecting to analysis service
5.6 Effect of the add-ins
Upon completion of the installation of the add-ins to the computer, the data mining
ribbon could be readily seen on the menu on a launched Microsoft office 2007
Excel spreadsheet, and through the ribbon the icons which are divided into a
logical and organized fashion that mimics the typical order of tasks involved in a
typical data mining process appear, they include the data Preparation which
comprises explore data, clean data and partition data as this basically acquires
and prepares data for analysis. (Mann 2007, 29.)
There is also the data modeling functionality which has options of algorithms that
could be selected from to use in the data mining. The accuracy and validation
option aids in testing and validating the mining model against real data before the
deploying the model into production. The model usage allows one to query and
browse the analysis services server for existing mining models. The management
option enables the management of the mining models such as renaming, deleting,
clearing, reprocessing, exporting or importing of models. (Mann 2007, 29.)
The other icon on the work sheet is the connection tab, which facilitates
connection to a server; it is discussed in detail in the next subsection. Once a
37
connection was been successfully made to the analysis service, if some data were
added to the spreadsheet, and the whole or part of the data was selected and
formatted as table using the “format as table” option from the menu, the data
adopts the new selected format and the “Table Tools” option appears as shown in
the graph below, it consist of the analyze and design options on the menu. (Mann
2007, 30.)
On clicking the analyze button eight of the functionalities of the analysis services
which make it possible to perform tasks without compelling the need to know
anything about the underlying data mining algorithms appears and they include;
·
Analyze Key Influencers
·
Detect Categories
·
Fill From Example
·
Forecast
·
Highlight Exceptions
·
Scenario Analysis
·
Prediction Calculator
·
Shopping Basket Analysis
GRAPH 10. Formatting an Excel sheet
38
The other option which is the design tab provides the option to remove duplicates,
convert to range and summarize with pivot table, it also provides the option of
exporting the formatted data.
GRAPH 11. The analysis ribbons
5.6.1 Analyze key Influencers
The analyze key influencers tool when applied, basically brings out the relationship
between all other chosen columns on a given table to one specified column, and
then it makes a report showing in details which of the columns have major
influence on the stipulated column and how it portrays itself. Its implementation
generates a temporary mining model using the Naïve Bayes algorithm.
(MacLennan et al. 2009, 22).
Take for instance if the tool is applied on a table having columns as annual
income, geographic location, number of kids and purchases, on applying the tool
and selecting the purchases column, it would be possible to determine if it is
income that plays a major role in the purchases an individual makes or any of the
other columns. Upon specifying the target column, the analyze key influencers
button is clicked and it pops up a dialog box requesting the selection of the column
39
to be analyzed for key factors, there is also an option to restrict the other columns
that could be used for the analysis process. (MacLennan et al 2009, 23).
Once the selection has been made, the run button is clicked and within seconds, a
report is displayed such as in the graph below.
GRAPH 12. Key influencer report
A look at the graph above simply demonstrates that the income column has the
greater influence over the weekly purchases of the people represented by the
favors column, though the number of kids is closely also having much influence in
a few instances. (MacLennan et al. 2009, 23).
5.6.2 Detect categories
Handling huge amount of data could be cumbersome, hence it is more advisable
and convenient to regroup all the data into smaller categories so that elements in
each category have lots of similarities enough to see them as been similar. It
applies the clustering algorithm, thus making data analysis convenient and easy. It
detects groups in the given data after analyzing the data, and then it places them
40
into groups based on their similarities, also emphasizing the details which
prompted the category creation. (MacLennan et al. 2009, 29.)
Applying the detect categories functionality of the table analysis tools simply
involves selecting and formatting the given data, then the analyze ribbon displays,
on clicking the detect categories button, a dialog box pops up. The box displays
the option of selecting the columns from the data, that the user would like to
analyze, and the user could do so by un-checking or checking any column of his
choice. (MacLennan et al. 2009, 29.)
There is also the option of the tool appending the detected category column to the
original Excel table, which is usually checked by default, and thirdly the option of
selecting the number of categories, the user would like to have usually in auto
detect state by default. (MacLennan et al. 2009, 29-30.)
On clicking run in the dialog box, the process is completed within few seconds and
then the results are displayed. The displayed result called the categories report
has three parts, one showing the created categories and the number of rows in
each as shown in graph below.
GRAPH 13. Categories created in category report
The second part of the category report shows the features of each category in
ranking of their relevance and importance, it is a table having four columns, the
41
first column having the category name, the second is the features from the original
column name, the third is the value, and the fourth column is the relative
importance showing how significant the features are to the created category.
(MacLennan et al. 2009, 31.)
A simple graph below shows how the category characteristics look
GRAPH 14. Category report characteristics
The third part of the category report is the category profiles, it appears as bar
charts, showing the distribution of any of the original data characteristics over all
the generated categories. Each of the bars in the chart contains more than one
color simply showing segments denoting the proportion of a row in the category.
There is the color legend on the right hand side which clearly portrays the
proportion of the feature on the category. The generated categories could be
renamed based on the users wish. (MacLennan et al 2009, 33.)
5.6.3 Fill from example tool
This data mining tool has an auto fill potential, in the sense that it is able to learn
from any given example of data, and automatically generate subsequent data
based on the trend and relationship in the example. It basically operates only on
columns of the Excel spreadsheet, so long as some two or more data examples
42
have been given in the row. The reliability of the result of this tool is mainly
dependent on the amount of sample data or values given in the target column,
hence, the higher the sample data, the greater the reliability of the tool result, and
vice versa. (MacLennan et al. 2009, 35.)
On selecting and formatting the given data, the table analysis tool option appeared
as expected and the fill from example tool button was clicked. A dialog box came
up, showing the option of which column to select for the sample data task, though
the tool most often suggests a likely column, but there was still the possibility to
choose a target column if it is different from the suggested one. There is also the
option of choosing which columns to use for the analysis in conjunction with the
specified column. On clicking the run button, the process is completed, and a
pattern report for the specified column is generated on a new Excel work sheet,
and also a new column is appended at the end of the original sheet, showing the
original and newly generated complete column. (Brennan 2011.)
The generated report has four columns showing the original column names, their
values, if they favorably impact the target column or not, and the last column
showing the relative effect by aid of horizontal bars.
GRAPH 15. Completed table after fill from example process
43
The results of the fill from example process could be refined by carrying out the
process again, if the displayed result is not close to the expected one, and the
refining could be done several times until a desired expected result is obtained.
(MacLennan et al. 2009, 39.)
5.6.4 Forecast tool
The forecast tool is able to recognize the trend that operates in a given series, and
extrapolates the patterns producing forecasts for subsequent evolution of the
series. The main patterns discovered from the analysis include trends (behavior of
the series evolution), periodicity (consistency of event intervals), and crosscorrelations (reliability of values in different series). (MacLennan et al. 2009, 40.)
Once the data has been formatted, and the forecast button is clicked, the forecast
dialog box appears, displaying the option for selecting the columns to use in
prediction, there is also the option to specify the time units to be forecasted. The
other tabs on the dialog box are the time stamp and the periodicity drop down
boxes. On clicking the run button, the tool implements the algorithm and within
seconds the forecasted new values could be seen, highlighted at the bottom of the
columns in the table. There is also the graph which shows the old series been
analyzed in solid lines, while the forecasted evolution trend is represented by
broken lines. (MacLennan et al. 2009, 43.)
In the graph below, the chosen time stamp was five, so the forecasting generated
five new rows to each column.
44
GRAPH 16. Result generated after forecasting
5.6.5 Highlight exceptions tool
This tool detects anomalies in any given data. Any row in the given data table that
does not follow the pattern of the other rows in the table is highlighted. These
discrepancies could be as a result of mistakes during data entry or Excel AutoFill
system. There could be instances of correct data values, but due to the fact that
they do not match the general pattern, they are seen as anomalies hence of much
interest. The tool is a good cleaning tool since it is able to detect and replace such
anomalies. (Ganas 2009, 8.)
On selecting and formatting the data in Excel, the analysis tools options are
displayed, and on clicking the highlight exceptions tool, the dialog box is shown,
providing the option of selecting the columns to be used for the analysis, columns
having unique values such as Id column are usually unchecked by default. Once
the run button is clicked, the tool processes the data, after which it highlights the
row having anomaly in a different color, it also generates a new Excel sheet,
showing the report for the anomalies details. (MacLennan et al. 2009, 45.)
The graph below shows an example of a highlight exception tool process, showing
how male skilled manual worker, having just one child, and a commute distance of
one mile, is an anomaly.
45
GRAPH 17. Report of exceptions from a highlight exception process
5.6.6 Scenario analysis tool
This tool is basically applied in sensitivity analysis of simulations. The “goal seek”
and the “what-if” options of the tool demonstrate how model results behave in
response to input data moderation. The goal seek option of the tool shows how
some or all of the input data need to be modified so as to attain certain expected
result, for example an insurance company needs to know what income a customer
gets annually to determine when he or she is a good trust worth customer, it is
similar to an instance of finding what could be the value in a column A, so that
column B would have a value of C. (Ganas 2009, 7- 8.)
The what-if option of the tool helps the user to be able to know how the model
result would be, upon altering one of the input data. Hence it shows the effect of
an input variable change on the result outcome. (Ganas 2009, 8.) The two options
of the scenario analysis could be performed on either a single row of a table or on
the whole table. The main principle involved in the process is showing how one or
more other columns have effect on a target column. This option of the analysis
table tool is different because unlike others, it does not present its result on a
different spreadsheet, rather the result is displayed on the same dialog box where
the user indicated the target row and modifier. (MacLennan et al. 2009, 59.)
On selecting and formatting the table, after which the analysis table tools are
displayed the target row was selected, the scenario analysis tool arrow was
clicked, it showed the two options of goal seek and what if, once the what if button
was clicked, a dialog box popped up, showing options of indicating the target
column, the proposed state, and the column to be modified.
46
Once the three decisions were made, and the run button clicked, the dialog box
shows a message of the either success or failure, the likely value and the
confidence state whether good or low. (MacLennan et al. 2009, 59.) The graph
below shows an outcome of the analysis scenario tool used on some given data,
where the target column was the “purchased bile” column, the requirement
condition chosen was yes, and the modified column was the commute distance,
which was over 10 miles, but after the goal seeking process was changed to a
range of 0 to 1. (MacLennan et al. 2009, 60.)
GRAPH 18. Goal seek scenario analysis
The what-if option of the scenario analysis is similar in it operation, except that, on
clicking the option from the analysis tool option, the dialog box that displayed has
the option of choosing a scenario as in what to column to make changes to, from
what position or value, then there is the what happens option for choosing what
column there should be an effect, and the third point where the user indicates if
the task is for a single row or the whole table. (MacLennan et al. 2009, 60.)
Once the options are made and the run button is clicked, the result will appear on
the same dialog box. The same task could be carried out on the whole table by
just indicating the whole table option and the target column option too. The result
47
will be a column at the right end, showing outcome for all rows on the table.
(MacLennan et al. 2009, 60.)
The graph below shows an example of the what if task performed, having the
children column as the input variable been altered from 0 to 1, and the target
column been the purchase bile column, and there was success report, implying on
changing the number of children an individual has from 0 to 1, he or she is more
likely to purchase a bike.
GRAPH 19. “What-if” scenario analysis
5.6.7 Prediction calculator
This tool is an example of an end user tool that integrates data mining technology.
(Ganas 2009, 9). The tool is an easy and convenient device for making
predictions, and it does not necessarily need to be connected to any server for it to
function. It uses the logistic regression algorithm and it aids in determining best
possible conditions in order to minimize or eliminate any case of consequences
arising due to wrong predictions, while maximizing the benefits associated to
making a correct prediction. It operates on the binary principle of using only one of
any two possible conditions. (MacLennan et al. 2009, 63.)
48
It is similar to the key influencer tool, even though it considers and displays value
depicting the impact of all columns, whether weak or strong on the target column,
after which the total effect is gotten by summing up all the values assigned to the
other columns. (MacLennan et al. 2009, 64.) Once the data table has been
formatted, from the analyze tools ribbon, the prediction calculator option could be
seen. On clicking the button the dialog box appeared, showing options of
indicating the target column, the next option was to indicate if the column values
are continuous, range or exact values such as yes or no, as in the sample data
used for this work. (MacLennan et al. 2009, 64.)
The option selected was “yes”, as shown in the graph below, so as to see the
effect of other columns on the purchased bike column, the other option on the
dialog box was of choosing the columns to be used for analysis other than the one
automatically chosen by the tool. The box also presents the option of the reports to
be presented, ranging from the operational calculator, to the prediction calculator
to the print-ready report. Once all input were selected, the run button is clicked,
and reports are generated in three new spreadsheets. (MacLennan et al. 2009,
65.)
GRAPH 20. Prediction calculator dialog box
Some sections of the prediction calculator spreadsheet generated could be
modified to achieve certain results. The outcome of the prediction calculator are
49
categorized into four classes namely the false positive cost, the false negative
cost, the true positive cost and the true negative cost as shown in the graph below.
GRAPH 21. Prediction calculator report
The prediction calculator report been one of the generated outcomes of the
prediction process, is really important, since it contains three column, showing the
original columns names, the highest attainable value, and their relative impact on
the target column. It is on this report that a threshold is generated, to guide the
user as to optimum expected reliability of prediction. The values of the original
columns could be modified to attain certain goal but if the total from the values of
attributes is not equal to or over the given threshold value, then the prediction is
not reliable. (MacLennan et al. 2009, 66.)
5.6.8 Shopping basket analysis
This analysis uses the association rules algorithm. Though from its name it implies
that it is applied to goods and services, it could also be applied in the medical field,
where such analyses are used to identify people with likelihood of undiagnosed
health problems. Insurance companies also apply the algorithm in certain
situations. The application of the association algorithm generates an if-then
50
statement coupled with some degree of accuracy, for instance if a customer buys
an items x and y, then M percent of the time, he will buy item z. (Ganas 2009, 9.)
On formatting the given data, the shopping basket analysis button was clicked,
and a dialog box appeared, there were the options of choosing the transaction id,
from a drop down box, the item selection, and the item value though it is optional.
Once the input variables are right, the run button was clicked and the analysis was
done within few seconds. The results of the process are reports which were
generated to two excel spreadsheets, namely the shopping basket bundled item
and the shopping basket recommendation. The former classifies the items that
were purchased together most often in ascending order, showing the frequency of
occurrence, the average value and the overall value of all transaction involving
such item bundles.( MacLennan et al. 2009, 76.)
The item bundles were presented in rows as shown in graph below, where the
purchase of road bikes and helmets combination occurred most often as depicted
by the horizontal bar.
GRAPH 22. Shopping basket bundled items report
However, making a reasonable use of the analysis tool also depends on the
shopping basket recommendation table generated, since it displays the best
possible items that ought to be combined with selected items based on their sales
and overall value of linked sales as shown in graph below.
51
GRAPH 23. Shopping basket recommendation
52
6 ANALYSIS SCENARIO AND RESULT
The data chosen to further demonstrate the role of data mining in decision making
by the Microsoft Excel data mining add-ins was gotten from the machine learning
repository. (Asuncion & Newman 2007.) The main analysis tools existing in the
Microsoft Excel were applied to the data, which had fifteen attributes as columns
and several hundred rows.
The results obtained are as shown in the following graphs.
GRAPH 24. Output of Key Influencers tool on data, showing marital status as a
positive relative impact on a person belonging to a “less” class in the data
53
GRAPH 25. Result of key influencers on “class” , showing marital status of “never
married to positively favor a person belonging to a “ less” class
GRAPH 26. Output of the detect category tool showing education having the
highest relative importance on the class of an individual
54
GRAPH 27. Bar chart showing the categories created after analysis of people
from the data, placing them into the “more” and “less” classes, represented by red
and blue colors as indicated on the right hand side of the charts
GRAPH 28. Report showing most exceptions existing in marital status in the data
55
GRAPH 29. Exceptions in data been marked out
GRAPH 30. Output of Fill from example task indicated on rightmost column,
showing the classes to which certain people would belong to based on their other
attributes
56
GRAPH 31. Output of Forecasting of age, education and hours per week of
people, indicated at the lower 5 rows of the table
57
GRAPH 32. (cont.) Forecast of age, education level, and hours per week of
persons from given data after some given time period, based on their previous
hours per week work and other attributes
GRAPH 33. Report of the Prediction calculator taking “workclass” as the target
column to see its relative impact
58
GRAPH 34. Table showing threshold of prediction calculator, indicating that if the
sum of the relative impact of a chosen person is less than „47‟ then it is false
59
7 CONCLUSION
Data mining is a technology which discovers and extracts hidden trends and
patterns in huge amount of data. The thesis‟ goal is to evaluate data mining in
theory and in practice. Achieving this goal involves reviewing the stages of the
process, the algorithms employed, carrying out a practical task and analyzing the
result.
The several algorithms which are been employed in discovering the useful and
relevant knowledge from large data sources are basically simple and efficient.
Most of the data mining software available in the market employ most or almost all
of the relevant algorithms, even though the data mining tools use different
instructions in handling and manipulating them, the results of employing the same
algorithm in different data mining tools with similar data is often similar.
It is necessary to note that, due to the choice of analysis tool which was used in
this work, the early stages of data mining which involves data collection, data
preparation, data selection, and data transformation, were not carried out, since
the repository source of the data used had already undergone the processes.
However, the most relevant phase of data mining is the modeling phase and that
is what the tool used basically does.
Attempts to collect real data from businesses and people around based on certain
attributes proved abortive, as people were reluctant to provide some information
for personal reasons. I also tried taking data from the finish statistic website, but
on trying to build a model that could be easily and successfully analyzed using the
Microsoft Excel analysis tool, the results generated were almost meaningless,
since most of the data were time related and there were insignificant changes in
the statistics, and such does not favor good data modeling.
On successfully using the repository data with the Microsoft Excel analysis tool,
several attempts to use the same data with some other data mining tools were
futile, as each of the mining tool has some specification on the format in which
60
data ought to be, for a successful mining or analysis task, but due to time
constraint, only little acquaintance was made with such tools.
However, based on the Microsoft Excel data mining add-ins used in this work, it
could be clearly seen that if an organization could successfully collect and prepare
its usually large data in the right format to be analyzed by this tool, the outcome of
the analysis is clear and easy to understand, hence aiding in any form of decisions
that could be made. I am highly convinced that the efficiency and accuracy of data
mining as a process using its tools and algorithms is better off, compared to any
human reasoning or skill, and any decision been made from a data mining task
output, would have a high percentage of success.
61
REFRENCES
Andy, P. 2011. Available: http://www.the-datamine.com/bin/view/Software/AllDataMiningSoftware. Accessed 30 March 2011.
Asuncion, A. & Newman, D. 2007. UCL machine learning repository. Available:
http://archive.ics.uci.edu/ml/. Accessed 20 April 2011.
Berk, R. 2003. Data mining within a regression framework. Available:
http://preprints.stat.ucla.edu/371/regmine.pdf. Accessed 02 April 2011.
Bramer, M. 2007. Principles of data mining. London: Springer.
Cios, K. J., Pedrycz, W., Swiniarski, R. W.,& Kurgan, L. A. 2007. Data mining: A
Knowledge Discovery approach. New York: Springer.
CRISP 2011. CRISP (Cross Industry Standard Process for Data Mining).
Available: http://www.crisp-dm.org/CRISPWP-0800.pdf. Accessed 20 April 2011.
Data mining software 2011. A brief history of data mining. Available:
http://www.data-mining-software.com/data_mining_history.htm . Accessed 21
February 2011.
Data mining add-ins for office 2007, video by Brennan Mary, Available
http://www.microsoft.com/sqlserver/2008/en/us/data-mining-addins.aspx .
Accessed 08 April 2011.
Elmasri, R. & Navathe, S. 2007. Fundamentals of database systems. New York:
Addison Wesley.
Fayyad, U., Piatesky-shapiro, G. & Smyth, P. 1996. Advances in knowledge
discovery and data mining. Menlo Park, California: American Association for
Artificial Intelligence (AAAI) Press.
Ganas, S. 2009. Data mining with predictive modeling with Excel 2007. Available:
http://www.casact.org/pubs/forum/10spforum/Ganas.pdf . Accessed 15 March
2011.
Garcia M.E.B. 2006. Mining your business in retail with IBM DB2 intelligent miner.
Available:
http://www.ibm.com/developerworks/data/library/tutorials/iminer/iminer.html.
Accessed 15 February 2011.
Garson, D. 2011. Cluster analysis. Available:
http://faculty.chass.ncsu.edu/garson/PA765/cluster.htm. Accessed 17 March
2011.
62
Han, J. & Kamber, M. & Pei, J. 2001. Data mining: concepts and techniques. San
Francisco: Morgan Kaufmann
Han, J., Kamber, M. & Pei, J. 2005. Data mining: Concepts and techniques. (2nd
ed.). San Francisco: Morgan Kaufmann.
Hand, D, Mannila, H, & Smyth, P. 2001. Principles of data mining. Cambridge
Massachusetts: MIT Press.
Hill, T. 2011. Genetic algorithms. Available:
http://www.pcai.com/web/ai_info/genetic_algorithms.html Accessed 18 March
2011.
Inmon, W.H 2005. Building the data warehouse, Indianoapolis: Wiley Publishing,
Inc.
Larose, T. D. 2004. Discovering knowledge in data: Introduction to data mining.
New Jersey: John Wiley & Sons Inc.
Lew , A. & Mauch, H. 2010. Dynamic programming: A computational tool (Studies
in Computational Intelligence). Berlin: Springer.
Maclennan, J., Tang, Z. & Crivat, B. 2009. Data mining with Microsoft SQL server
2008. Indianapolis: Wiley Publishing Inc.
Mann, A.T. 2007. Microsoft Office 2007 system business intelligence integration.
New York. Mann Publishing.
Mitra,S. & Acharya, T. 2003. Data mining: multimedia, soft computing and
bioinformatics. New Jersey: John Wiley & Sons, Inc.
Mueller, J. A. & Lemke, F. 1999. Self-organizing data mining: An intelligent
approach to extract knowledge from data. Berlin: Dresden.
Pal, N. & Jain, L. 2005. Advanced techniques in knowledge discovery and data
mining. London: Springer.
Ramakrishnan, R. & Gehrke, J. 2003. Database management systems. New York:
McGraw-Hill.
Sumathi, S. & Esakkirajan, S. 2007. Fundamentals of relational database
management systems. New York: Springer.
Sumathi, S. & Sivanandam, S. N. 2006. Introduction to data mining and its
application. New York: Springer .
Thearling, K. 2010. Introduction to data mining. Available:
http://www.thearling.com/text/dmwhite/dmwhite.htm Accessed 20 April 2011.
63
Webopedia 2011. Database. Available:
http://www.webopedia.com/TERM/D/database.html. Accessed 21 April 2011.
Witten, I., H. & Frank, E. 2005. Data Mining: Practical machine learning tools and
techniques. San Francisco: Morgan Kauffman.
Ye, N. (ed) 2003. Handbook of data mining. New Jersey: Lawrence Earlbaum
Associates.
APPENDIX 1/1
OTHER DATA MINING TOOLS
IBM DB2
Data mining using the Ibm intelligent miner takes data mining from the perspective
that it involves basic steps which are: problem definition, data exploration, data
preparation, modeling, evaluation and deployment. In the data exploration phase
of data mining, the data is been selected. The data table and views were taken
into account and collected. The sample data that was used was related to some
transaction and purchases, and based on the quest to determine a customers‟
behavior, there was need to identify which tables or views in the database contain
all the information needed, and the data was used in generating models in the
subsequent phase. (Garcia 2003, 10.)
Demonstrating data mining process using the Ibm intelligent miner required
installation and configuration of an IBM DB2 Infosphere warehouse software,
containing the Intelligent miner, scoring and visualization components.
The
software contains the server that houses the database which is to be analyzed.
(Garcia 2003, 3.) Once the software had been installed, there was need to upload
the data to the database, and that was done using the console window from the
installed application. The graph below shows a view of it. In this graph, the name
of the database been connected to is retail, and the server is DB2/NT 9.7.2
APPENDIX 1/2
Console showing connection to database.
After the connection had been confirmed to be successful, the control centre from
the installed IBM DB2 was used to view the tables in the database.
The graphs below shows a view of the table of importance in the database.
A retail table view from DB2 database
APPENDIX 1/3
Articles Tables from DB2 control centre
View during DB2 data preparation phase.
APPENDIX 1/4
During the modeling phase, the association rule was implemented and the
support, confidence and lift were specified. The next phase was the evaluation
phase, when the intelligent mining visualization application was used. It was used
to look at the results and to evaluate whether the model is good or not. (Garcia
2003, 20). The graph below shows the IM visualize connector to DB2 database
IBM DB2 intelligent miner connection interface.
Upon launching the application, there was need to supply the name of the
database, after which the connection was tested by clicking the connect button
which requested for user id and password. Due to password and user limits in the
trial copy of database and mining software, the data mining process using the DB2
intelligent miner was not completed.
APPENDIX 1/5
Tanagra data mining tool
Basic view of user interface for Tanagra mining tool
View of Tanagra tool during data loading.
APPENDIX 1/6
Weka data mining tool
Weka startup environment
Weka tool interface on loading data
View of weka tool, showing file selection options
APPENDIX 1/7
Weka tool analysis display
Weka analysis output
Weka cluster analysis output
APPENDIX 1/8
Rapid miner interface screenshots