Download 74 - Understanding Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Management:Overview
Understanding
Data Mining
Data mining has become one of the latest trends in using data. Rod Newing
explains that it is a complex process which has been around for a long time.
O
rganisations world-wide are accumulating vast quantities of
electronic data as databases
become ever more pervasive. The recent
trend to implement a data warehouse
architecture is increasing the quality and
accessibility of data. This is all being
done at great cost, but the information
is only valuable if used effectively.
Users have been using query tools,
OLAP servers, Business Intelligence
tools, Enterprise Information Systems
and a wide range of other packaged
software to examine their data.
However, these tools either work
with summarised data or answer users
specific questions. The more numerate
analysts have recognised that there are
hidden patterns, relationships and rules
in their data which cannot be found by
using these traditional methods.
The answer is to use specialist software which harnesses advanced
mathematics to examine large volumes
of detailed data. This specialist group of
software has become known as "data
mining" or "knowledge discovery". Data
mining is defined as the process of extracting valid, previously unknown and
ultimately comprehensible information from large databases and using it
to make critical business decisions.
Time
1960s
Evolutionary step
Data collection
1980s
Data access
1990s
Data warehousing
and decision support
Now
Data mining
The name is derived from the process of sifting large amounts of ore to
discover nuggets of gold, just as the
software is able to sift large volumes of
data to find nuggets of information
which yield gold in the form of competitive advantage. The extracted
information can be used to do one or
more of the following:
●
●
●
●
●
Provide an understanding of data
relationships to end users.
Form a prediction or classification
model.
Allow prediction of future trends
based on past experience.
Identify relationships between database records.
Provide a summary of the database
being mined.
With a query, the user knows what
is in the database and know what information to ask for, so they must know
what patterns exist. With data mining,
the software establishes the patterns
and relationships. It is possible to carry
out data mining operations using a
query tool, but the process is extremely
complex and would be prohibitively
manually intensive. Data mining software uses algorithms which have
automated most of the work involved.
Data mining differs from statistical
analysis in that the latter is used to verify
existing knowledge in order to prove a
known relationship. Most data mining
involves carrying out several different
operations using more than one technology, so it should be thought of as an
operation, rather than a product.
Data mining can be carried out on
any data file, from a spreadsheet to a
data warehouse. Transaction processing systems can be mined, and the
exercise can be used to generate
benefits which can help to justify the
considerable investment required to implement a data warehouse architecture.
Figure 1 outlines the major milestones in the evolution of Data Mining.
Objectives
Data mining can achieve a number
of different objectives, using one or
more different technologies.
Prediction And Classification
This approach uses the historical
data in the database to predict future
behaviour. It creates a generalised description which characterises the
contents of the database by generating
Business question
"What was my total revenue
in each of the last five years?"
"What were unit sales in
New England in March?"
Enabling Technologies
Computers, tapes, disks.
"What were unit sales in
New England in March?"
Drill down to Boston.
"What is likely to happen to
Boston unit sales next
month? Why?"
On-Line Analytical
Processing, data warehouses.
Relational databases, SQL,
ODBC.
Advanced algorithms,
multi-processor computers,
massive databases.
Characteristics
Retrospective static data
delivery.
Retrospective dynamic
data delivery at record
level.
Retrospective dynamic
data delivery at
multiple levels.
Prospective proactive
information delivery.
Figure 1 - Milestones in the evolution of Data Mining.
Issue 74 Page 13
PC Network Advisor
File: M0481.1
Management:Overview
an understandable model. It enables
the model to be applied to new data
sets in order to predict the behaviour
hidden in that data. For example, a
predictive model of existing customers
can be applied to potential customers
in order to identify those most likely to
purchase a particular product or service. It has traditionally used statistical
techniques, but lots of automatic model
development techniques are being developed, often based on supervised
induction.
Analysing Links
Data mining can be used to establish
relationships between the records in
the database which would otherwise
be impossible to find because they cannot be predicted and so cannot be
found other than by accident. It is a
relatively recent technique, which has
become well known through shopping
basket analysis, which indicates popular combinations purchased by retail
customers.
Segmenting Databases
This is a form of sophisticated query
to identify common groups of records
within a database. It may be a technique in its own right or may be used
to prepare data for further processing.
Data Transformation
Once it has been selected, the data
may need to be transformed. For instance, neural networks require
nominal values to be converted to
numeric ones. Alternatively, derived
attributes may need to be created by
applying mathematical or logical operators, such as a ratio or logarithmic
value.
One or more data mining techniques are carried out to try to extract
the required information or meet the
required objective. Some of the algorithms used are described in Figure 2.
Supervised induction automatically
creates a classification model from a set
of records, known as a "training set",
which may be the whole database or a
sample of data from it. The induced
model consists of generalised patterns
which can be used to classify new records. It can use neural networks or
decision trees, but the latter do not
work well with noisy data.
It produces high quality models,
even when data in the training set is
poor or incomplete. The result is more
accurate than that obtained using statistical methods, because it checks for
local patterns, whereas the latter work
across the entire database. The models
are easy for the user to understand. An
example would be a credit card analysis to discover the attributes of a good
Results Interpretation
The result of applying data mining
algorithms will be tables of values or
relationships. The user will have to
look for interesting groupings of data
and establish if there is any business
value in them. They need to be analysed using a data visualisation (see
Figure 3) or decision support tool. Visualisation helps the user to understand
the data and identify patterns.
If the objective is to produce a
model, it must be validated and tested.
This identifies unusual values
which do not conform to the expected
pattern. It is often a source of new
knowledge since the results defy
known logic. It is also used in fraud
detection, where unusual values may
represent an unauthorised transaction.
Decision Trees
Data Selection
The objective determines the type of
information and the way it is organised. Only part of the data available
from the source data file will be
needed, so the relevant data must be
identified. Noise and missing values
may need to be addressed. It may also
be preferable to sample the data required and mine the sample.
File: M0481.2
There are a number of techniques
for carrying out the data mining exercise.
Supervised Induction
Neural Networks
There are four basic steps which
need to be carried out in order to complete a data mining exercise.
Techniques
Applying Algorithms
Detecting Deviations
The Process
It may be necessary to refine the data,
repeating the sequence again. This process is often referred to as "data
refining".
Software which learns from training to identify patterns and construct a
model. This model is then applied to larger data sets to predict its structures.
It can also identify changes, which then become a notifiable event.
Decision trees are tree-shaped structures which represent sets of decisions.
They generate rules for classifying the data set, using algorithms such as ID3,
Classification and Regression Trees ("CART") and Chi Square Automatic
Interaction Detection ("CHAID").
Clustering Methods
In this method, artificial intelligence search techniques are used to identify
subsets in a cluster. It uses software such as AQ11, UNIMEM and COBWEB.
Rule Induction
Rule induction involves the extraction of "if ... then ...." rules from data based
on statistical significance. Examples are IBM’s RMINI, and FOIL, which are
in the public domain.
Genetic Algorithms
This is an optimisation technique which uses processes such as genetic
combination, mutation and natural selection in a design based on the concepts of evolution.
Figure 2 - Data Mining Technologies.
PC Network Advisor
Issue 74 Page 14
Management:Overview
Data Mining
credit risk in order to predict credit
worthiness of applicants.
example of association discovery is
market basket analysis.
related transactions. It is used for targeting direct mail.
Association Discovery
Sequence Discovery
Clustering
This is a technique which identifies
the affinities which exist among records. The output might find that 67%
of records containing A, B and C, also
contain Y and Z. The percentage is
known as the "confidence factor". An
This is similar to association discovery, but works over time. It is
frequently directed towards individual customers as a means of identifying
their preferences. It detects buying patterns which occur in a sequence of
This technique is used to segment a
database into subsets of mutually exclusive groups. The members of each
group should be as close to each other
as possible and as far apart from other
groups as possible. The members of
each cluster should possess properties
which are interesting to the user. Data
visualisation techniques are then used
to examine each cluster to establish
which are useful or interesting.
It is less precise than other techniques because of redundant or
irrelevant data. The solution is for the
user to direct the software to ignore
subsets of attributes, assign weightings
to them or apply filters to the information. The importance of the attributes
themselves can be established using
statistical methods.
Clustering can also be used to provide data for other techniques, such as
supervised induction. Clusters can be
created using statistics, neural networks or unsupervised induction.
However, using statistical methods
makes it difficult to assign new records
to existing clusters, because of the difficulty of measuring and handling its
deviation from those clusters.
Data visualisation provides the user with visual summaries of the results of
the data mining algorithms. This helps them to understand the results of the
data mining algorithms by communicating relationships in a way that rows
and columns cannot. It is interactive, allowing the user to filter or change the
information displayed. The user can also change the presentation method
used, such as from a histogram to a scatter chart.
Visualisation allows users to browse the data looking for unusual features.
It is good at identifying small meaningful sub-sets of data which defy
conventional wisdom. These "outliers" are anomalies which may be errors,
or genuine and valuable exceptions to established wisdom.
A wide range of advanced chart types can be used:
●
●
●
●
●
●
Geographical maps, combined with histograms, colour coding, pie charts
etc.
Tree maps showing the hierarchy of a classified database.
Rule visualisation.
Trends.
Scatter graphs.
Heat maps.
These chart types are very advanced when compared with traditional
graphing tools and need powerful workstations. For instance, a five dimensional chart can be created by representing clusters on a three dimensional
scatter chart as a sphere. The size and colour of the sphere represent the
fourth and fifth dimensions.
The time dimension can be incorporated by "playing" the chart like a video.
The user can watch the movements in a multi-dimensional chart as it changes
with the elapsed time.
Figure 3 - Data visualisation.
Supplier
Angoss
Attar
Brann Software
DataMind Corporation
EDS
IBM
Integral Solutions
Right Information Systems
The SAS Institute
Silicon Graphics
SPSS
Product
Knowledge Seeker
XpertRule
Viper
Mine Your Own Business
Dbintellect
Intelligent Miner
Intelligent Decision Server
Clementine
4Thought
Neural Network Application,
Insight, Spectraview, GIS
MineSet
SPSS CHAID, Neural Connection,
Professional Statistics etc
Applications
The importance of data mining has
been recognised by information intensive industries which have large
databases of customer transactions,
such as banking, health care, insur-
Contact Details
http://www.angoss.com
http://www.attar.com
http://www.brannsoftware.co.uk
http://www.datamindcorp.com
http://www.dbintellect.com
http://www.software.ibm.com
http://www.isl.co.uk
http://www.4thought.com
http://www.sas.com
http://www.sgi.com
http://www.spss.com
Figure 4 - The Main Data Mining Products.
Issue 74 Page 15
PC Network Advisor
File: M0481.3
Management:Overview
Supplier
Cognos
Comshare
NCR
Holistic Systems
Oracle
Pilot Software
Planning Sciences
Red Brick Systems
Product
PowerPlay
Commander Decision
Knowledge Discovery
Workbench
Holos
Express
Pilot Discovery Server
Gentia
Red Brick Data Mine
Tool
4Thought, Knowledge Seeker
Own
Clementine
Contact Details
http://www.cognos.com
http://www.comshare.com
http://www.ncr.com
Own
Partners’
Own, based on the Thinking Machine
Own, plus Intelligent Miner
Mine Your Own Business
http://www.holossys.com
http://www.oracle.com
http://www.pilotsw.com
http://www.gentium.com
http://www.redbrick.com
Figure 5 - Products incorporating data mining.
ance, marketing, retail and telecommunications.
One of the most well-known data
mining applications is market/shopping basket analysis. This involves
running an association discovery operation over Electronic Point Of Sale
(EPOS) data. It analyses the combinations of products purchased by
individual buyers to find dependencies. Until the recent arrival of
loyalty cards, it has been the only way
the supermarkets and high street stores
has to understand who their customers
are and how they behave.
Other common applications are for
promotion effectiveness, customer vulnerability analysis, cross-selling,
portfolio creation and fraud detection.
It is also used in healthcare, where it
can find relationships between patient
histories, illnesses and surgical operations. It is also used in manufacturing
processes to monitor quality and spot
machine wear.
In marketing, if an organisation
wants to cross-sell one product to another, it cannot target all customers,
because the volume may be too large.
Therefore it is necessary to mine the
database of existing customers to
identify patterns which describe the
characteristics of purchasers of the product. These patterns can then be
applied to the database of customers
who have not purchased the product to
segment and predict those who are
more likely to purchase the product.
These are then targeted in a very specific marketing campaign.
Data mining is often used to predict
and identify people most likely to respond to direct mail. This reduces the
cost of mailing without affecting the
response rate. Organisations have
File: M0481.4
found up to a twenty-fold decrease on
costs over conventional approaches.
The data mining operation can also be
taken a step further by identifying clusters of the most profitable likely
customers, which may be different to
those most likely to respond.
Identifying exceptions can be just as
important as finding hidden patterns.
In fraud detection, credit card transactions are often analysed by a neural
network to identify unusual transactions which may indicate that the card
is not being used by its holder, even
before the loss is reported.
It is important to understand that a
particular data mining exercise may
use more than one stage and use several algorithms by passing the results
from one analysis to another. For instance, the user might produce
associations using a decision tree and
then pass the result to a neural network
to identify changes over time. Mining
elements can be combined in an infinite
variety of ways.
present the data in an easy to understand manner so that users can assess
its significance to the business. It may
incorporate its own visualisation tools
or work with third-party packages.
The software must incorporate filters to remove "noise", which is
incorrect information or spurious relationships. For instance, the software
shouldn’t waste the user’s time by reporting that 99.9% of married people
have a spouse of the opposite gender!
Software for data mining is available either direct from the authors or
through decision support vendors who
have embedded it into their own applications. IBM and the other vendors
have open Application Programming
Interfaces so that application builders
can add value to their decision support
software by driving a data mining engine from their own tools.
Software
For most organisations, the software needs to be scalable from a
stand-alone PC to a parallel-processing
server. This allows data mining operations to be carried out on desktop
databases, relational or multi-dimensional data marts, transaction
processing systems or enterprise data
warehouses.
Because of the different techniques
and technologies, the software needs to
integrate various different algorithms
into one product. Most vendors use
several different ones and are writing
further modules to expand the scope of
their products. The software must
PC Network Advisor
PCNA
The Author
Rod Newing MBA FCA FInstD is
a specialist writer on Executive
Computing. He can be contacted
via email as [email protected].
Issue 74 Page 16
New Reviews from Tech Support Alert
Anti-Trojan Software Reviews
A detailed review of six of the best anti trojan software programs. Two products
were impressive with a clear gap between these and other contenders in their
ability to detect and remove dangerous modern trojans.
Inkjet Printer Cartridge Suppliers
Everyone gets inundated by hundreds of ads for inkjet printer cartridges, all
claiming to be the cheapest or best. But which vendor do you believe? Our
editors decided to put them to the test by anonymously buying printer cartridges
and testing them in our office inkjet printers. Many suppliers disappointed but we
came up with several web sites that offer good quality cheap inkjet cartridges
with impressive customer service.
Windows Backup Software
In this review we looked at 18 different backup software products for home or
SOHO use. In the end we could only recommend six though only two were good
enough to get our “Editor’s Choice” award
The 46 Best Freeware Programs
There are many free utilities that perform as well or better than expensive
commercial products. Our Editor Ian Richards picks out his selection of the very
best freeware programs and he comes up with some real gems.
Tech Support Alert
http://www.techsupportalert.com