Download Digging For Gold: Business Usage for Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bayesian inference in marketing wikipedia , lookup

Transcript
Digging for Gold: Business Usage for Data Mining
Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA
ABSTRACT
Current trends in data mining allow the
business community to take advantage of
sophisticated analytical techniques to
assess future directions and manage
strategic planning. Yet, these tools and
techniques are not being used to their full
capacity by business managers to solve
every day business problems. Why?
Perhaps, this can be attributed to an
underlying fear of complex mathematical
and statistical methods found within data
mining and neural network models.
The purpose of this paper is to demystify
the art of data mining by outlining practical
examples of usage applicable to business
managers and professionals. By focusing
on the competitive advantage that can be
obtained with data mining, the author
hopes to provide a better understanding of
the practical application of this type of data
analysis.
INTRODUCTION
relationships now exist between thousands
of different data elements. Hence the
conceptualization of the data warehouse.
Data warehousing, in turn, opens new
possibilities in terms of business
intelligence and decision support solutions.
One such solution is referred to as data
mining.
Preparation
Discovery
Analysis
DATA MINING
Data mining is simply the discovery of
valuable, new information from a large
collection of data. Or, as defined by The
Garner Group:
Data mining is the process of discovering
meaningful new correlations, patterns and
trends by sifting through large amounts of
data stored in repositories, using pattern
recognition technologies as well as
statistical and mathematical techniques.
Storage and processing capabilities of
technology have increased at a
tremendous rate over the course of the last
twenty years. The business community has
found new ways to utilize this additional
computing power to improve their
competitive advantage in the marketplace.
What it is not:
One problem with this situation is that large
numbers of databases are now distributed
across systems within any given
organization. Over time, information about
customers, suppliers, and operations has
become stored in many databases within
silos of the organization. Information usage
has become so specialized, that latent
3. Not statistical tests using standard
techniques
1. Not complex queries where suspicions
about relationships within the data
already exist
2. Not validation of hypotheses
What it is:
Automated discovery of new facts and
relationships within data
Think of it in terms of excavation. The
business data represents the rocks and the
data mining technique becomes an
excavation tool, sifting through the vast
quantities of raw data looking for valuable
nuggets of gold - information critical in
making a business successful.
Reapplication
The major steps in this process include:
The results are evaluated to determine
whether or not additional knowledge was
discovered and the relative importance of
the information is assessed. This is where
decisions are made using information found
in the mining process and where the most
business benefit can be seen.
The redeployment of techniques is applied
to multiple data populations for validation
and classification of results.
Discovery Analysis
Data Preparation
Data is selected, acquired, cleansed, and
preprocessed under the guidance of a
knowledgeable, domain expert. Who is a
domain expert? Not your typical
programmer or system analyst, but
someone who knows the business well
enough to determine the critical 20% of
information where 80% of business
decisions are based.
The model below is a simplistic
representation of a standard data mining
technique called a decision tree.
The
decision tree shows that there are multiple
decisions that can be made based on
different relationships between variables
based on the outcome of information from
the models.
Technology Review and Selection
Identification of the best techniques and
tools to utilize needs to be made based on:
•
•
•
•
•
•
•
•
Decision
1
Model
1
Business requirements
Infrastructure constraints
Size and location of data stores
Data preparedness
Availability of statistical/analytical
expertise
Average accuracy of overall results
(tools)
Training requirements
Cost
Decision
2
Source
D ata
Decision
3
M odel
2
Decision
4
Information Discovery
SAS uses an effective method for data
mining called SEMMA. SEMMA stands for:
Sample, Explore, Modify, Model, and
Assess. This process applies statistical
techniques to go through selection and
transformation of data that is considered
predictive. It then builds models based on
the results of the analysis and checks the
models for accuracy. This is a proven
Automated models and techniques are
applied to prepared data, compressing and
transforming it to make it easy to identify
any valuable, hidden information.
2
method, effective in the application of
successful mining techniques.
providers. As information about customers
are combined with information about
products, there are significant opportunities
that can be achieved. A company that is
able to identify customer buying decisions
over time will be able to use the best
approach for obtaining consumer buy in for
the products and services they offer. They
can also develop targeted marketing
campaigns as well as identify profitable
consumer markets.
There are different functions, increasing in
complexity, where mining techniques are
used to find latent information within
variables that exist in common, very large
data stores. A few functional requirements
of mining techniques include:
⇒ Associations, Classifications, and
Clustering
Used for risk assessment, market
segmentation and targeting sales, as well
as product reuse
Risk Analysis
Customers can be managed differently
based on perceived risk. This is true for
lending, insurance, health care, and even
utilities. Modeling techniques can be used
to classify the amount of risk associated
with a customer or customer segmentation.
This risk can also be tracked and adjusted
over time. This information is valuable
providing guidelines for credit scoring
stability, portfolio and product
management, lending practices, and fraud
assessment and detection.
⇒ Regression and Forecasting
Used for sales predictions, customer
ranking, price and inventory models,
product assessments
These functional requirements are based
on the business need at hand.
For
example, if a retail store wanted to know
what products should be marketed and
advertised on sale at the same time,
statistical models are used to meet the
need for association analysis.
Product Management
Using techniques for matching product and
part requirements to is critical for product
design reusability. Data gathered through
sales and part maintenance records can be
combined to identify where the need to
increase product longevity exists.
Now, let’s look at some practical business
applications for such data mining
techniques.
Now, let us review a couple of case studies
where mining techniques were used to
meet business needs.
BUSINESS APPROACHES
There are several areas within different
industries where mining can be applied.
Identified below are three basic business
analysis needs that most organizations
have:
Note that the results of these techniques
will be covered in detail during the
presentation of this paper.
Marketing
Mining can be used to improve customer
retention rates by identifying customers
ready to switch to other service or product
3
Always identify connectivity and platform
issues up front, especially when data stores
are at different locations (globally).
CASE STUDY: TELECOMMUNICATIONS
♦ Need for churn forecasting 6 months - 2
years prior to potential loss of
customers
♦ Operational data stores house several
hundred
thousand
records
on
subscribers collected and distributed on
a real-time (one hour delay) basis
♦ Multiple data marts (Oracle, Access,
Sybase, etc.) where historical data
stored
♦ Data
partially
preprocessed
(calculations
identified
however
inconsistently used within different lines
of business)
♦ SAS used for statistical analysis, Visual
Basic for GUI reporting, no mining tools
used to date
STRIKING GOLD
CASE STUDY: Human Resource and
Benefit Data Management
♦ Need for locating relationships within
participant data - specifically trends to
identify the need for reuse of benefit
packages
♦ 200-300 clients with anywhere from 100
to 100,000 participants
♦ Multiple systems including human
resource, benefits, pension, and health
care servicing
♦ Multiple plans for each client
♦ Data preprocessed (calculations and
deduplication process completed)
♦ No tools available
♦ Global data storage environment (over
1,000 tables housed world-wide)
♦ Clients consist of companies within
different industries
Solution: Functional need determined as
time-series forecasting, weights applied to
variables
and
prediction
accuracy
determined (5-10%), data run through
models to determine historical trends
(hourly, daily, weekly, monthly, etc.) then
rerun to identify potential future trends, this
application is then available for reuse on an
ongoing basis.
Solution: Select and track ‘cradle to grave’
attributes, measurements obtained for
identifying data size and location as well as
system infrastructure issues, tools and
techniques (primarily neural network
models housed on NT), apply models to
data based on functional need (clustering),
reapplication of model to revisit data
cleansing issues, final review and analysis.
Critical Success Factors:
It can be
difficult to determine how much historical
information is required to apply to the
models to identify the most accurate trend
information. Several reapplications may be
required for analysis purposes. It is
extremely important to put as much
applicable attribute data as possible in the
model to ensure that predictions are
accurate.
Critical Success Factors: Due to several
revisits
to
preprocessed
data,
no
assumptions can be made regarding data
cleansing. The need for re-cleansing will
always arise as discoveries are located
through
statistical
and
analytical
processing.
4
CONCLUSION
DATA PRESENTATION
Now that technology based storage
capacities are at an all time high,
organizations have more information
available to them than ever before. In fact,
quantities of information available exceeds
any given organization’s ability to manage
that information by an exponential amount.
Traditional query and reporting tools are no
longer sufficiently meet the sophisticated
analysis needs of today’s businesses. The
more data we have, the less we know
about the relationships between different
variables within this data.
Once mining techniques have been
applied, the results can be made available
to different levels within the user
community by applying additional business
intelligence or decision support system
solutions.
Results over time, geographical region, and
by specific demographics can be turned
into visual information for increased benefit
to the business community. Drill down,
query and reporting, and multidimensional
capabilities can be applied to the discovery
results to allow management to make
effective decisions based on the results of
data mining techniques.
Therefore, we must look beyond these
tools and techniques to processes that
allow us to address the increasing amount
of information available. Data mining
offers a solution to this problem.
Below is a graphical representation of the
different decision support techniques that
require increased levels of analysis
functionality. Note that at the top of the
pyramid is query and reporting, which can
be applied by users without much effort or
domain knowledge. Data mining is at the
bottom. As it is the most intense process
from an analytical perspective, it requires a
significant amount of domain knowledge,
input data, and high-level statistical
modeling to do the job. The more effective
the analysis, the greater the potential for
locating valuable information.
With an emphasis on the discovery of
valuable information from large databases,
data mining provides added value to the
investment in the corporate data
warehouse and provides business lines
with valuable nuggets of information to help
make them more competitive.
Most organizations do not realize that quite
often they are wasting millions of dollars
obtaining external information about their
consumers and competition to gain market
advantage, when the information is just
sitting in their files waiting to be discovered.
Q&R
EIS
REFERENCES
Raphaelian, G. and Strange, K. (1997),
“Data Warehousing and Data Mining:
Separating The Two,” Gartner Group, Inc.,
1.
OLAP
GIS
Data Mining
5
“Data Mining reveals the Diamonds in your
database,” (1996), SAS Communications,
2Q96, 18.
USEFUL RESOURCES
Adriaans, Pieter and Zantinge, Dolf (1996),
Data Mining, New York: Addison-Wesley
Publishing Company, Inc.
Biggus, Joseph (1996), Data Mining with
Neural Networks: Solving Business Problems from Application Development to
Decision Support, New York: McGraw-Hill.
AUTHOR CONTACT
Kimberly A. Foster
Manager & Practice Leader
Enterprise Data Management
CoreTech Consulting Group, Inc.
1040 First Avenue, Suite 400
King of Prussia, PA 19406-1336
Phone: 800-877-9612 ext. 3542
Fax:
610-337-2333
email: [email protected]
6