Download IBM SPSS Data Mining Tips

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
IBM SPSS Data
Mining Tips
A handy guide to help you save
A handy guide to help you save
time and money as you plan and
time and money as you plan and
execute your data mining projects
execute your data mining projects
Table of Contents
Introduction
4
What is data mining? 5
What types of data are used in data mining? 6
Data mining and predictive analytics
6
How is data mining different from OLAP?
7
How is data mining different from statistics? 7
Why use data mining?
8
What problems does data mining solve?
8
How does the data mining process work?
9
Data mining tips
10
Setting up for success
10
Following the phases of CRISP-DM
12
Business understanding
13
Data understanding
16
Data preparation
18
Should the data be balanced?
20
Modeling
20
Evaluation
24
Deployment
25
Selecting a data mining tool
27
About IBM Business Analytics 30
What makes us unique?
31
IBM SPSS products
32
Data mining
32
Statistical analysis
33
Survey and market research
34
Glossary
35 © Copyright IBM Corporation 2010
3
Introduction
If you have questions about beginning or
Are you currently involved in a data mining
us. We offer a variety of technology training and
project? Or are you perhaps considering
consulting programs that can help you.
executing your data mining projects, please call
undertaking a data mining project for the first
time? Regardless of your level of experience,
If you have any data mining suggestions or
IBM SPSS Data Mining Tips will help you plan
ideas, contact your your local office, or visit
and execute your project.
www.ibm.com/spss and then go to SPSS
Developer Central.
This booklet is divided into two major sections.
The first defines and outlines the data mining
A list of IBM SPSS products can be found on
process, while the second suggests a staged
pages 32-35 of this booklet, and you can visit
approach to the data mining process and gives a
us online at www.ibm.com/spss to find out
number of tips to guide you through it. Please
more about our data mining products.
remember that these stages are not to be
considered in isolation. A decision made at one
What is data mining?
stage may influence your work at other stages.
According to Gartner Inc., data mining is
Also, in some situations you may work on several
“the application of descriptive and predictive
stages simultaneously rather than sequentially.
analytics (such as clustering, segmentation,
estimation, prediction and affinity analysis) to
After the tips, you’ll find a glossary of terms
support the marketing, sales or service
frequently used in data mining. These terms are
functions.”
boldfaced the first time they appear in the text.
Data mining solves a common paradox: the
As you read, you’ll see symbols that will help you
more data you have, the more difficult and
better understand the information in this booklet.
time-consuming it is to effectively analyze and
draw meaning from it. What could be a gold
This symbol indicates an example illustrating
a particular tip.
This symbol directs you to more information
on the Web.
mine often lies unexplored due to a lack of
personnel, time or expertise. Data mining
overcomes these difficulties because it uses a
clear business orientation and powerful
analytic technologies to quickly and thoroughly
explore mountains of data and extract the
valuable, usable information – the business
Keep IBM SPSS Data Mining Tips by your side
insight – that you need.
and use it to save effort, complete your project in
a timely manner and produce positive,
measurable results.
4
5
What types of data are used in
data mining?
How is data mining different from
OLAP?
Depending on the data mining problem, your
Pivot tables and online analytical processing
project can incorporate data from a wide range
(OLAP) are important tools for understanding
of sources. In fact, data mining projects often
what has happened in the past. Data mining is a
benefit from using several different types of
process for understanding what will happen in
data, each of which gives additional insight into
the future. Data mining uses predictive
the area of study. Recent advances in analytics
modeling, including statistics and machine-
have led to two important new types of mining –
learning techniques such as neural networks,
text mining and web mining. While traditional
to predict what will happen next. For example, a
data analysis has focused on numerical,
query or report can tell you the total sales for the
structured data (as found in spreadsheets or
last month. OLAP goes deeper, telling you about
flat-file databases), these two technologies open
sales by product for the last month. Data mining,
a new and rich vein of data – information known
however, tells you who is likely to buy your
as unstructured data – from survey research,
products next month. And for the best results,
customer communications and log files from
predictive analytics insights can be incorporated
web servers.
into a marketing campaign to determine, for
example, how to deliver personalized offers that
Using multiple data sources, including both
have the best likelihood of leading to sales.
structured and unstructured sources, can add
survey data can add valuable information about
How is data mining different from
statistics?
opinions and preferences, explaining why people
Data mining doesn’t replace statistics. Statistics
act and behave as they do. Such attitudinal data
is more often concerned with confirming
can reveal psychographic or motivational desires
hypotheses, while data mining can help generate
that will never be discovered by analyzing
new ones. Statisticians frequently make
fact-based transactional data.
inferences about large populations from a small
accuracy and depth to your results. For example,
sample, while data miners can often process an
Data mining and predictive analytics
entire universe of observations. In fact, statistics
Data mining uncovers patterns in data using
is a good complement to data mining: traditional
advanced descriptive and predictive techniques.
statistical techniques, such as regression, are
Predictive analytics combines these with
used alongside data mining technologies, such
decision optimization to determine which
as neural networks. Statistics is also used to
actions will drive the best outcomes. These
validate data mining results.
recommendations, along with supporting
information, are delivered to the people and
systems that can take action. Data mining is at
the heart of predictive analytics.
6
7
Why use data mining?
•
Determining credit risks
Data mining empowers you to manage and
•
Increasing web site profitability
change the future of your organization by
•
Increasing retail store traffic and optimizing
layouts for increased sales
providing an understanding of the past and the
present and delivering accurate predictions.
•
Monitoring business performance
For example, data mining can tell you which
•
Student lifecycle management
prospects are likely to become profitable
to an offer. With this view of the future, you can
How does the data mining
process work?
increase your return on marketing investment
IBM SPSS data mining products and services
(ROMI) by making the offer only to those
ensure timely and, above all, reliable results
prospects likely to respond and become
because they support the CRoss-Industry
valuable customers.
Standard Process for Data Mining (CRISP-DM).
customers and which are most likely to respond
Created by industry experts, CRISP-DM
With data mining, you have a reliable guide
provides step-by-step guidelines, tasks and
to the future of your organization, and you have
objectives for every stage of the data mining
the power to make the right decisions right now.
process.
Decisions based on sound business insight –
not on instinct or gut reactions – can deliver
There are six phases in CRISP-DM:
consistent results that keep you ahead of the
competition.
•
What problems does data mining
solve?
•
You can use data mining to solve almost any
•
Business understanding – achieve a clear
understanding of your business challenges
available to mine for answers
Data preparation – prepare the data in a
format appropriate for your questions
business or organizational problem that
involves data, including:
Data understanding – determine what data is
•
Modeling – design data models to meet your
requirements
•
Increasing revenues from customers
•
Understanding customer segments and
preferences
•
•
of your project
•
Identifying profitable customers and acquiring
Evaluation – test your results against the goals
Deployment – make the results of the project
available to decision makers
new ones
•
Improving cross-selling and up-selling
•
Retaining customers and increasing loyalty
•
Employee retention
•
Increasing ROMI and reducing marketing
To learn more about CRISP-DM, visit
www.crisp-dm.org.
campaign costs
•
Detecting fraud, waste and abuse
8
9
Data mining tips
Manage expectations
Setting up for success
Make sure that your project stakeholders know
Follow CRISP-DM
that data mining is not a magic wand that
Using CRISP-DM to guide your data mining
miraculously solves all business problems.
project helps to ensure a successful outcome.
Rather, it is a business process implemented by
It is critical to follow a proven methodology –
powerful computer software, and, as with any
complex data mining technologies and large
business process, the stakeholders need to
volumes of available data can overwhelm a
propose a solvable problem and work with you
project that is not firmly grounded in the
to find the solution.
problem you want to solve.
Begin with the end in mind
To be able to show a positive return on
investment (ROI) at the end of the project, you
must know how you will evaluate the results
before you start (e.g., which business measures
should you use; how these will be calculated or
If you plan to segment customers for your
marketing department, let them know the
type of information they are likely to receive
as a result of your project. (i.e., “We’re using
product information and demographic data,
so we expect to provide segments based on
age, income, etc., that will show the product
mix favored by these customers.”)
derived) and, most importantly, how the results
will be used (i.e., how they will be deployed
Limit the scope of your initial project
throughout the organization).
Start with realistic objectives and schedules.
When you do achieve success, move on to
For example, suppose you want to identify the 20
more complex projects.
percent of your subscribers who (following the
Pareto Principle) will account for 70 to 80 percent
For example, rather than attempting to
of those who churn. Before you start, you should
immediately improve customer acquisition,
know how to translate this information into an
cross-selling, up-selling and retention in every
expected revenue improvement based on sound
region, focus on a smaller, more realistic goal.
assumptions about the cost of and response to
Pick one that is quickly achievable, easily
your customer retention programs.
measurable and has an important impact on
your organization. Initially, you should look for
Or suppose you want to improve your ability to
“low-hanging fruit” to establish that the process
detect insurance fraud. How much improvement
successfully delivers results. Then you can
will be sufficient to justify the exercise? How
become more ambitious in the scale and scope
strong do the models need to be? What will
of projects that you take on.
determine success (i.e., how much would you
would save if you identified ten additional cases
Identify a steering committee
of fraud)?
A data mining project is a group effort. It
requires business users who understand the
10
11
issues and the data, as well as people who
A detailed document, which expands on
the information presented here and includes a
user guide, can be downloaded from
www.crisp-dm.org.
understand analysis. In addition, those who own
the data will need to provide access to it.
For example, you may need a data mining
analyst, a database analyst and a marketing
manager. These roles may fall into different
functional areas whose goals do not align well
with those of the project. So it’s important to
find ways to encourage people to work together.
Be aware that you may also need IT department
support to provide access to the data.
Avoid the data dump
Always set up the business problem, define the
project goals and get the support of the project
group. If you simply begin analyzing a pile of
data with no project structure, you will simply
get lost in the data and waste time.
Don’t let the volume of data drive your project –
Business understanding
Know “who, what, when, where, why
and how” from a business perspective
Develop a thorough understanding of the
project parameters: the current business
situation; the primary business objective of the
project; the criteria for success; and who will
determine the success of the project.
Create a deployment strategy
Think about how you want to use the results
of the data mining project. For example:
•
don’t need to have the results interpreted?
•
focus on the business goal. You may not use all
Will the results be used by a wide range of
employees who need differing levels of
of your data – some may not be relevant to the
project. You may even discover that your data is
Will the results be used by specialists who
interpretation?
•
not sufficient to resolve your business problem: a
Will the results be deployed via a particular
medium (online, paper, etc.) that requires a
large volume of data is no guarantee that you
certain format?
have the right data.
Develop a maintenance strategy
For example, recent information usually offers
How will you manage the data once the initial
more accurate predictions for customers’
project is completed? If the project is part of
behaviors than volumes of historical data. More
an ongoing strategy, will you:
data is not always better, and just because you
have it doesn’t mean you have to – or should –
•
Analyze new data periodically?
use it.
•
Analyze new data in real time?
Following the phases of CRISP-DM
This section includes tips excerpted from the
data mining guide CRISP-DM 1.0.
12
13
Assess the situation and inventory
resources
Under what constraints will the
project operate?
Be sure to go over every aspect of the project
Check and develop solutions for the following:
in advance to ensure you have what you need
for success:
•
General constraints – legal issues, budget,
timing, resources
•
Personnel – project sponsor, business and
•
technical experts
•
Access rights to data sources – restrictions,
necessary passwords
Data sources – access to warehouse
•
Technical accessibility of data – operating
or operational data
systems, data management system, file
•
Computing resources – hardware, platforms
or database format
•
Software – data mining and other relevant
•
Accessibility of relevant knowledge
software
•
Partner organizations
Does everyone speak the same
language?
What are the project requirements?
Make sure that everyone involved understands
List all of the requirements of the project:
the terms and concepts that will be used
throughout the project.
•
Schedule for completion
•
Comprehensibility and quality of results
•
Security
•
Legal restrictions on data access
What assumptions are being made
about the project?
List and clarify all of the assumptions you have
made about:
Facilitate interdepartmental understanding
by creating a glossary of the business and
technical terms that are specific or relevant
to the project.
Translate business objectives into data
mining tasks
Determine which data mining tasks you must
•
Data quality – accuracy, availability
•
External factors – economic issues,
competition, technical advances
•
Internal factors – the business problem
•
Models – is it necessary to understand,
describe, or explain the models to senior
management?
complete to achieve your business objective.
Define the data mining tasks using technical
terms.
For example, the business goal “Increase
catalog sales to existing customers” might
translate into the data mining goal “Predict
how many widgets customers will buy, given
their purchases over the previous three years,
relevant demographic information and the
price of the item.”
14
15
Determine data mining success criteria
Try some exploratory data mining
Using technical terms, describe which criteria
Help data warehouse builders to set priorities
must be met if the project is to be considered a
by analyzing small amounts of data from multiple
success. For example, a successful model
sources, and communicating any discoveries.
would be one that generated a specific level
purchase profile should produce a specific
Does your data cover relevant
attributes?
degree of lift.
Ensure success by choosing data that best
of predictive accuracy, or a propensity-to-
represents the behavior or situation you want
Produce a project plan
to analyze. Do some preliminary brainstorming
Create a plan that outlines the steps you will
to generate a list of the relationships you might
take to achieve your data mining goals and meet
expect, then assess whether you have access
your business objective. Assess which tools and
to the right data to uncover the assumed
techniques are available to enable you to
patterns.
complete your project.
Data understanding
Describe existing data
Get a clear picture of your data by creating a
Make sure the data is available
report that describes data formats, the number
Gather all of the data you will need for your
of records and fields, field identities and other
project. If your data will come from more than
relevant features.
one source, make sure your data mining tools
Check data quality
can integrate the data.
To prevent future problems, assess the quality
Survey data can add critical attitudinal insights
to your models. A combination of behavioral
and attitudinal data is best for comprehensive
insight.
Up to 80 percent of your data may be hidden
in text documents. Use a text mining tool to
search these sources efficiently for valuable
information.
of your data and make a plan for addressing
any problems that are detected:
•
contain relate to one another?
•
Are any attributes missing? Are there any
blank fields?
•
Data collected from online activity can
improve the quality and accuracy of your
models. Use a web mining tool to add a
deeper level of insight to your data mining
project.
Do the attribute names and the values they
Check for multiple spellings of values
to eliminate repetition
•
Look for data that deviate from the norm and
determine the causes
Review any attributes that show patterns that
conflict with common sense (i.e., pregnant
males).
16
17
Exclude any irrelevant data. (i.e., If you’re
checking on home loan behavior, eliminate
customers who have never owned a home.)
•
Addressing special values and their meaning –
for example, a special value can be a default
value used when a survey question is not
answered or when data is shortened for
Generate a data quality report
Check for duplicate data, potential data errors
space considerations. (“2004” becomes “04.”)
•
Be careful about changes in data formats.
(e.g., Zip codes treated as numeric values will
(i.e., customers are shown to have churned
lose leading zeroes when formatted.)
before they even became customers) and
database fields that may contain invalid
information.
Some fields may be irrelevant to your goals
and don’t need to be cleaned. Track actions
taken or not taken for those fields, and
document your decisions because you may
decide to use them later in the process.
Data preparation
Select your data
Decide what data to use for analysis and be
clear about the reasons for your decisions.
This involves:
•
Make sure the data mining tools you choose are
Performing significance and correlation tests
to determine which fields to include
•
Selecting data subsets
•
Using sampling techniques to review small
chunks of data for appropriateness
•
Choose a flexible data construction tool
Performing data reduction techniques (e.g.,
factor analysis) where appropriate
For richer and more accurate models, be sure
to include non-traditional types of data, such
as survey data, key concepts from customer
communications, and data about online
activity. Combining multiple types of data
gives you a more complete picture of your
customers and your organization.
capable of manipulating the data according to
project needs. Your tools should also allow you
to add new fields as needed. Remember that
data mining is a discovery-driven process – it’s
impossible to know in advance where the data
will take you.
Determine whether to create newly
derived attributes
You may wish to create derived attributes for
the following reasons:
•
Due to your experience with the situation at
hand, you know that a particular attribute is
important to the data even though it doesn’t
currently exist
Address data quality problems
•
The modeling algorithm only handles certain
To ensure reliable results, take the time to fix
data types; therefore, important information
any data quality problems before you begin the
won’t be included unless it is recreated
analysis. Data quality activities may include:
•
Modeling results reveal that relevant facts are
not represented
•
Determining how to deal with dirty data
18
19
Preliminary statistical analyses may indicate how
best to combine variables into ratios or new
groupings. Before you add derived attributes,
determine whether and how they will help the
modeling process.
technique makes about data format and quality.
In some cases, only one technique may be
appropriate for your situation. Be sure to
consider:
•
Consolidate information by merging data
When you join new tables to consolidate
problem
•
Whether there are any “political” requirements
(management expectations, understandability)
information, you may also want to generate new
fields and aggregate values.
Which techniques are appropriate for your
•
Whether there are any constraints (unusual
data characteristics, staff expertise, timing
Make sure that your data mining tools can
accommodate different types of data – such as
survey, text, and Web data – from multiple
sources without costly, time-consuming
customization.
Do your data mining tools require data
to be in a specific order?
if your data mining tools require that your
records be in a particular order, you may need to
issues)
•
Which techniques conform best to your
deployment strategy
To ensure that you have the right technique
for each model and situation, choose data
mining tools that offer a wide range of
techniques and modeling options. Better still
are tools that allow multiple techniques to be
selected and assessed simultaneously based
on data types.
sort your dataset at this stage.
Should the data be balanced?
Determine whether your modeling
technique requires balanced data.
Test before you build
Before you create your final model, test the
quality and validity of the techniques you plan
to use. Create a test design that incorporates a
For example, direct mail campaigns often return
training test, a test set and a validation set. Then
information skewed toward “no response” – i.e.,
build the model on the training set and assess
most observations are from non-responders.
its effectiveness with the test dataset.
To predict positive responses accurately,
however, some techniques may require you to
Build your model
have roughly equal numbers of positive and
To create a model, run your modeling tool on
negative responses.
the dataset you have prepared. Describe the
result and assess its expected accuracy,
Modeling
Selecting modeling techniques
effectiveness and potential shortcomings.
To match your data to the right modeling
technique, check which assumptions each
20
21
Create a detailed model report that lists
the rules produced, the parameter settings
used, the model’s behavior and interpretation,
and any conclusions about patterns revealed
in the data.
Use only attributes that will be available
to the model and in the right state at the time
of deployment.
Use lift and gains tables to show a model’s
predictive ability.
Try several models to get the right fit
To improve model performance, try adding or
removing fields or experimenting with different
options. Remember the law of parsimony
(Occam’s Razor): simple models may be better.
For example, if you want to create a model that
Balance the strength or power of a model with
predicts the risk of losing customers within the
its complexity: simple models often are easier to
next three months, build the model using data
explain, easier to maintain, and may be less
about customers who defected during the
prone to degradation over time. Also, since
previous three months. Applied to current data,
each technique works slightly differently, try a
the model will then predict which customers
variety of approaches (such as clustering and
may leave in the near future, allowing you time
association) to find all of the relevant patterns.
to take action to prevent them doing so.
Statistical models are good for:
Using induction to produce a rule
Initial analysis – statistical analysis is useful in
Rules are essentially parameters within which
the early stages of a data mining project to
the data must fall in order to be considered.
gain an overview of the structure of the data.
They are usually in an “if/then” format. Induction
Developing a concise description of the
enables you to automatically choose which rules
characteristics of the data can help the group’s
are most effective for obtaining specific results.
members to develop hypotheses and plan
For example, this is how rule induction can be
further analysis.
used to create a set of rules for qualifying loan
Propensity models are good for:
prospects:
Predicting customer behavior – discovering
If employed for more than two years, then
who is most likely to purchase, most likely to
credit risk is good
churn, most likely to default on loans, and
•
If older than 30, then credit risk is good
much more. Use this information to determine
•
If declared bankrupt at any time, then credit
which customers and prospects offer the best
risk is bad
long-term profitability.
•
Test after you build
Clustering is good for:
Make sure your model delivers results that will
Finding natural groupings of cases that have the
help you achieve your data mining goal.
same characteristics – e.g., detecting fraud by
using clustering to group similar cases of
unusual credit card transactions.
22
23
Association rules are good for:
Determine next steps
Basket analysis – discovering which items are
Now is the time to determine whether the
most likely to be purchased together. Use this
project is successful enough to move ahead to
information to improve cross-selling through
deployment. If not, take further steps to achieve
catalog and store layout, recommendation
satisfactory results. Keep in mind:
engines, phone and direct mail offers, and more.
•
The deployment potential of each result
Evaluation
•
How the process could be improved
Evaluate your data mining results
•
Whether the resources exist for additional
Determine whether and how well the results
steps or repetitions of previous steps
delivered by a given model will help you achieve
your organization’s goals. Is there any
Deployment
systematic reason why the model is deficient?
Create a deployment plan
Take the project results and synchronize them
If time and resources are available, try testing
the model or models in a limited real-world
environment (e.g., at a single store or call
center, or for a single product line) to see if it
performs as expected.
with your original goals and objectives to
address your organizational issue most
effectively:
•
Summarize deployable models or software
results
Review the data mining process for any
missing steps or overlooked tasks
•
When you have confirmed the quality and
•
plans
•
•
Identify possible problems and pitfalls during
deployment
Was each stage of the data mining process
necessary in retrospect?
•
Reconfirm how you will monitor the use of the
results and measure the benefits
important steps or information.
•
Confirm how the results will be distributed to
recipients
effectiveness of your results, review your work
to determine whether you have missed any
Develop and evaluate alternative deployment
Was each stage executed as well as
Monitor and maintain your plan
possible?
Ensure the best use of your data mining results
by creating a maintenance plan that addresses:
•
What could change in the future that would
affect the use of the results
•
How to monitor accurate use of the results
•
When, if necessary, to discontinue
deployment or use of the results
•
24
Criteria for renewing and refreshing models
25
Create a final report
Depending on your deployment plan, the report
may be either a project summary or a final
presentation of the data mining results. To
create your final report:
•
Identify which reports are needed (slides,
management summary, etc.)
•
Identify report recipients
•
Outline the structure and content of the report
•
Select which discoveries to include
Execute your deployment plan
Put your data mining results to optimal use by
distributing them according to the deployment
plan. Even the most brilliant discovery will not
generate ROI if it isn’t used to improve your
business. Shelf reports have very little current
or future value.
what didn’t, what the major accomplishments were
and what improvements may be necessary.For a
complete review, try the following:
Interview all significant project members
about their experiences
Interview the end users of your data mining
results about their experiences
•
CRISP-DM document, Performing a data
mining tool evaluation.
Look for tools with a proven record of
solving the organizational problems
that your project addresses
Choose tools that have been shown to be
useful for solving problems within your industry
and that have a successful track record in the
business areas that you need to address.
Select tools that bridge business
understanding and the technical
aspects of data mining
Make sure that the steps used by the tools
Ask: Do the tools present data mining
This is your opportunity to assess what went right,
•
The tips in this section are excerpted from the
match the business needs of data mining.
Review the project
•
Selecting a data mining
tool
Document and analyze the specific data
concepts clearly?
Make sure your tools work with your
existing data sources and formats
You will save time and money, and maximize
your chances for reliable results, by choosing
tools that can pull in and combine data from
multiple sources and formats. This is particularly
important if discoveries later in the data mining
process lead you to add data from a new source.
mining steps that you took
•
Analyze how well the data mining goals
were met
•
Create recommendations for future projects
26
Data mining tools that enable you to combine
behavioral and attitudinal data, in the form of
both structured and unstructured data, will
deliver more accurate results and provide
greater flexibility in terms of the types of data
mining projects you’re able to undertake.
27
Choose tools with efficient,
comprehensible data preparation steps
algorithms for visualization, classification,
Save time and resources by choosing data
example, you might discover that one
mining tools that prepare data efficiently (from
technique works better than another for
initial stages through to model building) and that
specific types of data. Flexibility will enable you
presentdata preparation steps in an easy-to-
to try a number of techniques to get accurate,
understand way. This enables project members
effective results. The tools should also be able
with varying levels of expertise to obtain
to combine techniques in situations where that
effective results.
approach would produce the best results.
Make sure that your tools can
automatically extract data
Choose tools that deliver consistent,
high-quality results
Avoid writing time-consuming manual queries by
Get accurate results from your data with
choosing tools that can extract data automatically
adaptable tools that perform well in a variety
for the various data preparation steps.
of situations, rather than one designed for a
clustering, association, and regression. For
specific type of data or situation. Your tools
Can the tools use the data and
equipment you already have?
should be able to manage any data that you
may need to address your problem effectively.
Choose data mining tools that can use your data
databases or files, and that are compatible with
Look for interactive exploration and
visualization capabilities
your existing analysis and visualization tools. You
Make it easy to explore and understand the
don’t want to waste time and resources building
data by choosing tools that provide interactive
another database because you are unable to
visualization techniques. These allow you to
analyze the data you already have.
gain insights quickly by making changes within
where it exists today, regardless of whether it is in
graphs and creating new graphs based on
Can the tools build effective models
in a reasonable time?
different dimensions of the data.
Look for tools that enable analysts to find the
most effective models quickly. The tool should
What are the tools’ deployment
capabilities?
support efficient building and testing of multiple
It is critical to choose tools capable of integrating
models and, ideally, also support automation to
your results into operational applications now
reduce the time needed to carry out some of the
and in the future. Also consider:
more mundane aspects of data mining.
•
Choose tools with a wide range of
techniques
Whether integration will be cost effective or
whether it will require additional time and
money
To ensure the best results, make sure your
tools offer a wide range of techniques or
28
29
•
How easily the tools can update data mining
As part of this portfolio, IBM SPSS Predictive
results and what additional investments, if
Analytics software helps organizations predict
any, are required
future events and proactively act upon that
insight to drive better business outcomes.
Assess the potential costs of ownership
associated with the tools
Commercial, government and academic
Analyze the potential ROI for each tool:
technology as a competitive advantage in
customers worldwide rely on IBM SPSS
attracting, retaining and growing customers,
What will be the cost of ownership over the
while reducing fraud and mitigating risk.
product’s lifetime, including any additional
By incorporating IBM SPSS software into their
software or services required by the tool?
daily operations, organizations become
•
When can you expect a positive ROI?
predictive enterprises – able to direct and
•
How long will it take to implement your data
automate decisions to meet business goals and
mining tool? Is it designed for technical
achieve measurable competitive advantage. For
experts or can it accommodate users of
further information or to reach a representative
varying expertise? What training costs are
visit www.ibm.com/spss.
•
involved now and in the future?
•
Is the tool customizable for your particular
users and business needs? Can you save
common processes and automate tasks?
About IBM Business
Analytics
IBM Business Analytics software delivers
What makes us unique?
For 40 years, we have been the clear leader
in analytics technology. Here are some of the
reasons that customers have selected IBM
SPSS software to drive their decision making:
•
A complete, 360° view – Our software enables
complete, consistent and accurate information
you to develop in-depth understanding by
that decision-makers trust to improve business
using all of your information, both traditional
performance. A comprehensive portfolio of
structured data and unstructured data, for a
business intelligence, predictive analytics,
360° view of your customers or constituents
financial performance and strategy management,
•
Easy integration with operational systems –
and analytic applications provides clear,
IBM SPSS predictive analytics technologies
immediate and actionable insights into current
and products are designed to work well, both
performance and the ability to predict future
independently and with other technologies or
outcomes. Combined with rich industry
systems
solutions, proven practices and professional
services, organizations of every size can drive
the highest productivity, confidently automate
decisions and deliver better results.
30
31
•
Open, standards-based architecture – IBM
concepts sentiments, and relationships from
such as OLE DB for data access, XMLA for
unstructured data, and convert them to
data/format sharing, PMML for predictive
structured format for predictive modeling with
model sharing, SSL for Internet security
IBM SPSS Modeler
•
IBM® SPSS® Collaboration and Deployment
Services for authentication and authorization,
Services – Centralize and organize models
to name a few
and modeling processes, automate
Faster return on your software investment –
production and deployment of results
according to a recent study by Nucleus
•
IBM® SPSS® Modeler Premium – Extract key
SPSS software follows industry standards
management, and LDAP/Active Directory
•
•
Research, an independent analyst firm, 94
Statistical analysis
percent of IBM SPSS customers achieve a
IBM® SPSS® Statistics – IBM SPSS Statistics
positive return on investment within an average
is a tightly integrated, modular, full-featured
payback period of just 10.7 months
product line supporting the entire analytical
A lower total cost of ownership – IBM SPSS
process – from planning to data collection
products are designed to work with your
through data access and management, analysis
existing technology infrastructure and staff
and reporting to deployment – and a critical
resources. We keep both your short- and
complement to the data mining process. Add
long-term costs of ownership low by providing
the products below to increase your analysis
open technology and flexible licensing options
capabilities:
IBM SPSS products
•
IBM® SPSS® Advanced Statistics – Improve
the accuracy of your analyses and provide
With IBM SPSS products, you can build a
more dependable conclusions with
flexible analytics system that enables you to
procedures designed to fit the inherent
both meet your needs today and achieve
characteristics of your data
tomorrow’s goals.
•
IBM® SPSS® Custom Tables – Summarize and
communicate results in a presentation-ready
Data mining
tabular format, using a highly intuitive drag-
IBM® SPSS® Modeler Professional – This
and-drop interface
product’s interactive data mining process
incorporates your valuable expertise at every
step to create powerful predictive models that
address your specific organizational issues.
32
•
IBM® SPSS® Regression – Apply more
sophisticated models for greater accuracy in
market research, medical research, financial
risk assessment, and many other areas
33
•
IBM® SPSS® Text Analytics for Surveys –
Glossary
Categorize text responses to open-ended
Association: the process of discovering which
survey questions so you can integrate them
events occur together or are related. For
with your quantitative survey data. IBM SPSS
example, use association techniques to
Text Analytics for Surveys extracts key
determine which products are often purchased
concepts from text for further analysis in
together. Contrast with sequence detection,
IBM SPSS Statistics or Microsoft Excel.
which can be used to discover the order in
which the products were purchased.
The IBM SPSS Statistics family of products
includes a full range of modules and
stand-alone products. For a complete list, go
to www.ibm.com/statistics
Attitudinal data: data that relates to or is
expressive of personal attitudes or opinions.
Attitudinal data is often gathered through survey
research such as responses to open-ended
Survey and market research
IBM® SPSS® Data Collection – conduct both
large-scale, multi-mode research projects and
smaller, one-of-a- kind surveys with this open,
scalable and customizable survey research
platform. It includes products for every step
of the survey process, from creating survey
scripts to collecting and analyzing data and
reporting the results.
survey questions, and analysis of textual
communications such as customer emails.
Attribute: a property or characteristic of an
entity; also known as a variable or field.
Balanced data: if you have two or more
categories of data to analyze, each category
should have an equal amount of data to simplify
the modeling process.
Training and Services
IBM SPSS Training – We offer a full suite of data
mining courses, as well as product-specific
training. Most courses are available at an IBM
SPSS facility or at your company site.
IBM SPSS Worldwide Services – Let our
experienced consultants help you determine
which problems to address and how best to
solve them.
Behavioral data: data that relates to or reflects
behavior or actions. Behavioral data, often in
the form of purchasing or transactional data, is
the type of data used most extensively in data
mining.
Churn: the process of customer attrition is a
concern for many industries, particularly
telecommunications and financial services.
IBM SPSS products are available for Microsoft®
Windows, Apple® Mac®, Linux®, UNIX and other
platforms.
34
35
Classification: a process that identifies the
Decision trees: graphical, tree-like displays
group to which an object belongs by examining
that clearly show segments, patterns, and
characteristics of the object. In classification,
hierarchies in data.
the groups are defined by an external criterion
(contrast with clustering). Commonly used
Deployment: the distribution and use of results
techniques include decision trees and neural
obtained from data mining. Deployment ranges
networks.
from reports to the use of models in real-time
environments such as call centers.
Clustering: the process of grouping records
based on similarity. For example, an insurance
Derived attributes: new attributes that are
company might use clustering to group
constructed from one or more existing
customers according to income, age, type of
attributes in the same record.
policy purchased or prior claims history.
Clustering divides a dataset so that records with
Dirty data: data that contains errors such as
similar content are in the same group, and
missing or incorrect values. Dirty data is also
groups are as different as possible from each
referred to as noisy.
other (contrast with classification).
Field: also known as a variable or attribute, Cross-Industry Standard Process for Data
a field is a data space allocated to a particular
Mining (CRISP-DM): CRISP-DM provides a
class of data or information. For example, one
structure for data mining projects, as well as
data field may contain a customer’s first name;
guidance on potential problems and their
the next may contain the customer’s last name.
solutions. It comprises six phases: business
The columns in a spreadsheet are equivalent to
understanding, data understanding, data
fields, while rows are equivalent to records.
preparation, modeling, evaluation and
Gains tables: measures of the effectiveness of
deployment.
a model which shows the difference between
Cross-selling: the practice of offering and
results obtained by the model and results
selling additional products or services to existing
obtained without using the model under
customers.
random normal conditions.
Data mining: the process of analyzing data to
Lift charts: measures of model effectiveness
discover hidden patterns and relationships that
which shows the ratio between results
can help you manage and improve your
obtained using the model and results obtained
business.
without using the model. The farther the lift
lines from the baseline, the more effective the
Data warehouse: the database in which data
model.
is collected and stored for analysis.
36
37
Machine-learning techniques: a set of
Predictive analytics: a combination of
methods that enable a computer to learn a
advanced analytic techniques and decision
specific task such as decision making,
optimization. It uses historical information to
estimation, classification or prediction – without
make predictions about future behavior and
manual programming.
then delivers recommended actions to the
people and systems that can use them.
Model: a set of representative rules, behaviors,
or characteristics against which data are
Predictive modeling: the process of creating
analyzed to find similarities. Descriptive models
models to predict future activity, behavior or
are used to analyze past events. Predictive
characteristics. For example, a predictive
models are used to discover what will happen in
model may show which customers are most
the future. With predictive models, data miners
likely to churn in the future, based on the
can explore alternative scenarios to determine
characteristics and actions of previous
which actions will produce the desired outcome.
churners.
Neural network: a model for predicting or
Query: a request sent to a database for
classifying cases using a complex mathematical
information based on specified characteristics
scheme that simulates an abstract version of
or properties.
brain cells. A neural network is trained by
presenting it with a large number of observed
Record: a set of related data stored together.
cases, one at a time, and allowing it to update
Also known as a row (in spreadsheets) or a
itself repeatedly until it learns the task.
case (in statistics).
Noise: data that contains errors such as
Regression: the process of discovering and
missing or incorrect values, or extraneous
predicting relationships between two or more
columns, is called noisy or dirty data.
variables.
Online analytical processing (OLAP): software
Report: the results of data analysis, distributed
that lets users analyze many layers of current
in a format that is comprehensible to the
and historical data.
recipient.
Pivot tables: interactive tables that enable
Return on investment (ROI): the value that
users to get different views of information by
is returned or obtained from investments in
easily repositioning rows, columns and layers
technology, infrastructure, etc.
of data.
38
39
Return on Marketing Investment (ROMI):
Unstructured data: data in a text format or
the value that is returned or obtained from
other non numerical format. Combining
investments marketing campaigns.
unstructured and structured data in your data
mining projects can help you produce more
Rule induction: the process of automatically
accurate, valuable results.
deriving decision-making rules for predicting or
classifying future cases from example cases.
Up-selling: the practice of offering and selling
to existing customers products or services
Sequence detection: the process of
which are more profitable than those they
discovering the order of events in data. For
currently own or use.
example, use sequence detection to discover
the order in which customers purchase certain
Variable: any measured characteristic or
products. Contrast with association, which
attribute that differs for different subjects.
reveals which products are purchased together.
Web mining: the process of analyzing data
Statistics: the mathematics of the collection,
from online activities – including pay-per-click
organization and interpretation of numerical
advertising and other marketing campaigns –
data.
to discover relevant patterns and important
behavioral insights.
Structured data: data, for example transactional
data, in traditional numerical formats. Structured
data is often displayed in a tabular or
spreadsheet-like view.
IBM, has an enterprise network of distributors.
To locate the office nearest you, go to
www.ibm.com/planetwide
Test set: a dataset independent of the training
set, used to fine-tune the estimates of the
model parameters.
Text mining: the process of analyzing textual
information – such as documents, emails and
call center transcripts – to extract relevant
concepts.
Training set: a dataset used to estimate or train
a model.
40
41
© Copyright IBM Corporation 2010
IBM Corporation
Route 100
Somers, NY 10589
US Government Users Restricted Rights - Use,
duplication of disclosure restricted by GSA ADP
Schedule Contract with IBM Corp.
Produced in the United States of America
May 2010
All Rights Reserved
IBM, the IBM logo, ibm.com, WebSphere, InfoSphere
and Cognos are trademarks or registered trademarks of
International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first
occurrence in this information with a trademark symbol
(® or TM), these symbols indicate U.S. registered or
common law trademarks owned by IBM at the time this
information was published. Such trademarks may also be
registered or common law trademarks in other countries.
A current list of IBM trademarks is available on the Web
at “Copyright and trademark information” at www.ibm.
com/legal/copytrade.shtml.
SPSS is a trademark of SPSS, Inc., an IBM Company,
registered in many jurisdictions worldwide.
Other company, product or service names may be
trademarks or service marks of others.
Please Recycle
Business Analytics software
YTM03001-USEN-00
42
Business Analytics software
YTM03001-USEN-00