Download IBM SPSS Data Mining Tips

IBM SPSS Data Mining Tips A handy guide to help you save A handy guide to help you save time and money as you plan and time and money as you plan and execute your data mining projects execute your data mining projects Table of Contents Introduction 4 What is data mining? 5 What types of data are used in data mining? 6 Data mining and predictive analytics 6 How is data mining different from OLAP? 7 How is data mining different from statistics? 7 Why use data mining? 8 What problems does data mining solve? 8 How does the data mining process work? 9 Data mining tips 10 Setting up for success 10 Following the phases of CRISP-DM 12 Business understanding 13 Data understanding 16 Data preparation 18 Should the data be balanced? 20 Modeling 20 Evaluation 24 Deployment 25 Selecting a data mining tool 27 About IBM Business Analytics 30 What makes us unique? 31 IBM SPSS products 32 Data mining 32 Statistical analysis 33 Survey and market research 34 Glossary 35 © Copyright IBM Corporation 2010 3 Introduction If you have questions about beginning or Are you currently involved in a data mining us. We offer a variety of technology training and project? Or are you perhaps considering consulting programs that can help you. executing your data mining projects, please call undertaking a data mining project for the first time? Regardless of your level of experience, If you have any data mining suggestions or IBM SPSS Data Mining Tips will help you plan ideas, contact your your local office, or visit and execute your project. www.ibm.com/spss and then go to SPSS Developer Central. This booklet is divided into two major sections. The first defines and outlines the data mining A list of IBM SPSS products can be found on process, while the second suggests a staged pages 32-35 of this booklet, and you can visit approach to the data mining process and gives a us online at www.ibm.com/spss to find out number of tips to guide you through it. Please more about our data mining products. remember that these stages are not to be considered in isolation. A decision made at one What is data mining? stage may influence your work at other stages. According to Gartner Inc., data mining is Also, in some situations you may work on several “the application of descriptive and predictive stages simultaneously rather than sequentially. analytics (such as clustering, segmentation, estimation, prediction and affinity analysis) to After the tips, you’ll find a glossary of terms support the marketing, sales or service frequently used in data mining. These terms are functions.” boldfaced the first time they appear in the text. Data mining solves a common paradox: the As you read, you’ll see symbols that will help you more data you have, the more difficult and better understand the information in this booklet. time-consuming it is to effectively analyze and draw meaning from it. What could be a gold This symbol indicates an example illustrating a particular tip. This symbol directs you to more information on the Web. mine often lies unexplored due to a lack of personnel, time or expertise. Data mining overcomes these difficulties because it uses a clear business orientation and powerful analytic technologies to quickly and thoroughly explore mountains of data and extract the valuable, usable information – the business Keep IBM SPSS Data Mining Tips by your side insight – that you need. and use it to save effort, complete your project in a timely manner and produce positive, measurable results. 4 5 What types of data are used in data mining? How is data mining different from OLAP? Depending on the data mining problem, your Pivot tables and online analytical processing project can incorporate data from a wide range (OLAP) are important tools for understanding of sources. In fact, data mining projects often what has happened in the past. Data mining is a benefit from using several different types of process for understanding what will happen in data, each of which gives additional insight into the future. Data mining uses predictive the area of study. Recent advances in analytics modeling, including statistics and machine- have led to two important new types of mining – learning techniques such as neural networks, text mining and web mining. While traditional to predict what will happen next. For example, a data analysis has focused on numerical, query or report can tell you the total sales for the structured data (as found in spreadsheets or last month. OLAP goes deeper, telling you about flat-file databases), these two technologies open sales by product for the last month. Data mining, a new and rich vein of data – information known however, tells you who is likely to buy your as unstructured data – from survey research, products next month. And for the best results, customer communications and log files from predictive analytics insights can be incorporated web servers. into a marketing campaign to determine, for example, how to deliver personalized offers that Using multiple data sources, including both have the best likelihood of leading to sales. structured and unstructured sources, can add survey data can add valuable information about How is data mining different from statistics? opinions and preferences, explaining why people Data mining doesn’t replace statistics. Statistics act and behave as they do. Such attitudinal data is more often concerned with confirming can reveal psychographic or motivational desires hypotheses, while data mining can help generate that will never be discovered by analyzing new ones. Statisticians frequently make fact-based transactional data. inferences about large populations from a small accuracy and depth to your results. For example, sample, while data miners can often process an Data mining and predictive analytics entire universe of observations. In fact, statistics Data mining uncovers patterns in data using is a good complement to data mining: traditional advanced descriptive and predictive techniques. statistical techniques, such as regression, are Predictive analytics combines these with used alongside data mining technologies, such decision optimization to determine which as neural networks. Statistics is also used to actions will drive the best outcomes. These validate data mining results. recommendations, along with supporting information, are delivered to the people and systems that can take action. Data mining is at the heart of predictive analytics. 6 7 Why use data mining? • Determining credit risks Data mining empowers you to manage and • Increasing web site profitability change the future of your organization by • Increasing retail store traffic and optimizing layouts for increased sales providing an understanding of the past and the present and delivering accurate predictions. • Monitoring business performance For example, data mining can tell you which • Student lifecycle management prospects are likely to become profitable to an offer. With this view of the future, you can How does the data mining process work? increase your return on marketing investment IBM SPSS data mining products and services (ROMI) by making the offer only to those ensure timely and, above all, reliable results prospects likely to respond and become because they support the CRoss-Industry valuable customers. Standard Process for Data Mining (CRISP-DM). customers and which are most likely to respond Created by industry experts, CRISP-DM With data mining, you have a reliable guide provides step-by-step guidelines, tasks and to the future of your organization, and you have objectives for every stage of the data mining the power to make the right decisions right now. process. Decisions based on sound business insight – not on instinct or gut reactions – can deliver There are six phases in CRISP-DM: consistent results that keep you ahead of the competition. • What problems does data mining solve? • You can use data mining to solve almost any • Business understanding – achieve a clear understanding of your business challenges available to mine for answers Data preparation – prepare the data in a format appropriate for your questions business or organizational problem that involves data, including: Data understanding – determine what data is • Modeling – design data models to meet your requirements • Increasing revenues from customers • Understanding customer segments and preferences • • of your project • Identifying profitable customers and acquiring Evaluation – test your results against the goals Deployment – make the results of the project available to decision makers new ones • Improving cross-selling and up-selling • Retaining customers and increasing loyalty • Employee retention • Increasing ROMI and reducing marketing To learn more about CRISP-DM, visit www.crisp-dm.org. campaign costs • Detecting fraud, waste and abuse 8 9 Data mining tips Manage expectations Setting up for success Make sure that your project stakeholders know Follow CRISP-DM that data mining is not a magic wand that Using CRISP-DM to guide your data mining miraculously solves all business problems. project helps to ensure a successful outcome. Rather, it is a business process implemented by It is critical to follow a proven methodology – powerful computer software, and, as with any complex data mining technologies and large business process, the stakeholders need to volumes of available data can overwhelm a propose a solvable problem and work with you project that is not firmly grounded in the to find the solution. problem you want to solve. Begin with the end in mind To be able to show a positive return on investment (ROI) at the end of the project, you must know how you will evaluate the results before you start (e.g., which business measures should you use; how these will be calculated or If you plan to segment customers for your marketing department, let them know the type of information they are likely to receive as a result of your project. (i.e., “We’re using product information and demographic data, so we expect to provide segments based on age, income, etc., that will show the product mix favored by these customers.”) derived) and, most importantly, how the results will be used (i.e., how they will be deployed Limit the scope of your initial project throughout the organization). Start with realistic objectives and schedules. When you do achieve success, move on to For example, suppose you want to identify the 20 more complex projects. percent of your subscribers who (following the Pareto Principle) will account for 70 to 80 percent For example, rather than attempting to of those who churn. Before you start, you should immediately improve customer acquisition, know how to translate this information into an cross-selling, up-selling and retention in every expected revenue improvement based on sound region, focus on a smaller, more realistic goal. assumptions about the cost of and response to Pick one that is quickly achievable, easily your customer retention programs. measurable and has an important impact on your organization. Initially, you should look for Or suppose you want to improve your ability to “low-hanging fruit” to establish that the process detect insurance fraud. How much improvement successfully delivers results. Then you can will be sufficient to justify the exercise? How become more ambitious in the scale and scope strong do the models need to be? What will of projects that you take on. determine success (i.e., how much would you would save if you identified ten additional cases Identify a steering committee of fraud)? A data mining project is a group effort. It requires business users who understand the 10 11 issues and the data, as well as people who A detailed document, which expands on the information presented here and includes a user guide, can be downloaded from www.crisp-dm.org. understand analysis. In addition, those who own the data will need to provide access to it. For example, you may need a data mining analyst, a database analyst and a marketing manager. These roles may fall into different functional areas whose goals do not align well with those of the project. So it’s important to find ways to encourage people to work together. Be aware that you may also need IT department support to provide access to the data. Avoid the data dump Always set up the business problem, define the project goals and get the support of the project group. If you simply begin analyzing a pile of data with no project structure, you will simply get lost in the data and waste time. Don’t let the volume of data drive your project – Business understanding Know “who, what, when, where, why and how” from a business perspective Develop a thorough understanding of the project parameters: the current business situation; the primary business objective of the project; the criteria for success; and who will determine the success of the project. Create a deployment strategy Think about how you want to use the results of the data mining project. For example: • don’t need to have the results interpreted? • focus on the business goal. You may not use all Will the results be used by a wide range of employees who need differing levels of of your data – some may not be relevant to the project. You may even discover that your data is Will the results be used by specialists who interpretation? • not sufficient to resolve your business problem: a Will the results be deployed via a particular medium (online, paper, etc.) that requires a large volume of data is no guarantee that you certain format? have the right data. Develop a maintenance strategy For example, recent information usually offers How will you manage the data once the initial more accurate predictions for customers’ project is completed? If the project is part of behaviors than volumes of historical data. More an ongoing strategy, will you: data is not always better, and just because you have it doesn’t mean you have to – or should – • Analyze new data periodically? use it. • Analyze new data in real time? Following the phases of CRISP-DM This section includes tips excerpted from the data mining guide CRISP-DM 1.0. 12 13 Assess the situation and inventory resources Under what constraints will the project operate? Be sure to go over every aspect of the project Check and develop solutions for the following: in advance to ensure you have what you need for success: • General constraints – legal issues, budget, timing, resources • Personnel – project sponsor, business and • technical experts • Access rights to data sources – restrictions, necessary passwords Data sources – access to warehouse • Technical accessibility of data – operating or operational data systems, data management system, file • Computing resources – hardware, platforms or database format • Software – data mining and other relevant • Accessibility of relevant knowledge software • Partner organizations Does everyone speak the same language? What are the project requirements? Make sure that everyone involved understands List all of the requirements of the project: the terms and concepts that will be used throughout the project. • Schedule for completion • Comprehensibility and quality of results • Security • Legal restrictions on data access What assumptions are being made about the project? List and clarify all of the assumptions you have made about: Facilitate interdepartmental understanding by creating a glossary of the business and technical terms that are specific or relevant to the project. Translate business objectives into data mining tasks Determine which data mining tasks you must • Data quality – accuracy, availability • External factors – economic issues, competition, technical advances • Internal factors – the business problem • Models – is it necessary to understand, describe, or explain the models to senior management? complete to achieve your business objective. Define the data mining tasks using technical terms. For example, the business goal “Increase catalog sales to existing customers” might translate into the data mining goal “Predict how many widgets customers will buy, given their purchases over the previous three years, relevant demographic information and the price of the item.” 14 15 Determine data mining success criteria Try some exploratory data mining Using technical terms, describe which criteria Help data warehouse builders to set priorities must be met if the project is to be considered a by analyzing small amounts of data from multiple success. For example, a successful model sources, and communicating any discoveries. would be one that generated a specific level purchase profile should produce a specific Does your data cover relevant attributes? degree of lift. Ensure success by choosing data that best of predictive accuracy, or a propensity-to- represents the behavior or situation you want Produce a project plan to analyze. Do some preliminary brainstorming Create a plan that outlines the steps you will to generate a list of the relationships you might take to achieve your data mining goals and meet expect, then assess whether you have access your business objective. Assess which tools and to the right data to uncover the assumed techniques are available to enable you to patterns. complete your project. Data understanding Describe existing data Get a clear picture of your data by creating a Make sure the data is available report that describes data formats, the number Gather all of the data you will need for your of records and fields, field identities and other project. If your data will come from more than relevant features. one source, make sure your data mining tools Check data quality can integrate the data. To prevent future problems, assess the quality Survey data can add critical attitudinal insights to your models. A combination of behavioral and attitudinal data is best for comprehensive insight. Up to 80 percent of your data may be hidden in text documents. Use a text mining tool to search these sources efficiently for valuable information. of your data and make a plan for addressing any problems that are detected: • contain relate to one another? • Are any attributes missing? Are there any blank fields? • Data collected from online activity can improve the quality and accuracy of your models. Use a web mining tool to add a deeper level of insight to your data mining project. Do the attribute names and the values they Check for multiple spellings of values to eliminate repetition • Look for data that deviate from the norm and determine the causes Review any attributes that show patterns that conflict with common sense (i.e., pregnant males). 16 17 Exclude any irrelevant data. (i.e., If you’re checking on home loan behavior, eliminate customers who have never owned a home.) • Addressing special values and their meaning – for example, a special value can be a default value used when a survey question is not answered or when data is shortened for Generate a data quality report Check for duplicate data, potential data errors space considerations. (“2004” becomes “04.”) • Be careful about changes in data formats. (e.g., Zip codes treated as numeric values will (i.e., customers are shown to have churned lose leading zeroes when formatted.) before they even became customers) and database fields that may contain invalid information. Some fields may be irrelevant to your goals and don’t need to be cleaned. Track actions taken or not taken for those fields, and document your decisions because you may decide to use them later in the process. Data preparation Select your data Decide what data to use for analysis and be clear about the reasons for your decisions. This involves: • Make sure the data mining tools you choose are Performing significance and correlation tests to determine which fields to include • Selecting data subsets • Using sampling techniques to review small chunks of data for appropriateness • Choose a flexible data construction tool Performing data reduction techniques (e.g., factor analysis) where appropriate For richer and more accurate models, be sure to include non-traditional types of data, such as survey data, key concepts from customer communications, and data about online activity. Combining multiple types of data gives you a more complete picture of your customers and your organization. capable of manipulating the data according to project needs. Your tools should also allow you to add new fields as needed. Remember that data mining is a discovery-driven process – it’s impossible to know in advance where the data will take you. Determine whether to create newly derived attributes You may wish to create derived attributes for the following reasons: • Due to your experience with the situation at hand, you know that a particular attribute is important to the data even though it doesn’t currently exist Address data quality problems • The modeling algorithm only handles certain To ensure reliable results, take the time to fix data types; therefore, important information any data quality problems before you begin the won’t be included unless it is recreated analysis. Data quality activities may include: • Modeling results reveal that relevant facts are not represented • Determining how to deal with dirty data 18 19 Preliminary statistical analyses may indicate how best to combine variables into ratios or new groupings. Before you add derived attributes, determine whether and how they will help the modeling process. technique makes about data format and quality. In some cases, only one technique may be appropriate for your situation. Be sure to consider: • Consolidate information by merging data When you join new tables to consolidate problem • Whether there are any “political” requirements (management expectations, understandability) information, you may also want to generate new fields and aggregate values. Which techniques are appropriate for your • Whether there are any constraints (unusual data characteristics, staff expertise, timing Make sure that your data mining tools can accommodate different types of data – such as survey, text, and Web data – from multiple sources without costly, time-consuming customization. Do your data mining tools require data to be in a specific order? if your data mining tools require that your records be in a particular order, you may need to issues) • Which techniques conform best to your deployment strategy To ensure that you have the right technique for each model and situation, choose data mining tools that offer a wide range of techniques and modeling options. Better still are tools that allow multiple techniques to be selected and assessed simultaneously based on data types. sort your dataset at this stage. Should the data be balanced? Determine whether your modeling technique requires balanced data. Test before you build Before you create your final model, test the quality and validity of the techniques you plan to use. Create a test design that incorporates a For example, direct mail campaigns often return training test, a test set and a validation set. Then information skewed toward “no response” – i.e., build the model on the training set and assess most observations are from non-responders. its effectiveness with the test dataset. To predict positive responses accurately, however, some techniques may require you to Build your model have roughly equal numbers of positive and To create a model, run your modeling tool on negative responses. the dataset you have prepared. Describe the result and assess its expected accuracy, Modeling Selecting modeling techniques effectiveness and potential shortcomings. To match your data to the right modeling technique, check which assumptions each 20 21 Create a detailed model report that lists the rules produced, the parameter settings used, the model’s behavior and interpretation, and any conclusions about patterns revealed in the data. Use only attributes that will be available to the model and in the right state at the time of deployment. Use lift and gains tables to show a model’s predictive ability. Try several models to get the right fit To improve model performance, try adding or removing fields or experimenting with different options. Remember the law of parsimony (Occam’s Razor): simple models may be better. For example, if you want to create a model that Balance the strength or power of a model with predicts the risk of losing customers within the its complexity: simple models often are easier to next three months, build the model using data explain, easier to maintain, and may be less about customers who defected during the prone to degradation over time. Also, since previous three months. Applied to current data, each technique works slightly differently, try a the model will then predict which customers variety of approaches (such as clustering and may leave in the near future, allowing you time association) to find all of the relevant patterns. to take action to prevent them doing so. Statistical models are good for: Using induction to produce a rule Initial analysis – statistical analysis is useful in Rules are essentially parameters within which the early stages of a data mining project to the data must fall in order to be considered. gain an overview of the structure of the data. They are usually in an “if/then” format. Induction Developing a concise description of the enables you to automatically choose which rules characteristics of the data can help the group’s are most effective for obtaining specific results. members to develop hypotheses and plan For example, this is how rule induction can be further analysis. used to create a set of rules for qualifying loan Propensity models are good for: prospects: Predicting customer behavior – discovering If employed for more than two years, then who is most likely to purchase, most likely to credit risk is good churn, most likely to default on loans, and • If older than 30, then credit risk is good much more. Use this information to determine • If declared bankrupt at any time, then credit which customers and prospects offer the best risk is bad long-term profitability. • Test after you build Clustering is good for: Make sure your model delivers results that will Finding natural groupings of cases that have the help you achieve your data mining goal. same characteristics – e.g., detecting fraud by using clustering to group similar cases of unusual credit card transactions. 22 23 Association rules are good for: Determine next steps Basket analysis – discovering which items are Now is the time to determine whether the most likely to be purchased together. Use this project is successful enough to move ahead to information to improve cross-selling through deployment. If not, take further steps to achieve catalog and store layout, recommendation satisfactory results. Keep in mind: engines, phone and direct mail offers, and more. • The deployment potential of each result Evaluation • How the process could be improved Evaluate your data mining results • Whether the resources exist for additional Determine whether and how well the results steps or repetitions of previous steps delivered by a given model will help you achieve your organization’s goals. Is there any Deployment systematic reason why the model is deficient? Create a deployment plan Take the project results and synchronize them If time and resources are available, try testing the model or models in a limited real-world environment (e.g., at a single store or call center, or for a single product line) to see if it performs as expected. with your original goals and objectives to address your organizational issue most effectively: • Summarize deployable models or software results Review the data mining process for any missing steps or overlooked tasks • When you have confirmed the quality and • plans • • Identify possible problems and pitfalls during deployment Was each stage of the data mining process necessary in retrospect? • Reconfirm how you will monitor the use of the results and measure the benefits important steps or information. • Confirm how the results will be distributed to recipients effectiveness of your results, review your work to determine whether you have missed any Develop and evaluate alternative deployment Was each stage executed as well as Monitor and maintain your plan possible? Ensure the best use of your data mining results by creating a maintenance plan that addresses: • What could change in the future that would affect the use of the results • How to monitor accurate use of the results • When, if necessary, to discontinue deployment or use of the results • 24 Criteria for renewing and refreshing models 25 Create a final report Depending on your deployment plan, the report may be either a project summary or a final presentation of the data mining results. To create your final report: • Identify which reports are needed (slides, management summary, etc.) • Identify report recipients • Outline the structure and content of the report • Select which discoveries to include Execute your deployment plan Put your data mining results to optimal use by distributing them according to the deployment plan. Even the most brilliant discovery will not generate ROI if it isn’t used to improve your business. Shelf reports have very little current or future value. what didn’t, what the major accomplishments were and what improvements may be necessary.For a complete review, try the following: Interview all significant project members about their experiences Interview the end users of your data mining results about their experiences • CRISP-DM document, Performing a data mining tool evaluation. Look for tools with a proven record of solving the organizational problems that your project addresses Choose tools that have been shown to be useful for solving problems within your industry and that have a successful track record in the business areas that you need to address. Select tools that bridge business understanding and the technical aspects of data mining Make sure that the steps used by the tools Ask: Do the tools present data mining This is your opportunity to assess what went right, • The tips in this section are excerpted from the match the business needs of data mining. Review the project • Selecting a data mining tool Document and analyze the specific data concepts clearly? Make sure your tools work with your existing data sources and formats You will save time and money, and maximize your chances for reliable results, by choosing tools that can pull in and combine data from multiple sources and formats. This is particularly important if discoveries later in the data mining process lead you to add data from a new source. mining steps that you took • Analyze how well the data mining goals were met • Create recommendations for future projects 26 Data mining tools that enable you to combine behavioral and attitudinal data, in the form of both structured and unstructured data, will deliver more accurate results and provide greater flexibility in terms of the types of data mining projects you’re able to undertake. 27 Choose tools with efficient, comprehensible data preparation steps algorithms for visualization, classification, Save time and resources by choosing data example, you might discover that one mining tools that prepare data efficiently (from technique works better than another for initial stages through to model building) and that specific types of data. Flexibility will enable you presentdata preparation steps in an easy-to- to try a number of techniques to get accurate, understand way. This enables project members effective results. The tools should also be able with varying levels of expertise to obtain to combine techniques in situations where that effective results. approach would produce the best results. Make sure that your tools can automatically extract data Choose tools that deliver consistent, high-quality results Avoid writing time-consuming manual queries by Get accurate results from your data with choosing tools that can extract data automatically adaptable tools that perform well in a variety for the various data preparation steps. of situations, rather than one designed for a clustering, association, and regression. For specific type of data or situation. Your tools Can the tools use the data and equipment you already have? should be able to manage any data that you may need to address your problem effectively. Choose data mining tools that can use your data databases or files, and that are compatible with Look for interactive exploration and visualization capabilities your existing analysis and visualization tools. You Make it easy to explore and understand the don’t want to waste time and resources building data by choosing tools that provide interactive another database because you are unable to visualization techniques. These allow you to analyze the data you already have. gain insights quickly by making changes within where it exists today, regardless of whether it is in graphs and creating new graphs based on Can the tools build effective models in a reasonable time? different dimensions of the data. Look for tools that enable analysts to find the most effective models quickly. The tool should What are the tools’ deployment capabilities? support efficient building and testing of multiple It is critical to choose tools capable of integrating models and, ideally, also support automation to your results into operational applications now reduce the time needed to carry out some of the and in the future. Also consider: more mundane aspects of data mining. • Choose tools with a wide range of techniques Whether integration will be cost effective or whether it will require additional time and money To ensure the best results, make sure your tools offer a wide range of techniques or 28 29 • How easily the tools can update data mining As part of this portfolio, IBM SPSS Predictive results and what additional investments, if Analytics software helps organizations predict any, are required future events and proactively act upon that insight to drive better business outcomes. Assess the potential costs of ownership associated with the tools Commercial, government and academic Analyze the potential ROI for each tool: technology as a competitive advantage in customers worldwide rely on IBM SPSS attracting, retaining and growing customers, What will be the cost of ownership over the while reducing fraud and mitigating risk. product’s lifetime, including any additional By incorporating IBM SPSS software into their software or services required by the tool? daily operations, organizations become • When can you expect a positive ROI? predictive enterprises – able to direct and • How long will it take to implement your data automate decisions to meet business goals and mining tool? Is it designed for technical achieve measurable competitive advantage. For experts or can it accommodate users of further information or to reach a representative varying expertise? What training costs are visit www.ibm.com/spss. • involved now and in the future? • Is the tool customizable for your particular users and business needs? Can you save common processes and automate tasks? About IBM Business Analytics IBM Business Analytics software delivers What makes us unique? For 40 years, we have been the clear leader in analytics technology. Here are some of the reasons that customers have selected IBM SPSS software to drive their decision making: • A complete, 360° view – Our software enables complete, consistent and accurate information you to develop in-depth understanding by that decision-makers trust to improve business using all of your information, both traditional performance. A comprehensive portfolio of structured data and unstructured data, for a business intelligence, predictive analytics, 360° view of your customers or constituents financial performance and strategy management, • Easy integration with operational systems – and analytic applications provides clear, IBM SPSS predictive analytics technologies immediate and actionable insights into current and products are designed to work well, both performance and the ability to predict future independently and with other technologies or outcomes. Combined with rich industry systems solutions, proven practices and professional services, organizations of every size can drive the highest productivity, confidently automate decisions and deliver better results. 30 31 • Open, standards-based architecture – IBM concepts sentiments, and relationships from such as OLE DB for data access, XMLA for unstructured data, and convert them to data/format sharing, PMML for predictive structured format for predictive modeling with model sharing, SSL for Internet security IBM SPSS Modeler • IBM® SPSS® Collaboration and Deployment Services for authentication and authorization, Services – Centralize and organize models to name a few and modeling processes, automate Faster return on your software investment – production and deployment of results according to a recent study by Nucleus • IBM® SPSS® Modeler Premium – Extract key SPSS software follows industry standards management, and LDAP/Active Directory • • Research, an independent analyst firm, 94 Statistical analysis percent of IBM SPSS customers achieve a IBM® SPSS® Statistics – IBM SPSS Statistics positive return on investment within an average is a tightly integrated, modular, full-featured payback period of just 10.7 months product line supporting the entire analytical A lower total cost of ownership – IBM SPSS process – from planning to data collection products are designed to work with your through data access and management, analysis existing technology infrastructure and staff and reporting to deployment – and a critical resources. We keep both your short- and complement to the data mining process. Add long-term costs of ownership low by providing the products below to increase your analysis open technology and flexible licensing options capabilities: IBM SPSS products • IBM® SPSS® Advanced Statistics – Improve the accuracy of your analyses and provide With IBM SPSS products, you can build a more dependable conclusions with flexible analytics system that enables you to procedures designed to fit the inherent both meet your needs today and achieve characteristics of your data tomorrow’s goals. • IBM® SPSS® Custom Tables – Summarize and communicate results in a presentation-ready Data mining tabular format, using a highly intuitive drag- IBM® SPSS® Modeler Professional – This and-drop interface product’s interactive data mining process incorporates your valuable expertise at every step to create powerful predictive models that address your specific organizational issues. 32 • IBM® SPSS® Regression – Apply more sophisticated models for greater accuracy in market research, medical research, financial risk assessment, and many other areas 33 • IBM® SPSS® Text Analytics for Surveys – Glossary Categorize text responses to open-ended Association: the process of discovering which survey questions so you can integrate them events occur together or are related. For with your quantitative survey data. IBM SPSS example, use association techniques to Text Analytics for Surveys extracts key determine which products are often purchased concepts from text for further analysis in together. Contrast with sequence detection, IBM SPSS Statistics or Microsoft Excel. which can be used to discover the order in which the products were purchased. The IBM SPSS Statistics family of products includes a full range of modules and stand-alone products. For a complete list, go to www.ibm.com/statistics Attitudinal data: data that relates to or is expressive of personal attitudes or opinions. Attitudinal data is often gathered through survey research such as responses to open-ended Survey and market research IBM® SPSS® Data Collection – conduct both large-scale, multi-mode research projects and smaller, one-of-a- kind surveys with this open, scalable and customizable survey research platform. It includes products for every step of the survey process, from creating survey scripts to collecting and analyzing data and reporting the results. survey questions, and analysis of textual communications such as customer emails. Attribute: a property or characteristic of an entity; also known as a variable or field. Balanced data: if you have two or more categories of data to analyze, each category should have an equal amount of data to simplify the modeling process. Training and Services IBM SPSS Training – We offer a full suite of data mining courses, as well as product-specific training. Most courses are available at an IBM SPSS facility or at your company site. IBM SPSS Worldwide Services – Let our experienced consultants help you determine which problems to address and how best to solve them. Behavioral data: data that relates to or reflects behavior or actions. Behavioral data, often in the form of purchasing or transactional data, is the type of data used most extensively in data mining. Churn: the process of customer attrition is a concern for many industries, particularly telecommunications and financial services. IBM SPSS products are available for Microsoft® Windows, Apple® Mac®, Linux®, UNIX and other platforms. 34 35 Classification: a process that identifies the Decision trees: graphical, tree-like displays group to which an object belongs by examining that clearly show segments, patterns, and characteristics of the object. In classification, hierarchies in data. the groups are defined by an external criterion (contrast with clustering). Commonly used Deployment: the distribution and use of results techniques include decision trees and neural obtained from data mining. Deployment ranges networks. from reports to the use of models in real-time environments such as call centers. Clustering: the process of grouping records based on similarity. For example, an insurance Derived attributes: new attributes that are company might use clustering to group constructed from one or more existing customers according to income, age, type of attributes in the same record. policy purchased or prior claims history. Clustering divides a dataset so that records with Dirty data: data that contains errors such as similar content are in the same group, and missing or incorrect values. Dirty data is also groups are as different as possible from each referred to as noisy. other (contrast with classification). Field: also known as a variable or attribute, Cross-Industry Standard Process for Data a field is a data space allocated to a particular Mining (CRISP-DM): CRISP-DM provides a class of data or information. For example, one structure for data mining projects, as well as data field may contain a customer’s first name; guidance on potential problems and their the next may contain the customer’s last name. solutions. It comprises six phases: business The columns in a spreadsheet are equivalent to understanding, data understanding, data fields, while rows are equivalent to records. preparation, modeling, evaluation and Gains tables: measures of the effectiveness of deployment. a model which shows the difference between Cross-selling: the practice of offering and results obtained by the model and results selling additional products or services to existing obtained without using the model under customers. random normal conditions. Data mining: the process of analyzing data to Lift charts: measures of model effectiveness discover hidden patterns and relationships that which shows the ratio between results can help you manage and improve your obtained using the model and results obtained business. without using the model. The farther the lift lines from the baseline, the more effective the Data warehouse: the database in which data model. is collected and stored for analysis. 36 37 Machine-learning techniques: a set of Predictive analytics: a combination of methods that enable a computer to learn a advanced analytic techniques and decision specific task such as decision making, optimization. It uses historical information to estimation, classification or prediction – without make predictions about future behavior and manual programming. then delivers recommended actions to the people and systems that can use them. Model: a set of representative rules, behaviors, or characteristics against which data are Predictive modeling: the process of creating analyzed to find similarities. Descriptive models models to predict future activity, behavior or are used to analyze past events. Predictive characteristics. For example, a predictive models are used to discover what will happen in model may show which customers are most the future. With predictive models, data miners likely to churn in the future, based on the can explore alternative scenarios to determine characteristics and actions of previous which actions will produce the desired outcome. churners. Neural network: a model for predicting or Query: a request sent to a database for classifying cases using a complex mathematical information based on specified characteristics scheme that simulates an abstract version of or properties. brain cells. A neural network is trained by presenting it with a large number of observed Record: a set of related data stored together. cases, one at a time, and allowing it to update Also known as a row (in spreadsheets) or a itself repeatedly until it learns the task. case (in statistics). Noise: data that contains errors such as Regression: the process of discovering and missing or incorrect values, or extraneous predicting relationships between two or more columns, is called noisy or dirty data. variables. Online analytical processing (OLAP): software Report: the results of data analysis, distributed that lets users analyze many layers of current in a format that is comprehensible to the and historical data. recipient. Pivot tables: interactive tables that enable Return on investment (ROI): the value that users to get different views of information by is returned or obtained from investments in easily repositioning rows, columns and layers technology, infrastructure, etc. of data. 38 39 Return on Marketing Investment (ROMI): Unstructured data: data in a text format or the value that is returned or obtained from other non numerical format. Combining investments marketing campaigns. unstructured and structured data in your data mining projects can help you produce more Rule induction: the process of automatically accurate, valuable results. deriving decision-making rules for predicting or classifying future cases from example cases. Up-selling: the practice of offering and selling to existing customers products or services Sequence detection: the process of which are more profitable than those they discovering the order of events in data. For currently own or use. example, use sequence detection to discover the order in which customers purchase certain Variable: any measured characteristic or products. Contrast with association, which attribute that differs for different subjects. reveals which products are purchased together. Web mining: the process of analyzing data Statistics: the mathematics of the collection, from online activities – including pay-per-click organization and interpretation of numerical advertising and other marketing campaigns – data. to discover relevant patterns and important behavioral insights. Structured data: data, for example transactional data, in traditional numerical formats. Structured data is often displayed in a tabular or spreadsheet-like view. IBM, has an enterprise network of distributors. To locate the office nearest you, go to www.ibm.com/planetwide Test set: a dataset independent of the training set, used to fine-tune the estimates of the model parameters. Text mining: the process of analyzing textual information – such as documents, emails and call center transcripts – to extract relevant concepts. Training set: a dataset used to estimate or train a model. 40 41 © Copyright IBM Corporation 2010 IBM Corporation Route 100 Somers, NY 10589 US Government Users Restricted Rights - Use, duplication of disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Produced in the United States of America May 2010 All Rights Reserved IBM, the IBM logo, ibm.com, WebSphere, InfoSphere and Cognos are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm. com/legal/copytrade.shtml. SPSS is a trademark of SPSS, Inc., an IBM Company, registered in many jurisdictions worldwide. Other company, product or service names may be trademarks or service marks of others. Please Recycle Business Analytics software YTM03001-USEN-00 42 Business Analytics software YTM03001-USEN-00

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download IBM SPSS Data Mining Tips