Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Digging for Gold: Business Usage for Data Mining Kim Foster, CoreTech Consulting Group, Inc., King of Prussia, PA ABSTRACT Current trends in data mining allow the business community to take advantage of sophisticated analytical techniques to assess future directions and manage strategic planning. Yet, these tools and techniques are not being used to their full capacity by business managers to solve every day business problems. Why? Perhaps, this can be attributed to an underlying fear of complex mathematical and statistical methods found within data mining and neural network models. The purpose of this paper is to demystify the art of data mining by outlining practical examples of usage applicable to business managers and professionals. By focusing on the competitive advantage that can be obtained with data mining, the author hopes to provide a better understanding of the practical application of this type of data analysis. INTRODUCTION relationships now exist between thousands of different data elements. Hence the conceptualization of the data warehouse. Data warehousing, in turn, opens new possibilities in terms of business intelligence and decision support solutions. One such solution is referred to as data mining. Preparation Discovery Analysis DATA MINING Data mining is simply the discovery of valuable, new information from a large collection of data. Or, as defined by The Garner Group: Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques. Storage and processing capabilities of technology have increased at a tremendous rate over the course of the last twenty years. The business community has found new ways to utilize this additional computing power to improve their competitive advantage in the marketplace. What it is not: One problem with this situation is that large numbers of databases are now distributed across systems within any given organization. Over time, information about customers, suppliers, and operations has become stored in many databases within silos of the organization. Information usage has become so specialized, that latent 3. Not statistical tests using standard techniques 1. Not complex queries where suspicions about relationships within the data already exist 2. Not validation of hypotheses What it is: Automated discovery of new facts and relationships within data Think of it in terms of excavation. The business data represents the rocks and the data mining technique becomes an excavation tool, sifting through the vast quantities of raw data looking for valuable nuggets of gold - information critical in making a business successful. Reapplication The major steps in this process include: The results are evaluated to determine whether or not additional knowledge was discovered and the relative importance of the information is assessed. This is where decisions are made using information found in the mining process and where the most business benefit can be seen. The redeployment of techniques is applied to multiple data populations for validation and classification of results. Discovery Analysis Data Preparation Data is selected, acquired, cleansed, and preprocessed under the guidance of a knowledgeable, domain expert. Who is a domain expert? Not your typical programmer or system analyst, but someone who knows the business well enough to determine the critical 20% of information where 80% of business decisions are based. The model below is a simplistic representation of a standard data mining technique called a decision tree. The decision tree shows that there are multiple decisions that can be made based on different relationships between variables based on the outcome of information from the models. Technology Review and Selection Identification of the best techniques and tools to utilize needs to be made based on: • • • • • • • • Decision 1 Model 1 Business requirements Infrastructure constraints Size and location of data stores Data preparedness Availability of statistical/analytical expertise Average accuracy of overall results (tools) Training requirements Cost Decision 2 Source D ata Decision 3 M odel 2 Decision 4 Information Discovery SAS uses an effective method for data mining called SEMMA. SEMMA stands for: Sample, Explore, Modify, Model, and Assess. This process applies statistical techniques to go through selection and transformation of data that is considered predictive. It then builds models based on the results of the analysis and checks the models for accuracy. This is a proven Automated models and techniques are applied to prepared data, compressing and transforming it to make it easy to identify any valuable, hidden information. 2 method, effective in the application of successful mining techniques. providers. As information about customers are combined with information about products, there are significant opportunities that can be achieved. A company that is able to identify customer buying decisions over time will be able to use the best approach for obtaining consumer buy in for the products and services they offer. They can also develop targeted marketing campaigns as well as identify profitable consumer markets. There are different functions, increasing in complexity, where mining techniques are used to find latent information within variables that exist in common, very large data stores. A few functional requirements of mining techniques include: ⇒ Associations, Classifications, and Clustering Used for risk assessment, market segmentation and targeting sales, as well as product reuse Risk Analysis Customers can be managed differently based on perceived risk. This is true for lending, insurance, health care, and even utilities. Modeling techniques can be used to classify the amount of risk associated with a customer or customer segmentation. This risk can also be tracked and adjusted over time. This information is valuable providing guidelines for credit scoring stability, portfolio and product management, lending practices, and fraud assessment and detection. ⇒ Regression and Forecasting Used for sales predictions, customer ranking, price and inventory models, product assessments These functional requirements are based on the business need at hand. For example, if a retail store wanted to know what products should be marketed and advertised on sale at the same time, statistical models are used to meet the need for association analysis. Product Management Using techniques for matching product and part requirements to is critical for product design reusability. Data gathered through sales and part maintenance records can be combined to identify where the need to increase product longevity exists. Now, let’s look at some practical business applications for such data mining techniques. Now, let us review a couple of case studies where mining techniques were used to meet business needs. BUSINESS APPROACHES There are several areas within different industries where mining can be applied. Identified below are three basic business analysis needs that most organizations have: Note that the results of these techniques will be covered in detail during the presentation of this paper. Marketing Mining can be used to improve customer retention rates by identifying customers ready to switch to other service or product 3 Always identify connectivity and platform issues up front, especially when data stores are at different locations (globally). CASE STUDY: TELECOMMUNICATIONS ♦ Need for churn forecasting 6 months - 2 years prior to potential loss of customers ♦ Operational data stores house several hundred thousand records on subscribers collected and distributed on a real-time (one hour delay) basis ♦ Multiple data marts (Oracle, Access, Sybase, etc.) where historical data stored ♦ Data partially preprocessed (calculations identified however inconsistently used within different lines of business) ♦ SAS used for statistical analysis, Visual Basic for GUI reporting, no mining tools used to date STRIKING GOLD CASE STUDY: Human Resource and Benefit Data Management ♦ Need for locating relationships within participant data - specifically trends to identify the need for reuse of benefit packages ♦ 200-300 clients with anywhere from 100 to 100,000 participants ♦ Multiple systems including human resource, benefits, pension, and health care servicing ♦ Multiple plans for each client ♦ Data preprocessed (calculations and deduplication process completed) ♦ No tools available ♦ Global data storage environment (over 1,000 tables housed world-wide) ♦ Clients consist of companies within different industries Solution: Functional need determined as time-series forecasting, weights applied to variables and prediction accuracy determined (5-10%), data run through models to determine historical trends (hourly, daily, weekly, monthly, etc.) then rerun to identify potential future trends, this application is then available for reuse on an ongoing basis. Solution: Select and track ‘cradle to grave’ attributes, measurements obtained for identifying data size and location as well as system infrastructure issues, tools and techniques (primarily neural network models housed on NT), apply models to data based on functional need (clustering), reapplication of model to revisit data cleansing issues, final review and analysis. Critical Success Factors: It can be difficult to determine how much historical information is required to apply to the models to identify the most accurate trend information. Several reapplications may be required for analysis purposes. It is extremely important to put as much applicable attribute data as possible in the model to ensure that predictions are accurate. Critical Success Factors: Due to several revisits to preprocessed data, no assumptions can be made regarding data cleansing. The need for re-cleansing will always arise as discoveries are located through statistical and analytical processing. 4 CONCLUSION DATA PRESENTATION Now that technology based storage capacities are at an all time high, organizations have more information available to them than ever before. In fact, quantities of information available exceeds any given organization’s ability to manage that information by an exponential amount. Traditional query and reporting tools are no longer sufficiently meet the sophisticated analysis needs of today’s businesses. The more data we have, the less we know about the relationships between different variables within this data. Once mining techniques have been applied, the results can be made available to different levels within the user community by applying additional business intelligence or decision support system solutions. Results over time, geographical region, and by specific demographics can be turned into visual information for increased benefit to the business community. Drill down, query and reporting, and multidimensional capabilities can be applied to the discovery results to allow management to make effective decisions based on the results of data mining techniques. Therefore, we must look beyond these tools and techniques to processes that allow us to address the increasing amount of information available. Data mining offers a solution to this problem. Below is a graphical representation of the different decision support techniques that require increased levels of analysis functionality. Note that at the top of the pyramid is query and reporting, which can be applied by users without much effort or domain knowledge. Data mining is at the bottom. As it is the most intense process from an analytical perspective, it requires a significant amount of domain knowledge, input data, and high-level statistical modeling to do the job. The more effective the analysis, the greater the potential for locating valuable information. With an emphasis on the discovery of valuable information from large databases, data mining provides added value to the investment in the corporate data warehouse and provides business lines with valuable nuggets of information to help make them more competitive. Most organizations do not realize that quite often they are wasting millions of dollars obtaining external information about their consumers and competition to gain market advantage, when the information is just sitting in their files waiting to be discovered. Q&R EIS REFERENCES Raphaelian, G. and Strange, K. (1997), “Data Warehousing and Data Mining: Separating The Two,” Gartner Group, Inc., 1. OLAP GIS Data Mining 5 “Data Mining reveals the Diamonds in your database,” (1996), SAS Communications, 2Q96, 18. USEFUL RESOURCES Adriaans, Pieter and Zantinge, Dolf (1996), Data Mining, New York: Addison-Wesley Publishing Company, Inc. Biggus, Joseph (1996), Data Mining with Neural Networks: Solving Business Problems from Application Development to Decision Support, New York: McGraw-Hill. AUTHOR CONTACT Kimberly A. Foster Manager & Practice Leader Enterprise Data Management CoreTech Consulting Group, Inc. 1040 First Avenue, Suite 400 King of Prussia, PA 19406-1336 Phone: 800-877-9612 ext. 3542 Fax: 610-337-2333 email: [email protected] 6