Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTER 12 DATA MINING KNOWING THE UNKNOWN TEST YOUR UNDERSTANDING 1. Why is DM a process and not an end in itself? Explain. Although DM can produce knowledge, and discover new patterns, it is incapable of extracting meaning. The human intervention is still needed. 2. Describe the differences and similarities among DM, machine learning, and business intelligence. How are they related? 3. Business intelligence (BI) is a global term for all processes, techniques, and tools that support business decision making based on information technology. The approaches can range from a simple spreadsheet to an advanced decision support system. Data mining is a component of BI. The objective of data mining is to optimize the use of available data and reduce the risk of making wrong decisions. Data mining is a business process concerned with finding understandable knowledge from very large real-world databases. Statistics and machine learning are considered to be the analytical foundations upon which DM was developed. Machine learning (ML) has focused on making computers learn things for themselves. Machine learning is the automation of the learning process that is a crucial function in any intelligent system. Its methodology includes learning from examples, reinforcement learning, and supervised or unsupervised learning. ML is a scientific discipline considered to be a sub-field of artificial intelligence. “DM can be thought of as a form of advanced statistical techniques.” Do you agree with this statement? Why or why not? DM is not a form of advanced statistical techniques, because though DM uses statistical techniques to discover hidden facts contained in databases, find patterns, and subtle relationships, its overall function is broader and more sophisticated since it has to infer rules that allow the prediction of future results. Hence, statistical techniques are one of many tools that DM uses in performing its tasks. 4. “DM is a tool to develop intelligent systems.” Define intelligence, explain how systems could have intelligent behavior, and discuss this statement. According to the Oxford dictionary, intelligence is the power to learn, understand, and know. This definition applies to humans. With the evolution of the processing power of computers, many scientists started to claim that computers could do anything human beings could do and sometimes better or faster. Turing defined intelligent behavior of a system as the ability of performing perfect imitation of humans. No machine is able to CHAPTER 12 DATA MINING KNOWING THE UNKNOWN pass this test. However, machines can now perform some intelligent tasks that help humans to solve their problems. DM, for example, can extract hidden patterns from large sets of data. This task cannot be achieved by humans because of their poor computational efficiency. DM can capture or discover some knowledge that would remain useless without the direct intervention of humans to understand the meaning and take action. 5. Describe the differences between OLAP and DM. When would you use each tool? OLAP: Online analytical processing tools give the user the capability to perform multidimensional analysis of the data. This approach uses computing power and graphical interfaces to manipulate data easily and quickly at the convenience of the user. The focus is showing data along several dimensions. The manager should be able to drill down into the ultimate detail of a transaction and zoom up for a general view. Using a combination of machine learning, statistical analysis, modeling techniques, and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future results. Typical applications include market segmentation, customer profiling, fraud detection, evaluation of retail promotions, and credit risk analysis. 6. What are the limitations of OLAP? How is DM able to overcome them? OLAP has two limitations: It does not find patterns automatically. It does not have powerful analytical techniques. DM overcomes these limitations by using a combination of machine learning, statistical analysis, modeling techniques, and database technology. 7. What is the role of DM in e-business? DM applications for CRM are integrated with e-sales functions, in order to create the customer-centric firm. DM applications are the first line in understanding the customer and an integral key to segmenting the market. 8. Describe, with examples, when you would use predictive DM and when you would use descriptive DM. The goal of a DM descriptive task is to understand, explain, or discover relationships among data sets. It looks for similarity and dissimilarity in data. In contrast, a predictive task is concerned with future behavior. This task is time driven. Predicting company bankruptcy or customer response to marketing campaign are examples of predictive DM. 9. Explain how DM is used in the health sector and in the telecommunications industry. In the health-care business: Keeping pace with the rate of technological and medical advancement provides a significant challenge. Cost is a constant issue in this ever- 12-2 CHAPTER 12 DATA MINING KNOWING THE UNKNOWN changing market. Early DM activities have focused on financially oriented applications. Predictive models have been applied to predict length of stay, total charges, and even mortality. In the telecommunications industry: Keeping pace with the rate of technological change provides a significant challenge to businesses throughout the telecommunications industry. In addition to this, deregulation is changing the business landscape, resulting in competition from a wide range of service providers. Finding and retaining customers is important to telecommunications providers. In addition to customer profiling, subscription fraud and credit applications are utilized throughout the industry. Concerns about privacy and security are likely to result in DM applications targeted to these areas. 10. Explain how companies are using DM to understand their customers’ behavior and predict their intentions. Data mining—technologies and techniques for recognizing and tracking patterns within data—helps businesses sift through layers of seemingly unrelated data for meaningful relationships, where they can anticipate, rather than simply react to, customer needs. 11. Describe the major pitfalls faced by companies when implementing DM solutions. Data-mining project managers stumble across some problems such as: Insufficient understanding of business needs Careless handling of data. Data mishandling errors include the following: Over-quantifying data Miscoding data Analyzing without taking precautions against sampling errors Loss of precision due to improper rounding of data values Incorrectly handling missing values Invalidly validating the data-mining model KNOWLEDGE EXERCISES 1. Discuss what types of industries can best benefit from DM. Which ones cannot? Hint: Think of the ones having the most transactions and accessible data. The financial services, health-care, and telecommunication industries are among the industries that can benefit best from DM, because they have many complicated transactions, and access to data is guaranteed either through the Internet, data warehouses, or financial reports. One of the businesses that is in need of DM is agriculture, but due to the lack of information and fluctuating data it is not benefiting from the applications. Also, industries that include similar products (e.g., ice cream, beverage) don’t require DM because their transactions are limited and simple. 12-3 CHAPTER 12 DATA MINING KNOWING THE UNKNOWN 2. Statistical and DM applications both produce different results for management, even though they might use the same historical data. Discuss the similarities and differences in reporting capabilities. The similarities between DM and statistical applications are: They both depend on formulating hypotheses and testing them, they discover hidden associations, and they can find unexpected patterns. The differences are: in statistical applications the hypotheses are formulated manually, while using the DM applications; the hypotheses are automatically generated, in addition to other capabilities that the statistical applications can’t provide like response to extracted patterns, selection of the right actions, learning from past actions, and turning action into business value. 3. A large online bank needs to mine data coming from many sources, including marketing, accounting, and customer databases. Discuss the best way to collect and prepare multi-source data. The best ways to collect data for the bank is from a geographical database that includes a relational database for all the bank transactions (internal: purchasing, or external: relationships with clients) from different territories and geographical areas. Also, data warehouses are suitable places for a large amount of data from various sources. The data preparation stage includes the following tasks: evaluating data quality, handling missing data, processing outliers, normalizing data, and quantifying data. This will help in understanding the importance of some variables and the irrelevance of others, which helps narrowing down the focus of the application. 4. Minetise.com is an Internet company specializing in online banner ads. The company is developing an application that customizes a banner according to a customer’s historic profile. Discuss how DM can be used to develop such an application. To develop such an application, the company must go through the virtuous DM cycle, starting by business understanding: the company must identify its purpose for using the application, they must realize the real benefit from such banners and know what problems they are most likely to encounter. The application should define the profile of the customer. According to the profile, and based on historical data, a matching banner is identified. 5. Your manager is extremely worried about integration problems that might arise from implementing a DM application on your company’s SQL database. Some of the questions bothering him include the following: How will it integrate into the current computing environment? Will it work on our existing SQL database, or do we need anything else? How easily will the system work on our intranet? Discuss the problems and possible solutions to these questions. What other problems might your company face? 12-4 CHAPTER 12 DATA MINING KNOWING THE UNKNOWN Analytical methods include querying and reporting data, data visualization, and data analysis. However, statistics and machine learning that depend primarily on SQL and other database applications are considered to be analytical foundations upon which DM was developed. DM applications provide a global approach that integrates the conventional tools in a whole process that leads to actionable knowledge. It works directly on the SQL server and allows users to access information from different sources through client/server (intranet) or Web-based query systems. Some of the questions that needs to be addressed are: Will any SQL server work? Most of the new DM applications require the latest SQL server, and it can be installed easily. Do we need a special type of knowledge workers and users? DM can provide the right environment to satisfy the requirements of all types of knowledge workers. 6. Finance Trance is a stock brokerage firm. They are thinking of using DM in their customer services department. Suggest some uses and services they can offer. Also, discuss the DM tasks that are to be used. Some of the services that can be offered are: Portfolio screening: using DM applications, Finance Trance can offer their clients a high standards portfolio through scanning different companies’ stock prices, dividends, historical earnings, etc., and building a portfolio from the best options. Neural Networks are the proper DM task to be used for this service. Currency Exchange Market fluctuations: where it can provide clients with a forecast of the currency exchange prices in the future which will ensure an attractive return on investment. Neural Networks are to be used for this service. Loan applications processing: using DM applications, applicants will learn of their status in a short time. Classification tree is the task to be used here. 7. An online bookstore has asked your company to develop a DM application to recommend books to customers. Your manager wants you to analyze how the company works and see what data you can pull from their data warehouses. How would you go about understanding the business and data available before starting the project? What part does this fulfill in the overall project? This is the first stage of the Virtue DM cycle and it is called “Business Understanding” and data preparation. First of all we must determine the problems faced by the firm, this involves analyzing the company’s customer-base, market share, historical data about sales and revenues, payment methods, and other factors. The data can be retrieved from their own database through business transactions (money transfer, shipping, Web site registration, etc.). By achieving this stage, we would have a clear idea about the important issues that the DM application must address. 8. How could a mobile phone company use DM to lower customer churn? Can it use DM to increase variables such as product development speed, marketing effort, or even customer retention? 12-5 CHAPTER 12 DATA MINING KNOWING THE UNKNOWN DM can help a mobile phone company to lower customer churn in various ways. One can develop a DM model predicting which customers are more likely to renew their services and which are more likely to churn. It holds usage patterns and other important customer characteristics that can be used to identify satisfied and dissatisfied customers. It can identify to which incentives the customers respond best (more product features, extended guarantee period, etc.). Additionally, the model can determine other problems affecting the customers’ loyalty, and gives recommendations on how to solve or avoid them. 9. During the data preparation stage, a supermarket omitted certain data fields that were later shown to have significant adverse effects on the overall DM application. Which stages of the DM process will be affected? At which stage could this problem have been detected? How do you think the problem was detected? Omitting significant data will affect all the following stages: model building, action and decision, and evaluation. It will be detected at the model testing stage. At this stage, the model is put to test using test criteria, and if it fails the test it is either rejected or the parameters are adjusted for further testing. The proper way to detect such problems is to go through individual records before mining the data to get a feel for information, and see if at least what we know is still existent. 10. Design a survey to glean trends from several companies that are planning to develop DM applications. This survey should help clarify the role of executive managers, the characteristics of the planned project, and the return expected from it. This mini-project should help students understand how companies are planning for DM application, who is making the decision, and why. 11. Conduct an in-depth case study with a company that has implemented a DM solution to identify the best practices and common pitfalls. The assessment of a DM application should follow step by step the DM developing process. One of the most important obstacles is the collection and validation of data. At each step students should identify the difficulties and understand how they were solved. 12. Carrier Corp. is using data mining to profile online customers and offer them cool deals on air conditioners and related products. By using services from WebMiner, Inc., the air-conditioning, heating, and refrigeration equipment maker has turned more Web visitors into buyers, increasing per-visitor revenue from $1.47 to $37.42. Carrier, part of $26 billion United Technologies Corp., began selling air conditioners, air purifiers, and other products to consumers via the Web in 1999. However, it sold only about 3,500 units that year, says Paul Berman, global e-business manager at the Farmington, Connecticut, company. Not knowing just who its customers were and what they wanted was a big part of the problem. “We were looking for ways to raise awareness [of Carrier’s Web store] and convert Internet traffic to sales,” Berman says. 12-6 CHAPTER 12 DATA MINING KNOWING THE UNKNOWN Last year, Carrier gave WebMiner a year’s worth of online sales data, plus a database of Web surfers who had signed up for an online sweepstakes the company ran in 1999. WebMiner combined that with third-party demo-graphic data to develop profiles of Carrier’s online customers. The typical customer is young (30 to 37), Hispanic, and lives in an apartment in an East Coast urban area. WebMiner matched the profiles to ZIP codes and developed predictive models. Since May, Carrier has enticed visitors to its Web site (www.buy.carrier.com) with discounts. When they type in their ZIP codes, WebMiner establishes a customer profile and pops up a window that offers appropriate products, such as multi-room air conditioners for suburbanites or compact models for apartment dwellers. “It’s the first time we’ve intelligently delivered data-driven promotions,” Berman says. Online sales have exceeded 7,000 units this year, Berman says, compared with 10,000 units for all of last year. Carrier chose the WebMiner service because it was quick to implement and is relatively inexpensive—$10,000 for installation and a $5 fee to WebMiner for each unit sold, compared with 6-figure alternatives. a. The DM application used by Carrier was one that was predictive in nature. Could a descriptive model also be used? How would you use it, and what outputs would you expect? Would they be of any use to Carrier? b. What other data-driven promotions could Carrier come up with using other data mining techniques? c. What manufacturing-driven applications can Carrier implement using data mining? Hint: How can it be used to forecast manufacturing defects? d. What finance-driven applications can Carrier implement using data mining? Hint: How can Carrier use DM to distinguish on-time paying customers from doubtful ones? SOURCE: Whiting 2001. a. The only descriptive model that can be used is the multiple regression, where we can develop a formula to determine the relationship between the online sales on one hand and various variables on the other. "Y = a + bX1 + cX2 +…" This model can predict the dependent variable (sales volume) using the independent variables. The limitation of this model is that all independent variables must be quantified (Average income, family members, etc.). So it will not be helpful for Carrier, as some important attributes cannot be quantified (place of living, nationality, etc.). b. By realizing from a DM clustering model that their customer-base is located in the east coast, they can install manufacturing facilities in the proper facilities that can cover the largest possible area, and reduce shipping costs). c. One of the many applications for DM is quality inspection. Certain quality parameters can be entered in the application, and whenever the pattern changes, the defects can be identified immediately. 12-7 CHAPTER 12 DATA MINING KNOWING THE UNKNOWN d. By entering historical data that includes customers who paid on time, and others who defaulted, a DM model can be developed to assign the attributes of each type and to make predictions about the payment habits of new customers. 13. IAURIF, a French regional studies organization, needed to predict what mode of transportation Parisians would use—and why they would use it—from a large data set not originally collected for data mining. With Clementine’s rule-induction algorithms, IAURIF uncovered unexpected insights and proved the group’s first assumption, which was based solely on experience, to be untrue. Instead, Clementine’s rapid modeling environment revealed the most important travel factors and derived accurate results based on fact. Results were as follows: • Accomplished more accurate traffic forecasting • Improved transportation planning Analyzing and predicting traffic flows and growth is a complex process. For IAURIF, this process started with an existing database of 400,000 records. These previously collected data, from a detailed Parisian transport survey, were not originally intended for data mining. That meant a more complex task right from the start, because IAURIF had to complete extensive preprocessing before it could begin data mining. Armed with Clementine’s data manipulation capabilities, IAURIF began by grouping the 200 original fields under general headings, such as place of residence and socioeconomic class. Then, analysts selected a representative variable for each group of fields and ensured the groups were independent of their effect on transport mode. This important preprocessing, enabled IAURIF to pinpoint 26 fields, a core set of relevant variables that would simplify and significantly help the group’s data-mining efforts. IAURIF analysts then used Clementine’s rule-induction algorithms, which predicted a three-way variable—whether someone would walk, drive, or take public transportation for a specific journey. With Clementine’s powerful modeling techniques, analysts identified the factors behind each choice. Based on experience, IAURIF had first thought sociological factors, such as income and class, would combine with the journey’s purpose to be the most important causal factors. However, Clementine uncovered a very significant, and surprising, finding. The most important factors proved to be journey distance and trip time—not factors the group had predicted on experience alone. To be sure, IAURIF proved Clementine’s high accuracy by testing results with a validation data set. In the end, Clementine and this new modeling process increased IAURIF’s ability to plan future transport. a. Describe the case from the perspective of the 7-step DM process. Which parts were not covered? What recommendations do you have? 12-8 CHAPTER 12 DATA MINING KNOWING THE UNKNOWN b. What difference would using online transaction data, instead of survey data, have on the overall IAURIF project? What suggestions would you give? c. What other DM projects could you implement for IAURIF? Which DM techniques would you use? Why? a. The DM seven-step methodology: 1. Business understanding Needed to predict what mode of transportation Parisians would use—and why they would use it 2. Data understanding The process started with an existing database. Extensive preprocessing was completed before beginning data mining 3. Data preparation IAURIF began by grouping the 200 original fields under general headings Selected a representative variable for each group of fields and ensured the groups were independent of their effect on transport mode This preprocessing enabled IAURIF to pinpoint 26 fields, a core set of relevant variables that would simplify and significantly help the group’s data-mining efforts Used Clementine’s rule-induction algorithms, which predicted a three-way variable and then identified the factors behind each choice 4. Data modeling 5. Analysis of the results Clementine uncovered a very significant, and surprising, finding. The most important factors proved to be journey distance and trip time—not factors the group had predicted on experience alone. 6. Knowledge assimilation To be sure, IAURIF proved Clementine’s high accuracy by 12-9 CHAPTER 12 DATA MINING KNOWING THE UNKNOWN testing results with a validation data set 7. Deployment evaluation Other DM applications that can be carried out by IAURIF is the educational level of residents of cities and rural areas, and determine the attributes affecting the educational level. b. Using online transaction data would have saved time, provided a wider database and easier data processing. The use of surveys served effectively in covering all required data. In addition, surveys tolerate less abuse by the targeted population than the online transaction data. Therefore, it might be more effective if online surveys that are prepared especially for DM purposes were posted online with a feasible attractive reward. This way the benefit of the two methods are combined in one provided that it would not contradict with the time and cost constraints. c. IAURIF could investigate in the telecommunication industry or in the health care sector where it can do descriptive analysis. Other DM applications that can be carried out by IAURIF is the educational level of residents of cities and rural areas, and determine the attributes affecting the educational level. 12-10