Download With Clementine`s rule-induction algorithms, IAURIF uncovered

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CHAPTER 12
DATA MINING KNOWING THE UNKNOWN
TEST YOUR UNDERSTANDING
1.
Why is DM a process and not an end in itself? Explain.
Although DM can produce knowledge, and discover new patterns, it is incapable of
extracting meaning. The human intervention is still needed.
2.
Describe the differences and similarities among DM, machine learning, and business
intelligence. How are they related?



3.
Business intelligence (BI) is a global term for all processes, techniques, and tools
that support business decision making based on information technology. The
approaches can range from a simple spreadsheet to an advanced decision support
system. Data mining is a component of BI.
The objective of data mining is to optimize the use of available data and reduce the
risk of making wrong decisions. Data mining is a business process concerned with
finding understandable knowledge from very large real-world databases. Statistics
and machine learning are considered to be the analytical foundations upon which DM
was developed.
Machine learning (ML) has focused on making computers learn things for
themselves. Machine learning is the automation of the learning process that is a
crucial function in any intelligent system. Its methodology includes learning from
examples, reinforcement learning, and supervised or unsupervised learning. ML is a
scientific discipline considered to be a sub-field of artificial intelligence.
“DM can be thought of as a form of advanced statistical techniques.” Do you agree
with this statement? Why or why not?
DM is not a form of advanced statistical techniques, because though DM uses statistical
techniques to discover hidden facts contained in databases, find patterns, and subtle
relationships, its overall function is broader and more sophisticated since it has to infer
rules that allow the prediction of future results. Hence, statistical techniques are one of
many tools that DM uses in performing its tasks.
4.
“DM is a tool to develop intelligent systems.” Define intelligence, explain how systems
could have intelligent behavior, and discuss this statement.
According to the Oxford dictionary, intelligence is the power to learn, understand, and
know. This definition applies to humans. With the evolution of the processing power of
computers, many scientists started to claim that computers could do anything human
beings could do and sometimes better or faster. Turing defined intelligent behavior of a
system as the ability of performing perfect imitation of humans. No machine is able to
CHAPTER 12
DATA MINING KNOWING THE UNKNOWN
pass this test. However, machines can now perform some intelligent tasks that help
humans to solve their problems. DM, for example, can extract hidden patterns from large
sets of data. This task cannot be achieved by humans because of their poor computational
efficiency. DM can capture or discover some knowledge that would remain useless
without the direct intervention of humans to understand the meaning and take action.
5.
Describe the differences between OLAP and DM. When would you use each tool?
OLAP: Online analytical processing tools give the user the capability to perform
multidimensional analysis of the data. This approach uses computing power and
graphical interfaces to manipulate data easily and quickly at the convenience of the user.
The focus is showing data along several dimensions. The manager should be able to drill
down into the ultimate detail of a transaction and zoom up for a general view.
Using a combination of machine learning, statistical analysis, modeling techniques, and
database technology, data mining finds patterns and subtle relationships in data and infers
rules that allow the prediction of future results. Typical applications include market
segmentation, customer profiling, fraud detection, evaluation of retail promotions, and
credit risk analysis.
6.
What are the limitations of OLAP? How is DM able to overcome them?
OLAP has two limitations:
 It does not find patterns automatically.
 It does not have powerful analytical techniques.
DM overcomes these limitations by using a combination of machine learning, statistical
analysis, modeling techniques, and database technology.
7.
What is the role of DM in e-business?
DM applications for CRM are integrated with e-sales functions, in order to create the
customer-centric firm. DM applications are the first line in understanding the customer
and an integral key to segmenting the market.
8.
Describe, with examples, when you would use predictive DM and when you would use
descriptive DM.
The goal of a DM descriptive task is to understand, explain, or discover relationships
among data sets. It looks for similarity and dissimilarity in data. In contrast, a predictive
task is concerned with future behavior. This task is time driven. Predicting company
bankruptcy or customer response to marketing campaign are examples of predictive DM.
9.
Explain how DM is used in the health sector and in the telecommunications industry.
In the health-care business: Keeping pace with the rate of technological and medical
advancement provides a significant challenge. Cost is a constant issue in this ever-
12-2
CHAPTER 12
DATA MINING KNOWING THE UNKNOWN
changing market. Early DM activities have focused on financially oriented applications.
Predictive models have been applied to predict length of stay, total charges, and even
mortality.
In the telecommunications industry: Keeping pace with the rate of technological
change provides a significant challenge to businesses throughout the telecommunications
industry. In addition to this, deregulation is changing the business landscape, resulting in
competition from a wide range of service providers. Finding and retaining customers is
important to telecommunications providers. In addition to customer profiling,
subscription fraud and credit applications are utilized throughout the industry. Concerns
about privacy and security are likely to result in DM applications targeted to these areas.
10.
Explain how companies are using DM to understand their customers’ behavior and
predict their intentions.
Data mining—technologies and techniques for recognizing and tracking patterns within
data—helps businesses sift through layers of seemingly unrelated data for meaningful
relationships, where they can anticipate, rather than simply react to, customer needs.
11.
Describe the major pitfalls faced by companies when implementing DM solutions.
Data-mining project managers stumble across some problems such as:

Insufficient understanding of business needs

Careless handling of data. Data mishandling errors include the following:

Over-quantifying data

Miscoding data

Analyzing without taking precautions against sampling errors

Loss of precision due to improper rounding of data values

Incorrectly handling missing values

Invalidly validating the data-mining model
KNOWLEDGE EXERCISES
1.
Discuss what types of industries can best benefit from DM. Which ones cannot?
Hint: Think of the ones having the most transactions and accessible data.
The financial services, health-care, and telecommunication industries are among the
industries that can benefit best from DM, because they have many complicated
transactions, and access to data is guaranteed either through the Internet, data
warehouses, or financial reports. One of the businesses that is in need of DM is
agriculture, but due to the lack of information and fluctuating data it is not benefiting
from the applications. Also, industries that include similar products (e.g., ice cream,
beverage) don’t require DM because their transactions are limited and simple.
12-3
CHAPTER 12
DATA MINING KNOWING THE UNKNOWN
2.
Statistical and DM applications both produce different results for management, even
though they might use the same historical data. Discuss the similarities and differences
in reporting capabilities.
The similarities between DM and statistical applications are: They both depend on
formulating hypotheses and testing them, they discover hidden associations, and they can
find unexpected patterns.
The differences are: in statistical applications the hypotheses are formulated manually,
while using the DM applications; the hypotheses are automatically generated, in addition
to other capabilities that the statistical applications can’t provide like response to
extracted patterns, selection of the right actions, learning from past actions, and turning
action into business value.
3.
A large online bank needs to mine data coming from many sources, including
marketing, accounting, and customer databases. Discuss the best way to collect and
prepare multi-source data.
The best ways to collect data for the bank is from a geographical database that includes a
relational database for all the bank transactions (internal: purchasing, or external:
relationships with clients) from different territories and geographical areas.
Also, data warehouses are suitable places for a large amount of data from various
sources. The data preparation stage includes the following tasks: evaluating data quality,
handling missing data, processing outliers, normalizing data, and quantifying data. This
will help in understanding the importance of some variables and the irrelevance of others,
which helps narrowing down the focus of the application.
4.
Minetise.com is an Internet company specializing in online banner ads. The company
is developing an application that customizes a banner according to a customer’s
historic profile. Discuss how DM can be used to develop such an application.
To develop such an application, the company must go through the virtuous DM cycle,
starting by business understanding: the company must identify its purpose for using the
application, they must realize the real benefit from such banners and know what problems
they are most likely to encounter. The application should define the profile of the
customer. According to the profile, and based on historical data, a matching banner is
identified.
5.
Your manager is extremely worried about integration problems that might arise from
implementing a DM application on your company’s SQL database. Some of the
questions bothering him include the following: How will it integrate into the current
computing environment? Will it work on our existing SQL database, or do we need
anything else? How easily will the system work on our intranet? Discuss the problems
and possible solutions to these questions. What other problems might your company
face?
12-4
CHAPTER 12
DATA MINING KNOWING THE UNKNOWN
Analytical methods include querying and reporting data, data visualization, and data
analysis. However, statistics and machine learning that depend primarily on SQL and
other database applications are considered to be analytical foundations upon which DM
was developed. DM applications provide a global approach that integrates the
conventional tools in a whole process that leads to actionable knowledge. It works
directly on the SQL server and allows users to access information from different sources
through client/server (intranet) or Web-based query systems. Some of the questions that
needs to be addressed are:
 Will any SQL server work? Most of the new DM applications require the latest
SQL server, and it can be installed easily.
 Do we need a special type of knowledge workers and users? DM can provide the
right environment to satisfy the requirements of all types of knowledge workers.
6.
Finance Trance is a stock brokerage firm. They are thinking of using DM in their
customer services department. Suggest some uses and services they can offer. Also,
discuss the DM tasks that are to be used.
Some of the services that can be offered are:
 Portfolio screening: using DM applications, Finance Trance can offer their clients a
high standards portfolio through scanning different companies’ stock prices,
dividends, historical earnings, etc., and building a portfolio from the best options.
Neural Networks are the proper DM task to be used for this service.
 Currency Exchange Market fluctuations: where it can provide clients with a
forecast of the currency exchange prices in the future which will ensure an attractive
return on investment. Neural Networks are to be used for this service.
 Loan applications processing: using DM applications, applicants will learn of their
status in a short time. Classification tree is the task to be used here.
7.
An online bookstore has asked your company to develop a DM application to
recommend books to customers. Your manager wants you to analyze how the company
works and see what data you can pull from their data warehouses. How would you go
about understanding the business and data available before starting the project? What
part does this fulfill in the overall project?
This is the first stage of the Virtue DM cycle and it is called “Business Understanding”
and data preparation. First of all we must determine the problems faced by the firm, this
involves analyzing the company’s customer-base, market share, historical data about
sales and revenues, payment methods, and other factors. The data can be retrieved from
their own database through business transactions (money transfer, shipping, Web site
registration, etc.). By achieving this stage, we would have a clear idea about the
important issues that the DM application must address.
8.
How could a mobile phone company use DM to lower customer churn? Can it use DM
to increase variables such as product development speed, marketing effort, or even
customer retention?
12-5
CHAPTER 12
DATA MINING KNOWING THE UNKNOWN
DM can help a mobile phone company to lower customer churn in various ways. One can
develop a DM model predicting which customers are more likely to renew their services
and which are more likely to churn. It holds usage patterns and other important customer
characteristics that can be used to identify satisfied and dissatisfied customers. It can
identify to which incentives the customers respond best (more product features, extended
guarantee period, etc.). Additionally, the model can determine other problems affecting
the customers’ loyalty, and gives recommendations on how to solve or avoid them.
9.
During the data preparation stage, a supermarket omitted certain data fields that were
later shown to have significant adverse effects on the overall DM application. Which
stages of the DM process will be affected? At which stage could this problem have been
detected? How do you think the problem was detected?
Omitting significant data will affect all the following stages: model building, action and
decision, and evaluation. It will be detected at the model testing stage. At this stage, the
model is put to test using test criteria, and if it fails the test it is either rejected or the
parameters are adjusted for further testing. The proper way to detect such problems is to
go through individual records before mining the data to get a feel for information, and see
if at least what we know is still existent.
10.
Design a survey to glean trends from several companies that are planning to develop
DM applications. This survey should help clarify the role of executive managers, the
characteristics of the planned project, and the return expected from it.
This mini-project should help students understand how companies are planning for DM
application, who is making the decision, and why.
11.
Conduct an in-depth case study with a company that has implemented a DM solution
to identify the best practices and common pitfalls.
The assessment of a DM application should follow step by step the DM developing
process. One of the most important obstacles is the collection and validation of data. At
each step students should identify the difficulties and understand how they were solved.
12.
Carrier Corp. is using data mining to profile online customers and offer them cool
deals on air conditioners and related products. By using services from WebMiner, Inc.,
the air-conditioning, heating, and refrigeration equipment maker has turned more
Web visitors into buyers, increasing per-visitor revenue from $1.47 to $37.42.
Carrier, part of $26 billion United Technologies Corp., began selling air conditioners,
air purifiers, and other products to consumers via the Web in 1999. However, it sold
only about 3,500 units that year, says Paul Berman, global e-business manager at the
Farmington, Connecticut, company. Not knowing just who its customers were and
what they wanted was a big part of the problem. “We were looking for ways to raise
awareness [of Carrier’s Web store] and convert Internet traffic to sales,” Berman says.
12-6
CHAPTER 12
DATA MINING KNOWING THE UNKNOWN
Last year, Carrier gave WebMiner a year’s worth of online sales data, plus a database
of Web surfers who had signed up for an online sweepstakes the company ran in 1999.
WebMiner combined that with third-party demo-graphic data to develop profiles of
Carrier’s online customers. The typical customer is young (30 to 37), Hispanic, and
lives in an apartment in an East Coast urban area.
WebMiner matched the profiles to ZIP codes and developed predictive models. Since
May, Carrier has enticed visitors to its Web site (www.buy.carrier.com) with discounts.
When they type in their ZIP codes, WebMiner establishes a customer profile and pops
up a window that offers appropriate products, such as multi-room air conditioners for
suburbanites or compact models for apartment dwellers. “It’s the first time we’ve
intelligently delivered data-driven promotions,” Berman says.
Online sales have exceeded 7,000 units this year, Berman says, compared with 10,000
units for all of last year. Carrier chose the WebMiner service because it was quick to
implement and is relatively inexpensive—$10,000 for installation and a $5 fee to
WebMiner for each unit sold, compared with 6-figure alternatives.
a. The DM application used by Carrier was one that was predictive in nature. Could a
descriptive model also be used? How would you use it, and what outputs would you
expect? Would they be of any use to Carrier?
b. What other data-driven promotions could Carrier come up with using other data
mining techniques?
c. What manufacturing-driven applications can Carrier implement using data mining?
Hint: How can it be used to forecast manufacturing defects?
d. What finance-driven applications can Carrier implement using data mining?
Hint: How can Carrier use DM to distinguish on-time paying customers from doubtful
ones?
SOURCE: Whiting 2001.
a. The only descriptive model that can be used is the multiple regression, where we can
develop a formula to determine the relationship between the online sales on one hand
and various variables on the other. "Y = a + bX1 + cX2 +…" This model can predict
the dependent variable (sales volume) using the independent variables. The limitation
of this model is that all independent variables must be quantified (Average income,
family members, etc.). So it will not be helpful for Carrier, as some important
attributes cannot be quantified (place of living, nationality, etc.).
b. By realizing from a DM clustering model that their customer-base is located in the
east coast, they can install manufacturing facilities in the proper facilities that can
cover the largest possible area, and reduce shipping costs).
c. One of the many applications for DM is quality inspection. Certain quality parameters
can be entered in the application, and whenever the pattern changes, the defects can
be identified immediately.
12-7
CHAPTER 12
DATA MINING KNOWING THE UNKNOWN
d. By entering historical data that includes customers who paid on time, and others who
defaulted, a DM model can be developed to assign the attributes of each type and to
make predictions about the payment habits of new customers.
13.
IAURIF, a French regional studies organization, needed to predict what mode of
transportation Parisians would use—and why they would use it—from a large data set
not originally collected for data mining.
With Clementine’s rule-induction algorithms, IAURIF uncovered unexpected insights
and proved the group’s first assumption, which was based solely on experience, to be
untrue. Instead, Clementine’s rapid modeling environment revealed the most important
travel factors and derived accurate results based on fact.
Results were as follows:
• Accomplished more accurate traffic forecasting
• Improved transportation planning
Analyzing and predicting traffic flows and growth is a complex process. For IAURIF,
this process started with an existing database of 400,000 records. These previously
collected data, from a detailed Parisian transport survey, were not originally intended
for data mining. That meant a more complex task right from the start, because
IAURIF had to complete extensive preprocessing before it could begin data mining.
Armed with Clementine’s data manipulation capabilities, IAURIF began by grouping
the 200 original fields under general headings, such as place of residence and
socioeconomic class. Then, analysts selected a representative variable for each group
of fields and ensured the groups were independent of their effect on transport mode.
This important preprocessing, enabled IAURIF to pinpoint 26 fields, a core set of
relevant variables that would simplify and significantly help the group’s data-mining
efforts.
IAURIF analysts then used Clementine’s rule-induction algorithms, which predicted a
three-way variable—whether someone would walk, drive, or take public transportation
for a specific journey. With Clementine’s powerful modeling techniques, analysts
identified the factors behind each choice. Based on experience, IAURIF had first
thought sociological factors, such as income and class, would combine with the
journey’s purpose to be the most important causal factors.
However, Clementine uncovered a very significant, and surprising, finding. The most
important factors proved to be journey distance and trip time—not factors the group
had predicted on experience alone. To be sure, IAURIF proved Clementine’s high
accuracy by testing results with a validation data set. In the end, Clementine and this
new modeling process increased IAURIF’s ability to plan future transport.
a. Describe the case from the perspective of the 7-step DM process. Which parts were
not covered? What recommendations do you have?
12-8
CHAPTER 12
DATA MINING KNOWING THE UNKNOWN
b. What difference would using online transaction data, instead of survey data, have on
the overall IAURIF project? What suggestions would you give?
c. What other DM projects could you implement for IAURIF? Which DM techniques
would you use? Why?
a. The DM seven-step methodology:
1. Business understanding

Needed to predict what mode of
transportation Parisians would
use—and why they would use it
2. Data understanding

The process started with an
existing database. Extensive
preprocessing was completed
before beginning data mining
3. Data preparation

IAURIF began by grouping the
200 original fields under general
headings
Selected a representative variable
for each group of fields and
ensured
the
groups
were
independent of their effect on
transport mode
This
preprocessing
enabled
IAURIF to pinpoint 26 fields, a
core set of relevant variables that
would simplify and significantly
help the group’s data-mining
efforts
Used Clementine’s rule-induction
algorithms, which predicted a
three-way variable and then
identified the factors behind each
choice


4. Data modeling

5. Analysis of the results

Clementine uncovered a very
significant,
and
surprising,
finding. The most important
factors proved to be journey
distance and trip time—not
factors the group had predicted
on experience alone.
6. Knowledge assimilation

To be sure, IAURIF proved
Clementine’s high accuracy by
12-9
CHAPTER 12
DATA MINING KNOWING THE UNKNOWN
testing results with a validation
data set
7. Deployment evaluation
Other DM applications that can be carried out by IAURIF is the educational level of
residents of cities and rural areas, and determine the attributes affecting the educational
level.
b. Using online transaction data would have saved time, provided a wider database and
easier data processing. The use of surveys served effectively in covering all required
data. In addition, surveys tolerate less abuse by the targeted population than the
online transaction data.
Therefore, it might be more effective if online surveys that are prepared especially for
DM purposes were posted online with a feasible attractive reward. This way the
benefit of the two methods are combined in one provided that it would not contradict
with the time and cost constraints.
c. IAURIF could investigate in the telecommunication industry or in the health care
sector where it can do descriptive analysis. Other DM applications that can be carried
out by IAURIF is the educational level of residents of cities and rural areas, and
determine the attributes affecting the educational level.
12-10