Download Data Mining: A hands on approach By Robert Groth

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Operational transformation wikipedia , lookup

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data center wikipedia , lookup

Data model wikipedia , lookup

Database model wikipedia , lookup

Data analysis wikipedia , lookup

Forecasting wikipedia , lookup

Data vault modeling wikipedia , lookup

3D optical data storage wikipedia , lookup

Information privacy law wikipedia , lookup

Business intelligence wikipedia , lookup

Data mining wikipedia , lookup

Transcript
Data Mining: A hand on approach for business professionals.
Data Mining: A hands on approach
By Robert Groth
Reviewed by Mervyn Ng
Introduction:
Data mining is basically the process of knowledge discovery, which dates since the dawn of time.
People have attempted to perform data mining even before the term was being in use. Data
mining has know a huge rise of popularity since the early 1990’s and have been very important
especially in the financial sector and especially these days, it is a trend for all the business sectors
in general. This book, in summary tries to explain the current field of data mining and talks about
some popular tools on the market that could be of use to anyone who is considering data mining.
Data mining, until recently, has been largely an academic field and required computer systems
that were out of the reach of most business analysts. During those past years, there are some
factors that have helped in Data mining to be accessible to business professionals, they are:
1. Cost of personal computing power has decreased.
2. Innovations in data mining methodologies are making it more powerful and easier to
understand.
3. Software vendors are making data mining available to the end user.
This book is the first book by the author, devoted to business professionals and provides and easy
approach to learn data mining for these people. It discusses how knowledge discovery is used in
different industries and also on the software used by the companies especially the ones on the
desktop so that it would provide the widest audience for the book. Also the book does some
sample studies about specific industries like retail, banking and insurance.
Another interesting area of this book is the hands-on approach that it provides to the readers. The
readers who own software like KnowledgeSeeker, NeuralWare and DataMind have tutorials
included in the book and they can learn how to use the above software while reading the book.
Next I am going to do a summary of the following eight chapters of this book.
Chapter 1
The first chapter basically gives a brief definition of what Data mining is and also the different
types of data mining. The author describes that in some articles that he has read, he found about 8
different types of data mining and that some data mining algorithms are more appropriate than
others in some fields. But what really matters as the author puts it is that business professionals
should choose the tool that better suit their needs and be understandable by them. Business
professionals should also choose models that can be built in a timely manner so this requires the
data to have “good performance” attributes. This book covers three fundamental approaches to
data mining which are:
1. Classification studies or supervised learning
2. Clustering studies or unsupervised learning.
3. Visualization studies.
0
Data Mining: A hand on approach for business professionals.
Classification studies are the setting up of a clear goal in order to build an appropriate model
derived from historical data. Clustering studies are a method of grouping rows of data that share
similar trends and patterns. Finally, Visualization studies are simply the graphical presentation of
data. It is the process of representing data graphically that is used today in most of the query
tools. By representing data graphically often brings out points that would not be normally be seen
by the common user.
Chapter 1 also covers why data mining is used and describes some of the uses for it. Data mining
is used in direct marketing in order to find the people who are most likely to buy certain products
thus savings can be done in order to save in marketing expenditures. Data mining is also used in
trend analysis, to be able to understand trends in the market place can bring about strategic
advantage to the company as data mining can help in reducing costs and timeliness to market.
The other major use of data mining has also been the use of forecasting financial markets; the use
of data mining to model financial markets is used very extensively and is one of the major
industries where that technique is most used. The next topic of this chapter deals on is how data is
mined, the author mentions that there are five main steps in data mining and they are data
manipulation, defining a study, reading the data and building a model, understanding the model
and lastly making predictions. He gives emphasis on the importance on having clean data. Clean
data is basically having the data that is relevant to the area that you are analyzing. So things like
consistency should be observed. When reading the data and building a model there could be
“noise” or errors and anomalies that appear in the data mining process. There has been much
work in designing filters to soften the impact of noise in data sets and to improve the overall
accuracy of the model. After building an appropriate data model, several aspects of a model
should be considered and these are:
1. Model summary.
2. Data distribution.
3. Differentiation (An input should predict one outcome much better than others)
4. Validation (making predictions using an existing model and comparing results)
The author in the last part of this chapter gives an overview of the different type of data mining
models and they include decision trees, genetic algorithms (method of combinational
optimization based on process in biological evolution), neural nets (concept of an “artificial
neuron” which mimics the process of a neuron in the human brain), hybrid models (combinations
of algorithms that uses different modeling techniques like hybrid algorithms which is one
algorithm that makes use of several features.) The author gives an extensive definition of each but
for the scope of this book review, I would not go into too much detail.
Chapter 2:
Chapter 2 explains the data mining process in much greater detail by using examples and stepping
through the different stages of data mining. An interesting point is made about accessing data
warehouses. Data mining is often mentioned as an after market for data warehouses but not
because data mining requires a data warehouse but because taking the time to build such decision
support systems forces companies to undergo the task of bringing all their desperate data
together. An interesting trend in data mining is the integration of data warehousing databases
directly with data mining tools. But even if data is not in the form of a data warehouse, data can
be accessed from a relational, transactional based database directly by using connectivity or
ODBC standards that most database offer nowadays. Using relational databases instead of data
warehouses for data mining increases the chance of unclean data, which in turn increases the need
1
Data Mining: A hand on approach for business professionals.
for more data preparation. Some data quality issues are also raised in this chapter, data is rarely a
100% clean, data mining is at best as good as the data that it is representing.
Defining a study is the second step in the data mining process. The scope of the study for data
mining is very important; this involves several things such as understanding the limits of a study,
choosing good studies to perform, determining the right elements to study and understanding
sampling. The author goes on defining the type of studies that can be done in the area of data
mining and the different types of studies are profiling customer habits and customer
demographics, time dependence studies, retention management, risk forecasting, profitability
analysis, data trends analysis, employee studies and regional studies. Then after knowing which
study to choose from, the data miner has to read the data and build the model, like mentioned, the
model must be both accurate and understandable. Finally he talks a little bit about the prediction
part of data mining, the process of prediction is straightforward. With a set of inputs, a prediction
is made on a certain outcome. Also while the validation process uses prediction, it is really
comparing known results to predictions made to calculate an accuracy level. With true prediction,
the outcome to be predicted will not be known.
Chapter 3
This chapter provides insight to the data mining market as it is at the period at which the book
was written. It talks about trends, data mining vendors, visualization and data sources for mining.
It mentions that EIS and query vendors are involved in integrating data mining with traditional
query and decision support tools. Query and EIS tools in the past have required end users to
formulate questions in order to get interesting answers, an assumptive based process. Integrating
data mining with query and EIS tools will enable a discovery based process; whereby an end user
can be told the most interesting things to look at and then formulate questions based on new
information. OLAP vendors have also been announcing their interest in including data mining
tools in their products. The author gives a list of the different data mining vendors that are out
there, some examples are Angoss, Attar software, Business objects, Cognos, Data Mind
Corporation and IBM. He mentions several more than I would not include in this write up. The
next part of that chapter is visualization, since pictures often represent data better than reports or
numbers; data visualization is yet clearly another way to mine data. Data visualization tools go
clearly beyond two dimensional data mapping, many visualization techniques which were only
available on high power servers are moving to the end user market space. The data sources for
mining are also enclosed in this book, basically information about what people buy, where they
live, how much they earn and what types of hobbies they have can be very astonishing. This type
of information as the author puts it not only exists but is readily available. He mentions some
vendors who sell that kind of information; examples include Acxiom, CACI marketing systems,
Claritas, Harte Hanks, AC Nielsen and the Polk Company.
Chapter 4:
Chapter 4 does a thorough analysis and about the data mining software called KnowledgeSeeker,
which uses the decision tree approach to data mining. Part of this chapter is to familiarize the user
with decision trees but also to give hands on approach about the software itself. The author
mentions that this software makes use of two well known decision tree algorithms which are
CHAID and CART. CHAID is used to study categorical data like states in a country or gender.
CART, on the other hand, works with continuously dependent variables such as monthly
expenses. There are many more decision tree algorithms but KnowledgeSeeker uses only those
two. The next part of this chapter is just going on a step by step tutorial of the software itself and
2
Data Mining: A hand on approach for business professionals.
by doing an example dealing with profiling people with low, high and normal blood pressure then
the decision tree is grown in order to include information about the population of smokers and
how it relates to blood pressure. A tool such as KnowledgeSeeker can be used cross country for
such an experiment. Data could be grouped in optimal ways and this can be very useful if you are
looking at market segmentation studies. Decision trees help you to not only discover brand new
insight but also to confirm new trends and patterns.
Chapter 5
This chapter goes into detail on a different software called DataMind which focuses on customer
relation, retention and management applications for business. In other words, marketing
professionals can use the findings from the software in order to more efficiently target their
campaigns and retain competitors before they leave for a competitor. This software allows for the
analysis of large volumes of data found today in data warehouses. The technologies used are the
concepts of impact, conjunctions and differentiation, which offer both the ability to understand a
model and to use a model for prediction. This technology is better suited for integrating a model
understanding and prediction. It has also has an attribute called “The Agent Network
Technology” which is very fast in its ability to build models. While going through the tutorials, I
found the interesting part of DataMind was the discovery views option that had the alternative to
build three different kinds of sub reports (conjunctions, specific and irrelevant criteria and
Impacts). These different sub reports can narrow down the amount of information that would
allow the decision maker to take a decision. The main advantage about this software is that it
offers many different views to look at the models which are being built and these reports are in
Excel or Word format thus they can be saved, manipulated and printed.
Chapter 6
This chapter steps through the process of data mining with a leading software product that uses a
neural network approach. The software in question, Neural Works Predict, is quite distinctive in
its approach to making the product understandable to business professionals. Like DataMind,
Neural Works uses Excel as an interface to make users more comfortable. The software also has
the ability to integrate into other applications written in C or Visual Basic. The author also gives a
description of neural networks; basically, they attempt to mimic the process of a neuron in a
human brain, with each link described as a processing element. These networks detect patterns in
data, generalize data about data and make outcomes. Neural networks have an interesting
competency is that they are especially noted for their ability to predict complex processes. The
processing element in a neural network processes data by summarizing and transforming it using
a series of mathematical functions. One processing element is limited in ability but when
connected to form a system, the neurons or processing elements create an intelligent system. That
intelligent system can be retrained over thousands of iterations to more closely fit the data that
they are trying to model. The next section gives a step by step demonstration of how the software
Neural Works Predict can be used. The part about training a neural network was very interesting;
basically, processing elements are linked to inputs and outputs. And the process of training the
network involves modifying the strength or weight of the connections from the inputs to the
outputs. Increasing or decreasing the strength of a connection is based on its importance for
producing the proper outcome. This process uses a mathematical method for adjusting the
weights and is dubbed a learning rule. Training continues until a neural network produces
outcome values that match the known outcome values within a specified accuracy level or until it
satisfies some other stopping criterion. This chapter gave the reader more insight about neural
3
Data Mining: A hand on approach for business professionals.
network approaches to data mining and also gave us a basic understanding of Neural Works
Predict.
Chapter 7:
Chapter 7 gives us a summary of all the typical industries that make use of data mining as a
specific tool in terms of the companies to make decisions. The industries that are being targeted
are banking and finance, retail, healthcare and of course telecommunications. Banking and
Finance have made extensive use of data mining in the areas of modeling and predicting credit
fraud. It is also used in evaluating risk, trend analysis, in analyzing profitability and also in
marketing campaigns. Also, the author mentions that neural networks are used in the financial
markets in order to help in stock price forecasting, options trading, bond rating and in portfolio
management. This chapter gives an example of an application that uses data mining software used
for stocks prediction. This software is NetProphet. The author also quotes that “Data mining is
the most important application in financial services in 1996”.1 The retail sector also makes use of
data mining technology, the main driver for the retail sector is that they have to do with the slim
margins and so must find ways in order to be able to deal with competitors. Early adoption of
data warehousing by retailers have allowed them a better opportunity to take advantage of data
mining. The main applications in retail that use data mining are direct marketing applications.
Direct mail and mailing is another area where data mining is widely used, almost all types of
retailers use direct marketing, and their main concern is to have information about customer
segmentation, which in data mining is a clustering problem. Health care is also discussed as an
area that is making good use of data mining. The health care sector is so extensive, for example, it
can be divided in to medical research, biotechs and the pharmaceutical industry that data mining
can be useful in finding relevant information. The author gives various examples of data mining
software that have been used in the healthcare sector these software are NeuroMedical Systems,
Vysis which makes use of neural networks and also KnowledgeSeeker which is used in the
Oxford transplant center. The last part of this chapter covers the use of data mining in the
telecommunications industry. Like before, the main driver of using data mining was to achieve
competitive advantage against customers due to the deregulation of that particular industry. There
is a need to understand customers, keep them and to model effective ways to market new
products to the customers of telecommunication companies.
Chapter 8
This last chapter talks about enabling data mining through data warehouses. The biggest
challenge for business analysts when using data mining is to know how to extract, integrate,
cleanse and prepare the data in order to solve the most pressing business problems. This section
talks about how data warehouses are used in the process of data mining, although they do not
have to be always in place for data mining to occur, they do present a methodology for data
integration and preparation. The author gives a list of different vendors that offer data warehouse
design for companies and some examples are Sybase and LogicWorks. The difference is also
made between a transactional database system and a data warehouse, the author points out that
most DBMS today are transactional and optimized for inserting and updating information but not
for decision support. Data warehouses on the other hand, are built specifically for decision
support and would add many fields of information that transactional systems would not have. In
fact data warehouses, have the ability of integrating multiple transactional database systems. The
author next gives examples of data models that include transactional data purposely to help to
1
Bank systems and Technology 1996.
4
Data Mining: A hand on approach for business professionals.
distinguish types of information valuable for data mining studies, the examples cover all the four2
specific industries as mentioned above. The author ends this summary by stating an interesting
point, which is that there is one fallacy about data mining and it’s that for it to take place all data
must be in place. But the author says that the process of model building should start at some point
and over time, the models will get better as the business is better understood. Data mining is not
bent on finding the million dollars piece of information but in building the foundation to model
how your business is doing.
My Input
Personally I think this book by Groth helped me get more insight about Data mining as a whole.
This book strikes a good balance between technical background and business application, by
describing the theory of what goes on in the data mining software. This book would be of great
help to an introduction to data mining itself or to a business analyst who want to decide which
software would be best suited for their company. The intensive review about three software
applications: KnowledgeSeeker, Datamind and Neural Works Predict which each of them has a
different method in tackling data mining problems. The other things that I found enlightening
were the detailed description of the current market place and trends. There were also the
categorized listings of data mining vendors and related software. Enclosed with the book, was the
CD-Rom that contains demos of the different software that are mentioned in the book. Unluckily
I had a hard time getting them to work on my computer, but then I did some research on the
internet and found out that the CD packaged with the book only worked with Office 95 so I could
not really test them to have real hands on approach to the software by trying the tutorials. Other
than that I think the book is an excellent introductory resource for the topic. Its timely coverage of
techniques, issues and trends should prove quite worthwhile for the business professionals to
evaluate the potential of data mining. Especially that these days with the outstanding amount of
data that is available, proper analysis of such an amount of data would not be possible with such a
tool like data mining. Data mining can be of utmost importance for all the types of industries that
exist in the marketplace and would certainly give an added edge to all the companies that are
using it compared to those that are not. Definitely, I think data mining is still an area that should
be studied more and there is a huge potential for growth in that area.
2
Banking, healthcare, retail, telecommunications.
5