Download Data Mining: A hands on approach By Robert Groth

Data Mining: A hand on approach for business professionals. Data Mining: A hands on approach By Robert Groth Reviewed by Mervyn Ng Introduction: Data mining is basically the process of knowledge discovery, which dates since the dawn of time. People have attempted to perform data mining even before the term was being in use. Data mining has know a huge rise of popularity since the early 1990’s and have been very important especially in the financial sector and especially these days, it is a trend for all the business sectors in general. This book, in summary tries to explain the current field of data mining and talks about some popular tools on the market that could be of use to anyone who is considering data mining. Data mining, until recently, has been largely an academic field and required computer systems that were out of the reach of most business analysts. During those past years, there are some factors that have helped in Data mining to be accessible to business professionals, they are: 1. Cost of personal computing power has decreased. 2. Innovations in data mining methodologies are making it more powerful and easier to understand. 3. Software vendors are making data mining available to the end user. This book is the first book by the author, devoted to business professionals and provides and easy approach to learn data mining for these people. It discusses how knowledge discovery is used in different industries and also on the software used by the companies especially the ones on the desktop so that it would provide the widest audience for the book. Also the book does some sample studies about specific industries like retail, banking and insurance. Another interesting area of this book is the hands-on approach that it provides to the readers. The readers who own software like KnowledgeSeeker, NeuralWare and DataMind have tutorials included in the book and they can learn how to use the above software while reading the book. Next I am going to do a summary of the following eight chapters of this book. Chapter 1 The first chapter basically gives a brief definition of what Data mining is and also the different types of data mining. The author describes that in some articles that he has read, he found about 8 different types of data mining and that some data mining algorithms are more appropriate than others in some fields. But what really matters as the author puts it is that business professionals should choose the tool that better suit their needs and be understandable by them. Business professionals should also choose models that can be built in a timely manner so this requires the data to have “good performance” attributes. This book covers three fundamental approaches to data mining which are: 1. Classification studies or supervised learning 2. Clustering studies or unsupervised learning. 3. Visualization studies. 0 Data Mining: A hand on approach for business professionals. Classification studies are the setting up of a clear goal in order to build an appropriate model derived from historical data. Clustering studies are a method of grouping rows of data that share similar trends and patterns. Finally, Visualization studies are simply the graphical presentation of data. It is the process of representing data graphically that is used today in most of the query tools. By representing data graphically often brings out points that would not be normally be seen by the common user. Chapter 1 also covers why data mining is used and describes some of the uses for it. Data mining is used in direct marketing in order to find the people who are most likely to buy certain products thus savings can be done in order to save in marketing expenditures. Data mining is also used in trend analysis, to be able to understand trends in the market place can bring about strategic advantage to the company as data mining can help in reducing costs and timeliness to market. The other major use of data mining has also been the use of forecasting financial markets; the use of data mining to model financial markets is used very extensively and is one of the major industries where that technique is most used. The next topic of this chapter deals on is how data is mined, the author mentions that there are five main steps in data mining and they are data manipulation, defining a study, reading the data and building a model, understanding the model and lastly making predictions. He gives emphasis on the importance on having clean data. Clean data is basically having the data that is relevant to the area that you are analyzing. So things like consistency should be observed. When reading the data and building a model there could be “noise” or errors and anomalies that appear in the data mining process. There has been much work in designing filters to soften the impact of noise in data sets and to improve the overall accuracy of the model. After building an appropriate data model, several aspects of a model should be considered and these are: 1. Model summary. 2. Data distribution. 3. Differentiation (An input should predict one outcome much better than others) 4. Validation (making predictions using an existing model and comparing results) The author in the last part of this chapter gives an overview of the different type of data mining models and they include decision trees, genetic algorithms (method of combinational optimization based on process in biological evolution), neural nets (concept of an “artificial neuron” which mimics the process of a neuron in the human brain), hybrid models (combinations of algorithms that uses different modeling techniques like hybrid algorithms which is one algorithm that makes use of several features.) The author gives an extensive definition of each but for the scope of this book review, I would not go into too much detail. Chapter 2: Chapter 2 explains the data mining process in much greater detail by using examples and stepping through the different stages of data mining. An interesting point is made about accessing data warehouses. Data mining is often mentioned as an after market for data warehouses but not because data mining requires a data warehouse but because taking the time to build such decision support systems forces companies to undergo the task of bringing all their desperate data together. An interesting trend in data mining is the integration of data warehousing databases directly with data mining tools. But even if data is not in the form of a data warehouse, data can be accessed from a relational, transactional based database directly by using connectivity or ODBC standards that most database offer nowadays. Using relational databases instead of data warehouses for data mining increases the chance of unclean data, which in turn increases the need 1 Data Mining: A hand on approach for business professionals. for more data preparation. Some data quality issues are also raised in this chapter, data is rarely a 100% clean, data mining is at best as good as the data that it is representing. Defining a study is the second step in the data mining process. The scope of the study for data mining is very important; this involves several things such as understanding the limits of a study, choosing good studies to perform, determining the right elements to study and understanding sampling. The author goes on defining the type of studies that can be done in the area of data mining and the different types of studies are profiling customer habits and customer demographics, time dependence studies, retention management, risk forecasting, profitability analysis, data trends analysis, employee studies and regional studies. Then after knowing which study to choose from, the data miner has to read the data and build the model, like mentioned, the model must be both accurate and understandable. Finally he talks a little bit about the prediction part of data mining, the process of prediction is straightforward. With a set of inputs, a prediction is made on a certain outcome. Also while the validation process uses prediction, it is really comparing known results to predictions made to calculate an accuracy level. With true prediction, the outcome to be predicted will not be known. Chapter 3 This chapter provides insight to the data mining market as it is at the period at which the book was written. It talks about trends, data mining vendors, visualization and data sources for mining. It mentions that EIS and query vendors are involved in integrating data mining with traditional query and decision support tools. Query and EIS tools in the past have required end users to formulate questions in order to get interesting answers, an assumptive based process. Integrating data mining with query and EIS tools will enable a discovery based process; whereby an end user can be told the most interesting things to look at and then formulate questions based on new information. OLAP vendors have also been announcing their interest in including data mining tools in their products. The author gives a list of the different data mining vendors that are out there, some examples are Angoss, Attar software, Business objects, Cognos, Data Mind Corporation and IBM. He mentions several more than I would not include in this write up. The next part of that chapter is visualization, since pictures often represent data better than reports or numbers; data visualization is yet clearly another way to mine data. Data visualization tools go clearly beyond two dimensional data mapping, many visualization techniques which were only available on high power servers are moving to the end user market space. The data sources for mining are also enclosed in this book, basically information about what people buy, where they live, how much they earn and what types of hobbies they have can be very astonishing. This type of information as the author puts it not only exists but is readily available. He mentions some vendors who sell that kind of information; examples include Acxiom, CACI marketing systems, Claritas, Harte Hanks, AC Nielsen and the Polk Company. Chapter 4: Chapter 4 does a thorough analysis and about the data mining software called KnowledgeSeeker, which uses the decision tree approach to data mining. Part of this chapter is to familiarize the user with decision trees but also to give hands on approach about the software itself. The author mentions that this software makes use of two well known decision tree algorithms which are CHAID and CART. CHAID is used to study categorical data like states in a country or gender. CART, on the other hand, works with continuously dependent variables such as monthly expenses. There are many more decision tree algorithms but KnowledgeSeeker uses only those two. The next part of this chapter is just going on a step by step tutorial of the software itself and 2 Data Mining: A hand on approach for business professionals. by doing an example dealing with profiling people with low, high and normal blood pressure then the decision tree is grown in order to include information about the population of smokers and how it relates to blood pressure. A tool such as KnowledgeSeeker can be used cross country for such an experiment. Data could be grouped in optimal ways and this can be very useful if you are looking at market segmentation studies. Decision trees help you to not only discover brand new insight but also to confirm new trends and patterns. Chapter 5 This chapter goes into detail on a different software called DataMind which focuses on customer relation, retention and management applications for business. In other words, marketing professionals can use the findings from the software in order to more efficiently target their campaigns and retain competitors before they leave for a competitor. This software allows for the analysis of large volumes of data found today in data warehouses. The technologies used are the concepts of impact, conjunctions and differentiation, which offer both the ability to understand a model and to use a model for prediction. This technology is better suited for integrating a model understanding and prediction. It has also has an attribute called “The Agent Network Technology” which is very fast in its ability to build models. While going through the tutorials, I found the interesting part of DataMind was the discovery views option that had the alternative to build three different kinds of sub reports (conjunctions, specific and irrelevant criteria and Impacts). These different sub reports can narrow down the amount of information that would allow the decision maker to take a decision. The main advantage about this software is that it offers many different views to look at the models which are being built and these reports are in Excel or Word format thus they can be saved, manipulated and printed. Chapter 6 This chapter steps through the process of data mining with a leading software product that uses a neural network approach. The software in question, Neural Works Predict, is quite distinctive in its approach to making the product understandable to business professionals. Like DataMind, Neural Works uses Excel as an interface to make users more comfortable. The software also has the ability to integrate into other applications written in C or Visual Basic. The author also gives a description of neural networks; basically, they attempt to mimic the process of a neuron in a human brain, with each link described as a processing element. These networks detect patterns in data, generalize data about data and make outcomes. Neural networks have an interesting competency is that they are especially noted for their ability to predict complex processes. The processing element in a neural network processes data by summarizing and transforming it using a series of mathematical functions. One processing element is limited in ability but when connected to form a system, the neurons or processing elements create an intelligent system. That intelligent system can be retrained over thousands of iterations to more closely fit the data that they are trying to model. The next section gives a step by step demonstration of how the software Neural Works Predict can be used. The part about training a neural network was very interesting; basically, processing elements are linked to inputs and outputs. And the process of training the network involves modifying the strength or weight of the connections from the inputs to the outputs. Increasing or decreasing the strength of a connection is based on its importance for producing the proper outcome. This process uses a mathematical method for adjusting the weights and is dubbed a learning rule. Training continues until a neural network produces outcome values that match the known outcome values within a specified accuracy level or until it satisfies some other stopping criterion. This chapter gave the reader more insight about neural 3 Data Mining: A hand on approach for business professionals. network approaches to data mining and also gave us a basic understanding of Neural Works Predict. Chapter 7: Chapter 7 gives us a summary of all the typical industries that make use of data mining as a specific tool in terms of the companies to make decisions. The industries that are being targeted are banking and finance, retail, healthcare and of course telecommunications. Banking and Finance have made extensive use of data mining in the areas of modeling and predicting credit fraud. It is also used in evaluating risk, trend analysis, in analyzing profitability and also in marketing campaigns. Also, the author mentions that neural networks are used in the financial markets in order to help in stock price forecasting, options trading, bond rating and in portfolio management. This chapter gives an example of an application that uses data mining software used for stocks prediction. This software is NetProphet. The author also quotes that “Data mining is the most important application in financial services in 1996”.1 The retail sector also makes use of data mining technology, the main driver for the retail sector is that they have to do with the slim margins and so must find ways in order to be able to deal with competitors. Early adoption of data warehousing by retailers have allowed them a better opportunity to take advantage of data mining. The main applications in retail that use data mining are direct marketing applications. Direct mail and mailing is another area where data mining is widely used, almost all types of retailers use direct marketing, and their main concern is to have information about customer segmentation, which in data mining is a clustering problem. Health care is also discussed as an area that is making good use of data mining. The health care sector is so extensive, for example, it can be divided in to medical research, biotechs and the pharmaceutical industry that data mining can be useful in finding relevant information. The author gives various examples of data mining software that have been used in the healthcare sector these software are NeuroMedical Systems, Vysis which makes use of neural networks and also KnowledgeSeeker which is used in the Oxford transplant center. The last part of this chapter covers the use of data mining in the telecommunications industry. Like before, the main driver of using data mining was to achieve competitive advantage against customers due to the deregulation of that particular industry. There is a need to understand customers, keep them and to model effective ways to market new products to the customers of telecommunication companies. Chapter 8 This last chapter talks about enabling data mining through data warehouses. The biggest challenge for business analysts when using data mining is to know how to extract, integrate, cleanse and prepare the data in order to solve the most pressing business problems. This section talks about how data warehouses are used in the process of data mining, although they do not have to be always in place for data mining to occur, they do present a methodology for data integration and preparation. The author gives a list of different vendors that offer data warehouse design for companies and some examples are Sybase and LogicWorks. The difference is also made between a transactional database system and a data warehouse, the author points out that most DBMS today are transactional and optimized for inserting and updating information but not for decision support. Data warehouses on the other hand, are built specifically for decision support and would add many fields of information that transactional systems would not have. In fact data warehouses, have the ability of integrating multiple transactional database systems. The author next gives examples of data models that include transactional data purposely to help to 1 Bank systems and Technology 1996. 4 Data Mining: A hand on approach for business professionals. distinguish types of information valuable for data mining studies, the examples cover all the four2 specific industries as mentioned above. The author ends this summary by stating an interesting point, which is that there is one fallacy about data mining and it’s that for it to take place all data must be in place. But the author says that the process of model building should start at some point and over time, the models will get better as the business is better understood. Data mining is not bent on finding the million dollars piece of information but in building the foundation to model how your business is doing. My Input Personally I think this book by Groth helped me get more insight about Data mining as a whole. This book strikes a good balance between technical background and business application, by describing the theory of what goes on in the data mining software. This book would be of great help to an introduction to data mining itself or to a business analyst who want to decide which software would be best suited for their company. The intensive review about three software applications: KnowledgeSeeker, Datamind and Neural Works Predict which each of them has a different method in tackling data mining problems. The other things that I found enlightening were the detailed description of the current market place and trends. There were also the categorized listings of data mining vendors and related software. Enclosed with the book, was the CD-Rom that contains demos of the different software that are mentioned in the book. Unluckily I had a hard time getting them to work on my computer, but then I did some research on the internet and found out that the CD packaged with the book only worked with Office 95 so I could not really test them to have real hands on approach to the software by trying the tutorials. Other than that I think the book is an excellent introductory resource for the topic. Its timely coverage of techniques, issues and trends should prove quite worthwhile for the business professionals to evaluate the potential of data mining. Especially that these days with the outstanding amount of data that is available, proper analysis of such an amount of data would not be possible with such a tool like data mining. Data mining can be of utmost importance for all the types of industries that exist in the marketplace and would certainly give an added edge to all the companies that are using it compared to those that are not. Definitely, I think data mining is still an area that should be studied more and there is a huge potential for growth in that area. 2 Banking, healthcare, retail, telecommunications. 5

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining: A hands on approach By Robert Groth