Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining: A Tool for Knowledge Discovery 0 COLLEGE OF MANAGEMENT IN TRENČÍN USING DATA MINING AS A TOOL FOR DISCOVERING IMPORTANT KNOWLEDGE FOR COMPANIES 2010 Tomáš Vanek Data Mining: A Tool for Knowledge Discovery 1 COLLEGE OF MANAGEMENT IN TRENČÍN USING DATA MINING AS A TOOL FOR DISCOVERING IMPORTANT KNOWLEDGE FOR COMPANIES Bachelor Thesis Study program: Knowledge Management Workplace: College of Management, Bratislava Thesis advisor: Tomáš Vanek Consultant: Martina Česalová, M.S.C.S Trenčín 2010 Tomáš Vanek Data Mining: A Tool for Knowledge Discovery 2 Data Mining: A Tool for Knowledge Discovery 3 Data Mining: A Tool for Knowledge Discovery 4 Content 1. Introduction and Problem Statement ................................................................................. 1 2. Review of Literature .......................................................................................................... 2 3. Description of the Methodology ........................................................................................ 3 4. Data mining overview ........................................................................................................ 4 4.1. Principles of Data Mining ............................................................................................... 4 4.1.1. Definitions of Data Mining ...................................................................................... 7 4.1.2. History of Data Mining ............................................................................................ 7 4.1.3. The Evolution and the Future of Data Mining ......................................................... 8 4.1.4. Disadvantages of Data Mining ................................................................................ 9 4.2. Data Warehousing......................................................................................................... 10 4.3. Knowledge Discovery Process and Data Mining ......................................................... 11 4.3.2. CRISP-DM model.............................................................................................. 12 4.3.2.1. Business understanding................................................................................... 13 4.3.2.2. Data understanding ......................................................................................... 13 4.3.2.3. Data preparation .............................................................................................. 14 4.3.2.4. Modeling ......................................................................................................... 14 4.3.2.5. Evaluation ....................................................................................................... 15 4.3.2.6. Deployment ..................................................................................................... 15 5. Practical Project – Data Mining in Banking Domain ...................................................... 16 5.1. Business Understanding ............................................................................................ 17 5.2. Data Understanding .................................................................................................. 19 5.3. Data Preparation ....................................................................................................... 20 5.4. Modeling ................................................................................................................... 22 5.5. Evaluation ................................................................................................................. 23 5.6. Deployment ............................................................................................................... 23 5.7. Project Conclusion .................................................................................................... 27 Data Mining: A Tool for Knowledge Discovery 5 Thesis Conclusion ................................................................................................................ 28 List of Pictures ..................................................................................................................... 31 List of Figures ...................................................................................................................... 32 Literature .............................................................................................................................. 33 Data Mining: A Tool for Knowledge Discovery 6 List of abbreviations CRISPS-DM - Cross-Industry Standard Process for Data Mining Data Mining: A Tool for Knowledge Discovery 7 Acknowledgements I would like to thank to Martina Česalová, M.S.C.S. for her patience and advices during writing this thesis. Data Mining: A Tool for Knowledge Discovery 1 1. Introduction and Problem Statement The power of information can be considered as a very important factor in today's businesses. The popularity of information technology caused that many data from different areas is collected and stored. The data are stored every time a person access a web page, purchases a product, or makes a phone call. These data consist of hidden information that is very important. Data mining is a tool that allows analyzing this data and therefore extracting useful, previously unknown and interesting information. This tool is used mostly by companies that collect and store large number of data. Mining the data therefore allows them to gain essential knowledge and use it to their benefits. Thus data mining represents quite a new and unique technology that can provide numerous advantages. Objective of the thesis is to offer general information about the problem. The thesis consists of theoretical and practical part. Specifically, the theoretical part informs the reader about basic principles of the data mining. It starts by explaining how revolution of information technology forced and still forces scientists to develop data mining technology. Then thesis mentions some general examples why data mining can be considered as a gold mine for some companies. Also, in order to provide accurate point of view on technology, thesis mentions its advantages as well as the disadvantages. Moreover, thesis talks about the history of data mining and also examines future predications. The end of theoretical part focuses on six steps of standardized model called CRISP-DM that is used for data mining projects. The practical part of the thesis proposes data mining project that is applied in financial sector. The goal of the project is to help bank segment its customers by using data mining. The entire project is divided into six steps of CRISP-DM model. Basically, the project covers business opportunity, describes the used data, introduces model and suggests deployment of proposed solution. As a final result, the bank can use the segmentation to improve the process of decision making and to introducing new services. Data Mining: A Tool for Knowledge Discovery 2 2. Review of Literature During the writing of the thesis, various sources have been used. Great effort has been made to use different types of sources. Specifically, collected information mainly comes from printed and internet sources. The first, theoretical, part of the thesis is primary written from two books. Most of the information is taken from the book called “Introduction to Data Mining and its Applications” written by Dr. S. Sumathi and Dr. S.N. Sivanandam. Both are professors at College of Technology in India and therefore experts in their field. At the beginning, the book provides very clear and general introduction into science. Information in the book are presented in very extended way therefore the summarization has been used quite often. The second book used in the theoretical part is called “Data Mining - A Knowledge Discovery Approach” and is written by four authors: Krzysztof J. Cios, Witold Pedrycz, Roman W. Swiniarski, and Lukasz A. Kurgan. All four authors work for different universities across USA and Canada. As the name of the book says, the book mainly focuses on the knowledge discovery by using data mining and therefore is very suitable for the thesis. Moreover, there have a few been internet sources used. For example, to describe business opportunities that data mining offers, the YouTube video by by Dr. S. Srinath from Indian Institute of technology has been used. In the second, practical, part of the thesis the internet source has been used do describe credit scouring method. Information about credit scoring has been taken from the internet site called myFICO. This internet page has been on the market since year 2001 and primary deals with credit risk scoring issues for finance segment therefore can be considered as a relevant source. Moreover, to finish practical part of the project, the book named “Dobývání znalostí z databází” that can be translated as “Gaining the Knowledge from Databases” has been used. The book is written by Doc. Ing. Petr Berka, who works for The University of Economics in Prague. The practical part of the thesis could not be done without this book because is does not only deal with theoretical information, but also practical demonstrations of data mining methods. Data Mining: A Tool for Knowledge Discovery 3 3. Description of the Methodology In the thesis, the evaluation method has been used. Many sources has been collected and analyzed to gain the certain knowledge about data mining. After that, the most important things has been researched again and presented in the thesis. To highlight the importance of data mining and knowledge discovery in today’s competitive market environment, the examples were used. Moreover, gained theoretical knowledge was applied in the practical part of the thesis that was done to show how data mining can be used in banking environment. Data Mining: A Tool for Knowledge Discovery 4 4. Data mining overview 4.1. Principles of Data Mining An enormous number of data that is nowadays created, used and stored on every day bases caused a demand for a new tool that could help to analyze these massive data. Therefore, demand for a tool that turns stored data into useful knowledge that is easily understandable by human beings. Traditional techniques for analyzing data were very useful and solved many problems. These techniques mostly used statistics to analyze the data and therefore could only extract certain data characteristics. This limitation and need for a new tool for data analysis caused that scientists started to collect ideas to develop a machine learning tool. This effort has led to a new research area called data mining and later to a research area called- data mining and knowledge discovery. But it all would not be possible without computer revolution. (Sumathi & Sivanandam, 2006) People have experienced the trend and revolution when it comes to information availability. Especially during the last decade when the Internet and network based systems allowed the global exchange of information. E-commerce business have experienced great grow and companies started to collect more and more electronic information. More importantly, technology and market opportunities caused that companies started to collect and use right data. It means that they started to realize and analyze collected data rather than collect it without further use. Soon many companies realized that “tracking, accounting for, and archiving the activities of an organization, this data can sometimes be a gold mine for strategic planning, which recent research and new businesses have only started to tap” (Sumathi & Sivanandam, 2006). So with a support from scientists and demand from commercial domains data mining starts to have ideal conditions to grow and to be developed. (Sumathi & Sivanandam, 2006) Data mining concept and growth could not be that fast without database technology that was widely used in business environment with a great success. Organizations started to create very large databases that reach capacity in terabytes. These databases hold the business data like “consumer data, transaction histories, sales records, etc.”( Sumathi & Sivanandam, 2006) that can very likely consist many important and valuable information. This important business information is of course hidden in the data forms and need to be Data Mining: A Tool for Knowledge Discovery 5 somehow extracted. The extraction can be of course successfully done by using proper mining method. (Sumathi & Sivanandam, 2006) Data mining represents promising tool that can be described as “the process of discovering meaningful new correlation, patterns, and trends by digging into (mining) large amounts of data stored in warehouse, using statistical, machine learning, artificial intelligence (AI), and data visualization techniques” (Sumathi & Sivanandam, 2006). There are many industry areas that are already using mining of data. For example, aerospace, medical or chemical, but because the technology is still quite new the number of industries is still increasing. Not mostly for its impact on science, but also for its business value. (Sumathi & Sivanandam, 2006) When speaking about business value of data mining it can literally symbolize a gold mine. From business point of view, data mining can represent quite a beneficial and unique asset. There are many benefits that data mining can have for a company or generally for a business. Let’s look at a few concrete examples that can possibly motivate managers or business owners to invest to this technology. Data mining can: Influence decision making Grow wealth Help to analyze Improve a security Decision making is important process when running a company. Data mining can reveal patterns from historical data and therefore can lead to certain knowledge. For example, by analyzing company’s data, some hidden parents that repeat can be recognized. Having this knowledge form the past, company can learn something new and therefore act accordingly. Therefore, we can say that data mining can influence decision making. This is very important because making strategic decisions are necessary for every company that wants to stay on nowadays competitive market. (Srinath, 2008) Making good decisions is also connected with wealth growing. Basically, if data mining can help making right strategic decisions, it can logically also positively influence financial situation of a company. Moreover, by mining data the wealth of information that company has is growing. The information can be used in many different ways. For example, product development, marketing, investment, etc. So, we can definitely say that by using data mining company gains important knowledge. The gained knowledge can be Data Mining: A Tool for Knowledge Discovery 6 later transformed into strategic decisions that increase financial portfolio of a company and therefore growth wealth. (Srinath, 2008) As was mentioned data mining can reveal some patterns from history therefore help to analyze the trends. Trend analysis can be used, for example in stock market. By mining data, stock exchange companies can analyzing historical price of a stock end predict its future price. But, what can be also very interesting for companies is risk analysis. Exploring and analyzing data help companies that operate in financial sector to evaluate customers. As will be proposed in practical project, bank can mine its data and basically divide good customer from bad ones. Therefore analyze the risks before offering any service to particular customer. Overall, we can claim that mined data can offer different kind of information that can be used for analyzing purposes. (Sumathi & Sivanandam, 2006) Lastly, data mining just recently started to be used for maintaining security. It is quite a new field that includes mining data for discovering activity that can be possibly illegal. (Srinath, 2008) In the year 2008, data mining was successfully used to help to discover the biggest scandal in online gambling history. In short, few poker players ware accused of cheating on poker site that was part of Ultimate Bet network. Online poker players that turned into victims of cheaters used data mining to analyze the situation. They came with the conclusion that it is statistically almost impossible to win so much money in such a short time and contacted the company. It turned out that cheaters somehow avoided the security systems and therefore were able to see the cards of opponents; witch is in game of poker tremendous advantage. So in this case, known as Ultimate Bet scandal, data mining helped to discover fraud detection and maintain security. (Brunker, 2008) Obviously, these are just few possibilities why using data mining represents benefits for business owners or companies. Of course, there are more that are also very important. So, anther commonly used data mining uses that ware not discussed are listed below with short description: Market segmentation: Finding characteristics that are common for customers that purchased same or similar products. Customer churn: Identifying customers that are likely to leave the current company and go to different one. Direct marketing: Identifying and sending mails to specific group of customers to achieve high response rate. Data Mining: A Tool for Knowledge Discovery 7 Interactive marketing: Determining in what information/product a customer was interested in when browsing a web page. Analysis of market basket: Identifying products and services that have high probability to be purchased together. (Sumathi & Sivanandam, 2006) 4.1.1. Definitions of Data Mining Many various definitions can be used to define data mining. A few following definitions has been picked from different sources: “Data mining is the efficient discovery of valuable, nonobvious information from a large collection of data.” (Sumathi & Sivanandam, 2006) “The aim of data mining is to make sense of large amounts of mostly unsupervised data, in some domain.”(Cios, Pedrycz, Swiniarski, & Kurgan, 2007) “…is the process of analyzing data from different perspectives and summarizing it into useful information” (Palace, 1996) It is the process of extracting previously unknown, valid, and actionable information from large databases and then using the information to make crucial business decisions.” (Sumathi & Sivanandam, 2006) 4.1.2. History of Data Mining As was already mentioned data mining represents quite a young and groundbreaking tool that itself has not a very long history. It has been recently a subject in many magazines from business and software environment. Even though its significant importance is now widely spread, a few years ago not so many people ware familiar with a term- data mining. The term itself was firstly introduced in the 1990s. Data mining can be basically traced from the three family roots. (Data Mining Software, n.d.) The most important root is statistics. Classical statistics concepts like “regression analysis, standard distribution, standard deviation, standard variance, discriminant analysis, cluster analysis, and confidence intervals” (Data Mining Software, n.d.) are used in data mining when studying data and its relationships. Even though today’s data mining uses more advanced analysis, we can still say that core of data mining is build with the help of basic statistical tools and techniques. So without statistics, data mining would certainly not exist. (Data Mining Software, n.d.) Data Mining: A Tool for Knowledge Discovery 8 The second root data mining comes from is artificial intelligence. Artificial intelligence basically allows applying brain to process statistical problems. This off course requires computer processing approach, so it could not be used until the early 1980s. In early 1980s computers became very accessible and people could buy processing power at the quite reasonable prices. Later when computers became faster and cheaper the growth of data mining continued faster. Also, supercomputers allowed to study and analyze large number of data because of its super processing power. Overall, the biggest advantage of artificial intelligence was that it allowed to process data faster and more precisely than humans could. (Data Mining Software, n.d.) The last root is represented by the combination of statistics and artificial intelligence. This union is known as machine learning. Because in 80s and 90s computers became cheaper and faster, the machine learning experienced evolution. More applications were released because computers became more accessible than artificial intelligence. Actually, machine learning is considered as advancement of artificial intelligence. The main advancement of machine learning is typical of ability to make computer programs to lean about the studied data. This advantage allows programs to make decisions based on the gained knowledge from the data. Then it achieves its goals by using statistics and advanced algorithms. (Data Mining Software, n.d.) In one sentence, short history of data mining can be precisely described “as the union of historical and recent developments in statistics, AI, and machine learning” (Data Mining Software, n.d.). 4.1.3. The Evolution and the Future of Data Mining According to Dr. Sumathi and Dr. Sivanandam the evolution of data mining was natural process that was caused by increased use of information technologies. As the meter of fact, increase of information technologies went along with increase the data that have been used. Logically, the larger amounts of data had to be stored and analyzed. Traditional methods, such as of creating queries and reports did not handle working with large amounts of data therefore data mining started to be developed and widely used. Data mining soon started to be considered as a tool that has a big future potential. (Sumathi & Sivanandam, 2006) Future of data mining can be described as very bright. As was already mentioned, the whole potential of data mining is not used and the concept of mining data is still being developed. In the near future, data mining will penetrate into more business. Data mining Data Mining: A Tool for Knowledge Discovery 9 will logically became very profitable and valuable tool in many areas. There are many markets that could be heavily influenced by data mining tool, but probably the most significant that is going to be influenced is advertising market. Data mining will allow advertising to explore unique inches, which would attract wide range of new customers. Moreover, data mining will be available for general public. In terms of usage, data mining will be easier to use. That means not only experts in the field would be able to use benefits of data mining, but with the user-friendly applications and tools the technology would be as easy to use as e-mail. General public would possibly be able to find the lost numbers of classmates, or the best loan in the area within a short period of time. (Sumathi & Sivanandam, 2006) Speaking about long-term changes, data mining can do a lot for us. The changes and challenges are really exciting and ground braking. For example, by applying data mining into medical areas, we could possibly be able to discover a new treatments and practices for illnesses that we are not able to cure so far. (Sumathi & Sivanandam, 2006) 4.1.4. Disadvantages of Data Mining It should be now clear that data mining is very valuable tool that can offer quite unique benefits for companies that operate in different businesses. Even though the technology cannot literally harm anyone the purpose of this part is to discover possible drawbacks. At the moment, data mining does not have any primary disadvantage that could raise any concerns among companies that are willing to invest in this technology. Some scientists and experts however raised a few questions about possible disadvantages that can occur. In the future, the main disadvantages that are likely to be connected with data mining are privacy and security. (Chhay, 2005) Technology boom has caused that privacy has became a mayor concern among people. It allows people to do everyday tasks easier, faster and more comfortable. But it is the same technology that forces people becoming more sensitive about their privacy. It is because most of the technological tools used are able to track and store person’s private information. Whenever somebody makes a phone call, pays with a credit card, visits a web page, or books a flight ticket data are collected. This kind of data is already stored in databases among many companies. But what if all information were collected together? Collecting all the data from different sources represents the real concern. By analyzing these data a lot would be possible to tell about individuals. Even though, each country has a different privacy rights, generally it is illegal to sell or exchange data about private Data Mining: A Tool for Knowledge Discovery 10 information of customers within organizations. However this kind of transactions is hard co control. As Heng Chhay wrote “…in 1998, CVS had sold their patient’s prescription purchases to a different company” (2005). Selling information about customers without their knowledge is definitely violation of privacy. (Chhay, 2005) Security is another main issue that occurred and will represent disadvantage in future. Companies collect information about customers, but many of them do not have appropriate security measurements. Therefore there ware many cases when the data ware accessed and misused. For example, company called Ford Motor Credit had to apologies to 13,000 of their customers because “their personal information including Social Security number, address, account number and payment history were accessed by hackers who broke into a database” (Chhay, 2005). As the result, the company has lost its reputation. Therefore, companies should always think about safety of the data because underestimating security measurements can lead to disaster. (Chhay, 2005) 4.2. Data Warehousing Even though the topic of data warehousing may not seem to have important role for data mining, the opposite is true. It is very important to cover and to understand data warehousing concepts because data warehousing is closely connected with data mining. Data warehousing can be basically defined as ”a process of centralized data management and retrieval” (Sumathi & Sivanandam, 2006). As well as data mining, data warehousing is quite a new concept. It is important to know that data warehouse is not software, or hardware, but can be better defined as an environment. The environment that allows companies or corporations store their data into relational database systems. These systems are designed to satisfy high level of performance and support large databases. To make this clear, we can say that data warehousing and data mining are two enterprises that operate very well together. It is because data warehousing provides the memory and data mining the intelligence. (Sumathi & Sivanandam, 2006) Any organization that has a lot of data that is created and stored faces the problem to turn these data into valuable information. This information is usually unknown, but presented in already existing and stored data. To extract information from the data and therefore turn data into knowledge, certain steps need to be applied. For example, the data Data Mining: A Tool for Knowledge Discovery 11 needs to be stored in certain form and organized, so the mining can be applied. (Sumathi & Sivanandam, 2006) Primary purpose of data warehousing is to allow end users search for information that would support, for example, his/her strategic decision making. End users can access and interact with the data warehouses by front-end tools. These access tools can be divided into five main groups: “1. Data query and reporting tools 2. Application development tools 3. Executive information system (EIS) tools 4. Online analytical preprocessing tools and 5. Data mining tools” (Sumathi & Sivanandam, 2006) 4.3. Knowledge Discovery Process and Data Mining To understand the process of extracting valuable information from data that are stored in databases, the process of knowledge discovery needs to be briefly explained. The knowledge discovery process can be described as “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” (Cios, Pedrycz, Swiniarski, & Kurgan, 2007). So, what is the basic difference between data mining and knowledge discovery process? Data mining is just one of many steps that knowledge discovery process covers. The basic knowledge discovery process can be seen on Figure 1 below. Figure 1: Knowledge discovery process model Source: Cios, Pedrycz, Swiniarski, & Kurgan, 2007 As can be seen the model has to have an input that represents data and output that represents knowledge. Input is defined as the data that are going to be analyzed. The type of data of course differs depending on project. However input of data can typically include Data Mining: A Tool for Knowledge Discovery 12 “numerical and nominal data stored in databases or flat files; images; video; semistructured data, such as XML or HTML” (Cios, Pedrycz, Swiniarski, & Kurgan, 2007). The collected data then goes through number of steps that are interconnected by feedback loops. The result, as can be seen on Figure 1, includes the final knowledge. All in all, knowledge discovery process can be defined as a progress that helps to change data into useful knowledge by applying patterns and algorithms. (Cios, Pedrycz, Swiniarski, & Kurgan, 2007) 4.3.2. CRISP-DM model A lot of effort has been made to create model that would define the process and phases of data mining projects. One of them was for example - Cabena et el. (Cios, Pedrycz, Swiniarski, & Kurgan, 2007) that consists of five steps and is supported by IBM. Another is CRISP-DM model, which consists of six steps, became more popular and leading model among the others. Therefore this model will be explained in details. CRISPS-DM means Cross-Industry Standard Process for Data Mining. It was introduced in the 1990s by the European Commission of companies as a free to use data mining model. (Hunter, 2009) CRISPS-DM was developed to create standard process for data mining projects. It was because data mining was quite new and nobody followed any particular process or guide when developing data mining projects. This process is very flexible and can be uses in variety of industry areas and with variety of data mining software. CRISPS-DM process is very valuable because it makes data mining projects faster, cheaper, more efficient and more reliable. CRISPS-DM model consists of six unique steps or phases as can be seen bellow on the Fifure 2. (Cios, Pedrycz, Swiniarski, & Kurgan, 2007) Data Mining: A Tool for Knowledge Discovery 13 Figure 2: Cross-Industry Standard Process model Source: Crisp-dm, n.d. 4.3.2.1. Business understanding As can be seen on Figure 2, Cross-Industry Standard Process model starts with business understanding. The first phase is very important because in business understanding the primary goals are defined. It basically means the main purpose of the whole data mining project. There needs to be specified what we want to know or learn from available data that we are going to explore. Also it is important to set what questions the project should answer and what business value the project is holding. In the business understanding phase, there needs to be the project goal set and specifically measurable project success. It is also necessary to know that this initial phase gives the whole project the direction. Without clear defining objectives, the project can lose its direction and therefore can lose its initial purpose and fail to success. (Hunter, J., 2009) 4.3.2.2. Data understanding The second phase starts with the collecting already existing data. Data understanding can be also described as familiarization with data. This phase requires Data Mining: A Tool for Knowledge Discovery 14 domain expert, who explores interesting data and detect possible data problems. According to already specified business needs mentioned in the previous phase, the data are explored. Either by using graphic visualization or by statistic approach. Moreover, in this step domain expert starts to look at basic relationships between the available data. As can be seen on the Figure 2, the business understanding and data understanding are interconnected with each other. This interconnection exists because finding the relationship between the data can trigger the business understanding. For example, we can find out that data does not influence enough information to satisfy primary goal set in business understanding. So the goal needs to be changed, or replaced because we would not be able to achieve it. In other words, during the first and second phase, the hypothesis and goals for the project are formed into final version. (Hunter, 2009) 4.3.2.3. Data preparation Data preparation phase is usually the most time consuming. In some cases it can take more than 80% of the project’s schedule time. The time is usually influenced by the quality of the data that are available. If the raw data are messy it can take a lot of time to sort it. For example, some attributes and variables can be incorrect or can be missing. During data preparation the final dataset that is going to be used is created. The data set is created by selecting needed data. Moreover, during this phase the data needs to be cleaned into form that would be suitable for the purpose of the project. (Hunter, 2009) 4.3.2.4. Modeling In this phase, there is a wide range of modeling techniques selected and used. Several models are applied for the same data mining problem and later are modified for optimal output. As can be seen on the Figure 2 the data preparation and modeling phase are interconnected. Interconnection is created because some models require concrete input of data; therefore often the step back into the previous phase is necessary. After the data are cleaned and modified, algorithms can be used again. (Hunter, 2009) Modeling stage is divided into four parts: “selection of modeling technique(s) generation of test design creation of models, and assessment of generated models” (Cios, Pedrycz, Swiniarski, & Kurgan, 2007) Data Mining: A Tool for Knowledge Discovery 15 4.3.2.5. Evaluation After the model or models have been created they need to be reviewed and the best model is chose. The right one needs to be evaluated from the project’s business objective. The right model(s) need to usefully satisfy the set of goals. It is essential to find out if all the business goals have been considered. Additionally, in this step it is important to decide how to use the model or collection of models. (Hunter, 2009) 4.3.2.6. Deployment At this final phase the chosen model that is suitable for data mining project needs to be known. Deployment phase uses the chosen model to score the data. However, this phase could not be the final. Sometimes it is better to step back to third phase and add more data to achieve better results. (Hunter, 2009) Deployment phase is divided to: “plan deployment plan monitoring and maintenance generation of final report review of the process substeps” (Cios, Pedrycz, Swiniarski, & Kurgan, 2007) Data Mining: A Tool for Knowledge Discovery 16 5. Practical Project – Data Mining in Banking Domain After covering theory it should be clear what the data mining is and what are the pros and cons of the tool. However, to understand the issue better it is always good to apply theoretical knowledge into practice. Therefore in this part, the gained knowledge is going to be exercised in practice. The goal of the practical project is to apply data mining concepts and possibly solve some problem, or fill out some need. As was previously mentioned, data mining can be used in many different areas. As a student of business school, I have decided to apply data mining to finance sector. To be specific, the practical project will deal with banking domain. The primary goal of the project is to help a bank to improve its services. Specifically, the bank wants to use data mining to evaluate its customers. By using data mining, the existing data of the bank can be analyzed and used for the evaluation. The evaluation will basically help bank to divide its customers into categories. This segmentation can be very beneficial for the bank because it can, for example, divide bad customers from good ones. The data mining project will follow CRISPS-DM methodology that can be seen on Figure 2 and was described in theoretical part of the thesis. This standardized model is ideal for the project, so the project will be divided into six main parts. In the data understanding part, some of the data mining techniques that are used in financial sector will be mentioned. Moreover, each technique will be briefly exploited and explained. Additionally, one that is suitable for the purpose of the project will be chosen. Most importantly, the business understanding part will mention clear goal of the project and its business value. In the second step, the needed data for the project will be discovered as wall as defined. In the data preparation phase, different attributes that are essential for the project will be mentioned. In the fourth, modeling phase, the model will be introduced. In the evaluation phase, the model will be evaluated and deployment phase will finally explain the actual options how the bank can use the data mining solution. Lastly, the project will be summarized in the conclusion part. In the real life environment, the project would need three experts. Domain expert would be responsible for business purpose. That means for business understanding and deployment part. The data expert would take care about data understanding plus data Data Mining: A Tool for Knowledge Discovery 17 preparation part and the last expert would be data mining expert. He/she would be responsible for creating models and its evaluation. 5.1. Business Understanding In the business understanding phase the direction of the project needs to be set. But firstly, it is crucial to explain a few data mining methods that are used in financial sector. Furthermore, the one that is the most appropriate for the project will be chosen and the project goal can be set. First data mining method that could possibly the bank use is Customer Relationship Model. This model is used to measure customer’s response to service or product. By scoring a customer, the bank knows how successful the product or service is. The information can also predict customer’s behavior. For example, if the bank introduces a new service and the data will show that the service is poorly used, the bank can assume that service is not needed. Even though the service was considered important by bank, the customers proved the exact opposite and therefore the bank can predict that introducing similar services or products is not necessary. Implementation of the solution differs and basically is influenced by how the bank communicates with customers. It means that data can be gathered by different ways. For example, the bank can contact customers or opposite. This solution can definitely be used to improve customer care and therefore to increase competitive advantage of the bank. (Dass R, 2006) The second method used in financial sector is called Risk Analysis. This method is mainly used to forecast factors that can somehow influence the company. In our case, the bank can use the historical data to make right decisions. This method can lead co costeffective running of the bank. The predictions can help the bank to stay competitive in the marketplace. This method does not primary deal with service improvement and customer care therefore is not going to be used for the project. (Dass R, 2006) Stock analysis and predictions is the third method that could be used. The method is mainly used in stock market, but can also be used by banks that specialize on making investments. This method is not focused on customer. The main idea is to make a prediction based on the already existing data. It can basically be described as making the predictions about future based on historical data. Stock analysis focuses on finding historical events that are likely to repeat. Such a predictions are very valuable when predicting market trends, making decisions about whether or not to buy a stock, or when to by a stock. However, there are still so many factors that can influence the final prediction, Data Mining: A Tool for Knowledge Discovery 18 like financial crisis or natural disasters, so forecast cannot be considered 100 percent correct. The third method, stock analysis and predictions, generally does not fulfill project criteria therefore will not be used. (Dass R, 2006) The last method that could be suitable for the project is credit scoring method. This method is already used in financial sector for a few years. The method is primary used by banks to evaluate customer. By evaluation, the bank can predict possible risks when borrowing the money. It means that the bank, as the lender, uses credit formulas to analyze borrower’s data. Because the system is not just used in banking sector, credit formulas may differ. However, the important information for the bank may be seen on Figure 3. The pie chart shows the data that the bank has about the customer. The data are divided into five main categories: amount owned, payment history, types of credit used, new credit, and length of credit history. (myfico, 2009) Figure 3: FICO Scores chart 15% Amounts Owed 30% 10% Payment History Types of Credit Used 10% New Credit 35% Length of Credit History Source: myfico, 2009 The percentage information on the Figure 3 reflects the importance of the information. That means the information about history of payments is more important than amount owned etc. Each of the five information will be explained to get the exact idea why are important. As can be seen on the Figure 3 the most important factor that bank considers is “payment history”. The importance of the information represents 35 out of 100 percent. Payment history includes information about payments on accounts. If the customer has or had mortgage, if he or she has credit cards, loans etc. Second important attribute, “amount owned”, represents 30% of the pie chart. The information includes the amount customer Data Mining: A Tool for Knowledge Discovery 19 owns on account, number of accounts, credit limits on accounts etc. To “length of credit history” is dedicated 15 percent and it, for example, includes information about dates when the account(s) was opened, or information how often the account is used. Finally, the last two information are each worth 10 percent. “New credit” holds information about recent accounts. It includes: times when they were opened, credit limits, or credit history. Last 10 percent that is included in pie chart is called “types of credit used” and basically includes data about types of accounts customer uses. If, for example, he or she has credit card accounts, or loan accounts. (myfico, 2009) The purpose of the practical project is to evaluate customers of the bank. By evaluation, the segmentation can be done and used for services improvement. Credit scoring method is ideal for the project purpose and therefore was chosen from all mentioned solutions. The method can evaluate customers according to bank’s criteria. By using right data and proper model, bank would be able to decide whether customer is worth borrowing the money. So, the project goal is to create accurate model that would be able to do such segmentation. Also, creating the model would lead to service improvement and that is the main business goal of the practical data mining project. 5.2. Data Understanding In the data understanding phase the main goal is to get familiar with the data that are available for the project. Then decide what data are interesting/ suitable for the goal of the project. So in the data understanding part, the domain expert would collect the data from the bank and analyze it. The data collected contains very sensitive data, about customers and the bank. Because the real data contains such information they can be considered as a part of the bank’s know-how. Moreover, revealing private information of banks’ customers would violate their privacy rights as well as bank’s reputation. Therefore the data are not available for the general public. Because of this fact, the data from any real bank are not available for the project; therefore common sense and assumption will be used. The bank stores large number of data that includes: information about employees, transactions, customers etc. For the purpose of the project the main focus is given on the information about customers and the other data that are not in any relationship with customers can be considered as irrelevant. Information about customers is collected when a customer asks for an account. These information include name, address, phone number, Data Mining: A Tool for Knowledge Discovery 20 services that he/she needs etc. So the data collection is done when creating the account and is stored in bank’s database. The model of the banks database consists of many classes. The class diagram can be seen on Figure 4. The relationships between classes represent the lines and the symbols represent the type of the relationship. The relationship is following: one customer can have one or many accounts, one account can have one or many transactions, one customer can have one or many services and one or many loans. Figure 4: Class diagram of bank’s database Account Customers ∞ 1 1 1 ∞ Services 1 ∞ Transactions ∞ Loan After the familiarization, the domain expert concludes that the bank’s database is suitable for the project goal. It means that it contents enough data to produce valid result. So, the data understanding part was successful and data preparation can begin. 5.3. Data Preparation Data preparation part will cover the concrete attributes that will be important for the project. The data were collected from four tables. From customers table the personal information will be needed. The most crucial information for the project from this table are: age, income and employment. Then some attributes from services table will be collected as well as from account table. For the purpose of the project the most important attribute in the account table is account balance. Lastly, loan table consists of attribute called amount that is also very important and specifies the amount of money customer wants to borrow. In the next step all the important attributes that are presented in mentioned tables needs to be modified, so the decision tree would know how to process them. The list of the attributes that are clustered according to boundaries can be seen below: Data Mining: A Tool for Knowledge Discovery 21 Personal information: Gender: Male/ Female Marital status: Divorced/separated/married/single/widowed Age: young: 0 – 25 middle aged: 25 – 50 old: 50 – 67 retired: >67 Annual Income: low: 0 - 499 middle: 500 – 799 high: > 800 Employed: yes/no Job position: employed/unemployed Accommodation: own/rent/for free Number of residents in the household: (number) Number of children: (number) Service information: Number of credit cards: (number) Insurance: yes/no Internet banking: yes/no Account information: Monthly account balance: low: 0 > 249 middle: 250 – 999 high: >1000 Credit history: credit never taken/ all credit payed on time/ delay in payments Number of loans: (number) Number of permanent transactions: (number) Number of transactions: (number) Data Mining: A Tool for Knowledge Discovery 22 Loan information: Type of the loan: house/student/combined/others Purpose of the loan: house/car/equipment/investment/business/others Amount: (number) Monthly payments: (number) Debtors: none/co-applicants/guarantor 5.4. Modeling After the modification of attributes the process of modeling can begin. In the modeling phase the decision model is created. For the purpose of the project simple decision tree will be proposed to demonstrate the possible criteria that bank can require. The data mining expert designed decision tree that can be seen on Figure 5 according to three basic attributes: annual income, monthly account balance and employed. Figure 5: Decision tree model Annual income High Middle/ Low Yes Monthly account balance High Yes Low Middle No Employed Yes Yes No No Source: Berka, 2003 Data Mining: A Tool for Knowledge Discovery 23 According to proposed model, customer asking for a loan would firstly be considered by his or her annual income. As can be seen in data preparation part, the attribute has been clustered according to boundaries into three groups: high, middle and low. As can be seen on decision tree, the customer will get the loan if he or she has high income (more than €800). If not customer is being considered by second attribute. As can be seen on Figure 5, the second attribute is monthly account balance. The exact same principle is applied here as well. The customer is evaluated and is given the loan if he or she has bigger balance than €1000. If the customer’s balance is smaller than €249, he or she is not suitable for the loan. In case that the balance is between ranges €250 – €999, the customer is considered according to the last- third attribute. The attribute is simple, customer gets the loan if he or she is employed and vice versa. The proposed decision tree is very simple and easy to understand. Of course, the bank can easily change the requirements for the loan. For example, the bank can decrease, or increase the amount of annual income. Also the decision tree can be simply modified. Additional attributes that will help to evaluate customers can be added. It all depends on requirements that are given by bank. 5.5. Evaluation In evaluation phase, the created model needs to be evaluated and tested. The data mining model used in practical project is based on decision tree that has been described in modeling phase. The created model does meet all the business objectives and goals of the project and therefore can be evaluated as suitable. To ensure the model will work properly, it needs to be tested. The data mining expert decided to test the model on sample size of 5000 customers. All customers of sample will be evaluated according to model and the data could be reviewed. If for example error had occurred only with 5 customers, the bank can be sure that the model has approximate 99.9% accuracy. 5.6. Deployment The functionality of the proposed model based on decision tree guarantees the bank very high accuracy. So the next step and the purpose of the deployment phase is to apply the solution in bank and therefore propose the changes that can be done. As the result, domain expert suggests applying the platform in two basic ways: Data Mining: A Tool for Knowledge Discovery 24 1. Changing the process of decision making 2. Introducing and improving services The first and the most crucial improvement will allow clerks in the bank decide if the loan should be given or not. Changing the process of deciding whether or not to borrow money will help decide if a person applying for the loan is worth borrowing the money or not. Clerks will use platform that according to data evaluate the borrower and identify him/her as a “yes” or “no” customer. Each customer will need to be evaluated before the loan is given. So, the decision making process will be much easier and the possible mistakes that can be done by clerks will be minimized. This of course will make the work of employees in the bank much easier. However, if the platform marks the customer as not suitable for the loan, the clerk will always need to check if the data are correct. In case customer does not pass, it is clerk’s responsibility to find the reasons and explain them to customer. For example, clerk can advice customer to increase the account balance or decrease the amount of money borrowed. Moreover, if customer does not, for instance, have any account balance or is unemployed he/she fails completely. In this case, the clerk needs to explain that he/she does not fulfill the bank’s criteria and therefore the loan cannot be approved. Secondly, the created model will support introduction of new services and improvement of services that the bank already uses. The proposed decision tree will be part of applications that will be created for a bank. The first one is internal application and will be used only by employees. The second application will be external. It will be part of online platform that will be used by customers. The internal application is very important because it will allow employees to use the model without any further knowledge about decision tree, or data mining. The application needs to be programmed in some programming language that is commonly used. It is also important that the created program will be compatible with operational systems that are used in the bank. The program needs to be secured because it consists of information that are sensitive and also includes bank’s “know-how”. The primary users of the program will be clerks. We can assume that most of them probably have just basic computer skills. So, the program should be user friendly and intuitive. That would allow clerks to work with the program without going through long and complicated training. The program would be used while the dealing with a customer. The program would allow the Data Mining: A Tool for Knowledge Discovery 25 bank to simplify the decision making process. Moreover, to have such a sophisticated program, the clerks do not have to be very skilled or educated in banking sector. So, with the help of the software the bank can introduce new services. For example, the bank can start using new customer line that would be available 24/7. There would be one operator needed with good people skills that would have a training to use the software. The customer line would be for customers that do not have time to go to bank. They can simply call and tell the information to the operator and find out if they can get a loan or not. By introducing this service, bank can attract the broader target of customers and therefore increase its revenue stream. The external application would also attract more customers. The external application would be simply online platform. This online platform will be on the bank’s corporate website. The main purpose of the platform is to serve the clients that do not want, or cannot go to bank. The platform would allow customers to enter the required data into online form similar to one that can be seen on Picture 1. Data Mining: A Tool for Knowledge Discovery 26 Picture 1: Online form Source: TrueCredit The online form on Picture 1 is taken form TrueCredit web site and is used just for practical demonstration what kind of data it may include. For example, the data includes information/ purpose of the loan that customer applies for, personal as well as contact information. As soon as the customer submits this form, he/she is automatically redirected to next form that includes more detailed information about his/her credibility. For example, amount of money needed, annual income, number of kids, if a person is employed etc. Then the data will be analyzed and the customer could get the result about the loan he/she applied for. With the growing popularity of the internet this solution can definitely attract more customers therefore lead to financial benefits. Data Mining: A Tool for Knowledge Discovery 27 5.7. Project Conclusion To highlight the importance of the project it is essential to mention that offering loans to general public by commercial banks is still considered as a core business for them. Therefore loans can be considered as one of the primary sources of revenue stream for commercial banks. Logically, this fact forces banks to develop borrowing process and therefore improve its services. The process itself is quite easy and simple. However, the decision making is primary influenced by human procedure. It means that traditionally, the borrower is evaluated by clerk in bank. Even if clerk is highly trained, such a way can lead to making mistakes. The problem can be solved by using data mining approach with credit scoring formulas as was proposed in the project. Basically, by summarizing all information credit scoring method can evaluate the borrower and therefore decide if he/she deserves to get a loan. Finally, proposed solution will improve the decision making process and therefore help bank to decrease the risk of loosing borrowed money. Moreover, the solution can positively influence economical situation of the bank by introducing new services to customers. Data Mining: A Tool for Knowledge Discovery 28 Thesis Conclusion The main goal of the thesis is to inform the reader about data mining technology and highlight its importance. Furthermore, to propose the project in which the technology was applied and draw attention to its outcome. Theoretical part covers general information about mining the data. It starts by explaining basic principles on which the technology works. Then informs how data mining was developed and thesis continues by specifying four reasons why companies should consider investing in data mining technology. In the next part, the short history of the technology is highlighted. Thesis then mentions evolution of mining the data and points some predictions about near future. Because data mining technology is strongly connected with databases, some warehousing data mining concepts are covered. The thesis continues by explaining the difference between data mining and knowledge discovery process. Finally, the theoretical part of the thesis describes the six steps of the CRISP-DM model that was used in practical part. Practical part is dedicated to data mining project that is focused on financial sector. The goal of the project is to help bank divide its customers according to their financial credibility into two groups. The first group would represent customers that are suitable for borrowing the money and the second group would represent customers that are not. The project follows CRISP-DM model and therefore consists of six main steps. To achieve the project’s goal, credit scoring method was used. The outcome of the project is presented in the last, deployment part, which proposes two main ways how bank should use the created model. Firstly, the model should be used to improve and simplify the decision making process when the bank borrows the money. Secondly, the model should be used to improve bank’s current services as well as to introduce new ones. To conclude, practical part deals with data mining project that in the end advises the outcome which can help bank to decrease risks and increase revenues. Last but not least, the popularity of data mining is driven by increasing number of data that are being stored. It is primary caused by advancement of information technology that allows data to be stored faster and cheaper. The globalization and wide spread of telecommunication technologies are few of the reasons that caused that data created by people around the world can be gained quite easily. These are one of many reasons why there was naturally created demand for a tool or technology that could somehow translate these valuable data into helpful knowledge that can be easily understood. Data Mining: A Tool for Knowledge Discovery 29 As thesis mentions, by using data mining companies are able to gain knowledge and therefore make better decisions, gain competitive advantage, or grow wealth. Data mining and knowledge discovery can today seem as a complicated tool. However, the further development will probably cause that it will started to be used more not just by businesses, but also by governments and ordinary people. To conclude, data mining gives businesses unique opportunity to extract information from data they already have but in form that cannot be understood. Therefore, this opportunity should not be underrated. It should be considered as good investment especially in nowadays competitive market when making the right definitions is the key to success. Data Mining: A Tool for Knowledge Discovery 2 USING DATA MINING AS A TOOL FOR DISCOVERING IMPORTANT KNOWLEDGE FOR COMPANIES I, Tomas Vanek, do hereby irrevocably consent to and authorize the library of Vysoká škola manažmentu v Trenčíne to file the attached project and/or bachelor thesis USING DATA MINING AS A TOOL FOR DISCOVERING IMPORTANT KNOWLEDGE FOR COMPANIES and make such paper available for in-library use in all site locations. For public access to digital form of the project/bachelor thesis on internet I give my permission I do not give my permission I state at this time that the contents of this paper are my own work and all resources used are indicated. _______________________________________________________________ (Signature) ___________________28.3.2010________________________________________ (Date) Data Mining: A Tool for Knowledge Discovery 2 List of Pictures Picture 1: Online form Data Mining: A Tool for Knowledge Discovery 2 List of Figures Figure 1: Knowledge discovery process model Figure 2: Cross-Industry Standard Process model Figure 3: FICO Scores chart Figure 4: Class diagram of bank’s database Figure 5: Decision tree model Data Mining: A Tool for Knowledge Discovery 2 Literature Berka , P., (2003). Dobývání znalostí z databází [Gaining Knowledge from Databases]. Prague, Czech Republic: Academia. Brunker, M., (2008). Poker site cheating plot a high-stakes whodunit. Retrieved November 5, 2009, from http://www.msnbc.msn.com/id/26563848/ Cios, K., Pedrycz, W., Swiniarski, R., & Kurgan, L., (2007). Data Mining A Knowledge Discovery Approach. New York: Springer. Chhay, H., (2005). Data mining. Retrieved November 5, 2009, from http://cseserv.engr.scu.edu/StudentWebPages/hchhay/hchhay_FinalPaper.htm#DIS ADVANTAGES Crisp-dm. (n.d.). Process Model. Retrieved November 5, 2009, from http://www.crispdm.org/Process/index.htm Dass, R. (2006). DATA MINING IN BANKING AND FINANCE: A NOTE FOR BANKERS. Retrieved November 5, 2009, from http://www.iimahd.ernet.in/publications/data/Note%20on%20Data%20Mining%20 &%20BI%20in%20Banking%20Sector.pdf Data Mining Software. (n.d.). A Brief History of Data Mining. Retrieved November 5, 2009, from http://www.data-mining-software.com/data_mining_history.htm Hunter, J., (2009). Data Mining Process using CRISP. Retrieved November 5, 2009, from http://www.youtube.com/watch?v=dJcmOe3_P0E myfico. (2009). What’s in your FICO® score. Retrieved November 5, 2009, from http://www.myfico.com/CreditEducation/WhatsInYourScore.aspx Palace, B., (1996). Data Mining: What is Data Mining?. Retrieved November 5, 2009, from http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/ palace/datamining.htm Sumathi, S., & Sivanandam, S., (2006). Introduction to Data Mining and its Applications. New York: Springer. Srinath, S., (2008). Data Mining and Knowledge Discovery. Retrieved November 5, 2009, from http://www.youtube.com/watch?v=m5c27rQtD2E ABSTRAKT Data Mining: A Tool for Knowledge Discovery 2 Téma: Používanie data miningu v spoločnostiach ako nástroj na objavovanie informácii Kľúčové slová: data mining, databázy, cross industry standard process, credit scoring. Študent: Tomáš Vanek Vedúci BP: Martina Česalová, M.S.C.S. Bakalárska práca sa zaoberá základným princípom dolovania dát, resp. data miningom ako nástrojom na získavanie nových informácií. Práca sa skladá z dvoch hlavných častí. Prvá časť je teoretická, kde je vysvetlené ako data mining funguje a k akým informáciám sa pomocou neho možno dopracovať. Ďalej popisuje možné výhody a nevýhody, ktorými tento nástroj disponuje. Praktická stránka práce sa opiera o teoretickú časť, pričom sa venuje aplikovaniu data miningu na bankový sektor. Pozostáva z vytvorenia projektu pre fiktívnu banku, ktorá potrebuje použiť segmentáciu zákazníkov pri udeľovaní pôžičiek. Projekt pozostáva zo šiestich fáz CRISP-DM modelu, pričom hlavný dôraz sa kladie na biznis podstatu navrhnutého riešenia. Data Mining: A Tool for Knowledge Discovery 2 ABSTRACT Topic: Using Data Mining as a Tool for Discovering Important Knowledge for Companies Key words: Data Mining, Databases, Cross Industry Standard Process, Credit Scoring. Student: Tomáš Vanek Advisor: Martina Česalová, M.S.C.S. The bachelor thesis covers fundamental principles of data mining as a tool for knowledge discovery. The thesis consists of two main parts. The first part is theoretical and basically explains how data mining works and what kind of information can reveal. Additionally, the first part of the thesis also mentions advantages and disadvantages of the tool. The second, practical part, concentrates on applying data mining in banking domain. Therefore data mining project was created and deals with customer segmentation that helps bank to estimate customer’s financial credibility. The project follows six steps of CRISP-DM model, but the main focus is given on business aspect of the solution.