Download college of management in trenčín using data mining as a tool for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining: A Tool for Knowledge Discovery 0
COLLEGE OF MANAGEMENT IN TRENČÍN
USING DATA MINING AS A TOOL FOR DISCOVERING
IMPORTANT KNOWLEDGE FOR COMPANIES
2010
Tomáš Vanek
Data Mining: A Tool for Knowledge Discovery 1
COLLEGE OF MANAGEMENT IN TRENČÍN
USING DATA MINING AS A TOOL FOR DISCOVERING
IMPORTANT KNOWLEDGE FOR COMPANIES
Bachelor Thesis
Study program:
Knowledge Management
Workplace:
College of Management, Bratislava
Thesis advisor:
Tomáš Vanek
Consultant:
Martina Česalová, M.S.C.S
Trenčín 2010
Tomáš Vanek
Data Mining: A Tool for Knowledge Discovery 2
Data Mining: A Tool for Knowledge Discovery 3
Data Mining: A Tool for Knowledge Discovery 4
Content
1. Introduction and Problem Statement ................................................................................. 1
2. Review of Literature .......................................................................................................... 2
3. Description of the Methodology ........................................................................................ 3
4. Data mining overview ........................................................................................................ 4
4.1. Principles of Data Mining ............................................................................................... 4
4.1.1. Definitions of Data Mining ...................................................................................... 7
4.1.2. History of Data Mining ............................................................................................ 7
4.1.3. The Evolution and the Future of Data Mining ......................................................... 8
4.1.4. Disadvantages of Data Mining ................................................................................ 9
4.2. Data Warehousing......................................................................................................... 10
4.3. Knowledge Discovery Process and Data Mining ......................................................... 11
4.3.2. CRISP-DM model.............................................................................................. 12
4.3.2.1. Business understanding................................................................................... 13
4.3.2.2. Data understanding ......................................................................................... 13
4.3.2.3. Data preparation .............................................................................................. 14
4.3.2.4. Modeling ......................................................................................................... 14
4.3.2.5. Evaluation ....................................................................................................... 15
4.3.2.6. Deployment ..................................................................................................... 15
5. Practical Project – Data Mining in Banking Domain ...................................................... 16
5.1. Business Understanding ............................................................................................ 17
5.2. Data Understanding .................................................................................................. 19
5.3. Data Preparation ....................................................................................................... 20
5.4. Modeling ................................................................................................................... 22
5.5. Evaluation ................................................................................................................. 23
5.6. Deployment ............................................................................................................... 23
5.7. Project Conclusion .................................................................................................... 27
Data Mining: A Tool for Knowledge Discovery 5
Thesis Conclusion ................................................................................................................ 28
List of Pictures ..................................................................................................................... 31
List of Figures ...................................................................................................................... 32
Literature .............................................................................................................................. 33
Data Mining: A Tool for Knowledge Discovery 6
List of abbreviations
CRISPS-DM - Cross-Industry Standard Process for Data Mining
Data Mining: A Tool for Knowledge Discovery 7
Acknowledgements
I would like to thank to Martina Česalová, M.S.C.S. for her patience and advices during
writing this thesis.
Data Mining: A Tool for Knowledge Discovery 1
1. Introduction and Problem Statement
The power of information can be considered as a very important factor in today's
businesses. The popularity of information technology caused that many data from different
areas is collected and stored. The data are stored every time a person access a web page,
purchases a product, or makes a phone call. These data consist of hidden information that
is very important.
Data mining is a tool that allows analyzing this data and therefore extracting
useful, previously unknown and interesting information. This tool is used mostly by
companies that collect and store large number of data. Mining the data therefore allows
them to gain essential knowledge and use it to their benefits. Thus data mining represents
quite a new and unique technology that can provide numerous advantages.
Objective of the thesis is to offer general information about the problem. The thesis
consists of theoretical and practical part. Specifically, the theoretical part informs the
reader about basic principles of the data mining. It starts by explaining how revolution of
information technology forced and still forces scientists to develop data mining
technology. Then thesis mentions some general examples why data mining can be
considered as a gold mine for some companies. Also, in order to provide accurate point of
view on technology, thesis mentions its advantages as well as the disadvantages.
Moreover, thesis talks about the history of data mining and also examines future
predications. The end of theoretical part focuses on six steps of standardized model called
CRISP-DM that is used for data mining projects.
The practical part of the thesis proposes data mining project that is applied in
financial sector. The goal of the project is to help bank segment its customers by using data
mining. The entire project is divided into six steps of CRISP-DM model. Basically, the
project covers business opportunity, describes the used data, introduces model and
suggests deployment of proposed solution. As a final result, the bank can use the
segmentation to improve the process of decision making and to introducing new services.
Data Mining: A Tool for Knowledge Discovery 2
2. Review of Literature
During the writing of the thesis, various sources have been used. Great effort has
been made to use different types of sources. Specifically, collected information mainly
comes from printed and internet sources.
The first, theoretical, part of the thesis is primary written from two books. Most of
the information is taken from the book called “Introduction to Data Mining and its
Applications” written by Dr. S. Sumathi and Dr. S.N. Sivanandam. Both are professors at
College of Technology in India and therefore experts in their field. At the beginning, the
book provides very clear and general introduction into science. Information in the book are
presented in very extended way therefore the summarization has been used quite often. The
second book used in the theoretical part is called “Data Mining - A Knowledge Discovery
Approach” and is written by four authors: Krzysztof J. Cios, Witold Pedrycz, Roman W.
Swiniarski, and Lukasz A. Kurgan. All four authors work for different universities across
USA and Canada. As the name of the book says, the book mainly focuses on the
knowledge discovery by using data mining and therefore is very suitable for the thesis.
Moreover, there have a few been internet sources used. For example, to describe business
opportunities that data mining offers, the YouTube video by by Dr. S. Srinath from Indian
Institute of technology has been used.
In the second, practical, part of the thesis the internet source has been used do
describe credit scouring method. Information about credit scoring has been taken from the
internet site called myFICO. This internet page has been on the market since year 2001 and
primary deals with credit risk scoring issues for finance segment therefore can be
considered as a relevant source. Moreover, to finish practical part of the project, the book
named “Dobývání znalostí z databází” that can be translated as “Gaining the Knowledge
from Databases” has been used. The book is written by Doc. Ing. Petr Berka, who works
for The University of Economics in Prague. The practical part of the thesis could not be
done without this book because is does not only deal with theoretical information, but also
practical demonstrations of data mining methods.
Data Mining: A Tool for Knowledge Discovery 3
3. Description of the Methodology
In the thesis, the evaluation method has been used. Many sources has been
collected and analyzed to gain the certain knowledge about data mining. After that, the
most important things has been researched again and presented in the thesis. To highlight
the importance of data mining and knowledge discovery in today’s competitive market
environment, the examples were used. Moreover, gained theoretical knowledge was
applied in the practical part of the thesis that was done to show how data mining can be
used in banking environment.
Data Mining: A Tool for Knowledge Discovery 4
4. Data mining overview
4.1. Principles of Data Mining
An enormous number of data that is nowadays created, used and stored on every
day bases caused a demand for a new tool that could help to analyze these massive data.
Therefore, demand for a tool that turns stored data into useful knowledge that is easily
understandable by human beings. Traditional techniques for analyzing data were very
useful and solved many problems. These techniques mostly used statistics to analyze the
data and therefore could only extract certain data characteristics. This limitation and need
for a new tool for data analysis caused that scientists started to collect ideas to develop a
machine learning tool. This effort has led to a new research area called data mining and
later to a research area called- data mining and knowledge discovery. But it all would not
be possible without computer revolution. (Sumathi & Sivanandam, 2006)
People have experienced the trend and revolution when it comes to information
availability. Especially during the last decade when the Internet and network based systems
allowed the global exchange of information. E-commerce business have experienced great
grow and companies started to collect more and more electronic information. More
importantly, technology and market opportunities caused that companies started to collect
and use right data. It means that they started to realize and analyze collected data rather
than collect it without further use. Soon many companies realized that “tracking,
accounting for, and archiving the activities of an organization, this data can sometimes be a
gold mine for strategic planning, which recent research and new businesses have only
started to tap” (Sumathi & Sivanandam, 2006). So with a support from scientists and
demand from commercial domains data mining starts to have ideal conditions to grow and
to be developed. (Sumathi & Sivanandam, 2006)
Data mining concept and growth could not be that fast without database technology
that was widely used in business environment with a great success. Organizations started to
create very large databases that reach capacity in terabytes. These databases hold the
business data like “consumer data, transaction histories, sales records, etc.”( Sumathi &
Sivanandam, 2006) that can very likely consist many important and valuable information.
This important business information is of course hidden in the data forms and need to be
Data Mining: A Tool for Knowledge Discovery 5
somehow extracted. The extraction can be of course successfully done by using proper
mining method. (Sumathi & Sivanandam, 2006)
Data mining represents promising tool that can be described as “the process of
discovering meaningful new correlation, patterns, and trends by digging into (mining)
large amounts of data stored in warehouse, using statistical, machine learning, artificial
intelligence (AI), and data visualization techniques” (Sumathi & Sivanandam, 2006).
There are many industry areas that are already using mining of data. For example,
aerospace, medical or chemical, but because the technology is still quite new the number of
industries is still increasing. Not mostly for its impact on science, but also for its business
value. (Sumathi & Sivanandam, 2006)
When speaking about business value of data mining it can literally symbolize a
gold mine. From business point of view, data mining can represent quite a beneficial and
unique asset. There are many benefits that data mining can have for a company or
generally for a business. Let’s look at a few concrete examples that can possibly motivate
managers or business owners to invest to this technology. Data mining can:
Influence decision making
Grow wealth
Help to analyze
Improve a security
Decision making is important process when running a company. Data mining can
reveal patterns from historical data and therefore can lead to certain knowledge. For
example, by analyzing company’s data, some hidden parents that repeat can be recognized.
Having this knowledge form the past, company can learn something new and therefore act
accordingly. Therefore, we can say that data mining can influence decision making. This is
very important because making strategic decisions are necessary for every company that
wants to stay on nowadays competitive market. (Srinath, 2008)
Making good decisions is also connected with wealth growing. Basically, if data
mining can help making right strategic decisions, it can logically also positively influence
financial situation of a company. Moreover, by mining data the wealth of information that
company has is growing. The information can be used in many different ways. For
example, product development, marketing, investment, etc. So, we can definitely say that
by using data mining company gains important knowledge. The gained knowledge can be
Data Mining: A Tool for Knowledge Discovery 6
later transformed into strategic decisions that increase financial portfolio of a company and
therefore growth wealth. (Srinath, 2008)
As was mentioned data mining can reveal some patterns from history therefore help
to analyze the trends. Trend analysis can be used, for example in stock market. By mining
data, stock exchange companies can analyzing historical price of a stock end predict its
future price. But, what can be also very interesting for companies is risk analysis.
Exploring and analyzing data help companies that operate in financial sector to evaluate
customers. As will be proposed in practical project, bank can mine its data and basically
divide good customer from bad ones. Therefore analyze the risks before offering any
service to particular customer. Overall, we can claim that mined data can offer different
kind of information that can be used for analyzing purposes. (Sumathi & Sivanandam,
2006)
Lastly, data mining just recently started to be used for maintaining security. It is
quite a new field that includes mining data for discovering activity that can be possibly
illegal. (Srinath, 2008) In the year 2008, data mining was successfully used to help to
discover the biggest scandal in online gambling history. In short, few poker players ware
accused of cheating on poker site that was part of Ultimate Bet network. Online poker
players that turned into victims of cheaters used data mining to analyze the situation. They
came with the conclusion that it is statistically almost impossible to win so much money in
such a short time and contacted the company. It turned out that cheaters somehow avoided
the security systems and therefore were able to see the cards of opponents; witch is in
game of poker tremendous advantage. So in this case, known as Ultimate Bet scandal, data
mining helped to discover fraud detection and maintain security. (Brunker, 2008)
Obviously, these are just few possibilities why using data mining represents
benefits for business owners or companies. Of course, there are more that are also very
important. So, anther commonly used data mining uses that ware not discussed are listed
below with short description:
Market segmentation: Finding characteristics that are common for
customers that purchased same or similar products.
Customer churn: Identifying customers that are likely to leave the current
company and go to different one.
Direct marketing: Identifying and sending mails to specific group of
customers to achieve high response rate.
Data Mining: A Tool for Knowledge Discovery 7
Interactive marketing: Determining in what information/product a customer
was interested in when browsing a web page.
Analysis of market basket: Identifying products and services that have high
probability to be purchased together. (Sumathi & Sivanandam, 2006)
4.1.1. Definitions of Data Mining
Many various definitions can be used to define data mining. A few following
definitions has been picked from different sources:
“Data mining is the efficient discovery of valuable, nonobvious information from a
large collection of data.” (Sumathi & Sivanandam, 2006)
“The aim of data mining is to make sense of large amounts of mostly unsupervised
data, in some domain.”(Cios, Pedrycz, Swiniarski, & Kurgan, 2007)
“…is the process of analyzing data from different perspectives and summarizing it
into useful information” (Palace, 1996)
It is the process of extracting previously unknown, valid, and actionable
information from large databases and then using the information to make crucial
business decisions.” (Sumathi & Sivanandam, 2006)
4.1.2. History of Data Mining
As was already mentioned data mining represents quite a young and
groundbreaking tool that itself has not a very long history. It has been recently a subject in
many magazines from business and software environment. Even though its significant
importance is now widely spread, a few years ago not so many people ware familiar with a
term- data mining. The term itself was firstly introduced in the 1990s. Data mining can be
basically traced from the three family roots. (Data Mining Software, n.d.)
The most important root is statistics. Classical statistics concepts like “regression
analysis, standard distribution, standard deviation, standard variance, discriminant analysis,
cluster analysis, and confidence intervals” (Data Mining Software, n.d.) are used in data
mining when studying data and its relationships. Even though today’s data mining uses
more advanced analysis, we can still say that core of data mining is build with the help of
basic statistical tools and techniques. So without statistics, data mining would certainly not
exist. (Data Mining Software, n.d.)
Data Mining: A Tool for Knowledge Discovery 8
The second root data mining comes from is artificial intelligence. Artificial
intelligence basically allows applying brain to process statistical problems. This off course
requires computer processing approach, so it could not be used until the early 1980s. In
early 1980s computers became very accessible and people could buy processing power at
the quite reasonable prices. Later when computers became faster and cheaper the growth of
data mining continued faster. Also, supercomputers allowed to study and analyze large
number of data because of its super processing power. Overall, the biggest advantage of
artificial intelligence was that it allowed to process data faster and more precisely than
humans could. (Data Mining Software, n.d.)
The last root is represented by the combination of statistics and artificial
intelligence. This union is known as machine learning. Because in 80s and 90s computers
became cheaper and faster, the machine learning experienced evolution. More applications
were released because computers became more accessible than artificial intelligence.
Actually, machine learning is considered as advancement of artificial intelligence. The
main advancement of machine learning is typical of ability to make computer programs to
lean about the studied data. This advantage allows programs to make decisions based on
the gained knowledge from the data. Then it achieves its goals by using statistics and
advanced algorithms. (Data Mining Software, n.d.)
In one sentence, short history of data mining can be precisely described “as the
union of historical and recent developments in statistics, AI, and machine learning” (Data
Mining Software, n.d.).
4.1.3. The Evolution and the Future of Data Mining
According to Dr. Sumathi and Dr. Sivanandam the evolution of data mining was
natural process that was caused by increased use of information technologies. As the meter
of fact, increase of information technologies went along with increase the data that have
been used. Logically, the larger amounts of data had to be stored and analyzed. Traditional
methods, such as of creating queries and reports did not handle working with large
amounts of data therefore data mining started to be developed and widely used. Data
mining soon started to be considered as a tool that has a big future potential. (Sumathi &
Sivanandam, 2006)
Future of data mining can be described as very bright. As was already mentioned,
the whole potential of data mining is not used and the concept of mining data is still being
developed. In the near future, data mining will penetrate into more business. Data mining
Data Mining: A Tool for Knowledge Discovery 9
will logically became very profitable and valuable tool in many areas. There are many
markets that could be heavily influenced by data mining tool, but probably the most
significant that is going to be influenced is advertising market. Data mining will allow
advertising to explore unique inches, which would attract wide range of new customers.
Moreover, data mining will be available for general public. In terms of usage, data mining
will be easier to use. That means not only experts in the field would be able to use benefits
of data mining, but with the user-friendly applications and tools the technology would be
as easy to use as e-mail. General public would possibly be able to find the lost numbers of
classmates, or the best loan in the area within a short period of time. (Sumathi &
Sivanandam, 2006)
Speaking about long-term changes, data mining can do a lot for us. The changes
and challenges are really exciting and ground braking. For example, by applying data
mining into medical areas, we could possibly be able to discover a new treatments and
practices for illnesses that we are not able to cure so far. (Sumathi & Sivanandam, 2006)
4.1.4. Disadvantages of Data Mining
It should be now clear that data mining is very valuable tool that can offer quite
unique benefits for companies that operate in different businesses. Even though the
technology cannot literally harm anyone the purpose of this part is to discover possible
drawbacks. At the moment, data mining does not have any primary disadvantage that could
raise any concerns among companies that are willing to invest in this technology. Some
scientists and experts however raised a few questions about possible disadvantages that can
occur. In the future, the main disadvantages that are likely to be connected with data
mining are privacy and security. (Chhay, 2005)
Technology boom has caused that privacy has became a mayor concern among
people. It allows people to do everyday tasks easier, faster and more comfortable. But it is
the same technology that forces people becoming more sensitive about their privacy. It is
because most of the technological tools used are able to track and store person’s private
information. Whenever somebody makes a phone call, pays with a credit card, visits a web
page, or books a flight ticket data are collected. This kind of data is already stored in
databases among many companies. But what if all information were collected together?
Collecting all the data from different sources represents the real concern. By analyzing
these data a lot would be possible to tell about individuals. Even though, each country has
a different privacy rights, generally it is illegal to sell or exchange data about private
Data Mining: A Tool for Knowledge Discovery 10
information of customers within organizations. However this kind of transactions is hard
co control. As Heng Chhay wrote “…in 1998, CVS had sold their patient’s prescription
purchases to a different company” (2005). Selling information about customers without
their knowledge is definitely violation of privacy. (Chhay, 2005)
Security is another main issue that occurred and will represent disadvantage in
future. Companies collect information about customers, but many of them do not have
appropriate security measurements. Therefore there ware many cases when the data ware
accessed and misused. For example, company called Ford Motor Credit had to apologies to
13,000 of their customers because “their personal information including Social Security
number, address, account number and payment history were accessed by hackers who
broke into a database” (Chhay, 2005). As the result, the company has lost its reputation.
Therefore, companies should always think about safety of the data because
underestimating security measurements can lead to disaster. (Chhay, 2005)
4.2. Data Warehousing
Even though the topic of data warehousing may not seem to have important role for
data mining, the opposite is true. It is very important to cover and to understand data
warehousing concepts because data warehousing is closely connected with data mining.
Data warehousing can be basically defined as ”a process of centralized data management
and retrieval” (Sumathi & Sivanandam, 2006). As well as data mining, data warehousing is
quite a new concept. It is important to know that data warehouse is not software, or
hardware, but can be better defined as an environment. The environment that allows
companies or corporations store their data into relational database systems. These systems
are designed to satisfy high level of performance and support large databases. To make this
clear, we can say that data warehousing and data mining are two enterprises that operate
very well together. It is because data warehousing provides the memory and data mining
the intelligence. (Sumathi & Sivanandam, 2006)
Any organization that has a lot of data that is created and stored faces the problem
to turn these data into valuable information. This information is usually unknown, but
presented in already existing and stored data. To extract information from the data and
therefore turn data into knowledge, certain steps need to be applied. For example, the data
Data Mining: A Tool for Knowledge Discovery 11
needs to be stored in certain form and organized, so the mining can be applied. (Sumathi &
Sivanandam, 2006)
Primary purpose of data warehousing is to allow end users search for information
that would support, for example, his/her strategic decision making. End users can access
and interact with the data warehouses by front-end tools. These access tools can be divided
into five main groups:
“1. Data query and reporting tools
2. Application development tools
3. Executive information system (EIS) tools
4. Online analytical preprocessing tools and
5. Data mining tools” (Sumathi & Sivanandam, 2006)
4.3. Knowledge Discovery Process and Data Mining
To understand the process of extracting valuable information from data that are
stored in databases, the process of knowledge discovery needs to be briefly explained. The
knowledge discovery process can be described as “the nontrivial process of identifying
valid, novel, potentially useful, and ultimately understandable patterns in data” (Cios,
Pedrycz, Swiniarski, & Kurgan, 2007). So, what is the basic difference between data
mining and knowledge discovery process? Data mining is just one of many steps that
knowledge discovery process covers. The basic knowledge discovery process can be seen
on Figure 1 below.
Figure 1: Knowledge discovery process model
Source: Cios, Pedrycz, Swiniarski, & Kurgan, 2007
As can be seen the model has to have an input that represents data and output that
represents knowledge. Input is defined as the data that are going to be analyzed. The type
of data of course differs depending on project. However input of data can typically include
Data Mining: A Tool for Knowledge Discovery 12
“numerical and nominal data stored in databases or flat files; images; video; semistructured data, such as XML or HTML” (Cios, Pedrycz, Swiniarski, & Kurgan, 2007).
The collected data then goes through number of steps that are interconnected by feedback
loops. The result, as can be seen on Figure 1, includes the final knowledge. All in all,
knowledge discovery process can be defined as a progress that helps to change data into
useful knowledge by applying patterns and algorithms. (Cios, Pedrycz, Swiniarski, &
Kurgan, 2007)
4.3.2. CRISP-DM model
A lot of effort has been made to create model that would define the process and
phases of data mining projects. One of them was for example - Cabena et el. (Cios,
Pedrycz, Swiniarski, & Kurgan, 2007) that consists of five steps and is supported by IBM.
Another is CRISP-DM model, which consists of six steps, became more popular and
leading model among the others. Therefore this model will be explained in details.
CRISPS-DM means Cross-Industry Standard Process for Data Mining. It was
introduced in the 1990s by the European Commission of companies as a free to use data
mining model. (Hunter, 2009) CRISPS-DM was developed to create standard process for
data mining projects. It was because data mining was quite new and nobody followed any
particular process or guide when developing data mining projects. This process is very
flexible and can be uses in variety of industry areas and with variety of data mining
software. CRISPS-DM process is very valuable because it makes data mining projects
faster, cheaper, more efficient and more reliable. CRISPS-DM model consists of six
unique steps or phases as can be seen bellow on the Fifure 2. (Cios, Pedrycz, Swiniarski, &
Kurgan, 2007)
Data Mining: A Tool for Knowledge Discovery 13
Figure 2: Cross-Industry Standard Process model
Source: Crisp-dm, n.d.
4.3.2.1. Business understanding
As can be seen on Figure 2, Cross-Industry Standard Process model starts with
business understanding. The first phase is very important because in business
understanding the primary goals are defined. It basically means the main purpose of the
whole data mining project. There needs to be specified what we want to know or learn
from available data that we are going to explore. Also it is important to set what questions
the project should answer and what business value the project is holding. In the business
understanding phase, there needs to be the project goal set and specifically measurable
project success. It is also necessary to know that this initial phase gives the whole project
the direction. Without clear defining objectives, the project can lose its direction and
therefore can lose its initial purpose and fail to success. (Hunter, J., 2009)
4.3.2.2. Data understanding
The second phase starts with the collecting already existing data. Data
understanding can be also described as familiarization with data. This phase requires
Data Mining: A Tool for Knowledge Discovery 14
domain expert, who explores interesting data and detect possible data problems. According
to already specified business needs mentioned in the previous phase, the data are explored.
Either by using graphic visualization or by statistic approach. Moreover, in this step
domain expert starts to look at basic relationships between the available data. As can be
seen on the Figure 2, the business understanding and data understanding are interconnected
with each other. This interconnection exists because finding the relationship between the
data can trigger the business understanding. For example, we can find out that data does
not influence enough information to satisfy primary goal set in business understanding. So
the goal needs to be changed, or replaced because we would not be able to achieve it. In
other words, during the first and second phase, the hypothesis and goals for the project are
formed into final version. (Hunter, 2009)
4.3.2.3. Data preparation
Data preparation phase is usually the most time consuming. In some cases it can
take more than 80% of the project’s schedule time. The time is usually influenced by the
quality of the data that are available. If the raw data are messy it can take a lot of time to
sort it. For example, some attributes and variables can be incorrect or can be missing.
During data preparation the final dataset that is going to be used is created. The data set is
created by selecting needed data. Moreover, during this phase the data needs to be cleaned
into form that would be suitable for the purpose of the project. (Hunter, 2009)
4.3.2.4. Modeling
In this phase, there is a wide range of modeling techniques selected and used.
Several models are applied for the same data mining problem and later are modified for
optimal output. As can be seen on the Figure 2 the data preparation and modeling phase are
interconnected. Interconnection is created because some models require concrete input of
data; therefore often the step back into the previous phase is necessary. After the data are
cleaned and modified, algorithms can be used again. (Hunter, 2009)
Modeling stage is divided into four parts:
“selection of modeling technique(s)
generation of test design
creation of models, and
assessment of generated models” (Cios, Pedrycz, Swiniarski, & Kurgan, 2007)
Data Mining: A Tool for Knowledge Discovery 15
4.3.2.5. Evaluation
After the model or models have been created they need to be reviewed and the best
model is chose. The right one needs to be evaluated from the project’s business objective.
The right model(s) need to usefully satisfy the set of goals. It is essential to find out if all
the business goals have been considered. Additionally, in this step it is important to decide
how to use the model or collection of models. (Hunter, 2009)
4.3.2.6. Deployment
At this final phase the chosen model that is suitable for data mining project needs to
be known. Deployment phase uses the chosen model to score the data. However, this phase
could not be the final. Sometimes it is better to step back to third phase and add more data
to achieve better results. (Hunter, 2009)
Deployment phase is divided to:
“plan deployment
plan monitoring and maintenance
generation of final report
review of the process substeps” (Cios, Pedrycz, Swiniarski, & Kurgan, 2007)
Data Mining: A Tool for Knowledge Discovery 16
5. Practical Project – Data Mining in Banking Domain
After covering theory it should be clear what the data mining is and what are the
pros and cons of the tool. However, to understand the issue better it is always good to
apply theoretical knowledge into practice. Therefore in this part, the gained knowledge is
going to be exercised in practice. The goal of the practical project is to apply data mining
concepts and possibly solve some problem, or fill out some need. As was previously
mentioned, data mining can be used in many different areas. As a student of business
school, I have decided to apply data mining to finance sector. To be specific, the practical
project will deal with banking domain.
The primary goal of the project is to help a bank to improve its services.
Specifically, the bank wants to use data mining to evaluate its customers. By using data
mining, the existing data of the bank can be analyzed and used for the evaluation. The
evaluation will basically help bank to divide its customers into categories. This
segmentation can be very beneficial for the bank because it can, for example, divide bad
customers from good ones.
The data mining project will follow CRISPS-DM methodology that can be seen on
Figure 2 and was described in theoretical part of the thesis. This standardized model is
ideal for the project, so the project will be divided into six main parts. In the data
understanding part, some of the data mining techniques that are used in financial sector
will be mentioned. Moreover, each technique will be briefly exploited and explained.
Additionally, one that is suitable for the purpose of the project will be chosen. Most
importantly, the business understanding part will mention clear goal of the project and its
business value. In the second step, the needed data for the project will be discovered as
wall as defined. In the data preparation phase, different attributes that are essential for the
project will be mentioned. In the fourth, modeling phase, the model will be introduced. In
the evaluation phase, the model will be evaluated and deployment phase will finally
explain the actual options how the bank can use the data mining solution. Lastly, the
project will be summarized in the conclusion part.
In the real life environment, the project would need three experts. Domain expert
would be responsible for business purpose. That means for business understanding and
deployment part. The data expert would take care about data understanding plus data
Data Mining: A Tool for Knowledge Discovery 17
preparation part and the last expert would be data mining expert. He/she would be
responsible for creating models and its evaluation.
5.1. Business Understanding
In the business understanding phase the direction of the project needs to be set. But
firstly, it is crucial to explain a few data mining methods that are used in financial sector.
Furthermore, the one that is the most appropriate for the project will be chosen and the
project goal can be set.
First data mining method that could possibly the bank use is Customer Relationship
Model. This model is used to measure customer’s response to service or product. By
scoring a customer, the bank knows how successful the product or service is. The
information can also predict customer’s behavior. For example, if the bank introduces a
new service and the data will show that the service is poorly used, the bank can assume
that service is not needed. Even though the service was considered important by bank, the
customers proved the exact opposite and therefore the bank can predict that introducing
similar services or products is not necessary. Implementation of the solution differs and
basically is influenced by how the bank communicates with customers. It means that data
can be gathered by different ways. For example, the bank can contact customers or
opposite. This solution can definitely be used to improve customer care and therefore to
increase competitive advantage of the bank. (Dass R, 2006)
The second method used in financial sector is called Risk Analysis. This method is
mainly used to forecast factors that can somehow influence the company. In our case, the
bank can use the historical data to make right decisions. This method can lead co costeffective running of the bank. The predictions can help the bank to stay competitive in the
marketplace. This method does not primary deal with service improvement and customer
care therefore is not going to be used for the project. (Dass R, 2006)
Stock analysis and predictions is the third method that could be used. The method is
mainly used in stock market, but can also be used by banks that specialize on making
investments. This method is not focused on customer. The main idea is to make a
prediction based on the already existing data. It can basically be described as making the
predictions about future based on historical data. Stock analysis focuses on finding
historical events that are likely to repeat. Such a predictions are very valuable when
predicting market trends, making decisions about whether or not to buy a stock, or when to
by a stock. However, there are still so many factors that can influence the final prediction,
Data Mining: A Tool for Knowledge Discovery 18
like financial crisis or natural disasters, so forecast cannot be considered 100 percent
correct. The third method, stock analysis and predictions, generally does not fulfill project
criteria therefore will not be used. (Dass R, 2006)
The last method that could be suitable for the project is credit scoring method. This
method is already used in financial sector for a few years. The method is primary used by
banks to evaluate customer. By evaluation, the bank can predict possible risks when
borrowing the money. It means that the bank, as the lender, uses credit formulas to analyze
borrower’s data. Because the system is not just used in banking sector, credit formulas may
differ. However, the important information for the bank may be seen on Figure 3. The pie
chart shows the data that the bank has about the customer. The data are divided into five
main categories: amount owned, payment history, types of credit used, new credit, and
length of credit history. (myfico, 2009)
Figure 3: FICO Scores chart
15%
Amounts Owed
30%
10%
Payment History
Types of Credit Used
10%
New Credit
35%
Length of Credit History
Source: myfico, 2009
The percentage information on the Figure 3 reflects the importance of the
information. That means the information about history of payments is more important than
amount owned etc. Each of the five information will be explained to get the exact idea why
are important.
As can be seen on the Figure 3 the most important factor that bank considers is
“payment history”. The importance of the information represents 35 out of 100 percent.
Payment history includes information about payments on accounts. If the customer has or
had mortgage, if he or she has credit cards, loans etc. Second important attribute, “amount
owned”, represents 30% of the pie chart. The information includes the amount customer
Data Mining: A Tool for Knowledge Discovery 19
owns on account, number of accounts, credit limits on accounts etc. To “length of credit
history” is dedicated 15 percent and it, for example, includes information about dates when
the account(s) was opened, or information how often the account is used. Finally, the last
two information are each worth 10 percent. “New credit” holds information about recent
accounts. It includes: times when they were opened, credit limits, or credit history. Last 10
percent that is included in pie chart is called “types of credit used” and basically includes
data about types of accounts customer uses. If, for example, he or she has credit card
accounts, or loan accounts. (myfico, 2009)
The purpose of the practical project is to evaluate customers of the bank. By
evaluation, the segmentation can be done and used for services improvement. Credit
scoring method is ideal for the project purpose and therefore was chosen from all
mentioned solutions. The method can evaluate customers according to bank’s criteria. By
using right data and proper model, bank would be able to decide whether customer is worth
borrowing the money. So, the project goal is to create accurate model that would be able to
do such segmentation. Also, creating the model would lead to service improvement and
that is the main business goal of the practical data mining project.
5.2. Data Understanding
In the data understanding phase the main goal is to get familiar with the data that
are available for the project. Then decide what data are interesting/ suitable for the goal of
the project. So in the data understanding part, the domain expert would collect the data
from the bank and analyze it. The data collected contains very sensitive data, about
customers and the bank. Because the real data contains such information they can be
considered as a part of the bank’s know-how. Moreover, revealing private information of
banks’ customers would violate their privacy rights as well as bank’s reputation. Therefore
the data are not available for the general public. Because of this fact, the data from any real
bank are not available for the project; therefore common sense and assumption will be
used.
The bank stores large number of data that includes: information about employees,
transactions, customers etc. For the purpose of the project the main focus is given on the
information about customers and the other data that are not in any relationship with
customers can be considered as irrelevant. Information about customers is collected when a
customer asks for an account. These information include name, address, phone number,
Data Mining: A Tool for Knowledge Discovery 20
services that he/she needs etc. So the data collection is done when creating the account and
is stored in bank’s database.
The model of the banks database consists of many classes. The class diagram can
be seen on Figure 4. The relationships between classes represent the lines and the symbols
represent the type of the relationship. The relationship is following: one customer can have
one or many accounts, one account can have one or many transactions, one customer can
have one or many services and one or many loans.
Figure 4: Class diagram of bank’s database
Account
Customers
∞
1
1
1
∞
Services
1
∞
Transactions
∞
Loan
After the familiarization, the domain expert concludes that the bank’s database is
suitable for the project goal. It means that it contents enough data to produce valid result.
So, the data understanding part was successful and data preparation can begin.
5.3. Data Preparation
Data preparation part will cover the concrete attributes that will be important for the
project. The data were collected from four tables. From customers table the personal
information will be needed. The most crucial information for the project from this table
are: age, income and employment. Then some attributes from services table will be
collected as well as from account table. For the purpose of the project the most important
attribute in the account table is account balance. Lastly, loan table consists of attribute
called amount that is also very important and specifies the amount of money customer
wants to borrow.
In the next step all the important attributes that are presented in mentioned tables
needs to be modified, so the decision tree would know how to process them. The list of the
attributes that are clustered according to boundaries can be seen below:
Data Mining: A Tool for Knowledge Discovery 21
Personal information:
Gender: Male/ Female
Marital status: Divorced/separated/married/single/widowed
Age:
young: 0 – 25
middle aged: 25 – 50
old: 50 – 67
retired: >67
Annual Income:
low: 0 - 499
middle: 500 – 799
high: > 800
Employed: yes/no
Job position: employed/unemployed
Accommodation: own/rent/for free
Number of residents in the household: (number)
Number of children: (number)
Service information:
Number of credit cards: (number)
Insurance: yes/no
Internet banking: yes/no
Account information:
Monthly account balance:
low: 0 > 249
middle: 250 – 999
high: >1000
Credit history: credit never taken/ all credit payed on time/ delay in payments
Number of loans: (number)
Number of permanent transactions: (number)
Number of transactions: (number)
Data Mining: A Tool for Knowledge Discovery 22
Loan information:
Type of the loan: house/student/combined/others
Purpose of the loan: house/car/equipment/investment/business/others
Amount: (number)
Monthly payments: (number)
Debtors: none/co-applicants/guarantor
5.4. Modeling
After the modification of attributes the process of modeling can begin. In the
modeling phase the decision model is created. For the purpose of the project simple
decision tree will be proposed to demonstrate the possible criteria that bank can require.
The data mining expert designed decision tree that can be seen on Figure 5 according to
three basic attributes: annual income, monthly account balance and employed.
Figure 5: Decision tree model
Annual income
High
Middle/
Low
Yes
Monthly account
balance
High
Yes
Low
Middle
No
Employed
Yes
Yes
No
No
Source: Berka, 2003
Data Mining: A Tool for Knowledge Discovery 23
According to proposed model, customer asking for a loan would firstly be
considered by his or her annual income. As can be seen in data preparation part, the
attribute has been clustered according to boundaries into three groups: high, middle and
low. As can be seen on decision tree, the customer will get the loan if he or she has high
income (more than €800). If not customer is being considered by second attribute. As can
be seen on Figure 5, the second attribute is monthly account balance. The exact same
principle is applied here as well. The customer is evaluated and is given the loan if he or
she has bigger balance than €1000. If the customer’s balance is smaller than €249, he or
she is not suitable for the loan. In case that the balance is between ranges €250 – €999, the
customer is considered according to the last- third attribute. The attribute is simple,
customer gets the loan if he or she is employed and vice versa.
The proposed decision tree is very simple and easy to understand. Of course, the
bank can easily change the requirements for the loan. For example, the bank can decrease,
or increase the amount of annual income. Also the decision tree can be simply modified.
Additional attributes that will help to evaluate customers can be added. It all depends on
requirements that are given by bank.
5.5. Evaluation
In evaluation phase, the created model needs to be evaluated and tested. The data
mining model used in practical project is based on decision tree that has been described in
modeling phase. The created model does meet all the business objectives and goals of the
project and therefore can be evaluated as suitable. To ensure the model will work properly,
it needs to be tested. The data mining expert decided to test the model on sample size of
5000 customers. All customers of sample will be evaluated according to model and the
data could be reviewed. If for example error had occurred only with 5 customers, the bank
can be sure that the model has approximate 99.9% accuracy.
5.6. Deployment
The functionality of the proposed model based on decision tree guarantees the bank
very high accuracy. So the next step and the purpose of the deployment phase is to apply
the solution in bank and therefore propose the changes that can be done. As the result,
domain expert suggests applying the platform in two basic ways:
Data Mining: A Tool for Knowledge Discovery 24
1. Changing the process of decision making
2. Introducing and improving services
The first and the most crucial improvement will allow clerks in the bank decide if
the loan should be given or not. Changing the process of deciding whether or not to borrow
money will help decide if a person applying for the loan is worth borrowing the money or
not. Clerks will use platform that according to data evaluate the borrower and identify
him/her as a “yes” or “no” customer. Each customer will need to be evaluated before the
loan is given. So, the decision making process will be much easier and the possible
mistakes that can be done by clerks will be minimized. This of course will make the work
of employees in the bank much easier. However, if the platform marks the customer as not
suitable for the loan, the clerk will always need to check if the data are correct. In case
customer does not pass, it is clerk’s responsibility to find the reasons and explain them to
customer. For example, clerk can advice customer to increase the account balance or
decrease the amount of money borrowed. Moreover, if customer does not, for instance,
have any account balance or is unemployed he/she fails completely. In this case, the clerk
needs to explain that he/she does not fulfill the bank’s criteria and therefore the loan cannot
be approved.
Secondly, the created model will support introduction of new services and
improvement of services that the bank already uses. The proposed decision tree will be part
of applications that will be created for a bank. The first one is internal application and will
be used only by employees. The second application will be external. It will be part of
online platform that will be used by customers.
The internal application is very important because it will allow employees to use
the model without any further knowledge about decision tree, or data mining. The
application needs to be programmed in some programming language that is commonly
used. It is also important that the created program will be compatible with operational
systems that are used in the bank. The program needs to be secured because it consists of
information that are sensitive and also includes bank’s “know-how”. The primary users of
the program will be clerks. We can assume that most of them probably have just basic
computer skills. So, the program should be user friendly and intuitive. That would allow
clerks to work with the program without going through long and complicated training. The
program would be used while the dealing with a customer. The program would allow the
Data Mining: A Tool for Knowledge Discovery 25
bank to simplify the decision making process. Moreover, to have such a sophisticated
program, the clerks do not have to be very skilled or educated in banking sector. So, with
the help of the software the bank can introduce new services. For example, the bank can
start using new customer line that would be available 24/7. There would be one operator
needed with good people skills that would have a training to use the software. The
customer line would be for customers that do not have time to go to bank. They can simply
call and tell the information to the operator and find out if they can get a loan or not. By
introducing this service, bank can attract the broader target of customers and therefore
increase its revenue stream.
The external application would also attract more customers. The external
application would be simply online platform. This online platform will be on the bank’s
corporate website. The main purpose of the platform is to serve the clients that do not
want, or cannot go to bank. The platform would allow customers to enter the required data
into online form similar to one that can be seen on Picture 1.
Data Mining: A Tool for Knowledge Discovery 26
Picture 1: Online form
Source: TrueCredit
The online form on Picture 1 is taken form TrueCredit web site and is used just for
practical demonstration what kind of data it may include. For example, the data includes
information/ purpose of the loan that customer applies for, personal as well as contact
information. As soon as the customer submits this form, he/she is automatically redirected
to next form that includes more detailed information about his/her credibility. For example,
amount of money needed, annual income, number of kids, if a person is employed etc.
Then the data will be analyzed and the customer could get the result about the loan he/she
applied for. With the growing popularity of the internet this solution can definitely attract
more customers therefore lead to financial benefits.
Data Mining: A Tool for Knowledge Discovery 27
5.7. Project Conclusion
To highlight the importance of the project it is essential to mention that offering
loans to general public by commercial banks is still considered as a core business for them.
Therefore loans can be considered as one of the primary sources of revenue stream for
commercial banks. Logically, this fact forces banks to develop borrowing process and
therefore improve its services. The process itself is quite easy and simple. However, the
decision making is primary influenced by human procedure. It means that traditionally, the
borrower is evaluated by clerk in bank. Even if clerk is highly trained, such a way can lead
to making mistakes. The problem can be solved by using data mining approach with credit
scoring formulas as was proposed in the project. Basically, by summarizing all information
credit scoring method can evaluate the borrower and therefore decide if he/she deserves to
get a loan. Finally, proposed solution will improve the decision making process and
therefore help bank to decrease the risk of loosing borrowed money. Moreover, the
solution can positively influence economical situation of the bank by introducing new
services to customers.
Data Mining: A Tool for Knowledge Discovery 28
Thesis Conclusion
The main goal of the thesis is to inform the reader about data mining technology
and highlight its importance. Furthermore, to propose the project in which the technology
was applied and draw attention to its outcome.
Theoretical part covers general information about mining the data. It starts by
explaining basic principles on which the technology works. Then informs how data mining
was developed and thesis continues by specifying four reasons why companies should
consider investing in data mining technology. In the next part, the short history of the
technology is highlighted. Thesis then mentions evolution of mining the data and points
some predictions about near future. Because data mining technology is strongly connected
with databases, some warehousing data mining concepts are covered. The thesis continues
by explaining the difference between data mining and knowledge discovery process.
Finally, the theoretical part of the thesis describes the six steps of the CRISP-DM model
that was used in practical part.
Practical part is dedicated to data mining project that is focused on financial sector.
The goal of the project is to help bank divide its customers according to their financial
credibility into two groups. The first group would represent customers that are suitable for
borrowing the money and the second group would represent customers that are not. The
project follows CRISP-DM model and therefore consists of six main steps. To achieve the
project’s goal, credit scoring method was used. The outcome of the project is presented in
the last, deployment part, which proposes two main ways how bank should use the created
model. Firstly, the model should be used to improve and simplify the decision making
process when the bank borrows the money. Secondly, the model should be used to improve
bank’s current services as well as to introduce new ones. To conclude, practical part deals
with data mining project that in the end advises the outcome which can help bank to
decrease risks and increase revenues.
Last but not least, the popularity of data mining is driven by increasing number of
data that are being stored. It is primary caused by advancement of information technology
that allows data to be stored faster and cheaper. The globalization and wide spread of
telecommunication technologies are few of the reasons that caused that data created by
people around the world can be gained quite easily. These are one of many reasons why
there was naturally created demand for a tool or technology that could somehow translate
these valuable data into helpful knowledge that can be easily understood.
Data Mining: A Tool for Knowledge Discovery 29
As thesis mentions, by using data mining companies are able to gain knowledge
and therefore make better decisions, gain competitive advantage, or grow wealth. Data
mining and knowledge discovery can today seem as a complicated tool. However, the
further development will probably cause that it will started to be used more not just by
businesses, but also by governments and ordinary people. To conclude, data mining gives
businesses unique opportunity to extract information from data they already have but in
form that cannot be understood. Therefore, this opportunity should not be underrated. It
should be considered as good investment especially in nowadays competitive market when
making the right definitions is the key to success.
Data Mining: A Tool for Knowledge Discovery 2
USING DATA MINING AS A TOOL FOR DISCOVERING IMPORTANT
KNOWLEDGE FOR COMPANIES
I, Tomas Vanek, do hereby irrevocably consent to and authorize the library of Vysoká
škola manažmentu v Trenčíne to file the attached project and/or bachelor thesis USING
DATA MINING AS A TOOL FOR DISCOVERING IMPORTANT KNOWLEDGE FOR
COMPANIES and make such paper available for in-library use in all site locations.
For public access to digital form of the project/bachelor thesis on internet
I give my permission
I do not give my permission
I state at this time that the contents of this paper are my own work and all resources used
are indicated.
_______________________________________________________________ (Signature)
___________________28.3.2010________________________________________ (Date)
Data Mining: A Tool for Knowledge Discovery 2
List of Pictures
Picture 1: Online form
Data Mining: A Tool for Knowledge Discovery 2
List of Figures
Figure 1: Knowledge discovery process model
Figure 2: Cross-Industry Standard Process model
Figure 3: FICO Scores chart
Figure 4: Class diagram of bank’s database
Figure 5: Decision tree model
Data Mining: A Tool for Knowledge Discovery 2
Literature
Berka , P., (2003). Dobývání znalostí z databází [Gaining Knowledge from Databases].
Prague, Czech Republic: Academia.
Brunker, M., (2008). Poker site cheating plot a high-stakes whodunit. Retrieved November
5, 2009, from http://www.msnbc.msn.com/id/26563848/
Cios, K., Pedrycz, W., Swiniarski, R., & Kurgan, L., (2007). Data Mining A Knowledge
Discovery Approach. New York: Springer.
Chhay, H., (2005). Data mining. Retrieved November 5, 2009, from
http://cseserv.engr.scu.edu/StudentWebPages/hchhay/hchhay_FinalPaper.htm#DIS
ADVANTAGES
Crisp-dm. (n.d.). Process Model. Retrieved November 5, 2009, from http://www.crispdm.org/Process/index.htm
Dass, R. (2006). DATA MINING IN BANKING AND FINANCE: A NOTE FOR BANKERS.
Retrieved November 5, 2009, from
http://www.iimahd.ernet.in/publications/data/Note%20on%20Data%20Mining%20
&%20BI%20in%20Banking%20Sector.pdf
Data Mining Software. (n.d.). A Brief History of Data Mining. Retrieved November 5,
2009, from http://www.data-mining-software.com/data_mining_history.htm
Hunter, J., (2009). Data Mining Process using CRISP. Retrieved November 5, 2009, from
http://www.youtube.com/watch?v=dJcmOe3_P0E
myfico. (2009). What’s in your FICO® score. Retrieved November 5, 2009, from
http://www.myfico.com/CreditEducation/WhatsInYourScore.aspx
Palace, B., (1996). Data Mining: What is Data Mining?. Retrieved November 5, 2009,
from http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/
palace/datamining.htm
Sumathi, S., & Sivanandam, S., (2006). Introduction to Data Mining and its Applications.
New York: Springer.
Srinath, S., (2008). Data Mining and Knowledge Discovery. Retrieved November 5, 2009,
from http://www.youtube.com/watch?v=m5c27rQtD2E
ABSTRAKT
Data Mining: A Tool for Knowledge Discovery 2
Téma: Používanie data miningu v spoločnostiach ako nástroj na objavovanie
informácii
Kľúčové slová: data mining, databázy, cross industry standard process, credit scoring.
Študent: Tomáš Vanek
Vedúci BP: Martina Česalová, M.S.C.S.
Bakalárska práca sa zaoberá základným princípom dolovania dát, resp. data miningom ako
nástrojom na získavanie nových informácií. Práca sa skladá z dvoch hlavných častí. Prvá
časť je teoretická, kde je vysvetlené ako data mining funguje a k akým informáciám sa
pomocou neho možno dopracovať. Ďalej popisuje možné výhody a nevýhody, ktorými
tento nástroj disponuje. Praktická stránka práce sa opiera o teoretickú časť, pričom sa
venuje aplikovaniu data miningu na bankový sektor. Pozostáva z vytvorenia projektu pre
fiktívnu banku, ktorá potrebuje použiť segmentáciu zákazníkov pri udeľovaní pôžičiek.
Projekt pozostáva zo šiestich fáz CRISP-DM modelu, pričom hlavný dôraz sa kladie na
biznis podstatu navrhnutého riešenia.
Data Mining: A Tool for Knowledge Discovery 2
ABSTRACT
Topic: Using Data Mining as a Tool for Discovering Important Knowledge for
Companies
Key words: Data Mining, Databases, Cross Industry Standard Process, Credit
Scoring.
Student: Tomáš Vanek
Advisor: Martina Česalová, M.S.C.S.
The bachelor thesis covers fundamental principles of data mining as a tool for knowledge
discovery. The thesis consists of two main parts. The first part is theoretical and basically
explains how data mining works and what kind of information can reveal. Additionally, the
first part of the thesis also mentions advantages and disadvantages of the tool. The second,
practical part, concentrates on applying data mining in banking domain. Therefore data
mining project was created and deals with customer segmentation that helps bank to
estimate customer’s financial credibility. The project follows six steps of CRISP-DM
model, but the main focus is given on business aspect of the solution.