Download Data Mining

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Enabling Decision Tree Intelligence in Materialized View
Submitted By
RIAZ AHMAD
(MS-IT Session 2006-2008)
A thesis submitted in partial fulfillment of the requirement for the degree of
Master of Science in Information Technology
In
Databases and Data Warehousing
Institute of Management Sciences, NWFP, Peshawar
Pakistan
October, 2008
DEDICATION
I dedicate this struggle of mine to my loving Parents and
to my sweet Alishba and Rohail.
Certificate of Originality
It is to certify that this Report “Enabling Decision Tree Intelligence in Materialized View
“ submitted by the concerned student, is up to the requirements for the Degree of
MASTER OF SCIENCE (Information Technology) at IM | Sciences. All the work done
is solely the effort of the student and an adequate appreciation is given to work of others
which is mentioned as a reference material.
Supervisor:
Mr. Nafees-Ur-Rehman
Sign: __________________
External Examiner:
Name: ______________________
Designation: _________________
Affiliation: __________________
Sign: ___________________
Research Coordinator:
Mr. Nafees-Ur-Rehman
Sign: __________________
ACKNOWLEDGMENTS
I am indebted and grateful to all those people, who have helped me during the
course of my research project in one way or another. In particular, I would like to
acknowledge and express my deepest gratitude to the following:
First and foremost I would like to express my gratitude to Mr Nafees-Ur-Rehman, my
Supervisor to whom I am greatly indebted, for giving me the opportunity to undertake the
research, under his supervision and especially for his encouragement throughout the
course of the research.
Mr Syed Akmal Shah, my classmate for his support and kind suggestions and fruitful
discussion.
Lab Administrators Mr.Mumtaaz, and Mr.Saqib for their kind help and patience in
arranging all the necessary stuff I needed during my lab work for my dissertation.
Last but not least I would like to thank my Parents and the rest of my family members for
their patience and constant encouragement.
Riaz Ahmad,
IM | Sciences Hayatabad
Peshawar, PAK.
Oct, 2008
I
List of Abbreviation
OLTP
Online Transaction processing
DW
Data Warehouse
ERD
Entity Relational Diagram
ODS
Operation Data Store
OLAP
Online Analytical Processing
DTS
Data Transformation Services
RDBMS
Relational Database Management System
MV
Materialized View
DM
Data Mining
KDD
Knowledge Discovery in Databases
DC
Data Cleaning
DS
Data selection
DI
Data integration
DT
Data Transformation
PE
Pattern Evaluation
KP
Knowledge Presentation
ID3
Interactive Dichotomisor 3
II
List of Figures
Fig No
Figure Caption
Page No
2.1
Data warehouse users
9
2.2
Dimensional Modeling
11
5.1
KDD parts
31
5.2
Decision Tree
36
5.3
Decision Tree of the dataset
41
5.4
Decision Tree Generated in SQL Server 2000
42
III
List of Tables
Table No
Table Title
Page No
2.1
Comparison between OLAP and OLAP
8
Databases
5.1
Training Dataset for classification
38
5.2
Gain Information for Original set
40
5.3
Gain Information for Rain subset
40
5.4
Gain Information fro Sunny subset
41
5.5
Dependent Table
43
5.6
Resultant values in dependent table
44
IV
Abstract
Data Mining has attracted a great deal of attention in information industry in recent years
due to the wide availability of huge amounts of data and the imminent need for turning
such data into useful information and knowledge. Unlike the conventional model where
data is taken to the data mining system, this has been proposed in this thesis that mining
algorithms are placed inside data warehouse and database structures. This research is a
step forward to integrate data mining with data warehouses & databases. In particular,
decision tree is embedded with materialized views to reduce the tree construction and
data classification time. For the construction of the decision tree, different calculations &
computations are carried out repeatedly in a recursive manner. The initial dataset level
entropy is calculated along with other values and is stored in a new storage structure.
These values are referenced in the tree construction whenever a new decision tree is
required. These calculations are performed once and updated whenever the source dataset
is updated. The results of these calculations are used each time a new decision tree is
constructed. This pre-calculation reduces the tree construction time as one does not have
to re-calculate these values.
V
Table of Contents
S.No
Topic Titles
Page No
Chapter 1...................................................................................................................... 1
Introduction ............................................................................................................... 1
1.1 Background ........................................................................................................... 1
1.2 Scope ....................................................................................................................... 2
1.3 Objective ................................................................................................................ 2
1.4 Summary of Chapters ......................................................................................... 2
Chapter 2...................................................................................................................... 5
Data Warehousing ................................................................................................... 5
2.1 Benefits of Data Warehousing........................................................................... 5
2.2 OLAP Data Characteristics ............................................................................... 6
2.2.1. Consolidated and Consistent.......................................................................... 6
2.2.2. Subject Oriented ............................................................................................. 6
2.2.3. Historical .......................................................................................................... 6
2.2.4. Read Only ........................................................................................................ 7
2.2.5. Granular .......................................................................................................... 7
2.3 Database VS Data Warehouse .......................................................................... 8
2.4 Data Warehouse Users ........................................................................................ 9
2.5 Developing a Data warehouse ........................................................................... 9
2.5.1 Identification and collection of information ................................................ 9
2.5.2 Dimensional Modeling Design .................................................................... 10
2.5.3 Develop architecture contain Operation data store (ODS) ...................... 13
2.5.4 Design Relational Database and OLAP cubes. ......................................... 14
2.5.5 Develop Data warehouse maintenance applications ................................. 14
2.5.6 Develop Analysis application ...................................................................... 14
2.5.7 Test and install or organize the System ..................................................... 15
Chapter 3.................................................................................................................... 16
Materialized View .................................................................................................. 16
3.1 Materialized View in Different Environment .............................................. 17
3.1.1 Materialized Views for Distributed Computing.......................................... 17
3.1.2 Materialized Views for Mobile Computing ................................................. 17
3.2 The Need for Materialized Views ................................................................... 17
3.3 Uses of Materialized Views .............................................................................. 18
VI
3.4 How Materialized Views Work....................................................................... 18
3.5 Types of Materialized View ............................................................................. 19
3.5.1 Types of Materialized view on the basis of Tables ...................................... 19
3.5.2 Some other Types of Materialized view ....................................................... 20
3.6 Advantages and Disadvantages ...................................................................... 21
3.6.1 Advantages.................................................................................................... 21
3.6.2 Disadvantages ............................................................................................... 21
3.7 Materialized View Refresh Methods. ............................................................ 21
3.8 Creating a Materialized View ...................................................................... 22
3.9 Indexed View in SQL Sever 2000 ................................................................... 22
3.9.1 Restrictions on Creating Indexed Views .................................................... 23
3.9.2 Create the Indexed View or Materialized view ......................................... 23
Chapter 4.................................................................................................................... 25
Integration of Data warehouse and Data Mining....................................... 25
4.1
4.2
4.3
4.4
4.5
Introduction ...................................................................................................... 25
Data Integration ............................................................................................... 26
Schema Integration ......................................................................................... 27
Redundancy ...................................................................................................... 28
Inconsistencies .................................................................................................. 28
Chapter 5.................................................................................................................... 30
Data Mining ............................................................................................................. 30
5.1 Data Mining Definition. .................................................................................... 30
5.2 Data Mining History.......................................................................................... 31
5.3 Data Mining Techniques .................................................................................. 32
5.4 Classification in Data Mining .......................................................................... 33
5.4.1 Classification .................................................................................................. 33
5.4.2 Related issues with classification .................................................................. 34
5.5 Decision Tree Technique for Classification ................................................. 35
5.5.1 Decision Tree .................................................................................................. 35
5.5.2 Generating classification rules from a decision tree ................................... 36
5.5.3 ID3 Algorithms ............................................................................................... 37
5.6 Partial Integration of Decision Tree in Material View .............................. 42
5.6.1 Classification Experiment ............................................................................. 43
5.6.2 Conclusion ...................................................................................................... 45
VII
Appendix A ............................................................................................................... 46
Code Section.............................................................................................................. 46
Appendix B ............................................................................................................... 52
Application Interface ............................................................................................... 52
Appendix C ............................................................................................................... 58
References .................................................................................................................. 58
VIII
Chapter 1
Introduction
1.1 Background
Data warehouse is the enterprise level repository of subject oriented, time variant,
historical data used for informational retrieval and decision support. The DW stores
atomic and summary data. Decision – Making means that the data warehouse is intended
for knowledge worker or for those people who must analyze information provided by the
warehouse to make business decisions. A warehouse is not intended for day-to-day
transaction processing. Knowledge worker must access information to plan, forecast, and
make financial decisions. They are often people who are reasonably authoritative or are
in influential positions such as financial controllers, business analysts, or department
mangers.
Generally, data mining (sometimes called data or knowledge discovery) is the process of
analyzing data from different perspectives and summarizing it into useful information
that can be used to increase revenue, cuts costs, or both. Data mining software is one of a
number of analytical tools for analyzing data. It allows users to analyze data from many
different dimensions or angles, categorize it, and summarize the relationships identified.
Technically, data mining is the process of finding correlations or patterns among dozens
of fields in large relational databases. Data mining contains different types of techniques
which are used for the classification, clustering, associations and for sequential patterns.
Classification is a data mining (machine learning) technique used to predict group
membership for data instances. For example, you may wish to use classification to
predict whether the weather on a particular day will be “sunny”, “rainy” or “cloudy”.
Popular classification techniques include decision trees and neural networks.
Classification produced function that map a data item into one of the several predefined
classes by inputting a training dataset and building a model of the class attribute based on
the rest of attributes. The building model is used to classify a new data.
1
Decision tree is a classifier in the form of a tree structure. Decision trees are powerful
and popular tools for classification and prediction. The attractiveness of decision trees is
due to the fact that, in contrast to neural networks, decision trees represent rules. Rules
can readily be expressed so that humans can understand them or even directly used in a
database access language like SQL so that records falling into a particular category may
be retrieved. There are a variety of algorithms for building decision trees. A decision tree
can be used to classify an example by starting at the root of the tree and moving through
it until a leaf node, which provides the classification of the instance.
1.2 Scope
An ID3 algorithm is used behind the decision tree techniques. It is the recursive process.
During each iteration the following three steps are occurring before the selection of test
attribute. In the first step, ID3 calculates the Entropy of whole dataset and then in the
second step, it calculates the Entropy and Gain of each input attribute in dataset. In the
third step, ID3 algorithms select the maximum gain attribute for the classification of
dataset. These three steps process is very much expensive with respect to time.
1.3 Objective
This research work has mainly two objectives. The first objective is to integrate the data
mining intelligence in the data warehouses and databases. Here, I try to integrate the
decision tree techniques with the materialized view. The second objective is to reduce the
time required for the construction of classification tree. The computational process for
constructing tree is highly complex and recursive in nature. It includes calculating
various values i.e. Entropy of dataset, Entropy and Gain values of each input attributes in
dataset repeatedly. Here, I have pre-computed the results required at least for the
selection and classification of root node.
1.4 Summary of Chapters
Chapter 1, Introduction
Chapter 1st provides background, scope and objectives of the thesis.
2
Chapter 2, Data Warehousing
In this chapter we define the data warehouse, benefits of data warehouse.
Furthermore, it shows the comparison of data warehouse and databases
and at the end, it includes the developing process of data warehouse.
Chapter 3, Materialized or Indexed View
Chapter 3rd contains upon the definition of materialized view, types of
MV, its usages and it also discusses the advantages and disadvantages
along with its creation process.
Chapter 4, Data Warehouse and Data Mining Integration
In Chapter 4 provides detail discussion on integration and it also describes
the different types of data mining integration with database; such as data
integration, schema integration, redundancy and inconsistencies.
Chapter 5, Data Mining
Last chapter includes discussion about data mining, data mining
techniques and its classification along with related issues. Furthermore,
there is discussion about Decision Tree and its creation along the ID3
algorithms which use behind decision tree.
Appendix A
In appendix A we put the code section which divides into two parts. First
parts contain the SQL server 2000 code while the second part contains the
visual basic.net code which provides the interface for the work which
performed in SQL server 2000.
Appendix B
Appendix B provides the output of the practical work.
3
Appendix C
Appendix C contains references about the literature material which is
used in this thesis.
4
Chapter 2
Data Warehousing
Some people use the term data warehouse in a very general way. To them, any read-only
collection of accumulated historical data is called a data warehouse. A data warehouse is
a database specifically structured for query and analysis. A data warehouse typically
contains data representing the business history of an organization. Data is usually less
detailed and longer-lived than data from an online transaction processing (OLTP) system.
For example, a data warehouse may store daily order totals by customer over the past five
years, whereas an OLTP system would store every order processed but retain those
records for only a few months.
Some characteristics are common to all data warehouses:

Data is collected from other sources; for example, OLTP system.

Data is made consistent prior to storage in the data warehouse.

Data is summarized. Data warehouses usually do not retain as much detail as
transaction-oriented systems.

Data is longer-lived. Transaction systems may retain data only until processing is
complete, whereas data warehouses may retain data for years.

Data is stored in a format that is convenient for querying and analysis.

Data is usually considered read only.[3]
2.1 Benefits of Data Warehousing
Companies build warehouses to help them make decisions and can use the information in
a warehouse to spot trends, buying patterns, and relationships. Once a company builds a
warehouse, company leaders have a consistent source for enterprise wide data that allows
for fast answers to queries. The analysis phase of building a data warehouse might
uncover previously hidden information’s that allows for better decisions. The following
are the advantages of DW.
5

The ability to access enterprise wide data

The ability to have consistent data

The ability to perform analysis quickly

Recognition of redundancy of effort

Discovery of gaps in business knowledge or business processes

Decreased administration costs

Empowering all members of an enterprise by providing them with information
necessary to perform effectively [3].
2.2 OLAP Data Characteristics
Data in a data warehouse has several attributes that differentiate it from data in a
standard, online transaction processing system (OLTP) [3].
2.2.1. Consolidated and Consistent
The terms consolidated and consistent have particular meanings in a data warehouse.
Consolidated means that the data is gathered from throughout the enterprise and stored in
a central location. Consistent means that all users will get the same results to the same
question, even if it is posed at different times. For example, the answer to the question,
"What were the total sales for January 1997?" will be consistent whether the question is
posed in 1997 or 2002.
2.2.2. Subject Oriented
Data in a warehouse should include only key business information. Often, data in (OLTP)
sources throughout the enterprise includes information that is not of use to decision
makers in the company. Only subject-oriented data should be moved into a warehouse.
Once in the warehouse, the data should be organized based on subject.
2.2.3. Historical
Data warehouse data is historical, which means that it does not change over time unless a
problem existed with the data at the source. Data in a warehouse represents a snapshot in
time, so a warehouse is accurate only to a certain point in the past. Data in a warehouse
6
often covers a long period of time; OLTP systems have only current or very recent data.
Data over a long period of time allows the analysis of trends over time, including
seasonal and long-term trends.
2.2.4. Read Only
Because data in a warehouse is historical, it is read only. Data in a warehouse changes
only if errors are found in the original source data because if data is updated after it is in a
warehouse, consistency is compromised. Because data in a warehouse will not be updated
or deleted, the warehouse can be structured to allow maximum speed and flexibility for
queries, such as an aggressive use of indexes.
2.2.5. Granular
Data in an OLTP system is stored with maximum detail. Data in a data warehouse does
not usually need to be stored with maximum detail. Instead, you can handle a certain
level of summarization, so the data is stored with more or less granularity. The key of
data warehouse design is to identify the appropriate level of summary. You can always
summarize up, but you cannot drill down through a summary without the lower-level
data.
7
2.3 Database VS Data Warehouse
The following table shows differences between OLTP and OLAP databases. [4]
Table 2.1 Comparison between OLTP Vs OLAP databases
Databases
Data Warehouses
A database is a collection of related
data and a database system is a
database and database software
together.
Databases are transactional system.
Traditional databases support on-line
transaction processing (OLTP), which
includes insertions, updates, and
deletions.
A data warehouse is also a collection of
information as well as a supporting
system.
Multi databases provide access to
disjoint and usually heterogeneous
databases and are volatile
Whereas a data warehouse is frequently a
store of integrated data from multiple
sources, processed for storage in a
multidimensional model and nonvolatile.
Data warehouses also support time-series
and trend analysis, both of which require
more historical data.
Dimensional Modeling is used for DW.
While in data warehouse data is in
denormalized form.
Optimized for bulk loads and large,
complex, unpredictable queries that access
many rows per table
A data warehouse is typically optimized
for access from a decision maker's needs.
Data warehouses are designed specifically
to support efficient extraction, processing
and presentation for analytic and decisionmaking purposes.
ERD model is used for databases.
Data in databases exist in normalized
form.
Optimized for a common set of
transactions, usually adding or
retrieving a single row at a time per
table
Designed for real-time business
operations
Optimized for validation of incoming
data during transactions; uses
validation data tables
Supports thousands of concurrent users
Designed for analysis of business
measures by categories and attributes
Loaded with consistent, valid data;
requires no real time validation
Supports few concurrent users relative to
OLTP
8
2.4 Data Warehouse Users
Data warehouse users can be divided into four categories: Statisticians, Knowledge
Workers, Information Consumers, and Executives. Each type makes up a portion of the
user population as illustrated in this diagram [5]
Figure 2.1 Data Warehouse User
2.5 Developing a Data warehouse
The following phases are necessary for the development of data warehouse. These are
similar to those of the most database project.
1. Identification and collection of information.
2. Dimensional Modeling Designing.
3. Develop architecture contain Operation data store (ODS).
4. Design Relational Database and OLAP cubes.
5. Develop Data warehouse maintenance applications.
6. Develop Analysis application.
7. Test and install or organize the System.
2.5.1 Identification and collection of information
First of all understand the business before entering into discussions with users. Then
interview and work with users, not the data. Learn about the needs of users and then turn
these needs into project requirements. Data warehouse designer arrange or select the data
which provide the suitable information. The most important part of discussion with users
is that objectives and challenges as well as how they take a business decisions. The
9
business users should be tied with design team during the logical design process. They
are the people who understand the meaning of data. After interview of several users then
find out from the experts what data exists and where it resides, but only after you
understand the basic business needs of the end users.
2.5.2 Dimensional Modeling Design
On the basis of business requirements we can design dimensional model. This must
address the business needs and grain of detail and what dimensions and facts to include.
This model should be designed for ease of access, easy to maintenance and can adapt the
future changes. The model for the relational databases that support the OLAP cubes to
provide immediate query results to analysts.
An OLTP system requires a normalized structure to minimize redundancy, provide
validation of input data, and support a high volume of fast transactions. A transaction
usually involves a single business event, such as placing an order or posting an invoice
payment. An OLTP model often looks like a spider web of hundreds or even thousands of
related tables. In contrast, a typical dimensional model uses a star or snowflake design
that is easy to understand and relate to business needs, supports simplified business
queries, and provides superior query performance by minimizing table joins. [5]

STAR SCHEMA IN DATA WAREHOUSE DESIGN
In the design of a data warehouse, the fundamental structure utilized in a relational
system is the star schema. This schema has become the design of choice because of its
compact and uncomplicated structure that facilitates the query responses on the data. This
schema is simple to understand and provides a good introduction to the framework of a
warehouse. [6]
10
Figure 2.2 Dimensional Modeling

Fact Tables
Each data warehouse or data mart includes one or more fact tables. Central to a star or
snowflake schema, a fact table captures the data that measures the organization's business
operations. Fact tables usually contain large numbers of rows, sometimes in the hundreds
of millions of records when they contain one or more years of history for a large
organization.
A key characteristic of a fact table is that it contains numerical data (facts) that can be
summarized to provide information about the history of the operation of the organization.
Each fact table also includes a multipart index that contains as foreign keys the primary
keys of related dimension tables, which contain the attributes of the fact records. Fact
tables should not contain descriptive information or any data other than the numerical
measurement fields and the index fields that relate the facts to corresponding entries in
the dimension tables.
11
In the FoodMart 2000 sample database provided with Microsoft® SQL Server™ 2000
Analysis Services, one fact table, sales_fact_1998, contains the following columns.
Column
Description
Product_id
Foreign key for dimension table product.
time_id
Foreign key for dimension table time_by_day.
customer_id
Foreign key for dimension table customer.
store_id
Foreign key for dimension table store.
store_sales
Currency column containing the value of the sale.
store_cost
Currency column containing the cost to the store of the sale.
unit_sales
Numeric column containing the quantity sold.
In this fact table, each entry represents the sale of a specific product on a specific day to a
specific customer in accordance with a specific promotion at a specific store. The
business measurements captured are the value of the sale, the cost to the store, and the
quantity sold. The most useful measures to include in a fact table are numbers that are
additive. Additive measures allow summary information to be obtained by adding various
quantities of the measure, such as the sales of a specific item at a group of stores for a
particular time period. [4]

Dimension Tables
Dimension tables contain attributes that describe fact records in the fact table. Some of
these attributes provide descriptive information; others are used to specify how fact table
data should be summarized to provide useful information to the analyst. Dimension tables
contain hierarchies of attributes that aid in summarization. Dimensional modeling
produces dimension tables in which each table contains fact attributes that are
independent of those in other dimensions. For example, a customer dimension table
contains data about customers, a product dimension table contains information about
products, and a store dimension table contains information about stores. Queries use
attributes in dimensions to specify a view into the fact information. [7] The records in a
12
dimension table establish one-to-many relationships with the fact table. For example,
there may be a number of sales to a single customer, or a number of sales of a single
product. The dimension table contains attributes associated with the dimension entry;
these attributes are rich and user-oriented textual details, such as product name or
customer name and address. [5]

Hierarchies
The data in a dimension is usually hierarchical in nature. Hierarchies are determined by
the business need to group and summarize data into usable information. For example, a
time dimension often contains the hierarchy elements: Year, Quarter, Month, Day, or
Quarter, Week, and Day. A dimension may contain multiple hierarchies – a time
dimension often contains both calendar and financial year hierarchies. Geography
hierarchy for sales points is: (Country, Region, State or Province, City, Store). [5].

Surrogate Keys
A critical part of data warehouse design is the creation and use of surrogate keys in
dimension tables. A surrogate key is the primary key for a dimension table and is
independent of any keys provided by source data systems. Surrogate keys provide the
means to maintain data warehouse information when dimensions change. Special keys
are used for date and time dimensions, but these keys differ from surrogate keys used for
other dimension tables [5].
2.5.3 Develop architecture contain Operation data store (ODS)
The data warehouse architecture reflects the dimensional model developed to meet the
business requirements. Dimension design largely determines dimension table design, and
fact definitions determine fact table design. Data warehouse architectures must be
designed to accommodate ongoing data updates, and to allow for future expansion with
minimum impact on existing design. The historical nature of data warehouses means that
records almost never have to be deleted from tables except to correct errors. Errors in
source data are often detected in the extraction and transformation processes in the
staging area and are corrected before the data is loaded into the data warehouse database.
13
The dimensional model also lends itself to easy expansion. New dimension attributes and
new dimensions can be added, usually without affecting existing schemas other than by
extension. An entirely new schema can be added to a data warehouse without affecting
existing functionality. A new business subject area can be added by designing and
creating a fact table and any dimensions specific to the subject area. The Operational
Data Store (ODS) is an operational construct that has elements of both data warehouse
and a transaction system. Like a data warehouse, the ODS typically contains data
consolidated from multiple systems and grouped by subject area. Like a transaction
system, the ODS may be updated by business users, and contains relatively little
historical data. [5].
2.5.4 Design Relational Database and OLAP cubes.
In this phase, the star or snowflake schema is created in the relational database, surrogate
keys are defined and primary and foreign key relationships are established. Views,
indexes, and fact table partitions are also defined. OLAP cubes are designed that support
the needs of the users. [5].
2.5.5 Develop Data warehouse maintenance applications
The data maintenance applications, including extraction, transformation, and loading
processes, must be automated, often by specialized custom applications. Data
Transformation Services (DTS) in SQL Server 2000 is a powerful tool for defining many
transformations. [5].
2.5.6 Develop Analysis application
The applications that support data analysis by the data warehouse users are constructed in
this phase of data warehouse development. OLAP cubes and data mining models are
constructed using Analysis Services tools, and client access to analysis data is supported
by the Analysis Server. Other analysis applications, such as Excel PivotTables,
Predefined reports, Web sites are natural language applications using English Query.
Specialized third-party analysis tools are also acquired and implemented or installed.
14
2.5.7 Test and install or organize the System
It is important to involve users in the testing phase. After initial testing by development
and test groups, users should load the system with queries and use it the way they intend
to after the system is brought on line. Substantial user involvement in testing will provide
a significant number of benefits. Among the benefits are:

Discrepancies can be found and corrected

Users become familiar with the system

Index turning can be performed
It is important that users exercise the system during the test phase with the kinds of
queries they will be using in production. This can enable a considerable amount of
empirical index tuning to take place before the system comes online. Additional tuning
needs to take place after deployment, but starting with satisfactory performance is a key
to success. Users who have participated in the testing and have seen performance
continually improve as the system is exercised will be inclined to be supportive during
the initial deployment phase as early issues are discovered and addressed. [5].
15
Chapter 3
Materialized View
A materialized view or index view is a special type of summary table that is constructed
by aggregating one or more columns of data from a single table, or a series of tables that
are joined together. When queries are executed at an aggregation level satisfied by a
materialized view, the cost-based optimizer automatically rewrites the query to take
advantage of the most appropriate materialized view.
Materialized views can dramatically improve query performance, and significantly
decrease the load on the system. This is because materialized views require fewer logical
reads to satisfy the query than the same query running against the base tables.
Materialized views are a powerful feature that has been part of the Oracle RDBMS since
version 8.1. When they are effectively implemented across an entire data warehouse, the
total number of logical reads can be reduced by well over 90%. Although materialized
views are considered to be a data warehouse feature, they can also be employed in other
environments, including Operational Data Stores (ODS), data marts, and reporting tables
in OLTP
environments, where end-users will perform rollup queries on the
schema.[8]Materialized views within the data warehouse are transparent to the end user
or to the database application.
In SQL Server 2000 and 2005, a view that has a unique clustered index is referred to as
an indexed view (MV in oracle). In the case of a non-indexed view, the portions of the
view necessary to solve the query are materialized at run time. Any computations such as
joins or aggregations are done during query execution for each query referencing the
view1. After a unique clustered index is created on the view, the view's result set is
materialized immediately and persisted in physical storage in the database, saving the
overhead of performing this costly operation at execution time. [10]
16
3.1 Materialized View in Different Environment
3.1.1 Materialized Views for Distributed Computing
In distributed environments, you can use materialized views to replicate data at
distributed sites and to synchronize updates done at those sites with conflict resolution
methods. The materialized views as replicas provide local access to data that otherwise
would have to be accessed from remote sites. Materialized views are also useful in
remote data marts. [9]
3.1.2 Materialized Views for Mobile Computing
You can also use materialized views to download a subset of data from central servers to
mobile clients, with periodic refreshes and updates between clients and the central
servers. [9]
3.1.3 Materialized View for data warehouses
In data warehouses, you can use materialized views to precompute and store aggregated
data such as the sum of sales. Materialized views in these environments are often referred
to as summaries, because they store summarized data. They can also be used to pre
compute joins with or without aggregations. A materialized view eliminates the overhead
associated with expensive joins and aggregations for a large or important class of queries.
[9]
3.2 The Need for Materialized Views
Use materialized views in data warehouses to increase the speed of queries on very large
databases. Queries to large databases often involve joins between tables, aggregations
such as SUM, or both. These operations are expensive in terms of time and processing
power. The type of materialized view you create determines how the materialized view is
refreshed and used by query rewrite.
Materialized views improve query performance by pre calculating expensive join and
aggregation operations on the database prior to execution and storing the results in the
database. The query optimizer automatically recognizes when an existing materialized
view can and should be used to satisfy a request. It then transparently rewrites the request
17
to use the materialized view. Queries go directly to the materialized view and not to the
underlying detail tables. In general, rewriting queries to use materialized views rather
than detail tables improves response. A materialized view can be partitioned, and you can
define a materialized view on a partitioned table. You can also define one or more
indexes on the materialized view. [9]
3.3 Uses of Materialized Views
This is relatively straightforward and is answered in a single word - performance. By
calculating the answers to the really hard questions, we will greatly reduce the load on the
machine, we will experience:

Less physical reads - There is less data to scan through.

Less writes - We will not be sorting / aggregating as frequently.

Decreased CPU consumption - We will not be calculating aggregates and
functions on the data, as we will have already done that.

Markedly faster response times - Our queries will return incredibly quickly
when a summary is used, as opposed to the details. This will be a function of the amount
of work we can avoid by using the materialized view.
Materialized views will increase your need for one resource - more permanently allocated
disk. We need extra storage space to accommodate the materialized views, of course, but
for the price of a little extra disk space, we can pick a lot of benefit.
Materialized views work best in a read-only or read-intensive environment. They are not
designed for use in a high-end OLTP environment. They will add overhead to
modifications performed on the base tables in order to capture the changes. [11]
3.4 How Materialized Views Work
Materialized views may appear to be hard to work with at first. So, now that we can
create a materialized view and show that it works, what are the steps Oracle will
undertake to rewrite our queries? Normally, when QUERY REWRITE ENABLED is set
to FALSE, Oracle will take your SQL as is, parse it, and optimize it. With query rewrites
18
enabled, Oracle will insert an extra step into this process. After parsing, Oracle will
attempt to rewrite the query to access some materialized view, instead of the actual table
that it references. If it can perform a query rewrite, the rewritten query (or queries) is
parsed and then optimized along with the original query. The query plan with the lowest
cost from this set is chosen for execution. If it cannot rewrite the query, the original
parsed query is optimized and executed as normal. [11]
3.5 Types of Materialized View
The following are the types of materialized view.
3.5.1 Types of Materialized view on the basis of Tables
There are different types of materialized views. One is simple view while another is
complex view.

Simple materialized view can only be created on the basis of single table and does
not perform set operations and joins or group by. E.g
Create View inventory
As Select isbn, title, retail_price
From books
With Read only.

In complex materialized view include more than one table and also perform set
operations and join or group by. E.g
Create View balancedue
As Select customer#, order#, Sum(quantity * retail) Amtdue
From customers JOIN orders USING(customer#)
JOIN orderitems USING(order#)
JOIN books USING(isbn)
Group by customer#,order#;
19
3.5.2 Some other Types of Materialized view
The following are further types of Materialized view.

Read only materialized view
You can make a materialized view read-only during creation by omitting the FOR UPDATE
clause In addition, using read-only materialized views eliminates the possibility of a
materialized view introducing data conflicts at the master site or master materialized view
site, although this convenience means that updates cannot be made at the remote
materialized view site. The following is an example of a read-only materialized view:
[12]
CREATE MATERIALIZED VIEW hr.employees AS
SELECT * FROM hr.employees;

Updatable Materialized view
You can make a materialized view updatable during creation by including the FOR
UPDATE
.For changes made to an updatable materialized view to be pushed back to the
master during refresh, the updatable materialized view must belong to a materialized
view group.
Updatable materialized views enable you to decrease the load on master sites because
users can make changes to the data at the materialized view site. The following is an
example of an updatable materialized view:[12]
CREATE MATERIALIZED VIEW hr.departments FOR UPDATE AS
SELECT * FROM hr.departments

Writeable Materialized view
A writeable materialized view is one that is created using the FOR UPDATE clause but is
not part of a materialized view group. Users can perform DML operations on a writeable
materialized view, but if you refresh the materialized view, then these changes are not
pushed back to the master and the changes are lost in the materialized view itself.
20
Writeable materialized views are typically allowed wherever fast-refreshable read-only
materialized views are allowed. [12]

Conventional Materialized view.
A conventional materialized view blindly materializes and maintains all rows of a view,
even rows that are never accessed. [14]

Dynamic Materialized view.
We propose a more flexible materialization strategy aimed at reducing storage space and
view maintenance costs. A dynamic materialized view selectively materializes only a
subset of rows, for example, the most frequently accessed rows. One or more control
tables are associated with the view and define which rows are currently materialized.
Dynamic materialized views greatly reduce storage requirements and maintenance costs
while achieving better query performance with improved buffer pool efficiency. [14]
3.6 Advantages and Disadvantages
3.6.1 Advantages
–
Useful for summarizing, pre-computing, replicating and distributing data
–
Faster access for expensive and complex joins
–
Transparent to end-users
–
MVs can be added/dropped without invalidating coded SQL[13]
3.6.2
Disadvantages
–
Performance costs of maintaining the views
–
Storage costs of maintaining the views [13]
3.7 Materialized View Refresh Methods.
The following types of refresh methods are supported by Oracle.

Complete - build from scratch

Fast - only apply the data changes
21

Force - try a fast refresh, if that is not possible, do a complete refresh

Never - never refresh the materialized view [15]
3.8 Creating a Materialized View
A materialized view can be created with the CREATE MATERIALIZED VIEW
statement or using Oracle Enterprise Manager. The following command creates the
materialized view store_sales_mv.
CREATE MATERIALIZED VIEW store_sales_mv
BUILD IMMEDIATE
REFRESH COMPLETE
ENABLE QUERY REWRITE
AS SELECT s.store_name,SUM(dollar_sales) AS sum_dollar_sales
FROM store s, fact f
WHERE f.store_key = s.store_key
GROUP BY s.store_name;
3.9 Indexed View in SQL Sever 2000
It is a view that stores its result data, so it can use them later in subsequent queries on that
view. Means that next time you query this view, it doesn't have to go back to the
underlying table, but instead get the data from the view's storage.If you have a query that
is complicated and consumes lots of time, and resources, then it is better to store the
result, and next time just go to the result.SQL Server 2000 Indexed Views are similar to
Materialized Views in Oracle - the Result Set is stored in the Database. Query
Performance can be dramatically enhanced using Indexed Views. Create an Indexed
View by implementing a UNIQUE CLUSTERED index on the view. The results of the
view are stored in the leaf-level pages of the clustered index.
An indexed View automatically reflects modifications made to the data in the base tables
after the index is created, the same way an index created on a base table does. As
modifications are made to the data in the base tables, the data modifications are also
reflected in the data stored in the indexed view. The requirement that the clustered index
22
of the view be unique improves the efficiency with which SQL Server 2000 can find the
rows in the index that are affected by any data modification. The SQL Server 2000 Query
Optimizer automatically determines whether a given query will benefit from using an
Index View. Create Indexed Views when the performance gain of improved speed in
retrieving results outweighs the increased maintenance cost. The underlying data is
infrequently updated. Queries perform a significant amount of joins and aggregations that
either process many rows or are performed frequently by many users.
3.9.1 Restrictions on Creating Indexed Views
Consider the following guidelines:

The first index that you create on the view must be a UNIQUE CLUSTRERD index

You must create the view with the SCHEMABINDING option.

The view can reference base tables, but it cannot reference other views.

You must use two-part names to reference tables.
3.9.2 Create the Indexed View or Materialized view
The following procedure is used for the creation of indexed view in sql server 2000.
Before the creation of materialized view in sql server 2000 set the following options.
SET NUMERIC_ROUNDABORT OFF
GO
SET ANSI_PADDING, ANSI_WARNINGS, CONCAT_NULL_YIELDS_NULL ON
GO
SET ARITHABORT, QUOTED_IDENTIFIER, ANSI_NULLS ON
GO
If exists (select name from sysobjects where name = 'scabbiesdata_view' and type = 'v')
Drop view scabbiesdata_view
Go
CREATE VIEW scabbiesdata_view
With schemabinding
AS
SELECT keycol, Age, Gender, residence, education, monthlyincome, and scabbies_class
FROM dbo.scabbiesdata
GO
23
CREATE UNIQUE CLUSTERED INDEX scabbiesindex ON scabbiesdata_view(keycol)
GO
You need it in data warehouse environment more that OLTP environment you need it in
huge databases, and not a table that has 3 records. We don't want to use materialized view
for select from customer where location is New York, right? We want to use materialized
view for complicated view that does outer join, self-join, union and aggregation functions
but all those
Are not allowed in the index view .The second point that you think Sql server model has
advantage over Oracle, is Sql server model is dynamic, which means changed data in
underlying tables are immediately reflected in the view.
24
Chapter 4
Integration of Data warehouse and Data Mining
4.1 Introduction
Data warehouse (DW) is a system that extracts, cleans, conforms, and delivers source
data into a dimensional data store and then supports and implements querying and
analysis for the purpose of decision making. Sophisticated OLAP tools, which facilitate
multidimensional analysis, were used. Business trends are identified using data mining
(DM) tools and applying complex business models.
Warehouse is actually usable because the ETL process (extraction, transformation,
loading) still needs to be completed. Data warehousing is the process of taking data from
legacy and transaction database systems and transforming it into organized information in
a user-friendly format to encourage data analysis and support fact-based business
decision-making. The process that involves transforming data from its original format to
a dimensional data store accounts for at least 70 percent of the time, effort, and expense
of most data warehouse projects.
As it is very costly and critical part of a data warehouse implementation there is a variety
of data extraction and data cleaning tools, and load and refresh utilities for DW. Different
data mining techniques are used to facilitate the integration of data in DW [22]. Data
mining is the essential process where intelligent methods are applied in order to extract
data patterns SAS defines data mining as the process of selecting, exploring, and
modeling large amounts of data to uncover previously unknown patterns for a business
advantage. Data Mining is the activity of extracting hidden information (patterns and
relationships) from large databases automatically: that is, without benefit of human
intervention or initiative in the knowledge discovery process. Data Mining is the step in
the process of knowledge discovery in databases, that inputs predominantly cleaned,
25
transformed data, searches the data using algorithms, and outputs patterns and
relationships to the interpretation/evaluation step of the KDD process.
4.2 Data Integration
The integration is one of the most important characteristic of the data warehouse. Data is
fed from multiple disparate sources into the data warehouse. As the data is fed it is
converted, reformatted, summarized, and so forth. The result is that data—once it resides
in the data warehouse—has a single physical corporate image.
Many problems arise in this process. Designers of different applications made up their
decisions over the years in different ways. In the past, when application designers built an
application, they never considered that the data they were operating on would ever have
to be integrated with other data. Such a consideration was only a wild theory.
Consequently, across multiple applications there is no application consistency in
encoding, naming conventions, physical attributes, measurement of attributes, and so
forth. Each application designer has had free rein to make his or her own design
decisions. The result is that any application is very different from any other application.
One simple example of lack of integration is data that is not encoded consistently, as
shown by the encoding of gender. In one application, gender is encoded as m or f. In
another, it is encoded as 0 or 1. As data passes to the data warehouse, the applications’
different values must be correctly deciphered and recoded with the proper value.
This consideration of consistency applies to all application design issues, such as naming
conventions, key structure, measurement of attributes, and physical characteristics of
data. Some of the same data exists in various places with different names, some data is
labeled the same way in different places, some data is all in the same place with the same
name but reflects a different measurement, and so on [22].
26
4.3 Schema Integration
The most important issue in data integration is the Schema integration. How can
equivalent real-world entities from multiple data sources be matched up? This is referred
to as entity identification process. Terms may be given different interpretations at
different sources. For example, how can be data analyst is sure that customer_id in one
database and cust_number in another refer the same entity?
Data mining algorithms can be used to discover the implicit information about the
semantics of the data structures of the information sources. Often, the exact meaning of
an attribute cannot be deduced from its name and data type. The task of reconstructing
the meaning of attributes would be optimally supported by dependency modeling using
data mining techniques and mapping this model against expert knowledge, e.g., business
models. Association rules are suited for this purpose. Other data mining techniques, e.g.,
classification tree and rule induction, and statistical methods, e.g., multivariate
regression, probabilistic networks, can also produce useful hypotheses in this context.
Data mining and statistical methods can be used to induce integrity constraint candidates
from the data. These include, for example, visualization methods to identify distributions
for finding domains of attributes or methods for dependency modeling. Other data mining
methods can find intervals of attribute values, which are rather compact and cover a high
percentage of the existing values.
Data mining methods can discover functional relationships between different databases
when they are not too complex. A linear regression method would discover the
corresponding conversion factors. If the type of functional dependency (linear, quadratic,
exponential etc.) is a priori not known, model search instead of parameter search has to
be applied [22].
27
4.4 Redundancy
Redundancy is another important issue. An attribute may be redundant if it can be
“derived” from another table, e.g. annual revenue. In addition to detecting redundancies
between attributes, duplication can be detected at the tuple level (e.g., where there are
two or more identical tuples for a given unique data entry case). Some redundancies can
be detected by correlation analysis. For example, given two attributes, such analysis can
measure how strongly one attribute implies the other, based on available data.
Inconsistencies in attribute or dimension naming can also cause redundancies in the
resulting data set [22].
4.5 Inconsistencies
Since a data warehouse is used for decision-making, it is important that the data in the
warehouse are correct. However, since large volumes of data from multiple sources are
involved, there is a high probability of errors and anomalies in the data. Real-world data
tend to be incomplete, noisy and inconsistence. Data cleansing is a non-trivial task in data
warehouse environments. The main focus is the identification of missing or incorrect data
(noise) and conflicts between data of different sources and the correction of these
problems. Data cleansing routines attempt to fill in missing values, smooth out noise
while identifying outliers, and correct inconsistencies in the data. Some examples where
data cleaning becomes necessary are: inconsistent field lengths, inconsistent descriptions,
inconsistent value assignments, missing entries and violation of integrity constraints.
Typically, missing values are indicated by blank fields or special attribute values. A way
to handle these records is to replace them by the mean of most frequent value or the
value, which is most common to similar objects. Simple transformation rules can be
specified; e.g., “replace the string gender by sex”. Missing values may be determined
with regression, inference-based tools, using a Bayesian formalism, or decision tree
induction. [22].
Although there have been many data-mining methodologies and systems developed in
recent years, we contend that by and large, present mining models lack human
28
involvement, particularly in the form of guidance and user control. We believe that data
mining is most effective when the computer does what it does best— like searching large
databases or counting. This division of labor is best achieved through constraint-based
mining, in which the user provides restraints that guide a search.
Mining can also be improved by employing a multidimensional, hierarchical view of the
data. Current data warehouse systems have provided a fertile ground for systematic
development of this multidimensional mining. Together, constraint-based and
multidimensional techniques can provide a more ad hoc, query-driven process that
effectively exploits the semantics of data than those supported by current stand-alone
data-mining systems. A data-mining system should support efficient processing and
optimization of mining queries by providing a sophisticated mining-query optimizer.
29
Chapter 5
Data Mining
5.1 Data Mining Definition.
Data mining (DM) is defined as the process of discovering patterns in data. The process
must be automatic or (more usually) semiautomatic. The patterns discovered must be
meaningful in that they lead to some advantage, usually an economic advantage [16].
We simply define; data mining refers to extracting or mining “knowledge from large
amounts of data. The term is actually a misnomer. Remember that the mining of gold
from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus,
data mining “should have been more appropriately named knowledge mining from data",
which is unfortunately somewhat long. It is also called shortly knowledge mining. There
are many other terms such as knowledge mining from databases, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging. Many people treat data mining
as a synonym for another popularly used term, “Knowledge Discovery in Databases", or
KDD. Alternatively, others view data mining as simply an essential step in the process of
knowledge discovery in databases. Knowledge discovery as a process is depicted in
Figure 5.1, and consists of an iterative sequence of the following steps:

Data cleaning (DC): to remove noise or irrelevant data.

Data integration (DI): where multiple data sources may be combined

Data selection (DS) where data relevant to the analysis task are retrieved from the
databases.

Data transformation (DT): where data are transformed or consolidated into forms

Appropriate for mining by performing summary or aggregation operations.

Data mining (an essential process where intelligent methods are applied in order to
extract data patterns).

Pattern evaluation (PE): to identify the truly interesting patterns representing
knowledge.
30

Knowledge presentation (KP): where visualization and knowledge representation
Techniques are used to present the mined knowledge to the user [17].
Figure 5.1
Data mining as a process of knowledge discovery
5.2 Data Mining History
The past decade has seen an explosive growth in database technology and the amount of
data collected. Advances in data collection, use of bar codes in commercial outlets, and
the computerization of business transactions have flooded us with lots of data. We have
an unprecedented opportunity to analyze this data to extract more intelligent and useful
information, and to discover interesting, useful, and previously unknown patterns from
data. Due to the huge size of data and the amount of computation involved in knowledge
discovery, parallel processing is an essential component for any successful large-scale
data mining application. Data mining is concerned with finding hidden relationships
present in business data to allow businesses to make predictions for future use. It is the
process of data-driven extraction of not so obvious but useful information from large
databases. Data mining has emerged as a key business intelligence technology. The
explosive growth of stored data has generated an information glut, as the storage of data
alone does not bring about knowledge that can be used: (a) to improve business and
services and (b) to help develop new techniques and products. Data is the basic form of
31
information that needs to be managed, sifted, mined, and interpreted to create knowledge.
Discovering the patterns, trends, and anomalies in massive data is one of the grand
challenges of the Information Age. Data mining emerged in the late 1980s, made great
progress during the Information Age and in the 1990s, and will continue its fast
development in the years to come in this increasingly data-centric world. Data mining is a
multidisciplinary field drawing works from statistics, database technology, artificial
intelligence, pattern recognition, machine learning, information theory, knowledge
acquisition, information retrieval, high-performance computing, and data visualization.
The aim of data mining is to extract implicit, previously unknown and potentially useful
(or actionable) patterns from data [18].
5.3 Data Mining Techniques
Data mining consists of many up-to-date techniques such as classification (decision trees,
Naive Bayes classifier, k-nearest neighbor, neural networks), clustering (k-means,
hierarchical
clustering,
density-based
clustering),
association
(one-dimensional,
multidimensional, multilevel association, constraint-based association). Many years of
practice show that data mining is a process, and its successful application requires data
preprocessing (dimensionality reduction, cleaning, noise/outlier removal),
post
processing (understandability, summary, presentation), good understanding of problem
domains and domain expertise. Today’s competitive marketplace challenges even the
most successful companies to protect and retain their customer base, manage supplier
partnerships, and control costs while at the same time increasing their revenue. In a world
of accelerating change, competitive advantage will be defined by the ability to leverage
information to initiate effective business decisions before competition does. Hence in this
age of global competition accurate information plays a vital role in the insurance
business. Data is not merely a record of business operation – it helps in achieving
competitive advantages in the insurance sector. Thus, there is growing pressure on MIS
managers to provide information technology (IT) infrastructure to enable decision
support mechanism. This would be possible provided the decision makers have online
access to previous data. Therefore, there is a need for developing a data warehouse. Data
mining as a tool for customer relationship management also has proved to be a means of
32
controlling costs and increase revenues. In the last decade, machine learning had come of
age through a number of ways such as neural networks, statistical pattern recognition,
fuzzy logic, and genetic algorithms. Among the most important applications for machine
learning are classification, recognition, prediction, and data mining. Classification and
recognition are very significant in a lot of domains such as multimedia, radar, sonar,
optical character recognition, speech recognition, vision, agriculture, and medicine [18].
5.4 Classification in Data Mining
We only discuss here Classification because our selected topic is related to this
functionality.
Databases are rich with hidden information that can be used for making intelligent
business decisions. Classification is data analysis which can be used to extract models
describing important data classes or to predict future data trends. Classification predicts
categorical labels (or discrete values). For example, a classification model may be built to
categorize bank loan applications as either safe or risky. Many classification methods
have been proposed by researchers in machine learning, expert systems, statistics, and
neurobiology. Most algorithms are memory resident, typically assuming a small data size.
Recent database mining research has built on such work, developing scalable
classification techniques capable of handling large, disk resident data. These techniques
often consider parallel and distributed processing.
There are different basic techniques for data classification such as decision tree induction,
Bayesian classification and Bayesian belief networks, and neural networks. Other
approaches to classification, such as k-nearest neighbor classifiers, case-based reasoning,
genetic algorithms, rough sets, and fuzzy logic techniques are introduced.
5.4.1 Classification
Data classification is a two step process. In the first step, a model is built describing a
predetermined set of data classes or concepts. The model is constructed by analyzing
database rows described by attributes. Each row is assumed to belong to a predefined
class, as determined by one of the attributes, called the class label attribute. In the context
of classification, data rows are also referred to as samples, examples, or objects. The data
33
rows analyzed to build the model collectively form the training data set. The individual
rows making up the training set are referred to as training samples and are randomly
selected from the sample population. Since the class label of each training sample is
provided, this step is also known as supervised learning (i.e., the learning of the model is
'supervised' in that it is told to which class each training sample belongs). Typically, the
learned model is represented in the form of classification rules, decision trees, or
mathematical formulae. In the second step, the model is used for classification. First, the
predictive accuracy of the model (or classifier) is estimated. Then we use a test set of
class-labeled samples. These samples are randomly selected and are independent of the
training samples.
5.4.2 Related issues with classification
Preparing the data for classification, the following preprocessing steps may be applied to
the data in order to help improve the accuracy, efficiency, and scalability of the
classification process [19].

Data cleaning. This refers to the preprocessing of data in order to remove or reduce
noise (by applying smoothing techniques, for example), and the treatment of missing
values (e.g., by replacing a missing value with the most commonly occurring value for
that attribute, or with the most probable value based on statistics). Although most
classification algorithms have some mechanisms for handling noisy or missing data, this
step can help reduce confusion during learning.

Relevance analysis. Many of the attributes in the data may be irrelevant to the
classification task. For example, data recording the day of the week on which a bank loan
application was filed is unlikely to be relevant to the success of the application.
Furthermore, other attributes may be redundant. Hence, relevance analysis may be
performed on the data with the aim of removing any irrelevant or redundant attributes
from the learning process. In machine learning, this step is known as feature selection.
Including such attributes may otherwise slow down, and possibly mislead, the learning
step. Ideally, the time spent on relevance analysis, when added to the time spent on
learning from the resulting “reduced" feature subset, should be less than the time that
34
would have been spent on learning from the original set of features. Hence, such analysis
can help improve classification efficiency and scalability.

Data transformation. The data can be generalized to higher-level concepts. Concept
hierarchies may be used for this purpose. This is particularly useful for continuous-valued
attributes. For example, numeric values for the attribute income may be generalized to
discrete ranges such as low, medium, and high. Similarly, nominal-valued attributes, like
street, can be generalized to higher-level concepts, like city. Since generalization
compresses the original training data, fewer input/output operations may be involved
during learning. The data may also be normalized, particularly when neural networks or
methods involving distance measurements are used in the learning step. Normalization
involves scaling all values for a given attribute so that they fall within a small specified
range, such as -1.0 to 1.0, or 0 to 1.0. In methods which use distance measurements, for
example, this would prevent attributes with initially large ranges (like, say income) from
outweighing attributes with initially smaller ranges (such as binary attributes).
5.5 Decision Tree Technique for Classification
In above different techniques of classification we only select the Decision Tree technique
for our research area. In which we want to improve the performance of ID3 algorithm
behind the decision tree techniques. First of all we discuss the process of decision tree
below with the training set and also discuss the steps of ID3 [19].
5.5.1 Decision Tree
A decision tree is a tree in which each branch node represents a choice between a number
of alternatives, and each leaf node represents a decision. Decision tree are commonly
used for gaining information for the purpose of decision -making. Decision tree starts
with a root node on which it is for users to take actions. From this node, users split each
node recursively according to decision tree learning algorithm. The final result is a
decision tree in which each branch represents a possible scenario of decision and its
outcome [21].
A decision tree is a flow-chart-like tree structure, where each internal node denotes a test
on an attribute, each branch represents an outcome of the test, and leaf nodes represent
35
classes or class distributions. The topmost node in a tree is the root node. A typical
decision tree is shown in Figure 5.2. It represents the concept buys computer, that is, it
predicts whether or not a customer at All Electronics is likely to purchase a computer.
Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals. In order to
classify an unknown sample, the attribute values of the sample are tested against the
decision tree. A path is traced from the root to a leaf node which holds the class
prediction for that sample. Decision trees can easily be converted to classification rules.
Figure 5.2 a decision tree
5.5.2 Generating classification rules from a decision tree
The decision tree of Figure 5.2 can be converted to classification IF-THEN rules by
tracing the path from the root node to each leaf node in the tree. The rules extracted from
Figure 5.2 are: [19].
IF age = “<30" AND student = no THEN buys computer = no
IF age = “<30" AND student = yes THEN buys computer = yes
IF age = “30-40"
THEN buys computer = yes
IF age = “>40" AND credit rating = excellent
THEN buys computer = yes
IF age = “>40" AND credit rating = fair
THEN buys computer = no
36
5.5.3 ID3 Algorithms

Originator of the ID3 Algorithm
ID3 and its successors have been developed by Ross Quinlan, who discovered it while
working with Earl Hunt in the 1970s. He subsequently worked at Sydney University,
Rand Corporation in California, UTS, back to Sydney University, and several years at
UNSW. He now runs his own company, Rule quest (www.rulequest.com) [20].

Implementation of ID3 Algorithm
ID3 (Learning Sets S, Attributes Sets A, Attributes values V)
Return Decision Tree.
Begin
Load learning sets first, create decision tree root node 'root Node',
Add learning set S into Root node as its subset.
For root Node, we compute Entropy (rootNode.subset) first
If Entropy (rootNode.subset) = =0, then
RootNode.subset consists of records all with the same value for the categorical
Attribute, return a leaf node with decision attribute: attribute value;
If Entropy (rootNode.subset)! =0, then
Compute information gain for each attribute left (have not been used in splitting),
Find attribute A with Maximum (Gain(S, A)).
Create child nodes of this root Node and add to root Node in the decision tree.
For each child of the root Node, apply
ID3(S, A, V) recursively until reach
Node that has entropy=0 or reach Leaf node.
End ID3.
We implement the above ID3 algorithm on the following training dataset which contain
14 records. Each record is called samples with predefined class value. In the above
process the Entropy of dataset and the Gain value of each attribute are the important
37
information’s of this ID3 algorithms. The following mathematical formulae are used for
the calculation of Entropy and Gain.
Eq. 5.1 Entropy Equation
Entropy is used to measure the homogeneity of S; S is sample of training samples. P is
the proportionality measure which is also called relative frequency.
Eq. 5.2
Information Gain Equation
This is the training data set which is used for the classification process to generate
decision tree. The set is denoted by S which is considered as a root node. The root node
dataset contain fourteen records or rows. Each record is sample along with predefined
value of class attribute.
Table 5.1 (Training dataset used for classification)
key_col
1
2
3
4
5
6
7
8
9
Outlook temerature HumidityWindy
class
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
no
no
yes
yes
yes
no
yes
no
yes
hot
hot
hot
mild
cool
cool
cool
mild
cool
high
high
high
high
normal
normal
normal
high
normal
weak
strong
weak
weak
weak
strong
strong
weak
weak
In above dataset the data about the player of tennis which decide play tennis or not.
Dataset contains four input attributes such as outlook, temperature, Humidity and windy
while is the predicate or class attribute. Outlook attribute contain three values such as
sunny, overcast and rain. The temperature attribute also contains three values which are
hot, mild and cool. The humidity attribute contain two values such as high and normal
while the last input attribute also contain two values such as weak and strong. The
predicate attribute contain two values yes and no which define that there are only two
38
type of classes yes and no. dataset contain fourteen records.Key_col is the primary key
column.For the classification of above dataset first of all require the Entropy of dataset
and then calculate the gain of each attribute of dataset. After then select the maximum
gain attribute among the input attributes which is used for the classification purpose.

Entropy calculation processing of dataset is shown below.
First decides the number of records which have (No) class value which are five (5) while
the (yes) class value records are Nine (9). Total number of records is fourteen (14).
Relative frequency of No class: 5/14.
Relative frequency of Yes class: 9/14.
Entropy of S dataset is calculated by the above Entropy formula.
Entropy (5, 9) = -5/14 log2 5/14 – 9/14 log2 9/14
= 0.9403

Calculate the Gain of Each input attributes
1- Calculate gain of outlook attribute. For the calculation of attribute gain first checked
the number of values for this attribute, and then on the basis of each value the S dataset is
classified. Outlook attribute have three values like rain, overcast and sunny. There are
three subset is possible of S dataset on the basis of outlook attribute values.
I. First subset s1 contains five (5) records on the basis of rain value of outlook attribute.
II.Second subset s2 contains four (4) records on the basis of overcast value of outlook
attribute.
III. Thirds subset s3 contains five (5) records on the basis of sunny value.
Proportionality measure for s1 is 5/14
Proportionality measure fro s2 is 4/14
Proportionality measure for s3 is 5/14

Calculates the Entropy of each subset
In first set S1 the three (3) yes class and (2) No class. Total is five records (5)
Entropy (3, 2) = -3/5 log2 3/5 – 2/5 log2 2/5
= 0.971
In second set S2 the four (4) yes class. Total is four records (4)
Entropy (4) = -4/4 log2 4/4
39
=0
In third set S3 the three (3) NO class and two (2) Yes class. Total is five records (5).
Entropy (3, 2) = -3/5 log2 3/5 – 2/5 log2 2/5
= 0.971

Calculate the attribute Entropy
The following formula is used
Entropy( Sv) 
sv
 s Entropy(sv)
V A
Eq. 5.3 Calculate attribute Entropy
Entropy (S1, S2, S3) = S1/S * Entropy (S1) + S2/S * Entropy (S2) + S3/S * Entropy (S3)
Entropy (5, 4, 5) or Entropy (outlook)
= 5/14 * 0.971 + 4/14 * 0 + 5/14 * 0.971
= 0.694

Calculate the Gain of outlook Attribute by the above given formula.
Gain (S, A)
= Entropy(S) – Entropy (outlook)
Gain(S, outlook) = 0.9403 - 0.694
=0.2463
The above Three steps Repeats for other three remaining input attributes. The following
tables contain the Gain of attributes for Original set, Rain subset, and Sunny subset.
Tables 5.2 Gain information for Original set
Attribute
Name
Outlook
Temperature
Humidity
Windy
Gain
0.2463
0.0292
0.1518
0.0481
Tables 5.3 Gain information for Rain subset
Attribute
Temperature
Humidity
Windy
40
Gain
0.02
0.02
0.971
Tables 5.4 Gain information for sunny subset
Attribute
Temperature
Humidity
Windy

Gain
0.571
0.971
0.02
Select Maximum Gain attribute for Classification of S dataset.
In the above Table 4.2 the attribute which have the maximum gain value attribute is the
outlook. The Gain value of outlook is 0.2463 which is the highest value. After this
process we can split the dataset into three different subsets on the basis of outlook
attribute values which are following Rain, overcast and sunny. The classification is show
by the following diagram in figure 5.3.
Outlook is Test
Attribute
Humidity
is Test
Attribute
Windy is Test
Attribute
Rain
Strong
Sunny
Overcast
Weak
normal
Figure 5.3 Decision tree of the above dataset.
41
High
ID3 is the recursive process which is repeated further for the child subset. In every
repetition the Entropy of new set and Gain of each attribute of new set is calculated. The
recursive process repeated until the particular class is obtained or there is no input
attributes is remaining for the classification of dataset. The following decision tree is
generated in SQL server which is created during my practical work of this dissertation.
Figure 5.4
Figure 5.4 Decision Tree Generated in SQL server 2000
5.6 Partial Integration of Decision Tree in Material View
Decision tree Algorithms are recursive in nature. At each iteration, apart from other tasks,
the next best attribute, in the remaining attribute list, is selected as a test attribute. And
then for each value of the attribute, a branch is grown from the test node. The process of
figuring out the best attribute mostly involved tremendous amount of calculations. For
example, considering the most basic algorithm of ID3, to select the best attribute, a whole
lot of mathematics is carried out. ID3 used information Gain to select a split attribute.
Entropy of an attribute is calculated using formula in figure 5.6 Carrying out these
calculations on a data set where the data is updated frequently will obviously affect the
efficiency of the algorithm.
42
Materialized View or indexed View is refreshed using various policies. But in order to
have the latest data in hand to contract a classification model, the materialized view must
be updated frequently. From a decision tree perspective, we know that even with a single
record updating, the calculations have to be carried out again to present the exact
statistics for entropy and gain.
In our approach, we suggest to create a tabular structure in a data warehouse / database.
This table will contain the information values required for the construction of the
decision tree classifier. Moreover, this table will be dependent on the materialized view
where the dataset for decision tree is stored.
The structure that we have proposed for this dependent table is as under:
Table 5.5 Dependent Table
Att_name
Comp_type
Result
…………..
…………………..
………………..
………….
…………………..
……………….
………….
…………………..
………………...
Values required for creating a decision tree using ID3 and C4.5 algorithm will be stored
in this structure. Each time, no matter whatever class label is used, the initial required
values will be readily available for the algorithm. Instead of calculating values from
within the algorithm, these values will be available with the associated dependent tabular
structure with the Materialized view.
5.6.1 Classification Experiment
Table 5.1 presents a training set of data tuples for classification. The class label attribute,
CLASS, has two distinct values (namely, {yes, no}); therefore, there are two distinct
classes (m=2). Let class C1 correspond to YES and class C2 correspond to NO. The
outlook, temperature, Humidity and windy are the input attributes. In previous approach,
for creating the decision tree, the selection of the best attribute at the root node involved
calculations of heavy nature. Dataset and attribute level computations are carried out.
43
In our approach, this dataset will be stored in a Materialized view, and the other
dependent table will contain the statistical values required for the selection of the best
attribute at the root node. A significant efficiency is achieved making this algorithm
faster. An important issue is that a single dataset can be used to construct various
classification models. For this purpose a separate target / output class will be introduced.
As soon as the new class attribute is introduced in the materialized view, a function will
be triggered to redo the calculations according to the new target class values. And these
values will be updated /stored in the dependent table, providing the values required to
select best attribute at the root node. The dependent table will look like as:
Table 5.6 Resultant values in Dependent table
Attribute
Comp_type Result
Class
Entropy
0.9403
Outlook
Entropy
0.6935
Outlook
Gain
0.2468
Temperature Entropy
0.9111
Temperature Gain
0.0292
Humidity
Entropy
0.7885
Humidity
Gain
0.1518
Windy
Entropy
0.8922
Windy
Gain
0.0481
The first row in this table contains the expected information required to split this dataset.
The subsequent rows contain attribute-wise Entropy and Gain in the corresponding
columns.
The partial integration of the decision tree attribute selection measure with the
materialized view, containing the training dataset for the construction of classification
model will in no way effect the accuracy of the model. There is no change in the
intelligence approach, only the values required are stored, instead of calculating them at
run time in memory. However, this integration has given a jump start for the construction
of the classification model, enhancing the overall efficiency of the model.
44
5.6.2 Conclusion
Classification algorithms are memory resident, calculating various statistical values at
runtime. Storage of these statistical values, even for the selection of the best attribute at
the root node, greatly increases the performance of the classification algorithms.
Materialized view will hold the input training dataset while these statistical values will be
stored in a dependent table. This table will be updated according to the policy chosen.
Modern data warehouses offer a many methods to update the materialized view.
However, each time a new target class is introduced or new data is loaded in this
containing the statistical values will be updated accordingly. The accuracy of the
algorithm is in no way affected, not in a positive or negative direction. The significant
improvement introduced is in the efficiency, in selection of the root level attribute.
45
Appendix A
Code Section
Test_proc is the procedure which is used to find the entropy of original training set as
well as find the entropy and gain of each input attributes in training set. Such calculated
results by Test_proc are stored inside the pre calculated structure. This calculated gain
values are used for the classification purpose at root node to generate the decision tree.
ALTER
procedure Test_proc
@dataset varchar(20)
AS
Begin
declare @loopvar int,@flag int
declare @keyatt int,@datasetname varchar(40)
declare @newkeyvalue int, @get_key int
set @flag = 0
delete from current_datasetEntropy
declare datasetnames_table_cursor cursor
local static for select * from datasetnames_table
open datasetnames_table_cursor
if @@cursor_rows = 0
begin
insert into datasetnames_table values(1,@dataset)
exec Test_CalEntropy @dataset ,1,@flag
exec Test_CalAttGain @dataset ,1
end
else
begin
set @loopvar = 1
while @loopvar <= @@cursor_rows
begin
fetch next from datasetnames_table_cursor into @keyatt,@datasetname
if @datasetname = @dataset
begin
set @flag = 1
break
end
set @loopvar = @loopvar + 1
end
if @flag = 1
46
begin
exec Test_CalEntropy @dataset,@keyatt,@flag
end
else
begin
set @newkeyvalue = @@cursor_rows + 1
insert into datasetnames_table values(@newkeyvalue,@dataset)
exec Test_CalEntropy @dataset,@newkeyvalue,@flag
exec Test_CalAttGain @dataset,@newkeyvalue
end
end
close datasetnames_table_cursor
deallocate datasetnames_table_cursor
exec Test_searchkey @dataset,@get_key output
select Attribute,comp_type,result from calculated_data
where keyattribute = @get_key
End
Test_Decisiontree is another procedure which is used to select the maximum gain value
attribute from pre calculated structure. On the basis of this attribute we classified the
original training set into subsets before the C_ID3 procedure is called.
ALTER procedure Test_DecisionTree (
@dataset varchar(50),
@minsplitsize int,
@rec_Entgainvalue decimal,
@rec_attributeNum int,
@rec_leavesize int
)
As
begin
declare @Entvalue float
declare @Selatt varchar(100)
declare @gainattribute varchar(80)
declare @gainvalue float
declare @str varchar(100)
declare @size int,@loopvar int
declare @sepattval varchar(100)
declare @predicate_var varchar(100)
declare @levelno1 int
declare @attvalue varchar(100)
47
declare @dsize int
declare @countattribute int
declare @rec_attribute varchar(50),@rec_value varchar(50),@rec_records
int,@rec_per float
declare @getkey int
set @countattribute = 0
exec Test_searchkey @dataset,@getkey output
exec Test_GetEntropy @getkey, @dataset,@Entvalue OUTPUT,@Selatt
OUTPUT,@attvalue OUTPUT
exec C_datasetsize @dataset,@dsize output
---------------------------------------------------------------------------set @countattribute = @countattribute + 1
delete from Treedata
insert into Treedata values(0,@dataset,null,@dsize,100)
IF @dsize >= @minsplitsize
begin
IF @Entvalue = 0
begin
print @Selatt+':'+@attvalue
insert into Treedata values(1,@Selatt,@attvalue,@dsize,100)
end
else
begin
exec Test_maxgain @getkey,@gainattribute output,@gainvalue output
exec C_Getattribute_val @dataset,@gainattribute,@str output,@size output
IF @gainvalue >= @rec_Entgainvalue and @countattribute <= @rec_attributeNum
Begin
set @loopvar = 1
while @loopvar < = @size
begin
set @levelno1 = 1
exec C_sepattval @str,@sepattval output,@str output
print 'RULE *****
print 'Level '+cast(@levelno1 as varchar)+':' + upper(@gainattribute) + ':('+@sepattval+
') ' + cast(@gainvalue as varchar)
insert into Treedata values(@levelno1,@gainattribute,@sepattval,@dsize,100)
set @predicate_var = @gainattribute + ' = ' + "'" + @sepattval + "'"
set @levelno1 = @levelno1 + 1
exec C_ID3 @dataset ,@gainattribute,@gainvalue
,@predicate_var,0,@levelno1,@dsize,
@minsplitsize,@rec_Entgainvalue,@countattribute,
@rec_attributeNum,@rec_leavesize
48
set @loopvar = @loopvar + 1
end
End
Else
Begin
exec Maxrecordvalue @dataset,@rec_attribute output,@rec_value
output,@rec_records output,@rec_per output
insert into Treedata values(1,@rec_attribute,@rec_value,@dsize,@rec_per)
End
end
End
Else
begin
exec Maxrecordvalue @dataset,@rec_attribute output,@rec_value
output,@rec_records output,@rec_per output
insert into Treedata values(1,@rec_attribute,@rec_value,@dsize,@rec_per)
end
End
C_ID3 is the procedure which contains the recursive process. This procedure is used to
generate the further decision tree after the classification at root node.
ALTER
procedure C_ID3
(
@dataset varchar(50),@gainatt varchar(70),@gainval float,
@predicates varchar(200),@spacesize int,@levelno1 int,
@rec_datasetsize float ,@rec_minsplits int,@rec_Entgain decimal,
@rec_attcounter int,@rec_attributeNos int,@rec_leavessize int
)
As
begin
declare @Entvalue float, @Selatt varchar(200)
declare @gainattribute varchar(90),@sepattval varchar(70),@str varchar(70)
declare @gainvalue float, @getspaces varchar(100)
declare @spacelen int,@size int,@loopvar int,@reserve_levelno int
declare @pred_var varchar(200),@reserve_pred varchar(200)
declare @attvalue varchar(80), @datas int,@per float , @rec_flag int
declare @rec_attribute varchar(60),@rec_attvalue varchar(50),@rec_records int,
@rec_per float
declare @tablestr varchar(5000)
set @reserve_pred = @predicates
set @reserve_levelno = @levelno1
exec C_newsubset @dataset, @gainatt ,@predicates,@tablestr output
49
exec (@tablestr)
exec CalEntropydup 'temp'
exec CalAttGaindup 'temp'
exec C_datasetsize 'temp',@datas output
exec C_GetEntropydup 'temp',@Entvalue OUTPUT,@Selatt OUTPUT,@attvalue
Output
exec C_space @spacesize,@getspaces output,@spacelen output
set @spacesize = @spacelen
set @per = round(cast(@datas as float)/@rec_datasetsize * 100,2)
set @rec_attcounter = @rec_attcounter + 1
IF @datas >= @rec_minsplits and @datas >= @rec_leavessize
Begin
if @Entvalue = 0
begin
print @getspaces + 'level'+cast(@levelno1 as varchar)+':' + @Selatt+':('+ @attvalue+')'
insert into Treedata values(@levelno1,@Selatt,@attvalue,@datas,100)
end
Else
Begin
exec C_maxgaindup @gainattribute output,@gainvalue output
exec C_Getattribute_val 'temp',@gainattribute,@str output,@size output
set @loopvar = 1
if @gainvalue >= @rec_Entgain and @rec_attcounter < = @rec_attributeNos
Begin
while @loopvar < =@size
BEGIN
exec C_sepattval @str,@sepattval output,@str output
if @gainvalue != 0
begin
print @getspaces + 'Level '+cast(@levelno1 as varchar)+':' + upper(@gainattribute) +
':('+@sepattval+ ') ' + cast(@gainvalue as varchar)
insert into Treedata values(@levelno1,@gainattribute,@sepattval,@datas,@per)
end
set @pred_var = @gainattribute + ' = ' + "'" + @sepattval + "'"
set @predicates = @predicates + ' and ' +@pred_var
set @levelno1 = @levelno1 + 1
if @gainvalue !=0
begin
exec C_ID3 @dataset
,@gainattribute,@gainvalue,@predicates,@spacesize,@levelno1,@datas,@rec_minsplits
,@rec_Entgain,@rec_attcounter,@rec_attributeNos,@rec_leavessize
end
set @levelno1 = @reserve_levelno
50
set @pred_var =''
set @predicates = @reserve_pred
set @loopvar = @loopvar + 1
end
end
else
begin
exec Maxrecordvalue 'temp',@rec_attribute output,@rec_attvalue output,@rec_records
output,@rec_per output
insert into Treedata values(@levelno1,@rec_attribute,
@rec_attvalue,@datas,@rec_per)
end
end
end
else
begin
exec Maxrecordvalue 'temp',@rec_attribute output,@rec_attvalue
output,@rec_records output,@rec_per output
insert into Treedata values(@levelno1,@rec_attribute,@rec_attvalue,@datas
,@rec_per)
end
End
51
Appendix B
Application Interface
This is the main interface for the accessing of SQL server 2000 database. It contains the
training dataset and also contains the pre-calculated structures which contain the Entropy
of dataset and also contain the Entropy and gain information of each attribute in given
dataset. It also shows the decision tree of given dataset. Decision tree is used for the
classification of test dataset because it generates different classification rules.
52
1. Dataset
Output of Dataset:
53
2. Dataset
Output of dataset:
54
3. Dataset
Output of dataset:
55
5. DataSet
Output of dataset:
56
6. Dataset
Output of dataset
57
Appendix C
References
[1]. C. Imhoff, N. Galemmo, J.G. Geigar Mastering Data warehousing design
relational and Dimensional Techniques
[2]. Abhishek Sugandhi , Data Warehouse Design Considerations,
[3]. Sql Server 7.0 Data warehousing Training Kit by Microsoft
[4]. http://www.peterindia.net/DataWarehousingView.html
[5]. Sql server 2000 resource kit
[6]
Behrooz Seyed- Abbassi Teaching Effective Methodologies to Design a Data
Warehouse, University of North Florida Jacksonville, Florida 32224, United
States
[7]
http://technet.microsoft.com/en-us/library
[8]
www.ioug.org/client_files/members/select_pdf/05q2/SelectQ205_Maresh.pdf
[9]
http://www.cs.uvm.edu/oracle9doc/server.901/a90237/mv.htm#38255
[10] http://www.microsoft.com/technet/prodtechnol/sql/2005/impprfiv.mspx
[11] http://www.akadia.com/services/ora_materialized_views.html
[12]
http://download.oracle.com/docs/cd/B10501_01/server.920/a96567/repmview.htm
[13]
www.nocoug.org/download/2003-05/materialized_v.ppt
58
[14] http://ieeexplore.ieee.org/Xplore
[15] http://www.oracle.com/technology/products/oracle9i/daily/jul05.htm
[16] Ian H.Witten & Eibe Frank. Data mining practical machine learning tools and
techniques
[17] Jiawei Han and Micheline Kamber Data Mining: Concepts and Techniques
Simon Fraser University
[18] S. Sumathi, S.N. Sivanandam Introduction to Data Mining and its Applications
[19]
Data Mining Concepts and Techniques by Morgan Kaufmann
[20]
Induction of Decision Tree.html
[21] W. Peng, J. Chen and Haiping An Implementation of ID3 Decision Tree Learning
Algorithm, Zhou University of New South Wales, School of Computer Science &
Engineering, Sydney, NSW 2032, Australia
[22] Kalinka Mihaylova Kaloyanova Improving Data Integration for Data Warehouse:
A Data Mining
Approach University of Sofia,
59