Download 1.2 What is data mining?

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
University of Massachusetts Dartmouth
MULTIVARIATE DATA MINING USING
INDEXED K-MEANS.
A Thesis in
Computer Engineering
by
TR Satish Kumaar
Submitted in Partial Fulfillment of the Requirements
for the Degree of
Master of Science
University of Massachusetts, Dartmouth
January 2003
Copyright 2003 by TR Satish Kumaar
I grant the University of Massachusetts Dartmouth the non-exclusive right to use the work
for the purpose of making single copies of the work available to the public on a not-forprofit basis if the University's circulating copy is lost or destroyed.
____________________________________
TR Satish Kumaar
Date: ______________________________
We approve the thesis of TR Satish Kumaar
Date of signature
_________________________
Paul J. Fortier
Professor of Electrical and Computer Engineering
Thesis Advisor
__________________
__________________________
Hong Liu
Professor of Electrical and Computer Engineering
Graduate Committee
__________________
__________________________
__________________
Howard E. Michel
Assistant Professor of Electrical and Computer Engineering
Graduate Committee
___________________________
__________________
Dr. Dayalan P. Kasilingam
Associate Professor
Graduate Program Director, Department of Electrical and Computer Engineering
________________________
Antonio H. Costa
__________________
________________________
Farhad Azadivar
Dean, College of Engineering
__________________
Professor
Chairperson, Department of Electrical and Computer Engineering
__________________________
__________________
Richard J. Panofsky
Associate Vice Chancellor for Academic Affairs and Graduate Studies
ABSTRACT
Multivariate data mining using indexed k-means
By TR Satish Kumaar
Raw information grows at an ever-increasing rate, dictating a need for tools to turn such data
into useful information and knowledge; this is where data mining comes into play. The
knowledge gained can be used for applications ranging from business management,
production control, market analysis, to engineering design and science exploration. There are
many approaches for knowledge discovery ranging from Association rules, Decision trees, and
K-nearest neighbor, Classification, Cluster to Genetic algorithms.
The focus of this thesis is to mine a multivariate dataset, using indexing within a k-means
clustering algorithm to discover rules. The focus is also to compare the results with ordinary kmeans methods so as to analyze and test the results for accuracy and importance. This thesis
was testing whether a clustering method is possible for a multivariate dataset with static
variables using indexed k-means algorithm as well as researching whether better information
can be formed using this process than the regular k-means methods. Indexed k-means method
gives a more precise and useful information than the regular k-means method. Questions that
are not answered by the k-means method are answered by indexed k-means method. E.g.: This
method can say that "Month of January one can get Fish 120 at (42.25,70.25) with a market
value of 60.75 and landed weight of 26.25kg," while the regular k-means method didn’t have a
cluster for the month of January. Indexed k-means method has a big implication for it reduces
the computational power needed to cluster a huge dataset by replacing it with a smaller dataset
without losing precious knowledge in the data and also gives a more useful and precise
information than the regular k-means method.
TABLE OF CONTENTS
ACKNOWLEDGEMENT ............................................................................. v
LIST OF FIGURES .......................................................................................vi
LIST OF TABLES ...................................................................................... viii
CHAPTER 1 INTRODUCTION.................................................................. 1
1.1 Emergence of data mining ................................................................................... 1
1.2 What is data mining? ............................................................................................. 1
1.2.1 Architecture of data mining ...................................................................... 3
1.2.2 Data mining versus Query tool ............................................................... 4
1.2.3 Data mining Functionalities ...................................................................... 4
1.2.4 Classification of data mining ..................................................................... 7
1.2.5 Practical Problems of data mining ........................................................... 8
1.2.6 Data mining issues and ethics ................................................................... 9
1.3 What is multivariate data mining? .................................................................... 11
1.3.1 When is multivariate analysis used? ....................................................... 12
1.4 What is K-mean mining?.................................................................................... 13
1.4.1 How does k-mean work? ......................................................................... 13
1.4.2 Why k-means is not enough? .................................................................. 13
1.5 Motivation ............................................................................................................. 13
1.5.1 How can we achieve indexed k-mean method? .................................. 15
1.6 Research contribution ......................................................................................... 15
1.6.1 Assumptions............................................................................................... 15
1.7 Thesis organization ............................................................................................. 16
CHAPTER 2 RELATED WORK ................................................................ 17
2.1 Data mining approach ........................................................................................ 17
2.2 Mining complex data in large data and information repositories ............... 19
2.3 Clustering Analysis .............................................................................................. 20
2.3.1 Partitioning methods ................................................................................ 24
2.3.1.1 k-means algorithm .......................................................................... 25
2.3.1.2 K-medoids method ........................................................................ 30
2.3.2 Hierarchical methods................................................................................ 33
2.3.3 Density-based methods ............................................................................ 34
2.3.4 Grid-based methods ................................................................................. 34
2.3.5 Model-based methods .............................................................................. 35
2.3.6 EM algorithm ............................................................................................. 35
2.4 Indexed based algorithms .................................................................................. 38
2.5 Which techniques to use for which tasks ........................................................ 39
2.6 Multidimensional data model ............................................................................ 41
CHAPTER 3 ALGORITHM AND DATASET .......................................... 43
3.1 Different forms of knowledge .......................................................................... 43
3.2 Getting started...................................................................................................... 45
ii
3.3 KDD Process ....................................................................................................... 46
3.3.1 Data selection ............................................................................................. 46
3.3.2 Cleaning....................................................................................................... 47
3.3.3 Enrichment................................................................................................. 48
3.3.4 Coding ......................................................................................................... 49
3.3.5 Data mining ................................................................................................ 49
3.3.6 Reporting .................................................................................................... 50
3.4 Data Sources ......................................................................................................... 50
3.4.1 Data modeling............................................................................................ 51
3.4.2 Preprocessing ............................................................................................. 53
3.4.3 Data cleaning .............................................................................................. 54
3.5 Algorithm for indexed k-means ........................................................................ 55
3.6 Discover of interesting patterns........................................................................ 58
3.6.1 Interestingness measure ........................................................................... 58
3.7 Presentations and visualization of discovered patterns ................................ 60
3.8 Implementation Tools and software ................................................................ 61
CHAPTER 4 RESULTS AND ANALYSIS ................................................ 62
4.1 Experimental Results .......................................................................................... 62
4.1.1 Output for indexed k-means ................................................................... 62
4.1.2 Output for k-means .................................................................................. 71
4.2 Analyses ................................................................................................................. 72
iii
4.2.1 Interpreting the patterns found ..............................................................73
4.2.2 Testing and Performance .......................................................................74
4.2.3 Comparison between k-means and indexed k-means ........................75
4.3 Discussion ............................................................................................................. 76
CHAPTER 5 FUTURE WORK................................................................... 77
5.1 Conclusion ............................................................................................................ 78
5.2 Future Work ......................................................................................................... 78
5.3 Research Directions ............................................................................................ 79
APPENDIX A SOURCE CODE ................................................................. 81
BIBLIOGRAPHY ......................................................................................... 94
iv
ACKNOWLEDGMENTS
The author wishes to express appreciation to:
y supervisor Dr. Paul J. Fortier, whose ideas and comments made this
thesis possible
embers of my committee for their time and interest
My parents for their unconditional love and support
v
LIST OF FIGURES
Number
Page
Fig 2.1: Flowchart for K-means Algorithm.................................................................. 29
Fig 3.1: Flowchart for Indexed k-means Algorithm ................................................... 57
Fig 4.1.1: 3-Dimensional Figure of Landed_kg, Percentage of Occurrence and
Month .................................................................................................................................. 63
Fig 4.1.2: 3-Dimensional Figure of Latitude Degree, Longitude Degree and Fish
ID ......................................................................................................................................... 63
Fig 4.1.3: 3-Dimensional Figure of Latitude Degree, Longitude Degree and
Landed Weight of fish ...................................................................................................... 64
Fig 4.1.4: 3-Dimensional Figure of Latitude Degree, Longitude Degree and
Market Values of Fish ...................................................................................................... 64
Fig 4.1.5: 3-Dimensional Figure of Latitude Degree, Longitude Degree and
Month .................................................................................................................................. 65
Fig 4.1.6: 3-Dimensional Figure of Market Values, Fish ID and Percentage of
Occurrence of Fish ........................................................................................................... 65
Fig 4.1.7: 3-Dimensional Figure of Market values, Landed Kg and Fish ID ........ 66
Fig 4.1.8: 3-Dimensional Figure of Market values, Landed Kg and Month .......... 66
Fig 4.1.9: 3-Dimensional Figure of Market values, Landed Kg and Percentage of
Occurrence of Fish ........................................................................................................... 67
vi
Fig 4.1.10: 3-Dimensional Figure of Market values, Month and Percentage of
Occurrence of Fish ........................................................................................................... 67
Fig 4.1.11: 3-Dimensional Figure of Month, Landed KG and Fish Id................... 68
Fig 4.1.12: 3-Dimensional Figure of Month, Market Values and Fish Id .............. 68
Fig 4.1.13: 3-Dimensional Figure of Percentage, Month and Fish Id ..................... 69
Fig 4.1.14: 3-Dimensional Figure of Latitude degree, Longitude degree and
Percentage of Occurrence ............................................................................................... 69
Fig 4.1.15: 3-Dimensional Figure of Fish Id, Landed KG and Percentage of
Occurrence ......................................................................................................................... 70
vii
LIST OF TABLES
Number
Page
Table 2.1: Techniques and Tasks. .................................................................................. 39
Table 4.1: Output of K-means........................................................................................ 71
Table 4.2: Analysis of Indexed K-means ...................................................................... 72
viii
Chapter 1
INTRODUCTION
1.1 Emergence of data mining
In one of his short stories, The Library of Babel, the South-American writer Jorge
Louis Borges describes an infinite library, which consists of an endless network of
rooms with bookshelves. Although most of the books have no meaning and have
unintelligible titles like 'Axaxaxas mlo', people wander through it until they die,
and scholars develop wild hypotheses that somewhere in the library there must be
a central catalog; or that all the books that one could possibly think of must be
somewhere in the library. None of these hypotheses can be verified because the
library contains an infinite amount of data but no information [1]. The library of
Babel may be interpreted as an interesting but cruel metaphor for the situation in
which modern humans find themselves: we live in an expanding universe in
which there is too much data, (this growth of information is due to the
mechanical production of texts) and too little information. The development of
new techniques to find the required information from huge amounts of data is
one of the main challenges for software developers today [1].
1.2 What is data mining?
Against this background, a great interest is being shown in the new field of 'data
mining' or KDD (knowledge discovery in databases). KDD is like mining, where
enormous quantities of debris have to be removed before diamonds or gold can
be found. Similarly, with a computer, one can automatically find the one'
information-diamond' among the tons of data-debris in one’s database.
It was proposed at the first international KDD conference in Montreal in 1995
that the term 'KDD' be employed to describe the whole process of extraction of
knowledge from data, which is a multi-disciplinary field of research where the
knowledge here means relationships and patterns between data elements, data
mining is used exclusively for the discovery stage of the KDD process [1].
Knowledge discovery as a process consists of the following steps:
1.
Data cleaning (to remove noise and inconsistent data)
2.
Data Integration (where multiple data sources may be combined)
3.
Data Selection (Relevant data from database are retrieved for analysis)
4.
Data transformation (where data are transformed or consolidated into
forms appropriate for mining)
5.
Data mining (Process of intelligent methods to extract data patterns)
2
6.
Pattern evaluation (Identifies interesting patterns representing knowledge)
7.
Knowledge
presentation
(where
visualization
and
knowledge
representation techniques are used to present the mined knowledge to the user)
1.2.1 Architecture of data mining:
The architecture of a typical data mining system has the following components:

Database, data warehouse, or other information repository

Database or data warehouse server: It is responsible for fetching the
relevant data, based on the user's data mining request

Knowledge base: This is the domain knowledge that is used to guide the
search, or evaluate the interestingness of resulting patterns

Data mining engine: Consists of a set of functional modules for tasks
such as characterization, association, classification, cluster analysis, and evolution
and deviation analysis

Pattern evaluation module: It employs interestingness measures and
interacts with the data mining modules in order to focus the search towards
interesting patterns.
3

Graphical user interface: This communicates between users and the data
mining system, allowing the user to interact with the system by specifying a query
or task, information to help focus the search and visualize the patterns in
different forms.
1.2.2 Data mining verses query tools
Query tools and data mining tools are complementary. Normal queries can
answer questions like “who bought which product on which date?” While data
mining tools can answer questions like "what are the most important trends in
customer behavior?" which are much harder to find using SQL [1]. Of course,
these questions could be answered using SQL by a process of trial and error. It
could take days or months to find an optimal segmentation for a large database,
which the machine-learning algorithm can automatically find the answer to in a
much shorter time. Once the data-mining tool has found segmentation, you can
use your query environment again to query and analyze the profiles found.
One could say that if you know exactly what you are looking for, use SQL; but if
you know only vaguely what you are looking for, turn to data mining [1].
1.2.3 Data mining Functionalities:
Data mining functionalities are used to specify the kind of patterns to be found in
data mining tasks. Data mining tasks can be classified into two categories:
4
descriptive and predictive. Descriptive mining tasks characterize the general
properties of the data in the database. Predictive mining tasks perform inference
on the current data in order to make predictions.
Data mining functionalities, and the patterns they can discover, are as follows:
(a)
Concept/Class Description: Characterization and Discrimination: Data
can be associated with classes or concepts. The description of a class or concept
is summarized, concise and yet precise terms are called class/concept descriptions
[2]. These descriptions can be derived via (1) data characterization or (2) data
discrimination or (3) both data characterization and discrimination
(b)
Data characterization is a summarization of the general characteristics or
features of a target class of data. The output of data characterization can be
presented in various forms of charts and tables. The resulting descriptions can be
also presented as generalized relations or in rule forms (characteristic rules)
(c)
Data discrimination is a comparison of the general features of the
specified discrimination descriptions will include comparative measures that help
distinguish between the target and contrasting classes and expressed in rule form
referred as discriminate rules
5
(d)
Association Analysis: The discovery of association rules showing
attribute-value conditions that occur frequently together in a given set of data. It
is widely used for market basket or transaction data analysis
(e)
Classification and Prediction: The process of finding a set of models that
describe and distinguish data classes or concepts, for the purpose of being able to
use the model to predict the class of objects whose class label is unknown. The
derived model can be presented in various forms, such as classification (IFTHEN) rules, decision trees, mathematical formulae, or neural networks.
Classification can also be used for predicting the class label of data objects
(f)
Cluster Analysis: Analyzes data objects without consulting a known class
label. The objects are clustered or grouped based on the principle of maximizing
the intra class similarity and minimizing the interclass similarity. Each cluster that
is formed can be viewed as a class of objects, from which rules can be derived
(g)
Outlier Analysis: A database, which contains data, objects that do not
comply with the general behavior or model of the data. These data objects are the
outliers. Most data mining methods discard outliers as noise or exceptions. In
some applications such as fraud detection, the rare events can be more interesting
than the more regularly occurring ones
(h)
Evolution analysis: Describes and models regularities or trends for
objects whose behavior changes over time. This may include characterization;
6
discrimination, association, classification, or clustering or time-related data,
distinct features of such an analysis include time-series data analysis, sequence or
periodicity pattern matching, and similarity-based data analysis
1.2.4 Classification of data mining
Diverse disciple contributes data mining; hence data mining research is expected
to generate a large variety of data mining systems. Therefore a clear classification,
to help users identify those that best matches their needs.
Data mining can be categorized according to various criteria, as follows.
(a)
Classification according to the kinds of databases mined: Can be
classified according to different criteria such as data models or types of data or
applications involved.
(b)
Classification according to the kinds of knowledge mined: Can be
categorized based on knowledge like data mining functionalities.
(c)
Classification according to kinds of techniques utilized: Can be described
according to the degree of user interaction (e.g. autonomous, interactive
exploratory or query-driven systems) or by methods of data analysis employed
(e.g. database or data warehouse oriented techniques etc.)
7
(d)
Classification according to the applications adapted: Different
applications like finance, telecommunications, DNA, stock markets and e-mail
require the integration of application-specific methods.
1.2.5 Practical problems of data mining:
A lot of data mining projects get bogged down in a forest of problems like:

Lack of long-term vision: “what do we want from our files in the future?”

Not all files are up to date: Data vary greatly in quality

Struggle between departments: They may not want to give up their data

Poor cooperation from the electric data processing department: “Just give
us the queries and we will find the information you want.”

Legal and privacy restrictions: Data cannot be used for legal reasons.

Files are hard to connect for technical reasons: there is a discrepancy
between a hierarchical and a relational database, or data models are not up to date

Timing problems: files can be compiled centrally, with a six-month delay

Interpretation problem: Data’s meanings or usages are unknown
8
1.2.6 Data mining issues and ethics
The usage of data, particularly data about people has serious ethical implications,
and practitioners of data mining techniques must act responsibly by making
themselves aware of the ethical issues that surround their particular application.
When applied to people, data mining is frequently used to determine who gets the
loan, special offer and so on. Certain kinds of discrimination like racial, sexual,
religious, and so on- are not only unethical, but also illegal. However, the situation
is complex because it depends on the application. Using such information for
medical diagnosis is certainly ethical, but using the same information when
mining loan payments behavior is not. Even when sensitive information is
discarded, there is a risk that models will be bulky that rely on variables that can
be shown to substitute for facial or sexual characteristics. For example, people
frequently live in areas that are associated with particular ethnic identities, and so
using an area code in a data mining study runs the risk of building models that are
based on race even though racial information has been explicitly excluded from
the data.
1.3 What is multivariate data mining?
Multivariate data can be defined as a set of entities E, where the ith element of E
consists of a vector with 'n' variables. Each variable may be independent or
interdependent with one or more of the other variables.
9
An N-dimensional dataset, E comprises elements Ei = (xi1, xi2...xin).
Each observation xij may be independent of or interdependent on one or more of
the other observations. Observations may be discrete or continuous in nature, or
may take on nominal values.
Multivariate data is difficult to visualize effectively because of

Dimensional constraints: Difficult to visualize data in higher than 3
dimensions

Size of data set - Occlusion: Data patterns are difficult to find

Saturation: Data visualization is difficult

Scarcity: Less number of data points to find patterns
Examples of the types of multivariate data:

Physical interpretation such as geographical data

A sequence of time-varying information such as stock prices
Multivariate data analysis can be used for any tables of data, even one with a few
rows and many columns, is converted into a few meaningful plots that display the
information in the data, the real information, in a way that is easy to understand.
10
Typical applications:

Quality control and quality optimization (food, beverages, paints, drugs)

Process optimization and process control

Development and optimization of measurement methods

Prospecting for oil, ore, water, minerals, etc

Classification of bacteria, viruses, tissues, and other medical specimens

Analysis of economic and administrative tables

Design of new drugs
1.3.1 WHEN IS MULTIVARIATE ANALYSIS USED?
"Variate" refers to variables, and "multi" means several or many. Multivariate
analysis is appropriate whenever the dataset consists of two or more variables
observed a number of times of individuals. The result is often called a "data set."
It is customary to envision a data set as being comprised of rows and columns.
The rows pertain to each observation, such as each person or each completed
questionnaire in a large survey. The columns pertain to each variable, such as a
response or an observed characteristic for each person.
11
Rows: records; individuals; cases; respondents; subjects; patients; etc.
Columns: fields; variables; characteristics; responses; etc.
Data sets can be immense; a single study may have a sample size of 1,000
respondents, each answering 100 questions. Here the data set would be 1,000 by
100, or 100,000 cells of data. Hence, the need for summarization is evident.
Simple univariate or bivariate statistics could not be applied for an average were
computed for each variable, 100 means would result, and if all pair wise
correlations were computed, there would be close to 5,000 separate values.
Cluster analysis might yield five clusters. Multiple regressions could identify six
significant predictor variables. Multiple discriminate analyses perhaps would find
seven significant variables, and so on. It should be evident that parsimony can be
achieved by using multivariate techniques when analyzing most data sets. Another
reason for using multivariate techniques is that they automatically assess the
significance of all linear combinations of the observed variables.
1.4 What is k-mean mining?
The k-means algorithm takes the input parameter, and partitions a set of n
objects into k clusters so that the resulting intracluster similarity is high but the
intercluster similarity is low. Cluster similarity is measured in regard to the mean
value of the objects in a cluster, can be viewed as the cluster's center of gravity.
12
1.4.1 how does K-means work?
The k-means algorithm proceeds as follows. First, it randomly selects k of the
objects, each of which initially represents a cluster mean or center. For each of
the remaining objects, an object is assigned to the cluster to which it is the most
similar, based on the distance between the object (Typically, the squared-error
criterion is used.) and the cluster mean. It then computes the new mean for each
cluster. This process iterates until the criterion function converges.
1.4.2 why k- mean is not enough?
The k-means method can be applied only when the mean of a cluster is defined.
This may not be the case in some applications, such as when data with categorical
attributes are involved. The necessity for users to specify k, the number of
clusters, in advance can be seen as a disadvantage. The K-means method is not
suitable for discovering clusters with nonconvex shapes or clusters of very
different size. Moreover, it is sensitive to noise and outlier data points since a
small number of such data can substantially influence the mean value.
1.5 Motivation:
To get a sense of how adding multivariate data mining can enrich a pattern
sequence, let us look at the same areas in which general sequential patterns are
useful. Quality control is one of the areas that multivariate data mining are used.
13
For example, eight different properties are measured on products as part of the
quality control before delivery. You have a table with these eight values measured
on one hundred and fourteen products samples from the last year. Quality
manager can answer questions like: Are there any trends? Are the eight properties
related to each other, and if so how? Was there a difference when the new
production process was started six months ago? Is there any relation between the
products quality and the values of the sixteen process variables? Can we improve
the process? Do we have to measure all eight properties to guarantee good
products?
In effect this is achieved by using different data mining algorithms like k-means
clustering. K-means clustering is distance calculations between cluster centroids
and patterns. As the number of the patterns and the number of centroids are
increased, time needed to complete computations increased. This computational
load requires high performance computers and/or algorithmic improvements.
My research proposes a method in which to combine k-means algorithm with
indexes. These steps provide a better-localized result without any loss of
information along with less computation. In the above example, we can index the
dataset by products and then run k-means algorithm over each and every product
to find meaningful information. The thesis was investigating whether indexed kmeans algorithm method is possible for a multivariate dataset with static
variables. It was trying to answer whether better knowledge can be acquired using
14
this process than the regular k-means methods and find the truth in the
algorithm.
1.5.1 how can we achieve indexed k-mean method?
The dataset was indexed using a static variable with a fewer number of discrete
values, which took the resulting dataset in that specific static variable and ran the
regular k-means algorithm, (implemented in java) over it to find the information
on the dataset. For comparison, whole dataset was taken without indexing and kmeans algorithm was run over it.
1.6 Research Contribution
This thesis addresses a problem that has not been looked at before, namely how
to combine k-means algorithm with indexing. The thesis proposes to use this
index k-means algorithm and compare the information gathered with k-means
method.
1.6.1 Assumptions:
1. We assume that knowledge is stored only in 5 attributes or dimensions, even
though the algorithm supports more dimensions. With more dimensions, more
computations is required
15
2. We also assume that within each unordered dimension, only one value may be
present in the database record. For example, if the additional dimension refers to
the fish ID, then there cannot be two different fish ID associated with a record.
3. Noise and outlier data points have been discarded in the preprocessing of the
dataset, since a small number of such data can substantially influence the mean
value. These data’s may or may not have any useful information
4. Datasets from a particular year have been taken, to reduce the computations.
However some knowledge may have been lost in that process
1.7 Thesis organization
This thesis is organized as follows. In Chapter 2, related work is discussed
including other approaches for finding knowledge and research done in the area
of multivariate data analysis. Included here is an in depth discussion of the two
algorithms, k-means and indexed data mining, on which our proposed indexed kmeans algorithm is based. In Chapter 3 an explanation of how these two
algorithms are integrated to form the new algorithm, as well as the comparison
algorithm k-means are provided. Chapter 4 shows the results of the knowledge
analysis and possible optimizations. Chapter 5 concludes with a look at the future
directions of this research.
16
Chapter 2
RELATED WORKS
2.1 Data mining approaches
Data mining is a young interdisciplinary field, drawing from areas such as
database systems, data warehousing, statistics, machine learning, data
visualization, information retrieval, and high performance computing [4]. Other
contributing areas include neural networks, pattern recognition, spatial data
analysis, image databases, signal processing, probabilistic graph theory, and
inductive logic programming. Data mining needs the integration of approaches
from multiple disciplines [4].
Large sets of data analysis methods have been developed in statistics. Machine
learning has also contributed substantially to classification and induction
problems. Neural network have shown their effectiveness in classification,
prediction and clustering analysis tasks. However, with increasingly large amounts
of data stored in databases for data mining, these methods face challenges on
efficiency and scalability. Efficient data structures, indexing and data accessing
techniques developed in database researches contribute to high performance data
mining. Many data analysis methods developed need to be re-examined and set17
oriented; scalable algorithms should also be developed for effective data mining
[4].
Another difference between traditional data analysis and data mining is that
traditional data analysis is assumption-driven in the sense that a hypothesis is
formed and validated against the data, whereas data mining in contrast is
discovery-driven in the sense that patterns are automatically extracted from data,
which requires substantial search efforts [4]. Therefore, high performance
computing will play an important role in data mining. Parallel, distributed, and
incremental data mining methods should be developed, and parallel computer
architectures and other high performance computing techniques should also be
explored in data mining.
Human eyes identify patterns and regularities in data sets or data mining results.
Data and knowledge visualization is an effective approach for the presentation of
data and knowledge, exploratory data analysis, and interactive data mining.
Data mining in data warehouse is one step beyond on-line analytic processing
(OLAP) of data warehouse data [3]. By integrating OLAP and data cube
technologies, on-line analytical mining mechanism contributes to interactive
mining of multiple abstraction spaces of data cubes.
18
2.2 Mining complex data in large data and information repositories [4]
Data mining is not confined to relational, transactional, and data warehouse data.
There are high demands for mining spatial, text, multimedia and time-series data,
and mining complex, heterogeneous, semi-structured and unstructured data,
including the web-based information repositories [5,6].
Complex data may require advanced data mining techniques. For example, for
object-oriented and object-relational databases, object-cube based generalization
techniques can be developed for handling complex structured objects, methods,
class/subclass hierarchies, etc. Mining can then be performed on the multidimensional abstraction spaces provided by object-cubes.
A spatial database stores spatial data, which represents points, lines, regions, and
non-spatial data, which represent other properties of spatial objects and their
non-spatial relationships. Spatial data cube can be constructed which consists of
both spatial and non-spatial dimensions and/or measures. Since a spatial measure
may represent a group of aggregation that may produce a great number of such
aggregated spatial objects, it is impossible to pre-compute and store all of such
spatial aggregations. Therefore, selective materialization of aggregated spatial
objects is a good tradeoff between storage space and online computation time [4].
Spatial data mining can be performed in a spatial data cube as well as directly in a
spatial database. A multi-tier computation technique can be adopted in spatial
19
data mining to reduce spatial computation. For example, when applying mining
spatial association rules, one can first apply rough spatial computations, such as
minimal bounding rectangle method to filter out most of the sets of spatial
objects (e.g., not spatially close enough), and then apply relatively costly, refined
spatial computation only to the set of promising candidates.
Text analysis methods and content-based image retrieval techniques play an
important role in mining text and multimedia data, respectively. These techniques
can be integrated with data cube and data mining techniques for effective mining
of such types of data.
It is challenging to mine knowledge from the World-Wide-Web because of the
huge amount of unstructured and semi-structured data. However, Web access
patterns can be mined from the preprocessed and cleaned Web log records; hot
Web sites can be identified based on their access frequencies and the number of
links pointed to the corresponding sites.
2.3 Clustering Analysis:
Clustering Analysis is to identify clusters embedded in the data, where a cluster is
a collection of data objects that are "similar" to one another. Similarity can be
expressed by distance functions, specified by users or experts. A good clustering
method produces high quality clusters to ensure that the inter-cluster similarity is
20
low and the intra-cluster similarity is high. For example, one may cluster the
houses in an area according to their house category and geographical locations.
Unlike classification, clustering and unsupervised learning do not rely on
predefined classes and class-labeled training examples. For this reason, clustering
is a form of learning by observation, rather than learning by examples.
Conceptual clustering groups objects to form a class, described by a concept.
This differs from conventional clustering, which measures similarity based on
geometric distance. Conceptual clustering has two functions: (1) discovers the
appropriate classes; (2) forms descriptions for each class, as in classification. The
guideline of striving for high intraclass and low interclass similarity still applies.
Data mining research has been focused on high quality and scalable clustering
methods for large databases and multidimensional data warehouses.
An example of clustering would be what most people perform when they do
laundry- grouping permanent press, dry cleaning, wash whites and brightly
colored clothes, which is important for they have similar characteristics. It turns
out that these clusters have important common attributes about the way they
behave when washed. Clustering is straight forward, but of course, difficult to be
made; Clustering is often more dynamic.
An example of the nearest neighbor prediction algorithm is when you look at the
people in the neighborhood. It may be noticed that, in general, all have somewhat
21
similar income. However, there may still be a wide variety of incomes among
even your closest neighbors. The nearest neighbor prediction algorithm works in
very much the same way except that nearness in a database may consist of a
variety of factors and it performs quite well in terms of automation because many
of the algorithms are robust with respect to dirty and missing data.
The nearest neighbor prediction algorithm simply stated is as follows: "Objects
that are 'near' each other will also have similar prediction values. Thus, if you
know the prediction value of one of the objects, you can predict it from its
nearest neighbors [7]."
The typical requirements of clustering in data mining are:

Scalability: Highly scalable clustering algorithms are needed for a sample
of a given large data set, which may lead to biased results

Ability to deal with different types of attributes: Many algorithms are
designed to cluster interval-based (numerical) data. However, applications may
require clustering other types of data, such as binary, categorical (nominal), and
ordinal data, or mixture of data types

Discovery of clusters with arbitrary shape: Clusters can be of any shape.
Hence, it is important to develop algorithms that can detect clusters of arbitrary
shape
22

Minimal requirement for domain knowledge to determine input
parameters: The clustering results can be sensitive to input parameters

Ability to deal with noise: Some clustering algorithms are sensitive to
missing, unknown, outliers or erroneous data and lead to clusters of poor quality

Insensitivity to the order of input records: Some clustering algorithms are
sensitive to the order of input data

High dimensionality: It is challenging to cluster data objects in high-
dimensional space, especially considering that such data can be very sparse and
highly skewed

Constraint-based clustering: Applications may need to perform clustering
under various kinds of constraints. A challenging task is to find groups of data
with good clustering behavior that satisfy specified constraints.

Interpretability and usability: Clustering needs to be tied up with specific
semantic interpretations and applications. It is important to study how an
application goal may influence the selection of clustering methods.
There are many clustering techniques, organized into following categories:
partitioning, hierarchical, density-based, grid-based, and model-based methods.
Clustering can also be used for outlier detection.
23
2.3.1 Partitioning methods:
Given a database of n objects or data tuples, a partitioning method constructs k
partitions of the data, where each partition represents a cluster and k <= n. That
is, it classifies the data into k groups, which together satisfy the following
requirements:(1) each group must contain at least one object; (2) each object must
belong to exactly one group.
Given k, the number of partitions to constrict, a partitioning method creates an
initial partitioning. It then uses an iterative relocation technique that attempts to
improve the partitioning by moving objects from one group to another.
To achieve global optimality, partitioning-based clustering would require the
exhaustive enumeration of all of the possible partitions. Instead, most
applications adopt one of two popular heuristic methods: (1) the k-means
algorithm, where each cluster is represented by the mean value of the objects in
the cluster; (2) the k-mediods algorithm, where each cluster is represented by one
of the objects located near the center of the cluster. These heuristic clustering
methods work well in finding spherical-shaped clusters in small to medium-sized
databases. To find clusters with complex shapes and for clustering very large data
sets, partitioning-based methods need to be extended.
24
2.3.1.1 k-means algorithm:
Given a database of n objects and k, the number of clusters to form, a
partitioning algorithm organizes the objects into k partitions (kn), where each
partition represents a cluster. The clusters are formed to optimize an objectivepartitioning criterion, often called a similarity function, such as distance, so that
the objects within a cluster are "similar," whereas the objects of different clusters
are "dissimilar" in terms of the database attributes.
Algorithm: The k-means algorithm is partition based on the mean value of the
objects in the cluster.
Input: The number of clusters k and a database containing n objects
Output: A set of k clusters that minimizes the squared-error criterion.
Method:
(1)
Arbitrarily choose k objects as initial cluster centers
(2)
Repeat
(3)
(Re) assign each object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster
(4)
Update the cluster means, i.e., calculate the mean value of the objects for
each cluster
25
(5)
Until no change occurs
The k-means algorithm takes the input parameter, k and partitions a set of n
objects into k clusters so that it results in high intracluster and low intercluster
similarity. Cluster similarity is measured in regard to the mean value of the objects
in a cluster, which can be viewed as the cluster's center of gravity.
"How does the k-means algorithm work?" The k-means algorithm proceeds as
follows. First, it randomly selects k of the objects, each of which initially
represents a cluster mean or center. For each of the remaining objects, an object
is assigned to the cluster to which it is the most similar, based on the distance
between the object and the cluster mean. It then computes the new mean for
each cluster. This process iterates until the criterion function converges. Typically,
the squared-error criterion is used, which is defined as
E = Eki=1 ki=1 pCi|p-mi|2,
Where E is the sum of square-error for all the objects in the database, p is the
point in space representing a given object, and mi is the mean of cluster Ci (both
p and mi are multidimensional). This criterion tries to make the resulting k
clusters as compact and as separate as possible.
The algorithm attempts to determine k partitions that minimize the squared-error
function. It works well when the clusters are compact clouds that are rather well
26
separated from one another. The method is relatively scalable and efficient in
processing large data sets because the computational complexity of the algorithm
is O (nkt), where n is the total number of objects, k is the number of clusters, and
t is the number of iterations. Normally, k <<n and t<<n. The method often
terminates at a local optimum.
The k-means method, however, can be applied only when the mean of a cluster is
defined. This may not be the case in some applications, such as when data with
categorical attributes are involved. The necessity for users to specify k (number of
clusters) in advance can be seen as a disadvantage. The k-means method is not
suitable for discovering clusters with non-convex shapes or clusters of very
different size. Moreover, it is sensitive to noise and outlier data points since a
small number of such data can substantially influence the mean value.
There are a few variants of the k-means method. These differ in the selection of
the initial k means, the calculation of dissimilarity, and strategies for calculating
cluster means. An interesting strategy that often yields good results is to first
apply a hierarchical agglomeration algorithm to determine the number of clusters,
find initial clusters, and then use iterative relocation to improve them.
Another variant to k-means is the k-modes method, which extends the k-means
paradigm to cluster categorical data by replacing the means of clusters with
modes, using new dissimilarity measures to deal with categorical objects and a
27
frequency-based method to update modes of cluster. The k-means and the kmodes methods can be integrated to cluster data with mixed numeric and
categorical values, resulting in the k-prototypes method.
"How can we make the k-means algorithm more scalable?" A recent effort on
scaling the k-means algorithm is based on the idea of identifying three kinds of
regions in data: regions that are compressible, regions that must be maintained in
main memory, and regions that are discardable. An object is compressible if it is
not discardable but belongs to a tight sub cluster. A data structure known as a
clustering feature is used to summarize objects that have been discarded or
compressed. If an object is neither discardable nor compressible, then it should
be retained in main memory. To achieve scalability, the iterative clustering
algorithm only includes the clustering features of the compressible objects and
the objects that must retained in main memory, thereby turning a secondarymemory-based algorithm into a main-memory-based algorithm.
28
Start
Initiate Centers
of the K Cluster
Evaluate Cluster
Assignment of
Vectors
Compute New Cluster
Centers with respect to
new cluster assignment
Evaluate new
Cluster
Assignments
New Cluster
Assignments Differ
from previous one
yes
No
STOP
Fig 2.1 Flowchart for k-means Algorithm
29
2.3.1.2 K-Medoids Method:
The k-means algorithm is sensitive to outliers since an object with an extremely
large value may substantially distort the distribution of data. "How might the
algorithm be modified to diminish such sensitivity?” Instead of taking the mean
values of the objects in a cluster as a reference point, the mediod can be used,
which is the most centrally located object in a cluster. Thus the partitioning
method can still be performed based on the principle of minimizing the sum of
the dissimilarities between each object and its corresponding reference point.
This forms the basis of the k-medoids method.
The strategy of k-mediods clustering algorithm is to find k clusters in n objects by
first arbitrarily finding a representative object (the medoid) for each cluster. Each
remaining object is clustered with the medoid to which it is the most similar. The
strategy then iteratively replaces one of the medoids to which it is the most
similar, which then later replaces one of the medoids by one of the non-medoids
as long as the quality of the resulting clustering is improved. This quality is
estimated using a cost function that measures the average dissimilarity between an
object and the medoid of its cluster. To determine whether a nonmedoid object,
Orandom, is a good replacement for a current medoid, Oj, the following four cases
are examined for each of the nonmedoid objects, p.
Case 1: p currently belongs to medoid oj. If oj is replaced by orandom as a
medoid and p is closest to one of oi, ij, the p is reassigned to oi.
30
Case 2: currently belongs to medoid oj. If oj is replaced by orandom as a medoid
and p is closest to orandom, then p is reassigned to orandom.
Case 3: p currently belongs to medoid oi, i  j. If oj is replaced by orandom as a
medoid and p is still closest to oi, then the assignment does not change.
Case 4: p currently belongs to medoid oi, I  j. If oj is replaced by orandom as a
medoid and p is closest to orandom, the p is reassigned to orandom.
Each time a reassignment occurs, the difference in square-error E contributes to
the cost function that calculates the difference in square error value, if a current
medoid is replaced by a nonmedoid object. The total cost of swapping is the sum
of costs incurred by all nonmedoid objects. If the total cost is negative, the oj is
replaced with orandom for the actual square-error would be reduced. If the total cost
is positive, the current medoid oj is considered acceptable.
Algorithm: k-medoids is partitioning based on medoid or central objects.
Input: The number of clusters k and a database containing n objects.
Output: Set of k clusters that minimizes the sum of the dissimilarities of all the
objects to their nearest medoid.
Method:
31
(1)
Arbitrarily choose k objects as the initial medoids
(2)
Repeat
(3)
Assign each remaining objects to the cluster with the nearest medoid
(4)
Randomly select a nonmedoid object, orandom
(5)
Compute the total cost, S, of swapping oj with orandom
(6)
If S < 0 then swap oj with orandom to form the new set of k medoids
(7)
Until no change
PAM (Partitioning around Medoids) was one of the first k-medoids algorithms
introduced. It attempts to determine k partitions for n objects. After an initial
random selection of k medoids, the algorithm repeatedly tries to make a better
choice of medoids. All of the possible pairs of objects are analyzed, where one
object in each pair is considered a medoid and the other is not. The quality of the
resulting clustering is calculated for each such combination. An object, oj, is
replaced with the object causing the greatest reduction in square-error. The set of
the best objects for each cluster in iteration forms the medoids for the next
iteration. For large values of n and k, such computation becomes very costly.
"Which method is more robust-k-means or k-medoids?" The k-medoids method
is more robust than k-means in the presence of noise and outliers because a
32
medoid is less influenced by outliers or other extreme values than a mean.
However, its processing is more costly than the k-means method. Both methods
require the user to specify k, the number of clusters.
2.3.2 Hierarchical methods:
A hierarchical method creates a hierarchical decomposition of the given set of
data objects. A hierarchical method can be classified as being either agglomerative
or divisive, based on how the hierarchical decomposition is formed. The
agglomerative approach, also called the bottom-up approach, starts with each
object forming a separate group. IT successively merges the objects in the same
cluster. In successive iteration, a cluster is split up into smaller clusters, until
eventually each object is in one cluster, or until a termination condition holds.
Hierarchical methods suffer from the fact that once a step (merge or split) is
done, it can never be undone. The rigidity is useful for it leads to smaller
computation costs by not worrying about the combination of different choices.
However, such techniques cannot correct erroneous decisions. There are two
approaches to improve the quality of hierarchical clustering:(1) perform careful
analysis of object "linkages" at each hierarchical partitioning, such as in CURE
and Chameleon; (2) integrate hierarchical agglomeration and iterative relocation
by first using a hierarchical agglomerative algorithm and then refining the result
using iterative relocation, as in BIRCH.
33
2.3.3 Density-based methods:
Most partitioning methods cluster objects based on the distance between objects.
Such methods can find only spherical-shaped clusters and encounter difficulty in
discovering clusters of arbitrary shapes. Clustering methods have been developed
based on the notion of density. The general idea is to continue growing the given
cluster as long as the density (number of objects) in the "neighborhood" exceeds
some threshold; that is, for each object within a given cluster, the neighborhood
of a given radius will contain at least a minimum number of points. Such a
method can be used to filter out noise and discover clusters of arbitrary shape.
DBSCAN is a typical density-based method that grows clusters according to
density threshold. OPTICS is a density-based method that computes an
augmented clustering ordering for automatic and interactive cluster analysis.
2.3.4 Grid-based methods:
Grid-based methods quantize the object space into a finite number of cells that
form a grid structure. All of the clustering operations are performed on the grid
structure (quantized space). The main advantage of this approach is its fast
processing time, which is typically independent of the number of data objects and
dependent only on the number of cells in each dimension in the quantized space.
34
STING is a typical example of a grid-based method. CLIQUE and Wave-Cluster
are two clustering algorithms that are both grid and density based.
2.3.5 Model-based methods:
Model-based methods hypothesize a model for each of the clusters and find the
best cluster for the data of the given model. A model-based algorithm may locate
clusters by constructing a density function that reflects the spatial distribution of
the data points. It also leads to a way of automatically determining the number of
clusters based on standard statistics, taking "noise" or outlier into account and
thus yielding robust clustering methods.
Some clustering algorithms integrate the ideas of several clustering methods, so
that it is sometimes difficult to classify a given algorithm as uniquely belonging to
only one clustering method category. Furthermore, some applications may have
clustering criteria that require the integration of several clustering techniques.
2.3.6 EM algorithm
The EM (Expectation Maximization) algorithm extends the k-means paradigm in
a different way. Instead of assigning each object to a dedicated cluster, it assigns
each object to a cluster according to a weight representing the probability of
membership. In other words, there are no strict boundaries between clusters.
Therefore, new means are computed based on weighted measures.
35
In K-means, we know neither of the distribution that each training instance came
from, nor of the parameters of a mixture model. So we adopt the procedure used
for k-means clustering algorithm, and iterate. Guessing the initial five parameters
and using them to calculate the cluster probabilities for each instance, then using
these probabilities to re estimate the parameters, and repeating them is called the
EM algorithm. The first step, the calculation of the cluster probabilities (which
are the "expected" class values) is the “expectation"; the second, the calculation
of the distribution parameters, is the "maximization" of the likelihood of the
distributions given the data.
Adjustments must be made to the parameter estimation equations to account for
the fact that it is only cluster probabilities, not the clusters that are known for
each instance. These probabilities act like weights. If wi is the probability, then i
belong to cluster A. The mean A and standard deviation  A2 or A are
w1x1 + w2x2 + ...+ wnxn
A = 
w1 +w2 +...+wn
w1(x1-)2 + w2(x2-)2 +...+ wn(xn-)2
 A2 =

w1 +w2+...+wn
36
- Where xi are the entire instance, not just those belonging to cluster A. Now
consider how to terminate the iteration. The k-means algorithm stops when the
classes of the instances don't change from iteration to the next. This means that a
"fixed point" has been reached. The algorithm converges toward that fixed point
but never actually gets there. Despite that, we can see how close it is by
calculating the overall likelihood that the data came from this dataset, given the
values of the parameters. This overall likelihood is obtained by multiplying the
probabilities of the individual instances i:
i (pA Pr[xi|A] + pB Pr[xi|B])
The probabilities given of the cluster A and B are determined from the normal
distribution function f (x; , ). This overall likelihood is a measure of the
"goodness" of the clustering, and increases iteration of the EM algorithm. Again,
there is a technical difficulty with equating the probability of a particular value of
x with f (x; , ). In this case, the effect does not disappear because no
probability normalization operation is applied. The upshot is that the likelihood
expression is not a probability and does not necessarily lie between zero and one:
nevertheless, its magnitude reflects the quality of the clustering. In practical,
logarithm implementation is calculated, summing the logs of the individual
components, and avoiding all multiplications. Yet the overall conclusion still
holds; you should iterate until the increase in log-likelihood becomes negligible.
37
For example, a practical implementation might iterate until the difference
between successive values of log-likelihood is less than 10-10 for ten successive
iterations. The log likelihood may increase very sharply over the first few
iterations and then converge quickly to a point that is virtually stationary.
Although the EM algorithm is guaranteed to converge to a maximum, this is a
local maximum and my not necessarily be the same as the global maximum. For a
better chance of obtaining the global maximum, the whole procedure should be
repeated several times, with different initial guess for the parameter values. The
overall log-likelihood figure can be used to compare the different final
configurations obtained: just choose the largest of the local maximal.
2.4 Indexed based algorithms:
Given a data set, the index-based algorithm uses multidimensional indexing
structures, such as R-trees or k-d trees, to search for neighbors of each object o
within radius d around that object. Let M be the maximum number of objects
within the d-neighborhood of an outlier. Therefore, once M + 1 neighbors of
object o are found, it is clear that o is not an outlier. This algorithm has a worstcase complexity of O(k*n2), where k is the dimensionality and n is the number of
objects in the data set. The index based algorithm scales well as k increases.
However, this complexity evaluation takes only the search time into account even
though the task of building an index, in itself, can be computationally intensive.
38
2.5 Which Techniques to Use for Which Tasks [8]
Technique
Classification Estimation
Prediction Affinity
Clustering
Description
Group
Standard Statistics
Market
















Basket
Analysis
Memory-Based
Reasoning
Genetic Algorithm

Cluster Detection
Link Analysis


Decision Tree



Neural Networks




Table 2.1 Techniques and Tasks.
39


The choice of data mining techniques to apply at a given point in the cycle
depends on the particular data-mining task to be accomplished and on the data
available for analysis. Approach to select a data mining technique has two steps:

Translate the business problem to be addressed into a series of data
mining tasks

Understand the nature of the available data in terms of the content and
the types of data fields, and the structure of the relationships between records
Data mining approach is mostly influenced by the following data characteristics:

A preponderance of categorical variables

A preponderance of numeric variables

A large number of fields (independent variables) per record

Multiple target fields (dependent variables)

Variable-length records

Time-ordered data

Free-text data
40
2.5 Multidimensional Data Model
Multidimensional data model exists in the form of a data cube that allows data to
be modeled and viewed in multiple dimensions, defined by dimensions and facts.
Dimensions are the perspectives or entities according to which an organization
wants to keep records, like time, item, branch, and location in a sales store. Each
dimension may have a table associated with it, called a dimension table, which
further describes the dimension.
A multidimensional data model is typically organized around a central theme, like
sales, for instance. A fact table represents this theme, where facts are numerical
measures. The fact table contains the names of the facts, or measures, as well as
keys to each of the related dimension tables. Multidimensional models exist in the
form of a star schema, a snowflake schema, or a fact constellation schema.
Star Schema: The most common modeling paradigm is the star schema, in
which the data warehouse contains (1) a large central table (fact table) containing
the bulk of the data, with no redundancy, (2) a set of smaller attendant tables
(dimensional), one for each dimension. The schema graph resembles a starburst,
with the dimension tables displayed in a radial pattern around the central fact
table.
Snowflake schema: The snowflake schema is a variant of the star schema
model, where some dimension tables are normalized, thereby splitting the data
41
into additional tables. The resulting schema graph forms a snowflake shape. The
difference between the snowflake and star schema models is that the dimension
tables of the snowflake model may be kept in normalized form to reduce
redundancies. Such table is easy to maintain and saves storage space because a
large dimension table becomes enormous when the dimensional structure is
included as columns. The saved space is negligible in comparison to the
magnitude of the fact table. The snowflake structure reduces the effectiveness of
browsing since more joints are needed to execute a query. Consequently, the
system performance may be adversely impacted. Hence, the snowflake schema is
not as popular as the star schema in data warehousing design.
Fact constellation: Sophisticated applications may require multiple fact tables to
share dimension tables. This kind of schema can be viewed as a collection of
stars, hence is called a galaxy schema or a fact constellation.
A data warehouse collects information about subjects that span the entire
organization, such as customers, items, sales, assets, and personnel, and thus its
scope is enterprise-wide. For data warehouses, the fact constellation schema are
commonly used since it can model multiple, interrelated subjects. A data mart, on
the other hand, is a department subset of the data warehouse that focuses on
selected subjects, and thus its scope is department-wide. For data marts, the star
or snowflake schema are commonly used since both are geared towards modeling
single subjects, although the star schema is more popular and efficient.
42
Chapter 3
ALGORITHM AND DATASET
This chapter discusses the design and implementation of indexed k-means
clustering on the fisheries database.
For this purpose, k-means algorithm
clustering mining technique is implemented. Initially, the Dataset on which the
clustering algorithm is implemented is studied. Secondly, the implementation of
indexed k-means and k-means algorithm is discussed.
3.1: Different forms of knowledge:
The key issue in KDD is to realize that there is more information hidden in the
data than you are able to distinguish at first sight. In data mining we have four
different types of knowledge that can be extracted from the data:
1.
Shallow knowledge: This is information that can be easily retrieved
from databases using a query tool such as structured query language (SQL)
2.
Multi-dimensional knowledge: This is the information that can be
analyzed using online analytical processing tools. With OLAP tools you have the
ability to rapidly explore all sorts of clustering and different orderings of the data
but it is important to realize that most of the things you can do with an OLAP
43
tool can also be done using SQL. The advantage of OLAP tools is that they are
optimized for this kind of search and analysis operation. However, OLAP is not
as powerful as data mining; it cannot search for optimal solutions.
3.
Hidden knowledge: This is data that can be found relatively easily by
using pattern recognition or machine-learning algorithms. Again, one could use
SQL to find these patterns but this would probably prove extremely timeconsuming. A pattern recognition algorithm could find regularities in a database
in minutes or at most a couple of hours, whereas you would have to spend
months using SQL to achieve the same result.
4.
Deep knowledge: This is information that is stored in the database but
can only be located if there is a clue that indicates where to look. Hidden
knowledge is the result of a search space over a search algorithm. Deep
knowledge is typically the result of a search space over only a tiny local optimum,
with no indication of the dataset. A search algorithm could roam around without
achieving any significant result. An example of this is encrypted information
stored in a database. It is almost impossible to decipher a message that is
encrypted if you do not have the key, which indicates that for the present time at
any rate, there is a limit to what one can learn.
44
3.2 Getting started:
The starting point for any data mining activity is the formation of a specific
information requirement related to a specific action, i.e., what do we want to
know and what do we want to do with this knowledge? Data mining is pointless
unless the finding of the knowledge is followed up with the appropriate actions.
A data-mining environment can be realized on many different levels using several
different techniques; the following list gives an indication of the steps that should
be taken to start a KDD project:
1.
Make a list of requirements. For what purpose would a KDD
environment be realized? What are the criteria of success? How will success be
measured?
2.
Make an overview of existing hardware and software: networks,
databases, applications, servers, and so on.
3.
Evaluate the quality of the available data. For what purpose was it
collected?
4.
Make an inventory of the available databases, both internally and
externally.
45
5.
Is a data warehouse in existence? What kind of data is available? Can we
zoom in on details of operational data?
6.
Formulate the knowledge that the organization needs both now and in
the future, in order to be able to function optimally.
7.
Identify groups of knowledge workers or decision makers who are to
apply the results. What kinds of decisions will they need to take? Which patterns
are useful to them and which are not, both now and in the future?
8.
Analyze whether the knowledge found can actually be used by the
organization. It is useless to distill client profiles from mailing files, if for technical
reasons the mailing department cannot handle the selections found.
9.
List the processes and transformations these databases have to go
through before they can be used.
3.3 KDD process:
The knowledge discovery process consists of six stages: Data selection, Cleaning,
Enrichment, Coding, data mining, reporting.
3.3.1 Data Selection:
Once you have formulated your information requirements, the next logical step is
to collect and select the data you need. In most cases, this data will be stored in
46
operational databases used by the information systems in the organization.
However, gathering this information in a centralized database is not always an
easy task since it may involve low-level conversion of data, such as from flat file
to relational tables. The operational data used in different parts of the
organization varies in quality. Some databases are updated on a day-to-day basis;
others may contain information that dates back several years.
Therefore a data warehouse is an important aspect of the data mining process.
Although it is not essential to have a good data warehouse in operation to set up
a KDD activity, it is very helpful. A data warehouse presents a stable and reliable
environment in which to collect operational data.
3.3.2 Cleaning:
Once data is collected, the next stage is cleaning because the amount of pollution
that exists in a data might not be easily detectable, it is therefore a good idea to
examine the data in order to obtain a feeling for the possibilities, which is difficult
with large dataset. When databases are very large, it is always advisable to select
some random samples and analyze them to get a rough idea of what once can
expect. For example in an organization, the date of birth of a person will be
stored correctly, but the age field may not be correct. Before a data mining
operation, one has to clean the data as much as possible, and this can be done
automatically in most cases. It is not realistic, however to expect to be able to
47
remove all the pollution in advance since some anomalies in the data will only be
discovered during the data mining process itself. Checking domain consistency
needs to be carried out by programs that have deep semantic knowledge of the
attributes that are being checked. Most forms of pollution are produced via the
method in which the data is gathered in the field; removing this kind of pollution
will almost always involve re-engineering the business process.
3.3.3 Enrichment:
Once the data is cleaned, enriching it becomes necessity. Additional database may
be available on a commercial bases; these can provide information on a variety of
subjects, including demographic data, such as the average prices of houses and
cars, types of insurance that people have, and so on.
Matching the information from bought-in databases with the company’s own
database can be difficult. A well-known problem is the reconstruction of family
relationships in databases: a company may buy a database containing
demographic data of people living in certain areas, but this information is of value
only if it is able to reconstruct the family relationships between the individuals
that are in the database. In a relational environment this information can simply
be joined with the original data.
48
3.3.4 Coding:
By means of selection and protection in SQL, data can be manipulated to obtain
a clean target table. Sometimes pollution in the data can be removed simply by
filtering out the polluted records. Suppose that some information concerning car
or house ownership is mining about some individuals in the database; if this lack
of information is distributed randomly over the database, removing those records
will not affect the type of clusters formed, so we can do this successfully. On the
other hand, it is possible that some causal connection exists between the lack of
certain information and a certain type of customer, especially in situations where
fraud could be involved, such as in insurance records. Some customers might
have deliberately been given wrong information in order to obtain insurance
coverage for which they would not otherwise be eligible. Obviously, in such
cases, removing information will certainly affect the type of patterns found.
3.3.5 Data mining:
Data mining has three main task areas: - knowledge engineering, classification,
and problem solving. There is no single best machine-learning or pattern
recognition technique; different tasks pre-suppose different kinds of techniques.
A KDD environment therefore supports these different types of techniques, such
environment is called hybrid. The selection of data mining algorithm depends on
the quality of the input, and the output as well as the performance. The efficiency
49
of the algorithms is in the learning stage and the actual application stage of the
algorithm.
3.3.6 Reporting:
The reporting stage combines two different functions:

Analyzing the results of the pattern recognition algorithms

Application of the results of the pattern recognition algorithms to data
The purpose is not only to inspect what has been learned, but also to apply the
classifications and segmentation information that has been gathered. In many
cases, reporting can be done using traditional query tools for databases.
Nowadays, however, various new data visualization techniques are emerging,
ranging from simple scatter diagrams showing different clusters in a twodimensional way to complex interactive environments that enable us to fly over
landscapes containing information about data sets.
3.4 Data Sources:
One of the key steps in Knowledge Discovery in Databases [9] is to create a
suitable target data set for the data mining tasks. Data is stored in the Computer
Engineering Department (UMass Dartmouth) fisheries database; this in turn was
collected from NOAA. The database has the information about the whole
50
country regarding the available fish types, dates, location of fishing trips, along
with the quantities and market values of the fish caught.
Interesting tables’ trip, subtrip, and subtrip_fish were examined. From these
tables, interesting attributes for the data mining process were extracted.
Interesting attributes fish_id, Landed_kg, Mkt_values, Lat_deg, Long_deg and
date were considered. After these fields were obtained from the tables, a data
source containing all this information was created in a view (containing 2656731
records). To reduce the computations required for the clustering process, the
dataset for the year 1989 (containing 275258 records) was obtained, while the
other years’ records were ignored. For the analysis, data in the area around the
Gulf of Maine were considered, hence reduced the dataset to the Gulf of Maine
by limiting lat_deg and long_deg (41.5 to 45 N, 71 w to 65 W) that has 152606
records.
3.4.1 Data Modeling
A model is a description of the original historical database from which it was
built; this can be successfully applied to new data in order to make predictions
about missing values or to make statements about expected values [7]. A data
model captures a pattern in the database that describes an important aspect of the
data. It does not describe the entire database. In hypothesis testing style of data
mining, a mental model is what one starts with. But, there is still a step that must
51
be taken before this mental model can be put to test. This is where the skills of a
good analyst who understands the available data and is well versed in the design
of decision-support database queries, statistical packages and the data mining
tools lie.
The thesis uses a data model, which is in multidimensional in character and at the
same time, gives me a better understanding of the dataset, which can be used for
practical purposes. Hence, the fields of fish_id, landed_kg, Mkt_values, Lat_deg,
Long_deg and date were selected. This is a multidimensional dataset, but can be
broken into the following dimensions.
Domain A: Fish ID - static - This provides the information regarding the types
of fish that can be harvested at some point or time depending on other domains.
Domain B: landed_kg, Mkt_values - Variables - This gives a measure to quantify
the output, and can calculate the profitability of the results.
Domain C: Lat_deg, Long_deg - static - Geological - This domain supplies the
locations of the Whole dataset, hence providing a better understanding of the
fisheries
Domain D: Month - static - partial inference of environment - This domain
helps in calculating environmental factor that has to be taken into consideration
for the data-mining task. Knowing the month of the year can help determine
52
whether the climate will be hot or cold, which is generally taken into
consideration by fishermen.
Domains A, C and D have indexing capabilities. Domain D was chosen, as it is
more useful and relevant. Indexing is possible for Domain D, for it has only 12
variables, while the others have more variables; using the other domains may not
result in obtaining a realistic output.
The thesis was trying to model a data, for example: IF month = 1 then clusters
(Lat_deg, Long_deg, fish_Id, Mkt_values, Landed_kg). One possible answer
would be: For the month of January, at a particular point (Lat_deg, Long_deg),
there is a good chance of finding a particular fish_id with some landed_kg and
market value of this much. By determining the month of the year, fishermen can
go to that particular spot and find this fish with a variable landed_kg and market
value.
3.4.2 Preprocessing
There is a number of data preprocessing techniques. While data cleaning can be
applied to remove noise and correct inconsistencies in the data, data integration
merges data from multiple sources into a coherent data store, such as a data
warehouse or a data cube. Data transformations, such as normalization, may be
applied to the data. Data reduction can reduce the data size by aggregating,
eliminating redundant features, or clustering, for instance. These data processing
53
techniques, when applied prior to mining, can substantially improve the overall
quality of the patterns mined and/or the time required for the actual mining.
Manual segregation of data based on indices is required at this stage to allow
processing of data on specific indices.
3.4.3 Data Cleaning:
Missing values: The database didn't have many missing values. Hence not much
cleaning was necessary in the dataset.
Noisy Data: Noise is a random error or variance in a measure variable. Given the
numeric attribute mkt_values, landed_kg, how can the data be smoothed out to
remove the noise?
Normalization techniques were used in the attribute data and scaled so as to fall
within a small-specified range, such as 0.0 to 1.0. After the number of variables in
the different ranges were calculated and the major chunk of data was found in the
values between 0.0 and 0.01; very small variables in the range (0.01,1) were found
and deleted from the records. Similarly the ranges were reduced to find a best fit,
and the other records were deleted. This decreased the number of records for the
data-mining task to 122083, and the date variable to a month variable.
54
3.5 Algorithm for indexed k-means
The algorithm is an integration of indexing and k-mean clustering. It can also be
called associated clustering techniques, for the result is of the kind: if month =
February, then one gets a clusters of information.
Definition:
This method initially takes the dataset of an indexed column and the number of
components of the population equal to the final required number of clusters for
each indexed field. In this step, the final required number of clusters is chosen
so that the points are mutually farthest apart. Next, the method examines each
component in the population and assigns it to one of the clusters depending on
the minimum distance. The centroid's position is recalculated every time a
component is added to the cluster and continues until all the components are
grouped into the final required number of clusters grouped by indexed fields.
Input: The number of clusters ki, where i is the indexed field (month) and a
database containing n objects.
Output: A set of ki clusters that minimizes the squared-error criterion
55
Algorithm:
(1)
Divide the dataset into multiple dataset in such a way that each dataset
has a common domain (index).
(2)
Arbitrarily choose ki objects as initial cluster centers.
(3)
(Re) assign each object to the cluster to which the object is the most
similar, based on the mean value of the objects in the cluster.
(4)
Update the cluster means, i.e., calculate the mean value of the objects for
each cluster.
(5)
Until no change.
(6)
Repeat for each unique index fields.
56
Start
Select Index
column for
Clustering
Extract Datasets
for the indexed
column
Initiate Centers
of the K
Clusters
Evaluate Cluster
Assignment of
Vectors
Compute New Cluster
Centers with respect to new
cluster assignment
Evaluate new
Cluster
Assignments
New Cluster
Assignments Differ
from previous one
Yes
No
Yes
Anymore Index
fields
No
Stop
57
Fig. 3.1 Flowchart for Indexed K-means Algorithm
3.6 Discovery of Interesting patterns:
A data mining system can uncover thousands of patterns. Many of the patterns
discovered may be uninteresting to the given user, representing common
knowledge or lacking novelty. Several challenges remain regarding the
development of techniques to assess the interestingness of discovered patterns,
particularly with regards to subjective measures that estimate the value of patterns
with respect to a given user class, based on user beliefs or expectations. The use
of interestingness measures to guide the discovery process and reduce the search
space is another active area of research.
3.6.1 Interestingness measures:
Although specification of the task-relevant data and of the kind of knowledge to
be mined may substantially reduce the number of patterns generated, a data
mining process may still generate a large number of patterns. Typically, only a
small fraction of these patterns will actually be of interest to the given user. Thus,
users need to further confine the number of uninteresting patterns accomplished
by the process. This can be achieved by specifying interestingness measures that
estimate the simplicity, certainty, utility and novelty of patterns.
Some objective measures of pattern interestingness are based on the structure of
patterns and the statistics underlying them. In general, each measure is associated
with a threshold that can be controlled by the user. Rules that do not meet the
58
threshold are considered uninteresting, and hence are not presented to the user as
knowledge.
Simplicity: A factor contributing to the interestingness of a pattern is the
pattern’s overall simplicity for human comprehension. Objective measures of
pattern simplicity can be viewed as functions of the pattern structure, defined in
terms of the pattern size in bits or the number of attributes or operators
appearing in the pattern.
Rule length, for instance, is a simplicity measure. For rules expressed in
conjunctive normal form (i.e., as a set of conjunctive predicates), rule length is
typically defined as the number of conjuncts in the rule.
Certainty: Each discovered pattern should have a measure of certainty associated
with it that assesses the validity or “trustworthiness” of the pattern. A certainty
measure for clustering rules is the percentage of occurrence. Given a task-relevant
data tuples, the confidence of the cluster is defined as
Confidence (clusters) = Number of tuples containing the cluster
Number of tuples
Utility: The potential usefulness of a pattern is a factor defining its
interestingness. It can be estimated by a utility function, such as support. The
support of a clustering pattern refers to the percentage of task-relevant data
tuples for which the pattern is true.
59
Novelty: Novel patterns are those that contribute new information or increased
performance to the given pattern set. For example, a data exception may be
considered novel in that it differs from the data expected based on a statistical
model or user beliefs. Another strategy for detecting novelty is to remove
redundant patterns. If a discovered rule can be implied by another rule that is
already in the knowledge base or in the derived rule set, then either rule should be
reexamined in order to remove the potential redundancy.
3.7 Presentations and Visualization of Discovered Patterns
For data mining to be effective, data mining systems should be able to display the
discovered patterns in multiple forms, such as rules, tables, cross tabulations, pie
or bar charts, decision trees, cubes, or other visual representations. Allowing the
visualization of discovered patterns in various forms can help users with different
backgrounds to identify patterns of interest and to interact or guide the system in
further discovery. A user should be able to specify the forms of presentation to
be used for displaying the discovered patterns.
The use of concept hierarchies plays an important role in aiding the user to
visualize the discovered patterns. Mining with concept hierarchies allows the
representation of discovered knowledge in high-level concepts, which may be
more understandable to users than rules expressed in terms of primitive or raw
data, such as functional or multivalued dependency rules, or integrity constraints.
60
Furthermore, data mining systems should employ concept hierarchies to
implement drill-down and roll-up operations, so that users may inspect
discovered patterns at multiple levels of abstraction. In addition, pivoting, slicing
and dicing operations aid the user in viewing generalized data and knowledge
from different perspectives. A data mining system should provide such
interactive operations for any dimension, as well as for individual values of each
dimension.
3.8 Implementation Tools and software
For a successful data mining implementation, a variety of tools were used to aid
the thesis investigation. Java was used as the programming language to implement
the k-means algorithm. The datasets were residing on the Oracle 9i database. MS
access was used as a database for small test dataset to test the java program. Data
Sources Open Database Connectivity (ODBC) was employed to access data from
a variety of database management systems (Oracle). JDBC was utilized to access
the database from the Java programming language. SQL was used to
communicate with the database. Toad was used for data extraction and executing
queries. WEKA and SPSS was applied for cross verifying the results obtained by
the clustering program. Chartist-Pro was employed to generate flowcharts. Mat
lab was used to create a 3-dimensional graph of the output. Windows2000 Server
and Sun Solaris 5.8 were used as the operating systems. Experiments were finally
run on a standard PC with x86-based processor.
61
Chapter 4
RESULTS AND ANALYSES
This chapter discusses the results of the indexed k-means and k-means clustering
techniques that were applied to the fisheries’ database. The results were plotted
in a graph. The patterns were studied and analyzed.
4.1 Experimental Results:
4.1.1 Output for Indexed k-means:
Indexed k-mean algorithm clustering techniques were used on the fisheries‘
database and the outputs were tabulated and plotted in a 3 dimensional scatter
plot as shown below. The output had more than 3 dimensions, since it is not
possible to show a graphical output for more than 3 dimensions. The outputs
were plotted in a combination of different graphs. Each plot gave information
according to user requirements.
62
63
64
65
66
67
68
69
70
4.1.2 Output for k-means:
K-means algorithm clustering techniques were used on the fisheries’ database
without indexing. The outputs were tabulated as shown below.
Fish ID Lat Deg Long Deg Market Values Landed KG Month
147
43.0833
69.75
223.565
79.536
9
120
41.9167
67.5833
432.724
174.549
10
96
124
12
81
153
512
81
81
43.25
42.25
43.4167
43.4167
41.75
42.5833
42.75
43.25
70.25
70.25
70.0833
68.5833
69.75
70.4167
70.25
68.75
49.234
64.289
90.674
234.942
281.423
71.812
107.275
288.038
28.895
27.025
30.761
125.793
137.360
50.767
62.107
154.691
9
6
12
1
3
5
10
7
81
123
122
124
147
153
12
123
269
42.25
42.25
43.25
43.25
41.5833
42.4167
42.25
41.75
43.4167
70.25
70.25
70.0833
69.25
67.0833
70.25
70.25
70.25
69.9167
155.564
91.283
394.811
396.156
484.324
100.355
59.297
184.124
151.868
117.831
44.449
99.386
140.946
179.784
49.554
23.191
81.466
91.582
6
5
3
11
1
1
1
6
11
120
122
124
120
269
122
41.75
43.75
43.75
42.25
41.75
43.25
69.75
68.25
69.75
70.25
69.75
70.25
105.3
233.755
163.612
46.595
108.031
203.809
38.068
75.082
69.004
19.190
60.321
70.699
2
10
1
3
4
1
71
4.2 Analyses:
Implemented Indexed based K-mean algorithm on the Fisheries department
dataset and observed the following results as shown below.
Month
Fishes
Yield Points
Area
Most Probable
Profitability
January
81,123, 120,12
(Atlantic cod,
Yellowtail-Flounder,
Winter-Flounder,
Angler)
124,81,12,120
(Plaice-Flounder,
Atlantic cod, Angler,
Winter Flounder
120, 81, 512, 122,
124
(Winter Flounder,
Atlantic cod,
Wolfish,
Witch-Flounder,
Plaice-Flounder
122, 81, 124, 120,
269
Witch-Flounder,
Atlantic
cod,
Flounder,
Winter
Flounder, Pollock
122, 81, 120
Witch-Flounder,
Atlantic cod, Winter
Flounder
81, 122, 123
Atlantic cod, WitchFlounder,
Yellowtail-Flounder
(42.25,70.25),
(43.25,70.25)
(41.5833 to
43.4167&
66.9167
to
70.25))
(43.25,70.25),
(42.25,70.25)
(41.5833 to
43.25
&
67.0833
to
70.25)
(41.75
to
43.75 & 67.25
to 70.75)
120 - Winter Flounder
At (42.25,70.25)
With a market value of
60.75 and landed weight
of 26.25 kg
124 - Plaice-Flounder, at
(43.25, 70.25) with market
a value of 69.25 and
landed weight of 24.33kg.
512 - Wolfish at (43.25,
70.25) with market a
value of 21.63 and landed
weight of around 10.36
kg
122- Witch Flounder
At (43.0833,68.0833)
With market values to
weight
ratio
of
(791.27/159.43)
122 - Witch Flounder
At (42.5833,70.0833) With
market values to weight
ratio of (484.8/110.43)
122 - Witch Flounder
At (43.75,67.75) with
market values to landed
weight ratio (575/122.5)
July
February
March
April
(43.25,70.25),
(42.25,70.25)
(43.25,70.25),
(42.25,70.25)
(41.75
to
44.25 & 67.75
to 70.25)
81 - Atlantic cod at
(42.25, 70.25) with a
market value of 37.36 and
landed weight of around
29.42 kg.
122 - Witch Flounder
At (43.25,70.25) with
market values to landed
weight ratio (168.07/50.93)
(43.25,70.25),
(42.25,70.25)
(41.58
to
43.25 & 69.25
to 70.75)
(43.25,70.25),
(41.75,69.75),
(42.25, 70.25)
(41.75
43.75
67.0833
70.25)
12 - Angler at (42.25,70.25)
with market values to
landed
weight
ratio
(46.6/16.5)
122 - Witch Flounder
At (43.25,70.25) with
market values to landed
weight ratio (308.05/84.7)
120, 122, 81
Winter
Flounder,
Witch-Flounder,
Atlantic cod
(43.25,70.25),
(41.75,69.75),
(42.25, 70.25)
(41.75
to
43.25 & 67.25
to 70.75)
120 - Winter Flounder, at
(42.25, 70.25) with a
market value of 17.21 and
landed weight of 9.47 kg
81 - Atlantic cod at
(41.75, 69.75) with a
market value of 57.8 and
landed weight of around
38.78 kg.
122 -Witch Flounder, at
(43.25, 70.25) with a
market value of 135.27
and landed weight of
around 47.22 kg
August
81, 122, 124
Atlantic cod, WitchFlounder, Flounder
(41.75,
69.75),
(42.25, 70.25)
(41.5833 to
44.25 & 68.25
to 70.4167)
122 - Witch Flounder
At (43.0833,69.75) with
market values to landed
weight ratio (149.7/35.56)
September
81, 122, 124
Atlantic cod, WitchFlounder, Flounder
(43.25,
70.25),
(42.25, 70.25)
(41.5833
43.75
68.4167
70.75)
October
12, 81, 96, 269
Angler, Atlantic cod,
Cusk, Pollock
(42.25,69.75),
(43.25,
70.25),
(42.25, 70.25)
(41.75
to
43.75 & 68.25
to 70.75)
122 - Witch Flounder, at
(42.25, 70.25) with a
market value of 32.78 and
landed weight of around
10.65 kg
512 - Wolfish at (41.75,
69.75) with a market
value of 81.54 and landed
weight of around 38.14
kg.
12 - Angler at (42.25,
70.25) with a market
value of 45.2 and landed
weight of around 16.5 kg.
November
12, 81, 120
Angler, Atlantic cod,
Winter Flounder
(43.25,70.25),
(42.25,69.75),
(41.75,
70.25),
(42.25, 70.25)
(41.5833 to
43.25 & 66.75
to 70.4167)
December
81, 120, 123, 124
Atlantic cod, Winter
Flounder,
Yellowtail-Flounder,
Flounder
(43.25,70.25),
(41.75,70.25),
(42.25, 70.25)
(41.75
43.4167
67.25
70.25)
81-Atlanticcod, at (42.25,
70.25), (43.25, 70.25) with
a market value of 25.6,
115.6 and landed weight
of 14.8 kg, 60.6 kg.
81- Atlantic cod, at
(41.75, 70.25) with a
market value of 103.16
and landed weight of
around 50.82 kg
122 - Witch Flounder
At (43.25,69.4167) with
market values to landed
weight
ratio
(591.44/158.72)
12
Angler,
at
(43.25,70.25) with market
values to landed weight
ratio (141.61/32.3)
May
June
72
to
&
to
to
&
to
to
&
to
122 - Witch Flounder
At (42.25,68.25) with
market values to landed
weight
ratio
(424.24/117.43)
122 - Witch Flounder
At (43.4167,70.0833) with
market values to landed
weight ratio (77.84/21)
122 - Witch Flounder
At (42.75,69.75) with
market values to landed
weight ratio (155/37)
4.2.1 Interpreting the patterns found:
What does it mean? It means that in a month at a particular point(s) a particular
fish was found. Profitability from the output can also be inferred from the results.
Indexed k-means method proves to be very informative in terms of making
decisions on concentrated clusters, as well as intensifying innocuous patterns that
are not discerned easily by other clustering techniques. This proves to be all
encompassing while giving an idea about the dataset. For example, from the
following month, a clear idea which co-ordinates the latitude and longitude, will
provide an indication of what type of the kind of fish is available during which
month of the year. More specifically, for instance the month of January has a very
low instance of dataset. We are able to see that it is possible to find whatever type
of fish at this point. This would have been left out in the case of simple k-means
method because it depends on the available quantities of each month and hence
may not form a cluster with the month. The result helps in deciding when and
where to fish. For example: For the month of January, the Atlantic Cod,
Yellowtail-Flounder, Winter Flounder and Angler will be mostly found at the
points (42.25,70.25) and (43.25,70.25) and the most probable area to fish will be
in the area covered by the latitude degree of 41.5833 to 43.4167 and the longitude
degree of 66.9167 to 70.25. We also found out that Winter Flounder is the most
probable fish to be found at (42.25,70.25) with a market value of 60.75 and
landed weight of 26.25. One can also infer that the most profitable fish will be
73
the Witch flounder found at (43.0833,68.0833) with a market to weight ratio of
(791.27/159.43).
The month of the year as well as the Latitude and Longitude degrees are
physically static and are known by the user. The other variable and the relations
shown in the results are useful for the user, which tell the user when and where to
fish.
4.2.2 Testing and Limitations:
Using a simple query to the whole dataset and finding the probability of the
occurrence can discover the confidence level of the results. The thesis discovered
that on average 15% of the archive records matched the results found, hence the
support and the confidence. Sometimes the knowledge gained by the algorithm
showed no matching records in the archive, which should mean that the
knowledge gained for that instance is of no consequence. After careful analysis,
the knowledge gained in that instance is of true deep knowledge. For example in
the month of January the most profitable fish is Witch flounder found at
(43.0833, 68.0833). When tested with the archive records the analysis revealed
that the results hold good for the year 1985 and 1989. Yet the analysis was not
able to correspond with the records in year 1983, 1984, 1986, 1987 or 1988. After
careful analysis, the result was reached that there was no fishing done at that
74
point for that month and year, hence without this knowledge there would have
been a lack of profit.
The limitation of the results is that there may be more parameters that the
knowledge depends on. There may be physical parameters that might hinder the
results, such as the lack of fishing at that spot for a particular year, the type of
vessels used for fishing, fishing technology, environmental effects like el nino, or
man made interferences, hence the performance of the knowledge gained.
4.2.3 Comparison between k-means and indexed k-means:
Computation is reduced for indexed k-means because the dataset is broken into
smaller datasets for a knowledge discovery process without the loss of
information. Knowledge is lost by the k-means method. As for the case stated
above, a column could not be isolated according to the needs of the application.
When k-means was questioned with what kind of knowledge is available for the
given dataset in the month of August, the answer was not available. However, the
indexed k-means was able to answer the question. Both methods require the user
to specify the initial number of clusters. Both methods are not suitable for
discovering clusters with nonconvex shapes or clusters of different size. Indexed
k-means is less sensitive to noise and outlier data points compared to k-means,
for the erroneous data may be eliminated as it a more focused study then the
regular k-means.
75
4.3 Discussion
The nearest-neighbor instance-based learning is simple and often works very well.
Instance-based learning is time consuming for a dataset of realistic size because
the entire training data must be scanned to classify each test instance.
Another problem with instance-based methods is that the database can easily
become corrupted by noisy exemplars. One solution is to adopt the k-nearest
neighbor strategy. However, computation time inevitably increases. Another way
of proofing the database against noise is to choose the exemplars that are added
to it selectively and judiciously.
The nearest-neighbor method originated many decades ago, statisticians analyzed
the k-nearest-neighbor schemes in early 1950's. If the number of training
instances is large, it makes intuitive sense to use more than one single nearest
neighbor, but clearly this is dangerous if there are not many instances. It can be
shown that when k and the number n of instances both become infinite in such a
way that k/n -> 0, the probability of error approaches the theoretical minimum
for the dataset. The nearest-neighbor method was adopted as a classification
scheme in the early 1960s and has been widely used in the field of pattern
recognition for over three decades.
76
Chapter 5
FUTURE WORK
There have been many data mining systems developed in recent years. This trend
of research and development on data mining is expected to flourish because the
huge amounts of data that have been collected in databases and the necessity of
understanding and making good use of such data in decision making have served
as the driving forces in data mining.
The diversity of data, data mining tasks, and data mining approaches poses many
challenging research issues on data mining. The design of data mining languages,
the development of efficient and effective data mining methods and systems, the
construction of interactive and integrated data mining environment, and the
application of data mining techniques at solving large application problems are
the important tasks for data mining researchers and data mining system and
application developers.
Moreover, with the fast computerization of society, the social impact of data
mining should not be under-estimated. When a large amount of interrelated data
is effectively analyzed from a different perspective, it can pose threats to the goal
of protecting data security and guarding against the invasion of privacy. It is a
77
challenging task to develop effective techniques for preventing the disclosure of
sensitive information in data mining, especially as the use of data mining systems
is rapidly increasing in domains ranging from business analysis, customer analysis,
to medicine and government.
5.1 Conclusion:
A very simple and elegant method of conducting more informative clustering
techniques has been discussed in this research. In addition, a very descriptive
method of performing index k-mean algorithm has been provided and its
advantages over other clustering techniques have been discussed. A lot of
research is pending in the area of automating the indexing in a manner that is less
time consumptive.
5.2 Future Work
The diversity of data, data mining tasks, and data mining approaches poses many
challenging research issues in data mining. The design and development of
efficient data mining methods and systems and the construction of interactive
and integrated data mining techniques to solve large application problems are
important tasks for data mining researchers.
Web mining will become one of the most important and flourishing sub fields in
data mining. The future of data mining is in application exploration where it
spreads to different applications, hence resulting in the development of more
78
application-specific data mining systems. Scalable data mining is the one that may
be able to handle huge amounts of data efficiently and interactively. Data mining
is the integration of database, data warehouse and web database systems.
Standardization of data mining language will facilitate the systematic development
of data mining solutions. The study and development of visual data mining will
facilitate data mining as a tool for data analysis. Methods to mine complex types
of data like multimedia and time-series, where research is aimed towards the
integration of data mining methods with existing data analysis techniques are
required. However, data mining should ensure privacy protection and
information security while facilitating proper information access and mining.
5.3 Research Directions
In this thesis, a lot of research is pending in the area of automating the indexing
of the cluster so as to reduce the time involved in the process. This indexed based
k-means can be extended to other types of data mining clustering techniques. The
same principle can be used in different types of algorithms like Estimated mean,
k-mode etc to arrive at more localized results than the regular clustering method.
This can be used as an application specific data mining systems. For example:
medical expert systems give an opinion about what kind of sickness is a person
under. By using this indexed method, we can come to more close solutions. For
example, when a patient goes to a dental doctor for tooth pain, the doctor
examines him and inspects the visible symptoms to check for cavities. This
79
becomes a static variable. But there can be many types of cavities. Hence, the
doctor may ask more questions leading to a general idea of the reason for the
pain, assuming that the doctor doesn’t have the new x-ray at initial consulting. In
machine terms, instead of clustering the whole dataset and finding a variable that
explains the reason behind the pain, the index field (static variable) can be given
to find the cluster in that field (i.e. tooth pain).
80
Appendix A
SOURCE CODE
Source Code for k-means:
import java.sql.*;
import java.io.*;
import java.util.*;
import java.lang.Math;
class fish1{
/* initializing static variables*/
static int k=0; static double min; static double a[][] = new double[122130][5];
static double clus[][] = new double[15][5];
static double group[][]= new double[15][];
static double comp[][] = new double[15][];
81
static double b[][] = new double[12050][5];
static double c[][] = new double[12050][];
static double cl[][] = new double[12050][];
static double dis[] = new double[15];
static double count[] = new double[15];
static String newval[] = new String[15];
static String old[] = new String[15];
static Vector vec=null;
/* Method to get data values from database. */
public static Vector getdb()
{
Connection conn = null;
ResultSet rset= null;
Vector v= null;
ResultSetMetaData rsetmd;
82
String a[]= new String[5];
int nCols=0;
int cc=0;
try{
Class.forName ("sun.jdbc.odbc.JdbcOdbcDriver");
System.out.println ("I am inside class");
conn =DriverManager.getConnection("jdbc:odbc:testoracle","g_str","ora4st");
System.out.println ("connection created");
Statement stmt = conn.createStatement();
String query = "select fish_id, lat_deg, long_deg, mkt_values, landed_kg from
thesis_data"; // for indexing we use where statement.
rset = stmt.executeQuery(query);
rsetmd = rset.getMetaData();
v = new Vector();
while (rset.next ())
83
{
a[0]= rset.getString(1);
a[1]= rset.getString(2);
a[2]= rset.getString(3);
a[3]= rset.getString(4);
a[4]= rset.getString(5);
String x = a[0]+","+a[1]+","+a[2]+","+a[3]+","+a[4];
System.out.println("rows values"+x);
v.add(x);
cc++;
}//while
rset.close();
stmt.close();
conn.close();
}//try
84
catch(Exception e){e.printStackTrace();}
System.out.println("rows count"+ cc);
return v;
}//closing method getdb.
/*/method called from main method which call getdb to retrieval data values
and stores in double array b */
public static double[][] dataval()
{
int i=0;
int d;
String s[] = new String[12050];
vec =fish1.getdb();
for(d=0;d<vec.size();d++)
{ s[d]=(String)vec.get(d);}
for(int a=0;a<vec.size();a++)
85
{
int t=0;
StringTokenizer st = new StringTokenizer((String)vec.get(a),",");
while (st.hasMoreTokens()){
b[a][t] = Double.parseDouble(st.nextToken());
t++;
}//while
}//for
return b;
}
/* method called from main to initialize clusters.*/
public static double[][] getInitClus(double c[][],int i)
{
for(int s=0;s<i;s++)
{
86
cl[s] = c[s];
}
return cl;
}
/* main method */
public static void main (String args[]){
int z;
// k's value represents & should be changed for the no. of clusters.
int k=24;
int x=k;
System.out.println("inside main");
//calling dataval method to init data values to array a.
a=fish1.dataval();
//calling getinit to initialize clusters.
clus=fish1.getInitClus(a,k);
87
int nrows = vec.size();
do{
//assigning old values to new values
System.out.println(" i am an old value");
for(int r=0;r<k;r++)
{
comp[r]=clus[r];
old[r]=comp[r][0]+","+comp[r][1]+","+comp[r][2]+","+ comp[r][3] +","+comp[r][4];
System.out.println(old[r]);
}
/*initialize count with k ;used calculate the mean of the
grouped values. use 5 */
for (int f=0;f<k;f++){ count[f]=1.0;}
/* while condition loop to group the minimum distance values */
while( x < nrows)
88
{
// calculating distance use k 5;
for(int w=0;w<k;w++)
{
dis[w]=Math.sqrt(Math.pow((clus[w][0] - a[x][0]),2.0)
+ Math.pow( (clus[w][1] - a[x][1]),2.0)
+ Math.pow( (clus[w][2] - a[x][2]),2.0)
+ Math.pow( (clus[w][3] - a[x][3]),2.0)
+ Math.pow( (clus[w][4] - a[x][4]),2.0) );
}//for
// getting minimum distance value
double w = dis[0];
for (int i=1;i<k;i++) //use k 6
{ if (w<dis[i]) min = w ;
else min=dis[i];
89
w = min;
}//for
/*grouping adding cluster values with other related values
based on minimum distance values*/
z=0;
for(int c=0;c<k;c++)//use k 7
{ if(dis[c] == w)
{
count[c]++;
z=c;
group[c]=clus[c];//use k 8
}//if
}//for
group[z][0]=(group[z][0] + a[x][0]);
group[z][1]=(group[z][1] + a[x][1]);
90
group[z][2]=(group[z][2] + a[x][2]);
group[z][3]=(group[z][3] + a[x][3]);
group[z][4]=(group[z][4] + a[x][4]);
// incrementing for the next value to be added
x++;
}//while
// System.out.println(" i am after old[r]");
/*Getting mean and comparing the old values and new values in string */
for(int q=0;q<k;q++)// use k 11
{
for(int y=0;y<5;y++)
{
clus[q][y]=(group[q][y]/count[q]);
}
newval[q]=clus[q][0]+","+clus[q][1]+","+clus[q][2]+","+clus[q][3]+","+clus[q][4];
91
// System.out.println(" new values"+ newval[q]);
}//for
//string comparing
System.out.println("checking while at end of the grouping");
System .out.println("no, not yet.........");
x=k;//use k 12
}while( !(old[0].equals(newval[0])) & !(old[1].equals(newval[1])) &
!(old[2].equals(newval[2])) & !(old[3].equals(newval[3])) &
!(old[4].equals(newval[4])) & !(old[5].equals(newval[5])) &
!(old[6].equals(newval[6])) & !(old[7].equals(newval[7])) &
!(old[8].equals(newval[8])) & !(old[9].equals(newval[9])) &
!(old[10].equals(newval[10])) & !(old[11].equals(newval[11])) &
!(old[12].equals(newval[12])) & !(old[13].equals(newval[13])) &
!(old[14].equals(newval[14])) & !(old[15].equals(newval[15])) &
!(old[16].equals(newval[16])) & !(old[17].equals(newval[17])) &
92
!(old[18].equals(newval[18])) & !(old[19].equals(newval[19])) &
!(old[20].equals(newval[20])) & !(old[21].equals(newval[21])) &
!(old[22].equals(newval[22])) & !(old[23].equals(newval[23])) );
System.out.println("checking while at end of the grouping");
System.out.println("yep..i am done buddy ,here is your best values");
//printing the final values of clusters
for (int i=0;i<k;i++){ System.out.println(newval[i]); }//for
}//end of main method
}//class
93
References:
[1] Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. From Data Mining to
Knowledge Discovery: An Overview. In Advances in Knowledge Discovery and Data
Mining, G. Piatetsky-Shapiro and J.Frawley, editors, AAAI Press, Menlo Park, CA,
1996.
[2] Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. The KDD Process for
Extracting Useful Knowledge from Volumes of Data. In Communications of the
ACM – Data Mining and Knowledge Discovery in Databases, pages 27-34, 1996.
[3] Web surpasses one billion documents.
http://www.inktomi.com/new/press/billion.html.
[4] Cooley, R., Mobasher, B., and Srivastava, J. Web mining: Information and
patterns discovery on the World Wide Web. In Proceedings of the ninth IEEE
International Conference on Tools with Artificial Intelligence, pages 558–567, Newport
Beach, CA, 1997.
[5] Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K.,
and
Slattery, S. Learning to extract symbolic knowledge from the World Wide Web.
In Proceedings of the fifteenth National Conference on Artificial Intelligence, pages 509–516,
Madison, WI, 1998.
[6] Chakrabarti, S., Dom, B. E., Gibson, D., Kleinberg, J., Kumar, R., Raghavan,
P., Rajagopalan, S., and Tomkins, A. S. Mining the link structure of the world
wide web. IEEE Computer, 32(8): 60–67, 1999.
94
[7] Borges, J. and Levene, M. Data mining of user navigation patterns. In Masand,
B. and Spliliopoulou, M., editors, Web Usage Mining, To appear in Lecture Notes
in Artificial Intelligence (LNAI 1836). Springer Verlag, Berlin, 2000.
[1] pieter adriaans, dolf zantinge
datamining. Addison-Wesley, 1996
[2] jiawei han & micheline kamber . datamining concepts and techniques, Morgan
Kaufmann, 2000
[3] S. Chaudhuri and U.Dayal. An overview of Data Warehousing and OLAP
Technology. ACM SIGMOD Record, 26(1):65-74,1997.
[4] J. Han, `` Data Mining '', in J. Urban and P. Dasgupta (eds.), Encyclopedia of
Distributed Computing , Kluwer Academic Publishers, 1999.
[5] M.S. Chen, J. Han, and P.S. Yu. Data Mining: An Overview from a Database
Perspective. IEEE Transactions on Knowledge and Data Engineeing, 8(6):866883,1996.
[6] U.M. Fayyad, G. Piatetsky-Shapiro, P.Smyth, and R. Uthurusamy (eds.).
Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
[7] alex berson, stephan smith and kurt thearing Building data mining applications for
crm
95
[8] Michael j.a.berry Gordo linoff Data Mining Techniques for marketing, sales, and
customer support - - p415, John Wiley & sons, Inc. 1997 isbn: 0-471-17980-9
[9] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge
discovery: An overview. In Proceedings of ACM KDD,1994. Smith, Chris. Theory and
the Art of Communications Design. State of the University Press, 1997
96
BIBLIOGRAPHY
[1] pieter adriaans, dolf zantinge
Transactions on Knowledge and Data
datamining. Addison-Wesley, 1996
Engineeing, 8(6):866-883,1996.
[2] jiawei han & micheline kamber .
[6] U.M. Fayyad, G. Piatetsky-
datamining concepts and techniques,
Shapiro, P.Smyth, and R.
Morgan Kaufmann, 2000
Uthurusamy (eds.). Advances in
[3] S. Chaudhuri and U.Dayal. An
Knowledge Discovery and Data Mining.
overview of Data Warehousing and
AAAI/MIT Press, 1996.
OLAP Technology. ACM
[7] alex berson, stephan smith and
SIGMOD Record, 26(1):65-
kurt thearing Building data mining
74,1997.
applications for crm
[4] J. Han, `` Data Mining '', in J.
[8] Michael j.a.berry Gordo linoff
Urban and P. Dasgupta (eds.),
Data Mining Techniques for
Encyclopedia of Distributed Computing
marketing, sales, and customer support
, Kluwer Academic Publishers,
- - p415, John Wiley & sons, Inc.
1999.
1997 isbn: 0-471-17980-9
[5] M.S. Chen, J. Han, and P.S. Yu.
[9] U. Fayyad, G. Piatetsky-Shapiro,
Data Mining: An Overview from a
and P. Smyth. From data mining to
Database Perspective. IEEE
knowledge discovery: An overview. In
Proceedings of ACM KDD,1994.
97
Smith, Chris. Theory and the Art of
Communications Design. State of the
University Press, 1997
2
4