Download T._Ravindra_Ba .V._Subrah(BookZZ.org)

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Advances in Computer Vision and Pattern Recognition
T. Ravindra Babu
M. Narasimha Murty
S.V. Subrahmanya
Compression
Schemes for Mining
Large Datasets
A Machine Learning Perspective
Advances in Computer Vision and Pattern
Recognition
For further volumes:
www.springer.com/series/4205
T. Ravindra Babu r M. Narasimha Murty
S.V. Subrahmanya
r
Compression Schemes
for Mining Large Datasets
A Machine Learning Perspective
T. Ravindra Babu
Infosys Technologies Ltd.
Bangalore, India
S.V. Subrahmanya
Infosys Technologies Ltd.
Bangalore, India
M. Narasimha Murty
Indian Institute of Science
Bangalore, India
Series Editors
Prof. Sameer Singh
Rail Vision Europe Ltd.
Castle Donington
Leicestershire, UK
Dr. Sing Bing Kang
Interactive Visual Media Group
Microsoft Research
Redmond, WA, USA
ISSN 2191-6586
ISSN 2191-6594 (electronic)
Advances in Computer Vision and Pattern Recognition
ISBN 978-1-4471-5606-2
ISBN 978-1-4471-5607-9 (eBook)
DOI 10.1007/978-1-4471-5607-9
Springer London Heidelberg New York Dordrecht
Library of Congress Control Number: 2013954523
© Springer-Verlag London 2013
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publisher’s location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any
errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect
to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Preface
We come across a number of celebrated text books on Data Mining covering multiple aspects of the topic since its early development, such as those on databases,
pattern recognition, soft computing, etc. We did not find any consolidated work on
data mining in compression domain. The book took shape from this realization. Our
work relates to this area of data mining with a focus on compaction. We present
schemes that work in compression domain and demonstrate their working on one or
more practical datasets in each case. In this process, we cover important data mining
paradigms. This is intended to provide a practitioners’ view point of compression
schemes in data mining. The work presented is based on the authors’ work on related
areas over the last few years. We organized each chapter to contain context setting,
background work as part of discussion, proposed algorithm and scheme, implementation intricacies, experimentation by implementing the scheme on a large dataset,
and discussion of results. At the end of each chapter, as part of bibliographic notes,
we discuss relevant literature and directions for further study.
Data Mining focuses on efficient algorithms to generate abstraction from large
datasets. The objective of these algorithms is to find interesting patterns for further
use by the least number of visits of entire dataset, ideal being a single visit. Similarly, since the data sizes are large, effort is made in arriving at a much smaller
subset of the original dataset that is a representative of entire data and contains attributes characterizing the data. The ability to generate an abstraction from a small
representative set of patterns and features that is as accurate as that can be obtained
with entire dataset leads to efficiency in terms of both space and time. Important
data mining paradigms include clustering, classification, association rule mining,
etc. We present a discussion on data mining paradigms in Chap. 2.
In our present work, in addition to data mining paradigms discussed in Chap. 2,
we also focus on another paradigm, viz., the ability to generate abstraction in the
compressed domain without having to decompress. Such a compression would lead
to less storage and improve the computation cost. In the book, we consider both
lossy and nonlossy compression schemes. In Chap. 3, we present a nonlossy compression scheme based on run-length encoding of patterns with binary-valued features. The scheme is also applicable to floating-point-valued features that are suitv
vi
Preface
ably quantized to binary values. The chapter presents an algorithm that computes
the dissimilarity in the compressed domain directly. Theoretical notes are provided
for the work. We present applications of the scheme in multiple domains.
It is interesting to explore when one is prepared to lose some part of pattern representation, whether we obtain better generalization and compaction. We examine
this aspect in Chap. 4. The work in the chapter exploits the concept of minimum
feature or item-support. The concept of support relates to the conventional association rule framework. We consider patterns as sequences, form subsequences of short
length, and identify and eliminate repeating subsequences. We represent the pattern
by those unique subsequences leading to significant compaction. Such unique subsequences are further reduced by replacing less frequent unique subsequences by more
frequent subsequences, thereby achieving further compaction. We demonstrate the
working of the scheme on large handwritten digit data.
Pattern clustering can be construed as compaction of data. Feature selection also
reduces dimensionality, thereby resulting in pattern compression. It is interesting to
explore whether they can be simultaneously achieved. We examine this in Chap. 5.
We consider an efficient clustering scheme that requires a single database visit to
generate prototypes. We consider a lossy compression scheme for feature reduction. We also examine whether there is preference in sequencing prototype selection
and feature selection in achieving compaction, as well as good classification accuracy on unseen patterns. We examine multiple combinations of such sequencing.
We demonstrate working of the scheme on handwritten digit data and intrusion detection data.
Domain knowledge forms an important input for efficient compaction. Such
knowledge could either be provided by a human expert or generated through an
appropriate preliminary statistical analysis. In Chap. 6, we exploit domain knowledge obtained both by expert inference and through statistical analysis and classify
a 10-class data through a proposed decision tree of depth of 4. We make use of 2class classifiers, AdaBoost and Support Vector Machine, to demonstrate working of
such a scheme.
Dimensionality reduction leads to compaction. With algorithms such as runlength encoded compression, it is educative to study whether one can achieve efficiency in obtaining optimal feature set that provides high classification accuracy.
In Chap. 7, we discuss concepts and methods of feature selection and extraction.
We propose an efficient implementation of simple genetic algorithms by integrating
compressed data classification and frequent features. We provide insightful discussion on the sensitivity of various genetic operators and frequent-item support on the
final selection of optimal feature set.
Divide-and-conquer has been one important direction to deal with large datasets.
With reducing cost and increasing ability to collect and store enormous amounts of
data, we have massive databases at our disposal for making sense out of them and
generate abstraction that could be of potential business exploitation. The term Big
Data has been synonymous with streaming multisource data such as numerical data,
messages, and audio and video data. There is increasing need for processing such
data in real or near-real time and generate business value in this process. In Chap. 8,
Preface
vii
we propose schemes that exploit multiagent systems to solve these problems. We
discuss concepts of big data, MapReduce, PageRank, agents, and multiagent systems before proposing multiagent systems to solve big data problems.
The authors would like to express their sincere gratitude to their respective families for their cooperation.
T. Ravindra Babu and S.V. Subrahmanya are grateful to Infosys Limited for providing an excellent research environment in the Education and Research Unit (E&R)
that enabled them to carry out academic and applied research resulting in articles
and books.
T. Ravindra Babu likes to express his sincere thanks to his family members
Padma, Ramya, Kishore, and Rahul for their encouragement and support. He dedicates his contribution of the work to the fond memory of his parents Butchiramaiah
and Ramasitamma. M. Narasimha Murty likes to acknowledge support of his parents. S.V. Subrahmanya likes to thank his wife D.R. Sudha for her patient support.
The authors would like to record their sincere appreciation for Springer team, Wayne
Wheeler and Simon Rees, for their support and encouragement.
Bangalore, India
T. Ravindra Babu
M. Narasimha Murty
S.V. Subrahmanya
Contents
1
2
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Data Mining and Data Compression . . . . . . . . . . . .
1.1.1 Data Mining Tasks . . . . . . . . . . . . . . . . .
1.1.2 Data Compression . . . . . . . . . . . . . . . . .
1.1.3 Compression Using Data Mining Tasks . . . . . .
1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Data Mining Tasks . . . . . . . . . . . . . . . . .
1.2.2 Abstraction in Nonlossy Compression Domain . .
1.2.3 Lossy Compression Scheme and Dimensionality
Reduction . . . . . . . . . . . . . . . . . . . . .
1.2.4 Compaction Through Simultaneous Prototype and
Feature Selection . . . . . . . . . . . . . . . . .
1.2.5 Use of Domain Knowledge in Data Compaction .
1.2.6 Compression Through Dimensionality Reduction
1.2.7 Big Data, Multiagent Systems, and Abstraction . .
1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Bibliographical Notes . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Data Mining Paradigms . . . . . .
2.1 Introduction . . . . . . . . . .
2.2 Clustering . . . . . . . . . . .
2.2.1 Clustering Algorithms .
2.2.2 Single-Link Algorithm .
2.2.3 k-Means Algorithm . .
2.3 Classification . . . . . . . . . .
2.4 Association Rule Mining . . .
2.4.1 Frequent Itemsets . . .
2.4.2 Association Rules . . .
2.5 Mining Large Datasets . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
2
2
3
3
5
. . . .
6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
7
7
8
9
9
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
13
14
15
17
22
23
25
26
ix
x
Contents
2.5.1 Possible Solutions . . . .
2.5.2 Clustering . . . . . . . .
2.5.3 Classification . . . . . . .
2.5.4 Frequent Itemset Mining .
2.6 Summary . . . . . . . . . . . . .
2.7 Bibliographic Notes . . . . . . .
References . . . . . . . . . . . . . . .
3
4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
28
34
39
42
43
44
Run-Length-Encoded Compression Scheme . . . . . . . . . . . . .
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Compression Domain for Large Datasets . . . . . . . . . . . . .
3.3 Run-Length-Encoded Compression Scheme . . . . . . . . . . .
3.3.1 Discussion on Relevant Terms . . . . . . . . . . . . . . .
3.3.2 Important Properties and Algorithm . . . . . . . . . . . .
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Application to Handwritten Digit Data . . . . . . . . . .
3.4.2 Application to Genetic Algorithms . . . . . . . . . . . .
3.4.3 Some Applicable Scenarios in Data Mining . . . . . . . .
3.5 Invariance of VC Dimension in the Original and the Compressed
Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 Minimum Description Length . . . . . . . . . . . . . . . . . . .
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
47
48
49
49
50
55
55
57
59
Dimensionality Reduction by Subsequence Pruning . . . . . . . . .
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Lossy Data Compression for Clustering and Classification . . . .
4.3 Background and Terminology . . . . . . . . . . . . . . . . . . .
4.4 Preliminary Data Analysis . . . . . . . . . . . . . . . . . . . . .
4.4.1 Huffman Coding and Lossy Compression . . . . . . . . .
4.4.2 Analysis of Subsequences and Their Frequency in a Class
4.5 Proposed Scheme . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . .
4.5.2 Frequent Item Generation . . . . . . . . . . . . . . . . .
4.5.3 Generation of Coded Training Data . . . . . . . . . . . .
4.5.4 Subsequence Identification and Frequency Computation .
4.5.5 Pruning of Subsequences . . . . . . . . . . . . . . . . .
4.5.6 Generation of Encoded Test Data . . . . . . . . . . . . .
4.5.7 Classification Using Dissimilarity Based on Rough Set
Concept . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5.8 Classification Using k-Nearest Neighbor Classifier . . . .
4.6 Implementation of the Proposed Scheme . . . . . . . . . . . . .
4.6.1 Choice of Parameters . . . . . . . . . . . . . . . . . . .
4.6.2 Frequent Items and Subsequences . . . . . . . . . . . . .
60
63
65
65
66
67
67
67
68
73
74
79
81
83
83
84
84
85
85
86
87
87
87
88
Contents
4.6.3 Compressed Data and Pruning of Subsequences .
4.6.4 Generation of Compressed Training and Test Data
4.7 Experimental Results . . . . . . . . . . . . . . . . . . .
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
4.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
6
xi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Data Compaction Through Simultaneous Selection of Prototypes
and Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Prototype Selection, Feature Selection, and Data Compaction .
5.2.1 Data Compression Through Prototype and Feature
Selection . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Background Material . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Computation of Frequent Features . . . . . . . . . . .
5.3.2 Distinct Subsequences . . . . . . . . . . . . . . . . . .
5.3.3 Impact of Support on Distinct Subsequences . . . . . .
5.3.4 Computation of Leaders . . . . . . . . . . . . . . . . .
5.3.5 Classification of Validation Data . . . . . . . . . . . .
5.4 Preliminary Analysis . . . . . . . . . . . . . . . . . . . . . . .
5.5 Proposed Approaches . . . . . . . . . . . . . . . . . . . . . .
5.5.1 Patterns with Frequent Items Only . . . . . . . . . . .
5.5.2 Cluster Representatives Only . . . . . . . . . . . . . .
5.5.3 Frequent Items Followed by Clustering . . . . . . . . .
5.5.4 Clustering Followed by Frequent Items . . . . . . . . .
5.6 Implementation and Experimentation . . . . . . . . . . . . . .
5.6.1 Handwritten Digit Data . . . . . . . . . . . . . . . . .
5.6.2 Intrusion Detection Data . . . . . . . . . . . . . . . . .
5.6.3 Simultaneous Selection of Patterns and Features . . . .
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Domain Knowledge-Based Compaction . . . . . . . . . . .
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . .
6.2 Multicategory Classification . . . . . . . . . . . . . . . .
6.3 Support Vector Machine (SVM) . . . . . . . . . . . . . .
6.4 Adaptive Boosting . . . . . . . . . . . . . . . . . . . . .
6.4.1 Adaptive Boosting on Prototypes for Data Mining
Applications . . . . . . . . . . . . . . . . . . . .
6.5 Decision Trees . . . . . . . . . . . . . . . . . . . . . . .
6.6 Preliminary Analysis Leading to Domain Knowledge . .
6.6.1 Analytical View . . . . . . . . . . . . . . . . . .
6.6.2 Numerical Analysis . . . . . . . . . . . . . . . .
6.6.3 Confusion Matrix . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
89
91
91
92
93
94
.
.
.
95
95
96
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
99
100
103
104
104
105
105
105
107
107
108
109
109
110
110
116
120
122
123
123
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
125
125
126
126
128
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
129
130
131
132
133
134
xii
Contents
6.7 Proposed Method . . . . . . . . . . . . . . . . . .
6.7.1 Knowledge-Based (KB) Tree . . . . . . . .
6.8 Experimentation and Results . . . . . . . . . . . .
6.8.1 Experiments Using SVM . . . . . . . . . .
6.8.2 Experiments Using AdaBoost . . . . . . . .
6.8.3 Results with AdaBoost on Benchmark Data
6.9 Summary . . . . . . . . . . . . . . . . . . . . . . .
6.10 Bibliographic Notes . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . .
7
8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
136
136
137
138
140
141
143
144
144
Optimal Dimensionality Reduction . . . . . . . . . . . . . . . .
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . .
7.2.1 Based on Feature Ranking . . . . . . . . . . . . . . .
7.2.2 Ranking Features . . . . . . . . . . . . . . . . . . .
7.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Performance . . . . . . . . . . . . . . . . . . . . . .
7.4 Efficient Approaches to Large-Scale Feature Selection Using
Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . .
7.4.1 An Overview of Genetic Algorithms . . . . . . . . .
7.4.2 Proposed Schemes . . . . . . . . . . . . . . . . . . .
7.4.3 Preliminary Analysis . . . . . . . . . . . . . . . . .
7.4.4 Experimental Results . . . . . . . . . . . . . . . . .
7.4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . .
7.5 Bibliographical Notes . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
147
147
149
149
150
152
154
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
154
155
158
161
163
170
171
171
Big Data Abstraction Through Multiagent Systems . . . . . .
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.3 Conventional Massive Data Systems . . . . . . . . . . . .
8.3.1 Map-Reduce . . . . . . . . . . . . . . . . . . . . .
8.3.2 PageRank . . . . . . . . . . . . . . . . . . . . . .
8.4 Big Data and Data Mining . . . . . . . . . . . . . . . . . .
8.5 Multiagent Systems . . . . . . . . . . . . . . . . . . . . .
8.5.1 Agent Mining Interaction . . . . . . . . . . . . . .
8.5.2 Big Data Analytics . . . . . . . . . . . . . . . . . .
8.6 Proposed Multiagent Systems . . . . . . . . . . . . . . . .
8.6.1 Multiagent System for Data Reduction . . . . . . .
8.6.2 Multiagent System for Attribute Reduction . . . . .
8.6.3 Multiagent System for Heterogeneous Data Access
8.6.4 Multiagent System for Agile Processing . . . . . .
8.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
8.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
173
173
173
174
174
176
176
177
177
178
178
178
179
180
181
182
182
183
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
Appendix Intrusion Detection Dataset—Binary Representation
A.1 Data Description and Preliminary Analysis . . . . . . . .
A.2 Bibliographic Notes . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
185
185
189
189
Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
191
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
193
Acronyms
AdaBoost
BIRCH
CA
CART
CF
CLARA
CLARANS
CNF
CNN
CS
DFS
DNF
DTC
EDW
ERM
FPTree
FS
GA
GFS
HDFS
HW
KB
KDD
kNNC
MAD Analysis
MDL
MI
ML
NNC
NMF
PAM
Adaptive Boosting
Balanced Iterative Reducing and Clustering using Hierarchies
Classification Accuracy
Classification and regression trees
Clustering Feature
CLustering LARge Applications
Clustering Large Applications based on RANdomized Search
Conjunctive Normal Form
Condensed Nearest Neighbor
Compression Scheme
Distributed File System
Disjunctive Normal Form
Decision Tree Classifier
Enterprise Data Warehouse
Expected Risk Minimization
Frequent Pattern Tree
Fisher Score
Genetic Algorithm
Google File System
Hadoop Distributed File System
Handwritten
Knowledge-Based
Knowledge Discovery from Databases
k-Nearest-Neighbor Classifier
Magnetic, Agile, and Deep Analysis
Minimum Description Length
Mutual Information
Machine Learning
Nearest-Neighbor Classifier
Nonnegative Matrix Factorization
Partition Around Medoids
xv
xvi
PCA
PCF
RLE
RP
SA
SBS
SBFS
SFFS
SFS
SGA
SSGA
SVM
TS
VC
Acronyms
Principal Component Analysis
Pure Conjunctive Form
Run-Length Encoded
Random Projections
Simulated Annealing
Sequential Backward Selection
Sequential Backward Floating Selection
Sequential Forward Floating Selection
Sequential Forward Selection
Simple Genetic Algorithm
Steady Stage Genetic Algorithm
Support Vector Machine
Taboo Search
Vapnik–Chervonenkis
Chapter 1
Introduction
In this book, we deal with data mining and compression; specifically, we deal with
using several data mining tasks directly on the compressed data.
1.1 Data Mining and Data Compression
Data mining is concerned with generating an abstraction of the input dataset using
a mining task.
1.1.1 Data Mining Tasks
Important data mining tasks are:
1. Clustering. Clustering is the process of grouping data points so that points in
each group or cluster are similar to each other than points belonging to two or
more different clusters. Each resulting cluster is abstracted using one or more
representative patterns. So, clustering is some kind of compression where details of the data are ignored and only cluster representatives are used in further
processing or decision making.
2. Classification. In classification a labeled training dataset is used to learn a model
or classifier. This learnt model is used to label a test (unlabeled) pattern; this
process is called classification.
3. Dimensionality Reduction. A majority of the classification and clustering algorithms fail to produce expected results in dealing with high-dimensional datasets.
Also, computational requirements in the form of time and space can increase
enormously with dimensionality. This prompts reduction of the dimensionality
of the dataset; it is reduced either by using feature selection or feature extraction.
In feature selection, an appropriate subset of features is selected, and in feature
extraction, a subset in some transformed space is selected.
T. Ravindra Babu et al., Compression Schemes for Mining Large Datasets,
Advances in Computer Vision and Pattern Recognition,
DOI 10.1007/978-1-4471-5607-9_1, © Springer-Verlag London 2013
1
2
1
Introduction
4. Regression or Function Prediction. Here a functional form for variable y is learnt
(where y = f (X)) from given pairs (X, y); the learnt function is used to predict
the values of y for new values of X. This problem may be viewed as a generalization of the classification problem. In classification, the number of class labels
is finite, where as in the regression setting, y can have infinite values, typically,
y ∈ R.
5. Association Rule Mining. Even though it is of relatively recent origin, it is the
earliest introduced task in data mining and is responsible for bringing visibility
to the area of data mining. In association rule mining, we are interested in finding
out how frequently two subsets of items are associated.
1.1.2 Data Compression
Another important topic in this book is data compression. A compression scheme
CS may be viewed as a function from the set of patterns X to a set of compressed
patterns X . It may be viewed as
CS : X ⇒ X .
Specifically, CS(x) = x for x ∈ X and x ∈ X . In a more general setting, we
may view CS as giving output x using x and some knowledge structure or a dictionary K. So, CS(x, K) = x for x ∈ X and x ∈ X . Sometimes, a dictionary
is used in compressing and uncompressing the data. Schemes for compressing data
are the following:
• Lossless Schemes. These schemes are such that CS(x) = x and there is an
inverse CS−1 such that CS−1 (x ) = x. For example, consider a binary string
00001111 (x) as an input; the corresponding run-length-coded string is 44 (x ),
where the first 4 corresponds to a run of 4 zeros, and the second 4 corresponds
to a run of 4 ones. Also, from the run-length-coded string 44 we can get back
the input string 00001111. Note that such a representation is lossless as we get x from x using run-length encoding and x from x using decoding.
• Lossy Schemes. In a lossy compression scheme, it is not possible in general to get
back the original data point x from the compressed pattern x . Pattern recognition
and data mining are areas in which there are a plenty of examples where lossy
compression schemes are used.
We show some example compression schemes in Fig. 1.1.
1.1.3 Compression Using Data Mining Tasks
Among the lossy compression schemes, we considered the data mining tasks. Each
of them is a compression scheme as:
• Association rule mining deals with generating frequently cooccurring items/patterns from the given data. It ignores the infrequent items. Rules of association are
1.2 Organization
3
Fig. 1.1 Compression schemes
generated from the frequent itemsets. So, association rules in general cannot be
used to obtain the original input data points provided.
• Clustering is lossy because the output of clustering is a collection of cluster representatives. From the cluster representatives we cannot get back the original data
points. For example, in K-means clustering, each cluster is represented by the
centroid of the data points in it; it is not possible to get back the original data
points from the centroids.
• Classification is lossy as the models learnt from the training data cannot be used
to reproduce the input data points. For example, in the case of Support Vector
Machines, a subset of the training patterns called support vectors are used to get
the classifier; it is not possible to generate the input data points from the support
vectors.
• Dimensionality reduction schemes can ignore some of the input features. So, they
are lossy because it is not possible to get the training patterns back from the
dimensionality-reduced ones.
So, each of the mining tasks is lossy in terms of its output obtained from the given
data. In addition, in this book, we deal with data mining tasks working on compressed data, not the original data. We consider data compression schemes that
could be either lossy or nonlossy. Some of the nonlossy data compression schemes
are also shown in Fig. 1.1. These include run-length coding, Huffman coding, and
the zip utility used by the operating systems.
1.2 Organization
Material in this book is organized as follows.
1.2.1 Data Mining Tasks
We briefly discuss some data mining tasks. We provide a detailed discussion in
Chap. 2.
4
1
Introduction
The data mining tasks considered are the following.
• Clustering. Clustering algorithms generate either a hard or soft partition of the
input dataset. Hard clustering algorithms are either partitional or hierarchical.
Partitional algorithms generate a single partition of the dataset. The number of all
possible partitions of a set of n points into K clusters can be shown to be equal to
K
K
1 (i)n .
(−1)K−i
i
K!
i=1
So, exhaustive enumeration of all possible partitions of a dataset could be prohibitively expensive. For example, even for a small dataset of 19 patterns to be
partitioned into four clusters, we may have to consider around 11,259,666,000
partitions. In order to reduce the computational load, each of the clustering algorithms restricts these possibilities by selecting an appropriate subset of the set
of all possible K-partitions. In Chap. 2, we consider two partitional algorithms
for clustering. One of them is the K-means algorithm, which is the most popular clustering algorithm; the other is the leader clustering algorithm, which is the
simplest possible algorithm for partitional clustering.
A hierarchical clustering algorithm generates a hierarchy of partitions; partitions at different levels of the hierarchy are of different sizes. We describe the
single-link algorithm, which has been classically used in a variety of areas including numerical taxonomy. Another hierarchical algorithm discussed is BIRCH,
which is a very efficient hierarchical algorithm. Both leader and BIRCH are efficient as they need to scan the dataset only once to generate the clusters.
• Classification. We describe two classifiers in Chap. 2. Nearest-neighbor classifier
is the simplest classifier in terms of learning. In fact, it does not learn a model;
it employs all the training data points to label a test pattern. Even though it has
no training time requirement, it can take a long time for labeling a test pattern
if the training dataset is large in size. Its performance deteriorates as the dimensionality of the data points increases; also, it is sensitive to noise in the training
data. A popular variant is the K-nearest-neighbor classifier (KNNC), which labels a test pattern based on labels of K nearest neighbors of the test pattern. Even
though KNNC is robust to noise, it can fail to perform well in high-dimensional
spaces. Also, it takes a longer time to classify a test pattern.
Another efficient and state-of-the-art classifier is based on Support Vector Machines (SVMs) and is popularly used in two-class problems. An SVM learns a
subset of the set of training patterns, called the set of support vectors. These correspond to patterns falling on two parallel hyperplanes; these planes, called the
support planes, are separated by a maximum margin. One can design the classifier using the support vectors. The decision boundary separating patterns from
the two classes is located between the two support planes, one per each class. It is
commonly used in high-dimensional spaces, and it classifies a test pattern using
a single dot product computation.
• Association rule mining. A popular scheme for finding frequent itemsets and association rules based on them is Apriori. This was the first association rule mining
1.2 Organization
5
algorithm; perhaps, it is responsible for the emergence of the area of data mining
itself. Even though it is initiated in market-basket analysis, it can be also used in
other pattern classification and clustering applications. We use it in the classification of hand-written digits in the book. We describe the Apriori algorithm in
Chap. 2.
Naturally, in data mining, we need to analyze large-scale datasets; in Chap. 2, we
discuss three different schemes for dealing with large datasets. These include:
1. Incremental Mining. Here, we use abstraction AK and the (K + 1)th point XK+1
to generate the abstraction AK+1 . Here, AK is the abstraction generated after
examining the first K points. It is useful in dealing with stream data mining; in
big data analytics, it deals with velocity in the three-V model.
2. Divide-and-Conquer Approach: It is a popular scheme used in designing efficient
algorithms. Also, the popular and state-of-the-art Map-Reduce scheme is based
on this strategy. It is associated with dealing volume requirements in the three-V
model.
3. Mining based on an intermediate representation: Here an abstraction is learnt
based on accessing the dataset once or twice; this abstraction is an intermediate
representation. Once an intermediate representation is available, the mining is
performed on this abstraction rather than on the dataset, which reduces the computational burden. This scheme also is associated with the volume feature of the
three-V model.
1.2.2 Abstraction in Nonlossy Compression Domain
In Chap. 3, we provide a nonlossy compression scheme and ability to cluster and
classify data in the compressed domain without having to uncompress.
The scheme employs run-length coding of binary patterns. So, it is useful in dealing with either binary input patterns or even numerical vectors that could be viewed
as binary sequences. Specifically, it considers handwritten digits that could be represented as binary patterns and compresses the strings using run-length coding. Now
the compressed patterns are input to a KNNC for classification. It requires a definition of the distance d between a pair of run-length-coded strings to use the KNNC
on the compressed data.
It is shown that the distance d(x, y) between two binary strings x and y and
the modified distance d(x , y ) between the corresponding run-length-coded (compressed) strings x and y are equal; that is d(x, y) = d (x , y ). It is shown that the
KNNC using the modified distance on the compressed strings reduces the space and
time requirements by a factor of more than 3 compared to the application of KNNC
on the given original (uncompressed) data.
Such a scheme can be used in a number of applications that involve dissimilarity
computation in patterns with binary-valued features. It should be noted that even
real-valued features can be quantized into binary-valued features by specifying appropriate range and scale factors. Our earlier experience of such conversation on
6
1
Introduction
intrusion detection dataset is that it does not affect the accuracy. In this chapter, we
provide an application of the scheme in classification of handwritten digit data and
compare improvement obtained in size as well as computation time. Second application is related to efficient implementation of genetic algorithms. Genetic algorithms
are robust methods to obtain near-optimal solutions. The compression scheme can
be gainfully employed in situations where the evaluation function in Genetic Algorithms is the classification accuracy of the nearest-neighbor classifier (NNC). NNC
involves computation of dissimilarity a number of times depending on the size of
training data or prototype pattern set as well as test data size. The method can be
used for optimal prototype and feature selection. We discuss an indicative example.
The Vapnik–Chervonenkis (VC) dimension characterizes the complexity of
a class of classifiers. It is important to control the V C dimension to improve the
performance of a classifier. Here, we show that the V C dimension is not affected by
using the classifier on compressed data.
1.2.3 Lossy Compression Scheme and Dimensionality Reduction
We propose a lossy compression scheme in Chap. 4. Such compressed data can
be used in both clustering and classification. The proposed scheme compresses the
given data by using frequent items and then considering distinct subsequences. Once
the training data is compressed using this scheme, it is also required to appropriately
deal with test data; it is possible that some of the subsequences present in the test
data are absent in the training data summary. One of the successful schemes employed to deal with this issue is based on replacing a subsequence in the test data by
its nearest neighbor in the training data.
The pruning and transformation scheme employed in achieving compression reduces the dataset size significantly. However, the classification accuracy improves
because of the possible generalization resulting due to compressed representation.
It is possible to integrate rough set theory to put a threshold on the dissimilarity
between a test pattern and a training pattern represented in the compressed form. If
the distance is below a threshold, then the test pattern is assumed to be in the lower
approximation (proper core region) of the class of the training data; otherwise, it is
placed in the upper approximation (possible reject region).
1.2.4 Compaction Through Simultaneous Prototype and Feature
Selection
Simultaneous selection of prototypical patterns and features is considered in
Chap. 5. Here data compression is achieved by ignoring some of the rows and
columns in the data matrix; the rows correspond to patterns, and the columns are
features in the data matrix. Some of the important directions explored in this chapter
are:
1.2 Organization
7
• The impact of compression based on frequent items and subsequences on prototype selection.
• The representativeness of features selected using data obtained based on frequent
items with a high support value.
• The role of clustering and frequent item generation in lossy data compression and
how the classifier is affected by the representation; it is possible to use clustering
followed by frequent item set generation or frequent item set generation followed
by clustering. Both schemes are explored in evaluating the resulting simultaneous
prototype and feature selection. Here the leader clustering algorithm is used for
prototype selection and frequent itemset-based approaches are used for feature
selection.
1.2.5 Use of Domain Knowledge in Data Compaction
Domain knowledge-based compaction is provided in Chap. 6. We make use of domain knowledge of the data under consideration to design efficient pattern classification schemes. We design a domain knowledge-based decision tree of depth 4
that can classify 10-category data with high accuracy. The classification approaches
based on support vector machines and AdaBoost are used.
We carry out preliminary analysis on datasets and demonstrate deriving domain
knowledge from the data and from a human expert. In order that the classification
would be carried out on representative patterns and not on complete data, we make
use of the condensed nearest-neighbor approach and the leader clustering algorithm.
We demonstrate working of the proposed schemes on large datasets and public
domain machine learning datasets.
1.2.6 Compression Through Dimensionality Reduction
Optimal dimensionality reduction for lossy data compression is discussed in
Chap. 7. Here both feature selection and feature extraction schemes are described.
In feature selection, both sequential selection schemes and genetic algorithm (GA)
based schemes are discussed. In sequential selection, features are selected one after the other based on some ranking scheme; here each of the remaining features
is ranked based on their performance along with the already selected features using some validation data. These sequential schemes are greedy in nature and do
not guarantee globally optimal selection. It is possible to show that the GA-based
schemes are globally optimal under some conditions; however, most of practical
implementations may not be able to exploit this global optimality.
Two popular schemes for feature selection are based on Fisher’s score and Mutual information (MI). Fisher’s score could be used to select features that can assume
8
1
Introduction
continuous values, whereas the MI-based scheme is the most successful for selecting features that are discrete or categorical; it has been used in selecting features in
classification of documents where the given set of features is very large.
Another popular set of feature selection schemes employ performance of the
classifiers on selected feature subsets. Most popularly used classifiers in such feature
selection include the NNC, SVM, and Decision Tree classifier. Some of the popular
feature extraction schemes are:
• Principal Component Analysis (PCA). Here the extracted features are linear combinations of the given features. Signal processing community has successfully
used PCA-based compression in image and speech data reconstruction. It has
also been used by search engines for capturing semantic similarity between the
query and the documents.
• Nonnegative Matrix Factorization (NMF). Most of the data one typically uses are
nonnegative. In such cases, it is possible to use NMF to reduce the dimensionality.
This reduction in dimensionality is helpful in building effective classifiers to work
on the reduced-dimensional data even though the given data is high-dimensional.
• Random projections (RP). It is another scheme that extracts features that are linear
combinations of the given features; the weights used in the linear combinations
are random values here.
In this chapter, it is also shown as to how to exploit GAs in large-scale feature selection, and the proposed scheme is demonstrated using the handwritten digit data.
A problem with about 200-feature vector is considered for obtaining optimal subset
of features. The implementation integrates frequent features and genetic algorithms
and brings out sensitivity of genetic operators in achieving optimal set. It is practically shown on how the choice of probability of initialization of the population,
which is not often found in the literature, impacts the number of the final set of
features with other control parameters remaining the same.
1.2.7 Big Data, Multiagent Systems, and Abstraction
Chapter 8 contains ways to generate abstraction from massive datasets. Big data is
characterized by large volumes of heterogeneous types of datasets that need to be
processed to generate abstraction efficiently. Equivalently, big data is characterized
by three v’s, viz., volume, variety, and velocity. Occasionally, the importance of
value is articulated through another v. Big data analytics is multidisciplinary with a
host of topics such as machine learning, statistics, parallel processing, algorithms,
data visualization, etc. The contents include discussion on big data and related topics
such as conventional methods of analyzing big data, MapReduce, PageRank, agents,
and multiagent systems. A detailed discussion on agents and multiagent systems is
provided. Case studies for generating abstraction with big data using multiagent
systems are provided.
1.3 Summary
9
1.3 Summary
In this chapter, we have provided a brief introduction to data compression and mining compressed data. It is possible to use all the data mining tasks on the compressed
data directly. Then we have given how the material is organized in different chapters.
Most of the popular and state-of-the-art mining algorithms are covered in detail in
the subsequent chapters. Various schemes considered and proposed are applied on
two datasets, handwritten digit dataset and the network intrusion detection dataset.
Details of the intrusion detection dataset are provided in Appendix.
1.4 Bibliographical Notes
A detailed description of the bibliography is presented at the end of each chapter, and notes on the bibliography are provided in the respective chapters. This
book deals with data mining and data compression. There is no major effort so
far in dealing with the application of data mining algorithms directly on the compressed data. Some of the important books on compression are by Sayood (2000)
and Salomon et al. (2009). An early book on Data Mining was by Hand et al.
(2001). For a good introduction to data mining, a good source is the book by
Tan et al. (2005). A detailed description of various data mining task is given
by Han et al. (2011). The book by Witten et al. (2011) discusses various practical issues and shows how to use the Weka machine learning workbench developed by the authors. One of the recent books is by Rajaraman and Ullman
(2011).
Some of the important journals on data mining are:
1. IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE).
2. ACM Transactions on Knowledge Discovery from Data (ACM TKDD).
3. Data Mining and Knowledge Discovery (DMKD).
Some of the important conferences on this topic are:
1.
2.
3.
4.
Knowledge Discovery and Data Mining (KDD).
International Conference on Data Engineering (ICDE).
IEEE International Conference on Data Mining (ICDM).
SIAM International Conference on Data Mining (SDM).
References
J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, 3rd edn. (Morgan Kaufmann,
San Mateo, 2011)
D.J. Hand, H. Mannila, P. Smyth, Principles of Data Mining (MIT Press, Cambridge, 2001)
A. Rajaraman, J.D. Ullman, Mining Massive Datasets (Cambridge University Press, Cambridge,
2011)
10
1
Introduction
D. Salomon, G. Motta, D. Bryant, Handbook of Data Compression (Springer, Berlin, 2009)
K. Sayood, Introduction to Data Compression, 2nd edn. (Morgan Kaufmann, San Mateo, 2000)
P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining (Pearson, Upper Saddle River,
2005)
I.H. Witten, E. Frank, M.A. Hall, Data Mining: Practical Machine Learning Tools and Techniques,
3rd edn. (Morgan Kaufmann, San Mateo, 2011)
Chapter 2
Data Mining Paradigms
2.1 Introduction
In data mining, the size of the dataset involved is large. It is convenient to visualize
such a dataset as a matrix of size n × d, where n is the number of data points, and d
is the number of features. Typically, it is possible that either n or d or both are large.
In mining such datasets, important issues are:
• The dataset cannot be accommodated in the main memory of the machine. So, we
need to store the data on a secondary storage medium like a disk and transfer the
data in parts to the main memory for processing; such an activity could be timeconsuming. Because disk access can be more expensive compared to accessing
the data from the memory, the number of database scans is an important parameter. So, when we analyze data mining algorithms, it is important to consider the
number of database scans required.
• The dimensionality of the data can be very large. In such a case, several of the
conventional algorithms that use the Euclidean distance like metrics to characterize proximity between a pair of patterns may not play a meaningful role in
such high-dimensional spaces where the data is sparsely distributed. So, different
techniques to deal with such high-dimensional datasets become important.
• Three important data mining tasks are:
1. Clustering. Here a collection of patterns is partitioned into two or more clusters. Typically, clusters of patterns are represented using cluster representatives; a centroid of the points in the cluster is one of the most popularly used
cluster representatives. Typically, a partition or a clustering is represented by
k representatives, where k is the number of clusters; such a process leads to
lossy data compression. Instead of dealing with all the n data points in the
collection, one can just use the k cluster representatives (where k n in the
data mining context) for further decision making.
2. Classification. In classification, a machine learning algorithm is used on a
given collection of training data to obtain an appropriate abstraction of the
dataset. Decision trees and probability distributions of points in various classes
T. Ravindra Babu et al., Compression Schemes for Mining Large Datasets,
Advances in Computer Vision and Pattern Recognition,
DOI 10.1007/978-1-4471-5607-9_2, © Springer-Verlag London 2013
11
12
2
Data Mining Paradigms
Fig. 2.1 Clustering
are examples of such abstractions. These abstractions are used to classify a test
pattern.
3. Association Rule Mining. This activity has played a major role in giving a distinct status to the field of data mining itself. By convention, an association rule
is an implication of the form A → B, where A and B are two disjoint itemsets. It was initiated in the context of market-basket analysis to characterize
how frequently items in A are bought along with items in B. However, generically it is possible to view classification and clustering rules also as association
rules.
In order to run these tasks on large datasets, it is important to consider techniques
that could lead to scalable mining algorithms. Before we examine these techniques, we briefly consider some of the popular algorithms for carrying out these
data mining tasks.
2.2 Clustering
Clustering is the process of partitioning a set of patterns into cohesive groups or
clusters. Such a process is carried out so that intra-cluster patterns are similar and
inter-cluster patterns are dissimilar. This is illustrated using a set of two-dimensional
points shown in Fig. 2.1. There are three clusters in this figure, and patterns are
represented as two-dimensional points. The Euclidean distance between a pair of
points belonging to the same cluster is smaller than that between any two points
chosen from different clusters.
The Euclidean distance between two points X and Y in the p-dimensional space,
where xi and yi are the ith components of X and Y , respectively, is given by
p
1
2
d(X, Y ) =
(xi − yi )2 .
i=1
2.2 Clustering
13
Fig. 2.2 Representing clusters
This notion characterizes similarity; the intra-cluster distance (similarity) is small
(high), and the inter-cluster distance (similarity) is large (low). There could be other
ways of characterizing similarity.
Clustering is useful in generating data abstraction. The process of data abstraction may be explained using Fig. 2.2. There are two dense clusters; the first has 22
points, and the second has 9 points. Further, there is a singleton cluster in the figure.
Here, a cluster of points is represented by its centroid or its leader. The centroid
stands for the sample mean of the points in the cluster, and it need not coincide
with any one of the input points as indicated in the figure. There is another point in
the figure, which is far off from any of the other points, and it belongs to the third
cluster. This could be an outlier. Typically, these outliers are ignored, and each of
the remaining clusters is represented by one or more points, called the cluster representatives, to achieve the abstraction. The most popular cluster representative is its
centroid.
Here, if each cluster is represented by its centroid, then there is a reduction in the
dataset size. One can use only the two centroids for further decision making. For
example, in order to classify a test pattern using the nearest-neighbor classifier, one
requires 32 distance computations if all the data points are used. However, using
the two centroids requires just two distance computations to compute the nearest
centroid of the test pattern. It is possible that classifiers using the cluster centroids
can be optimal under some conditions. The above discussion illustrates the role of
clustering in lossy data compression.
2.2.1 Clustering Algorithms
Typically, a grouping of patterns is meaningful when the within-group similarity
is high and the between-group similarity is low. This may be illustrated using the
groupings of the seven two-dimensional points shown in Fig. 2.3.
Algorithms for clustering can be broadly grouped into hierarchical and partitional categories. A hierarchical scheme forms a nested grouping of patterns,
14
2
Data Mining Paradigms
Fig. 2.3 A clustering of the
two-dimensional points
whereas a partitional algorithm generates a single partition of the set of patterns. This is illustrated using the two-dimensional data set consisting of points
labeled A : (1, 1)t , B : (2, 2)t , C : (1, 2)t , D : (6, 2)t , E : (6.9, 2)t , F : (6, 6)t , and
G : (6.8, 6)t in Fig. 2.3. This figure depicts seven patterns in three clusters. Typically, a partitional algorithm would produce the three clusters shown in Fig. 2.3.
A hierarchical algorithm would result in a dendrogram representing the nested
grouping of patterns and similarity levels at which groupings change.
In this section, we describe two popular clustering algorithms; one of them is
hierarchical, and the other is partitional.
2.2.2 Single-Link Algorithm
This algorithm is a bottom-up hierarchical algorithm starting with n singleton clusters when there are n data points to be clustered. It keeps on merging smaller clusters
to form bigger clusters based on minimum distance between two clusters. The specific algorithm is described below.
Single-Link Algorithm
1. Input: n data points; Output: A dendrogram depicting the hierarchy.
2. Form the n × n proximity matrix by using the Euclidean distance between all
pairs of points. Assign each point to a separate cluster; this step results in n
singleton clusters.
3. Merge a pair of most similar clusters to form a bigger cluster. The distance between two clusters Ci and Cj to be merged is given by
Distance(Ci , Cj ) = MinX,Y d(X, Y )
where X ∈ Ci and Y ∈ Cj
4. Repeat step 3 till the partition of required size is obtained; a k-partition is obtained if the number of clusters k is given; otherwise, merging continues till a
single cluster of all the n points is obtained.
We illustrate the single-link algorithm using the data shown in Fig. 2.3. The proximity matrix showing the Euclidean distance between each pair of patterns is shown
2.2 Clustering
15
Table 2.1 Distance matrix
A
B
C
D
E
F
G
A
0.0
1.4
1.0
5.1
6.0
7.0
7.6
B
1.4
0.0
1.0
4.0
4.9
5.6
6.3
C
1.0
1.0
0.0
5.0
5.9
6.4
7.0
D
5.1
4.0
5.0
0.0
0.9
4.0
4.1
E
6.0
4.9
5.9
0.9
0.0
4.1
4.0
F
7.0
5.6
6.4
4.0
4.1
0.0
0.8
G
7.6
6.3
7.0
4.1
4.0
0.8
0.0
Fig. 2.4 The dendrogram
obtained using the single-link
algorithm
in Table 2.1. A dendrogram of the seven points in Fig. 2.3 (obtained from the singlelink algorithm) is shown in Fig. 2.4. Note that there are seven leaves with each leaf
corresponding to a singleton cluster in the tree structure. The smallest distance between a pair of such clusters is 0.8, which leads to merging {F} and {G} to form
{F, G}. Next merger leads to {D, E} based on a distance of 0.9 units. This is followed by merging {B} and {C}, then {A} and {B, C} at a distance of 1 unit each.
At this point we have three clusters. By merging clusters further we get ultimately a
single cluster as shown in the figure. The dendrogram can be broken at different levels to yield different clusterings of the data. The partition of three clusters obtained
using the dendrogram is the same as the partition shown in Fig. 2.3. A major issue with the hierarchical algorithm is that computation and storage of the proximity
matrix requires O(n2 ) time and space.
2.2.3 k-Means Algorithm
The k-means algorithm is the most popular clustering algorithm. It is a partitional
clustering algorithm and produces clusters by optimizing a criterion function. The
most acceptable criterion function is the squared-error criterion as it can be used
to generate compact clusters. The k-means algorithm is the most successfully used
squared-error clustering algorithm. The k-means algorithm is popular because it
is easy to implement and its time complexity is O(n), where n is the number of
patterns. We give a description of the k-means algorithm below.
16
2
Data Mining Paradigms
Fig. 2.5 An optimal
clustering of the points
k-Means Algorithm
1. Select k initial centroids. One possibility is to select k out of the n points randomly as the initial centroids. Each of them represents a cluster.
2. Assign each of the remaining n − k points to one of these k clusters; a pattern is
assigned to a cluster if the centroid of the cluster is the nearest, among all the k
centroids, to the pattern.
3. Update the centroids of the clusters based on the assignment of the patterns.
4. Assign each of the n patterns to the nearest cluster using the current set of centroids.
5. Repeat steps 3 and 4 till there is no change in the assignment of points in two
successive iterations.
An important feature of this algorithm is that it is sensitive to the selection of the
initial centroids and may converge to a local minimum of the squared-error criterion function value if the initial partition is not properly chosen. The squared-error
criterion function is given by
k X − centroidi 2 .
(2.1)
i=1 X∈Ci
We illustrate the k-means algorithm using the dataset shown in Fig. 2.3. If we
consider A, D, and F as the initial centroids, then the resulting partition is shown in
Fig. 2.5. For this optimal partition, the centroids of the three clusters are:
• centroid1: (1.33, 1.66)t ; centroid2: (6.45, 2)t ; centroid3: (6.4, 2)t .
• The corresponding value of the squared error is around 2 units.
The popularity of the k-means algorithm may be attributed to its simplicity. It
requires O(n) time as it computes nk distances in each pass and the number of
passes may be assumed to be a constant. Also, the number of clusters k is a constant.
Further, it needs to store k centroids in the memory. So, the space requirement is also
small.
However, it is possible that the algorithm generates a nonoptimal partition by
choosing A, B, and C as the initial centroids as depicted in Fig. 2.6. In this case, the
three centroids are:
2.3 Classification
17
Fig. 2.6 A nonoptimal
clustering of the
two-dimensional points
• centroid1: (1, 1)t ; centroid2: (1.5, 2)t ; centroid3: (6.4, 4)t .
• The corresponding squared error value is around 17 units.
2.3 Classification
There are a variety of classifiers. Typically, a set of labeled patterns is used to
classify an unlabeled test pattern. Classification involves labeling a test pattern; in
the process, either the labeled training dataset is directly used, or an abstraction
or model learnt from the training dataset is used. Typically, classifiers learnt from
the training dataset are categorized as either generative or discriminative. The Bayes
classifier is a well-known generative model where a test pattern X is classified or assigned to class Ci , based on the a posteriori probabilities P (Cj /X) for j = 1, . . . , C
if
P (Ci /X) ≥ P (Cj /X)
for all j.
These posterior probabilities are obtained using the Bayes rule using prior probabilities and the probability distributions of patterns in each of the classes. It is possible
to show that the Bayes classifier is optimal; it can minimize the average probability
of error. Support Vector Machine (SVM) is a popular discriminative classifier, and
it learns a weight vector W and a threshold b from the training patterns from two
classes. It assigns the test pattern X to class C1 (positive class) if W t X + b ≥ 0,
else it assigns X to class C2 (negative class).
The Nearest-Neighbor Classifier (NNC) is the simplest and popular classifier;
it classifies the test pattern by using the training patterns directly. An important
property of the NNC is that its error rate is less than twice the error rate of the Bayes
classifier when the number of training patterns is asymptotically large. We briefly
describe the NNC, which employs the nearest-neighbor rule for classification.
Nearest-Neighbor Classifier (NNC)
Input: A training set X = {(X1 , C 1 ), (X2 , C 2 ), . . . , (Xn , C n )} and a test pattern X.
Note that Xi , i = 1, . . . , n, and X are some p-dimensional patterns. Further, C i ∈
{C1 , C2 , . . . , CC } where Ci is the ith class label.
18
2
Data Mining Paradigms
Table 2.2 Data matrix
Pattern ID
feature1
feature2
feature3
feature4
Class label
X1
1.0
1.0
1.0
1.0
C1
X2
6.0
6.0
6.0
6.0
C2
X3
7.0
7.0
7.0
7.0
C2
X4
1.0
1.0
2.0
2.0
C1
X5
1.0
2.0
2.0
2.0
C1
X6
7.0
7.0
6.0
6.0
C2
X7
1.0
2.0
2.0
1.0
C1
X8
6.0
6.0
7.0
7.0
C2
Output: Class label for the test pattern X.
Decision: Assign X to class C i if d(X, Xi ) = minj d(X, Xj ).
We illustrate the NNC using the four-dimensional dataset shown in Table 2.2.
There are eight patterns, X1 , . . . , X8 , from two classes C1 and C2 , four patterns from
each class. The patterns are four-dimensional, and the dimensions are characterized
by feature1, feature2, feature3, and feature4, respectively. In addition to the four
features, there is an additional column that provides the class label of each pattern.
Let the test pattern X = (2.0, 2.0, 2.0, 2.0)t . The Euclidean distances between
X and each of the eight patterns are given by
d(X, X1 ) = 2.0;
d(X, X2 ) = 8.0;
d(X, X3 ) = 10.0;
d(X, X4 ) = 1.41;
d(X, X5 ) = 1.0;
d(X, X6 ) = 9.05;
d(X, X7 ) = 1.41;
d(X, X8 ) = 9.05.
So, the Nearest Neighbor (NN) of X is X5 because d(X, X5 ) is the smallest (it is
1.0) among all the eight distances. So, NN(X) = X5 , and the class label assigned
to X is the class label of X5 , which is C1 here, which means that X is assigned to
class C1 . Note that NNC requires eight distances to be calculated in this example. In
general, if there are n training patterns, then the number of distances to be calculated
to classify a test pattern is O(n).
The nearest-neighbor classifier is popular because:
1. It is easy to understand and implement.
2. There is no learning or training phase; it uses the whole training data to classify
the test pattern.
3. Unlike the Bayes classifier, it does not require the probability structure of the
classes.
4. It shows good performance. If optimal accuracy is 99.99 %, then with a large
training data, it can give at least 99.80 % accuracy.
Even though it is popular, there are some negative aspects. They include:
2.3 Classification
19
1. It is sensitive to noise; if the NN(X) is erroneously labeled, then X will be misclassified.
2. It needs to store the entire training data; further, it needs to compute the distances
between the test pattern and each of the training patterns. So, the computational
requirements can be large.
3. The distance between a pair of points may not be meaningful in high-dimensional
spaces. It is known that, as the dimensionality increases, the distance between a
point X and its nearest neighbor tends toward the distance between X and its
farthest neighbor. As a consequence, NNC may perform poorly in the context of
high-dimensional spaces.
Some of the possible solutions to the above problems are:
1. In order to tolerate noise, a modification to NNC is popularly used; it is called the
k-Nearest Neighbor Classifier (kNNC). Instead of deciding the class label of X
using the class label of the NN(X), X is labeled using the class labels of k nearest
neighbors of X. In the case of kNNC, the class label of X is the label of the class
that is the most frequent among the class labels of the k nearest neighbors. In
other words, X is assigned to the class to which majority of its k nearest neighbors belong; the value of k is to be fixed appropriately. In the example dataset
shown in Table 2.2, the three nearest neighbors of X = (2.0, 2.0, 2.0, 2.0)t are
X5 , X4 , and X7 . All the three neighbors are from class C1 ; so X is assigned to
class C1 .
2. NNC requires O(n) time to compute the n distances, and also it requires O(n)
space. It is possible to reduce the effort by compressing the training data. There
are several algorithms for performing this compression; we consider here a
scheme based on clustering. We cluster the n patterns into k clusters using the
k-means algorithm and use the k resulting centroids instead of the n training patterns. Labeling the centroids is done by using the majority class label in each
cluster.
By clustering the example dataset shown in Table 2.2 using the k-means algorithm, with a value of k = 2, we get the following clusters:
• Cluster1: {X1 , X4 , X5 , X7 } – Centroid: (1.0, 1.5, 1.75, 1.5)t
• Cluster2: {X2 , X3 , X6 , X8 } – Centroid: (6.5, 6.5, 6.5, 6.5)t
Note that Cluster1 contains four patterns from C1 and Cluster2 has the four patterns from C2 . So, by using these two representatives instead of the eight training
patterns, the number of distance computations and memory requirements will
reduce. Specifically, Centroid of Cluster1 is nearer to X than the Centroid of
Cluster2. So, X is assigned to C1 using two distance computations.
3. In order to reduce the dimensionality, several feature selection/extraction techniques are used. We use a feature set partitioning scheme that we explain in detail
in the sequel.
Another important classifier is based on Support Vector Machine. We consider it
next.
20
2
Data Mining Paradigms
Support Vector Machine The support vector machine (SVM) is a very popular classifier. Some of the important properties of the SVM-based classification
are:
• The SVM classifier is a discriminative classifier. It can be used to discriminate
between two classes. Intrinsically, it supports binary classification.
• It obtains a linear discriminant function of the form W t X + b from the training
data. Here, W is called the weight vector of the same size as the data points, and
b is a scalar. Learning the SVM classifier amounts to obtaining the values of W
and b from the training data.
• It is ideally associated with a binary classification problem. Typically, one of them
is called the negative class, and the other is called the positive class.
• If X is from the positive class, then W t X + b > 0, and if X is from the negative
class, then W t X + b < 0.
• It finds the parameters W and b so that the margin between the two classes is
maximized.
• It identifies a subset of the training patterns, which are called support vectors.
These support vectors lie on parallel hyperplanes; negative and positive hyperplanes correspond respectively to the negative and positive classes. A point X on
the negative hyperplane satisfies W t X + b = −1, and similarly, a point X on the
positive hyperplane satisfies W t X + b = 1.
• The margin between the two support planes is maximized in the process of
finding out W and b. In other words, the normal distance between the support planes W t X + b = −1 and W t X + b = 1 is maximized. The distance is
2
W . It is maximized using the constraints that every pattern X from the positive class satisfies W t X + b ≥ +1 and every pattern X from the negative
class satisfies W t X + b ≤ −1. Instead of maximizing the margin, we minimize its inverse. This may be viewed as a constrained optimization problem given
by
W 2
t
s.t. yi W Xi + b ≥ 1,
MinW
i = 1, 2, . . . , n,
where yi = 1 if Xi is in the positive class and yi = −1 if Xi is in the negative
class.
• The Lagrangian for the optimization problem is
1
W 2 −
αi yi W t X − i + b − 1 .
2
n
L(W, b) =
i=1
In order to minimize the Lagrangian, we take the derivative with respect to
b and gradient with respect to W , and equating to 0, we get αi s that satisfy
αi ≥ 0 and
q
i=1
αi yi = 0,
2.3 Classification
21
where q is the number of support vectors, and W is given by
W=
q
αi yi Xi .
i=1
• It is possible to view the decision boundary as W t X + b = 0 and W is orthogonal
to the decision boundary.
We illustrate the working of the SVM using an example in the two-dimensional
space. Let us consider two points, X1 = (2, 1)t from the negative class and X2 =
(6, 3)t from the positive class. We have the following:
• Using α1 y1 + α2 y2 = 0 and observing that y1 = −1 and y2 = 1, we get α1 = α2 .
So, we use α instead of α1 or α2 .
• As a consequence, W = −αX1 + αX2 = (4α, 2α)t .
• We know that W t X1 + b = −1 and W t X2 + b = 1; substituting the values of W ,
X1 , and X2 , we get
8α + 2α + b = −1,
24α + 6α + b = 1.
•
•
•
•
•
1
By solving the above, we get 20α = 2 or α = 10
, from which and from one of
the above equations we get b = −2.
1
From W = (4α, 2α)t and α = 10
we get W = ( 25 , 15 )t .
In this simple example, we have started with two support vectors in the twodimensional case. So, it was easy to solve for αs. In general, there are efficient
schemes for finding these values.
If we consider a point X = (x1 , x2 )t on the line x2 = −2x1 + 5, for example, the
point (1, 3)t , then W t (1, 3)t − 2 = −1 as W = ( 25 , 15 )t . This line is the support
line for the points in the negative class. In a higher-dimensional space, it is a
hyperplane.
In a similar manner, any point on the parallel line x2 = −2x1 + 15, for example, (5, 5)t satisfies the property that W t (5, 5) − 2 = 1, and this parallel line is
the support plane for the positive class. Again in a higher-dimensional space, it
becomes a hyperplane parallel to the negative class plane.
Note that the decision boundary is given by
2 1
,
X − 2 = 0.
5 5
So, the decision boundary 25 x1 + 15 x2 − 2 = 0 lies exactly in the middle of the two
support lines and is parallel to both. Note that (4, 2)t is located on the decision
boundary.
• A point (7, 6)t is in the positive class as W t (7, 6)t − 2 = 2 > 0. Similarly,
W t (1, 1)t − 2 = −1.4 < 0; so, (1, 1)t is in the negative class.
• We have discussed what is known as the linear SVM. If the two classes are linearly separable, then the linear SVM is sufficient.
22
2
Data Mining Paradigms
• If the classes are not linearly separable, then we map the points to a highdimensional space with a hope to find linear separability in the new space. Fortunately, one can implicitly make computations in the high-dimensional space
without having to work explicitly in it. It is possible by using a class of kernel
functions that characterize similarity between patterns.
• However, in large-scale applications involving high-dimensional data like in text
mining, linear SVMs are used by default for their simplicity in training.
2.4 Association Rule Mining
This is an activity that is not a part of either pattern recognition or machine learning
conventionally. An association rule is an implication of the form A → B, where A
and B are disjoint itemsets; A is called the antecedent, and B is called the consequent. Typically, this activity became popular in the context of market-basket analysis, where one is concerned with the set of items available in a super market, and
transactions are made by various customers. In such a context, an association rule
provides information on the association between two sets of items that are frequently
bought together; this facilitates in strategic decisions that may have a positive commercial impact in displaying the related items on appropriate shelves to avoid congestion or in terms of offering incentives to customers on some products/items.
Some of the features of the association rule mining activity are:
1. The rule A → B is not like the conventional implication used in a classical logic,
for example, the propositional logic. Here, the rule does not guarantee the purchase of items in B in the same transaction where items in A are bought; it
depicts a kind of frequent association between A and B in terms of buying patterns.
2. It is assumed that there is a global set of items I ; in the case of market-basket
analysis, I is the set of all items/product lines available for sale in a supermarket.
Note that A and B are disjoint subsets of I . So, if the cardinality of I is d, then
the number of all possible rules is of O(3d ); this is because an item in I can be
a part of A or B or none of the two and there are d items. In order to reduce
the mining effort, only a subset of the rules that are based on frequently bought
items is examined.
3. Popularly, the quantity of an item bought is not used; it is important to consider
whether an item is bought in a transaction or not. For example, if a customer buys
1.2 kilograms of Sugar, 3 loafs of Bread, and a tin of Jam in the same transaction,
then the corresponding transaction is represented as {Sugar, Bread, Jam}. Such
a representation helps in viewing a transaction as a subset of I .
4. In order to mine useful rules, only rules of the form A → B, where A and B
are subsets of frequent itemsets, are explored. So, it is important to consider
algorithms for frequent itemset mining. Once all the frequent itemsets are mined,
it is required to obtain the corresponding association rules.
2.4 Association Rule Mining
Table 2.3 Transaction data
23
Transaction
Itemset
t1
{a,c,d,e}
t2
{a, d, e}
t3
{b, d, e}
t4
{a, b, c}
t5
{a, b, c, d}
t6
{a, b, d}
t7
{a, d}
2.4.1 Frequent Itemsets
A transaction t is a subset of the set of items I . An itemset X is a subset of a
transaction t if all the items in X have been bought in t. If T is a set of transactions
where T = {t1 , t2 , . . . , tn }, then the support-set of X is given by
Support-set(X) = {ti |X is a subset of t}.
The support of X is given by the cardinality of Support-set(X) or |Support-set(X)|.
An itemset X is a frequent itemset if Support(X) ≥ Minsup, where Minsup is a
user-provided threshold.
We explain the notion of frequent itemset using the transaction data shown in
Table 2.3.
Some of the itemsets with their supports corresponding to the data in Table 2.3
are:
• Support({a, b, c}) = 2; Support({a, d}) = 5;
• Support({b, d}) = 3; Support({a, c}) = 3.
If we use a Minsup value of 4, then the itemset {a, d} is frequent. Further, {a, b, c}
is not frequent; we call such itemsets infrequent. There is a systematic way of enumerating all the frequent itemsets; this is done by an algorithm called Apriori. This
algorithm enumerates a relevant subset of the itemsets for examining whether they
are frequent or not. It is based on the following observations.
1. Any subset of a frequent itemset is frequent. This is because if A and B are
two itemsets such that A is a subset B, then Support(A) ≥ Support(B) because
Support-set(A) ⊆ Support-set(B). For example, knowing that itemset {a, d} is
frequent, we can infer that the itemsets {a} and {d} are frequent. Note that in
the data shown in Table 2.3, Support({a}) = 6 and Support({d}) = 6 and both
exceed the Minsup value.
2. Any superset of an infrequent itemset is infrequent. If A and B are two itemsets
such that A is a superset B, then Support(A) ≤ Support(B). In the example,
{a, c} is infrequent; one of its supersets {a, c, d} is also infrequent. Note that
Support({a, c, d}) = 2 and it is less than the Minsup value.
24
Table 2.4 Printed characters
of 1
2
Data Mining Paradigms
0
0
1
1
0
0
0
0
1
1
0
0
0
0
1
1
0
0
2.4.1.1 Apriori Algorithm
The Apriori algorithm iterates over two steps to generate all the frequent itemsets
from a transaction dataset. Each iteration requires a database scan. These two steps
are as follows.
• Generating Candidate itemsets of size k. These itemsets are obtained by looking
at frequent itemsets of size k − 1.
• Generating Frequent itemsets of size k. This is achieved by scanning the transaction database once to check whether a candidate of size k is frequent or not.
It starts with the empty set (φ), which is frequent because the empty set is a subset
of every transaction. So, Support(φ) = |T |, where T is the set of transactions.
Note that φ is a size 0 itemset as there are no items in it. It then generates candidate
itemsets of size 1; we call such itemsets 1-itemsets. Note that every 1-itemset is
a candidate. In the example data shown in Table 2.3, the candidate 1-itemsets are
{a}, {b}, {c}, {d}, {e}. Now it scans the database once to obtain the supports of these
1-itemsets. The supports are:
Support {a} = 6;
Support {b} = 4;
Support {c} = 3;
Support {d} = 6;
Support {e} = 3.
Using a Minsup value of 4, we can observe that frequent 1-itemsets are {a}, {b},
and {d}. From these frequent 1-itemsets we generate candidate 2-itemsets. The candidates are {a, b}, {a, d}, and {b, d}. Note that the other 2-itemsets need not be considered as candidates because they are supersets of infrequent itemsets and hence
cannot be frequent. For example, {a, c} is infrequent because {c} is infrequent.
A second database scan is used to find the support values of these candidates. The
supports are Support({a, b}) = 3, Support({a, d}) = 5, and Support({b, d}) = 3.
So, only {a, d} is a frequent 2-itemset. So, there can not be any candidates of size 3.
For example, {a, b, d} is not frequent because {a, b} is infrequent.
It is important to note that transactions need not be associated with supermarket
buying patterns only. It is possible to view a wide variety of patterns as transactions.
For example, consider printed characters of size 3 × 3 corresponding to character 1
shown in Table 2.4; there are two 1s. In the left-side one is present in the third
column of the matrix and the right-side matrix, the pattern is present in column 1.
By labeling the locations in such 3 × 3 matrices using 1 to 9 in a row-major fashion,
the two patterns may be viewed as transactions based on the 9 items. Specifically,
the transactions are t1 : {3, 6, 9} and t2 : {1, 4, 7}, where t1 corresponds to the leftside pattern, and t2 corresponds the right-side pattern in Table 2.4. Let us call the
left side 1 as Type1 1, and the right side 1 as Type2 1.
2.4 Association Rule Mining
Table 2.5 Transactions for
characters of 1
25
TID
1
2
3
4
5
6
7
8
9
Class
t1
1
0
0
1
0
0
1
0
0
Type1 1
t2
1
0
0
1
1
0
1
0
0
Type1 1
t3
1
0
0
1
0
0
1
1
0
Type1 1
t4
0
0
1
0
0
1
0
0
1
Type2 1
t5
0
0
1
0
0
1
0
1
1
Type2 1
t6
0
0
1
0
1
1
0
0
1
Type2 1
So, it is possible to represent data based on categorical features using transactions
and mine them to obtain frequent patterns. For example, with a small amount of
noise, we can have transaction data corresponding to these 1s as shown in Table 2.5.
There are six transactions, each of them corresponding to a 1. By using a Minsup
value of 3, we get the frequent itemset {1, 4, 7} for Type1 1 and the frequent itemset
{3, 6, 9} for Type2 1. Naturally subsets of these frequent itemsets also are frequent.
2.4.2 Association Rules
In association rule mining there are two important phases:
1. Generating Frequent Itemsets. This requires one or more dataset scans. Based on
the discussion in the previous subsection, Apriori requires k + 1 dataset scans if
the largest frequent itemset is of size k.
2. Obtaining Association Rules. This step generates association rules based on frequent itemsets. Once frequent itemsets are obtained from the transaction dataset,
association rules can be obtained without any more dataset scans, provided that
the support of each of the frequent itemsets is stored. So, this step is computationally simpler.
If X is a frequent itemset, then rules of the form A → B where A ⊂ X and
B = X − A are considered. Such a rule is accepted if the confidence of the rule
exceeds a user-specified confidence value called Minconf . The confidence of a rule
A → B is defined as
Confidence(A → B) =
Support(A ∪ B)
.
Support(A)
So, if the support values of all the frequent itemsets are stored, then it is possible to
compute the confidence value of a rule without scanning the dataset.
For example, in the dataset shown in Table 2.3, {a, d} is a frequent itemset. So,
there are two possible association rules. They are:
26
2
Data Mining Paradigms
1. {a} → {d}; its confidence is 56 .
2. {d} → {a}; its confidence is 56 .
So, if the Minconf value is 0.5, then both these rules satisfy the confidence threshold.
In the case of character data shown in Table 2.5, it is appropriate to consider rules
of the form:
• {1, 4, 7} → Type1 1
• {3, 6, 9} → Type2 1
Typically, the antecedent of such an association rule or a classification rule is a
disjunction of one or more maximally frequent itemsets. A frequent itemset A is
maximal if there is no frequent itemset B such that A is a subset of B. This illustrates
the role of frequent itemsets in classification.
2.5 Mining Large Datasets
There are several applications where the size of the pattern matrix is large. By large,
we mean that the entire pattern matrix cannot be accommodated in the main memory of the computer. So, we store the input data on a secondary storage medium
like the disk and transfer the data in parts to the main memory for processing. For
example, a transaction database of a supermarket chain may consist of trillions of
transactions, and each transaction is a sparse vector of a very high dimensionality;
the dimensionality depends on the number of product-lines. Similarly, in a network
intrusion detection application, the number of connections could be prohibitively
large, and the number of packets to be analyzed or classified could be even larger.
Another application is the clustering of click-streams; this forms an important part
of web usage mining. Other applications include genome sequence mining, where
the dimensionality could be running into millions, social network analysis, text mining, and biometrics.
An objective way of characterizing largeness of a data set is by specifying bounds
on the number of patterns and features present. For example, a data set having more
than billion patterns and/or more than million features is large. However, such a
characterization is not universally acceptable and is bound to change with the developments in technology. For example, in the 1960s, “large” meant several hundreds
of patterns. So, it is good to consider a more pragmatic characterization; large data
sets are those that may not fit the main memory of the computer; so, largeness of the
data varies with the technological developments. Such large data sets are typically
stored on a disk, and each point in the set is accessed from the disk based on processing needs. Note that disk access can be several orders slower compared to the
memory access; this property remains in tact even though memory and disk sizes at
different points time in the past are different. So, characterizing largeness using this
property could be more meaningful.
The above discussion motivates the need for integrating various algorithmic design techniques along with the existing mining algorithms so that they can handle
2.5 Mining Large Datasets
27
large data sets. Here, we provide an exhaustive set of design techniques that are useful in this context. More specifically, we offer a unifying framework that is helpful
in categorizing algorithms for mining large data sets; further, it provides scope for
designing novel efficient mining algorithms.
2.5.1 Possible Solutions
It is important that the mining algorithms that work with large data sets should scale
up well. Algorithms having nonlinear time and space complexities are ruled out.
Even algorithms requiring linear time and space may not be feasible if the number of
dataset scans is large. Based on these observations, it is possible to list the following
solutions for mining large data sets.
1. Incremental Mining. The basis of incremental mining is that the data is considered sequentially and the data points are processed step by step. In most of the
incremental mining algorithms, a small dataset is used to generate an abstraction.
New points are processed to update the abstraction currently available without
examining the previously seen data points. Also it is important that abstraction
generated is as small as possible in size. Such a scheme helps in mining very
large-scale datasets.
We can characterize incremental mining formally as follows. Let
X = (X1 , θ1 , t1 ), (X2 , θ2 , t2 ), . . . , (Xn , θn , tn )
be the set of n patterns, each represented as a triple, where Xi is the ith pattern, θi is the class label of Xi , and ti is the time-stamp associated with Xi so
that ti < tj if i < j . In incremental mining, as the data is considered sequentially, in a particular order, we may attach time stamps t1 , t2 , . . . , tn with the patterns X1 , X2 , . . . , Xn . Let Ak represent the abstraction generated using the first
k patterns, and An represent the abstraction obtained after all the n patterns are
processed. Further, in incremental mining, Ak+1 is obtained using Ak and Xk+1
only.
2. Divide-and-Conquer Approach. Divide-and-conquer is a well-known algorithm
design strategy. It has been used in designing several efficient algorithms. It has
been used in efficient data mining. A notable development in this direction is the
Map-Reduce framework, which is popular in a variety of data mining applications including text mining.
3. Mining based on an Intermediate Abstraction. The idea here is to use one or two
database scans to obtain a compact representation of the dataset. Such a representation may fit into main memory. Further processing is based on this abstraction,
and it does not require any more dataset scans. For example, as discussed in
the previous section, once frequent itemsets are obtained using a small number
of database scans, association rules can be obtained without anymore database
scans.
In the rest of the section, we examine how these three techniques are used in
Clustering, Classification, and Association Rule Mining.
28
2
Data Mining Paradigms
2.5.2 Clustering
2.5.2.1 Incremental Clustering
The basis of incremental clustering is that the data is considered sequentially and the
patterns are processed step by step. In most of the incremental clustering algorithms,
one of the patterns in the data set (usually the first pattern) is selected to form an
initial cluster. Each of the remaining points is assigned to one of the existing clusters
or may be used to form a new cluster based on some criterion. Here, a new data item
is assigned to a cluster without affecting the existing clusters significantly.
The abstraction Ak varies from algorithm to algorithm, and it can take different
forms. One of the popular schemes is when Ak is a set of prototypes or cluster representatives. Leader clustering algorithm is a well-known member of this category.
It is described below.
Leader Clustering Algorithm
Input: The dataset to be clustered and a Threshold value T provided by the user.
Output: A partition of the dataset such that patterns in each cluster are within a
sphere of radius T .
1. Set k = 1. Assign the first data point X1 to cluster Ck . Set the leader of Ck to be
Lk = X 1 .
2. Assign the next data point X to one of the existing clusters or to a new cluster. This assignment is done based on some similarity between the data point
and the existing leaders. Specifically, assign the data point X to cluster Cj if
d(X, Lj ) < T ; if there are more than one Cj satisfying the threshold requirement, then assign X to one of these clusters arbitrarily. If there is no Cj such that
d(X, Lj ) < T , then increment k, assign X to Ck , and set X to be Lk .
3. Repeat step 2 till all the data points are assigned to clusters.
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
BIRCH may be viewed as a hierarchical version of the leader algorithm with some
additional representational features to handle large-scale data. It constructs a data
structure called the Cluster Feature tree (CF tree), which represents each cluster
compactly using a vector called Cluster Feature (CF). We explain these notions
using the dataset shown in Table 2.6.
• Clustering Feature (CF). Let us consider the cluster of 2 points, {(1, 1)t , (2, 2)t }.
The CF vector is three-dimensional and is 2, (3, 3), (5, 5), where the three components of the vector are as follows:
1. The first component is the number of elements in the cluster, which is 2 here.
2. The second component is the linear sum of all the points (vectors) in the cluster, which is (3, 3) (= (1 + 2, 1 + 2)) in this example.
3. The third component, squared sum, is the sum of squares of the components
of the points in the cluster; here it is (5, 5) (= (12 + 22 , 12 + 22 )).
2.5 Mining Large Datasets
Table 2.6
A two-dimensional dataset
29
Pattern
number
feature1
feature2
1
1
1
2
6
3
3
2
2
4
7
4
5
9
8
6
9
11
7
14
2
8
13
3
• Merging Clusters. A major flexibility offered by representing clusters using CF
vectors is that it is very easy to merge two or more clusters. For example if ni is
the number of elements in Ci , lsi is the linear sum, and ssi is the squared sum,
then
CF vector of cluster Ci is ni , lsi , ssi and
CF vector of cluster Cj is nj , lsj , ssj , then
CF vector of the cluster obtained by merging Ci and Cj is
ni + nj , lsi + lsj , ssi + ssj .
• Computing Cluster Parameters. Another important property of the CF representation is that several statistics associated with the corresponding cluster can be
obtained easily using it. A statistic is a function of the samples in the cluster. For
example, if a cluster C = {X 1 , X 2 , . . . , X q }, then
q
ls
j =1 Xj
Centroid of C = CentroidC =
= ,
q
q
q
Radius of C = R =
=
i=1 (Xi
ss − 2 lsi2 +
i
q
q
− CentroidC )2
q
lsi2
q2
1
2
1
2
.
• At the leaf node level, each cluster is controlled by a user-provided threshold. If
T is the threshold, then all the points in the cluster lie in a sphere of radius T .
As the clusters are merged to form clusters at a previous level, one can use the
merging property of the CF vectors.
We show the CF-tree generated using the data shown in Table 2.6 in Fig. 2.7. By
inserting the first pattern we get the CF vector 1, (1, 1), (1, 1). When two patterns
are within a threshold of two units, we put them in the same cluster at the leaf level;
for example, (1, 1)t and (2, 2)t are placed in the same cluster at the leaf node as
shown in the figure. Here we consider nodes that can store two clusters at each
30
2
Data Mining Paradigms
Fig. 2.7 Insertion of the first
three patterns
level. By inserting all the eight patterns we get the CF-tree shown in Fig. 2.8. Some
of the important characteristics of the incremental algorithms are:
1. They require one database scan to generate the clustering of the data. Each pattern is examined only once in the process. In the case of leader clustering algorithm, the clustering is represented by a set of leaders. If the threshold value is
small, then a larger number of clusters are generated. Similarly, if the threshold
value is large, then the number of clusters is small.
2. BIRCH generates a CF-tree using a single database scan. Such an abstraction
captures clusters in a hierarchical manner. Merging two smaller clusters to form
a bigger cluster is very easy by using the merging property of the corresponding
CF vectors.
3. The parameters controlling the size of the CF-tree are the number of clusters
stored in each node of the tree and the threshold value used at the leaf node to fix
the size of the clusters.
4. Order-independence is an important property of clustering algorithms. An algorithm is order-independent if it generates the same partition for any order in
which the data is presented. Otherwise it is order-dependent. Unfortunately, in-
Fig. 2.8 CF-tree for the data
2.5 Mining Large Datasets
31
Fig. 2.9 Order-dependence
of leader algorithm
cremental algorithms can be order-dependent. This may be illustrated using an
example shown in Fig. 2.9.
By choosing the order in different ways, we get different partitions in terms of
both the number and size of clusters. For example, by choosing the three points
labelled X1 , X2 , X3 in that order as shown in the left part of the figure, we get four
clusters irrespective of the order in which the points X4 , X5 , X6 are processed.
Similarly, by selecting the centrally located points X4 and X5 as shown in the right
part of the figure as the first two points in the order, we get two clusters irrespective
of the order of the remaining four points.
2.5.2.2 Divide-and-Conquer Clustering
Conventionally, designers of clustering algorithms tacitly assume that the data sets
fit the main memory. This assumption does not hold when the data sets are large. In
such a situation, it makes sense to consider data in parts and cluster each part independently and obtain the corresponding clusters and their representatives. Once we
have obtained the cluster representatives for each part, we can cluster these representatives appropriately and realize the clusters corresponding to the entire data set.
If two or more representatives from different parts are assigned to some cluster C,
then assign the patterns in the corresponding clusters (of these representatives) to C.
Specifically, this may be achieved using a two-level clustering scheme depicted in
Fig. 2.10. There are n patterns in the data set. All these patterns are stored on a disk.
Each part or block of size pn patterns is considered for clustering at a time. These pn
data points are clustered in the main memory into k clusters using some clustering
algorithm. Clustering these p parts can be done either sequentially or in parallel;
the number of these clusters corresponding to all the p blocks is pk as there are
k cluster in each block. So, we will have pk cluster representatives. By clustering
these pk cluster representatives using the same or a different clustering algorithm
into k clusters, we can realize a clustering of the entire data set as stated earlier.
It is possible to extend this algorithm to any number of levels. More levels are
required if the data set size is very large and the main memory size is small. If singlelink algorithm is used to cluster data at both the levels, then we have the following
number of distance computations. We consider the number of distances as distance
computations form a major part of the computation requirements.
32
2
Data Mining Paradigms
Fig. 2.10 Divide-and-conquer approach to clustering
• One-level Algorithm. It does not employ divide-and-conquer. It is the conventional single-link algorithm applied on n data points, which makes n(n−1)
dis2
tance computations.
• The Two-level Algorithm. It requires:
– In each block at the first level, there are pn points. So, the number of distance
n n
( p − 1).
computations in each block is 2p
– There are p blocks at the first level. So, the total number of distances at the
first level is n2 ( pn − 1).
– There are pk representatives at the second level. So, the number of distances
computed at the second level is pk(pk−1)
.
2
– So, the total number of distances for the two-level divide-and-conquer algorithm is n2 ( pn − 1) + pk(pk−1)
.
2
• A Comparison. The number of distances computed by the conventional singlelink and two-level algorithm are shown in Table 2.7 for different values of n, k,
and p.
So, there is a great reduction in both time and space requirements if the two-level
algorithm is used. Also, the divide-and-conquer algorithm facilitates clustering very
large datasets.
2.5 Mining Large Datasets
33
Table 2.7 Number of distances computed
No. of data
points (n)
No. of
blocks (p)
No. of
clusters (k)
One-level
algorithm
Two-level
algorithm
100
2
5
4950
2495
500
20
5
124,750
15,900
1000
20
5
499,500
29,450
10,000
100
5
49,995,000
619,750
2.5.2.3 Clustering Based on an Intermediate Representation
The basic idea here is to generate an abstraction by scanning the dataset once or
twice and then use the abstraction, not the original data, for further processing. In
order to illustrate the working of this category of algorithms, we use the dataset
shown in Table 2.5. We use a database scan to find frequent 1-itemsets. Using a
Minsup value of 3, we get the following frequent itemsets: {1}, {4}, {7}, {3}, {6},
and {9}; all these items have a support value of 3. We perform one more database
scan to construct a tree using only the frequent items. First, we consider transaction
t1 and insert it into a tree as shown in Fig. 2.11. Here, we consider only the frequent
items present in t1 ; these are 1, 4, and 7. So, these are inserted into the tree by having
one node for each item present. The item numbers are indicated inside the nodes;
in addition, the count values are also indicated along with the item numbers. For
example, in Fig. 2.11(a), 1 : 1, 4 : 1, and 7 : 1 indicate that items 1, 4, and 7 are
present in the transaction. Next, we consider t2 , which has the same items as t1 , and
so we simply increment the counts as shown in Fig. 2.11(b). After examining all
the six transactions, we get the tree shown in Fig. 2.11. In the process, we need to
create new branches and nodes appropriately as we encounter new transactions. For
example, after considering t4 , we have items 3, 6, and 9 present in it, which prompts
us to start a new branch with nodes for the items 3, 6, and 9. At this point, the counts
on the right branch of the tree for these items are 3 : 1, 6 : 1, and 9 : 1, respectively.
It is possible to store the items in a transaction in any order, but we used the item
numbers in increasing order.
Note that the two branches of the tree, which is called Frequent-Pattern tree or
FP-tree, correspond to two different clusters; here each cluster corresponds to a
different class of 1s. Some of the important features of this class of algorithms are:
1. They require only two scans of the database. This is because each data item
is examined only twice. An abstraction is generated, and it is used for further
processing. Centroids, leaders, and FP-tree are some example abstractions.
2. The intermediate representation is useful in other important mining tasks like association rule mining, clustering, and classification. For example, the FP-tree has
been successfully used in association rule mining, clustering, and classification.
3. Typically, the space required by the intermediate representation could be much
smaller than the space required by the entire data set. So, it is possible to store it
in compact manner in the main memory.
34
2
Data Mining Paradigms
Fig. 2.11 A tree structure for the character patterns
There are several other types of intermediate representations. Some of them are:
• It is possible to reduce the computational requirements of clustering by using a
random subset of the dataset.
• An important and not systematically pursued direction is to use a compression
scheme to reduce the time and memory required to store the data. The compression scheme may be lossy or nonlossy. Use the compressed data for further processing. This direction will be examined in a great detail in the rest of the book.
2.5.3 Classification
It is also possible to exploit the three paradigms in classification. We discuss these
directions next.
2.5.3.1 Incremental Classification
Most of the classifiers can be suitably altered to handle incremental classification.
We can easily modify the NNC to perform incremental classification. This can be
done by incrementally updating the nearest neighbor of the test pattern. The specific
incremental algorithm for NNC is:
1. Let Ak be the nearest neighbor of the test pattern X after examining training
patterns X1 , X2 , . . . , Xk .
2.5 Mining Large Datasets
35
2. Next, when Xk+1 is encountered, we update the nearest neighbor of X to get
Ak+1 using Ak and Xk+1 .
3. Repeat step 2 till An is obtained.
We illustrate it using the dataset shown in Table 2.2. Consider the test pattern
X = (2.0, 2.0, 2.0, 2.0)t and X1 , X2 , X3 , X4 . The nearest neighbor of X out of
these four points is X4 , which is at a distance of 1.414 units; So, A4 is X4 . Now, if
we encounter X5 , then A5 gets updated, and it is X5 because d(X, X5 ) = 1.0, and
it is smaller than d(X, A4 ), which is 1.414. Proceeding further in this manner, we
note that A8 is X5 ; so, X is assigned to C1 as the class label of X5 is C1 .
In a similar manner, it is possible to visualize an incremental version of the
kNNC. For example, the three nearest neighbors of X after examining the first four
patterns in Table 2.2 are X4 , X1 , and X2 . Now if we encounter X5 , then the three
nearest neighbors are X5 , X4 , and X1 . After examining all the eight patterns, we
get the three nearest neighbors of X to be X5 , X4 , and X7 . All the three neighbors
are from class C1 ; so, we assign X to C1 .
2.5.3.2 Divide-and-Conquer Classification
It is also possible to exploit the divide-and-conquer paradigm in classification. Even
though it is possible to use it along with a variety of classifiers, we consider it in the
context of N NC. It is possible to use the division across either rows or columns of
the data matrix.
• Division across the rows. It may be described as follows:
1. Let the n rows of the data matrix be partitioned into p blocks, where there are
n
p rows in each block.
2. Obtain the nearest neighbor of the test pattern X in each block using pn distances. Let the nearest neighbor of X in the ith block be X i , and its distance
from X be di.
3. Let dj be the minimum of the values d1, d2, . . . , dp. Then NN(X) = X j .
Ties may be arbitrarily broken.
Note that computations in steps 2 and 3 can be parallelized to a large extent.
We illustrate this algorithm using the data shown in Table 2.2. Let us consider
two (p = 2) blocks such that:
– Block1 = {X1 , X2 , X3 , X4 };
– Block2 = {X5 , X6 , X7 , X8 }.
Now for the test pattern X = (2.0, 2.0, 2.0, 2.0)t , the nearest neighbors in the
two blocks are X 1 = X4 and X 2 = X5 . Note that their distances from X are
d1 = 1.414 and d2 = 1.0, respectively. So, X 2 , which is equal to X5 , is the nearest neighbor of X as the distance d2 is the smaller of the two. It is possible to
consider unequal-size partitions also and still obtain the nearest neighbor. Also,
it is possible to have a divide-and-conquer kNNC using a variant of the above
algorithm.
36
2
Data Mining Paradigms
• Division among the columns. An interesting situation emerges when the columns
are grouped together. It can lead to novel pattern generation or pattern synthesis.
The specific algorithm is given below:
1. Divide the number of features d into p blocks, where each block has pd features. Consider data corresponding to each of these blocks in each of the
classes.
2. Divide the test pattern X into p blocks; let the corresponding subpatterns be
X 1 , X 2 , . . . , X p , respectively.
3. Find the nearest neighbor of each X i for i = 1, 2, . . . , p from the corresponding ith block of each class.
4. Concatenate these nearest subpatterns of the corresponding subpattern of the
test pattern obtained for each class separately. Among these concatenated patterns, obtain the nearest pattern to X; assign the class label of the nearest
concatenated pattern to X.
We explain the working of this scheme using the example data shown in Table 2.2 and the test pattern X = (2.0, 2.0, 2.0, 2.0)t . Let p = 2. Let the two
feature set blocks be
– Block1 = {feature1, feature2};
– Block2 = {feature3, feature4}.
Correspondingly, the test pattern has two blocks, X 1 = (2.0, 2.0)t and X 2 =
(2.0, 2.0)t . The training data after partitioning into two feature blocks and reorganizing so that all the patterns in class are put together is shown in Table 2.8.
Note that, for X 1 , the nearest neighbor from C1 can be either the first subpattern
of X5 , which is denoted by X51 , or X71 ; we resolve the tie in favor of the first
pattern, which is X51 as X5 appears before X7 in the table. Further, the nearest
subpattern from C2 for X 1 is X21 . Similarly, for the second subpattern X 2 of X,
the nearest neighbors from C1 and C2 respectively are X42 and X22 . Now, concatenating the nearest subpatterns from the two classes, we have
– C1 – X51 : X42 , which is (1.0, 2.0, 2.0, 2.0)t ;
– C2 – X21 : X22 , which is (6.0, 6.0, 6.0, 6.0)t .
Out of these two patterns, the pattern from C1 is nearer to X than the pattern
from C2 , the corresponding distances being 1.0 and 8.0, respectively. So, we assign X to C1 .
There are some important points to be considered here:
1. In the above example, both the concatenated patterns are already present in
the data. However, it is possible that novel patterns are generated by concatenating the nearest subpatterns. For example, consider the test pattern
Y = (1.0, 2.0, 1.0, 1.0)t . In this case, the nearest subpatterns from C1 and
C2 for Y 1 = (1.0, 1.0)t and Y 2 = (1.0, 1.0)t are given below:
– The nearest neighbors of Y 1 from C1 and C2 respectively are X51 and X21 .
2.5 Mining Large Datasets
37
– The nearest neighbors of Y 2 from C1 and C2 respectively are X12 and X22 .
– Concatenating the nearest subpatterns from C1 , we get (1.0, 2.0, 1.0, 1.0)t .
– Concatenating the nearest subpatterns from C2 , we get (6.0, 6.0, 6.0, 6.0)t .
So, Y is classified as belonging to C1 because the concatenated pattern
(1.0, 2.0, 1.0, 1.0)t is closer to Y than the pattern from C2 . Note that in this
case, the concatenated pattern is the novel pattern (1.0, 2.0, 1.0, 1.0)t , which
is not a part of the training data from C1 . So, this scheme has the potential
to generate novel patterns from each of the classes and use them in decision
making. In general, if there are p blocks and ni patterns in class Ci , the space
p
size of all possible concatenated patterns in the class is ni , which can be much
larger than ni .
2. Even though the effective search space size or number of patterns examined
p
from the ith class is ni , the actual effort involved in finding the nearest concatenated pattern is of O(ni p), which is linear.
3. There is no need to compute the distance between X and concatenated nearest
subpatterns from each class separately if an appropriate distance function is
used. For example, if we use the squared Euclidean distance, then the distance
between the test pattern X and the concatenated subpatterns from a class is the
sum of the distances between the corresponding subpatterns. Specifically,
p
d 2 (X j , NN i X j ,
d 2 X, CN i (X) =
j =1
where CN i (X) is the concatenated nearest subpattern of X j s from class Ci ,
and NN i (X j ) is the nearest subpattern of X j from Ci . For example,
– The nearest subpattern of X 1 from C1 is X51 , and that of X 2 is X42 .
– The corresponding squared Euclidean distances are d 2 (X 1 , X51 ) = 1.0 and
d 2 (X 2 , X42 ) = 0.0.
– So, the distance between X and the concatenated pattern (1.0, 2.0, 2.0, 2.0)t
is 1.0 + 0.0 = 1.0.
– Similarly, for C2 , the nearest subpatterns of X 1 and X 2 are X21 and X22 ,
respectively.
– The corresponding distances are d 2 (X 1 , X21 ) = 32 and d 2 (X 2 , X22 ) = 32.
So, d 2 (X, CN2 (X)) = 32 + 32 = 64.
4. It is possible to extend this partition-based scheme to the kNNC.
2.5.3.3 Classification Based on Intermediate Abstraction
Here we also consider the NNC. There could be different intermediate representations possible. Some of them are:
38
2
Data Mining Paradigms
Table 2.8 Reorganized data matrix
Pattern ID
feature1
feature2
feature3
feature4
Class label
X1
1.0
1.0
1.0
1.0
C1
X4
1.0
1.0
2.0
2.0
C1
X5
1.0
2.0
2.0
2.0
C1
X7
1.0
2.0
2.0
1.0
C1
X2
6.0
6.0
6.0
6.0
C2
X3
7.0
7.0
7.0
7.0
C2
X6
7.0
7.0
6.0
6.0
C2
X8
6.0
6.0
7.0
7.0
C2
1. Clustering-based. Cluster the training data and use the cluster representatives as
the intermediate abstraction. Clustering could be carried out in each class separately. The resulting clusters may be interpreted as subclasses of the respective
classes. For example, consider the two-class four-dimensional dataset shown in
Table 2.2. By clustering the data in each class separately using the k-means algorithm with k = 2 we get the following centroids:
• C1 . By selecting X1 and X5 as the initial centroids, the clusters obtained using the k-means algorithm are C11 = {X1 } and C12 = {X4 , X5 , X7 }, and the
centroids of these clusters are (1.0, 1.0, 1.0, 1.0)t and (1.0, 1.66, 2.0, 1.66)t ,
respectively. Here, C11 and C12 are the first and second clusters obtained by
grouping data in C1 .
• C2 . By selecting X2 and X3 as the initial centroids, using the k-means algorithm, we get the clusters C21 = {X2 , X6 , X8 } and C22 = {X3 }, and the
respective centroids are (6.33, 6.33, 6.33, 6.33)t and (7.0, 7.0, 7.0, 7.0)t .
• Classification of X. Using the four centroids, two from each class, instead
of using all the eight training points, we classify the test pattern X =
(2.0, 2.0, 2.0, 2.0)t . The distances between X and these four centroids are
d(X, C11 ) = 2.0, d(X, C12 ) = 1.22, d(X, C21 ) = 8.66, and d(X, C22 ) =
10.0. So, X is closer to C12 , which is a cluster (or a subclass) in C1 ; as a
consequence, X is assigned to C1 .
2. FP-Tree based Abstraction. Here we consider using an abstraction based on frequent itemsets in classification. For example, consider the transaction dataset
shown in Table 2.5 and the corresponding FP-tree structure shown in Fig. 2.11(c).
Such an abstraction can be used in classification. The data in the table has two
classes, Type1 1 and Type2 1. Now consider a test pattern, which is a noisy
version of Type1 1 given by (1, 0, 1, 1, 0, 0, 1, 0, 0)t , which means the corresponding itemset using frequent items with Minsup value of 3 is {1, 3, 4, 7}. This
pattern aligns better with the left branch of the tree in Fig. 2.11 than the right
branch. So, we assign it to Type1 1 as the left branch represents Type 1 class. In
the process of alignment, we find out the nearest branch in the tree in terms of
the common items present in the branch and in the transaction.
2.5 Mining Large Datasets
Table 2.9 Transaction data
for incremental mining
39
Transaction
Itemset
given
Itemset in
frequency order
t1
{a, c, d, e}
{a, d}
t2
{a, d, e}
{a, d}
t3
{b, d, e}
{d, b}
t4
{a, c}
{a}
t5
{a, b, c, d}
{a, d, b}
t6
{a, b, d}
{a, d, b}
t7
{a, b, d}
{a, d, b}
2.5.4 Frequent Itemset Mining
In association rule mining, an important and time-consuming step is frequent itemset
generation. So, we consider frequent itemset mining here.
2.5.4.1 Incremental Frequent Itemset Mining
There are incremental algorithms for frequent itemset mining. They do not follow the incremental mining definition given earlier. They may require an additional
database scan. We discuss the incremental algorithm next.
1. Consider a block of m transactions, Block1, to find the frequent itemsets. Store
the frequent itemsets along with their supports. If an itemset is infrequent, but all
its subsets are frequent, then it is a border set. Obtain the set of such border sets.
Let F1 and B1 be the frequent and border sets from Block1.
2. Now let the database be extended by adding a block, Block2, of transactions.
Find the frequent itemsets and border set in Block2. Let them be F2 and B2 .
3. We update the frequent itemsets as follows:
• If an itemset is present in both F1 and F2 , then it is frequent.
• If an itemset is infrequent in both the blocks, then it is infrequent.
• If an itemset is frequent in F1 but not in F2 , it can be eliminated by using the
support values.
• Itemsets absent in F1 but frequent in the union of the two blocks can be obtained by using the notion of promoted border. This happens when an itemset
that is a border set in Block1 becomes frequent in the union of the two blocks.
If such a thing happens, then additional candidates are generated and tested
using another database scan.
We illustrate the algorithm using the dataset shown in Table 2.9 and Minsup value
of 4; note that the second column in the table gives the transactions. Let Block1
consist the first four transactions, that is, from t1 to t4 . The various sets along
with the frequencies are:
40
2
Data Mining Paradigms
• F1 = {a : 3}, {c : 2}, {d : 3}, {e : 3}, {a, c : 2}, {a, d : 2}, {a, e : 2},
{d, e : 3}, {a, d, e : 2}.
• B1 = {b : 1}, {c, d : 1}, {c, e : 1}.
Now we encounter the incremental portion or Block2 consisting of remaining
three transactions from Table 2.9. For this part, the sets F2 and B2 are:
• F2 = {a : 3}, {b : 3}, {d : 3}, {a, b : 3}, {b, d : 3}, {a, d : 3}, {a, b, d : 3}.
• B2 = {c : 1}.
Now we know from F1 and F2 that {a : 6}, {d : 6}, and {a, d : 5} are present in
both F1 and F2 . So, they are frequent. Further note that {b : 4}, a border set in
Block1 gets promoted to become frequent. So, we add it to the frequent itemsets.
We also need to consider {a, b}, {b, d}, and {a, b, d}, which may become frequent. However, {b, c} and {b, e} need not be considered because {c} and {e} are
infrequent. Now we need to make a scan of the database to decide that {a, b : 4}
and {b, d : 4} are frequent, but not {a, b, d : 3}.
2.5.4.2 Divide-and-Conquer Frequent Itemset Mining
The divide-and-conquer strategy has been used in mining frequent itemsets. The
specific algorithm is as follows.
Input: Transaction Data Matrix and Minsup value
Output: Frequent Itemsets
1. Divide the transaction data into p blocks so that each block has pn transactions.
2. Obtain frequent itemsets in each of the blocks. Let Fi be the set of frequent
itemsets in the ith block.
p
3. Take the union of all the frequent itemsets; let it be F . That means F = i=1 Fi .
4. Use one more database scans to find the supports of itemsets in F . Those satisfying the Minsup threshold are the frequent itemsets. Collect them in Ffinal , which
is the set of all the frequent itemsets.
Some of the features of this algorithm are as follows:
1. The most important feature is that if an itemset is infrequent in all the p blocks,
then it cannot be frequent.
2. This is a two-level algorithm, and it considers only those itemsets that are frequent at the first level for the possibility of being frequent.
3. The worst-case scenario emerges when at the end of the first level all the itemsets
are members of F . This can happen in the case of datasets where the transaction
are dense or nonsparse. In such a case, using an FP-tree that stores the itemsets
in a compact manner can be used.
We explain this algorithm using the data shown in Table 2.9. Let us consider two
blocks and Minsup value of 4, which means a value of 2 in each block.
• Let Block1 = {t1 , t2 , t3 , t4 }; Block2 = {t5 , t6 , t7 }.
2.5 Mining Large Datasets
•
•
•
•
41
F1 = {a}, {c}, {d}, {e}, {a, c}, {a, d}, {a, e}, {d, e}, {a, d, e}.
F2 = {a}, {b}, {d}, {a, b}, {b, d}, {a, d}, {a, b, d}.
F = {a}, {b}, {c}, {d}, {e}, {a, b}, {a, c}, {a, d}, {a, e}, {b, d}, {d, e}, {a, b, d}.
We examine the elements of F and another dataset scan to get Ffinal :
Ffinal = {a : 6}, {b : 4}, {d : 6}, {a, b : 4}, {a, d : 5}, {b, d : 4}.
2.5.4.3 Intermediate Abstraction for Frequent Itemset Mining
It is possible to read the database once or twice and produce an abstraction and
use this abstraction for obtaining frequent itemsets. The most popular abstraction
in this context is the Frequent Pattern Tree or FP-tree. It is constructed using two
database scans. It has been used in Clustering and Classification. However, it was
originally proposed for obtaining the frequent itemsets. The detailed algorithm for
constructing an FP-tree is given below:
Input: Transaction Database and Minsup.
Output: FP-tree.
1. Scan the dataset once to get the frequent 1-itemsets using the Minsup value.
2. Scan the database once more and in each transaction ignore the infrequent items
and insert the remaining part of the transaction in decreasing order of support
of the items. Also maintain the frequency counts along with items such that if
multiple transactions share the same subsets of items, then they are inserted into
the same branch of the tree as shown in Fig. 2.11. The frequency counts of the
items in the branch are updated appropriately instead of storing them in multiple
branches.
Construction of the FP-tree was discussed using the data shown in Table 2.5
and Fig. 2.11. By examining the FP-tree shown in Fig. 2.11(c), it is possible to
show that {1, 4, 7} and {3, 6, 9} are the two maximal frequent itemsets. Each of
them corresponds to a type of 1 (character 1), and also each itemset shares a branch
in the tree. Once the tree is obtained, frequent itemsets are found by going through
the tree in a bottom-up manner. It starts with a suffix based on less frequent items
present in the tree. This is efficiently done using an index structure.
We illustrate the frequent itemset mining using the data shown in Table 2.9. The
corresponding FP-tree is shown in Fig. 2.12. Some of the details related to the construction of the tree and finding frequent itemsets are:
• The frequent 1-itemsets are {a : 6}, {d : 6}, and {b : 4} by using a value of 4 for
Minsup and data in Table 2.9.
• We rewrite the transactions using the frequency order and Minsup information.
Infrequent items are deleted, and frequent items are ordered in decreasing order
of the frequency. Ties are broken based on lexicographic order. The modified
transactions are shown in column 3 of the table.
• By inserting the modified transactions, we get the FP-tree shown in Fig. 2.12.
42
2
Data Mining Paradigms
Fig. 2.12 An example
FP-tree
• In order to mine the frequent itemsets from the tree, we start with the least frequent among the frequent items, which is b in this case, and mine for all the itemsets from the tree with b as the suffix. For this, we consider the FP-tree above the
item b as shown by the curved line segment. Item d occurs in both the branches
with frequencies 5 and 1, respectively. However, in terms of co-occurrence along
with b, which has a frequency of 3 in the left branch, we need to consider a frequency of 3 for d and a. This is because they concurred in three transactions only
along with b. Similarly, from the right branch we know that b and d co-occurred
once. From this we get the frequencies of {b, d} and {a, b} to be 3 from the left
branch; in addition, from the right branch we get a frequency of 1 for {b, d}. This
means that the cumulative frequency of {b, d} is 4, and so it is frequent, but not
{a, b} with a frequency of 3 using the value of 4 for Minsup.
• Next, we consider item d that appears after b in the bottom-up order of frequency.
Note that d has a frequency of 6, and by using it as the suffix we get the itemset
{a, d}, which has a frequency of 5 from the left branch, and so it is frequent.
• Finally, we consider a, which has a frequency of 6, and so the itemset {a} is
frequent.
• Based on the above-mentioned conditional mining, we get the following frequent
itemsets: {a : 6}, {d : 6}, {a, d : 5}, {b : 4}, and {b, d : 4}.
2.6 Summary
Data mining deals with large-scale datasets, which may not fit into the main memory. So, the data is stored on a secondary storage, and it is transferred in parts into the
memory based on need. Multiple scans of such large databases can be prohibitive
in terms of computation time. So, in order to perform some of the data mining
tasks like clustering, classification, and frequent itemset mining, it is important to
2.7 Bibliographic Notes
43
have some scalable approaches. Specifically, schemes requiring a small number of
database scans are important. In this chapter, conventional algorithms used for data
mining were discussed first.
There are three different directions for dealing with large-scale datasets. These
are based on incremental mining, divide-and-conquer approaches, and an intermediate representation. In an incremental algorithm, each data point is processed only
once; so, a single database scan is required for mining. Divide-and-conquer is a
well-known algorithm design strategy, and it can be exploited in the context of the
data mining tasks including clustering, classification, and frequent itemset mining.
The third direction deals with generating an intermediate representation by scanning
the database once or twice and uses the abstraction, instead of the data, for further
processing. Tree structures like CF-tree and FP-tree can be good examples of intermediate representations. Such trees can be built from the data very efficiently.
2.7 Bibliographic Notes
Important data mining tools including clustering, classification, and association rule
mining are discussed in Pujari (2001). A good discussion on clustering is provided
in the books by Anderberg (1973) and Jain and Dubes (1988)). They discuss the
single-link algorithm and k-means algorithm in a detailed manner. Analysis of these
algorithms is provided in Jain et al. (1999).The k-means algorithm was originally
proposed by MacQueen (1967). Initial seed selection is an important step in the
k-means algorithm. Babu and Murty (1993) use genetic algorithms for initial seed
selection. Arthur and Vassilvitskii (2007) presented a probabilistic seed selection
scheme. The single-link algorithm was proposed by Sneath (1957). An analysis of
the convergence properties of the k-means algorithm is provided by Selim and Ismail (1984). Using the k-means step in genetic algorithm-based clustering, which
converges to the global optimum, is proposed and discussed by Krishna and Murty
(1999).
An authoritative treatment on classification is provided in the popular book by
Duda et al. (2000). A comprehensive treatment of the nearest-neighbor classifiers is
provided by Dasarathy (1990). Prototype selection is important in reducing the computational effort of the nearest-neighbor classifier. Ravindra Babu and Murty (2001)
study prototype selection using genetic algorithms. Jain and Chandrasekaran (1982)
discuss the problems associated with dimensionality and sample size. Sun et al.
(2013) propose a feature selection based on dynamic weights for classification. The
problems associated with computing nearest neighbors in high-dimensional spaces
is discussed by François et al. (2007) and Radovanović et al. (2009) . Even though
Vapnik (1998) is the proponent of SVMs, they were popularized by the tutorial paper by Burges (1998).
Apriori algorithm for efficient mining of frequent itemsets and association rules
was introduced by Agrawal and Srikant (1994). The FP-tree for mining frequent
itemsets without candidate generation was proposed by Han et al. (2000). Compression of frequent itemsets using clustering was carried out by Xin et al. (2005).
44
2
Data Mining Paradigms
Ananthanarayana et al. (2003) use a variant of the FP-tree, which can be built using one database scan. The role of frequent itemsets in clustering was examined by
Fung (2002). Yin and Han (2003) use frequent itemsets in classification. The role of
discriminative frequent patterns in classification is analyzed by Cheng et al. (2007).
The survey paper by Berkhin (2002) discusses a variety of clustering algorithms
and approaches that can handle large datasets. Different paradigms for clustering
large datasets was presented by Murty (2002). The book by Xu and Wunsch (2009)
on clustering offers a good discussion on clustering large datasets. A major problem with distance-based clustering and classification algorithms is that discrimination becomes difficult in high-dimensional spaces. Clustering paradigms for highdimensional data are discussed by Kriegel et al. (2009). The Leader algorithm for
incremental data clustering is described in Spath (1980). BIRCH is an incremental
hierarchical platform for clustering, and it is proposed by Zhang (1997). Vijaya et al.
(2005) propose another efficient hierarchical clustering algorithm based on leaders.
Efficient clustering using frequent itemsets was presented by Ananthanarayana et al.
(2001). Murty and Krishna (1980) propose a divide-and-conquer framework for efficient clustering. Guha et al. (2003) proposed a divide-and-conquer algorithm for
clustering stream data. Ng and Han (1994) propose two efficient randomized algorithms in the context of partitioning around medoids.
Viswanath et al. (2004) use a divide-and-conquer strategy on the columns of the
data matrix to improve the performance of the kNNC. Fan et al. (2008) have developed a library, called LIBLINEAR, for dealing with large-scale classification using
logistic regression and linear SVMs. Yu et al. (2003) use CF-tree based clustering
in training linear SVM classifiers efficiently. Asharaf et al. (2006) use a modified
version of the CF-tree for training kernel SVMs. Ravindra Babu et al. (2007) have
reported results on KNNC using run-length-coded data. Random forests proposed
by Breiman (2001) is one of the promising classifiers to deal with high-dimensional
datasets.
The book by Han et al. (2012) provides a wider and state-of-the-art coverage of
several data mining tasks and applications. Topic analysis has become a popular activity after the proposal of latent Dirichlet allocation by Blei (2012). Yin et al. (2012)
combine community detection with topic modeling in analyzing latent communities.
In text mining and information retrieval, Wikipedia is used (Hu et al. (2009)) as an
external knowledge source. Currently, there is a growing interest in analyzing Big
Data (Russom (2011)) and Map-Reduce (Pavlo et al. (2009)) framework to deal with
large datasets.
References
R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in Proceedings of International Conference on VLDB (1994)
V.S. Ananthanarayana, M.N. Murty, D.K. Subramanian, Efficient clustering of large data sets.
Pattern Recognit. 34(12), 2561–2563 (2001)
V.S. Ananthanarayana, M.N. Murty, D.K. Subramanian, Tree structure for efficient data mining
using rough sets. Pattern Recognit. Lett. 24(6), 851–862 (2003)
References
45
M.R. Anderberg, Cluster Analysis for Applications (Academic Press, New York, 1973)
D. Arthur, S. Vassilvitskii, K-means++: the advantages of careful seeding, in Proceedings of ACMSODA (2007)
S. Asharaf, S.K. Shevade, M.N. Murty, Scalable non-linear support vector machine using hierarchical clustering, in ICPR, vol. 1 (2006) pp. 908–911
G.P. Babu, M.N. Murty, A near-optimal initial seed value selection for k-means algorithm using
genetic algorithm. Pattern Recognit. Lett. 14(10) 763–769 (1993)
P. Berkhin, Survey of clustering data mining techniques. Technical Report, Accrue Software, San
Jose, CA (2002)
D.M. Blei, Introduction to probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
L. Breiman, Random forests. Mach. Learn. 45(1), 5–32 (2001)
C.J.C. Burges, A tutorial on support vector machines for pattern recognition. Data Min. Knowl.
Discov. 2, 121–168 (1998)
H. Cheng, X. Yan, J. Han, C.-W. Hsu, Discriminative frequent pattern analysis for effective classification, in Proceedings of ICDE (2007)
B.V. Dasarathy, Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques (IEEE Press,
Los Alamitos, 1990)
R.O. Duda, P.E. Hart, D.J. Stork, Pattern Classification (Wiley-Interscience, New York, 2000)
R.-E. Fan, K.-W. Chang, C.-J. Hsich, X.-R. Wang, C.-J. Lin, LIBLINEAR: a library for large linear
classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
D. François, V. Wertz, M. Verleysen, The concentration of fractional distances. IEEE Trans. Knowl.
Data Eng. 19(7), 873–885 (2007)
B.C.M. Fung, Hierarchical document clustering using frequent itemsets. M.Sc. Thesis, Simon
Fraser University (2002)
S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O’Callaghan, Clustering data streams: theory
and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)
J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in Proc. of ACMSIGMOD (2000)
J. Han, M. Kamber, J. Pei, Data Mining—Concepts and Techniques (Morgan-Kauffman, San Mateo, 2012)
X. Hu, X. Zhang, C. Lu, E.K. Park, X. Zhou, Exploiting Wikipedia as external knowledge for
document clustering, in ACM SIGKDD, KDD (2009)
A.K. Jain, B. Chandrasekaran, Dimensionality and sample size considerations in pattern recognition practice, in Handbook of Statistics, ed. by P.R. Krishnaiah, L. Kanal (1982), pp. 835–855
A.K. Jain, R.C. Dubes, Algorithms for Clustering Data (Prentice-Hall, Englewood Cliffs, 1988)
A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review. ACM Comput. Surv. 31(3), 264–323
(1999)
H.-P Kriegel, P. Kroeger, A. Zimek, Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering and correlation clustering. ACM Trans. Knowl. Discov. Data
3(1), 1–58 (2009)
K. Krishna, M.N. Murty, Genetic k-means algorithm. IEEE Trans. Syst. Man Cybern., Part B,
Cybern. 29(3), 433–439 (1999)
J. MacQueen, Some methods for classification and analysis of multivariate observations, in Proceedings of the Fifth Berkeley Symposium (1967)
M.N. Murty, Clustering large data sets, in Soft Computing Approach to Pattern Recognition and
Image Processing, ed. by A. Ghosh, S.K. Pal (World-Scientific, Singapore, 2002), pp. 41–63
M.N. Murty, G. Krishna, A computationally efficient technique for data-clustering. Pattern Recognit. 12(3), 153–158 (1980)
R.T. Ng, J. Han, Efficient and effective clustering methods for spatial data mining, in Proc. of the
VLDB Conference (1994)
A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. Dewit, S. Madden, M. Stonebraker, A comparison
of approaches to large-scale data analysis, in Proceedings of ACM SIGMOD (2009)
A.K. Pujari, Data Mining Techniques (Universities Press, Hyderabad, 2001)
46
2
Data Mining Paradigms
M. Radovanović, A. Nanopoulos, M. Ivanović, Nearest neighbors in high-dimensional data: the
emergence and influence of hubs, in Proceedings of ICML (2009)
T. Ravindra Babu, M.N. Murty, Comparison of genetic algorithm based prototype selection
schemes. Pattern Recognit. 34(2), 523–525 (2001)
T. Ravindra Babu, M.N. Murty, V.K. Agrawal, Classification of run-length encoded binary data.
Pattern Recognit. 40(1), 321–323 (2007)
P. Russom, Big data analytics. TDWI Research Report (2011)
S.Z. Selim, M.A. Ismail, K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell. 6(1), 81–87 (1984)
P. Sneath, The applications of computers to taxonomy. J. Gen. Microbiol. 17(2), 201–226 (1957)
H. Spath, Cluster Analysis Algorithms for Data Reduction and Classification (Ellis Horwood,
Chichester, 1980)
X. Sun, Y. Liu, M. Xu, H. Chen, J. Han, K. Wang, Feature selection using dynamic weights for
classification. Knowl.-Based Syst. 37, 541–549 (2013)
V.N. Vapnik, Statistical Learning Theory (Wiley, New York, 1998)
P.A. Vijaya, M.N. Murty, D.K. Subramanian, Leaders–subleaders: an efficient hierarchical clustering algorithm for large data sets. Pattern Recognit. Lett. 25(4), 505–513 (2005)
P. Viswanath, M.N. Murty, S. Bhatnagar, Fusion of multiple approximate nearest neighbor classifiers for fast and efficient classification. Inf. Fusion 5(4), 239–250 (2004)
D. Xin, J. Han, X. Yan, H. Cheng, Mining compressed frequent-pattern sets, in Proceedings of
VLDB Conference (2005)
R. Xu, D.C. Wunsch II, Clustering (IEEE Press/Wiley, Los Alamitos/New York, 2009)
X. Yin, J. Han, CPAR: classification based on predictive association rules, in Proceedings of SDM
(2003)
Z. Yin, L. Cao, Q. Gu, J. Han, Latent community topic analysis: integration of community discovery with topic modeling. ACM Trans. Intell. Syst. Technol. 3(4), 63:1–63:23 (2012).
H. Yu, J. Yang, J. Han, Classifying large data sets using SVM with hierarchical clusters, in Proc.
of ACM SIGKDD (KDD) (2003)
T. Zhang, Data clustering for very large datasets plus applications. Ph.D. Thesis, University of
Wisconsin–Madison (1997)
Chapter 3
Run-Length-Encoded Compression Scheme
3.1 Introduction
Data Mining deals with a large number of patterns of high dimension. While dealing
with such data, a number of factors become important such as size of data, dimensionality of each pattern, number of scans of database, storage of entire data, storage
of derived summary of information, computations involved on entire data that lead
to summary of information, etc. In the current chapter, we propose compression algorithms that work on patterns with binary-valued features. However, the algorithms
are applicable to floating-point-valued features and are appropriately quantized into
a binary-valued feature set.
Conventional methods of data reduction include clustering, sampling, use of sufficient statistics or other derived information from the data. For clustering and classification of such large data, the computational effort and storage space required
would be prohibitively large. In structural representation of patterns, string matching is carried out using an edit distance and the longest common subsequence. One
possibility of dealing with such patterns is to represent them as runs. Efficient algorithms exist to compute approximate and exact edit distances of run-length-encoded
strings. In the current chapter, we focus on numerical similarity measure.
We propose a novel idea of compressing the binary data and carry out clustering
and classification directly on such compressed data. We use run-length encoding
and demonstrate that such compression reduces both storage space and computation
time. The work is directly applicable to mining of large-scale business transactions.
Major contribution of the idea is in developing a scheme wherein a number of goals
are achieved such as reduced storage space, computation of distance function on
run-length-encoded binary patterns without having to decompress, and preserving
the same classification accuracy that could be obtained using original uncompressed
data and significantly reduced processing time.
We discuss theoretical foundations for such an idea, as well as practical implementation results.
T. Ravindra Babu et al., Compression Schemes for Mining Large Datasets,
Advances in Computer Vision and Pattern Recognition,
DOI 10.1007/978-1-4471-5607-9_3, © Springer-Verlag London 2013
47
48
3
Run-Length-Encoded Compression Scheme
3.2 Compression Domain for Large Datasets
Data Mining deals with large datasets. The datasets can be formal databases, data
warehouses, flat files, etc. From the Pattern Recognition perspective, a “Large
Dataset” can be defined as a set of patterns that are not amenable for in-memory
storage and operations.
Large Dataset Let n and d represent the number of patterns and the number of
features, respectively. Largeness of data can be defined as one of the following.
• n is small, and d is large
• n is large, and d is small
• n is large, and d is large.
Algorithms needing multiple data scans result in high processing requirements.
This motivates one to look for conventional and hybrid algorithms that are efficient
in storage and computations.
Viewing from Pattern Recognition perspective, a “large dataset” contains a large
number of patterns, with each pattern being characterized by a large number of
attributes or features. For example, in case of large transaction data consisting of
a large number of items, transactions are treated as patterns with items serving as
features. The largeness of datasets would make direct use of many conventional
iterative Pattern Recognition algorithms for clustering and classification unwieldy.
Generation of data abstraction by means of clustering was earlier successfully
carried out by various researchers. In order to scale up clustering algorithms, intelligent compression techniques were developed which generate sufficient statistics in
the form of clustering features from a large input data. Such statistics were further
used for clustering. Work based on the notion of scalable framework of clustering
is carried out earlier, by identifying regions that are required to be stored in the
memory, regions that are compressible and the regions of the database that can be
discarded. The literature contains work that was carried out in developing a method
known as “squashing,” which consisted of three steps known as grouping the input
large data into mutually exclusive groups, computing low-order moments within
each group and generating pseudo-data. Such data can be further used for clustering. Another important research contribution was the use of a novel frequent pattern
tree structure for storing compressed, crucial information about frequent patterns.
In this background, we attempt to represent the data in a compressed or compact
way in a lossless manner and carry out clustering and classification directly on the
compressed data. One such scheme that we consider in this chapter is a compact
data representation and carrying out clustering and classification directly on such
a compact representation. The compact and original representations are one-to-one.
We call a compact data representation lossless when the generation of uncompact or
uncompressed data from compact data representation matches exactly with original
data and lossy otherwise.
We illustrate working of the proposed ideas on handwritten digit data.
3.3 Run-Length-Encoded Compression Scheme
49
Handwritten Digit Dataset Handwritten digit data considered for illustration
consists of 100,030 labeled 192-feature binary patterns. The data consists of 10
categories, viz., 0 to 9. Of this entire data, 66,700 patterns, equally divided into 10
categories, are considered as training patterns, and 33,330 as test patterns, with approximately 3330 patterns per class. We present some sample handwritten patterns.
Each pattern of 192-features is represented as a 16 × 12 matrix. The patterns in the
figure represent nonzero features. They are indicative of the zero and nonzero feature combination, leading to varying run sequences of zeroes and ones.
3.3 Run-Length-Encoded Compression Scheme
We discuss a scheme that compresses binary data as run lengths. A novel algorithm
that computes dissimilarity in the compressed domain is presented. We begin by
defining related terms, which in turn are used in describing the algorithm.
3.3.1 Discussion on Relevant Terms
1. Run. In any ordered sequence of elements of two kinds, each maximal subsequence of elements of like-kind is called a run.
For example, sequences 111 and 0000 are runs of 1s and 0s, respectively.
2. Run Length. The number of continuous elements of same kind is defined as run
length.
For example, the sequence of binary numbers 1110000 has run lengths of 3
and 4 of 1s and 0s, respectively.
3. Run-String. A complete sequence of runs in a pattern is called a run-string.
For example, the run-string of the pattern 111000011000 is “3 4 2 3”.
4. Length of Run-String. The length of a run-string is defined as the sum of runs in
the string. For example, the length of the run-string “3 4 2 3” is 12 (3 + 4 + 2 + 3).
5. Run Dimension. The number of individual runs in a run-string is defined as the
run dimension. For example, the pattern 111000011000 has the run-string of
“3 4 2 3”. Its length is 12, and the run dimension is 4.
6. Compressed Data Representation. Every pattern is assumed to start with 1. In
case of non-1s in the beginning, the number of ones at the first instance is
recorded as 0. The Compressed Data Representation (CDR) consists of runs,
starting always with a run of 1s. For example, the Compressed Data Representations of 110011 and 001100 are “2 2 2” and “0 2 2 2”, respectively.
50
3
Run-Length-Encoded Compression Scheme
Table 3.1 Illustrations of compressed data representation
Sl. No.
Pattern
Compressed data
representation of
the run-string
Length of
the run-string
Run
dimension
1
111100110011
42222
12
5
2
011111001110
015231
12
6
3
111111111111
12
12
1
4
000000000000
0 12
12
2
5
101010101010
111111111111
12
12
6
010101010101
0111111111111
12
13
7. Decompression of Compressed Data Representation. Based on the definitions
above, the Compressed Data Representation of a given pattern can be expanded
to its original form. The Decompressed representation of CDR of a pattern consists of expanding the run-string form of the pattern into binary data.
For example, consider the pattern 001000. The CDR of the pattern is 0213.
Alternately, the CDR of 0213 implies that the original pattern started with 0s. The
number of starting 0s is 2, followed by one 1, followed by three 0s, i.e., 001000.
Below, we state and elaborate the properties that will be uses in Algorithm 3.1.
Subsequently, we prove lemmas that are based on the proposed representation.
The definitions are explained through few illustrations in Table 3.1.
3.3.2 Important Properties and Algorithm
Property 3.1 For a constant length of input binary patterns, the run dimension
across different patterns need not be the same.
Every given pattern under study consists of a constant number of features. The
features in the current context are binary. The data to be classified consists of intraand inter-pattern variations. The variations in the patterns in terms of shape, size,
orientation, etc. result in varying lengths of runs of 0s followed by 1s and vice
versa. Hence, the run dimension across different patterns need not be the same. This
is illustrated by the following example. Consider two patterns with equal number of
features, 11001100 and 11110001. The corresponding run-strings are 2222 and 431.
The run dimensions of the two patterns are 4 and 3, respectively.
Property 3.2 The sum of runs of a pattern is equal to the total number of bits
present in the binary pattern.
The property follows from definitions 3 and 4 in Sect. 3.3.1. It is true irrespective
of whether the pattern starts with 0 or nonzero feature value. The following example
3.3 Run-Length-Encoded Compression Scheme
51
illustrates this fact. Consider a pattern having m binary features. Let the features
consist of p continuous sequences of like kind, alternating between 1s and 0s and
leading to a run sequence of q1 , q2 , . . . , qp .
Bit string: bm−1 . . . b1 b0
Run string: q1 q2 . . . qp
Then q1 + q2 + · · · + qp = m.
Property 3.3 Counted from left to right in a CDR, starting with 1, for the positions
1, 2, 3, . . . of the input features, 1, 3, 5, . . . represent the number of continuous 1s,
and 2, 4, 6, . . . represent the numbers of continuous 0s. This follows from definition 6
of Sect. 3.3.1.
Property 3.4 The run dimension of a pattern can, at most, be one more than the
run-string length.
We provide an Algorithm 3.1 for computation of dissimilarity between two compressed patterns directly in the compressed domain with the help of run-strings.
To start with, all the patterns are converted to Compressed Data Representation.
Let C1 [1 . . . m1 ] and C2 [1 . . . m2 ] represent any two patterns in Compressed Data
Representation form. We briefly discus the algorithm. In Step 1 of the algorithm,
we read the patterns in their compressed form, C1 [1 . . . m1 ] and C2 [1 . . . m2 ], with
m1 and m2 being the lengths of the two compressed patterns considered. It should
be noted that m1 and m2 need not be equal, which most often is the case, even
when both the patterns belong to the same class. In Step 2, we initialize the
runs corresponding to compressed forms C1 [·] and C2 [·], viz., R1 and R2 , to the
first runs in each of the compressed patterns and set counters runlencounter1 and
runlencounter2 to 1. In Steps 4 to 7, we compute the difference between R1 and R2 ,
iteratively, till one of them is reduced to zero. As soon as one of them is reduced to zero, the next element of C1 [1 . . . m1 ] or C2 [1 . . . m2 ] is considered based
on which of them is reduced to zero. The distance is incremented by the minimum of current values of R1 and R2 in Step 6 whenever the difference between
counters runlencounter1 and runlencounter2 is odd. It should be noted that when
|runlencounter1 − runlencounter2 | is odd, the corresponding runs are of unlike kind,
viz., 0s and 1s. In Step 7, runlencounter1 and runlencounter2 are appropriately reset. Step 9 returns the Manhattan distance between the two patterns. The while-loop
is terminated when runlencounter1 exceeds m1 or runlencounter2 exceeds m2 .
Algorithm 3.1 (Computation of Distance between Compressed Patterns)
Step 1: Read Compressed Pattern-1 in array C1 [1 . . . m1 ] and Compressed
Pattern-2 in array C2 [1 . . . m2 ]
Step 2: Initialize
runcounter1 and runcounter2 to 1
R1 = C1 [runcounter1 ], R2 = C2 [runcounter2 ]
distance = 0
52
3
Run-Length-Encoded Compression Scheme
Step 3: WHILE-BEGIN(from Step-4 to Step-8)
Step 4:
If R1 = 0
(a) increment runlencounter1 by 1,
(b) if runlencounter1 > m1 , go to Step 9 (BREAK),
(c) load C1 [runlencounter1 ] in R1
Step 5:
If R2 =0
(a) increment runlencounter2 by 1,
(b) if runlencounter2 > m2 , go to Step 9 (BREAK),
(c) load C2 [runlencounter2 ] in R2
Step 6:
If |runlencounter1 − runlencounter2 | is odd, increment distance by
min(R1 ,R2 )
Step 7:
If R1 ≥ R2
(a) Subtract R2 from R1 ,
(b) Set R2 = 0
Else
(a) Subtract R1 from R2 ,
(b) Set R1 = 0
Step 8: WHILE-END
Step 9: Return distance
The computation is illustrated through an example. Consider two patterns,
[10110111] and [01101101]. The Manhattan distance between the two patterns in
their original, uncompressed form is 5. By definition 6 of Sect. 3.3.1, the Compressed Data Representations of these patterns, respectively, are [1 1 2 1 3] and
[0 1 2 1 2 1 1]. At Step 2, m1 = 5, m2 = 7, R1 = 2, R2 = 0, runlencounter1 = 1,
runlencounter2 = 1. The following is the computation path with values at the end
of various steps.
(a) Step 1: C1 = [1 1 2 1 3], C2 = [0 1 2 1 2 1 1]
Step 2: runlencounter2 = 1, runlencounter1 = 1, R1 = 1, R2 = 0, distance = 0
(b) Step 5: runlencounter2 = 1, runlencounter1 = 2, R1 = 1, R2 = 1
Step 6: counter-difference is odd, distance = 0 + 1 = 1
Step 7: R1 = 0, R2 = 0
(c) Step 4: runlencounter1 = 2, runlencounter2 = 2, R1 = 1, R2 = 0
Step 5: runlencounter1 = 2, runlencounter2 = 3, R1 = 1, R2 = 2
Step 6: counter-difference is odd, distance = 1 + 1 = 2
Step 7: R1 = 0, R2 = 1
(d) Step 4: runlencounter1 = 3, runlencounter2 = 3, R1 = 2, R2 = 1
Step 7: R1 = 1, R2 = 0
(e) Step 5: runlencounter1 = 4, runlencounter2 = 3, R1 = 1, R2 = 1
Step 6: counter-difference is odd, distance = 2 + 1 = 3
Step 7: R1 = 0, R2 = 0
(f) Step 4: runlencounter1 = 4, runlencounter2 = 4, R1 = 1, R2 = 0
Step 5: runlencounter1 = 4, runlencounter2 = 5, R1 = 1, R2 = 2
Step 6: counter-difference is odd, distance = 3 + 1 = 4
Step 7: R1 = 0, R2 = 1
3.3 Run-Length-Encoded Compression Scheme
53
(g) Step 4: runlencounter1 = 5, runlencounter2 = 5, R1 = 3, R2 = 1
Step 7: R1 = 2, R2 = 0
(h) Step 5: runlencounter1 = 5, runlencounter2 = 6, R1 = 2, R2 = 1
Step 6: counter-difference is odd, distance = 4 + 1 = 5
Step 7: R1 = 1, R2 = 0
(i) Step 5: runlencounter1 = 5, runlencounter2 = 7, R1 = 1, R2 = 1
Step 7: R1 = 0, R2 = 0
(j) Step 6: STOP and return distance as 5.
By definition, f is a function from χ A into χ B if for every element of χ A , there is an
assigned unique element of χ B , where χ A is the domain, and χ B is the range of the
function. The function is one-to-one if different elements of the domain χ A have
distinct images. The function is onto if each element of χ B is the image of some
element of χ A . The function is bijective if it is one-to-one and onto. A function is
invertible if and only if it is bijective.
Lemma 3.1 Let χ A and χ B represent original and compressed data representations. Then f : χ A −→ χ B is a function, and it is invertible.
Proof For every element of original data in χ A , viz., every original pattern, there
is a unique element in χ B , viz., compressed representation. Each of the images is
distinct. Hence, the function is one-to-one and onto and hence bijective. Alternately,
consider a mapping from χ B to χ A . Every compressed data representation leads to
a unique element of the original data. Hence, the function is invertible. Specifically,
note that χ B = χ A .
Lemma 3.2 Let χ A and χ B represent original and compressed data representations. Let (xa , ya ) and (xb , yb ) denote arbitrary patterns represented in χ A and χ B
representations, respectively. Let the length of the strings in A be n. Then,
d(xa , ya ) = d (xb , yb ),
where d represents the Manhattan distance function between original data points,
and d represents the Manhattan distance computation procedure based on Algorithm 3.1 between compressed data points.
Proof The proof is based on mathematical induction on n. For n = 1, each position
of xa and ya consists of either 0 or 1. The corresponding run-string by definition 6
is either 01 or 1, respectively.
Case a: Each of xa and ya is equal to 0. Then d(xa , ya ) = 0.
The corresponding run-strings of xb and yb are equal to 01.
By Algorithm 3.1, d (xb , yb ) = 0.
Case b: Each of xa and ya is equal to 1. Then d(xa , ya ) = 0.
The corresponding run-strings of xb and yb are equal to 1.
By Algorithm 3.1, d (xb , yb ) = 0.
Case c: xa = 1 and ya = 0. Then d(xa , ya ) = 1.
The corresponding run-strings of xb and yb are equal to 1 and 01.
By Algorithm 3.1, d (xb , yb ) = 1.
Case d: xa = 0, ya = 1, and d (xb , yb ) = 1. The proof is the same as given in Case c.
54
3
Run-Length-Encoded Compression Scheme
Let the lemma be true for n = k, for some k ≥ 1.
For n = k + 1, the additional bit (feature) would be either 0 or 1. With this additional bit, the d-function provides either 0 or 1 depending on whether kth and
(k + 1)st bits are alike or different, resulting in an additional distance of 0 or 1. In
case of χ B , a bit matching with the previous bits will lead to incrementing last run
by 1 or creation of an additional run. With all previous bits in case of χ A and all
previous runs in case of χ B remaining unchanged, this leads to the situation where
the run dimension is incremented by 1 or only incrementing the run-size by 1. This
leads to the condition of Case a to Case d as discussed for n = 1. Thus the lemma
is proved. The original and compressed data representations, χ A and χ B , provide
stable ordering, i.e., the distances with the same value appear in the representation
B as they do in the representation A.
Corollary 3.1 The representations χ A and χ B provide stable ordering.
Proof Consider an arbitrary pattern xa in A. The corresponding pattern in B
is xb , say. The ordered distances between xa and every other pattern of χ A is
(x 1 , x 2 , . . . , x k ). By Lemma 3.2, the distances d and d provide the same values for
the equivalent patterns between χ A and χ B . Thus, the ordered distances between xb
and the corresponding patterns of χ A in χ B are given by (x 1 , x 2 , . . . , x k ). Thus,
A
B
the representations χ and χ provide stable ordering.
Corollary 3.2 Classification Accuracy of kNNC for any valid k in both the schemes
χ A and χ B is the same.
Proof By Corollary 3.1, it is clear that the representations χ A and χ B provide stable ordering. Thus, the classification accuracy based on kNNC computed using representation χ A and representation χ B is the same. The Minkowski metric for ddimensional patterns a and b is defined as
d
1
q
q
|ai − bi |
.
Lq (a, b) =
i=1
This is also referred to as the Lq norm. L1 and L2 norms are called the Manhattan and Euclidean distances, respectively. The Hamming distance is defined as the
number of places where two vectors differ.
Lemma 3.3 The L1 norm and Hamming distances coincide for patterns with
binary-valued features.
Lemma 3.4 The L1 norm computation is more expensive than that of the Hamming
distance.
Proof It is clear from the above discussion and from Lemma 3.4 that although the
results coincide, because of the additional mathematical function of finding absolute
3.4 Experimental Results
55
value of the difference in L1 norm, it is more expensive than computation of the
Hamming distance.
Lemma 3.5 Hamming distance is equal to the squared Euclidean distance in the
case of patterns with binary-valued features.
In view of Lemmas 3.3, 3.4, and 3.5, we consider the Hamming distance as a dissimilarity measure for HW data used in the current work.
3.4 Experimental Results
We consider multiple scenarios where the proposed algorithm can be applied such as
classification of handwritten digit data, genetic algorithms, and artificial spacecraft
health data.
3.4.1 Application to Handwritten Digit Data
The algorithm is applied to a 10 % of the considered handwritten digit data. We
carried out experiments in two stages in order to demonstrate (a) nonlossy compression nature of the algorithm and (b) savings in processing time. In stage 1, the data
is compressed and decompressed. The decompressed data is found to be matching
exactly with the original data both in content and size. Table 3.2 provides statistics
of class-wise runs. Columns 2 and 3 contain arithmetic mean and standard deviation of the run dimension. The maximum run length in the class of any of 1s or 0s
is given in Column 4. The range, a measure of dispersion, of the set of values is
defined as the difference between maximum and minimum of values in the set. The
range of run dimension is given in Column 5. Column 2 contains the measure of
central tendency, and Columns 3 and 5 contain the measures of dispersion. It can be
seen from the table that 3σ limits based on sample statistics of any class is much
less than the number of features of the original pattern. Figure 3.1 contains statistics
of number of runs for class label “0” for about 660 patterns. The figure indicates
variation in the number of runs for different patterns. It can be observed from the
figure that even for patterns belonging to same class, there is a significant variability
in the number of runs. The patterns are randomly ordered, and hence the diagram
does not demonstrate any secular trend among the patterns.
In stage 2, both the original and compressed data are subjected, independently, to
the k-Nearest-Neighbor Classifier (kNNC) for different k values from 1 to 20. The
results are provided in Fig. 3.2. Here d is used on the original data, and d is used
on the compressed data. The classification accuracies of computed with original
dataset and the one computed in the compressed domain are computed. The results
matched exactly. This clearly indicates that the compression did not lead to any loss
56
3
Run-Length-Encoded Compression Scheme
Fig. 3.1 Run statistics of class label “0” of 10 % data
Table 3.2 Class-wise run statistics
Class
label
Average class-wise
run dimension
Standard
deviation
Max. run
length in class
Range of run
dimension
(1)
(2)
(3)
(4)
(5)
0
52.8
4.19
11
30
1
35.0
0.16
35
4
2
41.6
5.06
38
32
3
39.4
3.68
12
20
4
45.8
4.30
11
24
5
38.7
3.59
55
25
6
45.0
6.02
11
38
7
39.0
3.87
12
20
8
46.7
5.03
11
30
9
43.1
4.04
11
28
of information. The CPU times taken on a single-processor computer are presented.
The results are provided in Table 3.3. The CPU time provided in the table refers
to the difference of time obtained through system calls at the start and end of the
execution of the program. With kNNC, the best accuracy of 92.47 % is obtained for
k = 7.
It can be observed from the above table that the training data and test data sizes
are reduced by about three times after applying the proposed algorithm, and the
CPU time requirement is reduced by about 5 times.
3.4 Experimental Results
57
Fig. 3.2 Classification Accuracy with different values of k using kNNC
Table 3.3 Data size and
processing times
Description of
data
Original data as
features
Compressed data
in terms of runs
Data
Training data
Test data
CPU time
(sec) of kNNC
2,574,620
1,286,538
527.37
865,791
432,453
106.83
3.4.2 Application to Genetic Algorithms
We present an overview of genetic algorithms before providing application of the
proposed scheme to genetic algorithms.
3.4.2.1 Genetic Algorithms
Genetic algorithms are randomized search algorithms for finding an optimal solution
to an objective function. They are inspired by natural evolution and natural genetics.
The algorithm simultaneously explores a population of possible solutions through
generations with the help of genetic operators. There exist many variants of genetic
algorithms. We discuss Simple Genetic Algorithm in the current subsection. The
genetic operators used in Simple Genetic Algorithm are the following.
• Selection
• Cross-over
• Mutation
58
3
Run-Length-Encoded Compression Scheme
An important step in finding a solution is to encode a given problem for which an
optimal solution is required. The solution is found in the encoded space. A common method to encode is to represent the objective function as a binary string of
length l with decimal encoded mapping to various parameters that optimize the objective function. The objective function is evaluated. Consider a population of p
such strings. The value of an objective function or a fitness function is computed
as a function of these parameters and evaluated for each string. The following is an
example of a population of strings with p = 4 and l = 20. The strings are initialized
randomly.
1: 01010011010011101011
2: 01101010010001000100
3: 01011001110101001001
4: 11010110100010100101
Next, the generation of population is computed based on the above genetic operators. We discuss selection. There are a number of approaches to select highly fit
individuals from one generation to another generation. One such selection method
is proportionate selection, where based on fitness value, more copies of highly fit individuals from previous generation are carried forward to next generation. This ensures survival of the fittest individual. As a second step, they are subjected to crossover. We briefly discuss a single point crossover. There are alternate approaches
to cross-over known as uniform cross-over, 2-point crossover, etc. The cross-over
operation is performed on a pair of individuals, choosing them based on the probability of cross-over. Consider two strings randomly. Choose a location of cross-over
within strings randomly between 1 and l − 1. In order to illustrate cross-over, let the
location be 8, counting from 0. The genetic material between 0 and 8 is interchanged
between the two strings to generate two new strings in the following manner.
Strings before cross-over operation:
1: 01010011010011101011
3: 01011001110101001001
Strings after cross-over operation:
1: 01010011010101001001
3: 01011001110011101011
It can be noticed in the above schematic that the italic part is exchanged between
the chosen pair of strings. Cross-over helps in exploring newer solutions. Mutation
refers to flipping the string value between 0 and 1. The operation is performed based
on the probability of mutation. The following is an example of mutation operation
performed at randomly chosen location, say, 11.
Initial string:
01011001110011101011
String after Mutation operation: 01011001110111101011
It can be observed in the above schematic that at location 11, the bit value is flipped
from 0 to 1. This provides the occasional ability to explore a new solution especially
3.4 Experimental Results
59
when there is no newer exploration. In summary, a genetic algorithm is characterized
by the following set of key constituents.
• Encoding mechanism of solutions
• Probability of cross-over, Pc . Experimentally, it is chosen to be around 0.9
• Probability of mutation, Pm . It is usually considered small, which otherwise can
result in a random walk of solutions
• Probability of Initialization, Pi , can be optionally chosen as a parameter. It dictates the solution space to be explored
• Termination criterion for the convergence to a near-optimal solution
• Appropriate mechanism for selection, cross-over, and mutation
3.4.2.2 Application
Usually, it takes a large number of generations to converge to an optimal or a nearoptimal solution with genetic algorithms. The computational expense is dominated
by evaluation of the fitness function. A large population size requires more time
for evaluation of each string at every generation. Consider a case where the fitness function is the classification accuracy of patterns involving a large set of highdimensional patterns. The features either are binary-valued or mapped to binary
values. The algorithm is directly applicable to such a scenario, leading to significant
saving in computation time in arriving at convergence. Some applications of the
scheme are optimal feature selection where the string represents complete pattern
with each bit representing presence or absence of a feature, and optimal prototype
selection where the string is encoded as a parameter such as a distance threshold for
a leader clustering algorithm that leads to optimal number of clusters, etc.
3.4.3 Some Applicable Scenarios in Data Mining
In scenarios dealing with large data such as classification of transaction-type data
or anomaly detection based on Spacecraft Health Keeping (HK) data, the proposed
scheme provides significant improvement in (a) storage of the data in their compressed form and (b) classification of the compressed data directly. The HW data
can be represented as business transaction data consisting of transaction-wise item
purchase status as illustrated in the current chapter. It is clear from the presentation that the efficiency of the algorithm increases with sparseness of the data. In the
following subsection, a scheme is proposed where data summarization or anomaly
detection of Spacecraft HK data is presented.
3.4.3.1 A Model for Application to Storage of Spacecraft HK Data
It is a common practice in many Space Organizations to store the spacecraft health
data for entire mission life. Albeit it is possible to store the data by compressing
60
3
Run-Length-Encoded Compression Scheme
data through conventional methods, further analysis or operations on the data requires decompression, resulting in additional computational effort in decompressing. Also, it might result in some loss of information in case of lossy compression.
The advantage of the proposed scheme can be summarized as below.
• The data compression through the scheme is lossless. Thus, data can be stored in
compressed form, reducing storage requirements
• Data analysis involving dissimilarity computation such as clustering and classification can make use of the proposed algorithm for dissimilarity computation
directly between compressed patterns
3.4.3.2 Application to Anomaly Detection
Consider a remote sensing spacecraft carrying an optical imaging payload. The time
period during which a camera is switched-on is called duration of payload operation. In order to monitor a parameter, say, current variation(amp) during a payload
operation, one strips out the relevant bytes from digital HK-data. The profile of the
parameter is obtained by plotting the parameter against time. After appropriate preprocessing and normalization to fit to common pattern size, the choice of features
can be either a set of sample statistics, such as moments, autocorrelation peaks, standard deviations, and spectral peaks, or forming a pattern for structural matching. In
case of structural matching, the profile can be digitized with appropriate quantization such that all points of inflexion are present. For example, a profile containing,
say k peaks can be digitized in m rows and n columns. The choice of m and n is
problem dependent. Such a structure consists of binary values indicating the presence or absence of the profile in a given cell, similar to HW data. However, it should
be noted that with reducing value of m and n, the new form of data becomes more
and more lossy. Data-dependent analysis helps arriving at optimal m and n. Thus,
the data in real numbers is reduced in terms of binary data. The data is compressed
by the above scheme and stored for mission life. Data summarization by means of
clustering or anomaly detection by means of classification can make use of the above
compressed data directly. This forms a direct application of the proposed scheme.
3.5 Invariance of VC Dimension in the Original and
the Compressed Forms
The Vapnik–Chervonenkis (VC) dimension provides a general measure of complexity and gives associated bounds on learnability. Statistical learning theory (SLT)
describes statistical estimation with finite training data. VC theory takes sample
size into account and provides quantitative description of the trade-off between the
model complexity and information available through finite training data. The SLT is
built upon the concepts of VC Entropy, VC dimension, and empirical risk minimization principle. We present the following definitions prior to proposing a theorem. Let
3.5 Invariance of VC Dimension in the Original and the Compressed Forms
61
f (X, ω) be a class of approximating functions indexed by abstract parameter ω with
respect to a finite training dataset X. ω can be scalar, vector, or matrix belonging to
a set of parameters Ω.
1. Risk Functional. Given a finite sample (xi , yi ) of size n, L(y, f (x, ω)) is the
loss or discrepancy between output produced by the system and the learning
machine for a given point x. The expected value of loss or discrepancy is called
a risk functional, R(ω), which supposes the knowledge of the probability density
function of the population from which the above sample is drawn. R(ω∗ ) is the
unknown “True Risk Functional.”
2. Empirical Risk and ERM. It is the arithmetic average of loss over the training
data. Empirical Risk Minimization (ERM) is an inductive learning principle.
A general property necessary for any inductive principle is asymptotic consistency. It requires that the estimates provided by ERM should converge to true
values as the number of training data sample size grows large. Learning theory
helps to formulate conditions for consistency of the ERM principle.
3. Consistency of ERM principle. For bounded loss functions, the ERM principle
is consistent iff the empirical risk converges uniformly to the true risk in the
following sense:
lim P sup R ∗ (ω) − Remp (ω) > ε = 0 ∀ε > 0.
n→∞
ω
Here P indicates probability, Remp (ω) the empirical risk for sample of size n,
and R ∗ (ω) is the true risk for the same parameter values, ω. It indicates that any
analysis of ERM principle must be a “worst-case analysis.”
Consider a class of indicator functions Q(z, ω), ω ∈ Ω, and a given sample
Zn = {zi , i = 1, 2, . . . , n}. The diversity of a set of functions with respect to a
given sample can be measured by the number of different dichotomies, N (Zn ),
that can be implemented on the sample using the functions Q(z, ω).
4. VC entropy. The random entropy is defined as H (Zn ) = ln N (Zn ), which is a
random variable. Averaging the random entropy over all possible samples of size
n generated from distribution F (z) gives
H (n) = E ln N (Zn ) ;
H (n) is the VC entropy of the set of indicator functions on a sample of size n.
The VC Entropy is a measure of the expected diversity of a set of indicator functions with respect a sample of a given size, generated from unknown distribution.
5. Growth function. The growth function is defined as the maximum number of dichotomies that can be induced on a sample of size n using the indicator functions
Q(z, ω) from a given set:
G(n) = ln max N (Zn ),
Zn
where maximum is taken over all possible samples of size n regardless of the
distribution. The growth function depends only on the set of functions Q(z, ω)
62
3
Run-Length-Encoded Compression Scheme
and provides an upper bound for the distribution-dependent entropy. A necessary
and sufficient condition for consistency of the ERM principle is
H (n)
= 0.
n
However, it uses the notion of VC entropy defined in terms of unknown distribution, and the convergence of the empirical risk to the true risk may be very slow.
The asymptotic rate of convergence is called fast if for any m > m0 and c > 0,
the following exponential bound holds:
2
P R(ω) − R ω∗ < ε = e−cmε .
lim
n→∞
Statistical learning theory provides a distribution-independent necessary and sufficient condition for consistency of ERM and fast convergence, viz.,
G(n)
= 0.
n
The growth function is either linear or bounded by a logarithmic function of the
number of samples n. The VC dimension is that value of n (= h) at which the
growth starts to slow down. When the value is finite, then for large samples, the
growth function does not grow linearly. It is bounded by a logarithmic function,
viz.,
n
G(n) ≤ h 1 + ln
.
h
lim
n−→∞
If the bound is linear for any n, G(n) = n ln 2, then the VC-dimension for the
set of indicator functions is infinite, and hence no valid generalization is possible. The VC dimension is explained in terms of shattering. If h samples can
be separated by a set of indicator functions in all 2h possible ways, then this
set of samples is said to be shattered by the set of functions, and there do not
exist h + 1 samples that can be shattered by a set of functions. For binary partitions of size n, N (ZN ) = 2n and G(n) ≤ n ln 2. Let Xn denote valuations on Bn ,
with |Xn | = 2n and Xn identified with (0, 1)n . A Boolean function on Bn is a
mapping f : Xn −→ (0, 1). Thus, a Boolean function assigns labels 0 or 1 to
n
each assignment of truth values for each of Boolean variables. There exist 22
Boolean functions on Bn . A Boolean formula is a legal string containing 2n
literals b1 , . . . , bn , ¬b1 , . . . , ¬bn with connectives ∨ (and) and ∧ (or) and the
parenthesis symbols.
6. Pure Conjunctive Form. An expression of the form
b1 ∧ b2 ∧ · · · ∧ bn
is called a Pure Conjunctive Form (PCF).
7. Pure Disjunctive Form. An expression of the form
b1 ∨ b2 ∨ · · · ∨ bn
is called a Pure Disjunctive Form.
3.6 Minimum Description Length
63
8. Conjunctive Normal Form. A conjunction of several “clauses” each of which is
a disjunction of some literals is called a Conjunctive Normal Form (CNF). For
example, (b1 ∨ b2 ∨ ¬b5 ) ∧ (¬b1 ∨ b6 ∨ b7 ) ∧ (b2 ∨ b5 ) is a CNF.
9. Disjunctive Normal Form. A disjunction of several “clauses” each of which is
a conjunction of some literals is called a Disjunctive Normal Form (DNF). For
example, (b1 ∧ ¬b3 ∧ b5) ∨ (¬b2 ∧ b4 ∧ b6 ∧ b7 ) ∨ (b4 ∧ b6 ∧ ¬b7 ) is a DNF.
The HW data in its original form is represented as DNF.
Theorem 3.1 Suppose that C is class of concepts satisfying measurability conditions.
1. C is uniformly learnable if and only if the VC dimension of C is finite.
2. If the VC dimension of C is d < ∞, then
(a) For 0 < ε < 1 and δ > 0, the algorithm learns a given hypothesis if the sample size is at least
4 2 8d 13
ln ,
ln
max
.
ε δ ε
ε
(b) For 0 < ε < 12 and δ > 0, the algorithm learns a given hypothesis only if the
sample size is greater than
1−ε 1 ln , d 1 − 2 ε(1 − δ) + δ
.
max
ε
δ
Theorem 3.1 allows computation of bounds on sample complexity. Also, it shows
that one needs to compute limits on the VC dimension of a given learner to understand the sample complexity of the problem of learning from examples.
Theorem 3.2 The VC dimension in both Original and Run-Length-Encoded (RLE)
forms of the given data is the same.
Proof The proposed scheme forms a nonlossy compression scheme. It is shown
in Sects. 3.3 and 3.4 that dissimilarity computation between any two patterns in
both the forms of the data provides the same value. Thus, learning through kNNC
provides the same k-nearest neighbors and classification accuracy. The number of
dichotomies generated and thereby the VC dimension is the same in either case. 3.6 Minimum Description Length
The notion of algorithmic complexity refers to characterization of randomness of
the data set. The algorithm complexity is defined as the shortest binary code describing the given data. The data samples are random if they cannot be compressed
significantly. The Minimum Description Length (MDL) is a tool for inductive in-
64
3
Run-Length-Encoded Compression Scheme
ference based on Kolmogorov’s characterization of randomness. The MDL is the
sum of the code length of the data based on the considered model, L(model), and
the error term specifying how the actual data differs from the model prediction of
a code length L (data/model). Hence, the total code length, l, of such a code for
representing binary string of the data output is
l = L(model) + L(data/model).
The coefficient of compression for this string is
l
K= .
n
Applying the MDL principle in the current context, we represent the L(model) as
the number of bits required to store the pattern. Let L(data/model), which represents
the prediction error, be e with the original data. With the original data,
l = 192 · k1 + e,
where k1 is number of bits required to store the given feature value, and the corresponding compression for this string is
192 · k1 e
+ .
n
n
As demonstrated earlier in the sections, the run-length-encoded data provides compression. The compression can be seen in terms of number of features. The maximum number of features in any of the patterns is 55 (Table 3.2), which occurred for
class 5. It is clearly demonstrated that with the proposed algorithm 3.1, the error in
classification has remained the same. Thus, with the compressed data considering
k2 bits to store each feature value in the compressed form,
K(original − model) =
l ≤ 55 · k2 + e.
The corresponding compression for this string is
55 · k2 e
+ ,
n
n
which clearly shows a significant reduction in the compression ratio in the best case
and same as the original data in the worst case. The following theorem formalizes
the concept.
K(RLE − model) ≤
Theorem 3.3 The MDL of the compressed data is less than or equal to the MDL of
the original data.
Proof Let k1 be the number of bits required for storing the feature value of the
original data. In current example of HW data, it is equal to 1 bit, thus requiring
192 · k1 =192 bits to store each pattern. In case of compressed data, depending on
the worst-case length, say, p, we need log(p) bits to store it. Let k2 = log(p).
Since the compression is nonlossy and thereby the proposed algorithm, as discussed
3.7 Summary
65
in Sects. 3.3 and 3.4, the classification error remains the same. Thus, the second
term of MDL in either case does not change. The MDL of compressed data is better
than that of the MDL of the original data, as long as k2 ≤ k1 . In the worst case of
alternating binary feature values, k2 = k1 , making the MDLs of both sets of data
equal.
3.7 Summary
We consider patterns with binary-valued features. The data is compressed by means
of runs. A novel method of computing dissimilarity in the compressed domain is
proposed. This results in significant reduction in space and time. The process of
compression and decompression is invertible. The concept of computing dissimilarity is successfully applied to large-size handwritten digit data. Other application
areas in finding solution through genetic algorithms and conventional data mining
approaches are discussed. The classification of the data both in its original form and
compressed form results in the same accuracy. The results demonstrate the advantage of the procedure, viz., improvement of classification time by a factor of five.
The algorithm has a linear time complexity. The work will have pragmatic impact
on Data Mining applications, large data clustering, and related areas.
3.8 Bibliographic Notes
Approaches to data reduction include clustering (Jain et al. 1999) and sampling
(Han et al. 2012). Some approaches to sufficient statistics or data derived information are discussed by Tian et al. (1996), DuMouchel et al. (2002), Bradley et al.
(1998), Breuing et al. (2000), Fung (2002), Mitra et al. (2000), and Girolami and He
(2003). Algorithms to compute approximate and exact edit distances of run-lengthencoded strings are discussed by Makinen et al. (2003). Marques de Sa (2001) and
Duda and Hart (1973) provide detailed discussions on clustering, classification, and
distance metrics that are referred to in the current chapter. The works by Hastie
and Tibshirani (1998), Cherkassky and Mulier (1998), Vapnik (1999), Vidyasagar
(1997), Vapnik and Chervonenkis (1991, 1968), Rissanen (1978), and Blumer et al.
(1989) contain theoretical preliminaries on the VC dimension, minimum description length, etc. Discussions on notion of algorithm complexity can be found in
Kolmogorov (1965), Chaitin (1966), Cherkassky and Mulier (1998), and Vapnik
(1999).
The proposed algorithm, as discussed by Ravindra Babu et al. (2007), is directly
applicable to mining of large-scale data transactions. Mining association rules for
large datasets can be found in Agrawal et al. (1993). A detailed account for genetic
algorithms can be found in Goldberg (1989).
66
3
Run-Length-Encoded Compression Scheme
References
R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of items in large
databases, in Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’93) (1993),
pp. 266–271
A. Blumer, A. Ehrenfeucht, D. Haussler, M.K. Warmuth, Learnability and the Vapnik–
Chervonenkis dimension. J. Assoc. Comput. Mach. 36(4), 929–965 (1989)
P. Bradley, U.M. Fayyad, C. Reina, Scaling clustering algorithms to large databases, in Proceedings
of 4th Intl. Conf. on Knowledge Discovery and Data Mining (AAAI Press, New York, 1998),
pp. 9–15
M.M. Breuing, H.P. Kriegel, J. Sander, Fast hierarchical clustering based on compressed data and
OPTICS, in Proc. 4th European Conf. on Principles and Practice of Knowledge Discovery in
Databases (PKDD), vol. 1910 (2000)
G.J. Chaitin, On the length of programs for computing finite binary sequences. J. Assoc. Comput.
Mach. 13, 547–569 (1966)
V. Cherkassky, F. Mulier, Learning from Data (Wiley, New York, 1998)
R.O. Duda, P.E. Hart, Pattern Classification and Scene Analysis (Wiley, New York, 1973)
W. DuMouchel, C. Volinksy, T. Johnson, C. Cortez, D. Pregibon, Squashing flat files flatter. in
Proc. 5th Intl. Conf. on Knowledge Discovery and Data Mining, San Diego, CA (AAAI Press,
New York, 2002)
B.C.M. Fung, Hierarchical document clustering using frequent itemsets. M.Sc. Thesis, Simon
Fraser University (2002)
M. Girolami, C. He, Probability density estimation from optimally condensed data samples. IEEE
Trans. Pattern Anal. Mach. Intell. 25(10), 1253–1264 (2003)
D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning (AddisonWesley, Reading, 1989)
J. Han, M. Kamber, J. Pei, Data Mining—Concepts and Techniques (Morgan-Kauffman, New
York, 2012)
T. Hastie, R. Tibshirani, Classification by pairwise coupling. Ann. Stat. 26(2) (1998)
A.K. Jain, M.N. Murty, P. Flynn, Data clustering: a review. ACM Comput. Surv., 32(3) (1999)
A.N. Kolmogorov, Three approaches to the quantitative definitions of information. Probl. Inf.
Transm. 1(1), 1–7 (1965)
V. Makinen, G. Navarro, E. Ukkinen, Approximate matching of run-length compressed strings.
Algorithmica 35(4), 347–369 (2003)
J.P. Marques de Sa, Pattern Recognition—Concepts, Methods and Applications (Springer, Berlin,
2001)
P. Mitra, C.A. Murthy, S.K. Pal, Data condensation in large databases by incremental learning
with support vector machines, in Proc. 15th International Conference on Pattern Recognition
(ICPR’00), vol. 2 (2000)
T. Ravindra Babu, M. Narasimha Murty, V.K. Agrawal, Classification of run-length encoded binary
strings. Pattern Recognit. 40(1), 321–323 (2007)
J. Rissanen, Modeling by shortest data description. Automatica 14, 465–471 (1978)
Z. Tian, R. Raghu, L. Micon, BIRCH: an efficient data clustering method for very large databases,
in Proceedings of ACM SIGMOD International Conference of Management of Data (1996)
V. Vapnik, Statistical Learning Theory, 2nd edn. (Wiley, New York, 1999)
V. Vapnik, A.Ya. Chervonenkis, On the Uniform Convergence of Relative Frequencies of Events to
Their Probabilities. Dokl. Akad. Nauk, vol. 181 (1968) (Engl. Transl.: Sov. Math. Dokl.)
V. Vapnik, A.Ya. Chervonenkis, The necessary and sufficient conditions for the consistency of the
method of empirical risk minimization. Pattern Recognit. Image Anal. 1, 284–305 (1991) (Engl.
Transl.)
M. Vidyasagar, A Theory of Learning and Generalization (Springer, Berlin, 1997)
Chapter 4
Dimensionality Reduction by Subsequence
Pruning
4.1 Introduction
In Chap. 3, we discussed one approach of dealing with large data. In this approach,
we compress given large dataset and work in the compressed domain to generate an
abstraction. This is essentially a nonlossy compression scheme.
In the present chapter, we explore the possibility of preparing to lose some data
in the given large dataset and still be able to generate an abstraction that is nearly as
accurate as obtained with the original dataset. In the proposed scheme, we make use
of the concepts of frequent itemsets and support to compress the data and carry out
classification of such compressed data. The compression, in the current work, forms
a lossy scenario. We demonstrate that such a scheme significantly reduces storage
requirement without resulting in any significant loss in classification accuracy.
In this chapter, we initially discuss motivation for the current activity. Subsequently, we discuss basic methodology in detail. Preliminary data analysis provides
insights into data and directions for proper quantization for a given dataset. We elaborate the proposed lossy compression scheme and compression achieved at various
levels. We demonstrate working of the scheme on handwritten digit dataset.
4.2 Lossy Data Compression for Clustering and Classification
Classification of high-dimensional, large datasets is a challenging task in Data Mining. Clustering and classification of large data is in the focus of research in the
recent years, especially in the context of data mining. The largeness of data poses
challenges such as minimizing number of scans of the database that is stored in
the secondary memory, data summarization, apart from those issues related to clustering and classification algorithms, viz., scalability, high dimensionality, speed,
prediction accuracy, etc.
Data compression has been one of the enabling technologies for multimedia communication. Based on the requirements of reconstruction, data compression is divided into two broad classes, viz., lossless compression and lossy compression. For
T. Ravindra Babu et al., Compression Schemes for Mining Large Datasets,
Advances in Computer Vision and Pattern Recognition,
DOI 10.1007/978-1-4471-5607-9_4, © Springer-Verlag London 2013
67
68
4
Dimensionality Reduction by Subsequence Pruning
optimal performance, nature of data and the context influence the choice of the compression technique. In practice, many data compression techniques are in use. The
Huffman coding is based on the frequency of input characters. Instead of attempting
to compress a binary code as in Chap. 3, a fixed number of binary features of each
pattern is grouped or blocked together. One of the objectives of current work is to
compress and then classify the data. Thus, it is necessary that the compressed form
of the training and test data should be tractable and should require minimum space
for storage and fast processing time. Such a scheme should be amenable for further
operations on data such as dissimilarity computation in its compressed form. We
propose a lossy compression scheme that satisfies these requirements.
We make use of two concepts of association rule mining, viz., support and frequent itemset. We show that the use of frequent items that exceed a given support
will avoid less frequent input features and provide better abstraction.
The proposed scheme summarizes the given data in a single scan, initially as frequent items and subsequently in the form of distinct subsequences. Less frequent
subsequences are further pruned by their nearest neighbors that are more frequent.
This leads to a compact or compressed representation of data, resulting in significant compression of input data. Test data requires additional mapping since some
of the subsequences found in the test data would have got pruned in subsequence
generation in the training dataset. This could lead to inconsistency between the two
encoded datasets during dissimilarity computation. We discuss the need and modalities of transforming the test data in Sect. 4.5.6. The test data is classified in its compressed form. The classification of data directly in the compressed form provides a
major advantage. The lossy compression leads to highly accurate classification of
test data because of possible improvement in generalization.
Rough set approach-based classification with reference to dissimilarity limit of
the test patterns is carried out. The data thus reduced requires significantly less storage as compared to rough set-based schemes with similar classification accuracy.
4.3 Background and Terminology
Consider a training data set consisting of n patterns. Let each pattern consist of d
binary valued features. Let ε be the minimum support for any feature of a pattern
to be considered for the study. We formally discuss the terms used further in the
current chapter.
1. Support. Support of a feature, in the current work, is defined as actual number of
patterns in which the feature is present. Minimum support is referred to as ε.
2. Sequence. Consider a set of integer numbers,
{S1 , S2 , . . .}.
Let the set be denoted J. A sequence is a function from J to J , where J and J
are two sets of positive integers.
4.3 Background and Terminology
69
3. Subsequence. Let S = {Sn }, n = 1, 2, . . . , ∞, is a sequence of integer numbers,
and let S = {Si }, i = 1, 2, . . . , ∞, be a subsequence of the sequence of positive
integers. The composite function S ◦ S is called a subsequence of S.
For example, for i ∈ J, we have
S (i) = si , S ◦ S (i) = S S (i) = S si = ssi ,
and hence
S ◦ S = (ssi )∞
i=1
4. Length of a subsequence. Let S be a subsequence. The number of elements of
the subsequence is referred to as the length of a subsequence, r.
5. Block, Block Length. We define a finite number of binary digits as a block. The
number of such digits in a block, b, is called the block length.
6. Value of a Block. The decimal equivalent of a block is the value of block, v.
7. Minimum frequency for pruning. When subsequences are formed, in order to
prune the number of subsequences, we aim to replace those less frequent subsequences that remain below a chosen frequency threshold. It is referred to minimum frequency, ψ .
8. Dissimilarity threshold for replacing subsequences. While replacing less frequent subsequences, we replace a subsequence by its neighbor that is below certain distance. It is referred to as dissimilarity threshold, η. The parameter controls
the fineness of a neighborhood for subsequence replacement.
Table 4.1 provides the list of parameters used in the current work. In the current
implementation, all the parameters are integers.
We illustrate the above concepts through the following examples.
Illustration 4.1 (Sequence, subsequence, and blocks) Consider a pattern with binary features as {0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0}. The sequence is represented by
000000111000. 00000011 represents a subsequence of length 8. (0000), (0011),
(1000) represent blocks of length 4, each with the corresponding values of block as
0, 3, 8.
Illustration 4.2 (Frequent itemsets) Consider five patterns with six features each,
counted from 1 to 6. The concepts of itemsets and support are presented to Table 4.2.
In the table, each row consists of a pattern, and each column contains presence(1)
or absence(0) of the feature. Last row contains the column-wise sum that indicates
the support of the corresponding feature or item.
The support of each feature is obtained by counting the number of nonzero values. The feature-wise supports are {3, 2, 3, 1, 4, 3}. With the help of support values,
frequent features corresponding to different threshold values can be identified.
The frequent feature sets or equivalently item sets for the minimum support
thresholds of 2, 3, 4, and 5 are presented in Table 4.3. Each row consists of a minimum support-wise itemset. For example, row 2 consists of itemset that has minimum
support of 3. Equivalently, each of the items in the set {1, 3, 5, 6} has a minimum
support of 3.
70
Table 4.1 List of parameters
used
Table 4.2 Itemsets and
support
Table 4.3 Frequent itemsets
4
Dimensionality Reduction by Subsequence Pruning
Parameter
Description
n
Number of patterns or transactions
d
Number of features or items prior to
identification of frequent items
b
Number of binary features that makes
one block
q
Number of blocks in a pattern
v
Value of block
r
Length of a subsequence
ε
Minimum support
ψ
Minimum frequency for pruning a
subsequence
η
Dissimilarity threshold for identifying
nearest neighbor to a subsequence
Sl No.
Itemset
1
110010
2
011011
3
001011
4
100001
5
101110
Support
323143
Minimum support
Itemset
2
{1, 2, 3, 5, 6}
3
{1, 3, 5, 6}
4
{5}
5
–
Illustration 4.3 (Impact of minimum support on distinct subsequences) We continue with the data in Table 4.2 and Table 4.3. Consider block lengths of 2. Compute the values of the blocks for each of the patterns and form subsequences of
length 3 each as (3, 0, 2), (1, 2, 3), (0, 2, 3), (2, 0, 1), and (2, 3, 2). Here, all the subsequences are nonrepeating, and they are all distinct. By considering only frequent
items with, say, ε ≥ 3, the patterns with frequent features (items) are (100010),
(001011), (001011), (100001), and (101010). Here we observe from Table 4.3 that
for ε ≥ 3, frequent items are {1, 3, 5, 6}, which are reflected in patterns formed with
corresponding frequent items.
4.3 Background and Terminology
Table 4.4 Distinct
subsequences with minimum
support, ε ≥ 0
Parameters: n = 5, d = 6,
n = 2, No. of patterns = 5,
No. of distinct
subsequences = 5
Table 4.5 Distinct
subsequences with minimum
support, ε ≥ 3
Parameters: n = 5, d = 6,
n = 2, No. of patterns = 5,
No. of distinct
subsequences = 4
71
Sl. No.
Patterns
Blocks of
length 2
Values of
blocks
Subsequences
of length 3
(1)
(2)
(3)
(4)
(5)
1
110010
11, 00, 10
3, 0, 2
3, 0, 2
2
011011
01, 10, 11
1, 2, 3
1, 2, 3
3
001011
00, 10, 11
0, 2, 3
0, 2, 3
4
100001
10, 00, 01
2, 0, 1
2, 0, 1
5
101110
10, 11, 10
2, 3, 2
2, 3, 2
Sl. No.
Patterns
Blocks of
length 2
Values of
blocks
Subsequences
of length 3
1
100010
10, 00, 10
2, 0, 2
2, 0, 2
2
001011
00, 10, 11
0, 2, 3
0, 2, 3
3
001011
00, 10, 11
0, 2, 3
0, 2, 3
4
100001
10, 00, 01
2, 0, 1
2, 0, 1
5
101010
10, 10, 10
2, 2, 2
2, 2, 2
At this stage, the set of subsequences are (2, 0, 2), (0, 2, 3), (0, 2, 3), (2, 0, 1),
and (2, 2, 2). And since (0, 2, 3) repeats twice, the set of distinct subsequences is,
(2, 0, 2), (0, 2, 3), (2, 0, 1), and (2, 2, 2). Consider Table 4.4. It consists of five patterns arranged row-wise. Column 2 consists of list features for each pattern. Column
3 contains blocks of length 2 considered from each pattern. Column 4 contains decimal equivalents of each of the blocks. Column 5 is the values of blocks arranged as
a subsequence of length 3. Observe the reduction in the number of distinct subsequences from 5 to 4 with increasing minimum support from 2 to 3. Tables 4.4, 4.5,
and 4.6 summarize the concepts discussed in the example.
It can be noticed from Table 4.5 that since {0, 2, 3} repeats two times, the numbers of distinct subsequences are given below.
{2, 0, 2}, {0, 2, 3}, {2, 0, 1}, {2, 2, 2} for ε ≥ 3.
In case of ε ≥ 4 as shown in Table 4.6, the distinct subsequence is {0, 0, 2} alone.
Illustration 4.4 (Distinct subsequences and dissimilarity table) Consider the subsequences in Table 4.4. Table 4.7 contains distinct subsequences and their frequencies. They are numbered sequentially. Subsequent to this identification, the subsequences are referred to with this unique serial number. The distances are computed
through an upper triangular matrix for easy access. Table 4.8 contains an example of
subsequence numbers and inter-sequence distances corresponding to the data given
in Table 4.7.
72
Table 4.6 Distinct
Subsequences with minimum
support, ε ≥ 4
Parameters: n = 5, d = 6,
n = 2, No. of patterns = 5,
No. of distinct
subsequences = 1
Table 4.7 Distinct
subsequences and
corresponding support
Table 4.8 Dissimilarity table
of distinct subsequences in
terms of Euclidean distance
4
Dimensionality Reduction by Subsequence Pruning
Sl. No.
Patterns
Blocks of
length 2
Values of
blocks
Subsequences
of length 3
1
000010
00, 00, 10
0, 0, 2
0, 0, 2
2
000010
00, 00, 10
0, 0, 2
0, 0, 2
3
000010
00, 00, 10
0, 0, 2
0, 0, 2
4
000000
00, 00, 00
0, 0, 0
0, 0, 0
5
000010
00, 00, 10
0, 0, 2
0, 0, 2
Sl. No.
Subsequence
No. of repetitions
1
2, 0, 2
1
2
0, 2, 3
2
3
2, 0, 1
1
4
2, 2, 2
1
Sl. No.
1
2
3
4
1
0
3
2
–
0
1
√
2
√
3
–
–
0
5
√
5
4
–
–
–
0
12
Illustration 4.5 (Pruning of distinct subsequences) Consider data in Table 4.2. We
notice in Sect. 4.3 that the number of distinct subsequences decreases with increasing threshold. In Tables 4.4, 4.5, and 4.6, we see the list of distinct subsequences
with increasing minimum threshold from 0 to 4.
To illustrate the use of ε and η, we consider subsequences of hypothetical data as
given in Table 4.9. The table contains distinct subsequences and their corresponding
frequencies. In order to prune the number of distinct subsequences, we propose to
replace all the subsequences that occur with frequency less than or equal to 2, i.e.,
ψ = 2. Now from the table, {2, 0, 1} should be replaced by its nearest neighbor.
Table 4.8 consists of subsequence numbers 1 to 4, both shown in rows and columns.
Each cell corresponding to subsequence numbers indicates dissimilarity between
subsequences i and j , say, each ranging from 1 to 4. From Table 4.8, we notice that
the nearest neighbor of {2, 0, 1} is {2, 0, 2}, which is at a distance of 1. Thus, here
η = 1.
Illustration 4.6 (Nearest neighbors for mapping previously unseen subsequence)
In Illustration 4.5, we prune known subsequences that are less frequent by their
nearest neighbors. We now consider unseen subsequences such as those generated
4.4 Preliminary Data Analysis
Table 4.9 Hypothetical
distinct subsequences and
support
73
Sl. No.
Subsequence
No. of repetitions
1
2, 0, 2
6
2
0, 2, 3
4
3
2, 0, 1
2
4
2, 2, 2
10
from a test pattern. Since it is possible that a subsequence generated from a test
pattern is not seen in the training patterns, we assign it to a nearest neighbor among
the subsequences of a training pattern. This helps in assigning the same unique
subsequence id to each of such subsequences in Test pattern. It is experimentally
seen that such an assignment does not adversely affect the classification accuracy.
Illustration 4.7 (Classifying a test pattern transformed to subsequences) For a
chosen block size and length of subsequence, subsequences are formed from the
training data. Only those subsequences that are frequent based on ψ , are retained.
Among the subsequences, unique subsequences are identified and numbered. Table 4.8 provides the distances between any such two unique subsequences. For each
test pattern, subsequences are formed and corresponding unique id’s are assigned.
But it is possible that some of subsequences in test pattern are not seen earlier;
we assign its nearest neighbor from among the pruned subsequences of the training dataset. The dissimilarity between the test and each training pattern is quickly
carried out by accessing the values from the dissimilarity table.
With this background of parameter definition, we describe the proposed scheme
in the following section.
4.4 Preliminary Data Analysis
We implement the proposed method on a large Handwritten digit dataset. The current section provides a brief description of the data and preliminary analysis carried
out.
We consider a large handwritten digit dataset, which consists of 10 classes, 0 to 9.
Each pattern consists of a 16 × 12 matrix, which is equal to 192 binary-valued
features. The total number of patterns is 100,030, which are divided into 66,700
training and 33,330 test patterns. For demonstration of the proposed scheme, we
consider 10 % of this dataset.
Table 4.10 provides basic statistics on the training data. Column 1 of the table
contains Class-label. The arithmetic mean of the number of nonzero features of
large sample of training patterns within each class is provided in column 2. Column 3 contains the class-wise standard deviation of the number of nonzero features.
Column 4 indicates that there is at least one pattern that contained the recorded
74
Table 4.10 Basic statistics
on number of nonzero
features in the training data
4
Dimensionality Reduction by Subsequence Pruning
Class
label
Mean
Standard
deviation
Minimum
number
Maximum
number
(1)
(2)
(3)
(4)
(5)
0
66.4
10.8
38
121
1
29.8
5.1
17
55
2
64.4
10.5
35
102
3
59.8
10.3
33
108
4
52.9
9.3
24
89
5
61.3
10.0
32
101
6
58.4
8.5
34
97
7
47.3
7.7
28
87
8
67.4
11.4
36
114
9
55.6
8.4
31
86
Fig. 4.1 A set of typical and
atypical patterns of
handwritten data
minimum number of features. Similarly, Column 5 corresponds to a pattern that
contained the maximum number among all the patterns within a class. The table
brings out complexity and variability in the given handwritten digit data. For example, for class label 0, column 4 indicates that there is at least one pattern of zero that
contains just 38 nonzero features and the corresponding column 5 contains that there
is at least one pattern that has 121 nonzero features. Apart from this, the orientation
and shapes of different digit datasets also vary significantly. Albeit the number of
features in terms of statistics indicate that they are less than 192 for a given statistic, physically features of individual patterns occupy different feature locations, thus
making 192 locations relevant for representing the pattern. Further this results in a
challenge in classification of such patterns.
The data consisting of 10 classes with equal number of training patterns per class.
Typical and atypical patterns of the given training data are represented in Fig. 4.1.
4.4.1 Huffman Coding and Lossy Compression
We consider handwritten digit data and analyze the Huffman coding for nonlossy
and lossy compression.
4.4 Preliminary Data Analysis
75
4.4.1.1 Analysis with 6-Bit Blocks
Consider handwritten (HW) digit data consisting of 10 classes of 192-feature, labeled data. It is readable in matrix form, where each pattern is represented as 16× 12
matrix. In this matrix form, consider 6 bits as a block, continuously. Thus, each pattern would consist of 32 blocks. Each block is typically represented by a character
code, such as a, b, etc. In the current context, each block is labeled as 1, 2, . . . , 32.
With 6 bits, each block can assume values from 0 to 63. We propose the following
scheme for analysis.
1. Consider 6-bit blocks of the data
2. Compute the frequency of each of the values from 0 to 63 in the entire training
data.
3. Present the item-value (0 to 63) and the corresponding frequency for generating
complete binary tree, which eventually provides Huffman codes.
4. Generate the Huffman code for each item-value of the training data.
5. Compare the size for its possible improvement with the original data.
6. Find the dissimilarity in the compressed state itself.
7. Compare the CPU time and storage requirements.
Table 4.11 contains the Huffman codes for 6-bit blocks. In the table, column 1 corresponds to the decimal equivalent value and referred to as “Label”. Column 2 consists
of the number of occurrences of such a value, and column 3 provides the corresponding Huffman code. In order to accommodate all codes in a single table, we continue
placing the data in columns 4 to 6.
The entire training data is reduced in the form of decimal equivalent values of 6bit blocks. Table 4.12 provides such sample encoded data. From the table, similarity
among the patterns of the same class can be noticed in terms of values of 6-bit
blocks. Table 4.13 contains the Huffman codes of a sample of training patterns.
The following are important observations from the exercise.
1. The space savings of Huffman coding with 6-bit coding is about 14 % only.
2. Both original training data and the compressed training data are binary.
4.4.1.2 Analysis with 4-Bit Blocking
4-bit blocking results in 16 possibilities, viz., 0–15. Table 4.16 contains the results.
Table 4.14 contains 4-bit block coded training data. Here again, one can notice similarity among the patterns of the same class in terms of values of blocks and also
intra-pattern repetition of subsequences such as {0, 3, 8}.
The following are important observations.
1. Huffman coding of 4-bit blocking provides a space savings of 25 %.
2. Basic input and output data are binary.
76
4
Dimensionality Reduction by Subsequence Pruning
Table 4.11 Huffman coding of values 1 to 64 for entire training data
Label
Frequency
Code (leaf-to-root)
Label
Frequency
Code (leaf-to-root)
(1)
(2)
(3)
(4)
(5)
(6)
0
40,063
00
32
21,011
1111
1
15,980
1101
33
23
1010001000101
2
2176
1110011
34
31
011000100010
3
17,435
1011
35
22
0010001000101
4
2676
110010
36
100
11001000101
11110100010
5
378
6
8067
7
12,993
8
3840
011000101
38
82
00011
39
9
0001
40
375
110101
41
1
111000101
44
465
110001000101
45
2
010100010
46
51
11010110100010
101000101
10001011011100010
9
385
10
45
100110011
11
279
12
9104
10111
47
8
13
1364
0000101
48
23,845
14
2915
100101
49
53
00000100010
15
8550
00111
50
17
0011011100010
16
3833
010101
51
48
001001000101
17
75
18
7
19
20
21
1
22
23
24
11,710
25
346
1001011011100010
101001000101
01010110100010
110
01110100010
52
82
01011100010
101011011100010
54
86
00001000101
67
00110100010
55
31
111000100010
13
0001000100010
56
13,122
00001011011100010
57
54
10000100010
15
1001000100010
58
11
11011011100010
44
111011100010
59
37
110110100010
1010
60
4074
111100010
61
15
101000100010
62
1047
10110011
100100010
63
433
000110011
26
28
27
261
28
2561
29
162
30
670
31
2262
1001
010011
0010110100010
010010
0011100010
01100010
000010
4.4.1.3 Huffman Coding and Run-Length Compression
The binary code generated using the Huffman code can be subjected to run-length
coding as discussed in Chap. 3. This has the advantage of applying classification in
4.4 Preliminary Data Analysis
77
Table 4.12 Sample of 6-bit block coded training data
Label
Coded pattern
0
0 56 0 56 0 56 0 56 0 60 0 60 1 60 1 60 3 12 6 28 6 28 12 24 12 24 13 48 13 48 15 0
0
0 56 0 56 3 56 3 56 7 12 6 12 6 12 12 12 12 12 12 12 4 12 4 12 6 24 7 56 7 56 3 48
1
0 32 0 32 0 32 0 32 0 32 0 32 0 32 1 32 1 32 1 32 0 32 0 32 0 32 0 32 0 32 0 32
1
0 16 0 16 0 48 0 48 1 32 1 32 1 32 1 32 3 0 3 0 2 0 2 0 6 0 6 0 6 0 6 0
2
6 0 6 0 15 32 15 32 15 32 15 32 1 32 1 32 3 0 3 0 3 0 31 50 31 50 31 62 31 62 24 6
2
15 0 15 0 29 0 29 0 27 0 27 0 7 0 7 0 14 6 62 12 62 12 59 28 59 28 51 48 51 48 1 32
Table 4.13 Huffman codes corresponding to 6-bit block sample patterns
Label
Huffman code
0
0001001100010011000100110001001100
0000000011010001101000101110111000
1101001000011010010101111010101111
01000001010010010
0
0101000010100100100010100111000000
1001100010011101101001110110100110
0011011100011101110001110111101111
0111101111011110111101111100101011
1100101011100011101000010100110001
0100111011001001000101
1
0011110011110011110011110011110011
1100111111011111110111111101111100
1111001111001111001111001111001111
1
0001010100010101000010010001010000
1001000101110111111101111111011111
1101111110110010110011100110011100
11000001100000110000011000001100
its compressed form, which is subject to interpretation of the Huffman code. The
following are salient statistics.
1. No. of original input features: 6670 · 192 = 1,280,640.
2. No. of features after Huffman coding (6 bit-blocks): 948,401.
3. No. of features in post-run-length coding: 473231.
4.4.1.4 Lossy Compression: Assigning Longer Huffman Codes to Nearest
Neighbors
When Huffman coding is carried out over the given data, the code length corresponding to more frequent features would be less, and that corresponding to less frequent features would be longer. If longer code length features are assigned Nearest-
78
Table 4.14 Sample of 4-bit
blocked coded training data
4
Dimensionality Reduction by Subsequence Pruning
Label
Coded pattern
0
0 3 8 0 3 8 0 3 8 0 3 8 0 3 12 0 3 12 0 7 12 0
7 12 0 12 12 1 9 12 1 9 12 3 1 8 3 1 8 3 7 0 3
7 0 3 12 0
0
0 3 8 0 3 8 0 15 8 0 15 8 1 12 12 1 8 12 1 8 12
3 0 12 3 0 12 3 0 12 1 0 12 1 0 12 1 9 8 1 15
8 1 15 8 0 15 0
1
020020020020020020020060
060060020020020020020020
1
010010030030060060060060
0 12 0 0 12 0 0 8 0 0 8 0 1 8 0 1 8 0 1 8 0 1 8 0 1
Neighbor (NN) shorter length code features, the amount of storage required for
Huffman would further come down. The current exercise is aimed at such possible
reduction.
The procedure can be summarized as below.
• Generate a Huffman code for 4-bit row-wise blocks for both training (6670) patterns and test patterns (3333).
• Consider long patterns as shown Table 4.16.
• Assign longer length patterns to such NNs where the deviation is not large, by
means of Equivalence Class Mapping.
• Regenerate training and test data with 4-bit codes with newly mapped codes.
• Classify the patterns at decimal valued patterns, by means of a table look up
matrix.
• Use k-Nearest-Neighbor Classifier (kNNC) for k = 1 to 20.
• In the exercise, we experimented on different combinations of assignment, such
as assignment options 1 to 6, as provided in columns 5 to 10, respectively. For
example, consider column 9 in Table 4.16. The values in the column indicate by
which of the label-codes current label-code is replaced. For example, the code for
label 4, viz., 1110101, is replaced by a shorter code corresponding label 3, viz.,
1101. Similarly, code for label 13, viz, 10111111, is replaced by the code for label
12, viz., 1010, leading to a shorter code. The last two rows of the table indicate
the classification accuracy with kNNC after making such assignments and space
savings for each such combination.
This provides a view of assigning a given pattern having longer code to a pattern
having a shorter code, thereby leading to lossy compression while achieving a good
classification accuracy such as 93.8 % in case of option in column 4. Also it should
be noted in case of option 5, which is provided in column 10, that although the
compression achieved is 32.8 %, the classification accuracy was reduced to 87 %.
We make use of this concept further in the proposed algorithm.
4.4 Preliminary Data Analysis
79
4.4.2 Analysis of Subsequences and Their Frequency in a Class
The previous exercises focused on forming blocks of appropriate number of bits
and computing the frequency of such blocks across all the training patterns. The
current exercise considers a sequence of decimal values of 4-bit blocks. We identify
repetition of such sequences across all the training patterns.
In the current subsection, we consider the patterns belonging to class label 0 and
list out all possible subsequences and their frequencies. Based on the list, we bring
out some important observations and compression achieved. The following are some
of the observations.
• The 192 features of a pattern make a reading sense, when arranged as 16 × 12
matrix. Thus by making 4-bit blocks, 12 bits of a row leads to 3 block values.
• Identify all occurrences of same combination across all 667 × 16 sequences. Table 4.15 contains the summary of results. The table enlists all possible subsequences and the corresponding frequencies. From the table one can notice repetition of some subsequences between 1 and 543 to the extent of frequency of 100
or more. This brings out an important aspect that
(a) although the patterns belong to the same class, there exist intra-class dissimilarities, and importantly,
(b) in spite of such intra-class variation at pattern-level, there exist significant
subsequence-wise similarities.
• The 10,672 (667 × 16) subsequences get reduced to 251 repeating subsequences,
with frequency ranging from 1 to 543.
• It should be observed that a Huffman coding for such combination is not useful.
Earlier with 16 distinct values, the maximum code length of least repeating combination was 9 bits. If all the 251 combinations are treated as separate codes, the
corresponding Huffman code would be very long. Alternately, it should be noted
that the current representation itself could be considered as a compression scheme
that compresses 10,672 nondistinct combinations of codes into 251 subsequences.
• Under the second argument of the above point, the given training and test data
are encoded into this combination of subsequences. The test data is classified at
coded level using a look-up table of dissimilarities.
4.4.2.1 Analysis of Repeating Subsequences for One Class
In the current exercise, the focus is in finding repeating sequences across rows.
A maximum of three rows is considered. This brings out correlation among successive rows. Here too, training data of class 0 alone is considered. The following
procedure is followed.
• Consider a sequence of 3 codes that correspond to three 4-bit blocks, i.e., one
row.
80
4
Dimensionality Reduction by Subsequence Pruning
Table 4.15 Statistics of repeating subsequence and its frequency
Subseq.
Freq. Subseq.
Freq. Subseq.
{0, 3, 8}
70
{0, 3, 12}
22
{0, 7, 12}
{3, 1, 8}
543
{3, 7, 0}
103
{3, 0, 12}
289
{1, 0, 12}
38
{0, 12, 0}
175
{1, 15, 0}
{1, 14, 0}
387
{1, 12, 0}
{2, 3, 8}
14
{2, 0, 3}
10
{1, 15, 12}
{12, 1, 8}
{0, 8, 0}
29
{14, 0, 14}
Freq. Subseq.
Freq. Subseq.
Freq.
40
{0, 12, 12}
32
{1, 8, 12}
{3, 12, 0}
85
{0, 15, 8}
278
{1, 12, 12}
66
{1, 8, 8}
505
{1, 15, 8}
261
{0, 15, 0}
409
532
{2, 1, 8}
150
{3, 1, 0}
89
{3, 15, 0}
288
156
{1, 6, 0}
10
{3, 3, 0}
169
{2, 3, 0}
75
{2, 1, 0}
44
{1, 1, 0}
90
{0, 14, 0}
332
{1, 11, 8}
41
{3, 0, 3}
84
{3, 12, 12}
18
{3, 12, 8}
21
{0, 7, 8}
179
58
{3, 12, 6}
3
{7, 0, 6}
15
{6, 0, 6}
40
{12, 0, 12}
6
{14, 7, 0}
5
{7, 14, 0}
48
{0, 2, 0}
47
{0, 7, 0}
241
143
{3, 15, 8}
188
12
{7, 1, 12}
74
3
{7, 15, 8}
47
{3, 8, 8}
125
{3, 6, 0}
30
{0, 3, 0}
{7, 15, 12}
149
14
9
{14, 1, 12}
{2, 0, 8}
92
{6, 0, 12}
169
{6, 0, 8}
53
{6, 1, 8}
116
{3, 14, 0}
219
{3, 1, 12}
59
{0, 12, 8}
154
{1, 0, 8}
33
{1, 1, 8}
65
{2, 7, 0}
15
{3, 8, 12}
113
{6, 3, 8}
28
{7, 7, 0}
7
{1, 8, 0}
148
{1, 3, 0}
87
{1, 0, 3}
24
{7, 0, 12}
58
{7, 1, 8}
41
{6, 3, 0}
16
{0, 6, 8}
12
{0, 8, 3}
4
{1, 8, 3}
53
{3, 0, 8}
136
{0, 8, 8}
54
{7, 15, 0}
45
{6, 0, 3}
21
{6, 0, 14}
12
{3, 15, 14}
3
{1, 15, 14}
11
{0, 14, 8}
25
{0, 6, 0}
159
{3, 11, 8}
62
{6, 1, 14}
3
{7, 3, 8}
59
{1, 0, 0}
15
{6, 15, 12}
2
{3, 0, 14}
10
{12, 0, 6}
26
{6, 7, 8}
14
{1, 11, 0}
203
{0, 14, 3}
8
{1, 12, 3}
21
{3, 8, 3}
11
{2, 0, 12}
24
{0, 14, 12}
22
{0, 6, 3}
8
{0, 3, 3}
5
{0, 12, 3}
25
{0, 8, 12}
1
{0, 1, 8}
44
{6, 1, 12}
133
{7, 7, 8}
16
{0, 7, 3}
4
{6, 7, 0}
11
{7, 8, 0}
10
{3, 8, 6}
15
{15, 0, 12}
{15, 3, 12}
2
{15, 15, 0}
5
{7, 12, 0}
14
{1, 7, 0}
11
{3, 7, 8}
12
{2, 0, 2}
5
{3, 0, 2}
3
{1, 12, 2}
6
{0, 7, 14}
4
{1, 14, 12}
20
{7, 8, 12}
7
{7, 3, 12}
10
{0, 15, 12}
83
{3, 8, 0}
30
{7, 0, 8}
3
{15, 8, 0}
3
5
{3, 3, 8}
101
{1, 2, 0}
9
{2, 0, 6}
3
{3, 0, 6}
21
{1, 12, 8}
117
{6, 0, 2}
5
{12, 0, 2}
7
{12, 1, 12}
12
{3, 15, 12}
{1, 3, 8}
45
{7, 12, 6}
3
4
{0, 1, 12}
11
{0, 3, 14}
7
{0, 7, 6}
2
{0, 6, 2}
1
{0, 12, 2}
3
{1, 8, 6}
11
{3, 2, 0}
7
{6, 15, 0}
7
{7, 8, 8}
8
{7, 11, 8}
13
{1, 7, 12}
2
{0, 0, 8}
4
{2, 7, 8}
2
{0, 1, 0}
14
{0, 11, 0}
15
{0, 6, 6}
2
{1, 12, 6}
5
{2, 6, 0}
10
{2, 12, 0}
5
{2, 15, 0}
2
{3, 0, 0}
10
{2, 1, 12}
5
{7, 0, 14}
21
{14, 0, 12}
13
{14, 3, 12}
2
{3, 11, 12}
6
{6, 7, 12}
2
{7, 0, 0}
3
{14, 0, 6}
10
{14, 1, 8}
4
{0, 15, 14}
5
{0, 14, 6}
3
{3, 8, 14}
5
{1, 14, 8}
6
{8, 0, 6}
3
{0, 6, 14}
1
{15, 0, 2}
1
{15, 15, 8}
6
{15, 14, 0}
3
{3, 11, 0}
28
{0, 1, 14}
2
{0, 7, 2}
3
{1, 14, 6}
2
{3, 2, 3}
2
{6, 2, 6}
1
{14, 2, 14}
2
{12, 3, 12}
4
{12, 3, 0}
2
{15, 12, 0}
3
{0, 15, 3}
2
{6, 14, 0}
2
4.5 Proposed Scheme
81
Table 4.15 (Continued)
Subseq.
Freq. Subseq.
Freq. Subseq.
Freq. Subseq.
Freq. Subseq.
Freq.
{1, 8, 2}
4
{0, 15, 6}
2
{1, 12, 14}
4
{6, 3, 12}
2
{7, 12, 12}
3
{0, 12, 6}
4
{0, 12, 14}
1
{7, 11, 12}
6
{3, 11, 14}
2
{7, 15, 14}
8
{7, 8, 14}
6
{0, 0, 12}
4
{0, 6, 12}
1
{1814}
2
{7, 8, 6}
1
{7, 12, 14}
3
{15, 1, 12}
2
{6, 1, 0}
2
{2, 8, 8}
3
{6, 8, 12}
1
{6, 6, 0}
1
{12, 0, 14}
8
{11, 0, 12}
2
{12, 0, 3}
2
{14, 7, 8}
3
{3, 12, 3}
4
{15, 15, 14}
2
{1, 1, 12}
2
{7, 1, 14}
5
{15, 11, 6}
2
{3, 3, 12}
5
{14, 3, 8}
1
{12, 1, 14}
1
{14, 3, 14}
2
{15, 15, 6}
1
{0, 2, 3}
2
{0, 14, 2}
2
{15, 0, 14}
1
{14, 0, 2}
1
{1, 0, 2}
4
{3, 8, 2}
2
{7, 14, 14}
2
{7, 1, 0}
2
{3, 12, 14}
2
{7, 7, 12}
2
{12, 3, 8}
3
{12, 7, 0}
1
{3, 3, 14}
2
{3, 14, 12}
3
{2, 2, 0}
2
• Identify three-consecutive repetitions
• Repeat the above to find (a) two-consecutive repetitions, (b) matching of the sequence with the same sequence at any other place, and (c) only single occurrence
of the combination.
• The combinations are tabulated.
• The total number of occurrences matched with input subsequences. Thus, the
implementation is validated. The following are the statistics.
1. No. of three consecutive subsequence = 101 (3018 after multiplying with frequency)
2. No. of two consecutive subsequences = 225 (6288)
3. No. of one match of a subsequence with that at any other place (not
consecutive) = 115 (1260)
4. No. of single occurrences = 106 (106)
5. Observe that 3018 + 6288 + 1260 + 106 = 10672 = 667 × 16.
6. Thus there are in all 101 + 225 + 115 + 106 = 451 combinations. Here the
subsequence need not be distinct since it includes those repeating once, two
times and three times.
As part of preliminary analysis, optimal feature selection using the Steady-State
Genetic Algorithm considering entire training data together is carried out. The number of optimal features is found to be 106 out of 192 features, providing a classification accuracy of 92 %. This leads to a total data reduction by about 45 %.
4.5 Proposed Scheme
We propose a scheme that makes use of terms defined in Sect. 4.3. Figure 4.2 provides a broad outline of the scheme in three major stages. The first stage combines
domain knowledge of the data and preliminary data analysis to arrive at various
82
4
Dimensionality Reduction by Subsequence Pruning
Table 4.16 Huffman coding and neighboring pattern assignment
Label
Frequency
Huffman code
Binary
representation
Assignment options
1
2
3
4
5
6
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
0
131,618
0
0000
0
0
0
0
0
0
1
32,874
001
0001
1
1
1
1
1
1
2
5374
011101
0010
2
1
2
2
2
2
3
29,514
1011
0011
3
3
3
3
3
3
4
3750
5
246
6
1010111
0100
4
3
4
4
4
3
001111101
0101
5
6
6
6
6
6
12,774
00111
0110
6
6
6
6
6
6
7
10,175
110111
0111
7
8
7
7
7
7
8
32,658
1111
1000
8
8
8
8
8
8
9
3125
0010111
1001
9
8
9
9
8
8
10
345
101111101
1010
10
12
11
11
11
11
11
2806
0111101
1010
11
12
11
11
11
11
12
20,404
13
2281
0101
1100
12
12
12
12
12
12
11111101
1101
13
12
13
12
12
12
14
15
10,495
01101
1110
14
15
14
14
14
14
21,721
0011
1111
15
15
15
15
15
15
Classification accuracy (%) kNNC with k = 5
93.75
87
93.7
93.8
93.3
92.5
Savings in space (%)
25.9
26.1
26.8
27.5
28.4
32.8
Fig. 4.2 Proposed scheme
parameters. The second stage consists of frequent item identification and pruning,
and the third stage consists of test data encoding and classification of the data. The
scheme is described through the following steps.
4.5 Proposed Scheme
•
•
•
•
•
•
•
•
83
Initialization
Frequent Item generation
Generation of encoded training data
Subsequence identification and frequency generation
Pruning of subsequences
Generation of encoded test data
Classification using distance-based Rough concept
Classification using kNNC
We elaborate each of the above steps in the following subsections.
4.5.1 Initialization
In the given data, number of training patterns, n, and the number of features, d, are
known. Based on a priori domain knowledge of the data and through preliminary
analysis on the training data, the following parameters are initialized to nonzero
values.
•
•
•
•
minimum support, ε, for frequent item generation,
block length, b,
minimum frequency for subsequence pruning, ψ , and
dissimilarity threshold for identifying nearest neighbors to the pruned subsequences, η.
4.5.2 Frequent Item Generation
The input data encountered in practice, such as sales transaction data, contains features that are not frequent. They can equivalently be considered as noisy when the
objective is robust pattern classification. Also, the number of nonzero features differs from pattern to pattern. While generating an abstraction of entire data, it is
necessary to smooth out the noisy behavior, which otherwise may lead to improper
abstraction. This can be visualized by considering data such as handwritten digit
data or a sales transaction data. With such datasets in the focus, the support of each
feature across the training data is computed. The items whose support is above a
chosen value are considered for the study. It should be noted here that the feature
dimension, d, is kept unchanged, even though few features get eliminated across all
the patterns with the chosen . For example, for n = 8 and d = 8, the following sample sets (A) and (B) represent patterns in their original form and the corresponding
frequent items with ε = 3, respectively.
Set A: (a) 11011010 (b) 10110110 (c) 11001001 (d) 01011010
(e) 01110100 (f) 11101011 (g) 11011010 (h) 00101000
84
4
Dimensionality Reduction by Subsequence Pruning
Set B: (a) 11011010 (b) 10110010 (c) 11001001 (d) 01011010
(e) 01110000 (f) 11101010 (g) 11011010 (h) 00101000
4.5.3 Generation of Coded Training Data
At this stage, the training data consists only of frequent items. Considering b binary
items at a time, a decimal equivalent value is computed. The value of b is a result
of preliminary analysis or as obtained from domain knowledge. The value of d is an
integral multiple of b. The d binary features of each pattern are now represented as
q decimal values, where d = q · b.
Consider Set B of the above example for illustrating coded data generation.
Set B: (a) 11011010 (b) 10110010 (c) 11001001 (d) 01011010
(e) 01110000 (f) 11101010 (g) 11011010 (h) 00101000
With b = 2 and q = 4, the corresponding coded training data is given below.
Code Data: (a) 3122 (b) 2302 (c) 3021 (d) 1122
(e) 1300 (f) 3222 (g) 3122 (h) 0220
4.5.4 Subsequence Identification and Frequency Computation
The sequence of decimal values corresponding to a pattern is in turn grouped into
subsequences of decimal values. Some examples of such datasets are large sales
transaction datasets, where such similarity among the patterns is possible. We arrive
at the length of a subsequence based on preliminary analysis on the training data.
The length of a subsequence is a trade-off between representativeness and compactness.
r decimal values form a subsequence. We compute the frequency of each unique
subsequence. Not all subsequences identified are unique. The number of distinct
subsequences depends on ε. With increasing ε, the number of distinct subsequences
reduces.
Continuing with example from the previous section, the following is a set of
ordered subsequences with their corresponding frequencies for r = 2.
22:4
31:2 02:2 23:1 30:1 21:1 11:1 13:1 00:1 32:1
20:1
By mapping these subsequences to unique id’s, we can rewrite ordered subsequences and their frequencies as follows.
1:4
2:2 3:2 4:1 5:1 6:1 7:1
8:1
4.5 Proposed Scheme
85
4.5.5 Pruning of Subsequences
Consider the subsequences generated and compute the number of occurrences of
each unique subsequence. We term this as the frequency of subsequences. Arrange
the sequences in descending order.
In order to retain frequent subsequences, all subsequences whose frequency is
less than ψ are identified. Each less frequent subsequence is replaced by its nearest
neighbor from the frequent subsequences. However, the nearest subsequence should
remain within a prechosen dissimilarity threshold, η. For meaningful generalization,
the value is chosen to be small.
We notice that the compression is achieved in two levels. First, by discarding the
items that remain below a chosen support, ε, and generating distinct subsequences.
Second, by reducing the number of distinct subsequences by the choice of ψ and η.
A reduction in the number of features reduces the VC dimension.
Let the total number of distinct subsequences in the data be m1 . The number of
distinct subsequences that remain after this step depend on the value of ψ and ε. All
the remaining distinct subsequences are numbered, say, from 1 to m2 . It should be
noted that m2 ≤ m1 for all ψ > 0. At the end of this step, the training data consists
of just m2 unique id’s, numbered from 1 to m2 .
Continuing with the example discussed in Sect. 4.5.4, with ψ = 2, the given
distinct sequences are replaced by their nearest neighbor. Following the list demonstrates the assignment of subsequences by their more frequent nearest neighbors,
subject to ψ . The bold letters represent the nearest neighbors of a replaced subsequence for a chosen value of η. This leads to a lossy form of compression.
22 31 02 23:22 30:31 21:22 11:31 13:31 00:02 32:31 20:22
4.5.6 Generation of Encoded Test Data
The dataset under study is divided into mutually exclusive training and test datasets.
By the choice of ε and ψ , the training data is reduced to a finite number of distinct
subsequences.
The test data and training data sets are mutually exclusive subsets of the original dataset. The dataset too passes through a transformation before the activity of
classification of patterns.
Proceeding on the similar lines as in Sect. 4.3, b-bit decimal codes are generated
for the test data. It results in a set of subsequences. It should be noted here that the
minimum support, ε, is not made use of explicitly.
However, it is likely that:
1. many of the subsequences in the test data are unlikely to be present in the ordered
subsequences of the training data, and
2. some of the previously discarded subsequences could be available in the test data.
86
4
Dimensionality Reduction by Subsequence Pruning
Such a subsequence makes the dissimilarity computation between a training pattern and a test pattern, which is represented in terms of subsequences, difficult to
compute. In view of this, at the time of classification of test data, each new subsequence found in the test pattern is replaced by its nearest neighbor from the set of m2
subsequences generated using the training set. However, in this case, η is computed
as post facto information.
4.5.7 Classification Using Dissimilarity Based on Rough Set
Concept
Rough set theory is used here for classification. A given class Ω is approximated,
using rough set terminology by two sets, viz., ΩL , lower approximation of Ω and
ΩU , upper approximation of Ω. ΩL consists of samples that are certain to belong
to Ω. ΩU consists of samples that cannot be described as not belonging to Ω. Here
the decision rule is chosen based on dissimilarity threshold. ΩU contains the training
patterns that are neighbors by means of ordered distances without any limit on dissimilarity. ΩL contains the training patterns that are below the chosen dissimilarity
threshold. We classify the patterns falling within lower approximation unambiguously. We reject those patterns that fall between lower and upper approximation as
unclassifiable.
We discuss procedure to compute dissimilarity computation between compressed
patterns in the following section.
4.5.7.1 Dissimilarity Computation Between Compressed Patterns
Dissimilarity computation between compressed patterns, in the classification
schemes, viz., the current and the method discussed in Sect. 4.5, is based on the
unique identities of the subsequences.
For a chosen block length of 4-bits or 3-bits, all possible decimal codes are known
a priori. Every subsequence would then consist of known decimal codes. Thus, in
order to compute dissimilarity between two subsequences, storing an upper triangular matrix containing distances between all possible pairs of decimal codes would
be sufficient. For example, in case of 4-bit blocks, the range of decimal codes is 0
to 15. The size of the dissimilarity matrix is 16 × 16. Out of these 256 values, only
136 values corresponding to the upper triangular matrix are sufficient to compute
the dissimilarity between subsequences. In summary, the dissimilarity computation
between a training and a test pattern is simplified in the following ways.
• First with b-bit encoding, the pattern consists of q blocks, where q = db . Thus, it
requires only q < d comparisons.
• Second, by considering only frequent subsequences, the number of distinct subsequences further reduces, thereby reducing the number of comparisons further,
say, c, where c < q < d.
4.6 Implementation of the Proposed Scheme
87
• Third, dissimilarity between two subsequences is carried out by simple table lookup.
Since the data is inherently binary, the Hamming distance is used for dissimilarity
computation.
4.5.8 Classification Using k-Nearest Neighbor Classifier
In this approach, each of the compressed test patterns consists of pruned subsequences with reference to the values of ε and ψ considered for generating compressed training data. The dissimilarity is computed for each test pattern with all
training patterns. The first k neighbors are identified depending on the dissimilarity
value. Based on majority voting, a test pattern is assigned a class label. The classification accuracy depends on the value of k.
4.6 Implementation of the Proposed Scheme
The current section discusses each step of the proposed scheme in the context of
considered data.
4.6.1 Choice of Parameters
Parameters depend on nature of data. They are identified experimentally. The parameters considered are the minimum support for frequent item generation (ε), the
minimum frequency for pruning of subsequences (ψ ), and the maximum dissimilarity limit (η) for assigning nearest neighbors. For example, the value of η is identified
as 3. The experiments are conducted on a large random sample from training data.
For the number of bits for forming blocks, b, two lengths of 3 and 4 are considered. The 4-bit block values of two typical training patterns of classes with labels 0
and , respectively, are provided in Table 4.17. There are 48 decimal equivalent codes
(block values) for each pattern. In the table, space is left between successive subsequence length of 3, in order to indicate row-wise separation in 16 × 12-bit pattern.
It may be noted that the maximum value of 4-bit block is 15. Also, similarity among
different rows of a pattern, in terms of subsequences, should be noted.
Table 4.18 contains a sample of 3-bit coded training data. There are 64 block
values for each pattern. In the table space in between successive sets of 4-block
values indicating row-wise separation. In can be seen that in any given subsequence,
the maximum value of a block is 7, which is indicative of 3-bit coding. The similarity
among the subsequences should taken note of.
In the current implementation, after carrying out an elaborate number of exercises, 4-bit blocks are chosen. However, a brief mention about 3-bit blocks is made
wherever necessary in order to bring home subtlety of the concepts.
88
4
Dimensionality Reduction by Subsequence Pruning
Table 4.17 Sample of 4-bit coded training data
Label
Data
0
0 3 8 0 3 8 0 3 8 0 3 8 0 3 12 0 3 12 0 7 12
1 9 12 3 1 8 3 1 8 3 7 0 3 7 0 3 12 0
1
020
020
020 020 020 020 020
020 020 020 020
0 7 12
020 060
060
0 12 12
060
1 9 12
020
Table 4.18 Sample of 3-bit coded training data
Label
Data
0
0700
0634
0700
0634
0700 0700
1430 1430
0074
1560
0074
1560
0174 0174
1700
0314
1
0040
0140
0040
0040
0040 0040
0040 0040
0040
0040
0040
0040
0040 0140
0040
0140
4.6.2 Frequent Items and Subsequences
The minimum support value, ε, is changed starting from 1. For example, from Table 4.17 and Table 4.18, observe repeating subsequences. In Table 4.17, the first pattern contains the following unique subsequences of length 3 (r = 3), viz., (0, 3, 8),
(0, 3, 12), (0, 7, 12), (0, 12, 12), (1, 9, 12), (3, 1, 8), (3, 7, 0), (3, 12, 0) with respective frequencies of repetitions as 4,2,2,1,2,2,2,1. In Table 4.18, observe the subsequences of length 4 (r = 4), viz., (0, 0, 7, 0), (0, 0, 7, 4), (0, 1, 7, 4), (0, 3, 1, 4),
(0, 6, 3, 4), (1, 4, 3, 0), (1, 5, 6, 0), (1, 7, 0, 0) with respective frequencies of repetitions as 4, 1, 2, 1, 2, 2, 2, 1. Increasing this value results in less number of distinct
subsequences.
The choice of the minimum support value, ε, influences the number of distinct
subsequences. Figure 4.3 depicts reduction in the number of distinct subsequences
with increasing support value, ε. With support ε = 0, the number of distinct subsequences is 690. Also, observe from the figure that at ε = 50, the number of distinct
subsequences is 543. Compare this number of 4-bit encoded values of 543 distinct
subsequences with the total number of such encoded values in the training data, viz.,
6670 ·
192
= 6670 · 48 = 320,160.
4
With grouping of subsequences of length 3, the number of distinct subsequences
becomes 690 in the original data, which is further reduced by the choices of ε,η,
and ψ .
It should be observed from the figure that the number of distinct subsequences
reduced from 690 at an input support value of 1 to 395 with an input support value
of 100. A further discussion on impact of the increasing support is provided in
Sect. 4.7.
4.6 Implementation of the Proposed Scheme
89
Fig. 4.3 Distinct subsequences as functions of support value (ε)
4.6.3 Compressed Data and Pruning of Subsequences
The distinct subsequences are numbered in the descending order of their frequency.
This forms the compressed data. Table 4.19 contains typical compressed training
data for one arbitrary pattern each in classes 0–9.
The subsequences are pruned further by discarding infrequent subsequences.
This is carried out by choosing the value of ψ. A larger ψ reduces the number
of distinct subsequences. Figure 4.4 contains distinct subsequences and the corresponding frequency(ψ ) for a minimum support value of 50, i.e., ε = 50. Observe
that the maximum subsequence number is 543, and its corresponding frequency
is 1.
Figure 4.5 consists of the effect of frequency limit on pruning after replacing
pruned subsequences by their nearest neighbor (NN). The data is generated for a
specific value of support, ε = 50.
Considering Figs. 4.4 and 4.5, the number of distinct subsequences for ε = 50
is 543. If we eliminate subsequences of frequency ψ = 1, the number of distinct
subsequences reduces to 452. Since such subsequences get eliminated from the
training data, they are replaced by their nearest-neighbor subsequences, subject to
the dissimilarity limit of η = 2. After elimination, the distinct subsequences are
renumbered. For example, as shown in Fig. 4.5, 452 distinct subsequences are reduced to 106 distinct subsequences with increasing value of ε. This forms input for
compressed training and test data.
90
4
Dimensionality Reduction by Subsequence Pruning
Table 4.19 Sample training data in terms of subsequence numbers
Label
Unique codes of a pattern
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
6
6
6
6
7
7
7
7
8
8
8
8
9
9
9
9
23 23 23 23 51 51 39 39 82 98 98 25 25 43 43 30
23 23 13 13 87 57 57 46 46 46 116 116 22 9 9 6
3 3 5 5 5 5 22 22 25 58 58 56 56 14 14 11
18 18 11 85 29 29 59 112 58 58 114 58 25 25 68 5
10 10 10 10 10 10 10 1 1 1 10 10 10 10 10 10
17 17 2 2 1 1 1 1 3 3 28 28 8 8 8 8
15 15 1 1 1 1 15 15 15 3 3 3 3 3 3 3
10 10 1 1 1 1 1 1 1 3 3 3 3 28 28 28
8 8 19 19 19 19 1 1 3 3 3 508 508 86 86 139
30 30 236 236 188 188 18 18 195 398 398 566 566 442 442 1 2
24 24 34 34 8 8 34 34 80 80 190 190 202 202 60 60
6 6 5 5 6 6 12 12 13 13 9 9 9 9 18 18
6 6 66 66 12 12 12 12 58 58 25 25 5 5 7 7
45 45 2 2 2 2 6 6 6 4 4 21 21 21 21 35
20 20 13 13 4 4 12 12 6 6 2 2 14 14 30 30
12 12 13 13 32 4 4 9 5 5 83 83 2 14 14 19
4 4 4 4 25 25 29 29 16 14 14 1 1 1 1 15
55 55 55 55 70 70 32 32 14 14 14 14 10 10 1 1
2 2 2 47 100 100 100 100 11 11 6 1 1 1 15 15
55 55 21 4 32 37 48 25 14 14 11 10 10 1 1 1
35 35 9 9 34 60 60 42 2 2 26 26 49 49 49 78
95 95 81 51 51 1 13 13 122 21 21 137 67 67 14 18
55 55 51 51 6 11 11 34 30 30 3 3 3 30 30 24
4 4 35 35 11 11 18 18 1 1 274 274 42 42 60 60
4 4 23 23 12 1 1 1 13 13 89 89 89 13 13 12
24 24 36 8 8 36 36 24 112 156 212 212 522 522 177 33
8 8 8 8 24 29 29 16 132 132 25 25 56 5 5 3
15 15 3 3 8 8 8 8 43 16 16 155 155 14 14 14
5 5 5 2 2 10 7 7 6 15 15 3 3 3 28 28
3 3 13 13 13 53 53 4 2 2 12 12 1 7 7 3
13 13 50 50 2 2 1 1 15 15 3 3 28 28 28 28
11 11 11 5 5 2 2 1 1 1 7 7 3 3 28 28
5 5 49 49 159 159 49 49 49 162 162 74 74 49 49 14
20 20 13 13 13 64 26 26 6 1 1 6 32 32 13 6
12 12 13 13 64 64 6 6 7 7 7 11 11 154 154 30
6 6 5 5 26 26 11 11 7 5 5 26 26 5 5 3
11 11 5 5 29 25 25 58 14 14 5 5 4 4 4 4
2 2 13 13 32 48 48 9 9 9 4 4 4 4 4 4
12 12 12 12 66 66 6 6 6 6 2 2 2 2 2 2
5 5 16 40 40 112 112 14 14 2 17 17 4 4 4 4
4.7 Experimental Results
91
Fig. 4.4 Distinct subsequences and their frequencies
4.6.4 Generation of Compressed Training and Test Data
Based on pruned distinct subsequence list, say, 1 to 106, the training data is regenerated by replacing distinct subsequences with these mapped numbers. As discussed
in the previous subsection, each of those subsequences of the training data that are
not available among the distinct subsequence list is replaced with its nearest neighbor among the distinct sequences, within the dissimilarity threshold, η = 2. It should
be noted here that in the considered datasets, η remained within 2 for both test and
training data. After generating compressed training data in the above manner, compressed test data is generated. It should be noted that test data is represented as 4-bit
blocks only. No other operations are carried out on test data.
4.7 Experimental Results
A number of case studies were carried out, by varying the values of ε, ψ, and η.
With “rough set” approach, the best classification accuracy obtained is 94.13 %,
by classifying 94.54 % of the test patterns and rejecting 182 out of 3333 test patterns
as unclassified.
The results obtained by kNNC approach are provided in Fig. 4.6. It should be further noted from the figure that 543 distinct subsequences at ε = 50 are reduced to 70
at ε = 220, without drastically reducing the classification accuracy. The best classification accuracy obtained using kNNC is 93.3 % for ε = 50, ψ = 3, and η = 1. It
92
4
Dimensionality Reduction by Subsequence Pruning
Fig. 4.5 Distinct subsequences as functions of minimum frequency parameter (ψ )
made use of 452 distinct subsequence. However, it should be noted that classification accuracy is not significantly different even for increasing ψ. For example, with
106 distinct subsequences, the classification accuracy obtained is 92.92 %.
4.8 Summary
Large handwritten digit data is compressed by means of a novel method in two
stages, first by applying the limit on support value and subsequently on the frequency of so-generated subsequences. In terms of subsequences of 4-bit blocks, the
method reduced the original number of 690 subsequences without constraints on the
support and frequency to 106 subsequences. The classification accuracy improved
as compared with the original data. Further, this can be seen as effective feature
reduction.
The scheme integrates supervised classification, frequent itemsets, compression,
rough sets, and kNNC classification. Lossy compression of data obtained through
such a scheme leads to a significant compaction of data from 1,280,640 bits to 12bit strings numbering to 106. Classification accuracy is computed in the compressed
domain directly. With kNNC, the accuracy is 93.3 % and with rough set approach,
it is 94.13 %. We term it as hybrid learning methodology, as the activity combines
more than one learning technique.
4.9 Bibliographic Notes
93
Fig. 4.6 Classification accuracy for ε = 50 and (ψ )
It should be noted here that the classification accuracy obtained here with lossy
compression is higher than what is obtained with the original data set, 92.47 %,
using kNNC for k = 7 as discussed in Chap. 3.
The parameter values of ε, ψ, and η are data dependent. With reduction in the
number of patterns, the VC dimension reduces, provided that the NNC accuracy is
not affected. Perhaps, similar conclusions can be drawn under Probably Approximately Correct (PAC) learning framework with the help of Disjunctive Normal
Forms as defined in Chap. 3.
4.9 Bibliographic Notes
The work makes use of the concepts of frequent itemsets and support. They are discussed in detail by Agrawal et al. (1993) and Han et al. (2000). Vector quantization
is discussed by Gray and Neuhoff (1998). An exhaustive account of rough set concepts are provided in Deogun et al. (1994) and Pawlak et al. (1995). Fundamentals
of clustering can be found in Jain and Dubes (1988). Discussion on challenges in
large data clustering and classification are found in Ghosh (2003) and Jain et al.
(1999). Data compression which has been the enabling technology for multimedia communication is discussed in detail in Sayood (2000) and Lelewer and Hirshberg (1987). Discussions on the VC dimension and classification accuracy can be
found in Karacah and Krim (2002). Probably Approximately Correct (PAC) learning framework is provided by Valiant (1984). Application of the proposed algorithm
94
4
Dimensionality Reduction by Subsequence Pruning
can be found in Ravindra Babu et al. (2004) and an extension of the same work can
be found in Ravindra Babu et al. (2012). Mobahi et al. (2011) propose a method
to segment images by texture and boundary compression on Minimum Description
Length principle. Talu and Türkoğlu (2011) suggest a lossless compression algorithm that makes use of novel encoding by means of characteristic vectors and a
standard lossy compression algorithm. Definitions of a sequence and subsequence
can be found in Goldberg (1978).
References
R. Agrawal, T. Imielinski, A. Swamy, Mining association rules between sets of items in large
databases, in Proc. 1993 ACM-SIGMOD International Conference on Management of Data
(SIGMOD’93), (1993), pp. 266–271
J.S. Deogun, V.V. Raghavan, H. Sever, Rough set based classification methods for extended decision tables, in Proc. of Intl. Workshop on Rough Sets and Soft Computing, (1994), pp. 302–309
J. Ghosh, Scalable clustering, in The Handbook of Data Mining, ed. N. Ye (Lawrence Erlbaum
Assoc., Mahwah, 2003), pp. 247–278. Chapter 10
R.R. Goldberg, Methods of Real Analysis. 1st edn. (Oxford and IBH, New Delhi, 1978)
R.M. Gray, D.L. Neuhoff, Quantization. IEEE Trans. Inf. Theory, 44(6), 1–63 (1998)
J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in Proc. of ACM SIGMOD International Conference of Management of Data (SIGMOD’00), Dallas, Texas (2000),
pp. 1–12
A.K. Jain, R.C. Dubes, Algorithms for Clustering Data (Prentice Hall, Englewood Cliffs, 1988)
A.K. Jain, M. Narasimha Murty, P.J. Flynn, Data clustering: a review. ACM Comput. Surv. 31(3),
264–323 (1999)
B. Karacah, H. Krim, Fast minimization of structural risk by nearest neighbor rule. IEEE Trans.
Neural Netw. 14(1), 127–137 (2002)
D.A. Lelewer, D.S. Hirshberg, Data compression. ACM Comput. Surv. 9, 261–296 (1987)
H. Mobahi, S. Rao, A. Yang, S. Sastry, Y. Ma, Segmentation of natural images by texture and
boundary compression. Int. J. Comput. Vis. 95, 86–98 (2011)
Z. Pawlak, J. Grzymala-Busse, R. Slowinksi, W. Ziarko, Rough sets, Commun. ACM, 38, 89–95
(1995).
T. Ravindra Babu, M. Narasimha Murty, V.K. Agrawal, Hybrid learning scheme for data mining applications, in Proc. of Fourth Intl. Conf. on Hybrid Intelligent Systems (IEEE Computer
Society, 2004), pp. 266–271. doi:10.1109/ICHIS.2004.56
T. Ravindra Babu, M. Narasimha Murty, S.V. Subrahmanya, Quantization based sequence generation and subsequence pruning for data mining applications, in Pattern Discovery Using Sequence Data Mining: Applications and Studies, ed. by P. Kumar, P. Krishna, S. Raju (Information Science Reference, Hershey, 2012), pp. 94–110
K. Sayood, Introduction to Data Compression, 1st edn. (Morgan Kaufmann, San Mateo, 2000)
M.F. Talu, I. Türkoğlu, Hybrid lossless compression method for binary images. IU, J. Elect. Electron. Eng. 11(2) (2011)
L.G. Valiant, A theory of the learnable. Commun. ACM 27(11), 1134–1142 (1984)
Chapter 5
Data Compaction Through Simultaneous
Selection of Prototypes and Features
5.1 Introduction
In Chap. 4, we presented a novel scheme for data reduction by subsequence generation and pruning. We extend the concept in the current work to select prototypes
and features together from large data.
Given a large dataset, it is always interesting to explore whether one can generate
an abstraction with a subset or representative set of patterns drawn from the original
dataset that is at least as accurate as the original data. Such a representative dataset
forms prototypes. Drawing a random sample and resorting to pattern clustering are
some of the approaches to generate a set of prototypes from a large dataset.
In the given dataset, when each pattern is represented by a large set of features,
it is efficient to operate on a subset of features. Such a feature set should be representative. This forms the problem of feature selection. Some methods of selecting
optimal or ideal subset of features are through optimization methods. We explore
the option of frequent-item support to generate a representative feature subset.
In this process, we examine the following aspects.
• Effect of frequent items on prototype selection
• Effect of support-based frequent items on feature selection and evaluation of their
representativeness
• Impact of sequencing of clustering and frequent item generation on classification
• Combining clustering and frequent item generation resulting in simultaneous selection of patterns and features
The chapter is organized as follows. In Sect. 5.2, we provide a brief overview of
prototype, feature selection, and resultant data compaction. Section 5.3 contains a
background material necessary for appreciating the proposed methods. Section 5.4
contains preliminary data analysis that provides insights into prototype and feature
selection. Section 5.5 contains a discussion on the approaches proposed in this work.
Implementation of proposed schemes and experimentation is discussed in Sect. 5.6.
The work is summarized in Sect. 5.7. Section 5.8 contains bibliographic notes.
T. Ravindra Babu et al., Compression Schemes for Mining Large Datasets,
Advances in Computer Vision and Pattern Recognition,
DOI 10.1007/978-1-4471-5607-9_5, © Springer-Verlag London 2013
95
96
5
Data Compaction Through Simultaneous Selection of Prototypes
5.2 Prototype Selection, Feature Selection, and Data Compaction
In a broad sense, any method that incorporates information from training samples
in the design of a classifier employs learning. When the dataset is large and each
pattern is characterized by high dimension, abstraction or classification becomes an
arduous task. With large dimensionality of feature space, the need for large number
of samples grows exponentially. The limitation is known as Curse of Dimensionality. For superior performance of a classifier, one aims to minimize Generalization
Error, which is defined as the error rate on unseen or test patterns.
While dealing with high-dimensional large data, for the sake of abstraction generation and scalability of algorithms, one resorts to either dimensionality reduction or
data reduction or both. Approaches to data reduction include clustering, sampling,
data squashing, etc. In clustering, this is achieved by means of considering cluster
representatives either in their original form such as leaders, medoids, centroids, etc.
Sampling schemes involve simple random sampling with and without replacement
or stratified random sampling. Data squashing involves a form of lossy compression where pseudo data is generated from the original data through different steps.
A number of other approaches like BIRCH generate a summary of original data that
is necessary for further use.
Prototype selection by using Partition Around Medoids (PAM), Clustering
LARge Applications (CLARA), and Leader was earlier reported. It was shown in
the literature that classification performance of Leader clustering algorithm is better
than that of PAM and CLARA. Further, the computation time for Leader algorithm
is much less when compared to PAM and CLARA for the same data set. The computation complexity of PAM and CLARA are O(c(n − c)2 ) and O(ks 2 + c(n − c)),
respectively, where n is the size of the data, s is the sample size, and c is the number
of clusters. Leader has linear complexity. Although CLARA can handle larger data
than PAM, its efficiency depends on the sample size and unbiasedness.
Medoids, PAM, CLARA, and CLARANS
In prototype selection, a prototype is a member of the original dataset. Leaders and medoids are members of the dataset. This in contrast with k-means
algorithm where the centroids are not the members of the original dataset.
A Medoid is defined as that cluster member for which the average dissimilarity with every other member of the same cluster is minimal. The advantage
of medoids over k-means is that they are robust to outliers. Partition Around
Medoids (PAM) finds k-medoids by passing through two stages, viz., the build
phase and swap phase. After finding k-medoids, clusters are formed by assigning nearest cluster member to the medoids. Medoids are most centrally located within a cluster. It has complexity of each iteration as O(k(n−k)2 ), thus
making it computationally expensive for large values of n and k, wheren is
number of patterns, and k is number of clusters. CLARA (Clustering LARge
Applications) uses a sampling-based method for scaling for large data. It uses
the PAM method on a random sample drawn from the original dataset. With
5.2 Prototype Selection, Feature Selection, and Data Compaction
97
multiple samples from the entire dataset, it compares the average minimum
dissimilarity of all objects in the entire dataset. The sample size suggested is
40 + 2k. However, in practice, CLARA requires large amount of time when
k is greater than 100. It is observed that CLARA does not always generate
good prototypes and it is computationally expensive for large datasets with
a complexity of O(kc2 + k(n − k)), where c is the sample size. CLARANS
(Clustering Large Applications based on RANdomized Search) combines the
sampling technique with PAM. CLARANS replaces the build part of PAM
by random selection of objects. Where CLARA considers a fixed sample size
at every stage, CLARANS introduces randomness in sample selection. Once
a set of objects is selected, a new object is selected when a preset value of
local minimum and maximum neighbors is searched using the swap phase.
CLARANS generates better cluster representatives than PAM and CLARA.
The computational complexity of CLARANS is O(n2 ).
The other schemes for prototype selection include support vector machine
(SVM) and Genetic Algorithm (GA) based schemes. The SVMs are known to be
expensive as they take O(n3 ) time. The GA-based schemes need multiple scans of
dataset, which could be prohibitive when large datasets are processed.
As an illustration of prototype selection, we compare two algorithms that require
a single database scan, viz., Condensed Nearest-Neighbor (CNN) and Leader clustering algorithms for prototype selection.
The outline of CNN is provided in Algorithm 5.1. The CNN starts with the first
sample as a selected point (BIN-2). Subsequently, using the selected pattern, the
other patterns are classified. The first incorrectly classified sample is included as
an additional selected point. Likewise with selected patterns, all other patterns are
classified to generate a final set of representative patterns.
Algorithm 5.1 (Condensed Nearest Neighbor rule)
Step 1: Set two bins called BIN-1 and BIN-2. The first sample is placed in BIN-2
Step 2: The second sample is classified by the NN rule, using current contents of
BIN-2 as reference set. If the second sample is classified correctly, it is
placed in BIN-1; otherwise, it is placed in BIN-2.
Step 3: Proceeding in this manner, the ith sample is classified by the current contents of BIN-2. If classified correctly, it is placed in BIN-1; otherwise, it is
placed in BIN-2.
Step 4: After one passes through the original sample set, the procedure continues
to loop through BIN-1 until termination in one of the following ways
(a) The BIN-1 is exhausted with all its members transferred to BIN-2, or
(b) One complete pass is made through BIN-1 with no transfers to BIN-2.
Step 5: The final contents of BIN-2 are used as reference points for the NN rule;
the contents of BIN-1 are discarded
98
5
Data Compaction Through Simultaneous Selection of Prototypes
The Leader clustering algorithm is provided in Sect. 2.5.2.1. Discussions related
to Leader algorithm are provided in Sect. 5.3.4.
A comparative study is conducted between CNN and Leader by providing all the
6670 patterns as training data and 3333 patterns as test data for classifying them with
the help of Nearest-Neighbor Classifier (NNC). Table 5.1 provides the results. In the
table, Classification Accuracy is represented as CA. CPU Time refers to the processing time computed on Pentium III 500 MHz computer as time elapsed between the
first and last computations. The table provides a comparison between both methods.
It demonstrates the effect of threshold on the number of (a) prototypes selected,
(b) CA, and (c) processing time. A finite set of thresholds is chosen to demonstrate
the effect of distance threshold. It should be noted that for binary patterns, the Hamming and Euclidean distances provide equivalent information. At the same time, it
reduces computation time in terms of computation of squares of deviation and the
square root. Hence, we choose the Hamming distance as a dissimilarity measure.
The exercises indicate that compared to Leader algorithm, CNN requires more
time for obtaining the same classification accuracy. But CNN provides fewer but a
fixed set of prototypes corresponding to a chosen order of input data. Leader algorithm offers a way of improving the classification accuracy by means of threshold
value-based prototype selection and thus provides a greater flexibility to operate
with. In view of this and based on the earlier comparative study with PAM and
CLARA, Leader is considered for prototype selection in this study. We use the NNC
as the classifier. In order to achieve efficient classification, we use the set of prototypes obtained using the Leader algorithm. Our scheme offers flexibility to select
different sizes of prototype sets.
Dimensionality reduction is achieved through either feature selection or feature
extraction. In feature selection, it is achieved through removing redundant features.
This is achieved by optimal feature selection by deterministic and random search
algorithms. Some of the conventional algorithms include feature selection by individual merit basis, branch-and-bound algorithm, sequential forward and backward
selection, plus l–take away r algorithm, max–min feature selection, etc. Feature extraction methods utilize all the information contained in feature space to obtain a
transformed space resulting in lower dimension.
Considering these philosophical and historical notes, in order to obtain generalization and regularization, we examine a large handwritten digit data in terms of
feature selection and data reduction and whether there exists an equivalence between
the two.
Four different approaches are presented, and the results of exercises are provided in driving home the issues involved. We classify large handwritten digit data
by combining dimensionality reduction and prototype selection. The compactness
achieved by dimensionality reduction is indicated by means of the number of combinations of distinct subsequences. A detailed study of subsequence-based lossy
compression is presented in Chap. 4. The concepts of frequent items and Leader
cluster algorithms are used in the work.
The handwritten digit data set considered for the study consists of 10 classes
of nearly equal number of patterns. The data consisting of about 10,000 labeled
5.2 Prototype Selection, Feature Selection, and Data Compaction
Table 5.1 Comparison
between CNN and leader
Distance
threshold
99
No. of
prototypes
CA (%)
CPU time
(sec)
1610
86.77
942.76
CNN
–
Leader
5
6149
91.24
1171.81
10
5581
91.27
1066.54
15
4399
90.40
896.84
18
3564
90.20
735.53
20
3057
88.03
655.44
22
2542
87.04
559.52
25
1892
84.88
434.00
27
1526
81.70
363.00
patterns are divided in training and test patterns in the ratio of 67 % and 33 %
approximately. About 7 % of total dataset is used validation data, and it is taken
out of training dataset. Each pattern consists of 192 binary features. The number of
patterns per class is nearly equal.
5.2.1 Data Compression Through Prototype and Feature Selection
In Chap. 4, we observed that increasing frequent-item support till a certain value
leads to data compaction without resulting in significant reduction in classification
accuracy. We explore whether such a compaction would lead to selection of better
prototypes than selection without such a compaction. Similarly, we study whether
activity that leads to feature selection would result in a better representative feature
set. We propose to evaluate both these activities through classification of unseen
data.
5.2.1.1 Feature Selection Through Frequent Features
In this chapter, we examine whether frequent-item support helps in arriving at such
a discriminative feature set. We explore to select such a feature set with varying support values. We evaluate each such selected set through classifying unseen patterns.
5.2.1.2 Prototype Selection for Data Reduction
The use of representative patterns in place of original dataset reduces the input data
size. We make use of an efficient pattern clustering scheme known as leader clustering, which is discussed in Sect. 2.5.2.1. The algorithm generates clustering in
100
5
Data Compaction Through Simultaneous Selection of Prototypes
a single database scan. The leaders form cluster representatives. The clustering algorithm requires an optimal value of dissimilarity threshold. Since such a value is
data dependent, a random sample from the input data is used to arrive at the threshold. Each cluster is formed with reference to a leader. The leaders are retained, and
the remaining patterns are discarded. The representativeness of leaders is evaluated
with the help of pattern classification of unseen patterns with the help of the set of
leaders.
5.2.1.3 Sequencing Feature Selection and Prototype Selection for
Generalization
Prototype selection and feature selection reduce the data size and dimensionality, respectively. It is educative to examine whether feature selection using frequent items
followed by prototype selection or vice versa would have any impact on classification accuracy. We experiment both these orderings to evaluate relative performance.
5.2.1.4 Class-Wise Data vs Entire Data
Given a multiclass labeled dataset, we examine the relative performance of considering the dataset class-wise and a single large set of multiclass data. We observe
from Fig. 5.4 and Table 5.6 that patterns belonging to different class labels require
different numbers of effective features to represent the pattern. Identifying a classwise feature set or patterns would likely to be a better representative of the class. On
the contrary, it is interesting to examine whether there could be a common threshold
for prototype selection and common support threshold for selecting a feature set to
represent the entire dataset.
5.3 Background Material
Consider training data containing n patterns, each having d features. Let ε and ζ
be the minimum support for considering any feature for the study and the distance threshold for selecting prototypes, respectively. For continuity of notation,
we follow the same terminology as provided in Table 4.1. Also, the terms defined
in Sect. 4.3 are valid in the current work too. Additional terms are provided below.
1. Leader. Leaders are cluster representatives obtained by using Leader Clustering
algorithm.
2. Distance Threshold for clustering (ζ ). It is the threshold value of the distance
used for computing leaders.
Illustration 5.1 (Leaders, choice of first leader, and impact of threshold) In order
to illustrate computation of leaders and impact of the threshold on the number
5.3 Background Material
Table 5.2 Transaction and
items
101
Transaction
No.
Items
1
2
3
4
5
6
1
1
1
0
0
1
0
2
0
1
1
0
1
1
3
0
0
1
0
1
1
4
1
0
0
0
0
1
5
1
0
1
1
1
0
of leaders, we consider UCI-ML dataset on iris. We demonstrate the concepts on
iris-versicolor data. We consider the petal length and width as two features per
pattern. In applying the Leader algorithm, we consider the Euclidean distance as
a dissimilarity measure. To start with, we consider the distance threshold (ε) of
1.4 cm and consider the first pattern as leader 1. The result is shown in Fig. 5.1.
The figure contains two clusters with respective cluster members shown as different symbols. Leaders are shown with superscribed square symbols. In order
to demonstrate the order dependence of leaders, we consider the same distance
threshold and select pattern no. 16 as the first leader. As shown in Fig. 5.2, we
still obtain two clusters with location of different first leader and different number of cluster members. As a third example, we consider the distance threshold
of 0.5 cm. We obtain seven clusters as shown in Fig. 5.3 . Note that the leaders
are shown with a superscribed square. When we consider a large threshold of,
say, 5.0 cm, we obtain a single cluster, which essentially is a scatter plot of all
patterns.
3. Transaction. A transaction is represented using a set of items that are possible
to be purchased. In any given transaction, all or a subset of the items could be
purchased. Thus, a transaction indicates presence or absence of items purchased.
This is analogous to a pattern with presence or absence of binary-valued features.
Illustration 5.2 (Transaction or Binary-Valued Pattern) Consider a transaction
with six items. We represent an item bought as “1” or not bought as “0”. We
represent five transactions with the corresponding itemsets in Table 5.2. For example, in transaction 3, items 3, 5, and 6 are purchased.
The leader clustering algorithm is explained in Sect. 2.5.2.1. We use (a) pattern
and transaction and (b) item and feature interchangeably in the current work. The
following subsections describe some of the important concepts used in explaining
the proposed method. As compared to k-means clustering, the leader clustering algorithm identifies prototypes in one data scan and does not involve iteration. However, leaders are order dependent. It is possible that we arrive at different sets of
leaders depending on the choice of the first leader. Further, the centroid in k-means
102
5
Data Compaction Through Simultaneous Selection of Prototypes
Fig. 5.1 Leader clustering with a threshold of 1.4 cm on Iris-versicolor dataset. First leader is
selected as pattern sl no. 1
Fig. 5.2 Leader clustering with a threshold of 1.4 cm on Iris-versicolor dataset. First leader is
selected as pattern sl no. 16
5.3 Background Material
103
Fig. 5.3 Leader clustering with a threshold of 0.5 cm on Iris-versicolor dataset. First leader is
selected as pattern sl no. 16. The data is grouped into seven clusters
most often does not coincide with one of the input patterns. On the contrary, leaders
are one of the patterns.
5.3.1 Computation of Frequent Features
In the current section, we describe an experimental setup wherein we examine which
of the features helps discrimination using frequent-item support. This is done by
counting the number of occurrences of each feature in the training data. If the number is less than a given support threshold ε, the value of the feature is set to be
absent in all the patterns. After identifying the features that have support less than
ε as infrequent, the training data set is modified to contain “frequent features” only.
As noted earlier, the value of ε is a trade-off between the minimal description of the
data under study and the maximal compactness that could be achieved. The actual
value depends on the size of training data, such as class-wise data of 600 patterns,
each or full data of 6000 patterns.
To illustrate the concept, we consider one arbitrary pattern from each class and
display each such pattern with frequent features having supports of 1, 100, 200,
300, and 400. Figure 5.4 contains each of those patterns. Each of the support values
considered in the example is out of 600 patterns. It is interesting to note that although
with increasing support a pattern becomes less decipherable to the human eye as
shown in the figure, it is sufficient for the machine to classify it correctly. Later in
the current chapter, we demonstrate the advantage of this concept.
104
5
Data Compaction Through Simultaneous Selection of Prototypes
Fig. 5.4 Sample Patterns
with frequent features having
support 1, 100, 200, 300 and
400
5.3.2 Distinct Subsequences
The concepts of sequence, subsequence, and length of a subsequence are used in
the context of demonstrating compactness of a pattern, as discussed in Chap. 4. For
example, consider the pattern containing binary features, {01110110110100011011
0010 . . .}. Considering a block of length 4 the pattern can be written as {0111 0110
1101 0001 1011 0010 . . .}. The corresponding values of blocks, which are decimal equivalents of the 4-bit blocks, are {7, 6, 13, 1, 11, 2, . . .}. When arranged as
a 16 × 12 pattern matrix, each row of the matrix would contain three blocks, each
of length of 4 bits, as {(7, 6, 13), (1, 11, 2), . . .}. Let each set of three such codes
form a subsequence, e.g., {(7, 6, 13)}. In the training set, all such distinct subsequences are counted. Original data of 6000 training patterns consists of 6000 · 192
features. When arranged as subsequences, the corresponding number of distinct subsequences is 690.
We count the frequency of subsequences, which is the number of occurrences,
of each of the subsequences. Subsequently, they are ordered in descending order
of their frequency. The sequences are sequentially numbered for internal use. For
example, the first two of the most frequent distinct subsequences, {(0, 6, 0)} and
{(0, 3, 0)} , are repeated 8642 times and 6447 times, respectively. As the minimum
support value, ε, is increased, some of the binary feature values should be set to
zero. This would lead to reduction in the number of distinct subsequences, and we
would show later that it also provides a better generalization.
5.3.3 Impact of Support on Distinct Subsequences
As discussed in Sect. 5.3.2, with increasing ε, the number of distinct subsequences
reduces. For example, consider the pattern {(1101 1010 1011 1100 1010 1011. . . )}.
The corresponding 4-bit block values are {(13, 10, 11), (12, 10, 11), . . .}. Suppose that with the chosen support, the feature number 4 in the considered pattern is absent. This would make the pattern {(110010101011110010101011,. . . )}.
Thus, the original distinct subsequences {(13, 10, 11), (12, 10, 11), . . .} reduce to
{(12, 10, 11), (12, 10, 11), . . .}, where {(13, 10, 11)} is replaced by {(12, 10, 11)}.
This results in reduction in the number of distinct subsequences.
5.4 Preliminary Analysis
105
5.3.4 Computation of Leaders
The Leader computation algorithm is described in Sect. 2.5.2 of Chap. 2. The leaders
are considered as prototypes, and they alone are used further, either for classification
or for computing frequent items, depending on the adopted approach. This forms
data reduction.
5.3.5 Classification of Validation Data
The algorithm is tested against the validation data using k-Nearest-Neighbor Classifier (kNNC). Each time, prototypes alone are used to classify test patterns. Different
approaches are followed to generate prototypes. Depending on the approach, the
prototypes are either “in their original form” or in a “new form with reduced number of features.” The schemes are discussed in the following section.
5.4 Preliminary Analysis
We carry out elaborate preliminary analysis in arriving at various parameters and
also to study the sensitivity of such parameters.
Table 5.3 contains the results on preliminary experiments considering training
dataset as 6670 patterns, which combine training and validation data. The exercises
provide insights on the choice of thresholds, the number of patterns per class, the reduced set of training patterns, and classification accuracy with the test data. It can be
noticed from the table that as the distance threshold (ε) increases, the number of prototypes reduces, and that the classification reaches the best accuracy for an optimal
set of thresholds, beyond which it starts reducing. The table consists of class-wise
thresholds for different discrete choices. One such threshold set that provides the
best classification accuracy is {3.5, 3.5, 3.5, 3.8, 3.5, 3.7, 3.5, 3.5, 3.5, 3.5}, and the
accuracy is 94.0 %.
Table 5.4 provides the results on impact of various class-wise support thresholds
for a chosen set of leader distance threshold values. We consider a class-wise distance threshold of 3.5 for prototype selection using the Leader clustering algorithm
for the study. Column 1 of the table contains the class-wise support values, column
2 consists of the totals of prototypes, which are the sums of class-wise prototypes
generated with a common distance threshold of 3.5, and column 3 consists of classification accuracies using kNNC. The table provides an interesting aspect that when
the patterns with frequent features are only selected, the number of representative
patterns also reduces.
We study the sensitivity of support in terms of the number of distinct numbers of
subsequences. We consider 6670 patterns in full and apply the support threshold. We
present the numbers of distinct subsequences and evaluate the patterns with reduced
106
5
Data Compaction Through Simultaneous Selection of Prototypes
Table 5.3 Experiments with leader clustering with class-wise thresholds and prototypes
Class-wise percentage support
1
2
3
4
5
6
7
8
9
#Prototypes
CA
0
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
4207
92.53
5764
93.61
5405
93.61
5219
93.85
4984
93.88
4764
93.82
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
4.0
(412)
(21)
(580)
(545)
(500)
(606)
(398)
(243)
(528)
(374)
3.0
3.0
3.0
3.0
3.0
3.0
3.0
3.0
3.0
3.0
(609)
(74)
(663)
(657)
(649)
(661)
(686)
(549)
(642)
(624)
3.2
3.2
3.2
3.2
3.2
3.2
3.2
3.2
3.2
3.2
(561)
(49)
(647)
(640)
(629)
(657)
(577)
(452)
(628)
(565)
3.4
3.4
3.4
3.4
3.4
3.4
3.4
3.4
3.4
3.4
(542)
(42)
(641)
(630)
(612)
(653)
(548)
(409)
(606)
(536)
3.6
3.6
3.6
3.6
3.6
3.6
3.6
3.6
3.6
3.6
(506)
(34)
(628)
(606)
(593)
(648)
(516)
(363)
(593)
(497)
3.7
3.7
3.7
3.7
3.7
3.7
3.7
3.7
3.7
3.7
(418)
(30)
(616)
(589)
(567)
(640)
(489)
(322)
(570)
(463)
Table 5.4 Experiments with
support value for a common
set of prototype selection
Support
threshold
No. of prototypes
Classification
accuracy (%)
(1)
(2)
(3)
5
4981
93.82
10
4974
93.82
15
4967
93.61
20
4962
93.58
25
4948
93.67
30
4935
93.61
35
4928
93.49
40
4915
93.7
45
4899
93.55
50
4887
93.79
55
4875
93.76
numbers of distinct features using kNNC. Table 5.5 contains the results. In the table,
column 1 contains the support threshold for the features. Column 2 contains the corresponding distinct subsequences. The classification accuracy of validation patterns
with the considered distinct subsequences is provided in column 3.
5.5 Proposed Approaches
Table 5.5 Distinct
subsequences and
classification accuracy for
varying support and constant
set of input patterns
107
Support
threshold
Distinct
subsequences
Classification
accuracy (%)
(1)
(2)
(3)
0
690
92.47
15
648
92.35
25
599
92.53
45
553
92.56
55
533
92.89
70
490
92.29
80
468
92.26
90
422
92.20
100
395
92.32
5.5 Proposed Approaches
With the background of the discussion provided in Sect. 5.2, we propose the following four approaches.
• Patterns with frequent features only, considered in both class-wise and combined
sets of multi-class data
• Cluster representatives only in both class-wise and entire datasets
• Frequent item selection followed by prototype selection in both class-wise and
entire datasets
• Prototype selection followed by frequent items in both class-wise and combined
datasets
In the following subsections, we elaborate each approach.
5.5.1 Patterns with Frequent Items Only
In this approach, we consider entire training data. For a chosen ε, we form patterns
containing frequent features only. With the training data containing frequent features, we classify validation patterns using kNNC. By varying ε, the Classification
Accuracy (CA) is computed. The value of ε that provides the best CA is identified,
and results are tabulated. The entire exercise is repeated considering class-wise data
and class-wise support as well as full data. It should be noted that the support value
depends on data size. The procedure can be summarized as follows.
Class-Wise Data
• Consider class-wise data of 600 patterns per class.
• By changing support in small discrete steps carry out the following steps.
108
5
Data Compaction Through Simultaneous Selection of Prototypes
1. Compute frequent features.
2. Consider class-wise training data with frequent features and combine them to
form full training dataset of 6000 patterns. It should be noted here that the
prototype set is not changed here.
3. Classify validation patterns and record CA.
4. Compute the number of distinct subsequences.
Full Data
• Consider full training dataset of 6000 patterns containing all 10 classes together.
• By changing support in small discrete steps carry out the following steps.
1.
2.
3.
4.
Compute frequent features.
Consider full training data with frequent features.
Classify validation patterns and record CA.
Compute the number of distinct subsequences.
5.5.2 Cluster Representatives Only
In this approach, we consider training data and use Leader clustering algorithm to
identify leaders. The leaders form prototypes. Use the set of prototypes to classify
validation data. For computing leaders, we change the distance threshold value, ζ ,
in small steps. The training data is considered class-wise and as a full dataset separately. The procedure can be summarized as follows.
Class-Wise Data
• Consider class-wise data of 600 patterns per class.
• By changing the distance threshold in small discrete values carry out the following steps.
1. Compute class-wise leaders that form prototypes.
2. Consider class-wise prototypes and combine them to form a full training
dataset of prototypes.
3. Classify validation patterns and record CA.
Full Data
• Consider a full training dataset of 6000 patterns containing all 10 classes together.
• By changing the threshold ζ in small discrete values carry out the following steps.
1. Compute leaders over entire data.
2. Consider prototypes corresponding to the full training data.
3. Classify validation patterns and record CA.
5.5 Proposed Approaches
109
5.5.3 Frequent Items Followed by Clustering
In the current approach, to start with, the frequent items are identified for different
support values, ε. The training data at this stage contains only frequent features. The
training data is subjected leader clustering to identify cluster representatives. The
prototypes thus formed are used to classify validation data. The data is considered
class-wise and as full data. The procedure is summarized as follows.
Class-Wise Data
• Consider class-wise data of 600 patterns per class.
• By changing support threshold, ε, in small discrete values carry out the following
steps.
1. Compute class-wise frequent features.
2. Combine class-wise patterns having frequent features to form the training
dataset.
3. Compute prototypes for different values of ζ to identify prototype patterns that
contain frequent features.
4. Classify validation data patterns.
5. Compute the number of distinct subsequences.
Full Data
• Consider a full training dataset of 6000 patterns containing all 10 classes together.
• By changing the support threshold, ε, in small discrete values carry out the following steps.
1.
2.
3.
4.
Compute frequent features.
Carry out clustering and identify prototypes for different values of ζ .
Classify validation patterns and record CA.
Compute the number of distinct subsequences.
5.5.4 Clustering Followed by Frequent Items
In the current approach, the clustering is carried out for different distance threshold values, ζ , as first step. The training data at this stage contains only prototypes.
Frequent features in the prototypes are identified. The prototypes thus formed with
a pruned set of features are used to classify validation data. The data is considered
class-wise and as full data. The procedure is summarized as follows.
Class-Wise Data
• Consider class-wise data of 600 patterns per class.
• By changing distance thresholds, ζ , in small discrete values carry out the following steps.
110
1.
2.
3.
4.
5.
5
Data Compaction Through Simultaneous Selection of Prototypes
Compute leaders.
Combine all class data to form training dataset.
Compute frequent items among all the leaders for different values of ζ .
Use training data so generated to classify validation data patterns.
Compute the number of distinct subsequences.
Full Data
• Consider a full training dataset of 6000 patterns containing all 10 classes together.
• By changing the distance threshold, ζ , in small discrete values carry out the following steps.
1.
2.
3.
4.
Compute leaders.
Compute frequent items among the leaders.
Classify validation patterns and record CA.
Compute the number of distinct subsequences.
5.6 Implementation and Experimentation
The proposed schemes are demonstrated on two types of datasets. In Sect. 5.6.1,
we implement them on handwritten digit data that has binary-valued features. In
Sect. 5.6.2, we consider intrusion detection data provided under KDDCUP’99 challenge. The data consists of floating-point-valued features with different ranges for
each feature. The data is described in Appendix. Each of the features has a different
range of values. We appropriately quantize them to convert the considered data into
binary data. We implement prototype selection and simultaneous prototype and feature selection. The second dataset is considered to demonstrate the applicability of
the schemes on different types of data. We base further discussions and conclusions
primarily in Sect. 5.6.1.
5.6.1 Handwritten Digit Data
The training dataset of 6000 patterns in their original form provide CA of 91.79 %
with kNNC for k = 9. Elaborate experimentation is carried out by changing THE
values of ε and ζ in small discrete steps. In all the experiments, The sensitivity
of parameters such as ε and ζ is examined starting from the smallest possible
values till the performance degradation is observed, moving in small steps within
quantization error. The ranges mentioned during presentation of results and tables
are representative of this experimentation. While presenting the results, the number of distinct subsequences is highlighted along with Classification Accuracy. The
number of distinct subsequences indicates compactness achieved in the data during the particular experiment. A summary of results is provided in Table 5.7. Col-
5.6 Implementation and Experimentation
111
umn 1 of the table contains the sequence number of the approach. Column 2 consists of approaches such as feature selection, prototype selection, and their relative
sequencing. Column 3 consists of data on whether the approach is experimented
on class-wise or entire dataset. Column 4 contains the support threshold used for
the activity, and column 5 contains the distance threshold for leader clustering.
Column 6 contains the number of prototypes, and column 7 contains the number of distinct subsequences. The number of distinct sequences is derived based
on the distinct features. As the distinct number of features reduces, the number
of distinct subsequences reduces leading to compaction. Columns 8 and 9 contain the classification accuracy corresponding to validation and test datasets. It
can be observed from the table that for approaches 1 and 2, the number of prototypes is unaffected since the focus is on feature selection. Further, for frequent
item support-based feature selection, the number of distinct subsequences reduced
to 361 when the full data is considered compared to 507 of that of class-wise
data.
We provide an approach-wise observation summary below.
In Approach 1, we consider all patterns (ζ = 0). We initially consider class-wise
patterns, and by varying the support values of (ε) from 0 to 200 we identify frequent items (features). As part of another exploration, we consider the full dataset
and change the support (ε) from 0 to 2000. Thus, the number of effective items (or
features) gets reduced per pattern in both cases, which in turn results in reduction
in the number of distinct subsequences. It should however be noted here that in the
case of class-wise data, the set of frequent features is different for different classes,
and for full data, the frequent feature set is common for the entire dataset. On validation dataset, the best classification accuracy is obtained with 507 out of 669 distinct
subsequences for class-wise data and with 450 out of 669 distinct subsequences
for the full dataset. The classification accuracies (CA) with test data for class-wise
data and entire dataset are 92.32 % and 92.05 %, respectively. Figure 5.5 contains
actual reduction in number of distinct subsequences with increasing threshold on
entire data. However beyond ε value of 450, the CA deteriorates as loss of feature
information further affects ability to discriminate patterns. Table 5.6 indicates reduction obtained in number of features. Observe from the table that the reduction
is significant. Column 1 of the table contains the class labels. Column 2 contains
the numbers of nonzero features for each class. For a support threshold value of
160, the number of nonzero features reduced significantly as shown in column 3.
The percentage of reduction in number of features is provided in column 4. It can
be observed that the maximum number of reduction in features of 57.3 % occurred
for the class label 1, and the least reduction occurred for class 8 with a value of
42.2 %.
In Approach 2, only prototypes are considered (ε = 0). The distance threshold
values, ζ , are changed from 0.0 to 5.0 in both class-wise data and the entire data,
and prototypes are computed using the Leader clustering algorithm. For the best
case with validation data, the CAs with test data are 93.31 % and 92.26 % for
class-wise and fully data, respectively. Note that in this approach, the number of
distinct subsequences among all leaders, viz., 669, remain as in the original data.
112
5
Data Compaction Through Simultaneous Selection of Prototypes
Fig. 5.5 Number of distinct subsequences vs. increasing support on full data
Table 5.6 Results of Feature Selection using Support of 160
Class
label
Number of non-zero features
among 600 training patterns
(Max. features = 192)
Frequent features
with ε = 160
Reduction in
number of
features
(1)
(2)
(3)
(4)
0
176
95
46.0 %
1
103
44
57.3 %
2
186
101
45.7 %
3
174
87
50.0 %
4
170
85
50.0 %
5
181
96
47.0 %
6
171
90
47.4 %
7
170
67
60.6 %
8
173
100
42.2 %
9
175
80
54.3 %
With ε = 0, the reduction in number of leaders with increasing threshold is presented in Fig. 5.6. With increasing the distance threshold, the number of clusters
and thereby the number of cluster representatives, viz., leaders, reduces. In the limit
5.6 Implementation and Experimentation
113
Table 5.7 Results with each approach
Sl. Approach
No.
Description Support
threshold
(ε)
Leader
threshold
(ζ )
No. of
prototypes
Distinct CA with CA with
subseq. valdn
test data
data
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
1
Feature
Selection
(FeaSel)
Class-wise
data
160
0
6000
507
92.52 % 92.32 %
2
Feature
Selection
(FeaSel)
Full data
450
0
6000
361
92.09 % 92.05 %
3
Prototype
Selection
(ProtoSel)
Class-wise
data
0
3.1
5064
669
93.14 % 93.31 %
4
Prototype
Full data
0
3.1
5041
669
93.13 % 92.26 %
5
FeaSel
followed by
ProtoSel
Class-wise
data
40
3.1
5027
542
93.43 % 93.52 %
6
FeaSel
followed by
ProtoSel
Full data
190
3.1
5059
433
93.58 % 93.34 %
7
ProtoSel
followed by
FeaSel
Class-wise
data
180
3.1
5064
433
93.58 % 93.34 %
8
ProtoSel
followed by
FeaSel
Full data
300
3.1
5041
367
93.58 % 93.52 %
(9)
of the largest distance, there would be a single cluster, and in the case of the distance
threshold of 0.0, the number of clusters equals the number of training patterns.
In Approach 3, frequent items are first computed on the original data and then
followed by clustering on the patterns containing frequent features only. Frequent
features are computed by changing ε in steps of 5 in the range 0 to 200. As a next
step, prototypes are computed for each case using the Leader clustering algorithm.
For computing leaders, the distance threshold values (ζ ) are changed from 0 to 5.0
in steps of 1.0. The data thus arrived at is tested on validation dataset. The parameter
set that provided the best classification on the validation dataset is used to compute
the classification accuracy on the test dataset. The corresponding classification accuracies with class-wise data and full dataset are 93.52 % and 93.34 %, respectively.
It should be noted that the respective numbers of distinct subsequences of these two
cases are 542 and 433. Figure 5.7 presents the classification accuracy with increasing support values in Approach 3.
In Approach 4, as a first step, prototypes are computed using clustering, and it is
followed by frequent item computation. The classification accuracies with test data
114
5
Data Compaction Through Simultaneous Selection of Prototypes
Fig. 5.6 Change in the number of leaders with increasing distance threshold
corresponding to best cases with validation data for class-wise and full datasets are
93.34 % and 93.52 %, respectively. Because of prototype selection, there is also a
reduction in number of prototypes from original 6000 to 5064 and 5041 for each of
these two cases. The numbers of distinct subsequences in these cases are 433 and
367, respectively.
The number of distinct subsequences indicates compactness achieved. Further,
even with good amount of reduction in the number of distinct subsequences, there is
no significant reduction in Classification Accuracy (CA). This can be observed from
Fig. 5.8, corresponding to Approach 4 with full training data. The figure displays CA
for various values of support considering entire data and a distance threshold of 3.1.
Observe that CA with support of 300 reaches maximum.
In the end, we present another interesting result. For a chosen distance threshold,
the change in class-wise support does not further affect the number of leaders significantly. Figure 5.9 demonstrates this fact. In the figure, at the support ε = 0, the
number of leaders corresponding to a distance threshold (ζ ) of 3.1 is 5064. Subsequently, frequent features are identified using the threshold marked on the X-axis,
viz., 0 to 250, and then leaders are computed on such data. Observe that at support of
250, the number of leaders is 5058. Thus, this does not affect the number of leaders
significantly. However, the number of distinct subsequences reduces with increasing
support value.
5.6 Implementation and Experimentation
Fig. 5.7 Classification accuracy as a function of support value (ε = 3.1)
Fig. 5.8 Effect of support on classification accuracy
115
116
5
Data Compaction Through Simultaneous Selection of Prototypes
Fig. 5.9 Change in the number of leaders with increasing support
5.6.2 Intrusion Detection Data
The objective of this subsection is to illustrate the applicability of the schemes
on different types of data. Appendix contains a description of intrusion detection
dataset that is part of KDDCUP’99 challenge. The data that consists of floatingpoint-valued features is quantized into binary data. The procedure is discussed in
the Appendix.
5.6.2.1 Prototype Selection
The objective of the exercises in the current section is to identify a subset of original
data as data representatives or prototypes. The Leader clustering algorithm is used to
identify the prototypes. The input dataset consists of five classes of equal number of
patterns. For each of the classes, we identify cluster representatives independently.
We combine them to form training data that consists of prototypes only. With the
help of this training dataset, we classify 411,029 test patterns. The experiments are
conducted with different dissimilarity thresholds.
Figures 5.10, 5.11, 5.12, 5.13, 5.14 contain prototype selection for varying distance thresholds. Table 5.8 contains the sizes of data sets for three different distance
thresholds, viz., 100, 50, and 40. We notice that the number of representative patterns reduces with increasing distance thresholds. The results are provided in Table 5.9. From the table we observe that the cost is minimum with threshold set of
5.6 Implementation and Experimentation
117
Fig. 5.10 Results of prototype selection for the category “normal”
Table 5.8 Case study details
Case
No.
Normal
u2r
Thr
1
100
2
50
3
40
14,891
Pats
dos
r2l
probe
Thr
Pats
Thr
Pats
Thr
Pats
Thr
Pats
3751
100
33
2
7551
2
715
2
1000
10,850
10
48
2
7551
1
895
1
1331
100
48
2
7551
1
895
1
1331
(40, 100, 2, 1, 1). We further make use of the datasets mentioned in Table 5.8 in the
subsequent sections.
Algorithm 5.2 (Prototype Selection Algorithm)
Step 1: Compute class-wise leaders in each of the classes (normal, u2r, dos, r2l,
probe) (Figs. 5.10, 5.11, 5.12, 5.13, 5.14).
Step 2: Combine the class-wise leaders to form training data.
Step 3: Classify test data.
Step 4: Repeat the exercises with different distance thresholds. The Euclidean distance is used for both exercises.
Case study details are provided in Table 5.8. The results are provided in Table 5.9.
118
5
Data Compaction Through Simultaneous Selection of Prototypes
Fig. 5.11 Results of prototype selection for category “u2r”
Fig. 5.12 Results of prototype selection for category “dos”
5.6 Implementation and Experimentation
Fig. 5.13 Results of prototype selection for category “r2l”
Fig. 5.14 Results of prototype selection for category “probe”
119
120
Table 5.9 Results with
prototypes
5
Data Compaction Through Simultaneous Selection of Prototypes
Case No.
Training data
size
CA (%)
Cost
1
13,050
91.66
0.164046
2
20,660
91.91
0.159271
3
24,702
91.89
0.158952
5.6.3 Simultaneous Selection of Patterns and Features
We noted previously in Table A.8 that not all features in the data are frequent. We
make use of this fact to examine, by considering only frequent features, whether we
can classify the test patterns better. This is considered in two ways.
In the first method, we first find prototypes and then find frequent features within
the prototypes, which we term as “Leaders followed by Frequent features.” In the
second method, we consider frequent features in the entire data and then identify
prototypes, which we term as “Frequent features followed by Leaders.” The overall
algorithm is provided in Algorithm 5.3.
Algorithm 5.3 (Simultaneous Selection of Patterns and Features)
Step 1: Compute the support of each of the features across the given data.
Step 2: For a given distance threshold, identify features that exceed the threshold.
Term them frequent features.
Step 3: Eliminate infrequent features from both training and test data by setting the
corresponding feature values to 0.0. No. of patterns remains same.
Step 4: Classify test patterns and compute the cost.
5.6.3.1 Leaders Followed by Frequent Features
The costs of assigning a wrong label to pattern is not the same across different
classes. For example, the cost of assigning a pattern from class “normal” to class
“u2r” is 2, from “normal” to “dos” is 2, and from “normal” to “probe” is 1. Since
the cost matrix is not symmetric, the cost of assigning “u2r” to “normal” is not the
same as that of “normal” to “u2r”.
The prototype set mentioned in Case 3 of Table 5.8 is considered for the study.
Frequent features are obtained from the dataset with help of minimum item support.
All features are considered with a support of 0 %. The number of effective features
reduces from 38 to 21 and 18 with respective supports of 10 % and 20 %. Table 5.10
contains the results. It is interesting to observe that reduction in the number of features improves the classification accuracy with NNC with support of 10 %. The
accuracy is slightly reduced with support value of 20 %. In case of classification
cost, it improved with support of 10 % and reduced with support value of 20 %. In
summary, reduction in the number of features while classifying a large number of
5.6 Implementation and Experimentation
Table 5.10 Results with
frequent item support on case
3 (247,072 patterns)
Support
121
No. of features
CA
Cost
0%
38
91.89 %
0.1589
10 %
21
91.95 %
0.1576
20 %
18
91.84 %
0.1602
patterns reduces storage space and computation time. The scenario also leads to increase in classification accuracy and reduction in assignment cost till representatives
are preserved.
5.6.3.2 Frequent Feature Identification Followed by Leaders
In this case, entire training data is considered, and support of each feature is computed. When support of 5 % is applied on entire data, the number of features reduces
to 22 and with 10 %, and the number of features reduces to 17. Figure 5.15 summarizes the results.
The leader computation is restricted to data containing features with 10 % support. Table 5.11 contain the results. Observe from the table that the exercise corresponding to a distance threshold of 20.0 provided the least cost with classification
accuracy nearly unchanged for thresholds 5.0–100.0.
Fig. 5.15 Support vs number of features
122
Table 5.11 Results on
original data having features
with 10 % support
5
Data Compaction Through Simultaneous Selection of Prototypes
Distance
Threshold
No. of leaders
CA
Cost
5.0
17,508
91.83 %
0.1588
10.0
1,5749
91.85%
0.1586
20.0
15,023
91.83 %
0.1585
50.0
9669
84.60 %
0.2990
100.0
3479
82.97 %
0.3300
5.7 Summary
With the objective of handling large data classification problem efficiently, we examined the usefulness of prototype selection and feature selection individually and
in combination. During the process, a multicategory training data is considered both
as a single multicategory dataset and as class-wise. We also examined the effectiveness of sequencing of prototype selection and feature selection. Feature selection
using frequent item support has been studied. We consider kNNC for classifying
the given large, high-dimensional handwritten data. Elaborate experimentation has
been carried out and results presented.
The contributions of the data compaction scheme are to show that through combinations of prototype selection and feature selection we obtain better Classification
Accuracy, instead of considering these two activities independently. This amounts
to simultaneous selection of patterns and features together. Clustering “data containing frequent features” has provided a good amount of compactness for classification, from 669 distinct subsequences to 367 subsequences, viz., 45 % reduction.
Such compactness did not result in reduction in classification accuracy.
It is clear that frequent feature selection leads to reduction in the number of features. We have shown through experiments that such a reduction improves classification accuracy from 91.79 % to 92.52 % when the data is considered class-wise
for feature selection. The prototype selection excludes redundant patterns, leading
to an improvement of CA to 93.14 % when the data is considered class-wise. It
should be noted that when the data is considered class-wise, selection of ψ or ζ is
relevant to the training data within the class only. The combination scheme where
features and prototypes are selected simultaneously provided the best classification
accuracy. Further, Scheme 8 (Approach 4) provided both a compaction of 45 % and
the best classification accuracy of 93.58 %.
Similar observations can be made with intrusion detection data.
The following is the summary of work. The numerical data corresponds to the
experiments with handwritten digit data.
• Frequent feature computation leads to feature selection. This leads to compaction
of patterns by 45.9 %, which is characterized by distinct subsequences with CA of
92.32 %.
5.8 Bibliographic Notes
123
• Prototype selection leads to data reduction. The classification accuracy obtained
is superior to frequent feature usage only. However, the number of distinct features
is the same as the original data, viz., 669.
• The class-wise method of frequent feature selection followed by prototype selection provided the best classification accuracy of 93.52 %. Similarly, complete databased prototype selection followed by feature selection provides the same best
accuracy. The corresponding compaction achieved in terms of the number of features is 35.3 %.
• Clustering followed by frequent feature selection provides the best compaction of
45.1 % as compared to original distinct subsequences of 669 while providing the
classification accuracy of 93.58 %
5.8 Bibliographic Notes
Duda et al. (2000) provide a discussion on pattern and feature selection and on
pattern classification approaches. Kittler (1986) provides a discussion on feature selection and extraction. Importance of a simple model is emphasized by Domingos
(1998). A discussion on dimensionality and data reduction approaches can be found
in Jain et al. (1999) and Pal and Mitra (2004). Data squashing is discussed by DuMouchel et al. (2002). Sampling schemes for prototype selection can be found in
Pal and Mitra (2004) and Han et al. (2012). Zhang et al. (1996) provides data summarization scheme known as BIRCH. Bradley et al. (1998) provides a discussion
on scaling of clustering algorithms for large data. Description of PAM, CLARA
and CLARANS (Clustering Large Applications using RANdomized Search) can be
found in Kaufman and Rousseeuw (1989). A discussion on Leader algorithm is provided in Spath (1980). A comparison of prototype selection methods using genetic
algorithms is provided in Ravindra et al. (2001). Burges (1998) provides a detailed
discussion on support vector machines, which in turn can be used to select prototypes. Hart (1968) provides Condensed Nearest-Neighbor algorithm. Agrawal and
Srikant (1994) propose the concept of support in their seminal work on association
rule mining. Ravindra et al. (2005) discuss simultaneous selection of prototypes and
features. Lossy compression using the concepts of frequent item support and distinct
subsequences is provided by Ravindra et al. (2004). The dataset used for illustrating
of computation of leaders is taken from UCI-ML repository (2013).
References
R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in Proceedings of International Conference on VLDB (1994)
P. Bradley, U.M. Fayyad, C. Reina, Scaling clustering algorithms to large databases, in Proceedings
of 4th Intl. Conf. on Knowledge Discovery and Data Mining (AAAI Press, New York, 1998),
pp. 9–15
124
5
Data Compaction Through Simultaneous Selection of Prototypes
C.J.C. Burges, A tutorial on support vector machines for pattern recognition. Data Min. Knowl.
Discov. 2(2), 121–167 (1998).
P. Domingos, Occam’s two razors: the sharp and the blunt, in Proc. of 4th Intl. Conference on
Knowledge Discovery and Data Mining (KDD’98), ed. by R. Agrawal, P. Stolorz (AAAI Press,
New York, 1998), pp. 37–43
W. DuMouchel, C. Volinksy, T. Johnson, C. Cortez, D. Pregibon, Squashing flat files flatter, in
Proc. 5th Intl. Conf. on Knowledge Discovery and Data Mining, San Diego, CA (AAAI Press,
New York, 2002)
R.O. Duda, P.E. Hart, D.J. Stork, Pattern Classification (Wiley-Interscience, New York, 2000)
J. Han, M. Kamber, J. Pei, Data Mining—Concepts and Techniques (Morgan-Kauffman, New
York, 2012)
P.E. Hart, The condensed nearest neighbor rule. IEEE Trans. Inf. Theory IT-14, 515–516 (1968)
A.K. Jain, M.N. Murty, P. Flynn, Data clustering: a review. ACM Comput. Surv. 32(3) (1999)
J. Kittler, Feature selection and extraction, in Handbook of Pattern Recognition and Image Proc.,
ed. by T.Y. Young, K.S. Fu. (Academic Press, San Diego, 1986), pp. 59–83
L. Kaufman, P.J. Rousseeuw, Finding Groups in Data—An Introduction to Cluster Analysis (Wiley,
New York, 1989)
S.K. Pal, P. Mitra, Pattern Recognition Algorithms for Data Mining (Chapman & Hall/CRC, London/Boca Raton, 2004)
T. Ravindra Babu, M. Narasimha Murty, V.K. Agrawal, Hybrid learning scheme for data mining applications, in Proc. of Fourth Intl. Conf. on Hybrid Intelligent Systems (IEEE Computer
Society, Los Alamitos, 2004), pp. 266–271. doi:10.1109/ICHIS.2004.56
T. Ravindra Babu, M. Narasimha Murty, V.K. Agrawal, On simultaneous selection of prototypes
and features in large data, in Proceedings of the First International Conference on Pattern
Recognition and Machine Intelligence. Lecture Notes in Computer Science, vol. 3776 (Springer,
Berlin, 2005), pp. 595–600
T. Ravindra Babu, M. Narasimha Murty, Comparison of genetic algorithm based prototype selection schemes. Pattern Recognit. 34(2), 523–525 (2001)
H. Spath, Cluster Analysis Algorithms for Data Reduction and Classification (Ellis Horwood,
Chichester, 1980)
T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient data clustering method for very large
databases, in Proceedings of the ACM SIGMOD International Conference of Management of
Data (SIGMOD’96) (1996), pp. 103–114
Iris dataset (2013) http://archive.isc.uci.edu/ml/datasets/Iris. Accessed on 18 April 2013
Chapter 6
Domain Knowledge-Based Compaction
6.1 Introduction
With large datasets, it is difficult to choose good structures in order to control the
complexity of models unless there exists prior knowledge on data. With the objective of classification of large datasets, one can achieve significant compaction,
and thereby performance improvement, by integrating prior knowledge. The prior
knowledge may relate to nature of the data, discrimination function, or the learning
hypothesis.
The prior or domain knowledge is either provided by a domain expert or is
derived through rigorous data analysis as we demonstrate in the current work.
No-Free-Lunch theorem emphasizes that in the absence of assumptions, preference does not exist across one learning algorithm over the other. The assumptions
or knowledge about the domain is important in designing classification algorithms.
We stress the importance of deriving such domain knowledge through preliminary
data analysis so as to automate the process of classification of multiclass data using
binary classifiers. We exploit this aspect in the present work.
In the current chapter, we consider binary classifiers. Various approaches exist
for multiclass classification such as one-vs-one and one-vs-rest, with each of the
approaches requiring many comparisons for determining the category of a pattern.
We propose to exploit domain knowledge on the data in labeling the 10-category
patterns through a novel decision tree of depth 4. We apply such a scheme to
classify the patterns using support vector machines (SVM) and adaptive boosting
(AdaBoost). The overall classification accuracy thus obtained is shown to be better
than the previously reported values on the same data. The proposed method also
integrates clustering-based reduction of the original large data.
Major contributions of the work are the following.
• Exploiting domain knowledge in devising a multicategory tree classifier with a
depth of just 4.
• Use of SVMs with appropriate kernels.
T. Ravindra Babu et al., Compression Schemes for Mining Large Datasets,
Advances in Computer Vision and Pattern Recognition,
DOI 10.1007/978-1-4471-5607-9_6, © Springer-Verlag London 2013
125
126
6 Domain Knowledge-Based Compaction
• Use of AdaBoost.
• Employing clustering methods and CNN to obtain representative patterns.
• Integrating representative patterns, domain knowledge, SVMs, or AdaBoost in
obtaining the classification accuracy better than reported earlier with less design
and classification times as compared to application of the same scheme on full
data.
The chapter is organized as follows. Section 6.2 provides a brief discussion on
different schemes of multicategory data classification using a 2-class classifier. Section 6.3 contains an overview of SVM. A brief discussion on the AdaBoost method
is provided in Sect. 6.4. An overview of decision trees is provided in Sect. 6.5. Section 6.6 contains preliminary analysis on the data that helps to extract the domain
knowledge on data. The proposed method is provided in Sect. 6.7. Experimental
results using support vector machines and AdaBoost are provided in Sect. 6.8. Section 6.9 contains the summary of the work. A discussion on material for further
study and related literature is provided in Sect. 6.10
6.2 Multicategory Classification
Two-class classification rules using SVMs or AdaBoost are easier to learn. In case
of multicategory data of c-classes, the classification is done using multiple binary
classification stages, and the results are combined to provide overall classification
accuracy. The approaches can be classified into one-vs-rest, one-vs-one, and errorcorrecting codes. In case of one-vs-rest decision, we consider training samples from
each class as positive and the rest as negative. We assign that class to a test pattern,
which wins most comparisons. In case of one-vs-one, c(c−1)
binary classifications
2
are resorted to, considering two classes at a time. We assign that class to the test
pattern that has the largest number of votes. In case of pairwise coupling, the output
of binary classifier is interpreted as the posterior probability of positive class, and
a test pattern is assigned the class that has the largest posterior probability. Errorcorrecting code methods adapt coding matrices. A multicategory problem is reduced
to a two-category method in the literature through multiple approaches. Figures 6.1
and 6.2 contain examples of one-vs-one and one-vs-rest classification.
6.3 Support Vector Machine (SVM)
The current subsection provides a quick overview of the support vector machinerelated material.
Let (xi , yi ), i = 1, 2, . . . , m, represent patterns to be classified, x ∈ R p , and y is
the category, +1 or −1. This leads to a learning machine that maps from x −→ y
or x −→ f (x, α). A choice of α leads to a trained machine. The risk or expectation
of the test error for a trained machine is
1 (6.1)
R(α) =
y − f (x, α) dP (x, y).
2
6.3 Support Vector Machine (SVM)
127
Fig. 6.1 One-vs-one
classification. First subfigure
on left-top contains five
patterns that are under
consideration
Fig. 6.2 One-vs-rest
classification
Suppose that training patterns satisfy the following constraints:
w T x + b ≥ +1 for yi = +1,
(6.2)
w T x + b ≤ −1 for yi = −1.
(6.3)
Consider the points with equality in (6.2) and (6.3). These points lie on the hyperplanes w T x + b = +1 and w T x + b = −1. Such points are called Support Vectors.
The hyperplanes are parallel with a distance 2/w, called margin.
The above problem can be restated as a convex quadratic programming problem for maximizing the margin, which is equivalent to the following minimization
problem:
1
φ(w) = w2
2
subject to yi w T xi + b ≥ 1,
Minw,b
(6.4)
i = 1, 2, . . . , l.
By introducing positive Lagrangian multipliers, λi , i = 1, 2, . . . , m, the solution
of the optimization problem provides support vectors for λi > 0. The decision function is given by
m
f (x) = sign
yi λ∗i x T xi + b∗ .
(6.5)
i=1
128
6 Domain Knowledge-Based Compaction
In case of nonlinear decision functions, we map data to higher-dimensional feature space and construct a separating hyperplane in this space. The mapping is given
by the following equations:
X −→ H,
x −→ φ(x).
The decision functions are given by
f (x) = sign φ(x)T w ∗ + b∗ ,
m
∗
T
∗
f (x) = sign
yi λi φ(x) φ(x)i + b .
(6.6)
(6.7)
i=1
Consider a kernel function equation (6.8), which constructs an optimal separating
hyperplane in the space H without explicitly performing calculations in this space:
K(x, z) = φ(x)T φ(z).
With this, the decision function given by (6.7) can be rewritten as
m
yi λ∗i K(x, xi ) + b∗ .
f (x) = sign
(6.8)
(6.9)
i=1
In case of nonseparable data, we introduce a vector of slack variables (ξ1 , ξ2 ,
. . . , ξm )T that measure the amount of violation of constraints. The problem is restated as follows:
m
1
Φ w, b, (ξ1 , ξ2 , . . . , ξm ) = w 2 2 + C
ξik
2
i=1
subject to yi w T φ(xi ) + b ≥ 1 − ξi ,
ξi ≥ 0, i = 1, 2, . . . , m.
Minw,b
(6.10)
6.4 Adaptive Boosting
Boosting is a general method for improving accuracy of a learning algorithm. It
makes use of a base learning algorithm with an accuracy of at least 50 %. It adds a
new component classifier, through multiple stages, forming an ensemble. Decision
rule based on the ensemble provides CA higher than that provided by single use of
base learning algorithm.
AdaBoost makes use of a weak learner that classifies a pattern with accuracy at
least better than chance. The training data for the weak or base learner is selected
according to some chosen weight distribution. Such chosen data is used to label the
patterns. To begin with, equal weights are assigned to each of the training patterns.
At every following stage, the classification error in the previous iteration is used.
The update of the weights for the subsequent iteration is such that the weights of
6.4 Adaptive Boosting
129
incorrectly classified examples are appropriately increased so that the weak learner
at the next stage is forced to focus on the hard examples in the training set. The final
classification is based on a weighted linear combination of stage-wise assignment. It
was theoretically shown that with a learning algorithm slightly better than random,
training error drops exponentially.
The AdaBoost algorithm in its original form is provided in Algorithm 6.1. An
outline of implementation aspects of the AdaBoost algorithm is provided in Algorithm 6.2.
6.4.1 Adaptive Boosting on Prototypes for Data Mining
Applications
When the data is large, repeated application of the algorithm on entire data is expensive in terms of computation and storage. In view of this, we consider prototypes of
the large data. In the current work we propose a scheme that incorporates AdaBoost
on prototypes generated by the Leader clustering algorithm on handwritten digit
data consisting of 10 classes.
Algorithm 6.1 (AdaBoost Algorithm with leader prototypes)
Step 1: Consider n patterns (p) and corresponding labels (l) as (pi , li ), i =
1, 2, . . . , n. The labels li take values 1 or −1. Initialize pattern-wise weights
to W1 (i) = n1 .
Step 2: For each iteration j = 1, . . . , m, carry out the following: Train weak learner
using distribution Wj . Consider training patterns according to the weight
distribution, Wj . Compute leaders.
Step 3: Weak learner finds a weak hypothesis, hj , that maps input patterns to labels
{−1, 1} for the given weight distribution Wj using leaders.
Step 4: The error εj in the weak hypothesis hj is the probability of misclassifying
a new pattern, as the patterns are chosen randomly according to the distribution Wj .
1−ε
Step 5: Compute δj = 0.5 ln( εj j ).
Step 6: Generate a new weight distribution, Wj +1 (i) =
Wj (i)
Sj
× e−δj if pattern pi is
Wj (i)
δj
Sj × e if pattern pi is misclassified.
Wj (i) exp(δj li hj (pi ))
, where Sj is the normalization
Sj
correctly classified and Wj +1 (i) =
Combining, Wj +1 (i) =
factor such that m
i=1 Wj +1 (i) = 1.
Step 7: Output the final hypothesis, H = sign( m
i=1 δi hi ).
kNNC is the base learning algorithm. The procedure also consists of a novel
multiclass classification scheme based on knowledge-based multiple 2-class classifications. The procedure provides the best Classification Accuracy obtained so far
on the considered data.
130
6 Domain Knowledge-Based Compaction
Algorithm 6.2 (AdaBoost Implementation)
Step 1: Consider 2-class training dataset with labels +1 and −1. To begin with,
assign equal weights to each pattern. For each of the iterations i = 1, . . . , n,
carry out steps 2 to 5.
Step 2: Classify training data using component classifier, viz., kNNC. This forms a
weak hypothesis hi that maps each pattern x to labels +1 or −1. Compute
the error in the ith iteration, the sum of weights of misclassified patterns.
i
Step 3: Compute the parameter αi as 0.5 ln( 1−ε
εi ).
Step 4: With the help of αi , update the weights of patterns such that the weight of
misclassified pattern is increased, so that the subsequent iteration focusses
on those patterns. Normalize the weights so as to make a distribution after
updating all the weights.
Step 5: Select patterns according to the weight distribution.
Step 6: Compute the
final hypothesis as a weighted majority of n weak hypotheses,
H = sign( ni=1 αi hi (x)).
From the above discussion it is clear that the choice of base learning algorithm
and amount of training data provided for learning influence the efficiency. As the
number of iterations increases, the processing time becomes significant.
While applying AdaBoost to the current data, one option is to use the entire data
at each iteration subject to the weight distribution. This would obviously consume
large amount of processing time. At this stage, the following three considerations
emerge.
• Reduction of the training time would make use of AdaBoost more efficient.
• Efficiency also depends on the nature of base learning algorithm.
• While dealing with data under consideration, whether inherent characteristics in
the data would help in designing an efficient algorithm, which brings the domain
knowledge of the data into use.
These considerations lead to an efficient multistage algorithm.
6.5 Decision Trees
The pattern recognition methods discussed so far in the book are based on some
measure of dissimilarity among the patterns. In the current section, we discuss decision trees which fall, in a broad sense, into the category of nonmetric methods.
When one has a set of category variables at their disposal such as queries, a decision
tree is the best way of representing the same. The method has the advantages of
easy interpretability and applicability of variety of features such as category, string,
and numerical data. Leaf nodes of a decision tree contain outcome or class-label,
and all nonleaf nodes contain splits that test the value of expression of attributes to
reach a decision that leads to the leaf nodes. Decision trees help to represent rules
6.6 Preliminary Analysis Leading to Domain Knowledge
131
Fig. 6.3 Axis parallel split.
First subfigure on top-left
contains original dataset.
Second and third subfigures
contain first axis parallel split
based on x1 and second axis
parallel split based on x2,
respectively. The decision
tree is shown in the fourth
subfigure
Fig. 6.4 Oblique split. First
subfigure on top-left contains
original dataset. Second figure
indicates oblique split based
on function of x1 and x2. The
decision tree is shown in the
third subfigure
present in the data. When a splitting rule depends on a single attribute at each internal node, it is termed as a univariate split. An example of a univariate split is the
axis-parallel split. When a linear combination of attributes is used as a splitting rule,
it is achieved through linear or oblique splits. Finding an optimal oblique split is
an active research area. Figures 6.3 and 6.4 contain examples of axis-parallel and
oblique splits.
Some frequently used methods of creating decision trees are known as ID3, C4.5,
and CART (Classification and regression trees). The disadvantages of the decision
tree methods are the possibility of over-fitting and large design time.
In the current work, we propose a decision tree classifier where at every nonleaf
node, a decision is made based on a binary classifier, viz., support vector machine
or AdaBoost.
6.6 Preliminary Analysis Leading to Domain Knowledge
With the intention of designing an optimal decision tree, we analyze the given handwritten digit data to extract domain information. We approach this through the following three ways.
132
6 Domain Knowledge-Based Compaction
Fig. 6.5 Cluster analysis of data. The data containing 10 classes of handwritten data is clustered
into two groups using k-means clustering. The subplots correspond to class-wise numbers of patterns belonging to clusters 1 and 2, respectively
• Analytical view of the data to identify possible similarities.
• Numerical analysis by resorting to statistical analysis and clustering the data
to find out patterns belonging to different classes that could be grouped together.
• Resorting to classification of a large random sample from the given data and make
observations from the confusion matrix.
6.6.1 Analytical View
Physical view of handwritten digits indicate that the digits {0, 3, 5, 6, 8} share significant similarity among them. Similarly, {1, 2, 4, 7, 9} are alike. Since the patterns
are handwritten, it is found that 7 and 9 appear quite similar since most often the upper loop of 9 is not completed while writing. Similar observations can be made for
{0, 6} (when the loop for zero is incomplete at the bottom), {8, 3, 5}, {3, 5}, {4, 9},
{1, 2, 7}, and {1, 7}. In view of this, they form a natural grouping, especially in view
of the fact that they are handwritten digits.
6.6 Preliminary Analysis Leading to Domain Knowledge
133
Fig. 6.6 Box-plot of nonzero features in full pattern. The figure contains class-wise statistics of
nonzero features of patterns. The box-plot helps in finding similarity among the classes in terms of
measures of central tendency and measures of dispersion
6.6.2 Numerical Analysis
As part of numerical analysis, we consider a training dataset of 6670 patterns consisting of equal number of patterns from each class. We use k-means clustering
algorithm with k = 2 to find two distinct groups. We find that clustering results into
two groups of patterns where the numbers of patterns in the clusters are 3035 and
3635, respectively. We provide the class-wise number of patterns for each cluster in
Fig. 6.5. The analysis segregates some digits such as 0, 1, 3, 6, and 7 more crisply,
and other digits such as 4, 5, and 6 show dominant belonging to one of the two
clusters. Some digits such as 2 and 9 are almost divided equally into both clusters.
Subsequently, we studied nonzero features of the training patterns. Figure 6.6
contains a box-plot of statistics of nonzero features of the class-wise patterns when
a full pattern is considered. On closer observations, we notice that the digits are not
symmetric. Since each pattern is a matrix of features in 16 rows and 12 columns, we
studied the patterns by dividing them into top and bottom halves and also left and
right halves. This study provides significant insights. The box-plots of top and bottom halves are provided in Fig. 6.7, and those of left and right halves are provided
in Fig. 6.8. From the figures we can observe similarity between {0, 8}, {3, 5}, {2, 4}
in top-half box-plots and {1, 7, 9} in bottom-half box-plots. Interestingly, {1, 6} in
top-half analysis display similarity since, in handwritten form, there is no significant difference in the top-halves of these digits. {3, 6} are similar as seen from the
statistics on the complete patterns. Similar inferences can be drawn from left and
right half patterns.
134
6 Domain Knowledge-Based Compaction
Fig. 6.7 Box-plot of nonzero features in top and bottom halves of patterns. Each pattern is divided
into two halves known as top and bottom halves. Depending on the handwritten digit, there is similarity in the corresponding halves. This helps in grouping the classes for devising knowledge-based
tree. It also provides a view of complexity in classifying such data
Fig. 6.8 Box-plot of nonzero features in left and right halves of patterns. In this case, the data is
divided into left and right halves, and the statistics on such halves are computed and presented as
box-plots. The plots help notice similarities among halves of distinct classes
We can notice from the figures and inferences that the groups of classes are similar to those observed in Sect. 6.6.1.
6.6.3 Confusion Matrix
A confusion matrix provides a glimpse of correctly and incorrectly classified numbers of patterns, which is a result of a classification experiment. In the matrix, each
column indicates occurrences of predicted class and each row indicates occurrences
of actual class.
We carried out a number of case studies by considering different sets of representative patterns as training datasets and a set of test patterns and examined them
6.6 Preliminary Analysis Leading to Domain Knowledge
Table 6.1 Confusion matrix
corresponding to
classification accuracy of
76.06 %
Table 6.2 Confusion matrix
corresponding to
classification accuracy of
86.92 %
Label 0
135
1
2
3
0
255 9
7
10
7
1
0
0
0
2
333 0
2
2
17
282 9
3
1
23
4
4
273 0
5
6
7
8
9
6
19
4
3
14
0
0
0
0
0
0
2
7
3
9
22
0
5
0
6
4
0
36
0
0
179 0
2
4
0
112
5
0
10
0
33
5
265 8
2
3
7
6
0
32
0
1
4
0
287 0
0
10
41
7
0
44
0
1
1
0
0
246 0
8
0
46
6
46
3
23
2
20
123 64
9
0
17
0
2
4
0
0
17
1
292
1
2
3
4
5
6
7
8
9
0
317 2
1
1
4
0
1
3
3
2
1
0
332 0
0
1
0
0
0
0
0
4
0
1
11
13
5
17
1
6
20
8
Label 0
2
10
12
268 9
3
3
12
6
261 0
4
0
0
0
0
301 0
3
2
0
27
5
6
1
1
19
7
268 5
0
16
10
6
0
9
0
1
7
3
312 0
7
0
13
0
0
22
0
0
8
3
21
0
3
9
2
3
7
271 14
9
1
1
0
0
22
0
0
14
2
1
274 0
1
24
293
through confusion matrix. In most cases, we notice that the misclassification occurred between {1, 7}, {7, 9}, {3, 5}, {3, 8}, {5, 8}, {0, 6}, {4, 9}, {1, 2}, {1, 7}, etc.
Most of these misclassifications can be visualized from the nature of handwritten
patterns. This again leads to groups of classes similar to the ones as detailed in the
previous two subsections. Two sample confusion matrices that classified 3333 test
patterns based on prototypes generated using genetic algorithms, which are chosen
during initial generations in pursuit of obtaining optimal prototypes, are provided in
Tables 6.1 and 6.2. The chosen cases, which were part of initial stages of the experiment before a best set was identified, have a classification accuracy of 76.058 % and
86.92 %, respectively, using a nearest-neighbor classifier. To elaborate the contents
of the matrix, consider the row corresponding to label “4”. It can be seen that 179 of
333 patterns are labeled correctly as “4”, 36 of them are classified as “1”, 2 patterns
as “6”, 4 patterns as “7”, and 112 as “9”. The column corresponding to class “4”
contains the number of patterns of other classes labeled as “4”.
The above three subsections indicate analyses that help to extract domain knowledge. It should however be noted that they supplement an analyst’s comprehension
136
Table 6.3 List of parameters
used
6 Domain Knowledge-Based Compaction
Parameter
Description
n
Number of patterns or transactions
d
Number of features or items
l
Number of leaders
s
Number of support vectors
of the data and may not necessarily provide final grouping in a crisp manner in
every such attempt. In summary, the analysis leads to grouping of classes. This is
discussed in detail in the following section on the proposed scheme.
6.7 Proposed Method
Consider a training dataset consisting of n patterns. Let each pattern consist of d
binary features. In order to reduce the data size, two approaches are followed in
combination to compute representative patterns, viz., leaders and condensed nearest neighbors. The Leader clustering algorithm Spath (1980) is used to update pattern representatives in terms of leaders. The Condensed Nearest-Neighbor (CNN)
algorithm refers to the set of representative patterns that classify every other pattern within the training dataset correctly. Computation of prototypes using CNN is
discussed in detail in Sect. 5.2. Union of CNN and leaders is used for generating
representative patterns for the application of SVM and leaders alone for the application of AdaBoost. Each leader is considered as a cluster representative. The
number of leaders depends on the threshold value. Let l be the number of leaders.
The number of support vectors is represented by s. Table 6.3 contains the summary
of parameters.
The algorithm is provided in Algorithm 6.3.
6.7.1 Knowledge-Based (KB) Tree
Based on preliminary data analysis discussed in Sect. 6.6, we notice that input data
can be divided into two groups consisting of Set 1: {0, 3, 5, 6, 8} and Set 2: {1, 2 ,4,
7, 9} at the first stage. Thus, given a test pattern, a decision is made to classify given
digit into Set 1 vs Set 2 at this stage. Subsequently, based on the same analysis, the
decisions are made between (0, 6) vs (3, 5, 8), 0 vs 6, 8 vs (3, 5), and 3 vs 5; (4, 9) vs
(1, 2, 7), 4 vs 9, 2 vs (1, 7), and 1 vs 7. The corresponding decision tree is presented
in Fig. 6.9.
Algorithm 6.3 (Classification algorithm based on Knowledge Based Tree)
Step 1: Carry out preliminary analysis to group proximal classes using Knowledgebased (KB) Tree shown in Fig. 6.9.
6.8 Experimentation and Results
137
Fig. 6.9 Knowledge-based multicategory tree classifier
Step 2: Compute condensed nearest neighbors of the data.
Step 3: Compute leaders with a prechosen distance threshold, using Leader clustering algorithm.
Step 4: Combine the CNNs and leaders to form a union. The new set forms the set
of representatives.
Step 5: As a first step for classification, divide 10-class training data set into two
classes (0,3,5,6,8) as +1 and the remaining training patterns as −1. Compute support vectors for classifying these sets.
Step 6: Divide patterns of each of the 5 sets of classes into two sets successively,
till leaf node contains a single class; classify using binary classifier and preserve the corresponding set of support vectors. Observe that in all it requires
9 comparisons. At every stage, include only those patterns that are correctly
classified.
Step 7: Given a test pattern, based on respective sets of support vectors, classify
into two classes, +1 or −1. Once it reaches a leaf, compare the label of
the test pattern with that assigned by the classifier. Compute classification
accuracy.
6.8 Experimentation and Results
We consider labeled handwritten (HW) digit data and UCI-ML databases for the
study. The handwritten data consists of 10,003 handwritten digit data labeled from
0 to 9. Out of these patterns, datasets of sizes 6000, 670, and 3333 for training,
validation, and testing, respectively, are identified. Each pattern is characterized by
192 binary features. The numbers of patterns per class are nearly equal in each of
the above three datasets.
138
6 Domain Knowledge-Based Compaction
Preliminary analysis on the data such as clustering, computation of measures of
central tendency, and dispersion suggest similarity among the classes (1, 2, 4, 7, 9)
and (0, 3, 5, 6, 8) and further (1, 2, 7), (4, 9), (0, 6), (5, 3, 8), etc. These observations are further used of in designing a multiclass classifier based on multiple 2-class
classifiers.
We conduct experiments with SVM and AdaBoost classifiers using KB Tree.
6.8.1 Experiments Using SVM
Based on preliminary analysis and domain knowledge of HW data, we combine
training data belonging to different classes as shown in Fig. 6.9. Observe from the
figure that the procedure involves at most 4 comparisons, where each leaf of the tree
contains a single class.
SVM light is used for computing support vectors.
1. Generation of Representative Patterns. Consider n training patterns.
• Using CNN approach, arrive at a set A of n1 representative patterns.
• With class-wise distance thresholds, εs, compute the set of leaders, B. Let the
number of leaders be n2 ; n1 < n and n2 < n.
• The set of representative patterns is C = A ∪ B, with k (< n) patterns.
2. Multiple binary classification. Based on preliminary analysis and domain knowledge, combine similar classes thereby dividing multicategory classification into
multiple binary classification problems, as shown in Fig. 6.9. Observe that different stages are marked as (1) to (5d). During training at every binary branching,
the set of considered labeled patterns is classified into two classes. For example,
at (2a), the labeled patterns of classes (0, 3, 5, 6, 8) are classified into mapped
label of +1 corresponding to labels (0, 6) at (3a) and −1 corresponding to labels
(3, 5, 8) at (3b). Similarly, patterns at 3(a) are classified into +1 corresponding
to label (0) at (4a) and −1 corresponding to label (6) at 4(b).
Table 6.4 provides the results of experiments with SVM using different kernels. The support vectors and related parameters generated at each stage are
preserved for classification.
3. Computation of overall CA. When a test pattern is presented at stage 1, the
corresponding model generated during training classifies it through various
stages. For example, a test pattern with label 7, if correctly classified, reaches
stage 5d. A pattern presented at stage 1, gets classified into one of the stages
(4a,4b,4c,5a,5b, 4e,4f,4g,5c,5d) with the help of support vector sets at various
stages, 1 to 5d, that are generated earlier in Step 2. Thus, at the end, by comparing labels of classified patterns at leaves with their expected labels, the number
of correctly classified patterns is obtained. This is referred to as “Overall CA”.
Elaborate experimentation is carried out. The Hamming distance measure is considered. CNN results in 1611 representatives. In case of leaders, after an exhaustive
6.8 Experimentation and Results
Table 6.4 Experiments with
SVM
139
Case
Kernel
Degree
CA (%)
(0, 3, 5, 6, 8) vs (1, 2, 4, 7, 9)
Gaussian
–
98.20
(0, 6) vs (3, 5, 8)
Polynomial
3
98.84
(0, 6)
Polynomial
3
99.54
8 vs (3, 5)
Polynomial
3
97.71
3 vs 5
Polynomial
4
96.07
(4, 9) vs (1, 2, 7)
Polynomial
4
98.78
4 vs 9
Polynomial
2
96.92
2 vs (1, 7)
Polynomial
2
99.59
1 vs 7
Polynomial
2
99.70
experimentation with different values of ε, a distance threshold value of 3.5 is chosen for all classes, except for class with label 1, for which ε of 2.5 is considered.
Together, prototypes generated by CNN and leaders is 4800.
Table 6.5 contains CA obtained with test data at every stage with corresponding
mapped labels of +1 and −1. In the table, sets 1 and 2 respectively correspond to
(0, 3, 5, 6, 8) and (1, 2, 4, 7, 9), where a Gaussian kernel is used. “Degree” refers to
the degree of polynomial kernel. The overall CA with 4800 representative patterns is
94.75 %, which is better than the reported value on the same dataset. With full training dataset of 6670 patterns, the overall CA also is same as above, which indicates
that the proposed procedure captures all support vectors required for classification.
The raining times of CPU computed in seconds on PIII 500 MHz machine by the
proposed method with reduced data and full data are 143 seconds and 288 seconds,
respectively. The corresponding testing times are 113.02 seconds and 145.85 seconds. In other words, the proposed method requires only 50 % of training time and
77 % of testing time as compared to the case with full data. In summary, the proposed Knowledge-Based Multicategory Tree Classifier provides a higher CA than
other schemes using the same data, requires a lesser number of comparisons than
existing SVM-based multicategory classification methods, and requires less training
and testing times than that with full data.
Table 6.5 Results (RBF kernel at level 1 and polynomial kernels at all other levels of tree)
Case
set 1 vs
set 2
(0, 6) vs
(3, 5, 8)
0 vs
6
8 vs
(3, 5)
3 vs
5
(4, 9) vs
(1, 2, 7)
4 vs
9
2 vs
(1, 7)
1 vs
7
Degree
–
3
3
3
4
4
2
2
2
CA (%)
98.2
98.8
99.5
97.7
96.1
98.8
96.9
99.6
99.7
140
Table 6.6 List of parameters
6 Domain Knowledge-Based Compaction
Parameter
Description
n
No. of training patterns
k
No. of features per pattern
pi
ith training pattern
hi
ith Hypothesis at iteration j
H
Final hypothesis
εj
Error after each iteration j
αj
i
Derived parameter from εj , viz., 0.5 ln( 1−ε
εi )
Wj
Weight distribution at iteration j
m
Maximum number of iterations
ζ
Distance threshold for Leader clustering
6.8.2 Experiments Using AdaBoost
6.8.2.1 Prototype Selection
Prototypes are selected using the Leader clustering algorithm as discussed in
Sect. 2.5.2.1. The leader clustering algorithm begins with any arbitrary pattern in
the training data as the first leader. Subsequently, the patterns that lie within a prechosen distance threshold are considered to be part of the cluster represented by a
given leader. As the comparison progresses, a new pattern that lies outside the distance threshold is considered as the next leader. The algorithm continues till all the
training patterns are examined. It is clear that a small threshold would result in a
large number of leaders and a large threshold value would lead to a single cluster.
An optimal threshold is experimentally determined.
6.8.2.2 Parameters Used in Adaptive Boosting
The list of parameters used in the algorithm are provided in Table 6.6.
Elaborate experimentation is carried out for different values of distance threshold (ζ ) for computing prototypes using the Leader clustering algorithm. kNNC is the
classifier at each stage. The classification is carried out for different values of k. The
classification at every stage is binary. From every stage, only those correctly classified test patterns at the stage are passed to subsequent stage as input test patterns.
For example, consider a test pattern with “5”. The pattern if classified correctly and
passes through the following stages:
1.
2.
3.
4.
(0, 3, 5, 6, 8) vs (1, 2, 4, 7, 9),
(0, 6) vs (3, 5, 8),
8 vs (3, 5),
3 vs 5.
6.8 Experimentation and Results
Table 6.7 Results with
AdaBoost
Case
description
141
Value of k
in kNNC
Distance
threshold (ζ )
Classification
accuracy (%)
Set 1 vs. Set 2
5
3.2
98.3
(4,9) vs (1,2,7)
5
3.0
98.1
4 vs 9
8
3.4
96.6
2 vs (1,7)
3
3.0
99.7
1 vs 7
1
3.0
99.5
(0,6) vs (3,5,8)
5
4.0
99.0
0 vs 6
10
4.0
99.4
8 vs (3,5)
8
2.5
97.6
3 vs 5
3
3.4
96.0
The experiments are conducted on validation dataset. The set of parameters
(ζ and k) that lead to the best classification accuracy on the validation dataset are
used for verification with test dataset. Table 6.7 contains the results. In the table,
set 1 corresponds to (0, 3, 5, 6, 8), and set 2 corresponds to (1, 2, 4, 7, 9). We notice
that the overall classification accuracy depends on the number of misclassifications
at each stage. In Fig. 6.9, leaf nodes contain class-wise correctly classified patterns.
They are denoted in italics. The “overall CA” is 94.48 %, which is better than the
previously reported value on the same data. For k = 1, kNNC, i.e., NNC, of full
training data of 6000 patterns against test data provides a CA of 90.73 %. The best
accuracy of 92.26 % is obtained for k = 5 of the kNNC. The “overall CA” obtained
through the proposed scheme is better than CA obtained with NNC and kNNC on the
complete dataset. A decision tree of a depth of 4 is used to classify 10-class patterns.
This is a significant improvement over one-against-all and one-vs-one multiclass
classification schemes.
With increase of the distance threshold for leader clustering, viz., ζ , the number
of prototypes reduces. For example, for ζ = 3.2, for the input dataset for classification of set 1 vs set 2, the number of prototypes reduces by 20 % without adversely
affecting the classification accuracy. Secondly, it is important to note that as we approach the leaf nodes, the number of training patterns reduces. For example, we
start with 6000 patterns at the root of the decision tree, and at the stage of 0 vs 6,
the number of patterns reduces to 1200. Another, interesting observation is that the
number of prototypes at the stage {0 vs 6} is 748 for the distance threshold ζ = 4.0.
This is a reduction by 38 %.
Tables 6.8, 6.9, and 6.10 contain the results at some stages of the KB Tree classification.
6.8.3 Results with AdaBoost on Benchmark Data
The proposed algorithm is applied on three different datasets other than the abovementioned HW data, viz., WINE, THYROID, and SONAR. The data is obtained
from UCI Repository.
142
Table 6.8 Results on
AdaBoost for set 1 vs set 2
6 Domain Knowledge-Based Compaction
Distance
threshold
Ave. num. of
prototypes
per iteration
Ave. CA
with
training data
CA with
validation
data
2.0
1623
97.50
97.60
2.5
1568
97.60
97.60
3.0
1523
97.30
97.68
3.5
1335
97.21
97.38
4.0
1222
96.86
97.12
4.5
861
95.95
96.33
5.0
658
93.96
95.36
Table 6.9 Results on
AdaBoost for 0 vs rest of set 1 Distance
threshold
Ave. num. of
prototypes
per iteration
Ave. CA
with trg
data
CA with
valdn data
2.0
1288
99.19
98.58
2.5
1277
99.13
98.50
3.0
1255
99.14
98.43
3.5
1174
99.18
98.43
4.0
1006
98.90
98.28
4.5
706
97.88
98.20
5.0
536
97.28
98.28
Ave. num. of
prototypes
per iteration
Ave. CA
with trg
data
CA with
valdn data
2.0
1148
99.32
98.43
2.5
1054
99.42
98.65
3.0
966
99.39
98.80
3.5
827
99.22
98.80
4.0
645
98.25
98.65
4.5
417
97.40
98.05
5.0
317
91.73
98.28
Table 6.10 Results on
AdaBoost for 1 vs rest of set 2 Distance
threshold
Table 6.11 consists of details on each of the benchmark data and CA (Classification Accuracy) obtained using the current method.
The proposed algorithm is applied on the data. The patterns of all the considered
datasets contain number-valued features. The values are normalized to zero mean
and unit standard deviation. The nearest-neighbor classifier is used that contains the
6.9 Summary
Table 6.11 Details on
benchmark data
143
Name of
dataset
WINE
Test
data
size
Number
of
features
Number
of
classes
100
78
13
3
3772
3428
21
3
SONAR
104
104
60
2
Name of
the
dataset
Case description
Dist.
threshold
WINE
1 vs non-1
3.0
23
98.72 %
2 vs non-2
1.5
43
93.59 %
3 vs non-3
3.7
8
98.72 %
1 vs non-1
2.0
261
98.83 %
2 vs non-2
3.7
104
94.31 %
3 vs non-3
3.0
156
93.84 %
0 vs non-0
4.0
65
95.19 %
THYROID
Table 6.12 Results with
benchmark data
Training
data
size
THY
SONAR
Average
num. of
leaders
CA (%)
Euclidean distance as a dissimilarity measure. Classification Accuracies (CA) on
WINE, THYROID, and SONAR respectively are 92.31 %, 93.26 %, and 95.19 %.
We notice from Table 6.12 that in the first two data sets of WINE and THYROID, the average CAs obtained viz., 97.01 % and 95.66 %, respectively, are better
than those of NNC with entire dataset. The accuracy is obtained with prototypes
numbering 25 and 174 as compared to the full data size of 100 and 3772, respectively. In case of the third data set, viz., SONAR, the CA obtained is same as NNC,
but with less number of patterns. The average number of prototypes of 65 is less
than the original data containing 104 patterns.
6.9 Summary
We devise a knowledge-based tree that exploits domain knowledge on the data. The
tree enables us to classify 10-category handwritten data through just 4 comparisons.
For applications on large data, we consider representative patterns in place of the
complete dataset. Without loss of generality, representative patterns are considered
as the union of condensed nearest neighbors and leaders for application with support vector machines and leaders alone with AdaBoost. They form the prototypes
of the original data. The representatives are considered as training data. The tree
classifies multicategory data into multiple 2-class classifiers. Extensive experimentation is carried out using SVMs with different kernel approaches and values of
144
6 Domain Knowledge-Based Compaction
polynomial degree as well as for different distance thresholds for AdaBoost. The
best classification accuracy obtained using validation data is identified. Using the
models thus obtained, test data is classified. The CA obtained with the approach is
94.75 % with SVM, and it is 94.48 % with AdaBoost with kNNC as a component
classifier. The obtained result is better than the reported result on the same data in
the literature.
The scheme is a novel way of dealing with large data where prototypes are identified, domain knowledge is used in identifying the number of comparisons required,
and support vectors are computed to classify the given multicategory data with high
accuracy.
6.10 Bibliographic Notes
A support vector machine is a widely used classification method. Foundations and
details discussions on the method can be found in Vapnik (1999), Scholkopf and
Smola (2002), Breuing and Buxton (2001), and Burges (1998). It is applied on variety of problems that include face detection by Osuna et al. (1997), face recognition
by Guo et al. (2000), handwritten digit recognition by Scholkopf et al. (1995), and
Decoste and Scholkopf (2002). SVM light software by Joachims (1999) helps in application of support vector machines on various problems. Dong and Krzyzak (2005)
emphasize the prior knowledge on data to choose good structures in order to control
the complexity of models in large datasets. Fung and Mangasarian (2005), Platt et al.
(2000), Tax and Duin (2002), Allwein et al. (2000), and Hastie and Tibshirani (1998)
provide insights on multiclass classification. Rifkin and Klautau (2004) and Milgram et al. (2006) contain useful discussions one-vs-one and one-vs-all approaches.
Murthy (1998) provides a detailed discussion on decision trees. The Leader clustering algorithm is discussed in Spath (1980). Hart (1968) proposed the Condensed
Nearest-Neighbor approach. Duda et al. (2000) provide an insightful discussions on
No-Free-Lunch Theorem, decision trees, Adaptive Boosting (AdaBoost), and classification approaches. Freund and Schapire (1997, 1999) and Schapire (1990, 1999,
2002) discuss boosting and boosting C4.5, comparing C4.5 with boosting stumps,
weak learnability and AdaBoost algorithm. WINE, THYROID, and SONAR data
sets1 are obtained from UCI-ML database. Ravindra et al. (2004) carried out work
on AdaBoost for classification of large handwritten digit data.
References
E.L. Allwein, R.E. Schapire, Y. Singer, Reducing multiclass to binary: a unifying approach to
margin classifiers. Mach. Learn. Res. 1, 113–141 (2000)
1 http://archive.isc.uci.edu/ml/datasets/.
References
145
R. Breuing, B. Buxton, An introduction to support vector machines for data mining, in Proc. 12th
Conf. Young Operational Research, Nottingham, UK (2001), pp. 3–15
C.J.C. Burges, A tutorial on support vector machines for pattern recognition. Data Min. Knowl.
Discov. 2(2), 121–167 (1998)
D. Decoste, B. Scholkopf, Training invariant support vector machines. Mach. Learn. 46(1–3), 161–
190 (2002)
J.X. Dong, A. Krzyzak, Face SVM training algorithm with decomposition on very large datasets.
IEEE Trans. Pattern Anal. Mach. Intell. 27(4), 603–618 (2005)
R.O. Duda, P.E. Hart, D.J. Stork, Pattern Classification, 2nd edn. (Wiley, New York, 2000)
Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application
to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Y. Freund, R.E. Schapire, A short introduction to boosting. J. Jpn. Soc. Artif. Intell. 14(5), 771–780
(1999)
G.M. Fung, O.L. Mangasarian, Multicategory proximal support vector machine classifiers. Mach.
Learn. 59, 77–97 (2005)
G. Guo, S.Z. Li, K. Chan, Face recognition by support vector machines, in Proc. of Fourth IEEE
Intl. Conf. on Automatic Face Gesture Recognition (IEEE Computer Society, Los Alamitos,
2000), pp. 196–201
P.E. Hart, The condensed nearest neighbor rule. IEEE Trans. Inf. Theory 14(3), 515–516 (1968)
T. Hastie, R. Tibshirani, Classification of pairwise coupling. Ann. Stat. 26(2), 451–471 (1998)
T. Joachims, Making large-scale SVM learning practical, in Advances in Kernel Methods—Support
Vector Learning, ed. by B. Scholkopf, C.B. Burges, A. Smola (MIT Press, Cambridge, 1999)
J. Milgram, M. Cheriet, R. Sabourin, “One against one” or “one against all”: which one is better for handwritten recognition with SVMs? in Tenth International Workshop on Frontiers in
Handwritten Recognition (2006)
S.K. Murthy, Automatic construction of decision trees from data: a multi-disciplinary survey. Data
Min. Knowl. Discov. 2, 345–389 (1998)
E. Osuna, R. Freund, F. Girosi, Training support vector machines: an application to face detection,
in Proc. IEEE Conf. Computer Vision and Pattern Recognition (1997), pp. 130–136
J.C. Platt, N. Cristianini, J. Shawe-Taylor, Large margin DAGs for multiclass classification. Adv.
Neural Inf. Process. Syst. 12, 547–553 (2000)
T. Ravindra Babu, M. Narasimha Murty, V.K. Agrawal, Adaptive boosting with leader based learners for classification of large handwritten digit data, in Proc. of Fourth IEEE Intl. Conf. on
Hybrid Intelligent Systems, California (2004), pp. 326–331
R. Rifkin, A. Klautau, In defense of one-vs-all classification. J. Mach. Learn. Res. 5, 101–141
(2004)
R.E. Schapire, The strength of weak learnability. Mach. Learn. 5, 197–227 (1990)
R.E. Schapire, Theoretical views of boosting and applications, in Algorithmic Learning Theory,
Lecture Notes in Computer Science, vol. 1720 (1999), pp. 13–25
R.E. Schapire, The boosting approach to machine learning: an overview, in MSRI Workshop on
Nonlinear Estimation and Classification (2002)
B. Scholkopf, C.J.C. Burges, V.N. Vapnik, Extracting support data for a given task, in Proc. First
Intl. Conf. Knowledge Discovery and Data Mining (1995), pp. 252–257
B. Scholkopf, A.J. Smola, Learning with Kernels (MIT Press, Cambridge, 2002)
H. Spath, Cluster Analysis Algorithms for Data Reduction and Classification (Ellis Horwood,
Chichester, 1980)
D.M. Tax, R.P. Duin, Using two-class classifiers for multiclass classification, in Proc. of 16th IEEE
Intl. Conf. on Pattern Recognition, vol. 2 (2002), pp. 124–127
V. Vapnik, Statistical Learning Theory, 2nd edn. (Wiley, New York, 1999)
Chapter 7
Optimal Dimensionality Reduction
7.1 Introduction
In data mining applications, one encounters a high-dimensional large number of
patterns. Often, such large datasets are also characterized by a large number of features. It is observed that not all features contribute to generating an abstraction, and
an optimal subset of features is sufficient both for representation and classification
of unseen patterns. Feature selection refers to the activity of identifying a subset of
features that help in classifying unseen patterns well.
For large datasets, even repeated simple operations such as computation of the
distance between two binary-valued patterns result in significant amount of computation time. Reduction in number of patterns by prototype selection based on large
data clustering approaches; optimal selection of prototypes, dimensionality reduction through optimal selection of feature subsets, and optimal feature extraction are
some of the approaches that help in improving the efficiency.
In data mining, a dataset may be viewed as a matrix of size n × d, where n is
the number of data points, and d is the number of features. In mining such datasets,
the dimensionality of the data can be very large; the associated problems are the
following.
• If the dimensionality d is large, then building classifiers and clustering algorithms
on such datasets can be difficult. The reasons are as follows.
1. If the dimensionality increases, then the computational resource requirement
for mining also increases; dimensionality affects both time and space requirements.
2. Typically, both classification and clustering algorithms that use the Euclidean
distance like metrics to characterize similarity between a pair of patterns may
overfit. Further, it becomes difficult to discriminate between patterns based
on the distances in high-dimensional spaces where the data is sparsely distributed. The specific issue is that as the dimensionality increases, it is difficult to discriminate between the nearest and farthest neighbors of a point
T. Ravindra Babu et al., Compression Schemes for Mining Large Datasets,
Advances in Computer Vision and Pattern Recognition,
DOI 10.1007/978-1-4471-5607-9_7, © Springer-Verlag London 2013
147
148
7 Optimal Dimensionality Reduction
X based on their distances from X; these distances will be almost the
same.
3. There could be situations, specifically in areas like medical informatics, where
the number of data points n is small relative to the number of features d. It
is not uncommon to have hundreds of data points and millions of features in
some applications. In such cases, it is again important to reduce the dimensionality.
• Dimensionality reduction is achieved using one of the following approaches.
1. Feature Selection. Here
– We are given a set of features, FD = {f 1 , f 2 , . . . , f D }, which characterize
the patterns X1 , X2 , . . . , Xn .
– So each pattern is a D-dimensional vector. Feature selection involves
selecting a set F of d features, where d < D, F ⊂ FD , and F =
{f1 , f2 , . . . , fd }. That is, each fi is some f j ∈ FD .
– Selecting such a subset F of FD is done by using some heuristic or by
optimizing a criterion function. Primarily, there are two different schemes
for feature selection. These are the following.
(a) Filter methods. These employ schemes to select features without using the classifiers directly in the process; for example, features may be
ranked based on correlation with the class labels.
(b) Wrapper methods. Here, features are selected by using a classifier in the
process. Classifiers based on the nearest-neighbor rule, decision trees,
support vector machines, and naïve Bayes rule are used in feature selection; here features are selected based on accuracy of the resulting
classifier using these selected features.
2. Feature Extraction. It may be viewed as selection of features in a transformed space. Each feature extracted may be viewed as either a linear or a
nonlinear combination of the original features. For example, if FD is the
given set of D features, then the reduced set of features extracted is the set
j =D
F = {f1 , f2 , . . . , fd }, where d < D and fi = j =1 αj f j with real numbers αj and f j ∈ FD ; so the new features are linear combinations of the
given features. Here we consider feature extraction based on linear combinations only; note that feature selection is a special case of feature extraction
where all but one αj are zero. These feature extraction schemes may characterized as follows:
(a) Deterministic. Here αj are deterministic quantities obtained from the data.
(b) Stochastic. In these cases, αj are randomly chosen real numbers.
We discuss both feature selection and extraction schemes in the remaining parts
of the chapter.
7.2 Feature Selection
149
7.2 Feature Selection
In feature selection, we need to rank either the features or subsets of features to
ultimately select a subset of features. There are different schemes for ranking.
7.2.1 Based on Feature Ranking
Using this scheme, feature selection is achieved based on the following algorithm.
1. Rank individual features using some scheme. As mentioned earlier, let FD be the
set of given features f 1 , f 2 , . . . , f D . Let the features ranked using the scheme
be f1 > f2 > · · · > fd > fd+1 > · · · > fD−1 > fD , which means that f1 is the
best feature followed by f2 , then f3 , and so on.
2. Select a subset of d features as follows.
(a) Consider the set of features {f1 , f2 , . . . , fd }.
(b) Select the best feature f1 ; select the next feature to be fj where fj is such
that {f1 , fj } > {f1 , fi } for all i = j and i = 2, . . . , D. Repeat till the required
number (d) of features are selected. Note that here once we select a feature,
for example, f1 , then we have it in the final set of selected features. Any
feature to be included in the current set of selected features is ranked based
on how it performs jointly with the already selected ones. This is called the
Sequential Forward Selection (SFS) scheme. In this case, there is no way to
delete a feature that is already selected.
In the above schemes, we ranked features for inclusion in the set. In a symmetric
manner, we can also rank features for possible deletion. Here, the general scheme is
as follows.
1. Consider the possibility of deleting f i , i = 1, 2, . . . , D from FD . Let the resulting set be F −i . Let the ranking of the resulting sets be F1 > F2 > · · · > FD ,
where for each Fi , there is an F −j such that Fi = F −j .
2. We select F1 which has D − 1 features; then, recursively, we keep eliminating
one feature at a time till we ultimately get a set of required number (d) of features.
This scheme is called the Sequential Backward Selection (SBS) scheme. Some of
the properties of the SBS are the following.
1. Once a feature f is deleted from a set of size l (>d), then this feature cannot
figure in later, that is, in sets of size less than l.
2. It is useful when d is close to D; for example, when selecting 90 features from a
set of 100 features. It is inefficient to use this scheme for selecting 10 out of 100
features; SFS is better in such cases.
Both SFS and SBS are greedy schemes and do not guarantee good performance.
The most attractive alternatives in this direction are floating search schemes. These
150
7 Optimal Dimensionality Reduction
schemes permit nonmonotonic behavior. For example, the Sequential Forward
Floating Selection (SFFS) scheme permits us to delete features that have been added
earlier. Similarly, the Sequential Backward Floating Selection (SBFS) permits inclusion of features that were discarded earlier. We explain the working of the SFFS
next:
1. Let FD be a given set of D features; Let k = 0 and Fk = φ.
2. Addition based on SFS. We use SFS to select a feature to be added. Let fi be
the best feature along with the (already selected) features in F . Then update F
to include fi , that is, set k = k + 1 and Fk = Fk−1 ∪ {fi }.
3. Conditional deletion. Delete a feature fj from Fk if Fk \ {fj } > Fk−1 .
4. Repeat steps 2 and 3 to get a feature subset of size d.
SFFS permits deletion of a feature that was added earlier. Let F0 = φ. Let fi be the
best feature that is added to generate F1 = {fi }. Let the next best (along with fi )
be fj ; so, F2 = {fi , fj }. Note that deleting fi or fj from F2 will not be possible.
Let the next feature to be added is fl making F3 = {fi , fj , fl }. Now we can delete
fi from F3 if {fj , fl } is the best subset of size 2. So, by adding and conditionally
deleting, we could delete a feature (fi here) that was added before. Such a nonmonotonic behavior is also exhibited by SBFS; in SBFS, we conditionally add features
that were deleted earlier.
7.2.2 Ranking Features
Ranking may be achieved by examining the association between feature value and
the class label and/or classification accuracy. These may be realized as follows:
1. Filter Methods. These are schemes in which the features are ranked based on
some function of the values assumed by the individual features. Two popular
parameters used in this category are as given below.
(a) Fisher’s score. It is based on the separation between the means of two classes
with respect to sum of the variances for each feature. In a two-class situation,
the Fisher score FS is
(μ1 (j ) − μ2 (j ))2
FS(fj ) =
,
σ1 (j )2 + σ2 (j )2
where μi (j ) is the sample mean value of feature fj in class i = 1, 2, and
σi (j ) is the same standard deviation of fj for class i = 1, 2. For a multiclass
problem, it is of the form
C
ni (μi (j ) − μi )2
FS(fj ) = i=1
,
C
2
i=1 ni σi (j )
where ni is the number of patterns in class i, μi is the sample mean of class i,
and μi (j ) and σi (j ) respectively are the sample mean and sample standard
deviation for feature fj in class i.
7.2 Feature Selection
151
(b) Mutual Information. Mutual information (MI) gives information that one
random variable gives about another. MI is very popular in selecting important words/terms in classifying documents; here MI measures the amount
of information the presence or absence of term t provides about classifying
a document to a class c. MI of feature fi , MI(fi ), is given by
nij
nij n
MI(fi ) =
log2 .
n
l∈{0,1} nil
l∈{0.1} nlj
i,j ∈{0.1}
2. Wrapper Methods. Here, one may use the classification accuracy of a classifier to
rank the features. One may use any of the standard classifiers based on a feature
and compute the classification accuracy using training and validation datasets.
A majority of the classifiers have been used in this manner. Some of them are the
following.
• Nearest-Neighbor Classifier (NNC). Let FD be the set of features, and STrain
and SValidate be the sets of training and validation data, respectively. Then the
ranking algorithm is as follows:
– For each feature fi ∈ FD , i = 1, 2, . . . , D, compute classification performance using NNC, that is, find out the number of correctly classified patterns from SValidate by obtaining the nearest neighbor, from STrain , of each
validation pattern. Let ni be the number of correctly classified patterns using
fi only.
– Rank the features based on the ranking of ni s; fj > fk (fj is superior to fk )
if nj > nk (nj is larger than nk ). Resolve ties arbitrarily. It is possible to rank
subsets of features also using this scheme.
• Decision Tree Classifier (DTC). Here the ranking is done by building a onelevel decision tree classifier corresponding to each feature. The specific ranking algorithm is as follows:
– Build a decision tree classifier DTCi based on feature fi and the training
dataset STrain for i = 1, 2, . . . , D. Obtain the number of patterns correctly
classified from SValidate using DTCi , i = 1, 2, . . . , D; let the number of patterns obtained using DTCi be ni .
– Rank the features based on the ni s; fj is superior to fk if nj > nk .
• Support Vector Machine (SVM). Like NNC and DTC, in the case of SVM, we
also rank features or sets of features by training an SVM using the feature(s)
on STrain . We can again rank features by using the classification accuracy on
the validation dataset.
3. It is possible to use several other classifiers also in a similar manner to rank
features or sets of features. There is another possible way to use classifiers in
feature selection. Here, features could be directly selected by using a trained
classifier; these are called embedded schemes for feature selection. We give some
of the popular ones.
152
7 Optimal Dimensionality Reduction
(a) Decision tree based. Decision tree classifier learning algorithms build a decision tree using the training data. The resulting tree structure inherently
captures the relevant features. One can exploit this structure to rank features
as follows:
• Build a decision tree DT using the training data STrain . Each node in the
decision tree is associated with a feature.
• Use the Breadth First Search (BFS) to order the features used in the decision tree. Let the output of the BFS be f 1 , f 2 , . . . , f d . This ordering
gives a ranking of the features in terms of their importance.
(b) Support Vector Machine (SVM) based. In a two-class scenario, learning an
SVM from the training data involves obtaining a weight vector W and a
threshold weight b such that W t Xi + b < 0 if Xi is from the negative class
and W t Xi + b > 0 if Xi is from the positive class. Here, W and X are Ddimensional vectors. Let W = (w1 , w2 , . . . , wD )t ; naturally, each wi indicates the importance of the feature fi . It is possible to view the entries of W
as weights of the corresponding features. This is achieved using the SVM.
Specifically,
• Use an SVM learning algorithm on the training data STrain to obtain the
weight vector W and threshold (or bias) b.
• Sort the elements of W based on their magnitude; if wi is negative, then fi
contributes to the negative class, and if wi is positive, then fi contributes
to the positive class. So, the importance of feature fi is characterized by
the magnitude of wi .
• Now rank features based on the sorted order fj is superior to fk if
|wj | > |wk |.
(c) Stochastic Search based. Here the candidate feature subsets are generated
using a stochastic search technique like Genetic Algorithms (GAs), Tabu
Search (TS), or Simulated Annealing (SA). These possible solutions are
evaluated using different classifiers using classification accuracy on a validation set to rank the solutions. The best solution (feature subset) is chosen.
NNC is one of the popular classifiers in this context.
In Sect. 7.4, we provide a detailed case study of feature selection using GAs.
7.3 Feature Extraction
Feature extraction deals with obtaining new features that are linear combinations of
the given features. There are several well-known schemes; some of the popular ones
are the following.
1. Principal Component Analysis (PCA). The basic idea here is to consider direction in which there is maximum variance; this is called the first principal component. In a similar manner, the second, third, and successive d orthogonal directions are considered based on maximum variance at each step. It is possible
7.3 Feature Extraction
153
to show that the resulting directions are eigenvectors of the covariance matrix
of the data; also, it corresponds to minimizing some deviation (error) between
the original data in the D space and the projected data (corresponding to the d
principal components) in the d space.
Let the data matrix of size n × D be A where there are n data points and each
is a point in a D-dimensional space. If the data is assumed to be normalized to
be zero-mean, then the covariance matrix may be viewed as E(AAt ); the sample
covariance matrix is proportional to AAt and is of size n × n. It is possible to
show that AAt is symmetric; so the eigenvalues are real assuming that A has real
entries. Further, it is possible to show that the eigenvalues of AAt and At A (of
size D × D) are the same but for some extra zero eigenvalues, which are |n − D|
in number.
The eigenvectors and eigenvalues of AAt are characterized by
AAt Xi = λi Xi .
Similarly, the corresponding eigenvectors and eigenvalues of At A are given by
At A At Xi = λi At Xi .
Typically, singular value decomposition (SVD) of the matrix A is used to compute the eigenvectors of the matrix AAt . Then the top d eigenvectors (corresponding to the largest d eigenvalues) are used to represent the n patterns in the
d-dimensional space.
2. Nonnegative Matrix Factorization (NMF). This is based on the assumption that
the data matrix A is a nonnegative real matrix. We partition the n × D matrix
into two nonnegative matrices B (n × K) and C (K × D). This is achieved using
an optimization problem given by
1
f (B, C) = A − BC2F such that B and C ≥ 0,
2
where the cost function is the square of the Frobenius norm (entry-wise difference) between A and BC. A difficulty associated with this approach is that when
only A is known, but neither B nor C is known, then the optimization problem
is nonconvex and is not guaranteed to give the globally optimal solution.
However, once we get a decomposition of A into a product of B and C, then
we have a dimensionality reduction as obtained in B (of size n × K); each of the
n data points is represented using K features. Typically, K D, and so there is
a dimensionality reduction.
3. Random Projections (RP). Both PCA and NMF may be viewed as deterministic
schemes. However, it is possible to get linear combinations of features using random weights; a random projection scheme typically may be viewed as belonging
to extracting new features using randomly weighted linear combinations of the
given D features. This may be expressed as
min
B = AR,
where R is a D ×K matrix with random entries; typically, K D, so that B may
154
7 Optimal Dimensionality Reduction
be viewed as a lower-dimensional representation of A. An important property of
RP is that under some conditions, it is possible to show that the pairwise distances are preserved; if X and Y are points in the D space and X and Y are the
corresponding points in the K-dimensional space, then X − Y 2 approximates
X − Y 2 . This means that the Euclidean distances between pairs of points are
preserved.
7.3.1 Performance
It is observed based on experimental studies that PCA performed better than RP on
a variety of datasets. Specifically, on the OCR dataset, PCA-based SVM classifier
gave 92 % accuracy using the RBF kernel; using the same classifier on the OCR
data RP-based feature set gave an accuracy of 88.5 %.
In the following selection, we discuss two efficient approaches to feature selection using genetic algorithms.
7.4 Efficient Approaches to Large-Scale Feature Selection Using
Genetic Algorithms
On many practical datasets, it is observed that prototype patterns or representative
feature subsets or both together provide better classification performance as compared to using entire dataset and features considered. Such subsets also help in reducing classification cost.
Pattern recognition literature is replete with many feature selection approaches.
In this section, we propose to obtain optimal dimensionality reduction using Genetic Algorithms, through efficient classification of OCR pattern. The efficiency is
achieved by resorting to nonlossy compression of patterns and classifying them in
the compressed domain itself. We further examine combining frequent item supportbased feature reduction for possible improvement in classification accuracy.
Through experiments, we demonstrate that the proposed approaches result in an
optimal feature subset that, in turn, results in improved classification accuracy and
processing time as compared to conventional processing.
In the present work, we propose algorithms that integrate the following aspects.
•
•
•
•
Run-length compression of data and classification in the compressed domain.
Optimal feature selection using genetic algorithms.
Domain knowledge of data under consideration.
Identification of frequent features and their impact on classification accuracy
combined with genetic algorithms.
The section is organized in the following manner. Section 7.4.1 provides an
overview of genetic algorithms. Proposed schemes are provided in Sect. 7.4.2. Section 7.4.3 contains preliminary analysis of the dataset considered to demonstrate
7.4 Efficient Approaches to Large-Scale Feature Selection
155
working of the algorithm. Experiments and results are discussed in Sect. 7.4.4. The
work is summarized in Sect. 7.4.5.
7.4.1 An Overview of Genetic Algorithms
Genetic algorithms are search and optimization methods based on the mechanisms
of natural genetics and evolution. Since these algorithms are motivated by the competition and survival of the fittest in Nature, we find analogy with them. The GAs
have advantages over conventional optimization methods in finding global optimum
solution or near-global optimal solution while avoiding local optima. Over the years,
the applications rapidly spread to almost all engineering disciplines. Since their introduction, a number of developments and variants have been introduced and developed into mature topics such as multiobjective genetic algorithms, interactive
genetic algorithms, etc. In the current section, we briefly discuss the basic concepts
with a focus on implementation of a simple genetic algorithm (SGA) and few applications. A brief discussion on SGA can be found in Chap. 3. The discussion
provided in the present section forms the background to subsequent material.
SGA is characterized by the following.
•
•
•
•
•
Population of chromosomes or binary strings of finite length.
Fitness function and problem encoding mechanism.
Selection of individual strings.
Genetic operators, viz., cross-over and mutation.
Termination and other control mechanisms.
It should be noted here that each of the topics is studied in depth through research
works. Since the current section is intended to provide completeness on the discussion with a focus on implementation aspect, interested readers are directed to the
references listed out at the end of the section. We also intentionally avoid discussion
on other evolutionary algorithms.
Objective Function. SGA is intended to find optimal set of parameters that optimize
a function. For example, find a set of parameters, x1 , x2 , . . . , xn , that maximizes
a function f (x1 , x2 , . . . , xn ).
Chromosomes. A bit-string or chromosome consists of a set of finite number of
bits, l, called the length of the chromosome. Bit-string encoding is a classical
method adapted by the researchers. The chromosomes are used to encode parameters that represent a solution to the optimization problem. Alternate encoding mechanisms include binary encoding, gray code, floating point, etc. SGA
makes use of a population of chromosomes with a finite population size, C.
Each bit of the bit-string is called allele in genetic terms. Both the terms are
used interchangeably in the literature.
Encoding Mechanism and Fitness Function. We find an optimal value of f (x1 , x2 ,
. . . , xn ) through the set of parameters x1 , x2 , . . . , xn . The value of f (·) is called
156
7 Optimal Dimensionality Reduction
the fitness function. Given the values of x1 , x2 , . . . , xn , the fitness can be computed. We encode the chromosome to represent the set of the parameters. This
forms the key step of a GA. Encoding depends on the nature of the optimization
problem. The following are two examples of encoding mechanisms. It should be
noted that the mechanisms are problem dependent, and one can find novel ways
of encoding a given problem.
Example 1. Suppose that we need to select a subset of features out of a group of
features that represent a pattern. The chromosome length is considered equal to
the total number of features in the pattern, and each bit of the chromosome represents whether the corresponding feature is considered. The fitness function in
this case can be the classification accuracy based on the selected set of features.
Example 2. Suppose that we need to find values of two parameters that minimize
(maximize) a given function and the parameters assume real values. The chromosome is divided into two parts representing the two parameters. The binary
equivalent of the expected range of real values of the parameters are considered
as corresponding lengths, viz., l1 and l2 . The length of the chromosome is given
by l1 + l2 .
Selection Mechanism. Selection refers to identifying individual chromosomes from
previous generation to the next generation of evolution while giving emphasis to highly fit individuals in the current generation. There are many selection
schemes that are used in practice. For example, the Roulette wheel selection
scheme consists of a sector in roulette wheel such that the angle subtended by
the sector is proportional to its fitness. This ensures that more copies of highly
fit individuals move on to the next generation. Many alternate approaches for
selection mechanisms are used in practice.
Crossover. Pairs of individuals, s1 and s2 , are chosen at random from population
and are subjected to crossover. Crossover takes place when the prechosen probability of crossover, Pc , exceeds a generated random number in the range [0,1].
In the “single point crossover” scheme, the position, say, k, within chromosome
is chosen at random from the numbers 1, 2, . . . , (l − 1) with equal probability.
Crossover takes place at k, resulting in two new offsprings containing alleles
from 0 to k of s1 and from (k + 1) to l of s2 for offspring 1 and from 0 to k of s2
and from (k + 1) to l of s1 for offspring 2. The operation is depicted in Fig. 7.1.
The other crossover schemes include two-point crossover, uniform crossover,
etc.
Mutation. Mutation of a bit consists changing it from 0 to 1 or vice versa based on
probability of mutation, Pm . This provides better exploration of solution space
by restoring genetic material that could possibly be lost through generations. The
activity consists of generating a random number in the range [0,1]. If the random
number is greater than Pm , mutation is resorted . The bit position of mutation is
determined randomly by choosing a random number in [0, l]. A higher value for
Pm causes more frequent disruption. The operation is depicted in Fig. 7.2.
Termination. Many criteria exist for termination of the algorithm. Some approaches
are (a) when there is no significant improvement in the fitness value, (b) a limit
on number of iterations, etc.
7.4 Efficient Approaches to Large-Scale Feature Selection
157
Fig. 7.1 Crossover operation
Fig. 7.2 Mutation operation
Control Parameters. The choice of population size C and the values of Pc and Pm
affect the solution and speed of convergence. Although large population size
assures the convergence, it increases computation time. The choice of these parameters is problem dependent. We demonstrate the effect of their variability
in Sect. 7.4.4. Adaptive schemes for choosing the values of Pc and Pm show
improvement on final fitness value.
SGA. With the above background, we briefly discuss working of a Simple Genetic
Algorithm as given below. After encoding the parameters of an optimization
problem, consider n chromosomes, each of length l. Initialize the population
with a probability of initialization, PI . With PI = 0, all the alleles are considered
for each chromosome, and with PI = 1, none are considered. Thus, as the value
of PI varies from 0 to 1, more alleles with value 0 are expected, thereby resulting
in lesser number of features getting selected for the chromosome. In Sect. 7.4.4,
we demonstrate the effect of variation of PI and provide a discussion. As the
next step, we evaluate the function to obtain fitness values of each chromosome
of the function.
Till the convergence based on the set criteria is obtained, for each iteration, select the population for the generation and perform crossover (Pc ) and mutation (Pm )
operations to obtain new offsprings. Compute the fitness function for the new population.
158
7 Optimal Dimensionality Reduction
Simple Genetic Algorithm
{
Step 1: Initialize population containing ‘C’
strings of length ‘l’, each with
probability of initialization, Pi;
Step 2: Compute fitness of each chromosome;
while termination criterion not met
{
Step 3: Select population for the next
generation;
Step 4: Perform crossover based on Pc and
mutation Pm;
Step 5: Compute fitness of each updated
chromosome;
}
}
7.4.1.1 Steady-State Genetic Algorithm (SSGA)
In the general framework of Genetic Algorithms, we choose entire feature set of a
pattern as a chromosome. Since the features are in binary form, they indicate the
presence or absence of the corresponding feature in a pattern. The genetic operators
of Selection, Cross-over, and Mutation with corresponding probability of selection
(PI ), probability of cross-over (Pc ) and probability of mutation (Pm ) are used. Like
in the case of SGA, the given dataset is divided into training, validation, and test
data. Classification accuracy on validation data using NNC forms the fitness function. Table 7.1 contains the terminology used in the paper.
In case of SSGA, we retain a chosen percentage of highly fit individuals from
generation to generation, thereby preventing loss of such individuals during the generations at a given point of time. It is termed as generation gap. Thus, SSGA permits
larger Pm values as compared to SGA.
7.4.2 Proposed Schemes
We propose the algorithms shown in Algorithm 7.1 and Algorithm 7.2 for the study.
Algorithm 7.1 integrates run-length compression of data, classification of compressed data, SSGA, and knowledge acquired through preliminary analysis with
7.4 Efficient Approaches to Large-Scale Feature Selection
159
a generation gap of 40 %. Algorithm 7.2 integrates the concept of frequent features
in addition to GA-based optimal feature selection.
Algorithm 7.1 (Algorithm for Feature Selection using Compressed Data Classification and Genetic Algorithms)
Step 1: Consider a population of ‘C’ chromosomes, with each chromosome consisting of ‘l’ features. Initiate each chromosome by setting a feature to ‘1’
as selected with a given probability, PI .
Step 2: For each chromosome in the population,
(a) Consider those selected features in the chromosome
(b) With the selected features in training and validation data sets, compress
the data
(c) Compute classification accuracy of validation data directly using the
compressed form. The classification accuracy forms the fitness function
(d) Record the number of alleles, classification accuracy for each chromosome, and generation-wise average fitness value.
Step 3: In computing next generation of chromosomes, carry out the following
steps
(a) sort the chromosomes in the descending order of their fitness
(b) preserve 40 % of highly fit individuals for the next generation
(c) the remaining 60 % of the next population are obtained by subjecting randomly selected individuals from current population to cross-over
and mutation with respective probabilities Pc and Pm .
Step 4: Repeat Steps 2 and 3 till there is no significant change in the average fitness
between successive generations.
In the framework of optimal feature selection using genetic algorithms, each
chromosome is considered to represent entire candidate feature set. The population containing C chromosomes is initialized in Step 1. Since the features are binary valued, the initialization is carried out by setting a feature to “1” with a given
probability of initialization, PI . Based on the binary value 1 or 0 of an allele, the
corresponding feature is considered either selected or not, respectively.
In Step 2, for each initialized chromosome, original training and validation data is
updated to contain only those selected features. The data is compressed using runlength compression algorithm. The validation data is classified in its compressed
form, and the average classification accuracy is recorded.
In Step 3, the subsequent population is generated. The best 40 % of the current population are preserved. The remaining 60 % are generated by subjecting entire current population to genetic operators of selection, single-point cross-over, and
mutation with preselected probabilities. The terminating criterion is verified for percentage change of fitness between two successive generations.
An elaborate experimentation is carried out by changing population initialization procedure such as (a) preselected population, (b) preselecting some features as
unused, and (c) initialization using probability of initialization, varying values of
160
7 Optimal Dimensionality Reduction
probabilities of selection, cross-over, mutation, etc. The nature of exercises and the
results are discussed in the following section.
Genetic Algorithms (GAs) are well studied for feature selection and feature extraction. We restrict our study for feature selection. Given a feature set, C, the problem of dimensionality reduction can be defined as arriving at a subset of original
feature set of dimension d < C such that the best classification accuracy is obtained. Single dominant computation block is the evaluation of fitness function. If
this could be speeded up, overall speed can be achieved. In order to achieve this, we
propose to compress the training and validation data and compute the classification
accuracy directly on the compressed data without having to uncompress.
In the current section, before discussing the proposed procedure, we present compressed data classification and Steady-State Genetic Algorithm for feature selection
in the following subsections.
7.4.2.1 Compressed Data Classification
We make use of the algorithm discussed in Chap. 3 to compress input binary data
and operate directly on the compressed data without decompressing for classification using runs. This forms a total nonlossy compression–decompression scenario.
It is possible to perform this when classification is achieved with the help of the
Manhattan distance function. The distance function on the compressed data results
in the same classification accuracy as that obtained on the original data as shown
in Chap. 3. The compression algorithm is applied on large data, and it is noticed to
reduce processing requirements significantly.
7.4.2.2 Frequent Features
Albeit Genetic Algorithms provide optimal feature subset, it is interesting to explore
whether the input set of features can be reduced by simpler means. Frequent pattern
approach, as discussed in Sects. 2.4.1 and 4.5.2, provides frequently used features,
which could possibly help discrimination too. A binary-valued pattern can be considered as a transaction with each feature representing the presence and absence of
a feature. Support of an item can be defined as the percentage of transactions in
the given database that contain the item. We make use of the concept of support
in identifying the feature set that is frequent above a chosen threshold. This results
in reduction in the number of features that need to be explored for an optimal set.
In Sect. 7.4.3, as part of preliminary analysis on the considered data, we demonstrate this aspect. Figure 7.3 demonstrates the concept of support. The support and
percentage-support are used equivalently in the present chapter.
Algorithms 7.1 and 7.2 are studied in detail in the following sections.
Algorithm 7.2 (Optimal Feature Selection using Genetic Algorithms combined
with frequent features)
Step 1: Identify frequent features based on a chosen support threshold.
7.4 Efficient Approaches to Large-Scale Feature Selection
161
Fig. 7.3 The figure depicts
the concepts of transaction,
items, and support
Step 2: Consider only those frequent features for further exploration.
Step 3: All steps of Algorithm 7.1.
We briefly elaborate each of the steps along with results of preliminary analysis.
7.4.3 Preliminary Analysis
Preliminary analysis of the data brings out insights of the data and forms domain
knowledge. The analysis primarily consists of computation of measures of central
tendency and dispersion, feature occupancy of patterns, class-wise variability, and
inter-class similarities. The results of the analysis help in choosing appropriate parameters and forming the experimental setup.
We consider 10-class handwritten digit data consisting of 10,000 192-featured
patterns. Each digit is formed as a 16 × 12 matrix with binary-valued features. The
data is divided into three mutually exclusive sets for training, validation, and testing.
In order to find optimal feature selection, it is useful to understand the basic statistics on the number of nonzero features in the training data. Although care is taken
while forming the handwritten dataset in terms of centering and depicting all variations of collected digits through the pattern matrix, it is possible that some regions
within the 16 × 12 not fully utilized depending on the class-label. Figure 7.4 contains these details. The topmost figure depicts class-wise details of average number
of nonzero features. It can be seen that the digits 0, 2, and 8 contain about 68 nonzero
features, each with digit 1 requiring the least number of nonfeatures of about 30 for
162
7 Optimal Dimensionality Reduction
Fig. 7.4 Statistics of features in the training dataset
Fig. 7.5 The figure contains
nine 3-featured patterns
occupying different feature
locations in 3 × 3 pattern
representation. It can be
observed that all locations are
occupied cumulatively at the
end of 9 sample patterns
representation. The middle figure indicates the standard deviation of the number of
nonzero features, indicating comparatively a larger dispersion for the digits 0, 2, 3, 5,
and 8. The third plot in the figure provides an interesting aspect of occupancy of
features within digit. Considering the digit 0, although, on the average, 68 nonzero
features suffice to represent the digit, the nonzero features occupied about 175 of
192 features by one training pattern or the other. Similar observations can be seen
for other digits too.
When the objective is to find an optimal subset of features, this provides a
glimpse of complexity involved. Figure 7.5 summarizes this argument that although
the average number of features per pattern is small, all the feature locations can be
occupied at least once. We consider a pattern such as handwritten digit “1” in a 3 × 3
pattern-representation. The average number of features needed to represent the digit
is 3. It can be noted here that all the feature locations are occupied after passing
through 9 patterns.
7.4 Efficient Approaches to Large-Scale Feature Selection
163
Fig. 7.6 The figure contains patterns with frequent features excluded with minimum support
thresholds of 13, 21, 52, and 70. The excluded feature regions are depicted as gray and black
portion corresponding to retain a feature set for exploration
7.4.3.1 Redundancy of Features Vis-a-Vis Support
We make use of the concept of support, as discussed in Fig. 7.3 and Sect. 7.4.2.2 to
identify the features that occur above a prechosen support threshold. We compute
empirical the probability for each feature. We vary support to find the set of frequent
features. We will later examine experimentally whether such excluded features have
impact on feature selection. Figure 7.6 contains an image of a 192-featured pattern
with excluded features corresponding to various support thresholds. The figure indicates features of low minimum support. It should be noted that they occurred in
this case of low minimum support on the edges of the pattern. As the support is
increased, the pattern representability will be affected.
7.4.3.2 Data Compression and Statistics
The considered patterns consist of binary-valued features. The data is compressed
using the run-length compression scheme as discussed in Chap. 3. The scheme consist of the following steps.
• Consider each pattern.
• Form runs of continuous occurrence of each feature. For ease of dissimilarity
computation, consider each pattern as starting with a feature value of 1, so that
the first run corresponds to number of 1s. In case the first feature of the pattern
is 0, the corresponding length would be 0.
The compression results in unequal number of runs for various patterns as shown
Fig. 7.7. The dissimilarity computation in the compressed domain is based on the
work in Chap. 3.
7.4.4 Experimental Results
Experimentation is planned to explore each of the parameters of Table 7.1 in order
to arrive at a minimal set of features that provides the best classification accuracy.
We initially study the choice of probabilities of initialization, cross-over, and mutation based on few generations of execution of genetic algorithms. After choosing
164
7 Optimal Dimensionality Reduction
Fig. 7.7 Statistics of runs in compressed patterns. For each class label, the vertical bar indicates
the range of number of runs in the patterns. For example, for class label “0”, the compressed
image length ranges from 34 to 67. The discontinuities indicate that there are no patterns that
have compressed lengths of 36 to 39. The figure provides range of compressed pattern lengths
corresponding to the original pattern length of 192 for all the patterns
Table 7.1 Terminology
Term
Description
C
Population size
t
No. of generations
l
Length of chromosome
PI
Probability of initialization
Pc
Probability of cross-over
Pm
Probability of mutation
ε
Support threshold
appropriate values of these three values, we proceed with feature selection. We also
bring out comparison of computation time with and without compression and bring
out comparisons. All the exercises are carried out with run-length-encoded nonlossy
compression, and classification is performed in the compressed domain directly.
7.4.4.1 Choice of Probabilities
In order to choose appropriate values for probabilities of cross-over, mutation, and
initialization, exercises are carried out using the proposed algorithm for 10–15 gen-
7.4 Efficient Approaches to Large-Scale Feature Selection
165
Fig. 7.8 Result of genetic algorithms after 10 generations on sensitivity of the probabilities of
initialization, cross-over, and mutation. The two plots in each case indicate the number of features
for best chromosome across 10 generations and the corresponding fitness value
erations. For these exercises, we consider the complete set containing 192 features.
Figure 7.8 contains the results of these exercises. The objective of the study is to
obtain a subset of features that provides a reasonable classification accuracy.
Choice of Probability of Initialization (PI ). A feature is included when the corresponding probability is more than the probability of initialization(PI ). As PI
increases, the number of selected features reduces. When PI = 0, all features
are considered for the study. The classification accuracy of corresponding best
fit chromosome reduces as PI increases since the representability reduces in
166
7 Optimal Dimensionality Reduction
Fig. 7.9 Figure depicts of impact of choice of PI . X-axis represents the number of features,
and Y -axis represents the classification accuracy. Popsize (c) = 40, No. of generations (n) = 20,
Pc = 0.99, Pm = 0.001, Gengap = 20 %. From the figures counted column-wise. For figures in
column 1, the values of PI are 0.2, 0.3; in column 2, 0.4, 0.5, and in column 3, 0.6. kNNC is used
for classification
view of the reduced number of features. Based on the above results, PI is chosen as 0.2.
We present a novel data visualization to demonstrate the trend of results with
changing value of various parameters, say, ε. Here we consider all the fitness
values and plot them as a scatter plot. It forms a cloud of results. With varying
parameter value, the cloud changes its position in both X and Y axis directions.
Figure 7.9 indicates variation in the results with changing value of PI while
keeping the remaining parameter set constant. It can be seen from the figure that
with increase of probability of initialization, the average classification accuracy
changed from nearly 90 % to 82 % as the number of features varied from 160
to about 70. It should also be noted that the points disperse as the number of
selected features reduces.
Probability of Cross-over (Pc ). Pc is studied for the values between 0.8 and 0.99.
The recombination operator provides new offsprings from two parent chromosomes. It is usually chosen to have a relatively higher value of above 0.8. It
can be seen from Fig. 7.8 that as Pc increases, the classification accuracy improves. Interestingly, the corresponding number of features also reduces; Pc for
the study is chosen as 0.99.
Probability of Mutation (Pm ). It provides exploration by occasional flipping of the
allele. However, a higher value of Pm can lead to random behavior. It is studied for the values of 0.0001 to 0.5. As Fig. 7.8 suggests, as Pm increases, steady
7.4 Efficient Approaches to Large-Scale Feature Selection
167
Fig. 7.10 The figure contains an optimal feature set represented in the pattern. The preset features
based of frequent support are 13, 21 and 52. The corresponding best feature sets, as shown above,
provided a classification accuracy of 88.3 %, 88.5 %, and 88.8 %, respectively, with validation data
and 88.03 %, 87.3 %, and 87.97 % with test data. This is an example of small feature sets providing
relatively higher classification accuracy
increase of the classification accuracy is not assured. There is no consistent number of features as well. For the current study, Pm is chosen as 0.001. However,
SSGA ensures retaining of few highly fit individuals across generations.
7.4.4.2 Experiments with Complete Feature Set
Complete feature set of 192 features is considered for the experiments as the initial
set for optimal feature selection. The number of generations for each run of SSGA is
greater than 40. The best results of the exercises in terms of classification accuracy
(CA) are summarized below.
With complete dataset and 192 features, the CA with validation and test datasets
are 80.80 % and 90.34 %, respectively. With 175 features, the best CA of 90.85 % is
obtained with the validation dataset. The corresponding CA with test data is 90.40 %
with NNC and 91.60 % with kNNC with k = 5. It can be observed that this result is
better than the one obtained with complete feature set. This emphasizes the fact that
an optimal feature set that is a subset of complete feature set can provide a higher
CA. Similar observation can be made from Table 7.2 too.
7.4.4.3 Experiments with A Priori Excluded Features
Experiments are carried out to start with the patterns across the entire training data
with possibly redundant features as discussed earlier through Fig. 7.6, as template
for excluding them across the entire dataset. We consider the first three patterns with
13, 21, and 52 features excluded to explore an optimal feature set that provides the
best classification accuracy. However, we allowed mutations to take place for the excluded features. The optimal feature set obtained is presented in Fig. 7.10. Interestingly, these three cases respectively have feature sets of sizes 118, 104, and 93. The
168
7 Optimal Dimensionality Reduction
Fig. 7.11 Popsize (c) = 60, No. of generations (n) = 40, Pc = 0.99, Pm = 0.001, Gengap = 30,
PI = 0.1. The figures are counted column-wise. The support values and the corresponding number
of features are shown in parenthesis. Plots in column 1 correspond to 0(0), 0(0) with PI = 0.2,
0.001 % (23), and 0.003 % (38); in column 2, they are 0.004 % (47), 0.006 % (53), 0.01 % (60),
and 0.011 % (64). kNNC is the classifier
Table 7.2 Feature selection using GAs with minimum support-based feature exclusion
Minimum
support
Classification accuracy
with validn. data
Classification accuracy
with test data
Optimal set
of features
Feature
reduction
13
88.3 %
88.0 %
118
38.5 %
20
88.5 %
87.1 %
104
45.8 %
52
88.8 %
88.0 %
93
52.6 %
corresponding classification accuracies with validation dataset are 88.3 %, 88.5 %,
and 88.8 %. The classification accuracies with test data sets are 88.0 %, 87.13 %,
and 88.0 %. The reduction in feature set sizes as compared to original 192 features
is significant. They are 38.5 %, 45.8 %, and 51.6 %. The results are summarized in
Table 7.2.
The impact of minimum support on obtaining minimal feature set with best classification accuracy is studied. With each considered value of minimum support, features having support less than a chosen threshold are excluded from both training
and test data. Search for best set is carried out with the remaining set of features.
Figures 7.11 and 7.12 contain variation in classification accuracy across all generations of fitness evaluations for varying support values. In both figures, images
7.4 Efficient Approaches to Large-Scale Feature Selection
169
Fig. 7.12 Cases correspond to feature selection with kNNC classifier. Popsize (c) = 60, No. of
generations n = 40, Pc = 0.99, Pm = 0.001, Gengap = 30, PI = 0.1. The figures are counted
column-wise. For figures in column 1, the support values and corresponding number of features
are 0.017 % (71), 0.019 % (76), 0.032 % (81), 0.045 % (85), and in column 2, they are 0.055 %
(91), 0.06 % (95), 0.064 % (102). kNNC is the classifier
are arranged column-wise. In Fig. 7.11, the image in column 1 corresponds to all
features. The following observations can be made from the figures.
In summary, from the above analysis, the following inferences can be drawn.
• Increasing minimum support leads to increasing exclusion of the number of features.
• It can be noted from both figures that with reducing the number of features, cloud
of results remains nearly invariant w.r.t. classification accuracy as shown along Y axis up to a reduction of 85 features. The classification accuracy remains around
90 %. Subsequently, it affects the classification accuracy, although not significantly, till the reduction to 102 features. Subsequently, the reduction in accuracy
is drastic.
• The number of optimal features that provide good classification accuracy demonstrate significant reduction with increasing support value. It starts from 155–180
from complete feature-set exploration to 60–80 for 102 features.
• Interestingly, the results shown in the figures indicate that there are significant
redundant features that do not really contribute to discrimination of patterns.
• Frequent patterns help feature reduction. Equivalently, by increasing support we
tend to exclude less discriminant features.
• Pressure on exploration of best features through random search reduces with increasing support.
170
Table 7.3 Improvement in
CPU time due to proposed
algorithm on Intel Core-2
Duo processor
7 Optimal Dimensionality Reduction
Nature of data
CPU time
With uncompressed data
11,428.849 sec
With compressed data
6495.940 sec
7.4.4.4 Impact of Compressed Pattern Classification
Data compression using run lengths proposed in Chap. 3 is nonlossy, and it was
shown theoretically too. The experiments are repeated, and it is found that the classification accuracy remains the same.
Another important aspect is the CPU time improvement. The CPU times taken
by both compressed and original datasets after 16 generations of SSGA are compared. It is found that on Intel Core-2 Duo processor, the CPU time improved by
compressed data process to the tune of 43 %. The times are provided in Table 7.3.
7.4.5 Summary
Feature selection aims to achieve certain objective such as (a) optimizing an evaluation measure like classification accuracy of unseen patterns, (b) certain restriction
on evaluation measure, (c) best commitment among its size and the value of its
evaluation measure, etc. When the size of initial feature-set is more than 20, the
process forms large-scale feature selection problem. With the number of features d,
the search space is equal to 2d . The problem is further complex, when (a) number of patterns is huge, (b) data contains multicategory patterns, and (c) number of
features is much larger than 20. We provided an overview of feature selection and
feature extraction methods. We presented a case study on optimal feature selection
using genetic algorithms along with providing a discussion on genetic algorithms.
In the case study, we focused on efficient methods of large-scale feature selection
that provide a significant improvement in the computation time while providing the
classification accuracy at least as good as that of a complete dataset. The proposed
methods are applied in feature selection of large datasets and demonstrate that the
computation time improves by almost 50 % as compared to conventional approach.
We integrate the following aspects in the current work.
•
•
•
•
Feature selection of high-dimensional large dataset using Genetic Algorithms.
Domain knowledge of data under study obtained through preliminary analysis.
Run-length compression of data.
Classification of compressed data directly in the compressed domain.
Further, from the discussions it is clear that:
• Floating sequential selection schemes permit deletion of already added features
in the forward search; so they perform better.
7.5 Bibliographical Notes
171
• Mutual Information and Fisher’s score are important and popular in filter selection.
• PCA is superior to Random Projections; NMF can get stuck in a locally optimal
solution.
• Genetic Algorithms combined with frequent features lead to significant reduction
in the number of features and also improve the computation time.
7.5 Bibliographical Notes
Duda et al. (2001) provide an overview of feature selection. A good discussion on
the design and applicability of distance functions in high-dimensional spaces can be
found in Hsu and Chen (2009). Efficient and effective floating sequential schemes
for feature selection are discussed in Pudil et al. (1994) and Somol et al. (1999). Various schemes including the ones based on Fisher’s score and Mutual Information are
considered by Punya Murthy and Narasimha Murty (2012). A good introduction to
NMF is provided by Lee and Seung (1999). An authoritative coverage on Random
Projections is given by Menon (2007). A well-known reference on PCA is given
by Jolliffe (1986). Cover and Van Camenhout (1977) contains demonstration of the
need for exhaustive search for optimal feature selection. Goldberg (1989), Davis and
Mitchell (1991), and Man et al. (1996) provide a detailed account of genetic algorithms, including issues in implementation. Siedlecki and Sklansky (1989) demonstrate superiority of solution using Genetic Algorithms as compared to an exhaustive search, sequential search and branch-bound with the help of a 30-dimensional
dataset. Several variants of Genetic Algorithms are used for feature selection on
different types of data, such as works by Siedlecki and Sklansky (1989), Yang and
Honavar (1998), Kimura et al. (2009), Punch et al. (1993), Raymer et al. (1997),
etc. Raymer et al. (2000) focus on feature extraction for dimensionality reduction
using genetic algorithms. Oliveira et al. (2001) demonstrate feature selection using
a simple genetic algorithm and iterative genetic algorithm. Greenhagh and Marshall
(2000) discuss convergence criteria for genetic algorithms. Comparison of genetic
algorithm-based prototype selection schemes was provided by Ravindra Babu and
Narasimha Murty (2001). Raymer et al. (1997) and Ravindra Babu et al. (2005)
demonstrate simultaneous selection of feature and prototypes. Run-length-encoded
compression and dissimilarity computation in the compressed domain are provided
in Ravindra Babu et al. (2007). A utility of frequent item support for feature selection was demonstrated in Ravindra Babu et al. (2005). Cheng et al. (2007) argue that
frequent features help discrimination.
References
H. Cheng, X. Yan, J. Han, C.-W. Hsu, Discriminative frequent pattern analysis for effective classification, in 23rd Intl. Conf. for Data Engineering (2007), pp. 525–716
172
7 Optimal Dimensionality Reduction
T.M. Cover, J.M. Van Camenhout, On the possible orderings in the measurement selection problem. IEEE Trans. Syst. Man Cybern. 7(9), 657–661 (1977)
L.D. Davis, M. Mitchell, Handbook of Genetic Algorithms (Van Nostrand Reinhold, New York,
1991)
R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification (Wiley, New York, 2001)
D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning (AddisonWesley, Reading, 1989)
D. Greenhagh, S. Marshall, Convergence criteria for genetic algorithms. SIAM J. Comput. 3(1),
269–282 (2000)
C.-M. Hsu, M.-S. Chen, On the design and applicability of distance functions in high-dimensional
data space. IEEE Trans. Knowl. Data Eng. 21, 523–536 (2009)
I.T. Jolliffe, Principal Component Analysis (Springer, New York, 1986)
Y. Kimura, A. Suzuki, K. Odaka, Feature selection for character recognition using genetic algorithm, in Fourth Intl. Conf. on Innovative Computing, Information and Control (ICICIC) (2009),
pp. 401–404
D.D. Lee, H. Seung, Learning the parts of objects by non-negative matrix factorization. Nature
401, 788–791 (1999)
K.F. Man, K.S. Tang, S. Kwong, Genetic algorithms: concepts. IEEE Trans. Ind. Electron. 43(5),
519–534 (1996)
A.K. Menon, Random projections and applications to dimensionality reduction. B.Sc. (Hons.) Thesis, School of Info. Technologies, University of Sydney, Australia (2007)
L.S. Oliveira, N. Benahmed, R. Sabourin, F. Bortolozzi, C.Y. Suen, Feature subset selection using
genetic algorithms for handwritten digit recognition, in Computer Graphics and Image Processing (2001), pp. 362–369
P. Pudil, J. Novovicova, J. Kittler, Floating search methods in feature selection. Pattern Recognit.
Lett. 15, 1119–1125 (1994)
W.F. Punch, E.D. Goodman, M. Pei, L.C. Shun, P. Hovland, R. Enbody, Further research on feature
selection and classification using genetic algorithms, in ICGA (1993), pp. 557–564
C. Punya Murthy, M. Narasimha Murty, Discriminative feature selection for document classification, in Proceedings of ICONIP (2012)
T. Ravindra Babu, M. Narasimha Murty, V.K. Agrawal, On simultaneous selection of prototypes
and features in large data pattern recognition, in LNCS, vol. 3776 (Springer, Berlin, 2005), pp.
595–600
T. Ravindra Babu, M. Narasimha Murty, Comparison of genetic algorithm based prototype selection schemes. Pattern Recognit. 34(2), 523–525 (2001)
T. Ravindra Babu, M. Narasimha Murty, V.K. Agrawal, Classification of run-length encoded binary
data. Pattern Recognit. 40(1), 321–323 (2007)
M.L. Raymer, W.F. Punch, E.D. Goodman, P.C. Sanschagrin, L.A. Kuhn, Simultaneous feature
extraction and selection using a masking genetic algorithm, in Proc. 7th Intl. Conf. on Genetic
Algorithms (1997)
M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, A.K. Jain, Dimensionality reduction using
genetic algorithms. IEEE Trans. Evol. Comput. 4(2), 164–171 (2000)
W. Siedlecki, J. Sklansky, A note on genetic algorithms for large-scale feature selection. Pattern
Recognit. Lett. 10, 335–347 (1989)
P. Somol, P. Pudil, J. Novovicova, P. Paclik, Adaptive floating search methods in feature selection.
Pattern Recognit. Lett. 20, 1157–1163 (1999)
J. Yang, V. Honavar, Feature subset selection using a genetic algorithm. Intell. Syst. Appl. 13(2),
44–49 (1998)
Chapter 8
Big Data Abstraction Through Multiagent
Systems
8.1 Introduction
Big Data is proving to be a new paradigm after data mining in large or massive data
analytics. With increasing ability to store large volumes of data at every second,
the need for making sense of the data for summarization and business exploitation is steadily increasing. The data is emanating from customer records, pervasive
sensors, sense of keeping every data item for potential subsequent analysis, security paranoia, etc. Big Data theme is gaining importance especially because large
volumes of data in variety of formats are found related and need to be processed
in conjunction with each other. Large databases, which are conventionally built on
predefined schema, are not directly usable. However, there are arguments in the literature for and against the use of Map-Reduce algorithm as compared to massive
parallel databases. Such databases are built by many commercial players.
Agent-mining interaction is gaining importance in research community in solving massive data problems in divide-and-conquer manner. The interaction is mutual
such as agent driving data mining and vice versa. We discuss these issues in more
detail in the chapter.
We propose to solve Big Data analytics problems through multiagent systems.
We propose few problem solving schemes. In Sect. 8.2, we provide an overview
of Big Data and challenges it offers to research community. Section 8.3 discusses
large data problems as solved by conventional systems. Section 8.4 contains a discussion on overlap between big data and data mining. A discussion on multiagent
systems is provided in Sect. 8.5. Section 8.6 contains proposed multiagent systems
for abstraction generation with Big Data.
8.2 Big Data
Big data is marked by voluminous heterogeneous datasets that need to be accessed
and processed in real time to generate abstraction. Such an abstraction is valuable
T. Ravindra Babu et al., Compression Schemes for Mining Large Datasets,
Advances in Computer Vision and Pattern Recognition,
DOI 10.1007/978-1-4471-5607-9_8, © Springer-Verlag London 2013
173
174
8 Big Data Abstraction Through Multiagent Systems
for scientific or business decisions depending on nature of data. These attributes are
conventionally termed as three v’s, known as volume, velocity, and variety. Some
experts add an additional v, known as value. The big data analytics has also led to
a new inter-disciplinary topic, called data science, which combines statistics, machine learning, natural language processing, visualization, and data mining. Associated terminologies to data science are data products and data services.
The need for Big Data analytics or abstraction arose due to increasing ability to
sense and store the data, omnipresence of data, ability to see the business potential of such data sets. Some examples are the trails of data that one leaves as one
browses web pages, tweets his/her opinions, social media channels, visits to multiple stores to purchase varieties of items, scientific data such as genome sequencing,
astronomy, oceanography, clinical data, applications such as drug re-purposing, etc.
Researchers propose MAD (Magnetic, Agile, and Deep) analysis practice for Big
Data, self-tuning systems such as Starfish with respect a popular big data system.
The scenarios lead to demand of increasing agility in data accessing and processing
and to the need of accepting multiple data sources and generating sophisticated analytics. The need for such analytics in turn seeks the development and use of more
efficient machine learning algorithms and statistical analysis that integrate parallel
processing of data, etc. Some conventional pattern recognition algorithms or statistical methods need to be strengthened in these directions.
The Map-Reduce algorithm and its variants play a pivotal role in Big Data applications.
8.3 Conventional Massive Data Systems
Conventionally an “Enterprise Data Warehouse (EDW)” is the source for large data
analysis. Business intelligence software bases its analysis on this data and generates insights by querying EDW. The EDW system is a centralized data resource for
analytics. The EDW is marked by a systematic data integration with well-defined
schema, permitting predefined structures of data only for storage and analysis. This
should be contrasted with heterogeneous data sets such as unstructured data of click,
text such as twitter messages, images, voice data, etc., and semi-structured data such
as xml or rss-feeds and combinations of them. Parallel SQL database management
systems (DBMS) provide solution to large data systems. For the sake of completeness, to name a few, some commercial systems for parallel DBMS are TeraData,
Asterdata, Netezza, Vertica, Oracale, etc.
8.3.1 Map-Reduce
The concept of Map-Reduce revolutionalized the ability to carry out large computations in terms of computing clusters instead of a single supercomputer. The concept
8.3 Conventional Massive Data Systems
175
Fig. 8.1 Map-reduce system. The figure depicts broad stages of input task, phases of map, reduce
and output. The activity is under control of a master controller. The user has overall control on the
programming system
makes use of divide-and-conquer approach in an abstract sense. The system consists of multiple simple computing elements, termed as compute nodes, networked
together through gigabit network. The Distributed File System (DFS) is followed to
suit cluster computing. Some examples of operational distributed file systems are
Google File System (GFS), Hadoop Distributed File System (HDFS), and CloudStore. A conceptual Map-Reduce system is shown in Fig. 8.1. As depicted, a MapReduce system consists of finding an input task that is divided into multiple Map
tasks. The tasks in turn lead to output tasks through an intermediate processing
stage. The system is under the control of a master controller, which ensures an optimal allocation and fault-tolerance. The entire activity is under the control of a user
program.
The system achieves programming parallelism in solving complex problems.
Such a computing system is designed to take care of hardware failures at any stage
of computation either at the level of computing nodes or at the control stage. Multiple extensions of such systems emerged over a period of time. It should however be
noted that Map-Reduce systems are suited for large datasets that are not frequently
modified. Bibliographic notes contain a brief discussion on some epoch-making
contributions in these directions.
176
8 Big Data Abstraction Through Multiagent Systems
8.3.2 PageRank
An epoch-making contribution to evaluate relative importance of web pages based
on a search query is PageRank scheme. The PageRank is a real number between 0
and 1. The higher the value, the higher relevance of the result to the query is indicated by the rank. PageRank is determined by simulated random web surfers who
would execute random walk on the web pages coming across certain nodes more
often than others. Intuitively, PageRank considers the pages visited more often as
more relevant. However, in order to circumvent deliberate attempts to spam with
terms making PageRank invalid, the relevance of page is judged not only by the
content of the page, but also by the terms in the near-links directed to the page.
Implementation of the PageRank scheme takes care of large-scale representation
of transition matrix of web, efficient computation practice alternatives of matrix–
vector multiplication including use of Map-Reduce, methods to take care of dead
ends, spider traps through taxation, etc. Dead end is one where a web page has no
outgoing links. It affects PageRank computation reaching a value of zero for dead
end links and also few pages that have dead ends. This is taken care by removing the
nodes that have dead ends recursively as the computation progresses. Spider trap is a
condition where web pages have links with each other in a finite set of nodes without
having outlinks, thus leading to PageRank computation based on those finite nodes
only. This condition is taken care by a procedure called taxation parameter that lies
between 0 and 1 and provides a small probability to a random surfer to leave the
web and include an equivalent number of random surfers. Some of the relevant terminology includes computation of topic-sensitive PageRank, biased random walks,
and spam farm. Topic-sensitive PageRank, essentially, is PageRank biasing toward
a set of web pages, known as a teleport set to suit user’s interest through biased random walks. Link spam refers to a deliberate unethical effort to increase PageRank
of certain pages. This is tackled by Trust Rank designed to lower the rank of spam
pages and spam mass, which is a relative rank measure to identify possible spam
pages.
Apart from the use of the PageRank algorithm, each search engine should use
a propriety set of parameters, including weighting parameters to optimize its performance and query relevance.
8.4 Big Data and Data Mining
Big Data is marked by the need for accessing voluminous, multiple types of datasets,
and processing them in real or near-real time. Underlying the entire activity are data
mining and statistical methods, especially in dealing with large datasets in terms
of summarization and visualization, ability to process or generate abstraction in
real time, and integrating heterogeneous datasets. In the current work, we are not
focusing on other important research areas of Big Data such as parallel processing,
Map-Reduce, distributed systems, query processing, etc.
8.5 Multiagent Systems
177
Thus, Big Data offers newer challenges in the above terms to data mining approaches. The formal interaction between Big Data and Data Mining is beginning
to develop into areas such as mining massive datasets.
8.5 Multiagent Systems
Agents refer to computational entities that are autonomous, understand the environment, interact with other agents or humans, act reactively and proactively in
achieving an objective. The agents are termed intelligent when they can achieve the
objective by optimizing their own performance given the environment and objective. When more than one agent is involved in accomplishing a task with all the
previously discussed attributes, we call such a system a Multiagent system.
Example 8.1 An example of agents is footballers playing in a field. With a common
objective of scoring a goal against opposition, each of the players acts autonomously
to reach the objective, collaborate, proactively and reactively tackle the ball to seize
the initiative in achieving the objective.
Example 8.2 Face detection system can be designed as a multiagent system. Face
detection can be defined as detecting a human face in a given image. Some of the
challenges faced by the activity are background clutter, illumination variation, background matching with skin color of a person, partial occlusion, pose, etc. A multiagent face detection system consists of agents, each capable of carrying out activity autonomously and share its outcome with other agents. For example, an agent
carrying out skin color detection shares region containing skin and skin-color like
artifacts. The second agent may carry out detection of size and rotation of face about
the axis coming out of paper through ellipse fitting. The third agent carries out template matching of face in the given region. A combiner agent combines the results
to finally localize the face.
Data mining and Multiagent systems are both inter-disciplinary. Multiagent systems encompasses multiple disciplines such as artificial intelligence, sociology, and
philosophy. With recent developments, it includes many other disciplines, including
data mining.
8.5.1 Agent Mining Interaction
With clearly defined behavior for each agent, Multiagent systems are ideally suited
for data mining and big data applications. Suppose that an algorithm that computes
prototypes takes polynomial time. Given a dataset, we assign the task to an agent.
The time taken for generating prototypes from entire dataset by a single agent is
178
8 Big Data Abstraction Through Multiagent Systems
much larger than dividing the dataset into n subsets and assigning each dataset to
an autonomous agent. In other words, O((n1 + n2 + · · · + np )k ) > O((nk1 + nk1 +
· · · + nkp )). This is a case for agents supporting data mining. Alternately, clustering
of agents is an example of data mining supporting agents. The literature is replete
with a number of examples on both these aspects of agent mining interaction.
The agent mining interaction can take place at many levels such as interface,
performance, social, infrastructure, etc.
8.5.2 Big Data Analytics
Analytics with Big Data is equivalently called as Big data analytics, Advanced Analytics, Exploratory Analytics, or Discovery analytics. The business literature uses
these terms synonymously.
The challenges in the big data analytics are data sizes reaching exabytes,
data availability in distributed manner as against centralized data sources, semistructured and unstructured datasets, streaming data, flat data schemes as compared
to pre-defined models, complex schema containing inter-relationships, near-real
time and batch processing requirements, less dependence on SQL, continuous data
updates, etc. The analysis methods that required to be suitably improved for massive parallel processing are Multiagent systems, data mining methods, statistical
methods, large data visualization, natural language processing, text mining, graph
methods, instantiation approaches to streaming data etc. Data preprocessing challenges include integration of multiple data types, integrity checks, outlier handling,
and missing data issues. Commercial implementation of big data analytics will have
to integrate cloud services and Map-Reduce paradigm.
8.6 Proposed Multiagent Systems
Multiagent systems are suitable for distributed data mining applications. We provide divide-and-conquer approach to generate abstraction in big data. We provide
few examples of such systems for generating abstraction on large data. The proposed
schemes relate to data reduction in terms of identifying representative patterns, reduction in number of attributes/features, analytics in large data sets, heterogeneous
dataset access and integration, and agile data processing.
The schemes are practical and implemented earlier. We briefly discuss results for
some schemes.
8.6.1 Multiagent System for Data Reduction
In massive datasets, the need for reducing the data for further analysis and inference is pivotal. However, the nature of data in such heterogeneous datasets need
8.6 Proposed Multiagent Systems
179
Fig. 8.2 Multiagent system for prototype selection in big data. In the figure, each clustering agent
corresponds to a different clustering algorithm
not be uniform across the datasets. Some datasets could inherently form clusters of
hyper-spherical nature, some could be curvilinear in high dimensions, etc. A single
clustering algorithm alone would not be able to capture representative patterns in
each such case. For example, for dataset 1, we use partitional clustering method-1,
for dataset 2, partitional clustering method-2, for dataset 3, we use the hierarchical
clustering method, etc., as those methods are best suited for the nature of datasets.
Figure 8.2 contains a proposed scheme for a Multiagent system for data reduction.
The proposed method addresses each of the three v’s, viz., volume, variety, and
velocity of big data.
In the figure, we indicate different clustering algorithms to access the datasets.
It should be noted here that, based on preliminary analysis on a sample of dataset,
an appropriate clustering algorithm is chosen for prototype selection for the corresponding dataset. The evaluation of selected prototypes is carried out by an evaluation agent for each combination of dataset and clustering algorithm. An example
of evaluation agent is classification of a test dataset, which is a subset independent
from the training dataset.
8.6.2 Multiagent System for Attribute Reduction
We use the term attribute reduction synonymously with feature reduction. Here
again, methods of feature selection or feature extraction depend on the nature of
the data. The scheme is similar the one discussed in Fig. 8.2, where the clustering
agent is replaced by the feature reduction or the extraction agent.
180
8 Big Data Abstraction Through Multiagent Systems
Fig. 8.3 KDD framework for
data abstraction. Multiple
activities encompass each
box. The dotted line is further
expanded separately
Alternatively, feature selection and reduction can be achieved sequentially by
addition of another set of agents at a layer that is lower to clustering agents in the
figure.
8.6.3 Multiagent System for Heterogeneous Data Access
One major objective of Big Data is the ability to access and process multiple types
of data such as text messages, numerical, categorical, images, audio messages, etc.
and integrate them together for further use such as generating business intelligence
from them. It is an acknowledged fact that data access from different formats consumes significant amount of time for an experimental researcher. For an operational
system, it is always advantageous to place such a multiagent system in place. Given
that each of these heterogeneous datasets relates to the same theme, the participating agents need to interact with each other and share the information among
them. Figure 8.3 contains data analytics in conventional Knowledge Discovery from
Databases (KDD) framework. The figure contains three broad stages of KDD process. The first block contains substages of data access, data selection, generating target data for preprocessing, where preprocessing each data type includes cleansing.
The second block corresponds to the substages of data transformation that makes
data amenable for further processing. The third block corresponds to application of
machine learning, statistics, and data mining algorithms that generate the final data
abstraction.
Figure 8.4 contains a multiagent system for heterogeneous data processing. The
system is depicted in three layers. Layer 1 contains different data streams that are
processed by autonomous agents. Four types are shown to indicate the variety of
datasets. Many other data types such as semi-structured data such as xml-like standards are assumed to have been represented in this layer. In layer 2, the processing
methods depend on data type and inherent characteristics of the data. The methods are data-dependent. While processing the data, the agents cooperate with each
8.6 Proposed Multiagent Systems
181
Fig. 8.4 Multiagent system for data access and preprocessing. The objective is to provide framework where different streams of data are accessed and preprocessed by autonomous agents, which
also cooperate with fellow agents in generating integrated data. The data thus provided is further
processed to make it amenable for application of data mining algorithms
other. Although in the figure the horizontal arrows indicate exchange of information between the agents adjacent to each other, the exchange happens among all the
agents. They are depicted as shown for brevity. The preprocessed information is thus
aggregated by another agent and makes it amenable for further processing.
8.6.4 Multiagent System for Agile Processing
The proposed system for agile processing is part of Data Mining process of Fig. 8.3.
The system corresponds to the velocity part of Big Data. The processing in big data
can be real-time, near-real-time, or batch processing. We briefly discuss some of
the options for such processing. The need for agility is emphasized in view of large
volumes of data where conventional schemes may not provide the insights at such
speeds. The following are some such options.
• Pattern Clustering to reduce the dataset meaningfully through some validation
and operate only on such a reduced set to generate abstraction of entire data.
• Focus on important attributes by removing redundant features,
• Compress the data in some form and operate directly on such compressed
datasets.
• Improve the efficiency of algorithms through massive parallel processing and
Map-Reduce algorithms.
182
8 Big Data Abstraction Through Multiagent Systems
8.7 Summary
In the present chapter, we discuss the big data paradigm and its relationship with
data mining. We discussed the related terminology such as agents, multiagent systems, massive parallel databases, etc. We propose to solve big data problems using
multiagent systems. We provide few cases for multiagent systems. The systems are
indicative.
8.8 Bibliographic Notes
Big Data is emerging as a research and scientific topic in peer reviewed literature
in the recent years. Cohen et al. (2009) discusses new practices for Big Data analysis, called magnetic, agile, and deep (MAD) analysis. The authors contrast big
data scenario with Enterprise Data Warehouse and bring out many insights into new
practices for analytics for big data. Loukides (2011) discusses data science and related topics in the context of Big Data. Russom (2011) provides an overview of Big
Data Analytics based on industry practitioners’ survey and discusses current and
recommended best practices. Zikopoulos et al. (2011) provide a useful discussion
and insights into big data terminology. Halevi et al. (2012) provide an overview on
various aspects of big data and its trends. There are multiple commercial big data
systems such as Hadoop. Dean and Ghemawat (2004) provide the Map-Reduce algorithm. A insightful discussion on the Map-Reduce algorithm and PageRank can
be found in Rajaraman and Ullman (2012). The ageRank scheme was originally discussed by Brin and Page (1998) and Page et al. (1999). A insightful discussion on
PageRank computation can be found in Manning et al. (2008). Also the work makes
interesting comments on limitations on data mining. A discussion by Herodotou
et al. (2011) on proposal for an automatic tuning of Hadoop provides insights on
challenges in big data systems. A case for parallel database management systems
(DBMS) against Map-Reduce for large scale data analysis is discussed in the work
by Pavlo et al. (2009). A contrasting view on superiority of Map-Reduce to parallel
DBMS is provided by Abouzeid et al. (2009). Patil (2012) discusses data products
and data science aspects in his work. Weiss (2000) provides an extensive overview
of multiagent systems. The edited work contains theoretical or practical aspects of
the multiagent systems. Ferber (1999) provides an in-depth account of various characteristics of multiagent systems. Cao et al. (2007) discuss agent-mining integration
for financial services. Tozicka et al. (2007) suggest a framework for agent-based
machine learning and data mining. A proposal and implementation of a multiagent
system as a divide-and-conquer approach for large data clustering and feature selection are provided by Ravindra Babu et al. (2007). Ravindra Babu et al. (2010)
propose a large-data clustering scheme for data mining applications. Agogino and
Tumer (2006) and Tozicka et al. (2007) form examples of agents supporting data
mining. Gurruzzo and Rosaci (2008) and Wooldridge and Jennings (1994) form examples of data mining supporting agents. Fayyad et al. provide an early overview of
Data Mining.
References
183
References
A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, A. Rasin, HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads, in VLDB’09,
France (2009)
A. Agogino, K. Tumer, Efficient agent-based clustering ensembles, in AAMAS’06 (2006), pp.
1079–1086
S. Brin, L. Page, The anatomy of large-scale hyper-textual Web search engine. Comput. Netw.
ISDN Syst. 30, 107–117 (1998)
L. Cao, C. Zhang, F-Trade: an agent-mining symbiont for financial services, in AAMAS’07, Hawaii,
USA (2007)
J. Cohen, B. Dolan, M. Dunlap, MAD skills: new analysis practices for big data, in VLDB’09,
(2009), pp. 1481–1492
J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, in OSDI’04: 6th
Symposium on Operating Systems Design and Implementation (2004), pp. 137–149
U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, Advances in Knowledge Discovery
and Data Mining (AAAI Press/MIT Press, Menlo Park/Cambridge, 1996)
J. Ferber, Multi-agent Systems: An Introduction to Distributed Artificial Intelligence (AddisonWesley, Reading, 1999)
S. Gurruzzo, D. Rosaci, Agent clustering based on semantic negotiation. ACM Trans. Auton.
Adapt. Syst. 3(2), 7:1–7:40 (2008)
G. Halevi, Special Issue on Big Data. Research Trends, vol. 30 (Elsevier, Amsterdam, 2012)
H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F.B. Cetin, S. Babu, Starfish: a self-tuning
system for big data analytics, in 5th Biennial Conference on Innovative Data Systems Research
(CIDR’11) (USA, 2011), pp. 261–272
M. Loukides, What is data science, O’ Reillly Media, Inc., CA (2011). http://radar.oreilly.com/r2/
release-2-0-11.html/
C.D. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval (Cambridge University Press, Cambridge, 2008)
L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank citation ranking: bringing order to the
Web. Technical Report. Stanford InfoLab (1999)
J.J. Patil, Data Jujitsu: the art of turning data into product, in O’Reilly Media (2012)
A. Pavlo, E. Paulson, A. Rasin, D.J. Abadi, D.J. DeWitt, S. Madden, M. Stonebraker, A comparison
of approaches to large-scale data analysis, in SIGMOD’09 (2009)
A. Rajaraman, J.D. Ullman, Mining of Massive Datasets (Cambridge University Press, Cambridge,
2012)
T. Ravindra Babu, M. Narasimha Murty, S.V. Subrahmanya, Multiagent systems for large data
clustering, in Data Mining and Multi-agent Integration, ed. by L. Cao (Springer, Berlin, 2007),
pp. 219–238. Chapter 15
T. Ravindra Babu, M. Narasimha Murty, S.V. Subrahmanya, Multiagent based large data clustering
scheme for data mining applications, in Active Media Technology. ed. by A. An et al. LNCS,
vol. 6335 (Springer, Berlin, 2010), pp. 116–127
P. Russom, iBig data analytics. TDWI Best Practices Report, Fourth Quarter (2011)
J. Tozicka, M. Rovatsos, M. Pechoucek, A framework for agent-based distributed machine learning and data mining, in Autonomous Agents and Multi-agent Systems (ACM Press, New York,
2007). Article No. 96
G. Weiss (ed.), Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence (MIT
Press, Cambridge, 2000)
M. Wooldridge, N.R. Jennings, Towards a theory of cooperative problem solving, in Proc. of Workshop on Distributed Software Agents and Applications, Denmark (1994), pp. 40–53
P.C. Zikopoulos, C. Eaton, D. deRoos, T. Deutsch, G. Lapis, Understanding Big Data: Analytics
for Enterprise Class Hadoop and Streaming Data (McGraw Hill, Cambridge, 2011)
Appendix
Intrusion Detection Dataset—Binary
Representation
Network Intrusion Detection Data was used during KDD-Cup99 contest. Even
10 %-dataset can be considered large as it consists of 805049 patterns, each of which
is characterized by 38 features. We use this dataset in the present study, and hereafter
we refer to this dataset as a “full dataset” in the current chapter. In the current chapter, we apply the algorithms and methods developed so far on the said dataset and
demonstrate their efficient working. With this, we aim to drive home the generality
of the developed algorithms.
The appendix contains data description and preliminary analysis.
A.1 Data Description and Preliminary Analysis
Intrusion Detection dataset (10 % data) that was used during KDD-Cup99 contest
is considered for the study. The data relates to access of computer network by authorized and unauthorized users. The access by unauthorized users is termed as
intrusion. Different costs of misclassification are attached in assigning a pattern
belonging to a class to any other class. The challenge lies in detecting intrusion
belonging to different classes accurately minimizing the cost of misclassification.
Further, whereas the feature values in the data used in the earlier chapters contained
binary values, the current data set assumes floating point values.
The training data consists of 41 features. Three of the features are binary attributes, and the remaining are floating point numerical values. For effective use of
these attributes along with other numerical features, the attributes need to be assigned proper weights based on the domain knowledge. Arbitrary weightages could
adversely affect classification results. In view of this, only 38 features are considered
for the study. On further analysis, it is observed that values of two of the 38 features
in the considered 10 %-dataset are always zero, effectively suggesting exclusion
of these two features (features numbered 16 and 17, counting from feature 0). The
training data consists of 311,029 patterns, and the test data consists of 494,020 patterns. They are tabulated in Table A.1. A closer observation reveals that not all feaT. Ravindra Babu et al., Compression Schemes for Mining Large Datasets,
Advances in Computer Vision and Pattern Recognition,
DOI 10.1007/978-1-4471-5607-9, © Springer-Verlag London 2013
185
186
Intrusion Detection Dataset—Binary Representation
Table A.1 Attack types in training data
Description
No. of patterns
No. of attack types
No. of features
Training data
311,029
23
38
Test data
494,020
42
38
Table A.2 Attack types in training data
Class
No. of types
Attack types
normal
1
normal
dos
6
back, land, neptune, pod, smurf, teardrop
u2r
4
buffer–overflow, loadmodule, perl, rootkit
r2l
8
ftp-write, guess-password, imap, multihop,
phf, spy, warezclient, warezmaster
probe
4
ipsweep, nmap, portsweep, satan
Table A.3 Additional attack
types in test data
Additional attack types
snmpgetattack, processtable, mailbomb, snmpguess, named,
sendmail, named, sendmail, httptunnel, apache2, worm,
sqlattack, ps, saint, xterm, xlock, upstorm, mscan, xsnoop
Table A.4 Assignment
of unknown attack types
using domain knowledge
Class
Attack type
dos
processtable, mailbomb, apache2, upstorm
u2r
sqlattack, ps, xterm
r2l
snmpgetattack, snmpguess, named, sendmail,
httptunnel, worm, xlock, xsnoop
probe
saint, mscan
tures are frequent, which is also brought out in the preliminary analysis. We make
use of this fact during the experiments.
The training data consists of 23 attack types, which form 4-broad classes. The
list is provided in Table A.2. As noted earlier in Table A.1, test data contained more
classes than those in the training data, as provided in Table A.3. Since the classification of test data depends on learning from training data, the unknown attack types
(or classes) in the test data have to be assigned one of a priori known classes of training data. This is carried out in two ways, viz., (a) assigning unknown attack types
with one of the known types by Nearest-neighbor assignment within Test Data, or
(b) assigning with the help of domain knowledge. Independent exercises are carried
out to assign unknown classes by both the methods. The results obtained by both
these methods differ significantly. In view of this, assignments based on domain
A.1 Data Description and Preliminary Analysis
Table A.5 Class-wise
numbers of patterns
in training data of 494,020
patterns
Table A.6 Class-wise
distribution of test data based
on domain knowledge
Table A.7 Cost matrix
187
Class
Class-label
normal
0
97,277
u2r
1
52
dos
2
391,458
r2l
3
1126
probe
4
4107
Class
Class-label
normal
0
u2r
1
70
dos
2
229,853
r2l
3
16,347
probe
4
4166
Class type
normal
No. of patterns
No. of patterns
60,593
u2r
dos
r2l
probe
normal
0
2
2
2
1
u2r
3
0
2
2
2
dos
2
2
0
2
1
r2l
4
2
2
0
2
probe
1
2
2
2
0
knowledge are considered, and test data is formed accordingly. Table A.4 contains
assigned types based on domain knowledge. One important observation that can be
made from the mismatch between NN assignment and Table A.4 is that the class
boundaries overlap, which leads to difficulty in classification. Table A.5 contains
the class-wise distribution of training data. Table A.6 provides the class-wise distribution of test data based on domain knowledge assignment.
In classifying the data, each wrong pattern assignment is assigned a cost. The cost
matrix is provided in Table A.7. Observe from the table that the cost of assigning a
pattern to a wrong class is not uniform. For example, the cost of assigning a pattern
belonging to class “u2r” to “normal” is 3. Its cost is more than that of assigning a
pattern from “u2r” to “dos”, say.
Feature-wise statistics of training data are provided in Table A.8. The table contains a number of interesting statistics. They can be summarized below.
• Ranges of mean values (Column 2) of different features are different.
• Standard deviation (Column 3), which is a measure of dispersion, is different for
different feature values
• Minimum value of each feature is 0.0 (Column 4)
188
Intrusion Detection Dataset—Binary Representation
Table A.8 Feature-wise statistics
Feature Mean value
No.
(1)
(2)
SD
Min Max
(3)
(4)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
707.745756
988,217.066787
33,039.967815
0.006673
0.134805
0.005510
0.782102
0.015520
0.355342
1.798324
0.010551
0.007793
2.012716
0.096416
0.011020
0.036482
0.0
0.0
0.037211
213.147196
246.322585
0.380717
0.381016
0.231623
0.232147
0.388189
0.082205
0.142403
64.745286
106.040032
0.410779
0.109259
0.481308
0.042134
0.380593
0.380919
0.230589
0.230140
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
47.979302
3025.609608
868.529016
0.000045
0.006433
0.000014
0.034519
0.000152
0.148245
0.010212
0.000111
0.000036
0.011352
0.001083
0.000109
0.001008
0.0
0.0
0.001387
332.285690
292.906542
0.176687
0.176609
0.057433
0.057719
0.791547
0.020982
0.028998
232.470786
188.666186
0.753782
0.030906
0.601937
0.006684
0.176754
0.176443
0.058118
0.057412
(5)
58,329
693,375,616
5,155,468
1
3.0
3.0
30.0
5.0
1.0
884.0
1.0
2.0
993.0
28.0
2.0
8.0
0.0
0.0
1.0
511.0
511.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
255.0
255.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
Bits
(VQ)
(6)
Resoln
(494021)
(7)
Suprt
(8)
16
30
23
4
4
4
5
6
4
10
4
4
10
6
4
5
0
0
4
9
9
4
4
4
4
4
4
4
8
8
4
4
4
4
4
4
4
4
1.4e−5
6.0e−10
7.32e−8
0.06
0.06
0.06
0.03
0.08
0.06
9.8e−4
0.06
0.12
9.8e−4
3.2e−2
0.12
0.04
0
0
0.06
2.0e−3
2.0e−3
0.06
0.06
0.06
0.06
0.06
0.06
0.06
3.9e−3
3.9e−3
0.06
0.06
0.06
0.06
0.06
0.06
0.06
0.06
12,350
378,679
85,762
22
1238
3192
63
63
73,236
2224
55
12
585
265
51
454
0
0
685
494,019
494,019
89,234
88,335
29,073
29,701
490,394
112,000
34,644
494,019
494,019
482,553
146,990
351,162
52,133
94,211
93,076
35,229
341,260
A.2 Bibliographic Notes
Table A.9 Accuracy
of winner and runner-up
of KDD-Cup99
189
Class
Winner
Runner-up
normal
99.5
99.4
dos
97.1
97.5
r2l
8.4
7.3
u2r
13.2
11.8
probe
83.3
84.5
Cost
0.2331
0.2356
• Maximum values of different features are different (Column 5)
• Feature-wise support is different for different features (Column 8). The support
is defined here as the number of times a feature assumed a nonzero value in the
training data.
• If the real values are to be mapped to integers, the numbers of bits required along
with corresponding resolution are provided in Columns 6 and 7.
The observations made are used later in the current chapter through various
sections.
Further, dissimilarity measure plays an important role. The range of values for
any feature within a class or across the classes is large. Also the values assumed by
different features within a pattern are also largely variant. This scenario suggests use
of the Euclidean and Mahalanobis distance measures. We applied both the measures
while carrying out exercises on samples drawn from the original dataset. Based on
the study on the random samples, the Euclidean distance measure provided a better
classification accuracy. We made use of the Euclidean measure subsequently.
We classified test patterns with complete dataset. With full data, NNC provided
a classification accuracy of 92.11 %. The corresponding cost of classification cost is
0.254086. This result is useful in comparing possible improvements with proposed
algorithms in the book.
Results reported during KDD-Cup99 are provided in Table A.9.
A.2 Bibliographic Notes
KDD-Cup data (1999) contains the 10 % and full datasets provided during KDDCup challenge in 1999.
References
KDD-Cup99 Data, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html (1999)
Glossary
b Number of binary features per block, 69
c Sample size, 96
d Number of features, 48, 69
H Final hypothesis in AdaBoost, 140
k Number of clusters, 11, 96
n Number of patterns, 11, 48, 69
Pc Probability of cross-over, 59
Pi Probability of initialization, 59
Pm Probability of mutation, 59
q Number of blocks per pattern, 69
R ∗ (ω) True risk, 61
Remp (ω) Empirical risk, 61
r Length of subsequence, 69
ΩL Lower approximation of class, Ω, 86
ΩU Upper approximation of class, Ω, 86
v Value of block, 69
X Set of patterns, 2
ε Minimum support, 69
εj Error after each iteration in AdaBoost, 140
η Dissimilarity threshold for identifying nearest neighbor to a subsequence, 69
T. Ravindra Babu et al., Compression Schemes for Mining Large Datasets,
Advances in Computer Vision and Pattern Recognition,
DOI 10.1007/978-1-4471-5607-9, © Springer-Verlag London 2013
191
192
Glossary
ψ Minimum frequency for pruning a subsequence, 69
ζ Distance threshold for computing leaders in leader clustering algorithm, 100,
140
hi ith hypothesis at iteration j in AdaBoost, 140
X Set of compressed patterns, 2
Index
A
AdaBoost, 125, 126, 128–131
Agent, 177
mining interaction, 177
Agent-mining, 173
Anomaly detection, 59
Antecedent, 22
Appendix, 185
Apriori algorithm, 23
Association rule, 25
mining, 2, 4, 12, 22, 39, 68, 123
Apriori algorithm, 4
Average probability of error, 17
Axis-parallel split, 131
B
Bayes classifier, 17
Bayes rule, 17
Big data, 8, 173
analytics, 174
Big data analytics, 173
Bijective function, 53
Binary classifier, 20, 125
Binary pattern, 5
Binary string, 155
BIRCH, 28, 123
Block
length, 69
value of, 69
Boosting, 128
Breadth first search (BFS), 152
Business intelligence, 174
C
Candidate itemset, 24
CART, 131
Central tendency, 161
Centroid, 13, 96, 101
CLARA, 96, 123
CLARANS, 123
Class label, 17
Classification
AdaBoost, see AdaBoost
binary, 126, 138, 140
binary classifier, 125
decision tree, see Decision tree
definition, 1
divide and conquer, 35
incremental, 34
intermediate abstraction, 37
kNNC, see k-nearest neighbor classifier
multicategory, 126
NNC, see Nearest neighbor classifier
one vs all, 144
one vs one, 126, 144
one vs rest, 126
rough set, 86
SVM, see machine at Support vector
Classification accuracy, 6, 98, 110, 122, 129,
156
Classification algorithm, 1
Cluster feature tree, 28
Cluster representative, 1, 11, 13, 96, 100, 108,
112, 136
Clustering, 4, 95, 96
algorithms, 13
CLARA, see CLARA
CLARANS, see CLARANS
CNN, see Condensed nearest neighbor
(CNN)
definition, 1
hierarchical, 13
incremental, 28
intermediate representation, 33
T. Ravindra Babu et al., Compression Schemes for Mining Large Datasets,
Advances in Computer Vision and Pattern Recognition,
DOI 10.1007/978-1-4471-5607-9, © Springer-Verlag London 2013
193
194
Clustering (cont.)
k-Means, 101
leader, 28
PAM, see PAM
partitional, 13
Clustering algorithm, 1, 97
Clustering feature, 28
Clustering method, 126
CNN, 126
Compact data representation, 48
Compressed data
distance computation, 51
representation, 49
Compressed data representation, 49, 50
Compressed pattern, 86
Compressed training data, 91
Compressing data, 2
Compression
Huffman coding, 68, 74, 76, 77
lossless, 2, 76
run length, 49, 170
lossy, 2, 77
run length, 76
Computation time, 47, 121
Computational requirement, 1
Condensed nearest neighbor (CNN), 7, 97, 136
Confidence, 25
Confusion matrix, 134
Consequent, 22
Constraint, 127
Convex quadratic programming, 127
Cost matrix, 120
Criterion function, 15
Crossover, 155
Curse of dimensionality, 96
D
Data abstraction, 13, 48
Data analysis, 73, 125
Data compaction, 96
Data compression, 2
Data matrix, 6, 35
Data mining, 2, 173
association rules, see mining at Association
rule
feature selection, 7, see also, selection at
Feature, 59, 98–100
prototype selection, 7, 28, 43, 59, 95, 97,
100, 129, 140, 147, 179
prototypeselection, 99
Data mining algorithms, 11
Data science, 174
Data squashing, 96
Index
Data structure, 28
Dataset
hand written digits, 55, 75, 98, 110, 137,
138
intrusion detection, 110, 116, 122, 185
UCI-ML, 101, 123, 137, 144
Decision boundary, 21
Decision function, 128
Decision making, 1, 11
Decision rule, 128
Decision tree, 11, 130, 151, 152
Decision tree classifier, 8
Dendrogram, 14
Dictionary, 2
Dimensionality reduction, 3, 7, 96, 98, 147,
153, 154, 160, 171
Discriminative classifier, 17, 20
Discriminative model, 17
Dispersion, 161
Dissimilarity computation, 86
Distance threshold, 116, 139
Distinct subsequences, 104, 110
Divide and conquer, 5, 27, 31, 173, 175
Document classification, 8
Domain, 53
Domain knowledge, 125, 126, 131
Dot product, 4
E
Edit distance, 47
Efficient algorithm, 130
Efficient hierarchical algorithm, 4
Efficient mining algorithms, 27
Embedded scheme, 151
Empirical risk, 61
Encoding mechanism, 155
Ensemble, 128
Error-rate, 17
Euclidean distance, 11, 14, 54, 98
Exhaustive enumeration, 4
Expected test error, 126
F
Face detection, 177
Farthest neighbor, 19
Feature
extraction, 148, 152
principal component analysis, 152, 153
random projection, 153
selection, 148, 149
genetic algorithm, 154
ranking, 149
ranking features, 150
Index
Feature (cont.)
sequential backward floating selection,
150
sequential backward selection, 149
sequential forward floating selection,
150
sequential forward selection, 149
stochastic search, 152
wrapper methods, 151
Feature extraction
definition, 1
Feature selection, 95
definition, 1
Filter methods, 148
Fisher’s score, 7, 150
Fitness function, 155
Frequent features, 8, 99
feature selection, 160
Frequent item, 7, 23, 69, 83, 95, 100, 107
support, 99
Frequent item set, 95
Frequent items, 113
Frequent-pattern tree (FP-tree), 33, 48
Function, 53
G
Generalization error, 96
Generation gap, 158
Generative model, 17
Genetic algorithm, 97, 123, 152, 154, 171
crossover, 156
probability, 166
mutation, 156
probability, 166
selection, 156
simple (SGA), 155, 157
steady state (SSGA), 158
Genetic algorithms (GAs), 6, 57
Genetic operators, 155
Global optimum, 7, 155
Growth function, 61
H
Hadoop, 175
Hamming distance, 54, 98, 138
Handwritten digit data, 6
Hard partition, 4
Heterogeneous data, 174
Hierarchical clustering algorithm
definition, 4
High-dimensional, 4, 67, 96
High-dimensional dataset, 1
High-dimensional space, 19, 22
195
Hybrid algorithms, 48
Hyperplane, 21, 127
I
Improvement in generalization, 68
Incremental mining, 5, 27
Infrequent item, 2
Initial centroids, 16
Inter-cluster distance, 13
Intermediate abstraction, 27
Intermediate representation, 5
Intra-cluster distance, 13
K
k-means algorithm, 15
K-means clustering, 3
k-nearest neighbor classifier, 19, 54, 55,
105–107, 122, 129, 141, 144, 166,
169
K-nearest neighbor classifier (KNNC), 4, 78,
87
k-partition, 14
Kernel function, 128
KNNC, 83
Knowledge structure, 2
L
Lq norm, 54
Labelled training dataset, 1
Lagrange multiplier, 127
Lagrangian, 20
Large dataset, 12, 97
Large-scale dataset, 5
Leader, 13, 96, 100
clustering
algorithm, 100
Leader clustering, 140
Leader clustering algorithm, 7
Learn a classifier, 1
Learn a model, 1
Learning algorithm, 125, 128
Learning machine, 126
Linear discriminant function, 20
Linear SVM, 21
Linearly separable, 21
Local minimum, 16
Longest common subsequence, 47
Lossy compression, 11, 13, 96
Lower approximation, 6, 86
M
Machine learning, 7, 8, 22
Machine learning algorithm, 11
Manhattan distance, 51–53, 160
Map-reduce, 5, 173
196
MapReduce, 174
Massive data, 173
Maximizing margin, 127
Maximum margin, 20
Minimum description length, 63
Minimum frequency, 69
Minimum support, 71
Mining compressed data, 9
Minsup, 23
Multi-class classification, 129
Multiagent system, 173, 177, 179
agile processing, 181
attribute reduction, 179
data reduction, 178
heterogeneous data access, 180
Multiagent systems, 8
Multiclass classification, 125
Multiple data scans, 48
Mutation, 155
Mutual information (MI), 7, 151
N
Nearest neighbor, 4
Nearest neighbor classifier (NNC), 4, 17
feature selection by wrapper methods, 151
Negative class, 17
NNC, 8
No free lunch theorem, 125
Noise, 19
Non-linear decision boundary, 128
Non-negative matrix factorization, 8
Nonlossy compression, 5
Number of database scans, 11
Number of dataset scans, 27
Number of representatives, 116
Numerical taxonomy, 4
O
Objective function, 155
Oblique split, 131
One-to-one function, 53
Onto function, 53
Optimal decision tree, 131
Optimization problem, 156
Order dependence, 101
Outlier, 13
P
PageRank, 176
dead ends, 176
link spam, 176
MapReduce, 176
spam mass, 176
spider traps, 176
Index
teleport, 176
topic-sensitive, 176
TrustRank, 176
PAM, 96
Parallel hyperplanes, 20
Partitional algorithms, 4
Pattern classification, 100
Pattern clustering, 95
Pattern matrix, 26
Pattern recognition, 2, 22, 130
Pattern synthesis, 36
Patterns
representative, 136
Population of chromosomes, 155
Positive class, 17
Posterior probability, 17, 126
Prediction accuracy, 67
Principal component analysis, 8
Prior probabilities, 17
Probability distribution, 11
Probability of crossover, 156
Probability of mutation, 156
Prototype selection, 96, 99
Prototypes
CNN, 136
leader, 129
Proximity between a pair of patterns, 11
Proximity matrix, 14
Pruning of subsequences, 83
R
Random number, 156
Random projections, 8
Random surfer, 176
Range, 53
Regression, 2
logistic, 44
Regression trees, 131
Representative pattern, 1, 7, 97, 138
Risk, 126
Robust to noise, 4
Rough set, 6, 86
Rough set based scheme, 68
Roulette wheel selection, 156
Run
dimension, 49
length
encoded compression, 49
encoding, 47
string, 49
length, 49
Run length, 49
Run-length coding, 5
Index
S
Sampling, 96
Scalability, 67
Scalability of algorithms, 96
Scalable mining algorithms, 12
Scan the dataset, 4
Secondary storage, 11
Selection, 155
feature, 96
prototype, 96
Selection mechanism, 156
Selection of prototypes and features, 95
Semi-structured data, 174, 178
Sequence, 68
Sequence mining, 26
Set of prototypes, 95
Single database scan, 97, 100
Single-link algorithm, 4, 14
Singleton cluster, 13
Soft partition, 4
SONAR, 141
Space organization, 59
Spacecraft health data, 55
Squared Euclidean distance, 37
Squared-error criterion, 15
Squashing, 48
State-of-the-art classifier, 4
Storage space, 47, 121
Subsequence, 6, 69, 83
distinct, 70
length of, 69
Subset of items, 2
Sufficient statistics, 47
Support, 68
minimum, 68, 83, 85, 87, 88, 100, 104,
163, 168
Support vector, 3, 20, 127, 136
machine, 20, 97, 131
Support vector machine (SVM), 4, 17, 125,
126
197
feature selection by wrapper methods, 151
Survival of the fittest, 155
SVM, 8
T
Termination condition, 155
Test pattern, 1, 12, 17, 126
Text mining, 27
Threshold, 17
Threshold value, 28
THYROID, 141
Training dataset, 136
Training phase, 18
Training samples, 126
Training set, 17
Tree
CF, 28
decision, see Decision tree
knowledge based (KBTree), 136
Tree classifier, 125
U
UCI Repository, 141
Uncompressing the data, 2
Unstructured data, 174, 178
Upper approximation, 6, 86
V
Variety, 174
VC dimension, 6, 60
VC entropy, 61
Velocity, 174
Volume, 174
W
Weak learner, 128
Weight vector, 17
WINE, 141
Wrapper methods, 148