Download 1 Aggregating and visualizing a single feature: 1D analysis

Document related concepts

Mixture model wikipedia , lookup

Cluster analysis wikipedia , lookup

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Boris Mirkin
School of Computer Science and Information Systems, Birkbeck University of London UK
Computational Intelligence: Correlation, Summarization and
Visualization
Contents:
0. Introduction:
.
.
.
.
.
.
.
.
3
01. Computational intelligence.
02. Data visualization.
03. Case study problems.
1. Summarizing and visualizing a single feature: 1D analysis
18
1.1. Distribution
1.2. Centre and spread: definitions, properties, integral perspectives
Project 1.1 Analysis of a multimodal distribution
1.3. Confidence and computational experiment
Project 1.2 Data mining with a confidence interval: Bootstrap
Project 1.3 K-fold cross-validation
1.4. Modeling uncertainty: Interval and fuzzy values.
2 Correlating and visualizing two features: 2D analysis
35
2.1. Both quantitative:
Linear regression and residual variance
Correlation coefficient
Project 2.1. 2D analysis, linear regression and bootstrapping
Estimating non-linear regression
Project 2.2 Non-linear regression versus its linearized version: evolutionary
algorithm for estimation
2.2. Nominal and quantitative: table regression and correlation ratio
2.3. Both categorical:
Contingency table
Quetélet indices
Chi-squared correlation index and its visualization
3. Correlation with decision rules
53
3.1. Linear regression
3.2 . Discriminant function and SVM
3.3. Decision trees
4. Learning neural networks for prediction
4.1. Artificial neuron and perceptron
4.2. Multilayer network
4.3. Back-propagation algorithm
5. Summarization with Principal Component Analysis
5.1. One component
5.2. Principal Components and SVD
5.3. Popular applications
5.4. Correspondence analysis
5.5. Self-associative neural nets.
6. K-Means clustering
6.1. Batch and incremental K-Means
6.2. Anomalous pattern and iK-Means
6.3. Experimentally determining K
6.4. Evolutionary approaches to K-Means
Genetic algorithms
Evolutionary algorithms
Particle swarm optimization
6.5. Extensions of K-Means
Fuzzy
Kohonen’s Self-Organizing Map (SOM)
7. Structuring and visualizing similarity data
7.1. Hierarchical clustering
7.2. MST and Single-linkage clustering
7.3. Additive clusters
Appendix:
A.1 Basics of multivariate entities
A.2 Basic optimization
A.3 Basic MatLab
2
Introduction
0.1. Computational Intelligence from Data
This is a unique course in that it looks at data from the inside rather than the outside which is the
conventional perspective in Computer Science.
The term Computational Intelligence has emerged recently and means different things for different
people. Some think that the term should cover only those just emerging techniques that relate to neural
networks, evolutionary computation and fuzzy systems. They believe that the Computational
Intelligence has nothing to do with Machine Learning and Statistics because of the difference in
methods. Some add that Computational Intelligence should involve a biological underpinning. Still
others, including the author, think that a scientific discipline should be defined by a set of problems
rather than by a set of techniques; the problems can be addressed by various techniques, with no
predefined constraints on them. In this respect, the definition following that by Engelbrecht (2002), p.
4, deserves attention:
“Computational Intelligence – the study of adaptive mechanisms to enable or facilitate intelligent
behaviour in complex and changing environments and, specifically, to learn or adapt to new situations,
to generalize, abstract, discover and associate.”
This seems an adequate definition. However, currently this is by far too wide because the study of
adaptive mechanisms or actions is in a rather embryonic state at this moment. Yet the aspects related to
learning, relating and discovering patterns in data have been developed by now into a discernible set of
approaches, supported sometimes with sound theoretical grounding, and we thus will confine ourselves
to studying data-driven computational intelligence models and methods towards enhancing knowledge
of the domain of interest. The texts by D. Poole, A. Macworth and R. Goebel (2001), A. Engelbrecht
(2002), and Avi Kumar (2005) have rather different agendas, which explains the current effort.
Key concepts in this course are those related to computational and structural aspects in the input data,
output knowledge and methods and structures for relating them:
- Data:
Typically, in sciences and in statistics, a problem comes first, and then the investigator turns to
data that might be useful in the problem. In computational intelligence, it is also the case, but a
problem can be very broad: look at this data set - what sense can we make of it? This is more
reminiscent to a traveller’s view of the world rather than that of a scientist. The traveller deals
with what occurs on their way. Helping the traveller in making sense of data is the task of
computational intelligence. Rather than attending to individual problems, computational
intelligence focuses on learning patterns. It should be pointed out that this view much differs of
that accepted in the sciences including classical statistics in which the main goal is to identify a
pre-specified model of the world and data is but a vehicle in achieving this goal.
Any data set comprises two parts, metadata and data entries. Metadata may involve
names for the entities and their features. Depending on the origin, entities may be alternatively
but synonymously referred to as individuals, objects, cases, instances, or observations. Entity
features may be synonymously referred to as variables, attributes, states, or characters.
Depending on the way they are assigned to entities, the features can be of elementary structure
[e.g., age, sex, or income of individuals] or complex structure [e.g., a picture, or statement, or a
3
cardiogram]. Metadata nay involve relations between entities or other relevant information,
which we are not going to deal with further on.
- Knowledge:
Knowledge is a complex concept, not quite well understood yet, related to understanding
things. Structurally, knowledge can be thought of as a set of categories and statements of
relation between them. Categories are aggregations of similar entities such as apples or plums
or more general categories such as fruit comprising apples, plums, etc. When created over data
objects or features these are referred to as clusters or factors, respectively. Statements of
relation between categories express regularities relating different categories. These can be of
casual or correlation character. We say that two features correlate when the co-occurrence of
specific patterns in their values is observed as, for instance, when a feature’s value tends to be
the square of the other feature. The observance of a correlation pattern is thought to be a prerequisite to further inventing a theoretical framework from which the correlation follows. It is
useful to distinguish between quantitative correlations such as functional dependencies between
features and categorical ones expressed conceptually, for example, as logical production rules
or more complex structures such as decision trees. These may be used for both understanding
and prediction. In industrial applications, which are the driving force for the Computational
Intelligence so far, the latter is by far more important. Moreover, the prediction problem is
much easier to make sense of operationally so that the sciences so far paid much attention to
this. The notion of understanding, meanwhile, remains very vague.
We are going to study methods for enhancing knowledge by producing rules for finding either
(a) Correlation of features (As) or
(b) Summarization of entities or features (Ag),
each in either of two ways, quantitative (Q) and categorical (C).
A rule involves a postulated mathematical structure whose parameters are to be learnt from the data.
We will be dealing most with the following mathematical structures in the rules:
- linear combination of features;
- neural network mapping a set of input features into a set of target features;
- decision tree built over a set of features;
- partition of the entity set into a number of non-overlapping clusters.
A fitting method relies on a computational model involving a function scoring the adequacy of the
mathematical structure underlying the rule – a criterion, and, typically, visualization aids.
The criterion measures either the deviation from the target (to be minimised) or fitness to the target (to
be maximised). Currently available computational approaches to optimise the criterion can be
partitioned in three major groups:
- global optimisation, computationally feasible sometimes for linear quantitative and simple
discrete structures;
- local improvement using such general approaches as:
o gradient descent
o alternating optimization
o greedy neighbourhood search
- evolution of population, an approach involving relatively recent advancements in
computing capabilities, of which the following will used in some problems:
o genetic algorithms
o evolutionary algorithms
o particle swarm optimization
It should be pointed out that currently there is no systematic description of all possible combinations of
problems, data types, mathematical structures, criteria, and fitting methods available. Here we rather
4
focus on the generic and better explored problems in each of the four groups that can be safely claimed
as being prototypical within the groups:
Ag
Co
Quant
Principal component analysis
Categ
Cluster analysis
Quant
Regression analysis
Categ
Pattern recognition
Supervised classification
These methods have emerged in different frameworks and usually are considered as unrelated.
However, they are related in the context of computational intelligence. Moreover, they can be unified
by the so-called least-squares criterion that will be accepted for all main methods described in this text.
In fact, the criterion will be part of a unifying, data-recovery, perspective. The data recovery approach
involves two stages: (1) fitting a model to the data (sometimes referred to as “coding”), (2) deriving
data from the model in the format of the data used to build the model (sometimes referred to as
“decoding”), and (3) looking at the discrepancies between the observed data and those recovered from
the model. The smaller are the discrepancies, the better the fit, which gives a natural model fitting
criterion.
There can be distinguished at least three different levels of studying a computational intelligence
approach. One can be interested only in learning of the approach on the level of concepts only – what
is it for, why it should be applied at all, etc. A somewhat more practically oriented tackle would be of
an information system/tool that can be utilised without any knowledge beyond the structure of input
and output. A more technically oriented way would be studying the method involved and its
properties. Comparable advantages and disadvantages of these three levels are as follows.
Pro
Con
Concepts
Awareness
Superficial
Systems
Usable now
Simple
Short-term
Stupid
Techniques
Workable
Extendable
Technical
Boring
Many in Computer Sciences rely on Systems assuming that good methods have been put in there
already. Indeed, with the new data streams from new hardware devices being developed time and
again, such issues as data capture, security, maintenance, distribution, that are way beyond intelligent
data analysis techniques, can be much urgent indeed. Unfortunately, in many aspects, intelligence of
currently available “intelligent methods” is rather superficial and may lead to wrong results and
decisions.
Consider, for instance, a very popular concept, the power law – many say that in unconstrained social
processes such as those on the Web networks this law, expressed as y=ax-b where x and y are related
features and a, b>0 constant, dominates: the number of people who read news stories on the web
decays with time in a power law, the distribution of page requests on a web-site according to their
popularity, distribution of website interconnections, etc. According to a very popular recipe, to fit a
power law (that is, to estimate a and b from the data), one needs to fit the logarithm of the power-law
equation, that is, log(y)=c-b*log(x) where c=log(a), which is much easier to fit because it is linear.
Therefore, this recipe advises: take logarithms of the x and y first and then use any popular linear
5
regression program to find the constants. This recipe does work well when the regularity is observed
with no noise, which is impossible in social processes, because of too many factors affecting them. If
the data is not that exact, the recipe may lead to big errors. For example, I generated x (between 0 and
10) and y as related by the power law y=2*x1.07 , which can be interpreted as the growth with the rate
of approximately 7% per time moment, with the added Gaussian noise whose standard deviation is 2.
The recipe above led to estimates of a=3.08 and b=0.8 to suggest that the process does not grow with x
but rather decays. In contrast, when I applied an evolutionary optimization method, which will be
introduced later, I obtained realistic estimates of a=2.03 and b=1.076.
This is a relatively simple example, at which a correct procedure can be used. However, in more
complex situations of clustering or categorization, the very idea of a correct method seems rather
debatable; at least, methods in the existing systems can be and frequently are of a rather poor quality.
One may compare the situation here with that of getting services of an untrained medical doctor or car
driver; the results could be as devastating. This is why it is important to study not only How’s but
What’s and Why’s of the Computational Intelligence, which are addressed in this course by focusing
on Concepts and Techniques rather than Systems. In a typical case, the exposition goes along with the
structure of a data analysis application and comprises the following seven steps:
(i) formulating a specific data-related problem, then
(ii) developing a model and
(iii) method that are going to be used for advancing into the problem, then
(iv) application of the method to the data, sometimes preceded with
(v) the data standardization sometimes followed with
(vi) adjustment of the solution to the nature of the substantive problem, and – last not least –
(vii) interpretation and conclusion.
0.2. Visualization
0.2.1. General.
Visualization can be a by-product of the model and/or method, or it can be utilized by itself. The
concept of visualization usually relates to the human cognitive abilities, which are not well understood.
At this moment, we are not able to discuss the structures of visual image streams such as in a movie or
video. Nor can one reflect, in a computationally meaningful way, on art of painting or photography
whose goals relate to deep down impressions and emotions.
We are going to be concerned with presenting data as maps or diagrams or objects on a digital screen
in such a way that relations between data entities or features are reflected in distances or connections,
or other visual relations, between their images. Among more or less distinct visualization goals,
beyond sheer presentation that appeals to the cognitive domination of visual over other senses, we can
distinguish between:
A. Highlighting
B. Integrating different aspects
C. Narrating
D. Manipulating
Of these, manipulating visual images of entities, such as in computer games, seems an interesting area
yet to be developed in the framework of Computational Intelligence. The other three will be briefly
discussed and illustrated in the remainder of this section.
0.2.2. Highlighting
6
To visually highlight this or that feature of an image one should somehow distort the original
dimensions. A good example is the London tube scheme by H. Beck (1906) on which he greatly
enlarged the proportions of the Centre of London part to make them better seen. Such a gross
distortion, for a long while totally rejected by the authorities, is now a standard for metro maps
worldwide (see Figure 0.2.1)
Figure 0.2.1. A fragment of London Tube map made after H. Beck; the central part is highlighted by
disproportionate scaling.
This line of thinking has been worked on in geography for centuries, since the mapping of the Earth
global surface to a flat sheet is impossible to do exactly. Various proxy criteria have been proposed
Figure 0.2.2. The Fuller Projection, or Dymaxion Map, solves the problem of displaying spherical
data on a flat surface of a polyhedron using a low-distortion transformation. Landmasses are presented
without interruption -- the map's sinuses do not cut into the land area at any point.
leading to interesting highlights such as presented on Figure 0.2.2 (Fullers’ projection) and Figure
0.2.3 (August’s projection); see website http://en.wikipedia.org/wiki/ for more.
7
Figure 0.2.3. A conformal map: the angle between any two lines on the sphere is the same between their
projected counterparts on the map; in particular, each parallel crosses meridians at right angles; and also, scale
at any point is the same in all directions.
More recently this idea was applied by Rao and Card (1994) to table data (see Figure 0.2.4); more on
this can be found in the volume by Card, Mackinlay and Shneiderman (1999).
Figure 0.2.4. The Table Lens machine: highlighting a few rows and columns by enlarging them.
It should be noted that the disproportionate highlighting may lead to effects bordering with visually
cheating. This is especially apparent when relative proportions are visualized through proportions
between areas, as in Figure 0.2.5. An unintended effect of the picture is that the decline by half is
presented visually by the area of the doctor’s body, which is just one fourth of the initial size. This
grossly biases the message.
Figure 0.2.5. A decline in relative numbers of general practitioner doctors in California in 70-es is
conveniently visualized using 1D size-, not 2D area-related, scaling of a picture of doctor.
8
Figure 0.2.6. Another unintended distortion: a newspaper’s report (July 2005) is visualized with bars
that grow from mark 500,000 rather than 0.
Another typical case of unintentional cheating is when the relative proportions are visualized using
bars that start not at 0 but an arbitrary mark, as is the case of Figure 0.2.6, on which a newspaper’s
legitimate satisfaction with its success is visualized using bars that begin at 500,000 mark rather than
0. Another mistake is that the difference between the bars’ heights on the picture is much greater than
the reported 220,000. Altogether, the rival’s circulation bar is more than twice shorter while the real
circulation is less by just 25%.
0.2.3. Integrating different aspects
Figure 0.2.7. An image of Con Edison company’s power grid on a PC screen according to website
http://www.avs.com/software/soft_b/openviz/conedison.html.
9
Bringing different features of a phenomenon to a visual presentation can make life easier indeed.
Figure 0.2.7 represents an image that an energy company utilizes for real time managing, control and
repair of its energy network stretching over the island of Manhattan (New York, USA). Operators can
view the application on their desktop PCs and monitor the grid and repair problems when they arise by
rerouting power or sending a crew out to repair a device on site. This makes “manipulation and
utilisation of data in ways that were previously not possible,” according to the company’s website.
Bringing features together can be useful for less immediate insights too. A popular story of Dr. John
Snow’s fight against an outbreak of cholera in Soho, London, 1857, is based on the real fact that
indeed, two weeks into the outbreak, Dr. Snow went over all houses in the vicinity and made as many
tics at each of them on his map as many deaths of cholera have occurred there (a scheme of a fragment
of Dr. Snow’s map is on Figure 0.2.8). The ticks were densest around a water pump, which made Dr.
Figure 0.2.8. A scheme of a fragment of Dr. Snow’s map demonstrating that indeed most deaths
(labelled by circles) have occurred near the water pump he was dealing with.
Snow convinced that the pump was the cholera source. (In fact, he had served in India to become
disposed to the idea of the role of water flows in the transmission of the disease.) He discussed
his findings with the priest of local parish, who removed then the handle of the pump, after which
deaths stopped. This all is true. But there is more to this story. The death did stop - but because too few
remained in the district, not because of the removal: the handle was ordered to be back the very next
day after it was removed. Moreover, the borough council refused to accept Dr. Snow’s water pump
theory because of its inconsistency with the theory of the time that cholera progressed through stench
in the air rather than water. More people died in Soho of the next cholera outbreak in a decade. The
water pump theory was not accepted until much later, when the microbe theory became developed.
The story is instructive in that a data based conclusion needs a plausible explanation to be accepted.
Secto
Not Retail (Ind./Util.)r
Retail
Product
C
ECom
Product No Yes
Product
Figure 0.2.9. Product decisionAtree for the Company
B data in Table 0.1.
10
The diagram on Figure 0.2.9 visualizes relations between features in Company data (Table 0.1.) as a
decision tree to characterize their products. For example, the left hand branch distinctly describes
Product A by combining “Not retail” and “No e-commerce” edges. One more visual image depicts
relations between confusion patterns of decimal numerals drawn over rectangle’s edges and their
Patterns
Profiles
Descriptions
AbAbsence
Presence
sen
t
Figure 0.2.10. Confusion patterns for numerals, drawn using rectangle edges, their descriptions in
terms of edges present/absent, and profiles showing maximal common edges.
descriptions in terms of combinations of edges of the rectangle with which they are drawn. A
description may combine both edge presence and absence to distinctively characterise its pattern,
whereas a profile comprises edges that are present in all elements of its pattern. The confusion patterns
are derived from psychological data (see Mirkin 2005).
0.2.3. Narrating a story
In a situation in which features involved have a temporal or spatial aspect, integrating them in one
visual image may lead to a narrative of a story, with its starting and ending dates. Such a story is told
of a military company (Napoleon invading Russia 1812) as presented in Figure 0.2.11. It shows a map
Figure 0.2.11. The white band represents the trajectory of Napoleon’s army moving to the East and
the black band shows it moving to the West, the line width being proportional to the army’s strength.
11
of Russia, with Napoleon’s army trajectory drawn forth, in white, and back, in black, so that the time is
enveloping in this static image. The trajectory’s width shows the army’s strength in time steadily
declining on a dramatic scale.
All the images presented can be considered illustrations of a principle accepted further on. According
to this principle, to visualize data, one needs to specify first a “ground” image, such as a map or grid or
coordinate plane, which is supposed to be well known to the user. Visualization, as a computational
device, can be defined as mapping data to the ground image in such a way that the analysed properties
of the data are reflected in properties of the image. Of the goals considered, integration of data will be
of a priority since no temporal aspect is considered here.
0.3. Case study problems
Case 0.3.1: Companies
Table 0.1. Companies characterized by mixed scale features; the first three companies making product
A, the next three making product B, and the last two product C.
Company name
Income, $mln
SharP $
NSup
EC
Sector
Aversiona
19.0
43.7
2
No
Utility
Antyops
29.4
36.0
3
No
Utility
Astonite
23.9
38.0
3
No
Industrial
Bayermart
18.4
27.9
2
Yes Utility
Breaktops
25.7
22.3
3
Yes Industrial
Bumchista
12.1
16.9
2
Yes Industrial
Civiok
23.9
30.2
4
Yes Retail
Cyberdam
27.2
58.0
5
Yes Retail
There are five features in Table 0.1.:
1) Income, $ Mln;
2) SharP - share price, $;
3) NSup - Number of principal suppliers;
4) ECommerce - Yes or No depending on the usage of e-commerce in the firm;
5) Sector - The sector of the economy: (a) Retail, (b) Utility, and (c) Industrial.
Examples of computational intelligence problems related to this data set:
- How to map companies to the screen with their similarity reflected in distances on the plane?
(Summarization)
[Q: Do you think that the following statement is true? “There is no information on the company
products within the table”. A. You should, since no feature “Product” is present in the table, and the
separating lines are not part of the data.]
- Would clustering of companies reflect the product? What features would be involved then?
(Summarization)
- Can rules be derived to make an attribution of the product for another company, coming outside of
the table? (Correlation)
- Is there any relation between the structural features and market related features? (Correlation.)
12
An issue related to Table 0.1 is that not all of its entries are quantitative. Specifically, there are three
conventional types of features in it:
- Quantitative, that is, such that the averaging of its values is meaningful. In the Table 0.1, these
are: Income, SharePrice and NSup;
- Binary, that is, admitting one of two answers, Yes or No: this is EC;
- Nominal, that is, with a few disjoint not ordered categories, such as Sector in Table 0.1.
Most models and methods presented here require quantitative data only. The two non-quantitative
feature types, binary and nominal, can be pre-processed into a quantitative format as follows.
A binary feature can be recoded into 1/0 format by substituting 1 for “Yes” and 0 for “No”. Then the
recoded feature can be considered quantitative, because its averaging is meaningful: the average value
is equal to the proportion of unities, that is, the frequency of “Yes” in the original feature.
A nominal feature is first enveloped into a set of binary “Yes”/”No” features corresponding to
individual categories. In Table 0.1, binary features yielded by categories of feature “Sector” are:
Is it Retail? Is it Utility? Is it Industrial?
They are put as questions to make “Yes” or “No” answer to them. These binary features now can be
converted to the quantitative format as advised above, by recoding 1 for “Yes” and 0 for “No”.
Table 0.2 Data from Table 0.1 converted to the quantitative format.
Code
1
2
3
4
5
6
7
8
Income
19.0
29.4
23.9
18.4
25.7
12.1
23.9
27.2
SharP
43.7
36.0
38.0
27.9
22.3
16.9
30.2
58.0
NSup
2
3
3
2
3
2
4
5
EC
0
0
0
1
1
1
1
1
Util
1
1
0
1
0
0
0
0
Indu
0
0
1
0
1
1
0
0
Retail
0
0
0
0
0
0
1
1
0.3.2. Case 2: Iris data set
Sepal
Petal
Figure 0.1. Sepal and petal in an Iris flower.
This popular data set describes 150 Iris specimens, representing three taxa of Iris flowers, I Iris setosa
(diploid), II Iris versicolor (tetraploid) and III Iris virginica (hexaploid), 50 specimens from each.
13
Each specimen is measured on four morphological variables: sepal length (w1), sepal width (w2), petal
length (w3), and petal width (w4) (see Figure 0.1).
Table 0.3. Iris data: 150 Iris specimens measured over four features each.
#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
I Iris setosa
w1
5.1
4.4
4.4
5.0
5.1
4.9
5.0
4.6
5.0
4.8
4.8
5.0
5.1
5.0
5.1
4.9
5.3
4.3
5.5
4.8
5.2
4.8
4.9
4.6
5.7
5.7
4.8
5.2
4.7
4.5
5.4
5.0
4.6
5.4
5.0
5.4
4.6
5.1
5.8
5.4
5.0
5.4
5.1
4.4
5.5
5.1
4.7
4.9
5.2
5.1
w2
3.5
3.2
3.0
3.5
3.8
3.1
3.2
3.2
3.3
3.4
3.0
3.5
3.3
3.4
3.8
3.0
3.7
3.0
3.5
3.4
3.4
3.1
3.6
3.1
4.4
3.8
3.0
4.1
3.2
2.3
3.4
3.0
3.4
3.9
3.6
3.9
3.6
3.8
4.0
3.7
3.4
3.4
3.7
2.9
4.2
3.4
3.2
3.1
3.5
3.5
w3
1.4
1.3
1.3
1.6
1.6
1.5
1.2
1.4
1.4
1.9
1.4
1.3
1.7
1.5
1.9
1.4
1.5
1.1
1.3
1.6
1.4
1.6
1.4
1.5
1.5
1.7
1.4
1.5
1.6
1.3
1.7
1.6
1.4
1.3
1.4
1.7
1.0
1.5
1.2
1.5
1.6
1.5
1.5
1.4
1.4
1.5
1.3
1.5
1.5
1.4
w4
0.3
0.2
0.2
0.6
0.2
0.2
0.2
0.2
0.2
0.2
0.1
0.3
0.5
0.2
0.4
0.2
0.2
0.1
0.2
0.2
0.2
0.2
0.1
0.2
0.4
0.3
0.3
0.1
0.2
0.3
0.2
0.2
0.3
0.4
0.2
0.4
0.2
0.3
0.2
0.2
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.1
0.2
0.2
II Iris versicolor
III Iris virginica
w1
6.4
5.5
5.7
5.7
5.6
7.0
6.8
6.1
4.9
5.8
5.8
5.5
6.7
5.7
6.7
5.5
5.1
6.6
5.0
6.9
5.0
5.6
5.6
5.8
6.3
6.1
5.9
6.0
5.6
6.7
6.2
5.9
6.3
6.0
5.6
6.2
6.0
6.5
5.7
6.1
5.5
5.5
5.4
6.3
5.2
6.4
6.6
5.7
6.1
6.0
w1
6.3
6.7
7.2
7.7
7.2
7.4
7.6
7.7
6.2
7.7
6.8
6.4
5.7
6.9
5.9
6.3
5.8
6.3
6.0
7.2
6.2
6.9
6.7
6.4
5.8
6.1
6.0
6.4
5.8
6.9
6.7
7.7
6.3
6.5
7.9
6.1
6.4
6.3
4.9
6.8
7.1
6.7
6.3
6.5
6.5
7.3
6.7
5.6
6.4
6.5
w2
3.2
2.4
2.9
3.0
2.9
3.2
2.8
2.8
2.4
2.7
2.6
2.4
3.0
2.8
3.1
2.3
2.5
2.9
2.3
3.1
2.0
3.0
3.0
2.7
2.3
3.0
3.0
2.7
2.5
3.1
2.2
3.2
2.5
2.9
2.7
2.9
3.4
2.8
2.8
2.9
2.5
2.6
3.0
3.3
2.7
2.9
3.0
2.6
2.8
2.2
w3
4.5
3.8
4.2
4.2
3.6
4.7
4.8
4.7
3.3
3.9
4.0
3.7
5.0
4.1
4.4
4.0
3.0
4.6
3.3
4.9
3.5
4.5
4.1
4.1
4.4
4.6
4.2
5.1
3.9
4.7
4.5
4.8
4.9
4.5
4.2
4.3
4.5
4.6
4.5
4.7
4.0
4.4
4.5
4.7
3.9
4.3
4.4
3.5
4.0
4.0
w4
1.5
1.1
1.3
1.2
1.3
1.4
1.4
1.2
1.0
1.2
1.2
1.0
1.7
1.3
1.4
1.3
1.1
1.3
1.0
1.5
1.0
1.5
1.3
1.0
1.3
1.4
1.5
1.6
1.1
1.5
1.5
1.8
1.5
1.5
1.3
1.3
1.6
1.5
1.3
1.4
1.3
1.2
1.5
1.6
1.4
1.3
1.4
1.0
1.3
1.0
w2
3.3
3.3
3.6
3.8
3.0
2.8
3.0
2.8
3.4
3.0
3.0
2.7
2.5
3.1
3.0
3.4
2.7
2.7
3.0
3.2
2.8
3.1
3.1
3.1
2.7
3.0
2.2
3.2
2.8
3.2
3.0
2.6
2.8
3.0
3.8
2.6
2.8
2.5
2.5
3.2
3.0
3.3
2.9
3.0
3.0
2.9
2.5
2.8
2.8
3.2
w3
6.0
5.7
6.1
6.7
5.8
6.1
6.6
6.7
5.4
6.1
5.5
5.3
5.0
5.1
5.1
5.6
5.1
4.9
4.8
6.0
4.8
5.4
5.6
5.5
5.1
4.9
5.0
5.3
5.1
5.7
5.2
6.9
5.1
5.2
6.4
5.6
5.6
5.0
4.5
5.9
5.9
5.7
5.6
5.5
5.8
6.3
5.8
4.9
5.6
5.1
w4
2.5
2.1
2.5
2.2
1.6
1.9
2.1
2.0
2.3
2.3
2.1
1.9
2.0
2.3
1.8
2.4
1.9
1.8
1.8
1.8
1.8
2.1
2.4
1.8
1.9
1.8
1.5
2.3
2.4
2.3
2.3
2.3
1.5
2.0
2.0
1.4
2.1
1.9
1.7
2.3
2.1
2.5
1.8
1.8
2.2
1.8
1.8
2.0
2.2
2.0
14
The taxa are defined by the genotype whereas the features are of the appearance (phenotype). The
question arises whether the taxa can be described, and indeed predicted, in terms of the features or not.
It is well known from previous studies that taxa II and III are not well separated in the variable space.
Some non-linear machine learning techniques such as Neural Nets \cite{Ha99} can tackle the problem
and produce a decent decision rule involving non-linear transformation of the features. Unfortunately,
rules derived with Neural Nets are not comprehensible to the human and, thus, cannot be used for
interpretation and description. The human mind needs somewhat less artificial logics that is capable of
reproducing and extending botanists' observations such as that the petal area, roughly expressed by the
product of w3 and w4, provides for much better resolution than the original linear sizes. Other
problems that are of interest: (a) visualise the data, and (b) build a predictor of sepal sizes from the
petal sizes.
Case 0.3. West Country Market towns.
In Table 0.4 a set of Market towns in West Country, England is presented along with features
characterising population and social infrastructure. For the purposes of social monitoring, it is good to
have a smaller number of clusters being representatives of clusters of similar towns. In the Table, the
towns are sorted according to their population sizes. One can see that 21 towns have less than 4,000
residents. The number 4000 is taken as a divider since it is round and, more importantly, there is a gap
of more than thirteen hundred residents between Kingskerswell (3672 inhabitants) and next in the list
Looe (5022 inhabitatnts). Next big gap occurs after Liskeard (7044 inhabitatnts) separating the nine
middle sized towns from two larger town groups containing six and nine towns respectively. The
divider between the latter groups is taken between Tavistock (10222) and Bodmin (12553). In this
way, we get three or four groups of towns for monitoring. But is this enough, regarding the other
features available? Are the resident groups homogeneous enough for the purposes of monitoring?
As further computations will show, the numbers of services on average do follow the town sizes, but
the set (as well as the complete set of about thirteen hundred England Market towns) is much better
represented with seven somewhat different clusters: large towns of about 17-20,000 inhabitants, two
clusters of medium sized towns (8-10,000 inhabitants), three clusters of small towns (about 5,000
inhabitants), and a cluster of very small settlements with about 2,500 inhabitants. Each of the three
small town clusters is characterized by the presence of a facility, which is absent in two others: a Farm
market, a Hospital and a Swimming pool, respectively, which may be considered not quite important.
Then the only difference between clusters and the grouping over town resident numbers would be
different dividing points. However, one should not forget that the number of residents has been
selected by us because of our knowledge that this is the feature highly affecting all the other features of
town life. In the absence of such knowledge, the population size should come as an important feature
after, not prior to, the computation.
The data in Table 0.4 involve the following 12 features as observed in the census 1991:
Pop
PSch
Doct
Hosp
Bank
Sstor
- Population resident
- Primary schools
- General Practitioners
- Hospitals
- Banks
- Superstores
Petr
DIY
Swim
Post
CAB
FMar
- Petrol stations
- Do It Yourself shops
- Swimming pools
- Post offices
- Citizen Advice Bureaus
- Farmer markets
15
Table 0.4. Data of West Country England Market Towns 1991.
Town
Pop
Mullion
2040
So Brent
2087
St Just
2092
St Columb
2119
Nanpean
2230
Gunnislake 2236
Mevagissey 2272
Ipplepen
2275
Be Alston
2362
Lostwithiel 2452
St Columb
2458
Padstow
2460
Perranporth 2611
Bugle 2695 2
Buckfastle 2786
St Agnes
2899
Porthleven 3123
Callington 3511
Horrabridge 3609
Ashburton
3660
Kingskers
3672
Looe
5022
Kingsbridge 5258
Wadebridge 5291
Dartmouth
5676
Launceston 6466
Totnes
6929
Penryn
7027
Hayle
7034
Liskeard
7044
Torpoint
8238
Helston
8505
St Blazey
8837
Ivybridge
9179
St Ives
10092
Tavistock 10222
Bodmin
12553
Saltash
14139
Brixham
15865
Newquay
17390
Truro
18966
Penzance
19709
Falmouth
20297
St Austell 21622
Newton Abb 23801
PSch
1
1
1
1
2
2
1
1
1
2
1
1
1
0
2
1
1
1
1
1
1
1
2
1
2
4
2
3
4
2
2
3
5
5
4
5
5
4
7
4
9
10
6
7
13
Doct
0
1
0
0
1
1
1
1
0
1
0
0
1
0
1
1
0
1
1
0
0
1
1
1
0
1
1
1
0
2
3
1
2
1
3
3
2
2
3
4
3
4
4
4
4
Hosp
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
1
0
1
2
0
1
0
0
0
1
1
1
1
1
1
1
1
2
1
Bank Sstor Petr
2
1
2
2
0
1
1
0
1
2
0
3
1
0
1
2
1
3
2
2
0
2
7
5
4
8
7
2
2
6
3
7
1
3
7
7
6
4
5
12
19
12
11
14
13
0
1
1
1
0
0
0
0
1
0
1
0
1
1
2
1
1
1
1
1
1
1
1
3
4
4
2
4
2
2
2
2
1
1
2
3
3
2
5
5
4
7
3
6
4
1
0
1
1
0
1
0
1
0
1
3
0
2
0
2
1
0
1
1
2
2
1
2
1
1
4
1
1
2
3
1
3
4
4
2
3
5
3
3
4
5
5
2
4
7
DIY
Swim
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
2
1
0
3
1
0
0
0
0
0
0
0
0
0
0
0
0
0
2
1
0
0
1
0
1
0
1
0
1
0
1
1
0
0
1
0
1
0
0
0
2
1
1
2
1
2
1
1
1
1
Post CAB
1
1
1
1
2
3
1
1
1
1
2
1
2
0
1
2
1
1
2
1
1
3
1
1
2
3
4
3
2
2
2
1
4
1
4
3
2
3
5
5
7
7
9
8
7
FMar
0
0
0
1
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
1
0
1
1
1
1
1
0
1
1
2
1
1
0
1
1
1
1
1
1
1
1
2
1
1
2
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
1
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
1
0
0
1
0
Case 0.4. Student data
In Table 0.5, a fictitious data set is presented imitating data of Birkbeck University of London parttime students pursuing Master’s degree in Computer Sciences. This data refer to a hundred students
along with six features, three of which are personal characteristics (1. Occupation (Occ):
either Information technology (IT) or Business Administration (BA) or anything else (AN); 2. Age, in
years; 3. Number of children (Chi)) and three are their marks over courses in Software and
Programming (SEn), Object-Oriented Programming (OOP), and Computational Intelligence (CI).
16
Related questions are:
- Whether the students’ marks are affected by the personal features;
- Are there any patterns in marks, especially in relation to occupation?
Table 0.5. Student data in two columns.
Occ
Age
Chi
SEn
OOP
CI
Occ
Age
Chi
SEn
OOP
CI
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
IT
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
28
35
25
29
39
34
24
37
33
23
24
32
33
27
32
29
21
21
26
20
28
34
22
21
32
32
20
20
24
32
21
27
33
34
34
36
35
36
37
42
30
28
38
49
50
34
31
49
33
43
0
0
0
1
0
0
0
1
1
1
1
0
0
1
1
0
0
0
1
1
1
1
0
1
1
0
1
1
1
0
1
1
0
1
0
2
2
1
1
2
3
1
1
2
2
2
2
3
1
0
41
57
61
69
63
62
53
59
64
43
68
67
58
48
66
55
62
53
69
42
57
49
66
50
60
42
51
55
53
57
58
43
67
63
64
86
79
55
59
76
72
48
49
59
65
69
90
75
61
69
66
56
72
73
52
83
86
65
64
85
89
98
74
94
73
90
91
59
70
76
85
78
73
72
55
72
69
66
92
87
97
78
52
80
90
54
72
44
69
61
71
55
75
50
56
42
55
52
61
62
90
60
79
72
88
80
60
69
58
90
65
53
81
87
62
61
88
56
89
79
85
59
69
54
85
73
64
66
86
66
54
59
53
74
56
68
60
57
45
68
46
65
61
44
59
59
61
42
60
42
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
BA
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
AN
51
44
49
27
30
47
38
49
45
44
36
31
31
32
38
48
39
47
39
23
34
33
31
25
40
41
42
34
37
24
34
41
47
28
28
46
27
44
47
27
27
21
22
39
26
45
25
25
50
33
2
3
3
2
1
0
2
1
0
2
3
2
3
3
0
1
2
1
2
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
75
53
86
93
75
46
86
76
80
50
66
64
53
87
87
68
93
52
88
54
46
51
59
51
41
44
40
47
45
47
50
37
43
50
39
51
41
50
48
47
49
59
44
45
43
45
42
45
48
53
73
43
39
58
74
36
70
36
56
43
64
45
72
40
56
71
73
48
52
50
33
38
45
41
61
43
56
69
50
68
63
67
35
62
66
36
35
61
59
56
60
57
65
41
47
39
31
33
64
44
57
60
62
62
70
36
47
66
47
72
62
38
38
35
44
56
53
63
58
41
25
51
35
53
22
44
58
32
56
24
23
29
57
23
31
60
28
40
32
47
58
51
47
25
24
21
32
53
59
21
17
Case 0.5. Intrusion attack data.
With the growing range and scope of computer networks, their security becomes an issue of urgency.
An attack on a network results in its malfunctioning, the simplest of which is the denial of service. The
denial of service is caused by an intruder who makes some computing or memory resource too busy or
too full to handle legitimate requests. Also, it can deny access to a machine. Two of the denial-ofservice attacks are known as appache2 and smurf. An appache2 intrusion attacks a very popular free
software/open source web server APPACHE2 and results in denying services to a client that sends a
request with many http headers. The smurf acts by echoing a victim's mail, via an intermediary that
may be the victim itself. The attacking machine may send a single spoofed packet to the broadcast
address of some network so that every machine on that network would respond by sending a packet to
the victim machine. In fact, the attacker sends a stream of icmp 'ECHO' requests to the broadcast
address of many subnets; this results in a stream of 'ECHO' replies that flood the victim. Other types
of attack include user-to-root attacks and remote-to-local attacks. Some internet protocols are liable to
specific types of attack, as just described above for imcp (Internet Control Message Protocol) which
relates to network functioning; other protocols such as tcp (Transcription Control Protocol) or udp
(User Diagram Protocol) supplement conventional ip (Internet Protocol) and may be subject to many
other types of intrusion attacks.
A probe intrusion looking for flaws in the networking might precede an attack. A powerful probe
software is SAINT - the Security Administrator's Integrated Network Tool that uses a thorough
deterministic protocol to scan various network services. The intrusion detection systems collect
information of anomalies and other patterns of communication such as compromised user accounts and
unusual login behaviour.
The data set Intrusion consists of a hundred communication packages along with some of their features
sampled at the file publicly available on web \cite{St00}. The features reflect the packet as well as
activities of its source:
1 - protocol-type, which can be either tcp or icmp or udp (nominal feature),
2 - BySD, the number of data bytes from source to destination,
3 - SHCo, the number of connections to the same host as the current one in the past two seconds,
4 - SSCo, the number of connections to the same service as the current one in the past two seconds,
5 - SEr, the rate of connections (per cent in SHCo) that have SYN errors,
6 - REr , the rate of connections (per cent in SHCo) that have REJ errors,
7 – Attack, the type of attack (apache, saint, smurf as explained above, and no attack (norm)).
Of the hundred entities in the set, the first 23 have been attacked by apache2, the consecutive 24 to 69
packets are normal, eleven entities 80 to 90 bear data on a saint's probe, and the last ten, 91 to 100,
reflect the attack smurf.
Table 0.6. Intrusion data.
Prot
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
BySD
62344
60884
59424
59424
59424
75484
76944
59424
57964
59424
0
0
SHCo
16
17
18
19
20
21
22
23
24
25
40
41
SSCo
16
17
18
19
20
21
22
23
24
25
40
41
SEr
0
0.06
0.06
0.05
0.05
0.05
0.05
0.04
0.04
0.04
1
1
REr
0.94
0.88
0.89
0.89
0.9
0.9
0.91
0.91
0.92
0.92
0
0
Attack
apach
apach
apach
apach
apach
apach
apach
apach
apach
apach
apach
apach
Prot
tcp
tcp
tcp
udp
udp
udp
udp
udp
udp
udp
udp
udp
BySD
287
308
284
105
105
105
105
105
44
44
42
105
SHCo
14
1
5
2
2
2
2
2
3
6
5
2
SSCo
14
1
5
2
2
2
2
2
8
11
8
2
SEr
0
0
0
0
0
0
0
0
0
0
0
0
REr
0
0
0
0
0
0
0
0
0
0
0
0
Attack
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
18
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
Tcp
Tcp
Tcp
Tcp
0
0
0
0
0
0
0
0
0
0
0
258
316
287
380
298
285
284
314
303
325
232
295
293
305
348
309
293
277
296
286
311
305
295
511
239
5
288
42
43
44
45
46
47
48
49
40
41
42
5
13
7
3
2
10
20
8
18
28
1
4
13
1
4
6
8
1
13
3
5
9
11
1
12
1
4
42
43
44
45
46
47
48
49
40
41
42
5
14
7
3
2
10
20
8
18
28
1
4
14
8
4
6
8
8
14
6
5
15
25
4
14
1
4
1
1
1
1
1
1
1
1
0.62
0.63
0.64
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1 Aggregating
0
0
0
0
0
0
0
0
0.35
0.34
0.33
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
apach
apach
apach
apach
apach
apach
apach
apach
apach
apach
apach
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
udp
udp
udp
udp
udp
udp
udp
udp
udp
udp
udp
udp
udp
udp
udp
udp
udp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
icmp
icmp
icmp
icmp
icmp
icmp
icmp
icmp
icmp
icmp
105
42
105
105
44
105
105
44
105
105
45
45
105
34
105
105
105
0
0
0
0
0
0
0
0
0
0
0
1032
1032
1032
1032
1032
1032
1032
1032
1032
1032
2
2
1
1
2
1
1
3
1
1
3
3
1
5
1
1
1
482
482
482
482
482
482
482
482
482
483
510
509
510
510
511
511
494
509
509
510
511
2
3
1
1
4
1
1
14
1
1
6
6
1
9
1
1
1
1
1
1
1
1
1
1
1
1
1
1
509
510
510
511
511
494
509
509
510
511
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.05
0.05
0.05
0.05
0.05
0.05
0.06
0.06
0.06
0.06
0.04
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.95
0.95
0.95
0.95
0.95
0.95
0.94
0.94
0.94
0.94
0.96
0
0
0
0
0
0
0
0
0
0
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
norm
saint
saint
saint
saint
saint
saint
saint
saint
saint
saint
saint
smurf
smurf
smurf
smurf
smurf
smurf
smurf
smurf
smurf
smurf
and visualizing a single feature:
1D analysis
Before considering summarization and correlation problems with multidimensional data, let us take a
look at them on the simplest levels possible: one feature for summarization and two features for
correlation. This also will provide us with a stock of useful concepts for nD analysis.
1D data is a set of entities represented by one feature, categorical or quantitative. Let us first consider
the quantitative case. With N entities numbered from i=1, 2, …., N, data is a set of numbers x1,…,xN.
This set will be denoted X={x1,…,xN}.
Distribution (density, histogram):
The distribution is a most comprehensive, and quite impressive for the eye, way of summarization. On
the plane, one draws an x axis and the feature range boundaries, that is, X’s minimum a and maximum
b. The range interval is divided then into a number of non-overlapping equal-sized sub-intervals, bins.
To produce n bins, one needs n-1 dividers at points a+k(b-a)/n (k=1, 2, …, n-1). In fact, the same
formula works for k=0 and k=n+1 leading to the boundaries a as x0 and b as xn+1, which is useful for
the next operation, counting the number of entities Nk falling in each of the bins k=1, 2,..., n,. Note
19
that bin k has a+(k-1)(b-a)/n and a+k(b-a)/n as its left and right boundaries, respectively. One of
them should be excluded from the bin so that the bins are not overlapping. These counts, Nk, k=1,
2,..., n, constitute the distribution of the feature. A histogram is a visual representation of the
distribution by drawing a rectangle of the height Nk over each bin k, k=1, 2,..., n (see Figures 1.2 and
1.3). Note that the distribution is subject to choice of the number of bins.
Q. Why bins are not to be overlapping? A. So that each entity falls in only one bin, and the total of all
counts Nk remains N.
0
a
(a+b)/2
b
Figure 1.1. With two bins on the range, the divider is the mid-range.
On Figures 1.2 and 1.3, two most typical types of histograms are presented. The former corresponds to
the so-called power/Pareto law, approximating the function p(x)  a/x.This type is frequent in social
systems. According to numerous empirical studies, such features as wealth, group size, productivity
and the like are all distributed according to a power law so that very few individual entities create /
possess huge amounts of wealth or members, whereas very many individual entities are left with
virtually nothing. However, they all are important parts of the same system with the have-nots creating
the environment in which the lucky few can strive.
Count
880
Frequency
36.7 %
640
480
26.7%
260
140
Figure 1.2. A power type distribution.
20.0%
10.8%
5.8%
Another type, which is frequent in physical systems, is presented on Figure 1.3. This type of
histograms approximates the so-called normal, or Gaussian, law p(x)  exp[(x-a)/22]. Distributions
of measurement errors and, in general, features being results of small random effects are thought to be
Gaussian, which can be formally proven within a mathematical framework of the probability theory.
The parameters of this distribution, a and b, have natural meanings with a expressing the expectation,
or mean, and 2 – the variance, which naturally translates in terms of the empirical distributions, as
introduced below.
Count
700
550
500
350
300
Frequency
29.2 %
22.9%
20.8%
14.6%
12.5%
Figure 1.3. Gaussian type distribution (bell curve).
20
Q: Consider distributions of sizes in Iris data and Population and Bank in Market Town data at
different bin numbers. Can you tell for each of them, which of the two types it is similar to?
Another popular visualization of distributions is known as a pie-chart, in which proportions Nk are
expressed by the sizes of sectored slices of a round pie (see Figure 1.4).
As one can see, these two types of visualization provide for perception of two different aspects of the
distribution; the former for the actual envelopment of the distribution along the axis x, whereas the
latter caters for the relative sizes of distribution chunks falling into different bins. There are a dozen
more formats of visualization of distributions, such as bubble, doughnut and radar charts, easily
available in Microsoft Excel spreadsheet.
13%
15%
21%
23%
28%
Figure 1.4. Pie-chart for the histogram of Figure 1.3.
For categorical features, there is no need to define bins. The categories themselves play the role of
bins. Assuming that categories comprise set V of categories, Nv and pv are respectively defined as the
number and proportion of entities falling in category v  V.
Further aggregates
Further summarization of the data leads to presenting all the variety with just two real numbers, one
expressing the distribution’s location, its “central” point, whereas the other represents the distribution’s
variation, the spread. We review some most popular characteristics for both.
Center:
C1. Mean of set x1, x2,…, xN is defined as the arithmetic average:
N
c   xi / N
(1.1)
i 1
Example: For set 1, 1, 5, 3, 4, 1, 2, mean is (1+1+5+3+4+1+2)/7=17/7=2.42857…
This is as close approximation to the numbers as one can get. However, the mean is not stable enough
against outliers. This is why it is a good idea to remove a couple of observations on both extremes of
the data range, the minimum and maximum, before computing the mean.
21
Example: There are no outliers at the previous example. However, if we remove one minimum value
and one maximum value from the set, it becomes 1, 3, 4, 1, 2. The mean of the corrected set is
(1+3+4+1+2)/5=11/5=2.2, not a big change.
C2. Median of set x1, x2,…, xN is defined as follows. First, pre-process the data by sorting them
according to order, either descending or ascending, so that xs1  xs2 …  xsN assuming the ascending
order, where si is i-th element of the ordered series. If N is odd, that is, N=2n+1 for a whole number n,
then the median m is equal to the middle element in the order, that is, m=xs,n+1. Otherwise, N is even so
that N=2n for some integer n>0; the median is usually defined in this case as the middle of the interval
between two elements in the middle, xsn and xs,n+1, that is, m=(xsn + xs,n+1)/2.
Example: For the set from the previous example, 1, 1, 5, 3, 4, 1, 2, its sorted version is 1, 1, 1, 2, 3, 4,
5. The median is equal to the element in the middle, which is 2. This is rather far away from the mean,
2.43, which evidences that the distribution is biased towards the left end, the smaller entities.
The more symmetric a distribution is, the closer its mean and median are.
As seems rather obvious, the median is very stable against outliers: the values on the extremes just do
not affect the middle of the sorted set.
C3. Midrange mr is the middle of the range, mr=(Max(xi) + Min(xi))/2.
This corresponds to the mean of a flat distribution, in which all bins are equally likely. In contrast to
the mean and median, the midrange depends only on the range, not on the distribution. It is obviously
highly sensitive to outliers, that is, changes of the maximum and/or minimum values of the sample.
Example: At the previous set, Max(xi) =5 and Min(xi)=1, which makes mr=(5+1)/2=3 – the value of
the mean in the case when all values in the range are equally likely.
C4. Mode is the most likely bin, which is obviously depends on the bin size. On Figure 0.2, the mode
is the bin on the left side; on Figure 0.3, the bin in the middle. This is a rather local characteristic of the
distribution; and it is not going to be used often in this text.
Spread:
Each of the characteristics of spread below is, to an extent, parallel to that of the center under the same
number.
S1. Standard deviation s
This, conventionally called standard, value s is rather unconventional from the intuitive point of view.
It is defined as the square root of variance, s=(Variance) where Variance s2 is the mean of square
deviations of X values from their mean, that is, the sum of square errors (xi-c)2 , i =1, 2, …, N, divided
by N, where c is the mean/average defined in (1.1).
N
s2   ( x  c) 2 / N
i
i 1
(1.2)
The choice of the square root of the average squares for aggregating the deviations of observations
from their center comes from the least-squares approach, which will be explained later on pp. ……
22
In many packages, especially of a statistical flavour, the divisor in (1.2) is taken to be N-1 rather than
N. This is because in mathematical statistics the divisor is N only in the case when c is pre-specified or
given by an oracle, that is, derived not from the data but from the domain knowledge, as, for example,
in coin throwing, it is assumed from the symmetry, that the mean proportion of tails should be ½. If,
otherwise, c is derived from the data such as the mean (1.1) value, then the divisor, according to the
mathematical statistics, must be N-1, because the equation (1.1) deriving c from data is a relation
imposed on the N observed values, thus decreasing the degree of freedom from N to N-1. We explain
the view of mathematical statistics in a greater detail later on page ….. .
S2. Absolute deviation sm
The absolute deviation is defined as the mean of absolute deviations from the center, |xi – m|, which in
this case is usually taken to be median m rather than mean c:
N
sm   | x  m| / N
i
i 1
(1.3)
S3. Range r
This is probably the simplest measure, just the length of the interval covered by data X, r= Max(xi) Min(xi) . Obviously, it may be rather unstable and biased by the outliers.
It can be proven that both, the standard deviation and absolute deviation are at least twice less than the
range (Mirkin 2005).
S4. Quintile range (histogram’s extremes cut out)
This measure is a stable version of the range, which should be used at really large N values. A
proportion p, typically, within the range of 0.1-25% is specified. The 2p-quintile range utilises the
(upper) p-quintile, which is a value xp of X such that the proportion of entities with larger than xp
values is p. Similarly the lower quintile is defined, px: the proportion of those less than px is p. The
2p-quintile range is the interval between the p-quintiles, stretched up according to the proportion of
entities taken out, (xp – px)/(1-2p). It can be used as a rather stable characteristic of the actual range of
X, short of random and wild outliers on both extremes. The value of p should be taken rather small,
say 0.05% at N of the order 100,000: then xp cuts off 50 largest and px, 50 smallest values of X. This
measure is rather unusual at the present time, as well as the large data sizes yet.
Parameters of the Gaussian distribution
In classical mathematical statistics, set X= {x1, x2,…, xN} is usually considered a random sample from
a population defined by a probabilistic distribution with density f(x), in which each element xi is
sampled independently from the others. This involves an assumption that each observation xi is
modelled by the distribution f(xi) so that the mean’s model is the average of distributions f(xi). The
population analogues to the mean and variance are defined over f(x) so that, obviously, the mean (1.1),
median and the midrange, are unbiased estimates of the population mean. Moreover, the variance of
the mean is N times less that the population variance, so that the standard deviation’s variance tend to
decrease by N when N grows.
If we further assume that the population’s probabilistic distribution is Gaussian N(, ) with density
function
23
f(u, , )= C* exp{-(u - )2 / 22},
(1.4)
then c in (1.1) is an estimate of  and s (1.2) of  in f(u, , ). This parameters amount to the
population analogues of the concepts of mean and variance so that, for example,  = uf(u,,)du
where the integral is taken over the entire axis u. To be non-biased, s must have N-1 as its divisor if c
in (1.1) stands instead of  in formula for s2.
In some real life situations, the assumption that X is an independent random sample from the same
distribution seems rather adequate. However, in most real-world databases and multivariate samplings
this assumption is far from realistic.
Centre and spread in integral perspectives
The concepts of the center and spread can be formulated within the same perspective, such as those of
approximation, data recovery and probabilistic statistics, which are presented here in turn.
1.
Approximation perspective
Given a series X={x1,…,xN}, define the centre as a minimizing the average distance
D(X,a)=[d(x1,a)+d(x2,a)+…+d(xN,a)]/N
(1.5)
The following statements can be proven mathematically.
If d(x,a)=|x-a|2 in (1.5), then the solution a is the mean (1.1), and D(X,a) the variance (1.2).
If d(x,a)=|x-a| in (1.5), then the solution a (centre) is median, and D(X,a) the absolute deviation.
If D(X,a) is defined not by the sum, but by the maximum distance, D(X,a)= max (d(x1,a), d(x2,a),
…, d(xN, a)), then the midrange is the solution, for each of the d(x,a) specified, either |x-a|2 or |x-a|.
These three properties explain the parallels between centers C1, C2, and C3, and corresponding spread
evaluations S1, S2, S3: each of the centers minimizing its corresponding measure of spread.
2.
Data recovery perspective
In the data recovery perspective, it is assumed that the observed values are but noisy realisations of an
unknown value a. This is reflected in the form of a model like
xi = a + ei, for all i=1,2,…, N
(1.6)
in which ei are additive errors to be minimised. In classical mathematical statistics, errors are usually
modelled as random values independently drawn from the same distribution. This suggests a good
degree of knowledge and stability of the process. In many real world applications the model (1.6) itself
cannot be considered adequate enough, which also generates errors. When no assumptions on the
nature of errors are made, they are frequently referred to as residuals.
One cannot minimize all the residuals in (1.6) simultaneously. Thus, an integral criterion should be
formulated to embrace all the residuals. Most popular are:
(1) Least-squares criterion L2=e12+ e22 +…+ eN2 ; its minimisation over unknown a is equivalent to
the task of minimizing the average squared distance, thus leading to the mean, optimal a=c.
24
(2) Least-modules criterion L1=|e1|+|e2|+…+ |eN|; its minimisation over unknown a is equivalent to
the task of minimizing the average absolute deviation leading to the median, optimal a=m.
(3) Least-maximum criterion L= max(|e1|, |e2|, … |eN|); its minimisation over unknown a is
equivalent to the task of minimizing the maximum deviation leading to the midrange, optimal a=mr.
Formulations (1)-(3) may look just as trivial reformulations of the special cases of the approximation
criterion (1.5). This, however, is not exactly so. The equation (1.6) allows for a decomposition of the
data scatter involving the corresponding data recovery criterion.
This is rather straightforward for the least-squares criterion L2 whose minimal value, at a=c, is L2=
(x1-c)2+ (x2-c)2 +…+ (xN-c)2. With little algebra, this becomes L2 = x12+ x22 +…+ xN2 - 2c(x1+x2+…
+xN) + Nc2 = x12+ x22 +…+ xN2 - Nc2 =T(X) - Nc2.where T(X) is the quadratic data scatter defined as
T(X)= x12+ x22 +…+ xN2 .
This leads to equation T(X) = Nc2 + L2 decomposing the data scatter in two parts, that explained by
the model (1.6), Nc2, and that unexplained, L2. Since the data scatter is constant, minimizing L2 is
equivalent to maximizing Nc2. The decomposition of the data scatter allows to measure the adequacy
of model (1.6) by the relative value of L2/T(X). Similar decompositions can be derived for the least
modules L1 and other criteria (see Mirkin (1996).
3. Probabilistic perspective
Consider that the set X is a random independent sample from a population with a Gaussian, for the
sake of simplicity, probabilistic density function f(x)= Cexp{-(x - )2 / 22}.where  and 2 are
unknown parameters and C=( 22) -½. The likelihood of randomly getting xi then will be Cexp{-(xi
- )2 / 22}. Then the likelihood of the entire sample X will be the product of these values, because
they have been assumed to be independent of each other, that is, the likelihood is L(X)=iICexp{-(xi
- )2 / 22} = CNexp{- iI (xi - )2 / 22}. One may even go further and express L(X) as L(X) =
exp{Nln(C)- iI (xi - )2 / 22}.where ln is the natural logarithm (over base e). A well established
approach in mathematical statistics, the principle of maximum likelihood, claims that the values of 
and 2 best fitting to the data X are those at which the likelihood L(X) or, equivalently its logarithm,
ln(L(X)), reaches its maximum. Because of the derived formula for LIX), it is easy to see that the
maximum of ln(L)= Nln(C)- iI (xi - )2 / 22 is reached at  that provides for the minimum of the
expression in the exponent, E=  iI (xi - )2. This shows that the least-squares criterion follows from
the assumption that the sample is randomly drawn from a Gaussian population.
Likewise, the optimal 2 minimizes part of ln(L) depending on it, g(2)= - Nln(2)/2 -  iI (xi - )2
/ 22. It is not difficult to prove that the optimal 2 can be found from the first-order optimality
condition for g(2). Let us take the derivative of the function over 2 and equate it to 0: dg/d(2)= N/(22) +  iI (xi - )2 / 2(2)2 =0. This equation leads to 2 =  iI (xi - )2 /N, which means that
the variance is the maximum likelihood estimate of the parameter in the Gaussian distribution.
In situations in which the data can be plausibly assumed to randomly come from a Gaussian
distribution, the derivation above justifies the use of the mean and variance as the only theoretically
valid estimates of the data center and spread. The Gaussian distribution has been proven to
approximate well situations in which there are many small independent random effects adding to each
other. However, in many cases the assumption of normality is highly unrealistic, which does not
necessarily lead to rejection of the concepts of the mean and dispersion – they still may be utilised
within the other perspectives above.
25
Q. Consider a multiplicative model for the error, xi = a(1+ei), assuming that errors are proportional to
the values. Can you find or define what centre a is fitting to the data? A. Consider the least squares
approach. According to this approach, the fit should minimize the summary errors squared. Every error
can be expressed, from the model, as ei= xi/a -1= (xi-a)/a. Thus the criterion can be expressed as L2
= e12 + e22 +… eN2= (x1/a -1) 2 + (x2/a -1) 2 +…. (xN/a -1) 2. Applying the first order optimality
condition, let us take the derivative of L2 over a and equate it to zero. The derivative is equal to L2’= (2/a3)Σi(xi-a)xi. Assuming the optimal value of a is not zero, the first order condition can be expressed
as Σi(xi-a)xi =0, that is, a =Σi xi2/ Σi xi = (Σi xi2/N)/(Σi xi/N). The denominator here is but the mean, c,
whereas the numerator can be expressed through the variance s2 because of equation s2 = Σi xi2/N - Σi
xi/N that is not difficult to prove. With little algebraic manipulation, the least-squares fit can be
expressed as a = s2/c +1. Curiously, the variance to mean ratio, equal to a -1 according to this
derivation, is considered in statistics as a good relative estimate of the spread because of different
reasoning.
Binary features
A feature admitting only two, either “Yes” or “No”, values is conventionally considered Boolean in
Computer Sciences, thus relating to Boolean algebra with its “True” and “False” statement
evaluations. In this course, we code these values by numerals 1, for “Yes”, and 0, for “No”, and use
quantitative operations on them, referring to this type of features as binary ones.
Our interpretation of a two-valued categorical feature as a quantitative one stems from the fact that any
numerical recoding, 0 to  and 1 to , uses just two scaling parameters, that can be one to one
associated with the conventional quantitative scale transformations, the shift of the origin () and
rescaling factor ( - ).
The mean of a 1/0 coded binary feature is the number of ones related to N, that is, proportion p of its
“Yes” values. The median m is 1 only if p > 0.5; m=0 if p<0.5, and m =0.5 when p=0.5 exactly, so that
N is even. The midrange is always ½ for a binary feature. The mode is ether 1 or 0 depending on p>0.5
or not (same as the median).
To compute the variance of a binary feature, whose mean c=p, sum up Np items (1-p)2 and N(1-p)
items p2, which altogether leads to s2 = p(1-p). Accordingly, the standard deviation is the square root of
p(1-p). Obviously, this is maximum when p=0.5, that is, both binary values are equally likely. The
range is always 1. The absolute deviation, in the case when p<0.5 so that median m=0, comprises Np
items that are 1 and N(1-p) items that are 0, so that sm=p. When p>0.5, m=1 and the number of unity
distances is N(1-p) leading to sm=1-p. That means that, in general, sm=min(p,1-p), and it is less than c.
There are some probabilistic underpinnings to these. Two models are popular, by Bernoulli and by
Laplace. Given p, 0  p  1, Bernoulli model assumes that every xi is either 1, with probability p, or 0,
with probability 1-p. Laplace model suggests that, among the N binary numerals, random pN are
unities, and (1-p)N zeros. Both models yield the same mathematical expectation, p. However, their
variances differ: the Bernoulli distribution’s variance is p(1-p) whereas the Laplace distribution’s
variance is p, which is obviously greater for all positive p.
There is a rather natural, though somewhat less recognised, relation between quantitative and binary
features: the variance of a quantitative feature is always smaller than that of the corresponding binary
feature. To explicate this, assume the interval [0,1] to be the range of data X={x1,…,xN}. Assume that
the mean c divides the interval in such a way that a proportion p of the data is greater than or equal to
c, whereas proportion of those smaller than c is 1-p. The question then is this: given p, at what
distribution of X the variance or its square root, the standard deviation, is maximized.
26
Let X be any given distribution within interval [0,1] with its mean at some interior point c. According
to the assumption, there are Np observations between 0 and c. Obviously, the variance can only
increase if we move each of these points to the border, 0. Similarly, the variance will only increase if
we push each of N(1-p) points between c and 1, into 1. That means that the variance p(1-p) of a binary
variable with Np zero and N(1-p) unity values is the maximum, at given p.
We have proven the following:
A binary variable whose distribution is (p, 1-p) has the maximum variance, and the standard
deviation, among all quantitative variables of the same range and p entries above its average.
This implies that no variable over the range [0,1] has its variance greater than the maximum ¼ reached
by a binary variable at p=0.5. The standard deviation of this binary variable is ½, which is just half of
the range.
The binary variables also have the maximum absolute deviation among the variables of the same
range, which can be proven similarly.
Categorical features with disjoint categories
Sometimes categories by themselves have no quantitative meaning, so that the only comparison they
admit is of being equal or not-equal to each other. Moreover, a categorical feature such as Occupation
in Students data or Protocol in Intrusion data, partitions the entity set so that each entity falls in one
and only one category. Categorical features of this type are sometimes referred to as nominal.
If a nominal feature has L categories l=1,…,L, its distribution is characterized by amounts N1 , N2 , …,
NL of entities that fall in each of the categories. Because of the partitioning property these numbers
sum up to the total number of entities, N1 + N2 …. NL =N. The category frequencies, defined as p1 =
Nl/N, sum up to the unity (l=1, 2, …., L).
Since categories are non-ordered, categorical feature distributions are better visualized by pie-charts
than by histograms.
The concepts of centrality, except for the mode, are not applicable to categorical feature distributions.
Spread here is also not quite applicable. However, the variation – or diversity - of the distribution (p1,
p2, …, pL) can be measured. There are two rather popular indexes, Gini index, or qualitative variance,
and entropy.
Gini index can be introduced as the average error of the proportional prediction rule. The proportional
prediction rule requires predicting each category l, l=1,2, …, L, randomly with the distribution (pl), so
that l is predicted at at Npl cases of N. The average error of predictions of l in this case is equal to 1- pl,
which makes the index equal to:
L
L
G   p (1 p )  1  p 2
l
l
l
l 1
l 1
This is also the summary variation of L binary variables corresponding to categories l=1, 2, …, L; such
a variable answers the question “Is the category the object falls into is l”?
Entropy is the average value of the quantity of information in each category l as measured by -log(pl),
thus defined as
L
H    p log p
l
l
l 1
27
This is not too far away from the qualitative variance, because at small p, -log(1-p) = 1 – p + o(p), as
is known from calculus (see Figure 1.5).
There has been a unifying general formula for the variance of a nominal variable suggested:
L
S  (1  p  ) /( 1)
l
l 1
This obviously leads to the qualitative variance, at λ=2, and entropy, at λ tending to 1.
f(p)
p
Figure 1.5 Graphs of functions f(p)=1-p involved in Gini index (straight line) and information
f(p)=–log(p).
Project 1.1. Analysis of a multimodal distribution
Let us take a look at the distributions of OOP and CI marks at the Student data. Assuming that the data
file of Table 0.4 is stored as Data\studn.dat, the corresponding MatLab commands can be as follows:
>> a=load(‘Data\studn.dat’);
>> oop=a(:,7); %column of OOP mark
>> coi=a(:,8); %column of CI mark
>> subplot(1,2,1); hist(oop);
>> subplot(1,2,2); hist(coi);
With ten bins used in MatLab by default, the histograms are on Figure 1.5.
Figure 1.6. Histograms of the distributions of marks for OOP (on the left) and for CI (on he right)
from Students data.
The histogram on the left looks to have three humps, that is, it is three-modal. Typically, a
homogeneous sample should have a uni-modal distribution, to allow interpretation of the feature as its
28
modal value with random deviations from it. The fact that there are three modes on the OOP mark
histogram requires an explanation. For example, one may hypothesize that the modes can be explained
by the presence of three different occupations of students in the data so that IT occupation should lead
to higher marks than BA occupation for which marks should still be higher than those at AN
occupation.
To test this hypothesis, one needs to compare distributions of OOP marks for each of the occupation.
To make the distributions comparable, we need to specify an array with boundaries between 10 bins
that can be used for each of the samples. This array, b, can be computed as follows:
>> r=max(oop)-min(oop);for i=1:11;b(i)=min(oop)+(i-1)*r/10;end;
Now we are ready to produce comparable distributions for each of the occupations with MatLab
command histc:
>> for ii=1:3;li=find(a(:,ii)==1);hp(:,ii)=histc(oop(li),b);end;
This generates a list, li, of student indexes corresponding to each of the three occupations presented by
the three binary columns, ii=1:3. Matrix hp represents the three distributions in its three columns.
Obviously, the total distribution of OOP, presented on the left of Figure 1.6 is the sum of these three
columns. To visualise the distributions, one may use “bar” command in MatLab:
>> bar(hp);
which produces bar histograms for each of the three occupations (see Figure 1.7). One can see that the
histograms differ indeed and concur with the hypothesis, so that IT concentrates in top seven bins and
shares the top three bins with no other occupation. The other two occupations overlap more, though
still AN takes over on the leftmost, worst marks, positions.
Q. What would happen if array b is not specified once for all but the histogram is drawn by default for
each of the sub-samples? A. The 10 default bins depend on the data range, which may be different at
different sub-samples; if so, the histograms will be incomparable.
Figure 1.7. Histograms of OOP marks for each of three occupations, IT, BA and AN, each presented
with bars filled in according to the legend.
There can be other hypotheses as well, such as that the modes come from different age groups. To test
that, one should define the age group boundaries first.
29
Project 1.2
Data mining with a confidence interval: Bootstrap
The data file short.dat is a 50x 3 array whose columns are samples of three data types described in
Table 1.1:
Data type
Mean
Standard
deviation
Real value
Per cent of mean, %
Normal
Two-modal
Power law
10.27
16.92
289.74
1.76
4.97
914.50
17.18
29.38
315.63
Table 1.1. Aggregate characteristics of columns of short.dat array
The normal data is in fact a sample from a Gaussian N(10,2), that has 10 as its mean and 2 as its
standard deviation. The other two are Two-modal and Power law samples. Their 30-bin histograms are
on the left-hand sides of Figures 1.8, 1.9, and 1.10. Even with the aggregate data in Table 1.1 one can
see that the average of Power law does not make much sense, because its standard deviation is more
than three times greater than the average.
Many statisticians would argue the validity of characteristics in Table 1.1 not because of the
distribution shapes – which would be a justifiable source of concern for at least two of the three
distributions – but because of the insufficiency of the samples. Is the 50 entities available a good
representation of the entire population indeed? To address these concerns, the Mathematical Statistics
have worked out principles based on the assumption that the sampled entities come randomly and
independently from a – possibly unknown but stationary – probabilistic distribution. The mathematical
thinking would allow then, in reasonably well-defined situations, to arrive at a theoretical distribution
of an aggregate index such as the mean, so that the distribution may lead to some confidence
boundaries for the index. Typically, one would obtain the boundaries of an interval at which 95% of
the population falls, according to the derived distribution. For instance, when the distribution is
normal, the 95% confidence interval is defined by its mean plus/minus 1.96 times the standard
deviation. Thus, for the first column data, the theoretically derived 95% confidence interval will be 10
1.96*2 =103.92, that is, (6.08, 13.92) (if the true parameters of the distribution are known) or
10.271.96*1.76=10.273.45, that is, (6.82,13.72) (at the observed parameters in Table 1.1). The
difference is negligible, especially if one takes into account that the 95% confidence is a very much
arbitrary notion. In probabilistic statistics, the so-called Student’s distribution is used to make up for
the fact that the sample-estimated standard deviation value is used instead of the exact one, but that
distribution little differs from the Gaussian distribution when the data contain more than several
hundred entities.
In most real life applications the shape of the underlying distribution is unknown and, moreover, the
distribution is not stationary. The theoretically defined confidence boundaries are of little value then.
This is why a question arises whether any confidence boundaries can be derived computationally by
re-sampling the data at hand rather than by imposing some debatable assumptions. There have been
developed several approaches to computational validation of sample based results. One of the most
popular is bootstrapping which will be used here in its basic, “non-parametric and pivotal format” (as
defined in Carpenter and Bithell 2000)..
Bootstrapping is based on a pre-specified number, say 1000, of random trials.
A trial involves randomly drawn N entities, with replacement, from the entity set. Note that N is the
size of the entity set. Since re-sampling goes with replacement, some entities may be drawn two or
30
more times so that some others are bound to be left behind. Recalling that e=2.7182818… is the
natural logarithm base, it is not difficult to see that, on average, only approximately (e-1)/e=63.2%
entities get selected into a trial sample. Indeed, at each random drawing an entity from a set of N, the
probability of an entity being not drawn is 1-1/N, so that that the approximate proportion of entities
never selected in N draws is (1-1/N)N ≈ 1/e =1/2.71828≈ 36.8% of the total number of entities. For
instance, in a bootstrap trial of 15 entities, the following numbers have been drawn: 8, 11, 7, 5, 3, 3,
11, 5, 9, 3, 11, 6, 13, 13, 9 so that seven entities have been left out of the trial while several multiple
copies in.
Figure 1.8. The histograms of a 50 strong sample from a Gaussian distribution (on the left) and
its mean’s bootstrap values (on the right): all falling between 9.7 and 10.1.
A trial set of a thousand randomly drawn entity indices (some of them, as explained, coincide) is
assigned with the corresponding row data values from the original data table so that coinciding entities
get identical rows. Then a method under consideration, currently "computing the mean", applies to this
trial data to produce the trial result. After a number of trials, the user gets enough results to represent
them with a histogram and derive confidence boundaries for the mean’s estimate.
The bootstrap distributions, after 1000 trials, are presented in Figures 1.8, 1.9 and 1.10 on the right
hand side. We can see very clearly that the estimate in the case of Gaussian data, Figure 1.8, is more
precise: all 100% of the bootstrap mean values fall in the interval between 9.47 and 11.12, which is a
much more precise estimate of the mean than in the original distribution, both in terms of the interval
boundaries and confidence. There is theoretical evidence, presented by E. Bradley (1993), supporting
the view that the bootstrap can produce somewhat tighter confidence boundaries for the sample’s mean
than the theoretical analysis based on the original sample. In our case, we can see (Table 1.2) that
indeed, with the means almost unvaried, the standard deviations have been drastically reduced.
Data type
Mean
Standard
deviation
Value
Per cent of mean, %
Normal
Two-mode
Power law
10.27
16.94
287.54
0.25
0.69
124.38
2.46
4.05
43.26
Table 1.2. Aggregate characteristics of the results of 1000 bootstrap trials over short.dat array.
31
Figure 1.9. The histograms of a 50 strong sample from a Two-mode distribution (on the left) and its
mean’s bootstrap values (on the right).
Unfortunately, the bootstrap results are not that helpful in analysing the other two distributions: as can
be seen in our example, it shows rather decent boundaries for both of the means, the Two-modal and
Power law ones, while, in many applications, the mean of either of these two distributions may be
considered meaningless. It is a matter of applying other data analysis methods such as clustering to
produce more homogeneous sub-samples whose distributions would be more similar to that of a
Gaussian.
Figure 1.10. The histograms of a 1000 strong sample from a Power law distribution (on the left) and
its mean’s bootstrap values (on the right): all falling between 260 and 560.
Project 1.3 K-fold cross validation
Another set of validation techniques utilises randomly splitting the entity set in two parts of prespecified sizes, the so-called training and testing parts, so that the method’s results obtained for the
training part are compared with the data on the testing part. To guarantee that each of the entities gets
into a training/testing sample equal number of times, the so-called cross-validation methods have been
developed.
32
The so-called K-fold cross validation works as follows. Randomly split entity set in K parts Q(k),
k=1,…,K, of equal sizes1. Typically, K is taken as 2 or 5 or 10. In a loop over k, each part Q(k) is
taken a test set while the rest is the train set. A data analysis method under consideration is run over
the train set (“training phase”) with its result applied to the test set. The average score of all the test
sets constitutes a K-fold cross-validation estimate of the method’s quality.
The case when K is equal to the number of entities N is especially popular. It was introduced under the
term “jack-knife”, but currently term “leave-all one-out” is used as better reflecting the method: N
trials are run over the entire set except for just each one entity removed from the training.
Let us apply the 10-fold cross-validation method to the problem of evaluation of the means of the three
data sets. First, let us create a partition of our 1000 strong entity set in 10 non-overlapping classes, a
hundred entities each, with randomly assigning entities to the partition classes. This can be done by
randomly putting entities one by one in each of the 10 initially empty buckets. Or, one can take a
random permutation of the entity indices and divide then the permuted series in 10 chunks, 100 strong
each. For each class Q(k) of the 10 classes (k=1,2,…,10), we calculate the averages of the variables on
the complementary 900 strong entity set, and use these averages for calculating the quadratic
deviations from them – not from the averages of class Q(k) - on the class Q(k). In this way, we test the
averages found on the complementary training set.
Data type
Standard
deviation
Normal
On set
10-fold cr.-val.
Two-modal
1.94
1.94
Power law
5.27
1744.31
5.27
1649.98
Table 1.3. Quadratic deviations from the means computed on the entity set as is and by using 10-fold
cross validation.
The results are presented in Table 1.3. The values found at the original distribution and with 10-fold
cross validation are similar. Does this mean that there is no need in applying the method? No, for more
complex data analysis methods, results may differ indeed. Also, whereas the ten quadratic deviations
calculated on the ten test sets for the Gaussian and Two-modal data are very similar to each other,
those at the Power law data set drastically differ, ranging from 391.60 to 2471.03.
1.4. Modelling uncertainty: Intervals and fuzzy sets
Intervals and fuzzy sets are used to reflect uncertainty in data. When dealing with complex systems,
the feature value cannot be determined precisely, even for such a relatively stable and homogeneous
dimension as the population size of a country. The so-called “linguistic variables” (Zadeh 1970)
express categories or concepts in terms of some quantitative measures, such as the concept of “normal
temperature” or “normal weight of an individual” expressed with the Body Mass Index BMI (the ratio
of the weight, in kg, to the height, in metres, squared) as BMI interval [20, 25] between 20 and 25;
those with BMI > 25 are considered overweight (those with BMI>30 are officially recognized as
obese) and those with BMI < 20 underweight. In this example, the natural boundaries of a category are
expressed as an interval.
1
To do this, one may start from all sets Q(k) being empty and repeatedly run a loop over k=1:K in such a way that at each
step, a random entity is drawn from the entity set (with no replacement!) and put into the current Q(k); the process halts
when no entities remain out of Q(k).
33
(x)
1
18
22 24
27
x
Figure 1.11. A trapezoidal membership function expressing the concept of normal body mass
index; a positive degree of membership is assigned to each point within interval [18, 27] and,
moreover, those between 22 and 24 certainly belong to the set.
A more flexible description can be achieved with the so-called fuzzy set A expressed by the
membership function A(x) defined, on the example of Figure 1.11, as:
0 if x  18 or x  27
 0.25 x  4.5 if 18  x  22

A(x)= 
1 if 22  x  24

  x / 3  9 if 24  x  27
This function says that the normal weight does not occur outside of the BMI interval [18, 27].
Moreover, the concept applies in full, with the membership 1, only within BMI interval [22, 24].
There are “grey” areas expressed with the slopes on the left and the right so that, say, a person with
BMI=20 will have the membership value A(20) = 0.25*20 – 4.5 = 0.5 and the membership of that
with BMI = 26.1, will be A(26.1) = -26.1/3 + 9 = -8.7+9 = 0.3.
In fact, a membership function may have any shape; the only requirement is that it must be at least one
interval (or, a point) at which the function reaches value 1, which is its maximum value. A fuzzy set
formed with straight lines as on Figure 1.11 is referred to as a trapezoidal fuzzy set. Such a set can be
represented by four points on the axis x: (a,b,c,d) such that A(x) =0 outside the outer interval [a,d]
and A(x) = 1 inside the inner interval [b,c], (with the straight lines connecting points (a,0) and (b,1)
as well as (c,1) and (d,0) (see Figure 1.11).
(x)
1
18
22
27
x
Figure 1.12. A triangular fuzzy set for the normal weight BMI.
An interval (a, b) can be equivalently represented by a trapezoidal fuzzy set (a, a, b, b) in which all
points of (a, b) have their membership value equal to 1.
The so-called triangular fuzzy sets are also popular. A triangular fuzzy set A is represented by an
ordered triplet (a,b,c) so that A(x) =0 outside the interval [a,c] and A(x) = 1 only at x=b, with
values of A(x) in between are represented by the straight lines between points (a,0) and (b,1) and
between (c,0) and (b,1) on the Cartesian plane, see Figure 1.12.
Fuzzy sets presented on Figures 1.11 and 1.12 are not equal to each other: only those fuzzy sets A and
B are equal at which A(x) = B(x) for every x, not just outside of the base interval.
34
A fuzzy set should not be confused with a probabilistic distribution such as a histogram: there is no
probabilistic mechanism nor frequencies behind a membership function, just an expression of the
extent at which a concept is applicable. A conventional, crisp set S, can be specified as a fuzzy set
whose membership function  admits only values 0 or 1 and never those between; thus, (x)=1 if
xS and (x)=0, otherwise.
There are a number of specific operations with fuzzy sets imitating those with the “crisp” sets, first of
all, set-theoretic complement, union and intersection.
The complement of a fuzzy set A is fuzzy set B such that B(x)=1- A(x). The union of two fuzzy
sets, A and B, is a fuzzy set denoted by AB whose membership function is defined as AB(x) = max
(A(x), B(x)). Similarly, the intersection of two fuzzy sets, A and B, is a fuzzy set denoted by AB
whose membership function is defined as AB(x) = min(A(x), B(x)).
It is easy to prove that these operations indeed are equivalent to the corresponding set theoretic
operations when performed over crisp membership functions.
Questions:
1. Draw the membership function of fuzzy set A on Figure 1.11.
2. What is the union of the fuzzy sets presented in Figures 1.11 and 1.12.
3. What is the intersection of the fuzzy sets presented in Figures 1.11 and 1.12.
4. Draw the membership function of the union of two triangular fuzzy sets represented by triplets
(2,4,6), for A, and (3,5,7), for B. What is the membership function of their intersection?
5. What type of a function is the membership function of the intersection of two triangular fuzzy
sets? Of two trapezoidal fuzzy sets? Does it always represent a fuzzy set?
Central fuzzy set
The conventional centre and spread concepts can be extended to intervals and fuzzy sets. Let us
consider an extension of the concept of average to the triangular fuzzy sets using the least-squares data
recovery approach.
Given a set of triangular fuzzy sets A1, A2, …, AN, the central triangular set A can be defined by such
a triplet . (a, b, c) that approximates the triplets (ai, bi, ci), i = 1, 2, …, N). The central triplet can be
defined by the condition that the average difference squared,
L(a,b,c) = (i (ai-a)2 +i (bi-b)2 +i (ci-c)2 )/(3N)
is minimised by it. Since the criterion L is additive over the triplet’s elements, the optimal solution is
analogous to that in the conventional case: the optimal a is the mean of a1, a2,…,aN; and the optimal
b and c are the means of bi and ci, respectively.
Q. Prove that the average ai indeed minimizes L. A. Let us take the derivative of L over a: L/a =
- 2i(ai-a)/N. The first-order optimality condition, L/a=0, has the average as its solution
described.
Q. Explore the concepts of central trapezoidal fuzzy set and central interval in an analogous way.
Questions
35
1. What is the bin size in the example of Figure 1.13?
a=2
b=12
Figure 1.13 Range [2,12] divided in five bins.
2. Correlation and visualization in 2D
Two features can be of interest if there is an assumption or hypothesis or just gut feeling that they are
related in such a way that certain changes in one of them tend to co-occur with some changes in the
other. Then the relation – if proven to exist – can be used in various ways, of which typically
discernible are (i) those related to prediction of values of one variable from those of the other and (ii)
adding the relation to the knowledge of the domain by interpreting and explaining it in terms of the
existing knowledge. Goal (ii) is treated in the discipline of knowledge bases as part of the so-called
inferential approach, in which all relations are assumed to have been expressed as logical predicates
and treated accordingly; this will not be described here. We concentrate on the so-called inductive
approach related to less formal analysis of what type of information the data can provide with respect
to goals (i) and (ii). Typically, the feature whose values are predicted is referred to as the target
variable while the other as the input variable. Examples of goal (i) are: prediction of an intrusion attack
of a certain type (Intrusion data) or prediction of exam mark (Student data) or prediction of the number
of Primary schools in a town whose population is known (Market town data). One may ask: why
bother – all numbers are already in the file! Indeed, they are. But in the prediction problem, the data
are just a small sample of observations and is but a training ground for devising a decision rule for
prediction of behaviour of other, yet unobserved, entities. As to the goal (ii), the data are just idle
empirical sparkles not necessarily noticeable unless they are shaped into a decision rule.
The mathematical structure of the problem differs depending on the type of feature scales involved,
which leads us to considering three possible cases: (1) both features are quantitative, (2) target feature
is quantitative, input feature categorical, and (3) both features are quantitative. We leave the case when
the target feature is categorical and input feature is quantitative out, because nothing specific to the
task has been developed for this case so far.
2.1. Both features are quantitative
At the situation at which both features are quantitative, the three following concepts are popular:
scatter plot, regression, and correlation.
Scatter plot is a presentation of entities as 2D points in the plane of two pre-specified features. On the
left-hand side of Figure 2.1, a scatter-plot of Market towns over features PopResident (Axis x) and
PSchools (Axis y) is presented.
36
Figure 2.1. Scatter plot of PopRes versus PSchools in Market town data. The right hand graph
includes a regression line of PSchools over PopRes.
If one can think that these two features are related by a linear equation y=ax+b where a and b are some
constant coefficients - parameters, then these parameters a and b, referred to as the slope and intercept,
respectively, can be found by minimizing the inconsistencies of the equation on the 45 towns in the
data set. Indeed, it sounds rather unlikely that by adjusting just the two parameters, we could make
each of the towns to satisfy the equation.
Why one would need that?- for the purposes of description and prediction. The prediction goal is
obviously related to other towns that are not in Table 0.3: given PopRes. (x) at a town, predict its
Pschools (y). It would be useful if we could not only predict but, also, evaluate the reliability of the
prediction.
Let us consider this as a general problem: Present correlation between y and x using their values at a
number N of entities (x1,y1), (x2, y2),…, (xN, YN) in the form of equation
y=a*x + b
(2.1)
Obviously, on the entities i=1,2,…,N equation (2.1) will have some errors so that it can be rewritten as
yi=a*xi + b + ei, (i=1,2,…,N)
(2.2)
where ei are referred to as errors or residuals. Then the problem is of determining the two parameters,
a and b, in such a way that the residuals are least-squares minimized, that is, the summary (square)
error
L(a,b) = i ei2 = i (yi - a*xi - b)2 ,
(2.3)
reaches its minimum over all possible a and b. This minimization problem is easy to solve with the
elementary calculus tools.
Indeed L(a,b) is a “bottom down” parabolic function of a and b. Therefore, its minimum corresponds
to the point at which both partial derivatives of L(a,b) are zero.(the first-order optimality condition):
∂L/∂a = 0
and
∂L/∂b = 0
37
Leaving the finding of the derivatives to the reader as an exercise, let us focus on the unique solution a
in (2.4) and b in (2.6):
a =  (y) /(x)
(2.4)
where
 = [i (xi – mx)(yi-my)] ⁄[Nx)y)]
(2.5)
is the so-called correlation coefficient and mx, my are means of xi, yi, respectively;
b = my –a*mx
(2.6)
By substituting these optimal a and b into (2.3), one can express the minimum criterion value as
Lm(a,b) = N2(y)(1- 2)
(2.7)
It should be noticed that the equation (2.1) is referred to as the linear regression of y over x, index  in
(2.4) and (2.5) as the correlation coefficient, its square 2 in (2.7) as the determination coefficient, and
the minimum criterion value Lm in (2.7) is referred to as the unexplained variance.
Correlation coefficient and its properties
The meaning of the coefficients of correlation and determination is provided by equations (2.3)-(2.7).
Specifically,
* Determination coefficient 2 is the decrease of the variance of y after its linear correlation with x
has been taken into account (from (2.7)).
* Correlation coefficient  ranges between -1 and 1, because 2 is between 0 and 1 as follows from
the fact that value Lm  0 in (2.7) because it is sum of squares, see (2.3). The closer  to either 1 or
-1, the smaller are the residuals in the regression equation. For example, =0.9 implies that y’s
unexplained variance Lm is 1-2 = 19% of the original value.
* The slope a is proportional to  according to (2.4); a is positive or negative depending on the
sign of . If =0, the slope is 0: y and x are referred to than as not correlated. Being not correlated
does not mean “no relation”; it means just “no linear relation” between them, though another
functional relation, such as a quadratic one, may exist, as shown on Figure 2.2.
38
Figure 2.2. Three scatter-plots corresponding to zero or almost zero correlation coefficient ; the case
on the left: no correlation between x and y; the case in the middle: a non-random quadratic relation
y=(x-2)2+5; the case on the right: two symmetric linear relations, y=2x-5 and y=-2x+3, each holding at
a half of the entities.
* The correlation coefficient  does not change under shifting and rescaling of x and/or y, which
can be seen from equation (2.5). Its formula (2.5) becomes especially simple if the so-called znormalisation has been applied to both x and y. To z-normalize a feature, its mean m is subtracted
from all the values and the results are divided by the standard deviation :
x’i= (xi-mx)/x)
and y’i= (yi-my)/y),
i=1,2,…, N
Then formula (2.5) can be rewritten as
 = i x’i y’i ⁄N =(x’,y’)/N
(2.5’)
where (x’,y’) denotes the inner product of vectors x’=(x’i) and y’=(y’i).
* One of the fundamental discoveries by K. Pearson was an interpretation of the correlation
coefficient in terms of a bivariate Gaussian distribution. A generic formula for the density function
of this distribution, in the case in which features are pre-processed using z-normalization described
above, is
f(u, )= C* exp{-uT-1u/2}
(2.8)
T
where u =(x, y) is two-dimensional vector of random values of the two variables x and y under
consideration and  is the so-called correlation matrix
1 

  
 1
In this formula  is a parameter with a very clear geometric meaning. Consider, on Cartesian (x,y)
plane sets of points making function f(u, ) in (2.8) constant. Such a set makes uT-1u constant too.
That means that a constant density set of points (x, y) must satisfy equation x2-2xy+y2=const. This
defines a well-known quadratic curve, the ellipsis. At =0 it becomes the equation of a circle, x2+
y2=const, and the more the  differs from 0, the more skewed the ellipsis is so that at  = 1 the
ellipsis becomes a bisector line y =  x + b because the left part of the equation makes a full square,
39
x2 2xy+y2=const, that is, (y  x)2 = const. The size of the ellipsis is proportional to the constant: the
greater the constant the greater the size.
A striking fact is that the correlation coefficient (2.5) is a sample based estimate of the
parameter  in the Gaussian density function (2.8) under the conventional assumption that the sample
points (yi,xi) are drawn from a Gaussian population randomly and independently.
This fact is the base of a long standing controversy. Some say that the usage of the correlation
coefficient is justified only when one is sure that their sample is taken randomly and independently
from a Gaussian distribution. This seems somewhat unfounded. Indeed, the usage of the coefficient for
estimating the density function is justified only when the function is Gaussian, true. However, when
trying to linearly represent one variable through the other, the coefficient has a very different meaning
having nothing to do with Gaussian distributions, as expressed above with equations (2.4)-(2.7).
Q. Find the derivatives of L over a and b and solve the first-order optimality conditions.
Q. Derive the optimal value of L in (2.7) for the optimal a and b.
Q. Prove or find it in the literature that the linear equation corresponds to a straight line of which a is
the slope and b intercept indeed.
Project 2.1. 2D analysis, linear regression and bootstrapping
Let us take the Students data table as a 100 x 8 array a in MatLab, pick any two features of interest and
plot entities as points on the Cartesian plane formed by the features. For instance, take Age as x and
Computa-tional Intelligence mark as y:
>> x=a(:,4); % Age is 4-th column of array "a"
>> y=a(:,8); % CI score is in 8-th column of "a"
Then student 1 (first row) will be presented by point with coordinates x=28 and y=90 corresponding to
the student’s age and CI mark, respectively. To plot them all, use command:
>> plot(x,y,'k.')
% k refers to black colour, “.” dot graphics; 'mp' stands for magenta pentagram; see others by using
"help plot"
Unfortunately, this gives a very tight presentation: some points are on the borders of the drawing. To
make borders stretched out, one needs to change the axis:
>> d=axis; axis(1.2*d-10);
This transformation is presented in Figure 2.3 on the right. To make both plots presented on the same
figure, use "subplot" command of MatLab:
>> subplot(1,2,1)
>> plot(x,y,'k.');
>> subplot(1,2,2)
>> plot(x,y,'k.');
>> d=axis; axis(1.2*d-10);
40
Command subplot(1,2,1) creates one row consisting of two windows for plots and puts the follow-up
plot into the 1st window (that on the left).
Figure 2.3: Scatter plot of features “Age” and “CI score”; the display on the right is a rescaled version
of that on the left.
Whichever presentation is taken, no regularity can be seen on Figure 2.3 at all. Let's try then whether
anything better can be seen for different occupations. To do this, one needs to handle entity sets for
each occupation separately:
>> o1=find(a(:,1)==1); % set of indices for IT
>> o2=find(a(:,2)==1); % set of indices for BA
>> o3=find(a(:,3)==1); % set of indices for AN
>> x1=x(o1);y1=y(o1); % the features x and y at IT students
>> x2=x(o2);y2=y(o2); % the features at BA students
>> x3=x(o3);y3=y(o3); % the features at AN students
Now we are in a position to put, first, all the three together, and then each of these three separately
(again with the command "subplot", but this time with four windows organized in a two-by-two
format, see Figure 2.4).
>> subplot(2,2,1); plot(x1,y1, '*b',x2,y2,'pm',x3,y3,'.k');% all the three plotted
>> d=axis; axis(1.2*d-10);
>> subplot(2,2,2); plot(x1,y1, '*b'); % IT plotted with blue stars
>> d=axis; axis(1.2*d-10);
>> subplot(2,2,3); plot(x2,y2,'pm'); % BA plotted with magenta pentagrams
>> d=axis; axis(1.2*d-10);
>> subplot(2,2,4); plot(x3,y3,'.k'); % AN plotted with black dots
>> d=axis; axis(1.2*d-10);
Of the three occupation groups, some potential relation can be seen only in the AN group: it is likely
that "the greater the age the lower the mark" regularity holds in this group (black dots in the Figure
2.4’s bottom right). To check this, let us utilise the linear regression.
41
Figure 2.4. Joint and individual displays of the scatter-plots for the occupation categories (IT star, BA
pentagrams, AN dots).
Regression is a technique invented by F. Galton and K. Pearson to explicate the correlation between x
and y as a linear function (that is, a straight line on the plot), y = slope*x + intercept where slope and
intercept are constants, the former expressing the change in y when x is added by 1 and the latter the
level of y at x=0. The best possible values of slope and intercept (that is, those minimising the average
square difference between real y's and those found as slope*x+intercept) are expressed in MatLab,
according to formulas (2.4)-(2.6), as follows:
>> slope = rho*std(y)/std(x); intercept = mean(y) - slope*mean(x);
Here "rho" is the Pearson correlation coefficient between x and y (2.5) that can be determined with
MatLab operation "corrcoef". Since we are interested in group AN only, we apply it to AN-related
values x3 and y3:
>> cc=corrcoef(x3,y3)
leading to table cc =
1.0000 -0.7082
-0.7082 1.0000
in which the non-diagonal entry values are the rho, which can be caught up with command
>> rho=cc(1,2);
Then the general formula applies to pair (x3,y3):
>> slope = rho*std(y3)/std(x3); % this produces slope =-1.33;
>> intercept = mean(y3) - slope*mean(x3); % this produces intercept = 98.2;
42
thus leading to the linear regression y3= 98.2 - 1.33*x3 stating thus that every year added to the age, in
general decreases the mark by 1.33, so that aging by 3 years would lead to the loss of 4 marks.
To check whether the equation is good, one may compare the real values for randomly selected three
students number 81, 82, and 83 all belonging to AN, with those derived using the equation:
>> ii=[80 81 82];
>> x(ii); %the ages
>> y(ii); %the marks
>> yy=slope*x(ii)+intercept; % the marks derived from the age
which yields the following results:
x
24
34
41
y
62
30
39
yy
66.3
53.0
43.7
One can see that the error for the second student, 53 (predicted) – 30 (real) = 23, is rather high, which
reflects the fact that the mark of this student contradicts the general regularity: s/he is younger than the
third student but has a lower mark.
Altogether, the regression equation explains rho^2=0.50=50% of the total variance of y3 – not too
much.
Let us take a look at the reliability of the regression equation with bootstrapping, the popular
computational experiment technique for validating data analysis results that was introduced in Chapter
1. The computational power allows for experimentation on the spot, with the real data, rather than with
theoretical probabilistic distributions, which are not necessarily adequate to the data.
Bootstrapping is based on a pre-specified number of random trials. In the case of the data of 31 AN
students, each trial begins with randomly selecting a student 31 times, with replacement, so that the
same entity can be selected several times whereas some other entities may be never selected in a trial.
(As shown above, on average only 62% entities get selected into the sample.) A sample consists of 31
students because this is the number of elements in the set under consideration. The sample of 31
students (some of them, as explained, coincide) is assigned with their data values according to the
original data table so that coinciding students get identical feature values. Then a data analysis method
under consideration, currently "linear regression", applies to this data sample to produce the trial result.
After a number of such trials the user gets enough data to see how well they correspond to the original
results.
To do a trial as described, one can use the following MatLab command:
>> ra=ceil(31*rand(31,1));
% rand(31,1) produces a column of 31 random real numbers, between 0 and 1 each. Multiplying this
% by 31 stretches the numbers to be between 0 and 31, and "ceil" rounds them up to integers.
The values of x and y on the group can be assigned by using equations:
>>xr=x3(ra);yr=y3(ra);
43
after which formulas above apply to compute the rho, slope and intercept.
To do this a number (5000, in this case) of times, one runs a loop:
>> for k=1:5000; ra=ceil(31*rand(31,1));
xr=x3(ra);yr=y3(ra);
cc=corrcoef(xr,yr);rhr(k)=cc(1,2);
sl(k)=rhr(k)*std(yr)/std(xr); inte(k)=mean(yr)-sl(k)*mean(xr);
end
% the results are stored in 5000-strong columns rhr (correlations), sl (slopes) and inte (intercepts)
Now we can check the mean and standard deviation of the obtained distributions. Commands
>>mean(sl); std(sl)
produce values -1.33 and 0.24. That means that the original value of slope=-1.33 is confirmed with the
bootstrapping, but now we have its standard deviation, 0.24, as well. Similarly mean/std values for the
intercept and rho are computed. They are, respectively, 98.2 / 9.0 and -0.704 / 0.095.
We can plot the 5000 values found as 30-bin histograms (see Figure 2.5):
>> subplot(1,2,1); hist(sl,30)
>> subplot(1,2,2); hist(in,30)
Further commands:
>>slh=hist(sl,30); slf=find(slh>=70); sum(slh(slf));
show that 4736 out of 5000 trials fall into just 18 of the 30 histogram bins, labelled from 7 to 23. To
determine the boundaries of this area, one finds
>>slbinsize=(max(sl)-min(sl))/30;
>>slleftbound= min(sl)+6*slbinsize
>>slrightbound=max(sl)-7*slbinsize
which produces -1.80 and -0.86 as the right and left boundaries for the slope that hold for
4376/5000=97.4% of the trials.
Similar computations, with
>> inh=hist(in,30); inff=find(inh>60); sum(inh(inff))
will find the left and right boundaries for the intercept at 95.1% of the trials (by leaving out 8 bins on
the left and 5 bins on the right): 81.7 to 117.4.
44
Figure 2.5. 30-bin histograms of the slope (left) and intercept (right) after 5000 bootstrapping trials.
This all can be visualized by, first, defining the three regression lines with
>> y3reg=slope*x3+intercept;
>> y3regleft=slleftbound*x3+inleftbound;
>> y3regright=slrightbound*x3+inrightbound;
and then plotting the four sets onto the same figure Figure 2.6::
>> plot(x3,y3,'*k',x3,y3reg,'k',x3,y3regleft,'r',x3,y3regright,'r')
% x3,y3,'*k' presents student data as black stars; x3,y3reg,'k' presents the real regression line in black
% x3,y3regleft,'g' and x3,y3regright,'g' present the boundary regressions with green lines
The red lines on Figure 2.6 show the limits of the regression line for 95% of trials.
Figure 2.6. Regression of CI score over Age (black line) within occupation category AN with
boundaries covering 95% of potential biases due to sample fluctuations.
Non-linear correlations
In many domains the correlation between features is not necessarily linear. For example, in economics,
processes related to inflation over time are modelled as the exponential ones; similar thinking applies
to the processes of growth in biology; variables describing climatic conditions obviously have a cyclic
45
character; etc. Consider, for example, an exponential function y=a*exp(b*x) where x is predictor and
y predicted variables whereas a and b are unknown but constant coefficients. Given the values of xi
and yi on a number of observed entities i= 1,…, N, the exponent regression problem can again be
formulated as the problem of minimising the summary error squared over all possible pairs of
coefficients a and b. Given some a and b, the summary error squared is calculated as
E=[y1-a*exp(b*x1)]2 + [y2-a*exp(b*x2)]2 + … +[yN-a*exp(b*xN)]2 = i [yi-a*exp(b*xi)]2
(2.9)
There is no method that would straightforwardly lead to a globally optimal solution of the problem of
minimisation of E in (2.8) because it is the sum of many exponential functions. This is why
conventionally the exponential regression is fit by transforming it to a linear regression problem.
Indeed, by taking the logarithm of both parts of equation y=a*exp(b*x), we obtain an equivalent
equation ln(y)=ln(a)+b*x. This equation has the linear equation format, z=*x+, where z=ln(y),
=b and =ln(a). By fitting the linear regression equation with the given data xi and zi=ln(yi) to find
optimal  and , we can feed the coefficients back into the original exponential equation by taking
them as a=exp() and b=. This strategy seems especially suitable since the logarithm of a variable
typically is much smoother so that the linear fit is easier to achieve.
There is one “but” here, too. The issue is that the fact that  and  are optimal in the linear regression
problem does not necessarily imply that the values of a and b found this way are necessarily those
minimising the error E. Moreover, almost certainly they are not optimal and indeed can be rather far
away from the optimal values, to which the exponent in a=exp() can contribute dramatically as can be
seen in the following project.
Project 2.2. Non-linear regression versus its linearized version: evolutionary
algorithm for estimation
Let us consider an illustrative example involving variables x and y defined over a period of 20 time
moments as follows.
Table 2.1. Data of investment at time moments from 0.10-2.00..
x 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1.60
1.70
1.80
1.90 2.00
y 1.30 1.82 2.03 4.29 3.30 3.90 3.84 4.24 4.23 6.50 6.93 7.23 7.91 9.27 9.45 11.18 12.48 12.51 15.40 15 .91
Variable x can be thought of as related to the time whereas y may represent the value of an investment.
In fact, the components of x are numbers from 1 to 20 divided by 10, and y is obtained from them in
MatLab according to formula y=2*exp(1.04*x)+0.6*randn where randn is the normal (Gaussian)
random variable with the mathematical expectation 0 and variance 1. The average growth of the
investment according to these data can be expressed as the root of 1/19 of the ratio y20/y01, that is,
1.14; 14% per period!
The strategy of reduction of the exponential equation to the linear equation produces values 1.1969
and 0.4986 for  and , respectively, which leads to a=1.6465 and b=1.1969 according to formulas
above. As we can see, these differ from the original a=2 and b=1.04 by the order of 15-20%. The value
of the squared error here is E=13.90.
46
Figure 2.7. Plot of the original pair (x,y) in which y is a noisy exponential function of x (on the left)
and plot of the pair (x,z) in which z=ln(y). The plot on the right looks somewhat straighter indeed,
though the correlation coefficients are rather similar, 0.970 for the plot on the left and 0.973 for the
plot on the right.
It only remains to solve the original problem of minimising E in (2.9) and see whether these lead to
better estimations. One obviously can apply here local algorithms such as the fastest descent. Also, the
evolutionary approach can be applied. This approach involves a population of admissible solutions
evolving according to some rules. The rules include: (a) random changes from generation to
generation, and (b) elite control. After a number of generations, the best solution among those
observed is reported as the outcome.
To start the evolution, we first choose the population size, p, and randomly generate a population of p
admissible solutions, pairs (a,b), and evaluate how well they fit the data. The best of them is recorded
separately, and the record is updated from iteration to iteration (elite control). An iteration will consist
of moving the population in a random direction by adding randomly generated values. In the
beginning, a box, large enough to contain the optimal solution, is defined, so that the population is kept
in the box and not driven away. This very simple process is quite effective for this type of problems. A
MatLab program, nlrm.m, implementing this approach is posted in Appendix (see page … ). It is
supplied with comments and could be used as is or in a modified form for other non-linear fitting
problems.
The program nlrm.m found a solution of a=1.9908 and b= 1.0573. These are within 1-2% of the error
from the original values a=2 and b=1.04. The summary squared error here is E=7.45, which is by far
less than that found with the linearization strategy. The two exponential regressions found with the
different strategies are presented on Figure 2.8. One can see that the linearized version has a much
steeper exponent, which becomes visible at later periods.
2.2. Mixed Scale Case: a nominal feature versus a quantitative one
Consider x a categorical feature on the same entities as a quantitative feature y, such as Occupation
and Age at Students data set. The within-category distributions of y can be used to investigate the
correlation between x and y. The distributions can be visualized by using just ranges as follows:
47
present categories with equal-size bins on x axis, draw two lines parallel to x axis to present the
minimum and maximum values of y (in the entire data set), and then present the within category
ranges of y as shown on Figure 2.9.
Figure 2.8. Two fitting exponents are shown, with stars and dots, for the data in Project 2.2.
Age
51
20
IT
BA
AN
Occupation
Figure 2.9. Graphic presentation of within category ranges of Age at Student data.
Age
51
20
IT
BA
AN
Occupation
Figure 2.10. In a situation of ideal correlation, with zero within-category variances, knowledge of the
Occupation category would provide an exact prediction of the Age within it.
The correlation between x and y is higher when the within-category spreads are smaller because the
smaller the spread within an x-category, the more precise is prediction of y at it. Figure 2.10 illustrates
an ideal case of a perfect correlation – all within-category y-values are the same leading to an exact
prediction of Age when Occupation is known.
48
Figure 2.11 presents another extreme, when knowledge of an Occupation category does not lead to a
better prediction of Age than when the Occupation is unknown.
A simple statistical model extending that for the mean will be referred to as table regression. The table
regression of quantitative y over categorical x comprises three columns corresponding to:
(1) Category of x
(2) Within category mean of y
(3) Within category standard deviation of y
The number of rows in the table regression thus corresponds to the number of x-categories; there
should be a marginal row as well, with the mean and standard deviation of y on the entire entity set.
Age
51
20
IT
BA
AN
(Occupation)
Figure 2.11. Wide within-category distributions: the case of full variance within categories in which
the knowledge of Occupation would give no information of Age.
Consider, for example, the table regression of Age (quantitative target) over occupation (categorical
predictor). It suggests that if we know the Occupation, for instance, IT, then we can safely predict the
Age being 28.2 within the margin of plus/minus 5.6 years. Without knowing the Occupation category,
we could only say that the age is on average 33.7 plus/minus 8.5, a less precise assessment.
Table 2.2 Table regression of Age over Occupation in Students data.
Occupation
IT
BA
AN
Total
Age Mean
28.2
39.3
33.7
33.7
Age StD
5.6
7.3
8.7
8.5
The table can be visualized in a manner similar to Figures 2.9-2.11, this time presenting the within
category averages by horizontal lines and the standard deviations by vertical strips (see Figure 2.12).
Age
51
20
IT
BA
AN
Occupation
Figure 2.12. Table regression visualized with the within-category averages and standard deviations
represented by the position of solid horizontal lines and vertical line sizes, respectively. The dashed
line’s position represents the overall average (grand mean).
One more way of visualization of categorical/quantitative correlation is the so-called box-plot. The
within-category spread is expressed here with a quantile (percentile) box rather than with the standard
49
deviation. First, a quintile level should be defined such as, for instance, 40%, which means that we are
going to show the within-category range over only 60% of its contents by removing 20% off each of
its top and bottom extremes. At the category IT, Age ranges between 20 and 39, but if we sort it and
remove 7 entities of maximal Age and 7 entities of minimal Age (there are 35 students in IT so that 7
makes 20% exactly), then the Age range on the remaining 60% is from 22 to 33. Similarly, Age 60%
range is from 32 to 47 on BA, and from 25 to 44 on AN. These are presented with box’ heights on
Figure 2.13. The whiskers reflect 100% within category ranges, which are intervals [20,39], [27, 51]
and [21, 50], respectively.
Age
51
20
IT
BA
AN
Occupation
Figure 2.13. Box-plot of the relationship between Occupation and Age with 20% quintiles; the box
heights reflect the Age within-category 60% ranges, whiskers show the total ranges. Within-box
horizontal lines show the within category averages.
The box-plot proved useful in studies of quantitative features too: one of the features is partitioned into
a number of bins that are treated then as categories.
Correlation ratio
Let us consider one more table regression, this time of the OOProgramming mark over Occupation
(Table 2. 3)
Table 2.3. Table regression OOProg/ Occupation
Occupation OOP Mean OOP StD
IT
76.1
12.9
BA
56.7
12.3
AN
50.7
12.4
Total
61.6
16.5
A natural question emerges: In which of the tables the correlation is greater, 2.2 or 2.3?
This can be addressed with an integral characteristic of the table, the Correlation ratio (determination
coefficient for the table regression). To define this index, denote k a category of x, Sk the set of iI
such that yi=k, pk = | Sk |/|I|, and 2k the variance of y within Sk. Then we first calculate the
average within-category variance:
2w= k pk2k
50
Correlation ratio, usually denoted by 2, shows the drop of the variance of y from 2 (the variance of
y as is) to 2w (the average variance of y when the nominal x is taken into account):
2 = 1 – 2w/2
(2.10)
Properties:
- The range of 2 is between 0 and 1
- Correlation ratio 2 =0 when all 2k are zero (that is, when y is constant within each group)
- Correlation ratio 2 = 1 when all 2k are of the order of 2
In fact, the correlation ratio emerges as the square-error related criterion in the following data recovery
model. Find a set of ck such that the “residual variance”, that is, the average square error L=  iI
ei2/N is minimized, where ei=yi - ck according to equations
yi= ck +ei for all iSk
(2.11)
where Sk denotes the set of entities falling in k category of x. These equations underlie the table
regression and are referred to sometimes as the piece-wise regression. It is not difficult to prove that
the optimal ck is the within category k average of y, which implies that the minimum value of L is
equal to 2w defined above. The correlation ratio shows the relative drop in the variance of y when it is
predicted according to model (2.11).
Correlation ratios in our example are:
Occupation/Age
28.1%
Occupation/OOProg
42.3%
which shows that the correlation between the Occupation and OOProgramming mark is greater than
that between the former and Age.
2.3. Case of two nominal features
Consider two sets of disjoint categories: l=1,…,L (for example, occupation) and k=1,…,K (family or
housing type). Each makes a classification; they are crossed to see the correlation. Combine a pair of
categories (k,l)KL and count the number of entities that fall in both. The (k,l) co-occurrence count
is denoted by Nkl. Obviously, these counts sum up to N. A table housing these counts, Nkl , or their
relative values, frequencies pkl =Nkl /N, is referred to as a contingency table or just cross-classification.
Example: Partition the Market town set in four classes according to the number of Banks and Building
Societies (Ba): Ba 10 (10+), 10>Ba4 (4+), 4>Ba2 (2+), Ba=0/1 (1-) – these will be Banking type
categories. Cross classify this with FM (yes/no) (Table 2.4)
Table 2.4 Cross classification of the Ba related partition with FM
FarmMarket
Yes
No
Total
Number of Banks/Building societies
10+
4+
2+
2
5
1
4
7
13
6
12
14
11
12
13
Total
9
36
45
The same contingency data converted to frequencies are presented in Table 2.5.
51
Table 2.5. BA/FM Cross classification frequencies, per cent.
FM
| Ba 10+
4+
2+
1Total
Yes
4.44
11.11
2.22
2.22
20
No
8.89
15.56
28.89
26.67
80
Total
13.33
26.67
31.11
28.89
100
The totals, that is, within-row sums Nk+ =l Nkl and within-column sums N+l =k Nkl (as well as their
frequency counterparts) are referred to as marginals (because they are typed on margins of the
contingency data).
Another example: the contingency table for features “Protocol-type” and “Attack type” (Table 2.6).
Table 2.6. Protocol/Attack contingency table for Intrusion data.
Category
Tcp
Udp
Icmp
Total
apache
23
0
0
23
saint
11
0
0
11
surf
0
0
10
10
norm
30
26
0
56
Total
64
26
10
100
A contingency table can be used for assessment of correlations between two category sets. A
conceptual association may exist if a row, k, has all its entries (forget the margins), except for one,
equal to 0. Such are rows “Udp” and “Icmp” in Table 2.6. In this case, we have a perfect match
between the row category k and that column l in which the only non-zero count has occurred. No other
combination (k,l’) with l’ different of l is possible, according to the table; the zeros tell this. In such a
situation, one may claim that subject to the sample, k implies l so that l occurs only when k does.
According to Table 2.6, udp protocol implies “norm”, no attack situation, whereas icmp protocol
implies “surf” attack. The latter, in fact, amounts to the equivalence between “icmp” and “surf”,
because there is no other non-zero entry in the “surf” column so that “surf” implies “icmp” as well. In
contrast, “udp” and “norm” are not equivalent because “norm” may occur at another protocol, “tcp”,
too.
A similar situation may have been occurred in Table 2.4. Imagine, for example, that in row “Yes” of
Table 2.4 two last entries are 0, not 1s. This would imply that a Farmers Market may occur only in a
town with 4 or more Banks. A logical implication, that is, a production rule, “If BA is 4 or more, then
a Farmers Market must be present”, could be derived then from the Table. One may try taking this
path and cleaning the data of smaller entries and corresponding entities to not obscure our vision of the
pattern of correlation. Look at, for example, Table 2.7 that expresses, with no exception, a very simple
conceptual statement “A town has a Farmers Market if and only if the number of Banks in it is 4 or
greater”. However nice the rule may be, let us not forget the exceptions: there are 13 towns, almost
30% of the sample, that have been removed as those not fitting. If this is acknowledged, the Table 2.6
should not be attributed to just data doctoring; though an issue to address remains: could a different
conclusion be reached with other removals? Some try getting better ways for computationally
producing production rules, typically, by adding other features into consideration rather than subjective
entity removals, which is an important activity in machine learning and knowledge discovery.
Table 2.7. BA/FM cross classification cleaned of 13 towns, to sharpen the view.
FMarket
Yes
Number of Banks/Build. Societies
10+
4+
2+
2
5
0
10
Total
7
52
No
Total
0
2
0
5
13
13
12
12
25
32
Quetelet index
There is another strategy for visualisation of correlation patterns in contingency tables, without
removal of not-fitting entities. This strategy involves an index for assessing correlation between
individual categories. Let us consider correlation between the presence of a Farmer’s Market and the
category “10 or more Banks” according to data in Table 2.5. We can see that their joint
probability/frequency is the entry in the corresponding row and column: P(Ba=10+ &
FM=Yes)=4.44% (joint probability/frequency/rate). Of the 20% entities that fall in the row “Yes”, this
makes P(Ba=10+ / FM=Yes) =0.0444/0.20= 0.222 =22.2%. Such a ratio is referred to as the
conditional probability/rate.
Is this high or low? A founding father of statistics, A. Quetelet (Belgium, 1832), suggested that this
question can be addressed by comparing the conditional rate with the average probability of the
category “Ba=10+”, which is P(Ba=10+)=13.33%. Let us therefore compute the (relative) Quetelet
index q:
q(Ba=10+/ FM=Yes) = [P(Ba=10+/FM=Yes) - P(Ba=10+)] / P(Ba=10+) =[0.2222 – 0.1333] / 0.1333
= 0.6667 = 66.7%.
That means that condition “FM=Yes” raises the frequency of the Bank category by 66.7%.
In fact, such an evaluation is frequently used in everyday statistics. For example, consider the risk of
getting a serious illness l, say tuberculosis, which may be 0.1% in a given region. Take a condition k
such as “Bad housing” and count the rate of tuberculosis under this condition, say, 0.5%. Then the
Quetelet index q(l/k)=(0.5-0.1)/0.1=400% showing that the average rate raise is 4 times!
The general definition of Quetelet index is given in the following formula:
q(l/k)=[P(l/k)-P(l)]/P(l)
(2.10)
where P denotes the probability or frequency and, in our context, can be computed as follows: P(l)=
N+l/ N, P(k)= Nk+/ N, P(l/k)= Nkl / Nk+. That is, Quetelet index measures correlation between
categories k and l as the relative change of the probability of k when l is taken into account.
With little algebra, one can derive a simpler expression
q(l/k) = [Nkl / Nk+ - N+l/ N] / N+l/ N = Nkl N/(Nk+ N+l )–1 = plk /(pl+p+k)  1
(2.11)
Applying (2.11) to Table 2.4 we obtain Quetelet index values presented in Table 2. 8. By highlighting
positive values in it, we obtain the same pattern as on the cleaned data, but this time in a somewhat
more realistic guise. Specifically, one can see that “Yes” FM category provides for a strong increase in
probabilities, whereas “No” category leads to much weaker changes.
Table 2.8. BA/FM Cross classification Quetelet coefficients, % (positive entries highlighted).
FMarket
Yes
No
10+
66.67
-16.67
4+
108.33
-27.08
2+
-64.29
16.07
1-61.54
15.38
53
Quetelet coefficients for Table 2.6 are presented in Table 2.9; positive entries highlighted in bold.
Table 2.9. Quetelet indices for the Protocol/Attack contingency Table 2.6, %.
Category
Tcp
udp
icmp
apache
56.25
-100.00
-100.00
saint
56.25
-100.00
-100.00
surf
-100.00
-100.00
900.00
Norm
-16.29
78.57
-100.00
Q. If any logical production rules can come from the columns of Table 2.6? A. Yes, both apache and
saint attacks may occur at the tcp protocol only.
Pearson’s chi-squared decomposed over Quetelet indexes
This visualization can be extended to a more theoretically sound presentation.
Let us define the summary Quetelet correlation index Q as the sum of pair-wise Quetelet indexes
weighted by their frequencies/probabilities:
Q   p q(l / k )   p 2 / p p 1
kl
kl
k  l
k ,l
k ,l
(2.12)
The right-hand expression for Q in (2.12) can be obtained by putting expression (2.11) instead of
q(l/k). This expression is very popular in the statistical analysis of contingency data. In fact, this is
another formula for the Pearson chi-squared correlation coefficient proposed by K. Pearson (1901) in a
very different context – as a measure of deviation of the contingency table entries from the statistical
independence.
To explain this in more detail, let us first introduce the concept of statistical independence. Sets of the
k and l categories are said to be statistically independent if pkl = pk+ p+l for all k and l. Obviously, such
a condition is hard to fulfil in reality. K. Pearson suggested using relative squared errors to measure the
deviations. Specifically, he introduced the chi-squared coefficient:
X 2  N  ( p  p p ) 2 / p p  N[  p 2 / p p  1]
kl
k  l
k  l
kl
k  l
k,l
k,l
(2.13)
The equation in the middle is well known and allows us to see that X2=NQ, according to (2.12).
The popularity of X2 in statistics rests on the theorem proven by Pearson: if the contingency table is
based on an independent sample of entities drawn from a population in which the statistical
independence holds (so that all deviations are due to just randomly sampling), then the probabilistic
distribution of X2 converges to the chi-squared distribution (when N tends to infinity) introduced by
Pearson for similar analyses. The chi-squared distribution is defined as the distribution of the sum of
several standard Gaussian distributions squared.
This theorem may be of no interest to computational intelligence, because the latter draws on data that
are not necessarily random. However, Pearson’s chi-squared is a most popular index for scoring
correlation in contingency tables, and the equation X2=NQ may give some further support to it.
According to this equation, X2 also has a very different meaning, that of the averaged Quetelet
coefficient which has nothing to do with the statistical independence and it has everything to do with
correlation between categories. To make the underlying correlation concept more clear, let us take a
look at the range of possible values for X2.
54
It can be proven that, at K ≤ L – the number of columns is smaller than that of rows, X2 ranges
between 0 and K –1. It reaches 0 if there is statistical independence at all pairs (k,l) so that all qkl=0,
and it reaches K – 1 if each column l contains only one non-zero entry pk(l)l, which is thus equal to p+l.
The latter can be interpreted as logical implication k → l(k).
Representation of NQ=X2 as the sum of N plk q(l/k) terms allows for visualization of the chi-squared
correlation terms within the contingency table format, such as that presented in Table 2.10.
Table 2.10. BA/FM chi-squared (NQ = 6.86) and its decomposition according to (2.12), (2.13)2
FMarket
Yes
No
Total
10+
1.33
-.67
0.67
4+
5.41
-1.90
3.51
2+
-.64
2.09
1.45
1-.62
1.85
1.23
Total
5.43
1.37
6.86
The entry 5.41 highlighted in red contributes so much to X2=6.86 that perhaps it is the only single item
deserving to be considered for further investigation.
Q. In Table 2.10, all marginal values, the sums of rows and columns, are positive, in spite of the fact
that many within-table entries are negative. Is this just due to specifics of the distribution in Table 2.4
or a general property? A: A general property. It can be proven that the within-row or within-column
sums of the elements, plk q(l/k), must be positive.
Task: Find a similar decomposition of chi-squared for OOPmarks/Occupation in Student data. (Hint:
First, categorize quantitative feature OOPmarks somehow: you may use equal bins, or conventional
boundary points such as 35, 65 and 75, or any other considerations.)
2
Bold font for the positive items; red for exceptional contributions
55
3. Learning correlations
3.1. General
The problem can be stated as follows. Given N pairs (xi, ui) (observed at entities i =1, …, N) in
which xi are predictor/input vectors xi=(xi1,…,xip) (dimension p) and ui = (ui1,…,uiq) are
target/output vectors (dimension q), build a decision rule
û = F(x)
(3.1)
such that the difference between computed û and observed target vector u, given x, is minimal over
the class  of admissible rules F. A rule F is referred to as a classifier if the target is categorical and as
a regression if the target is quantitative. In the follow-up, we consider only the case of q=1, except for
the Chapter 4 devoted to neural network learning.
In the problems of linear regression or linear discrimination F is required to be linear. Classes  of
quadratic regression and discrimination are similarly defined.
Why (and how) should one restrict the class of admissible rules F? A big question, no good answers.
Take a look at the 2D regression problem: pairs (x,u) are observed at N entities shown on Figure 3.1:
u
x
Figure 3.1. Possible analytic expressions of correlation between x and u according to observed data
points (black circles).
The N=7 points on the Figure 3.1 can be exactly fitted by a polynomial of 6th order u = p(x) =
a0+a1x+a2x2+ a3x3 +a4x4+a5x5+a6x6. Indeed, the 7 points give 7 equations ui=p(xi) (i=1,…,7) to
exactly determine, in a typical case, the 7 coefficients ak of p(x).
However, the polynomial p(x), on which graph all observations lie, has no predictive power: beyond
the range, the curve may go either course (like those shown) depending on small changes in the data.
Typically, such over-fitted functions produce very poor predictions on newly added observations. The
blue straight line fits none of the points but expresses a simple and very robust tendency and should be
preferred because it summarises the data much deeper: it is defined by two parameters, slope and
intercept, only, whereas the polynomial line involves as many parameters as the data items.
If there is no domain knowledge motivation, it is hard to tell, what class  of Fs to use.
One set of answers relates to the so-called Occam’s razor. William Ockham (c. 1285–1349) said:
“Entities should not be multiplied unnecessarily.” (“All things being equal, the simplest explanation
tends to be the best one.”) This is usually interpreted as the “Principle of maximum parsimony, i.e.,
56
economy,” which is used when there is nothing better available. In the format of the so-called
“Minimum description length” principle, this approach can be meaningfully applied to problems of
estimation of parameters of statistic distributions (see Rissanen and 2007). Somewhat wider, and
perhaps more appropriate, explication of the Occam’s razor is proposed in Vapnik (2006). In a slightly
modified form, to avoid different terminologies, it says: “Find an admissible decision rule with the
smallest number of free parameters such that explains the observed facts” (Vapnik 2006, p. 448).
However, even in this format, the principle gives no guidance about how to choose an adequate
functional form. For example, which of two functions, f(x)=axb or g(x)=alog(x+b), both with two
parameters a and b, should be preferred as a summarisation tool?
Another set of answers, not incompatible with the former ones, relates to the so-called falsifability
principle by K. Popper, which can be expressed as follows: “Explain the facts by using such an
admissible decision rule which is easiest to falsify” (Vapnik 2006, p. 451). In philosophy, to falsify a
theory one needs to give an example that contradicts it. Falsifability of a decision rule can be
formulated in terms of the so-called VC-complexity, a measure of complexity of classes  of decision
rules: the smaller VC-complexity the greater the falsifability. Let us define VC-complexity for the
more intuitive case case of a categorical target. Different combinations of target categories can be
labelled by different labels, u1, u2, …, uK, so that a classifier F is bound to predict labels uk,
k=1,2,…,K. A set of classifiers  is said to shatter the sample of N pairs (xi, ui), where ui’s are just
codes of labels uk, if for any possible assignment of the labels, a classifier F exists in  such that F
reproduces those labels. The VC-complexity of a correlation problem is the maximum number of
entities that can be shattered by the admissible classifiers.
Consider, for example, the target being just a 0/1 category and input features being binary as well - the
set of admissible decision rules consists of all possible dichotomy partitions. Then the VC complexity
of the problem will be equal to the maximum dimension of the binary cube which is covered by the
data source. As an example, let us take a look at the set of eleven entities and four binary input features
shown below.
#
1
2
3
4
5
6
7
8
9
10
11
v1
1
1
1
0
0
0
0
1
1
1
0
v2
1
0
0
1
1
0
0
1
1
0
1
v3
1
1
1
1
0
1
1
0
0
0
0
v4
1
1
0
1
0
1
0
1
0
1
1
The VC complexity of the problem in this case is 2
because there exist two columns, for example, v1
and v2, containing all four rows of a twodimensional binary cube (as highlighted in bold).
However, there are no three columns containing all
eight rows of a three-dimensional binary cube (from
000 to 111). If we are certain that this property holds
for other possible data points from the source, even
those not necessarily present in the sample, then we
should utilize only relatively simple classifiers of VC
complexity 2.
The VC complexity is an important characteristic of a correlation problem especially within the
probabilistic machine learning paradigm. Under conventional conditions of independent random
sampling of the data, an accurate classifier “with probability a% will be b% accurate, where b depends
not only on a, but also on the sample size and VC-complexity.”
To specify a learning problem one should specify assumptions regarding a number of constituents
including:
(i)
Data flow
Two modes are usually considered:
57
(ii)
(iii)
(iv)
(i1) Incremental (adaptive) mode; entities are assumed to arrive one by one so that the
rule is updated incrementally. This type of data flow implies that the fitting algorithm must
be incremental, too; steepest descent and evolutionary approaches are most suitable for this.
(i2) Batch mode; all entity set is available for learning immediately so that the rule can
be found at once.
Type of rule
A rule involves a postulated mathematical structure whose parameters are to be learnt from
the data. The mathematical structures considered further are:
- linear combination of features
- neural network mapping a set of input features into a set of target features
- decision tree built over a set of features
- partition of the entity set into a number of non-overlapping clusters
Type of target
Two types are considered usually: quantitative and categorical. In the former case, equation
(3.1) is usually referred to as regression; in the latter case, decision rule, and the learning
problem is referred to as that of “classification” or, sometimes, “pattern recognition”.
Criterion
Criterion of the quality of fitting depends on the situation at which the learning task is
formulated. Most popular criteria are: maximum likelihood (in a probabilistic model of data
generation), least-squares (data recovery approach) and error counts. Many operational
criteria using error counts can be equivalently reformulated in terms of the least-squares
and maximum likelihood. According to the least-squares criterion, the difference between u
and û is measured with the squared error,
E=<u- û, u- û>=<u-F(x),u-F(x)>
(3.2)
which is to be minimised over all admissible F.
3.2. Linear regression
Consider feature Post expressing the number of post offices in Market Towns (Table 0.4 on p. 16-17)
and try to relate it to other features in the table. It obviously relates to the population. For example,
towns with population of 15,000 and greater are those and only those where the number of post offices
is 5 or greater. This correlation, however, is not as good as to give us more guidance in predicting Post
from the Population. For example, at the seven towns whose population is from 8,000 to 10,000 any
number of post offices from 1 to 4 may occur, according to the table. This could be attributed to effects
of services such as a bank or hospital present at the towns. Let us specify a set of features in Table 0.4
that can be thought of as affecting the feature Post, to include in addition to Population, say, PSchPrimary schools, Doct - General Practitioners, Hosp- Hospitals, Banks, Sstor
- Superstores, and
Petr – Petrol Stations, seven features altogether, that constitute the set of input variables (predictors)
{x1, x2, …, xp} at p=7.
What we want is to establish a linear relation between the set and target feature Post that has a general
denotation u in the formulation of section 3.1. A linear relation is an equation representing u as a
weighted sum of features xi plus a constant intercept, in which the weights can be any reals, not
necessarily positive. If the relation is supported by the data, it can be used for various purposes such as
analysis, prediction and planning.
This can be formulated as a specific case of the correlation learning problem in which there is just one
quantitative target variable u. The rule F in (3.1) is assumed to be linear:
58
u = w1*x1+w2*x2+…+wp*xp+w0
where w0, w1,…, wp are unknown weights3, parameters of the model.
For any entity i =1, 2, …, N, the rule-computed value of u
ûi = w1*xi1+w2*xi2+…+wp*xip+w0
differs from the observed one by di = |ûi – ui|, which may be zero, when the prediction is exact, or
not, when it is not. To find w1, w2, …, wp, w0, one can minimize
D2 = idi2 = i (ui -w1*xi1-w2*xi2-…-wp*xip-w0)2
(3.3)
over all possible parameter vectors w = (w0,w1,…,wp).
To make the problem treatable in terms of multidimensional spaces, a fictitious feature x0 is
introduced such that all its values are 1: xi0 =1 for all i = 1, 2, …, N. Then criterion D2 can be
expressed as D2 = i (ui -<wi,xi>)2 using the inner products <w,xi> where w=(w0, w1,…,wp) and
xi=(xi0, xi1 , …, xip) are (p+1)-dimensional vectors, sometimes referred to as having been augmented
(by adding the fictitious unity feature x0), of which all xi are known while w is unknown. From now
on, the unity feature x0 is assumed to be part of data matrix X in all correlation learning problems.
The metric D2 is but the Euclidean distance squared between the N-dimensional target feature column
u=(ui) and vector û=Xw whose components are ûi= <w,xi>. Here X is N x (p+1) matrix whose rows
are xi (augmented with the component xi0=1, thus being (p+1)-dimensional) so that Xw is the matrix
algebra product of X and w. Vectors defined as Xw for all possible w’s form (p+1)-dimensional vector
space, referred to as X-span.
Thus the problem of minimization of (3.3) can be reformulated as follows: given target vector u, find
its projection û in the X-span space. The global solution to this problem is well-known; it is provided
by a matrix PX applied to u:
û = PXu
(3.4)
where PX is the so-called orthogonal projection operator, of size N x N, defined as:
PX = X (XTX)-1XT
(3.5)
so that û = X (XTX)-1XTu and w=(XTX)-1XTu.
Matrix PX projects every N-dimensional vector u to its nearest match in the (p+1)-dimensional X-span
space. The inverse (XTX)-1 does not exist if the rank of X, as it may happen, is less than the number of
columns in X, p+1, that is, if matrix XTX is singular or, equivalently, the dimension of X-span is less
than p+1. In this case, the so-called pseudo-inverse matrix (XTX)+ can be used as well.
Table 3.1. Weight coefficients of input features at Post Office as target variable for Market towns data.
3
Symbol * is used to denote multiplication when convenient.
59
In our example of seven Market town features to be used for linearly relating to the Post Office
feature, vector w of weight coefficients found with the formula above is as presented in Table 3.1.
Each weight coefficient shows how much the target variable would change on average if the
corresponding feature is increased by one. One can see that increasing population by a thousand would
give a similar effect as adding a primary school, about 0.2, which may seem absurd in the example as
Post office variable can have only integer values. Moreover, the linear function format should not trick
the decision maker into thinking that increasing different input features can be done independently: the
features are obviously not independent so that increase of, say, the population will lead to respectively
adding new schools for the additional children. Still, the weights show relative effects of the features –
according to Table 3.1, adding a doctor’s surgery in a town would lead to maximally possible increase
in post offices. Still the maximum value is assigned to the intercept. What this may mean: The number
of post offices in an empty town with no population, hospitals or petrol stations? Certainly not. The
Feature
POP_RES
PSchools
Doctors
Hospitals
Banks
Superstores
Petrol
Intercept
Weights in natural
scales, w
0.0002
0.1982
0.2623
-0.2659
0.0770
0.0028
-0.3894
0.5784
Standard
deviations, s
6193.2
2.7344
1.3019
0.5800
4.3840
1.7242
1.6370
0
Weights in standarddized scales, w*s
1.3889
0.5419
0.3414
-0.1542
0.3376
0.0048
-0.6375
0
intercept expresses that part of the target variable which is relatively independent of the features taken
into account. It should be pointed out that the weight values are relative not to just feature concepts but
specific scales in which features measured. Change of the scale, say 10-fold, would result in a
corresponding, inverse, change of the weight (due to the linearity of the regression equation). This is
why in statistics, the relative weights are considered for the scales expressed in units of the standard
deviation. To find them, one should multiply the weight for the current scale by the feature’s standard
deviation (see Table 3.2).
Feature
Weight
POP_RES
0.0002
3.2. Rescaled weight coefficients of input
Table
PSchools
0.1982
features
at Post Office as target variable for Market
Doctors
0.2623
towns.
Hospitals
-0.2659
Banks
0.0770
Superstores
0.0028
Petrol
-0.3894
Intercept
0.5784
60
Amazingly, we also can see negative effects, specifically of features Petrols and Hospitals to the target
variable. This can be an artefact related to duplication of features; one can think of Hospitals being
duplicate of Doctors and Petrol of Superstores. Thus one should check whether the minus disappears if
duplicates are removed from the set of features before jumping to conclusions. As Table 3.3 shows,
not in this case: the negative weights remain, though they slightly changed, as well as other weights.
This illustrates that the interpretation of linear regression coefficients should be cautious and
restrained.
Table 3.3. Weight coefficients for reduced set of features at Post Office as target variable for Market
towns data.
Feature
POP_RES
PSchools
Hospitals
Banks
Petrol
Intercept
Weight
0.0003
0.1823
-0.3167
0.0818
-0.4072
0.5898
The quality of approximation is evaluated by the minimum value D2 in (3.3) averaged over the number
of entities and related to the variance of the target variable. Its complement to 1, the determination
coefficient, is defined by the equation
2 = 1- D2/(N2(u))
(3.6)
The determination coefficient shows the proportion of the variance of u explained by the linear
regression. Its square root, , is referred to as the coefficient of multiple correlation between u and X =
{x0, x1, x2, …, xp}.
In our example, determination coefficient 2= 0.83, that is, the seven features explain 83% of the
variance of Post Office feature, and the multiple correlation is =0.91. Curiously, the reduced set of
five features (see Table 3.2) contributes almost the same, 82.4% of the variance of the target variable.
This may make one wonder whether just one Population feature could be enough for doing the
regression. This can be tested with the 2D method described in section 2.1 or with the nD method of
this section. According to this method, one should use a matrix X with two columns here, one the
Population variable, the other fictitious variable of all ones. This immediately leads to the slope 0.0003
and intercept 0.4015, though with somewhat reduced determination coefficient, which is 2= 0.78 in
this case. From the prediction point of view this may be all right, but the ultimately reduced set of
features looses on interpretation.
3.3. Linear discrimination
Linear discrimination problem can be stated as follows. Let a set of N entities in the feature
space X={ x0, x1, x2, …, xp} is partitioned in two classes, sometime referred to as patterns, a “yes”
class and a “no” class, such as for instance a set of banking customers in which a, typically very small,
subset of fraudsters constitutes the “yes” class and that of the others the “no” class. The problem is to
61
find a function u=f(x0, x1, x2, …, xp) that would discriminate the two classes in such a way that u is
positive for all entities in the “yes” class and negative for all the entities in the “no” class. When the
discriminant function is assumed to be linear so that u = w1*x1+w2*x2+…+wp*xp+w0 at constant
w0, w1, …, wp, the problem is of linear discrimination. It differs from that of the linear regression in
only that aspect that the target values ui here are binary, either “yes” or “no”, so that this is a
classification rather than regression, problem.
To make it quantitative, define ui=1 if i belongs to the “yes” class and ui= -1 if i belongs to the “no”
class. The intercept w0 is referred to, in the context of the discrimination/classification problem, as
bias. On Figure 3.2 entities (x1, x2, u) are presented by stars *, at u=1, and circles, at u= -1. Vector
Figure 3.2. A geometric illustration of the separating hyper-plane between zero and star classes.
w represents a set of coefficients to a linear classifier; the dashed line represents the set of all x’s that
are orthogonal to w, <w,x> = 0 – the separating hyperplane. Figure 3.2 shows a relatively rare
situation at which the two patterns can be separated by a hyperplane – the linear separability case.
A linear classifier is defined by a vector w so that if ûi= <w,xi> >0, predict ůi=1; if ûi = <w,xi> < 0,
predict ůi= -1; that is, ůi = sign(<w,xi>) . (Here the sign function is utilized as defined by the
condition that sign(a)=1 when a > 0, =-1 when a < 0, and =0 when a = 0.)
Discriminant analysis DA and Bayesian decision rule
To find an appropriate w, even in the case when “yes” and “no” classes are linearly separable, various
criteria can be utilized. A most straightforward classifier is defined by the least-squares criterion of
minimizing (3.3). This produces
w=(XTX)-1XTu
(3.7)
The inverse matrix may not necessarily exist, which is the case when rank of X is less than the number
of columns. If this is the case, then the so-called pseudo-inverse is utilized, which is not a big deal
computationally. For example, in MatLab, one just puts pinv(XTX) instead of inv(XTX).
Solution (3.7) has two properties related to the so-called Bayesian decision rule in statistics. According
to Bayes, all relevant knowledge of the world is known to the decision maker in the form of
probability distributions; whatever data may occur afterwards, they may change the probabilities –
hence the difference between prior probabilities and posterior, data updated, probabilities. Specifically,
assume that, in the world, p1 and p2 are probabilities of two states of the world corresponding to the
“positive” and “negative” classes; p1 and p2 are positive and sum up to unity. Assume furthermore that
there are two probability density functions, f1(x1, x2, …, xp) and f2(x1, x2, …, xp), defining the
generation of observed points x={x1, x2, …, xp} corresponding to entities. [We drop for the moment
62
the fictitious variable x0 in this explanation.] If an x={x1, x2, …, xp} is actually observed, then
according to the well-known Bayes theorem from the elementary probability theory, the posterior
probabilities of the “positive” and “negative” classes change to
p(1 | x)=p1f1(x)/f(x) and p(2 | x)=p2f2(x)/f(x)
(3.8)
where f(x)=p1f1(x)+ p2f2(x).
Expressions (3.8) are referred to as likelihoods – they are used in an important approach of
mathematical statistics, the maximum likelihood. According to this approach, parameters of the
underlying distributions are assumed to have values maximising the likelihood of the observed data.
For our purposes, one expresses proportions of errors as 1- p(1 | x), if it is decided that the class of x
is “positive” or 1- p(2 | x), otherwise. To minimize the errors, one needs therefore to decide that the
class is “positive” if
p(1 | x) > p(2 | x) or, equivalently,
f1(x)/f2(x) >p2/p1
(3.9)
or, “negative”, if the reverse holds. This rule is referred to as Bayesian decision rule. In these
assumptions, there is no way to get fewer errors on average. The Bayesian rule can be expressed via
function B(x)= p(1 | x) - p(2 | x) so that B(x)>0 corresponds to class 1 and B(x) < 0 to class 2, with
equation B(x)=0 expressing the separating surface – the set of points separating the areas of minimum
error decisions.
It appears the squared summary difference between the least-square error linear decision rule function
<w,x> and Bayesian function B(x) is minimum over all possible w (Duda, Hart, Stork, p. 243-245).
Moreover, the least-squares linear decision rule is the Bayesian function B(x) if the class probability
distributions f1(x) and f2(x) are Gaussian with the same covariance matrix, that is, are expressed with
formula:
fi(x)=exp[-(x-i)T-1(x-i)/2]/[(2)p|Σ|]1/2
(3.10)
where I is the central point and  the pxp covariance matrix of the Gaussian distribution. Moreover,
in this case the optimal w=-1(1 -2) (see Duda, Hart, Stork, p. 36-40).
This Gaussian is a most popular density function (Figure… ).
Note that formula (3.7) leads to an infinite number of possible solutions. A slightly different criterion
of minimizing the ratio of the “within-class error” over “out-of-class error” was proposed by R. Fisher
(1936) as described by Duda, Hart and Stork 2001. Fisher’s criterion, in fact, can be expressed with the
least-squares criterion if the output vector u is changed for uf as follows: put N/N1 for the components
of the first class, instead of +1, and put - N/N2 for the entities of the second class, instead of -1. Then
the optimal w (3.7) at u=uf is solution to the Fisher’s discriminant criterion (see Duda, Hart, Stork,
2001, pp.242-243).
In spite of its good theoretical properties, least-squares solution may be not necessarily the best one at
a specific data configuration. In fact, it may fail to separate the positives from negatives even if they
are linear separable. Consider the following example.
Let there be 14 two-dimensional points presented in Table 3.4 (first line) and displayed in Figure 3.4
(a). Points 1,2,3,4, and 6 belong to the positive class (dots on Figure 3.4), the others to the negative
class (stars on Figure 3.4). Another set has been obtained by adding to each of the components a
63
random number according to the normal distribution with zero mean and 0.2 the standard deviation; it
is presented in the bottom line of Table 3.4 and Figure 3.4 (b). The distribution of the disturbed points
in classes is assumed the same.
Table 3.4. X-y coordinates of 14 points as given originally and perturbed with a white noise of
standard deviation 0.2, that is, generated from the Gaussian distribution N(0,0.2).
Entity #
Original
data
Perturbed
data
x
y
x
y
1
3.00
0.00
2.93
-0.03
2
3.00
1.00
2.83
0.91
3
3.50
1.00
3.60
0.98
4
3.50
0.00
3.80
0.31
5
4.00
1.00
3.89
0.88
6
1.50
4.00
1.33
3.73
7
2.00
4.00
1.95
4.09
8
2.00
5.00
2.13
4.82
9
2.00
4.50
1.83
4.51
10
1.50
5.00
1.26
4.87
11
2.00
4.00
1.98
4.11
12
2.00
5.00
1.99
5.11
13
2.00
4.50
2.10
4.46
14
1.50
5.00
1.38
4.59
The optimal vectors w according to formula (3.7) are presented in Table 3.5 as well as that for the
separating, dotted, line in Figure 3.4 (d).
Table 3.5. Coefficients of straight lines on Figure 3.4.
LSE at Original data
LSE at Perturbed data
Dotted at Perturbed data
x
-1.2422
-0.8124
-0.8497
Coefficients at
y
-0.8270
-0.7020
-0.7020
Intercept
5.2857
3.8023
3.7846
Q: Why only 10 points are shown on Figure 3.4 (b)? A. Because points 11-14 are same as 7-10. Q.
What would change if we remove the last four points so that only points 1-10 are left? A. The leastsquares solution will be separating again. Q. Would it be possible that Fisher’s separation criterion also
leads to a failure in a linear separable situation? A. I think yes.
Figure 3.4. Figures (a) and (b) represent the original and perturbed data sets. The least squares optimal
separating line is added in Figures (c) and (d), shown by solid. Entity 5 falls into “dot” class according
to the solid line in Figure (d), a real separating line is shown dotted (Figure (d)).
64
Support vector machine SVM criterion
Another criterion would put the separating hyperplane just in the middle of an interval drawn through
closest points of the different patterns. This criterion produces what is referred to as the support vector
machine since it heavily relies on the points (support vectors) involved in the drawing of the separating
hyperplane (shown by circles on Figure 3.3).
The difference between the least-squares discriminant hyperplane and support vector machine
hyperplane stems from the differences in their criteria. The latter is based on the borderline objects
only, whereas the former takes into account all entities so that the further away an entity is the more it
may affect the solution, because of the quadratic nature of the least squares criterion. Some may argue
that both borderline and far away entities can be rather randomly represented in the sample under
investigation so that neither should be taken into account: it is “core” entities of patterns that should be
separated – however, there has been no such an approach taken in the literature so far.
Figure 3.3. The support vector machine based separation hyperplane, shown as solid line, along with
the borderline points (support vectors) defining it, shown with circles.
Kernels
Situations at which patterns are linearly separable are very rare; in real data, patterns are typically well
intermingled. To tackle these typical situations, the data are nonlinearly transformed into a much
higher dimensional space in which, because of both nonlinearity and high dimension, the patterns may
be linearly separable. The transformation is performed virtually only because what really matters is
just the inner products between the transformed entities. The inner products in the transformed space
can be computed with so-called kernel functions. It is convenient to define a kernel function over
vectors x=(xv) and y=(yv) through the squared Euclidean distance d2 (x,y)= (x1-y1)2+…+(xV-yV)2
because results form positive definite matrices. Arguably, the most popular is the Gaussian kernel
(3.7):
K(x,y)=exp(-d2(x,y))
(3.7)
Q. What is VC-dimension of the linear discrimination problem at p=2 (two input features)? A. 3,
because each of three points can be separated from the others by a line, but there can be such 4-point
configurations that cannot be shattered using liner separators. Take, for instance, a rectangle whose
vertices joined by a diagonal are labelled by “+” while two others by “-“: no line can reproduce that.
3.4. Decision Trees
65
This is a structure used for prediction of quantitative features (regression tree) or nominal features
(classification tree).
Each node corresponds to a subset of entities (the root to the set of all entities I), and its children are
the subset’s parts defined by a single predictor feature x.
Each terminal node  individual target feature value u.
Example: Product-defined clusters of eight Companies
Sector: Util/Ind
Retail
C
EC: No
Yes
A
B
Figure 1. Decision tree for three product based classes of Companies defined by categorical features.
Decision trees:
Advantages
Interpretability
Computation efficiency
Drawbacks
Simplistic
Imprecise
NSup:
ShaP: > 30
A
<4
4 or more
< 30
C
B
Figure 2. Decision tree for three product-defined classes of Companies defined by quantitative
features.
Algorithm: Take a node and a feature value(s) and split the corresponding subset accordingly
Issues (classification tree):
66
Stop: Whether any node should be split at all
Select:
Which node of the tree and by which feature to split
Score:
Chi-squared (CHAID in SPSS),
Entropy (C4.5),
Change of Gini coefficient (CART)
Assign:
What target class k to assign to a terminal node x:
Conventionally, k* at which p(k/x) is maximised over k.
I suggest: This is ok when p(k) is about 10%-30%.
Otherwise, use comparison between p(k/x) and p(k). Specifically,
(i) If p(k) is of the order of 50%, the absolute Quetelet index a(k/x)= p(k/x)- p(k) should be used;
(ii) If p(k) is of the order of 1% or less, the relative Quetelet index q(k/x)= [p(k/x)- p(k)]/p(k)
should be employed.
4. Correlation: Learning neural networks
4.1. Steepest descent and perceptron for the square error minimisation
The machine learning paradigm is based on the assumption that a learning device adapts itself
incrementally by facing entities one by one. This means that the full sample is never known to the
device so that global solutions, such as the projection (3.5), are not applicable. In such a situation an
optimization algorithm that processes entities one by one should be applied. Such is the gradient
method, also referred to as the steepest descent. With respect to the problem of building a linear
classifier to minimise the square error (3.3) the algorithm can be stated as follows.
Steepest descent for the problem of linear discriminant analysis
0. Initialise weights w randomly.
1. For each training instance (xi, ui)
a. Compute grad(Ei(w)) where Ei(w) is part of criterion E in (3) related to the instance:
Ei= (ui -w1*xi1-w2*xi2-…-wp*xip-w0* xi0)2
Obviously, t-th component of the gradient is Ei/wt= –2(ui- ûi) xit , t=0, 1, …, p
b. Update weights w according to equation
w(new) = w - grad(Ei(w))
so that
wt(new) = wt + (ui- ûi)xit
(Here  is put rather than 2 because it is an arbitrary number anyway.)
2. If w(new)  w(old), stop; otherwise go to step 1 with w=w(new).
This process is proven to converge provided that the gradient step  is correctly set: it should not be
too big so that the minimum point is not jumped over, nor should it be too small so that when the
difference becomes small, the point wt is still updated.
Perceptron
In nineteen-fifties, F. Rosenblatt proposed the following modification of the steepest descent, referred
to as perceptron.
Perceptron algorithm
0. Initialise weights w randomly or to zero.
1. For each training instance (xi, ui)
a. compute ůi = sign(<w,xi>)
67
b. if ůi ui, update weights w according to equation
w(new) = w(old) + (ui- ůi)xi
where , 0<<1, is the so-called learning rate
2. Stopping rule: w(new)  w(old).
Perceptron is a slightly modified form of the conventional gradient minimization algorithm: the partial
derivative of Ei with respect to wt is equal to –2(ui- ûi) xit, which is similar to that used in the
perceptron learning rule, - 2(ui- ůi)xi. The innovation was to change the continuous ûi for the
discrete ůi =sign(ûi) in the process of steepest descent. (The logic is the same in the binary linear
discriminate rule above.)
However, further mathematical analysis shows that the perceptron can be considered not just an
analogue but a gradient minimization algorithm on its own, for a slightly different error function – the
summary absolute values rather than their squares!
Perceptron is proven to converge to the optimal w when the patterns are linearly separable.
4. 2. Artificial neuron
Figure 4.1. Scheme of a neuron cell.
A linear classifier can be considered a model of the neuron cell in a living organism. A neuron cell
fires an output when its summary input becomes higher than a threshold. Dendrite brings signal in,
axon passes it out, and the firing occurs via synapse, a gap between neurons, that makes the threshold
(see Figure 4.1).
The decision rule ůi =sign(ûi) can be interpreted in terms of an artificial neuron as follows: features xi
are input signals, weights wt are the wiring (axon) features, the bias w0 – the firing threshold, and
sign() – the neuron activation function. This way, the perceptron can be considered one of the first
examples of nature-inspired computation.
68
Figure 4.2. A scheme of an artificial neuron.
An artificial neuron consists of: a set of inputs (corresponding to x-features), wiring weights, and
activation function involving a firing threshold. Two popular activation functions, besides the sign
function ůi =sign(ûi), are the linear activation function, ůi = ûi (we considered it when discussed the
steepest descent) and sigmoid activation function ůi =s(ûi) where
s(x) = (1+ e-x)-1
(4.1)
is a smooth analogue to the sign function, except for the fact that its output is between 0 and 1, not -1
and 1 (see Figure 4.3 (b)). To imitate the perceptron with its sign(x) output, between -1 and 1, we first
double the output interval and then subtract 1:
th(x) =2s(x)-1= 2(1+ e-x)-1 - 1
(4.1’)
This function, illustrated on Figure 4.3 (c), is usually referred to as the hyperbolic tangent. In contrast
to sigmoid s(x), hyperbolic tangent th(x) is symmetric: th(-x) = - th(x), like sign(x), which can be
useful in some contexts.
1
x
x
x
-1
(a)
(b)
(c)
Figure 4.3. Graphs of sign (a), sigmoid (b) and hyperbolic tangent (c) functions.
The sigmoid activation functions have nice mathematical properties; they are not only smooth, but
their derivatives can be expressed through the functions themselves. Specifically,
s’(x)= ((1+ e-x)-1)’=(-1)(1+ e-x)-2(-1)e-x =s(x)(1-s(x)),
(4.2)
th(x)= [2s(x)-1]=2s(x)=2s(x)(1-s(x))=(1+th(x))(1-th(x))/2
(4.2’)
4.3. Learning with multi-layer neural nets
69
4.3.0. A case problem: Iris features are in pairs: the size (length and width) of petals (features 1, 2)
and that of sepals (features 3, 4). It is likely that the sepal sizes and petal sizes are related.
/advanced/ml/Data/iris.dat
150 x 4
Consider at any Iris specimen xi=(xi1,xi2,xi3,xi4), i=1,…,150, x = (xi3, xi4) (sepal) input and u =
(xi1,xi2) (petal) output. Find F such that u  F(x).
4.3.1. One-hidden-layer NN
Build F as a neural network of three layers:
(a) input layer that accepts x = ( xi3, xi4) and bias x0=1 (see the previous lecture),
(b) output layer producing estimate û for output u = (xi1,xi2), and
(c) intermediate - hidden - layer to allow more flexibility in the space of feasible functions F
(hidden - because not seen from the outside
This structure (Figure 4.4) is generic in NN theory; it has been proven, for instance, that such a
structure can exactly learn any subset of the set of entities. Moreover, any pre-specified u = F(x) can
be approximated with such a one-hidden-layer network, if the number of hidden neurons is large
enough (Tsybenko 1989).
û1
û2
III
1
k
v11
j
v12
III
2
v21 v22
II1
II1
II2
II2
w21
w11 w12
I1
i
w22
w13
I2
x1
x2
v31
Output (linear)
v32
Hidden (sigmoid)
II3
II3
w23
w31 w32 w33
I3
I3
x0 = 1
Input
(linear)
Figure 4.4. A feed-forward network with 2 input and 2 output features (no feedback loops). Layers:
input (I, indexed by i), output (III, indexed by k) and Hidden (II, indexed by j).
Weights I to II form 3x3 matrix
W=(wij),
i= I1, I2, I3,
j= II1, II2, II3,
Weights II to III form 3x2 matrix
V=(vjk),
j= II1, II2, II3, k= III1, III2
Layers I and III are assumed to give identical transformation (linear); hidden layer (II) - sigmoid:
70
4.3.2. Formula for the NN transformation F:
Node j of hidden layer II:
Input:
zj=w1j*x1 + w2j*x2+w3j*x3
which is j-th component of vector z = i xi*wij = x*W where x is1x3 input vector, W=(wij) is 3x3
weight matrix.
Layers I and III are assumed to give identical transformation (linear); hidden layer (II) –the sigmoid
symmetrised th(x) (4.1’).
Node j of the hidden layer II:
Input:
zj=w1j*x1 + w2j*x2+w3j*x3
which is j-th component of vector z = i xi*wij = x*W where x is1x3 input vector, W=(wij) is 3x3
weight matrix.
Output:
th(zj), j=1,2,3, th is function (4.1’).
Node k of output layer III:
Output = Input,
 j vjk*th(zj),
which is k-th component of the matrix product û = th(z)*V.
Thus, NN on Figure 1 transforms input x into output û as:
û = th(x*W)*V
(4.3)
If matrices W, V are known, (4.3) expresses – and computes - the unknown function u=F(x) in terms
of th, W, and V.
4.3.3. Learning problem
Find weight matrices W and V minimising the squared difference between observed u and û found
with (4.3),
E=d(u,û) = <u - th(x*W)*V, u - th(x*W)*V >,
(4.4)
over the training entity set.
4.4. Learning weights with error back propagation
4.4.1. Updating formula.
In NN applications, learning weights W and V minimising E is done with back-propagation that
imitates the gradient descent. It runs iterations of updating V and W, each based on the data of an entity
(in our case, one of 150 Iris specimens), with the input values in x=(xi) and output values in u=(uk).
An update moves V and W into the anti-gradient direction:
V(new)=V(old)-gV, W(new)=W(old)- gW
(4.5)
71
where  is the learning rate (step size) and gV, gW are parts of the gradient of the error function E in
(4.4) related to matrices V and W.
Specifically, the error function is
E = [(u1 – û1)2 + (u2 – û2)2 ]/2
(4.6)
where e1 = u1 – û1 and e2 = u2 – û2 are differences between the actual and predicted outputs. There
are two items in E corresponding to each of the two outputs; the more outputs, the more items. The
division by 2 is made to avoid factor 2 in the derivatives of E.
Equations for learning V and W can be written component-wise:
vjk(new)=vjk(old) - E/vjk,
wij(new)=wij(old) - E/wij (iI, jII, kIII)
(4.5’)
To make these computable, let us express the derivatives explicitly; first those closer to the output,
over vjk:
E/vjk = - (uk – ûk) ûk /vjk.
The derivative ûk /vjk=th(zj), since ûk = j th(zj) vjk. Thus,
E/vjk = - (uk – ûk) th(zj).
(4.7)
The derivative E/wij refers to the next layer, of W, which requires more chain derivatives.
Specifically,
E/wij = k[-(uk – ûk) ûk /wij].
Since ûk = j th(i xiwij) vjk, this can be expressed as
ûk /wij = vjk th(i xiwij) xi.
The derivative th’(z) can be expressed according to (4.2’), which leads to the following final
expression for the partial derivatives:
E/wij=-k[(uk – ûk)  vjk](1+th(zj))(1-th(zj)) xi/2
(4.8)
Equations (4.5), (4.7) and (4.8) lead to the following rule to process an instance in the backpropagation algorithm (see 4.4.2).
4.4.2. Instance Processing:
1. Forward computation (of the output û and error). Given matrices V and W, upon receiving
an instance (x,u), the estimate û of vector u is computed according to the neural network as
formalised in equation (4.3), and the error e = u – û is calculated.
2. Error back-propagation (for estimation of the gradient elements). Each neuron receives the
relevant error estimate, which is
-ek = -(uk – ûk),
from (4.7) for output neurons k (k=III1, III2) or
72
-k[(uk – ûk)  vjk], from (4.8) for hidden neurons j (j=II1, II2, II3) [the latter can
be seen as the sum of errors arriving from the output neurons according to the corresponding
synapse weights]. These are used to adjust the derivative (4.7), or (4.8), by multiplying it over
its local data depending on the input signal, which is th(zj), for neuron k’s source j in (4.7),
and th(zj) xi for neuron j’s source i in (4.8).
3. Weights update. Matrices V and W are updated according to formula (4.5’).
What is nice in this procedure is that the computation can be done locally, so that every neuron
processes only the data that are available to this neuron, first from the input layer, then backwards,
from the output layer. In particular, the algorithm does not change if the number of hidden neurons is
changed from h=3, in Figure 4.4, to any other integer h=1,2,…, nor it changes if the number of inputs
and/or outputs changed.
The procedure 4.4.2 can be easily extended to any feed-forward network however many hidden layers
it may have. Steps 1-3 are performed for all available entities in a random order, which constitutes an
epoch. Thus, a number of epochs are executed, until the matrices V and W are stabilised. Typically,
one or even a thousand epochs is not enough for matrices V and W to stabilize. Since, in practical
calculations, this may take ages to achieve, other stopping criteria can be utilised:
(i)The difference between the average values (over iterations within an epoch) of the error function
(4.5’) becomes smaller than a pre-specified threshold, such as 0.0001;
(ii)The number of epochs performed reaches a pre-specified threshold such as 10,000.
4.5. Error back propagation algorithm (for a data set available as a whole, “offline”).
Finally, one can formulate the error back propagation algorithm as follows.
A. Initialise weight matrices W=(wij) and V=(vjk) by using random normal distribution N(0,1)
with the mean at 0 and the variance 1.
B. Choose the data standardisation option amounting to selection of the shift and scale
coefficients, av and bv for each feature v, so that every data entry, xiv, is transformed to
yiv=(xiv-av)/bv (see section 6. below).
C. Formulate Halt criterion as explained above and run a loop over epochs.
D. Randomise the order of entities within an epoch and run a loop of the 4.2. Instance Processing
in that order.
E. If Halt-criterion is met, end the computation and output results: W, V, û, e, and E. Otherwise,
execute D again.
4. 6. Data standardisation for NN learning
Due to the specifics of the binary target variables and activation functions, such as th(x) and sign(x),
which have -1 and 1 as their boundaries, the data in the NN context are frequently pre-processed to
make every feature’s range to be between -1 and 1 whereas the midrange 0. To achieve this, take bv
equal to the half-range bv=(Mv-mv)/2, and shift coefficient av to the mid-range av=(Mv+mv)/2.
Here Mv denotes the maximum and mv the minimum of feature v. Then transform all feature entries
by first subtracting av from each and then dividing the results by bv.
73
The practice of digital computation shows that it is a good idea to further expand the ranges into a [10,10] interval by multiplying afterwards all entries by 10: in this range, digital numbers in computer
lead to smaller computation errors than if they are closer to 0.
Project 4.1: One hidden layer NN for predicting Iris/Student data
Let us develop a Matlab code for learning NN weights with the back propagation algorithm according
to the structure of Figure 4.4. Two parameters of the algorithm, the number of neurons in the hidden
layer and the learning rate, will be input parameters. The output, in this case, should be the level of
error achieved and the weight matrices V and W.
The code should include the following steps:
1. Loading data from subdirectory Data. According to the task, this can be either iris.dat or studn.dat.
2. Normalizing the data to [-10,10] scale according to the formulas in section 4.6.
3. Preparing input and output (target) sub-matrices after the decision has been made of what features
fall in the former and what features fall in the latter categories. In the case of Iris data, for example, the
target can be predicting the petal data (features w3 and w4) from sepal measurements (features w1 and
w2). In the case of Students data, the target can be students’ marks on all three subjects (CI, SP and
OOP), whereas the other variables (occupation categories, age and number of children), input.
4. Initializing the network with random (0,1) normally distributed values and setting a loop over
epochs with the counter at zero.
5. Organizing a loop over the entities in a random order; here the Matlab command randperm(n) for
making a random permutation of integers 1, 2,…, n can be used.
6. Forward pass: given an entity, the output is calculated, as well as the error, using the current V, W
and activation functions. We take here the symmetric sigmoid (4.1’) as the activation function.
7. Error back-propagation: computing gradient vectors for V and W according to formulas (4.6) and
(4.8).
8. Weights V and W update with the gradients computed and learning rate accepted as the input.
9. Halt-condition including both the level of precision, say 0.01, and a threshold to the number of
epochs, say, 5,000. After either is reached the programme halts.
A Matlab code, nnn.m, including all nine steps is in Appendix 3.
At the Iris data, this program leads to the average errors at each of the output variables presented in
Table 4.1 at different numbers of hidden neurons h. Note that the feature ranges are equal to 20 here so
that the relative average error is about 7% of the range.
H
|e1|
|e2|
3
6
10
1.07
0.99
0.97
1.77
1.68
1.63
74
Table 4.1. Absolute error values in the predicted petal dimensions with full Iris data after 5,000
epochs.
The number of parameters in matrices V and W here are 3h, in W, plus 2h in W. One can see that the
increase in h does bring some improvement – but not that great!
For the Students data, this program leads to the average errors in predicting student marks over three
subjects, as presented Table 4.2 at different numbers of hidden neurons h:
H
|e1|
|e2|
|e3|
# param.
3
2.65
3.16
3.17
27
6
2.29
3.03
2.75
54
10
2.17
3.00
2.64
90
Table 4.2. Absolute error values in the predicted student marks over all three subjects, with full
Student data after 5,000 epochs.
Home-work:
1.Find values of E for the errors reported in Table above.
2. Take a look at what happens if the data are not normalised.
3. Take a look at what happens if the learning rate is increased, or decreased, ten times.
4. Extend the table above for different numbers of hidden neurons.
5. Try petal sizes as input with sepal sizes as output.
6. Try predicting only one size/mark over all input variables.
7. Modify this code to involve the sigmoid activation function.
8. Find a way to improve the convergence of the process, for instance, with adaptive changes in
the step size values.
Back propagation should be executed with a re-sampling scheme, such as the k-fold cross-validation,
to provide the estimates of variation of the results regards the data change.
Q. The derivatives of sigmoid (1) or hyperbolic tangent (2) functions appear to be simple polynomials
of themselves:
s(x)= [(1+ e-x)-1] =(-1) (1+ e-x)-2 (e-x )=
(-1)(1+ e-x)-2(e-x )(-1)= (1+ e-x)-2 e-x =s(x)(1-s(x))
5. Learning summarizations
5.1. General
5.1.1. Decision structures
Popular decision structures used for data aggregating are the same as those used for data associating
and include the following:
(a) Partition of the entity set.
A partition S={S1,S2,…,SK} of the N-element entity set I into a set of non-empty nonoverlapping clusters may model a typology or a categorical feature reflecting within cluster similarities
75
between objects. When the data of the entities are feature based, such a partition is frequently
accompanied with a set c={c1,c2,…,cK} of cluster centroids; each centroid ck being considered a
“typical” representative of cluster Sk (k=1,2,…,K). This means that, on the aggregate level, original
entities are substituted by clusters Sk represented by vectors ck. That is, each entity iSk is represented
by ck. This can be expressed formally by using the concept of decoder. A decoder, in this context, is a
mapping from the set of clusters to the set of entities allowing recovering the original data from the
aggregates, with a loss of information of course. If the data set is represented by NV matrix X=(xiv)
where iI are entities and v=1, 2,…, V are features, then a decoder of clustering (S,c) can be expressed
as c1v zi1+c2vzi2 + … + xKvxiK  xiv where zk=(zik) is the N-dimensional membership vector of cluster
Sk defined by the condition that zik =1 if iSk and zik =0, otherwise. Indeed, for every iI, there is only
one item in the sum, which is not zero, so that the sum, in fact, represents ck for that cluster Sk which i
belongs to. Obviously, the closer the centroids fit to the data, the better the clusters represent the data.
(b) Representative vectors.
Sometimes centroids, called representative, quantizing or learning vectors, alone are considered
to represent the data. This is based on the implicit application of the principle which is called the
minimum distance rule, in clustering, or Voronoi diagram, in computational geometry. Given a set of
points c={c1,c2,…, cK} in the feature space, the minimum distance rule assigns every entity iI, and in
fact every point in the space, to that ck (k=1, 2, …, K) to which i is the closest. In this way, the set c is
assigned with a partition S, which relates to the structure (a) just discussed.
Given c={c1,c2,…, cK} in the feature space, let us refer to the set of points that are closer to ck
as the gravity area G(ck). If is not difficult to prove that if the distance utilized is Euclidean, then the
gravity areas are convex, that is, for any x, y G(ck), the straight line between them also belongs
G(ck). Indeed, for any rival ck’, consider the set of points Gk’ that are closer to ck than to ck’. It is
known that this set Gk’ is but a half-space defined by the hyperplane <x, gk’)>= fk’ which is orthogonal
to the interval between ck and ck’ in its midpoint. Obviously, G(ck) is the intersection of sets Gk’ over
all k’k. Then the gravity area G(ck) is convex since each half-space Gk’ is convex, and the intersection
of a finite set of convex sets is convex too. Gravity areas G(ck) of three representative points on the
plane are illustrated on Figure 5.1 using thick solid lines.
Figure 5.1. Voronoi diagram on the plane: three representative vectors, the stars, along with the
triangle of dashed lines between them and solid lines being perpendiculars to the triangle side middle
points. The boundaries between the representative vectors gravity areas are highlighted.
(c) Feature transformation
76
Figure 5.2 presents a number of adults represented by their height and weight measurements, those
overweight (pentagons) are separated from those of normal weight (pentagrams) by the dashed
Weight, kg
100
100
200
Height, cm
Figure 5.2. A transformed feature, y=Height-Weight-100 (dashed line), to separate pentagons from
pentagrams.
line expressing a common sense maxim of slim-bodied individuals that “the weight in kg should not be
greater than the height in cm short of a hundred”. In fact, the wisdom can be rephrased as stating that
in the matters of keeping weight normal, the single variable HWH=Height-Weight-100 alone should
stand, with a much better resolution, than the two original features, Height and Weight. This example
shows that a decision rule is but an aggregate feature.
Having specified a number of linear – or not - transformations zw=fw(x1, x2,…, xV), w=1,…,W
(typically, W is supposed to be much greater than V, though it may be not necessary in some contexts),
one needs a decoder to recover the original data from the aggregates. A linear decoder can be
specified by assuming a set of coefficients cv=(cvw) such that each linear combination <cv,z> = c1z1 +
c2z2 + … +cWzW= c1f1(x1, x2,…, xV) + c2f2(x1, x2,…, xV) + … +cW fW(x1, x2,…, xV) can stand for the
original variable v, v=1, 2, …, V. In matrix terms this can be expressed as follows. Denote by Z=(ziw)
the NW matrix of values of aggregate variables zw on the set of entities and by C=(cvw) the matrix
whose rows are vectors cv. Then NV matrix X’=ZCT is supposed to be the decoder of the original
data matrix X. An obvious quality criterion for both the decoder and transformation of the variables is
the similarity between X’ and X: the closer X’ to X, the better the transformed features reflect the
original data. But this cannot be the only criterion because it is not enough to specify the values of
transformation and decoder coefficients. Indeed, assume that we found the best possible transformation
and decoder leading to a very good data recovery matrix Z’. Then Z*=ZA with decoder C*=CA, where
A=(aww’) is orthogonal WW matrix such that AAT=I where I is the identity matrix, will produce the
data recovery matrix X*= Z*C*T coinciding with X’. Indeed, X*= Z*C*T=ZAATCT==ZCT =X’.
Additional principles may involve requirements on the transformed features coming from both internal
criteria, such as Occam razor, and external criteria such as the need in separation of pre-specified
patterns like that on Figure 5.2.
(d) Neural network: non-linear linear-like (in general, we do not know what non-linearity may be)
5.1.2. Least squares criterion
77
Criteria for finding them
Criteria for judging them sound:
- error
- stability
- interpretability
Given N vectors forming a matrix X= {(xi)} of features observed at entities i =1,…, N so that
xi=(xi1,…,xip) and a target set of aggregates U with decoder D: u  Rp, build an aggregate
û = F(X), û  U
such that the error, which is the difference between the decoded data D(û ) computed from û and
observed data X, is minimal over the class of admissible rules F. More explicitly, one assumes that
X = D(û)+ E
(5.1)
where E is matrix of residual values usually referred to as errors. The smaller the errors, the better the
summarization û. According to the most popular, least-squares, approach the errors can be minimized
by minimizing the summary squared error,
E2=<X- D(û), X- D(û)>=<X-D(F(X)), X-D(F(X))>
(5.2)
with respect to all admissible Fs and Ds.
Expression (5.1) can be further decomposed into
E2=<X, X>- 2<X, D(û)>+< D(û), D(û)>
In many situations, such as Principal component analysis and K-Means clustering described later, the
set of all possible decodings D(F(X)) forms a linear subspace. In this case, the multidimensional points
X, D(û) and 0 form a “right-angle triangle” so that <X, D(û)>=< D(û), D(û)> and expression
(5.2) becomes a multivariate analogue to the Pythagorean equation relating squares of the hypotenuse,
X, and the sides, D(û) and E:
<X, X>=< D(û), D(û)>+ E2
,
(5.3)
or on the level of matrix entries,

iI
vV
xiv  
2
iI
d
vV
2
iv

iI
e
vV
iv
2
(5.3’)
We consider here that the data is an N x V matrix X=(xiv) – set of rows/entities xi (i=1,…, N) or set
of columns/features xv (v=1,…, V). The item on the left in (5.3’) is usually referred to as the data
scatter and denoted by T(X),
T(X )  
iI

vV
xiv
2
(5.4)
Why “scatter”? Because T(X) is the sum of Euclidean squared distances from all entities to 0
T(X) is the sum of entity contributions, the squared distances d(xi,0) (i=1,…,N), or – of feature
contributions, the sums tv=ΣiI xiv2. In the case, when the average cv has been subtracted from all
values of the column v, tv =Nv2 , the variance.
78
5.1.3. Data standardization
The decomposition (5.3) shows that the least-squares criterion in the problem of summarization has
that property that it highly depends on the feature scales so that the solution may be highly affected by
scale changes. This was not the case in the correlation problems, at least with only one target feature,
because the least squares were, in fact, just that feature’s errors, thus expressed in the same scale.
somehow
To balance contributions of features to the data scatter, one conventionally applies the operation of
standardisation. This operation applies to only quantitative features. This requires thus, in the case
when data contain categorical features, first express them in a quantitative format. This can be done by
considering each category as a feature – sometimes referred to as a dummy variable - on its own,
quantitatively recoding it by assigning its “Yes” value with 1 and “No” value with 0.
Example 5.1. Consider Company data set
Company name
Aversi
Antyos
Astonite
Bayermart
Breaktops
Bumchist
Civok
Cyberdam
Income, $mln
19.0
29.4
23.9
18.4
25.7
12.1
23.9
27.2
SharP $
43.7
36.0
38.0
27.9
22.3
16.9
30.2
58.0
NSup
2
3
3
2
3
2
4
5
EC
No
No
No
Yes
Yes
Yes
Yes
Yes
Sector
Utility
Utility
Industrial
Utility
Industrial
Industrial
Retail
Retail
It contains two categorical variables, EC, with categories Yes/No, and Sector, with categories Utility,
Industrial and Retail. The former feature, EC, in fact represents just one category, “Using ECommerce” and can be recoded as such by substituting 1 for Yes and 0 for No. The other feature,
Sector, has three categories that should be substituted by a dummy variable each. To do this, we just
put three category features like these: (i) Is it Utility sector?, (ii) Is it Industrial sector?, and (iii) Is it
Retail sector?, each admitting Yes or No values respectively substituted by 1 and 0. This would lead us
to the following quantitative table.
Table 5.2. Quantitatively recoded Company data table.
Company name
Income
SharP
NSup
Aversi
19.0
43.7
2
Antyos
29.4
36.0
3
Astonite
23.9
38.0
3
Bayermart
18.4
27.9
2
Breaktops
25.7
22.3
3
Bumchist
12.1
16.9
2
Civok
23.9
30.2
4
Cyberdam
27.2
58.0
5
EC
0
0
0
1
1
1
1
1
Utility
1
1
0
1
0
0
0
0
Industrial
0
0
1
0
1
1
0
0
Retail
0
0
0
0
0
0
1
1
Standardisation – shift of the origin & rescaling to make features comparable
Yiv = (Xiv –Av)/Bv
79
X - original data
Y – standardized data
i – entity
v – feature
Av – shift of the origin, typically, the average
Bv – rescaling factor, traditionally the standard deviation (from statistics perspective), but range
may be better (from CI perspective)
In particular, when a nominal feature is represented with 3 binary features corresponding to its 3
categories, the feature’s summary contribution increases 3 times, so that it should be made up for by
further dividing the entries by the square root of 3.
Why the square root, not just 3? Because contribution to the data scatter involves all the entries
squared.
A typical data set
Visualising a 7-dimensional data set on a 2D screen: to be explained later, as part of PCA/SVD
No normalization (Bv=1)
Aversi
Bumchist
Bayermart
Cyberdam
Astonite
Civok
Antyos
Breaktops
z-scoring:
80
Bumchist
Breaktops
Civok
Astonite
Cyberdam
Bayermart
Antyos
Aversi
Recommended (to be explained later): Normalising by range*#categories
Cyberda
m
Antyos
Aversi
Astonite
Civok
Breaktops
Bayermart
Bumchist
The country clusters, much blurred at previous figures, are clearly seen here.
e1
e2
e3
e4
e5
e6
e7
e8
The data with the averages subtracted and normalized by the range*sqrt(number-of-categories):
-0.20
0.23
-0.33
-0.63
0.36
-0.22
-0.14
0.40
0.05
0
-0.63
0.36
-0.22
-0.14
0.08
0.09
0
-0.63
-0.22
0.36
-0.14
-0.23
-0.15
-0.33
0.38
0.36
-0.22
-0.14
0.19
-0.29
0
0.38
-0.22
0.36
-0.14
-0.60
-0.42
-0.33
0.38
-0.22
0.36
-0.14
0.08
-0.10
0.33
0.38
-0.22
-0.22
0.43
0.27
0.58
0.67
0.38
-0.22
-0.22
0.43
Note: only two values in each of the four columns on the right – why?
Note: the entries within every column sum up to 0 – why?
81
Every row represents an entity as a 7-dimensional vector/point
e1=(-0.20, 0.23, -0.33, -0.63, 0.36, -0.22, -0.14),
a 1 x 7 matrix (array)
Every column represents a feature/category as an 8-dimensional vector/point:
-0.20
0.40
0.08
LS=
-0.23
0.19
-0.60
0.08
0.27,
a 8 x 1 matrix (array),
or, its transpose, a 1x 8 row
LST = (-0.20, 0.40, 0.08, -0.23, 0.19, -0.60, 0.08, 0.27)T
5.2. Principal component analysis
The method of principal component analysis (PCA) emerged in the research of inherited talent by F.
Galton, first of all to measure talent. It is one of the most popular methods for data summarization and
visualization currently. The mathematical structure and properties of the method are based on the socalled singular value decomposition of data matrices (SVD); this is why in many publications the
terms PCA and SVD are used as synonymous. In the publications in the UK and USA, though, the
term PCA frequently refers only to a technique for the analysis of covariance/correlation matrix, by
extracting most contributing linear combinations of features, which utilizes no specific data models
and thus is considered purely heuristic. In fact, this method is equivalent to methods related to a
genuine data model that should be associated with the method.
Here is a list of problems in which the method is useful:

Scoring students’ abilities over marks on different subjects (F. Galton)

Scoring quality of life in different cities based on their scorings over different aspects (housing,
transportation, catering, pollution, etc.)

Scoring different parts of a big company or government over their performances

Visualizing documents and keywords with respect to their similarities

Visualizing a set of cereals and their taste characteristics for developing a new cereal product
concept

Visualizing multidimensional entities in 1D or 2D space
Quiz: What could be a purpose to aggregate the features in the Market towns’ data?
82
1. Model (for measuring talent – F. Galton):
Having students’ marks
xiv (i – student, v – discipline)
observed, find student hidden ability scores zi and discipline loadings cv
such that
xiv  zi cv,
which can be explicated, by using residuals eiv , as
xiv = cvzi+ eiv
(1)
where the residuals are minimised with the least squares criterion
L2  
iI
e
vV
2
iv

iI

vV
( xiv - c v zi ) 2
This is a problem of approximating the data matrix with a rank one matrix:
N*M observed data entries converted into
hidden: N scores +M loadings
(with N=1000, M=100,
N*M=100,000
N+M=1,100)
Matrix of rank 1: product of two vectors; for example
a=[1 4 2 0.5]’;
b=[2 3 5]’;
Here A’ is matrix A transposed so that vectors a and b are considered columns rather than rows.
A mathematical presentation of the matrix whose elements are products of components of a and b,
with product * being the so-called matrix product is below:
2
8
4
1
a*b’=
3
12
6
1.5
5
20
10
2.5
The defining feature of this matrix: all rows are proportional to each other; all columns are
proportional to each other.
(See more detail any course in linear algebra or matrix analysis.)
Curiously, the condition of statistical independence (within contingency data) can be reformulated
as the contingency table being of rank 1.
Solution:
The model in (1) has a flaw from the technical point of view: its solution cannot be defined uniquely!
Indeed, assume that we have got the talent score zi for student i and the loading cv at subject v, to
produce zicv as the estimate for the student’s mark at the subject. However, the same estimate will be
produced if we halve the talent score with simultaneously doubling the loading: zicv = (zi/2)(2cv). Any
other divisor / multiplier would, obviously, do the same.
To remedy this: fix the norms of vectors z and c , for instance, to be equal to 1, and treat the
multiplicative effect of the two of them as a real . Then we must put  z*ic*v instead of zicv where z*
83
and c* are normed versions of z and c, and  is their multiplicative effect. A vector x=(x1,…, xH) is
said to be normed if its length is 1, that is, x12+x22+…+xH2=1. After the optimal , z* and c* are
determined, we can return to the talent score z and loading c with formulas: z= 1/2z*, c =1/2c*.
Then the first-order optimality conditions imply that the least-squares solution to (1) satisfies equations
XTz*= c*
and
Xc*=z*
(2)
where  is the maximum singular value of X. We refer to a triple (, z*, c*) consisting of a real,  , and
two vectors, c* (size M x 1) and z* (size N x 1), as to a singular triple for X if it satisfies (2);  is the
singular value and z*, c* singular vectors.
To understand (2) in full, one needs to know the definition of product of an N x M matrix by a vector
M x 1: the matrix is a set of M columns and the product is the sum of these columns weighted by the
components of the M x 1 vector.
Equation on the right – an important corollary to the solution:
(A) z is a linear combination of columns of X weighted by c’s components: c’s components are
feature weights in the score z
Another property: Pythagorean decomposition
T(X)= 2 + L2
where T(X) is the data scatter (5.4).
(3)
(3) implies
(B) Value 2 expresses the proportion of the data scatter explained by the principal
component z
This can be further extended into a set of K different ability factors, with students and subjects
differently scored over them:
K
xiv   ckv zik  eiv ,
(4)
k 1
with a similar decomposition of the data scatter
T(X) = 1 2 +2 2 +…+K 2 + L2
To fit (4), some mathematics:
A triple  ,c*, z* satisfying (2) is referred to as a singular triple of matrix X, with  - singular value
and c*, z* - singular vectors corresponding to .
For any matrix X, there is a finite number of singular values 1, 2,…, r equal to the rank r of X; r≤
min(M, N). If k ≠ l then corresponding singular normed zk and zl are unique and orthogonal as well as
ck and cl (k,l=1,…r); otherwise, they are not unique and always can be selected to be orthogonal. The
matrix X admits the following singular-value decomposition (SVD):
84
or, in terms of vectors and matrices,
r
X    k z *k c *T k  ZSC T
(5' )
k 1
where the right-hand item Z is N× r matrix with columns z*k and C is M× r matrix with columns c*k
and S is r× r diagonal matrix with entries k on the diagonal and all other entries zero.
This implies that the least-squares fitting of the PCA model (equation (4) in lecture 26/10/06) are K
maximal singular vectors with the presented there decomposition of data scatter.
Equations (2) imply that 2 and c* satisfy
XTXc*= 2c* ,
(6)
That is c* is the eigen vector of square M x M matrix XTX corresponding to its maximum eigenvalue  = 2 .
This matrix XTX, divided by N, has an interesting statistical interpretation if all columns of X have
been centred (mean-subtracted) and normed (std-normalised): its elements are correlation coefficients
between corresponding variables. (Note how a bivariate concept, the correlation coefficient, is carried
through to multivariate data.) If the columns have not been normed, the matrix XTX /N is referred to as
covariance matrix; its diagonal elements are column variances.
Similarly, 2 and z* are the maximum eigen-value and corresponding eigen vector of matrix of raw-byraw inner products XXT.
Quiz: could you write equations defining 2 and z* (analogous to those for 2 and c*). (Tip: matrix
XXT should be involved.)
Footnote: In the English-written literature, PCA is introduced not via the model (1) but rather in terms
of the derivative properties (A) and (B) leading to equations (6) for finding the loadings first and then
the scores with equations (2). In this, the scatter of matrix XTX can be used for evaluation of the fit; it
equals the sum of the squares of eigen-values k, that is, k4!
r
xiv    k c *kv z *ik ,
(5)
k 1
or, in terms of vectors and matrices,
r
X    k z *k c *T k  ZSC T
(5' )
k 1
where the right-hand item Z is N× r matrix with columns z*k and C is M× r matrix with columns c*k
and S is r× r diagonal matrix with entries k on the diagonal and all other entries zero.
This implies that the least-squares fitting of (4) are K maximal singular vectors with the presented
decomposition of the data scatter.
85
Equations (2) imply that 2 and c* satisfy
XTXc*= 2c* ,
(6)
That is c* is the eigen vector of square M x M matrix XTX corresponding to its maximum eigen-value
 = 2 .
This matrix XTX, divided by N, has an interesting statistical interpretation if all columns of X have
been centred (mean-subtracted) and normed (std-normalised): its elements are correlation coefficients
between corresponding variables. (Note how a bi-variate concept is carried through to multivariate
data.) If the columns have not been normed, the matrix A=XTX /N is referred to as covariance matrix;
its diagonal elements are column variances. Since eigen-vectors of the square matrix A are mutually
orthogonal, it can be decomposed over them as
r
A   k c *k c *T k  CC T
(7)
k 1
which can be derived from (5’);  is diagonal r × r matrix with A’s eigen-values k=k2. Equation (7)
is referred to as the spectral decomposition of A; the eigen-values k constituting the spectre of A.
Similarly, 2 and z* are the maximum eigen-value and corresponding eigen vector of matrix of raw-byraw inner products XXT.
Quiz: could you write equations defining 2 and z* (analogous to those for 2 and c*).
2. Method:
SVD decomposition: MatLab’s svd.m function
[Z, S, C]=svd(Y);
where Y is data matrix X after standardisation (input)
Output (idealised):
Z – N  r matrix of r factor score columns (normed)
C – M  r matrix of corresponding discipline loading columns (normed)
S – r  r diagonal matrix of corresponding singular values sorted in the descending order
r – matrix Y’s rank
Matrix of rank 1: product of two vectors
a=[1 4 2 0.5];
b=[2 3 5];
86
2
3
5
a*b’=
8
12
20
4
6
10
1
1.5
2.5
Matrix of rank r – sum of r rank one matrices
Application
a. Data selection and pre-processing into a flat file:
any of our files would do
b. Data standardisation:
i. Shifting the origin:
- typically needed to put the origin in the middle of the data cloud
ii. Scaling the features:
- typically needed to balance the feature contributions such as in Market towns data,
Next:
- in Student marks – not needed (same scales);
- use ranges rather than standard deviations
c. Computation: as it stands (see visual.m in \ml)
d. Post-processing:
depends on the application domain:
- can be just visualization; do it
- if re-standardization is needed from z to f such as to scale combining
f=bz+a
To find two reals, b and a, we need two points to scale subjectively, depending on the goals.
In Student marks, we may want f = 0 when all marks are zero and f = 100 when all marks
are 100. This would lead to the following two equations:
0=b*0+a and 100=b*100*Sum cv +a
which implies that
a=0 ;
b = 1/Sum cv
e. Interpretation and drawing conclusions
- Number of the Principal components – depending on the contributions
- A principal component’s meaning depends on c’s components
PCA: talent score model and actually finding it
1. Principal Component: Analytical expression
Problem: Given subject marks of 20 students
Math
40
96
96
97
97
Phys
Chem
37
93
90
90
90
Lang
History
35
84
83
85
83
33
85
84
85
84
38
96
97
94
98
87
96
70
64
95
21
64
63
61
62
19
98
40
73
71
72
90
67
63
91
19
61
63
62
61
17
90
38
66
65
65
84
50
67
81
17
66
67
65
67
18
85
33
49
49
49
83
39
76
85
17
78
77
76
78
17
83
34
39
40
41
95
42
89
98
21
92
90
90
92
17
97
37
45
44
42
find their talent scores zi approximating the data.
The optimal normed talent vector is a linear combination (weighted sum) of different subject marks
z=Xc/ where X is data matrix,  maximum singular value and c corresponding singular vector
normed (z4 is another normed singular vector corresponding to ), that is,
z = c1xM+c2 xP + c3xC + c4xL+ c5xH
where cv=cv /.
(*)
The expression (*) has two meanings:
(a)
It is a PCA derived relation between 20 talent scores and marks in 20 rows of matrix X;
(b)
It is a general relation between the features, talent and subject marks, that can be
straightforwardly used for wider purposes such as assigning a talent score to a student
outside of the sample. (Quiz: Do you know how to do that? A: Just put the student’s marks
into (*) and calculate the talent score.)
2. Principal component: Geometric expression
Can we visualize the principal component (*)? Yes.
Unfortunately this cannot be done straightforwardly, in the space of 6 variables (z, xM, xP, xC, xL, xH)
involved, because (*) in this space corresponds to a hyperplane rather than line.
However, subject loading vector c=(c1, c2, c3, c4, c5) or the normed vector c=(c1, c2, c3, c4, c5) itself,
or any other proportional vector c, can be used for that, in the data feature space of 5D vectors x=(xM,
xP, xC, xL, xH). These define a straight line through the origin, 0=(0,0,0,0,0) and c. What is the meaning
of this line?
4
Note that vectors here are boldfaced whereas scalars not.
88
c at >1
c
xM
c
at 0<<1
x1
c at <0
Figure 1. Line through 0 and c in the M-dimensional feature space is comprised of points c at
different ’s.
Consider all talent score points z=(z1,…,zN) that are normed, that is, satisfy equation <z,z>=1, that is,
zTz=1, that is, z12+…+zN2=1: they form a sphere of radius 1 in the N-dimensional “entity” space (Fig.
2 (a)). The image of these points in the feature space, defined by applying data matrix X, Xz, forms a
skewed sphere, an ellipsoid, in the feature space, consisting of points c where c is normed. The
longest axis of this ellipsoid corresponds to the maximum , that is the first singular value of X.
1c1
zN
xM
c1
z1
(a)
x1
(b)
Figure 2. Sphere zTz=1 in the entity space (a) and its image, ellipsoid c=XTz, in the feature space.
The first component, c1, corresponds to the maximal axis of the ellipsoid with its length equal to 21 .
[Indeed, the first singular value 1 and corresponding normed singular vectors c1, z1 satisfy equations
Xc1=z1 and, thus, their transpose, c1TXT =1zT. Multiplying the latter by the former from the right, one
gets equation c1TXTXc1= 12, because zTz=1 since z is normed. ]
3.
Principal component: Direction and data points
What the longest axis has to do with the data?
The direction of the longest axis of the data ellipsoid makes minimum of the summary distances
(Euclidean squared) from data points to their projections on the line (see Fig. 3), so that the axis is the
best possible 1D representation of the data.
This property extends to all subspaces generated by the first
Principal components: the first two PC make a plane best representing the data, the first three make a
3D space best representing the data, etc.
89
1c1
xM
c1
x1
Figure 3. The direction of the longest axis of the data ellipsoid makes minimum the summary
distances (Euclidean squared) from data points to their projections on the line.
Why matrix X should be centered (by subtracting the within column average from all elements in each
column) then?
To better cover the data structure! (See Fig. 4)
xM
xM
x1
x1
(a)
(b)
Figure 4. Effect of centering the data set on PCA: (a) – data not centered, (b) – same data after
centering; the longer blue line corresponds to the direction of the first PC, the shorter one to the
direction of the second PC (necessarily orthogonal).
Eigenface: An application related to face analysis. Quiz: Learn what it is and what it has to do with
PCA by yourself (from web).
Latent semantic analysis: An application to document analysis using document-to-keyword data (and
applying equations (2) to include new data). Quiz: Learn what it is and what it has to do with PCA by
yourself (from web).
90
Correspondence analysis: Extension of PCA to contingency tables taking into account the data
specific (meaningful summing up entries across the table). See section 5.1.4 in my book as well as on
web.
Questions
Example
Applied to 205 Student marks file
Math Phys Chem Lang History
40
37
35
33
96
93
84
85
96
90
83
84
97
90
85
85
97
90
83
84
96
90
84
83
70
67
50
39
64
63
67
76
95
91
81
85
21
19
17
17
64
61
66
78
63
63
67
77
61
62
65
76
62
61
67
78
19
17
18
17
98
90
85
83
40
38
33
34
73
66
49
39
71
65
49
40
72
65
49
41
For matrix X, the 5D averages (grand means) vector:
69.75 65.90
60.85
61.70
38
96
97
94
98
95
42
89
98
21
92
90
90
92
17
97
37
45
44
42
70.70
to be subtracted from all entity vectors
The loadings (in a transposed form):
- the data lead to two principal components contributing, respectively, 91.56% and 8.33% to the
data scatter, thus leaving only 0.11% to the remaining three principal components, which
amount to noise.
To interpret these components, find their corresponding loadings:
c1 = [ 0.42 0.41 0.41 0.45 0.53]
c2 = [-0.59 -0.47 -0.02 0.38 0.53]
The first one shows that weights of all features are almost equal to each other, except for History
whose weight is about 20% greater. The first principal component thus expresses the general ability.
The second has sciences’ loadings negative, which shows that this component may correspond to arts
abilities.
91
Quiz: Why is (a) the first all positive and (b) the second half negative? A: (a) All features are
positively correlated, (b) the second must be orthogonal to the first.
Quiz: compare the first component’s score with that of the average scoring. The vector of
average scores rounded to integers is
37
54
91
54
90
90
90
90
54
72
90
19
72
72
71
72
18
91
36
54
Tip: To compare, compute the correlation coefficient.
Q. Assume that there is a hidden feature z, assigning a value zi to each student i=1,…,100, that can
alone, along with feature “loadings” c=(cAge,cSP,cOO, cCI), explain all the 100x4=400 entries in
array X=xm=x(:,ii) obtained from stud.dat at the 4 features in ii so that each of the entries can be
represented, approximately, as the product of a corresponding z-value and a feature-specific loading:
The matrix XTX mentioned in property 4 above can be expressed as follows:
>> xtx=xms'*xms
xtx = 7.5107 1.2987 -2.4989 -2.8167
1.2987 6.0918 0.4335 -0.1543
-2.4989 0.4335 6.0207 1.6660
-2.8167 -0.1543 1.6660 4.7729
As xtx is proportional to feature covariance matrix, one should notice a not quite straightforward
character of the data manifesting itself in the fact that features 1 and 2, being co-related positively,
have different relations to feature 3.
All four singular values and vectors can be found in Matlab with operation of Singular value
decomposition (SVD):
>> [Z,S,C]=svd(xms);
Here matrix Z is 100x4 array whose columns are singular z vectors, C is 4x4 array whose columns are
singular c vectors, and the first four rows of array S form a 4x4 diagonal matrix whose diagonal entries
are corresponding singular  values, which can be shaped with command
>> mu=S(1:,4,1:4);
These are subjects that can be covered with this:
- Visualisation
- Evaluation
- Interpretation
We consider them in turn.
To visualise the data onto a 2D plane, we need just two principal components, that can be defined by
using the first and second singular values and corresponding z vectors:
>> x1= z(:,1)*sqrt(mu(1,1));
>> x2= z(:,2)*sqrt(mu(2,2));
92
These can be seen as a 2D plot, on which groups of entities falling in categories such as
Occupation:AN (entities from 70 to 100) and Occupation:IT (entities 1 to 35) can be highlighted:
>> subplot(1,2,1), plot(x1,x2,'k.');%Fig. 8, picture on the left
>> subplot(1,2,2), plot(x1,x2,'k.', x1(1:35),x2(1:35),'b^', x1(70:100),x2(70:100),'ro');
Figure 8. Scatter plot of student data 4D (Age, SP marks, OO marks, CI marks) row points on the
plane of two first principal components, after they have been centred and rescaled in file xsm.
Curiously, students of occupations AN (circled) and IT (triangled) occupy contiguous regions, top and
left, respectively, of the plane as can be seen on the right-hand picture.
To evaluate how well the data are approximated by the PC plane, according to equation (3) one needs
to assess the summary contribution of the first two singular values squared in the total data scatter. To
get the squares one can multiply matrix mu by itself and then see the proportion of the first two values
in the total:
>> la=mu*mu
la =
11.1889
0
0
0
0 6.4820
0
0
0
0 3.8269
0
0
0
0 2.8982
>> 100*[la(1,1)+la(2,2)]/sum(sum(la))
ans = 72.43
This shows the PC plane takes 72.43% of the data scatter, which is not that bad for this type of data.
To interpret the results one should use the standardised coefficients c in expression (2) that come with
svd command into columns of matrix C. The two columns, in transposed form, are:
>> interp=c(:,1:2)'
interp =
0.7316 0.1587 -0.4858 -0.4511 %first singular vector
-0.1405 -0.8933 -0.4165 -0.0937 %second singular vector
The coefficients straightforwardly show how much a principal component is affected by a feature.
93
First component is positively related to Age and negatively to OO and CI marks; on average, it
increases with Age increased and OO/CI marks decreased. Second component increases when SP and
OO marks decrease (this obviously can be reverted by swapping all minuses for pluses).Thus, first
component can be interpreted as “age-related Computer science deterrence” and the second as “dislike
of programming issues”. Then the triangle and circle patterns on the right of Figure 8 show that IT
labourers are on the minimum side of the age-related CS deterrence, whereas AN occupations are high
on the second component scale.
Component retention
Although two or three Principal components are sufficient for the purposes of visualization, the issue
of automatically determination of the “right” number of components has attracted attention of
researchers. R.Cangelosi and A. Goriely (2007) Component retention in principal component analysis
with application to cDNA microarray data, Biology Direct, 2:2 review twelve rules for choosing the
number of principal components and, rather expectedly, note that no one of them was better than
others in their experiments with generated and real data sets.
Is there a right number of components?
This question is irrelevant if the user’s goal is visualization: just two or three components, depending
on the dimension of the screen.
However, some consider the question just when one wants to determine the “real” dimensionality of
the data. There have been a number of rules of thumb proposed of which a dozen were tested on real
and simulated data by Cangelosi and Goriely (2007). Simplest rules such as:
(i) stop at that component whose contribution is less than the average contribution, or better,
70% of the average contribution, and
(ii) take largest contributing components so that their summary contribution reaches 80%,
did remarkably well.
The rule (i) can be slightly modified with the Anomalous Pattern method described below (see section
…… ). According to this method, the average contribution first is subtracted from all the
contributions. Let us denote the resulting values, in the descending order, by c1, c2,…, cr, where r is
the rank of the data matrix, and compute, for every n=1,2,…, r, the squared sum of the first n of these
values, C(n). Then the rule says that the number of components to retain is defined as the first n at
which cn+12 is greater than 2C(n)(cn+1 -1/2n), which is obviously guaranteed if cn+1 <1/2n because the
expression is negative then.
Clustering: K-Means partitioning
Clustering is a set of methods for finding and describing cohesive groups in data, typically, as
“compact” clusters of entities in the feature space
Some data patterns:
94
(a)
(b)
(c)
Figure 1. A clear cluster structure at (a); data clouds with no visible structure at (b) and (c).
Finding clusters is half the job; describing them is another half. Duality of knowledge: Cluster contents
- extension; cluster description - intention.
If a cluster is clear-cut, it is easy to describe; if not, not.
b2
b1
a1
a2
Figure 2. Yellow cluster on the right: a1<x<a2 & b1<y<b2. Yellow cluster on the left – both false
positive and false negative errors !
95
Example of a good cluster structure: W. Jevons
(1835-1882), updated in Mirkin 1996
Pluto doesn’t fit in the two clusters of planets: started a new
cluster recently, September 2006
K-Means: clusters update
Entities are presented as multidimensional points (*)
0. Put K hypothetical centroids (seeds)
* *
1. Assign points to the centroids
* *
***
using the Minimum distance rule
** *
@
@ 2. Put centroids in gravity centres of
thus obtained clusters
3. Iterate 1. and 2. until convergence
@
**
***
96
K-Means: centroids update
Entities are presented as multidimensional points (*)
0. Put K hypothetical centroids (seeds)
1. Assign points to the centroids
* *
according to Minimum distance rule
* *
* * * 2. Put centroids in gravity centres of
** *
@
@
thus obtained clusters
3. Iterate 1. and 2. until convergence
**
***
@
Quiz: What is a gravity centre? (A: Given a set of points, its mean point)
Example (with Companies data)
Range standardised matrix (with additionally rescaled three category features, Ob, Pe, and
Di, by dividing them over 3):
A.
K-Means at K=3 and initial seeds at entities 2, 5, 7
0.
Entities 2, 5, and 7 are chosen as centroids.
1.
Minimum distance rule: calculate distance (Euclidean squared) from the
centroids to each entitiy i and assign i to the closest centroid (highlighted with bold
font):
97
This produces clusters 1-2-3, 4-5-6, and 7-8.
2.
.095
-.215
.179
Centroids update: calculate centroids of the clusters
.124
-.111 -.625 .168 .024
-.144
-.286 -.222 .375
-.024 .168
-.144
.243
.500 .375
-.216 -.216 .433
Bold-font highlighted outstanding values.
3.
Check whether new centroids coincide with those from the previous iteration. If
not, go to 1:
The same clusters 1-2-3, 4-5-6, and 7-8.
4.
At 2, with clusters the same, centroids are the same as well. Output the result:
Within cluster means are in both versions: real (upper row) and standardised (lower row).
The interpretation goes along the lines highlighted above: cluster 1 lacks SC, cluster 2 has small LS,
cluster 3 has D in excess, etc.
B. K-Means at K=3 and initial seeds at entities 1, 2, 3:
98
Within-column minima are bold-faced, leading to clusters 1-4, 2, and 3-5-6-7-8. After
updating the centroids, the next iteration will lead to clusters 1-4-6, 2, 3-5-7-8 that will
remain stable at the next iteration. A disastrous result.
Quiz: Do this by yourself. Try also K=2 with seeds at 1 and 8.
K-Means’ features:
Positive:
-
Models typology building
Computationally effective
Can be incremental, one entity at a time
Negative:
-
-
No advice on:
o Data pre-processing
o Number of clusters
o Initial setting
Instability of results
Insufficient interpretation aids
Figure 3. Example to the two red-highlighted items above: Two clusterings at a four-point-set with KMeans - intuitive (right) and counter-intuitive (left); red stars as centroids.
Other:
- Convex cluster shapes (A body S is referred to as convex if, with every two points x and y
in S, S also contains the entire interval of the straight line between x and y)
Quiz: Why convex? (Hint: A semi-space is convex, and intersection of convex bodies is convex too.)
K-Means criterion W(S, c):
Denote partition S={S1,…,SK}; an entity/row yi=(yiv) (i= 1, …, N), cluster Sk’s centroid ck=(ckv)
(k=1,…,K); v=1,…, M are feature indices.
W(S, c) - summary within cluster entity-to-centroid distances (Euclidean squared) to be minimised
K
W ( S , c)  
k 1
M
 (y
iS k
v 1
iv
K
 ckv )  
2
k 1
d(y ,c )
iS k
i
k
99
Figure 4. The distances (blue intervals) in criterion W(S,c).
Quiz: How many distances are in W(S, c)? (A: The number of entities N.) Does this number depend on
the number of clusters K? (A: No.) Does the latter imply: the greater the K, the less W(S, c)? (A: Yes.)
Why?
K-Means is alternating minimisation for W(S,c). Convergence guaranteed. (Quiz: Why?)
Quiz: Demonstrate that, at Masterpieces data, value W(S,c) at author-based partition {1-2-3, 4-5-6, 78} is lower than at partition {1-4-6, 2, 3-5-7-8} found at seeds 1, 2 and 3.
Quiz: Assume d(yi, ck) in W(S, c) is city-block distance rather than Euclidean squared. Could K-Means
be adjusted to make it alternating minimisation algorithm for the modified W(S,c)? (A: Yes, just use
the city-block distance through, as well as within cluster median points rather than gravity centres.)
Would this make any difference?
PCA Model extended to K-Means clustering
Remember?
Data yiv modelled as summary contributions of talent factors k with products ckvzik, ckv being feature v
loading and zik factor k’s score. Consider now ckv cluster k’s centroid and zik the
belongingness/membership function: given cluster Sk, zik = 1 if i Sk and zik = 0 if i Sk. F-la (4’):
for any cluster Sk and any entity i Sk, zik is equal to ckv up to residuals eiv to be minimised - a data
recovery model, a rather simplistic one.
K
yiv   ckv zik  eiv ,
(4' )
k 1
Least-squares criterion L2 for (4’): L2 =W(S, c) !
Moreover, the same data scatter decomposition holds:
T(Y) = 1 2 +2 2 +…+K 2 + L2
with k2 being analogous to the YYT eigen-values:
k2 = zkTYYTzk/zkTzk =v ckv2|Sk|
(8)
100
Quiz: What is the difference between PCA model (4) and clustering model (4’)?
Why are all these technical details?





o Data standardisation
o Algorithms
Spectral clustering
Anomalous pattern
iK-Means
o Additional interpretation aids
Correlation with features
Scatter decomposition ScaD table
o Data standardisation
Because of the data scatter decomposition, standardisation in clustering can be done on the basis of
data scatter, as advised for PCA:
- pre-process data into quantitative format,
- subtract within-column means,
- divide by the range.
o Algorithms
 Anomalous pattern
PCA strategy: one cluster at a time. From (8), find a cluster S maximising its contribution to data
scatter T(Y):
2 = zTYYTz/zTz = cv2|S| =d(0,c)|S| (9)
-
distance to 0 weighted by the size |S|
One cluster clustering with
Anomalous Pattern Cluster
Tom Sawyer
1.
2.
3.
4.
-
Put seed c=(cv) into an entity furthest from 0
Cluster update: Take S to consist of all entities that are closer to c than to 0.
Centroid update: Take gravity centre of S as c.
Reiterate 2&3 until convergence.
Similar to 2-means except for: anomalous centroid is the only one to change
One-by-one clusters can be extracted with contributions (9) showing cluster saliencies:
incomplete clustering
101

iK-Means (K-Means with intelligent initialisation)
1.
2.
3.
Extract clusters one-by-one using Anomalous pattern.
Specify at what size s (s=1, 2,…) a cluster should be discarded, and remove all AP clusters of
the size s or less.
Take K the number of remaining clusters and use their centroids as initial seeds.
This method has shown its superiority over a number of other methods in our experiments with
generated data.

Spectral clustering (a recent very popular technique)
1. Find eigen-vectors z , z , …, z
of NN matrix YYT. (This can be done fast, for instance, by
finding eigen-vectors ck of MM matrix YTY: in many applications, M50 while N100,000.
Then zk are found from ck by using formula (2) from the previous lecture.)
1
2
K
2. Given an optimal PC z find S as set of indices corresponding to largest components of z .
k
k
k
Not necessarily optimal. Can be applied in other settings such as distances.
Alternating minimisation algorithm for f(x,y):
a sequence y0, x1, y1, x2, y2,… yt, xt,…
Find x minimising f(x,y0); take this as x1. Given x1, find y minimising f(x1,y), take it as y1. Reiterate
until convergence.
Figure 4. The distances (blue intervals) in criterion W(S,c).
K-Means, with Euclidean distance squared and gravity centres, is alternating minimisation for
W(S,c). Convergence guaranteed.
(Quiz: Why? A: The criterion can only decrease at steps, so that no partition can re-appear, and the
number of possible partitions is finite.)
Issues to address:
o Data pre-processing
o Number of clusters
o Initial setting
o Insufficient interpretation aids
102
Quiz: Demonstrate that, at Masterpieces data, value W(S,c) at author-based partition {1-2-3, 4-5-6, 78} is lower than at partition {1-4-6, 2, 3-5-7-8} found at seeds 1, 2 and 3.
Quiz: Assume d(yi, ck) in W(S, c) is city-block distance rather than Euclidean squared. Could K-Means
be adjusted to make it alternating minimisation algorithm for the modified W(S,c)? (A: Yes, just use
the city-block distance through, as well as within cluster median points rather than gravity centres.)
Would this make any difference?
Most popular approach to tackle the issue of initialisation in K-Means
1. Take a range of K, say, from K=4 to K=25
2. For each of the K, run K-Means clustering a (five) hundred times starting from random seeds
(centroids); typically, random K entities are sampled as seeds, select the best result with its W
criterion value WK.
3. Compare WK –s at different Ks and select that K and its respective result, at which WK
“jumps”.
Of several approaches to formalising the “jump” concept, in my experiments the best - in terms of K,
not clustering - was Hartigan’s rule utilising measure:
HK = [ WK / WK+1 - 1 ]/(N-K-1)
Hartigan’s rule of thumb:
Start at K=1, halt at K at which H becomes less than 10.
 I have a different advice: iK-Means.
Contribution of feature F as explained by cluster partition S={S1,…, SK}, to the data scatter
K
Contr(S,F) =
 c
vF k 1
2
kv
| Sk |
(10)
is proportional to:
 correlation ratio 2 if F is quantitative
 contingency coefficient between cluster partition S and F, if F is nominal, that is:
 Pearson chi-square (if Poisson normalised)
 Goodman-Kruskal tau-b (range normalised)
Indeed, after standardisation of a nominal F, cluster k centroid’s components over vF are:
ckv=(pkv/pk – pv)/Bv
where pv is frequency of v, pk proportion of entities in Sk, pkv proportion of entities falling in both v
and Sk, and Bv the normalising scale coefficient.
Then the summary contribution (10):
2
2
Contr( S , F )  N  ( p kv  p k p v ) / p k B v
k ,v
At Bv=1 (range) this is tau-b, the change of the error of proportional prediction of v’s, and at Bv2 =pv
(Poisson normalised), is Pearson chi-squared.
Leads to: (a) conceptual clustering and (b) standardising suggestions.
 Scatter decomposition ScaD table:
Cluster-specific feature contributions ckv2|Sk| in a table with rows corresponding to clusters, columns to
features.
103
Table 1. ScaD for Authorship clusters at Masterpieces data.
Two rows just above the bottom show summary explained contributions of the clustering to
features and their complements to the total feature contributions (bottom line), unexplained parts.
The contributions follow from the algebra, but there is a geometric intuition in them as well (see
Figure 5 below).
y
x
Figure 5. Contributions of features x and y in the group of blue-circled points are proportional to
squared differences between their values at the grand mean (red star) and within-group mean,
centroid (yellow star). The x-difference is much greater; thus the group can be separated from the
rest along x much easier rather than along y.
Large contributions are highlighted in Table 1 with symbols related to authors: Dickens ( ), Twain
( ) and Tolstoy ( ).
Highlighted features exclusively describe author clusters conceptually:
C. Dickens: SCon = No
M. Twain:
LenD < 28
L. Tolstoy:
NChar > 3
or
Presentat.Direct,
which is not necessarily so at other data sets
A deeper optimum with nature-inspired approaches
GA for K-Means clustering
Recall that K-Means clustering is a method for finding a partition of a given set of N entities
represented by rows yi =(yi1, …, yiV) (i = 1,…, N) of data matrix Y=(yiv) in K clusters Sk (k=1, …, K)
K
W  W(S, c)  
k 1

d ( yi, ck ).......(*)
iSk
with centres (centroids) ck = (ci1, …, ciV) defined as the means of within-cluster rows. This method
alternatingly
104
minimises the summary within cluster distance d(yi, ck) = (yi1- ck1)2 + (yi2- ck2)2 + …. + (yiVckV)2 , the Euclidean distance squared between entity i and its cluster’s centroid.
To apply GA approach, define the concept of chromosome. Let the chromosome representing partition
S = {S1, … , SK} be the string of cluster labels assigned to entities in the order i=1,…, N.
If, for instance, N=8, and the entities are e1, e2, e3, e4, e5, e6, e7, e8, then the string 12333112
represents partition S with three classes, S1={e1, e6, e7}, S2={e2, e8}, and S3={e3, e4, e5}, which
can be easily seen from the diagram
e1 e2 e3 e4 e5 e6 e7 e8
1 2 3 3 3 1 1 2
A string of N numbers is considered “illegal” if some numbers between 1 and K are absent from it (so
that the corresponding classes are empty). Such an illegal string in the example above would be
11333111: it makes class S2 empty.
A GA for minimising the function W:
0. Initial setting. Fix the population size P, and even integer (no rules exist for this), and
randomly generate P legal strings s1,..,sP of K integers 1 ,…, K. For each of the strings, define
corresponding clusters, calculate their centroids as gravity centres and the value of criterion,
W(s1), …, W(sp), according to formula (*).
1. Selection. Select P/2 pairs of strings to mate; each of the pairs is to produce two “children”
strings. The mating pairs usually are selected randomly (with replacement, so that the same
string can form both parents in a pair). To mimic Darwin’s “survival of the fittest”, the
probability of selection of string st (t=1,…,P) should reflect its fitness value W(st). Since the
fitness is greater for the smaller W value, some make the probability inversely proportional to
W(st) (see Murthy, Chowdhury, 1996) and some to the difference between a rather large
number and W(st) (see Yi Lu et al. 2004).
(I would suggest to make it proportional to the explained part of the data scatter defined above,
which may lead to different results).
2. Cross-over. For each of the mating pairs, generate a random number r between 0 and 1. If r is
smaller than a pre-specified probability p (typically, p is taken about 0.7-0.8), then perform a
crossover; otherwise the mates themselves are considered the result. A (single-point) crossover
of strings sa=a1a2…aN and sb=b1b2…bN is performed as follows. A random number n
between 1 and N-1 is selected and the strings are crossed over to produce children
a1a2…anb(n+1)…bN and b1b2…bna(n+1)…aN. If a child is “illegal” (like, for instance,
strings a=11133222 and b=32123311 crossed over at n=4 to produce a’=11133311 and
b’=32123222; a’ is illegal here), then various policies can be pursued. Some authors suggest
the crossover operation to be repeated until a legal pair is produced. Some say illegal
chromosomes are ok, just they must be assigned with a lesser probability of selection.
3. Mutation. Mutation is a random alteration of a character in a chromosome. This provides a
mechanism for jumping to different ravines of the minimised function. Every character in every
string is subject to mutation with a low probability q which can be constant or inversely
proportional to the distance between the corresponding entity and corresponding centroid.
4. Elitist survival. This strategy suggests storing the best fitting chromosome separately. After
the crossover and mutations have been done, find fitness values for the new generation of
chromosomes. Check whether the worst of them is better than the record or not. If not, put the
record chromosome instead of the worst one into the population. Then find the record for thus
obtained population.
105
5. Stopping condition. Check the stop condition (typically, a limit on the number of iterations). If
this doesn’t hold, go to 1; otherwise, halt.
Yi Lu et al. (Bioinformatics, 2004) note that such a GA works much faster if after the 3. Mutation
is done the labels are changed according to the Minimum distance rule. They apply this instead of
the elitist survival.
Shortcoming of the GA algorithm: long chromosomes (of the size of N!) Can it be overcome? Yes, by
using centroids not partition to represent a clustering.
A set of K centroids c1=(c11,…c1v,…,c1M),…,cK=(cK1,…,cKv,…, cKM) can be considered a
sequence of K*M numbers, thus a string: its size does not depend on N. Another advantage: can be
changed softly, in a quantitative manner: by adding/subtracting a small change rather than by
switching to another symbol.
Evolutionary K-Means.
The chromosome is represented by the set of K centroids c1, c2, cK, which can be considered a string
of K*M real (“float”) numbers. In contrast to the previous representation, the length of the string here
does not depend on the number of entities that can be of advantage when the number of entities is
massive. Each centroid in the string is analogous to a gene in the chromosome.
The crossover of two centroid strings c and c’, each of the length K*V, is performed at a randomly
selected place n, 1 <= n < K*V, exactly as it is in the genetic algorithm above. Chromosomes c and c’
exchange the portions lying to the right of n-th component to produce two offspring. This means that,
in fact, only one of the centroids changes in each offspring chromosome.
The process of mutation, according to Bandyopadhyay and Maulik (2002), is organised as follows.
Given the fitness W values of all the chromosomes, let minW and maxW denote their minimum and
maximum respectively. For each chromosome, its radius R is defined as a proportion of maxW
reached at it: R=(W-minW)/(maxW-minW). When the denominator is 0, that is, if minW = maxW,
define each radius=1 . Here, W is the fitness value of the chromosome under consideration. Then the
mutation intensity  is generated randomly in the interval between –R and +R.
Let minxv and maxxv denote the minimum and maximum values in the data set along feature v
(v=1,…, M). Then every v-th component xv of the chromosome’s centroid changes to
xv+(maxxv – xv) if  >=0 (increase), or
xv+(xv - minxv), (decrease) otherwise
The perturbation leaves chromosomes within the hyper-rectangle defined by boundaries minxv and
maxxv. Please note that the best chromosome, at which W=minW, does not change in this process.
Elitism is maintained in the process as well.
The algorithm follows the scheme outlined for the genetic algorithm.
Based on little experimentation, this algorithm is said to outperform the previous one, GA, many times
in terms of the speed of convergence.
Differential evolution and K-Means
This process is very similar to those previously described, except that here the crossover and mutation
are merged together in the following rather tricky way.
106
An offspring chromosome is created for every chromosome j in the population (j=1, …, P) as follows.
(You remember, a chromosome is a set of K centroids here.) Three other chromosomes, k, l and m, are
taken randomly from the population. Then, for every component (gene) x.j of the chromosome j, a
uniformly random value r between 0 and 1 is drawn. This value is compared to the pre-specified
probability p (somewhat between 0.5 and 0.8). If r > p then the component goes to the offspring
unchanged. Otherwise, this component is substituted by the linear combination of the three other
chromosomes: x.m + (x.k-x.l) where  is a small scaling parameter. After the offspring’s fitness is
evaluated, it substitutes chromosome j if its fitness is better; otherwise, j remains as is and the process
applies to the next chromosome.
Krink and Paterlini (2005) claim that this method outperforms the others in K-Means clustering.
Particle swarm optimisation and K-Means
This is a very different method. The population members here are not crossbred, nor they mutate.
They just move randomly by drifting in the directions of the best places visited, individually and
socially. This can be done because they are vectors of reals. Because of the change, the genetic
metaphor is abandoned here, and the elements are referred to as particles rather than chromosomes,
and the set of them as a swarm rather than a population.
Each particle comprises:
- a position vector x that is an admissible solution to the problem in question (such as the KM
centroids vector in the evolution algorithm for K-Means above),
- the evaluation of its fitness f(x) (such as the summary distance W in formula ()),
- a velocity vector z of the same dimension as x, and
- the record of the best position b reached by the particle so far (the last two are a new feature!).
The swarm best position bg is determined as the best among all the individual best positions b.
At iteration t (t=0,1,…) the next iteration position is defined as
x(t+1) = x(t) + z(t+1)
with the velocity vector z(t+1) computed as
z(t+1) = z(t) + (b-x(t)) + (bg – x(t))
where
-  and  are uniformly distributed random numbers (typically, within the interval between 0 and
2, so that they are approximate unities),
- item (b-x(t)) refers to as the cognitive component and
- item (bg – x(t)) as the social component of the process.
Initial values x(0) and z(0) are generated randomly within the manifold of admissible values.
In some implementations, the group best position bg is changed for that of local best position bl that is
defined by the particle’s neighbours only. Here the neighbourhood topology makes its effect. There is
a report that the local best position works especially well, in terms of the optimality reached, when it is
based on just two Euclidean neighbours.
Question: Formulate a particle swarm optimisation algorithm for K-Means clustering.
107
Other Clustering
o Fuzzy K-Means
o Kohonen’s Self Organising Map SOM
o Hierarchical clustering
 Agglomerative algorithm
 Ward’s criterion
 Single linkage clustering
o Minimum Spanning Tree
 Prim’s algorithm
o Application to Single Linkage Clustering
Fuzzy clustering
Conventional (crisp): cluster k (k=1,…,K)
Centroid ck=(ck1,…, ckv,…, ckM) (M features)
Membership zk=(z1k,…, cik,…, cNk) (N entities)
If zik =1, i belongs to cluster k, if zik =0, i does not
Clusters form a partition of the entity set (every i belongs to one and only one cluster):
 i,
k zik = 1
Fuzzy: cluster k (k=1,…,K)
Centroid ck=(ck1,…, ckv,…, ckM) (M features)
Membership zk=(z1k,…, cik,…, cNk) (N entities)
0  zik  1, extent of belongingness of i to cluster k
Clusters form a fuzzy partition of the entity set (summary belongingness is unity):
 i,
k zik = 1
Having been put into the bilinear PCA model, as K-Means has been, fuzzy cluster memberships form
a rather weird model in which centroids are not average but rather extreme points in their clusters
(Mirkin, Satarov 1990, Nascimento 2005).
An empirically convenient criterion:
where d( , ) is Euclidean squared distance, leads to a convenient fuzzy version of K-Means. The value
K
F ({ck , zk })  
k 1
N
z
i 1
ik

d ( yi , ck )
(1)
 affects the fuzziness of the optimal solution: at =1, the optimal memberships are proven to be crisp,
the larger the  the ‘smoother’ the membership. Conveniently  is taken to be =2.
At each iteration, set of centroids {ck} is transformed:
Membership update:
108
K
zik  1 /  [d ( yi , ck ) / d ( yi , ck ' )]
1
 1
(2)
k '1
Centroids update:
N
N
ckv   zik yi /  zi 'k
i 1


(3)
i '1
Since equations (2) and (3) are the first-order optimality conditions for (1), convergence guaranteed.
This method is sometimes referred to as c-means clustering (Bezdek, 1999).
Meaning of criterion (1): F=i F(i), summary belongingness F(i) of points i to the cluster-structured
data, F(i) being equal to the harmonic average of the memberships at =2 (Stanforth, Mirkin,
Kolossov, 2005).
(a)
(b)
Contours for the membership function at about 14000 IDBS Guildford chemical compounds clustered
with iK-Means in 41 clusters (a); note undesirable cusps in (b) which scores membership using only
the nearest cluster’s centroid.
Kohonen’s Self Organising Maps SOM
Given an N x M data Y of entities yi (i=1,…,N), build:
Grid of r rows, c columns
Grid neighbourhood associated with each grid point ek (k=1,…,rc)
Reference vectors mk (k=1,…,rc) in feature MD space
109
e1
e
2
Figure 1. SOM grid; grid points e1 and e2 are shown along with possible neighbourhood patterns (in
black and blue)
Start: Reference points mk are thrown randomly (in the earlier Kohonen’s work); current advice: take
them as centroids after a run of K-Means at K=rc.
Then data points are iteratively associated with mk. In the end, data points associated at each mk are
visualised at the grid point ek. Originally, SOM iterations have been formulated in terms of single
entities arriving (increm-enttal mode), but later a straight version has been found.
Figure 2. A pattern of final SOM structure using entity labels of geometrical shapes.
Straight SOM:
0. Initial setting. Select r and c for the grid and initialize model vectors mk (k=1,...,rc) in the feature
space.
2. Neighbourhood update. For each grid node ek, with a pre-defined neighbourhood Ek, collect the
list It of entities most resembling the model mt for each et  Ek.
3. Seeds update. For each node ek, define new mk as the average of all entities yi with iIt for some
ekEt.
4. Stop-condition. Halt if new mk-s are close enough to the previous ones (or after a pre-specified
number of iterations). Otherwise go to 2.
Similar to Straight K-Means except for:
(a) number K=rc of model vectors is large and has nothing to do with final clusters, which come
visually as grid clusters;
110
(b) averaging over grid, not feature space, neighbourhood;
(c) no interpretation rules.
Hierarchic clustering
Data as a cluster hierarchy (rooted tree)
(a)
conceptual structures (taxonomy, ontology)
(b)
real processes (evolution, genealogy)
Tolstoy+Twain
Dickens
OT DS
GE
TS HF
YA
WP AK
Figure 3. A cluster hierarchy of Masterpieces data: nested node clusters, each comprising a set of
leaves. Cutting the tree at a certain height can lead to a partition (3 clusters here).
Two types of hierarchic clustering:
- Divisive (splitting top-to-bottom)
- Agglomerative (merging bottom-up)
Agglomerative clustering algorithm
0. Start: N singleton clusters Si={i}, i I, N x N distance (or similarity) matrix D=(d(i,j)), i,j I
1. Find: Find minimum distance d(i*,j*) in D.
2. Merge: combine clusters Si* and Sj* into a united cluster Si*j* = Si* Sj* ; remove rows/columns I,*
and j* from D; put there a new row/column i*j* for the united cluster Si*j* with newly computed
distances between Si*j* and other clusters.
3. Draw and check: Draw the merging in a tree drawing such as Figure 3 and check if the number of
clusters is greater than 1. If yes, go to 2. If no, halt.
Distances between Si*j* and other clusters: nearest neighbour/single linkage (minimum distance
between cluster elements), furthest neighbour (maximum distance between cluster elements),
average neighbour (average distance between cluster elements).
111
Ward (1963): distance between Si, Sj - increase in sum-mary within-cluster variance, W(c, S)/N, after
the merger:
wd(Si, Sj)= Ni *Nj /(Ni +Nj )*d(ci, cj)
(4)
where Ni and Nj – the number of entities in Si or Sj, and d(ci, cj) is Euclidean distance squared between
clusters’ centroids.
Ward distance in agglomerative algorithm (Step 2):
wd(Si*j*, Sk)= [(Ni* +Nk)wd(Si*, Sk) +(Nj* +Nk)wd(Sj*, Sk)  Nkwd(Si*, Sj*)]/(Ni* +Nj* + Ni*j*)
(5)
Ward is computationally intensive because of Step 1.
Single linkage /Nearest neighbour clustering can catch elongated formations, in contrast to KMeans/Ward
a
b
Figure 4: Two clusters at the same data set by using different criteria: Ward/K-Means (a) and Single
linkage (b)
Single linkage can be computationally easier by using Minimum Spanning Tree (MST)
Minimum/Maximum Spanning Tree:
a weighted graph concept
Graph G: nodes iI, edges {i,j} with weights w(i,j)
Tree T: sub-graph of G with no cycles
Spanning tree: tree T covering all nodes I of G
Tree weight: Sum of weights w(i,j) over {i,j}T
Minimum spanning tree (MST): Spanning tree
of minimum weight
3
A
2
3
3
3
B
4
C
4
2
E
3
3
F
4
2
D
3
1
G
112
Figure 1. Sub-graph {{A,B},{A,C},{B,C}} is a cycle, not a tree; {{A,B}, {A,C}, {B,F}} is a tree,
{{A,B}, {A,C}, {B, D}, {B,E}, {B, F}, {B,G}} is a spanning tree of weight 3+2+3+4+2+4=18.
Spanning tree {{A,C}, {C,D}, {D, G}, {G,F}, {F,B}, {F,E}} is of minimum weight 2+2+3+1+2+3 =
13, thus an MST (highlighted).
Prim’s algorithm for finding an MST
1. Initialization. Start with T consisting of any iI.
2. Tree update. Find j*I-T minimizing w(i,j) over all iT and jI-T. Add j* and (i,j*) to T.
3. Stop-condition. If I-T is empty, halt and output tree T. Otherwise go to 2.
Example: Find MST T in graph of Figure 1 starting from T={F}.
1st iteration: Take the minimum of w(F,j) over all jF; this is w(F,G)=1. Now T={F,G} along with the
corresponding edge (see Figure 2 a).
2
F
1
G
1
B
b
a
F
G
2
E
B 2
3
F
E
3
1
G
B 2
F 1
D
3
G
C
2
A
c
d
Figure 2. Three iterations of Prim’s algorithm (a, b, c) and a completed MST (d) for graph on Figure
1.
2d iteration: find j among {A, B, C, D, E} such that w(j,F) or w(j,G) is the minimum. It is obviously
w(B, F)=2. This adds B to T along with edge {B,F} (Fig. 6 b).
3d iteration: find j among {A, C, D, E} such that w(j,F) or w(j,G) or w(j, B) is minimum. There are
many at the weight 3 (specifically, edges {A,B}, {B,D}, {F,E}, {G,D}, and {G,A}), of which let us
take {F,E}, thus adding E to T so that T becomes { B, E, F, G} along with added edge (see Figure 6 c).
Next iterations: w(G,D) adds D with weight 3; w(C,D)=2 adds C to T along with edge {C,D}; the only
remaining entity, A, obviously has the minimum 2-weight edge to T, {A,C}. This completes an MST
(Figure 6 d).
Prim’s algorithm is greedy, thus computationally efficient. It builds a globally optimal MST by using
node (entity) sets. There is another, also greedy approach, by J. Kruskal that builds an MST by using
edge sets.
But what’s of its computational prowess? Is not it similar to Ward’s algorithm in the need for finding a
minimum link after each step?
113
The difference: Prim’s operates with the original weights only, whereas Ward’s changes them at each
step. Thus Prim’s can store information of nearest neighbours (NNs) for each of the nodes in the
beginning and use it at later steps.
What MST has to do with Single-linkage clustering?
Mathematically proven: Single-linkage clusters are parts of an MST over entities as nodes and
between-entity distances as weights.
A Single-linkage divisive clustering: make an MST for distance matrix and then sequentially cut in
two clusters over a maximum weight edge – or over all maximum weight edges simultaneously
(Figure 7 a, b).
F G B E A C D
a
F G B E A C D
b
Figure 7. A binary tree (a) and the natural tree (b) for Single linkage divisive clustering using MST
presented in Figure 6 d.
Appendix
A1 Vector mathematics
nD vector spaces: basic algebra and geometry
Summation defined as
e1=(-0.20, 0.23, -0.33, -0.63, 0.36, -0.22, -0.14)
+
e2=( 0.40, 0.05, 0, -0.63, 0.36, -0.22, -0.14)
____________________________________________________________
e1+e2=( 0.20, 0.28, -0.33, -1.26, 0.72, -0.44, -0.28)
114
Subtraction defined as
e1=(-0.20, 0.23, -0.33, -0.63, 0.36, -0.22, -0.14)
e2=( 0.40, 0.05, 0, -0.63, 0.36, -0.22, -0.14)
____________________________________________________________
e1- e2=(- 0.60, 0.18, -0.33, 0,
0,
0,
0 )
Multiplication by a real defined as
2e1 = (-0.40, 0.46, -0.66, -1.26, 0.72, -0.44, -0.28)
10e1=(-2.00, 2.30, -3.30, -6.30, 3.60, -2.20, -1.40)
Quiz: Could you illustrate the geometrical meaning of the set of all a*e1 (for any a)?
Geometry:
LD
0.4
e1-e2=(-.6,.18) e1
e1+e2=(.2,.28)
0.2
e2=(.4,.05)
-0.6
-0.4
-0.2
0.2
0.4
LS
-0.2
Distance
Euclidean distance r(e1,e2):= sqrt(sum((e1-e2).*(e1-e2))
e1=(-0.20, 0.23, -0.33, -0.63, 0.36, -0.22, -0.14)
-
115
e2=( 0.40, 0.05, 0, -0.63, 0.36, -0.22, -0.14)
____________________________________________________________
e1-e2=(- 0.60, 0.18, -0.33, 0,
0,
0,
0 )
____________________________________________________________
(e1-e2).*(e1-e2)=( 0.36, 0.03,0.11, 0,
0,
0,
0 )
d(e1,e2)=sum((e1-e2).*(e1-e2))= 0.36+0.03+0.11+ 0+0+0+0=.50
r(e1,e2)=sqrt(d(e1,e2))=sqrt(.50)= 0.71
Pythagorean theorem behind:
x12
2
x1=(x11,x12)
a
x22
c
x2=(x21,x22)
b
(0,0)
x11
x21
1
c2 = a2 + b2
d(x1,x2)=(x12-x22)2+(x21-x11)2
Other distances:
Manhattan/City-block m(x1,x2)=|x12-x22|+|x21-x11|
Chebyshev/L∞ ch(x1,x2)=max(|x12-x22|, |x21-x11|)
Quiz: Extend these to nD
Quiz: Characterise the sets of points that lie within distance 1 from a given point, say the
origin, for the cases when the distance is (i) Euclidean squared, (ii) Manchattan, (iii)
Chebyshev’s.
Inner product
Inner product <e1,e2>:= sum(e1.* e2)
e1=(-0.20, 0.23, -0.33, -0.63, 0.36, -0.22, -0.14)
*
e2=( 0.40, 0.05, 0, -0.63, 0.36, -0.22, -0.14)
____________________________________________________________
e1*e2=(-0.08, 0.01, 0, 0.39, 0.13, 0.05, 0.02 )
____________________________________________________________
<e1,e2>=sum(e1*e2)= -0.08+0.01+0+0.39+0.13+0.05+0.02=0.52
Relation between (Euclidean squared) distance d(e1,e2) and inner product <e1,e2>:
d(e1,e2)= <e1-e2, e1-e2> = <e1, e1>+<e2, e2>  2<e1, e2>
116
Especially simple if <e1,e2>=0:
d(e1,e2)= <e1,e1> + <e2,e2>
-
like (in fact, as) Pythagorean theorem
Points/vectors e1 and e2 satisfying <e1,e2>=0 are referred to as orthogonal (why?)
The square root of the inner product of a vector by itself, sqrt(<e,e>), is referred to as e’s
norm – the distance from 0 (analogous to length in nD)
Matrix of rank 1: product of two vectors; for example
a=[1 4 2 0.5]’;
b=[2 3 5]’;
Here A’ is matrix A transposed so that vectors a and b are considered columns rather than rows.
A mathematical presentation of the matrix whose elements are products of components of a and b,
with product * being the so-called matrix product is below:
2
8
4
1
a*b’=
3
12
6
1.5
5
20
10
2.5
The defining feature of this matrix: all rows are proportional to each other; all columns are
proportional to each other.
(See more detail any course in linear algebra or matrix analysis.)
This matrix XTX, divided by N, has an interesting statistical interpretation if all columns of X have
been centred (mean-subtracted) and normed (std-normalised): its elements are correlation coefficients
between corresponding variables. (Note how a bi-variate concept is carried through to multivariate
data.) If the columns have not been normed, the matrix A=XTX /N is referred to as covariance matrix;
its diagonal elements are column variances. Since eigen-vectors of the square matrix A are mutually
orthogonal, it can be decomposed over them as
r
A   k c *k c *T k  CC T
(7)
k 1
which can be derived from (5’);  is diagonal r × r matrix with A’s eigen-values k=k2. Equation (7)
is referred to as the spectral decomposition of A; the eigen-values k constituting the spectre of A.
Optimisation algorithms
Alternating minimisation algorithm for f(x,y):
a sequence y0, x1, y1, x2, y2,… yt, xt.
Find x minimising f(x,y0); take this as x1. Given x1, find y minimising f(x1,y), take it as y1. Reiterate
until convergence.
117
Gradient optimisation (the steepest ascent/descent, or hill-climbing) of any function f(z) of a
multidimensional variable z: given an initial state z=z0, do a sequence of iterations to move to a better
z location. Each iteration updates z-value:
z(new) =z(old) ± *grad(f(z(old))
(2)
where + applies if f is maximised, and –, if minimised. Here · grad(f(z)) stands for the vector of
partial derivatives of f with respect to the components of z. It is known from calculus, that the vector
grad(f(z)) shows the direction of the steepest rise of function f at point z. It is assumed, that –
grad(f(z)) shows the steepest descent direction.
·  value controls the length of the change of z in (2) and should be small (to guarantee not over
jumping) , but not too small (to guarantee changes when grad(f(z(old)) becomes too small; indeed
grad(f(z(old)) = 0 if old is optimum).
Q: What is gradient of function f(x1,x2)=x12+x22? Function f(x1,x2)=(x1-1)2+3*(x2-4)2?
Function f(z1,z2) = 3*z12 + (1-z2)4? A: (2x1, 2x2), 2*(x1-1),3*(x2-4)), (6*z1, -4*(1-z2)3).
Genetic algorithms (GA)
A population comprising a number, P, of structured entities, called chromosomes, typically strings
(sometimes trees, depending on the data structure), evolves imitating the following biological
mechanisms:
1. Selection
2. Cross-over
3. Mutation
These mechanisms apply to carry on the population from the current iteration to the next one. The
optimised criterion is referred to as fitness function. The initial population is selected, typically,
randomly. The evolution stops when the population’s fitness doesn’t change anymore or when a prespecified threshold to the number of iterations is reached.
An extension of GA approach:
Evolutionary algorithms
Evolutionary algorithms are similar to genetic algorithms in the aspect of evolving population, but may
differ in their mechanism: as far as I can see, the string representation may be abandoned here as well
as the crossover.
Example. Minimising function f(x)=sin(2x)e-x in the range [0,2].
Look at the following MatLab program eva.m.
% --------------------------evolutionary optimisation of a scalar function
function [soli, funi]=eva;
p=12;
%population size
lb=0;rb=2;
% the boundaries of the range
feas=(rb-lb)*rand(p,1)+lb; % population within the range
118
flag=1;
%looping variable
count=0;
% number of iterations
iter=1000;
%limit to the number of iterations
%------------------------------ initial evaluation
funp=0;
vv=f(feas);
[funi, ini]=min(vv);
soli=feas(ini) %initial x
funi
%initial f
si=0.5;
% mutation intensity
%-------------evolution loop
while flag==1
count=count+1;
feas=feas+si*randn(p,1);
%mutation
feas=max(lb,feas);
feas=min(rb,feas);
% keeping the population in [lb,rb]
vec=f(feas);
[fun, in]=min(vec);
%best record of the current population f(x)
sol=feas(in);
%corresponding x
[wf,wi]=max(vec);
wun=feas(wi);
%--------- elitist survival (slightly eugenic)-------if wf>funi
feas(wi)=soli;
vec(wi)=funi;
end
if rem(count,100)==0 %display
%funp=funi;
disp([soli funi]);
end
if fun < funi
%maintaining the best
soli=sol;
funi=fun;
end
if (count>=iter)
flag=0;
end
end
% ----------------------computing the function y=sin(2pix)exp(-x)
function y=f(x)
for ii=1:length(x)
a=exp(-x(ii));
b=sin(2*pi*x(ii));
y(ii)=a*b;
end
return;
This program finds the optimum rather fast indeed!
This is a very different method. The population members here are not crossbred, nor they mutate.
They just move randomly by drifting in the directions of the best places visited, individually and
119
socially. This can be done because they are vectors of reals. Because of the change, the genetic
metaphor is abandoned here, and the elements are referred to as particles rather than chromosomes,
and the set of them as a swarm rather than a population.
Each particle comprises:
- a position vector x that is an admissible solution to the problem in question (such as the KM
centroids vector in the evolution algorithm for K-Means above),
- the evaluation of its fitness f(x) (such as the summary distance W in formula ()),
- a velocity vector z of the same dimension as x, and
- the record of the best position b reached by the particle so far (the last two are a new feature!).
The swarm best position bg is determined as the best among all the individual best positions b.
At iteration t (t=0,1,…) the next iteration position is defined as
x(t+1) = x(t) + z(t+1)
with the velocity vector z(t+1) computed as
z(t+1) = z(t) + (b-x(t)) + (bg – x(t))
where
-  and  are uniformly distributed random numbers (typically, within the interval between 0 and
2, so that they are approximate unities),
- item (b-x(t)) refers to as the cognitive component and
- item (bg – x(t)) as the social component of the process.
Initial values x(0) and z(0) are generated randomly within the manifold of admissible values.
In some implementations, the group best position bg is changed for that of local best position bl that is
defined by the particle’s neighbours only. Here the neighbourhood topology makes its effect. There is
a report that the local best position works especially well, in terms of the optimality reached, when it is
based on just two Euclidean neighbours.
MatLab: A programming environment for user-friendly and fast manipulation and analysis of
data
Introduction
The working place within a processor’s memory is up to the user. A recommended option:
- a directory with user-made MatLab codes, say Codes and two or more subdirectories, Data and
Results, in which data and results are stored respectively.
MatLab is then can be brought up to the working directory with traditional MSDOS or UNIX based
commands such as: cd <Path_To_Working_Directory>. MatLab remembers then this path.
MatLab is organised as a set of packages, each in its own directory, consisting of program files with
extension .m each. A few data handling programmes are in the Code directory.
120
"Help" command allows seeing names of the packages as well as of individual program files; the latter
are operations that can be executed within MatLab. Example: Command “help” shows a bunch of
packages, “matlab\datafun” among them; command “help datafun” displays a number of operations
such as “max – largest component”; command “help max” explains the operation in detail.
Work with files
A data file should be organised as an entity-to-feature data table: rows correspond to entities, columns
to features (see stud.dat and stud.var). Such a data structure is referred to as a 2D array or matrix; 1d
arrays correspond to solitary entities or features. This is one of MatLab data formats. The array
format works on the principle of a chess-board: its (i,k)-th element is the element in i-th row k-th
column. Array's defining feature is that every row has the same number of columns.
To load such a file one may use a command from package "iofun". A simple one is "load":
>> a=load('Data\stud.dat');
%symbol "%" is used for comments: MatLab interpreter doesn’t read lines beginning with “%”.
% "a" is a place to put the data (variable); ";" should stand at the end of an instruction;
% stud.dat is a 100x8 file of 100 part-time students with 8 features:
% 3 binary for Occupation, Age, NumberChildren, and scores over three disciplines (in file stud.var)
Names are handled as strings, with ' ' symbol (no “space” in a string permitted). The entity/feature
name sizes may vary, thus cannot be handled in the array format.
To do this, another data format is used: the cell. Round braces (parentheses) are used for arrays, curly
braces for cells: a(i,:) - array's a i-th row, b{i} -cell's b i-th element, which can be a string, a number,
an array, or a cell. There can be other data structures as well (video, audio,...).
>> b=readlist('Data\stud.var'); % list of names of stud.dat features
If one wants working with only three of the six features, say "Age", "Children" and
“OOProgramming_Score", one must put together their indices into a named 1d array:
>> ii=[4 5 7]
% no semicolon in the end to display ii on screen as a row; to make ii a column, semicolons are used
>> newa=a(:,ii); %new data array
>> newb=b(ii); %new feature set
A similar command makes it to a subset of entities. If, for instance, we want to limit our attention to
only those students who received 60 or more at "OOProgramming", we first find their indices with
command "find":
>> jj=find(a(:,7)>=60);
% jj is the set of the students defined in find()
% a(:,7) is the seventh column of a
Now we can apply "a" to "ii":
>> al=a(jj,:); % partial data of better of students
121
% nlrm.m, evolutionary fitting of a nonlinear regression function y=f(x,a,b)
% x is predictor, y is target, a,b -regression prameters to be fitted
function [a,b, funi,residvar]=nlrm(xt,yt);
%
%
%
%
%
in this version the regression equation is y=a*exp(bx) which is
reflected only in the subroutine 'delta' in the bottom for computing the
value of the summary error squared
funi is the error's best value
residvar is its proportion to the sum of squares of y entries
ll=length(xt);
if ll~=length(yt)
disp('Something wrong is with data');
pause;
end
%----------------playing with the data range to define the rectangle at
%--------which populations are grown
mix=min(xt);maix=max(xt);
miy=min(yt);maiy=max(yt);
lb=-max(maix,maiy);rb=-lb;% the boundaries on the feasible solutions
% taken to be max range of the raw data, should be ok, given the model
%-------------organisation of the iterations, iter the limit to their number
p=40; %population size
feas=(rb-lb)*rand(p,2)+lb; % generated population of p pairs coefficients within
the range
flag=1;
count=0;
iter=10000;
%---------- evaluation of the initially generated population
funp=0;
for ii=1:p
vv(ii)=delta(feas(ii,:),xt,yt);
end
[funi, ini]=min(vv);
soli=feas(ini,:) %initial coeffts
funi %initial error
si=0.5; %step of change
%-------------evolution of the population
while flag==1
count=count+1;
feas=feas+si*randn(p,2); %mutation added with step si
for ii=1:p
feas(ii,:)=max([[lb lb];feas(ii,:)]);
feas(ii,:)=min([[rb rb];feas(ii,:)]);% keeping the population within the
range
vec(ii)=delta(feas(ii,:),xt,yt); %evaluation
end
[fun, in]=min(vec); %best approximation value
sol=feas(in,:);%corresponding parameters
[wf,wi]=max(vec);
wun=feas(wi,:); %worst case
if wf>funi
feas(wi,:)=soli;
vec(wi)=funi; %changing the worst for the best of the previous generation
end
if fun < funi
soli=sol;
funi=fun;
end
if (count>=iter)
122
flag=0;
end
residvar=funi/sum(yt.*yt);
%------------ screen the results of every 500th iteration
if rem(count,500)==0
%funp=funi;
disp([soli funi residvar]);
end
end
a=soli(1);
b=soli(2);
%-------- computing the quality of the approximation y=a*exp(bx)
function errorsq=delta(tt,x,y)
a=tt(1);
b=tt(2);
errorsq=0;
for ii=1:length(x)
yp(ii)=a*exp(b*x(ii)); %this function can be changed if a different model
assumed
errorsq=errorsq+(y(ii)-yp(ii))^2;
end
return;
% nnn.m for learning a set of features from a data set
% with a neural net with a single hidden layer
% with the symmetric sigmoid (hyperbolic tangent) in the hidden
layer
% and data normalisation to [-10,10] interval
function [V,W, mede]=nnn(hiddenn,muin)
%
%
%
%
hiddenn - number of neurons in the hidden layer
muin - the learning rate, should be of order of 0.0001 or less
V, W - wiring coefficients learnt
mede - vector of absolute values of errors in output features
%--------------1.loading data ---------------------da=load('Data\studn.dat'); %this is where the data file is put!!!
% da=load('Data\iris.dat'); %this will be for iris data
[n,m]=size(da);
%-------2.normalizing to [-10,10] scale---------------------mr=max(da);
ml=min(da);
ra=mr-ml;
ba=mr+ml;
tda=2*da-ones(n,1)*ba;
dan=tda./(ones(n,1)*ra);
dan=10*dan;
%-------------3. preparing input and output target)-------ip=[1:5]; % here is list of indexes of input features!!!
%ip=[1:2];%only two input features in the case of iris
123
ic=length(ip);
op=[6:8]; % here is list of indexes of output features!!!
%op=[3:4];% output iris features
oc=length(op);
output=dan(:,op); %target features file
input=dan(:,ip); %input features file
input(:,ic+1)=10;
%bias component
%-----------------4.initialising the network --------------------h=hiddenn;
%the number of hidden neurons!!!
W=randn(ic+1,h); %initialising w weights
V=randn(h,oc);
%initialising v weights
W0=W;
V0=V;
count=0; %counter of epochs
stopp=0; %stop-condition to change
%pause(3);
while(stopp==0)
mede=zeros(1,oc); % mean errors after an epoch
%----------------5. cycling over entities in a random order
ror=randperm(n);
for ii=1:n
x=input(ror(ii),:); %current instance's input
u=output(ror(ii),:);% current instance's output
%---------------6. forward pass (to calculate response ru)-----ow=x*W;
o1=1+exp(-ow);
oow=ones(1,h)./o1;
oow=2*oow-1;% symmetric sigmoid output of the hidden layer
ov=oow*V; %output of the output layer
err=u-ov; %the error
mede=mede+abs(err)/n;
%------------ 7. error back-propagation-------------------------gV=-oow'*err;
% gradient vector for matrix V
t1=V*err'; % error propagated to the hidden layer
t2=(1-oow).*(1+oow)/2; %the derivative
t3=t2.*t1';% error multiplied by the th's derivative
gW=-x'*t3;
% gradient vector for matrix W
%----------------8. weights update----------------------mu=muin; %the learning rate from the input!!!
V=V-mu*gV;
W=W-mu*gW;
end;
%------------------9. stop-condition -------------------------count=count+1;
ss=mean(mede);
if ss<0.01|count>=10000
stopp=1;
end;
mede;
if rem(count,500)==0
count
124
mede
end
end;
Reading
B. Mirkin (2005), Clustering for Data Mining, Chapman & Hall/CRC, ISBN 1-58488-534-3.
A.P. Engelbrecht (2002) Computational Intelligence, John Wiley & Sons, ISBN 0-470-84870-7.
Supplementary reading
H. Abdi, D. Valentin, B. Edelman (1999) Neural Networks, Series: Quantitative Applications in
the Social Sciences, 124, Sage Publications, London, ISBN 0 -7619-1440-4.
M. Berthold, D. Hand (1999), Intelligent Data Analysis, Springer-Verlag, ISBN 3540658084.
S.K.Card, J.D. Mackinlay, B. Shneiderman (1999) Readings in Information Visualization:
Using Vision to Think, Morgan Kaufmann Publishers, San Francisco, Ca, ISBN 1-55860-533-9.
A.C. Davison, D.V. Hinkley (2005) Bootstrap Methods and Their Application, Cambridge
University Press (7th printing).
R.O. Duda, P.E. Hart, D.G. Stork (2001) Pattern Classification, Wiley-Interscience, ISBN 0471-05669-3
S. S. Haykin (1999), Neural Networks (2nd ed), Prentice Hall, ISBN 0132733501.
R. Spence (2001), Information Visualization, ACM Press, ISBN 0-201-59626-1.
T. Soukup, I. Davidson (2002) Visual Data Mining, Wiley Publishers, ISBN 0-471-14999-3
V. Vapnik (2006) Estimation of Dependences Based on Empirical Data, Springer Science +
Business Media Inc., 2d edition.
A. Webb (2002) Statistical Pattern Recognition, Wiley, ISBN-0-470-84514-7.
Articles
R. Cangelosi, A. Goriely (2007) Component retention in principal component analysis with
application to cDNA microarray data, Biology Direct, 2:2, http://www.biolgy-direct.com/content/2/1/2.
J. Carpenter, J. Bithell (2000) Bootstrap confidence intervals: when, which, what? A practical
guide for medical statisticians, Statistics in Medicine, 19, 1141-1164.
G. W. Furnas (1981) The FISHEYE View: A new look at structured files, A technical report, in In
S.K.Card, J.D. Mackinlay, B. Shneiderman (1999) Readings in Information Visualization:
Using Vision to Think, Morgan Kaufmann Publishers, San Francisco, Ca, 350-367.
125
Y.K. Leung and M.D. Apperley (1994) A review and taxonomy of distortion-oriented presentation
techniques, In S.K.Card, J.D. Mackinlay, B. Shneiderman (1999) Readings in Information
Visualization: Using Vision to Think, Morgan Kaufmann Publishers, San Francisco, Ca, 350-367.
126