Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Boris Mirkin School of Computer Science and Information Systems, Birkbeck University of London UK Computational Intelligence: Correlation, Summarization and Visualization Contents: 0. Introduction: . . . . . . . . 3 01. Computational intelligence. 02. Data visualization. 03. Case study problems. 1. Summarizing and visualizing a single feature: 1D analysis 18 1.1. Distribution 1.2. Centre and spread: definitions, properties, integral perspectives Project 1.1 Analysis of a multimodal distribution 1.3. Confidence and computational experiment Project 1.2 Data mining with a confidence interval: Bootstrap Project 1.3 K-fold cross-validation 1.4. Modeling uncertainty: Interval and fuzzy values. 2 Correlating and visualizing two features: 2D analysis 35 2.1. Both quantitative: Linear regression and residual variance Correlation coefficient Project 2.1. 2D analysis, linear regression and bootstrapping Estimating non-linear regression Project 2.2 Non-linear regression versus its linearized version: evolutionary algorithm for estimation 2.2. Nominal and quantitative: table regression and correlation ratio 2.3. Both categorical: Contingency table Quetélet indices Chi-squared correlation index and its visualization 3. Correlation with decision rules 53 3.1. Linear regression 3.2 . Discriminant function and SVM 3.3. Decision trees 4. Learning neural networks for prediction 4.1. Artificial neuron and perceptron 4.2. Multilayer network 4.3. Back-propagation algorithm 5. Summarization with Principal Component Analysis 5.1. One component 5.2. Principal Components and SVD 5.3. Popular applications 5.4. Correspondence analysis 5.5. Self-associative neural nets. 6. K-Means clustering 6.1. Batch and incremental K-Means 6.2. Anomalous pattern and iK-Means 6.3. Experimentally determining K 6.4. Evolutionary approaches to K-Means Genetic algorithms Evolutionary algorithms Particle swarm optimization 6.5. Extensions of K-Means Fuzzy Kohonen’s Self-Organizing Map (SOM) 7. Structuring and visualizing similarity data 7.1. Hierarchical clustering 7.2. MST and Single-linkage clustering 7.3. Additive clusters Appendix: A.1 Basics of multivariate entities A.2 Basic optimization A.3 Basic MatLab 2 Introduction 0.1. Computational Intelligence from Data This is a unique course in that it looks at data from the inside rather than the outside which is the conventional perspective in Computer Science. The term Computational Intelligence has emerged recently and means different things for different people. Some think that the term should cover only those just emerging techniques that relate to neural networks, evolutionary computation and fuzzy systems. They believe that the Computational Intelligence has nothing to do with Machine Learning and Statistics because of the difference in methods. Some add that Computational Intelligence should involve a biological underpinning. Still others, including the author, think that a scientific discipline should be defined by a set of problems rather than by a set of techniques; the problems can be addressed by various techniques, with no predefined constraints on them. In this respect, the definition following that by Engelbrecht (2002), p. 4, deserves attention: “Computational Intelligence – the study of adaptive mechanisms to enable or facilitate intelligent behaviour in complex and changing environments and, specifically, to learn or adapt to new situations, to generalize, abstract, discover and associate.” This seems an adequate definition. However, currently this is by far too wide because the study of adaptive mechanisms or actions is in a rather embryonic state at this moment. Yet the aspects related to learning, relating and discovering patterns in data have been developed by now into a discernible set of approaches, supported sometimes with sound theoretical grounding, and we thus will confine ourselves to studying data-driven computational intelligence models and methods towards enhancing knowledge of the domain of interest. The texts by D. Poole, A. Macworth and R. Goebel (2001), A. Engelbrecht (2002), and Avi Kumar (2005) have rather different agendas, which explains the current effort. Key concepts in this course are those related to computational and structural aspects in the input data, output knowledge and methods and structures for relating them: - Data: Typically, in sciences and in statistics, a problem comes first, and then the investigator turns to data that might be useful in the problem. In computational intelligence, it is also the case, but a problem can be very broad: look at this data set - what sense can we make of it? This is more reminiscent to a traveller’s view of the world rather than that of a scientist. The traveller deals with what occurs on their way. Helping the traveller in making sense of data is the task of computational intelligence. Rather than attending to individual problems, computational intelligence focuses on learning patterns. It should be pointed out that this view much differs of that accepted in the sciences including classical statistics in which the main goal is to identify a pre-specified model of the world and data is but a vehicle in achieving this goal. Any data set comprises two parts, metadata and data entries. Metadata may involve names for the entities and their features. Depending on the origin, entities may be alternatively but synonymously referred to as individuals, objects, cases, instances, or observations. Entity features may be synonymously referred to as variables, attributes, states, or characters. Depending on the way they are assigned to entities, the features can be of elementary structure [e.g., age, sex, or income of individuals] or complex structure [e.g., a picture, or statement, or a 3 cardiogram]. Metadata nay involve relations between entities or other relevant information, which we are not going to deal with further on. - Knowledge: Knowledge is a complex concept, not quite well understood yet, related to understanding things. Structurally, knowledge can be thought of as a set of categories and statements of relation between them. Categories are aggregations of similar entities such as apples or plums or more general categories such as fruit comprising apples, plums, etc. When created over data objects or features these are referred to as clusters or factors, respectively. Statements of relation between categories express regularities relating different categories. These can be of casual or correlation character. We say that two features correlate when the co-occurrence of specific patterns in their values is observed as, for instance, when a feature’s value tends to be the square of the other feature. The observance of a correlation pattern is thought to be a prerequisite to further inventing a theoretical framework from which the correlation follows. It is useful to distinguish between quantitative correlations such as functional dependencies between features and categorical ones expressed conceptually, for example, as logical production rules or more complex structures such as decision trees. These may be used for both understanding and prediction. In industrial applications, which are the driving force for the Computational Intelligence so far, the latter is by far more important. Moreover, the prediction problem is much easier to make sense of operationally so that the sciences so far paid much attention to this. The notion of understanding, meanwhile, remains very vague. We are going to study methods for enhancing knowledge by producing rules for finding either (a) Correlation of features (As) or (b) Summarization of entities or features (Ag), each in either of two ways, quantitative (Q) and categorical (C). A rule involves a postulated mathematical structure whose parameters are to be learnt from the data. We will be dealing most with the following mathematical structures in the rules: - linear combination of features; - neural network mapping a set of input features into a set of target features; - decision tree built over a set of features; - partition of the entity set into a number of non-overlapping clusters. A fitting method relies on a computational model involving a function scoring the adequacy of the mathematical structure underlying the rule – a criterion, and, typically, visualization aids. The criterion measures either the deviation from the target (to be minimised) or fitness to the target (to be maximised). Currently available computational approaches to optimise the criterion can be partitioned in three major groups: - global optimisation, computationally feasible sometimes for linear quantitative and simple discrete structures; - local improvement using such general approaches as: o gradient descent o alternating optimization o greedy neighbourhood search - evolution of population, an approach involving relatively recent advancements in computing capabilities, of which the following will used in some problems: o genetic algorithms o evolutionary algorithms o particle swarm optimization It should be pointed out that currently there is no systematic description of all possible combinations of problems, data types, mathematical structures, criteria, and fitting methods available. Here we rather 4 focus on the generic and better explored problems in each of the four groups that can be safely claimed as being prototypical within the groups: Ag Co Quant Principal component analysis Categ Cluster analysis Quant Regression analysis Categ Pattern recognition Supervised classification These methods have emerged in different frameworks and usually are considered as unrelated. However, they are related in the context of computational intelligence. Moreover, they can be unified by the so-called least-squares criterion that will be accepted for all main methods described in this text. In fact, the criterion will be part of a unifying, data-recovery, perspective. The data recovery approach involves two stages: (1) fitting a model to the data (sometimes referred to as “coding”), (2) deriving data from the model in the format of the data used to build the model (sometimes referred to as “decoding”), and (3) looking at the discrepancies between the observed data and those recovered from the model. The smaller are the discrepancies, the better the fit, which gives a natural model fitting criterion. There can be distinguished at least three different levels of studying a computational intelligence approach. One can be interested only in learning of the approach on the level of concepts only – what is it for, why it should be applied at all, etc. A somewhat more practically oriented tackle would be of an information system/tool that can be utilised without any knowledge beyond the structure of input and output. A more technically oriented way would be studying the method involved and its properties. Comparable advantages and disadvantages of these three levels are as follows. Pro Con Concepts Awareness Superficial Systems Usable now Simple Short-term Stupid Techniques Workable Extendable Technical Boring Many in Computer Sciences rely on Systems assuming that good methods have been put in there already. Indeed, with the new data streams from new hardware devices being developed time and again, such issues as data capture, security, maintenance, distribution, that are way beyond intelligent data analysis techniques, can be much urgent indeed. Unfortunately, in many aspects, intelligence of currently available “intelligent methods” is rather superficial and may lead to wrong results and decisions. Consider, for instance, a very popular concept, the power law – many say that in unconstrained social processes such as those on the Web networks this law, expressed as y=ax-b where x and y are related features and a, b>0 constant, dominates: the number of people who read news stories on the web decays with time in a power law, the distribution of page requests on a web-site according to their popularity, distribution of website interconnections, etc. According to a very popular recipe, to fit a power law (that is, to estimate a and b from the data), one needs to fit the logarithm of the power-law equation, that is, log(y)=c-b*log(x) where c=log(a), which is much easier to fit because it is linear. Therefore, this recipe advises: take logarithms of the x and y first and then use any popular linear 5 regression program to find the constants. This recipe does work well when the regularity is observed with no noise, which is impossible in social processes, because of too many factors affecting them. If the data is not that exact, the recipe may lead to big errors. For example, I generated x (between 0 and 10) and y as related by the power law y=2*x1.07 , which can be interpreted as the growth with the rate of approximately 7% per time moment, with the added Gaussian noise whose standard deviation is 2. The recipe above led to estimates of a=3.08 and b=0.8 to suggest that the process does not grow with x but rather decays. In contrast, when I applied an evolutionary optimization method, which will be introduced later, I obtained realistic estimates of a=2.03 and b=1.076. This is a relatively simple example, at which a correct procedure can be used. However, in more complex situations of clustering or categorization, the very idea of a correct method seems rather debatable; at least, methods in the existing systems can be and frequently are of a rather poor quality. One may compare the situation here with that of getting services of an untrained medical doctor or car driver; the results could be as devastating. This is why it is important to study not only How’s but What’s and Why’s of the Computational Intelligence, which are addressed in this course by focusing on Concepts and Techniques rather than Systems. In a typical case, the exposition goes along with the structure of a data analysis application and comprises the following seven steps: (i) formulating a specific data-related problem, then (ii) developing a model and (iii) method that are going to be used for advancing into the problem, then (iv) application of the method to the data, sometimes preceded with (v) the data standardization sometimes followed with (vi) adjustment of the solution to the nature of the substantive problem, and – last not least – (vii) interpretation and conclusion. 0.2. Visualization 0.2.1. General. Visualization can be a by-product of the model and/or method, or it can be utilized by itself. The concept of visualization usually relates to the human cognitive abilities, which are not well understood. At this moment, we are not able to discuss the structures of visual image streams such as in a movie or video. Nor can one reflect, in a computationally meaningful way, on art of painting or photography whose goals relate to deep down impressions and emotions. We are going to be concerned with presenting data as maps or diagrams or objects on a digital screen in such a way that relations between data entities or features are reflected in distances or connections, or other visual relations, between their images. Among more or less distinct visualization goals, beyond sheer presentation that appeals to the cognitive domination of visual over other senses, we can distinguish between: A. Highlighting B. Integrating different aspects C. Narrating D. Manipulating Of these, manipulating visual images of entities, such as in computer games, seems an interesting area yet to be developed in the framework of Computational Intelligence. The other three will be briefly discussed and illustrated in the remainder of this section. 0.2.2. Highlighting 6 To visually highlight this or that feature of an image one should somehow distort the original dimensions. A good example is the London tube scheme by H. Beck (1906) on which he greatly enlarged the proportions of the Centre of London part to make them better seen. Such a gross distortion, for a long while totally rejected by the authorities, is now a standard for metro maps worldwide (see Figure 0.2.1) Figure 0.2.1. A fragment of London Tube map made after H. Beck; the central part is highlighted by disproportionate scaling. This line of thinking has been worked on in geography for centuries, since the mapping of the Earth global surface to a flat sheet is impossible to do exactly. Various proxy criteria have been proposed Figure 0.2.2. The Fuller Projection, or Dymaxion Map, solves the problem of displaying spherical data on a flat surface of a polyhedron using a low-distortion transformation. Landmasses are presented without interruption -- the map's sinuses do not cut into the land area at any point. leading to interesting highlights such as presented on Figure 0.2.2 (Fullers’ projection) and Figure 0.2.3 (August’s projection); see website http://en.wikipedia.org/wiki/ for more. 7 Figure 0.2.3. A conformal map: the angle between any two lines on the sphere is the same between their projected counterparts on the map; in particular, each parallel crosses meridians at right angles; and also, scale at any point is the same in all directions. More recently this idea was applied by Rao and Card (1994) to table data (see Figure 0.2.4); more on this can be found in the volume by Card, Mackinlay and Shneiderman (1999). Figure 0.2.4. The Table Lens machine: highlighting a few rows and columns by enlarging them. It should be noted that the disproportionate highlighting may lead to effects bordering with visually cheating. This is especially apparent when relative proportions are visualized through proportions between areas, as in Figure 0.2.5. An unintended effect of the picture is that the decline by half is presented visually by the area of the doctor’s body, which is just one fourth of the initial size. This grossly biases the message. Figure 0.2.5. A decline in relative numbers of general practitioner doctors in California in 70-es is conveniently visualized using 1D size-, not 2D area-related, scaling of a picture of doctor. 8 Figure 0.2.6. Another unintended distortion: a newspaper’s report (July 2005) is visualized with bars that grow from mark 500,000 rather than 0. Another typical case of unintentional cheating is when the relative proportions are visualized using bars that start not at 0 but an arbitrary mark, as is the case of Figure 0.2.6, on which a newspaper’s legitimate satisfaction with its success is visualized using bars that begin at 500,000 mark rather than 0. Another mistake is that the difference between the bars’ heights on the picture is much greater than the reported 220,000. Altogether, the rival’s circulation bar is more than twice shorter while the real circulation is less by just 25%. 0.2.3. Integrating different aspects Figure 0.2.7. An image of Con Edison company’s power grid on a PC screen according to website http://www.avs.com/software/soft_b/openviz/conedison.html. 9 Bringing different features of a phenomenon to a visual presentation can make life easier indeed. Figure 0.2.7 represents an image that an energy company utilizes for real time managing, control and repair of its energy network stretching over the island of Manhattan (New York, USA). Operators can view the application on their desktop PCs and monitor the grid and repair problems when they arise by rerouting power or sending a crew out to repair a device on site. This makes “manipulation and utilisation of data in ways that were previously not possible,” according to the company’s website. Bringing features together can be useful for less immediate insights too. A popular story of Dr. John Snow’s fight against an outbreak of cholera in Soho, London, 1857, is based on the real fact that indeed, two weeks into the outbreak, Dr. Snow went over all houses in the vicinity and made as many tics at each of them on his map as many deaths of cholera have occurred there (a scheme of a fragment of Dr. Snow’s map is on Figure 0.2.8). The ticks were densest around a water pump, which made Dr. Figure 0.2.8. A scheme of a fragment of Dr. Snow’s map demonstrating that indeed most deaths (labelled by circles) have occurred near the water pump he was dealing with. Snow convinced that the pump was the cholera source. (In fact, he had served in India to become disposed to the idea of the role of water flows in the transmission of the disease.) He discussed his findings with the priest of local parish, who removed then the handle of the pump, after which deaths stopped. This all is true. But there is more to this story. The death did stop - but because too few remained in the district, not because of the removal: the handle was ordered to be back the very next day after it was removed. Moreover, the borough council refused to accept Dr. Snow’s water pump theory because of its inconsistency with the theory of the time that cholera progressed through stench in the air rather than water. More people died in Soho of the next cholera outbreak in a decade. The water pump theory was not accepted until much later, when the microbe theory became developed. The story is instructive in that a data based conclusion needs a plausible explanation to be accepted. Secto Not Retail (Ind./Util.)r Retail Product C ECom Product No Yes Product Figure 0.2.9. Product decisionAtree for the Company B data in Table 0.1. 10 The diagram on Figure 0.2.9 visualizes relations between features in Company data (Table 0.1.) as a decision tree to characterize their products. For example, the left hand branch distinctly describes Product A by combining “Not retail” and “No e-commerce” edges. One more visual image depicts relations between confusion patterns of decimal numerals drawn over rectangle’s edges and their Patterns Profiles Descriptions AbAbsence Presence sen t Figure 0.2.10. Confusion patterns for numerals, drawn using rectangle edges, their descriptions in terms of edges present/absent, and profiles showing maximal common edges. descriptions in terms of combinations of edges of the rectangle with which they are drawn. A description may combine both edge presence and absence to distinctively characterise its pattern, whereas a profile comprises edges that are present in all elements of its pattern. The confusion patterns are derived from psychological data (see Mirkin 2005). 0.2.3. Narrating a story In a situation in which features involved have a temporal or spatial aspect, integrating them in one visual image may lead to a narrative of a story, with its starting and ending dates. Such a story is told of a military company (Napoleon invading Russia 1812) as presented in Figure 0.2.11. It shows a map Figure 0.2.11. The white band represents the trajectory of Napoleon’s army moving to the East and the black band shows it moving to the West, the line width being proportional to the army’s strength. 11 of Russia, with Napoleon’s army trajectory drawn forth, in white, and back, in black, so that the time is enveloping in this static image. The trajectory’s width shows the army’s strength in time steadily declining on a dramatic scale. All the images presented can be considered illustrations of a principle accepted further on. According to this principle, to visualize data, one needs to specify first a “ground” image, such as a map or grid or coordinate plane, which is supposed to be well known to the user. Visualization, as a computational device, can be defined as mapping data to the ground image in such a way that the analysed properties of the data are reflected in properties of the image. Of the goals considered, integration of data will be of a priority since no temporal aspect is considered here. 0.3. Case study problems Case 0.3.1: Companies Table 0.1. Companies characterized by mixed scale features; the first three companies making product A, the next three making product B, and the last two product C. Company name Income, $mln SharP $ NSup EC Sector Aversiona 19.0 43.7 2 No Utility Antyops 29.4 36.0 3 No Utility Astonite 23.9 38.0 3 No Industrial Bayermart 18.4 27.9 2 Yes Utility Breaktops 25.7 22.3 3 Yes Industrial Bumchista 12.1 16.9 2 Yes Industrial Civiok 23.9 30.2 4 Yes Retail Cyberdam 27.2 58.0 5 Yes Retail There are five features in Table 0.1.: 1) Income, $ Mln; 2) SharP - share price, $; 3) NSup - Number of principal suppliers; 4) ECommerce - Yes or No depending on the usage of e-commerce in the firm; 5) Sector - The sector of the economy: (a) Retail, (b) Utility, and (c) Industrial. Examples of computational intelligence problems related to this data set: - How to map companies to the screen with their similarity reflected in distances on the plane? (Summarization) [Q: Do you think that the following statement is true? “There is no information on the company products within the table”. A. You should, since no feature “Product” is present in the table, and the separating lines are not part of the data.] - Would clustering of companies reflect the product? What features would be involved then? (Summarization) - Can rules be derived to make an attribution of the product for another company, coming outside of the table? (Correlation) - Is there any relation between the structural features and market related features? (Correlation.) 12 An issue related to Table 0.1 is that not all of its entries are quantitative. Specifically, there are three conventional types of features in it: - Quantitative, that is, such that the averaging of its values is meaningful. In the Table 0.1, these are: Income, SharePrice and NSup; - Binary, that is, admitting one of two answers, Yes or No: this is EC; - Nominal, that is, with a few disjoint not ordered categories, such as Sector in Table 0.1. Most models and methods presented here require quantitative data only. The two non-quantitative feature types, binary and nominal, can be pre-processed into a quantitative format as follows. A binary feature can be recoded into 1/0 format by substituting 1 for “Yes” and 0 for “No”. Then the recoded feature can be considered quantitative, because its averaging is meaningful: the average value is equal to the proportion of unities, that is, the frequency of “Yes” in the original feature. A nominal feature is first enveloped into a set of binary “Yes”/”No” features corresponding to individual categories. In Table 0.1, binary features yielded by categories of feature “Sector” are: Is it Retail? Is it Utility? Is it Industrial? They are put as questions to make “Yes” or “No” answer to them. These binary features now can be converted to the quantitative format as advised above, by recoding 1 for “Yes” and 0 for “No”. Table 0.2 Data from Table 0.1 converted to the quantitative format. Code 1 2 3 4 5 6 7 8 Income 19.0 29.4 23.9 18.4 25.7 12.1 23.9 27.2 SharP 43.7 36.0 38.0 27.9 22.3 16.9 30.2 58.0 NSup 2 3 3 2 3 2 4 5 EC 0 0 0 1 1 1 1 1 Util 1 1 0 1 0 0 0 0 Indu 0 0 1 0 1 1 0 0 Retail 0 0 0 0 0 0 1 1 0.3.2. Case 2: Iris data set Sepal Petal Figure 0.1. Sepal and petal in an Iris flower. This popular data set describes 150 Iris specimens, representing three taxa of Iris flowers, I Iris setosa (diploid), II Iris versicolor (tetraploid) and III Iris virginica (hexaploid), 50 specimens from each. 13 Each specimen is measured on four morphological variables: sepal length (w1), sepal width (w2), petal length (w3), and petal width (w4) (see Figure 0.1). Table 0.3. Iris data: 150 Iris specimens measured over four features each. # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 I Iris setosa w1 5.1 4.4 4.4 5.0 5.1 4.9 5.0 4.6 5.0 4.8 4.8 5.0 5.1 5.0 5.1 4.9 5.3 4.3 5.5 4.8 5.2 4.8 4.9 4.6 5.7 5.7 4.8 5.2 4.7 4.5 5.4 5.0 4.6 5.4 5.0 5.4 4.6 5.1 5.8 5.4 5.0 5.4 5.1 4.4 5.5 5.1 4.7 4.9 5.2 5.1 w2 3.5 3.2 3.0 3.5 3.8 3.1 3.2 3.2 3.3 3.4 3.0 3.5 3.3 3.4 3.8 3.0 3.7 3.0 3.5 3.4 3.4 3.1 3.6 3.1 4.4 3.8 3.0 4.1 3.2 2.3 3.4 3.0 3.4 3.9 3.6 3.9 3.6 3.8 4.0 3.7 3.4 3.4 3.7 2.9 4.2 3.4 3.2 3.1 3.5 3.5 w3 1.4 1.3 1.3 1.6 1.6 1.5 1.2 1.4 1.4 1.9 1.4 1.3 1.7 1.5 1.9 1.4 1.5 1.1 1.3 1.6 1.4 1.6 1.4 1.5 1.5 1.7 1.4 1.5 1.6 1.3 1.7 1.6 1.4 1.3 1.4 1.7 1.0 1.5 1.2 1.5 1.6 1.5 1.5 1.4 1.4 1.5 1.3 1.5 1.5 1.4 w4 0.3 0.2 0.2 0.6 0.2 0.2 0.2 0.2 0.2 0.2 0.1 0.3 0.5 0.2 0.4 0.2 0.2 0.1 0.2 0.2 0.2 0.2 0.1 0.2 0.4 0.3 0.3 0.1 0.2 0.3 0.2 0.2 0.3 0.4 0.2 0.4 0.2 0.3 0.2 0.2 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.1 0.2 0.2 II Iris versicolor III Iris virginica w1 6.4 5.5 5.7 5.7 5.6 7.0 6.8 6.1 4.9 5.8 5.8 5.5 6.7 5.7 6.7 5.5 5.1 6.6 5.0 6.9 5.0 5.6 5.6 5.8 6.3 6.1 5.9 6.0 5.6 6.7 6.2 5.9 6.3 6.0 5.6 6.2 6.0 6.5 5.7 6.1 5.5 5.5 5.4 6.3 5.2 6.4 6.6 5.7 6.1 6.0 w1 6.3 6.7 7.2 7.7 7.2 7.4 7.6 7.7 6.2 7.7 6.8 6.4 5.7 6.9 5.9 6.3 5.8 6.3 6.0 7.2 6.2 6.9 6.7 6.4 5.8 6.1 6.0 6.4 5.8 6.9 6.7 7.7 6.3 6.5 7.9 6.1 6.4 6.3 4.9 6.8 7.1 6.7 6.3 6.5 6.5 7.3 6.7 5.6 6.4 6.5 w2 3.2 2.4 2.9 3.0 2.9 3.2 2.8 2.8 2.4 2.7 2.6 2.4 3.0 2.8 3.1 2.3 2.5 2.9 2.3 3.1 2.0 3.0 3.0 2.7 2.3 3.0 3.0 2.7 2.5 3.1 2.2 3.2 2.5 2.9 2.7 2.9 3.4 2.8 2.8 2.9 2.5 2.6 3.0 3.3 2.7 2.9 3.0 2.6 2.8 2.2 w3 4.5 3.8 4.2 4.2 3.6 4.7 4.8 4.7 3.3 3.9 4.0 3.7 5.0 4.1 4.4 4.0 3.0 4.6 3.3 4.9 3.5 4.5 4.1 4.1 4.4 4.6 4.2 5.1 3.9 4.7 4.5 4.8 4.9 4.5 4.2 4.3 4.5 4.6 4.5 4.7 4.0 4.4 4.5 4.7 3.9 4.3 4.4 3.5 4.0 4.0 w4 1.5 1.1 1.3 1.2 1.3 1.4 1.4 1.2 1.0 1.2 1.2 1.0 1.7 1.3 1.4 1.3 1.1 1.3 1.0 1.5 1.0 1.5 1.3 1.0 1.3 1.4 1.5 1.6 1.1 1.5 1.5 1.8 1.5 1.5 1.3 1.3 1.6 1.5 1.3 1.4 1.3 1.2 1.5 1.6 1.4 1.3 1.4 1.0 1.3 1.0 w2 3.3 3.3 3.6 3.8 3.0 2.8 3.0 2.8 3.4 3.0 3.0 2.7 2.5 3.1 3.0 3.4 2.7 2.7 3.0 3.2 2.8 3.1 3.1 3.1 2.7 3.0 2.2 3.2 2.8 3.2 3.0 2.6 2.8 3.0 3.8 2.6 2.8 2.5 2.5 3.2 3.0 3.3 2.9 3.0 3.0 2.9 2.5 2.8 2.8 3.2 w3 6.0 5.7 6.1 6.7 5.8 6.1 6.6 6.7 5.4 6.1 5.5 5.3 5.0 5.1 5.1 5.6 5.1 4.9 4.8 6.0 4.8 5.4 5.6 5.5 5.1 4.9 5.0 5.3 5.1 5.7 5.2 6.9 5.1 5.2 6.4 5.6 5.6 5.0 4.5 5.9 5.9 5.7 5.6 5.5 5.8 6.3 5.8 4.9 5.6 5.1 w4 2.5 2.1 2.5 2.2 1.6 1.9 2.1 2.0 2.3 2.3 2.1 1.9 2.0 2.3 1.8 2.4 1.9 1.8 1.8 1.8 1.8 2.1 2.4 1.8 1.9 1.8 1.5 2.3 2.4 2.3 2.3 2.3 1.5 2.0 2.0 1.4 2.1 1.9 1.7 2.3 2.1 2.5 1.8 1.8 2.2 1.8 1.8 2.0 2.2 2.0 14 The taxa are defined by the genotype whereas the features are of the appearance (phenotype). The question arises whether the taxa can be described, and indeed predicted, in terms of the features or not. It is well known from previous studies that taxa II and III are not well separated in the variable space. Some non-linear machine learning techniques such as Neural Nets \cite{Ha99} can tackle the problem and produce a decent decision rule involving non-linear transformation of the features. Unfortunately, rules derived with Neural Nets are not comprehensible to the human and, thus, cannot be used for interpretation and description. The human mind needs somewhat less artificial logics that is capable of reproducing and extending botanists' observations such as that the petal area, roughly expressed by the product of w3 and w4, provides for much better resolution than the original linear sizes. Other problems that are of interest: (a) visualise the data, and (b) build a predictor of sepal sizes from the petal sizes. Case 0.3. West Country Market towns. In Table 0.4 a set of Market towns in West Country, England is presented along with features characterising population and social infrastructure. For the purposes of social monitoring, it is good to have a smaller number of clusters being representatives of clusters of similar towns. In the Table, the towns are sorted according to their population sizes. One can see that 21 towns have less than 4,000 residents. The number 4000 is taken as a divider since it is round and, more importantly, there is a gap of more than thirteen hundred residents between Kingskerswell (3672 inhabitants) and next in the list Looe (5022 inhabitatnts). Next big gap occurs after Liskeard (7044 inhabitatnts) separating the nine middle sized towns from two larger town groups containing six and nine towns respectively. The divider between the latter groups is taken between Tavistock (10222) and Bodmin (12553). In this way, we get three or four groups of towns for monitoring. But is this enough, regarding the other features available? Are the resident groups homogeneous enough for the purposes of monitoring? As further computations will show, the numbers of services on average do follow the town sizes, but the set (as well as the complete set of about thirteen hundred England Market towns) is much better represented with seven somewhat different clusters: large towns of about 17-20,000 inhabitants, two clusters of medium sized towns (8-10,000 inhabitants), three clusters of small towns (about 5,000 inhabitants), and a cluster of very small settlements with about 2,500 inhabitants. Each of the three small town clusters is characterized by the presence of a facility, which is absent in two others: a Farm market, a Hospital and a Swimming pool, respectively, which may be considered not quite important. Then the only difference between clusters and the grouping over town resident numbers would be different dividing points. However, one should not forget that the number of residents has been selected by us because of our knowledge that this is the feature highly affecting all the other features of town life. In the absence of such knowledge, the population size should come as an important feature after, not prior to, the computation. The data in Table 0.4 involve the following 12 features as observed in the census 1991: Pop PSch Doct Hosp Bank Sstor - Population resident - Primary schools - General Practitioners - Hospitals - Banks - Superstores Petr DIY Swim Post CAB FMar - Petrol stations - Do It Yourself shops - Swimming pools - Post offices - Citizen Advice Bureaus - Farmer markets 15 Table 0.4. Data of West Country England Market Towns 1991. Town Pop Mullion 2040 So Brent 2087 St Just 2092 St Columb 2119 Nanpean 2230 Gunnislake 2236 Mevagissey 2272 Ipplepen 2275 Be Alston 2362 Lostwithiel 2452 St Columb 2458 Padstow 2460 Perranporth 2611 Bugle 2695 2 Buckfastle 2786 St Agnes 2899 Porthleven 3123 Callington 3511 Horrabridge 3609 Ashburton 3660 Kingskers 3672 Looe 5022 Kingsbridge 5258 Wadebridge 5291 Dartmouth 5676 Launceston 6466 Totnes 6929 Penryn 7027 Hayle 7034 Liskeard 7044 Torpoint 8238 Helston 8505 St Blazey 8837 Ivybridge 9179 St Ives 10092 Tavistock 10222 Bodmin 12553 Saltash 14139 Brixham 15865 Newquay 17390 Truro 18966 Penzance 19709 Falmouth 20297 St Austell 21622 Newton Abb 23801 PSch 1 1 1 1 2 2 1 1 1 2 1 1 1 0 2 1 1 1 1 1 1 1 2 1 2 4 2 3 4 2 2 3 5 5 4 5 5 4 7 4 9 10 6 7 13 Doct 0 1 0 0 1 1 1 1 0 1 0 0 1 0 1 1 0 1 1 0 0 1 1 1 0 1 1 1 0 2 3 1 2 1 3 3 2 2 3 4 3 4 4 4 4 Hosp 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 1 2 0 1 0 0 0 1 1 1 1 1 1 1 1 2 1 Bank Sstor Petr 2 1 2 2 0 1 1 0 1 2 0 3 1 0 1 2 1 3 2 2 0 2 7 5 4 8 7 2 2 6 3 7 1 3 7 7 6 4 5 12 19 12 11 14 13 0 1 1 1 0 0 0 0 1 0 1 0 1 1 2 1 1 1 1 1 1 1 1 3 4 4 2 4 2 2 2 2 1 1 2 3 3 2 5 5 4 7 3 6 4 1 0 1 1 0 1 0 1 0 1 3 0 2 0 2 1 0 1 1 2 2 1 2 1 1 4 1 1 2 3 1 3 4 4 2 3 5 3 3 4 5 5 2 4 7 DIY Swim 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 2 1 0 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 1 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0 0 0 2 1 1 2 1 2 1 1 1 1 Post CAB 1 1 1 1 2 3 1 1 1 1 2 1 2 0 1 2 1 1 2 1 1 3 1 1 2 3 4 3 2 2 2 1 4 1 4 3 2 3 5 5 7 7 9 8 7 FMar 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 1 1 1 1 0 1 1 2 1 1 0 1 1 1 1 1 1 1 1 2 1 1 2 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 Case 0.4. Student data In Table 0.5, a fictitious data set is presented imitating data of Birkbeck University of London parttime students pursuing Master’s degree in Computer Sciences. This data refer to a hundred students along with six features, three of which are personal characteristics (1. Occupation (Occ): either Information technology (IT) or Business Administration (BA) or anything else (AN); 2. Age, in years; 3. Number of children (Chi)) and three are their marks over courses in Software and Programming (SEn), Object-Oriented Programming (OOP), and Computational Intelligence (CI). 16 Related questions are: - Whether the students’ marks are affected by the personal features; - Are there any patterns in marks, especially in relation to occupation? Table 0.5. Student data in two columns. Occ Age Chi SEn OOP CI Occ Age Chi SEn OOP CI IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT IT BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA 28 35 25 29 39 34 24 37 33 23 24 32 33 27 32 29 21 21 26 20 28 34 22 21 32 32 20 20 24 32 21 27 33 34 34 36 35 36 37 42 30 28 38 49 50 34 31 49 33 43 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 0 0 1 1 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0 2 2 1 1 2 3 1 1 2 2 2 2 3 1 0 41 57 61 69 63 62 53 59 64 43 68 67 58 48 66 55 62 53 69 42 57 49 66 50 60 42 51 55 53 57 58 43 67 63 64 86 79 55 59 76 72 48 49 59 65 69 90 75 61 69 66 56 72 73 52 83 86 65 64 85 89 98 74 94 73 90 91 59 70 76 85 78 73 72 55 72 69 66 92 87 97 78 52 80 90 54 72 44 69 61 71 55 75 50 56 42 55 52 61 62 90 60 79 72 88 80 60 69 58 90 65 53 81 87 62 61 88 56 89 79 85 59 69 54 85 73 64 66 86 66 54 59 53 74 56 68 60 57 45 68 46 65 61 44 59 59 61 42 60 42 BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA BA AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN AN 51 44 49 27 30 47 38 49 45 44 36 31 31 32 38 48 39 47 39 23 34 33 31 25 40 41 42 34 37 24 34 41 47 28 28 46 27 44 47 27 27 21 22 39 26 45 25 25 50 33 2 3 3 2 1 0 2 1 0 2 3 2 3 3 0 1 2 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 75 53 86 93 75 46 86 76 80 50 66 64 53 87 87 68 93 52 88 54 46 51 59 51 41 44 40 47 45 47 50 37 43 50 39 51 41 50 48 47 49 59 44 45 43 45 42 45 48 53 73 43 39 58 74 36 70 36 56 43 64 45 72 40 56 71 73 48 52 50 33 38 45 41 61 43 56 69 50 68 63 67 35 62 66 36 35 61 59 56 60 57 65 41 47 39 31 33 64 44 57 60 62 62 70 36 47 66 47 72 62 38 38 35 44 56 53 63 58 41 25 51 35 53 22 44 58 32 56 24 23 29 57 23 31 60 28 40 32 47 58 51 47 25 24 21 32 53 59 21 17 Case 0.5. Intrusion attack data. With the growing range and scope of computer networks, their security becomes an issue of urgency. An attack on a network results in its malfunctioning, the simplest of which is the denial of service. The denial of service is caused by an intruder who makes some computing or memory resource too busy or too full to handle legitimate requests. Also, it can deny access to a machine. Two of the denial-ofservice attacks are known as appache2 and smurf. An appache2 intrusion attacks a very popular free software/open source web server APPACHE2 and results in denying services to a client that sends a request with many http headers. The smurf acts by echoing a victim's mail, via an intermediary that may be the victim itself. The attacking machine may send a single spoofed packet to the broadcast address of some network so that every machine on that network would respond by sending a packet to the victim machine. In fact, the attacker sends a stream of icmp 'ECHO' requests to the broadcast address of many subnets; this results in a stream of 'ECHO' replies that flood the victim. Other types of attack include user-to-root attacks and remote-to-local attacks. Some internet protocols are liable to specific types of attack, as just described above for imcp (Internet Control Message Protocol) which relates to network functioning; other protocols such as tcp (Transcription Control Protocol) or udp (User Diagram Protocol) supplement conventional ip (Internet Protocol) and may be subject to many other types of intrusion attacks. A probe intrusion looking for flaws in the networking might precede an attack. A powerful probe software is SAINT - the Security Administrator's Integrated Network Tool that uses a thorough deterministic protocol to scan various network services. The intrusion detection systems collect information of anomalies and other patterns of communication such as compromised user accounts and unusual login behaviour. The data set Intrusion consists of a hundred communication packages along with some of their features sampled at the file publicly available on web \cite{St00}. The features reflect the packet as well as activities of its source: 1 - protocol-type, which can be either tcp or icmp or udp (nominal feature), 2 - BySD, the number of data bytes from source to destination, 3 - SHCo, the number of connections to the same host as the current one in the past two seconds, 4 - SSCo, the number of connections to the same service as the current one in the past two seconds, 5 - SEr, the rate of connections (per cent in SHCo) that have SYN errors, 6 - REr , the rate of connections (per cent in SHCo) that have REJ errors, 7 – Attack, the type of attack (apache, saint, smurf as explained above, and no attack (norm)). Of the hundred entities in the set, the first 23 have been attacked by apache2, the consecutive 24 to 69 packets are normal, eleven entities 80 to 90 bear data on a saint's probe, and the last ten, 91 to 100, reflect the attack smurf. Table 0.6. Intrusion data. Prot tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp BySD 62344 60884 59424 59424 59424 75484 76944 59424 57964 59424 0 0 SHCo 16 17 18 19 20 21 22 23 24 25 40 41 SSCo 16 17 18 19 20 21 22 23 24 25 40 41 SEr 0 0.06 0.06 0.05 0.05 0.05 0.05 0.04 0.04 0.04 1 1 REr 0.94 0.88 0.89 0.89 0.9 0.9 0.91 0.91 0.92 0.92 0 0 Attack apach apach apach apach apach apach apach apach apach apach apach apach Prot tcp tcp tcp udp udp udp udp udp udp udp udp udp BySD 287 308 284 105 105 105 105 105 44 44 42 105 SHCo 14 1 5 2 2 2 2 2 3 6 5 2 SSCo 14 1 5 2 2 2 2 2 8 11 8 2 SEr 0 0 0 0 0 0 0 0 0 0 0 0 REr 0 0 0 0 0 0 0 0 0 0 0 0 Attack norm norm norm norm norm norm norm norm norm norm norm norm 18 tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp Tcp Tcp Tcp Tcp 0 0 0 0 0 0 0 0 0 0 0 258 316 287 380 298 285 284 314 303 325 232 295 293 305 348 309 293 277 296 286 311 305 295 511 239 5 288 42 43 44 45 46 47 48 49 40 41 42 5 13 7 3 2 10 20 8 18 28 1 4 13 1 4 6 8 1 13 3 5 9 11 1 12 1 4 42 43 44 45 46 47 48 49 40 41 42 5 14 7 3 2 10 20 8 18 28 1 4 14 8 4 6 8 8 14 6 5 15 25 4 14 1 4 1 1 1 1 1 1 1 1 0.62 0.63 0.64 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Aggregating 0 0 0 0 0 0 0 0 0.35 0.34 0.33 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 apach apach apach apach apach apach apach apach apach apach apach norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm udp udp udp udp udp udp udp udp udp udp udp udp udp udp udp udp udp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp tcp icmp icmp icmp icmp icmp icmp icmp icmp icmp icmp 105 42 105 105 44 105 105 44 105 105 45 45 105 34 105 105 105 0 0 0 0 0 0 0 0 0 0 0 1032 1032 1032 1032 1032 1032 1032 1032 1032 1032 2 2 1 1 2 1 1 3 1 1 3 3 1 5 1 1 1 482 482 482 482 482 482 482 482 482 483 510 509 510 510 511 511 494 509 509 510 511 2 3 1 1 4 1 1 14 1 1 6 6 1 9 1 1 1 1 1 1 1 1 1 1 1 1 1 1 509 510 510 511 511 494 509 509 510 511 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.05 0.05 0.05 0.05 0.05 0.05 0.06 0.06 0.06 0.06 0.04 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.95 0.95 0.95 0.95 0.95 0.95 0.94 0.94 0.94 0.94 0.96 0 0 0 0 0 0 0 0 0 0 norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm norm saint saint saint saint saint saint saint saint saint saint saint smurf smurf smurf smurf smurf smurf smurf smurf smurf smurf and visualizing a single feature: 1D analysis Before considering summarization and correlation problems with multidimensional data, let us take a look at them on the simplest levels possible: one feature for summarization and two features for correlation. This also will provide us with a stock of useful concepts for nD analysis. 1D data is a set of entities represented by one feature, categorical or quantitative. Let us first consider the quantitative case. With N entities numbered from i=1, 2, …., N, data is a set of numbers x1,…,xN. This set will be denoted X={x1,…,xN}. Distribution (density, histogram): The distribution is a most comprehensive, and quite impressive for the eye, way of summarization. On the plane, one draws an x axis and the feature range boundaries, that is, X’s minimum a and maximum b. The range interval is divided then into a number of non-overlapping equal-sized sub-intervals, bins. To produce n bins, one needs n-1 dividers at points a+k(b-a)/n (k=1, 2, …, n-1). In fact, the same formula works for k=0 and k=n+1 leading to the boundaries a as x0 and b as xn+1, which is useful for the next operation, counting the number of entities Nk falling in each of the bins k=1, 2,..., n,. Note 19 that bin k has a+(k-1)(b-a)/n and a+k(b-a)/n as its left and right boundaries, respectively. One of them should be excluded from the bin so that the bins are not overlapping. These counts, Nk, k=1, 2,..., n, constitute the distribution of the feature. A histogram is a visual representation of the distribution by drawing a rectangle of the height Nk over each bin k, k=1, 2,..., n (see Figures 1.2 and 1.3). Note that the distribution is subject to choice of the number of bins. Q. Why bins are not to be overlapping? A. So that each entity falls in only one bin, and the total of all counts Nk remains N. 0 a (a+b)/2 b Figure 1.1. With two bins on the range, the divider is the mid-range. On Figures 1.2 and 1.3, two most typical types of histograms are presented. The former corresponds to the so-called power/Pareto law, approximating the function p(x) a/x.This type is frequent in social systems. According to numerous empirical studies, such features as wealth, group size, productivity and the like are all distributed according to a power law so that very few individual entities create / possess huge amounts of wealth or members, whereas very many individual entities are left with virtually nothing. However, they all are important parts of the same system with the have-nots creating the environment in which the lucky few can strive. Count 880 Frequency 36.7 % 640 480 26.7% 260 140 Figure 1.2. A power type distribution. 20.0% 10.8% 5.8% Another type, which is frequent in physical systems, is presented on Figure 1.3. This type of histograms approximates the so-called normal, or Gaussian, law p(x) exp[(x-a)/22]. Distributions of measurement errors and, in general, features being results of small random effects are thought to be Gaussian, which can be formally proven within a mathematical framework of the probability theory. The parameters of this distribution, a and b, have natural meanings with a expressing the expectation, or mean, and 2 – the variance, which naturally translates in terms of the empirical distributions, as introduced below. Count 700 550 500 350 300 Frequency 29.2 % 22.9% 20.8% 14.6% 12.5% Figure 1.3. Gaussian type distribution (bell curve). 20 Q: Consider distributions of sizes in Iris data and Population and Bank in Market Town data at different bin numbers. Can you tell for each of them, which of the two types it is similar to? Another popular visualization of distributions is known as a pie-chart, in which proportions Nk are expressed by the sizes of sectored slices of a round pie (see Figure 1.4). As one can see, these two types of visualization provide for perception of two different aspects of the distribution; the former for the actual envelopment of the distribution along the axis x, whereas the latter caters for the relative sizes of distribution chunks falling into different bins. There are a dozen more formats of visualization of distributions, such as bubble, doughnut and radar charts, easily available in Microsoft Excel spreadsheet. 13% 15% 21% 23% 28% Figure 1.4. Pie-chart for the histogram of Figure 1.3. For categorical features, there is no need to define bins. The categories themselves play the role of bins. Assuming that categories comprise set V of categories, Nv and pv are respectively defined as the number and proportion of entities falling in category v V. Further aggregates Further summarization of the data leads to presenting all the variety with just two real numbers, one expressing the distribution’s location, its “central” point, whereas the other represents the distribution’s variation, the spread. We review some most popular characteristics for both. Center: C1. Mean of set x1, x2,…, xN is defined as the arithmetic average: N c xi / N (1.1) i 1 Example: For set 1, 1, 5, 3, 4, 1, 2, mean is (1+1+5+3+4+1+2)/7=17/7=2.42857… This is as close approximation to the numbers as one can get. However, the mean is not stable enough against outliers. This is why it is a good idea to remove a couple of observations on both extremes of the data range, the minimum and maximum, before computing the mean. 21 Example: There are no outliers at the previous example. However, if we remove one minimum value and one maximum value from the set, it becomes 1, 3, 4, 1, 2. The mean of the corrected set is (1+3+4+1+2)/5=11/5=2.2, not a big change. C2. Median of set x1, x2,…, xN is defined as follows. First, pre-process the data by sorting them according to order, either descending or ascending, so that xs1 xs2 … xsN assuming the ascending order, where si is i-th element of the ordered series. If N is odd, that is, N=2n+1 for a whole number n, then the median m is equal to the middle element in the order, that is, m=xs,n+1. Otherwise, N is even so that N=2n for some integer n>0; the median is usually defined in this case as the middle of the interval between two elements in the middle, xsn and xs,n+1, that is, m=(xsn + xs,n+1)/2. Example: For the set from the previous example, 1, 1, 5, 3, 4, 1, 2, its sorted version is 1, 1, 1, 2, 3, 4, 5. The median is equal to the element in the middle, which is 2. This is rather far away from the mean, 2.43, which evidences that the distribution is biased towards the left end, the smaller entities. The more symmetric a distribution is, the closer its mean and median are. As seems rather obvious, the median is very stable against outliers: the values on the extremes just do not affect the middle of the sorted set. C3. Midrange mr is the middle of the range, mr=(Max(xi) + Min(xi))/2. This corresponds to the mean of a flat distribution, in which all bins are equally likely. In contrast to the mean and median, the midrange depends only on the range, not on the distribution. It is obviously highly sensitive to outliers, that is, changes of the maximum and/or minimum values of the sample. Example: At the previous set, Max(xi) =5 and Min(xi)=1, which makes mr=(5+1)/2=3 – the value of the mean in the case when all values in the range are equally likely. C4. Mode is the most likely bin, which is obviously depends on the bin size. On Figure 0.2, the mode is the bin on the left side; on Figure 0.3, the bin in the middle. This is a rather local characteristic of the distribution; and it is not going to be used often in this text. Spread: Each of the characteristics of spread below is, to an extent, parallel to that of the center under the same number. S1. Standard deviation s This, conventionally called standard, value s is rather unconventional from the intuitive point of view. It is defined as the square root of variance, s=(Variance) where Variance s2 is the mean of square deviations of X values from their mean, that is, the sum of square errors (xi-c)2 , i =1, 2, …, N, divided by N, where c is the mean/average defined in (1.1). N s2 ( x c) 2 / N i i 1 (1.2) The choice of the square root of the average squares for aggregating the deviations of observations from their center comes from the least-squares approach, which will be explained later on pp. …… 22 In many packages, especially of a statistical flavour, the divisor in (1.2) is taken to be N-1 rather than N. This is because in mathematical statistics the divisor is N only in the case when c is pre-specified or given by an oracle, that is, derived not from the data but from the domain knowledge, as, for example, in coin throwing, it is assumed from the symmetry, that the mean proportion of tails should be ½. If, otherwise, c is derived from the data such as the mean (1.1) value, then the divisor, according to the mathematical statistics, must be N-1, because the equation (1.1) deriving c from data is a relation imposed on the N observed values, thus decreasing the degree of freedom from N to N-1. We explain the view of mathematical statistics in a greater detail later on page ….. . S2. Absolute deviation sm The absolute deviation is defined as the mean of absolute deviations from the center, |xi – m|, which in this case is usually taken to be median m rather than mean c: N sm | x m| / N i i 1 (1.3) S3. Range r This is probably the simplest measure, just the length of the interval covered by data X, r= Max(xi) Min(xi) . Obviously, it may be rather unstable and biased by the outliers. It can be proven that both, the standard deviation and absolute deviation are at least twice less than the range (Mirkin 2005). S4. Quintile range (histogram’s extremes cut out) This measure is a stable version of the range, which should be used at really large N values. A proportion p, typically, within the range of 0.1-25% is specified. The 2p-quintile range utilises the (upper) p-quintile, which is a value xp of X such that the proportion of entities with larger than xp values is p. Similarly the lower quintile is defined, px: the proportion of those less than px is p. The 2p-quintile range is the interval between the p-quintiles, stretched up according to the proportion of entities taken out, (xp – px)/(1-2p). It can be used as a rather stable characteristic of the actual range of X, short of random and wild outliers on both extremes. The value of p should be taken rather small, say 0.05% at N of the order 100,000: then xp cuts off 50 largest and px, 50 smallest values of X. This measure is rather unusual at the present time, as well as the large data sizes yet. Parameters of the Gaussian distribution In classical mathematical statistics, set X= {x1, x2,…, xN} is usually considered a random sample from a population defined by a probabilistic distribution with density f(x), in which each element xi is sampled independently from the others. This involves an assumption that each observation xi is modelled by the distribution f(xi) so that the mean’s model is the average of distributions f(xi). The population analogues to the mean and variance are defined over f(x) so that, obviously, the mean (1.1), median and the midrange, are unbiased estimates of the population mean. Moreover, the variance of the mean is N times less that the population variance, so that the standard deviation’s variance tend to decrease by N when N grows. If we further assume that the population’s probabilistic distribution is Gaussian N(, ) with density function 23 f(u, , )= C* exp{-(u - )2 / 22}, (1.4) then c in (1.1) is an estimate of and s (1.2) of in f(u, , ). This parameters amount to the population analogues of the concepts of mean and variance so that, for example, = uf(u,,)du where the integral is taken over the entire axis u. To be non-biased, s must have N-1 as its divisor if c in (1.1) stands instead of in formula for s2. In some real life situations, the assumption that X is an independent random sample from the same distribution seems rather adequate. However, in most real-world databases and multivariate samplings this assumption is far from realistic. Centre and spread in integral perspectives The concepts of the center and spread can be formulated within the same perspective, such as those of approximation, data recovery and probabilistic statistics, which are presented here in turn. 1. Approximation perspective Given a series X={x1,…,xN}, define the centre as a minimizing the average distance D(X,a)=[d(x1,a)+d(x2,a)+…+d(xN,a)]/N (1.5) The following statements can be proven mathematically. If d(x,a)=|x-a|2 in (1.5), then the solution a is the mean (1.1), and D(X,a) the variance (1.2). If d(x,a)=|x-a| in (1.5), then the solution a (centre) is median, and D(X,a) the absolute deviation. If D(X,a) is defined not by the sum, but by the maximum distance, D(X,a)= max (d(x1,a), d(x2,a), …, d(xN, a)), then the midrange is the solution, for each of the d(x,a) specified, either |x-a|2 or |x-a|. These three properties explain the parallels between centers C1, C2, and C3, and corresponding spread evaluations S1, S2, S3: each of the centers minimizing its corresponding measure of spread. 2. Data recovery perspective In the data recovery perspective, it is assumed that the observed values are but noisy realisations of an unknown value a. This is reflected in the form of a model like xi = a + ei, for all i=1,2,…, N (1.6) in which ei are additive errors to be minimised. In classical mathematical statistics, errors are usually modelled as random values independently drawn from the same distribution. This suggests a good degree of knowledge and stability of the process. In many real world applications the model (1.6) itself cannot be considered adequate enough, which also generates errors. When no assumptions on the nature of errors are made, they are frequently referred to as residuals. One cannot minimize all the residuals in (1.6) simultaneously. Thus, an integral criterion should be formulated to embrace all the residuals. Most popular are: (1) Least-squares criterion L2=e12+ e22 +…+ eN2 ; its minimisation over unknown a is equivalent to the task of minimizing the average squared distance, thus leading to the mean, optimal a=c. 24 (2) Least-modules criterion L1=|e1|+|e2|+…+ |eN|; its minimisation over unknown a is equivalent to the task of minimizing the average absolute deviation leading to the median, optimal a=m. (3) Least-maximum criterion L= max(|e1|, |e2|, … |eN|); its minimisation over unknown a is equivalent to the task of minimizing the maximum deviation leading to the midrange, optimal a=mr. Formulations (1)-(3) may look just as trivial reformulations of the special cases of the approximation criterion (1.5). This, however, is not exactly so. The equation (1.6) allows for a decomposition of the data scatter involving the corresponding data recovery criterion. This is rather straightforward for the least-squares criterion L2 whose minimal value, at a=c, is L2= (x1-c)2+ (x2-c)2 +…+ (xN-c)2. With little algebra, this becomes L2 = x12+ x22 +…+ xN2 - 2c(x1+x2+… +xN) + Nc2 = x12+ x22 +…+ xN2 - Nc2 =T(X) - Nc2.where T(X) is the quadratic data scatter defined as T(X)= x12+ x22 +…+ xN2 . This leads to equation T(X) = Nc2 + L2 decomposing the data scatter in two parts, that explained by the model (1.6), Nc2, and that unexplained, L2. Since the data scatter is constant, minimizing L2 is equivalent to maximizing Nc2. The decomposition of the data scatter allows to measure the adequacy of model (1.6) by the relative value of L2/T(X). Similar decompositions can be derived for the least modules L1 and other criteria (see Mirkin (1996). 3. Probabilistic perspective Consider that the set X is a random independent sample from a population with a Gaussian, for the sake of simplicity, probabilistic density function f(x)= Cexp{-(x - )2 / 22}.where and 2 are unknown parameters and C=( 22) -½. The likelihood of randomly getting xi then will be Cexp{-(xi - )2 / 22}. Then the likelihood of the entire sample X will be the product of these values, because they have been assumed to be independent of each other, that is, the likelihood is L(X)=iICexp{-(xi - )2 / 22} = CNexp{- iI (xi - )2 / 22}. One may even go further and express L(X) as L(X) = exp{Nln(C)- iI (xi - )2 / 22}.where ln is the natural logarithm (over base e). A well established approach in mathematical statistics, the principle of maximum likelihood, claims that the values of and 2 best fitting to the data X are those at which the likelihood L(X) or, equivalently its logarithm, ln(L(X)), reaches its maximum. Because of the derived formula for LIX), it is easy to see that the maximum of ln(L)= Nln(C)- iI (xi - )2 / 22 is reached at that provides for the minimum of the expression in the exponent, E= iI (xi - )2. This shows that the least-squares criterion follows from the assumption that the sample is randomly drawn from a Gaussian population. Likewise, the optimal 2 minimizes part of ln(L) depending on it, g(2)= - Nln(2)/2 - iI (xi - )2 / 22. It is not difficult to prove that the optimal 2 can be found from the first-order optimality condition for g(2). Let us take the derivative of the function over 2 and equate it to 0: dg/d(2)= N/(22) + iI (xi - )2 / 2(2)2 =0. This equation leads to 2 = iI (xi - )2 /N, which means that the variance is the maximum likelihood estimate of the parameter in the Gaussian distribution. In situations in which the data can be plausibly assumed to randomly come from a Gaussian distribution, the derivation above justifies the use of the mean and variance as the only theoretically valid estimates of the data center and spread. The Gaussian distribution has been proven to approximate well situations in which there are many small independent random effects adding to each other. However, in many cases the assumption of normality is highly unrealistic, which does not necessarily lead to rejection of the concepts of the mean and dispersion – they still may be utilised within the other perspectives above. 25 Q. Consider a multiplicative model for the error, xi = a(1+ei), assuming that errors are proportional to the values. Can you find or define what centre a is fitting to the data? A. Consider the least squares approach. According to this approach, the fit should minimize the summary errors squared. Every error can be expressed, from the model, as ei= xi/a -1= (xi-a)/a. Thus the criterion can be expressed as L2 = e12 + e22 +… eN2= (x1/a -1) 2 + (x2/a -1) 2 +…. (xN/a -1) 2. Applying the first order optimality condition, let us take the derivative of L2 over a and equate it to zero. The derivative is equal to L2’= (2/a3)Σi(xi-a)xi. Assuming the optimal value of a is not zero, the first order condition can be expressed as Σi(xi-a)xi =0, that is, a =Σi xi2/ Σi xi = (Σi xi2/N)/(Σi xi/N). The denominator here is but the mean, c, whereas the numerator can be expressed through the variance s2 because of equation s2 = Σi xi2/N - Σi xi/N that is not difficult to prove. With little algebraic manipulation, the least-squares fit can be expressed as a = s2/c +1. Curiously, the variance to mean ratio, equal to a -1 according to this derivation, is considered in statistics as a good relative estimate of the spread because of different reasoning. Binary features A feature admitting only two, either “Yes” or “No”, values is conventionally considered Boolean in Computer Sciences, thus relating to Boolean algebra with its “True” and “False” statement evaluations. In this course, we code these values by numerals 1, for “Yes”, and 0, for “No”, and use quantitative operations on them, referring to this type of features as binary ones. Our interpretation of a two-valued categorical feature as a quantitative one stems from the fact that any numerical recoding, 0 to and 1 to , uses just two scaling parameters, that can be one to one associated with the conventional quantitative scale transformations, the shift of the origin () and rescaling factor ( - ). The mean of a 1/0 coded binary feature is the number of ones related to N, that is, proportion p of its “Yes” values. The median m is 1 only if p > 0.5; m=0 if p<0.5, and m =0.5 when p=0.5 exactly, so that N is even. The midrange is always ½ for a binary feature. The mode is ether 1 or 0 depending on p>0.5 or not (same as the median). To compute the variance of a binary feature, whose mean c=p, sum up Np items (1-p)2 and N(1-p) items p2, which altogether leads to s2 = p(1-p). Accordingly, the standard deviation is the square root of p(1-p). Obviously, this is maximum when p=0.5, that is, both binary values are equally likely. The range is always 1. The absolute deviation, in the case when p<0.5 so that median m=0, comprises Np items that are 1 and N(1-p) items that are 0, so that sm=p. When p>0.5, m=1 and the number of unity distances is N(1-p) leading to sm=1-p. That means that, in general, sm=min(p,1-p), and it is less than c. There are some probabilistic underpinnings to these. Two models are popular, by Bernoulli and by Laplace. Given p, 0 p 1, Bernoulli model assumes that every xi is either 1, with probability p, or 0, with probability 1-p. Laplace model suggests that, among the N binary numerals, random pN are unities, and (1-p)N zeros. Both models yield the same mathematical expectation, p. However, their variances differ: the Bernoulli distribution’s variance is p(1-p) whereas the Laplace distribution’s variance is p, which is obviously greater for all positive p. There is a rather natural, though somewhat less recognised, relation between quantitative and binary features: the variance of a quantitative feature is always smaller than that of the corresponding binary feature. To explicate this, assume the interval [0,1] to be the range of data X={x1,…,xN}. Assume that the mean c divides the interval in such a way that a proportion p of the data is greater than or equal to c, whereas proportion of those smaller than c is 1-p. The question then is this: given p, at what distribution of X the variance or its square root, the standard deviation, is maximized. 26 Let X be any given distribution within interval [0,1] with its mean at some interior point c. According to the assumption, there are Np observations between 0 and c. Obviously, the variance can only increase if we move each of these points to the border, 0. Similarly, the variance will only increase if we push each of N(1-p) points between c and 1, into 1. That means that the variance p(1-p) of a binary variable with Np zero and N(1-p) unity values is the maximum, at given p. We have proven the following: A binary variable whose distribution is (p, 1-p) has the maximum variance, and the standard deviation, among all quantitative variables of the same range and p entries above its average. This implies that no variable over the range [0,1] has its variance greater than the maximum ¼ reached by a binary variable at p=0.5. The standard deviation of this binary variable is ½, which is just half of the range. The binary variables also have the maximum absolute deviation among the variables of the same range, which can be proven similarly. Categorical features with disjoint categories Sometimes categories by themselves have no quantitative meaning, so that the only comparison they admit is of being equal or not-equal to each other. Moreover, a categorical feature such as Occupation in Students data or Protocol in Intrusion data, partitions the entity set so that each entity falls in one and only one category. Categorical features of this type are sometimes referred to as nominal. If a nominal feature has L categories l=1,…,L, its distribution is characterized by amounts N1 , N2 , …, NL of entities that fall in each of the categories. Because of the partitioning property these numbers sum up to the total number of entities, N1 + N2 …. NL =N. The category frequencies, defined as p1 = Nl/N, sum up to the unity (l=1, 2, …., L). Since categories are non-ordered, categorical feature distributions are better visualized by pie-charts than by histograms. The concepts of centrality, except for the mode, are not applicable to categorical feature distributions. Spread here is also not quite applicable. However, the variation – or diversity - of the distribution (p1, p2, …, pL) can be measured. There are two rather popular indexes, Gini index, or qualitative variance, and entropy. Gini index can be introduced as the average error of the proportional prediction rule. The proportional prediction rule requires predicting each category l, l=1,2, …, L, randomly with the distribution (pl), so that l is predicted at at Npl cases of N. The average error of predictions of l in this case is equal to 1- pl, which makes the index equal to: L L G p (1 p ) 1 p 2 l l l l 1 l 1 This is also the summary variation of L binary variables corresponding to categories l=1, 2, …, L; such a variable answers the question “Is the category the object falls into is l”? Entropy is the average value of the quantity of information in each category l as measured by -log(pl), thus defined as L H p log p l l l 1 27 This is not too far away from the qualitative variance, because at small p, -log(1-p) = 1 – p + o(p), as is known from calculus (see Figure 1.5). There has been a unifying general formula for the variance of a nominal variable suggested: L S (1 p ) /( 1) l l 1 This obviously leads to the qualitative variance, at λ=2, and entropy, at λ tending to 1. f(p) p Figure 1.5 Graphs of functions f(p)=1-p involved in Gini index (straight line) and information f(p)=–log(p). Project 1.1. Analysis of a multimodal distribution Let us take a look at the distributions of OOP and CI marks at the Student data. Assuming that the data file of Table 0.4 is stored as Data\studn.dat, the corresponding MatLab commands can be as follows: >> a=load(‘Data\studn.dat’); >> oop=a(:,7); %column of OOP mark >> coi=a(:,8); %column of CI mark >> subplot(1,2,1); hist(oop); >> subplot(1,2,2); hist(coi); With ten bins used in MatLab by default, the histograms are on Figure 1.5. Figure 1.6. Histograms of the distributions of marks for OOP (on the left) and for CI (on he right) from Students data. The histogram on the left looks to have three humps, that is, it is three-modal. Typically, a homogeneous sample should have a uni-modal distribution, to allow interpretation of the feature as its 28 modal value with random deviations from it. The fact that there are three modes on the OOP mark histogram requires an explanation. For example, one may hypothesize that the modes can be explained by the presence of three different occupations of students in the data so that IT occupation should lead to higher marks than BA occupation for which marks should still be higher than those at AN occupation. To test this hypothesis, one needs to compare distributions of OOP marks for each of the occupation. To make the distributions comparable, we need to specify an array with boundaries between 10 bins that can be used for each of the samples. This array, b, can be computed as follows: >> r=max(oop)-min(oop);for i=1:11;b(i)=min(oop)+(i-1)*r/10;end; Now we are ready to produce comparable distributions for each of the occupations with MatLab command histc: >> for ii=1:3;li=find(a(:,ii)==1);hp(:,ii)=histc(oop(li),b);end; This generates a list, li, of student indexes corresponding to each of the three occupations presented by the three binary columns, ii=1:3. Matrix hp represents the three distributions in its three columns. Obviously, the total distribution of OOP, presented on the left of Figure 1.6 is the sum of these three columns. To visualise the distributions, one may use “bar” command in MatLab: >> bar(hp); which produces bar histograms for each of the three occupations (see Figure 1.7). One can see that the histograms differ indeed and concur with the hypothesis, so that IT concentrates in top seven bins and shares the top three bins with no other occupation. The other two occupations overlap more, though still AN takes over on the leftmost, worst marks, positions. Q. What would happen if array b is not specified once for all but the histogram is drawn by default for each of the sub-samples? A. The 10 default bins depend on the data range, which may be different at different sub-samples; if so, the histograms will be incomparable. Figure 1.7. Histograms of OOP marks for each of three occupations, IT, BA and AN, each presented with bars filled in according to the legend. There can be other hypotheses as well, such as that the modes come from different age groups. To test that, one should define the age group boundaries first. 29 Project 1.2 Data mining with a confidence interval: Bootstrap The data file short.dat is a 50x 3 array whose columns are samples of three data types described in Table 1.1: Data type Mean Standard deviation Real value Per cent of mean, % Normal Two-modal Power law 10.27 16.92 289.74 1.76 4.97 914.50 17.18 29.38 315.63 Table 1.1. Aggregate characteristics of columns of short.dat array The normal data is in fact a sample from a Gaussian N(10,2), that has 10 as its mean and 2 as its standard deviation. The other two are Two-modal and Power law samples. Their 30-bin histograms are on the left-hand sides of Figures 1.8, 1.9, and 1.10. Even with the aggregate data in Table 1.1 one can see that the average of Power law does not make much sense, because its standard deviation is more than three times greater than the average. Many statisticians would argue the validity of characteristics in Table 1.1 not because of the distribution shapes – which would be a justifiable source of concern for at least two of the three distributions – but because of the insufficiency of the samples. Is the 50 entities available a good representation of the entire population indeed? To address these concerns, the Mathematical Statistics have worked out principles based on the assumption that the sampled entities come randomly and independently from a – possibly unknown but stationary – probabilistic distribution. The mathematical thinking would allow then, in reasonably well-defined situations, to arrive at a theoretical distribution of an aggregate index such as the mean, so that the distribution may lead to some confidence boundaries for the index. Typically, one would obtain the boundaries of an interval at which 95% of the population falls, according to the derived distribution. For instance, when the distribution is normal, the 95% confidence interval is defined by its mean plus/minus 1.96 times the standard deviation. Thus, for the first column data, the theoretically derived 95% confidence interval will be 10 1.96*2 =103.92, that is, (6.08, 13.92) (if the true parameters of the distribution are known) or 10.271.96*1.76=10.273.45, that is, (6.82,13.72) (at the observed parameters in Table 1.1). The difference is negligible, especially if one takes into account that the 95% confidence is a very much arbitrary notion. In probabilistic statistics, the so-called Student’s distribution is used to make up for the fact that the sample-estimated standard deviation value is used instead of the exact one, but that distribution little differs from the Gaussian distribution when the data contain more than several hundred entities. In most real life applications the shape of the underlying distribution is unknown and, moreover, the distribution is not stationary. The theoretically defined confidence boundaries are of little value then. This is why a question arises whether any confidence boundaries can be derived computationally by re-sampling the data at hand rather than by imposing some debatable assumptions. There have been developed several approaches to computational validation of sample based results. One of the most popular is bootstrapping which will be used here in its basic, “non-parametric and pivotal format” (as defined in Carpenter and Bithell 2000).. Bootstrapping is based on a pre-specified number, say 1000, of random trials. A trial involves randomly drawn N entities, with replacement, from the entity set. Note that N is the size of the entity set. Since re-sampling goes with replacement, some entities may be drawn two or 30 more times so that some others are bound to be left behind. Recalling that e=2.7182818… is the natural logarithm base, it is not difficult to see that, on average, only approximately (e-1)/e=63.2% entities get selected into a trial sample. Indeed, at each random drawing an entity from a set of N, the probability of an entity being not drawn is 1-1/N, so that that the approximate proportion of entities never selected in N draws is (1-1/N)N ≈ 1/e =1/2.71828≈ 36.8% of the total number of entities. For instance, in a bootstrap trial of 15 entities, the following numbers have been drawn: 8, 11, 7, 5, 3, 3, 11, 5, 9, 3, 11, 6, 13, 13, 9 so that seven entities have been left out of the trial while several multiple copies in. Figure 1.8. The histograms of a 50 strong sample from a Gaussian distribution (on the left) and its mean’s bootstrap values (on the right): all falling between 9.7 and 10.1. A trial set of a thousand randomly drawn entity indices (some of them, as explained, coincide) is assigned with the corresponding row data values from the original data table so that coinciding entities get identical rows. Then a method under consideration, currently "computing the mean", applies to this trial data to produce the trial result. After a number of trials, the user gets enough results to represent them with a histogram and derive confidence boundaries for the mean’s estimate. The bootstrap distributions, after 1000 trials, are presented in Figures 1.8, 1.9 and 1.10 on the right hand side. We can see very clearly that the estimate in the case of Gaussian data, Figure 1.8, is more precise: all 100% of the bootstrap mean values fall in the interval between 9.47 and 11.12, which is a much more precise estimate of the mean than in the original distribution, both in terms of the interval boundaries and confidence. There is theoretical evidence, presented by E. Bradley (1993), supporting the view that the bootstrap can produce somewhat tighter confidence boundaries for the sample’s mean than the theoretical analysis based on the original sample. In our case, we can see (Table 1.2) that indeed, with the means almost unvaried, the standard deviations have been drastically reduced. Data type Mean Standard deviation Value Per cent of mean, % Normal Two-mode Power law 10.27 16.94 287.54 0.25 0.69 124.38 2.46 4.05 43.26 Table 1.2. Aggregate characteristics of the results of 1000 bootstrap trials over short.dat array. 31 Figure 1.9. The histograms of a 50 strong sample from a Two-mode distribution (on the left) and its mean’s bootstrap values (on the right). Unfortunately, the bootstrap results are not that helpful in analysing the other two distributions: as can be seen in our example, it shows rather decent boundaries for both of the means, the Two-modal and Power law ones, while, in many applications, the mean of either of these two distributions may be considered meaningless. It is a matter of applying other data analysis methods such as clustering to produce more homogeneous sub-samples whose distributions would be more similar to that of a Gaussian. Figure 1.10. The histograms of a 1000 strong sample from a Power law distribution (on the left) and its mean’s bootstrap values (on the right): all falling between 260 and 560. Project 1.3 K-fold cross validation Another set of validation techniques utilises randomly splitting the entity set in two parts of prespecified sizes, the so-called training and testing parts, so that the method’s results obtained for the training part are compared with the data on the testing part. To guarantee that each of the entities gets into a training/testing sample equal number of times, the so-called cross-validation methods have been developed. 32 The so-called K-fold cross validation works as follows. Randomly split entity set in K parts Q(k), k=1,…,K, of equal sizes1. Typically, K is taken as 2 or 5 or 10. In a loop over k, each part Q(k) is taken a test set while the rest is the train set. A data analysis method under consideration is run over the train set (“training phase”) with its result applied to the test set. The average score of all the test sets constitutes a K-fold cross-validation estimate of the method’s quality. The case when K is equal to the number of entities N is especially popular. It was introduced under the term “jack-knife”, but currently term “leave-all one-out” is used as better reflecting the method: N trials are run over the entire set except for just each one entity removed from the training. Let us apply the 10-fold cross-validation method to the problem of evaluation of the means of the three data sets. First, let us create a partition of our 1000 strong entity set in 10 non-overlapping classes, a hundred entities each, with randomly assigning entities to the partition classes. This can be done by randomly putting entities one by one in each of the 10 initially empty buckets. Or, one can take a random permutation of the entity indices and divide then the permuted series in 10 chunks, 100 strong each. For each class Q(k) of the 10 classes (k=1,2,…,10), we calculate the averages of the variables on the complementary 900 strong entity set, and use these averages for calculating the quadratic deviations from them – not from the averages of class Q(k) - on the class Q(k). In this way, we test the averages found on the complementary training set. Data type Standard deviation Normal On set 10-fold cr.-val. Two-modal 1.94 1.94 Power law 5.27 1744.31 5.27 1649.98 Table 1.3. Quadratic deviations from the means computed on the entity set as is and by using 10-fold cross validation. The results are presented in Table 1.3. The values found at the original distribution and with 10-fold cross validation are similar. Does this mean that there is no need in applying the method? No, for more complex data analysis methods, results may differ indeed. Also, whereas the ten quadratic deviations calculated on the ten test sets for the Gaussian and Two-modal data are very similar to each other, those at the Power law data set drastically differ, ranging from 391.60 to 2471.03. 1.4. Modelling uncertainty: Intervals and fuzzy sets Intervals and fuzzy sets are used to reflect uncertainty in data. When dealing with complex systems, the feature value cannot be determined precisely, even for such a relatively stable and homogeneous dimension as the population size of a country. The so-called “linguistic variables” (Zadeh 1970) express categories or concepts in terms of some quantitative measures, such as the concept of “normal temperature” or “normal weight of an individual” expressed with the Body Mass Index BMI (the ratio of the weight, in kg, to the height, in metres, squared) as BMI interval [20, 25] between 20 and 25; those with BMI > 25 are considered overweight (those with BMI>30 are officially recognized as obese) and those with BMI < 20 underweight. In this example, the natural boundaries of a category are expressed as an interval. 1 To do this, one may start from all sets Q(k) being empty and repeatedly run a loop over k=1:K in such a way that at each step, a random entity is drawn from the entity set (with no replacement!) and put into the current Q(k); the process halts when no entities remain out of Q(k). 33 (x) 1 18 22 24 27 x Figure 1.11. A trapezoidal membership function expressing the concept of normal body mass index; a positive degree of membership is assigned to each point within interval [18, 27] and, moreover, those between 22 and 24 certainly belong to the set. A more flexible description can be achieved with the so-called fuzzy set A expressed by the membership function A(x) defined, on the example of Figure 1.11, as: 0 if x 18 or x 27 0.25 x 4.5 if 18 x 22 A(x)= 1 if 22 x 24 x / 3 9 if 24 x 27 This function says that the normal weight does not occur outside of the BMI interval [18, 27]. Moreover, the concept applies in full, with the membership 1, only within BMI interval [22, 24]. There are “grey” areas expressed with the slopes on the left and the right so that, say, a person with BMI=20 will have the membership value A(20) = 0.25*20 – 4.5 = 0.5 and the membership of that with BMI = 26.1, will be A(26.1) = -26.1/3 + 9 = -8.7+9 = 0.3. In fact, a membership function may have any shape; the only requirement is that it must be at least one interval (or, a point) at which the function reaches value 1, which is its maximum value. A fuzzy set formed with straight lines as on Figure 1.11 is referred to as a trapezoidal fuzzy set. Such a set can be represented by four points on the axis x: (a,b,c,d) such that A(x) =0 outside the outer interval [a,d] and A(x) = 1 inside the inner interval [b,c], (with the straight lines connecting points (a,0) and (b,1) as well as (c,1) and (d,0) (see Figure 1.11). (x) 1 18 22 27 x Figure 1.12. A triangular fuzzy set for the normal weight BMI. An interval (a, b) can be equivalently represented by a trapezoidal fuzzy set (a, a, b, b) in which all points of (a, b) have their membership value equal to 1. The so-called triangular fuzzy sets are also popular. A triangular fuzzy set A is represented by an ordered triplet (a,b,c) so that A(x) =0 outside the interval [a,c] and A(x) = 1 only at x=b, with values of A(x) in between are represented by the straight lines between points (a,0) and (b,1) and between (c,0) and (b,1) on the Cartesian plane, see Figure 1.12. Fuzzy sets presented on Figures 1.11 and 1.12 are not equal to each other: only those fuzzy sets A and B are equal at which A(x) = B(x) for every x, not just outside of the base interval. 34 A fuzzy set should not be confused with a probabilistic distribution such as a histogram: there is no probabilistic mechanism nor frequencies behind a membership function, just an expression of the extent at which a concept is applicable. A conventional, crisp set S, can be specified as a fuzzy set whose membership function admits only values 0 or 1 and never those between; thus, (x)=1 if xS and (x)=0, otherwise. There are a number of specific operations with fuzzy sets imitating those with the “crisp” sets, first of all, set-theoretic complement, union and intersection. The complement of a fuzzy set A is fuzzy set B such that B(x)=1- A(x). The union of two fuzzy sets, A and B, is a fuzzy set denoted by AB whose membership function is defined as AB(x) = max (A(x), B(x)). Similarly, the intersection of two fuzzy sets, A and B, is a fuzzy set denoted by AB whose membership function is defined as AB(x) = min(A(x), B(x)). It is easy to prove that these operations indeed are equivalent to the corresponding set theoretic operations when performed over crisp membership functions. Questions: 1. Draw the membership function of fuzzy set A on Figure 1.11. 2. What is the union of the fuzzy sets presented in Figures 1.11 and 1.12. 3. What is the intersection of the fuzzy sets presented in Figures 1.11 and 1.12. 4. Draw the membership function of the union of two triangular fuzzy sets represented by triplets (2,4,6), for A, and (3,5,7), for B. What is the membership function of their intersection? 5. What type of a function is the membership function of the intersection of two triangular fuzzy sets? Of two trapezoidal fuzzy sets? Does it always represent a fuzzy set? Central fuzzy set The conventional centre and spread concepts can be extended to intervals and fuzzy sets. Let us consider an extension of the concept of average to the triangular fuzzy sets using the least-squares data recovery approach. Given a set of triangular fuzzy sets A1, A2, …, AN, the central triangular set A can be defined by such a triplet . (a, b, c) that approximates the triplets (ai, bi, ci), i = 1, 2, …, N). The central triplet can be defined by the condition that the average difference squared, L(a,b,c) = (i (ai-a)2 +i (bi-b)2 +i (ci-c)2 )/(3N) is minimised by it. Since the criterion L is additive over the triplet’s elements, the optimal solution is analogous to that in the conventional case: the optimal a is the mean of a1, a2,…,aN; and the optimal b and c are the means of bi and ci, respectively. Q. Prove that the average ai indeed minimizes L. A. Let us take the derivative of L over a: L/a = - 2i(ai-a)/N. The first-order optimality condition, L/a=0, has the average as its solution described. Q. Explore the concepts of central trapezoidal fuzzy set and central interval in an analogous way. Questions 35 1. What is the bin size in the example of Figure 1.13? a=2 b=12 Figure 1.13 Range [2,12] divided in five bins. 2. Correlation and visualization in 2D Two features can be of interest if there is an assumption or hypothesis or just gut feeling that they are related in such a way that certain changes in one of them tend to co-occur with some changes in the other. Then the relation – if proven to exist – can be used in various ways, of which typically discernible are (i) those related to prediction of values of one variable from those of the other and (ii) adding the relation to the knowledge of the domain by interpreting and explaining it in terms of the existing knowledge. Goal (ii) is treated in the discipline of knowledge bases as part of the so-called inferential approach, in which all relations are assumed to have been expressed as logical predicates and treated accordingly; this will not be described here. We concentrate on the so-called inductive approach related to less formal analysis of what type of information the data can provide with respect to goals (i) and (ii). Typically, the feature whose values are predicted is referred to as the target variable while the other as the input variable. Examples of goal (i) are: prediction of an intrusion attack of a certain type (Intrusion data) or prediction of exam mark (Student data) or prediction of the number of Primary schools in a town whose population is known (Market town data). One may ask: why bother – all numbers are already in the file! Indeed, they are. But in the prediction problem, the data are just a small sample of observations and is but a training ground for devising a decision rule for prediction of behaviour of other, yet unobserved, entities. As to the goal (ii), the data are just idle empirical sparkles not necessarily noticeable unless they are shaped into a decision rule. The mathematical structure of the problem differs depending on the type of feature scales involved, which leads us to considering three possible cases: (1) both features are quantitative, (2) target feature is quantitative, input feature categorical, and (3) both features are quantitative. We leave the case when the target feature is categorical and input feature is quantitative out, because nothing specific to the task has been developed for this case so far. 2.1. Both features are quantitative At the situation at which both features are quantitative, the three following concepts are popular: scatter plot, regression, and correlation. Scatter plot is a presentation of entities as 2D points in the plane of two pre-specified features. On the left-hand side of Figure 2.1, a scatter-plot of Market towns over features PopResident (Axis x) and PSchools (Axis y) is presented. 36 Figure 2.1. Scatter plot of PopRes versus PSchools in Market town data. The right hand graph includes a regression line of PSchools over PopRes. If one can think that these two features are related by a linear equation y=ax+b where a and b are some constant coefficients - parameters, then these parameters a and b, referred to as the slope and intercept, respectively, can be found by minimizing the inconsistencies of the equation on the 45 towns in the data set. Indeed, it sounds rather unlikely that by adjusting just the two parameters, we could make each of the towns to satisfy the equation. Why one would need that?- for the purposes of description and prediction. The prediction goal is obviously related to other towns that are not in Table 0.3: given PopRes. (x) at a town, predict its Pschools (y). It would be useful if we could not only predict but, also, evaluate the reliability of the prediction. Let us consider this as a general problem: Present correlation between y and x using their values at a number N of entities (x1,y1), (x2, y2),…, (xN, YN) in the form of equation y=a*x + b (2.1) Obviously, on the entities i=1,2,…,N equation (2.1) will have some errors so that it can be rewritten as yi=a*xi + b + ei, (i=1,2,…,N) (2.2) where ei are referred to as errors or residuals. Then the problem is of determining the two parameters, a and b, in such a way that the residuals are least-squares minimized, that is, the summary (square) error L(a,b) = i ei2 = i (yi - a*xi - b)2 , (2.3) reaches its minimum over all possible a and b. This minimization problem is easy to solve with the elementary calculus tools. Indeed L(a,b) is a “bottom down” parabolic function of a and b. Therefore, its minimum corresponds to the point at which both partial derivatives of L(a,b) are zero.(the first-order optimality condition): ∂L/∂a = 0 and ∂L/∂b = 0 37 Leaving the finding of the derivatives to the reader as an exercise, let us focus on the unique solution a in (2.4) and b in (2.6): a = (y) /(x) (2.4) where = [i (xi – mx)(yi-my)] ⁄[Nx)y)] (2.5) is the so-called correlation coefficient and mx, my are means of xi, yi, respectively; b = my –a*mx (2.6) By substituting these optimal a and b into (2.3), one can express the minimum criterion value as Lm(a,b) = N2(y)(1- 2) (2.7) It should be noticed that the equation (2.1) is referred to as the linear regression of y over x, index in (2.4) and (2.5) as the correlation coefficient, its square 2 in (2.7) as the determination coefficient, and the minimum criterion value Lm in (2.7) is referred to as the unexplained variance. Correlation coefficient and its properties The meaning of the coefficients of correlation and determination is provided by equations (2.3)-(2.7). Specifically, * Determination coefficient 2 is the decrease of the variance of y after its linear correlation with x has been taken into account (from (2.7)). * Correlation coefficient ranges between -1 and 1, because 2 is between 0 and 1 as follows from the fact that value Lm 0 in (2.7) because it is sum of squares, see (2.3). The closer to either 1 or -1, the smaller are the residuals in the regression equation. For example, =0.9 implies that y’s unexplained variance Lm is 1-2 = 19% of the original value. * The slope a is proportional to according to (2.4); a is positive or negative depending on the sign of . If =0, the slope is 0: y and x are referred to than as not correlated. Being not correlated does not mean “no relation”; it means just “no linear relation” between them, though another functional relation, such as a quadratic one, may exist, as shown on Figure 2.2. 38 Figure 2.2. Three scatter-plots corresponding to zero or almost zero correlation coefficient ; the case on the left: no correlation between x and y; the case in the middle: a non-random quadratic relation y=(x-2)2+5; the case on the right: two symmetric linear relations, y=2x-5 and y=-2x+3, each holding at a half of the entities. * The correlation coefficient does not change under shifting and rescaling of x and/or y, which can be seen from equation (2.5). Its formula (2.5) becomes especially simple if the so-called znormalisation has been applied to both x and y. To z-normalize a feature, its mean m is subtracted from all the values and the results are divided by the standard deviation : x’i= (xi-mx)/x) and y’i= (yi-my)/y), i=1,2,…, N Then formula (2.5) can be rewritten as = i x’i y’i ⁄N =(x’,y’)/N (2.5’) where (x’,y’) denotes the inner product of vectors x’=(x’i) and y’=(y’i). * One of the fundamental discoveries by K. Pearson was an interpretation of the correlation coefficient in terms of a bivariate Gaussian distribution. A generic formula for the density function of this distribution, in the case in which features are pre-processed using z-normalization described above, is f(u, )= C* exp{-uT-1u/2} (2.8) T where u =(x, y) is two-dimensional vector of random values of the two variables x and y under consideration and is the so-called correlation matrix 1 1 In this formula is a parameter with a very clear geometric meaning. Consider, on Cartesian (x,y) plane sets of points making function f(u, ) in (2.8) constant. Such a set makes uT-1u constant too. That means that a constant density set of points (x, y) must satisfy equation x2-2xy+y2=const. This defines a well-known quadratic curve, the ellipsis. At =0 it becomes the equation of a circle, x2+ y2=const, and the more the differs from 0, the more skewed the ellipsis is so that at = 1 the ellipsis becomes a bisector line y = x + b because the left part of the equation makes a full square, 39 x2 2xy+y2=const, that is, (y x)2 = const. The size of the ellipsis is proportional to the constant: the greater the constant the greater the size. A striking fact is that the correlation coefficient (2.5) is a sample based estimate of the parameter in the Gaussian density function (2.8) under the conventional assumption that the sample points (yi,xi) are drawn from a Gaussian population randomly and independently. This fact is the base of a long standing controversy. Some say that the usage of the correlation coefficient is justified only when one is sure that their sample is taken randomly and independently from a Gaussian distribution. This seems somewhat unfounded. Indeed, the usage of the coefficient for estimating the density function is justified only when the function is Gaussian, true. However, when trying to linearly represent one variable through the other, the coefficient has a very different meaning having nothing to do with Gaussian distributions, as expressed above with equations (2.4)-(2.7). Q. Find the derivatives of L over a and b and solve the first-order optimality conditions. Q. Derive the optimal value of L in (2.7) for the optimal a and b. Q. Prove or find it in the literature that the linear equation corresponds to a straight line of which a is the slope and b intercept indeed. Project 2.1. 2D analysis, linear regression and bootstrapping Let us take the Students data table as a 100 x 8 array a in MatLab, pick any two features of interest and plot entities as points on the Cartesian plane formed by the features. For instance, take Age as x and Computa-tional Intelligence mark as y: >> x=a(:,4); % Age is 4-th column of array "a" >> y=a(:,8); % CI score is in 8-th column of "a" Then student 1 (first row) will be presented by point with coordinates x=28 and y=90 corresponding to the student’s age and CI mark, respectively. To plot them all, use command: >> plot(x,y,'k.') % k refers to black colour, “.” dot graphics; 'mp' stands for magenta pentagram; see others by using "help plot" Unfortunately, this gives a very tight presentation: some points are on the borders of the drawing. To make borders stretched out, one needs to change the axis: >> d=axis; axis(1.2*d-10); This transformation is presented in Figure 2.3 on the right. To make both plots presented on the same figure, use "subplot" command of MatLab: >> subplot(1,2,1) >> plot(x,y,'k.'); >> subplot(1,2,2) >> plot(x,y,'k.'); >> d=axis; axis(1.2*d-10); 40 Command subplot(1,2,1) creates one row consisting of two windows for plots and puts the follow-up plot into the 1st window (that on the left). Figure 2.3: Scatter plot of features “Age” and “CI score”; the display on the right is a rescaled version of that on the left. Whichever presentation is taken, no regularity can be seen on Figure 2.3 at all. Let's try then whether anything better can be seen for different occupations. To do this, one needs to handle entity sets for each occupation separately: >> o1=find(a(:,1)==1); % set of indices for IT >> o2=find(a(:,2)==1); % set of indices for BA >> o3=find(a(:,3)==1); % set of indices for AN >> x1=x(o1);y1=y(o1); % the features x and y at IT students >> x2=x(o2);y2=y(o2); % the features at BA students >> x3=x(o3);y3=y(o3); % the features at AN students Now we are in a position to put, first, all the three together, and then each of these three separately (again with the command "subplot", but this time with four windows organized in a two-by-two format, see Figure 2.4). >> subplot(2,2,1); plot(x1,y1, '*b',x2,y2,'pm',x3,y3,'.k');% all the three plotted >> d=axis; axis(1.2*d-10); >> subplot(2,2,2); plot(x1,y1, '*b'); % IT plotted with blue stars >> d=axis; axis(1.2*d-10); >> subplot(2,2,3); plot(x2,y2,'pm'); % BA plotted with magenta pentagrams >> d=axis; axis(1.2*d-10); >> subplot(2,2,4); plot(x3,y3,'.k'); % AN plotted with black dots >> d=axis; axis(1.2*d-10); Of the three occupation groups, some potential relation can be seen only in the AN group: it is likely that "the greater the age the lower the mark" regularity holds in this group (black dots in the Figure 2.4’s bottom right). To check this, let us utilise the linear regression. 41 Figure 2.4. Joint and individual displays of the scatter-plots for the occupation categories (IT star, BA pentagrams, AN dots). Regression is a technique invented by F. Galton and K. Pearson to explicate the correlation between x and y as a linear function (that is, a straight line on the plot), y = slope*x + intercept where slope and intercept are constants, the former expressing the change in y when x is added by 1 and the latter the level of y at x=0. The best possible values of slope and intercept (that is, those minimising the average square difference between real y's and those found as slope*x+intercept) are expressed in MatLab, according to formulas (2.4)-(2.6), as follows: >> slope = rho*std(y)/std(x); intercept = mean(y) - slope*mean(x); Here "rho" is the Pearson correlation coefficient between x and y (2.5) that can be determined with MatLab operation "corrcoef". Since we are interested in group AN only, we apply it to AN-related values x3 and y3: >> cc=corrcoef(x3,y3) leading to table cc = 1.0000 -0.7082 -0.7082 1.0000 in which the non-diagonal entry values are the rho, which can be caught up with command >> rho=cc(1,2); Then the general formula applies to pair (x3,y3): >> slope = rho*std(y3)/std(x3); % this produces slope =-1.33; >> intercept = mean(y3) - slope*mean(x3); % this produces intercept = 98.2; 42 thus leading to the linear regression y3= 98.2 - 1.33*x3 stating thus that every year added to the age, in general decreases the mark by 1.33, so that aging by 3 years would lead to the loss of 4 marks. To check whether the equation is good, one may compare the real values for randomly selected three students number 81, 82, and 83 all belonging to AN, with those derived using the equation: >> ii=[80 81 82]; >> x(ii); %the ages >> y(ii); %the marks >> yy=slope*x(ii)+intercept; % the marks derived from the age which yields the following results: x 24 34 41 y 62 30 39 yy 66.3 53.0 43.7 One can see that the error for the second student, 53 (predicted) – 30 (real) = 23, is rather high, which reflects the fact that the mark of this student contradicts the general regularity: s/he is younger than the third student but has a lower mark. Altogether, the regression equation explains rho^2=0.50=50% of the total variance of y3 – not too much. Let us take a look at the reliability of the regression equation with bootstrapping, the popular computational experiment technique for validating data analysis results that was introduced in Chapter 1. The computational power allows for experimentation on the spot, with the real data, rather than with theoretical probabilistic distributions, which are not necessarily adequate to the data. Bootstrapping is based on a pre-specified number of random trials. In the case of the data of 31 AN students, each trial begins with randomly selecting a student 31 times, with replacement, so that the same entity can be selected several times whereas some other entities may be never selected in a trial. (As shown above, on average only 62% entities get selected into the sample.) A sample consists of 31 students because this is the number of elements in the set under consideration. The sample of 31 students (some of them, as explained, coincide) is assigned with their data values according to the original data table so that coinciding students get identical feature values. Then a data analysis method under consideration, currently "linear regression", applies to this data sample to produce the trial result. After a number of such trials the user gets enough data to see how well they correspond to the original results. To do a trial as described, one can use the following MatLab command: >> ra=ceil(31*rand(31,1)); % rand(31,1) produces a column of 31 random real numbers, between 0 and 1 each. Multiplying this % by 31 stretches the numbers to be between 0 and 31, and "ceil" rounds them up to integers. The values of x and y on the group can be assigned by using equations: >>xr=x3(ra);yr=y3(ra); 43 after which formulas above apply to compute the rho, slope and intercept. To do this a number (5000, in this case) of times, one runs a loop: >> for k=1:5000; ra=ceil(31*rand(31,1)); xr=x3(ra);yr=y3(ra); cc=corrcoef(xr,yr);rhr(k)=cc(1,2); sl(k)=rhr(k)*std(yr)/std(xr); inte(k)=mean(yr)-sl(k)*mean(xr); end % the results are stored in 5000-strong columns rhr (correlations), sl (slopes) and inte (intercepts) Now we can check the mean and standard deviation of the obtained distributions. Commands >>mean(sl); std(sl) produce values -1.33 and 0.24. That means that the original value of slope=-1.33 is confirmed with the bootstrapping, but now we have its standard deviation, 0.24, as well. Similarly mean/std values for the intercept and rho are computed. They are, respectively, 98.2 / 9.0 and -0.704 / 0.095. We can plot the 5000 values found as 30-bin histograms (see Figure 2.5): >> subplot(1,2,1); hist(sl,30) >> subplot(1,2,2); hist(in,30) Further commands: >>slh=hist(sl,30); slf=find(slh>=70); sum(slh(slf)); show that 4736 out of 5000 trials fall into just 18 of the 30 histogram bins, labelled from 7 to 23. To determine the boundaries of this area, one finds >>slbinsize=(max(sl)-min(sl))/30; >>slleftbound= min(sl)+6*slbinsize >>slrightbound=max(sl)-7*slbinsize which produces -1.80 and -0.86 as the right and left boundaries for the slope that hold for 4376/5000=97.4% of the trials. Similar computations, with >> inh=hist(in,30); inff=find(inh>60); sum(inh(inff)) will find the left and right boundaries for the intercept at 95.1% of the trials (by leaving out 8 bins on the left and 5 bins on the right): 81.7 to 117.4. 44 Figure 2.5. 30-bin histograms of the slope (left) and intercept (right) after 5000 bootstrapping trials. This all can be visualized by, first, defining the three regression lines with >> y3reg=slope*x3+intercept; >> y3regleft=slleftbound*x3+inleftbound; >> y3regright=slrightbound*x3+inrightbound; and then plotting the four sets onto the same figure Figure 2.6:: >> plot(x3,y3,'*k',x3,y3reg,'k',x3,y3regleft,'r',x3,y3regright,'r') % x3,y3,'*k' presents student data as black stars; x3,y3reg,'k' presents the real regression line in black % x3,y3regleft,'g' and x3,y3regright,'g' present the boundary regressions with green lines The red lines on Figure 2.6 show the limits of the regression line for 95% of trials. Figure 2.6. Regression of CI score over Age (black line) within occupation category AN with boundaries covering 95% of potential biases due to sample fluctuations. Non-linear correlations In many domains the correlation between features is not necessarily linear. For example, in economics, processes related to inflation over time are modelled as the exponential ones; similar thinking applies to the processes of growth in biology; variables describing climatic conditions obviously have a cyclic 45 character; etc. Consider, for example, an exponential function y=a*exp(b*x) where x is predictor and y predicted variables whereas a and b are unknown but constant coefficients. Given the values of xi and yi on a number of observed entities i= 1,…, N, the exponent regression problem can again be formulated as the problem of minimising the summary error squared over all possible pairs of coefficients a and b. Given some a and b, the summary error squared is calculated as E=[y1-a*exp(b*x1)]2 + [y2-a*exp(b*x2)]2 + … +[yN-a*exp(b*xN)]2 = i [yi-a*exp(b*xi)]2 (2.9) There is no method that would straightforwardly lead to a globally optimal solution of the problem of minimisation of E in (2.8) because it is the sum of many exponential functions. This is why conventionally the exponential regression is fit by transforming it to a linear regression problem. Indeed, by taking the logarithm of both parts of equation y=a*exp(b*x), we obtain an equivalent equation ln(y)=ln(a)+b*x. This equation has the linear equation format, z=*x+, where z=ln(y), =b and =ln(a). By fitting the linear regression equation with the given data xi and zi=ln(yi) to find optimal and , we can feed the coefficients back into the original exponential equation by taking them as a=exp() and b=. This strategy seems especially suitable since the logarithm of a variable typically is much smoother so that the linear fit is easier to achieve. There is one “but” here, too. The issue is that the fact that and are optimal in the linear regression problem does not necessarily imply that the values of a and b found this way are necessarily those minimising the error E. Moreover, almost certainly they are not optimal and indeed can be rather far away from the optimal values, to which the exponent in a=exp() can contribute dramatically as can be seen in the following project. Project 2.2. Non-linear regression versus its linearized version: evolutionary algorithm for estimation Let us consider an illustrative example involving variables x and y defined over a period of 20 time moments as follows. Table 2.1. Data of investment at time moments from 0.10-2.00.. x 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00 y 1.30 1.82 2.03 4.29 3.30 3.90 3.84 4.24 4.23 6.50 6.93 7.23 7.91 9.27 9.45 11.18 12.48 12.51 15.40 15 .91 Variable x can be thought of as related to the time whereas y may represent the value of an investment. In fact, the components of x are numbers from 1 to 20 divided by 10, and y is obtained from them in MatLab according to formula y=2*exp(1.04*x)+0.6*randn where randn is the normal (Gaussian) random variable with the mathematical expectation 0 and variance 1. The average growth of the investment according to these data can be expressed as the root of 1/19 of the ratio y20/y01, that is, 1.14; 14% per period! The strategy of reduction of the exponential equation to the linear equation produces values 1.1969 and 0.4986 for and , respectively, which leads to a=1.6465 and b=1.1969 according to formulas above. As we can see, these differ from the original a=2 and b=1.04 by the order of 15-20%. The value of the squared error here is E=13.90. 46 Figure 2.7. Plot of the original pair (x,y) in which y is a noisy exponential function of x (on the left) and plot of the pair (x,z) in which z=ln(y). The plot on the right looks somewhat straighter indeed, though the correlation coefficients are rather similar, 0.970 for the plot on the left and 0.973 for the plot on the right. It only remains to solve the original problem of minimising E in (2.9) and see whether these lead to better estimations. One obviously can apply here local algorithms such as the fastest descent. Also, the evolutionary approach can be applied. This approach involves a population of admissible solutions evolving according to some rules. The rules include: (a) random changes from generation to generation, and (b) elite control. After a number of generations, the best solution among those observed is reported as the outcome. To start the evolution, we first choose the population size, p, and randomly generate a population of p admissible solutions, pairs (a,b), and evaluate how well they fit the data. The best of them is recorded separately, and the record is updated from iteration to iteration (elite control). An iteration will consist of moving the population in a random direction by adding randomly generated values. In the beginning, a box, large enough to contain the optimal solution, is defined, so that the population is kept in the box and not driven away. This very simple process is quite effective for this type of problems. A MatLab program, nlrm.m, implementing this approach is posted in Appendix (see page … ). It is supplied with comments and could be used as is or in a modified form for other non-linear fitting problems. The program nlrm.m found a solution of a=1.9908 and b= 1.0573. These are within 1-2% of the error from the original values a=2 and b=1.04. The summary squared error here is E=7.45, which is by far less than that found with the linearization strategy. The two exponential regressions found with the different strategies are presented on Figure 2.8. One can see that the linearized version has a much steeper exponent, which becomes visible at later periods. 2.2. Mixed Scale Case: a nominal feature versus a quantitative one Consider x a categorical feature on the same entities as a quantitative feature y, such as Occupation and Age at Students data set. The within-category distributions of y can be used to investigate the correlation between x and y. The distributions can be visualized by using just ranges as follows: 47 present categories with equal-size bins on x axis, draw two lines parallel to x axis to present the minimum and maximum values of y (in the entire data set), and then present the within category ranges of y as shown on Figure 2.9. Figure 2.8. Two fitting exponents are shown, with stars and dots, for the data in Project 2.2. Age 51 20 IT BA AN Occupation Figure 2.9. Graphic presentation of within category ranges of Age at Student data. Age 51 20 IT BA AN Occupation Figure 2.10. In a situation of ideal correlation, with zero within-category variances, knowledge of the Occupation category would provide an exact prediction of the Age within it. The correlation between x and y is higher when the within-category spreads are smaller because the smaller the spread within an x-category, the more precise is prediction of y at it. Figure 2.10 illustrates an ideal case of a perfect correlation – all within-category y-values are the same leading to an exact prediction of Age when Occupation is known. 48 Figure 2.11 presents another extreme, when knowledge of an Occupation category does not lead to a better prediction of Age than when the Occupation is unknown. A simple statistical model extending that for the mean will be referred to as table regression. The table regression of quantitative y over categorical x comprises three columns corresponding to: (1) Category of x (2) Within category mean of y (3) Within category standard deviation of y The number of rows in the table regression thus corresponds to the number of x-categories; there should be a marginal row as well, with the mean and standard deviation of y on the entire entity set. Age 51 20 IT BA AN (Occupation) Figure 2.11. Wide within-category distributions: the case of full variance within categories in which the knowledge of Occupation would give no information of Age. Consider, for example, the table regression of Age (quantitative target) over occupation (categorical predictor). It suggests that if we know the Occupation, for instance, IT, then we can safely predict the Age being 28.2 within the margin of plus/minus 5.6 years. Without knowing the Occupation category, we could only say that the age is on average 33.7 plus/minus 8.5, a less precise assessment. Table 2.2 Table regression of Age over Occupation in Students data. Occupation IT BA AN Total Age Mean 28.2 39.3 33.7 33.7 Age StD 5.6 7.3 8.7 8.5 The table can be visualized in a manner similar to Figures 2.9-2.11, this time presenting the within category averages by horizontal lines and the standard deviations by vertical strips (see Figure 2.12). Age 51 20 IT BA AN Occupation Figure 2.12. Table regression visualized with the within-category averages and standard deviations represented by the position of solid horizontal lines and vertical line sizes, respectively. The dashed line’s position represents the overall average (grand mean). One more way of visualization of categorical/quantitative correlation is the so-called box-plot. The within-category spread is expressed here with a quantile (percentile) box rather than with the standard 49 deviation. First, a quintile level should be defined such as, for instance, 40%, which means that we are going to show the within-category range over only 60% of its contents by removing 20% off each of its top and bottom extremes. At the category IT, Age ranges between 20 and 39, but if we sort it and remove 7 entities of maximal Age and 7 entities of minimal Age (there are 35 students in IT so that 7 makes 20% exactly), then the Age range on the remaining 60% is from 22 to 33. Similarly, Age 60% range is from 32 to 47 on BA, and from 25 to 44 on AN. These are presented with box’ heights on Figure 2.13. The whiskers reflect 100% within category ranges, which are intervals [20,39], [27, 51] and [21, 50], respectively. Age 51 20 IT BA AN Occupation Figure 2.13. Box-plot of the relationship between Occupation and Age with 20% quintiles; the box heights reflect the Age within-category 60% ranges, whiskers show the total ranges. Within-box horizontal lines show the within category averages. The box-plot proved useful in studies of quantitative features too: one of the features is partitioned into a number of bins that are treated then as categories. Correlation ratio Let us consider one more table regression, this time of the OOProgramming mark over Occupation (Table 2. 3) Table 2.3. Table regression OOProg/ Occupation Occupation OOP Mean OOP StD IT 76.1 12.9 BA 56.7 12.3 AN 50.7 12.4 Total 61.6 16.5 A natural question emerges: In which of the tables the correlation is greater, 2.2 or 2.3? This can be addressed with an integral characteristic of the table, the Correlation ratio (determination coefficient for the table regression). To define this index, denote k a category of x, Sk the set of iI such that yi=k, pk = | Sk |/|I|, and 2k the variance of y within Sk. Then we first calculate the average within-category variance: 2w= k pk2k 50 Correlation ratio, usually denoted by 2, shows the drop of the variance of y from 2 (the variance of y as is) to 2w (the average variance of y when the nominal x is taken into account): 2 = 1 – 2w/2 (2.10) Properties: - The range of 2 is between 0 and 1 - Correlation ratio 2 =0 when all 2k are zero (that is, when y is constant within each group) - Correlation ratio 2 = 1 when all 2k are of the order of 2 In fact, the correlation ratio emerges as the square-error related criterion in the following data recovery model. Find a set of ck such that the “residual variance”, that is, the average square error L= iI ei2/N is minimized, where ei=yi - ck according to equations yi= ck +ei for all iSk (2.11) where Sk denotes the set of entities falling in k category of x. These equations underlie the table regression and are referred to sometimes as the piece-wise regression. It is not difficult to prove that the optimal ck is the within category k average of y, which implies that the minimum value of L is equal to 2w defined above. The correlation ratio shows the relative drop in the variance of y when it is predicted according to model (2.11). Correlation ratios in our example are: Occupation/Age 28.1% Occupation/OOProg 42.3% which shows that the correlation between the Occupation and OOProgramming mark is greater than that between the former and Age. 2.3. Case of two nominal features Consider two sets of disjoint categories: l=1,…,L (for example, occupation) and k=1,…,K (family or housing type). Each makes a classification; they are crossed to see the correlation. Combine a pair of categories (k,l)KL and count the number of entities that fall in both. The (k,l) co-occurrence count is denoted by Nkl. Obviously, these counts sum up to N. A table housing these counts, Nkl , or their relative values, frequencies pkl =Nkl /N, is referred to as a contingency table or just cross-classification. Example: Partition the Market town set in four classes according to the number of Banks and Building Societies (Ba): Ba 10 (10+), 10>Ba4 (4+), 4>Ba2 (2+), Ba=0/1 (1-) – these will be Banking type categories. Cross classify this with FM (yes/no) (Table 2.4) Table 2.4 Cross classification of the Ba related partition with FM FarmMarket Yes No Total Number of Banks/Building societies 10+ 4+ 2+ 2 5 1 4 7 13 6 12 14 11 12 13 Total 9 36 45 The same contingency data converted to frequencies are presented in Table 2.5. 51 Table 2.5. BA/FM Cross classification frequencies, per cent. FM | Ba 10+ 4+ 2+ 1Total Yes 4.44 11.11 2.22 2.22 20 No 8.89 15.56 28.89 26.67 80 Total 13.33 26.67 31.11 28.89 100 The totals, that is, within-row sums Nk+ =l Nkl and within-column sums N+l =k Nkl (as well as their frequency counterparts) are referred to as marginals (because they are typed on margins of the contingency data). Another example: the contingency table for features “Protocol-type” and “Attack type” (Table 2.6). Table 2.6. Protocol/Attack contingency table for Intrusion data. Category Tcp Udp Icmp Total apache 23 0 0 23 saint 11 0 0 11 surf 0 0 10 10 norm 30 26 0 56 Total 64 26 10 100 A contingency table can be used for assessment of correlations between two category sets. A conceptual association may exist if a row, k, has all its entries (forget the margins), except for one, equal to 0. Such are rows “Udp” and “Icmp” in Table 2.6. In this case, we have a perfect match between the row category k and that column l in which the only non-zero count has occurred. No other combination (k,l’) with l’ different of l is possible, according to the table; the zeros tell this. In such a situation, one may claim that subject to the sample, k implies l so that l occurs only when k does. According to Table 2.6, udp protocol implies “norm”, no attack situation, whereas icmp protocol implies “surf” attack. The latter, in fact, amounts to the equivalence between “icmp” and “surf”, because there is no other non-zero entry in the “surf” column so that “surf” implies “icmp” as well. In contrast, “udp” and “norm” are not equivalent because “norm” may occur at another protocol, “tcp”, too. A similar situation may have been occurred in Table 2.4. Imagine, for example, that in row “Yes” of Table 2.4 two last entries are 0, not 1s. This would imply that a Farmers Market may occur only in a town with 4 or more Banks. A logical implication, that is, a production rule, “If BA is 4 or more, then a Farmers Market must be present”, could be derived then from the Table. One may try taking this path and cleaning the data of smaller entries and corresponding entities to not obscure our vision of the pattern of correlation. Look at, for example, Table 2.7 that expresses, with no exception, a very simple conceptual statement “A town has a Farmers Market if and only if the number of Banks in it is 4 or greater”. However nice the rule may be, let us not forget the exceptions: there are 13 towns, almost 30% of the sample, that have been removed as those not fitting. If this is acknowledged, the Table 2.6 should not be attributed to just data doctoring; though an issue to address remains: could a different conclusion be reached with other removals? Some try getting better ways for computationally producing production rules, typically, by adding other features into consideration rather than subjective entity removals, which is an important activity in machine learning and knowledge discovery. Table 2.7. BA/FM cross classification cleaned of 13 towns, to sharpen the view. FMarket Yes Number of Banks/Build. Societies 10+ 4+ 2+ 2 5 0 10 Total 7 52 No Total 0 2 0 5 13 13 12 12 25 32 Quetelet index There is another strategy for visualisation of correlation patterns in contingency tables, without removal of not-fitting entities. This strategy involves an index for assessing correlation between individual categories. Let us consider correlation between the presence of a Farmer’s Market and the category “10 or more Banks” according to data in Table 2.5. We can see that their joint probability/frequency is the entry in the corresponding row and column: P(Ba=10+ & FM=Yes)=4.44% (joint probability/frequency/rate). Of the 20% entities that fall in the row “Yes”, this makes P(Ba=10+ / FM=Yes) =0.0444/0.20= 0.222 =22.2%. Such a ratio is referred to as the conditional probability/rate. Is this high or low? A founding father of statistics, A. Quetelet (Belgium, 1832), suggested that this question can be addressed by comparing the conditional rate with the average probability of the category “Ba=10+”, which is P(Ba=10+)=13.33%. Let us therefore compute the (relative) Quetelet index q: q(Ba=10+/ FM=Yes) = [P(Ba=10+/FM=Yes) - P(Ba=10+)] / P(Ba=10+) =[0.2222 – 0.1333] / 0.1333 = 0.6667 = 66.7%. That means that condition “FM=Yes” raises the frequency of the Bank category by 66.7%. In fact, such an evaluation is frequently used in everyday statistics. For example, consider the risk of getting a serious illness l, say tuberculosis, which may be 0.1% in a given region. Take a condition k such as “Bad housing” and count the rate of tuberculosis under this condition, say, 0.5%. Then the Quetelet index q(l/k)=(0.5-0.1)/0.1=400% showing that the average rate raise is 4 times! The general definition of Quetelet index is given in the following formula: q(l/k)=[P(l/k)-P(l)]/P(l) (2.10) where P denotes the probability or frequency and, in our context, can be computed as follows: P(l)= N+l/ N, P(k)= Nk+/ N, P(l/k)= Nkl / Nk+. That is, Quetelet index measures correlation between categories k and l as the relative change of the probability of k when l is taken into account. With little algebra, one can derive a simpler expression q(l/k) = [Nkl / Nk+ - N+l/ N] / N+l/ N = Nkl N/(Nk+ N+l )–1 = plk /(pl+p+k) 1 (2.11) Applying (2.11) to Table 2.4 we obtain Quetelet index values presented in Table 2. 8. By highlighting positive values in it, we obtain the same pattern as on the cleaned data, but this time in a somewhat more realistic guise. Specifically, one can see that “Yes” FM category provides for a strong increase in probabilities, whereas “No” category leads to much weaker changes. Table 2.8. BA/FM Cross classification Quetelet coefficients, % (positive entries highlighted). FMarket Yes No 10+ 66.67 -16.67 4+ 108.33 -27.08 2+ -64.29 16.07 1-61.54 15.38 53 Quetelet coefficients for Table 2.6 are presented in Table 2.9; positive entries highlighted in bold. Table 2.9. Quetelet indices for the Protocol/Attack contingency Table 2.6, %. Category Tcp udp icmp apache 56.25 -100.00 -100.00 saint 56.25 -100.00 -100.00 surf -100.00 -100.00 900.00 Norm -16.29 78.57 -100.00 Q. If any logical production rules can come from the columns of Table 2.6? A. Yes, both apache and saint attacks may occur at the tcp protocol only. Pearson’s chi-squared decomposed over Quetelet indexes This visualization can be extended to a more theoretically sound presentation. Let us define the summary Quetelet correlation index Q as the sum of pair-wise Quetelet indexes weighted by their frequencies/probabilities: Q p q(l / k ) p 2 / p p 1 kl kl k l k ,l k ,l (2.12) The right-hand expression for Q in (2.12) can be obtained by putting expression (2.11) instead of q(l/k). This expression is very popular in the statistical analysis of contingency data. In fact, this is another formula for the Pearson chi-squared correlation coefficient proposed by K. Pearson (1901) in a very different context – as a measure of deviation of the contingency table entries from the statistical independence. To explain this in more detail, let us first introduce the concept of statistical independence. Sets of the k and l categories are said to be statistically independent if pkl = pk+ p+l for all k and l. Obviously, such a condition is hard to fulfil in reality. K. Pearson suggested using relative squared errors to measure the deviations. Specifically, he introduced the chi-squared coefficient: X 2 N ( p p p ) 2 / p p N[ p 2 / p p 1] kl k l k l kl k l k,l k,l (2.13) The equation in the middle is well known and allows us to see that X2=NQ, according to (2.12). The popularity of X2 in statistics rests on the theorem proven by Pearson: if the contingency table is based on an independent sample of entities drawn from a population in which the statistical independence holds (so that all deviations are due to just randomly sampling), then the probabilistic distribution of X2 converges to the chi-squared distribution (when N tends to infinity) introduced by Pearson for similar analyses. The chi-squared distribution is defined as the distribution of the sum of several standard Gaussian distributions squared. This theorem may be of no interest to computational intelligence, because the latter draws on data that are not necessarily random. However, Pearson’s chi-squared is a most popular index for scoring correlation in contingency tables, and the equation X2=NQ may give some further support to it. According to this equation, X2 also has a very different meaning, that of the averaged Quetelet coefficient which has nothing to do with the statistical independence and it has everything to do with correlation between categories. To make the underlying correlation concept more clear, let us take a look at the range of possible values for X2. 54 It can be proven that, at K ≤ L – the number of columns is smaller than that of rows, X2 ranges between 0 and K –1. It reaches 0 if there is statistical independence at all pairs (k,l) so that all qkl=0, and it reaches K – 1 if each column l contains only one non-zero entry pk(l)l, which is thus equal to p+l. The latter can be interpreted as logical implication k → l(k). Representation of NQ=X2 as the sum of N plk q(l/k) terms allows for visualization of the chi-squared correlation terms within the contingency table format, such as that presented in Table 2.10. Table 2.10. BA/FM chi-squared (NQ = 6.86) and its decomposition according to (2.12), (2.13)2 FMarket Yes No Total 10+ 1.33 -.67 0.67 4+ 5.41 -1.90 3.51 2+ -.64 2.09 1.45 1-.62 1.85 1.23 Total 5.43 1.37 6.86 The entry 5.41 highlighted in red contributes so much to X2=6.86 that perhaps it is the only single item deserving to be considered for further investigation. Q. In Table 2.10, all marginal values, the sums of rows and columns, are positive, in spite of the fact that many within-table entries are negative. Is this just due to specifics of the distribution in Table 2.4 or a general property? A: A general property. It can be proven that the within-row or within-column sums of the elements, plk q(l/k), must be positive. Task: Find a similar decomposition of chi-squared for OOPmarks/Occupation in Student data. (Hint: First, categorize quantitative feature OOPmarks somehow: you may use equal bins, or conventional boundary points such as 35, 65 and 75, or any other considerations.) 2 Bold font for the positive items; red for exceptional contributions 55 3. Learning correlations 3.1. General The problem can be stated as follows. Given N pairs (xi, ui) (observed at entities i =1, …, N) in which xi are predictor/input vectors xi=(xi1,…,xip) (dimension p) and ui = (ui1,…,uiq) are target/output vectors (dimension q), build a decision rule û = F(x) (3.1) such that the difference between computed û and observed target vector u, given x, is minimal over the class of admissible rules F. A rule F is referred to as a classifier if the target is categorical and as a regression if the target is quantitative. In the follow-up, we consider only the case of q=1, except for the Chapter 4 devoted to neural network learning. In the problems of linear regression or linear discrimination F is required to be linear. Classes of quadratic regression and discrimination are similarly defined. Why (and how) should one restrict the class of admissible rules F? A big question, no good answers. Take a look at the 2D regression problem: pairs (x,u) are observed at N entities shown on Figure 3.1: u x Figure 3.1. Possible analytic expressions of correlation between x and u according to observed data points (black circles). The N=7 points on the Figure 3.1 can be exactly fitted by a polynomial of 6th order u = p(x) = a0+a1x+a2x2+ a3x3 +a4x4+a5x5+a6x6. Indeed, the 7 points give 7 equations ui=p(xi) (i=1,…,7) to exactly determine, in a typical case, the 7 coefficients ak of p(x). However, the polynomial p(x), on which graph all observations lie, has no predictive power: beyond the range, the curve may go either course (like those shown) depending on small changes in the data. Typically, such over-fitted functions produce very poor predictions on newly added observations. The blue straight line fits none of the points but expresses a simple and very robust tendency and should be preferred because it summarises the data much deeper: it is defined by two parameters, slope and intercept, only, whereas the polynomial line involves as many parameters as the data items. If there is no domain knowledge motivation, it is hard to tell, what class of Fs to use. One set of answers relates to the so-called Occam’s razor. William Ockham (c. 1285–1349) said: “Entities should not be multiplied unnecessarily.” (“All things being equal, the simplest explanation tends to be the best one.”) This is usually interpreted as the “Principle of maximum parsimony, i.e., 56 economy,” which is used when there is nothing better available. In the format of the so-called “Minimum description length” principle, this approach can be meaningfully applied to problems of estimation of parameters of statistic distributions (see Rissanen and 2007). Somewhat wider, and perhaps more appropriate, explication of the Occam’s razor is proposed in Vapnik (2006). In a slightly modified form, to avoid different terminologies, it says: “Find an admissible decision rule with the smallest number of free parameters such that explains the observed facts” (Vapnik 2006, p. 448). However, even in this format, the principle gives no guidance about how to choose an adequate functional form. For example, which of two functions, f(x)=axb or g(x)=alog(x+b), both with two parameters a and b, should be preferred as a summarisation tool? Another set of answers, not incompatible with the former ones, relates to the so-called falsifability principle by K. Popper, which can be expressed as follows: “Explain the facts by using such an admissible decision rule which is easiest to falsify” (Vapnik 2006, p. 451). In philosophy, to falsify a theory one needs to give an example that contradicts it. Falsifability of a decision rule can be formulated in terms of the so-called VC-complexity, a measure of complexity of classes of decision rules: the smaller VC-complexity the greater the falsifability. Let us define VC-complexity for the more intuitive case case of a categorical target. Different combinations of target categories can be labelled by different labels, u1, u2, …, uK, so that a classifier F is bound to predict labels uk, k=1,2,…,K. A set of classifiers is said to shatter the sample of N pairs (xi, ui), where ui’s are just codes of labels uk, if for any possible assignment of the labels, a classifier F exists in such that F reproduces those labels. The VC-complexity of a correlation problem is the maximum number of entities that can be shattered by the admissible classifiers. Consider, for example, the target being just a 0/1 category and input features being binary as well - the set of admissible decision rules consists of all possible dichotomy partitions. Then the VC complexity of the problem will be equal to the maximum dimension of the binary cube which is covered by the data source. As an example, let us take a look at the set of eleven entities and four binary input features shown below. # 1 2 3 4 5 6 7 8 9 10 11 v1 1 1 1 0 0 0 0 1 1 1 0 v2 1 0 0 1 1 0 0 1 1 0 1 v3 1 1 1 1 0 1 1 0 0 0 0 v4 1 1 0 1 0 1 0 1 0 1 1 The VC complexity of the problem in this case is 2 because there exist two columns, for example, v1 and v2, containing all four rows of a twodimensional binary cube (as highlighted in bold). However, there are no three columns containing all eight rows of a three-dimensional binary cube (from 000 to 111). If we are certain that this property holds for other possible data points from the source, even those not necessarily present in the sample, then we should utilize only relatively simple classifiers of VC complexity 2. The VC complexity is an important characteristic of a correlation problem especially within the probabilistic machine learning paradigm. Under conventional conditions of independent random sampling of the data, an accurate classifier “with probability a% will be b% accurate, where b depends not only on a, but also on the sample size and VC-complexity.” To specify a learning problem one should specify assumptions regarding a number of constituents including: (i) Data flow Two modes are usually considered: 57 (ii) (iii) (iv) (i1) Incremental (adaptive) mode; entities are assumed to arrive one by one so that the rule is updated incrementally. This type of data flow implies that the fitting algorithm must be incremental, too; steepest descent and evolutionary approaches are most suitable for this. (i2) Batch mode; all entity set is available for learning immediately so that the rule can be found at once. Type of rule A rule involves a postulated mathematical structure whose parameters are to be learnt from the data. The mathematical structures considered further are: - linear combination of features - neural network mapping a set of input features into a set of target features - decision tree built over a set of features - partition of the entity set into a number of non-overlapping clusters Type of target Two types are considered usually: quantitative and categorical. In the former case, equation (3.1) is usually referred to as regression; in the latter case, decision rule, and the learning problem is referred to as that of “classification” or, sometimes, “pattern recognition”. Criterion Criterion of the quality of fitting depends on the situation at which the learning task is formulated. Most popular criteria are: maximum likelihood (in a probabilistic model of data generation), least-squares (data recovery approach) and error counts. Many operational criteria using error counts can be equivalently reformulated in terms of the least-squares and maximum likelihood. According to the least-squares criterion, the difference between u and û is measured with the squared error, E=<u- û, u- û>=<u-F(x),u-F(x)> (3.2) which is to be minimised over all admissible F. 3.2. Linear regression Consider feature Post expressing the number of post offices in Market Towns (Table 0.4 on p. 16-17) and try to relate it to other features in the table. It obviously relates to the population. For example, towns with population of 15,000 and greater are those and only those where the number of post offices is 5 or greater. This correlation, however, is not as good as to give us more guidance in predicting Post from the Population. For example, at the seven towns whose population is from 8,000 to 10,000 any number of post offices from 1 to 4 may occur, according to the table. This could be attributed to effects of services such as a bank or hospital present at the towns. Let us specify a set of features in Table 0.4 that can be thought of as affecting the feature Post, to include in addition to Population, say, PSchPrimary schools, Doct - General Practitioners, Hosp- Hospitals, Banks, Sstor - Superstores, and Petr – Petrol Stations, seven features altogether, that constitute the set of input variables (predictors) {x1, x2, …, xp} at p=7. What we want is to establish a linear relation between the set and target feature Post that has a general denotation u in the formulation of section 3.1. A linear relation is an equation representing u as a weighted sum of features xi plus a constant intercept, in which the weights can be any reals, not necessarily positive. If the relation is supported by the data, it can be used for various purposes such as analysis, prediction and planning. This can be formulated as a specific case of the correlation learning problem in which there is just one quantitative target variable u. The rule F in (3.1) is assumed to be linear: 58 u = w1*x1+w2*x2+…+wp*xp+w0 where w0, w1,…, wp are unknown weights3, parameters of the model. For any entity i =1, 2, …, N, the rule-computed value of u ûi = w1*xi1+w2*xi2+…+wp*xip+w0 differs from the observed one by di = |ûi – ui|, which may be zero, when the prediction is exact, or not, when it is not. To find w1, w2, …, wp, w0, one can minimize D2 = idi2 = i (ui -w1*xi1-w2*xi2-…-wp*xip-w0)2 (3.3) over all possible parameter vectors w = (w0,w1,…,wp). To make the problem treatable in terms of multidimensional spaces, a fictitious feature x0 is introduced such that all its values are 1: xi0 =1 for all i = 1, 2, …, N. Then criterion D2 can be expressed as D2 = i (ui -<wi,xi>)2 using the inner products <w,xi> where w=(w0, w1,…,wp) and xi=(xi0, xi1 , …, xip) are (p+1)-dimensional vectors, sometimes referred to as having been augmented (by adding the fictitious unity feature x0), of which all xi are known while w is unknown. From now on, the unity feature x0 is assumed to be part of data matrix X in all correlation learning problems. The metric D2 is but the Euclidean distance squared between the N-dimensional target feature column u=(ui) and vector û=Xw whose components are ûi= <w,xi>. Here X is N x (p+1) matrix whose rows are xi (augmented with the component xi0=1, thus being (p+1)-dimensional) so that Xw is the matrix algebra product of X and w. Vectors defined as Xw for all possible w’s form (p+1)-dimensional vector space, referred to as X-span. Thus the problem of minimization of (3.3) can be reformulated as follows: given target vector u, find its projection û in the X-span space. The global solution to this problem is well-known; it is provided by a matrix PX applied to u: û = PXu (3.4) where PX is the so-called orthogonal projection operator, of size N x N, defined as: PX = X (XTX)-1XT (3.5) so that û = X (XTX)-1XTu and w=(XTX)-1XTu. Matrix PX projects every N-dimensional vector u to its nearest match in the (p+1)-dimensional X-span space. The inverse (XTX)-1 does not exist if the rank of X, as it may happen, is less than the number of columns in X, p+1, that is, if matrix XTX is singular or, equivalently, the dimension of X-span is less than p+1. In this case, the so-called pseudo-inverse matrix (XTX)+ can be used as well. Table 3.1. Weight coefficients of input features at Post Office as target variable for Market towns data. 3 Symbol * is used to denote multiplication when convenient. 59 In our example of seven Market town features to be used for linearly relating to the Post Office feature, vector w of weight coefficients found with the formula above is as presented in Table 3.1. Each weight coefficient shows how much the target variable would change on average if the corresponding feature is increased by one. One can see that increasing population by a thousand would give a similar effect as adding a primary school, about 0.2, which may seem absurd in the example as Post office variable can have only integer values. Moreover, the linear function format should not trick the decision maker into thinking that increasing different input features can be done independently: the features are obviously not independent so that increase of, say, the population will lead to respectively adding new schools for the additional children. Still, the weights show relative effects of the features – according to Table 3.1, adding a doctor’s surgery in a town would lead to maximally possible increase in post offices. Still the maximum value is assigned to the intercept. What this may mean: The number of post offices in an empty town with no population, hospitals or petrol stations? Certainly not. The Feature POP_RES PSchools Doctors Hospitals Banks Superstores Petrol Intercept Weights in natural scales, w 0.0002 0.1982 0.2623 -0.2659 0.0770 0.0028 -0.3894 0.5784 Standard deviations, s 6193.2 2.7344 1.3019 0.5800 4.3840 1.7242 1.6370 0 Weights in standarddized scales, w*s 1.3889 0.5419 0.3414 -0.1542 0.3376 0.0048 -0.6375 0 intercept expresses that part of the target variable which is relatively independent of the features taken into account. It should be pointed out that the weight values are relative not to just feature concepts but specific scales in which features measured. Change of the scale, say 10-fold, would result in a corresponding, inverse, change of the weight (due to the linearity of the regression equation). This is why in statistics, the relative weights are considered for the scales expressed in units of the standard deviation. To find them, one should multiply the weight for the current scale by the feature’s standard deviation (see Table 3.2). Feature Weight POP_RES 0.0002 3.2. Rescaled weight coefficients of input Table PSchools 0.1982 features at Post Office as target variable for Market Doctors 0.2623 towns. Hospitals -0.2659 Banks 0.0770 Superstores 0.0028 Petrol -0.3894 Intercept 0.5784 60 Amazingly, we also can see negative effects, specifically of features Petrols and Hospitals to the target variable. This can be an artefact related to duplication of features; one can think of Hospitals being duplicate of Doctors and Petrol of Superstores. Thus one should check whether the minus disappears if duplicates are removed from the set of features before jumping to conclusions. As Table 3.3 shows, not in this case: the negative weights remain, though they slightly changed, as well as other weights. This illustrates that the interpretation of linear regression coefficients should be cautious and restrained. Table 3.3. Weight coefficients for reduced set of features at Post Office as target variable for Market towns data. Feature POP_RES PSchools Hospitals Banks Petrol Intercept Weight 0.0003 0.1823 -0.3167 0.0818 -0.4072 0.5898 The quality of approximation is evaluated by the minimum value D2 in (3.3) averaged over the number of entities and related to the variance of the target variable. Its complement to 1, the determination coefficient, is defined by the equation 2 = 1- D2/(N2(u)) (3.6) The determination coefficient shows the proportion of the variance of u explained by the linear regression. Its square root, , is referred to as the coefficient of multiple correlation between u and X = {x0, x1, x2, …, xp}. In our example, determination coefficient 2= 0.83, that is, the seven features explain 83% of the variance of Post Office feature, and the multiple correlation is =0.91. Curiously, the reduced set of five features (see Table 3.2) contributes almost the same, 82.4% of the variance of the target variable. This may make one wonder whether just one Population feature could be enough for doing the regression. This can be tested with the 2D method described in section 2.1 or with the nD method of this section. According to this method, one should use a matrix X with two columns here, one the Population variable, the other fictitious variable of all ones. This immediately leads to the slope 0.0003 and intercept 0.4015, though with somewhat reduced determination coefficient, which is 2= 0.78 in this case. From the prediction point of view this may be all right, but the ultimately reduced set of features looses on interpretation. 3.3. Linear discrimination Linear discrimination problem can be stated as follows. Let a set of N entities in the feature space X={ x0, x1, x2, …, xp} is partitioned in two classes, sometime referred to as patterns, a “yes” class and a “no” class, such as for instance a set of banking customers in which a, typically very small, subset of fraudsters constitutes the “yes” class and that of the others the “no” class. The problem is to 61 find a function u=f(x0, x1, x2, …, xp) that would discriminate the two classes in such a way that u is positive for all entities in the “yes” class and negative for all the entities in the “no” class. When the discriminant function is assumed to be linear so that u = w1*x1+w2*x2+…+wp*xp+w0 at constant w0, w1, …, wp, the problem is of linear discrimination. It differs from that of the linear regression in only that aspect that the target values ui here are binary, either “yes” or “no”, so that this is a classification rather than regression, problem. To make it quantitative, define ui=1 if i belongs to the “yes” class and ui= -1 if i belongs to the “no” class. The intercept w0 is referred to, in the context of the discrimination/classification problem, as bias. On Figure 3.2 entities (x1, x2, u) are presented by stars *, at u=1, and circles, at u= -1. Vector Figure 3.2. A geometric illustration of the separating hyper-plane between zero and star classes. w represents a set of coefficients to a linear classifier; the dashed line represents the set of all x’s that are orthogonal to w, <w,x> = 0 – the separating hyperplane. Figure 3.2 shows a relatively rare situation at which the two patterns can be separated by a hyperplane – the linear separability case. A linear classifier is defined by a vector w so that if ûi= <w,xi> >0, predict ůi=1; if ûi = <w,xi> < 0, predict ůi= -1; that is, ůi = sign(<w,xi>) . (Here the sign function is utilized as defined by the condition that sign(a)=1 when a > 0, =-1 when a < 0, and =0 when a = 0.) Discriminant analysis DA and Bayesian decision rule To find an appropriate w, even in the case when “yes” and “no” classes are linearly separable, various criteria can be utilized. A most straightforward classifier is defined by the least-squares criterion of minimizing (3.3). This produces w=(XTX)-1XTu (3.7) The inverse matrix may not necessarily exist, which is the case when rank of X is less than the number of columns. If this is the case, then the so-called pseudo-inverse is utilized, which is not a big deal computationally. For example, in MatLab, one just puts pinv(XTX) instead of inv(XTX). Solution (3.7) has two properties related to the so-called Bayesian decision rule in statistics. According to Bayes, all relevant knowledge of the world is known to the decision maker in the form of probability distributions; whatever data may occur afterwards, they may change the probabilities – hence the difference between prior probabilities and posterior, data updated, probabilities. Specifically, assume that, in the world, p1 and p2 are probabilities of two states of the world corresponding to the “positive” and “negative” classes; p1 and p2 are positive and sum up to unity. Assume furthermore that there are two probability density functions, f1(x1, x2, …, xp) and f2(x1, x2, …, xp), defining the generation of observed points x={x1, x2, …, xp} corresponding to entities. [We drop for the moment 62 the fictitious variable x0 in this explanation.] If an x={x1, x2, …, xp} is actually observed, then according to the well-known Bayes theorem from the elementary probability theory, the posterior probabilities of the “positive” and “negative” classes change to p(1 | x)=p1f1(x)/f(x) and p(2 | x)=p2f2(x)/f(x) (3.8) where f(x)=p1f1(x)+ p2f2(x). Expressions (3.8) are referred to as likelihoods – they are used in an important approach of mathematical statistics, the maximum likelihood. According to this approach, parameters of the underlying distributions are assumed to have values maximising the likelihood of the observed data. For our purposes, one expresses proportions of errors as 1- p(1 | x), if it is decided that the class of x is “positive” or 1- p(2 | x), otherwise. To minimize the errors, one needs therefore to decide that the class is “positive” if p(1 | x) > p(2 | x) or, equivalently, f1(x)/f2(x) >p2/p1 (3.9) or, “negative”, if the reverse holds. This rule is referred to as Bayesian decision rule. In these assumptions, there is no way to get fewer errors on average. The Bayesian rule can be expressed via function B(x)= p(1 | x) - p(2 | x) so that B(x)>0 corresponds to class 1 and B(x) < 0 to class 2, with equation B(x)=0 expressing the separating surface – the set of points separating the areas of minimum error decisions. It appears the squared summary difference between the least-square error linear decision rule function <w,x> and Bayesian function B(x) is minimum over all possible w (Duda, Hart, Stork, p. 243-245). Moreover, the least-squares linear decision rule is the Bayesian function B(x) if the class probability distributions f1(x) and f2(x) are Gaussian with the same covariance matrix, that is, are expressed with formula: fi(x)=exp[-(x-i)T-1(x-i)/2]/[(2)p|Σ|]1/2 (3.10) where I is the central point and the pxp covariance matrix of the Gaussian distribution. Moreover, in this case the optimal w=-1(1 -2) (see Duda, Hart, Stork, p. 36-40). This Gaussian is a most popular density function (Figure… ). Note that formula (3.7) leads to an infinite number of possible solutions. A slightly different criterion of minimizing the ratio of the “within-class error” over “out-of-class error” was proposed by R. Fisher (1936) as described by Duda, Hart and Stork 2001. Fisher’s criterion, in fact, can be expressed with the least-squares criterion if the output vector u is changed for uf as follows: put N/N1 for the components of the first class, instead of +1, and put - N/N2 for the entities of the second class, instead of -1. Then the optimal w (3.7) at u=uf is solution to the Fisher’s discriminant criterion (see Duda, Hart, Stork, 2001, pp.242-243). In spite of its good theoretical properties, least-squares solution may be not necessarily the best one at a specific data configuration. In fact, it may fail to separate the positives from negatives even if they are linear separable. Consider the following example. Let there be 14 two-dimensional points presented in Table 3.4 (first line) and displayed in Figure 3.4 (a). Points 1,2,3,4, and 6 belong to the positive class (dots on Figure 3.4), the others to the negative class (stars on Figure 3.4). Another set has been obtained by adding to each of the components a 63 random number according to the normal distribution with zero mean and 0.2 the standard deviation; it is presented in the bottom line of Table 3.4 and Figure 3.4 (b). The distribution of the disturbed points in classes is assumed the same. Table 3.4. X-y coordinates of 14 points as given originally and perturbed with a white noise of standard deviation 0.2, that is, generated from the Gaussian distribution N(0,0.2). Entity # Original data Perturbed data x y x y 1 3.00 0.00 2.93 -0.03 2 3.00 1.00 2.83 0.91 3 3.50 1.00 3.60 0.98 4 3.50 0.00 3.80 0.31 5 4.00 1.00 3.89 0.88 6 1.50 4.00 1.33 3.73 7 2.00 4.00 1.95 4.09 8 2.00 5.00 2.13 4.82 9 2.00 4.50 1.83 4.51 10 1.50 5.00 1.26 4.87 11 2.00 4.00 1.98 4.11 12 2.00 5.00 1.99 5.11 13 2.00 4.50 2.10 4.46 14 1.50 5.00 1.38 4.59 The optimal vectors w according to formula (3.7) are presented in Table 3.5 as well as that for the separating, dotted, line in Figure 3.4 (d). Table 3.5. Coefficients of straight lines on Figure 3.4. LSE at Original data LSE at Perturbed data Dotted at Perturbed data x -1.2422 -0.8124 -0.8497 Coefficients at y -0.8270 -0.7020 -0.7020 Intercept 5.2857 3.8023 3.7846 Q: Why only 10 points are shown on Figure 3.4 (b)? A. Because points 11-14 are same as 7-10. Q. What would change if we remove the last four points so that only points 1-10 are left? A. The leastsquares solution will be separating again. Q. Would it be possible that Fisher’s separation criterion also leads to a failure in a linear separable situation? A. I think yes. Figure 3.4. Figures (a) and (b) represent the original and perturbed data sets. The least squares optimal separating line is added in Figures (c) and (d), shown by solid. Entity 5 falls into “dot” class according to the solid line in Figure (d), a real separating line is shown dotted (Figure (d)). 64 Support vector machine SVM criterion Another criterion would put the separating hyperplane just in the middle of an interval drawn through closest points of the different patterns. This criterion produces what is referred to as the support vector machine since it heavily relies on the points (support vectors) involved in the drawing of the separating hyperplane (shown by circles on Figure 3.3). The difference between the least-squares discriminant hyperplane and support vector machine hyperplane stems from the differences in their criteria. The latter is based on the borderline objects only, whereas the former takes into account all entities so that the further away an entity is the more it may affect the solution, because of the quadratic nature of the least squares criterion. Some may argue that both borderline and far away entities can be rather randomly represented in the sample under investigation so that neither should be taken into account: it is “core” entities of patterns that should be separated – however, there has been no such an approach taken in the literature so far. Figure 3.3. The support vector machine based separation hyperplane, shown as solid line, along with the borderline points (support vectors) defining it, shown with circles. Kernels Situations at which patterns are linearly separable are very rare; in real data, patterns are typically well intermingled. To tackle these typical situations, the data are nonlinearly transformed into a much higher dimensional space in which, because of both nonlinearity and high dimension, the patterns may be linearly separable. The transformation is performed virtually only because what really matters is just the inner products between the transformed entities. The inner products in the transformed space can be computed with so-called kernel functions. It is convenient to define a kernel function over vectors x=(xv) and y=(yv) through the squared Euclidean distance d2 (x,y)= (x1-y1)2+…+(xV-yV)2 because results form positive definite matrices. Arguably, the most popular is the Gaussian kernel (3.7): K(x,y)=exp(-d2(x,y)) (3.7) Q. What is VC-dimension of the linear discrimination problem at p=2 (two input features)? A. 3, because each of three points can be separated from the others by a line, but there can be such 4-point configurations that cannot be shattered using liner separators. Take, for instance, a rectangle whose vertices joined by a diagonal are labelled by “+” while two others by “-“: no line can reproduce that. 3.4. Decision Trees 65 This is a structure used for prediction of quantitative features (regression tree) or nominal features (classification tree). Each node corresponds to a subset of entities (the root to the set of all entities I), and its children are the subset’s parts defined by a single predictor feature x. Each terminal node individual target feature value u. Example: Product-defined clusters of eight Companies Sector: Util/Ind Retail C EC: No Yes A B Figure 1. Decision tree for three product based classes of Companies defined by categorical features. Decision trees: Advantages Interpretability Computation efficiency Drawbacks Simplistic Imprecise NSup: ShaP: > 30 A <4 4 or more < 30 C B Figure 2. Decision tree for three product-defined classes of Companies defined by quantitative features. Algorithm: Take a node and a feature value(s) and split the corresponding subset accordingly Issues (classification tree): 66 Stop: Whether any node should be split at all Select: Which node of the tree and by which feature to split Score: Chi-squared (CHAID in SPSS), Entropy (C4.5), Change of Gini coefficient (CART) Assign: What target class k to assign to a terminal node x: Conventionally, k* at which p(k/x) is maximised over k. I suggest: This is ok when p(k) is about 10%-30%. Otherwise, use comparison between p(k/x) and p(k). Specifically, (i) If p(k) is of the order of 50%, the absolute Quetelet index a(k/x)= p(k/x)- p(k) should be used; (ii) If p(k) is of the order of 1% or less, the relative Quetelet index q(k/x)= [p(k/x)- p(k)]/p(k) should be employed. 4. Correlation: Learning neural networks 4.1. Steepest descent and perceptron for the square error minimisation The machine learning paradigm is based on the assumption that a learning device adapts itself incrementally by facing entities one by one. This means that the full sample is never known to the device so that global solutions, such as the projection (3.5), are not applicable. In such a situation an optimization algorithm that processes entities one by one should be applied. Such is the gradient method, also referred to as the steepest descent. With respect to the problem of building a linear classifier to minimise the square error (3.3) the algorithm can be stated as follows. Steepest descent for the problem of linear discriminant analysis 0. Initialise weights w randomly. 1. For each training instance (xi, ui) a. Compute grad(Ei(w)) where Ei(w) is part of criterion E in (3) related to the instance: Ei= (ui -w1*xi1-w2*xi2-…-wp*xip-w0* xi0)2 Obviously, t-th component of the gradient is Ei/wt= –2(ui- ûi) xit , t=0, 1, …, p b. Update weights w according to equation w(new) = w - grad(Ei(w)) so that wt(new) = wt + (ui- ûi)xit (Here is put rather than 2 because it is an arbitrary number anyway.) 2. If w(new) w(old), stop; otherwise go to step 1 with w=w(new). This process is proven to converge provided that the gradient step is correctly set: it should not be too big so that the minimum point is not jumped over, nor should it be too small so that when the difference becomes small, the point wt is still updated. Perceptron In nineteen-fifties, F. Rosenblatt proposed the following modification of the steepest descent, referred to as perceptron. Perceptron algorithm 0. Initialise weights w randomly or to zero. 1. For each training instance (xi, ui) a. compute ůi = sign(<w,xi>) 67 b. if ůi ui, update weights w according to equation w(new) = w(old) + (ui- ůi)xi where , 0<<1, is the so-called learning rate 2. Stopping rule: w(new) w(old). Perceptron is a slightly modified form of the conventional gradient minimization algorithm: the partial derivative of Ei with respect to wt is equal to –2(ui- ûi) xit, which is similar to that used in the perceptron learning rule, - 2(ui- ůi)xi. The innovation was to change the continuous ûi for the discrete ůi =sign(ûi) in the process of steepest descent. (The logic is the same in the binary linear discriminate rule above.) However, further mathematical analysis shows that the perceptron can be considered not just an analogue but a gradient minimization algorithm on its own, for a slightly different error function – the summary absolute values rather than their squares! Perceptron is proven to converge to the optimal w when the patterns are linearly separable. 4. 2. Artificial neuron Figure 4.1. Scheme of a neuron cell. A linear classifier can be considered a model of the neuron cell in a living organism. A neuron cell fires an output when its summary input becomes higher than a threshold. Dendrite brings signal in, axon passes it out, and the firing occurs via synapse, a gap between neurons, that makes the threshold (see Figure 4.1). The decision rule ůi =sign(ûi) can be interpreted in terms of an artificial neuron as follows: features xi are input signals, weights wt are the wiring (axon) features, the bias w0 – the firing threshold, and sign() – the neuron activation function. This way, the perceptron can be considered one of the first examples of nature-inspired computation. 68 Figure 4.2. A scheme of an artificial neuron. An artificial neuron consists of: a set of inputs (corresponding to x-features), wiring weights, and activation function involving a firing threshold. Two popular activation functions, besides the sign function ůi =sign(ûi), are the linear activation function, ůi = ûi (we considered it when discussed the steepest descent) and sigmoid activation function ůi =s(ûi) where s(x) = (1+ e-x)-1 (4.1) is a smooth analogue to the sign function, except for the fact that its output is between 0 and 1, not -1 and 1 (see Figure 4.3 (b)). To imitate the perceptron with its sign(x) output, between -1 and 1, we first double the output interval and then subtract 1: th(x) =2s(x)-1= 2(1+ e-x)-1 - 1 (4.1’) This function, illustrated on Figure 4.3 (c), is usually referred to as the hyperbolic tangent. In contrast to sigmoid s(x), hyperbolic tangent th(x) is symmetric: th(-x) = - th(x), like sign(x), which can be useful in some contexts. 1 x x x -1 (a) (b) (c) Figure 4.3. Graphs of sign (a), sigmoid (b) and hyperbolic tangent (c) functions. The sigmoid activation functions have nice mathematical properties; they are not only smooth, but their derivatives can be expressed through the functions themselves. Specifically, s’(x)= ((1+ e-x)-1)’=(-1)(1+ e-x)-2(-1)e-x =s(x)(1-s(x)), (4.2) th(x)= [2s(x)-1]=2s(x)=2s(x)(1-s(x))=(1+th(x))(1-th(x))/2 (4.2’) 4.3. Learning with multi-layer neural nets 69 4.3.0. A case problem: Iris features are in pairs: the size (length and width) of petals (features 1, 2) and that of sepals (features 3, 4). It is likely that the sepal sizes and petal sizes are related. /advanced/ml/Data/iris.dat 150 x 4 Consider at any Iris specimen xi=(xi1,xi2,xi3,xi4), i=1,…,150, x = (xi3, xi4) (sepal) input and u = (xi1,xi2) (petal) output. Find F such that u F(x). 4.3.1. One-hidden-layer NN Build F as a neural network of three layers: (a) input layer that accepts x = ( xi3, xi4) and bias x0=1 (see the previous lecture), (b) output layer producing estimate û for output u = (xi1,xi2), and (c) intermediate - hidden - layer to allow more flexibility in the space of feasible functions F (hidden - because not seen from the outside This structure (Figure 4.4) is generic in NN theory; it has been proven, for instance, that such a structure can exactly learn any subset of the set of entities. Moreover, any pre-specified u = F(x) can be approximated with such a one-hidden-layer network, if the number of hidden neurons is large enough (Tsybenko 1989). û1 û2 III 1 k v11 j v12 III 2 v21 v22 II1 II1 II2 II2 w21 w11 w12 I1 i w22 w13 I2 x1 x2 v31 Output (linear) v32 Hidden (sigmoid) II3 II3 w23 w31 w32 w33 I3 I3 x0 = 1 Input (linear) Figure 4.4. A feed-forward network with 2 input and 2 output features (no feedback loops). Layers: input (I, indexed by i), output (III, indexed by k) and Hidden (II, indexed by j). Weights I to II form 3x3 matrix W=(wij), i= I1, I2, I3, j= II1, II2, II3, Weights II to III form 3x2 matrix V=(vjk), j= II1, II2, II3, k= III1, III2 Layers I and III are assumed to give identical transformation (linear); hidden layer (II) - sigmoid: 70 4.3.2. Formula for the NN transformation F: Node j of hidden layer II: Input: zj=w1j*x1 + w2j*x2+w3j*x3 which is j-th component of vector z = i xi*wij = x*W where x is1x3 input vector, W=(wij) is 3x3 weight matrix. Layers I and III are assumed to give identical transformation (linear); hidden layer (II) –the sigmoid symmetrised th(x) (4.1’). Node j of the hidden layer II: Input: zj=w1j*x1 + w2j*x2+w3j*x3 which is j-th component of vector z = i xi*wij = x*W where x is1x3 input vector, W=(wij) is 3x3 weight matrix. Output: th(zj), j=1,2,3, th is function (4.1’). Node k of output layer III: Output = Input, j vjk*th(zj), which is k-th component of the matrix product û = th(z)*V. Thus, NN on Figure 1 transforms input x into output û as: û = th(x*W)*V (4.3) If matrices W, V are known, (4.3) expresses – and computes - the unknown function u=F(x) in terms of th, W, and V. 4.3.3. Learning problem Find weight matrices W and V minimising the squared difference between observed u and û found with (4.3), E=d(u,û) = <u - th(x*W)*V, u - th(x*W)*V >, (4.4) over the training entity set. 4.4. Learning weights with error back propagation 4.4.1. Updating formula. In NN applications, learning weights W and V minimising E is done with back-propagation that imitates the gradient descent. It runs iterations of updating V and W, each based on the data of an entity (in our case, one of 150 Iris specimens), with the input values in x=(xi) and output values in u=(uk). An update moves V and W into the anti-gradient direction: V(new)=V(old)-gV, W(new)=W(old)- gW (4.5) 71 where is the learning rate (step size) and gV, gW are parts of the gradient of the error function E in (4.4) related to matrices V and W. Specifically, the error function is E = [(u1 – û1)2 + (u2 – û2)2 ]/2 (4.6) where e1 = u1 – û1 and e2 = u2 – û2 are differences between the actual and predicted outputs. There are two items in E corresponding to each of the two outputs; the more outputs, the more items. The division by 2 is made to avoid factor 2 in the derivatives of E. Equations for learning V and W can be written component-wise: vjk(new)=vjk(old) - E/vjk, wij(new)=wij(old) - E/wij (iI, jII, kIII) (4.5’) To make these computable, let us express the derivatives explicitly; first those closer to the output, over vjk: E/vjk = - (uk – ûk) ûk /vjk. The derivative ûk /vjk=th(zj), since ûk = j th(zj) vjk. Thus, E/vjk = - (uk – ûk) th(zj). (4.7) The derivative E/wij refers to the next layer, of W, which requires more chain derivatives. Specifically, E/wij = k[-(uk – ûk) ûk /wij]. Since ûk = j th(i xiwij) vjk, this can be expressed as ûk /wij = vjk th(i xiwij) xi. The derivative th’(z) can be expressed according to (4.2’), which leads to the following final expression for the partial derivatives: E/wij=-k[(uk – ûk) vjk](1+th(zj))(1-th(zj)) xi/2 (4.8) Equations (4.5), (4.7) and (4.8) lead to the following rule to process an instance in the backpropagation algorithm (see 4.4.2). 4.4.2. Instance Processing: 1. Forward computation (of the output û and error). Given matrices V and W, upon receiving an instance (x,u), the estimate û of vector u is computed according to the neural network as formalised in equation (4.3), and the error e = u – û is calculated. 2. Error back-propagation (for estimation of the gradient elements). Each neuron receives the relevant error estimate, which is -ek = -(uk – ûk), from (4.7) for output neurons k (k=III1, III2) or 72 -k[(uk – ûk) vjk], from (4.8) for hidden neurons j (j=II1, II2, II3) [the latter can be seen as the sum of errors arriving from the output neurons according to the corresponding synapse weights]. These are used to adjust the derivative (4.7), or (4.8), by multiplying it over its local data depending on the input signal, which is th(zj), for neuron k’s source j in (4.7), and th(zj) xi for neuron j’s source i in (4.8). 3. Weights update. Matrices V and W are updated according to formula (4.5’). What is nice in this procedure is that the computation can be done locally, so that every neuron processes only the data that are available to this neuron, first from the input layer, then backwards, from the output layer. In particular, the algorithm does not change if the number of hidden neurons is changed from h=3, in Figure 4.4, to any other integer h=1,2,…, nor it changes if the number of inputs and/or outputs changed. The procedure 4.4.2 can be easily extended to any feed-forward network however many hidden layers it may have. Steps 1-3 are performed for all available entities in a random order, which constitutes an epoch. Thus, a number of epochs are executed, until the matrices V and W are stabilised. Typically, one or even a thousand epochs is not enough for matrices V and W to stabilize. Since, in practical calculations, this may take ages to achieve, other stopping criteria can be utilised: (i)The difference between the average values (over iterations within an epoch) of the error function (4.5’) becomes smaller than a pre-specified threshold, such as 0.0001; (ii)The number of epochs performed reaches a pre-specified threshold such as 10,000. 4.5. Error back propagation algorithm (for a data set available as a whole, “offline”). Finally, one can formulate the error back propagation algorithm as follows. A. Initialise weight matrices W=(wij) and V=(vjk) by using random normal distribution N(0,1) with the mean at 0 and the variance 1. B. Choose the data standardisation option amounting to selection of the shift and scale coefficients, av and bv for each feature v, so that every data entry, xiv, is transformed to yiv=(xiv-av)/bv (see section 6. below). C. Formulate Halt criterion as explained above and run a loop over epochs. D. Randomise the order of entities within an epoch and run a loop of the 4.2. Instance Processing in that order. E. If Halt-criterion is met, end the computation and output results: W, V, û, e, and E. Otherwise, execute D again. 4. 6. Data standardisation for NN learning Due to the specifics of the binary target variables and activation functions, such as th(x) and sign(x), which have -1 and 1 as their boundaries, the data in the NN context are frequently pre-processed to make every feature’s range to be between -1 and 1 whereas the midrange 0. To achieve this, take bv equal to the half-range bv=(Mv-mv)/2, and shift coefficient av to the mid-range av=(Mv+mv)/2. Here Mv denotes the maximum and mv the minimum of feature v. Then transform all feature entries by first subtracting av from each and then dividing the results by bv. 73 The practice of digital computation shows that it is a good idea to further expand the ranges into a [10,10] interval by multiplying afterwards all entries by 10: in this range, digital numbers in computer lead to smaller computation errors than if they are closer to 0. Project 4.1: One hidden layer NN for predicting Iris/Student data Let us develop a Matlab code for learning NN weights with the back propagation algorithm according to the structure of Figure 4.4. Two parameters of the algorithm, the number of neurons in the hidden layer and the learning rate, will be input parameters. The output, in this case, should be the level of error achieved and the weight matrices V and W. The code should include the following steps: 1. Loading data from subdirectory Data. According to the task, this can be either iris.dat or studn.dat. 2. Normalizing the data to [-10,10] scale according to the formulas in section 4.6. 3. Preparing input and output (target) sub-matrices after the decision has been made of what features fall in the former and what features fall in the latter categories. In the case of Iris data, for example, the target can be predicting the petal data (features w3 and w4) from sepal measurements (features w1 and w2). In the case of Students data, the target can be students’ marks on all three subjects (CI, SP and OOP), whereas the other variables (occupation categories, age and number of children), input. 4. Initializing the network with random (0,1) normally distributed values and setting a loop over epochs with the counter at zero. 5. Organizing a loop over the entities in a random order; here the Matlab command randperm(n) for making a random permutation of integers 1, 2,…, n can be used. 6. Forward pass: given an entity, the output is calculated, as well as the error, using the current V, W and activation functions. We take here the symmetric sigmoid (4.1’) as the activation function. 7. Error back-propagation: computing gradient vectors for V and W according to formulas (4.6) and (4.8). 8. Weights V and W update with the gradients computed and learning rate accepted as the input. 9. Halt-condition including both the level of precision, say 0.01, and a threshold to the number of epochs, say, 5,000. After either is reached the programme halts. A Matlab code, nnn.m, including all nine steps is in Appendix 3. At the Iris data, this program leads to the average errors at each of the output variables presented in Table 4.1 at different numbers of hidden neurons h. Note that the feature ranges are equal to 20 here so that the relative average error is about 7% of the range. H |e1| |e2| 3 6 10 1.07 0.99 0.97 1.77 1.68 1.63 74 Table 4.1. Absolute error values in the predicted petal dimensions with full Iris data after 5,000 epochs. The number of parameters in matrices V and W here are 3h, in W, plus 2h in W. One can see that the increase in h does bring some improvement – but not that great! For the Students data, this program leads to the average errors in predicting student marks over three subjects, as presented Table 4.2 at different numbers of hidden neurons h: H |e1| |e2| |e3| # param. 3 2.65 3.16 3.17 27 6 2.29 3.03 2.75 54 10 2.17 3.00 2.64 90 Table 4.2. Absolute error values in the predicted student marks over all three subjects, with full Student data after 5,000 epochs. Home-work: 1.Find values of E for the errors reported in Table above. 2. Take a look at what happens if the data are not normalised. 3. Take a look at what happens if the learning rate is increased, or decreased, ten times. 4. Extend the table above for different numbers of hidden neurons. 5. Try petal sizes as input with sepal sizes as output. 6. Try predicting only one size/mark over all input variables. 7. Modify this code to involve the sigmoid activation function. 8. Find a way to improve the convergence of the process, for instance, with adaptive changes in the step size values. Back propagation should be executed with a re-sampling scheme, such as the k-fold cross-validation, to provide the estimates of variation of the results regards the data change. Q. The derivatives of sigmoid (1) or hyperbolic tangent (2) functions appear to be simple polynomials of themselves: s(x)= [(1+ e-x)-1] =(-1) (1+ e-x)-2 (e-x )= (-1)(1+ e-x)-2(e-x )(-1)= (1+ e-x)-2 e-x =s(x)(1-s(x)) 5. Learning summarizations 5.1. General 5.1.1. Decision structures Popular decision structures used for data aggregating are the same as those used for data associating and include the following: (a) Partition of the entity set. A partition S={S1,S2,…,SK} of the N-element entity set I into a set of non-empty nonoverlapping clusters may model a typology or a categorical feature reflecting within cluster similarities 75 between objects. When the data of the entities are feature based, such a partition is frequently accompanied with a set c={c1,c2,…,cK} of cluster centroids; each centroid ck being considered a “typical” representative of cluster Sk (k=1,2,…,K). This means that, on the aggregate level, original entities are substituted by clusters Sk represented by vectors ck. That is, each entity iSk is represented by ck. This can be expressed formally by using the concept of decoder. A decoder, in this context, is a mapping from the set of clusters to the set of entities allowing recovering the original data from the aggregates, with a loss of information of course. If the data set is represented by NV matrix X=(xiv) where iI are entities and v=1, 2,…, V are features, then a decoder of clustering (S,c) can be expressed as c1v zi1+c2vzi2 + … + xKvxiK xiv where zk=(zik) is the N-dimensional membership vector of cluster Sk defined by the condition that zik =1 if iSk and zik =0, otherwise. Indeed, for every iI, there is only one item in the sum, which is not zero, so that the sum, in fact, represents ck for that cluster Sk which i belongs to. Obviously, the closer the centroids fit to the data, the better the clusters represent the data. (b) Representative vectors. Sometimes centroids, called representative, quantizing or learning vectors, alone are considered to represent the data. This is based on the implicit application of the principle which is called the minimum distance rule, in clustering, or Voronoi diagram, in computational geometry. Given a set of points c={c1,c2,…, cK} in the feature space, the minimum distance rule assigns every entity iI, and in fact every point in the space, to that ck (k=1, 2, …, K) to which i is the closest. In this way, the set c is assigned with a partition S, which relates to the structure (a) just discussed. Given c={c1,c2,…, cK} in the feature space, let us refer to the set of points that are closer to ck as the gravity area G(ck). If is not difficult to prove that if the distance utilized is Euclidean, then the gravity areas are convex, that is, for any x, y G(ck), the straight line between them also belongs G(ck). Indeed, for any rival ck’, consider the set of points Gk’ that are closer to ck than to ck’. It is known that this set Gk’ is but a half-space defined by the hyperplane <x, gk’)>= fk’ which is orthogonal to the interval between ck and ck’ in its midpoint. Obviously, G(ck) is the intersection of sets Gk’ over all k’k. Then the gravity area G(ck) is convex since each half-space Gk’ is convex, and the intersection of a finite set of convex sets is convex too. Gravity areas G(ck) of three representative points on the plane are illustrated on Figure 5.1 using thick solid lines. Figure 5.1. Voronoi diagram on the plane: three representative vectors, the stars, along with the triangle of dashed lines between them and solid lines being perpendiculars to the triangle side middle points. The boundaries between the representative vectors gravity areas are highlighted. (c) Feature transformation 76 Figure 5.2 presents a number of adults represented by their height and weight measurements, those overweight (pentagons) are separated from those of normal weight (pentagrams) by the dashed Weight, kg 100 100 200 Height, cm Figure 5.2. A transformed feature, y=Height-Weight-100 (dashed line), to separate pentagons from pentagrams. line expressing a common sense maxim of slim-bodied individuals that “the weight in kg should not be greater than the height in cm short of a hundred”. In fact, the wisdom can be rephrased as stating that in the matters of keeping weight normal, the single variable HWH=Height-Weight-100 alone should stand, with a much better resolution, than the two original features, Height and Weight. This example shows that a decision rule is but an aggregate feature. Having specified a number of linear – or not - transformations zw=fw(x1, x2,…, xV), w=1,…,W (typically, W is supposed to be much greater than V, though it may be not necessary in some contexts), one needs a decoder to recover the original data from the aggregates. A linear decoder can be specified by assuming a set of coefficients cv=(cvw) such that each linear combination <cv,z> = c1z1 + c2z2 + … +cWzW= c1f1(x1, x2,…, xV) + c2f2(x1, x2,…, xV) + … +cW fW(x1, x2,…, xV) can stand for the original variable v, v=1, 2, …, V. In matrix terms this can be expressed as follows. Denote by Z=(ziw) the NW matrix of values of aggregate variables zw on the set of entities and by C=(cvw) the matrix whose rows are vectors cv. Then NV matrix X’=ZCT is supposed to be the decoder of the original data matrix X. An obvious quality criterion for both the decoder and transformation of the variables is the similarity between X’ and X: the closer X’ to X, the better the transformed features reflect the original data. But this cannot be the only criterion because it is not enough to specify the values of transformation and decoder coefficients. Indeed, assume that we found the best possible transformation and decoder leading to a very good data recovery matrix Z’. Then Z*=ZA with decoder C*=CA, where A=(aww’) is orthogonal WW matrix such that AAT=I where I is the identity matrix, will produce the data recovery matrix X*= Z*C*T coinciding with X’. Indeed, X*= Z*C*T=ZAATCT==ZCT =X’. Additional principles may involve requirements on the transformed features coming from both internal criteria, such as Occam razor, and external criteria such as the need in separation of pre-specified patterns like that on Figure 5.2. (d) Neural network: non-linear linear-like (in general, we do not know what non-linearity may be) 5.1.2. Least squares criterion 77 Criteria for finding them Criteria for judging them sound: - error - stability - interpretability Given N vectors forming a matrix X= {(xi)} of features observed at entities i =1,…, N so that xi=(xi1,…,xip) and a target set of aggregates U with decoder D: u Rp, build an aggregate û = F(X), û U such that the error, which is the difference between the decoded data D(û ) computed from û and observed data X, is minimal over the class of admissible rules F. More explicitly, one assumes that X = D(û)+ E (5.1) where E is matrix of residual values usually referred to as errors. The smaller the errors, the better the summarization û. According to the most popular, least-squares, approach the errors can be minimized by minimizing the summary squared error, E2=<X- D(û), X- D(û)>=<X-D(F(X)), X-D(F(X))> (5.2) with respect to all admissible Fs and Ds. Expression (5.1) can be further decomposed into E2=<X, X>- 2<X, D(û)>+< D(û), D(û)> In many situations, such as Principal component analysis and K-Means clustering described later, the set of all possible decodings D(F(X)) forms a linear subspace. In this case, the multidimensional points X, D(û) and 0 form a “right-angle triangle” so that <X, D(û)>=< D(û), D(û)> and expression (5.2) becomes a multivariate analogue to the Pythagorean equation relating squares of the hypotenuse, X, and the sides, D(û) and E: <X, X>=< D(û), D(û)>+ E2 , (5.3) or on the level of matrix entries, iI vV xiv 2 iI d vV 2 iv iI e vV iv 2 (5.3’) We consider here that the data is an N x V matrix X=(xiv) – set of rows/entities xi (i=1,…, N) or set of columns/features xv (v=1,…, V). The item on the left in (5.3’) is usually referred to as the data scatter and denoted by T(X), T(X ) iI vV xiv 2 (5.4) Why “scatter”? Because T(X) is the sum of Euclidean squared distances from all entities to 0 T(X) is the sum of entity contributions, the squared distances d(xi,0) (i=1,…,N), or – of feature contributions, the sums tv=ΣiI xiv2. In the case, when the average cv has been subtracted from all values of the column v, tv =Nv2 , the variance. 78 5.1.3. Data standardization The decomposition (5.3) shows that the least-squares criterion in the problem of summarization has that property that it highly depends on the feature scales so that the solution may be highly affected by scale changes. This was not the case in the correlation problems, at least with only one target feature, because the least squares were, in fact, just that feature’s errors, thus expressed in the same scale. somehow To balance contributions of features to the data scatter, one conventionally applies the operation of standardisation. This operation applies to only quantitative features. This requires thus, in the case when data contain categorical features, first express them in a quantitative format. This can be done by considering each category as a feature – sometimes referred to as a dummy variable - on its own, quantitatively recoding it by assigning its “Yes” value with 1 and “No” value with 0. Example 5.1. Consider Company data set Company name Aversi Antyos Astonite Bayermart Breaktops Bumchist Civok Cyberdam Income, $mln 19.0 29.4 23.9 18.4 25.7 12.1 23.9 27.2 SharP $ 43.7 36.0 38.0 27.9 22.3 16.9 30.2 58.0 NSup 2 3 3 2 3 2 4 5 EC No No No Yes Yes Yes Yes Yes Sector Utility Utility Industrial Utility Industrial Industrial Retail Retail It contains two categorical variables, EC, with categories Yes/No, and Sector, with categories Utility, Industrial and Retail. The former feature, EC, in fact represents just one category, “Using ECommerce” and can be recoded as such by substituting 1 for Yes and 0 for No. The other feature, Sector, has three categories that should be substituted by a dummy variable each. To do this, we just put three category features like these: (i) Is it Utility sector?, (ii) Is it Industrial sector?, and (iii) Is it Retail sector?, each admitting Yes or No values respectively substituted by 1 and 0. This would lead us to the following quantitative table. Table 5.2. Quantitatively recoded Company data table. Company name Income SharP NSup Aversi 19.0 43.7 2 Antyos 29.4 36.0 3 Astonite 23.9 38.0 3 Bayermart 18.4 27.9 2 Breaktops 25.7 22.3 3 Bumchist 12.1 16.9 2 Civok 23.9 30.2 4 Cyberdam 27.2 58.0 5 EC 0 0 0 1 1 1 1 1 Utility 1 1 0 1 0 0 0 0 Industrial 0 0 1 0 1 1 0 0 Retail 0 0 0 0 0 0 1 1 Standardisation – shift of the origin & rescaling to make features comparable Yiv = (Xiv –Av)/Bv 79 X - original data Y – standardized data i – entity v – feature Av – shift of the origin, typically, the average Bv – rescaling factor, traditionally the standard deviation (from statistics perspective), but range may be better (from CI perspective) In particular, when a nominal feature is represented with 3 binary features corresponding to its 3 categories, the feature’s summary contribution increases 3 times, so that it should be made up for by further dividing the entries by the square root of 3. Why the square root, not just 3? Because contribution to the data scatter involves all the entries squared. A typical data set Visualising a 7-dimensional data set on a 2D screen: to be explained later, as part of PCA/SVD No normalization (Bv=1) Aversi Bumchist Bayermart Cyberdam Astonite Civok Antyos Breaktops z-scoring: 80 Bumchist Breaktops Civok Astonite Cyberdam Bayermart Antyos Aversi Recommended (to be explained later): Normalising by range*#categories Cyberda m Antyos Aversi Astonite Civok Breaktops Bayermart Bumchist The country clusters, much blurred at previous figures, are clearly seen here. e1 e2 e3 e4 e5 e6 e7 e8 The data with the averages subtracted and normalized by the range*sqrt(number-of-categories): -0.20 0.23 -0.33 -0.63 0.36 -0.22 -0.14 0.40 0.05 0 -0.63 0.36 -0.22 -0.14 0.08 0.09 0 -0.63 -0.22 0.36 -0.14 -0.23 -0.15 -0.33 0.38 0.36 -0.22 -0.14 0.19 -0.29 0 0.38 -0.22 0.36 -0.14 -0.60 -0.42 -0.33 0.38 -0.22 0.36 -0.14 0.08 -0.10 0.33 0.38 -0.22 -0.22 0.43 0.27 0.58 0.67 0.38 -0.22 -0.22 0.43 Note: only two values in each of the four columns on the right – why? Note: the entries within every column sum up to 0 – why? 81 Every row represents an entity as a 7-dimensional vector/point e1=(-0.20, 0.23, -0.33, -0.63, 0.36, -0.22, -0.14), a 1 x 7 matrix (array) Every column represents a feature/category as an 8-dimensional vector/point: -0.20 0.40 0.08 LS= -0.23 0.19 -0.60 0.08 0.27, a 8 x 1 matrix (array), or, its transpose, a 1x 8 row LST = (-0.20, 0.40, 0.08, -0.23, 0.19, -0.60, 0.08, 0.27)T 5.2. Principal component analysis The method of principal component analysis (PCA) emerged in the research of inherited talent by F. Galton, first of all to measure talent. It is one of the most popular methods for data summarization and visualization currently. The mathematical structure and properties of the method are based on the socalled singular value decomposition of data matrices (SVD); this is why in many publications the terms PCA and SVD are used as synonymous. In the publications in the UK and USA, though, the term PCA frequently refers only to a technique for the analysis of covariance/correlation matrix, by extracting most contributing linear combinations of features, which utilizes no specific data models and thus is considered purely heuristic. In fact, this method is equivalent to methods related to a genuine data model that should be associated with the method. Here is a list of problems in which the method is useful: Scoring students’ abilities over marks on different subjects (F. Galton) Scoring quality of life in different cities based on their scorings over different aspects (housing, transportation, catering, pollution, etc.) Scoring different parts of a big company or government over their performances Visualizing documents and keywords with respect to their similarities Visualizing a set of cereals and their taste characteristics for developing a new cereal product concept Visualizing multidimensional entities in 1D or 2D space Quiz: What could be a purpose to aggregate the features in the Market towns’ data? 82 1. Model (for measuring talent – F. Galton): Having students’ marks xiv (i – student, v – discipline) observed, find student hidden ability scores zi and discipline loadings cv such that xiv zi cv, which can be explicated, by using residuals eiv , as xiv = cvzi+ eiv (1) where the residuals are minimised with the least squares criterion L2 iI e vV 2 iv iI vV ( xiv - c v zi ) 2 This is a problem of approximating the data matrix with a rank one matrix: N*M observed data entries converted into hidden: N scores +M loadings (with N=1000, M=100, N*M=100,000 N+M=1,100) Matrix of rank 1: product of two vectors; for example a=[1 4 2 0.5]’; b=[2 3 5]’; Here A’ is matrix A transposed so that vectors a and b are considered columns rather than rows. A mathematical presentation of the matrix whose elements are products of components of a and b, with product * being the so-called matrix product is below: 2 8 4 1 a*b’= 3 12 6 1.5 5 20 10 2.5 The defining feature of this matrix: all rows are proportional to each other; all columns are proportional to each other. (See more detail any course in linear algebra or matrix analysis.) Curiously, the condition of statistical independence (within contingency data) can be reformulated as the contingency table being of rank 1. Solution: The model in (1) has a flaw from the technical point of view: its solution cannot be defined uniquely! Indeed, assume that we have got the talent score zi for student i and the loading cv at subject v, to produce zicv as the estimate for the student’s mark at the subject. However, the same estimate will be produced if we halve the talent score with simultaneously doubling the loading: zicv = (zi/2)(2cv). Any other divisor / multiplier would, obviously, do the same. To remedy this: fix the norms of vectors z and c , for instance, to be equal to 1, and treat the multiplicative effect of the two of them as a real . Then we must put z*ic*v instead of zicv where z* 83 and c* are normed versions of z and c, and is their multiplicative effect. A vector x=(x1,…, xH) is said to be normed if its length is 1, that is, x12+x22+…+xH2=1. After the optimal , z* and c* are determined, we can return to the talent score z and loading c with formulas: z= 1/2z*, c =1/2c*. Then the first-order optimality conditions imply that the least-squares solution to (1) satisfies equations XTz*= c* and Xc*=z* (2) where is the maximum singular value of X. We refer to a triple (, z*, c*) consisting of a real, , and two vectors, c* (size M x 1) and z* (size N x 1), as to a singular triple for X if it satisfies (2); is the singular value and z*, c* singular vectors. To understand (2) in full, one needs to know the definition of product of an N x M matrix by a vector M x 1: the matrix is a set of M columns and the product is the sum of these columns weighted by the components of the M x 1 vector. Equation on the right – an important corollary to the solution: (A) z is a linear combination of columns of X weighted by c’s components: c’s components are feature weights in the score z Another property: Pythagorean decomposition T(X)= 2 + L2 where T(X) is the data scatter (5.4). (3) (3) implies (B) Value 2 expresses the proportion of the data scatter explained by the principal component z This can be further extended into a set of K different ability factors, with students and subjects differently scored over them: K xiv ckv zik eiv , (4) k 1 with a similar decomposition of the data scatter T(X) = 1 2 +2 2 +…+K 2 + L2 To fit (4), some mathematics: A triple ,c*, z* satisfying (2) is referred to as a singular triple of matrix X, with - singular value and c*, z* - singular vectors corresponding to . For any matrix X, there is a finite number of singular values 1, 2,…, r equal to the rank r of X; r≤ min(M, N). If k ≠ l then corresponding singular normed zk and zl are unique and orthogonal as well as ck and cl (k,l=1,…r); otherwise, they are not unique and always can be selected to be orthogonal. The matrix X admits the following singular-value decomposition (SVD): 84 or, in terms of vectors and matrices, r X k z *k c *T k ZSC T (5' ) k 1 where the right-hand item Z is N× r matrix with columns z*k and C is M× r matrix with columns c*k and S is r× r diagonal matrix with entries k on the diagonal and all other entries zero. This implies that the least-squares fitting of the PCA model (equation (4) in lecture 26/10/06) are K maximal singular vectors with the presented there decomposition of data scatter. Equations (2) imply that 2 and c* satisfy XTXc*= 2c* , (6) That is c* is the eigen vector of square M x M matrix XTX corresponding to its maximum eigenvalue = 2 . This matrix XTX, divided by N, has an interesting statistical interpretation if all columns of X have been centred (mean-subtracted) and normed (std-normalised): its elements are correlation coefficients between corresponding variables. (Note how a bivariate concept, the correlation coefficient, is carried through to multivariate data.) If the columns have not been normed, the matrix XTX /N is referred to as covariance matrix; its diagonal elements are column variances. Similarly, 2 and z* are the maximum eigen-value and corresponding eigen vector of matrix of raw-byraw inner products XXT. Quiz: could you write equations defining 2 and z* (analogous to those for 2 and c*). (Tip: matrix XXT should be involved.) Footnote: In the English-written literature, PCA is introduced not via the model (1) but rather in terms of the derivative properties (A) and (B) leading to equations (6) for finding the loadings first and then the scores with equations (2). In this, the scatter of matrix XTX can be used for evaluation of the fit; it equals the sum of the squares of eigen-values k, that is, k4! r xiv k c *kv z *ik , (5) k 1 or, in terms of vectors and matrices, r X k z *k c *T k ZSC T (5' ) k 1 where the right-hand item Z is N× r matrix with columns z*k and C is M× r matrix with columns c*k and S is r× r diagonal matrix with entries k on the diagonal and all other entries zero. This implies that the least-squares fitting of (4) are K maximal singular vectors with the presented decomposition of the data scatter. 85 Equations (2) imply that 2 and c* satisfy XTXc*= 2c* , (6) That is c* is the eigen vector of square M x M matrix XTX corresponding to its maximum eigen-value = 2 . This matrix XTX, divided by N, has an interesting statistical interpretation if all columns of X have been centred (mean-subtracted) and normed (std-normalised): its elements are correlation coefficients between corresponding variables. (Note how a bi-variate concept is carried through to multivariate data.) If the columns have not been normed, the matrix A=XTX /N is referred to as covariance matrix; its diagonal elements are column variances. Since eigen-vectors of the square matrix A are mutually orthogonal, it can be decomposed over them as r A k c *k c *T k CC T (7) k 1 which can be derived from (5’); is diagonal r × r matrix with A’s eigen-values k=k2. Equation (7) is referred to as the spectral decomposition of A; the eigen-values k constituting the spectre of A. Similarly, 2 and z* are the maximum eigen-value and corresponding eigen vector of matrix of raw-byraw inner products XXT. Quiz: could you write equations defining 2 and z* (analogous to those for 2 and c*). 2. Method: SVD decomposition: MatLab’s svd.m function [Z, S, C]=svd(Y); where Y is data matrix X after standardisation (input) Output (idealised): Z – N r matrix of r factor score columns (normed) C – M r matrix of corresponding discipline loading columns (normed) S – r r diagonal matrix of corresponding singular values sorted in the descending order r – matrix Y’s rank Matrix of rank 1: product of two vectors a=[1 4 2 0.5]; b=[2 3 5]; 86 2 3 5 a*b’= 8 12 20 4 6 10 1 1.5 2.5 Matrix of rank r – sum of r rank one matrices Application a. Data selection and pre-processing into a flat file: any of our files would do b. Data standardisation: i. Shifting the origin: - typically needed to put the origin in the middle of the data cloud ii. Scaling the features: - typically needed to balance the feature contributions such as in Market towns data, Next: - in Student marks – not needed (same scales); - use ranges rather than standard deviations c. Computation: as it stands (see visual.m in \ml) d. Post-processing: depends on the application domain: - can be just visualization; do it - if re-standardization is needed from z to f such as to scale combining f=bz+a To find two reals, b and a, we need two points to scale subjectively, depending on the goals. In Student marks, we may want f = 0 when all marks are zero and f = 100 when all marks are 100. This would lead to the following two equations: 0=b*0+a and 100=b*100*Sum cv +a which implies that a=0 ; b = 1/Sum cv e. Interpretation and drawing conclusions - Number of the Principal components – depending on the contributions - A principal component’s meaning depends on c’s components PCA: talent score model and actually finding it 1. Principal Component: Analytical expression Problem: Given subject marks of 20 students Math 40 96 96 97 97 Phys Chem 37 93 90 90 90 Lang History 35 84 83 85 83 33 85 84 85 84 38 96 97 94 98 87 96 70 64 95 21 64 63 61 62 19 98 40 73 71 72 90 67 63 91 19 61 63 62 61 17 90 38 66 65 65 84 50 67 81 17 66 67 65 67 18 85 33 49 49 49 83 39 76 85 17 78 77 76 78 17 83 34 39 40 41 95 42 89 98 21 92 90 90 92 17 97 37 45 44 42 find their talent scores zi approximating the data. The optimal normed talent vector is a linear combination (weighted sum) of different subject marks z=Xc/ where X is data matrix, maximum singular value and c corresponding singular vector normed (z4 is another normed singular vector corresponding to ), that is, z = c1xM+c2 xP + c3xC + c4xL+ c5xH where cv=cv /. (*) The expression (*) has two meanings: (a) It is a PCA derived relation between 20 talent scores and marks in 20 rows of matrix X; (b) It is a general relation between the features, talent and subject marks, that can be straightforwardly used for wider purposes such as assigning a talent score to a student outside of the sample. (Quiz: Do you know how to do that? A: Just put the student’s marks into (*) and calculate the talent score.) 2. Principal component: Geometric expression Can we visualize the principal component (*)? Yes. Unfortunately this cannot be done straightforwardly, in the space of 6 variables (z, xM, xP, xC, xL, xH) involved, because (*) in this space corresponds to a hyperplane rather than line. However, subject loading vector c=(c1, c2, c3, c4, c5) or the normed vector c=(c1, c2, c3, c4, c5) itself, or any other proportional vector c, can be used for that, in the data feature space of 5D vectors x=(xM, xP, xC, xL, xH). These define a straight line through the origin, 0=(0,0,0,0,0) and c. What is the meaning of this line? 4 Note that vectors here are boldfaced whereas scalars not. 88 c at >1 c xM c at 0<<1 x1 c at <0 Figure 1. Line through 0 and c in the M-dimensional feature space is comprised of points c at different ’s. Consider all talent score points z=(z1,…,zN) that are normed, that is, satisfy equation <z,z>=1, that is, zTz=1, that is, z12+…+zN2=1: they form a sphere of radius 1 in the N-dimensional “entity” space (Fig. 2 (a)). The image of these points in the feature space, defined by applying data matrix X, Xz, forms a skewed sphere, an ellipsoid, in the feature space, consisting of points c where c is normed. The longest axis of this ellipsoid corresponds to the maximum , that is the first singular value of X. 1c1 zN xM c1 z1 (a) x1 (b) Figure 2. Sphere zTz=1 in the entity space (a) and its image, ellipsoid c=XTz, in the feature space. The first component, c1, corresponds to the maximal axis of the ellipsoid with its length equal to 21 . [Indeed, the first singular value 1 and corresponding normed singular vectors c1, z1 satisfy equations Xc1=z1 and, thus, their transpose, c1TXT =1zT. Multiplying the latter by the former from the right, one gets equation c1TXTXc1= 12, because zTz=1 since z is normed. ] 3. Principal component: Direction and data points What the longest axis has to do with the data? The direction of the longest axis of the data ellipsoid makes minimum of the summary distances (Euclidean squared) from data points to their projections on the line (see Fig. 3), so that the axis is the best possible 1D representation of the data. This property extends to all subspaces generated by the first Principal components: the first two PC make a plane best representing the data, the first three make a 3D space best representing the data, etc. 89 1c1 xM c1 x1 Figure 3. The direction of the longest axis of the data ellipsoid makes minimum the summary distances (Euclidean squared) from data points to their projections on the line. Why matrix X should be centered (by subtracting the within column average from all elements in each column) then? To better cover the data structure! (See Fig. 4) xM xM x1 x1 (a) (b) Figure 4. Effect of centering the data set on PCA: (a) – data not centered, (b) – same data after centering; the longer blue line corresponds to the direction of the first PC, the shorter one to the direction of the second PC (necessarily orthogonal). Eigenface: An application related to face analysis. Quiz: Learn what it is and what it has to do with PCA by yourself (from web). Latent semantic analysis: An application to document analysis using document-to-keyword data (and applying equations (2) to include new data). Quiz: Learn what it is and what it has to do with PCA by yourself (from web). 90 Correspondence analysis: Extension of PCA to contingency tables taking into account the data specific (meaningful summing up entries across the table). See section 5.1.4 in my book as well as on web. Questions Example Applied to 205 Student marks file Math Phys Chem Lang History 40 37 35 33 96 93 84 85 96 90 83 84 97 90 85 85 97 90 83 84 96 90 84 83 70 67 50 39 64 63 67 76 95 91 81 85 21 19 17 17 64 61 66 78 63 63 67 77 61 62 65 76 62 61 67 78 19 17 18 17 98 90 85 83 40 38 33 34 73 66 49 39 71 65 49 40 72 65 49 41 For matrix X, the 5D averages (grand means) vector: 69.75 65.90 60.85 61.70 38 96 97 94 98 95 42 89 98 21 92 90 90 92 17 97 37 45 44 42 70.70 to be subtracted from all entity vectors The loadings (in a transposed form): - the data lead to two principal components contributing, respectively, 91.56% and 8.33% to the data scatter, thus leaving only 0.11% to the remaining three principal components, which amount to noise. To interpret these components, find their corresponding loadings: c1 = [ 0.42 0.41 0.41 0.45 0.53] c2 = [-0.59 -0.47 -0.02 0.38 0.53] The first one shows that weights of all features are almost equal to each other, except for History whose weight is about 20% greater. The first principal component thus expresses the general ability. The second has sciences’ loadings negative, which shows that this component may correspond to arts abilities. 91 Quiz: Why is (a) the first all positive and (b) the second half negative? A: (a) All features are positively correlated, (b) the second must be orthogonal to the first. Quiz: compare the first component’s score with that of the average scoring. The vector of average scores rounded to integers is 37 54 91 54 90 90 90 90 54 72 90 19 72 72 71 72 18 91 36 54 Tip: To compare, compute the correlation coefficient. Q. Assume that there is a hidden feature z, assigning a value zi to each student i=1,…,100, that can alone, along with feature “loadings” c=(cAge,cSP,cOO, cCI), explain all the 100x4=400 entries in array X=xm=x(:,ii) obtained from stud.dat at the 4 features in ii so that each of the entries can be represented, approximately, as the product of a corresponding z-value and a feature-specific loading: The matrix XTX mentioned in property 4 above can be expressed as follows: >> xtx=xms'*xms xtx = 7.5107 1.2987 -2.4989 -2.8167 1.2987 6.0918 0.4335 -0.1543 -2.4989 0.4335 6.0207 1.6660 -2.8167 -0.1543 1.6660 4.7729 As xtx is proportional to feature covariance matrix, one should notice a not quite straightforward character of the data manifesting itself in the fact that features 1 and 2, being co-related positively, have different relations to feature 3. All four singular values and vectors can be found in Matlab with operation of Singular value decomposition (SVD): >> [Z,S,C]=svd(xms); Here matrix Z is 100x4 array whose columns are singular z vectors, C is 4x4 array whose columns are singular c vectors, and the first four rows of array S form a 4x4 diagonal matrix whose diagonal entries are corresponding singular values, which can be shaped with command >> mu=S(1:,4,1:4); These are subjects that can be covered with this: - Visualisation - Evaluation - Interpretation We consider them in turn. To visualise the data onto a 2D plane, we need just two principal components, that can be defined by using the first and second singular values and corresponding z vectors: >> x1= z(:,1)*sqrt(mu(1,1)); >> x2= z(:,2)*sqrt(mu(2,2)); 92 These can be seen as a 2D plot, on which groups of entities falling in categories such as Occupation:AN (entities from 70 to 100) and Occupation:IT (entities 1 to 35) can be highlighted: >> subplot(1,2,1), plot(x1,x2,'k.');%Fig. 8, picture on the left >> subplot(1,2,2), plot(x1,x2,'k.', x1(1:35),x2(1:35),'b^', x1(70:100),x2(70:100),'ro'); Figure 8. Scatter plot of student data 4D (Age, SP marks, OO marks, CI marks) row points on the plane of two first principal components, after they have been centred and rescaled in file xsm. Curiously, students of occupations AN (circled) and IT (triangled) occupy contiguous regions, top and left, respectively, of the plane as can be seen on the right-hand picture. To evaluate how well the data are approximated by the PC plane, according to equation (3) one needs to assess the summary contribution of the first two singular values squared in the total data scatter. To get the squares one can multiply matrix mu by itself and then see the proportion of the first two values in the total: >> la=mu*mu la = 11.1889 0 0 0 0 6.4820 0 0 0 0 3.8269 0 0 0 0 2.8982 >> 100*[la(1,1)+la(2,2)]/sum(sum(la)) ans = 72.43 This shows the PC plane takes 72.43% of the data scatter, which is not that bad for this type of data. To interpret the results one should use the standardised coefficients c in expression (2) that come with svd command into columns of matrix C. The two columns, in transposed form, are: >> interp=c(:,1:2)' interp = 0.7316 0.1587 -0.4858 -0.4511 %first singular vector -0.1405 -0.8933 -0.4165 -0.0937 %second singular vector The coefficients straightforwardly show how much a principal component is affected by a feature. 93 First component is positively related to Age and negatively to OO and CI marks; on average, it increases with Age increased and OO/CI marks decreased. Second component increases when SP and OO marks decrease (this obviously can be reverted by swapping all minuses for pluses).Thus, first component can be interpreted as “age-related Computer science deterrence” and the second as “dislike of programming issues”. Then the triangle and circle patterns on the right of Figure 8 show that IT labourers are on the minimum side of the age-related CS deterrence, whereas AN occupations are high on the second component scale. Component retention Although two or three Principal components are sufficient for the purposes of visualization, the issue of automatically determination of the “right” number of components has attracted attention of researchers. R.Cangelosi and A. Goriely (2007) Component retention in principal component analysis with application to cDNA microarray data, Biology Direct, 2:2 review twelve rules for choosing the number of principal components and, rather expectedly, note that no one of them was better than others in their experiments with generated and real data sets. Is there a right number of components? This question is irrelevant if the user’s goal is visualization: just two or three components, depending on the dimension of the screen. However, some consider the question just when one wants to determine the “real” dimensionality of the data. There have been a number of rules of thumb proposed of which a dozen were tested on real and simulated data by Cangelosi and Goriely (2007). Simplest rules such as: (i) stop at that component whose contribution is less than the average contribution, or better, 70% of the average contribution, and (ii) take largest contributing components so that their summary contribution reaches 80%, did remarkably well. The rule (i) can be slightly modified with the Anomalous Pattern method described below (see section …… ). According to this method, the average contribution first is subtracted from all the contributions. Let us denote the resulting values, in the descending order, by c1, c2,…, cr, where r is the rank of the data matrix, and compute, for every n=1,2,…, r, the squared sum of the first n of these values, C(n). Then the rule says that the number of components to retain is defined as the first n at which cn+12 is greater than 2C(n)(cn+1 -1/2n), which is obviously guaranteed if cn+1 <1/2n because the expression is negative then. Clustering: K-Means partitioning Clustering is a set of methods for finding and describing cohesive groups in data, typically, as “compact” clusters of entities in the feature space Some data patterns: 94 (a) (b) (c) Figure 1. A clear cluster structure at (a); data clouds with no visible structure at (b) and (c). Finding clusters is half the job; describing them is another half. Duality of knowledge: Cluster contents - extension; cluster description - intention. If a cluster is clear-cut, it is easy to describe; if not, not. b2 b1 a1 a2 Figure 2. Yellow cluster on the right: a1<x<a2 & b1<y<b2. Yellow cluster on the left – both false positive and false negative errors ! 95 Example of a good cluster structure: W. Jevons (1835-1882), updated in Mirkin 1996 Pluto doesn’t fit in the two clusters of planets: started a new cluster recently, September 2006 K-Means: clusters update Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) * * 1. Assign points to the centroids * * *** using the Minimum distance rule ** * @ @ 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence @ ** *** 96 K-Means: centroids update Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids * * according to Minimum distance rule * * * * * 2. Put centroids in gravity centres of ** * @ @ thus obtained clusters 3. Iterate 1. and 2. until convergence ** *** @ Quiz: What is a gravity centre? (A: Given a set of points, its mean point) Example (with Companies data) Range standardised matrix (with additionally rescaled three category features, Ob, Pe, and Di, by dividing them over 3): A. K-Means at K=3 and initial seeds at entities 2, 5, 7 0. Entities 2, 5, and 7 are chosen as centroids. 1. Minimum distance rule: calculate distance (Euclidean squared) from the centroids to each entitiy i and assign i to the closest centroid (highlighted with bold font): 97 This produces clusters 1-2-3, 4-5-6, and 7-8. 2. .095 -.215 .179 Centroids update: calculate centroids of the clusters .124 -.111 -.625 .168 .024 -.144 -.286 -.222 .375 -.024 .168 -.144 .243 .500 .375 -.216 -.216 .433 Bold-font highlighted outstanding values. 3. Check whether new centroids coincide with those from the previous iteration. If not, go to 1: The same clusters 1-2-3, 4-5-6, and 7-8. 4. At 2, with clusters the same, centroids are the same as well. Output the result: Within cluster means are in both versions: real (upper row) and standardised (lower row). The interpretation goes along the lines highlighted above: cluster 1 lacks SC, cluster 2 has small LS, cluster 3 has D in excess, etc. B. K-Means at K=3 and initial seeds at entities 1, 2, 3: 98 Within-column minima are bold-faced, leading to clusters 1-4, 2, and 3-5-6-7-8. After updating the centroids, the next iteration will lead to clusters 1-4-6, 2, 3-5-7-8 that will remain stable at the next iteration. A disastrous result. Quiz: Do this by yourself. Try also K=2 with seeds at 1 and 8. K-Means’ features: Positive: - Models typology building Computationally effective Can be incremental, one entity at a time Negative: - - No advice on: o Data pre-processing o Number of clusters o Initial setting Instability of results Insufficient interpretation aids Figure 3. Example to the two red-highlighted items above: Two clusterings at a four-point-set with KMeans - intuitive (right) and counter-intuitive (left); red stars as centroids. Other: - Convex cluster shapes (A body S is referred to as convex if, with every two points x and y in S, S also contains the entire interval of the straight line between x and y) Quiz: Why convex? (Hint: A semi-space is convex, and intersection of convex bodies is convex too.) K-Means criterion W(S, c): Denote partition S={S1,…,SK}; an entity/row yi=(yiv) (i= 1, …, N), cluster Sk’s centroid ck=(ckv) (k=1,…,K); v=1,…, M are feature indices. W(S, c) - summary within cluster entity-to-centroid distances (Euclidean squared) to be minimised K W ( S , c) k 1 M (y iS k v 1 iv K ckv ) 2 k 1 d(y ,c ) iS k i k 99 Figure 4. The distances (blue intervals) in criterion W(S,c). Quiz: How many distances are in W(S, c)? (A: The number of entities N.) Does this number depend on the number of clusters K? (A: No.) Does the latter imply: the greater the K, the less W(S, c)? (A: Yes.) Why? K-Means is alternating minimisation for W(S,c). Convergence guaranteed. (Quiz: Why?) Quiz: Demonstrate that, at Masterpieces data, value W(S,c) at author-based partition {1-2-3, 4-5-6, 78} is lower than at partition {1-4-6, 2, 3-5-7-8} found at seeds 1, 2 and 3. Quiz: Assume d(yi, ck) in W(S, c) is city-block distance rather than Euclidean squared. Could K-Means be adjusted to make it alternating minimisation algorithm for the modified W(S,c)? (A: Yes, just use the city-block distance through, as well as within cluster median points rather than gravity centres.) Would this make any difference? PCA Model extended to K-Means clustering Remember? Data yiv modelled as summary contributions of talent factors k with products ckvzik, ckv being feature v loading and zik factor k’s score. Consider now ckv cluster k’s centroid and zik the belongingness/membership function: given cluster Sk, zik = 1 if i Sk and zik = 0 if i Sk. F-la (4’): for any cluster Sk and any entity i Sk, zik is equal to ckv up to residuals eiv to be minimised - a data recovery model, a rather simplistic one. K yiv ckv zik eiv , (4' ) k 1 Least-squares criterion L2 for (4’): L2 =W(S, c) ! Moreover, the same data scatter decomposition holds: T(Y) = 1 2 +2 2 +…+K 2 + L2 with k2 being analogous to the YYT eigen-values: k2 = zkTYYTzk/zkTzk =v ckv2|Sk| (8) 100 Quiz: What is the difference between PCA model (4) and clustering model (4’)? Why are all these technical details? o Data standardisation o Algorithms Spectral clustering Anomalous pattern iK-Means o Additional interpretation aids Correlation with features Scatter decomposition ScaD table o Data standardisation Because of the data scatter decomposition, standardisation in clustering can be done on the basis of data scatter, as advised for PCA: - pre-process data into quantitative format, - subtract within-column means, - divide by the range. o Algorithms Anomalous pattern PCA strategy: one cluster at a time. From (8), find a cluster S maximising its contribution to data scatter T(Y): 2 = zTYYTz/zTz = cv2|S| =d(0,c)|S| (9) - distance to 0 weighted by the size |S| One cluster clustering with Anomalous Pattern Cluster Tom Sawyer 1. 2. 3. 4. - Put seed c=(cv) into an entity furthest from 0 Cluster update: Take S to consist of all entities that are closer to c than to 0. Centroid update: Take gravity centre of S as c. Reiterate 2&3 until convergence. Similar to 2-means except for: anomalous centroid is the only one to change One-by-one clusters can be extracted with contributions (9) showing cluster saliencies: incomplete clustering 101 iK-Means (K-Means with intelligent initialisation) 1. 2. 3. Extract clusters one-by-one using Anomalous pattern. Specify at what size s (s=1, 2,…) a cluster should be discarded, and remove all AP clusters of the size s or less. Take K the number of remaining clusters and use their centroids as initial seeds. This method has shown its superiority over a number of other methods in our experiments with generated data. Spectral clustering (a recent very popular technique) 1. Find eigen-vectors z , z , …, z of NN matrix YYT. (This can be done fast, for instance, by finding eigen-vectors ck of MM matrix YTY: in many applications, M50 while N100,000. Then zk are found from ck by using formula (2) from the previous lecture.) 1 2 K 2. Given an optimal PC z find S as set of indices corresponding to largest components of z . k k k Not necessarily optimal. Can be applied in other settings such as distances. Alternating minimisation algorithm for f(x,y): a sequence y0, x1, y1, x2, y2,… yt, xt,… Find x minimising f(x,y0); take this as x1. Given x1, find y minimising f(x1,y), take it as y1. Reiterate until convergence. Figure 4. The distances (blue intervals) in criterion W(S,c). K-Means, with Euclidean distance squared and gravity centres, is alternating minimisation for W(S,c). Convergence guaranteed. (Quiz: Why? A: The criterion can only decrease at steps, so that no partition can re-appear, and the number of possible partitions is finite.) Issues to address: o Data pre-processing o Number of clusters o Initial setting o Insufficient interpretation aids 102 Quiz: Demonstrate that, at Masterpieces data, value W(S,c) at author-based partition {1-2-3, 4-5-6, 78} is lower than at partition {1-4-6, 2, 3-5-7-8} found at seeds 1, 2 and 3. Quiz: Assume d(yi, ck) in W(S, c) is city-block distance rather than Euclidean squared. Could K-Means be adjusted to make it alternating minimisation algorithm for the modified W(S,c)? (A: Yes, just use the city-block distance through, as well as within cluster median points rather than gravity centres.) Would this make any difference? Most popular approach to tackle the issue of initialisation in K-Means 1. Take a range of K, say, from K=4 to K=25 2. For each of the K, run K-Means clustering a (five) hundred times starting from random seeds (centroids); typically, random K entities are sampled as seeds, select the best result with its W criterion value WK. 3. Compare WK –s at different Ks and select that K and its respective result, at which WK “jumps”. Of several approaches to formalising the “jump” concept, in my experiments the best - in terms of K, not clustering - was Hartigan’s rule utilising measure: HK = [ WK / WK+1 - 1 ]/(N-K-1) Hartigan’s rule of thumb: Start at K=1, halt at K at which H becomes less than 10. I have a different advice: iK-Means. Contribution of feature F as explained by cluster partition S={S1,…, SK}, to the data scatter K Contr(S,F) = c vF k 1 2 kv | Sk | (10) is proportional to: correlation ratio 2 if F is quantitative contingency coefficient between cluster partition S and F, if F is nominal, that is: Pearson chi-square (if Poisson normalised) Goodman-Kruskal tau-b (range normalised) Indeed, after standardisation of a nominal F, cluster k centroid’s components over vF are: ckv=(pkv/pk – pv)/Bv where pv is frequency of v, pk proportion of entities in Sk, pkv proportion of entities falling in both v and Sk, and Bv the normalising scale coefficient. Then the summary contribution (10): 2 2 Contr( S , F ) N ( p kv p k p v ) / p k B v k ,v At Bv=1 (range) this is tau-b, the change of the error of proportional prediction of v’s, and at Bv2 =pv (Poisson normalised), is Pearson chi-squared. Leads to: (a) conceptual clustering and (b) standardising suggestions. Scatter decomposition ScaD table: Cluster-specific feature contributions ckv2|Sk| in a table with rows corresponding to clusters, columns to features. 103 Table 1. ScaD for Authorship clusters at Masterpieces data. Two rows just above the bottom show summary explained contributions of the clustering to features and their complements to the total feature contributions (bottom line), unexplained parts. The contributions follow from the algebra, but there is a geometric intuition in them as well (see Figure 5 below). y x Figure 5. Contributions of features x and y in the group of blue-circled points are proportional to squared differences between their values at the grand mean (red star) and within-group mean, centroid (yellow star). The x-difference is much greater; thus the group can be separated from the rest along x much easier rather than along y. Large contributions are highlighted in Table 1 with symbols related to authors: Dickens ( ), Twain ( ) and Tolstoy ( ). Highlighted features exclusively describe author clusters conceptually: C. Dickens: SCon = No M. Twain: LenD < 28 L. Tolstoy: NChar > 3 or Presentat.Direct, which is not necessarily so at other data sets A deeper optimum with nature-inspired approaches GA for K-Means clustering Recall that K-Means clustering is a method for finding a partition of a given set of N entities represented by rows yi =(yi1, …, yiV) (i = 1,…, N) of data matrix Y=(yiv) in K clusters Sk (k=1, …, K) K W W(S, c) k 1 d ( yi, ck ).......(*) iSk with centres (centroids) ck = (ci1, …, ciV) defined as the means of within-cluster rows. This method alternatingly 104 minimises the summary within cluster distance d(yi, ck) = (yi1- ck1)2 + (yi2- ck2)2 + …. + (yiVckV)2 , the Euclidean distance squared between entity i and its cluster’s centroid. To apply GA approach, define the concept of chromosome. Let the chromosome representing partition S = {S1, … , SK} be the string of cluster labels assigned to entities in the order i=1,…, N. If, for instance, N=8, and the entities are e1, e2, e3, e4, e5, e6, e7, e8, then the string 12333112 represents partition S with three classes, S1={e1, e6, e7}, S2={e2, e8}, and S3={e3, e4, e5}, which can be easily seen from the diagram e1 e2 e3 e4 e5 e6 e7 e8 1 2 3 3 3 1 1 2 A string of N numbers is considered “illegal” if some numbers between 1 and K are absent from it (so that the corresponding classes are empty). Such an illegal string in the example above would be 11333111: it makes class S2 empty. A GA for minimising the function W: 0. Initial setting. Fix the population size P, and even integer (no rules exist for this), and randomly generate P legal strings s1,..,sP of K integers 1 ,…, K. For each of the strings, define corresponding clusters, calculate their centroids as gravity centres and the value of criterion, W(s1), …, W(sp), according to formula (*). 1. Selection. Select P/2 pairs of strings to mate; each of the pairs is to produce two “children” strings. The mating pairs usually are selected randomly (with replacement, so that the same string can form both parents in a pair). To mimic Darwin’s “survival of the fittest”, the probability of selection of string st (t=1,…,P) should reflect its fitness value W(st). Since the fitness is greater for the smaller W value, some make the probability inversely proportional to W(st) (see Murthy, Chowdhury, 1996) and some to the difference between a rather large number and W(st) (see Yi Lu et al. 2004). (I would suggest to make it proportional to the explained part of the data scatter defined above, which may lead to different results). 2. Cross-over. For each of the mating pairs, generate a random number r between 0 and 1. If r is smaller than a pre-specified probability p (typically, p is taken about 0.7-0.8), then perform a crossover; otherwise the mates themselves are considered the result. A (single-point) crossover of strings sa=a1a2…aN and sb=b1b2…bN is performed as follows. A random number n between 1 and N-1 is selected and the strings are crossed over to produce children a1a2…anb(n+1)…bN and b1b2…bna(n+1)…aN. If a child is “illegal” (like, for instance, strings a=11133222 and b=32123311 crossed over at n=4 to produce a’=11133311 and b’=32123222; a’ is illegal here), then various policies can be pursued. Some authors suggest the crossover operation to be repeated until a legal pair is produced. Some say illegal chromosomes are ok, just they must be assigned with a lesser probability of selection. 3. Mutation. Mutation is a random alteration of a character in a chromosome. This provides a mechanism for jumping to different ravines of the minimised function. Every character in every string is subject to mutation with a low probability q which can be constant or inversely proportional to the distance between the corresponding entity and corresponding centroid. 4. Elitist survival. This strategy suggests storing the best fitting chromosome separately. After the crossover and mutations have been done, find fitness values for the new generation of chromosomes. Check whether the worst of them is better than the record or not. If not, put the record chromosome instead of the worst one into the population. Then find the record for thus obtained population. 105 5. Stopping condition. Check the stop condition (typically, a limit on the number of iterations). If this doesn’t hold, go to 1; otherwise, halt. Yi Lu et al. (Bioinformatics, 2004) note that such a GA works much faster if after the 3. Mutation is done the labels are changed according to the Minimum distance rule. They apply this instead of the elitist survival. Shortcoming of the GA algorithm: long chromosomes (of the size of N!) Can it be overcome? Yes, by using centroids not partition to represent a clustering. A set of K centroids c1=(c11,…c1v,…,c1M),…,cK=(cK1,…,cKv,…, cKM) can be considered a sequence of K*M numbers, thus a string: its size does not depend on N. Another advantage: can be changed softly, in a quantitative manner: by adding/subtracting a small change rather than by switching to another symbol. Evolutionary K-Means. The chromosome is represented by the set of K centroids c1, c2, cK, which can be considered a string of K*M real (“float”) numbers. In contrast to the previous representation, the length of the string here does not depend on the number of entities that can be of advantage when the number of entities is massive. Each centroid in the string is analogous to a gene in the chromosome. The crossover of two centroid strings c and c’, each of the length K*V, is performed at a randomly selected place n, 1 <= n < K*V, exactly as it is in the genetic algorithm above. Chromosomes c and c’ exchange the portions lying to the right of n-th component to produce two offspring. This means that, in fact, only one of the centroids changes in each offspring chromosome. The process of mutation, according to Bandyopadhyay and Maulik (2002), is organised as follows. Given the fitness W values of all the chromosomes, let minW and maxW denote their minimum and maximum respectively. For each chromosome, its radius R is defined as a proportion of maxW reached at it: R=(W-minW)/(maxW-minW). When the denominator is 0, that is, if minW = maxW, define each radius=1 . Here, W is the fitness value of the chromosome under consideration. Then the mutation intensity is generated randomly in the interval between –R and +R. Let minxv and maxxv denote the minimum and maximum values in the data set along feature v (v=1,…, M). Then every v-th component xv of the chromosome’s centroid changes to xv+(maxxv – xv) if >=0 (increase), or xv+(xv - minxv), (decrease) otherwise The perturbation leaves chromosomes within the hyper-rectangle defined by boundaries minxv and maxxv. Please note that the best chromosome, at which W=minW, does not change in this process. Elitism is maintained in the process as well. The algorithm follows the scheme outlined for the genetic algorithm. Based on little experimentation, this algorithm is said to outperform the previous one, GA, many times in terms of the speed of convergence. Differential evolution and K-Means This process is very similar to those previously described, except that here the crossover and mutation are merged together in the following rather tricky way. 106 An offspring chromosome is created for every chromosome j in the population (j=1, …, P) as follows. (You remember, a chromosome is a set of K centroids here.) Three other chromosomes, k, l and m, are taken randomly from the population. Then, for every component (gene) x.j of the chromosome j, a uniformly random value r between 0 and 1 is drawn. This value is compared to the pre-specified probability p (somewhat between 0.5 and 0.8). If r > p then the component goes to the offspring unchanged. Otherwise, this component is substituted by the linear combination of the three other chromosomes: x.m + (x.k-x.l) where is a small scaling parameter. After the offspring’s fitness is evaluated, it substitutes chromosome j if its fitness is better; otherwise, j remains as is and the process applies to the next chromosome. Krink and Paterlini (2005) claim that this method outperforms the others in K-Means clustering. Particle swarm optimisation and K-Means This is a very different method. The population members here are not crossbred, nor they mutate. They just move randomly by drifting in the directions of the best places visited, individually and socially. This can be done because they are vectors of reals. Because of the change, the genetic metaphor is abandoned here, and the elements are referred to as particles rather than chromosomes, and the set of them as a swarm rather than a population. Each particle comprises: - a position vector x that is an admissible solution to the problem in question (such as the KM centroids vector in the evolution algorithm for K-Means above), - the evaluation of its fitness f(x) (such as the summary distance W in formula ()), - a velocity vector z of the same dimension as x, and - the record of the best position b reached by the particle so far (the last two are a new feature!). The swarm best position bg is determined as the best among all the individual best positions b. At iteration t (t=0,1,…) the next iteration position is defined as x(t+1) = x(t) + z(t+1) with the velocity vector z(t+1) computed as z(t+1) = z(t) + (b-x(t)) + (bg – x(t)) where - and are uniformly distributed random numbers (typically, within the interval between 0 and 2, so that they are approximate unities), - item (b-x(t)) refers to as the cognitive component and - item (bg – x(t)) as the social component of the process. Initial values x(0) and z(0) are generated randomly within the manifold of admissible values. In some implementations, the group best position bg is changed for that of local best position bl that is defined by the particle’s neighbours only. Here the neighbourhood topology makes its effect. There is a report that the local best position works especially well, in terms of the optimality reached, when it is based on just two Euclidean neighbours. Question: Formulate a particle swarm optimisation algorithm for K-Means clustering. 107 Other Clustering o Fuzzy K-Means o Kohonen’s Self Organising Map SOM o Hierarchical clustering Agglomerative algorithm Ward’s criterion Single linkage clustering o Minimum Spanning Tree Prim’s algorithm o Application to Single Linkage Clustering Fuzzy clustering Conventional (crisp): cluster k (k=1,…,K) Centroid ck=(ck1,…, ckv,…, ckM) (M features) Membership zk=(z1k,…, cik,…, cNk) (N entities) If zik =1, i belongs to cluster k, if zik =0, i does not Clusters form a partition of the entity set (every i belongs to one and only one cluster): i, k zik = 1 Fuzzy: cluster k (k=1,…,K) Centroid ck=(ck1,…, ckv,…, ckM) (M features) Membership zk=(z1k,…, cik,…, cNk) (N entities) 0 zik 1, extent of belongingness of i to cluster k Clusters form a fuzzy partition of the entity set (summary belongingness is unity): i, k zik = 1 Having been put into the bilinear PCA model, as K-Means has been, fuzzy cluster memberships form a rather weird model in which centroids are not average but rather extreme points in their clusters (Mirkin, Satarov 1990, Nascimento 2005). An empirically convenient criterion: where d( , ) is Euclidean squared distance, leads to a convenient fuzzy version of K-Means. The value K F ({ck , zk }) k 1 N z i 1 ik d ( yi , ck ) (1) affects the fuzziness of the optimal solution: at =1, the optimal memberships are proven to be crisp, the larger the the ‘smoother’ the membership. Conveniently is taken to be =2. At each iteration, set of centroids {ck} is transformed: Membership update: 108 K zik 1 / [d ( yi , ck ) / d ( yi , ck ' )] 1 1 (2) k '1 Centroids update: N N ckv zik yi / zi 'k i 1 (3) i '1 Since equations (2) and (3) are the first-order optimality conditions for (1), convergence guaranteed. This method is sometimes referred to as c-means clustering (Bezdek, 1999). Meaning of criterion (1): F=i F(i), summary belongingness F(i) of points i to the cluster-structured data, F(i) being equal to the harmonic average of the memberships at =2 (Stanforth, Mirkin, Kolossov, 2005). (a) (b) Contours for the membership function at about 14000 IDBS Guildford chemical compounds clustered with iK-Means in 41 clusters (a); note undesirable cusps in (b) which scores membership using only the nearest cluster’s centroid. Kohonen’s Self Organising Maps SOM Given an N x M data Y of entities yi (i=1,…,N), build: Grid of r rows, c columns Grid neighbourhood associated with each grid point ek (k=1,…,rc) Reference vectors mk (k=1,…,rc) in feature MD space 109 e1 e 2 Figure 1. SOM grid; grid points e1 and e2 are shown along with possible neighbourhood patterns (in black and blue) Start: Reference points mk are thrown randomly (in the earlier Kohonen’s work); current advice: take them as centroids after a run of K-Means at K=rc. Then data points are iteratively associated with mk. In the end, data points associated at each mk are visualised at the grid point ek. Originally, SOM iterations have been formulated in terms of single entities arriving (increm-enttal mode), but later a straight version has been found. Figure 2. A pattern of final SOM structure using entity labels of geometrical shapes. Straight SOM: 0. Initial setting. Select r and c for the grid and initialize model vectors mk (k=1,...,rc) in the feature space. 2. Neighbourhood update. For each grid node ek, with a pre-defined neighbourhood Ek, collect the list It of entities most resembling the model mt for each et Ek. 3. Seeds update. For each node ek, define new mk as the average of all entities yi with iIt for some ekEt. 4. Stop-condition. Halt if new mk-s are close enough to the previous ones (or after a pre-specified number of iterations). Otherwise go to 2. Similar to Straight K-Means except for: (a) number K=rc of model vectors is large and has nothing to do with final clusters, which come visually as grid clusters; 110 (b) averaging over grid, not feature space, neighbourhood; (c) no interpretation rules. Hierarchic clustering Data as a cluster hierarchy (rooted tree) (a) conceptual structures (taxonomy, ontology) (b) real processes (evolution, genealogy) Tolstoy+Twain Dickens OT DS GE TS HF YA WP AK Figure 3. A cluster hierarchy of Masterpieces data: nested node clusters, each comprising a set of leaves. Cutting the tree at a certain height can lead to a partition (3 clusters here). Two types of hierarchic clustering: - Divisive (splitting top-to-bottom) - Agglomerative (merging bottom-up) Agglomerative clustering algorithm 0. Start: N singleton clusters Si={i}, i I, N x N distance (or similarity) matrix D=(d(i,j)), i,j I 1. Find: Find minimum distance d(i*,j*) in D. 2. Merge: combine clusters Si* and Sj* into a united cluster Si*j* = Si* Sj* ; remove rows/columns I,* and j* from D; put there a new row/column i*j* for the united cluster Si*j* with newly computed distances between Si*j* and other clusters. 3. Draw and check: Draw the merging in a tree drawing such as Figure 3 and check if the number of clusters is greater than 1. If yes, go to 2. If no, halt. Distances between Si*j* and other clusters: nearest neighbour/single linkage (minimum distance between cluster elements), furthest neighbour (maximum distance between cluster elements), average neighbour (average distance between cluster elements). 111 Ward (1963): distance between Si, Sj - increase in sum-mary within-cluster variance, W(c, S)/N, after the merger: wd(Si, Sj)= Ni *Nj /(Ni +Nj )*d(ci, cj) (4) where Ni and Nj – the number of entities in Si or Sj, and d(ci, cj) is Euclidean distance squared between clusters’ centroids. Ward distance in agglomerative algorithm (Step 2): wd(Si*j*, Sk)= [(Ni* +Nk)wd(Si*, Sk) +(Nj* +Nk)wd(Sj*, Sk) Nkwd(Si*, Sj*)]/(Ni* +Nj* + Ni*j*) (5) Ward is computationally intensive because of Step 1. Single linkage /Nearest neighbour clustering can catch elongated formations, in contrast to KMeans/Ward a b Figure 4: Two clusters at the same data set by using different criteria: Ward/K-Means (a) and Single linkage (b) Single linkage can be computationally easier by using Minimum Spanning Tree (MST) Minimum/Maximum Spanning Tree: a weighted graph concept Graph G: nodes iI, edges {i,j} with weights w(i,j) Tree T: sub-graph of G with no cycles Spanning tree: tree T covering all nodes I of G Tree weight: Sum of weights w(i,j) over {i,j}T Minimum spanning tree (MST): Spanning tree of minimum weight 3 A 2 3 3 3 B 4 C 4 2 E 3 3 F 4 2 D 3 1 G 112 Figure 1. Sub-graph {{A,B},{A,C},{B,C}} is a cycle, not a tree; {{A,B}, {A,C}, {B,F}} is a tree, {{A,B}, {A,C}, {B, D}, {B,E}, {B, F}, {B,G}} is a spanning tree of weight 3+2+3+4+2+4=18. Spanning tree {{A,C}, {C,D}, {D, G}, {G,F}, {F,B}, {F,E}} is of minimum weight 2+2+3+1+2+3 = 13, thus an MST (highlighted). Prim’s algorithm for finding an MST 1. Initialization. Start with T consisting of any iI. 2. Tree update. Find j*I-T minimizing w(i,j) over all iT and jI-T. Add j* and (i,j*) to T. 3. Stop-condition. If I-T is empty, halt and output tree T. Otherwise go to 2. Example: Find MST T in graph of Figure 1 starting from T={F}. 1st iteration: Take the minimum of w(F,j) over all jF; this is w(F,G)=1. Now T={F,G} along with the corresponding edge (see Figure 2 a). 2 F 1 G 1 B b a F G 2 E B 2 3 F E 3 1 G B 2 F 1 D 3 G C 2 A c d Figure 2. Three iterations of Prim’s algorithm (a, b, c) and a completed MST (d) for graph on Figure 1. 2d iteration: find j among {A, B, C, D, E} such that w(j,F) or w(j,G) is the minimum. It is obviously w(B, F)=2. This adds B to T along with edge {B,F} (Fig. 6 b). 3d iteration: find j among {A, C, D, E} such that w(j,F) or w(j,G) or w(j, B) is minimum. There are many at the weight 3 (specifically, edges {A,B}, {B,D}, {F,E}, {G,D}, and {G,A}), of which let us take {F,E}, thus adding E to T so that T becomes { B, E, F, G} along with added edge (see Figure 6 c). Next iterations: w(G,D) adds D with weight 3; w(C,D)=2 adds C to T along with edge {C,D}; the only remaining entity, A, obviously has the minimum 2-weight edge to T, {A,C}. This completes an MST (Figure 6 d). Prim’s algorithm is greedy, thus computationally efficient. It builds a globally optimal MST by using node (entity) sets. There is another, also greedy approach, by J. Kruskal that builds an MST by using edge sets. But what’s of its computational prowess? Is not it similar to Ward’s algorithm in the need for finding a minimum link after each step? 113 The difference: Prim’s operates with the original weights only, whereas Ward’s changes them at each step. Thus Prim’s can store information of nearest neighbours (NNs) for each of the nodes in the beginning and use it at later steps. What MST has to do with Single-linkage clustering? Mathematically proven: Single-linkage clusters are parts of an MST over entities as nodes and between-entity distances as weights. A Single-linkage divisive clustering: make an MST for distance matrix and then sequentially cut in two clusters over a maximum weight edge – or over all maximum weight edges simultaneously (Figure 7 a, b). F G B E A C D a F G B E A C D b Figure 7. A binary tree (a) and the natural tree (b) for Single linkage divisive clustering using MST presented in Figure 6 d. Appendix A1 Vector mathematics nD vector spaces: basic algebra and geometry Summation defined as e1=(-0.20, 0.23, -0.33, -0.63, 0.36, -0.22, -0.14) + e2=( 0.40, 0.05, 0, -0.63, 0.36, -0.22, -0.14) ____________________________________________________________ e1+e2=( 0.20, 0.28, -0.33, -1.26, 0.72, -0.44, -0.28) 114 Subtraction defined as e1=(-0.20, 0.23, -0.33, -0.63, 0.36, -0.22, -0.14) e2=( 0.40, 0.05, 0, -0.63, 0.36, -0.22, -0.14) ____________________________________________________________ e1- e2=(- 0.60, 0.18, -0.33, 0, 0, 0, 0 ) Multiplication by a real defined as 2e1 = (-0.40, 0.46, -0.66, -1.26, 0.72, -0.44, -0.28) 10e1=(-2.00, 2.30, -3.30, -6.30, 3.60, -2.20, -1.40) Quiz: Could you illustrate the geometrical meaning of the set of all a*e1 (for any a)? Geometry: LD 0.4 e1-e2=(-.6,.18) e1 e1+e2=(.2,.28) 0.2 e2=(.4,.05) -0.6 -0.4 -0.2 0.2 0.4 LS -0.2 Distance Euclidean distance r(e1,e2):= sqrt(sum((e1-e2).*(e1-e2)) e1=(-0.20, 0.23, -0.33, -0.63, 0.36, -0.22, -0.14) - 115 e2=( 0.40, 0.05, 0, -0.63, 0.36, -0.22, -0.14) ____________________________________________________________ e1-e2=(- 0.60, 0.18, -0.33, 0, 0, 0, 0 ) ____________________________________________________________ (e1-e2).*(e1-e2)=( 0.36, 0.03,0.11, 0, 0, 0, 0 ) d(e1,e2)=sum((e1-e2).*(e1-e2))= 0.36+0.03+0.11+ 0+0+0+0=.50 r(e1,e2)=sqrt(d(e1,e2))=sqrt(.50)= 0.71 Pythagorean theorem behind: x12 2 x1=(x11,x12) a x22 c x2=(x21,x22) b (0,0) x11 x21 1 c2 = a2 + b2 d(x1,x2)=(x12-x22)2+(x21-x11)2 Other distances: Manhattan/City-block m(x1,x2)=|x12-x22|+|x21-x11| Chebyshev/L∞ ch(x1,x2)=max(|x12-x22|, |x21-x11|) Quiz: Extend these to nD Quiz: Characterise the sets of points that lie within distance 1 from a given point, say the origin, for the cases when the distance is (i) Euclidean squared, (ii) Manchattan, (iii) Chebyshev’s. Inner product Inner product <e1,e2>:= sum(e1.* e2) e1=(-0.20, 0.23, -0.33, -0.63, 0.36, -0.22, -0.14) * e2=( 0.40, 0.05, 0, -0.63, 0.36, -0.22, -0.14) ____________________________________________________________ e1*e2=(-0.08, 0.01, 0, 0.39, 0.13, 0.05, 0.02 ) ____________________________________________________________ <e1,e2>=sum(e1*e2)= -0.08+0.01+0+0.39+0.13+0.05+0.02=0.52 Relation between (Euclidean squared) distance d(e1,e2) and inner product <e1,e2>: d(e1,e2)= <e1-e2, e1-e2> = <e1, e1>+<e2, e2> 2<e1, e2> 116 Especially simple if <e1,e2>=0: d(e1,e2)= <e1,e1> + <e2,e2> - like (in fact, as) Pythagorean theorem Points/vectors e1 and e2 satisfying <e1,e2>=0 are referred to as orthogonal (why?) The square root of the inner product of a vector by itself, sqrt(<e,e>), is referred to as e’s norm – the distance from 0 (analogous to length in nD) Matrix of rank 1: product of two vectors; for example a=[1 4 2 0.5]’; b=[2 3 5]’; Here A’ is matrix A transposed so that vectors a and b are considered columns rather than rows. A mathematical presentation of the matrix whose elements are products of components of a and b, with product * being the so-called matrix product is below: 2 8 4 1 a*b’= 3 12 6 1.5 5 20 10 2.5 The defining feature of this matrix: all rows are proportional to each other; all columns are proportional to each other. (See more detail any course in linear algebra or matrix analysis.) This matrix XTX, divided by N, has an interesting statistical interpretation if all columns of X have been centred (mean-subtracted) and normed (std-normalised): its elements are correlation coefficients between corresponding variables. (Note how a bi-variate concept is carried through to multivariate data.) If the columns have not been normed, the matrix A=XTX /N is referred to as covariance matrix; its diagonal elements are column variances. Since eigen-vectors of the square matrix A are mutually orthogonal, it can be decomposed over them as r A k c *k c *T k CC T (7) k 1 which can be derived from (5’); is diagonal r × r matrix with A’s eigen-values k=k2. Equation (7) is referred to as the spectral decomposition of A; the eigen-values k constituting the spectre of A. Optimisation algorithms Alternating minimisation algorithm for f(x,y): a sequence y0, x1, y1, x2, y2,… yt, xt. Find x minimising f(x,y0); take this as x1. Given x1, find y minimising f(x1,y), take it as y1. Reiterate until convergence. 117 Gradient optimisation (the steepest ascent/descent, or hill-climbing) of any function f(z) of a multidimensional variable z: given an initial state z=z0, do a sequence of iterations to move to a better z location. Each iteration updates z-value: z(new) =z(old) ± *grad(f(z(old)) (2) where + applies if f is maximised, and –, if minimised. Here · grad(f(z)) stands for the vector of partial derivatives of f with respect to the components of z. It is known from calculus, that the vector grad(f(z)) shows the direction of the steepest rise of function f at point z. It is assumed, that – grad(f(z)) shows the steepest descent direction. · value controls the length of the change of z in (2) and should be small (to guarantee not over jumping) , but not too small (to guarantee changes when grad(f(z(old)) becomes too small; indeed grad(f(z(old)) = 0 if old is optimum). Q: What is gradient of function f(x1,x2)=x12+x22? Function f(x1,x2)=(x1-1)2+3*(x2-4)2? Function f(z1,z2) = 3*z12 + (1-z2)4? A: (2x1, 2x2), 2*(x1-1),3*(x2-4)), (6*z1, -4*(1-z2)3). Genetic algorithms (GA) A population comprising a number, P, of structured entities, called chromosomes, typically strings (sometimes trees, depending on the data structure), evolves imitating the following biological mechanisms: 1. Selection 2. Cross-over 3. Mutation These mechanisms apply to carry on the population from the current iteration to the next one. The optimised criterion is referred to as fitness function. The initial population is selected, typically, randomly. The evolution stops when the population’s fitness doesn’t change anymore or when a prespecified threshold to the number of iterations is reached. An extension of GA approach: Evolutionary algorithms Evolutionary algorithms are similar to genetic algorithms in the aspect of evolving population, but may differ in their mechanism: as far as I can see, the string representation may be abandoned here as well as the crossover. Example. Minimising function f(x)=sin(2x)e-x in the range [0,2]. Look at the following MatLab program eva.m. % --------------------------evolutionary optimisation of a scalar function function [soli, funi]=eva; p=12; %population size lb=0;rb=2; % the boundaries of the range feas=(rb-lb)*rand(p,1)+lb; % population within the range 118 flag=1; %looping variable count=0; % number of iterations iter=1000; %limit to the number of iterations %------------------------------ initial evaluation funp=0; vv=f(feas); [funi, ini]=min(vv); soli=feas(ini) %initial x funi %initial f si=0.5; % mutation intensity %-------------evolution loop while flag==1 count=count+1; feas=feas+si*randn(p,1); %mutation feas=max(lb,feas); feas=min(rb,feas); % keeping the population in [lb,rb] vec=f(feas); [fun, in]=min(vec); %best record of the current population f(x) sol=feas(in); %corresponding x [wf,wi]=max(vec); wun=feas(wi); %--------- elitist survival (slightly eugenic)-------if wf>funi feas(wi)=soli; vec(wi)=funi; end if rem(count,100)==0 %display %funp=funi; disp([soli funi]); end if fun < funi %maintaining the best soli=sol; funi=fun; end if (count>=iter) flag=0; end end % ----------------------computing the function y=sin(2pix)exp(-x) function y=f(x) for ii=1:length(x) a=exp(-x(ii)); b=sin(2*pi*x(ii)); y(ii)=a*b; end return; This program finds the optimum rather fast indeed! This is a very different method. The population members here are not crossbred, nor they mutate. They just move randomly by drifting in the directions of the best places visited, individually and 119 socially. This can be done because they are vectors of reals. Because of the change, the genetic metaphor is abandoned here, and the elements are referred to as particles rather than chromosomes, and the set of them as a swarm rather than a population. Each particle comprises: - a position vector x that is an admissible solution to the problem in question (such as the KM centroids vector in the evolution algorithm for K-Means above), - the evaluation of its fitness f(x) (such as the summary distance W in formula ()), - a velocity vector z of the same dimension as x, and - the record of the best position b reached by the particle so far (the last two are a new feature!). The swarm best position bg is determined as the best among all the individual best positions b. At iteration t (t=0,1,…) the next iteration position is defined as x(t+1) = x(t) + z(t+1) with the velocity vector z(t+1) computed as z(t+1) = z(t) + (b-x(t)) + (bg – x(t)) where - and are uniformly distributed random numbers (typically, within the interval between 0 and 2, so that they are approximate unities), - item (b-x(t)) refers to as the cognitive component and - item (bg – x(t)) as the social component of the process. Initial values x(0) and z(0) are generated randomly within the manifold of admissible values. In some implementations, the group best position bg is changed for that of local best position bl that is defined by the particle’s neighbours only. Here the neighbourhood topology makes its effect. There is a report that the local best position works especially well, in terms of the optimality reached, when it is based on just two Euclidean neighbours. MatLab: A programming environment for user-friendly and fast manipulation and analysis of data Introduction The working place within a processor’s memory is up to the user. A recommended option: - a directory with user-made MatLab codes, say Codes and two or more subdirectories, Data and Results, in which data and results are stored respectively. MatLab is then can be brought up to the working directory with traditional MSDOS or UNIX based commands such as: cd <Path_To_Working_Directory>. MatLab remembers then this path. MatLab is organised as a set of packages, each in its own directory, consisting of program files with extension .m each. A few data handling programmes are in the Code directory. 120 "Help" command allows seeing names of the packages as well as of individual program files; the latter are operations that can be executed within MatLab. Example: Command “help” shows a bunch of packages, “matlab\datafun” among them; command “help datafun” displays a number of operations such as “max – largest component”; command “help max” explains the operation in detail. Work with files A data file should be organised as an entity-to-feature data table: rows correspond to entities, columns to features (see stud.dat and stud.var). Such a data structure is referred to as a 2D array or matrix; 1d arrays correspond to solitary entities or features. This is one of MatLab data formats. The array format works on the principle of a chess-board: its (i,k)-th element is the element in i-th row k-th column. Array's defining feature is that every row has the same number of columns. To load such a file one may use a command from package "iofun". A simple one is "load": >> a=load('Data\stud.dat'); %symbol "%" is used for comments: MatLab interpreter doesn’t read lines beginning with “%”. % "a" is a place to put the data (variable); ";" should stand at the end of an instruction; % stud.dat is a 100x8 file of 100 part-time students with 8 features: % 3 binary for Occupation, Age, NumberChildren, and scores over three disciplines (in file stud.var) Names are handled as strings, with ' ' symbol (no “space” in a string permitted). The entity/feature name sizes may vary, thus cannot be handled in the array format. To do this, another data format is used: the cell. Round braces (parentheses) are used for arrays, curly braces for cells: a(i,:) - array's a i-th row, b{i} -cell's b i-th element, which can be a string, a number, an array, or a cell. There can be other data structures as well (video, audio,...). >> b=readlist('Data\stud.var'); % list of names of stud.dat features If one wants working with only three of the six features, say "Age", "Children" and “OOProgramming_Score", one must put together their indices into a named 1d array: >> ii=[4 5 7] % no semicolon in the end to display ii on screen as a row; to make ii a column, semicolons are used >> newa=a(:,ii); %new data array >> newb=b(ii); %new feature set A similar command makes it to a subset of entities. If, for instance, we want to limit our attention to only those students who received 60 or more at "OOProgramming", we first find their indices with command "find": >> jj=find(a(:,7)>=60); % jj is the set of the students defined in find() % a(:,7) is the seventh column of a Now we can apply "a" to "ii": >> al=a(jj,:); % partial data of better of students 121 % nlrm.m, evolutionary fitting of a nonlinear regression function y=f(x,a,b) % x is predictor, y is target, a,b -regression prameters to be fitted function [a,b, funi,residvar]=nlrm(xt,yt); % % % % % in this version the regression equation is y=a*exp(bx) which is reflected only in the subroutine 'delta' in the bottom for computing the value of the summary error squared funi is the error's best value residvar is its proportion to the sum of squares of y entries ll=length(xt); if ll~=length(yt) disp('Something wrong is with data'); pause; end %----------------playing with the data range to define the rectangle at %--------which populations are grown mix=min(xt);maix=max(xt); miy=min(yt);maiy=max(yt); lb=-max(maix,maiy);rb=-lb;% the boundaries on the feasible solutions % taken to be max range of the raw data, should be ok, given the model %-------------organisation of the iterations, iter the limit to their number p=40; %population size feas=(rb-lb)*rand(p,2)+lb; % generated population of p pairs coefficients within the range flag=1; count=0; iter=10000; %---------- evaluation of the initially generated population funp=0; for ii=1:p vv(ii)=delta(feas(ii,:),xt,yt); end [funi, ini]=min(vv); soli=feas(ini,:) %initial coeffts funi %initial error si=0.5; %step of change %-------------evolution of the population while flag==1 count=count+1; feas=feas+si*randn(p,2); %mutation added with step si for ii=1:p feas(ii,:)=max([[lb lb];feas(ii,:)]); feas(ii,:)=min([[rb rb];feas(ii,:)]);% keeping the population within the range vec(ii)=delta(feas(ii,:),xt,yt); %evaluation end [fun, in]=min(vec); %best approximation value sol=feas(in,:);%corresponding parameters [wf,wi]=max(vec); wun=feas(wi,:); %worst case if wf>funi feas(wi,:)=soli; vec(wi)=funi; %changing the worst for the best of the previous generation end if fun < funi soli=sol; funi=fun; end if (count>=iter) 122 flag=0; end residvar=funi/sum(yt.*yt); %------------ screen the results of every 500th iteration if rem(count,500)==0 %funp=funi; disp([soli funi residvar]); end end a=soli(1); b=soli(2); %-------- computing the quality of the approximation y=a*exp(bx) function errorsq=delta(tt,x,y) a=tt(1); b=tt(2); errorsq=0; for ii=1:length(x) yp(ii)=a*exp(b*x(ii)); %this function can be changed if a different model assumed errorsq=errorsq+(y(ii)-yp(ii))^2; end return; % nnn.m for learning a set of features from a data set % with a neural net with a single hidden layer % with the symmetric sigmoid (hyperbolic tangent) in the hidden layer % and data normalisation to [-10,10] interval function [V,W, mede]=nnn(hiddenn,muin) % % % % hiddenn - number of neurons in the hidden layer muin - the learning rate, should be of order of 0.0001 or less V, W - wiring coefficients learnt mede - vector of absolute values of errors in output features %--------------1.loading data ---------------------da=load('Data\studn.dat'); %this is where the data file is put!!! % da=load('Data\iris.dat'); %this will be for iris data [n,m]=size(da); %-------2.normalizing to [-10,10] scale---------------------mr=max(da); ml=min(da); ra=mr-ml; ba=mr+ml; tda=2*da-ones(n,1)*ba; dan=tda./(ones(n,1)*ra); dan=10*dan; %-------------3. preparing input and output target)-------ip=[1:5]; % here is list of indexes of input features!!! %ip=[1:2];%only two input features in the case of iris 123 ic=length(ip); op=[6:8]; % here is list of indexes of output features!!! %op=[3:4];% output iris features oc=length(op); output=dan(:,op); %target features file input=dan(:,ip); %input features file input(:,ic+1)=10; %bias component %-----------------4.initialising the network --------------------h=hiddenn; %the number of hidden neurons!!! W=randn(ic+1,h); %initialising w weights V=randn(h,oc); %initialising v weights W0=W; V0=V; count=0; %counter of epochs stopp=0; %stop-condition to change %pause(3); while(stopp==0) mede=zeros(1,oc); % mean errors after an epoch %----------------5. cycling over entities in a random order ror=randperm(n); for ii=1:n x=input(ror(ii),:); %current instance's input u=output(ror(ii),:);% current instance's output %---------------6. forward pass (to calculate response ru)-----ow=x*W; o1=1+exp(-ow); oow=ones(1,h)./o1; oow=2*oow-1;% symmetric sigmoid output of the hidden layer ov=oow*V; %output of the output layer err=u-ov; %the error mede=mede+abs(err)/n; %------------ 7. error back-propagation-------------------------gV=-oow'*err; % gradient vector for matrix V t1=V*err'; % error propagated to the hidden layer t2=(1-oow).*(1+oow)/2; %the derivative t3=t2.*t1';% error multiplied by the th's derivative gW=-x'*t3; % gradient vector for matrix W %----------------8. weights update----------------------mu=muin; %the learning rate from the input!!! V=V-mu*gV; W=W-mu*gW; end; %------------------9. stop-condition -------------------------count=count+1; ss=mean(mede); if ss<0.01|count>=10000 stopp=1; end; mede; if rem(count,500)==0 count 124 mede end end; Reading B. Mirkin (2005), Clustering for Data Mining, Chapman & Hall/CRC, ISBN 1-58488-534-3. A.P. Engelbrecht (2002) Computational Intelligence, John Wiley & Sons, ISBN 0-470-84870-7. Supplementary reading H. Abdi, D. Valentin, B. Edelman (1999) Neural Networks, Series: Quantitative Applications in the Social Sciences, 124, Sage Publications, London, ISBN 0 -7619-1440-4. M. Berthold, D. Hand (1999), Intelligent Data Analysis, Springer-Verlag, ISBN 3540658084. S.K.Card, J.D. Mackinlay, B. Shneiderman (1999) Readings in Information Visualization: Using Vision to Think, Morgan Kaufmann Publishers, San Francisco, Ca, ISBN 1-55860-533-9. A.C. Davison, D.V. Hinkley (2005) Bootstrap Methods and Their Application, Cambridge University Press (7th printing). R.O. Duda, P.E. Hart, D.G. Stork (2001) Pattern Classification, Wiley-Interscience, ISBN 0471-05669-3 S. S. Haykin (1999), Neural Networks (2nd ed), Prentice Hall, ISBN 0132733501. R. Spence (2001), Information Visualization, ACM Press, ISBN 0-201-59626-1. T. Soukup, I. Davidson (2002) Visual Data Mining, Wiley Publishers, ISBN 0-471-14999-3 V. Vapnik (2006) Estimation of Dependences Based on Empirical Data, Springer Science + Business Media Inc., 2d edition. A. Webb (2002) Statistical Pattern Recognition, Wiley, ISBN-0-470-84514-7. Articles R. Cangelosi, A. Goriely (2007) Component retention in principal component analysis with application to cDNA microarray data, Biology Direct, 2:2, http://www.biolgy-direct.com/content/2/1/2. J. Carpenter, J. Bithell (2000) Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians, Statistics in Medicine, 19, 1141-1164. G. W. Furnas (1981) The FISHEYE View: A new look at structured files, A technical report, in In S.K.Card, J.D. Mackinlay, B. Shneiderman (1999) Readings in Information Visualization: Using Vision to Think, Morgan Kaufmann Publishers, San Francisco, Ca, 350-367. 125 Y.K. Leung and M.D. Apperley (1994) A review and taxonomy of distortion-oriented presentation techniques, In S.K.Card, J.D. Mackinlay, B. Shneiderman (1999) Readings in Information Visualization: Using Vision to Think, Morgan Kaufmann Publishers, San Francisco, Ca, 350-367. 126