Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Probability and Statistics: notes for a short course Jonathan G. Campbell Department of Computing, Letterkenny Institute of Technology, Co. Donegal, Ireland. email: jonathan dot campbell (at) gmail.com, [email protected] URL: http://www.jgcampbell.com/stats/stats.pdf Report No: jc/09/0004/r Revision 0.3 18th August 2009 Contents 1 2 3 4 5 Introduction 1.1 Purpose and Scope . . . . . . . . . . . . . . . . . 1.2 Why use R? . . . . . . . . . . . . . . . . . . . . . 1.3 Relevant textbooks and web sources . . . . . . . . 1.3.1 General Books on Probability and Statistics 1.3.2 Books on R and Statistics using R . . . . . 1.3.3 Bayesian Statistics . . . . . . . . . . . . . 1.3.4 Web Links . . . . . . . . . . . . . . . . . . 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple Data Analysis and Visualisation and Introduction to R 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Installation of R . . . . . . . . . . . . . . . . . . . . 2.1.2 Running R . . . . . . . . . . . . . . . . . . . . . . . 2.2 Visualisation and Exploratory Data Analysis . . . . . . . . . Averages 3.1 Introduction . . . . . . . . . . . . . . . . . 3.2 Arithmetic Mean . . . . . . . . . . . . . . 3.2.1 Arithmetic Mean using Frequencies 3.3 Median . . . . . . . . . . . . . . . . . . . . 3.4 Mode . . . . . . . . . . . . . . . . . . . . 3.5 Other Means . . . . . . . . . . . . . . . . Measures of Data Variability 4.1 Introduction . . . . . . . . . . . . . . . . 4.2 Variance and Standard Deviation . . . . . 4.2.1 Equalising the means . . . . . . . 4.2.2 Variability and spread . . . . . . . 4.2.3 Variance and Standard Deviation . 4.3 Standard Scores and Normalising Marks . 4.3.1 Standard Scores . . . . . . . . . Probability and Random Variables 5.1 Introduction . . . . . . . . . . . . . . . . 5.2 Basic Probability and Random Variables . 5.2.1 Introduction . . . . . . . . . . . . 5.2.2 Probability and Events . . . . . . 5.2.3 A Point on Terminology . . . . . 5.2.4 Probability of Non-disjoint Events 0–1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 2 3 3 3 3 . . . . 1 1 1 1 2 . . . . . . 1 1 1 2 3 4 6 . . . . . . . 1 1 1 4 5 5 6 7 . . . . . . 1 1 1 1 2 3 3 5.2.5 Finite Sample Spaces . . . . . . . . . . . . . . Random Variables . . . . . . . . . . . . . . . . . . . . Computing probabilities . . . . . . . . . . . . . . . . . Enumerating more complex events and sample spaces . 5.5.1 Multiplication of outcomes . . . . . . . . . . . 5.5.2 Addition of outcomes . . . . . . . . . . . . . . 5.5.3 Permutations . . . . . . . . . . . . . . . . . . 5.5.4 Combinations . . . . . . . . . . . . . . . . . . 5.6 Conditional Probability . . . . . . . . . . . . . . . . . 5.6.1 Venn diagrams . . . . . . . . . . . . . . . . . 5.6.2 Probability Trees . . . . . . . . . . . . . . . . 5.6.3 Joint Probability . . . . . . . . . . . . . . . . 5.7 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . 5.8 Independent Events . . . . . . . . . . . . . . . . . . . 5.9 Betting and Odds . . . . . . . . . . . . . . . . . . . . 5.10 Classical versus Bayesian Interpretations of Probability 5.3 5.4 5.5 6 7 One Dimensional Random Variables 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Definition: Random Variable . . . . . . . . . . 6.1.2 Probability associated with a Random Variable 6.2 Probability Mass Function (pmf) of a Discrete r.v. . . 6.3 Some Discrete Random Variables . . . . . . . . . . . . 6.3.1 Point Mass Distribution . . . . . . . . . . . . 6.3.2 Discrete Uniform Distribution . . . . . . . . . 6.3.3 Bernoulli Distribution . . . . . . . . . . . . . . 6.3.4 Binomial Distribution . . . . . . . . . . . . . . 6.3.5 Geometric Distribution . . . . . . . . . . . . . 6.3.6 Poisson Distribution . . . . . . . . . . . . . . 6.4 Some Continuous Random Variables . . . . . . . . . . 6.4.1 Probability Density Function (PDF) . . . . . . 6.4.2 Cumulative Distribution Function (cdf) . . . . 6.4.3 Uniform Distribution . . . . . . . . . . . . . . 6.4.4 Normal (Gaussian) Distribution . . . . . . . . 6.4.5 Exponential Distribution . . . . . . . . . . . . 6.4.6 Gamma Distribution . . . . . . . . . . . . . . 6.4.7 Beta Distribution . . . . . . . . . . . . . . . . 6.4.8 Student t Distribution . . . . . . . . . . . . . 6.4.9 Cauchy Distribution . . . . . . . . . . . . . . . 6.4.10 Chi-squared Distribution . . . . . . . . . . . . 6.5 Range spaces — terminology . . . . . . . . . . . . . . 6.6 Parameters . . . . . . . . . . . . . . . . . . . . . . . Two- and Multi-Dimensional Random Variables 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . 7.2 Probability Function of a Discrete Two-dimensional 7.3 PDF of a Continuous Two-dimensional r.v. . . . . 7.4 Marginal Probability Distributions . . . . . . . . . 7.5 Conditional Probability Distributions . . . . . . . . 0–2 . . r.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 4 4 5 5 5 6 6 6 7 9 9 11 12 13 . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 1 2 2 2 3 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 7 7 . . . . . 1 1 2 2 3 4 7.6 7.7 8 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two-dimensional (Bivariate) Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 4 5 5 5 . . . . . . . . 1 1 2 4 4 5 6 6 6 10 Statistical Inference 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 11 Statistical Estimation 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Populations and Samples . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Estimating the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Estimating the Standard Deviation . . . . . . . . . . . . . . . . . . . 11.5 Sampling Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.1 Sampling Distribution of the mean . . . . . . . . . . . . . . . 11.5.2 Sampling Distribution for Estimates of the Standard Deviation 11.6 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 2 2 3 3 4 5 12 Hypothesis Testing 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 13 Sampling 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 14 Classification and Pattern Recognition 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 15 Simple Classifier Methods 15.1 Thresholding for one-dimensional data . . . . . . . 15.2 Linear separating lines/planes for two-dimensions . 15.3 Nearest mean classifier . . . . . . . . . . . . . . . 15.4 Normal form of the separating line, projections, and 15.5 Projection and linear discriminant . . . . . . . . . 15.6 Projections and linear discriminants in p dimensions 1 1 4 4 5 6 7 9 Characterisations of Random Variables 8.1 Introduction . . . . . . . . . . . . . . . . . . 8.2 Expected Value (Mean) of a Random Variable 8.3 Variance of a Random Variable . . . . . . . . 8.4 Expectations in Two-dimensions . . . . . . . 8.4.1 Mean . . . . . . . . . . . . . . . . . 8.4.2 Covariance . . . . . . . . . . . . . . 4 5 The 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 . . . . . . . . . . . . . . . . . . . . . . . . Normal Distribution Introduction . . . . . . . . . . . . . . . . . . . . . . Cumulative Distribution Function (cdf) . . . . . . . Normal Cdf . . . . . . . . . . . . . . . . . . . . . . Using the Normal Cdf . . . . . . . . . . . . . . . . . Sum of Independent Normal Random Variables . . . Differences of Normal Random Variables . . . . . . . Linear Transformations of Normal Random Variables The Central Limit Theorem . . . . . . . . . . . . . . 0–3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . linear discriminants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.7 Template Matching and Discriminants . . . . . . . . . . . . . . . . . . . . . . . . 15.8 Nearest neighbour methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Statistical Classifier Methods 16.1 One-dimensional classification revisited . . . . . . . . . . 16.2 Bayes’ Rule for the Inversion of Conditional Probabilities 16.3 Parametric Methods . . . . . . . . . . . . . . . . . . . . 16.4 Discriminants based on Normal Density . . . . . . . . . 16.5 Bayes-Gauss Classifier – Special Cases . . . . . . . . . . 16.5.1 Equal and Diagonal Covariances . . . . . . . . . 16.5.2 Equal but General Covariances . . . . . . . . . . 16.6 Least square error trained classifier . . . . . . . . . . . . 16.7 Generalised linear discriminant function . . . . . . . . . 7 7 . . . . . . . . . 1 1 2 3 4 4 5 6 7 8 17 Linear Discriminant Analysis and Principal Components Analysis 17.1 Principal Components Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2 Fisher’s Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 18 Neural Network Methods 18.1 Neurons for Boolean Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18.2 Three-layer neural network for arbitrarily complex decision regions . . . . . . . . . 18.3 Sigmoid activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 3 3 4 19 Unsupervised Classification (Clustering) 1 20 Regression 20.1 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 A Basic Mathematical Notation A.1 Sets . . . . . . . . . . . . . . . . . . . . . A.1.1 Set Definition and Membership . . A.1.2 Important Number Sets . . . . . . . A.1.3 Set Operations . . . . . . . . . . . A.1.4 Venn Diagrams . . . . . . . . . . . A.2 Iterated Summation and Product Notation A.3 Iterated Union and Intersection . . . . . . . A.4 Cartesian Product Sets . . . . . . . . . . . . . . . . . . . 1 1 1 2 2 2 4 4 4 . . . . . . . . . . . 1 1 1 2 4 4 5 5 5 5 6 6 B Matrices and Linear Algebra B.1 Introduction . . . . . . . . . . . . B.2 Linear Simultaneous Equations . . B.3 Vectors and Matrices . . . . . . . B.4 Basic Matrix Arithmetic . . . . . . B.4.1 Matrix Multiplication . . . B.4.2 Multiplication by a Scalar . B.4.3 Addition . . . . . . . . . . B.5 Special Matrices . . . . . . . . . . B.5.1 Identity Matrix . . . . . . B.5.2 Orthogonal Matrix . . . . B.5.3 Diagonal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0–4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5.4 Transpose of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.6 Inverse Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.7 Multidimensional (Multivariate) Random Variables . . . . . . . . . . . . . . . . . . 0–5 6 7 8 Chapter 1 Introduction 1.1 Purpose and Scope This report is written as the basis for a short course on statistics to be presented for postgraduate students at Letterkenny Institute of Technology. The notes have a mixed objective. I started writing a set of notes based on the traditional approach to probability and statistics, namely: basic probability, up to and including conditional probability, independence, Bayes’ Law; then some one-dimensional discrete and continuous distributions and some of the properties. Et cetera. And the on to sampling, parameter estimation, point estimates, confidence intervals, and hypothesis testing. However, after discussion with someone who knows potential consumers of the course, I was persuaded to start with a more gentle introduction. Hence I start off with simple visualisation, the look at averages (central tendency), then variance, and then back to the main line. As I say, the notes have a mixed objective. One objective is as notes for a gentle introduction to statistics; another is to include a set of reference results that one would refer to during a course; that is a course presenter might not want to spend time of the details of, for example, the Binomial distribution, or even full details of the Normal, but it would be useful for students to have access to some of these details without having to access one or more textbooks. When I give a course, I may give attendees a printout of all the notes — including an outline of the objective of the course and the plan of coverage, mentioning the chapters that will be used. Or, alternatively, I may do a specialised printout that includes only the chapters to be covered. The notes you see here include everything. 1.2 Why use R? Let me quote from the R website http://www.r-project.org/: R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at 1–1 Bell Laboratories (formerly AT &T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, . . . ) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity. One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control. When I have to choose a software package for teaching or for practical use (I mean generally, it could be a development system for a programming language, a computer games engine, a statistics package, . . . ) I look primarily at the following criteria: • Is it easily available, i.e. is it already installed in our laboratory machines, or is easy (and cheap) to acquire? R does well on this criterion — it is free to download and install, see 2.1.1. • Is it well supported by textbooks and online documentation? Again, R does well. In the past ten years and this is greatly accelerating in the last five years, a great many top class books on R and on particular statistical techniques using R; see 1.3. I notice that books that used to have just numerical examples now, in recent editions, give R examples. There is a top class mailing list supported by volunteers of the highest calibre: https://stat.ethz.ch/mailman/listinfo/r-help Via that mailing list, I have received assistance from world-class statisticians. • Is it widely used? Yes. 1.3 1.3.1 Relevant textbooks and web sources General Books on Probability and Statistics These notes are mostly based on (Meyer 1966) (which was used for a college course on statistics that I attended), (Wasserman 2004), which is a good summary of all the statistics you might ever need, but is not an introduction, (Griffiths 2009) and (Milton 2009) which are excellent introductions though very wordy, (Crawley 2005), (Spiegel & Stephens 2008). The latter, (Spiegel & Stephens 2008), has plenty of examples including some examples on the use of the Excel spreadsheet. 1–2 (Dytham 2009) seems to be a good introduction for biologists and the more advanced (Quinn & Keough 2002) receives a lot of recommendations. Hacking’s book (Hacking 2001) is maybe a good introduction to probability and the philosophy and practice of probabilistic inference. The bibliography contains books in my collection and which I may have used in some small way and/or which may be useful to users of these notes. 1.3.2 Books on R and Statistics using R Crawley may be the best general book (Crawley 2005); for bio-scientists it has the advantage that Crawley’s research area is bio-science. Venables and Ripley’s MASS (Venables & Ripley 2002) is top class — note, do not be confused by the title Modern Applied Statistics with S; R is an open-source version of S (and S-Plus) and the book covers any differences, which are minimal. Maindonald (Maindonald & Braun 2007) is good for R graphics; R code for all his diagrams is available online (free). Matloff’s R for Programmers (Matloff 2008) has the advantage that it is available online. See also the extensive list at http://www.r-project.org/doc/bib/R-books.html 1.3.3 Bayesian Statistics Not that we’ll be emphasising the Bayesian approach. (Sivia 2006) (best introduction to Bayesian statistics), (MacKay 2002), (Lee 2004). 1.3.4 Web Links • General: http://www.jgcampbell.com/links/stats.html; • R: http://www.r-project.org/. 1.4 Outline Chapter 5 gives an introduction to probability; if you want to understand basic statistics you must have a basic understanding of probability — however we note that probability is to a great extent common sense. Before starting you should have a quick run through Appendix A just to familiarise yourself with basic mathematical notation; we note that the mathematical notation used is no more than shorthand; it would be difficult to write these notes without employing that shorthand; in addition, you will encounter similar shorthand in books and research papers. 1–3 Chapter 2 gives a very brief introduction to simple statistical techniques and visualisation and to the statistical package R. Chapter 3 gives a brief introduction to averages or what statisticians call central tendency. Chapter 4 This chapter introduces methods of describing data variability, most notably variance and standard deviation. Chapter 6 introduces random variables and lists the common one-dimensional probability distributions. Chapter 7 gives a brief introduction to multivariate random variables and some distributions. Note that Appendix B gives a gentle introduction to vector and matrix mathematics which are necessary in multivariate statistics. Chapter 8 discusses important characteristics of randoms variables such a mean and variance. Chapter 9 gives specialised treatment to the normal distribution — in view of its importance in applications. Chapter 10 introduces statistical inference, that is, how can we infer properties of a population from statistics derived from a sample. One aspect of statistical inference is parameter estimation; Chapter 11 introduces point estimation and confidence interval estimation. Hypothesis testing is strongly related to estimation; Chapter 12 gives an introduction to hypothesis testing. Chapter 13 discusses some of the intricacies of sampling. As of 2009-08-18 this is work in progress and will remain so for the foreseeable future. 1–4 Chapter 2 Simple Data Analysis and Visualisation and Introduction to R 2.1 Introduction The objectives of this chapter are to give a very brief introduction to simple statistical techniques and visualisation and to the statistical package R. 2.1.1 Installation of R Click on http://www.r-project.org/ and find the Download link. For Windows users there is an exe file which does everything. You may need Administrator rights on your machine; contact Computer Services as necessary. Linux users are probably best advised to rely on the installer of their particular Linux distribution. 2.1.2 Running R Start R by clicking on R desktop icon. R will open up a window with something like the following in it. R version 2.7.1 (2008-06-23) Copyright (C) 2008 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type ’license()’ or ’licence()’ for distribution details. Type ’demo()’ for some demos, ’help()’ for on-line help, or ’help.start()’ for an HTML browser interface to help. Type ’q()’ to quit R. 2–1 ¿ The ¿ is R asking you to enter something as on a calculator; R can operate as a simple calculator, but of course we are interested in its use as a powerful statistical calculator. ¿ 2 + 3 [1] 5 ¿ sqrt(26) [1] 5.09902 ¿ 3ˆ4 [1] 81 ¿ For the remainder of this chapter we’ll look at a significant example involving visualisation and exploratory data analysis on a data set. 2.2 Visualisation and Exploratory Data Analysis Were going to read in some examination result data and analyse them. The file exam.txt contains data as follows: exam 65 60 47 ... etc. 66 results in total The name of the column is exam and we tell R to pay attention to that. In what follows, # is a comment symbol and R ignores anything after the # until the next line. Anything after ¿ is something that you typed — a request to R. If something appears without a¿ , that is an R response. ¿ ex ¡- read.table(”exam.txt”, header= T) ¿ attach(ex) ¿ exam # print ’exam’ data on the screen [1] 65 60 47 43 51 32 62 71 0 56 52 59 15 49 54 67 44 2 47 61 45 95 62 80 46 [26] 52 61 12 62 69 78 62 48 56 56 58 60 0 48 71 50 90 51 53 5 51 63 35 39 10 [51] 57 53 20 54 22 44 53 52 25 60 55 39 30 53 67 50 ¿ That printout is quite uninformative, for example you have no idea what the maximum is, nor the range, nor have you an even rough idea of what the average mark is, etc. Let us look at a histogram. 2–2 ¿ hist(exam) And we get Figure 2.1. Often, like me here, you want to save the diagram to a file so that you can include it in a report. Here is how to do that; vis1-1.pdf is a filename that I made up. ¿ pdf(”vis1-1.pdf”, onefile=FALSE, height=8, width=6, pointsize=8, paper=”special”) ¿ hist(exam) ¿ devoff() Error: could not find function ”devoff” # R complaining ... ¿ dev.off() # do this to finalise and close the file # if you don’t it’s like forgetting to save in a wordprocessor. ¿ 10 5 0 Frequency 15 20 Histogram of exam 0 20 40 60 exam Figure 2.1: Histogram of exam marks. 2–3 80 100 Let us see what the average mark is and the range of marks: ¿ mean(exam) [1] 49.07576 ¿ range(exam) [1] 0 95 ¿ We could have used: ¿ length(exam) [1] 66 # 66 results in ’exam’ ¿ sum(exam)/length(exam) [1] 49.07576 Let us see the data in sorted order — a good deal more informative than unsorted: ¿ sort(exam) [1] 0 0 2 5 10 12 15 20 22 25 30 32 35 39 39 43 44 44 45 46 47 47 48 48 49 [26] 50 50 51 51 51 52 52 52 53 53 53 53 54 54 55 56 56 56 57 58 59 60 60 60 61 [51] 61 62 62 62 62 63 65 67 67 69 71 71 78 80 90 95 ¿ Now read in corresponding continuous assessment (CA) marks (courswork); they came from a spreadsheet so there’s a load of digits after the decimal point and that makes the data evern more incomprehensible, so we use round to round them to the nearest integer number. It looks like the CA marks are more generous than the exam. marks, and mean(ca) confirms this, as does the histogram in Figure 2.2. ¿ cw ¡- read.table(”ca.txt”, header= T) ¿ attach(cw) ¿ ca [1] 91.34390 85.54622 72.65543 63.10473 [9] 18.58191 83.30836 78.78221 77.68898 [17] 61.70048 16.28892 69.57387 83.08058 [25] 60.17263 79.49133 89.35610 27.89478 [33] 69.70333 85.23094 86.99767 82.89807 [41] 75.20815 97.17500 65.78075 70.29256 [49] 60.66164 20.05529 78.16085 73.58862 [57] 77.53929 77.20521 52.67979 89.10232 [65] 89.12518 67.58763 ¿ car = round(ca) ¿ car [1] 91 86 73 63 73 51 86 97 19 83 79 78 [26] 79 89 28 98 92 96 89 70 85 87 83 77 73.22074 21.07860 74.19594 98.06673 77.35877 14.20315 34.07182 76.78222 50.99642 76.04457 97.12300 92.34510 15.12655 73.02363 78.03601 54.16873 85.69151 76.56793 81.58833 96.19500 72.41332 87.38178 39.31353 40.23080 97.06528 86.90106 98.12345 88.69131 90.07670 52.74194 69.57565 81.09443 21 76 77 87 62 16 70 83 74 97 82 98 60 15 72 90 75 97 66 70 14 73 87 53 61 20 2–4 [51] 78 74 34 ¿ ¿ sort(car) [1] 14 15 16 [26] 73 73 73 [51] 87 87 87 ¿ ¿ mean(ca) [1] 70.10692 ¿ 78 39 70 78 77 53 89 77 54 40 81 89 68 19 20 21 28 34 39 40 51 53 53 54 60 61 62 63 66 68 70 70 70 70 72 74 74 75 76 77 77 77 77 78 78 78 78 79 79 81 82 83 83 83 85 86 86 89 89 89 89 90 91 92 96 97 97 97 98 98 ¿ hist(ca) # and save another one to a file ¿ pdf(”vis1-ca.pdf”, onefile=FALSE, height=4, width=6, pointsize=8, paper=”special”) ¿ hist(ca) ¿ dev.off() 10 5 0 Frequency 15 Histogram of ca 20 40 60 ca Figure 2.2: Histogram of CA marks. 2–5 80 100 Boxplots are another way of examining a data set. Figure 2.3 shows boxplots for the examination and CA results. The construction of the boxplot is as follows: (a) the heavy line across the interior of the box correspond to the median value (see Chapter 3); (b) the top and bottom of the box correspond to, respectively, the lower quartile and upper quartile, i.e. 25% of the data are below the lower quartile and 25% are above the upper quartile (or, if you like, 75% are below it). The so called whiskers show the smallest and largest values — excluding boxplot’s interpretation of outliers. The outliers are then shown as single points. Quartile is a specialisation of the general term quantile, see Chapter 4. In Chapters 9, 11 and 12, we’ll come across, for example, 5% and 95% quantiles. The median is the centre of the data, i.e. as many of the data are above the median as are bwlow it; see Chapter 3. 100 To determine what are outliers, boxplot fits a Normal distribution to the data and labels as outliers any data that are below the 1% or above the 99% quantiles of the fitted Normal distribution. ● 20 40 40 60 60 80 80 ● 0 20 ● ● ● ● ● ● ● ● ● ● ● ● Figure 2.3: Boxplot of: left, examination marks; right, CA marks. 2–6 How to look at the two data sets together? There must be a way of superimposing one histogram on another, but I haven’t found that yet. So let us display a two-dimensional scatter plot of the two data sets, see Figure 2.4. ¿ library(lattice) # first we must load a library that has ’xyplot’ in it ¿ xyplot(exam ˜ ca) ● ● 80 ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ● ●● ●● ● ● ●● ●●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● exam 60 40 ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 80 100 ca Figure 2.4: Scatter plot of Exam. marks versus CA marks. Someone says those CA and exam. marks look quite correlated, I wonder how accurately we could have predicted the exam. results using the CA?. This is regression territory — and given that Figure 2.4 shows a sort of straight line relationship, we’ll try linear regression, your old friend y = mx + c, or in this case exam = mca + c and it is more usual to use a, b exam = a + bca. a is the intercept, where the fitted straight line meets the y-axis at x = 0 and b is the slope. ¿ fitres = lm(exam ˜ ca) ¿ summary(fitres) Call: lm(formula = exam ˜ ca) Residuals: Min 1Q -10.9697 -3.1181 Median -0.7405 3Q 3.1036 Max 22.8368 Coefficients: Estimate Std. Error t value Pr(¿—t—) 2–7 (Intercept) -10.83639 2.21002 -4.903 6.77e-06 *** ca 0.85458 0.03002 28.469 ¡ 2e-16 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 5.482 on 64 degrees of freedom Multiple R-squared: 0.9268,Adjusted R-squared: 0.9257 F-statistic: 810.5 on 1 and 64 DF, p-value: ¡ 2.2e-16 ¿ R prints a lot of information that we’ll find out about in Chapter 20; for now all we need to know are a = −10.83639 (intercept) and b = 0.85458 (coefficient multiplying ca), i.e. the fitted line is exam = −10.83639 + 0.85458 × ca. Figure 2.5 shows the results of the straight line fitting. ● 80 ● ● ● exam 60 ● 40 ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 ● ● ● ● ● ● 0 ● ● ● ● 20 40 60 80 ca Figure 2.5: Straight line fitting Exam. marks versus CA marks. 2–8 100 Finally, we can save all those commands: ¿ savehistory(”20090508-3.txt”) # which we could load again at a later time with ¿ loadhistory(”20090508-3.txt”) # but in any case, weh you use q() to quit, R will offer you the # option of saving and thse saved commands will be loaded the # next time you run R. ¿ q() Save workspace image? [y/n/c]: y That’s enough for an introduction. 2–9 Chapter 3 Averages 3.1 Introduction This chapter gives a brief introduction to “average”s or what statisticians call central tendency. These are often, but not always, useful in summarising a set of data, especially when we wish to compare the data set with another. There are some pitfalls in using the common-or-garden average and we will note some of these. 3.2 Arithmetic Mean The most familiar average value is the arithmetic mean, i.e. sum the value and divide by the number of data. Just to get used to some mathematical notation, see A.2, we’ll write this as you’ll see it in textbooks (the data are xi , i = 1, . . . , n): n 1X x̄ = xi , n i=1 (3.1) R-Example 1 . As before, we’ll read the data and print them. This time they are already sorted, so much easier to read, even in list form. ¿ ex2 = read.table(”exam2.txt”, header = T) ¿ attach(ex2) ¿ exam2 [1] 43 43 43 44 46 48 48 50 51 53 53 53 55 56 56 57 57 58 58 59 59 59 60 60 60 [26] 61 62 62 64 69 We can compute the mean by summing and dividing, see below, but not unexpectedly, R has a function mean that does it for us. 3–1 ¿ sum(exam2) [1] 1647 ¿ length(hw) [1] 20 ¿ sum(exam2)/length(exam2) [1] 54.9 ¿ mean(exam2) [1] 54.9 ¿ In spite of its simplicity, it is possible to compute the arithmetic wrongly. R-Example 2 . The following are a set of homework marks, marked out of 10. We read the data in and print them. Then we produce a summarising table, marks versus frequency, which tells us that we have a three students with four (4) marks, three with five, six with six, etc. ¿ df.homew ¡- read.table(”hw.txt”, header = T) ¿ attach(df.homew) ¿ hw [1] 6 8 5 7 6 5 6 4 6 5 8 4 8 8 7 6 6 7 4 7 ¿ table(hw) hw 4 5 6 7 8 # marks 3 3 6 4 4 # frequencies If we were not using a computer, we might think that we have a quick way to compute the mean, we have just five marks, namely, 4 5 6 7 8, so we’ll take the average of those 4 + 5 + 6 + 7 + 8 = 30, so mean = 30/5 = 6. But R thinks differently: ¿ mean(hw) [1] 6.15 The method we used works only if the frequencies are the same for each mark; it would be a rare fluke if this were the case. But we’ll pursue the matter further, because (a) computing an arithmetic mean using a frequency table — done properly — can be a (correct) shortcut if you have a lot of numbers and just a calculator or pencil and paper; (b) using frequencies prepares the ground for topics covered in later chapters. 3.2.1 Arithmetic Mean using Frequencies We’ll rewrite the table, now calling the data (marks) x, we’ll label them with i so that we have xi , i = 1 . . . n, and n = 5. 3–2 ¿ table(hw) hw i= 1 2 3 4 5 ---------------xi 4 5 6 7 8 fi 3 3 6 4 4 ¿ # marks # frequencies If we want to use the frequency table, we have to replace eqn. 3.1 with Pn fi xi x̄ = Pi=1 . n i=1 fi (3.2) Applying eqn. 3.2 to our frequency table above gives (3×4+3×5+6×6+4×7+4×8)/(3+3+6+4+4) = (12+15+36+28+32)/20 = 123/20 = 6.15. If we look at the sum divided by number calculation in R, we see that the frequency calculation ends up with not only the same result, but the same division, ¿ length(hw) [1] 20 ¿ sum(hw) [1] 123 ¿ sum(hw)/length(hw) [1] 6.15 If you look at the sum of fi × xi you will see that it is the same as 4 + 4 + 4 + 5 + . . . + 8 + 8 + 8 + 8; the sorted hw marks are below: ¿ sort(hw) [1] 4 4 4 5 5 5 6 6 6 6 6 6 7 7 7 7 8 8 8 8 And the sum of the frequencies is 20, i.e. the number of data. [B 3.3 Median Sometimes neither the mean nor the mode give us what we would expect from a central value. Look at the following speed data (speed of cars at a speed check). Here mean,37.1, is well off the centre; and that offset is caused by an outlier, the 75. The offset would be a lot worse if the outlier was 1000 — not likely in the case of speeds, but outliers of this magnitude are possible in the case of some measurement systems. A common example is a mineralisation survey taken across an area of land. For the sake of argument, assume that we are looking for zinc. A sample that coincides with the dumping of an old bucket will produce a huge outlier. Now if we want to produce contour plots based on smoothed values (averages over regions), then mean smoothing will show a (false) hot-spot, while median smoothing will not. 3–3 sp = read.table(”cars.txt”, header = T) ¿ attach(sp) ¿ speed [1] 25 31 33 31 30 35 75 ¿ mean(speed) [1] 37.14286 The media gives a the true central value. If we sort the speeds, we see that the central value (the fourth) is 31. median give the same result. ¿ sort(speed) [1] 25 30 31 31 33 35 75 ¿ median(speed) [1] 31 ¿ speed[4] [1] 31 In the example above there are seven values, so the central one is the fourth; if we had an even number of values, we would take the average of the two central values. It can be said that the median is a measure of central tendency that is robust against outliers. 3.4 Mode Sometimes the mean does not give us what we would expect from a central value; for example, in the homework example, the mean (6.15) gives us a value that appears nowhere in the original data; that’s normally not a big deal, but it suggests the mode as a possible “average value”. The mode is the most frequent value, i.e. obtained from a frequency table or from a histogram, Figure 3.1. ¿ table(hw) hw xi 4 5 6 7 fi 3 3 6 4 8 4 # marks # frequencies 3–4 3 2 1 0 Frequency 4 5 6 Histogram of hw 4 5 6 7 hw Figure 3.1: Histogram of hw. 3–5 8 Multimodal Data Now that we’ve mentioned the mode, we’d better take the opportunity of warning about multi-modal data. File hw2.txt contains data which has two peaks in its histogram, Figure 3.2. ¿ df.homew2 ¡- read.table(”hw2.txt”, header = T) ¿ attach(df.homew2) ¿ sort(hw2) [1] 3 4 4 4 4 5 7 8 8 8 8 8 9 ¿ hist(hw2) mean(hw2) [1] 6.153846 3 2 0 1 Frequency 4 5 Histogram of hw2 3 4 5 6 7 8 9 hw2 Figure 3.2: Histogram of hw2 — multimodal. We can work calculate the mean, but does it convey much about the centre of the data? No, and using the mean as such may be quite misleading. For example, an average of 6.15 may indicate that the homework was, on average, completed satisfactorily; however, in fact, we had two sets of results, one good, one poor and the average of 6.15 adequately represents neither. Multimodality is pretty obvious in that small and one-dimensional data set. In much larger data sets and especially in multidimensional data, multimodality may be difficult to detect. Much later, Chapter 19, we’ll look at methods for separating multimodal data into different classes or clusters. 3.5 Other Means Read up in (Crawley 2005) on: geometric mean and harmonic mean. 3–6 Chapter 4 Measures of Data Variability 4.1 Introduction This chapter introduces methods of describing data variability, most notably variance and standard deviation. 4.2 Variance and Standard Deviation We are now going to work through an example based on two examination results, exam3 and exam4, see below. ¿ df.exam3 = read.table(”exam3.txt”, header = T) ¿ attach(df.exam3) ¿ df.exam4 = read.table(”exam4.txt”, header = T) ¿ attach(df.exam4) ¿ exam3 [1] 68 70 71 72 72 73 73 73 74 75 75 75 75 75 76 76 76 76 76 77 77 78 78 80 82 ¿ exam4 [1] 43 43 43 44 46 48 48 50 51 53 53 53 55 56 56 57 57 58 58 59 59 59 60 60 60 [26] 61 62 62 64 69 73 ¿ We are going to assume that these examinations are from two optional modules that final year BSc Honours students can take, that is students take one or other of these modules and not both. Final Honours classifications depend on these results; but we can see already that the students who took exam3 are at an advantage; except for one, they all achieved first class honours in that examination. If we assume that the exam3 students are equally capable as the exam4 students, then can we correct the imbalance? Before you start to be incredulous, this technique was practiced at a well-known university where I worked. First of all let us look at the histograms, Figure 4.1 and the box-plots, Figure 4.2. 4–1 ¿ hist(exam3) ¿ hist(exam4) 4 6 Frequency 6 4 0 2 2 0 Frequency 8 8 10 12 Histogram of exam4 10 Histogram of exam3 68 70 72 74 76 78 80 82 40 45 50 exam3 55 60 exam4 Figure 4.1: Histograms of exam3 and exam4. ¿ boxplot(exam3) ¿ boxplot(exam4) 4–2 65 70 75 68 70 45 70 50 72 55 74 60 76 78 65 80 82 ● ● Figure 4.2: Boxplots of exam3 and exam4. 4–3 The means confirm the difference. ¿ mean(exam3) [1] 74.92 ¿ mean(exam4) [1] 55.48387 ¿ ¿ diff ¡- mean(exam3) - mean(exam4) ¿ diff [1] 19.43613 ¿ 4.2.1 Equalising the means Can we shift one of the means so that the two data sets have the same mean? ¿ diff [1] 19.43613 ¿ exam4new ¡- round(exam4 + diff) ¿ exam4new [1] 62 62 62 63 65 67 67 69 70 72 72 72 74 75 75 76 76 77 77 78 78 78 79 79 79 [26] 80 81 81 83 88 92 ¿ fpdfsmall() ¿ hist(exam4new) Histogram of exam4new 4 6 Frequency 6 4 0 2 2 0 Frequency 8 8 10 10 Histogram of exam3 68 70 72 74 76 78 80 82 60 exam3 65 70 75 80 exam4new Figure 4.3: Histograms of exam3 and exam4 shifted by 19. 4–4 85 90 95 4.2.2 Variability and spread That is a bit better, but there remains a greater spread in exam4new (mean shifted). Can we quantify spread; range gives us the range between minimum and maximum, but we would like one number. ¿ range(exam3) [1] 68 82 ¿ range(exam4new) [1] 62 92 ¿ From our experience with the mean, maybe we can take the mean (expected value) of deviations from the means, ¿ mean(exam3 - mean(exam3)) [1] -1.705372e-15 # effectively zero ¿ mean(exam4new - mean(exam4new)) [1] -4.586385e-16 Not much good; from the definition of the mean we should have known in advance that these means (or sums) of deviations would be zero — the negative deviations cancel the positive. ¿ mean((exam4new - mean(exam4new))ˆ2) [1] 53.6691 ¿ mean((exam3 - mean(exam3))ˆ2) [1] 9.0336 We can achieve the same using sum and length, ¿ sum((exam3 - mean(exam3))ˆ2)/length(exam3) [1] 9.0336 4.2.3 Variance and Standard Deviation The variance, which is the expected value of the squared deviations from the mean is the built-in function to use (var in R), see eqn. 4.1, n 1X V ar (X) = E[(X − µ)] = (xi − µ)2 . n i=1 ¿ var(exam3) [1] 9.41 ¿ var(exam4new) [1] 55.45806 4–5 (4.1) Immediately, we see that it is not an illusion that the variability of exam4new is much greater than that of exam3. Note that the variance as calculated by var is slightly different from that calculated using mean — we’ll return to that below. The variance values, since they are sums of squares, give us a measure of squared variability; that can be hard to interpret and use; what we want is the square-root of the variance, or the standard deviation (sd in R), see eqn. 4.2, σX = SD(X) = p V ar [X]. (4.2) ¿ sqrt(var(exam4new)) [1] 7.447017 ¿ sqrt(var(exam3)) [1] 3.067572 ¿ sd(exam4new) [1] 7.447017 ¿ sd(exam3) [1] 3.067572 ¿ Variance different from mean of squared deviations? We return to the problem of variance being different the mean of squared deviations. The clue is given below, ¿ sum((exam3 - mean(exam3))ˆ2)/length(exam3) [1] 9.0336 ¿ sum((exam3 - mean(exam3))ˆ2)/(length(exam3) -1) [1] 9.41 In fact, rather than eqn. 4.1, this particular implementation of var computes what is called the sample variance using eqn. 4.3, n X 1 V ar (X) = (xi − µ)2 . (n − 1) i=1 (4.3) This gives an unbiassed estimate of the variance. 4.3 Standard Scores and Normalising Marks We now return to our desire to manipulate (fairly) the two data sets, exam3, exam4, such that students in each class have roughly the same opportunity; see section 4.2.1 where we equalised the means, but where we noted that the difference in variability remained a problem. 4–6 4.3.1 Standard Scores The normal way to equalise data sets like these (the proper term is either standardise or normalise) is to use the standard score as in, Xss = X−µ . σ (4.4) Eqn. 4.4 gives a set of scores with mean zero and standard deviation one, µss = 0, σss = 1. Thus, if we apply eqn. 4.4 to the two sets of marks, using the mean and standard-deviations of each, we get two sets of marks with the same mean (0) and the same spread (standard-deviation 1). That is fine for purely comparison purposes, but what if we need marks to publish? What we are going to do is: (i) use eqn. 4.4 to standardise the scores; then (ii) multiply by whatever (new) standard-deviation, call it σnew , that we require; finally, add the (new) mean that we require. The whole operation is given in eqn. 4.5, Xnew = Xold − µ × σnew + µnew . σold (4.5) We’ll now apply this to exam4, i.e. we want to make exam4 as close as possible to exam3 (in terms of mean and standard deviation). ¿ sd3 ¡- sd(exam3) ¿ sd3 [1] 3.067572 ¿ m3 ¡- mean(exam3) ¿ sd4 ¡- sd(exam4) ¿ sd4 [1] 7.447017 ¿ m4 ¡- mean(exam4) ¿ m4 [1] 55.48387 ¿ m3 [1] 74.92 ¿ exam4new = round(((exam4 - m4)/sd4)*sd3 + m3) ¿ exam4new [1] 70 70 70 70 71 72 72 73 73 74 74 74 75 75 75 76 76 76 76 76 76 76 77 77 77 [26] 77 78 78 78 80 82 ¿ mean(exam3) [1] 74.92 ¿ mean(exam4new) [1] 74.96774 # difference due to rounding ¿ sd(exam3) [1] 3.067572 ¿ sd(exam4new) [1] 2.99426 # difference due to rounding ¿ 4–7 And let us compare the histograms in Figure 4.4 8 6 0 2 4 Frequency 6 4 2 0 Frequency 8 10 Histogram of exam4new 10 Histogram of exam3 68 70 72 74 76 78 80 82 70 exam3 72 74 76 78 exam4new Figure 4.4: Histograms of exam3 and exam4new (exam4 equalised with exam3). 4–8 80 82 Chapter 5 Probability and Random Variables 5.1 Introduction This chapter gathers together some basic definitions, symbols and terminology to do with, probability, random variables, and random processes; the topics are chosen according to their applicability to basic statistics for bio-scientists, as well as pattern recognition, image processing and data compression. We will use some of the notation from Appendix A; you should have a quick look at that first. We emphasise that such notation is merely shorthand for common sense concepts which would otherwise be confusing and long-winded if written in English. 5.2 5.2.1 Basic Probability and Random Variables Introduction Let there be a set of outcomes to an experiment {ω1 , ω2 , . . . , ωn } = Ω, where, to each ωi , we associate a probability pi . The definition of probability includes the following constraints: 0 ≤ pi ≤ 1, n X pi = 1. (5.1) (5.2) i=1 The above simple definition of probability over outcomes is satisfactory for simple applications, but for many applications we need to extend it to apply to subsets of Ω. We could call the outcomes above elementary events, i.e. indivisible events, and we could call the subsets below composite, i.e. they are a composition of one or more outcomes. Ω is often called the sample space, i.e. as defined above, the set of all possible outcomes of the experiment. Elements of Ω are called outcomes, sample outcomes, or realisations. One of the problems of learning probability and statistics is the confusion caused by the multiplicity of terms for the same concept. In addition, different fields of study, e.g. bio-science, engineering, social science, . . . have their own terminology. 5–1 Example 1 Six sided dice. Ω = {i | i ∈ {1, . . . 6}} = {1, 2, . . . 6}. Example 2 Toss two six sided dice. (1, 1), (1, 2), . . . (1, 6), (2, 1), . . . (6, 6)}. Ω = {(i, j) | i, j ∈ {1, . . . 6}} = Example 3 Two sided coin. Ω = {H, T }. Outcomes need not be numbers. 5.2.2 Probability and Events Let there be subsets of Ω called events with a general event ai ; the set of all ai s is A. We define a probability measure P on A; P is a number and satisfies the following axioms: P (a) ≥ 0, (5.3) P (Ω) = 1, (certain event, something happens). (5.4) If a1 , a2 , . . . are disjoint, i.e. ai ∩ aj = ∅, ∀i, j, i 6= j, then P( ∞ [ i=1 ai ) = ∞ X P (ai ). (5.5) i=1 Disjoint (subsets) is another term for mutually exclusive, i.e. they cannot possibly happen together. ∩ denotes set intersection, i.e. in eqn. 5.5 we are requiring that there is no overlap between any of S the subsets and denotes union. Put simply, eqn. 5.5 says that probabilities add for events that do not overlap. ∅ denotes the empty set. There is a fourth axiom, a corollary of eqns. 5.4 and 5.5, P (∅) = 0, (impossible event). (5.6) Example 4 Six sided dice. Ω = {1, 2, . . . 6}. Let a be the event score greater than three; i.e. a = {4, 5, 6}. Example 5 Toss two six sided dice. Ω = {(i, j) | i, j ∈ {1, . . . 6}}. Let a be the event score less than four. Then a = {(1, 1), (1, 2), (2, 1)}. Partition When {a1 ∪ a2 ∪ . . . ∪ an } = Ω and a1 , a2 , . . . an are disjoint, we say that {a1 , a2 , . . . an } form a partition of Ω. 5–2 5.2.3 A Point on Terminology Above we have P (ai ) for probability that the outcome is in set ai . “The outcome is in set ai ” is what is called a proposition. A proposition is a sentence which may be true or false — but only one or the other and not in between. We should note that in most textbooks and later in these notes the arguments of probability functions, P (.) will be propositions, e.g. P (A) means the probability that A will occur, or that A will be true. Then, when we write P (AB) or P (A, B) (they mean the same), we mean probability of A and B being both true; logical and. Not or set complement We may want to talk about the probability that A will be false, i.e. the probability that the outcome will be in the complement set to A, i.e. any of the outcomes (in Ω) but not in As set. Not A is denoted Ā. We now can write a further axiom. P (Ā) = 1 − p(A). (5.7) Example 6 Six sided dice. Ω = {1, 2, . . . 6}. Let A = {1, 2, 3, 4}, so Ā = {5, 6}. P (Ā) = 1 − P (A) = 1 − 5.2.4 4 6 = 2 6 = 13 . Probability of Non-disjoint Events We saw in eqn. 5.5 that to compute the probability of two disjoint events you can add probabilities. For events A and B that are not necessarily disjoint (there may be overlap), we can write P (A [ B) = P (A) + P (B) − P (AB). (5.8) Example 7 Six sided dice. Ω = {1, 2, . . . 6}. Let A = {1, 2, 3, 4}, so B = {4, 5}; so A ∪ B = {1, 2, 3, 4, 5} and A ∩ B = {4}. P (A ∪ B) = P (A) + P (B) − P (A ∩ B) = P (A ∪ B) = P ({1, 2, 3, 4, 5}) = 65 . 4 6 + 2 6 − 1 6 = 65 , and we can see that, computed directly, We note that eqn. 5.8 collapses to eqn. 5.5 when AB is false (no overlap, the two cannot be true together), because of eqn. 5.6, i.e. P (∅) = 0, and P (A [ B) = P (A) + P (B) − P (∅) = P (A) + P (B) − 0 = P (A) + P (B). 5–3 5.2.5 Finite Sample Spaces In Example 1 we could identify and list all possible outcomes and we have a finite sample space. On the other hand, if the outcome was a weight, for example of a precipitate, then we could not list all possible weights and we would have an infinite sample space. 5.3 Random Variables If, to every outcome, ω, of an experiment, we assign a number, X(ω), X is called a random variable (r.v.). X is a function over the set Ω = {ω1 , ω2 , . . .} of outcomes; if the range of X is the real numbers or some subset of them, X is a continuous r.v.; if the range of X is some integer set, then X is a discrete r.v. Chapter 6 contains an extensive discussion on random variables and an introduction probability distributions. 5.4 Computing probabilities We have already done this in examples, but we need to formalise a bit. The number of elements in a (finite) set, say a, is called its cardinality and written |a|. Example 8 Six sided dice. Ω = {1, 2, . . . 6}, |Ω| = 6. Let a = {4, 5, 6}, |a| = 3. If the outcomes are equally likely (which {1, 2, . . . 6} are), then we can compute the probability of an event a as the ratio: P (a) = |a| . |Ω| Example 9 Six sided dice. Ω = {1, 2, . . . 6}, |Ω| = 6. Let a = {4, 5, 6}, |a| = 3, so P (a) = 5.5 |a| 3 1 = = . |Ω| 6 2 Enumerating more complex events and sample spaces We see above P (a) = |a| |Ω| . But |a| or |Ω| may not be simple to enumerate or count. 5–4 (5.9) 5.5.1 Multiplication of outcomes Let an event correspond to the combined outcomes of two experiments performed in sequence. Let the first have n1 outcomes and the second n2 outcomes. Any of the n1 outcomes of the first may be followed by any of the n2 outcomes of the second, so the number of outcomes in the combined experiment is n1 × n2 . Example 10 Toss two six sided dice in sequence (but the result is the same if we throw them together). n1 = |Ω1 | = 6, n2 = |Ω2 | = 6, so, for the combined experiment, |Ω| = n1 × n2 = 36, which we can also compute by counting the elements in Ω = {(i, j) | i, j ∈ {1, . . . 6}}. 5.5.2 Addition of outcomes Suppose again that we have two experiments. Let the first have n1 outcomes and the second n2 outcomes. This time we perform the first experiment or the second, but not both and which of them gets performed is chosen randomly; how many outcomes? We have n1 outcomes of the first, or the n2 outcomes of the second, so the total number of outcomes in the combined experiment is n1 + n2 . Example 11 Toss one six sided dice or toss a two sided coin. n1 = |Ω1 | = 6, n2 = |Ω2 | = 2, so, for the combined experiment, |Ω| = n1 + n2 = 8, which we can also compute by counting the elements in Ω = {1, 2, 3, 4, 5, 6, H, T }. 5.5.3 Permutations Suppose we have n items and we wish to place them in a sequence — just any sequence, not ordered according to size or any other attribute. How many ways to do this? The first position may be filled by any of the n items; the second position may be filled by any of the remaining n − 1 items, and so on, so that the number of possible different sequences (orderings) is n(n − 1)(n − 2) . . . 1 = n! (n-factorial). (5.10) Suppose now we have n items and we wish to choose any r of them place these in a sequence. How many ways to do this? The first position may be filled by any of the n items; the second position may be filled by any of the remaining n − 1 items, and so on until we have r in the sequence. The number of possible different sequences (orderings) is n(n − 1)(n − 2) . . . n − (r − 1) = n(n − 1)(n − 2) . . . n − r + 1) = n Pr is the name for the number of permutations of r from n. 5–5 n! =n P r . (n − r )! (5.11) 5.5.4 Combinations Suppose again we have n items and we wish to choose any r of them, but we do not need to place the r in a sequence. How many ways n Cr to do this? We can appeal to eqns. 5.11 and 5.10. n! =n Cr × (number of ways of permuting)r = r !n Cr , (n − r )! which leads to n 5.6 n! = Cr = r !(n − r )! n r . (5.12) Conditional Probability Example 12 Ω = {1, 2, 3, 4, 5, 6}. I throw the dice. What is the probability of getting greaterthan-three, P (> 3)? Let A be greater-than-three so that A = {4, 5, 6}, and the cardinality of this set is nA = |A| = 3, and ndice = |Ω| = 6, see section 5.4; there are three possibilities greater-than-3, so P (A) = P (> 3) = nA /ndice = 3/6 = 1/2. Now, I have a peek and I tell you that we have an odd number, let us call this event B (odd). What now is the probability of A(> 3)? The probability surely has changed because the only possibilities now are A odd = {1, 3, 5}. Within this set, 5 is the only (one) possibility that satisfies greaterthan-three, so, forgetting about any ideas we had before, we say that the conditional probability of greater-than-three given that we already know that an odd number has occurred, 1/3, i.e. the probability has doubled based on the information that an odd has occurred. We write this P (> 3|odd), the conditional probability of a > 3 conditional on the fact that we already know that an odd number has occurred. This is conditional probability ; we computed the probability of B conditional on A, P (B|A). 5.6.1 Venn diagrams Venn diagrams, see section A.1.4, can be used to think about conditional probabilities such as the one in Example 12. Here Ω = {1, 2, 3, 4, 5, 6} corresponds to the universal set (the set of all possibilities). One we have been told that the number is odd, we can reduce our sample space to set odd; then odd ∩ (> 3) = {5}. Example 13 If after hearing first that we have an odd number, then secondly we are told that greater-than-three has occurred, we are then asked (a) what is the probability of a six?, (b) what is the probability of a five? Think about it, once we have the two pieces of information: odd, then greater-than-three, the possibilities are very greatly reduced. To what? 5–6 1 3 2 5 6 4 odd & <= 3 1 1 3 odd & > 3 3 5 odd 5 odd <= 3 2 2 4 4 6 6 even even >3 even & <= 3 even & > 3 Figure 5.1: Dice: (a) universal set; (b) sets odd, even; (c) sets (> 3) and (<= 3) superimposed to show that, for example, odd&(> 3) = (set-odd) ∩ (set > 3) = {1, 3, 5} ∩ {4, 5, 6} = {5} . 5.6.2 Probability Trees Probability trees, see (Griffiths 2009, p. 158), are another way to think graphically about conditional probability. In mathematics, trees can grow sideways or even upside down. Figure 5.2 shows a probability tree for Example 12. When we split into branches as in Figure 5.2, any branching must represent all possibilities; in this case we first have odd and even; if we call odd B, we have even = not-odd = B̄. In the diagram we have no bar symbol, so we use B 0 = B̄. Next we have (> 3 and (<= 3). Thus, at any branching the probabilities in the branches must sum to one. The diagram shows how to compute joint probabilities using conditional probabilities and the probability of the conditioning event, for example P (> 3 & odd) = P (> 3 | odd) × P (odd). Figure 5.3 shows a general probability tree. The following may help us to think about conditional probability and joint probability. Think of the tree as having probability flowing in its branches. We start of at the root with all the probability (one, 1); proportions of the probability flow into the first set of branches (the proportions sum to one); follow one of those branches, at the next branching point, we split the remaining probability into proportions that again sum to one (it is just the proportions that sum to one, if there is, for example, 0.4 flowing into the branching point, and the proportions are 0.4, 0.4, 0.2 — three-way branch, then we will have probability flows of 0.16, 0.16, 0.08). And so on. 5–7 >3 and odd has occurred odd has occurred P(>3|odd) B 1/3 <=3 and odd has occurred P(<=3|odd) 2/3 P(odd) 1/2 1/2 P(>3 & odd) = P(>3|odd) x P(odd) P(<=3 & odd) = P(<=3|odd) x P(odd) P(>3|even) P(>3 & even) = P(>3|even) x P(even) 2/3 P(even) = 1/2 x 2/3 = 2/6 = 1/3 [P(4 or 6)] B’ P(<=3|even) even has 1/3 occurred P(<=3 & even) = P(<=3|even) x P(even) = 1/2 x 1/3 = 1/6 [P(2)] Figure 5.2: Probability tree for the dice example. We start off on the left with the root and everything possible. Then we split into branches odd and even. Next we split odd into (> 3) and (<= 3); same for the even branch. A has occurred i.e. A & B have occurred We know B has occurred P(A | B) A B not A has occurred i.e. not A & B = A’ & B P(B) P(B’) (not B) P(A & B) = P(AB) = P(A | B) x P(B) P(A’ | B) A’ P(A | B’) A B’ P(A’B) = P(A’ | B) \x P(B) P(AB’) = P(A | B’) x P(B’) A’ P(A’ | B’) B has not occurred i.e. not B has occurred P(A’B’) = P(A’ | B’) x P(B’) Figure 5.3: Probability tree. 5–8 Symbolically, and referring to Figure 5.3 . . . If we have proportion P (B) in a branch and then that splits into proportions P (A|B) and P (Ā|B) (these (relative) proportions again sum to one, but their total probability sums to whatever flowed into the branching point). Then the P (A|B) branch must an absolute amount of probability equal to P (A|B) × P (B) and this is P (AB). Formula for Conditional Probability abilities, We now give the formula for computing conditional prob- P (A|B) = P (AB) , P (B) (5.13) provided that P (B) > 0. Alternatively, as in Figure 5.3, P (AB) = P (A|B)P (B). 5.6.3 (5.14) Joint Probability P (AB) is the joint probability of A and B happening together. Sometimes we write P (AB), sometimes P (A&B), sometimes P (A and B), and sometimes, using set notation, P (A ∪ B). 5.7 Bayes’ Rule If we reverse the conditionality in eqn. 5.13 and noting that P (AB) = P (BA), we have P (B|A) = P (AB) , P (A) (5.15) leading to P (A)P (B|A) = P (AB), (5.16) P (B)P (A|B) = P (AB), (5.17) P (A)P (B|A) = P (B)P (A|B), (5.18) and eqn. 5.13 gives us so that 5–9 leading to Bayes’ rule: P (A|B) = P (A)P (B|A)/P (B). (5.19) Eqn. 5.19 allows to invert or reverse the conditionality. Example 14 Let A be has disease-X; let B be has swollen ankles. From a sample of former disease-X patients, we can estimate P (B|A); say it is P (B|A) = 0.3. Let us assume that we also know the proportion of the general population that have swollen ankles, P (B) = 0.01. Also we assume that we have the incidence of disease-X in the general population, P (A) = 0.005. Eqn. 5.19 allows us to compute the probability that the patient has disease-X given that the swollen ankles symptom (B) is present, P (A|B). Of course, in general, P (A|B) 6= P (B|A). P (A|B) = P (A)P (B|A)/P (B) = 0.005 × 0.3/0.01 = 0.15. (5.20) Bayes’ rule may be written in a more general manner. First we need a result called the law of total probabilities. Let A1 , A2 , . . . , An be a partition of Ω (see section 5.2.2 for a definition of partition), then P (B) = n X P (B|Ai )P (Ai ). (5.21) i=1 We write the more general form of Bayes’ rule as P (Ai |B) = P (B|Ai )P (Ai )/ n X P (B|Ai )P (Ai ). (5.22) i=1 Let us return to Example 14 and apply eqn. 5.22. When we said proportion of the general population that have swollen ankles, P (B) = 0.01, we strictly meant probability of people with disease-X together with those without disease-X = 0.01. We can restate the problem with A1 = has disease-X and A2 = has not disease-X, so that they form a partition of the general population. Assume that we now have P (B|A2 ) = 0.01 (i.e. we are changing the story slightly to associate this probability with people who do not have disease-X) and, as before, P (B|A1 ) = 0.3; we need also P (A1 = 0.005, as before. What is P (A2 ); it is P (Ā1 ) (probability that a general person does not have disease-X) and this is 1 − P (Ā1 ) = 0.995. Eqn. 5.21 now gives a revised figure for P (B), P (B) = n X P (B|Ai )P (Ai ) = P (B|A1 )P (A1 )+P (B|A2 )P (A2 ) = 0.30.005+0.010.995 = 0.01145, i=1 and we can rework eqn. 14 (or use eqn. 5.22, P (A1 |B) = P (A1 )P (B|A1 )/P (B) = 0.005 × 0.3/0.01145 = 0.131. 5–10 5.8 Independent Events We have already discussed disjoint events, i.e. events which cannot occur simultaneously; thus, disjoint events A, B, A ∩ B = ∅. Consequently, we can state that P (A|B) = 0 (if B has occurred, A cannot). At the opposite extreme, let A ⊂ B, i.e. A is a subset of B and if A has occurred, then so must B, with certainty, so in this case P (B|A) = 1. Example 15 Ω = {1, 2, 3, 4, 5, 6}. Let B = {2, 4, 6} (even number) and A = {6}. If we know that a 6 has been thrown (A has occurred), what is P (B|A)? The answer is 1 — we know that 6 is even so B is a sure thing — in punter parlance :-). But there are cases where A and B are totally unrelated — they are independent events. Example 16 Throw a dice (1) and toss a coin (2). Ω1 = {1, 2, 3, 4, 5, 6}, Ω2 = {H, T } and the combined sample space Ω = {(1, H), (1, T ), (2, H), . . . , (6, H), (6, T )} and |Ω| = 12. Let A = {4, 6} and B = {H}, so that AB = {(4, H), (6, H)} (two out of 12 equally likely events), so P (AB) = 1/6. also P (A) = 1/3, P (B) = 1/2. From eqn. 5.13 we have P (B|A) = P (AB) 1 1 1 = / = . P (A) 6 3 2 Because the result of the dice throw is unrelated to the result of the coin toss we are not surprised to find that P (B|A) = P (B) = 1 . 2 This leads us to a more general definition of independent events, P (B|A) = P (B) = P (AB) , P (A) so that A and B are independent events if and only if P (AB) = P (A)P (B). 5–11 (5.23) 5.9 Betting and Odds In circumstances where the terms have meaning, probability of A can be computed as the ratio of the number of equal probability events favourable to A, nA , versus the total number of equal probability events, nT , P (A) = nA /nt . (5.24) Odds, on the other hand are computed as the ratio of the number of equal probability events favourable to A, nA , versus the number of equal probability events unfavourable to A, nĀ , O(A) = nA /nĀ . (5.25) Thus, the probability of a 1 on the throw of a dice is 61 , whilst the odds are 15 ; bookmakers express this as five-to-one against. The probability for any number less than five (1–4) would be bookmakers express this as two-to-one on. 4 6, whilst the odds are 4 2 = 2 1; You can calculate probability from odds using P (A) = O(A) . 1 + O(A) (5.26) Thus, for any number less than five (1–4) on a dice throw, P (A) = 2 O(A) = 1 1 + O(A) 1+ 2 1 = 2 . 3 You can calculate odds from probability using O(A) = P (A) , 1 − P (A) that is, the ratio of probability-for (favourable) to probability-against (unfavourable). Thus, for one on a dice throw, O(A) = 1 6 1− 5–12 1 6 = 1 . 5 (5.27) Bookmakers odds and probabilities Bookmakers “probabilities” do not add to 1. Unlike proper probabilities, which add one for all possible events, see eqn 5.2. Let’s say we have four horses, each with an equal probability of winning (P (Ai ) = 1, 2, 3, 4. We would expect odds of O(A) = 1 4 1− 1 4 = 1 4, for i = 1 , 3 or three-to-one against. But the bookmaker has to make a living, and not just provide a mutual service for his punters. In this case, if four punters bet 10 Euro on each horse (bookie gets 40 Euro), one punter gets paid 30 Euro plus his stake returned = 40 Euro, and the bookie makes nothing for his work. The bookie is likely to give odds of something like two-to-one against, O0 (A) = 12 , and, computing probabilities, we find P 0 (A) = 1 O0 (A) 2 = 1 + O0 (A) 1+ 1 2 = 1 , 3 and the sum of “probabilities” is 43 . In this amended case, if four punters bet 10 Euro on each horse (bookie gets 40 Euro), one punter gets paid 20 Euro plus his stake returned = 30 Euro, and the bookie makes 10 Euro. 5.10 Classical versus Bayesian Interpretations of Probability In many books and discussions you will see a distinction made between the classical and the Bayesian interpretation of probability; also, in this context the term frequentist may be used as a synonym for classical. As an interpretation of probability, the term Bayesian has little to do with Bayes’ rule, section 5.7, that is until we get to statistical inference, Chapter 10. Broadly speaking, Bayesians interpret probability as belief ; frequentists interpret probability as relative frequency. Bayesian (belief) interpretation Take the case of the tossed (fair) dice. If you were asked to rate, on a scale of [0, 1], your belief that 2 will be the outcome, you would, I hope, agree that the probability is 16 ; for an even number of dots: 62 = 12 ; and any number 1-6 — a sure thing — probability is 1. Here 0 corresponds to complete disbelief and 1 to complete belief. 5–13 Relative frequency interpretation The frequentist says that the probability of 2 is the relative frequency with which 2 occurs in a large number of hypothetical throws. Let us then run an experiment involving a large number (n = 600) of throws. and let yi = the count of each Xi obtained. We might expect to obtain something like y1 = 95, y2 = 110, y3 = 90, y4 = 97, y5 = 105, y6 = 103. We then use p̂(i ) = yni ; the hat, ˆ, indicates that p̂(i ) is an approximation to p(i ); however, p̂(i ) → p(i ) as n → ∞. We have p̂(i ) = yni = p̂(i ) = {95/600, 110/600, 90/600, 97/600, 105/600, 103/600 = 0.158, 0.183, 0.15, 0.162, 0.175, 0.172}. The correct value is p(i ) = 16 = 0.1667. The errors above are not a real indictment of the frequentist method; a thought experiment allows us to reason that p(i ) = 61 . On the other hand, when you want to bet on football match and would like to estimate the probability and hence the odds, it makes no sense to think of an infinity of matches. 5–14 Chapter 6 One Dimensional Random Variables 6.1 Introduction We have already introduced the notion of a random variable in section 5.3, i.e. where we associate a number with the outcome of an experiment governed by probability. In most cases, your (scientific) data will already be numerical, but it nonetheless remains worthwhile to be cognisant of the details of probability and sample space described in Chapter 5. In some of the examples in Chapter 5, namely those involving the dice, the outcome already is a number, i.e. {1, . . . , 6}; in some considerations, this number is more a label than a number, but in any case, the association of a number with the outcome is made trivial. In the coin example we had {H, T }; in this case we could use the association {H → 1, T → 0}. 6.1.1 Definition: Random Variable If, to every outcome, ω, of an experiment, we assign a number, X(ω), X is called a random variable (r.v.). X is a function over the set Ω = {ω1 , ω2 , . . .} of outcomes; if the range of X is the real numbers or some subset of them, X is a continuous r.v.; if the range of X is some integer set, then X is a discrete r.v. The space of all possible values of X is called the range space of X, RX . In discussing random variables we label the r.v. with an upper case letter, e.g. X, but particular values of it are labelled with lower case, e.g. x, or xi . Example 17 Toss two coins. Ω = {T T, T H, HT, HH}. Let a r.v. X be defined as the number of heads in the outcome, i.e. {T T → 0, T H → 1, HT → 1, HH → 2}. Notice that two outcomes map to the same number (1); this is not a problem or a mistake. RX = {0, 1, 2}. 6.1.2 Probability associated with a Random Variable If we have an event B with respect to a range space RX . Let the event A with respect to Ω be defined as 6–1 A = {ω ∈ Ω | X(ω) ∈ B}. (6.1) Then A and B are equivalent events and we can carry the definitions and equations of Chapter 5 over to random variables. Example 18 Two coins as in Example 17. Examples of equivalent events are: A = {T T }, B = {0}; A = {T H, HT }, B = {1}; A = {HH}, B = {2}. In the case of eqn. 6.1, we can say P (B) = P (A). (6.2) Example 19 Two coins as in Example 18. A = {T T }, P (A) = 14 , B = {0}, P (B = 0) = 14 ; A = {T H, HT }, P (A) = 12 , B = {1}, P (B = 1) = 12 ; A = {HH}, P (A) = 41 , B = {2}, P (B = 2) = 14 . 6.2 Probability Mass Function (pmf) of a Discrete r.v. Let a r.v. X have a range space RX = {x1 , x2 , . . . , xn }. We denote the probability of a particular value X = xi as pX (xi ) = P (X = xi ). The probabilities pX (xi ), i = 1, 2, . . . , n, in keeping with eqns. 5.3 and 5.4, must satisfy pX (xi ) ≥ 0, i = 1, 2, . . . , n, n X pX (xi ) = 1. (6.3) (6.4) i=1 pX is called the probability function or the probability mass function of the r.v. X. We’ll attempt to standardise on probability mass function and its abbreviation pmf. We use the shorthand X ∼ pX to state that the r.v. X has a pmf pX . Often, where there is no ambiguity, you will find the subscript X omitted — pX (x) → p(x). 6.3 Some Discrete Random Variables This section identifies and describes the pmfs of some commonly occurring discrete random variables. 6.3.1 Point Mass Distribution If X can take on only one value, a, it has a point mass distribution at a; X ∼ δa . pX (x) = 1, for x = a, and 0 elsewhere. 6–2 (6.5) 6.3.2 Discrete Uniform Distribution X has a discrete uniform distribution on {1, . . . , k}, U(1, k), if pX (x) = 1 , for x = 1, . . . , k; and 0elsewhere. k (6.6) Example 20 . Lottery machine, k balls. First draw, X ∼ U(1, k). 6.3.3 Bernoulli Distribution Let X be the result of a (binary outcome) experiment with probability p of one outcome, X = 1, say, and 1 − p for the other, X = 0; for example a coin flip. There’s overuse of the symbol p here, but we need to keep to standard notation; context should resolve any ambiguities between the parameter p = P (X = 1) and the pmf pX (X). pX (x) = q x (1 − q)1−x , for x ∈ {0, 1}. 6.3.4 (6.7) Binomial Distribution Repeat the experiment above (Bernoulli distribution — coin flip) n times and let X be the number of 1s (e.g. heads) obtained. pX (x) = n x p x (1 − p)n−x , for x ∈ {0, 1, . . . n}; 0, otherwise. (6.8) n Where does the come from? We have already introduced it in eqn. 5.12; it is the number x of ways of selecting x items from n. The probability one of the x 1s is p x and the probability one of the n − x 0s is (1 − p)n−x ; the flips are independent so we can multiply the probabilities to get n n n! p x (1 − p)n−x . However, there are possible ways of getting the X = x 1s. = x!(n−x)! . x x Take n = 3; the sample space is Ω = {T T T, T T H, T HT, T HH, HT T, HT H, HHT, HHH} and the event corresponding to x = 2 (two heads, any two heads) is A = {T HH, HT H, HHT }, i.e. there are three outcomes that give two heads. 3! 6 n 3 = = = = 3. x 2 2!1! 2 6.3.5 Geometric Distribution X has a geometric distribution with parameter p, X ∼ Geom(p), p ∈ (0, 1), if P (X = k) = p(1 − p)k−1 , k = 1, 2, . . . , ∞. Example 21 . Distribution of the number of coin flips until the first head. 6–3 (6.9) 6.3.6 Poisson Distribution X has a Poisson distribution with parameter λ, X ∼ P oi sson(λ), if pX (x) = e x −λ λ x! , x ≥ 0. (6.10) Example 22 . Distribution of rare events like traffic accidents; there can be long periods of inactivity, but clumping of events is possible, e.g. waiting a long time for a town bus and three arrive in quick succession! 6.4 Some Continuous Random Variables This section identifies and describes the probability density functions of some commonly occurring continuous random variables. First we must introduce a continuous alternative to the probability mass function. 6.4.1 Probability Density Function (PDF) When we discussed discrete r.v.’s we let X have a range space RX = {x1 , x2 , . . . , xn }; the number of values in the range space was countable. Let the range space be RX = {0, 0.01, 0.02, . . . , 0.99, 1.0}; this is still a discrete r.v. But what if RX = [0, 1], i.e. all real numbers in the range 0 − −1. A number of problems arise, the chief of which are: • the random variable is now continuous, i.e. the elements of the range space are not countable; • the probability of any particular value of the r.v. is in fact zero. Example: you buy 0.5-kg of cheese in Tesco; what is the chance of it being exactly 0.5-kg? Zero. Same goes for the weight of a product of a chemical experiment. Hence we cannot use probability mass functions. We now must use a different probability function called a probability density function (pdf). A pdf, over a range space RX , must satisfy (c.f. eqns. 6.3 and 6.4 for discrete r.v.’s) fX (x) ≥ 0, all x ∈ Rx , (6.11) Z fX (x)dx = 1. (6.12) Rx We emphasise that fX (x) is not a probability, but fX (x)dx is. If you want to speak of a probability over a continuous r.v. you mustRstate something like the probability that X is in the range a to b, b inclusive, is P (a ≤ X ≤ b), i.e. a fX (x)dx. The term probability density function is used (in contrast to probability mass function (for discrete r.v.’s)) because, with a continuous r.v. you simply cannot pick a value (X = x), say, and state P (X = x), which is in fact zero. 6–4 Discrete probability mass versus Continuous probability density Think of a ruler upon which we place (stick with Blue-tack) ball bearings of various sizes along its length; the ball bearings represent discrete masses and P we can state that we have a mass m1 at ruling x1 ; we can also compute the total mass as i mi . Now think of a rod of varying diameter laid along the ruler; we cannot pick a point x and say that the mass at precisely that point is m(x), but we can say that the mass in a little length, x, x + ∆x, is d(x)∆x, where Rd is the mass per unit length at x, (the density). In this case we can compute the total mass as length d(x)dx. 6.4.2 Cumulative Distribution Function (cdf) Many textbooks base their treatment of continuous r.v.’s on the cumulative distribution function (cdf); the cdf does give a probability. FX (x) = P (X ≤ x), Z (6.13) x FX (x) = fX (x)dx. (6.14) −∞ 6.4.3 Uniform Distribution X has a uniform distribution on [a, b], X ∼ Unif or m(a, b), if ( fX (x) = 1 (b−a) , for x ∈ [a, b] 0 otherwise. (6.15) The cumulative distribution function (cdf) is FX (x) = 0, x <a 0 x > b. (x−a) , (b−a) 6.4.4 x ∈ [a, b] (6.16) Normal (Gaussian) Distribution X has a Normal (Gaussian) distribution with parameters µ and σ, X ∼ N(µ, σ), if 2 ! 1 x −µ 1 fX (x) = √ exp − , ∞ < x < ∞. 2 σ σ 2π (6.17) The Normal distribution is often used to model measurements taken in the presence of error or noise. If the true value of a variable X is µ, then measurement (random) variable is distributed as N(µ, σ) where σ (the standard deviation) is a measure of the ‘size’ of the errors. 6–5 We say X has a standard Normal distribution if µ = 0 and σ = 1; standard Normal r.v.’s are typically denoted by Z; Z ∼ N(0, 1). The CDF for Z is denoted by Φ(z); although there is no formula for Φ(z), it is tabulated. In the days before widespread use of computers, tables such as those for Φ(z) were of great importance to those involved in statistics and statistical inference. Nowadays statistic packages and even some calculators will compute Φ(z) for you or even remove the necessity by calculating the thing that required Φ(z) as an intermediate value. If X ∼ N(µ, σ) then Z = (x − µ)/sigma ∼ N(0, 1). Conversely, if Z ∼ N(0, 1) then X = σZ + µ ∼ N(µ, σ). Also, if X ∼ N(µ, σ) and Y = aX + b, then Y ∼ N(aµ + b, aσ). 6.4.5 Exponential Distribution X has a Exponential distribution with parameter β, β > 0, X ∼ Exp(β), if 1 exp(−x/β). β fX (x) = (6.18) The Exponential distribution is used to model the waiting times between infrequent events, c.f. the Poisson distribution, see section 6.3.6. 6.4.6 Gamma Distribution X has a Gamma distribution with parameters α, β; α, β > 0, X ∼ Gamma(α, β), if fX (x) = 1 x α−1 exp(−x/β), x > 0. β α Γ (α) (6.19) The Gamma function, for parameter α > 0, is given by ∞ Z y α−1 e −y dy . Γ (α) = (6.20) 0 The Exponential distribution is Gamma with parameter α = 1, Gamma(1, β). 6.4.7 Beta Distribution X has a Beta distribution with parameters α, β; α, β > 0, X ∼ Beta(α, β), if fX (x) = Γ (α + β) α−1 x (1 − x)β−1 ), 0 < x < 1. Γ (α)Γ (β) 6–6 (6.21) 6.4.8 Student t Distribution X has a Student t distribution (or just t distribution, with ν degrees of freedom X ∼ tν , if Γ fX (x) = Γ 6.4.9 ν+1 2 ν 2 1+ 1 (ν+1)/2 . x2 (6.22) ν Cauchy Distribution The Cauchy distribution, X ∼ Cauchy , is a special case of the t distribution with ν = 1, fX (x) = 6.4.10 1 . π(1 + x 2 ) (6.23) Chi-squared Distribution X has a χ2 distribution with n degrees of freedom X ∼ χ2n , if fX (x) = 6.5 1 x (n/2)−1 e −x/2 , x > 0. Γ (n/2)2n/2 (6.24) Range spaces — terminology In discussing discrete r.v.’s we mentioned, for example, a range space RX = {x1 , x2 , . . . , xn }. If the range space is all the integers, we could use the common symbol RX = Z. If the range space is all the real numbers, we could use the common symbol RX = R. If the range space is a subset of R, we use, for example, RX = [0, 1] to state that the r.v. can be 0 − −1 inclusive. For a discrete (integer) subset we use, for example, {1, 2, . . . , 10}. 6.6 Parameters In discussing the Binomial distribution, eqn. 6.8, and the Normal, eqn. 6.17, see below, pX (x) = n x q x (1 − q)n−x , for x ∈ {0, 1, . . . n}; 0, otherwise, 2 ! 1 x −µ 1 fX (x) = √ exp − , ∞ < x < ∞, 2 σ σ 2π 6–7 we note that q for the Binomial, and µ, σ for the Normal, completely specify the distributions. We call these parameters and we will see distributions written as, for example, fX (x; θ1 , θ2 ), where θ is a common symbol for parameter. A lot of practical statistics involves parameter estimation, where, for example, we may have a set (sample) of data x1 , x2 , . . . , xn , which we know to be drawn from a population with distribution fX (x; θ1 , θ2 ) and we want to compute an estimate θˆ1 for θ1 . 6–8 Chapter 7 Two- and Multi-Dimensional Random Variables 7.1 Introduction Chapter 6 has introduced one dimensional random variables and certain well known distributions. Both discrete and continuous r.v.’s were covered. In many cases, your (scientific) data will consist not just of single numbers, for example, the weight of a chemical in a mixture, but two or more numbers. If the numbers correspond to independent events, see section 5.8, it may be possible or desirable to treat them separately as individual one-dimensional r.v.’s, but, generally, you will want to treat pairs or triples or multiple numbers together. In section 5.6 and eqn. 5.13 we introduced the notion of the probability of two events happening together, P (AB), the joint probability of A and B. Here we introduce first two-dimensional r.v.’s and then go on to generalise to multi-dimensional r.v.’s. Range spaces — terminology for two and more dimensions See section 6.5 where we introduced some symbols and terminology used in describing range spaces for one-dimensional r.v.’s. If we have a two-dimensional continuous random variable — a pair (X, Y )— each member of which can take on any real value, we say that the range space is R × R; for general multi-dimensions, say p-dimensions, where the random variable is a random vector, we use Rp . For a subsets of R, we use, for example, [0, 1] × [0, 1] and [0, 1]p . The term for a combination (product) of sets such as [0, 1] × [0, 1] is Cartesian product. Two-dimensional (Bivariate) Random Variables If, to every outcome, ω, of an experiment, we assign two numbers, X(ω), Y (ω), X is called a two-dimensional random variable. As with one-dimension, we have discrete and continuous two-dimensional random variable, or random vector, especially when more than two dimensions. 7–1 Much of what we present here is just a two-dimensional analogue of what was covered in Chapter 6. Also, what is described here in terms of two-dimensions transfers immediately to multiple dimensions. 7.2 Probability Function of a Discrete Two-dimensional r.v. By analogy with eqns. 6.3 and 6.4, for one-dimension, we have pX,Y (xi , yj ) = P (X = xi , Y = yj ) (or just p(xi , yj )) and it must satisfy the following p(xi , yj ) ≥ 0, i = 1, 2, . . . ; j = 1, 2, . . . m X n X p(xi , yj ) = 1. (7.1) (7.2) j=1 i=1 As with one-d., pX,Y or just p is called the probability function or the joint probability function for the r.v. (X, Y ). Example 23 From (Meyer 1966, p. 85). There are two production lines; the first has a capacity to produce up to five items in a day; its actual production is a random variable X; the second has a capacity to produce up to three items in a day and its actual production is a random variable Y . The pair of random variables is the two-dimensional random variable (X, Y ) and the joint probability function is given in Table 7.1. Each entry represents P (X = xi , Y = yj ); so p(2, 3) = 0.04. Such a table could be estimated by noting (X, Y ) over a large number of days. X Y 0 1 2 3 0 1 2 3 4 5 0.0 0.01 0.01 0.01 0.01 0.02 0.03 0.02 0.03 0.04 0.05 0.04 0.05 0.05 0.05 0.06 0.07 0.06 0.05 0.06 0.09 0.08 0.06 0.05 Table 7.1: Example of a two-dimensional probability function We can verify that the table does represent a proper probability function in that requirement eqn. 7.1 is satisfied, and, by summing over all entries, that requirement eqn. 7.2 is satisfied — the entries sum to 1. 7.3 PDF of a Continuous Two-dimensional r.v. By analogy with eqns. 6.11 and 6.12, for one-dimension, we have the (joint) PDF f (x, y ) and it must satisfy the following f (x, y ) ≥ 0, all (x, y ) ∈ R × R, (7.3) 7–2 Z ∞ Z ∞ f (x, y )dxdy = 1. −∞ (7.4) −∞ We emphasise again that f (x, y ) is not a probability, but f (x, y )dxdy is. 7.4 Marginal Probability Distributions Example 24 Suppose in Example 23 (Table 7.1) we want to compute the probability functions for X and Y on their own. These are called marginal probability functions. The marginal probability function for X is given by pX (xi ) = P (X = xi ) = P (X = xi , Y = y1 , or . . . , or X = xi , Y = yn ) = m X p(xi , yj ). (7.5) j=1 Similarly, the marginal probability function Y is given by n X pY (yj ) = p(xi , yj ). i=1 Table 7.2 shows the corresponding sums. X Y 0 1 2 3 Sum 0 1 2 3 4 5 Sum 0.0 0.01 0.01 0.01 0.03 0.01 0.02 0.03 0.02 0.08 0.03 0.04 0.05 0.04 0.16 0.05 0.05 0.05 0.06 0.21 0.07 0.06 0.05 0.06 0.24 0.09 0.08 0.06 0.05 0.28 0.25 0.26 0.25 0.24 1.00 Table 7.2: Example We can verify that the sums corresponding to p(xi ) and p(yj ) do represent proper probability functions in that requirement 6.3 is satisfied, and, by summing the marginals, that requirement 6.4 is satisfied — both sets of marginals sum to 1. For continuous random variables, we can state the equivalent equation for marginal PDFs: Z fX (x) = fX,Y (x, y )dy . Y 7–3 (7.6) 7.5 Conditional Probability Distributions In section 5.6 we introduced conditional probability, i.e. the probability of an event B when we know that event A has occurred: P (B|A) = P (AB) . P (A) (7.7) We can do the same for probability functions. Example 25 Suppose in Example 24 (Table 7.2) we want to compute the conditional probability P (X = 2|Y = 1). Applying eqn. 7.7 we have P (X = 2|Y = 1) = P (X = 2, Y = 1) 0.04 = = 0.154. P (Y = 1) 0.26 We can give general rules, noting that q(yj ), p(xi ) are marginal probability functions given by eqn. 7.5, p(xi |yj ) = p(xi , yj ) if q(yj ) > 0, q(yj ) (7.8) p(yj |xi ) = p(xi , yj ) if p(xi ) > 0. p(xi ) (7.9) We can give similar general rules for continuous random variables, noting that h(yj ), h(x) are marginal probability functions given by eqn. 7.6, 7.6 f (x|y ) = f (x, y ) if h(y ) > 0, h(y ) (7.10) h(y |x) = f (x, y ) if g(x) > 0. g(xi ) (7.11) Independent Random Variables We can define the notion of independent random variables using the definition of independent events given in section 5.8; we had: A and B are independent events if and only if P (AB) = P (A)P (B). (The occurrence of event A in no way influences the occurrence of B and vice-versa.) 7–4 (7.12) Independent Discrete Random Variables Given the two-d. discrete random variable (X, Y ), X and Y are said to be independent if and only if p(xi , yj ) = q(xi )r (yj ), (7.13) noting that q(yj ), r (xi ) are marginal probability functions given by eqn. 7.5. Independent Continuous Random Variables Similarly, given the two-d. continuous random variable (X, Y ), X and Y are said to be independent if and only if f (x, y ) = g(x)h(y ), (7.14) where g(x), h(y ) are marginal pdfs. 7.7 Two-dimensional (Bivariate) Normal Distribution We can extend the one-d. Normal (Gaussian) distribution to two-d. f (x, y ) = 2πσx σy 1 p 2 2 ! 1 x − µx (x − µx )(y − µy ) y − µy exp − − 2ρ + , 2(1 − ρ2 ) σx σx σy σy 1 − ρ2 (7.15) for ∞ < x < ∞, ∞ < y < ∞. Before you start protesting that eqn. 7.15 is incomprehensible, (i) it isn’t and I can explain it; (ii) there is a much better way of handling multivariate random variables that is better for even two-d. See Chapter B and section B.7. 7–5 Chapter 8 Characterisations of Random Variables 8.1 Introduction We introduced the notion of a random variable in Chapters 6 and 7. We identified probability functions (for discrete r.v.’s) and probability density functions for some commonly occurring r.v.’s. Here we identify and define some parameters (numbers) that characterise some aspects of r.v. distributions. Generally, the expected value or expectation of some function of the r.v. is found useful and the expected value of the r.v. itself (the mean) is first amongst these. 8.2 Expected Value (Mean) of a Random Variable The expected value of a r.v. X, or expectation, or mean, is the average value of X. Definition: Expected Value, Discrete R.V. Discrete r.v., range space RX = {x1 , . . . , xn }; probability mass function p(xi ) = P (X = xi ). The expected value or expectation ((E(X)), or mean of X is given by E(X) = µx = N X xi p(xi ). (8.1) i=1 Continuous r.v., range space RX = R; probability density function f (x). The expected value or expectation ((E(X)), or mean of X is given by Z E(X) = µx = xf (x)dx. R 8–1 (8.2) Example 26 Toss two coins as in Example 18. X = number of heads. A = {T T }, P (A) = 14 , X = {0}, P (X = 0) = 41 ; A = {T H, HT }, P (A) = 12 , X = {1}, P (X = 1) = 21 ; A = {HH}, P (A) = 1 1 4 , X = {2}, P (X = 2) = 4 . E(X) = µx = N X i=1 1 1 1 xi p(xi ) = 0 + 1 + 2 = 0 + 0.5 + 0.5 = 1. 4 2 4 Example 27 Toss a dice and take X = the number of dots obtained; p(xi ) = 16 , i = 1, . . . , 6. E(X) = µx = N X i=1 6 1X xi p(xi ) = xi = 21/6 = 3.5. 6 i=1 (8.3) Note that in Example 27 µx = 3.5 is not one of the possible values of X. It is useful, particularly in two-d. cases, to think of µx as the centre of mass, where p(xi ) is a mass and xi is a position along a lever arm; µx is the position to place the fulcrum in order to achieve a balance. Aside — Sample Averages In later chapters we will encounter samples and sample averages. By sample we mean that we run an experiment and take some example values, say n of them, of the r.v., x1 , x2 , . . . , xn . Here we use n for the size of the sample rather than N as in eqn. 8.1 and note that the sample space Rx = x1 , . . . xN denotes the population, rather than a sample of it. Then we can compute a sample mean, X̄, (pronounced x-bar ) as n 1X xi . X̄ = n i=1 (8.4) That is, compute the average like we learned in early arithmetic. Ordinarily, we’ll make a strong distinction between sample mean and true mean. But let us consider the case of a large sample, say N = 600. Let yi = the count of each Xi obtained. We might expect to obtain something like Pyn1 = 95, y2 = 110, y3 = 90, y4 = 97, y5 = 105, y6 = 103, so that for eqn. /refeq:charrv-samp1 i=1 xi = 95 × 1, y2 = 110 × 2, y3 = 90 × 3, y4 = 97 × 4, y5 = 105 × 5, y6 = 103 × 6 = 3.6. If we look more carefully at eqn. 8.2 for this example, we can interpret it as a sample version of eqn. 8.1. X̄ = n X 1 i=1 n yi × xi = n X yi xi , n i=1 (8.5) and, comparing with eqn. 8.1, we have yni in place of p(xi ); we note that yni = p̄(xi ) = {95/600, 110/600, 90/600, 97/600, 105/600, 103/600 = 0.158, 0.183, 0.15, 0.162, 0.175, 0.172}, i.e. we have sample estimates of the probability mass function, which are incorrect. The error, X̄ 6= µx , is due to the errors in the p̄(xi ). Generally, as n → ∞, p̄(xi ) → p(xi ) and X̄ → µX . 8–2 Definition: Expected Value of a function of a r.v. of X Y = r (X) is given by E(Y ) = E(r (X)) = The expected value ((E(r (X))) of a function N X r (xi )p(xi ). (8.6) i=1 Example 28 Let us use a dice as a one number slot-machine (one-armed-bandit). We pay 5c to play and the machine pays whatever number comes up (1 − 6); thus our payout for each play is xi − 5. What is the expected value of the payout? (Think play for an hour, 1000 plays, inserting 5000c, what do we expect to win or lose?) E(Y ) = N X i=1 r (xi )p(xi ) = 6 X i=1 (xi − 5) 1 = −4/6 − 3/6 − 2/6 − 1/6 + 0/6 + 1/6 = −9/6 = −1.5. 6 That is, we lose on average 1.5c for every play and would lose 1500c in 1000 plays. (Maybe better than the average slot-machine?) Expected values for two-dimensions and higher dimensions. Eqns. 8.1 and 8.2 carry over to two and more Discrete r.v., range space RX,Y = {x1 , . . . , xN }×{y1 , . . . , yM }; probability mass function p(xi , yj ) = P (X = xi , Y = yj ). The expected value or expectation, (E[(X, Y )], or mean of the pair (X, Y ) is given by E[(X, Y )] = µX,Y = (µX , µY ) = M N X X (xi , yj )p(xi , yj ). (8.7) i=1 j=1 And similarly for two-d. (and multidimensional) continuous, where multiple integrals replace single integrals. Useful facts For Xi , . . . , Xn random variables and constants ai , . . . , an , E( X ai Xi ) = i X E(Xi ). (8.8) i For Xi , . . . , Xn independent random variables n n Y Y E( Xi ) = E(Xi ). i=1 i=1 8–3 (8.9) 8.3 Variance of a Random Variable Variance gives the spread of a distribution. The variance is the expected value (mean value) of the squared deviation from the mean. Definition: Variance Discrete r.v., range space RX = {x1 , . . . , xN }; probability mass function p(xi ), mean µ. The variance is given by N X V (X) = σ = E[(X − µX ) ] = (xi − µX )2 p(xi ). 2 2 (8.10) i=1 Continuous r.v. 2 Z 2 V (X) = σ = E[(X − µX ) ] = (x − µX )2 f (x)dx. (8.11) R The following formula is sometimes useful V (X) = E(X 2 ) − (E(X))2 = E(X 2 ) − µ2X . Aside — Sample Variance variance is given by (8.12) Eqn. 8.2 gives the sample mean of a random variable; the sample n X 1 (xi − X̄)2 . s = (n − 1) i=1 2 (8.13) You may wonder about the (n − 1) instead of n; if we divided by n, the estimate would be biassed. Standard Deviation Standard deviation: σX = Useful facts about variance p (V (X). For constants a, b, V (aX + b) − a2 V (X). (8.14) For Xi , . . . , Xn independent random variables and constants ai , . . . , an , V( n X i=1 Xi ) = n X V (Xi ). (8.15) i=1 If Xi , . . . , Xn are independent and identically distributed (IID) random variables with µ = E(X), σ 2 = V (X), then E(X̄) = µ, V (X̄) = σ 2 /n, E(s 2 ) = σ 2 . 8–4 (8.16) 8.4 8.4.1 Expectations in Two-dimensions Mean Two-d. discrete r.v., range space RX = {x1 , . . . , xn } × {y1 , . . . , yM }; probability mass function p(xi , yj ). The expected value or expectation ((E[(X, Y )]), or mean of (X, Y ) is given by E[(X, Y )] = µX,Y M X N X = (xi , yj )p(xi , yj ). (8.17) j=1 i=1 Similarly for a continuous r.v. — double integral replaces summation, pdf replaces probability mass function. 8.4.2 Covariance Let X, Y be r.v.’s with means µX , µY and standard deviations σX , σY . The covariance between X and Y is defined as Cov (X, Y ) = E[(X − µX )(Y − µY )]. (8.18) Cov (X, Y ) = Cov (Y, X). The correlation between between X and Y is defined as ρX,Y = Cov (X, Y )/σX σY . 8–5 (8.19) Chapter 9 The Normal Distribution 9.1 Introduction Here we introduce some uses of the Normal distribution, eqn. 6.17. The Normal distribution can be used as a model or approximate model in so many cases that a large amount of mathematics has been built up around it. Note: we use Normal (capitalised) to distinguish from the word normal (expected, typical) and because most other distribution names are capitalised. The probability density function (pdf) is given by: 2 ! 1 1 x −µ fX (x) = √ exp − , ∞ < x < ∞. 2 σ σ 2π (9.1) We say X ∼ N(µ, σ); note: some writers use X ∼ N(µ, σ 2 ), i.e. they use the variance for the second parameter of N; we will attempt to standardise on N(µ, σ). It is well worth checking carefully when reading books and papers, there can be a great difference between σ and σ 2 ! Because the pdf is different for each µ, σ, it is convenient to create a standardised Normal in which µ = 0, σ = 1. We standardise the r.v. X as follows; first we shift to zero mean, and then we divide by σ to obtain unit standard deviation. Z = (X − µ)/σ. (9.2) When we standardise X, we obtain Z = (X − µ)/σ ∼ N(0, 1), and eqn. 9.1 becomes eqn. 9.3, 1 fZ (z) = √ exp(−z 2 /2). 2π (9.3) The pdf for N(0, 1) is shown in Figure 9.1. As you can see, most of the probability is located in −3 < Z < 3; between these limits we have probability 0.9974, i.e. P (−3 < Z < 3) = 0.9974, that is if we have a random variable Z, we can be pretty sure it will fall between these limits; you may have heard the term three-sigma to denote nearly all occurrences. Likewise P (−1.96 < Z < 1.96) = 0.95, so that probability outside these limits is 0.05 or 5%; 9–1 R-Example 3 The following R code computes and plots Figure 9.1. ¿ z = seq(-6, 6, length = 200) ¿ pdf = dnorm(z, 0, 1) ## dnorm for d(ensity) normal ¿ plot(z, pdf, type = ”l”, lwd=3) ¿ 9.2 Cumulative Distribution Function (cdf) As we indicated in section 6.4.2, the pdf does not represent a probability, but a probability density, the numbers we refer to above, for example, P (−1.96 < Z < 1.96) = 0.95, are obtained by integration, Z 1.96 P (−1.96 < Z < 1.96) = 0.95 = fX (x)dx. (9.4) Rb fX (x)dx, which is where −1.96 However, for the Normal distribution, there is no easy way to compute the cdf comes in; we recall that the cdf is given by eqns. 9.5 and 9.6, a FZ (z) = P (Z ≤ z), Z z Φ(z) = FZ (z) = Z z fZ (u)du = −∞ −∞ 1 √ exp(−u 2 /2)du. 2π (9.5) (9.6) Because it is so commonly used, the standardised Normal cdf gets it own symbol, Φ(z). Φ(z) is plotted in Figure 9.2 which was created using the code in R-Example 4. R-Example 4 The following R code computes and plots Figure 9.1. ¿ z = seq(-6, 6, length = 200) ¿ cdf = pnorm(z, 0, 1) ## pnorm for p(robability) normal ¿ plot(z, cdf, type = ”l”, lwd=3) ¿ ### add these if you want a figure for a report pdf(”normcdf.pdf”, onefile=FALSE, height=4, width=4, pointsize=8, paper=”special”) ¿ plot(z, cdf, type = ”l”, lwd=3) ¿ dev.off() ### necessary to flush diagram into the file ”normcdf.pdf” Following the discussion above on how most of the probability is located between (−3 < Z < 3), we are not surprised to see that Φ(z) is close to zero at z = 3; it rises to 0.5 at z = 0 (one half of the probability is below 0, the other above 0) and then flattens out at z = 3 after which there is almost no probability for the integral to add in. 9–2 Figure 9.1: Standardised Normal distribution, N(0, 1), probability density function (pdf). Figure 9.2: Normal cumulative distribution function (cdf). 9–3 9.3 Normal Cdf Traditionally, statistics books, and books of tables contained tabulations of the Normal cdf, Φ(z). We will see below how these tables are used. However, because most statistics is now conducted using software packages, tables may be less frequently used, and may be less commonly encountered in textbooks. R-Example 5 . The following R code computes Table 9.1. ¿ z = seq(-4, 4, ¿ cdf = pnorm(z, ¿ z [1] -4 -3 -2 -1 ¿ cdf [1] 3.167124e-05 [6] 8.413447e-01 ¿ z Phi(z) length = 9) 0, 1) 0 1 2 3 4 1.349898e-03 2.275013e-02 1.586553e-01 5.000000e-01 9.772499e-01 9.986501e-01 9.999683e-01 -4 -3 -2 -1 0 1 3.2e-05 1.35e-03 2.28e-02 0.159 0.5 0.84 2 3 4 0.977 0.999 0.99997 Table 9.1: Erf(z) for z = -4 to + 4. What does Φ(z = −2) = 2.28 × 10−02 = 0.0228 mean? Referring to Figure 9.1 it means that the amount of probability to the left of Z = −2 is 0.0228, i.e. as indicated by eqn. 9.5. Owing to the symmetry of Figure 9.1, we can state that the amount of probability to the right of of Z = +2 is also 0.0228. Hence the probability P (Z < −2 or Z > +2) = 2 × 0.0228 = 0.0456 or 4.56%. If we move a little closer to the mean, we get P (Z < −1.96 or Z > +1.96) = 2 × 0.025 = 0.05 or 5%. This 5% quartile (+/ − 1.96) is used a lot in statistics. If P (Z < −1.96 or Z > +1.96) = 0.05 then P (−1.96 < Z < +1.96) = 0.95. In a similar way, we can determine that P (Z < −1 or Z > +1) = 2 × 0.159 = 0.318; that is, a standard Normal random variable Z is between plus or minus one standard deviation of the mean 3.18% of the time. The 0.159 number is used below in Example 29. 9.4 Using the Normal Cdf Example 29 Suppose we have a manufacturing process which takes fixed quantities of raw materials A (1000-grams) and B (500-g.) which react together to produce a product C in the form of a solid cake. The weights of the cakes, X, are monitored and those below a certain weight are set aside as B-grade. The manufacturer of the machine gives the yield expected value as E(X) = 165 grams with a variance √of 9 and has determined that the yield follows the Normal distribution; that is, µX = 165, σX = 9 = 3 and X ∼ N(165, 3). We have decided that cakes below 162 grams should be marked as B-grade. 9–4 What is the probability that a randomly selected output will be less than 162 grams? We have no tables for N(165, 3), but we do have for N(0, 1), that is the cdf for the standardised Normal Φ(z). Solution. (i) First we standardise using eqn. 9.2, Z = (X − µ)/σ = (X − 165)/3. Our standardisation formula is Z = (X − 165)/3, in which case the standardised weight corresponding to 162 is Z162 = (162 − 165)/3 = −1. (ii) The probability that Z < Z162 is just Φ(Z162 = Φ(−1) and we can read that from Table 9.1, i.e. the probability is 0.159 and 15.9% of the output will be B-grade. (iii) Or, we can use R. ¿ pnorm(-1, 0, 1) ## here explicitly giving mu and sigma. [1] 0.1586553 ¿ pnorm(-1) ## if none given, R assumes mu = 0, sigma = 1 [1] 0.1586553 ¿ (iv) We can even let R handle the standardisation. ¿ pnorm(162, 165, 3) ## here explicitly giving mu and sigma. [1] 0.1586553 Normal distribution appropriate? In Example 29 there can be an immediate objection to the Normal model. X can never be less than zero, but N(165, 3) will have a value greater than zero (but very very small) for X < 0. In defence, we can argue that the value will be negligibly small so that use the Normal model should not introduce significant errors. If we had a weight, E(X) = 4, V (X) = 9, σ = 3, then we would have to question the Normal model. 9.5 Sum of Independent Normal Random Variables If X1 ∼ N(µ1 , σ1 ) and X2 ∼ N(µ2 , σ2 ) are independent random variables, X = X1 + X2 ∼ N(µ, σ), where µ = µ1 + µ2 and V ar (X) = σ 2 = σ12 + σ22 . Add the means, add the variances; note not add the standard deviations. 9–5 (9.7) Need example here. Eqn. 9.7 generalises to give the distribution of a sum on n independent observations of the same random variable. If Xi ∼ N(µ, σ), X = X1 + X2 , . . . , X n = n X Xi ∼ N(nµ, √ nσ). (9.8) i=1 That is, add n means, and add n variances, so that σsum = p √ nV ar (X) = nσ. Need example here. 9.6 Differences of Normal Random Variables X1 ∼ N(µ1 , σ1 ), X2 ∼ N(µ2 , σ2 ) X = X1 − X2 ∼ N(µ, σ), (9.9) where µ = µ1 − µ2 and V ar (X) = σ 2 = σ12 + σ22 . Take the difference of the means and add the variances (not difference of variances). Need example here. 9.7 Linear Transformations of Normal Random Variables If X ∼ N(µ, σ), Y = aX + b ∼ N(aµ + b, aσ). (9.10) Need example here. 9.8 The Central Limit Theorem Why is the Normal distribution (a) so common; (b) so popular amongst statisticians. First, the Central Limit Theorem (CLT) states, roughly speaking, that if a random variable has been created by summing a large number of (independent) random variables, then the sum will have an approximately Normal distribution. Second, it is popular not just because of its common occurrence but because mathematics involving the distribution, eqn. 9.1 and its multivariate counterpart is in many cases rather easy — or a good deal easier than mathematics involving some other distributions. A compact statement of the CLT, from (Wasserman 2004), is as follows. 9–6 Let X1 , X2 , . . . , Xn be independent and identically distributed r.v.’s with mean µ and standard P deviation σ. Let X̄n = n1 ni=1 Xi . Then, as n → ∞, X̄n − µ X̄n − µ √ → Z, Zn = p = σ/ n V ar (X̄n ) where Z ∼ N(0, 1). 9–7 (9.11) Chapter 10 Statistical Inference 10.1 Introduction We use the Normal distribution, eqn. 6.17, repeated here, to introduce statistical inference. 2 ! 1 1 x −µ fX (x) = √ exp − , ∞ < x < ∞. 2 σ σ 2π (10.1) We may write fX as fX (x; µ, σ) or fX (x; θ1 , θ2 ), where θ1 , θ2 are parameters. We may think of a family of Normal distributions, N, parametrised or labelled or indexed by θ1 , θ2 . Let us say we have performed and experiment and have collected a sample of random variables X, x1 , x2 , . . . , xn ; we assume that X ∼ N(µ, σ) but we do not know either one or other (or both) of the parameters. Point Estimation Parameter estimation is concerned with estimating parameters. A point estimate for say µ is an approximate value µ̂ computed from the sample. Typically, in addition to the estimate, µ̂, we give some qualifications such as the variance of the estimate, that is, an indication of how variable we think µ̂ might be if we repeated the experiment a number of times. Interval Estimation An interval estimate (set estimate, confidence interval) for say µ is an interval [µ1 , µ2 computed from the sample which we claim to contain the real µ. Typically, we give some indication of how plausible the interval is in the form a some sort of probability value. Hypothesis Testing A typical hypothesis testing example is when a scientist needs to test the efficacy of a new method. And experiment is performed where there are two methods, M1 and M2 . Often, M1 is a control (say old method) and M2 is the new methods whose efficacy we wish to test. Let us keep the hypothesis simply by assuming that we wish to test whether M2 will give a better yield than M1 . 10–1 Chapter 11 Statistical Estimation 11.1 Introduction When we state for example X ∼ fX (x; θ1 , θ2 ), we indicate that the distribution depends on parameters θ1 , θ2 . For example, we may think of a family of Normal distributions, N(θ1 , θ2 ), parametrised or labelled or indexed by θ1 = µ, θ2 = σ. 11.2 Populations and Samples When we quote values of parameters, for example the mean and standard deviation of a Normally distributed r.v., X ∼ N(µ, σ), we are talking about population parameters. Let us collect a sample of random variables X, x1 , x2 , . . . , xn ; we assume that X ∼ N(µ, σ) but we do not know either of the parameters. We must estimate them and an obvious first attempt is to use sample mean and standard deviation. Note the difference: population versus sample. A population includes all possible random variables; a sample contains, well, a sample taken from the population. If you wanted a quick estimate of the mean salary of lecturers in the college, you could ask a number of lecturers you know and take the average of that sample. The Human Resources Department could give you an exact figure, because they have the data for the (complete) population of, N, lecturers. They would compute the true population parameters as, N 1X xi , µ= N i=1 (11.1) N 1X σ = (xi − µ)2 . N i=1 (11.2) 2 You could imagine that the larger your sample, the better the sample mean would approximate the population mean. 11–1 Random Sample However, apart from being a small sample, lecturers you know could contain another source of inaccuracy, namely that the sample is not random and so it may contain a bias due to the fact that, for example, the lecturers in your sample tend to be younger. By random sample we mean that each member of the population has an equal chance of being sampled. Achieving a random sample is not always easy, see Chapter 13. 11.3 Estimating the Mean A point estimate for say µ is an approximate value µ̂ computed from the sample. Typically, in addition to the estimate, µ̂, we give some qualifications such as the variance of the estimate, that is, an indication of how variable we think µ̂ might be if we repeated the experiment a number of times. The hat symbol, θ̂, is used to indicate that we have an estimate of θ. The most obvious estimate for µ is to copy eqn. 11.1, noting that we use capital N for the size of the population and lower-case n for the size of the sample, n 1X µ̂ = x̄ = xi . n i=1 (11.3) In this context the bar,¯as in x̄ (x bar indicates mean or average. Need example here. 11.4 Estimating the Standard Deviation The “best” estimate for σ is less obvious and eqn. 11.2 is modified slightly to, n σˆ2 = s 2 = 1 X (xi − x̄. n − 1 i=1 (11.4) Thus, we not only replace µ by its estimate, x̄, we divide by n − 1 instead of n. It is usual to use s 2 to denote sample variance. The reason for the n − 1 is that dividing by n would generally lead to a systematic underestimate — a so-called bias. This may be discussed in a later chapter; (reference it if we do). Need example here. 11–2 11.5 Sampling Distributions 11.5.1 Sampling Distribution of the mean The estimate of the mean given by eqn. 11.3 is itself a random variable; we can imagine taking m samples, each of size n, and each of these yielding a x̄ˆj for j = 1, 2, . . . , m. E(x̄) = µ, (11.5) V ar (x̄) = σ 2 /n. (11.6) √ Therefore, the standard deviation of the estimate of the mean is σ/ n. We already encountered this in section 9.5 and eqn. 9.8. Both eqns. 11.5 are rather comforting, (a) the expected value of x̄ is µ and the standard deviation √ of x̄ is σ/ n, that is, as n increases the standard deviation decreases and will decrease to zero as n → ∞. √ Finally, we can state that the sampling distribution of µ̂ is N(µ, σ/ n). This means that if we conduct a number of sample experiments (take a sample of n Xs and compute the mean bar x, then bar x will be found to have a normal distribution centred on the true mean µ. We note emphatically that we do not know µ. In the first part of the discussion below, we assume that σ 2 is known. However, this is typically untrue, and we must use an estimate for the standard deviation, as in eqn. 11.4. Figure 11.1 (Maindonald & Braun 2007, p. 103) shows two sampling distributions, for a random variable X which has µX = 10, σ = 1; Figure 11.1(a) shows the sampling distribution for a sample size of n = 4, while Figure 11.1(b) shows the sampling distribution for a sample size of n = 9; the distribution of X, corresponding to a sample size of n = 1 is shown for comparison. The useful formula now is, including standardisation: If the estimator for µ (unknown) is x̄ and σ is known then x̄ − µ √ ∼ N(0, 1). σ/ n (11.7) On the other hand, if σ is unknown, and we must replace σ with an estimate, s, see eqn. 11.4, then x̄ − µ √ ∼ tn−1 , s/ n (11.8) where tn−1 is the Student t distribution with n − 1 degrees of freedom; see section 6.22. As with N(0, 1), we have tables for the t distribution. 11–3 Figure 11.1: (a) Sampling distribution for a sample size of n = 4; (b) sampling distribution for a sample size of n = 9; the distribution of X, corresponding to a sample size of n = 1 is shown for comparison. 11.5.2 Sampling Distribution for Estimates of the Standard Deviation If the estimator for σ (unknown) is s, see eqn. 11.4, and µ is also unknown, with estimate x̄, then n X (n − 1)s 2 xi − x̄ = ∼ χ2n−1 , 2 2 σ σ i=1 (11.9) where χ2n is the Chi-squared distribution with n degrees of freedom; see section 6.4.10. As with N(0, 1) and tν we have tables for the χ2n distribution. 11–4 11.6 Confidence Intervals √ In section 11.5.1 we established that the distribution of the sample mean is x̄ ∼ N(µ, σ/ n) or x̄−µ √ ∼ N(0, 1). This tells us that the estimate has a distribution that is equivalently eqn. 11.7 σ/ n centred on the mean, that the expected value of the estimate is the mean, and that the distribution √ will have a standard-deviation (spread) of σ/ n. Thus referring to Figure 11.1(a), we can say that the mean of x̄4 is µ, the true mean — which we do not know and that different samples would vary between about 1.5σ above and below the true mean. Hence if the true mean is 10 as in the diagram, and we kept repeating our sampling experiment, we would expect the estimate x̄4 to vary between about 8.5 and 11.5. On the other hand, if we used sample size n = 9, we would expect the estimate x̄9 to vary between about 9.0 and 11.0, see Figure 11.1(b). The previous few sentences should be suggesting that we should be able to give a plausible interval estimate such as we estimate that the mean is between 9 and 11, together with a probability for that assertion, e.g. about 0.95 as discussed in section 9.3 for P (−1.96 < Z < +1.96). But unfortunately we cannot, for we do not know the true mean. What can we say? Well, for example, that P (−1.96 < (x̄ − µ)/ √σn < +1.96) = 0.95. Still not much good, for we do not know µ and we must be satisfied with the less useful statement that the estimate x̄ is within plus-or-minus 1.96 × √σn from µ, with a probability of 0.95. More explanation may be needed. What if x̄ is at one of these extremes, namely µ − 1.96 × √σn ; this would correspond to about 9 in Figure 11.1(a). We can then say that x̄ + 1.96 × √σn just about reaches up to µ. If we repeat the sampling, this will happen with a probability 1 − 0.025, i.e. the amount of probability up to Z = −1.96 is 0.025. Similarly, take the case that x̄ is at the other extreme, namely µ + 1.96 × √σn ; this would correspond to about 11 in Figure 11.1(a). We can now say that x̄ − 1.96 × √σn just about reaches down to µ. If we repeat the sampling, this will happen with a probability 1 − 0.025 (recall the symmetry argument in section 9.3). Consequently, if we take x̄ +/−1.96× √σn we can say that this interval will capture µ with probability 0.95. This allows us to construct a confidence interval which we can claim contains µ; that is, we compute not µ̂, but (L, U), an interval between (L)ower and (U)pper limits which we believe contains µ. In the case of confidence (probability) 0.95 = 95%, we can compute σ σ (L, U) = (x̄ − 1.96 × √ , x̄ + 1.96 × √ ) n n (11.10) Summary on Point Estimation and Confidence Interval for the Mean when Variance Known Refer to Figure 11.1, part (b) of which is based on a sample size of n = 9. • If we take a point estimate for the mean, it will be distributed according to the narrow distribution, i.e. if the true mean is 10, our estimate can be anywhere between 9 and 11. 11–5 • If we decide to give an interval estimate, we need to decide on a confidence (probability); the wider the interval, the greater the confidence we can have in it — but a huge interval with confidence of 100% is not much use to anyone. The usual confidence that is chosen is 95%. • We would like to be able to look at Figure 11.1 (b) and say that our interval for the mean is 9 to 11 with confidence 95% (based on the diagram this is approximate, 10 − 1.96 × 0.5 to 10 + 1.96 × 0.5 are the precise values for 95%. But we cannot make a statement like the latter, for we do not know that µ = 10. • The best we can do is (a) take our estimate, x̄, (b) place a distribution like that in Figure 11.1(b) about it; (c) compute the x̄ + / − √σn (≈ 2) interval (eqn. 11.10). This allows us to state: if we repeated our sampling a large number of times, and we computed eqn. 11.10 each time (getting a different interval), then 95% of these intervals would contain the true mean µ. Excel-Example 1 Need Excel example here. Need section on t-distribution and small sample sampling distrib. for mean with std.-dev. unknown. 11–6 Chapter 12 Hypothesis Testing 12.1 Introduction In Chapter 11 we discussed estimation of parameters, both point estimates and interval estimates (with confidence value attached). This chapter is also based on sampling theory but here we are interested in decisions rather than estimates. For example, based on a sample of occurrences of heads and tails in a sample of n = 10 tosses of a coin, we might wish to come to the decision whether the coin is fair. We might want to decide whether application of a new fertiliser really does increase cropping yield, based on samples involving (i) the current fertiliser and (ii) the new one. The hypothesis testing technique involves the postulation of a hypothesis (an assumption, a statement about population distributions or their parameters) and then designing an experiment which will yield a sample upon which we can decide whether the hypothesis is true — based on sample data. A typical hypothesis test is as follows. We make a hypothesis that a random variable is distributed according to fX (x), e.g. X ∼ N(µ, σ), where we assume that σ is known. We identify a null hypothesis, H0 : µ = µ0 and an alternative hypothesis, HA : µ > µ0 . We compute a test statistic (a sample estimate with sample size n), for example µ̂ = X̄n and reject H0 if X̄n > c, where c is some constant to be determined; X̄n > c is the critical region; X̄n ≤ c is called the acceptance region. The greater we make c, then the greater the significance level of the test X̄n > c. We can set c using the same considerations we used in setting confidence levels for a confidence interval in section 11.6. As in eqn. 11.7, we know that Z= X̄n − µ √ ∼ N(0, 1). σ/ n (12.1) ¯ −µ so that we can use er f (z) = Φ(z) to choose a c = z such that P (z > c 0 ) = 0.05 = P ( Xσ/n √ > n 0 c√σ 0 c ) = P (X̄n > n + µ, say, for a 2.5% significance level. (I’ve chose 2.5% = 0.025 because it corresponds to a cutoff point (Z = 1.96) that we have already encountered. 12–1 That is, z > c 0 would occurs only 2.5% of the time if H0 is true; in other words the critical region stretches from c 0 to the right of it. The acceptance region stretches to the left of c 0 , i.e. including 0 everywhere that X̄n ≤ c, where c = c√σn + µ. Recalling P (Z > +1.96) = 0.025, we can set c 0 = 1.96 for a significance level of 0.025. The latter corresponds to a one sided test. The standard normal pdf and the relevant critical region is shown in Figure 12.1 (Maindonald & Braun 2007, p. 106). Figure 12.1: One side hypothesis test, significance level = 0.025; critical region is shaded to the right of 1.96. For a two sided test with significance level = 0.05, we include in the critical region also the marked region to the left of -1.96. Let us keep the original null hypothesis, H0 : µ = µ0 , and now choose an different alternative hypothesis, namely HA : µ 6= µ0 . A suitable acceptance region for this might be cl < X̄n < ch , with the critical (rejection) region being all points below cl and all points above ch . If we now choose a significance level of 0.05, we arrive at the familiar P (Z < −1.96 or Z > 12–2 +1.96) = 0.05, that is, if we have µ = µ0 , then values of Z < −1.96 or Z > +1.96 or X¯n √ −µ σ/ n < X¯n √ −µ σ/ n −1.96 or > +1.96 should occur only 5% of the time and this is a sufficiently significant deviation for us to reject the null hypothesis. This is a two sided test. The significance level, usually denoted α, corresponds to the probability of rejecting H0 when H0 is true, that is, the extreme values in the critical region could occur, but with a small probability, α. Table 12.1 shows the possible outcomes of the hypothesis test. Accept H0 Reject H0 H0 true HA true correct Type 2 error, prob. β Type 1 error, prob. α correct Table 12.1: Outcomes of a hypothesis test. 12–3 Chapter 13 Sampling 13.1 Introduction To be completed. 13–1 Chapter 14 Classification and Pattern Recognition 14.1 Introduction The terms classification and pattern recognition are used almost synomomously; statisticians tend to favour classification, while engineers tend to use pattern recognition. This chapter merely introduces the concepts; Chapters 15, 16, 18, 17 and 19 fill in the details. These chapters are a reworking of some of the basic pattern recognition and neural network material covered in (Campbell 2005) and (Campbell & Murtagh 1998) and (Campbell 2000). We define/summarize a pattern recognition system using the block diagram in Figure 14.1. x sensed data Pattern Recognition w (omega) System(Classifier) class Figure 14.1: Pattern recognition system; x a tuple of p measurements, output ω — class label. Typically textbooks distinguish between supervised classification and unsupervised classification. Supervised classification Supervised (trained) classification may be posed as a prediction problem rather like regression. The prediction involves class labels. We have a set of examples, a sample, which we call training data, XT = {xi , ωi }ni=1 . We learn population parameters from the sample of x’s. Warning: in some classification and pattern recognition literature, the term sample takes on a different meaning from the standard statistical term — where a statistical sample means a set of random vectors taken from a population; in the pattern recognition literature a sample may mean a single random vector, so that a statistical sample will have to be termed a set of samples. x is the pattern vector — of course in certain situations x is a simple scalar. ω is the class label, ω ∈ Ω = {ω1 , . . . , ωc }. Then, given an unseen pattern x (a random vector), we predict ω. In general, x = (x0 x1 . . . xp−1 )T , a p-dimensional vector; T denotes transposition. 14–1 Unsupervised classification Unsupervised classification is more of an exploratory data analysis technique than is supervised classification. In this case we have a set of patterns (random vectors) XT = {xi }ni=1 and we want to explore structure in the set. For example, are they clustered, thereby suggesting that the clusters identify a number of classes. Clustering involves assigning class labels to the XT = {xi }ni=1 based not on training data but on proximity of the x’s or some other criterion. 14–2 Chapter 15 Simple Classifier Methods 15.1 Thresholding for one-dimensional data Let us assume that we want to classify a chemical product, for example fake pharmaceutical drugs, according to the results of a chemical analysis. The analysis data comprise a vector x where x1 might be percentage mass of component 1, x2 component 2, etc. The label ω might be courntry of origin, and it is this that we want to predict, given the results x from an analysis of a newly seize batch. For the moment, we’ll assume just two classes ω0 and ω1 ; two-class problems are easy to describe, yet extension to n-class problems is easy. In our simplistic character recognition system we require to recognise two sources, country 0 and country 1, ω0 and ω1 . We start off with two components x = (x1 x2 )T . As described in Chapter 14, we have earlier obtained examples of the drug from both countries, XT = {xi , ωi }ni=1 , i.e. we have training data, or a sample. Let us see whether we can recognise using component 1 alone (x1 . Figure 15.1 shows some (training) data. We see that a threshold (T) set at about x1 = 2.8 is the best we can do; the classification algorithm is: ω = 1 when x1 ≥ T, (15.1) = 0 otherwise. (15.2) Use of histograms, see Figure 15.2 might be a more methodical way of determining the threshold, T. If enough training data were available, n → ∞, the histograms, h0 (x1 ), h1 (x1 ), properly normalised would approach probability densities: p0 (x1 ), p1 (x1 ), more properly called class conditional probability densities (pdfs): p(x1 | ω), ω = 0, 1, see Figure 15.3. When the random vector is three-dimensional (p = 3) or more, it becomes impossible to estimate the pdfs using histogram binning — there are a great many bins, and most of them contain no data. In such cases it is usual to assume a distribution family, for example Normal, and to represent 15–1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 3 T 4 5 6 x1 Figure 15.1: Component 1 x1 . freq. h1(x1) h0(x1) 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 3 T 4 5 6 Figure 15.2: Histogram of component 1 x1 . 15–2 x1 p(x1 | 1) p(x1 | 0) 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 3 T 4 5 6 x1 Figure 15.3: Class conditional pdfs. the class confitional pdfs using parameters estimated from a sample (training data — estimation = training); see Chapter 11. The use of explicitly statistical methods is described in Chapter 16 but for now well try some intuitive methods, but as you will see we are never far from statistics. 15–3 15.2 Linear separating lines/planes for two-dimensions Since there is overlap in the component-1, x1 , measurement, let us use the two components, x = (x1 x2 )T , i.e. (component-1, component-2). Figure 15.4 shows a scatter plot of these data (the sample). 5 x2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 4 3 2 1 1 2 3 4 5 6 x1 Figure 15.4: Two dimensions, scatter plot. The dotted line shows that the data are separable by a straight line; it intercepts the axes at x1 = 4.5 and x2 = 6. Apart from plotting the data and drawing the line, how could we derive the separating from the data? (Thinking of a computer program.) 15.3 Nearest mean classifier First we estimate the class conditional means µ0 = E(x|ω = ω0 and µ1 = E(x|ω = ω1 ). Figure 15.5 shows the line joining the class means and the perpendicular bisector of this line; the perpendicular bisector turns out to be the separating line. We can derive the equation of the separating line using the fact that points on it are equidistant to both means, µ0 , µ1 , and expand using Pythagoras’s theorem, |x − µ0 |2 = |x − µ1 |2 , 2 (x1 − µ01 ) + (x2 − µ02 ) 2 (15.3) 2 2 = (x1 − µ11 ) + (x2 − µ12 ) . (15.4) We eventually obtain (µ01 − µ11 )x1 + (µ02 − µ12 )x2 − (µ201 + µ202 − µ211 − µ212 ) = 0, 15–4 (15.5) 5 x2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 4 3 2 1 1 2 3 4 5 6 x1 Figure 15.5: Two dimensional scatter plot showing means and separating line. which is of the form b1 x1 + b2 x2 − b0 = 0. (15.6) In Figure 15.5, µ01 = 4, µ02 = 3, µ11 = 2, µ12 = 1.5; with these values, eqn 15.6 becomes 4x1 + 3x2 − 18.75 = 0, (15.7) which intercepts the x1 axis at 18.75/4 ≈ 4.7 and the x2 axis at 18.75/3 = 6.25. 15.4 Normal form of the separating line, projections, and linear discriminants Eqn 15.6 becomes more interesting and useful in its normal form, a1 x1 + a2 x2 − a0 = 0, where a12 + a22 = 1; eqn 15.8 can be obtained from eqn 15.6 by dividing across by (15.8) p b12 + b22 . Figure 15.6 shows interpretations of the normal form straight line equation, eqn 15.8. The coefficients of the unit vector normal to the line are n = (a1 a2 )T and a0 is the perpendicular distance from the line to the origin. Incidentally, the components correspond to the direction cosines of n = (a1 a2 )T = (cos θ sin θa2 )T . Thus, (Foley, van Dam, Feiner, Hughes & Phillips 1994) n corresponds to one row of a (frame) rotating matrix; in other words, see below, section 15.5, dot product of the vector expression of a point with n, corresponds to projection onto n. (Note that cos π/2 − θ = sin θ.) 15–5 x2 normal vector (a1, a2) a0/a2 line (x1’ x2’) a1x1 + a2x2 −a0 = 0 a1x1’ + a2x2’ −a0 > 0 a0 theta a0/a1 x1 at (x1’’, x2’’) a1x1’’ + a2x2’’ − a0 < 0 Figure 15.6: Normal form of a straight line, interpretations. Also as shown in Figure 15.6, points x = (x1 x2 )T on the side of the line to which n = (a1 a2 )T points have a1 x1 + a2 x2 − a0 > 0, whilst points on the other side have a1 x1 + a2 x2 − a0 < 0; as we know, points on the line have a1 x1 + a2 x2 − a0 = 0. 15.5 Projection and linear discriminant We know that a1 x1 + a2 x2 = aT x, the dot product of n = (a1 a2 )T and x represents the projection of points x onto n — yielding the scalar value along n, with a0 fixing the origin. This is plausible: projecting onto n yields optimum separability. Such a projection, g(x) = a1 x1 + a2 x2 , (15.9) is called a linear discriminant; now we can adapt equation eqn. 15.2, ω = 0 when g(x) > a0 , (15.10) = 1, g(x) < a0 , (15.11) = tie, g(x) = a0 . (15.12) Linear discriminants, eqn. 15.12, are often written as g(x) = a1 x1 + a2 x2 − a0 , whence eqn. 15.12 becomes 15–6 (15.13) ω = 0 when g(x) > 0, 15.6 (15.14) = 1, g(x) < 0, (15.15) = tie, g(x) = 0. (15.16) Projections and linear discriminants in p dimensions Equation 15.13 readily generalises to p dimensions, n is a unit vector in p dimensional space, normal to the the p − 1 separating hyperplane. For example, when p = 3, n is the unit vector normal to the separating plane. Other important projections used in pattern recognition are Principal Components Analysis (PCA) and Fisher’s Linear Discriminant Analysis (lda), see Chapter 17. 15.7 Template Matching and Discriminants An intuitive (but well founded) classification method is that of template matching or correlation matching. Here we have perfect or average examples of classes stored in vectors {zj }cj=1 , one for each class. Without loss of generality, we assume that all vectors are normalised to unit length. Classification of an newly arrived vector x entails computing its template/correlation match with all c templates: xT zj ; (15.17) class ω is chosen as j corresponding to the maximum of eqn. 15.17. Yet again we see that classification involves dot product, projection, and a linear discriminant. 15.8 Nearest neighbour methods Obviously, we may not always have the linear separability of Figure 15.5. One non-parametric method is to go beyond nearest mean, see eqn. 15.4, to compute the nearest neighbour in the entire training data set, and to decide class according to the class of the nearest neighbour. A variation is k-nearest neighbour, where a vote is taken over the classes of the k nearest neighbours. 15–7 Chapter 16 Statistical Classifier Methods 16.1 One-dimensional classification revisited Recall Figure 15.3, repeated here as Figure 16.1. p(x1 | 1) p(x1 | 0) 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 3 T 4 5 6 x1 Figure 16.1: Class conditional densities. We have class conditional pdfs: p(x1 | ω), ω = 0, 1; given a newly arrived x10 we might decide on its class according to the maximum class conditional pdf at x10 , i.e. set a threshold T where p(x1 | 0) and p(x1 | 1) cross, see Figure 16.1. This is not completely correct. What we want is the probability of each class — its posterior probability — based on the evidence supplied by the data, combined with any prior evidence. In what follows, P (ω|x) is the posterior probability or a posteriori probability of class ωi given the observation x; P (ωi ) is the prior probability or a priori probability. We use upper case P (.) for discrete probabilities, whilst lower case p(.) denotes probability densities. 16–1 Bayes’ Rule Recall Bayes’ rule from eqn. 5.22 and repeated here, P (Ai |B) = P (B|Ai )P (Ai )/ n X P (B|Ai )P (Ai ). (16.1) i=1 This says that the posterior probability of Ai given B (conditional on B having Pn occurred) is the product of the conditional probability of B given Ai all divided by P (B) = i=1 P (B|Ai )P (Ai ). We can rewrite eqn. 16.1 in terms of our random variable x (= B) and our classes ω0 , ω1 (= Ai , i = 0, 1) to get P (ωi |x) = P (x|ωi )P (ωi )/ 1 X P (x|ωi )P (ωi ). (16.2) i=0 P (ωi |x) is the posterior probability of class ωi given that our analysis has yielded x; P (ωi ) is the prior probability — if we have no prior preference, the P (ω0 ) = 0.5, P (ω1 ) = 0.5. Eqn. 16.2 forms a Bayes decision rule: compute the two posterior probabilities and take the class which has the maximum. Let the Bayes decision rule be represented by a function g(.) of the feature vector x: g(x) = ar g maxwj ∈Ω [P (ωj | x)] (16.3) To show that the Bayes decision rule, eqn. 16.3, achieves the minimum probability of error, we compute the probability of error conditional on the feature vector x — the conditional risk — associated with it: c X P (ωk | x). (16.4) R(g(x) = ωj | x) = k=1,k6=j That is to say, for the point x we compute the posterior probabilities of all the c − 1 classes not chosen. Since Ω = {ω1 , . . . , ωc } form a partition (they are mutually exclusive and exhaustive) and the P (ωk |x)ck=1 are probabilities and so sum to unity, eqn. 16.4 reduces to: R(g(x) = ωj ) = 1 − P (ωj | x). (16.5) It immediately follows that, to minimise R(g(x) = ωj ), we maximise P (ωj | x), thus establishing the optimality of eqn. 16.3. The problem now is to determine P (ω | x) which brings us to Bayes’ rule. 16.2 Bayes’ Rule for the Inversion of Conditional Probabilities ****[Needs tidying and made compatible with previous section.] From the definition of conditional probability, we have: p(ω, x) = P (ω | x)p(x), 16–2 (16.6) and, owing to the fact that the events in a joint probability are interchangeable, we can equate the joint probabilities : p(ω, x) = p(x, ω) = p(x | ω)P (ω). (16.7) Therefore, equating the right hand sides of these equations, and rearranging, we arrive at Bayes’ rule for the posterior probability P (ω | x): P (ω | x) = p(x | ω)P (ω) . p(x) (16.8) P (ω) expresses our belief that ω will occur, prior to any observation. If we no prior knowledge, Phave c we can assume equal priors for each class: P (ω1 ) = P (ω2 ) . . . = P (ωc ), j=1 P (ωj ) = 1. Although we avoid further discussion here, we note that the matter of choice of prior probabilities is the subject of considerable discussion especially in the literature on Bayesian inference, see, for example, (Sivia 1996). p(x) is the unconditional probability density of x, and can be obtained by summing the conditional densities: c X p(x) = p(x | ωj )P (ωj ). (16.9) j=1 Thus, to solve eqn. 16.8, it remains to estimate the conditional densities. 16.3 Parametric Methods Where we can assume that the densities follow a particular form, for example Gaussian, the density estimation problem is reduced to that of estimation of parameters. The multivariate normal density, see section B.7, p-dimensional, is given by: p(x | ωj ) = 1 1 exp [− (x − µj )T K−1 j (x − µj )] 1/2 | Kj | 2 (2π)p/2 (16.10) p(x | ωj ) is completely specified by µj , the p-dimensional mean vector, and Kj the corresponding p × p covariance matrix: µj = E[x]ω=ωj , (16.11) Kj = E[(x − µj )(x − µj )T ]ω=ωj . (16.12) The respective maximum likelihood estimates are: Nj 1 X µj = xn , Nj n=1 (16.13) and, Nj 1 X Kj = (xn − µj )(xn − µj )T , Nj − 1 n=1 where we have separated the training data XT = {xn , ωn }N n=1 into sets according to class. 16–3 (16.14) 16.4 Discriminants based on Normal Density We may write eqn. 16.8 as a discriminant function: gj (x) = P (ωj | x) = p(x | ωj )P (ωj ) , p(x) (16.15) so that classification, eqn. 16.3, becomes a matter of assigning x to class wj if, gj (x) > gk (x), ∀ k 6= j. (16.16) Since p(x), the denominator of eqn. 16.15 is the same for all gj (x) and since eqn. 16.16 involves comparison only, we may rewrite eqn. 16.15 as gj (x) = p(x | ωj )P (ωj ). (16.17) We may derive a further possible discriminant by taking the logarithm of eqn. 16.17 — since logarithm is a monotonically increasing function, application of it preserves relative order of its arguments: gj (x) = log p(x | ωj ) + log P (ωj ). (16.18) In the multivariate Gaussian case, eqn. 16.18 becomes (Duda & Hart 1973), p 1 1 gj (x) = − (x − µj )T K−1 j (x − µj ) − log2π − log | Kj | +logP (ωj ) 2 2 2 (16.19) Henceforth, we refer to eqn. 16.19 as the Bayes-Gauss classifier. The multivariate normal (Gaussian) density provides a good characterisation of pattern (vector) distribution where we can model the generation of patterns as ideal pattern plus measurement noise; for an instance of a measured vector x from class ωj : xn = µj + en , (16.20) where en ∼ N(0, Kj ), that is, the noise covariance is class dependent. 16.5 Bayes-Gauss Classifier – Special Cases (Duda & Hart 1973, pp. 26–31) Revealing comparisons with the other learning paradigms which play an important role in this thesis are made possible if we examine particular forms of noise covariance in which the Bayes-Gauss classifier decays to certain interesting limiting forms: • Equal and Diagonal Covariances (Kj = σ 2 I, ∀j, where I is the unit matrix); in this case certain important equivalences with eqn. 16.19 can be demonstrated: – Nearest mean classifier; 16–4 – Linear discriminant; – Template matching; – Matched filter; – Single layer neural network classifier. • Equal but Non-diagonal Covariance Matrices. – Nearest mean classifier using Mahalanobis distance; and, as in the case of diagonal covariance, – Linear discriminant function; – Single layer neural network; 16.5.1 Equal and Diagonal Covariances When each class has the same covariance matrix, and these are diagonal, we have, Kj = σ 2 I, so = σ12 I. Since the covariance matrices are equal, we can eliminate the 12 | logKj |; the that K−1 j p T −1 2 log2π term is constant in any case; thus, including the simplification of the (x − µj ) Kj (x − µj ), eqn. 16.19 may be rewritten: 1 (x − µj )T (x − µj ) + logP (ωj ) 2 2σ 1 = kx − µj )k2 + logP (ωj ). 2 2σ gj (x) = − (16.21) (16.22) Nearest mean classifier If we assume equal prior probabilities P (ωj ), the second term in eqn. 16.22 may be eliminated for comparison purposes and we are left with a nearest mean classifier. Linear discriminant If we further expand the squared distance term, we have, gj (x) = − 1 (xT x − 2µTj x + µTj µj ) + logP (ωj ), 2 2σ (16.23) which may be rewritten as a linear discriminant: gj (x) = wj0 + wjT x where wj0 = − 1 (µTj µj ) + logP (ωj ), 2 2σ and wj = 1 µj . σ2 (16.24) (16.25) (16.26) Template matching In this latter form the Bayes-Gauss classifier may be seen to be performing template matching or correlation matching, where wj = constant × µj , that is, the prototypical pattern for class j, the mean µj , is the template. 16–5 Matched filter In radar and communications systems a matched filter detector is an optimum detector of (subsequence) signals, for example, communication symbols. If the vector x is written as a time series (a digital signal), x[n], n = 0, 1, . . . then the matched filter for each signal j may be implemented as a convolution: yj [n] = x[n] ◦ h[n] = N−1 X x[n − m] hj [m], (16.27) m=0 where the kernel h[.] is a time reversed template — that is, at each time instant, the correlation between h[.] and the last N samples of x[.] are computed. Provided some threshold is exceeded, the signal achieving the maximum correlation is detected. Single Layer Neural Network sification rule as: If we restrict the problem to two classes, we can write the clas- g(x) = g1 (x) − g2 (x) ≥ 0 : ω1 , other w i se ω2 T = w0 + w x, (16.28) (16.29) 1) where w0 = − 2σ1 2 (µT1 µ1 − µT2 µ2 ) + log PP (ω (ω2 ) and w = 1 σ 2 (µ1 − µ2 ). In other words, eqn. 16.29 implements a linear combination, adds a bias, and thresholds the result — that is, a single layer neural network with a hard-limit activation function. (Duda & Hart 1973) further demonstrate that eqn. 16.22 implements a hyper-plane partitioning of the feature space. 16.5.2 Equal but General Covariances When each class has the same covariance matrix, K, eqn. 16.19 reduces to: gj (x) = −(x − µj )T K−1 (x − µj ) + logP (ωj ) (16.30) Nearest Mean Classifier, Mahalanobis Distance If we have equal prior probabilities P (ωj ), we arrive at a nearest mean classifier where the distance calculation is weighted. The Mahalanobis distance (x−µj )T K−1 j (x−µj ) effectively weights contributions according to inverse variance. Points of equal Mahalanobis distance correspond to points of equal conditional density p(x | ωj ). Linear Discriminant where Eqn. 16.30 may be rewritten as a linear discriminant, see also section 15.5: gj (x) = wj0 + wjT x (16.31) 1 wj0 = − (µTj K−1 µj ) + logP (ωj ), 2 (16.32) wj = K−1 µj . (16.33) and 16–6 Weighted template matching, matched filter In this latter form the Bayes-Gauss classifier may be seen to be performing weighted template matching. Single Layer Neural Network As for the diagonal covariance matrix, it can be easily demonstrated that, for two classes, eqns. 16.31– 16.33 may be implemented by a single neuron. The only difference from eqn. 16.29 is that the non-bias weights, instead of being simple a difference between means, is now weighted by the inverse of the covariance matrix. 16.6 Least square error trained classifier We can formulate the problem of classification as a least-square-error problem. Let us require the classifier to output a class membership indicator ∈ [0, 1] for each class, we can write: d = f (x) (16.34) where d = (d1 , d2 , . . . dc )T is the c-dimensional vector of class indicators and x, as usual, the p-dimensional feature vector. We can express individual class membership indicators as: dj = b0 + p X bi xi + e. (16.35) i=1 In order to continue the analysis we need to refer to the theory of linear regression, see Chapter 20. We repeat eqn. 20.12 here, B̂ = (XT X)−1 XT Y (16.36) XT Y is a p + 1 × c matrix, and B̂ is a (p + 1) × c matrix of coefficients — that is, one column of p + 1 coefficients for each class. Eqn. 16.36 defines the training algorithm of our classifier. Application of the classifier to a feature vector x may be expressed as: ŷ = B̂x. (16.37) It remains to find the maximum of the c components of ŷ. In a two-class case, this least-square-error training algorithm yields an identical discriminant to Fisher’s linear discriminant (Duda & Hart 1973). Fisher’s linear discriminant is described in Chapter 17. 16–7 16.7 Generalised linear discriminant function Eqn. 15.13 may be adapted to cope with any function(s) of the features xi ; we can define a new feature vector x0 where: xk0 = fk (x). (16.38) In the pattern recognition literature, the solution of eqn. 16.38 involving now the vector x0 is called the generalised linear discriminant function (Duda & Hart 1973). It is desirable to escape from the fixed model of eqn. 16.38: the form of the fk (x) must be known in advance. Multilayer perceptron (MLP) neural networks provide such a solution. We have already shown the correspondence between the linear model, eqn. 20.8, and a single layer neural network with a single output node and linear activation function. An MLP with appropriate nonlinear activation functions, e.g. sigmoid, provides a model-free and arbitrary non-linear solution to learning the mapping between x and y (Bishop 1995). 16–8 Chapter 17 Linear Discriminant Analysis and Principal Components Analysis 17.1 Principal Components Analysis Principal component analysis (PCA), also called Karhunen-Loève transform (Duda, Hart & Stork 2000) is a linear transformation which maps a p-dimensional feature vector x ∈ Rp to another vector y ∈ Rp where the transformation is optimised such that the components of y contain maximum information in a least-square-error sense. In other words, if we take the first r ≤ p components (y0 ∈ Rq ), then using the inverse transformation, we can reproduce x with minimum error. Yet another view is that the first few components of y contain most of the variance, that is, in those components, the transformation stretches the data maximally apart. It is this that makes PCA good for visualisation of the data in two dimensions, i.e. the first two principal components give an optimum view of the spread of the data. We note however, unlike linear discriminant analysis, see section 17.2, PCA does not take account of class labels. Hence it is typically a more useful visualisation of the inherent variability of the data. In general x can be represented, without error, by the following expansion: x = Uy = p X yi ui (17.1) i=1 where yi is the ith component of y and (17.2) U = (u1 , u2 , . . . , up ) (17.3) utj uk = δjk = 1, when i = k; otherwise = 0. (17.4) where is an orthonormal matrix: 17–1 If we truncate the expansion at i = q 0 x = Uq y = q X yi ui , (17.5) |x − x0 | = mi ni mum. (17.6) i=1 we obtain a least square error approximation of x, i.e. The optimum transformation matrix U turns out to be the eigenvector matrix of the sample covariance matrix C: 1 t A A, N (17.7) UCUt = Λ, (17.8) C= where A is the N × p sample matrix. the diagonal matrix of eigenvalues. 17.2 Fisher’s Linear Discriminant Analysis In contrast with PCA (see section 17.1), linear discriminant analysis (LDA) transforms the data to provide optimal class separability (Duda et al. 2000) (Fisher 1936). Fisher’s original LDA, for two-class data, is obtained as follows. We introduce a linear discriminant u (a p-dimensional vector of weights — the weights are very similar to the weights used in neural networks) which, via a dot product, maps a feature vector x to a scalar, y = ut x. (17.9) u is optimised to maximise simultaneously, (a) the separability of the classes (between-class separability ), and (b) the clustering together of same class data (within-class clustering). Mathematically, this criterion can be expressed as: ut SB u . J(u) = t u SW u where SB is the between-class covariance, SB = (m1 − m2 )(m1 − m2 )t , and 17–2 (17.10) (17.11) Sw = C1 + C2 , (17.12) the sum of the class-conditional covariance matrices, see section 17.1. m1 and m2 are the class means. u is now computed as: u = S−1 w m1 − m2 . (17.13) There are other formulations of LDA (Duda et al. 2000) (Venables & Ripley 2002), particularly extensions from two-class to multi-class data. In addition, there are extensions (Duda et al. 2000) (Venables & Ripley 2002) which form a second discriminant, orthogonal to the first, which optimises the separability and clustering criteria, subject to the orthogonality constraint. The second dimension/discriminant is useful to allow the data to be view as a two-dimensional scatter plot. 17–3 Chapter 18 Neural Network Methods Here we show that a single neuron implements a linear discriminant (and hence also implements a separating hyperplane). Then we proceed to a discussion which indicates that a neural network comprising three processing layers can implement any arbitrarily complex decision region. Recall eqn. 15.12, with ai → wi , and now (arbitrarily) allocating discriminant value zero to class 0, ( p X ≤ 0, ω = 0 (18.1) wi xi − w0 g(x) = > 0, ω = 1. i=1 Figure 18.1 shows a single artificial neuron which implements precisely eqn. 18.1. +1 (bias) w0 x1 w1 w2 x2 . . . xp wp Figure 18.1: Single neuron. The signal flows into the neuron (circle) are weighted; the neuron receives wi xi . The neuron sums and applies a hard limit (output = 1 when sum > 0, otherwise 0). Later we will introduce a sigmoid activation function (softer transition) instead of the hard limit. The threshold term in the linear discriminant (a0 in eqn. 15.13) is provided by w0 × +1. Another interpretation of bias, useful in mathematical analysis of neural networks, see section 16.6, is to represent it by a constant component, +1, as the zeroth component of the augmented feature vector. 18–1 Just to reemphasise the linear boundary nature of linear discriminants (and hence neural networks), examine the two-dimensional case, ( ≤ 0, ω = 0 w1 x1 + w2 x2 − w0 (18.2) > 0, ω = 1. The boundary between classes, given by w1 x1 + w2 x2 − w0 = 0, is a straight line with x1 -axis intercept at −w0 /w1 and x2 -axis intercept at −w0 /w2 , see Figure 18.2. x2 −w0/w2 −w1/w0 x1 Figure 18.2: Separating line implemented by two-input neuron. 18–2 18.1 Neurons for Boolean Functions A neuron with weights w0 = −0.5, and w1 = w2 = 0.35 implements a Boolean AND: x1 x2 AND(x1,x2) Neuron summation Hard-limit (¿0?) ---------------------------------------------- -------------0 0 0 sum = -0.5 + 0.35x1 + 0.35x2 = -0.5 =¿ output= 0 1 0 0 sum = -0.5 + 0.35x1 + 0.35x2 = -0.15 =¿ output= 0 0 1 0 sum = -0.5 + 0.35x1 + 0.35x2 = -0.15 =¿ output= 0 1 1 1 sum = -0.5 + 0.35x1 + 0.35x2 = +0.2 =¿ output= 1 ------------------------------------------------ ------------Similarly, a neuron with weights w0 = −0.25, and w1 = w2 = 0.35 implements a Boolean OR. Figure 18.3 shows the x1 -x2 -plane representation of AND, OR, and XOR (exclusive-or). x2 1 0 0 1 0 x2 1 1 1 1 0 1 x1 x2 1 1 0 1 0 1 x1 AND OR 1 x1 XOR Figure 18.3: AND, OR, XOR. It is noted that XOR cannot be implemented by a single neuron; in fact it required two layers. Two layer were a big problem in the first wave of neural network research in the 1960s, when it was not known how to train more than one layer. 18.2 Three-layer neural network for arbitrarily complex decision regions The purpose of this section is to give an intuitive argument as to why three processing layers can implement an arbitrarily complex decision region. Figure 18.4 shows such a decision region in two-dimensions. As shown in the figure, however, each ‘island’ of class 1 may be delineated using a series of boundaries, d11 , d12 , d13 , d14 and d21 , d22 , d23 , d24 . Figure 18.5 shows a three-layer network which can implement this decision region. First, just as before, input neurons implement separating lines (hyperplanes), d11, etc. Next, in layer 2, we AND together the decisions from the separating hyperplanes to obtain decisions, ‘in island 1’, ‘in island 2’. Finally, in the output layer, we OR together the latter decisions; thus we can construct an arbitrarily complex partitioning. 18–3 5 x2 4 3 2 1 d24 d21 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 d23 1 1 1 1 1 10 0 0 0 0 0 0 0 0 0 1 1 1 1 10 0 0 0 0 0 0 0 0 0 0 1 1 1 10 0 0 0 d11 1 d22 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 d14 0 0 1 1 1 1 1 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 d121 1 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 d13 1 2 3 4 5 6 x1 Figure 18.4: Complex decision region required. Of course, this is merely an intuitive argument. A three layer neural network trained with backpropagation or some other technique might well achieve the partitioning in quite a different manner. 18.3 Sigmoid activation functions If a neural network is to be trained using backpropagation or similar technique, hard limit activation functions cause problems (associated with differentiation). Sigmoid activation functions are used instead. A sigmoid activation function corresponding to the hard limit progresses from output value 0 at −∞, passes through 0 with value 0.5 and flattens out at value 1 at +∞. 18–4 +1 (bias) d11 x1 x2 . . . xp +1 d12 +1 d13 AND . . . +1 class d14 OR d21 . . . d24 . . . AND Figure 18.5: Three-layer neural network implementing an arbitrarily complex decision region. 18–5 Chapter 19 Unsupervised Classification (Clustering) 19–1 Chapter 20 Regression 20.1 Linear Regression The simplest linear model, y = mx + c, of school mathematics, is given by: y = b0 + b1 x + e, (20.1) which shows the dependence of the dependent variable y on the independent variable x. In other words, y is a linear function of x and the observation is subject to noise, e; e is assumed to be a zero-mean random process. Strictly eqn. 20.1 is affine, since b0 is included, but common usage dictates the use of linear. Taking the nth observation of (x, y ), we have (Beck & Arnold 1977, p. 133): yn = b0 + b1 xn + en (20.2) Least square error estimators for b0 and b1 , bˆ0 and bˆ1 may be obtained from a set of paired observations {xn , yn }N n=1 by minimising the sum of squared residuals: S= N X rn2 = n=1 N X (yn − yˆn )2 (20.3) n=1 N X S= (yn − b0 − b1 xn )2 (20.4) n=1 Minimising with respect to b0 and b1 , and replacing these with their estimators, bˆ0 and bˆ1 , gives the familiar result: X X X X X xi )2 ] (20.5) bˆ1 = N[ yn xn − ( yi )( xi )]/[N( xi2 ) − ( bˆ0 = P yn bˆ1 xn xn − N N P (20.6) The validity of these estimates does not depend on the distribution of the errors en ; that is, assumption of Gaussianity is not essential. On the other hand, all the simplest estimation procedures, including eqns. 20.5 and 20.6, assume the xn to be error free, and that the error en is associated with yn . 20–1 In the case where y , still one-dimensional, is a function of many independent variables — p in our usual formulation of p-dimensional feature vectors — eqn. 20.2 becomes: yn = b0 + p X bi xin + en (20.7) i=1 where xin is the i -th component of the n-th feature vector. Eqn. 20.7 can be written compactly as: yn = xTn b + en (20.8) where b = (b0 , b1 , . . . , bp )T is a p + 1 dimensional vector of coefficients, and xn = (1, x1n , x2n , . . . , xpn ) is the augmented feature vector. The constant 1 in the augmented vector corresponds to the coefficient b0 , that is it is the so called bias term of neural networks, see sections 15.5 and 18. All N observation equations may now be collected together: y = Xb + e (20.9) where y = (y1 , y2 , . . . , yn , . . . , yN )T is the N × 1 vector of observations of the dependent variable, and e = (e1 , e2 , . . . , en , . . . , eN )T . X is the N × p + 1 matrix formed by N rows of p + 1 independent variables. Now, the sum of squared residuals, eqn. 20.3, becomes: S = (y − Xb̂)T . (20.10) Minimising with respect to b — just as eqn. 20.3 was minimised with respect to b0 and b1 — leads to a solution for b̂ (Beck & Arnold 1977, p. 235): b̂ = (XT X)−1 XT y. (20.11) PN The jk-th element of the (p + 1) × (p + 1) matrix XT X is n=1 xnj xnk , in other words, just N× the jk-th element of the autocorrelation matrix, R, of the vector of independent variables x estimated from the N sample vectors. If we have multiple dependent variables (y ), in this case, c of them, we can replace y in eqn. 20.11 with an appropriate matrix N × c matrix Y formed by N rows each of c observations. Now, eqn. 20.11 becomes: B̂ = (XT X)−1 XT Y (20.12) XT Y is a p + 1 × c matrix, and B̂ is a (p + 1) × c matrix of coefficients. Eqn. 20.12 has one significant weakness: it depends on the condition of the matrix XT X. As with any autocorrelation or auto-covariance matrix, this cannot be guaranteed; for example, linearly dependent features will render the matrix singular. In fact, there is an elegant indirect implementation of eqn. 20.12 involving the singular value decomposition (SVD) (Press, Flannery, Teukolsky & Vetterling 1992), (Golub & Van Loan 1989). The Widrow-Hoff iterative gradient-descent training procedure (Widrow & Lehr 1990) developed in the early 1960s tackles the problem in a different manner. 20–2 Bibliography Beck, J. & Arnold, K. (1977). Parameter Estimation in Engineering and Science, John Wiley & Sons, New York. Berger, J. (1985). Statistical Decision Theory and Bayesain Analysis 2nd ed., Springer Verlag. Berry, D. (1996). Statistics — a Bayesian Perspective, Duxbury Press. Bishop, C. (1995). Neural Networks for Pattern Recognition, Oxford University Press, Oxford, U.K. Boslaugh, S. & Watters, P. A. (2008). Statistics in a Nutshell, O’Reilly. Campbell, J. (2000). Fuzzy Logic and Neural Network Techniques in Data Analysis, PhD thesis, University of Ulster. Campbell, J. (2005). Lecture notes on pattern recognition and image processing, Technical report, Letterkenny Institute of Technology. http://www.jgcampbell.com/ip/pr.pdf (accessed 200905-01). Campbell, J. & Murtagh, F. (1998). Image processing and pattern recognition, Technical report, Computer Science, Queen’s University Belfast. available at: http://www.jgcampbell.com/ip (2009-05-01). Casella, G. & Berger, R. (2001). Statistical Inference, 2nd edn, McGraw-Hill. Crawley, M. J. (2005). Statistics: an introduction using R, John Wiley. Good introduction to statistics using R. Duda, R. & Hart, P. (1973). Pattern Classification and Scene Analysis, Wiley-Interscience, New York. Duda, R., Hart, P. & Stork, D. (2000). Pattern Classification, Wiley-Interscience. Duntsch, I. & Gediga, G. (2000). Sets, Relations, Functions, Methodos Publishers. Available via http://www.cosc.brocku.ca/ duentsch/papers/methprimer1.html (2009-04-30). Dytham, C. (2009). Choosing and Using Statistics: A Biologist’s Guide, 2nd edn, Blackwell Publishing. ISBN-13: 978-1-4051-0243-8. Feller, W. (1968). An Introduction to Probability Theory and its Applications, volume 1, 3rd edn, John Wiley & Sons, New York. Fisher, R. (1936). The use of multiple measurements in taxonomic problems, Annals of Eugenics 7: 179–188. in (?). 20–1 Foley, J., van Dam, A., Feiner, S., Hughes, J. & Phillips, R. (1994). Introduction to Computer Graphics, Addison Wesley. Frey, B. (2006). Statistics Hacks, O’Reilly. Gelman, A., Carlin, J., Stern, H. & Rubin, D. (1995). Bayesian Data Analysis, Chapman and Hall. Gelman, A. & Nolan, D. (2002). Teaching statistics: a bag of tricks, Oxford University Press. Golub, G. & Van Loan, C. (1989). Matrix Computations, 2nd edn, Johns Hopkins University Press, Baltimore. Griffiths, D. (2009). Head First Statistics, O’Reilly. ISBN-10: 0596527586. Excellent introduction. Hacking, I. (2001). An Introduction to Probability and Inductive Logic, Oxford University Press. Hastie, T., Tibshirani, R. & Friedman, J. (2001). The Elements of Statistical Learning, Springer. Hsu, H. (1997). Theory and Problems of Probability, Random Variables, and Random Processes (Schaum’s Outlines), McGraw-Hill. Jaynes, E. & (editor), L. B. (2003). Probability Theory: The Logic of Science, Cambridge University Press. Jaynes was one of the chief advocates of the Bayesian method. Jeffreys, H. (1961/1998). Theory of Probability, 3rd edn, Oxford University Press (Oxford Classics Series – 1998), Oxford, U.K. Larson, H. (1982). Introduction to Probability and Statistical Inference, 3rd edn, John Wiley. Lee, P. M. (2004). Bayesian Statistics: an introduction, 3rd edn, Arnold. Reputedly one of the best introductions to Bayesian statistics; Contains examples in R. MacKay, D. J. C. (2002). Information Theory, Inference and Learning Algorithms, Cambridge University Press. MacKay is a major advocate of Bayesian methods. Maindonald, J. & Braun, J. (2007). Data Analysis and Graphics Using R: an example-based approach, 2nd edn, Cambridge University Press, Cambridge, U.K. ISBN: 978-0-521-86116-8; good R examples, including graphics. Matloff, N. (2008). R for programmers, Technical report, University of California, Davis. http://heather.cs.ucdavis.edu/ matloff/R/RProg.pdf (accessed 2009-04-25). Meyer, P. L. (1966). Introductory Probability and Statistical Applications, Addison-Wesley, Reading, MA. Excellent introduction, but now out of print. Milton, M. (2009). Head First Data Analysis: A learner’s guide to big numbers, statistics, and good decisions, O’Reilly. ISBN-10: 0596153937. Another excellent introduction. Uses R. Murtagh, F. (2005). Correspondence Analysis and data Coding with Java and R, Chapman and Hall/CRC Press. O’Hagan, A. (1994). Kendall’s Advanced Theory of Statistics, Vol. 2B, Bayesian Inference, Edward Arnold. Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems (revised second printing), Morgan Kaufmann, San Francisco, CA. 20–2 Press, W., Flannery, B., Teukolsky, S. & Vetterling, W. (1992). Numerical Recipes in C, 2nd edn, Cambridge University Press, Cambridge, UK. Quinn, G. P. & Keough, M. J. (2002). Experimental Design and Data Analysis for Biologists, Cambridge University Press. ISBN-13: 978-0521009768. Ripley, B. (1996). Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, U.K. Rosenkrantz, R. D. (ed.) (1983). E.T. Jaynes. Papers on Probability, Statistics and Statistical Physics, Kluwer, Dordrecht. Salsburg, D. (2001). The Lady Tasting Tea: How Statistics Revolutionized Science in the 20th Century, W.H. Freeman. Great introduction to the origins of statistics. Silvey, S. (1975). Statistical Inference, Chapman and Hall. Sivia, D. (1996). Data Analysis, A Bayesian Tutorial, Oxford University Press, Oxford, U.K. Sivia, D. (2006). Data Analysis, A Bayesian Tutorial, 2nd edn, Oxford University Press. Best introduction to Bayesian inference there is. Spiegel, M. R., Schiller, J. & Srinivasan, R. A. (2009). Theory and Problems of Probability and Statistics (Schaum’s Outlines), 3rd edn, McGraw-Hill. Spiegel, M. R. & Stephens, L. J. (2008). Statistics (Schaum’s Outlines), 4th edn, McGraw-Hill. Highly recommended; if you have to buy one book, this is the one; has examples using a few packages, most notably Excel. Taylor, P. (2008). Probability (manuscript notes on mathematical foundations), Technical report, University of Manchester. http://www.paultaylor.eu/tripos/Probability.pdf (accessed 200904-25). Therrien, C. (1989). Decision, Estimation, and Classification, Chichester, UK: John Wiley and Sons. Tisted, R. (1988). Elements of Statistical Computing, Chapman and Hall/CRC Press. Venables, W. & Ripley, B. (2000). S Programming, Springer-Verlag. Venables, W. & Ripley, B. (2002). Modern Applied Statistics with S, 4th edn, Springer-Verlag. Highly recommended for learning R (R is a free version of S). Wasserman, L. (2004). All of Statistics: a concise course in statistical inference, Springer Verlag, New York, NY. ISBN: 0-387-40272-1; top class encyclopedic reference. Widrow, B. & Lehr, M. (1990). 30 Years of Adaptive Neural Networks, Proc. IEEE 78(9): 1415– 1442. –3 Appendix A Basic Mathematical Notation The notation described here notation is merely shorthand for common sense concepts which would otherwise be confusing and long-winded if written in English. Casual familiarity with the most important items will also allow you to read papers using statistics without becoming confused. The online book Sets, Relations, Functions (Duntsch & Gediga 2000) is an ideal introduction; we take these notes from that book. A.1 A.1.1 Sets Set Definition and Membership A set is a very basic mathematical entity and hence is a bit hard to define. Let’s say that a set is a collection of objects; there cannot be repetition (duplication) of objects. We can specify a set by writing all its members within curly brackets, { }. Example 30 Six sided dice, set of possible faces (identified by the number of spots); call the set D. We can write D as, D = {1, 2, 3, 4, 5, 6}. When there is an obvious sequence, we can write, D = {1, 2, . . . , 6}. Sometimes we specify a rule for making the set, we have for example, the trivial rule generated set D = {i | i ∈ {1, . . . , 6}} = {1, . . . , 6}; the set of even numbers between 1 and 6 is given by Dev en = {i | i ∈ {1, . . . , 6} and i even} = {2, 4, 6}. We use the membership symbol ∈ to state that an object is a member of a set, for example, 1 ∈ {1, 2, 3}; we can state non-membership by 6∈, for example, 6 6∈ {1, 2, 3} There is no ordering of position in a set. {1, 2, 3}, {2, 3, 1} represent the same set. If there is repetition, it is understood that the repeated elements have no effect so that {1, 2, 3}, {2, 3, 1, 1, 2} represent the same set. A–1 A.1.2 Important Number Sets • Natural numbers: N = {0, 1, 2, . . .}. • Positive natural numbers: N+ = {1, 2, . . .}. • Integers: Z = {. . . , −2, −1, 0, 1, 2, . . .}. • Real numbers: R. A.1.3 Set Operations • Intersection. The set formed by the intersection of sets A, B is written C = A ∩ B = {x : x ∈ A andx ∈ B. Example 31 A = {1, 2, 3, 4}, B = {3, 4, 5}, A ∩ B = {3, 4}. • Union. The set formed by the union of sets A, B is written C = A ∪ B = {x : x ∈ A orx ∈ B, where “or” means inclusive or, that is a or b means either a or b, or both. Example 32 A = {1, 2, 3, 4}, B = {3, 4, 5}, A ∪ B = {1, 2, 3, 4, 5}. • Set difference. The set formed by the difference of sets A, B is written C=A B = {x : x ∈ A andx 6∈ B. That is, remove any members of B. Example 33 A = {1, 2, 3, 4}, B = {3, 4}, A B = {1, 2}. • Set complement (with respect to a universal set, U). Ā = {x : x 6∈ A andx ∈ U. Example 34 U = {1, 2, 3, 4, 5, 6}; A = {3, 4, 5}, Ā{1, 2, 6}. Comment. In case the notion of a universal set causes difficulty: the universal set depends on the problem at hand; when talking about a class of students, then U would be the set of all students in the class. You might have A as the set of all students (in that class — in that universal set) from County Donegal; then Ā is the set of all students from outside County Donegal — that is not from County Donegal. A.1.4 Venn Diagrams Set operations such as intersection, union, difference and complement are often illustrated using Venn diagrams such as those shown in Figure A.1. A–2 11111111 00000000 00000000 11111111 00000000 11111111 A 00000000 11111111 00000000 11111111 00000000 11111111 00000000 11111111 000000000 111111111 00000000 11111111 000000000 111111111 00000000 11111111 000000000 111111111 00000000 11111111 000000000 111111111 00000000 11111111 000000000 111111111 00000000 11111111 000000000 111111111 00000000 11111111 000000000 111111111 B 000000000 111111111 000000000 111111111 000000000 111111111 A 11111 00000 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 B Intersection of A, B Union of A, B (all shaded area) U = universal set 111111 000000 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 A complement of A Figure A.1: Set operations illustrated using Venn diagrams; (a) intersection, (b) union, (c) complement. Subset When a set A has no members or some or all of the members of B, but no more, we say that A is a subset of B. A ⊆ B. Example 35 B = {1, 2, 3, 4, 5, 6}; A = {3, 4, 5}, so that A ⊆ B. Equality of sets When a set A has the same members as B, or each is empty, we say that they are equal: A = B. Another way of looking at this is, if A ⊆ B and B ⊆ A, then A = B. Empty Set If a set contains no members, we call it the empty set; symbol ∅. Cardinality of a Set The number of elements in a set A is called its cardinality and written |A|. Example 36 A = {1, 2, . . . , 6}, |A| = 6. B = {John, Mar y , Jean}, |B| = 3. Power Set (Probably not necessary for basic probability.) Given a set A, the power set of A, P(A), is the set of all subsets of A. |P(A)| = 2|A| . Notice that you can have a set of sets, for example, the set of all classes in the computing department. Example 37 A = {a, b, c}, |A| = 3. P(A) = {∅, {c}, {b}, {a}, {b, c}, {a, c}, {a, b}, {a, b, c}}. Verify that |P(A)| = 2|A| = 23 = 8. A–3 Finite and Infinite Sets Roughly speaking, if |A| = n where n is some number we can identify, then we say that A is a finite set. Most of the sets in our examples are finite sets; otherwise the set is infinite. N, Z, R are infinite sets. This is an example of a finite set of integer numbers A = {1, 2, . . . , n}; in contrast an infinite set of integer numbers would be written A = {1, 2, . . .} which means A = {1, 2, . . . , ∞}. Disjoint Sets We say that A1 , A2 , . . . are disjoint of Ai ∩ Aj = ∅, ∀i, j, i 6= j. ∀ denotes for all. A.2 Iterated Summation and Product Notation If we want to write down the operation of summing the numbers from 1 to 6, we could write s = 1 + 2 + 3 + 4 + 5 + 6 or s = 1 + 2+, P . . . , +6. But this becomes tedious or impossible for larger 6 lists. We have the summation notation i=1 i . Similarly, if we want to write downQthe operation of multiplying together all the numbers from 1 6 to 6, we use the product notation i=1 i . A.3 Iterated Union and Intersection If we want to write down the operation of taking the union (see section A.1.3 of a list of sets the numbers from A1 to A6 , we could write B = A1 ∪ A2 , . . . , ∪A6 . But this S6becomes tedious or impossible for larger lists. Similar to the summation notation we have B = i=1 Ai . T6 For intersection we have B = i=1 Ai . A.4 Cartesian Product Sets Quite often we need to make new sets by making pairs (or triples or n-tuples) from existing sets. Example 38 Let B = {1, 2, 3, 4, 5, 6} the set of outcomes from throwing a six-sided dice and A = {H, T }, the set of outcomes of a coin toss. If we perform an experiment where we throw the dice and toss a coin and we want to describe the set of all possible pairs C = {(1, H), (1, T ), (2, H), . . . , (6, H), (6, T )}, we call set C the Cartesian product of A and B. The Cartesian product of A and B is written A × B. The cardinality of A × B, |A × B| = |A| × |B|. So in Example 38, we have |A × B| = |A| × |B| = 6 × 2 = 12. Note: pairs such as (1, H), (1, T ), or generally n-tuples, — enclosed in round brackets ( ) — are not sets. A–4 Appendix B Matrices and Linear Algebra B.1 Introduction In Chapters 7 and 8 we introduce two-dimensional random variables, that is, pairs of random variables which, for one reason or another, we want to treat as pairs rather than separately. Much of what we do in one-dimension generalises to two- and generally multi-dimensions; likewise two-d. to multi-dimensions. B.2 Linear Simultaneous Equations Eqns. B.1 and B.2 are a pair of linear simultaneous equations, y1 = 3x1 + 1x2 , (B.1) y2 = 2x1 + 4x2 . (B.2) Practically, these equations could express the following: Price of an apple = x1 , price of an orange = x2 (both unknown). Person A buys 3 apples, and 1 orange and the total bill is 5c (y1 ). Person B buys 2 apples and 4 oranges and the total bill is 10c (y2 ). Now, what is x1 , the price of apples, and x2 , the price of oranges? We want to solve for the unknowns x1 , x2 . Matrix algebra gives us a nice technique for solving such problems, see section B.6, but first well see how to solve it without matrices. Substitute y1 = 5 and y2 = 10 into eqns. B.1 and B.2: 5 = 3x1 + 1x2 , (B.3) 10 = 2x1 + 4x2 . (B.4) Eqn. B.3 gives x2 = 5 − 3x1 , which, substituted into eqn. B.4 gives: B–1 10 = 2x1 + 4(5 − 3x1 ), 10 = 2x1 + 20 − 12x1 , −10 = −10x1 , x1 = 1. Now, substitute x1 = 1 into eqn. B.3: 5 = 3 + x2 , x2 = 2. We have determined our unknowns x1 = 1 and x2 = 2. Ex. Substitute x1 = 1 and x2 = 2 into eqns. B.3 and B.4 to check the correctness of the result. B.3 Vectors and Matrices Eqns. B.1 and B.2 can be written in matrix form as y = Ax (B.5) 3 1 where A is a 2 row × 2 column matrix, A = , y is a one column two row matrix, 2 4 y1 and x is another one representing a tuple, and what we will from now on call a vector, y = y2 x1 column two row matrix, x = . x2 Vectors We could be extra careful and continue to call objects like x and y tuples. But everyone in the statistical world uses the term vector for tuple, and, because we are using vector and matrix arithmetic and algebra, this gives another reason to use vector. A vector is nothing more than an ordered collection of one-dimensional variables; however, vector and matrix mathematics have been developed to allow us to do mathematics on vectors without having to deal with each of the elements of (X1 , X2 , . . . , Xp ) separately. It will rarely be helpful to think of these vectors as being like vectors of physics and having magnitude and direction; but it is often helpful to think of two-dimensional vectors as representing points in a Euclidean plane and to think of general multidimensional vectors (p-dimensions, say) as representing points in p-dimensional space. Generally, a system of m equations, in n variables, x1 , x2 , . . . , xn , y1 = a11 x1 + a12 x2 · · · + a1n xn y2 = a21 x1 + a22 x2 · · · + a2n xn ... yr = ar 1 x1 + ar 2 x2 · · · + ar n xn ... ym = am1 x1 + am2 x2 · · · + amn xn B–2 (B.6) can be written in matrix form as y = Ax, (B.7) where y is an m × 1 vector, y= y1 y2 . . ym x1 x2 . . xn , x is an n × 1 vector, x= , and A is an m-row × n-column matrix A= a11 a21 .. .. .. am1 a12 a22 .. ar c .. am2 .. .. .. .. a1n a2n .. .. .. amn . That is, the matrix A is a rectangular array of numbers whose element in row r , column c is ar c (rows are horizontal, think rows of teeth; columns are vertical. The matrix A is said to be m × n, i.e. m rows, n columns. Eqn. B.7 can be interpreted as the definition of a function which takes n arguments (x1 , x2 , . . . , xn ) and returns m variables (y1 , y2 . . . ym ). Such a function is also called a transformation: it transforms n-dimensional vectors to m-dimensional vectors. Such equations are linear transformations because there are no terms in xr2 or higher, only in xr = xr1 , and no numbers like 5 (5xr0 = 5 × 1 = 5). Uses of Vectors and Transformations in Statistics Instead of denoting a two-d. random variX1 = X able as (X, Y ), it is much more convenient to denote it as vector X = . X2 = Y This is particularly true when we get to larger dimensions, when equations like eqn. 7.15 get enormous or impossible. Why transformations? In other places, we have used combinations of random variables such as U = aX + bY ; and we might have also V = cX + d Y . Thus, we create a new two-d. random variable (U, V ) using linear combinations of (X, Y ); we transform (X, Y ) to yield (U, V ). This can be neatly expressed using matrix notation. y is an 2 × 1 vector, B–3 y= U V X Y , x is an 2 × 1 vector, x= , and A is an 2-row × 2-column matrix A= a11 = a a12 = b a21 = c a22 = d . The larger equation above allows us to create a m−dimensional random variable, y, as the linear combination of the n random variables in the n−dimensional vector x. B.4 Basic Matrix Arithmetic B.4.1 Matrix Multiplication We may multiply two matrices A, m × n, and B, q × p, as long as n = q. Such a multiplication produces an m × p result. Thus, C = A B. m×p m×n n×p (B.8) Method: The element at the r th row and cth column of C is the product (sum of component-wise products) of the r th row of A with the cth column of B. Pictorially: m n ---------------—----¿ — — A — — — — — ---------------- p ---------— — — — B — — = — — — — — — n — V — — — ---------- p ----------— — — C — — — — — m ----------- C = AB , A= B= a11 a12 a21 a22 b11 b12 b21 b22 B–4 , , so, the product a11 b11 + a12 b21 a11 b12 + a12 b22 a21 b11 + a22 b21 a21 b12 + a22 b22 C= . Example. Consider Eqn. B.7, y = Ax. Thus the product of A(m × n) and x(n × 1) is y1 = a11 x1 + a12 x2 · · · + a1n xn , · · · ym = am1 x1 + am2 x2 · · · + amn xn . In summation notation, yr = Pc=n c=1 ar c xc . The product is (m × n) × (n × 1) so the result is (m × 1), which checks okay, for y is (m × 1). B.4.2 Multiplication by a Scalar As with vectors (when represented as components), we simply multiply each component by the scalar, c B.4.3 a11 a12 a21 a22 = ca11 ca12 ca21 ca22 . Addition As with vectors (when represented as components), we add component-wise, a11 a12 a21 a22 + b11 b12 b21 b22 = a11 + b11 a12 + b12 a21 + b21 a22 + b22 . Clearly, the matrices must be the same size, i.e. row and column dimensions must be equal. B.5 B.5.1 Special Matrices Identity Matrix I= 1 0 0 1 i.e. produces no transformation effect. Thus, IA = A We can define the matrix inverse as follows, if AB = I then B = A−1 , see section B.6. B–5 B.5.2 Orthogonal Matrix A matrix which satisfies the property: AAt = I i.e. the transpose of the matrix is its inverse, see section B.6. Another way of viewing orthogonality in matrices is: For each row of the matrix (ar 1 ar 2 ....ar n ), the scalar product with itself is 1, and with all other rows, 0. I.e. Pn c=1 ar c apc = 1 for r = p, = 0 otherwise. B.5.3 Diagonal A= Sx 0 0 Sy is diagonal, i.e. the only non-zero elements are on the diagonal. The inverse of a diagonal matrix is B.5.4 a11 0 0 a22 1/a11 0 0 1/a22 Transpose of a Matrix At , spoken ‘A-transpose’. If a11 a12 a21 a22 a11 a21 a12 a22 A= then t A = i.e. replace column 1 with row 1 etc. The transpose is sometimes AT or A0 . B–6 B.6 Inverse Matrix Only for square matrices (m = n). Consider again Eqns. B.1 and B.2: y1 = 3x1 + 1x2 y2 = 2x1 + 4x2 i.e. y = Ax. 3 1 2 4 A= . Apply this to x= 1 2 , to get y1 = 3.1 + 1.2 = 5, y2 = 2.1 + 4.2 = 10. What if you know y = (5 10)t and you want to retrieve x = (x1 x2 )t ? In other words, can matrices help us solve for x1 , x2 as we did in section B.2? The answer is yes. Find the inverse of A = A−1 and then apply the inverse transformation to y, that is, multiply y by the inverse of the matrix, x = A−1 y. (B.9) In the case of a 2 × 2 matrix A= A −1 1 = |A| a11 a12 a21 a22 a22 −a12 −a21 a11 (B.10) where the determinant of the array, A, is | A |= a11 a22 − a12 a21 If | A |= 0, then A is not invertible, it is singular. Inverse matrices give us the equivalent of division. If | A |= 0, attempting to find the inverse is the equivalent to calculating 1/0. Thus for B–7 A= 3 1 2 4 we have | A |= 3 × 4 − 2 × 1 = 10 so = (1/10) 5 10 A Therefore, apply A −1 to −1 4 −1 −2 3 = 0.4 −0.1 −0.2 0.3 We find: A−1 y = 0.4 −0.1 −0.2 0.3 5 5 × 0.4 + 10 × −0.1 1 . = = 10 5 × −0.2 + 10 × 0.3 2 which is the answer we got in section B.2. In fact, in section B.2 what we did was something very similar to how one inverts a matrix in a computer program. B.7 Multidimensional (Multivariate) Random Variables We can now generalise two-d. random variables to p dimensions by extending (X, Y ) to (X1 , X2 , . . . , Xp ). It is usual to call the p-dimensional (multivariate) random variable a random vector and to use vector notation: X = (X1 , X2 , . . . , Xp ). The multivariate Normal pdf, p-dimensional, is given by: f (x) = 1 (2π)p/2 |K|1/2 1 exp [− (x − µ)T K−1 (x − µ)]. 2 B–8 (B.11)