Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Pattern Recognition Mathematic Review Instructor: Hamid R. Rabiee Teacher Assistant: Mohammad H. Rohban & Jafar Muhammadi Probability Space A triple of (Ω, F, P) Ω represents a nonempty set, whose elements are sometimes known as outcomes or states of nature F represents a set, whose elements are called events. The events are subsets of Ω. F should be a “Borel Field”. P represents the probability measure. Fact: P(Ω) = 1 Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 2 http://ce.sharif.edu/courses/85-86/1/ce725/ Random Variables Random variable is a “function” (“mapping”) from a set of possible outcomes of the experiment to an interval of real (complex) numbers. In other words: Outcomes ìïï F Í W ïìï X : F ® I : í í ïîï I Í R ïï X (b ) = r î Real Line Examples: Mapping faces of a dice to the first six natural numbers. Mapping height of a man to the real interval (0,3] (meter or something else). Mapping success in an exam to the discrete interval [0,20] by quantum 0.1 Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 3 http://ce.sharif.edu/courses/85-86/1/ce725/ Random Variables (Cont’d) Random Variables Discrete Dice, Coin, Grade of a course, etc. Continuous Temperature, Humidity, Length, etc. Random Variables Real Complex Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 4 http://ce.sharif.edu/courses/85-86/1/ce725/ Density/Distribution Functions Probability Mass Function (PMF) Discrete random variables Summation of impulses The magnitude of each impulse represents the probability of occurrence of the outcome PMF values are probabilities. PX Example I: Rolling a fair dice 1 6 1 2 3 4 6 5 X i 6 Sharif University of Technology, Department of Computer Engineering,1Pattern Recognition Course 5 PMF http://ce.sharif.edu/courses/85-86/1/ce725/ 6 i 1 X Density/Distribution Functions (Cont’d) Example II: Summation of two fair dices PX 1 6 2 3 4 5 6 7 8 9 10 11 12 X Note : Summation of all probabilities should be equal to ONE. (Why?) Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 6 http://ce.sharif.edu/courses/85-86/1/ce725/ Density/Distribution Functions (Cont’d) Probability Density Function (PDF) Continuous random variables The probability of occurrence of æ dx ö dx ÷ ÷ will be P (x ).dx x 0 Î ççx ,x + çè ø 2 2÷ PX Px x X Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 7 http://ce.sharif.edu/courses/85-86/1/ce725/ Density/Distribution Functions (Cont’d) Cumulative Distribution Function (CDF) Both Continuous and Discrete Could be defined as the integration of PDF PDF X CDF (x ) = FX (x ) = P (X £ x ) x FX (x ) = òf X (x ).dx CDF(x) - ¥ x Some CDF properties Non-decreasing Right Continuous F(-infinity) = 0 F(infinity) = 1 Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 8 http://ce.sharif.edu/courses/85-86/1/ce725/ X Famous Density Functions Some famous masses and densities Uniform Density f (x ) = PX 1 .(U (end )- U (begin )) a 1 a X a Gaussian (Normal) Density f (x ) = 1 - s . 2p e PX 2 (x - m) 2.s 2 = N (m, s ) 1 . 2 Exponential Density f (x ) = l .e - lx ìï l .e - l x .U (x ) = ïí ïï 0 î x ³ 0 x <0 Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 9 http://ce.sharif.edu/courses/85-86/1/ce725/ X Joint/Conditional Distributions Joint Probability Functions Density FX ,Y (x , y ) = P (X £ x and Y £ y ) Distribution x = y ò òf X ,Y (x , y )dydx - ¥ - ¥ Example I In a rolling fair dice experiment represent the outcome as a 3-bit digital number “xyz”. f X ,Y ìï 1 ïï 6 ïï ïï 1 ïï 3 ï (x , y ) = ïíï 13 ïï ïï 1 ïï 6 ïï ïï 0 î x = 0; y = 0 x = 0; y = 1 x = 1; y = 0 x yz 1 ® 001 2 ® 010 3 ® 011 4 ® 100 x = 1; y = 1 5 ® 101 OW . . 6 ® 110 Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 10 http://ce.sharif.edu/courses/85-86/1/ce725/ Joint/Conditional Distributions (Cont’d) Example II Two normal random variables f X ,Y (x , y ) = 1 2p .s x .s y . 1- r 2 e æ çç ç 1 - ççç ççç2 1- r 2 è ( 2 öö ÷ çæ ÷ ÷ çç(x - mx )2 (y - my ) 2 r (x - mx )(y - my )÷ ÷ ÷ ÷ ÷ ÷ ÷÷ + ççç ÷ ÷ ÷ 2 2 ÷ ÷ çç s x s . s ÷ ÷ s x y y ÷ çè ÷ øø ) What is “r” ? Independent Events (Strong Axiom) f X ,Y (x , y ) = f X (x ).f Y (y ) Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 11 http://ce.sharif.edu/courses/85-86/1/ce725/ Joint/Conditional Distributions (Cont’d) Obtaining one variable density functions ¥ f X (x ) = òf X ,Y (x , y )dy X ,Y (x , y )dx - ¥ ¥ f Y (y ) = òf - ¥ Distribution functions can be obtained just from the density functions. (How?) Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 12 http://ce.sharif.edu/courses/85-86/1/ce725/ Joint/Conditional Distributions (Cont’d) Conditional Density Function Probability of occurrence of an event if another event is observed (we know what “Y” is). f X Y (x y ) = f X ,Y (x , y ) f Y (y ) Bayes’ Rule f X Y (x y ) = fY X (y x ).f X (x ) ¥ òf Y X (y x ).f X (x )dx - ¥ Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 13 http://ce.sharif.edu/courses/85-86/1/ce725/ Joint/Conditional Distributions (Cont’d) Example I Rolling a fair dice X : the outcome is an even number Y : the outcome is a prime number P (X Y ) = P (X ,Y ) 16 1 = = 1 P (Y ) 3 2 Example II Joint normal (Gaussian) random variables f X Y (x y ) = 1 2p .s x . 1 - r 2 e æ çç 1 - çççç 2 ççè2 1- r ( æ y - my ççx - mx - r´ ççç sy èç s x ) Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 14 http://ce.sharif.edu/courses/85-86/1/ce725/ 2ö ö ÷ ÷÷ ÷ ÷ ÷ ÷ ÷ ÷÷ ÷ ÷ ÷ ÷÷ ø ÷ ø Joint/Conditional Distributions (Cont’d) Conditional Distribution Function FX Y (x y ) = P (X £ x while Y = y ) x = òf XY (x y )dx òf X ,Y (t , y )dt òf X ,Y (t , y )dt - ¥ x = - ¥ ¥ - ¥ Note that “y” is a constant during the integration. Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 15 http://ce.sharif.edu/courses/85-86/1/ce725/ Joint/Conditional Distributions (Cont’d) Independent Random Variables f X Y (x y ) = = f X ,Y (x , y ) f Y (y ) f X (x ).f Y (y ) f Y (y ) = f X (x ) Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 16 http://ce.sharif.edu/courses/85-86/1/ce725/ Distribution Measures Most basic type of descriptor for spatial distributions, include: Mean Center Variance & Standard Deviation Covariance Correlation Coefficient Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 17 http://ce.sharif.edu/courses/85-86/1/ce725/ Expected Value Expected value ( population mean value) E[g ( X )] = å xf ( x) = ò ¥ - ¥ xf ( x)dx Properties of Expected Value The expected value of a constant is the constant itself. E[b]= b If a and b are constants, then E[aX+ b]= a E[X]+ b If X and Y are independent RVs, then E[XY]= E[X ]* E[Y] If X is RV with PDF f(X), g(X) is any function of X, then, if X is discrete E[g ( X )] = å g ( X ) f ( x) if X is continuous E[ g ( X )] = ò ¥ - ¥ g ( X ) f ( x)dx Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 18 http://ce.sharif.edu/courses/85-86/1/ce725/ Variance & Standard Deviation Let X be a RV with E(X)=u, the distribution, or spread, of the X values around the expected value can be measured by the variance ( dx is the standard deviation of X). var(X ) = dx2 = E (X - m) 2 = å (X - m) 2 f (x ) x = E (X 2 ) - m2 = E (X 2 ) - [E (X )]2 The Variance Properties The variance of a constant is zero. If a and b are constants, then var(aX + b ) = a 2 var(X ) If X and Y are independent RVs, then var(X + Y ) = var(X ) + var(Y ) var(X - Y ) = var(X ) + var(Y ) Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 19 http://ce.sharif.edu/courses/85-86/1/ce725/ Covariance Covariance of two RVs, X and Y: Measurement of the nature of the association between the two. Cov (X ,Y ) = E [(X - mx )(Y - my )] = E (XY ) - mx my Properties of Covariance: If X, Y are two independent RVs, then Cov = E (XY ) - m m = E (x )E ( y ) - m m = 0 x y x y If a, b, c and d are constants, then Cov (a + bX ,c + dY ) = bd cov(X ,Y ) If X is a RV, then Cov = E (X 2 ) - mx 2 = Var (X ) Covariance Value Cov(X,Y) is positively big = Positively strong relationship between the two. Cov(X,Y) is negatively big = Negatively strong relationship between the two. Cov(X,Y)=0 = No relationship between the two. Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 20 http://ce.sharif.edu/courses/85-86/1/ce725/ Variance of Correlated Variables Var(X+Y) =Var(X)+Var(Y)+2Cov(X,Y) Var(X-Y) =Var(X)+Var(Y)-2Cov(X,Y) Var(X+Y+Z) = Var(X)+Var(Y)+Var(Z)+2Cov(X,Y)+2Cov(X,Z) + 2Cov(Z,Y) Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 21 http://ce.sharif.edu/courses/85-86/1/ce725/ Covariance Matrices If X is a n-Dim RV, then the covariance defined as: E [(x μ x )(x μ x )t ] whose ijth element ij is the covariance of xi and xj: Cov (x i , x j ) ij E [(x i i )(x j j )], then 11 12 22 21 d 1 d 2 1d 12 12 2d 21 22 dd d 1 d 2 i , j 1, 1d 2d d2 Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 22 http://ce.sharif.edu/courses/85-86/1/ce725/ ,d . Covariance Matrices (cont’d) Properties of Covariance Matrix: If the variables are statistically independent, the covariances are zero, and the covariance matrix is diagonal. 12 0 0 22 0 0 1/ 12 0 0 0 0 1/ 22 1 d2 0 0 0 0 1/ d2 2 x t 1 (x μ) (x μ) noting that the determinant of S is just the product of the variances, then we can write p (x ) 1 (2 ) d /2 1/ 2 e 1 ( x μ )t 1 ( x μ ) 2 This is the general form of a multivariate normal density function, where the covariance matrix is no longer required to be diagonal. Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 23 http://ce.sharif.edu/courses/85-86/1/ce725/ Correlation Coefficient Correlation: Knowing about a random variable “X”, how much information will we gain about the other random variable “Y” ? The population correlation coefficient is defined as cov(X ,Y ) x y The Correlation Coefficient is a measure of linear association between two variables and lies between -1 and +1 -1 indicating perfect negative association +1 indicating perfect positive association Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 24 http://ce.sharif.edu/courses/85-86/1/ce725/ Moments Moments n nth order moment of a RV X is the expected value of X : M n = E (X n ) Normalized form (Central Moment) ( n M n = E (X - mX ) ) Mean is first moment Variance is second moment added by the square of the mean. Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 25 http://ce.sharif.edu/courses/85-86/1/ce725/ Moments (cont’d) Third Moment Measure of asymmetry Often referred to as skewness s (x ) f (x )dx 3 If symmetric s= 0 Fourth Moment Measure of flatness Often referred to as Kurtosis k (x ) f (x )dx 4 small kurtosis large kurtosis Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 26 http://ce.sharif.edu/courses/85-86/1/ce725/ Sample Measures Sample: a random selection of items from a lot or population in order to evaluate the characteristics of the lot or population x i 1 x i n Sample Mean: 2 ( X X ) Sample Variance: S i 1 n 1 (X i X )(Y i Y ) Sample Covariance: Cov (X ,Y ) n 1 n 2 x Sample Correlation: Corr 3th center moment 4th center moment (X i (X X ) i 1 n 1 4 n (X X ) i 1 n 1 n 3 X )(Y i Y ) /(n 1) SxSy Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 27 http://ce.sharif.edu/courses/85-86/1/ce725/ More on Gaussian Distribution The Gaussian Distribution Function Sometimes called “Normal” or “bell shaped” Perhaps the most used distribution in all of science Is fully defined by 2 parameters 95% of area is within 2σ Normal distributions range from minus infinity to plus infinity PX f (x ) = 1 . 2 1 s . 2p - e Shaded 2 (x - m) 2.s 2 area = N (m, s ) gives 0.4 prob . 2 0.3 0.2 0.1 X -4 -3 -2 -1 Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 28 http://ce.sharif.edu/courses/85-86/1/ce725/ 1 2 3 Gaussian with =0 and =1 4 More on Gaussian Distribution (Cont’d) Normal Distributions with the Same Variance but Different Means Normal Distributions with the Same Means but Different Variances Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 29 http://ce.sharif.edu/courses/85-86/1/ce725/ More on Gaussian Distribution (Cont’d) Standard Normal Distribution A normal distribution whose mean is zero and standard deviation is one is called the standard normal distribution. f (x ) = 1 2p e - 1 2 x 2 any normal distribution can be converted to a standard normal distribution with simple algebra. This makes calculations much easier. Z = X - m s Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 30 http://ce.sharif.edu/courses/85-86/1/ce725/ Central Limit Theorem Why is The Gaussian PDF is so applicable? Central Limit Theorem Illustrating CLT a) 5000 Random Numbers b) 5000 Pairs (r1+r2) of Random Numbers c) 5000 Triplets (r1+r2+r3) of Random Numbers d) 5000 12-plets (r1+r2+…+r12) of Random Numbers Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 31 http://ce.sharif.edu/courses/85-86/1/ce725/ Central Limit Theorem Central Limit Theorem says that we can use the standard normal approximation of the sampling distribution regardless of our data. We don’t need to know the underlying probability distribution of the data to make use of sample statistics. This only holds in samples which are large enough to use the CLT’s “large sample properties.” So, how large is large enough? Some books will tell you 30 is large enough. Note: The CLT does not say: “in large samples the data is distributed normally.” Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 32 http://ce.sharif.edu/courses/85-86/1/ce725/ Two Normal-like distributions T-Student Distribution é t2ù G éêë(v + 1)/ 2ù ú û ê1 + ú f (t ) = v p G(v / 2) êë v ú û - (v + 1) / 2 Chi-Squared Distribution f (c 2 ) = 1 1 2 (v / 2)- 1 - c 2 / 2 ( c ) e G(v / 2) 2v / 2 Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 33 http://ce.sharif.edu/courses/85-86/1/ce725/ Distances Each clustering problem is based on some kind of “distance” between points. A Euclidean space has some number of real-valued dimensions and “dense” points. There is a notion of “average” of two points. A Euclidean distance is based on the locations of points in such a space. A Non-Euclidean distance is based on properties of points, but not their “location” in a space. Distance Matrices Once a distance measure is defined, we can calculate the distance between objects. These objects could be individual observations, groups of observations (samples) or populations of observations. For N objects, we then have a symmetric distance matrix D whose elements are the distances between objects i and j. Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 34 http://ce.sharif.edu/courses/85-86/1/ce725/ Axioms of a Distance Measure d is a distance measure if it is a function from pairs of points to reals such that: 1. d(x,y) > 0. 2. d(x,y) = 0 iff x = y. 3. d(x,y) = d(y,x). 4. d(x,y) < d(x,z) + d(z,y) (triangle inequality ). For the purpose of clustering, sometimes the distance (similarity) is not required to be a metric No Triangle Inequality No Symmetry Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 35 http://ce.sharif.edu/courses/85-86/1/ce725/ Distance Measures Euclidean Distance L2 Norm L1, L∞ Norm Mahalanobis Distance Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 36 http://ce.sharif.edu/courses/85-86/1/ce725/ Euclidean Distance (L2 Norm) A possible distance measure for spaces equipped with a Euclidean metric d ij d ij x 2 k 1 x ik p k 1 ik x jk x jk ( x j1 , x j 2 ) x j2 2 x2 2 d ij xi 2 Multivariate distances between populations ( xi1 , xi 2 ) xi1 x1 x j1 Calculates the Euclidean distance between two “points” defined by the multivariate means of two populations based on p variables. Does not take into account differences among populations in within-population variability nor correlations among variables. Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 37 http://ce.sharif.edu/courses/85-86/1/ce725/ Euclidean Distance (L2 Norm) (Cont’d) Multivariate distances between populations d ij p k 1 ik jk 12 22 2 d 12 X2 11 21 Sample 2 Sample 1 X1 Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 38 http://ce.sharif.edu/courses/85-86/1/ce725/ Euclidean Distance (L1, L∞ Norm) L1 Norm: sum of the differences in each dimension. Manhattan distance = distance if you had to travel along coordinates only. L∞ norm : d(x,y) = the maximum of the differences between x and y in any dimension. y = (9,8) 5 4 x = (5,5) 3 L1-norm: dist(x,y) = 4+3 = 7 L∞ -norm: dist(x,y) = Max(4, 3) = 4 Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 39 http://ce.sharif.edu/courses/85-86/1/ce725/ Mahalanobis Distance Distances are computed based on means, variances and covariances for each of g samples (populations) based on p variables. Mahalanobis distance “weights” the contribution of each pair of variables by the inverse of their covariance. D ij2 ( i j )T S 1 ( i j ) i , j are samples means. S is the Covariance between samples. Sharif University of Technology, Department of Computer Engineering, Pattern Recognition Course 40 http://ce.sharif.edu/courses/85-86/1/ce725/