Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PCA: Lecture 2 • Recap of PCA: what it does, how to do it • Details of PCA • presentation of results • terminology • scaling • truncation of PCs • interpretation of PCs • Rotation of PCs • Singular Value Decomposition (SVD) • Applications • Exploratory data analysis (EDA) • Data compression • PC regression / statistical prediction • etc… • Extensions to PCA and Some of its Relatives • Extended EOF (EEOF), Singular spectrum analysis (SSA), Canonical Correlation Analysis (CCA), Principal Oscillation Patterns (POP), Independent Component Analysis (ISA) PCA Recap • PCA reduces a correlated dataset to a dataset containing fewer new variables by axis rotation • The new variables are linear combinations of the original ones and are uncorrelated • The PCs are the new variables (or axes) which summarize several of the original variables Step 1: Organize the data (what are the variables, what are the samples?) Step 2: Calculate the covariance matrix (how do the variables co-vary?) Step 3: Calculate the eigenvectors and eigenvalues of the covariance matrix Step 4: Calculate the PCs (project the data onto the eigenvectors) Step 5: Choose a subset of PCs Step 6: Interpretation, data reconstruction, Presentation of PCA Results PCA results in; • a set of eigenvectors which are ordered according to the amount of variance of the original dataset that they explain • the dataset projected onto the PCs (for time series data these are the PC time series) • normally we are only concerned with the top few PCs PCA Terminology PCA terminology can be confusing – varied background and use. • PCs also known Empirical Orthogonal Functions (EOFs) • Spatial patterns: principal components, principal component loadings, EOFs • Time series: EOF time series, expansion coefficient time series, principal component time series (or even principal components!) Eigenvectors, em Eigenvector elements, ek,m Principal components um Principal component elements ui,m EOFs Loadings Empirical orthogonal variables Scores Modes of variation Coefficients Amplitudes Pattern vectors Pattern Coefficients Expansion coefficients Principal axes Empirical orthogonal weights Coefficients Principal vectors Proper functions Principal directions Scaling and Normalization of PCs • The eigenvectors are usually scaled in some way but there are many scaling conventions – confusing! • Usually, the eigenvectors are scaled to unit length ||em|| = 1, but remember the eigenvectors can be any length to satisfy the eigen decomposition equation (they just need to point in the right direction). • Sometimes it is useful to use an alternative scaling though Note: • Note that if we scale the eigenvectors em by a constant, then the magnitudes of the PCs, um also get scaled by the same factor as um = emT x • When using anomalies x’, E[x’] = 0, E[uscaled] = 0 • But variances will change by c2, where c is the scaling constant In geosciences: • Usually the PCs are represented as dimensionless maps, often normalized with max = 1 or 100 • Another way to represent a PC (EOF) is by calculating the correlation map between the expansion coefficients associated with the eigenvector, and the original data. Scaling and Normalization of PCs Some common scalings and their impacts: Eigenvector scaling E[um] Var[um] Corr[um,xk] Corr[um,zk] ||em|| = 1 0 λm ek,m(λm)1/2 / sk ek,m(λ)1/2 ||em|| = (λ)1/2 0 λm2 ek,m/sk ek,m ||em|| = (λ)-1/2 0 1 ek,m λm/sk ek,m λm Eigenvector elements are more interpretable in terms of the relationship between the PCs and the original data. Each eigenvector element ek,m = correlation ru,z between the mth PC um and the kth standardized variable zk. All PCs have equal unit variance, which can be useful in the detection of outliers Things to look out for PCA knows nothing about the spatial distribution of data • if the geographic distribution of data is not uniform (e.g. irregularly spaced stations or lat-lon grids) then data dense regions may be over-represented and data-sparse regions will be under-represented • For lat-lon data, high latitude data will be over represented • Some fixes: • for lat-lon data, multiply the data by sqrt(cos(lat)) • for irregularly spaced data (stations, lat-lon) interpolate to equal area grid • Data fields at different resolution or different domain • additionally need to rescale to equalize the sum of variances in each field Domain Size Effects and Buell Patterns • if the scale of the variability in the data is larger than the size of the domain being analyzed • this can lead to spurious PC patterns that are a result of the eigen decomposition and not the data! Truncating PCs: • Remember: many datasets are correlated (spatially, across variables, …) which implies that there is redundant information in the dataset. • Therefore it is possible to capture most of the variance by considering only the most important PCs • i.e. using M < K of the PCs um. If M << K then we get data compression. Mathematically: • The original PC analysis formula is u = ET x • If we truncate, then x(Kx1) = E(KxM) u(Mx1) where u is truncated to only the first M PCs and E is truncated to only the first M eigenvectors (columns) • Remember this is an approximation. Question: • what is the balance between data compression and data loss? There is no universal answer, but there are some selection rules that may help. Truncation Rules (subjective) 1) Retain enough PCs to represent a sufficient fraction of the total variance: M R m 1 2 m R 2 crit where R 2 m m K k 1 100% k But what is Rcrit? 70% < Rcrit <= 90%? 2) Look at the shape of the eigenvalue-PC number graph (scree diagram or eigenvalue spectra) • no guarantee that the data will show a nice separation Truncation point, M PC Number Truncation Rules (subjective) 3) Look at the size of the last retained component (how small can it be?): T retain λ if m K K s k 1 k ,k Where sk,k is the sample variance of the kth element of x, T is a threshold parameter. T = 1 is called Kaiser’s rule which compares the eigenvalues to the amount of joint variance reflected in the average eigenvalue. T = 0.7 is recommended by some. 4) Broken stick model: based on the expected length of the mth longest piece of a randomly broken unit length segment. K 1 T (m) K 1 j m j Here T = T(m) and the truncation is made at the smallest m for which (1) is not satisfied, according to the threshold in (2). Truncation Rules (objective) • Rule N: based on the assumption that the unwanted data are just random noise and that these can be identified by comparing to the distribution of eigenvalues from random data. • repeatedly generate sets of vectors of independent Gaussian random numbers with the same dimensions as the original data • compute the eigenvalues of the covariance matrix • scale the eigenvalues to match the original data eigenvalues, e.g. sum(random) = sum(data) • compare each of the data eigenvalues with the emipircal distribution of curves generated from random data • If the eigenvalue is larger than 95% of these, then retain this component 95th percentile 5th percentile PC Number Problems: • temporal correlation in the data • non-Gaussian data (bootstrap instead) • the tests for each eigenvalue is correlated with the tests on the other eigenvalues Interpretation of PCs • By construction PCs constitute directions of variability with no particular amplitude • Therefore if e is a PC, so is αe for any α nonzero scalar • For convenience, however, they are chosen to be unitary • Also by construction PCs are stationary structures, i.e. they do not evolve in time (or across samples) • The PC series attached to the corresponding PC (projection of the data onto the PC) provides the sign and the overall amplitude of the PC as a function of time • This provides a simplified representation of the state of the field at that time along that PC. • NOTE: when PCs are nondegenerate (eigenvalues are unique) they can be studied individually. When they are degenerate, however, the separation between them becomes problematic despite being orthogonal and their PCs uncorrelated. Physical Interpretation of PCs • Although PCs represent directions (or patterns) that (successively) explain most of the observed variability, their interpretation is, however, not always simple. • Remember they are just mathematical constructs chosen to represent the variance as efficiently as possible and to be orthogonal to each other • Physical interpretability especially can be controversial because physical modes are not necessarily orthogonal • The constraints imposed upon PCs are purely geometric and hence can be non-physical • Furthermore, the PC structure tends to be domain dependent and this adds to the difficulty of their physical interpretability. Some suggestions to see whether the PCs have some physical meaning: 1. Is the variance explained by the PC more than what you would expect if the data had no structure (is it white noise or red noise)? 2. Do you have an a priori reason for expecting the structures that you find? 3. How robust are the structures to the choice of domain? If you change the domain (geographic, number of parameters) do the structures change significantly? 4. How robust are the structures to the sample used? • One way around this is rotated PCs. Physical Interpretation of PCs Rotated PCs (or REOF) • Rotated PCs is a technique simply based on rotating PCs that attempts to overcome some of the shortcomings of PCA such as physical interpretability • If you look at the spatial patterns of PCs there is a temptation to ascribe some physical meaning to them – BUT this is not always a good interpretation because the orthogonality constraint on the eigenvectors can mean that the 2nd and 3rd PCs bear no resemblence to the physical mechanisms that drive the data. • The 1st PC represents the most important mode of variability or physical process but it may include aspects of other correlated modes and processes. • Rotated PCs helps with the physical interpretation by rotating the PCs to another coordinate set, usually on the first M PCs. • The objective is to alleviate the orthogonality constraint and obtain simple structures that can be interpreted physically. • A simple structure is found when a large number of the elements of the rotated vectors are near zero and a few of the remaining elements correspond to elements that are not zero in the other rotated vectors. • The new variables are called rotated PCs Rotated PCs: a simple example x2 e1 x1 x2 e1 + + e2 - + x1 x2 e1 + + e2 - + x1 x2 e1 + + e2 - + Unrotated eigenvectors e2 x1 x2 e2 Orthogonally rotated eigenvectors e1 x1 e2 x2 e1 Obliquely rotated eigenvectors x1 Rotated PCs: how to do it Rotated eigenvectors are a linear transformation of a subset of M of the original K eigenvectors: ~ E E ( K M ) T ( K M ) ( M M ) Where T is the rotation matrix. If T is orthogonal then it is an orthogonal rotation, otherwise it is an oblique rotation. Many different ways of defining T but the most popular is VARIMAX rotation where the elements of T are chosen to maximize: K *4 1 K *2 ek .m ek .m K k 1 m 1 k 1 M 2 In other words we are trying to maximize the sum of variances of the squared rotated eigenvector elements which tends to move them towards their max or min values (0 or ±1) a simple structure ek*,m e~k ,m ~2 ek ,m m 1 M 2 are the scaled versions of the rotated eigenvector elements Rotated PCs Continued Summary: • Varimax rotation attempts to simplify the structure of the patterns by tending the loadings towards zero, or ±1. • Rotated PCs therefore yield localised or simple structures BUT there are drawbacks: • how do we choose the criterion? • how many PCs should be retained? • variance is spread among rotated components and eigenvectors loss their orthogonality. • this yields patterns that may not constitute the main source of variation Singular Value Decomposition (SVD) What if very large dimensional data? • e.g., datasets with many dimensions (D ≥ 104) Problem: • Covariance matrix is size (D x D) • D = 104 then |cov| = 108 Singular Value Decomposition to the rescue! • pretty efficient algorithms available, including Matlab SVD • some implementations find just top N eigenvectors Singular Value Decomposition (SVD) • Singular value decomposition is a general decomposition method that decomposes and n x m matrix X into the form X = U S VT U is an n x n orthonormal matrix V is an m x m orthonormal matrix S is a diagonal n x m matrix with p elements down the diagonal Terminology: • The diagonal elements of S are called the singular values of the matrix. • The columns of the matrices U, V contain the (left and right) singular vectors of X. Note that if we have p singular values, and p is < n and m then some of the singular vectors are redundant, and X = Up Sp VpT Singular Value Decomposition (SVD) Relationship between PCA and SVD: • If the matrix X is square and invertible then U = V and S is the diagonal matrix containing the eigenvalues • Now, for X with means removed, R = cov(X) = XTX • Normal PCA gives R = E L ET • But if we first do SVD on X then form R we get: R = XTX = (USVT)T(USVT) = VSTUTUSVT = VSTSVT • So E = V, and L = STS • The relationship between the eigenvalues λi of R and the singular values si of X is λi = si2 • The column vectors of V contain the eigenvectors for XTX and the column vectors of the matrix U contain the eigenvectors for XXT Singular Value Decomposition (SVD) X= U S VT mxn mxn nxn nxn • Data X, one row per data point • U contains the left singular vectors which are the eigenvectors for XXT • S is diagonal and sk2 is the kth largest eigenvalue •The rows of VT (right singular vectors) are unit length eigenvectors of XTX Singular Value Decomposition (SVD) In MATLAB: • Data matrix X (each row is a map, each column a time series or list of samples) • remove the mean and detrend F = detrend(X,0) • Find the eigenvectors and singular values [U, S, V] = svd(F) • The eigenvectors are in the matrix V and the squared diagonal values of S are the eigenvalues of the covariance matrix R = FTF • To check this, compare the result with R = FTF, and [V,L] = eig(R) • Calculate the PC series: PCi = F * V(:,i) Applications of PCA Application: Exploratory Data Analysis Exploratory data analysis (biplots and basis vectors): • graphing high dimensional data is difficult • instead graph the data projected onto the first 2 PCs (scatter plot or biplot) • may reveal clustering, coherency in time series • can also show all variables simultaneously by projecting basis vectors onto the two leading PCs: e1Tbk and e2Tbk where b1=[1,0,0,…,0], b2=[0,1,0,…,0], … p1 p2 e2 t1 e1 t2 Effects of Flooding and Drought on Water Quality in Gulf Coastal Plain Streams in Georgia Application: Data Reduction/Prefiltering Spatial correlation maps based on PCA (EOFs) Figure: A comparison between correlation maps between and , based on EOFs (a) and conventional calculations (b). The confidence limit for panel b was estimated using the an MC-test on the point of maximum correlation. Note, that the confidence limits are generally higher for the conventional method, whereas only the lowest correlations in panel a are insignificant Application: Data Reduction/Prefiltering Data Reduction • compute the EOFs and retain a few of the first leading EOFs (neofs << nt), • we can then compress the size of the data from n x x ny x nt to nx x ny x neofs + (nt + 1) * neofs • with a minimal loss of information (filters away much of the small-scale noise). • If we have a data record, such as before, stored as 100 time slices on a 50 x 30 grid and we retain the 10 leading EOFs, then • the data size can be reduced from 150,000 numbers to just 16,010 numbers and • still account for about 90% of the variance in the data Computational Savings Also, the EOFs can save computation time since there are only neofs independent numbers. • correlation analysis can be applied to the neofs PCs, weighted by the EOF patterns and their variance, instead of the time series from nx x ny points. • calculation of confidence intervals, including a geographical distribution of limits • MC testing with EOFs • Maps of linear trends can be calculated simply • Regression analysis Application: PC Regression/Statistical Prediction BUT A prediction model could be a regression of sea-surface temperatures (SSTs) onto rainfall: • With over 16,000 grid boxes of SSTs, there is a very good chance of finding a strong correlation only by chance. • With SSTs strongly correlated with temperatures in many different parts of the globe, there is a high chance of large errors in the regression parameters. Application: PC Regression/Statistical Prediction Problems with multiple regression: A set of predictor variables with strong mutual correlations can give unstable regression parameters Multiplicity - the problem when the results of multiple independent significant tests are jointly evaluated OR high likelihood of getting a strong correlation just by chance Multicolinearity - Predictors may be strongly correlated. Nino3.4Mar 0 0.761 Nino3.4Feb Nino3.4Mar 0 0.628 Nino3.4Jan Nino3.4Mar 0 1.216 Nino3.4Feb 0.395 Nino3.4 Jan Application: PC Regression/Statistical Prediction Using PCs in multiple regression: • first transform the predictors to their principal components (correlations are zero) • uncorrelated predictors can be added or removed during testing without affecting the contributions of other component predictors • when forcasting: predictors are the PC amplitudes y(t) = a1u1(t) + a2u2(t) + … + amum(t)