Survey

Document related concepts

Transcript

NUMERICAL ANALYSIS OF BIOLOGICAL AND ENVIRONMENTAL DATA Lecture 2. Exploratory Data Analysis EXPLORATORY DATA ANALYSIS Types of variables Simple diagrams Summary statistics (i) Location (ii) Dispersion (iii) Skewness and kurtosis Transformations Density estimation Graphical display (i) Univariate data (ii) Bivariate and multivariate data Outliers Leverage and influence Software TYPES OF VARIABLES 1) discrete e.g. counts 2) continuous e.g. pH, elevation Both are random variables or variates, with random variation. TABULAR PRESENTATION Raw data Frequency tables Value or Range 0 1 2 ... 0 - 0.99 1 - 1.99 2 - 2.99 ... Frequency Cumulative Frequency % CF 3 8 3 ... 3 11 14 ... 2 6 11 ... SIMPLE DIAGRAMS DISCRETE OR CONTINUOUS VARIABLES Dot diagram Line diagram or profile Histogram n/10 bins Frequency graph or cumulative frequency graph CONTINUOUS VARIABLES DISCRETE VARIABLES HISTOGRAM BIN WIDTH (a) Wand (1997) Amer. Statistician 51, 59-64 (b) (c) DEFAULT S-PLUS Histograms of the British Incomes Data Based on (a) the Bin Width ĥ2 (b) the Bin Width ĥ0, and (c) the SPLUS Default Bin Width. Optimal solution 6 hˆ2 n 2 g21 1 3 where g21 is band-width parameter ψ2 is “normal scale” estimator Solution of ψ2 and g21 is iterative, to optimise a function MEAN INTEGRATED SQUARED ERROR 1 hˆo 3.49 n 3 Standard deviation range of data hˆ 1 log 2 n n = sample size Histogram Bin Width In R, a good option for histogram bin width is given by the Freedman-Diaconis rule which is: n1/ 3(max min) 2 ( Q Q ) 3 1 where n is the number of observations, max-min is the range of the data, and Q3-Q1 is the inter-quartile range. The brackets represent the ceiling, which means that you round up to the next integer, thereby avoiding 4.2 bins! Exploratory Data Analysis 1. Summary Statistics (A) Measures of location ‘typical value’ n x 1 n xi i 1 (1) Arithmetic mean x (2) Weighted mean n x w i i 1 n i w i i 1 (3) Mode ‘most frequent’ value (4) Median ‘middle values’ Robust statistic (5) Trimmed mean (6) Geometric mean 1 or 2 extreme observations at both tails deleted n log GM 1 n log x i i 1 GM n x1 x 2 x 3 x n 1 n = antilog n log x1 i 1 R (B) Measures of dispersion B smaller scatter than A ‘better precision’ (1) Range A 13.99 14.15 14.28 13.93 14.30 14.13 B 14.12 14.1 14.15 14.11 14.17 14.17 Precision Random error scatter (replicates) A = 0.37 Accuracy Systematic bias B = 0.07 (2) Interquartile range ‘percentiles’ Q1 25% Q2 25% (3) Mean absolute deviation 1 25% 1 3 5 1 8 4 xi xi n i 1 1 n xi x n i 1 2 2 25% n Mean absolute difference x xx Q3 ignore negative signs x 4 10 10/n = 2.5 (B) Measures of dispersion (cont.) (4) Variance and standard deviation S2 1 x x 2 n 1 SDs s2 Variance = mean of squares of deviation from mean Root mean square value SD (5) Coefficient of variation CV s x 100 Relative standard deviation Percentage relative SD (independent of units) mean (6) Standard error of mean 2 s SEM n R (C) Measures of skewness and kurtosis Skewness - measure of how one tail of curve is drawn out Kurtosis - measure of peakedness of curve g1 skewness measure g2 kurtosis measure “moment statistics” Central moment = 1 n n x x r i 1 r=1 deviation from mean = 0 r=2 variance g1 skewness r=3 1 ns 3 x x g2 kurtosis r=4 1 ns 4 x x 3 4 [third central moment divided by sd3] 3 Skewness and kurtosis negative g1 skewness to left positive g1 skewness to right negative g2 platykurtosis flatter, larger tails positive g2 leptokurtosis taller, few tails DATA TRANSFORMATIONS (1) Comparability (2) Better fit to model Better fit Comparability Normal distribution i i x, sd frequency Data centring - deviations from mean x* x x mean Data standardisation x i* x i x sd - zero mean, unit variance x i* x i range 1 sd = 66% of values 2 sd = 95% of values sd 66% 95% x Often find skewed to right positive g1 Log-normal distribution LOG-NORMAL DISTRIBUTION PROPERTIES geometric mean = median of log-normal distribution mean of log values = Geometric mean (antilog) CV of original values if sd 0.5 SD log values If SD larger CV = exp S 2 1 antilog How to decide whether to log transform? (1) Look at histograms. Right skewed (positive g1) log transform (2) If sd > mean or maximum value of variable > 20x than smallest value Log xi or Log (xi + 1) (3) Improves normality (4) Gives less weight to ‘dominants’ VARIANCE STABILISING (5) Reflects linear response of many species to log of chemical variables, i.e. log response over certain ranges. (6) In regression need normally distributed random errors. Log transformation. NORMAL AND LOG-NORMAL DISTRIBUTIONS Normal Log-Normal Effects Additive Multiplicative Shape Symmetric Skewed Mean x , arithmetic x *, geometric Standard deviation s, additive s*, multiplicative Measure of dispersion cv = s/x s* Confidence interval 68.3% x ±s x * x/s* 95.5% x ± 2s x * x/(s*)2 99.7% x ± 3s x * x/(s*)3 x/ = times / divide (cf ± plus / minus); cv = coefficient of variation METHODS FOR DESCRIBING LOG-NORMAL DISTRIBUTIONS Graphical methods Frequency plots, histograms, box plots Parameters Logarithm of x Mean Median Standard deviation Variance Skewness and kurtosis of x Problems What logarithm base to use? Parameters are not on the scale of the original data Appear to be very common in the real world Limpert, E, et al. 2001 BioScience 51 (5), 342-352 DATA TRANSFORMATIONS (1) Biological data - Stabilise variances - Dampen effects of very abundant taxa Choices - No transformation - Square root - Log (y + 1) - % data square root - Counts log (y + 1) (2) Environmental variable skewed to right log-normal distribution If SD > mean or maximum value of x > 20 times the smallest, use log (x + c) transformation where c is constant, usually 1. Other transformations: (1) (2) square root cubic root (3) fourth root 4 (4) log2 log2 (x + 1) (5) logp logp (x + 1) (6) Box-Cox transformation - most appropriate value for exponent λ x* x If 1 3 x x where λ 0 = log x where λ = 0 =1 no transformation = 0.5 square root = -1 reciprocal transformation =0 log transformation If x = 0.0, add 0.5 or 1.0 as constant Can also solve for best estimate of constant to add Can calculate confidence limits for λ. If these include 1, no need for a transformation! TRANSFOR DENSITY ESTIMATION A useful alternative to histograms is non-parametric density estimation which results in a smoothing of the histogram. The kernel-density estimate at the value of x of a variable X is given by n x xj 1 ˆ f(x) K b j 1 b where xj are the n observations of X, K is a kernel function (such as the normal density), and b is a bandwidth parameter influencing the amount of smoothing. Small bandwidths produce rough density estimates, whereas large bandwidths produce smoother estimates. Note that the histogram has been scaled to the density estimates, not the raw frequencies. Multiple approaches 1. 2. 3. 4. Histogram with density scaling (areas of histogram bars sum to 1) Density estimation (default) (thick line) Density estimation (half the default bin-width) (thin line) One-dimensional scatterplot ("rugplot") to show distribution of observations at the bottom Fox, 2002 QUANTILE-QUANTILE PLOTS Quantile-quantile (Q-Q) plots are useful tools for determining if data are normally distributed. They show the relationship between the distribution of a variable and a reference or theoretical distribution. Q-Q plot shows the relationship between the ordered data and the corresponding quantiles of the reference (in our case, normal) distribution. If the data are normally distributed, they should plot on a straight line through the 1st and 3rd quartiles. If there is a break in slope of the plotted points, the data deviate from the reference distribution. Note that quantiles are divisions of a frequency or probability distribution into equal, ordered subgroups (e.g. quartiles (4 parts) or percentiles (100 parts)). EXPLORATORY DATA ANALYSIS GRAPHICAL DISPLAY J.W. Tukey Univariate data (1) Stem-and-leaf displays 55 62 73 STEM 5 6 7 8 9 7 5 5 78 79 78 81 LEAF 5 2 3 1 4 3 5 1 3 2 3 1 8 4 5 6 7 8 9 2 1 3 1 8 1 4 2 9 3 6 7 “back-to-back” (2) Box-and-whisker plots - box plots CI around median 95% Median 1.58 (Q3) / (n)½ quartile (3) Hanging histograms Variations of box plots McGill et al. Amer. Stat. 32, 1216 Useful to label extreme points Fox, 2002 Box plots for samples of more than ten wing lengths of adult male winged blackbirds taken in winter at 12 localities in the southern United States, and in order of generally increasing latitude. From James et al. (1984a). Box plots give the median, the range, and upper and lower quartiles of the data. Useful to apply several approaches EDA tools Bivariate and multivariate data Simple scatter plot x2 • • • • •• •••• •• • • ••• • • • • • •• x1 SCATTERPLOT MATRIX. The data are measurements of ozone, solar radiation, temperature, and wind speed on 111 days. Thus the measurements are 111 points in a four-dimensional space. The graphical method in this figure is a scatterplot matrix: all pairwise scatterplots of the variables are aligned into a matrix with shared scales. Triangular arrangement of all pairwise scatter plots for four variables. Variables describe length and width of sepals and petals for 150 iris plants, comprising 3 species of 50 plants. Three-dimensional perspective view for the first three variables of the iris data. Plants of the three species are coded A,B and C. Can explore scatter-plot by adding box-plots for each variable, add simple linear regression line, add smoother (LOWESS – see Lecture 5), and label particular points. Fox, 2002 Categorical variables can be encoded in a plot by using different symbols or colours for each category (e.g. type of occupation) and smoothers fitted for each category. Fox, 2002 bc = blue collar, prof = professional, wc = white collar Jittering scatter-plots Discrete quantitative variables usually result in uniformative scatter-plots (e.g. education (years) and vocabulary (score on 0-10 scale)). Only 21 distinct education values and 11 scores, so only 21 x 11 = 231 plotting positions. Jittering data adds a small random quantity to each value to try to separate overplotted points. Can vary the amount of jittering and also plot a smoother. Fox, 2002 Bivariate density estimation and scatter-plots Large data-sets and weak relationships between variables. Improve plot by jittering and making symbols smaller and apply bivariate kernel-density estimate plus regression line and LOWESS smoother. Fox, 2002 coal-fired power station oil-fired power station Diagonal = density estimate for each variable The Bagplot: A Bivariate Boxplot Peter J. Rousseeuw The American Statistician November 1999, Vol. 53, No. 4, 382 Car weight and engine displacement of 60 cars. Part (a) shows the concentrations of cholesterol and triglycerides in the plasma of 320 patients. In part (b) logarithms are taken of both variables. Part (a) shows the altitudinal range and abundance of butterflies. In part (b) the logarithm of the abundance is plotted. Bagplot matrix of the three-dimensional aquifer data with 85 data points. Conditioning plots (Co-plots) Focus on relationship between response and a predictor variable, holding other predictors constant at particular values – conditionally fixing the values of other predictors. 'Statistical control' Co-plots provide graphical statistical control. Focus on particular predictor and set each other predictor to a relatively narrow range (if quantitative) or to a specific value (if categorical). Subranges for a quantitative predictor are typically set to overlap (called "shingles") rather than to partition data into disjoint subsets ("bins"). For each combination of values of the conditioning predictors, construct scatter-plot to show response to the local predictor and arrange the plots in an array. Can condition on more than one predictor (e.g. age, gender). Six overlapping age classes, two genders (male upper, female lower), LOWESS, and linear fits Fox, 2002 EDA and Data-Transformations Try to linearise non-linear relationships by trial-and-error. Mosteller & Tukey's 'bulging rule'. When bulge points down, transform y down the ladder of powers and roots; when the bulge points up, transform y up, when the bulge points left, transform x down; when the bulge points right transform x up. Fox, 2002 Infant mortality rate and GDP per capita for 193 countries Points down and to left, try powers and roots Log transformation linearising, variables more symmetric Fox, 2002 Simple multivariate data Profiles, Stars, Glyphs, Faces, and Boxes of Percentages of Republican Votes in Six Presidential Elections in Six Southern States. The circles in the Stars Are Drawn at 50%. The Assignment of Variables to Facial Features in the Faces is: 1932 – Shape of Face; 1936 – Length of nose; 1940 – Curvature of Mouth; 1960 – Width of Mouth; 1964 – Slant of Eyes; 1968 – Length of Eyebrows Three types of shape for representing multivariate data. In these examples glyph, stars and faces represent five, six and twelve (!) variables respectively. Frequency of the six commonest species on the Park Grass plots using star displays. Labelled polygon plot Polygon plots Chernoff faces CHERNOFF American city crime data Atlanta Boston Chicago Dallas Denver Detroit Hartford Honolulu Houston Kansas City Los Angeles New Orleans New York Portland Tucson Washington Murder Manslaughter 16.5 4.2 11.6 18.1 6.9 13 2.5 3.6 16.8 10.8 9.7 10.3 9.4 5 5.1 1.5 Rape 24.8 13.3 24.7 34.2 41.5 35.7 8.8 12.7 26.6 43.2 51.8 39.7 19.4 23 22.9 27.6 Robbery 106 122 340 184 173 477 68 42 289 255 286 266 522 157 85 524 Assault 147 90 242 293 191 220 103 28 186 226 355 283 267 144 148 217 Burglary 1112 982 808 1668 1534 1566 1017 1457 1509 1494 1902 1056 1674 1530 1206 1494 Larceny 905 669 609 901 1368 1183 724 1102 787 955 1386 1036 1392 1281 756 1003 Auto theft 494 954 645 602 780 788 468 637 697 765 862 776 848 488 483 739 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Atlanta Boston Chicago Dallas Denver Detroit Hartford Honolulu Houston Kansas City Los Angeles New Orleans New York Portland Tucson Washington Faces representation of city crime data CHERNOFF Occurrence of seven vegetation groups at sites on cliffs of Snowdonia, from soils containing differing amounts of available phosphate and exchangeable calcium. The size of circles indicates the relative abundance of the vegetation. Percentage of Republican Votes in residential Elections in six Southern States in the Years 1932-1940, 1960-68. 1932 Missouri 35 Maryland 36 Kentucky 40 Louisiana 7 Mississippi 4 South Carolina 2 1936 38 37 40 11 3 1 A) Schematic representation of the hierarchical clustering of years by complete link of republican vote data in six southern states. The numbers at the far left denote distances between clusters. B) Tree for Missouri computed according to decisions (i) – (v) 1940 48 41 42 14 4 4 1960 50 46 54 29 25 49 1964 36 35 36 57 87 59 1968 45 42 44 23 14 39 Trees for republican vote data in six southern states. Tree of yearly yields of 15 transportation companies with all variables labelled Tree of yearly yields of 15 transportation companies 1953-1977 Complex multivariate data Andrews (1972) FOURIER PLOTS Plot multivariate data into a function. f xt x1 2 x2 sint x3 cost x 4 sin2t x5 cos2t where data are [x1, x2, x3, x4, x5... xm] Plot over range -π ≤ t ≤ π Each object is a curve. Function preserves distances between objects. Similar objects will be plotted close together. MULTPLOT Andrews' plot for artificial data Andrews’ plots for all twenty-two Indian tribes. OTHER TYPES OF GRAPHICAL DISPLAY Dieldrin residues in the livers of 227 kestrels and barn owls found dead during 1970-1973. Each bird is represented by a point on the map. (Reproduced with permission from Institute of Terrestrial Ecology Annual Report for 1974). Map of aerial density of Sitobion avenea, 11-17 June 1984 produced using the SYMAP program. Darker areas represent higher densities on a logarithmic scale (×3 intervals). Numbers on map indicate positions of suction traps and their respective catch sizes (log3). (Reproduced with permission from Woiwod and Tatchell, 1984.) Contour map of the aerial density (using logarithmic intervals) of the hop aphid Phorodon humili 28 September to 2 October 1983, produced by the program SURFACE II. Suction trap sites are marked with a +. (Reproduced with permission from Fig. 3 of Woiwod and Tatchell, 1984) Three dimensional perspective view of the aphid densities obtained using SURFACE II. (Reproduced from Woiwod and Tatchell, 1984) THE POWER OF GRAPHICAL DATA DISPLAY. Visualization provides insight that cannot be appreciated by any other approach to learning from data. On this graph, the top left panel displays monthly average CO2 concentrations from Mauna Loa, Hawaii. The remaining panels show frequency components of variation in the data. The heights of the five bars on the right sides of the panels portray the same changes in ppm on the five vertical scales. OUTLIERS Identification of ‘outliers’ or ‘rogues’. “Observation which is, in some sense, inconsistent with the rest of the observations in the data-set. An observation can be an outlier due to the response variable(s) or any one or more of the predictor variables having values outside their expected limits.” Identify not for rejection at this stage but for investigation and evaluation. ? Incorrect measurement, incorrect data entry, transcription or recording error. LEVERAGE Potential for influence resulting from unusual values, particularly of predictor variables INFLUENCE Observation is influential if its deletion substantially changes the results Concept of outlier is model dependent. LEVERAGE MEASURES Generalised distance of observation i plus 1/n. di2 xi x S 1 xi x 1 n 1 x Measures how extreme the observation i is from the mean vector of complete sample x. If leverage of an observation is more than three times the average leverage, observation has high leverage. Need to check it and try to explain why it has high leverage. Alternatively, leverage of observation i (hi) equals the diagonal element of hat matrix H H = X (X 1 X ) -1 X 1 where X is n x k matrix of x values (i.e. the number of parameters in model), H is n x n square matrix. [Hat matrix so called because it puts “hat on Y” Ŷ= HY where Ŷ and Y are n x 1 vectors of predicted and observed Y values] di2 - two or more response variables (e.g. CANOCO) hi - one response variable (e.g. linear or multiple regression) Leverage ranges from 1/n to 1 Sample mean ĥi = k/n Size-adjusted cut-off ĥi 2k/n (ca. extreme 5%) Maximum (hi) Max (hi) 0.2 Safe 0.2 < Max (hi) 0.5 Risky Max (hi) > 0.5 Avoid if possible k = number of parameters As hi approaches 1, observation i may completely control the model. INFLUENCE MEASURES DFBETAS - change in standard errors if observation i is deleted slope of regression DFBETAS ik slope when i deleted bk bk i se i RSSk residual standard deviation when i deleted If DFBETASik > 0, < 0, If DFBETASik 2 DFBETAS n residual sum of squares when i not deleted case i pulls bk up case i pulls bk down influential case identifies influence of observations on individual regression coefficients to model “LOCAL” COOK'S D COOK’S D assesses impact of observations on regression coefficients “GLOBAL” standardised residual zi2 hi Di k1 hi number of parameters If leverage measure from H Di > 1 observation influential Di 4 n (size adjusted), observation influential High leverage - potential outlier Low influence - good outlier non-discordant outlier High influence - bad outlier discordant outlier ‘Good’ (left) and ‘bad’ (right) outliers: ‘bad’ outliers influence the slope (artificial data) Leverage (depends of x values only) hi 0.34 0.34 (‘risky’ (between 0.2 and 0.5) and well above size-adjusted cut-off of 2k/n = 4/100 = 0.04) Influence DFBETASi = 0.06 -9.1 (much less than 2/√n = 0.2) (much more than 2/√n = 0.2) High leverage, low influence High leverage, high influence ‘Good’ outlier Non-discordant outlier ‘Bad’ outlier Discordant outlier Robust leverage vs. Robust residuals plot NEVER FORGET THE GRAPH! “What is the use of a book, thought Alice, without pictures” SOFTWARE FOR EXPLORATORY DATA ANALYSIS R and S–PLUS MINITAB SYSTAT AXUM