• Study Resource
• Explore

Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Time series wikipedia, lookup

Categorical variable wikipedia, lookup

Transcript
```NUMERICAL ANALYSIS OF
BIOLOGICAL AND
ENVIRONMENTAL DATA
Lecture 2.
Exploratory Data
Analysis
EXPLORATORY DATA ANALYSIS
Types of variables
Simple diagrams
Summary statistics
(i) Location
(ii) Dispersion
(iii) Skewness and kurtosis
Transformations
Density estimation
Graphical display
(i) Univariate data
(ii) Bivariate and multivariate data
Outliers
Leverage and influence
Software
TYPES OF VARIABLES
1)
discrete
e.g. counts
2)
continuous
e.g. pH, elevation
Both are random variables or variates, with random variation.
TABULAR PRESENTATION
Raw data
Frequency tables
Value or Range
0
1
2
...
0 - 0.99
1 - 1.99
2 - 2.99
...
Frequency
Cumulative
Frequency
% CF
3
8
3
...
3
11
14
...
2
6
11
...
SIMPLE DIAGRAMS
DISCRETE OR CONTINUOUS VARIABLES
Dot diagram
Line diagram or profile
Histogram
n/10 bins
Frequency graph or cumulative frequency graph
CONTINUOUS VARIABLES
DISCRETE VARIABLES
HISTOGRAM BIN WIDTH
(a)
Wand (1997) Amer. Statistician 51, 59-64
(b)
(c)
DEFAULT
S-PLUS
Histograms of the British Incomes Data Based on (a) the Bin Width ĥ2 (b) the Bin Width ĥ0, and (c) the SPLUS Default Bin Width.
Optimal solution


6
hˆ2  
n
   2 g21  
1
3
where g21 is band-width
parameter
ψ2 is “normal scale” estimator
Solution of ψ2 and g21 is iterative,
to optimise a function MEAN
INTEGRATED SQUARED ERROR
1
hˆo  3.49 n  3
Standard deviation
range of data
hˆ 
1  log 2 n
n = sample size
Histogram Bin Width
In R, a good option for histogram bin width is given by the Freedman-Diaconis
rule which is:
 n1/ 3(max  min) 


2
(
Q

Q
)
3
1


where n is the number of
observations, max-min is
the range of the data, and
Q3-Q1 is the inter-quartile
range. The brackets
represent the ceiling,
which means that you
round up to the next
integer, thereby avoiding
4.2 bins!
Exploratory Data Analysis
1. Summary Statistics
(A) Measures of location
‘typical value’
n
x  1 n  xi
i 1
(1) Arithmetic mean
x 
(2) Weighted mean
n
x w
i
i 1
n
i
w
i
i 1
(3) Mode ‘most frequent’ value
(4) Median ‘middle values’ Robust statistic
(5) Trimmed mean
(6) Geometric mean
1 or 2 extreme observations at both tails deleted
n
log GM  1 n  log x i 
i 1
GM  n x1 x 2 x 3  x n
1 n

= antilog  n  log x1 
i 1


R
(B) Measures of dispersion
B smaller scatter than A
‘better precision’
(1) Range
A 13.99
14.15
14.28
13.93
14.30
14.13
B 14.12
14.1
14.15
14.11
14.17
14.17
Precision
Random error scatter (replicates)
A = 0.37
Accuracy
Systematic bias
B = 0.07
(2) Interquartile range ‘percentiles’
Q1
25%
Q2
25%
(3) Mean absolute deviation
1
25%
1
3
5
1
8
4
xi  xi
n
i 1
1 n
xi  x

n i 1
2
2
25%
n
Mean absolute difference
x
xx
Q3
ignore negative signs
x 4
10
10/n = 2.5
(B) Measures of dispersion (cont.)
(4) Variance and standard
deviation
S2 
1
x  x 2

n 1
SDs  
s2
Variance = mean of squares
of deviation from mean
Root mean square value
SD
(5) Coefficient of
variation
CV  s x  100
Relative standard deviation
Percentage relative SD
(independent of units)
mean
(6) Standard error of
mean
2
s
SEM 
n
R
(C)
Measures of skewness and kurtosis
Skewness - measure of how one tail of curve is drawn out
Kurtosis - measure of peakedness of curve
g1 skewness measure
g2 kurtosis measure
“moment statistics”
Central moment =
1
n
n  x  x 
r
i 1
r=1
deviation from mean = 0
r=2
variance
g1 skewness
r=3
1
ns 3
 x  x 
g2 kurtosis
r=4
1
ns 4
 x  x 
3
4
[third central moment
divided by sd3]
3
Skewness and kurtosis
negative g1
skewness to left
positive g1
skewness to right
negative g2
platykurtosis
flatter, larger tails
positive g2
leptokurtosis
taller, few tails
DATA TRANSFORMATIONS
(1) Comparability
(2) Better fit to model
Better fit
Comparability
Normal distribution
i
i
x, sd
frequency
Data centring - deviations
from mean
x*  x  x
mean
Data standardisation
x i*  x i  x  sd
- zero mean, unit variance
x i*  x i  range
1 sd = 66% of values
2 sd = 95% of values
sd
66% 95%
x
Often find
skewed to right
positive g1
Log-normal
distribution
LOG-NORMAL DISTRIBUTION PROPERTIES
geometric mean =
median of log-normal
distribution
mean of log values =
Geometric mean
(antilog)
CV of original values
if sd  0.5
SD log values 
If SD larger CV =
exp S 2   1
antilog
How to decide whether to log transform?
(1) Look at histograms. Right skewed (positive g1) log transform
(2) If sd > mean or maximum value of variable > 20x than smallest value
Log xi or Log (xi + 1)
(3) Improves normality
(4) Gives less weight to ‘dominants’ VARIANCE STABILISING
(5) Reflects linear response of many species to log of chemical variables, i.e. log
response over certain ranges.
(6) In regression need normally distributed random errors. Log transformation.
NORMAL AND LOG-NORMAL DISTRIBUTIONS
Normal
Log-Normal
Effects
Multiplicative
Shape
Symmetric
Skewed
Mean
x , arithmetic
x *, geometric
Standard deviation
s*, multiplicative
Measure of dispersion
cv = s/x
s*
Confidence interval
68.3%
x ±s
x * x/s*
95.5%
x ± 2s
x * x/(s*)2
99.7%
x ± 3s
x * x/(s*)3
x/ = times / divide (cf ± plus / minus); cv = coefficient of variation
METHODS FOR DESCRIBING LOG-NORMAL
DISTRIBUTIONS
Graphical methods
Frequency plots, histograms, box plots
Parameters
Logarithm of x
Mean
Median
Standard deviation
Variance
Skewness and kurtosis of x
Problems
What logarithm base to use?
Parameters are not on the scale of the original data
Appear to be very common in the real world
Limpert, E, et al. 2001 BioScience 51 (5), 342-352
DATA TRANSFORMATIONS
(1)
Biological data
- Stabilise variances
- Dampen effects of very abundant taxa
Choices
- No transformation
- Square root
- Log (y + 1)
- % data square root
- Counts log (y + 1)
(2)
Environmental variable
skewed to right
log-normal distribution
If SD > mean or maximum value of x > 20 times the smallest,
use log (x + c) transformation where c is constant, usually 1.
Other transformations:
(1)
(2)
square root
cubic root

(3)
fourth root
4
(4)
log2
log2 (x + 1)
(5)
logp
logp (x + 1)
(6)
Box-Cox transformation - most appropriate value for exponent λ
x*  x
If

 1
3

x
x
where λ  0
= log x where λ = 0
=1
no transformation
 = 0.5
square root
 = -1
reciprocal transformation
=0
log transformation
If x = 0.0, add 0.5 or 1.0 as constant
Can also solve for best estimate of constant to add
Can calculate confidence limits for λ.
If these include 1, no need for a transformation!
TRANSFOR
DENSITY ESTIMATION
A useful alternative to histograms is non-parametric density estimation
which results in a smoothing of the histogram.
The kernel-density estimate at the value of x of a variable X is given by
n
 x  xj 
1
ˆ
f(x)   K 

b j 1  b 
where xj are the n observations of
X, K is a kernel function (such as
the normal density), and b is a
bandwidth parameter influencing
the amount of smoothing. Small
bandwidths produce rough density
estimates, whereas large
bandwidths produce smoother
estimates.
Note that the histogram has been scaled to the
density estimates, not the raw frequencies.
Multiple approaches
1.
2.
3.
4.
Histogram with density
scaling (areas of
histogram bars sum to 1)
Density estimation
(default) (thick line)
Density estimation (half
the default bin-width)
(thin line)
One-dimensional scatterplot ("rugplot") to show
distribution of
observations at the
bottom
Fox, 2002
QUANTILE-QUANTILE PLOTS
Quantile-quantile (Q-Q) plots are
useful tools for determining if data
are normally distributed. They show
the relationship between the
distribution of a variable and a
reference or theoretical distribution.
Q-Q plot shows the relationship
between the ordered data and the
corresponding quantiles of the
reference (in our case, normal)
distribution.
If the data are normally distributed, they should plot on a straight line through
the 1st and 3rd quartiles. If there is a break in slope of the plotted points, the
data deviate from the reference distribution.
Note that quantiles are divisions of a frequency or probability distribution into
equal, ordered subgroups (e.g. quartiles (4 parts) or percentiles (100 parts)).
EXPLORATORY DATA ANALYSIS
GRAPHICAL DISPLAY
J.W. Tukey
Univariate data
(1)
Stem-and-leaf displays
55
62
73
STEM
5
6
7
8
9
7
5
5
78
79
78
81
LEAF
5
2
3
1
4
3
5
1
3
2
3
1
8
4
5
6
7
8
9
2
1
3
1
8
1
4
2
9
3
6
7
“back-to-back”
(2)
Box-and-whisker plots - box plots
CI around median 95%
Median  1.58 (Q3) / (n)½
quartile
(3)
Hanging histograms
Variations of box plots
McGill et al. Amer. Stat. 32, 1216
Useful to label extreme points
Fox, 2002
Box plots for samples of more than ten wing lengths of adult
male winged blackbirds taken in winter at 12 localities in the
southern United States, and in order of generally increasing
latitude. From James et al. (1984a). Box plots give the
median, the range, and upper and lower quartiles of the data.
Useful to apply several approaches EDA tools
Bivariate and multivariate data
Simple scatter plot
x2
•
•
• •
•• ••••
•• • •
••• •
•
• • •
••
x1
SCATTERPLOT MATRIX. The data are measurements of ozone, solar radiation,
temperature, and wind speed on 111 days. Thus the measurements are 111 points in
a four-dimensional space. The graphical method in this figure is a scatterplot matrix:
all pairwise scatterplots of the variables are aligned into a matrix with shared scales.
Triangular arrangement of all pairwise
scatter plots for four variables. Variables
describe length and width of sepals and
petals for 150 iris plants, comprising 3
species of 50 plants.
Three-dimensional perspective view for
the first three variables of the iris data.
Plants of the three species are coded A,B
and C.
linear regression line, add smoother (LOWESS – see Lecture 5), and label
particular points.
Fox, 2002
Categorical variables can be encoded in a plot by using different symbols or
colours for each category (e.g. type of occupation) and smoothers fitted for
each category.
Fox, 2002
bc = blue collar, prof = professional, wc = white collar
Jittering scatter-plots
Discrete quantitative
variables usually result in
uniformative scatter-plots
(e.g. education (years) and
vocabulary (score on 0-10
scale)).
Only 21 distinct education
values and 11 scores, so only
21 x 11 = 231 plotting
positions.
random quantity to each
value to try to separate overplotted points. Can vary the
amount of jittering and also
plot a smoother.
Fox, 2002
Bivariate density estimation and scatter-plots
Large data-sets and weak
relationships between variables.
Improve plot by jittering and
making symbols smaller and apply
bivariate kernel-density estimate
plus regression line and LOWESS
smoother.
Fox, 2002
 coal-fired
power
station
 oil-fired
power
station
Diagonal =
density
estimate
for each
variable
The Bagplot: A Bivariate Boxplot
Peter J. Rousseeuw
The American Statistician November 1999, Vol. 53, No. 4, 382
Car weight and engine displacement of 60 cars.
Part (a) shows the concentrations of
cholesterol and triglycerides in the
plasma of 320 patients. In part (b)
logarithms are taken of both
variables.
Part (a) shows the altitudinal range
and abundance of butterflies. In part
(b) the logarithm of the abundance is
plotted.
Bagplot matrix of the three-dimensional aquifer data with
85 data points.
Conditioning plots (Co-plots)
Focus on relationship between response and a predictor variable,
holding other predictors constant at particular values – conditionally
fixing the values of other predictors. 'Statistical control'
Co-plots provide graphical statistical control.
Focus on particular predictor and set each other predictor to a
relatively narrow range (if quantitative) or to a specific value (if
categorical). Subranges for a quantitative predictor are typically set
to overlap (called "shingles") rather than to partition data into
disjoint subsets ("bins").
For each combination of values of the conditioning predictors, construct
scatter-plot to show response to the local predictor and arrange the
plots in an array.
Can condition on more than one predictor (e.g. age, gender).
Six overlapping age
classes, two genders
(male upper, female
lower), LOWESS, and
linear fits
Fox, 2002
EDA and Data-Transformations
Try to linearise non-linear relationships by trial-and-error.
Mosteller & Tukey's 'bulging rule'.
When bulge points down,
of powers and roots;
when the bulge points up,
transform y up,
when the bulge points left,
transform x down;
when the bulge points right
transform x up.
Fox, 2002
Infant mortality rate and GDP per capita for 193 countries
Points down and to left,
try powers and roots
Log transformation
linearising, variables
more symmetric
Fox, 2002
Simple multivariate data
Profiles, Stars, Glyphs, Faces, and Boxes of Percentages of Republican Votes in Six Presidential
Elections in Six Southern States. The circles in the Stars Are Drawn at 50%. The Assignment of
Variables to Facial Features in the Faces is: 1932 – Shape of Face; 1936 – Length of nose; 1940 –
Curvature of Mouth; 1960 – Width of Mouth; 1964 – Slant of Eyes; 1968 – Length of Eyebrows
Three types of shape for representing multivariate data. In these
examples glyph, stars and faces represent five, six and twelve (!)
variables respectively.
Frequency of the six commonest species
on the Park Grass plots using star displays.
Labelled polygon plot
Polygon plots
Chernoff faces
CHERNOFF
American city crime data
Atlanta
Boston
Chicago
Dallas
Denver
Detroit
Hartford
Honolulu
Houston
Kansas City
Los Angeles
New Orleans
New York
Portland
Tucson
Washington
Murder
Manslaughter
16.5
4.2
11.6
18.1
6.9
13
2.5
3.6
16.8
10.8
9.7
10.3
9.4
5
5.1
1.5
Rape
24.8
13.3
24.7
34.2
41.5
35.7
8.8
12.7
26.6
43.2
51.8
39.7
19.4
23
22.9
27.6
Robbery
106
122
340
184
173
477
68
42
289
255
286
266
522
157
85
524
Assault
147
90
242
293
191
220
103
28
186
226
355
283
267
144
148
217
Burglary
1112
982
808
1668
1534
1566
1017
1457
1509
1494
1902
1056
1674
1530
1206
1494
Larceny
905
669
609
901
1368
1183
724
1102
787
955
1386
1036
1392
1281
756
1003
Auto theft
494
954
645
602
780
788
468
637
697
765
862
776
848
488
483
739
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Atlanta
Boston
Chicago
Dallas
Denver
Detroit
Hartford
Honolulu
Houston
Kansas City
Los Angeles
New Orleans
New York
Portland
Tucson
Washington
Faces representation of
city crime data
CHERNOFF
Occurrence of seven vegetation groups at sites on cliffs of
Snowdonia, from soils containing differing amounts of
available phosphate and exchangeable calcium. The size of
circles indicates the relative abundance of the vegetation.
Percentage of Republican Votes in residential Elections in six Southern
States in the Years 1932-1940, 1960-68.
1932
Missouri
35
Maryland
36
Kentucky
40
Louisiana
7
Mississippi
4
South Carolina 2
1936
38
37
40
11
3
1
A) Schematic representation of the hierarchical clustering of years by complete
link of republican vote data in six
southern states. The numbers at the far
left denote distances between clusters.
B) Tree for Missouri computed according to
decisions (i) – (v)
1940
48
41
42
14
4
4
1960
50
46
54
29
25
49
1964
36
35
36
57
87
59
1968
45
42
44
23
14
39
Trees for republican vote data in
six southern states.
Tree of yearly yields of 15
transportation companies with all
variables labelled
Tree of yearly yields of 15
transportation companies 1953-1977
Complex multivariate data
Andrews (1972)
FOURIER PLOTS
Plot multivariate data into a function.
f xt   x1
2  x2 sint   x3 cost   x 4 sin2t   x5 cos2t 
where data are [x1, x2, x3, x4, x5... xm]
Plot over range
-π ≤ t ≤ π
Each object is a curve. Function preserves distances between
objects. Similar objects will be plotted close together.
MULTPLOT
Andrews' plot for artificial data
Andrews’ plots for all twenty-two Indian tribes.
OTHER TYPES OF GRAPHICAL DISPLAY
Dieldrin residues in the
livers of 227 kestrels and
during 1970-1973. Each
bird is represented by a
point on the map.
(Reproduced with
permission from Institute
of Terrestrial Ecology
Annual Report for 1974).
Map of aerial density of
Sitobion avenea, 11-17 June
1984 produced using the
SYMAP program. Darker areas
represent higher densities on
a logarithmic scale (×3
intervals). Numbers on map
indicate positions of suction
traps and their respective
catch sizes (log3).
(Reproduced with permission
from Woiwod and Tatchell,
1984.)
Contour map of the aerial
density (using logarithmic
intervals) of the hop aphid
Phorodon humili 28
September to 2 October
1983, produced by the
program SURFACE II.
Suction trap sites are
marked with a +.
(Reproduced with
permission from Fig. 3 of
Woiwod and Tatchell,
1984)
Three dimensional perspective view of the aphid
densities obtained using SURFACE II. (Reproduced
from Woiwod and Tatchell, 1984)
THE POWER OF GRAPHICAL
DATA DISPLAY. Visualization
provides insight that cannot
be appreciated by any other
approach to learning from
data. On this graph, the top
left panel displays monthly
average CO2 concentrations
from Mauna Loa, Hawaii.
The remaining panels show
frequency components of
variation in the data. The
heights of the five bars on
the right sides of the panels
portray the same changes in
ppm on the five vertical
scales.
OUTLIERS
Identification of ‘outliers’ or ‘rogues’.
“Observation which is, in some sense, inconsistent with the rest of the
observations in the data-set. An observation can be an outlier due to
the response variable(s) or any one or more of the predictor variables
having values outside their expected limits.”
Identify not for rejection at this stage but for investigation and
evaluation.
? Incorrect measurement, incorrect data entry, transcription or
recording error.
LEVERAGE
Potential for influence resulting from unusual values,
particularly of predictor variables
INFLUENCE Observation is influential if its deletion substantially
changes the results
Concept of outlier is model dependent.
LEVERAGE MEASURES
Generalised distance of observation i plus 1/n.
di2  xi  x  S 1 xi  x   1 n
1
x
Measures how extreme the observation i is from the mean vector of complete sample x.
If leverage of an observation is more than three times the average leverage,
observation has high leverage. Need to check it and try to explain why it has high
leverage.
Alternatively, leverage of observation i (hi) equals the diagonal element of hat matrix H
H = X (X 1 X ) -1 X
1
where X is n x k matrix of x values (i.e. the number of parameters in model), H is n x n
square matrix.
[Hat matrix so called because it puts “hat on Y”
Ŷ= HY
where Ŷ and Y are n x 1 vectors of predicted and observed Y values]
di2 - two or more response variables (e.g. CANOCO)
hi - one response variable (e.g. linear or multiple regression)
Leverage ranges from 1/n to 1
Sample mean ĥi = k/n
Size-adjusted cut-off ĥi  2k/n (ca. extreme 5%)
Maximum (hi)
Max (hi)  0.2
Safe
0.2 < Max (hi)  0.5
Risky
Max (hi) > 0.5
Avoid if possible
k = number of parameters
As hi approaches 1, observation i may completely control the
model.
INFLUENCE MEASURES
DFBETAS - change in standard errors if observation i is deleted
slope of regression
DFBETAS ik 
slope when i deleted
bk  bk i 
residual standard deviation
when i deleted
If DFBETASik > 0,
< 0,
If DFBETASik  2
DFBETAS
n
residual sum of squares
when i not deleted
case i pulls bk up
case i pulls bk down
influential case
identifies influence of observations on individual regression
coefficients to model “LOCAL”
COOK'S D
COOK’S D assesses impact of observations on regression coefficients
“GLOBAL”
standardised residual
zi2 hi
Di 
k1  hi 
number of parameters
If
leverage measure from H
Di > 1
observation influential
Di  4 n
High leverage
-
potential outlier
Low influence
-
good outlier
non-discordant outlier
High influence
-
discordant outlier
the slope (artificial data)
Leverage (depends of x values only)
hi 0.34
0.34
(‘risky’ (between 0.2 and 0.5) and well above size-adjusted cut-off of 2k/n = 4/100 = 0.04)
Influence
DFBETASi = 0.06
-9.1
(much less than 2/√n = 0.2)
(much more than 2/√n = 0.2)
High leverage, low influence
High leverage, high influence
 ‘Good’ outlier
Non-discordant outlier