Download Chap10: SUMMARIZING DATA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Transcript
Chap 10: Summarizing Data
10.1: INTRO: Univariate/multivariate data (random
samples or batches) can be described using
procedures to reveal their structures via graphical
displays (Empirical CDFs, Histograms,…) that are to
Data what PMFs and PDFs are to Random Variables.
Numerical summaries (location and spread measures)
and the effects of outliers on these measures &
graphical summaries (Boxplots) will be investigated.
10.2: CDF based methods
10.2.1: The Empirical CDF (ecdf)
ECDF is the data analogue of the CDF of a
random variable.
Fn ( x) 
#  xi  x 
n
, where x1....xn is a batch of numbers
The ECDF is a graphical display that conveniently
n
summarizes
data sets.
Fn ( x) 
I
i 1
(  , x ]
(Xi )
, where X 1.... X n is a random sample
n
1 , t  A
and I A (t )  
is an indicator function
0 , t  A
The Empirical CDF (cont’d)
The random variables I ( , x ] ( X i ) are independent
Bernoulli random variables:
1 with probability F ( x)
I (  , x ] ( X i )  
0 with probabilty 1  F ( x)
n
nFn ( x)   I ( , x ] ( X i ) ~ Bin  n, F ( x) 
i 1
 E  Fn ( x)  F ( x) and Var  Fn ( x) 
Fn is an unbiased estimate of F
F ( x) 1  F ( x)
n
lim Var  Fn ( x)  0 and Fn has a max imum var iance at the median of F
t 
10.2.2: The Survival Function
In medical or reliability studies, sometime data
consist of times of failure or death; thus, it
becomes more convenient to use the survival
functionS (t )  1  F (t ) , T is a random var iable with cdf F
rather than the CDF. The sample survival function
(ESF) gives the proportion of the data greater
than t and is given by:
Sn (t )  1  Fn (t )
Survival plots (plots of ESF) may be used to
provide information about the hazard function that
may be thought as the instantaneous rate of
mortality for an individual alive at time t and is
defined to be:f (t )
d
d
h(t ) 
1  F (t )

dt
log 1  F (t )   
dt
log S (t )
The Survival Function (cont’d)
From page 149, the   method for the first order:
Var[ g ( X )]  Var ( X )[ g '(  X )]
2
2

1 
F (t )
 Var{log 1  F (t ) }  Var  Fn (t )  *  


n 1  F (t ) 
 1  F (t ) 
which expresses how extremely unreliable (huge
variance for large values of t) the empirical logsurvival function is.
10.2.3:QQP(quantile-quantile plots)
Useful for comparing CDFs by plotting quantiles of
one dist’n versus quantiles of another dist’n.
Additive treatment effect : X ~ F vs Y ~ G
y p  x p  h  G ( y )  F ( x  h)
QQ  plot is a line with slope  1 and y  int ercept  h
Multiplica tive treatment effect : X ~ F vs Y ~ G
 y
y p  cx p  G ( y )  F   for c  0
c
QQ  plot is a line with slope  c and y  int ercept  0
10.2.3: Q-Q plots
Q-Q plot is useful in comparing CDFs as it plots the
quantiles of one dist’n versus the quantiles of the
other dist’n.
Additive model:
X ~ F and Y ~ G when Y  X  h  G ( y )  F  y  h 
Q  Q plot is a straigth line with slope  1and y  int ercept  h
Multiplicative model:
 y
X ~ F and Y ~ G when Y  cX  G ( y )  F  
c
Q  Q plot is a straigth line with slope  c and y  int ercept  0
10.3: Histograms, Density curves
& Stem-and-Leaf Plots
Kernel PDF estimate:
iid
X 1 ... X n 
f  PDF
Estimating function of f is f h  Kernel PDF
1  x
Let wh ( x)  w  be a smooth weight function
h h
1 n
1 n  x  Xi 
Then f h ( x)   wh ( x  X i ) 
w


n i 1
nh i 1  h 
h  bandwidth parameter that controls the smoothness of f h
h  bin width of the histogram.
Choose a ' reasonable ' h not too big !not too small!
10.4: Location Measures
1 n
10.4.1: The Arithmetic Mean x   xi is sensitive to
n i 1
outliers (not robust).
10.4.2: The Median ~
x is a robust measure of location.
10.4.3: The Trimmed Mean is another robust location
measure 0.1    0.2 (highly recommended )
Step1 : order the data set
Step 2 : discard lowest  * 100% and highest  * 100%
Step 3 : takethe arithmetic mean for the remaining data
n [ n ]
1
Step 4 : x 
x(i ) is the *100% trimmed mean

n  2[n ] i [ n ]1
Location Measures (cont’d)
The trimmed mean (discard only a certain number of
the observations) is introduced as a natural
compromise between the mean (discard no
observations) and the median (discard all but 1 or 2
observations)
x is was proposed
Another compromise between x and ~
by Huber (1981) who suggested to minimize:
n
 Xi   

 with respect to  , where  ( x) is to be given.

  n  X   
i 1
i


  0 for  , where    '
or to solve 
  
i 1
(its solution will be called an M-estimate)
10.4.4: M-Estimates (Huber, 1964)
1 2
if | x | k
n
 2 x
  X i   , where  ( x)  

i 1
k | x |  1 k 2
if | x | k

2
 is proportion al to x 2 inside [k , k ]
 ( x)  
replaces the parabolic arcs by straigth lines outside [k , k ]
1
Big k  ˆ comes closer to the mean x ;  ( x)  x 2 or  ( x)  x
2
Small k  ˆ comes closer to the median ~
x ;  ( x)  k | x | or  ( x)  k sgn( x)
k   correspond s to the mean x and k  0 correspond s to the median ~
x
3
k   protects against outliers  observations  k away from the center 
2
is suggested as a " mod erate" compromise.
10.4.4: M-Estimates (cont’d)
M-estimates coincide with MLEs because:
n
 Xi   
 Xi   
Minimize   
 wrt   Maximize  f 
 wrt 
  
  
i 1
i 1
iid
User  function  ( x)   log f ( x) with X 1 , X 2 ,...., X n 
f
n
The computation of an M-estimate is a nonlinear
minimization problem that must be solved using an
iterative method (such as Newton-Raphson,…)
Such a minimizer is unique for convex functions. Here,
we assume that is known; but in practice, a robust
estimate of  (to be seen in Section 10.5) should be
used instead.
10.4.5: Comparison
of Location Estimates
Among the location estimate introduced in this
section, which one is the best? Not easy !
For symmetric underlying dist’n, all 4 statistics
(sample mean, sample median, alpha-trimmed
mean, and M-estimate) estimate the center of
symmetry.
For non symmetric underlying dist’n, these 4 statistics
estimate 4 different pop’n parameters namely (pop’n
mean, pop’n median, pop’n trimmed mean, and a
functional of the CDF by ways of the weight function
).

Idea: Run some simulations; compute more than one
estimate of location and pick the winner.
10.4.6: Estimating Variability of
Location Estimates
by the Bootstrap
Using a computer, we can generate (simulate) many
samples B (large) of size n from a common known
dist’n F. From each sample, we compute the value of
the location estimate ˆ .
*
*
*

,

,...,

The empirical dist’n of the resulting values 1 2
B
is a good approximation (for large B) to the dist’n
function of ˆ . Unfortunately, F is NOT known in
general. Just plug-in the empirical cdf Fn for F and
bootstrap ( = resample from Fn ).
1
Fn is a discrete PMF with same probabilit y
n
for each observed value x1 , x2 ,..., xn
10.4.6: Bootstrap (cont’d)
A sample of size n from Fn is a sample of size n drawn
with replacement from the observed data 1
n
*
that produce  b (b  1,..., B) .
Thus,
1 B *
* 2
sˆ 
  b   
x ,...., x
B
b 1
B
1
where  *    b* is the mean of the b* b  1,2,..., B 
B b 1
Read example A on page 368.
Bootstrap dist’n can be used to form an approximate CI
and to test for hypotheses.
10.5:Measures of Dispersion
A measure of dispersion (scale) gives a numerical
indication of the “scatteredness” of a batch of
numbers. The most common measure of dispersion
1 n
2
is the sample standard deviation
X i  X 
s

n  1 i 1
Like the sample mean, the sample standard deviation
is NOT robust (sensitive to outliers).
Two simple robust measures of dispersion are the
IQR (interquartile range) and the MAD (median
absolute deviation from the median).
10.6: Box Plots
Tukey invented a graphical display (boxplot)
that indicates the center of a data set
(median), the spread of the data (IQR) and
the presence of outliers (possible).
Boxplot gives also an indication of the
symmetry / asymmetry (skewness) of the
dist’n of data values.
Later, we will see how boxplots can be
effectively used to compare batches of
numbers.
10.7: Conclusion
Several graphical tools were introduced in
this chapter as methods of presenting and
summarizing data. Some aspects of the
sampling dist’ns (assume a stochastic
model for the data) of these summaries
were discussed.
Bootstrap methods (approximating a
sampling dist’n and functionals) were also
revisited.
Parametric Bootstrap:
Example: Estimating a population mean
It is known that explosives used in mining leave a crater
that is circular in shape with a diameter that follows an
exponential dist’n F ( x)  1  e  x /  , x  0 . Suppose a
new form of explosive is tested. The sample crater
diameters (cm) are as follows:
121 847 591 510 440 205 3110 142 65 1062
211 269 115 586 983 115 162 70 565 114
 x  514.15 sample mean and s  685.60 sample SD
 249.07,779.23
It would be inappropriate to use x  t 0.95
n
as a 90% CI for the pop’n mean via the t-curve (df=19)
s
Parametric Bootstrap: (cont’d)
because such a CI is based on the normality
assumption for the parent pop’n.
The parametric bootstrap replaces the exponential
pop’n dist’n F with unknown mean  by the known
exponential dist’n F* with mean  *  x  514.15
Then resamples of size n=20 are drawn from this
surrogate pop’n. Using Minitab, we can generate
B=1000 such samples of size n=20 and compute
the sample mean of each of these B samples. A
bootstrap CI can be obtained by trimming off 5%
from each tail. Thus, a parametric bootstrap 90% CI
is given by:
(50th smallest = 332.51,951st largest = 726.45)
Non-Parametric Bootstrap:
If we do not assume that we are sampling from a
normal pop’n or some other specified shape pop’n,
then we must extract all the information about the
pop’n from the sample itself.
Nonparametric bootstrapping is to bootstrap a
sampling dist’n for our estimate by drawing samples
with replacement from our original (raw) data.
Thus, a nonparametric bootstrap 90% CI of  is
obtained by taking the 5th and 95th percentiles of ˆ
among these resamples.