Download Estimation of the Values below the Detection Limit by Regression Techniques

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Regression analysis wikipedia , lookup

Least squares wikipedia , lookup

Forecasting wikipedia , lookup

Coefficient of determination wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Estimation of the Values below the Detection Limit by
Regression Techniques
Gayland Ridley. Merck & Co. Inc. Blue Bell. PA
Shi-Tao Yeh. EDP/Temps & CONTRACT SERVICES. Bala Cynwyd. PA
Abstract
The regression order statistics method
converts the observations into a ranked
order serial then regresses it on the values
corresponding to the inverse cumulative
normal distribution function. Obervations
are ranked from smallest to largest with
values below the Dl treated as the smallest
values.
The environmental or laboratory data set
may contain some observations which may
be very small or near zero. These data are
measured. but measuring devices or
procedures used are unable to detect low
concentrations. The analytical laboratories
may report it as below "detection limit"
(Dll. The data report forms from the
laboratories may indicate:
let
Xi
= the i th ranked observation.
XI = the smallest value.
o
not detected.
o
less than a specified detection
limit. or
o
zero.
X k+ I
= the smallest detectable
value. and
X n = the largest value.
If the obervations comprising of the sample
are randomly drawn from the population.
the ordered data values would divide the
underlying probability density function into
equal areas. Thus. on estimated plotting
position on an appropriate coordinate
system can be calcaulated for each point
such that the data above the Dl will fall on
a straight line.
The most common techniques to estimate
values below the Dl are the substitution
and
deletion
techniques.
Deletion
techniques delete the observations below
the Dl from the data set. This approach
uses the observations above the Dl only.
Substitution techniques use zero. half of the
DL. or the Dl as the value for substituting
any below the Dl observations.
let
Both deletion and substitution techniques
generate
bioses from
estimates
of
parameters. This
produces inaccurate
descriptive statistics and incorrect statistical
inferences.
Ai = F-l[(i.3/8)/(n+ 1/4)1
where F -1 [XI is the inverse cumulative
normal distribution function.
The method regresses X i on A ; to estimate
a and b in the equation.
This paper provides two regression methods.
regression order statistics and a log
probability method. with SAS ®
code for
estimating the values below the DL.
Xj=a+bAj+e,.
The mean of the distribution for the whole
data set is estimated by a and the standard
deviation is estimated by b.
Regression Order StaHsHcs Method
1.1 Theoretical Background
1.2 Computation Procedure
877
data pi;
set pi;
n = runiss + nomiss;
call symput('runiss',nmiss);
call symput('nomiss',nomiss);
call symput('n',n);
%global nmiss nomiss n;
Figure 1 shows the computation procedure.
R.tri._ tbo !lola Sot ....
V_
';111 MiIoiaa
VII_
1* creates a order list
*/
data 12;
doi= 1 to&n:
j = i;
output;
end;
Aai... tho Numhon of - . . . VII_,
Non-MiaiDr VII. ., """ Totol 0 _ u
olobol Moen> Variohl.
data f\(keep=i);
set &.fname; set f2;
proc sort data=f\ ;by &vname;
.j
C
Fillet
/* computes z scons */
....
F= « 1-311)1 (N+ 114»
Z=PROBrr(F)
Petfo.... Re,...;oo ADalyoiI .... Retrie... 1110
Pr-clictodVal_
I
data 13;
set f\; set fJ(keep=i);
f
i- 3/8 )/( &n+ 1/4»;
z = probit(f);
I
=«
/* fits a regression line
AlIi... tho Predie1ed VaI_ to !be Miainr
_d
*/
Val. . UId Flua it
IPri= Ibo Ne.. Dua with
MiIoiaa VII....
proc reg data=13 noprint;
model &vname = z;
output out = pip = xhat;
I
run;
data f3(keep=j &vamc llag);
. set 13; set pi;
length l\ag $ 13;
xhat = round(xhat,&rofl);
if &vname = . then do;
flag = '<= ESTIMATED';
&vname = xhat;
end;
Figure J Rowchatf of Computation Sleps
1.3 SAS Macro Module
The argument in the following macro
module is the usefs input of SAS data set
name Iname and variable name vname
with the below DL values to be estimated.
The third argument roll is the use(s specified
roundoff unit. The module calls the SAS
ROUND function to round the estimated
values to the nearest roundoff unit.
data &fname;
set &fuame; set 13;
proc print data=&fnarne noobs~
run;
O/.mend reg1;
O/.macro regl(fname,vanme.roff);
Figure 2 SAS Macro Module
/* computes # of missing values and
# of ob. in the datafile
'j
1.4 Sample Input and How to Use the
Module
proc unvariate data = &fname noprint;
var&vname;
output out=p 1 nm.iss=nmiss n=nomiss;
Display 1 shows a sample data file with file
name of SUG1.DAT.
878
9
10
x
y
13
12
9
10
12
10
16
8
11
12
13
1
2
14
15
16
17
18
19
20
7
3
10
13
4
5
6
7
8
9
9
11
10
10
6
<~~
ESTIMATED
16
8
,,12
13
10
12
11
9
12
13
14
15
16
17
18
19
20
10
12
10
16
Display 2 Output from Example Program
Log Probability Method
8
2.1 Theoretical Background
16
8
12
Display I
Gilliom
and
Helsel
conducted
an
experiment in which
samples were
generated from a wide range of parent
distributions. and the DL at varying levels.
This was to evaluate eight different
methods
for
estimating
distribution
parameters [IJ [3J. They found the most
robust method for minimizing error for
distribution parameters estimation was the
log-probability regression method. This
method uses observed data above the DL
with mISSing values extrapolated to
compute the distribution parameters in
assuming the below DL observations follows
a lognormal distribution at the zero-to-DL
range.
Sample Data
The following example utilizes the sample
data set and illustrates how to use the
macro module.
data sugi;
infile Isugi.dat'~
input yx;
run;
%reg 1(sugi,x, 1.)
run;
2.2 Computation Procedure
The estimation steps are as follows:
Figure 3 SAS Program Example
Display 2 shows the output from
example SAS program.
o performs the log transformation on
observations above the DL.
the
o computes their z scores. where
y
X
1
2
3
4
5
6
7
8
5
z=F-l [(i-3/8)/{n+1/4)J.
FLAG
o fits a regression line to the log
<= ESTIMATED
transformed observations and their z
7
10
13
9
11
10
10
scores,
o predicts the below DL observations
from the regression line.
o backtransforms all values to
879
arithmetic units.
2.3 How to Use the Module and output
Sample
Conclusion
This paper provides SAS modules to
estimate the values below the DL. Two
The following SAS program illustrates how
to perform log probability regression. It
produces SAS output shown in Display 3.
regression methods. rank order and log
probability. with computation procedures
and SAS macro code are presented and
discussed. The log probability regression
method is considered the most robust
method for minimizing error in missing value
data sugi(drop=x);
estimates. The module in this paper
provides a computational tool for users to
estimate the values below the DL.
set sugi;
Ix = log(x);
%regl (sugi,Ix,.OO 1)
run;
data sugi(drop=lx);
set sugi;
x = exp(lx);
References
[I] Gilliom. R. J. and Helsel. D. R.(1986)
"Estimation of Distribution Parameters for
Censored Trace level Water Quality Data.
2. Verification and Applications". Water
Resources Res.. 22(2).147-155
run;
proc print data=sugi noobs;
var y x flag;
run;
[2] Helsel. D. R. (1990). "less than Obvious.
Figure 4 SAS Progrom for Log Probability
Statistical Treatment of Data below the
Detection limit". Environ. Sci. Technol.
24(12).1766-1774
Method
x
y
1
2
3
4
6.1227
7.0000
10.0000
13.0000
9.0000
11.0000
10.0000
10.0000
13.0000
12.0000
9.0000
10.0000
13.0000
10.0000
16.0000
8.0000
6.9379
16.0000
8.0000
12.0000
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
[3]
• and Gilliam. R. J. (1986)
"Estimation of Distributional Parameters for
Censored Trace level Water Quality Dafa.
1. Estimation Techniques". Water Resources
Res ..22(2). 135-146
FLAG
<=ESTIMATED
[4] Newman. M.C.. D. Greene. Dixon. P. M ..
looney. B. B.. and Segal. C. (l992)
Uncensor V3.D. Savannah River Ecology
laboratory. Aiken. Sc. 1-15
[5]
. Dixon. P. M .. looney. B. B..
and Pinder. J. E. III. (1989) "Estimating Mean
and Variance for Environmental Samples
with· below Detection limit Observations".
Water Resources Bull. 25(4). 905-916.
<= ESTIMATED
SAS is a registered trademark or trademark of SAS
Institute Inc. in the USA and other countries.
® indicates USA registration.
Display 3 Output from Log ProbabUify
Method
880