Download Analyzing Survival or Reliability Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Analyzing Survival or Reliability Data
In this demo we consider the analysis of lifetime data. In biological
or medical applications, this is known as survival analysis, and the
times may represent the survival time of an organism or the time until
a disease is cured. In engineering applications, this is known as
reliability analysis, and the times may represent the time to failure
of a piece of equipment.
To demonstrate how to use MATLAB® and the Statistics Toolbox™ to
analyze lifetime data, we'll look at an application in modeling the
time to failure of a throttle from an automobile fuel injection
system.
Special Properties of Lifetime Data
Some features of lifetime data distinguish them other types of data.
First, the lifetimes are always positive values, usually representing
time. Second, some lifetimes may not be observed exactly, so that they
are known only to be larger than some value. Third, the distributions
and analysis techniques that are commonly used are fairly specific to
lifetime data
Let's simulate the results of testing 100 throttles until failure.
We'll generate data that might be observed if most throttles had a
fairly long lifetime, but a small percentage tended to fail very early
rand('state',1);
lifetime = [wblrnd(15000,3,90,1); wblrnd(1500,3,10,1)];
In this example, assume that we are testing the throttles under
stressful conditions, so that each hour of testing is equivalent to
100 hours of actual use in the field. For pragmatic reasons, it's
often the case that reliability tests are stopped after a fixed amount
of time. For this example, we will use 140 hours, equivalent to a
total of 14,000 hours of real service. Some items fail during the
test, while others survive the entire 140 hours. In a real test, the
times for the latter would be recorded as 14,000, and we mimic this in
the simulated data. It is also common practice to sort the failure
times.
T = 14000;
1
obstime = sort(min(T, lifetime));
We know that any throttles that survive the test will fail eventually,
but the test is not long enough to observe their actual time to
failure. Their lifetimes are only known to be greater than 14,000
hours. These values are said to be censored. This plot shows that
about 40% of our data are censored at 14,000
failed = obstime(obstime<T); nfailed = length(failed);
survived = obstime(obstime==T); nsurvived = length(survived);
censored = (obstime >= T);
plot([zeros(size(obstime)),obstime]', repmat(1:length(obstime),2,1),
...
'Color','b','LineStyle','-')
line([T;3e4], repmat(nfailed+(1:nsurvived), 2, 1),
'Color','b','LineStyle',':');
line([T;T], [0;nfailed+nsurvived],'Color','k','LineStyle','-')
text(T,30,'<--Unknown survival time past here')
xlabel('Survival time'); ylabel('Observation number')
100
90
80
Observation number
70
60
50
40
30
<--Unknown survival time past here
20
10
0
0
0.5
1
1.5
Survival time
Ways of Looking at Distributions
2
2
2.5
3
4
x 10
Before we examine the distribution of the data, let's consider
different ways of looking at a probability distribution.
A probability density function (PDF) indicates the relative
probability of failure at different times.
A survivor function gives the probability of survival as a function of
time, and is simply one minus the cumulative distribution function (1CDF).
The hazard rate gives the instantaneous probability of failure given
survival to a given time. It is the PDF divided by the survivor
function. In this example the hazard rates turn out to be increasing,
meaning the items are more susceptible to failure as time passes
(aging).
A probability plot is a re-scaled CDF, and is used to compare data to
a fitted distribution.
Here are examples of those four plot types, using the Weibull
distribution to illustrate. The Weibull is a common distribution for
modeling lifetime data.
x = linspace(1,30000);
subplot(2,2,1);
plot(x,wblpdf(x,14000,2),x,wblpdf(x,18000,2),x,wblpdf(x,14000,1.1))
title('Prob. Density Fcn')
subplot(2,2,2);
plot(x,1-wblcdf(x,14000,2),x,1-wblcdf(x,18000,2),x,1-wblcdf(x,14000,1.1))
title('Survivor Fcn')
subplot(2,2,3);
wblhaz = @(x,a,b) (wblpdf(x,a,b) ./ (1-wblcdf(x,a,b)));
plot(x,wblhaz(x,14000,2),x,wblhaz(x,18000,2),x,wblhaz(x,14000,1.1))
title('Hazard Rate Fcn')
subplot(2,2,4);
probplot('weibull',wblrnd(14000,2,40,1))
title('Probability Plot')
3
8
-5
x 10 Prob. Density Fcn
Survivor Fcn
1
6
4
0.5
2
0
0
1
2
0
3
0
1
2
4
x 10
x 10
-4
4
x 10 Hazard Rate Fcn
Probability Plot
Probability
3
2
1
0
0.999
0.99
0.95
0.9
0.75
0.5
0.25
0.1
0.05
0.01
0.005
0
1
2
3
3
4
10
x 10
-5
x 10 Prob. Density Fcn
5
10
Data
4
8
3
4
10
Survivor Fcn
1
6
4
0.5
2
0
0
1
2
0
3
0
1
2
4
x 10
x 10
-4
4
x 10 Hazard Rate Fcn
Probability Plot
Probability
3
2
1
0
3
4
0.999
0.99
0.95
0.9
0.75
0.5
0.25
0.1
0.05
0.01
0.005
0
1
2
3
3
10
4
x 10
4
4
10
Data
5
10
Fitting a Weibull Distribution
The Weibull distribution is a generalization of the exponential
distribution. If lifetimes follow an exponential distribution, then
they have a constant hazard rate. This means that they do not age, in
the sense that the probability of observing a failure in an interval,
given survival to the start of that interval, doesn't depend on where
the interval starts. A Weibull distribution has a hazard rate that may
increase or decrease.
Other distributions used for modeling lifetime data include the
lognormal, gamma, and Birnbaum-Saunders distributions.
We will plot the empirical cumulative distribution function of our
data, showing the proportion failing up to each possible survival
time. The dotted curves give 95% confidence intervals for these
probabilities.
subplot(1,1,1);
[empF,x,empFlo,empFup] = ecdf(obstime,'censoring',censored);
stairs(x,empF);
hold on;
stairs(x,empFlo,':'); stairs(x,empFup,':');
hold off
xlabel('Time'); ylabel('Proportion failed'); title('Empirical CDF')
5
Empirical CDF
0.7
0.6
Proportion failed
0.5
0.4
0.3
0.2
0.1
0
0
2000
4000
6000
8000
10000
12000
14000
Time
This plot shows, for instance, that the proportion failing by time
4,000 is about 12%, and a 95% confidence bound for the probability of
failure by this time is from 6% to 18%. Notice that because our test
only ran 14,000 hours, the empirical CDF only allows us to compute
failure probabilities out to that limit. About 40% of the data were
censored at 14,000, and so the empirical CDF only rises to about 0.60,
instead of 1.0.
The Weibull distribution is often a good model for equipment failure.
The function wblfit fits the Weibull distribution to data, including
data with censoring. After computing parameter estimates, we'll
evaluate the CDF for the fitted Weibull model, using those estimates.
Because the CDF values are based on estimated parameters, we'll
compute confidence bounds for them.
paramEsts = wblfit(obstime,'censoring',censored);
[nlogl,paramCov] = wbllike(paramEsts,obstime,censored);
xx = linspace(1,2*T,500);
[wblF,wblFlo,wblFup] = wblcdf(xx,paramEsts(1),paramEsts(2),paramCov);
We can superimpose plots of the empirical CDF and the fitted CDF, to
judge how well the Weibull distribution models the throttle
reliability data.
6
stairs(x,empF);
hold on
handles = plot(xx,wblF,'r-',xx,wblFlo,'r:',xx,wblFup,'r:');
hold off
xlabel('Time'); ylabel('Fitted failure probability'); title('Weibull
Model vs. Empirical')
Weibull Model vs. Empirical
1
0.9
Fitted failure probability
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.5
1
1.5
Time
2
2.5
3
4
x 10
Notice that the Weibull model allows us to project out and compute
failure probabilities for times beyond the end of the test. However,
it appears the fitted curve does not match our data well. We have too
many early failures before time 2,000 compared with what the Weibull
model would predict, and as a result, too few for times between about
7,000 and about 13,000. This is not surprising -- recall that we
generated data with just this sort of behavior.
Adding a Smooth Nonparametric Estimate
The pre-defined functions provided with the Statistics Toolbox don't
include any distributions that have an excess of early failures like
this. Instead, we might want to draw a smooth, nonparametric curve
through the empirical CDF, using the function ksdensity. We'll remove
the confidence bands for the Weibull CDF, and add two curves, one with
the default smoothing parameter, and one with a smoothing parameter
1/3 the default value. The smaller smoothing parameter makes the curve
follow the data more closely.
7
delete(handles(2:end))
[npF,ignore,u] =
ksdensity(obstime,xx,'cens',censored,'function','cdf');
line(xx,npF,'Color','g');
npF3 =
ksdensity(obstime,xx,'cens',censored,'function','cdf','width',u/3);
line(xx,npF3,'Color','m');
xlim([0 1.3*T])
title('Weibull and Nonparametric Models vs. Empirical')
legend('Empirical','Fitted Weibull','Nonparametric,
default','Nonparametric, 1/3 default', ...
'location','northwest');
The nonparametric estimate with the smaller smoothing parameter
matches the data well. However, just as for the empirical CDF, it is
not possible to extrapolate the nonparametric model beyond the end of
the test -- the estimated CDF levels off above the last observation.
Let's compute the hazard rate for this nonparametric fit and plot it
over the range of the data.
hazrate = ksdensity(obstime,xx,'cens',censored,'width',u/3) ./ (1npF3);
plot(xx,hazrate)
title('Hazard Rate for Nonparametric Model')
xlim([0 T])
8