Download The data that do not comply with the general behavior or model of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Predictive analytics wikipedia , lookup

Corecursion wikipedia , lookup

Pattern recognition wikipedia , lookup

Data analysis wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Outlier Study in Dataset
The data that do not comply with the general behavior or model of the data is known as an
outlier. These data objects are grossly different other data in the cluster or data set and very much
inconsistent with the trend. These are called the outliers. The below figure provides an example of
an outlier. Figure 1 describes the dataset without any outlier. Figure 2 describes the same dataset
with an additional outlier data point. The presence of outlier data point totally changes the
regression curve.
Figure1: Data set with no outliers
Figure 2: Same dataset with an outlier and the
changed regression curve
There are many data mining techniques to minimize the outliers in the data sets. This
technique is called as the outlier mining. However, outlier might be having som3e importance with
the dataset. With the elimination of the outlier data, we might loose some information for the whole
data set.
Application of outlier detection:





Fraud detection of unusual credit card usage
Fraudulent use of other persons telephone code
Used for determining the customized marketing
Finding unusual responses to various medical treatments etc.
Determining the malfunction of the data recorder, while working
Outlier mining:

Define the data that could be considered as the outlier.
With the use of a regression model for data modeling, the analysis of residuals can provide
information on the estimation of the data extremeness. It is difficult to find the outlier in the
time series data.

Determine/find an efficient method to mine the outliers so defined.
Data visualization technique is a good method of outlier determination. Only in case of
many categorical outputs in the dataset, the data visualization technique falters.
The computer based methods for outlier finding is widely used. They are such as statistical
approach, distance based approach, deviation based approach etc.
Statistical approach:

Inherent alternative distribution: Arrange the data in histogram format. If you get a very
exceedingly longer tail, we could visualize the presence of the outliers.
Scatter plot is a very good method of determining the outliers too. The above figures are
good example of finding the outliers by statistical method. The scatter plots were drawn
using the whole datasets and only one data was very much away from the group. Therefore,
it was very easy to recognize the outlier.




Again, some researchers also determine the outliers by fixing a confidence level, while
analyzing the dataset. The data outside of the confidence level from the regression line (say
5%) are considered as the outliers.
Mixture alternative distribution:
Slippage alternative distribution:
Block procedure:
Consecutive procedure:
Distance based approach:



Index based algorithm: In this approach, the multi-dimensional indexing structures, such as
R-trees or k-d trees, to search for neighbor of each object. While searching for the neighbor
of each object o within the radius d around the object, if we found M+1 neighbors (when M
is the maximum numbers of objects in d-neighborhood of an outlier), then o is not
considered as the outlier. In higher dimensionality data set, it proves to be very much time
consuming.
Nested loop algorithm: It follows the same procedure as index based approach. However, it
only divides the memory buffer into 2 halves first and try to minimize the input output
numbers, thus reducing the computational time for the determination of the outliers.
Cell-based algorithm: This approach reduces the problems associated with the index based
approach to determine the outliers in the higher dimensional dataset.
Deviation based outlier detection:
It does not use the statistical tests or distance based measures to identify exceptional objects. It
identifies outliers by examining the main characteristic of the data in a group. The objects those
deviates from the criteria are called the outliers. It is performed using two techniques.

Sequential exception technique: By using implicit redundancy of the data, human can
distinguishes the outlier or misfit data. With a set of S with n objects, it is tried to build a

number of sequential subsets. By addressing the subsets, we can determine the outliers. In
this analysis the following processes are followed.
o Exception set
o Dissimilarity function
o Cardinality function
o Smoothing factor
OLAP data cube technique:
Other useful methods of determining the outliers in practical datasets:
Concentration residuals:
When the optimum number of factors for the model has been determined, the predicted
concentrations of each training sample from the sample rotation with the selected factor model can
be used for outlier detection. The difference between the actual and predicted concentrations for a
sample is known as the concentration residual.
The model attempts to account for all the variations in the training data when the calibration
calculations are performed. Therefore, the prediction error of most of the samples should be
approximately the same. Samples that have significantly larger concentration residuals than the rest
of the training set are known as concentration outliers.
This is related to my BPNN predictive model, we developed for the yield prediction. In
some cases the prediction result has a very wide difference than all the actual and prediction
variation. Therefore, we could call it an outlier in the dataset.
Figure3: A plot of concentration residual of the predicted hydroxyl
number versus the sample number for a cross-validation of a PLS model
built from 35 FT-NIR spectra. Note that sample 31 has a significantly
different residual than the remainder of the data set, indicating that it is
probably an outlier.
This type of outlier generally arises when the experimenter either makes a mistake in
creating the calibration mixtures or there was an error in the analysis of the sample from the primary
calibration technique used to generate the calibration concentration values. Another possibility,
which frequently occurs, is a transcription error; the analyst simply types in the wrong concentration
value when building the computerized training set.
Looking at Figure 3 above, it is clear to see that sample #31 is significantly different from
the rest of the training set, and most likely a concentration outlier. However, outliers in most data
sets will not be as obvious as this. While the human eye is excellent at discerning patterns in data,
visual inspection is not always a valid basis for a decision of this type. What is really needed is a
mathematical way to accurately determine the likelihood that a sample is really an outlier.
Spectral residuals:
Another powerful tool in seeking out outlier samples is the spectral residual. Similar to
looking for concentration outliers, spectral outliers are detected by using a model for which the
optimum number of factors has been determined by a cross validation.
When each sample is predicted, a set of scores is found that best fits the model loading
vectors to the sample spectrum. By using the calculated scores and the calibration loading vectors, a
new model reconstructed spectrum can be calculated. The spectral residual is the difference
between this spectrum and the actual prediction spectrum and is calculated as:
where p is the number of wavelengths (data points) in the spectrum, Aorig are the original spectrum
absorbances, and Apred are the model predicted spectrum absorbances.
Figure 4: A plot of the spectral residual versus sample number for a
cross-validation of a PLS model of Research Octane Number
(RON) built from 57 NIR spectra of gasoline. Notice that sample
number 45 has a significantly different residual than the remainder
of the set indicating that it is a possible outlier.
As with concentration residuals, samples that have significantly higher spectral residuals
than the rest of the training set may be outliers. Spectral outliers can be caused by many different
factors including inconsistent sample handling, changes in the performance of the instrument, or
anything that contributes to a significant change in the spectrum of a given sample.
Cluster analysis:
There are other methods of outlier detection that are more abstract but equally valuable.
Cluster analysis is a method that is used to look for samples, which have scores inconsistent with
other samples in the training set. In this technique, the scores of one loading vector are plotted
versus the scores of another vector for every sample in the training set. If all the samples in the
training set are similar in composition and calibration value, the data points will tend to "cluster"
about some mean value. If a sample point lies significantly outside this cluster, it indicates that the
ratio of the two factor scores for this sample is inconsistent with the other spectra in the training set
and it may be an outlier.
There is, however, one exception: samples that lie at the ends of the calibration concentration range
(i.e., the sample contains the highest or the lowest concentration of a constituent) can be expected to
lie at the extreme limits of the cluster. An extreme sample will sometimes appear as an outlier, even
though it may not be one at all.
Figure 5: The Mahalanobis distance of a point is measured from the mean
point of the cluster (indicated by X). Unlike an absolute distance
measurement, it takes into account the "shape" of the cluster. Although
points A and B appear to be equidistant from the mean, in terms of
Mahalanobis distance, A is much closer and therefore more likely to be a
member of the cluster.
Once again it is desirable to have a more statistical measure of a sample’s potential to be an
outlier than simple visual inspection. For score clusters, it is possible to use a measure of the
Mahalanobis distance. This is calculated as the distance of the potential outlier sample point as
measured from the mean of all the remaining points in the cluster. The distance is scaled for the
range of variation in the cluster in all dimensions, and then assigns a probability weight to the
sample in terms of standard deviation. Any sample which lies outside of 3 standard deviations from
the mean can be considered suspicious.
Leverage and Studentized T-Test:
Another useful plot for identifying outliers is a plot of the Studentized concentration
residuals versus the leverage value for each sample in the training set. The leverage value gives a
measure of how important an individual training sample is to the overall model. The Studentized
residuals give an indication of how well the sample’s predicted concentration is in line with the
leverage. If a sample has a very high leverage compared to the rest of the training set, it is not
necessarily always an outlier. It could just be a sample at the high or low end of the concentration
range. However, if a sample has both a high leverage and a Studentized residual that is very
different from the rest of the data set, most likely it can be eliminated as an outlier.
Figure 6: A Leverage vs. Studentized Residual plot of same cross-validation
prediction of the model in data from Figure 4 above. Note that both the
studentized concentration residual and the leverage of sample #45 are both
significantly larger than the remainder of the training set. This is another
confirmation that this sample is an outlier.
Sample leverages are calculated from the factor scores in PCA/PCR and PLS models. It is a
relatively simple calculation:
where S is the n by f matrix of sample scores, and H is an n by n square matrix. As before, n is the
number of samples in the training set, and f is the number of factors in the model. The subscript i is
the sample number in the training set. Note that the individual sample leverages are the diagonal
elements of the Hat matrix.
The Studentized residual is then calculated by:
where, Cr are the concentration residuals of every sample in the training set.