Download The data that do not comply with the general behavior or model of

Outlier Study in Dataset The data that do not comply with the general behavior or model of the data is known as an outlier. These data objects are grossly different other data in the cluster or data set and very much inconsistent with the trend. These are called the outliers. The below figure provides an example of an outlier. Figure 1 describes the dataset without any outlier. Figure 2 describes the same dataset with an additional outlier data point. The presence of outlier data point totally changes the regression curve. Figure1: Data set with no outliers Figure 2: Same dataset with an outlier and the changed regression curve There are many data mining techniques to minimize the outliers in the data sets. This technique is called as the outlier mining. However, outlier might be having som3e importance with the dataset. With the elimination of the outlier data, we might loose some information for the whole data set. Application of outlier detection:      Fraud detection of unusual credit card usage Fraudulent use of other persons telephone code Used for determining the customized marketing Finding unusual responses to various medical treatments etc. Determining the malfunction of the data recorder, while working Outlier mining:  Define the data that could be considered as the outlier. With the use of a regression model for data modeling, the analysis of residuals can provide information on the estimation of the data extremeness. It is difficult to find the outlier in the time series data.  Determine/find an efficient method to mine the outliers so defined. Data visualization technique is a good method of outlier determination. Only in case of many categorical outputs in the dataset, the data visualization technique falters. The computer based methods for outlier finding is widely used. They are such as statistical approach, distance based approach, deviation based approach etc. Statistical approach:  Inherent alternative distribution: Arrange the data in histogram format. If you get a very exceedingly longer tail, we could visualize the presence of the outliers. Scatter plot is a very good method of determining the outliers too. The above figures are good example of finding the outliers by statistical method. The scatter plots were drawn using the whole datasets and only one data was very much away from the group. Therefore, it was very easy to recognize the outlier.     Again, some researchers also determine the outliers by fixing a confidence level, while analyzing the dataset. The data outside of the confidence level from the regression line (say 5%) are considered as the outliers. Mixture alternative distribution: Slippage alternative distribution: Block procedure: Consecutive procedure: Distance based approach:    Index based algorithm: In this approach, the multi-dimensional indexing structures, such as R-trees or k-d trees, to search for neighbor of each object. While searching for the neighbor of each object o within the radius d around the object, if we found M+1 neighbors (when M is the maximum numbers of objects in d-neighborhood of an outlier), then o is not considered as the outlier. In higher dimensionality data set, it proves to be very much time consuming. Nested loop algorithm: It follows the same procedure as index based approach. However, it only divides the memory buffer into 2 halves first and try to minimize the input output numbers, thus reducing the computational time for the determination of the outliers. Cell-based algorithm: This approach reduces the problems associated with the index based approach to determine the outliers in the higher dimensional dataset. Deviation based outlier detection: It does not use the statistical tests or distance based measures to identify exceptional objects. It identifies outliers by examining the main characteristic of the data in a group. The objects those deviates from the criteria are called the outliers. It is performed using two techniques.  Sequential exception technique: By using implicit redundancy of the data, human can distinguishes the outlier or misfit data. With a set of S with n objects, it is tried to build a  number of sequential subsets. By addressing the subsets, we can determine the outliers. In this analysis the following processes are followed. o Exception set o Dissimilarity function o Cardinality function o Smoothing factor OLAP data cube technique: Other useful methods of determining the outliers in practical datasets: Concentration residuals: When the optimum number of factors for the model has been determined, the predicted concentrations of each training sample from the sample rotation with the selected factor model can be used for outlier detection. The difference between the actual and predicted concentrations for a sample is known as the concentration residual. The model attempts to account for all the variations in the training data when the calibration calculations are performed. Therefore, the prediction error of most of the samples should be approximately the same. Samples that have significantly larger concentration residuals than the rest of the training set are known as concentration outliers. This is related to my BPNN predictive model, we developed for the yield prediction. In some cases the prediction result has a very wide difference than all the actual and prediction variation. Therefore, we could call it an outlier in the dataset. Figure3: A plot of concentration residual of the predicted hydroxyl number versus the sample number for a cross-validation of a PLS model built from 35 FT-NIR spectra. Note that sample 31 has a significantly different residual than the remainder of the data set, indicating that it is probably an outlier. This type of outlier generally arises when the experimenter either makes a mistake in creating the calibration mixtures or there was an error in the analysis of the sample from the primary calibration technique used to generate the calibration concentration values. Another possibility, which frequently occurs, is a transcription error; the analyst simply types in the wrong concentration value when building the computerized training set. Looking at Figure 3 above, it is clear to see that sample #31 is significantly different from the rest of the training set, and most likely a concentration outlier. However, outliers in most data sets will not be as obvious as this. While the human eye is excellent at discerning patterns in data, visual inspection is not always a valid basis for a decision of this type. What is really needed is a mathematical way to accurately determine the likelihood that a sample is really an outlier. Spectral residuals: Another powerful tool in seeking out outlier samples is the spectral residual. Similar to looking for concentration outliers, spectral outliers are detected by using a model for which the optimum number of factors has been determined by a cross validation. When each sample is predicted, a set of scores is found that best fits the model loading vectors to the sample spectrum. By using the calculated scores and the calibration loading vectors, a new model reconstructed spectrum can be calculated. The spectral residual is the difference between this spectrum and the actual prediction spectrum and is calculated as: where p is the number of wavelengths (data points) in the spectrum, Aorig are the original spectrum absorbances, and Apred are the model predicted spectrum absorbances. Figure 4: A plot of the spectral residual versus sample number for a cross-validation of a PLS model of Research Octane Number (RON) built from 57 NIR spectra of gasoline. Notice that sample number 45 has a significantly different residual than the remainder of the set indicating that it is a possible outlier. As with concentration residuals, samples that have significantly higher spectral residuals than the rest of the training set may be outliers. Spectral outliers can be caused by many different factors including inconsistent sample handling, changes in the performance of the instrument, or anything that contributes to a significant change in the spectrum of a given sample. Cluster analysis: There are other methods of outlier detection that are more abstract but equally valuable. Cluster analysis is a method that is used to look for samples, which have scores inconsistent with other samples in the training set. In this technique, the scores of one loading vector are plotted versus the scores of another vector for every sample in the training set. If all the samples in the training set are similar in composition and calibration value, the data points will tend to "cluster" about some mean value. If a sample point lies significantly outside this cluster, it indicates that the ratio of the two factor scores for this sample is inconsistent with the other spectra in the training set and it may be an outlier. There is, however, one exception: samples that lie at the ends of the calibration concentration range (i.e., the sample contains the highest or the lowest concentration of a constituent) can be expected to lie at the extreme limits of the cluster. An extreme sample will sometimes appear as an outlier, even though it may not be one at all. Figure 5: The Mahalanobis distance of a point is measured from the mean point of the cluster (indicated by X). Unlike an absolute distance measurement, it takes into account the "shape" of the cluster. Although points A and B appear to be equidistant from the mean, in terms of Mahalanobis distance, A is much closer and therefore more likely to be a member of the cluster. Once again it is desirable to have a more statistical measure of a sample’s potential to be an outlier than simple visual inspection. For score clusters, it is possible to use a measure of the Mahalanobis distance. This is calculated as the distance of the potential outlier sample point as measured from the mean of all the remaining points in the cluster. The distance is scaled for the range of variation in the cluster in all dimensions, and then assigns a probability weight to the sample in terms of standard deviation. Any sample which lies outside of 3 standard deviations from the mean can be considered suspicious. Leverage and Studentized T-Test: Another useful plot for identifying outliers is a plot of the Studentized concentration residuals versus the leverage value for each sample in the training set. The leverage value gives a measure of how important an individual training sample is to the overall model. The Studentized residuals give an indication of how well the sample’s predicted concentration is in line with the leverage. If a sample has a very high leverage compared to the rest of the training set, it is not necessarily always an outlier. It could just be a sample at the high or low end of the concentration range. However, if a sample has both a high leverage and a Studentized residual that is very different from the rest of the data set, most likely it can be eliminated as an outlier. Figure 6: A Leverage vs. Studentized Residual plot of same cross-validation prediction of the model in data from Figure 4 above. Note that both the studentized concentration residual and the leverage of sample #45 are both significantly larger than the remainder of the training set. This is another confirmation that this sample is an outlier. Sample leverages are calculated from the factor scores in PCA/PCR and PLS models. It is a relatively simple calculation: where S is the n by f matrix of sample scores, and H is an n by n square matrix. As before, n is the number of samples in the training set, and f is the number of factors in the model. The subscript i is the sample number in the training set. Note that the individual sample leverages are the diagonal elements of the Hat matrix. The Studentized residual is then calculated by: where, Cr are the concentration residuals of every sample in the training set.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download The data that do not comply with the general behavior or model of