Download Anomaly Detection vi..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Geographic information system wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Neuroinformatics wikipedia , lookup

Inverse problem wikipedia , lookup

Theoretical computer science wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Data analysis wikipedia , lookup

Pattern recognition wikipedia , lookup

Corecursion wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Anomaly Detection via Online Oversampling
Principal Component Analysis
ABSTRACT
Anomaly detection has been an important research topic in data mining and
machine learning. Many real-world applications such as intrusion or credit card
fraud detection require an effective and efficient framework to identify deviated
data instances. However, most anomaly detection methods are typically
implemented in batch mode, and thus cannot be easily extended to large-scale
problems without sacrificing computation and memory requirements. In this paper,
we propose an online over-sampling principal component analysis (osPCA)
algorithm to address this problem, and we aim at detecting the presence of outliers
from a large amount of data via an online updating technique. Unlike prior PCA
based approaches, we do not store the entire data matrix or covariance matrix, and
thus our approach is especially of interest in online or large-scale problems. By
over-sampling the target instance and extracting the principal direction of the data,
the proposed osPCA allows us to determine the anomaly of the target instance
according to the variation of the resulting dominant eigenvector. Since our osPCA
need not perform eigen analysis explicitly, the proposed framework is favored for
online applications which have computation or memory limitations. Compared
with the well-known power method for PCA and other popular anomaly detection
algorithms, our experimental results verify the feasibility of our proposed method
in terms of both accuracy and efficiency.
Existing System
Statistical approaches assume that the data follows some standard or
predetermined distributions, and this type of approach aims to find the outliers
which deviate form such distributions. For distance-based methods, the distances
between each data point of interest and its neighbors are calculated. If the result is
above some predetermined threshold, the target instance will be considered as an
outlier. One of the representatives of this type of approach is to use a density based
local outlier factor (LOF) to measure the outlierness of each data instance. Based
on the local density of each data instance, the LOF determines the degree of
outlierness, which provides suspicious ranking scores for all samples. The most
important property of the LOF is the ability to estimate local data structure via
density estimation. This allows users to identify outliers which are sheltered under
a global data structure
Problems on existing system:
Most distribution models are assumed univariate, and thus the lack of
robustness for multidimensional data is a concern. Moreover, since these
methods are typically implemented in the original data space directly, their
solution models might suffer from the noise present in the data.
Proposed System
PCA is a well known unsupervised dimension reduction method, which
determines the principal directions of the data distribution. This will prohibit the
use of our proposed framework for real-world large-scale applications. Although
the well known power method is able to produce approximated PCA solutions, it
requires the storage of the covariance matrix and cannot be easily extended to
applications with streaming data or online settings. Therefore, we present an
online updating technique for our osPCA. This updating technique allows us to
efficiently calculate the approximated values without performing analysis or
storing the data covariance matrix.
ADVANTAGE:
1.
The required computational costs and memory requirements are significantly
reduced.
2. Our method is especially preferable in online, streaming data, or large scale
problems.
IMPLEMENTATION
Implementation is the stage of the project when the theoretical design is
turned out into a working system. Thus it can be considered to be the most
critical stage in achieving a successful new system and in giving the user,
confidence that the new system will work and be effective.
The implementation stage involves careful planning, investigation of the
existing system and it’s constraints on implementation, designing of methods to
achieve changeover and evaluation of changeover methods.
Main Modules:1. User Module:
In this module, Users are having authentication and security to access
the detail which is presented in the ontology system. Before accessing or
searching the details user should have the account in that otherwise they should
register first.
2. Computation or memory limitations :
Here The Memory should not need for storing all datils in the database
this work based on the interface just passes the information from client to
server. This is not Required Memory Space for Storing datas.
3. Clustering:
The training data will be selected only by our assumption. So there is a
possibility that some outlier data may be considered as normal data in the previous
method due to our training data. So the clustering method is used to solve this
problem. The clusters are formed for input data instances and then the outlier
calculation is applied for each cluster to find the outlier exactly.
rigorous security definition and proved the security of the proposed scheme under
the provided definition to ensure the confidentiality.
4. Anomaly Detection:
This is for detecting the outlierness of the user input. When the user giving the
input to the system, the system calculate the St value for the new input. And then
compare that new St value with the threshold value which is calculated in earlier.
If the St value of the new data instance is above the threshold value, then that
input data is identified as an outlier and that value will be discarded by the system.
Otherwise it is considered as a normal data instance, and the PCA value of that
particular data instance is updated accordingly.
System Configuration:H/W System Configuration:Processor
-
Pentium –III
Speed
- 1.1 Ghz
RAM
- 256 MB(min)
Hard Disk
- 20 GB
Floppy Drive
- 1.44 MB
Key Board
- Standard Windows Keyboard
Mouse
- Two or Three Button Mouse
Monitor
- SVGA
S/W System Configuration:
Operating System
:Windows95/98/2000/XP

Application Server
: Tomcat5.0/6.X

Front End
: HTML, Java, Jsp

Scripts

Server side Script

Database
: Mysql 5.0

Database Connectivity
: JDBC.
: JavaScript.
: Java Server Pages.