Download ABCDE – Alarm Basic Correlations Discovery Environment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ABCDE – Alarm Basic Correlations Discovery Environment
Oliver Jukić
Virovitica College
Virovitica, Republic of Croatia
[email protected]
Marijan Kunštić
University of Zagreb
Faculty of Electrical Engineering and Computing
Zagreb, Republic of Croatia
[email protected]
Abstract – Alarms generated by telecommunication network
are processed by network personnel who are required to
respond within a reasonable time interval. When a global
network problem occurs, it is represented as a sequence of
alarms coming from one or more different network elements.
That sequence is typically not recognized as a global problem,
or the presence of global problem is detected, but not its real
nature. The reason for that is the huge number of alarms
generated, “bombing” the operator. Automatic recognition of
network problems is very useful for network monitoring
processes. Automatic recognition and detection can be done
by simple IF-THEN correlation rules performed on incoming
alarm stream. The problem is in recognizing potential
correlation rules candidates. In our previous works, we have
marked mathematical Apriori algorithm implementation as a
potential improvement of correlation rules detection. This
paper describes architecture proposal for Alarm Basic
Correlations Discovery Environment, starting discussion on
some implementation aspects.
Keywords: Network problem, alarm correlation, correlation
rule
I.
INTRODUCTION
When we are talking about customer experience of
service quality, fault management is one of the most
relevant network management functional areas. Fault
management primarily covers the detection, isolation and
correction of unusual operational behaviors of
telecommunication network and its environment [6].
Typically, alarms from the whole network are delivered to
the network operation and management center, where the
alarms are processed by network operator. In that case, we
talk about centralized fault or network management.
After problem’s appearance, the network generates a
large number of unsolicited events carrying information
about the malfunction called alarms. The network
operator’s reaction time depends on many factors. One of
the most important issues is to recognize the problem’s
root-cause correlating incoming alarms. Alarms can be
correlated by its starting/ending time (when alarm
started/ended?), location (where alarm happened?),
probable cause (what is alarm nature?) or by another
criteria – alarm attribute.
Correlation engines rely on predefined correlation rules,
correlating alarms by mentioned criteria, usually written in
IF-THEN manner. In this paper, we will refer to that kind
of rules as high-level correlation rules. The most
challenging task is to create appropriate correlation rules,
based on network equipment and structure knowledge.
Sometimes correlation rules are not created due to
insufficient network operation personnel or correlation rules
creation knowledge, even when correlation tools do exist.
Hence, correlation capabilities are not used as much as
possible in network operation centers. This fact has implicit
impact on service quality delivered to customers.
High-level correlation rules will recognize network
problems correlating incoming alarms. But it is not enough
to create high-level correlation rules only. Great role in
alarm reduction plays alarm filtration. Alarm filtration
should be performed before alarm correlation in order to
eliminate irrelevant alarms. For instance, during some
scheduled maintenance action, it is reasonable to ignore
alarms from maintained network elements. Filtration will
increase efficiency of alarm correlation, while total number
of alarms presented to network operator will be decreased.
Fact that operator will cope with decreased number of
alarms ensures his more reliable and efficient work.
Except high-level correlation rules, there is number of
typical patterns that can be recognized at low-level. For
instance, alarms coming from certain network element
within the same time interval can be treated as “multiple”
alarms. Some network problems are presented as “jittering”
of alarm. In that case first alarm indicates problem, all other
“jittering” alarms can be “hidden” beyond first alarm.
Implementation of low-level correlations will also decrease
total number of presented alarms.
Network problems manifest themselves as an alarm
sequence. Since network problems repeat more or less
frequently, processing of alarm sequences from alarm
history can be good base for creation of correlation rules
that will be used in the future, when the same problem will
appear.
Commercial network management tools usually have the
capability to perform alarm data correlation. Correlation
rules are loaded as input for the alarm correlation process.
Namely, tools are only the framework; it is necessary to
ensure built-in correlation knowledge. In order to have
built-in knowledge, it must exist in human’s mental picture.
One of the axioms of this paper is that the presence of
human beings is irreplaceable in the process of correlation
rules detection. However, automation of analyzing previous
alarm streams is welcomed.
In our previous works, we have proposed the creation of
correlation rules from historical data. Main theoretical
concepts have been described in [1], [2] and [7], and those
are not the subject of this paper. Rather, we will focus on
the potential architecture of (basic) alarm correlation rules
discovery systems.
This paper gives an architecture proposal for ABCDE –
Alarm Basic Correlations Discovery Environment.
Environment implementation is already done partially,
while complete implementation and integration into
telecommunication operator’s network management center
will be the subject of future work. Some important aspects
of ABCDE are tested using alarm data obtained from real
telecommunication network.
II.
ALARM BASIC CORRELATIONS DISCOVERY
ENVIRONMENT ARCHITECTURE
A. ABCDE architecture overview
Basic ABCDE architecture is shown on figure 1:
Fig. 1. Basic ABCDE architecture
Incoming network alarms are generated by the
telecommunication network. Alarms are consumed and
processed by alarm processing engine that performs alarm
filtration as well as low and high-level correlation.
Processed alarms are presented to the network operator
through alarm surveillance GUI. Alarm processing engine
uses correlation and filtration rules stored in database, while
incoming alarms are stored into alarm data warehouse.
Logical inventory database containing data about network
interconnections can be use for more efficient alarm
correlation. Logical inventory data can be used for
enhancement of incoming alarm data also, tying relevant
inventory information with alarm data (for instance,
“friendly” alarm location name). Alarm processing engine
is not the focus of this paper since number of commercial
tools is able to perform alarms processing functions.
Alarm data warehouse is a database containing all raw
alarm history data as well as correlated alarm history data
for a certain time period, predefined by the operator (e.g. 2
years). Alarm data warehouse is starting point for discovery
and analysis of typical correlations from alarm historical
data, in order to include it in the Correlation and filtration
rules database.
Correlation and filtration rules database contains data
about correlations and filtrations to be performed in realtime manner by alarm processing engine. Rules from this
database are proposed by Correlation discovery and
analysis module. This module can be used for discovery of
new potential rules performing data mining algorithm on
historical alarm data. It can be used for analysis and
evaluation of potential rule candidates also, performing rule
execution on sample of historical alarm data.
Filtration part of Correlation discovery and analysis
module discovers and evaluates potential filter patterns. Not
all incoming alarms are relevant for further processing.
Alarm classification and filtration are described in details in
[11], and will not be discussed here more detailed. Filtering
is also not always statically related to predefined, concrete
network element; it can be rather dynamically changed,
based on certain circumstances in network, such as
scheduled maintenance procedure on some network
elements.
After filtration is done on historical alarm data, low-level
correlation discovery and evaluation can be performed. This
is primarily related to discovery of general patterns, such as
alarm overlapping or alarm jittering.
High-level correlation will cope with concrete alarm
patterns, coming from specific network elements. At this
stage, raw alarm clusters are detected first. Alarm cluster is
set of alarms received from the network within certain time
interval fenced with cluster borders. Namely, we have
detected “long enough” time periods without alarms. Those
periods are considered as cluster borders. All alarms suited
between two cluster borders belong to the same cluster [2].
Cluster is input for the mathematical Apriori algorithm, but
in order to improve algorithm performance, we have
proposed usage of logical network inventory data to split
raw clusters in smaller parts containing alarms from
interconnected alarm locations only. In that case, all
interconnections will be taken under consideration while
creating alarm clusters: total number of clusters will
increase, while average number of alarms in one cluster will
decrease. It will drastically improve performance of data
mining algorithm execution.
Logical inventory data should be obtained from network
operator. However, if it is not obtainable, there is proposed
technique how to extract logical inventory data from alarm
history. It was described in [7], and it is not primary focus
of this paper. However, it was denoted on figure 1 through
Logical inventory block.
When clusters are generated, the Apriori algorithm is
performed. The final result is the number of alarm
sequences that occurred frequently in the past. Those
sequences are potential high-level correlation rules
candidates for future alarm processing. Criteria for
acceptation of those candidates can be rule frequency, but
also rule can be accepted based on network expert’s
opinion.
B. Low-level correlations
In the case of non-overlapped alarms, timer interval
between them is short enough. “Alarm storm” can be hence
replaced by only one alarm with value-added information,
reducing even 7 alarms from operator’s graphical interface:
After alarm filtration is performed, low-level correlations
are to be performed. Low-level correlations are not related
to concrete network elements or alarm types; rather we are
going to discover general alarm behavior patterns.
Typical behavior is alarm jittering; for some reasons,
certain network element may jitter between alarming and
non-alarming state. It is represented to network operator in
terms of number of (short) alarms with short periods
between end of first alarm and start of the second alarm.
We will refer to sequence of jittering alarms as “chained”
alarms.
Another such behavior is related to alarm overlapping.
Generally two alarms can be overlapped completely,
partially, or not overlapped. Even in the last case, great role
plays time interval between two alarms:
a)
Fig. 3. Reduction of alarms by low-level correlations
C. High-level correlations: raw-cluster detection
After filtration and low-level correlation processing, the
incoming alarm stream will be “clustered”: alarm clusters
containing alarms potentially belonging to the same
network problem will be detected. Alarm cluster detection
is described in [2]. The important thing is that the alarm
clusters are divided by time intervals without alarms.
b)
c)
D. High-level correlations: cluster splitting
d)
Typically, a network problem is represented by the
number of alarms coming from one or more network
elements. If the alarms are coming from more than one
network element, it is reasonable to expect that the network
elements are interconnected. If we have a logical inventory
database at our disposal (i.e., database where information
about network element interconnections is stored), we can
try to include it in the discovery environment. How? We
can consider only the clusters containing alarms from
interconnected network elements.
e)
f)
g)
Fig. 2. Alarm overlapping patterns
At low-level correlation, completely and partially
overlapped alarms (fig. 2 a, b, c, d, e) coming from the
same network element (and, optionally, with the same
probable cause) can be considered as one alarm with valueadded information “sticked” to it: number of alarms laying
beyond it.
Alarms that are not overlapped have important parameter
related to them: time between end of first alarm and start of
the second alarm. If that time is short enough, two alarms
can be treated as only one alarm, ignoring end of first and
start of second alarm.
Combining those two typical patterns, and reducing all
hidden alarms from operator’s GUI, number of reduced
alarms can increase. On figure 3, there are 8 alarms coming
from the same network element within certain time period.
Some of those alarms are overlapped, while some are not.
Since a logical inventory database is not always
available, there is a possibility to “generate” it, based on the
alarm historical data. In that case, we will first analyze
alarms by their location only. After that analysis we will
have information about the most frequent points of
interconnection. This data can be stored in a logical
inventory database (using a predefined threshold) and can
be used in the cluster splitting process in the future. This
concept is described in [7].
E. High-level correlations: Apriori algorithm
The mining of association rules is potentially very
interesting for detection of specific alarm “clusters” that can
represent a global network problem. What was the original
motivation for researching association rules? Let us
imagine a supermarket serving a huge number of customers
every day. The supermarket manager is responsible for all
business aspects, including special offers and promotions.
For instance, the manager can decide to launch chips
discount for every customer buying 6 beers. The previously
mentioned special offer seems to be very logical, based on
our daily experience. However, there are numbers of such
association rules that cannot be perceived by casual
observation. Hence, the manager is forced to analyze the
supermarket’s transaction data (i.e., customer receipt
archive or database) – to examine customer behavior while
purchasing products. The result of such analysis is a set of
typical association rules describing how often items are
purchased together. For instance, rule “Beer ⇒ Chips
(80%)” states that four of five customers buying beer are
also buying chips [3]. That result can be useful for business
decisions related to marketing, pricing and product
promotion.
We have considered our alarms as products purchased in
a supermarket, and alarm clusters as baskets from a specific
customer. Hence we have decided to use the Apriori
algorithm in order to find and recognize specific alarm
sequences – potential correlation rules for the future [2].
Apriori algorithm itself is described in number of papers
such as [3].
The final result of high-level correlations is the creation
of a correlation rules database. Rules are structured in an
IF-THEN manner. It means that the alarm processing
engine will receive incoming alarm stream matching
incoming patterns with existing patterns in the correlation
rules database. When a pattern is matched, a new alarm is
generated containing information about the real network
root-cause problem.
III.
IMPLEMENTATION ASPECTS AND
EXPERIMENTAL RESULTS
A. Programming languages and techniques
ABCDE components are developed using C and C++
programming languages, as a parts of complex application.
Central application component is executable file that
involves different dynamic-linked libraries (dll) in
architecture. Every part is implemented as separated dll. It
allows upgrade of separated components without
disturbing general application structure.
For database access we have used Open Database
Connection (ODBC) with all data stored in MS SQL server
database. For database access we have used standard MFC
classes, but all other techniques could be used.
B. Experimental results
Experimental proof of concept was done on real alarm
data sample, obtained from one GSM operator in region.
Data covers one month period, date from November 2002.
However, for this experimental work even such old data
are useful.
Data came from access network of GSM system. Base
stations are connected to Base Station Controllers via
multiplexing transmission system. In this case, connections
were realized using microwave radio transmission links.
Hence we have opportunity to find real interesting patterns,
potentially caused by heavy weather conditions, impacting
transmission performance.
Total number of incoming alarms for processing was
36639. At low-level correlation, we have tried to evaluate
number of reduced alarms for two typical patterns:
overlapped and chained alarms. As parameter, we have
tuned number of seconds between two chained alarms.
Final result of experiment was number of reduced alarms.
If we consider that every overlapped and/or chained
sequence can be replaced with one alarm with value-added
information “sticked” to it, discovering of sequence with
length=N means reduction of (N-1) alarms:
TABLE I
NUMBER AND PERCENTAGE OF REDUCED ALARMS
AFTER LOW-LEVEL CORRELATIONS
Number of
alarms
%
Total
36639
100,00
Reduced (30 s interval)
23983
65.46
Reduced (45 s interval)
24869
67.88
Reduced (60 s interval)
25408
69.35
Reduced (120 s interval)
26110
71.26
According to obtained results, we have decided to fix
time interval between two chained alarms at 30 seconds
value. In that case, number of reduced alarms, lying
“under” bearer alarm, was 23983, or 65.46 %.
After overlapped and chained alarms evaluation, number
of alarms could be filtered, due to its “self-solving” nature.
Alarms that are short enough, and are not chained to other
alarms, can be treated as self-solving alarms. Self-solving
alarms can be extracted from set of alarms by its duration
attribute. We have used value of 30 seconds as maximum
duration of self-solving alarm. We performed low-level
correlation first, in order to discover chained self-solving
alarms first.
Number of self-solving alarms was 5008 alarms.
Together with low-level correlation reduced alarms, we
have detected 28991 potentially reduced alarms. This is
79.12 % of total number of alarms. Conclusion is that
filtration together with low-level correlations can decrease
number of alarms in great percent, almost 80 % in this
case.
Finally, after number of alarms was reduced, we have
7648 alarms as input for high-level correlations discovery
module.
This number can be reduced if we discover some
frequently repeated alarm sequences, and replace it by one
alarm. For that purpose, we have used Apriori algorithm,
as we discussed in our previous work. However, after
sequences are detected, it is necessary to “judge” which
sequence is relevant for future and which is not. One of
criteria can be frequency of alarm sequence appearing.
Also, some sequences can be very relevant, event if those
are not repeated very frequently. ABCDE can be used for
discovery and statistical processing of alarm sequences,
while final decision should be made by human operator.
According to our previous and other related works [12],
reduction rate at high-level correlations can be rather high,
up to 70%. Using test data sample and finding several
alarm sequences confirmed by network experts, reduction
rate was 25.41 %.
By interviewing network personnel working with real
alarm data every day in network operation and
management centers, we have articulated their attitude to
alarm correlation process: high-level correlations are very
important, but reduction of total number of alarms at lowlevel and good filtration discipline is even more important
to them. Reason is that high-level correlations and problem
root-causes can be detected by network personal if total
number of alarms is reduced to reasonable number of
relevant alarms that can be tracked by network operator.
Experimental results presented here are going to help
network operators respecting their attitude.
introducing logical inventory data in typical alarm sequence
detection processes [7].
However, interviewing number of network personals, we
have detected their attitude related to importance of lowlevel correlations and filtration.
Since our final goal is the real implementation of our
proposed concepts in a telecommunication network, we
presented potential architecture of Alarm Basic Correlation
Discovery Environment. Significant part of it is related to
filtrations and low-level correlations.
Low-level correlations together with filtrations reduced
number of presented alarms up to 80%. Other 20% alarms
were input for discovering of high-level correlations. High
level-correlations (alarm sequences) can be detected in
rather simple way; the most important question is which
correlations are useful. Here we need assistance of human
network operators.
Alarm Basic Correlation Discovery Environment should
be used in telecom operator’s network operation center and
its final goal is improved network problem detection
process leading to better reaction times to problems. In that
case, network users will not perceive existing network
problems as service degradation.
Further research efforts should be invested into the full
implementation of proposed architecture, improving and
introducing new data mining techniques for high-level
correlations discovery as well as typical patterns that can be
used for low-level correlations and filtrations.
REFERENCES
C. System performance
[1]
Filtration and evaluation of low-level correlations based
on test data is not time-consuming. Processing of 36639
alarms took around 14 seconds at low-level correlations,
while filtration is even more comfortable.
However, discovery of high-level correlations using data
mining algorithms can be time-consuming. Hence we have
introduced logical inventory database in order to eliminate
obviously unrelated alarms from algorithms.
Discovering of high-level correlations is rather
challenging task. By presented reduction rate obtained at
low-level, high-level correlations discovery will be freed
from irrelevant alarms, which opens great opportunities for
introduction more data mining techniques with good
performances.
[2]
[3]
[4]
[5]
[6]
IV.
CONCLUSION
[7]
In this paper we continue the research of the potential
usage of the mathematical Apriori algorithm in fault
management started in [2], [11] and improved with
Kunštić, M., O. Jukić and M. Bagić, “Definition of formal
infrastructure for perception of intelligent agents as problem
solvers”, Proceedings on International Conference on
Software, Telecommunications and Computer Networks,
Nikola Rožić and Dinko Begušić (ed.), Split, 2002.
Jukić, O., M. Kunštić, “Network problems frequency
detection using Apriori algorithm”, Proceedings of the 32rd
International Convention MIPRO 2009., Golubić S. et al.
(ed.), pp. 77-81, Opatija, Republic of Croatia, 2009.
Goethals, B., “Survey on frequent pattern mining”,
Department of Computer Science, University of Helsinki,
Finland, 2009.
Agrawal R., T. Imielinski and A.N. Swami, “Mining
association rules between sets of items in large database”,
Proceedings of the 1993 ACM SIGMOD International
Conference on Management Data, P. Buneman and S.
Jajodia (ed.), ACM Press, 1993.
Kowalski, R., Logic for problem solving, North Holland,
New York 1979.
Udupa, K.D., TMN – Telecommunications Management
Network, McGraw-Hill Telecommunications, New York,
1999.
Jukić, O., M. Kunštić, “Logical inventory database
integration into network problems frequency detection
process”, Proceedings of the 10th International Conference
on Telecommunications CONTEL 2009., Podnar Žarko,
Ivana; Boris, Vrdoljak (ed.), pp. 361-365, Zagreb, Republic
of Croatia, 2009.
[8]
[9]
[10]
[11]
[12]
Burns, L., J.L.Hellerstein, S.Ma, D.J.Taylor, C.S.Perng,
D.A.Robenhorst, “Toward Discovery of Event Correlation
Rules”, IBM T.J. Watson Research Center, Hawthorne, New
York USA
ITU T, Recommendation X.733: Alarm Reporting Function,
Geneva 1992.
Garofalakis, M., R. Rastogi, “Data mining meets network
management – The Nemesis project”, Bell Laboratories,
USA, 2001.
Jukić, O., M. Špoljarić, V. Halusek, “Low-level alarm
filtration based based on alarm classification”, Proceedings
of the 51stInternational Symposium ELMAR 2009., Grgić,
Mislav et al. (ed.), pp. 143-146, Zadar, Republic of
Croatia,2009
Costa, R., N. Cachulo, P. Cortez, “An Intelligent Alarm
Management System for Large-Scale Telecommunication
Companies”, EPIA 2009, L. Seabra Lopes et al. (ed.), pp.
386-399, Berlin 2009