Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ABCDE – Alarm Basic Correlations Discovery Environment Oliver Jukić Virovitica College Virovitica, Republic of Croatia [email protected] Marijan Kunštić University of Zagreb Faculty of Electrical Engineering and Computing Zagreb, Republic of Croatia [email protected] Abstract – Alarms generated by telecommunication network are processed by network personnel who are required to respond within a reasonable time interval. When a global network problem occurs, it is represented as a sequence of alarms coming from one or more different network elements. That sequence is typically not recognized as a global problem, or the presence of global problem is detected, but not its real nature. The reason for that is the huge number of alarms generated, “bombing” the operator. Automatic recognition of network problems is very useful for network monitoring processes. Automatic recognition and detection can be done by simple IF-THEN correlation rules performed on incoming alarm stream. The problem is in recognizing potential correlation rules candidates. In our previous works, we have marked mathematical Apriori algorithm implementation as a potential improvement of correlation rules detection. This paper describes architecture proposal for Alarm Basic Correlations Discovery Environment, starting discussion on some implementation aspects. Keywords: Network problem, alarm correlation, correlation rule I. INTRODUCTION When we are talking about customer experience of service quality, fault management is one of the most relevant network management functional areas. Fault management primarily covers the detection, isolation and correction of unusual operational behaviors of telecommunication network and its environment [6]. Typically, alarms from the whole network are delivered to the network operation and management center, where the alarms are processed by network operator. In that case, we talk about centralized fault or network management. After problem’s appearance, the network generates a large number of unsolicited events carrying information about the malfunction called alarms. The network operator’s reaction time depends on many factors. One of the most important issues is to recognize the problem’s root-cause correlating incoming alarms. Alarms can be correlated by its starting/ending time (when alarm started/ended?), location (where alarm happened?), probable cause (what is alarm nature?) or by another criteria – alarm attribute. Correlation engines rely on predefined correlation rules, correlating alarms by mentioned criteria, usually written in IF-THEN manner. In this paper, we will refer to that kind of rules as high-level correlation rules. The most challenging task is to create appropriate correlation rules, based on network equipment and structure knowledge. Sometimes correlation rules are not created due to insufficient network operation personnel or correlation rules creation knowledge, even when correlation tools do exist. Hence, correlation capabilities are not used as much as possible in network operation centers. This fact has implicit impact on service quality delivered to customers. High-level correlation rules will recognize network problems correlating incoming alarms. But it is not enough to create high-level correlation rules only. Great role in alarm reduction plays alarm filtration. Alarm filtration should be performed before alarm correlation in order to eliminate irrelevant alarms. For instance, during some scheduled maintenance action, it is reasonable to ignore alarms from maintained network elements. Filtration will increase efficiency of alarm correlation, while total number of alarms presented to network operator will be decreased. Fact that operator will cope with decreased number of alarms ensures his more reliable and efficient work. Except high-level correlation rules, there is number of typical patterns that can be recognized at low-level. For instance, alarms coming from certain network element within the same time interval can be treated as “multiple” alarms. Some network problems are presented as “jittering” of alarm. In that case first alarm indicates problem, all other “jittering” alarms can be “hidden” beyond first alarm. Implementation of low-level correlations will also decrease total number of presented alarms. Network problems manifest themselves as an alarm sequence. Since network problems repeat more or less frequently, processing of alarm sequences from alarm history can be good base for creation of correlation rules that will be used in the future, when the same problem will appear. Commercial network management tools usually have the capability to perform alarm data correlation. Correlation rules are loaded as input for the alarm correlation process. Namely, tools are only the framework; it is necessary to ensure built-in correlation knowledge. In order to have built-in knowledge, it must exist in human’s mental picture. One of the axioms of this paper is that the presence of human beings is irreplaceable in the process of correlation rules detection. However, automation of analyzing previous alarm streams is welcomed. In our previous works, we have proposed the creation of correlation rules from historical data. Main theoretical concepts have been described in [1], [2] and [7], and those are not the subject of this paper. Rather, we will focus on the potential architecture of (basic) alarm correlation rules discovery systems. This paper gives an architecture proposal for ABCDE – Alarm Basic Correlations Discovery Environment. Environment implementation is already done partially, while complete implementation and integration into telecommunication operator’s network management center will be the subject of future work. Some important aspects of ABCDE are tested using alarm data obtained from real telecommunication network. II. ALARM BASIC CORRELATIONS DISCOVERY ENVIRONMENT ARCHITECTURE A. ABCDE architecture overview Basic ABCDE architecture is shown on figure 1: Fig. 1. Basic ABCDE architecture Incoming network alarms are generated by the telecommunication network. Alarms are consumed and processed by alarm processing engine that performs alarm filtration as well as low and high-level correlation. Processed alarms are presented to the network operator through alarm surveillance GUI. Alarm processing engine uses correlation and filtration rules stored in database, while incoming alarms are stored into alarm data warehouse. Logical inventory database containing data about network interconnections can be use for more efficient alarm correlation. Logical inventory data can be used for enhancement of incoming alarm data also, tying relevant inventory information with alarm data (for instance, “friendly” alarm location name). Alarm processing engine is not the focus of this paper since number of commercial tools is able to perform alarms processing functions. Alarm data warehouse is a database containing all raw alarm history data as well as correlated alarm history data for a certain time period, predefined by the operator (e.g. 2 years). Alarm data warehouse is starting point for discovery and analysis of typical correlations from alarm historical data, in order to include it in the Correlation and filtration rules database. Correlation and filtration rules database contains data about correlations and filtrations to be performed in realtime manner by alarm processing engine. Rules from this database are proposed by Correlation discovery and analysis module. This module can be used for discovery of new potential rules performing data mining algorithm on historical alarm data. It can be used for analysis and evaluation of potential rule candidates also, performing rule execution on sample of historical alarm data. Filtration part of Correlation discovery and analysis module discovers and evaluates potential filter patterns. Not all incoming alarms are relevant for further processing. Alarm classification and filtration are described in details in [11], and will not be discussed here more detailed. Filtering is also not always statically related to predefined, concrete network element; it can be rather dynamically changed, based on certain circumstances in network, such as scheduled maintenance procedure on some network elements. After filtration is done on historical alarm data, low-level correlation discovery and evaluation can be performed. This is primarily related to discovery of general patterns, such as alarm overlapping or alarm jittering. High-level correlation will cope with concrete alarm patterns, coming from specific network elements. At this stage, raw alarm clusters are detected first. Alarm cluster is set of alarms received from the network within certain time interval fenced with cluster borders. Namely, we have detected “long enough” time periods without alarms. Those periods are considered as cluster borders. All alarms suited between two cluster borders belong to the same cluster [2]. Cluster is input for the mathematical Apriori algorithm, but in order to improve algorithm performance, we have proposed usage of logical network inventory data to split raw clusters in smaller parts containing alarms from interconnected alarm locations only. In that case, all interconnections will be taken under consideration while creating alarm clusters: total number of clusters will increase, while average number of alarms in one cluster will decrease. It will drastically improve performance of data mining algorithm execution. Logical inventory data should be obtained from network operator. However, if it is not obtainable, there is proposed technique how to extract logical inventory data from alarm history. It was described in [7], and it is not primary focus of this paper. However, it was denoted on figure 1 through Logical inventory block. When clusters are generated, the Apriori algorithm is performed. The final result is the number of alarm sequences that occurred frequently in the past. Those sequences are potential high-level correlation rules candidates for future alarm processing. Criteria for acceptation of those candidates can be rule frequency, but also rule can be accepted based on network expert’s opinion. B. Low-level correlations In the case of non-overlapped alarms, timer interval between them is short enough. “Alarm storm” can be hence replaced by only one alarm with value-added information, reducing even 7 alarms from operator’s graphical interface: After alarm filtration is performed, low-level correlations are to be performed. Low-level correlations are not related to concrete network elements or alarm types; rather we are going to discover general alarm behavior patterns. Typical behavior is alarm jittering; for some reasons, certain network element may jitter between alarming and non-alarming state. It is represented to network operator in terms of number of (short) alarms with short periods between end of first alarm and start of the second alarm. We will refer to sequence of jittering alarms as “chained” alarms. Another such behavior is related to alarm overlapping. Generally two alarms can be overlapped completely, partially, or not overlapped. Even in the last case, great role plays time interval between two alarms: a) Fig. 3. Reduction of alarms by low-level correlations C. High-level correlations: raw-cluster detection After filtration and low-level correlation processing, the incoming alarm stream will be “clustered”: alarm clusters containing alarms potentially belonging to the same network problem will be detected. Alarm cluster detection is described in [2]. The important thing is that the alarm clusters are divided by time intervals without alarms. b) c) D. High-level correlations: cluster splitting d) Typically, a network problem is represented by the number of alarms coming from one or more network elements. If the alarms are coming from more than one network element, it is reasonable to expect that the network elements are interconnected. If we have a logical inventory database at our disposal (i.e., database where information about network element interconnections is stored), we can try to include it in the discovery environment. How? We can consider only the clusters containing alarms from interconnected network elements. e) f) g) Fig. 2. Alarm overlapping patterns At low-level correlation, completely and partially overlapped alarms (fig. 2 a, b, c, d, e) coming from the same network element (and, optionally, with the same probable cause) can be considered as one alarm with valueadded information “sticked” to it: number of alarms laying beyond it. Alarms that are not overlapped have important parameter related to them: time between end of first alarm and start of the second alarm. If that time is short enough, two alarms can be treated as only one alarm, ignoring end of first and start of second alarm. Combining those two typical patterns, and reducing all hidden alarms from operator’s GUI, number of reduced alarms can increase. On figure 3, there are 8 alarms coming from the same network element within certain time period. Some of those alarms are overlapped, while some are not. Since a logical inventory database is not always available, there is a possibility to “generate” it, based on the alarm historical data. In that case, we will first analyze alarms by their location only. After that analysis we will have information about the most frequent points of interconnection. This data can be stored in a logical inventory database (using a predefined threshold) and can be used in the cluster splitting process in the future. This concept is described in [7]. E. High-level correlations: Apriori algorithm The mining of association rules is potentially very interesting for detection of specific alarm “clusters” that can represent a global network problem. What was the original motivation for researching association rules? Let us imagine a supermarket serving a huge number of customers every day. The supermarket manager is responsible for all business aspects, including special offers and promotions. For instance, the manager can decide to launch chips discount for every customer buying 6 beers. The previously mentioned special offer seems to be very logical, based on our daily experience. However, there are numbers of such association rules that cannot be perceived by casual observation. Hence, the manager is forced to analyze the supermarket’s transaction data (i.e., customer receipt archive or database) – to examine customer behavior while purchasing products. The result of such analysis is a set of typical association rules describing how often items are purchased together. For instance, rule “Beer ⇒ Chips (80%)” states that four of five customers buying beer are also buying chips [3]. That result can be useful for business decisions related to marketing, pricing and product promotion. We have considered our alarms as products purchased in a supermarket, and alarm clusters as baskets from a specific customer. Hence we have decided to use the Apriori algorithm in order to find and recognize specific alarm sequences – potential correlation rules for the future [2]. Apriori algorithm itself is described in number of papers such as [3]. The final result of high-level correlations is the creation of a correlation rules database. Rules are structured in an IF-THEN manner. It means that the alarm processing engine will receive incoming alarm stream matching incoming patterns with existing patterns in the correlation rules database. When a pattern is matched, a new alarm is generated containing information about the real network root-cause problem. III. IMPLEMENTATION ASPECTS AND EXPERIMENTAL RESULTS A. Programming languages and techniques ABCDE components are developed using C and C++ programming languages, as a parts of complex application. Central application component is executable file that involves different dynamic-linked libraries (dll) in architecture. Every part is implemented as separated dll. It allows upgrade of separated components without disturbing general application structure. For database access we have used Open Database Connection (ODBC) with all data stored in MS SQL server database. For database access we have used standard MFC classes, but all other techniques could be used. B. Experimental results Experimental proof of concept was done on real alarm data sample, obtained from one GSM operator in region. Data covers one month period, date from November 2002. However, for this experimental work even such old data are useful. Data came from access network of GSM system. Base stations are connected to Base Station Controllers via multiplexing transmission system. In this case, connections were realized using microwave radio transmission links. Hence we have opportunity to find real interesting patterns, potentially caused by heavy weather conditions, impacting transmission performance. Total number of incoming alarms for processing was 36639. At low-level correlation, we have tried to evaluate number of reduced alarms for two typical patterns: overlapped and chained alarms. As parameter, we have tuned number of seconds between two chained alarms. Final result of experiment was number of reduced alarms. If we consider that every overlapped and/or chained sequence can be replaced with one alarm with value-added information “sticked” to it, discovering of sequence with length=N means reduction of (N-1) alarms: TABLE I NUMBER AND PERCENTAGE OF REDUCED ALARMS AFTER LOW-LEVEL CORRELATIONS Number of alarms % Total 36639 100,00 Reduced (30 s interval) 23983 65.46 Reduced (45 s interval) 24869 67.88 Reduced (60 s interval) 25408 69.35 Reduced (120 s interval) 26110 71.26 According to obtained results, we have decided to fix time interval between two chained alarms at 30 seconds value. In that case, number of reduced alarms, lying “under” bearer alarm, was 23983, or 65.46 %. After overlapped and chained alarms evaluation, number of alarms could be filtered, due to its “self-solving” nature. Alarms that are short enough, and are not chained to other alarms, can be treated as self-solving alarms. Self-solving alarms can be extracted from set of alarms by its duration attribute. We have used value of 30 seconds as maximum duration of self-solving alarm. We performed low-level correlation first, in order to discover chained self-solving alarms first. Number of self-solving alarms was 5008 alarms. Together with low-level correlation reduced alarms, we have detected 28991 potentially reduced alarms. This is 79.12 % of total number of alarms. Conclusion is that filtration together with low-level correlations can decrease number of alarms in great percent, almost 80 % in this case. Finally, after number of alarms was reduced, we have 7648 alarms as input for high-level correlations discovery module. This number can be reduced if we discover some frequently repeated alarm sequences, and replace it by one alarm. For that purpose, we have used Apriori algorithm, as we discussed in our previous work. However, after sequences are detected, it is necessary to “judge” which sequence is relevant for future and which is not. One of criteria can be frequency of alarm sequence appearing. Also, some sequences can be very relevant, event if those are not repeated very frequently. ABCDE can be used for discovery and statistical processing of alarm sequences, while final decision should be made by human operator. According to our previous and other related works [12], reduction rate at high-level correlations can be rather high, up to 70%. Using test data sample and finding several alarm sequences confirmed by network experts, reduction rate was 25.41 %. By interviewing network personnel working with real alarm data every day in network operation and management centers, we have articulated their attitude to alarm correlation process: high-level correlations are very important, but reduction of total number of alarms at lowlevel and good filtration discipline is even more important to them. Reason is that high-level correlations and problem root-causes can be detected by network personal if total number of alarms is reduced to reasonable number of relevant alarms that can be tracked by network operator. Experimental results presented here are going to help network operators respecting their attitude. introducing logical inventory data in typical alarm sequence detection processes [7]. However, interviewing number of network personals, we have detected their attitude related to importance of lowlevel correlations and filtration. Since our final goal is the real implementation of our proposed concepts in a telecommunication network, we presented potential architecture of Alarm Basic Correlation Discovery Environment. Significant part of it is related to filtrations and low-level correlations. Low-level correlations together with filtrations reduced number of presented alarms up to 80%. Other 20% alarms were input for discovering of high-level correlations. High level-correlations (alarm sequences) can be detected in rather simple way; the most important question is which correlations are useful. Here we need assistance of human network operators. Alarm Basic Correlation Discovery Environment should be used in telecom operator’s network operation center and its final goal is improved network problem detection process leading to better reaction times to problems. In that case, network users will not perceive existing network problems as service degradation. Further research efforts should be invested into the full implementation of proposed architecture, improving and introducing new data mining techniques for high-level correlations discovery as well as typical patterns that can be used for low-level correlations and filtrations. REFERENCES C. System performance [1] Filtration and evaluation of low-level correlations based on test data is not time-consuming. Processing of 36639 alarms took around 14 seconds at low-level correlations, while filtration is even more comfortable. However, discovery of high-level correlations using data mining algorithms can be time-consuming. Hence we have introduced logical inventory database in order to eliminate obviously unrelated alarms from algorithms. Discovering of high-level correlations is rather challenging task. By presented reduction rate obtained at low-level, high-level correlations discovery will be freed from irrelevant alarms, which opens great opportunities for introduction more data mining techniques with good performances. [2] [3] [4] [5] [6] IV. CONCLUSION [7] In this paper we continue the research of the potential usage of the mathematical Apriori algorithm in fault management started in [2], [11] and improved with Kunštić, M., O. Jukić and M. Bagić, “Definition of formal infrastructure for perception of intelligent agents as problem solvers”, Proceedings on International Conference on Software, Telecommunications and Computer Networks, Nikola Rožić and Dinko Begušić (ed.), Split, 2002. Jukić, O., M. Kunštić, “Network problems frequency detection using Apriori algorithm”, Proceedings of the 32rd International Convention MIPRO 2009., Golubić S. et al. (ed.), pp. 77-81, Opatija, Republic of Croatia, 2009. Goethals, B., “Survey on frequent pattern mining”, Department of Computer Science, University of Helsinki, Finland, 2009. Agrawal R., T. Imielinski and A.N. Swami, “Mining association rules between sets of items in large database”, Proceedings of the 1993 ACM SIGMOD International Conference on Management Data, P. Buneman and S. Jajodia (ed.), ACM Press, 1993. Kowalski, R., Logic for problem solving, North Holland, New York 1979. Udupa, K.D., TMN – Telecommunications Management Network, McGraw-Hill Telecommunications, New York, 1999. Jukić, O., M. Kunštić, “Logical inventory database integration into network problems frequency detection process”, Proceedings of the 10th International Conference on Telecommunications CONTEL 2009., Podnar Žarko, Ivana; Boris, Vrdoljak (ed.), pp. 361-365, Zagreb, Republic of Croatia, 2009. [8] [9] [10] [11] [12] Burns, L., J.L.Hellerstein, S.Ma, D.J.Taylor, C.S.Perng, D.A.Robenhorst, “Toward Discovery of Event Correlation Rules”, IBM T.J. Watson Research Center, Hawthorne, New York USA ITU T, Recommendation X.733: Alarm Reporting Function, Geneva 1992. Garofalakis, M., R. Rastogi, “Data mining meets network management – The Nemesis project”, Bell Laboratories, USA, 2001. Jukić, O., M. Špoljarić, V. Halusek, “Low-level alarm filtration based based on alarm classification”, Proceedings of the 51stInternational Symposium ELMAR 2009., Grgić, Mislav et al. (ed.), pp. 143-146, Zadar, Republic of Croatia,2009 Costa, R., N. Cachulo, P. Cortez, “An Intelligent Alarm Management System for Large-Scale Telecommunication Companies”, EPIA 2009, L. Seabra Lopes et al. (ed.), pp. 386-399, Berlin 2009