Download Automation of Mobile Radio Network Performance and Fault

Automation of Mobile Radio Network Performance and Fault Management (Matkapuhelinradioverkon suorituskyvyn- ja vianhallinnan automatisointi) A thesis submitted in partial fulfilment of the requirements for the degree of Master of Science Espoo 28.2.2007 Helsinki University of Technology Department of Electrical and Communications Engineering Author: Magnus Wallström [email protected] Supervisor: Timo Korhonen [email protected] Instructor: Mikko Lamberg, MSc (Tech), Nokia Networks [email protected] 1 2007-02-28 / Magnus Wallström Contents • Introduction and background • Literature review: • • • • Architecture of Mobile Radio Access Networks State of the art in management of mobile networks as defined by 3GPP Performance management data Functionality scenarios in UTRAN • Methods of the practical study • Results: • • • • • • Results (I/IX): Current PM and FM organisation and process Results (II/IX): Example on current PM and FM process (1/3) Results (III/IX): Example on current PM and FM process (2/3) Results (IV/IX): Example on current PM and FM process (3/3) Results (V/IX): Analysis of the current organisation and process Results (VI/IX): Problems in current PM and FM process, Interrelationship- and why-why-diagrams of the process problems • Results (VII/IX): Summary of the analysis • Results (VIII/IX): Solution for automated investigation • Results (IX/IX): Implementation of the solution • Conclusions of the thesis • References 2 2007-02-14 / Magnus Wallström Key concepts • 3GPP = Project that aims to develop GSM and UMTS specifications in cooperation with the vendors, operators and standardisation organisations. The acronym 3GPP stands for Third Generation Partnership Project. • Fault management = Functions that enable the detection and location of failures in the network and scheduling of repairs. 3GPP specifies the requirements of the concept. • Mobile radio access network = a network that provides wireless access to users through radio interface and allows the mobile users to move between coverage areas without losing connection, i.e. handover. • Performance management = Functions that enable the performance measurements of network services. 3GPP specifies the requirements of the concept. 3 2007-02-14 / Magnus Wallström Introduction and background How to enhance the productivity of the UTRAN performance management investigations? • Research area: Mobile Radio Network Performance and Fault Management • Research questions: • What is a mobile radio access network and how it is managed? • What is the current performance management set-up in the organisation under study? • What is the organisation and communication structure? • What is the process to tackle performance problems in UTRAN? • What are the problems of the current set-up? • What could be solutions to the root-problems found from the current setup? • Scope: • Limited to European 3G mobile radio network = UTRAN (UMTS Terrestrial Radio Access Network) • Other major mobile radio network technologies are: GERAN (GSM radio network), Wimax (WMAN) and WiFi (WLAN) 4 2007-02-14 / Magnus Wallström Architecture of Mobile Radio Access Networks • General architecture • UE – User Equipment • • • • Currently the most popular mobile RANs are UTRAN and GERAN • Other radio access technologies are LTE, WiMAX and WiFi Iu / A /Gb / .. Mobile netw ork Consists of Mobile Equipment (ME) and Subscriber Identity Module (SIM) for the end-user to access the mobile network RAN UTRAN RAN – Radio Access Network PSTN CN GERAN UE CN – Core Network • • Uu / Um / .. IP X All RANs are attached to a CN that provides switching and access to services in PSTN and any IP network O&M OSS – Operations Support System • OSS All parts of the mobile network may be managed by a centralised system • UTRAN architecture • • 5 Uu Network elements • RNC – Radio Network Controller • Node B aka. BTS – Base Transceiver Station • A – ATM transmission nodes Iub Mobile netw ork UTRAN CN B RNC Interfaces • IuCS: RNC to Circuit Switched Core Network (voice and video calls) • IuPS: RNC to Packet Switched Core Network (data calls) • Iur: RNC to RNC • Iub: RNC to BTS • Uu: BTS to UE • O&M: OSS to any network element: RNC, BTS, ATM-nodes and CN elements (MSC, HLR, SGSN, GGSN etc.)) 2007-02-14 / Magnus Wallström Iu (CS/PS) B CS Iur B B A RNC UE O&M OSS PS State of the art in management of mobile networks as defined by 3GPP • Network management areas relevant to RAN technical support • Performance Management (PM) • Keeps track on the network performance status and analyses the effects of configuration changes in the network [3GPP TS 32.101] • Bases on measurements that are continuously recorded in the network elements • Fault Management (FM) • Consists of fault detection, fault localisation, fault reporting, fault correction and fault repair [3GPP TS 32.111-1] • Bases mainly on alarms and system logs that the network elements produce • Software management (SWM) • Covers software request management, installation, customer feedback and software fault management, i.e. detection of software faults and finding resolution to the problems. This duty is close to and overlapping with fault management (FM) [3GPP TS 32.101] • Configuration management (CM) • Controls the operational parameters of network elements [3GPP TS 32.600] • Process of applying the network management: 1. Performance monitored 2. Faults localised Depending on the type of failure: 3a. Configuration changed or 3b. Software defect(s) corrected 3a. CM 1. PM 2. FM param or software? 4. Monitor the performance (step 1) 3b. SWM 6 2007-02-14 / Magnus Wallström Functionality scenarios in UTRAN • Control plane, i.e. signaling, on RRC connection (Radio Resource Control) • • Major purpose: setup and release a call User plane, i.e. the traffic, on RAB connections (Radio Access Bearer): • Major purpose: define the QoS class of the call: • • • • • Signaling scenarios: • MTC (Mobile Terminated Call) scenario 1. 2. 3. 4. 5. 6. 7. 8. • Paging: RNC sends an “RRC Paging Type 1” message to the Uu interface RRC connection setup: The paged UE responses by starting the radio control connection establishment procedure by (1.) sending an “RRC Connection Request” message to RNC (“RRC Connection Setup Attempt” counter is updated). (2.) RNC tries to allocate radio resources (BTS) and if successful, it responses with “RRC: Connection Setup” message (“RRC Connection Setup Complete” counter is updated). (3.) Finally the UE responses with “RRC: Connection Setup Complete” message (“RRC Access Complete” counter is updated). Transaction reasoning: RNC and CN negotiate on the transaction type Authentication and Security procedure: UMTS subscriber and network authenticate each other, and other security mechanism are activated RAB setup for transaction: Actual communication resources for the transaction are allocated. Transaction: UE has an active user plane bearer connection across the whole UMTS network RAB release for transaction clearing: Network resources related to the transaction are released, i.e. all the RAB active connections for an UE are released RRC connection release: Radio control connection between the UE and the UTRAN is released Mobility (handover scenario): 1. 2. 3. 7 Conversational class, RT (Real Time), applications: CS voice and video calls Streaming class, RT, applications: CS streaming video Interactive class, NRT (Non RT), applications: PS (Packet Switched) web browsing Background class, NRT, applications: emails, MMS (Multimedia Messaging Service) Measurement: the UE sends a radio-link measurement report to the RNC Decision: the final decision to make a handover is done in RNC by the RRM handover control algorithms. Decision bases on the handover criteria and algorithm parameters Execution: handover signalling between e.g. UE and RNC, and radio resource allocation e.g. in BTS 2007-02-14 / Magnus Wallström Performance management data • Performance counters • UTRAN collects thousands of counters that measure the amount of specific events • E.g. RRC Setup Attempts, RRC Setup Completes, RRC Setup Attempt Failure RNC, RRC Setup Attempt Failure BTS etc. • KPI (Key Performance Indicator) Calculated most often from performance counters to relative %-values • Relative% KPIs are comparable between networks of different sizes, absolute values are not, because the amount of traffic varies • Form: KPI = (a formula of performance counters) Examples: • RRC_Acc% = “RRC access complete ratio” = “RRC Access Completes” / “RRC Setup Attempts” • CSSR, Call Setup Success Rate (voice call) = RRC_Acc% * (RAB_voice_attempts-RAB_voice_failures) / RAB_attempts • CCSR, Call Completion Success Rate (voice call) = (RAB_active_voice_failures) / (RAB_active_voice_failures + RAB_active_voice_succesful_completes) 8 2007-02-14 / Magnus Wallström Methods of the practical study • Based on UCD (User Centered Design) process and framework • Chronologically the practical study had three phases: I. II. III. 9 Study and define the current process and organisation a. Study: interview b. Study: focus group c. Study: contextual enquiry Analyse the current set-up a. Analysis: brainstorming b. Analysis: affinity diagram c. Analysis: double teams d. Analysis: interrelationship diagram e. Analysis: why-why-analysis Develop an enhanced process a. Solution: brainstorming b. Solution: SWOT analysis c. Solution: UML diagrams 2007-02-14 / Magnus Wallström Results (I/IX): Current PM and FM organisation and process • Organisation-wise Technical support is the communicator between the local customer contact teams and product line R&D organisation. Customer B 2. Technical support investigates and analyses the performance degradations and makes decisions to fix them with the co-operation of R&D. Customer C Technical Support R&D R&D’s responsibility is to develop corrections to the system, if no other solution is effective. • Investigation (maps to PM [3GPP]) • Analysis (maps to FM [3GPP]) • Investigation~~PM PM Investigation 2. Find performance dips Decision (maps to SWM and CM [3GPP]) 2007-02-14 / Magnus Wallström Analysis~~FM FM Analysis Decision~~CC M~~SM SM Decision M 1. Get top level PM data 3. Get detail level PM data Each phase of the process has deliverables that are utilised in the later phases. 10 Local team B Local teams communicate the performance status of the customer networks to the technical support. The process follows the three phases of the root-cause analysis methodology: • Local team A 1. 3. • Customer A 4. Find problematic sites 5. Gather logs, alarms etc. 6. Analyse the logs and other detail level data that w as gathered during the investigation 7. Generate solutions Results (II/IX): Example on current PM and FM process (1/3) 1. Get KPIs and failure counters for the required top object (i.e. RNC) • Achieved by using a reporting tool that collects the needed counters from the OSS measurement database and calculates the KPI values based on the counters. By manual postprocessing the data, the graphical output • Call setup performance: at 11 the CSSR KPI has had poor values. The phenomenon has been partly ongoing during the next hour Retainability: high drop call ratio at 16. Counter diagram verifies that the drop in CCSR is due to high number of RAB active failures. • Failure counters: 9000 9000 500 100.00 8000 8000 450 95.00 7000 7000 400 90.00 6000 85.00 5000 80.00 4000 75.00 3000 70.00 2000 65.00 1000 60.00 0 11 2007-02-14 / Magnus Wallström CSSR CCSR 350 6000 300 5000 250 4000 200 3000 150 2000 100 1000 50 0 0 20 0 20 6/1 2 20 06/1 /23 06 2/ 0 20 /1 23 0 0 2 20 6/1 /23 01 0 2 20 6/1 /23 02 0 2 20 6/1 /23 03 2 20 06/1 /23 04 06 2/ 0 20 /1 23 5 0 2 20 6/1 /23 06 0 2 20 6/1 /23 07 0 2 20 6/1 /23 08 2 20 06/1 /23 09 06 2/ 1 20 /1 23 0 0 2 20 6/1 /23 11 0 2 20 6/1 /23 12 0 2 20 6/1 /23 13 2 20 06/1 /23 14 06 2/ 1 20 /1 23 5 0 2 20 6/1 /23 16 0 2 20 6/1 /23 17 0 2 20 6/1 /23 18 2 20 06/1 /23 19 06 2/ 2 20 /1 23 0 06 2/ 2 /1 23 1 2/ 2 23 2 23 RRC setup attempts RRC Connection Setup Attempts 105.00 20 0 20 6/1 0 2/ 20 6/1 23 0 0 2/ 0 20 6/1 23 0 0 2/ 1 20 6/1 23 0 0 2/ 2 20 6/1 23 0 0 2/2 3 20 6/12 3 0 0 /2 4 20 6/1 3 0 0 2/ 5 20 6/1 23 0 0 2/ 6 20 6/1 23 0 0 2/ 7 20 6/1 23 0 2/ 08 20 6/1 23 0 2/2 09 20 6/12 3 1 0 /2 0 20 6/1 3 1 0 2/ 1 20 6/1 23 1 0 2/ 2 20 6/1 23 1 0 2/ 3 20 6/1 23 1 0 2/ 4 20 6/1 23 1 0 2/2 5 20 6/12 3 1 0 /2 6 20 6/1 3 1 0 2/ 7 20 6/1 23 1 0 2/ 8 20 6/1 23 1 0 2/ 9 20 6/1 23 2 0 2/ 0 20 6/1 23 2 06 2/2 1 /1 3 2/ 22 23 23 KPI [%] KPIs: 2. Find measurement periods where there is a dip in performance: RRC setup failures RRC access failures RAB setup failures RAB active failures RRC active failures RRC setup attempts RAB access failures Results (III/IX): Example on current PM and FM process (2/3) • 3. Get the KPIs and failure counters on BTS level. • It can be achieved using the same reporting tool than in the first phase. The output is extensive list of all the BTS under one RNC, all measurement periods and counters per each BTS. • 4. Find the network elements that are causing the performance dip. 12 2007-02-14 / Magnus Wallström 4 4 7 6 3 3 4 3 2 3 RRC active failures 8 3 4 2 1 RAB active failures RAB setup failures 1 123 1231 98.46 15.80 1234 1 123 1232 98.45 24.70 1032 1 123 1233 97.51 45.32 602 1 69 691 96.62 91.40 385 1 69 693 97.66 91.62 342 RAB access failures RRC access failures time 2006/12/23 16 2006/12/23 16 2006/12/23 16 2006/12/23 16 2006/12/23 16 RRC setup failures 4 5 4 2 2 3 RRC setup attempts 2 6 3 1 2 2 CCSR 4 5 1 3 1 2 CSSR 3 5 8 2 8 3 Cell id RRC active failures 4 4 7 6 7 6 BTS id RAB active failures 99.01 1345 1132 96.97 1032 820 97.76 602 452 99.05 185 69 96.61 99 24 96.30 84 19 RNC id RAB access failures 15.02 19.19 22.26 56.76 59.60 64.29 RAB setup failures CSSR 1041 1042 1043 691 693 2011 RRC access failures Cell id 104 104 104 69 69 201 RRC setup failures BTS id 1 1 1 1 1 1 RRC setup attempts RNC id 2006/12/23 11 2006/12/23 11 2006/12/23 11 2006/12/23 11 2006/12/23 11 2006/12/23 11 CCSR time • After post-processing the data, the results are lists of BTS that are the main contributors to the performance dips 4 1023 5 765 1 321 3 32 1 28 4 5 4 2 2 Results (IV/IX): Example on current PM and FM process (3/3) 5. Gather the system logs for those network elements that are main contributors of the RNC performance dip. • Achieved by connecting to the network element’s O&M unit either by manual command procedures or using a tool that automates the procedure. The log files are usually in binary format, so they need to be opened by a parser or converted to textual format before the analysis can take place. 6. Analyse the detailed data. • The format of the data is vendor specific, i.e. not defined in public specifications => no general guidance can be set for the analysis itself. • Highly dependant on the individual system specialists that can handle the versatile analysis and can produce reliable results The analysis can be in this context treated as a black box, which has the input of system data, i.e. logs, parameters, alarms, counters and KPIs, and output of set of root-causes for the occurred performance problem. • 7. Generate a solution to the root-cause. • Needs the presence of a skilled system specialist. Depending on the type of solution, finding a working solution might need trial and error approach. • Before applying the solution to a live network, it is tested in a test bed of the vendor. Some network operators have also test beds of their own, on which they verify the solutions, e.g. SW corrections, before they are installed to the live network. 13 2007-02-14 / Magnus Wallström Results (V/IX): Analysis of the current organisation and process • Main problems: • Problems in current organisation operation • 7.2.1 High travel costs • 7.2.2 Troubleshooting poorly controlled • Problems in current PM and FM process • 7.3.2 NE logs not available for performance dips • 7.3.3 Alarms not mapped to performance dips • 7.3.4 Configuration data not available for performance dips • 7.3.5 Internal failures not distinguished from external causes • 7.3.6 Investigation is time consuming 14 2007-02-14 / Magnus Wallström Results (VI/IX): Problems in current PM and FM process, Interrelationship- and why-why-diagrams of the process problems Interrelationship Why-why Internal failures not distinguished from No reference points external Investigation is timeconsuming Manual w ork No consistent set of tools for performance management and troubleshooting available Alarms not mapped NE logs are too large to be saved continuously for a long time Alarms not collected Performance data is not available to base decision to gather logs CM data not available NE logs not available Too much traffic per RNCCPU and memory capacity for continuous monitoring Lack of competence 15 2007-02-14 / Magnus Wallström RNCCPU/MEM capacity too small for monitoring Results (VII/IX): Summary of the analysis • 16 Analysis set two general requirements for the solution: • Support fault management analysis conducted by system specialists. The solution should be able to collect relevant fault management (FM) data, i.e. NE logs, configuration data and alarms, for troubleshooting. The evaluation of the FM data relevance bases on the performance measurement data, which may be collected either from OSS or from RNC. • Support general reporting of performance conducted by operator and vendor performance management bodies. The solution should produce scalable reports of the performance measurement data. Reports should represent the performance data both on whole network and individual network element level down to the level of a single cell. Other statistical requirements are: timely aggregation and that the data can be averaged. 2007-02-14 / Magnus Wallström Results (VIII/IX): Solution for automated investigation INPUT: •“Connection to a live network”. The requirement of the developed solution is either a working remote or onsite connection to the network. This prevents limitations on from which specific parts of the network the data is gathered, i.e. the OSS, NEs or some other databases in the network. nvestigation~~PM PM I Investigation Connection to a live netw ork “thesolution”: solution”: “the Automated Automated investigation investigation ofthe thenetw network ork of performance performance System log files for the failures that have occurred in the live netw ork Analysis~~FM FM Analysis Decision Decision SM ~~CC MM~~SM Overview report of the live netw ork performance OUTPUT: •System log files and other detail data for the failures that have occurred in the live network. The root-cause analysis phase utilises this data to make decisions. •Overview reporting of the network performance that can be utilised in reporting the status of the network to company management and to customer, i.e. the network operator. 17 2007-02-14 / Magnus Wallström Results (IX/IX): Implementation of the solution • The distributed system consists of five separate applications: • RNC monitor • RNC static performance data fetcher • OSS data fetcher • Processor & Report (application) • Report (server) Sol2: Sol2: RNC RNCmonitor/data monitor/data fetcher fetcher Sol3: Sol3: Log&alarm Log&alarm fetcher fetcher Iub Iu (CS/PS) Mobile netw ork UTRAN CN B Sol5: Sol5: Sol4: Sol4: Report Report Processor Processor RNC B Iur B B RNC A O&M OSS Sol1: Sol1: Oss Ossdata data fetcher fetcher 18 2007-02-14 / Magnus Wallström CS PS Conclusions of the thesis • Summary of the thesis, Thesis studied practical problems of mobile radio network management: • Conclusion: UTRAN vendor technical support requires a distributed system of troubleshooting tools to enhance its troubleshooting processes • • Purpose of the troubleshooting tools is to enhance the performance investigation by automating gathering of the performance and other relevant network behaviour data for the time periods where network suffers from low performance The reasoning of the solution bases on • Current troubleshooting set-up study: • • • The analysis of the current set-up: • • Same principles are applicable to other radio network (e.g. GERAN) performance and fault management • Utilization of an OSS in data gathering makes the solution more portable to other radio network systems • Typically OSS uses relational SQL databases. Different radio networks have different performance indicators. Then the same tools may be used after modifying SQL-queries, which is a straightforward process Future work • • 19 currently the main problem is the inefficiency of the first, i.e. investigation, phase in the performance and fault management process. Generalisation of the results • • Organisation: vendor home base technical support that is a link between local teams, which are located by the operated networks, and the vendor R&D. During special occasions, e.g. a new product release or emergency situation in network, the organisation may adjust itself by transferring temporarily system specialist to work locally by the operated network. Process: The practical performance and fault management process consists of three phases: investigation, analysis and decision. Scope was limited to investigation. Also the complex analysis-phase has demanding development needs. Technical support organization requires product-processes to manage the development and maintanance of the troubleshooting tools. 2007-02-14 / Magnus Wallström References • Standards and Technical Specifications • 3GPP: GSM, 3G and LTE • IEEE: WiFi and WiMAX • Commercial material • Nokia Multiradio: http://www.nokia.com/NOKIA_COM_1/Microsites/NokiaWorld/Press/Multiradio_Press_Backgrounder.pdf • Cisco WiMAX: http://www.cisco.com/en/US/netsol/ns616/networking_solutions_customer_profile0900aecd80334a23.html • Previous thesis’ • Kujala, Kimmo (2006) Expert System for Mobile Network Troubleshooting. Thesis. Diplomityö, TKK / Sähkö- ja tietoliikennetekniikan osasto, 2006. 72p. • An attempt to build automated fault analysis tool system. The result in the thesis was that automated analysis is still unreliable! • Utriainen, Juha (2004) UTRAN Operation System Security. Thesis. Diplomityö, TKK / Sähkö- ja tietoliikennetekniikan osasto, 2004. 64p. • Gives a good overview on the UTRAN O&M (Operation and Maintenance) • Handbooks • Kaaranen, Heikki (2005) UMTS Networks – Architecture, Mobility and Services. Second Edition. JOHN WILEY & SONS. ISBN: 0470011033 • Nielsen, Jakob (1993) Usability Engineering. Boston: Academic Press, 1993. 20 2007-02-14 / Magnus Wallström

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Automation of Mobile Radio Network Performance and Fault