Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The HAS Architecture: A Highly Available and Scalable Cluster Architecture for Web Servers Ibrahim Haddad A Thesis in the Department of Computer Science and Software Engineering Presented in Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy at Concordia University Montréal, Québec, Canada March 2006 ©Ibrahim Haddad, 2006 CONCORDIA UNIVERSITY SCHOOL OF GRADUATE STUDIES This is to certify that the thesis prepared By: _______________________________________________________________________ Entitled:____________________________________________________________________ _____________________________________________________________________ _____________________________________________________________________ and submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY (Computer Science and Software Engineering) complies with the regulations of the University and meets the accepted standards with respect to originality and quality. Signed by the final examining committee: __________________________________________ Chair __________________________________________ External Examiner __________________________________________ External to Program __________________________________________ Examiner __________________________________________ Examiner __________________________________________ Thesis Supervisor Approved by _________________________________________ Chair of Department or Graduate Program Director ____________2006 _______________________ Dr. Nabil Esmail, Dean Faculty of Engineering and Computer Science ii Abstract The HAS Architecture: A Highly Available and Scalable Cluster Architecture for Web Server Ibrahim Haddad, Ph.D. Concordia University, 2006 This dissertation proposes a novel architecture, called the HAS architecture, for scalable and highly available web server clusters. The prototype of the Highly Available and Scalable Web Server Architecture was validated for scalability and high availability. It provides non-stop service and is able to maintain the base line performance of approximately 1000 requests per second per processor, for up to 16 traffic processors in the cluster, achieving close to linear scalability. The architecture supports dynamic traffic distribution using a lightweight distribution scheme, and supports connection synchronization to ensure that web connections survive software or hardware failures. Furthermore, the architecture supports different redundancy models and high availability capabilities such as Ethernet and NFS redundancy that contribute in increasing the availability of the service, and eliminating single points of failures. This dissertation presents current methods for scaling web servers, discusses their limitations, and investigates how clustering technologies can help overcome some of these challenges and enables the design of scalable web servers based on a cluster of workstations. It examines various ongoing research projects in the academia and the industry that are investigating scalable and highly available architectures for web servers. It discusses their scope, architecture, provides a critical analysis of their work, presents their advantages and drawbacks, and their contributions to this dissertation. The proposed Highly Available and Scalable Web Server Architecture builds on current knowledge, and provides contributions in areas such as scalability, availability, performance, traffic distribution, and cluster representation. iii Acknowledgments The work that has gone into this thesis has been thoroughly enjoyable largely because of the interaction that I have had with my supervisors and colleagues. I would like to express my gratitude to my supervisor Professor Greg Butler, whose expertise, understanding, and patience, added considerably to my graduate experience. I appreciate his vast knowledge and skills in many areas, and his encouragement that provided me with much support, guidance, and constructive criticism. I would like to thank the other members of my committee, Professor J. William Atwood, Dr. Ferhat Khendek, and Professor Thiruvengadam Radhakrishnan for the assistance they provided at all levels of the project. The feedback I received from members of my committee as early as during my doctoral proposal was very important and had influence on the direction of the work. I also would like to acknowledge the support I received from Ericsson Research granting me unlimited access to their remarkable research lab in Montréal, Canada. I would also like to thank and express my gratitude to my wife, parents, brother, and sister for their love, encouragement, and support. Ibrahim Haddad March 2006 iv Table of Contents Abstract .................................................................................................................................................iii Acknowledgments .................................................................................................................................iv Table of Contents ................................................................................................................................... v List of Figures .....................................................................................................................................viii List of Tables.........................................................................................................................................xi Chapter 1 Introduction and Motivation .................................................................................................. 1 1.1 Internet and Web Servers ............................................................................................................. 1 1.2 The Need for Scalability............................................................................................................... 2 1.3 Web Servers Overview................................................................................................................. 3 1.4 Properties of Internet and Web Applications ............................................................................... 8 1.5 Study Objectives......................................................................................................................... 10 1.6 Scope of the Study...................................................................................................................... 11 1.7 Thesis Contributions................................................................................................................... 13 1.8 Dissertation Roadmap ................................................................................................................ 14 Chapter 2 Background and Related Work............................................................................................ 16 2.1 Cluster Computing ..................................................................................................................... 16 2.2 SMP versus Clusters................................................................................................................... 22 2.3 Cluster Software Components.................................................................................................... 23 2.4 Cluster Hardware Components................................................................................................... 23 2.5 Benefits of Clustering Technologies .......................................................................................... 23 2.6 The OSI Layer Clustering Techniques ....................................................................................... 26 2.7 Clustering Web Servers.............................................................................................................. 32 2.8 Scalability in Internet and Web Servers ..................................................................................... 36 2.9 Overview of Related Work......................................................................................................... 43 2.10 Related Work: In-depth Examination....................................................................................... 46 Chapter 3 Preparatory Work................................................................................................................. 65 3.1 Early Work ................................................................................................................................. 65 3.2 Description of the Prototyped Web Cluster................................................................................ 65 3.3 Benchmarking Environment....................................................................................................... 67 3.4 Web Server Performance............................................................................................................ 69 3.5 LVS Traffic Distribution Methods ............................................................................................. 70 v 3.6 Benchmarking Scenarios............................................................................................................ 74 3.7 Apache Performance Test Results ............................................................................................. 74 3.8 Tomcat Performance Test Results ............................................................................................. 79 3.9 Scalability Results...................................................................................................................... 81 3.10 Discussion ................................................................................................................................ 83 3.11 Contributions of the Preparatory Work.................................................................................... 84 Chapter 4 The Architecture of the Highly Available and Scalable Web Server Cluster ..................... 85 4.1 Architectural Requirements ....................................................................................................... 85 4.2 Overview of the Challenges....................................................................................................... 87 4.3 The HAS Architecture ............................................................................................................... 88 4.4 HAS Architecture Components ................................................................................................. 91 4.5 HAS Architecture Tiers ............................................................................................................. 94 4.6 Characteristics of the HAS Cluster Architecture ....................................................................... 96 4.7 Availability and Single Points of Failures ................................................................................. 99 4.8 Overview of Redundancy Models............................................................................................ 102 4.9 HA Tier Redundancy Models .................................................................................................. 103 4.10 SSA Tier Redundancy Models............................................................................................... 107 4.11 Storage Tier Redundancy Models.......................................................................................... 109 4.12 Redundancy Model Choices .................................................................................................. 109 4.13 The States of a HAS Cluster Node......................................................................................... 111 4.14 Example Deployment of a HAS Cluster ................................................................................ 113 4.15 The Physical View of the HAS Architecture ......................................................................... 116 4.16 The Physical Storage Model of the HAS Architecture .......................................................... 118 4.17 Types and Characteristics of the HAS Cluster Nodes ........................................................... 123 4.18 Local Network Access ........................................................................................................... 126 4.19 Master Nodes Heartbeat......................................................................................................... 127 4.20 Traffic Nodes Heartbeat using the LDirectord Module ......................................................... 128 4.21 CVIP: A Cluster Virtual IP Interface for the HAS Architecture............................................ 130 4.22 Connection Synchronization .................................................................................................. 136 4.23 Traffic Management............................................................................................................... 140 4.24 Access to External Networks and the Internet ....................................................................... 149 4.25 Ethernet Redundancy ............................................................................................................. 150 vi 4.26 Dependencies and Interactions between Software Components ............................................ 151 4.27 Scenario View of the Architecture ......................................................................................... 155 4.28 Network Configuration with IPv6 .......................................................................................... 172 Chapter 5 Architecture Validation...................................................................................................... 176 5.1 Introduction .............................................................................................................................. 176 5.2 Validation of Performance and Scalability............................................................................... 176 5.3 The Benchmarked HAS Architecture Configurations.............................................................. 178 5.4 Test-0: Experiments with One Standalone Traffic Node ......................................................... 180 5.5 Test-1: Experiments with a 4-nodes HAS Cluster.................................................................... 183 5.6 Test-2: Experiments with a 6-nodes HAS Cluster.................................................................... 186 5.7 Test-3: Experiments with a 10-nodes HAS Cluster.................................................................. 188 5.8 Test-4: Experiments with an 18-nodes HAS Cluster................................................................ 191 5.9 Scalability Charts...................................................................................................................... 192 5.10 Validation of High Availability.............................................................................................. 194 5.11 HA-OSCAR Architecture: Modeling and Availability Prediction......................................... 199 5.12 Impact of the HAS Architecture on Open Source .................................................................. 204 5.13 HA-OSCAR versus Beowulf Architecture............................................................................. 205 5.14 The HA-OSCAR Architecture versus the HAS Architecture................................................. 207 5.15 HAS Architecture Impact on Industry.................................................................................... 210 Chapter 6 Contributions, Future Work, and Conclusion .................................................................... 212 6.1 Contributions ............................................................................................................................ 212 6.2 Future Work ............................................................................................................................. 220 6.3 Conclusion................................................................................................................................ 226 Bibliography....................................................................................................................................... 228 Glossary.............................................................................................................................................. 241 vii List of Figures Figure 1: Web server components ......................................................................................................... 4 Figure 2: Request handling inside a web server.................................................................................... 5 Figure 3: Analysis of a web request....................................................................................................... 5 Figure 4: The SMP architecture ........................................................................................................... 17 Figure 5: The MPP architecture ........................................................................................................... 18 Figure 6: Generic cluster architecture .................................................................................................. 19 Figure 7: Cluster architectures with and without shared disks ............................................................ 19 Figure 8: A cluster node stack.............................................................................................................. 20 Figure 9: The L4/2 clustering model.................................................................................................... 26 Figure 10: Traffic flow in an L4/2 based cluster ................................................................................. 27 Figure 11: The L4/3 clustering model.................................................................................................. 28 Figure 12: The traffic flow in an L4/3 based cluster........................................................................... 29 Figure 13: The process of content-based dispatching – L7 clustering model...................................... 30 Figure 14: A web server cluster ........................................................................................................... 32 Figure 15: Using a router to hide the web cluster ................................................................................ 33 Figure 16: Hierarchical redirection-based web server architecture ..................................................... 47 Figure 17: Redirection mechanism for HTTP requests........................................................................ 48 Figure 18: The web farm architecture with the dispatcher as the central component.......................... 51 Figure 19: The SWEB architecture...................................................................................................... 53 Figure 20: The functional modules of a SWEB scheduler in a single processor ................................. 54 Figure 21: The LSMAC implementation ............................................................................................. 56 Figure 22: The LSNAT implementation .............................................................................................. 56 Figure 23: The architecture of the IP sprayer....................................................................................... 58 Figure 24: The architecture with the HACC smart router.................................................................... 58 Figure 25: The two-tier server architecture.......................................................................................... 61 Figure 26: The flow of the web server router ...................................................................................... 62 Figure 27: The architecture of the prototyped web cluster .................................................................. 66 Figure 28: The architecture of the WebBench benchmarking tool ...................................................... 68 Figure 29: The architecture of the LVS NAT method ......................................................................... 71 Figure 30: The architecture of the LVS DR method............................................................................ 72 Figure 31: Benchmarking results of NAT versus DR.......................................................................... 73 Figure 32: Benchmarking results of the Apache web server running on a single processor ............... 75 Figure 33: Apache reaching a peak of 5,903 KB/s before the Ethernet driver crashes ....................... 75 Figure 34: Benchmarking results of Apache on one processor – post Ethernet driver update ............ 76 Figure 35: Results of a two-processor cluster (requests per second) ................................................... 77 Figure 36: Results of a four-processor cluster (requests per second) .................................................. 77 Figure 37: Results of eight-processor cluster (requests per second).................................................... 78 Figure 38: Results of Tomcat running on two processors (requests per second)................................. 79 Figure 39: Results of a four-processor cluster running Tomcat (requests per second)........................ 80 Figure 40: Results of an eight-processor cluster running Tomcat (requests per second).................... 80 Figure 41: Scalability chart for clusters consisting of up to 12 nodes running Apache....................... 82 Figure 42: Scalability chart for clusters consisting of up to 12 nodes running Tomcat....................... 82 Figure 43: The HAS architecture ......................................................................................................... 90 Figure 44: Built-in redundancy at different layers of the HAS architecture ...................................... 101 Figure 45: The process of the network adapter swap......................................................................... 102 Figure 47: The 1+1 active/standby redundancy model ...................................................................... 104 viii Figure 48: Illustration of the failure of the active node ...................................................................... 104 Figure 49: The 1+1 active/active redundancy model ......................................................................... 106 Figure 50: The N+M and N-way redundancy models........................................................................ 107 Figure 51: The N+M redundancy model with support for state replication ....................................... 108 Figure 52: The N+M redundancy model, after the failure of an active node ..................................... 108 Figure 53: the redundancy models at the physical level of the HAS architecture.............................. 109 Figure 54: The state diagram of the state of a HAS cluster node ....................................................... 112 Figure 55: The state diagram including the standy state .................................................................... 113 Figure 56: A HAS cluster using the HA NFS implementation .......................................................... 114 Figure 57: The HA-OSCAR prototype with dual active/standby head nodes.................................... 114 Figure 58: The physical view of the HAS architecture ...................................................................... 117 Figure 59: The no-shared storage model ........................................................................................... 119 Figure 60: The HAS storage model using a distributed file system ................................................... 120 Figure 61: The NFS server redundancy mechanism .......................................................................... 120 Figure 62: DRDB disk replication for two nodes in the 1+1 active/standby redundancy model ....... 122 Figure 63: A HAS cluster with two specialized storage nodes .......................................................... 123 Figure 64: The master node stack....................................................................................................... 124 Figure 65: The traffic node stack........................................................................................................ 125 Figure 66: The redundant LAN connections within the HAS architecture ........................................ 126 Figure 67: The topology of the heartbeat Ethernet broadcast............................................................. 128 Figure 68: The CVIP generic configuration ....................................................................................... 131 Figure 69: Level of distribution.......................................................................................................... 132 Figure 70: Network termination concept............................................................................................ 133 Figure 71: The CVIP framework........................................................................................................ 134 Figure 72: Step 1 - Connection Synchronization................................................................................ 138 Figure 73: Step 2 - Connection Synchronization................................................................................ 138 Figure 74: Step 3 - Connection Synchronization................................................................................ 139 Figure 75: Step 4 - Connection Synchronization................................................................................ 139 Figure 76: Peer-to-peer approach ....................................................................................................... 140 Figure 77: The CPU information available in /proc/cpuinfo.............................................................. 143 Figure 78: The memory information available in /proc/meminfo ...................................................... 144 Figure 79: Example list of traffic nodes and their load index ............................................................ 146 Figure 80: Illustration of the interaction between the traffic client and the traffic manager .............. 147 Figure 81: The direct routing approach – traffic nodes reply directly to web clients......................... 149 Figure 82: The restricted access approach – traffic nodes reply to master nodes, who in turn reply to the web clients ........................................................................................................................... 150 Figure 83: The dependencies and interconnections of the HAS architecture system software .......... 152 Figure 84: The sequence diagram of a successful request with one active master node.................... 157 Figure 85: The sequence diagram of a successful request with two active master nodes .................. 158 Figure 86: A traffic node reporting its load index to the traffic manager........................................... 159 Figure 87: A traffic node joining the HAS cluster ............................................................................. 160 Figure 88: The boot process of a diskless node.................................................................................. 161 Figure 89: The boot process of a traffic node with disk – no software upgrades are performed ....... 162 Figure 90: The process of rebuilding a node with disk ...................................................................... 163 Figure 91: The process of upgrading the kernel and application server on a traffic node.................. 164 Figure 92: The sequence diagram of upgrading the hardware on a master node ............................... 165 Figure 93: The sequence diagram of a master node becoming unavailable ....................................... 166 Figure 94: The NFS synchronization occurs when a master node becomes unavailable ................... 166 ix Figure 95: The sequence diagram of a traffic node becoming unavailable ....................................... 167 Figure 96: The scenario assumes that node C has lost network connectivity .................................... 168 Figure 97: The scenario of an Ethernet port becoming unavailable .................................................. 169 Figure 98: The sequence diagram of a traffic node leaving the HAS cluster .................................... 169 Figure 99: The LDirectord restarting an application process............................................................. 171 Figure 100: The network becomes unavailable ................................................................................. 172 Figure 101: The sequence diagram of the IPv6 autoconfiguration process ....................................... 173 Figure 102: A functional HAS cluster supporting IPv4 and IPv6...................................................... 175 Figure 103: A screen capture of the WebBench software showing 379 connected clients................ 177 Figure 104: The network setup inside the benchmarking lab ............................................................ 178 Figure 105: The benchmarked HAS cluster configurations showing Test-[1..4] .............................. 179 Figure 106: The results of benchmarking a standalone processor -- transactions per second ........... 181 Figure 107: The throughput benchmarking results of a standalone processor................................... 182 Figure 108: The number of failed requests per second on a standalone processor ........................... 182 Figure 109: The number of successful requests per second on a HAS cluster with four nodes ........ 184 Figure 110: The throughput results (KB/s) on a HAS cluster with four nodes.................................. 185 Figure 111: The number of failed requests per second on a HAS cluster with four nodes................ 185 Figure 112: The number of successful requests per second on a HAS cluster with six nodes .......... 187 Figure 113: The throughput results (KB/s) on a HAS cluster with six four nodes ............................ 187 Figure 114: The number of failed requests per second on a HAS cluster with six nodes.................. 188 Figure 115: The number of successful requests per second on a HAS cluster with 10 nodes ........... 190 Figure 116: The throughput results (KB/s) on a HAS cluster with 10 nodes .................................... 190 Figure 117: The number of successful requests per second on a HAS cluster with 18 nodes ........... 191 Figure 118: The throughput results (KB/s) on a HAS cluster with 18 nodes .................................... 192 Figure 119: The results of benchmarking the HAS architecture prototype ....................................... 193 Figure 120: The scalability chart of the HAS architecture prototype ................................................ 194 Figure 121: The possible connectivity failure points......................................................................... 195 Figure 122: The tested setup for data redundancy ............................................................................ 198 Figure 123: The modeled HA-OSCAR architecture, showing the three sub-models ........................ 200 Figure 124: A screen shot of the SPNP modeling tool ...................................................................... 201 Figure 125: System instantaneous availabilities ................................................................................ 203 Figure 126: Availability improvement analysis of HA-OSCAR versus the Beowulf architecture.... 204 Figure 127: The architecture of a Beowulf cluster............................................................................. 205 Figure 128: The architecture of HA-OSCAR .................................................................................... 207 Figure 129: The CGL cluster architecture based on the HAS architecture........................................ 210 Figure 130: The contributions of the HAS architecture..................................................................... 212 Figure 131: The untested configurations of the HAS architecture..................................................... 221 Figure 133: The architecture logical view with specialized nodes .................................................... 224 x List of Tables Table 1: Classification of clusters by usage and functionality ............................................................. 21 Table 2: Characteristics of SMP and cluster systems........................................................................... 22 Table 3: Expected service availability per industry type...................................................................... 24 Table 4: Advantages and drawbacks of clustering techniques operating at the OSI layer................... 31 Table 5: Web performance metrics ...................................................................................................... 69 Table 6: The results of benchmarking with Apache............................................................................. 78 Table 7: The results of benchmarking with Tomcat............................................................................. 81 Table 8: The possible redundancy models per each tier of the HAS architecture.............................. 110 Table 9: The supported redundancy models per each tier in the HAS architecture prototype ........... 111 Table 10: The performance results of one standalone processor running the Apache web server..... 180 Table 11: The results of benchmarking a four-nodes HAS cluster .................................................... 183 Table 12: The results of benchmarking a HAS cluster with six nodes............................................... 186 Table 13: The results of benchmarking a HAS cluster with 10 nodes ............................................... 189 Table 14: The summary of the benchmarking results of the HAS architecture prototype ................. 192 Table 15: Input parameters for the HA-OSCAR model ..................................................................... 201 Table 16: System availability for different configurations................................................................. 202 Table 17: The changes made to the Linux kernel to support NFS redundancy.................................. 217 xi Chapter 1 Introduction and Motivation 1.1 Internet and Web Servers An Internet server is a server that provides services to users over the Internet using the client/server model. We differentiate between the various flavors of Internet servers by the type of services they offer, the characteristics of their workload, degree of high availability, security levels, performance, throughput, and response times. A web server is one specialization of Internet servers. A web server is a client/server program that uses the Hypertext Transfer Protocol (HTTP) to serve static and dynamic contents to web users. We use web servers as a case study throughout this dissertation. In recent years, the interest in and the deployment of scalable and highly available Internet servers has increased rapidly for the wide potential such systems offer. The progress of Internet servers has been feasible, driven by advances in network, software, and computer technologies. However, there are still many challenges to resolve. Scalability is one of the biggest challenges facing Internet servers providing interactive services for a large user base, and it presents itself as a crucial factor for the success or failure of an online service. The scalability of an Internet server refers to its ability to retain performance levels when adding additional resources without requiring architectural changes or technology changes, and without imposing additional overhead. For instance, a web server scales linearly if it continues to be available and functional at consistent speeds as the number of users and requests continues to grow to high numbers. As long as the web server continues to provide consistent performance in the face of rising demand, then it is scalable. For instance, a web server is able to serve up to 1,000 requests per second with a single processor (Section 3.7). If we double the number of processors, then each processor in the cluster should theoretically be able to maintain a 1,000 request per second per processor, and the cluster serving 2,000 requests per second, achieving linear scalability. The common strategy in measuring server scalability is to measure throughput as the number of users or traffic increases and identify important trends. For instance, we measure the throughput of the server with 100 concurrent transactions, then with 1,000, and then with 10,000 transactions. We then examine how throughput changes and observe how it compares with linear scalability. This comparison gives us a measure of the scalability of the architecture. 1 To measure scalability, we need to benchmark and measure the performance of the web server. Benchmarks operate on a real system or a working prototype. A benchmark is a publicly defined procedure, designed to evaluate the performance of a web server system using a well-defined and standardized workload model. The web server benchmark is a mechanism that generates a controlled stream of web requests with standard metrics and aims at reproducing as accurately as possible the characteristics of real traffic patterns, and reports the results. The main goals of benchmarking are: to measure the performance and scalability of request dispatching algorithms and request routing mechanisms, to assess the impact of changes in the information system such as the system architecture and the distribution of content and services, and to help tune the system configuration. Furthermore, benchmarking helps evaluate the system capacity and response time with respect to an existing, expected, or standardized workload. The common metrics for web server performance include throughput, connection rate, request rate, reply rate, error rate, connect time, latency time, and transfer time. Section 3.4 discusses the metrics for web server performance. 1.2 The Need for Scalability The explosive growth of the Internet in the last few years has given rise to a vast range of new online services. Current Internet services span a diverse range of categories that require significant computational and I/O resources to process each request. Furthermore, exponential growth of the Internet population is placing unprecedented demands upon the scalability and robustness of these services [1][2]. Yahoo!, for instance, receives over 1.2 billion page views a day [3], while AOL’s web caches service over 10 billion hits daily [4]. Internet services have become critical both for driving large businesses as well as for personal productivity. Global enterprises are increasingly dependent upon Internet-based applications for ecommerce, supply chain management, human resources, and financial accounting, while many individuals consider e-mail and web access to be indispensable. This growing dependence upon Internet services underscores the importance of their availability, scalability, and ability to handle large loads. Popular web sites such as eBay [5], Excite [6], and E*Trade [7] have, at times, experiences costly and high profile outages during periods of high load. As more people rely upon the Internet for managing financial accounts, paying bills, and potentially even voting in elections, it is increasingly important that these services are available at all times, perform well under high demand, 2 and are robust to accommodate to rapid changes in load. Furthermore, the variations of load experienced by web servers intensify the challenges of building scalable and highly available web servers. It is not uncommon to experience more than 100-fold increases in demand when a web site becomes popular [8]. When the terrorist attacks on New York City and Washington DC occurred on September 11, 2001, Internet news services reached unprecedented levels of demands. CNN.com, for instance, experienced a two-and-a-half hour outage with load exceeding 20 times the expected peak [8]. Although the site team managed to grow the server farm by a factor of five by borrowing machines from other sites, this arrangement was not sufficient to deliver adequate service during the load spike. CNN.com came back online only after replacing the front page with a text-only summary in order to reduce the load [9]. Web sites are also subject to sophisticated denial-of-service attacks, often launched simultaneously from thousands of servers, which can knock a service out of commission. Denial-ofservice attacks have had a major impact on the performance of sites such as Yahoo! and whitehouse.gov [10]. The number of concurrent sessions and hits per day to Internet sites translates into a large number of I/O and network requests, placing enormous demands on underlying resources. 1.3 Web Servers Overview Web servers are a specialization of a server providing services over the Internet. The following subsections describe the functions of a web server, present on the HTTP [11] request and reply cycle, and discuss the protocol performance issues. 1.3.1 Web Server Definition A web server, sometimes called HTTP server, is a program that responds to an incoming TCP connection, and provides a service to the requester. Its primary purpose is to provide data using the HTTP [11]. There are many variants of web servers for different operating system platforms. Figure 1 illustrates the three core components of a web server: the web server software, a computer system with a connection to the Internet, and the information or documents that are available for serving. In this dissertation, the term web server refers to the whole entity, computer platform, server software, and the documents. 3 Web Server Machine Disk Storage Web Clients Apache Web Server Software Network Figure 1: Web server components 1.3.2 Functions of a Web Server The primary function of a web server is to service requests made over the HTTP. The web server receives a request asking for a specific resource, a document for instance, and then it returns the resource as a response to the web client. A web server consists of several internal components. Each component is responsible for imparting certain functionality to the whole system. These components work in tandem, as well as with other external components. The web server software is composed of a stream manager, a HTTP server, and a path resolver. These internal components to the web server software interact with external components, such as the common gateway interface (CGI) [12] and the file system. The file system is the main source of information of a web server where data resides. The CGI provides an interface where the use of scripts empowers the server to compute request specific information; it allows the web clients to view documents created at run time. The data flow in the server from the time a request enters the server until the server sends the output back to the client follows a cycle called the data flow cycle. Figure 2 illustrates how a web server handles an incoming web client request. The stream manager is the first contact point for a web client with the web server. When a client sends a request to a server (1), it docks on to the stream manager. A client might reference a file in its request, and as a result, the web server returns that file to the client. A client can also request a program, a CGI script for instance, and the web server launches that program and returns the resulting output to the client. The stream manager decodes the request (2) and pushes the request to the HTTP server module. The HTTP server is the layer between the server and the underlying operating system. It provides the necessary functionality to gain access to the file system, either by connecting directly by making system calls to it or through the CGI. The HTTP 4 server decodes the path of the request from the uniform resource locator (URL) of the request (3) using the path resolver. Next, the HTTP server authenticates the user using the access list, which stores all the authorized users. On a successful authentication (4), the HTTP server accesses the file system (5). Depending on the request type, the web server either retrieves a resource from the file system, or executes a new program in a separate process. In both cases, the result of the above operation is written on an output stream and passed on to the client via a stream manager (6). Native File System CGI 5 3 Web Client 1 2 Stream Manager Path Resolver HTTP Server Access List 6 4 Figure 2: Request handling inside a web server Web Server hosting www.website.com Web Object Users Users 5 HTTP Request 4 1 142.133.17.22 142.133.17.22 3 2 www.website.com Local DNS Server Authoritative DNS Server for www.website.com Figure 3: Analysis of a web request Figure 3 illustrates the two main phases that a web request goes through from outside the web server: the lookup phase includes steps (1), (2), and (3), and the request phase includes steps (4) and (5). 5 When the user requests a web site from the browser in the form of a URL, the request arrives (1) to the local DNS server who consults (2) the authoritative DNS server responsible for the requested web site. The local domain name system (DNS) server then sends back (3) the IP address location of the web server hosting the requested web site to the client. The client requests (4) the document from the web server using the web server’s IP address and the web server responds (5) to the client with the requested web document. Web servers should cope with the numerous incoming requests using minimum system resources. They have to be multitasking to deal with more than one request at a time. They provide mechanisms to control access authorization and ensure that the incoming requests are not a threat for the host system, where the web server software runs. In addition, web servers respond to error messages they receive, negotiate a style and a language of response with the client, and in some cases run as a proxy server. Web servers generate logs of all connections for statistics and security reasons. 1.3.3 The HTTP Request and Reply Cycle Web servers follow the client/server model, where a client sends a request to the server and the server replies to the client. This cycle is called the HTTP request and reply cycle. This section illustrates the steps that take place in a HTTP request and response cycle. The web server, running as a background process, listens on a specific port for requests addressed to it; the default port where web servers listen to incoming connections is port 80. A HTTP request arrives from a client to the web server. The request includes the information needed by the web server to decide what to do with the request. The web server reads the request and parses it to determine how to handle it. The web server tries to fulfill the request. If the client is requesting a document, the web server locates the document on the disk. Each web server has a repository of data that each request accesses. If there is only one copy of the data, then all concurrently processed requests funnel through the one database of documents. The web server sends a response to the client, which is in this case the requested document. Finally, the web server cleans up by closing open files and terminates the network connection. This cycle illustrates how a web server handles one request per connection. With the newer version of the HTTP, HTTP version 1.1, the protocol sets up a persistent connection between the web client and the web server, which stays open and allows the web client to send multiple requests. The operating system completes several parts of the HTTP request and reply cycle such as creating a new process to 6 handle an incoming request, reading from the network, looking up the requested document, reading the document from disk, and writing the document onto the network. 1.3.4 Web Servers Requirements This section discusses the requirements of a highly available and scalable web server related to this dissertation such as minimal response time, fast processing of requests, high availability, and reliability. - Minimal response time: A crucial factor in the success of a web service is the response time experienced by the web user. The server has to minimize response times to meet the expectations of the users. Research on a wide variety of web systems has demonstrated that users require response times of less than one second when moving from one page to another when browsing the web [13]. Traditional human factors research into response times confirmed the need for response time faster than one second. The general guidelines regarding response times has been about the same for almost thirty years [14][15]: > 0.1 second is the approximate limit for having the user feel that the system is reacting instantaneously, meaning no special feedback is necessary to display the result. > 1.0 second is the approximate limit for users’ flow of thought to remain un-interrupted, even if the user notices the delay. Normally, no special feedback is necessary during delays of more than 0.1 second and less than 1.0 second, but the user does lose the feeling of operating directly with the data. > 10 seconds is the approximate limit for keeping the attention of the user focused on the dialogue. For longer delays, users want to perform other tasks while waiting for the computer to finish, so they expect some feedback indicating when the computer expects to finish. Feedback during the delay is especially important if the response time is likely to be highly variable, since users do not know what to expect. - Fast processing capability: In a distributed environment, several factors introduce delays that lessen the speed of processing of a web request, and the speed in which the traffic is distributed among the servers within the web cluster. Therefore, a fast and lightweight traffic distribution mechanism can help speed up processing time of web requests. - Reliability and availability: Web servers need to be reliable and highly available. As the number of users and the volume of data handled increases, it becomes difficult to guarantee web server 7 reliability and availability. Therefore, web servers should deploy hardware and software faulttolerance and redundancy mechanisms to ensure reliability, to prevent single points of failure, and to maintain availability in case of a hardware or software failure. - Ability to sustain a guaranteed number of connections: This requirement entails the web server to maintain a minimum number of connections per second and to process these connections simultaneously. The ability to sustain a guaranteed number of connections, also described as maintaining the base performance, has a direct effect on the total number of requests the web server can process at any point in time. - High storage capacity: Web servers provide I/O storage capacity to store data and a variety of information it is hosting. In addition, with the increased demand on multimedia data, requiring fast data retrieval is an essential requirement. - Cost effectiveness: An important requirement governing the future of web servers is their cost effectiveness. Otherwise, the cost per transaction increases as the number of transactions increase, and as a result, the cost becomes an important factor governing which server architecture and software to use in any given deployment case. Designing a high performance and scalable web and Internet server is a challenging task. This dissertation aims to understand what causes scalability problems in the web server cluster and explores how we can scale a web server cluster. The dissertation focuses on the design of next generation cluster architecture to meet the requirements discussed above. The architecture need to be able to scale linearly for up to 16 processors, support service availability, and reliability. The architecture will inherently meet other requirements such as better cluster resource utilization and the ability to handle different types of traffic. 1.4 Properties of Internet and Web Applications To scale the performance of a distributed web server, system designers need to understand its structure and purpose, as well as the characteristics of the provided services and data, for these all affect the architecture and the scaling methods. The three fundamental properties of web applications, also applicable to web servers, are massive concurrency demands, an increasing trend towards complex content, and a need to be extremely robust to load. The following subsections discuss each of these properties. 8 1.4.1 High Concurrency The growth in popularity and functionality of Internet and web services has been astounding. While the world wide web itself is growing in size, with recent estimates anywhere between 1 billion and 2.5 billion unique documents, the number of users on the web is also growing at a staggering rate [16][17]. In April 2002, Nielsen NetRatings estimated that there are over 422 million Internet users worldwide [18]. Consequently, Internet and web applications need to support unprecedented concurrency demands, and these demands are increasing over time. 1.4.2 Dynamic Content In recent years, the use of dynamic content generation has become more widespread. Processing requests for dynamic content requires significant amounts of computation and I/O to generate. A dynamic Internet service involves a range of systems, including web servers, middle-tier application servers, and back-end databases, to process each request. The processing for each request might include encryption and decryption, server side scripting, database access, or access to legacy systems such as mainframes. Such dynamic services require more resources. The content generated by a dynamic web service is often not amenable to caching, so each request demands a large amount of server resources. In addition, the resource demands for a given service are difficult to predict. Apart from placing new demands on networks, this growth in content requires that Internet and web services to be responsible for dedicating resources to store and serve vast amounts of data. 1.4.3 Robustness to Load The demand for Internet services can spike rapidly and unexpectedly, causing the peak load to achieve higher magnitude than the average load [5][6][7][8]. Given the vast user population on the Internet, a site can suddenly become flooded with a surge of requests that far exceed its ability to deliver service. A common approach to dealing with heavy load is to over-provision resources. However, over provisioning is not feasible when the ratio of peak to average load is very high; it may not be practical to add 10 or 100 times the number of machines needed to support the average load case. This approach also neglects cost and system administration issues. Since we cannot expect Internet services to scale to peak demand, it is critical that we design services that can accommodate high load. When the demand on a service exceeds its capacity, a service should 9 not over commit its resources and degrade in a way that all clients suffer. Rather, the service need to be aware of overload conditions and attempt to adapt to them, by degrading the quality of service delivered to clients, or by predictably shedding load, such as by giving users some indication that the service is saturated. It is far better for an overloaded service to inform users of the overload than to silently drop requests. 1.5 Study Objectives The goal of this study is to design an architecture for a highly available and scalable web server cluster. The main motivations to design such architecture are to increase system capacity, availability, and scalability, and to increase the rate at which the server accepts and processes incoming connections. The architecture has to be flexible and modular, capable of linearly scaling as we add more processors, while maintaining the baseline performance and without affecting the availability of the service. As a result, as we double the number of processors in the web cluster, we expect to double the number of requests per second served per cluster processor, and to maintain the level of throughput per processor. The emphasis of the study is therefore on scalability through clustering, effective resource utilization, and efficient distribution of user requests among the cluster nodes through traffic differentiation. The highly available web server cluster should also adapt itself dynamically to different numbers of users and amounts of data, with minimal effect on performance. The goal of the architecture is achieve maximum scalability and performance levels for up to 16 processors, with each processor in the cluster handling N requests per second. N refers to the baseline performance that reaches a threshold of over 1,000 requests per second (Section 3.7). The limitation is set to 16 processors for two main reasons. First, based on our initial experiments with web clusters (Chapter 3), we demonstrated that web clusters suffer from scalability problems starting with small clusters that consist of four processors. Our experimental work confirmed performance degradation and loss of scalability as we increased the number of nodes in a clustered web server from four to eight, 10, and then to 12 nodes. The second reason for choosing 16 processors as the upper limit relies on the facts we collected from our literature review (Section 2.10). Sections 2.9 and 2.10 present and discuss ongoing projects that focus on scalability and performance of web servers from an architectural point of view, and experimenting with web clusters that consist of eight processors and less. In addition, with an upper limit of 16 processors, we can demonstrate 10 and validate the scalability of the architecture in the lab without resorting to building a theoretical model and simulating it. 1.6 Scope of the Study This section defines the scope of the study with respect to architecture, type of servers covered by the study, types of web site content, architectural model, scalability, and high availability. The scope is on the architecture and its parameters and excludes external factors such as the performance of file systems, networks, and operating systems. 1.6.1 Goal The goal of this study is to propose an architecture for scalable and highly available web server clusters. The architecture allows the following properties: fast access, linear scalability for up to 16 processors, architecture transparency, high availability, and robustness of offered services. 1.6.2 Internet Servers This dissertation investigates web servers as specific case of an Internet server. A web server is a program that uses the client/server model and the HTTP to serve files and documents to web users. 1.6.3 Web Site Content For experimental purposes, we limit the experimental work with web server scalability testing to static web site contents, which does not include dynamic transaction or web searches. In spite of this self-set limitation, we supported dynamic test suites in our benchmarking experiments for the proposed architecture. WebBench [19], the benchmarking tool used in the benchmarks, has the facilities to include dynamic testing as part of its features. 1.6.4 Architecture and Servers We aim to reduce architecture complexities by providing a generic architecture that can handle massive concurrency demands and deal gracefully with large variations in load. The scope of the proposed architecture is limited to systems that follow the client/server model and run applications characterized by short transactions, short response time, a thin control path, and static delivery data pack. 11 This dissertation does not address multimedia servers, streams, sessions, states, and applications servers. In addition, it does not address nor try to fix problem with networking protocols. In addition, high performance computing (HPC) is not in the scope of the study. HPC is a branch of computer science that concentrates on developing software to run on supercomputers. The HPC research focuses on developing parallel processing algorithms that divide large computational task into small pieces so that separate processors can execute simultaneously. Architectures in this category focus on maximizing compute performance for floating point operations. This branch of computing is unrelated to the dissertation. The architecture targets servers providing services over the Internet with the characteristics previously mentioned. The architecture applies to systems with short response times such as, but not exclusively, web servers, Authentication Authorization and Accounting (AAA) servers, Policy servers, Home Location Register (HLR) servers, Service Control Point (SCP) servers, without specialized extensions needed at the architectural level. 1.6.5 Scalability Existing server scaling methods rely on adding more hardware, upgrading processor and memory, or distributing the incoming load and traffic by distributing users or data into several servers. These schemes, discussed in Chapter 2, suffer from different shortcomings, which can cause uneven load distribution, create bottlenecks, and obstruct the scalability of a system. These scaling methods are out of our scope and we do not aim to improve them. Our goals with the architecture are to achieve scalability through clustering, where we dynamically direct incoming web requests to the appropriate cluster node, and scale the number of serving nodes with as little overheard or drop in baseline performance as possible. Therefore, our focus is scalability while maintaining a high throughput. The cluster needs to be able to distribute the application load across N cluster nodes with linear or close-to linear scalability. 1.6.6 High Availability This work targets the architecture for highly available (HA) web servers. From this perspective, the scope of this dissertation covers techniques that allow us to embed high availability capabilities at the architecture level to ensure the high levels of availability with the least performance degradation. The goal of high availability is to maximize service availability for the end users. There are two sub12 categories: HA stateless, with no saved state information and the HA stateful with state information that allows the web application to maintain sessions across a failover. Our scope focuses on the HA stateless web applications, although we can apply the same principles to the HA stateful web applications. 1.7 Thesis Contributions This dissertation proposes highly available and scalable web server architecture. It combines the use of clustering technologies, high availability mechanisms, efficient and dynamic traffic distribution mechanism to handle additional requests and maintain scalability across cluster nodes as the web traffic increases. We evaluate and validate the architecture through performance and scalability, and HA testing, to demonstrate its robustness, scalability for up to 16 processors, and to demonstrate its availability. This dissertation offers contributions in the areas of cluster architecture, scalability, traffic management, high availability, single cluster interface, in addition to contributions to the open source and the industry. Architecture for highly available and scalable Web clusters: The architecture consists of multiple connected server elements that allow the use of server building blocks to achieve high availability, high performance, and high scalability. Chapter 4 presents the HAS architecture. Scalability and capacity: We are able to scale the architecture and serve more traffic by adding processors to the web cluster without affecting the servicability or the uptime of the provided services. Chapter 5 presents and discusses the results of benchmarking the prototyped cluster. Traffic management: We designed and prototyped a dynamic traffic distribution mechanism that consists of a lightweight load monitoring mechanism that reports the load on the cluster nodes to the scheduler, and a scheduler that distributes incoming traffic based on the distribution policy. Furthermore, the traffic management scheme incorporates a keep-alive mechanism, through which the scheduler is aware when a traffic node becomes unavailable and then stops forwarding traffic to it. Section 4.23 discusses the traffic management scheme. High Availability: We achieve HA by supporting different redundancy models on all tiers of the architecture. Sections 4.7, 4.8, and 4.19 discuss our contributions in the area of HA. A transparent, single cluster IP interface for the cluster: The cluster IP interface is an IP stack distributed among a number of processors that allows access to a cluster of processors via a single IP 13 address. It provides a single entry to the cluster by hiding the complexity of the cluster, and provides address location transparency, to address a resource in the cluster without knowing or specifying the processor location. Section 4.21 presents this contribution. Application availability: The architecture provides the capabilities to monitor the health of the application server running on the traffic nodes, and dynamically exclude the node from the cluster in the event the application process fails. Section 4.20 discusses these capabilities. Furthermore, with connection synchronization between the two master nodes, in the event of failure of a master node, the standby node is able to continue serving the established connections. Section 4.22 discusses connection synchronization. Contribution to Open Source: This work has resulted in several contributions to the the HA-OSCAR project [21], whose architecture is based on the HAS architecture. Section 5.12 discusses these contributions. Contributions to the industry: The Carrier Grade industry initiative [26] at the Open Source Development Labs [27] has adopted the HAS architecture as the base standard architecture for carrier grade clusters running telecommunication applications. Section 5.15 discusses this contribution. Other contributions include benchmarking current solutions, providing enhancements to their capabilities and adding functionalities to existing system software, and providing best practices for building benchmarking environment for large-scale systems. 1.8 Dissertation Roadmap The dissertation consists of six chapters. Chapter 1 provides the necessary background on Internet and web servers, discusses scalability challenges, and presents the objectives and scope of the study. Chapter 2 focuses on three main topics: clustering technologies, scalability challenges and related work. The clustering section sets the grounds for all cluster related definitions and approaches to build clustered web servers. It presents clustering technologies and the benefits resulting from using these technologies to design and build Internet and web servers. The chapter introduces software and hardware clustering technologies, their advantages and drawbacks, and discusses our experience prototyping a highly available, and scalable, clustered web server platform. The chapter describes the need of scalable servers, discusses how users experience scalability issues, and examines the resulting performance problems. The last part of this chapter presents our literature review in the area of 14 scaling Internet and web servers. It presents a survey of academic and industry research projects, discusses their focus areas, results, and contributions. It also presents the contributions of these projects into this dissertation to help us achieve our goal of a scalable and highly available web server platform. Chapter 3 summarizes the technical preparatory work we completed in the laboratory prior to designing the Highly Available and Scalable (HAS) architecture. This chapter describes the prototyped web cluster that uses existing components and mechanisms. It also describes the benchmarking environment we built specifically to test the performance and scalability of web clusters and present the benchmarking results of the tests we conducted on the prototyped cluster. Chapter 4 focuses on describing and discussing the HAS architecture. It presents the architecture, its components, and their characteristics. The chapter then discusses the conceptual, physical, and scenario architecture views, the supported redundancy models, the traffic distribution scheme, and the dependencies between the various components. It also covers the architecture characteristics as related to eliminating single points of failures. Chapter 5 presents the validation of the architecture and illustrates how it scales for up to 16 processors without performance degradation. The validation covers two aspects: scalability and availability. The chapter presents the results of the benchmarking tests we conducted on the HAS architecture prototype. It also presents the results of experiments we conducted to test the availability features in a HAS cluster. Chapter 6 presents the contributions and future work in the areas of scalability and performance of Internet and web servers. 15 Chapter 2 Background and Related Work 2.1 Cluster Computing Cluster computing has become increasingly popular for a number of reasons, which include low cost, high performance of commodity-off-the-shelf (COTS) hardware, availability of rapidly maturing proprietary and open source software components to support high performance and high availability applications. Furthermore, the high-speed networking and the improved microprocessor performance have positioned networks of workstations as a more appealing vehicle for distributed computing compared to the Symmetric Multiprocessor (SMP) and Massively Parallel Processor (MMP) systems. As result, clusters built using COTS hardware and software are playing a major role in redefining the concept of supercomputing and are becoming a popular alternative to SMP and MPP systems. However, there are still several challenges in the areas of performance, availability, manageability, and scalability that are key issues. A web server is an example of an application that requires more computing power than what a single computer can provide, and is a candidate to run on a computer cluster. A viable and cost-effective web cluster solution consists of connecting multiple computers together and coordinates their efforts to serve web traffic. The resulting system is a distributed web server that responds to incoming traffic and processes them on multiple processors. The following sub-sections introduce the three general classes of distributed systems: SMP, MPP, and clusters, compare their advantages and drawbacks, and identify the model that is more suitable for hosting web servers. 2.1.1 Symmetric Multiprocessors (SMP) An SMP machine consists of tightly coupled series of identical processors, operating on a single shared bank of memory. There are no multiple memories, input and output (I/O) systems, or operating systems. SMP systems are shared everything systems, where each processor has access to the shared memory system and all of the attached devices and peripherals of the system, perform I/O operations, and interrupt other processors [28][29]. Figure 4 illustrates the architecture of an SMP system. SMP systems have a master processor at boot time, and then the operating system starts up the second (or 16 more) processor(s) and manages access to the shared resources among all the processors. A single copy of the operating system is in charge of all the processors. SMP systems available on the market (at the time of writing) do not exceed 16 processors, with configurations available in two, four, eight, and 16 processors. Processor A Processor B Processor N Cache Cache Cache System Bus Memory IO Figure 4: The SMP architecture SMP systems are not scalable because all nodes have to access the same resources. In addition, SMP systems have a limit to the number of processors they can have. They require considerable investments in upgrades, and an entire replacement of the system to accommodate a larger capacity. Furthermore, an SMP system runs a single copy of the operating system, where all processors share the same copy of the operating system data. If one processor becomes unavailable because of either hardware or software error, it leaves locks unlocked, data structures in partially updated states, and potentially, I/O devices in partially initialized states. As a result, the entire system becomes unavailable on the account of a single processor. In addition, SMP architectures are not highly available. SMP systems have several single points of failure (cache, memory, processor, bus); if one subsystem becomes unavailable, it brings the system down and makes the service unavailable to the end users. 2.1.2 Massively Parallel Processors (MPP) As we add more processors to SMP systems and more hardware such as crossbar switches, to make memory access orthogonal, these parallel processors become Massively Parallel Processors. Figure 5 illustrates an MPP system that consists of several processing elements interconnected through a highspeed interconnection network. Each node has its own processor(s), memory, and runs a separate copy of the operating system [28]. The key distinction between MPP and SMP systems relies in the 17 use of fully distributed memory. In an MPP system, each processor is self-contained with its own cache and memory chips. Processor Cache Processor Cache Processor Cache Memory Memory Memory Processor Processor Processor Cache Cache Memory Memory Cache Memory Interconnecting Network Figure 5: The MPP architecture Another distinct characteristic of MPP systems is the job scheduling subsystem. We achieve job scheduling through a single run queue. MPP systems tackle one very large computational problem at a time and serve to solve HPC problems. In addition, MPP systems suffer from the same issues as SMP systems in the areas of scalability, single points of failures, and the impact on high availability, and the need to shutdown the system to perform either software or hardware upgrades. 2.1.3 Computer Clusters In his book “In Search of Clusters”, Greg Pfister defines a cluster as a parallel or distributed system consisting of a collection of interconnected whole computers that appear as a single, unified computing resource [28]. In this dissertation, we consider a cluster as a group of separate computers that are interconnected, and are used as a single computing entity to provide a service or run an application for the purposes of scalability, high avsailability, or high performance computing. Our definition of a cluster is complimentary to Pfister’s definition. We both agree that a cluster consists of a number of independent computers that appear as a single compute entity to the end users. End users are not aware that they are interacting with a cluster and they are not aware as well where the application is running. 18 Figure 6 illustrates the generic cluster architecture, which consists of multiple standalone nodes that are connected through redundant links, and providing a single entry to the cluster. Cluster Node A Users Users Node B Node C ... Single entry point The Internet Node N Redundant LANs Figure 6: Generic cluster architecture Cluster nodes interconnect in different ways. Figure 7 illustrates two common variations. In the first variation, Figure 7-A, cluster nodes share a common disk repository; in the second variation, Figure 7-B, cluster nodes do not share common resources and use their own local disk for storage. Shared Disk Shared Nothing Network Figure 7-A: Shared Storage Figure 7-B: Individual Node Disks Figure 7: Cluster architectures with and without shared disks The phrase single, unified computing resource in Greg Pfister definition of a cluster invokes a wide variety of possible applications and uses, and is deliberately vague in describing the services provided by the cluster. At one end of the spectrum, a cluster is nothing more than the collection of whole computers available for use by a sophisticated distributed application. At the other end, the cluster creates an environment where existing non-distributed programs can benefit from increased 19 availability because of the cluster wide fault masking, and increased performance because of the increased computing capacity. A cluster is a group of independent COTS servers interconnected through a network. The servers, called cluster nodes, appear as a single system, and they share access to cluster resources such as such as shared disks, network file systems and the network. A network interconnects all the nodes in a cluster and it is separate from the cluster’s external environment such as the local Intranet or the Internet. The interconnection network employs local area network or systems area network technology. Clusters can be highly available because of the of build-in redundancy that prevents the presence of a single point of failure (SPOF). As a result, failures are contained within a single node. Monitoring software continually runs checks, by sending signals also called heartbeats, to ensure that the cluster node and the application running on are up and available. If these signals stop, then the system software initiates the failover to recover from the failure. The presumably dead or unavailable system or application is then isolated from I/O access, disks, and other resources such as access to the network; furthermore, incoming traffic is redirected to other available nodes within the cluster. As for performance, clusters allow the possibility to add nodes and scale up the performance, the capacity, and the throughput of the cluster as the number of users or traffic increases. Applications Middleware / System Software Operating System Interconnect Protocol Interconnect Technology Nodes Figure 8: A cluster node stack Figure 8 illustrates at a high level a cluster node stack. The stack consists of the processor, the interconnect technology and protocol, the operating system, the middleware or system software, and the application servers. At the lowest level is the node hardware. One level up from the Node level is the Interconnect Technology (such as Ethernet), followed by the Interconnect Protocol. Up one level from the Interconnect Protocol is the Operating System, followed by the System Software, which 20 provides all of the support functionalities. Finally at the top level is the Applications Layer. We utilize clusters in many modes including but not limited to high performance computing, high capacity or throughput, scalability, and high availability. Table 1 presents the different types of clusters depending on their functionalities and the types of applications they host. Purpose of Clustering High Performance Computing (HPC) Scalability High Availability (HA) Server Consolidation Goal Maximize floating Maximize throughput and Maximize service Maximize ease of point computation performance availability management of performance multiple computing resources Description - - Many nodes - Many nodes working - Redundancy and - Also called working together on similar tasks, failover for fault Single System on a single- distributed in a tolerance of Image (SSI) compute based defined fashion based services problem. on system load characteristics Performance is measured as the - Performance is number of measured as floating point throughput in terms operations of KB/s (FLOP) per second - - - - and stateful cluster resources applications and treat the management percentage of the up and providing increase its capacity service. Section oriented (data single measured as a to the cluster to throughput), or Data cluster as a Availability is time the system is (network Provide a central management of Adding more nodes Network oriented - Support stateless unit. 4.7 presents the formula for calculating the availability. transactions) Examples Beowulf clusters such Examples include the Examples include the Examples include the as MOSIX [30][31], Linux Virtual Server [23], HA-OSCAR project OpenSSI project Rocks [32], OSCAR TurboLinux [37], in [21], in addition to [38], the OpenGFS [33] [34] [35], and addition to commercial commercial clustering project [39], and the Ganglia [36] database products products Oracle Cluster File System [40] Table 1: Classification of clusters by usage and functionality 21 Clustering for scalability (Table 1, 3rd column) focuses on distributing web traffic among cluster nodes using distribution algorithms such as round robin DNS. Clustering for high availability (Table 1, 4th column) relies on redundant servers to ensure that critical applications remain available if a cluster node fails. There are two methods for failover solutions: software-based failover solutions discussed in Section 2.7.1.2, and hardware-based failover devices discussed in Section 2.7.1.1. Software-based failover detects when a server has failed and automatically redirect new incoming HTTP requests to the cluster members that are available. Hardware-based failover devices have limited built-in intelligence and require an administrator's intervention when they detect a failure. Many of the clustering products available fit into more than one of the above categories. For instance, some products include both failover and load-balancing components. In addition, SSI products that fit into the server consolidation category (Table 1, 5th column) provide certain HA failover capabilities. Our goal with this dissertation is a cluster architecture that targets both scalability and high availability. 2.2 SMP versus Clusters Table 2 summarizes the comparisons between SMP and cluster architectures. Scaling SMP - High Availability Limited scaling capabilities - System Management - Not highly available - Single system - Single points of failure in hardware and - Single image of the Requires a complete operating system operating system upgrade of the system Clusters - Virtually unlimited - Can be configured to have no SPOF scaling by adding through redundancy of key cluster nodes to the cluster components - - Multi-node system - Each node runs its own copy of the operating system and application There exist several challenges to achieve continuous availability of service - Allows flexibility in configuration Table 2: Characteristics of SMP and cluster systems SMP systems have limited scalability, while clusters have virtually unlimited scaling capabilities since we can always continue to add more nodes to the cluster. As for high availability, an SMP 22 system has several single points of failure; one single error can lead to a system downtime; in contrast to a cluster, where functionalities are redundant and spread across multiple cluster nodes. As for management, an SMP system is a single system, while a cluster is composed of several nodes, some of which can be SMP machines. 2.3 Cluster Software Components We classify the software components that comprise the environment of a commodity cluster in four major categories: the operating system that runs on each of the cluster nodes, application execution environment such as libraries and debuggers, cluster installation infrastructure, and cluster services components such as cluster membership, storage, management, traffic distribution and application services. The critical software components include cluster membership, storage, fault management, and traffic distributions services. The cluster membership service includes functions to recognize and manage the nodes membership in the cluster. The cluster storage service includes the replication and retrieval of cluster configuration and application data. The fault management service includes functions to recognize hardware and software faults and recovery mechanisms. The traffic distribution service includes functions to distribute the incoming traffic across the nodes in the cluster. 2.4 Cluster Hardware Components The key components, which comprise a commodity cluster, are the nodes performing the computing and the dedicated interconnection network providing the data communication among the nodes. A cluster node is different from an MPP node in that a cluster node is an operational standalone computing system. A cluster node integrates several key subsystems that include the processor, memory, storage, external interfaces, and network interfaces. 2.5 Benefits of Clustering Technologies Computer clusters provide several advantages including high availability, scalability, high performance compared to single server architectures, rapid response to technology improvements, manageability of multiple servers as a single server, transparency, and flexibility of configuration. The following sub-sections present these advantages and benefits of clusters. 23 2.5.1 High Availability High availability (HA) refers to the availability of resources in a computer system [41]. We achieve HA through redundant hardware, specialized software, or both [41][42][43]. With clusters, we can provide service continuity by isolating or reducing the impact of a failure in the node, resources, or device through redundancy and fail over techniques. Table 3 presents the various levels of HA, the annual downtime and type of applications for various classes of systems [44]. 9's Availability Downtime per year Example Areas for Deployments 1 2 3 4 5 90.00% 99.00% 99.90% 99.99% 99.999% 36 days 12 hours 87 hours 36 minutes 8 hours 46 minutes 52 minutes 33 seconds 5 minutes 15 seconds Personal clients Entry-level businesses ISPs, mainstream businesses Data centers Telecom system, medical, banking 6 99.9999% 31.5 seconds Military defense, carrier grade routers Table 3: Expected service availability per industry type It is important that a service not only be down except for N minutes a year, but also that the length of outages be short enough, and the frequency of outages be low enough, that the end user does not perceive it as a problem. Therefore, the goal is to have a small number of failures and a prompt recovery time. This concept is termed Service Availability, meaning whatever services the user wants are available in a way that meets the user’s expectations. 2.5.2 Scalability Clusters provide means to reach high levels of scalability by expanding the capacity of a cluster in terms of processors, memory, storage, or other resources, to support users and traffic growth [1]. 2.5.3 High Performance We can achieve a better performance characterized by improved processing speed by using clusters and the cluster-wide resources instead of the resources of a standalone server. 2.5.4 Rapid Response to Technology Improvements Commodity clusters are most able to track technology improvements and respond rapidly to new offerings. Clusters benefit from the latest mass-market technology, as it becomes available at low cost. As new devices including processors, memory, disks, and network interfaces become available in the market, we can integrate them into cluster nodes allowing clusters to be the first class of 24 parallel systems to benefit from such advances. Similarly, clusters benefit from the latest operating system and networking features, as they become available. 2.5.5 Manageability Clusters require a management layer that allows us to manage all cluster nodes as a single entity [28]. Such cluster management facilities help reduce system management costs. There exists a significant number of cluster management software, almost all of them originating from research projects, and are now adopted by commercial vendors. 2.5.6 Cost Efficient Solutions Clusters take advantage of COTS hardware, which allows a better price to performance ratio when compared to a dedicated parallel supercomputer. In addition, the availability of open source operating systems, system software, development tools, and applications has contributed to the provisioning of cost effective clusters built with open source software. 2.5.7 Expandability and Upgradeability We can expand clusters by adding more nodes, disk storage, and memory, as necessary. As hardware, software, operating system and network upgrades become available, we can upgrade a cluster node independently of the others; this presents a major advantage over SMP and MMP systems. 2.5.8 Transparency The SSI layer represents the nodes that make up the cluster as a single server. It allows users to use a cluster easily and effectively without the knowledge of the underlying system architecture or the number of nodes inside the cluster. This transparency frees the end-user from having to know where an application runs. 2.5.9 Flexibility of Configuration Clusters allow flexibility in configurations that is not available through conventional SMP and MPP systems. The number of cluster nodes, memory capacity per node, number of processors per node, and interconnect topology, are all parameters of the cluster structure that may be specified in fine detail on a per system basis without incurring additional cost. Furthermore, we can modify the cluster structure or augment it over time as need and opportunity dictates. This expanded control over the 25 cluster structure not only benefits the end user but the cluster vendor as well, yielding a wide array of system capabilities and cost tradeoffs to meet customer demands. 2.6 The OSI Layer Clustering Techniques The following sub-sections present the clustering techniques that operate at OSI layer two (data link layer), OSI layer three (network layer), and OSI layer seven (application layer). 2.6.1 L4/2 Clustering The level 4 web switch works at the TCP/IP level. Figure 9 illustrates the L4/2 clustering model. The same server machine must process the packets pertaining to the same connection, and the web switch maintains a binding table to associate each session with its assigned server. Replies Server 1 . . . Requests Dispatcher Server n Replies Figure 9: The L4/2 clustering model In L4/2 based clusters, the dispatcher and all the servers in the cluster share the cluster network-layer address using primary and secondary IP addresses. While the primary address of the dispatcher is the same as the cluster address, each cluster server is configured with the cluster address as a secondary address using either interface aliasing or by changing the address of the loopback device on the cluster servers. The nearest gateway is configured such that all packets arriving for the cluster address are addressed to the dispatcher at layer two using a static Address Resolution Protocol (ARP) cache entry. If the packet received corresponds to a TCP/IP connection initiation, the dispatcher selects one of the servers in the server pool to service the request (Figure 9). The selection of the server to respond to the incoming request relies on a traffic distribution algorithm such as round robin. When an incoming request arrives to the dispatcher, the dispatcher creates an entry in a connection map that includes information such as the origin of the connection and the chosen cluster server. The layer two destination address is then rewritten to the hardware address of 26 the chosen cluster server, and the frame is placed back on the network. If the incoming packet is not for a connection initiation, the dispatcher examines its connection map to determine if it belongs to a currently established connection. If it does, the dispatcher rewrites the layer two destination address to be the address of the cluster server previously selected, and forwards the packet to the cluster server as before. In the event that the received packet does not correspond to an established connection and is not a connection initiation packet, then the dispatcher drops it. Figure 10 illustrates the traffic flow in an L4/2 clustered environment [45]. A web client sends an HTTP packet (1) with A as the destination IP address. The immediate router sends the packet to the dispatcher at IP address A (2). Based on the traffic distribution algorithm and the session table, the dispatcher decides which back-end server will handle this packet, server 2 for instance, and sends the packet to server 2 by changing the MAC address of the packet to server 2's MAC address and forwarding it (3). Server 2 accepts the packet and replies directly to the web client. Router 1 4 2 Dispatcher IP Address=A 3 Server 1 IP alias=A Server 2 IP alias=A Server 3 IP alias=A Figure 10: Traffic flow in an L4/2 based cluster L4/2 clustering has a performance advantage over L4/3 clustering because of the downstream bias of web transactions. Since the network address of the cluster server to which the packet is delivered is identical to the one the web client used originally in the request packet, the cluster server handling that connection may respond directly to the client rather than through the dispatcher. As a result, the dispatcher processes only the incoming data stream, which is a fraction of the entire transaction. Moreover, the dispatcher does not need to re-compute expensive integrity codes (such as the IP checksums) in software since only layer two parameters are modified. Therefore, the two parameters that limit the scalability of the cluster are the network bandwidth and the sustainable request rate of the dispatcher, which is the only portion of the transaction actually processed by the dispatcher. 27 One restriction on L4/2 clustering is that the dispatcher must have a direct physical connection to all network segments that house servers (due to layer two frame addressing). This contrasts with L4/3 clustering (Section 2.6.2), where the server may be anywhere on any network with the sole constraint that all client-to-server and server-to-client traffic must pass through the dispatcher. In practice, this restriction on L4/2 clustering has little appreciable impact since servers in a cluster are likely to be connected via a single high-speed LAN. Among research and commercial products implementing layer two clustering are the ONE-IP developed at Bell Laboratories [46], the IBM's eNetwork Dispatcher [47], and the LSMAC from the University of Nebraska-Lincoln (Section 2.10.4). 2.6.2 L4/3 Clustering In L4/3 based clusters, the dispatcher appears as a single host to the web clients. Figure 11 illustrates the L4/3 dispatcher that appears as a gateway for the cluster servers. The web client sends traffic to the address of the web cluster and the traffic arrives to the dispatcher. If the packet received corresponds to a TCP/IP connection initiation, the dispatcher selects one of the servers in the server pool to service the request. Server 1 . . . Requests Replies Dispatcher Server n Figure 11: The L4/3 clustering model Similar to L4/2 clustering, the selection of the cluster server relies on a traffic distribution algorithm. The dispatcher then creates an entry in the connection map noting the origin of the connection, the chosen server, and other relevant information. However, unlike the L4/2 approach, the dispatcher rewrites the destination IP address of the packet as the address of the cluster server selected to service this request. Furthermore, the dispatcher re-calculates any integrity codes affected such as packet checksums, cyclic redundancy checks, or error correction checks. The dispatcher then sends the modified packet to the cluster server corresponding to the new destination address of the packet. If the 28 incoming web client traffic is not a connection initiation, the dispatcher examines its connection map to determine if it belongs to a currently established connection. If it does, the dispatcher rewrites the destination address as the server previously selected, re-computes the checksums, and forwards the packet to the cluster server as we described earlier. In the event that the packet does not correspond to an established connection and it is not a connection initiation packet, then the dispatcher drops the packet. The traffic sent from the cluster servers to the web clients travels through the dispatcher since the source address on the response packets is the address of the particular server that serviced the request, not the cluster address. The dispatcher rewrites the source address to the cluster address, re-computes the integrity codes, and forwards the packet to the web client. Router 1 2 3 4 5 Dispatcher IP Address=A Server 1 IP address=B1 Server 2 IP address=B2 Server 3 IP address=B3 Figure 12: The traffic flow in an L4/3 based cluster Figure 12 illustrates the traffic flow in an L4/3 clustered environment [45]. A web client sends an HTTP packet with A as the destination IP address (1). The immediate router sends the packet to the dispatcher (2), since the dispatcher machine is the owner of the IP address A. Based on the traffic distribution algorithm and the session table, the dispatcher decides to forward this packet to the backend server, Server 2 (3). The dispatcher then rewrites the destination IP address as B2, recalculates the IP and TCP checksums, and sends the packet to B2 (3). Server 2 accepts the packet and replies to the client via the dispatcher (4), which the back-end server sees as a gateway. The dispatcher rewrites the source IP address of the replying packet as A, recalculates the IP and TCP checksums, and sends the packet to the web client (5). RFC 2391, Load Sharing using IP Network Address Translation, presents the L4/3 clustering approach [48]. The LSNAT from the University of Nebraska-Lincoln provides a non-kernel space implementation of the L4/3 clustering approach [49]. Section 2.10.4 discusses the project and the implementation. 29 L4/2 clustering theoretically outperforms L4/3 clustering due to the overhead imposed by L4/3 clustering with the necessary integrity code recalculation coupled with the fact that all traffic must flow through the dispatcher, resulting that the L4/3 dispatcher processes more traffic than an L4/2 dispatcher does. Therefore, the total data throughput of the dispatcher limits the scalability of the system more than the sustainable request rate. 2.6.3 L7 Clustering Level 7 web switch works at the application level. The web switch establishes a connection with the web client and inspects the HTTP request content to decide about dispatching. The L7 clustering technique is also known as content-based dispatching since it operates based on the contents of the client request. The Locality-Aware Request Distribution (LARD) dispatcher developed by the researchers at Rice University is an example of the L7 clustering. LARD partitions a Web document tree into disjoint sub-trees. The dispatcher then allocates each server in the cluster one of these subtrees to serve. As such, LARD provides content-based dispatching as the dispatcher receives web clients requests. a a Server 1 a a c a a c b . . . a Dispatcher c c Server n b b c Figure 13: The process of content-based dispatching – L7 clustering model Figure 13 presents an overview of the processing with the L7 clustering approach [45]. Server 1 processes request of type ; Server 2 processes requests of types and . The dispatcher separates the stream of requests into two streams of requests: one stream for Server with requests of of type and and stream for Server 2 with requests of types and , . As requests arrive from clients for the web cluster, the dispatcher accepts the connection and the request. It then classifies the requested document and dispatches the request to the appropriate server. The dispatching of requests requires support from a modified kernel that enables the connection handoff protocol. After establishing the 30 connection, identifying the request, and choosing the cluster server, the dispatcher informs the cluster server of the status of the network connection, and the cluster server takes over that connection, and communicates directly with the web client. Following this approach, the LARD allows the file system cache of each cluster server to cache a separate part of the web tree rather than having to cache the entire tree, as it is the case with L4/2 and L4/3 clustering. Additionally, it is possible to have specialized server nodes, where for instance, the dynamically generated content is offloaded to special compute servers while other requests are dispatched to servers with less processing power. The LARD requires modifications to the operating system on the servers to support the TCP handoff protocol. 2.6.4 Discussion of the OSI Layer Clustering Techniques We broadly classify transparent server clustering into three categories: L4/2, L4/3, and L7. Table 4 summarizes these technologies and highlights their advantages and disadvantages. L4/2 Mechanism Link-layer address L4/3 L7 Network address translation Content-based routing translation Flows Incoming only Incoming/outgoing Varies HA and fault Varies, several single point of Varies, several single point of Varies, several single point of tolerance failures failures failures Restrictions Incoming traffic passes Dispatcher lies between client and Incoming traffic passes through dispatcher server. All incoming and outgoing through dispatcher traffic passes through dispatcher Bottleneck Connection dispatching Integrity code calculations Connections dispatching, dispatcher complexity Client information At TCP/IP level At TCP/IP level TCP/IP information and HTTP header content Table 4: Advantages and drawbacks of clustering techniques operating at the OSI layer Each of the approaches creates bottlenecks that limit scalability, and presents several single points of failure. For L4/2 dispatchers, system performance is constrained by the ability of the dispatcher to set up, look up, and tear down entries. However, the most telling performance metric is the sustainable request rate. The limitation of L4/3 dispatchers is their ability to rewrite and recalculate the checksums for the massive numbers of packets they process. Therefore, the most telling performance 31 metric is the throughput of the dispatcher. Lastly, the L7 clustering approach has limitations related to the complexity of the content-based routing algorithm and the size of their cache. 2.7 Clustering Web Servers Clustering for web servers is a technique in which we group together two or more web servers to collectively accommodate increases in load and provide system redundancy. Figure 14 illustrates an example of a web cluster that consists of five cluster nodes. The nodes share access to a common network, which connects it to the public Internet. The figure does not specify whether or not the nodes share access to common disk repository, or if they have private disk storage. Web clients are online users of the system; they connect to the Internet and request data from the web server cluster, which consists of multiple nodes each running an independent copy of the web server software. We accomplish clustering using software, hardware, or a combination of both. Web Clients The Internet Web Server Cluster Node 1 Node 2 Node 3 Node 4 Node 5 Figure 14: A web server cluster The following sub-sections explore the software and hardware techniques used to build web clusters. 2.7.1.1 Hardware Based Clustering Solutions Figure 15 illustrates the hardware clustering approach using a network device called the packet router. The packet router sits in front of a number of web servers and directs incoming HTTP requests to available web servers in the cluster. The web server software running on cluster nodes have access to the same storage repository. The packet router distributes traffic to the cluster servers has on a predefined distribution policy such as round robin. The router device, in conjunction with the web servers, comprises a virtual server. 32 Load balancing switches, such as the Cisco LocalDirector [50], redirect TCP requests to servers belonging to a cluster. The LocalDirector provides traffic distribution service by presenting a common virtual IP address to the web clients and then forwarding their incoming requests to an available node within the web cluster. Using heartbeat algorithms, these switches frequently ping the servers in the cluster to ensure they are still available making them a better solution than the basic round robin DNS [51][83]. Web Clients The Internet Router Web Cluster Nodes Network Attached Storage Figure 15: Using a router to hide the web cluster Hardware-based clustering solutions use routers to provide a single IP interface to the cluster and to distribute traffic among various cluster nodes. These solutions are a proven technology; they are neither complicated nor complex by design. However, they have certain limitations such as limited intelligence, un-awareness of the applications running on the cluster nodes, and the presence of SPOF. Limited intelligence: Packet routers can load balance in a round robin fashion, and some can detect failures and automatically remove failed servers from a cluster and redirect traffic to other nodes. These routers are not fully intelligent network devices. They do not provide application-aware traffic distribution. While they can redirect requests upon discovering a failure, they do not allow configuring redirection thresholds for individual servers in a cluster, and therefore, they are unable to manage load to prevent failures. Lack of Dynamism: A router cannot measure the performance of a web application server or make an intelligent decision on where to route the request based on the load of the cluster node and its hardware characteristics. 33 Single point of failure: Packet router constitutes a SPOF for the entire cluster. If the router fails, the cluster is not accessible to end users and the service becomes unavailable. 2.7.1.2 Software Clustering Solutions Software solutions are another approach to clustering. A number of different clustering packages exist ranging from various platform implementations of the domain name system (DNS) to virtual server implementations that present a cluster of servers to the outside world as a single server. Section 2.6 presented on the OSI layer clustering techniques. This section presents and discusses other software clustering techniques, in addition to some implementations of the OSI layer techniques. One method of software clustering is the round robin DNS, which involves setting up the DNS server of the web site to return the set of all the IP addresses of the servers in the cluster in a different order on each successive request [51][52]. The client forwards the request to the first IP address in the list of IP addresses returned. Consequently, the request arrives at a different server in the cluster, and the traffic is distributed across the cluster servers. This distribution scheme has some disadvantages. First, the DNS server has no way of knowing if one or more of the servers in the cluster is overloaded, out of service, or if the application on that server is still up and running. Second, the servers inside the cluster can be of different hardware configurations (processor, memory, network bandwidth, and disk) and the DNS is not aware of the heterogonous nature of the nodes inside the cluster. As a result, the load is not assigned to the server nodes based on their potential capacity, which leads to overloading some nodes and under-using, others. While round robin DNS is a popular choice because of its relative simplicity and low implementation cost, it does not demonstrate intelligence in distributing the traffic load or reacting to node failures. Therefore, it does not help prevent server overloads and eventual failures for sites with high traffic. One example of a commercial clustering solution is the Microsoft Cluster Server (MSCS) [53]. MSCS is a built-in feature of Windows NT Server, Enterprise Edition. It supports the connection of two servers into a "cluster" for higher availability and easier manageability of data and applications. It supports the active/standby redundancy model in which two cloned systems provide redundancy for one another. The MSCS can automatically detect and recover from server or application failures. It can be used to move server workload to balance utilization and allow scheduled maintenance without downtime. The initial release of MSCS supports clusters with two servers. Microsoft Corporation announced its plans to release the MSCS Phase 2 in the third quarter of 2006. MSCS Phase 2 34 promises to support larger clusters and to include enhanced services to simplify the creation of highly scalable, cluster-aware applications [54]. The current version of MSCS suffers from scalability issues as it only supports two servers that require upgrading as the traffic increases. The Linux Virtual Server (LVS) is an open source project that aims to provide a high performance and highly available software clustering implementation for Linux [23]. It implements layer 4 switching in the Linux kernel, providing a virtual server layer built on a cluster of real servers and allowing TCP and UDP sessions to be load balanced between multiple real servers. The virtual service is by either an IP address or a port and protocol. The front-end of the real servers is a load balancer, which schedules requests to the different servers and makes parallel services of the cluster appear as a virtual service on a single IP address. The architecture of the cluster is transparent to end users, and the users who only see the address of the virtual server. The LVS is available in three different implementations [55]: Network Address Translation (NAT), Direct Routing (DR), and IP tunneling. Sections 3.5.1, 3.5.2, and 3.5.3, present and discuss the NAT, DR, and IP tunneling methods, respectively. We have experimented with both the NAT and DR methods. Section 3.5.4 presents the benchmarking results comparing the performance of both methods. Each of these techniques for providing a virtual interface to a web cluster has its own advantages and disadvantages. Based on our lab experiments discussed in Chapter 3, we concluded that the common disadvantage among these schemes is their limited scalability (Figure 31). When the traffic load increases, the load balancer becomes a bottleneck for the whole cluster and the local director crashes under heavy load or stops accepting new incoming requests. In both cases, the local director replies very slowly to ongoing requests. Software clustering solutions have three main advantages that make them a better alternative than clustering solutions: flexibility, intelligence, and availability. First, software clustering solutions can augment existing hardware devices, thereby providing a more robust traffic distribution and failover solution. Additionally, by integrating hardware with software, you diminish, if not eliminate, losses on capital expenditures that your organization has already made. Secondly, they provide a level of intelligence that enables preventive traffic distribution measures that actually minimize the chance of servers becoming unavailable. In the event that a server becomes overloaded or actually fails, some software can automatically detect the problem and reroute HTTP requests other nodes in the cluster. Thirdly, with software clustering solutions, we can support high availability capabilities to avoid 35 single points of failure. An individual server failure does not affect the service availability since functionalities and failover capabilities are distributed among the cluster servers. However, we need to consider several issues when evaluating cluster software solutions, mainly the differences among feature sets, the platform constraints, their HA and scaling capabilities. Software clustering solutions have different capabilities and features, such as their capabilities of providing automatic failure detection, notification, and recovery. Some solutions have significantly delayed failure detection; others allow the configuration of the load thresholds to enable preventive measures. In addition, they can support different redundancy models such as the 1+1 active/standby, 1+1 active/active, N+M and N-way. Therefore, we need to determine the needs or requirements for scalability and failover and pick the solution accordingly. In addition, software solutions have limited platform compatibility; they are available to run on specific operating system or computing environments. Furthermore, the capability of the clustering solution to scale is important. Some solutions have limited capabilities restricted to four, eight, or 16 nodes, and therefore have scaling limitations. 2.8 Scalability in Internet and Web Servers Scalability is the ability of an application server, such as a web server, to grow to support a large number of concurrent users without performance degradation. Generally, scalability refers to how well a hardware or software system can adapt to increased demands. Perfect scalability is linear: if we double the number of processors in the system, we expect the system to serve double the number of requests per second it normally serves. We consider this optimum performance. Unfortunately, this is not the case in real systems. As the number of online services continues to grow as well as the number of Internet users, Internet and web servers have to meet new requirements in areas of availability, scalability, performance, reliability, and security. They have to cope with the explosive growth of the Internet [1][2], continuous increasing traffic, and meet all expectations in terms of stringent requirements in those areas. One additional capability includes supporting geographical mirroring for increased high availability, especially when providing critical services such as electronic banking and stocks brokering. However, these servers experience scalability problems – they are slow; they are continuously hacked and attacked by malicious users, and at times they are not available for service, making businesses liable to losing money when users are not able to access the services [56][57][58]. 36 As such, scalability presents itself as a crucial factor for the success or failure of online services and it is certainly one important challenge faced when designing servers that provide interactive services for a wide clientele. Many factors can affect negatively the scalability of systems [59]. The first common factor is the growth of user base, which cause serious capacity problems for servers that can only serve a certain number of transactions per second. If the server is not able to cope with the increased number of users and traffic, the server starts rejecting requests. A second key factor negatively affecting the scalability of servers is the number and size of data objects, particularly the size of audio and video files strains the network and I/O capacity causing scalability problems. The increasing amount of accessible data makes data search, access, and management more difficult, which causes processing problems and eventually led to rejecting incoming requests. Finally, the non-uniform request distribution imposes strains on the servers and network at certain times of the day or at certain requested data. These factors can cause servers to suffer from bottlenecks, and run out of network, processing, and I/O resources. 2.8.1 Scalability in Telecommunication Servers Telecommunication servers provide a wide range of IP applications for mobile phones. These servers require scaling capabilities to support an increasing number of subscribers and services. They support some of the rigorous requirements in terms of reliability, performance, availability, and scalability [58][59]. They aim to provide five nines availability, which translates into a maximum of five minutes and fifteen seconds of downtime per year that includes downtime associated with operating system, software upgrades, and hardware maintenance. These expectations place an unprecedented burden on telecommunication equipment manufacturers to ensure that all the elements needed to support a service are functioning whenever a user requests that particular service. While achieving a 100 percent of uptime is desirable, it is difficult to achieve [60]. A fault-tolerant system needs to function correctly, given the small but practically inevitable presence of a fault in the system. Supporting five nines service availability depends on the near-flawless interaction of applications, operating system, management middleware, hardware, as well as on environmental and operational factors. The requirements of telecommunication servers are getting more complex as the telecom industry is moving towards an always connected, always online paradigm [61] with a new suite of third 37 generation interactive and multimedia services. These servers suffer from scalability problems as the number of mobile subscribers is increasing at a fast pace [1]. To cope with the increased number of users and traffic, mobile operators are resorting to upgrading servers or buying new servers with more processing power [59], a process that proved to be expensive and iterative. According to Ericsson Research, the growth rate of mobile subscribers in 2004 was approximately 500,000 users per day [62]. This raises the question of whether the Mobile Internet servers and the applications running on those servers will be able to cope with such growth. When the servers are not able to cope with increased traffic, it results in failure to meet the high expectations of paying customers who expect services to be available at all times with acceptable performance levels [63][64], and meeting and managing service level agreements. Service level agreements dictate the percentage of the time services will be available, the number of users that can be served simultaneously, specific performance benchmarks to which actual performance will be periodically compared, and access availability. If ISPs, for instance, are not able to cope with the increasing number of users, they will break their service level agreements causing them to loose money and potentially loose customers. Similarly, mobile operators have to deal with huge money losses if their servers are not available for their subscribers. 2.8.2 How Users Experience Scalability Variations in the scalability of a system are transparent to the end users. Users experience the capacity of a system in various ways such as how fast and accurately the system responds to their requests. The user expects to receive the response in an acceptable time with no errors [13][14][15]. Therefore, the server needs to adapt to different numbers of users and amounts of data, without resource problems or performance bottleneck. We can measure these qualities via the system response time, availability, and reliability of the server. 2.8.2.1 Response time The response time is the space of that exists between the moment a user gives an input, or posts a request, to the moment when a user receives an answer from the server. Total response time includes the time to connect, the time to process the request on the server, and the time to transmit the response back to the client: Total response time = connect time + process time + response transit time 38 When throughput is low, the response transmit time is insignificant. However, as throughput approaches the limit of network bandwidth, the server has to wait for bandwidth to become available before it can transmit the response. The response time in a distributed system consists of all the delays created at the source site, in the network, and at the receiver site. The possible reasons for the delays and their length depend on the system components and the characteristics of the transport media. The response time consists of the delays in both directions. 2.8.2.2 Availability and Reliability The availability of a service depends on the reliability of the server and network components providing the service, and the system architecture. Availability is the percentage of time the system has been available. Reliability, on the other hand, is the ability of the server to respond accurately to requests without errors. The system designer needs to design and build the system so that a failure of one server or network link does not cause the service to become unavailable. We can avoid such situations by duplicating the service to several servers and having optional redundant network routings to them. In such a system, the failure of a server network link only means loss of capacity but the system keeps working. Since web servers offer interactive services, we expect them to be available to service requests. Alternatively, when the service is not available, the system needs to recover from faults and with minimal degradation of service. 2.8.3 Current Scalability Methods for Web Servers Web servers need to be able to scale and maintain high availability, reliability, and performance as the amount of simultaneous users and traffic increases. To overcome the overloading problem of web servers, there are two types of solutions. The first, called the single server, consists of upgrading the server to a higher performance server. However, the new server will be overloaded when requests increases so that administrators will have to upgrade it again. The upgrading process is complex and the cost is high. The second, called the multi-server, consists of building a scalable server on a cluster of servers relying on clustering techniques. When the load increases, we can add more servers into the cluster to meet the demands; however, it is not as easy as it sounds. In reality, as we add more servers into the cluster, the total number of transactions per server declines. 39 The widely deployed scaling methods for clustered web servers are round robin DNS and packet router device to distribute incoming traffic. 2.8.3.1 Round Robin DNS The round robin DNS maps a single server name to the different IP address in a Round robin manner. Following this approach, the round robin DNS distributes the traffic among the servers and maps clients are to different servers in the cluster. The round robin DNS method is not very efficient. Because of the caching nature of clients and the hierarchical DNS system, it easily leads to dynamic load imbalance among the servers, making it unrealistic for a server to handle its peak load. It is difficult to choose the Time-To-Live (TTL) value of a name mapping: with small values, round robin DNS is a bottleneck, and with high values, the dynamic load imbalance gets worse. Even when the TTL value is set to zero, the scheduling granularity is per host; different users' access pattern may lead to dynamic load imbalance, because some people pull many pages from the site, and others just surf a number of pages and go away. Furthermore, round robin DNS is not a reliable mechanism. When a cluster node fails, the clients who map the name to the IP address discovers that the node is down; the problem persists even if the users press the reload or refresh button in their web browsers. 2.8.3.2 Load Balancer Network Device A different scaling method is to use a network device called the load balancer to distribute incoming traffic among the servers. A load balancer can be implemented either in software, such as the LVS, or as a hardware solution, such as the Cisco LocalDirector [50]. These solutions generally forward HTTP requests to the web servers inside the cluster, receive the result, and forward it to the clients. The service provided by the servers constitutes the web cluster that appears to end-users as a virtual server made visible through a single IP address. The scheduling granularity is per connection, which can make a sound traffic distribution among the servers. We can achieve traffic distribution at two levels: application-level and IP-level. Chapter 3 expands on the software and hardware distribution methods, and discusses the results of our benchmark testing. 40 2.8.4 Principles of Scalable Architecture This section discusses the principles of scalable architectures. After presenting the HAS architecture in Chapter 4, we discuss how the HAS architecture design meets the architectural scaling principle presented in this section. We can characterize the applications by their consumption of four primary system resources: processor, memory, file system bandwidth, and network bandwidth. We can achieve scalability by simultaneously optimizing the consumption of these resources and designing an architecture that can grow modularly by adding more resources. Several design principles are required to design scalable systems. The list includes divide and conquer, asynchrony, encapsulation, concurrency, and parsimony [65]. Each of these principles presents a concept that is important in its own right when designing a scalable system. There are also tensions between these principles; we can sometimes apply one principle at the cost of another. The root of a solid system design is to strike the right balance among these principles. In the following subsections, we present on each of these principles. 2.8.4.1 Divide and Conquer The divide and conquer principle indicates that the system need to be partitioned into relatively small sub-systems, each carrying out some well-focused function [65]. This permits deployments that can leverage multiple hardware platforms or simply separate processes or threads, thereby dispersing the load in the system and enabling various forms of load balancing (or traffic distributing) and tuning. The divide and conquer principle varies slightly from the concept of modularization, as it addresses the partitioning of both code and data and it can approach the problem from either side. 2.8.4.2 Asynchrony The asynchrony principle means that the system carries out the work based on available resources [65]. Synchronization constrains a system under load because application components cannot process work in random order, even if resources do exist to do so. Asynchrony decouples functions and lets the system schedule resources more freely and thus potentially more completely. This principle allows us to implement strategies that effectively deal with stress conditions such as peak load. 41 2.8.4.3 Encapsulation The encapsulation principle is the concept of building the system using loosely coupled components, with little or no dependence among components [65] . This principle often, but not always, correlates with asynchrony. Highly asynchronous systems tend to have well encapsulated components and vice versa. Loose coupling means that components can pursue work without waiting for work from others. 2.8.4.4 Concurrency The concurrency principle means that there are many moving parts in a system and the goal is to split the activities across hardware, processes, and threads [65]. Concurrency aids scalability by ensuring that the maximum possible work is active at all times and addresses system load by spawning new resources on demand within predefined limits. Concurrency also maps directly to the ability to scale by rolling in new hardware. The more concurrency applications exploit, the better the possibilities to expand by adding new hardware. 2.8.4.5 Parsimony The parsimony principle indicates that the designer of the system needs to be economical in what he or she designs [65]. Each line of code and each piece of state information has a cost, and, collectively, the costs can increase exponentially. A developer has to ensure that the implementation is as efficient and lightweight as possible. Paying attention to thousands of micro details in a design and implementation can eventually pay off at the macro level with improved system throughput. Parsimony also means that designers carefully use scarce or expensive resources. No matter what design principle a developer applies, a parsimonious implementation is appropriate. Some examples include algorithms, I/O, and transactions. Parsimony ensures that algorithms are optimal to the task since several small inefficiencies can add up and kill performance. Furthermore, performing I/O is one of the more expensive operations in a system and we need to keep I/O activities to the bare minimum. Moreover, transactions constrain access to costly resources by imposing locks that prohibit read or write operations. Applications should work outside of transactions whenever feasible and go out of each transaction in the shortest time possible. 42 2.8.5 Strategies for Achieving Scalability Section 2.8.5 presented the five principles of scalable architectures. This section presents the design strategies to achieve a scalable architecture. 2.8.5.1 System Partitioning A notable characteristic of a scalable system is its ability to balance the load. As the system scales up to meet demand, it should do so by maximizing resource utilization. In the case of a clustered web server, the web traffic is distributed among the cluster nodes. Partitioning breaks system software into domain components with well-bounded functionality and a clear interface. Each component defines part of the architectural conceptual model and a group of functional blocks and connectors. Ultimately, the goal is to partition the solution space into appropriate domain components that map onto the system topology in a scalable manner. The principles that apply in this strategy include divide and conquer, asynchrony, encapsulation, and concurrency. 2.8.5.2 Service Based Layered Architecture By making services available to a client’s application layer, service-based architectures offer an opportunity to share a single component across many different systems. Having services front-end allows a component to offer different access rights and visibility to different clients. The principles that apply in this strategy include divide and conquer and encapsulation. 2.8.5.3 Keeping it Simple Donald A. Norman warns in the preface of his book ”The Design of Everyday Things”, “Rule of thumb: if you think something is clever and sophisticated, beware — it is probably self-indulgence” [66]. The software development cycle’s elaboration phase is an iterative endeavor. In accordance with Occam Razor [65], when faced with two design approaches, choose the simpler one. If the simpler solution proves inadequate to the purpose, consider the more complicated counterpart in the project’s later stages. 2.9 Overview of Related Work Server scalability is a recognized research area both in the academia and in the industry and it is an essential factor in the client/server dominated network environments. Researchers around the world 43 are investigating clusters and commodity hardware as an alternative to expensive specialized hardware for building scalable Internet and web servers. Although the area of cluster computing is relatively new, there is an abundance of research projects in the areas of cluster computing and scalable cluster-based servers. Section 2.10 examines six projects that are close to our work in terms of scope and goal, discuss their approaches, advantages and drawbacks, discuss their prototypes and implementations, and present our learned lessons from their experiences. This section serves as a brief overview of some of the other surveyed work that helped us get a better understanding of the areas of research related to scalable web servers. In [63], the authors present and discuss a method for traffic distribution inside a cluster of web servers that uses distributed packet rewriting. In [67], the authors address strategies for designing a scalable architecture: divide and conquer asynchrony, encapsulation, concurrency, and parsimony. The authors argue that the strategies presented are not comprehensive; however, they represent critical strategies for server-side scaling. In [68], the authors propose a new scheduling policy, called multi-class round robin (MC-RR), for web switched operating at the layer 7 of the open system interconnection (OSI) protocol stack to route requests reaching the web cluster. The authors demonstrate through a set of simulation experiments that MC-RR is more effective than round robin for web sites providing highly dynamic services. In [69], the authors describe their architecture for a web server designed to cope with the ongoing increase of the Internet requirements. The proposed architecture addresses the need for a powerful data management system to cope with the increase in the complexity of users’ requests, and a caching mechanism to reduce the amount of redundant traffic. The author extends the architecture with a caching system that builds up an adaptive hierarchy of caches for web servers, which allow them to keep up the changes in the traffic generated by the applications they are running. In [70], the authors present a comparison of different traffic distribution methods for HTTP traffic in scalable web clusters. The authors present a classification framework for the different load balancing methods and compare their performance. In particular, they discuss the rotating name server method in comparison with alternative load balancing method based on remapping requests and responses in the network. Their results demonstrate that the remapping requests and responses yield better results that the rotating name serve method. 44 The researchers at the Korea Advanced Institute of Science and Technology have developed an adaptive load balancing method that changes the number of scheduling entities according to different workload [71]. It behaves exactly like dispatcher based scheme with low or intermediate workload, taking advantage of fine-grained load balancing. When the dispatcher is overloaded, the DNS servers distribute the dispatching jobs to other entities such as and back-end servers. In this way, they relax the hot spot of the dispatcher. Based on simulation results, they demonstrated that the adaptive dispatching method improves the overall performance on realistic workload simulation. In [72], the authors present and evaluate an implementation of a prototype scalable web server consisting of a balanced cluster of hosts that collectively accept and service TCP connections. The host IP addresses are advertised using round robin DNS technique allowing a host to receive requests from a client. They use a low-overhead technique called the distributed packet rewriting (DPR) to redirect TCP connections. Each host keeps information about the remaining hosts in the system. Their performance measurements suggest that their prototype outperforms round robin DNS. However, their benchmarking was limited to a five-node cluster, where each node reached a peak of 632 requests per second, compared to over 1,000 requests per second per node we achieved with our early prototype (Section 3.7). In [45], the authors discuss clustering as a preferred technique to build scalable web servers. The authors examine early products and a sample of contemporary commercial offerings in the field of transparent web server clustering. They broadly classify transparent server clustering into three categories: L4/2, L4/3, and L7 clustering, and discuss their advantages and disadvantages. In [73], the authors present their two implementations for traffic manipulation inside a web cluster: MAC-based dispatching (LSMAC) and IP-based dispatching (LSNAT). The authors discuss their results, and the advantages and disadvantages of both methods. Section 2.10.4 discusses those approaches. The researchers from Lucent Technologies and the University of Texas at Austin present in [74] their architecture for a scalable web cluster. The distributed architecture consists of independent servers sharing the load through a round robin traffic distribution mechanism. In [75], the authors present on optimizations to the NCSA http server [76] to make it more scalable and allow it to serve more requests. 45 2.10 Related Work: In-depth Examination This section discusses six projects that share the common goal of increasing the performance and scalability of web clusters. These projects had different focus areas such as traffic distribution algorithms, new architectures, presenting the cluster as a single server through a virtual IP layer. This section examines these research projects, presents their respective area of research, their architectures, highlights their status and plans, and discusses the contributions of their research into our work. The works discussed are the following: - “Redirectional-based Web Server Architecture” at University of Texas (Austin): The goal of this project is to design and prototype a redirectional-based hierarchical architecture that eliminates bottlenecks in the cluster and allows the administrator to add hardware seamlessly to handle increased traffic [77]. Section 2.10.1 discusses this project. - “Scalable policies for Scalable Web clusters” at the University of Roma: The goal of the project is to provide scalable scheduling policies for web clusters [68][78]. Section 2.10.2 discusses this project. - “The Scalable Web Server (SWEB)” at the University of California (Santa Barbara): The project investigates the issues involved in developing a scalable web server on a cluster of workstations. The objective is to strengthen the processing capabilities of such servers by utilizing the power of computers to match the huge demand in simultaneous access requests from the Internet [78]. Section 2.10.3 discusses this project. - “LSMAC and LSNAT”: The project at the University of Nebraska-Lincoln investigates server responsiveness and scalability in clustered systems and client/server network environments [79]. The project is focusing on different server infrastructures to provide a single entry into the cluster and traffic distribution among the cluster nodes [73]. Section 2.10.4 examines the project and its results. - “Harvard Array of Clustered Computers (HACC)”: The HACC project aims to design and prototype cluster architecture for scalable web servers [81]. The focus of the project is on a technology called “IP Sprayer”, a router component that sits between the Internet and the cluster and is responsible for traffic distribution among the nodes of the cluster [82]. Section 2.10.5 discusses this project. - “IBM Scalable and Highly Available Web Server”: This project is investigating scalable and highly available web clusters. The goal with the project is to develop a scalable web cluster that 46 will host web services on IBM proprietary SP-2 and RS/6000 systems [83]. Section 2.10.6 discusses this project. 2.10.1 Hierarchical Redirection-Based Web Server Architecture The University of Texas at Austin and Bell Labs are collaborating on a research project to design scalable web cluster architecture. The authors confirm that a distributed architecture consisting of independent servers sharing the load is most appropriate for implementing a scalable web server [76]. Their study of currently available approaches, such as RR DNS, demonstrates that these approaches do not scale effectively because they lead to bottleneck in different parts of the system. This section presents their redirection-based hierarchical server architecture, discusses its advantages and drawbacks, and presents their contributions to our work. The study proposes a redirectional-based hierarchical server architecture that includes two levels of servers: redirectional servers and normal HTTP servers. The administrator of the cluster partitions data and stores it on different cluster nodes. The directional servers distribute the requests of the web users to the corresponding HTTP server. Clients Round Robin DNS Redirectional Server 1 Server 1 Redirectional Server k …. Server 2 …. Server n Figure 16: Hierarchical redirection-based web server architecture Figure 16 illustrates the architecture of the hierarchical redirection based web server approach. Each HTTP server stores a portion of the data available at the site. The round robin DNS distributes the load among the redirection servers [75]. The redirection servers in turn redirect the requests to the HTTP servers where a subset of the data resides. The redirection mechanism is part of the HTTP 47 protocol and it is completely transparent to the user. The browser automatically recognizes the redirection message, derives the new URL from it, and connects to the new server to fetch the file. The original goal of the redirection mechanism supported in HTTP was to facilitate moving files from one server to another. When a client uses the old URL from its cache or from the bookmark after, and if the file referenced by the old URL was moved to a new server, the old server returns a redirection message, which contains the new URL. The cluster administrator partitions the documents stored at the site among the different servers based on their content. For instance, server 1 could store stock price data, while server 2 stores weather information and server 3 stores movie clips and reviews. All requests for stock quotes are directed to server 1 while requests for weather information are directed for server 2. It is possible to implement the architecture described with server software modifications. However, in order to provide more flexibility in load balancing and additional reliability, there is a need to replicate contents on multiple servers. Implementing data replication requires modifying the data structure containing the mapping information. If there is replication of data, a logical file name is mapped to multiple URLs on different servers. In this case, the redirection server has to choose one of the servers containing the relevant information data. Intelligent strategies for choosing the servers can be implemented to better balance the load among the HTTP servers. Many approaches are possible including round robin and weighted round robin. Users Users 1 Browser 6 2 DNS 4 3 Base URL Target URL 5 Redirectional message with base URL File HTTP Server Redirectional Server Figure 17: Redirection mechanism for HTTP requests 48 Figure 17 illustrates the steps a web request goes through until the client gets a response back from the HTTP server. The web user types a web request into the web browser (1). The DNS server resolves the address and returns the IP address of the server, which in this case is the address of the redirectional server (2). When the request arrives to the redirectional server (3), it is examined and forwarded to the appropriate HTTP server (4,5). The HTTP server processes the request and replies to the web client (6). The authors implement load balancing by having each HTTP server report its load periodically to a load monitoring coordinator. If the load on a particular server exceeds a certain threshold, the load balancing procedure is triggered. Some portions of the content on the overloaded server are then moved to another server with lower load. Next, the redirection information is updated in all redirection servers to reflect the data move. The authors implemented a prototype of the redirection-based server architecture using one redirectional server and three HTTP servers. Measurements using the WebStone [86] benchmark demonstrate that the throughput scales up with the number of machines added. Measurements of connection times to various sites on the Internet indicate that additional connection to the redirection server accounts for a +20% increase in latency [74]. This architecture is implemented using COTS hardware and server software. Web clients see a single logical web server without knowing the actual location of the data, or the number of current servers providing the service. The administrator of the system partitions the document store among the available cluster nodes however using tedious and mechanical mechanisms and does not provide dynamic load balancing; rather, it requires the interference of a system administrator to move data to a different server(s) and to update manually the redirection rules. One important characteristic of the implementation is the size of the mapping table. The HTTP server stores the redirection information in a table that is created when the server is started and stored in main memory. This mapping table is searched on every access to the redirection server. If the table grows too large, it increases the processing time searching in the redirection server. The architecture assumes that all HTTP servers have disk storage, which is not very realistic as many real deployments take advantage of diskless nodes and network storage. The maintenance and update of all copies of the data is difficult. In addition, web requests require an additional connection between the redirection server and the HTTP server. 49 Other drawbacks of the architecture include the lack of redundancy at the main redirectional server. The authors did not focus on incorporating high availability capabilities within the architecture. In addition, since the architecture assumes a single redirectional server, there was no effort to investigate a single IP interface to hide all the redirectional servers. As a result, the redirectional server poses a SPOF and limits the performance and scalability of the architecture. Furthermore, the authors did not investigate the scaling limitation of the architecture. Overall, the architecture promises a limited level of scalability. We classify the main inputs from this project in four essential points. First, the research provided us with a confirmation that a distributed architecture is the right way to proceed forward. A distributed architecture allows us to add more servers to handle the increase in traffic in a transparent fashion. The second input is the concept of specialization. Although very limited in this study, node specialization can be beneficial where different nodes within the same cluster handle different traffic depending on the application running on the cluster nodes. The third input to our work relates to load balancing and moving data between servers. The redirectional architecture achieves load balancing by manually moving data to different servers, and then updating the redirection information stored on the redirectional server. This load balancing scheme is an interesting concept for small configurations; however, it is not practical for large web clusters and we do not consider this approach for our architecture. The fourth input to our work is the need for a dynamic traffic distribution mechanism that is efficient and lightweight. 2.10.2 Scheduling Policies for Scalable Web Clusters The researchers at the University of Roma are investigating the design of efficient and scalable scheduling algorithms for web dispatchers. The research investigates clusters as scalable web server platform. The researchers argue that one of the main goals for scalability of distributed web system is the availability of a mechanism that optimally balances the load over the server nodes [79]. Therefore, the project focuses on traditional and new algorithms that allow scalability of web server farms receiving peak traffic. The project recognizes that clustered systems are leading architectures for building web sites that require guaranteed scalable services when the number of users grows exponentially. The project defines a web farm as a web site that uses two or more servers housed together in a single location to handle requests [79]. 50 Wide Area Network Server Server 11 Server Server 22 Dispatcher Dispatcher Local Area Network Server Server 33 Server Server 44 Primary DNS URL Æ IP Server Server nn Figure 18: The web farm architecture with the dispatcher as the central component Figure 18 presents the architecture of the web cluster with n servers connected to the same local network and providing service to incoming requests. The dispatcher server connects to the same network as the cluster servers, provides an entry point to the web cluster, and retains transparency of the distributed architecture for the users [84]. The dispatcher receives the incoming HTTP requests and distributes it to the back-end cluster servers. Although web clusters consist of several servers, all servers use one hostname site to provide a single interface to all users. Moreover, to have a mechanism that controls the totality of the requests reaching the site and to mask the service distribution among multiple back-end servers, the web server farm provides a single virtual IP address that corresponds to the address of front-end server(s). This entity is the dispatcher that acts as a centralized global scheduler that receives incoming requests and routes them among the back-end servers of the web cluster. To distribute the load among the web servers, the dispatcher identifies uniquely each server in the web cluster through a private address. The researchers argue that the dispatcher cannot use highly sophisticated algorithms for traffic distribution because it has to take fast decision for hundreds of requests per second. Static algorithms are the fastest solution because they do not rely on the current state of the system at the time of making the distribution decision. Dynamic distribution algorithms have the potential to outperform static algorithms by using some state information to help dispatching decisions. However, they require a mechanism that collects, transmits, and analyzes that information, thereby incurring in overheads. 51 The research project considered three scheduling policies that the dispatcher can execute [84]: random (RAN), round robin (RR) and weighted round robin (WRR). The project does not consider sophisticated traffic distribution algorithms to prevent the dispatcher from becoming the primary bottleneck of the web farm. Based on modeling simulations, the project observed that burst of arrivals and skewed service times alone do not motivate the use of sophisticated global scheduling algorithms. Instead, an important feature to consider for the choice of the dispatching algorithm is the type of services provided by the web site. If the dispatcher mechanism has a full control on client requests and clients require HTML pages or submit light queries to a database, the system scalability is achieved even without sophisticated scheduling algorithms. In these instances, straightforward static policies are as effective as their more complex dynamic counterparts are. Scheduling based on dynamic state information appears to be necessary only in the sites where the majority of client requests are of three or more orders of magnitude higher than providing a static HTML page with some embedded objects. The project observes that for web sites characterized with a large percentage of static information, a static dispatching policy such as round robin provides a satisfactory performance and load balancing. Their interpretation for this result is that a light-medium load is implicitly balanced by a fully controlled circular assignment among the server nodes that is guaranteed by the dispatcher of the web farm. When the workload characteristics change significantly, so that very long services dominate, the system requires dynamic routing algorithms such as WRR to achieve a uniform distribution of the workload and a more scalable web site. However, in high traffic web sites, dynamic policies become a necessity. The researchers did not prototype the architecture into a real system and run benchmarking tests on it to validate the performance, scalability, and high availability. In addition, the project did not design or prototype new traffic distribution algorithms for web servers; instead, it relied on existing distribution algorithms such as the DNS routing and RAN, RR, and WRR distribution. The architecture presents several single points of failure. In the event of the dispatcher failure, the cluster becomes unreachable. Furthermore, if a cluster node becomes unavailable, there is no mechanism is place to notify the dispatcher of the failure of individual nodes. Moreover, the dispatcher presents a bottleneck to the cluster when under heavy load of traffic. 52 The main input from this project is that dynamic routing algorithms are a core technology to achieve a uniform distribution of the workload and a reach scalable web cluster. The key is in the simplicity of the dynamic scheduling algorithms. 2.10.3 Scalable World Wide Web Server The Scalable Web server (SWEB) project grew out of the needs of the Alexandria Digital Library (ADL) project [85] at the University of California at Santa Barbara [78] that has a potential to become the bottleneck in delivering digitized documents over high speed Internet. For web-based network information systems such as digital libraries, the servers involve much more intensive I/O and heterogeneous processor activities. The SWEB project investigates the issues involved in developing a scalable web server on a cluster of workstations and parallel machines [78]. The objective of the project is to strengthen the processing capabilities of such servers by utilizing the power of computers to match the huge demand in simultaneous access requests from the Internet. The project aims to demonstrate how to utilize inexpensive commodity networks, heterogeneous workstations, and disks to build a scalable web server, and to attempt to develop dynamic scheduling algorithms for exploiting task and I/O parallelism adaptive to run time change of system resource load and availability. The scheduling component of the system actively monitors the usages of processors, I/O channels, and the interconnection network to distribute effectively HTTP request across cluster nodes. Users Users Users Users HTTP Requests Users Users DNS - Scheduler - Load info - httpd Internet Server Disk Internal Network SWEB Logical Server Figure 19: The SWEB architecture 53 Figure 19 illustrates the SWEB architecture. The DNS routes the user requests to the SWEB processors using round robin distribution. The DNS assigns the requests without consulting the dynamically changing system load information. Each processor in the SWEB architecture contains a scheduler, and the SWEB processors collaborate with each other to exchange system load information. After the DNS sends a request to a processor, the scheduler on that processor decides whether to process this requests or assign it to another SWEB processor. The architecture uses URL redirection to achieve re-assignment. The SWEB architecture does not allow SWEB servers to redirect HTTP requests more than once to avoid the ping-pong effect. SWEB . . accept_request(r) choose_server(s) if (s != me) reroute_request(r) else handle_request(r) . . httpd Broker - Manages servers - chooses sever for request Oracle - Characterize requests loadd - Manages distributed load info loadd Figure 20: The functional modules of a SWEB scheduler in a single processor Figure 20 illustrates the functional structure of the SWEB scheduler. The SWEB scheduler contains a HTTP daemon based on the source code of the NCSA HTTP [76] for handling http requests, in addition to the broker module that determines the best possible processor to handle a given request. The broker consults with two other modules, the oracle module and the loadd module. The oracle module is a miniature expert system, which uses a user-supplied table that characterizes the processor and disk demands for a particular task. The loadd module is responsible for updating the system processor, network and disk load information periodically (every 2 to 3 seconds), and making the processors, which have not responded within the time limit, unavailable. When a processor leaves or joins the resource pool, the loadd module is aware of the change as long as the processor has the original list of processor that was setup by the administrator of the SWEB system. The SWEB architecture investigates several concepts. It supports a limited flavor of dynamism while monitoring the processor and disk usage on processors. The loadd module collects processor and disk 54 usage information and feed back this information to the broker to make better distribution decisions. The drawback of this mechanism is that it does not report available memory as part of the metrics, which is as important as the processor information; instead it reports local disk information for an architecture that relies on a network file system for storage. The SWEB architecture does not provide high availability features, making it vulnerable to single points of failures. The oracle module expects as input from the administrator a list of processors in the SWEB system and the processor and disk demands for a particular task. It is not able to collect this information automatically. As a result, the administrator of the cluster interferes every time we need to add or remove a processor. The SWEB implementation modified the source code of the web server and created two additional software modules [78]. The implementation is not flexible and does not allow the usage of those modules outside the SWEB specific architecture. The researchers have benchmarked the SWEB architecture built using a maximum of four processors with an in-house benchmarking tool, not using a standardized tool such as WebBench with a standardized workload. The results of the tests demonstrate a maximum of 76 requests per second for 1 KB/s request size, and 11 requests per seconds for 1.5 MB/s request size, which ranks low compared to our initial benchmarking results (Section 3.7). The project contributes to our work by providing us how-to on actively monitoring the usages of processor, I/O channels, and the network load. This information allows us to distribute effectively HTTP requests across cluster nodes. Furthermore, the concept of web cluster without master nodes, and having the cluster nodes provide the services master nodes usually provide, is a very interesting concept. 2.10.4 University of Nebraska-Lincoln: LSMAC and LSNAT This research project recognizes that server responsiveness and scalability are important in client/server network environments [13]. The researchers are considering clusters that use commodity hardware as an alternative to expensive specialized hardware for building scalable web servers. The project investigates different server infrastructures namely MAC-based dispatching (LSMAC) and IPbased dispatching (LSNAT), and focuses on the single IP interface approach [44]. The resulting implementations are the LSMAC and LSNAT implementations, respectively. LSMAC dispatches each incoming packet by directly modifying its media access control (MAC) addresses. 55 Figure 21 presents the LSMAC approach [80]. A client sends an HTTP packet (1) with A as the destination IP address. The immediate router sends the packet to the dispatcher at IP address A (2). Based on the load sharing algorithm and the session table, the dispatcher decides that this packet should be handled by the back-end-server, Server 2, and sends the packet to Server 2 by changing the MAC address and forwarding it (3). Server 2 accepts the packet and replies directly to the client (4). 4 HTTP requests with dest IP=A 1 Router 3 2 Route AÆD IP=B1 IP=B2 IP=B3 IP=D IP_alias=A Server 1 IP_alias=A Server 2 IP_alias=A Server 3 LSMAC dispatcher Figure 21: The LSMAC implementation 5 HTTP requests with dest IP=A 1 4 Router 3 2 Route AÆD IP=B1 IP_alias=A Server 1 IP=B2 IP_alias=A Server 2 IP=B3 IP=D IP_alias=A Server 3 LSMAC dispatcher Figure 22: The LSNAT implementation Figure 22 illustrates the LSNAT approach [73][80]. The LSNAT implementation follows RFC 2391 [48]. A client (1) sends an HTTP packet with A as the destination IP address. The immediate router sends the packet to the dispatcher (2) on A, since the dispatcher machine is assigned the IP address A. Based on the load sharing algorithm and the session table, the dispatcher decides that this packet should be handled by the back-end server, Server 2. It then rewrites the destination IP address as Server 2, recalculates the IP and TCP checksums, and sends the packet to Server 2 (3). Server 2 56 accepts the packet (4) and replies to the client via the dispatcher, which the back-end servers see as a gateway. The dispatcher rewrites the source IP address of the replying packet as A, and recalculates the IP and TCP checksums, and send the packet to the client (5). The dispatcher in both approaches, LSMAC and LSNAT, is not highly available and presents a SPOF that can lead to service discontinuity. The work did not focus on providing high availability capabilities; therefore, in the event of a node failure, the node continues to receive traffic. Moreover, the architecture does not support scaling the number of servers. The largest setup tested was a cluster that consists of four nodes. The authors did not demonstrate the scaling capabilities of the proposed architecture beyond four nodes [73][80]. The performance measurements were performed using the benchmarking tool WebStone [86]. The LSMAC implementation running on a four-node cluster averaged 425 transactions per second per traffic node. The LSNAT implementation running on a four nodes cluster averaged 200 transactions per second per traffic node [79]. Furthermore, the architecture does not provide adaptive optimized distribution. The dispatcher does not take into consideration the load of the traffic nodes nor their heterogeneous nature to optimize its traffic distribution. It assumes that all the nodes have the same hardware characteristics such as the same processor speed and memory capacity. 2.10.5 Harvard Array of Clustered Computers (HACC) The goal of the HACC project is to design and develop a cluster architecture for scalable and cost effective web servers [81][82]. In [81], the authors discuss the approach that places a router called IP Sprayer between the Internet and a cluster of web servers. Figure 23 illustrates the architecture of the web cluster with the IP Sprayer. The IP Sprayer is responsible for distributing incoming web traffic evenly between the nodes of the cluster. A number of commercial products, such as the Cisco Local Director [50] and the F5 Networks Big IP [87], employ this approach to distribute web site requests to a collection of machines typically in a round robin fashion. The HACC architecture focuses on locality enhancements by dividing the document store among the cluster nodes and dynamic traffic distribution. Rather than distributing requests in a round robin fashion, the HACC smart router distributes requests so that as to enhance the inherent locality of the requested documents in the server cluster. 57 Web Clients HTTP Requests Router (or IP Sprayer) a C B Node 1 a C B Node 2 a B C Node 3 Figure 23: The architecture of the IP sprayer Web Clients HTTP Requests HACC Smart Router a B C Node 1 Node 2 Node 3 Figure 24: The architecture with the HACC smart router Figure 24 illustrates the concept of the HACC smart router. Instead of being responsible for the entire working set, each node in the cluster is responsible for only a fraction of the document store. The size of the working set of each node decreases each time we add a node to the cluster, resulting in a more efficient use of resources per node. The smart router uses an adaptive scheme to tune the load presented to each node in the cluster based on that node’s capacity, so that it can assign each node a fair share of the load. The idea of HACC bears some resemblance to the affinity based scheduling schemes for shared memory multiprocessor systems [88][89], which schedule a task on a processor where relevant data already resides. 58 2.10.5.1 HACC Implementation The main challenge in realizing the potential of the HACC design is building the Smart Router, and within the Smart Router, designing the adaptive algorithms that direct requests at the cluster nodes based on the locality properties and capacity of the nodes [81]. The smart router implementation consists of two layers: the low smart router (LSR) and the high smart router (HSR). The LSR corresponds to the low-level kernel resident part of the system and the HSR implements the high-level user-mode brain of the system. The authors conceived this partitioning to create a separation of mechanism and policy, with the mechanism implemented in the LSR and the policy implemented in the HSR. The Low Smart Router: The LSR encapsulates the networking functionality. It is responsible for TCP/IP connection setup and termination, for forwarding requests to cluster nodes, and forwarding the result back to clients. The LSR listens on the web server port for a connection request. When the LSR receives connection request, TCP passes a buffer to the LSR containing the HTTP request. The HSR extracts and copies the URL from the request. The LSR queues all data from this incoming request and waits for the HSR to indicate which cluster node should handle the request. When the HSR identifies the node, the LSR establishes a connection with it and forwards the queued data over this connection. The LSR continues to ferry data between the client and the cluster node serving the request until either side closes the connection. The High Smart Router: The HSR monitors the state of the document store, the nodes in the cluster, and properties of the documents passing through the LSR. It uses this information to decide how to distribute requests over the HACC cluster nodes. The HSR maintains a tree that models the structure of the document store. Leaves in the tree represent documents and nodes represent directories. As the HSR processes requests, it annotates the tree with information about the document store to the applied in load balancing. This information could include node assignment, document sizes, request latency for a given document, and in general, sufficient information to make an intelligent decision about which node in the cluster should handle the next document request. When a request for a particular file is received for the first time, the HSR adds nodes representing the file and newly reached directories to its model of the document store, initializing the file’s node with its server assignment. In the current prototype, incoming new documents are assigned to the least loaded server node. After the first request for a document, subsequent requests go to the same server and though improve the locality of references. 59 Dynamic Load Balancing: Dynamic load balancing is implemented using Windows NT’s performance data helper (PDH) interface [90]. The PDH interface allows collecting a machine’s performance statistics remotely. When the smart router initializes, it spawns a performance monitoring thread that collects performance data from each cluster node at a fixed interval. The HSR then uses the performance data for load balancing in two ways. First, it identifies a least loaded node and assigns new requests to it. Second, when a node becomes overloaded, the HSR tries to offload a portion of the documents for which the overloaded node is responsible to the least loaded node. 2.10.5.2 HACC Evaluation and Lessons Learned There are several drawbacks to the HACC architecture. The architecture is vulnerable to SPOF, has limited performance and scalability, and suffers from various challenges in aspects such as data distributions, in addition to specific implementation issues. Firstly, the smart router presents itself as a SPOF. In the HACC architecture if the smart router becomes unavailable because of a software or hardware error, the HACC cluster becomes unusable. Secondly, the performance and scalability of the HACC cluster with the smart router scalability are limited. The benchmarking tests demonstrate that the prototype of the smart router is capable of handling between 400 and 500 requests per second (requests are of size 8 KB) [81]. These results are half the performance that our early prototype achieved with its worst-case scenarios, using the round robin distribution scheme (Section 3.7). The limited performance of the smart router suggests that the HACC cluster cannot scale because of the limitation imposed by the bottleneck at the smart router level. In addition, the prototype of the HACC architecture consists of one node acting as the smart router and three nodes acting as cluster nodes, suggesting a limited configuration. Furthermore, the HACC architecture relies on distributing data among the different cluster nodes; we cannot perform the distribution while the HACC cluster is in operation. This approach is not realistic for systems that are in operation and require an update to their data store online. In addition, the tree-structured name space only works for the case when the structure of the document store is hierarchical. Moreover, the Keep Alive feature of the HTTP poses some potential problems with the Smart Router. If we enable the Keep Alive option, the browser allows reusing the TCP connection for subsequent requests that the smart router does not intercept, which interfere with the traffic distribution decision of the smart router. 60 2.10.6 IBM Scalable and Highly Available Web Server IBM Research is investigating the concept of a scalable and highly available web server that offers web services via a Scalable Parallel (SP-2) system, a cluster of RS/6000 workstations. The goal is to support a large number of concurrent users, high bandwidth, real time multimedia delivery, finegrained traffic distribution, and high availability. The server will provide support for large multimedia files such as audio and video, real time access to video data with high access bandwidth, fine-grained traffic distribution across nodes, as well as efficient back-end database access. The project is focusing on providing efficient traffic distribution mechanisms and high availability features. The server achieves traffic distribution by striping data objects across the back-end nodes and disks. It achieves high availability by detecting node failures and reconfiguring the system appropriately. However, there is no mention of the time to detect the failure and to recover. Users Users Users Users Users Users External network connection Load Balancing Front-End Nodes App: Application specific software Communication Back-End Nodes App … App App Switch … Disks Figure 25: The two-tier server architecture Figure 25 illustrates the architecture of the web cluster. The architecture consists of a group of nodes connected by a fast interconnect. Each node in the cluster has a local disk array attached to it. The disks of a node either can maintain a local copy of the web documents or can share it among nodes. 61 The nodes of the cluster are of two types: front-end (delivery) nodes and back-end (storage) nodes. The round robin DNS is used to distribute incoming requests from the external network to the frontend nodes, which also run httpd daemons. The logical front-end node then forwards the required command to the back-end nodes that have the data (document), using a shared file system. Next, the httpd daemons send the results to the front-end nodes through the switch and then the results are transmitted to the user. The front-end nodes run the web daemons and connect to the external network. To balance the load among them, the client spread the load across nodes using RR DNS [51]. All the front nodes are assigned a single logical name and the RR DNS maps the name to multiple IP addresses. 3 Web Server Nodes 2 Switch Internet 1 TCP Router Nodes Figure 26: The flow of the web server router Figure 26 illustrates another approach for achieving traffic distribution. One or more nodes of the cluster serve as TCP routers, forwarding client requests to the different frond-end nodes in the cluster in a round robin order. The name and IP address of the router is public, while the addresses of the other nodes in the cluster are private. If there is more than one router node, a single name is used and the round robin maps the name to the multiple routes. The flow of the web server router (Figure 26) is as follows. When a client sends requests to the router node (1), the router node forwards (2) all packets belonging to a particular TCP connection to one of the server front-end nodes. The router can use different algorithms to select which node to route to, or use round robin scheme. The server nodes directly reply to the client (3) without using the router. However, the server nodes change the source 62 address on the packets sent back to the client to be that of the router node. The back-end nodes host the shared file system used by the front-ends to access the data. There are four main drawbacks preventing the architecture from achieving a scalable and highly available web cluster: limited traffic distribution performance, limited scalability, lack of high availability capabilities, the presence of several SPOF, and the lack of a dynamic feedback mechanism. The architecture relies on round robin DNS to distribute traffic among server nodes. The scheme is static, does not adjust based on the load of the cluster nodes, and does not accommodate the heterogeneous nature of the cluster nodes. The authors proposed an improved traffic distribution mechanism [83] that involves changing packet headers but still relies on round robin DNS to distribute traffic among router server nodes. The concept was prototyped with four front-end nodes and four back-end node. The project did not demonstrate if the architecture is capable of scaling beyond four traffic nodes and if failures at node level are observed and accommodated for dynamically. The architecture does not provide features that allow service continuity. The switch as shown in Figure 25 and Figure 26 is a SPOF. The network file system where data resides is also vulnerable to failures and presents another SPOF. Furthermore, the architecture does not support a dynamic feedback loop that allows the router to forward traffic depending on the capabilities of each traffic node. 2.10.7 Discussion The surveyed projects share some common results and conclusions. Current server architectures do not provide the needed scalability to handle large traffic volumes and a large number of web users. There seem to be a consensus among surveyed literature for a need to design new server architecture that is able to meet our visions for next generation Internet servers. A distributed architecture that consists of independent servers sharing the load is more appropriate than single server architectures for implementing a scalable web server. Surveyed projects such as [63], [64], [65], [68], [72], [91], [92], [93], [94], and [95], discuss that using clustering technologies help increase the performance and scalability of the web server. Clustering is the dominant technology that will help us achieve better scalability and higher performance. The focus is not on clustering as a technology, rather, the focus is on using this technology as a mean to achieve scalability, high capacity, and high availability. Several of the 63 surveyed projects such as [68], [79], [81], and [82], focused on providing efficient traffic distribution mechanisms and providing high availability features that enable continuous service availability. However, looking at current benchmarking results, adding high availability, and fault tolerance features is negatively affecting the performance and scalability of the cluster. Traffic distribution is an essential aspect towards achieve highly scalable platform. Hardware-base traffic distribution solutions are not scalable; they constitute both a performance bottleneck and a SPOF. There is a clear need to provide a software single Virtual IP layer that hides the cluster nodes and make them transparent to end users. There is a need to have an unsophisticated design and consequently a simple implementation. The uncomplicated design allows a smooth integration of many components into a well-defined architecture. The benefits are a lightweight design and a faster and more robust system. One interesting observation is that the surveyed works, for the exception of the HACC project [81], use the Linux operating system to prototype and implement Internet and web servers. A highly available and scalable web cluster requires is a combination of a smart and efficient traffic distribution mechanism that distributes incoming traffic from the cluster interface to the least busy nodes in the cluster based on a dynamic feedback loop. The distribution mechanism and the cluster interface should provide neither a bottleneck nor a SPOF. To scale such cluster, we would like to have the ability to add nodes into the cluster without disruption of the provided services and the capability to increase the number of nodes to meet traffic resulting in close to linear scalability. 64 Chapter 3 Preparatory Work This chapter describes the preparatory work conducted as part of the early investigations. It describes the prototyped web server cluster, presents the benchmarking environment, and reports the benchmarking results. 3.1 Early Work As part of the early experimental work, I have studied the HTTP protocol [96] and the Apache web server as case studies [97][98]. These studies resulted in the two technical reports, [99] and [100]. Apache is the most popular web server software in the world, running over 60% of the web servers [101], and licensed under the GNU General Public License (GPL) [102] that grant users full access to source code. We divide the experimental work into two tasks: understanding and building a web server, and implementing a performance measurement tool that is able to execute performance and scalability tests on web servers. The first task involved understanding the working of web servers, in addition to setting up a web server at the research lab in Concordia. This task was a necessary step leading to the study of web server design and performance. This work resulted in three published technical reports: [96], [99], and [100]. The second task consisted of designing and implementing the X-Based Web Servers Performance Tool (XWPT) that generates web traffic to a web server and collects performance metrics such as the number of successful results per second and the throughput [103]. 3.2 Description of the Prototyped Web Cluster The goal of this exercise was to provide a test platform for a highly available and scalable web server cluster, and to benchmark a running web cluster to learn how the cluster scales, versus simulating traffic into a model. Figure 27 illustrates the prototyped web cluster. The cluster consists of 14 Pentium III processors running at 500 MHz with 512 MB of RAM each. The cluster processors (also called nodes) run the Linux operating system [22]. Each server has six Ethernet ports: two onboard ports and four ports offered on an auxiliary CompaqPCI card for redundancy purposes. Each of the 65 eight processors has three SCSI disks with a capacity of 54 GB in total; the other eight processors are diskless. All processors have access to a common storage volume provided by the master nodes. Although we have used the NFS, we also experienced with the Parallel Virtual File System (PVFS) provide a shared disk space among all nodes. The PVFS supports high performance I/O over the web [91][92][105][106]. To provide high availability, we implemented redundant Ethernet connections, redundant network file systems, in addition to software RAID [104]. The LVS [23] provides a single IP interface for the cluster and provides HTTP traffic distribution mechanism among the servers in the cluster. As for the web server software, we run Apache release 2.08 [24] and Tomcat release 3.1 [25] on traffic nodes. Other studies and benchamrking we performed in this area include [99], [100], [103], [107], and [108]. A single entry to the cluster LVS Virtual IP Interface Master nodes providing Storage and cluster services Redundant LANS Master Node 1 Master Node 2 HA NFS LAN 1 LAN 2 Traffic Node 1 Traffic Node 7 Traffic Node 2 Traffic Node 3 Traffic Node 8 Traffic Node 4 Traffic Node 9 Traffic Node 5 Traffic Node 10 Traffic Node 6 Traffic Node 11 Traffic Node 12 Figure 27: The architecture of the prototyped web cluster As for the local network, all cluster nodes are interconnected using redundant dedicated links. Each node connects to two networks through two Ethernet ports over two redundant network switches. External network access is restricted to master nodes unless traffic nodes require direct access, which is configurable for some scenarios such as direct routing. The dotted connection indicates that the processor connects to LAN 1. The solid connection indicates that the processor connects to LAN 2. The components and parameters of the prototyped system include two master nodes and 12 traffic nodes, a traffic distribution mechanism, a storage sub-system, and local and external networks. The master nodes provide a single entry point to the system through the virtual IP address receiving incoming web traffic and distributing it to the traffic nodes using the traffic distribution algorithm. 66 The master nodes provide cluster-wide services such as I/O services through NFS, DHCP, and NTP services. They respond to incoming web traffic and distribute it among the traffic nodes. The traffic nodes run the Apache web server and their primary responsibility is to respond to web requests. They rely on master nodes for cluster services. We used the LVS in different configurations to distribute incoming traffic to the traffic nodes. Section 3.5 discusses the network address translation and direct routing methods, and demonstrates their capabilities. 3.3 Benchmarking Environment We used web server benchmarking to evaluate the performance and scalability of the prototyped web cluster. Web benchmarking is a combination of standardized tests that consist of a mechanism to generate a controlled stream of web requests to the cluster with standard metrics to report results. The following subsections describe the hardware and software components of the benchmarking environment. 3.3.1 Hardware Benchmarking Environment We used 16 Intel Celeron 500 MHz machines, called web client machines, to generate web traffic to the prototype web cluster illustrated in Figure 27. Each of the web client machines is equipped with 512 MB of RAM and run Windows NT. In addition, the benchmarking environment requires a controller machine that is responsible for collecting and compiling the test results from all the web client machines. The controller machine has a Pentium III processors running at 500 MHz with 512 MB of RAM. 3.3.2 Software Benchmarking Environment The web client machines generate traffic using WebBench [19], a freeware benchmarking tool that simulates web browsers. When a web client receives a response from the web server, WebBench records the information associated with the response and immediately sends another request to the server. Figure 28 presents the WebBench architecture. WebBench runs on a server running the controller program and one or more servers, each running the client program. The controller provides means to set up, start, stop, and monitor the WebBench tests. It is also responsible for gathering and analyzing the data reported from the clients. The web client machines execute the WebBench tests and send requests to the web server. 67 WebBench supports up to 1,000 web clients generating traffic to a target web server. WebBench stresses a web server by using a number of test systems to request URLs. We can configure each WebBench test system to use multiple threads to make simultaneous web server requests. By using multiple threads per test system, it is possible to generate a large load on a web server to stress it to its limit with a reasonable number of test systems. Each WebBench test thread sends an HTTP request to the web server and waits for the reply. When the reply comes back, the test thread immediately makes a new HTTP request. WebBench makes peak performance measurements that illustrate the limitations of a web server platform. WebBench Controller Machine gathering results and generating statistics Client machines generating web traffic Target Clustered System Figure 28: The architecture of the WebBench benchmarking tool The workload tree provided by WebBench contains the test files the WebBench client access when we execute a test suite. WebBench workload tree is the result of studying real-world sites such as Microsoft, USA Today, and the Internet Movie Database. The tree uses multiple directories and different directory depths. It contains over 6,200 static pages and executable test files. The WebBench provides static (STATIC.TST) and dynamic test suites (WBSSL.TST). The static suites use HTML and GIF files. The dynamic suites use applications that run on the server. WebBench keeps at run-time all the transaction information and uses this information to compute the final metrics presented when the tests are completed. The standard test suites of WebBench begin with one client and add clients incrementally until they reach a maximum of 60 clients (per single client machine). WebBench provides numerous standardized test suites. For our testing purposes, we executed a mix of the STATIC.TST (90%) and WBSSL.TST (10%) tests. Each test takes on average two hours and a half to run this combination of test suites. 68 3.4 Web Server Performance In considering the performance of a web server, we should pay special regard to its software, operating system, and hardware environment, because each of these factors can dramatically influence the results. In a distributed web server, this environment is complicated further by the presence of multiple components, which require connection handoffs, process activations, and request dispatching. A complete performance evaluation of all layers and components of a distributed web server system is very complex impossible. Hence, a benchmarking study needs to define clearly its goals and scope. In our case, the goal is to evaluate the end-to-end performance of a web cluster. Our main interests do not go into the hardware and operating system that in many cases are given. Web server performance refers to the efficiency of a server when responding to user requests according to defined benchmarks. Many factors affect a server performance such as application design and construction, database connectivity, network capacity and bandwidth, and hardware server resources. In addition, the number of concurrent connections to the web server has a direct impact on its performance. Therefore, the performance objectives include two dimensions: the speed of a single user's transaction and the amount of performance degradation related to the increasing number of concurrent connections. Metric Name Description Throughput The rate at which data is sent through the network, expressed in Kbytes per second (KB/s) Connection rate The number of connections per second Request rate The number of client requests per second Reply rate The number of server responses per second Error rate The percentage of errors of a given type DNS lookup time The time to translate the hostname into the IP address Connect time The time interval between sending the initial SYN and the last byte of a client request and the receipt of the first byte of the corresponding response Latency time In a network, latency, a synonym for delay, is an expression of how much time it takes for a packet of data to get from one designated point to another Transfer time The time interval between the receipt of the first response byte and the last response byte Web object response time The sum of latency time and transfer time Web page response time The sum of web object response time pertaining to a single web page, plus the connect time Session time The sum of all web page response times and user think time in a user session Table 5: Web performance metrics 69 Table 5 presents the common metrics for web system performance. In reporting the results of the benchmarking tests, we report the connection rate, number of successful transactions per second, and the throughput as KB/s. 3.5 LVS Traffic Distribution Methods The LVS is an open source project to cluster many servers together into one virtual server. It implements a layer four switching in the Linux kernel that allows the distribution of TCP and UDP sessions between multiple real servers. The real servers, also called traffic nodes, interconnect through either a local area network or a geographically dispersed wide area network. The front-end of the real servers is the LVS director. The LVS director is the load balancing engine, and it runs on the master cluster processor(s). It provides IP-level traffic distribution to make parallel services of the cluster appear as a virtual service on a single IP address. All requests come to the front-end LVS director, owner of the virtual IP address, and then the LVS director distributes the traffic among the real servers. LVS provides three implementations of its IP load-balancing techniques based on NAT, IP tunneling, and DR. We experimented with both the NAT and DR methods. We did not test the IP tunneling method for two reasons: the implementation of the IP tunneling method is still experimental and it does not offer added advantage over the DR method. The following sub-sections discuss all three methods, with a focus on NAT and DR, and present the results of benchmarking a cluster with a NAT and then a DR configuration. 3.5.1 LVS via NAT Network address translation relies on manipulating the headers of Internet protocols appropriately so that clients believe they are contacting one IP address, and servers at different IP addresses believe the clients are contacting them directly. LVS uses this NAT feature to build a virtual server; parallel services at the different IP addresses can appear as a virtual service on a single IP address via NAT. A hub or a switch interconnects the master node(s), which run LVS, and the real servers. The real servers usually run the same service and provide access to the same contents, available on a shared storage device through a distributed file system. Figure 29 illustrates the process of address translation. The user first accesses the service provided by the server cluster (1). The request packet arrives at the load balancer through the external IP address. The load balancer examines the packet's destination address and port number (2). If they match a 70 virtual server service (according to the virtual server rule table), then the scheduling algorithm, round robin by default, chooses a real server from the cluster to serve the request, and adds the connection into the hash table which records all established connections. The load balancer server rewrites the destination address and the port of the packet to match those of the chosen real server, and forwards the packet to the real server. The real server processes the request (3) and returns the reply to the load balancer. When an incoming packet belongs to this connection and the established connection exists in the hash table, the load balancer rewrites and forwards the packet to the chosen server. When the reply packets come back from the real server to the load balancer, the load balancer rewrites the source address and port of the packets (4) to those of the virtual service, and submits the response back to the client (5). The LVS removes the connection record from the hash table when the connection terminates or timeouts. 3 Users Processing the request Real Server 1 Real Server 2 Internet/Intranet 2 Scheduling and rewriting packets 1 Requests Switch/Hub 5 Replies Load Balancer Linux Box 4 Real Server N Rewriting replies Figure 29: The architecture of the LVS NAT method 3.5.2 LVS via DR The direct routing method allows the real servers and the load balancer server(s) to share the virtual IP address. The load balancer has an interface configured with the virtual IP address. The load balancer uses this interface to accept request packets and directly route them to the chosen real server. Figure 30 illustrates the architecture of the LVS following the direct routing method. When a user 71 accesses a virtual service provided by the server cluster (1), the packet destined for the virtual IP address arrives to the load balancer. The load balancer examines (2) the packet's destination address and port. If it matches a virtual service, the scheduling algorithm chooses a real server (3) from the cluster to serve the request and adds the connection into the hash table that records connections. Next, the load balancer forwards the request to the chosen real server. If new incoming packets belong to this connection and the chosen server is available in the hash table, the load balancer directly routes the packets to the real server. When the real server receives the forwarded packet, the server finds that the packet is for the address on its alias interface or for a local socket, so it processes the request (4) and returns the result directly to the user (5). The LVS removes the connection record from the hash table when the connection terminates or timeouts. 5 Reply going to the user directly 4 Processing the request Users 1 Requests Internet/Intranet Real Server 1 Examine packet destination 2 Virtual IP Address Internal Network Linux Director 3 Forward request to real server Real Server 2 . . . Real Server N Figure 30: The architecture of the LVS DR method 3.5.3 LVS via IP Tunneling IP Tunneling allows packets addressed to an IP address to be redirected to another address, possibly on a different network. In the context of layer four switching, the behavior is very similar to that of the DR method, except that when packets are forwarded they are encapsulated in an IP packet, rather than just manipulating the Ethernet frame. 72 The main advantage of using tunneling is that the real servers (i.e. traffic nodes) can be on a different network. We did not experiment with the IP Tunneling method because of the unstable status of the implementation, and because it does not provide additional capabilities over the DR method. However, we present it for completion purpose. 3.5.4 Direct Routing versus Network Address Translation We performed a test to benchmark the same web cluster using the NAT approach for traffic distribution and then then using the direct routing approach. With direct routing, the LVS load balancer distributes traffic to the cluster processors, which in turn would reply directly to the web clients, without going through address translation. We run the same test on the same cluster running Apache, however, using different traffic distribution mechanisms. Figure 31 presents the results in terms of successful requests per second. Following the NAT approach, the load balancer node saturates at approximately 2,000 requests per second compared to 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1_ c 4_ lien cl t i 8_ ents c 12 lien _c t s 16 lien _c ts 20 lien _c ts 24 lien _c ts 28 lien _c ts 32 lien _c ts 36 lien _c ts 40 lien _c ts 44 lien _c ts 48 lien _c ts 52 lien _c ts 56 lien _c ts 60 lien _c ts lie nt s Requests per Second 4,700 requests per second with the DR method. LVS Direct Routing LVS Network Address Translation Figure 31: Benchmarking results of NAT versus DR 73 In both tests, the bottleneck occurs at the load balancer node that was unable to accept more traffic and distribute it to the traffic servers. Instead, the LVS director was rejecting incoming connections resulting in unsuccessful requests. This test demonstrates that the DR approach is more efficient than the NAT approach and allows better performance and scalability. In addition, it demonstrates the bottleneck at the director level of the LVS. 3.6 Benchmarking Scenarios The goal of the benchmarking activities was to demonstrate performance and scalability issues in clustered web servers, to uncover these issues and to demonstrate where and when bottlenecks occur and when the system stops scaling. We carried out six benchmark tests, starting with one processor and scaling up to 12 processors. The benchmark of a single server is to demonstrate the limitation of a single server. We also carried out a benchmarking to test demonstrate scalability limitation in NAT and DR methods. We performed two sets of benchmarking tests on the prototyped cluster (Figure 27): benchmark a single processor to determine the baseline performance, and several benchmarks while scaling the prototyped cluster from two to 12 processors. We performed the first benchmarking test on a single processor running the Apache web server software. The goal with this test is to reveal the limitations of a single server. We use the results of this test to define the baseline performance and use it as a basis for comparison. For the second set of testing, we scaled the cluster to two, four, six, eight, 10, and 12 traffic nodes at a time. Each node is running its copy of the Apache web server software. The benchmarking tool, WebBench, generates the web traffic to the LVS IP layer of the cluster that is running on the master nodes. Since the LVS DR method is more efficient and scalable than its NAT equivalence (Figure 31), we used it as the distribution mechanism for the incoming traffic. The LVS load director performs traffic distribution among all the cluster nodes. The goal with this test is to identify the performance and scaling trend as we add more processors into the cluster. 3.7 Apache Performance Test Results We conducted benchmarking tests on one processor running Apache to determine the baseline performance on a single processor. Figure 32 presents the number of web clients on the x-axis and the number of successful requests per second on the y-axis. 74 52 48 44 40 36 32 28 24 20 16 12 cli e cli e _c lie _c lie _c lie _c lie _c lie _c lie _c lie _c lie _c lie _c lie s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt nt cli e _c lie 8_ 4_ 1_ Throughput (Kbytes/Sec) 52 48 44 40 36 32 28 24 20 16 12 cli e cli e _c lie _c lie _c lie _c lie _c lie _c lie _c lie _c lie _c lie _c lie s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt nt cli e _c lie 8_ 4_ 1_ Requests Per Second 1200 1000 800 600 400 200 0 Number of Clients Requests Per Second Figure 32: Benchmarking results of the Apache web server running on a single processor 7000 6000 5000 4000 3000 2000 1000 0 Number of Clients Throughput (Kbytes/Sec) Figure 33: Apache reaching a peak of 5,903 KB/s before the Ethernet driver crashes 75 Apache served 1,053 requests per second before it suddenly stops servicing incoming requests; in fact, as far as the WebBench tool, Apache crashed (Figure 32). We thought that the Apache server has crashed under heavy load. However, that was not the case. The Apache server process was still running when we logged locally into the machine. It turned out that the Ethernet device driver crashed and caused the processor to disconnect from the network and became unreachable. Figure 33 illustrates the throughput achieved on one processor (5,903 KB/s), before the processor disconnects from the network dues to the device driver crash. We investigated the device driver problem, fixed it, and made the updated source code publicly available. We did not face the driver crash problem in further testing. Figure 34 presents the benchmarking result of Apache on a single processor after fixing the device driver problem. Apache served an average of 1,043 requests per second. 1200 Requests Per Second 1000 800 600 400 200 1_ cl ie nt 4_ cl ie nt s 8_ cl ie nt s 12 _c lie nt s 16 _c lie nt s 20 _c lie nt s 24 _c lie nt s 28 _c lie nt s 32 _c lie nt s 36 _c lie nt s 40 _c lie nt s 44 _c lie nt s 48 _c lie nt s 52 _c lie nt s 0 Number of Clients Requests Per Second Figure 34: Benchmarking results of Apache on one processor – post Ethernet driver update Next, we setup the cluster in several configurations with two, four, six, eight, 10 and 12 processors and we performed the benchmarking tests. In these benchmarks, the LVS was forwarding the HTTP traffic to the traffic nodes following the DR distribution method. Figure 35 presents the results of the benchmarking test we performed on a cluster with two processors running Apache. The average number of requests per second per processor is 945. 76 60 56 52 48 44 40 36 32 28 24 20 16 12 cli e cli e _c lie _c lie _c lie _c lie _c lie _c lie _c lie _c lie _c lie _c lie _c lie _c lie s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt nt cli e _c lie 8_ 4_ 1_ Requests Per Second _c lie _c lie _c lie _c lie _c lie cli e cli e s nt nt s nt s nt s nt s s nt nt cli e nt _c s lie 36 n ts _c lie 40 n ts _c lie 44 n ts _c lie 48 n ts _c lie 52 n ts _c lie 56 n ts _c lie 60 n ts _c lie nt s 32 28 24 20 16 12 8_ 4_ 1_ Requests Per Second 2000 1800 1600 1400 1200 1000 800 600 400 200 0 Number of Clients Requests Per Second Figure 35: Results of a two-processor cluster (requests per second) 4500 4000 3500 3000 2500 2000 1500 1000 500 0 Number of Clients Requests Per Second Figure 36: Results of a four-processor cluster (requests per second) 77 Figure 36 presents the results of the benchmarking test we performed on a cluster with four processors running Apache. The average number of requests per second per processor is 1,003. 8000 Requests Per Second 7000 6000 5000 4000 3000 2000 1000 4_ 1_ cli e n cli t en 8_ t s cli 1 2 en t _c s li 16 en t _c s li 20 en t _c s li 24 en t _c s li 28 en t _c s li 32 en t _c s li 36 en t _c s li 40 en t _c s li 44 en t _c s li 48 en t _c s li 52 en t _c s li 56 en t _c s li 60 en t _c s li 64 en t _c s li 68 en t _c s lie nt s 0 Number of Clients Requests Per Second Figure 37: Results of eight-processor cluster (requests per second) Figure 37 presents the results of the benchmarking test we performed on a cluster with eight processors running Apache. The average number of requests per second per processor is 892. Table 6 presents the results of Apache benchmarking for all the cluster configurations including the single standalone node. Processors in the cluster Maximum requests per second Transaction per second per processor 1 1053 1053 2 1890 945 4 4012 1003 6 5847 974 8 7140 892 10 7640 764 12 8230 685 Table 6: The results of benchmarking with Apache 78 For each cluster configuration, we present the maximum performance achieved by the cluster in terms of requests per second; the third column presents the average number of requests per second per processor. As we add more processors into the cluster, the number of requests per second per processor decreases (Table 6, 3rd column). Section 3.9 discusses the scalability of the prototyped web cluster. 3.8 Tomcat Performance Test Results Similarly, we conducted the benchmarking tests with Tomcat running on one, two, four, six, eight, 10 and 12 processors. Figure 38 presents the results of the benchmarking test we performed on a cluster with two processors running Tomcat. The average number of requests per second per processor is 76. 160 Requests Per Second 140 120 100 80 60 40 20 1_ cl i 4_ ent cl ie 8_ nts cl i 12 ent _c s li 16 en t _c s li 20 en t _c s li 24 en t _c s li 28 en t _c s li 32 en t _c s li 36 en t _c s li 40 en t _c s li 44 en t _c s li 48 en t _c s li 52 en t _c s li 56 en t _c s li 60 en t _c s li 64 en t _c s li 72 en t _c s li 80 en t _c s li e nt s 0 Num be r of Clie nts Requests Per Second Figure 38: Results of Tomcat running on two processors (requests per second) Figure 39 presents the results of the benchmarking test we performed of a system with four processors running Tomcat. The average number of requests per second per processor is 75. Figure 40 presents the results of the benchmarking test we performed of a system with eight processors running Tomcat. The average number of requests per second per processor is 71. 79 80 112_clients 104_clients 96_clients 88_clients 80_clients 72_clients 64_clients 60_clients 56_clients 52_clients 48_clients 44_clients 40_clients 36_clients 32_clients 28_clients 24_clients 20_clients 16_clients 12_clients 8_clients 4_clients 1_client Requests per Second 1_ c 4_ lien cl t ie 8_ nts cl 12 ien _c ts 16 lien _c ts 20 lien _c ts l 24 ien _c ts 28 lien _c ts 32 lien _c ts l 36 ien _c ts 40 lien _c ts 44 lien _c ts l 48 ien _c ts 52 lien _c ts 56 lien _c ts l 60 ien _c ts 64 lien _c ts 72 lien _c ts l 80 ien _c ts 88 lien _c ts 96 lien _c ts li e nt s Requests Per Second 350 300 250 200 150 100 50 0 Num ber of Clie nts Requests Per Second Figure 39: Results of a four-processor cluster running Tomcat (requests per second) 500 450 400 350 300 250 200 150 100 50 0 Num be r of Clie nts Requests per Second Figure 40: Results of an eight-processor cluster running Tomcat (requests per second) Table 7 presents the results of testing the prototyped cluster running the Tomcat application server. For each cluster configuration, we present number of processors in the cluster, the maximum performance achieved by the cluster in terms or requests per second, and the average number of requests per second per cluster processor. Number of processors in Cluster maximum requests per Transaction per second per the cluster second processor 1 81 81 2 152 76 4 300 75 6 438 73 8 568 71 10 700 70 12 804 67 Table 7: The results of benchmarking with Tomcat 3.9 Scalability Results We collected benchmarking results achieved with both Apache (Table 6) and Tomcat (Table 7) for clusters with one, two, four, six, eight, 10 and 12 processors. For each testing scenario, we recorded the maximum number of requests per second that each configuration can service. When we divide this number by the number of processors, we get the maximum number of request that each processor can process per second in each configuration. Figure 41 and Figure 42 illustrate the transaction capability per processor plotted against the cluster size. In both figures, the curve is not flat; as we add more processors into the cluster, the number of successful transactions per second per processor drops. In the case of Apache, when the cluster scaled from two to 12 processors, the number of successful transactions per second per processor drops by a factor of -35%. In the case of Tomcat, when the cluster scales from two to 12 processors, the number of successful transactions per second per processor drops by a factor of -18%. In both cases with Apache and Tomcat, as we add more processors into the cluster, we experience a decrease in performance and scalability. The more processors we have in the cluster, the less performance we get per processor. 81 Number of transactions per second per processor 1200 1053 945 1000 1003 974 892 764 800 685 600 400 200 0 11 22 43 64 85 6 10 7 12 Number of processors in the cluster Transactions per second per processor Figure 41: Scalability chart for clusters consisting of up to 12 nodes running Apache Number of transactions per second per processor 90 81 80 76 75 73 71 70 85 10 6 70 67 60 50 40 30 20 10 0 11 22 43 64 12 7 Number of processors in the cluster Transaction per second per processor Figure 42: Scalability chart for clusters consisting of up to 12 nodes running Tomcat 82 3.10 Discussion Web servers have a limited capacity in serving incoming requests. In the case of Apache, the capacity limit is around 1,000 requests per second when running on a single processor. Beyond this threshold, the server starts rejecting incoming requests. We have demonstrated non-linear scalability with clusters up to 12 nodes running Apache and Tomcat web servers. In the case of Apache for instance, when we scale the cluster from two processors to 12 processors, the number of successful requests per second per processor drops from 1,053 to 685, down by -35%. These results present major performance degradation. Theoretically, as we add more processors into the cluster, we would like to achieve linear scalability and maintain the baseline performance of 1,000 requests per second per processor. We experimented with the NAT and DR traffic distribution approaches. The NAT approach, although widely used, has limited performance and scalability compared to the DR approach as demonstrated in Section 3.5.4. Our results demonstrate that the bottleneck occurs at the master node because of bottlenecks at the master node level and because of the inefficiencies in the traffic distribution mechanism. We plan to propose an enhanced method based on the DR approach for our highly available and scalable architecture. The planned improvements include a running daemon on all traffic nodes that reports the node load to the distribution mechanism to allow it to perform a more efficient and dynamic distribution. We experienced some problems with the Ethernet device drivers crashing with high traffic load (throughput of 5,903 KB/s – Figure 33). We improved the device driver code and as a result, it is now able to sustain a higher throughput. We contributed the improvements back to the Ethernet card provider (ZNYX Networks) and to the open source community. As far as the benchmarking tests, it would have helped to include other metrics such as processor utilization as well as file system and disk performance metrics to provide more insight on bottlenecks. These metrics are not available in WebBench, and therefore the only way to have those metrics is to either implement a separate tool ourselves or use an existing tool. We acknowledge that the performance of the network file system has certain effects on the total performance of the system since the network file system hosts the web documents shared between all traffic nodes. However, the performance of the file system is out of our scope. 83 3.11 Contributions of the Preparatory Work We examined current ways of solving scalability challenges and demonstrated scalability problems in a real system through building a web cluster and benchmarking it for performance and scalability. Many factors differentiate our early experimental work from others. We did not rely on simulation models to define our system and benchmark it. Instead, we followed a systematical approach building the web cluster using existing system components and following best practices. In many instances, we have contributed system software such as Ethernet and NFS redundancy and introduced enhancements to existing implementations. Similarly, we built our benchmarking environment from scratch and we did not rely on simulation models to get performance and scalability results. This approach gave us much flexibility and allowed us to test many different configurations in a real world setting. One unique aspect of our experiments, which we did not see in related work (Sections 2.9 and 2.10), is the scale of the benchmarking environment and the tests we conducted. Our early prototyped cluster consisted of 12 processors and our benchmarking environment consisted of 17 machines. Surveyed projects were limited in their resources and were not able to demonstrate the negative scalability effects we experienced when reaching 12 processors, simply because they only tested for up to eight processors. In addition, other works relied on simulation to get a feeling of the how their architecture would perform. In contrast, we performed our benchmarking using an industrystandardized tool and workload. The benchmarking tool, WebBench, uses standardized work loads and is capable of generating more traffic and compiling more test results and metrics than the other available tools used in the surveyed work such as SURGE [109], S-Clients [110], WebStone [111], SPECweb99 [112], and TCP-W [113]. The paper comparing these benchmarking tools is available from [114]. Furthermore, the entire server environment uses open source technologies allowing us access to the source code and granting the freedom to modify it, and introduce changes to suite our needs. 84 Chapter 4 The Architecture of the Highly Available and Scalable Web Server Cluster This chapter describes the architecture of the highly available and scalable (HAS) web server cluster. The chapter reviews of the initial requirements discussed in previous chapters, and then presents the HAS cluster architecture, its tiers and characteristics. It discusses the architecture components, presents they interact with each other, illustrates the supported redundancy models, and discusses the various types of cluster nodes and their characteristics. Furthermore, the chapter includes examples of sample deployments of the HAS architecture as well as case studies that demonstrate how the architecture scales to support increased traffic. The chapter also addresses the subject of fault tolerance and the high availability features. It discusses the traffic management scheme responsible for dynamic traffic distribution and discuss the cluster virtual IP interface which presents the HAS architecture as a single entity to the outside world. The chapter concludes with the scenario view of the HAS architecture and examines several use cases. 4.1 Architectural Requirements The architecture provides the infrastructure that allows the platform to meet the requirements of Internet and web servers running in environments that require high availability and high performance. In our context, the architecture requirements for a highly available and scalable web cluster fall in the following categories: scalability, minimizing response time, providing reliable storage, continuous service availability, high performance, high reliability of service tolerance, and supporting new the Internet Protocol version 6 (IPv6). 4.1.1 Scalability The expansion of the user base requires scaling the capacity of the infrastructure to be in line with demand. However, with a single server, this means upgrading the server vertically by adding more memory, or replacing the processor(s) with a faster one. Each upgrade brings the service down for the duration of the upgrade, which is not desirable. In some cases, the upgrade involves a full deployment on entirely new hardware resulting in extended downtime. Clustering an application allows us to scale it horizontally by adding more servers into the service cluster without necessitating service downtime. However, even when utilizing clustering techniques to scale, the performance gain is not linear. 85 4.1.2 Minimal Response Time Achieving a minimal response time is a crucial factor for the success of distributed web servers. Many studies argue that 0.1 second is about the limit for having the user feel that the system is reacting instantaneously [13][14][15]. Web users expect the system to process their requests and to provide responses quickly and with high data access rates. Therefore, the server needs to minimize response times to live up to the expectations of the users. The response time consists of the connect time, process time and response transit time. The goal is to minimize all of these three parameters resulting in faster total response time. 4.1.3 Reliable, High Capacity and High Performance Storage The web server has to provide reliable, high performance and highly available storage. Storage is essential to save and maintain application data, and to provide fast data retrieval. 4.1.4 Continuous Service Availability Web servers have to provide a connection that is highly available and reliable. The larger the number of users and the greater the volume of data the server handles, the more difficult it becomes to guarantee the availability of the service under high loads [17]. All web server components need be highly available and redundant; otherwise, they present a SPOF. While there are many definitions for high availability, in this context, it is the ability of the web cluster to provide continuous service even in the event of failures of individual components. 4.1.5 High Performance The web server has to sustain a guaranteed number of requests per second. It needs to maximize resource utilization and serve the maximum number of transactions per second that a single web server node can handle. Section 3.7 presents on the base performance a web server should sustain under high load of traffic. 4.1.6 High Reliability and Fault Tolerance High reliability and fault tolerance are essential characteristics for highly available web servers that provide continuous online services. Several challenges arise in web clusters in relation to high reliability and fault tolerance such as preventing software or hardware failures from interrupting the provided service. 86 4.1.7 Supporting the IPv6 The IPv6 is the next generation Internet protocol designed by the Internet Engineering Task Force (IETF) as a replacement of the Internet protocol version 4 (IPv4). Despite its age, IPv4 has been remarkably resilient; however, it is beginning to demonstrate problems in various features areas. Its most visible shortcoming is the growing shortage of IPv4 addresses needed by the growing number of devices connecting to the Internet. Other limitations in this area include quality of service, security, auto-configuration, and mobility aspects. To maintain future competitiveness and to transition into IPv6-only Internet, web clusters need to support IPv6 at the operating system and at application layer. 4.2 Overview of the Challenges There are many challenges to address in order to achieve high availability and linear scalability for web clusters. The following subsections discuss the challenges in the specified areas, and provide a background on the problems we need to address with the HAS architecture. 4.2.1 High Performance High performance in terms of high throughput and maximizing the number of requests per second are important characteristic of web servers. The challenges in this area are to enhance response time and provide the maximum possible throughput in term of KB/s. Section 3.7 establish the baseline performance of a single processor ranging between 1,000 and 1,050 requests per second. Our goal with the HAS cluster is to maintain this level of performance and throughput as we scale the number of processors in the cluster. 4.2.2 Continuous Service Availability In an environment providing critical Internet and web services, it is necessary to provide non-stop operations and eliminate single points of failure. A SPOF exists if a cluster component providing a cluster function fails leading to service discontinuity. Potential single points of failure in a cluster include the following cluster components: master and traffic nodes, cluster software components, cluster support services, networks and network adapters, and storage nodes and disks. The HAS architecture supports various redundancy capabilities and provides functionalities to eliminate single points of failures. Section 4.7 discusses these capabilities. Furthermore, we need the capability of 87 monitoring the availability of the web server application running on the traffic nodes and ensuring that the application is up and running. Otherwise, the master nodes will forward traffic to a node that has an unresponsive web server application. Section 4.20 discusses this functionality. 4.2.3 Cluster Single Interface One of the challenges with clusters that consist of multiple nodes is representing the cluster nodes as a single entity to the outside world. For instance, one common scenario is to have thousands of users requesting web pages over the network from a web cluster that consists of multiple web server nodes. This creates the necessity for a cluster single interface that is scalable to support high number of connections and highly available so that it does not present a bottleneck or a SPOF. As a result, the challenge is the ability to allow a virtually infinite amount of clients to reach a virtually infinite amount of web servers presented as a single virtual IP address. Section 4.21 addresses this challenge. 4.2.4 Efficient and Dynamic Traffic Distribution A cluster consists of multiple nodes of heterogeneous hardware receiving thousands of requests per second. The challenge in this area is to distribute the incoming web traffic efficiently amongst all cluster nodes, taking into consideration the characteristics of each node. Section 4.23 addresses this challenge, and discusses the solution of the HAS architecture. 4.2.5 Support for IPv6 The main challenge in this area is to support IPv6 across the architecture and all of its software components such as the traffic distribution mechanism, the cluster virtual interface, application servers, and Ethernet redundancy daemons. In addition, we need to support IPv6 at the kernel level on all cluster nodes, and support IPv6 routing and domain name system and name lookup. Section 4.28 presents the support for IPv6 in the HAS architecture. 4.3 The HAS Architecture The HAS architecture consists of a collection of loosely coupled computing elements, referred to as nodes or processors, that form what appears to users as a single highly available web cluster. There are no shared resources between nodes with the exception of storage and access to networks. The HAS architecture allows the addition of nodes to the cluster to accommodate increased traffic, 88 without performance degradation and while maintaining the level of performance for up to 16 processors (Chapter 5). Figure 43 illustrates the conceptual model of the architecture showing the three tiers of the architecture and the software components inside each tier. In addition, it shows the supported redundancy models per tier. For instance, the high availability (HA) tier supports the 1+1 redundancy model (active/active and active/standby) and can be expanded to support the N+M redundancy model, where N nodes are active and M nodes are standby. Similarly, the scalability and service availability (SSA) tier supports the N-way redundancy model where all traffic nodes are active and servicing requests. This tier can be expanded as well to support the N+M redundancy model. Sections 4.9, 4.10, and 4.11 discuss the supported redundancy models. The HAS architecture is composed of three logical tiers: the high availability (HA) tier, the scalability and service availability (SSA) tier, and the storage tier. This section presents the architecture tiers at a high level. The high availability tier: This tier consists of front-end systems called master nodes. Master nodes provide an entry-point to the cluster acting as dispatchers, and provide cluster services for all HAS cluster nodes. They forward incoming web traffic to the traffic nodes in the SSA tier according to the scheduling algorithm. Section 4.5.1 covers the characteristics of this tier. Section 4.17.1 presents the characteristics of the master nodes. Section 4.9 discusses the supported redundancy models of the HA tier. The scalability and service availability tier: This tier consists of traffic nodes that run application servers. In the event that all servers are overloaded, the cluster administrator can add more nodes to this tier to handle the increased workload. As the number of nodes increases in this tier, the cluster throughput increases and the cluster is able to respond to more traffic. Section 4.6.2 describes the characteristics of this tier. Section 4.17.2 presents the characteristics of the traffic nodes. Section 4.10 discusses the supported redundancy models of the SSA tier. The storage tier: This tier consists of nodes that provide storage services for all cluster nodes so that web servers share the same set of content. Section 4.6.3 describes the characteristics of this tier and Section 4.17.3 describes the characteristics of the storage nodes. Section 4.11 presents the supported redundancy models of the storage tier. The HAS cluster prototype did not utilize specialized storage nodes. Instead, it utilized a contributed extension to the NFS to support HA storage. 89 Figure 43: The HAS architecture 90 End Users EthD CCM TM Legend CCP1 CCP2 VIP DHCP RAD EthD RCM RCM RAD saru HBD DHCP NFS EthD CCM TM RCM RAD saru NFS HBD DHCP CVIP NTP EthD CCM TM RCM Management Console HBD DHCP NFS TCD LdirectorD EthD Application Server TCD NTP TM NFS TCD HBD CCM LdirectorD CCP1 CCP2 Shared Storage B DiskA’ A’ Disk DiskB’ B’ Disk ... DiskN’ N’ Disk DiskNN Disk ... DiskBB Disk DiskAA Disk Shared Storage A (Optional) Storage Tier Network Time Protocol Traffic Manager Network File Server Traffic Client Daemon Heartbeat mechanism Cluster Configuration Manager Linux Director Daemon Redundant Image Servers LdirectorD EthD Standby Traffic Node (can be expanded to) N+M Redundancy Model N-Way Redundancy Model Application Server TCD LdirectorD EthD Active Traffic Node N Application Server TCD LdirectorD EthD Active Traffic Node 2 Application Server Active Traffic Node 1 Scalability and Service Availability (SSA) Tier Cluster Communication Path 1 (LAN 1) Cluster Communication Path 1 (LAN 2) Cluster Virtual IP Layer Dynamic Host Configuration Protocol IPv6 Router Advertisement Daemon Ethernet Redundancy Daemon Redundancy Configuration Manager RAD saru Standby Master Node (can be expanded to) N+M Redundancy Model 1+1 Redundancy Model CVIP NTP Hot Standby Master Node CVIP NTP External Components to the HAS Architecture V I P C L U S T E R Active Primary Master Node High Availability (HA) Tier 4.4 HAS Architecture Components Each of the HAS architecture tiers consists of several nodes, and each node runs specific software component. A software component (or system software) is a stand-alone set of code that provides service either to users or to other system software. A component can be internal to the cluster and represents a set of resources contained on the cluster physical nodes; a component can also be external to the cluster and represents a set of resources that are external to the cluster physical nodes. Components can be either software or hardware components. Core components are essential to the operation of the cluster. On the other hand, optional components are used depending on the usage and deployment model of the HAS cluster. The HAS architecture is flexible and allows administrators of the cluster to add their own software components. The following sub-sections present the components of the HAS architecture, categorize the components as internal or external, discuss their capabilities, functions, input, output, interfaces, and describe how they interact with each other. 4.4.1 Architecture Internal Components The internal components of the architecture include the master nodes, traffic nodes, storage nodes, routers, and local networks. Master nodes are members of the HA tier. They provide a key entry into the system through the cluster virtual IP interface, and act as a dispatcher, forwarding incoming traffic from the cluster virtual IP interface to the traffic nodes located in the SSA tier. Moreover, master nodes also provide cluster services to the cluster nodes such as DHCP, NTP, TFTP, and NFS. Section 4.17.1 discusses the characteristics of master nodes. Traffic nodes reside in the SSA tier of the HAS architecture. Traffic nodes run application servers, such as a web server, and they are responsible for replying to clients requests. Section 4.17.2 discusses the characteristics of traffic nodes. Storage nodes are located in the storage tier, and provide HA shared storage using multi-node access to redundant mirrored storage. Storage nodes are optional nodes. If the administrators of the HAS cluster choose not to deploy specialized storage nodes, they can instead use a modified implementation of the NFS providing HA capabilities. The architecture supports two local networks, also called cluster communication paths, to provide connectivity between all cluster nodes. Each of the networks connects to a different router. 91 4.4.2 Architecture External Components External components to the cluster represent a set of resources that are external to the cluster physical nodes and include external networks, to which the cluster is connected, the management console through which we administer the cluster, and web users. External networks connect the cluster nodes to the Internet or the outside world. The management console is an external component to the cluster through which the cluster administrator logs in and performs cluster management operations. Web users, also called web clients, are the service requesters. A user can be a human being, an external device, or another computer system. Image servers are optional external components to the HAS architecture. We can use master nodes to provide the functionalities provided by the image servers (DHCP and TFTP services for booting and installation of traffic nodes). However, we recommend dedicating the resources on master nodes to serve incoming traffic. 4.4.3 Architecture Software Components The architecture consists of several system software components. Some of the components are essential to the operation of the cluster and run by default. However, we expect that across all possible deployments of highly available web clusters, there are trade-offs between available functions, acceptable levels of system complexity, security needs, and administrator preferences. The HAS system software components include: traffic client daemon (TCD), cluster virtual IP interface (the routed process), traffic manager daemon (TM), IPv6 router advertisement daemon, DHCP daemon, NTP daemon, HA NFS server daemon, TFTP daemon, heartbeat service (HBD), the Linux director daemon (LDirectord), the saru module (connection synchronization manager), and the Ethernet redundancy daemon (EthD). The traffic client daemon is a core system software that runs on all traffic nodes. It computes the load_index of traffic nodes, and reports it to the traffic managers running on the master nodes. The traffic client daemon allows the traffic manager running on master nodes to provide efficient traffic distribution based on the load of each traffic node. The metrics collected by the traffic client daemon are the processor load and memory usage. The implementation of the traffic client supports IPv6. Section 4.23.5 discusses the traffic client daemon. The cluster virtual IP interface (CVIP) is a core system component that runs on the master nodes in the HA tier of the HAS architecture. The CVIP masks the HAS architecture internals, and makes it appear as a single server to the external users, who are not be aware of the internals of the cluster such 92 as how many nodes exist or where the applications run. The CVIP allows a virtually infinite number of clients to reach a virtually infinite number of servers presented as a single virtual IP address, without impact on client or server applications. The CVIP operates at the IP level, enabling applications that run on top of IP to take advantage of the transparency it provides. The CVIP supports IPv6 and is capable of handling incoming IPv6 traffic. Section 4.21 discusses the cluster virtual IP interface. The traffic manager daemon (TM) is a core system component that runs on the cluster master nodes. The traffic manager receives the load index of the traffic nodes from the traffic client daemons, maintains the list of available traffic nodes and their load index, and executes the distribution of traffic to the traffic nodes based on the defined distribution policy in its configuration file. The current traffic manager implementation supports round robin and the HAS distribution. However, the traffic manager can support more policies. The traffic manager supports IPv6. Section 4.23.3 discusses the traffic manager. The cluster configuration manager (CCM) is a system software that manages all the configuration files that control the operation of the HAS architecture software components. It provides a centralized single access point for editing and managing all the configuration files. For the purpose of this work, we did not implement the cluster configuration manager. However, it is a high priority future work. At the time of publication, with the HAS architecture prototype, we maintain the configuration files if the various software components on the network file system. The redundancy configuration manager (RCM) is responsible for switching the redundancy configuration of each cluster tier from one redundancy configuration to another, such as from the 1+1 active/standby to the 1+1 active/active. It is also responsible for switching service between components when the cluster tiers follow the N+M redundancy model. Therefore, it should be aware active nodes in the cluster and their corresponding standby nodes. For the purpose of this work, we did not implement the redundancy configuration manager. Section 6.2.4 discusses the RMC as a future work item. The IPv6 router advertisement daemon is an optional system software component that is used only when the HAS architecture need to support IPv6. It offers automatic IPv6 configuration for network interfaces for all cluster server nodes. It ensures that all the HAS cluster nodes can communicate with each other and with network elements outside the HAS architecture over IPv6. 93 The cluster administrator has the option of using the DHCP daemon (an optional server service) to configure IPv4 addresses to the cluster server nodes. The NTP service is a required system service used to synchronize the time on all cluster server nodes. It is essential to the operation of other software components that rely on time stamps to verify if a node is in service or not. Alternatively, we can use a time synchronization service provided by an external server located on the Internet. However, this poses security risks and it is not recommend it. As for storage, we provide an enhanced implementation of the Linux kernel NFS server implementation that supports NFS redundancy and eliminates the NFS server as a SPOF. Section 4.16 discusses the storage models and the various available possibilities. The TFTP service daemon is an optional software component used in collaboration with the DHCP service to provide the functionalities of an image server. The image server provides an initial kernel and ramdisk image for diskless server nodes within the HAS system. The TFTP daemon supports IPv6 and is capable of receiving requests to download kernel and ramdisk images over IPv6. The heartbeat service (HBD) runs on master nodes and sends heartbeat packets across the network to the other instances of Heartbeat (running on other master nodes) as a keep-alive type message. When the standby master node no longer receives heartbeat packets, it assumes that the active master node is dead, and then the standby node becomes primary. The heartbeat mechanism is a contribution from the Linux-HA project [20]. We have contributed enhancements to the heartbeat service to accommodate for the HAS architecture requirements. Section 4.19 discusses the heartbeat service and its integration with the HAS architecture. The Linux director daemon (LDirectord) is responsible for monitoring the availability of the web server application running on the traffic nodes by connecting to them, making an HTTP request, and checking the result. If the LDirectord module discovers that the web server application is not available on a traffic node, it communicates with the traffic manager to ensure that the traffic manager does not forward traffic to that specific traffic node. Section 4.20 presents the functionalities of the LDirectord. 4.5 HAS Architecture Tiers The following subsections describe the three HAS architecture tiers illustrated in Figure 43: the high availability tier, scalability and service availability tier, and the storage tier. 94 4.5.1 The High Availability Tier The HA tier consists of master nodes that act as a dispatcher for the SSA tier. The role of the master node is similar to a connection manager or a dispatcher. The HA tier does not tolerate service downtime. If the master nodes are not available, the traffic nodes in the SSA tier become unreachable and as a result, the HAS cluster cannot accept incoming traffic. The primary functions of the nodes in this tier are to handle incoming traffic and distribute it to traffic nodes located in the SSA tier, and to provide cluster infrastructure services to all cluster nodes. The HA tier consists of two nodes configured following the 1+1 active/standby redundancy model. The architecture supports the extension redundancy model of this tier to the 1+1 active/active redundancy model. With the 1+1 active/active redundancy model, master nodes share servicing incoming traffic to avoid bottlenecks at the HA tier level. Another possible extension to the redundancy model is the support of the N-way and the N+M redundancy models, which allows the HA tier to scale the number of master nodes one at a time. However, it requires a complex implementation and it is not yet supported it. Section 4.8 describes the supported redundancy models. Furthermore, the master nodes provide cluster wide services to nodes located in the SSA and the storage tiers. The HA tier controls the activity in the SSA tier, since it forwards incoming requests to the traffic nodes. Therefore, the HA tier needs to determine the status of traffic nodes and be able to reliably communicate with each traffic node. The HA tier uses traffic managers to receive load information from traffic nodes. Section 4.6.1 presents the characteristics of the HA tier. Section 4.8 discusses the redundancy models supported by this tier. 4.5.2 The Scalability and Service Availability Tier The scalability and service availability (SSA) tier consists of traffic nodes that run web server applications. The main task of the traffic nodes is to receive, process, and reply to incoming requests. This tier provides a redundant application execution cluster. In the event that the traffic manager daemon running on the master nodes ceases to receive load notification from the traffic client daemon, then the traffic manager stops forwarding traffic to that specific traffic node. Section 4.27.9 discusses this scenario. If the traffic node becomes unresponsive (after a timeout limit defined in the configuration file), the traffic manager declares the traffic node unavailable and stops forwarding traffic to it. Section 4.27.9 discusses this use case and presents its sequence diagram. The supported redundancy model is the N95 way model: all nodes are active and there are no standby nodes. As such, redundancy is at node level. Section 4.6.2 presents the characteristics of the SSA tier. Section 4.10 discusses the redundancy models supported by this tier. 4.5.3 The Storage Tier The storage tier consists of specialized nodes that provide shared storage for the cluster. This tier differs from the other two tiers in that it is an optional tier, and not required if the master nodes are providing shared storage through a distributed file system. Since storage is not the focus of this dissertation, we do not explore this area in depth. Instead, our interest in this area is limited to providing a highly available storage and access to storage independently from the storage techniques. For instance, to avoid single points of failure, we provide redundant access paths from the cluster nodes to the shared storage. Our efforts in this area include a contribution of a HA extension to the Network File Server (NFS). In the modified version of the NFS, the file system supports two redundant servers, where one server failure is transparent to the users. Master nodes can provide the storage service using the highly available NFS implementation that requires a modified implementation of the mount program. Cluster nodes use the modified mount program to mount simultaneously two network file servers at the same mounting point. Section 4.6.3 presents the characteristics of the storage tier. Section 4.11 presents the redundancy models supported by this tier. 4.6 Characteristics of the HAS Cluster Architecture The HAS architecture is the combination of the HA, SSA, and storage tiers. The architecture inherits the characteristics of all tiers, discussed in Sections 4.6.1, 4.6.2, and 4.6.3. The HAS architecture does not require specialized hardware or commercial software components; instead, the architecture is implementable using COTS hardware and open source software components. The following subsections discuss the characteristics of each architecture tier. 4.6.1 Characteristics of the HA Tier The HA tier consists of two master nodes that provide cluster services to traffic nodes and direct incoming requests to the SSA tier. The HA tier has the following characteristics: - No single point of failure: If a master node fails and becomes unavailable, the standby master node takes over in a transparent fashion. Section 4.27.8 presents the sequence diagram of this case scenario. The HA tier provides node level redundancy. All software components are 96 redundant and available on both master nodes. If one master node fails, the other master node handles incoming traffic, and provides cluster services to cluster nodes. Section 4.8 discusses the redundancy model of the HA tier. - Sensible repair and replacement model: The architecture allows the upgrade or replacement of failed components without affecting the service availability. For instance, if the active master node requires a processor upgrade, then the administrator gracefully switches control from the active master node to the standby node; the active master becomes standby and the standby master becomes active. The cluster continues to provide services without associated downtime and without performance penalties, with the exception that load sharing among master nodes (1+1 active/active) is not possible since there is only one master node available. With such a model, upgrades do not require a cluster or a service downtime, as it is the case with SMP or MMP systems. Instead, only the affected node is involved, and the service continues to be available to end users. - Shared storage: Master nodes require private disk storage to maintain the configuration files for the services they provide to all cluster nodes. Application data on the other hand is stored on shared storage. If a traffic node fails, the application data is still available to the surviving traffic nodes. Therefore, all application data is available exclusively on redundant highly available storage and is not dependent on the serving nodes. - Master node failure detection: The heartbeat mechanism running on master nodes ensures that the standby master node detects the failure of the active master node within a delay of 200 ms. Section 4.19 describes how the architecture handles the failure of a master node. - Number of nodes in the high availability tier: The current implementation of the HA tier supports two master nodes. In this model, the standby node is available to allow a quick transition when the active node fails, or for load-sharing purposes. We can add more master nodes to provide additional protection against multiple failures. The HA tier can scale from two to N master nodes. Although the addition of nodes would provide additional protection and higher capacity, it introduces significant complexity to the implementation of the system software. The current prototype HAS architetcure supports the 1+1 redundancy model in both active/standby and active/active modes. Section 4.8 describes the supported redundancy models. - Cluster Virtual IP Interface (CVIP): The cluster virtual IP interface is a transparent layer that presents the cluster as a single entity to the outside world, whether it is users to the cluster, or 97 other systems. Users of the system are not aware that the web system consists of multiple nodes, and they do not know where the applications run. Section 4.21 describes the CVIP interface. 4.6.2 Characteristics of the SSA Tier The SSA tier consists of multiple independent traffic nodes, each running a copy of the Apache web server software (version 2.0.35), and service incoming requests. Theoretically, there are no limitations on the number of traffic nodes in this tier: the number of nodes can be N, where N ≥ 2 to insure that traffic nodes do not constitute a SPOF. The HAS architecture prototype consisted of 2 master nodes and 16 traffic nodes. Redundancy in the SSA tier is at the node level. Requests coming from the HA tier are assigned to traffic nodes based on the traffic distribution algorithm (Section 4.23). Traffic nodes reply to requests directly to the web clients eliminating possible bottleneck at master nodes. The characteristics of the SSA tier are the following: - Node availability: Single node availability is less important in this tier, since there are N traffic nodes available to serve requests. If one traffic node becomes unavailable, the master node removes the failed traffic node from its list of available traffic nodes, and directs traffic to available traffic nodes. Section 4.27.9 presents the case scenario of a failing traffic node. - Traffic node failure detection: In the same way the heartbeat mechanism checks the health of the master nodes in the HA tier, the HAS architecture supports mechanisms to verify that a traffic node is up and running, and that the web software application is providing service to web clients. We use two methods simultaneously to ensure that traffic nodes are healthy and that the web server application is up and running. The first check is a continuous communication between the traffic manager running on the master node and the traffic client running of the traffic node (Section 4.23). The second check is an application check that ensures that the web server application is up and running (Section 4.20). - Support for diskless nodes: The SSA tier supports diskless traffic nodes. Traffic nodes are not required to have local disks and rely on image servers for booting and downloading a kernel and disk image into their memory. - Seamless upgrades without service interruption: With the HAS architecture, it is possible to upgrade the operating system and the application software without disturbing the service availability. We provide the mechanisms to upgrade the kernel and application automatically through an image server, by rebooting the nodes. Upon reboot, the traffic node downloads an 98 updated kernel and the new version of the ramdisk (if the node is diskless) or a new image disk (if the node has disk) from the image server. Section 4.27.5 presents this upgrade scenario with the sequence diagram. - Hosting application servers: Traffic nodes run application servers that can be stateful or stateless. If an application requires state information, then the application saves the state information on the shared storage and makes it available to all cluster nodes. 4.6.3 Characteristics of the Storage Tier Data oriented applications require that storage and access to storage is available at all times. To ensure reliability and high availability, the architecture does not allow application data to be stored on local node storage devices. Application servers maintain their persistent data on the shared storage nodes located in the storage tier, or on shared storage managed by master nodes through a redundant network file server (Section 4.16.2). Since storage techniques are outside the scope of the thesis, we do not provide a full coverage of this area. However, the architecture is flexible to support multiple ways of providing storage through specialized storage nodes, software RAID and distributed file systems. The supported storage models include: - Specialized storage nodes use techniques such as storage area networks or network-attached storage to provide highly available shared storage over a network to a large network of users. - RAID techniques consist of established techniques to achieve data reliability through redundant copies of the data. Since we are using COTS components, then software RAID is our choice to provide redundant and reliable data storage. In addition, software RAID techniques do not require specialized hardware; therefore, we can use it with master nodes when master nodes are providing shared storage through the NFS. - Distributed file systems allow us to combine all the disk storage on multiple cluster nodes under one virtual file system. 4.7 Availability and Single Points of Failures The architecture provides facilities to enable fast error detection, and provides correction procedures to recover, when it is possible. One of our goals is to increase service availability. Availably is a measure of the time that a server is functioning normal. We calculate the availability as, 99 ⎛ ⎞ MTBF ⎟⎟ *100 , A = ⎜⎜ ⎝ MTBF +MTTR ⎠ where A is the percentage of availability, MTBF is the mean time between failures and MTTR is the mean time to repair or resolve a particular problem. According to the formula, we calculate availability A as the percentage of uptime for a given period, taking into account the time it requires for the system to recover from unplanned failures and planned upgrades. As MTTR approaches zero, the availability percentage A increases towards 100 percent. As the MTBF value increases, MTTR has less impact on A. Following this formula, there are two possible ways to increase availability: increasing MTBF and decreasing MTTR. Increasing MTBF involves improving the quality or robustness of the software and using redundancy to remove single points of failures. As for decreasing MTTR, our focus in the implementation of the system software is to streamline and accelerate fail-over, respond quickly to fault conditions, and make faults more granular in time and scope to allow us to have many short faults than a smaller number of long ones, and to limit scope of faults to smaller components. To increase MTTR among the HAS architecture components, we need to avoid a SPOF. The following subsections discuss eliminating SPOF at master node levels, traffic nodes, application servers, networks and network interfaces, and storage nodes. The HAS architecture supports fault tolerance through features such as the hot-standby data replication to enable node failure recovery, storage mirroring to enable disk fault recovery, and LAN redundancy to enable network failure recovery. The topology of the architecture enables failure tolerance because of the various built-in redundancies within all layers of the HAS architecture. Figure 44 illustrates the supported redundancy at the different layers of the HAS architecture. The cluster virtual IP interface (1) provides a transparent layer that hides the internal of the cluster. We can add or remove master nodes from the cluster without interruptions to the services (2). Each cluster node has two connections to the network (3) ensuring network connectivity. Many factors contribute towards achieving network and connection availability such as the availability of redundant routers and switches, redundant network connections and redundant Ethernet cards. We contributed an Ethernet redundancy mechanism to ensure high availably for network connections. As for traffic nodes (4), redundancy is at the node level, allowing us to add and remove traffic nodes transparently and without service interruption. We can guarantee service availability by providing multiple instances of the application running on multiple redundant traffic nodes. The HAS 100 architecture supports storage redundancy (5) through a customized HA implementation of the NFS server; alternatively, we can also use redundant specialized storage nodes. Users Users 1 Cluster Interface to the Outside World 2 Redundancy of Nodes Providing cluster services 3 Network Redundancy 4 Traffic Node Redundancy 5 Storage Redundancy (includes NFS redundancy and RAID 5) 1 Cluster Virtual Interface Master Node A Master Node B 2 3 Traffic Node A Storage Node A 4 … Traffic Node N 5 Storage Node B Figure 44: Built-in redundancy at different layers of the HAS architecture The following sub-sections discuss eliminating SPOF at each of the HAS architecture layers. 4.7.1 Eliminating Master Nodes as a Single Point of Failure The HA tier of the HAS architecture consists of two master nodes that follow the 1+1 redundancy model. In the 1+1 active/standby redundancy model (Section 4.27.1.1), if the active master node is unavailable and does not respond to the heartbeat messages, the standby master node declares it as dead and assumes the active state. Section 4.27.8 discusses this scenario. If the master nodes follow the 1+1 active/active redundancy model (Section 4.27.1.2), the failures in one master node are transparent to end users, and do not affect the service availability. 4.7.2 Eliminating Applications as a Single Point of Failure The primary goal of the HAS architecture is to provide a highly available environment for web server applications. Each traffic node runs a copy of the web server. These applications represent a SPOF, since in the event that the application crashes, the service on that traffic node becomes unavailable. To ensure the availability of these applications, the HAS architecture prototype supports two mechanisms. The first mechanism is a health check between the traffic manager and the traffic client. Section 4.23.5 discusses this health check mechanism, and Section 4.27.11 presents the sequence diagram of this scenario. The second mechanism is an application health check to ensure that the 101 application is up and running. If the application is not available, the health check mechanism notifies the traffic manager that removes the traffic node from its list of available node, and as a result, the traffic manager stops forwarding traffic to it. Sections 4.27.9 and 4.27.12 present the sequence diagrams of this scenario. 4.7.3 Eliminating Network Adapters as a Single Point of Failure The Ethernet redundancy daemon, a system software component, handles network adapter failures. Figure 45 illustrates the Ethernet adapter swapping process. When a network adapter (Ethernet card) fails, the Ethernet redundancy daemon swaps the roles of the active and standby adapters on that node. The failure of the active adapter is transparent with a delay of less than 350 ms while the system switches to the standby Ethernet adapter. Active Standby The active Ethernet adapter provides the connection to the network. The standby Ethernet adapter is hidden from applications and is known only to the Ethernet redundancy daemon. Active Standby Unavailable Active The active Ethernet adapter has failed and the Ethernet redundancy daemon designates the former standby Ethernet adapter as the new active adapter. Figure 45: The process of the network adapter swap 4.7.4 Eliminating Storage as a Single Point of Failure Shared storage is a possible SPOF in the cluster if it relies on a single storage server. Section 4.16 discusses the physical model description which covers eliminating storage as a SPOF using both a custom implementation of a highly available network file system (Section 4.16.2), and redundant specialized storage nodes (Section 4.16.3). 4.8 Overview of Redundancy Models The following sub-sections examine the various redundancy models: the 1+1 two nodes redundancy model, the N+M and N-way redundancy models, and the no redundancy model. 102 4.8.1 The 1+1 Redundancy Model There are two types of the 1+1 redundancy model (also called two-node redundancy model): the active/standby, which is also called the asymmetric model, and the active/active or the symmetric redundancy model [115]. With the 1+1 active/standby redundancy model, one cluster node is active performing critical work, while the other node is a dedicated standby, ready to take over should the active master node fails. In the 1+1 active/active redundancy model, both nodes are active and doing critical work. In the event that either node should fail, the survivor node steps in to service the load of the failed node until the first node is back to service. 4.8.2 The N+M Redundancy Model In the N+M redundancy model, the cluster tier support N active nodes and M standby nodes. If an active node fails, a standby node takes over the active role. If M=1, then the model would be the N+1 redundancy model, where the “+1” node is standby, ready to be active when one of the active nodes fail. The N+1 redundancy model can be applied to services that do not require a high level of availability because an N+1 cluster cannot provide a standby node to all active nodes. N+1 cannot perform hot standby because the standby node never knows which active nodes will fail. One advantage of the N+M over the N+1 is that in the N+M (with M ≥ 2) provide higher availability should more than one node fails, while not affecting the performance and throughput of the cluster. 4.8.3 The N-way Redundancy Model In the N-way redundancy model, all N nodes are active. Redundancy is at the node level. 4.8.4 The “No Redundancy” Model The “no redundancy” model is used with non-critical services, when the failure of a component does not cause a severe impact on the overall system. Following this model, an active component does not have a standby ready to take over if the active fails. We list this model for completion purposes. 4.9 HA Tier Redundancy Models The HA tier supports three redundancy models: the 1+1 active/standby, 1+1 active/active and the Nway redundancy model. The current prototype of the HA tier supports the 1+1 redundancy model in both configurations: active/standby and active/active. Support for the N+M and for the N-way redundancy models is a future work item. 103 4.9.1 The HA Tier Active/Standby Redundancy Model Figure 46 illustrates the 1+1 active/standby redundancy model supported by the HA tier, which consists of two master nodes, one is active and the second is standby. The active master node hosts active processes and the standby nodes host standby processes that are ready to take over when primary node fails. The active/standby model enables fast recovery upon the failure of the active node. Shared Network Storage Dual Redundant Data Paths Physically connected but not logically in use Heartbeat Messages Standby Master Node Active Master Node Public Network Clients Figure 46: The 1+1 active/standby redundancy model Shared Network Storage Physically connected but not logically in use due to the failure of the master node Physically connected and in use Heartbeat Messages Failed Node Standby Master Node Active Master Node Now Active Master Node Public Network Clients Figure 47: Illustration of the failure of the active node 104 Figure 47 illustrates the 1+1 active/standby pair after the failover has completed. The active/standby redundancy model supports connection synchronization between the two master nodes. Section 4.22 discusses connection synchronization. The HA tier can transition from the 1+1 active/standby to the 1+1 active/active redundancy model through the redundancy configuration manager, which is responsible for switching from one redundancy model to another. The 1+1 active/standby redundancy model provides high availability; however, it requires a master node to sit idle waiting for the active node to fail so it can take over. The active/standby model leads to a waste of resources and limits the capacity of the HA tier. The 1+1 active/active redundancy model, discussed in the following section, addresses this problem by allowing the two master nodes to be active and to serve incoming requests for the same virtual service. 4.9.2 The HA Tier Active/Active Redundancy Model The active/standby redundancy model offers one way of providing high availability. However, we would like to take advantage of the standby node and not have it sitting idle. By moving to the active/active redundancy model, both nodes in this tier are active and resulting in an increased cluster throughput. The active/active model allows two nodes to load balance the incoming connections for the same virtual service at the same time. It provides a solution where the master nodes are servicing incoming traffic and distributing it to the traffic nodes; as a result, we increase the capacity of the whole cluster, while still maintaining the high availability of master nodes. Figure 48 illustrates the 1+1 active/active redundancy model, which supports load sharing between the two master nodes in addition to concurrent access to the shared storage. If the HAS cluster experiences additional traffic, we can extend the number of nodes in this tier and support the N-way redundancy model, where all nodes in the tier are active and provide service. However, increasing the number of nodes beyond two nodes increases the complexity of the implementation. The proposed method of providing active/active in the HA tier of a HAS cluster is to have all master nodes configured with the same Ethernet hardware address (MAC) and IP address. Originally, the support of the active/active was not planned as part of the HAS architecture prototype; instead, the HAS prototype supported the active/standby and the benchmarking tests (Chapter 5) were performed on the HAS prototype with the HA tier configured in the active/standby redundancy model. The support for the active/active redundancy model came at a much later stage. 105 Shared Network Storage Physically connected and in use providing redundant data path for master node 1 Physically connected and in use providing redundant data path for master node 2 Heartbeat Messages Active Master Node 2 Active Master Node 1 Public Network Clients Figure 48: The 1+1 active/active redundancy model 4.9.2.1 The Role of the Saru Module When the HA tier is in the active/active redundancy model, each master node has the same hardware address (MAC) and IP address and the challenges becomes on how to distribute incoming connections between the two active master nodes. For that purpose, we have enhanced the saru module [116], open source system software, to run in coordination with heartbeat on each of the master nodes and be responsible for dividing the incoming connections between the two master nodes. The heartbeat daemon provides a mechanism to determine which master node is available and the saru module uses this information to divide the space of all possible incoming connections between the two available master nodes. If the saru module divides the blocks in the results space, then it is possible that an inconsistent state results between the master nodes, where some blocks are allocated to more than one master node, or some blocks are not allocated at all. Therefore, one approach to solve this problem is to have a saru master node and a saru slave node. When the master nodes boot, the saru module starts and elects a master node to do the allocations. The elected master node divides blocks of source or destination ports or addresses between the two active master nodes. The notion of a master and a slave node is only a internal convenience for the saru module. Both master nodes are active and continue to service incoming requests distribute it to the traffic nodes. 106 4.10 SSA Tier Redundancy Models The SSA tier supports the N+M and the N-way redundancy models. In the N+M model, N is the number of active traffic nodes hosting the active web server application, and M is the number of standby traffic nodes. When M=0, it is the N-way redundancy model where all traffic nodes are active. Following the N-way redundancy model, traffic nodes operate without standby nodes. Upon the failure of an active traffic node, the traffic manager running on the master node removes the failed traffic node from its list of available traffic nodes (Section 4.27.9) and redirects traffic to available traffic nodes. Figure 49 illustrates the N-way redundancy model. The state information of the web server application running on the active nodes is saved (1) on the HA shared storage (2). When the application running on the active node fails, the application on the standby node accesses the saved state information on the shared storage (3) and provides continuous service. N+M Redundancy Model Node A N-way Redundancy Model Node C Node B Node D Node A 1 1 Standby Process 2 Active Process Node C Node B State HA Shared Storage 2 Node D 3 State HA Shared Storage 3 Standby Node 1 Standby Node 2 Standby Node 1 (1) (2) (3) State information is written to shared storage State information is available on shared storage Traffic nodes have access to the state information. Figure 49: The N+M and N-way redundancy models In Figure 50, the state information of the application running on the active node is saved (1) on the HA shared storage. 107 Active Node Application Middleware Active Node Standby Node Application Application Application Middleware Middleware Active Node 1 State Operating System Operating System Processor Processor HA high speed connectivity Operating System Middleware Operating System Processor Processor 2 State HA Shared Storage Figure 50: The N+M redundancy model with support for state replication In Figure 51, when the application running on the active node fails, the application on the standby node accesses the saved state info on the shared storage and provides a continuous service. Failed Node Active Node Application Application Middleware Middleware Operating System Processor HA high speed connectivity Operating System Processor Active Node Active Node Application Middleware Operating System Processor 3 Application Middleware Operating System Processor State HA Shared Storage Figure 51: The N+M redundancy model, after the failure of an active node Supporting applications that require maintaining state information is not the scope of this dissertation. However, the WebBench tool provides dynamic test suites. Therefore, in our testing, we used a combination of both static (STATIC.TST) and dynamic test suites (WBSSL.TST). The static test suite contains over 6,200 static pages and executable test files. The dynamic test suites use applications that run on the server and require maintaining a state. Although supporting applications with state is not in the scope of the work, the HAS architecture still handles them. 108 4.11 Storage Tier Redundancy Models Although storage is outside the scope of our work, the redundancy models of the storage tier depend on the physical storage model described in Section 4.16. 4.12 Redundancy Model Choices Figure 52 identifies the various redundancy models supported for master, traffic and storage nodes. Master Nodes: m=2 1+1 redundancy model, Active/Standby | Active/Active HA Tier N active N active / M standby SSA Tier Storage Tier Redundant Specialized Storage Nodes Redundant storage through the implementation of a HA NFS server Traffic Nodes: t ≥ 2 Redundancy is at node level Redundancy models: N active | N active & M standby, where N ≥ 2 and M ≥ 0 Storage: Either use redundant specialized nodes: s = 2, or use the contributed HA NFS implementation Local networks, l=2 routers, r=2 = Active Node = Standby Node Figure 52: the redundancy models at the physical level of the HAS architecture Master nodes follow the 1+1 redundancy model. The HA tier hosts two master nodes that interact with each other following the active/standby model or the active/active (load sharing) model. Traffic nodes follow one of two redundancy models: N+M (N active and M standby) or N-way (all nodes are active). In the N+M active/standby redundancy model, N is the number of active traffic nodes available to service requests. We need at least two active traffic nodes, N ≥ 2. M is the number of 109 standby traffic nodes, available to replace an active traffic node as soon as it becomes unavailable. The N-way redundancy model is the N+M redundancy model with M = 0. In the N-way redundancy model, all traffic nodes are in the active mode and servicing requests with no standby traffic nodes. When a traffic node becomes unavailable, the traffic manager stops sending traffic to the unavailable node and redistributes incoming traffic among the remaining available traffic nodes. However, when standby nodes are available, the throughput of the cluster does not suffer from the loss of a traffic node since the standby node takes over the unavailable traffic node. As for the storage tier, the redundancy model depends on various possibilities ranging from hosting data on the master nodes to having separate and redundant nodes that are responsible for providing storage to the cluster. The redundancy configuration manager is responsible for switching from one redundancy model to another. For the purpose of the work, we did not implement the redundancy configuration manager (Section 6.2.4). Rather, we relied on re-starting the cluster nodes with modified configuration files when we wanted to experiment with a different redundancy model. Table 8 provides a summary of the possible redundancy models. The HAS architecture allows the support for all redundancy models and supporting them is an implementation issue. HA Tier 1+1 1+1 Active/Active Active/Standby X X X SSA X N+M N-Way No Redundancy X X X (N+M, with M=0) (one master node) X X X (one traffic node) Tier Storage X X X X X (one storage node) Tier Table 8: The possible redundancy models per each tier of the HAS architecture Table 9 illustrates the implemented redundancy models for the HAS architecture prototype. At the HA tier, both the 1+1 active/standby and the 1+1 active/active redundancy models are supported. At the SSA tier, the HAS architecture supports the N-way redundancy model. The storage tier supports the 1+1 active/active redundancy model. 110 1+1 Active/Active 1+1 Active/Standby X X HA Tier N+M N-Way No Redundancy X (one master node) X SSA Tier X (one traffic node) Storage Tier X X (one NFS server) Table 9: The supported redundancy models per each tier in the HAS architecture prototype 4.13 The States of a HAS Cluster Node A state transition diagram illustrates how a cluster node transitions from one defined state to another, as a result to certain events. We represent the states of the nodes as circles, and label the transitions between the states with the events or failures that changed the state of the node from one state to another. A node starts in an initial state, represented by the closed circle; and can end up in a final state, represented by the bordered circle. A cluster node can be in one of four HA states: active, standby, in-transition, and out-of-cluster. We identify two types of error conditions: recoverable and unrecoverable errors. Recoverable errors such as the failure of an Ethernet adapter do not qualify as error conditions that require the node to change state because such a failure is recoverable. However, other failures such as a kernel crash are unrecoverable and as a result require a change of state. Figure 53 presents the HA states of a node, discarding the out-of-cluster state for simplicity purposes. We assume that the initial state of a node is active (1). When in the active state, the node is ready to service incoming requests or provide cluster services. When the active node is facing some hardware or software problem forcing it to stop service, it transitions (2)(3) to the out-of-cluster state. In this state, the node is not member of the cluste; it does not receive traffic (traffic node) or provide cluster services to other nodes (master node). When the problem is fixed, the node transitions back to the active state (4)(5) and become the active. A node is in-transition when it is changing state. If the node is changing to an active state (5), after the completion of the transition, the node starts receiving incoming requests. 111 1 Initial State (Assuming initial state is active for this node) Active [The active node is facing problems forcing a it to change state] Accept & service traffic 2 5 In-Transition In-Transition Switching from out of cluster to active [Problem is fixed, node re-joining cluster] Stop accepting new requests 4 3 Out of Cluster [The node is now considered outside the cluster. It does not serve traffic nor provide services to cluster nodes] Figure 53: The state diagram of the state of a HAS cluster node Figure 54 represents the state diagram after we expand it to include the standby state, in which the node is not currently providing service but prepared to take over the active state. This scenario is only applicable to nodes in the HA tier, which supports the 1+1 active/standby redundancy model. When the node is in the active, in-transition or standby state, and it encounters software or hardware problems, it becomes unstable and it will not be member of the HAS cluster. Its state becomes out-of cluster and it is not anymore available to service traffic. If the transition is from active to standby, the node stops receiving new requests and providing services, but keeps providing service to ongoing requests until their termination, when possible; otherwise, ongoing requests are terminated. The system software that manages the transition of states are the traffic manager and the heartbeat daemon running on the master nodes, and the traffic client and the LDirectord running on the traffic nodes. 112 Out of Cluster [Error] Active Accept & service traffic Out of Cluster [Error] In Transition In Transition Switching from Standby to active Stop accepting New requests [Error] Out of Cluster Standby [Error] Out of Cluster Figure 54: The state diagram including the standy state 4.14 Example Deployment of a HAS Cluster Figure 55 illustrates an example prototype of the HAS architecture using a distributed file system to provide shared storage among all cluster nodes. In this example, two master nodes (A and B) provide storage using the contributed implementation of the highly available NFS (Section 4.16.2). The HA tier consists of two master nodes in the 1+1 active/standby redundancy model. Node A is the active node and Node B is the standby node. The SSA tier consists of four traffic nodes each with its own local storage. The SSA tier follows the N-way redundancy model, where all traffic nodes are active. There are two redundant local area networks, LAN 1 and LAN 2, connected to two redundant routers. The minimal prototype of the HAS architecture requires two master nodes and two traffic nodes. In the event that the incoming web traffic increases, we add more traffic nodes into the SSA tier. If we scale the number of traffic nodes, no other support nodes are required and no changes are required for either the master nodes or the current traffic nodes. 113 HA Tier LAN 1 LAN 2 Master Node A C L U S T E R V I P DFS H e a r t b e a t SSA Tier Traffic Node 1 Local Disk Traffic Node 2 Local Disk Traffic Node 3 Local Disk Traffic Node 4 Local Disk Node A Local Disk Data Synchronization Node B Local Disk Master Node B Figure 55: A HAS cluster using the HA NFS implementation Figure 56 illustrates the hardware configuration of an HA-OSCAR cluster [117]. The system consists of a primary server, a standby server, two LAN connections, and multiple compute clients, where all the compute clients have homogeneous hardware. Figure 56: The HA-OSCAR prototype with dual active/standby head nodes A server is responsible for serving user’s requests and distributing the requests to specified clients. A compute client is dedicated to computation [118]. Each server has three network interface cards; one 114 interface card connects to the Internet by a public network address, the other two connect to a private LAN, which consists of a primary Ethernet LAN and a standby LAN. Each LAN consists of network interface cards, switch, and network wires, and provides communication between servers and clients, and between the primary server and the standby server. The primary server provides the services and processes all the user’s requests. The standby server activates its services and waits for taking over the primary server when its failure is detected. The periodical transmission of heartbeat messages travels across the Ethernet LAN between the two servers, and works as health detection of the primary server. When a primary server failure occurs, the heartbeat detection on the standby server cannot receive any response message from the primary server. After a prescribed time, the standby server takes over the alias IP address of the primary server, and the control of the cluster transfers from the primary server to the standby server. The user’s requests are processed on the standby server from later on. From the user’s point of view, the transfer is almost seamless except the short prescribed time. The failed primary server is repaired after the standby server taking over the control. Once the repair is completed, the primary server activates the services, takes over the alias IP address, and begins to process user’s requests. The standby server releases its alias IP address and goes back to initial state. At a regular interval, the running server polls all the LAN components specified in the cluster configuration file, including the primary LAN cards, the standby LAN cards, and the switches. Network connection failures are detected in the following manner. The standby LAN interface is assigned to be the poller. The polling interface sends packet messages to all other interfaces on the LAN and receives packets back from all other interfaces on the LAN. If an interface cannot receive or send a message, the numerical count of packets sent and received on an interface does not increment for an amount of time. At this time, the interface is considered down. When the primary LAN goes down, the standby LAN takes over the network connection. When the primary LAN is repaired, it takes over the connection back from the standby LAN. Whenever a client fails while the cluster is in operation, the cluster undergoes a reconfiguration to remove the corresponding failing client node or to admit a new node into the cluster. This process is referred to as a cluster state transition. The HA OSCAR cluster uses a quorum voting scheme to keep the system performance requirement, where the quorum Q is the minimum number of functioning clients required for a HPC system. We consider a system with N clients, and assume that each client is assigned one vote. The minimum number of votes required for a quorum is given by (N+2)/2 [119]. 115 Whenever the total number of the votes contributed by all the functioning clients falls below the quorum value, the system suspends operation. Upon the availability of sufficient number of clients to satisfy the quorum, a cluster resumption process takes place, and brings the system back to operational state. 4.15 The Physical View of the HAS Architecture The physical model specifies the architecture topology in terms of hardware architecture and organization of physical resources (processors, network and storage elements). It also describes the hardware layers and takes into account the system's non-functional requirements such as system connection availability, reliability and fault-tolerance, performance (in terms of throughput) and scalability. The HAS architecture consists of a collection of inter-connected nodes presented to users as a single, unified computing resource. Nodes are independent compute entities that physically reside on a common network. The software executes on a network of computers. We map the various elements identified in the logical, process, and development views onto the various cluster nodes that constitute the HAS architecture. The mapping of the software components to the nodes is highly flexible and has minimal impact on the source code itself. We can deploy different physical configurations depending on deployment and testing purposes. Figure 57 illustrates the physical view of the architecture, which consists of the following resources: master nodes, traffic nodes, redundant storage nodes, redundant LANs, one router per LAN, and two redundant image servers. The HA tier consists of at least two redundant master nodes that accept incoming traffic from the Internet through a virtual network interface that presents the cluster as a single entity and provides cluster infrastructure services to all HAS cluster nodes. The HA tier does not tolerate service downtime because if the master node (acting as dispatcher) goes down, the traffic nodes are unable to receive traffic to service web clients. The HA tier controls the activity in the SSA tier and therefore needs to be able to determine the health of traffic nodes and their load. The SSA tier consists of a farm of independent and possibly diskless traffic nodes. The number of nodes in this tier scales from two to N nodes. If a node in this tier fails, the dispatcher stops forwarding traffic to the failed node, and forwards incoming traffic to the remaining available traffic nodes. The storage tier consists of at least two redundant specialized storage nodes. 116 The HAS architecture requires the availability of two routers (or switches) to provide a highly available and reliable communication path. An image server is a machine that holds the operating system and ramdisk images of the cluster nodes. This machine, two for redundancy purposes, is responsible for propagating the images over the network to the cluster nodes every time there is an upgrade or a new node joining the cluster. Master nodes can provide the functionalities of the image server; however, for large deployments this might slow down the performance of master nodes. Image servers are external and optional components to the architecture. HA Tier SSA Tier Storage Tier LAN 1 LAN 2 Router 1 Router 2 Image Server 1 Traffic Node 1 Traffic Node 2 Users Users C L U S T E R V I P Master Node A Master Node B Traffic Node 3 Storage Node 1 Image Server 2 Storage Node 2 Traffic Node 4 Figure 57: The physical view of the HAS architecture We divide the cluster components into the following functional units: master nodes, traffic nodes, storage nodes, local networks, external networks, network paths, and routers. The m master nodes form the HA tier in a HAS architecture and implement the 1+1 redundancy model. The number of master nodes is m = 2. These nodes can be in the active/standby or active/active mode. When m ≥ 2, then the redundancy model is the N+M model; however, we do not implemented this redundancy model in the HAS prototype. The t traffic nodes are located in the SSA tier, where t ≥ 2. If t = 1, then there is a single traffic node that constitutes a SPOF. Let s indicate the number of storage nodes. If s = 0, then the cluster does not include specialized storage nodes; instead master nodes in the HA tier provide shared storage using a highly available distributed file system. The HA file system uses the disk space available on the master nodes to host application data. When s 117 ≥ 2, it indicates that at least two specialized nodes are providing storage. When s ≥ 2, we introduce the notion of d shared disks, where d ≥ 2 x s. The d shared disks are the total number of shared disks in the cluster. The l local networks provide connectivity between cluster nodes. For redundancy purposes, l ≥ 2 to provide redundant network paths. However, this is dependent on two parameters: the number of routers r available (r ≥ 2, one router per network path) and the number of network interfaces eth available of each node (one eth interface per network path). The HAS architecture requires a minimum of two Ethernet cards per cluster node; therefore eth ≥ 2. The cluster can be connected to outside networks, identified as e, where e ≥ 1 to recognize that the cluster is connected to at least one external network. 4.16 The Physical Storage Model of the HAS Architecture The storage model of the HAS architecture aims to meet two essential requirements. The first requirement is for the storage to be highly available, ensuring that data is always available to cluster users and applications. This requirement enables tolerance of single storage unit failure as mirrored data is hosted on at least two physical units; it also allows node failure tolerance, as mirrored data can be accessed form different nodes. The second requirement is to have high throughput and minimum access delay. Based on these requirements, we propose three possible physical storage access models for the HAS architecture: using the HA implementation of the NFS, using specialized shared storage, or using local node storage. The following sub-sections explore these storage access methods and discuss their architectures and redundancy models. 4.16.1 Without Shared Storage Figure 58 illustrates the no-shared storage model. This model offers a limited number of advantages; however, we list it for completeness. Following this model, cluster nodes use their local storage to store configuration and application data. While the no-shared model offers low cost, it comes at the expense of not being able to use diskless nodes. In addition, application data is not replicated since state information is saved on each node where the transaction took place. If a node becomes unavailable, the data that resides on this node becomes unavailable too. As a result, the no-shared storage model is limited and poses restrictions on cluster nodes and their applications. 118 HA Tier SSA Tier LAN 1LAN 2 Node A Local Disk C L U S T E R V I P Master Node A Heartbeat Master Node B Node B Local Disk Traffic Node 1 Local Disk Traffic Node 2 Local Disk Traffic Node 3 Local Disk Traffic Node 4 Local Disk Figure 58: The no-shared storage model However, it is worthy to mention that other research projects (Section 2.10.5) have adopted this model as their preferred way of handling data and dividing it across multiple traffic nodes [82]. Following their architectures, a traffic node receives a connection only if it has the data stored locally. 4.16.2 Shared Storage with Distributed File Systems Distributed file systems enable file system access from a client through the network. This model enables cluster node location transparency as each cluster node has direct access to the file system data and it is suitable for applications with high storage capacity and access performance requirements. Figure 59 illustrates the physical model of the HAS architecture using shared storage. In this example, cluster nodes rely on the networked storage available through the master nodes. Master nodes run the modified implementation of the NFS server (Section 6.1.9) in which two NFS servers provide redundant shared storage to cluster nodes. In the event of a failure of one of the NFS servers, the other NFS server continues to provide access to data. We contributed this mechanism with a modified implementation of the mount program (Section 6.1.9) that allow us to mount two NFS servers into the same location, where the client data resides. 119 HA Tier SSA Tier LAN 1 LAN 2 Master Node A C L U S T E R H e a r t b e a t Traffic Node 1 Local Disk Traffic Node 2 Local Disk Node A Local Disk Data Synchronization … V I P DFS Node B Local Disk Traffic Node N Master Node B Local Disk Figure 59: The HAS storage model using a distributed file system Figure 60 illustrates how the HAS architecture achieves NFS server redundancy. In Figure 57-A, master-a, is the name of the Master Node A server, and master-b is the name of the Master Node B server. Both master nodes are running the modified HA version of the network file system server. Using the modified mount program, we mount a common storage repository on both master nodes: % mount –t nfs master-a,master-b:/mnt/CommonNFS LAN 2 LAN 1 LAN 2 Master Node A (master-a) Primary NFS Server Node A Local Disk /mnt/CommonNFS Common storage available on 2 NFS servers LAN 1 Master Node A Node A Local Disk /mnt/CommonNFS Primary NFS Server Synchronization is performed by the rsync utility /mnt/CommonNFS Node B Local Disk Master Node B (master-b) Node B Local Disk Secondary NFS Server (Fig 57-A) Storage view from outside the cluster: One storage repository. Master Node B (Fig 57-B) Changes in contents is synchronized with the rsync utility. Figure 60: The NFS server redundancy mechanism 120 Secondary NFS Server When the rsync utility detects a change in the contents, it performs the synchronization to ensure that both repositories are identical. If the NFS server on master-a becomes unavailable, data requests to /mnt/CommonNFS will not be disturbed because the secondary NFS server on master-b is still running and hosting the /mnt/CommonNFS network file system. 4.16.2.1 Synchronization of Shared Storage The rsync utility is open source software that provides incremental file transfer between two sets of files across the network connection, using an efficient checksum-search algorithm [120]. It provides a method for bringing remote files into synch by sending just the differences in the files across the network link [121]. The rsync utility can update whole directory trees and file systems, preserves symbolic links, hard links, file ownership, permissions, devices and times, and uses pipelining of file transfers to minimize latency costs. It uses ssh or rsh for communication, and can run in daemon mode, listening on a socket, which is used for public file distribution. We used the rsync utility to synchronize data on both NFS servers running on the two master nodes in the HA tier of the HAS architecture. 4.16.2.2 Disk Replication Block Device DRBD (Disk Replication Block Device) is an alternative to rsync to provide highly available storage. DRBD is system software that acts as a block device. It operates by mirroring a whole block device via the network and provides the synchronization between the data storage on both master nodes [122]. DRBD takes over the data, writes it to the local disk, and sends it to the other storage server. DRBD is a distributed replicated block device responsible for carrying the synchronization between the two independently running NFS servers on two separate nodes. Figure 61 illustrates the data replication with DRDB. If the active node fails, the heartbeat switches the secondary device into primary state and starts the application there. If the failed node becomes available again, it becomes the new secondary node and it synchronizes its NFS content to the primary master node. This synchronization takes place as a background process and does not interrupt the service. 121 Network Clients Users Users Users Users Users Users Network Active Master Node Disk Replication (DRDB) Local Disk for Active Master Node Standby Master Node Local Disk for Standby Master Node Figure 61: DRDB disk replication for two nodes in the 1+1 active/standby redundancy model The DRBD utility provides intelligent resynchronization as it only resynchronizes those parts of the device that have changed, which results in less synchronization time. It grants read-write access only to one node at a time, which is sufficient for the usual fail-over HA cluster. The drawback of the DRBD approach is that it does not work when we have two active nodes because of possibly multiple writes to the same block. If we have more than one node concurrently modifying distributed devices, we have an interesting problem to decide which part of the device is up-to-date on which node, and what blocks we need to resynchronize in which direction. 4.16.3 Storage with Specialized Storage Nodes The HAS architecture allows the addition of specialized storage nodes that provide shared storage to the cluster nodes. Figure 62 illustrates the physical model of the architecture using a storage area network provided by two redundant specialized storage nodes. Following this model, storage for application data is hosted on both of the specialized storage nodes. The traffic nodes can use their private disk to maintain configuration and system files. We did not experiment providing storage using specialized storage nodes. 122 HA Tier LAN 1 LAN 2 SSA Tier Storage Tier Traffic Node 1 C L U S T E R V I P Master Node A Master Node B Traffic Node 2 Specialized Storage Node 1 Traffic Node 3 Traffic Node 4 Specialized Storage Node 2 Figure 62: A HAS cluster with two specialized storage nodes 4.17 Types and Characteristics of the HAS Cluster Nodes A cluster is a dynamic entity consisting of a number of nodes configured to be part of the cluster. Node can join or leave the cluster at anytime. Sections 4.27.3 and 4.27.8 describe the sequence diagrams for a node joining and leaving our HAS architecture prototype. A cluster node is the logical representation of a physical node. A node is an independent compute entity that is a member of a cluster and resides on a common network. It communicates with all the cluster nodes through two local networks, enabling tolerance against single local network failure. A cluster node can be a single or dual processor machine or an SMP machine. An independent copy of the operating system environment usually characterizes each cluster node. However, some cluster nodes share a single boot image from the central, shared disk storage unit or an image server. There are three types of nodes in the HAS architecture: master, traffic and storage nodes. The following sub-sections present the characteristics of each node type and discuss its responsibilities and roles within the cluster. 4.17.1 Master Nodes Master nodes reside in the HA tier of the HAS architecture. They have access to persistent storage, and depending on the specific deployment, they provide shared disk access to traffic nodes and host cluster configuration, management and application data. Master nodes have local disk storage, unlike traffic nodes that can be diskless. 123 Figure 63 illustrates the software and hardware stack of a master node in the HAS architecture. NTP, DHCP, CVIP, TM, saru, Ethd, HBD, DHCP Operating System (Linux) Interconnect Protocol (IPv4 and IPv6) Interconnect Technology Ethernet (TCP/IP/UDP) Processors Figure 63: The master node stack Master nodes provide an IP layer abstraction hiding all cluster nodes and provide transparency towards the end user. Master nodes have a direct connection to external networks. They do not run server applications; instead, they receive incoming traffic through the cluster virtual IP interface and distribute it to the traffic nodes using the traffic manager and a dynamic distribution mechanism (Section 4.23). Master nodes provide cluster-wide services for the traffic nodes such as DHCP server, IPv6 router advertisement, time synchronization, image server, and network file server. Master nodes run a redundant and synchronized copy of the DHCP server, a communications protocol that allows network administrators to manage centrally and automate the assignment of IP addresses. The configuration files of this service are available on the HA shared storage. The router advertisement daemon (radvd) [123] runs on the master nodes and sends router advertisement messages to the local Ethernet LANs periodically and when requested by a node sending a router solicitation message. These messages are specified by RFC 2461 [124], Neighbor Discovery for IP Version 6, and are required for IPv6 stateless autoconfiguration. The time synchronization server, running on the master nodes, is responsible for maintaining a synchronized system time. In addition, master nodes provide the functionalities of an image server. When the SSA tier consists of diskless traffic nodes, there is a need for an image server to provide operating system images, application images, and configuration files. The image server propagates this data to each node in the cluster and solves the problem of coordinating operating system and application patches by putting in place and enforcing policies that allow operating system and software installation and upgrade on multiple machines in a synchronized and coordinated fashion. Master nodes can optionally provide this service. In addition, master nodes provide shared storage via 124 a modified, highly available version of the network file server. We also prototyped a modified mount program to allow master nodes to mount multiple servers over the same mounting point. Master nodes can optionally provide this service. 4.17.2 Traffic Nodes Traffic nodes reside in SSA tier of the HAS architecture. Traffic nodes can be of two types: either with disk or diskless. Traffic nodes rely on cluster storage facilities for application data. Diskless traffic nodes rely on the image server to provide them with the needed operating system and application images to run and all necessary configuration data. Figure 64 illustrates the software and hardware stack of a traffic node in the HAS architecture. Apache Web Server Software TCD, LDirectord,Ethd Operating System (Linux) Interconnect Protocol (IPv4 and IPv6) Interconnect Technology Ethernet (TCP/IP/UDP) Processors Figure 64: The traffic node stack Traffic nodes run the Apache web server application. They reply to incoming requests, forwarded to them by the traffic manager running on the master nodes. Each traffic node runs a copy of the traffic client, LDirectord, and the Ethernet redundancy daemon. Traffic nodes rely on cluster storage to access application data and configuration files, as well as for cluster services such as DHCP, FTP, NTP, and NFS services. Traffic nodes have the option to boot from the local disk (available for nodes with disks), the network (two networks for redundancy purposes), flash disk (for CompaqPCI architectures), from CDROM, DVDROM, or floppy. The default booting method is through the network. Traffic nodes also run the NTP client daemon, which continually keeps the system time in step with the master nodes. With the HAS prototype, we experienced booting traffic nodes from the local disk, the network, and from flash disk which we mostly used for troubleshooting purposes. 125 4.17.3 Storage Nodes Cluster storage nodes provide storage that is accessible to all cluster nodes. Section 4.16 presents the physical storage model of the HAS architecture. 4.18 Local Network Access Cluster nodes are interconnected using redundant dedicated links or local area networks (LAN). All nodes in the cluster are visible to each other. We assume, following the physical model, that all cluster nodes are one routing hop from each other. We achieve redundancy by using two local networks to interconnect all the nodes of the cluster. All nodes have two Ethernet adapters, with each connected to a separate LAN. Figure 65 illustrates the local network access model where each cluster node connects to both LAN1 and LAN2; LAN1 and LAN2 are separate local area networks with their own subnet or domain. LAN 1 Router 1 LAN 2 Router 2 Traffic Node 1 C L U S T E R V I P Master Node A Master Node B Traffic Node 2 Traffic Node 3 Storage Node 1 Storage Node 2 Traffic Node 4 Ethernet Interface 1 Ethernet Interface 2 Figure 65: The redundant LAN connections within the HAS architecture This connectivity model ensures high availability access to the network and prevents the network of being a SPOF. The HAS architecture supports both Internet Protocols IPv4 and IPv6. Supporting IPv4 does not imply additional implementation considerations. However, supporting IPv6 requires the need for a router advertisement daemon that is responsible for automatic configuration of IPv6 Ethernet interfaces. The router advertisement daemon also acts as an IPv6 router: sending router advertisement messages, specified by RFC 2461 [124] to a local Ethernet LAN periodically and when 126 requested by a node sending a router solicitation message. These messages are required for IPv6 stateless autoconfiguration. As a result, in the event we need to reconfigure networking addressing for cluster nodes, this is achievable in a transparent fashion and without disturbance to the service provided to end users. 4.19 Master Nodes Heartbeat It is necessary to detect when master nodes fail and when they become available again. Heartbeat is the system software that provides this functionality in the HAS architecture prototype. It is an internal system software to the HAS architecture prototype. Heartbeat is an open source project that provides system software that runs on both master nodes over multiple network paths for redundancy purposes [125]. Its goal is to ensure that master nodes are alive through sending heartbeat packets to each other as defined in its configuration file. The heartbeat component is a low-level component that monitors the presence and health of master nodes in the HA tier of a HAS architecture by sending heartbeat packets across the network to the other instances of heartbeat running on other master nodes as a sort of keep-alive message. The heartbeat program takes the approach that the keep-alive messages, which it sends, are a specific case of the more general cluster communications service [126]. In this sense, it treats cluster membership as joining the communication channel, and leaving the cluster communication channel as leaving the cluster. Heartbeat itself acts similarly to a cluster-wide init daemon, making sure each of the services it manages is running at all times. When a master node stops receiving the heartbeat packets, it assumes that the other master node died; as a result, the services the primary master node was providing are failed over to the standby master node. Figure 66 illustrates the heartbeat topology. The HA tier consists of two master nodes following the 1+1 redundancy model. Heartbeat supports both configuration of the 1+1 redundancy model: the active/standby and active/active configurations. Each master node sends heartbeat and administrative messages to the other master node as broadcasts. 127 Master Node 1 2 Master Node 2 1 1 2 1 Router 2 Figure 66: The topology of the heartbeat Ethernet broadcast With heartbeat, master nodes are able to coordinate their role (active and standby) and track their availability. Heartbeat discussions are presented in [20], [127], and [128]. 4.20 Traffic Nodes Heartbeat using the LDirectord Module In the same way, the heartbeat mechanism (Section 4.19) checks the availability of the master nodes in the HA tier, we need to provide a mechanism to verify that traffic nodes in the SSA tier are up and running, and providing service to web clients. We use two methods to ensure that traffic nodes are available and providing services. The first method is the keep-alive mechanism integrated in the traffic distribution scheme discussed in Section 4.23. Each traffic node reports its load to the traffic managers running on the master nodes every x seconds (x is a configuration parameter). This continuous communication ensures that the traffic managers are aware of all available traffic nodes. In the event that this communication is disrupted, after a pre-defined time out the traffic manager removes the traffic node from its list of available traffic nodes. This communication ensures that the traffic client is alive, and that it is reporting the load index of the traffic node to the traffic manager. Section 4.23 discusses this mechanism. The second method for traffic nodes heartbeat is an application check whose goal is to ensure that the application server running on the traffic node is available and running. The application check relies 128 on the Linux Director daemon (LDirectord) to monitor the health of the applications running on the traffic nodes. Each traffic node runs a copy of the LDirectord daemon. The LDirectord daemon performs a connect check of the services on the traffic nodes by connecting to them and making a HTTP request to the communication port where the service is running. This check ensures that it can open a connection to the web server application. When the application check fails, the LDirectord connects to the traffic manager and sets the load index of that specific traffic node to zero. As a result, existing connections to the traffic node may continue, however the traffic manager stop forwarding new connections to it. Section 4.27.12 discusses this scenario. This method is also useful for gracefully taking a traffic node offline. 4.20.1 Sample LDirectord Configuration The LDirectord module loads its configuration from the ldirectord.cf configuration file, which contains the configuration options. An example configuration file is presented below. It corresponds to a virtual web server available at address 192.68.69.30 on port 80, with round robin distribution between the two nodes: 142.133.69.33 and 142.133.69.34. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # Global Directives Checktimeout = Checkinterval = Autoreload = Logfile = Quiescent = 10 2 no "local0" yes # Virtual Server for HTTP Virtual = 192.68.69.30:80 Fallback = 127.0.0.1:80 Real = 142.133.69.33:80 masq Real = 142.133.69.34:80 masq Service = http Request = "index.html" Receive = "Home Page" Scheduler = rr Protocol = tcp Checktype = negotiate Once the LDirectord module starts, the virtual server kernel table is populated. The capture below uses the ipvsadm command line to capture the output of the kernel. The ipvsadm command is used to set up, maintain or inspect the virtual server table in the Linux kernel. The listing illustrates the 129 virtual server session, with the virtual address on port 80, and the two hosts providing this virtual service. % ipvsadm -L -n IP Virtual Server version 1.0.7 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 192.68.69.30:80 rr -> 142.133.69.33:80 Masq 1 0 0 -> 142.133.69.34:80 Masq 1 0 0 -> 127.0.0.1:80 Local 0 0 0 By default, the LDirectord module uses the quiescent feature to add and remove traffic nodes. When a traffic node is to be removed from the virtual service, its weight is set to zero and it remains part of the virtual service. As such, exiting connections to the traffic node may continue, but the traffic node is not allocated any new connections. This mechanism is particularly useful for gracefully taking real servers offline. This behavior can be changed to remove the real server from the virtual service by setting the global configuration option quiescent=no. 4.21 CVIP: A Cluster Virtual IP Interface for the HAS Architecture One of the challenges is to present the cluster as a single entity to the end users. To overcome the disadvantage of existing solutions presented in Sections 2.9 and 2.10, the HAS architecture provides a transparent and scalable interface between the internet and the cluster called the cluster virtual IP interface. It is fault-tolerant and scalable interface between the Internet and the HAS architecture that provide a single entry point to the HAS cluster. It should not have an impact on web clients, applications and the existing network infrastructure. It should not present a SPOF, have minimal impact on performance, and have minimal impact in case of failure. In addition, it should be scalable to allow a large number of transactions (virtually an unlimited number) without posing a bottleneck, and to allow the increase of the number of master nodes and traffic nodes independently. Furthermore, it should support IPv4 and IPv6 and be application independent. One essential requirement is that the CVIP interface should provide a single Support for multiple VIP addresses As a result, the CVIP is a fault-tolerant and scalable method of interfacing a plurality of application servers running on the HAS cluster. The CVIP framework includes a plurality of network terminations (master servers in the HAS architecture) that receive incoming data packets from the Internet, and a plurality of forwarding processes (traffic managers) that are associated with the 130 network terminations. The following sub-sections present the CVIP framework and discuss the architecture and the various concepts. 4.21.1 The Basic Architecture Figure 67 illustrates the generic configuration of the CVIP interface and the CVIP framework. It presents a three layer architecture that consists of server application software (i.e. the web server software) running on the traffic nodes, the master nodes that provide IP services, and the network cards on the master nodes connecting the CVIP interface towards external networks and the Internet. We achieve network bandwidth scalability by increasing the number of network terminations. We achieve capacity scalability by increasing the number of servers in the SSA tier. Server application software (traffic nodes) IP servers (master nodes/front-end nodes) Network cards (on master nodes and visible to the Internet) Intra/Internet Figure 67: The CVIP generic configuration Figure 68 illustrates the level of distribution within the HAS cluster using the CVIP as the interface towards the outside networks. There exist two distribution points: the network termination, which distributes packages based on IP address to correct master (or front end) node, and the traffic manager that distributes network connections to the applications on the traffic node. Section 4.21.1.1 discusses the network termination concept. 131 Apache Web Server Software Apache Web Server Software TCD, LDirectord, Ethd TCD, LDirectord, Ethd Operating System (Linux) Operating System (Linux) Interconnect Protocol (IPv4 and IPv6) Interconnect Protocol (IPv4 and IPv6) Interconnect Technology Ethernet (TCP/IP/UDP) Interconnect Technology Ethernet (TCP/IP/UDP) Processor Processor Distribution of IP packages from net termination to web (or application) server NTP, DHCP, CVIP, TM, saru, Ethd, HBD, DHCP NTP, DHCP, CVIP, TM, saru, Ethd, HBD, DHCP Operating System (Linux) Operating System (Linux) Interconnect Protocol (IPv4 and IPv6) Interconnect Protocol (IPv4 and IPv6) Interconnect Technology Ethernet (TCP/IP/UDP) Interconnect Technology Ethernet (TCP/IP/UDP) Processor Processor Distribution of IP packages from/to external world to net termination "front ends” Figure 68: Level of distribution 4.21.1.1 The Concept of Network Termination Figure 69 illustrates the concept of a network termination. A network termination is the last stop on the network for a given connection (data packet) before it is forwarded for processing inside the HAS architecture. The network termination of a web request packet is the network card interface on the master node in the HA tier of a HAS cluster. Each network termination carries its own IP address and can be addressed directly. These addresses are published via RIP/OSPF/BGP to the routers indicating them as gateway addresses for the CVIP address; this means that routers see each termination just as another router. 132 Linux Kernel Linux Kernel CPU Net Termination • All Linux processors have their own IP address • When in the 1+1 active/standby model, one front-end node (master node) is owner of the virtual IP address • When in the 1+1 active/active model, each master node claims to be an IP router for the CVIP address • More “Front Ends” can be added at runtime • The OSPF protocol is used to monitor router links CPU Net Termination Intra/Internet Figure 69: Network termination concept 4.21.2 The CVIP as a Framework The CVIP is an interfacing method and a framework that provides high fault tolerance and close to linear scalability of the servers and the network interfaces. The framework is transparent to the web clients and to the HAS cluster servers, and has minimal impact on the surrounding network infrastructure. The CVIP operates with applications that run on top of IP. Figure 70 illustrates the CVIP framework. The CVIP interface receives incoming data packets from the Internet or a packet data network and passes packets to the traffic manager (TMD) running on an IP server (i.e. a master node) in the HAS architecture. The IP server then forwards the request to an already existing connection or a traffic distribution mechanism is invoked to determine the next traffic node where a request will be sent. On a master node, the traffic manager is the entity responsible of forwarding traffic to application servers running on traffic nodes in the SSA tier of the HAS architecture based on a defined traffic distribution mechanism that is executed by the traffic manager. The outgoing traffic takes a different path. The reply to the request goes directly from the application server running on the traffic node to the web client, issuer of the original request. 133 Traffic Node Traffic Node HTTPD HTTPD TCD Traffic Node TCD HTTPD TCD Active Master Node TCD routed TMD Traffic Node TMD HTTPD routed HB HB = Standby Master Node = Intra/Internet Users Users Users Users Users Users Figure 70: The CVIP framework 4.21.3 Advantages of the CVIP The following sub-sections present the CVIP advantages such as a single point of entry to the cluster, scalability and high availability advantages, transparency and support for multiple application servers. 4.21.3.1 Transparency and single entry point to the cluster CVIP is transparent to the applications running on traffic nodes and web clients are not aware of it. CVIP supports multiple protocols and applications that use TCP, UDP and raw IP sockets, are able to use it transparently. CVIP hides the cluster internals from the users and makes the cluster visible to the outside world as a single entity through a virtual IP address. For the outside world, service is available through a certain web address, which is the virtual cluster IP address that masks behind it the IP addresses of the master nodes. It allows access to a cluster of processors via a single IP address. We can also define a number of CVIP addresses in a system and achieve communication between web clients and applications using different CVIP addresses in the cluster. 134 4.21.3.2 Scalable The CVIP offers a unique scalability advantage. We can increase network terminations, master nodes, or traffic nodes independently and without any affecting how the cluster is presented to the outside world. With CVIP, we can cluster multiple servers to use the same virtual IP address and port numbers over a number of processors to share the load. As we add new nodes in the HA and SSA tiers, we increase the capacity of the system and its scalability. The number of clients or servers using the virtual IP address is not limited. The framework is scalable and we can add more servers to increase the system capacity. In addition, although we have only presented HTTP servers, the applications on top may include server application that runs on IP such as an FTP server for file transfer. 4.21.3.3 Fault tolerance The interface hides errors that take place on the HAS nodes and it is always available. Any crash in the application server is transparent to end users. In the event the web server software crashes, only 1/N of the ongoing transactions are lost (only if there is no connection synchronization between master nodes). All ongoing transactions can be saved and the state information can be preserved. 4.21.3.4 Availability Since CVIP supports multiple servers, it does not provide a SPOF. In the HAS architecture prototype, CVIP was provided on two master nodes in the HA tier. If one master node crashes, the web clients and web servers are not affected. 4.21.3.5 Dynamic connection distribution The framework supports a traffic distribution mechanism discussed in Section 4.23 that provides web clients with access to application servers running on traffic nodes in a transparent way. When the CVIP receives incoming traffic, it is possible to choose between two different distribution algorithms: the round robin distribution mechanism or the HAS distribution mechanism that provides dynamic and high performance distribution (discussed in 4.23). 135 4.21.3.6 Support for multiple application servers Since the CVIP interface operates at IP level and it is transparent to application servers running on traffic nodes, then it is independent from the type of traffic that it accepts and forwards. As a result, with CVIP, the HAS architecture supports all types of application servers that work at IP level. 4.21.4 CVIP Discussion When the HA tier is configured for the 1+1 active/standby redundancy model, incoming connections to the HAS cluster arrive at the active master node, owner of the CVIP address. When the HA tier is configured for the 1+1 active/active redundancy model, incoming connections arrive at the elected master node owner of the CVIP address, and then are distributed among the master nodes using the saru module. With the CVIP framework, we can start as many web servers as needed to handle increased traffic with the entire set of web servers serving the same virtual IP address. The routed process includes a local routing table that contains a list of IP addresses that can be used to reach specific client IP addresses. The routed process contains information that is global to all processors, but it is also available on each processor through local instances of the routed process. The cluster IP interface operates at the IP level, enabling applications that run on top of IP to access transparently application servers running on traffic nodes and gain scalability and high fault tolerance. Whenever we add a new network card to a master node, we define its own local IP address. Through the OSPF routing protocol, the rest of the IP network is informed that this network interface can be used to receive CVIP traffic. 4.22 Connection Synchronization When the HA tier follows the 1+1 active/standby redundancy model, the cluster achieves a higher availability than a single node tier, since there are two master nodes, one is active and the second is standby. When the active master node fails, the standby master node automatically takes over the IP address of the virtual service, and the cluster continues to function. However, when a fail-over occurs at the active master node, ongoing connections that are in progress terminate because the standby master node does not know anything about them. To improve this situation and to prevent the loss of ongoing connections, there is a need to provide the capabilities of synchronizing the ongoing connections information between the active master node and the standby master node. As a result, the 136 cluster can minimize and eliminate the situation of lost connections caused by the failure of an active master node. When the information of ongoing connections is synchronized between master nodes, then if the standby master node becomes the active master node, it retains the information about the currently established and active connections, and as a result, the new active master node continues to forward their packets to the traffic nodes in the SSA tier. 4.22.1 The Challenge of Connection Synchronization When a master node receives a packet for a new connection, it allocates the connection to a traffic node. This allocation in effected by allocating an ip_vs_conn structure in the Linux kernel which stores the source address and port, the address and port of the virtual service, and the traffic node address and port of the connection. Each time a subsequent packet for this connection is received, this structure is looked up, and the packet is forwarded accordingly. This structure is also used to reverse the translation process, which occurs when NAT is used to forward a packet and to store persistence information. Persistence is a feature whereby subsequent connections from the same web user are forwarded to the same traffic node. When fail-over occurs, the new master node does not have the ip_vs_conn structures for the active connections. Therefore, when a packet is received for one of these connections, the new master node does not know which real server to send it to. As a result, the connection breaks and the web user need to reconnect. By synchronizing the ip_vs_conn structures between the master nodes, this situation ca be avoided, and connections can continue after a master node fails-over. 4.22.2 The Master/Slave Approach to Connection Synchronization The connection synchronization code relies on a sync-master/sync-slave setup where the sync-master sends synchronization information and the sync-slave listens. The following example illustrates this scenario [129]. There are two master nodes, master-A and master-B. Master-A is the sync-master and the active master node. Master-B is the sync-slave and the standby master node. In step 1 (Figure 71), the web user opens connection-1. Master node A receives this connection, forwards it to a traffic node, and synchronizes it to the sync-slave, master node B. 137 web user opens connection-1 Active Master Node A sync-master connection-1 forwarded to traffic node connection-1 synchronized Standby Master Node B sync-slave Figure 71: Step 1 - Connection Synchronization In step 2 (Figure 72), a fail-over occurs and the master node B becomes the active master node. Connection-1 is able to continue because the connection synchronization took place in step 1. Standby Master Node A sync-master web user continues connection-1 Active Master Node B sync-slave connection-1 forwarded to traffic node Figure 72: Step 2 - Connection Synchronization The master/slave implementation of the connection synchronization works with two master nodes: the active master node sends synchronization information for connections to the standby master node, and the standby master node receives the information and updates its connection table accordingly. The synchronization of a connection takes place when the number of packets passes a predefined threshold and then at a certain configurable frequency of packets. The synchronization information for the connections is added to a queue and periodically flushed. The synchronization information for up to 50 connections can be packed into a single packet that is sent to the standby master node using multicast. A kernel thread, started through an init script, is responsible for sending and receiving synchronization information between the active and standby master nodes. 4.22.3 Drawbacks of the Master/Slave Approach After experimenting with the master/slave approach for synchronizing connections, we realized that it suffers from a drawback that is discoverable when the new master node (previously standby) fails and 138 the current standby node (previously active) becomes active again. To illustrate this drawback, we continue discussing the example of connection synchronization from the previous section. In Step 3 (Figure 73), a web user opens connection-2. Master node B receives this connection, and forwards it to a traffic node. Connection synchronization does not take place because master node B is a sync-slave. Standby Master Node A sync-master web user opens connection-2 No connection synchronization Active Master Node B sync-slave connection-2 forwarded to traffic node Figure 73: Step 3 - Connection Synchronization In step 4 (Figure 74), another fail-over takes place and master node A is again the active master node. Connection-2 is unable to continue because it was not synchronized. web user continues connection-2 Active Master Node A sync-master connection-2 breaks Standby Master Node B sync-slave Figure 74: Step 4 - Connection Synchronization 4.22.4 Alternative Approach: Peer-to-Peer Connection Synchronization The master/slave approach to connection synchronization is not optimum and therefore there is a need for a different approach that eliminates the drawback in the existing implementation. Figure 75 illustrates the peer-to-peer approach. In this approach, each master node sends synchronization information for connections that it is handling to the other master node. Therefore, in the scenario illustrated in Figure 74, connections are synchronized from master node B to master node A; connection-2 are able to continue after the second fail-over when master node A becomes the active master node again. The implementation of this approach is a future work item. 139 Connections Active Master Node A sync-master connections forwarded to traffic node Connections synchronized Standby Master Node B sync-slave Figure 75: Peer-to-peer approach 4.23 Traffic Management Traffic management is an important aspect of the HAS architecture that contributes to building a highly available and scalable web cluster. This section presents the requirements for a flexible and scalable traffic distribution inside a web cluster. We discuss the needs of the HAS architecture for a traffic management scheme and presents the HAS architecture solution that consists of a traffic manager running on master nodes enforcing a distribution policy and a traffic client that runs on traffic nodes and report the load index of the traffic node providing a dynamic cycle of feedback. 4.23.1 Background and Requirements Traffic distribution is the process of distributing network traffic across a set of server nodes to achieve better resource utilization, greater scalability, and high availability. Scalability is an important factor because it ensures a rapid response to each network request regardless of the load. Availability ensures the service continues to run despite failure of individual server nodes. Traffic distribution can be passive or active technology depending on the specific implementation. Some traffic distribution schemes do not modify network requests, but pass them verbatim to one of the cluster nodes and returns the response verbatim to the client. Other schemes, such as NAT, change the request headers and force the request to go through an intermediate server before it reaches the end user. There exist many interesting aspects of a traffic distribution implementation that we considered when designing and prototyping the HAS architecture traffic distribution scheme. These aspects include optimizing of response time, copying with failures of individual traffic nodes, supporting heterogeneous traffic nodes, ensuring the distribution service remains available, supporting session persistence, and ensuring transparency so that users are not aware where the application is hosted and if the server is a cluster or a single server. 140 Our survey of identical works (Sections 2.9 and 2.10) has identified that added performance from complex algorithms is negligible. The recommendations were to focus on a distribution algorithm that is uncomplicated, has a low overhead, and that minimizes serialized computing steps to allow for faster execution. Scalable web server clusters require three core components: a scheduling mechanism, a scheduling algorithm, and an executor. The scheduling mechanism directs clients’ requests to the best web server. The scheduling algorithm defines the best web server to handle the specific request. The executor carries out the scheduling algorithm using the scheduling mechanism. The following subsection present these three core components in the HAS architecture. 4.23.2 Traffic Management in the HAS Architecture Traffic management is a core technology that enable better scalability in the HAS architecture. It consists of three elements that work together to achieve efficient and dynamic traffic distribution among cluster traffic nodes. These three elements are the traffic manager running on the master nodes, the traffic client daemons running on traffic nodes, and the traffic distribution policy. Traffic nodes continuously report to the traffic managers their availability and their load index. The traffic manager distributes incoming traffic based on the distribution policy. When a request, such as that for a particular page of a website, arrives at a master node, depending on the policy defined in the configuration file of the traffic manager, the traffic manager forwards the request to the appropriate traffic node. This scheme follows the LVS direct routing method, where the traffic manager and the master nodes have one of their interfaces physically linked by a hub or a switch. It is also the same approach as the L4/2 clustering method (Section 2.6.1). When a user accesses a virtual service provided by the server cluster, the packet destined for the virtual IP address arrives, the traffic manager forwards it to the appropriate traffic nodes, which services the request and replies directly to the user. 4.23.3 Traffic Manager The traffic manager runs on master nodes. It receives load announcements from the traffic client daemons running on traffic nodes, and updates its internal list of traffic nodes and their load. It maintains a list of available traffic nodes and their load index. The traffic manager distributes incoming traffic to the available traffic nodes based on their load index. The traffic manager requires 141 a configuration file that lists the addresses of all traffic nodes, the traffic distribution policy, and the port of communication, the timeout limit, and the addresses of the master nodes. 4.23.3.1 Sample Traffic Manager Configuration File This section presents sample configuration file of a traffic manager daemon. 1 # TMD Configuration File 2 # List of master nodes 3 master1 <IP Address of Master Node 1> 4 master2 <IP Address of Master Node 2> 5 6 # Active traffic nodes 7 # List the IP addresses of active traffic nodes 8 traffic1 <IP Address of Traffic Node 1> 9 traffic2 <IP Address of Traffic Node 2> 10 traffic3 <IP Address of Traffic Node 2> 11 traffic4 <IP Address of Traffic Node 2> 12 13 # Distribution mechanism (either HAS or RR) 14 distribution HAS 15 16 # Standby traffic nodes - Used when in the NxM redundancy model 17 # If redundancy model is not NxM, then leave empty 18 # List the IP addresses of standby traffic nodes 19 <IP Address of Traffic Node 1> 20 <IP Address of Traffic Node 2> 21 <IP Address of Traffic Node M> 22 23 # Port number 24 port <port_number> 25 26 # Reporting errors – needed for troubleshooting purposes 27 ErrorLog <full_path_to_error_log_file> 28 29 # The Timeout option specifies the amount of time in ms the TMD 30 # waits to receive load info from TCD. Timeout should be > than 31 # the update frequency of the TCD 32 # See TCD configuration file for TCD specific information 142 33 timeout <timeout_value> 4.23.4 The Proc File System The /proc file system is a real time, memory resident file system that tracks the processes running on the machine and the state of the system, and maintains highly dynamic data on the state of the operating system. The information in the /proc file system is continuously updated to match the current state of the operating system. The contents of the /proc files system are used by many utilities which read the data from the particular /proc directory and display it. The traffic client uses two parameters from /proc to compute the load_index of the traffic node: the processor speed and free memory. The /proc/cpuinfo file provides information about the processor, such as its type, make, model, cache size, and processor speed in BogoMIPS [128]. The BogoMIPS parameter is an internal representation of the processor speed in the Linux kernel. Figure 76 illustrates the contents of the /proc/cpuinfo file at a given moment in time and highlights the BogoMIPS parameter used to compute the load_index of the traffic node. The processor speed is a constant parameter; therefore, we only read the /proc/cpuinfo file once when the TC starts. % more /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 13 model name : Intel(R) Pentium(R) M processor 1.70GHz stepping : 6 cpu MHz : 598.186 cache size : 2048 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 2 wp : yes flags : fpu vme de pse tsc msr mce cx8 sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe est tm2 bogomips : 1185.43 Figure 76: The CPU information available in /proc/cpuinfo 143 The /proc/meminfo file reports a large amount of valuable information about the RAM usage. The /proc/meminfo file contains information about the system's memory usage such as current state of physical RAM in the system, including a full breakdown of total, used, free, shared, buffered, and cached memory utilization in bytes, in addition to information on swap space. Figure 77 illustrates the contents of the /proc/meminfo file at a given moment in time and highlights the MemFree parameter used to compute the load_index of the traffic node. Since the MemFree is a dynamic parameter, it is read from the /proc/meminfo file every time the TC calculates the load_index. % more /proc/meminfo MemTotal: 775116 MemFree: 6880 Buffers: 98748 Cached: 305572 SwapCached: 2780 Active: 300348 Inactive: 286064 HighTotal: 0 HighFree: 0 LowTotal: 775116 LowFree: 6880 SwapTotal: 1044184 SwapFree: 1040300 Dirty: 16 Writeback: 0 Mapped: 237756 Slab: 171064 Committed_AS: 403120 PageTables: 1768 VmallocTotal: 245752 VmallocUsed: 11892 VmallocChunk: 232352 HugePages_Total: HugePages_Free: kB kB kB kB kB kB kB kB kB kB kB kB kB kB kB kB kB kB kB kB kB kB 0 0 Figure 77: The memory information available in /proc/meminfo 4.23.5 Traffic Client The traffic client is a daemon process that runs on each traffic node. It collects processor and memory information from the /proc file system and reports it to the traffic managers running on master nodes. The implementation of the traffic client requires a configuration file (Section 4.23.5.1) that lists the addresses of master nodes, port of communication, timeout limits, and the logging directive. 144 In the event of failure of the traffic client daemon, the traffic manager does not receive a load notification and after a timeout, it removes the traffic node from its list of available traffic nodes. As a result, no disruption of service results from the failure of the traffic client daemon. Section 4.27.9 discusses this scenario. 4.23.5.1 Sample Traffic Client Configuration File This section presents sample configuration file of a traffic client daemon. 1 # TC Configuration File 2 # List of master nodes to which the TC daemon reports load 3 master1 <IP Address of Master Node 1> 4 master2 <IP Address of Master Node 2> 5 6 # Port # to connect to at master - this port number can be anywhere 7 # between 1024 and 49151. We can also use ports 49152 through 65535 8 port <port_number> 9 10 # Frequency of load updates in ms. 11 updates <frequency_of_updates> 12 13 # Reporting errors -- needed for troubleshooting purposes 14 ErrorLog <full_path_to_error_log_file> 15 16 # Number of RAM in the node with the least RAM in the cluster 17 RAM <num_of_ram> 4.23.6 Characteristics of the Traffic Management Scheme The traffic management scheme supports static distribution with round robin without taking into consideration the load of traffic nodes. It also supports dynamic distribution that is configurable to support various intervals of load updates. The HAS cluster can include computers with heterogeneous hardware such as with different processor speeds and memory capacity. The traffic distribution mechanism is aware of the variation in computing power since the processor speed, reported in BogoMIPS, and memory reported in MB, are factored by the traffic manager when forwarding traffic to the nodes. As a result, the master node assigns traffic to the traffic nodes based on their load index. The traffic manager uses the following formula to calculate the load index of each traffic node: 145 ⎛ CPU _ a ∗ RAM _ a ⎞ ⎟⎟ Load _ a = ⎜⎜ RAM _ x ⎝ ⎠ The formula consists of the following variables: • CPU_a is the BogoMIPS representation of the processor speed of traffic node a • RAM_a is the number of free RAM in MB of traffic node a, and • RAM_x is the number of total RAM in MB of traffic node x, where traffic node x is the node with the least amount of RAM among all traffic nodes. The configuration file of the traffic client specifies this parameter (Section 4.23.5.1). The result is the relative load of traffic node a. Let us consider an example of how the formula is applied. Consider a traffic node with a Pentium Mobile processor running at a speed of 1.7 GHz with 748 MB of RAM. The BogoMIPS processor speed is 3358,72. We assume that the node has 512 MB of free RAM. Let us consider as well that the machine with the lowest amount of RAM has 256 MB of RAM. Therefore, the load of that machine as reported by the traffic client daemon to the traffic manager is: ⎛ 3358,72 * 524,288 ⎞ Load _ a = ⎜ ⎟ ≈ 6,717 262,144 ⎝ ⎠ The traffic manager maintains a list of nodes and their loads. Figure 78 illustrates a case example of a HAS cluster that consists of eight traffic nodes. Node IP Address Load Index 192.168.1.100 4484 192.168.1.101 4397 192.168.1.102 4353 192.168.1.103 4295 192.168.1.104 4235 192.168.1.105 4190 192.168.1.106 4097 192.168.1.107 3973 Least Busy Traffic Node Most Busy Traffic Node Figure 78: Example list of traffic nodes and their load index When the traffic manager receives an incoming request, it examines the list of nodes and forwards the request to the least busy node on the list. The list of traffic nodes is a sorted linked list that allows us to maintain an ordered list of nodes without having to know ahead of time how many nodes we will 146 be adding. To build this data structure, we used two class modules: one for the list head and another for the items in the list. The list is a sorted linked list; as we add nodes into the list, the code finds the correct place to insert them and adjusts the links around the new nodes accordingly. 4.23.7 Example of Operation This section provides a description of a normal scenario of traffic client reporting load to traffic manager. Figure 79 illustrates this scenario. Users Users 5 routed /proc file system routed 6 4 TM TM 1 TM TM 3 2 /proc/cpuinfo /proc/cpuinfo /proc/meminfo /proc/meminfo TC TC 7 saru HBD Web Web Server Server saru 9 8 Distributed Storage HBD LDirectord Master Master Node Node 22 Master Master Node Node 11 Traffic Traffic Node Node 11 Figure 79: Illustration of the interaction between the traffic client and the traffic manager (1) The traffic client reads the /proc file system and retrieves the BogoMIPS processor speed from /proc/cpuinfo, and then retrieves the amount of free memory from /proc/meminfo. (2) The traffic client computes the node load index based on the formula in Section 4.23.6. (3) The traffic client reports the load index to the traffic manager as a string that consists of pair parameters: the traffic node IP address and the load index (traffic_node_IP, load_index). (4) The traffic manager receives the load index and updates its internal list of traffic nodes to reflect the new load_index of the traffic_node_IP. (5) For illustration purposes, we assume that an incoming request reaches the virtual interface. 147 (6) The routed daemon forwards the request to the traffic manager. The traffic manage examines the list of traffic nodes and chooses a traffic node as the target for this request. (7) The traffic manages forwards the request to the traffic node. (8) The web server running on the traffic node receives the request. It reads from the distributed storage to retrieve the requested document and fetches it from storage. (9) The web server sends the request document directly to the web user. This scenario also illustrates several drawbacks such as the number of steps involved and the communication overhead between the system software. One of the future work items is to minimize the number of system software resulting in less communication overhead. 4.23.8 Support for Other Distribution Algorithms The current traffic manager prorotype supports the basic RR distribution and the HAS distribution. The distribution method is specified in the configuration file of the traffic manager (Section 4.23.3.1): # Distribution mechanism (HAS or RR, or others) distribution <distribution-scheme> As a result, when the traffic manager starts, it enforces the specified traffic distribution method. If the distribution method is RR, then the traffic clients running on traffic nodes continue to report their load, however, the list of traffic nodes will not be kept sorted based on the load of the traffic nodes. Instead, the table remains static, and reflects the list of traffic nodes available in the HAS cluster. The traffic manager rotates through the table and distributes incoming traffic in a RR fashion to the list of traffic nodes it maintains. Furthermore, the current model can support additional distribution methods. 4.23.9 Discussion of the HAS Architecture Traffic Management The traffic management scheme is lightweight and configurable. The traffic management scheme is able to cope with failures of individual traffic nodes. It supports a cluster that consists of heterogeneous cluster nodes with different processor speed and memory capacity. In case the master node fails, the traffic manager becomes unavailable. In this case, the standby master node takes over and the standby traffic manager becomes responsible for distributing incoming traffic. The scheme is transparent to web clients who are unaware that the applications run on a cluster of nodes. 148 4.24 Access to External Networks and the Internet The two classical methods to access external networks are the direct access method and the restricted method. In the direct access method, traffic node can reply to web clients directly. In the restricted method, traffic nodes forward their response to one of the master nodes that then rewrites the response header and forwards it to the web client. With the latest method, access to external networks is restricted by master nodes in the HA tier, which is achieved using forwarding, filtering, and masquerading mechanisms. As a result, master nodes monitor and filter all accesses to the outside world. The HAS architecture supports both methods as this is architecture independent and depends on configuration. Based on the survey of similar work (Sections 2.9 and 2.10), the direct access method is the most efficient traffic distribution method that help improve the system scalability. This model is supported by the HAS cluster prototype and with which we have performed our benchmarking tests. Figure 80 illustrates the scenario of direct access. When a request arrives to the cluster (1), the master node examines it, decides where the request should be forwarded, and forwards it (3) to the appropriate traffic node. The request reaches the traffic node that treats it, and replies directly (4) to the web client. Furthermore, the HAS architecture supports the restricted access method, as this is implementation specific. End EndUsers Users LAN 1 LAN 2 1 C L U S T E R V I P Master Node A Traffic Node 1 2 Master Node B 3 Traffic Node 2 4 Figure 80: The direct routing approach – traffic nodes reply directly to web clients Figure 81 illustrates the scenario of restricted access, which requires re-writing of IP packets. When a request arrives to the cluster (1), the master node examines it, decides where the request should be 149 forwarded, rewrites the packets, and forwards it (3) to the appropriate traffic node. The request reaches the traffic node that treats it and replies (4) to the master node. The master node then re-writes the packets (5) and sends the final reply to the web client (6). LAN 1 1 End EndUsers Users 6 C L U S T E R V I P LAN 2 Master Node A Traffic Node 1 2 5 3 4 Master Node B Traffic Node 2 Figure 81: The restricted access approach – traffic nodes reply to master nodes, who in turn reply to the web clients 4.25 Ethernet Redundancy The Ethernet redundancy daemon (erd) runs after the "route" configuration data has been installed, following boot up. In order to have link redundancy, is it necessary to configure the paired ports (eth0 and eth1) with the same IP and MAC addresses. The "erd" program starts by fetching the configuration of all routes and then storing them into a table (loaded from /proc/net/route). The table is used to compose route deletion commands for all routes bound to the secondary link (eth1). The route configuration for the primary link (eth0) is then copied to the secondary link (eth1) with a higher metric (meaning a lower priority). The primary link (eth0) is then polled every x milliseconds (x is a configurable parameter in the configuration file). When a primary link fails, all routing information is deleted for the link, causing traffic to be switched to the secondary link. Upon link restoration, the cached routing information for the primary link is updated, forcing a traffic switchback. 4.25.1 Command Line Usage The command has specific usage syntax: % erd [eth0 eth1] 150 The example shown above indicates that eth0 has eth1 as its backup link. If we do not specify parameters in the command, it defaults to the equivalent of "erd eth0 eth1". We automated this command on system startup. 4.25.2 Encountered Issues The servers in our prototype use Tulip Ethernet cards. We patched the tulip.c driver to make the MAC addresses for ports 0 and 1 identical. Alternatively, we also were able to get the same result by issuing the following commands (on Linux) to set the MAC address for an Ethernet port: % ifconfig eth[X] down % iconfig eth[X] hw ether AA:BB:CC:DD:EE:FF % ifconfig eth[X] up We also modified the source code of the Ethernet device driver tulip.c to toggle the RUNNING bit in the devÆflags variable, which allows ifconfig to present the state of an Ethernet link. The state of the RUNNING bit for the primary link is accessed by erd via the system call ioctl. 4.26 Dependencies and Interactions between Software Components The HAS architecture consists of many software components (Section 4.4) that depend on each other to provide service. When the work initially started, our goal was to minimize the amount of dependencies so to reduce the possibilities of cascading failures. However, as the work progressed and the architecture grew to support additional functionalities, the number of system components grew in number as well, and their relation became more complex. Figure 82 illustrates the dependencies and interconnections of the various system software components within the HAS architecture. As a future work item, we plan to consolidate the functionalities of these system software components resulting in fewer components. The following subsections discuss the dependencies among these software components, and describe the nature of their interactions. 151 Master Node B Master Node A TM TM # saru saru heartbeat heartbeat heartbeat heartbeat # routed routed Traffic Node 1 Traffic Node 2 TC TC Traffic Node n TC TC … /proc /proc file file system system LDirectord LDirectord /proc /proc file file system system LDirectord LDirectord Legend: Depends on # Depends on, only in the 1+1 active/ active redundancy model Mutual dependency Dependencies across nodes Figure 82: The dependencies and interconnections of the HAS architecture system software 4.26.1 Traffic Manager and the CVIP The CVIP interface is the point of entry to the HAS cluster. All incoming connections arrive at the CVIP and are forwarded to the traffic manager running on the master node in the HA tier to decide which traffic node is to service the incoming request. Therefore, the traffic manager depends on the CVIP to receive incoming requests. 152 4.26.2 The Saru Module and the Heartbeat Daemon When the HA tier is in the active/active redundancy model, the saru module runs in coordination with heartbeat on each of the master nodes. The saru module is responsible for dividing the incoming connections between the two master nodes. The heartbeat daemon provides a mechanism to determine which master node is available and the saru module uses this information to divide the space of all possible incoming connections between both active master nodes. 4.26.3 Traffic Manager and Traffic Client Traffic managers running on master nodes depend on the traffic client daemons running of traffic node to provide traffic distribution based on the load index of the traffic nodes. In the event that traffic client daemon does not send the load index to the traffic managers, traffic managers are not able to distribute traffic accordingly. Therefore, we need to ensure that traffic managers are aware when a traffic node becomes unavailable, when a traffic client daemon fails and when the application fails on the traffic node. 4.26.4 Heartbeat and CVIP The active master node owns the address to the virtual service. If the active master node fails, the standby master node becomes the active node (notified by the heartbeat mechanism) and the owner of the virtual service. Therefore, there is a dependency between the heartbeat mechanism and the cluster virtual IP interface. 4.26.5 Heartbeat Instances Each master node runs a copy of the heartbeat daemon. The heartbeat instance running on the master nodes communicate with each other to ensure that they are available. If the heartbeat on the standby master node does not receive a reply from the heartbeat instance running on the active master node, then it declares the active master node as dead, and assumes the active state. The cron utility monitors the heartbeat daemon locally on each master node, and when the heartbeat daemon fails, the cron utility re-starts it automatically. 4.26.6 LDirectord and Traffic Manager The LDirectord module communicates with the traffic manager to ensure that the traffic manager does not send web traffic to a traffic node that hosts a failing web server application. The LDirectord 153 module reports to the traffic manager the IP address of the traffic node it is running on and the load_index pre-set to the value of 0 (zero). Section 4.27.12 presents this scenario. 4.26.7 The LDirectord Module and the Traffic Client The traffic client reports the load of the traffic node to the traffic manager. We use also this mechanism as a health check between the traffic client and the traffic manager. If the traffic client fails to report within a specific time, and the timeout is exceeded, then the traffic manager consider the traffic node as unavailable and removes it from its list of active nodes. However, there is a case scenario where the traffic client is reporting the load index to the traffic manager but it is unaware that the application it is hosting is unavailable (has failed). The role of the LDirectord module is to ensure that the traffic client does not update the load_index while the application is not responsive. Therefore, the LDirectord sets the traffic client load_index_report_flag to 0. When the load_index_report_flag = 0, the traffic client stops reporting its load to the traffic manager. When the application check performed by LDirectord returns positive (the application is up and running), the LDirectord sets the load_index_report_flag to 1. As a result, the traffic client restarts to report its load index to the traffic manager and then the traffic manager adds the traffic node to its list of available nodes. 4.26.8 The Saru Module and Traffic Manager The traffic manager can be a SPOF in the 1+1 active/active redundancy model and therefore we need a mechanism to ensure continuous availability of the functions of a traffic manager. The saru module is the entity that ensures that in the event a traffic manager fails, the cluster virtual interface stops sending incoming traffic to it and instead sends traffic to the available traffic manager on the other master node. The saru module is active on one master node when the nodes in the HA tiers follow the in 1+1 active/active redundancy model. It is responsible of evenly dividing traffic between the active master nodes, using the round robin algorithm. Let us assume that the saru module is running on active master node 1. When a web request arrives to the saru module from the routed process, the request is now within master node 1. The saru module forwards it to the appropriate traffic manager depending on the round robin algorithm. Based on this, the traffic manager depends on the saru module to receive traffic. 154 4.26.9 The saru Module and routed Process The saru module relies on the routed process to receive incoming traffic from the cluster virtual IP interface. 4.26.10 Traffic Client and the /proc File System The traffic client daemon depends on the /proc file system to retrieve memory and processor usage, which are metrics needed to computer the load index of a traffic node. 4.27 Scenario View of the Architecture The scenario view consists of a small subset of important scenarios – instances of use cases – to illustrate how the elements of the HAS architecture work together. For each scenario, we describe the corresponding sequences of interactions between objects and between processes. This view acts as a driver to help us validate and illustrate the architecture design, both on paper and as the starting point for the benchmarking tests of the architectural prototype. The following sub-sections examine are these scenarios: - Normal scenario: The request arrives to the CVIP. The traffic manager forwards it to a traffic node, and the traffic nodes replies to the web client. - Traffic client daemon reporting its load: This scenario presents how a traffic node computes and reports its load index to the traffic managers running on master nodes. - Adding a traffic node to the SSA tier: This scenario examines the steps involved adding a traffic node to the SSA tier. - Boot process of a traffic node: This scenario examines the boot process of a traffic node. There are two variations of this scenario depending on whether the traffic node is diskless or it has local disk. - Upgrading the operating system and application server software: This scenario describes generic upgrading the kernel and applications on a HAS cluster node. - Upgrading the hardware of a master node: This scenario presents the process of upgrading a hardware component on a master node. - Boot process of a master node: This scenario examines the boot process of a master node. - Master node becomes unavailable: This scenario illustrates what happens when a master node becomes unavailable. 155 - Traffic node becomes unavailable: In some cases, the traffic node can become unavailable because of hardware or software error. This scenario illustrates how the cluster reacts to the unresponsiveness of a traffic node. - Ethernet port becomes unavailable: A cluster node can face networking problems because of Ethernet card or Ethernet drivers issues. This scenario examines how a HAS cluster node reacts when it faces Ethernet problems. - Traffic node leaving the cluster: When a traffic node is not available to serve traffic, the traffic manager disconnects it from the cluster. This scenario illustrates how a traffic node leaves the cluster. - Application server process dies on a traffic node: When the application becomes unresponsive, it stops serving traffic. This scenario examines how to recover from such a situation. - Network becomes unavailable: This scenario presents the chain of events that take place when the network to which the cluster is connected, becomes unavailable. 4.27.1 Normal Scenario The normal scenario describes how the HAS cluster successfully processes an incoming web request. The process of the request varies depending on the redundancy model at the HA tier. Therefore, this scenario examines two variations: the HA tier configured for the 1+1 active/standby redundancy model, and then when it is configured for the 1+1 active/active redundancy model. 4.27.1.1 Normal Scenario, 1+1 Active/Standby Figure 83 illustrates the sequence diagram of a successful request. This scenario assumes that the nodes in the HA tier follow the 1+1 active/standby redundancy. The web user issues a request for a specific service that the cluster provides through its virtual interface (1). The CVIP interface receives the request and forwards it to the traffic manager running on the active master node (2). The traffic manager locates the target node from its list of available traffic nodes (3), and forwards the request to the appropriate traffic node based on the traffic distribution algorithm (4). The traffic node receives the requests, processes it, and returns the reply directly to the web user (5). 156 CVIP Users User sends request to virtual IP address 1 Traffic Manager (TM) Traffic Active Master Node Node A Requests arrives to the TM on that master node 2 TM checks for the best available traffic node 3 Request arrives at Traffic Node A 4 Traffic Node A replies to user 5 Figure 83: The sequence diagram of a successful request with one active master node 4.27.1.2 Normal Scenario, 1+1 Active/Active Figure 84 illustrates the sequence diagram of an incoming request that arrives at the HA tier of the HAS cluster, with two active master nodes. This scenario assumes that the nodes in the HA tier follow the 1+1 active/active redundancy.The web user accesses a service hosted on web server (1). The request arrives at the cluster IP interface (2). Master node 1 is elected by the saru module to own the CVIP interface and receive incoming requests from the CVIP. Therefore, the request arrives to master node 1 (3). The saru module forwards it to the traffic manager running on master node 2 (4). The saru module is responsible of evenly dividing traffic in between the active master nodes, using round robin algorithm. The traffic manager on master node 2 checks (5) for the best available traffic node from its list of available traffic nodes based on the distribution algorithm, and then forwards the request to it (6). The web server application receives the request, processes it, and replies directly to the user (7). 157 CVIP Users User sends request to virtual IP address 1 Traffic Manager (TM) Traffic Manager (TM) Traffic Active Master Node 1 Active Master Node 2 Node C User sends request to virtual IP address 2 3 Master node 1 is elected by saru to receive incoming requests from VIP and divide it among the active master nodes Request is sent to Master Node 2 4 5 TM checks for the best available traffic node Request arrives at Traffic Node C 6 Traffic Node C replies to user 7 Figure 84: The sequence diagram of a successful request with two active master nodes 4.27.2 Traffic Node Reporting Load Status Figure 85 illustrates the sequence diagram of a traffic node reporting its load information to the traffic managers running on master nodes. This scenario assumes that each traffic node runs a copy of the traffic client daemon. When the traffic client daemon starts, it loads its configuration from the /etc/tcd.conf configuration file. As a result, it obtains the IP addresses of the master nodes to which it reports its load, the port number, the frequency of the load index update, and the location where it logs errors. The traffic client daemon then reads the processor and memory information from the /proc file system (1), computes the load index (Section 4.23.5), and reports it (2) to traffic managers running on the master nodes. The traffic managers receive load index of the traffic node, and update their list of available traffic nodes and their load indexes (4)(5). The frequency of reporting the load of the traffic node is defined in the traffic client configuration file. Due to the repetitive nature of this activity, and since the processor information is static, unlike memory information, the processor information is read from the /proc file system on the first time the traffic client is started. 158 Master Node 1 Master Node 2 Traffic Node B Traffic Manager Traffic Manager Traffic Client Daemon (TCD) TCD knows about the traffic managers running on master nodes from its configuration file 1 Traffic Manager Daemon Traffic node load index 2 Traffic node load index Communication Port TCD retrieves CPU load and free memory, and computes the load index as: ⎛ CPU ∗ free _ RAM ⎞ Load _ index = ⎜⎜ ⎟⎟ ⎝ lowest _ RAM ⎠ 3 (4) and (5): Update list of available of traffic nodes and their load List of available traffic nodes Before update 5 4 After update At this point, traffic node B is added to the list of available traffic nodes. The traffic manager starts forwarding incoming traffic to traffic node B. Figure 85: A traffic node reporting its load index to the traffic manager 4.27.3 Adding a Traffic Node to the HAS Cluster The HAS architecture allows the addition of traffic nodes to the cluster in response to increased traffic transparently and dynamically. The HAS cluster administrator configures the traffic node to boot from LAN. The traffic node boots and downloads a kernel image and a ramdisk (Section 4.27.4.1). The ramdisk includes three software components that are started automatically: the Ethernet redundancy daemon, the LDirectord, and the traffic client. Figure 86 illustrates the scenario of a traffic node joining the HAS cluster. When the traffic client starts, it reads the processor speed and available memory from the /proc file system and computes the load index (1). The traffic client then reports its load index to the traffic managers running on the master nodes in the HA tier as a pair of traffic node IP address and the load index (2)(3). The traffic managers receive the notification and update their list of available traffic nodes (4)(5). Next, the traffic node is ready to service incoming requests and the traffic manager starts forwarding traffic to the new traffic node. 159 Master Node 1 Master Node 2 Traffic Node B Traffic Manager (TM) Traffic Manager (TM) Traffic Client Daemon (TCD) The traffic client daemon is aware of the master nodes since the IP addresses of those nodes are provided in its configuration file. Traffic Manager Daemon Traffic node load index 2 Traffic node load index 1 TCD retrieves CPU load and free memory, computes the load index and reports it to TMs running on master nodes Communication Port 3 List of available traffic nodes (4) and (5): Update list of available of traffic nodes and their load Before update 5 4 After update At this point, traffic node B is added to the list of available traffic nodes. The traffic manager starts forwarding incoming traffic to traffic node B. Figure 86: A traffic node joining the HAS cluster 4.27.4 Boot Process of a Traffic Node There are two types of traffic nodes, diskless and nodes with local disk, and each boots differently. The following sub-sections describe both booting methods, but first, we examine how the cluster operates. The nodes in the cluster boot from the network. When the nodes boot, they broadcast their Ethernet MAC address looking for a DHCP server. The DHCP server, running on the master nodes and configured to listen for specific MAC addresses, responds with the correct IP address for the nodes. Alternately, the DHCP server responds to any broadcast on its physical network with IP information from a designated range of IP addresses. The nodes receive the network information they need to configure their interfaces: IP addresses, gateway, netmask, domain name, the IP address of the image and boot server, and the name of the boot file. Next, the nodes download a kernel image from the boot server, which can also be the master node. The boot server responds by sending a network loader to the client node, which loads the network boot kernel. The network boot loader mounts the root file system as read-only; the network loader reads the network boot kernel sent from the boot server into local memory and transfers control to it. The kernel mounts root as read/write, 160 mounts other file systems, and starts the init process. The init process brings up the customized Linux services for the node, and the node is now fully booted and all initial processes are started. 4.27.4.1 Boot Process of a Diskless Traffic Node Figure 87 illustrates the process of adding a diskless node to a running HAS cluster. Diskless Traffic Node DHCP/Image Server 1 DHCP_DISCOVER (PXE Client) 2 IP Address, TFTP Server, rsync Servers, Name of PXE bootloader 3 TFTP Request (PXE Client) 4 PXE bootloader (diskless_node) 5 TFTP Request (diskless_node) 6 TFTP (diskless_node_ramdisk) Figure 87: The boot process of a diskless node 1. Ensure that the MAC address of the NIC on that diskless node is associated with a traffic node and configured on the master nodes as a diskless traffic node. The notion of diskless is important since the traffic node will download from the imager server a kernel and ramdisk image. Traffic nodes have their BIOS configured to do a network boot. When the administrator starts traffic nodes, the PXE client that resides in the NIC ROM, sends a DHCP_DISCOVER message. 2. The DHCP server, running on the master node, sends the IP address for the node with the address of the TFTP server and the name of the PXE bootloader file that the diskless traffic node should download. 3. The NIC PXE client then uses TFTP to download the PXE bootloader. 4. The diskless traffic node receives the kernel image (diskless_node) and boots with it. 5. Next, the diskless traffic node sends a TFTP request to download a ramdisk. 6. The image server sends the ramdisk to the diskless traffic node. The diskless traffic node downloads the ramdisk and executes it. 161 When the diskless traffic node executes the ramdisk, the traffic client daemon starts and reports the load of the node periodically to the master nodes. The traffic manager, running on the master nodes, adds the traffic nodes to its list of available traffic nodes and starts forwarding traffic to it. 4.27.4.2 Boot Process of a Traffic Node with Disk There are two scenarios to add a node with disk to the HAS cluster. The scenarios differ if the traffic node has the latest versions of the operating system and the application server or if the traffic node requires the upgrade of either the operating system or the application server. Figure 88 illustrates the sequence diagram of a traffic node with disk that is booting from the network. If the traffic node requires the upgrade of either the operating system or the application server, the traffic node downloads the updated operating system image and then a ramdisk image including the newer version of the web server application. Traffic Node With Disk 1 DHCP/Image Server DHCP_DISCOVER (PXE Client) IP Address and network configuration 2 Figure 88: The boot process of a traffic node with disk – no software upgrades are performed Figure 89 illustrates the process of upgrading the ramdisk on a traffic node. To rebuild a traffic node or upgrade the operating system and/or the ramdisk image, we re-point the symbolic link in the DHCP configuration to execute a specific script, which results in the desired upgrade. At boot time, the DHCP server checks if the traffic node requires an upgrade and if so, it executes the corresponding script. 162 Traffic Node with Disk DHCP/Image Server 1 DHCP_DISCOVER (PXE Client) 2 IP Address, TFTP Server, rsync Servers, Name of PXE bootloader 3 TFTP Request (PXE Client) 4 PXE bootloader (diskless_node) 5 TFTP Request (diskless_node) 6 IP_addr config file (linux_install) 7 TFTP (node_with_disk) 8 Boot kernel, initrd 9 FTP (node_disk_ramdisk) 10 Boot kernel with full ramdisk image Figure 89: The process of rebuilding a node with disk 4.27.5 Upgrading Operating System and Application Server The architecture promises service continuity even in the event of upgrading the operating system and/or the application servers running of the traffic nodes. Figure 90 illustrates the events that take place when we initiate an upgrade of both the kernel and the application servers on a specific traffic node. While a traffic node is undergoing a software upgrade, traffic arrives to the cluster and the traffic manager forwards it to the available traffic nodes. Following this model, we upgrade the software stack on the cluster nodes without having a service downtime. 163 Traffic Image Server Node B 1 X System administrator reboots the node. On reboot the node is flagged for a new stage Request a restage with new kernel and/or new version of the application server 3 Image server uploads new kernel image to the traffic node 4 5 Node boots with new kernel Node signals it booted with new kernel version 6 Image server uploads new version of the application server 7 8 Node starts the new version of the application server Figure 90: The process of upgrading the kernel and application server on a traffic node 4.27.6 Upgrading Hardware on Master Node Figure 91 illustrates the events that take place when a master node is undergoing a hardware upgrade. The administrator of the system shuts down (1) the master node 1, which becomes offline (2). The heartbeat service on master node 1 is not available anymore (3) to reply to heartbeat messages sent to it from the heartbeat instance running on master node 2. After a timeout (specified in the heartbeat configuration file), master node 2 becomes the active master node (4) and the owner of the virtual services offered by the cluster. As a result, all incoming traffic arrives to the only available master node (5). The traffic manager on that master node forwards traffic to the appropriate traffic nodes, and responses are sent back to the web users. 164 Users Users Admin Admin Master Master CVIP Node 1 Node 2 (master node 2) Administrator shuts down master node 1 1 2 X Master Node 1 is offline Heartbeat on standby Master Node 2 does not receive heartbeat message from Master Node 1 (timeout limit is exhausted) 3 X 4 New incoming traffic Master Node 2 declares Master Node 1 unavailable and becomes active and owner of the CVIP interface 5 Response is sent back to the user The request is sent to the appropriate traffic node for processing Figure 91: The sequence diagram of upgrading the hardware on a master node 4.27.7 Master Node Boot Process The process of booting a master node is different from that of a traffic node; the cluster administrator configures the master nodes, while master nodes manage traffic nodes. The cluster administrator installs all the software packages and tunes the configuration files for all system software components running on the master nodes. The components that are required include the network time protocol server, the cluster IP interface, the Ethernet redundancy daemon, traffic manager, the IPv6 router advertisement daemon, heartbeat daemon, DHCP, and the NFS redundant server daemon. 4.27.8 Master Node Becomes Unavailable Master nodes supervise the availability of each other using heartbeat. With heartbeat, the failure of a master node is detected within a delay 200 ms. One common scenario of a failure is when a master node becomes unavailable because of a hardware problem or an operating system crash. It is critical to have a mechanism in place to deal with such a challenge. Figure 92 illustrates the sequence of events when master node 1 becomes unavailable. When the master node 1 becomes unavailable (1), the heartbeat instance running on the master node 2 does not 165 receive the keep-alive message from the master node 1 and after a configurable timeout (2), master node 2 become the active master node (3) and owner of the cluster virtual interface (4)(5). There is no impact on storage because the content of the storage system is duplicated and synchronized on both master nodes (Section 4.7.4). Users Master Master Node 1 Master Node 2 Master Node 2 Node 1 Heartbeat Heartbeat CVIP Master Node 1 is 1 X unavailable due to a major failure 2 Heartbeat on Master Node 1 does not send heartbeat message to X the heartbeat instance running on Master Node 2 (timeout limit is exhausted) 3 The heartbeat instance on Master Node 2 declares Master Node 1 unavailable and Makes Master Node 2 as primary 4 5 New requests from web user arrive to Master Node 2 6 Master Node 2 Is owner of the virtual services Figure 92: The sequence diagram of a master node becoming unavailable Figure 93 illustrates the sequence diagram of synchronizing storage when one of the master nodes fails. Master Node 1 Master Node 2 HA NFS Daemon Redundant NFS Servers were mounted on traffic nodes using the modified mount command which allows mounting 2 servers towards the same mount target 1 X Master Node 1 is dead The NFS implementation 2 X detects the failure of NFS server on Master Node 1. Master Node 1 is 3 X Back online The NFS implementation 4 X detects that the NFS server on Master Node 1 is back online Synchronization with rsync 5 Figure 93: The NFS synchronization occurs when a master node becomes unavailable 166 When the initial active master node becomes available again for service, there is no need for a switchback to active status between the two master nodes. The new master node acts as a hot standby for the current active master node. As a future work, we would like the standby master node to switch to the load sharing mode (1+1 active/active), helping the active master node to direct traffic to the traffic nodes once the active master node reaches a pre-defined threshold limit. When master node 1 becomes available again (3), the NFS server is re-started; it mounts the storage and re-syncs its local content with the master node 2 using the rsync utility. 4.27.9 Traffic Node becomes Unavailable Traffic nodes, similarly to master nodes, can face problems and become unavailable to serve incoming traffic. The HAS cluster architecture overcomes this challenge by supporting node level redundancy. Figure 94 illustrates the scenario when a traffic node becomes unavailable. Traffic Master Node 1 Master Node 2 Node C Traffic Manager Traffic Manager 1 X Traffic Node C becomes unavailable 2 X Traffic managers do not receive X the load index from traffic node C. 2 Timeout – traffic managers stop waiting to receive the load index. They declare it unavailable. 3 4 X Traffic managers remove Traffic Node C from their lists of traffic nodes 3 Traffic Node C is back online Traffic Client sends load index to TM on Master Node 5 6 Traffic managers receive load index and add Traffic Node C to their list of available traffic nodes 6 Figure 94: The sequence diagram of a traffic node becoming unavailable When a traffic node becomes unavailable (1), the traffic client daemon (running on that node) becomes unavailable and does not report the load index to the master nodes. As a result, the traffic manager daemons do not receive the load index from the traffic node (2). After a timeout, the traffic 167 managers remove the traffic node from their list of available traffic nodes (3). However, if the traffic nodes becomes available again (4), the traffic client daemon reports the load index to the traffic manager running on the master node (5). When the traffic manager receives the load index from the traffic node, it is an indication that the node is up and ready to provide service. The traffic manager then adds the traffic node (6) to the list of available traffic nodes. A traffic node is declared unavailable if it does not send its load statistics to the master nodes within a specific configurable time. Web Users Web Users LAN LAN Active Standby Node C Active Standby Active Node D Traffic Node C is available and serving traffic . Standby Active Node C Standby Node D Traffic Node C leaves the cluster. Traffic is directed to other available nodes. Figure 95: The scenario assumes that node C has lost network connectivity Figure 95 illustrates a traffic node losing network connectivity. The scenario assumes that traffic node C lost network connectivity, and as a result, it is not a member of the HAS cluster. The traffic manager now forwards incoming traffic is to the other remaining traffic nodes. 4.27.10 Ethernet Port becomes Unavailable All cluster nodes are equipped with two Ethernet cards. The Ethernet redundancy daemon, running on each cluster node, is responsible for detecting when an Ethernet interface becomes unavailable and activating the second Ethernet interface. Figure 96 illustrates this scenario. When the Ethernet port 1 becomes unavailable (1), the Ethernet redundancy daemon detects the failure after a timeout (2) and activates Ethernet port 2 with the same MAC address and IP address as Ethernet port 1 (3). From this point further, all communication goes through Ethernet port 2. 168 Cluster Node Ethernet Redundancy Daemon Ethernet Port 1 1 X Ethernet Port 2 Ethernet port is unavailable 2 Ethernet Redundancy Daemon detects the failure of Ethernet Port 1 and performs failover to Ethernet Port 2 Ethernet Port 2 is now primary 3 Figure 96: The scenario of an Ethernet port becoming unavailable 4.27.11 Traffic Node Leaving the Cluster If a traffic node does not report its load index to the traffic managers running on master nodes, the later remove the traffic node from their list of available traffic nodes and stops forwarding traffic to it. Figure 97 illustrates how a traffic node leaves the cluster. Master Node 1 Master Node 2 Traffic Node B Traffic Manager (TM) Traffic Manager (TM) Traffic Client Daemon (TCD ) The traffic node was servicing request until (1) an error took place and the node is not communicating with master nodes. TM does not receive traffic load index – the traffic node did not send out its load index 2 TM TM updating updating the the list list of of available available traffic traffic nodes nodes based based on on traffic traffic nodes nodes reporting reporting their their load load index index 1 Traffic node is unavailable, suffering from hardware or software problem … 2 After timeout, TM removes the traffic node from its list of available traffic nodes 3 3 Figure 97: The sequence diagram of a traffic node leaving the HAS cluster 169 When traffic managers stop receiving messages from the traffic node reporting its load index (1)(2), after a defined timeout, the traffic managers remove the node from the list of available traffic nodes (3). The scenario of a traffic node leaving the HAS cluster is similar to the scenario Traffic Node Becomes Unavailable presented in Section 4.27.9. 4.27.12 Application Server Process Dies The availability of service depends on the availability of the application process (in our cases, the Apache web server). If the application crashes and becomes unavailable, we need a mechanism to detect the failure and restart the application. This is important because a master node cannot detect such a failure in the traffic nodes and as a result, the traffic manager continues to forward incoming traffic to a traffic node that hosts a failing application. In addition, we need to maintain consistency between traffic manager and traffic client. Our approach to overcome this challenge requires two system software: the cron operating system facility and the LDirectord module. The cron utility is a UNIX system daemon that executes commands or scripts as scheduled by the system administrator. As such, the operating system monitors the application and it re-starts it when it crashes. However, if the application crashes in between cron checks, then the traffic manager continues to forward requests to the web server running on the traffic node. New and ongoing requests fail because the web server is down. Figure 98 illustrates our initial approach into addressing this challenge. (1) The LDirectord performs an application check by making an HTTP request to the application and checking the result. The check, performed every x milliseconds (x is a configurable parameter), ensures that we can open a HTTP connection to the service on the traffic node. (2) The LDirectord receives the result of the check. A positive result indicates that the application is up and running; no action is required from LDirectord. A negative result indicates that the web server did not respond as expected. (3) If the check is negative, then the LDirectord connects to the traffic manager and reports that the application on the traffic node is not available. Otherwise, the traffic manager continues to forward requests to the traffic node on which the unresponsive application is located. The LDirectord delivers the pair of (traffic_node_IP, load_index) to the traffic manager, with the load_index = 0. A load_index = 0 ensures that the traffic managers stop forwarding traffic to the specific traffic node. 170 (4) The traffic manager updates its list of available traffic nodes and stops forwarding traffic to the traffic node. (5) The LDirectord needs to ensure that the traffic client does not update the load_index while the application is not responsive. The LDirectord sets the load_index_report_flag to 0. (6) When the load_index_report_flag = 0, the traffic client stops reporting its load to the traffic managers. (7) On the next loop cycle, the LDirectord checks if the application is still unresponsive. If the application is still not available, then no action is required from LDirectord. (8) If the application checks returns positive then LDirectord connects to the traffic manager and reset the load_index_report_flag to 1. When the load_index_report_flag = 1, the traffic client resumes reporting its load to the traffic manager. (9) The traffic client reports the new load_index that overwrites the 0 value. (10) The traffic manager updates its list of available traffic nodes and starts forwarding traffic to the traffic node. Users Users /proc file system routed /proc/cpuinfo /proc/cpuinfo /proc/meminfo /proc/meminfo 4 10 TM TM 6 TC TC 9 Web Web Server Server saru 3 1 2 7 5 8 LDirectord HBD cron Master Master Node Node 11 apphbd Traffic Traffic Node Node 11 Figure 98: The LDirectord restarting an application process 171 4.27.13 Network Becomes Unavailable In the event that one network becomes unavailable, the HAS cluster needs to survive such a failure and switch traffic to the redundant available network. Figure 99 illustrates this scenario. Cluster Traffic Node Users Switch/ Switch/ Router 1 Router 2 Request is forwarded to Ethernet port 1 of Traffic Node User sends request 1 3 X Switch/Router becomes unavailable Ethernet Port 1 Ethernet Port 2 2 Reply is sent 4 5 Timeout Reply is re-sent through Switch/Router 2 6 User receives reply 7 8 Figure 99: The network becomes unavailable When the router becomes unavailable (3), Ethernet port 1 gets a reply with a timeout (5). At this point, the Ethernet port 1 uses its secondary route through router 2. We can use the heartbeat mechanism to monitor the availability of routers. However, since routers are outside our scope, we do not pursue how to use heartbeat to discover and recover from router failures. 4.28 Network Configuration with IPv6 By default, cluster administrators configure the network setting on all cluster nodes with IPv4 through either static configuration or using a DHCP server. However, with clusters consisting of tens and hundreds of nodes, these methods are labor intensive and prone to many errors. Designer of network protocols recognize the difficulty of installing and configuring TCP/IP networks. Over the years, they have come up with solutions to overcome these pitfalls. Their latest outcome is a newly designed IP protocol version, IPv6. One of IPv6's useful features is its auto-configuration ability. It does not require a stateful configuration protocol such as DHCP. Hosts, in our case cluster nodes, can use router discovery to determine the addresses of routers and other configuration parameters. The router 172 advertisement message also includes an indication of whether the host should use a stateful address configuration protocol. There are two types of auto-configuration. Stateless configuration requires the receipt of router advertisement messages. These messages include stateless address prefixes and preclude the use of a stateful address configuration protocol. Stateful configuration uses a stateful address configuration protocol, such as DHCPv6, to obtain addresses and other configuration options. A host uses stateful address configuration when it receives router advertisement messages that do not include address prefixes and require that the host use a stateful address configuration protocol. A host also uses a stateful address configuration protocol when there are no routers present on the local link. By default, an IPv6 host can configure a link-local address for each interface. The main idea behind IPv6 autoconfiguration is the ability of a host to auto-configure its network setting without manual intervention. Autoconfiguration requires routers of the local network to run a program that answers the autoconfiguration requests of the hosts. The radvd (Router ADVertisement Daemon) provides these functionalities. This daemon listens to router solicitations and answers with router advertisement. Master Node or Router Traffic Node C Router advertisement daemon is running on master nodes Cluster nodes support the IPv6 protocol at the OS level 1 X Send solicitation message Node boots 2 Generate link local address 5 Generate own IP and perform duplicate address detection procedure 3 router advertisement, specifying subnet prefix, lifetimes, and default router. 4 Figure 100: The sequence diagram of the IPv6 autoconfiguration process 173 Figure 100 illustrates the process of auto-configuration. This scenario assumes that the router advertisement daemon is started on at least one master node, and that cluster nodes support the IPv6 protocol at the operating system level, including its auto-configuration feature. The node starts (1). As the node is booting, it generates its link local address (2). The node sends a router solicitation message (3). The router advertisement daemon receives the router solicitation message from the cluster node (4); it replies with the router advertisement, specifying subnet prefix, lifetimes, default router, and all other configuration parameters. Based on the received information, the cluster node generates its IP address (5). The last step is when the cluster node verifies the usability of the address by performing the Duplicate Address Detection process. As a result, the cluster node has now fully configured its Ethernet interfaces for IPv6. 4.28.1 Changes Required to Support IPv6 Many changes are required on the cluster nodes to support IPv6. Some of the changes are common to all cluster nodes; others are limited to the master nodes. In addition to supporting IPv6, we need to add some missing IPv6 network functionalities. The following subsections describe the changes to the HAS architecture prototype to support a dual IPv4 and IPv6 environment (Figure 101). We have conducted many experiments and tests that are documented in [131]. 4.28.1.1 Changes Needed on All Nodes The single update that is required on all cluster nodes is to support the IPv6 protocol in the Linux kernel. We demonstrate how to support IPv6 in the kernel in [132]. 4.28.1.2 Specific Changes to Master Nodes There are several changes needed on master nodes in order to fully support IPv6 and provide all functionalities and services over IPv6. The core services that need to support IPv6 on these nodes include NTP, DHCP, cluster virtual IP interface, and all others IP services provided by the master nodes. 4.28.1.3 Added Functionalities to the Cluster For general cluster infrastructure, we need to provide support for IPv6 DNS and IPv6 firewalling. Figure 101 presents our prototyped cluster in an environment that supports IPv4 and IPv6. With the illustrated environment, incoming traffic to the cluster can be over either IPv4 or IPv6. The cluster 174 virtual IP interface receives the traffic and forward it within the cluster. The upstream provider is the authority that provides an IPv6 address range to the router that manages the assignment of IP addresses within the cluster. The IPv6 DNS server is responsible for IPv6 addresses resolution. The IPv6 in IPv4 tunnel in Figure 101 enables IPv6 hosts and routers to connect with other IPv6 hosts and routers over the existing IPv4 Internet. IPv6 tunneling encapsulates IPv6 datagrams within IPv4 packets. The encapsulated packets travel across an IPv4 Internet until they reach their destination host or router. The IPv6-aware host or router decapsulates the IPv6 datagrams, forwarding them as needed. IPv6 tunneling eases IPv6 deployment by maintaining compatibility with the large existing base of IPv4 hosts and routers. LAN 1LAN 2 IPv6 DNS Server Upstream Provider IPv6 in IPv4 Tunnel IPv6 Incoming Traffic IPv4 C L U S T E R V I P Master Node A Master Node B Traffic Node 1 Traffic Node 2 Storage Node 1 Traffic Node 3 Storage Node 2 Traffic Node 4 Figure 101: A functional HAS cluster supporting IPv4 and IPv6 175 Chapter 5 Architecture Validation 5.1 Introduction The initial goal of this work was to propose an architecture that allows web clusters to scale for up to 16 nodes while maintaining the baseline performance of each individual cluster node. The validation of the architecture is an important activity that allows us to determine and verify if the architecture meets our initial requirements. Network and telecom equipment providers use professional services of specialized validation test centers to test and validate their products. This chapter presents three types of validation for the HAS architecture. The first is the validation of scalability and high availability. It present the benchmarking results that demonstrate the ability to scale the HAS architecture for up to 18 nodes (2 master nodes and 16 traffic nodes) while maintaining the baseline performance across all traffic nodes. In addition, this chapter presents the results of the HA testing to validate the HA capabilities. The second validation is the external validation by open source projects. It describes the impact of the work on the HA-OSCAR project. The third validation is the adoption of the architecture by the industry as the base architecture for communication platforms that run telecom applications providing mission critical services. 5.2 Validation of Performance and Scalability The benchmarking environment we built to measure the performance and scalability the HAS architecture prototype is similar to the benchmarking environment presented in Section 3.3, however in testing the HAS architecture prototype, we used more client machines to generate traffic and tested with a newer version of the WebBench software, version 5.0. To generate web traffic, we used 31 web client machines powered by Intel Pentium III and Pentium Celeron. Each client machine runs a copy of the WebBench benchmarking software. In addition to the 31 web client machines, we used one Pentium IV machine as the WebBench controller to managing the testing, collecting, and compiling test results from all the web client machines. WebBench is a benchmark program that measures the performance of web servers. WebBench uses PC clients to send requests for standardized workloads to the web server. The workload is a combination of static files and dynamic executables that run in order to produce the data the server 176 returns to the client. These client machines simulate web browsers. When the server replies to a client request, the client records information such as how long the server took and how much data it returned and then sends a new request. When the test ends, WebBench calculates two overall server scores, requests per second and throughput in bytes per second, as well as individual client scores. WebBench maintains at run-time all the transaction information and uses this information to compute the final metrics presented when the tests are completed. Figure 102: A screen capture of the WebBench software showing 379 connected clients Figure 102 is a screen capture from the WebBench controller that shows 379 connected clients from the client machines that are ready to generate traffic. The benchmarking tests took place at the Ericsson Research lab in Montréal, Canada. Although the lab connects to the Ericsson Intranet, our LAN segment is isolated from the rest of the Ericsson network and therefore our measurement conditions are under well-defined control. Figure 103 illustrates the network setup in the lab. The client computers run WebBench to generate web traffic with one computer running WebBench as the test manager. These computers connect to a fiber capable Cisco switch (2) through 100 MB/s links. The Cisco switch connects to the HAS cluster (3) through a one GB/s fiber link. Most benchmarking tests we conducted over IPv4, with some 177 additional tests conducted over IPv6. The results demonstrate that we are able to achieve similar results as with IPv4, however, with a slight decrease in performance [131]. 1 31 client machines running WebBench to generate web traffic 2 100Mbps Links 3 One fiber-capable Cisco switch c2948g 1 machine running WebBench acting as the test manager Rest of lab backbone 1Gbps fiber link HAS Cluster Permanent 100Mbps Links Figure 103: The network setup inside the benchmarking lab We experienced a decrease in the number of successful transactions per second per processor ranging between -2% and -4% [133]. We believe that this is the direct result of the immaturity of the IPv6 networking stack compared to the mature IPv4 networking stack. 5.3 The Benchmarked HAS Architecture Configurations The HAS architecture prototype consists of 18 nodes; each has a Pentium III processor and 512 MB of RAM. We configured the HAS cluster prototype as follows: - Two master nodes providing a single IP interface, cluster services, and storage service through a redundant HA NFS implementation. The master nodes run following the 1+1 active/standby redundancy model. They run redundant copies of NTP, RAD, DHCP, and TM. In addition, each master node runs an instance of heartbeat, the CVIP, and the Ethernet redundancy daemon. - The prototype included 16 traffic nodes that provide service to web clients. Each traffic node runs a copy of Apache web server version 2.0.35. The Apache server serves documents available on the shared storage. Each traffic node runs a copy of the traffic client daemon, which reports the node load to the traffic manager daemons running on master nodes. Figure 104 illustrates the benchmarked configurations of the HAS architecture prototype. In between each of the benchmarking tests, all nodes in the HAS cluster were rebooted and initialized to ensure that we were starting from a fresh install. The same applies to all the web client machines and the WebBench controller machine. 178 HA Tier WebBench Web Clients C L U S T E R V I P Active Primary Master Node SSA Tier Traffic Node 1 Test-1 Traffic Node 2 Test-2 NFS Server A NFS Server A Test-3 Traffic Node 3 Disk Disk Traffic Node 4 NFS Server B NFS Server B Traffic Node 5 Test-4 Traffic Node 6 Hot Standby Master Node Traffic Node 7 Traffic Node 8 1+1 Active/Hot-Standby Traffic Node 9 Traffic Node 10 Traffic Node 11 Traffic Node 12 Traffic Node 13 Traffic Node 14 Traffic Node 15 Traffic Node 16 Figure 104: The benchmarked HAS cluster configurations showing Test-[1..4] I performed the following four benchmarking tests: - Test-1 generates traffic to a HAS cluster that consists of two master nodes and two traffic nodes. This is the minimal deployment of HAS cluster. - Test-2 generates traffic to a HAS cluster that consists of two master nodes and four traffic nodes. This configuration has double the number of traffic nodes than Test-1. - Test-3 generates traffic to a 10 processors HAS cluster that consists of two master nodes and eight traffic nodes. This configuration has double the number of traffic nodes than Test-2. - Test-4 generates traffic to a HAS cluster that consists of two master nodes and 16 traffic nodes. This configuration has double the number of traffic nodes than Test-3. In addition to testing the HAS architecture prototype, I performed a test on a single machine with a Pentium III processor and 512 MB of RAM. The goal of this test, called Test-0, is to establish the baseline performance, and allows us to realize the performance limitation of a single node. 179 5.4 Test-0: Experiments with One Standalone Traffic Node This test consists of generating web traffic to a single standalone server running the Apache web server software. This test reveals the performance limitation of a single node. We use the results of this test to define the baseline performance. Apache 2.0.35 was running on this node and the NFS server running on the network segment hosting the document repository. Table 10 presents the results of the benchmark with a single server node. The results of Test-0 are consistent with the tests conducted in 2002 and 2003 with an older version of Apache (Section 3.7). The main lesson to learn from this benchmark is that the maximum capacity for a standalone server is an average of 1,033 requests per second. If the server receives requests over its baseline capacity, it becomes overloaded and unable to respond to all of them. Hence, the high number of failed requests as illustrated in the table below. Table 10 presents the number of clients generating web traffic, the number of requests per second the servers has completed, and the throughput. WebBench generates this table automatically as it collects the results of the benchmarking test. Number of Clients Requests Per Second Throughput (Bytes/Sec) Throughput (KBytes/Sec) Errors (Connection + Transfer) 1_client 143 825609 806 0 4_clients 567 3249304 3173 0 8_clients 917 5240358 5118 0 12_clients 1012 5769305 5634 0 16_clients 1036 5924397 5786 212 20_clients 1036 6044917 5903 456 24_clients 1038 6058103 5916 692 28_clients 1040 6063940 5922 932 32_clients 1037 6046431 5905 1012 36_clients 1037 6052267 5910 1252 40_clients 1037 6049699 5908 1484 44_clients 1029 6008823 5868 1736 48_clients 1032 6020502 5879 1983 52_clients 1033 6032181 5891 2237 Table 10: The performance results of one standalone processor running the Apache web server 180 1200 Requests Per Second 1000 800 600 400 200 nt s _c lie nt s 52 nt s _c lie 48 nt s _c lie 44 nt s 40 _c lie _c lie nt s 36 _c lie nt s 32 28 _c lie nt s nt s _c lie 24 nt s _c lie 20 nt s _c lie 16 s 12 _c lie nt s cli e nt 8_ cli e 4_ 1_ cli e nt 0 Number of Clients Requests Per Second Figure 105: The results of benchmarking a standalone processor -- transactions per second In Figure 105, we plot the results from Table 10, the number of clients versus the number of requests per second. We notice that as we reach 16 clients, Apache is unable to process additional incoming web requests and the scalability curve levels-off. Even though we are increasing the number of clients generating traffic, the application sever has reached its maximum capacity and is unable to process more requests. With this exercise, we conclude that the maximum number of requests per second we can achieve with a single process is 1,035. We use this number to measure how our cluster scales as we add more processors. Figure 106 presents the throughput achieved with one processor. We plot the results from Table 10, the number of clients versus the throughput in terms of KB/s. The maximum throughput possible with a single processor averages around 5,800 KB/s. In addition, WebBench provides statistics about failed requests. Table 10 presents the number of clients generating traffic and the number of failed requests. Apache starts rejecting incoming requests when we reach 16 simultaneous WebBench clients generating over 1,300 requests per second. 181 1_ cli 4_ ent cli e 8_ nt s cli en 12 t _c s l ie n 16 _c ts li 20 en t _c s l ie nt 24 _c s li 28 en t _c s l ie nt 32 _c s li 36 en t _c s l ie nt 40 _c s li 44 en t _c s li 48 en t _c s l ie nt 52 _c s l ie nt s Requests Per Second 182 Requests Per Second 52 48 44 40 36 32 28 24 20 16 12 cli e cli e _c lie _c lie _c lie _c lie _c lie _c lie _c lie _c lie _c lie _c lie s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt s nt nt cli e _c lie 8_ 4_ 1_ Throughput (KBytes/Sec) 7000 6000 5000 4000 3000 2000 1000 0 Number of Clients Throughput (KBytes/Sec) Figure 106: The throughput benchmarking results of a standalone processor 2500 2000 1500 1000 500 0 Number of Clients Errors (Connection + Transfer) Figure 107: The number of failed requests per second on a standalone processor Figure 107 illustrates the curve of successful requests per second combined with the curve of failed requests per second. As we increase the number of clients generating traffic to the processor, the number of failed requests increases. Based on the benchmarks with a single node, we can draw two main conclusions. The first is that a single processor can process up to one thousand requests per second before it reaches its threshold. The second conclusion is that after reaching the threshold, the application server starts rejecting incoming requests. 5.5 Test-1: Experiments with a 4-nodes HAS Cluster This test consists of generating web traffic to the cluster consisting of two master nodes (configured in the active/standby redundancy model) and two traffic nodes. Table 11 presents the results of the benchmarking test. The table presents the number of clients generating web traffic, the number of requests per second the servers has completed, and the resulting throughput. The WebBench controller machine generates this table automatically as it collects the results of the benchmarking tests from all the WebBench client machines. Number of Clients Requests Per Second Throughput (Bytes/Sec) Throughput (KBytes/Sec) Errors (Connection + Transfer) 1_client 178 1102472 1077 0 4_clients 723 4388988 4286 0 8_clients 1458 9269198 9052 0 12_clients 1744 10986836 10729 0 16_clients 1976 12455528 12164 97 20_clients 2030 12795912 12496 317 24_clients 2034 12821126 12521 547 28_clients 2043 12871553 12570 792 32_clients 2056 12953497 12650 991 36_clients 2062 12997621 12693 1203 40_clients 2087 13155206 12847 1466 44_clients 2089 13167813 12859 1689 48_clients 2086 13146174 12838 1973 52_clients 2089 13165089 12857 2217 Table 11: The results of benchmarking a four-nodes HAS cluster The traffic nodes in the HAS cluster start rejecting new incoming requests as WebBench has reached generating 2,073 requests (1,976 successful requests versus 97 failed requests). As WebBench adds more web clients to generate traffic, we notice an increase of failed requests, with the number of successful requests being almost constant ranging between 2,030 and 2,089 requests per second. 183 Requests Per Second: 2 Master Nodes and 2 Traffic Nodes Requests Per Second 2500 2000 1500 1000 500 nt s nt s _c lie 52 _c lie nt s 48 _c lie nt s _c lie 40 44 nt s nt s _c lie 36 nt s _c lie _c lie 28 32 nt s _c lie nt s 24 _c lie nt s 20 nt s _c lie 16 s 12 _c lie nt s cli e nt 8_ cli e 4_ 1_ cli e nt 0 Number of Clients Requests Per Second Figure 108: The number of successful requests per second on a HAS cluster with four nodes Figure 108 illustrates the results with a 4-processor cluster, illustrating the curve of the number of transactions per second versus the number of clients. Figure 109 shows the throughput curve illustrating the achieved throughput per second with a 4-processor cluster as we increase the number of client machines generating traffic to the cluster. 184 Throughput (KBytes/Sec): 2 Master Nodes and 2 Traffic Nodes 12000 10000 8000 6000 4000 2000 nt s nt s _c l ie 52 _c l ie nt s 48 nt s _c l ie 44 _c l ie nt s 40 nt s nt s _c l ie _c l ie 36 28 32 _c l ie nt s nt s nt s _c l ie _c l ie 24 16 20 nt 12 _c l ie cli e 8_ 4_ _c l ie s s nt cli e cli e 1_ nt s 0 nt Throughput (KBytes/Sec) 14000 Number of Clients Throughput (KBytes/Sec) Figure 109: The throughput results (KB/s) on a HAS cluster with four nodes Requests per Second 2500 2000 1500 1000 500 nt s _c lie nt s 16 _c lie nt s 20 _c lie nt s 24 _c lie nt s 28 _c lie nt s 32 _c lie nt s 36 _c lie nt s 40 _c lie nt s 44 _c lie nt s 48 _c lie nt s 52 _c lie nt s s 12 cli e nt 8_ cli e 4_ 1_ cli en t 0 Number of Clients Successful Requests Errors (Connection + Transfer) Figure 110: The number of failed requests per second on a HAS cluster with four nodes 185 Figure 110 shows the curve of successful requests per second combined with the curve of failed requests per second. As we increase the number of clients generating traffic to the processor, the number of failed requests increases. 5.6 Test-2: Experiments with a 6-nodes HAS Cluster This test consists of generating web traffic to a HAS cluster made up of two master nodes and four traffic nodes. Table 12 presents the results of the benchmarking test. This table also lists the number of client machines generating traffic, the successful number of requests per second, the throughput achieved, and the number of failed requests per second because of either connection or transfer errors. Number of Clients Requests Per Second Throughput (Bytes/Sec) Throughput (KBytes/Sec) Errors (Connection + Transfer) 1_client 176 1101321 1075 0 4_clients 719 4588988 4481 0 8_clients 1460 8693081 8489 0 12_clients 1752 10986836 10729 0 16_clients 2014 12695058 12398 0 20_clients 2350 14813003 14466 0 24_clients 2711 17088532 16688 0 28_clients 3013 18985857 18541 0 32_clients 3405 21463095 20960 0 36_clients 3725 23473882 22924 0 40_clients 4031 25421634 24826 0 44_clients 4101 26329324 25712 189 48_clients 4152 26662688 26038 243 52_clients 4163 26694742 26069 473 56_clients 4167 26790905 26163 784 60_clients 4171 26810137 26182 981 64_clients 4171 26803711 26175 1193 68_clients 4170 26805003 26177 1390 72_clients 4168 26793795 26166 1667 74_clients 4194 26959538 26328 1983 78_clients 4189 26927397 26296 2312 82_clients 4220 27126669 26491 2633 86_clients 4168 26792407 26164 2987 90_clients 4170 26805263 26177 3271 94_clients 4168 26792407 26164 3592 Table 12: The results of benchmarking a HAS cluster with six nodes The maximum number of successful requests per second is 4,220, and the maximum throughput reached is 26,491 KB/s. 186 Requests Per Second: 2 Master Nodes and 4 Traffic Nodes 4500 Requests per Second 4000 3500 3000 2500 2000 1500 1000 500 86 nt s 94 _c l ie _c l ie nt s nt s _c l ie nt s _c l ie 78 64 72 _c l ie _c l ie nt s nt s nt s 48 56 _c l ie nt s _c l ie nt s 32 40 _c l ie _c l ie 24 16 nt s nt s _c l ie cli e 8_ 1_ cli e nt s nt 0 Number of Clients Requests Per Second Figure 111: The number of successful requests per second on a HAS cluster with six nodes Throughput (KBytes/Sec): 2 Master Nodes and 4 Traffic Nodes 25000 20000 15000 10000 5000 94 _c l ie nt s nt s 86 _c l ie nt s _c l ie nt s 78 _c l ie 72 _c l ie nt s nt s 64 _c l ie nt s 56 _c l ie nt s 48 _c l ie nt s 40 _c l ie 32 _c l ie _c l ie nt s nt s 24 s nt 16 cli e 8_ cli e nt 0 1_ Throughput (KBytes/Sec) 30000 Number of Clients Throughput (KBytes/Sec) Figure 112: The throughput results (KB/s) on a HAS cluster with six four nodes 187 4500 Requests per Second 4000 3500 3000 2500 2000 1500 1000 500 nt s _c lie 94 _c lie nt s nt s 86 _c lie 78 _c lie nt s nt s 72 64 _c lie _c lie nt s nt s 56 _c lie 48 40 _c lie nt s nt s _c lie 32 24 _c lie nt s nt s s _c lie nt 16 cli e 8_ 1_ cli e nt 0 Number of Clients Successful Requests Errors (Connection + Transfer) Figure 113: The number of failed requests per second on a HAS cluster with six nodes Figure 113 presents the curve of successful requests per second combined with the curve of failed requests per second. As we increase the number of clients generating traffic to the processor, the number of failed requests increases. 5.7 Test-3: Experiments with a 10-nodes HAS Cluster This test consists of generating web traffic to a HAS cluster made up of two master nodes and eight traffic nodes. Table 13 presents the results of the benchmarking test. The maximum number of successful transactions per second is 8,316. The maximum throughput achieved is 51,073 KB/s. Number of Clients Requests Per Second Throughput (Bytes/Sec) Throughput (KBytes/Sec) 1_client 173 1125072 1099 4_clients 725 4377974 4275 8_clients 1440 9258378 9041 12_clients 1753 10978510 10721 16_clients 2015 12694844 12397 20_clients 2349 14797572 14451 24_clients 2729 17194114 16791 28_clients 3022 19007587 18562 188 32_clients 3339 21372529 20872 36_clients 3660 23498460 22948 40_clients 4042 25416830 24821 44_clients 4340 26272234 25656 48_clients 4560 27561631 26916 52_clients 4800 29637244 28943 56_clients 5090 31580773 30841 60_clients 5352 33656387 32868 64_clients 5674 35681682 34845 68_clients 5930 37298145 36424 72_clients 6324 39770012 38838 76_clients 6641 41770148 40791 80_clients 6910 43462088 42443 84_clients 7211 45550281 44483 88_clients 7460 46789359 45693 92_clients 7680 48292606 47161 96_clients 7871 49192039 48039 100_clients 8052 50833660 49642 104_clients 8158 51116698 49919 100_clients 8209 51311680 50109 104_clients 8278 52060159 50840 92_clients 8293 52154505 50932 96_clients 8310 52267720 51043 100_clients 8312 52280300 51055 104_clients 8307 52248851 51024 108_clients 8316 52299169 51073 112_clients 8313 52280300 51055 114_clients 8311 52274010 51049 118_clients 8310 52261431 51037 122_clients 8306 52242561 51018 126_clients 8319 52318038 51092 130_clients 8302 52217403 50994 134_clients 8308 52255141 51030 138_clients 8311 52267720 51043 142_clients 8312 52280300 51055 Table 13: The results of benchmarking a HAS cluster with 10 nodes Figure 114 presents the curve of performance illustrating the number of successful requests per second achieved with a HAS cluster that consists of two master nodes and eight traffic nodes. The master nodes are in the 1+1 active/standby model and the traffic nodes follow the N-way redundancy model, where all traffic nodes are active. Figure 115 shows the throughput curve of the 10-processor HAS cluster. 189 1_ c 8_ l ien cli t 16 en _c t s 24 li en _c ts 32 li en _c ts 40 li en _c ts 48 li en _c ts 56 li en _c ts 64 li en _c ts 72 li en _c ts 80 li en _c ts 88 li en _c ts 96 li en _ t 10 clie s 4_ n t 10 c lie s 4_ nt c s 96 lie n _ t 10 clie s 4_ n t 11 c lie s 2_ nt 11 c lie s 8_ nt 12 c lie s 6_ nt 13 c lie s 4_ nt 14 c lie s 2_ nt c li s en ts Throughput (KBytes/Sec) c 8_ l ien cli t 16 en _c t s 24 li en _c ts 32 li en _c ts 40 li en _c ts 48 li en _c ts 56 li en _c ts 64 li en _c ts 72 li en _c ts 80 li en _c ts 88 li en _c ts 96 li en _ t 10 clie s 4_ n t s 10 c lie 4_ nts c 96 lie n _c ts 1 0 li e 4_ n t s 11 c lie 2_ nts 11 c lie 8_ nt 12 c lie s 6_ nt s 13 c lie n 4_ ts 14 c lie 2_ nt c li s en ts 1_ Requests Per Second Requests Per Second: 2 Master Nodes and 8 Traffic Nodes 190 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 Number of Clients Requests Per Second Figure 114: The number of successful requests per second on a HAS cluster with 10 nodes Throughput (KBytes/Sec): 2 Master Nodes and 8 Traffic Nodes 60000 50000 40000 30000 20000 10000 0 Number of Clients Throughput (KBytes/Sec) Figure 115: The throughput results (KB/s) on a HAS cluster with 10 nodes 5.8 Test-4: Experiments with an 18-nodes HAS Cluster This test consists of generating web traffic to a HAS cluster made up of two master nodes and 16 traffic nodes. This test is the largest we conducted and consists of 18 nodes in the HAS cluster and 32 machines in the benchmarking environment, 31 of which generate traffic. Figure 116 presents the number of successful transactions per second achieved with the 18 processors HAS cluster. Requests Per Second: 2 Master Nodes and 16 Traffic Nodes 18000 Requests Per Second 16000 14000 12000 10000 8000 6000 4000 2000 1_ 16 clie _c nt 32 l i e n _c ts 48 l i e n _c ts 64 l i e n _c ts 80 l i e n _c ts 96 l i e n _ ts 10 clie 4_ n ts 10 c lie 4_ nts 11 c lie 8_ nts 13 c lie 4_ nts 14 c lie 8_ nts 16 c lie 4_ nts 17 c lie 8_ nts 19 c lie 4_ nts 20 c lie 8_ nts 22 c lie 4_ nts 24 c lie 0_ nts 25 c lie 6_ nts 27 c lie 2_ nts c li en ts 0 Number of Clients Requests Per Second Figure 116: The number of successful requests per second on a HAS cluster with 18 nodes In this test, the HAS cluster with 18 nodes achieved 16,001 successful requests per second, an average of 1,000 successful requests per second per traffic node in the HAS cluster. 191 120000 100000 80000 60000 40000 20000 0 1_ 16 clie _c nt 32 l i e n _c ts 48 l i e n _c ts 64 l i e n _c ts 80 l i e n _c ts 96 l i e n _ ts 10 clie 4_ n ts 10 c lie 4_ nts 11 c lie 8_ nts 13 c lie 4_ nts 14 c lie 8_ nts 16 c lie 4_ nts 17 c lie 8_ nts 19 c lie 4_ nts 20 c lie 8_ nts 22 c lie 4_ nts 24 c lie 0_ nts 25 c lie 6_ nts 27 c lie 2_ nts c li en ts Throughput (KBytes/Sec) Throughput (KBytes/Sec): 2 Master Nodes and 16 Traffic Nodes Number of Clients Throughput (KBytes/Sec) Figure 117: The throughput results (KB/s) on a HAS cluster with 18 nodes 5.9 Scalability Charts Table 14 presents the summary of the benchmarking results we collected from testing HAS clusters consisting of two, four, eight and 16 traffic nodes, respectively; the table includes the results of benchmarking a single standalone traffic node, which illustrates the baseline performance data. Traffic Cluster Nodes 1 2 4 8 16 Total Transactions 1032 2068 4143 8143 16001 Average Transactions per Traffic Node 1032 1034 1036 1017 1000 Table 14: The summary of the benchmarking results of the HAS architecture prototype For each testing scenario, we recorded the maximum number of requests per second that each configuration supported. When we divide this number by the number of processors, we get the maximum number of request that each processor can process per second in each configuration. Table 192 14 presents the number of successful transactions per traffic node. The total transactions is the total number of successful transactions of the full HAS cluster as reported by WebBench. The average transactions per traffic node is the average number of successful transaction served by a single traffic node in the HAS cluster. 18000 Number of transactions per second (served by all the traffic nodes in the HAS cluster) 16000 16001 14000 12000 10000 8143 8000 6000 4143 4000 2000 2068 1032 0 1034 1036 1017 1000 1 2 3 4 5 Total Number of Transactions in the HAS Cluster 1032 2068 4143 8143 16001 Average Number of Transactions per Traffic Node 1032 1034 1036 1017 1000 Total Number of Transactions in the HAS Cluster Average Number of Transactions per Traffic Node Figure 118: The results of benchmarking the HAS architecture prototype Figure 118 presents the scalability of the prototyped HAS cluster architecture. Starting with one processor, we established the baseline performance to be at 1,032 requests per second. Next, we setup the HAS prototype and performed benchmarking tests as we scaled the number of traffic nodes from two to 16. The HAS cluster maintained a 1,000 requests per second per traffic node. As we scaled by adding more traffic nodes to the SSA tier, we lost -3.1% of the baseline performance per traffic node 193 as defined in Section 5.4. These results represent an improvement compared to the -40% decrease in performance experienced with a cluster built using existing software that uses traditional methods of scaling (Section 3.9). Transactions per second per processor 1040 1030 1036 1034 1032 1020 1017 1010 1000 1000 990 980 1 22 43 84 16 5 Number of processors in the cluster Average Transactions per traffic Node Figure 119: The scalability chart of the HAS architecture prototype Figure 119 illustrates the scalability chart of the HAS architecture prototype. The results demonstrate a close to linear scalability as we increased the number of traffic nodes up to 16. The results demonstrate that we were able to scale from a standalone single node performing an average maximum of 1,032 requests per second to 18 nodes in the HAS cluster (16 of them serve traffic) with an average of 1,000 requests per second per traffic node. The scaling is achieved with a -3.11% decrease in performance as compared to the baseline performance. 5.10 Validation of High Availability The validation of HA requires specialized testing and poses new challenges. In a real world situation, validation of HA requires special monitoring and reporting tools and running the system for months and monitoring its usage with a controlled stream of traffic while invoking error cases. The monitoring and reporting tools would capture the details of the system under testing and produce 194 result matrix. However, since we do not have access to specialized HA testing tools, we performed basic test scenarios to ensure that the claimed HA support is provided and it works as described. An important feature of a highly available system is its ability to continue providing service even when certain cluster sub-system fails. Our testing strategy was based on provoking common faults and observing how this affects the service and if it leads to a service downtime. In our case, the system is the HAS architecture prototype, and the cluster nodes run web servers that provide service to web users. If there is a service downtime, users do not get replies to their web requests. The following sub-sections discuss the high availability testing for the HAS cluster and cover testing the connectivity (Ethernet connection and routers), data availability (redundant NFS server), master node, and traffic node availability. 5.10.1 Experiments with Connectivity Availability Figure 120 illustrates the experiments we performed to test the high availability features of the HAS architecture. 1 Traffic Node X 4 5 Web Server Application System Software EthD Master Node 1 TCD TMD EthD System Software Linux Operating System Linux Operating System Interconnect Protocol Interconnect Protocol Interconnect Technology Interconnect Technology Processor Processor HBD HBD Heartbeat daemon running on Master Node 2 3 Router 1 Router 2 Ethernet Card 1 Ethernet Card 2 2 LAN 1 LAN 2 Experiment number Figure 120: The possible connectivity failure points The failures tested include Ethernet daemon failure, failure of Ethernet adapter, router failure, discontinued communication between the traffic manager and the traffic client daemon and 195 discontinued communication between the heartbeat instances running on the master nodes. The experiments included provoking a failure to monitor how the HAS cluster reacts to the failure, how the failure affects the service provided, and how the cluster recovers from the failure. 5.10.1.1 Ethernet Daemon Becomes Unavailable For this type of failure, there is no differentiation if the failure takes place on a master node or a traffic node. If we shutdown the Ethernet daemon on a cluster node (Figure 120 – experiment 1), there is no direct effect on the service provided The one negative impact that can take place is when the primary Ethernet card fails (Ethernet Card 1), the Ethernet redundancy daemon is not running to detect the failure and to switch to the backup Ethernet card (Ethernet Card 2). As a result, the node becomes isolated from the cluster and the client traffic daemon is not able to connect to the traffic manager. The traffic manager declares the traffic node unavailable and removes it from its list of available traffic nodes. However, this case is an unusual failure case scenario where both the Ethernet adapters and the Ethernet redundancy daemon become unavailable on the same node. 5.10.1.2 Ethernet Card is Unresponsive Similar to the use case above, there is no difference for type of failure if the failure occurs on a master node or a traffic node. If the Ethernet card becomes unresponsive and unavailable (Figure 120 – experiment 2), the Ethernet redundancy daemon detects the failure (within a range of 350 ms to 400 ms), considers the Ethernet card out of order, and starts the process of the network adapter swap (Section 4.7.3). As a result, the Ethernet redundancy daemon designates the standby Ethernet card (Ethernet Card 2) as the primary adapter. Section 6.1.9.3 covers the workings of this scenario. If the unresponsive Ethernet card receives traffic while it is down, or while the transition to the second active Ethernet card is not yet complete, new incoming connections will stall, and ongoing connections are lost. 5.10.1.3 Router Failure If the router fails (Figure 120 – experiment 3), we consider this failure a network problem and it is beyond the scope of the HAS cluster architecture. 196 5.10.1.4 Discontinued Communication between the TM and the TC This test case examines the scenario where we disrupt the communication between the TM and the TC (Figure 120 – experiment 4). As a result, the TM stops receiving load alerts from the TC. After a predefined timeout, the TM removes the traffic node from its list of available traffic nodes and stops forwarding traffic to it. When we restore communication between the TM and the TC, the TC starts sending its load messages to the TM. The TM then adds the traffic node to its list of available nodes and starts forwarding traffic to it. Section 4.27.9 examines a similar scenario. 5.10.1.5 Discontinued Communication between the Heartbeat Instances This test case examines what happens when we disconnect the communication between the heartbeat instances running on the master nodes (Figure 120 – experiment 5). This failure scenario depends on whether the disconnected node is the active master node or the standby master node. If the heartbeat instance running on the active master node becomes unavailable, then the heartbeat instance running on the standby node does not receive the keep-alive message. As a result, the heartbeat instance on the standby node declares the primary master node unavailable, thereby making the standby master node the active master node and owner of the virtual service. The new master node starts receiving traffic within a delay of 200 ms. If the heartbeat instance running on the standby node becomes unavailable, then the heartbeat instance running on the master node does not receive the keep-alive message. The cluster does not undergo a reconfiguration. However, the HA tier of the HAS cluster becomes vulnerable to SPOF since there is only one master node that is available, and it is in the active state. 5.10.2 Experiments with Data Availability The availability of data is essential in a clustered environment where data resides on shared storage and multiple clients access it. In the HAS architecture prototype, the data is duplicated and available on two NFS servers through a special implementation of the NFS server code and an updated implementation of the mount program. Our testing methodology uses a direct approach. Based on Figure 121, two components can affect data availability: the availability of the NFS server daemons and the availability of the master nodes. Master nodes are redundant and the failure of one master node does not affect the availability of data or access to the provided service. The only scenario that could lead to data unavailability is if the NFS 197 server daemons running on master nodes were to crash. For this purpose, we have implemented redundancy in the NFS server code. The two test cases we experimented with are shutting down the NFS server daemon on a master node and disconnecting the master node from the network. In both scenarios, there was no interruption to the service provided. Instead, there was a delay ranging between 450 ms and 700 ms to receive the requested document. LAN 2 LAN 1 Master Node A Primary NFS Server Daemon Node A Local Disk /mnt/CommonNFS /mnt/CommonNFS Node B Local Disk Master Node B Secondary NFS Server Daemon Figure 121: The tested setup for data redundancy 5.10.3 Experiments with Traffic Node Availability Traffic nodes follow the N redundancy model. If one node fails and becomes unavailable, the traffic manager forwards traffic to other available traffic nodes. The failures tested include connectivity problems, hardware and software problems, and TC problems. Connectivity problems: Section 5.10.1 discusses the validation of high availability connectivity. Unknown hardware or software problems leading to abnormal node unavailability: If the traffic node goes to a halt state (power off), the traffic manager declares that node as unavailable (by taking it off its list of available traffic nodes) since the traffic client daemon becomes unreachable and unable to report node status to the traffic manager. Section 4.27.9 presents this case scenario. TC problem: If the traffic client daemon becomes unavailable, because of software or a hardware problem, the daemon does not report the node availability and status to the traffic managers. The traffic node stops receiving traffic. Section 4.27.9 discusses a similar case scenario. 198 5.10.4 Experiments with Connection Synchronization To test the connection synchronization mechanism, we open a connection to the virtual service while the master node is active. Then we cause a fail-over to occur by powering down the master node. At this point, the connection stalls. Once the CVIP address is failed over to the standby master node, the connections continues. Streaming is a useful way to test this, as streaming connections by their nature are open for a long time. Furthermore, it provides intuitive feedback as the video pauses and then continues. It is of note that by increasing the buffer size of the streaming client software, the pause is eliminated. 5.11 HA-OSCAR Architecture: Modeling and Availability Prediction The HA-OSCAR project team evaluated the HA-OSCAR architecture, its system failure, and recovery using the Stochastic Reward Nets (SRN) modeling approach with the goal to measure the availability improvements introduced to the HA-OSCAR architecture. This section discusses the Stochastic Reward Nets modeling approach, the Stochastic Petri Net Package (SPNP), the modeled architecture, and the results. 5.11.1 Stochastic Reward Nets and Stochastic Petri Net Package The HA-OSCAR project team built an SRN model to simulate the failure-repair behavior of the HAOSCAR cluster, and to calculate and predict the availability of the HA-OSCAR cluster. The goal of the SRN model was to analyze the availability of the HA OSCAR cluster based on a range of predefined assumptions and parameters. The SRN are used in the availability evaluation and prediction for complicated systems, when the time-dependent behavior is of interest. They applied the SRN for aiding the automated generation and solution of the underlying large and complex Markov chains [134]. The Stochastic Petri Net Package (SPNP) allows the specification of SRN models and the computation of steady and transient-state. The HA-OSCAR team used the SPNP to build the HA-OSCAR SRN model and predict the availability of the system [135]. The Stochastic Petri Net Package (SPNP) [136] is the modeling tool designed for SPN models [137], used as a modeling tool for performance, dependability (reliability, availability, safety), and performability analysis of complex systems. It uses efficient and numerically stable algorithms to solve input models based on the theory of the SRN. The SRN models are described in the input 199 language for SPNP called CSPL (C-based SPN Language) which is an extension of the C programming language with additional constructs that facilitate easy description of SPN models. Additionally, if the user does not want to describe his model in CSPL, a graphical user interface is available to specify all the characteristics as well as the parameters of the solution method chosen to solve the model [138]. 5.11.2 Building the HA-OSCAR Model The HA-OSCAR project team used the SRN to develop a failure-repair behavior model for the system, and to determine the high availability of the cluster. They utilized the SPNP to build and solve the HA-OSCAR SRN model. Figure 122 illustrates how the behavior of the HA-OSCAR cluster is divided into three sub-models: server, network connection, and client sub-models. The HAOSCAR team used the SRN model to study the availability of the HA-OSCAR cluster through the example illustrated in Figure 122. Server sub-model Network connection sub-model Clients sub-model Figure 122: The modeled HA-OSCAR architecture, showing the three sub-models Figure 123 shows a screen shot of the SPNP modeling tool. The HA-OSCAR team also studied the overall cluster uptime and the impact of different polling interval sizes in the fault monitoring mechanism. 200 ` Figure 123: A screen shot of the SPNP modeling tool Input Parameter Numerical Value Mean time to primary server failure, 1/λps 5,000 hrs. Mean time to primary server repair, 1/µpsr 4 hrs. Mean time to takeover primary server, 1/µst 30 sec. Mean time to standby server failure, 1/λss 5,000 hrs. Mean time to standby server repair, 1/µssr 4 hrs. Mean time to primary LAN failure, 1/λpl 10,000 hrs. Mean time to primary LAN repair, 1/µplr 1 hr. Mean time to takeover primary LAN, 1/µlt 30 sec. Mean time to standby LAN failure, 1/λsl 10,000 hrs. Mean time to standby LAN repair, 1/µslr 1 hr. Mean time to client permanent failure, 1/λp 2,000 hrs. Mean time to client intermittent failure, 1/λi 1,000 hrs. Mean time to system reboot, 1/µsrb 15 min. Mean time to client reboot, 1/µcrb 5 min. Mean time to client reconfiguration, 1/µrc 1 min. Mean time to client repair, 1/µrp 4 hrs. Permanent failure coverage factor, cp 0.95 Intermittent failure coverage factor, ci 0.95 Table 15: Input parameters for the HA-OSCAR model 201 Table 15 presents the input parameter values used in the computation of the measurements. These values are for illustrative purpose only. The model makes several assumptions. First, the HA-OSCAR cluster functions properly only if either primary master node or the standby master node is functioning. Second, there has to be at least one LAN that is functioning properly, either the primary LAN or the standby LAN. Third, the quorum for compute node must be present. The "Quorum Value Q" denotes the minimum number of nodes required for the system to keep functioning. For example, in the third row, we have an 8(N) node system. In order to keep the system working, at least 5(Q) nodes must be alive. The final assumption is that the system is not undergoing any rebooting or reconfiguration. The availability of the cluster at time t is computed as the expected instantaneous reward rate E[X(t)] at time t and its general expression is: E[ X (t )] = ∑ k∈τ rkπ k (t ) where rk represents the reward rate assigned to state k of the SRN, τ is the set of tangible marking, and π k (t ) is the probability of being in marking k at time t [138][139]. 5.11.3 The Results The results of HA-OSCAR availability prediction study are available in [140] and [141]. The results demonstrate that the component redundancy in HA-OSCAR is efficient in improving the cluster system availability. Table 16 presents the steady-state cluster availability and the mean cluster down time per year for the different HA-OSCAR configurations. System Configuration Quorum Value System Availability Mean cluster down time (N) (Q) (A) (t) 4 3 0.999933475091 34.9654921704 6 4 0.999933335485 35.0388690840 8 5 0.999933335205 35.0390162520 16 9 0.999933335204 35.0390167776 Table 16: System availability for different configurations We notice that the system availabilities for the various configurations are very close, within a small range of difference. After we introduce the quorum voting mechanism in the client sub-model, the system availability is not sensitive to the change of client configuration. When we add more clients to 202 improve the system performance, the availability of the system almost remains unchanged. In Table 16, we notice that when we add more clients to improve the system performance in the first column N is increasing, and we keep the value of Q to N/2+1, the availability of the system almost remains unchanged. In column 3, the system availability is almost the same as we increase the number of nodes in the system. Figure 124 illustrates the instantaneous availabilities of the system when it has eight clients and the quorum is five. The modeling and availability measurements using the SPNP offered the calculated instantaneous availabilities of the system and its parameters. Figure 124: System instantaneous availabilities Figure 125 illustrates the total availability (including planned and unplanned downtime) improvement analysis of the HA-OSCAR architecture versus single head node Beowulf architecture [140]. The results show a steady-state system availability of 99.9968% compared to 92.387% availability for a Beowulf cluster with a single head node [140]. Additional benefits include higher serviceability such as the ability to upgrade incrementally and hot-swap cluster nodes, operating system, services, applications, and hardware, further improve planned downtime, which benefits the overall aggregate performance. 203 The HA-OSCAR vs. the Beowulf Architecture Total Availability impacted by service nodes 100.00% 99.00% 99.995% 99.9966% 99.9968% 99.9951% 99.9962% 99.990% 98.00% 99.9896% 97.00% Availability 100.000% 99.985% 96.00% 99.980% 95.00% 99.975% 94.00% 93.00% 92.00% 91.00% 90.00% 99.970% 99.9684% 91.575% 92.081% 92.251% 92.336% 92.387% 99.960% 90.580% 1000 99.965% 99.955% 2000 4000 6000 8000 10000 99.950% Beowulf 0.905797 0.915751 0.920810 0.922509 0.923361 0.923873 HA-OSCAR 0.999684 0.999896 0.999951 0.999962 0.999966 0.999968 Mean time to failure (hr) Model assumption: - Scheduled downtime=200 hrs - Nodal MTTR = 24 hrs - Failover time=10s - During maintenance on the head, standby node acts as primary Figure 125: Availability improvement analysis of HA-OSCAR versus the Beowulf architecture 5.11.4 Discussion The HA-OSCAR architecture proof-of-concept implementation, experimental and analysis results, suggest that the HA-OSCAR architecture offers a significant enhancement and a promising solution to providing a high-availability Beowulf cluster class architecture [140][142][143]. The availability of the experimental system improves substantially from 92.387% to 99.9968%. The polling interval for failure detection indicates a linear behavior to the total cluster availability. 5.12 Impact of the HAS Architecture on Open Source In 2001, I participated in the Open Cluster Group [144] and proposed the creation of a working group whose goal is to design and prototype a highly available clustering stack for clusters that run mission critical application. In 2002, the Open Cluster Group announced the creation of the HA-OSCAR working group [21], High Availability Open Source Cluster Application Resource, which consisted of Dr. Stephen S. Scott (Oak Ridge National Lab), Dr. Chokchai Leangsuksun (Louisiana Technical University), and the author of this thesis. As a result, we started the open source HA-OSCAR project to provide the combined power of high availability and high performance computing based on the initial HAS architecture. 204 The goal of the HA-OSCAR project is to enhance a Beowulf cluster system for mission critical applications, to achieve high availability and eliminate single points of failure, and to incorporate self-healing mechanisms, failure detection and recovery, automatic failover and failback. On March 23, 2004, the HA-OSCAR group announced the HA-OSCAR 1.0 release, with over 5000 downloads within the first 24 hours of the announcement. It provides an installation wizard and a web-based administration tool that allows a user to create and configure a multi-head Beowulf cluster. Furthermore, the HA-OSCAR 1.0 release supports high availability capabilities for Linux Beowulf clusters. To achieve high availability, the HA-OSCAR architecture adopts component redundancy to eliminate SPOF, especially at the head node. The HA-OSCAR architecture also incorporates a selfhealing mechanism, failure detection and recovery, automatic failover and failback [146]. In addition, it includes a default set of monitoring services to ensure that critical services, hardware components, and important resources are always available at the control node. 5.13 HA-OSCAR versus Beowulf Architecture A Beowulf cluster consists of two node types: head node servers and multiple identical client nodes. A server or head node is responsible for serving user requests and distributing them to clients via scheduling and queuing software. Clients or compute nodes are dedicated to computation. Clients Head Node Router Compute Nodes Figure 126: The architecture of a Beowulf cluster Figure 126 illustrates the architecture of a Beowulf cluster. However, the single head-node architecture represented in Beowulf cluster is a single-point-of-failure prone, similar to cluster 205 communication, where an outage of wither can render the entire cluster unusable. There are various techniques to implement cluster architecture with high availability. These techniques include active/active, active/standby, and active/cold standby. In the active/active, both head nodes simultaneously provide services to external requests. If one head node goes down, the other node takes over total control. A hot-standby head node, on the other hand, monitors system health and only takes over control if there is an outage at the primary head node. The cold standby architecture is similar to the hot standby, except that the backup head node is activated from a cold start. The key effort focused on simplicity by supporting a self-cloning of cluster master node (redundancy and automatic failover). While the aforementioned failover concepts are not new, HA-OSCAR effortless installation, combined HA and HPC architecture are unique and its 1.0 release is the first known field-grade HA Beowulf cluster release [34]. The HA-OSCAR experimental and analysis results, discussed in Section 5.11, suggested a significant improvement in availability from the dual head architecture [145]. Figure 127 illustrates the HA-OSCAR architecture. The HA-OSCAR architecture deploys duplicate master nodes to offer server redundancy, following the active/standby approach, where one primary master node is active and the second master node is standby [35]. Each node in the HA-OSCAR architecture has two network interface cards (NIC); one has a public network address, and the other has a private local network. The HA-OSCAR project uses the SystemImager [147] utility for building and storing system images as well as a providing a backup for disaster recovery purposes. The HA-OSCAR 1.0 release supports high availability capabilities for Linux Beowulf clusters. It provides a graphical installation wizard and a web-based administration tool to allow the administrator of the HA-OSCAR cluster to create and configure a multi-head Beowulf cluster. In addition, HA-OSCAR includes a default set of monitoring services to ensure that critical services, hardware components, and certain resources are always available at the master node. The current version of HA-OSCAR, 1.0, supports active/standby for the head nodes. 206 Users Users Users Users Users Users Public Network Heartbeat Standby Head Node Primary Head Node Optional Reliable Storage Redundant image servers sitting outside the cluster Router 2 Router 1 Redundant Network Connections Compute Nodes Figure 127: The architecture of HA-OSCAR 5.14 The HA-OSCAR Architecture versus the HAS Architecture When we established the HA-OSCAR project in 2002, the HA-OSCAR architecture was based on the HAS architecture. Both architectures target high availability at the master node level, aiming to provide redundancy to reduce single points of failure, and to support redundant network connections. However, since the applications that run on the HA-OSCAR architecture have unique characteristics than applications running on a HAS architecture, we realize that while both architectures have the same base, they grew to have different characteristics. 5.14.1 Target Applications The HAS architecture and the HA-OSCAR architecture target different types of applications. The HA-OSCAR architecture aims for HPC computing applications that focus on dividing programs into smaller pieces, then executing them simultaneously on separate processors within the cluster. On the 207 other hand, the HAS architecture focuses on client/server applications that run over the web and characterized with short transactions, short response time, a thin control path, and static delivery data pack. 5.14.2 Intensity of Jobs The HAS architecture supports thousands of transactions per second while the HA-OSCAR architecture, targeted for HPC applications, supports large computing jobs that are divided into smaller jobs executed on the cluster compute nodes. The HA-OSCAR architecture receives fewer jobs but the jobs run for longer periods. 5.14.3 Homogeneous versus Heterogeneous Node Hardware The HA-OSCAR architecture assumes that all cluster nodes are homogeneous with the same hardware characteristics. You can have heterogeneous hardware; however, this will not allow you to take advantage of the faster nodes in the cluster. In contrast, the HAS architecture accommodates differences in the hardware configurations of the nodes and uses all node resources fully using the dynamic feedback mechanism. 5.14.4 Dynamic Feedback Mechanism The HAS architecture supports a dynamic feedback mechanism that allows traffic nodes to report their load to the master nodes, and to guide the executor of the traffic distribution policy to make the best possible distribution at that moment. Furthermore, this mechanism acts as a keep-alive message; anytime a master node does not receive this message from a traffic node, after a pre-defined timeout, the master node stops forwarding requests to the traffic node. 5.14.5 Traffic Distributions versus Load Balancing Since the type of applications supported by the HA-OSCAR architecture are different from those supported by the HAS architecture, the underlying distribution mechanisms are distinctive. In the HA-OSCAR architecture, distribution is a load balancing function of a single job divided into smaller jobs, and distributed across computer nodes. In the HAS architecture, the cluster receives thousands of requests per second, and the traffic manager needs to decide on where to forward these incoming requests among available traffic nodes. 208 5.14.6 Failure Discovery and Recovery Mechanisms Both the HA-OSCAR architecture and the HAS architecture support failure detection and recovery mechanism. However, these mechanisms target different system component with variable failure detection and recovery time. The failure recovery in the HA-OSCAR takes between 3 seconds and 5 seconds [148], compared to a failure recovery ranging between 200 ms and 700 ms in the HAS cluster, depending on the type of failure. 5.14.7 Support for the Redundancy Models The HAS architecture supports the 1+1 active/active redundancy model at the HA tier. This is a roadmap feature for the HA-OSCAR architecture. 5.14.8 Connections Synchronization The HAS architecture supports connection synchronization at the HA tier. We discuss this feature in Section 4.22. The HA-OSCAR architecture does not provide such capabilities. 5.14.9 High Availability of Traffic Nodes The availability of traffic nodes (also called compute nodes) in the HA-OSCAR architecture is less important than the availability of traffic nodes in the HAS architecture. 5.14.10 Application Keep-alive Mechanism The HAS architecture supports a keep-alive mechanism for applications running on traffic nodes. In the event that an application becomes unavailable, the LDirectord module notifies the executor of the traffic distribution policy (the TM), and as a result, the TM stops forwarding traffic to the traffic node with the unavailable application. 5.14.11 Other Differentiations The HA-OSCAR deploys a software utility called SystemImager to provide the installation infrastructure of HA-OSCAR. In addition, the installation wizard of HA-OSCAR uses a graphical user interface to install and setup the HA-OSCAR cluster. In the HAS architecture prototype, the installation infrastructure is built from scratch and it does not offer a graphical user interface to guide the administrator in installing a HAS cluster. 209 5.15 HAS Architecture Impact on Industry The Open Source Development Labs (OSDL) is a non-for-profit organization founded in 2000 by IT and Telecommunication companies to accelerate the growth and adoption of Linux based platforms and standardized platform architectures. The Carrier Grade Linux (CGL) initiative at OSDL aims to standardize the architecture of telecommunication servers and enhance the Linux operating system for such platforms. The CGL Working Group has identified three main categories of application areas into which they expect the majority of applications implemented on CGL platforms to fall. These application areas include gateways, signaling, and management servers, and have different characteristics. A gateway, for instance, processes a large number of small requests that it receives and transmits them over a large number of physical interfaces. Gateways perform in a timely manner, close to hard real time. Signaling servers require soft real time response capabilities, and manage tens of thousands of simultaneous connections. A signaling server application is context switch and memory intensive, because of the quick switching and capacity requirements to manage large numbers of connections. Management applications are data and communication intensive. Their response time requirements are less stringent compared to those of signaling and gateway applications. Figure 128: The CGL cluster architecture based on the HAS architecture 210 The OSDL released version 2.0 of the Carrier Grade specifications in October 2003. Version 2.0 of the specifications introduced support for clustering requirements and the cluster architecture is based on the work presented in this dissertation. Figure 128 illustrates the CGL architecture, which is based on the HAS architecture. In June 2005, the OSDL released version 3.1 of the specification. The Carrier Grade architecture is a standard for the type of communication applications presented earlier. 211 Chapter 6 Contributions, Future Work, and Conclusion This chapter presents the contributions of the work, future work, and the conclusions. 6.1 Contributions The initial goal of this dissertation was to design and prototype the necessary technology demonstrating the feasibility of a web cluster architecture that is highly available and able to linearly scale for up to 16 processors to meet the increased web traffic. Lightweight and efficient traffic distribution: - Traffic Client Daemon - Traffic Manager Daemon - HAS distribution algorithm - Keep-alive mechanism - Support for heterogeneous cluster nodes HAS Architecture ¾ A building block approach for scalable and HA web clusters ¾ Software components can be reused in other environments and for other architectures ¾ Supports different redundancy models ¾ Dynamic addition of traffic nodes as traffic increases ¾ Provides the infrastructure for cluster membership, cluster storage, fault management, recovery mechanisms and traffic distribution ¾ Capable of maintaining baseline performance per node for up to 16 traffic nodes Application availability - Enhancement to existing LDirectord implementation - Adaptations to integrate LDirectord with the HAS architecture and to communicate with TC and TM HAS A r c h i t e c t u r e Efficient and Light-weight Traffic Distribution LDirectord Single entry point to the HAS cluster - Concept and prototype - Full transparency Cluster IP Interface Connection Synchronization Connection synchronization - Adaptations and integration with the HAS prototype - Improved integration with heartbeat saru Saru - Adaptations and integration with the HAS prototype Connection Redundancy Data Redundancy HBD HA Ethernet connection - The Ethernet redundancy daemon - Several fixes to the Ethernet device driver that made it more robust and less prone to failures under heavy load HA Tier Nodes Heartbeat - Re-wrote parts of current implementation to decrease failure detection time - Adaptations to integrate with the HAS architecture Highly available network file system - Support for HA NFS in the Linux Kernel - A modified mount program to support mounting dual redundant NFS servers Figure 129: The contributions of the HAS architecture We achieved our goal with the HAS architecture that supports continuous service through its high availability capabilities, and provides close to linear scalability through the combination of multiple parameters which include efficient traffic distribution, the cluster virtual IP layer and the connection synchronization mechanism. Figure 129 provides an illustration of other contributions grouped into eight distinct areas: application availability, network availability, data availability, master nodes 212 availability, connection synchronization, single cluster IP interface, and traffic distribution. Since the HAS architecture prototype follows the building block approach, these contributions can be reused in a different environments outside of the HAS architecture and can function completely independently outside of a cluster environment. The HAS architecture is based on loosely coupled nodes, provides a building block approach for designing and implementing software components that can be re-used for other environments and architectures. It provides the infrastructure for cluster membership, cluster storage, fault management, recovery mechanisms, and traffic distribution. It supports various redundancy models for each tier of the architecture and provides a seamless software and hardware upgrade without interruption of service. In addition, the HAS architecture is able to maintain baseline performance for up to 18 cluster nodes (16 traffic nodes), validating a close to linear scaling. The HAS architecture integrates these contributions within a framework that allows us to build scalable and highly available web clusters. The following sub-sections examine these contributions. 6.1.1 Highly Available and Scalable Architecture We have achieved our goal of a highly available and scalable system architecture that provides close to linear scalability for up to 16 nodes. The HAS architecture is a generic cluster based architecture targeted for web servers. A HAS cluster consists of multiple loosely coupled nodes connected through a high-speed network. The loosely coupled cluster model is a basic clustering technique that is suitable for web servers and similar types of application servers. We do not exclude specializations; however, we can handle them in the future to meet the requirements of specialized applications. 6.1.2 Scalability and Capacity The HAS architecture is characterized by the ability to add server nodes to the cluster without affecting the servicability or the uptime. When we experience an increase in traffic, we can add more traffic nodes into the SSA tier and experience additional capacity. The benchmarking results (Section 5.2) demonstrate the HAS architecture prorotype is able to sustain the baseline performance per processor as we increase the number of processors in the HAS cluster with a minimal impact of -3.1% on performance (Section 5.9) for up to 16 traffic processors. Figure 119 illustrates the close to linear scalability achieved with the HAS architecture prototype, which consists of two master nodes and 16 traffic nodes. In addition, because the SSA tier of the HAS architecture support the N-way 213 redundancy model, the architecture does not force us to deploy traffic nodes in pairs. As a result, we can deploy exactly the right number of traffic nodes to meet our traffic demands without having traffic nodes sitting idle. In addition, we are able to scale each tier of the architecture independently of the other tiers. 6.1.3 Support for Heterogeneous Hardware The HAS cluster architecture supports heterogeneous hardware. Typically, clusters are deployed with nodes that are homogenous, with the same hardware configurations. A HAS cluster can consist of traffic nodes with varying processor speed and RAM capacity and still achieve efficient resource utilization largely because of the efficient and lightweight traffic management scheme. 6.1.4 Lightweight and Efficient Traffic Distribution Among the surveyed works, the SWEB project (Section 2.10.3) and the Hierarchical RedirectionBased Web Server Architecture (Section 2.10.1) considered using a load feedback mechanism to monitor the load of their cluster nodes. Other surveyed work proposed mechanisms and distribution schemes that do not take into consideration the capacity of each traffic node. For the HAS architecture, there was a need to either implement a continuous load monitoring or enhance an available traffic distribution implementation with the load monitoring capability. The HAS traffic distribution scheme monitors the load of the traffic nodes using multiple metrics, and uses this information to distribute incoming traffic among the traffic nodes (Section 4.23.2). The traffic distribution scheme enables the traffic manager to detect and avoid situations where nodes in the cluster are operating inefficiently because of overloading. This scheme enables the cluster to operate at or near full capacity in overload situations, which is in contrast to conventional cluster architectures that are subject to congestion and collapse under overload. In addition, the HAS traffic distribution scheme integrates a keep-alive mechanism which allows master node to know when a traffic node is available for service and when it is not available because of either software or hardware problems. Section 4.23 presents the HAS traffic management scheme. The scheme is scalable (Section 5.2) and does not present a performance bottleneck for up to 16 traffic nodes. It can be extended to support additional distribution algorithms, to monitor additional system resources (such as network bandwidth), and to support the concept of cluster sub-zones (discussed as a future work item). 214 6.1.5 High Availability The architecture tiers support two essential redundancy models: the 1+1 (active/standby and active/active) and the N-way redundancy models. As a result, the HAS cluster architecture achieves high availability through redundancy at various levels of the architecture: network, processors, application servers, and data storage. As a result, we are able to do such actions as reconfigure the network setting, upgrade the hardware, software, and the operating system, without service downtime. Other areas of contributions include a mechanism to detect and recover from Ethernet failures, master node failures, NFS failures, and application (web server software) failures. 6.1.6 Cluster Single IP Interface One of the challenges of clustering is the ability to present the cluster nodes to the outside world as a single entity without creating a bottleneck that would affect the performance. The proposed approach offers a transparent solution to web clients, and allows web clients to reach an unlimited number of servers (Section 4.21). The proposed CVIP approach provides many advantages in the areas of single cluster interface: transparency, scalability since it supports high load of traffic, availability, fault tolerance, and dynamic connection distribution. 6.1.7 Continuous Service The ability to synchronize connections provides us with continuous service even in the event of software of hardware failures. Section 4.22.2 discusses the master/slave approach to provide connection synchronization between the master nodes in the HA tier of the HAS cluster. The approach allows ongoing connections to continue in progress even when the active master node fails. 6.1.8 Community and Industry Contributions This work has served as a significant contributor to industry standard and to the open source community through the HA-OSCAR project. The HA-OSCAR architecture has remarkably influenced the thinking in high performance computing. Researchers and systems designers for HPC cluster based systems are using the HA-OSCAR architecture as a concept prototype for a highly available HPC cluster. The HA-OSCAR project is the first prototype of an architecture that provides high availability for high performance computing platform. The first release of the HA-OSCAR generated over five thousands download within the first 24 hours of its release followed by thousands 215 more in the weeks and months after. This public rush is a indication of the community of user who are using, testing, and deploying the HA-OSCAR architecture for their specific needs. It is also important to note that there is an active community of users on the HA-OSCAR project discussion board and mailing list. The HA-OSCAR architecture is based on the HAS architecture and provides an open source implementation that is freely available for download with a substantial user community. Section 5.15 described our contributions to the Carrier Grade specifications that define an architectural model for telecommunication platforms providing voice and data communication services. The Carrier Grade architecture mode is an industry standard largely based on the HAS architecture. 6.1.9 Source Code Contributions The HAS architecture was prorotyped and several software components were prorotyped and other existing software were enhanced and modified to suite our needs. Since the beginning, the objective was to keep these software components generic, scalable in terms of performance and future feature additions, and less dependable on each other. For some software, we were able to reach that goal and for others, we were not able to achieve this goal and we plan further improvements as discussed in the following subsections. We did not target the implementations to a specific cluster deployment model such as web cluster; rather we sought to maintain a sense of independence from a specific deployment case. In order to build the HAS web cluster, redundancy features for the network file system and Ethernet connections were implemented at the kernel level and user space in the Linux operating system. A part of the work behind this thesis involved writing the source code for the various system components. The following sub-sections describe the source code contributions and their state of development. 6.1.9.1 Redundant NFS Server It was essential to have this functionality in place to enable highly available shared storage that is accessible to all cluster nodes without a SPOF. As a result, in the event that the primary NFS server running on the active master node fails, the secondary NFS server running on the standby node continues to provide storage access without service discontinuity and transparently to end users. The 216 rsync utility provides the synchronization between the two NFS servers. Table 17 lists the Linux kernel files modified to support the NFS redundancy. Linux Kernel - Modified Source Files /usr/src/linux/fs/nfs/inode.c /usr/src/linux/net/sunrpc/sched.c /usr/src/linux/net/sunrpc/clnt.c Description of Changes Added support for NFS redundancy Raise the timeout flag when there is one Raise the timeout flag when there is one Table 17: The changes made to the Linux kernel to support NFS redundancy The implementation of the HA NFS server is stable, however, it requires upgrading to the latest stable Linux Kernel release, version 2.6. Furthermore, the HA NFS implementation requires a new implementation of the mount program to support mounting multi-host NFS servers, instead of single file server mount. This functionality is provided, and in the new mount program, the addresses of the two redundancy NFS servers are passed as parameters to the new mount program, and then to the kernel. The new command line for mounting two NFS server looks as follows: % mount –t nfs server1,server2:/nfs_mnt_point 6.1.9.2 Heartbeat Mechanism We re-used the heartbeat mechanism developed initially by the Linux-HA project after making several adaptations to the source code, and re-wroting part of the original source code making optimizations that resulted in a decrease of the failure detection delay from 200 ms to 150 ms. Furthermore, we can expect additional improvements to the failure detection time down to 100 ms from 150 ms with further optimizations and re-writing parts of the original source code. The limitation of the current implementation is its support to two nodes. As future work, we would like to investigate using the heartbeat mechanism with more than two nodes in the HA tier and supporting the N-way redundancy model. 6.1.9.3 Redundant Ethernet Connections The goal with this feature is to maintain Ethernet connectivity at the server node level. The implemented Ethernet redundancy daemon monitors the link status of the primary Ethernet port. On link down, the route of the first port is deleted, and incoming traffic is directed to the second Ethernet 217 port. When the link goes up again, the daemon waits to make sure the connection does not drop again, and then switches back to the primary Ethernet port. In addition to this contribution, smaller supporting contributions included fixes and re-writes of the Ethernet device drive; all of these supporting contributions are now integrated in the original Ethernet device driver code in the Linux kernel. Further improvements to the current implementation include stabilizing the source and optimizing the performance of the Ethernet daemon, which include optimizing the failure detection time of the Ethernet driver. In addition, the source code of the Ethernet redundancy daemon is to be upgraded to run on the latest release of the Linux kernel, version 2.6. 6.1.9.4 Traffic Manager Daemon The traffic manager runs on master nodes. It receives load announcements from the traffic client daemons running on traffic nodes, compute the rank of each node, and update its internal list of nodes and their load. It maintains a list of available traffic nodes and their load. The traffic manager is also responsible for distributing incoming traffic to the available traffic nodes based on their load. The current implementation of the traffic manager can use several improvements that include further testing and stabilizing of the source code, optimizing insertion, and updates of the list of traffic nodes and their load index. As future work, we would like to investigate the possibility of merging the functionalities of the saru module (connection synchronization) with the traffic manager. 6.1.9.5 Traffic Client Daemon The traffic client is a daemon process that runs on each traffic node. It collects processor and memory information from the /proc file system, computes the load index, and communicates it to the traffic managers running on master nodes. The current implementation of the traffic client is stable because of the many testing it went through. As future work, we plan to investigate how to have a better representation of the load index, by adding for instance the network bandwidth parameter into the load index computation. In addition, we would like to investigate the possibility of merging the functionalities of the LDirectord module with the traffic client resulting in one system software that reports the node load index and monitors the health of the application server running on the traffic node. 218 6.1.9.6 Cluster virtual IP Interface (CVIP) The CVIP interface is a cluster virtual IP interface that presents the HAS cluster as a single entity to the outside world, making all nodes inside the cluster transparent to end users. Section 4.21 discusses the CVIP interface. Sections 6.2.7 and 6.2.8 discuss the future work items for CVIP. 6.1.9.7 LDirectord The improvements and adaptations to the LDirectord module include capabilities to connect to the traffic manager and the traffic client. The current implementation is not fully optimized; rather, the implementation is a working prototype that requires further testing and stabilization. Furthermore, as future work, we would like to minimize the number of sequential steps to improve the performance, and investigate the possibility of integrating the LDirectord module with the traffic client running on the traffic node. 6.1.9.8 Connection Synchronization The improvements and adaptations to the saru module include improving capabilities to connect and talk to the hearbeat mechanism. The current implementation of the saru module relies on the redundancy relationship between master nodes. As future work, we plan to migrate to the peer-to-peer approach discussed in Section 4.22.4. 6.1.10 Other Contributions The work has also contributed several papers, technical reports, tutorials presentations, in addition to offering contributions to existing projects and to the industry. - Referenced papers, technical reports and white papers: [30], [96], [99], [100], [103], [107], [108], [117], [146], [145], [149], [150], [151], [152], [153], [154], [155], [156], [157], [158], [159], [160], [161], [162], [163] and [164]. - Conference presentations and tutorials: [165], [166], [167], [168], [169], [170], [171], [172], [173], [174], [175], [176], [177], [178], [179], [180], [181] and [182]. - In the process of evaluating existing works, we benchmarked existing solutions, and provided direct input to the specific projects. A major advantage we had was the availability of a test lab that allowed us to perform large scale testing for performance and scalability. 219 - We ported the Apache and Tomcat web servers to support IPv6 and performed bechmarking tests to compare with benchmaking tests of Apache and Tomcat running over IPv4. - As part of the dissertation, we needed a flexible cluster installation infrastructure that help us built and setup clusters within hours instead of days, and that would accommodate for nodes with disks and diskless nodes and network boot. This infrastructure did not exist and we had to design it and build it from scratch. This cluster installation infrastructure is now being used at the Ericsson Research lab in Montréal, Canada. - Other contributions included the influence of the work on the industry. The Carrier Grade Linux specifications are industry standards with a defined architecture for telecom platforms and applications running on telecom servers in mission critical environments. The carrier grade architecture relies on the work proposed in this thesis with little modifications to accommodate specific type of telecommunication applications. The author of the thesis is publicly recognized as a contributor to the Carrier Grade specification. Furthermore, since January 2005, he has been employed by the OSDL to focus on advancing the specifications and the architecture. 6.2 Future Work The following sub-sections propose several areas that would extend and improve the current work. With the exception of the maintenance activities to improve the current HAS prototype, the areas of interest range from running additional benchmarking tests, to implementing new system components and providing new functionalities. In addition, investigating linearly scalability beyond 16 nodes is a high priority future work. The top two future work include the redundancy configuration manager and the cluster configuration manager. 6.2.1 Support Linear Scalability Beyond 16 Nodes With the HAS architecture, we were able to reach a near linear scalability for up to 16 traffic nodes (Section 5.9). The next major goal is to investigate how to accomplish higher levels of scalability for clusters consisting of 32 nodes and beyond. One of the important further investigations in this area relates to the scaling overhead of the cluster. In the HAS cluster architecture, the scaling overhead has been minimal but limited to 16 nodes. As the cluster scales in the number of nodes, many potential bottlenecks affect the level of performance such as storage, communication, application servers, and 220 traffic distribution mechanism. The goal of this activity is to investigate the source of bottlenecks and explore solutions. 6.2.2 Traffic Client Implementation In a future version, we plan to allow the traffic client to receive notification that a master node is not available directly from the heartbeat mechanism. With such information, the traffic client stops reporting its load index to a master node that is unavailable, minimizing its communication overhead and resulting in less network traffic. 6.2.3 Additional Benchmarking Tests Because of limitations such as timing and access to lab hardware, we were not able to conduct more benchmarking in the lab against the HAS prototype. As future work, we would to establish a larger benchmarking environment that consists of more test machines to generate additional traffic into the HAS cluster. Furthermore, we would like to benchmark the HAS cluster with the master nodes in the HA tier are in load sharing mode, 1+1 active/active model (Figure 130). Untested HAS Architecture Configurations Master Nodes 1+1 (Active/Standby & Active/Active) Traffic Nodes N active & N active/ M standby Specialized Storage Nodes Master Nodes 1+1 Active/Active N active Traffic Nodes Storage: HA NFS Master Nodes 1+1 Active/Active Traffic Nodes N active M standby Storage: HA NFS Figure 130: The untested configurations of the HAS architecture We expect to achieve high performance levels when both master nodes are receiving incoming traffic and forwarding it to traffic nodes. Furthermore, we would like to benchmark the HAS architecture prototype using specialized storage nodes and compare the results to when using the HA NFS implementation to provide storage. These tests will give us insights on the most efficient storage solution. 221 6.2.4 Redundancy Configuration Manager The current prototype of the HAS architecture does not support dynamic changes to the redundancy configurations nor transitioning from one redundancy configuration to another. This feature is very useful when the nodes reach a certain pre-defined threshold, then the redundancy configuration manager would for example transition the HA tier from the 1+1 active/standby to the 1+1 active/active, allowing both master nodes to share and service incoming traffic. Such a transition in the current HAS architecture prototype requires stopping all services on master nodes, updating the configuration files, and restarting all software components running on master nodes. The redundancy configuration manager would be the entity responsible for switching the redundancy configuration of the cluster tiers from one redundancy model to another. For instance, when the SSA tier is in the N+M redundancy model, the redundancy configuration manager will be responsible to activate a standby traffic node when a traffic node becomes unavailable. As such, the configuration manager should be aware of active traffic nodes and the states of their components, and their corresponding standby traffic nodes. 6.2.5 Cluster Configuration Manager The current HAS cluster prototype does not provide a central location for managing the cluster configuration files required by all the system software. As a result, the process of locating and editing all configuration files needed to run the cluster is a daunting experience. There is a need for a central entity, preferably with a user-friendly graphical user interface, that manages all the configuration files that control the operation of the various software components. This entity, a cluster configuration manager, is a one-stop configuration that enables easy and centralized configuration. 6.2.6 Monitoring Load of Master Nodes We would like to expand the current HAS implementation to include a logging mechanism that monitors and reports the load of master nodes such as processor usage, free memory and network bandwidth. We can then use this information to determine the impact of incoming traffic on the performance and scalability of master nodes. Such information can also help us identify the node subsystems that suffer from bottlenecks, which is particularly useful in a controlled lab environment when running benchmarking tests and generating thousands of requests per second. 222 6.2.7 Merging the Functionalities of the CVIP and Traffic Management Scheme One possibility of further investigation is to couple the functionalities of the cluster virtual IP interface with the traffic management scheme. With the current implementation, incoming traffic arrives to the cluster through the cluster virtual IP Interface and then it is handled by the traffic manager before it reaches its final destination on one of the traffic nodes. A future enhancement is to eliminate the traffic management scheme and incorporate traffic distribution within the CVIP interface. Eliminating the traffic manager daemon and integrating its functionalities with the CVIP results in increased performance and a faster response time as we eliminate one serialized step in managing an incoming request. However, our proposal is to combine the functionalities of the cluster virtual IP interface and the traffic distribution mechanism to eliminate a step of forwarding traffic in between web users and the application server running the web server. 6.2.8 Eliminating the Need for Master Nodes In the quest for maximum performance, we need to eliminate as many pre-processing steps as possible to speed up the process of replying to the web clients. One interesting investigation is the possibility of eliminating the need for master nodes and distributing their functionalities among traffic nodes. With that model, we can eliminate a step of pre-processing the request; however, we create additional challenges such as how to perform efficient traffic distribution and provide cluster services. One interesting approach is to elect traffic nodes to perform selective master nodes functions based on a well-defined criteria. 6.2.9 Supporting Cluster Zones and Specialized Traffic Nodes A cluster can host several applications that provide a range of services for the cluster users. This requirement influences the way we manage the cluster and implies the need for node specialization, where specific nodes that constitute a cluster zone are dedicated to task-specific services. The concept of cluster zone is as follows: since the cluster is composed of multiple nodes, we divide the cluster into sub-clusters, called cluster zones or simply zones. Each zone provides a specific type of service through defined cluster nodes and as a result, it receives specific type of traffic to those nodes. We 223 have two main challenges in this area: the first is to provide the virtualization of the cluster zones, and the second is the ability to migrate dynamically cluster nodes between several zones based on traffic trends. LAN 1 LAN 2 Streaming Node 2 1 Streaming Node 1 FTP Node 3 FTP Node 1 C L U S T E R V I P Master Node A HTTP Node 5 Master Node B HTTP Node 3 HTTP Node 1 LAN 1 FTP Node 2 Storage Node 1 FTP Node 3 FTP Node 1 HTTP Node 2 C L U S T E R V I P LAN 1 LAN 2 Streaming Node 1 FTP Node 3 C L U S T E R V I P Master Node A Master Node B FTP Node 1 HTTP Node 3 HTTP Node 1 Master Node A HTTP Node 5 FTP Node 2 Storage Node 1 2 HTTP Node 4 Master Node B Storage Node 2 HTTP Node 3 HTTP Node 2 HTTP Node 1 Streaming Node 2 FTP Node 5 Streaming Node 2 Streaming Node 1 Storage Node 2 HTTP Node 4 LAN 2 Storage Node 1 3 Storage Node 2 FTP Node 4 FTP Node 2 HTTP Node 2 Figure 131: The architecture logical view with specialized nodes Figure 131 illustrates the concept of cluster zones. The cluster in the figure consists of three zones: one provides HTTP service, the other provides FTP service, and the third provides streaming service. In (1) the FTP cluster zone is receiving traffic, while some nodes in the HTTP cluster zone are sitting idle due to low traffic. The traffic manager running on the master node disconnects (2) two nodes from the HTTP cluster zone and transition (3) them into the FTP cluster zone to accommodate the increase in FTP traffic. There are several possible areas of investigations such as defining cluster 224 zones as logical entities in a larger cluster, dynamic node(s) selection to be part of a specialized cluster zone, transitioning the node into the new zone, and investigating queuing theories suitable for such usage models. 6.2.10 Peer-to-Peer Connection Synchronization The peer-to-peer connection synchronization is the improved approach of the master/slave approach. In the peer-to-peer approach to connection synchronization, each master node sends synchronization information for connections that it is handling to the other master node. In the event that a master node fails, connections are synchronized, and connections are able to continue after the multiple failovers. 6.2.11 Eliminating Traffic Manager as a SPOF The traffic managers running on the master nodes in the HA tier presents a SPOF. In the event a traffic manager fails on the active node, the traffic client (running on traffic nodes) continues reporting its load index to the failed traffic manager until it receives a timeout. 6.2.12 Eliminating Saru Module as a SPOF The saru module is the system software that runs on master nodes. It distributes incoming traffic between the two master nodes (when in the 1+1 active/active model) following the round robin method. The saru module presents a SPOF. If the saru module fails on the master node, then the routed process will not be able to send the saru module incoming requests and the cluster comes to halt. The solution to this problem is to have the routed process aware that the second master node also has an inactive copy of the saru module running on it. When the routed process fails to send requests to the saru module running on master node 1 then it should send them to the saru module running on master node 2. 6.2.13 Failure of LDirectord The goal of this future work is to improve the current implementation of the LDirectord and make it resilient to failures. 225 6.3 Conclusion This dissertation covers a range of technologies for highly available and scalable web clusters. It addresses the challenges of designing a scalable and highly available web server architecture that is flexible, components based, reliable and robust under heavy loads. The first chapter provides a background on Internet and web servers, scalability challenges, and presents the objectives and scope of the study. The second chapter looks at clustering technologies, scalability challenges, and related work. We examine clustering technologies and techniques for designing and building Internet and web servers. We argue that traditional standalone server architecture fails to address the scalability and high availability need for large-scale Internet and web servers. We introduce software and hardware clustering technologies, their advantages and drawbacks, and discuss our experience prototyping a highly available, and scalable, clustered web server platform. We present and discuss the various ongoing research projects in the industry and academia, their focus areas, results, and contributions. In the third chapter, the thesis summarizes the preparatory technical work with the prototyped web cluster that uses existing components and mechanisms. Chapter four presents and discusses the HAS architecture, its components and their characteristics, eliminating single points of failure, the conceptual, physical, and scenario architecture views, redundancy models, cluster virtual interface and traffic distribution scheme. The HAS architecture consists of a network of server nodes connected over highly available networks. A virtual IP interface provides a single point of entry to the cluster. The software and hardware components of the architecture do not present a SPOF. The HAS cluster architecture supports multiple redundancy models for each of its tiers allowing you to choose the best redundancy model for your specific deployment scenario. The HAS architecture manages incoming traffic from web clients through a lightweight, efficient, and dynamic traffic distribution scheme that takes into consideration the capacity of each traffic. This approach has proven to be an effective method to distribute traffic based on the performance testing we conducted. In chapter five, we validate the scalability of the architecture. Our results demonstrate that the HAS architecture is able to reach close to linear scaling for up to 16 processors and attain high performance levels with a robust behavior under heavy load. In addition, the chapter presents the results of the availability validation testing the availability features in a HAS architecture. The final chapter illustrates the contributions and future work. 226 The HAS architecture brings together aspects of high availability, concurrency, dynamic resource management and scalability into a coherent framework. Our experience and evaluation of the architecture demonstrate that the approach is an effective way to build robust, highly available, and scalable web clusters. We have developed an operational prototype based on the HAS architecture; the prototype focused on building a proof-of-concept for the HAS architecture that consists of a set of necessary system software components. The HAS architecture relies on the integration of many system components into a well-defined, and generic cluster platform. It provides the infrastructure for cluster membership to recognize and manage the nodes membership in the cluster, cluster storage service, fault management service to recognize hardware and software faults and recovery mechanisms, and traffic distribution service to distribute the incoming traffic across the nodes in the cluster. The HAS architecture represents a new design point for large-scale Internet and web servers that support scalability, high availability, and high performance. 227 Bibliography [1] K. Coffman, A. Odlyzko, The Growth Rate of the Internet, Technical Report, First Monday, Volume 3 Number 10, October 1998, http://www.firstmonday.dk/issues/issue3_10/coffman [2] E. Brynjolfsson, B. Kahin, Understanding the Digital Economy: Data, Tool, and Research, MIT Press, October 2000 [3] Yahoo! Fourth Quarter 2001 Financial Report, http://docs.yahoo.com/docs/pr/4q01pr.html [4] America Online, Press Data Points, http://corp.aol.com/press/press_datapoints.html [5] K. McNaughton, Is eBay too CNET popular?, News.com, March 1, 1999, http://news.cnet.com/news/0-1007-200-339371.html [6] E. Hansen, Email outage takes toll on Excite@Home, CNET News.com, June 28, 2000, http://news.cnet.com/news/0-1005-200-2167721.html [7] Bloomberg News, E*Trade hit by class-action suit, CNET News.com, February 9, 1999, http://news.cnet.com/news/0-1007-200-338547.html [8] W. LeFebvre, Facing a World Crisis, Invited talk at the 15th USENIX LISA System Administration Conference, San Diego, California, USA, December 2-7, 2001 [9] British Broadcasting Corporation, Net surge for news sites, September 2001, http://news.bbc.co.uk/hi/english/sci/tech/newsid_1538000/1538149.stm [10] R. Lemos, Web worm targets White House, CNET News.com, July 2001, http://news.com.com/2100-1001-270272.html [11] The Hyper Text Transfer Protocol Standardization at the W3C, http://www.w3.org/Protocols [12] The Common Gateway Interface, http://www.w3.org/CGI [13] J. Nielsen, The Need for Speed, Technical Report, March 1997, http://www.useit.com/alertbox/9703a.html [14] R. B. Miller, Response Time in Man-Computer Conversational Transactions, Proceedings of the FIPS Fall Joint Computer Conference, Vol. 33, 1968, pp. 267-277 [15] 228 J. Nielsen, Usability Engineering, Morgan Kaufmann, San Francisco, 1994 [16] Inktomi Corporation, Web surpasses one billion documents, Press Release, January 2000 http://www.inktomi.com/new/press/2000/billion.html [17] A. T. Saracevic, Quantifying the Internet, San Francisco Examiner, November 5, 2000, http://www.sfgate.com [18] Nielsen/NetRatings, Nielsen/NetRatings finds strong global growth in monthly Internet sessions and time spent online between April 2001 and April 2002, June 2002, http://www.nielsen-netratings.com/pr/pr_020610_global.pdf [19] The WebBench Tool, http://www.etestinglabs.com/benchmarks/webbench/webbench.asp [20] The Linux-HA Project: http://www.linux-ha.org [21] The HA-OSCAR project, http://xcr.cenit.latech.edu/ha-oscar [22] The Linux Operating System, http://www.kernel.org [23] The Linux Virtual Server Project, http://www.linuxvirtualserver.org [24] The Apache Project, http://httpd.apache.org [25] The Tomcat Project, http://Jakarta.apache.org/tomcat [26] The Carrier Grade Linux Initiative, http://www.osdl.org/lab_activities/carrier_grade_linux [27] The Open Source Development Lab, http://www.osdl.org [28] G. Pfister, In Search of Clusters, Second Edition, Prentice Hall PTR, 1998 [29] The Open Group, The UNIX® Operating System: A Robust, Standardized Foundation for Cluster Architectures, White Paper, June 2001, http://www.unix.org/whitepapers/cluster.htm [30] I. Haddad, E. Paquin, MOSIX: A Load Balancing Solution for Linux Clusters, Linux Journal, May 2001 [31] The MOSIX Project, http://www.mosix.org/ [32] The ROCKS Clustering Package, http://www.rocksclusters.org [33] Open Source Cluster Application Resources, http://oscar.openclustergroup.org 229 [34] M. J. Brim, T. G. Mattson, and S. L. Scott, OSCAR: Open Source Cluster Application Resources, Ottawa Linux Symposium 2001, Ottawa, Canada, July 2001 [35] J. Hsieh, T. Leng, and Y.C. Fang, OSCAR: A Turnkey Solution for Cluster Computing, Dell Power Solutions, Issue 1, 2001, pp. 138-140 [36] Ganglia Distributed Monitoring Tool, http://ganglia.sourceforge.net [37] TurboLinux Cluster Server, http://www.turbolinux.com/products/middleware/tlcs8.html [38] OpenSSI Clustering Solution, http://openssi.org/cgi-bin/view?page=openssi.html [39] OpenGFS clustered file system, http://opengfs.sourceforge.net [40] Oracle cluster file system, http://oss.oracle.com/projects/ocfs/ [41] IEEE Task Force on Cluster Computing High Availability, http://www.clustercomputing.org [42] Advanced Research on Internet E-Servers, http://www.linux.ericsson.ca [43] Trilliurn Digital Systems, Distributed Fault-Tolerant and High-Availability Systems White Paper, http://www.trilliurn.com [44] Intel Corporation, http://www.intel.com [45] X. Gan, T. Schroeder, S. Goddard, and B. Ramamurthy, LSMAC vs. LSNAT: Scalable cluster-based Web servers, IEEE Cluster Computing, November 2000, pp. 175-185 [46] O. Damani, P. Chung, Y. Huang, C. Kintala, and Y. M. Wang, ONE-IP: Techniques for Hosting a Service on a Cluster of Machines, IEEE Computer Networks, Volume 29, Numbers 813, September 1997, pp. 1019-1027 [47] G. Hunt, G. S. Goldszmidt, R. P. King, and R. Mukherjee, Network Dispatcher: A Connection Router for Scalable Internet Services, IEEE Computer Networks, Volume 30, Numbers 1-7, April 1999, pp. 347-357 [48] RFC 2391, Load Sharing http://www.faqs.org/rfcs/rfc2391.html 230 using IP Network Address Translation (LSNAT), [49] X. Gan, T. Schroeder, S. Goddard, and B. Ramamurthy, LSMAC and LSNAT: Two Approaches for Cluster-Based Scalable Web Servers, EEE International Conference on Communications, June 2000, pp. 1164-1168 [50] Cisco Local Director, http://www.cisco.com/warp/public/cc/pd/cxsr/400/index.shtml [51] DNS RFCs, http://www.dns.net/dnsrd/rfc [52] The Public DNS Service, http://soa.granitecanyon.com [53] The Microsoft Cluster Server, http://www.microsoft.com/ntserver/support/faqs/Clusterfaq_deploy.asp [54] The Microsoft cluster server FAQ, http://www.microsoft.com/ntserver/support/faqs/Clustering_faq.asp [55] S. Horman, Linux Virtual Server Tutorial, http://www.ultramonkey.org/papers/lvs_tutorial [56] M. Williams, EBay, Amazon, Buy.com hit by Internet attacks, Network World, February 9, 2000, http://www.nwfusion.com/news/2000/0209attack.html [57] G. Sandoval and T. Wolverton, Leading Web Sites Under Attack, News.com, February 9, 2000, http://news.cnet.com/news/0-1007-200-1545348.html [58] Trilliurn Digital Systems, Distributed Fault-Tolerant/High-Availability Systems, White Paper, http://www.trilliurn.com [59] D. LaLiberte, and A. Braverman, A Protocol for Scalable Group and Public Annotations, Computer Networks and ISDN Systems, Volume 27, Number 6, January 1995, pp. 911-918 [60] IMEX Research High Availability Report, http://www.imexresearch.com [61] The Mobile Internet, http://www.ericsson.com/mobileinternet [62] H. Soliman, IPv6 in 3G Technical Report, Ericsson Research, 2001, http://www.ericsson.com/technology [63] L. Aversa, and A. Bestavros, Load Balancing a Cluster of Web Servers Using Distributed Packet Rewriting, Proceedings of IEEE International Performance Conference, Phoenix, Arizona, USA, February 2000, pp. 24-29 231 [64] S. N. Budiarto, and S. Nishio, MASEMS: A Scalable and Extensible Multimedia Server, The 1999 International Symposium on Database Applications in Non-Traditional Environments, Kyoto, Japan, November 1999, pp. 28-30 [65] C. Roe, and S. Gonik, Server-Side Design Principles for Scalable Internet Systems, IEEE Software, Volume 19, Number 2, March/April 2002, pp. 34-41 [66] D. Norman, The Design of Everyday Things, Double-Day, New York, 1998 [67] D. Dias, W. Kish, R. Mukherjee, and R. Tewari, A Scalable and Highly Available Web Server, Proceedings of the Forty-First IEEE Computer Society International Conference: Technologies for the Information Superhighway, Santa Clara, California, USA, February 25-28, 1996, pp. 85-92 [68] E. Casalicchio, and S. Tucci, Static and Dynamic Scheduling Algorithms for Scalable Web Server Farm, IEEE Network 2001, pp. 368-376 [69] A. S. Z. Belloum, E. C. Kaletas, A. W. Van Halderen, H. Afsarmanesh, and A. J. H. Peddemors, A Scalable Web Server Architecture, IEEE World Wide Web Journal, Volume 5 Number 1, 2002, pp. 5-23 [70] H. Bryhni, E. Klovning, and O. Kure, A Comparison of Load Balancing Techniques for Scalable Web Servers, IEEE Network, July/August 2000, pp. 58-64 [71] D. Kim, C. H. Park, and D. Park, Request Rate Adaptive Dispatching Architecture for Scalable Internet Server, IEEE International Conference on Cluster Computing, Chemmnitz, 2000, pp. 289-296 [72] L. Aversa, and A. Bestavros, Load Balancing a Cluster of Web Servers Using Distributed Packet Rewriting, Proceedings of the 2000 IEEE International Performance, Computing, and Communications Conference, February 2000, pp. 24 - 29 [73] B. Ramamurthy, LSMAC vs. LSNAT: Scalable Cluster-based Web Servers, Seminar presented at Rice University, http://www-ece.rice.edu/ece/colloq/00-01/Oct23br-00.html, October 23, 2000 [74] A. N. Murad, and H. Liu, Scalable Web Server Architectures, Technical Report BL0314500- 961216TM, Bell Labs, Lucent Technologies, December 1996 232 [75] E. D. Katz, M. Butler, and M. McGrawth, A Scalable HTTP Server: The NCSA Prototype, Proceedings of the 1st International WWW Conference, Geneva, Switzerland, May 25-27, 1994, pp. 155-164 [76] The NCSA HTTP server, http://hoohoo.ncsa.uiuc.edu [77] D. Kim, C. H. Park, and D. Park, Request Rate Adaptive Dispatching Architecture for Scalable Internet Server, IEEE Network 2000, pp. 289-296 [78] D. Anderson, T. Yang, V. Holmedahl, and O. Jbarra, SWEB: Towards a Scalable World Wide Web Server on Multicomputers, Proceedings of the 10th International Parallel Processing Symposium, Honolulu, Hawaii, USA, April 15-19, 1996, pp. 850-856 [79] E. Casalicchio, and M. Colajanni, Scalable Web Clusters with Static and Dynamic Contents, Proceedings of the IEEE Conference on Cluster Computing, Chemnitz, Germany, November 28 – December 1, 2000, pp. 170-177 [80] X. Gan, T. Schroeder, S. Goddard, and B. Ramamurthy, LSMAC and LSNAT: Two Approaches for Cluster-based Scalable Web Servers, Proceedings of the 2000 IEEE International Conference on Communications, New Orleans, USA, June 18-22, 2000, pp. 1164-1168 [81] X. Zhang, M. Barrientos, B. Chen, and M. Seltzer, HACC: An Architecture for Cluster-Based Web Servers, Proceedings of the 3rd USENIX Windows NT Symposium, Seattle, Washington, USA, July 12-15, 1999, pp. 155-164 [82] The HACC Web Server Project, http://www.eecs.harvard.edu/~cxzhang/projects/hacc [83] The IBM RS/6000, http://www.rs6000.ibm.com [84] G. D. H. Hunt, G.S. Goldszmidt, R. P. King, and R. Mukherjee, Network Dispatcher: A Connection Router for Scalable Internet Services, Proceedings of 7th International World Wide Web Conference, Bribane, Australia, April 1998, pp. 347-357 [85] The Alexandria Digital Library, http://alexandria.sdc.ucsb.edu [86] WebStone Web Server Benchmarking Tool, http://www.mindcraft.com/webstone [87] F5 Networks Big IP, http://www.f5.com/f5products/bigip 233 [88] A. Tucker, and A. Gupta, Process Control and Scheduling Issues for Multiprogrammed Shared-Memory Multiprocessors, Proceedings of the 12th Symposium on Operating Systems Principles, ACM, Litchfield Park, Arizona, USA, December 1989, pp. 159-166 [89] M. Squillante, and E. Lazowska, Using Processor-Cache Affinity Information in Shared- Memory Multiprocessors Scheduling, IEEE Transactions on Parallel and Distributed Systems, Volume 4 Number 2, February 1993, pp. 131-134 [90] Microsoft Developer Network Platform SDK, Performance Data Helper, Microsoft, July 1998 [91] W. B. Ligon III, and R. Ross, Server-Side Scheduling in Cluster Parallel I/O Systems, The Calculateurs Parallèles Journal, October 2001 [92] W.B. Ligon III, and R. Ross, PVFS: Parallel Virtual File System, Beowulf Cluster Computing with Linux, MIT Press, November 2001, pp. 391-430 [93] B. Nishio, and S. Nishio, MASEMS: A Scalable and Extensible Multimedia Server, IEEE Network 2000, pp. 443-450 [94] A. Mourad, and H. Liu, Scalable Web Server Architectures, Proceedings of IEEE International Symposium on Computers and Communications, Alexandria, Egypt, July 1997, pp. 12-16 [95] M. Andreolini, V. Cardellini, and M. Colajanni, Benchmarking, Models and Tools for Distributed Web-Server System, Proceedings of the Performance 2002, Rome, Italy, July 24-26, 2002, pp. 208-235 [96] I. Haddad, An Overview of the HTTP, http://www.cs.concordia.ca/~i_haddad/phd.html [97] P. Wainwright, Professional Apache, Wrox Press Inc, 1999 [98] B. Laurie, P. Laurie, and R. Denn, Apache: The Definitive Guide, O'Reilly & Associates, 1999 [99] I. Haddad, Apache User Authentication, Linux Journal, October 2000 [100] I. Haddad, Apache 2.0 Internals, Linux Journal, August 2001 [101] Netcraft Web Server Survey, http://Netcraft.com/survey 234 [102] The GNU General Public License, http://www.gnu.org/copyleft/gpl.html [103] I. Haddad, W. Hassan, and L. Tao, XWPT: An X-based Web Servers Performance Tool, the 18th International Conference on Applied Informatics, Innsbruck, Austria, February 2000, pp. 5055 [104] The Software RAID How-to, http://www.tldp.org/HOWTO/Software-RAID-HOWTO.html [105] The Parallel Virtual File System, http://www.parl.clemson.edu/pvfs [106] A. Ching, A. Choudhary, W. Liao, R. Ross, and W. Gropp, Noncontiguous I/O through PVFS, Proceedings of the 2002 IEEE International Conference on Cluster Computing, September 23-26, 2002, Chicago, Illinois, USA, pp. 405-414 [107] I. Haddad, and M. Pourzandi, Open Source Web Servers Performance on Carrier-Class Linux Clusters, Linux Journal, April 2001, pp. 84-90 [108] I. Haddad, PVFS: A Parallel Virtual File System for Linux Clusters, Linux Journal, December 2000 [109] P. Barford, and M. Crovella, Generating Representative WebWorkloads for Network and Server Performance Evaluation, Proceedings of the ACM Sigmetrics Conference, Madison, Wisconsin, USA, June 1998, pp. 151-160 [110] The S-Client, http://www.cs.rice.edu/CS/Systems/Web-measurement/paper/node9.html [111] WebStone Web Server Benchmarking Tool, http://www.mindcraft.com/webstone [112] SPEC Web Server Benchmarking Tool, http://www.spec.org/osg/web99 [113] TCP Westwood, http://www.cs.ucla.edu/NRL/hpi/tcpw/ [114] M. Andreolini, V. Cardellini, and M. Colajanni, Benchmarking Models and Tools for Distributed Web-Server Systems, Proceedings of Performance 2002, Rome, Italy, July 24-26, 2002, pp. 208-235 [115] E. Marcus, and H. Sten, BluePrints for High Availability: Designing Resilient Distributed Systems, Wiley, 2000 [116] S. Horman, Connection Synchronization, http://www.ultramonkey.org/papers/conn_sync 235 [117] I. Haddad, C. Leangsuksun, R. Libby, and S. Scott, HA-OSCAR: Towards Non-stop Services in High End and Grid computing Environments, Poster Presentation at the Fifth Los Alamos Computer Science Institute Symposium, New Mexico, USA, October 12-14, 2004 [118] The Open Cluster Group, How to Install an OSCAR Cluster, Technical Report, November 3, 2005, http://oscar.openclustergroup.org/public/docs/oscar4.2/oscar4.2-install.pdf [119] A. S. Tanenbaum, and M. S. Van, Distributed Systems: Principles and Paradigms, Prentice Hall, July 2001, pp. 371-375 [120] The rsync utility, http://samba.anu.edu.au/rsync [121] The rsync algorithm, http://dp.samba.org/rsync/tech_report/node2.html [122] The DRDB Tool, http://www.drbd.org [123] The Router Advertisement Daemon Project, http://v6web.litech.org/radvd/ [124] RFC 2461, Neighbor Discovery for IP Version 6, http://www.faqs.org/rfcs/rfc2461.html [125] The Linux-HA project, http://www.linux-ha.org [126] L. Marowsky-Brée, A New Cluster Resource Manager for Heartbeat, UKUUG LISA/Winter Conference High Availability and Reliability, Bournemouth, UK, February 2004 [127] A. L. Robertson, The Evolution of the Linux-HA Project, UKUUG LISA/Winter Conference High-Availability and Reliability, Bournemouth, UK, February 25-26, 2004 [128] A. L. Robertson, Linux-HA Heartbeat Design, Proceedings of the 4th International Linux Showcase and Conference, Atlanta, October 10-14, 2000 [129] S. Horman, Connection Synchronisation (TCP Fail-Over), Technical Paper, November 2003 [130] The BogoMIPS Representation, http://www.tldp.org/HOWTO/BogoMips/x78.html [131] D. Gordon, and I. Haddad, Apache talking IPv6, Linux Journal, January 2003 [132] I. Haddad, The Future of IP Is Now, LinuxWorld Magazine, February 2004 [133] I. Haddad, IPv6 on Linux: Ongoing Development Effort and Tutorial, Linux User and Developer, June 2003 236 [134] D. Gamerman, Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Chapman & Hall/CRC Press, Boca Raton, Fl., USA, 1997 [135] G. Ciardo, and P. Darondeau, Applications and Theory of Petri Nets 2005, 26th International Conference, ICATPN 2005, Miami, USA, June 20-25, 2005 [136] G Ciardo, J. Muppala, and K. Trivedi, SPNP: Stochastic Petri Net Package, Proceedings of the International Workshop on Petri Nets and Performance Models, IEEE Computer Society Press, Los Alamitos, Ca., USA, December 1989, pp. 142-150 [137] H. Choi, Markov Regenerative Stochastic Petri Nets, Computer Performance Evaluation, Vienna 1994, pp. 337-357 [138] C. Hirel, R. Sahner, X. Zang, and K. S. Trivedi, Reliability and Performability Modeling using SHARPE 2000, Computer Performance Evaluation/TOOLS 2000, Schaumburg, US, March 2000, pp. 345-349 [139] C.Hirel, B. Tuffin, and K.S. Trivedi, SPNP: Stochastic Petri Nets Version 6.0, Computer Performance Evaluation/TOOLS 2000, Schaumburg, US, March 2000, pp. 354-357 [140] C. Leangsuksun, L. Shen, T. Liu, H. Song, and S. Scott, Availability Prediction and Modeling of High Availability OSCAR Cluster, IEEE International Conference on Cluster Computing, Hong Kong, China, December 2-4, 2003, pp. 227-230 [141] C. Leangsuksun, Highly Reliable Linux HPC Clusters: Self-awareness Approach, Proceedings of the International Symposium on Parallel Architectures, Algorithms, and Networks (I-SPAN), Hong Kong, China, May 2004 [142] C. Leangsuksun, L. Shen, T. Liu, H. Song, and S. Scott, Dependability Prediction of High Availability OSCAR Cluster Server, The 2003 International Conference on Parallel and Distributed Processing Techniques and Applications, Las Vegas, Nevada, USA, 2003, pp. 23-26 [143] C. Leangsuksun, L. Shen, T. Lui, and S. L. Scott, Achieving High Availability and Performance Computing with an HA-OSCAR Cluster, Future Generation Computer System, Volume 21, Number 1, January 2005, pp. 597-606 [144] The Open Cluster Group, http://www.openclustergroup.org 237 [145] C. Leangsuksun, L. Shen, H. Song, and S. Scott, I. Haddad, The Modeling and Dependability Analysis of High Availability OSCAR Cluster System, The 17th Annual International Symposium on High Performance Computing Systems and Applications, Sherbrooke, Quebec, Canada, May 11-14, 2003 [146] I. Haddad, C. Leangsuksun, R. Libby, T. Liu, Y. Liu, and S. Scott, Highly Reliable Linux HPC Clusters: Self-awareness Approach, Proceedings of the 2nd International Symposium on Parallel and Distributed Processing and Applications, Hong Kong, China, December 13-15, 2004, pp. 217-222 [147] The SystemImager utility, http://www.systemimager.org [148] I. Haddad, and C. Leangsuksun, Building Highly Available HPC Clusters with HA-OSCAR, Tutorial Presentation, the 6th LCI International Conference on Clusters: The HPC Revolution 2005, Chapel Hill, NC, USA 2005, April 2005 [149] I. Haddad, and G. Butler, Experimental Studies of Scalability in Clustered Web Systems, Proceedings of the International Parallel and Distributed Processing Symposium 2004, Santa Fe, New Mexico, USA, April 2004 [150] I. Haddad, Keeping up with Carrier Grade, Linux Journal, August 2004 [151] I. Haddad, Carrier Grade Server Requirements, Linux User and Developer, August 2004 [152] I. Haddad, Moving Towards Open Platforms, LinuxWorld Magazine, May 2004 [153] I. Haddad, Linux Gains Momentum in Telecom, LinuxWorld Magazine, May 2004 [154] I. Haddad, OSDL Carrier Grade Linux, O'Reilly Network, April 2004 [155] I. Haddad, CGL Platforms: Characteristics and Development Efforts, Euro-Par 2003, Klagenfurt, Austria, August 2003 [156] I. Haddad, C. Leangsuksun, M. Pourzandi, and A. Tikotekar, Feasibility Study and Early Experimental Results Toward Cluster Survivability, Proceedings of Cluster Computing and Grid 2005, Cardiff, UK, May 9-12, 2005 [157] 238 D. Gordon, and I. Haddad, Building an IPv6 DNS Server Node, Linux Journal, October 2003 [158] I. Haddad, C. Leangsuksun, M. Pourzandi, and A. Tikotekar, Experimental Results in Survivability of Secure Clusters, Proceedings of the 6th International Conference on Linux Clusters, Chapel Hill, NC, USA 2005 [159] I. Haddad, IPv6: The Essentials You Must Know, Linux User and Developer, May 2003 [160] I. Haddad, Using Freenet6 Service to Connect to the IPv6 Internet, Linux User and Developer, July 2003 [161] I. Haddad, C. Leangsuksun, R. Libby, T. Liu, Y. Liu, and S. L. Scott , High-Availability and Performance Clusters: Staging Strategy, Self-Healing Mechanisms, and Availability Analysis, Proceedings of the IEEE Cluster Conference 2004, San Diego, USA, September 20-23, 2004 [162] I. Haddad, Streaming Video on Linux over IPv6, Linux User and Developer, August 2003 [163] I. Haddad, Voice over IPv6 on Linux, Linux User and Developer, September 2003 [164] I. Haddad, NAT-PT: IPv4/IPv6 and IPv6/IPv4 Address Translation, Linux User and Developer, October 2003 [165] I. Haddad, Design and Implementation of HA Linux Clusters, IEEE Cluster 2001, Newport Beach, USA, October 8-11, 2001 [166] I. Haddad, Designing Large Scale Benchmarking Environments, ACM Sigmetrics 2002, Marina Del Rey, USA, June 2002 [167] I. Haddad, Supporting IPv6 on Linux Clusters, IEEE Cluster 2002, Chicago, USA, September 2002 [168] I. Haddad, IPv6: Characteristics and Ongoing Research, Internetworking 2003, San Jose, USA, June 2003 [169] I. Haddad, CGL Platforms: Characteristics and Development Efforts, Euro-Par 2003, Klagenfurt, Austria, August 2003 [170] I. Haddad, Carrier Grade Linux: Status and Ongoing Work, Real World Linux 2004, Toronto, Canada, April 2004 [171] I. Haddad, Carrier Grade Platforms: Characteristics and Ongoing Efforts, ICETE 2004, Setúbal, Portugal, August 2004 239 [172] I. Haddad, and C. Leangsuksun, Building HA/HPC Clusters with HA-OSCAR, Tutorial Presentation at the IEEE Cluster Conference, San Diego, USA, September 2004 [173] I. Haddad, C. Leangsuksun, and S. Scott, Towards Highly Available, Scalable, and Secure HPC Clusters with HA-OSCAR, the 6th International Conference on Linux Clusters, Chapel Hill, NC, USA, April 2005 [174] I. Haddad, and S. Scott, HA Linux Clusters: Towards Platforms Providing Continuous Service, Linux Symposium, Ottawa, Canada, July 2005 [175] I. Haddad, and C. Leangsuksun, HA-OSCAR: Highly Available Linux Cluster at your Fingertips, IEEE Cluster 2005, Boston, USA, September 2005 [176] I. Haddad, HA Linux Clusters, Open Cluster Group 2001, Illinois, USA, March 2001 [177] I. Haddad, Combining HA and HPC, Open Cluster Group 2002, Montréal, Canada, June 2002 [178] I. Haddad, Towards Carrier Grade Linux Platforms, USENIX 2004, Boston, USA, June 2004 [179] I. Haddad, Towards Unified Clustering Infrastructure, Linux World Expo and Conference, San Francisco, USA, August 2005 [180] I. Haddad, Carrier Grade Linux: Status and Ongoing Work, Real World Linux 2004, Toronto, Canada, April 2004 [181] I. Haddad, Carrier Grade Platforms: Characteristics and Ongoing Efforts, ICETE 2004, Setúbal, Portugal, August 2004 [182] C. Leangsuksun, A Failure Predictive and Policy-Based High Availability Strategy for Linux High Performance Computing Cluster, The 5th LCI International Conference on Linux Clusters: The HPC Revolution 2004, Austin, USA, May 18-20, 2004 [183] 240 TechTarget, Dictionary for Internet and Computer Technologies, http://whatis.techtarget.com Glossary The definitions of the terms appearing in this glossary are referenced from [183]. 3G 3G is short for third-generation wireless, and refers to near-future developments in personal and business wireless technology, especially mobile communications. AAA Authentication, authorization, and accounting (AAA) is a term for a framework for intelligently controlling access to computer resources, enforcing policies, auditing usage, and providing the information necessary to bill for services. These combined processes are considered important for effective network management and security. Authentication, authorization, and accounting services are often provided by a dedicated AAA server, a program that performs these functions. A current standard by which network access servers interface with the AAA server is the Remote Authentication Dial-In User Service (RADIUS). Active Node A node that is providing (or capable of providing) a service. Active/active A redundancy configuration where all servers in the cluster run their own applications but are also ready to take over for failed server if needed. Active/standby A redundancy configuration where one server is running the application while another server in the cluster is idle but ready to take over if needed. ASP Application service provider (ASP) is a company that offers individuals or enterprises access over the Internet to applications and related services that would otherwise have to be located in their own personal or enterprise computers. Availability Availability is the amount of time that a system or service is provided in relation to the amount of time the system or service is not provided. Availability is commonly expressed as a percentage. 241 C++ is an object-oriented programming language. C++ C C is a structured, procedural programming language that has been widely used for both operating systems and applications and that has had a wide following in the academic community. CGI The common gateway interface (CGI) is a standard way for a Web server to pass a Web user's request to an application program and to receive data back to forward to the user. Cluster/server The client/server describes the relationship between two computer programs in which one program, the client, makes a service request from another program, the server, which fulfills the request. Cluster A cluster is a collection of cluster nodes that my change dynamically as nodes join or leave the cluster. COTS Commercial off-the-shelf describes ready-made products that can easily be obtained. Cluster Two or more computer nodes in a system used as a single computing entity to provide a service or run an application for the purpose of high availability, scalability, and distribution of tasks. CMS Cluster Management System (CMS) is a management layer that allows the whole cluster to be managed as a single entity. Daemon A program that runs continuously in the background, until activated by a particular event. A daemon can constantly query for requests or await direct action from a user or other process. DRAM Dynamic random access memory (DRAM) is the most common random access memory (RAM) for personal computers and workstations. 242 DRBD Disk Replication Block Device DNS The domain name system (DNS) is the way that Internet domain name are located and translated into IP addresses. A domain name is a meaningful and easy-to-remember "handle" for an Internet address. DIMM A DIMM (dual in-line memory module) is a double SIMM (single in-line memory module). Like a SIMM, it is a module containing one or several random access memory (RAM) chips on a small circuit board with pins that connect it to the computer motherboard. EIDE Enhanced (sometimes "Expanded") IDE is a standard electronic interface between a computer and its mass storage drives. EIDE's enhancements to Integrated Drive Electronics (IDE) make it possible to address a hard disk larger than 528 Mbytes. EIDE also provides faster access to the hard drive, support for Direct Memory Access (DMA), and support for additional drives. Failover The ability to automatically switch a service or capability to a redundant node, system, or network upon the failure or abnormal termination of the currently active node, system, or network. Failure The inability of a system or system component to perform a required function within specified limits. A failure may be produced when a fault is encountered. Examples of failures include invalid data being provided, slow response time, and the inability for a service to take a request. Causes of failure can be hardware, firmware, software, network, or anything else that interrupts the service. Five nines is measured as 99.999% service availability. It is equivalent to 5 minutes a year of total planned and unplanned downtime of the service provided by the system FTP File Transfer Protocol (FTP) is a standard Internet protocol that defines one way of exchanging files between computers on the Internet. 243 Gateways Gateways are bridges between two different technologies or administration domains. A media gateway performs the critical function of converting voice messages from a native telecommunications time-division-multiplexed network, to an Internet protocol packet-switched network. High Availability The state of a system having a very high ratio of service uptime compared to service downtime. Highly available systems are typically rated in terms of number of nines such as fivenines or sixnines. HLR The Home Location Register (HLR) is the main database of permanent subscriber information for a mobile network. HTML Hypertext Markup Language (HTML) is the set of markup symbols or codes inserted in a file intended for display on a World Wide Web browser page. The markup tells the web browser how to display a web page's words and images for the user. HTTP Hypertext Transfer Protocol (HTTP) is the set of rules for exchanging files (text, graphic images, sound, video, and other multimedia files) on the World Wide Web. HTTP-NG Hypertext Transfer Protocol – Next Generation Internet The Internet is a worldwide system of computer networks - a network of networks in which users at any one computer can, if they have permission, get information from any other computer (and sometimes talk directly to users at other computers). It was conceived by the Advanced Research Projects Agency (ARPA) of the U.S. government in 1969 and was first known as the ARPANET. The Internet is a public, cooperative, and self-sustaining facility accessible to hundreds of millions of people worldwide. Physically, the Internet uses a portion of the total resources of the currently existing public telecommunication networks. IP The Internet Protocol (IP) is the method or protocol by which data is sent from one computer to another on the Internet. 244 iptables is a Linux command used to set up, maintain, and inspect the tables of IP packet filter rules in the Linux kernel. There are several different tables, which may be defined, and each table contains a number of built-in chains, and may contain user-defined chains. Each chain is a list of rules which can match a set of packets: each rule specifies what to do with a packet which matches. This is called a `target', which may be a jump to a user-defined chain in the same table. IPv6 Internet Protocol Version 6 (IPv6) is the latest version of the Internet Protocol. IPv6 is a set of specifications from the Internet Engineering Task Force (IETF) that was designed as an evolutionary set of improvements to the current IP Version 4. ISDN Integrated Services Digital Network (ISDN) is a set of standards for digital transmission over ordinary telephone copper wire as well as over other media. I/O I/O describes any operation, program, or device that transfers data to or from a computer. ISP Internet service provider (ISP) is a company that provides individuals and other companies access to the Internet and other related services such as web site building and virtual hosting. ISV Independent Software Vendors LAN A local area network (LAN) is a group of computers and associated devices that share a common communications line or wireless link and typically share the resources of a single processor or server within a small geographic area. Management Server Management servers handle traditional network management operations, as well as service and customer management. These servers provide services such as Home Location Register and Visitor Location Register (for wireless networks) or customer information, such as personal preferences including features the customer is authorized to use. 245 MMP Massively Parallel Processors (MPP) is the coordinated processing of a program by multiple processors that work on different parts of the program, with each processor using its own operating system and memory. Typically, MPP processors communicate using some messaging interface. MP3 MP3 (MPEG-1 Audio Layer-3) is a standard technology and format for compression a sound sequence into a very small file (about one-twelfth the size of the original file) while preserving the original level of sound quality when it is played. MSS Maximum Segment Size (MSS) MTTF Mean Time To Failure (MTTF) is the interval in time which the system can provide service without failure MTTR Mean Time To Repair (MTTR) is the interval in time it takes to resume service after a failure has been experienced. NAS Network-attached storage (NAS) is hard disk storage that is set up with its own network address rather than being attached to the department computer that is serving applications to a network's workstation users. NAT NAT (Network Address Translation) is the translation of an Internet Protocol address (IP address) used within one network to a different IP address known within another network. One network is designated the inside network and the other is the outside. Network A connection of [nodes] which facilitates [communication] among them. Usually, the connected nodes in a network use a well defined [network protocol] to communicate with each other. Network Protocols Rules for determining the format and transmission of data. Examples of network protocols include TCP/IP, and UDP. 246 NIC A network interface card (NIC) is a computer circuit board or card that is installed in a computer so that it can be connected to a network. Node A single computer unit, in a [network], that runs with one instance of a real or virtual operating system. NTP Network Time Protocol (NTP) is a protocol that is used to synchronize computer clock times in a network of computers. OSI The Open System Interconnection, model defines a networking framework for implementing protocols in seven layers. Control is passed from one layer to the next, starting at the application layer in one station, proceeding to the bottom layer, over the channel to the next station and back up the hierarchy. Performance The efficiency of a [system] while performing tasks. Performance characteristics include total throughput of an operation and its impact to a [system]. The combination of these characteristics determines the total number of activities that can be accomplished over a given amount of time. Perl Perl is a script programming language that is similar in syntax to the C language and that includes a number of popular Unix facilities such as SED, awk, and tr. PCI PCI (Peripheral Component Interconnect) is an interconnection system between a microprocessor and attached devices in which expansion slots are spaced closely for high-speed operation. Using PCI, a computer can support both new PCI cards while continuing to support Industry Standard Architecture (ISA) expansion cards, an older standard. PDA Personal digital assistant (PDA) is a term for any small mobile hand-held device that provides computing and information storage and retrieval capabilities for personal or business use, often for keeping schedule calendars and address book information handy. 247 Proxy Server A computer network service that allows clients to make indirect network connections to other network services. A client connects to the proxy server, and then requests a connection, file, or other resource available on a different server. The proxy provides the resource either by connecting to the specified server or by serving it from a cache. In some cases, the proxy may alter the client's request or the server's response for various purposes. RAID Redundant array of independent disks (RAID) is a way of storing the same data in different places (thus, redundantly) on multiple hard disks. RAMDISK A RamDisk is a portion of memory that is allocated to be used as a hard disk partition. RAS Reliability, availability, and servicability Recovery To return a failing component, node or system to a working state. A failing component can be a hardware or a software component of a node or network. Recovery can also be initiated to work around an fault that has been detected; ultimately restoring the service. RTT Round-Trip Times (RTT) is the time required for a network communication to travel from the source to the destination and back. RTT is used by routing algorithms to aid in calculating optimal routes. SAN Storage Area Network (SAN) is a high-speed special-purpose network (or sub- network) that interconnects different kinds of data storage devices with associated data servers on behalf of a larger network of users. SCP A Service Control Point server is an entity in the intelligent network that implements service control function that operation that affects the recording, processing, transmission, or interpretation of data. SCSI The Small Computer System Interface(SCSI) is a set of ANSI standard electronic interfaces that allow personal computers to communicate with peripheral hardware such as disk 248 drives, tape drives, CD-ROM drives, printers, and scanners faster and more flexibly than previous interfaces. Service A set of functions provided by a computer system. Examples of Telco services include media gateway, signal, or soft switch types of applications. Some general examples of services include web based or database transaction types of applications. Session Series of consecutive page requests to the web server from the same user Signaling Servers Signaling servers handle call control, session control, and radio recourse control. A signaling server handles the routing and maintains the status of calls over the network. It takes the request of user agents who want to connect to other user agents and routes it to the appropriate signaling. SLA Service Level Agreement (SLA) is a contract between a network service provider and a customer that specifies, usually in measurable terms, what services the network service provider will furnish. SPOF Single point of failure (SPOF) - Any component or communication path within a computer system that would result in an interruption of the service if it failed. SMP Symmetric Multiprocessors (SMP) is the processing of programs by multiple processors that share a common operating system and memory. The processors share memory and the I/O bus or data path. A single copy of the operating system is in charge of all the processors. SSI Single System Image (SSI) is a form of distributed computing in which by using a common interface multiple networks, distributed databases or servers appear to the user as one system. In SSI systems, all nodes share the operating system environment in the system. Sixnines is measured as 99.9999% [service] [availability]. It is equivalent to 30 seconds a year of total planned and unplanned downtime of the [service] provided by the [system]. 249 Standby Not currently providing service but prepared to take over the active state. System A computer system that consists of one computer [node] or many nodes connected via a computer network mechanism. Switch-over The term switch-over is used to designate circumstances where the cluster moves the active state of a particular component/node from one component/node to another, after the failure of the active component/node. Switch-over operations are usually the consequence of administrative operations or escalation of recovery procedures. Tcl Tcl is an interpreted script language developed by Dr. John Ousterhout at the University of California, Berkeley, and now developed and maintained by Sun Laboratories. TCP TCP (Transmission Control Protocol) is a set of rules (protocol) used with the Internet Protocol (IP) to send data in the form of message units between computers over the Internet. While IP takes care of handling the actual delivery of the data, TCP takes care of keeping track of the individual units of data (called packets) that a message is divided into for efficient routing through the Internet. TFTP Trivial File Transfer Protocol (TFTP) is an Internet software utility for transferring files that is simpler to use than the File Transfer Protocol (FTP) but less capable. It is used where user authentication and directory visibility are not required. TTL Time-to-live (TTL) is a value in an Internet Protocol (IP) packet that tells a network router whether the packet has been in the network too long and should be discarded. QoS Quality of Service (QoS) is the idea that transmission rates, error rates, and other characteristics can be measured, improved, and, to some extent, guaranteed in advance. URI To paraphrase the World Wide Web Consortium, Internet space is inhabited by many points of content. A URI (Uniform Resource Identifier; pronounced YEW-AHR-EYE) is the way you 250 identify any of those points of content, whether it be a page of text, a video or sound clip, a still or animated image, or a program. The most common form of URI is the web page address, which is a particular form or subset of URI called a Uniform Resource Locator (URL). USB USB (Universal Serial Bus) is a plug-and-play interface between a computer and add-on devices (such as audio players, joysticks, keyboards, telephones, scanners, and printers). With USB, a new device can be added to your computer without having to add an adapter card or even having to turn the computer off. User An external entity that acquires service from a computer system. It can be a human being, an external device, or another computer system. Web Service Web services are loosely coupled software components delivered over Internet standard technologies. A web service can also be defined as a self-contained, modular application that can be described, published, located, and invoked over the web. 251 252