Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lessons from Giant-Scale Services IEEE Internet Computing, Vol. 5, No. 4., July/August 2001 Eric A. Brewer University of California, Berkeley, and Iktomi Corporation Παρουσίαση: Ηλίας Τσιγαρίδας (Μ484) 1 Examples of Giant-scale services Aol Microsoft network Yahoo The demand eBay They must be always available, despite their scale, growth rate, CNN rapid evolution of content and Instant messaging features, etc Napster Many more… 2 Article Characteristics Characteristics “Experience” article No literature points Principles approaches Not quantitative evaluation The reasons Focusing on high level design New area Proprietary nature of the information 3 Article scope Look at the Basic Model of the giant-scale services Focusing the challenges of High availability Evolution Growth Principles for the above Simplify the design of large systems 4 Basic Model (general) The “infrastructure services” Internet-based systems that provide instant messaging, wireless services and so on 5 Basic Model (general) We discuss Single-site Single-owner Well-connected cluster Perhaps a part of a larger service We do not discuss Wide are issues Network partitioning Low or discontinuous bandwidth Multiple admistrative domains Service monitoring Network QoS Security Log and logging analysis DBMS 6 Basic Model (general) We focus on High availability Replication Degradation Disaster tolerance Online evolution The scope is bridging the gap between the basic building block of giant-scale services and the real world scalability and availability they require 7 Basic Model (Advantages) Access anywhere, anytime Availability via multiple devices Groupware support Lower overall cost Simplified service updates 8 Basic Model (Advantages) Access anywhere, anytime The infrastructure is ubiquitous You can access the service from home, work airport and so on 9 Basic Model (Advantages) Availability via multiple devices The infrastructure handles the processing (the most at least) User access the services via set-top boxes, networks computer, smart phones and so on In that way we have offer more functionality for a given cost and battery life 10 Basic Model (Advantages) Groupware support Centralizing data from many users allowing group-ware application like Calendar Teleconferencing systems, and so on 11 Basic Model (Advantages) Lower overall cost Hard to measure overall cost but Infrastructure services have an advantage over designs based on stand alone devices High utilization Centralize administration reduce the cost, but harder to quantify 12 Basic Model (Advantages) Simplified service updates Updates without physical distribution The most powerful long term advantage 13 Basic Model (Components) 14 Basic Model (Assumptions) The service provider has limited control over the clients an the IP network Queries drive the service Read only queries outnumber greatly update queries Giant-scale services use CLUSTERS 15 Basic Model (Components) Clients, such as Web browsers. Initiate the queries to the IP network, public Internet or a private network. Provides Load manager, provides indirection between the service’s Servers. Combining CPU, memory, and disks into an easy-to- Persistent data store, replicated or partitioned database Backplane. Optional. Handles inter-server traffic. services access to the service. external name and the servers’ physical names (IP addresses). Load balancing. Proxies or firewalls before the load manager. replicate unit. spread across the servers. Optional external DBMSs or RAID storage. 16 Basic Model Round Robin DNS “Layer-4” switches parses URL Custom “front-end” nodes understand TCP and port numbers “Layer-7” switches (Load Management) They act like service specific “layer-7” routers Include the clients in the load balancing Ex alternative DNS or Name Server 17 Basic Model (Load Management) Two opposite approaches Simple Web Farm Search engine cluster 18 Basic Model (Load Management) Simple Web Farm 19 Basic Model (Load Management) Search engine cluster 20 High Availability Like telephone, rail or water systems Features (general) Extreme symmetry No people Few cables No external disks No monitors Inkotomi in addition Manages the cluster offline Limit temperature and power variations 21 High Availability (metrics) MTBF MTTR uptime MTBF queries completed yield queries offered data available harvest complete data 22 High Availability (DQ principle) Data per query×Queries per second constant The systems overall capacity has a particular physical bottleneck Ex. Total I/O bandwidth, total seeks per second Total amount of data to be moved per second Measurable and tunable Ex. adding nodes, software optimization OR faults 23 High Availability (DQ principle) Focus on the relative DQ value, not on the absolute Define the DQ value of your system Normally DQ values scales linearly with the number of the nodes 24 High Availability (DQ principle) Analyzing the faults impact Focus on how DQ reduction influence the three metrics Only for data-intensive sites 25 High Availability Replication vs. Partitioning Example: 2-node cluster. One down Replication 100% harvest 50% yield DQ -= 50% Maintain D Reduce Q Partitioning 50% Harvest 100% yield DQ -= 50% Reduce D Maintain Q 26 High Availability Replication 27 High Availability Replication vs. Partitioning Replication wins if the bandwidth is the same. Extra cost is on the bandwidth not on the disks Easy recovering We might also use partial replication and randomization 28 High Availability Graceful degradation We can not avoid saturation, because Peak-to-average ratio 1.6:1 to 6:1. Expensive to build capacity above the (normal) peak Single events burst (ex. Online ticket sales for special events) Faults like power failures or natural disaster affect substantially the overall DQ and the remaining nodes become saturated. So, we MUST have mechanisms for degradation 29 High Availability Graceful degradation The DQ principle give us the options for Limit Q (capacity) to maintain D Reduce D and increase Q Focus on harvest by Admission Control (AC) Reduce Q Reduce D on dynamic databases Both Cut the effective database to half (new approach) 30 High Availability Graceful degradation More sophisticated techniques Cost based AC Priority (or value) based AC Estimate query cost Reduce the data per query Augment Q Drop low-valued queries Ex execute stock trade within 60s or the user pays no commission Reduced data freshness Reduce the freshness so reduce the work per query Increase yield at the expense of harvest 31 High Availability Disaster Tolerance Combination of managing replicas and graceful degradation How many locations? How many replicas on each location? Load management “Layer-4” switch do not help with the loss of a whole cluster Smart clients is the solution 32 Online Evolution & Growth We must plan for continuous growth and frequent functionality updates Maintenance and upgrades are controlled failures Total loss of DQ value is ΔDQ = n · u · average DQ/node = DQ · u Where n is the number of nodes and u the total amount per time a node requires for online evolution 33 Online Evolution & Growth Three approaches An example for a 4-node cluster 34 Conclusions The basic lessons learned Get the basics right Professional data center, layer-7 switch, symmetry Decide on your availability metrics Everyone must agree on the goals Harvest and yield > uptime Focus on MTTR at least as much as MTBF MTTR is easier and has the same impact Understand load redirection during faults Data replication is insufficient, you need excess DQ 35 Conclusions The basic lessons learned Graceful degradation is a critical part Use DQ analysis on all upgrades Intelligent admission control and dynamic database reduction Capacity planning Automate upgrades as much as possible Have a fast simple way to return to older version 36 Final Statement Smart clients could simplify all of the above 37