Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Some Claims about Giant-Scale Systems Prof. Eric A. Brewer UC Berkeley Vanguard Conference, May 5, 2003 Five Claims Scalability How is easy (availability is hard) to build a giant-scale system Services are King (infrastructure centric) P2P is cool but overrated XML doesn’t help much Need new IT for the Third World May 5, 2003 Vanguard Conference Claim 1: Availability >> Scalability Key New Problems Unknown but large growth Must be truly highly available Incremental & Absolute scalability 1000’s of components Hot swap everything (no recovery time) $6M/hour for stocks, $300k/hour for retailers Graceful degradation under faults & saturation Constant evolution (internet time) Software will be buggy Hardware will fail These can’t be emergencies... May 5, 2003 Vanguard Conference Typical Cluster May 5, 2003 Vanguard Conference Scalability is EASY Just add more nodes…. Unless you want HA Or you want to change the system… Hard parts: Availability Overload Evolution May 5, 2003 Vanguard Conference Step 1: Basic Architecture Client Client Client Client IP Network Single Site Server Load Manager Op tio nal B ackp lane Persistent Data Store May 5, 2003 Vanguard Conference Step 2: Divide Persistent Data Transactional (expensive, must be correct) Supporting Data HTML, images, applets, etc. Read mostly Publish in snapshots (normally stale) Session state Persistent across connections Limited lifetime Can be lost (rarely) May 5, 2003 Vanguard Conference Step 3: Basic Availability a) depend on layer-7 switches Isolate external names (IP address, port) from specific machines b) automatic detection of problems Node-level checks (e.g. memory footprint) (Remote) App-level checks c) Focus on MTTR Easier May 5, 2003 to fix & test than MTBF Vanguard Conference Step 4: Overload Strategy Goal: degrade service (some) to allow more capacity during overload Examples: Simpler pages (less dynamic, smaller size) Less options (just the basics) Must kick in automatically “overload mode” Relatively easy to detect May 5, 2003 Vanguard Conference Step 5: Disaster Tolerance a) pick a few locations Independent b) failure Dynamic redirection of load Best: client-side control Next: switch all traffic (long routes) Worst: DNS remapping (takes a while) c) Target site will get overloaded -- May 5, 2003 but you have overload handling Vanguard Conference Step 6: Online Evolution Goal: rapid evolution without downtime a) Publishing model Decouple development from live system Atomic push of content Automatic revert if trouble arises b) Three methods May 5, 2003 Vanguard Conference Evolution: Three Approaches Flash Upgrade Rolling Upgrade Fast reboot into new version Focus on MTTR (< 10 sec) Reduces yield (and uptime) Upgrade nodes one at time in a “wave” Temporary 1/n harvest reduction, 100% yield Requires co-existing versions “Big Flip” May 5, 2003 Vanguard Conference The Big Flip Steps: 1) take down 1/2 the nodes 2) upgrade that half 3) flip the “active half” (site upgraded) 4) upgrade second half 5) return to 100% Avoids mixed versions (!) can replace schema, protocols, ... Twice used to change physical location May 5, 2003 Vanguard Conference The Hat Trick Merge the handling of: Disaster tolerance Online evolution Overload handling The first two reduce capacity, which then kicks in overload handling (!) May 5, 2003 Vanguard Conference Claim 2: Services are King Coming of The Infrastructure 100 90 80 1996 1998 2000 70 60 50 40 30 20 10 0 Telephone May 5, 2003 PC Vanguard Conference Non-PC Internet Infrastructure Services Much lower cost & more functionality longer battery life Data is in the infrastructure can lose the device enables groupware can update/access from home or work phone book on the web, not in the phone can use a real PC & keyboard Much simpler devices faster access Surfing is 3-7 times faster Graphics look good May 5, 2003 Vanguard Conference Transformation Examples Tailor content for each user & device 6.8x 65x 10x 1.2 The Remote Queue Model We introduce Remote Queues (RQ), …. Infrastructure Services (2) Much cheaper overall cost (20x?) Device utilization = 4%, infrastructure = 80% Admin & supports costs also decrease “Super View powerpoint slides with teleconference Integrated cell phone, pager, web/email access Map, driving directions, location-based services Can Convergence” (all -> IP) upgrade/add services in place! Devices last longer and grow in usefulness Easy to deploy new services => new revenue May 5, 2003 Vanguard Conference Internet Phases (prediction) Internet as New Media Consumer Services (today) Shopping, travel, Yahoo!, eBay, tickets, … Industrial Services HTML, basic search XML, micropayments, spot markets Everything in the Infrastructure May 5, 2003 store your data in the infrastructure access anytime/anywhere Vanguard Conference Claim 3: P2P is cool but overrated P2P Services? Not soon… Challenges Untrusted nodes !! Network Partitions Much harder to understand behavior Harder to upgrade Relatively May 5, 2003 few advantages… Vanguard Conference Better: Smart Clients Mostly helps with: Load balancing Disaster tolerance Overload Can also offload work from servers Can also personalize results E.g. mix search results locally Can include private data May 5, 2003 Vanguard Conference Claim 4: XML doesn’t help much… The Problem Need services for computers to use HTML only works for people Sites depend on human interpretation of ambiguous content “Scaping” content is BAD Very error prone No strategy for evolution XML doesn’t solve any of these issues! At best: RPC with an extensible schema May 5, 2003 Vanguard Conference Why it is hard… The real problem is *social* Doesn’t make evolution better… What do the fields mean? Who gets to decide? Two sides still need to agree on schema Can ignore stuff you don’t understand? When can a field change? Consequences? At least need versioning system… XML can mislead us to ignore/postpone the real issues! May 5, 2003 Vanguard Conference Claim 5: New IT for the Third World Plug for new area… Bridging the IT gap is the only long-term path to global stability Convergence makes it possible: 802.11 wireless ($5/chipset) Systems on a chip (cost, power) Infrastructure services (cost, power) Goal: 10-100x reduction in overall cost and power May 5, 2003 Vanguard Conference Five Claims Scalability How is easy (availability is hard) to build a giant-scale system Services are King (infrastructure centric) P2P is cool but overrated XML doesn’t help much Need new IT for the Third World May 5, 2003 Vanguard Conference Backup Refinement Retrieve part of distilled object at higher quality Distilled image (by 60X) Zoom in to original resolution The CAP Theorem Consistency Availability Tolerance to network Partitions May 5, 2003 Theorem: You can have at most two of these properties for any shared-data system Vanguard Conference Forfeit Partitions Consistency Availability Examples Single-site databases Cluster databases LDAP xFS file system Tolerance to network Partitions May 5, 2003 Traits 2-phase commit cache validation Vanguard Conference protocols Forfeit Availability Consistency Availability Examples Distributed databases Distributed locking Majority protocols Tolerance to network Partitions May 5, 2003 Traits Pessimistic locking Make minority Vanguard Conference partitions unavailable Forfeit Consistency Consistency Availability Tolerance to network Partitions May 5, 2003 Examples Coda Web cachinge DNS Traits expirations/leases conflict resolution optimistic Vanguard Conference These Tradeoffs are Real The whole space is useful Real internet systems are a careful mixture of ACID and BASE subsystems We use ACID for user profiles and logging (for revenue) But there is almost no work in this area Symptom of a deeper problem: systems and database communities are separate but overlapping (with distinct vocabulary) May 5, 2003 Vanguard Conference CAP Take Homes Can have consistency & availability within a cluster (foundation of Ninja), but it is still hard in practice OS/Networking good at BASE/Availability, but terrible at consistency Databases better at C than Availability Wide-area databases can’t have both Disconnected clients can’t have both All systems are probabilistic… May 5, 2003 Vanguard Conference The DQ Principle Data/query * Queries/sec = constant = DQ for a given node for a given app/OS release A fault can reduce the capacity (Q), completeness (D) or both Faults reduce this constant linearly (at best) May 5, 2003 Vanguard Conference Harvest & Yield Yield: Fraction of Answered Queries Harvest: Fraction of the Complete Result Related to uptime but measured by queries, not by time Drop 1 out of 10 connections => 90% yield At full utilization: yield ~ capacity ~ Q Reflects that some of the data may be missing due to faults Replication: maintain D under faults DQ corollary: harvest * yield ~ constant ACID => choose 100% harvest (reduce Q but 100% Vanguard Conference D) May 5, 2003 Harvest Options 1) Ignore lost nodes RPC gives up forfeit small part of the database reduce D, keep Q 2) Pair up nodes RPC tries alternate survives one fault per pair reduce Q, keep D 3) n-member replica groups Decide when you care... RAID RAID Replica Groups With n members: Each fault reduces Q by 1/n D stable until nth fault Added load is 1/(n-1) per fault n=2 => double load or 50% capacity n=4 => 133% load or 75% capacity “load redirection problem” Disaster tolerance: better have >3 mirrors May 5, 2003 Vanguard Conference Graceful Degradation Goal: smooth decrease in harvest/yield proportional to faults Saturation will occur we know DQ drops linearly high peak/average ratios... must reduce harvest or yield (or both) must do admission control!!! One answer: reduce D dynamically disaster => redirect load, then reduce D to compensate for extra load May 5, 2003 Vanguard Conference Thinking Probabilistically Maximize symmetry Make faults independent SPMD + simple replication schemes requires thought avoid cascading errors/faults understand redirected load KISS Use randomness May 5, 2003 makes worst-case and average case the same ex: Inktomi spreads data & queries randomly Node loss implies a random 1% harvest reduction Vanguard Conference Server Pollution Can’t fix all memory leaks Third-party software leaks memory and sockets so does the OS sometimes Some failures tie up local resources Solution: planned periodic “bounce” Not worth the stress to do any better Bounce time is less than 10 seconds Nice to remove load first… May 5, 2003 Vanguard Conference Key New Problems Unknown but large growth Must be truly highly available Incremental & Absolute scalability 1000’s of components Hot swap everything (no recovery time allowed) No “night” Graceful degradation under faults & saturation Constant evolution (internet time) Software will be buggy Hardware will fail These can’t be emergencies... May 5, 2003 Vanguard Conference Conclusions Parallel Programming is very relevant, except… historically avoids availability no notion of online evolution limited notions of graceful degradation (checkpointing) best for CPU-bound tasks Must think probabilistically about everything no such thing as a 100% working system no such thing as 100% fault tolerance partial results are often OK (and better than none) Capacity * Completeness == Constant May 5, 2003 Vanguard Conference Partial checklist What is shared? (namespace, schema?) What kind of state in each boundary? How would you evolve an API? Lifetime of references? Expiration impact? Graceful degradation as modules go down? External persistent names? Consistency semantics and boundary? May 5, 2003 Vanguard Conference The Move to Clusters No single machine can handle the load Only Other solution is clusters cluster advantages: Cost: about 50% cheaper per CPU Availability: possible to build HA systems Incremental growth: add nodes as needed Replace whole nodes (easier) May 5, 2003 Vanguard Conference Goals Sheer scale Handle 100M users, going toward 1B Largest: AOL Web Cache: 12B hits/day High Availability Large cost for downtime $250K per hour for online retailers $6M per hour for stock brokers Disaster Tolerance? Overload Handling System Evolution Decentralization? May 5, 2003 Vanguard Conference