* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download The Datacenter Needs an Operating System
Survey
Document related concepts
Transcript
The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica Background • Clusters of commodity servers have become a major computing platform in industry and academia • Driven by data volumes outpacing the processing capabilities of single machines • Democratized by cloud computing Background • Some have declared that “the datacenter is the new computer” • Claim: this new computer increasingly needs an operating system • Not necessarily a new host OS, but a common software layer that manages resources and provides shared services for the whole datacenter, like an OS does for one host Why Datacenters Need an OS • Growing number of applications – Parallel processing systems: MapReduce, Dryad, Pregel, Percolator, Dremel, MR Online – Storage systems: GFS, BigTable, Dynamo, SCADS – Web apps and supporting services • Growing number of users – 200+ for Facebook’s Hadoop data warehouse, running near-interactive ad hoc queries What Operating Systems Provide • Resource sharing across applications & users • Data sharing between programs • Programming abstractions (e.g. threads, IPC) • Debugging facilities (e.g. ptrace, gdb) Result: OSes enable a highly interoperable software ecosystem that we now take for granted An Analogy • Today, a scientist analyzing data on a single machine can pipe it through a variety of tools, write new tools that interface with these through standard APIs, and trace across the stack • In the future, the scientist should be able to fire up a cloud on EC2 and do the same thing: – – – – Intermix a variety of apps & programming models Write new parallel programs that talk to these Get a unified interface for managing the cluster Debug and trace across all these components Today’s Datacenter OS • Hadoop MapReduce as common execution and resource sharing platform • Hadoop InputFormat API for data sharing • Abstractions for productivity programmers, but not for system builders • Very challenging to debug across all the layers Tomorrow’s Datacenter OS • Resource sharing: – Lower-level interfaces for fine-grained sharing (Mesos is a first step in this direction) – Optimization for a variety of metrics (e.g. energy) – Integration with network scheduling mechanisms (e.g. Seawall [NSDI ‘11], NOX, Orchestra) Tomorrow’s Datacenter OS • Data sharing: – Standard interfaces for cluster file systems, keyvalue stores, etc – In-memory data sharing (e.g. Spark, DFS cache), and a unified system to manage this memory – Streaming data abstractions (analogous to pipes) – Lineage instead of replication for reliability (RDDs) Tomorrow’s Datacenter OS • Programming abstractions: – Tools that can be used to build the next MapReduce / BigTable in a week (e.g. BOOM) – Efficient implementations of communication primitives (e.g. shuffle, broadcast) – New distributed programming models Tomorrow’s Datacenter OS • Debugging facilities: – Tracing and debugging tools that work across the cluster software stack (e.g. X-Trace, Dapper) – Replay debugging that takes advantage of limited languages / computational models – Unified monitoring infrastructure and APIs Putting it All Together • A successful datacenter OS might let users: – Build a Hadoop-like software stack in a week using the OS’s abstractions, while gaining other benefits (e.g. cross-stack replay debugging) – Share data efficiently between independently developed programming models and applications – Understand cluster behavior without having to log into individual nodes – Dynamically share the cluster with other users Conclusion • Datacenters need an OS-like software stack for the same reasons single computers did: manageability, efficiency & programmability • An OS is already emerging in an ad-hoc way • Researchers can help by taking a long-term approach towards these problems How Researchers can Help • Focus on paradigms, not performance – Industry is tackling performance but lacks luxury to take long-term view towards abstractions • Explore clean-slate approaches – Likelier to have impact here than in a “real” OS because datacenter software changes quickly! • Bring cluster computing to non-experts – Much harder and more rewarding than big users