Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Telegraph Endeavour Retreat 2000 Joe Hellerstein Roadmap • Motivation & Goals • Application Scenarios • Quickie core technology overview – Adaptive dataflow – Event-based storage manager – Come hear more about these tonight/tomorrow! • Status and Plans – Dataflow infrastructure & apps – Storage manager? Motivations • Global Data Federation – All the data is online – what are we waiting for? – The plumbing is coming • XML/HTTP, XML/WAP, etc. give LCD communication • but how do you flow, summarize, query and analyze data robustly over many sources in the wide area? • Ubiquitous computing: more than clients – sensors and their data feeds are key • smart dust, biomedical (MEMS sensors) • each consumer good records (mis)use – disposable computing • video from surveillance cameras, broadcasts, etc. • Huge Data flood a’comin’! – will it capsize the good ship Endeavour? Initial Telegraph Goals • Unify data access & dataflow apps – Commercial wrappers for most infosources – Most info-centric apps can be cast as dataflow – The data flood needs a big dataflow manager! – Goal: a robust, adaptive dataflow engine • Unify storage – Currently lots of disparate data stores • Databases, Files, Email servers (and http access on these) – Goal: A single, clean storage manager that can serve: • DB records & semantics • Files and “semantics” • Email folders, calendars, etc. and semantics Challenge for Dataflow: Volatility! • Federated query processors – A la Cohera, IBM DataJoiner – No control over stats, performance, administration • Large Cluster Systems “Scaling Out” – No control over “system balance” • User “CONTROL” of running dataflows – Long-running dataflow apps are interactive – No control over user interaction • Sensor Nets – No control over anything! • Telegraph – Dataflow Engine for these environments The Data Flood: Main Features • What does it look like? – Never ends: interactivity required • Online, controllable algorithms for all tasks! – Big: data reduction/aggregation is key – Volatile: this scale of devices and nets will not behave nicely The Telegraph Dataflow Engine • Key technologies – Interactive Control • interactivity with early answers and examples • online aggregation for data reduction – Dataflow programming via paths/iterators • Elevate query processing frameworks out of DBMSs • Long tradition of static optimization here – Suggestive, but not sufficient for volatile environments – Continuously adaptive flow optimization • massively parallel, adaptive dataflow • Rivers and Eddies Static Query Plans • Volatile environments like sensors need to adapt at a much finer grain Continuous Adaptivity: Eddies Eddy • How to order and reorder operators over time – based on performance, economic/admin feedback • Vs.River: – River optimizes each operator “horizontally” – Eddies optimize a pipeline “vertically” Unifying Storage • Storage management buried inside specific systems • Elevate and expose the core services & semantic options – Layout/indexing – Concurrent access/modification – Recovery • Design for clustered environments – Replicate for reliability (tie-ins with Ninja) – Cluster options: your RAM vs. my disk – Events & State Machines for scalability • Unify eventflow and dataflow? • Share optimization lessons? Status: Adaptive Dataflow • Initial Eddy results promising, well received (SIGMOD 2K) • Finishing Telegraph v0 in Java/Jaguar – Prototype now running • Demo service to go live on web this summer – Analysis queries over web sites • We’ve picked a provocative app to go live with (stay tuned!) • Incorporates Ninja “path” project for caching – Goal: Telegraph is to “facts and figures” as search engines are to “documents” • Longer-term goals: – Formalize & optimize Eddy/River scheduling policies – Study HCI/systems/stats issues for interaction – Crawl “Dark Matter” on the web – Attack streams from sensors • Sequence queries and mining, data reduction, browsing, etc. Status: Unified Storage Manager • Prototype implementation in Java/Jaguar – ACID transactions + (non-ACID) Java file access – Robust enough to get TPC-W numbers – Events/states vs. threads • Echoes Gribble/Welsh results: better than threaded under load, but Java complicates detailed mesurement • Time to re-evaluate importance of this part – Interest? More mindshare in dataflow infrastructure. – Vs. tuning an off-the-shelf solution (e.g. Berkeley DB)? – Goal? unified lessons about dataflow/eventflow optimization on clusters. Integration with Rest of Endeavour • Give – Be dataflow backbone for diverse “clients” • • • • Our own Telegraph apps (federated dataflow, sensors) Replication/delivery dataflow engine for OceanStore Scalable infrastructure for tacit info mining algorithms? Pipes for next version of Iceberg? – Telegraph Storage Manager provides storage (xactional/otherwise) for OceanStore? Ninja? • Take – OceanStore to manage distributed metadata, security – Leverage protocols out of TinyOS for sensors – Partner with Ninja to manage local metadata? – Work with GUIR on interacting with streams? More Info • People: – Joe Hellerstein, Mike Franklin, Eric Brewer, Christos Papadimitriou – Sirish Chandrasekaran, Amol Deshpande, Kris Hildrum, Sam Madden, Vijayshankar Raman, Mehul Shah • Software – http://telegraph.cs.berkeley.edu coming soon – ABC interactive data anlysis/cleansing at http://control.cs.berkeley.edu • Papers: – See http://db.cs.berkeley.edu/telegraph Extra slides for backup Connectivity & Heterogeneity • Lots of folks working on data format translation, parsing – we will borrow, not build – currently using JDBC & Cohera Net Query • commercial tool, donated by Cohera Corp. • gateways XML/HTML (via http) to ODBC/JDBC – we may write “Teletalk” gateways from sensors • Heterogeneity – never a simple problem – Control project developed interactive, online data transformation tool: ABC CONTROL Continuous Output and Navigation Technology with Refinement On Line • Data-intensive jobs are long-running. How to give early answers and interactivity? – online interactivity over feeds • pipelining “online” operators, data “juggle” – online data correlation algs: ripple joins, online mining and aggregation – statistical estimators, and their performance implications • Deliver data to satisfy statistical goals • Appreciate interplay of massive data processing, stats, and HCI “Of all men's miseries, the bitterest is this: to know so much and have control over nothing” –Herodotus Performance Regime for CONTROL • New “Greedy” Performance Regime – Maximize 1st derivative of the user-happiness function 100% CONTROL Traditional Time CONTROL Continuous Output and Navigation Technology with Refinement On Line CONTROL Continuous Output and Navigation Technology with Refinement On Line River • We built the world’s fastest sorting machine – On the “NOW”: 100 Sun workstations + SAN – But it only beat the record under ideal conditions! • River: performance adaptivity for data flows on clusters – simplifies management and programming – perfect for sensor-based streams Declarative Dataflow: NOT new • Database Systems have been doing this for years – Xlate declarative queries into an efficient dataflow plan – “query optimization” considers: • • • • • Alternate data sources (“access methods”) Alternate implementations of operators Multiple orders of operators A space of alternatives defined by transformation rules Estimate costs and “data rates”, then search space • But in a very static way! – Gather statistics once a week – Optimize query at submission time – Run a fixed plan for the life of the query • And these ideas are ripe to elevate out of DBMSs – And outside of DBMSs, the world is very volatile – There are surely going to be lessons “outside the box” Static Query Plans • Volatile environments like sensors need to adapt at a much finer grain Continuous Adaptivity: Eddies Eddy • How to order and reorder operators over time – based on performance, economic/admin feedback • Vs.River: – River optimizes each operator “horizontally” – Eddies optimize a pipeline “vertically” Competitive Eddies s index1 block hash Eddy R1 R2 R3 S1 S2 S3 s index2 Potter’s Wheel Anomaly Detection The Data Flood is Real Petabytes 3500 3000 2500 2000 Sales 1500 Moore's Law 1000 500 Year Source: J. Porter, Disk/Trend, Inc. http://www.disktrend.com/pdf/portrpkg.pdf 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 1990 1989 1988 0 Disk Appetite, cont. • Greg Papadopoulos, CTO Sun: – Disk sales doubling every 9 months • Note: only counts the data we’re saving! • Translate: – Time to process all your data doubles every 18 months – MOORE’S LAW INVERTED! • (and Moore’s Law may run out in the next couple decades?) • Big challenge (opportunity?) for SW systems research – Traditional scalability research won’t help • “Ideal” linear scaleup is NOT NEARLY ENOUGH! Data Volume: Prognostications • Today – SwipeStream • E.g. Wal-Mart 24 Tb Data Warehouse – ClickStream – Web • Internet Archive: ?? Tb – Replicated OS/Apps • Tomorrow – Sensors Galore – DARPA/Berkeley “Smart Dust” • Note: the privacy issues only get more complex! – Both technically and ethically Temperature, light, humidity, pressure, accelerometer, magnetics Explaining Disk Appetite • Areal density increases 60%/yr • Yet Mb/$ rises much faster! 100 60 Mb/$ Moore's Law 40 20 0 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 MB/$ 80 Year Source: J. Porter, Disk/Trend, Inc. http://www.disktrend.com/pdf/portrpkg.pdf