Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Federated Facts and Figures Joseph M. Hellerstein UC Berkeley Road Map The Deep Web and the FFF An Overview of Telegraph Demo: Election 2000 From Tapping to Trawling A Taste of Policy and Countermeasures Delicious Snacks Meet the “Deep Web” Available in your browser, but not via hyperlinks   Accessed via forms (press the “submit” button) Typically runs some code to generate data   E.g. call out to a database, or run some “servlet” Pretty-print results in HTML  Dynamic HTML Estimated to be >400x larger than the “surface web” Not accessible in the search engines  Typically crawl hyperlinks only Federated Facts and Figures One part of the deep web: more full-text documents    E.g. archived newspaper articles, legal documents, etc. Figure out how to fetch these, the add to search engine Various people working on this (e.g. CompletePlanet) Another part: Facts and Figures       I.e. structured database data Fetch is only the first challenge Want to combine (“federate”) these databases Want to search by criteria other than keywords Want to analyze the data en masse I.e. want full query power, not just search   Search was always easy Ranking not clearly appropriate here Meet the FFF Meet the FFF Meet the FFF Meet the FFF http://telegraph.cs.berkeley.edu Telegraph An adaptive dataflow system  Dataflow     siphon data from the deep web and other data pools harness data streaming from sensors and traces flow these data streams through code Adaptive   sensor nets & wide area networks: volatile! like Telegraph Avenue   needs to “be cool” with volatile mix from all over the world adaptive techniques route data to machines and code  marriage of queries, app-level route/filter, machine learning First apps    Facts and Figures Federation: Election 2000 Continuous queries on sensor nets Rich queries on Peer-to-Peer Joe Hellerstein, Mike Franklin, & co. Dataflow Commonalities Dataflow at the heart of queries and networks    Query engines move records through operators Networks move packets through routers Networked data-intensive apps an emerging middle ground Database Systems:  High-function, high integrity, carefully administered. Compile intelligent query plans based on data models and statistical properties, query semantics. Networks:  Low-function, high availability, federated administration. Adapt to performance variabilities, treat data and code as opaque for loose coupling. Long-Running Dataflows on the FFF Not precomputed like web indexes Need online systems & apps for online performance goals Subject of prior work in CONTROL project  Combo of query processing, sampling/estimation, HCI 100% Online Traditional  Time Telegraph Architecture Telegraph executes Dataflow Graphs Extensible set of operators  With extensible optimization rules Data access operators    TeSS: the Telegraph Screen Scraper Napster/Gnutella readers File readers Data processing operators  Selections (filters), Joins, Drill-Down/Roll-Up, Aggregation Adaptivity Operators  Eddies, STeMs, FLuX, etc. Screen Scraping: TeSS Screen scrapers do two things:   Fetch: emulate a web user clicking Parse: extract info from resulting HTML/XML Somebody has to train the screen scraper   Need a separate wrapper for each site Some research work on making this process semi-automatic TeSS is an open-source screen-scraper     Available at http://telegraph.cs.berkeley.edu/tess Written by a (superstar) sophomore! Simple scripting interface targeted today Moving towards GUI for non-technical users (“by example”) First Demo: Election 2000 From Tapping to Trawling Telegraph allows users to pose rich queries over the deep web But sometimes would like to be more aggressive:    Preload a telegraph cache Access a variety of data for offline mining More (we’ll see soon!) Want something like a webcrawler for FFF   But FFF is too big. Want to “trawl” for interesting stuff hidden there. From Tapping to Trawling From Tapping to Trawling From Tapping to Trawling Name Address DupElim Anywho Name Yahoo Maps Eddy Infospace Name “Smith” Infospace Street “1600 Pennsylvania Avenue, DC” API Challenges in Trawling Load APIs on the web today: service and silence    Various policies at the servers, hard to learn No analogy to robots.txt (which is too limiting anyhow) Feedback can be delayed, painful Solutions    Be very conservative Make out-of-band (human) arrangements Both seem inefficient Finding new sites to trawl is hard  Have to wrap them: fetch is easyish, parse hardish   XML will help a little here Query? Or Update? Again, an API problem!  Imagine we auto-trawled AnyWho and WeSpamYou.com Trawling “Domains” Can now collect lists of: Names (First, Last), Addresses, Companies, Cities, States, etc. etc.  Can keep lists organized by site and in toto  Allows for offline mining, etc.   Q: Do webgraph mining techniques apply to facts and figures? Exploiting Enumerated Domains I Can trawl any site on known domains!  Suddenly the deep web is not so hidden. In essence, we expand our trawl Can use pre-existing domains to trawl further  Or, can add new sites to the trawl process  Exploiting Enumerated Domains II Trawling gets a sample (signature) of a site’s content  Analogous to a random walk, but needs to be characterized better Can identify that 2 sites have related subsets of domains Helps with the query composition problem  Rich query interfaces tend to be non-trivial   What sites to use? How to combine them? Imagine:    Traditional search engine experience to pick some sites System suggests how to join the sites in a meaningful way As you build the query, you always see incremental results    Refine query as the data pours in Berkeley CONTROL project has been incremental queries Blends search, query, browse and mine A Sampler of FFF Policy Issues Statistical DB Security Issues Facing the Power of the FFF “False” combinations  Combination strength  What is trawling?  Copying? So what?   Akamai for the deep web? Cracking? Sampler of Countermeasures Trawl detection  And Distributed Trawl Detection Metadata Watermarking  Provenance, Lineage, Disclaimers Stockpiling Spam Delicious Snacks "Concepts are delicious snacks with which we try to alleviate our amazement” -- A. J. Heschel, Man Is Not Alone Technical Snacks Adaptive Dataflow  Systems + Learning Incremental & continuous querying   And online, bounded trawling Adds an HCI component to the above FFF APIs, standards    The wrapper-writing bottleneck: XML? Backoff APIs? Search vs. Update Mining trawls More Technical Snacks Tie-ins with Security Applications beyond FFF Sensors  P2P  Overlay Networks  Policy Questions Presenting & Interpreting Data  Not just search Privacy: What is it, what’s it for? Leading Indicators from the FFF More? http://telegraph.cs.berkeley.edu [email protected] Collaborators:    Mike Franklin, Hal Varian -- UCB Lisa Hellerstein & Torsten Suel -- Polytechnic Sirish Chandrasekaran, Amol Deshpande, Sam Madden, Vijayshankar Raman, Fred Reiss, Mehul Shah -- UCB Backup Slides Telegraph: Adaptive Dataflow Mixed design philosophy:     Tolerate loose coupling and partial failure Adapt online and provide best-effort results Learn statistical properties online Exploit knowledge of semantics via extensible optimization infrastructures Target new networked, data-intensive applications Adaptive Systems: General Flavor Repeat: 1. Observe (model) environment 2. Use observation to choose behavior 3. Take action Adaptive Dataflow in DBs: History Rich But Unacknowledged History  Codd's data independence predicated on adaptivity!   adapt opaquely to changing schema and storage Query optimization does it!   statistics-driven optimization key differentiator between DBMSs and other systems Adaptivity in Current DBs Limited & coarse grain Repeat: 1. Observe (model) environment – 2. Use observation to choose behavior – 3. runstats (once per week!!): model changes in data query optimization: fixes a single static query plan Take action – query execution: blindly follow plan What’s So Hard Here? Volatile regime    Data flows unpredictably from sources Code performs unpredictably along flows Continuous volatility due to many decentralized systems Lots of choices     Choice of services Choice of machines Choice of info: sensor fusion, data reduction, etc. Order of operation Maintenance   Federated world Partial failure is the common case Adaptivity required! Adaptive Query Processing Work Frequency of Adaptivity      Late Binding: Dynamic, Parametric [HP88,GW89,IN+92,GC94,AC+96,LP97] Per Query: Mariposa [SA+96], ASE [CR94] Competition: RDB [AZ96] Inter-Op: [KD98], Tukwila [IF+99] Query Scrambling: [AF+96,UFA98] Survey: Hellerstein, Franklin, et al., DE Bulletin 2000 A Networking Problem!? Networks do dataflow! Significant history of adaptive techniques   E.g. TCP congestion control E.g. routing But traditionally much lower function   Ship bitstreams Minimal, fixed code Lately, moving up the foodchain?    app-level routing active networks politics of growth  assumption of complexity = assumption of liability Networking Code as Dataflow? States & Events, Not Threads     Asynchronous events natural to networks State machines in protocol specification and system code Low-overhead, spreading to big systems Totally different programming style  remaining area of hacker machismo Eventflow optimization    Can’t eventflow be adaptively optimized like dataflow? Why didn’t that happen years ago? Hold this thought Query Plans are Dataflow Too Programming model: iterators   old idea, widely used in DB query processing object with three methods:    Init(), GetNext(), Close() input/output types query plan: graph of iterators  pipelining: iterators that return results before children Close() Clever Dataflow Tricks Volcano: “exchange” iterator [Graefe]    encapsulate exchange logic in an iterator not in the dataflow system Box-and-arrow programming can ignore parallelism Some Solutions We’re Focusing On Rivers  Adaptive partitioning of work across machines Eddies  Adaptive ordering of pipelined operations Quality of Service    Online aggregation & data reduction: CONTROL MUST have app-semantics Often may want user interaction  UI models of temporal interest Data Dissemination  Adaptively choosing what to send, what to cache River Berkeley built the world-record sorting machine   On the NOW: 100 Sun workstations + SAN Only beat the record under ideal conditions   No such thing in practice! (Arpaci-Dusseau)2  with Culler, Hellerstein, Patterson River: adaptive dataflow on clusters  One main idea: Distributed Queues    adaptive exchange operator Simplifies management and programming Remzi Arpaci-Dusseau, Eric Anderson, Noah Treuhaft  w/Culler, Hellerstein, Patterson, Yelick River Multi-Operator Query Plans Deal with pipelines of commutative operators Adapt at finer granularity than current DBMSs Continuous Adaptivity: Eddies Eddy A pipelining tuple-routing iterator  just like join or sort or exchange Works best with other pipelining operators  like Ripple Joins, online reordering, etc. Ron Avnur & Joe Hellerstein Continuous Adaptivity: Eddies Eddy How to order and reorder operators over time  based on performance, economic/admin feedback Vs.River:   River optimizes each operator “horizontally” Eddies optimize a pipeline “vertically” Continuous Adaptivity: Eddies Adjusts flow adaptively   Tuples routed through ops in different orders Visit each op once before output Naïve routing policy:  All ops fetch from eddy as fast as possible   A la River Turns out, doesn’t quite work  Only measures rate of work, not benefit Lottery-based routing    Uses “lottery scheduling” to address a bandit problem Kris Hildrum, et al. looking at formalizing this Various AI students looking at Reinforcement Learning Competitive Eddies  Throw in redundant data access and code modules! An Aside: n-Arm Bandits A little machine learning problem:   Each arm pays off differently Explore? Or Exploit?    Sometimes want to randomly choose an arm Usually want to go with the best If probabilities are stationary, dampen exploration over time Eddies with Lottery Scheduling Operator gets 1 ticket when it takes a tuple  Favor operators that run fast (low cost) Operator loses a ticket when it returns a tuple  Favor operators with high rejection rate  Low selectivity Lottery Scheduling:   When two ops vie for the same tuple, hold a lottery Never let any operator go to zero tickets  Support occasional random “exploration” Set up “inflation” (forgetting) to adapt over time  E.g. tix’ = aoldtix + newtix Promising! Initial performance results Ongoing work on proofs of convergence  have analysis for contrained case To Be Continued s index1 block hash Tune & formalize policy Competitive eddies   Source & Join selection Requires duplicate management Parallelism  Eddies + Rivers? Reliability   Eddy Long-running flows Rivers + RAID-style computation R1 R2 R3 S1 S2 S3 s index2 To Be Continued, cont. What about wide area?    data reduction sensor fusion asynchronous communication Continuous queries   events disconnected operation Lower-level eventflow?  can eddies, rivers, etc. be brought to bear on programming?