Download PowerPoint Presentation - Federated Facts and Figures

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Federated Facts and Figures
Joseph M. Hellerstein
UC Berkeley
Road Map
The Deep Web and the FFF
An Overview of Telegraph
Demo: Election 2000
From Tapping to Trawling
A Taste of Policy and Countermeasures
Delicious Snacks
Meet the “Deep Web”
Available in your browser, but not via hyperlinks


Accessed via forms (press the “submit” button)
Typically runs some code to generate data


E.g. call out to a database, or run some “servlet”
Pretty-print results in HTML

Dynamic HTML
Estimated to be >400x larger than the “surface web”
Not accessible in the search engines

Typically crawl hyperlinks only
Federated Facts and Figures
One part of the deep web: more full-text documents



E.g. archived newspaper articles, legal documents, etc.
Figure out how to fetch these, the add to search engine
Various people working on this (e.g. CompletePlanet)
Another part: Facts and Figures






I.e. structured database data
Fetch is only the first challenge
Want to combine (“federate”) these databases
Want to search by criteria other than keywords
Want to analyze the data en masse
I.e. want full query power, not just search


Search was always easy
Ranking not clearly appropriate here
Meet the FFF
Meet the FFF
Meet the FFF
Meet the FFF
http://telegraph.cs.berkeley.edu
Telegraph
An adaptive dataflow system

Dataflow




siphon data from the deep web and other data pools
harness data streaming from sensors and traces
flow these data streams through code
Adaptive


sensor nets & wide area networks: volatile!
like Telegraph Avenue


needs to “be cool” with volatile mix from all over the world
adaptive techniques route data to machines and code

marriage of queries, app-level route/filter, machine learning
First apps



Facts and Figures Federation: Election 2000
Continuous queries on sensor nets
Rich queries on Peer-to-Peer
Joe Hellerstein, Mike Franklin, & co.
Dataflow Commonalities
Dataflow at the heart of queries and networks



Query engines move records through operators
Networks move packets through routers
Networked data-intensive apps an emerging middle ground
Database Systems:

High-function, high integrity, carefully administered. Compile
intelligent query plans based on data models and statistical
properties, query semantics.
Networks:

Low-function, high availability, federated administration. Adapt to
performance variabilities, treat data and code as opaque for loose
coupling.
Long-Running Dataflows on the
FFF
Not precomputed like web indexes
Need online systems & apps for online performance
goals
Subject of prior work in CONTROL project

Combo of query processing, sampling/estimation, HCI
100%
Online
Traditional

Time
Telegraph Architecture
Telegraph executes Dataflow Graphs
Extensible set of operators

With extensible optimization rules
Data access operators



TeSS: the Telegraph Screen Scraper
Napster/Gnutella readers
File readers
Data processing operators

Selections (filters), Joins, Drill-Down/Roll-Up, Aggregation
Adaptivity Operators

Eddies, STeMs, FLuX, etc.
Screen Scraping: TeSS
Screen scrapers do two things:


Fetch: emulate a web user clicking
Parse: extract info from resulting HTML/XML
Somebody has to train the screen scraper


Need a separate wrapper for each site
Some research work on making this process semi-automatic
TeSS is an open-source screen-scraper




Available at http://telegraph.cs.berkeley.edu/tess
Written by a (superstar) sophomore!
Simple scripting interface targeted today
Moving towards GUI for non-technical users (“by example”)
First Demo: Election 2000
From Tapping to Trawling
Telegraph allows users to pose rich queries
over the deep web
But sometimes would like to be more
aggressive:



Preload a telegraph cache
Access a variety of data for offline mining
More (we’ll see soon!)
Want something like a webcrawler for FFF


But FFF is too big.
Want to “trawl” for interesting stuff hidden there.
From Tapping to Trawling
From Tapping to Trawling
From Tapping to Trawling
Name
Address
DupElim
Anywho Name
Yahoo Maps
Eddy
Infospace Name
“Smith”
Infospace Street
“1600
Pennsylvania
Avenue, DC”
API Challenges in Trawling
Load APIs on the web today: service and silence



Various policies at the servers, hard to learn
No analogy to robots.txt (which is too limiting anyhow)
Feedback can be delayed, painful
Solutions



Be very conservative
Make out-of-band (human) arrangements
Both seem inefficient
Finding new sites to trawl is hard

Have to wrap them: fetch is easyish, parse hardish


XML will help a little here
Query? Or Update? Again, an API problem!

Imagine we auto-trawled AnyWho and WeSpamYou.com
Trawling “Domains”
Can now collect lists of:
Names (First, Last), Addresses,
Companies, Cities, States, etc. etc.
 Can keep lists organized by site and in toto
 Allows for offline mining, etc.


Q: Do webgraph mining techniques apply to
facts and figures?
Exploiting Enumerated Domains I
Can trawl any site on known domains!

Suddenly the deep web is not so hidden.
In essence, we expand our trawl
Can use pre-existing domains to trawl
further
 Or, can add new sites to the trawl process

Exploiting Enumerated Domains
II
Trawling gets a sample (signature) of a site’s content

Analogous to a random walk, but needs to be characterized
better
Can identify that 2 sites have related subsets of
domains
Helps with the query composition problem

Rich query interfaces tend to be non-trivial


What sites to use? How to combine them?
Imagine:



Traditional search engine experience to pick some sites
System suggests how to join the sites in a meaningful way
As you build the query, you always see incremental results



Refine query as the data pours in
Berkeley CONTROL project has been incremental queries
Blends search, query, browse and mine
A Sampler of FFF Policy Issues
Statistical DB Security Issues
Facing the Power of the FFF
“False” combinations
 Combination strength

What is trawling?

Copying? So what?


Akamai for the deep web?
Cracking?
Sampler of Countermeasures
Trawl detection

And Distributed Trawl Detection
Metadata Watermarking

Provenance, Lineage, Disclaimers
Stockpiling Spam
Delicious Snacks
"Concepts are delicious snacks with
which we try to alleviate our
amazement”
-- A. J. Heschel, Man Is Not Alone
Technical Snacks
Adaptive Dataflow

Systems + Learning
Incremental & continuous querying


And online, bounded trawling
Adds an HCI component to the above
FFF APIs, standards



The wrapper-writing bottleneck: XML?
Backoff APIs?
Search vs. Update
Mining trawls
More Technical Snacks
Tie-ins with Security
Applications beyond FFF
Sensors
 P2P
 Overlay Networks

Policy Questions
Presenting & Interpreting Data

Not just search
Privacy: What is it, what’s it for?
Leading Indicators from the FFF
More?
http://telegraph.cs.berkeley.edu
[email protected]
Collaborators:



Mike Franklin, Hal Varian -- UCB
Lisa Hellerstein & Torsten Suel -- Polytechnic
Sirish Chandrasekaran, Amol Deshpande, Sam
Madden, Vijayshankar Raman, Fred Reiss, Mehul
Shah -- UCB
Backup Slides
Telegraph: Adaptive Dataflow
Mixed design philosophy:




Tolerate loose coupling and partial failure
Adapt online and provide best-effort results
Learn statistical properties online
Exploit knowledge of semantics via extensible
optimization infrastructures
Target new networked, data-intensive
applications
Adaptive Systems: General
Flavor
Repeat:
1.
Observe (model) environment
2.
Use observation to choose behavior
3.
Take action
Adaptive Dataflow in DBs:
History
Rich But Unacknowledged History

Codd's data independence predicated on
adaptivity!


adapt opaquely to changing schema and
storage
Query optimization does it!


statistics-driven optimization
key differentiator between DBMSs and other
systems
Adaptivity in Current DBs
Limited & coarse grain
Repeat:
1.
Observe (model) environment
–
2.
Use observation to choose behavior
–
3.
runstats (once per week!!): model changes in data
query optimization: fixes a single static query plan
Take action
–
query execution: blindly follow plan
What’s So Hard Here?
Volatile regime



Data flows unpredictably from sources
Code performs unpredictably along flows
Continuous volatility due to many decentralized systems
Lots of choices




Choice of services
Choice of machines
Choice of info: sensor fusion, data reduction, etc.
Order of operation
Maintenance


Federated world
Partial failure is the common case
Adaptivity required!
Adaptive Query Processing Work
Frequency of Adaptivity





Late Binding: Dynamic, Parametric
[HP88,GW89,IN+92,GC94,AC+96,LP97]
Per Query: Mariposa [SA+96], ASE [CR94]
Competition: RDB [AZ96]
Inter-Op: [KD98], Tukwila [IF+99]
Query Scrambling: [AF+96,UFA98]
Survey: Hellerstein, Franklin, et al., DE Bulletin 2000
A Networking Problem!?
Networks do dataflow!
Significant history of adaptive techniques


E.g. TCP congestion control
E.g. routing
But traditionally much lower function


Ship bitstreams
Minimal, fixed code
Lately, moving up the foodchain?



app-level routing
active networks
politics of growth

assumption of complexity = assumption of liability
Networking Code as Dataflow?
States & Events, Not Threads




Asynchronous events natural to networks
State machines in protocol specification and system code
Low-overhead, spreading to big systems
Totally different programming style

remaining area of hacker machismo
Eventflow optimization



Can’t eventflow be adaptively optimized like dataflow?
Why didn’t that happen years ago?
Hold this thought
Query Plans are Dataflow Too
Programming model:
iterators


old idea, widely used in DB
query processing
object with three methods:



Init(), GetNext(), Close()
input/output types
query plan: graph of
iterators

pipelining: iterators that
return results before
children Close()
Clever Dataflow Tricks
Volcano: “exchange”
iterator [Graefe]



encapsulate exchange
logic in an iterator
not in the dataflow system
Box-and-arrow
programming can ignore
parallelism
Some Solutions We’re Focusing
On
Rivers

Adaptive partitioning of work across machines
Eddies

Adaptive ordering of pipelined operations
Quality of Service



Online aggregation & data reduction: CONTROL
MUST have app-semantics
Often may want user interaction

UI models of temporal interest
Data Dissemination

Adaptively choosing what to send, what to cache
River
Berkeley built the world-record sorting machine


On the NOW: 100 Sun workstations + SAN
Only beat the record under ideal conditions


No such thing in practice!
(Arpaci-Dusseau)2

with Culler, Hellerstein, Patterson
River: adaptive dataflow on clusters

One main idea: Distributed Queues



adaptive exchange operator
Simplifies management and programming
Remzi Arpaci-Dusseau, Eric Anderson, Noah Treuhaft

w/Culler, Hellerstein, Patterson, Yelick
River
Multi-Operator Query Plans
Deal with pipelines of commutative operators
Adapt at finer granularity than current DBMSs
Continuous Adaptivity: Eddies
Eddy
A pipelining tuple-routing iterator

just like join or sort or exchange
Works best with other pipelining operators

like Ripple Joins, online reordering, etc.
Ron Avnur & Joe Hellerstein
Continuous Adaptivity: Eddies
Eddy
How to order and reorder operators over time

based on performance, economic/admin feedback
Vs.River:


River optimizes each operator “horizontally”
Eddies optimize a pipeline “vertically”
Continuous Adaptivity: Eddies
Adjusts flow adaptively


Tuples routed through ops in different orders
Visit each op once before output
Naïve routing policy:

All ops fetch from eddy as fast as possible


A la River
Turns out, doesn’t quite work

Only measures rate of work, not benefit
Lottery-based routing



Uses “lottery scheduling” to address a bandit problem
Kris Hildrum, et al. looking at formalizing this
Various AI students looking at Reinforcement Learning
Competitive Eddies

Throw in redundant data access and code modules!
An Aside: n-Arm Bandits
A little machine learning problem:


Each arm pays off differently
Explore? Or Exploit?



Sometimes want to randomly choose an
arm
Usually want to go with the best
If probabilities are stationary, dampen
exploration over time
Eddies with Lottery Scheduling
Operator gets 1 ticket when it takes a tuple

Favor operators that run fast (low cost)
Operator loses a ticket when it returns a tuple

Favor operators with high rejection rate

Low selectivity
Lottery Scheduling:


When two ops vie for the same tuple, hold a lottery
Never let any operator go to zero tickets

Support occasional random “exploration”
Set up “inflation” (forgetting) to adapt over time

E.g. tix’ = aoldtix + newtix
Promising!
Initial performance results
Ongoing work on proofs of convergence

have analysis for contrained case
To Be Continued
s
index1
block
hash
Tune & formalize policy
Competitive eddies


Source & Join selection
Requires duplicate
management
Parallelism

Eddies + Rivers?
Reliability


Eddy
Long-running flows
Rivers + RAID-style
computation
R1 R2 R3 S1 S2 S3
s
index2
To Be Continued, cont.
What about wide area?



data reduction
sensor fusion
asynchronous communication
Continuous queries


events
disconnected operation
Lower-level eventflow?

can eddies, rivers, etc. be brought to bear on
programming?