Download Efficient Semantic-based Content Search in P2P Network

Document related concepts

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Searching and Data Sharing in P2P Systems
Beng Chin Ooi
Department of Computer Science
National University of Singapore
[email protected]
www.comp.nus.edu.sg/~ooibc
Acknowledgement


A few ppt slides are borrowed/adapted from Hellerstein’s
group and his vldb-04 tutorial slides
Some are screen dumps as examples
What is P2P?
Client Server Architecture
Peer-to-Peer Architecture
P2P Systems?



Effective Use of the Internet-connected
PCs/workstations directly participate in the Internet
Sites are autonomous
Similar functionalities and responsibilities


Each peer consumes and serves
Resources are distributed
Driving Forces

Main driving forces:

Exploiting existing resources





Computational efficiency is not the main goal
Sharing costs among users
Autonomy
Anonymity
Legal protection
P2P Systems
“ A class of applications that takes advantage of resources
like storage, CPU cycles, content and even human
presence available at the edges of the Internet” -- Clay
Shirkey, an investment advisor
P2P Applications
Peer-to-Peer Applications
Collaboration
Instant
Messaging
Resource Utilisation
Groupware
P2P MessengerFile Sharing Groove
Upriser
Computation
Others
SETI
Bandwidth
Storage
Folding@h
ome
freenet
Properties of P2P Applications?




Dynamic and Self-Organizing
Enduring
Resilient
Collaborative
P2P Future

Aberdeen Group’s prediction:



US$930 million by end 2004
From US$20.6 at end of 2000
Standardization


NPI (New Productivity Initiative)
Peer-to-Peer Working Group (P2PWG)

NAT, Taxonomy, Security, File Services, Interoprability
Overlay Networks

P2P applications need to:

Track identities & (IP) addresses of peers




Route messages among peers


If you don’t keep track of all peers, this is “multi-hop”
This is an overlay network


Peers are doing both naming and routing
IP becomes “just” the low-level transport


May be many!
May have significant Churn
Best not to have n2 ID references
All the IP routing is opaque
Control over naming and routing is powerful

And as we’ll see, brings networks into the database era
Infecting the Network, Peer-to-Peer


The Internet is hard to change.
But Overlay Nets are easy!


P2P is a wonderful “host” for infecting network designs
The “next” Internet is likely to be very different



“Naming” is a key design issue today
Querying and data independence key tomorrow?
Don’t forget:


The Internet was originally an overlay on the telephone network
There is no money to be made in the bit-shipping business
• A modest goal for DB research:
– Don’t query the Internet.
The Evolution of P2P systems

First generation – centralized P2P systems


Second generation –decentralized & unstructured P2P systems


E.g. Gnutella
Third generation—structured P2P systems



E.g. Napster, SETI@home
….
DHT systems (CAN/Chord/Pastry/Tapestry)
Skip-list based systems
Unstructured P2P Systems



P2P with Central Servers
P2P with fully Autonomous Peers (pure p2p)
P2P with Superpeers (SuperNodes)
Unstructured Centralized P2P Systems -Napster
Directory
Server
Get X
A


Reply with X
B
Searching is efficient, with only a few messages
exchanged;
Non-scalable, a central point of failure;
Harnessing Idle CPU Cycles – SETI@HOME
A
sw
es
oc
Pr
he
g
ssin
e
c
Pro esutls
R
ni
dle
B
ad
nlo ta
w
o
D
da
raw
g
in
ss
s
l
ce ut
d
ro s
oa
P Re
nl ata
ow d
D aw
r
Center
data
source
C
D
E
Unstructured Fully Decentralized -- Gnutella


Searching is inherently flooding (unscalable);
Time-to-Live(TTL) is used to partially address this problem;
Techniques for improving search in Gnutellalike Network





Expanding Ring;
Random Walks;
Good Peer;
Local indices;
Routing indices;
Freenet
Download file X from Peer E
Query: “Who has file X”
A
E
Em
: “I
ply
t ha
s fi
le X
”
le X
Pee
rD
h a s mi g h t
file
X
Rep
ly :
has “Peer
E
file
X”
igh
e fi
hav
B
er
Pe
Re
ht
ig
m
C eX
er fil
Pe has
E
er
Pe ”
:“ X
y ile
pl s f
Re ha
C
D
Worst Case for Freenet
Download file X from Peer E
A
E
Em
NOT FOUND !
er
Pe
t
igh
m
C eX
er fil
Pe has
I HAVE FILE X !
igh
F
t ha
s fi
Pee
rD
h a s mi g h t
file
X
le X
C
B
D


Peer F has the requested file, but never finds it because a poor routing
decision made at Peer D, and results in the query not being matched.
In this case, query will be rerouted once again with alternate path
Unstructured P2P with Supernodes


Combine the benefits of centralized and decentralized
search;
Take advantage of the heterogeneity of peer capabilities;
Morpheus
Supernode
Layer
Center
Index for
its cluster
as
A
h
rH
Pee
ly: “ e X”
fil
s
E
I
Rep
o ha
Wh
ry: “
Que file X”
F
D
C
G
H
Cluster
Cluster
Download file X from Peer H
B
Cluster
What is Grid?
“A hardware and software infrastructure that provides
dependable, consistent, pervasive, and inexpensive
access to high-end computational capabilities”
-- Ian Foster & Kal Kesselman, 1998
“Sharing enviorment implemented via the deployment of
a persistent, standards-based service infrastructure
that supports the creation of, and resource sharing
within distributed communities”
--Ian Foster & Adriana Iamnitchi, 2003
A basic concept in Grid -- “Virtual
Organization”
The evolution of Grid Systems



First generation systems involved proprietary solutions for
sharing high performance computing resources; e.g. Condor
Second generation systems introduced middleware to cope
with scale and heterogeneity, with a focus on large scale
computational power and large volumes of data; e.g. Globus,
Eu DataGrid
Third generation systems are adopting a service-oriented
approach, adopt a more holistic view of the e-Science
infrastructure, are metadata-enabled and may exhibit
autonomic features.
 Open Grid Services Architecture (OGSA)
P2P vs. Grid --similarities

Both P2P and Grid address the same problem, share the
same goal


Resource sharing within distributed resources.
Both offer promising paradigms for developing distributed
systems and applications
P2P vs. Grid --differences

Resources


Grid– higher-end resources, better connected with high levels of
availability
P2P– edge level devices, intermittently connected with highly
variable availability
P2P vs. Grid --differences

Services


Dependent on the nature of communities
Eg 1. Resource Discovery



Grid—very well structured and stable network making this less of an
issue
P2P—unstable network
Eg 2. Security


Grid—authentication, authorization, accountability
P2P—anonymity, censorship resistance
P2P vs. Grid --differences

Infrastructure



Grid – more emphasis in standardization, interoperability
P2P – little emphasis, no interoperability
Applications


Grid – large range of applications, more computation and data intensive
P2P – more social-based, less computation and data intensive
P2P vs. Grid --differences

Scalability


Grid– Most services, such as resource discovery, are mainly based
on centralized or hierarchial models
P2P– Most P2P systems are decentralized
P2P vs. Grid --summary




Grid needs to address more in decentralization, selforganization, fault tolerance, and scalability issues,
which are strong points of P2P.
P2P should put more effort on standard infrastructure
and provide more services.
The P2P model could help to ensure Grid scalability
Two technologies are likely to converge (grid +
structured p2p)
Data sharing in P2P systems

Provide only file-level sharing, and lack of content-based
search


Lack of extensibility and flexibility


coarse granularity of information sharing.
no easy and rapid means to expand applications
Node’s neighbors are typically statically defined

difficult to utilize network bandwidth and optimize system
performance
Relational data sharing in Unstructured P2P vs.
Distributed DB
P2P
Distributed Database Systems
Nodes can join and leave the network anytime.
Nodes are added/removed from the network in
a controlled manner.
Usually no predetermined (global) schema
among nodes. Queries: Keywords
Have some knowledge of a shared schema.
Queries: SQL
Answers to queries are typically incomplete.
*by “completeness” we mean all answers that satisfy a query
Can actually retrieve the complete set of
answers.
Content location is typically by “word-ofmouth” e.g., node routes query to its neighbors
and so on…
Exact location to direct the query is typically
known.
P2P & DB Systems
Flexibility


Decentralized


Strong Semantics


Powerful query facilities


Fault Tolerance


Lightweight


Transactions & Concurrency Control


Taken from Hellerstein’s group ppt
P2P + DB = ?

P2P Database? No!



ACID transactional guarantees do not scale, nor does the everyday user
want ACID semantics
Much too heavyweight of a solution for the everyday user
Query Processing on P2P!



Both P2P and DBs do data location and movement
Can be naturally unified (lessons in both directions)
P2P brings scalability & flexibility
DB brings relational model & query facilities
Taken from Hellerstein’s group ppt
Many New Challenges

Relative to other parallel/distributed systems








Partial failure
Churn
Few guarantees on transport, storage, etc.
Huge optimization space
Network bottlenecks & other resource constraints
No administrative organizations
Trust issues: security, privacy, incentives
Relative to IP networking


Much higher function, more flexible
Much less controllable/predictable
Some Proposals on Data Sharing…

Database:





Data Mapping (SIGMOD’03)
Piazza (ICDE’03)
PeerDB(ICDE’03)
…
IR:



PlanetP((HPDC’03)
SummaryIndex (TKDE’04 special issue on P2P)
…
The Birth of BestPeer…

Started in 1998




Extended to P2P in early 2000


To steal storage and CPU cycles from staff machines
To provide a virtual and parallelised content-based document retrieval
system
To be able to move processes from one PC to another quickly when
users need the PC back
VC showed interested in the project
W.S. Ng, B. C. Ooi and K.L. Tan: BestPeer: A self
configurable peer-to-peer system. ICDE’2002.
BestPeer Network



BestPeer is a generic P2P system designed to serve as a
platform on which P2P applications can be developed
easily and efficiently
Integrate mobile agent with P2P technologies
Each participant runs BestPeer software


Provide communication facilities and share resources with other
peers
Provide an environment in which agent can reside and perform
their tasks
BestPeer Network


cont…
Large # of peers, Small # of LIGLO;
Each node comprises of two types of data: private data and sharable
data;

New node registration:




Register with LIGLO
Obtain a unique BPID from LIGLO.
LIGLO sends a list of (BPID, IP) pairs
that node can communicate directly.
Node is ready to communicate to other
peers.
BestPeer Network
cont…

Node Rejoins:




Send node’s current IP to
LIGLO
For each peer of the node, p,
send p’s BPID to its registered
LIGLO
p’s registered LIGLO will
reply with IP of p if it is
currently connected to the
network
Node has rejoined
BestPeer Network

Access Data from other nodes:



cont…
Propagation broadcast
Node with matching result will respond to initiating node directly
Two modes to access data:


Phase 1: Node with matching answer will return the result directly
or Node with matching answer will only indicate that they have the
information
Phase 2: The initiating node will then send a further message to
some, if not all, of these nodes to obtain desired information
Reconfigurable BestPeer Network



A node in the BestPeer network can dynamically reconfigure itself by
keeping peers that benefit it most.
Based on assumption: peers that benefit a node most for a query are
most likely to provide the greatest gain for subsequent query.
Every node has its control of maximum number of direct peers it can
have
Reconfigurable BestPeer Network


cont…
BestPeer applies autonomous strategy, where each node
tries to keep promising peers as closes as possible with no
information exchange between peers.
BestPeer provides two default reconfiguration strategies:

MaxCount


Maximizes the number of objects a node can obtain from its directly
connected peers.
MinHops

Minimizes the number of Hops that a node needs to travel
Location-Independent Global Names Lookup Server
(LIGLO)



To facilitate identification of a single node that may have
different IP addresses at different occasion
LIGLO is a node that has a fixed IP and running LIGLO
software
LIGLO:



Generates BestPeer Global Identity (BPID)
Maintains peer’s current status
LIGLO applies distributed approach, each LIGLO only
needs to maintain its members’ name
Features of BestPeer




Combines the power of agent technology and P2P
technology in a single system
Supports a finer granularity of data sharing, and sharing of
computational power
Facilitates dynamic reconfiguration of BestPeer network
Adopts a distributed approach to minimize bottlenecks of
servers acting as LIGLO
Integrating of Mobile Agent and P2P
Technologies



P2P technologies provide resources sharing capabilities
among node; Mobile Agent further extends the
functionalities
Java-based Agent System
BestPeer Search Agent vs. Traditional Search Agent:



(Trad) Predefined itinerary vs. Auto and transparent
TTL / Hops based lifetime
Result/Cost-based lifespan
PeerDB


PeerDB is built on top of BestPeer
Four components that are integrated and implemented on the
application layer.

Data management system



Local Dictionary



Metadata sharable to other nodes
Cache Manager
Caching remote data in secondary storage

Caching/replacement policy
B.C. Ooi, K.L. Tan, A. Zhou, C.H. Goh, Y.G. Li, C.Y. Liau, B. Ling, W.S. Ng, Y. Shu,
X.Y. Wang, M. Zhang: PeerDB: Peering into Personal Databases. SIGMOD’2003,
Demo.
W.S. Ng, B. C. Ooi, K.L. Tan, A. Zhou: PeerDB: A P2P-based System for Distributed
Data Sharing. ICDE’2003


Metadata stored in Local Dictionary
Export Dictionary


Facilitates storage, manipulation and retrieval of the data
MySQL as the backend for supporting SQL query facility
PeerDB

Agent Layer: DBAgent




cont…
Provide the environment for mobile agent (Java agent) to operate on.
Each PeerDB node has a master agent that manages the user query.
Clone and dispatch worker agents to neighboring nodes
P2P Layer:


Network management and messages management
Monitor statistics and manage network reconfiguration
Architecture
Sharing Data Without Global Schema
Information Retrieval (IR) approach
Meta-data (keywords) are maintained for each relation’s name and
attributes


Serve as a kind of synonymous
names (i.e., miniature thesaurus)
Example
Peer
Names
Keywords
Relations
P1
Kinases,
protein, human
key, identifier, ID
length
Sequence, protein sequence
Kinases(SeqID, length,
proteinSeq)
SeqID
length,
proteinSeq
P1 Query
P2
Protein
SeqNo
Len
sequence
P3
ProteinKLen
ID
seqLength
ProteinKSeq
ID
Sequence
P4
Protein
Name
char
protein, annexin, zebrafish
number, identifier
length
sequence
Protein(SeqNo, len,
sequence)
Protein, kinases, length
Number, identifier
Length
Protein,sequence
Number identifier
sequence
ProteinKLen(ID,seqLength)
ProteinKSeq(ID,sequence)
Protein, kinases, annexin
Name
Characteristics, features, functions
Protein(Name, char)
SELECT SeqID, proteinSeq
FROM Kinases
WHERE length > 30
* Knows own schema but not the
schema of other peers
P2,P3P3and
andP4P4match
matchthe
thequery
queryrelation
relation
P2,
SeqID, proteinSeq and length all have matching keywords in P2 and P3
Note: For P3, query may have to be turned into a join query
ProteinKLen
ProteinKSeq
P4 (relation match only) ranks lower than P2 and P3
Semantically, P2’s data are not actually those that P1 is interested in…the
meta-data & info returned to the users before fetching the data.
Query Processing Strategy


Completely assisted by agents and interact with DBMS.
Query may be rewritten into another form by the DBAgent.
e.g., single query -> join query involving multiple relations

Local query vs. Remote query – A query is local to a node if
it is initiated there, and remote otherwise.
Convergence of Technologies on P2P Network
Search Engine
DBMSInformation
Aggregation
• Possible Business Model for P2P?
Keyword Join – Current Work


A mean to facilitate information aggregation
Tuples are “joined” based on similar values (not exact as in
normal join)



IR similarity matching between attribute values + contents
Top-K Answers
Eg. Database search patent filing
Some BestPeer work





Wee Siong Ng, Beng Chin Ooi, Yan Feng Shu, Kian Lee Tan and Wee Hyong Tok
Efficient Distributed CQ Processing using Peers (Poster).
The Twelfth International World Wide Web Conference 2004.
B.C. Ooi, K.L. Tan, A.Y. Zhou, C.H. Goh, Y.G. Li, C.Y. Liau, B. Ling, W.S. Ng, Y.F.
Shu, X.Y. Wang, M. Zhang PeerDB: Peering into Personal Databases.
The 2003 ACM SIGMOD Intl. Conf. on Management of Data (Demo).
Wee Siong Ng, Beng Chin Ooi, Kian Lee Tan and AoYing Zhou
PeerDB: A P2P-based System for Distributed Data Sharing.
The 19th International Conference on Data Engineering 2003.
Panos Kalnis*, Wee Siong Ng, Beng Chin Ooi, Dimitris Papadias*, Kian-Lee Tan
An Adaptive Peer-to-Peer Network for Distributed Caching of OLAP Results.
ACM-SIGMOD Conference 2002.
(SIGMOD 2002).
Wee Siong Ng, Beng Chin Ooi and Kian Lee Tan
BestPeer: A Self-Configurable Peer-to-Peer System (Poster).
The 18th International Conference on Data Engineering 2002