Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to Stanford DB Group Research Li Ruixuan http://cs.hust.edu.cn/rxli/ [email protected] 1 Contents Introduction Past projects Current projects Events References Links 2 The Stanford Database Group “Mainstream” faculty – – – – Hector Garcia-Molina Jennifer Widom Jeff Ullman Gio Wiederhold “Adjunct” faculty – Chris Manning (natural language processing) – Rajeev Motwani (theory) – Terry Winograd (human-computer interaction) A.k.a. Stanford InfoLab 3 Database Group (cont’d) Approximately 25 Ph.D. students Varying numbers of M.S. and undergraduate students Handful of visitors One senior research associate One systems administrator, one programmer Excellent administrative staff Resident photographer 4 Research Areas (very coarse) Digital libraries Peer-to-peer systems Data streams Replication, caching, archiving, broadcast, … The Web Ontologies, semantic Web Data mining Miscellaneous 5 Past Projects LIC: Large-Scale Interoperation and Composition (1999) – mediator (SKC, OntoWeb, CHAIMS, SmiQL, image DB) SKC: Scalable Knowledge Composition (2000) - semantic heterogeneity TID: Trusted Image Distribution (2001) - Image Filtering for Secure Distribution of Medical Information Image Database: Content-based Image Retrieval (2003) SimQL:Simulation Access Language (2001) - Software modules in manufacturing, acquisition, and planning systems 6 Past Projects (cont’d) TSIMMIS: Wrapping and mediation for heterogenous information sources (1998) Lore: A Database Management System for XML (2000) WHIPS: WareHouse Information Prototype at Stanford (1998) - Data warehouse creation and maintenance MIDAS: Mining Data at Stanford (1999) WSQ: Web-Supported Queries (2000) Integrating database queries and Web searches7 Current Projects WebBase: Crawling, storage, indexing, and querying of large collections of Web pages. (Molina) STREAM: A Database Management System for Data Streams (Widom) Peers: Building primitives for peer-to-peer systems (Molina) Digital Libraries: Interoperating on-line services for enduser support (TID,WebBase,OntoAgents) (Molina) TRAPP: Approximate data caching: trading precision for performance (Widom) CHAIMS: Compiling High-level Access Interfaces for Multi-site Software (1999) (Wiederhold) OntoAgents: Ontology based Infrastructure for Agents (2002) (Wiederhold) 8 WebBase: Objectives Provide a storage infrastructure for Web-like content Store a sizeable portion of the Web Enable researchers to easily build indexes of page features across large sets of pages Distribute Webbase content via multicast channels Support structure and content-based querying over the stored collection 9 WebBase: Architecture WWW Page Repository Crawler Indexing Module Analysis Module Multicast Module I n d e x A P I Query Engine Retrieval Indexes Feature Repository Client Client Indexing Client W e b B a s e A P I Client Client 10 WebBase: Current Status Efficient “smart” crawler – Parallelism – Freshness & Relevance Efficient and scalable indexing – Distributed Web-scale content indexes – Indexes over graph structure Unicast dissemination – Within Stanford – External clients: Columbia, U.Wash, U.C.Berkeley 11 WebBase: In Progress WebBase Infrastructure – Multicast dissemination – Complex queries Other work – – – – PageRank extensions Clustering and similarity search Structured data extraction Hidden Web crawling 12 Data Streams: Motivation Traditional DBMS -- data stored in finite, persistent data sets New applications -- data as multiple, continuous, rapid, time-varying data streams – – – – – – – Network monitoring and traffic engineering Security applications Telecom call records Financial applications Web logs and click-streams Sensor networks Manufacturing processes 13 STREAM: Architecture Register Query Streamed Result Stored Result DSMS Input streams Archive Scratch Store Stored Relations 14 STREAM: Challenges Multiple, continuous, rapid, time-varying streams of data Queries may be continuous (not just one-time) – Evaluated continuously as stream data arrives – Answer updated over time Queries may be complex – Beyond element-at-a-time processing – Beyond stream-at-a-time processing 15 DBMS versus DSMS Persistent relations Transient streams (and persistent relations) One-time queries Continuous queries Random access Sequential access Access plan determined by query processor and physical DB design Unpredictable data arrival and characteristics “Unbounded” disk store Bounded main memory 16 STREAM: Current Status Data streams and stored relations Declarative language for registering continuous queries Flexible query plans Designed to cope with high data rates and query workloads – Graceful approximation when needed – Careful resource allocation and usage Relational, centralized (for now) 17 STREAM: Ongoing Work Algebra for streams Semantics for continuous queries Synopses and algorithmic issues Memory management issues Exploiting constraints on streams Approximation in query processing Distributed stream processing System development 18 STREAM: Related Work Amazon/Cougar (Cornell) – sensors Aurora (Brown/MIT) – sensor monitoring, dataflow Hancock (AT&T) – telecom streams Niagara (OGI/Wisconsin) – Internet XML databases OpenCQ (Georgia) – triggers, incr. view maintenance Stream (Stanford) – general-purpose DSMS Tapestry (Xerox) – pub/sub content-based filtering Telegraph (Berkeley) – adaptive engine for sensors Tribeca (Bellcore) – network monitoring 19 Peer-To-Peer Systems Multiple sites (at edge) Distributed resources Sites are autonomous (different owners) Sites are both clients and servers Sites have equal functionality 20 P2P Benefits Pooling available (inexpensive) resources High availability and fault-tolerance Self-organization 21 P2P Challenges Search – – – – – Query Expressiveness Comprehensiveness Topology Data Placement Message Routing Resource Management – fairness – load balancing Security & Privacy – – – – Anonymity Reputation Accountability Information Preservation – Information Quality – Trust – Denial of service attacks 22 Peers: Stanford Research New Architectures Performance Modeling and Optimization Security and Trust Distributed Resource Management Applications 23 Digital Library Project: Overview Payment Institutions Internet Libraries User Interfaces and Annotations Copyright Services Telnet HTTP Z39.50 Search Agents Query/Data Conversion Commercial Information Brokers & Providers 24 DigLib Projects: DLI1,DLI2 Resource Discovery Retrieving Information Interpreting Information Managing Information Sharing Information 25 DigLib: Resource Discovery Geographic Views (Tools to assist you in more systematically locating different types of information from a large and diverse number of information sources) 26 DigLib: Retrieving Information Information Tiling PalmPilot Infrastructure (PDA) Power Browsing (PDA) Query Translator SDLIP (Simple Digital Library Interoperability Protocol) Value Filtering WebBase 27 DigLib: Interpreting Information Murals (Tools to help a user interpret and organize search results) Web Clustering 28 DigLib: Managing Information Archival Repositories Archiving Movie InterBib (a tool for maintaining bibliographic information) Medical Transport Info PhotoBrowser 29 DigLib: Sharing Information Diet ORB (PDA, based on MICO) Digital Wallets Mobile Info Delivery Mobile Security Multicasting 30 DLI1 Projects (95-99) AHA ComMentor DLITE Google GLOSS FAB Grassroots Metadata Architecture RManage/FIRM SenseMaker SCAM Shopping Models, U-PAI SONIA STARTS WebWriter 31 TRAPP: Overview TRAPP: Tradeoff in Replication Precision and Performance A.k.a: Approximate Data Caching Project goal: investigating techniques to permit controlled and explicit relaxation of data precision in exchange for improved performance 32 TRAPP: Motivation Transactional consistency too expensive Even nontransactional propagation of every update still too expensive in many cases Solution: Approximate Caching – Exploit the fact that many applications do not require exact consistency – Avoid propagating insignificant updates – Trade cache precision for network load 33 Example: TRAPP Over Numeric Data Caches store intervals that bound the exact source values Sources refresh when value leaves interval cache [2, 5] [-1, 0.8] refreshes 3.9 source refreshes 0.2 source Query answers are intervals Precision constraints specify maximum width 34 Eg(cont’d): Querying in TRAPP For one-time aggregation queries: – Answers computed by combining approximate cached data and exact source data – At query-time: Find low-cost subset of sources to probe so final answer will have adequate precision – Algorithm determined by aggregation function » Some easy, some hard Query: X + Y (within 2) cache [2, 5] [-1, 0.8] probe X 3.9 source Y 0.2 source Answer: [2.9, 4.7] 35 TRAPP: Approximate Caching Two common scenarios: • Minimize bandwidth usage, precision fixed » TRAPP: caches store bounds as approximations » Queries select combination of cached & source data » Adaptive bound adjustment for good precision level • Bandwidth fixed, maximize precision » Best-Effort Synchronization: caches store stale copies » Refreshing based on priority scheduling » Global priority order via threshold » Adaptive threshold setting for flow control 36 TRAPP: Status Past work: focused on an approximate data caching architecture that permits finegrained control of the precisionperformance tradeoff for numerical data in data caching environments. Current work: applying the above techniques and others to more complex data such as Web pages. 37 CHAIMS: Overview CHAIMS: Compiling High-level Access Interfaces for Multi-site Software Objective: Investigate revolutionary approaches to largescale software composition. Approach: Develop and validate a composition-only language, a protocol for large, distributed, heterogeneous and autonomous megamodules, and a supporting system. Planned contributions: – – – – Asynchrony by splitting up CALL-statement. Hardware and software platform independence. Potential for multi-site dataflow optimization. Performance optimization by invocation scheduling. 38 CHAIMS: Overview Megaprogram for composition, written by domain programmer CHAIMS Megamodules CHAIMS system automates generation of client for distributed system Megamodules, provided by various megamodule providers 39 CHAIMS: Architecture Megamodule Provider Megaprogrammer writes wraps non-CHAIMS compliant megamodules information Megaprogram (in CHAIMS language) information adds information to Wrapper Templates CHAIMS Repository CHAIMS Compiler b generates CSRT (compiled megaprogram) a d e c MEGA modules Distribution System (CORBA, RMI…) 40 OntoAgents: Objective OntoAgents goal: establish an agent infrastructure on the WWW or WWW-like networks Such an agent infrastructure requires an information food chain: every part of the food chain provides information, which enables the existence of the next part. 41 OntoAgents: Architecture •End User •Ontology Articulation Toolkit •Ontology Construction Tool •Agents •Community Portal •Ontologies •Inference Engine •Webpage Annotation Tool •Annotated Webpages •Metadata Repository 42 Events: DB Seminars Academic Year Fall Winter Spring 2002/2003 no seminar in the fall quarter Database Seminar (CS545) Genome Databases (CS545G) 2001/2002 Past, Present, and Future of Database Technology Genome Databases Database Seminar to come 2000/2001 Interoperation, Databases and the Semantic Web Image Databases Databases and the Semantic Web 1999/2000 Ontologies, E-Commerce, n/a XML & Metadata 1998/1999 Digital Libraries Image Databases 1997/1998 Data Warehousing Image Databases 1996/1997 Fall Quarter 96 Image Databases Ontologies, ECommerce, XML & Metadata Internet and Databases Internet and Databases Spring Quarter 43 97 Events: Meetings Stanford Computer Science Forum - Annual Affiliates Meeting, Stanford, May 2003. SWiM (the Stream Winter Meeting): About 35 researchers in the data streams are came together at Stanford for SWiM, Jan. 2003. – Stream Team: A few data streams research groups held some informal get-togethers, 2002. Conference Talk: ACM SIGMOD/PODS, VLDB, ICDT, ICDE, ICDCS, CIDR 44 References: WebBase Junghoo Cho, Hector Garcia-Molina. "Parallel Crawlers," In Proceedings of the Eleventh World Wide Web Conference, May 2002. Taher Haveliwala, Aristides Gionis, etc. "Evaluating Strategies for Similarity Search on the Web," Proceedings of the Eleventh International World Wide Web Conference, May 2002. Taher Haveliwala. "Topic-Sensitive PageRank," Proceedings of the Eleventh International World Wide Web Conference, May 2002. 45 References: STREAM R. Motwani, J. Widom, etc. Query Processing, Resource Management, and Approximation in a Data Stream Management System In Proc. of the 2003 Conference on Innovative Data Systems Research (CIDR), January 2003 A. Arasu, B. Babcock. etc. STREAM: The Stanford Stream Data Manager In Proc. of the ACM Intl Conf. on Management of Data (SIGMOD 2003), June 2003 B. Babcock, S. Babu, etc. Models and Issues in Data Stream Systems Invited paper in Proc. of the 2002 ACM Symp. on Principles of Database Systems (PODS 2002), June 2002 46 References: Peers Neil Daswani, Hector Garcia-Molina and Beverly Yang. Open Problems in Data-Sharing Peer-to-Peer Systems, In ICDT, 2003. Hector Garcia-Molina. Peer-To-Peer Data Management, Key-notes In ICDE, 2002. Hrishikesh Deshpande, Mayank Bawa, and Hector Garcia-Molina. Streaming Live Media over a Peer-to-Peer Network. 47 References: TRAPP C. Olston and J. Widom. Best-Effort Cache Synchronization with Source Cooperation. ACM SIGMOD 2002 International Conference on Management of Data, Madison, Wisconsin, June 2002, pp. 73 -84. C. Olston, B. T. Loo and J. Widom. Adaptive Precision Setting for Cached Approximate Values. ACM SIGMOD 2001 International Conference on Management of Data, Santa Barbara , California, May 2001, pp. 355-366. 48 Useful Links Database Group: http://www-db.stanford.edu/ STREAM: http://www-db.stanford.edu/stream/ Peers: http://www-db.stanford.edu/peers/ DigLib: http://www-diglib.stanford.edu/ TRAPP: http://www-db.stanford.edu/trapp/ WebBase: http://wwwdiglib.stanford.edu/~testbed/doc2/WebBase/ 49