Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley Jennifer Widom Stanford University NRDM, Santa Barbara, CA, May 25, 2001 Network Traffic Management • Large networks are growing complex and difficult to manage – – • Increasing demands, overprovisioning, hardware changes, manual configuration Lack of information to configure network for effective usage Network traffic management is becoming an important part of the Internet infrastructure – Collect data • E.g., packet traces, network-flow data, SNMP data – Process data • E.g., compute link utilization, per-hop delays, traffic demands – Deploy mechanisms to control traffic • E.g., change routing parameters • Data management forms a core part of traffic management Traffic Management: Data Collection • Many data sources – – – – Packet and flow traces Router forwarding tables and configuration data SNMP data Active measurements of packet delay, link utilization • Data is collected continuously – Networks need to be 24*7 for everything – Huge and fast-growing databases • Many current traffic management systems store collected data in file systems or data warehouses Traffic Management: Data Processing • Sophisticated data processing is required • Measuring link utilization – Aggregate packet traces • Maintaining network topology – Join SNMP data from different network elements • Deriving traffic demands – Join network flow traces, router forwarding tables and configuration data, and SNMP data • Anomaly detection, traffic modeling, traffic prediction, and many others • Most current traffic management systems process data using ad-hoc scripts or software toolkits Challenge in Data Management: Online Data Processing • Most current traffic management applications process data offline – Huge volume of data – Complex processing involved • Offline processing is indeed appropriate for some applications – E.g., capacity planning, determining pricing plans • Many traffic management applications need online processing – E.g., congestion cause detection, resource allocation for guaranteed QoS, detecting denial-of-service attacks, detecting Service-Level Agreement violations, admission control and traffic policing Online Processing • What’s wrong with using a file system and procedural processing? – Difficult to maintain and reuse (not a long term solution) • What’s wrong with using a Database Management System (DBMS)? – DBMS expects all data to be managed as persistent data sets – DBMS assumes “one-time” queries against stored and finite data A Data Stream Management System (DSMS) for Online Processing • Data Streams are the appropriate model for online processing – Data is changing frequently (often exclusively though insertions) – It is impractical to operate on same data multiple times • Continuous queries -- issued once and run “forever” • Performance – Need continuous-query optimization – Need adaptive query-optimization • A Data Stream Management System for traffic management – Idea: Support online processing with continuous queries over data streams A Data Stream Management System for Online Processing (cont’d) Applications based on online processing Continuous Queries Data Stream Management System Data Management System Streams SNMP data Packet traces Flow traces Router forwarding tables Active measurements Continuous Query over a Single Data Stream <A,B> <B,C> <A,D> Data Stream Q A? • Many options with different ramifications • Stream is infinite, append-only (e.g., packet traces) – size of A is unbounded for a filter query -- cannot store A – Stream out A -- but self-join query requires unbounded intermediate state to compute A – Updates to tuples in A -- e.g., aggregation query • Stream has updates, deletions (e.g., SNMP data) – often require more intermediate state to compute A Operator Architecture in a DSMS • Stream • Append-only semantics: Result tuples that won’t change later • Update semantics: Updates to current result • Store: Result tuples that could change later • Scratch: Intermediate state to compute future results • Throw: Unneeded data Example Queries from Traffic Management • Single packet trace input data stream (IP headers over a link) • Continuous query 1: Link utilization (total #bytes sent over the link) – Store -- sum of packet lengths – Stream -- empty – Scratch -- empty • Continuous query 2: Number of flows per protocol Flow Identifier Packet Trace Per-Protocol #flows counter Stream Scratch Store Example Queries from Traffic Management (cont’d) • Continuous query 3: Join packet traces collected from different points in the network to measure packet delays (or identify routes) HT 1 Packet trace 1 Packet trace 2 Scratch HT 2 Symmetric Hash-Join Stream • Efficient intermediate state management • Intermediate state is unbounded theoretically • Use of constraints can reduce intermediate state • Can reclaim memory after each match • Approximate answers can further reduce intermediate state • Can you trade precision for state? Examples Queries from Traffic Management (cont’d) • Continuous query 4: Identify top 5% (source IP address, destination IP address) Pairs with maximum bandwidth consumption over a link • Non-trivial query over a stream – Number of distinct Pairs can vary – Bandwidth consumption of each Pair can vary – How much intermediate state is needed? Count Distinct Pairs Stream Packet trace Scratch Bandwidth Consumption Of Pairs Scratch Store Top 5% Pairs Further Challenges in Data Management: Distributed Stream Processing • Data is collected from different points in a network • Structure of an Internet Service Provider imposes restrictions – Core routers are sensitive (so are the network operators ) • Sending collected data to a central processing site is harmful – Additional load on the network – Hinders real-time processing – Won’t scale with the network and traffic • Truly distributed processing is infeasible for many queries – Goal: minimize communication traffic – Trade communication traffic for precision Example Queries from Traffic Management (cont’d) • Continuous query 5: Identify top 5% of destination IP addresses with maximum bandwidth consumption (to detect denial-ofservice attacks) CQ 5 local CQ 5 CQ 5 Stream global Stream local Stream CQ 5 local • Hierarchical processing structure could also be useful Summary of Basic Problems and Techniques • Continuous queries over data streams is a unique combination of: – Online processing – Storage constraints -- amount of memory available is bounded • Query result size may be unbounded • Intermediate state may be unbounded • Relevant techniques – Online data structures (not build-and-throw) – Summarization: samples, histograms, wavelets, fractals – Adaptivity • Data characteristics • Flow rates • Amount of memory Some Simplifying Assumptions • In talk, but not necessarily in work • Traffic management data is clean – Data is dirty: incomplete, inconsistent – Temporal uncertainties – Could be reduced as the importance of traffic management is realized • Traffic management data is tuple-oriented – Often true – Implications for query language Conclusions • Traffic management requires efficient data management • Many traffic management applications benefit from online data processing • Case for a Data Stream Management System (DSMS) – Provides continuous queries over data streams for online processing – Many interesting research issues – Work is in progress • Additional references – S. Babu and J. Widom. Continuous queries over data streams http://dbpubs.stanford.edu/pub/2001-9 – STREAM project homepage http://www-db.stanford.edu/stream