Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Effective Variation Management for Pseudo Periodical Streams Lv-an Tang, Bin Cui, Hongyan Li, Gaoshan Miao, Dongqing Yang, Xinbiao Zhou School of EECS Peking University 2017/4/29 ACM SIGMOD 2007 Summary Introduction Related Work Variation Management for Pseudo Periodical Stream Experiments Conclusion •2017/4/29 ACM SIGMOD 2007 2/45 Pseudo Periodical Stream Pseudo Periodical Stream Data seems to repeat in a certain period Tiny variation exists between different periods Common in the domain of medical, seismology Typical stream variations: gradual evolutions rather than burst changes •2017/4/29 ACM SIGMOD 2007 3/45 An Example of Pseudo Periodical Stream The respiratory data repeats about every 3.2 seconds Reflects the evolution of the patient’s illness during five hours •2017/4/29 ACM SIGMOD 2007 4/45 Variation Management on Data Stream Data streams are widely applied in many domains Stock market analysis Road traffic control Medical signal processing Online variation management -- an important task When did the variation occur? (Detect variations) What is the variation ? / How does it change? (Describe variations) Why it turns to change in this way ? (Help understanding variations ) •2017/4/29 ACM SIGMOD 2007 5/45 Major Technical Challenges Value Type Traditional Algorithms: Discrete values (enumerative) or Time series (equidistant intervals) Data stream: consecutive real number with variable sampling frequencies Training Sets or Models Several training sets or predefined models Data stream evolves and the models may not work soon On the contrary, the system is required to generate such models as output •2017/4/29 ACM SIGMOD 2007 6/45 Major Technical Challenges II Variation Type Not only on abnormal values and distribution The structure in a period (shape) Noises: unpredictable, random In many applications, the variations are monitored manually Our contribution: proposing a new method named Pattern Growth Graph (PGG) to detect and store variations over pseudo periodical streams •2017/4/29 ACM SIGMOD 2007 7/45 Summary Introduction Related Work Variation Management for Pseudo Periodical Stream Experiments Conclusion •2017/4/29 ACM SIGMOD 2007 8/45 Data Stream Management Systems Data stream work can be loosely classified in two categories: DSMS and Online Data Mining Data Stream Management Systems (DSMS) Such as STREAM, Aurora, TelegraphCQ… Mainly focus on completing predefined SQL queries Not try to find the data features, or to monitor the variations •2017/4/29 ACM SIGMOD 2007 9/45 Online data mining Variation management is an important part of online data mining Three classes according to the algorithms Symbolic Approaches Mathematic Transformation Predefined Models Symbolic Approaches: Tarzan and SAX Space: Put the entire time series/data stream in memory Precision is not good for SAX •2017/4/29 ACM SIGMOD 2007 10/45 Mathematic Transformation Mathematic Transformation: Discrete Wavelet Transform (DWT) and Fast Fourier Transform (FFT) Require the data length fixed, as well as the sampling frequency (equidistant intervals) Haar wavelet transform can only perform on 2n data items, e.g, the data length must be 1024 or 2048 Predefined Models: Using Zigzag to detect events in financial streams (SIGMOD 04) Too domain specific Users can not provide such models in advance – actually they would like them as the output •2017/4/29 ACM SIGMOD 2007 11/45 Summary Introduction Related Work Variation Management for Pseudo Periodical Stream Experiments Conclusion •2017/4/29 ACM SIGMOD 2007 12/45 Task Specification by Respiration Stream Respriation 1400 1200 A B C D E F G H 1000 800 600 400 200 0 0 Time (S) 3 6 9 12 1 801 the stream 1601 2401 Variation : Online detect variation in 3201 one pass Wave: The smallest unit concerned is not a single point, but values in a certain period represented as a wave Alarms: F is actually the noise caused by body movements Summary: A summary with acceptable error bound is very helpful •2017/4/29 ACM SIGMOD 2007 13/45 System Framework Send Alarms 6 5 4 3 2 1 0 OutPut Pattern Evolutions Pattern Growth Graph Data Stream Wave Online Update Stream View Splitting Unmatched 6 5 4 3 2 1 0 Wave-Pattern Matching Partially Matched Full Matched Wave Stream New Pattern Grow Pattern Increase Frequency Online Variation Management •2017/4/29 ACM SIGMOD 2007 14/45 Wave Splitting I Variation: the difference from old data Detected by comparing the old data and coming stream Waste too much resources if comparing at each coming item Just comparing at each wave -- much more efficient How to divide the stream according to the data features? •2017/4/29 ACM SIGMOD 2007 15/45 Wave Splitting II Fixed length window will accumulate error Observation: The waves start and end at valley points that are smaller than a certain value •2017/4/29 ACM SIGMOD 2007 16/45 Upper Bound of Valley Points User define Update with the average value of past valley points •2017/4/29 N U b ( Vi ) / N i 1 ACM SIGMOD 2007 17/45 Valley Sections Valley Section: Approximate flat section represents the time interval between two events It is also worth to study as one part of the wave Take the last point of the section as the cut point •2017/4/29 ACM SIGMOD 2007 18/45 Two Problems in Online Matching I Problem 1: The data stream’s sampling frequency is usually high (>100Hz), waves should be simplified Problem 2: How to compare two waves with different time lengths, and may not have data at same time point? A: {(10,0.5), (20, 1.0), (25, 1.3), …(90, 50.5)} 22 data items B: {(11,0.5), (25, 1.2), (30, 1.7) … (87, 50)} 20 data items •2017/4/29 ACM SIGMOD 2007 19/45 Two Problems in Online Matching II Solution 1: Piecewise Liner Representation Make Problem 2 more difficult: patterns are simplified as segments, how to compare segments and points? •2017/4/29 ACM SIGMOD 2007 20/45 Wave-pattern Matching ECG Time (s) 0 0.5 1.0 Time (s) 1.5 0 0.5 1.0 1.5 In real applications, two sequences are assumed to match if their paths roughly coincide PLR segments record paths of old data Testing whether the incoming stream items are on the paths The intensity of variations can be determined by the number of matching items •2017/4/29 ACM SIGMOD 2007 21/45 Record the Patterns Observation: Many patterns just have few partial segments changed Most stream variations are gradual evolutions rather than burst mutations Recording by a simple list not only ignores their relationship but also causes storage redundancy Utilize the similarity among patterns and reuse the unchanged parts Pattern Growth Graph (PGG) is designed to store patterns and the variation history •2017/4/29 ACM SIGMOD 2007 22/45 Pattern Growth Graph Implemented as bi-directional linked list Only generate new segments on the un-matched data New patterns seems to grow from the old one 1.5 Wave Pattern 1 Pattern 2 2 1 3 1 4 5 6 1' 7 3' 2' 4' 8 0.5 0 1 51 1' Pattern 2 (Growth Pattern) Pattern 1 ( Base Pattern) •2017/4/29 101 151 2' 3' 4' End Start End 1 2 3 4 5 6 7 8 ACM SIGMOD 2007 23/45 Construct Full Wave-pattern New Problem: Wave-Pattern matching needs full pattern to compare, while PGG only stores the new parts Fortunately we can construct the full pattern by propagating the pointers Pattern 3 1" 1' Pattern 2 End 2" 2' 3' Collision Pattern 1 Start End 1 2 3 4 模式 5 6 7 8 9 左1 右1 左2 右2 Step 0 1” 2” 1’ 2’ 9 End Step 1 1’ 1” 2’ 9 2” 1 3’ 8 \ Step 2 1 1’ 1” 2’3’ 8 9 2” Start Final 1 1’ 1” 2’3’ 8 9 2” \ •2017/4/29 8 ( Collision! ) 7 \ \ ACM SIGMOD 2007 \ \ 24/45 Problems for PGG size Waves in data stream: N PGG size: k Time complexity of PGG based matching algorithm is O (k*n) In the worst case, each incoming wave introduces a new pattern: overall time cost is O (n2) When PGG becomes larger, the algorithm is timeconsuming PGG is not allowed to take “forgetting functions” Hard to delete in PGG Some uncommon patterns may have higher domain significance •2017/4/29 ACM SIGMOD 2007 25/45 Rank the Patterns Observation: The most frequent pattern and its similar patterns have the highest possibility to match the incoming wave Matching probability factor The patterns with smaller probability are not deleted, but have lower priority to be compared When one pattern get a match, system not only increase its own rank, also its “families” •2017/4/29 ACM SIGMOD 2007 26/45 Reconstruct the Stream View with PGG Queries on traditional DSMS predefined, hard to conduct when data items passed by Answer “the patient's ECG in the past five hours” Record all patterns’ occurrence time in PGG Reconstruct the stream view with PGG patterns Only consumes about 4% storage space of the original stream, but can provide an approximate stream view within 5% relative error bound •2017/4/29 ACM SIGMOD 2007 27/45 Track Pattern Evolution To answer “Why will it change in this way ?” User selects an interesting pattern, PGG can track the source of it •2017/4/29 ACM SIGMOD 2007 28/45 False Alarm A successful system needs to reduce the false alarms introduced by noises The major problem: noises are caused by many sources, they have various styles and are hard to be modeled •2017/4/29 ACM SIGMOD 2007 29/45 Noise Reorganization A short cut: considering the pattern’s evolution history Some strategies to reduce false alarms on medical stream: Unusual values in growth patterns: the patients’ condition has been exacerbated -- Warning New pattern, it matches successive waves: the underlying pathology mechanism might have some fundamental changes -- Warning A series of new patterns and they all un-match the previous/following waves -- suspected as noises •2017/4/29 ACM SIGMOD 2007 30/45 System Framework Send Alarms 6 5 4 3 2 1 0 OutPut Pattern Evolutions Pattern Growth Graph Data Stream Wave Online Update Stream View Splitting Unmatched 6 5 4 3 2 1 0 Wave-Pattern Matching Partially Matched Full Matched Wave Stream New Pattern Grow Pattern Increase Frequency Online Variation Management •2017/4/29 ACM SIGMOD 2007 31/45 Summary Introduction Related Work Variation Management for Pseudo Periodical Stream Experiments Conclusion •2017/4/29 ACM SIGMOD 2007 32/45 Experimental Setup Data Set Medical streams: Six real pathology signals including ECG, respiration... (over 25,000,000 data points) Earthquake waves: The pacific earthquake wave data from the NGA project. (100,000 data points) Sunspot data: All the sunspot records between the year 1850 and 2001 (55,000 data points) Environment: Intel Pentium 4 3.0GHz CPU with 1GB RAM, Windows XP Professional, JDK 1.5.0… •2017/4/29 ACM SIGMOD 2007 33/45 Effect of Rank Function At the beginning, the effect is insignificant. After three million data points, the naive algorithm’s performance decreases rapidly In the end, the rank algorithm outperforms by about 300% •2017/4/29 ACM SIGMOD 2007 34/45 Reconstruct the Stream View ECG data stream (more than 10M data items) can be represented with only 420 patterns The amazing compressing result is achieved due to two factors The PLR simplify can reduce the size of patterns to about 20% PGG further reduces it to about 3.31% by compressing the repeating and similar patterns (Patterns only need 0.3%, the rest 3% stores the occurrence time of the patterns) •2017/4/29 ACM SIGMOD 2007 35/45 Compared with Other Methods Compared PGG with SAX (symbolic approaches), Discrete Haar Wavelet Transformation (mathematic transformation) and Zigzag (predefined models) The processing efficiency is average 60K—70K items/sec Much higher than real application needs •2017/4/29 ACM SIGMOD 2007 36/45 Variation Detection & Noise Recognition Two important measurements: Sensitivity (High Positive Rate): The algorithm send alarms at meaningful variations Selectivity (Low Negative Rate): The algorithm does not send false alarms on noises The two measurements are conflict Increasing sensitivity to find more variations will inevitably cause more false alarms In a medical environment, sensitivity is much more important -- missing a meaningful variation may cost the patient’s life •2017/4/29 ACM SIGMOD 2007 37/45 Best Results of Sensitivity on Respiration Stream Zigzag sends false alarm at almost every noise section DWT and SAX nearly cannot distinguish real variations from noises •2017/4/29 ACM SIGMOD 2007 38/45 Results of Noise Recognition on Other Stream For other stream, we take precision as the main measurement PGG performs accurately and stably Zigzag is volatile with different datasets: Good on three blood pressure signals (ABP, CVP and ICP, meaningful variations are outliners) Poorly on PLETH (meaningful variations are of inner structures) •2017/4/29 ACM SIGMOD 2007 39/45 Discussion Zigzag: focuses on extreme data points, strongly influenced by outliers SAX: good at finding in a long period using frequency statistics -- more suitable for time series DWT: only effective for signals with strict periods With the effective data structure, PGG discovers and records as much features of the data stream as possible The recorded information helps distinguish between meaningful variations and noises •2017/4/29 ACM SIGMOD 2007 40/45 Summary Introduction Related Work Variation Management for Pseudo Periodical Stream Experiments Conclusion •2017/4/29 ACM SIGMOD 2007 41/45 Conclusion Streams are split as waves and represented by PLR patterns Detect variations by online wave-pattern matching Pattern Growth Graph stores the variation history Reconstruct the stream view with high accuracy Effectively distinguish meaningful variations from noises •2017/4/29 ACM SIGMOD 2007 42/45 The System Interface •2017/4/29 ACM SIGMOD 2007 43/45 Future Work Extend PGG to multiple streams Implement the PGG method in other application domains such as weather forecasting and financial analysis Combine with other methods, like Zigzag… •2017/4/29 ACM SIGMOD 2007 44/45 Thank You Very Much! Please give me questions and suggestions •2017/4/29 ACM SIGMOD 2007 45/45