Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Mining from Data Streams
- Competition between Quality and Speed
Adapted from:
Wei-Guang Teng (鄧維光) and
S. Muthukrishnan’s presentations
CS 636 - Adv. Data Mining (Wi 04/05)
1
Streaming: Finding Missing
Numbers
Paul permutes numbers 1…n, and
shows all but one to Carole, in the
permuted order, one after the other.
Carole must find the missing number.
Carole can not remember all the numbers
she has been shown.
CS 636 - Adv. Data Mining (Wi 04/05)
2
Streaming: Finding Missing
Numbers
Carole cumulates the sum of all the numbers
that she has been shown. At the end she can
subtract this sum from
n(n+1)/2
Analysis
Takes O(log n) bits to store the partial sum
Performs one addition each time a new number is
shown (takes O(log n) time per number)
Performs one subtraction at the end (takes O(log
n time)
CS 636 - Adv. Data Mining (Wi 04/05)
3
Data Streams (1)
Traditional DBMS – data stored in finite,
persistent data sets
New Applications – data input as continuous,
ordered data streams
Network monitoring and traffic engineering
Telecom call detail records (CDR)
ATM operations in banks
Sensor networks
Web logs and click-streams
Transactions in retail chains
Manufacturing processes
CS 636 - Adv. Data Mining (Wi 04/05)
4
Data Streams (2)
Definition
Application Characteristics
Continuous, unbounded, rapid, time-varying
streams of data elements
Massive volumes of data (can be several terabytes)
Records arrive at a rapid rate
Goal
Mine patterns, process queries and compute
statistics on data streams in real-time
CS 636 - Adv. Data Mining (Wi 04/05)
5
Data Stream Algorithms
Streaming involves
Small number of passes over data. (Typically 1?)
Sublinear space (sublinear in the universe or
number of stream items?)
Sublinear time for computing (?)
Similar to dynamic, online, approximation or
randomized algorithms, but with more
constraints.
CS 636 - Adv. Data Mining (Wi 04/05)
6
Data Streams: Analysis Model
User/Application
Query/Mining Target
Results
Stream Processing
Engine
Scratch Space
(Memory and/or Disk)
CS 636 - Adv. Data Mining (Wi 04/05)
7
Motivation
3 Billion Telephone Calls in US each day
30 Billion emails daily, 1 Billion SMS, IMs
Scientific data: NASA's observation satellites
generate billions of readings each day.
IP Network Traffic: up to 1 Billion packets per
hour per router. Each ISP has many
hundreds) of routers!
Compare to human scale data: "only" 1 billion
worldwide credit card transactions per month.
CS 636 - Adv. Data Mining (Wi 04/05)
8
Network Management Application
Monitoring and configuring network hardware
and software to ensure smooth operation
Measurements
Alarms
Network Operations
Center
Network
CS 636 - Adv. Data Mining (Wi 04/05)
9
IP Network Measurement Data
IP session data
Source
10.1.0.2
18.6.7.1
13.9.4.3
15.2.2.9
12.4.3.8
10.5.1.3
11.1.0.6
19.7.1.2
Destination
16.2.3.7
12.4.0.3
11.6.8.2
17.1.2.1
14.8.7.4
13.0.0.1
10.3.4.5
16.5.5.8
Duration
12
16
15
19
26
27
32
18
Bytes
20K
24K
20K
40K
58K
100K
300K
80K
Protocol
http
http
http
http
http
ftp
ftp
ftp
AT&T collects 100 GBs of NetFlow data each
day!
CS 636 - Adv. Data Mining (Wi 04/05)
10
Network Data Processing
Traffic estimation/analysis
List the top 100 IP addresses in terms of traffic
What is the average duration of an IP session?
Fraud detection
Identify all sessions whose duration was more
than twice the normal
Security/Denial of Service
List all IP addresses that have witnessed a sudden
spike in traffic
Identify IP addresses involved in more than 1000
sessions
CS 636 - Adv. Data Mining (Wi 04/05)
11
Challenges in Network Apps.
1 link with 2 Gb/s. Say avg packet size is 50
bytes.
Number of pkts/sec = 5 Million.
Time per pkt = 0.2 µsec.
If we capture pkt headers per packet:
src/dest IP, time, no of bytes, etc. at least 10
bytes. Space per second is 50 Mb. Space per
day is 4.5 Tb per link. ISPs have hundreds of
links.
CS 636 - Adv. Data Mining (Wi 04/05)
12
Data Streaming Models
Input data: a1, a2, a3, …
Input stream describes a signal A[i], a
one-dimensional function (value vs.
index)
There is mapping from the input stream
to the signal
This is the data stream model
CS 636 - Adv. Data Mining (Wi 04/05)
13
Time-Series Model
ai’s are form A[i]’s.
CS 636 - Adv. Data Mining (Wi 04/05)
14
Cash-Register Model
ai’s are increments to A[j]
ai= (j, Ii) Ii >= 0
Ai[j] = Ai-1[j + Ii
CS 636 - Adv. Data Mining (Wi 04/05)
15
Turnstile Model
ai’s are updates to A[j]
ai= (j, Ui)
Ai[j] = Ai-1[j + Ui
Strict turnstile model
Ai[j] >= at all i
CS 636 - Adv. Data Mining (Wi 04/05)
16
Data Stream Algorithms
Compute various functions on the signal
A at various times
Performance measures
Processing time per item ai in the stream
Space used to store the data structure on
At at time t
Time needed to compute the functions on
A
CS 636 - Adv. Data Mining (Wi 04/05)
17
Outline
Introduction & Motivation
Issues & Techs. of Processing Data Streams
Sampling
Histogram
Wavelet
Data Streaming Systems System
Example Algorithms for Frequency Counting
Lossy Counting
Sticky Sampling
CS 636 - Adv. Data Mining (Wi 04/05)
18
Data Stream Algorithms
Stream Processing Requirements
Single pass: each record is examined at most once
Bounded storage: limited memory for storing
synopsis
Real-time: per record processing time (to maintain
synopsis) must be low
Generally, algorithms compute approximate
answers
Difficult to compute answers accurately with
limited memory
CS 636 - Adv. Data Mining (Wi 04/05)
19
Approximation in Data Streams
Approximate Answers - Deterministic Bounds
Algorithms only compute an approximate answer,
but bounds on error
Data Streaming Systems System
Approximate Answers - Probabilistic Bounds
Algorithms compute an approximate answer with
high probability
With probability at least 1 , the computed answer is
within a factor of the actual answer
CS 636 - Adv. Data Mining (Wi 04/05)
20
Sliding Window Approximation
011000011100000101010
Why?
Approximation technique for bounded memory
Natural in applications (emphasizes recent data)
Well-specified and deterministic semantics
Issues
Extend relational algebra, SQL, query optimization
Algorithmic work
Timestamps?
CS 636 - Adv. Data Mining (Wi 04/05)
21
Timestamps
Explicit
Implicit
Injected by data source
Models real-world event represented by tuple
Tuples may be out-of-order, but if near-ordered can reorder
with small buffers
Introduced as special field by DSMS
Arrival time in system
Enables order-based querying and sliding windows
Issues
Distributed streams?
Composite tuples created by DSMS?
CS 636 - Adv. Data Mining (Wi 04/05)
22
Time
Easiest: global system clock
Stream elements and relation updates
timestamped on entry to system
Application-defined time
Streams and relation updates contain application
timestamps, may be out of order
Application generates “heartbeat”
Or deduce heartbeat from parameters: stream skew,
scrambling, latency, and clock progress
Query results in application time
CS 636 - Adv. Data Mining (Wi 04/05)
23
Sampling: Basics
A small random sample S of the data often wellrepresents all the data
Example: select agg from R where R.e is odd (n=12)
Data stream: 9 3 5 2 7 1 6 5 8 4 9 1
Sample S: 9 5 1 8
If agg is avg, return average of odd elements in S
answer: 5
If agg is count, return average over all elements e in S of
n if e is odd
0 if e is even
CS 636 - Adv. Data Mining (Wi 04/05)
answer: 12*3/4 =9
Unbiased!
24
Histograms
Histograms approximate the frequency
distribution of element values in a stream
A histogram (typically) consists of
A partitioning of element domain values into
buckets
A count C B per bucket B (of the number of
elements in B)
Long history of use for selectivity estimation
within a query optimizer ([Koo80], [PSC84], etc)
CS 636 - Adv. Data Mining (Wi 04/05)
25
Types of Histograms
Equi-Depth Histograms
Select buckets such that counts per bucket are equal
Count for
bucket
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Domain values
V-Optimal Histograms [IP95] [JKM98]
Select buckets to minimize frequency variance within buckets
minimize
B vB ( f v
CB 2
)
VB
Count for
bucket
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
CS 636 - Adv. Data Mining (Wi 04/05)
Domain values
26
Answering Queries using Histograms
[IP99]
(Implicitly) map the histogram back to an
approximate relation, & apply the query to the
approximate relation
Example: select count(*) from R where 4<=R.e<=15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Count spread
evenly among
bucket values
4 R.e 15
answer: 3.5 * C B
For equi-depth histograms, maximum error: 2 * CB
CS 636 - Adv. Data Mining (Wi 04/05)
27
Wavelet Basics
For hierarchical decomposition of functions/signals
Haar wavelets
Simplest wavelet basis => Recursive pairwise averaging and
differencing at different resolutions
Resolution
3
2
Averages
Detail Coefficients
[2, 2, 0, 2, 3, 5, 4, 4]
[2,
1
0
1,
4,
[1.5,
4]
4]
[2.75]
---[0, -1, -1, 0]
[0.5, 0]
[-1.25]
Haar wavelet decomposition:[2.75, -1.25, 0.5, 0, 0, -1, -1, 0]
CS 636 - Adv. Data Mining (Wi 04/05)
28
Haar Wavelet Coefficients
Hierarchical decomposition structure (“error tree”)
2.75
+
0.5
+
2
0
-
+
2 0
+
-1
-1
2 3
-1.25
-
-
- +
0.5
0
-
0
0
0
-
+
5 4
Original frequency distribution
CS 636 - Adv. Data Mining (Wi 04/05)
+
2.75
-1.25
+
+
Coefficient “Supports”
-1
-4
-1
0
+ + +-+ + -++
29
Wavelet-based Histograms [MVW98]
Problem: range-query selectivity estimation
Key idea: use a compact subset of Haar
wavelet coefficients for approximating
frequency distribution
Steps
Compute cumulative frequency distribution C
Compute Haar wavelet transform of C
Coefficient thresholding: only m<<n coefficients
can be kept
CS 636 - Adv. Data Mining (Wi 04/05)
30
Using Wavelet-based Histograms
Selectivity estimation: count(a<= R.e<= b) = C’[b] - C’[a-1]
C’ is the (approximate) “reconstructed” cumulative distribution
Time: O(min{m, logN}), where m = size of wavelet synopsis
(number of coefficients), N= size of domain
At most logN+1 coefficients are
needed to reconstruct any C’ value
C’[a]
Empirical results over synthetic data shows improvements over
random sampling and histograms
CS 636 - Adv. Data Mining (Wi 04/05)
31
Data Streaming Systems
Low-level application specific approach
DBMS approach
Generic data stream management
systems
CS 636 - Adv. Data Mining (Wi 04/05)
32
DBMS Vs. DSMS: Meta-Questions
Killer-apps
Motivation
Application stream rates exceed DBMS capacity?
Can DSMS handle high rates anyway?
Need for general-purpose DSMS?
Not ad-hoc, application-specific systems?
Non-Trivial
DSMS = merely DBMS with enhanced support for
triggers, temporal constructs, data rate mgmt?
CS 636 - Adv. Data Mining (Wi 04/05)
33
DBMS versus DSMS
Persistent relations
One-time queries
Random access
Access plan
determined by query
processor and
physical DB design
CS 636 - Adv. Data Mining (Wi 04/05)
Transient streams
(and persistent
relations)
Continuous queries
Sequential access
Unpredictable data
characteristics and
arrival patterns
34
(Simplified) Big Picture of DSMS
Register
Query
Stored
Result
Streamed
Result
DSMS
Input streams
Archive
Scratch Store
CS 636 - Adv. Data Mining (Wi 04/05)
Stored
Relations
35
(Simplified) Network Monitoring
Intrusion
Warnings
Online
Performance
Metrics
Register
Monitoring
Queries
DSMS
Network measurements,
Packet traces
Archive
Scratch Store
CS 636 - Adv. Data Mining (Wi 04/05)
Lookup
Tables
36
Using Conventional DBMS
Data streams as relation inserts, continuous
queries as triggers or materialized views
Problems with this approach
Inserts are typically batched, high overhead
Expressiveness: simple conditions (triggers), no
built-in notion of sequence (views)
No notion of approximation, resource allocation
Current systems don’t scale to large # of triggers
Views don’t provide streamed results
CS 636 - Adv. Data Mining (Wi 04/05)
37
Query 1 (self-join)
Find all outgoing calls longer than 2 minutes
SELECT O1.call_ID, O1.caller
FROM
Outgoing O1, Outgoing O2
WHERE (O2.time – O1.time > 2
AND O1.call_ID = O2.call_ID
AND O1.event = start
AND O2.event = end)
Result requires unbounded storage
Can provide result as data stream
Can output after 2 min, without seeing end
CS 636 - Adv. Data Mining (Wi 04/05)
38
Query 2 (join)
Pair up callers and callees
SELECT O.caller, I.callee
FROM
Outgoing O, Incoming I
WHERE O.call_ID = I.call_ID
Can still provide result as data stream
Requires unbounded temporary storage …
… unless streams are near-synchronized
CS 636 - Adv. Data Mining (Wi 04/05)
39
Query 3 (group-by aggregation)
Total connection time for each caller
SELECT
FROM
WHERE
GROUP BY
O1.caller, sum(O2.time – O1.time)
Outgoing O1, Outgoing O2
(O1.call_ID = O2.call_ID
AND O1.event = start
AND O2.event = end)
O1.caller
Cannot provide result in (append-only)
stream
Output updates?
Provide current value on demand?
CS 636 - Adv. Data Mining (Wi 04/05)
40
Data Model
Append-only
Call records
Updates
Stock tickers
Deletes
Transactional data
Meta-Data
Control signals, punctuations
System Internals – probably need all above
CS 636 - Adv. Data Mining (Wi 04/05)
41
Related Database Technology
DSMS must use ideas, but none is substitute
Triggers, Materialized Views in Conventional DBMS
Main-Memory Databases
Sequence/Temporal/Timeseries Databases
Realtime Databases
Adaptive, Online, Partial Results
Novelty in DSMS
Semantics: input ordering, streaming output, …
State: cannot store unending streams, yet need
history
Performance: rate, variability, imprecision, …
CS 636 - Adv. Data Mining (Wi 04/05)
42
Outline
Introduction & Motivation
Data Stream Management System
Issues & Techs. of Processing Data Streams
Sampling
Histogram
Wavelet
Example Algorithms for Frequency Counting
Lossy Counting
Sticky Sampling
CS 636 - Adv. Data Mining (Wi 04/05)
43
Problem of Frequency Counts
Stream
Identify all elements whose current frequency
exceeds support threshold s = 0.1%
CS 636 - Adv. Data Mining (Wi 04/05)
44
Algorithm 1: Lossy Counting
Step 1: Divide the stream into “windows”
Window 1
Window 2
Window 3
Is window size a function of support s? Will fix later…
CS 636 - Adv. Data Mining (Wi 04/05)
45
Lossy Counting in Action ...
Frequency
Counts
+
Empty
First Window
At window boundary, decrement all counters by 1
CS 636 - Adv. Data Mining (Wi 04/05)
46
Lossy Counting (cont’d)
Frequency
Counts
+
Next Window
At window boundary, decrement all counters by 1
CS 636 - Adv. Data Mining (Wi 04/05)
47
Error Analysis
How much do we undercount?
If
and
then
current size of stream
=N
window-size
= 1/ε
frequency error #windows = εN
Rule of thumb:
Set ε = 10% of support s
Example:
Given support frequency s = 1%,
set error frequency
ε = 0.1%
CS 636 - Adv. Data Mining (Wi 04/05)
48
Analysis of Lossy Counting
Output
Elements with counter values exceeding sN – εN
Approximation guarantees
Frequencies underestimated by at most εN
No false negatives
False positives have true frequency at least sN – εN
How many counters do we need?
Worst case: 1/ε log(εN) counters
CS 636 - Adv. Data Mining (Wi 04/05)
49
Algorithm 2: Sticky Sampling
Stream
28
31
41
23
35
19
34
15
30
Create counters by sampling
Maintain exact counts thereafter
What rate should we sample?
CS 636 - Adv. Data Mining (Wi 04/05)
50
Sticky Sampling (cont’d)
For finite stream of length N
Sampling rate = 2/Nε log 1/(s )
( = probability of failure)
Output
Elements with counter values exceeding sN – εN
Same error guarantees
as Lossy Counting
but probabilistic!
CS 636 - Adv. Data Mining (Wi 04/05)
Same Rule of thumb:
Set ε = 10% of support s
Example:
Given support threshold s = 1%,
set error threshold
ε = 0.1%
set failure probability = 0.01%
51
Sampling rate?
Finite stream of length N
Sampling rate: 2/Nε log 1/(s)
Infinite stream with unknown N
Gradually adjust sampling rate
In either case,
Expected number of counters = 2/εlog 1/s
Independent of N!
CS 636 - Adv. Data Mining (Wi 04/05)
52
New Directions
Functional approximation theory
Data structures
Computational geometry
Graph theory
Databases
Hardware
Streaming models
Data stream quality monitoring
CS 636 - Adv. Data Mining (Wi 04/05)
53
References (1)
[AGM99] N. Alon, P.B. Gibbons, Y. Matias, M. Szegedy. Tracking Join and Self-Join Sizes in
Limited Storage. ACM PODS, 1999.
[AMS96] N. Alon, Y. Matias, M. Szegedy. The space complexity of approximating the frequency
moments. ACM STOC, 1996.
[CIK02] G. Cormode, P. Indyk, N. Koudas, S. Muthukrishnan. Fast mining of tabular data via
approximate distance computations. IEEE ICDE, 2002.
[CMN98] S. Chaudhuri, R. Motwani, and V. Narasayya. “Random Sampling for Histogram
Construction: How much is enough?”. ACM SIGMOD 1998.
[CDI02] G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan. Comparing Data Streams Using
Hamming Norms. VLDB, 2002.
[DGG02] A. Dobra, M. Garofalakis, J. Gehrke, R. Rastogi. Processing Complex Aggregate Queries
over Data Streams. ACM SIGMOD, 2002.
[DJM02] T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining database structure or
how to build a data quality browser. ACM SIGMOD, 2002.
[DH00] P. Domingos and G. Hulten. Mining high-speed data streams. ACM SIGKDD, 2000.
[EKSWX98] M. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, and X. Xu. Incremental Clustering for
Mining in a Data Warehousing Environment. VLDB 1998.
[FKS99] J. Feigenbaum, S. Kannan, M. Strauss, M. Viswanathan. An approximate L1-difference
algorithm for massive data streams. IEEE FOCS, 1999.
[FM85] P. Flajolet, G.N. Martin. “Probabilistic Counting Algorithms for Data Base Applications”.
JCSS 31(2), 1985
54
CS 636 - Adv. Data Mining (Wi 04/05)
References (2)
[Gib01] P. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and
event reports, VLDB 2001.
[GGI02] A.C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, M. Strauss. Fast, smallspace algorithms for approximate histogram maintenance. ACM STOC, 2002.
[GGRL99] J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh: BOAT-Optimistic Decision Tree
Construction. SIGMOD 1999.
[GK01] M. Greenwald and S. Khanna. “Space-Efficient Online Computation of Quantile
Summaries”. ACM SIGMOD 2001.
[GKM01] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss. Surfing Wavelets on Streams:
One Pass Summaries for Approximate Aggregate Queries. VLDB 2001.
[GKM02] A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M. Strauss. “How to Summarize the Universe:
Dynamic Maintenance of Quantiles”. VLDB 2002.
[GKS01b] S. Guha, N. Koudas, and K. Shim. “Data Streams and Histograms”. ACM STOC 2001.
[GM98] P. B. Gibbons and Y. Matias. “New Sampling-Based Summary Statistics for Improving
Approximate Query Answers”. ACM SIGMOD 1998.
[GMP97] P. B. Gibbons, Y. Matias, and V. Poosala. “Fast Incremental Maintenance of
Approximate Histograms”. VLDB 1997.
[GT01] P.B. Gibbons, S. Tirthapura. “Estimating Simple Functions on the Union of Data Streams”.
ACM SPAA, 2001.
CS 636 - Adv. Data Mining (Wi 04/05)
55
References (3)
[HHW97] J. M. Hellerstein, P. J. Haas, and H. J. Wang. “Online Aggregation”. ACM SIGMOD 1997.
[HSD01] Mining Time-Changing Data Streams. G. Hulten, L. Spencer, and P. Domingos. ACM
SIGKD 2001.
[IKM00] P. Indyk, N. Koudas, S. Muthukrishnan. Identifying representative trends in massive
time series data sets using sketches. VLDB, 2000.
[Ind00] P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings, and Data Stream
Computation. IEEE FOCS, 2000.
[IP95] Y. Ioannidis and V. Poosala. “Balancing Histogram Optimality and Practicality for Query
Result Size Estimation”. ACM SIGMOD 1995.
[IP99] Y.E. Ioannidis and V. Poosala. “Histogram-Based Approximation of Set-Valued Query
Answers”. VLDB 1999.
[JKM98] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel.
“Optimal Histograms with Quality Guarantees”. VLDB 1998.
[JL84] W.B. Johnson, J. Lindenstrauss. Extensions of Lipshitz Mapping into Hilbert space.
Contemporary Mathematics, 26, 1984.
[Koo80] R. P. Kooi. “The Optimization of Queries in Relational Databases”. PhD thesis, Case
Western Reserve University, 1980.
CS 636 - Adv. Data Mining (Wi 04/05)
56
References (4)
[MRL98] G.S. Manku, S. Rajagopalan, and B. G. Lindsay. “Approximate Medians and other
Quantiles in One Pass and with Limited Memory”. ACM SIGMOD 1998.
[MRL99] G.S. Manku, S. Rajagopalan, B.G. Lindsay. Random Sampling Techniques for Space
Efficient Online Computation of Order Statistics of Large Datasets. ACM SIGMOD, 1999.
[MVW98] Y. Matias, J.S. Vitter, and M. Wang. “Wavelet-based Histograms for Selectivity
Estimation”. ACM SIGMOD 1998.
[MVW00] Y. Matias, J.S. Vitter, and M. Wang. “Dynamic Maintenance of Wavelet-based
Histograms”. VLDB 2000.
[PIH96] V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. “Improved Histograms for Selectivity
Estimation of Range Predicates”. ACM SIGMOD 1996.
[PJO99] F. Provost, D. Jenson, and T. Oates. Efficient Progressive Sampling. KDD 1999.
[Poo97] V. Poosala. “Histogram-Based Estimation Techniques in Database Systems”. PhD Thesis,
Univ. of Wisconsin, 1997.
[PSC84] G. Piatetsky-Shapiro and C. Connell. “Accurate Estimation of the Number of Tuples
Satisfying a Condition”. ACM SIGMOD 1984.
[SDS96] E.J. Stollnitz, T.D. DeRose, and D.H. Salesin. “Wavelets for Computer Graphics”.
Morgan-Kauffman Publishers Inc., 1996.
CS 636 - Adv. Data Mining (Wi 04/05)
57
References (5)
[T96] H. Toivonen. Sampling Large Databases for Association Rules. VLDB 1996.
[TGI02] N. Thaper, S. Guha, P. Indyk, N. Koudas. Dynamic Multidimensional Histograms. ACM
SIGMOD, 2002.
[U89] P. E. Utgoff. Incremental Induction of Decision Trees. Machine Learning, 4, 1989.
[U94] P. E. Utgoff: An Improved Algorithm for Incremental Induction of Decision Trees. ICML
1994.
[Vit85] J. S. Vitter. “Random Sampling with a Reservoir”. ACM TOMS, 1985.
CS 636 - Adv. Data Mining (Wi 04/05)
58