Download Mining Compressed Commodity Workflows From Massive RFID

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Mining Compressed Commodity Workflows From Massive
RFID Data Sets
∗
Hector Gonzalez, Jiawei Han, Xiaolei Li
Department of Computer Science
University of Illinois at Urbana-Champaign
{hagonzal,
hanj, xli10}@uiuc.edu
ABSTRACT
Radio Frequency Identification (RFID) technology is fast
becoming a prevalent tool in tracking commodities in supply chain management applications. The movement of commodities through the supply chain forms a gigantic workflow
that can be mined for the discovery of trends, flow correlations and outlier paths, that in turn can be valuable in
understanding and optimizing business processes.
In this paper, we propose a method to construct compressed probabilistic workflows that capture the movement
trends and significant exceptions of the overall data sets,
but with a size that is substantially smaller than that of the
complete RFID workflow. Compression is achieved based
on the following observations: (1) only a relatively small
minority of items deviate from the general trend, (2) only
truly non-redundant deviations, i.e., those that substantially
deviate from the previously recorded ones, are interesting,
and (3) although RFID data is registered at the primitive
level, data analysis usually takes place at a higher abstraction level. Techniques for workflow compression based on
non-redundant transition and emission probabilities are derived; and an algorithm for computing approximate path
probabilities is developed. Our experiments demonstrate
the utility and feasibility of our design, data structure, and
algorithms.
Categories and Subject Descriptors: H.2.8 [Database
Applications]: Data mining
General Terms: Algorithms
Keywords: RFID, workflow induction
1.
INTRODUCTION
Radio Frequency Identification (RFID) technology has received considerable attention from both the hardware and
software communities. Hardware research deals with the
issues related to building small and reliable tags and read∗
The work was supported in part by the U.S. National Science Foundation NSF IIS-05-13678 and NSF BDI-05-15813.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
CIKM’06, November 5–11, 2006, Arlington, Virginia, USA.
Copyright 2006 ACM 1-59593-433-2/06/0011 ...$5.00.
ers, while software research addresses the problem of cleaning, summarizing, warehousing and mining RFID data sets.
RFID technology is composed of readers and tags. Readers
are transponders capable of detecting a tag from a distance
and without line of sight. Tags are attached to items, and
transmit a unique Electronic Product Code (EPC) when in
proximity to a reader. Each tag reading generates a tuple of
the form (EP C, location, time), where location is the place
where the reader is positioned, and time is the time when
the reading took place.
An RFID data set is a collection of paths, one per distinct tag in the system. Paths can be seen as a probabilistic
workflow, where nodes are path stages, and edges are transitions; each edge has a probability that is the fraction of
items took the transition. As millions of individual items
move through possibly hundreds of locations in the supply
chain, such workflows can be enormous.
In [3] we introduced the concept of a FlowCube, which
is a data cube computed on an RFID path database. The
FlowCube is composed of cuboids aggregated at different
levels of abstraction of the path-independent dimensions describing each item (item view ) such as product, manufacturer, and price; and at different levels of abstraction along
the locations and durations of each path stage (path view ).
Item view aggregation is similar to regular aggregation in
data cubes in which, for example, we may go from individual
items, to product type, and to product category. Path view
is a new kind of aggregation for FlowCubes and it may, for
example, collapse all the individual locations within a store
to a single “store” location, or all individual trucks into a
single “transportation location” to present a summarized
view of the paths traversed by items. The measure recorded
in each cell of the FlowCube is a probabilistic workflow. [3]
presents a framework for efficient construction of FlowCube
but does not go into the details of a concrete workflow design. In this paper we address the design and implementation of a compressed workflow that can be used as a measure
in the FlowCube. Our goal is to design a workflow structure
that is manageable in size, but can still be used to identify
major flow trends and significant exceptions and provide answers to questions such as:
1. Is there a correlation between the time spent at quality
control and the return rate for laptops manufactured
in China?
2. What are the most notable characteristics of paths
traveled by dairy products that end up being discarded
from the stores in Boston?
3. List containers arriving by ship at New York ports
that have spent unusually long periods of time at any
intermediate port in Europe?
The gigantic size of an RFID data set and the diversity
of queries over flow patterns pose great challenges to traditional workflow induction and analysis technologies since
processing may involve retrieval and reasoning over a large
number of inter-related tuples through different stages of
object movements. Creating a complete workflow that captures the nature of commodity movements and that incorporates time will be prohibitively expensive since there can
be billions of different location and time combinations.
In this paper we propose a new probabilistic workflow
model that is highly compressed but is still capable of capturing the main characteristics of object movements through
locations and time, while keeping track of significant timelocation correlations and deviations from the main trend.
The workflow is organized in such a way that a wide range
of queries can be answered efficiently. Our design is based
on the following key observations.
First, it is necessary to eliminate the redundancy present in
RFID data and aggregate it to a minimal level of abstraction
interesting to users. Each reader provides tuples of the form
(EP C, location, time) at fixed time intervals. When an
item stays at the same location for a period of time, multiple
tuples will be generated. These tuples can be grouped into a
single compressed record: (EP C, location, time in, time out).
For example, if a supermarket has readers on each shelf that
scan the items every minute, and items stay on the shelf on
average for 1 day, the above compressed record gets a 1,440
to 1 reduction in size without loss of information. One can
further compress the data by aggregating the tuples to a
level of abstraction higher than the one present in the raw
RFID database, for example, representing data at the hour
level if users are not interested in infer granularity. Furthermore, one can rewrite the data set as a set of paths of
the form (itemp : (l1 , d1 ), . . . , (ln , dn )) where di is the time
spent by itemp at the i-th location li .
Second, we can do semantic path compression, by merging and collapsing path segments that may be uninteresting
to users. For example, a retail store manager may not be
interested in item movements through the production lanes
in a factory and may like to group all those transitions into
a single “factory” location record.
Third, many items have similar flow patterns and only a
relatively small number of items significantly deviate from the
general trend. Taking this into consideration, one can construct a highly compressed workflow that represents only the
general trends and significant deviations but ignores very low
support exceptions. Such workflow may help uncover correlations present in the data. For example, one may detect that
CDs that stayed for more than two months in the factory
have a much higher chance of being returned by customers
than those that stayed for just one month.
Fourth, for concise representation of the overall flow graphs,
it is important to build conditional probabilities on short path
segments before long ones. This is because short path segments lead to much less combinations and facilitate concise
representation. The exceptions caused by long path segments will need to be expressed only when they are substantially different from that derivable from the corresponding
short ones. This will save the overall effort of flow-graph
construction and maintenance.
Efficient implementation has been explored with the model
proposed above. Our algorithm development and performance study show that such a model, though comprehensive, is clean and concise in comparison with a complete
workflow, and the construction of such a model requires only
a small number of scans of the RFID data sets.
The rest of the paper is organized as follows. Section 2
presents the structure of the input RFID data. Section 3
presents RFID workflow design and compression methods.
Section 4, introduces algorithms for workflow compression.
Section 5 reports the experimental and performance results.
We discuss the related work in Section 6 and conclude our
study in Section 7.
2.
COMPRESSED RFID DATA SETS
Data generated from an RFID application can be seen as
a stream of RFID tuples of the form (EP C, location, time),
where EPC is the unique identifier read by an RFID reader,
location is the place where the RFID reader scanned the
item, and time is the time when the reading took place.
Table 1 is an example of a raw RFID database where a
symbol starting with r represents an RFID tag, l a location,
and t a time.
RawData
(r1 , l1 , t1 ) (r2 , l1 , t1 ) . . . (r1 , l1 , t2 ) (r2 , l1 , t2 ) . . . (r1 , l2 , t10 )
(r2 , l1 , t10 ) . . . (r2 , l1 , t15 ) (r3 , l3 , t15 ) . . . (r10000 , l3 , t500 )
Table 1: Raw RFID Records
As shown in our recent study [4], the size of the data can
be reduced by eliminating duplicate readings for the same
object at the same location and create Stay records of the
form (EP C, location, time in, time out) where time in is
the time when the object enters the location, and time out
is the time when the object leaves the location. Table 2
presents a clean RFID database for the raw data in Table
1. The clean database contains stay records for the 10,000
items present in the raw data.
EPC
r1
r2
r3
r10000
Stay(EPC, location, time in, time out)
(r1 , l1 , t1 , t10 )
(r2 , l1 , t15 , t20 )(r2 , l2 , t21 , t29 )
(r3 , l1 , t10 , t15 )(r3 , l3 , t16 , t22 )(r3 , l1 , t23 , t25 )
...
(r10000 , l1 , t480 , t490 )(r10000 , l3 , t491 , t500 )
Table 2: A Cleansed RFID Database
It is possible that users are interested in looking at the
data at a level of abstraction higher than the one present in
the raw data, e.g., users may not be interested in the time
spent by items at location at the second level, but at the
hour level, and it may not be important to distinguish each
item but rather look at the data at the SKU level. We can
aggregate the Stay table to the level (pl , ll , tl ), where pl is the
product level (e.g., SKU, product type, product category),
ll is the location level (e.g., location, locale, location type),
and tl is the time level (e.g., second, minute, hour, day).
For RFID data flow analysis, one could be interested only
in the duration (i.e., length of time) of object stay or transition, but not the absolute time. In such cases, we can rewrite
the Stay table as a set of paths of the form (EP C : (l1 , d1 )
(l2 , d2 ) . . . (lk , dk )), were li is the i-th location in the path
traveled by the item identified by EP C, and di is the total
time that the item stayed at li .
The duration that an item spends at a location is a continuous attribute. In order to simplify the model even further
we can discretize all the distinct durations for each location
into a fixed number of clusters. We call such a path-database
with discretized durations clustered path-database.
Table 3 presents a clustered path-database for the stay
data in Table 2. (lj , i) corresponds to an item that stays for
the duration i (where i is a cluster) at location lj .
P ath
(l1 , 1)
(l1 , 2)
(l1 , 3)
(l1 , 1) → (l2 , 1)
(l1 , 2) → (l2 , 1)
(l1 , 3) → (l2 , 1)
(l1 , 3) → (l3 , 1)
(l1 , 3) → (l3 , 2)
(l1 , 3) → (l3 , 1) → (l1 , 1)
(l1 , 3) → (l3 , 1) → (l1 , 2)
(l1 , 3) → (l3 , 1) → (l1 , 3)
(l1 , 3) → (l3 , 1) → (l1 , 1) → (l2 , 1)
(l1 , 3) → (l3 , 1) → (l1 , 2) → (l2 , 1)
(l1 , 3) → (l3 , 1) → (l1 , 3) → (l2 , 1)
(l1 , 3) → (l3 , 1) → (l1 , 1) → (l3 , 1)
(l1 , 3) → (l3 , 1) → (l1 , 2) → (l3 , 1)
(l1 , 3) → (l3 , 1) → (l1 , 3) → (l3 , 1)
(l1 , 3) → (l3 , 1) → (l1 , 3) → (l3 , 2)
count
700
700
2400
240
240
800
2160
2160
18
18
144
6
6
48
36
36
144
144
- F : Q → [0, 1] is a termination function which returns
the probability for a state to be final.
The sum of all the transition probabilities for a state
plus its termination probability should add up to 1. Figure 1 presents a probabilistic workflow constructed on all
the paths from Table 3. The transition function value for
each pair of states li → lj is the number placed on the
arrow, and it is computed as count(li , lj )/count(li ), where
count(li , lj ) is the number of items that took the transition, and count(li ) is the total number of items that arrived
at li . The termination function value for each node li is
the number placed on top of the node and is computed as
count(li , #)/count(li ), where count(li , #) is the number of
items that terminate their path at li . This workflow is a
highly compact summary of the data; we have represented
all the paths in the database with a model that has just
6 nodes. But the compression is not lossless, the workflow
in Figure 1 does not have any information on the duration
spent at each location. For example, we cannot distinguish
the paths (l1 , 1)(l2 , 1) and (l1 , 2)(l2 , 1) of Table 3.
1.00
0.37
l1
1
l2
0 .1 3
0 .5
2
1.00
0.88
3
Table 3: A clustered path-database
l2
0.3
0.12
l3
l1
4
0 .1
0.6
5
1.00
l3
6
3.
RFID WORKFLOW
A clustered path-database can be naturally modeled as
a probabilistic workflow. Each location corresponds to an
activity, and locations are linked according to their order of
occurrence. Links between activities q and p have an associated probability that is the percentage of the time that
activity p occurred right after activity q. We will illustrate
the benefit of such modeling with an example: The US government will require every container arriving at ports in the
country to carry an RFID tag in order to detect abnormal
paths. We can easily determine the level of abnormality of a
path by looking at its transition probabilities in the worflow;
you could even look at the current trajectory of a container
and predict the set of the most likely next stages, and raise
a real time alert when an in-transit container has taken an
unexpected transition.
We will state a more formal definition of a probabilistic
workflow that is based on the definition of a probabilistic
deterministic finite automaton PDFA [6]. The only difference is that we associate states with symbols of the alphabet
whereas the PDFA associates transitions with symbols.
A probabilistic workflow is defined by the tuple
(Q, Σ, η, τ, q0 , F ) where
- Q is a finite set of states
- Σ is the alphabet, in our case it will be locations
- η : Q → Σ is the naming function, that assigns a
symbol from Σ to each state
- τ : Q × Q → [0, 1] is a function that returns the probability associated with a transition
- q0 is the initial state,
Figure 1: Probabilistic workflow
3.1
Complete workflow
Creating a workflow that incorporates duration information is valuable in the analysis of RFID paths. Duration
may for example be very important in inventory management; we can detect path segments that have unusually long
durations and use the information to re-route items if possible, or to optimize those segments. In order to incorporate
duration into the workflow, the first option is to extend Σ to
be not just locations, but the cartesian product of locations
× durations. We refer to a workflow with such nodes as a
complete workflow. Figure 2 presents the complete workflow
for Table 3. The problem with this approach is that in many
cases we will get a very significant increase in the number
of nodes (in our example we went from 6 nodes to 19), with
a large percentage of the new nodes being uninteresting.
For our running example, we would replace the first node l1
with 3 nodes (l1 , 1), (l1 , 2), and (l1 , 3), l2 would have only
one node (l2 , 1); if we look at the transition (l1 , 3) → (l2 , 1)
the transition probability is 0.1 a number close to 0.13 the
value recorded in the duration independent workflow, for
some applications, explicitly identifying this transition does
not add much information. Redundancy becomes more serious when examining all the possible transitions between two
nodes with many different possible durations each but with
a duration distribution largely independent of each other.
1
0.3
1
l 1 ,1
l 2 ,1
9
0. 1
0 .6
4
l 2 ,1
1
l 3 ,1
13
0.29
l1,3
0.
81
0.26
l 2 ,1
7
0.74
1
0.26
3
l 1 ,2
10
0 .1
0 .6
l 2 ,1
1
18
1
0.3
l 1 ,3
11
l 2 ,1
0. 1
0 .3
0.
3
l 2 ,1
8
1
16
l 3 ,1
17
1
l3 ,2
18
Figure 2: Complete workflow
3.2
- The length of a path x, length(x) is the number of
stages in the path.
14
l 3 ,1
15
0.
1
2
l1,2
0.0 2
l 3 ,1
1
0.3
6
0.74
l1,1
02
0.78
1
2
0.094
0.
09
4
7
0. 2
0.34
5
0.
0.
1
1
l 3 ,2
12
- We define a path x to be the sequence of stages x1 x2
. . . xl , where each xi a location/duration pair of the
form (q, c); q ∈ Q and c ∈ C. When c = ∗ when we do
not care about the particular duration at the location.
Extended probabilistic workflow
In order to solve the inefficiencies presented in the complete workflow model we develop an extended probabilistic
workflow model that incorporates durations by factoring
probabilities into conditional transition probabilities, associated with transitions, and conditional emission probabilities
associated with durations at nodes. This model is similar
to a Hidden Markov Model (HMM) [10] but in our case, we
conditioned the emission and transition probabilities at a
state on the durations at all previous states, a concept not
present in HMMs.
More formally, an extended probabilistic workflow model is
a tuple (Q, C, Σ, η, T, E, q0 ), where all the components are
defined as in a probabilistic workflow, and C, T , and E are
defined as:
- C, is the set of possible durations at any node, including * to represent any duration.
S
- T : Powerset(Q × C) × (Q {#}) → [0, 1], a conditional transition probability function that assigns a
probability to each transition going into a state Q or #
(the special termination state symbol) given the time
spent at each of the predecessors of Q.
- E : Powerset(Q × C) × (Q × C) → [0, 1], a conditional
emission probability function that assigns a probability
to each length of stay c ∈ C at node Q given the time
spend at each of the predecessors of Q.
As before, we require that the emission and transition
probabilities conditioned on a given path prefix add up to
1.
The ability to compute the probability of a path or a
path segment is important, it can be used for example for
data cleaning, we can use the workflow to detect unlikely
transitions due to false positive readings, or to check the
most likely next locations where an item can go right after
being at a given location, in order to detect if the reader
missed the item or if it really moved. The workflow can also
be used to rank paths according to likelihood and allow data
analysts to focus their effort in optimizing the top-k most
likely paths.
Before we present how to compute probabilities in an extended probabilistic workflow we need a few definitions:
- We can identify the state and duration of components
of the stage i in path x by the functions state(x, i) and
time(x, i) respectively.
- We define the prefix of a path pref ix(x, i) to be the
first i stages in path x. pref ix(x, 0) = φ. The prefix
of a path, is a path itself. Every path is a prefix of
itself.
- The function count(x) returns the number of times
that the path x appears in the clustered path database
as a complete path or as a prefix of a longer path x0 .
Note that (q, c = ∗) matches any duration at location
q.
- A path xx0 is the concatenation of the stages of the
paths x and x0 . A path x(q, c) is the concatenation of
the stages of path x with the stage (q, c).
We can define the probability of a path as the product of
all the conditional emission probabilities for the durations
spent at each location in the path, multiplied by the product
of all the conditional transition probabilities in the path.
P (x) = Emission(x) × T ransition(x)
Emission(x) =
n
Y
(1)
E(pref ix(x, i − 1), xi )
(2)
i=1
T ransition(x) =
n
Y
!
T (pref ix(x, i − 1), state(xi ))
× T (x, #)
i=1
(3)
The computation of an emission probability for ci (ci 6= ∗)
in the stage xi = (qi , ci ) given the prefix x1 . . . xi−1 is done
as follows:
E(pref ix(x, i − 1), xi ) =
count(pref ix(x, i))
count(pref ix(x, i − 1)(state(xi ), ∗))
(4)
The computation of a transition probability to location li
given the prefix x1 . . . xi−1 is done as follows:
T (pref ix(x, i − 1), li )) =
count(pref ix(x, i − 1)(li , ∗))
count(pref ix(x, i − 1))
(5)
Example (Emission probability computation). Figure
3 presents each node in the workflow with the conditional
emission and transition probabilities. Rows represent durations at the node, and columns the probability of the duration conditioned on different path segments. If we look at
the conditional emission table for node 6, we can compute
cell 1 (row 1, column 1) as E((l1 , ∗)(l3 , ∗)(l1 , ∗), (l3 , 1)) =
count((l1 , ∗)(l3 , ∗)(l1 , ∗)(l3 , 1))/count( (l1 , ∗)(l3 , ∗)(l1 , ∗)(l3 , ∗))
= 216/360 = 0.6. Similarly we can compute transition probabilities.
3.2.1
Uninteresting conditional probabilities
When computing the conditional transition and emission
probabilities for a node, we do not need to consider every
possible path that is a prefix to the node, as a conditioning
factor. In certain supply chain applications it is possible
for some items to have unique path segments, e.g., items
moving in a conveyor belt may all have identical paths and
durations at each stage. When finding the conditional emission or transition probabilities we can consider each unique
path segment as an atomic unit, this offers great benefits in
computation costs and workflow compression, e.g., if every
item of a certain type goes through the same 20 stages in a
conveyor belt, by considering that path segment as a unit
we gain a 220 to 1 reduction in the number of possible path
prefixes to consider.
We call a conditional emission or transition entry uninteresting if it can be derived from another entry without loss
of information. For example, if we know that before a certain product gets to the shelf, it always spends 1 day in the
warehouse, 2 days in the store backroom, and 4 hours in the
truck, there is no need to condition the length of stay at the
shelf on all eight combinations of these three stages, conditioning on the single prefix containing all stages is sufficient.
We say that a path x is closed iif you can not find another
path y that goes thorough the same stages as x and that is
more specific (has less durations set to *) and count(x) =
count(y). This is a concept very similar to that of closed
frequent patterns [9]. An emission or transition probability
entry conditioned on x is uninteresting if x is not closed.
Lemma 1. Conditional emission probabilities conditioned
on non-closed paths can be derived without loss of information from conditional emission probabilities conditioned on
closed paths.
Proof Sketch: Assume that we are looking at node qi of
a workflow and computing the emission probability of duration ci . Assume that x is a non-closed path leading to
qi and that y is the closed path corresponding to x. If we
know E(y, (qi , ci )), given that count(x) = count(y) we get
that E(x, (qi , ci )) = E(y, (qi , ci )).
Lemma 2. Conditional transition probabilities conditioned
on non-closed paths can be derived without loss of information from conditional emission probabilities conditioned on
closed path.
Proof Sketch: Using an analogous argument to that used
in the proof of lemma 1, the transition probability T (x, qi ) =
T (y, qi ) when x is not closed and y is the corresponding
closed path.
Example (Uninteresting conditional probabilities).
Looking up node 3 in the workflow of Figure 3 and computing the conditional transition probability to node 4 given
(l3 , 1), we get T ((l1 , ∗)(l3 , 1), l1 ) = 600/2760. This is exactly the same as T ((l1 , 3)(l3 , 1), l1 ) = 600/2760. The reason is that (l1 , ∗)(l3 , 1) is not a closed path. Thus the first
conditional probability is uninteresting.
3.2.2
Redundant conditional probabilities
The benefit of all conditional emission and transition probabilities is not the same. Intuitively, a conditional probability that deviates significantly from the probability that we
would infer from the currently recorded conditional probabilities is more important than the one that deviates very
little. Such deviations are important because they are interesting patterns by themselves, e.g., it may be important
to discover that the transition, for perishable products, to
the return counter increases when they spent too long in
transportation locations. Recording these deviations is also
important to compute more accurate path probabilities.
When determining if a conditional probability is redundant at a node it is important to condition on short path prefixes before we condition on long path prefixes, i.e., we only
record deviations given prefixes of length i if they are not redundant given recorded deviations on prefixes of length i−1.
This optimization is quite natural and effective as it will uncover substantially fewer truly “surprising” deviations, that
are really interesting.
Let us define Ê(x, (q, c)) to be the expected emission probability of duration c at node q given path x. If we have
already computed the emission probabilities E(a1 , (q, c)),
. . . , E(an , (q, c)), where each ai is a direct ancestor of x
(same path stages, and just one less duration set to *) and
(ai , (q, c)) and (aj , (q, c)) are independent and conditionally
independent given (q, c) for every i 6= j, we can compute
Ê(x, (q, c)) as,
!
n
Y
E(ai , (q, c))
Ê(x, (q, c)) =
× E(b, (q, c))
(6)
E(b, (q, c))
i=1
where E(b, (q, c)) is the unconditional probability of observing duration c at state q, b is the path up to q with all
durations set to ∗. Intuitively, this equation is adjusting the
unconditional probability of observing duration c in stage q,
by compounding the effect that each ancestor independently
has on the emission probability1 .
We say that a conditional emission probability is redundant if |Ê(x, (q, c)) − E(x, (q, c))| < ². And we call ² the
minimum probability deviation threshold 2 .
Similarly, we can define expected transition probability of
T (x, q) as:
!
n
Y
T (ai , q)
T̂ (x, q) =
× T (b, q)
(7)
T (b, q)
i=1
where each ai a direct ancestor of x and T (b, q) is the unconditional transition probability to state q. The intuition
behind Eq. (7) is the same as for Eq. (6).
We say that a conditional transition probability is redundant if |T̂ (x, q) − T (x, q)| < ².
3.2.3
Low support conditional probabilities
Up to this point we have considered all the closed paths
that are a prefix of a node in computing its conditional emission and transition probabilities. Conditioning probabilities
on every closed path may lead to very large probability tables, with many entries supported by very few observations
(paths). The size of our workflow may end up being dominated by noise in the clustered path-database. In order to
solve this problem, we restrict our closed paths to only those
1
Eq. (6) can be better understood with a simple example, assume that you have three events a, b, and c, and b, c independent and conditionally independent given a, you can compute P (a|bc) = P (a) × P (a|b)/P (a) × P (a|c)/P (a), which is
equivalent to our formula when you have only two ancestors.
2
Alternative definitions are possible. For example, we
could define an emission probability to be redundant if:
maxai (|E(x, (q, c)) − E(ai , (q, c))|) < ².
that appear in the clustered database more than δ percent
of the time. This has the effect that any probability entry
will be supported by at least δ percent of the observations
and the size of the tables will be significantly smaller. We
call δ the minimum support.
3.2.4
Compressed Workflow
We call an extended probabilistic workflow, with only interesting, and non-redundant conditional emission and transition probability entries, conditioned on closed paths with
minimum support δ and minimum probability deviation ² a
compressed workflow.
The optimal definition of δ and ², such that a good compromise between workflow size, exception interestingness,
and information loss is achieved is a very challenging task,
and one that depends on the specific application needs. When
the main objective is to detect general flow patterns and significant exceptions, the use of larger values of δ and ² has two
advantages: discovered exception will likely be more interesting, and the total number of exceptions to analyze will be
smaller. When the objective is to use the compressed workflow to compute path probabilities with high accuracy, we
need to be more conservative in the selection of the parameters. For this case a good indicator for parameter selection
may be the nature of the paths for which we will estimate
probabilities. ² can be estimated by checking the error on
a representative query set with varying levels of the parameter, we can use the value where error rate has a big jump
(see Figure 7). δ could be selected using a top-k frequent
path approach such as [5].
Figure 3 shows the compressed workflow derived from the
clustered path database in Table 3. The number of nodes is
6, the same as the duration independent workflow in Figure
1. Each node contains a conditional emission table, and
a conditional transition table. The lightly shaded columns
(nodes 1 and 3) are redundant for an ² = 0.2 and the darkly
shaded columns (node 6) are not supported for a δ = 150
paths. An estimate for the size of the workflow is the number
of probability entries, in this case, 30 entries 3 .
4.
WORKFLOW COMPUTATION METHOD
In this section we will introduce an algorithm to compute
a compressed workflow from a clustered path database, with
minimum support δ and minimum probability deviation ².
4.1
Mining closed paths
From the discussion in the previous section we know that
we only need to condition transition and emission probabilities on frequent closed paths. The mining of closed paths
can be easily implemented with an efficient closed itemset
mining algorithm [16] after associating locations to nodes in
a duration independent workflow . If we are computing not a
single cell but a complete FlowCube we can compute all the
frequent closed paths for all cells at all levels of abstraction
in parallel using the method presented in [3].
After we have computed the counts of all closed paths, we
need to do one more pass over the path database computing
the support of each closed path followed by each duration independent suffix. For example, if we get that (l1 , 3)(l3 , 1) is a
3
In reality the number of probabilities is smaller as we only
need to store n − 1 entries for each column given the law of
total probability.
frequent closed path, we should compute (l1 , 3)(l3 , 1)(l1 , ∗),
(l1 , 3)(l3 , 1)(l2 , ∗), and (l1 , 3)(l3 , 1)(l3 , ∗). These counts will
be needed in computing conditional emission and transition
probabilities.
4.2
Mining conditional probabilities
When computing conditional probabilities given closed
frequent paths we follow a short to long principle. We first
record unconditional probabilities, then we record probabilities conditioned on paths of length 1 and that deviate from
the unconditional probabilities; when we have all paths of
length 1, we look at the ones of length 2, and record deviations given the unconditional entries and the length 1
entries; we repeat this process until we have looked at all
closed frequent path prefixes for a node. This method is
very efficient as the number of distinct short paths is quite
small as compared to large paths, and in real RFID applications conditional probabilities usually depend on short path
segments, e.g., the transition to the location return counter
may depend on the time that an item stayed at quality control, but not on a very long combination of locations and
durations.
4.2.1
Computing non-redundant conditional emission
probabilities
At this point we have all the information required to compute the conditional emission probabilities given by Eq. (4)
for all nodes in the workflow.
When computing E(pref ix(x, i − 1), xi ) we require the
entire path x to be frequent and not only the pref ix(x, i−1)
part to be frequent. This makes sense as we want to compute
deviations in emission probability when enough supporting
evidence (including the emission duration itself) is available.
We need to determine which conditional probabilities are
redundant and that can be easily done by arranging the
closed paths that are a prefix of each node in a lattice, and
traversing the lattice from more general paths to more specific paths, storing only probabilities that are found to be
non-redundant using Eq. (6).
4.2.2
Computing non-redundant conditional transition probabilities
We can use Eq. (5) to compute the conditional transition
probabilities for all nodes in the workflow.
When computing T (pref ix(x, i−1), qi ) we require pref ix(
x, i − 1) (qi , ∗) to be frequent. This guarantees that we have
enough evidence of the conditional transition to qi .
The determination of redundant conditional transition probabilities follows exactly the same logic used for emission
probabilities in the previous section.
4.3
Algorithm
Based on the previous discussion we state algorithm 1,
which summarizes the complete set of steps required to output the compressed workflow.
Analysis. Algorithm 1 can be executed quite efficiently, in
addition to the scans required by the frequent closed itemset
mining algorithm, we need just two more scans, one to build
the duration independent workflow, and another to collect
extra counts needed to compute conditional probability entries.
l2
C-Emis
l1
C-Emis
1
Cond-Transition
-
l1,1
l1,2
l1,3
1
.09
l2
.13
.26
.26
.10
2
.09
l3
.49
.00
.00
.61
3
.82
#
.38
.74
.74
.30
-
C-Trans
-
1.0
#
l2
1.0
2
C-Emis
C-Emis
Cond-Transition
-
1
-
C-Emis
l1
l3
1
.56
l1
.12
.22
.00
2
.44
#
.88
.78
1.0
1
C-Trans
-
(l1,3)(l3,1) (l1,3)(l3,2)
C-Trans
-
-
1.0
#
5
-
1
.10
l2
.1
2
.10
l3
.6
3
.80
#
.3
3
1.0
l3
C-Trans
C-Emission
-
4
(l1,3)(l3,1) (l1,3)(l3,1) (l1,3)(l3,1)
(l1,1)
(l1,2)
(l1,3)
1
.6
1.0
1.0
.50
2
.4
0.0
0.0
.50
#
1.0
6
Figure 3: Compressed Workflow
Algorithm 1 Construct compressed workflow
Input: Clustered path database D, minimum conditional
probability deviation ², minimum support δ
Output: compressed workflow W
Method:
1: On a single pass over D construct the duration independent workflow W1 .
2: Using W1 compute D1 the path database with locations
encoded with nodes in W1 .
3: Call a frequent closed itemset mining algorithm with D1
as the transaction database and minimum support δ to
derive C the set of closed frequent paths.
4: Scan D once to collect the support of the extended frequent closed paths.
5: Construct W by annotating each node in W1 with the
set of non-redundant conditional emission and transition probabilities given the closed paths in C, and the
minimum conditional probability deviation ².
6: return W .
4.4
Computing the probability of a path in the
compressed workflow
Eq. (1) can be used to compute the probability of a path
x when we have all possible conditional emission and transition probabilities. But when we have compressed the workflow by using ² as a threshold for redundant probabilities,
and δ as the minimum support for recording a conditional
probability, we can not use this equation anymore, because
many of the required probabilities may be missing.
The solution is to define an approximation of P (x) called
P̂ (x), that uses Eqs. (6) and (7) for computing conditional
emission and transition probabilities respectively, whenever
the exact probabilities have not been explicitly recorded in
the compressed model. The path prefixes to use are the
most specific paths recorded in the model that are still more
general than x.
The probability for path x computed by P̂ (x) is an approximation to P (x) and is subject to error. The amount of
error will increase for larger values of ² and δ, and will also
depend on the support of x. If P (x), that is the true probability of x is very low, then the error can be larger as the
computation of the probability of x may require conditional
probability entries that were not stored in the compressed
workflow because of not having enough support. In the experimental section we explore the conditions under which
error increases in more detail.
5.
PERFORMANCE STUDY
In this section, we perform a thorough analysis of our
model and algorithms. All experiments were implemented
using C++ and were conducted on an Intel Pentium IV
2.4GHz System with 1GB of RAM. The system ran Debian
Sarge with the Linux kernel 2.6.13.4 and gcc 4.0.2.
5.1
Data Synthesis
The RFID databases in our experiments were generated
using a random path generator that simulates the movement
of items through a set of locations in a manner that mimics
the operation of a large retailer. Locations have a multinomial transition probability distribution for the likelihood of
moving to a given next location or stopping, a multinomial
duration probability distribution for the length of stay of
items at the location, and a list of correlations with other
locations. We also associate a preferred level to locations,
to indicate the stage in which they tend to appear. For a
given location, transition probabilities are high for locations
at the next level and low for locations at any other level,
this generates the effect that items tend to move forward in
the supply chain but that sometimes can still go backwards
or jump to a future location.
As notational convenience, we use the following symbols
to denote certain data set parameters. N : Number of paths.
²: conditional probability threshold. δ: minimum conditional probability support.
5.2
Workflow Compression
One of the main advantages of using a compressed workflow with minimum support δ and minimum conditional
probability deviation ² is that it captures the essence of the
complete workflow but requires only a fraction of the space.
In this section we compare the size of a compressed workflow
with that one of a complete workflow. The complete workflow is constructed by traversing the clustered path database
adding each path to a probabilistic prefix tree, while keeping track of the counts of items reaching each node and the
transition counts. We will measure the size S of a workflow
7000
Size (probabilities)
6000
Size (probabilities)
800
4000
3000
2000
0
0.01
0.005
0.0005
0.0001
Support
Figure 5: Compression vs.
100, 000, ² = 0.05
Support (δ).
N =
complete workflow increases significantly with the number
of paths. The reason is that in the compressed workflow
we only need to update conditional emission and transition probability entries, while in the complete workflow new
nodes need to be added for each new location/duration observed. In this experiment the error induced by compression
remained constant for all database sizes at around 9%.
compressed workflow
complete workflow
100000
1200
1000
5000
cond-transition
cond-emission
uncond-transition
uncond-emission
1000
Workflow Size (log scale)
by the total number of probability entries. The size of a
compressed workflow is the sum of the number of probabilities for the conditional emission, and conditional transition
tables for each node. The size of a complete workflow is just
the sum of the number of transition probabilities for each
node.
Figure 4 shows the size of the compressed workflow for
various levels of ², for a data set containing 100,000 paths,
and using a support δ of 0.1%. In the Figure we distinguish
the count of conditional and unconditional transition and
emission probabilities. For this experiment the size of the
complete workflow is of 14,095 probability entries. As we
decrease ² the number of non-redundant conditional transition and emission probabilities increases. For ² = 0.2 we
do not record any conditional probabilities and the size of
the workflow is just the unconditional emission and transition probabilities, as we decrease ² the percentage of the
total workflow size accounted by conditional probabilities increases from 0% to around 80%. The size of the compressed
workflow is just 1.5% to 7.5% of the size of the complete
workflow, we observe very significant space savings even for
small values of ².
Compression power is even more dramatic when we increase the number of distinct paths, for this example we
generated another data set containing every possible distinct path with a count proportional to its likelihood given
the location characteristics. The size of the complete workflow increased to 1,190,479 nodes, while the size of the compressed workflow increased only marginally. Compression
power for this case increased to 10,000 to 1.
cond-transition
cond-emission
uncond-transition
uncond-emission
10000
1000
600
400
100
1000
5000
Paths in thousands
10000
200
0
0.2
0.1
0.05
0.03
0.02
0.01
Epsilon
Figure 4: Compression vs. Conditional probability
threshold (²). N = 100, 000, δ = 0.001
Figure 5 shows the size of the compressed workflow for
the same data set used for Figure 4; with a fixed ² at 5%
and with varying support levels. The support parameter
determines the number of frequent closed paths that need
to be considered in constructing the conditional probability
tables of each node, lower support levels tend to create more
conditional probability entries. For this experiment we used
support levels of 1% to 0.001%; as we decrease support the
percentage of the total workflow size corresponding to conditional probabilities increases from 0% to around 95%. As
in the previous case compression is very significant, the size
of the compressed workflow is just 1.5% to 4.5% of the size
of the complete workflow.
Figure 6 shows the size of the compressed workflow and
the complete workflow for different sizes of the input clustered path database. We can see that while the size of the
compressed workflow remains largely constant the size of the
Figure 6: Clustered Path DB size vs. Workflow Size.
² = 0.05, δ = 0.001
5.3
Workflow Compression Error
In this section we look at the error incurred in computing the probability of a path using the compressed workflow
as compared to the complete workflow. A compressed workflow with probability deviation ² and minimum support δ
may not be able to assign the exact probability to every
possible path, because it only records emission and transition probability deviations if they are supported by more
than δ transactions and the deviation is larger than ² of the
expected probability. The compressed workflow will assign
probabilities that may be different from the true probability to uncommon paths that deviate from their expected
probabilities (in absolute terms the deviation will still be
small as the path will have very low probability to begin
with, but in percentual terms it can be large). On the other
hand, more common paths, or path segments, will be assigned very precise probabilities, because their deviations
will have enough supporting transactions. The experiments
on this section correspond to the same data sets used to
compute compressed workflow size in the previous section.
For the experiments in this section we will use the percentual error measure4 , which computes the percentual deviation in probability from the complete workflow, and is
defined as,
Figure 7 shows the percentual error that a compressed
workflow will assign to three sets of paths with varying levels of ². The sets of paths used for testing have a probability
of occurring (minimum probability) of 0.0001, 0.0002, and
0.0005 respectively. We can see that as we increase ² the
error increases, and it increases faster for uncommon paths
than for more common ones. In this case the common paths
have an error of around 9% with every level of ² while the
other two sets go from 13% to 17% and 16% to 27% respectively. This means that even with a low ² we can compute
the probability of more common paths and path segments
with low error and storing only a small compressed workflow.
0.25
q-0.0001
q-0.0002
q-0.0005
p-error
0.2
0.15
0.1
0.05
0
0.02
0.04
0.06
0.08
0.1
0.12
Epsilon
0.14
0.16
0.18
0.2
Figure 7: Percentual Error vs. Conditional probability threshold (²). N = 100, 000, δ = 0.001
q-0.0001
q-0.0002
q-0.0005
0.2
p-error
|P (path|compressed wf ) − P (path|complete wf )|
P (path|complete wf )
0.25
0.15
0.1
0.05
0
0.001
0.002
0.003
0.004
0.005 0.006
Support
0.007
0.008
0.009
0.01
Figure 8: Error vs. Support (δ). N = 100, 000, ² =
0.05
tion of the closed paths for a given support δ, and (3) computation of the conditional emission and transition entries
given the frequent closed paths, and a conditional probability threshold ². In our experiments we observed that the
factor with highest influence in computation time was the
support threshold, given that it determines how many closed
paths will be found, and we need to test each node for conditional dependencies against each of the closed paths.
Figure 9 shows the computation of the conditional emission and transition probabilities for a clustered path database
with 1,000,000 paths. We can see that as we increase support, the computation time decreases. For this experiment
the time to learn the unconditional probability entries of
the workflow was unaffected by the support threshold and
remained constant at around 4 seconds. For mining closed
patterns we used the program closet+ [16], and the run time
was around 2 seconds for each support level.
2
cond-emission
cond-transition
5.4
1
0.5
0
1e-04
0.001
Support (log scale)
0.01
Figure 9: Construction Time vs Support (δ). N =
1, 000, 000, ² = 0.05
Construction Time
In this section we analyze the efficiency of the algorithm
used to construct the compressed workflow. The main components of the algorithm are: (1) computation of the unconditional emission and transition probabilities, (2) computa4
1.5
Time (seconds)
Figure 8 shows the percentual error that a compressed
workflow will assign to the same three sets of paths as before but with varying the support threshold δ, used to compute the closed paths necessary for recording conditional
probability entries. We can see that as we increase support, the percentual error grows; faster for the path data
sets with low probability. Common paths have an error of
around 10% with every level of δ while the other two sets go
from 12% to 17% and 14% to 27% respectively. In this case
we can also observe that using a small support threshold is
enough to compute the probabilities of common paths, and
path segments, with low error. As discussed in section 3.2.4
we can use the slope of the error curves to determine good
values for ² for a particular type of query load.
We also use KL-Divergence of the computed probabilities
and the true probabilities and this measure mimics percentual error.
6. RELATED WORK
Software research into management of RFID data is divided into three areas. The first is concerned with secure
collection and management of online tag related information
[11, 12]. The second is cleaning of errors present in RFID
data due to error inaccuracies, [7, 8]. The third is related
to the creation of a multi-dimensional warehouses capable
of providing OLAP operations over large RFID data sets [4,
3]. Our work is an extension of [3]. That paper introduces
the concept of a FlowCube which is a data cube constructed
on a path database where each cell records as its measure a
workflow, which the authors call FlowGraph. In this paper
we explore in detail the design of a concrete probabilistic
workflow that can be used in the FlowCube.
Workflow induction is an area or research close to our
work [15], it studies the problem of learning the structure of
a workflow from even logs. [1] first introduced the problem
of process mining and proposed a method to discover workflow structure, but for the most part their methods assumes
no duplicate activities, and does not take activity duration
into account. [14] deals with time in workflow learning; it
first learns a workflow without time, and then replays the
log, computing statistics such as average, minimum, and
standard deviation for each node. This approach is not well
suited for RFID flows, where there may be strong interdependencies between the time distributions at different nodes
not captured by summary statistics at a single node.
Another area of research closed to RFID workflow mining is that of grammar induction [2, 13], the idea is to take
as input a set of strings and infer the probabilistic deterministic finite state automaton (PDFA) [6] that generated
the strings. This approach could be applied to a clustered
path database, by looking at each path as a string formed
by an alphabet composed of every possible combination of
location and duration; but as we have shown in the experimental section this approach is not well suited for RFID
data as it tends to generate huge workflows.
7.
CONCLUSIONS
We have proposed a novel model for the design and construction of a highly compressed RFID workflow that captures the main trends and important deviations in item
movements in an RFID application. The compressed workflow records for each node, conditional emission and transition probability tables that have an entry for each nonredundant, and sufficiently supported deviation from the
main movement trend. This design takes advantage of the
fact that in many RFID applications items tend to move
according to a general trend, and have a relatively small
number of deviations from it. Our performance study shows
that the size of the compressed workflow is only a fraction
of that of a complete workflow, that it can be used to compute the probabilities of paths and path segments with high
precision, and that it can be computed efficiently.
The compressed workflow presented in our study can be
a very useful in providing guidance to users in their analysis process as it highlights general trends and exceptions
which may not be apparent from the raw data. This knowledge can be used to better understand the way objects move
through the supply chain, and optimize the logistics process
governing the most common flow trends; while exception
flow information can be used to uncover events that may for
example trigger a higher than expected return rate.
Notice that our proposal for workflow compression is based
on the assumption that commodities flow according to a general trend and only few and significant deviations are present
in the data. This fits a good number of RFID applications,
such as supply chain management. However, there are also
other applications where RFID data may not have such char-
acteristics. We believe that further research is needed to
construct efficient workflow models for such applications.
8.
REFERENCES
[1] R. Agrawal, D. Gunopulos, and F. Leymann. Mining
process models from workflow logs. In Proc. 1998 Int.
Conf. Extending Database Technology (EDBT’98), pages
469–483, Valencia, Spain, Mar. 1998.
[2] R. C. Carrasco and J. Oncina. Learning stochastic regular
grammars by means of a state merging method. In Proc.
1994 Int. Col. Grammatical Inference (ICGI’94), pages
139–152, Alicante, Spain, Sept. 1994.
[3] H. Gonzalez, J. Han, and X. Li. Flowcube: Constructuing
RFID flowcubes for multi-dimensional analysis of
commodity flows. In Proc. 2006 Int. Conf. Very Large
Data Bases (VLDB’06), Seoul, Korea, Sept. 2006.
[4] H. Gonzalez, J. Han, X. Li, and D. Klabjan. Warehousing
and analysis of massive RFID data sets. In Proc. 2006 Int.
Conf. Data Engineering (ICDE’06), Atlanta, Georgia,
April 2006.
[5] J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining top-k
frequent closed patterns without minimum support. In
Proc. 2002 Int. Conf. on Data Mining (ICDM’02), pages
211–218, Maebashi, Japan, Dec. 2002.
[6] J. Hopcroft and J. Ullman. Introduction to Automata
Theory, Languages and Computation. Addison Wesley,
Reading, Massachusetts, 1979.
[7] S. R. Jeffery, G. Alonso, and M. J. Franklin. Adaptive
cleaning for RFID data streams. In Technical Report
UCB/EECS-2006-29, EECS Department, University of
California, Berkeley, March 2006.
[8] S. R. Jeffery, G. Alonso, M. J. Franklin, W. Hong, and
J. Widom. A pipelined framework for online cleaning of
sensor data streams. In Proc. 2006 Int. Conf. Data
Engineering (ICDE’06), Atlanta, Georgia, April 2006.
[9] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.
Discovering frequent closed itemsets for association rules.
In Proc. 7th Int. Conf. Database Theory (ICDT’99), pages
398–416, Jerusalem, Israel, Jan. 1999.
[10] L. R. Rabiner. A tutorial on hidden markov models and
selected applications in speech recognition. Proc. IEEE,
77:257–286, 1989.
[11] S. Sarma, D. L. Brock, and K. Ashton. The networked
physical world. In White paper, MIT Auto-ID Center,
http://archive.epcglobalinc.org/publishedresearch/MITAUTOID-WH-001.pdf,
2000.
[12] S. E. Sarma, S. A. Weis, and D. W. Engels. RFID systems,
security & privacy implications. In White paper, MIT
Auto-ID Center,
http://archive.epcglobalinc.org/publishedresearch/MITAUTOID-WH-014.pdf,
2002.
[13] F. Thollard, P. Dupont, and C. dela Higuera. Probabilistic
DFA inference using kullback-leibler divergence and
minimality. In Proc. 2000 Int. Conf. Machine Learning
(ICML’00), pages 975–982, Stanford, CA, June 2000.
[14] W. van der Aalst and B. van Dongen. Discovering workflow
performance models from timed logs. In International
Conference on Engineering and Deployment of Cooperative
Information Systems (EDCIS 2002), pages 45–63, 2002.
[15] W. van der Aalst and T. Weijters. Process mining: A
research agenda. In Computers in Industry, pages 231–244,
2004.
[16] J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the
best strategies for mining frequent closed itemsets. In Proc.
2003 ACM SIGKDD Int. Conf. Knowledge Discovery and
Data Mining (KDD’03), pages 236–245, Washington, DC,
Aug. 2003.