Download LNAI 4031 - Model and Algebra for Genetic Information of Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Corecursion wikipedia , lookup

Transcript
Model and Algebra for Genetic Information of Data*
Deyou Tang1,2, Jianqing Xi1,**, and Yubin Guo1
1
School of Computer Science and Engineering, South China University of Technology,
Guangzhou, 510641, China
2
Department of Computer Science & Technology, Hunan Polytechnic University,
Zhuzhou, 412008, China
[email protected], [email protected], [email protected]
Abstract. The genetic information of data is the evolution information about data
in its lifecycle, and it performs heredity and mutation with the diffusion of data.
Managing the genetic information of data has significance to the information
discrimination, auditing, tracing and lookup. This paper presents Data Genome
Model (DGM) and its algebra to manage the genetic information of data in its
lifecycle according to the gene of life. A structure named data genome is used to
carry the genetic information of data and querying algebras are defined to manipulate the genetic information.
1 Introduction
Under the environment of complex networks, data is exchanged frequently. In this
process, huge amounts of information about the evolution are created. E.g. when data is
published, information hiding behind this operation includes the publisher, publishing
time, etc. This information has great significance to the incremental utilization of data.
Suppose some unhealthy pictures are disseminated in a pure p2p network such as
Gnutella[1]. To punish the badman who issues the resource initially, related department
has to take long time and much human resource to identify the source step by step. The
problem is that no evolution information can be retrieved from the data itself. However,
if each item shared in Gnutella has some adjunctive information about the evolution of
data such as publisher, propagation path, etc, intelligence agent can do their job through
analyzing this information, thus saves much expense. Unfortunately, current data
models usually focus on the content or the external characteristic of data, and little attention has been paid to the evolution of data.
Metadata are the data about data in applications. But metadata only expresses the
static properties of data, and it doesn’t describe the evolution of data.
The concept of “temporal” is used in temporal database to describe history information about data [2]. TXPath model [3] extended the XPath data model to model the
history of semi-structured data. The GEM model [4] is a graphical semi-structured
*
This work is supported by the China Ministry of Education (Grant NoD4109029), Guangdong
High-Tech Program (G03B2040770), and Guangdong Natural Science Foundation
(B6480598).
**
Correspondence author.
M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1071 – 1079, 2006.
© Springer-Verlag Berlin Heidelberg 2006
1072
D. Tang, J. Xi, and Y. Guo
temporal data model. These temporal data models use the history information of data
effectively, but they don’t express the relevancy of different data.
Adoc[5] is a new programming model targeted to collaborative business process
integration applications. The concept of “smart distance” was proposed to interweave
various dynamic and heterogeneous elements of information system on the base of
Adoc[6]. ABO was presented to model the lifecycle of the business entity [7]. However, those methods don’t model the evolution of data.
Information Lifecycle Management (ILM) aims at improving system resource
utilization and maximizing information value automatically for fixed-content reference
data [8][9]. Unfortunately, ILM doesn’t take into account the value of evolution information created in data lifecycle.
In this paper, we present Data Genome Model (DGM) to manage the information
created in the evolution of data, which is called genetic information. A series of concepts like data gene, data gene sequence and data genome are introduced to carry the
genetic information hierarchically. Data genome algebras are presented to query the
genetic information stored in data genomes.
2 Data Genome Model
In biology, the genetic information of life determines the variety and lineage of species,
and a genome is the complete set of genetic information that an organism possesses.
The genetic information of parents transmits to offspring along with the evolution of
life, and different species have different genetic information.
In the evolution of data, information about the evolutionary process are created, including the creator, creation time and descriptive keywords when data is created, the
publisher, publishing time and receivers when data is published, the correlation with
other data. This kind of information closely correlates with data, and it is created when
data is created and deleted when data is deleted. At the same time, it changes with the
evolution of data and is different for different phase in the lifecycle of data. Furthermore, it will be transmitted to other data partially or fully in the evolutionary process.
As information created in the evolution of data has the above features, we think data has
the genetic information too [10]. In this paper, we call this information the genetic
information of data and put forward a concept called data genome to carry the genetic
information of data with genome in biology as reference.
Suppose A is an attribute name of data or operation, and V is its value. We use the
concept of data gene, data gene sequence and data genome to carry genetic information
for data hierachically. Data gene carries a piece of genetic information for data and
describes some characteristics of data at the given condition. Data gene sequence carries the genetic information of a “segment” in the evolution of data. Data genome is
“all-sided” to represent the evolution of data and carries genetic information for the
whole data.
Definition 1 (Data gene). A data gene fragment F is a 2-tuple: F = ( A,V ) , data gene is
the positive closure of F: D = F + = ( A,V ) + .
Data gene fragment is the minimal unit of genetic information, and the selection of
attributes and their domain depends on the situation of applications. Data gene can
Model and Algebra for Genetic Information of Data
1073
describe any characteristics of data, for example, it may be a description of browse
operation or the metadata at the given time. For any data, a whole impression about data
at any moment is encoded in the leading data gene.
Definition 2 (Leading data gene). A leading data gene carries the instant information
about data as a whole, denoted as Da.
Da is created when data is created, it holds the global information about data like identifier, title, creator, and so on.
Definition 3 (Data gene sequence). Data gene sequence is a 2-tuple, S=<$sid, D+>,
where $sid is the identifier of the sequence, D+ is the positive closure of data genes that
describes the same characteristics of data.
Usually, we need many data gene sequences to describe the genetic information of data
perfectly, and each data gene sequence encodes part of genetic information. The most
important data gene sequence is the main data gene sequence.
Definition 4 (Main data gene sequence). A main data gene sequence Sm is a data gene
sequence, which describes the global characteristics of data and its evolution.
Sm includes the leading data gene and data genes carrying the history of those data gene
fragments in the leading data gene.
Definition 5 (Data genome). Data genome DG is a 3-tuple: < $ gid , S + , DG′* > , where
$gid is the identifier of data genome, S+ is a positive closure of data gene sequence,
DG′ is a component data genome of DG and DG′* is a Klein closure of DG′ .
In definition 5, $gid must be unique globally. S+ has at least one data gene sequence,
Sm, and each sequence in it must have different identifer. The component data genome
set DG′* carries genetic information for partial data retrieved from other data sources.
Component data genome DG′ is a data genome mutated from the data genome of
correlated data. In addition, data gene sequences in DG′ have main data gene sequence
only and the main data gene sequence has a leading data gene only, so do the
component data genomes in DG′ . Furthermore, for each data genome in DG′* , there is
a sequence in S+ to record the correlation, whose identifier is equal to the identifier of
the component data genome.
Figure 1 is a referenced implementation of data genome for the text file. It is composed of six data gene sequences and two component data genomes. Sequence $gid1
(Herein, each sequence mentioned belongs to data genome dg) is composed of a
leading data gene and two ordinary data genes. From the leading data gene in sequence
$gid1, we can get a whole impression of Test. The other two data genes in sequence
$gid1 denote the history of the corresponding data gene fragments in the leading data
gene. Sequence $sid3 carries the genetic information about the history of being edited
for data. Data genes in sequence $sid6 indicate that Tom forwarded this data to Mike
and Jane on 06/18/2005 and Mike forwarded this data to Jim on 06/19/2005. Data gene
in sequence $sid7 indicates that Jim browsed this data on 06/20/2005. Sequence $gid2 is
in accordance with component data genome dg2. The only data gene in sequence $gid2
denotes that Tom correlated the component data genome to data genome $gid01 on
06/15/2005. Data gene sequence $sid5 carries the information of operating on the
1074
D. Tang, J. Xi, and Y. Guo
dg
$gid1
Sm
Da
Title
Creator
Identifier
Test
Tom
001
$gid 1
DgID
Source
$gid02
Date
06/14/2005
...
...
Source
DgID
0
$gid01
Title
Source
DgID
T1
$gid01
$gid02
dg2
$gid1
$sid3
$sid5
$sid6
$sid7
$gid2
Da
D4
D5
D1
D3
Di
D
D2
Da
dg4
$gid4
D
Sm
$gid2
$gid2
$gid4
Sm
Da
D1
Editor
Tom
Date
06/18/2005
Action
Operator
Date
06/15/2005
Acceptor
Jane
DgID
Correlate
Tom
$gid01
DgID
...
$gid01
Acceptor
...
Mike
...
Date
06/15/2005
...
...
D4
...
Forw arder
Tom
D2
D5
Modifier
Tom
Date
06/16/2005
DgID
$gid02
...
...
D3
Title
Test0
Creator
dydy
Identifier
002
$gid2
DgID
Date
03/12/2004
...
...
Title
T11
Creator
Tang
004
$gid4
Forw arder
Mike
Brow ser
Jim
Identifier
Date
06/19/2005
Date
06/20/2005
Acceptor
...
Jim
...
Location
...
...
...
DgID
Date
...
10/09/2003
...
Fig. 1. The structure of data gene, data gene sequence and data genome 1
leading data gene.Component data genome dg2 is a sub data genome of data genome of
data Test0, which means part of genetic information of Test0 are transmited to Test.
Proposition 1. Each main data gene sequence has a unique leading data gene.
The leading data gene carries the genetic information of data resembled to metadata.
Though all component genomes have their own leading data gene, they are local, and
there is only one leading data gene belonging to the whole data. Usually, the structure
of this gene is the most complex data gene in a data genome.
Proposition 2. Each data genome has a unique main data gene sequence.
The main data gene sequence records the global information of data and its evolution.
Comparatively speaking, the main data gene sequence of each component data genome
carries the genetic information for part of the data.
Proposition 3. Each data has a unique data genome to carry its genetic information,
and the data genome can be represented as a tree with the data genome as the root,
data gene sequences as the leaf nodes and component data genomes as branch nodes.
The simplest data genome has only two nodes, the root and a leaf, when represented as
a tree. Once data correlates to other data, a data genome will be inserted into the
component data genome set. Since each data genome has a unique main data gene
sequence, we also use the identifier of the data genome to identify the main data gene
sequence.
The definitions and propositions presented above give a framework for managing the
adjunctive information of data created in its evolution. Since the type of data may vary
1
Dashed lines are used to explain the structure of each data gene. 2). For simplicity, data genes
and the $sid of each sequence in each data genome are displayed in a table, and different
background colors are used to differ them.
Model and Algebra for Genetic Information of Data
1075
from structured (e.g. relational data) to semi-structured (e.g. XML, code collections) to
unstructured (word document), we give the privilege of modeling the behaviours of
such types of data to users and we give the framwork only.
3 Data Genome Algebra Operations
The data genome algebra includes eight operators, they are union, intersection, difference, select, project, iterate, filter and join, which in turn are the basis of designing
the genetic information querying language. As data gene is the basic operation unit in
DGM, the data genome algebra operations are implemented on the base of algebra
operations for data genes, which include union, intersection, difference and project. We
introduce some symbols that are used in the remainder of this paper first.
Let G be the set of data genes, g , gi , g j ∈ G . Let DG be the set of data genomes,
dg , dg i , dg j , dg ′ ∈ DG
. ID(s) denotes the identifier of data gene sequence s, T(s) denotes
the data genes set of s. S(dg) denotes the set of data gene sequence in dg, Sm(dg) denotes
the main data gene sequence of dg, ψ(dg) denotes the component data genome set of dg.
A(g) denotes attribute set of data gene g. µ a(f) denotes the attribute of data gene
fragment f.
Definition 6 (Algebra operations for data gene)
Union, ∪
{
gi ∪ g j = f f ∈ gi ∨ f ∈ g j
Intersection, ∩
{
gi ∩ g j = f f ∈ gi ∧ f ∈ g j
Difference,
-g = { f
gi
j
Project,
-
f ∈ gi ∧ f ∉ g j
}
}
}
π
π A ′ ( g ) = { f f ∈ g ∧ μ a ( f ) ∈ A ′}
,
Obviously, the result of union intersection and difference is a data gene fragment set.
Except for the algebra operations for data genes, the other two operations are also the
basis of algebra operations for data genome.
Definition7 (Set operations of the main data gene sequence). Suppose si and sj are
different main data gene sequences, and I(s) denotes the leading data gene in the main
data gene sequence.
-
s′ = si ∗ s j = 〈$sid ,T (s′)〉 where ∗∈{∪, ∩, } and
⎧⎪ I ( s′) = I ( si ) ∗ I ( s j )
⎨
⎪⎩T ( s′) = (T ( si ) − {I ( si )}) ∗ (T ( s j ) − {I ( s j )}) ∪ {I ( s′)}
1076
D. Tang, J. Xi, and Y. Guo
Defintion 8 (Concatenation of data gene sequences). Suppose si and sj are data gene
sequences, given the concatenating condition f, then
s′ = si ∞ s j = 〈$sid ,T (s′)〉 where
f
T ( s′) = U( gi ∪ g j ) gi ∈ T ( si ) ∧ g j ∈ T ( s j ) ∧ f ( gi , g j ) = 1
The set operation for the main data gene sequence is defined for the set operations of
data genomes, while the concatenation operation is defined for the join operation of
data genomes.
Definition 9 (Algebra operations for data genome). Let s′ = S m ( dg ′) , si = S m ( dg i ) ,
s j = Sm (dg j ) .
Union, ∪
dg ′ = dgi ∪ dg j = 〈$ gid ′, S (dg′),ψ (dg′)〉
⎧ s′ = si ∪ s j
⎪
where ⎨ S ( dg ′) = ( S (dgi ) − {si }) ∪ ( S ( dg j ) − {s j }) ∪ {s′}
⎪
⎩ψ (dg ′) = ψ (dg i ) ∪ ψ (dg j )
Intersection, ∩
dg ′ = dg i ∩ dg j = 〈$ gid ′, S ( dg ′),ψ ( dg ′)〉
⎧ s′ = si ∩ s j
⎪
where ⎨ S (dg ′) = ( S (dg i ) − {si }) ∩ ( S (dg j ) − {s j }) ∪ {s′}
⎪
⎩ψ (dg ′) = ψ (dg i ) ∩ ψ (dg j )
Difference,
-
-
dg ′ = dgi dg j = 〈$ gid ′, S (dg ′),ψ (dg ′)〉
⎧ s′ = si − s j
⎪
where ⎨ S (dg ′) = ( S ( dg i ) − {si }) − ( S ( dg j ) − {s j }) ∪ {s′}
⎪
⎩ψ (dg ′) = ψ ( dg i ) − ψ (dg j )
Select, δ
δ sc (dg ) = dg ′ = 〈$ gid ′, S (dg ′),φ 〉
where S (dg ′) = {s s ∈ S(dg ) ∧ sc( s) = true}
Project, π
π A′ (dg ) = dg ′ = 〈$ gid ′, S (dg ′), φ 〉
{
}
where S (dg ′) = s′ | s′ = {π A′ ( g ) g ∈ s ∧ A′ ⊆ A( g )} ∧ s ∈ S (dg )
Iterate, D
D a[max −iteration ] (dg ) = dg ′ = 〈$ gid ′,{s′}, φ 〉
where T ( s′) = { g g ∈ U s ∧ s ∈ S ( dg ) ∧ a ∈ A( g )}
∪ (U S (D a[max −iteration−1] (dg ′)) dg ′ ∉ψ (dg ))
Model and Algebra for Genetic Information of Data
Filter,
1077
ε
ε f (dg ) = U dg ′ dg ′ ∈ψ (dg ) ∧ f ( dg ′) = 1
Join, ><
dg′ = dg i >< dg j = 〈$ gid ′, S (dg ′),ψ (dg ′)〉
f
,
⎧⎪ S (dg ′) = U( s1 ∞f s2 ) s1 ∈ S (dg i ) ∧ s2 ∈ S (dg j )
where ⎨
⎪⎩ψ (dg ′) = ψ (dg i ) ∪ ψ (dg j )
Here in above the result of each operator is a temporary data genome. The operator
δ selects data gene sequences from the data gene sequence set of dg that satisfies the
condition sc. Generally, we use this operator to select one data gene sequence from the
data genome by refining a right condition. If no condition is specified, this operation
will return a data genome without any component data genome.
The operator π filters out data genes whose attribute set is the superset of the given
attribute set, gets rid of the redundant attributes and eliminates the repeated data gene
for each data gene sequence in S(dg). If no attribute set is specified, this operator will
return a null data genome. The operator D recursively traverses the specified attributes set a in the data genome. If no iteration limit is specified, this operation
will traverse the whole data genome. If the iteration limit is zero, the temporary data
genome will have one data gene sequence consisted of data genes with the given attributes in S(dg). The operator ε selects component data genomes from the data genome with the given filter condition f. If no condition is specified, this operator will
return the union of all component data genomes in ψ (dg ) .
In the join operation, f is a formula specifying the join predicate, which is a conjunctive form. Usually, it has the form of c1 ∧ c2 ... ∧ cn , where ci (i = 1..n ) has the generic
form of A1 ⊗ A2 , A1 and A2 are attributes belonging to dgi and dgj separately, ⊗ is a
comparison operator. If the operator ⊗ is “=”, then this operation is an equi-join.
Suppose the structure of data genome for document is defined as the data genome in
figure 1. Herein, we give some examples of queries.
Example 1. To query the history of being edited in 2005, including the operator and
operation time.
π Editor , Action, Date (δ Date>12 / 31 / 2004∧ Date<01 / 01 / 2006 (dg ))
Example 2. To list all data source related to Test:
π Title (D Identifier ,Title (dg ))
Example 3. To list out the creator and creation time of the first correlated data of Test:
π Creator ,Time (ε
$ gid =$ gid 2
(dg ))
Example 4. To list out the common correlated data:
π Title (D Identifier,Title (dg1 )) ∩ π Title (D Identifier,Title (dg 2 ))
1078
D. Tang, J. Xi, and Y. Guo
Example 5. To list out the edit history of T1(suppose the data genome is dg1 ), which
the operator is the creator of T2(Suppose its data genome is dg 2 ):
π Editor , Action ,Date dg1
><
dg1 . Editor = dg 2 .Creator
dg 2
Definition 9 presents data genome algebras to support the query of genetic information.
In fact, we can organize data genomes into a data genome base and perform data mining
on it. For example, if we organize data genomes with the same data identifier retrieved
from the P2P network into a data genome base, we can mine out the number of contributors, track the distribution, trace the propagation path of each copy, etc.
4 Usages and Applications of DGM
The motivation of DGM is to manage information created in the evolution of data. Due
to the limit of space, this paper can’t elaborate on the manipulation of data genome and
the simulation of its evolution with the evolution of data. The details will be reported in
other papers, and we here only give a brief introduction of its usages and list some
applications.
Data genome can attach to any types of data and be applied in any information
systems if the corresponding information systems have the ability of supporting DGM.
For example, when we create a text file in an editor, the editor should create a data
genome for the file accordingly. Once the editor saves data, a data gene carrying the
operation information is inserted into the corresponding data genome (such as D4 in
figure 1). Moreover, if the file (suppose its data genome is dg) correlates to other file
(suppose its data genome is dg′ ), the editor should mutate dg′ as a component data
genome of dg, and insert a new data gene sequence that identifier equals to the identifier of the main data sequence of dg′ into the data gene sequence set of dg to express
the correlation (such as dg2 and $gid2 in figure 1).
DGM can be used in information tracking, data mining, information integration,
semantic web, and so on. Take information tracking for example, we can find out the
source and the propagation path of data instantly using the algebras defined in the
previous section or special tools based on data genome. E.g., we can use πForwarder, Acceptor(dg) to filter out the forwarders and their acceptors in figure 1 and construct a
propagation graph based on the result, as shown in figure 2. From figure 2, we know
that Tom is the creator of file Test and he transmits the file to Mike, and Mike transmits
the file to Jim subsequently.
Tom
Jane
M ike
Jim
Fig. 2. A propagation graph constructed by Jim
Model and Algebra for Genetic Information of Data
1079
5 Conclusions and Future Works
This paper presents DGM and its algebras to model and manipulate the evolution of
data. However, the research of DGM is just underway, and no parallel work has been
found yet. This model may be not perfect yet, and much work has to be done to improve
this model.
Currently, we are working for the user interfaces of data genome and implementing a
file exchange system using the DGM as the base of tracking the lineage of different file
appeared in the system. The programming interfaces of data genome include the manipulation interfaces and query interfaces. The data genome manipulation interfaces
include creating data genome for any type of data, evolving data genome to synchronize the evolution data, etc. Though we present the algebras of data genome in this
paper, we are sure a friendly user programming interfaces is necessary.
In addition, our work can be extended in several ways. First, as we only give a
conceptual model for the genetic information of data to date, we plan to design a
storage model for DGM recently. Second, we think the number of data gene sequences
in a data genome is different for different type of data such as text files, applications,
databases etc, so we need to study the metadata of data genome for different type of
data. Third, we need to give some tools to mine the value of genetic information more
than query language, such as information track system based on DGM, information
retrival system with data genome, etc. Moreover, extending DGM to support service is
also a prospective direction.
References
1. Gnutella. http://gnutella.wego.com.
2. A. Tansel, J. Clifford, S. Gadia, S. Jajodia, A. Segev, and R.T. Snodgrass, Temporal Databases: Theory, Design, and Implementation, eds.,Benjamin/Cummings, 1994
3. http://www.w3.org/TR/xpath20/
4. C. Combi, B. Oliboni, and E. Quintarelli. A Graph-Based Data Model to Represent Transaction Time in Semistructured Data. DEXA 2004: 559-568
5. S. Kumaran, P. Nandi, T. Heath, K. Bhaskaran, and R. Das. ADoc-Oriented Programming.
SAINT 2003: 334-343
6. Y.M. Ye, P. Nandi, and S. Kumaran, Smart Distance for Information Systems: The Concept.
IEEE Computational Intelligence Bulletin 2(1), 2003: 25-30
7. P. Nandi, and S. Kumaran. Adaptive Business Objects - A new Component Model for
Business Integration. ICEIS (3), 2005: 179-188
8. D. Reiner, G. Press, M. Lenaghan, D. Barta and R. Urmston, Information lifecycle management: the EMC perspective. Proceedings of the 20th International Conference on Data
Engineering, 2004:804 - 807
9. Y. Chen. Information Valuation for Information Lifecycle Management, Proceedings of the
Second International Conference on Autonomic Computing (ICAC’05), 2005:135 – 146
10. J.Q. Xi, D.Y. Tang, and Y.B. Guo, Data Gene: the Genetic Information Carrier of Data.
Computer Engineering (In Chinese, in press).