Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Model and Algebra for Genetic Information of Data* Deyou Tang1,2, Jianqing Xi1,**, and Yubin Guo1 1 School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510641, China 2 Department of Computer Science & Technology, Hunan Polytechnic University, Zhuzhou, 412008, China [email protected], [email protected], [email protected] Abstract. The genetic information of data is the evolution information about data in its lifecycle, and it performs heredity and mutation with the diffusion of data. Managing the genetic information of data has significance to the information discrimination, auditing, tracing and lookup. This paper presents Data Genome Model (DGM) and its algebra to manage the genetic information of data in its lifecycle according to the gene of life. A structure named data genome is used to carry the genetic information of data and querying algebras are defined to manipulate the genetic information. 1 Introduction Under the environment of complex networks, data is exchanged frequently. In this process, huge amounts of information about the evolution are created. E.g. when data is published, information hiding behind this operation includes the publisher, publishing time, etc. This information has great significance to the incremental utilization of data. Suppose some unhealthy pictures are disseminated in a pure p2p network such as Gnutella[1]. To punish the badman who issues the resource initially, related department has to take long time and much human resource to identify the source step by step. The problem is that no evolution information can be retrieved from the data itself. However, if each item shared in Gnutella has some adjunctive information about the evolution of data such as publisher, propagation path, etc, intelligence agent can do their job through analyzing this information, thus saves much expense. Unfortunately, current data models usually focus on the content or the external characteristic of data, and little attention has been paid to the evolution of data. Metadata are the data about data in applications. But metadata only expresses the static properties of data, and it doesn’t describe the evolution of data. The concept of “temporal” is used in temporal database to describe history information about data [2]. TXPath model [3] extended the XPath data model to model the history of semi-structured data. The GEM model [4] is a graphical semi-structured * This work is supported by the China Ministry of Education (Grant NoD4109029), Guangdong High-Tech Program (G03B2040770), and Guangdong Natural Science Foundation (B6480598). ** Correspondence author. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1071 – 1079, 2006. © Springer-Verlag Berlin Heidelberg 2006 1072 D. Tang, J. Xi, and Y. Guo temporal data model. These temporal data models use the history information of data effectively, but they don’t express the relevancy of different data. Adoc[5] is a new programming model targeted to collaborative business process integration applications. The concept of “smart distance” was proposed to interweave various dynamic and heterogeneous elements of information system on the base of Adoc[6]. ABO was presented to model the lifecycle of the business entity [7]. However, those methods don’t model the evolution of data. Information Lifecycle Management (ILM) aims at improving system resource utilization and maximizing information value automatically for fixed-content reference data [8][9]. Unfortunately, ILM doesn’t take into account the value of evolution information created in data lifecycle. In this paper, we present Data Genome Model (DGM) to manage the information created in the evolution of data, which is called genetic information. A series of concepts like data gene, data gene sequence and data genome are introduced to carry the genetic information hierarchically. Data genome algebras are presented to query the genetic information stored in data genomes. 2 Data Genome Model In biology, the genetic information of life determines the variety and lineage of species, and a genome is the complete set of genetic information that an organism possesses. The genetic information of parents transmits to offspring along with the evolution of life, and different species have different genetic information. In the evolution of data, information about the evolutionary process are created, including the creator, creation time and descriptive keywords when data is created, the publisher, publishing time and receivers when data is published, the correlation with other data. This kind of information closely correlates with data, and it is created when data is created and deleted when data is deleted. At the same time, it changes with the evolution of data and is different for different phase in the lifecycle of data. Furthermore, it will be transmitted to other data partially or fully in the evolutionary process. As information created in the evolution of data has the above features, we think data has the genetic information too [10]. In this paper, we call this information the genetic information of data and put forward a concept called data genome to carry the genetic information of data with genome in biology as reference. Suppose A is an attribute name of data or operation, and V is its value. We use the concept of data gene, data gene sequence and data genome to carry genetic information for data hierachically. Data gene carries a piece of genetic information for data and describes some characteristics of data at the given condition. Data gene sequence carries the genetic information of a “segment” in the evolution of data. Data genome is “all-sided” to represent the evolution of data and carries genetic information for the whole data. Definition 1 (Data gene). A data gene fragment F is a 2-tuple: F = ( A,V ) , data gene is the positive closure of F: D = F + = ( A,V ) + . Data gene fragment is the minimal unit of genetic information, and the selection of attributes and their domain depends on the situation of applications. Data gene can Model and Algebra for Genetic Information of Data 1073 describe any characteristics of data, for example, it may be a description of browse operation or the metadata at the given time. For any data, a whole impression about data at any moment is encoded in the leading data gene. Definition 2 (Leading data gene). A leading data gene carries the instant information about data as a whole, denoted as Da. Da is created when data is created, it holds the global information about data like identifier, title, creator, and so on. Definition 3 (Data gene sequence). Data gene sequence is a 2-tuple, S=<$sid, D+>, where $sid is the identifier of the sequence, D+ is the positive closure of data genes that describes the same characteristics of data. Usually, we need many data gene sequences to describe the genetic information of data perfectly, and each data gene sequence encodes part of genetic information. The most important data gene sequence is the main data gene sequence. Definition 4 (Main data gene sequence). A main data gene sequence Sm is a data gene sequence, which describes the global characteristics of data and its evolution. Sm includes the leading data gene and data genes carrying the history of those data gene fragments in the leading data gene. Definition 5 (Data genome). Data genome DG is a 3-tuple: < $ gid , S + , DG′* > , where $gid is the identifier of data genome, S+ is a positive closure of data gene sequence, DG′ is a component data genome of DG and DG′* is a Klein closure of DG′ . In definition 5, $gid must be unique globally. S+ has at least one data gene sequence, Sm, and each sequence in it must have different identifer. The component data genome set DG′* carries genetic information for partial data retrieved from other data sources. Component data genome DG′ is a data genome mutated from the data genome of correlated data. In addition, data gene sequences in DG′ have main data gene sequence only and the main data gene sequence has a leading data gene only, so do the component data genomes in DG′ . Furthermore, for each data genome in DG′* , there is a sequence in S+ to record the correlation, whose identifier is equal to the identifier of the component data genome. Figure 1 is a referenced implementation of data genome for the text file. It is composed of six data gene sequences and two component data genomes. Sequence $gid1 (Herein, each sequence mentioned belongs to data genome dg) is composed of a leading data gene and two ordinary data genes. From the leading data gene in sequence $gid1, we can get a whole impression of Test. The other two data genes in sequence $gid1 denote the history of the corresponding data gene fragments in the leading data gene. Sequence $sid3 carries the genetic information about the history of being edited for data. Data genes in sequence $sid6 indicate that Tom forwarded this data to Mike and Jane on 06/18/2005 and Mike forwarded this data to Jim on 06/19/2005. Data gene in sequence $sid7 indicates that Jim browsed this data on 06/20/2005. Sequence $gid2 is in accordance with component data genome dg2. The only data gene in sequence $gid2 denotes that Tom correlated the component data genome to data genome $gid01 on 06/15/2005. Data gene sequence $sid5 carries the information of operating on the 1074 D. Tang, J. Xi, and Y. Guo dg $gid1 Sm Da Title Creator Identifier Test Tom 001 $gid 1 DgID Source $gid02 Date 06/14/2005 ... ... Source DgID 0 $gid01 Title Source DgID T1 $gid01 $gid02 dg2 $gid1 $sid3 $sid5 $sid6 $sid7 $gid2 Da D4 D5 D1 D3 Di D D2 Da dg4 $gid4 D Sm $gid2 $gid2 $gid4 Sm Da D1 Editor Tom Date 06/18/2005 Action Operator Date 06/15/2005 Acceptor Jane DgID Correlate Tom $gid01 DgID ... $gid01 Acceptor ... Mike ... Date 06/15/2005 ... ... D4 ... Forw arder Tom D2 D5 Modifier Tom Date 06/16/2005 DgID $gid02 ... ... D3 Title Test0 Creator dydy Identifier 002 $gid2 DgID Date 03/12/2004 ... ... Title T11 Creator Tang 004 $gid4 Forw arder Mike Brow ser Jim Identifier Date 06/19/2005 Date 06/20/2005 Acceptor ... Jim ... Location ... ... ... DgID Date ... 10/09/2003 ... Fig. 1. The structure of data gene, data gene sequence and data genome 1 leading data gene.Component data genome dg2 is a sub data genome of data genome of data Test0, which means part of genetic information of Test0 are transmited to Test. Proposition 1. Each main data gene sequence has a unique leading data gene. The leading data gene carries the genetic information of data resembled to metadata. Though all component genomes have their own leading data gene, they are local, and there is only one leading data gene belonging to the whole data. Usually, the structure of this gene is the most complex data gene in a data genome. Proposition 2. Each data genome has a unique main data gene sequence. The main data gene sequence records the global information of data and its evolution. Comparatively speaking, the main data gene sequence of each component data genome carries the genetic information for part of the data. Proposition 3. Each data has a unique data genome to carry its genetic information, and the data genome can be represented as a tree with the data genome as the root, data gene sequences as the leaf nodes and component data genomes as branch nodes. The simplest data genome has only two nodes, the root and a leaf, when represented as a tree. Once data correlates to other data, a data genome will be inserted into the component data genome set. Since each data genome has a unique main data gene sequence, we also use the identifier of the data genome to identify the main data gene sequence. The definitions and propositions presented above give a framework for managing the adjunctive information of data created in its evolution. Since the type of data may vary 1 Dashed lines are used to explain the structure of each data gene. 2). For simplicity, data genes and the $sid of each sequence in each data genome are displayed in a table, and different background colors are used to differ them. Model and Algebra for Genetic Information of Data 1075 from structured (e.g. relational data) to semi-structured (e.g. XML, code collections) to unstructured (word document), we give the privilege of modeling the behaviours of such types of data to users and we give the framwork only. 3 Data Genome Algebra Operations The data genome algebra includes eight operators, they are union, intersection, difference, select, project, iterate, filter and join, which in turn are the basis of designing the genetic information querying language. As data gene is the basic operation unit in DGM, the data genome algebra operations are implemented on the base of algebra operations for data genes, which include union, intersection, difference and project. We introduce some symbols that are used in the remainder of this paper first. Let G be the set of data genes, g , gi , g j ∈ G . Let DG be the set of data genomes, dg , dg i , dg j , dg ′ ∈ DG . ID(s) denotes the identifier of data gene sequence s, T(s) denotes the data genes set of s. S(dg) denotes the set of data gene sequence in dg, Sm(dg) denotes the main data gene sequence of dg, ψ(dg) denotes the component data genome set of dg. A(g) denotes attribute set of data gene g. µ a(f) denotes the attribute of data gene fragment f. Definition 6 (Algebra operations for data gene) Union, ∪ { gi ∪ g j = f f ∈ gi ∨ f ∈ g j Intersection, ∩ { gi ∩ g j = f f ∈ gi ∧ f ∈ g j Difference, -g = { f gi j Project, - f ∈ gi ∧ f ∉ g j } } } π π A ′ ( g ) = { f f ∈ g ∧ μ a ( f ) ∈ A ′} , Obviously, the result of union intersection and difference is a data gene fragment set. Except for the algebra operations for data genes, the other two operations are also the basis of algebra operations for data genome. Definition7 (Set operations of the main data gene sequence). Suppose si and sj are different main data gene sequences, and I(s) denotes the leading data gene in the main data gene sequence. - s′ = si ∗ s j = 〈$sid ,T (s′)〉 where ∗∈{∪, ∩, } and ⎧⎪ I ( s′) = I ( si ) ∗ I ( s j ) ⎨ ⎪⎩T ( s′) = (T ( si ) − {I ( si )}) ∗ (T ( s j ) − {I ( s j )}) ∪ {I ( s′)} 1076 D. Tang, J. Xi, and Y. Guo Defintion 8 (Concatenation of data gene sequences). Suppose si and sj are data gene sequences, given the concatenating condition f, then s′ = si ∞ s j = 〈$sid ,T (s′)〉 where f T ( s′) = U( gi ∪ g j ) gi ∈ T ( si ) ∧ g j ∈ T ( s j ) ∧ f ( gi , g j ) = 1 The set operation for the main data gene sequence is defined for the set operations of data genomes, while the concatenation operation is defined for the join operation of data genomes. Definition 9 (Algebra operations for data genome). Let s′ = S m ( dg ′) , si = S m ( dg i ) , s j = Sm (dg j ) . Union, ∪ dg ′ = dgi ∪ dg j = 〈$ gid ′, S (dg′),ψ (dg′)〉 ⎧ s′ = si ∪ s j ⎪ where ⎨ S ( dg ′) = ( S (dgi ) − {si }) ∪ ( S ( dg j ) − {s j }) ∪ {s′} ⎪ ⎩ψ (dg ′) = ψ (dg i ) ∪ ψ (dg j ) Intersection, ∩ dg ′ = dg i ∩ dg j = 〈$ gid ′, S ( dg ′),ψ ( dg ′)〉 ⎧ s′ = si ∩ s j ⎪ where ⎨ S (dg ′) = ( S (dg i ) − {si }) ∩ ( S (dg j ) − {s j }) ∪ {s′} ⎪ ⎩ψ (dg ′) = ψ (dg i ) ∩ ψ (dg j ) Difference, - - dg ′ = dgi dg j = 〈$ gid ′, S (dg ′),ψ (dg ′)〉 ⎧ s′ = si − s j ⎪ where ⎨ S (dg ′) = ( S ( dg i ) − {si }) − ( S ( dg j ) − {s j }) ∪ {s′} ⎪ ⎩ψ (dg ′) = ψ ( dg i ) − ψ (dg j ) Select, δ δ sc (dg ) = dg ′ = 〈$ gid ′, S (dg ′),φ 〉 where S (dg ′) = {s s ∈ S(dg ) ∧ sc( s) = true} Project, π π A′ (dg ) = dg ′ = 〈$ gid ′, S (dg ′), φ 〉 { } where S (dg ′) = s′ | s′ = {π A′ ( g ) g ∈ s ∧ A′ ⊆ A( g )} ∧ s ∈ S (dg ) Iterate, D D a[max −iteration ] (dg ) = dg ′ = 〈$ gid ′,{s′}, φ 〉 where T ( s′) = { g g ∈ U s ∧ s ∈ S ( dg ) ∧ a ∈ A( g )} ∪ (U S (D a[max −iteration−1] (dg ′)) dg ′ ∉ψ (dg )) Model and Algebra for Genetic Information of Data Filter, 1077 ε ε f (dg ) = U dg ′ dg ′ ∈ψ (dg ) ∧ f ( dg ′) = 1 Join, >< dg′ = dg i >< dg j = 〈$ gid ′, S (dg ′),ψ (dg ′)〉 f , ⎧⎪ S (dg ′) = U( s1 ∞f s2 ) s1 ∈ S (dg i ) ∧ s2 ∈ S (dg j ) where ⎨ ⎪⎩ψ (dg ′) = ψ (dg i ) ∪ ψ (dg j ) Here in above the result of each operator is a temporary data genome. The operator δ selects data gene sequences from the data gene sequence set of dg that satisfies the condition sc. Generally, we use this operator to select one data gene sequence from the data genome by refining a right condition. If no condition is specified, this operation will return a data genome without any component data genome. The operator π filters out data genes whose attribute set is the superset of the given attribute set, gets rid of the redundant attributes and eliminates the repeated data gene for each data gene sequence in S(dg). If no attribute set is specified, this operator will return a null data genome. The operator D recursively traverses the specified attributes set a in the data genome. If no iteration limit is specified, this operation will traverse the whole data genome. If the iteration limit is zero, the temporary data genome will have one data gene sequence consisted of data genes with the given attributes in S(dg). The operator ε selects component data genomes from the data genome with the given filter condition f. If no condition is specified, this operator will return the union of all component data genomes in ψ (dg ) . In the join operation, f is a formula specifying the join predicate, which is a conjunctive form. Usually, it has the form of c1 ∧ c2 ... ∧ cn , where ci (i = 1..n ) has the generic form of A1 ⊗ A2 , A1 and A2 are attributes belonging to dgi and dgj separately, ⊗ is a comparison operator. If the operator ⊗ is “=”, then this operation is an equi-join. Suppose the structure of data genome for document is defined as the data genome in figure 1. Herein, we give some examples of queries. Example 1. To query the history of being edited in 2005, including the operator and operation time. π Editor , Action, Date (δ Date>12 / 31 / 2004∧ Date<01 / 01 / 2006 (dg )) Example 2. To list all data source related to Test: π Title (D Identifier ,Title (dg )) Example 3. To list out the creator and creation time of the first correlated data of Test: π Creator ,Time (ε $ gid =$ gid 2 (dg )) Example 4. To list out the common correlated data: π Title (D Identifier,Title (dg1 )) ∩ π Title (D Identifier,Title (dg 2 )) 1078 D. Tang, J. Xi, and Y. Guo Example 5. To list out the edit history of T1(suppose the data genome is dg1 ), which the operator is the creator of T2(Suppose its data genome is dg 2 ): π Editor , Action ,Date dg1 >< dg1 . Editor = dg 2 .Creator dg 2 Definition 9 presents data genome algebras to support the query of genetic information. In fact, we can organize data genomes into a data genome base and perform data mining on it. For example, if we organize data genomes with the same data identifier retrieved from the P2P network into a data genome base, we can mine out the number of contributors, track the distribution, trace the propagation path of each copy, etc. 4 Usages and Applications of DGM The motivation of DGM is to manage information created in the evolution of data. Due to the limit of space, this paper can’t elaborate on the manipulation of data genome and the simulation of its evolution with the evolution of data. The details will be reported in other papers, and we here only give a brief introduction of its usages and list some applications. Data genome can attach to any types of data and be applied in any information systems if the corresponding information systems have the ability of supporting DGM. For example, when we create a text file in an editor, the editor should create a data genome for the file accordingly. Once the editor saves data, a data gene carrying the operation information is inserted into the corresponding data genome (such as D4 in figure 1). Moreover, if the file (suppose its data genome is dg) correlates to other file (suppose its data genome is dg′ ), the editor should mutate dg′ as a component data genome of dg, and insert a new data gene sequence that identifier equals to the identifier of the main data sequence of dg′ into the data gene sequence set of dg to express the correlation (such as dg2 and $gid2 in figure 1). DGM can be used in information tracking, data mining, information integration, semantic web, and so on. Take information tracking for example, we can find out the source and the propagation path of data instantly using the algebras defined in the previous section or special tools based on data genome. E.g., we can use πForwarder, Acceptor(dg) to filter out the forwarders and their acceptors in figure 1 and construct a propagation graph based on the result, as shown in figure 2. From figure 2, we know that Tom is the creator of file Test and he transmits the file to Mike, and Mike transmits the file to Jim subsequently. Tom Jane M ike Jim Fig. 2. A propagation graph constructed by Jim Model and Algebra for Genetic Information of Data 1079 5 Conclusions and Future Works This paper presents DGM and its algebras to model and manipulate the evolution of data. However, the research of DGM is just underway, and no parallel work has been found yet. This model may be not perfect yet, and much work has to be done to improve this model. Currently, we are working for the user interfaces of data genome and implementing a file exchange system using the DGM as the base of tracking the lineage of different file appeared in the system. The programming interfaces of data genome include the manipulation interfaces and query interfaces. The data genome manipulation interfaces include creating data genome for any type of data, evolving data genome to synchronize the evolution data, etc. Though we present the algebras of data genome in this paper, we are sure a friendly user programming interfaces is necessary. In addition, our work can be extended in several ways. First, as we only give a conceptual model for the genetic information of data to date, we plan to design a storage model for DGM recently. Second, we think the number of data gene sequences in a data genome is different for different type of data such as text files, applications, databases etc, so we need to study the metadata of data genome for different type of data. Third, we need to give some tools to mine the value of genetic information more than query language, such as information track system based on DGM, information retrival system with data genome, etc. Moreover, extending DGM to support service is also a prospective direction. References 1. Gnutella. http://gnutella.wego.com. 2. A. Tansel, J. Clifford, S. Gadia, S. Jajodia, A. Segev, and R.T. Snodgrass, Temporal Databases: Theory, Design, and Implementation, eds.,Benjamin/Cummings, 1994 3. http://www.w3.org/TR/xpath20/ 4. C. Combi, B. Oliboni, and E. Quintarelli. A Graph-Based Data Model to Represent Transaction Time in Semistructured Data. DEXA 2004: 559-568 5. S. Kumaran, P. Nandi, T. Heath, K. Bhaskaran, and R. Das. ADoc-Oriented Programming. SAINT 2003: 334-343 6. Y.M. Ye, P. Nandi, and S. Kumaran, Smart Distance for Information Systems: The Concept. IEEE Computational Intelligence Bulletin 2(1), 2003: 25-30 7. P. Nandi, and S. Kumaran. Adaptive Business Objects - A new Component Model for Business Integration. ICEIS (3), 2005: 179-188 8. D. Reiner, G. Press, M. Lenaghan, D. Barta and R. Urmston, Information lifecycle management: the EMC perspective. Proceedings of the 20th International Conference on Data Engineering, 2004:804 - 807 9. Y. Chen. Information Valuation for Information Lifecycle Management, Proceedings of the Second International Conference on Autonomic Computing (ICAC’05), 2005:135 – 146 10. J.Q. Xi, D.Y. Tang, and Y.B. Guo, Data Gene: the Genetic Information Carrier of Data. Computer Engineering (In Chinese, in press).