Download vi. modification of the xml document and rule updation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Knowledge Discovery from XML Document
Based On Queries
SAUMYA P1, MAYA MATHEW2
1
PG Student, Department of CSE, KMCT College of Engineering, Calicut, Kerala, India
E-mail: [email protected]
2
Asst Prof, Department of CSE, KMCT College of Engineering, Calicut, Kerala, India
E-mail: [email protected]

Abstract: Nowadays Extensible Markup Language (XML) has gained popularity as an efficient technique used for
representing information and data exchange. A fixed or absolute schema is not mandatory while using XML as a storing
standard. Accessing such a document is a tough job. Hence it has been chosen as an interesting field of research. In this
paper recurrent correlation among semi-structured data is found by applying data mining techniques. We propose an
approach that extracts association rules that give intentional and summarized information about the structure and content
of the XML document. The approach is called Branch Oriented Rules. The answers to queries on XML document are
provided in a quick and approximate way by using these mined rules.
Index Terms— XML, intentional answers, association rules, Branch Oriented Rules (BORs).
retrieval of answers from an XML document to the user query
I. INTRODUCTION
In this present world, large size digital information are
available on the web like business transaction, shopping etc.
These data are oversized and so the datasets returned as the
answer to a query is also large. These large sized data are thus
stored without any restriction on having a fixed structure and
schema in a format known as Extensible Markup Language
(XML) [1]. XML has got eminent importance in the field of
internet. It is widely used in applications as it can allow
applications to have communication though they are built in
different platforms. It has a nested and self-describing nature
which makes it a flexible means to exchange data. Hence it
has become a popular means by which storing and sharing of
data between heterogeneous platforms is possible.
Data mining has been used to extract interesting knowledge
from data stored in databases or data warehouses. Using
various forms of representation such as clusters, decision tree,
association rules etc this mined knowledge or information can
be represented. Association rules have proved to be an
effective tool to discover interesting relations among data.
Three different methods could be used to find association
rules from XML documents. Mining rules between XML
document tags is the first method. Finding association from
the element content is the second. The third method is to
discover interesting relations that is the frequent substructure
in the XML document.
To retrieve information from XML document either
keyword based search or query based search mode of access
can be used. Keyword search in which exact word match is
found is the traditional approach of searching process given
the search word. But when query search is used it becomes
difficult to process such request. The fast and concise
is a though problem. More challenges are involved in mining
the XML document than compared to the traditional
relational database because of the unspecific structure and
semantics. It is not necessary that the structure of XML
document has been declared in advance for example via a
DTD or an XML schema [2].
II. OBJECTIVES
Querying an XML document which is not having a fixed
schema is a difficult process for two main reasons either huge
amount of information will turn up as the result of the query
which is the overloading of data [3] or the answer will not
contain any useful content causing deprivation of information
[4]. This situation arises due to the confused state of the user
provided with such a large quantity of data or not able to
formulate the structure of a query. Thus it will be useful to the
user if they are provided with characteristic and the structure
of the document. This knowledge can be efficiently extracted
from an XML document by implementing a data mining
technique called association rule [5]. Thus knowledge
discovery process [6] sets a new trend in mining information
from an XML in the form of its contents along with its
structure.
The purpose of this paper is to derive rules called Branch
Oriented Rules, a method which mines summarized
intentional knowledge from XML datasets. These rules are
stored in the XML format so as to be used as an alternative to
the original dataset to be queried for providing fast and
concise results [7]. This method works directly on the XML
1
document without converting it to any other form. XQuery [8]
is used to provide query on XML document. The user queries
on the original document are transformed to queries on the
mined rules, Branch Oriented Rules (BOR), file to provide the
answers. In this paper branch oriented rules is proposed for
representing recurrent information in XML document [9].
This intentional knowledge helps users to understand and get
an idea about the content of the document by providing them
with frequent patterns from the data sets. BORs are also used
when original document is lost, as it contains all extracted
information present in the original document. So BOR
provides an abstract level of intentional information from
XML document [10].
III. EXISTING SYSTEM
There is no existing approach that has studied the problem
of ranking the result according to its relevance. To determine
the intention of the search which relies on keyword based
query in not an easy job. It’s because the search through the
condition is not unique. Thus to measure the probability of
each search intention candidate, and to rank the individual
matches of all these candidates is a challenging task.
V. BRANCH ORIENTED RULES
Tree structure is used for storing data in an XML
document. When compared with the traditional database
mining data stored in an XML document is a hard work.
Association rule mining came to be implemented in XML
mining after they were proposed for conventional databases.
Association rules are used to describe the co-occurrence of
data in a large size of collected information and are
represented in the form X =>Y implications, where X and Y
are assumed to be the two arbitrary sets, such that X ∩Y=∅ .
The use of support and confidence are the best way to
measure the quality of an association rule. Support refers to
the number of occurrence of the set X∩Y in the dataset, while
confidence gives the conditional probability of finding Y,
having found X and is given by supp(X∩Y)/supp(X)[11].
IV. PROPOSED SYSTEM
The proposed knowledge discovery framework is to
perform data mining on XML and obtain intentional
knowledge. The intentional knowledge mined is nothing but
association rules with support and confidence. This
knowledge is written in the form of XML. In other words, the
result is BORs (Branch Oriented Rules).
Fig 2. A Sample XML File: “Record.xml”
The support and confidence of BOR is:
Support = Count(Y, D) / Cardinality (D).
Confidence = Count(Y, D) / Count(X, D)
Cardinality (D) represents the total number of nodes in the
XML document
Count(Y, D) denotes the number of occurrence of structure
Y
Fig 1. System Framework
The framework can be seen in Fig 1. The XML file is given
as input to the parser then the parsed XML file is given to data
mining sub system which is responsible for frequent structure
generation and also BOR extraction. The generated BOR file
which is an XML file is used by query processor system. This
module takes XML query, on the original XML file, from the
end user, translates it to queries on BOR file and makes use of
mined knowledge to answer the query quickly.
Fig 2 shows an example in an XML document record.xml
which contains nodes like country, type of crime, ballistic
item used, etc. used for the crime and date of reporting. The
idea of mining knowledge in the form of association rules to
get summarized information about the XML document had
been incorporated in the system by using XQuery techniques
in XML context or by implementing graph or tree based
algorithms. The rules extracted and stored should satisfy a
minimum support and confidence. The minimum support
(minsupport) and minimum confidence (minconfidence) is
given before the extraction of the rules. Lesser the value of
minsupport larger frequent structures is extracted. The main
steps involved in association rule mining are finding the
frequent subtrees having support greater than a given
threshold support from the XML document and computing
2
the rules from the frequent subtrees which satisfies a
minimum confidence.
Fig 3. Frequent Sub tree Mining
Algorithm1 takes input as a XML document D with user
specific support i.e. minsupport and the user specific
confidence i.e. minconfidence. The output will be the
frequent subtree which has the support and confidence not
less than the user specified minsupport and minconfidence.
Algorithm 1 helps the system find frequent structure and they
are handed over to the function that generates all possible
rules.
Algorithm 1:
Generate-Rules (D, minsupport, minconfidence)
1.
2.
3.
4.
5.
6.
7.
Ffilter=FilterFrequentSubtree(D, minsupport)
Rule_List=∅
For all s ϵ Ffilter do
Temporary_List=Get_Rules(s, minconfidence)
Rule_List= Rule_List∪Temporary_List
endFor
return Rule_List
Get_Rules(s, minconfidence)
1. Rule_List=∅
2. Black_List=∅
3. For all c, subtree of s do
4. if c is not a subtree of any element in Black_List then
5. confidence =supp(s)/supp(c)
6. if confidence ≥ minconfidence
7. NewRule=(c, s, confidence, supp(s))
8. Rule_List= Rule_List∪{NewRule}
9. else
10. Black_List= Black_List∪c
11. endif
12. endif
13. endfor
14. return Rule_List
Branch Oriented Rules are generated for those frequent
structures which have their support and confidence value
above the user defined minimum support and confidence
value. BOR mining comprises of two steps: 1) mining
frequent structures obeys the constraint that it has the number
of occurrence greater than a user defined threshold value,
from the XML document; 2) generating interesting rules,
defines finding the rules that have confidence greater than
user-defined confidence value. Assuming a rule X=>Y,
generated with support as 0.05 and confidence as 0.8, then it
is said that if there is a node labeled X in the document, with
80 percent probability that the node has a child or sibling
labeled Y. The tag element <rule> is used to store the
extracted rule which comprises of three attributes such as rule
ID, the support and the confidence of the rule. These extracted
rules are stored in XML format called BOR file, which can be
used in future as a source to get idea about the contents of the
original XML document. Intentional knowledge to the given
user query can be found from BOR file which contain rules by
matching the conditions specified in the query.
VI. MODIFICATION OF THE XML DOCUMENT AND
RULE UPDATION
Branch Oriented Rules provide fast and concise
intentional answer to the user queries on original document as
they are converted to queries on BOR file. Instead of giving
properties that describe the data it gives the properties which
the data satisfies frequently. Indexing is applied to each rule
that is being extracted. Indexes help to retrieve data or
information efficiently. Files that represent indexes contain
set of references to each node in the rules is an XML
document. The information stored in the XML documents
may keep on updating or changing. Thus in cases where the
document may undergo changes frequently, updating process
of the previously obtained files containing rules and indexes is
done instead of creating new rules.
By matching the conditions of the query with the elements
in index file references are obtained. The reference will return
the results as the rule IDs of the rules which matches the
conditions in the query and only those rules are searched for
obtaining the answer. Once files containing the rules and
indexes are saved they are used as a substitute of the original
XML document to be queried. User provides queries on
original document but these queries are translated into
XQuery and then fired on extracted datasets or file. Due to
this instead of getting exact answers users get summarized or
recurrent answers to the queries. The answers provided will
give an idea about content and structure of the file by giving
the properties which are regularly satisfied and will be
precise. Thus instead of returning answers from the original
dataset, answers are extracted from the mined knowledge and
can also be used as a substitute of the original document at
times if the document is lost.
VII. CONCLUSION
In this paper a framework has been proposed for XML
query answering which extracts branch oriented rules which
are the correlation rules from a given XML file to answer xml
queries. The proposed framework takes input as XML
document with user defined threshold called minsupport and
minconfidence to generate BORs. The main goal is to mine all
branch oriented rules without the need to specify the structure
of the rules and will be having different support and
confidence values. The rules extracted are stored in XML
format so that these rules can be used in future, in place of the
original document, without mining the XML document again.
This process is made efficient by working directly on the
XML documents, and not transforming the data into any other
format. Therefore extracted knowledge can be used
3
effectively to gain knowledge by using XML query.
REFERENCES
[1]
World Wide Web Consortium, Extensible Markup Language (XML)
1.0, http://www.w3C.org/TR/REC-xml/, 1998
[2] S. Gasparini and E. Quintarelli, “Intensional query answering to
xquery expressions”. In Proc. of the 16th Int. Conf. on Database and
Expert Systems Applications, pages 544–553, 2005
[3] Yongming Guo Dehua Chen Liangxu Liu, Jiajin Le ,“A Frame of Per
Personalized Information Filtering System Based on XML”,
Networked Computing and Information Management, Volume: 1,
2008. NCM '08.
[4] R.Sree Lekshmi, B. Sasi kumar, “ Extracting Information from
Semistructured XML using TARs“, International Journal of
Engineering and Advanced Technology (IJEAT) ISSN: 2249 –
8958, Volume-2, Issue-1, October 2012.
[5] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association
Rules in Large Databases,” Proc. 20th Int’l Conf. Very Large Data
Bases, pp. 478-499, 1994.
[6] Mirjana Mazuran, Elisa Quintarelli, Letizia Tanca, “Data Mining for
XML Query-Answering Support”, IEEE Transactions On Knowledge
And Data Engineering, Vol. 24, No. 8, August 2012.
[7] Mirjana Mazuran, Elisa Quintarelli, and Letiza Tanca, “Mining
Tree-Based Association Rules from XML Documents,” technical
report,
Politecnico
di
Milano,
http://home.dei.polimi.it/quintare/Papers/MQT09-RR.pdf, 2009.
[8] S. Gasparini and E. Quintarelli, “Intensional Query Answering to
XQuery Expressions,” Proc. 16th Int’l Conf. Database and Expert
Systems Applications, pp. 544-553, 2005
[9] E. Baralis, P. Garza, Elisa Quintarelli, and Letiza Tanca, “Answering
XML Queries by Means of Data Summaries,” ACM Trans.
Information Systems, vol. 25, no. 3, p. 10, 2007.
[10] J. Paik, H.Y. Youn, and U.M. Kim, “A New Method for Mining
Association Rules from a Collection of XML Documents,” Proc. Int’l
Conf. Computational Science and Its Applications, pp. 936-945, 2005
[11] D. Braga, A. Campi, S. Ceri, M. Klemettinen, and P. Lanzi,
“Discovering Interesting Information in XML Data with Association
Rules,” Proc. ACM Symp. Applied Computing, pp. 450-454, 2003
4