Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Knowledge Discovery from XML Document Based On Queries SAUMYA P1, MAYA MATHEW2 1 PG Student, Department of CSE, KMCT College of Engineering, Calicut, Kerala, India E-mail: [email protected] 2 Asst Prof, Department of CSE, KMCT College of Engineering, Calicut, Kerala, India E-mail: [email protected] Abstract: Nowadays Extensible Markup Language (XML) has gained popularity as an efficient technique used for representing information and data exchange. A fixed or absolute schema is not mandatory while using XML as a storing standard. Accessing such a document is a tough job. Hence it has been chosen as an interesting field of research. In this paper recurrent correlation among semi-structured data is found by applying data mining techniques. We propose an approach that extracts association rules that give intentional and summarized information about the structure and content of the XML document. The approach is called Branch Oriented Rules. The answers to queries on XML document are provided in a quick and approximate way by using these mined rules. Index Terms— XML, intentional answers, association rules, Branch Oriented Rules (BORs). retrieval of answers from an XML document to the user query I. INTRODUCTION In this present world, large size digital information are available on the web like business transaction, shopping etc. These data are oversized and so the datasets returned as the answer to a query is also large. These large sized data are thus stored without any restriction on having a fixed structure and schema in a format known as Extensible Markup Language (XML) [1]. XML has got eminent importance in the field of internet. It is widely used in applications as it can allow applications to have communication though they are built in different platforms. It has a nested and self-describing nature which makes it a flexible means to exchange data. Hence it has become a popular means by which storing and sharing of data between heterogeneous platforms is possible. Data mining has been used to extract interesting knowledge from data stored in databases or data warehouses. Using various forms of representation such as clusters, decision tree, association rules etc this mined knowledge or information can be represented. Association rules have proved to be an effective tool to discover interesting relations among data. Three different methods could be used to find association rules from XML documents. Mining rules between XML document tags is the first method. Finding association from the element content is the second. The third method is to discover interesting relations that is the frequent substructure in the XML document. To retrieve information from XML document either keyword based search or query based search mode of access can be used. Keyword search in which exact word match is found is the traditional approach of searching process given the search word. But when query search is used it becomes difficult to process such request. The fast and concise is a though problem. More challenges are involved in mining the XML document than compared to the traditional relational database because of the unspecific structure and semantics. It is not necessary that the structure of XML document has been declared in advance for example via a DTD or an XML schema [2]. II. OBJECTIVES Querying an XML document which is not having a fixed schema is a difficult process for two main reasons either huge amount of information will turn up as the result of the query which is the overloading of data [3] or the answer will not contain any useful content causing deprivation of information [4]. This situation arises due to the confused state of the user provided with such a large quantity of data or not able to formulate the structure of a query. Thus it will be useful to the user if they are provided with characteristic and the structure of the document. This knowledge can be efficiently extracted from an XML document by implementing a data mining technique called association rule [5]. Thus knowledge discovery process [6] sets a new trend in mining information from an XML in the form of its contents along with its structure. The purpose of this paper is to derive rules called Branch Oriented Rules, a method which mines summarized intentional knowledge from XML datasets. These rules are stored in the XML format so as to be used as an alternative to the original dataset to be queried for providing fast and concise results [7]. This method works directly on the XML 1 document without converting it to any other form. XQuery [8] is used to provide query on XML document. The user queries on the original document are transformed to queries on the mined rules, Branch Oriented Rules (BOR), file to provide the answers. In this paper branch oriented rules is proposed for representing recurrent information in XML document [9]. This intentional knowledge helps users to understand and get an idea about the content of the document by providing them with frequent patterns from the data sets. BORs are also used when original document is lost, as it contains all extracted information present in the original document. So BOR provides an abstract level of intentional information from XML document [10]. III. EXISTING SYSTEM There is no existing approach that has studied the problem of ranking the result according to its relevance. To determine the intention of the search which relies on keyword based query in not an easy job. It’s because the search through the condition is not unique. Thus to measure the probability of each search intention candidate, and to rank the individual matches of all these candidates is a challenging task. V. BRANCH ORIENTED RULES Tree structure is used for storing data in an XML document. When compared with the traditional database mining data stored in an XML document is a hard work. Association rule mining came to be implemented in XML mining after they were proposed for conventional databases. Association rules are used to describe the co-occurrence of data in a large size of collected information and are represented in the form X =>Y implications, where X and Y are assumed to be the two arbitrary sets, such that X ∩Y=∅ . The use of support and confidence are the best way to measure the quality of an association rule. Support refers to the number of occurrence of the set X∩Y in the dataset, while confidence gives the conditional probability of finding Y, having found X and is given by supp(X∩Y)/supp(X)[11]. IV. PROPOSED SYSTEM The proposed knowledge discovery framework is to perform data mining on XML and obtain intentional knowledge. The intentional knowledge mined is nothing but association rules with support and confidence. This knowledge is written in the form of XML. In other words, the result is BORs (Branch Oriented Rules). Fig 2. A Sample XML File: “Record.xml” The support and confidence of BOR is: Support = Count(Y, D) / Cardinality (D). Confidence = Count(Y, D) / Count(X, D) Cardinality (D) represents the total number of nodes in the XML document Count(Y, D) denotes the number of occurrence of structure Y Fig 1. System Framework The framework can be seen in Fig 1. The XML file is given as input to the parser then the parsed XML file is given to data mining sub system which is responsible for frequent structure generation and also BOR extraction. The generated BOR file which is an XML file is used by query processor system. This module takes XML query, on the original XML file, from the end user, translates it to queries on BOR file and makes use of mined knowledge to answer the query quickly. Fig 2 shows an example in an XML document record.xml which contains nodes like country, type of crime, ballistic item used, etc. used for the crime and date of reporting. The idea of mining knowledge in the form of association rules to get summarized information about the XML document had been incorporated in the system by using XQuery techniques in XML context or by implementing graph or tree based algorithms. The rules extracted and stored should satisfy a minimum support and confidence. The minimum support (minsupport) and minimum confidence (minconfidence) is given before the extraction of the rules. Lesser the value of minsupport larger frequent structures is extracted. The main steps involved in association rule mining are finding the frequent subtrees having support greater than a given threshold support from the XML document and computing 2 the rules from the frequent subtrees which satisfies a minimum confidence. Fig 3. Frequent Sub tree Mining Algorithm1 takes input as a XML document D with user specific support i.e. minsupport and the user specific confidence i.e. minconfidence. The output will be the frequent subtree which has the support and confidence not less than the user specified minsupport and minconfidence. Algorithm 1 helps the system find frequent structure and they are handed over to the function that generates all possible rules. Algorithm 1: Generate-Rules (D, minsupport, minconfidence) 1. 2. 3. 4. 5. 6. 7. Ffilter=FilterFrequentSubtree(D, minsupport) Rule_List=∅ For all s ϵ Ffilter do Temporary_List=Get_Rules(s, minconfidence) Rule_List= Rule_List∪Temporary_List endFor return Rule_List Get_Rules(s, minconfidence) 1. Rule_List=∅ 2. Black_List=∅ 3. For all c, subtree of s do 4. if c is not a subtree of any element in Black_List then 5. confidence =supp(s)/supp(c) 6. if confidence ≥ minconfidence 7. NewRule=(c, s, confidence, supp(s)) 8. Rule_List= Rule_List∪{NewRule} 9. else 10. Black_List= Black_List∪c 11. endif 12. endif 13. endfor 14. return Rule_List Branch Oriented Rules are generated for those frequent structures which have their support and confidence value above the user defined minimum support and confidence value. BOR mining comprises of two steps: 1) mining frequent structures obeys the constraint that it has the number of occurrence greater than a user defined threshold value, from the XML document; 2) generating interesting rules, defines finding the rules that have confidence greater than user-defined confidence value. Assuming a rule X=>Y, generated with support as 0.05 and confidence as 0.8, then it is said that if there is a node labeled X in the document, with 80 percent probability that the node has a child or sibling labeled Y. The tag element <rule> is used to store the extracted rule which comprises of three attributes such as rule ID, the support and the confidence of the rule. These extracted rules are stored in XML format called BOR file, which can be used in future as a source to get idea about the contents of the original XML document. Intentional knowledge to the given user query can be found from BOR file which contain rules by matching the conditions specified in the query. VI. MODIFICATION OF THE XML DOCUMENT AND RULE UPDATION Branch Oriented Rules provide fast and concise intentional answer to the user queries on original document as they are converted to queries on BOR file. Instead of giving properties that describe the data it gives the properties which the data satisfies frequently. Indexing is applied to each rule that is being extracted. Indexes help to retrieve data or information efficiently. Files that represent indexes contain set of references to each node in the rules is an XML document. The information stored in the XML documents may keep on updating or changing. Thus in cases where the document may undergo changes frequently, updating process of the previously obtained files containing rules and indexes is done instead of creating new rules. By matching the conditions of the query with the elements in index file references are obtained. The reference will return the results as the rule IDs of the rules which matches the conditions in the query and only those rules are searched for obtaining the answer. Once files containing the rules and indexes are saved they are used as a substitute of the original XML document to be queried. User provides queries on original document but these queries are translated into XQuery and then fired on extracted datasets or file. Due to this instead of getting exact answers users get summarized or recurrent answers to the queries. The answers provided will give an idea about content and structure of the file by giving the properties which are regularly satisfied and will be precise. Thus instead of returning answers from the original dataset, answers are extracted from the mined knowledge and can also be used as a substitute of the original document at times if the document is lost. VII. CONCLUSION In this paper a framework has been proposed for XML query answering which extracts branch oriented rules which are the correlation rules from a given XML file to answer xml queries. The proposed framework takes input as XML document with user defined threshold called minsupport and minconfidence to generate BORs. The main goal is to mine all branch oriented rules without the need to specify the structure of the rules and will be having different support and confidence values. The rules extracted are stored in XML format so that these rules can be used in future, in place of the original document, without mining the XML document again. This process is made efficient by working directly on the XML documents, and not transforming the data into any other format. Therefore extracted knowledge can be used 3 effectively to gain knowledge by using XML query. REFERENCES [1] World Wide Web Consortium, Extensible Markup Language (XML) 1.0, http://www.w3C.org/TR/REC-xml/, 1998 [2] S. Gasparini and E. Quintarelli, “Intensional query answering to xquery expressions”. In Proc. of the 16th Int. Conf. on Database and Expert Systems Applications, pages 544–553, 2005 [3] Yongming Guo Dehua Chen Liangxu Liu, Jiajin Le ,“A Frame of Per Personalized Information Filtering System Based on XML”, Networked Computing and Information Management, Volume: 1, 2008. NCM '08. [4] R.Sree Lekshmi, B. Sasi kumar, “ Extracting Information from Semistructured XML using TARs“, International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-2, Issue-1, October 2012. [5] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules in Large Databases,” Proc. 20th Int’l Conf. Very Large Data Bases, pp. 478-499, 1994. [6] Mirjana Mazuran, Elisa Quintarelli, Letizia Tanca, “Data Mining for XML Query-Answering Support”, IEEE Transactions On Knowledge And Data Engineering, Vol. 24, No. 8, August 2012. [7] Mirjana Mazuran, Elisa Quintarelli, and Letiza Tanca, “Mining Tree-Based Association Rules from XML Documents,” technical report, Politecnico di Milano, http://home.dei.polimi.it/quintare/Papers/MQT09-RR.pdf, 2009. [8] S. Gasparini and E. Quintarelli, “Intensional Query Answering to XQuery Expressions,” Proc. 16th Int’l Conf. Database and Expert Systems Applications, pp. 544-553, 2005 [9] E. Baralis, P. Garza, Elisa Quintarelli, and Letiza Tanca, “Answering XML Queries by Means of Data Summaries,” ACM Trans. Information Systems, vol. 25, no. 3, p. 10, 2007. [10] J. Paik, H.Y. Youn, and U.M. Kim, “A New Method for Mining Association Rules from a Collection of XML Documents,” Proc. Int’l Conf. Computational Science and Its Applications, pp. 936-945, 2005 [11] D. Braga, A. Campi, S. Ceri, M. Klemettinen, and P. Lanzi, “Discovering Interesting Information in XML Data with Association Rules,” Proc. ACM Symp. Applied Computing, pp. 450-454, 2003 4