Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Semantically Mining Heterogeneous Data Sources of Deep Web A thesis submitted For the Partial Fulfillment of the Requirement of MS (CS) Degree By Ayesha Manzoor 584-FBAS/MSCS/F09 Supervisor Dr Ali Daud Assistant Professor IIUI Co- Supervisor Umara Zahid Department of Computer Sciences & Software Engineering International Islamic University Islamabad Campus 2012 Department of Computer Science International Islamic University Islamabad Date: [date of external examination] Final Approval This is to certify that we have read the thesis submitted by [Ayesha Manzoor], [584FBAS/MSCS/F09]. It is our judgment that this thesis is of sufficient standard to warrant its acceptance by International Islamic University, Islamabad for the degree of [MS Computer Science]. Committee: External Examiner: [External Examiner’s name] [Designation of External Examiner] [Address of External Examiner] ___________________________ Internal Examiner: [Internal Examiner’s name] [Designation of internal Examiner] [Address of Internal Examiner] ___________________________ Supervisor: [Dr Ali Daud] [Assistant Professor] [International Islamic University,Islamabad] ___________________________ Co Supervisor: [Umara Zahid] [Research Associate] [International Islamic University,Islamabad] ___________________________ Dedicated to my beloved Parents and Brothers A dissertation Submitted To Department of Computer Science & Software Engineering, Faculty of Basic and Applied Sciences, International Islamic University, Islamabad As a Partial Fulfillment of the Requirement for the Award of the Degree of MSCS. Declaration We hereby declare that this research thesis neither as a whole nor as a part has been copied out from any source. It is further declared that we have done this research with the accompanied report entirely on the basis of my personal efforts, under the proficient guidance of my teachers especially my supervisors. If any part of the thesis is proved to be copied out from any source or found to be reproduction of any thesis from any of the training institute or educational institutions, I shall stand by the consequences. ___________________________ [Ayesha Manzoor] [584-FBAS-MSCS/F09] Acknowledgement First of all we are obliged to Allah Almighty the Merciful, the Beneficent and the source of all Knowledge, for granting us the courage and knowledge to complete this thesis. I thankful to my supervisor Dr Ali Daud and co supervisor Mrs. Umara Zahid whom give me direction and guidance to accomplish this thesis. I also thankful to my fellows to whom help me in this thesis especially Kashif Iftikhar, Umm-e-Zahoora, Robina Khatoon,and Assmah Jabeen. I also thankful to university and university staff and faculity members. I also thankful to my mother, brothers and my family member to encourage and support me to do this research work. ___________________________ [Ayesha Manzoor] [584-FBAS-MSCS/F09] Abstract Abstract Over the years a critical increase in the mass of the web has been observed. Among that a large part comprises of online subject-specific databases, hidden behind query interface forms called as deep web. Existing search engines are unable to completely index this highly relevant information due to its large volume. To access deep web content, the research community has proposed to organize it using machine learning techniques. Clustering is one of the key solutions to organize the deep web databases. In our research work we proposed a novel method “DWSemClust” to semantically cluster deep web databases. For the purpose, we employed a generative probabilistic model latent Dirichlet allocation (LDA) for modeling content representative of deep web databases. LDA cluster the words into the “topics” and the document is the collection of different topics. The task of parameter estimation in the model is what the topic and which document have topic in what proportion. The deep web sources are mostly sparse, therefore one motive to use the LDA due to sparseness of deep web sources. Further content representative comprises of form contents (single attribute/ multiple attributes), page contents, and hyperlink structure in the neighborhood of forms i.e. hub/ authority scores. In our work first we provide a comprehensive assessment of exiting deep web clustering approaches. Based on the limitations in existing approaches we present our proposed method. Finally we provide a comparative analysis between our proposed method and the existing methods. Table of Contents Table of Contents Chapter 1 ............................................................................................................................ 13 1. Problem Definition.................................................................................................. 13 1.1. 1.2. Deep web ............................................................................................................. 13 Difference between “surface” web and “Deep” web .............................................. 14 1.2.1. Searching strategy ........................................................................................ 14 1.2.2. Size of deep web ................................................................................................. 15 1.2.3. Quality of deep web is different from the “surface” web ................................... 15 1.2.4. Growing ratio of deep web and surface web ....................................................... 15 1.3. Scale/Coverage of deep web ................................................................................... 16 1.3.1. Scale of deep web ................................................................................................ 16 1.3.2. Coverage of deep web ......................................................................................... 17 1.4. Coverage of deep web through search engines ................................................... 17 1.5. Coverage of deep web through deep web directories ......................................... 18 1.6. Challenges of deep web .......................................................................................... 19 1.7. Why deep web sources need to cluster or classify .................................................. 19 1.8. Benefits to cluster the deep web sources ................................................................ 21 1.9. Proposed approach .................................................................................................. 21 Chapter 2 ............................................................................................................................ 23 2. Literature Review.................................................................................................... 23 2.1. Feature extraction/Feature selection.................................................................... 23 2.2. Clustering ............................................................................................................ 25 2.3. Classification ....................................................................................................... 27 Chapter 3 ............................................................................................................................ 29 3. Methodology ........................................................................................................... 29 3.1. Data preprocessing .............................................................................................. 29 3.1.1. Form-Page Model ........................................................................................ 29 3.1.2. Compute Form-Page Vectors ....................................................................... 29 3.1.3. Stop word removal ....................................................................................... 30 3.1.4. Remove less frequent terms ......................................................................... 30 3.2. Calculation of terms weight ................................................................................ 30 3.3. Computing Form-Page Similarity ....................................................................... 30 3.4. The CAFC-C Algorithm: .................................................................................... 31 3.5. CAFC-CH Algorithm .......................................................................................... 32 Table of Contents 3.6. Generative model ................................................................................................ 33 3.7. Discriminative model .......................................................................................... 34 3.8. Difference between Generative models and discriminative models ................... 34 3.9. Topic model......................................................................................................... 35 3.10. Latent variable ..................................................................................................... 35 3.11. Prior probability .................................................................................................. 35 3.12. Multinomial distribution ..................................................................................... 36 3.13. Plate notation ....................................................................................................... 36 3.14. Proposed Technique: DWSemClust .................................................................... 37 3.15. Latent Dirichlet allocation ................................................................................... 37 3.16. Summary ............................................................................................................. 40 Chapter 4 ............................................................................................................................ 41 4. Experiments ............................................................................................................ 41 4.1. Performance Measures ........................................................................................ 41 4.2. Dataset ................................................................................................................. 42 4.3. Parameter settings ............................................................................................... 43 4.4. Results and Discussions ...................................................................................... 43 4.5. Summary ............................................................................................................. 61 Chapter 5 ............................................................................................................................ 62 5. Conclusions and Future Work ................................................................................ 62 5.1. Concluded points ................................................................................................. 62 5.1.1. Stability ........................................................................................................ 62 5.1.2. Soft Clustering ............................................................................................. 62 5.1.3. Semantics ..................................................................................................... 62 5.1.4. Running time ................................................................................................ 63 5.1.5. Parameter estimation .................................................................................... 63 5.2. Future Work ........................................................................................................ 63 5.2.1. Integrated schema ........................................................................................ 63 5.2.2. Check dataset on structured techniques ....................................................... 63 References .......................................................................................................................... 64 Semantically Mining Heterogeneous Data Sources Of Deep Web 10 List of Tables List of Tables Table 1.1: Deep web estimation and sampling ............................................................ 17 Table 1.2: Web directories coverage ............................................................................. 18 Table 3.1: Algorithm CAFC-C ..................................................................................... 31 Table 3.2: Proposed Technique: DWSemClust ............................................................ 37 Table 4.1: Dataset Description ...................................................................................... 42 Table 4.2: Entropy results of DWSemClust for forms and pages ................................ 43 Table 4.3: Topics discovery through DWSemClust ..................................................... 45 Table 4.4: F-measure results of DWSemClust ............................................................. 47 Table 4.5: Comparison of the CAFC_C and DWSemClust with form and page contents based on entropy ........................................................................... 48 Table 4.6: Comparison of the CAFC_C and DWSemClust with forms contents based on entropy .................................................................................................... 50 Table 4.7: Comparison of the CAFC_C and DWSemClust with forms contents and with forms and pages contents based on entropy......................................... 52 Table 4.8: Comparison of the CAFC_C and DWSemClust with forms contents based on F-measure ................................................................................................ 54 Table 4.9: F-measure comparison of the CAF_C with forms and page contents and DWSemClust with forms and pages ............................................................ 55 Table 4.10: F-measure comparison of the CAF_C with forms and page contents and DWSemClust with forms and pages ............................................................ 57 List of Figures List of Figures Figure 1.1: A deep web site ............................................................................................ 13 Figure 1.4: Document clustering ................................................................................... 20 Figure 3.1: Plate as sub graph......................................................................................... 36 Figure 3.2: Symbol for hidden parameter....................................................................... 36 Figure 3.3: Symbol for observed variable ...................................................................... 36 Figure 3.4: Arrow shows the dependency ...................................................................... 36 Figure 3.5: Latent Dirichlet allocation ........................................................................... 39 Figure 4.1: Entropy for DWSemClust with form contents, form and page contents ..... 45 Figure 4.2: F-measure for DWSemClust with form contents, form and page contents . 48 Figure 4.3: Comparison of DWSemClust and CAFC_C with formpages contents based on entropy .................................................................................................... 50 Figure 4.4: Comparison of DWSemClust and _C with forms based on entropy ........... 52 Figure 4.5: Comparison of the CAFC_C and DWSemClust with forms contents and with forms and pages contents based on entropy......................................... 54 Figure 4.6: Comparison of the CAFC_C and DWSemClust with forms contents based on F-measure ................................................................................................ 55 Figure 4.7: Comparison of the CAFC_C and DWSemClust with forms and pages contents based on F-measure ....................................................................... 56 Figure 4.8: Comparison of the CAFC_C and DWSemClust with forms and pages and forms contents based on F-measure ............................................................. 58 Figure 4.9: Comparison of average entropy ................................................................... 58 Figure 4.10: Comparison of average F-measure ........................................................... 59 Figure 4.11: Entropy comparison of CAFC_C, CAFC_CH and DWSemClust ........... 60 Chapter 1 Introduction Chapter 1 1. Problem Definition This chapter introduces the deep web and surface web, importance of deep web why it need to uncover the sources of deep web. Difference between deep web and the surface web, scale and coverage of deep web, why we need to cluster or classify the deep web sources and the benefits that we can gain through the clustering or classify the deep web sources. Last portion will present why we use propose technique. 1.1. Deep web The preferred medium for information transfer and commerce for internet based companies is the web. With the introduction of e-business the trend of developing the web sites grow. With the connection of database the dynamic Web site developed increases in the number of sites with back end database for holding the important information in them. This information can retrieve trough user query from database server. The information store in databases and hidden behind the HTML pages are called “Deep web”. It can be refer with many other names as deep web, dark net, deep net, invisible net.[1] These all term used is for the stuff of valuable knowledge that cannot be accessible through traditional search engine. The deep web contents are stored in the searchable databases; these contents can be retrieved through the direct query. Without direct query we cannot reach the results of databases. Anyone who queried the searchable database and in the result of query a resultant page is return and that page contain the dynamic content which is according to the query given to the database. Figure 1.1: A deep web site Semantically Mining Heterogeneous Data Sources Of Deep Web 13 Chapter 1 Introduction Figure 1: A deep web site shows a deep web site, interface is connected to database and through this interface a user can put his/her query to the database for retrieving the information. In Deep web deep the terms that are mostly used are deep web site, databases, query interfaces. A deep web site is a Web server that gives information stored in one or more back-end Web databases and Web form which is used to get information from the database through the query as a input. Definition A deep web site can be denoted as Ds, database denoted as Ddb and query Interface is denoted as DI. DI contain the attributes DIat=1……n. Deep web query interface can be categorized in to two types Simple query interface Advanced query interface Simple query interface is the interface that has less number of attributes. That cannot give much information about the interface or give knowledge about the domain of interface belongs to. Advanced query interface is the interface that has attributes that are good representative of a domain. Mostly we can guess about the domain of a database. Attributes can vary in number a simple query interface have less number of attributes. The Web becomes rapidly “deepened” by hug online databases [2]. The size of the deep web increasing exponentially. 1.2. Difference between “surface” web and “Deep” web 1.2.1. Searching strategy The first difference between surface and the deep web sources is the crawling strategies two ways are used first an author submit pages for listing and indexing. And second are “spider” or “crawl” pages through hypertext link to another. In surface web static pages are linked together. Traditional search engines cannot retrieve the contents of deep web Semantically Mining Heterogeneous Data Sources Of Deep Web 14 Chapter 1 Introduction .Deep web contents are dynamic and can be retrieved through the direct query that point to the database [2]. 1.2.2. Size of deep web Second difference between surface and the deep web sources is the size. As deep web is very big and hug, according to a study there are 60 largest Deep web that contain the content of 84 billion pages that are 40 time greater then surface web. These sites contain 750 terabytes of data [2]. 1.2.3. Quality of deep web is different from the “surface” web Deep web contents have higher quality then the surface web [2]. Deep web content is more significant for user and satisfied the user need. Most of the deep web contents are topic specific that are stored in databases. Deep web contents are deeper then the surface web. 1.2.4. Growing ratio of deep web and surface web The growing rate of deep web is much larger than the surface web this shows the importance of deep web as a next generation internet [2]. Figure 1.2: Surface and deep web [2] Semantically Mining Heterogeneous Data Sources Of Deep Web 15 Chapter 1 Introduction Figure 2 show the quantity of surface web and deep web and the deepness of deep web. As fishes show the data and at the surface level data is present but the portion beneath the surface carry more data. This precious data need to uncover as this data. Precious in sense of quality and quantity of data. As growing rate is very high and contents updated frequently. 1.3. Scale/Coverage of deep web For the scale and coverage of the deep web first of all take a look over the “entrance” of databases.[3]To get the information hidden in the sea we must know about entrance and also must know about the deepness of the entrance? query interfaces are the entry point to the databases and for each query interface [3] find the depth of the entry point is calculated that was depth 3 their ratio was 72% 93 was at 3 depth where total was 129 interfaces. Through these interfaces one can access the database which can find minimum at depth 3 that was 94% 32 out of 34 web databases. Finally 22 out of 24 that were 91.6% deep web sites which contain their database at level 3 which can be refer as depth 3 coverage. 1.3.1. Scale of deep web He at al [3] test 1,000,000 IP samples for knowing the scale of deep web. They crawl one million IPs at depth three because of most of databases can be found at depth three. After crawling they found those 126 deep web sites from 2,256 web servers and 406 query interfaces and 190 web databases. When this sample is applied to the whole population of IP space which was 2,230,2124,544 IPs total 307,000 were deep web sites, 450,000 databases and 1,258,000 query interfaces were found. They also observe the nature of deep web as multiplicity of access. Each deep web site contain database of 1.5 and that can support 2.8 query interfaces. A survey [1] estimated the 43,000 to 96,000 sites which are deep web sites are present over the web. This figure shows that deep web increased 37 times from 2000-2004. [3] Table 1 below shows the estimation and sampling of the study. First column show the deep web sites, web databases which can be structured or unstructured and query Semantically Mining Heterogeneous Data Sources Of Deep Web 16 Chapter 1 Introduction interfaces. Second column show the sampling results. Third column shows the total estimation and forth column show the 99% confidence interval. Table 1.1: Deep web estimation and sampling [3] Sample Results Total Estimate 99% Confidence interval Deep Web Sites 126 307,000 236,000-377,000 Web databases 190 450,000 366,000-535,000 -Unstructured 43 102,000 62,000-142,000 -Structured 147 348,000 275,000-423,000 Query interfaces 406 1,258,000 1,097,000-1,419,000 1.3.2. Coverage of deep web We can define coverage as how much data of deep web can be crawled or indexed and can be retrieved through search engines or deep web directories. Under the coverage of deep web there are two type of coverage. Coverage of deep web through search engines Coverage of deep web through deep web directories 1.4. Coverage of deep web through search engines To access the hidden web contents one can “browse” directories to use URL .It is remain question that will it effective to indexed and crawl the deep web like surface web. He at al [3] investigate the three popular search engines MSN(msn.com),Yahoo(yahoo.com) and Google(google.com).they randomly choose 20 web databases out of 190 in their sampling results. Figure 3 below show the finding of the survey MSN coverage is less then all of three that was 11% Yahoo and Google indexed 32% of deep web. Overall these search engines cover 37% contents of deep web. This study conclude the major aspects, one is that the common thinking about the deep web is invisibility as one third part can be search through these search engines it means by nature deep web is not invisible. It means other contents are not properly indexed by any search engines therefore most of deep web contents remain invisible. Semantically Mining Heterogeneous Data Sources Of Deep Web 17 Chapter 1 Introduction The entire deep web Yahoo.com(32%) MSN.com(11%) Google.com(32%) All(37%) 0% 20% 40% 60% 80% 100% 120% Figure1.3: Search engines coverage [3] 1.5. Coverage of deep web through deep web directories Besides of search engines which crawl traditionally. There are some directories which are embedding online and classify databases on web in some catalog. To check the coverage of the directories He at al [3] surveyed four web directories which are popular and count the coverage for which they claimed for indexed. They check completeplanet.com, lii.org, turbo10.com and invisible_web.net. Table 1.2: Web directories coverage [3] Number of Web Databases Coverage completeplanet.com 70,000 15.6% lii.org 14,000 3.1% turbo10.com 2,300 0.5% invisible_web.net 1,000 0.2% Data in Table 2 shows the number of web databases and coverage of each directory. Completeplanet.com directory have 70,000 databases out of 450,000 web databases which is 15.6% that is very low coverage. Other directories coverage was very low range from 0.2% to 3.1%. This seems that directories manually classified which will be hard to scale the deep web. He at al [3] concludes the study that they have conduct. Semantically Mining Heterogeneous Data Sources Of Deep Web 18 Chapter 1 Introduction Information that is available publically on the deep web is 400 to 550 times larger than that are defined in World Wide Web. There is 7500 terabytes information in deep web as compare to surface web which is 19 terabytes. Individual document on deep web is 550 billion as compare to surface web that are 1 billion. There are more than 100,000 deep web sites are exist. There are 60 deep web sites that contain 750 terabytes information which larger then surface web by 40 times. Growing rate is very high as compare to surfaces web. Deep web sites content are deeper and narrower then the surface web. At least total quality of deep web is 1,000 to 2,000 time greater then of the surface web. Deep web is more informative and satisfy the user need. Deep web content are mostly topic specific. 95% deep web sources are publically accessible information. During 4 years from 2000 to 2004 deep web sources increase 3 to 7 time. 1.6. Challenges of deep web Open challenges in this field are To crawl hidden web Categorization of the deep web sources Integration Query mediation 1.7. Why deep web sources need to cluster or classify Organizing the data spread over the Web into groups / collections in order to make easy to data accessing and availability. At the same time meet user need. First of all size of deep web is very big and hug, according to a study there are 60 largest Deep web that contain the content of 84 billion pages that are 40 time greater then surface web. These sites contain 750 terabytes of data [2]. Therefore it is need to cluster or classify. When this huge amount of data clustered then it will used as need of user satisfied. Semantically Mining Heterogeneous Data Sources Of Deep Web 19 Chapter 1 Introduction Document #1 Document #5 Document #8 Document #2 Document #6 Document #9 Document #3 Document #7 Document #10 Document #4 Document Clustering Algorithm Document #1 Document #8 Document #9 Document #2 Document #7 Document #10 Document #6 Document #3 Document #2 Document #4 nt #2 Document #2 Cluster 1 nt #2 Document #2 Document #5 ntCluster #2 3 Document #2 ntCluster #2 2 Figure 1.4: Document clustering In Figure 1.4 document clustering process is described in which set of document which is gatherClcluster from the information retrieval system. Same colored document are related to each other but randomly distributed. There is clustering algorithm that will cluster the document which is related to each other. After the clustering algorithm the documents are clustered together in same cluster. Semantically Mining Heterogeneous Data Sources Of Deep Web 20 Chapter 1 Introduction 1.8. Benefits to cluster the deep web sources Some benefits of clustering the deep web sources are Accessibility over the Web will increase. Length of Web navigation pathways will decrease. Web user’s requests services will improve. Information retrieval will improve. Improving content delivery on the Web. Understanding users’ navigation behavior. Integrating diverse data representation standards. Web information organizational practices will extend that are currently in use. 1.9. Proposed approach We will work on deep web sources semantically. We will use Latent Dirichlet Allocation (LDA). Before start to explain LDA We want to describe the limitation of traditional and keyword base clustering method .keywords based clustering method extract the words that used for matching the related entities Vector Space Model (VSM) is the example of keywords base modeling that is stat of art clustering gives a good way to cluster or group similar documents on the bases of similar content that are extracted from the text. The major problem with keywords based clustering is ignoring the semantics in other words ignore polynomial and synonymy terms. In traditional a document is associated to a cluster which is called hard clustering these problem motivate topic modeling based on latent topic 1ayer. Topic modeling is the technique which generate soft cluster. This technique can capture semantics of text. Latent topic allow document that are composed of different topics to more than one cluster. There are topic layer that is hidden in the fundamental topic modeling. In the topic modeling LDA use the basic terminology and notations are word, document and corpus. Basic unit of discrete data is a word. Words are items in a vocabulary. Denoted as w. Sequence of words is collection of words are called document. A document contains N words. Denoted as D= {w1, w2, w3, w4, ….. , wN}. Semantically Mining Heterogeneous Data Sources Of Deep Web 21 Chapter 1 Introduction Corpus is the collection of documents. Denoted as C= {D1, D2, D3, D4......DM} which shows that corpus contain M document. Topic layer Z= {Z1, Z2, Z3, Z4… Zi} between the document and words in the documents. Zi represent latent topic a document vector d words wd. This layer is used to capture the semantic relationship that considers the synonymy of words. Semantically Mining Heterogeneous Data Sources Of Deep Web 22 Chapter 2 Literature Review Chapter 2 2. Literature Review This chapter gives the overview of the work that has done in the literature. The work on hidden web mining can be categorized in various areas such as 1. Feature extraction/Feature selection 2. Clustering 3. Classification In [4, 5, 6, 17] feature extraction are discuss and show how these features add and improve the results. Clustering is the unsupervised learning used for grouping the data. In [7, 8, 9, 10, 11] propose the techniques for clustering the deep web sources. Classification is the supervised learning, in [12, 13, 14, 15, 16] different techniques are used to classification. 2.1. Feature extraction/Feature selection Sriram et al [4] deal the problem of designing a crawler that is able to extract the text from hidden Web. They introduce a standard operational model for crawling the hidden Web and explain how this model is realized in HiWE (Hidden Web Exposer), a model crawler presented at Stanford. They introduce a new Layout-based Information Extraction Technique (LITE) and explain how it will extracting the information relating semantic from search forms and resultant pages include the Deep web pages that has database behind the forms that is problem of existing crawlers. They proposed a crawler that is task specific for hidden Web crawling. A task specific application is useful in designing a crawler that has knowledge of the specific domain. There are two limitations of the HiWE design 1. HiWE’s is not able to recognize and give response to simple dependencies among form elements. Chapter 2 Literature Review 2. HiWE’s does not of support for incomplete filled forms; i.e., giving values to some elements in a form. Andreas et al [5] concept is a Web Service operation allocates automatically in domain taxonomy, in data-type taxonomy each input factor to a concept. In a category taxonomy automatically allocates a Web Service to a concept. Web Services is Cluster in order to create category taxonomy automatically. Authors suppose a category taxonomy C. the services provided by the Web Service Category. Second, they suppose a domain taxonomy D. Domains provide specific services. Third, they assume data type taxonomy T. Data types deals with semantics of data of fields. For classification of web form a form is converting into the Bayesian network. A tree is build that show the generative model: Domain of the form is represented at the top, at children level represent data type that is associate with each field, and grandchildren level show the terms to each field. Leave-one-out methodology is used for baselines two bags of terms. The terms in a form as single bag of terms for domain classification. The naïve Bayes algorithm is used for data type classification over its bag of terms.This technique is applicable to databases which are good representative of a domain because it takes decision on the part of form for example labels and values that are available. In this work they have introduced an incremental heuristic algorithm [6] by connecting it with the entropy based algorithm .This algorithm works good with different sample sizes and parameter settings. It is very efficient for data stream as for every new entry it not has to look back so far .the criteria for clustering data is also very reasonable because of entropy. This algorithm uses entropy to group categorical attributes unlike earlier methods that uses distance matrices between vectors to do so. As the problem they choose was NP complete problem that is why they used heuristic to solve it. It is scale able due its Incremental approach. In data integration a crucial step is Matching query interfaces is across multiple Web databases. In interface schema matching different type of information is used. It is not reliable to use single aspect of schema, it will yield uncertain and inaccurate result. Semantically Mining Heterogeneous Data Sources Of Deep Web 24 Chapter 2 Literature Review The state-of-the-art approach is the evidence theory to combining uncertain information of multiple sources. However the limitations of traditional evidence theory to treat the individual matchers of different matching tasks apply to query interfaces, which will reduce the performance of matching. The authors proposes a novel matching approach for query interface which based on extended evidence theory for Deep Web. They introduce procedure of dynamic prediction for different credibility of matchers. They use exponentially weighted evidence theory to extends traditional evidence theory to combine the results that have come from multiple matchers. 2.2. Clustering In [7] the authors organize the structure deep web sources into domain hierarchy through query schemas which are good discriminator of a domain. They seem query schema as a categorical data and cluster the categorical data .they assume that same sources are used same generative models. They propose a new objective function Model- differentiation .Which is used for test of assumption that give maximum statistical heterogeneity between clusters. Authors develop Algorithm MDhac: First, DATAGROUPING pre-clusters data into groups Second, GROUPSELECTION excludes the loner schemas with loner threshold N Third, and CLUSTERINGHAC clusters the remaining groups with the standard HAC algorithm. Fourth, LONERHANDLING classifies the loner schemas into the accomplished G clusters. Finally, BUILDHIERARCHY again applies the HAC algorithm to build the hierarchical tree of domains (by considering each cluster as one domain). They adopt χ2 testing for evaluating the homogeneity among clusters. Conditional Entropy is used, on clustering Web query schemas; the model-differentiation function outperforms existing ones, such as likelihood, entropy, and context linkages, with the hierarchical agglomerative clustering algorithm. For each source, we manually extract attributes from its query interface by extracting noun phrases, and then judge its corresponding domain. CAFC-C [8] helps to get homogeneous clusters having low entropy and high Fmeasures. So it could be very helpful in discriminating the different online databases .But it have some limitations that need to be consider. As it uses k means method, Semantically Mining Heterogeneous Data Sources Of Deep Web 25 Chapter 2 Literature Review quality of the resultant clusters is highly effected by the selection of initial seeds .So it can effects badly in two scenario firstly when there will be heterogeneity in vocabulary and secondly when the domains although different but have large vocabulary similar. So in this case it seems only forms will not be sufficient to have good cluster results. The CAFC-CH [8] is the extension of the CAFC-C Algorithm. In this algorithm to get the high utility of the above mention algorithm they have also included page contents. That will be helpful in order to break the tie when there will be vocabulary overlap in form contents of different domains or when same domains having forms contents with different vocabularies. Another limitation of k means is resolved by considering hyper links too for the selection of seed clusters. They used backline to improve the quality of seed clusters, but the limitation here too as there is no backlinks of all the sites. Vector space model is used a drawback of vector space model, which just keyword based matching. The work done in this [9] paper was to replace the current automatic database selection methods and cooperative methods by using appropriate language models relative to that database due to their limitations. Instead they provide the solution that are database service itself creates its language model by random sampling also called query based sample approach. Query based sampling approach assumes that every database can run simple query on it, and in result returns some documents which in return are helpful in making language model for that particular database automatically. Song et al [10] proposed semantically clustering the deep web using fuzzy semantic measure is used to integrate the ontology and fuzzy set is used for check the similarity of visible feature of two deep web form and hybrid Swarm particle optimization (PSO) algorithm is proposed for clustering the databases of deep web. Average Similarity of Document to the Cluster Centroid (ASDC) and Rand Index (RI) are used to evaluate the result. The proposed solution have the values of ASDC is higher than K-Means and PSO approaches. It is concluded that within a cluster similarity is high and between clusters low similarity that is positive sign. Semantically Mining Heterogeneous Data Sources Of Deep Web 26 Chapter 2 Literature Review Zhang et al [11] works on feature extraction and Ontology based method is used. They had worked on three domains and show the result evaluated by precision and recall. 2.3. Classification Xiang et al[12] has work on the classification of deep web structured sources .There main contributions are they proposed category ontology model for deep web secondly vector space model for deep web is build on the bases of model that have proposed at first step. They build ontology which have 8-tuple (V,F1,T, S,C,L,ROOT,F2,):V is the attribute set appears in the interface. V has also two parts Type and Ai, Ai is the attribute label and Type is the data type of the attribute. F1 is the reference function which refer to concept. T is the attribute nature Vi belongs to V and if Vi is Ts then it mean that attribute will be only in a specific domain, if vi is in Tc then is means the attribute will shared in different domain and if vi is in Tn then it means that attribute is noise and have no meaning. S is interface schema pre defined concept. C is the conceptual portion of attribute in the specific domains. L is domains. Root is the domain which cannot be classify into any domain.F2 is reference function. New weight calculation (DWTF) presented which get good results then TFIDF and TF. They evaluate the classification result with average precision and average recall whose results are 91.6%for precision and 92.4% for recall. Peiguang et al [13] contribute in the field of deep web by classification of deep web. They analyzed the attributes which are common in the same domain and describe the characteristic of an interface and propose new representation of the interface. They proposed Function terms and form terms they propose algorithm for computing algorithm Literal and semantic based similarity computing (LSSC). 0This used the two definitions of function term and form term. Another contribution is the contribution is the combination of LSSC and NQ algorithm LSSC_NQ. Experimental results shows that this algorithm give good result. Pengpeng Zhao et al [14] put their contribution in the field of deep web by clustering the query interfaces that are structured. They used link graph to cluster these sources. They proposed framework Form Graph Cluster (FGC). This framework is used for organizing the sources which are belonging to deep web they use pre query method. On the source the Fuzzy Clustering Method (FCM) is used. Their main contribution is that this method Semantically Mining Heterogeneous Data Sources Of Deep Web 27 Chapter 2 Literature Review is firstly used by them in this area. The similarity and dissimilarity of the deep web can be expressed in graph. Query interface is treated as a node of the graph and line shows the relationship of the two nodes and the weight on the line show the similarity ao dissimilarity. They use from set as undirected weighted graph. Degree of similarity and dissimilarity is measured in traditional way in history, as 0 or 1 which are not best way therefore the author used fuzzy set theory. They extract the features from the HTML form. They define the controls and divide these into three control text area, select and input control. They used only the value of controlled other are eliminate. For improving the result of deep web classification of a domain Le et al [15] contribute to select the subset of features among all the feature set that are extracted from the interfaces .In previous work all the features are included when that are extracted from the interfaces or query schemas of sources they refine the feature set. They treated the interfaces of the domain category as the bag of words and choose the words from the whole set that are suitable for classification. They use a novel simple ranking scheme and new matrix for feature selection method. They obtain high precision and recall and Fmeasure using selective features or aggressive feature selection. Xian et al in [16] present a new framework which classifies the structured deep web sources by the combining of the machine learning technique SVM and query-probing with the help of simple query interface into topic specific domain. They used the random queries to gather the result schema. Result schema is collected from the result page of query probing. Then they used domain specific classifier (DSC) for classifying the simple query schemas. They used precision and recall and F-measure for evaluation. Summary In above section we discussed the work that has done in the fields of deep web. Different authors proposed different algorithms and show the efficiency of their work. We categorized the literature review in clustering, classification and features extraction. Semantically Mining Heterogeneous Data Sources Of Deep Web 28 Chapter 3 Methodology Chapter 3 3. Methodology In this chapter we will discuss about the methodology of [8] and then discuss own method that have used to cluster the deep web sources. In [8] two algorithms are described and we will also explain how the document vectors are made and which steps are perform on dataset as preprocessing. Then we will change direction of our discussion towards the method which is used for clustering the deep web sources. 3.1. Data preprocessing 3.1.1. Form-Page Model First of all to get Web form that associates with a web page which is called as form page FP. A FP has tuples of FP (PC, FC), both PC and FC shows two individual feature spaces. PC stands for page contents and FC stands for form contents. Both feature spaces PC and FC are viewed as text, because in [8] the authors use the vector space model [19] each feature space consist of vector ,which have terms that are present in the feature spaces and their associated weights. 3.1.2. Compute Form-Page Vectors For computing the Form-Page vectors parse the HTML page and extract the contents, as we discussed early that two feature spaces are computed FC and PC. For feature space FC parse the HTML page and extract the contents between the FORM tags. These contents are belonging to form but contain HTML markup. After removing the HTML markup and scripting tags we gather the FC feature spaces. For PC feature space, FC is subtracting from the HTML page and remaining will be page contents. After removing the HTML markup and scripting tags gather the PC feature spaces. Chapter 3 Methodology 3.1.3. Stop word removal Stop words are the noisy words that have no importance but these words increase the execution time and disturbed the final results. Stop words are “an, of, the, are, is and etc” the stop word list is available on the internet which consist the major stop words. we include the stop word list that is available on internet. The process of stop word removal is used for both form feature space and page feature space. 3.1.4. Remove less frequent terms Another preprocessing step that remove the less frequent terms. The threshold value is three it means that the terms that occurs three times or less in whole collection are removed. These terms also noisy data to get the good results the preprocessing steps must performed. 3.2. Calculation of terms weight In information retrieval [21] TF- IDF (term frequency/inverse document frequency) measure is widely used. It gives a way to model the importance of terms and also eliminate noisy data from the vectors. N wj = TFj ∗ log ( ) nj (3.1) wj is the weight of jth term. It can be different in different document as the TFj is the term frequency of jth term TF is the occurrence of the term in a specific document for example a word appear two time in a document two is the term frequency of that specific term. Where log (N/nj) is the formula of IDF (inverse document frequency) N is the total number of documents and nj is the document frequency where jth term appears. 3.3. Computing Form-Page Similarity In base paper the authors use cosine similarity measure [20]. To compute form page similarity, calculate the distance between corresponding vectors of both feature space. Semantically Mining Heterogeneous Data Sources Of Deep Web 30 Chapter 3 Methodology cos(d1, d2) = d1 • d2 ||d1|| ∗ ||d2|| (3.2) The cos(d1, d2) is the cosine distance of vectors ~d1 and ~d2. The dot product of d1 and d2 vectors is divided by the product of their lengths. For aggregate similarities of two feature spaces form contents and page contents, take average of the similarity in each space. sim(FP1, FP2) = C1 ∗ cos(PC1, PC2) + C2 ∗ cos(FC1, FC2) C1 + C2 Table 3.1: (3.3) Algorithm CAFC-C Algorithm 1 CAFC-C 1: Input: formPages, k 2: centroids = selectSeeds( formPages,k) {Randomly select seeds} 3: repeat 4: clusters = assignPoints( f ormPages,centroids) {Assign form page to the closest centroid} 5: centroids = recomputeCentroids(clusters) {Recomputing centroids} 6: until stop criterion is reached 7: return clusters 3.4. The CAFC-C Algorithm: For clustering the form pages belong to the same domain Context-Aware Form Clustering CAFC-C uses k-means Algorithm. K-means is widely used in document clustering, this clustering algorithm is partition centroid-based algorithm. The main reason of using this algorithm is its simplicity and effectiveness [21]. CAFC-C takes k desired number of cluster as input, form pages are the whole collection that need to cluster. First of all k Semantically Mining Heterogeneous Data Sources Of Deep Web 31 Chapter 3 Methodology clusters are randomly selected as seeds from the collection of form pages and calculate the centroids. After centroids calculation each form page has distance value and assign points the form pages to that seed cluster which has closest value. Then take the average of all centroids value belongs to a seed clusters using the formula: 𝐶= ∑PCєC PC ∑FCєC FC , |C| |C| (3.4) C is the cluster and ∑PCєC is the summation of PC belongs to the cluster C divided by number of pages in the cluster C. Where ∑FCєC is the summation of FC belongs to the cluster C and divided by number of forms in the clusters. The algorithm recomputed the similarity and reassigns points and recomputed cluster centroids until the clusters become stable. CAFC-C has some limitation K-means make hard clusters as a point is assign to only one cluster. As CAFC-C use Vector Space Model (VSM) and K-means and major problem with VSM is not deal with semantics of the text. 3.5. CAFC-CH Algorithm CAFC-CH use the extended Form-Page model, it take backlinks as in addition. It take three tuples FP (FC, PC, Backlink) [8]. Backlinks are the web pages which call the searchable formpage as link. If different pages are called by different a pages or set of pages share the common backlinks it means they belongs to same domain. Backlinks are retrieved though link:API provided different search engines[25]. If it is possible to CAFC_CH has some limitations This idea can be successfully implemented on the documents where complete graph is available. As deep web contents are sparsely distributed over web. Semantically Mining Heterogeneous Data Sources Of Deep Web 32 Chapter 3 Methodology Table 3.2: Algorithm CAFC-CH Algorithm 2 CAFC-CH 1: Input: formPages, k formPages: set of form pages and k: number of clusters required 2: hubClusters = SelectHubClusters( f ormPages,k) 3: clusters = CAFC-C ( f ormPages,k,hubClusters) Compute k-means using hubClusters instead of random seeds 4: return clusters Algorithm 3 SelectHubClusters 1: Input: formPages, k 2: hubs = generateHubs( f ormPages) 3: distanceMatrix = createDistanceMatrix(hubs) Compute distance between hub 4: f inalSeeds = twoMostDistant(distanceMatrix) Select two hubs that are most far apart 5: while f inalSeeds.length < k do 6: finalSeeds = addDistantPoint( f inalSeeds,distanceMatrix) 7: end while 8: return finalSeeds 3.6. Generative model A model which generates randomly observable data when hidden parameters are specified is called a generative model. It gives a joint probability distribution over label sequences and observation. Machine learning use Generative models for direct data modeling or used as a intermediate step .Examples of generative models include: Semantically Mining Heterogeneous Data Sources Of Deep Web 33 Chapter 3 Methodology Gaussian mixture model and other types of mixture model Latent Dirichlet allocation AODE Restricted Boltzmann Machine Hidden Markov model Naive Bayes 3.7. Discriminative model Model which is used in machine learning that is used for modeling an variable y which is unknown or unobserved over the variable x which is known or observed are called Discriminative models. In other words find the P(y | x) conditional probability distribution and are a class of models used in for modeling the dependence of an unobserved variable y on an observed variable x. Using a statistical framework, In this modeling it can be used for predicting one variable y which is unknown from the variable x which is known .Here is the examples of discriminative models which is used in machine learning: Support vector machines Logistic regression, a type of generalized linear regression used for predicting binary or categorical outputs (also known as maximum entropy classifiers) Neural networks Boosting Linear discriminate analysis Conditional random fields 3.8. Difference between Generative models and discriminative models Generative models are different from discriminative models Generative models are fully probabilistic model for all variables on other hand discriminative model gives a model which work on for modeling an variable y which is unknown or unobserved over the variable x which is known or observed. To generate and simulate the value of any variable used in the model generative model can be used Semantically Mining Heterogeneous Data Sources Of Deep Web 34 Chapter 3 Methodology Discriminative model are used for only sampling the specific variables which are observed quantities. Discriminative models cannot express relationship between observed and unobserved variable although its observed variables does not need to model the distribution. 3.9. The generative models perform better at classification and regression tasks. Topic model In natural language processing and machine learning, a statistical model for discovering the abstract "topics" is called topic model that occur in a collection of documents. Probabilistic latent semantic indexing (PLSI) was topic model that was used, proposed by Thomas Hofmann in 1999. Latent Dirichlet allocation (LDA), is the topic model which use in many applications developed by David Blei, Andrew Ng, and Michael Jordan in 2002, deal with documents that have mixture of topics. 3.10. Latent variable In statistics, latent variables are contrast to observable variables. These variables are indirectly observed through a mathematical model from variables that are observed. Latent variable models are mathematical models which explain observed variables in terms of latent variables. Latent variable are mostly termed as hidden variable which can be define as variables are present but hidden. Many disciplines use latent variable models for example, machine learning/artificial intelligence, natural language processing, economics, social sciences, psychology, and the bioinformatics. Advantage of latent variable is the reduction of data dimensionality. Data dimensionality is used in machine learning a process which is used to reduce the random variables. 3.11. Prior probability Prior probability distribution in Bayesian statistical inference, are called prior, of an quantity of uncertain p before the "data". It is meant to attribute uncertainty rather than randomness to the uncertain quantity. Semantically Mining Heterogeneous Data Sources Of Deep Web 35 Chapter 3 Methodology 3.12. Multinomial distribution In probability theory, the multinomial distribution is a generalization of the binomial distribution. In probability distribution the binomial distribution is of the n independent Bernoulli trials where probability of "success" is the same on each trial. Bernoulli trials is the trials which has two possible outcomes one is true and other is false or yes and no. 3.13. Plate notation A graphical model in which representing variables are shown is called Plate notation. Instead to show or drawing variable which repeats in a process, a rectangle which is called as plate as a sub graph in which repeating variables are grouped. In the corner of plate variable show the number of iterations. Circle show the variables and directed arrows show the dependency. In plate notation empty circles shows that variables are latent variable or hidden or not directly observed and colored or filled circles are observable variables. M Figure 3.1: Plate as sub graph Figure 3.2: Symbol for hidden parameter Figure 3.3: Symbol for observed variable Figure 3.4: Arrow shows the dependency Semantically Mining Heterogeneous Data Sources Of Deep Web 36 Chapter 3 Methodology 3.14. Proposed Technique: DWSemClust Algorithm 1 DWSemClust take the formPages as input formPages are the Web pages that contain searchable forms (line 1). Parseformpages take formPages as input and form contents and page contents are separated. HTML page contain FORM tag which contain form contents. After removing the markup tags form content extracted. Same procedure applies on the page contents. Then stopword remove from the form and page content. The words whose frequency is less than three are removed from the dataset. After prepossessing steps the form contents and page contents in taken as input for LDA. For each formCont, pageCont iterate M times. Select өformCont,pageCont from hidden parameters α. For each 𝑊formCont,pageCont iterate N times for each document. 𝑍formCont,pageCont is selected from өformCont,pageCont . Than 𝑊formCont,pageCont is observable variable from probability of (𝑊formCont,pageCont /𝑍formCont,pageCont , ß) then LDA will return clusters. LDA is used in many applications [26]. Table 3.3: Proposed Technique: DWSemClust Algorithm 1 DWSemClust 1: Input: formPages {formPages: set of pages which contain searchable forms} 2: formCont,pagecont=parseformPages(formPages) 3: F1 =LDA(formCont,pagecont) For each (formCont,pageCont) [1…M] do Select (өformCont,pageCont ) ~ Dir (α) For each of the term 𝑊formCont,pageCont [1…N] do (a) Select a topic 𝑍formCont,pageCont~Multinomial (өformCont,pageCont ). (b) Select a word 𝑊formCont,pageCont from p (𝑊formCont,pageCont / 𝑍formCont,pageCont , ß), a multinomial probability conditioned on the topic 𝑍formCont,pageCont 4: Return clusters 3.15. Latent Dirichlet allocation A generative probabilistic model which explains the set of observation through the unobserved group of data, and show the relationship why some part of data is similar. Document consists of different topics and each can explain through words. For example, Semantically Mining Heterogeneous Data Sources Of Deep Web 37 Chapter 3 Methodology if words collected into documents are observations, each document are collection of words and mixture of a small number of topics and the words in the document can explain the document's topics. LDA is an example of a topic model and t presented as a graphical model by Blei et al [22] LDA consider following assumption for generative process for each document D in a corpus C. In [27,28] LDA is improved : 1. Select for each interface N ~ Poisson (ξ). 2. Select ө𝑖𝑛𝑡𝑒𝑟𝑓𝑎𝑐𝑒𝑠 ~ Dir (α). 3. For each of the N words wn: (a) Select a topic Zn~Multinomial (ө𝑖𝑛𝑡𝑒𝑟𝑓𝑎𝑐𝑒𝑠 ). (b) Select a word wn from p (wn / Zn, ß), a multinomial probability conditioned on the topic Zn. In basic model some implication assumptions are made. The dimensionality L of the Dirichlet distribution is assumed known and fixed. (And thus the topic dimensionality Z is fixed.) The word probabilities are L×V matrix ß. Finally, the Poisson assumption is not dependent on the length of documents. Length of document is used as needed. N is independent for generating variables ө and z. For k dimensional random variable 𝜃 that is dirichlet variable. Following density function is used. Γ(∑ᵏᵢ‗₁ 𝛼ᵢ) ƿ(𝜃ǀα) =∏ᵏ ᵢ‗₁ Γ(αᵢ) 𝜃₁𝛼₁−1 … 𝜃ᵏ 𝛼𝑘 ᵏ−1 (3.5) The dirichlet variable α . Γ(x) is the Gamma function. It belong to exponential family. That are parameter estimation for LDA. The parameters α and 𝛽 are given, topics 𝑧 set as N. And w is also set as N. ƿ(θ,z,wǀα,β)= ƿ(θǀα)∏N n=1 ƿ(zn ǀθ)ƿ(wn ǀzn , β) Semantically Mining Heterogeneous Data Sources Of Deep Web (3.6) 38 Chapter 3 Methodology ƿ(𝑧𝑛 ǀ𝜃) is 𝜃𝑖 for value of 𝑖 that is unique , 𝑧𝑛𝑖 =1. For marginal distribution of a document 𝜃 integrating and summing the 𝑧. ƿ(wǀα,β)= ∫ ƿ(θǀα)(∏𝑁 𝑛=1 ∑Ζ𝑛 ƿ(z𝑛 ǀ𝜃) ƿ(𝑤𝑛 ǀz𝑛 , 𝛽))𝑑𝜃 (3.7) At the end for corpus probability take the product of marginal probability that are associated with documents. N d ƿ(Dǀα,β)=∏M d=1 ∫ ƿ(θd ǀα)(∏n=1 ∑Ζdn ƿ(zdn ǀθd )ƿ(wdn ǀzdn , β))dθd (3.8) In this model, Latent Dirichlet allocation is a Bayesian network which models the documents in a corpus or collection of documents are related topically. Two variables that are not in any plate; α and β. Dirichlet prior parameter α is the per-document topic distributions, and Dirichlet prior parameter β is the per-topic word distribution. The variables in the outermost plate will iterate for each document in a collection. M in the right corner of the plate shows the number of iteration in executed through the variable all variables will iterate once for each document. Figure 3.5: Latent Dirichlet allocation Inner plate has N words that are in a specific document. Z is the topic of specific word in document. W represents the words. Inner plate will iterate N times for each word in a document. In the model some circles are filled and others are empty. The variables in the Semantically Mining Heterogeneous Data Sources Of Deep Web 39 Chapter 3 Methodology empty circle are the hidden variables and not directly observed. Only w is the observable. The edge that is directed show the variable dependency. 3.16. Summary In above section we have discussed the preprocessing steps that we perform over dataset. Then discuss the proposed method in detail and described the terms that used in proposed method. Semantically Mining Heterogeneous Data Sources Of Deep Web 40 Chapter 4 Experiments Chapter 4 4. Experiments In this chapter we will discuss about the experiments that have done and their results. This chapter reports the clustering the deep web sources, and give the overview of the performance measures. Then we will discuss about the experimental setup in which we describe the dataset performance measure used and its results and discussion in detail. 4.1. Performance Measures The F-measure is combined measure of precision and recall [24]. In dataset description we discussed that dataset was divided in 8 domain and then cluster them in unsupervised way then we check the true positive TP, false negative FN, false positive FP. TP Recall = TP+FN (4.1) TP (4.2) Precision = TP+FP Where TP is the number of members of class which truly cluster means they belong to that cluster and clustering algorithm also cluster them in that cluster,FN is the number of members of a class which belong to that cluster but falsely cluster to other domain. FP is the number of members of other class but falsely cluster in that cluster. The F-measure is then computed by the following formula: F(i, k) = 2 × Recall × Precision Recall + Precision (4.3) The overall F-measure for a set of clusters is computed by the weighted average of the values for the F-measure of individual clusters. A perfect clustering solution will result in Chapter 4 Experiments an F-score of one, and in general, the higher the F-measure value, and the better the clustering solution. For evaluation of the cluster another performance measure used is entropy. Entropy can be defined as the measure of disorder of a cluster. Cluster performance increases as the entropy decreases. For every clusterci , to calculate the possibility of occurrence pjk that a member of cluster i belongs to class k. Entropy is calculated through the standard formula (4.4) 𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑗 = − ∑ pjk log ( pjk ) The sum of the entropies of each cluster is called total entropy for the set of all clusters, weighted by the size of each cluster. If more homogeneous clusters then more better cluster will be and their entropy will lower. 4.2. Dataset In order to evaluate the performance, algorithms described in previous chapter were tested over a set of 259 form pages. We retrieved these pages from the UIUC repository [23]. We use TEL_8 query interfaces, interfaces were collected from different sources that sources was from 8 different domains. But some of pages are not available due to server or time out error we use 259 forms pages for our experiment. TEL_8 is stand for Travel group, Entertainment group and Living group which are related to 8 different domains. Travel group related to car hotels and airfares. Entertainment group have the music record, movies and books that are related to entertainment. Living group contain job and automobiles. This dataset is created in May 2003. We gathered all the form pages in the repository whose pages are still available on the Web. Some of them have error on it and was not available due to timeout error. The collection contains both single and multi-attribute forms. Table 4.1 shows the dataset description which is available on the UIUC repository [23]. Table 4.1: Groups domains Travel group Airfares Hotels Number of Sources 34 26 Dataset Description Number of Querable Interfaces 34 26 Semantically Mining Heterogeneous Data Sources Of Deep Web Simple query interfaces Yes Yes Advanced query interfaces Yes Yes 42 Chapter 4 Experiments Entertainment group Car Rentals 17 17 Yes Yes Books Movies 42 41 42 41 Yes Yes Yes Yes Music 35 35 Yes Yes 25 39 25 39 Yes Yes Yes Yes Records Living group 4.3. Jobs Automobiles Parameter settings Hyper parameters α and ß can be optimize through Gibbs sampling algorithm [27] or Expectation Maximization (EM) method [28]. Gibbs sampling algorithm is used rather than EM algorithm because of computationally inefficient and vulnerable to local maxima [22]. Hyper parameters are needed to optimize because of some topic models related to different applications are sensitive to these parameters. But in our topic model is not sensitive to hyper parameters. In our experiments for 8 topics z the hyper parameters values for α and ß are respectively 50/z and .01. The value of topics z is set as respect to our dataset used. As data set contain 8 domains that are discussed in dataset section in detail. We ran 1000 iterations Gibbs sampling chains for each. Experiments are performed on a machine running Windows 7 with Intel(R) Core(TM) 2 CPU Processor (1.67 GHz) and 1 GB memory. 4.4. Results and Discussions In order to check the efficiency of our proposed technique DWSemClust we calculate entropy and F-measure as performance measure. Table 4.2 shows the entropy results of DWSemClust for only forms data and for both form and page data. DWSemClust show good performance in both scenarios. Table 4.2: Entropy results of DWSemClust for forms and pages Sr.No DWSemClust (formpages) DWSemClust (forms) 1 1.235 1.5119 2 1.286 1.51408 3 1.232 1.51408 Semantically Mining Heterogeneous Data Sources Of Deep Web 43 Chapter 4 Experiments 4 1.249 1.51408 5 1.275 6 1.213 7 1.211 8 1.221 9 1.241 10 1.263 1.5008 11 1.260 1.5017 12 1.299 1.5090 13 1.263 1.5120 14 1.249 1.5246 15 1.249 1.4940 16 1.249 1.5029 17 1.249 1.5137 18 1.249 1.5196 19 1.267 1.489 20 1.296 1.517 21 1.265 1.522 22 1.305 1.516 23 1.226 1.496 24 1.301 1.507 25 1.265 1.5452 1.5292 1.5292 1.5035 1.5453 1.5452 Figure 4.1 shows the graphical view of the DWSemClust entropy values described in Table 4.2. On x-axis the no of iterations is presented and on y-axis the entropy value is shown and at right side techniques are shown. Entropy curve shows that combination of form and pages give better results then only form contents. Semantically Mining Heterogeneous Data Sources Of Deep Web 44 Chapter 4 Experiments 1.8 1.6 1.4 Entropy 1.2 1 0.8 DWSemClust(formpages) 0.6 DWSemClust(forms) 0.4 0.2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 No of Iterations Figure 4.1: Entropy for DWSemClust with form contents, form and page contents Table 4.3: Topics discovery through DWSemClust 0th Topic Music Record 1th Topic Airfares Words Words books music book news online contact shop store dvd gift toys usd games browse prices movie quot titles tax reviews 2th Topic Words prices movie Probabilities 0.036005 0.028672 0.025338 0.020005 0.019339 0.017339 0.016672 0.016672 0.016006 0.015339 0.015339 0.013339 0.013339 0.012006 0.011339 0.010673 0.010673 0.010673 0.010673 0.010006 Probabilities 0.037763 0.034109 city airport flight child name infant flights hotel include options lap children seat class miles commerce departing days airports preferred 3th Topic Hotels Words deals help Semantically Mining Heterogeneous Data Sources Of Deep Web Probabilities 0.066359 0.065777 0.0553 0.044241 0.037839 0.036092 0.033182 0.027944 0.019795 0.018049 0.016885 0.016303 0.016303 0.015721 0.014557 0.014557 0.013975 0.013975 0.013393 0.013393 Probabilities 0.037741 0.032238 45 Chapter 4 rewards sort search artist miles reservation options view confirmation ticket close wyndham special stay hotels offers account join 4th Topic Jobs Words jobs pound job saving details price summary sales health business manager london care application south filofax west north director category 6th Topic Car Rental Words car time code Experiments 0.031673 0.024974 0.024365 0.020711 0.019493 0.018275 0.017667 0.017058 0.017058 0.016449 0.015231 0.015231 0.014013 0.014013 0.014013 0.014013 0.012795 0.012795 Probabilities 0.120145 0.049718 0.027451 0.024862 0.020719 0.019683 0.017094 0.015023 0.013469 0.012951 0.012433 0.011916 0.011398 0.01088 0.009844 0.009326 0.008291 0.008291 0.007773 0.007773 Probabilities 0.077764 0.05887 0.053056 travel document information vacation hotel save cheap map sign privacy destination cheapfares hotels write tips specials print writeln 5th Topic Automobiles Words cars chevrolet ford toyota car quantity vehicle nissan honda cyl bmw dodge hyundai mercedes benz mazda volkswagen lexus gmc audi 7th Topic Books Words title advanced price Semantically Mining Heterogeneous Data Sources Of Deep Web 0.02988 0.028308 0.025949 0.025163 0.022805 0.022805 0.022019 0.021233 0.020447 0.018088 0.018088 0.017302 0.016516 0.014944 0.014158 0.013372 0.013372 0.012586 Probabilities 0.055285 0.028059 0.028059 0.028059 0.028059 0.025584 0.023934 0.021459 0.019809 0.018984 0.015684 0.015684 0.015684 0.014034 0.014034 0.014034 0.013209 0.013209 0.012384 0.011559 Probabilities 0.06196 0.050344 0.031953 46 Chapter 4 Experiments select travel pick drop ages date address rental location zip city airport company day country age hotel 0.047969 0.039249 0.037069 0.034889 0.032709 0.032709 0.028348 0.024715 0.023262 0.021808 0.021081 0.018901 0.018175 0.018175 0.016721 0.014541 0.014541 keyword author keywords model isbn below results abc category format fields exact sort search artist publisher facilities 0.030017 0.029049 0.028081 0.024209 0.024209 0.022273 0.021305 0.018401 0.014529 0.013561 0.013561 0.012593 0.012593 0.011625 0.011625 0.011625 0.010657 Table 4.3 shows the F-measure results for DWSemClust for form pages and for forms contents higher the F-measure better clustering results DWSemClust give the high results. Sr.No shows the number of iterations and DWSemClust (formpages) show the results of F-measure for form and page contents. DWSemClust (form) show the results of F-measure for form contents. Form and pages contents show the better result than for only form contents. Table 4.4: F-measure results of DWSemClust Sr.No DWSemClust (formpages) DWSemClust (forms) 1 3.429 3.892 2 3.802 3.386 3 4.617 4.323 4 3.724 4.037 5 4.293 3.429 6 3.698 3.656 7 3.810 3.994 8 3.739 3.130 9 3.342 3.396 Semantically Mining Heterogeneous Data Sources Of Deep Web 47 Chapter 4 Experiments 10 3.734 3.512 Figure 4.1 shows the pictorial view of F-measure values. On x-axis show the number of iteration and y-axis shows the F-measure values and at right side techniques are F-measure shown. 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 DWSemClust (formpages) DWSemClust (forms) 1 2 3 4 5 6 7 8 9 10 No of Iterations Figure 4.2: F-measure for DWSemClust with form contents, form and page contents Table 4.4 shows the comparison of CAFC_C and DWSemClust with page contents first column show the Sr.No. Second column show the CAFC_C with formpages and third column show the DWSemClust with formpages. DWSemClust show the stable behavior than the CAFC_C. Table 4.5: Comparison of the CAFC_C and DWSemClust with form and page contents based on entropy Sr.No CAFC_C(formpages) DWSemClust (formpages) 1 1.3495 1.235 2 7.0615 1.286 3 2.1336 1.232 4 12.4307 1.249 5 2.4678 6 4.2335 1.275 1.213 Semantically Mining Heterogeneous Data Sources Of Deep Web 48 Chapter 4 Experiments 7 4.565 8 7.781 9 5.618 10 0.4420 1.263 11 9.691 1.260 12 10.578 1.299 13 7.423 1.263 14 0.52538 1.249 15 5.8126 1.249 16 5.6317 1.249 17 2.3613 1.249 18 3.746 1.249 19 4.4169 1.267 20 9.608 21 9.200 22 0.4008 23 1.4245 24 0.0019 25 0.7379 1.265 Total 119.6416 31.418 4.785664 1.25672 1.211 1.221 1.241 1.296 1.265 1.305 1.226 1.301 entropy Average entropy Figure 4.3 show the comparison of DWSemClust and CAFC_C with formpages on the bases of entropy. On x-axis shows the no of iterations and y-axis shows the entropy value and at right side techniques are shown. DWSemClust show the stable behavior on the other hand CAFC_C fluctuate. Semantically Mining Heterogeneous Data Sources Of Deep Web 49 Chapter 4 Experiments 14 12 10 8 CAFC_C(formpages) 6 DWSemClust (formpages) 4 2 0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425 Figure 4.3: Comparison of DWSemClust and CAFC_C with formpages contents based on entropy Table 4.5 shows the entropy comparison of CAFC_C and DWSemClust with forms contents first column show the Sr.No, second column show the CAFC_C with form contents and third column show the DWSemClust with form contents. Table 4.6: Comparison of the CAFC_C and DWSemClust with forms contents based on entropy Sr.No CAFC_C DWSemClust (forms) (forms) 1 6.1536 1.5119 2 7.804 1.51408 3 5.6761 1.51408 4 0.130 1.51408 5 4.006 1.5292 6 1.065 1.5292 7 4.282 1.5035 Semantically Mining Heterogeneous Data Sources Of Deep Web 50 Chapter 4 Experiments 8 15.6036 1.5453 9 4.399 1.5452 10 11.5607 1.5008 11 11.562 1.5017 12 8.7423 1.5090 13 5.5947 1.5120 14 12.0949 1.5246 15 5.650 1.4940 16 3.943 1.5029 17 9.078 1.5137 18 6.312 1.5196 19 5.344 1.489 20 14.787 21 2.5748 22 0.2194 23 10.1402 24 25 5.060 10.991 1.517 1.522 1.516 1.496 1.507 1.5452 Total entropy 172.7733 37.87704 Average entropy 6.910932 1.5150816 Figure 4.4 show the comparison of DWSemClust and CAFC_C with forms contents on the bases of entropy. X-axis shows the no of iterations and Y-axis show the entropy value and at right side techniques are shown. Entropy curve prove the efficiency of DWSemClust over CAFC_C. Semantically Mining Heterogeneous Data Sources Of Deep Web 51 Chapter 4 Experiments 18 16 14 Entropy 12 10 CAFC(forms) 8 DWSemClust (forms) 6 4 2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 No of iterations Figure 4.4: Comparison of DWSemClust and CAFC_C with forms based on entropy Table 4.6 shows the comparison of the CAFC_C and DWSemClust with forms contents and with form and page contents on the bases of entropy. In this table this is clearly declare that DWSemClust give good results in both scenario. DWSemClust works well with page contents than with form contents. Table 4.7: Sr.No Comparison of the CAFC_C and DWSemClust with forms contents and with forms and pages contents based on entropy CAFC_C CAFC_C DWSemClust DWSemClust (formpages) (forms) (formpages) (forms) 1 1.3495 6.1536 1.235 1.5119 2 7.0615 7.804 1.286 1.51408 3 2.1336 5.6761 1.232 1.51408 4 12.4307 0.130 1.249 1.51408 5 2.4678 4.006 6 4.2335 1.065 Semantically Mining Heterogeneous Data Sources Of Deep Web 1.275 1.213 1.5292 1.5292 52 Chapter 4 Experiments 7 4.565 4.282 8 7.781 15.6036 9 5.618 4.399 10 0.4420 11.5607 1.263 1.5008 11 9.691 11.562 1.260 1.5017 12 10.578 8.7423 1.299 1.5090 13 7.423 5.5947 1.263 1.5120 14 0.52538 12.0949 1.249 1.5246 15 5.8126 5.650 1.249 1.4940 16 5.6317 3.943 1.249 1.5029 17 2.3613 9.078 1.249 1.5137 18 3.746 6.312 1.249 1.5196 19 4.4169 5.344 1.267 1.489 20 9.608 14.787 1.296 1.517 21 9.200 2.5748 1.265 1.522 22 0.4008 0.2194 1.305 1.516 23 1.4245 10.1402 1.226 1.496 24 0.0019 1.301 1.507 25 0.7379 1.265 1.5452 Total 119.6416 172.7733 31.418 37.87704 4.785664 6.910932 1.25672 1.5150816 5.060 10.991 1.211 1.221 1.241 1.5035 1.5453 1.5452 Entropy Average Entropy Figure 4.5 shows the comparison of the CAFC_C and DWSemClust with forms contents and with forms and pages contents based on entropy. On x-axis represents no of iterations y-axis represent entropy value and at right side techniques are shown. Entropy curve for DWSemClust show the efficiency of our proposed technique. Semantically Mining Heterogeneous Data Sources Of Deep Web 53 Chapter 4 Experiments 18 16 14 Entropy 12 CAFC_C(formpages) 10 8 CAFC_C(forms) 6 DWSemClust (formpages) 4 DWSemClust (forms) 2 0 1 Figure 4.5: 3 5 7 9 11 13 15 17 19 21 23 25 No of iterations Comparison of the CAFC_C and DWSemClust with forms contents and with forms and pages contents based on entropy Table 4.7 show the F-measure comparison of the CAFC_C and DWSemClust with forms contents first Column show the sr.no second column show the F-measure values for CAFC_C (forms) third column show the values of F-measure for DWSemClust. Table 4.8: Comparison of the CAFC_C and DWSemClust with forms contents based on F-measure Sr.No CAFC_C DWSemClust (forms) (forms) 1 3.680 3.892 2 3.538 3.386 3 4.059 4.323 4 2.407 4.037 5 2.739 3.429 6 3.441 3.656 7 3.461 3.994 8 3.047 3.130 9 3.309 3.396 Semantically Mining Heterogeneous Data Sources Of Deep Web 54 Chapter 4 Experiments 10 3.901 3.512 Figure 4.6 shows the comparison of CAFC_C and DWSemClust on the bases of Fmeasure. On x-axis number of iteration shown and on y-axis F-measure values are presented and at right side techniques are shown. As higher the F-measure value is better. F- measure DWSemClust give high F-measure than CAFC_C. 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 CAFC_C(forms) DWSemClust (forms) 1 2 3 4 5 6 7 8 9 10 No of iterations Figure 4.6: Comparison of the CAFC_C and DWSemClust with forms contents based on F-measure Table 4.8 illustrates the comparison of CAFC_C and DWSemClust with forms and pages contents. First column show the sr.no second column shows the F-measure values of CAFC_C with formpages and third column shows the F-measure values of DWSemClust. Table 4.9: F-measure comparison of the CAF_C with forms and page contents and DWSemClust with forms and pages Sr.No CAFC_C DWSemClust (formpages) (formpages) 1 2.450 3.429 2 2.367 3.802 3 2.079 4.617 4 2.067 3.724 5 1.584 4.293 Semantically Mining Heterogeneous Data Sources Of Deep Web 55 Chapter 4 Experiments 6 2.301 3.698 7 2.44 3.810 8 2.536 3.739 9 2.448 3.342 10 1.966 3.734 Figure 4.7 shows the comparison of CAFC_C and DWSemClust with forms and page contents on the bases of F-measure. On x-axis number of iteration shown and on y-axis Fmeasure values are presented and at right side techniques are shown. As higher the Fmeasure value is better. DWSemClust give high F-measure than CAFC_C. 5 4.5 4 F-measure 3.5 3 2.5 CAFC_C(formpages) 2 DWSemClust (formpages) 1.5 1 0.5 0 1 2 3 4 5 6 7 8 9 10 No of iteration Figure 4.7: Comparison of the CAFC_C and DWSemClust with forms and pages contents based on F-measure Table 4.9 shows the comparison of CAFC_C and DWSemClust with only forms contents and with forms and pages contents on the bases of F-measure values. Semantically Mining Heterogeneous Data Sources Of Deep Web 56 Chapter 4 Table 4.10: Sr.No Experiments F-measure comparison of the CAFC_C with forms and page contents and DWSemClust with forms and pages CAFC_C CAFC_C DWSemClust DWSemClust (formpages) (forms) (formpages) (forms) 1 2.450 3.680 3.429 3.892 2 2.367 3.538 3.802 3.386 3 2.079 4.059 4.617 4.323 4 2.067 2.407 3.724 4.037 5 1.584 2.739 4.293 3.429 6 2.301 3.441 3.698 3.656 7 2.44 3.461 3.810 3.994 8 2.536 3.047 3.739 3.130 9 2.448 3.309 3.342 3.396 10 1.966 3.901 3.734 3.512 Total 22.238 33.582 38.108 37.755 2.224 3.358 3.811 3.776 F-measure Average F-measure Figure 4.8 shows the comparison of the CAFC_C and DWSemClust with forms contents and with forms and pages contents. F-measure values shown at x-axis and number of iterations are shown at y-axis. At right the techniques are shown. DWSemClust gives better F-measure values in both situations with form contents and with form and page contents. Semantically Mining Heterogeneous Data Sources Of Deep Web 57 Chapter 4 Experiments 5 4.5 4 F-measure 3.5 3 CAFC_C(formpages) 2.5 CAFC_C(forms) 2 DWSemClust (formpages) 1.5 DWSemClust (forms) 1 0.5 0 1 2 3 4 5 6 7 8 9 10 No of iterations Figure 4.8: Comparison of the CAFC_C and DWSemClust with forms and pages and forms contents based on F-measure Figure 4.9 shows the average entropy of CAFC_C and DWSemClust with form contents and with form and page contents. Average entropy of DWSemClust with form and pages Entropy in both scenarios are minimum than CAFC_C. 8 7 6 5 4 3 2 1 0 Techniques Figure 4.9: CAFC_C(forms) CAFC_C(formpages) DWSemClust(forms) DWSemClust(formpages) Comparison of average entropy Semantically Mining Heterogeneous Data Sources Of Deep Web 58 Chapter 4 Experiments Figure 4.10 shows the average F-measure of CAFC_C and DWSemClust with form contents and with form and page contents. Average F-measure of DWSemClust with F-measure form and pages in both scenarios are higher than the CAFC_C. 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 CAFC_C(forms) CAFC_C(formpages) DWSemClust(forms) DWSemClust(formpages) Techniques Figure 4.10: Comparison of average F-measure Table 4.11: Comparison of CAFC_C, CAFC_CH and DWSemClust on bases of entropy Performance CAFC_C CAFC_CH DWSemClust 4.321 1.257 6.449 1.515 Measure Entropy for form 4.786 pages Entropy for form 6.911 In Table 4.11 comparisons of CAF_C, CAFC_CH and DWSemClust shown. First column show the performance measure used is entropy. CAFC_C use random selection of documents and CAFC_CH use the hub induced similarity as a preprocessing step. First of all hubs are generated and the number of clusters are selected and then run the algorithm of CAFC_C which cluster the sources but in all techniques DWSemClust perform well. Semantically Mining Heterogeneous Data Sources Of Deep Web 59 Chapter 4 Experiments Entropy Comparison of CAFC_C,CAFC_CH and DWSemClust 8 7 Entropy 6 5 4 CAFC_C 3 CAFC_CH 2 DWSemClust 1 0 CAFC_C CAFC_CH DWSemClust Techniques Figure 4.11: Entropy comparison of CAFC_C, CAFC_CH and DWSemClust Figure show the comparison of CAFC_C, CAFC_CH and DWSemClust Entropy value is on the y-axis and techniques are on the x-axis. At right side technique are shown. Entropy value of CAFC_C and CAFC_CH is high then the DWSemClust. As Entropy value increases the performance of cluster decreases. DWSemClust have less entropy value. Table 4.12: Comparison of CAFC_C, CAFC_CH and DWSemClust on bases of F-measure Performance CAFC_C CAFC_CH DWSemClust 2.224 3.215 3.811 3.358 3.402 3.776 Measure F-Measure for form pages F-Measure for form In Table 4.12 comparisons of CAF_C, CAFC_CH and DWSemClust shown. First column show the performance measure used is F-Measure. CAFC_C use random selection of documents and CAFC_CH use the hub induced similarity as a preprocessing step. First of all hubs are generated and the number of clusters are selected and then run the algorithm of CAFC_C which cluster the sources but in all techniques DWSemClust perform well. Semantically Mining Heterogeneous Data Sources Of Deep Web 60 Chapter 4 Experiments F-Measure F-Measure Comparison of CAFC_C,CAFC_CH and DWSemClust 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 CAFC_C CAFC_CH DWSemClust CAFC_C CAFC_CH DWSemClust Techniques Figure 4.12: F-measure comparison of CAFC_C, CAFC_CH and DWSemClust Figure show the comparison of CAFC_C, CAFC_CH and DWSemClust F-Measure value is on the y-axis and techniques are on the x-axis. At right side technique are shown. Fmeasure value of CAFC_C and CAFC_CH is less than the DWSemClust. As F-measure value increases the performance of cluster also increases. DWSemClust have more Fmeasure value. 4.5. Summary In above section we discussed the performance measure, dataset in brief from where we gathered and which attributes are used. Then experimental setup and results are described in detail. From above discussion we conclude that proposed technique perform well with high F-measure and low entropy value. Semantically Mining Heterogeneous Data Sources Of Deep Web 61 Chapter 5 Conclusion Chapter 5 5. Conclusions and Future Work In this thesis we cluster the heterogeneous sources of deep web, which is a critical task towards the integration of the sources. For satisfying the need of user and timely decision, correct and accurate clustering is desirable as result shows that our proposed technique DWSemClust is more efficient than existing techniques. It takes less time to give the clustering results. As results show the existing techniques fluctuate but proposed technique is more stable. One of positive point of our research is deal with semantics because LDA is used and it has semantic layer which specially deal with semantics. As LDA produce soft clusters. It gives the probability of each document for each cluster. Hence, DWSemClust is suitable for the scenario where the sources are sparsely distributed over the web. In a survey [18] estimate the parameters for clustering the deep web sources, according to that survey our approach work on all major parameters. 5.1. Concluded points Following key points are concluded. 5.1.1. Stability DwSemClust is more stable then the CAFC_C as in experimental section clearly show the entropy values fluctuate in ever iteration and DWSemClust shows the stable behavior. 5.1.2. Soft Clustering DWSemClust use latent dirichlet allocation a topic model which give soft clusters. Soft clustering gives the probability of each document for each cluster. CAFC_C give hard cluster which allocate a document to a cluster. 5.1.3. Semantics DWSemClust use latent dirichlet allocation which works on semantics. As CAFC_C does not deal with semantics therefore the results show the difference of use of semantics and non-semantics techniques. Chapter 4 Experiments 5.1.4. Running time Running time of DWSemClust is low as compare to CAFC_C. CAFC_C take approximately an hour to execute for one time where the DWSemClust take 5 minutes for execution. 5.1.5. Parameter estimation The parameters that are define in [18]. According to that parameters are structure deep web sources, un-structure sources, simple query interfaces, advance query interfaces, clustering, classification, query probing, visible form feature, macro, micro and use of ontology. In our results we use structure deep web sources, simple query interfaces, advance query interfaces, clustering, visible form feature and macro parameters are used. 5.2. Future Work 5.2.1. Integrated schema We will work on the integrated schema on the bases of our clustering results that will use to satisfy the customer query. For example if a customer want to buy a book then he will query the database the integrated schema will give the results of all possible stores from where he can buy and on which prices. 5.2.2. Check dataset on structured techniques We will check the dataset on structured techniques as proposed technique use topic modeling LDA which is unstructured in nature. If that technique gives good results then we will check on combination of our technique and other technique. Semantically Mining Heterogeneous Data Sources Of Deep Web 63 References References [1]Wikipedia: http://en.wikipedia.org/wiki/Deep_Web [2]Deep web search directory service:http://www.completeplanet.com. [3] K. C.-C. Chang, B. He, C. Li, M. Patel, and Z. Zhang, “Structured databases on the web: Observations and implications”, SIGMOD Record, vol. 33(3): pp. 61–70, Sept. 2004. [4] S. Raghavan, H. Garcia-Molina, “Crawling the Hidden Web”, In VLDB, pp. 129–138, 2001. [5] A. Hess and N. Kushmerick. “Automatically attaching semantic metadata to web services”. In Proceedings of IIWeb, pp. 111–116, 2003. [6] B. D., C. J., and L. Y. Coolcat, “An entropy-based algorithm for categorical clustering”. In 11th International Conference on Information and Knowledge Management, pp. 582–589, 2002. [7] B. He, T. Tao, and K. C.-C. Chang. “Organizing structured web sources by query schemas: a clustering approach”, In CIKM, pp. 22–31, 2004. [8] L.Barbosa, L., Freire, J., Silva, A, “Organizing hidden-web databases by clustering visible web documents”, In ICDE, pp. 326–335, 2007. [9] J. P. Callan, M. Connell, A. Du. “Automatic discovery of language models for text databases”. In SIGMOD, pp. 479–490, 1999. [10] H. Li et al. “Clustering Deep Web Databases Semantically”, In AIRS, pp. 365–376, 2008. [11] W. Zhang, K. Chen, F. Zhang. “Mining Data Records based on Ontology Evolution for Deep Web”, In 2nd International Conference on Computer Engineering and Technology. 2010. [12] H.Xiang Xu,Xiu-Lan Hao,Shu-Yun Wang, Yun-Fa Hu, “A Method of Deep Web Classification” . CMLC, pp. 4009 - 4014, 2007 References [13] L.Peiguang, Yibing Du; Xiaohua Tan; Chao Lv;.”Research on Automatic Classification for Deep Web Query Interfaces”, ISIP, pp. 313 - 317, 2008 [14] P Zhao, L Huang, W Fang, “Organizing Structured Deep Web by Clustering Query Interfaces Link Graph Advanced Data Mining” ADMA, pp.683-690, 2008 [15] H.Q Le, “Classifying Structured Web Sources Using Aggressive Feature Selection”, 5th International Conference on Web, WEBIST, pp. 618-625, 2009 [16] X .Xian, P Zhao, W Fang, J Xin “Automatic classification of deep web databases with simple query interface” ICIMA, pp.85-88, 2009 [17] Y.Q Dong, QZ Li, YH. Ding et al. “A query interface matching approach based on extended evidence theory for Deep Web”, Journal of computer science and technology vol. 25(3): May 2010 [18] U.Noor, Z.Rashid, A.Rauf. “A Survey of Automatic Deep Web Classification Techniques”, International journal of computer application, 2011 [19] G. Salton, A. Wong, and C. S. Yang. “A vector space model for automatic indexing” CACM, vol.18(11): pp. 613–620, 1975. [20] R. A. Baeza-Yates and B. A. Ribeiro-Neto. “Modern Information Retrieval”, ACM Press/Addison-Wesley, 1999. [21] M. Steinbach, G. Karypis, and V. Kumar. “A comparison of document clustering techniques”, In KDD Workshop on Text Mining, 2000. [22] D.M.Blei, Ng, A.Y., Jordan, M.I.: “Latent Dirichlet Allocation”. Journal of Machine Learning Research vol. 3, pp. 993–1022, 2003 [23] The UIUC Web integration repository http://metaquerier.cs.uiuc.edu/repository. [24] B. Larsen and C. Aone. “Fast and effective text mining using linear-time document clustering”. In KDD, pp. 16–22, 1999. [25] K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. “The connectivity server: Fast access to linkage information on the Web”, Computer Networks, vol. 30(1-7): pp. 469–477, 1998. [26] A. Daud, J. Li, L. Zhou, and F. Muhammad. “Knowledge Discovery through Directed Probabilistic Topic Model- a Survey”. Journal of Frontiers of Computer Science in China (FCS), vol. 4(2), pp. 280-301, June, 2010. [27] T.L. Griffiths, M. Steyvers, “Finding scientific topics, in: Proceedings of the National Academy of Sciences (NAS)”, USA,pp. 5228–5235, 2004 [28] Hofmann, T.: Probabilistic Latent Semantic Analysis. In: Proc. of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI), Stockholm, Sweden,1999 Semantically Mining Heterogeneous Data Sources Of Deep Web 65