Download Deep web - AllThesisOnline

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Semantically Mining Heterogeneous
Data Sources of Deep Web
A thesis submitted
For the Partial Fulfillment of the Requirement of
MS (CS) Degree
By
Ayesha Manzoor
584-FBAS/MSCS/F09
Supervisor
Dr Ali Daud
Assistant Professor IIUI
Co- Supervisor
Umara Zahid
Department of Computer Sciences & Software Engineering
International Islamic University Islamabad Campus
2012
Department of Computer Science
International Islamic University Islamabad
Date: [date of external examination]
Final Approval
This is to certify that we have read the thesis submitted by [Ayesha Manzoor], [584FBAS/MSCS/F09]. It is our judgment that this thesis is of sufficient standard to warrant
its acceptance by International Islamic University, Islamabad for the degree of [MS
Computer Science].
Committee:
External Examiner:
[External Examiner’s name]
[Designation of External Examiner]
[Address of External Examiner]
___________________________
Internal Examiner:
[Internal Examiner’s name]
[Designation of internal Examiner]
[Address of Internal Examiner]
___________________________
Supervisor:
[Dr Ali Daud]
[Assistant Professor]
[International Islamic University,Islamabad]
___________________________
Co Supervisor:
[Umara Zahid]
[Research Associate]
[International Islamic University,Islamabad]
___________________________
Dedicated to my beloved Parents and
Brothers
A dissertation Submitted To
Department of Computer Science & Software Engineering,
Faculty of Basic and Applied Sciences,
International Islamic University, Islamabad
As a Partial Fulfillment of the Requirement for the Award of the
Degree of MSCS.
Declaration
We hereby declare that this research thesis neither as a whole nor as a part has been
copied out from any source. It is further declared that we have done this research with the
accompanied report entirely on the basis of my personal efforts, under the proficient
guidance of my teachers especially my supervisors. If any part of the thesis is proved to
be copied out from any source or found to be reproduction of any thesis from any of the
training institute or educational institutions, I shall stand by the consequences.
___________________________
[Ayesha Manzoor]
[584-FBAS-MSCS/F09]
Acknowledgement
First of all we are obliged to Allah Almighty the Merciful, the Beneficent and the source
of all Knowledge, for granting us the courage and knowledge to complete this thesis. I
thankful to my supervisor Dr Ali Daud and co supervisor Mrs. Umara Zahid whom give
me direction and guidance to accomplish this thesis. I also thankful to my fellows to
whom help me in this thesis especially Kashif Iftikhar, Umm-e-Zahoora, Robina
Khatoon,and Assmah Jabeen. I also thankful to university and university staff and faculity
members. I also thankful to my mother, brothers and my family member to encourage and
support me to do this research work.
___________________________
[Ayesha Manzoor]
[584-FBAS-MSCS/F09]
Abstract
Abstract
Over the years a critical increase in the mass of the web has been observed. Among that a
large part comprises of online subject-specific databases, hidden behind query interface
forms called as deep web. Existing search engines are unable to completely index this
highly relevant information due to its large volume. To access deep web content, the
research community has proposed to organize it using machine learning techniques.
Clustering is one of the key solutions to organize the deep web databases. In our research
work we proposed a novel method “DWSemClust” to semantically cluster deep web
databases. For the purpose, we employed a generative probabilistic model latent Dirichlet
allocation (LDA) for modeling content representative of deep web databases. LDA cluster
the words into the “topics” and the document is the collection of different topics. The task
of parameter estimation in the model is what the topic and which document have topic in
what proportion. The deep web sources are mostly sparse, therefore one motive to use
the LDA due to sparseness of deep web sources. Further content representative comprises
of form contents (single attribute/ multiple attributes), page contents, and hyperlink
structure in the neighborhood of forms i.e. hub/ authority scores. In our work first we
provide a comprehensive assessment of exiting deep web clustering approaches. Based on
the limitations in existing approaches we present our proposed method. Finally we
provide a comparative analysis between our proposed method and the existing methods.
Table of Contents
Table of Contents
Chapter 1 ............................................................................................................................ 13
1.
Problem Definition.................................................................................................. 13
1.1.
1.2.
Deep web ............................................................................................................. 13
Difference between “surface” web and “Deep” web .............................................. 14
1.2.1.
Searching strategy ........................................................................................ 14
1.2.2. Size of deep web ................................................................................................. 15
1.2.3. Quality of deep web is different from the “surface” web ................................... 15
1.2.4. Growing ratio of deep web and surface web ....................................................... 15
1.3.
Scale/Coverage of deep web ................................................................................... 16
1.3.1. Scale of deep web ................................................................................................ 16
1.3.2. Coverage of deep web ......................................................................................... 17
1.4.
Coverage of deep web through search engines ................................................... 17
1.5.
Coverage of deep web through deep web directories ......................................... 18
1.6.
Challenges of deep web .......................................................................................... 19
1.7.
Why deep web sources need to cluster or classify .................................................. 19
1.8.
Benefits to cluster the deep web sources ................................................................ 21
1.9.
Proposed approach .................................................................................................. 21
Chapter 2 ............................................................................................................................ 23
2.
Literature Review.................................................................................................... 23
2.1.
Feature extraction/Feature selection.................................................................... 23
2.2.
Clustering ............................................................................................................ 25
2.3.
Classification ....................................................................................................... 27
Chapter 3 ............................................................................................................................ 29
3.
Methodology ........................................................................................................... 29
3.1.
Data preprocessing .............................................................................................. 29
3.1.1.
Form-Page Model ........................................................................................ 29
3.1.2.
Compute Form-Page Vectors ....................................................................... 29
3.1.3.
Stop word removal ....................................................................................... 30
3.1.4.
Remove less frequent terms ......................................................................... 30
3.2.
Calculation of terms weight ................................................................................ 30
3.3.
Computing Form-Page Similarity ....................................................................... 30
3.4.
The CAFC-C Algorithm: .................................................................................... 31
3.5.
CAFC-CH Algorithm .......................................................................................... 32
Table of Contents
3.6.
Generative model ................................................................................................ 33
3.7.
Discriminative model .......................................................................................... 34
3.8.
Difference between Generative models and discriminative models ................... 34
3.9.
Topic model......................................................................................................... 35
3.10. Latent variable ..................................................................................................... 35
3.11. Prior probability .................................................................................................. 35
3.12. Multinomial distribution ..................................................................................... 36
3.13. Plate notation ....................................................................................................... 36
3.14. Proposed Technique: DWSemClust .................................................................... 37
3.15. Latent Dirichlet allocation ................................................................................... 37
3.16. Summary ............................................................................................................. 40
Chapter 4 ............................................................................................................................ 41
4.
Experiments ............................................................................................................ 41
4.1.
Performance Measures ........................................................................................ 41
4.2.
Dataset ................................................................................................................. 42
4.3.
Parameter settings ............................................................................................... 43
4.4.
Results and Discussions ...................................................................................... 43
4.5.
Summary ............................................................................................................. 61
Chapter 5 ............................................................................................................................ 62
5.
Conclusions and Future Work ................................................................................ 62
5.1.
Concluded points ................................................................................................. 62
5.1.1.
Stability ........................................................................................................ 62
5.1.2.
Soft Clustering ............................................................................................. 62
5.1.3.
Semantics ..................................................................................................... 62
5.1.4.
Running time ................................................................................................ 63
5.1.5.
Parameter estimation .................................................................................... 63
5.2.
Future Work ........................................................................................................ 63
5.2.1.
Integrated schema ........................................................................................ 63
5.2.2.
Check dataset on structured techniques ....................................................... 63
References .......................................................................................................................... 64
Semantically Mining Heterogeneous Data Sources Of Deep Web
10
List of Tables
List of Tables
Table 1.1:
Deep web estimation and sampling ............................................................ 17
Table 1.2:
Web directories coverage ............................................................................. 18
Table 3.1:
Algorithm CAFC-C ..................................................................................... 31
Table 3.2:
Proposed Technique: DWSemClust ............................................................ 37
Table 4.1:
Dataset Description ...................................................................................... 42
Table 4.2:
Entropy results of DWSemClust for forms and pages ................................ 43
Table 4.3:
Topics discovery through DWSemClust ..................................................... 45
Table 4.4:
F-measure results of DWSemClust ............................................................. 47
Table 4.5:
Comparison of the CAFC_C and DWSemClust with form and page
contents based on entropy ........................................................................... 48
Table 4.6:
Comparison of the CAFC_C and DWSemClust with forms contents based
on entropy .................................................................................................... 50
Table 4.7:
Comparison of the CAFC_C and DWSemClust with forms contents and
with forms and pages contents based on entropy......................................... 52
Table 4.8:
Comparison of the CAFC_C and DWSemClust with forms contents based
on F-measure ................................................................................................ 54
Table 4.9:
F-measure comparison of the CAF_C with forms and page contents and
DWSemClust with forms and pages ............................................................ 55
Table 4.10:
F-measure comparison of the CAF_C with forms and page contents and
DWSemClust with forms and pages ............................................................ 57
List of Figures
List of Figures
Figure 1.1:
A deep web site ............................................................................................ 13
Figure 1.4:
Document clustering ................................................................................... 20
Figure 3.1:
Plate as sub graph......................................................................................... 36
Figure 3.2:
Symbol for hidden parameter....................................................................... 36
Figure 3.3:
Symbol for observed variable ...................................................................... 36
Figure 3.4:
Arrow shows the dependency ...................................................................... 36
Figure 3.5:
Latent Dirichlet allocation ........................................................................... 39
Figure 4.1:
Entropy for DWSemClust with form contents, form and page contents ..... 45
Figure 4.2:
F-measure for DWSemClust with form contents, form and page contents . 48
Figure 4.3:
Comparison of DWSemClust and CAFC_C with formpages contents based
on entropy .................................................................................................... 50
Figure 4.4:
Comparison of DWSemClust and _C with forms based on entropy ........... 52
Figure 4.5:
Comparison of the CAFC_C and DWSemClust with forms contents and
with forms and pages contents based on entropy......................................... 54
Figure 4.6:
Comparison of the CAFC_C and DWSemClust with forms contents based
on F-measure ................................................................................................ 55
Figure 4.7:
Comparison of the CAFC_C and DWSemClust with forms and pages
contents based on F-measure ....................................................................... 56
Figure 4.8:
Comparison of the CAFC_C and DWSemClust with forms and pages and
forms contents based on F-measure ............................................................. 58
Figure 4.9:
Comparison of average entropy ................................................................... 58
Figure 4.10:
Comparison of average F-measure ........................................................... 59
Figure 4.11:
Entropy comparison of CAFC_C, CAFC_CH and DWSemClust ........... 60
Chapter 1
Introduction
Chapter 1
1.
Problem Definition
This chapter introduces the deep web and surface web, importance of deep web why it
need to uncover the sources of deep web. Difference between deep web and the surface
web, scale and coverage of deep web, why we need to cluster or classify the deep web
sources and the benefits that we can gain through the clustering or classify the deep web
sources. Last portion will present why we use propose technique.
1.1.
Deep web
The preferred medium for information transfer and commerce for internet based
companies is the web. With the introduction of e-business the trend of developing the
web sites grow. With the connection of database the dynamic Web site developed
increases in the number of sites with back end database for holding the important
information in them. This information can retrieve trough user query from database
server. The information store in databases and hidden behind the HTML pages are called
“Deep web”. It can be refer with many other names as deep web, dark net, deep net,
invisible net.[1] These all term used is for the stuff of valuable knowledge that cannot be
accessible through traditional search engine. The deep web contents are stored in the
searchable databases; these contents can be retrieved through the direct query. Without
direct query we cannot reach the results of databases. Anyone who queried the searchable
database and in the result of query a resultant page is return and that page contain the
dynamic content which is according to the query given to the database.
Figure 1.1:
A deep web site
Semantically Mining Heterogeneous Data Sources Of Deep Web
13
Chapter 1
Introduction
Figure 1: A deep web site shows a deep web site, interface is connected to database and
through this interface a user can put his/her query to the database for retrieving the
information.
In Deep web deep the terms that are mostly used are deep web site, databases, query
interfaces. A deep web site is a Web server that gives information stored in one or more
back-end Web databases and Web form which is used to get information from the
database through the query as a input.
Definition
A deep web site can be denoted as Ds, database denoted as Ddb and query Interface is
denoted as DI. DI contain the attributes DIat=1……n.
Deep web query interface can be categorized in to two types

Simple query interface

Advanced query interface
Simple query interface is the interface that has less number of attributes. That cannot give
much information about the interface or give knowledge about the domain of interface
belongs to.
Advanced query interface is the interface that has attributes that are good representative
of a domain. Mostly we can guess about the domain of a database. Attributes can vary in
number a simple query interface have less number of attributes.
The Web becomes rapidly “deepened” by hug online databases [2]. The size of the deep
web increasing exponentially.
1.2. Difference between “surface” web and “Deep” web
1.2.1. Searching strategy
The first difference between surface and the deep web sources is the crawling strategies
two ways are used first an author submit pages for listing and indexing. And second are
“spider” or “crawl” pages through hypertext link to another. In surface web static pages
are linked together. Traditional search engines cannot retrieve the contents of deep web
Semantically Mining Heterogeneous Data Sources Of Deep Web
14
Chapter 1
Introduction
.Deep web contents are dynamic and can be retrieved through the direct query that point
to the database [2].
1.2.2. Size of deep web
Second difference between surface and the deep web sources is the size. As deep web is
very big and hug, according to a study there are 60 largest Deep web that contain the
content of 84 billion pages that are 40 time greater then surface web. These sites contain
750 terabytes of data [2].
1.2.3. Quality of deep web is different from the “surface” web
Deep web contents have higher quality then the surface web [2]. Deep web content is
more significant for user and satisfied the user need. Most of the deep web contents are
topic specific that are stored in databases. Deep web contents are deeper then the surface
web.
1.2.4. Growing ratio of deep web and surface web
The growing rate of deep web is much larger than the surface web this shows the
importance of deep web as a next generation internet [2].
Figure 1.2:
Surface and deep web [2]
Semantically Mining Heterogeneous Data Sources Of Deep Web
15
Chapter 1
Introduction
Figure 2 show the quantity of surface web and deep web and the deepness of deep web.
As fishes show the data and at the surface level data is present but the portion beneath the
surface carry more data. This precious data need to uncover as this data. Precious in sense
of quality and quantity of data. As growing rate is very high and contents updated
frequently.
1.3. Scale/Coverage of deep web
For the scale and coverage of the deep web first of all take a look over the “entrance” of
databases.[3]To get the information hidden in the sea we must know about entrance and
also must know about the deepness of the entrance? query interfaces are the entry point to
the databases and for each query interface [3] find the depth of the entry point is
calculated that was depth 3 their ratio was 72% 93 was at 3 depth where total was 129
interfaces. Through these interfaces one can access the database which can find minimum
at depth 3 that was 94% 32 out of 34 web databases. Finally 22 out of 24 that were 91.6%
deep web sites which contain their database at level 3 which can be refer as depth 3
coverage.
1.3.1. Scale of deep web
He at al [3] test 1,000,000 IP samples for knowing the scale of deep web. They crawl one
million IPs at depth three because of most of databases can be found at depth three. After
crawling they found those 126 deep web sites from 2,256 web servers and 406 query
interfaces and 190 web databases. When this sample is applied to the whole population of
IP space which was 2,230,2124,544 IPs total 307,000 were deep web sites, 450,000
databases and 1,258,000 query interfaces were found. They also observe the nature of
deep web as multiplicity of access. Each deep web site contain database of 1.5 and that
can support 2.8 query interfaces. A survey [1] estimated the 43,000 to 96,000 sites which
are deep web sites are present over the web. This figure shows that deep web increased 37 times from 2000-2004. [3]
Table 1 below shows the estimation and sampling of the study. First column show the
deep web sites, web databases which can be structured or unstructured and query
Semantically Mining Heterogeneous Data Sources Of Deep Web
16
Chapter 1
Introduction
interfaces. Second column show the sampling results. Third column shows the total
estimation and forth column show the 99% confidence interval.
Table 1.1:
Deep web estimation and sampling [3]
Sample Results
Total Estimate
99% Confidence interval
Deep Web Sites
126
307,000
236,000-377,000
Web databases
190
450,000
366,000-535,000
-Unstructured
43
102,000
62,000-142,000
-Structured
147
348,000
275,000-423,000
Query interfaces
406
1,258,000
1,097,000-1,419,000
1.3.2. Coverage of deep web
We can define coverage as how much data of deep web can be crawled or indexed and
can be retrieved through search engines or deep web directories. Under the coverage of
deep web there are two type of coverage.

Coverage of deep web through search engines

Coverage of deep web through deep web directories
1.4.
Coverage of deep web through search engines
To access the hidden web contents one can “browse” directories to use URL .It is remain
question that will it effective to indexed and crawl the deep web like surface web. He at al
[3] investigate the three popular search engines MSN(msn.com),Yahoo(yahoo.com) and
Google(google.com).they randomly choose 20 web databases out of 190 in their sampling
results. Figure 3 below show the finding of the survey MSN coverage is less then all of
three that was 11% Yahoo and Google indexed 32% of deep web. Overall these search
engines cover 37% contents of deep web.
This study conclude the major aspects, one is that the common thinking about the deep
web is invisibility as one third part can be search through these search engines it means
by nature deep web is not invisible. It means other contents are not properly indexed by
any search engines therefore most of deep web contents remain invisible.
Semantically Mining Heterogeneous Data Sources Of Deep Web
17
Chapter 1
Introduction
The entire deep
web
Yahoo.com(32%)
MSN.com(11%)
Google.com(32%)
All(37%)
0%
20%
40%
60%
80%
100%
120%
Figure1.3: Search engines coverage [3]
1.5.
Coverage of deep web through deep web directories
Besides of search engines which crawl traditionally. There are some directories which are
embedding online and classify databases on web in some catalog. To check the coverage
of the directories He at al [3] surveyed four web directories which are popular and count
the coverage for which they claimed for indexed. They check completeplanet.com, lii.org,
turbo10.com and invisible_web.net.
Table 1.2:
Web directories coverage [3]
Number of Web Databases
Coverage
completeplanet.com
70,000
15.6%
lii.org
14,000
3.1%
turbo10.com
2,300
0.5%
invisible_web.net
1,000
0.2%
Data in Table 2 shows the number of web databases and coverage of each directory.
Completeplanet.com directory have 70,000 databases out of 450,000 web databases
which is 15.6% that is very low coverage. Other directories coverage was very low range
from 0.2% to 3.1%. This seems that directories manually classified which will be hard to
scale the deep web.
He at al [3] concludes the study that they have conduct.
Semantically Mining Heterogeneous Data Sources Of Deep Web
18
Chapter 1

Introduction
Information that is available publically on the deep web is 400 to 550 times larger
than that are defined in World Wide Web.

There is 7500 terabytes information in deep web as compare to surface web
which is 19 terabytes.

Individual document on deep web is 550 billion as compare to surface web that
are 1 billion.

There are more than 100,000 deep web sites are exist.

There are 60 deep web sites that contain 750 terabytes information which larger
then surface web by 40 times.

Growing rate is very high as compare to surfaces web.

Deep web sites content are deeper and narrower then the surface web.

At least total quality of deep web is 1,000 to 2,000 time greater then of the
surface web.

Deep web is more informative and satisfy the user need.

Deep web content are mostly topic specific.

95% deep web sources are publically accessible information.

During 4 years from 2000 to 2004 deep web sources increase 3 to 7 time.
1.6. Challenges of deep web
Open challenges in this field are

To crawl hidden web

Categorization of the deep web sources

Integration

Query mediation
1.7. Why deep web sources need to cluster or classify
Organizing the data spread over the Web into groups / collections in order to make easy
to data accessing and availability. At the same time meet user need. First of all size of
deep web is very big and hug, according to a study there are 60 largest Deep web that
contain the content of 84 billion pages that are 40 time greater then surface web. These
sites contain 750 terabytes of data [2]. Therefore it is need to cluster or classify. When
this huge amount of data clustered then it will used as need of user satisfied.
Semantically Mining Heterogeneous Data Sources Of Deep Web
19
Chapter 1
Introduction
Document #1
Document #5
Document #8
Document #2
Document #6
Document #9
Document #3
Document #7
Document #10
Document #4
Document Clustering
Algorithm
Document #1
Document #8
Document #9
Document #2
Document #7
Document #10
Document #6
Document #3
Document #2
Document #4
nt #2
Document #2
Cluster 1
nt #2
Document #2
Document #5
ntCluster
#2
3
Document #2
ntCluster
#2
2
Figure 1.4:
Document clustering
In Figure 1.4 document clustering process is described in which set of document which is
gatherClcluster
from the information retrieval system. Same colored document are related to each
other but randomly distributed. There is clustering algorithm that will cluster the
document which is related to each other. After the clustering algorithm the documents are
clustered together in same cluster.
Semantically Mining Heterogeneous Data Sources Of Deep Web
20
Chapter 1
Introduction
1.8. Benefits to cluster the deep web sources
Some benefits of clustering the deep web sources are

Accessibility over the Web will increase.

Length of Web navigation pathways will decrease.

Web user’s requests services will improve.

Information retrieval will improve.

Improving content delivery on the Web.

Understanding users’ navigation behavior.

Integrating diverse data representation standards.

Web information organizational practices will extend that are currently in use.
1.9. Proposed approach
We will work on deep web sources semantically. We will use Latent Dirichlet Allocation
(LDA). Before start to explain LDA We want to describe the limitation of traditional and
keyword base clustering method .keywords based clustering method extract the words
that used for matching the related entities Vector Space Model (VSM) is the example of
keywords base modeling that is stat of art clustering gives a good way to cluster or group
similar documents on the bases of similar content that are extracted from the text. The
major problem with keywords based clustering is ignoring the semantics in other words
ignore polynomial and synonymy terms. In traditional a document is associated to a
cluster which is called hard clustering these problem motivate topic modeling based on
latent topic 1ayer. Topic modeling is the technique which generate soft cluster. This
technique can capture semantics of text. Latent topic allow document that are composed
of different topics to more than one cluster. There are topic layer that is hidden in the
fundamental topic modeling. In the topic modeling LDA use the basic terminology and
notations are word, document and corpus.

Basic unit of discrete data is a word. Words are items in a vocabulary. Denoted as
w.

Sequence of words is collection of words are called document. A document
contains N words. Denoted as D= {w1, w2, w3, w4, ….. , wN}.
Semantically Mining Heterogeneous Data Sources Of Deep Web
21
Chapter 1

Introduction
Corpus is the collection of documents. Denoted as C= {D1, D2, D3, D4......DM}
which shows that corpus contain M document.
Topic layer Z= {Z1, Z2, Z3, Z4… Zi} between the document and words in the documents.
Zi represent latent topic a document vector d words wd. This layer is used to capture the
semantic relationship that considers the synonymy of words.
Semantically Mining Heterogeneous Data Sources Of Deep Web
22
Chapter 2
Literature Review
Chapter 2
2. Literature Review
This chapter gives the overview of the work that has done in the literature. The work on
hidden web mining can be categorized in various areas such as
1. Feature extraction/Feature selection
2. Clustering
3. Classification
In [4, 5, 6, 17] feature extraction are discuss and show how these features add and
improve the results. Clustering is the unsupervised learning used for grouping the data.
In [7, 8, 9, 10, 11] propose the techniques for clustering the deep web sources.
Classification is the supervised learning, in [12, 13, 14, 15, 16] different techniques are
used to classification.
2.1.
Feature extraction/Feature selection
Sriram et al [4] deal the problem of designing a crawler that is able to extract the text
from hidden Web. They introduce a standard operational model for crawling the hidden
Web and explain how this model is realized in HiWE (Hidden Web Exposer), a model
crawler presented at Stanford.
They introduce a new Layout-based Information Extraction Technique (LITE) and
explain how it will extracting the information relating semantic from search forms and
resultant pages include the Deep web pages that has database behind the forms that is
problem of existing crawlers. They proposed a crawler that is task specific for hidden
Web crawling. A task specific application is useful in designing a crawler that has
knowledge of the specific domain.
There are two limitations of the HiWE design
1. HiWE’s is not able to recognize and give response to simple dependencies among
form elements.
Chapter 2
Literature Review
2. HiWE’s does not of support for incomplete filled forms; i.e., giving values to some
elements in a form.
Andreas et al [5] concept is a Web Service operation allocates automatically in domain
taxonomy, in data-type taxonomy each input factor to a concept. In a category taxonomy
automatically allocates a Web Service to a concept. Web Services is Cluster in order to
create category taxonomy automatically. Authors suppose a category taxonomy C. the
services provided by the Web Service Category. Second, they suppose a domain
taxonomy D. Domains provide specific services. Third, they assume data type taxonomy
T. Data types deals with semantics of data of fields.
For classification of web form a form is converting into the Bayesian network. A tree is
build that show the generative model: Domain of the form is represented at the top, at
children level represent data type that is associate with each field, and grandchildren
level show the terms to each field.
Leave-one-out methodology is used for baselines two bags of terms. The terms in a form
as single bag of terms for domain classification. The naïve Bayes algorithm is used for
data type classification over its bag of terms.This technique is applicable to databases
which are good representative of a domain because it takes decision on the part of form
for example labels and values that are available.
In this work they have introduced an incremental heuristic algorithm [6] by connecting
it with the entropy based algorithm .This algorithm works good with different sample
sizes and parameter settings. It is very efficient for data stream as for every new entry it
not has to look back so far .the criteria for clustering data is also very reasonable
because of entropy. This algorithm uses entropy to group categorical attributes unlike
earlier methods that uses distance matrices between vectors to do so. As the problem
they choose was NP complete problem that is why they used heuristic to solve it. It is
scale able due its Incremental approach.
In data integration a crucial step is Matching query interfaces is across multiple Web
databases. In interface schema matching different type of information is used. It is not
reliable to use single aspect of schema, it will yield uncertain and inaccurate result.
Semantically Mining Heterogeneous Data Sources Of Deep Web
24
Chapter 2
Literature Review
The state-of-the-art approach is the evidence theory to combining uncertain information
of multiple sources. However the limitations of traditional evidence theory to treat the
individual matchers of different matching tasks apply to query interfaces, which will
reduce the performance of matching. The authors proposes a novel matching approach for
query interface which based on extended evidence theory for Deep Web. They introduce
procedure of dynamic prediction for different credibility of matchers. They use
exponentially weighted evidence theory to extends traditional evidence theory to
combine the results that have come from multiple matchers.
2.2.
Clustering
In [7] the authors organize the structure deep web sources into domain hierarchy
through query schemas which are good discriminator of a domain. They seem query
schema as a categorical data and cluster the categorical data .they assume that same
sources are used same generative models. They propose a new objective function
Model- differentiation .Which is used for test of assumption that give maximum
statistical heterogeneity between clusters.
Authors develop Algorithm MDhac: First, DATAGROUPING pre-clusters data into
groups Second, GROUPSELECTION excludes the loner schemas with loner threshold N
Third, and CLUSTERINGHAC clusters the remaining groups with the standard HAC
algorithm. Fourth, LONERHANDLING classifies the loner schemas into the
accomplished G clusters. Finally, BUILDHIERARCHY again applies the HAC algorithm
to build the hierarchical tree of domains (by considering each cluster as one domain).
They adopt χ2 testing for evaluating the homogeneity among clusters. Conditional
Entropy is used, on clustering Web query schemas; the model-differentiation function
outperforms existing ones, such as likelihood, entropy, and context linkages, with the
hierarchical agglomerative clustering algorithm. For each source, we manually extract
attributes from its query interface by extracting noun phrases, and then judge its
corresponding domain.
CAFC-C [8] helps to get homogeneous clusters having low entropy and high Fmeasures. So it could be very helpful in discriminating the different online databases
.But it have some limitations that need to be consider. As it uses k means method,
Semantically Mining Heterogeneous Data Sources Of Deep Web
25
Chapter 2
Literature Review
quality of the resultant clusters is highly effected by the selection of initial seeds .So it
can effects badly in two scenario firstly when there will be heterogeneity in vocabulary
and secondly when the domains although different but have large vocabulary similar. So
in this case it seems only forms will not be sufficient to have good cluster results.
The CAFC-CH [8] is the extension of the CAFC-C Algorithm. In this algorithm to get
the high utility of the above mention algorithm they have also included page contents.
That will be helpful in order to break the tie when there will be vocabulary overlap in
form contents of different domains or when same domains having forms contents with
different vocabularies.
Another limitation of k means is resolved by considering hyper links too for the
selection of seed clusters. They used backline to improve the quality of seed clusters,
but the limitation here too as there is no backlinks of all the sites. Vector space model is
used a drawback of vector space model, which just keyword based matching.
The work done in this [9] paper was to replace the current automatic database selection
methods and cooperative methods by using appropriate language models relative to that
database due to their limitations. Instead they provide the solution that are database
service itself creates its language model by random sampling also called query based
sample approach. Query based sampling approach assumes that every database can run
simple query on it, and in result returns some documents which in return are helpful in
making language model for that particular database automatically.
Song et al [10] proposed semantically clustering the deep web using fuzzy semantic
measure is used to integrate the ontology and fuzzy set is used for check the similarity of
visible feature of two deep web form and hybrid Swarm particle optimization (PSO)
algorithm is proposed for clustering the databases of deep web. Average Similarity of
Document to the Cluster Centroid (ASDC) and Rand Index (RI) are used to evaluate the
result. The proposed solution have the values of ASDC is higher than K-Means and PSO
approaches. It is concluded that within a cluster similarity is high and between clusters
low similarity that is positive sign.
Semantically Mining Heterogeneous Data Sources Of Deep Web
26
Chapter 2
Literature Review
Zhang et al [11] works on feature extraction and Ontology based method is used. They
had worked on three domains and show the result evaluated by precision and recall.
2.3.
Classification
Xiang et al[12] has work on the classification of deep web structured sources .There
main contributions are they proposed category ontology model for deep web secondly
vector space model for deep web is build on the bases of model that have proposed at first
step. They build ontology which have 8-tuple (V,F1,T, S,C,L,ROOT,F2,):V is the attribute
set appears in the interface. V has also two parts Type and Ai, Ai is the attribute label and
Type is the data type of the attribute. F1 is the reference function which refer to concept.
T is the attribute nature Vi belongs to V and if Vi is Ts then it mean that attribute will be
only in a specific domain, if vi is in Tc then is means the attribute will shared in different
domain and if vi is in Tn then it means that attribute is noise and have no meaning. S is
interface schema pre defined concept. C is the conceptual portion of attribute in the
specific domains. L is domains. Root is the domain which cannot be classify into any
domain.F2 is reference function. New weight calculation (DWTF) presented which get
good results then TFIDF and TF. They evaluate the classification result with average
precision and average recall whose results are 91.6%for precision and 92.4% for recall.
Peiguang et al [13] contribute in the field of deep web by classification of deep web. They
analyzed the attributes which are common in the same domain and describe the
characteristic of an interface and propose new representation of the interface. They
proposed Function terms and form terms they propose algorithm for computing algorithm
Literal and semantic based similarity computing (LSSC). 0This used the two definitions
of function term and form term. Another contribution is the contribution is the
combination of LSSC and NQ algorithm LSSC_NQ. Experimental results shows that this
algorithm give good result.
Pengpeng Zhao et al [14] put their contribution in the field of deep web by clustering the
query interfaces that are structured. They used link graph to cluster these sources. They
proposed framework Form Graph Cluster (FGC). This framework is used for organizing
the sources which are belonging to deep web they use pre query method. On the source
the Fuzzy Clustering Method (FCM) is used. Their main contribution is that this method
Semantically Mining Heterogeneous Data Sources Of Deep Web
27
Chapter 2
Literature Review
is firstly used by them in this area. The similarity and dissimilarity of the deep web can be
expressed in graph. Query interface is treated as a node of the graph and line shows the
relationship of the two nodes and the weight on the line show the similarity ao
dissimilarity. They use from set as undirected weighted graph. Degree of similarity and
dissimilarity is measured in traditional way in history, as 0 or 1 which are not best way
therefore the author used fuzzy set theory. They extract the features from the HTML
form. They define the controls and divide these into three control text area, select and
input control. They used only the value of controlled other are eliminate.
For improving the result of deep web classification of a domain Le et al [15] contribute
to select the subset of features among all the feature set that are extracted from the
interfaces .In previous work all the features are included when that are extracted from
the interfaces or query schemas of sources they refine the feature set. They treated the
interfaces of the domain category as the bag of words and choose the words from the
whole set that are suitable for classification. They use a novel simple ranking scheme and
new matrix for feature selection method. They obtain high precision and recall and Fmeasure using selective features or aggressive feature selection.
Xian et al in [16] present a new framework which classifies the structured deep web
sources by the combining of the machine learning technique SVM and query-probing
with the help of simple query interface into topic specific domain. They used the random
queries to gather the result schema. Result schema is collected from the result page of
query probing. Then they used domain specific classifier (DSC) for classifying the simple
query schemas. They used precision and recall and F-measure for evaluation.
Summary
In above section we discussed the work that has done in the fields of deep web. Different
authors proposed different algorithms and show the efficiency of their work. We
categorized the literature review in clustering, classification and features extraction.
Semantically Mining Heterogeneous Data Sources Of Deep Web
28
Chapter 3
Methodology
Chapter 3
3. Methodology
In this chapter we will discuss about the methodology of [8] and then discuss own
method that have used to cluster the deep web sources. In [8] two algorithms are
described and we will also explain how the document vectors are made and which steps
are perform on dataset as preprocessing. Then we will change direction of our
discussion towards the method which is used for clustering the deep web sources.
3.1.
Data preprocessing
3.1.1. Form-Page Model
First of all to get Web form that associates with a web page which is called as form page
FP. A FP has tuples of FP (PC, FC), both PC and FC shows two individual feature spaces.
PC stands for page contents and FC stands for form contents. Both feature spaces PC and
FC are viewed as text, because in [8] the authors use the vector space model [19] each
feature space consist of vector ,which have terms that are present in the feature spaces and
their associated weights.
3.1.2. Compute Form-Page Vectors
For computing the Form-Page vectors parse the HTML page and extract the contents, as
we discussed early that two feature spaces are computed FC and PC. For feature space FC
parse the HTML page and extract the contents between the FORM tags. These contents
are belonging to form but contain HTML markup. After removing the HTML markup and
scripting tags we gather the FC feature spaces. For PC feature space, FC is subtracting
from the HTML page and remaining will be page contents. After removing the HTML
markup and scripting tags gather the PC feature spaces.
Chapter 3
Methodology
3.1.3. Stop word removal
Stop words are the noisy words that have no importance but these words increase the
execution time and disturbed the final results. Stop words are “an, of, the, are, is and etc”
the stop word list is available on the internet which consist the major stop words. we
include the stop word list that is available on internet. The process of stop word removal
is used for both form feature space and page feature space.
3.1.4. Remove less frequent terms
Another preprocessing step that remove the less frequent terms. The threshold value is
three it means that the terms that occurs three times or less in whole collection are
removed. These terms also noisy data to get the good results the preprocessing steps must
performed.
3.2.
Calculation of terms weight
In information retrieval [21] TF- IDF (term frequency/inverse document frequency)
measure is widely used. It gives a way to model the importance of terms and also
eliminate noisy data from the vectors.
N
wj = TFj ∗ log ( )
nj
(3.1)
wj is the weight of jth term. It can be different in different document as the TFj is the term
frequency of jth term TF is the occurrence of the term in a specific document for example
a word appear two time in a document two is the term frequency of that specific term.
Where log (N/nj) is the formula of IDF (inverse document frequency) N is the total
number of documents and nj is the document frequency where jth term appears.
3.3.
Computing Form-Page Similarity
In base paper the authors use cosine similarity measure [20]. To compute form page
similarity, calculate the distance between corresponding vectors of both feature space.
Semantically Mining Heterogeneous Data Sources Of Deep Web
30
Chapter 3
Methodology
cos(d1, d2) =
d1 • d2
||d1|| ∗ ||d2||
(3.2)
The cos(d1, d2) is the cosine distance of vectors ~d1 and ~d2. The dot product of d1 and d2
vectors is divided by the product of their lengths. For aggregate similarities of two feature
spaces form contents and page contents, take average of the similarity in each space.
sim(FP1, FP2) =
C1 ∗ cos(PC1, PC2) + C2 ∗ cos(FC1, FC2)
C1 + C2
Table 3.1:
(3.3)
Algorithm CAFC-C
Algorithm 1 CAFC-C
1: Input: formPages, k
2: centroids = selectSeeds( formPages,k)
{Randomly select seeds}
3: repeat
4: clusters = assignPoints( f ormPages,centroids)
{Assign form page to the closest centroid}
5: centroids = recomputeCentroids(clusters)
{Recomputing centroids}
6: until stop criterion is reached
7: return clusters
3.4.
The CAFC-C Algorithm:
For clustering the form pages belong to the same domain Context-Aware Form Clustering
CAFC-C uses k-means Algorithm. K-means is widely used in document clustering, this
clustering algorithm is partition centroid-based algorithm. The main reason of using this
algorithm is its simplicity and effectiveness [21]. CAFC-C takes k desired number of
cluster as input, form pages are the whole collection that need to cluster. First of all k
Semantically Mining Heterogeneous Data Sources Of Deep Web
31
Chapter 3
Methodology
clusters are randomly selected as seeds from the collection of form pages and calculate
the centroids. After centroids calculation each form page has
distance value and assign points the form pages to that seed cluster which has closest
value. Then take the average of all centroids value belongs to a seed clusters using the
formula:
𝐶=
∑PCєC PC ∑FCєC FC
,
|C|
|C|
(3.4)
C is the cluster and ∑PCєC is the summation of PC belongs to the cluster C divided by
number of pages in the cluster C. Where ∑FCєC is the summation of FC belongs to the
cluster C and divided by number of forms in the clusters. The algorithm recomputed the
similarity and reassigns points and recomputed cluster centroids until the clusters become
stable.
CAFC-C has some limitation

K-means make hard clusters as a point is assign to only one cluster.

As CAFC-C use Vector Space Model (VSM) and K-means and major problem
with VSM is not deal with semantics of the text.
3.5.
CAFC-CH Algorithm
CAFC-CH use the extended Form-Page model, it take backlinks as in addition. It take
three tuples FP (FC, PC, Backlink) [8]. Backlinks are the web pages which call the
searchable formpage as link. If different pages are called by different a pages or set of
pages share the common backlinks it means they belongs to same domain. Backlinks are
retrieved though link:API provided different search engines[25]. If it is possible to
CAFC_CH has some limitations

This idea can be successfully implemented on the documents where complete
graph is available.

As deep web contents are sparsely distributed over web.
Semantically Mining Heterogeneous Data Sources Of Deep Web
32
Chapter 3
Methodology
Table 3.2:
Algorithm CAFC-CH
Algorithm 2 CAFC-CH
1: Input: formPages, k
formPages: set of form pages and k: number of clusters required
2: hubClusters = SelectHubClusters( f ormPages,k)
3: clusters = CAFC-C ( f ormPages,k,hubClusters)
Compute k-means using hubClusters instead of random seeds
4: return clusters
Algorithm 3 SelectHubClusters
1: Input: formPages, k
2: hubs = generateHubs( f ormPages)
3: distanceMatrix = createDistanceMatrix(hubs)
Compute distance between hub
4: f inalSeeds = twoMostDistant(distanceMatrix)
Select two hubs that are most far apart
5: while f inalSeeds.length < k do
6: finalSeeds = addDistantPoint( f inalSeeds,distanceMatrix)
7: end while
8: return finalSeeds
3.6.
Generative model
A model which generates randomly observable data when hidden parameters are
specified is called a generative model. It gives a joint probability distribution over label
sequences and observation. Machine learning use Generative models for direct data
modeling or used as a intermediate step .Examples of generative models include:
Semantically Mining Heterogeneous Data Sources Of Deep Web
33
Chapter 3
Methodology

Gaussian mixture model and other types of mixture model

Latent Dirichlet allocation

AODE

Restricted Boltzmann Machine

Hidden Markov model

Naive Bayes
3.7.
Discriminative model
Model which is used in machine learning that is used for modeling an variable y which is
unknown or unobserved over the variable x which is known or observed are called
Discriminative models.
In other words find the P(y | x) conditional probability
distribution and are a class of models used in for modeling the dependence of an
unobserved variable y on an observed variable x. Using a statistical framework, In this
modeling it can be used for predicting one variable y which is unknown from the variable
x which is known .Here is the examples of discriminative models which is used in
machine learning:

Support vector machines

Logistic regression, a type of generalized linear regression used for predicting

binary or categorical outputs (also known as maximum entropy classifiers)

Neural networks

Boosting

Linear discriminate analysis

Conditional random fields
3.8.
Difference between Generative models and discriminative
models
Generative models are different from discriminative models

Generative models are fully probabilistic model for all variables on other hand
discriminative model gives a model which work on for modeling an variable y
which is unknown or unobserved over the variable x which is known or observed.
To generate and simulate the value of any variable used in the model generative
model can be used
Semantically Mining Heterogeneous Data Sources Of Deep Web
34
Chapter 3

Methodology
Discriminative model are used for only sampling the specific variables which are
observed quantities.

Discriminative models cannot express relationship between observed and
unobserved variable although its observed variables does not need to model the
distribution.

3.9.
The generative models perform better at classification and regression tasks.
Topic model
In natural language processing and machine learning, a statistical model for discovering
the abstract "topics" is called topic model that occur in a collection of documents.
Probabilistic latent semantic indexing (PLSI) was topic model that was used, proposed by
Thomas Hofmann in 1999. Latent Dirichlet allocation (LDA), is the topic model which
use in many applications developed by David Blei, Andrew Ng, and Michael Jordan in
2002, deal with documents that have mixture of topics.
3.10. Latent variable
In statistics, latent variables are contrast to observable variables. These variables are
indirectly observed through a mathematical model from variables that are observed.
Latent variable models are mathematical models which explain observed variables in
terms of latent variables. Latent variable are mostly termed as hidden variable which can
be define as variables are present but hidden. Many disciplines use latent variable models
for example, machine learning/artificial intelligence, natural language processing,
economics, social sciences, psychology, and the bioinformatics. Advantage of latent
variable is the reduction of data dimensionality. Data dimensionality is used in machine
learning a process which is used to reduce the random variables.
3.11. Prior probability
Prior probability distribution in Bayesian statistical inference, are called prior, of an
quantity of uncertain p before the "data". It is meant to attribute uncertainty rather than
randomness to the uncertain quantity.
Semantically Mining Heterogeneous Data Sources Of Deep Web
35
Chapter 3
Methodology
3.12. Multinomial distribution
In probability theory, the multinomial distribution is a generalization of the binomial
distribution. In probability distribution the binomial distribution is of the n independent
Bernoulli trials where probability of "success" is the same on each trial. Bernoulli trials is
the trials which has two possible outcomes one is true and other is false or yes and no.
3.13. Plate notation
A graphical model in which representing variables are shown is called Plate notation.
Instead to show or drawing variable which repeats in a process, a rectangle which is
called as plate as a sub graph in which repeating variables are grouped. In the corner of
plate variable show the number of iterations. Circle show the variables and directed
arrows show the dependency. In plate notation empty circles shows that variables are
latent variable or hidden or not directly observed and colored or filled circles are
observable variables.
M
Figure 3.1:
Plate as sub graph
Figure 3.2:
Symbol for hidden parameter
Figure 3.3:
Symbol for observed variable
Figure 3.4:
Arrow shows the dependency
Semantically Mining Heterogeneous Data Sources Of Deep Web
36
Chapter 3
Methodology
3.14. Proposed Technique: DWSemClust
Algorithm 1 DWSemClust take the formPages as input formPages are the Web pages that
contain searchable forms (line 1). Parseformpages take formPages as input and form
contents and page contents are separated. HTML page contain FORM tag which contain
form contents. After removing the markup tags form content extracted. Same procedure
applies on the page contents. Then stopword remove from the form and page content. The
words whose frequency is less than three are removed from the dataset. After
prepossessing steps the form contents and page contents in taken as input for LDA. For
each formCont, pageCont iterate M times. Select
өformCont,pageCont
from hidden
parameters α. For each 𝑊formCont,pageCont iterate N times for each document.
𝑍formCont,pageCont
is
selected from
өformCont,pageCont .
Than
𝑊formCont,pageCont is
observable variable from probability of (𝑊formCont,pageCont /𝑍formCont,pageCont , ß) then
LDA will return clusters. LDA is used in many applications [26].
Table 3.3:
Proposed Technique: DWSemClust
Algorithm 1 DWSemClust
1: Input: formPages
{formPages: set of pages which contain searchable forms}
2: formCont,pagecont=parseformPages(formPages)
3: F1 =LDA(formCont,pagecont)
For each (formCont,pageCont) [1…M] do
Select (өformCont,pageCont ) ~ Dir (α)
For each of the term 𝑊formCont,pageCont [1…N] do
(a) Select a topic 𝑍formCont,pageCont~Multinomial (өformCont,pageCont ).
(b) Select a word 𝑊formCont,pageCont from p (𝑊formCont,pageCont / 𝑍formCont,pageCont , ß), a
multinomial probability conditioned on the topic 𝑍formCont,pageCont
4: Return clusters
3.15. Latent Dirichlet allocation
A generative probabilistic model which explains the set of observation through the
unobserved group of data, and show the relationship why some part of data is similar.
Document consists of different topics and each can explain through words. For example,
Semantically Mining Heterogeneous Data Sources Of Deep Web
37
Chapter 3
Methodology
if words collected into documents are observations, each document are collection of
words and mixture of a small number of topics and the words in the document can explain
the document's topics. LDA is an example of a topic model and t presented as a graphical
model by Blei et al [22]
LDA consider following assumption for generative process for each document D in a
corpus C. In [27,28] LDA is improved :
1. Select for each interface N ~ Poisson (ξ).
2. Select ө𝑖𝑛𝑡𝑒𝑟𝑓𝑎𝑐𝑒𝑠 ~ Dir (α).
3. For each of the N words wn:
(a) Select a topic Zn~Multinomial (ө𝑖𝑛𝑡𝑒𝑟𝑓𝑎𝑐𝑒𝑠 ).
(b) Select a word wn from p (wn / Zn, ß), a multinomial probability conditioned
on the topic Zn.
In basic model some implication assumptions are made.

The dimensionality L of the Dirichlet distribution is assumed known and fixed.
(And thus the topic dimensionality Z is fixed.)

The word probabilities are L×V matrix ß.

Finally, the Poisson assumption is not dependent on the length of documents.
Length of document is used as needed. N is independent for generating variables ө
and z.
For k dimensional random variable 𝜃 that is dirichlet variable. Following density
function is used.
Γ(∑ᵏᵢ‗₁ 𝛼ᵢ)
ƿ(𝜃ǀα) =∏ᵏ
ᵢ‗₁ Γ(αᵢ)
𝜃₁𝛼₁−1 … 𝜃ᵏ 𝛼𝑘 ᵏ−1
(3.5)
The dirichlet variable α . Γ(x) is the Gamma function. It belong to exponential family.
That are parameter estimation for LDA. The parameters α and 𝛽 are given, topics 𝑧 set as
N. And w is also set as N.
ƿ(θ,z,wǀα,β)= ƿ(θǀα)∏N
n=1 ƿ(zn ǀθ)ƿ(wn ǀzn , β)
Semantically Mining Heterogeneous Data Sources Of Deep Web
(3.6)
38
Chapter 3
Methodology
ƿ(𝑧𝑛 ǀ𝜃) is 𝜃𝑖 for value of 𝑖 that is unique , 𝑧𝑛𝑖 =1. For marginal distribution of a
document 𝜃 integrating and summing the 𝑧.
ƿ(wǀα,β)= ∫ ƿ(θǀα)(∏𝑁
𝑛=1 ∑Ζ𝑛 ƿ(z𝑛 ǀ𝜃) ƿ(𝑤𝑛 ǀz𝑛 , 𝛽))𝑑𝜃
(3.7)
At the end for corpus probability take the product of marginal probability that are
associated with documents.
N
d
ƿ(Dǀα,β)=∏M
d=1 ∫ ƿ(θd ǀα)(∏n=1 ∑Ζdn ƿ(zdn ǀθd )ƿ(wdn ǀzdn , β))dθd
(3.8)
In this model, Latent Dirichlet allocation is a Bayesian network which models the
documents in a corpus or collection of documents are related topically. Two variables that
are not in any plate; α and β. Dirichlet prior parameter α is the per-document topic
distributions, and Dirichlet prior parameter β is the per-topic word distribution. The
variables in the outermost plate will iterate for each document in a collection. M in the
right corner of the plate shows the number of iteration in executed through the variable all
variables will iterate once for each document.
Figure 3.5:
Latent Dirichlet allocation
Inner plate has N words that are in a specific document. Z is the topic of specific word in
document. W represents the words. Inner plate will iterate N times for each word in a
document. In the model some circles are filled and others are empty. The variables in the
Semantically Mining Heterogeneous Data Sources Of Deep Web
39
Chapter 3
Methodology
empty circle are the hidden variables and not directly observed. Only w is the observable.
The edge that is directed show the variable dependency.
3.16. Summary
In above section we have discussed the preprocessing steps that we perform over dataset.
Then discuss the proposed method in detail and described the terms that used in proposed
method.
Semantically Mining Heterogeneous Data Sources Of Deep Web
40
Chapter 4
Experiments
Chapter 4
4. Experiments
In this chapter we will discuss about the experiments that have done and their results.
This chapter reports the clustering the deep web sources, and give the overview of the
performance measures. Then we will discuss about the experimental setup in which we
describe the dataset performance measure used and its results and discussion in detail.
4.1.
Performance Measures
The F-measure is combined measure of precision and recall [24]. In dataset description
we discussed that dataset was divided in 8 domain and then cluster them in unsupervised
way then we check the true positive TP, false negative FN, false positive FP.
TP
Recall = TP+FN
(4.1)
TP
(4.2)
Precision = TP+FP
Where TP is the number of members of class which truly cluster means they belong to
that cluster and clustering algorithm also cluster them in that cluster,FN is the number of
members of a class which belong to that cluster but falsely cluster to other domain. FP is
the number of members of other class but falsely cluster in that cluster. The F-measure is
then computed by the following formula:
F(i, k) =
2 × Recall × Precision
Recall + Precision
(4.3)
The overall F-measure for a set of clusters is computed by the weighted average of the
values for the F-measure of individual clusters. A perfect clustering solution will result in
Chapter 4
Experiments
an F-score of one, and in general, the higher the F-measure value, and the better the
clustering solution.
For evaluation of the cluster another performance measure used is entropy. Entropy can
be defined as the measure of disorder of a cluster. Cluster performance increases as the
entropy decreases. For every clusterci , to calculate the possibility of occurrence pjk that a
member of cluster i belongs to class k. Entropy is calculated through the standard formula
(4.4)
𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑗 = − ∑ pjk log ( pjk )
The sum of the entropies of each cluster is called total entropy for the set of all clusters,
weighted by the size of each cluster. If more homogeneous clusters then more better
cluster will be and their entropy will lower.
4.2.
Dataset
In order to evaluate the performance, algorithms described in previous chapter were
tested over a set of 259 form pages. We retrieved these pages from the UIUC repository
[23]. We use TEL_8 query interfaces, interfaces were collected from different sources
that sources was from 8 different domains. But some of pages are not available due to
server or time out error we use 259 forms pages for our experiment. TEL_8 is stand for
Travel group, Entertainment group and Living group which are related to 8 different
domains. Travel group related to car hotels and airfares. Entertainment group have the
music record, movies and books that are related to entertainment. Living group contain
job and automobiles. This dataset is created in May 2003. We gathered all the form pages
in the repository whose pages are still available on the Web. Some of them have error on
it and was not available due to timeout error. The collection contains both single and
multi-attribute forms. Table 4.1 shows the dataset description which is available on the
UIUC repository [23].
Table 4.1:
Groups
domains
Travel group
Airfares
Hotels
Number of
Sources
34
26
Dataset Description
Number of
Querable
Interfaces
34
26
Semantically Mining Heterogeneous Data Sources Of Deep Web
Simple query
interfaces
Yes
Yes
Advanced
query
interfaces
Yes
Yes
42
Chapter 4
Experiments
Entertainment
group
Car Rentals
17
17
Yes
Yes
Books
Movies
42
41
42
41
Yes
Yes
Yes
Yes
Music
35
35
Yes
Yes
25
39
25
39
Yes
Yes
Yes
Yes
Records
Living group
4.3.
Jobs
Automobiles
Parameter settings
Hyper parameters α and ß can be optimize through Gibbs sampling algorithm [27] or
Expectation Maximization (EM) method [28]. Gibbs sampling algorithm is used rather
than EM algorithm because of computationally inefficient and vulnerable to local maxima
[22]. Hyper parameters are needed to optimize because of some topic models related to
different applications are sensitive to these parameters. But in our topic model is not
sensitive to hyper parameters. In our experiments for 8 topics z the hyper parameters
values for α and ß are respectively 50/z and .01. The value of topics z is set as respect to
our dataset used. As data set contain 8 domains that are discussed in dataset section in
detail. We ran 1000 iterations Gibbs sampling chains for each. Experiments are
performed on a machine running Windows 7 with Intel(R) Core(TM) 2 CPU Processor
(1.67 GHz) and 1 GB memory.
4.4.
Results and Discussions
In order to check the efficiency of our proposed technique DWSemClust we calculate
entropy and F-measure as performance measure. Table 4.2 shows the entropy results of
DWSemClust for only forms data and for both form and page data. DWSemClust show
good performance in both scenarios.
Table 4.2:
Entropy results of DWSemClust for forms and pages
Sr.No
DWSemClust (formpages)
DWSemClust (forms)
1
1.235
1.5119
2
1.286
1.51408
3
1.232
1.51408
Semantically Mining Heterogeneous Data Sources Of Deep Web
43
Chapter 4
Experiments
4
1.249
1.51408
5
1.275
6
1.213
7
1.211
8
1.221
9
1.241
10
1.263
1.5008
11
1.260
1.5017
12
1.299
1.5090
13
1.263
1.5120
14
1.249
1.5246
15
1.249
1.4940
16
1.249
1.5029
17
1.249
1.5137
18
1.249
1.5196
19
1.267
1.489
20
1.296
1.517
21
1.265
1.522
22
1.305
1.516
23
1.226
1.496
24
1.301
1.507
25
1.265
1.5452
1.5292
1.5292
1.5035
1.5453
1.5452
Figure 4.1 shows the graphical view of the DWSemClust entropy values described in
Table 4.2. On x-axis the no of iterations is presented and on y-axis the entropy value is
shown and at right side techniques are shown. Entropy curve shows that combination of
form and pages give better results then only form contents.
Semantically Mining Heterogeneous Data Sources Of Deep Web
44
Chapter 4
Experiments
1.8
1.6
1.4
Entropy
1.2
1
0.8
DWSemClust(formpages)
0.6
DWSemClust(forms)
0.4
0.2
0
1
3
5
7
9
11 13 15 17 19 21 23 25
No of Iterations
Figure 4.1:
Entropy for DWSemClust with form contents, form and page contents
Table 4.3:
Topics discovery through DWSemClust
0th Topic Music Record
1th Topic Airfares
Words
Words
books
music
book
news
online
contact
shop
store
dvd
gift
toys
usd
games
browse
prices
movie
quot
titles
tax
reviews
2th Topic
Words
prices
movie
Probabilities
0.036005
0.028672
0.025338
0.020005
0.019339
0.017339
0.016672
0.016672
0.016006
0.015339
0.015339
0.013339
0.013339
0.012006
0.011339
0.010673
0.010673
0.010673
0.010673
0.010006
Probabilities
0.037763
0.034109
city
airport
flight
child
name
infant
flights
hotel
include
options
lap
children
seat
class
miles
commerce
departing
days
airports
preferred
3th Topic Hotels
Words
deals
help
Semantically Mining Heterogeneous Data Sources Of Deep Web
Probabilities
0.066359
0.065777
0.0553
0.044241
0.037839
0.036092
0.033182
0.027944
0.019795
0.018049
0.016885
0.016303
0.016303
0.015721
0.014557
0.014557
0.013975
0.013975
0.013393
0.013393
Probabilities
0.037741
0.032238
45
Chapter 4
rewards
sort
search
artist
miles
reservation
options
view
confirmation
ticket
close
wyndham
special
stay
hotels
offers
account
join
4th Topic Jobs
Words
jobs
pound
job
saving
details
price
summary
sales
health
business
manager
london
care
application
south
filofax
west
north
director
category
6th Topic Car Rental
Words
car
time
code
Experiments
0.031673
0.024974
0.024365
0.020711
0.019493
0.018275
0.017667
0.017058
0.017058
0.016449
0.015231
0.015231
0.014013
0.014013
0.014013
0.014013
0.012795
0.012795
Probabilities
0.120145
0.049718
0.027451
0.024862
0.020719
0.019683
0.017094
0.015023
0.013469
0.012951
0.012433
0.011916
0.011398
0.01088
0.009844
0.009326
0.008291
0.008291
0.007773
0.007773
Probabilities
0.077764
0.05887
0.053056
travel
document
information
vacation
hotel
save
cheap
map
sign
privacy
destination
cheapfares
hotels
write
tips
specials
print
writeln
5th Topic Automobiles
Words
cars
chevrolet
ford
toyota
car
quantity
vehicle
nissan
honda
cyl
bmw
dodge
hyundai
mercedes
benz
mazda
volkswagen
lexus
gmc
audi
7th Topic Books
Words
title
advanced
price
Semantically Mining Heterogeneous Data Sources Of Deep Web
0.02988
0.028308
0.025949
0.025163
0.022805
0.022805
0.022019
0.021233
0.020447
0.018088
0.018088
0.017302
0.016516
0.014944
0.014158
0.013372
0.013372
0.012586
Probabilities
0.055285
0.028059
0.028059
0.028059
0.028059
0.025584
0.023934
0.021459
0.019809
0.018984
0.015684
0.015684
0.015684
0.014034
0.014034
0.014034
0.013209
0.013209
0.012384
0.011559
Probabilities
0.06196
0.050344
0.031953
46
Chapter 4
Experiments
select
travel
pick
drop
ages
date
address
rental
location
zip
city
airport
company
day
country
age
hotel
0.047969
0.039249
0.037069
0.034889
0.032709
0.032709
0.028348
0.024715
0.023262
0.021808
0.021081
0.018901
0.018175
0.018175
0.016721
0.014541
0.014541
keyword
author
keywords
model
isbn
below
results
abc
category
format
fields
exact
sort
search
artist
publisher
facilities
0.030017
0.029049
0.028081
0.024209
0.024209
0.022273
0.021305
0.018401
0.014529
0.013561
0.013561
0.012593
0.012593
0.011625
0.011625
0.011625
0.010657
Table 4.3 shows the F-measure results for DWSemClust for form pages and for forms
contents higher the F-measure better clustering results DWSemClust give the high
results. Sr.No shows the number of iterations and DWSemClust (formpages) show the
results of F-measure for form and page contents. DWSemClust (form) show the
results of F-measure for form contents. Form and pages contents show the better
result than for only form contents.
Table 4.4:
F-measure results of DWSemClust
Sr.No
DWSemClust (formpages)
DWSemClust (forms)
1
3.429
3.892
2
3.802
3.386
3
4.617
4.323
4
3.724
4.037
5
4.293
3.429
6
3.698
3.656
7
3.810
3.994
8
3.739
3.130
9
3.342
3.396
Semantically Mining Heterogeneous Data Sources Of Deep Web
47
Chapter 4
Experiments
10
3.734
3.512
Figure 4.1 shows the pictorial view of F-measure values. On x-axis show the number
of iteration and y-axis shows the F-measure values and at right side techniques are
F-measure
shown.
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
DWSemClust (formpages)
DWSemClust (forms)
1
2
3
4
5
6
7
8
9
10
No of Iterations
Figure 4.2:
F-measure for DWSemClust with form contents, form and page
contents
Table 4.4 shows the comparison of CAFC_C and DWSemClust with page contents first
column show the Sr.No. Second column show the CAFC_C with formpages and third
column show the DWSemClust with formpages. DWSemClust show the stable behavior
than the CAFC_C.
Table 4.5:
Comparison of the CAFC_C and DWSemClust with form and page
contents based on entropy
Sr.No
CAFC_C(formpages)
DWSemClust (formpages)
1
1.3495
1.235
2
7.0615
1.286
3
2.1336
1.232
4
12.4307
1.249
5
2.4678
6
4.2335
1.275
1.213
Semantically Mining Heterogeneous Data Sources Of Deep Web
48
Chapter 4
Experiments
7
4.565
8
7.781
9
5.618
10
0.4420
1.263
11
9.691
1.260
12
10.578
1.299
13
7.423
1.263
14
0.52538
1.249
15
5.8126
1.249
16
5.6317
1.249
17
2.3613
1.249
18
3.746
1.249
19
4.4169
1.267
20
9.608
21
9.200
22
0.4008
23
1.4245
24
0.0019
25
0.7379
1.265
Total
119.6416
31.418
4.785664
1.25672
1.211
1.221
1.241
1.296
1.265
1.305
1.226
1.301
entropy
Average
entropy
Figure 4.3 show the comparison of DWSemClust and CAFC_C with formpages on the
bases of entropy. On x-axis shows the no of iterations and y-axis shows the entropy value
and at right side techniques are shown. DWSemClust show the stable behavior on the
other hand CAFC_C fluctuate.
Semantically Mining Heterogeneous Data Sources Of Deep Web
49
Chapter 4
Experiments
14
12
10
8
CAFC_C(formpages)
6
DWSemClust (formpages)
4
2
0
1 2 3 4 5 6 7 8 9 10111213141516171819202122232425
Figure 4.3:
Comparison of DWSemClust and CAFC_C with formpages contents
based on entropy
Table 4.5 shows the entropy comparison of CAFC_C and DWSemClust with forms
contents first column show the Sr.No, second column show the CAFC_C with form
contents and third column show the DWSemClust with form contents.
Table 4.6:
Comparison of the CAFC_C and DWSemClust with forms contents
based on entropy
Sr.No
CAFC_C
DWSemClust
(forms)
(forms)
1
6.1536
1.5119
2
7.804
1.51408
3
5.6761
1.51408
4
0.130
1.51408
5
4.006
1.5292
6
1.065
1.5292
7
4.282
1.5035
Semantically Mining Heterogeneous Data Sources Of Deep Web
50
Chapter 4
Experiments
8
15.6036
1.5453
9
4.399
1.5452
10
11.5607
1.5008
11
11.562
1.5017
12
8.7423
1.5090
13
5.5947
1.5120
14
12.0949
1.5246
15
5.650
1.4940
16
3.943
1.5029
17
9.078
1.5137
18
6.312
1.5196
19
5.344
1.489
20
14.787
21
2.5748
22
0.2194
23
10.1402
24
25
5.060
10.991
1.517
1.522
1.516
1.496
1.507
1.5452
Total entropy
172.7733
37.87704
Average entropy
6.910932
1.5150816
Figure 4.4 show the comparison of DWSemClust and CAFC_C with forms contents on
the bases of entropy. X-axis shows the no of iterations and Y-axis show the entropy value
and at right side techniques are shown. Entropy curve prove the efficiency of
DWSemClust over CAFC_C.
Semantically Mining Heterogeneous Data Sources Of Deep Web
51
Chapter 4
Experiments
18
16
14
Entropy
12
10
CAFC(forms)
8
DWSemClust (forms)
6
4
2
0
1
3
5
7
9
11
13
15
17
19
21
23
25
No of iterations
Figure 4.4:
Comparison of DWSemClust and CAFC_C with forms based on
entropy
Table 4.6 shows the comparison of the CAFC_C and DWSemClust with forms contents
and with form and page contents on the bases of entropy. In this table this is clearly
declare that DWSemClust give good results in both scenario. DWSemClust works well
with page contents than with form contents.
Table 4.7:
Sr.No
Comparison of the CAFC_C and DWSemClust with forms contents
and with forms and pages contents based on entropy
CAFC_C
CAFC_C
DWSemClust
DWSemClust
(formpages)
(forms)
(formpages)
(forms)
1
1.3495
6.1536
1.235
1.5119
2
7.0615
7.804
1.286
1.51408
3
2.1336
5.6761
1.232
1.51408
4
12.4307
0.130
1.249
1.51408
5
2.4678
4.006
6
4.2335
1.065
Semantically Mining Heterogeneous Data Sources Of Deep Web
1.275
1.213
1.5292
1.5292
52
Chapter 4
Experiments
7
4.565
4.282
8
7.781
15.6036
9
5.618
4.399
10
0.4420
11.5607
1.263
1.5008
11
9.691
11.562
1.260
1.5017
12
10.578
8.7423
1.299
1.5090
13
7.423
5.5947
1.263
1.5120
14
0.52538
12.0949
1.249
1.5246
15
5.8126
5.650
1.249
1.4940
16
5.6317
3.943
1.249
1.5029
17
2.3613
9.078
1.249
1.5137
18
3.746
6.312
1.249
1.5196
19
4.4169
5.344
1.267
1.489
20
9.608
14.787
1.296
1.517
21
9.200
2.5748
1.265
1.522
22
0.4008
0.2194
1.305
1.516
23
1.4245
10.1402
1.226
1.496
24
0.0019
1.301
1.507
25
0.7379
1.265
1.5452
Total
119.6416
172.7733
31.418
37.87704
4.785664
6.910932
1.25672
1.5150816
5.060
10.991
1.211
1.221
1.241
1.5035
1.5453
1.5452
Entropy
Average
Entropy
Figure 4.5 shows the comparison of the CAFC_C and DWSemClust with forms contents
and with forms and pages contents based on entropy. On x-axis represents no of iterations
y-axis represent entropy value and at right side techniques are shown. Entropy curve for
DWSemClust show the efficiency of our proposed technique.
Semantically Mining Heterogeneous Data Sources Of Deep Web
53
Chapter 4
Experiments
18
16
14
Entropy
12
CAFC_C(formpages)
10
8
CAFC_C(forms)
6
DWSemClust (formpages)
4
DWSemClust (forms)
2
0
1
Figure 4.5:
3
5
7
9
11 13 15 17 19 21 23 25
No of iterations
Comparison of the CAFC_C and DWSemClust with forms contents
and with forms and pages contents based on entropy
Table 4.7 show the F-measure comparison of the CAFC_C and DWSemClust with forms
contents first Column show the sr.no second column show the F-measure values for
CAFC_C (forms) third column show the values of F-measure for DWSemClust.
Table 4.8:
Comparison of the CAFC_C and DWSemClust with forms contents
based on F-measure
Sr.No
CAFC_C
DWSemClust
(forms)
(forms)
1
3.680
3.892
2
3.538
3.386
3
4.059
4.323
4
2.407
4.037
5
2.739
3.429
6
3.441
3.656
7
3.461
3.994
8
3.047
3.130
9
3.309
3.396
Semantically Mining Heterogeneous Data Sources Of Deep Web
54
Chapter 4
Experiments
10
3.901
3.512
Figure 4.6 shows the comparison of CAFC_C and DWSemClust on the bases of Fmeasure. On x-axis number of iteration shown and on y-axis F-measure values are
presented and at right side techniques are shown. As higher the F-measure value is better.
F- measure
DWSemClust give high F-measure than CAFC_C.
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
CAFC_C(forms)
DWSemClust (forms)
1
2
3
4
5
6
7
8
9
10
No of iterations
Figure 4.6:
Comparison of the CAFC_C and DWSemClust with forms contents
based on F-measure
Table 4.8 illustrates the comparison of CAFC_C and DWSemClust with forms and pages
contents. First column show the sr.no second column shows the F-measure values of
CAFC_C with formpages and third column shows the F-measure values of DWSemClust.
Table 4.9:
F-measure comparison of the CAF_C with forms and page contents
and DWSemClust with forms and pages
Sr.No
CAFC_C
DWSemClust
(formpages)
(formpages)
1
2.450
3.429
2
2.367
3.802
3
2.079
4.617
4
2.067
3.724
5
1.584
4.293
Semantically Mining Heterogeneous Data Sources Of Deep Web
55
Chapter 4
Experiments
6
2.301
3.698
7
2.44
3.810
8
2.536
3.739
9
2.448
3.342
10
1.966
3.734
Figure 4.7 shows the comparison of CAFC_C and DWSemClust with forms and page
contents on the bases of F-measure. On x-axis number of iteration shown and on y-axis Fmeasure values are presented and at right side techniques are shown. As higher the Fmeasure value is better. DWSemClust give high F-measure than CAFC_C.
5
4.5
4
F-measure
3.5
3
2.5
CAFC_C(formpages)
2
DWSemClust (formpages)
1.5
1
0.5
0
1
2
3
4
5
6
7
8
9
10
No of iteration
Figure 4.7:
Comparison of the CAFC_C and DWSemClust with forms and pages
contents based on F-measure
Table 4.9 shows the comparison of CAFC_C and DWSemClust with only forms contents
and with forms and pages contents on the bases of F-measure values.
Semantically Mining Heterogeneous Data Sources Of Deep Web
56
Chapter 4
Table 4.10:
Sr.No
Experiments
F-measure comparison of the CAFC_C with forms and page contents
and DWSemClust with forms and pages
CAFC_C
CAFC_C
DWSemClust
DWSemClust
(formpages)
(forms)
(formpages)
(forms)
1
2.450
3.680
3.429
3.892
2
2.367
3.538
3.802
3.386
3
2.079
4.059
4.617
4.323
4
2.067
2.407
3.724
4.037
5
1.584
2.739
4.293
3.429
6
2.301
3.441
3.698
3.656
7
2.44
3.461
3.810
3.994
8
2.536
3.047
3.739
3.130
9
2.448
3.309
3.342
3.396
10
1.966
3.901
3.734
3.512
Total
22.238
33.582
38.108
37.755
2.224
3.358
3.811
3.776
F-measure
Average
F-measure
Figure 4.8 shows the comparison of the CAFC_C and DWSemClust with forms contents
and with forms and pages contents. F-measure values shown at x-axis and number of
iterations are shown at y-axis. At right the techniques are shown. DWSemClust gives
better F-measure values in both situations with form contents and with form and page
contents.
Semantically Mining Heterogeneous Data Sources Of Deep Web
57
Chapter 4
Experiments
5
4.5
4
F-measure
3.5
3
CAFC_C(formpages)
2.5
CAFC_C(forms)
2
DWSemClust (formpages)
1.5
DWSemClust (forms)
1
0.5
0
1
2
3
4
5
6
7
8
9
10
No of iterations
Figure 4.8:
Comparison of the CAFC_C and DWSemClust with forms and pages
and forms contents based on F-measure
Figure 4.9 shows the average entropy of CAFC_C and DWSemClust with form contents
and with form and page contents. Average entropy of DWSemClust with form and pages
Entropy
in both scenarios are minimum than CAFC_C.
8
7
6
5
4
3
2
1
0
Techniques
Figure 4.9:
CAFC_C(forms)
CAFC_C(formpages)
DWSemClust(forms)
DWSemClust(formpages)
Comparison of average entropy
Semantically Mining Heterogeneous Data Sources Of Deep Web
58
Chapter 4
Experiments
Figure 4.10 shows the average F-measure of CAFC_C and DWSemClust with form
contents and with form and page contents. Average F-measure of DWSemClust with
F-measure
form and pages in both scenarios are higher than the CAFC_C.
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
CAFC_C(forms)
CAFC_C(formpages)
DWSemClust(forms)
DWSemClust(formpages)
Techniques
Figure 4.10: Comparison of average F-measure
Table 4.11: Comparison of CAFC_C, CAFC_CH and DWSemClust on bases of
entropy
Performance
CAFC_C
CAFC_CH
DWSemClust
4.321
1.257
6.449
1.515
Measure
Entropy for form 4.786
pages
Entropy for form
6.911
In Table 4.11 comparisons of CAF_C, CAFC_CH and DWSemClust shown. First column
show the performance measure used is entropy. CAFC_C use random selection of
documents and CAFC_CH use the hub induced similarity as a preprocessing step. First of
all hubs are generated and the number of clusters are selected and then run the algorithm
of CAFC_C which cluster the sources but in all techniques DWSemClust perform well.
Semantically Mining Heterogeneous Data Sources Of Deep Web
59
Chapter 4
Experiments
Entropy Comparison of CAFC_C,CAFC_CH
and DWSemClust
8
7
Entropy
6
5
4
CAFC_C
3
CAFC_CH
2
DWSemClust
1
0
CAFC_C
CAFC_CH
DWSemClust
Techniques
Figure 4.11: Entropy comparison of CAFC_C, CAFC_CH and DWSemClust
Figure show the comparison of CAFC_C, CAFC_CH and DWSemClust Entropy value is
on the y-axis and techniques are on the x-axis. At right side technique are shown. Entropy
value of CAFC_C and CAFC_CH is high then the DWSemClust. As Entropy value
increases the performance of cluster decreases. DWSemClust have less entropy value.
Table 4.12: Comparison of CAFC_C, CAFC_CH and DWSemClust on bases of
F-measure
Performance
CAFC_C
CAFC_CH
DWSemClust
2.224
3.215
3.811
3.358
3.402
3.776
Measure
F-Measure for
form pages
F-Measure for
form
In Table 4.12 comparisons of CAF_C, CAFC_CH and DWSemClust shown. First column
show the performance measure used is F-Measure. CAFC_C use random selection of
documents and CAFC_CH use the hub induced similarity as a preprocessing step. First of
all hubs are generated and the number of clusters are selected and then run the algorithm
of CAFC_C which cluster the sources but in all techniques DWSemClust perform well.
Semantically Mining Heterogeneous Data Sources Of Deep Web
60
Chapter 4
Experiments
F-Measure
F-Measure Comparison of CAFC_C,CAFC_CH
and DWSemClust
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
CAFC_C
CAFC_CH
DWSemClust
CAFC_C
CAFC_CH
DWSemClust
Techniques
Figure 4.12: F-measure comparison of CAFC_C, CAFC_CH and DWSemClust
Figure show the comparison of CAFC_C, CAFC_CH and DWSemClust F-Measure value
is on the y-axis and techniques are on the x-axis. At right side technique are shown. Fmeasure value of CAFC_C and CAFC_CH is less than the DWSemClust. As F-measure
value increases the performance of cluster also increases. DWSemClust have more Fmeasure value.
4.5.
Summary
In above section we discussed the performance measure, dataset in brief from where we
gathered and which attributes are used. Then experimental setup and results are described
in detail. From above discussion we conclude that proposed technique perform well with
high F-measure and low entropy value.
Semantically Mining Heterogeneous Data Sources Of Deep Web
61
Chapter 5
Conclusion
Chapter 5
5. Conclusions and Future Work
In this thesis we cluster the heterogeneous sources of deep web, which is a critical task
towards the integration of the sources. For satisfying the need of user and timely decision,
correct and accurate clustering is desirable as result shows that our proposed technique
DWSemClust is more efficient than existing techniques. It takes less time to give the
clustering results. As results show the existing techniques fluctuate but proposed
technique is more stable. One of positive point of our research is deal with semantics
because LDA is used and it has semantic layer which specially deal with semantics. As
LDA produce soft clusters. It gives the probability of each document for each cluster.
Hence, DWSemClust is suitable for the scenario where the sources are sparsely
distributed over the web. In a survey [18] estimate the parameters for clustering the deep
web sources, according to that survey our approach work on all major parameters.
5.1.
Concluded points
Following key points are concluded.
5.1.1. Stability
DwSemClust is more stable then the CAFC_C as in experimental section clearly show
the entropy values fluctuate in ever iteration and DWSemClust shows the stable behavior.
5.1.2. Soft Clustering
DWSemClust use latent dirichlet allocation a topic model which give soft clusters. Soft
clustering gives the probability of each document for each cluster. CAFC_C give hard
cluster which allocate a document to a cluster.
5.1.3. Semantics
DWSemClust use latent dirichlet allocation which works on semantics. As CAFC_C does
not deal with semantics therefore the results show the difference of use of semantics and
non-semantics techniques.
Chapter 4
Experiments
5.1.4. Running time
Running time of DWSemClust is low as compare to CAFC_C.
CAFC_C take
approximately an hour to execute for one time where the DWSemClust take 5 minutes for
execution.
5.1.5. Parameter estimation
The parameters that are define in [18]. According to that parameters are structure deep
web sources, un-structure sources, simple query interfaces, advance query interfaces,
clustering, classification, query probing, visible form feature, macro, micro and use of
ontology. In our results we use structure deep web sources, simple query interfaces,
advance query interfaces, clustering, visible form feature and macro parameters are used.
5.2.
Future Work
5.2.1. Integrated schema
We will work on the integrated schema on the bases of our clustering results that will use
to satisfy the customer query. For example if a customer want to buy a book then he will
query the database the integrated schema will give the results of all possible stores from
where he can buy and on which prices.
5.2.2. Check dataset on structured techniques
We will check the dataset on structured techniques as proposed technique use topic
modeling LDA which is unstructured in nature. If that technique gives good results then
we will check on combination of our technique and other technique.
Semantically Mining Heterogeneous Data Sources Of Deep Web
63
References
References
[1]Wikipedia: http://en.wikipedia.org/wiki/Deep_Web
[2]Deep web search directory service:http://www.completeplanet.com.
[3] K. C.-C. Chang, B. He, C. Li, M. Patel, and Z. Zhang, “Structured databases on the
web: Observations and implications”, SIGMOD Record, vol. 33(3): pp. 61–70, Sept.
2004.
[4] S. Raghavan, H. Garcia-Molina, “Crawling the Hidden Web”, In VLDB, pp. 129–138,
2001.
[5] A. Hess and N. Kushmerick. “Automatically attaching semantic metadata to web
services”. In Proceedings of IIWeb, pp. 111–116, 2003.
[6] B. D., C. J., and L. Y. Coolcat, “An entropy-based algorithm for categorical
clustering”. In 11th International Conference on Information and Knowledge
Management, pp. 582–589, 2002.
[7] B. He, T. Tao, and K. C.-C. Chang. “Organizing structured web sources by query
schemas: a clustering approach”, In CIKM, pp. 22–31, 2004.
[8] L.Barbosa, L., Freire, J., Silva, A, “Organizing hidden-web databases by clustering
visible web documents”, In ICDE, pp. 326–335, 2007.
[9] J. P. Callan, M. Connell, A. Du. “Automatic discovery of language models for text
databases”. In SIGMOD, pp. 479–490, 1999.
[10] H. Li et al. “Clustering Deep Web Databases Semantically”, In AIRS, pp. 365–376,
2008.
[11] W. Zhang, K. Chen, F. Zhang. “Mining Data Records based on Ontology Evolution
for Deep Web”, In 2nd International Conference on Computer Engineering and
Technology. 2010.
[12] H.Xiang Xu,Xiu-Lan Hao,Shu-Yun Wang, Yun-Fa Hu, “A Method of Deep Web
Classification” . CMLC, pp. 4009 - 4014, 2007
References
[13] L.Peiguang, Yibing Du;
Xiaohua Tan;
Chao Lv;.”Research on Automatic
Classification for Deep Web Query Interfaces”, ISIP, pp. 313 - 317, 2008
[14] P Zhao, L Huang, W Fang, “Organizing Structured Deep Web by Clustering Query
Interfaces Link Graph Advanced Data Mining” ADMA, pp.683-690, 2008
[15] H.Q Le, “Classifying Structured Web Sources Using Aggressive Feature Selection”,
5th International Conference on Web, WEBIST, pp. 618-625, 2009
[16] X .Xian, P Zhao, W Fang, J Xin “Automatic classification of deep web databases
with simple query interface” ICIMA, pp.85-88, 2009
[17] Y.Q Dong, QZ Li, YH. Ding et al. “A query interface matching approach based on
extended evidence theory for Deep Web”, Journal of computer science and
technology vol. 25(3): May 2010
[18] U.Noor, Z.Rashid, A.Rauf. “A Survey of Automatic Deep Web Classification
Techniques”, International journal of computer application, 2011
[19] G. Salton, A. Wong, and C. S. Yang. “A vector space model for automatic indexing”
CACM, vol.18(11): pp. 613–620, 1975.
[20] R. A. Baeza-Yates and B. A. Ribeiro-Neto. “Modern Information Retrieval”, ACM
Press/Addison-Wesley, 1999.
[21] M. Steinbach, G. Karypis, and V. Kumar. “A comparison of document clustering
techniques”, In KDD Workshop on Text Mining, 2000.
[22] D.M.Blei, Ng, A.Y., Jordan, M.I.: “Latent Dirichlet Allocation”. Journal of Machine
Learning Research vol. 3, pp. 993–1022, 2003
[23] The UIUC Web integration repository http://metaquerier.cs.uiuc.edu/repository.
[24] B. Larsen and C. Aone. “Fast and effective text mining using linear-time document
clustering”. In KDD, pp. 16–22, 1999.
[25] K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. “The
connectivity server: Fast access to linkage information on the Web”, Computer
Networks, vol. 30(1-7): pp. 469–477, 1998.
[26] A. Daud, J. Li, L. Zhou, and F. Muhammad. “Knowledge Discovery through
Directed Probabilistic Topic Model- a Survey”. Journal of Frontiers of Computer
Science in China (FCS), vol. 4(2), pp. 280-301, June, 2010.
[27] T.L. Griffiths, M. Steyvers, “Finding scientific topics, in: Proceedings of the
National Academy of Sciences (NAS)”, USA,pp. 5228–5235, 2004
[28] Hofmann, T.: Probabilistic Latent Semantic Analysis. In: Proc. of the 15th Annual
Conference on Uncertainty in Artificial Intelligence (UAI), Stockholm, Sweden,1999
Semantically Mining Heterogeneous Data Sources Of Deep Web
65