Download Deep Web Search in Data Mining for Data Extraction Nripendra

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
International Journal of Research in Management, Science & Technology (E-ISSN: 2321-3264)
Vol. 2, No. 3, December 2014
Available at www.ijrmst.org
Deep Web Search in Data Mining for Data
Extraction
Nripendra Narayan Das#1, Dimpy Gambhir#2
1,2
1,2
Department of Computer Science,
Rawal Institute of Engineering and Technology , Faridabad, India
1
[email protected]
[email protected]
2
A. Finding relevant information
Most people use some search service when they want to find
specific information on the Web. A user usually inputs a
simple keyword query and a result is a list of ranked pages.
This ranking is based on their similarity to the query. Today's
search tools have some problems: Low precision and low
recall, mainly because of wrong or incomplete keyword
query. This leads to irrelevance of many search results.
B. Creating new knowledge
This problem is data-triggered process that presumes that we
have a collection of Web data and we want to extract
potentially useful knowledge from these data.
C. Personalisation of information
People differ in the contents and presentations they prefer
while interacting with the Web.
D. Learning about consumers or individual users
This is a group of sub-problems such as mass customizing
information to intended consumers, problems related to
effective Web site design and management, problems related
to marketing and others.
Latest search engine like Google, Yahoo Search are providing
only the link of submitted queries.
Abstract:-WWW stands for World Wide Web, and it is an
advanced information retrieval system. As years passed World
Wide Web became weighed down with information and it
became hard to retrieve data according to the need. The Web
mining extracts useful information from the web pages. Web
mining techniques seek to extract knowledge from Web data,
including web documents, hyperlinks between documents, and
usage logs of web sites. Web usage mining mines knowledge from
diverse websites. Extracting appropriate data from deep web
pages is an exigent dilemma due to the overflow of data in to the
web. Web servers generates a huge amount of information on
web users browsing activities. These are called click stream or
web access log data. The click stream data can be enriched with
information about the content of visited pages.
The aim of this paper is to obtain all the data behind a form by
multiple submissions of the form filled out in all possible ways by
using agent, but efficiency concerns lead us to consider
alternatives. We can estimate the amount of remaining data after
a small number of submissions maximize the coverage of the
data.
I. INTRODUCTION
Mining information from the Web and identifying relevant
resources which match a query is studied in the field of
Information Retrieval (IR). Today, it is impossible to imagine
the Web without search engines. However, traditional search
engines too have some limitations like they only returns the
result pages that are already gathered and pre-processed by
crawlers. This technique is efficient for the static web pages,
which remain same for longer periods of time. The mining
data varies from structured to unstructured. Data mining
mainly deals with structured data organized in a database
while text mining mainly handles unstructured data. Web
mining lies in between and copes with semi structured data
and/or unstructured data. Web mining calls for creative use of
data mining and/or text mining techniques and its distinctive
approaches. Mining the web data is one of the most
challenging tasks for the data mining and data management
scholars because there are huge heterogeneous, less structured
data available on the web and we can easily get overwhelmed
with data. As the Web reaches its full potential, however, we
must improve its services, make it more comprehensible, and
increase its usability. As researchers continue to develop data
mining techniques, we believe this technology will play an
increasingly important role in meeting the challenges of
developing the intelligent Web.
Users could encounter following problems when interacting
with the Web:
II. RELATED WORK
Data mining can be viewed as a result of the natural evolution
of information technology. The database system industry has
witnessed an evolutionary path in the development of data
collection, creation, data management (including data storage
and retrieval, and database transaction processing), and
advanced data analysis (involving data warehousing and data
mining) [8]. The research and development in database
systems since the 1970s has progressed from early
hierarchical and network database systems to the development
of relational database systems (where data are stored in
relational table structures), data modelling tools, and indexing
and accessing methods. In addition, users gained convenient
and flexible data access through query languages, user
interfaces, optimized query processing, and transaction
management. Efficient methods for on-line transaction
processing (OLTP)[3], where a query is viewed as a read-only
transaction, have contributed substantially to the evolution
and wide acceptance of relational technology as a major tool
for efficient storage, retrieval, and management of large
amounts of data.
Application-oriented database systems, including spatial,
temporal, multimedia, active, stream, and sensor, and
2321-3264/Copyright©2014, IJRMST, December 2014
35
International Journal of Research in Management, Science & Technology (E-ISSN: 2321-3264)
Vol. 2, No. 3, December 2014
Available at www.ijrmst.org
scientific and engineering databases, knowledge bases, and
office information bases, have flourished. Issues related to the
distribution, diversification, and sharing of data have been
studied extensively. Heterogeneous database systems and
Internet-based global information systems such as the World
Wide Web (WWW) have also emerged and play a vital role in
the information industry.
1.
Go through available online admission forms for
following purposes
a. Identify student details that are required to fill an
online admission forms i.e. personal, contact,
academic details, etc. by crawling through
available online admission forms.
b. Identify sub category(sub field) of each detail
according the online admission forms’ fields
(Text Fields), for e.g.
i. Student Name can be sub divided into Prefix,
First Name, Middle Name, Last Name,
Suffix, Family Name, etc.
ii. Address cab be sub divided into House/Flat
No, Building/Block, Street, City, PIN, etc.
2. Crawl through HTML forms and collect value of
HTML attribute “name” of HTML Text Fields and
HTML label for the HTML Text Field. Create
key(HTML label)-value(value of attribute name)
pairs, for e.g.
a. A text field labeled with “First Name” may have
name attribute value first-name, fname, fn,
first_name, etc.
Store key-value pair (like FirstName[Pair] -> fn,
firstname, fname [Value]) into database after
removing all non-alphanumeric character from key
and value for future reference.
TABLE-I
Key value specification
#
1
2
3
4
4
Fig -1 A Framework of Data Mining Process
To accessing information from web currently users choose
various approaches. Most of the approaches are based on the
following:
A. Content or Keyword based
Most of the search engine perform information search based
on the keyword or content-directory browsing such as MSN,
Google or Yahoo, which use keyword indices or manually
built directories to find documents with specified keywords or
contents.
B. Multilevel Deep Web Querying
Information cannot be accessed through static URL links, as
most of the information hides behind searchable database
query forms that unlike the surface. For example if a user
searching for a movie, book or song, which information not
remain on the index pages it need to go for multilevel web
search to find the relevant information.
C. Dynamic Web Link Clicking
Dynamically surfing the Web linkage links to a web resource
presented by search engines.
The success of these approaches and techniques, especially
with the more recent page ranking by search engines highly
depends on the efficient web data mining which shows the
great promise to become the ultimate information systems.
3.
4.
III. PROPOSED WORK
A.ALGORITHM
2321-3264/Copyright©2014, IJRMST, December 2014
36
Key
FirstName
FirstName
FirstName
StudentName
StudentName
Value
Fn
Firstname
Fname
Name
Studentname
Key also refers to sub category identified in STEP 1.
Key-value will help in binding/auto-filling student
details in the form and submitting it online.
To fill online form automatically, follow given steps
a. Visit admission form link
b. Identify HTML admission form in it.
c. Store value of “action” and “method” attribute
of the HTML form in “SUBMIT_URL”and
“SUBMIT_TYPE” to submit the form.
d. Parse the HTML form and collect value of
“name” attribute of all inputs (Text Fields of
HTML form) and store them in an array
“INPUTS”.
e. Travrse array “INPUT”, find Keys from keyvalue table by matching value column with
“INPUT”elements and store Keys in array
“REQUIRED_DETAIL”.
f. Create
a
new
Key-Value
pair
“FORM_PARAM”.
g.
h.
International Journal of Research in Management, Science & Technology (E-ISSN: 2321-3264)
Vol. 2, No. 3, December 2014
Available at www.ijrmst.org
Loop through “REQUIRED_DETAIL” array, [16]. Fan, W., Wallace, L., Rich, S. and Zhang, Z. 2005. Tapping into the
Power of Text Mining. Communications of the ACM - Privacy and
store array index in “INDEX” and follow steps
Security in highly dynamic systems. Vol. 49, Issue-9.
for each iteration
[17]. Bharanipriya, V. and Prasad, K. 2011. Web content Mining Tools: A
i. Store value of INPUTS[INDEX] in
Comparative study. International Journal of Information Technology
and Knowledge Management. Vol. 4. No 1,211- 215.
FORM_PARAM as a key.
ii. Store actual detail of student matching [18]. R. Chau, C. Yeh and K. Smith, Personalized multilingual web content
mining, KES (2004), pp. 155–163
REQUIRED_DETAIL[INDEX]
in [19]. P. Kolari and A. Joshi, Web mining: Research and practice, Comput.
FORM_PARAM as a value.
Sci. Eng.July/August (2004) 42 –53
STEP g creates a list of key-value pair named [20]. B. Liu and K. Chang, Editorial: Special issue on web content mining,
SIGKDD Explorations 6(2) (2004) 1–4
FORM_PARAM (form data) required to
submit through the form.
Submit FORM_PARAM(Form Data) at
SUBMIT_URL using SUBMIT_TYPE.
j. Recored the response of submission in the
database for future reference.
Repeat STEP 4 to submit student details in multiple
online admission forms.
i.
5.
IV. CONCLUSION
In this paper, an automatic prototype is being proposed. The
first part of this prototype will separate deep web pages from
surface web pages and discards the later ones. After collecting
the deep web pages, system will extract all the search query
interfaces and then choose the interface with maximum
number of fields in second part. Third part of the system
provides only one interface that will be filled by the user and
at last, query will be formed to fetch the relevant result pages.
[1].
[2].
[3].
[4].
[5].
[6].
[7].
[8].
[9].
[10].
[11].
[12].
[13].
[14].
[15].
REFERENCES
Kosla, R. and Blockeel, H. 2000. Web Mining Research: A Survey.
SIG KDD Explorations. Vol. 2, 1-15.
Bergman, M.K. (2001). “The Deep Web: Surfacing Hidden Value”.
In The Journal of Electronic Publishing, Vol. 7, No. 1.
Han, J., Kamber, M.: Data Mining: Concepts and Techniques,
Second edition, p. 628-648. Morgan Kaufmann Publishers, 2006.
Ramakrishna, Gowdar et al ‖Web Mining: Key Accomplishments,
Applications and Future Directions‖,
in the International
Conference on Data Storage and Data Engineering 2010.
Jiawei Han,Kevin,Chen-Chuan Chang "Data Mining for Web
Intelligence" IEEE International Conference on Data Mining, 2002.
S. Chaudhuri and U. Dayal, ―An Overview of Data Warehousing
and OLAP Technology,‖ SIGMOD Record, vol. 26, no. 1, 1997, pp.
65-74.
S. Brin and L. Page, ―The Anatomy of a Large-Scale Hypertextual
Web Search Engine,‖ Proc. 7th International World Wide Web Conf.
(WWW98), ACM Press, New York, 1998, pp. 107-117.
J. Srivastava et al., ―Web Usage Mining: Discovery and
Applications of Usage Patterns from Web Data,‖ SIGKDD
Explorations, vol. 1, no. 2, 2000, pp. 12- 23.
Kosala
and
Blockeel,
―Web
mining
research:Asurvey,‖SIGKDD:SIGKDD Explorations: Newsletter of
the Special Interest Group (SIG) on Knowledge Discovery and Data
Mining, ACM, Vol. 2, 2000
https://sites.google.com/site/ontowebmining/
http://www.ScholarSearchEngines.com/
http://deep-web.org/how-to-research/deep-web-search-engines/
http://en.wikipedia.org/wiki/Deep_Web
Sriram Raghavan and Hector Garcia- Molina.“Crawling the
hidden web”. In Proceedings of the International Conference on
Very Large Data Bases, pages 129{138, San Francisco, CA, USA,
2001.
Kevin Chen-Chuan Chang, Bin He, Chengkai Li, Mitesh Patel, and
Zhen Zhang.“Structured databases on the web: observations and
implications”. SIGMOD Record, 33(3):61{70, 2004.
2321-3264/Copyright©2014, IJRMST, December 2014
37