Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Research in Management, Science & Technology (E-ISSN: 2321-3264) Vol. 2, No. 3, December 2014 Available at www.ijrmst.org Deep Web Search in Data Mining for Data Extraction Nripendra Narayan Das#1, Dimpy Gambhir#2 1,2 1,2 Department of Computer Science, Rawal Institute of Engineering and Technology , Faridabad, India 1 [email protected] [email protected] 2 A. Finding relevant information Most people use some search service when they want to find specific information on the Web. A user usually inputs a simple keyword query and a result is a list of ranked pages. This ranking is based on their similarity to the query. Today's search tools have some problems: Low precision and low recall, mainly because of wrong or incomplete keyword query. This leads to irrelevance of many search results. B. Creating new knowledge This problem is data-triggered process that presumes that we have a collection of Web data and we want to extract potentially useful knowledge from these data. C. Personalisation of information People differ in the contents and presentations they prefer while interacting with the Web. D. Learning about consumers or individual users This is a group of sub-problems such as mass customizing information to intended consumers, problems related to effective Web site design and management, problems related to marketing and others. Latest search engine like Google, Yahoo Search are providing only the link of submitted queries. Abstract:-WWW stands for World Wide Web, and it is an advanced information retrieval system. As years passed World Wide Web became weighed down with information and it became hard to retrieve data according to the need. The Web mining extracts useful information from the web pages. Web mining techniques seek to extract knowledge from Web data, including web documents, hyperlinks between documents, and usage logs of web sites. Web usage mining mines knowledge from diverse websites. Extracting appropriate data from deep web pages is an exigent dilemma due to the overflow of data in to the web. Web servers generates a huge amount of information on web users browsing activities. These are called click stream or web access log data. The click stream data can be enriched with information about the content of visited pages. The aim of this paper is to obtain all the data behind a form by multiple submissions of the form filled out in all possible ways by using agent, but efficiency concerns lead us to consider alternatives. We can estimate the amount of remaining data after a small number of submissions maximize the coverage of the data. I. INTRODUCTION Mining information from the Web and identifying relevant resources which match a query is studied in the field of Information Retrieval (IR). Today, it is impossible to imagine the Web without search engines. However, traditional search engines too have some limitations like they only returns the result pages that are already gathered and pre-processed by crawlers. This technique is efficient for the static web pages, which remain same for longer periods of time. The mining data varies from structured to unstructured. Data mining mainly deals with structured data organized in a database while text mining mainly handles unstructured data. Web mining lies in between and copes with semi structured data and/or unstructured data. Web mining calls for creative use of data mining and/or text mining techniques and its distinctive approaches. Mining the web data is one of the most challenging tasks for the data mining and data management scholars because there are huge heterogeneous, less structured data available on the web and we can easily get overwhelmed with data. As the Web reaches its full potential, however, we must improve its services, make it more comprehensible, and increase its usability. As researchers continue to develop data mining techniques, we believe this technology will play an increasingly important role in meeting the challenges of developing the intelligent Web. Users could encounter following problems when interacting with the Web: II. RELATED WORK Data mining can be viewed as a result of the natural evolution of information technology. The database system industry has witnessed an evolutionary path in the development of data collection, creation, data management (including data storage and retrieval, and database transaction processing), and advanced data analysis (involving data warehousing and data mining) [8]. The research and development in database systems since the 1970s has progressed from early hierarchical and network database systems to the development of relational database systems (where data are stored in relational table structures), data modelling tools, and indexing and accessing methods. In addition, users gained convenient and flexible data access through query languages, user interfaces, optimized query processing, and transaction management. Efficient methods for on-line transaction processing (OLTP)[3], where a query is viewed as a read-only transaction, have contributed substantially to the evolution and wide acceptance of relational technology as a major tool for efficient storage, retrieval, and management of large amounts of data. Application-oriented database systems, including spatial, temporal, multimedia, active, stream, and sensor, and 2321-3264/Copyright©2014, IJRMST, December 2014 35 International Journal of Research in Management, Science & Technology (E-ISSN: 2321-3264) Vol. 2, No. 3, December 2014 Available at www.ijrmst.org scientific and engineering databases, knowledge bases, and office information bases, have flourished. Issues related to the distribution, diversification, and sharing of data have been studied extensively. Heterogeneous database systems and Internet-based global information systems such as the World Wide Web (WWW) have also emerged and play a vital role in the information industry. 1. Go through available online admission forms for following purposes a. Identify student details that are required to fill an online admission forms i.e. personal, contact, academic details, etc. by crawling through available online admission forms. b. Identify sub category(sub field) of each detail according the online admission forms’ fields (Text Fields), for e.g. i. Student Name can be sub divided into Prefix, First Name, Middle Name, Last Name, Suffix, Family Name, etc. ii. Address cab be sub divided into House/Flat No, Building/Block, Street, City, PIN, etc. 2. Crawl through HTML forms and collect value of HTML attribute “name” of HTML Text Fields and HTML label for the HTML Text Field. Create key(HTML label)-value(value of attribute name) pairs, for e.g. a. A text field labeled with “First Name” may have name attribute value first-name, fname, fn, first_name, etc. Store key-value pair (like FirstName[Pair] -> fn, firstname, fname [Value]) into database after removing all non-alphanumeric character from key and value for future reference. TABLE-I Key value specification # 1 2 3 4 4 Fig -1 A Framework of Data Mining Process To accessing information from web currently users choose various approaches. Most of the approaches are based on the following: A. Content or Keyword based Most of the search engine perform information search based on the keyword or content-directory browsing such as MSN, Google or Yahoo, which use keyword indices or manually built directories to find documents with specified keywords or contents. B. Multilevel Deep Web Querying Information cannot be accessed through static URL links, as most of the information hides behind searchable database query forms that unlike the surface. For example if a user searching for a movie, book or song, which information not remain on the index pages it need to go for multilevel web search to find the relevant information. C. Dynamic Web Link Clicking Dynamically surfing the Web linkage links to a web resource presented by search engines. The success of these approaches and techniques, especially with the more recent page ranking by search engines highly depends on the efficient web data mining which shows the great promise to become the ultimate information systems. 3. 4. III. PROPOSED WORK A.ALGORITHM 2321-3264/Copyright©2014, IJRMST, December 2014 36 Key FirstName FirstName FirstName StudentName StudentName Value Fn Firstname Fname Name Studentname Key also refers to sub category identified in STEP 1. Key-value will help in binding/auto-filling student details in the form and submitting it online. To fill online form automatically, follow given steps a. Visit admission form link b. Identify HTML admission form in it. c. Store value of “action” and “method” attribute of the HTML form in “SUBMIT_URL”and “SUBMIT_TYPE” to submit the form. d. Parse the HTML form and collect value of “name” attribute of all inputs (Text Fields of HTML form) and store them in an array “INPUTS”. e. Travrse array “INPUT”, find Keys from keyvalue table by matching value column with “INPUT”elements and store Keys in array “REQUIRED_DETAIL”. f. Create a new Key-Value pair “FORM_PARAM”. g. h. International Journal of Research in Management, Science & Technology (E-ISSN: 2321-3264) Vol. 2, No. 3, December 2014 Available at www.ijrmst.org Loop through “REQUIRED_DETAIL” array, [16]. Fan, W., Wallace, L., Rich, S. and Zhang, Z. 2005. Tapping into the Power of Text Mining. Communications of the ACM - Privacy and store array index in “INDEX” and follow steps Security in highly dynamic systems. Vol. 49, Issue-9. for each iteration [17]. Bharanipriya, V. and Prasad, K. 2011. Web content Mining Tools: A i. Store value of INPUTS[INDEX] in Comparative study. International Journal of Information Technology and Knowledge Management. Vol. 4. No 1,211- 215. FORM_PARAM as a key. ii. Store actual detail of student matching [18]. R. Chau, C. Yeh and K. Smith, Personalized multilingual web content mining, KES (2004), pp. 155–163 REQUIRED_DETAIL[INDEX] in [19]. P. Kolari and A. Joshi, Web mining: Research and practice, Comput. FORM_PARAM as a value. Sci. Eng.July/August (2004) 42 –53 STEP g creates a list of key-value pair named [20]. B. Liu and K. Chang, Editorial: Special issue on web content mining, SIGKDD Explorations 6(2) (2004) 1–4 FORM_PARAM (form data) required to submit through the form. Submit FORM_PARAM(Form Data) at SUBMIT_URL using SUBMIT_TYPE. j. Recored the response of submission in the database for future reference. Repeat STEP 4 to submit student details in multiple online admission forms. i. 5. IV. CONCLUSION In this paper, an automatic prototype is being proposed. The first part of this prototype will separate deep web pages from surface web pages and discards the later ones. After collecting the deep web pages, system will extract all the search query interfaces and then choose the interface with maximum number of fields in second part. Third part of the system provides only one interface that will be filled by the user and at last, query will be formed to fetch the relevant result pages. [1]. [2]. [3]. [4]. [5]. [6]. [7]. [8]. [9]. [10]. [11]. [12]. [13]. [14]. [15]. REFERENCES Kosla, R. and Blockeel, H. 2000. Web Mining Research: A Survey. SIG KDD Explorations. Vol. 2, 1-15. Bergman, M.K. (2001). “The Deep Web: Surfacing Hidden Value”. In The Journal of Electronic Publishing, Vol. 7, No. 1. Han, J., Kamber, M.: Data Mining: Concepts and Techniques, Second edition, p. 628-648. Morgan Kaufmann Publishers, 2006. Ramakrishna, Gowdar et al ‖Web Mining: Key Accomplishments, Applications and Future Directions‖, in the International Conference on Data Storage and Data Engineering 2010. Jiawei Han,Kevin,Chen-Chuan Chang "Data Mining for Web Intelligence" IEEE International Conference on Data Mining, 2002. S. Chaudhuri and U. Dayal, ―An Overview of Data Warehousing and OLAP Technology,‖ SIGMOD Record, vol. 26, no. 1, 1997, pp. 65-74. S. Brin and L. Page, ―The Anatomy of a Large-Scale Hypertextual Web Search Engine,‖ Proc. 7th International World Wide Web Conf. (WWW98), ACM Press, New York, 1998, pp. 107-117. J. Srivastava et al., ―Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data,‖ SIGKDD Explorations, vol. 1, no. 2, 2000, pp. 12- 23. Kosala and Blockeel, ―Web mining research:Asurvey,‖SIGKDD:SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery and Data Mining, ACM, Vol. 2, 2000 https://sites.google.com/site/ontowebmining/ http://www.ScholarSearchEngines.com/ http://deep-web.org/how-to-research/deep-web-search-engines/ http://en.wikipedia.org/wiki/Deep_Web Sriram Raghavan and Hector Garcia- Molina.“Crawling the hidden web”. In Proceedings of the International Conference on Very Large Data Bases, pages 129{138, San Francisco, CA, USA, 2001. Kevin Chen-Chuan Chang, Bin He, Chengkai Li, Mitesh Patel, and Zhen Zhang.“Structured databases on the web: observations and implications”. SIGMOD Record, 33(3):61{70, 2004. 2321-3264/Copyright©2014, IJRMST, December 2014 37