Download The Research of a Spider Based on Crawling Algorithm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
2016 Joint International Conference on Artificial Intelligence and Computer Engineering (AICE 2016) and
International Conference on Network and Communication Security (NCS 2016)
ISBN: 978-1-60595-362-5
The Research of a Spider Based on Crawling Algorithm
Xin-Yang WANG1,a, Jian ZHANG2,b
1
Guangdong Vocational Technology Institute, China
2
Foshan Southern Insitute of Data Science and Technology, China
a
[email protected], [email protected]
Keywords: Spider, URL Seed, Scope First, Document Correlativity, Threshold.
Abstracts. This paper conducts a deep research on data mining in three areas including work flow,
key technologies and software algorithm of the spider. The paper analyzes the work flow and key
technologies of the spider facing URL in details. It also brings forward the mind that adopting
several queues to manage the URL list, in order to download HTML, files in high speed we sort the
URLs by document correlativity. The aim of this paper is to design a well-adjusted and perfectly
functional software model of the spider. Sun JDK+Borland Jbuilder+SQL Server+IIS+Bot package
is used as the software development environment support.
Introduction
As WEB sites involve a large amount of complex information, how to mine valuable data on
them has become one of the research focuses in data mining at present. The key issue of mining
data on WEB is how to design a spider, so this paper conducts a deep research on data mining in
three areas including work flow, key technologies and software algorithm of the spider, which aims
to design a well-adjusted and perfectly functional software model of the spider. Sun JDK+Borland
Jbuilder+SQL Server+IIS+Bot package is used as the software development environment support.
Work Flow of the Spider
Figure 1. Work Flow of the Spider.
After a theme or URL is provided, the spider starts crawling Web pages. For example, the
“theme-related spider”, developed on the basis of a fully integrated index database, crawls Web
pages according to the theme with high “recall rate” and “accuracy rate”, however, due to great
development difficulty and high cost, it applies the high-end search engine system. As for the
“spider facing URL”, it starts crawling with provision of a specific URL, also known as the seed.
This spider is good for information extraction on an overall site, but it tends to return a large
number of useless pages due to its failure to value the URL. Thus, in order to improve the “recall
rate” and lower the cost, I propose the idea of using “URL seed based spider” model, which crawls
pages starting from the URL seed and values the pages by computing document correlativity with
consideration to both URL seed and theme.
If all the pages that it has crawled are downloaded, it will provide a wealth of information for
further data mining, but the search “accuracy” may be impaired, so there is need to analyze the
importance of pages. According to analysis of the page value based on document correlativity, no
download is conducted to the pages whose calculated value of correlativity is lower than a certain
threshold.
WEB is a “map” theoretically assigning a URL seed, and the spider can crawl all the nodes of the
“map”. However, crawling pages is not to allow the spider to move back and forth between WEB
pages on and on, but to stop it when the crawling information is collected fully. I choose to control
crawling of the spider by setting blackout time and defining specific crawling scope. For example,
the spider is set to complete crawling work within 1 day, or set to crawl within the URL scope and a
certain depth. “Downloading pages” is to save the pages on the hard disk in the form of HTML,
which is a tag language. The spider should be able to “recognize” these tags and can mine important
data from them, so as to do “page parsing” work.
With information parsed, data application is expanded by three modes, that is, database, HTML
and XML. By database mode, data can be saved safely; by HTML mode, the parsed information is
formatted and written into HTML files after aggregation, and then released by the WEB server; by
XML mode, the parsed information is written into XM files and is explained by the client, which
reduces the WEB server load and the unifies data expression. If DOM programming is conducted
for XML based on XSLT, various advanced “data inquiry” functions can be available.
Design of the Spider
Application of the Crawling Algorithms
The crawling algorithms of the spider include those of depth-first, breadth-first and Fish-Search,
etc. Depth-first (breadth-first) algorithm makes depth (breadth) breadth traversal to the targeted site
to collect pages. Fish-Search algorithm simulates foraging of sea fish that fish will continue to
reproduce if they find food, otherwise they will die. Its weakness is that the document correlativity
computation is simple. Shark-Search algorithm is an improvement to Fish-Search algorithm by
expanding the correlativity measurement methods. Because the important data so the site is
generally located at the high level of the “tree”, the breadth-first algorithm is a better alternative to
get important data in a short time, and the depth-first algorithm is very likely to be in a situation that
it cannot get out once in. For that reason, the “URL seed based spider” model choose to use the
breadth-first algorithm. Application of the crawling algorithm is as shown in Figure 2.
Figure 2. Application of the Crawling Algorithms.
The breadth first can be divided into recursive implementation and non-recursive implementation.
It can get dozens or hundreds of URLs in crawling pages each times, and the number of URLs may
jump fiercely as the crawling depth drops. If the recursive algorithm is employed, frequent pushing
and pulling operations will greatly impair the system performance. Therefore, non-recursive
implementation is used and queues are adopted to manage the URL list. Because document
correlativity decides the value of URLs, I consider sorting the queue element by correlativity.
Literature 1 uses a single queue to manage the URL list, which I think is limited by the queue length
and may affect response to important URLs in the queue, so I adopt many queues to manage the
URL list instead. Respective responses are made to each head-of-queue by multi-threading, to finish
page crawling.
Computation of Document Correlativity
To ensure the spider remains close to the theme when crawling pages, document correlativity of
Web pages must be measured quantitatively. I use the vector space modal (VSM) to compute
document correlativity. Let there are n key words, and the weight of the key words be i , the
theme vector can be expressed as:
  ( 1 ,  2 ,,  i , n )
i  1,2,3 n
 i  i
Count occurrence frequency of n key words and describe with xi expression, then the theme
vector of pages is expressed as:
  ( x11 , x2 2 ,, xii , xn n )
i  1,2,3 n
And document correlativity of pages is expressed as:
Cos   ,   ( ,  ) /  
For a given threshold value t, if t  Cos   ,   , the current page is considered to have
correlation with and the theme, and then its URL is added to the download queue. Setting value t
needs to have a correct estimate of document correlativity of unknown pages. According to
Literature 3, value t can be set smaller if more pages are needed, otherwise it can be set larger. This
method fixes the document correlativity threshold, which can lead to two extremes of either too
many useless pages or too few useful pages once the real threshold is far away from this value.
Therefore, I have borrowed the idea of segmenting gray level images by the iterative approach in
digital image processing and designed the following algorithm subject to dynamical update of the
threshold t.
Step l: set the initial threshold ti  t 0 , which is a small value;
Step 2: compute the current document correlativity t '  Cos   ,   , if t '  ti , the theme is
correlated; otherwise, the theme is deviated;
Step 3: find the value tmax with the highest correlativity and the value tmin with the lowest
correlativity in the downloaded URL queue, update the threshold ti  1  (t max  t min ) / 2 , then add
up according to the following formula:
ti 1
ti 1
j t min
j t min
Tmean1   jh( j ) /  h( j ), Tmean 2 
t max

j ti 1
jh( j ) /
t max
 h( j )
j ti 1
Compute ti  1  (Tmean1  Tmean 2 ) / 2 , where, h( j ) represents how frequent the correlativity value
of the corresponding document appears. So a queue is designed for storing the document
correlativity values and appearance frequency among queue elements.
Step 4: use the new threshold to download pages, then let ti  ti  1 and repeat Step 2 and Step 3
to compute the threshold. When the threshold shows a convergence tendency, and if
t i 1  t i  
(where,  represents an infinitesimal positive value), the current threshold can be
considered appropriate. The computation of document correlativity of the following pages should
follow this value, and iteration stops.
Application of Key Technologies
1) Socket technology: The spider needs crawl needs to keep HTTP connection, which is a typical
HTTP application, with the site when crawling. HTTP is based on the TCP/IP protocol as a Socket
protocol, so the spider is essentially a network communication program established on the Socket.
Two classes (Socket and ServerSocket) are defined in Java for socket programming. In view that
Socket is a parent class at a higher level with better expansibility, so Socket is based on for design
of network communication program in software.
2) Multithreading technology: Serial waiting is common between multiple download tasks when
the spider downloads pages, which goes against improvement of the system performance, so
multi-task parallel scheduling method is employed by allocating download of more than one HTML
pages to a separate thread, so as to make the most of the computer resources and improve the
download speed.
3) Tag parsing technology: We must enable the spider to “recognize” various formatting settings
on Web pages if we want to parse the information on such pages. The basic form of Web pages is
HTML, so the spider must know meanings of HTML tags. Both Swing in Java and HTMLPage in
Bot package can parse Web pages. Swing is an underlying parsing package with complex parsing
principles, while HTMLPage is a high-level API for parsing different tags, which is more
convenient in programming. I choose to use the latter.
4) Streaming technology: After successful establishment of Socket, communication between the
spider and the site is carried on in the form of stream. In addition, data needs to be recorded in
HTML or XML files in the form of stream when to expand data application. In the design of the
program, File Output Stream/File Input Stream operation of Java is mainly used.
Some Core Codes
This paper discusses application environment of the spider model: where www.51job.com is used
as the URL seed and I have designed 14 analytic functions for this purpose, which are getZipcode( ),
getCompanyinfo( ), getHR( ), getPhone( ), getFax( ), getJobInfo( ), getSalary( ),
getResumelang( ),getLang( ), getYear( ), getPlace( ), getDate( ), getMail( ), getNum( ) and getSite( )
respectively. Now GetJobInfo( ) is taken as an example to describe program implementation, others
are in the same way.
Protected String getJobInfo (String filename){try{String url = “http://127.0.0.1/job/”+filename;
//send Socket connecting request to WEB server
HTTPSocket http = new HTTPSocket(); // create a socket connection
http.send (url, null);
int i = http.getBody () . indexOf(“job description”); // locate “job description” in HTML
int j = http.getBody(). IndexOf(“apply for the position”); // locate the character string in HTML
String info = http.getBody().substring (i, j); // extract information between character i and
character j
return info; // return the parsing results} catch (Exception e){...//prompt of failure to extract
information properly, omitted}}
Expansion of data application is to write the data into files in the form of stream, for this purpose,
I have designed two formatting functions, namely WriteIntoHTML() and WriteIntoXML. Now
WriteIntoHTML() is taken as an example to describe program implementation,
Protected void WriteIntoHTML(){
...//create an HTML file, open file stream ps, omitted
ps.println (“<html><head><title> running result of the spider</title></head>”); //file title
ps.println(“<body><h1> running result of the spider </h1>”; //parse the result title
ps.println(“<table width=‘75%’border=‘1'align=‘center'>”); //put information all together in
tabular form
Ps,println (“<tr> <td align= ‘center'> email </td>”); //start from header and develop table body
ps.println (<td align= ‘center'> release date </td>”); //develop table body
…other table bodies are developed in the same way as above, omitted
ps.println (“</tr>”); //header ends
for (int i=1;i< = numofinfo; i++)// numofinfo represents number of extracted information pieces
{ps.println(“<tr>”); //parsed data cell starts
String mail = getMail(); //parse “company email address”
Ps.print(“<td align=‘center'>+mail+</td>”); //output parsing result of “company email address”
...//codes of other items are the same as above, omitted
ps.println (“<tr>”); //output of one current record ends
}.../close the file stream ps}
Conclusion
With the URL seed set to be www.51job.com, the theme to be “programmer”, crawling depth to
be 4, and crawling range not to exceed 51job website, I have conducted a test, which shows that the
system reaches its best performance when the number of threads is about 100; the relevancy
threshold can converge; and the number of queues that manage the URL list has some limits as the
crawling depth declines and cannot be infinitely increased, so good results can be gained by
determining the number of queues according to the quantity of URLs at a higher level.
References
[1]Jeff Heaton, Programming Spider, Bot and Aggregators in Java [M]. Translated by Tong
Zhaofeng, et al. Beijing: Publishing House of Electronics Industry, 2002.
[2]Zhang Hongbin, Software Design of Spider for Online Job Hunting [J]. Journal of East China
Jiaotong University Report, 2006, 23 (1): 113-116.
[3]Wang Tao, Fan Xiaozhong, Design and Implementation of Themed Crawler [J]. Computer
Application, 2004, 24 (6): 270-272.
[4]Long Tengfang, Research on Application of Data Mining Technology in the Field of Agriculture,
2005, 21 (8 ): 42-44.