Download 2 Concurrent Spiders using Boost Threads - SERC

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Executing 'Threads' Concurrently in C++
Libby Shoop, Macalester College, with Dick Brown, St. Olaf College
1
Web Spider – Executed Sequentially
Also known as a web crawler, a web spider downloads web page content starting with a
starting URL. If the spider encounters links inside the content, it downloads that page's content
as well. Web spiders can be used for practices such as creating copies of visited pages for
indexing or automating maintenance tasks on a Web site.
In a sequential web spider program, one “spider” accesses the data and does all of the
work. In our C++ code (in the folder serial/src,) there are three main data structures used
to accomplish this task:
1. m_work – a queue consisting of URLs left to be scraped
2. m_finished – a queue of URLs that have already been retrieved
3. m_urls – Contains the count of each URL that was encountered
This web spider works by going through the queue of URLs in work, processing the
corresponding web page. If the spider finds a URL that has not been processed, the URL is
added to the list to be investigated and further processed. This is repeated until no more new
URLs are found, or if the maximum limit has been reached.
1.2
Exercises
1. Complete the process method in spider.cpp .
void spider::process(const std::string & url, const std::string & html)
{
std::string baseUrl(getBaseUrl(url));
page p;
p.scan(baseUrl, url, html);
for (size_t i = 0; i < p.get_urls()->size(); ++i)
{
/* INSERT CODE HERE
The page object p now includes a list of url's that appear in
links within this page. For each such url with index i , if that
url is not an image, add it to the work queue. */
}
m_pages.push_back(p);
}
2. Complete the method crawl in spider.cpp. The method is shown below:
void spider::crawl(const std::string & startURL)
{
while (!m_work.empty()) m_work.pop();
m_finished.clear();
m_pages.clear();
m_work.push(startURL);
size_t processed = 0;
while (processed < m_maxUrls && !m_work.empty())
{
std::string url = m_work.front(); m_work.pop();
raw_page data;
}
2
Concurrent Spiders using Boost Threads
While this Spider program is functional, a sequential program like this will not be ideal
with a large dataset; if we wanted to crawl through millions or billions of pages using this
program, it would take a long time to complete, making our program inefficient. In order to
make our Spider program more effective, we will have to make use of Boost threads library and
concurrent programming practices. Our new parallel program makes a couple of significant
changes to the original Spider program serial in order to incorporate concurrency:
1. Multiple Spiders actively crawl through the URLs. These spiders act as 'threads,' running
concurrently acting on data independently among a set that is shared by all spiders.
2.2
Exercises for parallel program (TO REVISE FOR C++)
1. In a class containing crawl, rewrite the code to create each Boost thread and to find
modes (cds) in the input
Note: this would be creating startThread and finishing main(), which is currently provided
in the student's Concurrent Spider
2. Create the shared data structures in a separate class. Name this file
SharedSpiderData.java.
Note: this file can be provided to students
3. Create one instance of the shared data to each instance of the Runnable via the
constructor
4. Complete the method run() of ConcurrentSpider.java to have each thread grab from the
work queue and process the page
5. Complete the method processPage in ConcurrentSpider.java to process a page
2.3
Once you have completed the Concurrent Spider
Conduct some experiments to determine the speedup of your concurrent threaded program
using varying numbers of threads as compared to your original sequential version. Write a
report discussing your findings.
3
Resources
We are using these shared data structure classes for implementing a concurrent Spider:

ArrayBlockingQueue: A bounded blocking queue backed by an array

ConcurrentLinkedQueue: An unbounded thread-safe queue based on linked nodes

ConcurrentHashMap: A hash table supporting full concurrency of retrievals and
adjustable expected concurrency for updates , found inside another class provided for
you to hold the counts of URLs
C++ has special data structures designed to be shared by Threads. See the documentation
for java.util.concurrent:
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/package-summary.html