Download 2 Concurrent Spiders using Boost Threads - SERC

Executing 'Threads' Concurrently in C++ Libby Shoop, Macalester College, with Dick Brown, St. Olaf College 1 Web Spider – Executed Sequentially Also known as a web crawler, a web spider downloads web page content starting with a starting URL. If the spider encounters links inside the content, it downloads that page's content as well. Web spiders can be used for practices such as creating copies of visited pages for indexing or automating maintenance tasks on a Web site. In a sequential web spider program, one “spider” accesses the data and does all of the work. In our C++ code (in the folder serial/src,) there are three main data structures used to accomplish this task: 1. m_work – a queue consisting of URLs left to be scraped 2. m_finished – a queue of URLs that have already been retrieved 3. m_urls – Contains the count of each URL that was encountered This web spider works by going through the queue of URLs in work, processing the corresponding web page. If the spider finds a URL that has not been processed, the URL is added to the list to be investigated and further processed. This is repeated until no more new URLs are found, or if the maximum limit has been reached. 1.2 Exercises 1. Complete the process method in spider.cpp . void spider::process(const std::string & url, const std::string & html) { std::string baseUrl(getBaseUrl(url)); page p; p.scan(baseUrl, url, html); for (size_t i = 0; i < p.get_urls()->size(); ++i) { /* INSERT CODE HERE The page object p now includes a list of url's that appear in links within this page. For each such url with index i , if that url is not an image, add it to the work queue. */ } m_pages.push_back(p); } 2. Complete the method crawl in spider.cpp. The method is shown below: void spider::crawl(const std::string & startURL) { while (!m_work.empty()) m_work.pop(); m_finished.clear(); m_pages.clear(); m_work.push(startURL); size_t processed = 0; while (processed < m_maxUrls && !m_work.empty()) { std::string url = m_work.front(); m_work.pop(); raw_page data; } 2 Concurrent Spiders using Boost Threads While this Spider program is functional, a sequential program like this will not be ideal with a large dataset; if we wanted to crawl through millions or billions of pages using this program, it would take a long time to complete, making our program inefficient. In order to make our Spider program more effective, we will have to make use of Boost threads library and concurrent programming practices. Our new parallel program makes a couple of significant changes to the original Spider program serial in order to incorporate concurrency: 1. Multiple Spiders actively crawl through the URLs. These spiders act as 'threads,' running concurrently acting on data independently among a set that is shared by all spiders. 2.2 Exercises for parallel program (TO REVISE FOR C++) 1. In a class containing crawl, rewrite the code to create each Boost thread and to find modes (cds) in the input Note: this would be creating startThread and finishing main(), which is currently provided in the student's Concurrent Spider 2. Create the shared data structures in a separate class. Name this file SharedSpiderData.java. Note: this file can be provided to students 3. Create one instance of the shared data to each instance of the Runnable via the constructor 4. Complete the method run() of ConcurrentSpider.java to have each thread grab from the work queue and process the page 5. Complete the method processPage in ConcurrentSpider.java to process a page 2.3 Once you have completed the Concurrent Spider Conduct some experiments to determine the speedup of your concurrent threaded program using varying numbers of threads as compared to your original sequential version. Write a report discussing your findings. 3 Resources We are using these shared data structure classes for implementing a concurrent Spider:  ArrayBlockingQueue: A bounded blocking queue backed by an array  ConcurrentLinkedQueue: An unbounded thread-safe queue based on linked nodes  ConcurrentHashMap: A hash table supporting full concurrency of retrievals and adjustable expected concurrency for updates , found inside another class provided for you to hold the counts of URLs C++ has special data structures designed to be shared by Threads. See the documentation for java.util.concurrent: http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/package-summary.html

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 2 Concurrent Spiders using Boost Threads - SERC