Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2016Fall.01 厦门大学智能分析与推荐系统研究组 Group of Intelligent Analysis & Recommendation System 基于Scrapy的爬虫框架设计与实现 The Design and Implementation of Crawler Framework Based on Scrapy 2016年9月19日 王玮玮 厦门大学 自动化系 WANG Weiwei, Department Of Automation, Xiamen university. CONTENT Which sites can be crawled 01 The Framework of Crawler 02 Data processing and application 03 04 Open Source Code CONTENT Our Code 05 Distributed Crawls 06 Avoiding getting banned 07 08 Papers and Research 01 PART ONE Which sites can be crawled 1. Which sites can be crawled All kinds of sites Which sites are worth us to crawl…… 02 PART TWO The Framework of Crawler 2. The Framework of Crawler Scrapy (https://scrapy.org/) A Fast and Powerful Scraping and Web Crawling Framework 03 PART THREE Data processing and application 3. Data processing and application Content and Text Analysis News websites, like:http://news.sina.com.cn/、 http://news.163.com/、http://news.qq.com/…… Industry Analysis Shopping Site, like:http://www.jd.com/、 https://www.taobao.com/、http://www.yhd.com/…… Social Media Monitor Social Network, like:Weibo、Public WeChat Account、 Facebook、Twitter…… 04 PART FOUR Open Source Code 4. Open Source Code Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing. 4. Open Source Code 微信公众号爬虫 https://github.com/hexcola/wcspider 豆瓣读书爬虫 https://github.com/lanbing510/DouBanSpider 知乎爬虫 https://github.com/LiuRoy/zhihu_spider Bilibili用户爬虫 https://github.com/airingursb/bilibili-user 新浪微博爬虫 https://github.com/LiuXingMing/SinaSpider 小说下载分布式爬虫 https://github.com/gnemoug/distribute_crawler 中国知网爬虫 https://github.com/yanzhou/CnkiSpider 链家网爬虫 https://github.com/lanbing510/LianJiaSpider 京东爬虫 https://github.com/taizilongxu/scrapy_jingdong QQ 群爬虫 https://github.com/caspartse/QQ-Groups-Spider 乌云爬虫 https://github.com/hanc00l/wooyun_public 05 PART FIVE Our Code 5.Our Code - Base on Scrapy - Encapsulation - Provide API 5.Our Code WORKFLOW 5.Our Code What to do next on our Framework? - JavaScript - Simulated user login - Cookie - Proxy Server - Redis 06 PART SIX Distributed Crawls 6. Distributed Crawls 07 PART SEVEN Avoiding getting banned 7. Avoiding getting banned • rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them) • disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour • use download delays (2 or higher). See DOWNLOAD_DELAY setting. • if possible, use Google cache to fetch pages, instead of hitting the sites directly • use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh • use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages. One example of such downloaders is Crawlera 08 PART EIGHT Papers and Research 8. Papers and Research - Crawler Technology - Data Mining A&Q Thanks for Listening