Python pyspider的安装与开发

# Created on 2017-07-28 13:44:53

# Project: pyspiderdemo

# mimvp.com

from pyspider.libs.base_handler import *

class Handler(BaseHandler):

crawl_config = {

}

@every(minutes=24 * 60)

def on_start(self):

self.crawl('mimvp.com', callback=self.index_page)

@config(age=10 * 24 * 60 * 60)

def index_page(self, response):

for each in response.doc('a[href^="http"]').items():

self.crawl(each.attr.href, callback=self.detail_page)

@config(priority=2)

def detail_page(self, response):

return {

"url": response.url,

"title": response.doc('title').text(),

}

运行结不雅：

2)示例2：设置代劳爬取网页

PySpider 支撑应用代劳爬取网页，其应用代劳有两种方法：

pyspider --phantomjs-proxy "188.226.141.217:8080" all

方法1：

--phantomjs-proxy TEXT phantomjs proxy ip:port

启动敕令例如：

同时，因为100个站点，天天都可能会有站点掉效或者改版，所以须要可以或许监控模板掉效，以及查看抓取状况。

方法2：

设置代劳全局变量，如下图：

crawl_config = { 'proxy' : '188.226.141.217:8080'}

PySpider 来源竽暌冠以前做的一个垂直搜刮引擎应用的爬虫后端。我们须要大年夜200个站点(因为站点掉效，不是都同时啦，同时有100+在跑吧)采集数据，并请求在5分钟内将对方网站的更新更新到库中。所以，灵活的抓取控制是必须的。

示例代码：

#!/usr/bin/env python 
# -*- encoding: utf-8 -*- 
# Created on 2017-07-28 14:13:14 
# Project: mimvp_proxy_pyspider 
# 
# mimvp.com 
  
from pyspider.libs.base_handler import * 
  
  
class Handler(BaseHandler): 
    crawl_config = { 
        'proxy' : 'http://188.226.141.217:8080',     # http 
        'proxy' : 'https://182.253.32.65:3128'      # https 	
			 2/3   首页 上一页 1 2 3 下一页 尾页	
			

　　推荐阅读
　　怎样在java中定义一个抽象属性
            The following transaction has just finished:  1502179140689,1501,This is a test transaction !!  别的修改TransactionManagerFS如下：Abstract关键字平日被用于类和办法，用来把某些行动的实现宛>>>详细阅读


本文标题：Python pyspider的安装与开发
地址：http://www.17bianji.com/lsqh/36785.html
 1/2    1