基于Python的Scrapy爬虫入门：代码详解

for img in post.get('images', ''):

img_id = img['img_id']

url = 'https://photo.tuchong.com/%s/f/%s.jpg' % (item['site_id'], img_id)

item['images'][img_id] = url

item['tags'] = []

# 将 tags 处理成 tag_name 数组

for tag in post.get('tags', ''):

item['tags'].append(tag['tag_name'])

items.append(item)

return items

经由这些步调，抓取的数据将被保存在 TuchongItem 类中，作为构造化的数据便于处理及保存。

前面说过，并不是所有抓取的条目都须要，例如本例中我们只须要 type=”multi_photo 类型的图集，并且图片太少的也不须要，这些抓取条目标筛选操作以及若何保存须要在pipelines.py中处理，该文件中默认已创建类 TuchongPipeline 并重载了 process_item函数，经由过程修改该函数只返回那些相符前提的 item，代码如下：

import scrapy, json 
 
from ..items import TuchongItem 
 
class PhotoSpider(scrapy.Spider): 
 
    name = 'photo' 
 
    # allowed_domains = ['tuchong.com'] 
 
    # start_urls = ['http://tuchong.com/'] 
 
 
 
    def start_requests(self): 
 
        url = 'https://tuchong.com/rest/tags/%s/posts?page=%d&count=20&order=weekly'; 
 
        # 抓取10个页面，每页20个图集 
 
        # 指定 parse 作为回调函数并返回 Requests 请求对象 
 
        for page in range(1, 11): 	
			 7/10   首页 上一页 5 6 7 8 9 10 下一页 尾页	
			

　　推荐阅读
　　摆脱尴尬，我国IPv6加速跑需要“魔鬼步伐”
            CTO练习营 | 12月3-5日，深圳，是时刻成为优良的技巧治理者了
            
                
                    
                
                人工智能、大年夜数据、云计算、物联网，其实都是>>>详细阅读


本文标题：基于Python的Scrapy爬虫入门：代码详解
地址：http://www.17bianji.com/lsqh/39298.html
 1/2    1