Scapy入门介绍-shenhailuanma-ChinaUnix博客

转自：

Scrapy 是神马

简单的来讲，这货是个爬虫框架，就是简单设置一下就能很强大的那种。

Scrapy 安装篇

sudo pip install scrapy

Scrapy 介绍

scrapy data flow

这张图完美的诠释了Scrapy的工作流程，首先下载数据，middlewares控制下载时候的参数。然后将Response进行parse，形成item，再将item交给pipeline处理，与此同时讲选中的链接放入Scheduler生成Request，等待Downloader处理。

Scrapy 主要元素介绍

HTMLXpathSelector

说道Scrapy第一个肯定要说的就是它了吧。用XPath的形式抽取网页中的信息，当然在Scrapy中你也可以选择Scrapy包以外的BeautifulSoap做同样的事情。

pages = hxs.select("//div[@class='post_thumbnail']/a")

item

item就是存储趴下来的数据的最合适的容器，为什么这么说呢，因为它可以方便的保存为JSON，XML等格式，还可以方便Pipeline进行后续的处理。

你首先需要声明一个继承自scrapy.item的class，填入你需要的字段

from scrapy.item import Item, Field

class FotomenItem(Item):
    # define the fields for your item here like:
    # name = Field()
    title = Field()
    url = Field()
    image_urls = Field()

Pipeline

普通的pipeline没什么意思，Scrapy提供了很多有趣的Pipeline，比如处理图片的ImagePipeline。它可以帮你方便的存储爬去过程中遇到的图片文件，可以对图片进行格式转换，生成缩略图等操作，操作起来十分方便。

将需要下载的图片链接放在item的images_url字段中，然后在加载ImagePipeline的时候就会自动帮你下载到指定的目录，并对文件名进行SHA1处理，防止重复。 http://doc.scrapy.org/en/latest/topics/images.html

想启用ImagePipeline的话，首先在settings.py中设置

ITEM_PIPELINES = ['scrapy.contrib.pipeline.images.ImagesPipeline']
IMAGES_STORE = '/Users/windwild/Code/project/fotomen/images/'

ITEM_PIPELINES 字段是设置加载哪个的Pipeline的。
IMAGES_STORE 字段是设置图片存放的路径的。

想要启用图片Pipeline这两个东西缺一不可。

PS. 我在实践中遇到了一个错误，就是总会出现图片错误的提示。究其原因是没有安装PIL这个包，这个包的全称是Python Imaging Library。不安装不行啊！

Pipeline 扩展

因为我想写个scrapy爬取上的图片，并按照不同的Post进行分类存储。所以ImagePipeline中以图片链接的SHA1值存储图片并不适合。我们需要对ImagePipeline进行一定的改造。

首先新建一个文件，新建一个类 MyImagesPipeline 继承自 ImagesPipeline 然后覆盖你想改造的函数。最后在 settings.py 中设置你加载你自己定义的Pipeline就大功告成。

from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.http import Request

class MyImagesPipeline(ImagesPipeline):
    title = "default"

    def image_key(self, url):
        year,month = url.split('/')[-3],url.split('/')[-2]
        image_guid = hashlib.sha1(url).hexdigest()
        img_path = "%s/%s/%s" % (year,month,self.title)
        return '%s/%s.jpg' % (img_path, image_guid)

    def get_media_requests(self, item, info):
        self.title = item['url'].split('/')[-2]
        for image_url in item['image_urls']:
            yield Request(image_url)

小结

后边的东西待我实践中慢慢总结放上来。

这次实现的Project可以在找到。