# -*- coding: utf-8 -*-
from ..items import PyjobItem
import scrapy


class PyjobSpiderSpider(scrapy.Spider):
    name = 'pyjob_spider'
    allowed_domains = ['python.org']
    start_urls = ['http://python.org/jobs/']

    def parse(self, response):
        for res in response.xpath('//h2[@class="listing-company"]'):
            job = PyjobItem()
            job['title'] = res.xpath('.//span[@class="listing-company-name"]/a/text()').extract()
            job['location'] = res.xpath('.//span[@class="listing-location"]/a/text()').extract()
            yield job

        next_page = response.xpath('//li[@class="next"]/a/@href').extract()
        if next_page:
            url = response.urljoin(next_page[0])
            yield scrapy.Request(url, callback=self.parse)

pyjob_spider.py

実行結果

下記コマンドでクローリング実行し、json形式で結果を保存します。拡張子でファイルタイプを自動判定してくれます。

scrapy crawl pyjob_spider -o test.json

実行結果を抜粋します。オファーのタイトルと、勤務地が保存されています。

[
{"title": ["Software Engineer - Python"], "location": ["San Francisco, CA, United States"]},
{"title": ["Back-End Developer"], "location": ["London, London, United Kingdom"]},
{"title": ["Director of Engineering"], "location": ["Mountain View, CA, United States"]},
{"title": ["Python / full-stack Developer"], "location": ["London, United Kingdom"]},
{"title": ["Software Product Engineer"], "location": ["Carmel, Indiana, United States"]},
{"title": ["Python Developer"], "location": ["New York, NY, USA"]},
{"title": ["SENIOR FULL-STACK WEB DEVELOPER - LIVE BLOG "], "location": ["BELGRADE, PRAGUE,BERLIN, Serbia, Czech republic, Germany"]},
{"title": ["Python Developer"], "location": ["Redruth, Cornwall, United Kingdom"]},
{"title": ["Senior Python Developer"], "location": ["Dublin, Ireland"]},
...
"title": ["Python/Django Developer"], "location": ["London, Greater London, United Kingdom"]},
{"title": ["Entry-level Python/Django Developer"], "location": ["LONDON, Greater London, United Kingdom"]},
{"title": ["Senior Bioinformatician/Analyst"], "location": ["Cambridge, United Kingdom"]},
{"title": ["Senior Bioinformatician"], "location": ["Cambridge, United Kingdom"]},
{"title": ["Junior Backend Developer"], "location": ["Sheung Wan, HK Island, Hong Kong"]},
{"title": ["Python Developer"], "location": ["Des Moines, Iowa, United States"]},
{"title": ["Cloud Systems Programmer (Remote)"], "location": ["Atlanta, GA, USA"]},
{"title": ["Experienced Python Developer"], "location": ["Houthalen, Belgium"]},
{"title": ["Senior Python Backend Developer (Ethereum Blockchain/RaidEX)"], "location": ["Mainz, Berlin, Copenhagen, Germany, Denmark"]}
]

test.json

Shellモード

XPathの設定については、セッション中でもTipsとして触れられていますが、下記サイトも参考にしつつshellモードでセレクタを調査しました。

qiita.com

クローリング過程とスクレイピング過程の分離

Tipsとして、クローリング過程とスクレイピング過程の分離について触れられています。

分離のメリットは、

サイト構造変更でスクレイピングに失敗しても、クローリングをやり直さずに済む
クローリングとスクレイピングのどちらに問題があるか切り分けが容易

であると私は解釈しました。

まとめ・感想

XPathの設定にかなり手こずってしまいました。逆に言えば、実装作業のほとんどはXPath記述に費やしており、慣れれば簡単にspiderを実装できそうです。

AT's Blog

趣味のプログラミング、ギター、音楽とかとか

YouTubeで"Pythonで作るWebクローラ入門"を視聴した際のメモ

概要

セッション資料

クローラの仕様

Scrapyインストール

robots.txt

XPath

DOWNLOAD_DELAY

Spider

実行結果

Shellモード

クローリング過程とスクレイピング過程の分離

まとめ・感想