scrapy ページネート形式のリソースをページ順々にスクレイピングする

レコード一覧をページネートで表現したリソースがあるとする。

ページ毎に固有メッセージ page number is {page number} が存在するので、

全ページ分その取得を試みる。

リソースの想定図

ページ毎に page number is {page number} が存在する
ページネートが設置されていて、番号をクリックすればそのページに遷移する
現在のページに対して「次のページ」を示す属性(ex. rel="next") が存在する

f:id:mat5ukawa:20180605002555p:plain

探索方法(概念図)

単方向にリソースを探索する。探索が終わったら(= ページ終点にたどり着いたら)クロールを止める。

f:id:mat5ukawa:20180605002617p:plain

探索方法(ソース)

main.py

import scrapy

class Spider(scrapy.Spider):
  start_urls = ['http://localhost:3000/seeds']  # ページネートリソースにアクセスできる URL を指す
  name = 'spider'

  def parse(self, response):
    for content in response.xpath("//div[@id='page_num']/text()").extract():
      yield { 'content' : content }

    next_page = response.css('a[rel="next"]').xpath("@href").extract_first()
    if next_page is not None:
      next_page = response.urljoin(next_page)
      yield scrapy.Request(next_page, callback = self.parse)

$ scrapy runspider main.py -o outfile.json

outfile.json

[
{"content": "page number is 1"},
{"content": "page number is 2"},
{"content": "page number is 3"},
{"content": "page number is 4"},
{"content": "page number is 5"},
{"content": "page number is 6"},
{"content": "page number is 7"},
{"content": "page number is 8"},
{"content": "page number is 9"},
{"content": "page number is 10"}
]

参考文献

https://doc.scrapy.org/en/latest/intro/tutorial.html?highlight=recursive#following-links