import scrapy

class Spider(scrapy.Spider):
  start_urls = ['http://localhost:3000/seeds']
  name = 'spider'

  def parse(self, response):
    # scrape resource's id
    id = response.xpath("//div[@id='page_num']/text()").extract_first()
    yield { 'id' : id }

    # scrape resource in each link
    for link in response.xpath("//table/tbody/tr/td[2]/a/@href").extract():
      yield scrapy.Request(response.urljoin(link), callback = self._parse_link)

    # transit to next resource
    next_page = response.css('a[rel="next"]').xpath("@href").extract_first()
    if next_page is not None:
      next_page = response.urljoin(next_page)
      yield scrapy.Request(next_page, callback = self.parse)

  def _parse_link(self, response):
     detail = response.xpath("//p[@id='name']/text()").extract_first()
     yield { 'detail' : detail }

$ scrapy runspider crawler/main.py -o outfile.json

outfile.json

[
{"id": "page number is 1"},
{"detail": "e8b5202869b74aa1ffd0eadff7c3664f"},
{"detail": "8de50ac9ae0edd7b6d62a84af338186c"},
{"detail": "6f5a8295e6ccb687ceb9c4789f6e63f7"},
{"detail": "cb4e15e1100ae6aed7b2cfea3ce1842b"},
{"id": "page number is 2"},
{"detail": "0e89ba31a9592c3058fe87f5bef28e33"},
{"detail": "eed30fcd7d827f33127375ec44e7748f"},
{"detail": "18b84f501753059fab3cd3402703d2d9"},
{"detail": "dda2c6b55c0aedb84834f8a16aa9d024"},
...
]

使ってはないが参考にしたページ

https://stackoverflow.com/questions/30491498/getting-scrapy-to-follow-specific-links-on-a-page

探索が二次元になるので、記載の仕組みを用いてより深層へスクレイピングする場合は計算量と相談。

2018-06-05

scrapy ページネート形式のリソースをページ順々にスクレイピングする

Python

レコード一覧をページネートで表現したリソースがあるとする。

ページ毎に固有メッセージ page number is {page number} が存在するので、

全ページ分その取得を試みる。

リソースの想定図

ページ毎に page number is {page number} が存在する
ページネートが設置されていて、番号をクリックすればそのページに遷移する
現在のページに対して「次のページ」を示す属性(ex. rel="next") が存在する

f:id:mat5ukawa:20180605002555p:plain

探索方法(概念図)

単方向にリソースを探索する。探索が終わったら(= ページ終点にたどり着いたら)クロールを止める。

f:id:mat5ukawa:20180605002617p:plain

探索方法(ソース)

main.py

import scrapy

class Spider(scrapy.Spider):
  start_urls = ['http://localhost:3000/seeds']  # ページネートリソースにアクセスできる URL を指す
  name = 'spider'

  def parse(self, response):
    for content in response.xpath("//div[@id='page_num']/text()").extract():
      yield { 'content' : content }

    next_page = response.css('a[rel="next"]').xpath("@href").extract_first()
    if next_page is not None:
      next_page = response.urljoin(next_page)
      yield scrapy.Request(next_page, callback = self.parse)

$ scrapy runspider main.py -o outfile.json

outfile.json

[
{"content": "page number is 1"},
{"content": "page number is 2"},
{"content": "page number is 3"},
{"content": "page number is 4"},
{"content": "page number is 5"},
{"content": "page number is 6"},
{"content": "page number is 7"},
{"content": "page number is 8"},
{"content": "page number is 9"},
{"content": "page number is 10"}
]

参考文献

https://doc.scrapy.org/en/latest/intro/tutorial.html?highlight=recursive#following-links

相手に何をして欲しいかを伝えられていない 報連相は失敗する

相手に何をして欲しいかを添える

使ってはないが参考にしたページ

リソースの想定図

探索方法(概念図)

探索方法(ソース)

参考文献

相手に何をして欲しいかを伝えられていない報連相は失敗する