2018-06-05

ページネートのリソースを1ページずつ巡回し、ページにある各々のリンクへ 1 階層潜り込んでスクレイピングする

Python

http://mat5ukawa.hateblo.jp/entry/2018/06/05/002826

より

各ページを巡回する
巡回ごとに各々のレコード詳細リンクへ 1 階層潜り込む
潜り込み先でスクレイピングする

クエリ抽出されたページのレコード探索に利用すれば良いと思う。

import scrapy

class Spider(scrapy.Spider):
  start_urls = ['http://localhost:3000/seeds']
  name = 'spider'

  def parse(self, response):
    # scrape resource's id
    id = response.xpath("//div[@id='page_num']/text()").extract_first()
    yield { 'id' : id }

    # scrape resource in each link
    for link in response.xpath("//table/tbody/tr/td[2]/a/@href").extract():
      yield scrapy.Request(response.urljoin(link), callback = self._parse_link)

    # transit to next resource
    next_page = response.css('a[rel="next"]').xpath("@href").extract_first()
    if next_page is not None:
      next_page = response.urljoin(next_page)
      yield scrapy.Request(next_page, callback = self.parse)

  def _parse_link(self, response):
     detail = response.xpath("//p[@id='name']/text()").extract_first()
     yield { 'detail' : detail }

$ scrapy runspider crawler/main.py -o outfile.json

outfile.json

[
{"id": "page number is 1"},
{"detail": "e8b5202869b74aa1ffd0eadff7c3664f"},
{"detail": "8de50ac9ae0edd7b6d62a84af338186c"},
{"detail": "6f5a8295e6ccb687ceb9c4789f6e63f7"},
{"detail": "cb4e15e1100ae6aed7b2cfea3ce1842b"},
{"id": "page number is 2"},
{"detail": "0e89ba31a9592c3058fe87f5bef28e33"},
{"detail": "eed30fcd7d827f33127375ec44e7748f"},
{"detail": "18b84f501753059fab3cd3402703d2d9"},
{"detail": "dda2c6b55c0aedb84834f8a16aa9d024"},
...
]

使ってはないが参考にしたページ

https://stackoverflow.com/questions/30491498/getting-scrapy-to-follow-specific-links-on-a-page

探索が二次元になるので、記載の仕組みを用いてより深層へスクレイピングする場合は計算量と相談。

2018-06-05

scrapy ページネート形式のリソースをページ順々にスクレイピングする

Python

レコード一覧をページネートで表現したリソースがあるとする。

ページ毎に固有メッセージ page number is {page number} が存在するので、

全ページ分その取得を試みる。

リソースの想定図

ページ毎に page number is {page number} が存在する
ページネートが設置されていて、番号をクリックすればそのページに遷移する
現在のページに対して「次のページ」を示す属性(ex. rel="next") が存在する

f:id:mat5ukawa:20180605002555p:plain

探索方法(概念図)

単方向にリソースを探索する。探索が終わったら(= ページ終点にたどり着いたら)クロールを止める。

f:id:mat5ukawa:20180605002617p:plain

探索方法(ソース)

main.py

import scrapy

class Spider(scrapy.Spider):
  start_urls = ['http://localhost:3000/seeds']  # ページネートリソースにアクセスできる URL を指す
  name = 'spider'

  def parse(self, response):
    for content in response.xpath("//div[@id='page_num']/text()").extract():
      yield { 'content' : content }

    next_page = response.css('a[rel="next"]').xpath("@href").extract_first()
    if next_page is not None:
      next_page = response.urljoin(next_page)
      yield scrapy.Request(next_page, callback = self.parse)

$ scrapy runspider main.py -o outfile.json

outfile.json

[
{"content": "page number is 1"},
{"content": "page number is 2"},
{"content": "page number is 3"},
{"content": "page number is 4"},
{"content": "page number is 5"},
{"content": "page number is 6"},
{"content": "page number is 7"},
{"content": "page number is 8"},
{"content": "page number is 9"},
{"content": "page number is 10"}
]

参考文献

https://doc.scrapy.org/en/latest/intro/tutorial.html?highlight=recursive#following-links

2018-06-04

scrapy で localhost サーバーのリソースをスクレイピングして parse 結果を json ファイルに出力する

Python

他人のサーバーでテストすることは憚られる

環境

Mac OSX - 10.13.4
Python - 3.6.5
nginx - 1.12.2
scrapy - 1.5.0

localhost サーバー

workspace ディレクトリ直下の index.html を改変しておく。

/path/to/nginx/workspace/index.html

<html>
  <head>
    <style type="text/css">
      #caption {
        color: red;
      }   
      .elem {
        color: blue;
      }   
    </style>
  </head>
  <body>
    <h1 id="caption">hello world</h1>
    <div>
      <ul>
        <li class="elem">abc</li>
        <li class="electric">def</li>
        <li class="elem"><a href="#">link to myself</a></li>
      </ul>
    </div>
  </body>
</html>

$ nginx

クローラー

仮想環境で scrapy を導入

$ mkdir testdir && cd testdir
$ python3 -m venv ./venv/environment
$ source ./source/venv/environment/bin/activate
$ pip3 install scrapy
$ touch main.py

main.py

クローラー本体

import scrapy

class Spider(scrapy.Spider):
  start_urls = ['http://localhost:8080/index.html']
  name = 'spider'

  def parse(self, response):
    for text in response.xpath('//h1/text()').extract():
      yield { 'h1-text' : text }

    for dom in response.css('li.electric').extract():
      yield { 'dom li-electric' : dom }

    for text in response.xpath('//li/a/text()').extract():
      yield { 'text of a in li' : text }

クロール

$ scrapy runspider main.py -o outfile.json

クロール結果

outfile.json

[
{"h1-text": "hello world"},
{"dom li-electric": "<li class=\"electric\">def</li>"},
{"text of a in li": "link to myself"}
]

参考文献

2018-06-03

python 仮想環境とは

Python

「管理者、システム権限から隔離された Python を実行できる環境」を仮想環境と呼ぶ。

実体的にはディレクトリであり venv ( あるいは env ) と表記される。

その配下には Python バイナリ, pip, 3rd パーティパッケージなどが集約されている。

(厳密な定義はリンク先をご覧ください)

簡易例

`numpy` を実行できる仮想環境を作る、使う、依存パッケージを抽出する(導入する)

環境

MacOSX - 10.13.4
Python - 3.6.5
- brew で導入しておくことが望ましい

適当な空ディレクトリを作って移動しておく

$ mkdir mytest && cd mytest

仮想環境を作る

$ python3 -m venv ./venv/environment
$ source venv/environment/bin/activate
(environment)
$ which python3
/Users/matsukawa/Develop/python/mytest/venv/environment/bin/python3

3 つ目のコマンドから、仮想環境で Python を実行していることがわかる。

仮想環境に `numpy` を導入する

(environment)
$ pip3 install numpy

mytest ディレクトリ直下に main.py を作成する

./mytest/main.py

import numpy as np

print(np.array([1, 2, 3]))

仮想環境を使う

(environment)
$ python3 main.py

> 出力
[1, 2, 3]

仮想環境で依存したパッケージを抽出する

他環境へパッケージ導入を知らせる手段としてこれを使う

(environment)
$ pip3 freeze > requirements.txt

./mytest/requirements.txt

numpy==1.14.3

導入する時

(environment)
$ pip3 install -r requirements.txt

仮想環境から抜け出す

(environment)
$ deactivate
$

シェル冒頭の (environment) が消えていること

補足

venv は各々の環境に配置されるべきなので、ソースコード管理対象からは除外する。

参考文献

2018-06-01

setup.py の最小構成

Python

python setup.py sdist した時に warning なく dist が出力されること
hello world スクリプトすらない

最小というより 骨組み の方が適切かもしれない

試行目的

setup.py の使い方に慣れること

動作確認バージョン

Python 2.7.10
MacOSX 10.13.4

ディレクトリ構成

./
├── .gitignore
├── LICENSE.rst
├── MANIFEST.in
├── README.rst
├── lib
│   └── __init__.py
└── setup.py

`.gitignore`

長いのでリンク先を参照

`LICENSE.rst`

LICENSE
====================

MIT

`MANIFEST.in`

include *.rst

`README.rst`

setup_py
====================

Minimal setup.py to develop

`lib/init.py`

name = 'lib'

`setup.py`

from distutils.core import setup
from setuptools import find_packages

setup(
  name         = 'minimal_setup',
  version      = '1.0',
  description  = 'minimal setup',
  author       = 'ymatsukawa',
  author_email = 'ymatsukawa27@example.com',
  url          = 'https://github.com/ymatsukawa/',
  license      = 'MIT',
  packages     = find_packages(where = '.'),
)

確認内容

python setup.py sdist するとコマンド実行ディレクトリで dist/minimal_setup-1.0.tar.gz が作成される

現場ログ

.

ページネートのリソースを1ページずつ巡回し、ページにある各々のリンクへ 1 階層潜り込んでスクレイピングする

使ってはないが参考にしたページ

scrapy ページネート形式のリソースをページ順々にスクレイピングする

リソースの想定図

探索方法(概念図)

探索方法(ソース)

参考文献

scrapy で localhost サーバーのリソースをスクレイピングして parse 結果を json ファイルに出力する

環境

localhost サーバー

クローラー

クロール

クロール結果

参考文献

python 仮想環境とは

簡易例

`numpy` を実行できる仮想環境を作る、使う、依存パッケージを抽出する(導入する)

環境

仮想環境を作る

仮想環境に `numpy` を導入する

仮想環境を使う

仮想環境で依存したパッケージを抽出する

仮想環境から抜け出す

補足

参考文献

setup.py の最小構成

試行目的

動作確認バージョン

ディレクトリ構成

`.gitignore`

`LICENSE.rst`

`MANIFEST.in`

`README.rst`

`lib/init.py`

`setup.py`

確認内容

読んだリソース

使ってはないが参考にしたページ

リソースの想定図

探索方法(概念図)

探索方法(ソース)

参考文献

環境

localhost サーバー

クローラー

クロール

クロール結果

参考文献

簡易例

numpy を実行できる仮想環境を作る、使う、依存パッケージを抽出する(導入する)

環境

仮想環境を作る

仮想環境に numpy を導入する

仮想環境を使う

仮想環境で依存したパッケージを抽出する

仮想環境から抜け出す

補足

参考文献

試行目的

動作確認バージョン

ディレクトリ構成

.gitignore

LICENSE.rst

MANIFEST.in

README.rst

lib/__init__.py

setup.py

確認内容

読んだリソース

`numpy` を実行できる仮想環境を作る、使う、依存パッケージを抽出する(導入する)

仮想環境に `numpy` を導入する

`.gitignore`

`LICENSE.rst`

`MANIFEST.in`

`README.rst`

`lib/init.py`

`setup.py`