スクレイピング – Beautiful Soup の高速化方法まとめ

概要
結論
テスト用の HTML ファイルを用意する。
テキストまたはバイト列での解析速度の違い
HTML パーサーでの解析速度の違い
並列処理の活用
改行やタブを事前に削除する

概要

BeautifulSoup で大量の HTML を解析する際の高速化方法について紹介します。

結論

先に結論を書くと、以下の2点が高速化に寄与しました。

HTML パーサーを標準のものから lxml に変更する。(pip install lxml でインストールする必要があります。)
```
BeautifulSoup(html, "lxml")
```
Python
futures.ThreadPoolExecutor で並列実行する。CPU の論理コア数が多いなら効果大。

テスト用の HTML ファイルを用意する。

テスト用に 100kb 規模の HTML ファイルを100ファイル用意しました。

In [1]:

from pathlib import Path


from bs4 import BeautifulSoup

input_dir = Path("html")
html_paths = sorted(input_dir.glob("*.html"))
paths = html_paths[:100]
print(f"number of files: {len(paths)}")

number of files: 100

テキストまたはバイト列での解析速度の違い

BeautifulSoup に与えるデータがテキストまたはバイト列で違いがあるかを検証しました。結果は、大きな違いは見られませんでした。文字コードが Shift-jis などの場合は検証していないので、また結果が変わってくるかもしれません。

In [2]:

%%timeit

# テキスト (utf-8) として読み込み
for path in paths:
    with open(path) as f:
        html = f.read()
    soup = BeautifulSoup(html)

3.37 s ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]:

%%timeit

# バイト列 (utf-8) として読み込み
for path in paths:
    with open(path, "rb") as f:
        html = f.read()
    soup = BeautifulSoup(html)

3.35 s ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

HTML パーサーでの解析速度の違い

HTML パーサーが標準では、html5.parser が使用されますが、他に lxml、html5lib が使用できます。これらで違いがあるか検証しました。なお、lxml、html5lib は外部ライブラリのため、pip でインストールしておく必要があります。結果は、html5lib < html.parser (標準) < lxml となりました。解析速度を上げたい場合は lxml を検討するとよいかもしれません。

!pip install lxml html5lib

In [4]:

%%timeit

# パーサーに "lxml" を使用
for path in paths:
    with open(path) as f:
        html = f.read()
    soup = BeautifulSoup(html, "lxml")

3.39 s ± 38.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]:

%%timeit

# パーサーに "html.parser" を使用
for path in paths:
    with open(path) as f:
        html = f.read()
    soup = BeautifulSoup(html, "html.parser")

9.74 s ± 1.96 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]:

%%timeit

# パーサーに "html5lib" を使用
for path in paths:
    with open(path) as f:
        html = f.read()
    soup = BeautifulSoup(html, "html5lib")

11.3 s ± 9.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

並列処理の活用

concurrent.futures の ProcessPoolExecutor で解析処理を並列実行することで、実行時間の短縮が期待できます。 GIL の影響か ThreadPoolExecutor では返って遅くなってしまいました。

In [7]:

from concurrent import futures


def parse_html(path):
    with open(path) as f:
        html = f.read()
    soup = BeautifulSoup(html, "lxml")

In [8]:

%%timeit

# ThreadPoolExecutor で並列化
with futures.ThreadPoolExecutor() as executor:
    rets = list(executor.map(parse_html, paths))

8.03 s ± 9.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]:

%%timeit

# ProcessPoolExecutor で並列化
with futures.ProcessPoolExecutor() as executor:
    rets = list(executor.map(parse_html, paths))

873 ms ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

改行やタブを事前に削除する

HTML に含まれる改行やタブを事前に削除しておくことで、HTML のパースが高速化できます。

In [10]:

%%timeit

for path in paths:
    with open(path) as f:
        html = f.read()
    soup = BeautifulSoup(html, "lxml")

5.11 s ± 7.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [ ]:

%%timeit

for path in paths:
    with open(path) as f:
        html = f.read()
    # 改行とタブを削除する。
    html = html.replace("\n", "").replace("\t", "")
    soup = BeautifulSoup(html, "lxml")

スクレイピング – Beautiful Soup の高速化方法まとめ

概要

結論

テスト用の HTML ファイルを用意する。

テキストまたはバイト列での解析速度の違い

HTML パーサーでの解析速度の違い

並列処理の活用

改行やタブを事前に削除する

コメント

コメントするコメントをキャンセル

スクレイピング – Beautiful Soup の高速化方法まとめ

概要

結論

テスト用の HTML ファイルを用意する。

テキストまたはバイト列での解析速度の違い

HTML パーサーでの解析速度の違い

並列処理の活用

改行やタブを事前に削除する

コメント

コメントする コメントをキャンセル

コメントするコメントをキャンセル