Pytorch – GoogLeNet の仕組みと実装について解説

概要

ディープラーニングの画像認識モデルである GoogleNet (Inception v1) を解説し、Pytorch の実装例を紹介します。

GoogleNet (Inception v1)

GoogLeNet (Inception v1) は、画像認識のコンテスト ILSVRC 2014 で優勝した CNN ネットワークモデルです。論文 Going deeper with convolutions に基づいて解説します。ちなみに GoogLeNet という名前は LeNet という有名なモデルのオマージュです。

モデルの構造

VGG が畳み込み層を重ねて層を深くしたのに対して、GoogLeNet では Inception Module を導入し、縦だけでなく、横にも広げた構造になっています。

入力層に近い部分は、これまでのモデルと同様、畳み込み層とプーリング層を繰り返して、特徴量のサイズを小さくします。
中間層か Inception Module というものを繰り返します。
畳み込み層や出力層以外の全結合層の直後に活性化関数 ReLU を適用します。

Inception Module

Inception Module は下図のような構造を持つモジュールです。Conv #3×3 reduce、Conv #5X5 reduce はそれぞれ 3×3、5×5 の畳み込み層に入力する前にチャンネル数を削減する (reduce) 働きを持つ 1×1 の畳み込み層です。4つの枝のそれぞれの出力を最後にチャンネル方向で結合して1つの出力にします。

Inception Module

Inception Module の畳み込み層、プーリング層のパラメータ

名称	branch	kernel_size	stride	padding
conv 1×1	branch1	1	1	0
conv 3×3 reduce	branch2	1	1	0
conv 3×3	branch2	3	1	1
conv 5×5 reduce	branch3	1	1	0
conv 5×5	branch3	5	1	2
Max Pooling	branch4	3	2	1
pool proj	branch4	1	1	0

5つの畳み込み層の出力数は Inception Module によって異なり、下記のようになります。

Inception Module の各畳み込み層の出力数

名称	#1×1	#3×3 reduce	#3×3	#5×5 reduce	#5×5	max proj
Inception 3a	64	96	128	16	32	32
Inception 3b	128	128	192	32	96	64
Inception 4a	192	96	208	16	48	64
Inception 4b	160	112	224	24	64	64
Inception 4c	128	128	256	24	64	64
Inception 4d	112	144	288	32	64	64
Inception 4e	256	160	320	32	128	128
Inception 5a	256	160	320	32	128	128
Inception 5b	384	192	384	48	128	128

Auxiliary Classifier

学習時は最後の出力層の他に中間層から分岐した補助の分類器 (Auxiliary Classifier) を2つ用意し、3つの出力を重み付き平均をとり、損失を計算します。

loss = aux1 の損失 * 0.3 + aux2 の損失 * 0.3 + 最後の出力層の損失

これにより、層を深くしたことにより、学習が困難になる問題の対策をしています。推論時は Auxiliary Classifier は無効にします。

Auxiliary Classifier 1 のパラメータ

名称	出力の形状	out_features	kernel_size	stride	padding
Input	(N, 512, 4, 4)
Conv_aux	(N, 128, 4, 4)	128	1	1	0
Flatten	(N, 128 4 4)
Linear_aux1	(N, 1024)	1024	3
Dropout_aux (p=0.4)	(N, 1024)	3
Linear_aux2	(N, 1000)	1000	1

Auxiliary Classifier 2 のパラメータ

名称	出力の形状	out_features	kernel_size	stride	padding
Input	(N, 528, 4, 4)
Conv_aux	(N, 128, 4, 4)	128	1	1	0
Flatten	(N, 128 4 4)
Linear_aux1	(N, 1024)	1024	3
Dropout_aux (p=0.4)	(N, 1024)	3
Linear_aux2	(N, 1000)	1000	1

GooLeNet 全体のパラメータ

GoogLeNet の構造

名称	出力の形状	out_features	kernel_size	stride	padding
Input	(N, 3, 224, 224)
Conv1	(N, 64, 112, 112)	64	7	2	3
MaxPooling1	(N, 64, 56, 56)	3	2	0
Conv2	(N, 64, 56, 56)	64	1	1	0
Conv3	(N, 192, 56, 56)	192	3	1	1
MaxPooling2	(N, 192, 28, 28)	3	2	0
Inception 3a	(N, 256, 28, 28)
Inception 3b	(N, 480, 28, 28)
MaxPooling3	(N, 480, 14, 14)	3	2	0
Inception 4a	(N, 512, 14, 14)
Inception 4b	(N, 512, 14, 14)
Inception 4c	(N, 512, 14, 14)
Inception 4d	(N, 528, 14, 14)
Inception 4e	(N, 832, 14, 14)
MaxPooling4	(N, 832, 7, 7)	3	2	0
Inception 5a	(N, 832, 7, 7)
Inception 5b	(N, 1024, 7, 7)
AvgPooling	(N, 1024, 1, 1)	7	1	0
Flatten	(N, 1024)
Dropout (p=0.7)	(N, 1024)
Linear	(N, 1000)	1000

Pytorch の実装

torchvision の実装、Caffe の実装を元に構成したものです。

論文の図には Local Response Normalization がいくつか入っていますが、VGG の論文によると効果がないため、削除しています。
畳み込み層の直後に BatchNorm2d を入れています。
nn.MaxPool2d() で ceil_mode=True を指定します。Pytorch はウィンドウをスライドさせた際に余った入力の部分が切り捨てられますが、Caffe のライブラリでは切り捨てられません。切り捨てられてしまうと論文と形状が合わなくなっていまうのでこれを指定します。

畳み込み層の定義

畳み込み層はすべて Conv2d -> BatchNorm2d -> ReLU という順番で処理を行うので、モジュール化します。

In [ ]:

import torch
import torch.nn as nn
import torch.nn.functional as F


class BasicConv2d(nn.Module):
    def __init__(self, in_channels, out_channels, **kwargs):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
        self.bn = nn.BatchNorm2d(out_channels, eps=0.001)
        self.relu = nn.ReLU(True)

    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        x = self.relu(x)

        return x

Auxiliary Classifier の定義

Auxiliary Classifier をモジュール化します。

平均プーリング層は、モデルの入力が (N, 3, 224, 224) の場合は (N, C, 4, 4) が入力され、カーネルサイズ5、ストライド3で平均プーリングをした結果、(N, C, 4, 4) が出力されますが、モデルの入力が (224, 224) より大きい場合でも対応できるように AdaptiveAvgPool2d に置き換えています。
Dropout の確率は Caffe の実装では、0.7となっています。

In [ ]:

class InceptionAux(nn.Module):
    def __init__(
        self,
        in_channels,
        num_classes,
        dropout,
    ):
        super().__init__()
        self.pool = nn.AdaptiveAvgPool2d(x, (4, 4))
        self.conv = BasicConv2d(in_channels, 128, kernel_size=1)
        self.fc1 = nn.Linear(2048, 1024)
        self.fc2 = nn.Linear(1024, num_classes)
        self.dropout = nn.Dropout(p=dropout)
        self.relu = nn.ReLU(True)

    def forward(self, x):
        x = F.adaptive_avg_pool2d(x, (4, 4))  # (N, 512 or 528, 4, 4)
        x = self.conv(x)  # (N, 128, 4, 4)
        x = torch.flatten(x, 1)  # (N, 128 * 4 * 4)
        x = self.fc1(x)  # (N, 1024)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)  # (N, num_classes)

        return x

Inception Module

Inception Module をモジュール化します。最後に torch.cat(dim=1) でチャンネル方向に結合しています。

Inception Module

In [1]:

class Inception(nn.Module):
    def __init__(
        self,
        in_channels,
        ch1x1,
        ch3x3red,
        ch3x3,
        ch5x5red,
        ch5x5,
        pool_proj,
    ):
        super().__init__()
        self.branch1 = BasicConv2d(in_channels, ch1x1, kernel_size=1)
        self.branch2 = nn.Sequential(
            BasicConv2d(in_channels, ch3x3red, kernel_size=1),
            BasicConv2d(ch3x3red, ch3x3, kernel_size=3, padding=1),
        )
        self.branch3 = nn.Sequential(
            BasicConv2d(in_channels, ch5x5red, kernel_size=1),
            BasicConv2d(ch5x5red, ch5x5, kernel_size=5, padding=2),
        )
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1, ceil_mode=True),
            BasicConv2d(in_channels, pool_proj, kernel_size=1),
        )

    def forward(self, x):
        branch1 = self.branch1(x)
        branch2 = self.branch2(x)
        branch3 = self.branch3(x)
        branch4 = self.branch4(x)

        out = torch.cat([branch1, branch2, branch3, branch4], 1)

        return out

GoogleNet 本体を定義する

出力層の直前の平均プーリング層は、モデルの入力が (N, 3, 224, 224) の場合は (N, C, 7, 7) が入力され、カーネルサイズ7、ストライド1で平均プーリングをした結果、(N, C, 1, 1) が出力されますが、モデルの入力が (224, 224) 大きい場合でも対応できるように AdaptiveAvgPool2d に置き換えています。
論文に初期化方法は記載がないため、caffe/train_val.prototxt のやり方で初期化しました。
- 畳み込み層のカーネル、全結合層の重みは Xavier で初期化します。これは $[-\sqrt{\frac{3}{fan\_in}}, \sqrt{\frac{3}{fan\_in}}]$ の一様分布に従う乱数で初期化する方法です。
- 全結合層のバイアスは出力層は0、それ以外の層は0.2の定数で初期化します。

In [2]:

class GoogLeNet(nn.Module):
    def __init__(
        self,
        num_classes=1000,
        aux_logits=True,
        dropout=0.4,
        dropout_aux=0.7,
    ):
        super().__init__()
        self.aux_logits = aux_logits

        self.conv1 = BasicConv2d(3, 64, kernel_size=7, stride=2, padding=3)
        self.maxpool1 = nn.MaxPool2d(3, stride=2, ceil_mode=True)
        self.conv2 = BasicConv2d(64, 64, kernel_size=1)
        self.conv3 = BasicConv2d(64, 192, kernel_size=3, padding=1)
        self.maxpool2 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception3a = Inception(192, 64, 96, 128, 16, 32, 32)
        self.inception3b = Inception(256, 128, 128, 192, 32, 96, 64)
        self.maxpool3 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception4a = Inception(480, 192, 96, 208, 16, 48, 64)
        self.inception4b = Inception(512, 160, 112, 224, 24, 64, 64)
        self.inception4c = Inception(512, 128, 128, 256, 24, 64, 64)
        self.inception4d = Inception(512, 112, 144, 288, 32, 64, 64)
        self.inception4e = Inception(528, 256, 160, 320, 32, 128, 128)
        self.maxpool4 = nn.MaxPool2d(3, stride=2, ceil_mode=True)

        self.inception5a = Inception(832, 256, 160, 320, 32, 128, 128)
        self.inception5b = Inception(832, 384, 192, 384, 48, 128, 128)

        if aux_logits:
            self.aux1 = InceptionAux(512, num_classes, dropout=dropout_aux)
            self.aux2 = InceptionAux(528, num_classes, dropout=dropout_aux)
        else:
            self.aux1 = None
            self.aux2 = None

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.dropout = nn.Dropout(p=dropout)
        self.fc = nn.Linear(1024, num_classes)
        self._initialize_weights()

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_uniform_(m.weight, nonlinearity="linear")
                if m.out_features == 1000:
                    nn.init.zeros_(m.bias)  # 出力層は0で初期化する
                else:
                    nn.init.constant_(m.bias, 0.2)
            if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
                nn.init.kaiming_uniform_(m.weight, nonlinearity="conv2d")

    def forward(self, x):
        x = self.conv1(x)  # (N, 64, 112, 112)
        x = self.maxpool1(x)  # (N, 64, 56, 56)
        x = self.conv2(x)  # (N, 64, 56, 56)
        x = self.conv3(x)  # (N, 192, 56, 56)
        x = self.maxpool2(x)  # (N, 192, 28, 28)
        x = self.inception3a(x)  # (N, 256, 28, 28)
        x = self.inception3b(x)  # (N, 480, 28, 28)
        x = self.maxpool3(x)  # (N, 480, 14, 14)
        x = self.inception4a(x)  # (N, 512, 14, 14)

        aux1 = self.aux1(x) if self.aux_logits and self.training else None

        x = self.inception4b(x)  # (N, 512, 14, 14)
        x = self.inception4c(x)  # (N, 512, 14, 14)
        x = self.inception4d(x)  # (N, 528, 14, 14)

        aux2 = self.aux2(x) if self.aux_logits and self.training else None

        x = self.inception4e(x)  # (N, 832, 14, 14)
        x = self.maxpool4(x)  # (N, 832, 7, 7)

        x = self.inception5a(x)  # (N, 832, 7, 7)
        x = self.inception5b(x)  # (N, 1024, 7, 7)

        x = self.avgpool(x)  # (N, 1024, 1, 1)
        x = torch.flatten(x, 1)  # (N, 1024)
        x = self.dropout(x)
        x = self.fc(x)  # (N, 1000)

        if self.aux_logits and self.training:
            return x, aux2, aux1
        else:
            return x


def googlenet():
    return GoogLeNet()

torchvision の実装

torchvision.models.googlenet() で利用できます。

モデル名	関数名	パラメータ数	Top-1 エラー率	Top-5 エラー率
GoogleNet	googlenet()	13004888	30.22	10.47

論文との差異

TensorFlow の実装 inception_v1.py から移植されたものであり、論文と以下の差異があります。

inception4e の後のプーリング層が論文では kernel_size=3x3、stride=2 であるのに対して、torchvision では kernel_size=2x2、stride=2 になっている。
Inception モジュールの #5x5 の畳み込み層のカーネルサイズが5ではなく、3になっている。

Pytorch – GoogLeNet の仕組みと実装について解説

概要

GoogleNet (Inception v1)

モデルの構造

Inception Module

Auxiliary Classifier

GooLeNet 全体のパラメータ

Pytorch の実装

畳み込み層の定義

Auxiliary Classifier の定義

Inception Module

GoogleNet 本体を定義する

torchvision の実装

論文との差異

コメント

コメントするコメントをキャンセル

Pytorch – GoogLeNet の仕組みと実装について解説

概要

GoogleNet (Inception v1)

モデルの構造

Inception Module

Auxiliary Classifier

GooLeNet 全体のパラメータ

Pytorch の実装

畳み込み層の定義

Auxiliary Classifier の定義

Inception Module

GoogleNet 本体を定義する

torchvision の実装

論文との差異

関連記事

コメント

コメントする コメントをキャンセル

コメントするコメントをキャンセル