pandas – isna、notna で欠損値かどうかを判定する方法

概要

pandas で欠損値かどうかを判定する方法について紹介します。

欠損値

データ上でなんらかの理由により記録されなかった値のことを欠損値 (missing data) といいます。 CSV 上では、欠損値は空欄や “N/A” という文字列など様々な表し方がありますが、pandas で扱う場合は、欠損値は浮動小数点数の NaN (Not a Number、非数) を割り当てることになっています。 NaN は本来、0除算など定義されない浮動小数点演算を行った結果を表すために用意された特殊な浮動小数点数の一つです。また pandas では NaN 以外に None、NaT (Not a Time) も欠損値として扱われます。

欠損値かどうかは比較は比較演算子 == でやってはいけない

NaN 同士の比較は False となるように浮動小数点演算の規格 (IEEE754) で定義されています。そのため、欠損値かどうかを判定するには、pandas.isna() のように専用の関数を使用します。

In [1]:

import numpy as np

print(np.nan == np.nan)

False

pandas.DataFrame.count – 欠損値でない数を数える

pandas.DataFrame.count() は、行または列ごとに欠損値でない数を数えます。

df.count(axis=0): 行ごとに欠損値でない数を数えます。
df.count(axis=1): 列ごとに欠損値でない数を数えます。

DataFrame.count(self, axis=0, level=None, numeric_only=False)

引数

名前	型	デフォルト値
axis	{0 or ‘index’, 1 or ‘columns’}	0
axis	数える方向
level	int or str, optional
level	MultiIndex の場合に、集計する階層を指定します。
numeric_only	bool	False
numeric_only	numeric_only=True の場合は float、int、bool 以外が含まれる行 (列) は集計対象から除外します。

返り値

名前	説明
Series or DataFrame	欠損値でない数

axis – 数える方向

In [2]:

df = pd.DataFrame(
    [
        [None, 2, 3, 4, 5],
        [None, 7, 8, 9, 10],
        [11, 12, 13, 14, 15],
        [16, 17, 18, None, 20],
        [None, None, None, None, None],
    ],
    index=["a", "b", "c", "d", "e"],
    columns=["A", "B", "C", "D", "E"],
)
display(df)

print(df.count(axis=0))

print(df.count(axis=1))

	A	B	C	D	E
a	NaN	2.0	3.0	4.0	5.0
b	NaN	7.0	8.0	9.0	10.0
c	11.0	12.0	13.0	14.0	15.0
d	16.0	17.0	18.0	NaN	20.0
e	NaN	NaN	NaN	NaN	NaN

A    2
B    4
C    4
D    3
E    4
dtype: int64
a    4
b    4
c    5
d    4
e    0
dtype: int64

level – MultiIndex の場合に階層ごとに数える

MultiIndex の場合は level で集計する階層を指定できます。

In [3]:

df = pd.DataFrame(
    [
        [None, 2, 3, 4, 5],
        [None, 7, 8, 9, 10],
        [11, 12, 13, 14, 15],
        [16, 17, 18, None, 20],
        [None, None, None, None, None],
    ],
    index=pd.MultiIndex.from_tuples(
        [("a", 0), ("a", 1), ("b", 0), ("c", 0), ("c", 1)], names=["lv1", "lv2"]
    ),
    columns=["A", "B", "C", "D", "E"],
)
display(df)

display(df.count(axis=0, level=0))

display(df.count(axis=0, level=1))

		A	B	C	D	E
lv1	lv2
a	0	NaN	2.0	3.0	4.0	5.0
a	1	NaN	7.0	8.0	9.0	10.0
b	0	11.0	12.0	13.0	14.0	15.0
c	0	16.0	17.0	18.0	NaN	20.0
c	1	NaN	NaN	NaN	NaN	NaN

	A	B	C	D	E
lv1
a	0	2	2	2	2
b	1	1	1	1	1
c	1	1	1	0	1

	A	B	C	D	E
lv2
0	2	3	3	2	3
1	0	1	1	1	1

numeric_only – すべての値が数値の行 (列) のみ数える

In [4]:

df = pd.DataFrame(
    [
        ["hoge", 2, 3, 4, 5],
        [None, 7, 8, 9, 10],
        [11, 12, None, 14, 15],
        [16, 17, 18, "hoge", 20],
        [None, None, None, None, None],
    ],
    index=["a", "b", "c", "d", "e"],
    columns=["A", "B", "C", "D", "E"],
)
display(df)

print(df.count(axis=0, numeric_only=True))

	A	B	C	D	E
a	hoge	2.0	3.0	4	5.0
b	None	7.0	8.0	9	10.0
c	11	12.0	NaN	14	15.0
d	16	17.0	18.0	hoge	20.0
e	None	NaN	NaN	None	NaN

B    4
C    3
E    4
dtype: int64

pandas.Series.count

pandas.Series.count() は、欠損値でない数を数えます。

Series.count(self, level=None)

引数

名前	型	デフォルト値
level	int or level name	None
level	MultiIndex の場合に、集計する階層を指定します。

返り値

名前	説明
int or Series (if level specified)	欠損値でない数

In [5]:

s = pd.Series([None, 2, 3, 4, None])
print(s.count())

NaN でない値のインデックスを取得する

pandas.DataFrame.first_valid_index(): すべての値が NaN でない最初の行のインデックスを取得する。
pandas.DataFrame.last_valid_index(): すべての値が NaN でない最後の行のインデックスを取得する。
pandas.Series.first_valid_index(): NaN でない最初のインデックスを取得する。
pandas.Series.last_valid_index(): NaN でない最後インデックスを取得する。

In [6]:

df = pd.DataFrame(
    [
        [None, None, None, None, None],
        [None, 2, 3, 4, 5],
        [None, None, 8, 9, 10],
        [11, 12, 13, 14, 15],
        [None, None, None, None, None],
    ],
    index=["a", "b", "c", "d", "e"],
    columns=["A", "B", "C", "D", "E"],
)

print(df.first_valid_index())
print(df.last_valid_index())

b
d

In [7]:

s = pd.Series([None, 2, 3, 4, None], index=["a", "b", "c", "d", "e"])

print(s.first_valid_index())
print(s.last_valid_index())

b
d

pandas.Series.hasnans – NaN が含まれるかどうか

pandas.Series.hasnans は、Series に NaN が含まれるかどうかを返します。

In [8]:

s = pd.Series([None, 2, 3, 4, None])
print(s.hasnans)

True

NaN かどうかを判定する

	欠損値かどうか	欠損値でないかどうか
pandas	pandas.isna()	pandas.notna()
pandas	pandas.isnull()	pandas.notnull()
DataFrame	pandas.DataFrame.isna()	pandas.DataFrame.notna()
Series	pandas.Series.isna()	pandas.Series.notna()

In [9]:

import numpy as np
import pandas as pd

df = pd.DataFrame(
    np.array([[1, 2, None], [4, 5, 6], [np.nan, 8, 9]]),
    index=["a", "b", "c"],
    columns=["A", "B", "C"],
)
display(df)

display(pd.isna(df))  # または  pd.isnull(df)
display(pd.notna(df))  # または  pd.notnull(df)

	A	B	C
a	1	2	None
b	4	5	6
c	NaN	8	9

	A	B	C
a	False	False	True
b	False	False	False
c	True	False	False

	A	B	C
a	True	True	False
b	True	True	True
c	False	True	True

In [10]:

s = pd.Series([1, 2, np.nan, 4, None])
print(pd.isna(s))  # または  pd.isnull(s)
print(pd.notna(s))  # または  pd.notnull(s)

0    False
1    False
2     True
3    False
4     True
dtype: bool
0     True
1     True
2    False
3     True
4    False
dtype: bool

pandas – isna、notna で欠損値かどうかを判定する方法

概要

欠損値

欠損値かどうかは比較は比較演算子 == でやってはいけない

pandas.DataFrame.count – 欠損値でない数を数える

axis – 数える方向

level – MultiIndex の場合に階層ごとに数える

numeric_only – すべての値が数値の行 (列) のみ数える

pandas.Series.count

NaN でない値のインデックスを取得する

pandas.Series.hasnans – NaN が含まれるかどうか

NaN かどうかを判定する

コメント

コメントするコメントをキャンセル

pandas – isna、notna で欠損値かどうかを判定する方法

概要

欠損値

欠損値かどうかは比較は比較演算子 == でやってはいけない

pandas.DataFrame.count – 欠損値でない数を数える

axis – 数える方向

level – MultiIndex の場合に階層ごとに数える

numeric_only – すべての値が数値の行 (列) のみ数える

pandas.Series.count

NaN でない値のインデックスを取得する

pandas.Series.hasnans – NaN が含まれるかどうか

NaN かどうかを判定する

コメント

コメントする コメントをキャンセル

コメントするコメントをキャンセル