pandas – crosstab を使ってクロス集計表を作成する方法

概要

pandas で crosstab を使ったクロス集計の方法について解説します。

クロス集計

クロス集計表 (cross tabulation table) または分割表 (contingency table) は、2つ以上のカテゴリ変数の相互関係を分析するために作成する表です。例えば、性別 (男性、女性)、利き手 (右手、左手) という2つのカテゴリ変数があり、100人の無作為抽出して標本を得たとき、変数の関係は次のクロス集計表で表せます。

	右利き	左利き	計
男性	43	9	52
女性	44	4	48
計	87	13	100

この表で、右端の列を行周辺合計 (row marginal total) 、下端の行を列周辺合計 (column marginal total)、右下端の部分を総計 (grand total)といいます。

pandas.crosstab

pandas.crosstab は、クロス集計表を作成するための関数です。

pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name:  = 'All', dropna:  = True, normalize=False) → ’DataFrame’

引数

名前	型	デフォルト値
index	array-like, Series, or list of arrays/Series
index	クロス集計表の行に表示するカテゴリ変数
columns	array-like, Series, or list of arrays/Series
columns	クロス集計表の列に表示するカテゴリ変数
values	array-like, optional
values	量的変数。合わせて、aggfunc を指定する必要がある。
rownames	sequence	None
rownames	クロス集計表の index の名前
colnames	sequence	None
colnames	クロス集計表の columns の名前
aggfunc	function, optional
aggfunc	集計方法。合わせて、values を指定する必要がある。
margins	bool	False
margins	クロス集計表に小計の行及び列を追加するかどうか
margins_name	str	‘All’
margins_name	クロス集計表の小計の行及び列の名前
dropna	bool	True
dropna	すべての要素が NaN の列を集計から除外するかどうか
normalize	bool, {‘all’, ‘index’, ‘columns’}, or {0,1}	False
normalize	正規化するかどうか

返り値

名前	説明
DataFrame	クロス集計表

サンプル

データを読み込む

例として、飲食店でのチップの支払い額を記録したデータを使用します。

In [1]:

import pandas as pd

df = pd.read_csv("https://git.io/JJlHw")
df.head()

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

頻度をクロス集計する

クロス集計表の行に表示するカテゴリ変数を index、列に表示するカテゴリ変数を columns に指定します。

In [2]:

ret = pd.crosstab(df["sex"], df["time"])
ret

time	Dinner	Lunch
sex
Female	52	35
Male	124	33

index、columns に2つ以上のカテゴリ変数を指定した場合、Multiindex になります。

In [3]:

ret = pd.crosstab(df["sex"], [df["day"], df["time"]])
ret

day	Fri		Sat	Sun	Thur
time	Dinner	Lunch	Dinner	Dinner	Dinner	Lunch
sex
Female	5	4	28	18	1	31
Male	7	3	59	58	0	30

値をクロス集計する

values に量的変数を指定した場合、aggfunc に指定した方法でクロス集計が行われます。

In [4]:

ret = pd.crosstab(df["sex"], df["time"], df["total_bill"], aggfunc="mean")
ret

time	Dinner	Lunch
sex
Female	19.213077	16.339143
Male	21.461452	18.048485

クロス集計法に小計を追加する

margins=True を指定した場合、クロス集計表に小計を追加します。

In [5]:

ret = pd.crosstab(df["sex"], df["time"], margins=True)
ret

time	Dinner	Lunch	All
sex
Female	52	35	87
Male	124	33	157
All	176	68	244

クロス集計表の行、列、小計の名前を設定する

クロス集計表の index、columns の名前をそれぞれ rownames、colnames で指定できます。また、margins_name で小計の項目名を指定できます。

In [6]:

ret = pd.crosstab(
    df["sex"],
    df["time"],
    rownames=["性別"],
    colnames=["時間"],
    margins=True,
    margins_name="合計",
)
ret

時間	Dinner	Lunch	合計
性別
Female	52	35	87
Male	124	33	157
合計	176	68	244

正規化する

normalize でクロス集計表の値の正規化方法を指定できます。

normalize=False: 正規化しない
normalize=True または normalize="all": 全体で正規化する
normalize=columns または normalize="0": 列ごとに正規化する
normalize=index または normalize="1": 行ごとに正規化する

In [7]:

ret = pd.crosstab(df["sex"], df["time"], normalize=True)  # normalize="all"
ret

time	Dinner	Lunch
sex
Female	0.213115	0.143443
Male	0.508197	0.135246

In [8]:

ret = pd.crosstab(df["sex"], df["time"], normalize="columns")  # normalize=0
ret

time	Dinner	Lunch
sex
Female	0.295455	0.514706
Male	0.704545	0.485294

In [9]:

ret = pd.crosstab(df["sex"], df["time"], normalize="index")  # normalize=1
ret

time	Dinner	Lunch
sex
Female	0.597701	0.402299
Male	0.789809	0.210191

In [10]:

ret = pd.crosstab(df["sex"], [df["day"], df["time"]])
ret

day	Fri		Sat	Sun	Thur
time	Dinner	Lunch	Dinner	Dinner	Dinner	Lunch
sex
Female	5	4	28	18	1	31
Male	7	3	59	58	0	30

pandas.pivot_table

pandas.pivot_table は、

pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False)

引数

名前	型	デフォルト値
data	DataFrame
data
values	column to aggregate, optional
values
index	column, Grouper, array, or list of the previous
index	If an array is passed, it must be the same length as the data. Thelist can contain any of the other types (except list).Keys to group by on the pivot table index. If an array is passed,it is being used as the same manner as column values.
columns	column, Grouper, array, or list of the previous
columns	If an array is passed, it must be the same length as the data. Thelist can contain any of the other types (except list).Keys to group by on the pivot table column. If an array is passed,it is being used as the same manner as column values.
aggfunc	function, list of functions, dict	numpy.mean
aggfunc	If list of functions passed, the resulting pivot table will havehierarchical columns whose top level are the function names(inferred from the function objects themselves)If dict is passed, the key is column to aggregate and valueis function or list of functions.
fill_value	scalar	None
fill_value	Value to replace missing values with (in the resulting pivot table,after aggregation).
margins	bool	False
margins	Add all row / columns (e.g. for subtotal / grand totals).
dropna	bool	True
dropna	Do not include columns whose entries are all NaN.
margins_name	str	‘All’
margins_name	Name of the row / column that will contain the totalswhen margins is True.
observed	bool	False
observed	This only applies if any of the groupers are Categoricals.If True: only show observed values for categorical groupers.If False: show all values for categorical groupers.Changed in version 0.25.0.

返り値

名前	説明
DataFrame	An Excel style pivot table.

pandas.DataFrame.pivot_table

pandas.DataFrame.pivot_table は、

DataFrame.pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False)

引数

名前	型	デフォルト値
values	column to aggregate, optional
values
index	column, Grouper, array, or list of the previous
index	If an array is passed, it must be the same length as the data. Thelist can contain any of the other types (except list).Keys to group by on the pivot table index. If an array is passed,it is being used as the same manner as column values.
columns	column, Grouper, array, or list of the previous
columns	If an array is passed, it must be the same length as the data. Thelist can contain any of the other types (except list).Keys to group by on the pivot table column. If an array is passed,it is being used as the same manner as column values.
aggfunc	function, list of functions, dict	numpy.mean
aggfunc	If list of functions passed, the resulting pivot table will havehierarchical columns whose top level are the function names(inferred from the function objects themselves)If dict is passed, the key is column to aggregate and valueis function or list of functions.
fill_value	scalar	None
fill_value	Value to replace missing values with (in the resulting pivot table,after aggregation).
margins	bool	False
margins	Add all row / columns (e.g. for subtotal / grand totals).
dropna	bool	True
dropna	Do not include columns whose entries are all NaN.
margins_name	str	‘All’
margins_name	Name of the row / column that will contain the totalswhen margins is True.
observed	bool	False
observed	This only applies if any of the groupers are Categoricals.If True: only show observed values for categorical groupers.If False: show all values for categorical groupers.Changed in version 0.25.0.

返り値

名前	説明
DataFrame	An Excel style pivot table.

サンプルコード

pandas – crosstab を使ってクロス集計表を作成する方法

概要

クロス集計

pandas.crosstab

サンプル

データを読み込む

頻度をクロス集計する

値をクロス集計する

クロス集計法に小計を追加する

クロス集計表の行、列、小計の名前を設定する

正規化する

pandas.pivot_table

pandas.DataFrame.pivot_table

コメント

コメントするコメントをキャンセル

pandas – crosstab を使ってクロス集計表を作成する方法

概要

クロス集計

pandas.crosstab

サンプル

データを読み込む

頻度をクロス集計する

値をクロス集計する

クロス集計法に小計を追加する

クロス集計表の行、列、小計の名前を設定する

正規化する

pandas.pivot_table

pandas.DataFrame.pivot_table

コメント

コメントする コメントをキャンセル

コメントするコメントをキャンセル