下記記事の続きになります。
データ加工をする場合は「(その3-2) エイムズの住宅価格のデータセットのデータ加工①」から実施してくださいね。
変数選択
変数選択をして行きます。実業務ではたくさんの説明変数を作成することが多いですので、「次元の呪い」と呼ばれる現象に悩む時があります。
従って特定の方法で使わない変数を決めるのですが、今回もオーソドックスに相関係数を確認し相関が高い変数はどちらか一方を残すといった作業をします。
この作業は計算量を減らすだけでなく、多重共線性を解決し安定したモデルを作成することにつながります。
相関係数の種類まとめ
相関係数を求めるためによく使われるのがピアソンの相関係数だと思います。体重と身長の関係性を調べるのに使われたりするかと思います。
ただし変数の尺度の種類によって相関係数の算出方法論も使い分けがあるようです。
前述の体重と身長のように連続型の数値であればピアソンの相関係数を利用すればいいのですが、アンケートのなどの「悪い」から「良い」までを5段階で評価したような順序型の指標同士の相関係数はスピアマンの順位相関係数を使うと良いようです。
詳しくは山口東京理科大学の亀田研究室さんの記事や株式会社アイスタットさん
の記事にまとめられていました。
尺度に関してはITmediaさんの「27°C×2=54°C」が何の意味もない理由とは――「測定」と「データ」の基礎知識の記事が分かりやすかったです。
上記参考リンクの内容をまとめると下の表のようになります。
尺度
種別 | 尺度 | 例 |
---|---|---|
質的変数 | 名義尺度 | 性別・都道府県など |
質的変数 | 順序尺度 | ランキング・アンケートの5段階評価 |
量的変数 | 間隔尺度 | 偏差値・摂氏(°C)など |
量的変数 | 比例尺度 | 金額・絶対温度など |
相関係数
変数種別 | 相関係数種別 |
---|---|
量的変数 x 量的変数 | ピアソンの相関係数 |
順序尺度 x 順序尺度 | スピアマンの順位相関係数 |
名義尺度 x 名義尺度 | クラメール連関係数 |
量的変数 x 名義尺度 | 相関比 |
エイムズのデータセットでは上記全ての相関関係を算出できそうですので試してみようと思います。
相関係数算出の準備
# 訓練データとテストデータの分岐点
max_train_index = train_records-1
# 相関係数確認用データフレーム
df_corr = df_all.loc[0:max_train_index,:].copy()
df_corr["SalePrice"] = df["SalePrice"]
# 連続値の変数に限定
quant_cols=[
'LotFrontage'
, 'LotArea'
, 'MasVnrArea'
, 'BsmtFinSF1'
, 'BsmtFinSF2'
, 'BsmtUnfSF'
, 'TotalBsmtSF'
, '1stFlrSF'
, '2ndFlrSF'
, 'LowQualFinSF'
, 'GrLivArea'
, 'BsmtFullBath'
, 'BsmtHalfBath'
, 'FullBath'
, 'HalfBath'
, 'BedroomAbvGr'
, 'KitchenAbvGr'
, 'TotRmsAbvGrd'
, 'Fireplaces'
, 'GarageCars'
, 'GarageArea'
, 'WoodDeckSF'
, 'OpenPorchSF'
, 'EnclosedPorch'
, '3SsnPorch'
, 'ScreenPorch'
, 'PoolArea'
, 'MiscVal'
, 'TotalLivArea'
, 'TotalBathRms'
, 'TotalWoodDeckPorch'
, 'SalePrice' # 目的変数
]
# 順序型の変数に限定
ordinal_cols=[
'LotShape'
,'Utilities'
,'LandSlope'
,'OverallQual'
,'OverallCond'
,'ExterQual'
,'ExterCond'
,'BsmtQual'
,'BsmtCond'
,'BsmtExposure'
,'BsmtFinType1'
,'BsmtFinType2'
,'HeatingQC'
,'KitchenQual'
,'Functional'
,'FireplaceQu'
,'GarageFinish'
,'GarageQual'
,'GarageCond'
,'PoolQC'
,'Fence'
,'SalePrice' # 目的変数
]
df_corr_obj = df_corr.select_dtypes(include=object)
df_corr_number = df_corr.select_dtypes(include='number').drop("Id",axis=1)
df_corr_ordinal = df_corr_number[ordinal_cols]
df_corr_quant = df_corr_number[quant_cols]
相関係数算出の関数
# 参考: https://www.statology.org/cramers-v-in-python/
# クラメール連関係数算出
def cramersV(col1,col2):
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency
# 二変数のクロス集計
crosstab =np.array(pd.crosstab(col1,col2,rownames=None,colnames=None))
# カイ二乗値
X2 = chi2_contingency(crosstab, correction=False)[0]
# サンプル数
n = np.sum(crosstab)
# min(カテゴリー数)
k = min(crosstab.shape) - 1
#クラメール連関係数 (Cramer's V)
V = np.sqrt((X2/n) / k)
return V
# 参考: https://www.kaggle.com/code/chrisbss1/cramer-s-v-correlation-matrix/notebook
def get_corr_cramers(dataframe):
import pandas as pd
import numpy as np
dataframe_obj = dataframe.select_dtypes(include=object)
dataframe_cols = dataframe_obj.columns
rows= []
for var1 in dataframe_cols:
col = []
for var2 in dataframe_cols:
cramers = cramersV(dataframe_obj[var1], dataframe_obj[var2])
col.append(round(cramers,2))
rows.append(col)
results = np.array(rows)
return pd.DataFrame(results, columns = dataframe_cols, index = dataframe_cols)
# 参考: https://stackoverflow.com/questions/52083501/how-to-compute-correlation-ratio-or-eta-in-python
# return correlation_ratio (eta_squared) of category_var x numerical_var
def correlation_ratio(category_var, numerical_var):
import pandas as pd
import numpy as np
category = np.array(category_var)
numerical = np.array(numerical_var)
ssw = 0 # Sum of Squares Within
ssb = 0 # Sum of Squares Between
for acategory in set(category):
subgroup = numerical[np.where(category == acategory)[0]]
# 級内変動
ssw += sum((subgroup - np.mean(subgroup))**2)
# 級間変動 SUM(各グループn * (グループ平均 - 全体平均)^2)
ssb += len(subgroup) * (np.mean(subgroup) - np.mean(numerical))**2
eta_squared = ssb / (ssb + ssw)
return eta_squared
# 参考: https://www.kaggle.com/code/chrisbss1/cramer-s-v-correlation-matrix/notebook
def get_corr_ratio(dataframe):
import pandas as pd
import numpy as np
dataframe_obj = dataframe.select_dtypes(include=object)
dataframe_num = dataframe.select_dtypes(include='number')
dataframe_obj_cols = dataframe_obj.columns
dataframe_num_cols = dataframe_num.columns
rows= []
for var1 in dataframe_obj_cols:
col = []
for var2 in dataframe_num_cols:
eta_squared = correlation_ratio(dataframe[var1], dataframe[var2])
col.append(round(eta_squared,2))
rows.append(col)
results = np.array(rows)
return pd.DataFrame(results, columns = dataframe_num_cols, index = dataframe_obj_cols)
# 相関係数を出力する関数
"""
arg1: dataframe
arg2: correlation threshold to display
arg3: method (pearson,spearman,cramers,corr_ratio)
"""
def get_corr(dataframe,r=0.3,method='pearson'):
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
plt.figure(figsize=(18,14))
# 相関係数を計算
if method == 'cramers':
corr = get_corr_cramers(dataframe)
elif method == 'corr_ratio':
corr = get_corr_ratio(dataframe)
else:
corr = dataframe.corr(method=method)
# 相関係数確認 (r < 0.3は非表示)
return sns.heatmap(corr, vmax=1, vmin=-1, center=0, mask = abs(corr) < r,linecolor="black",linewidth=0.5, annot=True,annot_kws={"size":8})
"""
相関係数のリストを作成する関数。重複除く
arg1: dataframe
arg2: correlation threshold to display
arg3: method (pearson,spearman,cramers,corr_ratio)
return: list of correlation
# 参考: https://stackoverflow.com/questions/48395350/how-to-remove-duplicates-from-correlation-in-pandas
"""
def get_corr_list(dataframe,r=0.3,method='pearson'):
import pandas as pd
import numpy as np
# 相関係数を計算
if method == 'cramers':
corr = get_corr_cramers(dataframe)
elif method == 'corr_ratio':
corr = get_corr_ratio(dataframe)
else:
corr = dataframe.corr(method=method)
if method != 'corr_ratio':
# マスク対象の行列をTrueにした行列を作成
np_for_mask = np.tril(np.ones(corr.shape)).astype(np.bool_)
# 相関行列の左下半分をnullに変換
corr = corr.mask(np_for_mask)
# 行列ではなく縦持ちに変換する
corr = corr.stack().reset_index()
corr.columns = ["col1","col2","r"]
return corr[abs(corr["r"]) >= r].sort_values(by="r",key=abs,ascending=False).reset_index().drop("index",axis=1)
ピアソンの相関係数 (量的変数 x 量的変数)
# 量的変数のみの相関関係を確認 (順序尺度除く)
get_corr(df_corr_quant)
from IPython.display import HTML
HTML(get_corr_list(df_corr_quant).to_html())
col1 col2 r 0 GarageCars GarageArea 0.882475 1 GrLivArea TotalLivArea 0.880324 2 GrLivArea TotRmsAbvGrd 0.825489 3 TotalBsmtSF TotalLivArea 0.822888 4 TotalBsmtSF 1stFlrSF 0.819530 5 1stFlrSF TotalLivArea 0.797678 6 TotalLivArea SalePrice 0.778959 7 WoodDeckSF TotalWoodDeckPorch 0.743209 8 GrLivArea SalePrice 0.708624 9 2ndFlrSF GrLivArea 0.687501 10 TotRmsAbvGrd TotalLivArea 0.678802 11 BedroomAbvGr TotRmsAbvGrd 0.676620 12 BsmtFinSF1 BsmtFullBath 0.649212 13 GarageCars SalePrice 0.640409 14 GrLivArea FullBath 0.630012 15 GarageArea SalePrice 0.623431 16 FullBath TotalBathRms 0.621043 17 GrLivArea TotalBathRms 0.617494 18 2ndFlrSF TotRmsAbvGrd 0.616423 19 TotalBsmtSF SalePrice 0.613581 20 TotalBathRms SalePrice 0.613005 21 2ndFlrSF HalfBath 0.609707 22 HalfBath TotalBathRms 0.605905 23 1stFlrSF SalePrice 0.605852 24 TotalLivArea TotalBathRms 0.574806 25 FullBath TotalLivArea 0.574403 26 1stFlrSF GrLivArea 0.566024 27 FullBath SalePrice 0.560664 28 GarageArea TotalLivArea 0.558466 29 FullBath TotRmsAbvGrd 0.554784 30 TotRmsAbvGrd SalePrice 0.533723 31 GarageCars TotalLivArea 0.529608 32 BsmtFinSF1 TotalBsmtSF 0.522396 33 GrLivArea BedroomAbvGr 0.521270 34 2ndFlrSF BedroomAbvGr 0.502901 35 BsmtFinSF1 BsmtUnfSF -0.495251 36 1stFlrSF GarageArea 0.489782 37 TotalBsmtSF GarageArea 0.486665 38 2ndFlrSF TotalBathRms 0.482426 39 TotRmsAbvGrd TotalBathRms 0.482310 40 Fireplaces TotalLivArea 0.475416 41 MasVnrArea SalePrice 0.472614 42 FullBath GarageCars 0.469672 43 GrLivArea GarageArea 0.468997 44 BsmtFullBath TotalBathRms 0.468786 45 GarageCars TotalBathRms 0.468671 46 GrLivArea GarageCars 0.467247 47 Fireplaces SalePrice 0.466929 48 GrLivArea Fireplaces 0.461679 49 OpenPorchSF TotalWoodDeckPorch 0.458911 50 TotalBsmtSF GrLivArea 0.454868 51 LotFrontage TotalLivArea 0.448730 52 BsmtFinSF1 1stFlrSF 0.445863 53 MasVnrArea TotalLivArea 0.439385 54 1stFlrSF GarageCars 0.439317 55 TotalBsmtSF GarageCars 0.434585 56 LotFrontage 1stFlrSF 0.434109 57 GarageArea TotalBathRms 0.425791 58 BsmtUnfSF BsmtFullBath -0.422900 59 2ndFlrSF FullBath 0.421378 60 BsmtFinSF1 TotalBathRms 0.419852 61 GrLivArea HalfBath 0.415772 62 BsmtUnfSF TotalBsmtSF 0.415360 63 BsmtFinSF1 TotalLivArea 0.411084 64 1stFlrSF Fireplaces 0.410531 65 1stFlrSF TotRmsAbvGrd 0.409516 66 FullBath GarageArea 0.405656 67 TotalLivArea TotalWoodDeckPorch 0.397695 68 TotalWoodDeckPorch SalePrice 0.390993 69 MasVnrArea GrLivArea 0.388052 70 BsmtFinSF1 SalePrice 0.386420 71 LotFrontage GrLivArea 0.385190 72 GrLivArea TotalWoodDeckPorch 0.381182 73 LotFrontage TotalBsmtSF 0.381038 74 1stFlrSF FullBath 0.380637 75 BsmtUnfSF TotalLivArea 0.374540 76 MasVnrArea GarageArea 0.370884 77 FullBath BedroomAbvGr 0.363252 78 TotRmsAbvGrd GarageCars 0.362289 79 MasVnrArea GarageCars 0.361945 80 MasVnrArea TotalBsmtSF 0.360067 81 BedroomAbvGr TotalLivArea 0.359459 82 LotFrontage SalePrice 0.349876 83 2ndFlrSF TotalLivArea 0.345689 84 HalfBath TotRmsAbvGrd 0.343415 85 OpenPorchSF TotalLivArea 0.342402 86 Fireplaces TotalBathRms 0.341565 87 MasVnrArea 1stFlrSF 0.339850 88 TotalBsmtSF Fireplaces 0.339519 89 TotalBsmtSF TotalBathRms 0.339473 90 LotFrontage GarageArea 0.339085 91 TotRmsAbvGrd GarageArea 0.337822 92 LotFrontage LotArea 0.335957 93 LotFrontage TotRmsAbvGrd 0.332619 94 GrLivArea OpenPorchSF 0.330224 95 TotRmsAbvGrd Fireplaces 0.326114 96 MasVnrArea TotalBathRms 0.325309 97 WoodDeckSF SalePrice 0.324413 98 TotalBsmtSF FullBath 0.323722 99 2ndFlrSF SalePrice 0.319334 100 BsmtUnfSF 1stFlrSF 0.317987 101 OpenPorchSF SalePrice 0.315856 102 TotalBathRms TotalWoodDeckPorch 0.315110 103 TotalBsmtSF BsmtFullBath 0.307351 104 LotArea TotalLivArea 0.306814 105 Fireplaces GarageCars 0.300789 106 1stFlrSF TotalBathRms 0.300033
相関係数が0.7以上のどちらかの変数を除外しようと思います。
基本的にSalePriceとの相関が高い方を残そうと思います。
GarageCars GarageArea 0.882475
Garage Carsを残します。
GrLivArea TotalLivArea 0.880324
TotalLivAreaを残します。
TotalBsmtSF TotalLivArea 0.822888
TotalLivAreaを残します。
1stFlrSF TotalLivArea 0.797678
TotalLivAreaを残します。
WoodDeckSF TotalWoodDeckPorch 0.743209
TotalWoodDeckPorchを残します。
# 除外対象の変数一覧
cols_to_erase=[
'GarageArea'
,'GrLivArea'
,'TotalBsmtSF'
,'1stFlrSF'
,'WoodDeckSF'
]
# 除外して結果をもう一度確認
HTML(get_corr_list(df_corr_quant.drop(cols_to_erase,axis=1)).to_html())
col1 col2 r 0 TotalLivArea SalePrice 0.778959 1 TotRmsAbvGrd TotalLivArea 0.678802 2 BedroomAbvGr TotRmsAbvGrd 0.676620 3 BsmtFinSF1 BsmtFullBath 0.649212 4 GarageCars SalePrice 0.640409 5 FullBath TotalBathRms 0.621043 6 2ndFlrSF TotRmsAbvGrd 0.616423 7 TotalBathRms SalePrice 0.613005 8 2ndFlrSF HalfBath 0.609707 9 HalfBath TotalBathRms 0.605905 10 TotalLivArea TotalBathRms 0.574806 11 FullBath TotalLivArea 0.574403 12 FullBath SalePrice 0.560664 13 FullBath TotRmsAbvGrd 0.554784 14 TotRmsAbvGrd SalePrice 0.533723 15 GarageCars TotalLivArea 0.529608 16 2ndFlrSF BedroomAbvGr 0.502901 17 BsmtFinSF1 BsmtUnfSF -0.495251 18 2ndFlrSF TotalBathRms 0.482426 19 TotRmsAbvGrd TotalBathRms 0.482310 20 Fireplaces TotalLivArea 0.475416 21 MasVnrArea SalePrice 0.472614 22 FullBath GarageCars 0.469672 23 BsmtFullBath TotalBathRms 0.468786 24 GarageCars TotalBathRms 0.468671 25 Fireplaces SalePrice 0.466929 26 OpenPorchSF TotalWoodDeckPorch 0.458911 27 LotFrontage TotalLivArea 0.448730 28 MasVnrArea TotalLivArea 0.439385 29 BsmtUnfSF BsmtFullBath -0.422900 30 2ndFlrSF FullBath 0.421378 31 BsmtFinSF1 TotalBathRms 0.419852 32 BsmtFinSF1 TotalLivArea 0.411084 33 TotalLivArea TotalWoodDeckPorch 0.397695 34 TotalWoodDeckPorch SalePrice 0.390993 35 BsmtFinSF1 SalePrice 0.386420 36 BsmtUnfSF TotalLivArea 0.374540 37 FullBath BedroomAbvGr 0.363252 38 TotRmsAbvGrd GarageCars 0.362289 39 MasVnrArea GarageCars 0.361945 40 BedroomAbvGr TotalLivArea 0.359459 41 LotFrontage SalePrice 0.349876 42 2ndFlrSF TotalLivArea 0.345689 43 HalfBath TotRmsAbvGrd 0.343415 44 OpenPorchSF TotalLivArea 0.342402 45 Fireplaces TotalBathRms 0.341565 46 LotFrontage LotArea 0.335957 47 LotFrontage TotRmsAbvGrd 0.332619 48 TotRmsAbvGrd Fireplaces 0.326114 49 MasVnrArea TotalBathRms 0.325309 50 2ndFlrSF SalePrice 0.319334 51 OpenPorchSF SalePrice 0.315856 52 TotalBathRms TotalWoodDeckPorch 0.315110 53 LotArea TotalLivArea 0.306814 54 Fireplaces GarageCars 0.300789
スピアマンの順位相関係数 (順序尺度 x 順序尺度)
# 順序尺度と目的変数のみ
get_corr(df_corr_ordinal,method='spearman')
# リストで表示
HTML(get_corr_list(df_corr_ordinal,method='spearman').to_html())
col1 col2 r 0 GarageQual GarageCond 0.817132 1 OverallQual SalePrice 0.809829 2 ExterQual KitchenQual 0.725266 3 OverallQual ExterQual 0.715988 4 ExterQual SalePrice 0.684014 5 BsmtQual SalePrice 0.678026 6 OverallQual BsmtQual 0.673048 7 KitchenQual SalePrice 0.672849 8 OverallQual KitchenQual 0.660498 9 ExterQual BsmtQual 0.645766 10 GarageFinish SalePrice 0.633974 11 BsmtQual KitchenQual 0.575112 12 OverallQual GarageFinish 0.567090 13 BsmtQual GarageFinish 0.555535 14 ExterQual HeatingQC 0.552073 15 FireplaceQu SalePrice 0.537602 16 ExterQual GarageFinish 0.536103 17 HeatingQC KitchenQual 0.532787 18 HeatingQC SalePrice 0.491392 19 OverallQual FireplaceQu 0.481197 20 KitchenQual GarageFinish 0.480438 21 OverallQual HeatingQC 0.473591 22 BsmtQual HeatingQC 0.453746 23 GarageFinish GarageCond 0.419086 24 GarageFinish GarageQual 0.415570 25 HeatingQC GarageFinish 0.406279 26 BsmtQual BsmtExposure 0.380819 27 FireplaceQu GarageFinish 0.380119 28 BsmtQual BsmtFinType1 0.375364 29 BsmtExposure BsmtFinType1 0.374814 30 BsmtFinType1 SalePrice 0.361625 31 ExterQual FireplaceQu 0.352144 32 GarageQual SalePrice 0.351082 33 KitchenQual FireplaceQu 0.348324 34 BsmtExposure SalePrice 0.344207 35 GarageCond SalePrice 0.339015 36 OverallCond ExterCond 0.329091 37 LotShape SalePrice -0.321055 38 BsmtQual FireplaceQu 0.317001 39 BsmtQual BsmtCond 0.316947 40 BsmtCond BsmtFinType2 0.301495
こちらも相関係数が0.7以上のどちらかの変数を除外しようと思います。
基本的に目的変数(SalePrice)と相関が高い変数を残します。
GarageQual GarageCond 0.817132
GarageQualを残します。
ExterQual KitchenQual 0.725266
ExterQualを残します。
OverallQual ExterQual 0.715988
OverallQualを残します。
# 除外対象の変数一覧
cols_to_erase=[
'GarageCond'
,'KitchenQual'
,'ExterQual'
]
# 除外して結果をもう一度確認
HTML(get_corr_list(df_corr_ordinal.drop(cols_to_erase,axis=1)).to_html())
col1 col2 r 0 OverallQual SalePrice 0.790982 1 BsmtQual BsmtCond 0.633713 2 OverallQual BsmtQual 0.629379 3 BsmtQual SalePrice 0.585207 4 OverallQual GarageFinish 0.556863 5 GarageFinish SalePrice 0.549247 6 FireplaceQu SalePrice 0.520438 7 OverallQual FireplaceQu 0.490788 8 BsmtQual GarageFinish 0.485184 9 GarageFinish GarageQual 0.482399 10 OverallQual HeatingQC 0.457083 11 HeatingQC SalePrice 0.427649 12 BsmtQual BsmtExposure 0.399339 13 BsmtQual HeatingQC 0.397169 14 FireplaceQu GarageFinish 0.394891 15 HeatingQC GarageFinish 0.392244 16 OverallCond ExterCond 0.389163 17 BsmtQual BsmtFinType1 0.377398 18 BsmtExposure SalePrice 0.374696 19 BsmtExposure BsmtFinType1 0.347840 20 BsmtQual FireplaceQu 0.307337 21 BsmtFinType1 SalePrice 0.304908
クラメール連関係数 (名義尺度 X 名義尺度)
get_corr(df_corr_obj,method='cramers')
HTML(get_corr_list(df_corr_obj,method='cramers').to_html())
col1 col2 r 0 MSSubClass BldgType 0.90 1 MSSubClass HouseStyle 0.85 2 Exterior1st Exterior2nd 0.76 3 YearBuilt GarageYrBlt 0.75 4 YearBuilt YearRemodAdd 0.73 5 MSZoning Neighborhood 0.65 6 YearRemodAdd GarageYrBlt 0.64 7 YearBuilt Foundation 0.59 8 YearBuilt CentralAir 0.57 9 YearBuilt Heating 0.55 10 YearBuilt PavedDrive 0.54 11 GarageType GarageYrBlt 0.52 12 Foundation GarageYrBlt 0.51 13 CentralAir GarageYrBlt 0.50 14 HouseStyle YearBuilt 0.49 15 Alley YearBuilt 0.48 16 SaleType SaleCondition 0.48 17 Condition2 YearBuilt 0.47 18 YearBuilt MasVnrType 0.47 19 GarageYrBlt PavedDrive 0.46 20 RoofStyle RoofMatl 0.46 21 Heating CentralAir 0.46 22 MSSubClass YearBuilt 0.46 23 Neighborhood YearBuilt 0.45 24 Alley Neighborhood 0.45 25 MSZoning YearBuilt 0.45 26 MSSubClass CentralAir 0.45 27 MasVnrType GarageYrBlt 0.45 28 Neighborhood Foundation 0.44 29 YearBuilt SaleCondition 0.44 30 YearRemodAdd CentralAir 0.44 31 Neighborhood BldgType 0.44 32 GarageYrBlt SaleCondition 0.43 33 CentralAir Electrical 0.42 34 YearBuilt Exterior1st 0.41 35 YearRemodAdd MasVnrType 0.41 36 BldgType YearBuilt 0.40 37 MSZoning GarageYrBlt 0.40 38 Neighborhood CentralAir 0.40 39 Neighborhood MasVnrType 0.40 40 Neighborhood GarageYrBlt 0.39 41 MSSubClass GarageYrBlt 0.39 42 YearBuilt GarageType 0.39 43 MSSubClass Neighborhood 0.39 44 YearRemodAdd Foundation 0.39 45 MSZoning Alley 0.39 46 YearRemodAdd SaleCondition 0.39 47 YearBuilt Exterior2nd 0.39 48 LandContour YearBuilt 0.38 49 Alley GarageYrBlt 0.38 50 LandContour Neighborhood 0.38 51 Exterior2nd GarageYrBlt 0.38 52 HouseStyle GarageYrBlt 0.38 53 Electrical GarageYrBlt 0.37 54 MSSubClass Foundation 0.37 55 Exterior1st GarageYrBlt 0.37 56 Foundation CentralAir 0.37 57 CentralAir GarageType 0.37 58 YearBuilt SaleType 0.36 59 Exterior1st CentralAir 0.36 60 Exterior2nd CentralAir 0.35 61 YearBuilt Electrical 0.35 62 MSSubClass MSZoning 0.35 63 Street YearBuilt 0.35 64 Condition2 MiscFeature 0.35 65 Neighborhood YearRemodAdd 0.34 66 Neighborhood Exterior2nd 0.34 67 MSSubClass GarageType 0.34 68 CentralAir PavedDrive 0.34 69 BldgType GarageYrBlt 0.34 70 Street GarageYrBlt 0.34 71 GarageYrBlt SaleType 0.34 72 LandContour GarageYrBlt 0.33 73 Exterior1st Foundation 0.33 74 Exterior2nd Foundation 0.33 75 Neighborhood PavedDrive 0.33 76 YearRemodAdd SaleType 0.32 77 MSSubClass Alley 0.32 78 YearBuilt RoofStyle 0.32 79 MSSubClass PavedDrive 0.32 80 Condition2 RoofStyle 0.32 81 Neighborhood GarageType 0.32 82 Neighborhood HouseStyle 0.32 83 Street YearRemodAdd 0.32 84 Heating GarageYrBlt 0.32 85 YearBuilt YrSold 0.31 86 Neighborhood Exterior1st 0.31 87 BldgType YearRemodAdd 0.31 88 YearRemodAdd Exterior1st 0.31 89 MSSubClass YearRemodAdd 0.30 90 MSZoning YearRemodAdd 0.30 91 MSZoning CentralAir 0.30 92 GarageType PavedDrive 0.30 93 YearBuilt RoofMatl 0.30 94 GarageYrBlt YrSold 0.30 95 YearRemodAdd Exterior2nd 0.30 96 Condition1 YearBuilt 0.30
0 MSSubClass BldgType 0.90
1 MSSubClass HouseStyle 0.85
2 Exterior1st Exterior2nd 0.76
3 YearBuilt GarageYrBlt 0.75
4 YearBuilt YearRemodAdd 0.73
らへんが相関が高いようです。
どちらを残すかはSalePriceとの相関比が高い方を残すようにしようと思います。
相関比 (量的変数 x 名義尺度)
# クラメール連関係数で相関が高かった名義尺度の変数とSalePriceの相関比を確認
chk_cols=[
'MSSubClass'
,'BldgType'
,'HouseStyle'
,'Exterior1st'
,'Exterior2nd'
,'YearBuilt'
,'GarageYrBlt'
,'YearRemodAdd'
,'SalePrice'
]
get_corr(df_corr[chk_cols],method='corr_ratio',r=0)
HTML(get_corr_list(df_corr[chk_cols],method='corr_ratio',r=0).to_html())
col1 col2 r 0 YearBuilt SalePrice 0.44 1 GarageYrBlt SalePrice 0.39 2 YearRemodAdd SalePrice 0.31 3 MSSubClass SalePrice 0.25 4 Exterior1st SalePrice 0.15 5 Exterior2nd SalePrice 0.15 6 HouseStyle SalePrice 0.09 7 BldgType SalePrice 0.03
0 MSSubClass BldgType 0.90
MSSubClassを残します。
1 MSSubClass HouseStyle 0.85
MSSubClassを残します。
2 Exterior1st Exterior2nd 0.76
どちらを残してもよい。Exterior1stを残します。
3 YearBuilt GarageYrBlt 0.75
YearBuiltを残します。
4 YearBuilt YearRemodAdd 0.73
YearBuiltを残します。
# 除外対象の変数一覧
cols_to_erase=[
'GarageArea'
,'GrLivArea'
,'TotalBsmtSF'
,'1stFlrSF'
,'WoodDeckSF'
,'GarageCond'
,'KitchenQual'
,'ExterQual'
,'BldgType'
,'HouseStyle'
,'Exterior2nd'
,'GarageYrBlt'
,'YearRemodAdd'
]
get_corr(df_corr.drop("Id",axis=1).drop(cols_to_erase,axis=1),method='corr_ratio')
HTML(get_corr_list(df_corr.drop("Id",axis=1).drop(cols_to_erase,axis=1),method='corr_ratio').to_html())
col1 col2 r 0 GarageType GarageQual 0.89 1 MiscFeature MiscVal 0.88 2 MSSubClass 2ndFlrSF 0.81 3 MSSubClass KitchenAbvGr 0.61 4 YearBuilt OverallQual 0.56 5 Neighborhood SalePrice 0.55 6 YearBuilt BsmtQual 0.53 7 Foundation BsmtQual 0.53 8 Neighborhood OverallQual 0.52 9 Foundation BsmtCond 0.48 10 GarageType GarageFinish 0.47 11 MasVnrType MasVnrArea 0.47 12 MSSubClass HalfBath 0.47 13 YearBuilt HeatingQC 0.46 14 YearBuilt GarageFinish 0.45 15 YearBuilt SalePrice 0.44 16 YearBuilt GarageCars 0.43 17 GarageType GarageCars 0.41 18 Neighborhood BsmtQual 0.41 19 YearBuilt FullBath 0.41 20 MSSubClass TotalBathRms 0.40 21 YearBuilt TotalBathRms 0.40 22 LandContour LandSlope 0.39 23 Neighborhood GarageCars 0.38 24 Neighborhood GarageFinish 0.38 25 MSSubClass TotRmsAbvGrd 0.38 26 Neighborhood FullBath 0.36 27 MSSubClass BedroomAbvGr 0.35 28 MSSubClass GarageFinish 0.34 29 Foundation OverallQual 0.34 30 Neighborhood TotalBathRms 0.34 31 Neighborhood TotalLivArea 0.31 32 Neighborhood HeatingQC 0.30 33 MSSubClass LotFrontage 0.30
相関比に関しては各カテゴリー変数のグループごとに数値分布の差があるかどうかの指標になるのかなと思っています。
SalePriceとNeighborhood、YearBuiltがそこそこ高い値になっていますが、これはNeighborhoodもしくはYearBuiltごとにSalePriceに違いがありそうという意味になるのではないかと思っています。
どちらかを残すことにあまり意味はなさそう(むしろ残した方がいい?)なので相関比の結果では特に何もしません。
最後に全ての数値型の変数の相関係数を確認 (ピアソンの相関係数)
# 除外対象の変数一覧
cols_to_erase=[
'GarageArea'
,'GrLivArea'
,'TotalBsmtSF'
,'1stFlrSF'
,'WoodDeckSF'
,'GarageCond'
,'KitchenQual'
,'ExterQual'
#,'BldgType'
#,'HouseStyle'
#,'Exterior2nd'
#,'GarageYrBlt'
#,'YearRemodAdd'
]
# 数値変換済みの順序尺度の変数を含む全ての数値変数の相関関係を表示
get_corr(df_corr_number.drop(cols_to_erase,axis=1))
HTML(get_corr_list(df_corr_number.drop(cols_to_erase,axis=1)).to_html())
col1 col2 r 0 PoolArea PoolQC 0.899924 1 Fireplaces FireplaceQu 0.863241 2 OverallQual SalePrice 0.790982 3 BsmtFinType2 BsmtFinSF2 0.788986 4 TotalLivArea SalePrice 0.778959 5 BsmtFinType1 BsmtFinSF1 0.695751 6 TotRmsAbvGrd TotalLivArea 0.678802 7 BedroomAbvGr TotRmsAbvGrd 0.676620 8 OverallQual TotalLivArea 0.664830 9 BsmtFinSF1 BsmtFullBath 0.649212 10 GarageCars SalePrice 0.640409 11 BsmtQual BsmtCond 0.633713 12 OverallQual BsmtQual 0.629379 13 FullBath TotalBathRms 0.621043 14 2ndFlrSF TotRmsAbvGrd 0.616423 15 TotalBathRms SalePrice 0.613005 16 2ndFlrSF HalfBath 0.609707 17 HalfBath TotalBathRms 0.605905 18 OverallQual GarageCars 0.600671 19 BsmtFinType1 BsmtFullBath 0.589056 20 BsmtQual SalePrice 0.585207 21 GarageFinish GarageCars 0.579729 22 GarageCars GarageQual 0.576622 23 TotalLivArea TotalBathRms 0.574806 24 FullBath TotalLivArea 0.574403 25 FullBath SalePrice 0.560664 26 OverallQual GarageFinish 0.556863 27 FullBath TotRmsAbvGrd 0.554784 28 OverallQual FullBath 0.550600 29 GarageFinish SalePrice 0.549247 30 TotRmsAbvGrd SalePrice 0.533723 31 OverallQual TotalBathRms 0.529906 32 GarageCars TotalLivArea 0.529608 33 FireplaceQu SalePrice 0.520438 34 BsmtQual TotalLivArea 0.509830 35 2ndFlrSF BedroomAbvGr 0.502901 36 BsmtFinSF1 BsmtUnfSF -0.495251 37 OverallQual FireplaceQu 0.490788 38 BsmtQual GarageFinish 0.485184 39 FireplaceQu TotalLivArea 0.485004 40 2ndFlrSF TotalBathRms 0.482426 41 GarageFinish GarageQual 0.482399 42 TotRmsAbvGrd TotalBathRms 0.482310 43 Fireplaces TotalLivArea 0.475416 44 MasVnrArea SalePrice 0.472614 45 FullBath GarageCars 0.469672 46 BsmtQual TotalBathRms 0.469031 47 BsmtFullBath TotalBathRms 0.468786 48 GarageCars TotalBathRms 0.468671 49 Fireplaces SalePrice 0.466929 50 OpenPorchSF TotalWoodDeckPorch 0.458911 51 OverallQual HeatingQC 0.457083 52 GarageFinish TotalBathRms 0.454948 53 BsmtQual GarageCars 0.449194 54 LotFrontage TotalLivArea 0.448730 55 MasVnrArea TotalLivArea 0.439385 56 LotArea LandSlope -0.436868 57 HeatingQC SalePrice 0.427649 58 OverallQual TotRmsAbvGrd 0.427452 59 GarageFinish TotalLivArea 0.423132 60 BsmtUnfSF BsmtFullBath -0.422900 61 2ndFlrSF FullBath 0.421378 62 BsmtFinSF1 TotalBathRms 0.419852 63 BsmtFinSF1 TotalLivArea 0.411084 64 FullBath GarageFinish 0.407588 65 OverallQual MasVnrArea 0.407252 66 BsmtFinType1 TotalBathRms 0.402710 67 BsmtFinType1 BsmtUnfSF -0.400184 68 BsmtQual BsmtExposure 0.399339 69 TotalLivArea TotalWoodDeckPorch 0.397695 70 BsmtQual HeatingQC 0.397169 71 OverallQual Fireplaces 0.396765 72 FireplaceQu GarageFinish 0.394891 73 HeatingQC GarageFinish 0.392244 74 TotalWoodDeckPorch SalePrice 0.390993 75 OverallCond ExterCond 0.389163 76 BsmtFinSF1 SalePrice 0.386420 77 BsmtQual BsmtFinType1 0.377398 78 BsmtExposure SalePrice 0.374696 79 BsmtUnfSF TotalLivArea 0.374540 80 BsmtQual FullBath 0.371243 81 FireplaceQu GarageCars 0.370034 82 BsmtExposure BsmtFinSF1 0.369115 83 FullBath BedroomAbvGr 0.363252 84 TotRmsAbvGrd GarageCars 0.362289 85 MasVnrArea GarageCars 0.361945 86 BedroomAbvGr TotalLivArea 0.359459 87 TotRmsAbvGrd FireplaceQu 0.355589 88 LotFrontage SalePrice 0.349876 89 BsmtExposure BsmtFinType1 0.347840 90 2ndFlrSF TotalLivArea 0.345689 91 HalfBath TotRmsAbvGrd 0.343415 92 OpenPorchSF TotalLivArea 0.342402 93 Fireplaces TotalBathRms 0.341565 94 BsmtExposure BsmtFullBath 0.338672 95 LotFrontage LotArea 0.335957 96 FireplaceQu TotalBathRms 0.335915 97 HeatingQC FullBath 0.333499 98 LotFrontage TotRmsAbvGrd 0.332619 99 TotRmsAbvGrd Fireplaces 0.326114 100 HeatingQC GarageCars 0.325347 101 MasVnrArea TotalBathRms 0.325309 102 Fireplaces GarageFinish 0.324376 103 2ndFlrSF SalePrice 0.319334 104 OpenPorchSF SalePrice 0.315856 105 LotArea LotShape -0.315484 106 TotalBathRms TotalWoodDeckPorch 0.315110 107 OverallQual OpenPorchSF 0.308819 108 OverallQual BsmtUnfSF 0.308159 109 BsmtQual FireplaceQu 0.307337 110 LotArea TotalLivArea 0.306814 111 OverallQual TotalWoodDeckPorch 0.306097 112 BsmtFinType1 SalePrice 0.304908 113 BsmtQual BsmtFinSF1 0.304607 114 HeatingQC TotalLivArea 0.303991 115 Fireplaces GarageCars 0.300789
最後に全ての数値型の変数の相関係数を確認 (スピアマンの相関係数)
# 数値変換済みの順序尺度の変数を含む全ての数値変数の相関関係を表示
get_corr(df_corr_number.drop(cols_to_erase,axis=1),method="spearman")
HTML(get_corr_list(df_corr_number.drop(cols_to_erase,axis=1),method='spearman').to_html())
col1 col2 r 0 PoolArea PoolQC 0.999991 1 BsmtFinType2 BsmtFinSF2 0.902542 2 Fireplaces FireplaceQu 0.895131 3 TotalLivArea SalePrice 0.814984 4 OverallQual SalePrice 0.809829 5 BsmtFinType1 BsmtFinSF1 0.795755 6 TotalBathRms SalePrice 0.691160 7 GarageCars SalePrice 0.690711 8 BsmtQual SalePrice 0.678026 9 TotRmsAbvGrd TotalLivArea 0.676489 10 BsmtFinSF1 BsmtFullBath 0.674175 11 OverallQual BsmtQual 0.673048 12 BedroomAbvGr TotRmsAbvGrd 0.667822 13 OverallQual TotalLivArea 0.655684 14 FullBath SalePrice 0.635957 15 GarageFinish SalePrice 0.633974 16 FullBath TotalBathRms 0.633341 17 2ndFlrSF HalfBath 0.625272 18 FullBath TotalLivArea 0.616407 19 OverallQual GarageCars 0.608756 20 LotFrontage LotArea 0.608313 21 HalfBath TotalBathRms 0.607923 22 BsmtFinType1 BsmtFullBath 0.595755 23 2ndFlrSF TotRmsAbvGrd 0.587189 24 TotalLivArea TotalBathRms 0.586014 25 OverallQual FullBath 0.576372 26 BsmtFinSF1 BsmtUnfSF -0.573638 27 GarageCars TotalLivArea 0.573071 28 OverallQual GarageFinish 0.567090 29 FullBath TotRmsAbvGrd 0.558665 30 BsmtQual GarageFinish 0.555535 31 BsmtQual GarageCars 0.551884 32 OverallQual TotalBathRms 0.548317 33 GarageFinish GarageCars 0.548214 34 BsmtQual TotalBathRms 0.538023 35 FireplaceQu SalePrice 0.537602 36 TotRmsAbvGrd SalePrice 0.532586 37 Fireplaces SalePrice 0.519247 38 FullBath GarageCars 0.518310 39 BsmtQual FullBath 0.510767 40 2ndFlrSF BedroomAbvGr 0.510443 41 BsmtQual TotalLivArea 0.507785 42 GarageCars TotalBathRms 0.506718 43 FireplaceQu TotalLivArea 0.502548 44 Fireplaces TotalLivArea 0.493839 45 HeatingQC SalePrice 0.491392 46 LotArea TotalLivArea 0.485170 47 GarageFinish TotalBathRms 0.484792 48 OverallQual FireplaceQu 0.481197 49 TotRmsAbvGrd TotalBathRms 0.478120 50 OpenPorchSF SalePrice 0.477561 51 OverallQual HeatingQC 0.473591 52 GarageFinish TotalLivArea 0.457168 53 LotArea SalePrice 0.456461 54 BsmtQual HeatingQC 0.453746 55 2ndFlrSF TotalBathRms 0.452348 56 BsmtUnfSF BsmtFullBath -0.447472 57 BsmtFullBath TotalBathRms 0.444909 58 OpenPorchSF TotalWoodDeckPorch 0.442410 59 FullBath GarageFinish 0.435853 60 OverallQual OpenPorchSF 0.435046 61 LotFrontage TotalLivArea 0.431782 62 OverallQual TotRmsAbvGrd 0.427806 63 TotalWoodDeckPorch SalePrice 0.425249 64 GarageCars GarageQual 0.420815 65 OverallQual Fireplaces 0.420626 66 BsmtFinType1 TotalBathRms 0.418561 67 MasVnrArea SalePrice 0.415906 68 OpenPorchSF TotalLivArea 0.415838 69 GarageFinish GarageQual 0.415570 70 OverallQual MasVnrArea 0.408136 71 LotFrontage SalePrice 0.407481 72 HeatingQC GarageFinish 0.406279 73 LotArea TotRmsAbvGrd 0.405924 74 BsmtQual OpenPorchSF 0.400801 75 OpenPorchSF TotalBathRms 0.400649 76 MasVnrArea TotalLivArea 0.400072 77 MasVnrArea GarageCars 0.398213 78 BedroomAbvGr TotalLivArea 0.391959 79 TotalLivArea TotalWoodDeckPorch 0.391641 80 BsmtFinType1 BsmtUnfSF -0.390439 81 BsmtFinSF1 TotalBathRms 0.388938 82 TotRmsAbvGrd GarageCars 0.386244 83 2ndFlrSF FullBath 0.384187 84 BsmtQual BsmtExposure 0.380819 85 FireplaceQu GarageFinish 0.380119 86 BsmtQual BsmtFinType1 0.375364 87 BsmtExposure BsmtFinType1 0.374814 88 FullBath OpenPorchSF 0.370152 89 TotRmsAbvGrd FireplaceQu 0.367548 90 BsmtFinType1 SalePrice 0.361625 91 HalfBath TotRmsAbvGrd 0.359001 92 Fireplaces TotalBathRms 0.358501 93 BsmtUnfSF TotalLivArea 0.358177 94 FireplaceQu GarageCars 0.357182 95 Fireplaces GarageFinish 0.351574 96 GarageQual SalePrice 0.351082 97 LotArea Fireplaces 0.350198 98 HeatingQC FullBath 0.348801 99 LotFrontage GarageCars 0.348541 100 TotRmsAbvGrd Fireplaces 0.346829 101 BsmtExposure SalePrice 0.344207 102 HeatingQC GarageCars 0.343988 103 HalfBath SalePrice 0.343008 104 GarageCars OpenPorchSF 0.342701 105 LotFrontage TotRmsAbvGrd 0.341900 106 LotArea LotShape -0.341581 107 GarageFinish OpenPorchSF 0.340726 108 LotArea GarageCars 0.340195 109 BsmtExposure BsmtFinSF1 0.340172 110 MasVnrArea GarageFinish 0.339790 111 LotArea BedroomAbvGr 0.337788 112 FullBath BedroomAbvGr 0.336515 113 HeatingQC TotalLivArea 0.334766 114 MasVnrArea TotalBathRms 0.331010 115 OverallCond ExterCond 0.329091 116 OverallQual TotalWoodDeckPorch 0.329067 117 Fireplaces GarageCars 0.325520 118 TotalBathRms TotalWoodDeckPorch 0.324840 119 BsmtExposure BsmtFullBath 0.323130 120 LotShape SalePrice -0.321055 121 FireplaceQu TotalBathRms 0.318769 122 MasVnrArea BsmtQual 0.318218 123 LotArea FireplaceQu 0.317002 124 BsmtQual FireplaceQu 0.317001 125 BsmtQual BsmtCond 0.316947 126 HeatingQC TotalBathRms 0.312123 127 Fireplaces TotalWoodDeckPorch 0.307912 128 HeatingQC OpenPorchSF 0.303542 129 LotFrontage BedroomAbvGr 0.302878 130 BsmtFinSF1 SalePrice 0.301871 131 BsmtCond BsmtFinType2 0.301495
# 最終的にSalePriceと相関が高い方を残すため、削除候補とSalePriceの相関確認
chk_cols=[
'PoolArea'
,'PoolQC'
,'Fireplaces'
,'FireplaceQu'
,'BsmtFinType2'
,'BsmtFinSF2'
,'SalePrice'
]
HTML(get_corr_list(df_corr_number[chk_cols],r=0).to_html())
col1 col2 r 0 PoolArea PoolQC 0.899924 1 Fireplaces FireplaceQu 0.863241 2 BsmtFinType2 BsmtFinSF2 0.788986 3 FireplaceQu SalePrice 0.520438 4 Fireplaces SalePrice 0.466929 5 PoolQC SalePrice 0.115484 6 PoolQC Fireplaces 0.095621 7 PoolArea Fireplaces 0.095074 8 PoolArea SalePrice 0.092404 9 PoolQC FireplaceQu 0.054380 10 Fireplaces BsmtFinType2 0.052070 11 PoolArea FireplaceQu 0.048737 12 Fireplaces BsmtFinSF2 0.046921 13 PoolArea BsmtFinSF2 0.041709 14 PoolQC BsmtFinSF2 0.014524 15 PoolArea BsmtFinType2 0.013058 16 BsmtFinSF2 SalePrice -0.011378 17 BsmtFinType2 SalePrice -0.005323 18 PoolQC BsmtFinType2 0.004901 19 FireplaceQu BsmtFinSF2 0.001518 20 FireplaceQu BsmtFinType2 0.000022
HTML(get_corr_list(df_corr_number[chk_cols],method='spearman',r=0).to_html())
col1 col2 r 0 PoolArea PoolQC 0.999991 1 BsmtFinType2 BsmtFinSF2 0.902542 2 Fireplaces FireplaceQu 0.895131 3 FireplaceQu SalePrice 0.537602 4 Fireplaces SalePrice 0.519247 5 PoolQC Fireplaces 0.083937 6 PoolArea Fireplaces 0.083876 7 PoolArea BsmtFinSF2 0.068076 8 PoolQC BsmtFinSF2 0.068000 9 Fireplaces BsmtFinType2 0.060091 10 PoolArea BsmtFinType2 0.058715 11 PoolQC BsmtFinType2 0.058663 12 PoolQC SalePrice 0.058469 13 PoolArea SalePrice 0.058453 14 PoolQC FireplaceQu 0.052000 15 PoolArea FireplaceQu 0.051880 16 BsmtFinType2 SalePrice 0.039813 17 BsmtFinSF2 SalePrice -0.038806 18 Fireplaces BsmtFinSF2 0.029886 19 FireplaceQu BsmtFinSF2 -0.023980 20 FireplaceQu BsmtFinType2 0.006487
結果まとめ
ピアソン
0 PoolArea PoolQC 0.899924
1 Fireplaces FireplaceQu 0.863241
2 OverallQual SalePrice 0.790982
3 BsmtFinType2 BsmtFinSF2 0.788986
4 TotalLivArea SalePrice 0.778959
スピアマン
0 PoolArea PoolQC 0.999991
1 BsmtFinType2 BsmtFinSF2 0.902542
2 Fireplaces FireplaceQu 0.895131
3 TotalLivArea SalePrice 0.814984
4 OverallQual SalePrice 0.809829
5 BsmtFinType1 BsmtFinSF1 0.795755
判断結果
PoolArea PoolQC
PoolQCを残す
Fireplaces FireplaceQu
FireplaceQuを残す
BsmtFinType2 BsmtFinSF2
ピアソンとスピアマンで結果が異なり迷いますが、感覚に合うBsmtFinType2を残す。
## 相関係数の確認作業で除外対象とした変数一覧 (16変数)
cols_to_erase=[
'GarageArea'
,'GrLivArea'
,'TotalBsmtSF'
,'1stFlrSF'
,'WoodDeckSF'
,'GarageCond'
,'KitchenQual'
,'ExterQual'
,'BldgType'
,'HouseStyle'
,'Exterior2nd'
,'GarageYrBlt'
,'YearRemodAdd'
,'PoolArea'
,'Fireplaces'
,'BsmtFinSF2'
]
df_all = df_all.drop(cols_to_erase,axis=1)
特徴量エンジニアリング② (OneHotEncoderを使う)
from sklearn.preprocessing import OneHotEncoder
OneHotEnc = OneHotEncoder(categories='auto',drop='if_binary',handle_unknown='ignore') #エラーは0になるオプション
# OneHotコンバート対象の変数
OneHotCols=df_all.select_dtypes(include=object).columns.to_list()
# fit_transformして、ダミー変数の作成
get_dummies = OneHotEnc.fit_transform(df_all[OneHotCols])
# ダミー変数名取得
dummy_cols = OneHotEnc.get_feature_names_out()
# 元のデータフレームにダミー変数を追加する
df_all = df_all.join(pd.DataFrame(get_dummies.toarray(),columns=dummy_cols))
# ダミー化した変数を除外
df_all = df_all.drop(columns=OneHotCols)
df_all.shape
(2919, 334)
df_all.info()
RangeIndex: 2919 entries, 0 to 2918 Columns: 334 entries, Id to SaleCondition_Partial dtypes: float64(317), int64(17) memory usage: 7.4 MB
加工済み訓練データとテストデータのアウトプット
最後に再利用できるように訓練データとテストデータに戻してあげた後、エクスポートしておきます。
# 訓練データとテストデータの分岐点
max_train_index = train_records-1
# 加工済み訓練データ
df_train = df_all.loc[0:max_train_index,:].copy()
df_train["SalePrice"] = df["SalePrice"]
# 加工済みテストデータ
df_test = df_all.loc[(max_train_index + 1):,:].copy()
# モデリング用データのエクスポート
df_train.to_csv("ames_train.csv", index=False)
df_test.to_csv("ames_test.csv",index=False)
まとめ
やっとデータ加工パートが完了しました。
次回からお楽しみのモデリングパートに入ります。
各ライブラリのバージョン
pandas Version: 1.4.3
numpy Version: 1.22.4
scikit-learn Version: 1.1.1
seaborn Version: 0.11.2
matplotlib Version: 3.5.2