自分のキャリアをあれこれ考えながら、Pythonで様々なデータを分析していくブログです

(その3-3) エイムズの住宅価格のデータセットのデータ加工②

Data Analytics
Data Analytics

下記記事の続きになります。

(その3-2) エイムズの住宅価格のデータセットのデータ加工①
前回、(その3-1) エイムズの住宅価格のデータセットのデータ加工の計画ということでどう進めて行こうか計画を立てました。 今回は欠損値処理、外れ値処理、特徴量エンジニアリング① (追加変数作成、データ型変更)を実施しました。 変数選択や特徴...

データ加工をする場合は「(その3-2) エイムズの住宅価格のデータセットのデータ加工①」から実施してくださいね。

スポンサーリンク

変数選択

変数選択をして行きます。実業務ではたくさんの説明変数を作成することが多いですので、「次元の呪い」と呼ばれる現象に悩む時があります。

従って特定の方法で使わない変数を決めるのですが、今回もオーソドックスに相関係数を確認し相関が高い変数はどちらか一方を残すといった作業をします。

この作業は計算量を減らすだけでなく、多重共線性を解決し安定したモデルを作成することにつながります。

相関係数の種類まとめ

相関係数を求めるためによく使われるのがピアソンの相関係数だと思います。体重と身長の関係性を調べるのに使われたりするかと思います。

ただし変数の尺度の種類によって相関係数の算出方法論も使い分けがあるようです。

前述の体重と身長のように連続型の数値であればピアソンの相関係数を利用すればいいのですが、アンケートのなどの「悪い」から「良い」までを5段階で評価したような順序型の指標同士の相関係数はスピアマンの順位相関係数を使うと良いようです。

詳しくは山口東京理科大学の亀田研究室さんの記事や株式会社アイスタットさん
の記事にまとめられていました。

尺度に関してはITmediaさんの「27°C×2=54°C」が何の意味もない理由とは――「測定」と「データ」の基礎知識の記事が分かりやすかったです。

上記参考リンクの内容をまとめると下の表のようになります。

尺度

種別 尺度
質的変数 名義尺度 性別・都道府県など
質的変数 順序尺度 ランキング・アンケートの5段階評価
量的変数 間隔尺度 偏差値・摂氏(°C)など
量的変数 比例尺度 金額・絶対温度など

相関係数

変数種別 相関係数種別
量的変数 x 量的変数 ピアソンの相関係数
順序尺度 x 順序尺度 スピアマンの順位相関係数
名義尺度 x 名義尺度 クラメール連関係数
量的変数 x 名義尺度 相関比

エイムズのデータセットでは上記全ての相関関係を算出できそうですので試してみようと思います。

相関係数算出の準備

# 訓練データとテストデータの分岐点
max_train_index = train_records-1

# 相関係数確認用データフレーム
df_corr = df_all.loc[0:max_train_index,:].copy()
df_corr["SalePrice"] = df["SalePrice"]
# 連続値の変数に限定
quant_cols=[
 'LotFrontage'
, 'LotArea'
, 'MasVnrArea'
, 'BsmtFinSF1'
, 'BsmtFinSF2'
, 'BsmtUnfSF'
, 'TotalBsmtSF'
, '1stFlrSF'
, '2ndFlrSF'
, 'LowQualFinSF'
, 'GrLivArea'
, 'BsmtFullBath'
, 'BsmtHalfBath'
, 'FullBath'
, 'HalfBath'
, 'BedroomAbvGr'
, 'KitchenAbvGr'
, 'TotRmsAbvGrd'
, 'Fireplaces'
, 'GarageCars'
, 'GarageArea'
, 'WoodDeckSF'
, 'OpenPorchSF'
, 'EnclosedPorch'
, '3SsnPorch'
, 'ScreenPorch'
, 'PoolArea'
, 'MiscVal'
, 'TotalLivArea'
, 'TotalBathRms'
, 'TotalWoodDeckPorch'
, 'SalePrice' # 目的変数
]
# 順序型の変数に限定
ordinal_cols=[
'LotShape'
,'Utilities'
,'LandSlope'
,'OverallQual'
,'OverallCond'
,'ExterQual'
,'ExterCond'
,'BsmtQual'
,'BsmtCond'
,'BsmtExposure'
,'BsmtFinType1'
,'BsmtFinType2'
,'HeatingQC'
,'KitchenQual'
,'Functional'
,'FireplaceQu'
,'GarageFinish'
,'GarageQual'
,'GarageCond'
,'PoolQC'
,'Fence'
,'SalePrice' # 目的変数
]
df_corr_obj = df_corr.select_dtypes(include=object)
df_corr_number = df_corr.select_dtypes(include='number').drop("Id",axis=1)
df_corr_ordinal = df_corr_number[ordinal_cols]
df_corr_quant = df_corr_number[quant_cols]

相関係数算出の関数

# 参考: https://www.statology.org/cramers-v-in-python/
# クラメール連関係数算出
def cramersV(col1,col2):
    import pandas as pd
    import numpy as np
    from scipy.stats import chi2_contingency

    # 二変数のクロス集計
    crosstab =np.array(pd.crosstab(col1,col2,rownames=None,colnames=None))
    # カイ二乗値
    X2 = chi2_contingency(crosstab, correction=False)[0]
    # サンプル数
    n = np.sum(crosstab)
    # min(カテゴリー数)
    k = min(crosstab.shape) - 1
    #クラメール連関係数 (Cramer's V)
    V = np.sqrt((X2/n) / k)

    return V

# 参考: https://www.kaggle.com/code/chrisbss1/cramer-s-v-correlation-matrix/notebook
def get_corr_cramers(dataframe):
    import pandas as pd
    import numpy as np

    dataframe_obj = dataframe.select_dtypes(include=object)
    dataframe_cols = dataframe_obj.columns

    rows= []

    for var1 in dataframe_cols:
        col = []
        for var2 in dataframe_cols:
            cramers = cramersV(dataframe_obj[var1], dataframe_obj[var2])
            col.append(round(cramers,2))
        rows.append(col)

    results = np.array(rows)

    return pd.DataFrame(results, columns = dataframe_cols, index = dataframe_cols)

# 参考: https://stackoverflow.com/questions/52083501/how-to-compute-correlation-ratio-or-eta-in-python
# return correlation_ratio (eta_squared) of category_var x numerical_var
def correlation_ratio(category_var, numerical_var):

    import pandas as pd
    import numpy as np

    category = np.array(category_var)
    numerical = np.array(numerical_var)

    ssw = 0 # Sum of Squares Within 
    ssb = 0 # Sum of Squares Between

    for acategory in set(category):
        subgroup = numerical[np.where(category == acategory)[0]]
        # 級内変動
        ssw += sum((subgroup - np.mean(subgroup))**2)
        # 級間変動 SUM(各グループn * (グループ平均 - 全体平均)^2)
        ssb += len(subgroup) * (np.mean(subgroup) - np.mean(numerical))**2

    eta_squared = ssb / (ssb + ssw)

    return eta_squared

# 参考: https://www.kaggle.com/code/chrisbss1/cramer-s-v-correlation-matrix/notebook
def get_corr_ratio(dataframe):

    import pandas as pd
    import numpy as np

    dataframe_obj = dataframe.select_dtypes(include=object)
    dataframe_num = dataframe.select_dtypes(include='number')
    dataframe_obj_cols = dataframe_obj.columns
    dataframe_num_cols = dataframe_num.columns
    rows= []

    for var1 in dataframe_obj_cols:
        col = []
        for var2 in dataframe_num_cols:
            eta_squared = correlation_ratio(dataframe[var1], dataframe[var2])
            col.append(round(eta_squared,2))
        rows.append(col)

    results = np.array(rows)

    return pd.DataFrame(results, columns = dataframe_num_cols, index = dataframe_obj_cols)

# 相関係数を出力する関数
"""
arg1: dataframe
arg2: correlation threshold to display
arg3: method (pearson,spearman,cramers,corr_ratio)
"""
def get_corr(dataframe,r=0.3,method='pearson'):
    import matplotlib.pyplot as plt
    import seaborn as sns
    import pandas as pd
    plt.figure(figsize=(18,14))

    # 相関係数を計算
    if method == 'cramers':
        corr = get_corr_cramers(dataframe)
    elif method == 'corr_ratio':
        corr = get_corr_ratio(dataframe)
    else:
        corr = dataframe.corr(method=method)

    # 相関係数確認 (r < 0.3は非表示)
    return sns.heatmap(corr, vmax=1, vmin=-1, center=0, mask = abs(corr) < r,linecolor="black",linewidth=0.5, annot=True,annot_kws={"size":8})

"""
相関係数のリストを作成する関数。重複除く
arg1: dataframe
arg2: correlation threshold to display
arg3: method (pearson,spearman,cramers,corr_ratio)
return: list of correlation
# 参考: https://stackoverflow.com/questions/48395350/how-to-remove-duplicates-from-correlation-in-pandas
"""
def get_corr_list(dataframe,r=0.3,method='pearson'):
    import pandas as pd
    import numpy as np

    # 相関係数を計算
    if method == 'cramers':
        corr = get_corr_cramers(dataframe)
    elif method == 'corr_ratio':
        corr = get_corr_ratio(dataframe)
    else:
        corr = dataframe.corr(method=method)

    if method != 'corr_ratio':
        # マスク対象の行列をTrueにした行列を作成
        np_for_mask = np.tril(np.ones(corr.shape)).astype(np.bool_)

        # 相関行列の左下半分をnullに変換
        corr = corr.mask(np_for_mask)

    # 行列ではなく縦持ちに変換する
    corr = corr.stack().reset_index()

    corr.columns = ["col1","col2","r"]

    return corr[abs(corr["r"]) >= r].sort_values(by="r",key=abs,ascending=False).reset_index().drop("index",axis=1)

ピアソンの相関係数 (量的変数 x 量的変数)

# 量的変数のみの相関関係を確認 (順序尺度除く)
get_corr(df_corr_quant)

png

from IPython.display import HTML
HTML(get_corr_list(df_corr_quant).to_html())
Out[0]

col1 col2 r
0 GarageCars GarageArea 0.882475
1 GrLivArea TotalLivArea 0.880324
2 GrLivArea TotRmsAbvGrd 0.825489
3 TotalBsmtSF TotalLivArea 0.822888
4 TotalBsmtSF 1stFlrSF 0.819530
5 1stFlrSF TotalLivArea 0.797678
6 TotalLivArea SalePrice 0.778959
7 WoodDeckSF TotalWoodDeckPorch 0.743209
8 GrLivArea SalePrice 0.708624
9 2ndFlrSF GrLivArea 0.687501
10 TotRmsAbvGrd TotalLivArea 0.678802
11 BedroomAbvGr TotRmsAbvGrd 0.676620
12 BsmtFinSF1 BsmtFullBath 0.649212
13 GarageCars SalePrice 0.640409
14 GrLivArea FullBath 0.630012
15 GarageArea SalePrice 0.623431
16 FullBath TotalBathRms 0.621043
17 GrLivArea TotalBathRms 0.617494
18 2ndFlrSF TotRmsAbvGrd 0.616423
19 TotalBsmtSF SalePrice 0.613581
20 TotalBathRms SalePrice 0.613005
21 2ndFlrSF HalfBath 0.609707
22 HalfBath TotalBathRms 0.605905
23 1stFlrSF SalePrice 0.605852
24 TotalLivArea TotalBathRms 0.574806
25 FullBath TotalLivArea 0.574403
26 1stFlrSF GrLivArea 0.566024
27 FullBath SalePrice 0.560664
28 GarageArea TotalLivArea 0.558466
29 FullBath TotRmsAbvGrd 0.554784
30 TotRmsAbvGrd SalePrice 0.533723
31 GarageCars TotalLivArea 0.529608
32 BsmtFinSF1 TotalBsmtSF 0.522396
33 GrLivArea BedroomAbvGr 0.521270
34 2ndFlrSF BedroomAbvGr 0.502901
35 BsmtFinSF1 BsmtUnfSF -0.495251
36 1stFlrSF GarageArea 0.489782
37 TotalBsmtSF GarageArea 0.486665
38 2ndFlrSF TotalBathRms 0.482426
39 TotRmsAbvGrd TotalBathRms 0.482310
40 Fireplaces TotalLivArea 0.475416
41 MasVnrArea SalePrice 0.472614
42 FullBath GarageCars 0.469672
43 GrLivArea GarageArea 0.468997
44 BsmtFullBath TotalBathRms 0.468786
45 GarageCars TotalBathRms 0.468671
46 GrLivArea GarageCars 0.467247
47 Fireplaces SalePrice 0.466929
48 GrLivArea Fireplaces 0.461679
49 OpenPorchSF TotalWoodDeckPorch 0.458911
50 TotalBsmtSF GrLivArea 0.454868
51 LotFrontage TotalLivArea 0.448730
52 BsmtFinSF1 1stFlrSF 0.445863
53 MasVnrArea TotalLivArea 0.439385
54 1stFlrSF GarageCars 0.439317
55 TotalBsmtSF GarageCars 0.434585
56 LotFrontage 1stFlrSF 0.434109
57 GarageArea TotalBathRms 0.425791
58 BsmtUnfSF BsmtFullBath -0.422900
59 2ndFlrSF FullBath 0.421378
60 BsmtFinSF1 TotalBathRms 0.419852
61 GrLivArea HalfBath 0.415772
62 BsmtUnfSF TotalBsmtSF 0.415360
63 BsmtFinSF1 TotalLivArea 0.411084
64 1stFlrSF Fireplaces 0.410531
65 1stFlrSF TotRmsAbvGrd 0.409516
66 FullBath GarageArea 0.405656
67 TotalLivArea TotalWoodDeckPorch 0.397695
68 TotalWoodDeckPorch SalePrice 0.390993
69 MasVnrArea GrLivArea 0.388052
70 BsmtFinSF1 SalePrice 0.386420
71 LotFrontage GrLivArea 0.385190
72 GrLivArea TotalWoodDeckPorch 0.381182
73 LotFrontage TotalBsmtSF 0.381038
74 1stFlrSF FullBath 0.380637
75 BsmtUnfSF TotalLivArea 0.374540
76 MasVnrArea GarageArea 0.370884
77 FullBath BedroomAbvGr 0.363252
78 TotRmsAbvGrd GarageCars 0.362289
79 MasVnrArea GarageCars 0.361945
80 MasVnrArea TotalBsmtSF 0.360067
81 BedroomAbvGr TotalLivArea 0.359459
82 LotFrontage SalePrice 0.349876
83 2ndFlrSF TotalLivArea 0.345689
84 HalfBath TotRmsAbvGrd 0.343415
85 OpenPorchSF TotalLivArea 0.342402
86 Fireplaces TotalBathRms 0.341565
87 MasVnrArea 1stFlrSF 0.339850
88 TotalBsmtSF Fireplaces 0.339519
89 TotalBsmtSF TotalBathRms 0.339473
90 LotFrontage GarageArea 0.339085
91 TotRmsAbvGrd GarageArea 0.337822
92 LotFrontage LotArea 0.335957
93 LotFrontage TotRmsAbvGrd 0.332619
94 GrLivArea OpenPorchSF 0.330224
95 TotRmsAbvGrd Fireplaces 0.326114
96 MasVnrArea TotalBathRms 0.325309
97 WoodDeckSF SalePrice 0.324413
98 TotalBsmtSF FullBath 0.323722
99 2ndFlrSF SalePrice 0.319334
100 BsmtUnfSF 1stFlrSF 0.317987
101 OpenPorchSF SalePrice 0.315856
102 TotalBathRms TotalWoodDeckPorch 0.315110
103 TotalBsmtSF BsmtFullBath 0.307351
104 LotArea TotalLivArea 0.306814
105 Fireplaces GarageCars 0.300789
106 1stFlrSF TotalBathRms 0.300033

相関係数が0.7以上のどちらかの変数を除外しようと思います。

基本的にSalePriceとの相関が高い方を残そうと思います。

GarageCars GarageArea 0.882475

Garage Carsを残します。

GrLivArea TotalLivArea 0.880324

TotalLivAreaを残します。

TotalBsmtSF TotalLivArea 0.822888

TotalLivAreaを残します。

1stFlrSF TotalLivArea 0.797678

TotalLivAreaを残します。

WoodDeckSF TotalWoodDeckPorch 0.743209

TotalWoodDeckPorchを残します。

# 除外対象の変数一覧
cols_to_erase=[
'GarageArea'
,'GrLivArea'
,'TotalBsmtSF'
,'1stFlrSF'
,'WoodDeckSF'
]
# 除外して結果をもう一度確認
HTML(get_corr_list(df_corr_quant.drop(cols_to_erase,axis=1)).to_html())
Out[0]

col1 col2 r
0 TotalLivArea SalePrice 0.778959
1 TotRmsAbvGrd TotalLivArea 0.678802
2 BedroomAbvGr TotRmsAbvGrd 0.676620
3 BsmtFinSF1 BsmtFullBath 0.649212
4 GarageCars SalePrice 0.640409
5 FullBath TotalBathRms 0.621043
6 2ndFlrSF TotRmsAbvGrd 0.616423
7 TotalBathRms SalePrice 0.613005
8 2ndFlrSF HalfBath 0.609707
9 HalfBath TotalBathRms 0.605905
10 TotalLivArea TotalBathRms 0.574806
11 FullBath TotalLivArea 0.574403
12 FullBath SalePrice 0.560664
13 FullBath TotRmsAbvGrd 0.554784
14 TotRmsAbvGrd SalePrice 0.533723
15 GarageCars TotalLivArea 0.529608
16 2ndFlrSF BedroomAbvGr 0.502901
17 BsmtFinSF1 BsmtUnfSF -0.495251
18 2ndFlrSF TotalBathRms 0.482426
19 TotRmsAbvGrd TotalBathRms 0.482310
20 Fireplaces TotalLivArea 0.475416
21 MasVnrArea SalePrice 0.472614
22 FullBath GarageCars 0.469672
23 BsmtFullBath TotalBathRms 0.468786
24 GarageCars TotalBathRms 0.468671
25 Fireplaces SalePrice 0.466929
26 OpenPorchSF TotalWoodDeckPorch 0.458911
27 LotFrontage TotalLivArea 0.448730
28 MasVnrArea TotalLivArea 0.439385
29 BsmtUnfSF BsmtFullBath -0.422900
30 2ndFlrSF FullBath 0.421378
31 BsmtFinSF1 TotalBathRms 0.419852
32 BsmtFinSF1 TotalLivArea 0.411084
33 TotalLivArea TotalWoodDeckPorch 0.397695
34 TotalWoodDeckPorch SalePrice 0.390993
35 BsmtFinSF1 SalePrice 0.386420
36 BsmtUnfSF TotalLivArea 0.374540
37 FullBath BedroomAbvGr 0.363252
38 TotRmsAbvGrd GarageCars 0.362289
39 MasVnrArea GarageCars 0.361945
40 BedroomAbvGr TotalLivArea 0.359459
41 LotFrontage SalePrice 0.349876
42 2ndFlrSF TotalLivArea 0.345689
43 HalfBath TotRmsAbvGrd 0.343415
44 OpenPorchSF TotalLivArea 0.342402
45 Fireplaces TotalBathRms 0.341565
46 LotFrontage LotArea 0.335957
47 LotFrontage TotRmsAbvGrd 0.332619
48 TotRmsAbvGrd Fireplaces 0.326114
49 MasVnrArea TotalBathRms 0.325309
50 2ndFlrSF SalePrice 0.319334
51 OpenPorchSF SalePrice 0.315856
52 TotalBathRms TotalWoodDeckPorch 0.315110
53 LotArea TotalLivArea 0.306814
54 Fireplaces GarageCars 0.300789

スピアマンの順位相関係数 (順序尺度 x 順序尺度)

# 順序尺度と目的変数のみ
get_corr(df_corr_ordinal,method='spearman')

png

# リストで表示
HTML(get_corr_list(df_corr_ordinal,method='spearman').to_html())
Out[0]

col1 col2 r
0 GarageQual GarageCond 0.817132
1 OverallQual SalePrice 0.809829
2 ExterQual KitchenQual 0.725266
3 OverallQual ExterQual 0.715988
4 ExterQual SalePrice 0.684014
5 BsmtQual SalePrice 0.678026
6 OverallQual BsmtQual 0.673048
7 KitchenQual SalePrice 0.672849
8 OverallQual KitchenQual 0.660498
9 ExterQual BsmtQual 0.645766
10 GarageFinish SalePrice 0.633974
11 BsmtQual KitchenQual 0.575112
12 OverallQual GarageFinish 0.567090
13 BsmtQual GarageFinish 0.555535
14 ExterQual HeatingQC 0.552073
15 FireplaceQu SalePrice 0.537602
16 ExterQual GarageFinish 0.536103
17 HeatingQC KitchenQual 0.532787
18 HeatingQC SalePrice 0.491392
19 OverallQual FireplaceQu 0.481197
20 KitchenQual GarageFinish 0.480438
21 OverallQual HeatingQC 0.473591
22 BsmtQual HeatingQC 0.453746
23 GarageFinish GarageCond 0.419086
24 GarageFinish GarageQual 0.415570
25 HeatingQC GarageFinish 0.406279
26 BsmtQual BsmtExposure 0.380819
27 FireplaceQu GarageFinish 0.380119
28 BsmtQual BsmtFinType1 0.375364
29 BsmtExposure BsmtFinType1 0.374814
30 BsmtFinType1 SalePrice 0.361625
31 ExterQual FireplaceQu 0.352144
32 GarageQual SalePrice 0.351082
33 KitchenQual FireplaceQu 0.348324
34 BsmtExposure SalePrice 0.344207
35 GarageCond SalePrice 0.339015
36 OverallCond ExterCond 0.329091
37 LotShape SalePrice -0.321055
38 BsmtQual FireplaceQu 0.317001
39 BsmtQual BsmtCond 0.316947
40 BsmtCond BsmtFinType2 0.301495

こちらも相関係数が0.7以上のどちらかの変数を除外しようと思います。

基本的に目的変数(SalePrice)と相関が高い変数を残します。

GarageQual GarageCond 0.817132

GarageQualを残します。

ExterQual KitchenQual 0.725266

ExterQualを残します。

OverallQual ExterQual 0.715988

OverallQualを残します。

# 除外対象の変数一覧
cols_to_erase=[
'GarageCond'
,'KitchenQual'
,'ExterQual'
]
# 除外して結果をもう一度確認
HTML(get_corr_list(df_corr_ordinal.drop(cols_to_erase,axis=1)).to_html())
Out[0]

col1 col2 r
0 OverallQual SalePrice 0.790982
1 BsmtQual BsmtCond 0.633713
2 OverallQual BsmtQual 0.629379
3 BsmtQual SalePrice 0.585207
4 OverallQual GarageFinish 0.556863
5 GarageFinish SalePrice 0.549247
6 FireplaceQu SalePrice 0.520438
7 OverallQual FireplaceQu 0.490788
8 BsmtQual GarageFinish 0.485184
9 GarageFinish GarageQual 0.482399
10 OverallQual HeatingQC 0.457083
11 HeatingQC SalePrice 0.427649
12 BsmtQual BsmtExposure 0.399339
13 BsmtQual HeatingQC 0.397169
14 FireplaceQu GarageFinish 0.394891
15 HeatingQC GarageFinish 0.392244
16 OverallCond ExterCond 0.389163
17 BsmtQual BsmtFinType1 0.377398
18 BsmtExposure SalePrice 0.374696
19 BsmtExposure BsmtFinType1 0.347840
20 BsmtQual FireplaceQu 0.307337
21 BsmtFinType1 SalePrice 0.304908

クラメール連関係数 (名義尺度 X 名義尺度)

get_corr(df_corr_obj,method='cramers')

png

HTML(get_corr_list(df_corr_obj,method='cramers').to_html())
Out[0]

col1 col2 r
0 MSSubClass BldgType 0.90
1 MSSubClass HouseStyle 0.85
2 Exterior1st Exterior2nd 0.76
3 YearBuilt GarageYrBlt 0.75
4 YearBuilt YearRemodAdd 0.73
5 MSZoning Neighborhood 0.65
6 YearRemodAdd GarageYrBlt 0.64
7 YearBuilt Foundation 0.59
8 YearBuilt CentralAir 0.57
9 YearBuilt Heating 0.55
10 YearBuilt PavedDrive 0.54
11 GarageType GarageYrBlt 0.52
12 Foundation GarageYrBlt 0.51
13 CentralAir GarageYrBlt 0.50
14 HouseStyle YearBuilt 0.49
15 Alley YearBuilt 0.48
16 SaleType SaleCondition 0.48
17 Condition2 YearBuilt 0.47
18 YearBuilt MasVnrType 0.47
19 GarageYrBlt PavedDrive 0.46
20 RoofStyle RoofMatl 0.46
21 Heating CentralAir 0.46
22 MSSubClass YearBuilt 0.46
23 Neighborhood YearBuilt 0.45
24 Alley Neighborhood 0.45
25 MSZoning YearBuilt 0.45
26 MSSubClass CentralAir 0.45
27 MasVnrType GarageYrBlt 0.45
28 Neighborhood Foundation 0.44
29 YearBuilt SaleCondition 0.44
30 YearRemodAdd CentralAir 0.44
31 Neighborhood BldgType 0.44
32 GarageYrBlt SaleCondition 0.43
33 CentralAir Electrical 0.42
34 YearBuilt Exterior1st 0.41
35 YearRemodAdd MasVnrType 0.41
36 BldgType YearBuilt 0.40
37 MSZoning GarageYrBlt 0.40
38 Neighborhood CentralAir 0.40
39 Neighborhood MasVnrType 0.40
40 Neighborhood GarageYrBlt 0.39
41 MSSubClass GarageYrBlt 0.39
42 YearBuilt GarageType 0.39
43 MSSubClass Neighborhood 0.39
44 YearRemodAdd Foundation 0.39
45 MSZoning Alley 0.39
46 YearRemodAdd SaleCondition 0.39
47 YearBuilt Exterior2nd 0.39
48 LandContour YearBuilt 0.38
49 Alley GarageYrBlt 0.38
50 LandContour Neighborhood 0.38
51 Exterior2nd GarageYrBlt 0.38
52 HouseStyle GarageYrBlt 0.38
53 Electrical GarageYrBlt 0.37
54 MSSubClass Foundation 0.37
55 Exterior1st GarageYrBlt 0.37
56 Foundation CentralAir 0.37
57 CentralAir GarageType 0.37
58 YearBuilt SaleType 0.36
59 Exterior1st CentralAir 0.36
60 Exterior2nd CentralAir 0.35
61 YearBuilt Electrical 0.35
62 MSSubClass MSZoning 0.35
63 Street YearBuilt 0.35
64 Condition2 MiscFeature 0.35
65 Neighborhood YearRemodAdd 0.34
66 Neighborhood Exterior2nd 0.34
67 MSSubClass GarageType 0.34
68 CentralAir PavedDrive 0.34
69 BldgType GarageYrBlt 0.34
70 Street GarageYrBlt 0.34
71 GarageYrBlt SaleType 0.34
72 LandContour GarageYrBlt 0.33
73 Exterior1st Foundation 0.33
74 Exterior2nd Foundation 0.33
75 Neighborhood PavedDrive 0.33
76 YearRemodAdd SaleType 0.32
77 MSSubClass Alley 0.32
78 YearBuilt RoofStyle 0.32
79 MSSubClass PavedDrive 0.32
80 Condition2 RoofStyle 0.32
81 Neighborhood GarageType 0.32
82 Neighborhood HouseStyle 0.32
83 Street YearRemodAdd 0.32
84 Heating GarageYrBlt 0.32
85 YearBuilt YrSold 0.31
86 Neighborhood Exterior1st 0.31
87 BldgType YearRemodAdd 0.31
88 YearRemodAdd Exterior1st 0.31
89 MSSubClass YearRemodAdd 0.30
90 MSZoning YearRemodAdd 0.30
91 MSZoning CentralAir 0.30
92 GarageType PavedDrive 0.30
93 YearBuilt RoofMatl 0.30
94 GarageYrBlt YrSold 0.30
95 YearRemodAdd Exterior2nd 0.30
96 Condition1 YearBuilt 0.30

0 MSSubClass BldgType 0.90
1 MSSubClass HouseStyle 0.85
2 Exterior1st Exterior2nd 0.76
3 YearBuilt GarageYrBlt 0.75
4 YearBuilt YearRemodAdd 0.73

らへんが相関が高いようです。

どちらを残すかはSalePriceとの相関比が高い方を残すようにしようと思います。

相関比 (量的変数 x 名義尺度)

# クラメール連関係数で相関が高かった名義尺度の変数とSalePriceの相関比を確認
chk_cols=[
'MSSubClass'
,'BldgType'
,'HouseStyle'
,'Exterior1st'
,'Exterior2nd'
,'YearBuilt'
,'GarageYrBlt'
,'YearRemodAdd'
,'SalePrice'
]
get_corr(df_corr[chk_cols],method='corr_ratio',r=0)

png

HTML(get_corr_list(df_corr[chk_cols],method='corr_ratio',r=0).to_html())
Out[0]

col1 col2 r
0 YearBuilt SalePrice 0.44
1 GarageYrBlt SalePrice 0.39
2 YearRemodAdd SalePrice 0.31
3 MSSubClass SalePrice 0.25
4 Exterior1st SalePrice 0.15
5 Exterior2nd SalePrice 0.15
6 HouseStyle SalePrice 0.09
7 BldgType SalePrice 0.03

0 MSSubClass BldgType 0.90

MSSubClassを残します。

1 MSSubClass HouseStyle 0.85

MSSubClassを残します。

2 Exterior1st Exterior2nd 0.76

どちらを残してもよい。Exterior1stを残します。

3 YearBuilt GarageYrBlt 0.75

YearBuiltを残します。

4 YearBuilt YearRemodAdd 0.73

YearBuiltを残します。

# 除外対象の変数一覧
cols_to_erase=[
'GarageArea'
,'GrLivArea'
,'TotalBsmtSF'
,'1stFlrSF'
,'WoodDeckSF'
,'GarageCond'
,'KitchenQual'
,'ExterQual'
,'BldgType'
,'HouseStyle'
,'Exterior2nd'
,'GarageYrBlt'
,'YearRemodAdd'
]
get_corr(df_corr.drop("Id",axis=1).drop(cols_to_erase,axis=1),method='corr_ratio')

png

HTML(get_corr_list(df_corr.drop("Id",axis=1).drop(cols_to_erase,axis=1),method='corr_ratio').to_html())
Out[0]

col1 col2 r
0 GarageType GarageQual 0.89
1 MiscFeature MiscVal 0.88
2 MSSubClass 2ndFlrSF 0.81
3 MSSubClass KitchenAbvGr 0.61
4 YearBuilt OverallQual 0.56
5 Neighborhood SalePrice 0.55
6 YearBuilt BsmtQual 0.53
7 Foundation BsmtQual 0.53
8 Neighborhood OverallQual 0.52
9 Foundation BsmtCond 0.48
10 GarageType GarageFinish 0.47
11 MasVnrType MasVnrArea 0.47
12 MSSubClass HalfBath 0.47
13 YearBuilt HeatingQC 0.46
14 YearBuilt GarageFinish 0.45
15 YearBuilt SalePrice 0.44
16 YearBuilt GarageCars 0.43
17 GarageType GarageCars 0.41
18 Neighborhood BsmtQual 0.41
19 YearBuilt FullBath 0.41
20 MSSubClass TotalBathRms 0.40
21 YearBuilt TotalBathRms 0.40
22 LandContour LandSlope 0.39
23 Neighborhood GarageCars 0.38
24 Neighborhood GarageFinish 0.38
25 MSSubClass TotRmsAbvGrd 0.38
26 Neighborhood FullBath 0.36
27 MSSubClass BedroomAbvGr 0.35
28 MSSubClass GarageFinish 0.34
29 Foundation OverallQual 0.34
30 Neighborhood TotalBathRms 0.34
31 Neighborhood TotalLivArea 0.31
32 Neighborhood HeatingQC 0.30
33 MSSubClass LotFrontage 0.30

相関比に関しては各カテゴリー変数のグループごとに数値分布の差があるかどうかの指標になるのかなと思っています。

SalePriceとNeighborhood、YearBuiltがそこそこ高い値になっていますが、これはNeighborhoodもしくはYearBuiltごとにSalePriceに違いがありそうという意味になるのではないかと思っています。

どちらかを残すことにあまり意味はなさそう(むしろ残した方がいい?)なので相関比の結果では特に何もしません。

最後に全ての数値型の変数の相関係数を確認 (ピアソンの相関係数)

# 除外対象の変数一覧
cols_to_erase=[
'GarageArea'
,'GrLivArea'
,'TotalBsmtSF'
,'1stFlrSF'
,'WoodDeckSF'
,'GarageCond'
,'KitchenQual'
,'ExterQual'
#,'BldgType'
#,'HouseStyle'
#,'Exterior2nd'
#,'GarageYrBlt'
#,'YearRemodAdd'
]
# 数値変換済みの順序尺度の変数を含む全ての数値変数の相関関係を表示
get_corr(df_corr_number.drop(cols_to_erase,axis=1))

png

HTML(get_corr_list(df_corr_number.drop(cols_to_erase,axis=1)).to_html())
Out[0]

col1 col2 r
0 PoolArea PoolQC 0.899924
1 Fireplaces FireplaceQu 0.863241
2 OverallQual SalePrice 0.790982
3 BsmtFinType2 BsmtFinSF2 0.788986
4 TotalLivArea SalePrice 0.778959
5 BsmtFinType1 BsmtFinSF1 0.695751
6 TotRmsAbvGrd TotalLivArea 0.678802
7 BedroomAbvGr TotRmsAbvGrd 0.676620
8 OverallQual TotalLivArea 0.664830
9 BsmtFinSF1 BsmtFullBath 0.649212
10 GarageCars SalePrice 0.640409
11 BsmtQual BsmtCond 0.633713
12 OverallQual BsmtQual 0.629379
13 FullBath TotalBathRms 0.621043
14 2ndFlrSF TotRmsAbvGrd 0.616423
15 TotalBathRms SalePrice 0.613005
16 2ndFlrSF HalfBath 0.609707
17 HalfBath TotalBathRms 0.605905
18 OverallQual GarageCars 0.600671
19 BsmtFinType1 BsmtFullBath 0.589056
20 BsmtQual SalePrice 0.585207
21 GarageFinish GarageCars 0.579729
22 GarageCars GarageQual 0.576622
23 TotalLivArea TotalBathRms 0.574806
24 FullBath TotalLivArea 0.574403
25 FullBath SalePrice 0.560664
26 OverallQual GarageFinish 0.556863
27 FullBath TotRmsAbvGrd 0.554784
28 OverallQual FullBath 0.550600
29 GarageFinish SalePrice 0.549247
30 TotRmsAbvGrd SalePrice 0.533723
31 OverallQual TotalBathRms 0.529906
32 GarageCars TotalLivArea 0.529608
33 FireplaceQu SalePrice 0.520438
34 BsmtQual TotalLivArea 0.509830
35 2ndFlrSF BedroomAbvGr 0.502901
36 BsmtFinSF1 BsmtUnfSF -0.495251
37 OverallQual FireplaceQu 0.490788
38 BsmtQual GarageFinish 0.485184
39 FireplaceQu TotalLivArea 0.485004
40 2ndFlrSF TotalBathRms 0.482426
41 GarageFinish GarageQual 0.482399
42 TotRmsAbvGrd TotalBathRms 0.482310
43 Fireplaces TotalLivArea 0.475416
44 MasVnrArea SalePrice 0.472614
45 FullBath GarageCars 0.469672
46 BsmtQual TotalBathRms 0.469031
47 BsmtFullBath TotalBathRms 0.468786
48 GarageCars TotalBathRms 0.468671
49 Fireplaces SalePrice 0.466929
50 OpenPorchSF TotalWoodDeckPorch 0.458911
51 OverallQual HeatingQC 0.457083
52 GarageFinish TotalBathRms 0.454948
53 BsmtQual GarageCars 0.449194
54 LotFrontage TotalLivArea 0.448730
55 MasVnrArea TotalLivArea 0.439385
56 LotArea LandSlope -0.436868
57 HeatingQC SalePrice 0.427649
58 OverallQual TotRmsAbvGrd 0.427452
59 GarageFinish TotalLivArea 0.423132
60 BsmtUnfSF BsmtFullBath -0.422900
61 2ndFlrSF FullBath 0.421378
62 BsmtFinSF1 TotalBathRms 0.419852
63 BsmtFinSF1 TotalLivArea 0.411084
64 FullBath GarageFinish 0.407588
65 OverallQual MasVnrArea 0.407252
66 BsmtFinType1 TotalBathRms 0.402710
67 BsmtFinType1 BsmtUnfSF -0.400184
68 BsmtQual BsmtExposure 0.399339
69 TotalLivArea TotalWoodDeckPorch 0.397695
70 BsmtQual HeatingQC 0.397169
71 OverallQual Fireplaces 0.396765
72 FireplaceQu GarageFinish 0.394891
73 HeatingQC GarageFinish 0.392244
74 TotalWoodDeckPorch SalePrice 0.390993
75 OverallCond ExterCond 0.389163
76 BsmtFinSF1 SalePrice 0.386420
77 BsmtQual BsmtFinType1 0.377398
78 BsmtExposure SalePrice 0.374696
79 BsmtUnfSF TotalLivArea 0.374540
80 BsmtQual FullBath 0.371243
81 FireplaceQu GarageCars 0.370034
82 BsmtExposure BsmtFinSF1 0.369115
83 FullBath BedroomAbvGr 0.363252
84 TotRmsAbvGrd GarageCars 0.362289
85 MasVnrArea GarageCars 0.361945
86 BedroomAbvGr TotalLivArea 0.359459
87 TotRmsAbvGrd FireplaceQu 0.355589
88 LotFrontage SalePrice 0.349876
89 BsmtExposure BsmtFinType1 0.347840
90 2ndFlrSF TotalLivArea 0.345689
91 HalfBath TotRmsAbvGrd 0.343415
92 OpenPorchSF TotalLivArea 0.342402
93 Fireplaces TotalBathRms 0.341565
94 BsmtExposure BsmtFullBath 0.338672
95 LotFrontage LotArea 0.335957
96 FireplaceQu TotalBathRms 0.335915
97 HeatingQC FullBath 0.333499
98 LotFrontage TotRmsAbvGrd 0.332619
99 TotRmsAbvGrd Fireplaces 0.326114
100 HeatingQC GarageCars 0.325347
101 MasVnrArea TotalBathRms 0.325309
102 Fireplaces GarageFinish 0.324376
103 2ndFlrSF SalePrice 0.319334
104 OpenPorchSF SalePrice 0.315856
105 LotArea LotShape -0.315484
106 TotalBathRms TotalWoodDeckPorch 0.315110
107 OverallQual OpenPorchSF 0.308819
108 OverallQual BsmtUnfSF 0.308159
109 BsmtQual FireplaceQu 0.307337
110 LotArea TotalLivArea 0.306814
111 OverallQual TotalWoodDeckPorch 0.306097
112 BsmtFinType1 SalePrice 0.304908
113 BsmtQual BsmtFinSF1 0.304607
114 HeatingQC TotalLivArea 0.303991
115 Fireplaces GarageCars 0.300789

最後に全ての数値型の変数の相関係数を確認 (スピアマンの相関係数)

# 数値変換済みの順序尺度の変数を含む全ての数値変数の相関関係を表示
get_corr(df_corr_number.drop(cols_to_erase,axis=1),method="spearman")

png

HTML(get_corr_list(df_corr_number.drop(cols_to_erase,axis=1),method='spearman').to_html())
Out[0]

col1 col2 r
0 PoolArea PoolQC 0.999991
1 BsmtFinType2 BsmtFinSF2 0.902542
2 Fireplaces FireplaceQu 0.895131
3 TotalLivArea SalePrice 0.814984
4 OverallQual SalePrice 0.809829
5 BsmtFinType1 BsmtFinSF1 0.795755
6 TotalBathRms SalePrice 0.691160
7 GarageCars SalePrice 0.690711
8 BsmtQual SalePrice 0.678026
9 TotRmsAbvGrd TotalLivArea 0.676489
10 BsmtFinSF1 BsmtFullBath 0.674175
11 OverallQual BsmtQual 0.673048
12 BedroomAbvGr TotRmsAbvGrd 0.667822
13 OverallQual TotalLivArea 0.655684
14 FullBath SalePrice 0.635957
15 GarageFinish SalePrice 0.633974
16 FullBath TotalBathRms 0.633341
17 2ndFlrSF HalfBath 0.625272
18 FullBath TotalLivArea 0.616407
19 OverallQual GarageCars 0.608756
20 LotFrontage LotArea 0.608313
21 HalfBath TotalBathRms 0.607923
22 BsmtFinType1 BsmtFullBath 0.595755
23 2ndFlrSF TotRmsAbvGrd 0.587189
24 TotalLivArea TotalBathRms 0.586014
25 OverallQual FullBath 0.576372
26 BsmtFinSF1 BsmtUnfSF -0.573638
27 GarageCars TotalLivArea 0.573071
28 OverallQual GarageFinish 0.567090
29 FullBath TotRmsAbvGrd 0.558665
30 BsmtQual GarageFinish 0.555535
31 BsmtQual GarageCars 0.551884
32 OverallQual TotalBathRms 0.548317
33 GarageFinish GarageCars 0.548214
34 BsmtQual TotalBathRms 0.538023
35 FireplaceQu SalePrice 0.537602
36 TotRmsAbvGrd SalePrice 0.532586
37 Fireplaces SalePrice 0.519247
38 FullBath GarageCars 0.518310
39 BsmtQual FullBath 0.510767
40 2ndFlrSF BedroomAbvGr 0.510443
41 BsmtQual TotalLivArea 0.507785
42 GarageCars TotalBathRms 0.506718
43 FireplaceQu TotalLivArea 0.502548
44 Fireplaces TotalLivArea 0.493839
45 HeatingQC SalePrice 0.491392
46 LotArea TotalLivArea 0.485170
47 GarageFinish TotalBathRms 0.484792
48 OverallQual FireplaceQu 0.481197
49 TotRmsAbvGrd TotalBathRms 0.478120
50 OpenPorchSF SalePrice 0.477561
51 OverallQual HeatingQC 0.473591
52 GarageFinish TotalLivArea 0.457168
53 LotArea SalePrice 0.456461
54 BsmtQual HeatingQC 0.453746
55 2ndFlrSF TotalBathRms 0.452348
56 BsmtUnfSF BsmtFullBath -0.447472
57 BsmtFullBath TotalBathRms 0.444909
58 OpenPorchSF TotalWoodDeckPorch 0.442410
59 FullBath GarageFinish 0.435853
60 OverallQual OpenPorchSF 0.435046
61 LotFrontage TotalLivArea 0.431782
62 OverallQual TotRmsAbvGrd 0.427806
63 TotalWoodDeckPorch SalePrice 0.425249
64 GarageCars GarageQual 0.420815
65 OverallQual Fireplaces 0.420626
66 BsmtFinType1 TotalBathRms 0.418561
67 MasVnrArea SalePrice 0.415906
68 OpenPorchSF TotalLivArea 0.415838
69 GarageFinish GarageQual 0.415570
70 OverallQual MasVnrArea 0.408136
71 LotFrontage SalePrice 0.407481
72 HeatingQC GarageFinish 0.406279
73 LotArea TotRmsAbvGrd 0.405924
74 BsmtQual OpenPorchSF 0.400801
75 OpenPorchSF TotalBathRms 0.400649
76 MasVnrArea TotalLivArea 0.400072
77 MasVnrArea GarageCars 0.398213
78 BedroomAbvGr TotalLivArea 0.391959
79 TotalLivArea TotalWoodDeckPorch 0.391641
80 BsmtFinType1 BsmtUnfSF -0.390439
81 BsmtFinSF1 TotalBathRms 0.388938
82 TotRmsAbvGrd GarageCars 0.386244
83 2ndFlrSF FullBath 0.384187
84 BsmtQual BsmtExposure 0.380819
85 FireplaceQu GarageFinish 0.380119
86 BsmtQual BsmtFinType1 0.375364
87 BsmtExposure BsmtFinType1 0.374814
88 FullBath OpenPorchSF 0.370152
89 TotRmsAbvGrd FireplaceQu 0.367548
90 BsmtFinType1 SalePrice 0.361625
91 HalfBath TotRmsAbvGrd 0.359001
92 Fireplaces TotalBathRms 0.358501
93 BsmtUnfSF TotalLivArea 0.358177
94 FireplaceQu GarageCars 0.357182
95 Fireplaces GarageFinish 0.351574
96 GarageQual SalePrice 0.351082
97 LotArea Fireplaces 0.350198
98 HeatingQC FullBath 0.348801
99 LotFrontage GarageCars 0.348541
100 TotRmsAbvGrd Fireplaces 0.346829
101 BsmtExposure SalePrice 0.344207
102 HeatingQC GarageCars 0.343988
103 HalfBath SalePrice 0.343008
104 GarageCars OpenPorchSF 0.342701
105 LotFrontage TotRmsAbvGrd 0.341900
106 LotArea LotShape -0.341581
107 GarageFinish OpenPorchSF 0.340726
108 LotArea GarageCars 0.340195
109 BsmtExposure BsmtFinSF1 0.340172
110 MasVnrArea GarageFinish 0.339790
111 LotArea BedroomAbvGr 0.337788
112 FullBath BedroomAbvGr 0.336515
113 HeatingQC TotalLivArea 0.334766
114 MasVnrArea TotalBathRms 0.331010
115 OverallCond ExterCond 0.329091
116 OverallQual TotalWoodDeckPorch 0.329067
117 Fireplaces GarageCars 0.325520
118 TotalBathRms TotalWoodDeckPorch 0.324840
119 BsmtExposure BsmtFullBath 0.323130
120 LotShape SalePrice -0.321055
121 FireplaceQu TotalBathRms 0.318769
122 MasVnrArea BsmtQual 0.318218
123 LotArea FireplaceQu 0.317002
124 BsmtQual FireplaceQu 0.317001
125 BsmtQual BsmtCond 0.316947
126 HeatingQC TotalBathRms 0.312123
127 Fireplaces TotalWoodDeckPorch 0.307912
128 HeatingQC OpenPorchSF 0.303542
129 LotFrontage BedroomAbvGr 0.302878
130 BsmtFinSF1 SalePrice 0.301871
131 BsmtCond BsmtFinType2 0.301495
# 最終的にSalePriceと相関が高い方を残すため、削除候補とSalePriceの相関確認
chk_cols=[
'PoolArea'
,'PoolQC'
,'Fireplaces'
,'FireplaceQu'
,'BsmtFinType2'
,'BsmtFinSF2'
,'SalePrice'
]
HTML(get_corr_list(df_corr_number[chk_cols],r=0).to_html())
Out[0]

col1 col2 r
0 PoolArea PoolQC 0.899924
1 Fireplaces FireplaceQu 0.863241
2 BsmtFinType2 BsmtFinSF2 0.788986
3 FireplaceQu SalePrice 0.520438
4 Fireplaces SalePrice 0.466929
5 PoolQC SalePrice 0.115484
6 PoolQC Fireplaces 0.095621
7 PoolArea Fireplaces 0.095074
8 PoolArea SalePrice 0.092404
9 PoolQC FireplaceQu 0.054380
10 Fireplaces BsmtFinType2 0.052070
11 PoolArea FireplaceQu 0.048737
12 Fireplaces BsmtFinSF2 0.046921
13 PoolArea BsmtFinSF2 0.041709
14 PoolQC BsmtFinSF2 0.014524
15 PoolArea BsmtFinType2 0.013058
16 BsmtFinSF2 SalePrice -0.011378
17 BsmtFinType2 SalePrice -0.005323
18 PoolQC BsmtFinType2 0.004901
19 FireplaceQu BsmtFinSF2 0.001518
20 FireplaceQu BsmtFinType2 0.000022
HTML(get_corr_list(df_corr_number[chk_cols],method='spearman',r=0).to_html())
Out[0]

col1 col2 r
0 PoolArea PoolQC 0.999991
1 BsmtFinType2 BsmtFinSF2 0.902542
2 Fireplaces FireplaceQu 0.895131
3 FireplaceQu SalePrice 0.537602
4 Fireplaces SalePrice 0.519247
5 PoolQC Fireplaces 0.083937
6 PoolArea Fireplaces 0.083876
7 PoolArea BsmtFinSF2 0.068076
8 PoolQC BsmtFinSF2 0.068000
9 Fireplaces BsmtFinType2 0.060091
10 PoolArea BsmtFinType2 0.058715
11 PoolQC BsmtFinType2 0.058663
12 PoolQC SalePrice 0.058469
13 PoolArea SalePrice 0.058453
14 PoolQC FireplaceQu 0.052000
15 PoolArea FireplaceQu 0.051880
16 BsmtFinType2 SalePrice 0.039813
17 BsmtFinSF2 SalePrice -0.038806
18 Fireplaces BsmtFinSF2 0.029886
19 FireplaceQu BsmtFinSF2 -0.023980
20 FireplaceQu BsmtFinType2 0.006487

結果まとめ

ピアソン

0 PoolArea PoolQC 0.899924
1 Fireplaces FireplaceQu 0.863241
2 OverallQual SalePrice 0.790982
3 BsmtFinType2 BsmtFinSF2 0.788986
4 TotalLivArea SalePrice 0.778959

スピアマン

0 PoolArea PoolQC 0.999991
1 BsmtFinType2 BsmtFinSF2 0.902542
2 Fireplaces FireplaceQu 0.895131
3 TotalLivArea SalePrice 0.814984
4 OverallQual SalePrice 0.809829
5 BsmtFinType1 BsmtFinSF1 0.795755

判断結果

PoolArea PoolQC

PoolQCを残す

Fireplaces FireplaceQu

FireplaceQuを残す

BsmtFinType2 BsmtFinSF2

ピアソンとスピアマンで結果が異なり迷いますが、感覚に合うBsmtFinType2を残す。

## 相関係数の確認作業で除外対象とした変数一覧 (16変数)
cols_to_erase=[
 'GarageArea'
,'GrLivArea'
,'TotalBsmtSF'
,'1stFlrSF'
,'WoodDeckSF'
,'GarageCond'
,'KitchenQual'
,'ExterQual'
,'BldgType'
,'HouseStyle'
,'Exterior2nd'
,'GarageYrBlt'
,'YearRemodAdd'
,'PoolArea'
,'Fireplaces'
,'BsmtFinSF2'
]
df_all = df_all.drop(cols_to_erase,axis=1)
スポンサーリンク

特徴量エンジニアリング② (OneHotEncoderを使う)

from sklearn.preprocessing import OneHotEncoder
OneHotEnc = OneHotEncoder(categories='auto',drop='if_binary',handle_unknown='ignore') #エラーは0になるオプション

# OneHotコンバート対象の変数
OneHotCols=df_all.select_dtypes(include=object).columns.to_list()

# fit_transformして、ダミー変数の作成
get_dummies = OneHotEnc.fit_transform(df_all[OneHotCols])

# ダミー変数名取得
dummy_cols = OneHotEnc.get_feature_names_out()

# 元のデータフレームにダミー変数を追加する
df_all = df_all.join(pd.DataFrame(get_dummies.toarray(),columns=dummy_cols))

# ダミー化した変数を除外
df_all = df_all.drop(columns=OneHotCols)
df_all.shape
Out[0]

    (2919, 334)

df_all.info()
Out[0]

    RangeIndex: 2919 entries, 0 to 2918
    Columns: 334 entries, Id to SaleCondition_Partial
    dtypes: float64(317), int64(17)
    memory usage: 7.4 MB
スポンサーリンク

加工済み訓練データとテストデータのアウトプット

最後に再利用できるように訓練データとテストデータに戻してあげた後、エクスポートしておきます。

# 訓練データとテストデータの分岐点
max_train_index = train_records-1

# 加工済み訓練データ
df_train = df_all.loc[0:max_train_index,:].copy()
df_train["SalePrice"] = df["SalePrice"]

# 加工済みテストデータ
df_test = df_all.loc[(max_train_index + 1):,:].copy()

# モデリング用データのエクスポート
df_train.to_csv("ames_train.csv", index=False)
df_test.to_csv("ames_test.csv",index=False)
スポンサーリンク

まとめ

やっとデータ加工パートが完了しました。

次回からお楽しみのモデリングパートに入ります。

スポンサーリンク

各ライブラリのバージョン

pandas Version: 1.4.3
numpy Version: 1.22.4
scikit-learn Version: 1.1.1
seaborn Version: 0.11.2
matplotlib Version: 3.5.2

スポンサーリンク

参考

A comparison of the Pearson and Spearman correlation methods - Minitab
タイトルとURLをコピーしました