今回はXGBoostになります。
SVRではグリッドサーチ、ランダムフォーレストではベイズ最適化を試しましたのでXGBoostではランダムサーチという手法でパラメータチューニングをしたいと思います。
本記事ではデフォルト設定で試した結果をまとめようと思います。
評価指標
住宅IdごとのSalePrice(販売価格)を予測するコンペです。
評価指標は予測SalePriceと実測SalePriceの対数を取ったRoot-Mean-Squared-Error(RMSE)の値のようです。
XGBoost
分析用データの準備
事前に欠損値処理や特徴量エンジニアリングを実施してデータをエクスポートしています。
本記事と同じ結果にするためには事前に下記記事を確認してデータを用意してください。
(その3-2) エイムズの住宅価格のデータセットのデータ加工①
(その3-3) エイムズの住宅価格のデータセットのデータ加工②
学習用データとスコア付与用データの読み込み
import pandas as pd
import numpy as np
# エイムズの住宅価格のデータセットの訓練データとテストデータを読み込む
df = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/ames/ames_train.csv")
df_test = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/ames/ames_test.csv")
df.head()
Id LotFrontage LotArea LotShape Utilities LandSlope OverallQual OverallCond MasVnrArea ExterCond ... SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial SalePrice 0 1 65.0 8450 3.0 3.0 2.0 7 5 196.0 2.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 208500 1 2 80.0 9600 3.0 3.0 2.0 6 8 0.0 2.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 181500 2 3 68.0 11250 2.0 3.0 2.0 7 5 162.0 2.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 223500 3 4 60.0 9550 2.0 3.0 2.0 7 5 0.0 2.0 ... 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 140000 4 5 84.0 14260 2.0 3.0 2.0 8 5 350.0 2.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 250000 5 rows × 335 columns
# 描画設定
from IPython.display import HTML
import seaborn as sns
from matplotlib import ticker
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
from matplotlib import rcParams
rcParams['font.family'] = 'Hiragino Sans' # Macの場合
#rcParams['font.family'] = 'Meiryo' # Windowsの場合
#rcParams['font.family'] = 'VL PGothic' # Linuxの場合
rcParams['xtick.labelsize'] = 12 # x軸のラベルのフォントサイズ
rcParams['ytick.labelsize'] = 12 # y軸のラベルのフォントサイズ
rcParams['axes.labelsize'] = 18 # ラベルのフォントとサイズ
rcParams['figure.figsize'] = 18,8 # 画像サイズの変更(inch)
XGBoostに使用する変数を選ぶ
こちらも全てを突っ込んでみます
XGBoostで学習を実施 (デフォルト設定)
# 説明変数と目的変数を指定
# 学習データ
X_train = df.drop(["Id","SalePrice"],axis=1)
Y_train = df["SalePrice"] # 販売価格
# テストデータ
X_test = df_test.drop(["Id"],axis=1)
import xgboost as xgb
import multiprocessing
# https://xgboost.readthedocs.io/en/stable/python/examples/sklearn_parallel.html
if __name__ == "__main__":
print("Parallel Parameter optimization")
xgb_model = xgb.XGBRegressor(objective ='reg:squarederror',n_jobs=multiprocessing.cpu_count() // 2)
xgb_model.fit(X_train, Y_train)
Parallel Parameter optimization
特にエラーは出ませんでした。
# モデルパラメータ一覧
xgb_model.get_params()
{'objective': 'reg:squarederror', 'base_score': 0.5, 'booster': 'gbtree', 'callbacks': None, 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 1, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'gamma': 0, 'gpu_id': -1, 'grow_policy': 'depthwise', 'importance_type': None, 'interaction_constraints': '', 'learning_rate': 0.300000012, 'max_bin': 256, 'max_cat_to_onehot': 4, 'max_delta_step': 0, 'max_depth': 6, 'max_leaves': 0, 'min_child_weight': 1, 'missing': nan, 'monotone_constraints': '()', 'n_estimators': 100, 'n_jobs': 2, 'num_parallel_tree': 1, 'predictor': 'auto', 'random_state': 0, 'reg_alpha': 0, 'reg_lambda': 1, 'sampling_method': 'uniform', 'scale_pos_weight': 1, 'subsample': 1, 'tree_method': 'exact', 'validate_parameters': 1, 'verbosity': None}
# 説明変数の係数を確認
coef = pd.DataFrame()
coef["features"] = xgb_model.feature_names_in_
coef["importances"] = xgb_model.feature_importances_
HTML(coef.sort_values(by="importances",ascending=False).to_html())
features importances 5 OverallQual 0.314412 38 TotalLivArea 0.124968 39 TotalBathRms 0.062914 29 GarageCars 0.037409 9 BsmtQual 0.021757 235 RoofStyle_Flat 0.021322 30 GarageQual 0.016001 24 KitchenAbvGr 0.014833 270 Foundation_PConc 0.014108 17 2ndFlrSF 0.011595 81 Neighborhood_Crawfor 0.010507 13 BsmtFinSF1 0.010086 51 MSSubClass_60 0.009971 181 YearBuilt_1957 0.009928 4 LandSlope 0.008380 280 CentralAir_Y 0.008091 324 SaleType_New 0.008017 252 Exterior1st_BrkFace 0.007694 64 Alley_NA 0.007384 27 FireplaceQu 0.007335 228 YearBuilt_2004 0.007203 53 MSSubClass_75 0.006930 26 Functional 0.006755 61 MSZoning_RM 0.006310 201 YearBuilt_1977 0.006139 111 Condition2_Norm 0.006057 277 Heating_Grav 0.006040 251 Exterior1st_BrkComm 0.006014 326 SaleType_WD 0.005569 255 Exterior1st_HdBoard 0.005513 103 Condition1_PosA 0.005412 327 SaleCondition_Abnorml 0.005405 6 OverallCond 0.004754 57 MSZoning_C (all) 0.004641 101 Condition1_Feedr 0.004376 289 GarageType_BuiltIn 0.004252 28 GarageFinish 0.003979 47 MSSubClass_30 0.003976 96 Neighborhood_Somerst 0.003691 21 FullBath 0.003685 269 Foundation_CBlock 0.003624 102 Condition1_Norm 0.003313 267 MasVnrType_Stone 0.003205 104 Condition1_PosN 0.003194 75 Neighborhood_Blmngtn 0.003100 2 LotShape 0.002842 172 YearBuilt_1948 0.002796 86 Neighborhood_Mitchel 0.002784 1 LotArea 0.002728 12 BsmtFinType1 0.002684 34 ScreenPorch 0.002540 84 Neighborhood_IDOTRR 0.002379 282 Electrical_FuseF 0.002168 40 TotalWoodDeckPorch 0.002133 287 GarageType_Attchd 0.002099 331 SaleCondition_Normal 0.001987 100 Condition1_Artery 0.001985 58 MSZoning_FV 0.001925 294 PavedDrive_P 0.001919 231 YearBuilt_2007 0.001873 83 Neighborhood_Gilbert 0.001857 226 YearBuilt_2002 0.001842 330 SaleCondition_Family 0.001815 312 MoSold_9 0.001812 54 MSSubClass_80 0.001778 229 YearBuilt_2005 0.001773 74 LotConfig_Inside 0.001706 165 YearBuilt_1939 0.001692 15 BsmtUnfSF 0.001688 233 YearBuilt_2009 0.001659 167 YearBuilt_1941 0.001615 90 Neighborhood_NoRidge 0.001614 147 YearBuilt_1920 0.001610 11 BsmtExposure 0.001602 108 Condition1_RRNn 0.001572 99 Neighborhood_Veenker 0.001539 31 OpenPorchSF 0.001538 16 HeatingQC 0.001535 303 MoSold_11 0.001495 18 LowQualFinSF 0.001485 55 MSSubClass_85 0.001437 37 MiscVal 0.001394 187 YearBuilt_1963 0.001388 91 Neighborhood_NridgHt 0.001373 80 Neighborhood_CollgCr 0.001362 97 Neighborhood_StoneBr 0.001359 313 YrSold_2006 0.001317 78 Neighborhood_BrkSide 0.001313 32 EnclosedPorch 0.001309 304 MoSold_12 0.001305 95 Neighborhood_SawyerW 0.001297 14 BsmtFinType2 0.001281 105 Condition1_RRAe 0.001231 159 YearBuilt_1932 0.001209 71 LotConfig_CulDSac 0.001182 302 MoSold_10 0.001182 301 MoSold_1 0.001180 20 BsmtHalfBath 0.001176 35 PoolQC 0.001165 79 Neighborhood_ClearCr 0.001136 23 BedroomAbvGr 0.001118 65 Alley_Pave 0.001111 214 YearBuilt_1990 0.001043 310 MoSold_7 0.001029 266 MasVnrType_None 0.001017 7 MasVnrArea 0.001003 45 MSSubClass_190 0.001002 257 Exterior1st_MetalSd 0.000993 200 YearBuilt_1976 0.000949 311 MoSold_8 0.000943 41 MSSubClass_120 0.000939 262 Exterior1st_Wd Sdng 0.000878 232 YearBuilt_2008 0.000876 182 YearBuilt_1958 0.000844 195 YearBuilt_1971 0.000844 194 YearBuilt_1970 0.000839 306 MoSold_3 0.000819 265 MasVnrType_BrkFace 0.000790 143 YearBuilt_1916 0.000789 72 LotConfig_FR2 0.000787 87 Neighborhood_NAmes 0.000778 82 Neighborhood_Edwards 0.000776 268 Foundation_BrkTil 0.000740 258 Exterior1st_Plywood 0.000739 70 LotConfig_Corner 0.000737 254 Exterior1st_CemntBd 0.000729 161 YearBuilt_1935 0.000699 191 YearBuilt_1967 0.000677 315 YrSold_2008 0.000672 177 YearBuilt_1953 0.000665 219 YearBuilt_1995 0.000651 93 Neighborhood_SWISU 0.000648 137 YearBuilt_1910 0.000639 220 YearBuilt_1996 0.000638 307 MoSold_4 0.000617 317 YrSold_2010 0.000614 291 GarageType_Detchd 0.000612 227 YearBuilt_2003 0.000612 308 MoSold_5 0.000582 285 Electrical_SBrkr 0.000579 222 YearBuilt_1998 0.000556 225 YearBuilt_2001 0.000550 36 Fence 0.000549 273 Foundation_Wood 0.000547 230 YearBuilt_2006 0.000546 94 Neighborhood_Sawyer 0.000543 98 Neighborhood_Timber 0.000539 316 YrSold_2009 0.000532 152 YearBuilt_1925 0.000521 67 LandContour_HLS 0.000519 218 YearBuilt_1994 0.000515 154 YearBuilt_1927 0.000499 238 RoofStyle_Hip 0.000497 50 MSSubClass_50 0.000483 198 YearBuilt_1974 0.000479 19 BsmtFullBath 0.000461 309 MoSold_6 0.000441 106 Condition1_RRAn 0.000431 175 YearBuilt_1951 0.000429 132 YearBuilt_1904 0.000416 203 YearBuilt_1979 0.000410 0 LotFrontage 0.000410 22 HalfBath 0.000395 192 YearBuilt_1968 0.000393 46 MSSubClass_20 0.000383 199 YearBuilt_1975 0.000377 264 MasVnrType_BrkCmn 0.000367 180 YearBuilt_1956 0.000352 281 Electrical_FuseA 0.000347 221 YearBuilt_1997 0.000344 69 LandContour_Lvl 0.000339 318 SaleType_COD 0.000337 92 Neighborhood_OldTown 0.000334 189 YearBuilt_1965 0.000329 196 YearBuilt_1972 0.000325 157 YearBuilt_1930 0.000322 217 YearBuilt_1993 0.000320 209 YearBuilt_1985 0.000320 213 YearBuilt_1989 0.000309 25 TotRmsAbvGrd 0.000308 49 MSSubClass_45 0.000308 68 LandContour_Low 0.000304 162 YearBuilt_1936 0.000303 236 RoofStyle_Gable 0.000303 314 YrSold_2007 0.000300 8 ExterCond 0.000289 52 MSSubClass_70 0.000287 290 GarageType_CarPort 0.000286 249 Exterior1st_AsbShng 0.000245 66 LandContour_Bnk 0.000239 197 YearBuilt_1973 0.000234 205 YearBuilt_1981 0.000233 295 PavedDrive_Y 0.000219 60 MSZoning_RL 0.000218 211 YearBuilt_1987 0.000211 89 Neighborhood_NWAmes 0.000201 193 YearBuilt_1969 0.000183 223 YearBuilt_1999 0.000182 248 RoofMatl_WdShngl 0.000179 305 MoSold_2 0.000179 179 YearBuilt_1955 0.000176 174 YearBuilt_1950 0.000156 242 RoofMatl_CompShg 0.000153 166 YearBuilt_1940 0.000137 73 LotConfig_FR3 0.000137 85 Neighborhood_MeadowV 0.000137 163 YearBuilt_1937 0.000133 261 Exterior1st_VinylSd 0.000130 144 YearBuilt_1917 0.000127 63 Alley_Grvl 0.000123 186 YearBuilt_1962 0.000122 151 YearBuilt_1924 0.000122 156 YearBuilt_1929 0.000101 171 YearBuilt_1947 0.000098 216 YearBuilt_1992 0.000096 293 PavedDrive_N 0.000092 188 YearBuilt_1964 0.000089 33 3SsnPorch 0.000065 323 SaleType_ConLw 0.000046 322 SaleType_ConLI 0.000043 256 Exterior1st_ImStucc 0.000043 10 BsmtCond 0.000039 129 YearBuilt_1900 0.000037 185 YearBuilt_1961 0.000031 263 Exterior1st_WdShing 0.000026 173 YearBuilt_1949 0.000025 292 GarageType_NA 0.000000 120 YearBuilt_1880 0.000000 118 YearBuilt_1875 0.000000 119 YearBuilt_1879 0.000000 117 YearBuilt_1872 0.000000 116 Condition2_RRNn 0.000000 296 MiscFeature_Gar2 0.000000 204 YearBuilt_1980 0.000000 286 GarageType_2Types 0.000000 288 GarageType_Basment 0.000000 121 YearBuilt_1882 0.000000 298 MiscFeature_Othr 0.000000 122 YearBuilt_1885 0.000000 284 Electrical_Mix 0.000000 283 Electrical_FuseP 0.000000 123 YearBuilt_1890 0.000000 124 YearBuilt_1892 0.000000 125 YearBuilt_1893 0.000000 279 Heating_Wall 0.000000 278 Heating_OthW 0.000000 297 MiscFeature_NA 0.000000 107 Condition1_RRNe 0.000000 299 MiscFeature_Shed 0.000000 56 MSSubClass_90 0.000000 3 Utilities 0.000000 329 SaleCondition_Alloca 0.000000 328 SaleCondition_AdjLand 0.000000 42 MSSubClass_150 0.000000 325 SaleType_Oth 0.000000 43 MSSubClass_160 0.000000 321 SaleType_ConLD 0.000000 320 SaleType_Con 0.000000 319 SaleType_CWD 0.000000 44 MSSubClass_180 0.000000 48 MSSubClass_40 0.000000 59 MSZoning_RH 0.000000 300 MiscFeature_TenC 0.000000 62 Street_Pave 0.000000 76 Neighborhood_Blueste 0.000000 77 Neighborhood_BrDale 0.000000 88 Neighborhood_NPkVill 0.000000 276 Heating_GasW 0.000000 109 Condition2_Artery 0.000000 110 Condition2_Feedr 0.000000 112 Condition2_PosA 0.000000 113 Condition2_PosN 0.000000 114 Condition2_RRAe 0.000000 115 Condition2_RRAn 0.000000 126 YearBuilt_1895 0.000000 130 YearBuilt_1901 0.000000 275 Heating_GasA 0.000000 241 RoofMatl_ClyTile 0.000000 239 RoofStyle_Mansard 0.000000 149 YearBuilt_1922 0.000000 237 RoofStyle_Gambrel 0.000000 150 YearBuilt_1923 0.000000 234 YearBuilt_2010 0.000000 153 YearBuilt_1926 0.000000 155 YearBuilt_1928 0.000000 158 YearBuilt_1931 0.000000 160 YearBuilt_1934 0.000000 164 YearBuilt_1938 0.000000 168 YearBuilt_1942 0.000000 169 YearBuilt_1945 0.000000 224 YearBuilt_2000 0.000000 170 YearBuilt_1946 0.000000 176 YearBuilt_1952 0.000000 178 YearBuilt_1954 0.000000 183 YearBuilt_1959 0.000000 184 YearBuilt_1960 0.000000 215 YearBuilt_1991 0.000000 212 YearBuilt_1988 0.000000 210 YearBuilt_1986 0.000000 190 YearBuilt_1966 0.000000 208 YearBuilt_1984 0.000000 207 YearBuilt_1983 0.000000 206 YearBuilt_1982 0.000000 240 RoofStyle_Shed 0.000000 148 YearBuilt_1921 0.000000 274 Heating_Floor 0.000000 243 RoofMatl_Membran 0.000000 127 YearBuilt_1896 0.000000 272 Foundation_Stone 0.000000 271 Foundation_Slab 0.000000 128 YearBuilt_1898 0.000000 202 YearBuilt_1978 0.000000 131 YearBuilt_1902 0.000000 133 YearBuilt_1905 0.000000 134 YearBuilt_1906 0.000000 135 YearBuilt_1907 0.000000 136 YearBuilt_1908 0.000000 260 Exterior1st_Stucco 0.000000 259 Exterior1st_Stone 0.000000 138 YearBuilt_1911 0.000000 139 YearBuilt_1912 0.000000 140 YearBuilt_1913 0.000000 141 YearBuilt_1914 0.000000 253 Exterior1st_CBlock 0.000000 142 YearBuilt_1915 0.000000 250 Exterior1st_AsphShn 0.000000 145 YearBuilt_1918 0.000000 146 YearBuilt_1919 0.000000 247 RoofMatl_WdShake 0.000000 246 RoofMatl_Tar&Grv 0.000000 245 RoofMatl_Roll 0.000000 244 RoofMatl_Metal 0.000000 332 SaleCondition_Partial 0.000000
### モデルを適用し、SalePriceの予測をする
df_test["SalePrice"] = xgb_model.predict(X_test)
df_test[["Id","SalePrice"]]
Id SalePrice 0 1461 126004.601562 1 1462 165939.859375 2 1463 187305.421875 3 1464 193261.218750 4 1465 187279.500000 ... ... ... 1454 2915 84418.914062 1455 2916 75160.867188 1456 2917 158373.281250 1457 2918 119447.867188 1458 2919 216942.234375 1459 rows × 2 columns
sns.histplot(df_test["SalePrice"],bins=20)
予測できていそうです。
Kaggleにスコア付与結果をアップロード
df_test[["Id","SalePrice"]].to_csv("ames_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c house-prices-advanced-regression-techniques -f ames_submission.csv -m "#7 xgboost"
100%|██████████████████████████████████████| 21.1k/21.1k [00:03<00:00, 5.57kB/s] Successfully submitted to House Prices - Advanced Regression Techniques #7 xgboost Score: 0.14524
Random Forest + ベイズ最適化によるパラメータチューニングよりいい結果になりました。
使用ライブラリのバージョン
pandas Version: 1.4.3
numpy Version: 1.22.4
scikit-learn Version: 1.1.1
seaborn Version: 0.11.2
matplotlib Version: 3.5.2
まとめ
やっぱりXGBoostはいいですね。使いやすいし精度は出るしとても優秀です。
次回はランダムサーチを使ってXGBoostのハイパーパラメータチューニングをしようと思います。
参考
・https://xgboost.readthedocs.io/en/stable/python/python_intro.html