自分のキャリアをあれこれ考えながら、Pythonで様々なデータを分析していくブログです

(その4-7) エイムズの住宅価格をXGBoostで予測してみた パート2

Data Analytics
Data Analytics

前回はデフォルト設定のXGBoostを試しました。

(その4-7) エイムズの住宅価格をXGBoostで予測してみた パート1
今回はXGBoostになります。 SVRではグリッドサーチ、ランダムフォーレストではベイズ最適化を試しましたのでXGBoostではランダムサーチという手法でパラメータチューニングをしたいと思います。 本記事ではデフォルト設定で試した結果をま...

今回はランダムサーチという手法でパラメータチューニングをしたいと思います。

instead of testing every combination of hyperparameters, random searches only test a certain number of combinations that are selected randomly.引用: https://towardsdatascience.com/improve-your-hyperparameter-tuning-experience-with-the-random-search-2c05d789175f

ランダムサーチとは名前の通り、ランダムに選択したパターンで最適なパラメータを見つけていく手法のようです。

グリッドサーチは設定したすべての組み合わせをテストしてパラメータを見つけますが、その分時間がかかります。

ランダムサーチだとグリッドサーチよりは処理時間が短く済み、確かめることができないパターンもあるのでベストな結果にはならないかも知れませんが、少なとも最適なパラメータには近くなると想定されます。

それではやってます。

スポンサーリンク

評価指標

住宅IdごとのSalePrice(販売価格)を予測するコンペです。

評価指標は予測SalePriceと実測SalePriceの対数を取ったRoot-Mean-Squared-Error(RMSE)の値のようです。

House Prices - Advanced Regression Techniques | Kaggle
Predict sales prices and practice feature engineering, RFs, and gradient boosting
スポンサーリンク

XGBoost

分析用データの準備

事前に欠損値処理や特徴量エンジニアリングを実施してデータをエクスポートしています。

本記事と同じ結果にするためには事前に下記記事を確認してデータを用意してください。

(その3-2) エイムズの住宅価格のデータセットのデータ加工①

(その3-3) エイムズの住宅価格のデータセットのデータ加工②

学習用データとスコア付与用データの読み込み

import pandas as pd
import numpy as np
# エイムズの住宅価格のデータセットの訓練データとテストデータを読み込む
df = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/ames/ames_train.csv")
df_test = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/ames/ames_test.csv")
df.head()
Out[0]

Id LotFrontage LotArea LotShape Utilities LandSlope OverallQual OverallCond MasVnrArea ExterCond ... SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial SalePrice
0 1 65.0 8450 3.0 3.0 2.0 7 5 196.0 2.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 208500
1 2 80.0 9600 3.0 3.0 2.0 6 8 0.0 2.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 181500
2 3 68.0 11250 2.0 3.0 2.0 7 5 162.0 2.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 223500
3 4 60.0 9550 2.0 3.0 2.0 7 5 0.0 2.0 ... 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 140000
4 5 84.0 14260 2.0 3.0 2.0 8 5 350.0 2.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 250000

5 rows × 335 columns

# 描画設定
from IPython.display import HTML
import seaborn as sns
from matplotlib import ticker
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
from matplotlib import rcParams
rcParams['font.family'] = 'Hiragino Sans' # Macの場合
#rcParams['font.family'] = 'Meiryo' # Windowsの場合
#rcParams['font.family'] = 'VL PGothic' # Linuxの場合
rcParams['xtick.labelsize'] = 12       # x軸のラベルのフォントサイズ
rcParams['ytick.labelsize'] = 12       # y軸のラベルのフォントサイズ
rcParams['axes.labelsize'] = 18        # ラベルのフォントとサイズ
rcParams['figure.figsize'] = 18,8      # 画像サイズの変更(inch)

XGBoostに使用する変数を選ぶ

こちらも全てを突っ込んでみます

XGBoostで学習を実施 (ランダムサーチ)

# 説明変数と目的変数を指定

# 学習データ
X_train = df.drop(["Id","SalePrice"],axis=1)
Y_train = df["SalePrice"] # 販売価格

# テストデータ
X_test = df_test.drop(["Id"],axis=1)
# ランダムサーチで探索するパラメータ
# 参考: https://xgboost.readthedocs.io/en/stable/parameter.html
from sklearn.utils.fixes import loguniform
import scipy

distributions = {
 'objective': ['reg:squarederror','reg:squaredlogerror'], # default=reg:squarederror
 'booster': ['gbtree','gblinear'], # default=gbtree
 'colsample_bylevel': loguniform(1e-1, 1), # default=1
 'colsample_bynode': loguniform(1e-1, 1), # default=1
 'colsample_bytree': loguniform(1e-1, 1), # default=1
 'gamma': scipy.stats.expon(scale=10), # default=0 alias: min_split_loss
 'grow_policy': ['depthwise','lossguide'], #  default:depthwise
 'eta': scipy.stats.expon(scale=1), # default=0.3
 'max_delta_step': scipy.stats.expon(scale=10), # default=0
 'max_depth': [5,6,7,8,9,10], # default=6 intじゃないとエラーになった。
 'min_child_weight': scipy.stats.expon(scale=10), # default=1
 'reg_alpha': scipy.stats.expon(scale=10) , # default=0
 'reg_lambda': scipy.stats.expon(scale=10), # default=0
 'subsample': loguniform(1e-1, 1) # default=1 (0.1]
 #'base_score': 0.5, # default=0.5
 #'callbacks': None,
 #'early_stopping_rounds': None,
 #'enable_categorical': False,
 #'eval_metric': None, # default according to objective
 #'gpu_id': -1, 
 #'importance_type': None,
 #'interaction_constraints': '',
 #'max_bin': 256,
 #'max_cat_to_onehot': 4,
 #'missing': nan,
 #'monotone_constraints': '()',
 #'n_estimators': 100,
 #'n_jobs': 2,
 #'num_parallel_tree': 1,
 #'max_leaves': 0, # default=0 Not used by exact tree method intじゃないとエラーになった。
 #'predictor': 'auto',
 #'random_state': 0,
 #'sampling_method': 'uniform', # default=uniform
 #'scale_pos_weight': 1,
 #'tree_method': 'exact',
 #'validate_parameters': 1,
 #'verbosity': None
}

# 補足
# loguniform(1e-1, 1) -> 0.1 ~ 1までの値を返すみたい
# scipy.stats.expon(scale=10) -> 0 ~ 10までの値を返すみたい
# ※ scipy.stats.expon(scale=10).rvs()などで返却値を確認できます。

import xgboost as xgb
import multiprocessing

regr = xgb.XGBRegressor(random_state=1414)

from sklearn.model_selection import RandomizedSearchCV
# ランダムサーチ: https://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-search
search = RandomizedSearchCV(regr, distributions, random_state=1415,verbose=3,n_jobs=multiprocessing.cpu_count() // 2,cv=5,n_iter=100)

# フィットする
search.fit(X_train, Y_train)
Out[0]
Fitting 5 folds for each of 100 candidates, totalling 500 fits
[CV 2/5] END booster=gbtree, colsample_bylevel=0.20249137197695594, colsample_bynode=0.1689576228883714, colsample_bytree=0.4942861918490231, eta=0.47552003596119125, gamma=1.853547031998277, grow_policy=lossguide, max_delta_step=19.54969293025852, max_depth=8, min_child_weight=2.6684567339875, objective=reg:squarederror, reg_alpha=1.1879186345650592, reg_lambda=28.056713516080713, subsample=0.1054067067326944;, score=-5.044 total time=   0.4s
・・・省略・・・
[21:54:30] WARNING: /Users/runner/work/xgboost/xgboost/python-package/build/temp.macosx-10.9-x86_64-3.7/xgboost/src/learner.cc:627: 
Parameters: { "colsample_bylevel", "colsample_bynode", "colsample_bytree", "gamma", "grow_policy", "max_delta_step", "max_depth", "min_child_weight", "subsample" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.
・・・省略・・・

途中エラwarningが出る場合がありましたが、無事ベストパラメータを算出することができました。

パラメータの組み合わせによっては使われない項目があり、WARNINGを出してくれるようです。

print("Best parameter (CV score=%0.3f):" % search.best_score_)
print(search.best_params_)
Out[0]
Best parameter (CV score=0.809):
{'booster': 'gblinear', 'colsample_bylevel': 0.4519016075105826, 'colsample_bynode': 0.3313778089901033, 'colsample_bytree': 0.1438563872420179, 'eta': 0.2363846114743429, 'gamma': 9.323119539203132, 'grow_policy': 'lossguide', 'max_delta_step': 0.16298747971912123, 'max_depth': 6, 'min_child_weight': 10.922794802690138, 'objective': 'reg:squarederror', 'reg_alpha': 31.147959993458834, 'reg_lambda': 0.004404341890977873, 'subsample': 0.2660245581100381}
# 説明変数の係数を確認
best_xgb_model = search.best_estimator_
coef = pd.DataFrame()
coef["features"] = best_xgb_model.feature_names_in_
coef["importances"] = best_xgb_model.feature_importances_
HTML(coef.sort_values(by="importances",ascending=False).to_html())
Out[0]

features importances
241 RoofMatl_ClyTile 1.364783e-01
113 Condition2_PosN 8.331048e-02
43 MSSubClass_160 7.708784e-02
57 MSZoning_C (all) 6.432155e-02
82 Neighborhood_Edwards 5.758692e-02
66 LandContour_Bnk 4.005601e-02
101 Condition1_Feedr 3.854522e-02
83 Neighborhood_Gilbert 3.851381e-02
85 Neighborhood_MeadowV 3.771904e-02
269 Foundation_CBlock 3.534563e-02
268 Foundation_BrkTil 3.476647e-02
45 MSSubClass_190 3.404771e-02
87 Neighborhood_NAmes 3.368609e-02
61 MSZoning_RM 3.331088e-02
86 Neighborhood_Mitchel 3.252959e-02
286 GarageType_2Types 3.248380e-02
24 KitchenAbvGr 3.205123e-02
89 Neighborhood_NWAmes 3.154989e-02
41 MSSubClass_120 2.955483e-02
105 Condition1_RRAe 2.941146e-02
294 PavedDrive_P 2.927018e-02
264 MasVnrType_BrkCmn 2.820734e-02
293 PavedDrive_N 2.744913e-02
72 LotConfig_FR2 2.668959e-02
93 Neighborhood_SWISU 2.591797e-02
260 Exterior1st_Stucco 2.559235e-02
104 Condition1_PosN 2.535482e-02
84 Neighborhood_IDOTRR 2.516552e-02
51 MSSubClass_60 2.477851e-02
208 YearBuilt_1984 2.423521e-02
92 Neighborhood_OldTown 2.420007e-02
137 YearBuilt_1910 2.412999e-02
295 PavedDrive_Y 2.286284e-02
326 SaleType_WD 2.247498e-02
285 Electrical_SBrkr 2.211657e-02
94 Neighborhood_Sawyer 2.202679e-02
236 RoofStyle_Gable 2.201424e-02
297 MiscFeature_NA 2.197519e-02
100 Condition1_Artery 2.145700e-02
299 MiscFeature_Shed 2.140664e-02
265 MasVnrType_BrkFace 2.129468e-02
327 SaleCondition_Abnorml 2.018851e-02
55 MSSubClass_85 1.957877e-02
64 Alley_NA 1.885411e-02
111 Condition2_Norm 1.883510e-02
62 Street_Pave 1.855532e-02
282 Electrical_FuseF 1.744998e-02
302 MoSold_10 1.723088e-02
246 RoofMatl_Tar&Grv 1.655600e-02
275 Heating_GasA 1.614092e-02
52 MSSubClass_70 1.557085e-02
197 YearBuilt_1973 1.531382e-02
80 Neighborhood_CollgCr 1.495131e-02
316 YrSold_2009 1.483480e-02
199 YearBuilt_1975 1.476974e-02
63 Alley_Grvl 1.474482e-02
278 Heating_OthW 1.463584e-02
187 YearBuilt_1963 1.426249e-02
73 LotConfig_FR3 1.421422e-02
193 YearBuilt_1969 1.401104e-02
281 Electrical_FuseA 1.395318e-02
44 MSSubClass_180 1.375699e-02
287 GarageType_Attchd 1.325168e-02
330 SaleCondition_Family 1.324303e-02
291 GarageType_Detchd 1.303744e-02
147 YearBuilt_1920 1.293923e-02
181 YearBuilt_1957 1.283755e-02
189 YearBuilt_1965 1.254245e-02
318 SaleType_COD 1.249419e-02
315 YrSold_2008 1.159116e-02
69 LandContour_Lvl 1.148733e-02
188 YearBuilt_1964 1.120989e-02
311 MoSold_8 1.065633e-02
255 Exterior1st_HdBoard 1.040226e-02
249 Exterior1st_AsbShng 1.022551e-02
129 YearBuilt_1900 1.022127e-02
305 MoSold_2 1.008975e-02
74 LotConfig_Inside 9.991845e-03
70 LotConfig_Corner 9.955981e-03
276 Heating_GasW 9.897222e-03
60 MSZoning_RL 9.582582e-03
217 YearBuilt_1993 9.558224e-03
317 YrSold_2010 9.185801e-03
65 Alley_Pave 9.033608e-03
242 RoofMatl_CompShg 8.982881e-03
309 MoSold_6 8.847974e-03
182 YearBuilt_1958 8.790661e-03
304 MoSold_12 8.656987e-03
303 MoSold_11 8.602913e-03
216 YearBuilt_1992 8.593409e-03
202 YearBuilt_1978 8.434063e-03
30 GarageQual 7.770768e-03
98 Neighborhood_Timber 7.513789e-03
23 BedroomAbvGr 7.246896e-03
201 YearBuilt_1977 7.220321e-03
314 YrSold_2007 7.173019e-03
266 MasVnrType_None 7.122650e-03
79 Neighborhood_ClearCr 6.765737e-03
155 YearBuilt_1928 6.729275e-03
270 Foundation_PConc 6.695215e-03
50 MSSubClass_50 6.639951e-03
263 Exterior1st_WdShing 6.572891e-03
151 YearBuilt_1924 6.456998e-03
331 SaleCondition_Normal 6.424721e-03
200 YearBuilt_1976 6.347195e-03
292 GarageType_NA 6.335316e-03
214 YearBuilt_1990 5.895378e-03
262 Exterior1st_Wd Sdng 5.809718e-03
313 YrSold_2006 5.601048e-03
323 SaleType_ConLw 5.485953e-03
231 YearBuilt_2007 5.405434e-03
277 Heating_Grav 5.203868e-03
128 YearBuilt_1898 5.097715e-03
145 YearBuilt_1918 5.086328e-03
59 MSZoning_RH 5.015036e-03
288 GarageType_Basment 4.653136e-03
235 RoofStyle_Flat 4.479492e-03
215 YearBuilt_1991 4.285953e-03
301 MoSold_1 4.169056e-03
10 BsmtCond 4.117712e-03
142 YearBuilt_1915 3.892665e-03
290 GarageType_CarPort 3.686885e-03
68 LandContour_Low 3.557683e-03
26 Functional 3.539023e-03
102 Condition1_Norm 3.359756e-03
139 YearBuilt_1912 3.316668e-03
53 MSSubClass_75 2.694070e-03
267 MasVnrType_Stone 2.654621e-03
306 MoSold_3 2.651667e-03
192 YearBuilt_1968 2.488862e-03
196 YearBuilt_1972 2.293185e-03
312 MoSold_9 2.131248e-03
177 YearBuilt_1953 2.061957e-03
141 YearBuilt_1914 2.011883e-03
167 YearBuilt_1941 1.992464e-03
54 MSSubClass_80 1.966759e-03
228 YearBuilt_2004 1.510183e-03
222 YearBuilt_1998 1.384342e-03
307 MoSold_4 1.197370e-03
261 Exterior1st_VinylSd 8.920432e-04
258 Exterior1st_Plywood 5.492828e-04
205 YearBuilt_1981 4.443001e-04
20 BsmtHalfBath 3.706285e-04
32 EnclosedPorch 7.811136e-05
31 OpenPorchSF 1.667543e-05
37 MiscVal 7.832025e-06
58 MSZoning_FV 1.950779e-06
106 Condition1_RRAn 1.513840e-06
221 YearBuilt_1997 2.133810e-07
184 YearBuilt_1960 6.860162e-10
183 YearBuilt_1959 1.487147e-10
195 YearBuilt_1971 1.239090e-10
191 YearBuilt_1967 2.228932e-12
173 YearBuilt_1949 1.528147e-12
170 YearBuilt_1946 1.377871e-12
153 YearBuilt_1926 1.157876e-12
185 YearBuilt_1961 7.385856e-13
298 MiscFeature_Othr 2.602122e-13
194 YearBuilt_1970 2.597090e-13
206 YearBuilt_1982 6.927290e-14
166 YearBuilt_1940 6.164192e-14
176 YearBuilt_1952 3.766673e-14
149 YearBuilt_1922 3.661187e-14
279 Heating_Wall 3.307666e-14
175 YearBuilt_1951 2.807156e-14
110 Condition2_Feedr 2.626883e-14
157 YearBuilt_1930 2.484343e-14
109 Condition2_Artery 1.310668e-14
154 YearBuilt_1927 1.213512e-14
251 Exterior1st_BrkComm 1.171719e-14
122 YearBuilt_1885 1.031899e-14
284 Electrical_Mix 9.573177e-15
164 YearBuilt_1938 9.435978e-15
204 YearBuilt_1980 7.608706e-15
328 SaleCondition_AdjLand 6.082970e-15
325 SaleType_Oth 6.036885e-15
152 YearBuilt_1925 3.811845e-15
171 YearBuilt_1947 3.349124e-15
158 YearBuilt_1931 3.274006e-15
116 Condition2_RRNn 2.507881e-15
245 RoofMatl_Roll 1.500304e-15
274 Heating_Floor 8.004688e-16
118 YearBuilt_1875 3.416647e-16
117 YearBuilt_1872 -0.000000e+00
115 Condition2_RRAn -0.000000e+00
119 YearBuilt_1879 -0.000000e+00
130 YearBuilt_1901 -0.000000e+00
244 RoofMatl_Metal -0.000000e+00
114 Condition2_RRAe -0.000000e+00
209 YearBuilt_1985 -0.000000e+00
107 Condition1_RRNe -0.000000e+00
296 MiscFeature_Gar2 -0.000000e+00
273 Foundation_Wood -0.000000e+00
300 MiscFeature_TenC -0.000000e+00
239 RoofStyle_Mansard -0.000000e+00
121 YearBuilt_1882 -0.000000e+00
123 YearBuilt_1890 -0.000000e+00
48 MSSubClass_40 -0.000000e+00
136 YearBuilt_1908 -0.000000e+00
163 YearBuilt_1937 -0.000000e+00
253 Exterior1st_CBlock -0.000000e+00
146 YearBuilt_1919 -0.000000e+00
126 YearBuilt_1895 -0.000000e+00
144 YearBuilt_1917 -0.000000e+00
140 YearBuilt_1913 -0.000000e+00
138 YearBuilt_1911 -0.000000e+00
42 MSSubClass_150 -0.000000e+00
135 YearBuilt_1907 -0.000000e+00
127 YearBuilt_1896 -0.000000e+00
133 YearBuilt_1905 -0.000000e+00
132 YearBuilt_1904 -0.000000e+00
131 YearBuilt_1902 -0.000000e+00
134 YearBuilt_1906 -0.000000e+00
150 YearBuilt_1923 -1.473838e-16
212 YearBuilt_1988 -3.795833e-16
156 YearBuilt_1929 -1.228930e-15
243 RoofMatl_Membran -2.374595e-15
237 RoofStyle_Gambrel -2.507778e-15
250 Exterior1st_AsphShn -2.711456e-15
223 YearBuilt_1999 -4.773307e-15
103 Condition1_PosA -5.643381e-15
76 Neighborhood_Blueste -9.171732e-15
256 Exterior1st_ImStucc -1.467647e-14
162 YearBuilt_1936 -1.641516e-14
272 Foundation_Stone -1.766240e-14
240 RoofStyle_Shed -2.115336e-14
108 Condition1_RRNn -2.270212e-14
168 YearBuilt_1942 -2.525766e-14
321 SaleType_ConLD -3.641150e-13
234 YearBuilt_2010 -4.661779e-13
211 YearBuilt_1987 -8.581646e-13
78 Neighborhood_BrkSide -9.901119e-13
227 YearBuilt_2003 -1.000230e-12
322 SaleType_ConLI -1.119078e-11
229 YearBuilt_2005 -4.700402e-11
226 YearBuilt_2002 -5.132749e-11
120 YearBuilt_1880 -1.338902e-10
230 YearBuilt_2006 -1.538841e-10
95 Neighborhood_SawyerW -1.836599e-09
207 YearBuilt_1983 -2.088400e-09
143 YearBuilt_1916 -3.702996e-09
247 RoofMatl_WdShake -5.027636e-09
283 Electrical_FuseP -7.475331e-09
320 SaleType_Con -1.704861e-08
148 YearBuilt_1921 -9.035572e-08
75 Neighborhood_Blmngtn -1.130907e-07
1 LotArea -1.311478e-06
33 3SsnPorch -1.905425e-05
38 TotalLivArea -2.273556e-05
15 BsmtUnfSF -3.702013e-05
18 LowQualFinSF -5.270061e-05
289 GarageType_BuiltIn -5.607540e-05
13 BsmtFinSF1 -5.735111e-05
280 CentralAir_Y -6.028981e-05
40 TotalWoodDeckPorch -7.493954e-05
34 ScreenPorch -8.281871e-05
7 MasVnrArea -9.308037e-05
17 2ndFlrSF -1.019909e-04
329 SaleCondition_Alloca -1.053824e-04
88 Neighborhood_NPkVill -1.178915e-04
179 YearBuilt_1955 -1.461224e-04
0 LotFrontage -4.833398e-04
47 MSSubClass_30 -5.049534e-04
172 YearBuilt_1948 -7.013854e-04
36 Fence -7.061210e-04
56 MSSubClass_90 -8.671080e-04
210 YearBuilt_1986 -1.196173e-03
178 YearBuilt_1954 -1.755640e-03
49 MSSubClass_45 -2.162681e-03
8 ExterCond -2.589513e-03
224 YearBuilt_2000 -2.913814e-03
161 YearBuilt_1935 -3.098941e-03
257 Exterior1st_MetalSd -3.223440e-03
310 MoSold_7 -3.445940e-03
112 Condition2_PosA -3.939633e-03
125 YearBuilt_1893 -3.951334e-03
14 BsmtFinType2 -4.144865e-03
259 Exterior1st_Stone -4.418576e-03
169 YearBuilt_1945 -4.633356e-03
16 HeatingQC -4.667311e-03
160 YearBuilt_1934 -4.922951e-03
190 YearBuilt_1966 -5.101435e-03
12 BsmtFinType1 -5.125653e-03
308 MoSold_5 -5.613291e-03
238 RoofStyle_Hip -5.619870e-03
186 YearBuilt_1962 -5.738712e-03
25 TotRmsAbvGrd -5.917800e-03
19 BsmtFullBath -6.115159e-03
9 BsmtQual -6.710044e-03
319 SaleType_CWD -6.827183e-03
39 TotalBathRms -6.863371e-03
180 YearBuilt_1956 -6.970157e-03
332 SaleCondition_Partial -7.070826e-03
28 GarageFinish -7.854546e-03
165 YearBuilt_1939 -7.989953e-03
203 YearBuilt_1979 -8.023988e-03
96 Neighborhood_Somerst -8.101874e-03
198 YearBuilt_1974 -8.251028e-03
324 SaleType_New -8.350134e-03
35 PoolQC -8.447775e-03
27 FireplaceQu -8.714589e-03
213 YearBuilt_1989 -9.248829e-03
174 YearBuilt_1950 -9.307830e-03
6 OverallCond -9.386095e-03
22 HalfBath -9.791858e-03
77 Neighborhood_BrDale -1.045923e-02
67 LandContour_HLS -1.263923e-02
218 YearBuilt_1994 -1.345258e-02
124 YearBuilt_1892 -1.540263e-02
2 LotShape -1.547271e-02
225 YearBuilt_2001 -1.943541e-02
5 OverallQual -1.948718e-02
11 BsmtExposure -1.987083e-02
219 YearBuilt_1995 -1.991667e-02
4 LandSlope -2.237393e-02
3 Utilities -2.246451e-02
99 Neighborhood_Veenker -2.276831e-02
21 FullBath -2.300613e-02
159 YearBuilt_1932 -2.454287e-02
220 YearBuilt_1996 -2.497749e-02
271 Foundation_Slab -2.534658e-02
29 GarageCars -2.769025e-02
233 YearBuilt_2009 -2.785729e-02
46 MSSubClass_20 -3.145137e-02
254 Exterior1st_CemntBd -3.647721e-02
81 Neighborhood_Crawfor -3.756185e-02
71 LotConfig_CulDSac -3.814503e-02
252 Exterior1st_BrkFace -4.142647e-02
232 YearBuilt_2008 -7.582833e-02
90 Neighborhood_NoRidge -1.032294e-01
91 Neighborhood_NridgHt -1.072475e-01
97 Neighborhood_StoneBr -1.153319e-01
248 RoofMatl_WdShngl -1.214418e-01

結果を見るに少し怪しい気がします。RoofMatl_ClyTileが重要度が一番高いのは感覚とズレます。

### モデルを適用し、SalePriceの予測をする
df_test["SalePrice"] = search.predict(X_test)
df_test[["Id","SalePrice"]]
Out[0]

Id SalePrice
0 1461 110429.750000
1 1462 126215.140625
2 1463 183316.406250
3 1464 190553.328125
4 1465 209255.078125
... ... ...
1454 2915 69338.406250
1455 2916 72127.171875
1456 2917 187375.984375
1457 2918 119090.171875
1458 2919 223522.406250

1459 rows × 2 columns

sns.histplot(df_test["SalePrice"],bins=20)

予測できていそうです。

Kaggleにスコア付与結果をアップロード

df_test[["Id","SalePrice"]].to_csv("ames_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c house-prices-advanced-regression-techniques -f ames_submission.csv -m "#7 xgboost random search"
Out[0]
100%|██████████████████████████████████████| 21.1k/21.1k [00:04<00:00, 5.16kB/s]
Successfully submitted to House Prices - Advanced Regression Techniques

#7 xgboost random search
Score: 0.16435

XGBoostのデフォルト設定より悪くなってしまいました。もっと試行回数を上げた方がいいのと探索するパラメータを絞った方がいいのかも知れません。

スポンサーリンク

使用ライブラリのバージョン

pandas Version: 1.4.3
numpy Version: 1.22.4
scikit-learn Version: 1.1.1
seaborn Version: 0.11.2
matplotlib Version: 3.5.2

スポンサーリンク

まとめ

今回はランダムサーチを使ってXGBoostのパラメータチューニングをしました。

結果は失敗してしまいましたが、探索処理の手軽さはとても魅力的だと思いました。

色々試してみましたが、グリッドサーチが一番分かりやすくて個人的には好きです。

次回は最後の砦のAutoMLを試してみようと思います。

スポンサーリンク

参考

https://xgboost.readthedocs.io/en/stable/parameter.html
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.expon.html
https://amalog.hateblo.jp/entry/hyper-parameter-search

タイトルとURLをコピーしました