自分のキャリアをあれこれ考えながら、Pythonで様々なデータを分析していくブログです

(その4-7) エイムズの住宅価格をXGBoostで予測してみた パート1

Data Analytics
Data Analytics

今回はXGBoostになります。

SVRではグリッドサーチ、ランダムフォーレストではベイズ最適化を試しましたのでXGBoostではランダムサーチという手法でパラメータチューニングをしたいと思います。

本記事ではデフォルト設定で試した結果をまとめようと思います。

スポンサーリンク

評価指標

住宅IdごとのSalePrice(販売価格)を予測するコンペです。

評価指標は予測SalePriceと実測SalePriceの対数を取ったRoot-Mean-Squared-Error(RMSE)の値のようです。

House Prices - Advanced Regression Techniques | Kaggle
Predict sales prices and practice feature engineering, RFs, and gradient boosting
スポンサーリンク

XGBoost

分析用データの準備

事前に欠損値処理や特徴量エンジニアリングを実施してデータをエクスポートしています。

本記事と同じ結果にするためには事前に下記記事を確認してデータを用意してください。

(その3-2) エイムズの住宅価格のデータセットのデータ加工①

(その3-3) エイムズの住宅価格のデータセットのデータ加工②

学習用データとスコア付与用データの読み込み

import pandas as pd
import numpy as np
# エイムズの住宅価格のデータセットの訓練データとテストデータを読み込む
df = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/ames/ames_train.csv")
df_test = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/ames/ames_test.csv")
df.head()
Out[0]

Id LotFrontage LotArea LotShape Utilities LandSlope OverallQual OverallCond MasVnrArea ExterCond ... SaleType_New SaleType_Oth SaleType_WD SaleCondition_Abnorml SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial SalePrice
0 1 65.0 8450 3.0 3.0 2.0 7 5 196.0 2.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 208500
1 2 80.0 9600 3.0 3.0 2.0 6 8 0.0 2.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 181500
2 3 68.0 11250 2.0 3.0 2.0 7 5 162.0 2.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 223500
3 4 60.0 9550 2.0 3.0 2.0 7 5 0.0 2.0 ... 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 140000
4 5 84.0 14260 2.0 3.0 2.0 8 5 350.0 2.0 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 250000

5 rows × 335 columns

# 描画設定
from IPython.display import HTML
import seaborn as sns
from matplotlib import ticker
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
from matplotlib import rcParams
rcParams['font.family'] = 'Hiragino Sans' # Macの場合
#rcParams['font.family'] = 'Meiryo' # Windowsの場合
#rcParams['font.family'] = 'VL PGothic' # Linuxの場合
rcParams['xtick.labelsize'] = 12       # x軸のラベルのフォントサイズ
rcParams['ytick.labelsize'] = 12       # y軸のラベルのフォントサイズ
rcParams['axes.labelsize'] = 18        # ラベルのフォントとサイズ
rcParams['figure.figsize'] = 18,8      # 画像サイズの変更(inch)

XGBoostに使用する変数を選ぶ

こちらも全てを突っ込んでみます

XGBoostで学習を実施 (デフォルト設定)

# 説明変数と目的変数を指定

# 学習データ
X_train = df.drop(["Id","SalePrice"],axis=1)
Y_train = df["SalePrice"] # 販売価格

# テストデータ
X_test = df_test.drop(["Id"],axis=1)
import xgboost as xgb
import multiprocessing

# https://xgboost.readthedocs.io/en/stable/python/examples/sklearn_parallel.html
if __name__ == "__main__":
    print("Parallel Parameter optimization")
    xgb_model = xgb.XGBRegressor(objective ='reg:squarederror',n_jobs=multiprocessing.cpu_count() // 2)
    xgb_model.fit(X_train, Y_train)
Out[0]
Parallel Parameter optimization

特にエラーは出ませんでした。

# モデルパラメータ一覧
xgb_model.get_params()
Out[0]
{'objective': 'reg:squarederror',
 'base_score': 0.5,
 'booster': 'gbtree',
 'callbacks': None,
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'early_stopping_rounds': None,
 'enable_categorical': False,
 'eval_metric': None,
 'gamma': 0,
 'gpu_id': -1,
 'grow_policy': 'depthwise',
 'importance_type': None,
 'interaction_constraints': '',
 'learning_rate': 0.300000012,
 'max_bin': 256,
 'max_cat_to_onehot': 4,
 'max_delta_step': 0,
 'max_depth': 6,
 'max_leaves': 0,
 'min_child_weight': 1,
 'missing': nan,
 'monotone_constraints': '()',
 'n_estimators': 100,
 'n_jobs': 2,
 'num_parallel_tree': 1,
 'predictor': 'auto',
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'sampling_method': 'uniform',
 'scale_pos_weight': 1,
 'subsample': 1,
 'tree_method': 'exact',
 'validate_parameters': 1,
 'verbosity': None}
# 説明変数の係数を確認
coef = pd.DataFrame()
coef["features"] = xgb_model.feature_names_in_
coef["importances"] = xgb_model.feature_importances_
HTML(coef.sort_values(by="importances",ascending=False).to_html())
Out[0]

features importances
5 OverallQual 0.314412
38 TotalLivArea 0.124968
39 TotalBathRms 0.062914
29 GarageCars 0.037409
9 BsmtQual 0.021757
235 RoofStyle_Flat 0.021322
30 GarageQual 0.016001
24 KitchenAbvGr 0.014833
270 Foundation_PConc 0.014108
17 2ndFlrSF 0.011595
81 Neighborhood_Crawfor 0.010507
13 BsmtFinSF1 0.010086
51 MSSubClass_60 0.009971
181 YearBuilt_1957 0.009928
4 LandSlope 0.008380
280 CentralAir_Y 0.008091
324 SaleType_New 0.008017
252 Exterior1st_BrkFace 0.007694
64 Alley_NA 0.007384
27 FireplaceQu 0.007335
228 YearBuilt_2004 0.007203
53 MSSubClass_75 0.006930
26 Functional 0.006755
61 MSZoning_RM 0.006310
201 YearBuilt_1977 0.006139
111 Condition2_Norm 0.006057
277 Heating_Grav 0.006040
251 Exterior1st_BrkComm 0.006014
326 SaleType_WD 0.005569
255 Exterior1st_HdBoard 0.005513
103 Condition1_PosA 0.005412
327 SaleCondition_Abnorml 0.005405
6 OverallCond 0.004754
57 MSZoning_C (all) 0.004641
101 Condition1_Feedr 0.004376
289 GarageType_BuiltIn 0.004252
28 GarageFinish 0.003979
47 MSSubClass_30 0.003976
96 Neighborhood_Somerst 0.003691
21 FullBath 0.003685
269 Foundation_CBlock 0.003624
102 Condition1_Norm 0.003313
267 MasVnrType_Stone 0.003205
104 Condition1_PosN 0.003194
75 Neighborhood_Blmngtn 0.003100
2 LotShape 0.002842
172 YearBuilt_1948 0.002796
86 Neighborhood_Mitchel 0.002784
1 LotArea 0.002728
12 BsmtFinType1 0.002684
34 ScreenPorch 0.002540
84 Neighborhood_IDOTRR 0.002379
282 Electrical_FuseF 0.002168
40 TotalWoodDeckPorch 0.002133
287 GarageType_Attchd 0.002099
331 SaleCondition_Normal 0.001987
100 Condition1_Artery 0.001985
58 MSZoning_FV 0.001925
294 PavedDrive_P 0.001919
231 YearBuilt_2007 0.001873
83 Neighborhood_Gilbert 0.001857
226 YearBuilt_2002 0.001842
330 SaleCondition_Family 0.001815
312 MoSold_9 0.001812
54 MSSubClass_80 0.001778
229 YearBuilt_2005 0.001773
74 LotConfig_Inside 0.001706
165 YearBuilt_1939 0.001692
15 BsmtUnfSF 0.001688
233 YearBuilt_2009 0.001659
167 YearBuilt_1941 0.001615
90 Neighborhood_NoRidge 0.001614
147 YearBuilt_1920 0.001610
11 BsmtExposure 0.001602
108 Condition1_RRNn 0.001572
99 Neighborhood_Veenker 0.001539
31 OpenPorchSF 0.001538
16 HeatingQC 0.001535
303 MoSold_11 0.001495
18 LowQualFinSF 0.001485
55 MSSubClass_85 0.001437
37 MiscVal 0.001394
187 YearBuilt_1963 0.001388
91 Neighborhood_NridgHt 0.001373
80 Neighborhood_CollgCr 0.001362
97 Neighborhood_StoneBr 0.001359
313 YrSold_2006 0.001317
78 Neighborhood_BrkSide 0.001313
32 EnclosedPorch 0.001309
304 MoSold_12 0.001305
95 Neighborhood_SawyerW 0.001297
14 BsmtFinType2 0.001281
105 Condition1_RRAe 0.001231
159 YearBuilt_1932 0.001209
71 LotConfig_CulDSac 0.001182
302 MoSold_10 0.001182
301 MoSold_1 0.001180
20 BsmtHalfBath 0.001176
35 PoolQC 0.001165
79 Neighborhood_ClearCr 0.001136
23 BedroomAbvGr 0.001118
65 Alley_Pave 0.001111
214 YearBuilt_1990 0.001043
310 MoSold_7 0.001029
266 MasVnrType_None 0.001017
7 MasVnrArea 0.001003
45 MSSubClass_190 0.001002
257 Exterior1st_MetalSd 0.000993
200 YearBuilt_1976 0.000949
311 MoSold_8 0.000943
41 MSSubClass_120 0.000939
262 Exterior1st_Wd Sdng 0.000878
232 YearBuilt_2008 0.000876
182 YearBuilt_1958 0.000844
195 YearBuilt_1971 0.000844
194 YearBuilt_1970 0.000839
306 MoSold_3 0.000819
265 MasVnrType_BrkFace 0.000790
143 YearBuilt_1916 0.000789
72 LotConfig_FR2 0.000787
87 Neighborhood_NAmes 0.000778
82 Neighborhood_Edwards 0.000776
268 Foundation_BrkTil 0.000740
258 Exterior1st_Plywood 0.000739
70 LotConfig_Corner 0.000737
254 Exterior1st_CemntBd 0.000729
161 YearBuilt_1935 0.000699
191 YearBuilt_1967 0.000677
315 YrSold_2008 0.000672
177 YearBuilt_1953 0.000665
219 YearBuilt_1995 0.000651
93 Neighborhood_SWISU 0.000648
137 YearBuilt_1910 0.000639
220 YearBuilt_1996 0.000638
307 MoSold_4 0.000617
317 YrSold_2010 0.000614
291 GarageType_Detchd 0.000612
227 YearBuilt_2003 0.000612
308 MoSold_5 0.000582
285 Electrical_SBrkr 0.000579
222 YearBuilt_1998 0.000556
225 YearBuilt_2001 0.000550
36 Fence 0.000549
273 Foundation_Wood 0.000547
230 YearBuilt_2006 0.000546
94 Neighborhood_Sawyer 0.000543
98 Neighborhood_Timber 0.000539
316 YrSold_2009 0.000532
152 YearBuilt_1925 0.000521
67 LandContour_HLS 0.000519
218 YearBuilt_1994 0.000515
154 YearBuilt_1927 0.000499
238 RoofStyle_Hip 0.000497
50 MSSubClass_50 0.000483
198 YearBuilt_1974 0.000479
19 BsmtFullBath 0.000461
309 MoSold_6 0.000441
106 Condition1_RRAn 0.000431
175 YearBuilt_1951 0.000429
132 YearBuilt_1904 0.000416
203 YearBuilt_1979 0.000410
0 LotFrontage 0.000410
22 HalfBath 0.000395
192 YearBuilt_1968 0.000393
46 MSSubClass_20 0.000383
199 YearBuilt_1975 0.000377
264 MasVnrType_BrkCmn 0.000367
180 YearBuilt_1956 0.000352
281 Electrical_FuseA 0.000347
221 YearBuilt_1997 0.000344
69 LandContour_Lvl 0.000339
318 SaleType_COD 0.000337
92 Neighborhood_OldTown 0.000334
189 YearBuilt_1965 0.000329
196 YearBuilt_1972 0.000325
157 YearBuilt_1930 0.000322
217 YearBuilt_1993 0.000320
209 YearBuilt_1985 0.000320
213 YearBuilt_1989 0.000309
25 TotRmsAbvGrd 0.000308
49 MSSubClass_45 0.000308
68 LandContour_Low 0.000304
162 YearBuilt_1936 0.000303
236 RoofStyle_Gable 0.000303
314 YrSold_2007 0.000300
8 ExterCond 0.000289
52 MSSubClass_70 0.000287
290 GarageType_CarPort 0.000286
249 Exterior1st_AsbShng 0.000245
66 LandContour_Bnk 0.000239
197 YearBuilt_1973 0.000234
205 YearBuilt_1981 0.000233
295 PavedDrive_Y 0.000219
60 MSZoning_RL 0.000218
211 YearBuilt_1987 0.000211
89 Neighborhood_NWAmes 0.000201
193 YearBuilt_1969 0.000183
223 YearBuilt_1999 0.000182
248 RoofMatl_WdShngl 0.000179
305 MoSold_2 0.000179
179 YearBuilt_1955 0.000176
174 YearBuilt_1950 0.000156
242 RoofMatl_CompShg 0.000153
166 YearBuilt_1940 0.000137
73 LotConfig_FR3 0.000137
85 Neighborhood_MeadowV 0.000137
163 YearBuilt_1937 0.000133
261 Exterior1st_VinylSd 0.000130
144 YearBuilt_1917 0.000127
63 Alley_Grvl 0.000123
186 YearBuilt_1962 0.000122
151 YearBuilt_1924 0.000122
156 YearBuilt_1929 0.000101
171 YearBuilt_1947 0.000098
216 YearBuilt_1992 0.000096
293 PavedDrive_N 0.000092
188 YearBuilt_1964 0.000089
33 3SsnPorch 0.000065
323 SaleType_ConLw 0.000046
322 SaleType_ConLI 0.000043
256 Exterior1st_ImStucc 0.000043
10 BsmtCond 0.000039
129 YearBuilt_1900 0.000037
185 YearBuilt_1961 0.000031
263 Exterior1st_WdShing 0.000026
173 YearBuilt_1949 0.000025
292 GarageType_NA 0.000000
120 YearBuilt_1880 0.000000
118 YearBuilt_1875 0.000000
119 YearBuilt_1879 0.000000
117 YearBuilt_1872 0.000000
116 Condition2_RRNn 0.000000
296 MiscFeature_Gar2 0.000000
204 YearBuilt_1980 0.000000
286 GarageType_2Types 0.000000
288 GarageType_Basment 0.000000
121 YearBuilt_1882 0.000000
298 MiscFeature_Othr 0.000000
122 YearBuilt_1885 0.000000
284 Electrical_Mix 0.000000
283 Electrical_FuseP 0.000000
123 YearBuilt_1890 0.000000
124 YearBuilt_1892 0.000000
125 YearBuilt_1893 0.000000
279 Heating_Wall 0.000000
278 Heating_OthW 0.000000
297 MiscFeature_NA 0.000000
107 Condition1_RRNe 0.000000
299 MiscFeature_Shed 0.000000
56 MSSubClass_90 0.000000
3 Utilities 0.000000
329 SaleCondition_Alloca 0.000000
328 SaleCondition_AdjLand 0.000000
42 MSSubClass_150 0.000000
325 SaleType_Oth 0.000000
43 MSSubClass_160 0.000000
321 SaleType_ConLD 0.000000
320 SaleType_Con 0.000000
319 SaleType_CWD 0.000000
44 MSSubClass_180 0.000000
48 MSSubClass_40 0.000000
59 MSZoning_RH 0.000000
300 MiscFeature_TenC 0.000000
62 Street_Pave 0.000000
76 Neighborhood_Blueste 0.000000
77 Neighborhood_BrDale 0.000000
88 Neighborhood_NPkVill 0.000000
276 Heating_GasW 0.000000
109 Condition2_Artery 0.000000
110 Condition2_Feedr 0.000000
112 Condition2_PosA 0.000000
113 Condition2_PosN 0.000000
114 Condition2_RRAe 0.000000
115 Condition2_RRAn 0.000000
126 YearBuilt_1895 0.000000
130 YearBuilt_1901 0.000000
275 Heating_GasA 0.000000
241 RoofMatl_ClyTile 0.000000
239 RoofStyle_Mansard 0.000000
149 YearBuilt_1922 0.000000
237 RoofStyle_Gambrel 0.000000
150 YearBuilt_1923 0.000000
234 YearBuilt_2010 0.000000
153 YearBuilt_1926 0.000000
155 YearBuilt_1928 0.000000
158 YearBuilt_1931 0.000000
160 YearBuilt_1934 0.000000
164 YearBuilt_1938 0.000000
168 YearBuilt_1942 0.000000
169 YearBuilt_1945 0.000000
224 YearBuilt_2000 0.000000
170 YearBuilt_1946 0.000000
176 YearBuilt_1952 0.000000
178 YearBuilt_1954 0.000000
183 YearBuilt_1959 0.000000
184 YearBuilt_1960 0.000000
215 YearBuilt_1991 0.000000
212 YearBuilt_1988 0.000000
210 YearBuilt_1986 0.000000
190 YearBuilt_1966 0.000000
208 YearBuilt_1984 0.000000
207 YearBuilt_1983 0.000000
206 YearBuilt_1982 0.000000
240 RoofStyle_Shed 0.000000
148 YearBuilt_1921 0.000000
274 Heating_Floor 0.000000
243 RoofMatl_Membran 0.000000
127 YearBuilt_1896 0.000000
272 Foundation_Stone 0.000000
271 Foundation_Slab 0.000000
128 YearBuilt_1898 0.000000
202 YearBuilt_1978 0.000000
131 YearBuilt_1902 0.000000
133 YearBuilt_1905 0.000000
134 YearBuilt_1906 0.000000
135 YearBuilt_1907 0.000000
136 YearBuilt_1908 0.000000
260 Exterior1st_Stucco 0.000000
259 Exterior1st_Stone 0.000000
138 YearBuilt_1911 0.000000
139 YearBuilt_1912 0.000000
140 YearBuilt_1913 0.000000
141 YearBuilt_1914 0.000000
253 Exterior1st_CBlock 0.000000
142 YearBuilt_1915 0.000000
250 Exterior1st_AsphShn 0.000000
145 YearBuilt_1918 0.000000
146 YearBuilt_1919 0.000000
247 RoofMatl_WdShake 0.000000
246 RoofMatl_Tar&Grv 0.000000
245 RoofMatl_Roll 0.000000
244 RoofMatl_Metal 0.000000
332 SaleCondition_Partial 0.000000
### モデルを適用し、SalePriceの予測をする
df_test["SalePrice"] = xgb_model.predict(X_test)
df_test[["Id","SalePrice"]]
Out[0]

Id SalePrice
0 1461 126004.601562
1 1462 165939.859375
2 1463 187305.421875
3 1464 193261.218750
4 1465 187279.500000
... ... ...
1454 2915 84418.914062
1455 2916 75160.867188
1456 2917 158373.281250
1457 2918 119447.867188
1458 2919 216942.234375

1459 rows × 2 columns

sns.histplot(df_test["SalePrice"],bins=20)

予測できていそうです。

Kaggleにスコア付与結果をアップロード

df_test[["Id","SalePrice"]].to_csv("ames_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c house-prices-advanced-regression-techniques -f ames_submission.csv -m "#7 xgboost"
Out[0]
100%|██████████████████████████████████████| 21.1k/21.1k [00:03<00:00, 5.57kB/s]
Successfully submitted to House Prices - Advanced Regression Techniques

#7 xgboost
Score: 0.14524

Random Forest + ベイズ最適化によるパラメータチューニングよりいい結果になりました。

スポンサーリンク

使用ライブラリのバージョン

pandas Version: 1.4.3
numpy Version: 1.22.4
scikit-learn Version: 1.1.1
seaborn Version: 0.11.2
matplotlib Version: 3.5.2

スポンサーリンク

まとめ

やっぱりXGBoostはいいですね。使いやすいし精度は出るしとても優秀です。

次回はランダムサーチを使ってXGBoostのハイパーパラメータチューニングをしようと思います。

スポンサーリンク

参考

https://xgboost.readthedocs.io/en/stable/python/python_intro.html

タイトルとURLをコピーしました