自分のキャリアをあれこれ考えながら、Pythonで様々なデータを分析していくブログです

(その4-9) エイムズの住宅価格をAutoML(AutoGluon)で予測してみた

Data Analytics
Data Analytics

今回はAutoGluonというAutoMLライブラリをエイムズのデータセットで試してみます。


MacでAutoMLの環境をする方法は下記記事にまとめています。pipでインストールしているのがほとんどですので、Linuxでも同じようなコードでインストールできるかと思います。

※ brew install しているのは yum や apt に置き換える必要はあります。

(MLJAR) Pythonで3つのAutoML環境を用意してみた
(AutoGluon) Pythonで3つのAutoML環境を用意してみた
(auto-sklearn) Pythonで3つのAutoML環境を用意してみた


それではやってみます。

スポンサーリンク

AutoGluonのアップグレード

なるべく新しいバージョンのライブラリを使うことにします。

source ~/venv-autogluon/bin/activate
(venv-autogluon) python3 -m pip install autogluon --upgrade
Out[0]
Requirement already satisfied: autogluon in ./venv-autogluon/lib/python3.8/site-packages (0.4.2)
Collecting autogluon
  Using cached autogluon-0.5.2-py3-none-any.whl (9.6 kB)
・・・省略・・・
Successfully installed Cython-0.29.32 autogluon-0.5.2 autogluon.common-0.5.2 autogluon.core-0.5.2 autogluon.features-0.5.2 autogluon.multimodal-0.5.2 autogluon.tabular-0.5.2 autogluon.text-0.5.2 autogluon.timeseries-0.5.2 autogluon.vision-0.5.2 click-8.0.4 convertdate-2.4.0 distlib-0.3.5 future-0.18.2 gluonts-0.9.7 grpcio-1.43.0 hijri-converter-2.2.4 holidays-0.14.2 hyperopt-0.2.7 korean-lunar-calendar-0.2.1 llvmlite-0.39.0 nlpaug-1.1.10 nltk-3.7 numba-0.56.0 numpy-1.21.6 patsy-0.5.2 platformdirs-2.5.2 pmdarima-1.8.5 protobuf-3.18.1 py4j-0.10.9.5 pymeeus-0.5.11 pytorch-metric-learning-1.3.2 ray-1.13.0 sktime-0.11.4 statsmodels-0.13.2 tbats-1.1.0 tensorboardX-2.5.1 torch-1.11.0 torchtext-0.12.0 torchvision-0.12.0 transformers-4.20.1 virtualenv-20.16.3

バージョン 0.4.2から0.5.2にアップデートされました。

スポンサーリンク

評価指標

住宅IdごとのSalePrice(販売価格)を予測するコンペです。

評価指標は予測SalePriceと実測SalePriceの対数を取ったRoot-Mean-Squared-Error(RMSE)の値のようです。

House Prices - Advanced Regression Techniques | Kaggle
Predict sales prices and practice feature engineering, RFs, and gradient boosting
スポンサーリンク

AutoGluon

分析用データの準備

事前に欠損値処理や特徴量エンジニアリングを実施してデータをエクスポートしています。

本記事と同じ結果にするためには事前に下記記事を確認してデータを用意してください。

(その3-2) エイムズの住宅価格のデータセットのデータ加工①

(その3-3) エイムズの住宅価格のデータセットのデータ加工②

学習用データとスコア付与用データの読み込み

import pandas as pd
import numpy as np
# エイムズの住宅価格のデータセットの訓練データとテストデータを読み込む
df = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/ames/ames_train.csv")
df_test = pd.read_csv("/Users/hinomaruc/Desktop/blog/dataset/ames/ames_test.csv")
# 描画設定
import seaborn as sns
from matplotlib import ticker
import matplotlib.pyplot as plt
sns.set_style("whitegrid")
from matplotlib import rcParams
rcParams['font.family'] = 'Hiragino Sans' # Macの場合
#rcParams['font.family'] = 'Meiryo' # Windowsの場合
#rcParams['font.family'] = 'VL PGothic' # Linuxの場合
rcParams['xtick.labelsize'] = 12       # x軸のラベルのフォントサイズ
rcParams['ytick.labelsize'] = 12       # y軸のラベルのフォントサイズ
rcParams['axes.labelsize'] = 18        # ラベルのフォントとサイズ
rcParams['figure.figsize'] = 18,8      # 画像サイズの変更(inch)
# 説明変数と目的変数を指定

# 学習データ (AutoGluonは目的変数も含める)
X_train = df.drop(["Id"],axis=1)

# テストデータ
X_test = df_test.drop(["Id"],axis=1)

AutoGluonでモデルの作成

# https://auto.gluon.ai/stable/api/autogluon.tabular.TabularPredictor.html
# autogluonのモデル作成
from autogluon.tabular import TabularPredictor
predictor = TabularPredictor(label="SalePrice", problem_type="regression",path="RESULT_AUTOGLUON").fit(X_train, time_limit = 600)
Out[0]

Beginning AutoGluon training ... Time limit = 600s
AutoGluon will save models to "RESULT_AUTOGLUON/"
AutoGluon Version:  0.5.2
Python Version:     3.8.13
Operating System:   Darwin
Train Data Rows:    1460
Train Data Columns: 333
Label Column: SalePrice
Preprocessing data ...
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
    Available Memory:                    9107.26 MB
    Train Data (Original)  Memory Usage: 3.89 MB (0.0% of available memory)
    Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    Stage 1 Generators:
        Fitting AsTypeFeatureGenerator...
            Note: Converting 286 features to boolean dtype as they only contain 2 unique values.
    Stage 2 Generators:
        Fitting FillNaFeatureGenerator...
    Stage 3 Generators:
        Fitting IdentityFeatureGenerator...
    Stage 4 Generators:
        Fitting DropUniqueFeatureGenerator...
    Useless Original Features (Count: 7): ['MSSubClass_150', 'YearBuilt_1879', 'YearBuilt_1895', 'YearBuilt_1896', 'YearBuilt_1901', 'YearBuilt_1902', 'YearBuilt_1907']
        These features carry no predictive signal and should be manually investigated.
        This is typically a feature which has the same value for all rows.
        These features do not need to be present at inference time.
    Types of features in original data (raw dtype, special dtypes):
        ('float', []) : 310 | ['LotFrontage', 'LotShape', 'Utilities', 'LandSlope', 'MasVnrArea', ...]
        ('int', [])   :  16 | ['LotArea', 'OverallQual', 'OverallCond', '2ndFlrSF', 'LowQualFinSF', ...]
    Types of features in processed data (raw dtype, special dtypes):
        ('float', [])     :  24 | ['LotFrontage', 'LotShape', 'LandSlope', 'MasVnrArea', 'ExterCond', ...]
        ('int', [])       :  16 | ['LotArea', 'OverallQual', 'OverallCond', '2ndFlrSF', 'LowQualFinSF', ...]
        ('int', ['bool']) : 286 | ['Utilities', 'MSSubClass_120', 'MSSubClass_160', 'MSSubClass_180', 'MSSubClass_190', ...]
    1.0s = Fit runtime
    326 features in original data used to generate 326 features in processed data.
    Train Data (Processed) Memory Usage: 0.88 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 1.2s ...
AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
    This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
    To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 1168, Val Rows: 292
Fitting 11 L1 models ...
Fitting model: KNeighborsUnif ... Training model for up to 598.8s of the 598.78s of remaining time.
    -49351.2967  = Validation score   (-root_mean_squared_error)
    0.09s    = Training   runtime
    0.07s    = Validation runtime
Fitting model: KNeighborsDist ... Training model for up to 598.59s of the 598.58s of remaining time.
    -49022.574   = Validation score   (-root_mean_squared_error)
    0.06s    = Training   runtime
    0.04s    = Validation runtime
Fitting model: LightGBMXT ... Training model for up to 598.46s of the 598.45s of remaining time.
    -32311.2318  = Validation score   (-root_mean_squared_error)
    2.71s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: LightGBM ... Training model for up to 595.69s of the 595.68s of remaining time.
    -33364.3905  = Validation score   (-root_mean_squared_error)
    0.82s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: RandomForestMSE ... Training model for up to 594.83s of the 594.82s of remaining time.
    -33091.9716  = Validation score   (-root_mean_squared_error)
    4.49s    = Training   runtime
    0.07s    = Validation runtime
Fitting model: CatBoost ... Training model for up to 590.17s of the 590.16s of remaining time.
    -29978.5556  = Validation score   (-root_mean_squared_error)
    21.76s   = Training   runtime
    0.04s    = Validation runtime
Fitting model: ExtraTreesMSE ... Training model for up to 568.35s of the 568.34s of remaining time.
    -32129.1943  = Validation score   (-root_mean_squared_error)
    3.44s    = Training   runtime
    0.07s    = Validation runtime
Fitting model: NeuralNetFastAI ... Training model for up to 564.74s of the 564.72s of remaining time.
No improvement since epoch 3: early stopping
[W ParallelNative.cpp:229] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
    -52385.7488  = Validation score   (-root_mean_squared_error)
    20.07s   = Training   runtime
    0.18s    = Validation runtime
Fitting model: XGBoost ... Training model for up to 544.44s of the 544.43s of remaining time.
    -28249.057   = Validation score   (-root_mean_squared_error)
    4.8s     = Training   runtime
    0.02s    = Validation runtime
Fitting model: NeuralNetTorch ... Training model for up to 539.59s of the 539.58s of remaining time.
[W ParallelNative.cpp:229] Warning: Cannot set number of intraop threads after parallel work has started or after set_num_threads call when using native parallel backend (function set_num_threads)
    -34557.6018  = Validation score   (-root_mean_squared_error)
    5.05s    = Training   runtime
    0.04s    = Validation runtime
Fitting model: LightGBMLarge ... Training model for up to 534.49s of the 534.48s of remaining time.
    -30409.2806  = Validation score   (-root_mean_squared_error)
    2.78s    = Training   runtime
    0.01s    = Validation runtime
Fitting model: WeightedEnsemble_L2 ... Training model for up to 360.0s of the 531.52s of remaining time.
    -27721.2328  = Validation score   (-root_mean_squared_error)
    0.64s    = Training   runtime
    0.0s     = Validation runtime
AutoGluon training complete, total runtime = 69.2s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("RESULT_AUTOGLUON/")

なんか色々自動でやってくれています。
最終的にWeightedEnsembele_L2モデルが選択されたようです。

# 特徴量の重要度の確認
predictor.feature_importance(X_train)
Out[0]

    These features in provided data are not utilized by the predictor and will be ignored: ['MSSubClass_150', 'YearBuilt_1879', 'YearBuilt_1895', 'YearBuilt_1896', 'YearBuilt_1901', 'YearBuilt_1902', 'YearBuilt_1907']
    Computing feature importance via permutation shuffling for 326 features using 1460 rows with 5 shuffle sets...
        1178.64s    = Expected runtime (235.73s per shuffle set)
        889.07s = Actual runtime (Completed 5 of 5 shuffle sets)

importance stddev p_value n p99_high p99_low
TotalLivArea 38261.203832 1094.563494 8.028544e-08 5 40514.925195 36007.482469
OverallQual 23512.093204 936.372603 3.012264e-07 5 25440.097335 21584.089073
BsmtQual 4811.859551 195.716333 3.277030e-07 5 5214.842186 4408.876915
GarageCars 2733.962537 333.895938 2.617385e-05 5 3421.458889 2046.466185
LotArea 2485.904056 60.642905 4.246384e-08 5 2610.768635 2361.039477
... ... ... ... ... ... ...
RoofMatl_CompShg -2.558456 3.332701 9.194039e-01 5 4.303621 -9.420532
YrSold_2008 -3.084284 4.882976 8.846518e-01 5 6.969831 -13.138400
YearBuilt_1916 -3.790175 2.284067 9.896777e-01 5 0.912750 -8.493101
PoolQC -36.494475 12.669437 9.985053e-01 5 -10.407929 -62.581021
MoSold_1 -71.427322 47.552748 9.858308e-01 5 26.484442 -169.339086

326 rows × 6 columns

結果がでるまで時間が結構かかりました。

TotalLivAreaとOverallQualが上位に来ているあたり、信用できる結果になっているのかなと思います。

# 自由度調整済みr2を算出
def adjusted_r2(X,Y,Yhat):
    from sklearn.metrics import r2_score
    import numpy as np
    r_squared = r2_score(Y, Yhat)
    adjusted_r2 = 1 - (1-r_squared)*(len(Y)-1)/(len(Y)-X.shape[1]-1)
    return adjusted_r2

# 訓練データへの精度を確認 (オーバーフィット具合を確認する)
print("train r2_adjusted",adjusted_r2(X_train,X_train["SalePrice"], predictor.predict(X_train)))
Out[0]
train r2_adjusted 0.9680146829503177
### モデルを適用し、SalePriceの予測をする
df_test["SalePrice"] = predictor.predict(X_train)
df_test[["Id","SalePrice"]]
Out[0]

Id SalePrice
0 1461 208608.531250
1 1462 183918.640625
2 1463 225355.890625
3 1464 141544.953125
4 1465 286628.718750
... ... ...
1454 2915 187273.500000
1455 2916 173471.203125
1456 2917 208200.875000
1457 2918 266205.906250
1458 2919 142144.937500

1459 rows × 2 columns

# SalePrice(予測) の分布を確認
sns.histplot(df_test["SalePrice"],bins=20)
Out[0]

Kaggleにスコア付与結果をアップロード

df_test[["Id","SalePrice"]].to_csv("ames_submission.csv",index=False)
!/Users/hinomaruc/Desktop/blog/my-venv/bin/kaggle competitions submit -c house-prices-advanced-regression-techniques -f ames_submission.csv -m "#9 automl autogluon"
Out[0]
100%|██████████████████████████████████████| 21.1k/21.1k [00:04<00:00, 5.29kB/s]
Successfully submitted to House Prices - Advanced Regression Techniques

#9 automl autogluon
Score: 0.56305

まさかの全然だめでした。学習時間が足りなかったのかもしれません。600秒にしていたので。。

スポンサーリンク

使用ライブラリのバージョン

pandas Version: 1.3.5
numpy Version: 1.21.6
scikit-learn Version: 1.0.2
seaborn Version: 0.11.2
matplotlib Version: 3.5.2
autogluon Version: 0.5.2

スポンサーリンク

まとめ

次回はautosklearnを試してみたいと思います。

タイトルとURLをコピーしました