機器學習競賽王者演算法：XGBoost、LightGBM、CatBoost

12月 16, 2020

若是有參加過機器學習競賽的人應該都會聽過 XGBoost 這個橫掃各種競賽的演算法，其實後來也有衍生出 LightGBM 和 Cat boost 這兩種演算法可以使用，只是比較少人聽過，又鑒於這方面資料似乎比較少，就自己吃了一些學習資源來介紹，順便推動自己的學習，希望才疏學淺的本人可以幫助你更了解這三個常見演算法。

The map of machine Learning algorithms

Source

不論是 XGBoost, LightGBM 還是 Catboost，他們都屬於 Ensemble learning (集成式學習)中的 Boosting 方法。Ensemble learning 的基本想法就是一個不夠，你有沒有試過兩個？兩個不夠，你有沒有試過三個？也就是將多種學習方法都集合在一起使用，彼此互相補足，讓演算法的架構更加靈活。

Boosting

Boosting 是循序的演算法，一開始會先產生多個弱學習器(Weak learner)並讓它們不斷改正舊模型的錯誤，最後合成起來變成一個強學習器(Strong learner)。而所謂的弱學習器是指比隨機猜的表現還要更好一點的模型，這種學習器的訓練成本低、複雜度也低，同時也不易 Overfitting。正因為這些特性，所以我們才可以在短時間內訓練很多種弱學習器，彼此組合起來得到一個強力的模型。

Gradient Boosting

Gradient boosting 指的則是將梯度下降法(Gradient descent)應用到這些弱學習器的學習上，幫助最小化 error。這個方法的主要問題在於每跑一次 Gradient descent都需要重新建立一次學習器，這樣非常沒有效率，也因此 XGboost 出現了。

Gradient Boosting Decision Tree (GBDT)

當我們剛剛所提到的每一個弱學習器都是一個決策樹的時候，這個意圖學習並組合出最好的決策樹的Gradient Boosting就叫做GBDT。這也是我們目前最常見用來實作Gradient boosting的做法，同時，這樣做也正好緩和了決策樹容易Overfitting 的問題。今天的三大主題 XGBoost, LightGBM, CatBoost 都是基於 GBDT 的改進方法。

XGBoost

XGBoost的全名是 eXtreme Gradient Boosting，聽起來就很強很極限，實際上也真的很強很極限，可以比喻成是吃了類固醇的 Gradient Boosting。相較於循序學習的GBDT，XGBoost更加有效率、有彈性並且更輕量化，這是因為他能提供平行的決策樹Boosting，讓整個學習的過程不再是循序而是平行的。

XGBoost 的學習過程基於 level-wise tree pruning，在學習的時候它是在不同tree的同一個level上利用資訊獲利 ( information gain ) 的方式作為指標來挑選節點出來組合，雖然已經可以達到平行學習，但在資料量大時要去計算所有可能的 split 還是有點耗時，所以才會衍生出 LightGBM 和 CatBoost。

LightGBM

LightGBM 採用 Level-wise 和 leaf-wise 並行訓練，並且用 Gradient-based One-Side Sampling (GOSS)和 Exclusive Feature Bundling (EFB)兩種演算法來降低計算量。

其中，GOSS 是一種下採樣的抽樣演算法，首先它會先計算出所有樣本的 gradient並排序，接下來它會保留一些 gradient 較大的樣本並從剩下的 gradient 較小的樣本中隨機抽樣幾個出來留下。這是因為 gradient 大的表示還沒有經過充分的訓練，而gradient小表示已經有充分的訓練過，也因此，繼續對 gradient 小的樣本訓練非常沒有效率，所以才會以下採樣的方式減少它們的數量。而以隨機採樣的方式的原因是為了避免資料分布在過程中被改變，這麼做不僅可以減少計算量也能確保模型的準確度。　

EFB是用來降維的演算法，主要用途在於將多個 features 合併起來壓縮成一個。這是因為高維度的資料通常會有稀疏(sparse)的特色，這表示裡面有許多資料其實是多餘、不需要的，就像 One-hot encoding 中的0一樣。而 EFB 就是利用貪婪演算法找到這些稀疏的特徵，並將稀疏的特徵和真正的特徵做合併，達到降維的效果。

CatBoost

Cat 的意思是Categorial，這是因為它是針對類別型資料(Categorial data)去設計的，不過它仍然可以應用在迴歸問題上。CatBoost 最特別的是它基於對稱決策樹(symmetric or oblivious trees)。對稱決策樹是利用二元分割的方式創建 level-wise 的tree，也就是它只以一個feature作為分割基準，而問題也會分成只有是和否兩種選擇，這會讓決策樹的結構變得非常簡單，從這裡也看得出來為什麼它是針對類別問題去設計的。CatBoost 還有另一個用來避免 overfitting 的機制叫做 Ordered Boosting，它對 categorial feature 做隨機的排列組合，以此來增加組合的數量，避免 overfitting。

以結果來說，Catboost 的計算量是三者裡面需求最低的。

Important parameter of XGBoost, LightGBM and CatBoost

既然都提到它們是機器學習競賽的寵兒，當然要介紹一下這三者重要的參數有哪些，以便打比賽的時候可以有效率的硬 train 一波啊！

需要注意的是這邊提到 XGBoost 沒有針對 catrgorical values 的參數的意思並不是說XGBoost 不能做分類問題，而是它沒有針對分類問題能調整的參數。三者不管在分類問題還是回歸問題上都可以使用，不然要怎麼當王者呢？

Source

Implementing XGBoost, LightGBM and CatBoost

Data:2015 Flight Delays and Cancellations

Gist

Data Preprocessing

import pandas as pd
import numpy as np
import time
from sklearn.model_selection import train_test_split

# Read data
data = pd.read_csv("flights.csv")
data = data.sample(frac = 0.1, random_state=10)

# Data preprocessing
data = data[["MONTH","DAY","DAY_OF_WEEK","AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT",
                 "ORIGIN_AIRPORT","AIR_TIME", "DEPARTURE_TIME","DISTANCE","ARRIVAL_DELAY"]]
data.dropna(inplace=True)

data["ARRIVAL_DELAY"] = (data["ARRIVAL_DELAY"] > 10)*1

cols = ["AIRLINE","FLIGHT_NUMBER","DESTINATION_AIRPORT","ORIGIN_AIRPORT"]
for item in cols:
    data[item] = data[item].astype("category").cat.codes +1

print(data.head())

# train test split
train, test, y_train, y_test = train_test_split(data.drop(["ARRIVAL_DELAY"], axis=1), 
data["ARRIVAL_DELAY"],random_state=10, test_size=0.25)

XGBoost

# Example of XGBoost

import xgboost as xgb
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

def auc(m, train, test): 
    return (metrics.roc_auc_score(y_train,m.predict_proba(train)[:,1]),
                            metrics.roc_auc_score(y_test,m.predict_proba(test)[:,1]))

# Parameter Tuning
model = xgb.XGBClassifier()
param_dist = {"max_depth": [10,30,50],
              "min_child_weight" : [1,3,6],
              "n_estimators": [200],
              "learning_rate": [0.05, 0.1,0.16],}
grid_search = GridSearchCV(model, param_grid=param_dist, cv = 3, 
                                   verbose=10, n_jobs=-1)
grid_search.fit(train, y_train)

grid_search.best_estimator_

model = xgb.XGBClassifier(max_depth=50, min_child_weight=1,  n_estimators=200,
                          n_jobs=-1 , verbose=1,learning_rate=0.16)
model.fit(train,y_train)

auc(model, train, test)

LightGBM

LightGBM 和 Catboost都有 categorical features 可以設，所以我就都訓練兩個模型，一個有categorical 另一個沒有。另外 LightGBM 規定 num_boost_round 只能用 train 來控制，不能拿來調參的樣子


# Example of LightGBM

import lightgbm as lgb

from sklearn import metrics

from sklearn.model_selection import GridSearchCV


def auc2(m, train, test): 

    return (metrics.roc_auc_score(y_train,m.predict(train)),

                            metrics.roc_auc_score(y_test,m.predict(test)))



lg = lgb.LGBMClassifier(silent=False)

param_dist = {"max_depth": [25,50, 75],

              "learning_rate" : [0.01,0.05,0.1],

              "num_leaves": [300,900,1200],

              "n_estimators": [200]

             }

grid_search = GridSearchCV(lg, n_jobs=-1, param_grid=param_dist, cv = 3, 

                                            scoring="roc_auc", verbose=5)

grid_search.fit(train,y_train)

grid_search.best_estimator_

d_train = lgb.Dataset(train, label=y_train)

params = {"max_depth": 50, "learning_rate" : 0.1, "num_leaves": 900,  "n_estimators": 300}


# Without Categorical Features

model2 = lgb.train(params, d_train)

auc2(model2, train, test)


#With Catgeorical Features

cate_features_name = ["MONTH","DAY","DAY_OF_WEEK","AIRLINE","DESTINATION_AIRPORT",

                 "ORIGIN_AIRPORT"]

model2 = lgb.train(params, d_train, categorical_feature = cate_features_name)

auc2(model2, train, test)

CatBoost

這邊如果有出現 catboostclassifier' object has no attribute 'catboostclassifier' 這種謎之神奇錯誤的話就用 pip install --upgrade catboost 升級一下catboost版本，似乎是官方的bug。

# Example of Catboost
import catboost as cb

cat_features_index = [0,1,2,3,4,5,6]

def auc(m, train, test): 
    return (metrics.roc_auc_score(y_train,m.predict_proba(train)[:,1]),
                            metrics.roc_auc_score(y_test,m.predict_proba(test)[:,1]))

# n_estimator of LightGBM is called iterations
params = {'depth': [4, 7, 10],
          'learning_rate' : [0.03, 0.1, 0.15],
         'l2_leaf_reg': [1,4,9],
         'iterations': [300]}
cb = cb.CatBoostClassifier()
cb_model = GridSearchCV(cb, params, scoring="roc_auc", cv = 3)
cb_model.fit(train, y_train)

# Without Categorical features
clf = cb.CatBoostClassifier(eval_metric="AUC", depth=10, iterations= 500, l2_leaf_reg= 9, learning_rate= 0.15)
clf.fit(train,y_train)
auc(clf, train, test)

# With Categorical features
clf = cb.CatBoostClassifier(eval_metric="AUC",one_hot_max_size=31, 
                            depth=10, iterations= 500, l2_leaf_reg= 9, learning_rate= 0.15)
clf.fit(train,y_train, cat_features= cat_features_index)
auc(clf, train, test)

Conclusion

從實驗結果可以看到，不論是訓練還是預測，耗費的時間不意外的是CatBoost最少，準確率方面也是它最高，

但 CatBoost 其實只有在加入 categorical features 的時候才能達到最高的0.816，在不考慮 categorical features 的情況下會是 XGBoost 表現比 LightGBM 還好一點。

搜尋此網誌

nbsword's blog

語言模型的物理學 Physics of Language Models - ICML 2024 Tutorial

機器學習競賽王者演算法：XGBoost、LightGBM、CatBoost

The map of machine Learning algorithms

Boosting

Gradient Boosting

Gradient Boosting Decision Tree (GBDT)

XGBoost

LightGBM

CatBoost

Important parameter of XGBoost, LightGBM and CatBoost

Implementing XGBoost, LightGBM and CatBoost

Data Preprocessing

XGBoost

LightGBM

CatBoost

Conclusion

References:

留言

張貼留言

這個網誌中的熱門文章

十種常見的軟體架構模式 10 Common Software Architectural Patterns

為什麼只能在訓練資料上平衡不平衡的資料集? Why should you deal with an imbalanced dataset only on training data?

Get new posts by email: