Reading Time: 6mins, First Published: Sat, May 26, 2018

A couple of years ago I read a blog post on Analytics Vidhya Complete Guide to Parameter Tuning in XGboost (with codes in Python). The original post uses a multi-step grid-search to tune an XGBoost model. This post will develop a simple “pipeline tool” to automate this sort of tuning.
Knowledge of, and interest in machine learning is beneficial to follow along
DISCLAIMER
I’ll start by stating that I make no promises that this methodology will deliver a good model! The premise here is simple: given that one might want to make a multi-step machine learning pipeline, how can we make that process as easy as possible. I do not necessarily advocate the process. Consider this post as an experiment!
Introduction
Basic Concept of a Multi-Step Grid-Search
Cross validated grid-searches are a popular method of tuning hyper-parameters in machine learning models.
A disadvantage of grid-searches, and hyper-parameter tuning, in general is that grid-searches can be rather slow: as we search more parameters the parameter space we search grows in multiple dimensions.
Searching 10 settings across 5 hyper parameters generates: 10^5 100,000 combinations combinations add in 5-fold cross validation and this means tuning the model 500,000 times!
Rather than explore the parameter space exhaustively we can dramatically reduce the size of the parameter space by taking a step-wise “greedy” approach. Tuning one or two parameters at a time, taking the best parameters for that search and moving onto the next set of parameters. This is the exact style of tuning suggested in the Analytics Vidhya post.
So 500,000 becomes 5 * (10 + 10 + 10 + 10 + 10) = 250
A 2000 fold reduction.
Obviously this approach has trade-off’s as this approach only explores a tiny fraction of the parameter space.
Basic Setup
This is the basic setup I am using to run the code below displayed nicely using the Watermark extension for Jupyter notebooks.
%watermark -a "James Poynter" -d -m -v -p numpy,pandas,xgboost,sklearn,matplotlib,tqdm
James Poynter 2018-05-26
CPython 3.6.4
IPython 6.2.1
numpy 1.14.0
pandas 0.22.0
xgboost 0.71
sklearn 0.19.1
matplotlib 2.1.2
tqdm 4.23.4
compiler : GCC 7.2.0
system : Linux
release : 4.13.0-39-generic
machine : x86_64
processor : x86_64
CPU cores : 4
interpreter: 64bit
Imports
Start by importing the required libraries. Here we use xgboosts sklearn compatible API.
import os
from functools import wraps
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets, model_selection, linear_model, metrics, preprocessing
from sklearn.pipeline import Pipeline
from xgboost.sklearn import XGBClassifier
from tqdm import tqdm
Plot Styling
Next we can set matplotlib styling, here we use the bmh style as seen in the Bayesian methods for Hackers book.
plt.style.use("bmh") # Bayesian Method for Hackers style
The dataset
The dataset used for the demonstration is the Scikit-learn breast cancer dataset. The code below loads the dataset and transfers into a DataFrame.
breast_cancer = datasets.load_breast_cancer()
breast_cancer_df = pd.DataFrame(data=breast_cancer.data,
columns=breast_cancer.feature_names)
breast_cancer_df["target"] = breast_cancer.target
There are 569 examples in the dataset which comprises 31 different features. Normally we would do some exploratory analysis at this point but for the purposes of this tutorial EDA is off topic.
breast_cancer_df.shape
(569, 31)
Train, Test splits
Next we split out a test and training set, choosing to hold back 30% of examples as test data, and setting random_state to 888 ensuring that split is the same each time the notebook is run.
X, y = breast_cancer_df.drop("target", axis=1), breast_cancer_df.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=888,
stratify=y)
Basic Model: Logistic Regression
Let’s start with a basic out of the box regularised logistic regression.
logistic_regression_clf = linear_model.LogisticRegression().fit(X_train, y_train)
def plot_auc_metrics(X, y, model):
fpr, tpr, thresholds = metrics.roc_curve(y, model.predict(X))
roc_auc = metrics.auc(fpr, tpr)
plt.plot(fpr, tpr, label=f"ROC AUC {roc_auc:.2%}")
plt.plot([0, 1], [0, 1], color="r", linestyle="--")
plt.legend()
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Reciever Operating Characteristic")
plot_auc_metrics(X_test, y_test, logistic_regression_clf)
Even a simple logistic regression model achieves a very good result straight out of the box on this dataset.
Grid-Search on LogisticRegression
The implementation of logistic regression above can be improved with by adding some feature scaling and grid-searches.
First we create a StratifiedShuffleSplit object.
sss = model_selection.StratifiedShuffleSplit(random_state=42, n_splits=50)
And then build a pipeline to container the preprocessing StandardScaler and the LogisticRegression model.
logistic_pipeline = Pipeline([("scalar", preprocessing.StandardScaler()),
("clf", linear_model.LogisticRegression())])
Next we tune the pipeline using the GridSearchCV passing in StratifiedShuffleSplit object as cv. Note the double underscore format used in the param_grid.
logistic_grid = {"clf__C": np.logspace(-4, 0, 20), "clf__penalty": ["l1", "l2"], "clf__class_weight": [None, "balanced"]}
logistic_clf_gs = model_selection.GridSearchCV(logistic_pipeline, logistic_grid, scoring="roc_auc",
return_train_score=True, cv=sss, n_jobs=4)
logistic_clf_gs.fit(X_train, y_train)
plot_auc_metrics(X_test, y_test, logistic_clf_gs)
This improves the result:
XGBoost Out of the Box Performance
Next we test the out of the box performance of XGBoost.
xgboost_clf_oob = XGBClassifier().fit(X_train, y_train)
plot_auc_metrics(X_test, y_test, xgboost_clf_oob)
XGboost performs a little better out of the box, let’s see how some basic tuning effects the performance.
Basic Tuned XGBoost
The code below implements a simple GridSearchCV checking two parameters n_estimators, and max_depth. Which relate to the number of boosting rounds and maximum depth of the trees.
param_grid = {"n_estimators": np.arange(1, 250, 25), 'max_depth': range(1, 5)}
xgboost_clf_gs = model_selection.GridSearchCV(XGBClassifier(), param_grid, scoring="roc_auc", return_train_score=True,
cv=sss)
xgboost_clf_gs.fit(X_train, y_train)
plot_auc_metrics(X_test, y_test, xgboost_clf_gs)
Multi-Step Grid-Search with XGBoost
Implementing a simple multi-step Pipeline
The param_grid_pipeline object below is a simple Python list where each element in the list is a dictionary representing the param_grid from each step in the tuning process.
param_grid_pipeline = [
{"n_estimators": np.arange(1, 250, 25)},
{'max_depth':range(1, 5), 'min_child_weight':range(1,6,2)},
{"scale_pos_weight": np.linspace(0, 1, 10)},
{'gamma':[i/10.0 for i in range(0,5)]},
{'subsample':[i/10.0 for i in range(6,10)], 'colsample_bytree':[i/10.0 for i in range(6,10)]},
{'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]},
{'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]},
{"n_estimators": np.arange(1, 250, 25)},
]
The run_param_grid_pipeline function below uses a “for loop” to step through each step in the pipeline, for each step a grid-search is performed and the previous best model is replaced with best with new best model.
The plot_all_cv_results simply plots each step in the grid-search as a separate subplot in a single graphic.
def run_param_grid_pipeline(model, param_grid_pipeline, X_train, y_train, *args, **kwargs):
all_cv_results = []
for param_grid in tqdm(param_grid_pipeline):
model, cv_results = tune_model(model, param_grid, X_train, y_train, *args, **kwargs)
all_cv_results.append(pd.DataFrame(cv_results))
return model, all_cv_results
def tune_model(model, param_grid, X_train, y_train, *args, **kwargs):
xgboost_gs = model_selection.GridSearchCV(model, param_grid, scoring="roc_auc", return_train_score=True,
*args, **kwargs)
xgboost_gs.fit(X_train, y_train)
cv_results = pd.DataFrame(xgboost_gs.cv_results_)
return xgboost_gs.best_estimator_, cv_results
def plot_all_cv_results(all_cv_results):
steps = len(all_cv_results)
fig, axes = plt.subplots(steps, figsize=(20, steps * 7), sharey=True)
for i, cv_results in enumerate(all_cv_results):
best_result_index = cv_results.mean_test_score.values.argmax()
best_param = cv_results["params"].iloc[best_result_index]
cv_results[["mean_test_score", "params"]].set_index("params").plot(ax=axes[i], marker="o", color="b")
cv_results[["mean_train_score", "params"]].set_index("params").plot(ax=axes[i], marker="o", color="r")
axes[i].set_ylabel(f"Best Param {best_param}")
for j, row in enumerate(cv_results[["mean_test_score", "params"]].set_index("params").itertuples()):
axes[i].text(j, row.mean_test_score, row.Index, rotation=270, rotation_mode="anchor")
xgboost_pipe_clf, xgboost_pipe_results = run_param_grid_pipeline(XGBClassifier(), param_grid_pipeline, X_train, y_train,
n_jobs=4, cv=sss)
plot_all_cv_results(xgboost_pipe_results)
Multi-Step Grid-Search Tuning Summary
And this is the performance from XGBoost model trained using the multi-step grid-search pipeline.
plot_auc_metrics(X_test, y_test, xgboost_pipe_clf)
Performance is back to the level seen in the out of the XGboost box model, but still worse than the LogisticRegression pipeline.
Conclusion
Disappointingly in this example the Multi-Step Grid-Search Pipeline did not perform very well, but it’s an interesting idea, and perhaps with a more robust implementation or on different data would have performed better.