Reading Time: 6mins, First Published: Sat, May 26, 2018
XGBoost
Python
Machine-Learning
Data Science


A couple of years ago I read a blog post on Analytics Vidhya Complete Guide to Parameter Tuning in XGboost (with codes in Python). The original post uses a multi-step grid-search to tune an XGBoost model. This post will develop a simple “pipeline tool” to automate this sort of tuning.

Knowledge of, and interest in machine learning is beneficial to follow along


DISCLAIMER

I’ll start by stating that I make no promises that this methodology will deliver a good model! The premise here is simple: given that one might want to make a multi-step machine learning pipeline, how can we make that process as easy as possible. I do not necessarily advocate the process. Consider this post as an experiment!


Introduction



Cross validated grid-searches are a popular method of tuning hyper-parameters in machine learning models.

A disadvantage of grid-searches, and hyper-parameter tuning, in general is that grid-searches can be rather slow: as we search more parameters the parameter space we search grows in multiple dimensions.

Searching 10 settings across 5 hyper parameters generates: 10^5 100,000 combinations combinations add in 5-fold cross validation and this means tuning the model 500,000 times!

Rather than explore the parameter space exhaustively we can dramatically reduce the size of the parameter space by taking a step-wise “greedy” approach. Tuning one or two parameters at a time, taking the best parameters for that search and moving onto the next set of parameters. This is the exact style of tuning suggested in the Analytics Vidhya post.

So 500,000 becomes 5 * (10 + 10 + 10 + 10 + 10) = 250

A 2000 fold reduction.

Obviously this approach has trade-off’s as this approach only explores a tiny fraction of the parameter space.

Basic Setup


This is the basic setup I am using to run the code below displayed nicely using the Watermark extension for Jupyter notebooks.

%watermark -a "James Poynter" -d -m -v -p numpy,pandas,xgboost,sklearn,matplotlib,tqdm
James Poynter 2018-05-26

CPython 3.6.4
IPython 6.2.1

numpy 1.14.0
pandas 0.22.0
xgboost 0.71
sklearn 0.19.1
matplotlib 2.1.2
tqdm 4.23.4

compiler   : GCC 7.2.0
system     : Linux
release    : 4.13.0-39-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 4
interpreter: 64bit

Imports


Start by importing the required libraries. Here we use xgboosts sklearn compatible API.

import os
from functools import wraps

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets, model_selection, linear_model, metrics, preprocessing
from sklearn.pipeline import Pipeline
from xgboost.sklearn import XGBClassifier
from tqdm import tqdm

Plot Styling

Next we can set matplotlib styling, here we use the bmh style as seen in the Bayesian methods for Hackers book.

plt.style.use("bmh")  # Bayesian Method for Hackers style

The dataset


The dataset used for the demonstration is the Scikit-learn breast cancer dataset. The code below loads the dataset and transfers into a DataFrame.

breast_cancer = datasets.load_breast_cancer()
breast_cancer_df = pd.DataFrame(data=breast_cancer.data,
                                columns=breast_cancer.feature_names)
breast_cancer_df["target"] = breast_cancer.target

There are 569 examples in the dataset which comprises 31 different features. Normally we would do some exploratory analysis at this point but for the purposes of this tutorial EDA is off topic.

breast_cancer_df.shape
(569, 31)

Train, Test splits


Next we split out a test and training set, choosing to hold back 30% of examples as test data, and setting random_state to 888 ensuring that split is the same each time the notebook is run.

X, y = breast_cancer_df.drop("target", axis=1), breast_cancer_df.target
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=888,
                                                                    stratify=y)

Basic Model: Logistic Regression


Let’s start with a basic out of the box regularised logistic regression.

logistic_regression_clf = linear_model.LogisticRegression().fit(X_train, y_train)
def plot_auc_metrics(X, y, model):
    fpr, tpr, thresholds = metrics.roc_curve(y, model.predict(X))
    roc_auc = metrics.auc(fpr, tpr)

    plt.plot(fpr, tpr, label=f"ROC AUC {roc_auc:.2%}")
    plt.plot([0, 1], [0, 1], color="r", linestyle="--")
    plt.legend()
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("Reciever Operating Characteristic")
plot_auc_metrics(X_test, y_test, logistic_regression_clf)

Even a simple logistic regression model achieves a very good result straight out of the box on this dataset.

logistic

Grid-Search on LogisticRegression


The implementation of logistic regression above can be improved with by adding some feature scaling and grid-searches.

First we create a StratifiedShuffleSplit object.

sss = model_selection.StratifiedShuffleSplit(random_state=42, n_splits=50)

And then build a pipeline to container the preprocessing StandardScaler and the LogisticRegression model.

logistic_pipeline = Pipeline([("scalar", preprocessing.StandardScaler()),
                              ("clf", linear_model.LogisticRegression())])

Next we tune the pipeline using the GridSearchCV passing in StratifiedShuffleSplit object as cv. Note the double underscore format used in the param_grid.

logistic_grid = {"clf__C": np.logspace(-4, 0, 20), "clf__penalty": ["l1", "l2"], "clf__class_weight": [None, "balanced"]}

logistic_clf_gs = model_selection.GridSearchCV(logistic_pipeline, logistic_grid, scoring="roc_auc",
                                               return_train_score=True, cv=sss, n_jobs=4)

logistic_clf_gs.fit(X_train, y_train)
plot_auc_metrics(X_test, y_test, logistic_clf_gs)

This improves the result:

logitic_gs


XGBoost Out of the Box Performance


Next we test the out of the box performance of XGBoost.

xgboost_clf_oob = XGBClassifier().fit(X_train, y_train)
plot_auc_metrics(X_test, y_test, xgboost_clf_oob)

XGboost performs a little better out of the box, let’s see how some basic tuning effects the performance.

logistic


Basic Tuned XGBoost


The code below implements a simple GridSearchCV checking two parameters n_estimators, and max_depth. Which relate to the number of boosting rounds and maximum depth of the trees.

param_grid = {"n_estimators": np.arange(1, 250, 25), 'max_depth': range(1, 5)}
xgboost_clf_gs = model_selection.GridSearchCV(XGBClassifier(), param_grid, scoring="roc_auc", return_train_score=True,
                                              cv=sss)
xgboost_clf_gs.fit(X_train, y_train)

plot_auc_metrics(X_test, y_test, xgboost_clf_gs)

logistic


Multi-Step Grid-Search with XGBoost


Implementing a simple multi-step Pipeline

The param_grid_pipeline object below is a simple Python list where each element in the list is a dictionary representing the param_grid from each step in the tuning process.

param_grid_pipeline = [
        {"n_estimators": np.arange(1, 250, 25)},
        {'max_depth':range(1, 5), 'min_child_weight':range(1,6,2)},
        {"scale_pos_weight": np.linspace(0, 1, 10)},
        {'gamma':[i/10.0 for i in range(0,5)]},
        {'subsample':[i/10.0 for i in range(6,10)], 'colsample_bytree':[i/10.0 for i in range(6,10)]},
        {'reg_alpha':[1e-5, 1e-2, 0.1, 1, 100]},
        {'reg_alpha':[0, 0.001, 0.005, 0.01, 0.05]},
        {"n_estimators": np.arange(1, 250, 25)},
    ]

The run_param_grid_pipeline function below uses a “for loop” to step through each step in the pipeline, for each step a grid-search is performed and the previous best model is replaced with best with new best model.

The plot_all_cv_results simply plots each step in the grid-search as a separate subplot in a single graphic.

def run_param_grid_pipeline(model, param_grid_pipeline, X_train, y_train, *args, **kwargs):
    all_cv_results = []
    for param_grid in tqdm(param_grid_pipeline):
        model, cv_results = tune_model(model, param_grid, X_train, y_train, *args, **kwargs)
        all_cv_results.append(pd.DataFrame(cv_results))
    return model, all_cv_results


def tune_model(model, param_grid, X_train, y_train, *args, **kwargs):
    xgboost_gs = model_selection.GridSearchCV(model, param_grid, scoring="roc_auc", return_train_score=True,
                                              *args, **kwargs)
    xgboost_gs.fit(X_train, y_train)
    cv_results = pd.DataFrame(xgboost_gs.cv_results_)
    return xgboost_gs.best_estimator_, cv_results


def plot_all_cv_results(all_cv_results):
    steps = len(all_cv_results)
    fig, axes = plt.subplots(steps, figsize=(20, steps * 7), sharey=True)

    for i, cv_results in enumerate(all_cv_results):
        best_result_index = cv_results.mean_test_score.values.argmax()
        best_param = cv_results["params"].iloc[best_result_index]
        cv_results[["mean_test_score", "params"]].set_index("params").plot(ax=axes[i], marker="o", color="b")
        cv_results[["mean_train_score", "params"]].set_index("params").plot(ax=axes[i], marker="o", color="r")
        axes[i].set_ylabel(f"Best Param {best_param}")

        for j, row in enumerate(cv_results[["mean_test_score", "params"]].set_index("params").itertuples()):
            axes[i].text(j, row.mean_test_score, row.Index, rotation=270, rotation_mode="anchor")
xgboost_pipe_clf, xgboost_pipe_results = run_param_grid_pipeline(XGBClassifier(), param_grid_pipeline, X_train, y_train,
                                                                 n_jobs=4, cv=sss)
plot_all_cv_results(xgboost_pipe_results)

Multi-Step Grid-Search Tuning Summary


pipeline

And this is the performance from XGBoost model trained using the multi-step grid-search pipeline.

plot_auc_metrics(X_test, y_test, xgboost_pipe_clf)

Performance is back to the level seen in the out of the XGboost box model, but still worse than the LogisticRegression pipeline.

xgb_pipe_roc

Conclusion


Disappointingly in this example the Multi-Step Grid-Search Pipeline did not perform very well, but it’s an interesting idea, and perhaps with a more robust implementation or on different data would have performed better.