Reading Time: 24mins, First Published: Sun, Apr 29, 2018
JackKnife Variance
Random Forests
Data-Science


This tutorial will provide an introductory guide to tree based models, and random forests. In the second half of the tutorial we will look at how to explain models using Local Interpretable Model-Agnostic Explanations (LIME), and how to calculate confidence intervals around random forest predictions, using an approach called infinitesimal jack knife variance.

Introduction


Often data scientists are primarily interested in the prediction accuracy of their models. However, a topic discussed less frequently, and often overlooked completely is how to explain a prediction, and the confidence that the model has in the prediction. Gaining a better understanding of a prediction can prove helpful, particularly in a business setting where the output of the model is being utilised as one component of the overall decision making process.

What tools are needed for this tutorial?


This tutorial will utilise the following libraries:

# Scientific Python stack
import pandas as pd
import numpy as np
from scipy import stats
from sklearn import ensemble, model_selection, metrics, datasets, tree

# Data visualisation
import matplotlib.pyplot as plt
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D

# Statical visualisations
import seaborn as sns

# Plotting decision trees
import graphviz

# Explaining model predictions
from lime import lime_tabular

# Implementation of in infinitesimal jackknife variance
import forestci

Predicting House Prices with the Boston Housing Data Set


Throughout the post we will utilise the Boston Housing data set to illustrate the key theoretical concepts, and to provide a practical examples of how to implement each theory or idea. The Boston Housing data set is available through Scikit learn’s data sets module. I expect that some readers will already be familiar with the data set.

Loading the data


We load the data from the data sets module using the load_boston() method, and create an object called house prices where we will store all of the data.

house_prices = datasets.load_boston()

About the Boston Housing Data


Each of the Scikit learn data sets come complete with a description which gives an overview of the data.

We can see that the full data set is relatively small: 500 instances and 13 attributes; there are also certain features included in the data set that would could also be considered unethical by modern standards. Reading further we find that the data set was collected pre-1978. Typically these sort of data sets are used in academic settings, as it makes comparison with previous works easier.

The target variable in our analysis is: Median value of owner-occupied homes in $1000’s


print(house_prices["DESCR"])

Boston House Prices Dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506

    :Number of Attributes: 13 numeric/categorical predictive

    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset. test

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. ‘Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, ‘Regression diagnostics …', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression problems.

References


  • Belsley, Kuh & Welsch, ‘Regression diagnostics: Identifying Influential Data and Sources of Collinearity’, Wiley, 1980. 244-261.
  • Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
  • many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

Data analysis: Creating train, test splits


It is good practice to carry out exploratory analysis on the training set, rather than the full data set. The logic being that keeping the test set fully isolated avoids any chance that the test set data is influencing the decision making process of the analyst.

The code below splits out 20% of the data chosen at random. We also place the features, and labels into pandas dataframes, and series respectively. This additional step will make the data easier to work with by providing column names, and pandas data munging functionality.

X_train, X_test, y_train, y_test = model_selection.train_test_split(
    pd.DataFrame(house_prices.data, columns=house_prices.feature_names),
    pd.Series(house_prices.target, name="med_price"),
    test_size=0.20, random_state=42)

Exploratory Data Analysis (EDA)


The focus of this post is tree based models, random forests so we will not go into any detail in respect of exploratory data analysis. However, we can get a lot of bang for our buck using the Seaborn library’s pairplot() function.

The pair plot generates a matrix of scatter plots which show how one variable relates to another, in a one-factor sense. The bottom row is the median house price in each district. We can see that there are a few potential relationships that look interesting, but RM and PSTAT stand out. On the diagonal axis scatters are replaced with a histogram for each feature. If we look at the median price row, we can see that the data is capped at 50, evidenced by the fact that the plot is initially very smooth and then we find a bump at 50 which we would not expect, this feature is also visible in a number of the scatter plots.

sns.pairplot(pd.concat([X_train, y_train], axis=1))
plt.savefig("exploratory.png")

exploratory

Regression using tree based models


We will frame the problem of predicting the median house price as a regression problem; this post will not cover a classification example, but the basic principles of what we cover here will generalise for classification examples.

How does a regression tree work?


Let’s jump ahead a little:

from future import decision tree model we haven’t built yet… :

treeout

As the name suggests a decision tree is a, tree of decisions. And as the training data flows through the decisions: it “splits” at the different “decision nodes” and collects at the end nodes which we call “leaves”.

At each decision node in the tree two decisions can be made:

  • the first decision is whether or not to split
  • the second is how to split

In the example above we start at the “root node” with 404 training samples. The decision has been made to split, and the criteria is that instances with  6.941 and above rooms go left, and the others go right. And so 337 instances go left, 67 go right. We then do the same at the next node.  Finally, we arrive at the leaves where the training samples collect.

When we want to make a new prediction, the new data instance will flow through the tree passing through the decision nodes, and being steered to arrive at one of the leaves. The predicted house price will be the average of all the of the training prices that fell on the same leaf of the tree during the training.

The choice of whether or not to split can be determined by predefined structural restrictions such as:

  • maximum tree depth
  • minimum samples to split

the choice can also be influenced by whether the split is worth doing. The worth of the split is determined by a cost function. In the case of regression the cost function is the reduction is variance upon splitting:

$$ cost(D) = \sum(y_i - \bar{y})^2 $$

The question of whether to split then becomes whether or not splitting the data into two new nodes will result in an overall decrease in variance, relative to leaving things as is.

Defining what makes a good model


A good model will:

be capable of predicting the median house price on previously unseen data, with the minimum amount of error.

The test set we have held out will provide our unseen data. The key concept here is that the model must generalise well to previously unseen data.

The next question is how do we define a good error metric. A generally good choice for a regression model is “mean squared error”: The mean squared error looks at the difference between the predicted price and the actual price squared. Our model’s aim is to minimise this error metric.

$$ RMSE = \sqrt {\frac{1}{N}{\sum\limits_{i =1}^N {(y_{i} - \bar{y}_{i})^2 }}} $$

How would a single tree try to reduce this error?


Our model has access to a table of features, which contain information about a property: the number of rooms, distance to financial centres etc.

A decision tree begins at the root node; the root node contains all of the instances in the training data set, as we previously stated the aim at each node is to reduce the variance of the parent node by splitting the samples into two. But how are the splits decided? There’s more than one algorithm but one way is to:

  • look at each one of the features one at a time
  • and then using a single feature try and split the data set into two parts: houses with more rooms than x, and houses with less rooms than x. The split values can be based on the unique feature values in the training data, so if we had four houses with [1 , 2, 3, 4] rooms the splits might be the mid-points between the feature instance values e.g. [1.5, 2.5, 3.5] as splitting criteria.
  • the model then evaluates each possible split, for each possible feature to find the best split to reduce the the variance.

This is type of recipe is referred to as a greedy algorithm:

greedy algorithm is an algorithmic paradigm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum. Wikipedia

Scale Robustness of Tree Based Models


The fact that the algorithm considers only one feature at a time when making a split, and that making the split does not rely on all of the data, just two instances (or one in some implementations) means that tree based models are very robust to data being at different scales relative to other models. The reasoning is similar to why medians are robust, outliers will have no effect. This means that if one feature is orders of magnitudes larger than the other there is no real impact on the model.

From a data scientists perspective this removes the need to scale data, this quality as a number of benefits:

  • Slightly less work: Firstly it removes the need for the data scientist to build a pipeline to scale the data prior to model training, and prediction.
  • Aids interpret-ability: Secondly the lack of any scaling can assist in making the the model outputs more understandable as we retain the original scaling of the data.

Tree based models are not scale invariant


Whilst tree based models are robust to features of different scales, depending on the implementation, tree based models are not necessarily scale invariant. If we think about the step 2 where we generate the splits, we can see why:

As an example suppose we have another data set, one of the columns is made up of the following array [1, 2, 3, 4] if we took the log of the [1, 2, 3, 4] feature, this would have the effect of shrinking the larger values more than the smaller values.

e.g.


log_values = np.log([1,2,3,4])
log_splits = log_values[1:] + (log_values[:-1] - log_values[1:])/2
np.exp(log_splits)

>>> array([ 1.41421356,  2.44948974,  3.46410162])</pre>
original = np.array([1,2,3,4])
original_splits = original[1:] + (original[:-1] - original[1:])/2
original_splits
>>> array([ 1.5,  2.5,  3.5])

The effect means that predictions can vary depending on the feature scaling. In the vast majority of cases the impact is likely to be fairly minimal, although it is probably worth being aware of.

Fitting a tree based model with Scikit Learn


Fitting a tree model with Scikit learn is very easy, the code below restricts the depth of the tree to two splits so that we can make a plot of the model for illustration.


tree_reg = tree.DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X_train, y_train)

>>> DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
           max_leaf_nodes=None, min_impurity_split=1e-07,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, presort=False, random_state=None,
           splitter='best')

The graphic is produced using graphviz. The graph illustrates the algorithm that the model has learned from the data.

tree.export_graphviz(tree_reg, out_file="tree.dot", feature_names=X_train.columns)

# Save to png
from subprocess import check_call
check_call(['dot','-Tpng','tree.dot','-o','treeout.png'])

treeout

Visualising the graph in 3D


As only two features are utilised in the model RM and LSTAT, we can create a 3D representation of how the predicted median value varies with RM and LSTAT. The graphic is useful for illustrating how the different levels of the tree combine to produce and model non-linear features of the data.

room_range = np.linspace(X_train.RM.min(), X_train.RM.max(), 100)
lstat_range = np.linspace(X_train.LSTAT.min(), X_train.LSTAT.max(), 100)

xs, ys = np.meshgrid(room_range, lstat_range)


zs = np.where(xs <=  6.941,
                np.where(ys <= 14.4, 23.3246, 14.8291),
                np.where(xs <=  7.437, 32.3634, 44.65)
    )


fig, axes = plt.subplots(figsize=(10,10))
axes = fig.gca(projection='3d')
surf = axes.plot_surface(xs, ys, zs, alpha = 0.3, rstride=1, cstride=1, cmap=plt.cm.cool, linewidth=0.5,
                       antialiased=True)
axes.set_title("SURFACE PLOT BASED ON BUSINESS RULES")
axes.set_ylabel("LSTAT")
axes.set_xlabel("RM")
axes.set_zlabel("MED VALUE")

fig.savefig("3D_tree.png")

3D_tree

Interpretability of simple tree based models


The interpretability of tree based models is one of their key strengths. The model is well defined, easily understandable, and could be given to anyone in the business. Unfortunately as the tree gets deeper it quickly becomes unmanageable task to navigate the complex flow of decisions. As is self evident from the graph when we increase the depth to 10.

treeout

High variance of trees & over-fitting


Tree based models will often over-fit the data, and suffer from high variance (small changes in the data can result in large changes in the model). As the model begins to over-fit it starts to model noise in the data, rather than any real generalisable pattern. This means that when we try and make a prediction on new data, the model is confused by all of the noise patterns it memorised, it tries to build this noise from the training data into the new predictions. As the noise is purely random, there is no generalisable pattern, and so the model makes mistakes.

Over-fitting is often characterised by a strong result on the training set, and much weaker result on the test set. We can see this is the case in the example of the second deeper regression tree:

train_test_split

We can of course restrict the complexity of the model to combat the over-fitting. This process of complexity restriction is called regularisation. Regularisation is one of the most important concepts in machine learning. By reducing the complexity of the model we can find a sweet spot between over-fitting the data: memorising too much noise, and under-fitting the data: not learning enough of the real patterns in the data. This concept is referred to as the bias variance trade-off.

A lot of people still hold on to the notion that a more complex model is by definition a better model. This is simply not true. I often see this fallacy play out when discussing models with underwriters, catastrophe modellers, and even actuaries; the reason why this myth prevails is because there are still many professions who do not judge the worth of a model by its’ ability to predict and generalise to new data, but instead judge how good a model is by how well it fits the data as a whole. Without a focus on how well the model generalises, the model will “fit the data” best when if it memorises all the noise from the data set!

A high bias example


As an example of what happens at the other end of the spectrum, if we plot our original model with max depth 2 we see the following scatter plots. Here the MSE is almost the same on test and training sets. In both cases though the MSE is higher than the previous model. This is a prime example of under-fitting. A data scientist would call this a high bias model, because we are ignoring what the data is telling us. Our model is simply too restrictive and lacks the complexity to properly model the real patterns in the data.

high_bias_tree


There are many different approaches to finding a sweet spot between over-fitting, and under-fitting. A simple option is to try lots of different values for maximum depth and see which is best. The issue with this approach is that each time we test out a different value we take another look at the test data, if are not careful we can end up tuning our hyper parameters on the test data, it’s a bit like being able to know your marks on an example as you write the answer: hmmm just 2 out of 5, what if I try this… oh 3 great! :-)

To try and avoid this situation data scientists tune hyper-parameters using a technique called cross validation. In cross-validation we split the training set into, say 5 slices: we hold one slice of the training set, as a test set: which we called a validation set to avoid confusion with the actual test set. We then train a model on the first four slices, and test on the last slice. We can also permute the slices, so that the say the second slice becomes the validation set and the rest become the training set. This technique allows us to train 5 models on different subsets of the data, we can then combine these results to build up a picture of how the model performs on average.

Using this cross validation approach we can test each hyper-parameter, in our previous example this was the max_depth hyper-parameter, without looking at the test data!

Grid searches can be computationally expensive if we search 10 different depths with 5 folds, that means we have trained and tested 50 models! If we tune an additional hyper-parameter, say max_features with 10 different options this means we are testing 500 models! 5folds x 10max_depths x 10max_features. Large grid searches can take days or even weeks.

Running a grid search

param_grid = {
    "max_depth": np.arange(1, 15),
}

gs = model_selection.GridSearchCV(tree.DecisionTreeRegressor(), param_grid=param_grid, cv=5, n_jobs=-1)

gs.fit(X_train, y_train)
mse = gs.cv_results_["mean_test_score"]
max_depth = gs.cv_results_["param_max_depth"].data

plt.plot(max_depth, mse)
plt.axvline(np.argmax(mse)+1, alpha=0.5, linestyle="--", color="r")
plt.savefig("hyper-param-tuning.png")

hyper_param_tuning

If we plot the results from the cross validated grid search, we can now see that the model performs slightly worse on the training data, but performs much better on the unseen test data: which confirms that indeed a more complex model is not always better!

tuned_tree

Introduction to Ensemble models


As previously discussed one method of reducing the variance of the model is to reduce the model complexity through regularisation, but often this is not the the optimum methodology in terms of prediction accuracy. Another popular methodology designed to overcome the high variance associated with tree based models is to build ensembles. An ensemble groups lots of models together into a single model. In the case of tree based models, this means grouping together lots of different trees together to make a “forest” of trees.

Averaging the results from multiple high variance models will reduce the overall variance. However, for this to work successfully certain conditions are necessary, obviously if we were to average the output from the same exact model multiple times this would not result in less variance. Intuitively the concept is similar to wisdom of the crowds:  Wisdom of the crowds posits that, subject to certain criteria, that the average of each individuals opinions is smarter than the opinion of any one individual. The criteria for a wise crowd is similar to what makes a good forest:

  • Diversity of opinion: Each person should have private information even if it’s just an eccentric interpretation of the known facts.
  • Independence: People’s opinions aren’t determined by the opinions of those around them.
  • Decentralisation People are able to specialise and draw on local knowledge.
  • Aggregation: Some mechanism exists for turning private judgements into a collective decision.

Whilst the analogy is not perfect it is useful. Tree based ensembles use randomised re-sampling techniques (which we shall discuss next), to generate diversity, independence, decentralisation, and we aggregate results by taking the mean of the predictions from all of the trees in the model, and thus start to average out the noise. This allows us to reduce the variance of the overall ensemble without increasing the bias. Hopefully, this process should allow us to generate better predictions.

Bagging


One example of an ensemble is known as “bagging learner”, the name “bagging” is a shortening of the term “bootstrap aggregating”. Bootstrap aggregation or bootstrapping refers to the method by which we introduce diversity of opinion. In simple terms each tree in the ensemble receives a slightly different version of the original data set. This diversity is achieved by re-sampling the original data set with replacement:

* we pick out a row at random from the original data * record the information from the randomly chosen row * replace the row back in to the original data

  • and repeat this process until we have recorded a new data set, typically the same size as the original one.

This technique of re-sampling with replacement is known as bootstrap re-sampling, and it means that certain rows in the original data are sampled twice, others are missed completely. The technique introduces diversity in the data that each tree receives, and hence each tree forms a different representation of the training data.

Random Forests


Random forests take the bagging concept a step further by introducing an additional source of noise or randomness, typically this might involve randomly selecting a only subset of the available columns. This means that when the algorithm is looking for a good splitting point it does not have a full choice of columns to choose from, it has to make do with what it has in the randomly selected subset. The purpose of this is to restrict the available data, and force each tree to come up with a different representation of the structure. These additional sources of noise usually have hyper-parameters which can be tuned by the analyst. There are also variants on this approach such as the “extremely random trees” algorithm, but the common tread is a bagging algorithm + an additional source of randomness.

Boosting


Boosting refers to another class of tree based models. In a boosted model an initial tree will make a prediction. The next model then evaluates the errors that the first tree made, and then builds a second tree adding more weight/importance to the incorrect predictions from the first model. Intuitively the aim of each successive tree is to deal with the mistakes of its’ previous ancestors.

This post will not discuss boosted models further, however this is not to discount their value implementations of Gradient Boosted Models such as XGBoost are used frequently machine learning competitions, but they deserve a separate post!

Explaining Predictions and Evaluating variance in Random Forests


Setting up a Basic Random Forest Model


As an example we will use a simple grid search to tune a few of the model features. We fix the number of estimators at 500, and provide a range of options for max_depth, and max_features.

For ease of use we name the best estimator found in the grid search forest.

trees_in_forest = 500
param_grid = {
    "n_estimators":[trees_in_forest],
    "max_depth":[None] + list(range(2, 53, 10)),
    "max_features":[None] + list(range(2, X_train.shape[1]))
}

gs = model_selection.GridSearchCV(ensemble.RandomForestRegressor(), param_grid=param_grid, n_jobs=2)

gs.fit(X_train, y_train)

forest = gs.best_estimator_

Note the computation expense here:

  • 500 tree models
  • 3 fold cross validation
  • 7 max_depth options
  • 12 max_feature options

I make that to be 126,000 trees! On a bigger data set you may need to use fewer trees, or faster hardware!

Model performance


The plot below illustrates the model performance of the random-forest, we can see that the MSE on the test set is lower than the tuned tree model we built earlier (but not substantially), I think with a little more effort we could improve the performance further but let’s leave it there for now.

forest_scatter

Random Forest Feature Importance


Now that we have a trained model we can look at the feature importance found in the model. Large feature importance indicate an important feature. Scikit learn normalises the feature importance so that they sum to one.

The following code plots the feature importance of a random forest and also plots the standard deviation in feature importance over all trees. We can see that the LSTAT, and RM features are the most important features. This reaffirms what we saw in our original tree based model.

importances = forest.feature_importances_

std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

# Plot the feature importances of the forest
plt.figure(figsize = (15,5))
plt.title("Feature importances")
plt.bar(range(X_train.shape[1]), importances[indices], yerr=std[indices], align="center")
plt.xticks(range(X_train.shape[1]), X_train.columns[indices],rotation='vertical')
plt.xlim([-1, X_train.shape[1]])
plt.ylabel("Feature Importance")
plt.xlabel("Feature Name")
plt.savefig("feature_importances.png", dpi=300)

feature_importance

Black Boxes


Unfortunately whilst the feature importance give us an indication of which features were important in the model, feature importance does not tell us why they were important on what way i.e. How does RM effect the prediction? This illustrates an important point about ensemble models, and complex machine learning models in general: as the model complexity increases we can potentially make better predictions at the expense of comprehensibility. Models which fit this description are sometimes referred to as black box models, so we might think of our random forests as black forests!

Looking inside the box: Local Interpretable Model-Agnostic Explanations (LIME)


The aim of LIME (Local Interpretable Model-Agnostic Explanations) is to explain how a model makes a prediction.

The key intuition behind LIME is that it is much easier to approximate a black-box model by a simple model locally (in the neighbourhood of the prediction we want to explain), as opposed to trying to approximate a model globally.

Installation


At the time of writing the version of LIME available through pip does not include support for regression. Support for regression was included from version 0.1.1.22. By the time you read this the version available via pip may be update, but in case it is not you can clone the repo from github:

https://github.com/marcotcr/lime

And install by running:

python setup.py install

eli5

Infinitesimal Jackknife Variance


Infinitesimal jackknife variance is a technique designed to estimate the variance in a random forests prediction using re-sampling techniques.

The technique is detailed in the paper available here

Sources of model variance


In a random forest there are two key sources of variance:

  1. Bootstrap (variance caused by the bagging process)
  2. Variance in the estimate

What is infinitesimal Jackknife variance


The aim of the Infinitesimal Jackknife technique is estimate how much variance there is in the prediction once bootstrap effects are eliminated. Bootstrap effects are eliminated when the number of bagging estimators approaches infinity. However, in the real world we can’t have an infinite number of estimators, as a work around Wager et al. proposed the following:

$$ V^B_{IJ} = \sum^{n}_{i=1}Cov^2_i $$

where

$$ Cov_{i} = \frac{\sum_{b}(N^*_{bi} - 1)(t^*_{bi}(x)-\bar{t}^*(x))}{B} $$

The equation looks complicated but let’s talk through the basic structure and see if we can build a basic understanding.

\( N^*_{bi} \) indicates the number of times the \( i^{th} \) training observation appears in the bootstrap sample b. That is how many times a data point was placed in each tree during the bagging process described previously.

\((t^*_{bi}(x)-\bar{t}^*(x))\) is the centred prediction. For each tree in the random forest we calculate the difference between the tree’s prediction on the test set and the average prediction over all trees on the test set.

And so the jackknife variance captures the covariance between the composition of the bagged sample and the prediction made by the tree built on the bagged sample i.e. how the prediction, and composition of the inbag sample change together.

Bias adjusted jackknife variance


If you read through the original technical paper:

“Confidence intervals for random forests: The jackknife and the infinitesimal jackknife”

The equation to compute the unbiased jackknife variance is as follows:

$$ V^B_{IJ-U} = V^B_{IJ} - \frac{n}{B^2} \sum^B_{b=1}(t^*_b(x)-\bar{t^*}(x))^2 $$

Implementing Jackknife variance in Python


The code below implements the key elements of the equations above, the calculation of how often each sample appears in the inbag sample is borrowed from the forestci library.

def get_infinitesimal_jackknife_variance(inbag, forest, n_estimators, X_test):
    simple_prediction = np.array([tree.predict(X_test) for tree in forest]).T
    centered_prediction = simple_prediction - simple_prediction.mean(axis=0)
    y_err = np.sum((np.dot(inbag-1, centered_prediction.T)/n_estimators)**2, 0)
    bias_adjustment = bias_correction(y_err, inbag, centered_prediction, n_estimators)
    return y_err, bias_adjustment


def bias_correction(V_IJ, inbag, pred_centered, n_trees):
    n_train_samples = inbag.shape[0]
    boot_var = np.square(pred_centered).sum(axis=1) / n_trees

    bias_correction = n_train_samples * boot_var / n_trees

    return bias_correction

# Compute the bias adjustment
inbag = forestci.calc_inbag(X_train.shape[0], forest)
y_err, bias_adjustment = get_infinitesimal_jackknife_variance(inbag, forest, trees_in_forest,  X_test)

# Make Plot
fig, axes = plt.subplots(figsize=(8,8))
axes.errorbar(forest.predict(X_test), y_test , yerr=np.sqrt(y_err-bias_adjustment), linestyle="",
              marker="o", markersize=5)

axes.plot([0,50], [0,50], color="r", linestyle="--")

jackknife_variance

Forestci Implementation


Fortunately there’s no need to roll your own implementation of jackknife variance the forestci has an implementation that works with scikit learn.

Using the forestci implementation

inbag = forestci.calc_inbag(X_train.shape[0], forest)
y_err = forestci.random_forest_error(forest, inbag, X_train, X_test)

fig, axes = plt.subplots(figsize=(8,8))
axes.errorbar(forest.predict(X_test), y_test , yerr=np.sqrt(y_err), linestyle="",
              marker="o", markersize=5)

axes.plot([0,50], [0,50], color="r", linestyle="--")

jackknife_variance2

The implementation of the bias adjustment in forestci also includes a factor called n_var, I honestly am not sure what n_var is supposed to do, the code to implement n_var is as follows. Comments below please :-)


# Forestci implementation
n_var = np.mean(np.square(inbag[0:trees_in_forest]).mean(axis=1).T.view() -
                np.square(inbag[0:trees_in_forest].mean(axis=1)).T.view())

Conclusion


If you made it all the way through the post, I hope that you found it useful.

Understanding and explaining the predictions from random forests is clearly a big subject, the more I researched for this post the interesting ideas and developments sprung up. In fact there are many elements especially LIME which I would like to take a closer look at in a follow up post.

If you are interested in learning more about explaining models the LIME documentation, and the paper explaining the jackknife variance are good places to start.

Thanks for reading,

James