1. How important are our features ?

Intuitive Interpretation of Random Forest

For someone who thinks that random forest is a black box algorithm, this post can offer a differing opinion. I am going to cover 4 interpretation methods that can help us get meaning out of a random forest model with intuitive explanations. I am also going to briefly discuss the pseudo code behind all these interpretation methods. I have learned about this in fast.ai ‘Introduction to Machine Learning’ course as MSAN student at USF.

It is pretty common to use model.feature_importances in sklearn random forest to study about the important features. Important features mean the features that are more closely related with dependent variable and contribute more for variation of the dependent variable. We generally feed as much features as we can to a random forest model and let the algorithm give back the list of features that it found to be most useful for prediction. But carefully choosing right features can make our target predictions more accurate .

The idea of calculating feature_importances is simple, but great.

Splitting down the idea into easy steps:

1. train random forest model (assuming with right hyper-parameters)
2. find prediction score of model (call it benchmark score)
3. find prediction scores p more times where p is number of features, each time randomly shuffling the column of i(th) feature
4. compare all p scores with benchmark score. If randomly shuffling some i(th) column is hurting the score, that means that our model is bad without that feature.
5. remove the features that do not hurt the benchmark score and retrain the model with reduced subset of features.

Code to calculate feature importance:
Below code will give a dictionary of {feature, importance} for all the features.

Feature importance code from scratch:

Output:

importance = feat_imp(ens, X_train[cols], y_train); importance
[('YearMade', -0.21947050888595573),
('Coupler_System', -0.21318328275792894),
('ProductSize', -0.18353291714217482),
('saleYear', -0.045706193607739254),
('Enclosure', -0.041566508577359523),
('MachineID', -0.01399141076436905),
('MachineHoursCurrentMeter', -1.9246700722952426e-05)]
In above output, YearMade increases prediction RMSE most if it gets shuffled (proxy to getting removed from model). So it must be most important feature. 

(above results correspond to data taken from a Kaggle competition. Here is the link -
https://www.kaggle.com/c/bluebook-for-bulldozers)

Generally, when businesses want to predict something, their end goal is either to reduce costs or improve profits. Before taking big business decisions, businesses are interested to estimate the risk of taking that decision. But when the prediction results are presented without a confidence interval, rather than reducing the risk, we might inadvertently expose the business to more risk.

It is relatively easy to find the confidence level of our predictions when we use a linear model (in general models which are based on distribution assumptions). But when it comes to confidence interval for random forest, it is not very straightforward.

I guess, anyone who has taken a linear regression class must have seen this image (A). To find a best linear model, we look for model that finds best bias-variance tradeoff. The image here nicely illustrates the definition of bias and variance in our predictions. (Let these 4 images are darts thrown by 4 different persons)

If we have high bias and low variance (3rd person), we are hitting dart consistently away from bulls eye. On contrary, if we have high variance and low bias (2nd person), we are very inconsistent in hitting the dart. If one has to guess where the next dart will go when hit by the 2nd person, it can go either hit bulls eye or away from it. Now, let’s suppose catching a credit fraud in real life is analogous to hitting a bulls eye in above example. If the credit company has predictive model similar to 2nd person’s dart throwing behavior, the company might not catch fraud most of the times, even though on an average model is predicting right.

The takeaway is that rather than only mean predictions, we should also check confidence level of our point predictions.

How to do that in random forest ?

A random forest is made from multiple decision trees (as given by n_estimators). Each tree individually predicts for the new data and random forest spits out the mean prediction from those trees. The idea for confidence level of predictions is just to see how much predictions coming from different trees are varying for the new observations. Then to analyze further, we can seek some pattern (something like predictions corresponding to year 2011 have high variability) for observations which have highest variability of predictions.

The source code of prediction confidence based on tree variance:

Output of above code will look like following:

Reading from this output, we can say that we are least confident about our prediction of validation observation at index 14.

Feature importance (as in 1st section) is useful if we want to analyze which features are important for overall random forest model. But if we are interested in one particular observation, then the role of tree interpreter comes into play.

For example, there is a RF model which predicts — a patient X coming to hospital has high probability of readmission or not? For sake of simplicity, let’s consider we only have 3 features — patient's blood pressure data, patient's age and patient's sex. Now, if our model says that patient A has 80% chances of readmission, how can we know what is special in that person A that our model predicts he/she will be readmitted ? . In this case, tree interpreter tells the prediction path followed for that particular patient. Something like, because patient A is 65 years old male, that is why our model predicts that he will be readmitted. Another patient B who my model predicts to be readmitted might be because B has high blood pressure (not because of age or sex).

Basically, tree interpreter gives the sorted list of bias (mean of data at starting node) and individual node contributions for a given prediction.

The decision tree (depth: 3) for image (B) is based on Boston housing price data set. It shows the breakdown of decision path, in terms of prediction values from intermediate nodes and features that cause values to change. Contribution of a node is difference of value at that node from the value at the previous node.

This image (C) gives an example output of using tree interpreter for Patient A. It says that being 65 years old was highest contributor that model predicted high probability of readmission than mean.

Visualization of spreadsheet output can also be done using Waterfall chart (D). I have made this using quick and easy waterfall chart from “waterfallcharts package”.

Code for above waterfall chart plot:

Just to be clear about terminology -
Value means target value predicted by nodes. (just mean of target observations falling in that node).
Contribution is value at present node minus value at previous node (this is what gives feature contribution for a path).
Path is combination of all the feature splits taken by some observation in order to reach leaf node.

The function from treeinterpreter package is pretty straightforward for getting contributions from each node and can be explored here.

Having found the most important features, next thing we might be interested in is to study the direct relationship between target variable and features of interest. An analogy of this from linear regression is model coefficients. For linear regression, coefficients are calculated in such a way that we can interpret them by saying: ”what would be change in Y with 1 unit change in X(j), keeping all other X(i’s) constant”.

Although we have feature importances from random forest, but they only give a relative change in Y with respect to change in X(i’s). We can not directly interpret them as how much change in Y is caused due to unit change in X(j), keeping all other features constant.

Luckily, we have partial dependence plots that can be viewed as graphical representation of linear model coefficients, but can be extended to seemingly black box models also. The idea is to isolate the changes made in predictions to solely come from a specific feature. It is different than scatter plot of X vs. Y as scatter plot does not isolate the direct relationship of X vs. Y and can be affected by indirect relationships with other variables on which both X and Y depend.

The steps to make PDP plot are as follows:

1. train a random forest model (let’s say F1…F4 are our features and Y is target variable. Suppose F1 is the most important feature). 
2. we are interested to explore the direct relationship of Y and F1
3. replace column F1 with F1(A) and find new predictions for all observations. take mean of predictions. (call it base value)
4. repeat step 3 for F1(B) … F1(E), i.e. for all distinct values of feature F1. 
5. PDP’s X-axis has distinct values of F1 and Y-axis is change in mean prediction for that F1 value from base value.

Below (E)is how a partial dependence plot looks like. (done on kaggle bulldozer competition data). It shows the relationship of YearMade with SalesPrice.

And below (F) is how a line plot of SalePrice vs. YearMade would look like. We can see that scatter/line plot might not catch the direct impact of YearMade on SalesPrice as done by PDP.

F. Source of above 2 plots is rf interpretation notebook of fast.ai ml1 course.

In most of the cases random forests can beat linear models for prediction. An objection frequently raised for random forests is interpretation of results as compared to linear models. But one can address the misconceived objection using the discussed methodologies of interpretation.

Bio: I am currently studying Data Science (Analytics) as University of San Francisco and doing my intern at Manifold.ai. Previously, I have worked as Data Scientist at Capgemini and Sr. Business Analyst at Altisource.

Linkedin: https://www.linkedin.com/in/prince-grover-0562a946/