Shaply values

Article Outline

  • Why SHAP (SHapley Additive exPlanations)
  • About Dataset
  • Loading Dataset
  • Model Fitting
  • Shaply values estimation
  • Variable Importance plot
  • Summary plot
  • Dependence Plot
  • Force Plot
  • Tutorial Dataset

Why SHAP (SHapley Additive exPlanations)?

The very common problem with Machine Learning models is its interpretability. Majority of algorithms (tree-based specifically) provides the aggregate global feature importance but this lacks the interpretability as it does not indicate the direction of impact. 

There are many methods available that was used for variable importance computation. The drop column method is one of the simplest technique to achieve this goal but it was computationally expensive as the number of models to train increases with the number of data features. Another approach was the permutation method where particular feature values are permuted to compute variability in model accuracy. The method has an advantage over the drop column method (few model training) but it fails when correlated features existed in the training dataset. For example, in medical data, if you use systolic and diastolic blood pressure (both are correlated) to train a model, in such scenario permutation method not able to distinguish the feature importance. To cope up with this problem more advanced methods were introduced. One of them was the SHAP (SHapley Additive exPlanations) proposed by Lundberg et al. [1], which is reliable, fast and computationally less expensive.

Advantages

  • SHAP and Shapely Values are based on the foundation of Game Theory. Shapely values guarantee that the prediction is fairly distributed across different features (variables). 
  • SHAP can compute the global interpretation by computing the Shapely values for a whole dataset and combine them.
  • SHAP method connects other interpretability techniques, like LIME.
  • SHAP has a lightning-fast Tree-based model explainer.

About Dataset

I have a Transportation Engineering (Civil Engineering Domain) background. During my civil engineering Diploma, B.Tech and M.Tech I had performed the Concrete’s Characteristics Compressive Strength test in a laboratory setting. Thus, I thought it would be interesting to model and interpret the concrete’s compressive strength using a tree-based ensemble (Random Forest).

Hence, in this article, we are going to use the concrete dataset [2] obtained from the UCI Machine Learning library.

The dataset includes the following variables, which are the ingredients for making durable high strength concrete.

I1: Cement (C1): kg in a m3 mixture
I2: Blast Furnace Slag (C2): kg in a m3 mixture
I3: Fly Ash (C3): kg in a m3 mixture
I4: Water (C4): kg in a m3 mixture
I5: Superplasticizer (C5): kg in a m3 mixture
I6: Coarse Aggregate (C6): kg in a m3 mixture
I7: Fine Aggregate (C7): kg in a m3 mixture
I8: Age: Day (1~365)
O1: Concrete compressive strength: MPa

Where I: Input; O: Output, C: Component; m3: meter cube and MPa: Megapascal.

Before proceeding to the data analysis part, let’s get familiar with the different inputs of the concrete dataset.

Concrete

Concrete is comprised of three basic components: water, aggregate (rock, sand, or gravel) and cement. Cement acts as a binding agent when mixed with water and aggregates.

Compressive Strength

Compressive strength is one of the vital parameters that determine the performance as a construction material. A concrete mix designed to get the required performance and durability for a given construction work/project. The compressive strength of concrete is determined in laboratories in order to maintain the desired quality of concrete during casting. The compressive strength is calculated by dividing the failure load with the area of application of load, usually after 28 days (I8: Age) of the curing period. Though researchers also report strength after 7, 14 and 21 days of curing period. The strength of concrete is achieved by controlling the proportion of cement (C1), fine (C7) and coarse (C6) aggregates, water, and various admixtures. The characteristic compressive strength of concrete fc/ fck is usually reported in MPa (O1). For normal Construction, the characteristic compressive strength can vary from 10 to 60 MPa; while for a certain structure the requirement can go beyond 600 MPa.

Admixture

Nowadays, researchers are using different admixtures to get desired property; the fly ash (C3) is one of them. The fly ash act as an admixture in concrete mixes, which is a pozzolan substance containing aluminous and siliceous material; when mixed with lime and water, forms a compound similar to cement. Fly ash is mixed in concrete as an admixture to improve workability and to reduce permeability and bleeding.

Similarly, the ground granulated blast furnace slag (C2), a mineral admixture is added in concrete to improves its properties such as workability, strength and durability.

Superplasticizers

Superplasticizers (high range water reducers) are used in concrete mixes for making high strength durable concrete. Superplasticizers (C5) are water-soluble organic substances that reduce the amount of water require to achieve certain stability of concrete, reduce the water-cement ratio, reduce cement content and increase slump. Use of superplasticizers reduces the water requirementup to 30% without losing workability.

Aim

The aim of this article is to understand black-box model variables and their contribution. Here, we will mainly focus on the shaply values estimation process using shap Python library and how we could use it for better model interpretation. 

In this article, we will train a concrete’s compressive strength prediction model and interpret the contribution of variables using shaply values.

Loading relevant libraries

The very first step is to load relevant python libraries.

import pandas as pd               # Data manipulation
import numpy as np                # Array manipulation
import matplotlib.pyplot as plt   # Plotting

# Sklearn for data splitting and modeling
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

Loading dataset

The next step is to load the data from an excel sheet from your local storage and performing basic exploratory data analysis.

concrete = pd.read_excel("Concrete.xlsx")
concrete.head()
First five rows

Let’s view the different column names using .column attribute.

concrete.columns
Column names

Let’s assign the X variables (independent variables) after dropping the “Comp_str” and Y variable (outcome variable: Comp_str), and save the column names in X_featurenames.

X = concrete.drop("Comp_str", axis = 1)
Y = concrete['Comp_str']
X_featurenames = X.columns

Next, we will split the data into 80% train and 20% test dataset using the train_test_split( ) function from sklearn library.

# Split the data into train and test data:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 44)

The next step will be to fit a random forest regressor using randomForestRegressior( ) function with parameters like maximum tree depth (max_depth = 20), number of trees to train (n_estimators = 10000).

# Build the model with the random forest regression algorithm:
model = RandomForestRegressor(max_depth = 20, random_state = 0, n_estimators = 10000)
model.fit(X_train, Y_train)

The advantage of tree-based algorithms is that it provides global variable importance, which means you can rank them based on their contribution to the model. Here, we have extracted the feature importance using .feature_importance_ and supplied the column names (X_featurenames). Then, we took the top five contributing variables and plot them using a bar plot.

Here, you can observe that the Age of the concretehas the highest influence in the model, followed by Cement content. The problem with global importance is that it gives an average overview of variable contributing to the model but it lacks the direction of impact means whether a variable has a positive or negative influence.

feat_importances = pd.Series(model.feature_importances_, index = X_featurenames)
feat_importances.nlargest(5).plot(kind = 'barh')
Global feature importance

From the BlackBox model, it is nearly impossible to get a feeling for its inner functioning. This brings us to a question of trust: do you trust that a certain prediction from the model is correct? Or do you even trust that the model is making sound predictions?

Here, we can utilize advance algorithms such as SHAP. 

Summary Plot

In order to understand the variable importance along with their direction of impact one can plot a summary plot using shap python library. This plot’s x-axis illustrates the shap values (-ve to +ve) and the y-axis indicates the features (variables). The colour bar indicates the impact. Red colour indicates high feature impact and blue colour indicates low feature impact. 

Steps:

  1. Create a tree explainer using shap.TreeExplainer( ) by supplying the trained model
  2. Estimate the shaply values on test dataset using ex.shap_values()
  3. Generate a summary plot using shap.summary( ) method
import shap

ex = shap.TreeExplainer(model)
shap_values = ex.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
Summary plot

Interpretation: The summary plot shows that the top three influential variables are Age of concrete, Cement content and water content which determines the characteristics compressive strength of concrete. The Age variable has a high range of values and it has a positive impact on compressive strength. As concrete’s age increases its characteristics compressive strength also increases. Cement also has a positive influence. It is worthwhile to note that water content has a negative impact on compressive strength. If we add more and more water the concrete strength will reduce respectively. Superplasticizer and Blast Furnace Slag have a positive impact on compressive strength. Similarly, the addition of more fine and coarse aggregate reduces the compressive strength.

Dependence Plot

You can also plot a partial dependence plot (marginal influence) to understand whether the feature (variable) has a linear relationship of non-linear relationship with the dependent (outcome) variable. In other words, the partial dependence plot shows the marginal effect one or two features have on the predicted outcome of a machine learning model.

You can use the shap.dependance_plot( ) method and pass the feature whose interaction you want to plot. The function automatically includes another feature that your selected variable interacts most with. 

Here, we have added Cement feature whose interaction we want to observe. The plot illustrates that the Cement feature has a positive linear relationship with compressive strength (outcome variable) and cement interacts with super-plasticizers.

shap.dependence_plot("Cement", shap_values, X_test)
Dependence plot for cement content

Similarly, water has a negative and almost linear relationship.

shap.dependence_plot("Water", shap_values, X_test)
Dependence plot for water content

The fine aggregate has a negative linear relationship with the outcome variable.

shap.dependence_plot("Fine_aggregate", shap_values, X_test)
Dependence plot for fine aggregate

Local Interpretability

The Shaply values can be computed on individual observations to understand the impact of different features. This plot provides us with the explainability to a single model prediction.

In order to generate the force plot; first, you should initiate shap.initjs() if using jupyter notebook.

Steps:

  1. Create a model explainer using shap.kernelExplainer( )
  2. Compute shaply values for a particular observation. Here, I have supplied the first observation (0th) from the test dataset
  3. Next, step is to generate a force plot using shap.force_plot( ) method.
shap.initjs()
ex = shap.KernelExplainer(model.predict, X_train)
shap_values = ex.shap_values(X_test.iloc[0,:])
shap.force_plot(ex.expected_value, shap_values, X_test.iloc[0,:])
Force plot

Interpretation: The plot provides

  1. The model output value: 21.99
  2. The base value: this is the value would be predicted if we didn’t have any features for the current output (base value: 36.04).
  3. In the x-axis, it shows the impact of each feature on the output.

Here we can see red and blue arrows associated with each feature.
Each of these arrows indicates:

  • Feature’s impacts on the model: the bigger the arrow, the bigger the impact.
  • How a feature impacts the model: a red arrow pushes the outcome to the right (increases the model output value) while a blue arrow pushes the model outcome to the left (decreases the model output value).

For the observation zero (first test observation) we can observe that cement content has the highest impact and it pushes the outcome to the right, while the Age variable decreases the outcome and it pushes the outcome to the left.

I hope you learned something new from this blog.

Tutorial Dataset

References

[1] S. Lundberg, S.-I. Lee, A Unified Approach to Interpreting Model Predictions, Adv. Neural Inf. Process. Syst. 2017-Decem (2017) 4766–4775. https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions

[2] I-Cheng Yeh, “Modeling of the strength of high-performance concrete using artificial neural networks,” Cement and Concrete Research, Vol. 28, №12, pp. 1797–1808 (1998).