Article outline
Introduction
Data Background
Aim of the article
Exploratory analysis
Training a Random Forest Model
Global Importance
Local Importance
Introduction
In the supervised machine learning world, there are two types of algorithmic task often performed. One is called regression (predicting continuous values) and the other is called classification (predicting discrete values). Black box algorithms such as SVM, random forest, boosted trees, neural networks provide better prediction accuracy than conventional algorithms. The problem starts when we want to understand the impact (magnitude and direction) of different variables. In this article, I have presented an example of Random Forest binary classification algorithm and its interpretation at the global and local level using Local Interpretable Model-agnostic Explanations (LIME).
Data Background
In this example, we are going to use the Pima Indian Diabetes 2 data set obtained from the UCI Repository of machine learning databases (Newman et al. 1998).
This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the data set is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the data set. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
The Pima Indian Diabetes 2 data set is the refined version (all missing values were assigned as NA) of the Pima Indian diabetes data. The data set contains the following independent and dependent variables.
Independent variables (symbol: I)
- I1: pregnant: Number of times pregnant
- I2: glucose: Plasma glucose concentration (glucose tolerance test)
- I3: pressure: Diastolic blood pressure (mm Hg)
- I4: triceps: Triceps skin fold thickness (mm)
- I5: insulin: 2-Hour serum insulin (mu U/ml)
- I6: mass: Body mass index (weight in kg/(height in m)\²)
- I7: pedigree: Diabetes pedigree function
- I8: age: Age (years)
Dependent Variable (symbol: D)
- D1: diabetes: diabetes case (pos/neg)
Aim of the Modelling
- fitting a random forest ensemble binary classification model that accurately predicts whether or not the patients in the data set have diabetes
- understanding the global influence of variables on diabetes prediction
- understanding the influence of variables on the local level for the individual patient
Loading Libraries
The very first step will be to load relevant libraries.
import pandas as pd # data mnipulation
import numpy as np # number manipulation/crunching
import matplotlib.pyplot as plt # plotting
# Classification report
from sklearn.metrics import classification_report
# Train Test split
from sklearn.model_selection import train_test_split
# Random forest classifier
from sklearn.ensemble import RandomForestClassifier
Reading dataset
After data loading, the next essential step is to perform an exploratory data analysis which helps in data familiarization. Use the head( ) function to view the top five rows of the data.
diabetes = pd.read_csv("diabetes.csv")
diabetes.head()
The below table showed that the diabetes data set includes 392 observations and 9 columns/variables. The independent variables include integer 64 and float 64 data types, whereas dependent/response (diabetes) variable is of string (neg/pos) data type also known as an object.
Let’s print the column names
diabetes.columns
Mapping output variable into 0 and 1
Before proceeding to model fitting, it is often essential to ensure that the data type is consistent with the library/package that you are going to use. In diabetes, data set the dependent variable (diabetes) consists of strings/characters i.e., neg/pos, which need to be converted into integers by mapping neg: 0 and pos: 1 using the .map( ) method.
diabetes["diabetes"] = diabetes["diabetes"].map({"neg":0, "pos":1})
diabetes["diabetes"].value_counts()
Now you can see that the dependent variable “diabetes” is converted from object to an integer 64 type.
The next step is to gaining knowledge about basic data summary statistics using .describe( ) method, which computes count, mean, standard deviation, minimum, maximum and percentile (25th, 50th and 75th) values. This helps you to detect any anomaly in your dataset. Such as variables with high variance or extremely skewed data.
Training RF Model
The next step is splitting the diabetes data set into train and test split using train_test_split of sklearn.model_selection module and fitting a random forest model using the sklearn package/library.
Train and Test Split
The whole data set generally split into 80% train and 20% test data set (general rule of thumb). The 80% train data is being used for model training, while the rest 20% will be used for model generalized and local model interpretation.
Y = diabetes['diabetes']
X = diabetes[['pregnant', 'glucose', 'pressure', 'triceps', 'insulin', 'mass',
'pedigree', 'age']]
X_featurenames = X.columns
# Split the data into train and test data:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)
In order to fit a Random Forest model, first, you need to install sklearn package/library and then you need to import RandomForestClassifier from sklearn.ensemble. Here,have fitted around 10000 trees with a max depth of 20.
# Build the model with the random forest regression algorithm:
model = RandomForestClassifier(max_depth = 20, random_state = 0, n_estimators = 10000)
model.fit(X_train, Y_train)
Classification Report
Let’s predict the test data class labels using predict( ) and generate a classification report. The classification report revealed that the micro average of F1 score (used for unbalanced data) is about 0.71, which indicates that the trained model has a classification strength of 71%.
y_pred = model.predict(X_test)
print(classification_report(Y_test, y_pred, target_names=["Diabetes -ve", "Diabetes +ve"]))
Feature Importance Plot
The advantage of tree-based algorithms is that it provides global variable importance, which means you can rank them based on their contribution to the model. Here, you can observe that the glucose variable has the highest influence in the model, followed by Insulin. The problem with global importance is that it gives an average overview of variable contributing to the model.
feat_importances = pd.Series(model.feature_importances_, index = X_featurenames)
feat_importances.nlargest(5).plot(kind = 'barh')
From the BlackBox model, it is nearly impossible to get a feeling for its inner functioning. This brings us to a question of trust: do you trust that a certain prediction from the model is correct? Or do you even trust that the model is making sound predictions?
Creating a model explainer
LIME is short for Local Interpretable Model-Agnostic Explanations. Local refers to local fidelity — i.e., we want the explanation to really reflect the behaviour of the classifier “around” the instance being predicted. This explanation is useless unless it is interpretable — that is, unless a human can make sense of it. Lime is able to explain any model without needing to ‘peak’ into it, so it is model-agnostic.
Behind the workings of LIME lies the assumption that every complex model is linear on a local scale and asserting that it is possible to fit a simple model around a single observation that will mimic how the global model behaves at that locality (Pedersen and Benesty, 2016).
LIME explainer fitting steps
- import the lime library
- import lime.lime_tabular
- Fit an explainer using LimeTabularExplainer( ) function
- Supply the x_train values, feature names and class names as ‘Diabetes -ve’, ‘Diabetes +ve’
- Here we used the lasso_path for feature selection
- binned continuous variable into discrete values (discretize_continuous = True) based on “quartile”
- Select mode as classification
import lime
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values, feature_names = X_featurenames, class_names = ['Diabetes -ve', 'Diabetes +ve'], feature_selection = "lasso_path", discretize_continuous = True, discretizer = "quartile", verbose = True, mode = 'classification')
For local level explanation let’s pick an observation from test data who is diabetes +ve. Let’s select the 3rd observation (index 254). Here are the first 5 observations from X_test dataset including 3rd observation (index number 254)
X_test.iloc[0:5]
Let’s observe the output variable. You can observe the 3rd observation (index 254) has a value of 1 which indicates it is diabetes +ve.
Y_test.iloc[0:5]
Let’s see whether LIME able to interpret which variables contribute to +ve diabetes and what is the impact magnitude and direction for observation 3 (index number 254)
Explain an observation
For model explanation, one needs to supply the observation and the model predicted probabilities.
The output shows the local level LIME model intercept is 0.245 and LIME model prediction is 0.613 (Prediction_local). The original random forest model prediction 0.589. Now, we can plot the explaining variables to show their contribution. In the plot, the right side green bar shows support for +ve diabetes while left side red bars contradicts the support. The variable glucose > 142 shows the highest support for +ve diabetes for the selected observation. In other words for observation 3 in the test dataset having glucose> 142 primarily contributed to +ve diabetes.
exp = explainer.explain_instance(X_test.iloc[2], model.predict_proba)
exp.as_pyplot_figure()
Similarly, you can plot a detailed explanation using the show_in_notebook( ) function.
exp = explainer.explain_instance(X_test.iloc[2], model.predict_proba)
exp.show_in_notebook(show_table = True, show_all = False)
In summary, black-box models nowadays not a black box anymore. There are plenty of algorithms that have been proposed by researchers. Some of them are LIME, Sharp values etc. The above explanation mechanism could be used for all major classification and regression algorithms, even for the deep neural networks.