feature text

Introduction

In the supervised machine learning world, there are two types of algorithmic tasks often performed. One is called regression (predicting continuous values) and the other is called classification (predicting discrete values). In this blog, I have presented an example of a binary classification algorithm called “Binary Logistic Regression” which comes under the Binomial family with a logit link function. Binary logistic regression is used for predicting binary classes. For example, in cases where you want to predict yes/no, win/loss, negative/positive, True/False, and so on. There is quite a bit difference exists between training/fitting a model for production and research publication. This blog will guide you through a research-oriented practical overview of modelling and interpretation i.e., how one can model a binary logistic regression and interpret it for publishing in a journal/article.

Article Outline

  • Data Background
  • Aim of the modelling
  • Data Loading
  • Basic Exploratory Analysis
  • Data Preparation
  • Model Fitting/Training
  • Interpretation of Model Summary
  • Model Evaluation on Test data Set
  • References

Data Background

In this example, we are going to use the Pima Indian Diabetes 2 data set obtained from the UCI Repository of machine learning databases (Newman et al. 1998).

This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the data set is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the data set. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The Pima Indian Diabetes 2 data set is the refined version (all NA or missing values were removed) of the Pima Indian diabetes data. The data set contains the following independent and dependent variables.

Independent variables (symbol: I)

  • I1: pregnant: Number of times pregnant
  • I2: glucose: Plasma glucose concentration (glucose tolerance test)
  • I3: pressure: Diastolic blood pressure (mm Hg)
  • I4: triceps: Triceps skin fold thickness (mm)
  • I5: insulin: 2-Hour serum insulin (mu U/ml)
  • I6: mass: Body mass index (weight in kg/(height in m)\²)
  • I7: pedigree: Diabetes pedigree function
  • I8: age: Age (years)

Dependent Variable (symbol: D)

  • D1: diabetes: diabetes case (pos/neg)

Aim of the Modelling

The aim of this blog is to fit a binary logistic regression machine learning model that accurately predict whether or not the patients in the data set have diabetes, followed by understanding the influence of significant factors that truly affects. Next, testing the trained model’s generalization (model evaluation) strength on the unseen data set.

Loading Libraries and Data Set

Step 1: The first step is to load the relevant libraries, such as pandas (data loading and manipulation), and matplotlib and seaborn (plotting).

Image for post

Step 2: The next step is to read the data using pandas read_csv( ) function from your local storage and saving in a variable called “diabetes”.

Image for post

Exploratory Data Analysis

Step 1: After data loading, the next essential step is to perform an exploratory data analysis that helps in data familiarization. Use the head( ) function to view the top five rows of the data.

Image for post

Step 2: It is often essential to know about the column data types and whether any data is missing. The .info( ) method helps in identifying data types and the presence of missing values.

The below table showed that the diabetes data set includes 392 observations and 9 columns/variables. The independent variables include integer 64 and float 64 data types, whereas dependent/response (diabetes) variable is of string (neg/pos) data type also known as an object.

Image for post

Step 3: We can initially fit a logistic regression line using seaborn’s regplot( ) function to visualize how the probability of having diabetes changes with pedigree label. The “pedigree” was plotted on x-axis and “diabetes” on the y-axis using regplot( ). In a similar fashion, we can check the logistic regression plot with other variables. This type of plot is only possible when fitting a logistic regression using a single independent variable. The current plot gives you an intuition how the logistic model fits an ‘S’ curve line and how the probability changes from 0 to 1 with observed values. In the oncoming model fitting, we will train/fit a multiple logistic regression model, which include multiple independent variables.

Image for post

Data Preparation

Before proceeding to model fitting, it is often essential to ensure that the data type is consistent with the library/package that you are going to use. In diabetes, data set the dependent variable (diabetes) consists of strings/characters i.e., neg/pos, which need to be converted into integers by mapping neg: 0 and pos: 1 using the .map( ) method.

Image for post

Now you can see that the dependent variable “diabetes” is converted from object to an integer 64 type.

Image for post

The next step is to gaining knowledge about basic data summary statistics using .describe( ) method, which computes count, mean, standard deviation, minimum, maximum and percentile (25th, 50th and 75th) values. This helps you to detect any anomaly in your dataset. Such as variables with high variance or extremely skewed data.

Image for post

Model Fitting (Binary Logistic Regression)

The next step is splitting the diabetes data set into train and test split using train_test_split of sklearn.model_selection module and fitting a logistic regression model using the statsmodels package/library.

Train and Test Split

The whole data set generally split into 80% train and 20% test data set (general rule of thumb). The 80% train data is being used for model training, while the rest 20% is used for checking how the model generalized on unseen data set.

Image for post

Fitting Logistic Regression

In order to fit a logistic regression model, first, you need to install statsmodels package/library and then you need to import statsmodels.api as sm and logit functionfrom statsmodels.formula.api

Here, we are going to fit the model using the following formula notation:

formula = (‘dep_variable ~ ind_variable 1 + ind_variable 2 + …….so on’)

The model is fitted using a logit( ) function, same can be achieved with glm( ). Here, logit( ) function is used as this provides additional model fitting statistics such as Pseudo R-squared value.

Image for post
Fitting a binary logistic regression

Interpretation of Model Summary

After model fitting, the next step is to generate the model summary table and interpret the model coefficients. The model summary includes two segments. The first segment provides model fit statistics and the second segment provides model coefficients, their significance and 95% confidence interval values. In publication or article writing you often need to interpret the coefficient of the variable from the summary table.

The model fit statistics revealed that the model was fitted using the Maximum Likelihood Estimation (MLE) technique. The model has converged properly showing no error. The McFadden Pseudo R-squared value is 0.327, which indicates a well-fitted model.

Additionally, the table provides a log-likelihood ratio test. Likelihood Ratio test (often termed as LR test) is a goodness of fit test used to compare between two models; the null model and the final model. The test revealed that when the model fitted with only intercept (null model) then the log-likelihood was -198.29, which significantly improved when fitted with all independent variables (Log-Likelihood = -133.48). Fit improvement is also significant (p-value <0.05).

Image for post
Image for post
Binary Logit Regression Summary Table

The coefficient table showed that only glucose and pedigree label has significant influence (p-values < 0.05) on diabetes. The coefficients are in log-odds terms. The interpretation of the model coefficients could be as follows:
Each one-unit change in glucose will increase the log odds of having diabetes by 0.038, and its p-value indicates that it is significant in determining diabetes. Similarly, with each unit increase in pedigree increases the log odds of having diabetes by 1.231 and p-value is significant too.
The interpretation of coefficients in the log-odds term does not make much sense if you need to report it in your article or publication. That is why the concept of odds ratio was introduced.

ODDs Ratio

The ODDS is the ratio of the probability of an event occurring to the event not occurring. When we take a ratio of two such odds it called Odds Ratio.

Image for post
ODDS and ODDS RATIO

Mathematically, one can compute the odds ratio by taking exponent of the estimated coefficients. For example, in the below ODDS ratio table, you can observe that pedigree has an ODDS Ratio of 3.427, which indicates that one unit increase in pedigree label increases the odds of having diabetes by 3.427 times.

Image for post
ODDS Ratio Estimates

Marginal Effects Computation

Marginal effects are an alternative metric that can be used to describe the impact of a predictor on the outcome variable. Marginal effects can be described as the change in outcome as a function of the change in the treatment (or independent variable of interest) holding all other variables in the model constant. In linear regression, the estimated regression coefficients are marginal effects and are more easily interpreted.

There are three types of marginal effects reported by researchers: Marginal Effect at Representative values (MERs), Marginal Effects at Means (MEMs) and Average Marginal Effects at every observed value of x and average across the results (AMEs), (Leeper, 2017). For categorical variables, the average marginal effects were calculated for every discrete change corresponding to the reference level.

The statsmodels library offers the following Marginal Effects computation:

Image for post
Image for post
Statsmodels Documentation

In the STEM research domains, Average Marginal Effects is very popular and often reported by researchers. In our case, we have estimated the AMEs of the predictor variables using .get_margeff( ) function and printed the report summary.

Image for post
Average Marginal Effects Estimates

The Average Marginal Effets table reports AMEs, standard error, z-values, p-values and 95% confidence intervals. The interpretation of AMEs is similar to linear models. For example, the AME value of pedigree is 0.1677 which can be interpreted as a unit increase in pedigree value increases the probability of having diabetes by 16.77%.

Model Evaluation on Test Data Set

After fitting a binary logistic regression model, the next step is to check how well the fitted model performs on unseen data i.e. 20% test data.

Thus, the next step is to predict the classes in the test data set and generating a confusion matrix. The steps involved the following:

  • The first step is to import NumPy library as np and importing classification_report and accuracy_score from sklearn.metrics.
  • Next predicting the diabetes probabilities using model.predict( ) function
  • Setting a cut-off value (0.5 for binary classification). Below 0.5 of probability treated diabetes as neg (0) and above that pos (1)
  • Use pandas crosstab( ) to create a confusion matrics between actual (neg:0, pos:1) and predicted (neg:0, pos:1)

Confusion Matrix

The confusion matrix revealed that the test dataset has 52 sample cases of negative (0) and 27 cases of positive (1). The trained model classified 44 negatives (neg: 0) and 16 positives (pos: 1) class, accurately.

Image for post
Confusion Matrix Computation
Image for post
Confusion Matrix Table

Classification Accuracy

The classification accuracy can be calculated as follows:

Image for post

The same accuracy can be estimated using the accuracy_score( ) function. The result revealed that the classifier is about 76% accurate in classifying unseen data.

Image for post

Classification Report

A Classification report is used to measure the quality of predictions from a classification algorithm. How many predictions are true and how many are false. The classification report uses True Positive, True Negative, False Positive and False Negative in classification report generation.

  1. TP / True Positive: when an actual observation was positive and the model prediction is also positive
  2. TN / True Negative: when an actual observation was negative and the model prediction is also negative
  3. FP / False Positive: when an actual observation was negative but the model prediction is positive
  4. FN / False Negative: when an actual observation was positive but the model prediction is negative

The classification report provides information on precision, recall and F1-score.

We have already calculated the classification accuracy then the obvious question would be, what is the need for precision, recall and F1-score? The answer is accuracy is not a good measure when a class imbalance exists in the data set. A data set is said to be balanced if the dependent variable includes an approximately equal proportion of both classes (in binary classification case). For example, if the diabetes dataset includes 50% samples with diabetic and 50% non-diabetic patience, then the data set is said to be balanced and in such case, we can use accuracy as an evaluation metric. But in real-world it is often not the actual case.

Let’s make it more concrete with an example. Say you have gathered a diabetes data set that has 1000 samples. You passed the data set through your trained model and the model predicted all the sample as non-diabetic. But later when you skim through your data set, you observed in the 1000 sample data 3 patients have diabetes. So out model misclassified the 3 patients saying they are non-diabetic (False Negative). Even after 3 misclassifications, if we calculate the prediction accuracy then still we get a great accuracy of 99.7%.

Image for post

But practically the model does not serve the purpose i.e., accurately not able to classify the diabetic patients, thus for imbalanced data sets, accuracy is not a good evaluation metric.

To cope with this problem the concept of precision and recall was introduced.

Image for post
Precision, Recall and F1 Score

Precision: determines the accuracy of positive predictions.

Recall: determines the fraction of positives that were correctly identified.

F1 Score: is a weighted harmonic mean of precision and recall with the best score of 1 and the worst score of 0. F1 score conveys the balance between the precision and the recall.

The classification report revealed that the micro average of F1 score is about 0.72, which indicates that the trained model has a classification strength of 72%.

Image for post
Classification Report

Binary logistic regression is still a vastly popular ML algorithm (for binary classification) in the STEM research domain. It is still very easy to train and interpret, compared to many sophisticated and complex black-box models.

Dataset and Code

Click here for diabetes data and code

** Hoping this blog would help **

See you next time!

Photo Credit

Photo by JESHOOTS.COM on Unsplash

References

Leeper, T.J., (2017). Interpreting regression results using average marginal effects with R’s margins. Tech. rep.

Newman, C. B. D. & Merz, C. (1998). UCI Repository of machine learning databases, Technical report, University of California, Irvine, Dept. of Information and Computer Sciences.

Shrikant I. Bangdiwala (2018). Regression: binary logistic, International Journal of Injury Control and Safety Promotion, DOI: 10.1080/17457300.2018.1486503