- Data Background
- Aim of the modelling
- Data Loading
- Basic Exploratory Analysis
- Data Preparation
- Model Comparison
- Hyper-parameter Tuning
- Model Plots development
- Model Testing
- Model Finalization and Saving
In this example, we are going to use the Pima Indian Diabetes 2 data set obtained from the UCI Repository of machine learning databases (Newman et al. 1998).
This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the data set is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the data set. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
The Pima Indian Diabetes 2 data set is the refined version (all missing values were assigned as NA) of the Pima Indian diabetes data. The data set contains the following independent and dependent variables.
Independent variables (symbol: I)
- I1: pregnant: Number of times pregnant
- I2: glucose: Plasma glucose concentration (glucose tolerance test)
- I3: pressure: Diastolic blood pressure (mm Hg)
- I4: triceps: Triceps skin fold thickness (mm)
- I5: insulin: 2-Hour serum insulin (mu U/ml)
- I6: mass: Body mass index (weight in kg/(height in m)\²)
- I7: pedigree: Diabetes pedigree function
- I8: age: Age (years)
Dependent Variable (symbol: D)
- D1: diabetes: diabetes case (pos/neg)
The aim of this article to model a diabetes classification model using PyCaret Python library.
PyCaret is an open source, low-code machine learning library in Python that allows you to prepare and deploy your model with few lines of code.
The very first step is to load the relevant libraries.
import pandas as pd #data loading and manipulation import matplotlib.pyplot as plt #plotting import seaborn as sns #statistical plotting
Reading Diabetes Dataset
The next step is to load the diabetes dataset using pandas read_csv( ) function and printing the first five rows.
diabetes = pd.read_csv("diabetes.csv") diabetes.head()
To know the data description such as data types and missing values one can use the .info( ) method. You can see that the dataset contains 2 float columns, 6 integer columns, and 1 object column (dependent variable).
Setting PyCaret Environment
To start with PyCaret, the first step is to import all methods and attributes from PyCaret’s classification class.
from pycaret.classification import *
The next step is to prepare the data for analysis. Those who regularly deal with different datasets knew that the data preparation is the most time consuming part (involves 80% of the overall time). Even if we use the different modules of Scikit Learn library still it requires many step to prepare the dataset. Using PyCaret one can prepare the data in just one step. In PyCaret you can use the setup( ) function for processing the dataset. Here, I have supplied the diabetes dataset. Set the target to “diabetes”. Even though the pregnant column is integer still for demonstration I have supplied it in the numeric_features argument to illustrate that you could tell PyCaret to treat certain columns as numeric or categorical. Further, I have supplied the train_size as 80% of the data which split the dataset into 80% train and 20% test. In the example I have also set the normalize = True so that during processing it normalize the dataset. To make the modeling process reproducible you can set the session_id.
Now, you are thinking that what about dummy coding. PyCaret automatically dummy codes you categorical variables so you do not have to worry about it. You can see that after processing the dependent variable is dummy coded (neg:0; pos:1).
dia_clf = setup(data = diabetes, target = 'diabetes', numeric_features=["pregnant"], train_size = 0.8, normalize=True, session_id=123)
In order to compare multiple models to get the initial idea that what type of classification model would provide better results, you can use the compare_models( ) function. Here I have used a sort = “AUC” so after training it will sort the models in decreasing order corresponding to the AUC metric. Here, you can observe that Extreme Gradient Boosting topped the list and provided the best AUC value.
compare_models(sort = "AUC")
Once you have an idea about the best performing model, the next step is to tune the model hyperparameters to get the stable model and to ensure that the model does not overfit the data. To tune the model you need to use the tune_model( ) function and supply the model name and optimizing metric. Here I have used the optimizing metric as “AUC”. As XGBoost has many hyper-parameters and tuning every combination is very time and resource-consuming thus by default it will conduct a random grid search which is a fast and efficient method for getting optimal results. Here, I have supplied n_iter = 500 which will randomly search for hyper-parameters from 500 hyper-parameter combinations. The grid search by default performs a 10-fold cross-validation model training to provide a better estimate of model performance.
tuned_xgb = tune_model("xgboost", optimize = "AUC", n_iter = 500)
To get the best hyper-parameters you just need to print the tuned model object.
The best part of PyCaret library is that it provides ready to use model plots by calling single line code. You can plot global variable importance plot (if trained model support it), Confusion matrix, Area Under Receiver Operating Curve, precision recall curve, local variable importance plot and many more.
Figure (a). Variable Importance Plot (Global Importance)
Figure (b). Confusion Matrix
plot_model(tuned_xgb, plot = 'confusion_matrix')
Figure (c). AUROC Plot
plot_model(tuned_xgb, plot = 'auc')
Figure (d). Precision Recall Plot
plot_model(tuned_xgb, plot = 'pr')
Figure (e). Local Importance Plot
Initially, we have plotted the global importance plot but the problem is that global importance provides the importance but not the direction of impact. For example global importance plot, Figure (a) reveled that glucose is the top predictor but does not reveal that with unit increase in glucose increases or decreases the diabetes.
To understand the impact two popular algorithms were developed one is LIME and another is sharply values. In PyCaret one can compute and plot the Sharply Values using the interpret_model( ) function.
Here, I have plotted the sharply value by supplying the tuned XGBoost model. The plot consists of Sharply values on the x-axis and variables on the left side y-axis. The color represents the impact; red means higher impact and blue means low impact. For example, glucose has a positive high impact on diabetes. As glucose concentration increases the diabetes probability also increases.
interpret_model(tuned_xgb, plot = 'summary')
Prediction on Test Dataset
After model training, the next step is to check how your tuned model performing on an unseen or test dataset. You can evaluate the model on test data using predict_model( ) function. The tuned_xgboost contained the 20% data split which inherited during the data pre-processing step. You can observe the test AUC is about 0.8189 which is really good.
Once you are satisfied with the final model performance you would likely to save the model. So, the next step is to finalize the model using finalize_model( ) function.
final_gbc = finalize_model(tuned_xgb)
You can use the save_model( ) function to save the model for future use.
save_model(tuned_xgb,'Final tuned_xgb Model 11July2020')
Loading Saved Model
Similarly, you can load a saved model using load_model( ) function.
saved_final_lightxgb = load_model('Final tuned_xgb Model 11July2020')
PyCaret is a very high level machine learning modeling library where you could train, tune and send a model for production using very few lines of codes.
I hope you learned something new. See you next time !
Newman, C. B. D. & Merz, C. (1998). UCI Repository of machine learning databases, Technical report, University of California, Irvine, Dept. of Information and Computer Sciences.