Introduction
Pearsonâs correlation (r) is utilized when we have two numeric variables, and we want to see if there is a linear relationship between those variables. It is often used to eliminate correlated variables before fitting regression models. Similarly, there are few statistical measures of association exists like Cramerâs V and Theilâs U that are used to check association between two nominal variables (categorical variables).
In this blog, we are going to use various libraries to estimate pair-wise Pearsonâs correlation matrix, and pair-wise nominal association (using Cramerâs V and Theilâs U).
Further, we will present them using aesthetically beautiful plots (using Matplotlib library).
Article Outline
- Installing libraries
- Loading libraries
- Loading dataset
- Estimating and plotting pairwise Pearsonâs Correlation (r) using pandas and matplotlib
- Estimating and plotting pairwise Pearsonâs Correlation (r) using dython library
- Estimating Pearsonâs correlation (r) using pingouin library (hypothesis testing)
- Estimating and plotting Cramers V (association among nominal variables) using association_metrics library
- Estimating and plotting Cramers V (association among nominal variables) using dython library
- Estimating and plotting Theilâs U (association among nominal variables) using dython library
- Saving plots
- Dataset and codes download link
About datasets
In this blog, we are going to use two datasets. The dataset name and their description are provided as follows:
- Concrete dataset
- Tips dataset
Concrete Dataset
The concrete dataset [1] obtained from the UCI Machine Learning library.
The dataset includes the following variables, which are the ingredients for making durable high strength concrete.
Cement: kg in a m3 mixture
Blast Furnace Slag: kg in a m3 mixture
Fly Ash: kg in a m3 mixture
Water: kg in a m3 mixture
Superplasticizer: kg in a m3 mixture
Coarse Aggregate: kg in a m3 mixture
Fine Aggregate: kg in a m3 mixture
Age: Day (1~365)
Concrete compressive strength: MPa
where m3: meter cube and MPa: Megapascal.
Tips Dataset
The data was reported in a collection of case studies for business statistics. The dataset is also available through the Python library Seaborn [2].
The Tips data contains 244 observations and 7 variables (excluding the index). The variables descriptions are as follows:
bill: Total bill (cost of the meal), including tax, in US dollars
tip: Tip (gratuity) in US dollars
sex: Sex of person paying for the meal (Male, Female)
smoker: Presence of smoker in a party? (No, Yes)
weekday: day of the week (Saturday, Sunday, Thursday and Friday)
time: time of day (Dinner/Lunch)
size: the size of the party
Installing libraries
The first step is to install the following libraries in your python environment.
pip install pandas # data loading and wrangling
pip install matplotlib # plots
pip install seaborn # statistical plots
pip install association_metrics # correlation/association & plot
pip install dython # Correlation/association & plot
pip install pingouin # Hypothesis testing
Importing libraries
Next, import the following libraries.
import numpy as np # for data crunching
import pandas as pd # Data manipulation
import matplotlib.pyplot as plt # Plotting
import seaborn as sns # Statistical plots and tips data
Loading dataset
Letâs read the Concrete dataset using pandas .read_csv( ) method. Print the first five observations using the .head( ) method.
concrete = pd.read_csv("concrete.csv")
concrete.head()
Once the data is loaded, the next step is to check the data information. This is to confirm that all the variables are of numerical type so that we can compute a correlation coefficient (r).
concrete.info()
As per the output, all the variables are numeric type, i.e., either integer or float type.
Estimating Pearsonâs Correlation (r) between numerical variables
Parametric Correlation: Pearson correlation (r), is a linear association between two variables and presented in a range of -1 to +1.
Non-parametric Correlation (r): Kendall tau and Spearman rho, are rank-based correlation coefficients (ranges between -1 to +1).
First, we will try the pandas inbuilt corr( ) method on the concrete dataset to compute Pearsonâs (r) and round it to two decimal places.
cor_df = concrete.corr(method = "pearson").round(2)
cor_df
It can be noted that few variable pairs are highly correlated. For example, we can see that the correlation between cement and strength is +0.50, similarly water and strength variable pair has a correlation strength of -0.29.
Pearsonâs pairwise correlation plot using Pandas and matplotlib library
Once we have the pair-wise correlation matrix, we can generate a plot to illustrate it.
To generate the plot, we need to go through the following steps.
Step 1: Create a figure and axis object with matplotlibâs subplots( ) method.
Step 2: Generate the figure using the ax.imshow( ). Here we have supplied a default Blues colormap and set the minimum (-1) and maximum values (+1).
Step 3: Set the x and y ticks using set method
Step 4: Resize the tick parameters using tick_params( ) method
Step 5: Add a colour-bar using ax.colorbar( ) method to relate the colour gradient to correlation strength
Step 6: Loop through coordinates (x, y) and correlation (r) values using np.ndenumerate( ) method and annotate the values using ax.annotate( ) method.
# Step 1: Initiating a fig and axis object
fig, ax = plt.subplots(figsize = (12, 10))
# Step 2: Create a plot
cax = ax.imshow(cor_df.values, interpolation='nearest', cmap = 'Blues', vmin = -1, vmax = 1)
# Step 3: Set axis tick labels
ax.set_xticks(ticks = range(len(concrete.columns)),
labels = concrete.columns)
ax.set_yticks(ticks = range(len(concrete.columns)),
labels = concrete.columns)
# Step 4: Resize the tick parameters
ax.tick_params(axis = "x", labelsize = 12, labelrotation = 90)
ax.tick_params(axis = "y", labelsize = 12, labelrotation = 0)
# Step 5: Adding a color bar
fig.colorbar(cax).ax.tick_params(labelsize = 12)
# Step 6: Add annotation
for (x, y), t in np.ndenumerate(cor_df):
ax.annotate("{:.2f}".format(t),
xy = (x, y),
va = "center", # vertical position
ha = "center") # horizontal position
Estimating and Plotting Pearsonâs pairwise correlation coefficients using dython library
We can generate the same pairwise correlation plot using dython library, with just two lines of code.
Here, the associations( ) method automatically apply the âPearsonâ correlation method on the numerical variables and generates the plot.
# using dython library
from dython.nominal import associations
# Step 1: Instantiate a figure and axis object
fig, ax = plt.subplots(figsize=(16, 8))
# Step 2: Creating a pair-wise correlation plot
# Saving it into a variable(r)
r = associations(concrete, ax = ax, cmap = "Blues")
The saved variable contains two objects, i) The correlation matrix and ii) Axis object.
Letâs print the correlation matrix.
r["corr"].round(2)
Estimating Pearsonâs correlation using pingouin library
Occasionally, we need to perform a hypothesis test just to report the correlation strength (r) and the p-value.
pingouin library comes in handy which provides different statistics for the pair-wise correlation, including p-value.
Documentation link 1
Documentation link 2
We usually test the following hypothesis while computing correlation.
Null Hypothesis (H0): True correlation is equal to zero.
Alternative Hypothesis (H1): True correlation is not equal to zero.
Here for example, we would like to compute the correlation between four variables which are âcementâ, âwaterâ, âstrengthâ and âsuperplasticâ.
# Import library as pg
import pingouin as pg
# Perform a hypothesis testing
(
pg.pairwise_corr(concrete,
columns = ["cement", "water", "strength", "superplastic"],
method = 'pearson')
.round(3)
)
The results show that the probability of all pair-wise correlation values are lower than the conventional 5% (P<0.05), thus here the alternative hypothesis is true.
Estimating Cramerâs V association among nominal variables
In statistics, CramĂ©râs V (sometimes referred to as CramĂ©râs phi and denoted as Ïc) is a measure of association between two nominal variables [3]. The value ranges 0 to 1, where 0 indicates no association and 1 indicates a perfect association.
Cramerâs V facts:
- The value ranges between 0 and 1
- It is a symmetrical association measure [V(x, y) = V(y, x)]
For estimating the Cramerâs V we are going to use the tips dataset.
# Loading tips dataset
tips = sns.load_dataset("tips")
# Print first five observations
tips.head()
Generating Cramerâs V pairwise matrix plot using `association_metrics` library
To compute the pairwise Cramerâs V association matrix, we need to go through the following steps.
Step 1: import association_metrics as sm.
Step 2: Use a lambda function to convert all object columns to categorical columns.
Step 3: use CramersV( ) method on the dataset.
Step 4: use .fit( ) method to fit the estimator to compute values.
# Using association_metrics library
import association_metrics as am
# Convert object columns to Category columns
df = tips.apply(
lambda x: x.astype("category") if x.dtype == "O" else x)
# Initialize a CramersV object using the pandas.DataFrame (df)
cramers_v = am.CramersV(df)
# It will return a pairwise matrix filled with Cramer's V, where
# columns and index are the categorical variables of the passed # pandas.DataFrame
cfit = cramers_v.fit().round(2)
cfit
The next step is to generate a pair-wise association plot, the same way we did it for Pearsonâs correlation coefficient (r).
# Instantiating a figure and axes object
fig, ax = plt.subplots(figsize = (10, 6))
# Generate a plot
cax = ax.imshow(cfit.values, interpolation='nearest', cmap='Blues', vmin=0, vmax=1)
# Setting the axes labels
ax.set_xticks(ticks = range(len(cfit.columns)),
labels = cfit.columns)
ax.set_yticks(ticks = range(len(cfit.columns)),
labels = cfit.columns)
# Setting tick parameters
ax.tick_params(axis = "x", labelsize = 12, labelrotation = 0)
ax.tick_params(axis = "y", labelsize = 12, labelrotation = 0)
# Adding a colorbar
fig.colorbar(cax).ax.tick_params(labelsize = 12)
# Adding annotations
for (x, y), t in np.ndenumerate(cfit):
ax.annotate("{:.2f}".format(t),
xy = (x, y),
va = "center",
ha = "center").set(color = "black", size = 12)
Generating Cramerâs V pairwise association plot using dython library
The same pair-wise association plot can be generated using dython library using two lines of codes. Just, we need to provide categorical columns (object) and nom_nom_assoc = âcramerâ.
Note:
[1] For dython library, we must supply the categorical columns as âobjectâ type. Here, we converted âcategoricalâ columns to âobjectâ columns.
[2] By default, dython library enables Cramerâs V bias correction (cramers_v_bias_correction=True); thus values estimated using dython library will be slightly different from association_metrics library.
# Importing library
from dython.nominal import associations
# Convert categorical columns to object columns
df = tips.apply(
lambda x: x.astype("object") if x.dtype == "category" else x)
# Instantiate a figure and axis object
fig, ax = plt.subplots(figsize = (12, 7))
# Estimate and generate Cramer's V association plot
cramers_v = associations(df[["sex", "smoker", "day", "time"]],
nom_nom_assoc = 'cramer',
ax = ax,
cmap = "Blues")
The saved object cramers_v contains two sub-objects, i) the pair-wise association value matrix and ii) axis object.
Let’s print the association matrix and round up to two decimal places.
cramers_v["corr"].round(2)
Letâs access the axis object that we can later use for plot modifications.
Note: This method of axis object retrieval is appropriate only when if we have not instantiated the figure and axis objects using Matplotlibâs .subplots( ).
cramers_v["ax"]
<AxesSubplot:>
Generating Theilâs U pairwise association plot using dython library
In statistics, Theilâs U is a measure of association between two nominal variables [4, 5]. Similar to Cramerâs V, the value ranges 0 to 1, where 0 indicates no association and 1 indicates a perfect association.
Theilâs U facts:
- The value ranges between 0 and 1
- It is an asymmetrical association measure [U(x, y) â U(y, x)]
Here, we are going to use the dython library and will supply nom_nom_assoc = âtheilâ.
Note: for dython library, we must supply the categorical columns as âobjectâ type. Here, we converted âcategoricalâ columns to âobjectâ columns.
# Loading libarary
from dython.nominal import associations
# Convert categorical columns to object columns
df = tips.apply(lambda x: x.astype("object") if x.dtype == "category" else x)
# Instantiate a figure and axis object
fig, ax = plt.subplots(figsize = (10, 7))
# Estimate and generate Theil's U association plot
theils_u = associations(df[["sex", "smoker", "day", "time"]],
nom_nom_assoc = 'theil',
ax = ax,
cmap = "Blues")
Same way, we can retrieve the association matrix.
theils_u["corr"].round(2)
Letâs retrieve the axis object.
theils_u["ax"]
<AxesSubplot:>
Saving Plots
We have instantiated the figure (fig) object for each of the plots generated above. Thus, after generating a particular plot, we can save the plot using .savefig( ) method.
# bbox_inches = "tight" removes unnecessary borders
# You can change the "dpi" value to get a higher resolution plot
fig.savefig("figurename.png", dpi = 300, bbox_inches = "tight")
These are the various ways to generate aesthetically beautiful publication ready correlation and association plots.
Click here for the data and code
References
[1] I-Cheng Yeh, âModeling of the strength of high-performance concrete using artificial neural networks,â Cement and Concrete Research, Vol. 28, â12, pp. 1797â1808 (1998).
[2] Bryant, P. G. and Smith, M. A. (1995), Practical Data Analysis: Case Studies in Business Statistics, Richard D. Irwin Publishing, Homewood, IL.
[3] Cramér, H., 2016. Mathematical methods of statistics, in: Mathematical Methods of Statistics (PMS-9), Volume 9. Princeton University press, pp.
1â575.
[4] Theil, H., 1958. Economic policy and forecasts.
[5] Theil, H., 1966. Applied economic forecasting, rand mcnally and company.
I hope you learned something new!