Introduction
A bar plot is a graphical representation which shows the relationship between a categorical and a numerical variable. In general, there are two popular types used for data visualization, which are dodged and stacked bar plot.
A dodged bar plot is used to compare a grouping variable, where the groups are plotted side by side. It could be used to compare categorical counts or relative proportions, and in general used to compare numerical statistics such as mean/median.
In the current article, we will deal with count-based bar plots where we compare the proportions corresponding to a grouping variable.
Article outline
The current article will cover the following:
Loading libraries
The first step is to load the required libraries.
import numpy as np # array manipulation
import pandas as pd # Data Manipulation
import matplotlib.pyplot as plt # Visualisation
import seaborn as sns # Visualisation
Basic knowledge of matplotlib’s subplots
If you have basic knowledge of matplotlib’s subplots( ) method, you can proceed with the article, else I will highly recommend reading the first blog on this visualisation guide series.
Link: Introduction to Line Plot — Matplotlib, Pandas and Seaborn Visualization Guide (Part 1)
Basic barplot using Rectangle method
In this article, we will learn how to generate dodge plots. But before we proceed with such advanced statistical plot, first we need to be familiar with how matplotlib builds a bar plot step by step.
To build a bar plot, we need to go through the following steps:
Step 1: From matplotlib.patches import Rectangle.
Step 2: Using plt.subplots( ) instantiate figure (fig) and axes (ax) objects.
Step 3: Use the Rectangle( ) method to generate the patch/rectangle object. The Rectangle( ) method takes x and y as tuple, then width and height of the bar.
Step 4: We generate two such Rectangle\patch object (rec1 and rec2) that we are going to add\impose of the axes (ax) object.
Step 5: Now add these patch/rectangle objects on the axes (ax) using add_patch( ) method.
from matplotlib.patches import Rectangle
fig, ax = plt.subplots()
# Define rectangle
# Rectangle((x, y), width, height)
rec1 = Rectangle((0.1, 0), 0.2, 0.9)
rec2 = Rectangle((0.5, 0), 0.2, 0.5)
# Adding patch object/ rectangles
ax.add_patch(rec1)
ax.add_patch(rec2)
Help on the methods
You can get help using the python’s inbuilt help( ) method, where you can supply any object name (for example Rectangle) to get information on the associated attributes and methods.
help(Rectangle)
Check for the Patches
Let’s check whether the axes (ax) object contains the paches/rectangles. We can check that by applying the attribute patches on axes object (ax). The output clearly shows that the axes object contains two patches.
ax.patches
<Axes.ArtistList of 2 patches>
Changing the Rectangle/Patch Colour
We can customise patch properties. Let’s change the patch property of the 2nd rectangle by accessing the object via ax.patches[1] and apply the set_color(“red”) to change the colour to red.
ax.patches[1].set_color("red")
fig
Now you have a basic idea how matplotlib generates the rectangles of a bar plot. This approach is good, but difficult to use when we have many bars to plot. Thus, to overcome this issue, we can use a more convenient method offered by the axes object (ax) called bar( ).
Let’s proceed with step by step method:
Step 1: Instantiate a figure (fig) and axes (ax) object.
Step 2: Generate a list of x-axis and y-axis values.
Step 3: use the bar( ) method of axes (ax) object and pass the x and y lists.
This way you can generate a basic bar plot.
# Adding bars using defined values
fig, ax = plt.subplots()
x = [0, 1, 2, 3, 4]
y = [1, 3, 5, 2, 7]
# Use ax.bar()
ax.bar(x, y)
Again, let’s check the patches of the axes (ax) object using patches attribute. Now you can observe that it contains 5 patches/rectangles.
# Check number of patches
ax.patches
<Axes.ArtistList of 5 patches>
Like last time, here also we can change the colour of the rectangle/patch objects. Let’s change the 4th patch’s colour to red. It uses the same method set_color( ) but here we need to apply this on the 4th patch using patches[3]. We supplied 3 because Python is a zero-index-based language.
# Changing 4th patch color to "red"
# Caange patch 1 to red
ax.patches[3].set_color("red")
fig
We are now familiar with the bar plot and how to generate them from scratch. Now let’s proceed with a new form of plot called “dodged bar plot”.
Dodged barplot [matplotlib style]
A dodged bar plot is used to present the count/proportions/statistics (mean/median) for two or more variables side by side. It helps in making comparison between variables.
For the current plot, we are going to use tips dataset.
Source:
Bryant, P. G. and Smith, M. A. (1995), Practical Data Analysis: Case Studies in Business Statistics, Richard D. Irwin Publishing, Homewood, IL.
The Tips dataset contains 244 observations and 7 variables (excluding the index). The variables descriptions are as follows:
bill: Total bill (cost of the meal), including tax, in US dollars
tip: Tip (gratuity) in US dollars
sex: Sex of person paying for the meal (Male, Female)
smoker: Presence of smoker in a party? (No, Yes)
weekday: day of the week (Saturday, Sunday, Thursday and Friday)
time: time of day (Dinner/Lunch)
size: the size of the party
Let’s load the tips dataset using pandas read_csv( ) method and print the first 4 observations using head() method.
tips = pd.read_csv("datasets/tips.csv")
tips.head(4)
Aim of the plot
The aim of the plot is to calculate and impose gender wise smoker proportion using a dodged bar plot. See the below figure which represent the final plot that we are going to plot using various approach (matplotlib, pandas and seaborn). In the plot, we will present the gender category in the x-axis and their proportion corresponding to smoker category in the y-axis. Further, we are going to add labels on top of the bar and customise the legend.
Estimate gender/sex wise smoker percentage
To generate this dodged plot, we need to compute the sex wise smoker and non-smoker proportion. To achieve this, we have to go through the following steps:
Step 1: apply the groupby( ) method and group the data based on ‘sex’ and select the ‘smoker’ column from each group.
Step 2: Then apply the value_counts( ) method and supply normalize = True to compute proportion.
Step 3: Next, multiply it with 100, using .mul(100) and round it to 2 decimal places.
Step 4: Apply unstack( ) method so that the sex labels presented in index and smoker status presented in columns and percentage values are presented in cells.
Step 5: Save the output into df variable.
df = (tips
.groupby("sex")["smoker"]
.value_counts(normalize=True)
.mul(100)
.round(2)
.unstack())
df
Next, we will take out the Data Frame index using df.index and save in label and generate a range count using the np.range( ) method. We will need these two objects to customise the plots. If we print these objects, we can observe that the label contains sex labels (Female and Male) and the x variable contains 0 and 1 as a list.
# Generating labels and index
label = df.index
x = np.arange(len(label))
print(label)
print(x)
Index([‘Female’, ‘Male’], dtype=’object’, name=’sex’)
[0 1]
Understanding the plotting mechanism
The very first thing we need to do is to use subplots( ) method from matplotlib and generate axes (ax) and figure (fig) objects. The figure size is set to 8 by 6 inches.
Next, set the bar width to 0.2 and use the bar( ) method and apply it to axes object (ax), over which we will impose the bars.
In the bar( ) method, we need to separately supply the columns of the df object. Here in the first one we supplied the x (previously generated object) and the ‘No’ column at x and y position. Then width value, label (to mark the bar) and bar border colour using edgecolor argument. Then saved the bar object to rect1.
Similarly, for the ‘Yes’ column, we have created another object and save it to rect2.
Now if we see the plotted object we can observe that the blue and orange bar are in a single column which is far from the desired dodged plot. This is because the bars from each group (No/Yes) are imposed one above another.
To rectify the situation, we need to move the blue bars to the left by 0.1 and the orange bars to the right by 0.1.
#create the base axis
fig, ax = plt.subplots(figsize = (8,6))
#set the width of the bars
width = 0.2
#add first pair of bars
rect1 = ax.bar(x,
df["No"],
width = width,
label = "No",
edgecolor = "black")
#add second pair of bars
rect2 = ax.bar(x,
df["Yes"],
width = width,
label = "Yes",
edgecolor = "black")
Now, if we deduct 0.1 from the blue bars’ x-axis position (x – width/2) and add 0.1 to the orange bars (x + width/2) and plot it again, we can see that the bars now looked like dodged bars.
There is one problem, that the x-axis labels are not matching to the final plot, which we actually wanted. We need to correct it.
#create the base axis
fig, ax = plt.subplots(figsize = (8,6))
#set the width of the bars
width = 0.2
# create the first bar by shifting it to left side by width/2
rect1 = ax.bar(x - width/2,
df["No"],
width = width,
label = "No",
edgecolor = "black")
# create the first bar by shifting it to right side by width/2
rect2 = ax.bar(x + width/2,
df["Yes"],
width = width,
label = "Yes",
edgecolor = "black")
Let’s reset the x-axis tick labels using the set_xticks(x) which will set it to the list values stored in x. But we need to label it as per the sex.
# Reset x-ticks
ax.set_xticks(x)
fig
Next, set the x-tick labels using the set_xticklabels( ) method by supplying the label object (created initially). Now we have got the desired x-tick labels.
# Setting x-axis tick labels
ax.set_xticklabels(label)
fig
Concept of Patch objects (groups)
Now let’s move to one of the important topic in bar plots called patch. Every rectangle you see in a barplot know as patch object which contains numerous information like height of the bar, width, their x and y position, colour etc. Let’s enquire about the patches from our axes (ax) object. If we apply the .patches attribute on the axes (ax), then it will show that it contains 4 patch objects corresponding to 4 bars.
# Number of patches
ax.patches
<Axes.ArtistList of 4 patches>
To retrieve the information and make use of it, we need to know the order of the patches.
- The blue patches contain information regarding the “No” column and the orange patches contain information regarding “Yes” column.
- The order will be blue Female bar (patch 0), blue Male bar (patch 1), orange Female bar (patch 2), orange Male bar (patch 3).
Let’s retrieve the height from the first patch. To do so, you need to select the first patch object using .patches[0] and apply the get_height( ) method, which reveals the height, i.e., 62.07.
# 0 & 1 are blue pair; 2 & 3 are orange pair (left to right)
ax.patches[0].get_height()
62.07
Labelling bars
Now we know the concept of patches, we can add labels to each bar by retrieving height information from each patch object using a for loop. To achieve this, follow the following steps:
Step 1: Loop through each patch objects (ax.patches) and save it to a temporary variable ‘p’.
Step 2: use ax.annotate( ) method to annotate the labels. It takes the height value, x and y positions. We can retrieve the height using get_height( ) and convert it to a string object using str( ) to add a percentage (%) symbol. Further, the x and y position can be retrieved using get_x( ) and get_height( ) method. To improve the padding at the top of the bars, we add some padding of 0.03 (in the x-direction) and 1 (in the y-direction). Next, save it to a variable ‘t’.
Step 3: use the set( ) method to change the annotated text properties.
# Adding bar values
for p in ax.patches:
t = ax.annotate(str(p.get_height()) + "%", xy = (p.get_x() + 0.03, p.get_height()+ 1))
t.set(color = "black", size = 14)
fig
Customising bar plot
The first step of customising it to remove some splines (plot border lines). I usually prefer turning off the top and right spines. To achieve this, use a for loop and use ax.spines[position] and apply set_visible() to False.
We can also alter the tick parameters [using tick_params( )], and axis labels [using set_xlabel( ) and set_ylabel( )] to make the plot informative and aesthetically beautiful.
# Remove spines
for s in ["top", "right"]:
ax.spines[s].set_visible(False)
# Adding axes and tick labels
ax.tick_params(axis = "x", labelsize = 14, labelrotation = 45)
ax.set_ylabel("Percentage", size = 14)
ax.set_xlabel("Sex", size = 14)
fig
Last, but not the least, let’s customise the legend. Here, using the ax.legend( ) method, I have modified the existing labels to “N” and “Y”.
As we know that each plot ranges 0 to 1 in the x and y direction. We can use this information to position our plot legend to the middle of the plot. We can access the legend using ax.legend_ and set the position using .set_bbox_to_anchor( ) and supply the x and y position using a list.
Now our plot is finalized and ready to use.
# Customize legend
ax.legend(labels = ["N", "Y"],
fontsize = 12,
title = "Smoker",
title_fontsize = 18)
# # Fix legend position
ax.legend_.set_bbox_to_anchor([0.6, 0.5])
fig
Saving the plot
To save a plot, we can use the figure object (fig) and apply the savefig( ) method, where we need to supply the path (images/) and plot object name (dodged_barplot.png) and resolution (dpi=300).
# Save figure object
fig.savefig("images/dodged_barplot.png", dpi = 300)
Dodged bar plot using pandas DataFrame’s plot( ) method
The next step is to generate the same dodged plot, but using the pandas DataFrame based plot( ) method.
First step is to prepare the data, which is the same as we did in the last plot.
tips = pd.read_csv("datasets/tips.csv")
df = (tips
.groupby("sex")["smoker"]
.value_counts(normalize=True)
.mul(100)
.round(2)
.unstack())
df
Pandas plot( ) method
Let’s generate the dodged plot using pandas plot( ) method-based approach. To achieve this, we need to follow the following steps.
Step 1: Use subplots( ) method from matplotlib and generate axes (ax) and figure (fig) object. Set the figure size to 8 by 6 inches.
Step 2: apply plot( ) method on the DataFrame (df) object. Specify the kind = “bar” and ax = ax and edgecolor = “black”.
Bam! Your plot framework is almost ready.
fig, ax = plt.subplots(figsize = (10, 4))
df.plot(kind = "bar",
ax = ax,
edgecolor = "black")
Next part is labelling and customizing the plot, which is exactly the same as we did in the raw matplotlib based approach. Here, I did not alter the legend labels [“No”, “Yes”].
# Add data labels
for p in ax.patches:
t = ax.annotate(str(p.get_height()) + "%", xy = (p.get_x() + 0.03, p.get_height()+ 1))
t.set(color = "black", size = 14)
# Remove spines
for s in ["top", "right"]:
ax.spines[s].set_visible(False)
# Add axes labels and tick parameters
ax.set_xlabel("Sex", size = 14)
ax.set_ylabel("Percentage", size = 14)
ax.tick_params(labelsize = 14, labelrotation = 0)
# Customise legend
ax.legend(labels = ["No", "Yes"],
fontsize = 12,
title = "Smoker",
title_fontsize = 18)
# Fix legend position
ax.legend_.set_bbox_to_anchor([0.5, 0.3])
fig
Dodged barplot with pandas DataFrame [seaborn style]
Next, we will generate the same plot, but using seaborn plotting style. In the seaborn we need the input data as pandas DataFrame.
The process of calculating groupwise proportion is similar with small difference. Here, use the reset_index( ) method instead of untack( ) to convert index to columns. Now the output is a pandas DataFrame type which includes all the columns as stacked Series object.
df = (
tips
.groupby("sex")["smoker"]
.value_counts(normalize = True)
.mul(100)
.rename('percent')
.reset_index()
.round(2)
)
df
Plotting a dodged plot [seaborn method]
Here, we will be going to use the catplot( ) method from seaborn library. We need to supply the x variable as “sex”, y variable as “percent”, fill color, i.e., hue = “smoker”, DataFrame object (df) and legend = False.
As the catplot does not take an axes (ax) object; thus we need to somehow retrieve the axes (ax) and figure (fig) objects.
We can retrieve the axes (ax) object using the plt.gca( ) and figure (fig) object using the plt.gcf( ). The gca refers to `get current axes` and gcf refers to the `get current figure`.
sns.catplot(x = "sex",
y = 'percent',
hue = "smoker",
kind = 'bar',
data = df,
legend = False)
ax = plt.gca()
fig = plt.gcf()
The next step is to customising the plot, i.e., adding data labels, modifying ticks and axis labels.
Lastly, we will fix the size of the plot using the fig.set_size_inches( ).
sns.catplot(x = "sex",
y = 'percent',
hue = "smoker",
kind='bar',
data = df,
legend = False)
################################
# Customization
################################
# Retrieve axis and fig objects from the current plot environment
ax = plt.gca()
fig = plt.gcf()
# Add bar labels
for p in ax.patches:
p.set_edgecolor("black") # Add black border across all bars
t = ax.annotate(str(p.get_height().round(2)) + "%", xy = (p.get_x() + 0.1, p.get_height() + 1))
t.set(size = 14)
# Adding axes labels and tick parameters
ax.set_xlabel("Sex", size = 16)
ax.set_ylabel("Percentage", size = 16)
ax.tick_params(labelsize = 14)
# Legend customisation
ax.legend(fontsize = 12,
title = "Smoker",
title_fontsize = 12)
# Resetting figure size
fig.set_size_inches(8, 4)
Once you learn base matplotlib, you can customise the plots in various ways. I hope you now know various ways to generate a dodged plot. Apply the learned concepts to your datasets.
References:
Click here for the data and code
I hope you learned something new!