thumbnail

Data Manipulation in Python using the dfply Package (dplyr-Style)

Introduction

I have been working with R programming language since 2018. R is a statistical programming language very popular among data science enthusiasts, statisticians, and academicians. If you have ever used R then you might have encountered the popular data manipulation package called “dplyr” developed by [Hadley Wickham [aut, cre], Romain François [aut], Lionel Henry [aut], Kirill Müller [aut] and RStudio [cph, fnd] ]. This package contains several essential tools that make your data manipulation fast, fluent, and fun. The dplyr library/package allows you to write functions in chained order, which is very very helpful.

I personally use R for data manipulation and Python for ML. Even before starting a model training in Python I usually perform some sort of manipulation to better understanding the dataset. In the Python world, the Pandas package is one of the most popular packages usually used for data manipulation but I personally like the chaining style of the dplyr package when I need to perform very complex operations.

You can enjoy the fun of the dplyr package in Python too. Kiefer Katovich has implemented similar essential functions of dplyr into a new Python package known as dfply. The dfply package’s philosophy is primarily based on the following major functions and it is constantly updated by the package maintainer.

  • select() picks variables based on their names.
  • filter_by() picks cases based on their values.
  • mutate() adds new variables that are functions of existing variables
  • summarise() reduces multiple values down to a single summary.
  • arrange() changes the ordering of the rows.

The current article focuses on showing the core dfply functionality using the popular hflights dataset/pandas data frame.

Article Outline

  • Introduction
  • dfply installation
  • Description of hflights dataset
  • Data Manipulation using primary dfply functions
  • Others important functions
  • Dataset and code link

dfply installation

The very first step is to install the dfply package using pip or anaconda installation. This package requires Python 3.

# pip installation
pip install dfply
# conda installation
conda install -c tallic dfply

Description of hflights dataset

The hflights dataset contains all flights departing from Houston airports IAH (George Bush Intercontinental) and HOU (Houston Hobby). The data set contains the following features/variables.

YearMonthDaydate of departureDay of the week of departure (DayOfWeek), Departure and arrival times (DepTime, ArrTime), Unique abbreviation for a carrier (UniqueCarrier), Flight number (FlightNum), Airplane tail number (TailNum), Elapsed time of flight in minutes (ActualElapsedTime), Flight time in minutes (AirTime), Arrival and Departure delays in minutes (ArrDelay, DepDelay), Origin and Destination airport codes (Origin, Dest), Distance of flight in miles (Distance), Taxi in and Out times in minutes (TaxiIn, TaxiOut), Cancelled indicator: 1 = Yes, 0 = No (Cancelled), Diverted indicator: 1 = Yes, 0 = No (Diverted).

Importing essential packages

The first step is to import the essential packages before dive into data manipulation fun.

import pandas as pdfrom dfply import *

Loading hflights dataset

Let’s import the hflights data into the python environment using pandas pd.read_csv() function. I have cleaned the dataset by removing missing values and few columns.

hflights = pd.read_csv("hflights_cleaned.csv", index_col=0)
hflights.head()
Image for post
First few columns

Checking the column names

Let’s check the column names using .columns method.

hflights.columns
Image for post

Introduction to the pipe method of dfply

R’s magrittr package first introduced the pipe operator >>The operator pipe their left-hand side values forward into expressions that appear on the right-hand side, i.e. one can replace f(x) with x >> f(), where >> is the (main) pipe-operator.

dfply works directly on pandas DataFrames, chaining operations on the data with the >> operator, or alternatively starting with >>= for inplace operations.

Lets Start piping

Observing the head or tail of the data frame

Let’s start with the most common operation observing first or last few row examples using head( ) or tail( ) function but more like dfply way (chaining/piping). Let’s check first 2 rows of hflight data frame by chaining head( ) function with >>.

hflights >> head(2)
Image for post
Due to page limitation only a few variables are printed

Similarly, you can check the last observations/rows to get a glimpse of the data using the tail() function.

A) Selecting and dropping columns

You often need to work with datasets that include a large number of columns/variables but only a few of them will be the variable of interest. Let’s say you want to select Year, Origin, and Dest from hflights dataset and print top 2 rowsThe variable selection can be done using X dot -> [X.dot] or quotation ‘var1’ , ‘var2’ … or column index number.

A1) X dot way:

(hflights 
  >> select(X.Year, X.Origin, X.Dest) 
  >> head(2))
Image for post

A2) Quote and index-based selection

(hflights >>
 select(0, ['Origin', 'Dest']) >>
 head(2))
Image for post

A3) starts_with( ):

Sometimes we need to select variables that start with a common name/string/characters. One example could be column names starts with “date” prefix. Here in this dataset, say we want to select columns that start with Arr character.

(hflights >>
 select(starts_with("Arr")) >>
 head(2))
Image for post

A4. Drop using a tilde (~):

we can put a tilde sign in front of variables to exclude them from the column selection.

Arr_not_in_data = hflights >> select(~starts_with("Arr"))
Arr_not_in_data.columns
Image for post

The above data frame now doesn’t include columns that start with Arr characters.

A5) columns_from( ): Dropping column UniqueCarrier onwards

The drop function is opposite to selection. We could use the columns_from selection filter along with drop( ) function to drop all columns “UniqueCarrier” onwards and select the top 2 rows of the rest.

(hflights >>
drop(columns_from(X.UniqueCarrier)) >>
head(2))
Image for post

A6) Column_to() and column_from() together

Let’s say we want to select the first two columns, the last three columns, and the “DayOfWeek” column together. The inclusive = True argument keeps the column with index 1.

(hflights >> 
select(columns_to(1, inclusive=True),
       'DayOfWeek',
       columns_from(-3)) >>
head(2))

B) Subset selection based on filter_by( )

filter_by( ) function helps you to select a subset of a pandas data frame based on logical conditions. Let’s say we want to select a subset of rows where Arrival Delay ≥ 500 and based on this condition we want to select all columns that start with Arr. You can set multiple conditions separated with commas.

(hflights >>
 filter_by(X.ArrDelay >= 500) >>
 select(contains("Arr")) >>
 head(3))
Image for post

Pulling out columns using pull() function

In case you only care about one particular column at the end of your pipeline, you can pull out a column and returns it as a pandas series. Let’s pull out the ArrTime column based on the condition of ArrDelay≥500

(hflights >>
 filter_by(X.ArrDelay >= 500) >>
 pull('ArrTime') >>
 head(4))
Image for post
four rows are shown due to head(4) function

C) Creating new columns using mutate( )

Let’s say we want to calculate the total delay of each flight. This can be calculated using X.ArrDelay + X.DepDelay. We can pass this inside the mutate function and assign to a new column name Total_Delay.

C1) mutate( ):

(hflights >> 
  select(X.ArrDelay, X.DepDelay) >> 
  mutate(Total_Delay = X.ArrDelay + X.DepDelay) >> 
  head(3))
Image for post

C2) transmute( ):

Sometimes we want the newly generated column data only; to perform different statistical analysis on that. Plotting a histogram could be a great example. In such cases, we could use the transmute( ) function.

(hflights >> 
  select(X.ArrDelay, X.DepDelay) >> 
  transmute(Total_Delay = X.ArrDelay + X.DepDelay) >> 
  head(3))
Image for post

D) sorting rows with arrange( )

Let’s, for example, we want to arrange our dataset based on decreasing values of ArrDelay column. To achieve this we could use the arrange( ) function and supply the ArrDelay column and set the ascending = False.

(hflights >>
 arrange(X.ArrDelay, ascending=False) >>
 select(X.ArrDelay)>>
 head())
Image for post

E) Summarizing data using summarize( )

Summarize help in computing single row statistics such as mean, standard deviation, minimum and maximum etc.

E1. summarize( ):

For example, if we want to calculate the mean arrival delay and mean departure delay we could you the summarize( ) function and supply the columns with a dot method; in our case mean( ).

(hflights >>
 summarize(mean_arr_delay = X.ArrDelay.mean(),
           mean_dep_delay = X.DepDelay.std()))
Image for post

E2. summarize_each( ):

In some situation, we need to compute multiple statistics for multiple variables. In such scenarios we could use summarize_each( ) function. Here, in the following example, we applied the mean (np.mean) and standard deviation (np.std) function on ArrDelay and DepDelay.

(hflights >>
 summarize_each([np.mean, np.std], X.ArrDelay, X.DepDelay))
Image for post

E3. Grouping data using group_by( )

group_by( ) function allows you to group data based on one or multiple columns. This helps us to compute statistics on each group of data separately.

For example, we want to compute the monthly mean and standard deviation of Arrival Delay and Departure Delay. This can be achieved using the group_by(X.Month). We assigned the resulting data frame to monthly_stats variable and printed the first 4 months.

monthly_stats = (hflights >> 
  select(X.ArrDelay, X.DepDelay, X.Month) >> 
  group_by(X.Month) >> 
  summarise_each([np.mean, np.std], X.ArrDelay, X.DepDelay))
monthly_stats.head(4)
Image for post

E4. Plotting resulting data frame combining pandas plot( )

Instead of getting just only the data frame, we could add a pandas plot function in the resulting data frame. Here, we plotted a line plot using Month on the x-axis and mean Arrival Delay on the y-axis.

(hflights >> 
  select(X.ArrDelay, X.DepDelay, X.Month) >> 
  group_by(X.Month) >> 
  summarise_each([np.mean, np.std],
                 X.ArrDelay,
                 X.DepDelay)).plot(x = "Month",
                                   y = "ArrDelay_mean",
                                   kind = "line")
Image for post

F) Others important functions

There are plenty of useful functions available in dfply package. Here are a few of them.

F1. Sampling

Taking a random sample in the data science domain is a very common task. A random sample can be obtained supplying fraction argument (frac) or number (n) inside the sample( ) function.

In this example, we used the sample( ) function and supplied fraction (frac) argument. One can also set the replacement to True/False.

hflights >> sample(frac=0.2, replace=False)

if you know how many random samples do you want in advance then use (number of the sample) argument instead of frac (fraction). Here we set replacement to True.

hflights >> sample(n=4, replace=True)

F2. distinct( ):

The distinct( ) function helps in identifying the number of distinct levels available in a categorical column.

hflights >> distinct(X.Origin) >> select(X.Origin)
Image for post

F3. rename( ):

The rename() function will rename columns provided as values to what you set as the keys in the keyword arguments.

hflights = hflights >> rename(Destination = X.Dest)
hflights.columns
Image for post

F4. Count using n( )

Let’s estimate the monthly flights count where Departure Delay ≤ 10 minutes using n( ) function.

(hflights >>
 filter_by(X.DepDelay <= 10) >>
 group_by(X.Month) >>
 summarize(count = n(X.FlightNum)))
Image for post

F5. mutate( ) + between( )

One can use between( ) function inside the mutate( ) function to create a boolean variable. Here, we have created an arr_delay_btwn boolean variable and coded levels as True if ArrDelay within 1 and else False.

(hflights >> 
 select(X.ArrDelay) >> 
 mutate(arr_delay_btwn = between(X.ArrDelay, 1, 5)) >>
 head(6))
Image for post

F6. row_slice( ): Row Slicing

You can use row_slice( ) function to slicing rows from a pandas data frame.

hflights >> row_slice([2,5])
Image for post

Let’s calculate the average flight speed

Let’s go deeper and compute the flight’s average travel speed from each origin. To get the speed in kilometre per hour (kmph), we need to convert the distance (in mile) to kilometre and AirTime (in minutes) to an hour unit. To convert the distance into km unit we need to multiply distance with 1.60934. Similarly, we need to divide AirTime by 60 to convert it in hour unit.

(hflights >>
 select(X.Origin, X.Distance, X.AirTime) >> 
 mutate(distance_km = X.Distance * 1.60934,
        travel_time_hr = X.AirTime/60) >>
 group_by(X.Origin) >>
 mutate(speed = X.distance_km/X.travel_time_hr) >>
 summarise(avg_speed = X.speed.mean()) >>
 arrange(X.avg_speed, ascending=False)
)
Image for post

Conclusion

dfply definitely going to make your data manipulation easy and fun. In this article, I have just scratched the surface of dplyr style functions. To know more, you must visit the GitHub page of dfply documentation.

Dataset and Code

Click here for code and dataset

** Hoping this blog would help **

See you next time!