What is chaining? I want to know more…
Ultimately method chaining can be thought of as a programming paradigm in which you “chain together” multiple methods onto a single underlying object in a sequential manner. From a data science perspective this can play a role in areas such as insight analysis, model data creation analysis etc. You’ll probably get more of a benefit by performing this in certain areas rather than others.
Great resources available on the web, check out the following:
- Matt Harrisons Pandas Chaining Tutorial:
- Tom Aspurger Blog post:
- Sin Yi Chou blog post on pipe operators:
- Pipe method chaining blog posts:
- Others
Many more exist so I recommend you too google around if more as you explore…
This style of programming in Python pandas was likely inspired by the ability of the dplyr package in R but has supposed roots in the Smalltalk programming language developed in the 1970/80s by researchers.
Benefits of method chaining (e.g during EDA)
Exploratory Data Analysis (EDA) is a term used to describe the process of analyzing datasets. Typically it does not involve model creation, but summarizing the characteristics of the data and visualizing them. This is not new and was promoted by John Tukey in his book Exploratory Data Analysis in 1977.
Although there is no standard approach when beginning a data analysis, it is typically a good idea to develop a routine for yourself when first examining a dataset. Similar to everyday routines that we have for waking up, showering, going to work, eating, and so on, a data analysis routine helps you to quickly get acquainted with a new dataset.
It’s important to be aware though that various sources tend to use the term EDA to refer to a blanket of operations that occur during analysis.
Some issues people tend to come across when performing these activities are:
- Spagehetti Code: It can be more difficult to read and understand. When you have to write multiple lines of code to perform a single operation, it can be difficult to keep track of what’s happening. Method chaining can help to make your code more readable by grouping related operations together.
- Memory Inefficient Code: It can be less efficient. When you have to write multiple lines of code to perform a single operation, it takes more time for the code to execute. Method chaining can help to improve the efficiency of your code by reducing the number of lines of code that need to be executed aswell as storing less intermediary variables.
- Less Versatile Code: It can be less flexible. When you have to write multiple lines of code to perform a single operation, you’re less flexible in how you can write your code. Method chaining can give you more flexibility in how you write your code, as you can chain together any number of methods in any order.
Key benefits to method chaining include:
- Conciseness: Method chaining allows you to perform multiple operations on an object in a single line of code, which can make your code more concise and readable.
- Readability: Method chaining can help to make your code more readable by grouping related operations together.
- Efficiency: Method chaining can help to improve the efficiency of your code by reducing the number of lines of code that need to be executed.
- Flexibility: Method chaining can give you more flexibility in how you write your code, as you can chain together any number of methods in any order.
Even with this pluses people usally complain of difficulty to debug since no intermediate variables are created. A great response is:
-
You can go line by line and run as you are going along
-
Create functions to read/output intermediate results and pipe those into expressions.
-
Just refactor afterwards.
Ways to incorporate method chaining into your workflow…
Ultimately this truely depends on the task you are performing however some guiding points can be:
- Start by getting used to some common functions/ patterns that can be used (look over some of the example resources)
- In theory intermediate chains have to output dataframe like objects unless its the last link in the chain.
- Try playing around with chaining during EDA activities.
- Start tidying up existing code using these principles
- Replace multi-cell spagehetti code with this.
- When trying to create chaining from scratch think about the particular question you are trying to answer
- Doing these while keeping in mind the principles of chaining makes it alot easier to come up with code.
Basic Example: Utilising a Churn dataset
These techniques are applicable to all problems and analysis that you work on, for convenience I will showcase some basic operations using a telco churn dataset.
Setup
You’ll need to read in some basic libraries aswell as pull in your data and view it. As side note I did this by downloading the file locally and using google drive to load it into collab.
# Importing libraries and packages
import numpy as np
import pandas as pd
import matplotlib as plt
from matplotlib import colormaps as cm
import seaborn as sns
import io
# Setting styles
plt.style.use('ggplot')
cmap = cm['Spectral']
# Uploading local file
from google.colab import files
uploaded = files.upload()
# Loading data into a dataframe
df = pd.read_csv(io.BytesIO(uploaded['telco_churn_kaggle.csv']))
df.head()
Viewing basic descriptive statistics
When you first begin exploring a new dataset a common thing to do is view descriptive stats of your data to get a feel for what you are working with. By doing so you may inherently already be performing chaining without formally knowing it in which case perhaps neating up the identations would do.
# Understanding the various columns
df.info()
# Total memory usage in bytes for each of your columns
df.memory_usage(deep=True)
# How many columns are available of all the different datatypes
(
df
.dtypes
.value_counts()
)
# Descriptive stats for numeric columns
(
df
.describe(include=[np.number],
percentiles=[.01, .05, .10, .25, .5, .75, .9, .95, .99])
.T
)
# Descriptive stats for text columns
(
df
.describe(include=[object, pd.Categorical])
.T
)
# How many null values are present in each column
(
df
.isna()
.sum()
)
Generating plots
When it comes to plotting its very easy to get careless and have a bunch of code across cells which need to be run in order to generate your plots. This can be frustrating to reflect on after some time since you may begin to forget which cells in your notebook need to be run prior to generating your plots.
By performing chaining as below, all the code required to generate your plot is in one place making things alot easier to come back too, aswell as being much more pleasing to the eye.
# Plotting the number of churners by the various genders
(
df
.assign(Churn=df.Churn.fillna(0).astype('bool'))
.groupby('gender')
.agg({'Churn': 'sum'})
.plot(kind='bar', cmap=cmap, color='b', figsize=(10, 4), title='Churners by Gender')
)
# Plotting the number of senior citizens who are churners
(
df
.assign(SeniorCitizen=df.SeniorCitizen.replace({1: 'Senior', 0: 'Not Senior'}),
Churn=df.Churn.fillna(0).astype('bool'))
.groupby('SeniorCitizen')
.agg(total_churners=pd.NamedAgg(column= 'Churn', aggfunc='sum'))
.plot.bar(figsize=(10, 4), cmap=cmap, color='g', title='Churners by Senior Citizen status')
)
# We can combine these into a single plot as follows
fig, ax_array = plt.subplots(1, 3, figsize=(16,8))
(ax1, ax2, ax3) = ax_array
fig.suptitle('Data Viz Summary', size=20)
(
df
.assign(Churn=df.Churn.fillna(0).astype('bool'))
.groupby('gender')
.agg({'Churn': 'sum'})
.plot(kind='bar', cmap=cmap, color='b', title='Churners by Gender', ax=ax1)
)
(
df
.assign(SeniorCitizen=df.SeniorCitizen.replace({1: 'Senior', 0: 'Not Senior'}),
Churn=df.Churn.fillna(0).astype('bool'))
.groupby('SeniorCitizen')
.agg(total_churners=pd.NamedAgg(column= 'Churn', aggfunc='sum'))
.plot(kind='bar', cmap=cmap, color='g', title='Churners by Senior Citizen Status', ax=ax2)
)
(
df
.drop_duplicates(subset=['customerID'], keep='first')
.loc[:, 'TotalCharges']
.astype(float)
.plot(kind='kde',
cmap=cmap,
title='Distribution of Total Charges per Customer',
xlabel='Total Charges',
ax=ax3)
)
Performing transformations/ cleaning of data
Firstly you want to define what transformations you want to perform and then package them up into functions
# Understanding what transformations we want to apply and creating functions for them
# Notice how the functions are returning dataframes
def col_name(data):
data.reset_index()
data.columns = ['customer_id', 'gender', 'senior_citizen', 'partner', 'dependents',
'tenure', 'phone_service', 'multiple_lines', 'internet_service',
'online_security', 'online_backup', 'device_protection', 'tech_support',
'streaming_tv', 'streaming_movies', 'contract', 'paperless_billing',
'payment_method', 'monthly_charges', 'total_charges', 'label']
return data
def change_types_and_label(data):
data.loc[data['total_charges'] == ' ', 'total_charges'] = data['monthly_charges']
data['total_charges'] = data['total_charges'].astype('float64')
data['senior_citizen'] = data['senior_citizen'].astype('category')
data['loyality'] = data['contract'].apply(lambda x: 0 if x == 'Month-to-month' else 1).astype('category')
data['label'] = data['label'].apply(lambda x: 0 if x == 'No' else 1).astype('int64')
data['multiple_lines'] = data['multiple_lines'].apply(lambda x: 'No' if x == 'No phone service' else x)
cols = ['online_backup', 'online_backup', 'device_protection', 'tech_support', 'streaming_tv', 'streaming_movies', 'online_security']
for col in cols:
data[col] = data[col].apply(lambda x: 'No' if x == 'No internet service' else x)
return data
def fillnan(data):
data['internet_service'] = data['internet_service'].apply(lambda x: 0 if x == 'No' else 1).astype('category')
return data
def drop_cols(data):
cols = ['customer_id']
data.drop(cols, axis=1)
return data
You can then apply these functions to your data by applying the pipe
method recursively like so
(
df
.pipe(col_name)
.pipe(change_types_and_label)
.pipe(fillnan)
.pipe(drop_cols)
)
Answering specific questions
This is probably one of the main goals you may have when performing insight analysis, after all analysis tends to stem from questions. I have found the deeper the question the more benefit you can get from chaining!
# What is the highest number of churners by payment method?
(
df2
.groupby('payment_method')['label']
.agg(['sum', 'count'])
.assign(ratio = lambda x: x['sum'] / x['count'])
.sort_values(by='ratio', ascending=False)
.style.background_gradient(low=0.42)
)
# What is the ratio of churners by genders
(
df2
.groupby('gender')['label']
.agg(['sum', 'count'])
.assign(ratio = lambda x: x['sum'] / x['count'])
.sort_values(by='ratio', ascending=False)
.style.background_gradient(low=0.42)
)
# Which type of contract owners churn the most?
(
df2
.groupby('contract')['label']
.agg(['sum', 'count'])
.assign(ratio = lambda x: x['sum'] / x['count'])
.sort_values(by='ratio', ascending=False)
.style.background_gradient(low=0.42)
)
Additional sections: TBA!
There are plenty of other ways to use chaining and as such I recommend you to explore and find ways to incorporate into your own work.