Last year I wrote an article about why I try to avoid using dataframes. Dataframes are powerful because they are so versatile and flexible, but my argument was that rigorous data analysis code should introduce as much rigidity as possible to reduce the likelihood of mistakes. As an alternative, I suggested we implement data pipelines as transformations between custom object types - types you define yourself with (intentionally) limited functionality and readable structure. That said, as a professional social scientist who often has to analyze survey data, I recognize that sometimes dataframes are still the best tools for the job. In this article, I will explain how my suggested approaches can be used for dataframe-oriented pipelines.
Encapsulation is one of the most basic principles of object-oriented programming; it means bundling data elements with methods that operate on the data (fundamental to Python classes) and creating interfaces for working with those data elements rather than requiring users to access or modify that data directly. By encapsulating dataframes within custom types, we can create interfaces that (1) implicitly or explicitly enforce structure and (2) have clear interfaces for transformations that may be applied to our particular dataset. Now I’ll use some Python examples to show how we can use this principle in our design patterns.
For these examples, I’ll use the iris dataset, which can easily be
accessed through the seaborn
package. You can see the
import pandas as pd
import dataclasses
import typing
import seaborn
iris_df = seaborn.load_dataset("iris")
print(iris_df.head(3))
The first several rows of that dataframe look like this:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
Create Custom Types to Encapsulate the Dataframes
Now we create a custom type to encapsulate this dataframe. We can do
this by creating a new dataclass with a
single attribute: the dataframe we are encapsulating (see also my tips for writing effective
dataclasses). I will also add a __repr__
method to make
any output more readable. Note that here I’m using a variable name with
a double underscore prefix “__” so that Python will strictly enforce
encapsulation: the inner dataframe cannot be accessed from outside
methods of this class. For most cases I believe this is overkill - you
need not place that restriction unless you believe something could go
very wrong.
@dataclasses.dataclass
class IrisData0:
__df: pd.DataFrame
def __repr__(self) -> str:
return f'{self.__class__.__name__}(size={self.__df.shape[0]})'
The dataclass
decorator created a constructor for this
object underneath the hood, so creating a new IrisData1 object is as
simple as using the constructor.
idata = IrisData0(iris_df)
idata
The __repr__
method we created above shows the object
name and the number of rows in the table.
output:
IrisData0(size=150)
To encapsulate our data, we need to create methods that allow users to access aspects of the underlying data explicitly. I will create two property methods which allow the user to access columns of the underlying dataset. For this application, I would recommend creating properties for each column of the dataframe.
@property
def sepal_length(self) -> pd.Series:
return self.__df['sepal_length']
@property
def species(self) -> pd.Series:
return self.__df['species']
These properties return regular Pandas Series
objects,
so we may transform them but they cannot be reassigned by the downstream
user. For instance, we can compute the number of observations by
species.
idata.species.value_counts()
And the result is a regular Series object.
output:
species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
To make this object useful, we will also need to create methods for
transforming the data. A common operation we might want to perform on
the data is to calculate the average lengths and widths by species.
Recall that we must implement this as part of the IrisData0
class because the contained dataframe should not be exposed to the
downstream user.
def species_mean_dataframe(self) -> pd.DataFrame:
return self.__df.groupby('species').mean()
The downstream user can call the method without accessing the dataframe directly.
idata.species_mean_dataframe()
This function returns a regular dataframe that can be manipulated however the user see fit.
output:
sepal_length sepal_width petal_length petal_width
species
setosa 5.006 3.428 1.462 0.246
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026
Oh, the Power
While this most basic example is trivial to implement, the improvement over using raw dataframes cannot be understated. Here are some of the benefits:
We gave it a name. This is so important. This name can be used as a type hint for static analyzers or as an input library so that you know what kind of dataframe is expected. If you inspect the dataframe inside objects of this type, you will have certain expectations for what the underlying data will look like regardless of where it fits within your data pipeline.
We restricted the ways that users can access the underlying data. The user can only access the properties we defined on the object. From simply inspecting our object definition, we know that the user will never be able to change the column or index names, for instance, without changing the underlying dataframe being encapsulated. Because of this, we can guarantee that the property
sepal_length
will always return the pandas series for that column.We defined the transformations that may be applied to this data. We know that
species_mean_dataframe
is the only transformation that will ever be applied to this dataframe directly. If you ever go back to change this class, you will know all use cases to support simply by looking at the object methods. A static analyzer can look only at this object and know whether any of the methods will fail.
Connecting the Pipes
The benefits of these patterns only grow as we apply them to other
parts of our data pipelines. Let us return to the method we used to
compute averages of all the attributes within each species above. The
return type of this function is a dataframe, and so we should consider
encapsulating it as well. Let us call the encapsulating class
SpeciesMean0
and the definition will look familiar. I make
an additional property to access all the unique species in the dataset
as well.
@dataclasses.dataclass
class SpeciesMean0:
__species_av: pd.DataFrame
def __repr__(self) -> str:
return f'{self.__class__.__name__}(num_species={self.__species_av.shape[0]})'
@property
def all_species(self) -> typing.List[str]:
return list(self.__species_av.index)
Now we can create a factory constructor method for
SpeciesMean0
so that it knows how to create itself from the
original type. To create it, we call the existing
species_mean_dataframe
method.
@classmethod
def from_iris_data(cls, iris_data: IrisData0) -> typing.Self:
return cls(iris_data.species_mean_dataframe())
Call this factory method constructor to create the average from the original Iris data object.
SpeciesMean0.from_iris_data(idata)
The new object __repr__
makes it look very similar to
the previous example object.
output:
SpeciesMean0(num_species=3)
To make a cleaner interface, we can even call this constructor method from a method of the original data source. While this increases coupling between the objects, it ensures every object knows how to create itself from any other object type.
@dataclasses.dataclass
class IrisData0:
__df: pd.DataFrame
...
def species_mean(self) -> SpeciesMean0:
return SpeciesMean0.from_iris_data(self)
And you can use this method like any other.
idata.species_mean()
The expected return type is returned.
output:
SpeciesMean0(num_species=3)
In this way, you could imagine a series of dataframe transformations being represented as a series of custom types with factory construction methods determining how each object was created from others. It is much simpler to describe your data pipelines in terms of sequences of defined types instead of sequences of dataframes with different structures. From a quick scan, the reader knows the kinds of transformations which are expected to occur on the data. This is a powerful benefit of encapsulation and custom types.
A Little Bookkeeping
The example above shows the power of encapsulation, but we can create stronger guarantees about the structure of the data from the input all the way through the pipeline by doing a little more bookkeeping. Recall that the optimal behavior for a buggy data pipeline is to fail fast and early. To do this, we can more explicitly keep track of structural features of the dataframe such as column names, and access attributes only through other objects or data tables.
For example, let us design a type which contains all of the column
names that should appear in the input data. If we can guarantee that the
input data has all of these columns when it is ingested, we can make
sure our program fails right away rather than waiting until we try to
access the property. Here I create the classmethod all
so
we can grab the full list later.
class IrisColNames:
sepal_length = 'sepal_length'
sepal_width = 'sepal_width'
petal_length = 'petal_length'
petal_width = 'petal_width'
species = 'species'
@classmethod
def all(cls) -> typing.List[str]:
return [cls.sepal_length, cls.sepal_width, cls.petal_length, cls.petal_width, cls.species]
The encapsulating object itself looks very similar to the previous
example, except that the factory method constructor explicitly selects
the columns defined in IrisColNames
instead of waiting for
them to be accessed downstream. You can also see that we referenced the
IrisColNames
attributes instead of the names of the columns
explicitly.
@dataclasses.dataclass
class IrisData1:
__df: pd.DataFrame
@classmethod
def from_dataframe(cls, df: pd.DataFrame) -> typing.Self:
return cls(
df[IrisColNames.all()]
)
def __repr__(self) -> str:
return f'{self.__class__.__name__}(size={self.__df.shape[0]})'
@property
def sepal_length(self) -> pd.Series:
return self.__df[IrisColNames.sepal_length]
@property
def species(self) -> pd.Series:
return self.__df[IrisColNames.species]
def filter_by_species(self, species: str) -> typing.Self:
return self.__class__(self.__df.query(f'{IrisColNames.species} == "{species}"'))
Without enforcing these column names at ingestion, there is a
potential for simple errors to propagate downstream and cause errors in
the future. For instance, consider our first example of
IrisData0
. If the input data file had named the sepal
length property "sepal_len"
instead of
"sepal_length"
, we wouldn’t know there was an error until
we either accessed the sepal_length
attribute or a related
parameter downstream. Even worse, imagine the dataset is one that might
be averaged downstream without first checking if all columns are
present. We would have a significant error that goes silently through
our data pipeline.
Explicitly enumerating the columns associated with the input data
takes more work but can also make your code more modular. Imagine that
another wave of Iris data is collected, but the format of the input data
is slightly different: some of the column names are different. In this
case, we may want to separate our column name sets into
IrisColNames2013
and IrisColNames2024
types
which both inherit from BaseIrisColNames
. The
IrisData.from_dataframe
factory method constructor could
then accept a subtype of BaseIrisColNames
or you could
create multiple constructor such as from_2024_dataframe
which load the appropriate column name type but otherwise behaves
exactly the same. In either case, the user will be required to call the
correct input parsing code for the given input data.
In Conclusion
The tips I shared here are a great start to improving data pipelines that involve dataframes, and I strongly encourage you to explore other ways of building more structure into your data pipelines. While I do believe that dataframes are the wrong choice most of the time, with a little work we can drastically improve our code to prevent many of the pitfalls. Happy analyzing!