The popular dataclasses
module has been pushing many data scientists to adopt more
object-oriented patterns in their data pipelines since it was introduced
to the Python standard library. This module makes it easy to create data
types by offering a decorator that automatically generates
__init__
and a number of other boilerplate dunder methods
from only a set of statically defined attributes with type hints (I
recommend this
tutorial). Previously I have written about object-oriented
alternatives to dataframes for data science, and in this article I
wanted to share a few patterns I use for data objects created with
dataclasses
in my own work.
Dataclass Basics
Firstly, I strongly recommend reading more about the dataclasses
module before reading this article if you are not familiar, but the
general principles may still be useful otherwise.
You can create a dataclass using the
dataclasses.dataclass
decorator on a class definition. The
class definition should contain a set of static attributes with type
hints that can have default values. From this class definition, the
decorator creates an __init__
method which includes all of
these attributes as parameters, and the defaulted attributes are also
default parameter values.
import dataclasses
@dataclasses.dataclass
class MyType:
a: int
b: float
c: str = ''
You can instantiate this object using the generated constructor as you would expect.
obj = MyType(1, 1.0)
obj = MyType(1, 2.0, c='hello')
obj = MyType(
a = 1,
b = 2,
c = 'hello world',
)
There are many more features than this that I may bring up when discussing anti-patterns, but it is worth reading more about the module if you are interested in using it.
1. Immutabile
Firstly, I recommend that you make your dataclass objects immutable - that is, objects with a fixed set of attributes whose values cannot be changed after instantiation. Normally Python objects can have new attributes assigned to them at any point - this is part of the power and flexibility of the language. That said, adding additional attributes to the object other than those described in the definition tends to weaken the value of dataclasses - I recommend avoiding them.
You can require that dataclasses enforce this rule by passing the
frozen=True
argument to the dataclass
decorator.
@dataclasses.dataclass(frozen=True)
class MyType:
a: int
b: float
You will then see an exception when you attempt to assign a value to a new or existing attribute.
obj = MyType(1, 0.0)
You’ll see an exception when changing an existing attribute.
obj.b = 2
<string> in __setattr__(self, name, value)
FrozenInstanceError: cannot assign to field 'b'
And even when creating an additional attribute.
obj.c = 5
<string> in __setattr__(self, name, value)
FrozenInstanceError: cannot assign to field 'c'
2. Slots
Next, I recommend using the slots interface to fix the set of attributes associated with your object, whether or not you choose to make the values immutable. You can read more about this interface on the related Python wiki page, but essentially you make this guarantee to the interpreter so that it doesn’t need to create extra resources to allow for the introduction of new attributes dynamically. In my experience, this can cut memory usage down by around half, depending on the number of attributes you use.
To use the slots interface, simply provide the attribute names as a
list of strings associated with the __slots__
static
attribute. Note that the dataclass
decorator will ignore
this particular attribute, so it should not affect other aspects of your
object.
@dataclasses.dataclass
class MyType:
__slots__ = ['a', 'b']
a: int
b: float
It works like a regular dataclass but will raise an
AttributeError
if you try to assign a value to a new
attribute.
obj = MyType(1, 0.0)
obj.b = 2
obj.c = 5
The error will look like this:
---> 11 obj.c = 5
AttributeError: 'MyType' object has no attribute 'c'
Starting in Python 3.11, you can alternatively provide the
slots=True
argument to the
dataclasses.dataclass
decorator to accomplish this (read
more here).
@dataclasses.dataclass(slots=True)
class MyType:
a: int
b: float
3. Static Factory Methods
Next, I recommend using static factory methods, or static methods
that return instances of the containing class, to instantiate
dataclasses for cases other than direct assignment of attributes. I
believe this is preferable to relying on logic placed in
__post_init__
, which will be executed any time you
instantiate the object, even if no actual work should be done.
In Python, you can use the classmethod
decorator to
create a static factory method like the example below. You can see that
new
just passes a
and b
directly
to the object constructor - essentially doing nothing except acting as a
pass-through.
@dataclasses.dataclass
class MyType:
a: int
b: float
@classmethod
def new(cls, a: int, b: float):
# logic goes here
return cls(a=a, b=b)
Using this approach, instantiating a new object is straightforward.
obj = MyType.new(1.0, 2.0)
Alternatively, the dataclass
decorator allows you to
place some logic requred to instantiate the object in a method called
__post_init__
, which will be called at the end of the
generated __init__
after attribute assignment is complete.
This method accepts a single argument: a reference to the object itself
after assignment.
...
def __post_init__(self):
# post-init code here
pass
I will now cover a few use-cases to demonstrate the value of using
static factory methods over __post_init__
.
Non-data Arguments
The most obvious use-case for static factory methods arises when the
interface you use to create the object will be different than the
interface offered in the dataclass-generated __init__
method. This might occur, for instance, if some parameters needed to
create the object will not be used later, or when the data being stored
is actually a function of some other parameters.
For example, imagine we want to create an object that maintains the current timestamp and some other data. The dataclass will just include a timestamp and the other data, but we shouldn’t need to provide a timestamp object directly from the outside since we know that attribute always be a timestamp - we should be able to create it from within the object. To do this, we create a static factory method that just requires a time zone - the timestamp can then use that information, but the time zone itself need not be stored elsewhere because we can always get it from the timestamp.
import datetime
@dataclasses.dataclass
class MyTimeKeeper:
ts: datetime.datetime
other_data: dict
@classmethod
def now(cls, tz: datetime.timezone, **other_data):
return cls(datetime.datetime.now(tz=tz), other_data)
Of course, you can build out even higher-level static factory methods to support the object as well. Both allow you to create the object without providing a timestamp.
...
@classmethod
def now_utc(cls, **other_data):
return cls.now(datetime.timezone.utc, **other_data)
@classmethod
def now_naive(cls, **other_data):
return cls.now(None, **other_data)
Creating these objects only using the static factory methods guarantees that the stored timestamp will always be a timestamp but also makes clear to the reader that we only need the time zone for making the timestamp - not for later use.
An approach using __post_init__
would require us to
partially initialize the object by leaving ts = None
and
then create it later, but this has several downsides: (a) it is unclear
to the reader how and if the timestamp may be used in the future, (b)
you do not have a lot of flexibility in the ways you instantiate these
objects, and (c) the interface for creating the object is not that
clean.
@dataclasses.dataclass
class MyTimeKeeperPostInit:
tz: datetime.timezone
other_data: dict
ts: datetime.datetime = None
def __post_init__(self):
self.ts = datetime.datetime.now(self.tz)
Parameter Co-dependence
Now let us explore the case where have a dataclass with two
attributes a
and b
and we know that if one of
these values is not provided, the object should compute the other
following the relationship b = 2a
. I created two static
factory methods to handle each of these cases. Note that we require the
customer (user) to choose this case explicitly following the Zen of Python, which
suggests that “explicit is better than implicit.” This object should not
be required to guess which case will need to be handled - the user can
do that.
@dataclasses.dataclass
class MyType:
a: int
b: float
@classmethod
def from_a_only(cls, a: int):
return cls(a=a, b=a*2)
@classmethod
def from_b_only(cls, b: float):
return cls(a=b/2, b=b)
obj = MyType.from_a_only(1)
obj = MyType.from_b_only(2)
The anti-pattern that relies on __post_init__
would
require us to place some if-else logic inside the object constructor to
handle these cases. Unfortunately many Python APIs rely on this type of
“guessing”, even though the Zen of Python says “in the
face of ambiguity, refuse the temptation to guess.”
@dataclasses.dataclass
class MyType:
a: int = None
b: float = None
def __post_init__(self):
if self.b is None and a is not None:
self.b = self.a * 2
elif self.a is None and self.b is not None:
self.a = self.b / 2
elif self.a is None and self.b is None:
raise Exception('Neither a nor b were provided.')
It is also worth noting that in many cases, static factory methods
provide cleaner alternatives to using default_factory
parameters or anything that requires more complex logic.
4. Extend Functionality Using Composition
As projects grow and requirements change, the number of methods
associated with a single dataclass might become unwieldy. To organize
some of these methods, I recommend creating a new container object that
contains only an instance of the original object and operates on that
instance. The interface can be exposed from the original object using a
property
method that simply returns a new instance of the
container object.
In this example, we define MyTypeMath
to be a container
around MyType
using the dataclass decorator, and in methods
simply refer to this attribute instead of self
. In this
way, you can easily grow the code associated with the dataclass.
@dataclasses.dataclass
class MyType:
a: int
b: float
@property
def math(self):
return MyTypeMath(self)
@dataclasses.dataclass
class MyTypeMath:
mt: MyType
def product(self):
return self.mt.a * self.mt.b
def sum(self):
return self.mt.a + self.mt.b
And the interface would be used like this:
obj = MyType(1.0, 1.0)
obj.math.product()
obj.math.sum()
If the container object does any other transformation, you could create the new container and then access its many methods directly.
mather = obj.math
mather.sum()
mather.product()
5. Define custom types for transformations
My last pattern concerns chaining transformations from one dataclass
object to another. This will look much like the previous pattern, but
instead of creating a wrapper object that only adds functionality, we
create a second object that uses a static factory method involving some
transformation. In turn, the original class uses that static factory
method instead of the basic __init__
method provided by
dataclasses
.
@dataclasses.dataclass
class MyType:
a: int
b: float
def get_summary(self):
return MySummaryType.from_mytype(self)
@dataclasses.dataclass
class MySummaryType:
added: float
product: float
@classmethod
def from_mytype(cls, mt: MyType):
return cls(mt.a + mt.b, mt.a * mt.b)
print(MyType(1, 4).get_summary())
While most transformations will operate on collections of dataclasses (which I will cover in a future article), this approach works well for basic object-wise transformations.
Conclusions
Object oriented design can drastically improve the maintainability of your data science code, and these patterns will improve the readability and error robustness of those designs.
In the next article I will be covering patterns for working with collections of dataclasses - aggregation, filtering, grouping, and more.