Devin J. Cornell: Patterns for data collection types

In this article I discuss some patterns and anti-patterns for building specialized types that act as data collections, and I will focus on variable-sized collections of fixed-size data types. This will typically mean iterables of structs or classes that exclusively store atomic data types. Built-in lists, dicts, and arrays are all powerful tools for managing data objects, but I argue that either extending or encapsulating them with custom types can improve the readability, maintainability, and error robustness of your data pipelines.

This article serves as a natural progression from my article on patterns for dataclasses (essentially the data objects I reference here) and builds on some of the basic strategies described in my article on weaknesses of using dataframes in your pipelines. The examples I offer here are based in Python, but I believe they could apply to most other languages.

Data Object Collections

Built-in collection types such as lists, arrays, or dicts offer the simplest methods for storing and manipulating sequences of data objects: they are designed to handle collections of any type and are therefore fit for most applications. For example, let us start with a very basic data object in Python built using the dataclasses module. This type has exactly two properties - an int and a float - and is used to represent a single element in your dataset.

import dataclasses

@dataclasses.dataclass
class MyType:
    a: int
    b: float

Creating a list of these objects is then fairly straightforward: we can simply call the constructor in a loop to create a new list of these objects. We can even add a type hint to note that this is a list of objects of this specific type.

import typing

mytypes: typing.List[MyType] = [MyType(i, 1/(i+1)) for i in range(10)]

And, of course, we can continue to use methods that operate on iterables to manipulate these objects, typically in for loops (or list comprehensions). These objects are very general, and you can work with them in the same way you work with any other iterable.

mytype_products = [mt.a * mt.b for mt in mytypes]

The problem, I argue, is that they are a little too general - there is nothing in the code that indicates how these objects will be transformed and any downstream customers/functions must use functions that create iterables from scratch. The alternative is that instead of using these built-in types directly, you can create your own application-specific types that can be manipulated according to methods that you define.

In most languages, there are two primary ways to create collection types: (a) extend an existing collection type, or (b) encapsulate a collection in another custom type. I will discuss each approach below, but most subsequent patterns will focus on the encapsulating approach because it is more complicated and should make clear the approach you would use for extended types.

A. Encapsulate Collection Types

The most typical approach for building custom collection types is to create a wrapper object that contains and encapsulates a collection type. You may choose which features of the collections (such as iteration or indexing) to expose and add additional methods The dataclasses module can be helpful here as it can build a constructor that accepts a single object: a collection of objects of some type (note that it very well could contain more). The constructor simply assigns the collection to an attribute of the object.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List

And you might instantiate the new collection like this:

MyCollection([])

You could even default set a default to allow you to create an empty collection.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)

In which case you could use MyCollection() to instantiate a collection with an empty list.

B. Extend Existing Types

Alternatively, for simple cases, you may extend an existing collection type. To do this in Python, you will probably want to use the typing module instead of list, dict, or set directly. This object will work almost exactly like the inherited type, but with any additional methods you would like to assign.

class MyCollectionExtended(typing.List[MyType]):
    pass

You would access this the same way you use the list constructor.

MyCollectionExtended(MyType(i, i + 1) for i in range(10))

For this article, I will use the former approach that involves wrapping collections, but I certainly do find extending existing types to be valuable in simple cases where I want to minimize code. Using some of the encapsulation principles I discuss here (namely static factory methods), you could easily design your pipeline such that you could shift from one approach to another as the project changes.

Now I will cover strategies for building more features into these collection types.

Static Factory Methods

The first concern we have for custom collections will be the methods by which they are constructed - for this, I recommend using static factory methods almost exclusively. Static factory methods are static functions that return instances of the parent object. It is generally best to contain any type of working logic into a static factory method instead of the constructor in case there are cases where you want to instantiate it without that logic.

One benefit of these methods is that they can also call the constructors or static factory methods of your containing types. For example, using the dataclass-generated constructor of the below collection requires you to either pass a set of pre-constructed instances to the underlying list, or append them afterwards. Alternatively, the static factory method calls the MyType constructor for you, so you can simply pass it an iterable of relevant information to make the collection with the proper types.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)
    
    @classmethod
    def from_ab_pairs(cls, elements: typing.Iterable):
        return cls([MyType(*el) for el in elements])

For more complicated cases, you may need to use a static factory method of the contained types instead of their constructors. In a case where we want to create MyType objects from a single integer, we can add a static factory method to that type.

@dataclasses.dataclass
class MyType:
    a: int
    b: float
    
    @classmethod
    def from_number(cls, i: int):
        return cls(i, 1/(i+1))

Then we simply call that as we iterate over the data being used to create the collection.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)
    
    @classmethod
    def from_numbers(cls, numbers: typing.Iterable[int]):
        return cls([MyType.from_number(i) for i in numbers])

This greatly simplifies the process of creating new collections using only the data needed for the contained types.

MyCollection.from_numbers(range(10))

You can imagine how this would scale to more complicated cases.

Exposing collection methods

Whereas extending existing types gives you access to behavior of collections directly, building custom wrapper types may require you to implement some boilerplate functionality such as iteration and numerical (or other) indexing. You can do some of this by creating __iter__ and __getitem__ attributes.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)
    ...
    def __iter__(self) -> typing.Iterator[MyType]:
        return iter(self.objs)
    
    def __getitem__(self, ind: int) -> MyType:
        return self.objs[ind]

In cases where you are wrapping dictionaries or sets, you might want to add additional pass-through functionality.

Interface for adding elements

The essential characteristic of the collections I am discussing here is that they contain only data objects of the specified type. Without further work, you would rely on the customer to create a new instance of the containing type before it can be added. Basic software engineering principles suggest that we should encapsulate relevant functionality for the contained function, so we could add an append() method to the collection (although obviously, and less ideally, the customer could add to the list directly).

The most basic encapsulation method would simply act as a pass-through.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)
    ...
    def append(self, *args, **kwargs) -> None:
        return self.objs.append(*args, **kwargs)

A better solution would be to add object construction code from within the append method so that you do not need to create it each time. You can use either the constructor or a static factory method to make this.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)
    ...
    def append_mytype(self, *args, **kwargs) -> None:
        return self.objs.append(MyType(*args, **kwargs))
    
    def append_from_number(self, *args, **kwargs) -> None:
        return self.objs.append(MyType.from_number(*args, **kwargs))


mc = MyCollection()
mc.append(MyType(1, 2.0))
mytypes.append_from_number(1)

Adding this to an extended collection type involves use of built-in collection methods directly, instead of manipulating the contained collection.

class MyCollectionExtended(typing.List[MyType]):
    ...    
    def append_mytype(self, *args, **kwargs) -> None:
        return self.append(MyType(*args, **kwargs))
    
    def append_from_number(self, *args, **kwargs) -> None:
        return self.append(MyType.from_number(*args, **kwargs))

You would create interfaces for similar methods such as element removal by following a similar pattern.

Filtering

Filtering functions are used to return a collection of the same type that includes only a subset of the original elements. To return the same type, you will likely need to access the self.__class__ attribute.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)
    ...
    def filter(self, keep_if: typing.Callable[[MyType], bool]):
        return self.__class__([o for o in self.objs if keep_if(o)])

Returning the same type means that you can still use any methods defined in the original object.

Aggregation

Aggregation functions are used to reduce a set of contained elements into a single element according to some function. As an example, let us say we want to return the average element in a collection - that is, an element that represents the average of a and b attributes. We would start by creating a custom type for the return value so that the customer knows it is an aggregation of multiple elements and not an observation itself. Not much is needed here unless we want to add new functionality.

class MyTypeAverage(MyType):
    pass

Actually computing the average can be done in the new method. It would return the new average type.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)

    ...
    def average(self) -> MyTypeAverage:
        return MyType(
            a = statistics.mean([o.a for o in self.objs]),
            b = statistics.mean([o.b for o in self.objs]),
        )

If you would expect the averaging to appear in other places, you could place that code into the average object itself as a static factory method.

class MyTypeAverage(MyType):

    @classmethod
    def from_mytypes(self, mtypes: typing.Iterable[MyType]):
        return self.__class__(
            a = statistics.mean([o.a for o in self.objs]),
            b = statistics.mean([o.b for o in self.objs]),
        )

Then simply call that from the collection object.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)

    ...
    def average_sfm(self) -> MyTypeAverage:
        return MyTypeAverage.from_mytypes(self.objs)

In this way, you could call the aggregated object’s static factory method from other functions that return that type.

Grouping and Aggregation

Aggregation is often used in conjunction with grouping, or splitting elements into subgroups according to some criteria and aggregating within those groups. In Python, you could represent groups as a dictionary mapping some key to our collection objects. It will be important to reference self.__class__ to ensure you are creating groups that are the same type as the original collection - this is important.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)
    ...
    def group_by_as_dict(self, key_func: typing.Callable[[MyType], typing.Hashable]) -> typing.Dict[str, MyCollection]:
        groups = dict()
        for el in self.objs:
            k = key_func(el)
            if k not in groups:
                groups[k] = list()
            groups[k].append(el)
        return {k:self.__class__(grp) for k,grp in groups.items()}

For readability, it may also be helpful to create a custom type, however simple, to represent the grouped objects.

class GroupedMyCollection(typing.Dict[typing.Hashable, MyCollection]):
    pass

Because the return type of these functions is a set of the original collection types, you can use the previously defined aggregation functions on each group.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)
    ...
    def group_by_average(self, *args, **kwargs) -> typing.Dict[typing.Hashable, MyTypeAverage]:
        return {k:grp.average() for k,grp in self.group_by(*args, **kwargs).items()}

A better approach may be to add functionality to the grouping object such that you can apply the grouping first and then perform additional operations on the grouping.

class GroupedMyCollection(typing.Dict[typing.Hashable, MyCollection]):
    def average(self) -> typing.Dict[typing.Hashable, MyTypeAverage]:
        return {k:grp.average() for k,grp in self.items()}

To do this, you’d wrap the grouping function with the custom grouping type.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)
    ...

    def group_by(self, key_func: typing.Callable[[MyType], typing.Hashable]):
        groups = dict()
        for el in self.objs:
            k = key_func(el)
            if k not in groups:
                groups[k] = list()
            groups[k].append(el)
        return GroupedMyCollection({k:self.__class__(grp) for k,grp in groups.items()})

Instead of using .group_by_average(), you could use .group_by().average(). This is possible because you are creating the intermediary type for the grouped collection.

mytypes.group_by(lambda mt: int(mt.a) % 2 == 0).average()

And the output would look like the following:

{
    True: MyTypeAverage(a=4, b=0.3574603174603175),
    False: MyTypeAverage(a=3.5, b=0.7052083333333333)
}

Note that for practical purposes, I recommend fixing the key function in this example so that the customer can see all the types of groupings that one would expect to use with a given collection - I simply used this for example purposes. This improves readability and avoids leaving the key function specification to the customer since there may be many cases they must consider.

Mutations and Element-wise Transformations

Mutations always involve transforming objects from one type to another - rarely would I output a simple list/array of numbers, for instance, as you might in the mutate function in R/dplyr. As such, we would create a new type for the result of the transformation as well as the associated collection type. We would also define static factory methods on each to support the transformation.

@dataclasses.dataclass
class MyTypeTwo:
    sum: int
    prod: float
    
    @classmethod
    def from_mytype(cls, mt: MyType):
        return cls(sum = mt.a + mt.b, prod = mt.a * mt.b)

@dataclasses.dataclass
class MyCollectionTwo:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)
    
    @classmethod
    def from_mycollection(cls, mytypes: MyCollection):
        return MyCollectionTwo([MyTypeTwo.from_mytype(mt) for mt in mytypes])

To further improve readability, I further recommend adding a method to call the static factory method from within the original collection object. You can then call this method to return a new collection of the transformed types.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)
    ...

    def transform_to_two(self) -> MyCollectionTwo:
        return MyCollectionTwo.from_mycollection(self)

This creates a fairly simple interface, and you could imagine chaining these methods to perform more complicated operations.

Parallelized transformations

In the case where you want to implement parallelization, you can call the element-level static factory method directly in each process. This way, all parallelization code is maintained within the object itself.

import multiprocessing

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)
    ...
    def transform_parallelized(self) -> MyCollectionTwo:
        with multiprocessing.Pool() as p:
            results = p.map(MyTypeTwo.from_mytype, self)
        return MyCollectionTwo(results)

The fact that you are using parallelized code need not even be transparent to the customer.

Extend Functionality Using Composition

Following the recommendation in my previous article discussing patterns for dataclasses, I recommend extending functionality using composition in this range too. This addresses the problem of creating collection objects with a massive number of methods. You simply create a wrapper object that is instantiated from the original collection and operates on the original function. For instance, let us say you need to implement a number of math and statistical methods but don’t want to clog the main object with these methods. First you would create such a wrapper objects - fairly easy with the dataclasses module. Notice that I am breaking the encapsulation of the original collection - it isn’t necessary to do this in my opinion, but it would be a fair choice.

@dataclasses.dataclass
class MyCollectionMath:
    mc: MyCollection

    def total_a(self) -> float:
        return sum([mt.a for mt in self.mc.objs])
    
    def total_b(self) -> float:
        return sum([mt.b for mt in self.mc.objs])

And add the constructor to a method within the original collection. Python makes a clean interface for this using the property decorator.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)
    ...
    @property
    def math(self):
        return MyCollectionMath(self)

And you can access all the additional methods through temporary instances of the wrapper object.

mytypes.math.total_a()

Plotting Objects

The last pattern I will discuss involves specifically managing interfaces for plotting results, although it could apply to many scenarios where extending your collection types involves some type of transformation. I recommend creating objects specifically for plotting values in your collection types. To do this, you can create a wrapper object similar to my pattern for extending collections, except have it instead wrap a dataframe and create a static factory method to create the dataframe from the original collection. You can then add any plotting functionality to that object - here I chose a function to create a bar graph.

import plotly.express as px
import pandas as pd

@dataclasses.dataclass
class MyCollectionPlotter:
    df: pd.DataFrame
    
    @classmethod
    def from_mycollection(cls, mc: MyCollection):
        df = pd.DataFrame([dataclasses.asdict(mt) for mt in mc])
        return cls(df)
    
    def bar(self):
        return px.bar(self.df, x='a', y='b')

And to make the interface clean you can simply add it to a method of the original collection.

@dataclasses.dataclass
class MyCollection:
    objs: typing.List[MyType] = dataclasses.field(default_factory=list)
    ...
    def plotter(self):
        return MyCollectionPlotter.from_mycollection(self)

This is a method for extending the object with plotting functionality even when it requires a transformation to a dataframe prior to plotting. In many cases you would likely want to call multiple plotting functions in the same script/function, and you can do that by interacting with the same plotting object. Plotting is a good use-case because it offers an easy to organize many plotting functions with different aesthetics, but this may be an appropriate pattern for other problem types as well.

Conclusions

Creating custom types for collections makes your code more readable and generally has the same benefits I discussed in my article about the challenges with dataframes. The general idea is that more structure is better - it can make your code more readable, easier to maintain, and less error-prone. While at first glance it may feel like overkill because it requires so much additional code, the payoff surely comes as projects become large or dynamic enough such that you need better ways to organize your code.