Here I show a few design pattern examples you can use in your code. Refer back to the main article to get this right.
from __future__ import annotations
import seaborn
import pandas as pd
import typing
#url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'
#iris_df = pd.read_csv(url)
iris_df = seaborn.load_dataset("iris")
print(iris_df.shape)
iris_df.head()
(150, 5)
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
The dataclasses
module has been wildly popular lately because it is one of the easiest ways to create data containers. The decorator dataclasses.dataclass
will generate an __init__
function that accepts the three defined static attributes, and we use frozen=True
to indicate that the class should not be modified after being created. This is a good failsafe to have when working with your data unless you have a very good reason to expect modification - instead, I recommend redesigning your code around this constraint.
import dataclasses
@dataclasses.dataclass(frozen=True)
class IrisEntryDataclass:
sepal_length: int
sepal_width: int
species: str
# factory method constructor
@classmethod
def from_dataframe_row(cls, row: pd.Series):
new_obj: cls = cls(
sepal_length = row['sepal_length'],
sepal_width = row['sepal_width'],
species = row['species'],
)
return new_obj
def sepal_area(self) -> float:
return self.sepal_length * self.sepal_width
IrisEntryDataclass(sepal_length=5.1, sepal_width=3.5, species='setosa') IrisEntryDataclass(sepal_length=4.9, sepal_width=3.0, species='setosa') IrisEntryDataclass(sepal_length=4.7, sepal_width=3.2, species='setosa') IrisEntryDataclass(sepal_length=4.6, sepal_width=3.1, species='setosa')
Using the factory method you can create a list of entry objects that contain the data associated with each row.
# now create a function to insert the actual data
def dataframe_to_entries(df: pd.DataFrame) -> typing.List:
entries = list()
for ind, row in df.iterrows():
new_iris = IrisEntryDataclass.from_dataframe_row(row)
entries.append(new_iris)
return entries
# now parse the dataframe and print the output
entries = dataframe_to_entries(iris_df)
for e in entries[:4]:
print(e)
attrs
Classes¶Ideally you could impose stricter validation rules on the data inserted into these objects. The attrs
project maintains a superset of functionality from dataclasses
(attrs
actually predates and was an inspiration for dataclasses
) and provides some convenient decorators and methods for type checking/conversions and value checking. Note below I create validation functions using the @species.validator
, @sepal_length.validator
, and @sepal_width.validator
.
import attrs
import attr
@attr.s(frozen=True, slots=True)
class IrisEntryAttrs:
'''Represents a single iris.'''
sepal_length: int = attrs.field(converter=float)
sepal_width: int = attrs.field(converter=float)
species: str = attrs.field(converter=str)
######################### Factory Methods #########################
@classmethod
def from_dataframe_row(cls, row: pd.Series):
new_obj: cls = cls(
sepal_length = row['sepal_length'],
sepal_width = row['sepal_width'],
species = row['species'],
)
return new_obj
######################### Validators #########################
@species.validator
def species_validator(self, attr, value) -> None:
if not len(value) > 0:
raise ValueError(f'Attribute {attr.name} must be a '
'string larger than 0 characters.')
@sepal_length.validator
@sepal_width.validator
def meas_validator(self, attr, value) -> None:
if not value > 0:
raise ValueError(f'Attribute {attr.name} was '
f'{value}, but it must be larger than zero.')
######################### Properties #########################
def sepal_area(self) -> float:
return self.sepal_length * self.sepal_width
IrisEntryAttrs(sepal_length=5.1, sepal_width=3.5, species='setosa') IrisEntryAttrs(sepal_length=4.9, sepal_width=3.0, species='setosa') IrisEntryAttrs(sepal_length=4.7, sepal_width=3.2, species='setosa') IrisEntryAttrs(sepal_length=4.6, sepal_width=3.1, species='setosa')
Follwing conventions, one might be tempted to build a container object for the list of entries, although in some select cases it might make sense to create a container that directly inherits from the builtin list
, or, as shown here typing.List
(which is supposed to be more friendly for inheritence). In general inheritance is bad for your health, but in some select cases it can make things simpler. Here our primary use is to introduce the factory method from_dataframe
, which simply calls the IrisEntryAttrs
factory method to parse each row of the dataframe separately. This way you can add operations like grouping or filtering or even data type conversion as an extension of the list.
class IrisEntriesList(typing.List[IrisEntryAttrs]):
######################### Factory Methods #########################
@classmethod
def from_dataframe(cls, df: pd.DataFrame):
# add type hint by hinting at returned variable
elist = [IrisEntryAttrs.from_dataframe_row(row) for ind,row in df.iterrows()]
new_entries: cls = cls(elist)
return new_entries
######################### Dataframe Conversion #########################
def as_dataframe(self) -> pd.DataFrame:
return pd.DataFrame({
'sepal_length': [e.sepal_length for e in self],
'sepal_width': [e.sepal_width for e in self],
'species': [e.species for e in self],
})
######################### Grouping and Filtering #########################
def group_by_species(self) -> typing.Dict[str, IrisEntriesList]:
groups = dict()
for e in self:
groups.setdefault(e.species, self.__class__())
groups[e.species].append(e)
return groups
def filter_sepal_area(self, sepal_area: float):
elist = [e for e in self if e.sepal_area() >= sepal_area]
entries: self.__class__ = self.__class__(elist)
return entries
Here I'm including some of the quick examples I showed in the article.
Dataframes are generally popular because they are easy to use and great for plotting and creating summary statistics. While I often use them to work with tabular data, I suggest that you avoid using them as the primary data structures in your pipelines for two reasons: (1) you do not have explicit knowledge about your data without introspecting, and the introspection may need to happen at multiple levels of your program; and (2) they are often the wrong tools for the job (performance-wise) - even though they may be fine for many tasks.
iris_df.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
For languages such as Python, one might prefer to use standard data structures like list
s, dict
s, and set
s to structure your data because they are simple, and in many cases, the right tools for the job (as in, you can choose to use optimal data structures for whatever operations you will perform - an approach often missed by dataframe users). However, they can be a lot to keep track of as the data takes on more complicated structures and you are tracing it through larger and larger callstacks. You can see the complexity from the type hints I provide in group_by_species
- a function that simply groups objects by species. The function accepts a list of dictionaries mapping strings to floats or strings, and it outputs a dictionary mapping strings to lists of dictionaries that map strings to floats or strings.
irises = iris_df.to_dict(orient='records')
EntryList = typing.List[typing.Dict[str, typing.Union[float, str]]]
def group_by_species(irises: EntryList) -> typing.Dict[str, EntryList]:
groups = dict()
for iris in irises:
groups.setdefault(iris['species'], list())
groups[iris['species']].append(iris)
return groups
groups = group_by_species(irises)
groups.keys()
dict_keys(['setosa', 'versicolor', 'virginica'])