I frequently need to store large binary data like parsed text objects, trained ML models, or custom data types as part of my data analysis workflow. For me, SQLite and other database engines are obvious choices because I ususally need to record metadata like algorithm hyperparameters, evaluation results, or timestamps and any other information that might be relevant when I need to look back at various models I have run. Using relational databases to store large binary data is typically ill-advised, so my solution was to implement a custom column type that handles insertion of binary data by saving to the filesystem and including only the file path in the original database.
If you've been attempting to keep track of multiple models generated from your analyses without databases, you've likely been generating filenames like `modelA_a=0.1_tau=3_iter=5_runs=100.pic', which include information about the stored model with some metadata. The metadata is hard to sort through because you will be reading raw filenames or writing parsers to automatically scan the files to find a model you are interested in. In contrast, traditional databases are great at storing this type of metadata, but innapropriate for storing large binary data alongside it. The implementation of my solution in doctable allows us to treat these columns just as if they were regular binary (or text) data being stored in the database directly, while actually storing the data in the filesystem for faster read and write times.
In case you are not familiar,
doctable is designed to provide an object-oriented interface for working with relational database tables.
It is database engine independent because it is built on SQLAlchemy Core, but does not provide object-relational mapping (ORM) functionality - it simply works as a way to describe database schemas in Python code, generate queries, and work with query results through a Python interface.
I used SQLAlchemy's custom data type feature to implement my solution to the binary storage problem, so it can be accessed just as if it were any other data type.
From the doctable perspective, these special file columns look like any other column used to stored pickled binary data.
I created a few examples that might be helpful for understanding the main interface implemented in doctable here.
I implemented this feature by subclassing
sqlalchemy.types.TypeDecorator to a custom
PickleFileType class and overloading the
process_result_value to implement the insert/update and select functionality as you would any other custom sqlalchemy type.
Each time a query is executed, sqlalchemy constructs a new
PickleFileType and is passed a reference to a persistant
FileTypeControl object which was created when the
DocTable was instantiated.
Each file-based column maintains a
FileTypeControl object to maintain state between queries and bridge messages to and from the
When an object is inserted into the database, the
PickleFileType object creates a random filename for the payload data, and writes the pickled object to disk.
It then writes the filename directly into the SQLTable to keep track of the associated file.
When an object is selected from the database, the
PickleFileType retreives the filename and reads/unpickles the file into the returned object.
From the interface, insert/update/delete operations appear as if you were inserting or retrieving data in the database directly.
I set up a benchmark where I created two different databases: one that includes a payload column where the data is to be inserted directly into the database, and another that includes a payload column where that stores the data into a file in the filesystem and inserts only a filename into the actual database.
The benchmark consists of three different measures. The first is a measure of insert time: how long it takes to store a set of binary objects into the database. The second is a measure of how long it takes to select the index column of the database. The hypothesis behind this measure is that inserting large quantities of data into the table will create larger distances (in terms of disk addresses) between each successive row, and thus slow down select statements even when the payload column is not included in the query. The last measure is the query for the actual payload, which is intended to test the raw read speed.
This first figure shows the average ammount of time it took to insert 500 objects into an SQLite database on my machine at various data payload sizes. We can see that the file-based solution I implemented shows a ~3x improvement over inserting directly into the database.
This next figure shows the average ammount of time it took to select just the index column of the table. We can see that the file-based solution is approximately 5x faster than the in-database storage solution.
This last figure shows the time it took to load the actual data payload. The file-based solution is much faster than the in-database storage solution, and the multiplier appears to increase with filesize.
From these results we can see that the file storage-based solution improves performance significantly according to these three measures. Of course, it is difficult to project these results into extreme cases where we are storing extremely large data payloads or a ton of really small payloads, but I think it does provide insight into some reasonable cases.