Introspection in SQLAlchemy: Reflecting upon the Magnum Opus

The layering of orthogonal concepts within SQLAlchemy lends itself to deep introspection. These capabilities can be used for a variety of purposes including debugging and concise expression of programmatic intent. The detailed introspection API added in version 0.8 can be very useful in several scenarios. Previously, while these introspection capabilities were available, they were mostly undocumented and without official support. We’ll cover some deeper parts of this API through the investigation of an application bug and the addition of a common feature. First, though, it might be best to glance at the surface.

Transmutation Ingredients

SQLAlchemy is a comprehensive database interface tool that is split into several components. The most obvious distinction is between ‘SQLAlchemy Core’ and the ‘SQLAlchemy ORM’. Both the Core and ORM themselves are greatly subdivided into several layers, though the primary focus in this article is the ORM’s internals. In addition, it’s important to note the separation of the ORM from the declarative extension. The declarative extension adds the declarative base class and other niceties, but ultimately it is just another layer.

Session

A primary focal point of SQLAlchemy is the fabled “db session”. This object is the key to interacting with the ORM (during model usage, rather than creation) since nearly all of the heavy lifting is done in a way that is rooted to a single Session. This Session does several things that are mostly behind the scenes, but all ORM object instances ultimately hold a reference back to it.

The Session object is responsible for storing and synchronizing the in-memory python object instances with the current state that the database is in. One important shortcut (normally) taken by SQLAlchemy is to assume that all interaction with the session takes place in the context of a transaction. This allows SQLAlchemy to batch updates, maintain it’s identity map, and issue queries that return accurate results while only communicating with the database when needed.

Flushing

In common use of SQLAlchemy, communication with the database is delayed until “needed”. In particular, this means that

1
2


inst = db_session.query(MyObj).get(1)
inst.first_attr = "hello"

does not execute an UPDATE statement. In fact, the data for ‘first_attr’ is stored within a “need to flush” attribute, and then sent in an UPDATE statement when a flush occurs. These flushes are either explicit (session.flush()) or automatic (run before each query, including SELECT queries). In addition, a flush is always executed before a commit. The reason for autoflush to exist is to ensure changing an object and then querying for it returns the correct value, since before the flush, the database is unaware of in-memory modifications. In other words, if one ran the above code and then ran db_session.query(MyObj).filter_by(first_attr="hello") with autoflush=False, it would not be returned, but with autoflush=True, a .flush() call would be executed first, allowing the DB to notice that this object meets the criteria.

InstanceState

Every model instance has an associated InstanceState instance, which is the actual store for these values. In particular, the “old” values ( /if they are loaded/ ) are stored on InstanceState’s .dict attribute, and the current not-yet-flushed values are stored on .committed_state (which is a somewhat confusing name). The official API to access this data, however, is via the History interface. This interface shows the old value and new value in a much more convenient way, and is gained via the inspection api.

istate = inspect(inst) returns an InstanceState. istate.attrs returns a “namespace” (dict-like object) of attribute names mapped to AttributeState instances. These AttributeState instances contain the history attribute, which returns the History object, and is the “official” interface to the old and new pre-flush values.

Alchemical Calcination ¹

In resolving bugs, one must first investigate and determine their cause. In a bug I resolved recently, a logically unchanged object was causing SQLAlchemy to emit an UPDATE clause, which caused the database to update a recently changed timestamp. In this case, an application of inspect(), InstanceState, AttributeState, and History used just before db_session.commit() was very useful in spotting the issue:

1
2


>>>dict([(k, v.history) for k, v in inspect(model_instance).attrs.items() if v.history.has_changes()])
{u'location_id': History(added=['2'], unchanged=(), deleted=[2L])}

Given a model instance, we inspect() it, which returns an InstanceState instance. This tells us about the state of the object in it’s session (pending, detached, etc), and has details about it’s attributes. Accessing the attrs attribute returns a “namespace”, which behaves more or less like a dict. It’s keys are the names of persisted attributes for our instance, and it’s values are AttributeState objects. The AttributeState object’s history attribute gives us access to a History object, which records unpersisted changes to the database. In particular, it is these history objects that contain the details of state that is pending but not yet persisted to the database via a flush operation.

It is worthwhile to note that this history API is generally only useful pre-flush, because it is during flush that an UPDATE or INSERT statement can be issued. That being said, the above could integrate quite nicely with a session before_flush listener (or simple breakpoint).

Alchemical Multiplication ²

Serialization is a common function added to many declarative base object implementations. Often it will take the name of .as_dict(), .as_json(), or even .__getstate__() for Base classes that would like to support the pickle protocol. Unfortunately, several implementations fall short of achieving various desired outcomes. For example, one may want to serialize an object to json for display on the frontend. However, as soon as different users have different logical “attribute level” permissions to view fields (eg, ‘owner’, ‘salary’, or ‘home_address’), this one size fits all approach can fall short. In addition, there are several other decisions to make - often an object has dependent children (say, a user has multiple phone numbers). In the json representation, it may be convenient to return the attribute ‘phones’ as a list of numbers rather than deal with an entirely separate UserPhone object on the frontend. In short, there’s no one size fits all solution.

That being said, below is my one size fits all solution. It inspects an object instance and returns a serialized dict. The function is recursive by default, though that can be disabled. Many to many relationships are followed and returned as dicts or as a list of ids (depending on arguments). In addition, it takes a filter_func that is called twice per dumped object: once with a dict of attributes (before hitting the database) that can whitelist or add additional attributes to return, and then a second time with the loaded attribute values. This allows a clean logical dump with appropriate filtering based on where it’s called.

1
2
3
4
5


>>> dump(model_instance)
{'id': 1, 'attr_a': 'a', 'attr_b': 'b'}

>>> dump(model_instance, include_relationships=True)
{'id': 1, 'attr_a': 'a', 'attr_b': 'b', 'foos': [{'id': 1, 'bar': 123}, {'id': 2, 'bar': 456}])

Contents