Saturday, December 20, 2008

Lazy Loading Considered Harmful

Lazy loading is a common feature in object-relational data access frameworks. It can produce some really nasty side effects though, and can even be considered harmful. The pattern is considered so risky by some commercial vendors that it isn't included in some products.

Consider a class model that has a Customer class, an Order class, and an OrderLine class. In this model, a Customer is associated with many Orders, and an Order is associated with many OrderLines. The same model is reflected in the relational database, where the application data for this model is stored.

An object-relational data framework allows for order data to be retrieved in the form of an Order object rather than an order data row. This is a great benefit to object-oriented programmers since their environments (.NET, Java, etc) are natively object-oriented. They will inevitably be working with Order objects, and Customer objects, and OrderLine objects.

Objects retrieved from an object-relational data access framework with lazy loading can cause unintended queries to be executed against relational database servers, causing performance, scalability, and data integrity problems.

For example, if a programmer writes some program code that retrieves a Customer object, and then makes use of the Customer's reference to Orders, the data access framework will automatically query all of the Customer's Orders. If the customer has done a lot of business with the company in the past, it may have thousands of orders.

An entire application written on an object-relational data access framework that has lazy loading will have quite a lot of these auto-loading relationships set up between objects.

Serving hundreds of users with this application will put stress on database servers that can become unmanageable. In cases like this, it's beneficial to not use lazy loading at all. It's better to write program code that explicitly loads data when its needed rather than have these risky operations lurking in program code that may be hard to find due to their implicit, transparent, and seemingly innocent nature.

Lazy loading can indeed be harmful, and you can see why commercial vendors might not even include such unsafe features in their object-relational data access frameworks. These kinds of problems can lead to increased database server maintenance and operations costs, excess client licenses expenditures, and even costly business continuity problems in when the excess load put on database servers by lazy loading leads to outages.

Up to this point, this article has been a bit of ruse, filled with many misconceptions about object-relational data access that are unfortunately perpetuated by folks who tend to see object-oriented design through invisible relational data-oriented lenses. The argument is often used to dissuade the use of lazy loading, and has indeed been used by vendors to justify the exclusion of lazy loading from object-relational data access frameworks.

The crux of the issue is the assumption that a Customer object would have an association to its list of Order objects. It's likely not a reasonable way to build a class model for these objects. The Customer class would not have an association to its list of Orders.

Database modeling and class modeling have different rules. They are fundamentally different kinds of technologies, and so the fact that the rules are different shouldn't be much of a surprise. However, if you don't stop to wonder if these differences exist, you might just go ahead and shape your objects and their associations the way you shape the database tables and the relationships that you're used to.

A fixed association between a Customer object and its Order objects is an unnatural association. Although you can conceive of the association in real life, it's not an appropriate association for a class model.

Class models, like many kinds of information models, have natural partitions. The Customer class and the Order class are not part of the same partition. Putting a hard link between them, across their partition boundaries, isn't something that you would simply do without putting some consideration into the design, regardless of whether this association is in a database's data model.

Ironically, there are no hard links between relational database tables. The technology doesn't allow for it. Any conception that we have of hard links between the Customer table and the Order table due to a Customer_ID foreign key field in the Order table is merely a trick of the mind. It's a concept. A Customer row has no knowledge of it's Order rows. An Order row has no knowledge of it's Customer row.

The Order row has a copy of the value of a Customer ID in one of its columns, but that isn't a fixed association or hard link to the actual Customer row object in the database server's memory model. That foreign key value can be used to query the Order table to find the Customer's orders, or can be used to query the Customer table to find an Order's customer, but these data structures don't have fixed associations the way that objects in an object model do.

All tables in relational database data models are partitioned. This is just how relational databases work. When we conceive of fixed associations between database tables, we are merely overlaying our conceptual model on a technological model. We can even implement constraint logic in a relational database server to mimic our conceptual model, but none of this useful wizardry changes the fact that the entities in the underlying technological model are naturally partitioned.

Partitioning is a common technique used to reduce the complexity caused by the associations between resources. Every fixed association between two entities will cause a system itself to become increasingly fixed, or rigid, which makes it increasingly difficult to adapt to new requirements and reparations.

Partitions are found at all levels of a system's architecture - from distributed services off in the cloud, all the way down to the relational databases, and even the disk storage systems beneath them.

Pat Helland writes about partitions in his paper on infinitely scalable systems, Life Beyond Transactions. Roger Sessions talks about partitions in his Simple Iterative Partitions process for decreasing complexity in enterprise architecture. You can find this pattern everywhere in software. Once you realize that it exists, you'll see it everywhere.

Eric Evans writes about a particular partitioning pattern called Aggregate in Domain-Driven Design, Tackling Complexity in the Heart of Software. This pattern is useful in guiding the decisions you can make in designing a class model.

The Customer class and the Order class are in separate aggregates. There may be cases where this isn't true, but for the most part it's likely to hold true. Because they are in separate aggregates, I would think twice about establishing a fixed association between them. Without a fixed association between them, I no longer have the lazy loading risk.

I can still get all Order objects for a Customer by specifically querying the database through a data access class written to support Order data access needs for the application. This query is issued from within the higher level business scenario logic that might need a Customer as well as its Orders. In practice, this pattern covers the situations where you might have presumed a need to query the database for a Customer's orders via a Customer object rather than a data access object for Customer.

From within an aggregate, you might make use of lazy loading or eager loading depending on performance analysis and other empirical knowledge. For example, the Order class does have a fixed association between itself and its OrderLines. When an Order object is retrieved, if it is always necessary to retrieve the Order's lines, then the association would be eager loaded on the spot. If not, it would be lazy loaded - deferring the decision to load the related OrderLines till later in the execution of a business transaction if the Order's lines are referenced.

There are cases where you might not put a fixed association between an Order and OrderLines at all. It depends on the context, the amount of data being loaded, and possibly other factors as well. There are no canonical models - only business contexts that are best served with models crafted to built suit the circumstances, regardless of the unbounded predilections for reuse that we sometimes succumb to.

Lazy loading is harmful if you use it in support of naive class modeling. The absence of lazy loading in object-relational applications is just as harmful, leading to infrastructure code bloat in non-infrastructure classes, poor encapsulation, higher complexity, higher coupling, and code that is harder to understand, test, and maintain.

Any trepidation that people usually have with lazy loading is often rooted is misconceptions with object-oriented programming and design. To use lazy loading safely, start by modeling the object-oriented parts of your application according to object-oriented principles, and recognize that there are different principles for object modeling and data modeling, and that this is a good thing rather than something to hide from.

Any tool is considered harmful when used improperly or naively. Or, like my favorite software development quote says:

Every tool is a weapon - if you hold it right
- Ani Difranco