The Canon-View Paradigm
During my time at Amazon, our backend systems were plagued with data consistency issues. I found existing design patterns and frameworks insufficient for describing the true root cause of these issues. Instead, I came up what I call the Canon-View Paradigm.
The Canon-View Paradigm is a way of thinking about data systems that standardizes many different types of systems on a simple shared vocabulary that allows for clear reasoning about engineering tradeoffs. I call it a “paradigm” because it is a high-level, system-wide taxonomy; in that sense, it is a higher degree of abstraction than a design pattern (such as Model-View-Controller). It could be thought of as a conceptual framework, but in software, the word “framework” often means a specific set of code-level abstractions (which themselves often implement design patterns, such as the Django framework).
Information Systems
Most networked computer systems (enterprise web services, social media, ride-share apps) are information systems. That is, they store some information about the real world in a digital format to be accessed by people at a different places and times. While this is obviously the case with e.g. an enterprise record-keeping system, this framing less obviously applicable to, say, DoorDash. Nonetheless, the Canon-View Paradigm applies to any system that can be thought of in this way.
The Canon
If an information system stores information about the real world, the Canon is the name we give to the authoritative stored representation of that information. The Canon is the single source of truth for a piece of information stored in the system, and a given piece of information in the system can only be stored canonically ONCE. This condition is the characterizing feature of the canon-view paradigm. Incidentally, Canonical data often corresponds nearly 1-to-1 with user input.
Views
All non-canonical information represented in the system is called a View. This is an intentionally broad definition, and distinct from a database view in SQL; indeed, almost any data representation can be considered a View so long as it is not the Canon. A UI rendering of canonical data is a View. A computation on two pieces of canonical data is a View. A Slack notification triggered by some change in the system is a View. A database view or computed column in SQL is a View.
As the name might suggest, these two definitions are sufficient for defining the Canon-View Paradigm. It is the fairly simple matter of considering every data representation in a system as either the Canon, or a View, and being extremely strict in distinguishing exactly one representation of each piece of information as canonical and therefore authoritative over all the Views. The Canon-View Paradigm is all about preventing the storage of duplicate data.
Example: Age and Birthday
Though this may all seem trivial, here’s a toy example that illustrates the nuance required in distinguishing Canon from View.
The Problem
Let’s say a developer is building a social network, and they want to allow users to input their birthday as well as display their age. This developer creates a screen that allows the user to enter their birthdate, then creates a backend endpoint that takes the birthdate, computes the user’s current age based on the date, and stores both values in the database as two columns in the user table: birth_date
and age
. The savvy reader will already be wondering: what happens next year?
The problem here is that the system is treating both birth_date
and age
as canonical, whereas in reality age
is a computation on birth_date
and today()
, and therefore a View.
The Canon-View Paradigm Solution
Under the canon-view paradigm, the solution here is simple: treat birth_date
as canonical data, and compute age
as a View of that data. The question of when and how to make that computation is a matter of engineering tradeoffs. By default, it’s perfectly reasonable to make this computation at read-time. Either way, in understanding age
to be a computed value, we avoid one of the most painful classes of bugs: data corruption.
Cache
A seasoned engineer will have the sensible protest that many Views can’t practically be re-computed on every request—I agree. In such cases, it is acceptable to temporarily store a View. Under the Canon-View Paradigm, we call any stored View a Cache. Similar to view, this term is already well-known in software engineering, though we use it in a slightly broader way. Data memoized by a React useMemo
hook is a Cache. Data in a CDN is a Cache. A materialized database view is a Cache. A SQL table of statistics that is populated by a daily cronjob is a Cache.
In the toy example, the age
column in the database is a Cache (even though it’s stored in the database rather than some “cache”-style storage). This may seem pedantic, but when you call it a cache, you invoke all the considerations that come with caching, such as cache staleness, invalidation, and refreshing. Crucially, you distinguish that the Cached data in age
is junior to the Canonical data in birth_date
and, if they ever conflict, we consider that an invalid cache and always resynchronize to the value in birth_date
. Furthermore, the function by which a Cache is computed from the Canon must always be idempotent—that is, it must be able to be run multiple times and always produce the same result.
Consequences of the Canon-View Paradigm
We’ve now defined the Canon and Views (which, when stored, we call Caches). It’s a simple, borderline pedantic taxonomy. Its power is in its consequences.
Complexity Allocation
When adhering strictly to the Canon-View Paradigm, the resulting systems allocate the majority of their complexity to the read side. It is rare that any sort of complex computations happen at write-time (with the exception of data validation). To the engineer who is used to building denormalized systems, this may seem more complex. I assert that for any given system, the total complexity is the same, but since the majority of that complexity is allocated to the read-side, the maintenance burden is decreased. The reason for this is simple: when a bug is found in the read path, it needs to be fixed. When a bug is found in the write path, the bug needs to be fixed, any data written since the introduction of the bug needs to be reviewed, and that data needs to be restored from its corrupted state to an uncorrupted state. When the write path includes computations that are non-idempotent, this process can only be described as hellish.
Storage
When data modeling with the Canon-View Paradigm in mind, one leans toward a fairly strongly normalized data model (often landing around BCNF or 4th Normal Form). At first glance, such data models sometimes seem a larger data footprint. However, by nature, the whole purpose of normalization is to prevent duplicate data storage, so any alternative necessarily requires more storage space to represent the same information.
Conclusion
Whether you choose to use this taxonomy or not, my hope is that you have an increased sensitivity to the subtle ways in which duplicate data can be stored, and the cost of bugs in systems where this is the case. I’ve found the terms Canon, View, and Cache to help me reason about these systems. I hope you find them helpful too.
Bonus
This simple taxonomy can be stretched beyond reason. For example, a DAW can be thought of in terms of the Canon-View Paradigm. I don’t recommend it. It would look like this.