Normal Forms in a Graph World
Graph databases have grown in popularity in the last several years. Their popularity is partly thanks to their flexibility and intuitiveness. However, I believe they are not fundamentally different from relational databases, but rather a different way of thinking about relational data. They have a direct analogy to relational data, using the following simple translations:
Graph | Relational |
---|---|
Vertex Type | Table |
Vertex | Row |
Vertex Field | Column |
Edge Type | One Join Table per Source Type/Target Type pair |
Edge | A Row in a Join Table |
We can use this analogy to analyze Normal Forms in terms of Graph Databases. This gives us a language with which to discuss how to improve our graph database normalization, as well as when and how to violate normalization. Denormalization is often necessary for technical or organizational reasons, but some forms of denormalization are costlier than others.
These examples mostly come from the surprisingly-approachable Wikipedia page on Database normalization.
First Normal Form (1NF)
First normal form is simply the condition that no columns in a table contain multiple values nor nested values. In a graph world, this is equivalent to fields with multiple values (Collections), or fields with nested values (Objects or Maps).
When to Normalize
This normal form tends to be less problematic than others, and it is frequently a strong candidate for denormalization. Normalization is useful (and denormalization is high-cost) if either of these conditions hold:
- The data represented in the Collection, Object, or Map may be referenced by other vertices, and would need to be duplicated if denormalized.
- The data represented in the Collection, Object, or Map need to be uniquely identified (for reasons such as revision history) and would benefit from a stable vertex ID.
If neither condition holds, the data is a good candidate for low-cost denormalization.
How to Normalize
Normalization of multiple/nested values simply involves breaking the field into its own Vertex Type, then adding Edges from the original Vertex to the new Vertices as necessary.
Second Normal Form (2NF)
A relation is in second normal form if every non-candidate-key attribute depends on the whole candidate key, not just part of it 1.
This is only relevant if a table has a multi-column primary key. Vertex IDs are analogous to single-column primary keys, so we can safely ignore this form for now.
Third Normal Form (3NF)
A relation is in third normal form if every non-candidate-key attribute depends directly (not transitively) on the candidate key. In the words of Bill Kent: "[every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key"
Aside: Transitive Functional Dependencies
A functional dependency is when the value of one field depends on the value of another. For example, a book’s Author depends on what book it is (obviously!), which is represented by its ISBN. In a Books table, Author is a functional dependency of ISBN.
A transitive functional dependency is when the value in one field depends on the value of another via some third field. For example, a book’s Author Nationality depends on its Author, which depends on its ISBN, so Author Nationality is a transitive functional dependency of ISBN (and a direct functional dependency of Author).
In this example, storing Author Nationality in the Books table violates 3NF, as the Author Nationality provides a fact not about the key (ISBN), but about the Author (which itself is a fact about the key).
In graph models, the same constraint applies at the field-level. If one field is a direct functional dependency on another field—rather than the vertex itself—it is a transitive functional dependency, and it violates 3NF.
When to Normalize
Similar to 1NF, violating this normal form may be desirable under certain conditions. However, this normal form should be adhered to if the data in the transitive functional dependency may be referenced by other vertices, and would need to be duplicated if denormalized,
How to Normalize
Normalization of transitive dependencies involves extracting both fields of the transitive dependency (Author and Author Nationality) into their own Vertex Type, then adding Edges from the original Vertex (Book) to the new Vertex(ices) (Author) as necessary.
Graph Normalization Heuristics
These forms can be wrapped into a few simple heuristics. These heuristics favor normalization by default, but also include cases of safe denormalization:
- Avoid storing computed fields
- Only store Collections if the elements of the collection don’t need to be uniquely identified
- Otherwise, extract the collection to vertices
- Only store Objects or Maps if the data inside each Object or Map will never be referenced by multiple vertices.
- Otherwise, extract the Object or Map into vertices
- If you have a Vertex Type with a 1-to-1 relationship with another vertex, it can be safely turned into a sub-field of the Vertex Type with which it has a 1-to-1 relationship.