Monday, October 06, 2008

The Case for Canonicals

You might have heard people talking about a "Canonical Data Model" or CDM for short. You might have even heard the rumour that having a CDM is critical success factor in achieving the true benefits of a Service Oriented Architecture. But in the meantime you still have to encounter the first situation in which it actually is being used. Is a CDM really a must-have or just another buzzword?

First let me try to explain what a CDM actually is, apart from just being one of the integration design patterns. In short you could say that a canonical data model provides a generic view on the structure of data that systems deal with, like for example a generic concept of what a Customer is, what attributes it should have and the data types and formats of those attributes are.

It might surprise you that having a common view of an entity like "Customer" often is far from common practice. Imagine a big organization like a bank having many systems with different purposes, because of merges often from different companies. Such an organization can easily have as many definitions in their data dictionary of "Customer" as they have systems that deal with customers.

Now what if such an organization needs to integrate all these systems with SOA using XLM transformations for that? If there are N systems to integrate, than in principle there are N * (N - 1) mappings possible for each type of Customer. In case of 4 systems that need to exchange customer data, that already means 12 mappings, as you can see in the following picture. But if you define one generic definition and map to and from that definition, than the maximum number of mappings are 2 * N. In case of 4 systems that means only 8.

A larger bank easily has hundreds of applications with dozens of different definitions of "Customer", let's say 30. Then the difference is 870 versus 60! And that is only for Customer, and there are plenty of other entity types that needs to be exchanged as well, like Account, Address, etc. You get the picture?

So the incentive to use the Canonical Data Model design pattern, is to reduce the number of mappings and with that the inter-dependency between systems, the complexity of the overall integration and, last but not least, the maintenance of all that. For larger organizations this can make a huge difference.

Having said all this, it probably never is the case that all systems need to integrate to each other, let alone that this all the time requires a two-way mapping of every entity involved. To know if an entity should have a definition in a CDM, it depends in how many mappings it will be involved in. When there are more than three systems that all need to exchange an entity in a bi-directional way, than a canonical definition of the entity starts to make sense.

The case against canonicals is that in some organizations it might prove to be far from trivial to get a common view of how the generic definition of a specific entity should look like.