Tuesday, June 16, 2015

Natural Keys are Healthy

It seems that I often attract suggestions from developers who have recently been reading Ralph Kimball. This week, it was a suggestion to add surrogate keys to all the dimension tables because this "would make the datamart more extensible and scalable"! I had to argue my preference to only use synthetic (surrogate) keys when there is a need. If the natural business key works and there is no need for SCD (slowing changing dimensions or Type 2, as Kimball would say) I would keep the design simple and use the existing natural key. That key is already maintained by the source system and business users are typically familiar with it. In this case the main dimension table contains book titles and has a natural key of ISBN. I think that since the ISBN is an International Standard Book Number, and its attributes don't need SCD support, it would make a perfectly good primary key of the titles dimension table. Also, the fact tables would have ISBN as a foreign key - just like the source system. This requires much less ETL logic and much easier reconciliation. No need to translate surrogate keys when determining which facts are related to ISBN=x. Why would you want to replace ISBN with a synthetic number, which is meaningless to everyone? I struggle with applications that are more complex than they need to be. I think Einstein said something along the lines of "Everything Should Be Made as Simple as Possible, But Not Simpler".

Some might argue that the fact table will be thinner with a numeric foreign key rather than the 13 byte ISBN. That is true, however I think it's risky to remove the natural key (ISBN) from the fact table as it makes reconciliation with ISBN difficult in the event of a bug in the synthetic key allocation. Also, the fact and dimension tables might like indexes on these keys, and with natural and synthetic keys, there might be a need for twice the number of indexes. Overall, I would argue that natural keys offer higher performance.

I have seen datamarts mystified by inappropriate use of surrogate keys. To illustrate, imagine the titles dimension constructed with a synthetic surrogate key and the ISBN no longer unique. This might be argued by a few to be superior as it would support historical integrity. (Hmm, like historical integrity is necessarily superior.) Think about a slowly changing ISBN! Someone enters a new title as "Kate the Cat", when it should have been "Katie the Cat". After a few weeks, it gets noticed and is corrected. However the SCD datamart is too clever and says that it was "Kate the Cat" for those first few weeks, so it must remain so, for historical integrity! From now on, users of this dimension must combine "Kate the Cat" with "Katie the Cat" to get sales information on this book. This can be frustrating for users, who know only one book called "Katie the Cat".

I have also seen datamarts where the dimension tables have blown out by more than 20 times (1 Million customers turns into a 20 Million row customer dimension) because all attributes are maintained with SCD logic. So each customer record had been updated an average of 20 times. Sometimes these changes are irrelevant to the business regarding SCD, such as a field containing the number of Customer Service calls. In this case, every time the customer calls up, there will be a new record in the customer dimension, with just the call count incremented by 1. For what purpose?

I have also seen date dimensions with surrogate keys! This also puzzles me. Does the developer believe that attributes of a date will change and that reports should show those dates with historical integrity? To my simple mind, that just seems to be a quick way to make a datamart/DW more complex for no benefit. Unless, there is benefit in increasing future work loads as maintenance and reconciliation become bigger chores.

I would add, that if you really do identify a need for SCD I would encourage you to only maintain SCD on the identified attributes. Do not apply SCD logic to all attributes blindly.

Please don't see this blog as anti Kimball. I truly believe that he has brought a lot of rigour and maturity to data warehousing. However, to follow his dimensional modelling blindly is not pragmatic and does not help organisations trying to build business intelligence.

No comments: