I’m sure most of the BI developers have the experience of having a customer ask about why there are null values in their report and what they mean. It’s very common to have unknown values in a data warehouse solution and there are a number of reasons to cause it. Sometimes developers take action to fix or hide the unknowns but sometimes we decide it’s OK to just leave it as is. It all depends on if the developer understands where this null value comes from and asks himself a couple of questions before making decisions.
Forget about the design, the first thing is always to check your data quality. It happens a lot that the source data is invalid for the field type or there is a domain constraint that we didn’t capture correctly. These are easy mistakes to make but takes hard effort to discover since it’s too detailed. So start by checking if there is a violation on the data selection rules.
Is it a Dimension Attribute?
In an OLTP relational database, it’s often important to keep null values as it is for the truth of data. But unlike transactional system, a data warehouse have more complex situations. One first scenario is a null value attribute in a dimension table. Mostly it’s because certain attributes may not be applied to all the dimension records. For example, one can have an empty value in State if he’s not a US based customer. On the other hand, the data would not be available if there is a delay of entry in the source system. For either situation, we can create a custom transformation to capture the null values instead of just leaving a blank field there. A simple way is to insert a row with -1 as key and ‘Not Applicable’ as description. But we can always assign any custom value to represent a missing piece of data.
Is it a Fact Measure?
A null fact measure is very similar to a null dimension attribute. The reason could be a non-existing data or a delayed data input. But when we consider the interpretation of fact measures, it’s actually fine to treat a blank value just as zero. When it comes to cube aggregation and report development, it won’t be a big challenge to display the null values as 0 or 0.00. That is to say, a developer can just leave the null value as it is and change the format strings in the following database tools.
Is it a Fact Surrogate Key?
This is probably one of the most important ETL step a developer should take care of. In theory, any surrogate key should be assigned a value to avoid the violation of referential integrity. There are also plenty of reasons for a null value in a surrogate key field. The join could be incorrect, the fact row could be not applicable to the certain dimension, and key may just not exist in the source system. Thus, we need to maintain a consistent method to deal with empty surrogate keys. For anything that is not applicable or not related to the fact table, we can set the key as -1,-2, or -3 for its particular situation. And we insert the keys and corresponding descriptions into the dimension tables. For a transaction that happens before its dimension attribute gets created, we can customize and insert the dimension attributes from the fact table into the dimension table. For example, this applies to the scenario when a marketing custom item is sold in a special business unit and the item could not be found in the item dimension table yet.
Action taken consistently?
No matter what action you take, my experience tells me the key is to stay consistent in your data warehouse development. It definitely creates confusion to the business users when they see three rows each with blank, ‘N/A’, and ‘Not Applicable’ in their pivot table, even if they have the same meaning. Our job is to and treat null values based on the identified cause and implement the solution consistently throughout the system.