Wednesday 14 February 2018

How to Improve Data Quality

Is he 98 or 298? It all depends on the data type ....
Designing a new database is exciting; new ways of looking at data are being created. But there is often frustration mixed in with the excitement as there is a certain amount of groundwork to get the project off the ground.

This includes figuring out what information needs to be stored. It is a process of going through the existing business processes, envisioning the new system, and deciding what is or is not needed. All now with an eye towards GDPR.

And making the right decisions about your data will help make the new system a success.

Having decided what data needs to be stored, there are more questions. What range of values do you expect to hold? Will you always know the value when a new record is entered? Could the value ever be negative? What is the largest value the file might hold? It is at this point that the frustration invariably turns to annoyance. "It doesn’t matter", comes back the reply. "We will think about that later" is another favourite. "After all, we’re not short of disk space, so what’s the problem?"

Yet each one of those unmade decisions is an opportunity to improve the quality of the data being held. And potentially the possibility of a bug. These decisions are not about running out of disk space, but about making sure the data you hold is the data you intended. The data type for each field is the most basic and most valuable constraint in a database. It is what separates relational databases from other ways of storing data, such as spreadsheets.

The data that can be entered into a particular field is constrained by a number of things:
  • The data type. This ensures that, for example, a date value cannot be entered into a field designed to hold an integer.
  • Primary and foreign keys. As well as providing relationships between tables, they also check data as it is being entered. A value entered into a foreign key field must match the value in another table’s primary key.
  • NULL or NOT NULL. Fields that allow NULLs risk introducing hard-to-debug issues when data is added to the system. It is often necessary to allow NULLs, because the information may not always be available when records are added. However, fields that allow NULL can produce query results that are not the expected results.
  • Allowable ranges. Whilst some data types automatically limit the range that can be added, others do not. For example, a SQL Server tinyint cannot take values greater than 255, or values that are negative. This might make a tinyint a good data type for a field holding age values. A SQL Server smallint, on the other hand, may be negative or positive, and allows values between -32,768 and +32,767, making is a poor choice for an age field, as incorrect data could easily be entered.  
There are very many more examples of how data types and constraints help keep data quality high. Whilst I fully admit that the age example is not ideal (it would be far better to hold date of birth rather than age), it is easily understood and illustrates the point.

Of course, things can be changed in a database system. You can add constraints later, or change constraints you got wrong. But everything comes at a price. And sometimes that price is reliability because every change in a system has consequences - sometimes unexpected ones.

So if you are getting impatient with never ending questions about your data, think of it as an investment in keeping the quality of your data high for years to come.