Minimum requirements for a dataset
Completeness and accuracy are the minimum requirements for any dataset. In the absence of these characteristics, any final outcome is suspect of biases and wrong conclusions.
Any analysis relies on the initial dataset that we have, a Machine Learning algorithm, no matter how sophisticated is, depends on the quality of the data. For this reason, we need to invest time and effort to verify the reliability of the data before to start the process of transforming this data into valuable information for the business.
In this article I would like to describe an approach to data verification as well as limiting the utilization impact of low quality datasets.
1. Validate completeness
How do we understand completeness in the frame of dataset? When we can be sure that we have the whole data and you’re not just getting a subset of the data.
The main question to answer can be formulated in this way: Is this data a subset? Are there missing fields, whether rows or columns?
If a dataset belonging to an organization, whenever possible it is useful to verify the original source of my information, if the data are based on NASA, Census Bureau, CE and any other origin is desirable to obtain the data from the original entity, with the minimum possible edition. In many cases for the same data set we can find different sources and in each sources we can have different quantity or quality of information. An example of that is the dataset de landslides en Kaggle has 1693 rows, the same dataset from NASA (the original source of the data is from GLC – Nasa Centro Goddart) has 11,033 rows.
If I am working within a company, I need to ask to the business units if my information is complete. Do I have everything, all the rows was incorporated? Are there time restrictions, business units or products that were not added?
Our strategy for this point is to verify the completeness of the data directly with the source that generated our data set.
2. Validate consistency
Understanding what each column means, and how the data relate to each other, allows us to determine if the data set in the analysis process is consistent. This means that data are consistent with each other.
This is especially valid when we are combining data from a variety of sources into a unique dataset.
For example, during the process of generating a single dataset that analyzes the sales of all the branches of a company, and in two different dataset there are different information about the same branch office, we are facing a problem of referential integrity, which requires validating which is the correct data, rectifying it and continuing with the process of generating a unique dataset.
Our strategy for this point is to verify the meaning of each column and check for inconsistencies.
3. Validate constrains
A dataset must be following the constrains for its own data.
Usually the data set has a description about the fields where it details what each column means and the type of data it should find in it. For example, if there are fields that correspond to telephone numbers verify that they are numbers and not a mixture of numbers and letters. If it is indicated that a field can only take binary values, verify that the type of data is of type 1-0.
What happens if we find data that is not consistent, data that should not be there? We can choose to clean them, or if the error is too rude we can even dismiss the dataset, but at least we have left a project in the initial stages, without having invested an enormous amount of time and effort in something that would not be useful in no sense.
Our strategy to validate consistency is to read, verify and analyze critically the documentation that accompanies our dataset.
4. Validate uniformity
Uniformity means all the data are in the same unit of measure.
One of the most remarkable cases about metric mishap caused the loss of Climate Orbiter because a Lockheed Martin engineering team used English units of measurement while the NASA’s team used the more conventional metric system for a key spacecraft operation.
Other common problems, if we have data about money we need to verify that all are expressed in the same currency: it can be that some quantities are expressed in some currency and others in a different one, making any analysis meaningless, requiring as a first task a data transformation thougth an unique unit.
Conclusions
One of the most critical parts of data analysis happens way before you build your first visualization, or your machine learning models. And while not necessarily the most glamorous, data preparation and transformation can’t be ignored.
As we see before, a combination of strategies can be followed towards the verification of the data, and even in some cases an external verification of the data ultimately could be required to validate any inconsistency or doubt.