4 ideas for cleaning datasets
During the process of working with real datasets from NASA, news about Ireland referendum and Twitter messages I had the huge opportunity to thinking about the process to obtain a dataset full of reliable and meaning data, ready to analyze in depth.
Before to start I would like to clarify what a reliable and meaningful dataset means: It is a set of data with no biases, with data related to each other and a validated origin of the fields.
Real-life datasets come with much more problems and inconsistencies to fix it since problems causes for different sources which may imply finding biases (for example, too much data from one source and much less information from the other), the presence of outliers, problems with the type of data, missing values, among many other problems.
Step by step
1) General idea about our dataset
Our very first step is explain to ourselves the meaning of the dataset: all the data represent what? a country? the state of the economy during last year? a database from our customers? If we understand what is the meaning of the information that gonna be likely understand if part of the data has any problem.
If we are lucky, we can rely on the documentation accompanying the dataset, but in the case of missing documentation, a short checking with the business (or the source of our data) could be really beneficial for everyone, specially at the beginning of the process of cleaning.
Some of the question that we would like to answer:
- how is our dataset? how many lines it has? how many columns it has? what is the meaning of each column?
- are different sources into the data? Which is the impact of that? (Usually that means some columns with NA for one origin over the other, or too much data from one source and much less data from the other)
How to check that?
#summaries of the results of various model fitting functions
sumary(df)
#dimensions of the dataset
dim(df)
#general stats for the dataset
DataExplorer::introduce(df)
2) Completeness of the dataset
Is it full of NA or not? Which are the columns where we should execute the cleanup?
Theoretically, 25 to 30% is the maximum missing values are allowed, beyond which we might want to drop the variable from analysis. Obviously, there are exceptions if you know that the most common value is 0 you can replace NA for that value, or if the customer insists in using the variable for analysis you can replace NA by the average of the remain values.
At this point, the kind of questions that you require to secure the integrity of the data varies according to the dataset, however some questions could be:
- Do we have the same quantity of information for each unit of time that we are analyzing (day/month/year)?
- Are more information from one place to the others? Is that congruent with our data universe?
How to check that?
#histogram by unit of time
ggplot(df, aes(df$year)) +
geom_histogram(stat="count", binwidth = 1, fill = "#453781FF") +
#scale_fill_manual(guide = FALSE) +
labs(
title ="Landslide 2007 - 2016", subtitle = "Distribution of events by year",
caption = "source: GLC by NASA Goddard Space Flight Center \n by thinkingondata.com"
) +
theme_minimal()
#This function returns and plots frequency of missing values for each feature.
DataExplorer::plot_missing(df)
3) Consistency of the data
Our initial validation for consistency is about the type of the variable: letters for character variables and numbers for numeric variables.
In other cases consistency is related with the information that our dataset should have: if our dataset is “Data 2017 for France and Italy” and we only have information from some months and just for Italy, our data is not consistent with what the business expect from us.
Another recommendation is, if you think that something is wrong, take a moment and think about that because probably you have already detected something but you are unable to describe that in words.. yet.
So basic questions to answer:
- Check the type of the data (character, numeric..)
- Check if all the data related with money are express in the same currency? If the answer is no, we need to execute a transformation.
- Do we have possible duplication into the data?
How to check consistency?
# To generate a small sample head(df) tail(df) # Check that small sample looking for mistakes and inconsistencies
How to check duplicity?
install.packages("janitor")
library(janitor)
mtcars %>% janitor::get_dupes(wt)
4. Distribution of discrete variables.
Histograms are used for variables whose values are numerical. It shows the distribution of the data, and it can help to detect unusual observations (outliers) or gaps in the data. In both cases we need to determinate our next steps: usually that means to delete outliers and to determinate if we will continue including variable with large gaps.
#Simple histogram hist(df$value1) #Histogram with ggplot2
ggplot(data=diamonds) + geom_histogram(binwidth=500, aes(x=diamonds$price))
#using of skimr library in order to obtain the histogram for all the numeric or integer variables for a dataset
library(skimr)
skim(df)
Conclusions
This post describes general considerations on a data cleansing for real-world datasets. The foundational base before to start with the process of exploratory analysis is obtain a tidy dataset, full of reliable data.
In order to obtain a tidy dataset there were enunciated a serie of considerations about how to understand what the dataset represents, how to check the consistency and completeness of the data and finally analyze the distribution of the data looking for gaps and outliers.
Acknowledges
https://towardsdatascience.com/data-cleaning-101-948d22a92e4