According to some estimates between 50% to 80% of the work of a data scientist is spent collecting and preparing data, what the New York Times calls janitor work. When we consider the iterative nature of the data science process (refer to The Data Science Process ), we see each cycle typically repeats the data preparation step. As our understanding of the data evolves as well as the refinement of the model, we find ourselves often going back to further develop the data. While data preparation has never been an easy process, in a big data world the greater variety of data and data sources makes it all the more difficult. These sources rarely store or present data in a structure that facilitates analysis. To address this issue, we need to tidy the data. Let me explain…
Transaction systems focus on structuring the data for faster processing. BI systems focus on structuring the data for reporting and interacting with the data. In data analysis, however, we need a different kind of structure, a structure well suited for data analysis which we refer to as tidy data. Hadley Wickham first explained the principles of tidy data in 2014. As Wickham describes it, a tidy set has the following three features:
- Each Variable forms a column
- Each Observation forms a row.
- Each type of observational unit forms a table.
To understand this a bit better, let’s look at the example in the table below. The entire table represents an observational unit which, in this case, is demographic information. Each cell contains a value which is a measure of an attribute. A variable, which is the column of our table, contains all the values for our attribute. In our example, the variable weight (column four) contains all the values we have collected for the weight attribute of our test subjects. Each row represents an observation, all the measurements we have gathered for each of the attributes for each our test subjects.
As I had noted above, with big data we have a variety of data sources and data types. These systems in turn structure and present the data that is best suited for their purposes which are probably not data analysis. While from a data science perspective, we could call this messy data, we should bear in mind that for these systems that alternate structure may be quite reasonable; more than reasonable that structure may be the best structure for their purposes. With that in mind, I would prefer to refer to data that do not follow the guidelines above as un-tidy data. The most often cited characteristics of un-tidy data, is as follows:
- Multiple observational units stored in a single table
- A single observational unit stored in multiple tables.
- Column Headers are values, not variable names.
- Variables stored in both rows and columns.
- Multiple variables stored in one column.
Let’s look at examples of each of these.
Multiple Observational Units Stored in a Single Table
Now, let’s say that we want to look at more than just demographic data. What if we wanted to see if we can predict what type of dog a person would own based on some of this demographic data. We can if we would like put all the data in one table as we see below:
Unfortunately, this is a bit un-tidy. Notice we repeat one subject, Judy, along with all her demographic information. This duplication of data can generate inconsistencies in addition to complicating the data analysis. The problem with this table is that it combines two observational units, the demographics of the pet owner and the pet. To tidy up this data we split the observational units into separate tables joining them in a one-to-many relationship as follows.
A Single Observational Unit Stored in Multiple Tables
As you can expect, at times source systems may break data for an observational unit into many tables. Shown in the example below, the demographic information for our test subjects resides in two separate tables. To resolve this issue, we simply merge the data into one table, as shown in the original example.
Column Headers are Values, not variable names
Although it may seem intuitive that each column represents a variable, we often organize the data quite differently to make it more readable. Let’s look at a table of the population of the different religions for each of the G20 nations. The problem is that the column headings of the various religions are values of a variable, not variables themselves.
To tidy up this table, we pivot a portion of the table as shown in the example below. In this observation unit, we have three variables; Country, Religion, and Population. Each row represents a specific measurement.
Variables Stored in Both Rows and Columns
In this next example, we see that variables stored in both rows and columns. As you can see, things are a bit confused. We have month and year as variables with the days distributed across the columns. Also note how each row represents a different type of measure, maximum, mean, and minimum temperature. What happens with months that have fewer than 31 days? Do we have a value of NULL or NA? Perhaps it is obvious that value is missing because the month does not have that particular day, but what about data sets where this is not as obvious.
To correct this issue, we consolidate the data into one variable. We then give each of the variables for temperature their own column. Notice how in tidying up the data we are narrowing the tables, going from wide tables, i.e. tables with many columns, to tables with fewer columns with many more rows.
Multiple Variables Stored in One Column
The final, untidy issue to discuss is storing multiple variables in a single column. In the example below, I have combined the sex and age variables into a single column. To resolve this problem, we simply split the column into two variables as in the original example.
Back in 2014 when Wickham first published his paper, he noted that the issue of data tidying was a small but important part of data preparation. He also noted that by having a standard framework it is easier develop standard tools and provide for the exchange of data sets between various environments and technologies. He expressed his hope that others will build on the framework he defined to develop better storage strategies and tools.
My intent for this post was to provide a brief introduction the subject of tidy data. I would strongly encourage you to refer to Tidy Data by Hadley Wickham, in the Journal of Statistical Software, August 2014 for a more thorough discussion of this topic.
 Lohr, Steve; For Big-Data Scientists, “Janitor Work” Is Key Hurdle to Insights, Aug. 17, 2014
 Note that the G20 includes the European Union. I felt it redundant to include the counts for the EU in the table since we already had several EU nations already included.