When I was a kid I worked in a department store where we typically ate lunch in a corner of the warehouse in the back of the store. One year, as the holidays approached, I remember one lunch-time conversation about how I was looking forward to Christmas day dinner when my mother would serve her homemade lasagna and sauce. A coworker replied that she did not bother with all the work involved in the traditional recipe, replacing it with what she called lazy man’s lasagna. She went on to explain that she would dump everything into one big pot to cook. Later she would serve the mess in bowls. Bowls? Rather than the beautifully structured layers of pasta, meat, and cheese she served a jumbled a stew in bowls? What sacrilege!
As organizations move to data lakes from the enterprise data warehouse, denigrating schema on write in favor of schema on read approaches, I get the impression that many see the data lake as the lazy man’s lasagna of data storage. Now, I do not intend to set sail on the sea of articles of why the data warehouse is dead replaced with glimmering new lakes. That subject has been pontificated upon ad nauseam. Let me simply say that while the rigorous structure of a data warehouse inhibits its ability to keep pace with the velocity, variety, and volume of a big data world, it would be a significant error to miss the need to properly curate the data in a data lake. Let me explain …
There are three basic phases through which data moves through a data lake; ingestion, curation, and consumption. During ingestion data is integrated into the data lake in its native format, i.e. the format of the source system. Not only does this allow the capture of data at its original level of granularity, data architects are freed from being a data soothsayer who must predict users’ future data requirement. We just grab everything, the way it exists in the source system. When the users have need of data in the lake they simply skip on down to the lake with a bucket to scoop up what they need. It is at this point that users structure the data in schemas that meet their specific needs.
But where do they dip their bucket? How do they know what data is in the lake? Which data is brackish and which is pure? The curation process, the middle phase, provides answers to these questions.
Data curation prepares and maintains the data so that it is available to users. It does this by synthesizing the data, defining its lineage, building linkages between data elements, identifying it within the data catalog. We should note here that we do not modify or rewrite the data. The objective of the lake is to write the data once and read it many times. Rather than making modifications to the original data, the curation process envelops the data in metadata. Let’s look at an example.
{“Meta=Data”:
“Source_System”: “Inventory-Poughkeepsie”
“Load_Date”: “03-AUG-2018”
“Lineage”: “Prod_Transform 1.09”
{“Canonical”: {Product_ID: “FZX 3542”}},
{“Global”: {Element_ID: “111000222444”}},
{
“Product_Name”: “Guitar Tuner”
“Product_ID”: “FZX 3542”
“Inventory”: 304
“Safety_Stock”: 12
}
}
In the example above, the metadata begins by identifying not only the source of the data, but when the data was loaded into the lake as well as any transformations that were applied to the data. For this particular record, we see that the data was loaded from the Poughkeepsie inventory system on the 3rd of August 2018. We can also see any transformations applied to the data as well as the version of the code that performed the transformation. In our example, Prod_Transform version 1.09 was performed.
As noted above, an important aspect of data curation is creating linkages between data elements. In this example we do this using a global element identifier in the metadata. Using this identifier, we are able to link this product record with a product record in the lake that originated from another system, perhaps a sales record as shown below.
{“Meta=Data”:
“Source_System”: “Sales-Tuscaloosa”
“Load_Date”: “10-Jul-2018”
{“Global”: {Element_ID: “111000222444”}},
{
“Product_Name”: “Guitar Tuner”
“Product_ID”: “F-35-42”
“Price”: 15.45
“Quantiy”: 12
}
}
You will also notice that the product identifier in the sales system is different from the ID in the inventory system. The product ID in the inventory system, however, has been identified as canonical. By this we mean that it is defined as the truth, blessed by the curators of the data lake as conforming to the rules of acceptability. We know that in this particular record the product ID is the ID which is canon, which is the defined truth. Users can trust it. When extracting data from the data lake, therefore, users would give this ID preference.
Although these are a few simple examples of data curation, I believe they make the point that the data lake is not the data storage equivalent of lazy man’s lasagna. Think of your metadata as the pasta with your data being the meaty, cheesy, saucy goodness in between. We don’t provide structure by rolling the meat into meatballs, or cutting the cheese into orderly strips, but envelope them in layers of pasta that we can easily eat with a fork while not spilling it on our Sunday best shirt.