Recently I had read a LinkedIn article by Kalyan Sambhangi in which he asked where are all the data philosophers. He makes a good point, if data science is truly a science shouldn’t the philosophy of science be applicable to it as well; i.e. the philosophy of data science. I agree; it should. So, let me ask a follow-up question. What is the philosophy of data science (and should data scientists care)?
The ultimate question that the philosophy of science attempts to answer is if science can reveal the truth about the world as it really is. Can we depend on science to give us an accurate understanding of the world, and ultimately the universe, in which we live? Isn’t this what we ask ourselves as data scientists? Each step of our process is ultimately concerned with how well we are reflecting reality, from the selection and cleansing of data; to the construction and evaluation of a model. We are concerned that we develop an accurate vision of the world around us. We can conclude that in the case of the philosophy of data science, we seek to understand the foundations and methods of data science. In this search, we establish its validity along with defining the boundaries of the field.
Now get ready for the surprise twist in the story.
Nate Silver once said that “data scientist is a sexed-up term for a statistician”. It is difficult for me to argue with Silver. Sure, I understand that in addition to
statistics you also need to have hacking skills and domain knowledge as the infamous Conway diagram tell us. Yet, at heart of the data science process is statistics. My point is not that statistics is the most important of the skills Conway listed, although I believe that I could make a convincing argument that it is. My point is that statistics is core to data science.
When we discuss the philosophy of data science we are really discussing the philosophy of statistics, an already established field of study. And… that is a good thing.
The philosophy of statistics addresses such topics as methodology, ethics, and epistemology as well as causality versus correlation. If statistics is core to data science, it only follows that the philosophy of statistics is core to the philosophy of data science. It lays the groundwork upon which we, as data scientists, can build. It remains to be seen if our construction is an extension of the philosophy of statistics or if it a separate school of thought. In either case, if we are to pursue a philosophy of data science, we need to also understand the philosophy of statistics.
Should data scientists care? Well, that depends on the data scientist. You do not need to understand the philosophy of data science to do data science. I won’t be attempting to amend Conway’s Venn diagram. As we construct our models or perform regression testing, the finer philosophical details may not have much of an impact, if any. If, however, you want to have a deeper understanding of our field, to go beyond the mere mechanics, then yes you should care. Ultimately, if you develop these deeper insights you will be a better data scientist.
As a community, it is critical that we begin to explore these issues. Janet Stemwedel reminded us that “what Einstein did for physics had as much to do with proposing a (philosophical) reorganization of the theoretical territory as it did with new empirical data. So perhaps the odd scientist can put some philosophical training to good scientific use”. We can only benefit by thinking more deeply about data science.