We live in a world where larger and larger volumes of varied data types are coming at us in ever increasing speeds, i.e. we live in a world of big data. In order to make sense of big data, we have turned to data science. Data Science is a tool employed by the transliterate to transform data into information. Let me explain …
In his book, Innovators, Walter Isaacson makes the point that “the quest for artificial intelligence – machines that think on their own – has consistently proved less fruitful than creating ways to forge a partnership of symbiosis between people and machines.” This symbiosis is data science. To understand this partnership better, I would like to review the data science methodology.
There have been a number of descriptions of the data science process, from the Computing Community Consortium to KDD (Knowledge Discovery in Databases) to the CRISP-DM Consortium. I have included references to each of these in links below. As shown in the data flow, the data science process is composed of the following six steps;
- Organization Understanding – In this phase, we develop an understanding of the organization for which we are undertaking the project. We assess the current state of the organization, determine objectives for the data mining project, and produce a project plan. It is important to establish quantitative metrics to determine the success of a project. For example, to say that a project is to improve customer satisfaction is not sufficient. How do you measure that objectively? A quantitative metric, such as increasing customer retention by 2%, can provide an objective way to measure project success.
- Data Understanding – As noted above, in the world of big data we are receiving an expanding variety of data. In this phase of the process, we examine these streams to define which elements are relevant to our objectives. As we collect and profile the data we evaluate its quality. A thorough analysis includes an understanding of the metadata as well. Also note, that as we understand the data more we will have better insights into the organization which may cause reevaluate our Phase 1 findings.
- Data Preparation – Data preparation includes ETL (Extraction, Transformation, and Load) and data cleansing. As we have noted above, however, we are receiving data in a variety of formats, many of which are not suitable for analysis. As part of data preparation, we will need to transform data such as text or images to more conducive structures. Also note, we are getting data from many different sources, sources that are most probably not integrated with one another. Data preparation, therefore, will include creating linkages between the various data domains.
- Modeling – Now we get to the glamorous fun part, the model. During the modeling phase, we select the modeling technique. After we have constructed the model we test its performance. Refer to Overfitting / Underfitting – How well does Your Model Fit and Overfit / Underfit – Shaving with Occam’s Razor. As we refine our models we may realize that we need to provide additional attributes or restructure the data to fit a change in technique. These data changes could, in turn, require that we step back to the data preparation phase.
- Evaluation – Although we have evaluated the performance of the model itself, in this phase, we reexamine what we have built at a higher level. Here, we assess how well we realized our objectives. During this review, if we realize that we are have not succeeded in our mission, we will need to step back to the model phase to make the appropriate adjustments.
- Deployment – Deployment is integration into the operational environment. During the deployment phase, we monitor and maintain the system measuring performance. We may realize that we are not meeting the stated objective which returns us to Phase 1. On the other hand, deployment may give us new insights that will cause us to expand our scope or apply the same techniques to similar or related issues, again returning us to Phase 1.
Data Science is more than just building a model. It is more than just statistics. In a previous post, From Data Literacy to Transliteracy, I had noted how transliteracy is the process of transforming data into information, as we can see the data science is the process by which the transliterate perform this transformation in a big data world. It is the realization of the symbiosis described by Isaacson.
Additional Sources of Information
The Computing Community Consortium Big Data Whitepaper
From Data Mining to Knowledge Discovery
 Isaacson, Walter; The Innovators, Simon & Schuster, 2014, pg. 4-5