According to some estimates between 50% to 80% of the work of a data scientist is spent collecting and preparing data, what the New York Times calls janitor work[1]. When we consider the iterative nature of the data science process (refer to The Data Science Process ), we see each cycle typically repeats the data preparation step. As our understanding of the data evolves as well as the refinement of the model, we find ourselves often going back to further develop the data. While data preparation has never been an easy process, in a big data world the greater variety of data and data sources makes it all the more difficult. These sources rarely store or present data in a structure that facilitates analysis. To address this issue, we need to tidy the data. Let me explain…
R
Open Data Science
Everyone is talking about data science. One study found that 96% of companies believe that data science is integral to the success of their business. Yet, most of these organizations (70%) are not realizing its full potential. They cite such factors as poor data quality, lack of talent, and access to proper tools and technology[1]. … Continue reading Open Data Science