Data Science Workflow + Data Wrangling

Data Engineering / Data Wrangling / Programming Languages / R / Social Media

A good grasp of the data science workflow is important. Philip Guo has given a nice overview of this, discussing different key stages. Here is another workflow from Sebastian Raschka for supervised machine learning model. However, as Ben Lorica also points out, data analysis is just one component of the data science. The crucial task of data cleaning is often overlooked by novices.

Somehow, I find cleaning data rather therapeutic. But, beyond profound weirdness, I often hear from those that prefer the data engineering (a term that describes tasks more on the data deployment, maintenance and wrangling) aspects of their job. Part of this is the challenge to implement a good, reliable data infrastructure for big data, particularly for real time analysis. It requires a lot of creativity and pure technical prowess. Also, they are motivated by the importance of data cleaning to the problem they want to solve; such as modelling with an unstructured dataset and to improve the accuracy of the model. Tavish Srivastava has an example using two humorously chosen bank cards - Metrro and Barcllay.

Looking at the landscape of data science, different aspects of the job are probably going to become increasingly specialised. It is not hard to envision this continuing, analogous to the manner in which the webmasters, prevalent in the early 90s, diverged into web designers, developers, SEO specialists, etc. This is just the beginning.

Share on: Twitter, Facebook, LinkedIn or Google+