30 Apr Will the “real” data scientist please stand up?
The term ‘data science’ has become increasingly popular over the last few years, especially when coupled with its more famous cousin, ‘big data’. A simple keyword search on social media forums and job portals is sufficient to suggest that data science is here to stay beyond its initial hype.
In a recent conversation with a friend who is a data scientist for a Toronto-based technology start-up, we discussed how the knowledge domains and skills associated with data science (e.g. creating datasets, querying databases for information, data analysis/interpretation using statistical methods, prediction using machine learning methods, and so on), have existed for decades before this term started gaining popularity. Both of us agreed that it makes sense to embrace some of this current hype around data science.
For those interested in the history of the discipline, here is an excellent blog post that talks about the chronological evolution and usage of data science. To me, metaphorically speaking, adopting a new term in business and technological circles is akin to the creation of a music genre. A new genre, as symbolized by a name (e.g. heavy metal, jazz fusion, disco in the 70’s; synth pop in the 80’s; grunge in the early 90’s; dubstep in the late 90’s; etc.), is popularized, partly, due to the uniqueness of that genre as represented by its musical elements, and also by the media surrounding that culture/community. A new genre is not born out of nothing. The musical elements/subgenres comprising this new genre have existed on their own somewhat independently, but now play a collaborative role in defining this new genre as something unique. Likewise, ‘data science’ symbolizes a unique amalgamation of various previously existing domains.
Despite the popularity of this term, many of us, including hiring managers at IT companies, are still surprisingly unclear about what data science entails. There is vagueness in the usage of this term as regards skill-sets. How many of us would qualify as data scientists and how many of us would be considered as not quite up to the mark, or “fake”? Bernard Marr presents this reality check.
Extending the music analogy further, with data science we are in an interesting transitional phase where the genre is still continuing to evolve and re-define itself. Let’s take heavy metal, for example. During its initial years, bands such as Led Zeppelin, Deep Purple, and Black Sabbath were associated with the sound of heavy metal. However, today, five decades later, many metal enthusiasts including myself would wince at the thought of referring to Led Zeppelin and Deep Purple as metal bands, but would unhesitatingly prescribe Black Sabbath as required course material for Heavy Metal 101. In other words, the boundaries and components of data science are still evolving. The fully defined picture for data science will be clear only in retrospect.
Having said that, are there ways to enlighten ourselves about what constitutes data science in its present state? Fortunately, several people have been asking similar questions, and their efforts have resulted in an abundance of useful, relevant information on the web. Data science can be understood through the core areas of study or knowledge domains and disciplines (e.g. computer science, statistics, databases). We could also think of it in terms of job roles and skills within the data science spectrum.
Data science can also be understood as a combination of disciplines involved and skills required. Ferris Jumah provides a novel way of visualizing what is hot in data science using a “data-centric” approach. Finally, here is an exhaustive information resource from DataCamp for aspiring data scientists, which provides areas of study, skills and background required, and even a roadmap with resources to acquire those skills – the mother of all data science infographics, in my opinion! There is a wealth of this kind of information available online and I have barely scraped the surface.
So, who qualifies as a data scientist? This blog post gets right to the point by listing 14 definitions, each highlighting different aspects of data science, some poignant, some detailed, and some tickling the funny bone. A quick read of all these definitions reveals key areas and corresponding skill sets, allowing us to build a summarized knowledge schema around data science – data analysis; data munging, cleaning, and manipulation; inference from big datasets; interpretation using statistical methods and machine learning; software engineering; story telling and visualization, to name a few in no particular order. My particular favorite is #11 by John Rauser, which I have often heard being paraphrased. Another interesting point to note is that the nature of the definition varies depending on the background (i.e. knowledge, training, experience) and current position of the person defining it. Each individual’s perception of data science has a bias that is unique to his or her present and past experiences, and knowledge. I find this exciting because it enables us to aggregate diverse perspectives from current data science practitioners in the real world and assimilate these views together into a common data science schema. This also serves as a good method to learn more about data science – by hearing it straight from the horse’s mouth, so to speak.
Data Science Weekly runs a section titled “Data Scientist Interviews” in its weekly newsletter. The title is self-explanatory. At least two volumes of interviews with data scientists have been published since 2014. After reading the first volume, I can say with conviction that this is clearly one of the more valuable resources for keeping abreast with the field of data science, understanding its scope, the nature of data science problems, and monitoring how this field is evolving. Volume 1 contains interviews with data scientists from diverse backgrounds working in a wide range of fields, and sharing a passion for answering questions pertaining to data.
Now that we have a good overview of what data science involves, how do we recognize the hidden data scientist within us? I believe, the key is to find our niche. Going back to musical metaphors, we are like musicians in a band playing jazz fusion, for instance. There isn’t enough time to master every style. But it helps to perfect one or two styles and have at least a surface-level understanding of others. Members of 70’s jazz rock fusion bands such as The Mahavishnu Orchestra, Weather Report, and Shakti were exemplary individual musicians who brought their uniqueness towards the common goals of the band to form a distinctive voice. In data science teams, irrespective of whether we are statisticians, computer scientists, software engineers, or hybrids of these areas, we all have a role to play. We use our strengths to provide insights and useful answers to common questions the team is attempting to address, while utilizing the rich data sources that are presently available. Have you found the hidden data scientist within yourself?
To your success as a data scientist!
By: Naresh Vempala