According to Glassdoor, data science has been the hottest and best job in the United States for three years straight. But what is it actually about? In this post, I will try to provide a description of data science, basing it on a wide range of existing definitions of the field. In general, data science is a field encapsulating many different areas and thus, can be utilized for various areas of work. In a future post, I will then pick up on provided definition of data science in this post, and elaborate on what we refer to as full stack data scientist on this website.
What is Data Science About?
As the name implies, data science focuses on working with data. That means, that in the core work of a data scientist, one usually sees some data. However, there is no restriction on what data that is, it can be small, it can be big, it can be structured, it can be unstructured, and so forth. Around this data, a data scientist offers services which for example fall into the areas of data processing, data analytics, and data insight services. Next, let us properly define the default data science process.
Data Science Process: CRISP-DM Model
When talking about a reference for a data science process, I prefer to use the so-called cross-industry standard process for data mining, shortly CRISP-DM model. It was first conceived in 1996 attempting to describe the default data mining process for tackling a diverse set of problems. According to KDNuggets, it is still the most prominent data mining process in use. While originally developed for data mining, it can be re-used for data science one to one. The CRISP-DM can be visualized as follows:
It breaks down the process of data science into six major phases. The sequence does not need to be followed strictly, and arrows back and forth only mark the most frequently observed back-and-forth steps and also other re-iterations can and should be considered. This iterative nature of data science is also visualized by the outer circle. Also note, that sometimes a project does not warrant each individual step outlined.
Business understanding. Initially, a data scientist often is confronted with requirements and needs from business. The goal is then to understand the business needs and transform these needs into proper data science objectives.
Data understanding. In this phase, the data scientist should get familiar with the data, identify quality issues, or missing data, as well as identify interesting facets of the data.
Data preparation. This phase transforms the initial raw data to appropriate formats necessary for modeling. Necessary steps are often performed multiple times and a back and forth process between data preparation and modeling might be necessary. Frequently, data scientists spend most of their time within this phase of the whole data science process.
Modeling. In this phase, a data scientist performs various modeling and statistical methods in order to achieve the predictive or analytical objectives of the project.
Evaluation. A fundamental aspect of many data science projects is the proper evaluation of developed models and methods. This might lead to necessary adaptions of previous steps and is fundamental to the iterative data science process.
Deployment. Specifically in corporate environments, modeling is not the end, but rather the deployment of said models and insights. This phase can be as simple as generating reports for stakeholders, but can also involve the development of automated dashboards and for predictive models also the deployment of models into production.
The Data Science Venn Diagram
As we have seen in the CRISP-DM process, the tasks of a data scientist warrant skills and knowledge in a wide range of fields. These areas can be roughly captured by the infamous Data Science Venn diagram depicted below (there are also other similar versions like the one at datajobs.com):
As we can see, data science requires, among others, skills in the area of mathematics & statistics, hacking skills, and substantive expertise.
Math & Statistics knowledge. At the heart, data scientists aim at properly preparing data, gaining statistical insights, and conduct predictive analytics on data as elaborated above. To do so, they need knowledge in the areas of mathematics and statistics. Mathematics is specifically needed for understanding the underlying applied models and being able to adapt them and to build own methodological solutions to given problems. Statistics is warranted to allow data scientists to accurately provide statistical insights, for example in the areas of frequentist and Bayesian statistics. It is the job of data scientists to provide statistical relevant insights to stakeholders and to properly ground their underlying methodology to profound statistical requirements. Combined with hacking skills described next, a fundamental aspect of data science is also machine learning, where methods are fundamentally dependent on mathematics and statistics.
Hacking skills. This area mainly focuses on the ability of a data scientist to apply technology where needed. This does not refer to hacking into computer networks, but rather for data scientists to have technological skills that allow them to find intuitive solutions to diverse problem settings. To that end, a data scientist usually uses his or her preferred programming language to tackle problems in a fast-driven and agile way. This allows to conduct all necessary steps described in our previous CRISP-DM data science process. At the same time, hacking also refers to the ability of data scientists to think algorithmically.
Substantive expertise. The final area is what makes a good scientist meaning that a data scientist should be able to ask the right research questions for problem settings at hand, as well as critically tinker and research. For trying to answer research hypotheses, a data scientist then utilized methods from the other two areas. This area also includes a strong business accumen tailored to the business you are working in. For example, a data scientist in the banking business should be able to understand the surrounding environment and to be able to act as a consulting in the area of expertise. It means to also be able to connect the dots where standard business departments fail to see the forest through the trees.
Putting it all together
Above, we have discussed a general data science process and the necessary skills for tackling all the various aspects and problems of this process. We can see from that, that data science is a very broad field that requires quite some substantial expertise in different areas. This makes data science such an interesting job and has lead to the sharp rise in popularity of it in the previous years. Many businesses have understood that there is a huge benefit that data science can bring to their business. However, this has also lead to quite some high expectations which are often times also unrealistic. To cope with certain expectations, I find it highly important that data scientists offer a very robust and diverse skill-set that also allows them to connect to other areas which might not even necessarily fall into the ones discussed in this article. Also, data scientists should be able to implement projects from start to finish being involved in all the steps necessary. Having this 360° view on the topic, is what I like to refer to as a full stack data scientist. On this website and blog, we will aim to talk and discuss also about a wide array of topics. In one of the following posts, we will also discuss a full stack data scientist in more detail.
References and readings
- https://datajobs.com/what-is-data-science
- http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
- https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining
Also published on Medium.