As a full stack data scientist, you need to be competent in a wide range of areas as discussed in previous posts. Consequently, this also leads to the necessity of executing various aspects of our data science process. To achieve this, good skills in one or many programming languages are required. On this website, we will heavily focus on Python as our programming language of choice. In this blog post, I want to shortly explain why I think that Python is the perfect programming language for a full stack data scientist.
What is Python?
Python’s official website describes it as follows:
Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python’s simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.
The executive summary continues with:
Often, programmers fall in love with Python because of the increased productivity it provides. Since there is no compilation step, the edit-test-debug cycle is incredibly fast. Debugging Python programs is easy: a bug or bad input will never cause a segmentation fault. Instead, when the interpreter discovers an error, it raises an exception. When the program doesn’t catch the exception, the interpreter prints a stack trace. A source level debugger allows inspection of local and global variables, evaluation of arbitrary expressions, setting breakpoints, stepping through the code a line at a time, and so on. The debugger is written in Python itself, testifying to Python’s introspective power. On the other hand, often the quickest way to debug a program is to add a few print statements to the source: the fast edit-test-debug cycle makes this simple approach very effective.
This excellent summary gives us a good overview of the main advantages of Python. Most importantly, it is a general purpose language meaning that Python is a language that has been designed to be applicable to a wide range of application domains. This is perfect for our duties as a full stack data scientist, as we have to tackle tasks in a variety of areas, primarily in data modeling, but also in subsequent tasks like model deployment.
Open source community
One of the main advantages of Python for a data scientist is its huge open source community which supports developers in their endeavors. For nearly every single problem you will encounter during your career as a full stack data scientist, some solution in Python already exists. Nearly all necessary tools can be accessed via open source packages which are mostly hosted, documented, and supported on Github. If one has any question regarding Python, the go-to-stop is Stackoverflow. Specifically, Python has established itself as the go-to language for machine learning. A huge repertoire of open source libraries has emerged especially in this field. We will discuss a few next.
Most popular and important packages
In the following, I want to shortly mention the most popular and important Python packages for a full stack data scientist. This list is by no means complete and we will learn about many more handy packages in future blog posts.
- Scikit-learn offers simple and efficient tools for data mining and data analysis. It offers functionality for all important machine learning models and algorithms. Scikit-learn has also established a simple-to-use API for training and predicting machine learning models which is also re-used in many other popular machine learning libraries.
- Fundamentally, scikit-learn and many other libraries, build on numpy and scipy which both offer functionalities for scientific computing with Python.
- Pandas is an open source library for providing high-performance, and easy-to-use data structures and data analysis tools.
- Matplotlib is a plotting library that, according to its website, produces publication quality figures and visualizations.
- Both theano and tensorflow have brought functionality of working with tensors to the Python community. Probably the most popular python library for deep learning is keras.
- For natural language processing, nltk is the go-to library.
- Statsmodels provides functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.
As mentioned, there are many more awesome libraries. A short overview of some relevant to data science can also be found online.
Apart from all the aspects mentioned above, so-called Jupyter notebooks have brought incredible value to the general work-flow of data scientists over the past few years. The official website describe them as follows:
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
An example of such a notebook running Python can be found online. It is ideal for a data scientist specifically for prototyping. Combined with its rich markdown and visualization functionalities, notebooks can be also an excellent tool for producing reports that can be shared with others. Examples of data science notebooks can for example be found on Kaggle. The great thing with jupyter notebooks is also that they are not restricted to Python as you can also run other kernels on them like R or Julia.
Alternative programming language for a full stack data scientist
There are many other great languages for data science out there and everyone should use the one he or she is most comfortable with. It might also make sense to utilize different languages through various parts and tasks of a project. It also makes sense to widen the skill-set by learning different languages which allows one to me more flexible in ones work. Specifically R can be highly valuable for a data scientist. R is a language that specifically focuses on statistical computing and graphics. It provides a large array of statistical and graphical techniques. Especially in terms of statistical modeling, it has many benefits over Python and offers a numerous number of open source packages tailored for that task. In a nutshell, I would say that the statistical community focuses on R, while the machine learning community focuses more on Python. Nonetheless, you can usually do both things in both languages to a large degree. Apart from Python and R, it might make sense to also look into other evolving technologies and languages, such as spark for the Hadoop eco-system, nodejs and java script for web development, Julia, or other relevant languages for data science like SQL for databases. Nonetheless, for most of these things, wrappers and APIs for Python are available, so that they can be directly utilized within the Python eco-system. If one is looking for a single programming language to start to get into for data science, I heavily recommend Python.
Also published on Medium.