How to become a data scientist?

This is a tech blog on data science and programming stuff. I will write down what I have learned along the way and share it with the community. In this first blog, I would like to say something about my transition from academia to industry. Hopefully, it will be helpful for guys with similar backgrounds.

What is my background?

I am a mathematician interested in mathematical physics. For the satisfaction of curiosity, let me describe what mathematicians do everyday. After finding an interesting problem, a mathematician buries himself in scratch papers and calculations, tries to find nice formulas or algorithms and prove beautiful theorems, and finally publishes papers. It is a hard career to pursue, he has to guarantee the quality of his fresh knowledge and find a way to sell it to authorities, who obviously never care about his ideas. I thought I was a promising mathematician until I figured out I was not able to find a job in a math department. Life hit me so hard!

Physics is a totally different science compared to mathematics. I was working on condensed matter physics that has deep connections with deep learning, since an artificial neural network is similar to a tensor network. Physicists have two powerful tools in hand, field theory and quantum mechanics, but they are not capable of understanding quantum field theory. Fields such as electromagnetic or gravitational fields are mathematically described by tensors, whose dynamics are modeled by differential equations. Quantum theory teaches us that light is composed of light quanta or photons, which are basically discrete packets of energy. Physicists don’t even have a good definition of quantum field theory. Machine learning borrows ideas from physics, for example, to build energy-based models like restricted Boltzmann machine. Meanwhile, machine learning also provides new approaches in the study of physics, for example, pattern recognition in astrophysics.

What is the prerequisites?

You don’t need a PhD to become a data scientist. What you really need is basic college mathematics and statistics plus programming skills. You don’t need physics to learn data science, even though the concept of entropy will appear in machine learning algorithms.

One key component of machine learning is algorithms, in order to understand them, you need to know linear algebra, Calculus and Bayesian probability etc. For instance, the concept of inner product and Hilbert space is useful, e.g., what is the relation between l1 and l2 norms, what is the kernel method in support vector machine. A basic knowledge of probability is very helpful, e.g., what is the difference between a generative and discriminative model, why naive Bayes is so naive. Statistical hypothesis testing is sometimes very important, another application of statistics is A/B testing, say, if you want to deploy your model as a web service.

Before you start to learn data science, it is better to have some experience in Linux or shell scripting. Besides installing all kinds of software or packages in a shell, you will need to test your model in a terminal or console without graphical user interface (GUI). In data science, Python is a good programming language to start with, which is an interpreted, object-oriented, high-level language with dynamic semantics. Of course, to really understand object-oriented programming, it is better to have some experience in Java or C++. By the batteries included philosophy, Python has a rich and versatile standard library, so you have to learn some of powerful built-in modules besides the basic syntax and data structures.

You need a decent machine to do computations, you may build a desktop with a video card or graphics processing unit (GPU). Install a Linux system, you may try it out with a video game. Install Python and PIP, which is a package manager for Python modules, alternatively, you could install Anaconda. Without a powerful computer, you could turn to cloud computing.

How to learn data science?

A basic knowledge of database and structured query language (SQL) is very helpful, since most structured data is traditionally stored in relational databases and tables. Python has a sqlite3 module, which is a lightweight built-in database. As another example, MySQL is a popular free and open-source relational database with GUI tools. With some familiarity with SQL, it is easier for you to learn Pandas dataframe and relevant operations. If you don’t have a data engineer supporting you, as a data scientist, you will have to take care of database maintenance like extract, transform, load (ETL).

Machine learning is a category of algorithms that automates predictive model building and performs specific tasks such as classification, regression and clustering etc. Machine learning can be divided into two parts: classical algorithms and neural network methods.

Classical machine learning algorithms includes Linear Regression, Logistic Regression, Decision Tree, Support Vector Machine, Naive Bayes, K-Nearest Neighbors etc. The process of machine learning can be described as learning a mapping f that best maps feature vectors X to a target variable y: y = f(X). Feature engineering is the most important step in building a predictive model, feature selection significantly determines the accuracy of the trained model. Scikit-Learn is a free and open source machine learning library in Python with detailed documentation. It is built on numerical and scientific libraries NumPy and SciPy. It is worth your time to intensively study the examples, documentation and implementations of Scikit-Learn, which is mostly written in Python.

Artificial neural network and deep learning has become very popular since the achievements of DeepMind and AlphaGo thanks to the fast developments of hardware technology. One promise of deep learning is to automate self-driving cars in the near future. Neural networks includes multi-layer perceptrons (MLP), convolutional neural networks (CNN) and recurrent neural networks (RNN) etc. Besides applications in traditional machine learning tasks, deep learning has been widely used in computer vision and natural language processing (NLP). Among many deep learning libraries, TensorFlow and PyTorch are popular frameworks in research and industry. The main difference is that TensorFlow uses static computational graphs and PyTorch uses dynamic graphs. Both TensorFlow and PyTorch are written in C\C++ and Python. TensorFlow provides a visualization tool called TensorBoard, and easy-to-use application programming interfaces (APIs) for various programming languages. So I would recommend you to learn TensorFlow first, then PyTorch if you have time. The advanced math used in deep learning is gradients or differentiations for backpropagation, that is called reverse-mode automatic differentiation in TensorFlow and Autograd in PyTorch.

What would be a plus

Data analysis sometimes is too abstract, so we need data visualization to help us understand data. There are powerful libraries such as Matplotlib and Seaborn that provide all kinds of plottings in Python. Given structured data, a data scientist ought to use software like Tableau or Power BI to do reportings to a broad audience.

As more companies are moving their data and computational platforms to the cloud, it would be an advantage if a data scientist has a good knowledge of big data frameworks like Hadoop and Spark. In particular, Spark offers fast in-memory computational tools such as SparkSQL and Spark ML, so it is pretty easy to do machine learning directly on the cluster. Spark ML is evolving very fast and now supports Pandas dataframe, plus, it is possible to integrate Scikit-Learn or TensorFlow into Spark. As a result, Spark has grown into a good platform for cloud computing and machine learning. If you do not have GPU installed in your computer and you need to do deep learning, there are cloud services provided by big technology companies, so that you can use Linux virtual machine and buy GPU hours for deep learning for a reasonable price. With the development of big data, cloud computing is now a popular Software as a service (Saas).

As a data scientist, how do you deploy your production-ready predictive model as an end-to-end application? It would be very convenient if you know how to create a website on the Internet or develop a smartphone App. Within the realm of Python, it is a good idea to learn a web framework such as Flask or Django. Of course, JavaScript is a popular language to learn, and Node.js is becoming very powerful in the backend of web development. If you are good at Java and familiar with Model–View–Controller (MVC) architecture or a similar one, it is not hard to get started with android development.

Why they need a PhD

As a PhD, what is your advantages in data science, if any? We had to read papers, understand long formulas, and do researches, remember? In a broad sense, mathematics or physics is a data science as well. When you write a paper, you first make reasonable assumptions and define a solvable problem, then collect evidences such as data or formulas to support a conclusion, which could be a universal model or an acceptable solution to the original problem. So as a mathematician or physicist, you are already a data scientist for a long time but using a different set of tools. To become a data scientist, you have to learn new formulas or algorithms, start coding and embrace a new culture. I would like to inject scientific attitude into data science, and view it as a science more than a technology. A PhD tends to understand the theory under the hood instead of just calling APIs, if necessary, it is possible for him to modify the underlying algorithms to fulfill the requirements.

Summary

In this blog, I gave some hints about how to learn data science as a math PhD from my personal experience. To be a data scientist, you have to learn machine learning algorithms and master programming languages such as Python to solve problems in business. To be a qualified data scientist, you need to learn a handful of skills, more importantly, keep curious, creative and open-minded forever. If you have no knowledge about programming, I do not encourage mathematicians or physicists to jump from academia to industry because of the steep learning curve. If you are brave enough to do so, I promise you will have fun in studying and applying data science.

References

  1. Wes McKinney. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, 2nd Ed. O’Reilly Media, 2017

  2. Jake VanderPlas. Python Data Science Handbook: Essential Tools for Working with Data. O’Reilly Media, 2016.

  3. Andreas Mueller and Sarah Guido. Introduction to Machine Learning with Python: A Guide for Data Scientists. O’Reilly Media, 2016.

  4. Sebastian Raschka and Vahid Mirjalili. Python Machine Learning: Machine Learning and Deep Learning with Python, scikit-learn, and TensorFlow, 2nd Ed. Packt Publishing, 2017.

  5. Aurelien Geron. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media, 2017.

Written on March 29, 2019