How Much Math do Data Analyst or Data Scientists Need?

by Martin D. Maas, Ph.D

In this post, we'll discuss what are the main professional roles in the data industry, and where mathematics actually fits.

Data Analysts and Data Scientists need to focus on different areas of mathematics.

“When we are in front of a blackboard, we call it statistics; when in front of a computer, it becomes machine learning; and in a business presentation, we refer to the process as artificial intelligence“.

I get often asked what is the role of mathematics in the data industry, and so this post is an attempt to answer that question. Honestly, it’s a good idea to stop making things seem more complicated than what they actually are.

By the way, my formal education has been an MSc and a PhD in Applied Mathematics. I acquired what we now call Data Skills mostly in Computational Statistics courses, using the old-fashioned method of reading books, and doing research. Most of the work I’ve been recently doing lies in the field of remote sensing, trying to make sense of vasts amounts of satellite data.

In som, I believe Data Science is mostly about the interplay of coding, statistics, and domain-specific knowledge. However, there are different roles in the data industry, and the required mathmeatical background can vary substantially.

Data Analysis: Making sense of data

Data analysis involves finding patterns and trends in large amounts of data with the goal of providing insights that can help solve problems and improve business decisions. To perform data analysis, you need to understand how to collect and organize data, how to extract the information you want, and how to interpret the results.

The skills you should probably look forward to learning are:

  • Basic coding. Learn easy languages like Python and R. I would suggest R in case you are starting from scratch, and Python if you have at least some programming background.
  • Data Visualization. The bread and butter of almost any kind of Data Science work. You should get familiar working with tools such as ggplot2 (in R), or matplotlib’s pyplot (in Python).
  • Statistics with R. You don’t need to take a classical statistics course with nothing but theorems unless you want to become a statistician/mathematician yourself. However, classical statistics sits on a solid tradition and a firm theoretical ground. A combination of coding in a simple language like R, with some theoretical background about statistics should be a good fit, and there are many courses that get this mix just right.

So… is there a lot of math here? Honestly, for this role what you need is a practical understanding of statistics, not theoretical statistics.

Keep in mind that even basic data analysis skills, combined with sensible domain expertise, can be a very powerful tool in today’s environment. You can read more in my post about domain knowledge in data science.

Data Science: Make Predictions Based on Data

Data science is the role in the data industry that requires the most advanced mathematical skills. As this happens to be the most well known role in the industry, this lead to the idea that math requisites pervade the whole field, which is however not the case.

As a data scientist, your job is to discover patterns and make connections among data to solve complex problems. This task requires a broad base of math and programming skills. Specifically, you’ll need to be comfortable working with data visualization, statistical analyses, machine learning, programming languages, and databases.

The difference between a data analyst and a data scientist, is that, while a data analyst is more of a generalist who uses analytics and domain knowledge to gain insights and make and recommendations, a data scientist is a specialist who uses strives to use advanced analytics to solve problems in more automated ways.

Crucially, the key difference between these roles lies in the ability of a data scientist to create predictive models.

Indeed, predictive models have a long history in statistics. They are often referred to as “forecasts”, and statisticians have been making them for well over a century. But of course, it’s the modern tooling what makes this field so exciting.

The required skills here can be probably classified into two tags: mastering classical statistical techniques, and approaching the more advanced computational tools.

Important fact: if you don’t know where the classical techniques fall short, you won’t know why you are applying more sophisticated ones, and that’s not a good place to be. Definitively, learning fundamentals is a key to the long-term understanding of the various tools that regularly pop up in the area.

With this in mind, here is a list of what’s probably the most important things to know:

  • Basic Calculus. Data Science doesn’t actually require much calculus, other than as a prerequisite to probability and statistical theory.
  • Linear Algebra, as it is the basis of modern practical computing. Least squares, dimensionality reduction, collinearity, and more, all can be understood in terms of Linear Algebra.
  • Linear Statistical Models. A second course on statistics is usually about so-called “Multivariate Statics” — i.e. mastering least squares techniques in statistics. And least squares is one of those deceivingly simple, and yet most powerful tools out there. I believe is common for people to become excited about more sophisticated techniques and hot topics, before they have exhausted the most simple ones, and that is a mistake. You can, of course, create forecasts, and solve classification problems using nothing but least squares. So I wouldn’t skip this topic at all.
  • Machine Learning/Deep Learning/IA. These are more advanced predictive techniques. I believe it’s best to focus on case studies and applications areas to learn these techniques, as the theoretical understanding of these tools is actually well behind the state of the art of industry practice.

Data and Cloud Engineering: When Data Applications Get Real

Data engineers, like all engineers, are called upon when things “get real”. That is, when models have to be implemented on a large scale, with massive amounts of data, and run in real time.

In order to do this, we have “the cloud”. The cloud is this amazing technical and commercial innovation that enables the possibility of renting out massive server infrastructure “by the second” (or is it by millisecond already?). This has created so many possibilities that I can’t even get started. Interacting with Data Science applications is just one of this new possibilities.

  • Databases: Learn about how large-scale data is stored and accessed. The main paradigm is SQL, and there is also the new noSQL.
  • Virtualization: this is what makes possible to run code on the cloud. You will need to know about virtual machines, Containers, Kubernetes, and Microservices.
  • Big Data: When the scale of the data you are working with becomes really but really large, you will need to get familiar with tools like Hadoop, Spark, and Google’s BigQuery.

So, do you really need a lot of math to become a data engineer? Frankly, not really.

Conclusion

Starting a career in the Data Industry can be quite competitive. But as the data usage by companies continues to grow, so will the demand for data-savvy professionals.

Bear in mind that there are many professional roles within this industry, and there is a varying degree of mathematics involved with each one.

Continue reading this blog:

Are Online Courses Worth Paying?.