What is Data Science All About?
In this post, I will give a general view of what I think are the basic skills required to get into the data world, and digress about the many ways to acquire them.
“When we are in front of a blackboard, we call it statistics; when in front of a computer, it becomes machine learning; and in a business presentation, we refer to the process as artificial intelligence”.
Honestly, let’s stop making things seem more complicated than they actually are. Data Science is about coding, statistics, and domain-specific knowledge, and that’s basically it.
By the way, my formal education has been an MSc and a PhD in Applied Mathematics. I acquired what we now call Data Skills mostly in Computational Statistics courses, using the old-fashioned method of reading books, and doing research. Most of the work I’ve been recently doing lies in the field of remote sensing, trying to make sense of vasts amounts of satellite data.
Basic Professional Roles in the Data Industry
“The world’s most valuable resource is no longer oil, but data”, claimed The Economist a few years ago. Indeed, companies continue to collect increasing amounts of data to serve business needs. And just like oil, raw data isn’t as valuable as “refined” data. Consequently, companies require an increasing number of data-savvy professionals.
Among those data professionals, the most well known are the “Data Scientists”. But new technical roles have emerged as well, such as Data Analysts or Data Engineers.
Data Analyst: Make Sense out of Data
From a technical point of view, this is probably the most basic role out there. However, even basic data analysis skills, combined with sensible domain expertise, can be a very powerful tool in today’s environment.
The skills you should probably look forward to learn are:
- Basic coding. Learn easy languages like Python and R. I would suggest R in case you are starting from scratch, and Python if you have at least some programming packground.
- Data Visualization. The bread and butter of almost any kind of Data Science work. You should get familiar working with tools such as ggplot2 (in R), or matplotlib’s pyplot (in Python).
- Statistics with R. You don’t need to take a classical statistics course with nothing but theorems unless you want to become a statistician/mathematician yourself. However, classical statistics sits on a solid tradition and a firm theoretical ground. A combination of coding in a simple language like R, with some theoretical background about statistics should be a good fit, and there are many courses that get this mix just right.
Data Science: Make Predictions Based on Data
After data analysis and visualization, the next step is the ability to create predictive models.
Indeed, predictive models have a long history in statistics. They are often referred to as “forecasts”, and statisticians have been making them for well over a century. But of course, it’s the modern tooling what makes this field so exciting.
The required skills here can be probably classified into two categories: mastering classical statistical techniques, and approaching the more advanced computational tools. Important fact: if you don’t know where the classical techniques fall short, you won’t know why you are applying more sophisticated ones, and that’s not a good place to be.
- Linear Statistical Models. A second course on statistics is usually about so-called “Multivariate Statics” – i.e. mastering least squares techniques in statistics. And least squares is one of those deceivingly simple, and yet most powerful tools out there. I believe is common for people to become excited about more sophisticated techniques and hot topics, before they have exhausted the most simple ones, and that is a mistake. You can, of course, create forecasts, and solve classification problems using nothing but least squares. So I wouldn’t skip this topic at all.
- Machine Learning/Deep Learning/IA. These are more advanced predictive techniques. I believe it’s best to focus on case studies and applications areas to learn these techniques, as the theoretical understanding of these tools is actually well behind the state of the art of industry practice.
Data and Cloud Engineering: When Data Applications Get Real
Data engineers, like all engineers, are called upon when things “get real”. That is, when models have to be implemented on a large scale, with massive amounts of data, and run in real time.
In order to do this, we have “the cloud”.
The cloud is this amazing technical and commercial innovation that enables the possibility of renting out massive server infrastructure “by the second” (or is it by millisecond already?). This has created so many possibilities that I can’t even get started. Interacting with Data Science applications is just one of this new possibilities.
- Databases: Learn about how large-scale data is stored and accessed. The main paradigm is SQL, and there is also the new noSQL.
- Virtualization: this is what makes possible to run code on the cloud. You will need to know about virtual machines, Containers, Kubernetes, and Microservices.
- Big Data: When the scale of the data you are working with becomes really but really large, you will need to get familiar with tools like Hadoop, Spark, and Google’s BigQuery.
The Importance of Domain Expertise in Data Science
Arguably, domain-expertise in areas like business, finance, or science and engineering, combined with solid data-skills, are far preferable than purely data-handling knowledge, no matter how sophisticate.
With domain expertise you can:
- Understand the big picture of what the data is about,
- Understand the business goal of working with this particular data,
- Ask the right questions in order to determine problem to solve,
- Communicate effectively with non-technical peers in your industry,
- Determine a relevant criterion to measure the success of a model.
These matters are undoubtedly crucial when working for an organization.
Also, we should consider the level of maturity of the field of Data Science. Today it is something still new, but there are many ongoing efforts to improve the infrastructure and automate many software development tasks.
In a few years, much of the technical heavy-lifting might well disappear. With domain knowledge, on the other hand, you will be able to target the right business problem to solve with the available data-science techniques, whatever those might be. Demand for such skill will never go away.
So What is the Mathematics of Data Science?
As a mathematician myself, it’d be rather odd not mention the mathematics involved in this post.
However, frankly, I don’t think math should be the main concern of someone how wants to enter the field. As discussed in the previous sections, acquiring so-called “data-skills” (coding and statistics), and combining them with domain-specific knowledge looks far more valuable to me, than trying to focus on the theoretical heavy-lifting. This is specially so as the field matures, and more automatic tools are available for an increasing number of practitioners.
On the other hand, if you already have some background in math, and are looking just to brush-up on what’s most useful, that’s a whole different story.
- Did I mention Statistics?
- Linear Algebra, as it is the basis of modern practical computing. Least squares, dimensionality reduction, collinearity, and more, all can be understood in terms of Linear Algebra.
- Basic Calculus. Data Science doesn’t actually require much calculus, other than as a prerequisite to probability and statistical theory.
How to Learn Data Science with Free Content?
There are many ways of learning the data skills we have been discussing. Options vary wildly: there are undergraduate degrees, master’s, online programs, and even lots of free material out there.
However, it’s easy to feel overwhelmed and lost with so many options.
Before moving on, let me just share two free resources on how to learn beginner data-science material for free on Youtube:
Let’s now move on to discuss the elephant in the room, when it comes to online learning:
Why Pay for Online Courses with So Much Free Content?
In the world we live today, most of the existing information and knowledge can be found online for free, in places like Youtube or blogs (like this one!). So, why it could be worth paying anything for content you can get for free? There are actually a few reasons:
- Save time. While searching online for free content, it is most likely that you will be exposed to material which is seriously unorganized, outdated, or even wrong. All of these problems combined can waste a lot of your time. You might well be in a position where you should consider spending some money in order to save some of this trouble. Curated, streamlined, systematic, and credible curriculums do offer great value.
- Stay Focused and Avoid Procrastination. If you have paid some money you are also less likely to procrastinate, as you will be more inclined to try to get the most out of your money. Following a program should help you stay focused, self-motivated, and on-track.
- Learn Fundamentals, not Just Hot Topics. The best programs will cover fundamentals, not only hot topics. Importantly, I strongly advice to learn statistics.
- Get Feedback. Full programs will provide feedback. With online tools, this usually comes in the form of automatically-graded coding assignments or peer-grading.
- Build a project porfolio. Good programs will encourage and help you build a project portfolio.
- Get Certified. This will help you start marketing your skills and inspire trust.
- Career Services. Some programs will even go as far as providing a bundle of certifications and employment services, such as resume writing and interview coaching. Importantly, there is actual people helping you out here, personally. And not just with the technical aspects, but with soft skills.
- Start Networking. Meet other people and start collaborating.
What Massive Online Programs Lack
Don’t get me wrong on this one. I don’t think there is any fundamental problem with online learning. I have been an online teacher myself, and I’m absolutely enthusiastic about where digital materials might lead us.
I think, however, we have to acknowledge that the massive scale that commercial online learning companies are striving for comes with certain unavoidable limitations, when compared with a more personal, one-on-one, experiences we have in a traditional classroom.
- Accountability. If you have a coach or teacher you meet at least once a week, with whom you have a personal interaction, you become accountable to them. Instead, one of the key advangates of online platforms is that they let you “learn at your own pace”. This also means that you are only accountable to yourself.
- Personalized and Credible Feedback. Automated grading and peer-grading is far from an ideal replacement a real teacher personally looking and grades your work, and pointing out whatever difficulty you might have encountered.
Starting a career in the Data Industry can be quite competitive. But as the data usage by companies continues to grow, so will the demand for data-savvy professionals.
So whether you choose a DIY approach based on free online material, enroll in online training programs, or even go for a full University degree, I hope the experience will be rewarding for you!
Of course, if you want to move past intermediate-level, and you want to move to more advanced roles withing the Data industry, you’re probably going to switch back to a DIY approach and read books, and/or consult free online content in blogs like this one – for example, if you are interested in Julia, an up-and-coming programming language for Scientific and Statistical Computing (see my tutorial series).