Learning Good Programming Practices for Scientific Code
Learning best programming practices is important and difficult. It is also easy to get lost in the process, without formal training in computer science.
Researchers spend an increasing amount of time writing, testing, debugging, installing, and maintaining software. This can be very time-consuming, to the extent that these software-related tasks can easily become a dominant factor in how long it takes to produce a new result.
Interestingly, there is a whole body of knowledge called software engineering, which, in a few words, deals with how to organize code and software projects in order to maximize the productivity of software developers.
Increasing our productivity in software development is quite desirable, but it also requires an initial time investment.
Publication-Driven-Development: Writing Horrible Code
When doing research, the typical programming workflow is to iteratively implement, design, and develop a sophisticated set of algorithms, which, most of the time, invariably evolves into a horrible pile of undocumented hacks.
After some iterations of that process, we get the results we need for a publication. In the end, the software does what it was meant to do (produce a publication) and nothing more.
This process comes at a practical cost: adding new features or sharing some code with others has become impossible. Therefore, when beginning a new project, we almost have to start from scratch. Each time.
At a certain point in time, when searching our old code for something we can rely on, and find nothing worth using, we wish we had written reusable code, instead of disposable code. That we had written reliable software for future use, instead of having created a big pile of rubbish that can only serve the purpose of producing that publication.
I know that from first-hand experience. As an applied mathematician, writing sophisticated algorithms has been my job for more than a decade, but I didn’t have any formal education in software engineering. Much of what I know today about software development stems from my own experiences and mistakes, as well as having read several great books.
A balance must be found, of course, between writing good code and writing code as fast as possible.
To be honest, in some cases writing disposable code as fast as possible, just to get a paper published, is a perfectly fine plan. However, there’s no reason not to follow at least some of the best practices, even in this case. This would be the approach of developing “barely sufficient” good practices.
However, I believe this is an important concept: knowledge can sometimes work like compound interest. In particular, making a small effort today can lead to a big gain tomorrow, and saving some time and cutting corners today will probably come back to bite us at a greater cost at a later point in time.
How Can a Researcher Learn Basic Software Engineering?
Ok, I believe we made the case for the desirability of learning good programming practices. But is it possible to learn software engineering as a researcher, or are we condemned to be bad programmers?
I think that software engineering is just like any new knowledge we want to incorporate. And as researchers, we have this ability to incorporate new knowledge, right? We can learn software engineering over time, from talking to colleagues, from reading books and papers, and most importantly by learning from our own experiences.
Yet, there is a major barrier to entry into software engineering as a non-CS researcher.
For a researcher, the newest tools and coolest CS trends come bundled with a few annoying characteristics:
- Much of what is written in software engineering textbooks just doesn’t apply to our case, as it focuses on commercial software developers who get requests from clients. All the advice about specifying requirements in advance tends to be useless, as a research software project’s requirements are constantly changing.
- Even if we do read some software engineering books, many times the material isn’t well documented for non-experts.
- A lot of the material seems to be highly opinionated and discussed with zealotry (i.e. paradigm wars).
Reading books about software engineering is, to some extent, trying to leverage on the experience and the insight of other developers. The problem is that they might have been working in a completely different environment.
Given this situation, learning best programming practices is hard. It takes years of practice, study, and learning from past errors.
Additionally, you just can’t learn best programming practices the same way you learn a cooking recipe. Blindly following a certain mantra “because it’s what the experts do” will result in writing even more horrible code than before the “knowledge” was acquired.
When trying to learn best programming practices, the know-why is just as important, or maybe even more important than the know-how, as good practices are not set in stone, but rather might depend on the situation.
What can we learn, then? Ignoring all the client/specification references in the software engineering literature, I think the typical researcher should focus on the following basic areas:
- Tooling and Automation (IDEs, package managers, and version control systems).
- Workflow Tips, like test-driven development.
- Use and misuse of abstraction.
Unlearning bad programming practices
Certain bad programming practices might develop naturally as a consequence of using the following programming languages.
- Matlab. You need to place every externally-callable function in its own .m file (dough!) while other functions within that file only see each other. Everything has to be an array for performance, which can easily lead to illegible code.
- Python. Your code is a disposable script almost by definition. Let’s be honest about something: the real software developers are those who wrote the underlying libraries in C that power Python.
- R. same as Python in relation to writing disposable code, plus the speed of native R is even worse.
- C++. Somehow encourages unnecessary abstraction, like the idea that “everything has to be a class”, or the idea that any serious projects has to use the most advanced features of the language like templated metaprogramming.
- Fortran. Strict backwards compatibility has lead practitioners to stick with F77, and the language provided little incentives to adopt more modern practices. The lack of modern internet-era tooling has also led to isolated development environments where many programmers “reinvent the wheel” often to solve common tasks.
The Julia programming language
Julia, by solving the so-called two-language problem, also solves some of the most important issues mentioned above.
In Julia, you can try new ideas very quickly, and the code you write in Julia is not just a disposable script, as for with little extra effort it can become quite “the real thing”.
Also, a nice feature of Julia is that it encourages a programming style based on composing many of small functions, as opposed to writing highly abstract and massive code.
Learning good programming practices, without thorough training in computer science can be challenging, specially in the particular scenario faced by researchers having to constantly meet publication, or other deadlines at which the code must just run.
I will continue covering these topics in this website, so stay tuned!
Here are some good papers about this, by the way:
- Barely sufficient practices in scientific computing
- Good enough practices in scientific computing
- Best Practices for Scientific Computing
- Software Engineering Practices for Scientific Software Development: A Systematic Mapping Study
Don't miss any updates!
I'm writing a newsletter about once a month, with links to new posts and tutorials.
In particular, one of my goals for 2022 is to write a book about scientific software development using Julia. Make sure to subscribe if you are interested!