Researchers spend an increasing amount of time writing, testing, debugging, installing, and maintaining software. This can be very time-consuming, to the extent that these software-related tasks can easily become a dominant factor in how long it takes to produce a new result.
Interestingly, there is a whole body of knowledge called software engineering, which, in a few words, deals with how to organize code and software projects in order to maximize the productivity of software developers.
Increasing our productivity in software development is quite desirable, but it also requires an initial time investment.
Publication-Driven-Development: Writing Horrible Code
When doing research, the typical programming workflow is to iteratively implement, design, and develop a sophisticated set of algorithms, which, most of the time, invariably evolves into a horrible pile of undocumented hacks.
After some iterations of that process, we get the results we need for a publication. In the end, the software does what it was meant to do (produce a publication) and nothing more.
This process comes at a practical cost: adding new features or sharing some code with others has become impossible. Therefore, when beginning a new project, we almost have to start from scratch. Each time.
At a certain point in time, when searching our old code for something we can rely on, and find nothing worth using, we wish we had written reusable code, instead of disposable code. That we had written reliable software for future use, instead of having created a big pile of rubbish that can only serve the purpose of producing that publication.
I know that from first-hand experience. As an applied mathematician, writing sophisticated algorithms has been my job for more than a decade, but I didn’t have any formal education in software engineering. Much of what I know today about software development stems from my own experiences and mistakes, as well as having read several great books.
A balance must be found, of course, between writing good code and writing code as fast as possible.
To be honest, in some cases writing disposable code as fast as possible, just to get a paper published, is a perfectly fine plan. However, there’s no reason not to follow at least some of the best practices, even in this case. This would be the approach of developing “barely sufficient” good practices.
However, I believe this is an important concept: knowledge can sometimes work like compound interest. In particular, making a small effort today can lead to a big gain tomorrow, and saving some time and cutting corners today will probably come back to bite us at a greater cost at a later point in time.
How Can a Researcher Learn Basic Software Engineering?
Ok, I believe we made the case for the desirability of learning good programming practices. But is it possible to learn software engineering as a researcher, or are we condemned to be bad programmers?
I think that software engineering is just like any new knowledge we want to incorporate. And as researchers, we have this ability to incorporate new knowledge, right? We can learn software engineering over time, from talking to colleagues, from reading books and papers, and most importantly by learning from our own experiences.
Yet, there is a major barrier to entry into software engineering as a non-CS researcher.
For a researcher, the newest tools and coolest CS trends come bundled with a few annoying characteristics:
- Much of what is written in software engineering textbooks just doesn’t apply to our case, as it focuses on commercial software developers who get requests from clients. All the advice about specifying requirements in advance tends to be useless, as a research software project’s requirements are constantly changing.
- Even if we do read some software engineering books, many times the material isn’t well documented for non-experts.
- A lot of the material seems to be highly opinionated and discussed with zealotry (i.e. paradigm wars).
Reading books about software engineering is, to some extent, trying to leverage on the experience and the insight of other developers. The problem is that they might have been working in a completely different environment.
Given this situation, learning best programming practices is hard. It takes years of practice, study, and learning from past errors.
Additionally, you just can’t learn best programming practices the same way you learn a cooking recipe. Blindly following a certain mantra “because it’s what the experts do” will result in writing even more horrible code than before the “knowledge” was acquired.
When trying to learn best programming practices, the know-why is just as important, or maybe even more important than the know-how, as good practices are not set in stone, but rather might depend on the situation.
Basic Guideline for Writing Scientific Code
What can we learn, then?
Ignoring all the client/specification references in the software engineering literature, I think the typical researcher should focus on the following basic points:
- Clarity above all. When working with complex algorithms we must pay attention to how intelligible our code looks like. Adding comments, choosing clear variable names, and using indentation are a few basic tips.
- Use version control systems. Bugs in scientific code can be hard to track, so comparing different versions, or organizing our work with issue trackers can come in very handy.
- Use libraries. Are you trying to stand on the shoulders of giants, and not developing everything from scratch, right? When you need to solve a problem, spend some time looking for existing solutions.
- Create and run tests. Testing portions of our code for correctness is of paramount importance. We will gain insight on what’s working and where the errors might be – of course, where tests are failing, but untested areas will start to become suspicious once you are used to testing.
- Learn to refactor your code. Once you have a test in place, you can do the ultimate software engineering task: change how the code is written, so it becomes more clear and reusable, without affecting its current behavior.
- Write documentation. This can seem too boring or a waste of time, but it can be actually easy to do. Do you remember the first rule on adding comments to your code? Well, modern tools can automatically extract those comments when formatted in specific ways, and help you create stunning documentation in a pretty automated way. In this day and age, documentation is part of the code.
Unlearning bad programming practices
Certain bad programming practices might develop naturally as a consequence of using the following programming languages.
- Matlab. You need to place every externally-callable function in its own .m file (dough!) while other functions within that file only see each other. Everything has to be an array for performance, which can easily lead to illegible code.
- Python. Your code is a disposable script almost by definition. Let’s be honest about something: the real software developers are those who wrote the underlying libraries in C that power Python.
- R. same as Python in relation to writing disposable code, plus the speed of native R is even worse.
- C++. Somehow encourages unnecessary abstraction, like the idea that “everything has to be a class”, or the idea that any serious projects has to use the most advanced features of the language like templated metaprogramming.
- Fortran. Strict backwards compatibility has lead practitioners to stick with F77, and the language provided little incentives to adopt more modern practices. The lack of modern internet-era tooling has also led to isolated development environments where many programmers “reinvent the wheel” often to solve common tasks.
The Julia programming language
Julia, by solving the so-called two-language problem, also solves some of the most important issues mentioned above.
In Julia, you can try new ideas very quickly, and the code you write in Julia is not just a disposable script, as for with little extra effort it can become quite “the real thing”.
Also, a nice feature of Julia is that it encourages a programming style based on composing many of small functions, as opposed to writing highly abstract and massive code.
Learning good programming practices, without thorough training in computer science can be challenging, specially in the particular scenario faced by researchers having to constantly meet publication, or other deadlines at which the code must just run.
I will continue covering these topics in this website, so stay tuned!
Here are some good papers about this, by the way: