Performance Optimizations Checklist

by Martin D. Maas, Ph.D

Last updated: January 1, 2023

A quick overview of things to do when your Julia code performance is not what you expected.

Writing fast code vs writing code fast

Julia strives to obtain the speed of C/Fortran with a scripting-language syntax and workflow, which is no minor goal. However, Julia is not free of some trade-offs and gotchas when it comes to the combination of performance and usability.

As a general rule, obtaining extra performance will require a little effort on the programmer’s side. Hopefully, if you are using Julia, such effort will be substantially lower than switching to a different language and start dealing with a multi-language project.

For example, one of the basic rules to get good performance in Julia (or, to avoid intepreted-language-like bad performance), is that we can’t directly type code on the REPL anymore (i.e. in the global scope). Rather, our code will have to be contained inside a function, which requires a small effort on our side.

Additionally, when writing fast we will have to care about memory allocation, something we usually don’t care about with scripting languages.

Finally, in performance-critical sections of our code, the compiler should be able to know in advance the type of each variable, another thing we don’t care about in scripting languages.

So, we can say that even though writing Julia code is extremely easy and intuitive, writing fast Julia code can be considerably more difficult. This page will hopefully provide some guidance in how to avoid the most common pitfalls.

So, without further ado, let’s begin with the basic performance checklist.

1. Benchmark to find bottlenecks

Whenever we think our program is not fast enough, we need to find where the performance bottlenecks are. There is a built-in tool to do this, the profiler, which however can be quite sophisticated, and which is not easy to use. The easiest way to start testing your code for performance is, naturally, to profile sections of your code with the @time and @btime macros. The use of the profiler, I believe starts making sense when you have a very large code-base.

It’s important to pay attention not just to timing, but also to memory allocation and time spent in garbage collection. For example, when running @time myFunc(), seeing something like this

64.115791 seconds (12.71 M allocations: 23.535 GiB, 62.19% gc time)

could be a sign that something is wrong. The causes of concern which we will briefly discuss in this post are: use of global variables, too many allocations, and type instabilities.

2. Avoid using global variables (and the global scope)

In order to produce fast code, the compiler needs to be able to know the type of each variable in our algorithm, so it can avoid the overhead of handling all possible cases, and instead produce specialized code.

However, given that the Julia compiler is structured around the compilation of functions, it will be impossible for the compiler to know the type of a global variable which is present inside a function. In particular, the same situation will arise when executing code in the global scope.

Consider the following example

a = 3π

function do_stuff(b)
    return a*b
end

b = rand(6,6)
c = do_stuff(b)

The function do_stuff has no information about the global variable a. In our case a is just a scalar Float64 number, but it could also be a matrix, a complex number, or anything. As the compiler has to be able to handle all those cases, it won’t be able to produce specialized code.

We can fix our toy example by simply adding a as an additional parameter to our function do_stuff.

a = 3π

function do_stuff(a,b)
    return a*b
end


b = rand(6,6)
c = do_stuff(a,b)

Whenever we have a long list of problem parameters and we want to apply this procedure, it can be a good idea to refactor all the parameters as a struct (see the structs page here) and pass that struct around to a function.

3. Avoid intermediate allocations

Other than performing arithmetic operations, moving memory around can take considerable time. Importantly, @time and @btime will report how much memory was used by a given function. Excessive memory allocation can slow down our code considerably.

There are two possible ways to avoid allocating memory:

Reuse existing pre-allocated memory. This would require to manually pre-allocating variables outside of loops, or rely on an important tool which is the @view macro.
Rely on a special kind of memory called the stack. The stack is a part of the memory hierarchy (see Wiki), which, for simplicity, let’s just say that it is a small fraction of memory (usually about 8MB in most Linux systems) that sits really close the processor and can be thus accessed super-efficiently. In particular, Julia won’t report stack-allocated variables as allocations, as those don’t have to be managed by the garbage collector.

Importantly, allocations are sometimes hard to spot. For example, there can be unexpected allocations on the right-hand side of expressions that look perfectly fine.

In fact, the following examples are things I consider quite tricky performance gotchas:

f(p,x) = p*x

function apply_f(p,x1,x2,x3)
    for i ∈ 1:1000
        a1 = f(p, [x1,x2,x3])
    end
end

@btime apply_f(0.5, 0.1, 0.2, 0.3)

71.574 μs (2000 allocations: 156.25 KiB)

What is happening here, is that this function is allocating a small temporary array on the right hand side [x1,x2,x3], while this small modification:

using StaticArrays

function apply_f_SA(p,x1,x2,x3)
    for i ∈ 1:1000
        a1 = f(p, SA[x1,x2,x3])
    end
end

is non-allocating, and consequently much faster, as the shown by this test:

@btime apply_f_SA(0.5, 0.1, 0.2, 0.3)

2.679 ns (0 allocations: 0 bytes)

which, as we can see, provides a whooping 35X speed-up over the previous version.

This is tricky, but you’ll learn to suspect about Julia’s default dynamic array behaviour when you care about performance. Dynamic arrays are, of course, very convenient, and this represents the kind of trade-offs that we should expect between convenience and performance.

A similar gotcha can happen when referencing existing arrays. By default, Julia will create copies of sub-arrays, and to avoid this we can resort on the @views macro (see official docs). Consider the following example:

function moving_average(A, window)
    na = length(A) - window
    Avg = Array{eltype(A)}(undef,na)
    for i ∈ 1:na
        Avg[i] = sum(A[i:i+window]) / window
    end
end

@btime B = moving_average(rand(1_000_000), 100)

261.356 ms (999904 allocations: 869.66 MiB)

while, if we remove the temporary allocation that ocurrs when doing A[i:i+window] by using the @views macro:

function moving_average_views(A, window)
    na = length(A) - window
    Avg = Array{eltype(A)}(undef,na)
    for i ∈ 1:na
        Avg[i] = sum(@views A[i:i+window]) / window
    end
end

and repeat the benchmark:

@btime B = moving_average_views(rand(1_000_000), 100)

25.685 ms (4 allocations: 15.26 MiB)

it turns out that we have obtained a massive 10X speed-up

4. More tips

While using the global scope and excessive memory allocations (sometimes inadvertently) have been the major performance issues in the code that I typically write, there are other things worth mentioning, which will perhaps be the topics of future posts on this website.

Avoid type instabilities. Type-unstable code can be slow as an interpreted language. While this not always the case, and small type-instabilities (like conversion of Int to Float) are usually not a big deal, if you care about performance, it’s better to stick to type-stable functions.
Aim for good threading efficiency, when possible. Nowadays, all processors are multi-core. With a simple macro Threads.@threads before a for loop, you can start gaining speed-ups. Check out for the scaling when you enable threads. If enabling 4 threads on a given computation doesn’t provide you with something close to a 4X speed-up, that is a sign that something is sub-optimal about how memory is being managed, the presence of type stabilities, or usage of the global scope. However, also remember that not all algorithms can obtain a threading efficiency of 100%. If the flow data flow in your algorithm is limited by the memory bandwidth (i.e. when you are reading in too much data and performing operations on it of low arithmetic intensity), adding more processors won’t help you.
Consider SIMD optimizations or GPU computing. Once all the above is complete and you have a nearly-optimal multi-core CPU code, if you still need more performance you might want to consider these two advanced optimization techniques. I believe it’s important not to rush too early into these advanced optimizations, as you won’t see any improvements if the above-mentioned factors aren’t dealt with in the first place.
Master the official Performance Tips. The performance tips section of the official docs is a thick piece of content that will typically take several months to digest. It is a section to revisit often, specially as your code gets more complex.

Conclusion

Writing high-performant Julia code is generally harder that writing a simple script which does the job we want. However, the trade-off between the simplest code we can write, and a highly-performing Julia code, can be quite small.

A related topic to writing fast Julia code, is whether we will always be able to obtain the optimal performance of a given algorithm in a given hardware, using Julia.

This question could be a subject of much debate and I honestly still don’t have a definitive answer. However, as a general rule, I believe a programmer who wishes to obtain optimal performance, in some cases, can begin to miss some quite low-level features of languages like C or Fortran, like manual memory management. While developing a very low level C-style dialect of Julia is possible, this doesn’t seem to be the mainstream use case for Julia, and as such, some tools in this area are still experimental (see, for example, packages like StrideArrays.jl and Bumper.jl). However, given some time, tools like this can even get merged into the language, providing the potential for optimal performance in all cases.