Julia vs Numba and Cython: Looking Beyond Microbenchmarks

We compare the performance of Julia and Numba, for a minimal benchmark that enables to see some fundamental difference between Julia and accelerated-Python.

A friendly contest

Python is a programming language with so many advantages that we don’t even need to list them. It is also commonly acknowledged that its main drawback is code performance.

Given this problem, many ways of speeding-up Python have been proposed over time. The goal of the various approaches is to inject a bit of compiled code to make Python fast, with the least possible effort on the programmer’s side.

We normally want to do this only for a performance-critical section, while keeping the advantage of the higher-level logic and the rich package ecosystem.

There are many choices in this area, like Cython, PyPy, calling Fortran/C functions, and so forth. In this post we will be mostly focusing on Numba, which at the time of writing this seems mature enough for real test cases, and seems to be performing well, at least on several microbenchmarks.

I found that running personal benchmarks is beneficial before making long-term commitments to a particular software stack. In this post I’m sharing some preliminary conclusions, based on the discussion I found online, and the test I came up with, to test some of the ideas under discussion around what I think is the main point.

Importantly, my goal with this kind of benchmark is not about taking sides in so-called language wars, but to gain insight about two amazing programming languages and their package ecosystems.

Much of the discussion in this article is based on the following references (which are quite lengthy, I have to say):

How Should we Compare Julia vs Accelerated-Python?

A fundamental difference between Julia and Python, is that while in Julia code is put together during JIT compilation, in Python packages are put together during interpretation.

This difference is crucial, because while the Julia compiler gets a shot at performing global optimizations (i.e. interprocedural), most forms of optimized Python (like Numba or Cython) will only get a chance to optimize a small snippet of code.

But wait! Isn’t that what we wanted in the first place, simply to optimize a small performance-critical section of code?

Well… no.

We also wanted to keep the advantages of an easy-to-use high-level logic, and a rich package ecosystem, while we call this optimized fragment of code.

So, let’s see if we do get to keep all that, with the various flavors of accelerated Python.

Code Simplicity and Performance Optimizations

What about code simplicity?

On the first hand, Cython requires that the programmer write substantial ‘glue’ code,

Also, naive Cython won’t optimize much, you have to specifically tell it where and what to optimize (it needs to be used wisely).

I have to admit that I like the Numba approach to fast Python, as the compromise looks minimal.

We still have to note that even basic Python structures such as dicts are not supported in Numba, which could certainly be a problem. However, the approach reminds me a lot to calling C or Fortran from Python, something I have been attempting for some time, and which should be perfectly valid for segments of number-crunching code within a Python application. (But we’ll see).

As for the long-awaited PyPy, which could be a game-changer if it manages to scale up in supported features, see this presentation 82 by Armin Ronacher (creator of Flask and PyPy contributor) on why it’s so hard to optimize Python because of the very design of the language.

Julia vs Numba: A Minimalistic Benchmark

Benchmarking Globally vs Locally Compiler-Optimized Code

I strongly adhere to the so-called “principle of minimum generality” in many areas of my life.

That means the following: we should be able to come up with the most simple relevant case to discuss an idea.

On one hand, microbenchmarks can be often too simple. For example, I have seen some 3 or 4 liners comparing Julia with Numba, that I don’t think are at all relevant. By making decisions based on those simple microbenchmarks, we are taking the risk of being penny-wise and dollar-dumb.

On the other hand, large scale, realistic applications have too many moving parts to be actually checked in detail.

We need to select a sufficiently realistic benchmark case where some fundamental difference between Julia and Numba-optimized Python is actually put to test. Which I believe requires us to aim for testing the degree of composability and performance of each language.

The Benchmark

Without further ado, let’s get into the nitty-gritty of making a (hopefully) meaningful comparison.

I started with the idea of composing a quadrature rule with a non-linear solver, but I actually ended up building yet a simpler test.

Let’s evaluate the following function with a quadrature rule:

$$ g(p) = \int_{-1}^1 e^{px} dx $$

We would normally do this with a simple one-linear that calls a specialized quadrature library. But to make sure that we are testing the same exact code in each language, we’ll use a 7-liner trapezoidal rule instead.

So let’s pretend this trapezoidal rule code was provided by an expert and is available trough a package.

using LoopVectorization

""" Integrate a function f with the Trapezoidal Rule"""
function quad_trap(f,a,b,N) 
    h = (b-a)/N
    int = h * ( f(a) + f(b) ) / 2
    @turbo for k=1:N-1
        xk = (b-a) * k/N + a
        int = int + h*f(xk)
    end
    return int
end

And now this is the one-liner we would normally write:

g(p) = quad_trap( x -> exp(p*x) - 10, -1, 1, 10000) 
@time g(1)

We would normally do more than just evaluate a function at a single point, but let’s keep things simple. This function g(p) is the basic building block we would need to build any sort of higher-level logic around the optimized quadrature rule.

Let’s do this in pure Python, and Numba now. For conciseness I only share the Numba code here (the pure Python can be obtained by removing the references to @nb.njit)

import math
import numba as nb
import time

@nb.njit   
def quad_trap(f,a,b,N):
    h = (b-a)/N
    integral = h * ( f(a) + f(b) ) / 2
    for k in range(N):
        xk = (b-a) * k/N + a
        integral = integral + h*f(xk)
    return integral

Now, in order to call our Numba-optimized function, we need to provide it with another JIT-compiled function, as described in the documentation.

Let’s do it in two different ways: a completely fixed function of x, and a parametrized function of x and p, which is the one we actually intend to use. I even added some help in case the Numba type-inference didn’t the types right.

@nb.njit(nb.float64(nb.float64))
def func(x):
    return math.exp(x) - 10

def g(p): 
    @nb.njit(nb.float64(nb.float64))
    def integrand(x):
        return math.exp(p*x) - 10
    return quad_trap(integrand, -1, 1, 10000) 

# warm-up JIT
q1 = quad_trap(func,-1,1,10000)
start_time = time.time()
q1 = quad_trap(func,-1,1,10000)
print("Quadrature--- %s seconds ---" % (time.time() - start_time))

# warm-up JIT
a = f(1)
start_time = time.time()
r = f(1)
print("Eval f(1)--- %s seconds ---" % (time.time() - start_time))

Results

Language Quadrature Eval g(1)
Pure Python 0.00276 0.00315
Numba-Python 0.00027 0.11
Numba-fastmath 0.00015 0.11
Julia 0.00005 0.00005

I believe there are a number of insights to gain from this comparison.

Numba is 10X faster than pure Python for the micro-benchmark of a simple quadrature rule. However, Julia is still more than 3X faster than Numba, in part due to SIMD optimizations enabled by LoopVectorization.jl. But most importantly, Numba breaks down when we add a minimal higher-level construction.

The reason of this breakdown was explained to me by Jérôme Richard as an answer to my question about this benchmark in StackOverflow.

What’s happening under the hood is that Numba is recompiling the integrand function each time we call it. Moreover, what we are trying to do here is passing an inner function as an argument, which is unsupported by the present version of Numba according to the documentation.

This means that Numba is not really applicable to this use case (say, of using a numba-optimized quadrature library and then adding higher-level logic on top of it).

To fix it, we could modify the quadrature code ourselves (which pretty much negates the whole point of this benchmark), or resort to Numpy, which I benchmarked in the past in this post

Final Remarks

The code of this simple tests is available in my Github repo. I welcome comments in the issue section of the repo, or emails.

Of course, this conclusions apply to the present version of Numba, which at the time of writing this is v0.50.

Composability, Code Reuse, and Package Ecosystems

My finals thoughts are on composability, code reuse, and package ecosystems.

In the present days, a package ecosystem is probably one of the most important factors (if not the single one) behind our choices of one programming language over an other.

At the time of writing this, there are around 200.000 Python packages and only about 6.000 Julia packages, so it is natural that most people will be using Python for the time being.

While there is a great Python package ecosystem, there is no Numba-specific package ecosystem under development.

On the other hand, Julia packages compose nicely and very efficiently.

So my guess is that you should expect Python to be slow. Speeding it up seems to involve a substantial compromise, as we don’t get to keep all the nice things about Python. In escence, someone needs to rewrite more and more of Python into C for performance.

So I believe that, over time, the Julia package ecosystem will develop to offer much better performance than the Python ecosystem in many areas. It will also develop much faster per invested unit of time. This is why, in its present state, Julia strikes me as the best choice for people developing new things which do not require too much leverage on existing projects.