NumPy Performance

In my Python code I wanted to compute the Local Binary Patterns of 165 images each sized 100x130 pixels. The algorithm is pretty simple, which should be a very easy task...

The first naive version I came up with looked like this:

def rlbp_slow(X):
    X = np.asarray(X)
    ysize, xsize = X.shape
    result = np.zeros((ysize-2,xsize-2), dtype=np.uint8)
    for y in range(1, ysize-1):
        for x in range(1, xsize-1):
                center = X[y,x]
                code = 0
                code |= (X[y-1,x-1] >= center) << 7
                code |= (X[y-1,x] >= center) << 6
                code |= (X[y-1,x+1] >= center) << 5
                code |= (X[y,x+1] >= center) << 4
                code |= (X[y+1,x+1] >= center) << 3
                code |= (X[y+1,x] >= center) << 2
                code |= (X[y+1,x-1] >= center) << 1
                code |= (X[y,x-1] >= center) << 0
                result[y-1,x-1] = code
    return result

You already see the problem: The for loop over the pixels will be unbelievably slow in Python, but hey I am optimistic. There's a Multi-core processor inside, just throw some more cycles at it -- I don't mind the milliseconds! How long could it take?

# [...]
    from time import time
    t0 = time()
    for i in range(0,X.shape[1]):
        rlbp_slow(X[:,i].reshape(self.height, self.width))
    self.logger.debug("time to compute patterns took=%.6f seconds" % (time()-t0))
# [...]

Forever. 185 seconds. 3 minutes. Lightyears away from realtime:

2011-09-30 14:12:39,811 - facerec.models.LBP - DEBUG - time to compute patterns took=185.088027 seconds

Why? This code is executed in Python and due to checking array bounds (and so on) for each call it get's unbelievably slow; I should really port this to C. But if you think another second about it you will probably recognize that you can perform this calculation by only using X:

Take the inner matrix of X as the center values and compare it with the equally sized matrix at a (-1,-1) offset. Multiply the result with 2^7 and you are done for the first neighbor. Now take the matrix at a (-1,0) offset, multiply with 2^6 and add it... You see where this leads to.

Is this useful in NumPy? Yes it is! Because now the computations are performed in C.

Let's see what it looks like:

def rlbp_fast(X):
    X = (1<<7) * (X[0:-2,0:-2] >= X[1:-1,1:-1]) \
        + (1<<6) * (X[0:-2,1:-1] >= X[1:-1,1:-1]) \
        + (1<<5) * (X[0:-2,2:] >= X[1:-1,1:-1]) \
        + (1<<4) * (X[1:-1,2:] >= X[1:-1,1:-1]) \
        + (1<<3) * (X[2:,2:] >= X[1:-1,1:-1]) \
        + (1<<2) * (X[2:,1:-1] >= X[1:-1,1:-1]) \
        + (1<<1) * (X[2:,:-2] >= X[1:-1,1:-1]) \
        + (1<<0) * (X[1:-1,:-2] >= X[1:-1,1:-1])
    return X

This code yields the same result and needs 0.27 seconds to complete:

2011-09-30 14:29:40,337 - facerec.models.LBP - DEBUG - time to compute patterns took=0.269666 seconds

By vectorizing the code we can use the Local Binary Patterns without going to C and stay with our familiar NumPy syntax. While it was easy to vectorize the code in this example, it may not be trivial for complicated algorithms. But we can still do faster with the tools NumPy and SciPy have, and don't need to vectorize the code.

Beware! Things get a little bit tough to debug from here on. By using scipy.weave you can either use weave.blitz or weave.inline to weave C/C++ code into your program. Our code is rather easy for blitz, because it only has to translate our NumPy ranges into blitz::Range objects:

from scipy import weave

def rlbp_fast_blitz(X):
    X = np.asarray(X) # blitz otherwise doesn't know the type
    Y = np.zeros(((X.shape[0]-2), (X.shape[1]-2)), dtype=np.uint8) # and we don't want to override X
    arg_dict={'X':X, 'Y':Y} # variables C++ has to know about
    expr = "Y = (1<<7) * (X[0:-2,0:-2] >= X[1:-1,1:-1]) \
        + (1<<6) * (X[0:-2,1:-1] >= X[1:-1,1:-1])   \
        + (1<<5) * (X[0:-2,2:] >= X[1:-1,1:-1]) \
        + (1<<4) * (X[1:-1,2:] >= X[1:-1,1:-1]) \
        + (1<<3) * (X[2:,2:] >= X[1:-1,1:-1]) \
        + (1<<2) * (X[2:,1:-1] >= X[1:-1,1:-1]) \
        + (1<<1) * (X[2:,:-2] >= X[1:-1,1:-1]) \
        + (1<<0) * (X[1:-1,:-2] >= X[1:-1,1:-1])"
    weave.blitz(expr, arg_dict, check_size=0)
    return Y

When you run the program for the first time it gets translated and compiled. The generated C++ code looks familiar (I am not pasting the whole thing here):

// ...
Y=(1<<7)*(X(blitz::Range(0,NX(0)-2-1),blitz::Range(0,NX(1)-2-1))>=X(blitz::Range(1,NX(0)-1-1),blitz::Range(1,NX(1)-1-1)))+(1<<6)*[...]
// ...

Translating and compiling takes some time, so the first call now takes 4.2 seconds to execute:

2011-09-30 14:49:17,483 - facerec.models.LBP - DEBUG - time to compute patterns took=4.170938 seconds

But the second call only takes 0.05 seconds to finish:

2011-09-30 14:50:19,677 - facerec.models.LBP - DEBUG - time to compute patterns took=0.052150 seconds

Sometimes the blitz syntax is not expressive enough, so you want to fall back to standard C/C++. You can write inline C++ with the weave.inline module. The code is first embedded into a C++ file (with all the macros) and is then compiled.

Let's see how my very naive attempt performs in C++:

def rlbp_fast_inline(X):
    X = np.asarray(X)
    Y = np.zeros(((X.shape[0]-2), (X.shape[1]-2)), dtype=np.uint8) # allocate some space
    expr = """
    int i,j;
    for(i=1;i<NX[0]-1;i++) {
        for(j=1;j<NX[1]-1;j++) {
            int center = X2(i,j);
            int code = 0;
            code |= (X2(i-1,j-1) >= center) << 7;
            code |= (X2(i-1,j) >= center) << 6;
            code |= (X2(i-1,j+1) >= center) << 5;
            code |= (X2(i,j+1) >= center) << 4;
            code |= (X2(i+1,j+1) >= center) << 3;
            code |= (X2(i+1,j) >= center) << 2;
            code |= (X2(i+1,j-1) >= center) << 1;
            code |= (X2(i,j-1) >= center) << 0;
            Y2(i-1,j-1) = code;
        }
    }
    """
    weave.inline(expr, ['X', 'Y'])
    return Y

I'll explain this code a bit. Don't think you've missed something! The NX, X2, Y2 variables are macros created by scipy.weave to make your life easier. NX has the information about the shape of X; X2 is a macro that allows 2-dimensional indexing.

Now in C++ my naive attempt only takes 0.07 seconds:

2011-09-30 15:28:17,026 - facerec.models.LBP - DEBUG - time to compute patterns took=0.069170 seconds

This is just a little slower compared to the code generated by the weave.blitz module.

Finally make sure that all functions calculate the same (you should come up with something more sophisticated):

import numpy as np

X = np.asarray(np.random.rand(100,100)*255, dtype=np.uint8)
ts = np.sum(rlbp_slow(X)) \
    == np.sum(rlbp_fast(X)) \
    == np.sum(rlbp_fast_blitz(X)) \
    == np.sum(rlbp_fast_inline(X))

if ts:
    print "Test succeeded."
else:
    print "Test failed."

Do all functions calculate the same?

philipp@mango:~/github$ python rlbp.py
Test succeeded.

They do. Hooray!

Conclusion

So you saw that we could speed up our code from 185 seconds to 0.05 seconds. That's one of the reasons why C, C++ and Fortran aren't dead, because they are blazingly fast at some tasks. It's great that NumPy allows to inline C++ code that easy. For more complex tasks you should research for tools like Cython, because it's probably easier (and better supported) to interface with external C/C++ code from Python -- at least the documentation suggests it.