Friday, April 27, 2012

Rpython to LLVM - Part2

In the last post we saw that Rpython-to-LLVM can be 200X faster than Python in a tight loop. What happens when the loop gets more complicated? This next test introduces a Vector class, and using a new decorator, the LLVM backend can translate instances of this class into the SSE optimized LLVM vector type.
@rpy.vector( type='float32', length=4 )
class Vector(object):
 def __init__(self, x=.0, y=.0, z=.0):
  self.x = x
  self.y = y
  self.z = z

 def __getitem__(self, index):
  r = .0
  if index == 0: r = self.x
  elif index == 1: r = self.y
  elif index == 2: r = self.z
  return r

 def __setitem__(self, index, value):
  if index == 0: self.x = value
  if index == 1: self.y = value
  if index == 2: self.z = value

 def __add__( self, other ):
  x = self.x + other.x
  y = self.y + other.y
  z = self.z + other.z
  return Vector( x,y,z )

The new decorator is "rpy.vector( type, length )" and for best SSE performance it should be of type float32 with length 4 (even if you only use 3).

Test Function:

def test(x1, y1, z1, x2, y2, z2):
 a = Vector(x1, y1, z1)
 b = Vector(x2, y2, z2)
 i = 0
 c = 0.0
 while i < 16000000:
  v = a + b
  c += v[0] + v[1] + v[2]
  i += 1
 return c

Test Results:

  • Python2 = 51 seconds
  • Rpython-to-LLVM = 0.019 seconds
How could LLVM be 2,680X faster than standard Python? It turns out in this case LLVM is able to optimize the while-loop by moving many operations into the "function entry" and reducing the work the while-loop needs to do (see the optimized LLVM ASM below).


define float @test(float %x1_0, float %y1_0, float %z1_0, float %x2_0, float %y2_0, float %z2_0) {
  %0 = insertelement <4 x float> , float %x1_0, i32 0
  %1 = insertelement <4 x float> %0, float %y1_0, i32 1
  %2 = insertelement <4 x float> %1, float %z1_0, i32 2
  %3 = insertelement <4 x float> , float %x2_0, i32 0
  %4 = insertelement <4 x float> %3, float %y2_0, i32 1
  %5 = insertelement <4 x float> %4, float %z2_0, i32 2
  %vecadd = fadd <4 x float> %2, %5              
  %element = extractelement <4 x float> %vecadd, i32 0 
  %element3 = extractelement <4 x float> %vecadd, i32 1
  %v5 = fadd float %element, %element3           
  %element4 = extractelement <4 x float> %vecadd, i32 2
  %v7 = fadd float %v5, %element4                
  br label %while_loop

  %st_c_0.0 = phi float [ 0.000000e+00, %entry ], [ %v8, %while_loop.while_loop_crit_edge ]
  %st_i_0.0 = phi i32 [ 0, %entry ], [ %v9, %while_loop.while_loop_crit_edge ]
  %v8 = fadd float %st_c_0.0, %v7                
  %v9 = add i32 %st_i_0.0, 1                     
  %v10 = icmp ult i32 %v9, 16000000               
  br i1 %v10, label %while_loop.while_loop_crit_edge, label %else

  br label %while_loop

  %v8.lcssa = phi float [ %v8, %while_loop ]      
  ret float %v8.lcssa

Part2: Escaping the GIL

llvm-py contains an example "" that shows you how to bypass the LLVM Execution Engine and instead call your compiled function via ctypes. The advantage of using ctypes over the Execution Engine is that ctypes will release the GIL and allows your Python threads to run in parallel. The next test simply calls the same function four times from four threads at the same time.

Test 4 Threads:

  • LLVM Execution Engine = 0.086 seconds
  • Ctypes = 0.025 seconds
As we can see in this test with 4 threads, ctypes is 3.4X faster on a quad core CPU. Note that another way to escape the GIL is the multiprocessing module, there are pros and cons for both processes and threads. Rpythonic now uses ctypes by default to call the compiled LLVM functions, so its up to you to decide if you want to take advantage of threads or not.


  1. This is really cool!

    Are you aware of numba? (,!forum/numba-users,

    Even though this i rpython to llvm rather than numpy aware python to llvm this is a pretty big overlap...

    1. I recently found Numba to, really cool project.