Optimizing the Lua/C++ Interoperability

May

23, 2017

by mrdooz


Tech

Introduction

The Roblox engine is written in a combination of C++ and Lua, with the code that performs computationally intensive operations written in optimized C++, while game logic and scripts are written in Lua, for ease of development. For this model to be effective, it requires that the transitions between Lua and C++ be as fast as possible, as any time spent in this no man’s land is essentially just wasted milliseconds.

Over the past couple of months, we’ve been rolling out various improvements to this part of the system. One part specifically—the actual invocation of C++ methods from Lua—was particularly interesting, as it led to considerable speed improvements and required digging around in the guts of Lua to understand how things worked under the hood.

We ended up modifying the Lua VM itself, but before we get to that, we need to lay some groundwork.

Compilers, VM, and bytecode

When Lua source code is compiled, it’s compiled into Lua bytecode, which the Lua VM then runs. Lua bytecode has around 35 instructions in total, for things like reading/writing tables, calling functions, performing binary operations, jumps and conditionals, and so on. The Lua VM is register-based, as opposed to being stack-based like many other VMs, so part of what the compiler does when it generates bytecode is determine which registers each instruction should use.

Each instruction has the form “OP_CODE A B,” or “OP_CODE  A B C,” where “OP_CODE” is the opcode (for example, CALL for calling a function) and A/B/C are the opcode arguments. The arguments (or registers) aren’t actual values. Instead, they are indices that point into one of two tables: the constant table (written Kst(..)) or the register table (written R(..)).

Interlude

For a detailed description of Lua bytecode, see “A No-Frills introduction to Lua 5.1 VM Instructions.” It’s a lot more exciting than it sounds; I promise!

To give you a feel for what Lua bytecode looks like, we’re going to go over some simple programs first and then progress to some more relevant examples.

Interlude

Using the Chunkspy utility, we can disassemble Lua bytecode into Lua assembly and get a listing of the code, as well as the constant table, so we can essentially see what bytecode gets generated for any given Lua source code.

Basic bytecode examples

A simple program like “x = 10” compiles into:

.const "x"       ;  0

.const 10        ;  1

[1]  loadk                        0    1        ;   10 

[2]  setglobal                 0   0       ;   x

 

The first two lines show the constant table (with the string value “x” in slot 0 and the integer value 10 in slot 1), and the following two lines are the disassembled opcodes.

[Line 1] Looking up the LOADK opcode in “No Frills,” we see that it has the form “LOADK A Bx — R(A) := Kst(Bx).” So, LOADK has two arguments (registers A and B) and its operation is to assign the value found in the constant table at the slot given by the second register, Kst(Bx), to the register table in the slot given by the first argument, R(A). “Bx” just means that because the opcode only has two arguments, the B register is extended and assigned more bits.

[Line 2] SETGLOBAL has the form “SETGLOBAL A Bx — Gbl[Kst(Bx)] := R(A).” It assigns a value to the global table using the key given by the constant table at the slot of the second argument. Because the second argument is 0 and the value of the constant table at 0 is “x,” it writes something to the global table using the key “x.” Whatever is in the register table at the slot given by the first argument is what’s being written, which the previous instruction loaded with the value 10.

Let’s look at a slightly more complicated example, “x = 10; y = x.” I’ll leave the manual execution of the code as an exercise for the reader. 🙂

.const    "x"  ;    0

.const    10  ;    1

.const     "y"   ;    2

[1]  loadk          0   1        ;   10

[2]  setglobal   0   0       ;   x

[3]  getglobal   0   0       ;   x

[4]  setglobal   0   2       ;   y

 

Bytecode for function calls

Let’s look at the code generated for “foo(10):”

.const "foo"      ; 0

.const 10       ;   1

[1]  getglobal        0    0                 ;  foo   //  R(A)  :=  Gbl[Kst(Bx)]

[2]  loadk              1     1                  ;  10   //  R(A)  :=  Kst(Bx)

[3]  call                 0     2      1

 
To execute function calls, the function must be loaded into the first register and arguments into subsequent ones. The semantics for “CALL A B C” are such that A holds the function, B is the number of arguments (actually, it’s the number of arguments +1, due to the way “…” is implemented), and C is the number of return values (again, it’s the number of return values +1, to handle multiple return values).

We’re familiar with the first two lines; they load a value into register table slot 0 and the value 10 into register table slot 1. The third line is what performs the function call, using the value in register A (register table slot 0, which was loaded with “foo”), with B specifying the number of arguments, and C the number of return values (remember, both B and C should have 1 added to their values). Before the function can be called, the VM also verifies that the value in R(A) is in fact callable.

Interlude — Metafunctions, metatables, and userdata

Lua has  a mechanism that allows users to extend the functionality of tables by associating a metatable with an existing table. The metatable contains fallback methods that are invoked if a certain method or operation can’t be performed on the main table (see https://www.lua.org/pil/13.html for a thorough description).

For our purposes, the most relevant entries in the metatable are the “__index” and “__call” fields. __index is used when looking up an element in a table, so the code “local x = my_table[10]” would first call the __index method on my_table. If that failed, it would instead try to call __index on my_table’s metatable. __call is similarly used when you try to treat something as a function and call it “local x = foo(42),” for example

In order for Lua and C++ to interoperate, they need some way to be able to share functions and data. Lua facilitates this by providing a data type called UserData. UserData objects can be created in C++ land, and because they are native Lua data types, they can be adorned with metatables that allows Lua code to interact with them as if they were just regular Lua objects.

Member function calls

Okay, back to looking at some bytecode! This next example is a bit more interesting because it shows what happens when you have code like “foo:bar(10),” which is calling the bar method on the instance foo (an instance of the class Foo).

foo:bar(10)

.const    "foo"   ;   0

.const    "bar"   ;   1

.const    10   ;   2

[1]  getglobal              0    0                ;  foo

[2]  self                       0     0    257     ;  "bar"

[3]  loadk                    2     2               ;  10

[4]  call                       0     3      1

 
The new thing here is the self instruction [line 2], which we haven’t seen before. Self has the syntax “SELF A B C — R(A) := R(B)[RK(C)]; R(A+1) := R(B),” so let’s break this down. In the register table at slot R(A), it will place the result of looking up the table in register slot R(B) using the key in slot RK(C). It will also copy whatever was in slot R(B) into slot R(A+1), but more on this later. You might notice that the value of the C register is 257. This is valid because Lua is using RK(C) to look up the value, and RK will either use the register table or the constant table, depending on the value of 9:th bit. If it’s a 1, which it is in this case, then the constant table is used; otherwise, the lookup goes to the register table (after masking out the highest bit).

Line 3 places 10 in slot 2, and finally line 4 will execute the function call.

The SELF instruction serves two purposes. First, it looks for the “bar” method on the Foo class, and places it in R(A). Secondly, because foo is an instance method and we need the instance of the class that we’re invoking the method on when doing the call, it places this instance in R(A+1). If you’re familiar with classes in Python, then you might recognize the concept: methods are usually written as “def my_method(self, arg1, arg2..),” where self is the class instance.

We’ll need to dig into this a little deeper and take a look at what happens when the foo instance is a C++ object, represented in Lua as a UserData object.

The SELF call can be seen as a table lookup, i.e. Foo[“bar”] (capital Foo represents the class Foo, as opposed to foo, the instance), and we know that lookups will use the __index method. When the foo instance was created in C++ land, a metatable was associated with the instance, and the metatable had its __index field set to a piece of C++ code that will get called when __index is invoked.

Interlude — lua_State

When C/C++ gets called from Lua, the only data that gets passed is a lua_State object. This object contains everything related to the currently running Lua thread. The most important information in the state object is the Lua stack, which contains the function arguments (accessed via the lua_tointeger/tostring etc family of functions) and is also used to return values back to Lua.

In pseudo-C++, our __index function looks something like this:

int  metaIndex(lua_State*  L)

{

           //  first  argument  is  the  userdata  object

           UserData*  userdata  =  lua_touserdata(L,  1);


           //  get  some  kind  of  descriptor,  that  contains  information

           //  about  what  methods the  class  exposes

           ClassDescriptor*  desc =  getDescriptorForUserData(userdata);


           //  See  if  the  class  has  the  requested  method

           const  char*  methodName  =  lua_tostring(L,  2);

           MemberFunctionPtr  method  = desc->hasMethod(methodName);

           if  (method)

           {

                   //  Upvalues  are  values  that  are  available  when  a  C

                   //  function  is  invoked.

                   lua_pushupvalue(L,  method);

                   lua_pushcfunction(L,  methodInvoker); 

                   return  1;

           }

           else

           {

                   lua_pushnil(L);

                   return  0;

           }

}

 
Lots of internals are glossed over, but here’s the gist of it. Given that the UserData object passed as the first argument on the Lua stack, we’re able to find a descriptor that describes the actual C++ class, and via the descriptor we can see if this class has a method with the given name. If it does, a function pointer to a method invoker is pushed on the Lua stack, and we return success.

After this call, the Lua VM will place the rest of the arguments in the register table, and then call the function we returned from the metaIndex method, which will again call into C++, and land in the invoker function:

int  methodInvoker(lua_State*  L)

{
           //  Get  the  userdata  and  the  class  descriptor

           UserData*  userdata  =  lua_touserdata(L,  1);
        
           ClassDescriptor*  desc  =  getDescriptorForUserData(userdata);
 
           Class*  instance  =  (Class*)userdata;


           //  Using  Lua's  upvalue  mechanism,  get  the  'method'

           //  that  was  stored  in  metaIndex.

           MemberFunctionPtr  method  =  lua_getupvalue(L,  1);


           //  This  is  hand-wavey,  but  we  have  some  mechanism  of  being

           //  able  to  invoke  a  member  function  via  the  class  descriptor,

           //  and  also  pop  arguments  from  the  Lua  stack,  and  push  return  values

           return  desc->invokeFunction(instance,  method,  L);
}

 
The methodInvoker also uses the ClassDescriptor, but this time it’s able to invoke the member function, and pop the correct arguments from the stack.

The home stretch!

Now that we can clearly see the two round trips from Lua to C++, we can try to figure out how to optimize this.

Our end goal is to do a single function call from Lua to C++ and have all the pieces we need on the Lua stack to be able to do method lookup and invocation at once. The problem seems to be that we have one register too few. When we call our combined lookup/invoker function, we want the Lua stack to look like [self, method name, arg1, arg2, …], but looking at SELF, we see that it uses its first slot for the result of looking up the method function and the second slot for storing the instance.

A key realization came when we looked at the way the __call metamethod worked. If an object has the __call metamethod, then before the _call function gets invoked, the object itself is pushed on the stack and all arguments are shifted up. By piggy-backing on this functionality, there was a way of getting “self” on the stack without having to explicitly store it in a register.

The second part involved getting the method name on the stack as well. For this, we had to be sneaky and alter the workings of the SELF opcode.

Remember that in the default case, SELF would try to find the member function and store this in R(A) together with the instance in R(A+1). We ended up skipping the lookup altogether and stored the actual object in R(A) and the method name in R(A+1).

If we now made sure that the object in R(A) had a __call metamethod, then we’d also end up pushing self on the stack. So, we’d have a stack that looked like [self, method name, args…] and did just a single call into C++. Perfect! Well, almost. 🙂

Before considering this done, we wanted to put some final polish on it. We didn’t want to overload the semantics of the __call metamethod, so instead we added a specific metamethod for this type of invocationcalled __namecallthat was only available on UserData objects. We also modified the SELF opcode so it only uses the new semantics if the object has a __namecall metamethod.

The second thing we did was mostly make the new path and the old path able to share code easily. Instead of having the method-name as the second argument, we pushed it to the last argument. So, after it had been used to look up the method pointer, it could easily be popped off the stack and the stack looked like it did if the function had been invoked via the old path.

Conclusion

How much of an impact does this optimization have? Well, like most things in programming, the answer is “it depends.” For functions that are heavyweightand you don’t call that oftenyou won’t see much of an improvement. But for smaller functions that you call often, the savings can be considerable.

People on the Developer Forum quickly noticed the appearance of this strange, new metamethod, and a table was presented that compared the speed of __namecall against both the old method of calling instance methods and against a workaround that developers had been using to optimize method calling:

local  part  =  workspace.Baseplate


local  count  =  1000000


local  start0  =  tick()

for  i=1,count  do

        part:IsA("BasePart")

end

local  end0  =  tick()



local  start1  =  tick()

for  i=1,count  do

       local  isa  =  part.IsA
       
       isa(part,  "BasePart")

end

local  end1  =  tick()



local  start2  =  tick()

local  isa  =  part.IsA

for  i=1,count  do
 
       isa(part,  "Basepart")

end

local  end2  =  tick()




print("namecall",  end0  -  start0)

print("index+call",  end1  -  start1)

print("call",  end2  -  start2)



>  namecall  0.49229717254639

>  index+call  0.78510332107544

>  call  0.49960780143738

 
The first loop uses the new __namecall code path, but because all the magic happens under the hood, there is no need for developers to change any existing code to benefit from the optimization.

The second loop emulates the old way of doing an instance method call; first doing a lookup to find the method and then invoking it.

And finally, the third loop shows a common optimization that developers were doing, where the method was first looked up, stored in a local variable, and then the variable was invoked.

The nice thing here is that it shows that with the __namecall optimization, it’s no longer necessary to explicitly cache instance functions, as it’s just as fast as the cached optimization, so the most straightforward code will also be the most performant.

Now that __namecall has been deployed, and we’re happy with the results we’re seeing, it’s time to turn our focus to memory usage, and see what we can do to improve the client in that area!