How Can High Performance Code Be Produced From Dynamic Languages

By Andrew Bennett 

About Andrew…

Andrew’s interest for computing started when he learnt to program on his Atari 800XL as a child. He has worked in research labs for computer vision and machine learning and has produced toolchains for embedded systems. More recently, with BJSS, he has worked on frontend and backend systems for medical and financial clients.

Outside work he is a keen runner and enjoys spending time with his family and taking their dog out for walks in the countryside.

There are a number of dynamic languages used in production systems, but the most popular are Ruby, Python and JavaScript. This article will cover the toolchains that are used to compile and run programs written in these languages along with how they produce high performance code.

a little introduction 

Before we take a deep dive into toolchains let’s talk a little about what dynamic languages are, and how they differ from static languages like Java or C#. The dynamic element of these languages means that a program can change its structure at runtime. Methods or fields on a class can be added, modified or removed at runtime. There is also no typing of variables, fields or arguments as you would see in a Java program. The operations are performed given the current object assigned to the variable. If the object has the required method or field the program will run correctly (which allows us to use techniques like duck typing). This means that a lot of the checks to ensure the operations you are trying to do are valid will happen at runtime, as opposed to build or compile time with static languages.

The lack of typing and ability to change program structure at runtime are the two main performance problems in dynamic languages, and there are various ways to solve them as we will see later on in this article.

SOME EXAMPLE DYNAMIC LANGUAGE TOOLCHAINS 

Let’s now look at some example toolchains for Python, Ruby and JavaScript and how they produce high performance code. Please refer to my previous article on compilers for an overview on how they work.

The first set of toolchains are ones that just have an interpreter, but do not have a JIT. This is the approach used by CPython and the standard Ruby toolchain. For both of these toolchains the interpreter is written in C and is heavily specialised to running Python or Ruby bytecode. A lot of C tricks are used in the code to improve performance. The code for the interpreter is then compiled down using a C compiler to machine code for the target machine so it can run quickly. In Python, methods and constants can change during lifetime of the program. We need a quick way to get back the current value of a method or constant. Caching techniques are used to do this so that performance is not impacted.

The next set of toolchains are ones that use a pre-existing JIT. The aim here is to write our dynamic toolchain in a language like Java or C#. This is the approach used by: JRuby, Jython (written in Java) and IronPython and IronRuby (written in C#). There are two reasons to do this. Firstly, we hope to use the JITs in the Java or C# toolchains to improve performance. These JITs have been used for many years and have a huge amount of work done on them to make them fast The aim is to piggyback off that to improve performance. Secondly, polyglot programming can be used, as the dynamic language can access to the static language and vice-versa. For example, JRuby allows Java and Ruby to interact.

There are two main problems that need to be solved when using this approach.

Firstly, we have a dynamic language which is untyped, and an abstract syntax tree (AST) representing our program written in that language. What we would like to do is map this on to Java Bytecode or CIL; and Java or C# libraries. Both the bytecode and libraries are typed, so we need to perform some form of typing when we convert the AST to get this to work.

The other aspect, as covered previously, is because the language is dynamic we need to have techniques to deal with the method and constant field lookups. These need to be fast to prevent the lookups being a bottleneck on the application.

How to map to Bytecode

So, how do we map a dynamic language to static bytecode? IronPython and IronRuby make use of the Dynamic Language Runtime (DLR) to produce CIL. To use this library you convert the AST for your program into a AST used by the DLR. From this AST the DLR will then create bytecode. Jython and JRuby have custom code that walk the AST tree and from this process generate Java Bytecode including any type conversions required.

Next, how do we make looking up methods, fields, and constants quick?  In IronRuby and IronPython they use DLR’s polymorphic caching feature. This associates a method cache function with each location where a call is made in the code. When the method call is performed it goes through the method cache function. This will look at the arguments and if it has seen it before will return back the cached method. Otherwise it will ask the DLR to look up the location of the method (based on the arguments), and also to update the code in the method cache function to match this new method, so that next time the lookup is a lot faster.

The other approach is what Jython and JRuby use. These make use of the invokedynamic bytecode instruction in the JVM. In the same principal as before we need to make a call to a method, or get the value of a field or constant, but we do not know where it is. The call is wrapped up in the invokedynamic instruction. When the JVM hits this instruction the first time it will run a bootstrap method associated with it. This bootstrap method returns back a method handle. The method handle contains a set of operations to run to tell the JVM where to find the current location of the method, field or constant. These operations can be optimised away by the JVM making it fast. The next time we need to call the method or get the field or constant value at this point in the bytecode we will just use the result returned by the initial invokedynamic call.

Let’s look at an example from JRuby to show these two concepts. This has an output method that takes two arguments, x and y, adds them together and prints the result to the screen. The output method is then called up with two integer arguments, and secondly with two string arguments.

So, how do we map a dynamic language to static bytecode? IronPython and IronRuby make use of the Dynamic Language Runtime (DLR) to produce CIL. To use this library you convert the AST for your program into a AST used by the DLR. From this AST the DLR will then create bytecode. Jython and JRuby have custom code that walk the AST tree and from this process generate Java Bytecode including any type conversions required.

Next, how do we make looking up methods, fields, and constants quick?  In IronRuby and IronPython they use DLR’s polymorphic caching feature. This associates a method cache function with each location where a call is made in the code. When the method call is performed it goes through the method cache function. This will look at the arguments and if it has seen it before will return back the cached method. Otherwise it will ask the DLR to look up the location of the method (based on the arguments), and also to update the code in the method cache function to match this new method, so that next time the lookup is a lot faster.

The other approach is what Jython and JRuby use. These make use of the invokedynamic bytecode instruction in the JVM. In the same principal as before we need to make a call to a method, or get the value of a field or constant, but we do not know where it is. The call is wrapped up in the invokedynamic instruction. When the JVM hits this instruction the first time it will run a bootstrap method associated with it. This bootstrap method returns back a method handle. The method handle contains a set of operations to run to tell the JVM where to find the current location of the method, field or constant. These operations can be optimised away by the JVM making it fast. The next time we need to call the method or get the field or constant value at this point in the bytecode we will just use the result returned by the initial invokedynamic call.

Let’s look at an example from JRuby to show these two concepts. This has an output method that takes two arguments, x and y, adds them together and prints the result to the screen. The output method is then called up with two integer arguments, and secondly with two string arguments.

When we run jruby with the bytecode option it will output the Java bytecode required to run this program with the JRuby Java libraries. If we look at a small section of the bytecode below to make the integer call to output we can see the use of the invokedynamic instruction and type conversion that needs to be performed before making the call to the output method.

There are two lines that convert the values 99 and 55 to the Ruby’s Fixnum format. To do this we need to make a call to the fixnum method. We use the invokedynamic instruction along with the org/jruby/ir/targets/FixnumObjectSite.bootstrap bootstrap method to find the current definition of the method. When the method runs it returns back an object that implements the IRubyObject interface which allows it to work with the JRuby Java libraries used in the test.invokeOther2:output method.

When we run jruby with the bytecode option it will output the Java bytecode required to run this program with the JRuby Java libraries. If we look at a small section of the bytecode below to make the integer call to output we can see the use of the invokedynamic instruction and type conversion that needs to be performed before making the call to the output method.

There are two lines that convert the values 99 and 55 to the Ruby’s Fixnum format. To do this we need to make a call to the fixnum method. We use the invokedynamic instruction along with the org/jruby/ir/targets/FixnumObjectSite.bootstrap bootstrap method to find the current definition of the method. When the method runs it returns back an object that implements the IRubyObject interface which allows it to work with the JRuby Java libraries used in the test.invokeOther2:output method.

However, there are problems with using these JITs as they are optimised for static languages like C# and Java, and the bytecode that is typically generated from these languages. This means they are not really designed for dynamic languages, and some performance problems still exist because certain bytecode sequences generated by the dynamic language toolchains are not always optimised properly by the JIT. To gain higher performance we need to look at writing our own JIT so that we have greater control over the machine code that gets generated, and then ultimately higher performance.
The major problem with converting dynamic bytecode to machine code is that machine code is static. Firstly, it is typed; for example, an add instruction will only work with integer values, but it won’t be able to deal with strings. Secondly, it assumes functions and data structures appear in the same place in memory and won’t change. This means it can’t deal with the dynamic aspects of languages, for example if a new field is added to a class.

So how do we convert dynamic bytecode into static machine code? While the interpreter has been running the bytecode it records hot regions of bytecode that are run with a specific type. For example, if we had a method that took two arguments and added them together, and this happened a large amount of time with integers.

The interpreter would pass this information onto the JIT which would convert the bytecode to machine code, by selecting machine code instructions that are for the specific type we are specializing to, in our example this would be integer add instructions. Checks and guards are also added to the start of machine code to ensure it is only run for that specific data type. If these checks or guards fail, for example if we tried to pass in strings to our method, we would deoptimize back to the interpreter. The interpreter would then use the bytecode to run this method with the failing arguments.

This is the approach used by the V8 JavaScript engine used in Node and the Chrome web browser. Its bytecode interpreter is called Ignition and its JIT is called TurboFan. Let’s walk through an example showing the machine code generated by TurboFan for a simple JavaScript program.

This example has a function called output that adds its two arguments together and appends the result to the end of the array a.  We then call this function 10000 times to allow the JIT to be activated to optimize the code.  On the right is a snippet of the machine code that is generated by TurboFan.  The lines in red show where it is checking the first argument to the method is an integer, and if this fails it calls a function to deoptimize back to the interpreter.  The line in green shows that it has made use of the addl x86 machine code instruction to add the two integers together.

An Alternative Approach

An alternative JITing approach is done by PyPy. PyPy is written in Python and takes a program written in RPython and converts it to machine code. RPython is a restricted form of Python that makes it easy to infer the types used in the program. There are interpreters written in RPython for Python, Ruby and PHP. PyPy can also add a Tracing JIT to the interpreter. This looks for loops in the user’s code. When this happens a trace of the interpreter operations during the loop is recorded. This trace is then converted to machine code and executed. Guards are used to check that the objects used in the loop have not changed their state. If they do, PyPy deoptimizes back to the interpreter.

Finally, instead of writing an entire toolchain from scratch could we just change the JIT in an existing toolchain? GraalVM is a project to do just that. It replaces the JVMs JIT with a JIT written in Java. It is currently being used for a number of reasons. Firstly, Ahead Of Time (AOT) compilation of Java classes to produce native images to improve startup performance. Secondly, to run JavaScript, Python, Java, R, C, C++ and Ruby on the same JVM via the Truffle framework. And finally, zero-overhead interoperability between these programming languages, by using the Polygot library.

Let’s look in more detail at Truffle. This is a Java framework for writing AST interpreters for programming languages. Using the framework we write code to deal with each of the different types of AST nodes. For example add, if statements or calling methods. When an AST is run on the AST interpreter the AST nodes will use a technique called partial evaluation to specialise themselves. This means the nodes will specialise themselves to the current state of the program i.e. types of function arguments or which path is taken in an if statement. When the bytecode and fields for these AST nodes are sent to the GraalVM JIT it will assume they are static and won’t change.  It will then produce machine code based on these assumptions. If the assumptions are invalidated, we deoptimize back to the AST interpreter which can then specialise the AST nodes again. Currently Truffle is experimental for Ruby and Python and is not designed for production use yet.

In Summary

This article has covered how different dynamic languages can produce high performance code. There has been a large amount of recent work on performance improvements to dynamic languages, and it will be interesting to see if this allows the use of them to increase, both standalone and with traditional static languages via polyglot programming.