Advancing the JIT compiler

It is customary to reintroduce oneself after a long blogging hiatus like the one I displayed here for some 6 months. So here I am, ready to talk JIT compilers again. Or compilers in general, if you wish it. Today I want to discuss possible ways to improve the MoarVM JIT compiler for great performance wins! Or so I hope.

It is no secret (at least not to me) that the MoarVM JIT compiler isn't as good as, say, any of {v8, luajit, HotSpot, Spidermonkey, PyPy}. There are many reasons why and it is helpful to consider the structure of the Rakudo-MoarVM 'stack'.
So to help you do that I've drawn this picture:
A picture may be worth more than a thousand words, but this is just a diagram, and I can summarize the important points as:
  1. MoarVM is the 'end of the pipe'. To your program many optimizations may be applied before your code ever reaches this layer. Many more optimizations cannot be proven safe due to perl6 semantics, however, which is why spesh is a speculative optimizer.
  2. The JIT only kicks in after spesh has applied several optimizations and in the current system only after spesh has generated optimized bytecode. (This is a flaw that we are looking to correct). This is an important point that is often missed, possibly because in many other systems the optimization and JIT compilation steps are not so clearly delineated.
  3. Within MoarVM, 6model plays a central role. 6model is the name for Rakudo's implementation of the perl6 (meta-) object system. Many operations such as array indexes and hash lookups are implemented as 6model operations. In general, these ultimately resolve to 6model 'representation' objects, which are implemented in C. Many steps in the execution of a perl6 program are really (virtually) dispatched to these 'REPR' objects.
So what benefit can an improved JIT compiler bring? The JIT has an advantage compared to all other components in that it has on one hand all information of the layers above and on the other hand it has the full flexibility of the underlying machine. Many tricks are easy and possible at the level of machine code which are awkward (and compiler-specific) to express at higher levels. Generating code at runtime is more-or-less a superpower, which is probably what LISP fans will tell you too.

The current JIT cannot really take advantage of this power because it is really very simple. All operations in a given piece of code are compiled independently and stitched together to form the entire subroutine. You can think of it as a 'cut-and-paste' compiler. Because all pieces are independent, each piece has load and store values directly to memory, causing unnecessary memory traffic. There is also no way to reorder operations or coalesce operations into a smaller amount of instructions. Indeed most forms of machine-level optimization are virtually impossible.

I propose to replace - step-by-step - our current 'cut and paste JIT' with a more advanced 'expression JIT'. The idea is very simple - in order to generate good code for a machine it is first necessary to reify the operations on that machine. Compilation is then the process of ordering these operations in a (directed acyclic) graph and selecting instructions to perform those operations. There is nothing new about that idea, 'real' compilers have been written that way for ages, and there is very little preventing us from doing the same. (Currently, DynASM doesn't support dynamic selection of registers other than those present in x86, which are rather limited. This is something that can be fixed, though).


To be sure, I think we should leave large parts of the current JIT infrastructure intact, with maybe an improvement here or there. And I think this 'expression JIT' can be introduced piecemeal, with the current 'cut-and-paste' code snippets serving as a template. The expression JIT will function as another node type for the current JIT, although probably most code will eventually be converted to it. When this is done it opens up a lot of possibilities:
  • Load and store elision, and in general 'optimal' code generation by register selection and allocation. We are aided in a way by the fact that x86-64 has become more 'RISC-y' over the years, which makes instruction selection simpler.
  • JIT compilation of REPR object methods. The 'expression tree' on which the JIT operates is architecture-independent, so it's feasible to have REPR objects generate a tree for specific operations, effectively 'inlining' such operations into the compilation frame. In real terms, this may translate a high-level array access into a simple pointer reference.
  • NativeCall calls may be more easily converted into JIT-level C calls, which is ultimately the same thing, just much more efficient.
  • Many optimizations that become much simpler even if they were possible before. For example 'small bigint' optimization which lets us execute small integer operations with big integer semantics, courtesy of cheap overflow checking at the machine level. Or possibly transforming tight operation loops into equivalent SIMD instructions, although that one is in fact rather more involved.
To conclude, I think the current MoarVM JIT compiler can be radically improved. I also think that this is nothing groundbreaking intellectually speaking and that there are few barriers to implementation. And finally, that with these improvements MoarVM will have a strong base for future development. I'd love to hear your thoughts.

Reacties

Populaire posts van deze blog

Reverse Linear Scan Allocation is probably a good idea

Retrospective of the MoarVM JIT

Something about IR optimization