Programming languages findings and other thoughts: Structuring the compiler

1 Simple is difficult
2 Implementation of the toplevel of the compiler
3 Interface to the passes and modular implementation

Simple is difficult

In the implementation of the L compiler, the focus was put on the readability and simplicity of the source code. As often in computer science, it is difficult to make things simple: I had a complete, working prototype, with parser, typing, cps transformation and compilation to LLVM when I started this blog; I spent all this time cleaning, documenting, and especially simplifying the compiler, publishing the progress along the way.

I believe that the result is worth the work: a simpler compiler is easier to extend, both for myself and future contributors. The challenge will be to keep things easy as more features are added; this is possible only if we wait for things to be well structured before adding the non-essential features¹; and if we do not hesitate to perform a local redesign occasionally.

Implementation of the toplevel of the compiler

The toplevel of the compiler, that link all the passes together, is a good example of the results of such a redesign. A design goal of this toplevel is to make each pass communicate with a minimal interface. I had written a first attempt, using a record updated functionally, with each field containing the internal state of one pass. The toplevel consisted in a giant function with nested loops – one loop per pass. This structure was inelegant for a number of reasons:

Even if the type of the state could be made abstract, the abstract type was still part of the interface. Moreover, the initial value for this type also had to be provided by the passes.
Updating the environment functionally introduced a lot of boilerplate code; even if imperative updates might have helped a little.
There was also some boilerplate code to handle the case where a single input could return multiple output (e.g. a datatype declaration in ML defines both a type and some constructors).
The nested loop pattern was not really readable.

After a while, I came up with a much simpler design using streams of definitions. The code for a toplevel is then dead-simple:



module Stream = Extensions.Stream

let process_file file = let parsetree_stream = Parser.make_stream file in let ast_stream = Astfromsexp.from_stream parsetree_stream in let cps_stream = Cpstransform.from_stream ast_stream in let converted_stream = Stream.iter_and_copy Cpsconvertclosures.in_definition cps_stream in let llvm_stream = Cpsllvm.from_stream converted_stream in Stream.iter (fun elt → Llvmexec.exec_nodef () elt) llvm_stream

The actual implementation also interposes optional printer to see the contents of the stream, which is very useful when debugging the compiler.

The stream interface make it easy to add new passes, completely remove the state from the interface, factorize the pattern when a pass can produce several elements at once, retain the minimal memory consumption of the nested loop design, and can be easily updated to a parallel pipelining implementation.

Interface to the passes and modular implementation

Interface of each pass

The interface of each pass is made minimal. Each pass only declares:

The type of the elements in the stream (e.g. tokens, typed AST definitions, CPS definitions…). The type is not abstract: it is used by the following pass to convert them to its own representation (e.g. the CPS pass use the AST type to perform CPS transformation, the LLVM pass use the CPS type to compile to LLVM).
A function mapping an input stream to an output stream.
Possibly, a function to print elements in the stream, used for debugging the compiler.

And that is all. The interface of each pass clear is minimal, which allows to study or change it independently.

Note: this way of structuring things imply that transformation from one pass to the next is done in the second pass. (This implies that the second pass depends on the first one).

An alternative is to expose functions to build elements in the second pass, and move the transformation code from the beginning of the second pass to the end of the first pass. In this case, the first pass depends on the second pass. One advantage is that the first pass does not have to expose the type of its internal format. I am considering using this structure for the parser, where a "parsetree" is produced, but whose type is not very interesting outside of the parsing phase; while the building functions for the AST are more interesting.

Modules inside each pass

Inside each pass, the code is structured around a hierarchy of modules and interfaces. The idea is that each interface progressively hides more details about the implementation of the pass, until the top where we get the minimal interface presented earlier. Intermediate levels show up when we manage to group a complex piece of code into a simple interface. For instance in the CPS pass, there are four intermediary modules:

CPStransform, performing AST to CPS transformation. Its implementation consist of 4 sub-modules, that include the compilation of pattern matching; its interface only consists of a single function.
CPSconvertclosure is a slightly less complex algorithm; its interface also consists in a single function.
CPSshrink will perform shrinking reductions and optimizations. I have not yet implemented the cleaned version of this module.
CPSbase provides general functions for creating and updating CPS terms; the interface of this module is more complex, but hides many details about the internal CPS data structure (which is quite complex), and prevents direct writes to this structure.

Structuring the code around hierarchical, modular interfaces and implementations thus allow intermediate levels of abstractions; when completing a task one is provided with just the needed information, the right degree of abstraction; this helps to focus. The hierarchy also decrease, and thus simplifies, the size of the interfaces. OCaml was perfect for this job, but C, to a lesser extent, also allows this kind of modular programming.

Comparison with object orientation

Note that this design is not at all object-oriented. I think that when providing a new abstract type, packing it with all the methods allowing to access or update objects of this type is an excellent idea. However by using only object-oriented interfaces, the tendency is to to present all methods of an object in a flat manner, and implement all of them in the same class and file. This is problematic when there are many methods (which should be grouped by themes), or that some methods are lengthy algorithms (the algorithm should be set in a separate file). Modules allow such grouping and separation of methods, while creating a new usual object-oriented classes would not work (because we operate on an object defined in another class). Of course there are object-oriented work-arounds; but I think that modules and interfaces with abstract types solve the problem more elegantly (while still allowing "type + methods"-style interfaces). I hope that the code for the L compiler, written in OCaml, shows my point.

Footnotes:

¹ I have a huge and ever-growing list of extensions I would like to experiment or add to the L language

Programming languages findings and other thoughts

2013/06/30

Structuring the compiler

Table of Contents