Wednesday, July 04, 2007

One of the things that I have been working on in my free time is a compiler for a subset of the Scheme programming language. This compiler is written in Scheme and is a small, relatively fast compiler.

Languages and Usage

It supports a very basic core language:

  1. fixed size integers
  2. vectors
  3. scheme pairs (via cons)
  4. quoted constants
  5. let, letrec, top level define
  6. booleans, if
  7. set! (mutable variables)
  8. lambdas, higher order functions

The core language that it supports is a higher order programming language (with first class functions) and proper tail calls.

The compiler takes input in the supported subset of scheme and emits x86 assembly code. Here is the "life-cycle" of a compilation: The emitted assembly has entry point into scheme, a sort of scheme_main(). To use the compiler one needs to compile the generated assembly into .obj using any intel-syntax-compatible assembler - I use ml.exe, which is Microsoft's assembler. The .obj then needs to be compiled and linked against a minimal bootstrap C source file which essentially allocates some amount of memory from the OS and them jumps into the scheme code. I use cl.exe, Microsoft's C++ compiler, for this purpose. The assembler and C compiler are free tools that maybe downloaded from Microsoft's website.

All of the core compilation, including the register allocator and code generator are written in Scheme. The generated code is rather efficient (read: not terrible), though a lot can be done improve its quality. In fact the quality of code it generates substantially better than one or two compilation systems that I know of that are used commercially. The code generator is a macro-based instruction selector and the register allocator is a linear allocator as opposed to a graph-based allocator.

The compiler code base is purely functional (for the most part ...) - the minor variations are control operators like yield, recursive functions etc. which are borderline-functional depending upon the school of thought that you belong to.

Organisation

At the point of this writing, the bulk of the compiler source is limited to one file called compiler.ss. This file contains all of the traditional compiler passes as well as the register allocator and the code generator. This file is a little above two thousand lines of scheme code. While that is pretty small for a compiler, I believe it could be made smaller and prettier still.

The compiler is a collection of "passes". The input to the first pass is the source language. The output from the last pass is x86 assembly. Each pass does a small transformation on the source tree - for example a pass could remove one-armed ifs and replace them with ifs that have a do-nothing else branch. The compiler has a fairly large number of passes (greater than 15 or so), each doing a simple and relatively specific transformation. One could think of a pass as a mini-compiler that turns a specific input grammar into a specific output grammar. I believe that it was Kent Dybvigs original observation that the traditional wisdom about expensive-compiler-passes is essentially misguided. Hence his own production quality compiler has many many passes, is blazingly fast, produces efficient code and has a manageable code base.

In compiler.ss, passes are written as a collection of a "visitors". A visitor is a small code transformer that takes one syntactic form and does a transformation on it. For example a visitor could be one that takes a one-armed if and returns a proper if that has an else branch. A pass is constucted by composing several visitors together. The infrastructure that enables us to write the passes as a collection of visitors has most of the basic elements of a term rewriting system. The advantage of writing the compiler as a collection of visitors makes things easy -  I can completely isolate each code transformation from the rest of the system and be free to focus on solving that little problem well.

The source tree is represented as a scheme s-expression and hence requires no parsing and none of the traditonal noise that gets in the way of compilation. At some point towards the middle of the passes the intermediate language that is output is referred to as an "IL". This IL is rich enough to represent all of the details of the scheme program while not have any scheme specific components to it. It is very much like a machine-independant assembly language - almost like C. The hope is that I maybe able to compile other languages to this IL and them share the backend of the compiler - the code generator, register allocator etc.

Most of the other files are essentially infrastructure pieces (sometimes rather generic) and helpers. The source walking visitor infrastructure is in micro-pass.ss. The yield control operator is implemented via callcc and set! in yield.ss. The super handy match macro is implemented in match.ss (I am not the original author of the match pattern matcher).

Usage and Installation notes

You will need the following to use the compiler

  1. The compiler sources
  2. Petite Chez Scheme (Just get the latest from www.scheme.com)
  3. Microsoft's C++ compiler suite or something compatible (You need the compiler and assembler, these are free downloads) 
  4. Ruby (Only needed if you wish the run the test suite)

Install Petite and the Microsoft tools. Makes sure that you have petite, cl.exe and ml.exe in your command line path. Compile a little hello-world with cl and see if things are installed and working fine. Extract the compiler sources to some folder. Thats it you are done. If you want to run the test suite, have ruby also in your path - this maybe a good idea if you want to play around with the sources.

There are two batch files for your convenience:

test-all.cmd: Runs the test suite. The test suite consist of the files in the tests folder.

compile.cmd: This compiles a specific file for you.

Consider a little program that recursively removes a value from a list:

(define foldr (lambda (op unit xs)
(if (null? xs)
unit
(op (car xs) (foldr op unit (cdr xs))))))
(define remove (lambda (xs x)
(foldr (lambda (v ls)
(if (pair? v)
(cons (remove v x) ls)
(if (eq? x v)
ls
(cons v ls))))
()
xs)))



(let* ((vals '(0 1 2 0 0 0 3
(1 2 0 0 3 4)
(1 2 0
(1 2 0 3 4) 4)
4 5 0
(1 2 0 0 3 4 0)
7 8 0))
(res (remove vals 0)))
(cons vals (cons res ())))

We compile foo.ss as follows:
> compile foo.ss
produces foo.asm and foo.exe, among other things. foo.exe when run produces:
((0 1 2 0 0 0 3 (1 2 0 0 3 4) (1 2 0 (1 2 0 3 4) 4) 4 5 0 (1 2 0 0 3 4 0) 7 8 0) (1 2 3 (1 2 3 4) (1 2 (1 2 3 4) 4) 4 5 (1 2 3 4) 7 8))

The generated foo.asm is the result of the compilation. Its the .asm file that you want to eyeball to make observations about how inifintely retarded the quality of my output asm is in comparison to your-favourite-compiler.

Source Language

The best way to find the supported subset of the language is to grep the tests folder. However, here is an attempt:

  • 30-bit numbers, ascii characters, vectors, lists, #t, #f, (), (void)
  • quoted constants composed of the above
  • cons, car, cdr, set-car!, set-cdr!
  • make-vector, vector-ref, vector-set!, vector-length
  • let, letrec, let*, set!
  • fx+, fx-, fx*, fxadd1, fxsub1, fxlognot, fxlogor, fxlogand
  • fixnum->char, char->fixnum,
  • pair?, null?, fixnum?, char?, vector?, boolean?, fxzero?, procedure?
  • fx=, fx<, fx<=, fx>, fx>=, char=?, char<=?, char>?, char>=?, eq?, not
  • lambda, if, begin, define (top level only)

If you are unfamiliar with these operators you can use Chez Scheme's index as a reference. Here are the disappointing bits: there is no quasiquote, no input-output, no forgein function interface, no file loads or includes ... yet. Note that while this is only the core subset the compiler supports, this is mostly rich enough to build a library of commonly used stuff.

Since the supported language has no output statements, the result of of program execution (the final result) is merely printed out at the console. This can be of any of the supported data type, including any s-expression (Note: I dont do anything to handle circular expression so brace yourself to see lots of parenthesis if thats the case).

XEmacs nonsense:
The best way to read the sources would be to use xemacs with folding mode enabled to ;{{{ and ;}}}. To do this, you need to add the following to your init.el file:
(folding-add-to-marks-list 'scheme-mode "{{{" "}}}" nil t)

The most useful folding mode commands are folding-mode, folding-context-next-action. It maybe nice to bind them to key combinations like so:
(global-set-key (kbd "C-c C-m") 'folding-mode)
(global-set-key (kbd "C-c C-f") 'folding-context-next-action)

Future Work and What it lacks

There are many things I would like to add to the compiler. Since I am writing this compiler entirely for the fun of it, it is one of my objectives to make choices that are interesting as opposed to useful. I believe most programming languages today are designed with the aim of being useful which is partly why they all suck; they are crammed with hasty decisions substantiated by all sorts of reasons. Eventually I suspect my compiler will fall prey to usefulness and/or boredom, until then I hope to have fun.

Most of the items in the menu below haven't been implemented yet because of time or because I am not compelled enough by the best ideas I have.

GC: The topmost on my list of things to add is a GC. While it is fairly easy to do a copy-collector in the C-runtime system, I have been toying with the idea of writing the GC in the subset of scheme that I compile. It is interesting to see the minimal set of primitives that will have to be added to the language and how cleanly this can be expressed.

Threading: It would be interesting to write a compiler that supports a mutlithreaded language. There are easy and not so nice ways to accomplish in the present compiler, and there are deeper more elegant ways to this as well.

Errors: The compiler assumes good input programs. This is never the case in "real" life and an early error checking pass is required. Traditionally good error reporting is considered hard in compilers and with good reason. However I may never get down to doing this because this leans more towards useful than interesting in my present world view.

STM: I have been doing some work on STM over the past year or so and I have some pet algorithms of my own that need compiler support to implement well. Once I have STM, like Satnam Singh shows, it maybe possible to express Join patterns and such in terms of them.

Large Objects and Strings: The compiler doesnt have any native support for strings. One may treat vectors of characters of strings, but I wonder if I can do better - for example have a fast string-copy. Similarly the compiler works with a fixed size heap at this point. Allocation requests that overshoot this heap size would simply run into a guard page and segfault the system. Support extensible and large object heaps would be nice.

Pattern Language: I would really like to add compilation support for a pattern language. At some level I suspect that pattern languages are more basic than lambdas. The existence of Barry Jay's Pattern calculus strengthens this view. Scheme's best pattern matcher 'match' is added as a macro to the language. It would be nice if the compiler to primitively support a generic enough pattern matcher.

Continuations: I would like to natively support delimited continuations in some form - most likely my yield operator or a more traditional shift-reset. These can however be added nicely into the compilation once the GC and the threading issues have been addressed because those will affect the runtime stack architecture a lot.

Module System: Some sort of a module system.

Macros: Though this is strictly not an addition to the compiler, it would be intersting to support a pattern language of some sort. Some system where the compiler itself was part if the runtime system would be awesome - maybe some approximation of Walid Taha's MetaML. The thing is that MetaML does not have an catamorphism operator and is hence not very useful for describing syntax extensions: maybe this can be coupled with a pattern system in some way.

Types: At some point I would be intersted in adding a type inferencer to the system. The compilation at this point is rather blind and fast. If one attempts to do a primitive addition between a cons cell and a number, the compiler provides absolutely no security - not even a runtime exception. Scheme solution, as is most "dynamic" languages, is to verify the types of things at runtime before doing primitive operations. These checks are expensive and it would be interesting to see to what extent I can substitute these with compile time assurances.

Parser: With all of these additions, the input language is no longer Scheme. Atleast no self respecting Schemer would allow me the pleasure of their company if I said so. This calls for a different input syntax, one that is more amenable to type decorations as well. While writing a parser is a easy, I am fairly stumped by what a good syntax would be. The syntax should also lend itself to macro generation and possibly tot a Haskell-like do-synatx so that I can hide the set! behind monadic state - maybe ST or some such state monad. If you have any good ideas about syntax design, let me know.

As a matter of fact if you are interested in toying with any part of the compiler let me know. Chances are that I am going to disagree with you and reject your code in the first place, but all the same it would be fun to know someone else think.

Credits

I started work on this compiler in the course of Prof. Kent Dybvig's famous compiler class, in Spring 2007. Kent is the author of the amazing Chez Scheme compiler. This compiler would not have happened without Kent and his awesome grad student Abdulaziz Ghulom.

Why is the compiler written in Scheme? The compiler is written in Scheme because Scheme is a beautiful, expressive language. It is my opinion that writting your system in the most expressive medium that you have leads to better and more manageable code. A direct conseuence of having mangeable code is that it encourages one to implement and experiment with more interesting compilation strategies - in time the extra abstraction privded by an expressive language pays off in spades. I have seen too many systems die the death of complexity by being prematurely optimised into a C-like language - in time the code becomes expert-only and eventually just unmanageable.

Download Compiler Source

Wednesday, July 04, 2007 4:50:44 PM (Eastern Standard Time, UTC-05:00)  #    Comments [2]  | 
Friday, July 06, 2007 7:44:51 AM (Eastern Standard Time, UTC-05:00)
(Deep Bow) (Deep Bow) (Deep Bow)
Friday, July 06, 2007 6:29:34 PM (Eastern Standard Time, UTC-05:00)
Come back to Microsoft!
Name
E-mail
Home page

Comment (Some html is allowed: a@href@title, strike) where the @ means "attribute." For example, you can use <a href="" title=""> or <blockquote cite="Scott">.  

Enter the code shown (prevents robots):

Live Comment Preview