Wednesday, September 05, 2007

Khusrau rain suhaag ki, jo main jaagi pi ke sang,
Tan mora man piya ka, jo dono ek hi rang.

Thus, with the words of Amir Khusro, starts the 1978 opus of Shyam Benegal. I was so moved by this movie, all the little details in it, the characters, the poetry, the zeitgeist. The film aparently introduces the gorgeous Nafisa Ali. It also has Naseeruddin Shah, vibrant as ever, Shashi Kapoor, Shabana Azmi, Jennifer Kendall ...

The movie captures an angst in visuals and poetry, over the period of 1857 to 1858 - the days of the "Sepoy Mutiny" in a little town in Rohilkhand.

http://en.wikipedia.org/wiki/Junoon_(film)
http://www.imdb.com/title/tt0077783/

 

Tuesday, September 04, 2007 11:17:50 PM (Eastern Standard Time, UTC-05:00)  #    Comments [3]  | 
 Sunday, August 05, 2007

I asked my soul, "What is Delhi?"
She replies, "The world is the body, Delhi its life"
                                                                                  - Ghalib


Sunday, August 05, 2007 1:01:11 AM (Eastern Standard Time, UTC-05:00)  #    Comments [2]  | 
 Wednesday, July 25, 2007
  sudo! 
This is awesome:




Wednesday, July 25, 2007 7:36:01 PM (Eastern Standard Time, UTC-05:00)  #    Comments [5]  | 
 Tuesday, July 24, 2007

In a preachy mood: Today I spent some time talking to someone about how a piece of code that is going to affect his hapiness (aka he is responsible for) has some really bad design decisions in it. We went over how there is this little bug that shows as a consequence of this design flaw. The bug itself has an easy fix and such, but the underlying reason is that several layers of leaky abstractions interact with each other in unhappy ways.

In short I was trying to tell him that its not a bug, but that its bad design waiting for a victim - this should be better described as a pit or a trap than a bug. Bugs are visible warts, pits are lurking demons. Bugs are plain as day, easy to justify - pits are lurkers, those who can tell good code from bad, feel uneasy in their precense.

Do you have pits in your code? Do you have a sense of what parts of your code are pits? If you had to choose between fixing a pit and a bug, what would you do?

I believe everyone's code has pits to varying degrees and that they are completely unavoidable beacuse there is simply no such thing as a perfect apriori design. That said, here are some slightly more interesting questions:

If you had to implement something new - there is solution 1 that is going to introduce a pit but will deliver said target surely and quickly and there is approach 2 that will potentially eliminate the pit, but is going to be far more expensive to implement - How would you choose which approach to take?

I usually go ahead and implement the pit when I dont have enough information to decide between the two. Often there are pits that will never surface and no one gets trapped in them. When I dont know enough implementing the pit is a good way to (1) get my work done (2) ensure that I am collecting data so that the next time I know what the real costs are.

Its vastly more valuable to document the fact that a choice was made based on the lack of sufficient information, than the excrutiating details of choice itself. Pits turn into chronic illnesses of your code base when no one knows why ceratin decisions were made, and hence no one ever feels comfortable back tracking over a bad decision. Once I know an ill-informed choice was made, and once I know the price I am paying for it, its a lot easier to back track over it.

In other words, dont just document your design, document your choices in the context of your design space - it will keep you honest. Well written code to a large extent is its own documentation - what it is not a good documentation of is the choices that you did not take and the reasons why you did not take them.

Thoughts?

Tuesday, July 24, 2007 2:33:39 AM (Eastern Standard Time, UTC-05:00)  #    Comments [0]  | 
 Friday, July 20, 2007

As an indirect result of some of conversations in the past few days at Google and Microsoft, I was left thinking that I need a better way to explain to someone why basic research of languages and language constructs is necessary. This crops up every now and then, and it would be useful if I had a simple answer to this. I think I have an approximate answer now, what do you think:

I think basic research about language constructs is necessary because it helps separate concerns. It separates of the quality of the abstraction that the construct provides from the details of the implementation of the construct. In this way one can study and construct and make observation about how valuable it would be as a tool for expressing computation. Once we have a great abstraction, we can then focus our attention on the best possible implementation of the abstraction or some select component of it, now with a clear understanding of the expressive power of the end result.

As an example our present notions of concurrent programming with threads and locks/semaphores is a broken idiom - they are hard to program with, are brittle and don’t compose well. The notion that it would be nicer to write concurrent programs as memory transactions came about and gained favor because the quality of the abstractions that memory transactions provided was far greater than that of locks and semaphores. By quality I mean that memory transactions compose far better, provide guarantees about progress, fairness etc and are not prone to the sorts of pitfalls that lock-based approaches suffer from. This view of transactions matured into a set of key combinators that gave us "composable" memory transactions. Once the idea that "memory transactions are good" gained ground it spurred a cottage industry of people trying to efficiently implement these abstractions.

The underlying point is that the search for abstractions to represent computation is not hindered by details of implementation. What we refer to as an implementation is only the expression of an idea in terms of other abstractions that previous languages have imposed on us. Limiting ourself to the scope of these, baises us against ideas that dont appear natural to them. I like to think of this as a form of "the evil of premature optimisation" at the level of ideas.

Friday, July 20, 2007 12:31:34 AM (Eastern Standard Time, UTC-05:00)  #    Comments [0]  | 
 Wednesday, July 04, 2007

One of the things that I have been working on in my free time is a compiler for a subset of the Scheme programming language. This compiler is written in Scheme and is a small, relatively fast compiler.

Languages and Usage

It supports a very basic core language:

  1. fixed size integers
  2. vectors
  3. scheme pairs (via cons)
  4. quoted constants
  5. let, letrec, top level define
  6. booleans, if
  7. set! (mutable variables)
  8. lambdas, higher order functions

The core language that it supports is a higher order programming language (with first class functions) and proper tail calls.

The compiler takes input in the supported subset of scheme and emits x86 assembly code. Here is the "life-cycle" of a compilation: The emitted assembly has entry point into scheme, a sort of scheme_main(). To use the compiler one needs to compile the generated assembly into .obj using any intel-syntax-compatible assembler - I use ml.exe, which is Microsoft's assembler. The .obj then needs to be compiled and linked against a minimal bootstrap C source file which essentially allocates some amount of memory from the OS and them jumps into the scheme code. I use cl.exe, Microsoft's C++ compiler, for this purpose. The assembler and C compiler are free tools that maybe downloaded from Microsoft's website.

All of the core compilation, including the register allocator and code generator are written in Scheme. The generated code is rather efficient (read: not terrible), though a lot can be done improve its quality. In fact the quality of code it generates substantially better than one or two compilation systems that I know of that are used commercially. The code generator is a macro-based instruction selector and the register allocator is a linear allocator as opposed to a graph-based allocator.

The compiler code base is purely functional (for the most part ...) - the minor variations are control operators like yield, recursive functions etc. which are borderline-functional depending upon the school of thought that you belong to.

Organisation

At the point of this writing, the bulk of the compiler source is limited to one file called compiler.ss. This file contains all of the traditional compiler passes as well as the register allocator and the code generator. This file is a little above two thousand lines of scheme code. While that is pretty small for a compiler, I believe it could be made smaller and prettier still.

The compiler is a collection of "passes". The input to the first pass is the source language. The output from the last pass is x86 assembly. Each pass does a small transformation on the source tree - for example a pass could remove one-armed ifs and replace them with ifs that have a do-nothing else branch. The compiler has a fairly large number of passes (greater than 15 or so), each doing a simple and relatively specific transformation. One could think of a pass as a mini-compiler that turns a specific input grammar into a specific output grammar. I believe that it was Kent Dybvigs original observation that the traditional wisdom about expensive-compiler-passes is essentially misguided. Hence his own production quality compiler has many many passes, is blazingly fast, produces efficient code and has a manageable code base.

In compiler.ss, passes are written as a collection of a "visitors". A visitor is a small code transformer that takes one syntactic form and does a transformation on it. For example a visitor could be one that takes a one-armed if and returns a proper if that has an else branch. A pass is constucted by composing several visitors together. The infrastructure that enables us to write the passes as a collection of visitors has most of the basic elements of a term rewriting system. The advantage of writing the compiler as a collection of visitors makes things easy -  I can completely isolate each code transformation from the rest of the system and be free to focus on solving that little problem well.

The source tree is represented as a scheme s-expression and hence requires no parsing and none of the traditonal noise that gets in the way of compilation. At some point towards the middle of the passes the intermediate language that is output is referred to as an "IL". This IL is rich enough to represent all of the details of the scheme program while not have any scheme specific components to it. It is very much like a machine-independant assembly language - almost like C. The hope is that I maybe able to compile other languages to this IL and them share the backend of the compiler - the code generator, register allocator etc.

Most of the other files are essentially infrastructure pieces (sometimes rather generic) and helpers. The source walking visitor infrastructure is in micro-pass.ss. The yield control operator is implemented via callcc and set! in yield.ss. The super handy match macro is implemented in match.ss (I am not the original author of the match pattern matcher).

Usage and Installation notes

You will need the following to use the compiler

  1. The compiler sources
  2. Petite Chez Scheme (Just get the latest from www.scheme.com)
  3. Microsoft's C++ compiler suite or something compatible (You need the compiler and assembler, these are free downloads) 
  4. Ruby (Only needed if you wish the run the test suite)

Install Petite and the Microsoft tools. Makes sure that you have petite, cl.exe and ml.exe in your command line path. Compile a little hello-world with cl and see if things are installed and working fine. Extract the compiler sources to some folder. Thats it you are done. If you want to run the test suite, have ruby also in your path - this maybe a good idea if you want to play around with the sources.

There are two batch files for your convenience:

test-all.cmd: Runs the test suite. The test suite consist of the files in the tests folder.

compile.cmd: This compiles a specific file for you.

Consider a little program that recursively removes a value from a list:

(define foldr (lambda (op unit xs)
(if (null? xs)
unit
(op (car xs) (foldr op unit (cdr xs))))))
(define remove (lambda (xs x)
(foldr (lambda (v ls)
(if (pair? v)
(cons (remove v x) ls)
(if (eq? x v)
ls
(cons v ls))))
()
xs)))



(let* ((vals '(0 1 2 0 0 0 3
(1 2 0 0 3 4)
(1 2 0
(1 2 0 3 4) 4)
4 5 0
(1 2 0 0 3 4 0)
7 8 0))
(res (remove vals 0)))
(cons vals (cons res ())))

We compile foo.ss as follows:
> compile foo.ss
produces foo.asm and foo.exe, among other things. foo.exe when run produces:
((0 1 2 0 0 0 3 (1 2 0 0 3 4) (1 2 0 (1 2 0 3 4) 4) 4 5 0 (1 2 0 0 3 4 0) 7 8 0) (1 2 3 (1 2 3 4) (1 2 (1 2 3 4) 4) 4 5 (1 2 3 4) 7 8))

The generated foo.asm is the result of the compilation. Its the .asm file that you want to eyeball to make observations about how inifintely retarded the quality of my output asm is in comparison to your-favourite-compiler.

Source Language

The best way to find the supported subset of the language is to grep the tests folder. However, here is an attempt:

  • 30-bit numbers, ascii characters, vectors, lists, #t, #f, (), (void)
  • quoted constants composed of the above
  • cons, car, cdr, set-car!, set-cdr!
  • make-vector, vector-ref, vector-set!, vector-length
  • let, letrec, let*, set!
  • fx+, fx-, fx*, fxadd1, fxsub1, fxlognot, fxlogor, fxlogand
  • fixnum->char, char->fixnum,
  • pair?, null?, fixnum?, char?, vector?, boolean?, fxzero?, procedure?
  • fx=, fx<, fx<=, fx>, fx>=, char=?, char<=?, char>?, char>=?, eq?, not
  • lambda, if, begin, define (top level only)

If you are unfamiliar with these operators you can use Chez Scheme's index as a reference. Here are the disappointing bits: there is no quasiquote, no input-output, no forgein function interface, no file loads or includes ... yet. Note that while this is only the core subset the compiler supports, this is mostly rich enough to build a library of commonly used stuff.

Since the supported language has no output statements, the result of of program execution (the final result) is merely printed out at the console. This can be of any of the supported data type, including any s-expression (Note: I dont do anything to handle circular expression so brace yourself to see lots of parenthesis if thats the case).

XEmacs nonsense:
The best way to read the sources would be to use xemacs with folding mode enabled to ;{{{ and ;}}}. To do this, you need to add the following to your init.el file:
(folding-add-to-marks-list 'scheme-mode "{{{" "}}}" nil t)

The most useful folding mode commands are folding-mode, folding-context-next-action. It maybe nice to bind them to key combinations like so:
(global-set-key (kbd "C-c C-m") 'folding-mode)
(global-set-key (kbd "C-c C-f") 'folding-context-next-action)

Future Work and What it lacks

There are many things I would like to add to the compiler. Since I am writing this compiler entirely for the fun of it, it is one of my objectives to make choices that are interesting as opposed to useful. I believe most programming languages today are designed with the aim of being useful which is partly why they all suck; they are crammed with hasty decisions substantiated by all sorts of reasons. Eventually I suspect my compiler will fall prey to usefulness and/or boredom, until then I hope to have fun.

Most of the items in the menu below haven't been implemented yet because of time or because I am not compelled enough by the best ideas I have.

GC: The topmost on my list of things to add is a GC. While it is fairly easy to do a copy-collector in the C-runtime system, I have been toying with the idea of writing the GC in the subset of scheme that I compile. It is interesting to see the minimal set of primitives that will have to be added to the language and how cleanly this can be expressed.

Threading: It would be interesting to write a compiler that supports a mutlithreaded language. There are easy and not so nice ways to accomplish in the present compiler, and there are deeper more elegant ways to this as well.

Errors: The compiler assumes good input programs. This is never the case in "real" life and an early error checking pass is required. Traditionally good error reporting is considered hard in compilers and with good reason. However I may never get down to doing this because this leans more towards useful than interesting in my present world view.

STM: I have been doing some work on STM over the past year or so and I have some pet algorithms of my own that need compiler support to implement well. Once I have STM, like Satnam Singh shows, it maybe possible to express Join patterns and such in terms of them.

Large Objects and Strings: The compiler doesnt have any native support for strings. One may treat vectors of characters of strings, but I wonder if I can do better - for example have a fast string-copy. Similarly the compiler works with a fixed size heap at this point. Allocation requests that overshoot this heap size would simply run into a guard page and segfault the system. Support extensible and large object heaps would be nice.

Pattern Language: I would really like to add compilation support for a pattern language. At some level I suspect that pattern languages are more basic than lambdas. The existence of Barry Jay's Pattern calculus strengthens this view. Scheme's best pattern matcher 'match' is added as a macro to the language. It would be nice if the compiler to primitively support a generic enough pattern matcher.

Continuations: I would like to natively support delimited continuations in some form - most likely my yield operator or a more traditional shift-reset. These can however be added nicely into the compilation once the GC and the threading issues have been addressed because those will affect the runtime stack architecture a lot.

Module System: Some sort of a module system.

Macros: Though this is strictly not an addition to the compiler, it would be intersting to support a pattern language of some sort. Some system where the compiler itself was part if the runtime system would be awesome - maybe some approximation of Walid Taha's MetaML. The thing is that MetaML does not have an catamorphism operator and is hence not very useful for describing syntax extensions: maybe this can be coupled with a pattern system in some way.

Types: At some point I would be intersted in adding a type inferencer to the system. The compilation at this point is rather blind and fast. If one attempts to do a primitive addition between a cons cell and a number, the compiler provides absolutely no security - not even a runtime exception. Scheme solution, as is most "dynamic" languages, is to verify the types of things at runtime before doing primitive operations. These checks are expensive and it would be interesting to see to what extent I can substitute these with compile time assurances.

Parser: With all of these additions, the input language is no longer Scheme. Atleast no self respecting Schemer would allow me the pleasure of their company if I said so. This calls for a different input syntax, one that is more amenable to type decorations as well. While writing a parser is a easy, I am fairly stumped by what a good syntax would be. The syntax should also lend itself to macro generation and possibly tot a Haskell-like do-synatx so that I can hide the set! behind monadic state - maybe ST or some such state monad. If you have any good ideas about syntax design, let me know.

As a matter of fact if you are interested in toying with any part of the compiler let me know. Chances are that I am going to disagree with you and reject your code in the first place, but all the same it would be fun to know someone else think.

Credits

I started work on this compiler in the course of Prof. Kent Dybvig's famous compiler class, in Spring 2007. Kent is the author of the amazing Chez Scheme compiler. This compiler would not have happened without Kent and his awesome grad student Abdulaziz Ghulom.

Why is the compiler written in Scheme? The compiler is written in Scheme because Scheme is a beautiful, expressive language. It is my opinion that writting your system in the most expressive medium that you have leads to better and more manageable code. A direct conseuence of having mangeable code is that it encourages one to implement and experiment with more interesting compilation strategies - in time the extra abstraction privded by an expressive language pays off in spades. I have seen too many systems die the death of complexity by being prematurely optimised into a C-like language - in time the code becomes expert-only and eventually just unmanageable.

Download Compiler Source

Wednesday, July 04, 2007 4:50:44 PM (Eastern Standard Time, UTC-05:00)  #    Comments [2]  | 
 Tuesday, July 03, 2007
  Spirit 

Which part of the lack of something can you point at and say, there should have been something here? This is what is missing.

Tuesday, July 03, 2007 11:43:18 AM (Eastern Standard Time, UTC-05:00)  #    Comments [1]  | 
 Saturday, May 05, 2007

Despite my many many associations with Microsoft over the past years, I haven't ever really been to the holy land of Redmond. Over the years I have had one too many missed oppurtunities to go to Seattle/Redmond, each planned visit getting stalled for some reason or the other. So every once in while when some one hears about some association I have had with Redmond and asks me about Seattle I have to politely tell them that I have never really been there. All that is about to change.

But not quiet in the way I expected it would - tomorrow morning I fly to Seattle from where on Monday, the day after, I fly to my host company's HQ at MountainView California. After a week of orientation in sunny California, I fly back to Seattle where I will be spending all summer... with Google.

Strange as it may be, my first trip to the holy land is to work at a different altar. I will be working as an intern at Google, Kirkland this summer. Though Google's is generally secretive about its work, I believe I will be working on something very close to my interests in programming languages. :)

Saturday, May 05, 2007 8:48:43 AM (Eastern Standard Time, UTC-05:00)  #    Comments [7]  |