Wednesday, August 24, 2011

Let's make a shit JavaScript interpreter! Part Three.



Let's put the Research into R&D. I guess it should be called Shit Research to go along with the name of this series (p1) of articles (p2) I started one year ago. It takes a lot of time to read over hundreds of thousands of lines of undocumented C++ and java code, so part three took much longer than expected.


Let's explore existing JavaScript implementations.

In this part we are going to take a small digression to look over other JavaScript implementations. I'll provide a short description of how each JavaScript implementation is made. We can use the Architectural knowledge from each implementation to inspire our own implementation.

Another benefit of researching each implementation is that we can find all the different tools they use. For example different test suites.


Make sure to read the URL's for each implementation to find out more information about each one.

There is a list of ECMAScript implementations at wikipedia. We will not cover all of the ones listed there. If like me, you may spend a few hours or even days reading through the links from there.


narcissus - js in js.

Narcissus is a javascript implementation written in javascript (with some SpiderMonkey language extensions). It is written by the same author as the original JavaScript implementation back in 2005-2007 (Brendan Eich). In mid 2010 the Narcissus code has been taken up again by the Mozilla project to make researching JavaScript changes easier.

It is a good implementation to study, since it is fairly small, and is meant to be easy enough to read.

It uses a hand written parser, not a generated one.

The original narcissus source repository.

The new repository of narcissus, and two articles about it.

There is a port to python of the narcissus parser called pynarcissus. The authors of pynarcissus found it difficult to port the rest of the interpreter since it relies on JavaScript features itself.

It weighs in at about 7000 lines of code counted with wc -l. I had to count lines in this way since my SLOC counter does not seem to count javascript.


Spider monkey.

Spider Monkey is the original JavaScript implementation used by Netscape, and the Mozilla project.

SpiderMonkey is a production grade, high quality JavaScript implementation.

It also has excellent documentation. Especially the js/src/README.html file which includes a design walk through.

"""The compiler consists of a recursive-descent parser and a random-logic rather than table-driven lexical scanner. Semantic and lexical feedback are used to disambiguate hard cases such as missing semicolons, assignable expressions ("lvalues" in C parlance), etc. The parser generates bytecode as it parses, using fixup lists for downward branches and code buffering and rewriting for exceptional cases such as for loops. It attempts no error recovery. The interpreter executes the bytecode of top-level scripts, and calls itself indirectly to interpret function bodies (which are also scripts). All state associated with an interpreter instance is passed through formal parameters to the interpreter entry point; most implicit state is collected in a type named JSContext. Therefore, all API and almost all other functions in JSRef take a JSContext pointer as their first argument.

The decompiler translates postfix bytecode into infix source by consulting a separate byte-sized code, called source notes, to disambiguate bytecodes that result from more than one grammatical production.
"""



Google V8.

Written in C++, javascript and assembler. Uses hidden classes, and generates machine code at run time.

The parser is hand written parser in C++. It's not generated.

'Preparses', which creates tokens. Then creates an AST by parsing. Finally compiling the AST. parser.css is where all the parsing happens.

The projects documentation is quite limited - so reading the source is the best way to get in there. There are some videos which describe the architecture, and engineering decisions behind the choices taken.

http://code.google.com/p/v8/

V8 weighs in at a mighty 605,962 SLOC. Making it the largest, biggest, badest V8 JavaScript engine around.


JSLint.

JSLint is written in JavaScript and uses a TDOP approach to the parser like we are using. I won't discuss this, since it is documented well elsewhere.


Rhino

Rhino can run as either an interpreter or a java byte code generator. It's hosted by the mozilla project (who make firefox). It complements their C++ implementation (spidermonkey) and their JavaScript implementation (narcissus).

There are a few things which show it is a very mature as well as modern implementation. The project was started in 1999, and has been developed ever since. It has partial support for JavaScript 1.8, and ECMAScript 5 standards. Weighing in at 50,551 source lines of Java code (SLOC), it can be considered a very large code base. Another modern feature is that it supports the CommonJS javascript module standard. Despite still being hosted in cvs, it's still being maintained, and features are being added regularly.

The Rhino JavaScript team maintains a library of tests and other documentation for JavaScript. Tests are being shared with the other JavaScript implementations within the mozilla spidermonkey project. There have also been parts which have been ported from spidermonkey - such as the hand written parser.

A handwritten scanner they call TokenStream creates tokens, and then parses those into an AST. Despite being mostly hand written, some parts are generated. Specifically the stringToKeyword method, which detects keywords is generated somehow.

The documentation of the architecture of the project is limited. There is however some API documentation. With a couple of modifications to some ant build files I was able to build it, as well as even make a few small modifications.

The wikipedia Rhino page has some great information on the rhino javascript engine.


KJS

KJS is the KDE JavaScript implementation for the konqueror browser. It was the parent of the JavaScript implementations done by Apple Computer, inc. I won't go into any detail on this one, since I'll cover JavaScriptCore instead. KJS is written in C++, for the QT library.


JavaScriptCore, Squirrelfish, Nitro

You can browse the source here: http://trac.webkit.org/browser/trunk/Source/JavaScriptCore

This uses a hand written lexer(tokeniser), and a hand written parser. The code structure of the parser and lexer looked eerily familiar. The code base is mostly written in C++ and is quite massive. 140,837 SLOC

There is lots of platform specific code, but it also has a jit, uses byte code, and an interpreter. There is also lots of development code in there for things like debuggers, and profilers.

Closed source JavaScript implementations



There are a few JavaScript implementations that are closed source. The two main ones in widespread use are the ones from Microsoft, and the ones from Opera.

They have however published papers and blog posts about their implementations. I won't cover them any more, because not as much can be learned without the source code.


pypy javascript

The pypy project started a javascript interpreter now too.
https://bitbucket.org/pypy/lang-js/src/de89ec32a7dc/js/javascript-interpreter.txt

The description of the project mentions it's currently using the spider monkey parser, but it appears to generate one using a parser generator provided by pypy. Using a EBNF grammar file. It also creates an AST.

It works for some simple javascript programs that don't use the javascript standard library. I'm not sure of the future of the project, since it appears it was a GSOC project which has now finished, so there might not be any full time developers left on it.

It's written in RPython (a restricted subset of python) and python. Running on top of pypy, it should theoretically be able to take advantage of that platforms jit and garbage collector.

This project makes use of some JavaScript tests and benchmarks from other projects. Specifically some benchmarks from v8, and the language shootout website. It also includes the "ECMA 262 Edition 1" tests.

It weighs in at 5452 Source Lines of Code (SLOC). Which is much smaller, but the implementation is also not complete, so that is to be expected.


So what have we learned then?



We see that most of the implementations use a hand written parser. We also see that the implementations in js and python are much smaller. So despite them being incomplete, I think it proves that it should be feasible to make our shit interpreter in python. We don't need half a million lines of C++ to do our project.

We have also learned that there are test suites available, which should help us out a lot. In fact, many of the implementations share the test suites. Having a test suite already available makes it way easier to write an implementation of something yourself. It acts as a guide to development, and also reduces the time for testing since a lot of it can be automated.

Exercise for next time



Choose One(1) of the implementations, build it, run it, and modify it slightly to do something different. Try and run the tests that come with it.



Further reading.

This whole article is "further reading", but we can never have too much to read. Can we!?

This time, instead of reading it on the train or in the bath tub - may I suggest reading these on a couch?

Sunday, August 21, 2011

bugzilla bugs to bitbucket issues

As part of the pygame move to bitbucket, we are converting a bug database stored in bugzilla.

Work is progressing on the issues migration:
https://bitbucket.org/illume/bugzilla_bitbucket

Bugzilla is more comprehensive compared to bitbucket. So not all things can be migrated from the bugzilla bugs to the bitbucket issues.

Also, the bitbucket API is limited in what it can do. You can't add user comments or add attachments.

To get around the attachments part, I made the migration script create a directory of attachments, and then link to them in the bug text. Then I've extracted the attachments from bugzilla and put them online.

To hack around the lack of comment adding in the bitbucket API, I'm just combining all of the comments into the content text.

Other features bitbucket issues API does not have are priorities, and platforms. The issues features _does_ support priorities, but it doesn't look like the API is set up yet. You can't add platforms (like 'Windows XP SP2') at all it seems. You also can not add free form tags of any sort. Another couple of basic things you can not set with bitbucket issues are the date, and the bug id.

So there you have it. This code worked ok for us.

This code might be a useful start for anyone else wanting to start on a migration. Feel free to go fork it yourself: https://bitbucket.org/illume/bugzilla_bitbucket