Monday, November 7, 2011

Implementing a Parser and Simple Compiler for the Java Virtual Machine

1. Overview


  This document covers the implementation of a simple recursive-descent parser for an infix adder language with a lexer, parser, and compiler.  The language is implemented with Java and compiles to Java Virtual Machine (JVM) bytecode.  The first implementation of the compiler is different from other implementations because it is fully implemented in one class without third party parser generator libraries.  The implementation of the language only uses the Java standard libraries.  The code for the parser does not use a parser generator so it only contains pure Java standard libraries and no references to third-party libraries.

I wanted to discuss this particular example it helps for you to understand the nature of programming languages and their implementation.  Don't be scared about the parsing input token routines in them.  They help as a tool to the developer.  This a basic implementation but at least I provided a JVM compiler.  Actually code generator from the AST to JVM byte code.


Figure 1 : Mini adder language input (6 tokens are visible)
  The language program in the snippet above consists of a SYMBOL name token, an assignment token, NUMBER value the PLUS token and another NUMBER token.  The result of the add operation is stored in  global memory.  


2. Lexical Analysis


  The lexical analysis consists of reading the character tokens from the input stream and then converting those   ASCII characters into tokens that are used by our language.
Figure 2: Source code for lexer method, convert input ASCII characters into our language token types

  If you look at the code snippet above, the lexer method mostly consists of a large case statement that checks for a particular character and then creates a wrapper token object for that character. Once the token object is created, we add start to add the token node to the abstract syntax tree.

3. Lexer and Simple Parser with ANTLR


The parser example code uses only the Java standard libraries.   Most developers that write domain specific languages or other programming languages may use a parser generator.  ANTLR is a parser generator system written in Java that can output parser code in Java programming language or Python.  You provide a grammar and ANTLR generates the code that you can use for the lexer and parser.  In our case, are going to create a simple grammar and then two classes are created, the lexer and parser.

When then feed the input tokens to our ANTLR code generated code.





4. Parser Implementation and Abstract Syntax Tree (AST)


  The parser implementation converts and analyzes the tokens and coverts the tokens to an abstract syntax tree.  Once we build the AST, then we can traverse the tree and perform operations at each node.

Figure 3.1 : Lookahead parsing (LL(2)) and look for match errors

  We used look ahead two parsing.   Match method.

Figure 3.2 : Output from Abstract Syntax Tree
Figure 3:  Building the AST



4.1 Mini Language Functionality: Adding, Global Memory, Printing

Figure 4:  Visitor implementation, visit the AST expression nodes.  Perform 'add' operation

5. Java Bytecode


  The Java bytecode consists of  the magic number CAFEBABE.  After that comes the constant pool data.  After the constant pool data comes the data buffer which consists of the code block.


5.1 Java Tools : Javap

Figure 5 : Output from Javap Disassemble/Decompile on a Java Class

5.2 The Constant Pool


  The constant pool is a collection of constants that are referenced by ID in the data buffer or code section.


5.3 The Data Buffer and Code Section Attributes

The data buffer code section contains the code.

5.4 Analyzing the OpenJDK Java Source


Figure 6 : OpenJDK Java Source ClassWriter.java, writes Java bytecode to file/stream




6. Mini Compiler Implementation (writing to JVM bytecode)


  When thinking about the Java programming language, think of it as a wrapper for generating Java bytecode.  You don't necessarily need a mainstream programming language.  You do need to format the bytecode input into a class file and feed that into the Java virtual machine.


Figure 7 : Mini adder compiler, adding constants to constant pool. Matches Javap output (see comments)




7. Conclusion


This document covers the implementation of a compiler, a very basic compiler for the JVM.  We looked at the lexer, the parser, the AST, JVM bytecode and compiler.
A. Resources
B. Source Code
https://github.com/berlinbrown/ImperativeStaticCompiledLangOneFile - Github Source, see CompiledLangOneFile.java)

Also see:
http://math-services.appspot.com/lang2.jsp

C. Building the OpenJDK Java Compiler


D. Tools


http://www.antlr.org/


E. Disclaimer


For this document, I included snapshot images of the source instead of the actual source because I plan on converting this document to other formats.  I wanted to keep some of the syntax highlighting from my IDE.

-- Berlin Brown on 9/22/2011

No comments: