This is not a how-to post; rather it is a post that describes technical concepts behind the implementation of the Ballerina programming language.

When we started developing the language, we were newbies in language development. We had to research everything from scratch. In fact, we took leave for two weeks and studied compiler theory first 🙂

So how is our compiler is designed? The current compiler design consists of two parts — front-end and back-end.

High-level design

When it comes to compilers, usually, they are written from the language they’re working with. This practice is called bootstrapping the compiler. We have done this for our compiler’s back end. So, up to now (current version 1.1.0), the front end is written in Java, and the back end is written in Ballerina. (The BVM back end mentioned in the above diagram is no longer there. I have added that just for clarity so that people can understand that we can plug in as many back ends as we need.)

In Ballerina, we have a concept called a module, which is the top-level language construct that we build from source files. It is also the shareable artifact in the language, so people can publish their own modules, import other’s modules, etc. In doing so, we have to share those modules somehow, but sharing the full source of the module is pointless (we had done this for some time).

So, we had an idea to compile these module sources to some shareable format that was platform-neutral and generic. We call this shareable intermediate state BIR (Ballerina Intermediate Representation). Essentially, the compiler front end creates this BIR. It compiles down ballerina source files to BIR format, which then gets serialized and fed back into the back end implementations.

The back end reads this serialized byte stream and compiles that down to relevant target platforms (JVM bytecode, machine code, etc). This separation has given us the flexibility to bootstrap our back end implementations, as the back ends are a separate entity that reads a byte stream and works on top of its own tree structure.

You may also like: Building a Compiler: Model-to-Model Transformations.

So, let’s dig deep into the compiler front end first.

The compiler front end has two types of inputs. One is *.bal source files, and the other is *.bir files (when we import some already compiled modules). But semantic analyzing phases should only run for the *.bal source input; there’s no need to run the semantic phases on top of BIR, as it has already run for them when creating the BIR previously. So, what do we need from BIR in the compiler front end, and how have we achieved that?

From BIR compiler, the front end only needs details concerning top-level components that are accessible from other packages. Specifically, top-level symbols and their types only. This same BIR is the one which is getting fed into the compiler’s back end as well. When it comes to the compiler’s back end, we need more from the BIR than simply top-level symbols, like method bodies, type bodies, etc.

I’ll explain this with the above example, which has a type definition and a function. Here, what we need at the compiler’s front end are top-level types and the public function signature because they can be accessed from other modules that depend on this module. But it doesn’t need to know about the local variable localVar, as it cannot be accessed from outside this module (outside pubFunc to be exact). But, when it comes to back end code generation, it needs to know about the type definition, function, and the body of the function.

This means we have to read BIR files to achieve two goals. One is to generate the relevant byte codes at compiler’s back end (which needs to read the full BIR). The other is to recreate the required top-level symbols in the compiler’s front end. So, we have optimized the serialized BIR with these goals in mind. (I think that is a separate discussion entirely; for the moment, what we need to know is that with BIR, we are capable of recreating the relevant symbols in the compiler’s front end and generate byte code in the back end very efficiently).

So, let’s look at how the compiler’s front end does its magic.

The compiler’s front end again is broken into multiple phases, and we have used visitor pattern to implement these phases, which made it possible for us to keep all the logic in a single place per phase without scattering them within each node. Below is a diagram depicting those phases in the front end.

Compiler front end break down (Note — this only includes major phases only)

The first phase is the ANTL parser, which we use to parse the BAL file and generate an AST (Abstract Syntax Tree). To do this, we use a lexer and a parser generated by ANTLR, but the AST is our own AST (Not an ANTLR generated generic tree), which we generate by listening to ANTLR events. The topmost node in this AST is called a module, and a module has child nodes that are called top-level nodes.

So, if we take the above example code, which has a global variable definition and a function, the function, testFunc, and the global variable, globalVar, are top-level nodes, whereas localVar is only a local variable to the function.

For each relevant node, we create symbols (lightweight representation of nodes that only contain the relevant details for type checking and linking) and define them in relevant scopes. If I explain this a bit more, we have scopes for each block level, so a module has its own scope, a function has it’s own scope, and if body has it’s own scope, etc. These scopes are chained, so every child scope has a link to its parent scope. When looking up symbols, what we lookup through the scope chain starting from the current scope and go up until the symbol is found.

Let’s take an example to explain this compilation flow. Look at the following sample BAL code.

The parser will generate AST as follows for the above code:

AST (Note — This only has the high-level tree for clarity)

Then, this tree will go into symbol enter phase, where we iterate through top-level nodes and define those symbols in the module scope. In this example, we have two such symbols, one global variable named “a” and one function named “test”. In this phase, we don’t do any validations. We only define symbols in the scope. The idea behind that is to allow cyclic references and fix ordering issues. Example of ordering issue goes as follows:

As we have to iterate through top-level nodes, at the time we go through first global variable definition statement node, the getA function is not defined in the scope. If we try to validate the statement, we will encounter a symbol not found issue for the getA function. This phase is all about symbol entry only.

The next phase is semantic analysis. Here, we analyze each statement. When it comes to statements, almost all the statement nodes have two parts, LHS and RHS, which are separated by an equal sign.

When it comes to analyzing the global variable definition, we first analyze its LHS and find the relevant type. In this case, it is int. Then, what we do is we analyze the RHS expression and get the RHS type. Then, we type check whether RHS type is assignable to the LHS type. That is the whole point of our semantic analyzer — of course, there are other complex scenarios like type inference, symbol linking, etc., but essentially, this is what we do.

Now, it comes to the code analysis phase. In this phase, we do several validations such as:

  • Loop continuation statement validation.
  • Function return validation/unreachable code validation.
  • Worker send/receive validation.
  • Experimental feature usage validation.

I will explain here how we do return validation for the following code

First of all, I have to explain how we model this if-else ladder in our AST. Our model only has if and else parts, so the AST we have here is something like the following diagram:

AST for if-else code block

So, the logic is an if-else ladder to be considered as all paths returning. (Both if and else conditions should return.)

We use a boolean, stmtReturns, to keep track of whether each path returns. For example, at the beginning of the function, stmtReturns is false. Then, we analyze statement-by-statement in which we get the if-else condition. Here, we analyze the if body and mark the stmtReturns variable accordingly. Then, we keep track of the current state of the stmtReturns variable, reset the variable, and analyze the else body. Now, we have two states for both the if and else paths.

Then, we set the final value for the stmtReturns variable accordingly. (If both paths are true, then the final value is true. Otherwise, the final value is false.) This goes on recursively, depending on the depth of the if-else ladder. It then goes on iteratively for every statement in the function body. At the end of each function, we can determine whether all the paths have return statements or not by looking at the value of the stmtReturns variable.

Next comes the data flow analyzer, which is essentially responsible for detecting uninitialized variable usages. Here, we keep track of each variable (variable symbol) and its states (uninitialized or not) in a hashmap. At this point, variable definition and the usages are already linked through the variable symbol. So, what we have to do here is simple. We iterate through the statement list. When we find a variable definition, we put the variable symbol into the hashmap with its state. (If the variable definition has RHS expression, that means the variable is initialized.) Once we find a variable reference, we look in the hashmap for the symbol and check its states.

If this is a RHS reference and current variable state is uninitialized, then that’s a semantic failure. But, if the reference is an LHS reference, then that means we can update the variable state in the hashmap to be initialized. This becomes a bit complex when it comes to if-else ladders. There, for each block, we keep separate hashmaps to keep track and at the end merge them accordingly. If one path returns, there’s no need to merge. We simply take the other path. Otherwise, we merge them.

The next phase is the taint analyzer. I think it should be a different topic. So, I’m going to skip this in this post (I will try to do another post on that, covering what we do in taint analyzer as well as how we do it.) For now, just think of it as another phase that does some more validations on the AST.

The next phase is the desugar phase. This is important, as we change the original AST here. In our compiler’s front end, this is the first location that we rewrite the AST. The reason being that the AST, just before this phase, is the one used in our language server plugins. When it comes to those plugins, they need a semantically validated AST(so that they can show semantic errors as well) and that AST should accurately represent the BAL code as those plugins need to be able to regenerate BAL code by using the AST.

For example, if we have the below tuple destructure statement in the BAL file.

We rewrite the above statement to something like below:

What we do here is create new AST nodes and use them in place of the previous nodes.

If I take another example with a lock statement, before desugar it looks like the following:

After desugar, it will become something like below

As you can see, we are creating nodes that are not actually possible with actual language syntax as well.

Now comes another important phase called the BIRGen phase, where we convert the current AST to a BIR tree. So, what is the difference between AST and BIR?

For clarity, I will use a sample and explain that.

For above BAL code, AST would look something like the following:

AST for a function with if-else
BIR of that would look like the following

BIR representation of the above tree

As you can see, the BIR tree is more friendly in the sense of data flow. Where AST is just a node tree, a BIR tree can be used to do dataflow analysis easily. (We had plans to move our current data flow analyzer on top of BIR tree; however, we haven’t done that yet.) A BIR tree basically is a tree of basic blocks (which are marked as BB0, BB1, etc), within a basic block. What we have is a list of instructions and one terminator. Even though the terminator is also an instruction, it is a kind of terminating instruction for the basic block (last instruction of the block). For example, CONSTANT_LOAD is a normal instruction, where RETURN or BRANCH instructions are terminators.

The next phase is the BIR optimizer, where it optimizes away unwanted instructions and temporary variables. The BIR we have at the end is platform neutral and generic. Now, we serialize this BIR with our own serializer and feed the serialized byte array to the back end code gen. I’m not going to explain the back end code generation in detail because the post is already too long :).

But, to give you a general idea, we read the serialized byte array and build the BIR model again. Then, we feed that to the code generator. (At the moment, we have two types of code generators, one generating direct machine code and other generating Java byte code; here, what I’m referring to is the JVM byte code generator, as the machine code generator is still not in a releasable state.) From there, the code generator generates JVM byte code by going through the BIR tree. This whole phase is written in Ballerina itself.

I hope this is enough for a single post and is helpful enough for people who are eager to learn about compilers and particularly interested in Ballerina. 🙂

Further Reading

Source link

Write A Comment