Back

Generating and Interpreting Bytecode for MITScript — Using Rust

05-03-2017 Languages

Well, this was my first foray into the world of Rust, the systems language that is Mozilla’s precious baby. And what better way to learn this hip new language than to write an MITScript bytecode interpreter for Computer Language Engineering?

Rust was not gentle for this first-time developer. It does so much to protect you that my first attempt writing a few hundred lines of code resulted in the same number of compiler errors, and I needed a lot of help from my team to just get anything to compile. For this reason, it’s not great for iterating quickly if you aren’t very experienced already. But, I’m licking my chops at the fact that the end result will be much safer, and hopefully faster, than our classmates’ C++ compilers. This post will be about the struggles I encountered as a Rust newbie, as well as the fun of generating/interpreting MITScript bytecode.

Fortunately, I had some great Rust developers on my team – Jason and James Gilles, who have been avid Rustaceans for years.

If you have ever seen bytecode, you’ll expect to see some non-human-readable hex bytes that stand for instructions. Instead, we have a strange text-based bytecode consisting of the following instructions:

Load/store instructions:

LoadConst    
LoadFunc     
LoadLocal    
StoreLocal   
LoadGlobal   
StoreGlobal  
PushReference
LoadReference   
StoreReference  
AllocRecord     
FieldLoad    
FieldStore   
IndexLoad       
IndexStore

Closure functions:

1
2
3

AllocClosure    
Call    
Return

Binary/unary ops:

Add     
Sub     
Mul     
Div     
Neg     
Gt      
Geq     
Eq      
And     
Or      
Not

Control flow:

1
2

Goto
If

Stack manipulation:

1
2
3

Dup     
Swap    
Pop

And the full bytecode is a series of nested functions, like

function
{
  functions =
  [
    function
    {
      functions =[
        function
         {
          functions  = [],
          constants  = [],
          parameter_count = 1,
          local_vars = [z],
          local_ref_vars = [],
          free_vars  = [y],
          names = [x],
          instructions = [
            ...
             return
          ]
         }
      ],
      constants = [None, 1],
      parameter_count = 1,
      local_vars = [y, g],
      local_ref_vars = [y],
      free_vars = [],
      names = [],
      instructions = [
        ...
        return
       ]
    }
  ],
  constants = [1, None],
  parameter_count = 1,
  local_vars = [],
  local_ref_vars = [],
  free_vars = [],
  names = [x, f],
  instructions =
  [
    ...
  ]
}

Some things to note:
A free_var is a variable from a parent scope that’s accessed in the current function.

A closure contains a function, as well as a list of references to the free variables in the nested function. So, when we make a call, we need to clone all the closure’s free variable references into the newly created frame.

Pieces of the puzzle

We split up the work into three parts. James quickly finished the bytecode parser, using the Rust library lalrpop. Jason and I worked on (last minute) the bytecode generation and interpreting, respectively.

Each bytecode function was modeled as a Function struct.

I also have a Frame structure that, as in the AST interpreter from Lab 2, contains the context for the currently executing closure. Most importantly, there is an operand stack onto which references, values, closures, etc. are pushed and popped as opcodes are executed. The HeapValue enum lists them all:

type FrameSlot = Gc<RefCell<HeapValue>>;

pub enum HeapValue {
    Reference(FrameSlot),
    Record(Gc<RefCell<Record>>),
    Function(Gc<Function>),
    Closure(Gc<Closure>),
    // note: oddly enough, foreign functions behave more like
    // closures, since you can actually call them
    ForeignFunction(ForeignFunction),
    String(String),
    Integer(i64),
    Boolean(bool),
    None,
}

References have a layer of indirection provided by the Gc<RefCell<>> construct in Rust. Gc is just an alias for Rc. In Rust, Rc is a reference-counted pointer, and RefCell is a wrapper for a mutable value. RefCell introduces something called “interior mutability”, which allows us to modify an otherwise-immutable location like a field in a struct. We can use borrow or borrow_mut to get a pointer to the value contained in the RefCell.

The bytecode generator

Jason did this part almost entirely himself, and had to wrangle with some tricky scoping rules that I still don’t entirely understand.

One key function is lookup_variable, which determines whether a variable is a reference, local, or global.

/// Determines what scope a variable's value should be found in
pub fn lookup_variable(&mut self, name: &str) {
  if self.globals.contains(name) {
            self.map.insert(name.to_owned(), Location::Global);
        } else if self.locals.contains(name) {
            self.map.insert(name.to_owned(), Location::Local);
        } else if self.lookup_ref(name) {
            // We found the variable in a parent scope
            self.free.push(name.to_owned());
            self.map.insert(name.to_owned(), Location::Reference);
        } else if self.lookup_global(name) {
            // Found the var in one of the active globals
            self.globals.push(name.to_owned());
            self.map.insert(name.to_owned(), Location::Global);
        } else {
            panic!("Uninitialized variables {}", name);
        }
    }

lookup_ref also propagates up to parents, pushing a reference variable in parent scopes, because it has to be in each parent scope to be copied down to the current scope! To be more clear, say we have nested functions f1, f2…f100; and f100 uses some variable x that is local to f3. f4…f99 must all have x as a free variable, so when we see that x is being used in f100, lookup_ref will add it as a free variable for all parents.

Next, we actually need to start thinking about performance! We’ll write a garbage collector for our compiler in Lab 4.