(Re)writing an interpreter in Rust

Published by Danny van Kooten on November 17, 2022. Permalink

Two years ago I wrote my first interpreter for a toy programming language called Monkey, in C.

It works and is pretty fast, but I remember plenty of frustration with segfaults and hard-to-track-down memory leaks as soon as I introduced heap-allocated values.

Much of that was undoubtedly because I was not very experienced with C. But, and I know this for a fact now, it was also because C makes it very easy for issues like this to pop up.

Rewrite it in Rust

One of my long-time friends has been working hard on a fast multi-threaded DataFrame library in Rust. I wanted to contribute something, so I started working on a CLI interface for it.

Working through the codebase made me realise that if I wanted to contribute in a more meaningful way, I first had to work on my Rust skills. I was already comfortable enough with Rust to solve Advent of Code puzzles, but I had yet to really get in a good fight with the borrow checker. I needed a bigger project to really get rusty.

What better way to practice than to build an interpreter? It’s fun, and you can go deep trying to make it fast. Shall we try to make it at least as performant as the one I wrote in C?

As our benchmark program, we’ll measure a very inefficient recursive Fibonacci implementation:

function fib(n) {
    if n < 2 {
        return n;
    }

    return fib(n-1) + fib(n-2);
}

fib(35)

Let’s take a quick look at the times to beat, using Hyperfine:

hyperfine \
    -n pepper-tree-walker "pepper --tree-walker fib35.pr" \
    -n pepper-vm "pepper --vm fib35.pr" \
    -n python-3.10 "python fib35.py"  \
    --runs 10 --export-markdown /tmp/hf.md && cat /tmp/hf.md

Command	Mean [s]	Min [s]	Max [s]	Relative
`pepper-tree-walker`	3.791 ± 0.031	3.755	3.816	5.00 ± 0.13
`pepper-vm`	0.758 ± 0.018	0.738	0.773	1.00
`python-3.10`	2.033 ± 0.008	2.025	2.041	2.68 ± 0.07

That is 3.8 seconds for a tree-walking implementation, or under 1 second when compiled to bytecode and executed inside a virtual machine. I also included CPython 3.10, which uses bytecode too but is not known to be particularly fast.

Now, let’s see if we can beat these times using Rust.

First benchmarks using a tree-walking interpreter

Fast forward some time, and we now have a first draft of a working interpreter that successfully lexes, parses, and evaluates the Fibonacci program above. It’s called Nederlang, and you can try it in your browser. Thanks, WASM!

We’ll start our optimization work from this commit.

The source code is first turned into tokens and then parsed into an Abstract Syntax Tree (AST), where each node is a variant of the Expr enum:

enum Expr {
    Int(ExprInt),
    Float(ExprFloat),
    Bool(ExprBool),
    String(ExprString),
    Infix(ExprInfix),
    Prefix(ExprPrefix),
    If(ExprIf),
    Identifier(String),
    Function(ExprFunction),
    Call(ExprCall),
    Assign(ExprAssign),
    Declare(ExprDeclare),
    Block(Vec<Expr>),
}

struct ExprInt {
    value: i64,
}

struct ExprInfix {
    left: Box<Expr>,
    operator: Operator,
    right: Box<Expr>,
}

// etc..

When running the program, we’ll walk this tree and evaluate each expression. Some of these expressions mutate variables stored inside the Environment type:

struct Environment<'a> {
    symbols: RefCell<HashMap<String, Object>>,
    outer: Option<&'a Environment<'a>>,
}

When resolving a variable by name, we look it up in the symbols HashMap, traversing upwards to the outermost Environment until we have a match.

Now we’re ready to run the first benchmark. We’ll build a release binary and time the Fibonacci program with Hyperfine. ~drum roll~

cargo build --release && hyperfine --runs 10 'target/release/nederlang fib.nl'

Command	Mean [s]	Min [s]	Max [s]
`nederlang fib.nl`	39.292 ± 0.309	39.037	39.636

39 seconds. Ouch. I know tree walking is not supposed to be fast, but given that the C implementation manages it in under 5 seconds, surely we should be able to come close to that in Rust?

Optimizing Rust code for performance

Let’s take a look at where all this time is spent by asking perf to record a call graph:

cargo build --release && perf record --call-graph dwarf target/release/nederlang fib.nl && perf report

100.00%     8.34%   [.] nederlang::eval::eval_expr
100.00%     0.00%   [.] nederlang::eval::eval_infix_expr (inlined)
 99.94%     2.10%   [.] nederlang::eval::eval_block
 99.69%     0.00%   [.] nederlang::eval::eval_if_expr (inlined)
 59.32%    15.26%   [.] nederlang::eval::Environment::resolve
 27.53%     0.00%   [.] std::collections::hash::map::HashMap<K,V,S>::get (inlined)
 27.53%     0.00%   [.] hashbrown::map::HashMap<K,V,S,A>::get (inlined)
 27.49%     0.00%   [.] hashbrown::map::HashMap<K,V,S,A>::get_inner (inlined)

Right away, it shows that a whopping 59% of time is spent in Environment::resolve, which resolves variables by name. Let’s optimize that.

Using a faster HashMap implementation

What if we switch to a faster, non-cryptographic HashMap implementation like fxhash?

We add the dependency to Cargo.toml:

[dependencies]
fxhash = "0.2.1"

Then we import the new HashMap type under an alias, so the rest of our code can remain untouched:

use fxhash::FxHashMap as HashMap;

Re-running the same release build and Hyperfine benchmark shows what this change does:

Command	Mean [s]	Min [s]	Max [s]
`nederlang fib.nl`	32.284 ± 0.337	31.953	32.627

Down to 32 seconds, an 18% performance improvement. Not bad, but still slow. We need to rethink how we store and resolve variables.

Using a Vec instead of a Hashmap

In Nederlang there is a global scope and a local scope. Each function call creates a new local scope, so every variable declared inside that function is dropped once the function returns.

We could represent that as a Vec<HashMap<String, Object>>, pushing a new HashMap before evaluating a function body and then popping it afterwards.

But in a typical program, there will only be a handful of variables per scope. What if we drop the HashMap entirely and use a Vec<Vec<(String, Object)>> instead?

The look-up time will be O(n) instead of O(1), but since there are only a handful of variables to iterate over, I have a feeling it will be faster than going through a hash function.

type Scope = Vec<(String, Object)>;

struct Environment {
    scopes: Vec<Scope>,
}

impl Environment {
    fn resolve(&self, ident: &str) -> Object {
        for scope in self.scopes.iter().rev() {
            for (name, value) in scope {
                if name == ident {
                    return value.clone();
                }
            }
        }

        Object::Null
    }
}

You can see the entire commit here. Let’s see what this change does to our benchmark times.

Command	Mean [s]	Min [s]	Max [s]
`nederlang fib.nl`	23.883 ± 0.129	23.743	23.998

24 seconds. A 25% performance improvement! Just 80% more to go…

Separating names from their values

100.00%    17.41%  [.] nederlang::eval::eval_expr
100.00%     0.00%  [.] nederlang::eval::eval_infix_expr (inlined)
100.00%     4.24%  [.] nederlang::eval::eval_block
100.00%     0.00%  [.] nederlang::eval::eval_if_expr (inlined)
 47.63%     5.58%  [.] nederlang::eval::Environment::resolve

48% of the time is still spent inside Environment::resolve. What else can we do to speed this up?

To resolve a variable value by name, it currently iterates over several separately allocated Vec instances holding tuples of variable names and values.

Changing the language specification of Nederlang to prevent function names from being shadowed would allow us to speed up resolving functions by name. That way, we can start by looking at the outermost global scope and only then traverse the inner scopes. A cool trick, but not really Rust-related, so let’s think of what else there is.

What about using a single Vec<(String, Object)> then? If we store the number of declared variables before evaluating the function body, we can call Vec::truncate() afterwards to get rid of all variables declared in that function.

While we’re at it, let’s use a separate Vec for names and values. That should speed up resolving variables by name, since we’re only interested in the value of the variable we’re looking for.

struct Environment {
    names: Vec<String>,
    values: Vec<Object>
}

impl Environment {
    fn new() -> Self {
        Environment {
            names: Vec::with_capacity(64),
            values: Vec::with_capacity(64)
        }
    }

    fn resolve(&self, ident: &str) -> Object {
        if let Some(pos) = self.names.iter().rev().position(|name| *name == ident) {
            return self.values[self.values.len() - 1 - pos].clone();
        }

        Object::Null
    }

    fn insert(&mut self, ident: String, value: Object) {
        self.names.push(ident);
        self.values.push(value);
    }
}

A nice benefit is that we can now start pushing values into the environment while still evaluating the argument expressions. This works because, without a corresponding element in the names vector, they will not be resolved if an argument expression refers to a variable with the same name from another scope.

// In eval_call_expr(...)
let var_length = env.names.len();
for value_expr in &expr.arguments {
    let value = eval_expr(value_expr, env)?;
    env.values.push(value);
}
for name in parameters {
    env.names.push(name);
}

let result = eval_block(&body, env);
env.names.truncate(var_length);
env.values.truncate(var_length);
return result;

Are we faster than the C implementation yet?

Command	Mean [s]	Min [s]	Max [s]
`nederlang fib.nl`	21.415 ± 0.112	21.308	21.578

And the profile provided by perf:

100.00%    16.80%   [.] nederlang::eval::eval_expr
100.00%     0.00%   [.] nederlang::eval::eval_infix_expr (inlined)
100.00%     4.73%   [.] nederlang::eval::eval_block
100.00%     0.00%   [.] nederlang::eval::eval_if_expr (inlined)
 49.90%     4.84%   [.] nederlang::eval::Environment::resolve
 46.16%     0.00%   [.] <nederlang::object::Object as core::clone::Clone>::clone (inlined)
 41.64%     5.54%   [.] <alloc::vec::Vec<T,A> as core::clone::Clone>::clone
 41.51%     0.00%   [.] alloc::slice::<impl [T]>::to_vec_in (inlined)
 41.51%     0.00%   [.] alloc::slice::hack::to_vec (inlined)
 41.51%     0.00%   [.] <T as alloc::slice::hack::ConvertVec>::to_vec (inlined)
 40.23%    19.38%   [.] <nederlang::ast::Expr as core::clone::Clone>::clone

The good news is that we’re down to 21 seconds now, so resolving variables did indeed get faster.

The bad news is that we can no longer ignore the obvious problem: all these calls to clone(). We’re cloning a bunch of Expr and Vec types inside the hot path.

Reference nodes in the AST vs. cloning

The reason is that Nederlang’s object type looks like this:

enum Object {
    Null,
    Int(i64),
    Float(f64),
    Bool(bool),
    String(String),
    Func(Vec<String>, Vec<Expr>),
}

The Object::Func variant owns two Vec types. These hold the parameter names and the function body. But wait, can’t we keep the AST around in memory and use that instead? That way, we only have to store a reference inside our Object::Func variant.

Being lazy and coming from C, let’s first get a feel for what kind of performance gain we can expect by simply storing a raw pointer and dereferencing it later. Don’t worry, I promise to fix it properly later on.

// Storing the raw pointer
enum Object {
    ...
    Func(*const ExprFunction),
}

// Creating the raw pointer
Object::Func(expr as *const ExprFunction)

// Dereferencing the raw pointer
let func = unsafe {
    &*(ptr as *const ExprFunction)
};

Here’s the full commit. Let’s run another benchmark now.

Command	Mean [s]	Min [s]	Max [s]
`nederlang fib.nl`	4.222 ± 0.059	4.177	4.322

Down to 4.2 seconds. That’s the 80% improvement we needed! Surely that is worth introducing the lifetime constraints and sleeping sound at night knowing our code is safe.

- fn eval_expr(expr: &Expr, env: &mut Environment) -> Result<Object, Error> {
+ fn eval_expr<'a>(expr: &'a Expr, env: &mut Environment<'a>) -> Result<Object<'a>, Error>

1337 lifetime constraints later…

Now that we’re making sure the AST sticks around in memory for longer than the Environment type, why not use string references for the variable names too? We’ve got the lifetimes set up now anyway, and it should get rid of some String clones.

struct Environment<'a> {
-   names: Vec<String>,
+   names: Vec<&'a str>,
    values: Vec<Object<'a>>
}

Command	Mean [s]	Min [s]	Max [s]
`nederlang fib.nl`	3.699 ± 0.015	3.685	3.720

Another 12% faster and now on par with the tree-walking interpreter written in C. Hurray! Let’s keep going though; there’s way more optimization to do.

Representing dynamically typed values

Now that we’ve gotten rid of most allocation-related performance hogs, it’s time to look at the Object type.

enum Object<'a> {
    Null,
    Int(i64),
    Float(f64),
    Bool(bool),
    String(String),
    Func(&'a ExprFunction),
}

Looks good to me. Let’s see how it’s laid out in memory using the -Zprint-type-sizes flag for rustc:

RUSTFLAGS=-Zprint-type-sizes cargo +nightly build --release

Finding our Object type in the result:

print-type-size type: `object::Object<'_>`: 32 bytes, alignment: 8 bytes
print-type-size     discriminant: 1 bytes
print-type-size     variant `String`: 31 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 24 bytes, alignment: 8 bytes
print-type-size     variant `Int`: 15 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 8 bytes, alignment: 8 bytes
print-type-size     variant `Float`: 15 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 8 bytes, alignment: 8 bytes
print-type-size     variant `Func`: 15 bytes
print-type-size         padding: 7 bytes
print-type-size         field `.0`: 8 bytes, alignment: 8 bytes
print-type-size     variant `Bool`: 1 bytes
print-type-size         field `.0`: 1 bytes
print-type-size     variant `Null`: 0 bytes

32 bytes?! What does the Rust reference have to say about enumerated types?

Any enum value consumes as much memory as the largest variant for its corresponding enum type, as well as the size needed to store a discriminant.

That explains it. The largest variant is Object::String, holding a String type of 24 bytes. The discriminant takes up a single byte, but because of alignment, it adds another 7 bytes of padding.

Using 32 bytes while most of our variants could theoretically fit into just 8 bytes does not sound optimal. What can we do to shrink it?

One option is to Box the value of the Object::String variant, but that only shrinks it to 16 bytes. Can we somehow make the value plus a type discriminant fit in 8 bytes?

Sure! We can do pointer tagging in Rust.

Shrinking our Object type using pointer tagging

Trigger warning: unsafe Rust ahead!

Right now, our Object type owns the String value it holds, which means a lot of cloning just passing objects of this variant around.

We should probably allocate our String objects elsewhere and have our Object store a reference, or pointer, instead. That also means we need some kind of garbage collection to manage that memory for us, but I will happily ignore that for this post as we’re not really working with any heap-allocated values in our recursive Fibonacci program anyway.

What if we get our Object type to look something like this?

enum Object {
    Null,
    Bool(bool),
    Int(i64),
    String(*const String),
}

This way, the size of our Object will be 16 bytes: 8 bytes for the i64 or raw pointer, and another 8 bytes for the enum discriminant. But what if we store the discriminant inside the pointer or value?

Because of memory alignment, memory addresses on 64-bit architectures will also be byte-aligned. This leaves the 3 least significant bits unused, as these will always be 0.

There are currently only 6 different types of values in Nederlang, so 3 bits are enough to store our type information.

The new Object type will be a thin wrapper around a raw pointer:

struct Object(*mut u8);

Let’s use a separate enum for a Type tag:

#[repr(u8)]
enum Type {
    // The types below are all stored directly inside the pointer
    Null = 0b000,
    Int,
    Bool,
    Function,

    // The types below are all heap-allocated
    Float,
    String,
}

We’ll introduce some helper methods for creating a new object with a given type and retrieving just the type tag.

impl<'a> Object {
    /// Creates a new object from the value (or address) with the given type mask applied
    fn with_type(raw: *mut u8, t: Type) -> Self {
        Self((raw as usize | t as usize) as _)
    }

    fn get_type(self) -> Type {
        // Safety: self.0 with TAG_MASK applied will always yield a correct Type
        unsafe { std::mem::transmute((self.0 as usize & 0b111) as u8) }
    }
}

Here’s a method to create an Object holding an integer value with the Type::Int tag.

impl<'a> Object {
    /// Create a new integer value
    fn int(value: i64) -> Self {
        // assert there is no data loss because of the shift
        debug_assert_eq!(((value << 3) >> 3), value);
        Self::with_type((value << 3) as _, Type::Int)
    }
}

As you can see, we’re losing 3 bits because of the type tag, so our maximum integer value will now be 2^60 instead of 2^63.

To later retrieve the integer value, we simply shift 3 bits to the right again, discarding the type tag.

impl<'a> Object {
    /// Returns the integer value of this object pointer
    /// The caller should ensure the object is of the correct type
    fn as_int(self) -> i64 {
        self.0 as i64 >> 3
    }
}

Storing and retrieving a raw pointer is very similar, except to retrieve the address we reset the lowest 3 bits to 0 instead of shifting.

impl<'a> Object {
    fn as_ptr(self) -> *mut u8 {
        (self.0 as usize & !0b111) as _
    }

    unsafe fn get<T>(self) -> &'a T {
        &*(self.as_ptr() as *const T)
    }
}

So now, if we have an Object, we can do the following to get a reference to the String value it points to:

assert_eq!(obj.get_type(), Type::String);
let str = obj.get::<String>();

You can view the complete tagged pointer implementation in this commit.

OK. Now our Object type has shrunk to just 8 bytes, meaning it fits into a register!

What does that yield us in terms of performance? Let’s re-run the same benchmark to find out.

Command	Mean [s]	Min [s]	Max [s]
`nederlang fib.nl`	2.093 ± 0.027	2.049	2.118

BOOM! 2.1 seconds, down from 3.7. A 43% performance improvement. Worth it if you ask me.

(Manually) inlining the hot path

We’ve squeezed most performance out of the tree walker by now, but there are still some things we can do. We can compile an optimized binary using profile guided optimization. While that shaved off another few percent for me, it feels a bit too much like cheating.

Another option is to manually instruct the compiler which functions to inline, trading binary size for runtime performance. Let’s take a look at the binary size before spraying [inline] all over our code.

ls -lh target/release/nederlang
-rwxr-xr-x 2 danny danny 2.1M Nov 22 09:57 target/release/nederlang*

And after inlining all of the hot path:

ls -lh target/release/nederlang
-rwxr-xr-x 2 danny danny 2.1M Nov 22 09:59 target/release/nederlang*

No change in size! Yet we know it worked because running perf now shows that all of our hot functions were inlined:

99.99%    84.80%  [.] nederlang::eval::eval_expr (inlined)
99.99%     0.00%  [.] nederlang::eval::eval_if_expr (inlined)
99.99%     0.00%  [.] nederlang::eval::eval_infix_expr (inlined)
99.99%     0.00%  [.] nederlang::eval::eval_block (inlined)
99.99%     0.00%  [.] nederlang::eval::eval_call_expr (inlined)
43.08%     0.00%  [.] nederlang::eval::Environment::resolve (inlined)

So… did it yield a significant performance improvement? Let’s ask Hyperfine.

Command	Mean [s]	Min [s]	Max [s]
`nederlang fib.nl`	1.873 ± 0.017	1.854	1.895

1.8 seconds, an 11% improvement. Free speed!

Further optimizations: bytecode compilation

That’s about as far as I want to go with this tree walker.

The most obvious remaining performance improvement will come from not evaluating the AST directly, but transforming it into CPU-cache-efficient bytecode, applying optimizations at compile time, and then executing the instructions inside a virtual machine.

My experience writing Rust versus C

In the past few weeks, I’ve grown considerably more comfortable writing Rust code, grasping lifetimes, and dealing with Rust’s tooling ecosystem.

I’ve now implemented and optimized a simple interpreted programming language in both C and Rust. I think it is easy to write performant C code, but very hard to write safe, leak-free C code. Especially for a newcomer.

Rust reverses that default. It is very easy to write safe, leak-free Rust code, yet a newcomer to the language might have to spend some time optimizing that code for performance.

This really only applies to newcomers like me, though. In hindsight, most of the performance optimizations in this post were pretty obvious and trivial to fix once I knew what to watch for.

I’ve prepared a GitHub branch containing the various optimizations described here, so you can look at all of the actual code. Note that the order of optimizations might differ slightly from the order in this post.