Home Article List

Write a Damn Virtual Machine

The most common question you see in programming forums is: What projects should I do? In the case of low-level programming, which is my typical space, the person asking this question is usually looking for projects related to low-level programming. My answer to this question is always the same: Write a damn virtual machine. People tend to look at this as a daunting task. When a beginner thinks VM, they likely think of virtual operating system instances, like Virtual Box. An intermediate might be closer to understanding what I mean if language runtime VMs come to mind, like the .NET runtime for C# or JVM for Java. In either case, this seems like an impossible task, hardly something simple enough to recommend to someone as a quick hobby project. But that is not the case at all.

The term Virtual Machine is not bound to any level of complexity. As I mentioned before, you can use them as a lightweight language backend or even an OS-as-an-app. Both of those are rather complex cases, what is a virtual machine in its most simple form? Well, it is a virtual representation of a physical machine (duh!), and when we say “machine”, we are meaning “computer”. What are the basic building blocks of a computer that we need to virtualize? Atomically - that is, atomic in the indivisible sense - you really only need memory and a processor. You might even argue that you don’t need memory at all, but that’s a different conversation. If we reduce our machine to a virtual processor and memory module, we have two simple devices that we can represent with minimal code.

#include <stdint.h>

uint8_t memory[64 * 1024];

struct {
    uint32_t registers[32];
} cpu;

That’s it. In 5 lines, we have created a virtual machine in its most basic form. It’s so simple you can even compile it if you really wanted to. We have 64KB of memory and a 32-bit processor with 32 general purpose registers. However, this isn’t really going to do much on its own. When we think of the types of computers we are trying to emulate, we think of giving them instructions that they can execute. This takes the form of machine code and, for us, assembly language. Take a look at this MIPS-style assembly.

add $r1, $r2, $r3

In pseudo code, this looks like:

r1 = r2 + r3

In the real world, we encode these instructions into a fixed-width number that instructs our processor at the gate level. In the world of VMs, we don’t have to worry about that tiny nonsense. Instead, we can simply create a function that operates on our CPU.

void add_instruction(unsigned r1, unsigned r2, unsigned r3)
{
    cpu->registers[r1] = cpu->registers[r2] + cpu->registers[r3];
}

We have now effectively created a processor that can run real instructions. Ignoring the more nuanced details (though real processors and thus emulators and VMs typically design around those), you can implement all arithmetic instructions pretty easily.

void sub_instruction(unsigned r1, unsigned r2, unsigned r3)
{
    cpu->registers[r1] = cpu->registers[r2] - cpu->registers[r3];
}

void mul_instruction(unsigned r1, unsigned r2, unsigned r3)
{
    cpu->registers[r1] = cpu->registers[r2] * cpu->registers[r3];
}

void div_instruction(unsigned r1, unsigned r2, unsigned r3)
{
    cpu->registers[r1] = cpu->registers[r2] / cpu->registers[r3];
}

I’m not going to type any more of these, because then I would be the one writing the damn virtual machine. If you use your imagination, I’m sure you could easily see how bitwise and comparison operations could be implemented.

We’ve neglected memory up to this point, but we want to be at least a few steps above Turing completeness. Assembly language allows us to interact with memory by moving data between memory and the CPU. This is done with load and store instructions, which move memory to and from the processor respectively. Sticking to MIPS for the examples:

lw $r0, 4($r1)
sw $r0, 8($r1)

In case you don’t have knowledge of assembly language, I should note that the integer values here (4 in the lw and 8 in the sw) are offsets. Think of indexing an array at indices 4 and 8. With that in mind, we can make simple functions for these operations that make use of both our processor and our memory.

void load_word(unsigned r1, unsigned r2, unsigned offset)
{
    cpu->registers[r1] = memory[r2 + offset];
}

void store_word(unsigned r1, unsigned r2, unsigned offset)
{
    memory[r2 + offset] = cpu->registers[r1];
}

Now we have memory that we can swap data with and a processor that can operate on that data.

This idea is incredibly scalable. Nothing is stopping you from stripping away the hardware emulation principles in favor of something more practical. But sticking to the subject of hardware emulation, you could take this idea really far with even basic knowledge of computer architecture. You could easily add more helpful and realistic dedicated hardware to your processor, as an example.

struct
{
    uint32_t registers[32];
    uint32_t pc; // Program counter
    uint32_t sp; // Stack pointer
    // ... so on and so forth
} cpu;

To make things really interesting, you might even make a small assembler. Please pardon the unsafe, messy code. This is a trivial example and does not need to be overcomplicated.

#include <stdio.h>
#include <string.h>

int main() {
    char op[4];
    unsigned rd, rs, rt, off;

    scanf("%3s", op);

    if (!strcmp(op, "add")) {
        scanf(" $%d, $%d, $%d", &rd, &rs, &rt);
        add_instruction(rd, rs, rt);
    }
    
    if (!strcmp(op, "sub")) {
        scanf(" $%d, $%d, $%d", &rd, &rs, &rt);
        sub_instruction(rd, rs, rt);
    }

    if (!strcmp(op, "lw")) {
        scanf(" $%d, %d($%d)", &rt, &off, &rs);
        load_word(rt, rs, off);
    }

    if (!strcmp(op, "sw")) {
        scanf(" $%d, %d($%d)", &rt, &off, &rs);
        store_word(rt, rs, off);
    }

    return 0;
}

Now, go forth. Build a damn virtual machine. Research alternative approaches. Clean up the code to follow better practices than what I have shown here, and make something cool. Above all else, have fun. If anyone else asks for project ideas - especially somewhere like a low-level programming community - link them to this article. Thanks for reading :)