Bytecode and VM

This page is a tour, not a reference. For the per-opcode detail, fast-lane shapes, fold rules, and benchmark deltas behind each landing, the changelog carries the granular notes; the entries from v0.105.0 through v0.145.0 cover the bytecode VM end-to-end.

Value representation

Every mino value flows through the runtime as a mino_val_t *. The low three bits of the pointer carry a tag that picks between a heap object and one of four inline shapes:

tag 000  ->  heap pointer to a struct mino_val
tag 001  ->  inline 61-bit signed int (payload in bits 63..3)
tag 010  ->  inline BOOL          (one bit at offset 3)
tag 011  ->  inline NIL           (the constant pointer itself)
tag 100  ->  inline CHAR          (21-bit Unicode codepoint)
tag 101..111 -> reserved

61-bit inline ints cover the range ±260 (~±1.15×1018). Anything wider widens to a heap-allocated BigInt. The decode relies on arithmetic right shift of a signed integer, which C99 leaves implementation-defined for negative operands; every supported toolchain (clang, gcc, msvc on x86_64 and arm64) implements it as sign-preserving. 64-bit hosts only.

The tag scheme has practical consequences. Tight integer loops skip allocation entirely because the inline-int payload lives in a register or stack slot. Tag tests are a single AND with 0x7 against a known constant. Heap pointers are 8-byte aligned by construction, so the bottom three bits are always zero and the runtime can discriminate the inline cases without touching the heap object.

Instruction encoding

Instructions are 32-bit unsigned words in one of three shapes:

ABC   :  op (8)  | A (8)  | B (8)  | C (8)
ABx   :  op (8)  | A (8)  | Bx (16)
AsBx  :  op (8)  | A (8)  | sBx (16, biased by 0x8000)

Opcodes occupy the low byte. A is the destination register for the common ABC shape. Bx is an unsigned 16-bit index into the const pool or a wide immediate; sBx is the signed variant used by jumps so a zero offset encodes a no-op jump. The fixed-width design simplifies fetch and decode: the dispatch loop reads one 32-bit word, masks out the op byte, and switches.

Each compiled function record (mino_bc_fn_t) carries the instruction stream, a const pool indexed by Bx, a register count, an array of arity clauses, and an inline-cache slot array. The MINO_FN heap value owns one mino_bc_fn_t across all closures built from the same template, so two closures of (fn [i] (fn [] i)) share the IC slots even though each carries its own captured environment.

Dispatch and the register window

The runtime maintains a single growable register stack on the state (S->bc_regs). Each entry into a compiled function pushes a window of n_regs slots onto the stack and points regs at the window base. Arguments arrive in regs[0..n_params), optionally followed by a collected rest list, and the body compiles to read and write within the window. On return the window is popped and the result lands at the caller's designated ret_base slot.

Dispatch is a single switch over the op byte inside one C function (mino_bc_run). The switch lets gcc and clang emit a jump table on platforms that have it; computed-goto dispatch is not used — the readability win of a single branch-and-decode loop wins against a fragile portability footprint, and the per-op cost is already low enough that the dispatch is rarely the bottleneck.

Any opcode that can re-enter user code (calls, global resolution, closure construction, var redefinition) re-reads the window pointer from S->bc_regs + base on the next cycle so a recursive mino_bc_run that triggers register-stack growth — and therefore reallocation — does not leave the outer frame with a dangling pointer.

Opcode catalog

Operations group into a small handful of families. The complete enum and per-op encoding live in src/eval/bc/internal.h.

Inline cache for globals

Every reference to a global name in a compiled body uses an IC slot that holds the resolved value plus an ic_gen snapshot. The state-wide counter bumps on def / ns-unmap / var-set-root / var-unintern; a bumped counter misses every slot in lockstep, no per-slot invalidation work. The cache fills only when no dyn-bindings are active, so the stored value is always the var's published root and never a dyn-shadowed value.

Compile-time folding and dead-binding elimination

Two optimisations operate on the AST before emission, both gated by ic_gen coherence:

Capturing let scopes (those that publish bindings into an env for an inner closure) opt out of both folds. Identity short-circuits in the prim layer give (assoc m k v) and set conj the property that (identical? m (assoc m k (get m k))) holds — a real signal callers can rely on, affordable because cached hashes keep the equality check O(1) in the typical no-match case.

Calling convention

Two callee ABIs coexist:

Tail calls produce a MINO_TAIL_CALL sentinel that the call-site trampoline consumes; bc-runnable tail targets unpack it back to argv and re-enter mino_bc_run in the same dispatch loop iteration so deep tail recursion stays on a constant-size C stack. Multi-arity fns carry an array of mino_bc_clause_t entries; the runtime picks the matching clause at fn entry and seats argv into the chosen clause's parameter registers, collecting trailing args into a rest list when has_rest is set.

GC integration

mino's collector is a non-moving two-generation tracing collector. The VM cooperates with it in three places:

Soundness considerations

Worth being honest: the recent opcode-fusion and compile-time-fold work is the kind of optimisation that is easy to get subtly wrong. New tests and adversarial fuzzing give us confidence it is fairly stable, but the hard problem underneath is knowing in advance which aspects of Clojure are truly static and predictable so the compiler can treat them as axioms.

Every optimisation either has to be invariant under every reachable redefinition of a Clojure name, or carry a coherence check that catches a redefinition before the next dispatch. The current discipline:

The conservative shape of these checks is the cost of playing this game with Clojure's late-bound semantics. A statically-typed dialect could do more here; a dialect with mutable globals would have to do less.

What this design borrows and where it differs

Why Clojure makes some of this work

Several optimisations that would be unsound in a Lua-style or generic-Scheme VM become natural in a Clojure dialect:

The CPJIT layer

The CPJIT cycle (v0.178.0 – v0.240.0) added a copy-and-patch JIT on top of the bytecode VM. The design avoids every cost that previously kept JIT off the table: there is no code-gen backend, no runtime assembler, no signal-handler hooks, and no per-platform EH wiring.

Copy-and-patch works by pre-compiling each bytecode instruction's body as a short C function (a stencil), then asking the host C compiler to emit an object file per stencil. A small extractor (tools/stencil-extract, roughly 1,500 lines of C99 spread across per-format modules for Mach-O / ELF / PE-COFF) walks the object file, lifts the function body bytes out of the .text / __text section, and records every relocation that has to be patched at runtime (register operand slots, immediate constants, calls to extern helpers). The output is a byte table the runtime consumes.

At compile time the JIT walks the bytecode, copies each stencil's bytes into a writable / executable region, patches the operand slots with the current instruction's actual register and constant indices, and links neighbouring stencils into a musttail chain so the host compiler's tail-call guarantee turns the chain into a single threaded loop. There is no inline cache or specialiser inside the JIT — the IC machinery lives in the bytecode body and the JIT preserves it verbatim.

Five host arches ship full byte tables today: ARM64 Darwin, ARM64 Linux, x86_64 Linux, x86_64 Darwin, and x86_64 Windows (PE-COFF + VirtualAlloc for the writable / executable region). The runtime auto-detects the host and enables the JIT; per-state runtime control lives behind mino_state_set_jit_mode (AUTO / OFF / ON) and mino_state_set_jit_hot_threshold (call count before a function compiles); the CLI exposes both as --jit=auto|off|on and --jit-threshold=N.

An embedder that prefers a smaller binary over peak throughput can link mino-lean instead — the same source compiled with the JIT pipeline gated out by -DMINO_CPJIT=0. CI builds both binaries every push and asserts byte-identical stdout across ./mino --jit=auto, --jit=on, --jit=off, and ./mino-lean (4-way parity).

What the JIT covers today. Move, load-constant, fused load-then-return, return-arg / return-immediate, the canonical integer arithmetic and comparison ops (add / sub / mul / lt / le / gt / ge / eq, both register and constant operand variants), inc / dec, zero-test, the loop-with-int-bound hot lane. Functions that mix unsupported bytecodes fall back to the interpreter transparently. As the bytecode VM grows, the stencil set follows.

What the JIT does not do. Type-feedback specialisation; SSA-style optimisation; register allocation across stencils; deoptimisation. The stencil is bytecode-identical to what the interpreter runs — just stitched together with the dispatch loop elided. The soundness model is therefore the same as the interpreter's: if --jit=on and --jit=off observably diverge, the JIT is the bug, not the program.

Recently picked up

Six frontiers from earlier drafts of this page have shipped and folded into the steady-state VM. The benchmark deltas below are from the cycle's first-pass landings (all median-of-five, with the empty-thunk harness floor subtracted).

Still open

Hypotheses worth picking up. Each line is a one-liner; the shape of the work and rough payoff are obvious from the sketch.

Beyond opcodes: a formal Clojure spec

The soundness discipline keeps coming back to one question: which properties of Clojure are safe to treat as axioms? Right now the answer is a case-by-case judgement call. A separate, much larger project — a formal and executable Clojure language spec plus a meta-analysis engine written in core.logic or Prolog with full runtime introspection — would turn that judgement call into a mechanical check. Each candidate optimisation would carry the axiom it depends on as data; the engine would mechanically verify the axiom holds for the dialect's surface and the fold's reachability.

That is not this project, and it is not next quarter's project. It is the natural endpoint of the soundness work the VM is already doing by hand.