Bytecode and VM

This page is a tour, not a reference. For the per-opcode detail, fast-lane shapes, fold rules, and benchmark deltas behind each landing, the changelog carries the granular notes; the entries from v0.105.0 through v0.145.0 cover the bytecode VM end-to-end.

Value representation

Every mino value flows through the runtime as a mino_val_t *. The low three bits of the pointer carry a tag that picks between a heap object and one of four inline shapes:

tag 000  ->  heap pointer to a struct mino_val
tag 001  ->  inline 61-bit signed int (payload in bits 63..3)
tag 010  ->  inline BOOL          (one bit at offset 3)
tag 011  ->  inline NIL           (the constant pointer itself)
tag 100  ->  inline CHAR          (21-bit Unicode codepoint)
tag 101..111 -> reserved

61-bit inline ints cover the range ±2⁶⁰ (~±1.15×10¹⁸). Anything wider widens to a heap-allocated BigInt. The decode relies on arithmetic right shift of a signed integer, which C99 leaves implementation-defined for negative operands; every supported toolchain (clang, gcc, msvc on x86_64 and arm64) implements it as sign-preserving. 64-bit hosts only.

The tag scheme has practical consequences. Tight integer loops skip allocation entirely because the inline-int payload lives in a register or stack slot. Tag tests are a single AND with 0x7 against a known constant. Heap pointers are 8-byte aligned by construction, so the bottom three bits are always zero and the runtime can discriminate the inline cases without touching the heap object.

Instruction encoding

Instructions are 32-bit unsigned words in one of three shapes:

ABC   :  op (8)  | A (8)  | B (8)  | C (8)
ABx   :  op (8)  | A (8)  | Bx (16)
AsBx  :  op (8)  | A (8)  | sBx (16, biased by 0x8000)

Opcodes occupy the low byte. A is the destination register for the common ABC shape. Bx is an unsigned 16-bit index into the const pool or a wide immediate; sBx is the signed variant used by jumps so a zero offset encodes a no-op jump. The fixed-width design simplifies fetch and decode: the dispatch loop reads one 32-bit word, masks out the op byte, and switches.

Each compiled function record (mino_bc_fn_t) carries the instruction stream, a const pool indexed by Bx, a register count, an array of arity clauses, and an inline-cache slot array. The MINO_FN heap value owns one mino_bc_fn_t across all closures built from the same template, so two closures of (fn [i] (fn [] i)) share the IC slots even though each carries its own captured environment.

Dispatch and the register window

The runtime maintains a single growable register stack on the state (S->bc_regs). Each entry into a compiled function pushes a window of n_regs slots onto the stack and points regs at the window base. Arguments arrive in regs[0..n_params), optionally followed by a collected rest list, and the body compiles to read and write within the window. On return the window is popped and the result lands at the caller's designated ret_base slot.

Dispatch is a single switch over the op byte inside one C function (mino_bc_run). The switch lets gcc and clang emit a jump table on platforms that have it; computed-goto dispatch is not used — the readability win of a single branch-and-decode loop wins against a fragile portability footprint, and the per-op cost is already low enough that the dispatch is rarely the bottleneck.

Any opcode that can re-enter user code (calls, global resolution, closure construction, var redefinition) re-reads the window pointer from S->bc_regs + base on the next cycle so a recursive mino_bc_run that triggers register-stack growth — and therefore reallocation — does not leave the outer frame with a dangling pointer.

Opcode catalog

Operations group into a small handful of families. The complete enum and per-op encoding live in src/eval/bc/internal.h.

Generic forms. OP_MOVE, OP_LOAD_K, OP_JMP, OP_JMPIFNOT. Register-to-register move, const-pool load, and jumps.
Global resolution. OP_GETGLOBAL_CACHED (the form actually emitted) and OP_SETGLOBAL. The cached read goes through an inline-cache slot; see the section below.
Calls and returns. OP_CALL, OP_TAILCALL, OP_RETURN, OP_CLOSURE. Hot calls pass an argv slice straight from the caller's register window with no cons-spine allocation; tail calls produce a sentinel that an internal trampoline consumes so deep tail recursion stays on a constant-size C stack.
Integer fast lanes. Per-op opcodes for tagged-int arithmetic, comparisons, and the common unary forms (inc, dec, zero?, pos?, neg?, even?, odd?). Overflow on +/-/* bails to the boxed slow lane via the compiler builtin so Clojure's throw-on-overflow semantics stay intact. Immediate-operand variants fold a signed 8-bit literal into the C slot so (< i 10) avoids loading the literal separately.
Fused counted loops. Two common loop shapes — single-binding and two-binding inc/dec — emit a single fused opcode at the recur target. Each iteration is one decode plus one or two tagged-int updates and a back-jump. Emission is gated on canonical-prim resolution so a user (defn dec ...) shadow is honoured.
Collection fast lanes. OP_NTH_VEC reads a persistent vector's leaf directly; OP_GET_KW_MAP does a single hash + HAMT lookup. Any miss falls back to prim_nth / prim_get.
try / catch / throw and dynamic bindings. OP_PUSHCATCH / OP_POPCATCH / OP_THROW bracket a try region; the handler runs in the same bc frame with register window and dyn stack restored. OP_PUSHDYN / OP_POPDYN manage a dyn frame for binding; the cached-global IC skips the cache while a dyn stack is active so a stale var root never masks a dyn-shadowed value.
Lazy sequences. OP_MAKE_LAZY constructs a thunk from a const-pool function value. Realisation flows back through the apply path so an un-compileable lazy body drops to the tree-walker without breaking the value-level contract.
Env management for capturing forms. OP_PUSH_ENV / OP_POP_ENV / OP_ENV_BIND bracket let scopes only when an inner fn closes over the outer bindings. Fns without inner fns skip these ops entirely; the common case pays no env-frame cost.

Inline cache for globals

Every reference to a global name in a compiled body uses an IC slot that holds the resolved value plus an ic_gen snapshot. The state-wide counter bumps on def / ns-unmap / var-set-root / var-unintern; a bumped counter misses every slot in lockstep, no per-slot invalidation work. The cache fills only when no dyn-bindings are active, so the stored value is always the var's published root and never a dyn-shadowed value.

Compile-time folding and dead-binding elimination

Two optimisations operate on the AST before emission, both gated by ic_gen coherence:

Let-binding fold-through. When a let's right-hand side folds via the PURE_PRIMS table, the compiler remembers the folded value on the local. A later pure-prim call whose args all resolve through this substitution folds the whole expression. (let [x (+ 1 2)] (* x x)) collapses to a single OP_LOAD_K 9.
Dead-binding elimination. A let binding whose name appears nowhere in the body and whose right-hand side is observably side-effect-free is dropped at compile time. Macros that expand to verbose (let [_ ... ...] body) around a binding the body never reads collapse to just the body.

Capturing let scopes (those that publish bindings into an env for an inner closure) opt out of both folds. Identity short-circuits in the prim layer give (assoc m k v) and set conj the property that (identical? m (assoc m k (get m k))) holds — a real signal callers can rely on, affordable because cached hashes keep the equality check O(1) in the typical no-match case.

Calling convention

Two callee ABIs coexist:

argv ABI. Bytecode-runnable user fns and the prims registered through mino_prim_argv receive an argv pointer plus a count directly. OP_CALL hands the caller's register slice through as the argv with zero cons cells allocated for the call. This is the hot path.
Cons-spine ABI. Fn1 prims, tree-walker user fns, macros, and non-fn callables receive a cons list of evaluated arguments. The call site builds the list lazily on the slow path so the hot path pays nothing.

Tail calls produce a MINO_TAIL_CALL sentinel that the call-site trampoline consumes; bc-runnable tail targets unpack it back to argv and re-enter mino_bc_run in the same dispatch loop iteration so deep tail recursion stays on a constant-size C stack. Multi-arity fns carry an array of mino_bc_clause_t entries; the runtime picks the matching clause at fn entry and seats argv into the chosen clause's parameter registers, collecting trailing args into a rest list when has_rest is set.

GC integration

mino's collector is a non-moving two-generation tracing collector. The VM cooperates with it in three places:

Register stack roots. The GC root walk scans every live register slot in [0, bc_top) on S->bc_regs. The window grows on demand; bc_pop_window clears every slot before the next push lands on it so stale references do not pin objects past their last use. The body emitter writes to every register before any op that may collect, so the root walk sees filled slots rather than uninitialised state.
Const pool and IC slots. The MINO_FN mark pass walks the bc fn's const pool and IC slot array so children and cached resolutions stay reachable across minors. The write barrier records old-to-young pointers when an IC slot in an old fn is filled with a young var root, so the next minor's remset is honest without per-op pinning.
Trampoline-safe handoff. The argv passed to a callee is the live register slice of the caller. Because the slice lives on the GC-rooted register stack and the callee re-reads its base on every cycle, an allocation inside the callee cannot leave the argv with a dangling tail.

Soundness considerations

Worth being honest: the recent opcode-fusion and compile-time-fold work is the kind of optimisation that is easy to get subtly wrong. New tests and adversarial fuzzing give us confidence it is fairly stable, but the hard problem underneath is knowing in advance which aspects of Clojure are truly static and predictable so the compiler can treat them as axioms.

Every optimisation either has to be invariant under every reachable redefinition of a Clojure name, or carry a coherence check that catches a redefinition before the next dispatch. The current discipline:

Canonical-prim resolution check. Fused-loop and fast-lane emission only fires when the symbol at the head resolves to the canonical core prim. A user (defn dec [x] ...) shadow declines the fused step automatically.
ic_gen coherence. Any optimisation that bakes in a global resolution at compile time stashes the state's ic_gen snapshot on the fn. A mismatch at dispatch invalidates the bc and forces recompile, so (def + -) re-emits the body.
Side-effect-free check. Dead-binding elimination only drops bindings whose RHS the compiler can see through to a literal, symbol, or pure-prim call. Anything else fails closed.
Capture opt-out. Both folds skip let scopes that publish bindings into an env for an inner closure — env publishing is itself observable.
Overflow handling. Tagged-int fast lanes decline to the boxed slow lane on overflow so (dec MIN_INT) still throws Clojure-correct.

The conservative shape of these checks is the cost of playing this game with Clojure's late-bound semantics. A statically-typed dialect could do more here; a dialect with mutable globals would have to do less.

What this design borrows and where it differs

Lua 5.5. Register-based dispatch, fixed-width ABC/ABx/AsBx encoding, and a switch interpreter loop are direct Lua influences. mino differs by using low-tag pointer tagging rather than NaN-boxing (better for arbitrary-precision ints), by treating persistent vectors / HAMTs / lists / sets as distinct heap shapes rather than collapsing them into Lua's tables, and by layering a copy-and-patch JIT on top of the interpreter rather than committing to a single execution mode (see CPJIT below).
Janet. Small, embeddable, register-based, ship the bytecode interpreter and call it done — matches mino's ethos for the core path. mino differs by being a Clojure dialect: persistent immutable values are the default, lazy sequences are first-class, STM and agents are in the core surface, and a per-state recursive mutex serialises script execution rather than Janet's fiber-cooperative model. mino also has an optional copy-and-patch JIT on top of the bytecode VM; Janet does not.
BEAM (Erlang / Elixir). Per-process isolation, message-passing-only communication, and a preemptive scheduler are a fundamentally different concurrency model. mino shares a heap within a single mino_state_t and recovers BEAM's isolation property at a coarser grain via multiple runtimes and mino_clone. The per-state GIL is explicit rather than emergent.

Why Clojure makes some of this work

Several optimisations that would be unsound in a Lua-style or generic-Scheme VM become natural in a Clojure dialect:

Homoiconicity. A let is a list with a vector of binding pairs, so fold-through and dead-binding elimination are ordinary AST rewrites rather than IR passes.
Immutable bindings. A let name never observably changes between bind and body, so the compiler can substitute without worrying about aliasing. The same property lets fused-loop registers update in place safely.
Persistent immutable collections. A direct trie walk is safe because the trie is immutable. Subvecs share a parent trie and the walker honours the window without materialising it. Identity short-circuits on assoc / conj are observable because identical? is a real signal.
Cached hashes. Structural-equality compares short-circuit on hash mismatch in O(1), which is what makes the identity short-circuit cheap in practice rather than only when the lookup happens to miss.
Lazy sequences as values. OP_MAKE_LAZY emits against a thunk const and trusts the GC to keep its captured env alive. Realisation flows back through the apply path so an un-compileable lazy body drops to the tree-walker transparently.
Late binding plus a coherence epoch. The ic_gen counter turns Clojure's late-bound vars into a single load-compare per cached site. The optimiser checks at dispatch against the snapshot; on a redefinition the bc invalidates and recompiles.

The CPJIT layer

The CPJIT cycle (v0.178.0 – v0.240.0) added a copy-and-patch JIT on top of the bytecode VM. The design avoids every cost that previously kept JIT off the table: there is no code-gen backend, no runtime assembler, no signal-handler hooks, and no per-platform EH wiring.

Copy-and-patch works by pre-compiling each bytecode instruction's body as a short C function (a stencil), then asking the host C compiler to emit an object file per stencil. A small extractor (tools/stencil-extract, roughly 1,500 lines of C99 spread across per-format modules for Mach-O / ELF / PE-COFF) walks the object file, lifts the function body bytes out of the .text / __text section, and records every relocation that has to be patched at runtime (register operand slots, immediate constants, calls to extern helpers). The output is a byte table the runtime consumes.

At compile time the JIT walks the bytecode, copies each stencil's bytes into a writable / executable region, patches the operand slots with the current instruction's actual register and constant indices, and links neighbouring stencils into a musttail chain so the host compiler's tail-call guarantee turns the chain into a single threaded loop. There is no inline cache or specialiser inside the JIT — the IC machinery lives in the bytecode body and the JIT preserves it verbatim.

Five host arches ship full byte tables today: ARM64 Darwin, ARM64 Linux, x86_64 Linux, x86_64 Darwin, and x86_64 Windows (PE-COFF + VirtualAlloc for the writable / executable region). The runtime auto-detects the host and enables the JIT; per-state runtime control lives behind mino_state_set_jit_mode (AUTO / OFF / ON) and mino_state_set_jit_hot_threshold (call count before a function compiles); the CLI exposes both as --jit=auto|off|on and --jit-threshold=N.

An embedder that prefers a smaller binary over peak throughput can link mino-lean instead — the same source compiled with the JIT pipeline gated out by -DMINO_CPJIT=0. CI builds both binaries every push and asserts byte-identical stdout across ./mino --jit=auto, --jit=on, --jit=off, and ./mino-lean (4-way parity).

What the JIT covers today. Move, load-constant, fused load-then-return, return-arg / return-immediate, the canonical integer arithmetic and comparison ops (add / sub / mul / lt / le / gt / ge / eq, both register and constant operand variants), inc / dec, zero-test, the loop-with-int-bound hot lane. Functions that mix unsupported bytecodes fall back to the interpreter transparently. As the bytecode VM grows, the stencil set follows.

What the JIT does not do. Type-feedback specialisation; SSA-style optimisation; register allocation across stencils; deoptimisation. The stencil is bytecode-identical to what the interpreter runs — just stitched together with the dispatch loop elided. The soundness model is therefore the same as the interpreter's: if --jit=on and --jit=off observably diverge, the JIT is the bug, not the program.

Recently picked up

Six frontiers from earlier drafts of this page have shipped and folded into the steady-state VM. The benchmark deltas below are from the cycle's first-pass landings (all median-of-five, with the empty-thunk harness floor subtracted).

Write-path fast lanes. OP_CONJ_VEC for arity-2 (conj v x) and OP_ASSOC for the arity-3 (assoc coll k v) shape land in v0.152.0. The runtime dispatches by collection type (vector or map) and falls back to the canonical prim on any miss. conj-vec -34%, assoc-vec -56%, assoc-small -32%.
Inlining of small pure prims. OP_FIRST_VEC / OP_COUNT_VEC / OP_EMPTY_VEC in v0.153.0 turn the hot single-arg seq prims on a vector into a direct field read; misses fall through to the canonical prim. count-vec -94%, first-vec -93%, empty?-vec ~0 ns.
Record fast paths and keyword-as-fn inlining. v0.154.0 extends OP_GET_KW_MAP with a record branch (fixed-slot fetch after a declared-field lookup), and rewrites the (:kw coll) keyword-as-fn shape to the same fast lane the (get coll :kw) form already used. get-kw-record -93%, kw-fn-record -84%, kw-fn-map -77%.
Inline-cached call sites. v0.155.0 fuses the global-symbol head resolution and the dispatch into one OP_CALL_CACHED opcode that re-uses the OP_GETGLOBAL_CACHED ic-slot discipline. Dynamic bindings and closure captures still shadow even on a hot site; the cached path drops a register window slot and one inline-cache lookup per call. fib-30 -13%, loop-recur-1M -9%.
Generic get and dissoc fast lanes. v0.156.0 opens OP_GET_KW_MAP to any hashable key on a map (the keyword guard now only fences the record branch), and adds OP_DISSOC for the arity-2 (dissoc m k) shape. Records, sorted-maps, transients, and variadic dissoc keep the canonical-prim path. get-str-map -81%, dissoc-map -21%.
Transducer fusion for reduce pipelines. v0.157.0 recognises a (reduce f init (->> src (map ...) (filter ...) (take ...))) chain at the moment prim_reduce runs: the outer LAZY cell's c_thunk pointer identifies the stage kind, the chain unwinds by walking the thunk pointers, and the bottom source walks once through a fused element-by-element loop with the stages applied inline. Lazy-seq cells from map / filter / take stop being allocated. The five common numeric predicates got argv-ABI siblings in the same commit so a (filter odd? ...) stage stays cons-free. pipeline-sum -77% (93.5 µs → 21.3 µs); pipeline alloc count -86%.
Protocol-method inline cache. v0.158.0 adds OP_PROTOCOL_CALL_CACHED (and a tail-position twin) for monomorphic and small-PIC protocol dispatch. The slot pins the dispatch atom at compile, the hot path derefs it once, pointer-compares the deref'd map and the first-arg type discriminator against the cached pair, and on a hit invokes the cached impl directly via apply_callable_argv — no protocol-dispatch trampoline, no map_get on the hot path. The miss path performs one map_get_val with a :default fallback and refills the IC under write barriers. proto-mono-area -57% (5.03 µs → 2.14 µs); proto-bi-area -50%.
Seq-fusion generalisation. v0.159.0 extracts the v0.157.0 walker from prim_reduce into a shared pipeline_walk and wires it into into (vector target), mapv, filterv, and dorun. The same head-shape recognition kicks in for every consumer; the lazy cells from map / filter / take stop being allocated regardless of which terminal consumer drains the pipeline. into-vec-pipeline -70%, mapv-pipeline -65%, filterv-pipeline -95%, dorun-pipeline -94%.
Chunked-source walk + canonical-prim stages. v0.161.0 adds two combined optimisations to the fused walker. When the unwound source is (or forces to) a MINO_CHUNKED_CONS the walk iterates the chunk's value array directly instead of going through seq_iter_val / seq_iter_next per element. And when a stage's callable resolves at walk-entry to one of the canonical numeric prims (inc, dec, odd?, even?, pos?, neg?, zero?), the operation is applied inline on tagged-int elements with no apply_callable_argv call. reduce + map inc (range 1m) -19%; reduce + filter odd? (range 1m) -9%.
Hot/cold handler partition. v0.162.0 splits the dispatch switch in src/eval/bc/vm.c along the op-count profile. The 18 hot ops (the move / load-k / cached-get / cached-call / int fast lanes / tail-call / return / loop-fused / read-side small-prim / assoc family) stay inlined; the long tail (NOP, the uncached OP_GETGLOBAL / OP_CALL, closure build, env push/pop/bind, OP_THROW and the dyn / try frames, OP_NTH_VEC / OP_EMPTY_VEC / OP_CONJ_VEC / OP_DISSOC) moves to a static bc_cold_op helper called from the default arm. The partition leaves room to add future hot opcodes without bumping the case count back over clang's tipping point that bit OP_TAILCALL_CACHED last cycle. matrix neutral as a gate; small speedups (5–8%) on shapes where the hot ops fit in fewer cache lines after the ladder shrinks.
IC consumer consolidation. v0.163.0 funnels the three IC-cache consumers (OP_GETGLOBAL_CACHED, OP_CALL_CACHED, OP_PROTOCOL_CALL_CACHED) behind two shared helpers — ic_resolve_global carries the dyn / env / cached / resolve cascade and the miss-path write-barrier refill; ic_resolve_protocol carries the atom-deref / type-disc / cache-check / map_get miss-fallback / refill sequence. The IC-slot GC walk that used to duplicate across the MINO_FN and GC_T_BC_FN walker arms is centralised in one gc_mark_bc_ic_slots function. No behavior change; the substantiation pays off when a fourth IC consumer lands.
Unboxed int-acc reducer fast lane. v0.164.0 routes (reduce <op> [init] coll) through a shared reduce_ctx_t that keeps the accumulator as an unboxed long long while the reducer is one of the canonical numeric prims (+, *, -, bit-and, bit-or, bit-xor) and every element so far has been tagged-int with no overflow. The first miss boxes the accumulator and falls through to the generic reduce_step path so the numeric tower stays Clojure-correct. Shared across the vec, set, list, and pipeline reducer entry points. (reduce + vec-100k) -48%; (reduce + set-100k) -49%; (reduce + list-100k) -24%; range reduce stays at the existing floor.
In-place transient vector mutation. v0.165.0 puts each (transient ...) on a monotonic 32-bit owner ID, and adds an owner field to every vec trie / tail node. conj! / assoc! / pop! on a vector mutate the owner-tagged nodes in place — the first edit through a fresh transient clones the touched node (and stamps it with the owner); every later edit on that node is a single slot write + count bump with no allocation. Slot writes route through gc_write_barrier so an OLD owner-tagged node that aged across a minor during a long batch keeps its remset entry consistent. mino_vec_node stayed at 264 bytes; the persistent path's clone size is unchanged. into-vec-pipeline -74% (589 µs → 152 µs); mapv-pipeline -69%; persistent conj/assoc/pop flat.
Builder-pattern compile-time rewrite. v0.166.0 recognises the canonical (loop [... acc []] (if <t> (recur ... (conj acc x)) acc)) shape (and the assoc sister form over {}, then/else either way) at compile time, and rewrites it to (persistent! (loop [... acc (transient [])] ...)) with conj! / assoc! in the recur step. The substrate from the previous release makes the rewrite pay off — on the wrapper transients it was 2.5× slower than the persistent baseline; with owner-tagged in-place mutation it's a 3.4× win. (loop ... (conj acc i)) N=100k: 92 ms → 27 ms (-71%); matches a hand-written transient builder within run-to-run noise.

Still open

Hypotheses worth picking up. Each line is a one-liner; the shape of the work and rough payoff are obvious from the sketch.

Dispatch shape rework. Both standard alternatives to the current switch have been tried and regressed. Apple clang at -O2 tail-merges every per-handler goto *target into a single dispatch site, collapsing the threaded interpreter back to switch shape; asm goto with an explicit label list is the canonical workaround but Apple clang miscompiles register-held label addresses. [[clang::musttail]] per-handler dispatch was tried in the v0.117.0 era and regressed too: per-handler prolog/epilog plus locals reload beat the stack-growth savings on short bodies. v0.162.0's hot/cold partition mitigated the case-count ceiling that had blocked adding new hot ops, so the urgency is lower now; the remaining 1–5% real-world win on modern CPUs (per a 2026 CPython measurement on M1 and Raptor Lake) would need either a Linux GCC build (its tail-merging is less aggressive) or a different architecture entirely (shared-locals threaded dispatch via explicit register pinning, two-tier dispatch, direct-threaded code).
Profile-guided opcode rewriting. Hot sites swap generic opcodes for specialised variants after a stable shape is observed — adaptive specialisation without a JIT. Type-feedback fast lanes for arith on observed int+int sites would extend the literal-arg fast lanes already shipped, and a runtime OP_CALL -> OP_CALL_CACHED rewrite would cover closure-bound heads the compiler can't statically prove. v0.163.0 consolidated the three existing IC consumers behind shared resolve helpers and a unified GC-scan; the framework is now ready for a fourth consumer to plug in. The earlier OP_TAILCALL_CACHED probe that regressed fib-30 motivated the v0.162.0 hot/cold partition, which gives the dispatch switch room to grow new hot ops without taxing the existing case ladder. Concrete next step: a profile-driven type-feedback op-rewrite pass.
Recur-shape fusion expansion. OP_LOOP_INT_DEC / OP_LOOP_INT_DEC_INC covers a single binding shape today. Real-workload profiling shows roughly zero loops in the bench matrix or test suite hit the fused opcode — the long tail of two- and three-binding loops with pos? / < / ≤ tests is the next coverage frontier.
OP_GET_KW_MAP static folding. Inside a defrecord method body the record type is statically known; the field-index lookup that OP_GET_KW_MAP resolves at runtime could fold to a constant. The opcode is 12% of protocol_bench dispatch — small but concentrated.
Fused BigInt arithmetic. Multi-step BigInt op stays in BigInt form across the chain, avoiding the re-tag roundtrip per step.
Parallel reduce on persistent vectors. Fork/join trie walk via fold / cat, combined with a runtime-per-shard model.
Full TCO across more tail positions. Propagate the tail flag through every tail-shaped form so cross-fn tail calls stay constant-stack regardless of which side is compiled.

Beyond opcodes: a formal Clojure spec

The soundness discipline keeps coming back to one question: which properties of Clojure are safe to treat as axioms? Right now the answer is a case-by-case judgement call. A separate, much larger project — a formal and executable Clojure language spec plus a meta-analysis engine written in core.logic or Prolog with full runtime introspection — would turn that judgement call into a mechanical check. Each candidate optimisation would carry the axiom it depends on as data; the engine would mechanically verify the axiom holds for the dialect's surface and the fold's reachability.

That is not this project, and it is not next quarter's project. It is the natural endpoint of the soundness work the VM is already doing by hand.