JIT
What ships today
- 58 opcode stencils covering arithmetic (
OP_ADD_II,OP_SUB_II,OP_MUL_II, k-immediate variants), comparison (OP_EQ_IIthroughOP_GE_IIplus*_IKforms), bitwise (OP_BAND_II,OP_BOR_II,OP_BXOR_II,OP_SHL_II,OP_SHR_II,OP_USHR_II,OP_BNOT_I), unary predicates (OP_ZERO_INT_P,OP_POS_P_I,OP_EVEN_P_I,OP_ODD_P_I), unary increment/decrement (OP_INC_I,OP_DEC_I), fused loop steps (OP_LOOP_INT_LT,OP_LOOP_INT_DEC,OP_LOOP_INT_LT_INC,OP_LOOP_INT_DEC_INC), data-structure fast lanes (OP_NTH_VEC,OP_FIRST_VEC,OP_COUNT_VEC,OP_EMPTY_VEC,OP_CONJ_VEC,OP_GET_KW_MAP,OP_ASSOC,OP_DISSOC), dispatch (OP_CALL,OP_CALL_CACHED,OP_TAILCALL,OP_PROTOCOL_CALL_CACHED,OP_PROTOCOL_TAILCALL_CACHED,OP_GETGLOBAL_CACHED,OP_CLOSURE,OP_MAKE_LAZY), env management (OP_PUSH_ENV,OP_POP_ENV,OP_ENV_BIND), leaf shapes (OP_LOAD_K,OP_MOVE,OP_RETURN, and the fusedOP_LOAD_K_RETURNsuperinstruction), and one syntheticOP_DEOPT_TO_INTERPstencil that the compile path inserts to bail back to the interpreter. - 5 host arches with on-disk byte tables. Each generated header carries the stencil bytes plus the symbol and relocation tables the runtime patcher consumes:
stencils_arm64_darwin.h(90,568 bytes)stencils_arm64_linux.h(90,782 bytes)stencils_x86_64_darwin.h(84,925 bytes)stencils_x86_64_linux.h(84,396 bytes)stencils_x86_64_windows.h(87,195 bytes)
- Side-exit deopt path. Fns whose first unstenciled op sits past PC 0 compile to a native prefix plus a deopt stencil. When execution reaches the deopt instruction the native code returns into
mino_bc_run_resumewhich drives the interpreter over the same regs window from the recorded PC. Round-trip cost is roughly 100 ns per deopt on Apple Silicon; the path amortises against running the prefix through the interpreter around the 30-op prefix length. After this landed, therealistic_benchandreal_workloadscorpora both show 100% native eligibility with zero hard rejections. - Cancellable JIT'd loops. Each fused-loop stencil polls
mino_bc_safepointon a 256-iteration downcounter, so a spinning JIT'd loop responds to(future-cancel f)within bounded wall time even when the body is entirely native. Per-iteration cost is one decrement plus one branch on the hot path. - Adaptive tiering. In
AUTOmode any callee invoked from inside a JIT-compiled region picks up a threshold of 1, so warm-up gaps on short-lived scripts collapse to the first call after the JIT'd caller fires. Tunable viamino_state_set_jit_hot_thresholdfor embedders that want different cold-call counts. - Dual-binary build. The full
minobinary builds with-DMINO_CPJIT=1and links the entire JIT pipeline. The parallelmino-leanbinary builds the same source tree withMINO_CPJITundefined; the patcher, emitter, and stencil entry layers compile to no-ops and the runtime decision tree collapses to the tree-walker plus the bytecode VM.mino_state_jit_capabilityreturns{.available=0, ...}onmino-leanand{.available=1, ...}on the full build. - 4-way parity green on the dev host. The
test-jit-paritytask runs the test suite four times (MINO_JIT=auto,MINO_JIT=on,MINO_JIT=off, and themino-leanbinary) and asserts the four stdouts are byte-identical and all four processes exit 0:$ ./mino task test-jit-parity 93 tests, 93 assertions: 93 passed, 0 failed, 0 errors jit-parity: OK -- stdout byte-identical across jit-auto / jit-on / jit-off / lean, all exit 0 - Synthetic-blob selftests. The
tools/stencil-extract --selftestbinary builds hand-crafted Mach-O, ELF, and COFF object blobs with known function bodies, symbol tables, and relocation tables, then runs each format parser against them and asserts the extracted bytes match expected values:$ ./tools/stencil-extract --selftest selftest_macho_synthetic: OK selftest_elf_synthetic: OK selftest_coff_synthetic: OK stencil_extract selftest: OK
Per-host pipeline stages
Each stage is independent. extract parses the object file produced by the host compiler. generate emits the on-disk byte table the runtime patcher reads. build is the full in-tree compile against that byte table. smoke is the test suite running through the JIT pipeline. parity is the four-way auto/on/off/lean byte-identical-stdout check.
| Target | Format | extract | generate | build | smoke | parity |
|---|---|---|---|---|---|---|
| ARM64 Darwin | Mach-O 64 | green | green | green | green | green |
| ARM64 Linux | ELF64 | green | green | green | green | green |
| x86_64 Linux | ELF64 | green | green | green | green | green |
| x86_64 Darwin | Mach-O 64 | green | green | partial | partial | partial |
| x86_64 Windows | PE/COFF | green | green | green | green | partial |
Partial-cell notes
- x86_64 Darwin build/smoke/parity: covered by the cross-compile parity job on macos-14 that regenerates every committed byte table and asserts
git diff --exit-code. End-to-end execution on Intel macOS is not in the GHA matrix; GitHub has been retiring Intel-Mac runners and the project does not yet host a self-hosted Intel runner. The Mach-O 64 stencil parser and patcher share their data flow with the ARM64 Darwin path, which is the dev host and is exercised on every push. - x86_64 Windows parity: the four-way parity sits inside the release-gate composite, which depends on a libsanitizer build that MinGW does not ship. The Windows runner instead runs the smoke build (make +
tests/run.cljthrough the JIT) on every push, which exercises the same code paths the parity check would but stops short of the byte-identical-stdout assertion.
Runtime control
Each row in the support table is a build claim. At runtime, every JIT-capable binary lets the host choose how the pipeline executes. The five public symbols are stable across releases:
void mino_state_set_jit_mode(mino_state_t *S,
mino_jit_mode_t mode);
mino_jit_mode_t mino_state_jit_mode(const mino_state_t *S);
void mino_state_set_jit_hot_threshold(mino_state_t *S,
unsigned n);
unsigned mino_state_jit_hot_threshold(const mino_state_t *S);
mino_jit_capability_t mino_state_jit_capability(const mino_state_t *S);Modes
MINO_JIT_MODE_AUTO (default): compile when the hot-call threshold trips. MINO_JIT_MODE_OFF: never compile. MINO_JIT_MODE_ON: compile on first call. ON is for benchmarking and parity testing; AUTO is the default for embedders.
Hot threshold
Default seed is the compile-time MINO_JIT_THRESHOLD (currently 10 calls). Lower for shorter-lived scripts where warm-up matters; raise to avoid compiling rarely-called functions in long-lived embedders. Inside an AUTO region the threshold collapses to 1 for callees, so the warm-up gap doesn't compound across nested JIT'd calls.
Capability discovery
mino_state_jit_capability returns a struct with :available, :mode, :threshold, :host_arch, and :host_os fields. Embedders use this at startup to size their tuning before any script runs. mino-lean returns {.available=0, ...} so a host build that depends on JIT throughput knows to fall back.
CLI flags and env vars
Mode and threshold are also reachable from outside the embed surface for scripting use:
--jit=auto|off|onand--jit-threshold=Ncall through to the same internal setters.MINO_JITandMINO_JIT_HOT_THRESHOLDset the per-state default atmino_state_newtime before the host gets a handle.
Side-exit deopt path
When a function's first unstenciled op sits past PC 0, the JIT compiles the supported prefix natively and plants an OP_DEOPT_TO_INTERP stencil at the first unstenciled position. The stencil records the resume PC on the state and returns NULL; mino_jit_invoke detects the deopt sentinel, clears the flag, and tail-calls mino_bc_run_resume to drive the interpreter over the same regs window from the recorded PC. The interpreter runs to function exit; subsequent calls re-enter the native prefix from the top, so the deopt cost is paid once per call, not per iteration.
Two safety gates apply: the resume PC must fit in the 16-bit Bx slot the deopt stencil reads, and no direct-emit branch in the prefix may land past it. Both are checked by mino_jit_eligible before compile; fns failing either gate take the regular interpreter path. MINO_CPJIT_STATS=tracing surfaces an ok-with-deopt line per fn that took the compile-with-deopt path, and the bytes-blocked dashboard splits each op's total into hard (no native prefix) and ok-with-deopt counts so the reader can tell which blockers side-exit picked up.
Where the JIT shines: tight compute
Loop kernels and recursive compute where the JIT's stencils cover the inner cycle end-to-end. These are the workloads the copy-and-patch substrate was designed for: no allocation per iteration, no transducer machinery, just fused tagged-int arithmetic and inline-cached call dispatch. Median of three runs each on Apple Silicon (arm64-darwin) against mino v0.323.0.
| Workload | JIT off | JIT on | Speedup |
|---|---|---|---|
(dec-only 10M) — counted-down loop | 30.46 ms | 15.20 ms | 2.00x |
(lt-only 10M) — counted-up loop | 30.84 ms | 17.15 ms | 1.80x |
(sum-to 1M) — counter + accumulator | 19.41 ms | 3.01 ms | 6.46x |
(fib 30) — recursive compute | 107.15 ms | 53.34 ms | 2.01x |
The sum-to row is the strongest case in current shapes: the JIT covers both (< i n) and (+ acc i) inline (fused OP_LOOP_INT_LT_INC stencil), eliminating the tagged-int dispatch overhead on both the counter and the accumulator. The other rows halve roughly because the JIT covers either the loop step or the recursion path, but the function-call layer still goes through the interpreter dispatcher for the recursive branch.
Where the JIT does not shine: alloc / GC pressure
Median of three runs per cell, captured on Apple Silicon (arm64-darwin) against mino v0.323.0. All numbers in ms/op except the sub-ms row in µs/op.
| Row | JIT on | JIT off | Ratio (off/on) | Reading |
|---|---|---|---|---|
| build 5k int-map and sum | 10.05 ms | 10.34 ms | 1.03x | within noise envelope |
| bump 5k int-map values | 17.97 ms | 16.94 ms | 0.94x | within noise envelope |
| map/filter/map/reduce over 50k | 757 µs | 779 µs | 1.03x | within noise envelope |
| nested vectors 500x100 | 18.03 ms | 18.67 ms | 1.04x | within noise envelope |
| realize 10k of lazy range | 4.19 ms | 4.48 ms | 1.07x | within noise envelope |
| fibonacci(25) | 6.65 ms | 9.21 ms | 1.38x | meaningful JIT win |
Five of six rows land within the +/- 7% noise envelope. Allocation- and GC-dominated workloads are not where the JIT lives; they are dominated by nursery sizing, write-barrier cost, and minor-cycle frequency. The JIT sits above the GC and cannot accelerate work the allocator and collector are already doing. The one row that moves meaningfully is fibonacci(25), pure compute that the JIT's recursive-call inline cache and fused tagged-int arithmetic cover end-to-end.
Out of scope
- Type-feedback specialization. Stencils dispatch on opcode shape, not on per-call-site type history. A future cycle can add an IC layer that captures observed types and patches a fast path; the current surface holds the interpreter-parity contract.
- Forward stencil hooks. The side-exit path is one-way: native to interpreter, then interpreter to function exit. Re-entering native after a deopt-resumed dispatch reaches a stenciled run again is a future enhancement; deferred until a workload demonstrates the need.
- Cross-module-leak static analysis for the stencil extractor. Synthetic-blob selftests cover regression detection; static analysis is nice-to-have, not feature-complete-blocking.
- Self-hosted Intel Mac runner for x86_64 Darwin end-to-end verification. Operational, not code; the cross-compile parity job is the documented floor.
How each cell is gated
All five targets land their byte tables through the same extractor (tools/stencil-extract), so a regression in the format parser breaks all hosts that share that format. The synthetic-blob selftest in tools/stencil_extract --selftest catches parser regressions before any compile runs; the per-host generated byte table comparison catches drift introduced after the parser passes.
Three workflows produce the green cells:
- release-gate (every push, non-Windows): produces extract / generate / build / smoke / parity for the three hosts where it runs.
- cross-compile (every push, on macos-14): regenerates every committed stencil byte table and asserts no diff. Covers x86_64 Darwin's first two columns and guards the byte-identical claim for the other four hosts.
- ci-nightly (04:00 UTC): re-runs release-gate plus extended suites (GC stress, fault injection, embedding stress) on the three non-Windows hosts. Surfaces toolchain drift on a daily cadence instead of waiting for the next PR to trip on it.
Next steps
- Bytecode and VM -> The CPJIT layer: the architectural tour of the stencil substrate, ICache discipline, and what the runtime patcher does.
- Garbage Collection: where the floor lives for the allocation-heavy rows in the A/B table above.
- Performance: the runtime-perf track that surrounds JIT engagement.