JIT

The copy-and-patch JIT (CPJIT) is feature-complete on five host pipelines. It extracts pre-compiled opcode stencils into byte tables per host object format; the runtime patcher copies those bytes into executable memory and rewrites the operand immediates per call site. The full mino binary links the JIT in by default; the parallel mino-lean build compiles the CPJIT machinery out entirely for hosts that want the smaller footprint over the throughput speedup.

What ships today

65 opcode stencils covering arithmetic (OP_ADD_II, OP_SUB_II, OP_MUL_II, k-immediate variants), comparison (OP_EQ_II through OP_GE_II plus *_IK forms), bitwise (OP_BAND_II, OP_BOR_II, OP_BXOR_II, OP_SHL_II, OP_SHR_II, OP_USHR_II, OP_BNOT_I), unary predicates (OP_ZERO_INT_P, OP_POS_P_I, OP_EVEN_P_I, OP_ODD_P_I), unary increment/decrement (OP_INC_I, OP_DEC_I), fused loop steps (OP_LOOP_INT_LT, OP_LOOP_INT_DEC, OP_LOOP_INT_LT_INC, OP_LOOP_INT_DEC_INC), data-structure fast lanes (OP_NTH_VEC, OP_FIRST_VEC, OP_COUNT_VEC, OP_EMPTY_VEC, OP_CONJ_VEC, OP_GET_KW_MAP, OP_ASSOC, OP_DISSOC), dispatch (OP_CALL, OP_CALL_CACHED, OP_TAILCALL, OP_PROTOCOL_CALL_CACHED, OP_PROTOCOL_TAILCALL_CACHED, OP_GETGLOBAL_CACHED, OP_CLOSURE, OP_MAKE_LAZY), env management (OP_PUSH_ENV, OP_POP_ENV, OP_ENV_BIND), leaf shapes (OP_LOAD_K, OP_MOVE, OP_RETURN, and the fused OP_LOAD_K_RETURN superinstruction), and two synthetic stencils the compile path inserts itself: OP_DEOPT_TO_INTERP to bail back to the interpreter and OP_SAFEPOINT_POLL planted before every backward jump so generic loops stay cancellable.
5 host arches with on-disk byte tables. Each generated header carries the stencil bytes plus the symbol and relocation tables the runtime patcher consumes:
- stencils_arm64_darwin.h (102,298 bytes)
- stencils_arm64_linux.h (102,410 bytes)
- stencils_x86_64_darwin.h (95,897 bytes)
- stencils_x86_64_linux.h (96,317 bytes)
- stencils_x86_64_windows.h (99,068 bytes)
Side-exit deopt path. Fns whose first unstenciled op sits past PC 0 compile to a native prefix plus a deopt stencil. When execution reaches the deopt instruction the native code returns into mino_bc_run_resume which drives the interpreter over the same regs window from the recorded PC. Round-trip cost is roughly 100 ns per deopt on Apple Silicon; the path amortises against running the prefix through the interpreter around the 30-op prefix length. After this landed, the realistic_bench and real_workloads corpora both show 100% native eligibility with zero hard rejections.
Cancellable JIT'd loops. Every native loop back-edge polls the safepoint on a 256-iteration downcounter: fused-loop stencils keep the counter in a register, and every other loop shape gets a safepoint stencil planted before its backward jump. A spinning JIT'd loop responds to (future-cancel f) within bounded wall time even when the body is entirely native, and the poll keeps the runtime's lock auto-yield at the same cadence the interpreter produces, so a native spin cannot starve sibling workers. Per-iteration cost is one decrement plus one branch on the fused hot path.
Adaptive tiering. In AUTO mode any callee invoked from inside a JIT-compiled region picks up a threshold of 1, so warm-up gaps on short-lived scripts collapse to the first call after the JIT'd caller fires. Tunable via mino_state_set_jit_hot_threshold for embedders that want different cold-call counts.
Dual-binary build. The full mino binary builds with -DMINO_CPJIT=1 and links the entire JIT pipeline. The parallel mino-lean binary builds the same source tree with MINO_CPJIT undefined; the patcher, emitter, and stencil entry layers compile to no-ops and the runtime decision tree collapses to the tree-walker plus the bytecode VM. mino_state_jit_capability returns {.available=0, ...} on mino-lean and {.available=1, ...} on the full build.
4-way parity green on the dev host. The test-jit-parity task runs the test suite four times (MINO_JIT=auto, MINO_JIT=on, MINO_JIT=off, and the mino-lean binary) and asserts the four stdouts are byte-identical and all four processes exit 0:
```
$ ./mino task test-jit-parity
93 tests, 93 assertions: 93 passed, 0 failed, 0 errors
  jit-parity: OK -- stdout byte-identical across
              jit-auto / jit-on / jit-off / lean, all exit 0
```
Synthetic-blob selftests. The tools/stencil-extract --selftest binary builds hand-crafted Mach-O, ELF, and COFF object blobs with known function bodies, symbol tables, and relocation tables, then runs each format parser against them and asserts the extracted bytes match expected values:
```
$ ./tools/stencil-extract --selftest
selftest_macho_synthetic: OK
selftest_elf_synthetic: OK
selftest_coff_synthetic: OK
stencil_extract selftest: OK
```

Per-host pipeline stages

Each stage is independent. extract parses the object file produced by the stencil compiler. generate emits the on-disk byte table the runtime patcher reads. build is the full in-tree compile against that byte table. smoke is the test suite running through the JIT pipeline. parity is the four-way auto/on/off/lean byte-identical-stdout check.

The committed byte tables are regenerated by a single pinned zig cc (a bundled, version-locked Clang with cross-compilation) that builds every target from one host — ./mino task gen-stencils-all. Stencil sources are hermetic, so all five targets cross-compile with no platform SDK, and one toolchain emitting every table is what makes the bytes byte-for-byte reproducible. This is a maintainer-only step: make plus any C99 compiler still builds mino from the committed bytes, and embedders never invoke a stencil compiler.

Target	Format	extract	generate	build	smoke	parity
ARM64 Darwin	Mach-O 64	green	green	green	green	green
ARM64 Linux	ELF64	green	green	green	green	green
x86_64 Linux	ELF64	green	green	green	green	green
x86_64 Darwin	Mach-O 64	green	green	partial	partial	partial
x86_64 Windows	PE/COFF	green	green	green	green	partial

Partial-cell notes

x86_64 Darwin build/smoke/parity: covered by the stencil-determinism job, which runs on one pinned-zig Linux runner every push, regenerates every committed byte table (including both Darwin targets, which need no macOS SDK) and asserts git diff --exit-code. End-to-end execution on Intel macOS is not in the GHA matrix; GitHub has been retiring Intel-Mac runners and the project does not yet host a self-hosted Intel runner. The Mach-O 64 stencil parser and patcher share their data flow with the ARM64 Darwin path, which is the dev host and is exercised on every push.
x86_64 Windows parity: the four-way parity sits inside the release-gate composite, which depends on a libsanitizer build that MinGW does not ship. The Windows runner instead runs the smoke build (make + tests/run.clj through the JIT) on every push, which exercises the same code paths the parity check would but stops short of the byte-identical-stdout assertion.

Runtime control

Each row in the support table is a build claim. At runtime, every JIT-capable binary lets the host choose how the pipeline executes. The five public symbols are stable across releases:

void                  mino_state_set_jit_mode(mino_state *S,
                                              mino_jit_mode mode);
mino_jit_mode       mino_state_jit_mode(const mino_state *S);

void                  mino_state_set_jit_hot_threshold(mino_state *S,
                                                       unsigned n);
unsigned              mino_state_jit_hot_threshold(const mino_state *S);

mino_jit_capability mino_state_jit_capability(const mino_state *S);

Modes

MINO_JIT_MODE_AUTO (default): compile when the hot-call threshold trips. MINO_JIT_MODE_OFF: never compile. MINO_JIT_MODE_ON: compile on first call. ON is for benchmarking and parity testing; AUTO is the default for embedders.

Hot threshold

Default seed is the compile-time MINO_JIT_THRESHOLD (currently 10 calls). Lower for shorter-lived scripts where warm-up matters; raise to avoid compiling rarely-called functions in long-lived embedders. Inside an AUTO region the threshold collapses to 1 for callees, so the warm-up gap doesn't compound across nested JIT'd calls.

Capability discovery

mino_state_jit_capability returns a struct with :available, :mode, :threshold, :host_arch, and :host_os fields. Embedders use this at startup to size their tuning before any script runs. mino-lean returns {.available=0, ...} so a host build that depends on JIT throughput knows to fall back.

CLI flags and env vars

Mode and threshold are also reachable from outside the embed surface for scripting use:

--jit=auto|off|on and --jit-threshold=N call through to the same internal setters.
MINO_JIT and MINO_JIT_HOT_THRESHOLD set the per-state default at mino_state_new time before the host gets a handle.

Side-exit deopt path

When a function's first unstenciled op sits past PC 0, the JIT compiles the supported prefix natively and plants an OP_DEOPT_TO_INTERP stencil at the first unstenciled position. The stencil records the resume PC on the state and returns NULL; mino_jit_invoke detects the deopt sentinel, clears the flag, and tail-calls mino_bc_run_resume to drive the interpreter over the same regs window from the recorded PC. The interpreter runs to function exit; subsequent calls re-enter the native prefix from the top, so the deopt cost is paid once per call, not per iteration.

Two safety gates apply: the resume PC must fit in the 16-bit Bx slot the deopt stencil reads, and no direct-emit branch in the prefix may land past it. Both are checked by mino_jit_eligible before compile; fns failing either gate take the regular interpreter path. MINO_CPJIT_STATS=tracing surfaces an ok-with-deopt line per fn that took the compile-with-deopt path, and the bytes-blocked dashboard splits each op's total into hard (no native prefix) and ok-with-deopt counts so the reader can tell which blockers side-exit picked up.

Where the JIT shines: tight compute

Loop kernels and recursive compute where the JIT's stencils cover the inner cycle end-to-end. These are the workloads the copy-and-patch substrate was designed for: no allocation per iteration, no transducer machinery, just fused tagged-int arithmetic and inline-cached call dispatch. Median of three runs each on Apple Silicon (arm64-darwin) against mino v0.323.0.

Workload	JIT off	JIT on	Speedup
`(dec-only 10M)` — counted-down loop	30.46 ms	15.20 ms	2.00x
`(lt-only 10M)` — counted-up loop	30.84 ms	17.15 ms	1.80x
`(sum-to 1M)` — counter + accumulator	19.41 ms	3.01 ms	6.46x
`(fib 30)` — recursive compute	107.15 ms	53.34 ms	2.01x

The sum-to row is the strongest case in current shapes: the JIT covers both (< i n) and (+ acc i) inline (fused OP_LOOP_INT_LT_INC stencil), eliminating the tagged-int dispatch overhead on both the counter and the accumulator. The other rows halve roughly because the JIT covers either the loop step or the recursion path, but the function-call layer still goes through the interpreter dispatcher for the recursive branch.

Where the JIT does not shine: alloc / GC pressure

Median of three runs per cell, captured on Apple Silicon (arm64-darwin) against mino v0.323.0. All numbers in ms/op except the sub-ms row in µs/op.

Row	JIT on	JIT off	Ratio (off/on)	Reading
build 5k int-map and sum	10.05 ms	10.34 ms	1.03x	within noise envelope
bump 5k int-map values	17.97 ms	16.94 ms	0.94x	within noise envelope
map/filter/map/reduce over 50k	757 µs	779 µs	1.03x	within noise envelope
nested vectors 500x100	18.03 ms	18.67 ms	1.04x	within noise envelope
realize 10k of lazy range	4.19 ms	4.48 ms	1.07x	within noise envelope
fibonacci(25)	6.65 ms	9.21 ms	1.38x	meaningful JIT win

Five of six rows land within the +/- 7% noise envelope. Allocation- and GC-dominated workloads are not where the JIT lives; they are dominated by nursery sizing, write-barrier cost, and minor-cycle frequency. The JIT sits above the GC and cannot accelerate work the allocator and collector are already doing. The one row that moves meaningfully is fibonacci(25), pure compute that the JIT's recursive-call inline cache and fused tagged-int arithmetic cover end-to-end.

Out of scope

Type-feedback specialization. Stencils dispatch on opcode shape, not on per-call-site type history. A future cycle can add an IC layer that captures observed types and patches a fast path; the current surface holds the interpreter-parity contract.
Forward stencil hooks. The side-exit path is one-way: native to interpreter, then interpreter to function exit. Re-entering native after a deopt-resumed dispatch reaches a stenciled run again is a future enhancement; deferred until a workload demonstrates the need.
Cross-module-leak static analysis for the stencil extractor. Synthetic-blob selftests cover regression detection; static analysis is nice-to-have, not feature-complete-blocking.
Self-hosted Intel Mac runner for x86_64 Darwin end-to-end verification. Operational, not code; the cross-compile parity job is the documented floor.

How each cell is gated

All five targets land their byte tables through the same extractor (tools/stencil-extract), so a regression in the format parser breaks all hosts that share that format. The synthetic-blob selftest in tools/stencil_extract --selftest catches parser regressions before any compile runs; the per-host generated byte table comparison catches drift introduced after the parser passes.

Three workflows produce the green cells:

release-gate (every push, non-Windows): produces extract / generate / build / smoke / parity for the three hosts where it runs.
cross-compile (every push, on macos-14): regenerates every committed stencil byte table and asserts no diff. Covers x86_64 Darwin's first two columns and guards the byte-identical claim for the other four hosts.
ci-nightly (04:00 UTC): re-runs release-gate plus extended suites (GC stress, fault injection, embedding stress) on the three non-Windows hosts. Surfaces toolchain drift on a daily cadence instead of waiting for the next PR to trip on it.

Next steps

Bytecode and VM -> The CPJIT layer: the architectural tour of the stencil substrate, ICache discipline, and what the runtime patcher does.
Garbage Collection: where the floor lives for the allocation-heavy rows in the A/B table above.
Performance: the runtime-perf track that surrounds JIT engagement.