Performance
mino's evaluator is now a layered system. The tree-walker remains as the ground-truth interpreter; on top of it sits a small register-based bytecode VM that compiles function bodies lazily on first call. The compiler bails to the tree-walker on any unsupported form, so behavior stays identical and the VM is additive. The numbers below reflect this layered shape, not a single-path rewrite.
Footprint(source)(source)
Three binary footprints worth knowing about — the Floor tier that an embedder commits to, the Sandbox tier with the canonical Clojure surface, and the Standalone ceiling that ships from Homebrew. All linked with -ffunction-sections -Wl,--gc-sections (Mach-O: -Wl,-dead_strip) so unreferenced subsystems drop out at link time, and stripped with strip --strip-all.
Each tier ships in two flavours: the JIT-free build (matches the parallel mino-lean binary; embed with MINO_CPJIT undefined) and the JIT-included build (the default Homebrew bottle). Both columns are stripped, dead-section-eliminated builds against the same C source tree.
| Build | No JIT | + JIT | JIT cost | What's in it |
|---|---|---|---|---|
Floor (install_minimal only) | ~601 KB | ~651 KB | +50 KB (8%) | mino_state_new + mino_install_minimal + mino_eval_string. Reader, evaluator, GC, persistent collections, numeric ops, foundational macros. No core.clj evaluation, no regex / bignum / multimethods / protocols / transducers, no I/O. |
Sandbox (install_sandbox) | ~909 KB | ~943 KB | +34 KB (4%) | Floor plus regex, bignum, multimethods, protocols, transducers, and the safe bundled libs — every name a Clojure scripter expects. Still no I/O, FS, processes, STM, agents, async. |
Standalone (install_all + REPL) | ~962 KB | ~996 KB | +34 KB (4%) | Sandbox plus I/O, FS, subprocess, STM, agents, async, host-interop, all bundled clojure.* namespaces, the project resolver, the task/deps machinery, and the REPL crash handler. The released mino binary an end user receives from Homebrew. The no-JIT Standalone column is the mino-lean sibling binary. |
JIT adds 34–50 KB across the tiers — under one percent of any modern device's disk budget, and well under 1 ms of additional disk-load time on a cold launch. The JIT-included build pays back 1.8–6.5x on compute-bound hot code (see the JIT page for the workload table). Embedders running one-shot scripts on Floor get marginal value from JIT (no hot loops to amortize against); embedders on Sandbox or Standalone with any sustained execution should keep it on.
Source-side numbers for what an in-tree embedder pulls in:
| Item | Size | Notes |
|---|---|---|
C source tree (src/ minus vendor) | ~2.27 MB | C source plus generated bundled-source headers |
Vendor (imath for BigInt) | ~157 KB | Only loaded when arithmetic exceeds 64-bit range |
Bundled stdlib source (clojure.* headers compiled into the binary) | ~194 KB | Lazy-installed; the minimum-embed build drops these |
core.clj source | ~121 KB | Embedded as a C string literal; evaluated the first time mino_install brings in a non-floor capability |
Cold startup(source)
Wall time from fork+exec to process exit. Each row is the median of 50 invocations after three warmup runs to prime the OS page cache. Both columns evaluate the same (+ 1 2) expression; the difference between the two is the cost of linking the JIT pipeline in (a small mmap + symbol-table init at mino_state_new time, with no actual compilation triggered by the one-shot expression).
| Tier | No JIT | + JIT | JIT cost | Notes |
|---|---|---|---|---|
Floor (install_minimal) | 3.98 ms | 4.01 ms | ~0 | Process spawn + mino_state_new + mino_install_minimal + eval + exit. No core.clj parse / eval. |
Sandbox (install_sandbox) | 6.99 ms | 6.99 ms | 0 | Floor + regex + bignum + multimethods + protocols + transducers + the safe bundled libs. Parses and evaluates core.clj at install. |
Standalone (./mino -e ...) | 8.09 ms | 8.04 ms | ~0 | Sandbox plus I/O, FS, processes, STM, agents, async, bundled clojure.*. The Homebrew binary. No-JIT column is the mino-lean sibling. |
JIT contributes essentially zero to cold start. The pipeline initializes an mmap'd page at mino_state_new (sub-millisecond), and the hot-call threshold means nothing compiles until user code calls a function past MINO_JIT_THRESHOLD (default 10) times. A one-shot script that fits in a single expression therefore pays the same wall-time on JIT and no-JIT builds.
Per-process initialization cost, measured in-process over 50 init/teardown cycles inside one binary (no fork/exec overhead). An embedder that creates one runtime up-front pays this once; an embedder that spins one runtime per request sees it on every call.
| Operation | Median | Notes |
|---|---|---|
mino_state_new + mino_install_minimal + mino_state_free | 0.18 ms | Floor tier. No core.clj parse / eval. |
mino_state_new + mino_install_sandbox + mino_state_free | 2.65 ms | Sandbox tier. Equivalent to mino_install(S, env, MINO_CAP_DEFAULT). Parses and evaluates core.clj with regex, bignum, multimethods, protocols, transducers, and the safe bundled libs enabled. |
mino_state_new + mino_install_all + mino_state_free | 2.78 ms | Adds I/O, FS, STM, agents, bundled clojure.* registration (lazy; not evaluated until required). The standalone CLI path. |
The Floor tier saves ~2.5 ms on every cold start by skipping core.clj evaluation. The cost is that capability-gated names (e.g. re-find, defmulti, slurp) are not bound; user code calling them raises an MNS002 capability-disabled diagnostic until the host installs the corresponding capability.
Core operations(source)(source)
Per-call cost for fundamental eval shapes, measured through the full read + eval path via the bytecode VM (mean of 100,000 iterations, dominant fast-path). Lower is better.
| Operation | Cost | Notes |
|---|---|---|
Primitive call (+ 1 2) | 5.5 µs | Fused int-add fast lane, no boxing |
| User fn call (1 arg) | 5.5 µs | Compiled to bytecode; register-window entry |
| User fn call (3 args) | 5.8 µs | Cost grows ~0.1 µs per arg in the bc path |
Vector literal [1 2 3] | 5.0 µs | 32-way trie allocation |
Map literal {:a 1} | 5.1 µs | HAMT insertion per key |
(get m k) on 100-key map | 5.8 µs | Hash + HAMT traversal |
Symbol resolution (local let) | 4.7 µs | Register read; no env walk |
| Symbol resolution (global var) | 4.9 µs | Inline-cache hit on the bc GETGLOBAL slot |
| Closure capture (1 var) | 4.9 µs | Env-chain extend + restore |
| Closure capture (5 vars) | 5.0 µs | Captures scale below noise |
Bulk operations(source)(source)(source)
Cost of working with collections at scale. The fused counted-loop opcodes (OP_LOOP_INT_DEC et al.) and the int-fast-lane opcodes are responsible for most of the movement here since the last bench.
| Operation | Cost | Per element |
|---|---|---|
(into [] (range 100)) | 29.8 µs | 0.30 µs |
(reduce + 0 (range 100)) | 12.6 µs | 0.13 µs |
(reduce + 0 (range 1000)) | 13.9 µs | 0.014 µs |
loop/recur 1,000 iterations | 5.5 µs | 0.006 µs |
loop/recur 10,000 iterations | 73 µs | 0.007 µs |
Build 100-key map (assoc loop) | 282 µs | 2.82 µs/key |
conj 1,000-element vector | 183 µs | 0.18 µs/elt |
conj 10,000-element vector | 2.29 ms | 0.23 µs/elt |
nth random on 1,000-vec | 5.5 µs | — |
(get m k) on 1,000-key map | 5.4 µs | — |
(fib 25) recursive (~242k calls) | 6.65 ms | 0.027 µs/call |
Tight integer loops run at near-native speed when the compiler can prove the iteration is int-typed. loop/recur over 10,000 iterations clocks in around 7 ns per step because the fused-loop opcode collapses the test/dec/back-jump into a single dispatch with two tagged-int checks. Recursive Fibonacci sees the JIT compile the recursion hot path and reaches roughly 27 ns per call.
Eager collection builders(source)(source)
When laziness is not needed, rangev, mapv, and filterv produce vectors directly in C, bypassing thunk allocation entirely. The bc compiler also recognizes reduce over a vector and dispatches straight to the C primitive walker.
| Operation | Cost | vs. lazy equivalent |
|---|---|---|
(rangev 100) | 2.8 µs | 10× faster than (into [] (range 100)) |
(mapv inc (rangev 100)) | 15.2 µs | Eliminates per-element thunk + cons |
(filterv odd? (rangev 100)) | 13.8 µs | Same shape as mapv |
Use rangev for data generation and reduce over vectors for the biggest wins. The speedup over lazy comes from skipping thunk allocation and eval overhead per element; once per-element work is dominated by a user fn the gap narrows.
Concurrency(source)
Standalone mino grants cpu_count worker threads at startup, so future, promise, thread, and the blocking channel ops <!!/>!!/alts!! resolve to real OS threads. Embedders start at one (single-threaded) and raise the limit via mino_set_thread_limit or one of the pool/factory grants. A runtime that has at least one live worker serializes script execution on a per-state recursive mutex; cross-state work runs fully concurrent and intra-state work is naturally race-free. Single-threaded states skip the mutex entirely and pay no lock cost.
core.async numbers from the current bench run:
| Operation | Cost | Notes |
|---|---|---|
offer!/poll! on (chan 1024) (no scheduler) | 281 µs/op | Buffer + offer/poll data-structure cost only |
offer! on full buffer returns false | 117 µs/op | Hot rejection path |
poll! on empty buffer returns nil | 67 µs/op | Hot empty-buffer path |
put!/take! on (chan 1) + drain! | 363 µs/op | Callback path through the scheduler |
go block (<! ch) with pending put + drain! | 1.84 ms/op | IOC state machine + park/unpark roundtrip |
go producer/consumer hand-shake pair | 3.67 ms/op | Two park/unpark cycles end-to-end |
alts! over 1 ready channel | 405 µs/op | Arbitration on a single ready candidate |
alts! over 8 channels, last ready | 1.03 ms/op | Linear walk through :priority order |
alts! with :default | 159 µs/op | Fast non-block path |
(timeout 0) + take! + drain | 692 µs/op | Timer-chan path through the scheduler |
Shared-state work scales to one core's worth of throughput regardless of worker count because the per-state recursive mutex serializes script execution. To get parallel speedup, distribute work across runtime instances and pass results back via the host or via mino_clone, not across workers in one runtime.
Garbage collection(source)
mino uses a non-moving two-generation tracing collector. Short-lived values live in a young-gen nursery that is swept in bounded minor collections. Survivors are promoted to old-gen, which is marked incrementally, paced by the allocator. A write barrier records old-to-young pointers so minors stay proportional to young reachability. The collector is stop-the-world at slice boundaries; there are no collector threads.
GC share is a function of allocation pressure, not absolute speed. The bytecode VM cut the constant-factor cost of computation but left allocation rates largely unchanged, so GC share rose proportionally on the same workloads:
| Workload | GC share | Max pause |
|---|---|---|
| Small function calls (empty, identity, let) | ~12% | ~1.4 ms |
loop/recur 10,000 iterations | ~0% | — |
Build 1,000-element vector via conj | ~19% | ~1.4 ms |
Build 10,000-element vector via conj | ~21% | ~7.8 ms |
| Build 5k int-map and sum | ~16% | ~1.8 ms |
| map/filter/map/reduce over 50,000 (fused transducers) | ~0% | — |
| Nested vectors 500x100 | ~17% | ~2.0 ms |
| Realize 10k of lazy range | ~33% | ~4.0 ms |
Five tuning knobs are exposed through mino_gc_set_param: nursery size, major growth multiplier, promotion age, incremental slice budget, and allocation quantum between slices. The defaults target interactive latency on a general workload; embedders with throughput-dominated batches or tighter pause budgets can shift the tradeoff without rebuilding.
The default nursery size rose from 1 MiB to 4 MiB in v0.250.0 after a measured pass over realistic_bench. Allocation-heavy workloads (bump-int-map, nested-vec, lazy-range realization) gained 1.14–1.42x with no measurable regression in worst-case minor-GC pause — the larger nursery collects more bytes per cycle so total GC wall time falls even though each minor pass sweeps more. Each VM state holds 3 extra MiB of young-gen residency before the first major GC; embedders running many concurrent VM states under tight memory budgets override via MINO_GC_NURSERY_BYTES or mino_gc_set_param(S, MINO_GC_NURSERY_BYTES, n).
realistic_bench (v0.249 vs v0.250):
| Row | 1 MiB (v0.249) | 4 MiB (v0.250) | Speedup |
|---|---|---|---|
| build 5k int-map and sum | 12.72 ms | 11.18 ms | 1.14x |
| bump 5k int-map values | 22.92 ms | 16.12 ms | 1.42x |
| map/filter/map/reduce 50k | 0.77 ms | 0.68 ms | 1.13x |
| nested vectors 500×100 | 23.00 ms | 16.92 ms | 1.36x |
| realize 10k lazy range | 7.82 ms | 5.79 ms | 1.35x |
| fibonacci(25) | 7.83 ms | 6.43 ms | 1.22x |
Where the time goes
The cost centers in order of impact:
- Allocation pressure. Persistent collections, cons cells, and intermediate seqs are the dominant source of work in any realistic pipeline. Recent
reduceandassoc/identity short-circuits cut redundant allocation, but laziness still pays a thunk + cons-cell per element. Userangev/mapv/filtervwhen laziness is not needed; useloop/recurwhen iterating without building a collection. - Core library initialization. A
mino_state_ton the Standard or Standalone tier parses and evaluatescore.clj(~5.2 ms here, smaller on faster hardware). The Floor tier (mino_install_minimal) skips this entirely and pays only ~0.22 ms. Parsed forms are cached per state, so additional envs in one state avoid re-parsing. Bundledclojure.*namespaces are registered but not evaluated untilrequired. - Bytecode bailouts. Forms the bc compiler doesn't yet handle bail to the tree-walker on first call and are remembered as declined; subsequent calls skip recompilation but still pay tree-walker per-call cost. Compiler coverage is the lever here, not VM speed.
- Per-state lock at every eval entry (when threaded). Once a state is multi-threaded (host has granted workers, or the standalone has run
mino_install_all), each script entry throughmino_eval_stringor a workermino_calltakes the per-state recursive mutex. Uncontested cost is ~10 ns per entry; single-threaded states skip the mutex entirely.
What this means in practice
mino is fast enough for configuration evaluation, rules engines, interactive consoles, plugin systems, scripting automation, and data transformation on moderate-sized collections. It evaluates a simple expression in low microseconds and processes hundreds of thousands of elements per millisecond on tight integer loops.
It is still not the right choice for tight numerical loops in the hot inner cycle of a server, large-scale data processing, or any workload where per-element overhead matters at the nanosecond level. For those cases, do the heavy lifting in C and pass results to mino for composition and coordination. The embedding model supports this naturally.
Benchmarking
Write benchmarks as mino scripts or C programs that link against the mino source. Compile and run with the same per-subsystem flags the standalone build uses:
cc -std=c99 -O2 \
-Isrc -Isrc/public -Isrc/runtime -Isrc/gc -Isrc/eval \
-Isrc/collections -Isrc/prim -Isrc/async -Isrc/interop \
-Isrc/diag -Isrc/vendor/imath \
-o my_bench my_bench.c \
src/public/*.c src/runtime/*.c src/gc/*.c src/eval/*.c \
src/eval/bc/*.c src/collections/*.c src/prim/*.c \
src/async/*.c src/interop/*.c src/regex/*.c \
src/diag/*.c src/vendor/imath/*.c \
-lm -lpthread
./my_benchFor minimum-footprint embed measurements, add -ffunction-sections -fdata-sections to the compile flags and -Wl,--gc-sections to the link flags so unreferenced subsystems drop out.