Performance

mino's evaluator is now a layered system. The tree-walker remains as the ground-truth interpreter; on top of it sits a small register-based bytecode VM that compiles function bodies lazily on first call. The compiler bails to the tree-walker on any unsupported form, so behavior stays identical and the VM is additive. The numbers below reflect this layered shape, not a single-path rewrite.

Footprint(source)(source)

Three binary footprints worth knowing about — the Floor tier that an embedder commits to, the Sandbox tier with the canonical Clojure surface, and the Standalone ceiling that ships from Homebrew. All linked with -ffunction-sections -Wl,--gc-sections (Mach-O: -Wl,-dead_strip) so unreferenced subsystems drop out at link time, and stripped with strip --strip-all.

Each tier ships in two flavours: the JIT-free build (matches the parallel mino-lean binary; embed with MINO_CPJIT undefined) and the JIT-included build (the default Homebrew bottle). Both columns are stripped, dead-section-eliminated builds against the same C source tree.

BuildNo JIT+ JITJIT costWhat's in it
Floor (install_minimal only)~601 KB~651 KB+50 KB (8%)mino_state_new + mino_install_minimal + mino_eval_string. Reader, evaluator, GC, persistent collections, numeric ops, foundational macros. No core.clj evaluation, no regex / bignum / multimethods / protocols / transducers, no I/O.
Sandbox (install_sandbox)~909 KB~943 KB+34 KB (4%)Floor plus regex, bignum, multimethods, protocols, transducers, and the safe bundled libs — every name a Clojure scripter expects. Still no I/O, FS, processes, STM, agents, async.
Standalone (install_all + REPL)~962 KB~996 KB+34 KB (4%)Sandbox plus I/O, FS, subprocess, STM, agents, async, host-interop, all bundled clojure.* namespaces, the project resolver, the task/deps machinery, and the REPL crash handler. The released mino binary an end user receives from Homebrew. The no-JIT Standalone column is the mino-lean sibling binary.

JIT adds 34–50 KB across the tiers — under one percent of any modern device's disk budget, and well under 1 ms of additional disk-load time on a cold launch. The JIT-included build pays back 1.8–6.5x on compute-bound hot code (see the JIT page for the workload table). Embedders running one-shot scripts on Floor get marginal value from JIT (no hot loops to amortize against); embedders on Sandbox or Standalone with any sustained execution should keep it on.

Source-side numbers for what an in-tree embedder pulls in:

ItemSizeNotes
C source tree (src/ minus vendor)~2.27 MBC source plus generated bundled-source headers
Vendor (imath for BigInt)~157 KBOnly loaded when arithmetic exceeds 64-bit range
Bundled stdlib source (clojure.* headers compiled into the binary)~194 KBLazy-installed; the minimum-embed build drops these
core.clj source~121 KBEmbedded as a C string literal; evaluated the first time mino_install brings in a non-floor capability

Cold startup(source)

Wall time from fork+exec to process exit. Each row is the median of 50 invocations after three warmup runs to prime the OS page cache. Both columns evaluate the same (+ 1 2) expression; the difference between the two is the cost of linking the JIT pipeline in (a small mmap + symbol-table init at mino_state_new time, with no actual compilation triggered by the one-shot expression).

TierNo JIT+ JITJIT costNotes
Floor (install_minimal)3.98 ms4.01 ms~0Process spawn + mino_state_new + mino_install_minimal + eval + exit. No core.clj parse / eval.
Sandbox (install_sandbox)6.99 ms6.99 ms0Floor + regex + bignum + multimethods + protocols + transducers + the safe bundled libs. Parses and evaluates core.clj at install.
Standalone (./mino -e ...)8.09 ms8.04 ms~0Sandbox plus I/O, FS, processes, STM, agents, async, bundled clojure.*. The Homebrew binary. No-JIT column is the mino-lean sibling.

JIT contributes essentially zero to cold start. The pipeline initializes an mmap'd page at mino_state_new (sub-millisecond), and the hot-call threshold means nothing compiles until user code calls a function past MINO_JIT_THRESHOLD (default 10) times. A one-shot script that fits in a single expression therefore pays the same wall-time on JIT and no-JIT builds.

Per-process initialization cost, measured in-process over 50 init/teardown cycles inside one binary (no fork/exec overhead). An embedder that creates one runtime up-front pays this once; an embedder that spins one runtime per request sees it on every call.

OperationMedianNotes
mino_state_new + mino_install_minimal + mino_state_free0.18 msFloor tier. No core.clj parse / eval.
mino_state_new + mino_install_sandbox + mino_state_free2.65 msSandbox tier. Equivalent to mino_install(S, env, MINO_CAP_DEFAULT). Parses and evaluates core.clj with regex, bignum, multimethods, protocols, transducers, and the safe bundled libs enabled.
mino_state_new + mino_install_all + mino_state_free2.78 msAdds I/O, FS, STM, agents, bundled clojure.* registration (lazy; not evaluated until required). The standalone CLI path.

The Floor tier saves ~2.5 ms on every cold start by skipping core.clj evaluation. The cost is that capability-gated names (e.g. re-find, defmulti, slurp) are not bound; user code calling them raises an MNS002 capability-disabled diagnostic until the host installs the corresponding capability.

Core operations(source)(source)

Per-call cost for fundamental eval shapes, measured through the full read + eval path via the bytecode VM (mean of 100,000 iterations, dominant fast-path). Lower is better.

OperationCostNotes
Primitive call (+ 1 2)5.5 µsFused int-add fast lane, no boxing
User fn call (1 arg)5.5 µsCompiled to bytecode; register-window entry
User fn call (3 args)5.8 µsCost grows ~0.1 µs per arg in the bc path
Vector literal [1 2 3]5.0 µs32-way trie allocation
Map literal {:a 1}5.1 µsHAMT insertion per key
(get m k) on 100-key map5.8 µsHash + HAMT traversal
Symbol resolution (local let)4.7 µsRegister read; no env walk
Symbol resolution (global var)4.9 µsInline-cache hit on the bc GETGLOBAL slot
Closure capture (1 var)4.9 µsEnv-chain extend + restore
Closure capture (5 vars)5.0 µsCaptures scale below noise

Bulk operations(source)(source)(source)

Cost of working with collections at scale. The fused counted-loop opcodes (OP_LOOP_INT_DEC et al.) and the int-fast-lane opcodes are responsible for most of the movement here since the last bench.

OperationCostPer element
(into [] (range 100))29.8 µs0.30 µs
(reduce + 0 (range 100))12.6 µs0.13 µs
(reduce + 0 (range 1000))13.9 µs0.014 µs
loop/recur 1,000 iterations5.5 µs0.006 µs
loop/recur 10,000 iterations73 µs0.007 µs
Build 100-key map (assoc loop)282 µs2.82 µs/key
conj 1,000-element vector183 µs0.18 µs/elt
conj 10,000-element vector2.29 ms0.23 µs/elt
nth random on 1,000-vec5.5 µs
(get m k) on 1,000-key map5.4 µs
(fib 25) recursive (~242k calls)6.65 ms0.027 µs/call

Tight integer loops run at near-native speed when the compiler can prove the iteration is int-typed. loop/recur over 10,000 iterations clocks in around 7 ns per step because the fused-loop opcode collapses the test/dec/back-jump into a single dispatch with two tagged-int checks. Recursive Fibonacci sees the JIT compile the recursion hot path and reaches roughly 27 ns per call.

Eager collection builders(source)(source)

When laziness is not needed, rangev, mapv, and filterv produce vectors directly in C, bypassing thunk allocation entirely. The bc compiler also recognizes reduce over a vector and dispatches straight to the C primitive walker.

OperationCostvs. lazy equivalent
(rangev 100)2.8 µs10× faster than (into [] (range 100))
(mapv inc (rangev 100))15.2 µsEliminates per-element thunk + cons
(filterv odd? (rangev 100))13.8 µsSame shape as mapv

Use rangev for data generation and reduce over vectors for the biggest wins. The speedup over lazy comes from skipping thunk allocation and eval overhead per element; once per-element work is dominated by a user fn the gap narrows.

Concurrency(source)

Standalone mino grants cpu_count worker threads at startup, so future, promise, thread, and the blocking channel ops <!!/>!!/alts!! resolve to real OS threads. Embedders start at one (single-threaded) and raise the limit via mino_set_thread_limit or one of the pool/factory grants. A runtime that has at least one live worker serializes script execution on a per-state recursive mutex; cross-state work runs fully concurrent and intra-state work is naturally race-free. Single-threaded states skip the mutex entirely and pay no lock cost.

core.async numbers from the current bench run:

OperationCostNotes
offer!/poll! on (chan 1024) (no scheduler)281 µs/opBuffer + offer/poll data-structure cost only
offer! on full buffer returns false117 µs/opHot rejection path
poll! on empty buffer returns nil67 µs/opHot empty-buffer path
put!/take! on (chan 1) + drain!363 µs/opCallback path through the scheduler
go block (<! ch) with pending put + drain!1.84 ms/opIOC state machine + park/unpark roundtrip
go producer/consumer hand-shake pair3.67 ms/opTwo park/unpark cycles end-to-end
alts! over 1 ready channel405 µs/opArbitration on a single ready candidate
alts! over 8 channels, last ready1.03 ms/opLinear walk through :priority order
alts! with :default159 µs/opFast non-block path
(timeout 0) + take! + drain692 µs/opTimer-chan path through the scheduler

Shared-state work scales to one core's worth of throughput regardless of worker count because the per-state recursive mutex serializes script execution. To get parallel speedup, distribute work across runtime instances and pass results back via the host or via mino_clone, not across workers in one runtime.

Garbage collection(source)

mino uses a non-moving two-generation tracing collector. Short-lived values live in a young-gen nursery that is swept in bounded minor collections. Survivors are promoted to old-gen, which is marked incrementally, paced by the allocator. A write barrier records old-to-young pointers so minors stay proportional to young reachability. The collector is stop-the-world at slice boundaries; there are no collector threads.

GC share is a function of allocation pressure, not absolute speed. The bytecode VM cut the constant-factor cost of computation but left allocation rates largely unchanged, so GC share rose proportionally on the same workloads:

WorkloadGC shareMax pause
Small function calls (empty, identity, let)~12%~1.4 ms
loop/recur 10,000 iterations~0%
Build 1,000-element vector via conj~19%~1.4 ms
Build 10,000-element vector via conj~21%~7.8 ms
Build 5k int-map and sum~16%~1.8 ms
map/filter/map/reduce over 50,000 (fused transducers)~0%
Nested vectors 500x100~17%~2.0 ms
Realize 10k of lazy range~33%~4.0 ms

Five tuning knobs are exposed through mino_gc_set_param: nursery size, major growth multiplier, promotion age, incremental slice budget, and allocation quantum between slices. The defaults target interactive latency on a general workload; embedders with throughput-dominated batches or tighter pause budgets can shift the tradeoff without rebuilding.

The default nursery size rose from 1 MiB to 4 MiB in v0.250.0 after a measured pass over realistic_bench. Allocation-heavy workloads (bump-int-map, nested-vec, lazy-range realization) gained 1.14–1.42x with no measurable regression in worst-case minor-GC pause — the larger nursery collects more bytes per cycle so total GC wall time falls even though each minor pass sweeps more. Each VM state holds 3 extra MiB of young-gen residency before the first major GC; embedders running many concurrent VM states under tight memory budgets override via MINO_GC_NURSERY_BYTES or mino_gc_set_param(S, MINO_GC_NURSERY_BYTES, n).

realistic_bench (v0.249 vs v0.250):

Row1 MiB (v0.249)4 MiB (v0.250)Speedup
build 5k int-map and sum12.72 ms11.18 ms1.14x
bump 5k int-map values22.92 ms16.12 ms1.42x
map/filter/map/reduce 50k 0.77 ms 0.68 ms1.13x
nested vectors 500×10023.00 ms16.92 ms1.36x
realize 10k lazy range 7.82 ms 5.79 ms1.35x
fibonacci(25) 7.83 ms 6.43 ms1.22x

Where the time goes

The cost centers in order of impact:

What this means in practice

mino is fast enough for configuration evaluation, rules engines, interactive consoles, plugin systems, scripting automation, and data transformation on moderate-sized collections. It evaluates a simple expression in low microseconds and processes hundreds of thousands of elements per millisecond on tight integer loops.

It is still not the right choice for tight numerical loops in the hot inner cycle of a server, large-scale data processing, or any workload where per-element overhead matters at the nanosecond level. For those cases, do the heavy lifting in C and pass results to mino for composition and coordination. The embedding model supports this naturally.

Benchmarking

Write benchmarks as mino scripts or C programs that link against the mino source. Compile and run with the same per-subsystem flags the standalone build uses:

cc -std=c99 -O2 \
  -Isrc -Isrc/public -Isrc/runtime -Isrc/gc -Isrc/eval \
  -Isrc/collections -Isrc/prim -Isrc/async -Isrc/interop \
  -Isrc/diag -Isrc/vendor/imath \
  -o my_bench my_bench.c \
  src/public/*.c src/runtime/*.c src/gc/*.c src/eval/*.c \
  src/eval/bc/*.c src/collections/*.c src/prim/*.c \
  src/async/*.c src/interop/*.c src/regex/*.c \
  src/diag/*.c src/vendor/imath/*.c \
  -lm -lpthread
./my_bench

For minimum-footprint embed measurements, add -ffunction-sections -fdata-sections to the compile flags and -Wl,--gc-sections to the link flags so unreferenced subsystems drop out.