Performance

Numbers below were measured against mino v0.323.0 on Apple Silicon (arm64-darwin) under normal desktop load. Treat them as directional; different hardware will shift absolute numbers but the ratios between rows hold. The bc-frontiers cycle (v0.152.0 – v0.157.0) landed targeted fast lanes that moved the write-side, small-prim, record-access, generic-get, call-site, and transducer-fusion rows by 9–93%; the Clojure-aware cycle (v0.158.0 – v0.163.0) added a protocol-method inline cache (proto-mono-area -57%), generalised the seq fusion to into / mapv / filterv / dorun (-65 to -95%), and inlined canonical-prim stages over a chunked-source walk (reduce-pipeline rows -9 to -19% on top of the v0.157.0 fusion); the follow-on pre-JIT sweep (v0.164.0 – v0.166.0) routes the canonical numeric reducers through an unboxed long-long accumulator (reduce on vec / set / list -24 to -49% on 100k sizes), ships real in-place transient vector mutation (into-vec-pipeline -74%, mapv-pipeline -69%), and rewrites the persistent-builder loop shape at compile time to use the new in-place transients ((loop ... (conj acc i)) at N=100k -71%); the CPJIT cycle (v0.178.0 – v0.240.0) added a copy-and-patch runtime JIT, the dual-binary mino / mino-lean split, runtime JIT modes (--jit=auto|on|off), and end-to-end portability across five host arches (ARM64 Darwin / Linux, x86_64 Linux / Darwin / Windows); the GC nursery bump in v0.250.0 (1 MiB → 4 MiB default) cut total GC wall-time by 35–60% across realistic_bench rows (1.14–1.42x speedup) without raising worst-case minor-GC pause. Those later rows are not yet re-tabled here. The full bench suite lives in mino-bench/benchmarks/ and the in-process / cold-start / footprint harnesses live in mino-bench/tests/. Every table on this page links to the bench file the row was measured against — click the section heading's source link to see the actual code.

mino's evaluator is now a layered system. The tree-walker remains as the ground-truth interpreter; on top of it sits a small register-based bytecode VM that compiles function bodies lazily on first call. The compiler bails to the tree-walker on any unsupported form, so behavior stays identical and the VM is additive. The numbers below reflect this layered shape, not a single-path rewrite.

Footprint(source)(source)

Three binary footprints worth knowing about — the Floor tier that an embedder commits to, the Sandbox tier with the canonical Clojure surface, and the Standalone ceiling that ships from Homebrew. All linked with -ffunction-sections -Wl,--gc-sections (Mach-O: -Wl,-dead_strip) so unreferenced subsystems drop out at link time, and stripped with strip --strip-all.

Each tier ships in two flavours: the JIT-free build (matches the parallel mino-lean binary; embed with MINO_CPJIT undefined) and the JIT-included build (the default Homebrew bottle). Both columns are stripped, dead-section-eliminated builds against the same C source tree.

Build	No JIT	+ JIT	JIT cost	What's in it
Floor (`install_minimal` only)	~601 KB	~651 KB	+50 KB (8%)	`mino_state_new` + `mino_install_minimal` + `mino_eval_string`. Reader, evaluator, GC, persistent collections, numeric ops, foundational macros. No `core.clj` evaluation, no regex / bignum / multimethods / protocols / transducers, no I/O.
Sandbox (`install_sandbox`)	~909 KB	~943 KB	+34 KB (4%)	Floor plus regex, bignum, multimethods, protocols, transducers, and the safe bundled libs — every name a Clojure scripter expects. Still no I/O, FS, processes, STM, agents, async.
Standalone (`install_all` + REPL)	~962 KB	~996 KB	+34 KB (4%)	Sandbox plus I/O, FS, `subprocess`, STM, agents, async, host-interop, all bundled `clojure.*` namespaces, the project resolver, the task/deps machinery, and the REPL crash handler. The released `mino` binary an end user receives from Homebrew. The no-JIT Standalone column is the `mino-lean` sibling binary.

JIT adds 34–50 KB across the tiers — under one percent of any modern device's disk budget, and well under 1 ms of additional disk-load time on a cold launch. The JIT-included build pays back 1.8–6.5x on compute-bound hot code (see the JIT page for the workload table). Embedders running one-shot scripts on Floor get marginal value from JIT (no hot loops to amortize against); embedders on Sandbox or Standalone with any sustained execution should keep it on.

Source-side numbers for what an in-tree embedder pulls in:

Item	Size	Notes
C source tree (`src/` minus vendor)	~2.27 MB	C source plus generated bundled-source headers
Vendor (`imath` for BigInt)	~157 KB	Only loaded when arithmetic exceeds 64-bit range
Bundled stdlib source (`clojure.*` headers compiled into the binary)	~194 KB	Lazy-installed; the minimum-embed build drops these
`core.clj` source	~121 KB	Embedded as a C string literal; evaluated the first time `mino_install` brings in a non-floor capability

Cold startup(source)

Wall time from fork+exec to process exit. Each row is the median of 50 invocations after three warmup runs to prime the OS page cache. Both columns evaluate the same (+ 1 2) expression; the difference between the two is the cost of linking the JIT pipeline in (a small mmap + symbol-table init at mino_state_new time, with no actual compilation triggered by the one-shot expression).

Tier	No JIT	+ JIT	JIT cost	Notes
Floor (`install_minimal`)	3.98 ms	4.01 ms	~0	Process spawn + `mino_state_new` + `mino_install_minimal` + eval + exit. No `core.clj` parse / eval.
Sandbox (`install_sandbox`)	6.99 ms	6.99 ms	0	Floor + regex + bignum + multimethods + protocols + transducers + the safe bundled libs. Parses and evaluates `core.clj` at install.
Standalone (`./mino -e ...`)	8.09 ms	8.04 ms	~0	Sandbox plus I/O, FS, processes, STM, agents, async, bundled `clojure.*`. The Homebrew binary. No-JIT column is the `mino-lean` sibling.

JIT contributes essentially zero to cold start. The pipeline initializes an mmap'd page at mino_state_new (sub-millisecond), and the hot-call threshold means nothing compiles until user code calls a function past MINO_JIT_THRESHOLD (default 10) times. A one-shot script that fits in a single expression therefore pays the same wall-time on JIT and no-JIT builds.

Per-process initialization cost, measured in-process over 50 init/teardown cycles inside one binary (no fork/exec overhead). An embedder that creates one runtime up-front pays this once; an embedder that spins one runtime per request sees it on every call.

Operation	Median	Notes
`mino_state_new` + `mino_install_minimal` + `mino_state_free`	0.18 ms	Floor tier. No `core.clj` parse / eval.
`mino_state_new` + `mino_install_sandbox` + `mino_state_free`	2.65 ms	Sandbox tier. Equivalent to `mino_install(S, env, MINO_CAP_DEFAULT)`. Parses and evaluates `core.clj` with regex, bignum, multimethods, protocols, transducers, and the safe bundled libs enabled.
`mino_state_new` + `mino_install_all` + `mino_state_free`	2.78 ms	Adds I/O, FS, STM, agents, bundled `clojure.*` registration (lazy; not evaluated until `require`d). The standalone CLI path.

The Floor tier saves ~2.5 ms on every cold start by skipping core.clj evaluation. The cost is that capability-gated names (e.g. re-find, defmulti, slurp) are not bound; user code calling them raises an MNS002 capability-disabled diagnostic until the host installs the corresponding capability.

Core operations(source)(source)

Per-call cost for fundamental eval shapes, measured through the full read + eval path via the bytecode VM (mean of 100,000 iterations, dominant fast-path). Lower is better.

Operation	Cost	Notes
Primitive call `(+ 1 2)`	5.5 µs	Fused int-add fast lane, no boxing
User fn call (1 arg)	5.5 µs	Compiled to bytecode; register-window entry
User fn call (3 args)	5.8 µs	Cost grows ~0.1 µs per arg in the bc path
Vector literal `[1 2 3]`	5.0 µs	32-way trie allocation
Map literal `{:a 1}`	5.1 µs	HAMT insertion per key
`(get m k)` on 100-key map	5.8 µs	Hash + HAMT traversal
Symbol resolution (local `let`)	4.7 µs	Register read; no env walk
Symbol resolution (global var)	4.9 µs	Inline-cache hit on the bc `GETGLOBAL` slot
Closure capture (1 var)	4.9 µs	Env-chain extend + restore
Closure capture (5 vars)	5.0 µs	Captures scale below noise

Bulk operations(source)(source)(source)

Cost of working with collections at scale. The fused counted-loop opcodes (OP_LOOP_INT_DEC et al.) and the int-fast-lane opcodes are responsible for most of the movement here since the last bench.

Operation	Cost	Per element
`(into [] (range 100))`	29.8 µs	0.30 µs
`(reduce + 0 (range 100))`	12.6 µs	0.13 µs
`(reduce + 0 (range 1000))`	13.9 µs	0.014 µs
`loop/recur` 1,000 iterations	5.5 µs	0.006 µs
`loop/recur` 10,000 iterations	73 µs	0.007 µs
Build 100-key map (`assoc` loop)	282 µs	2.82 µs/key
`conj` 1,000-element vector	183 µs	0.18 µs/elt
`conj` 10,000-element vector	2.29 ms	0.23 µs/elt
`nth` random on 1,000-vec	5.5 µs	—
`(get m k)` on 1,000-key map	5.4 µs	—
`(fib 25)` recursive (~242k calls)	6.65 ms	0.027 µs/call

Tight integer loops run at near-native speed when the compiler can prove the iteration is int-typed. loop/recur over 10,000 iterations clocks in around 7 ns per step because the fused-loop opcode collapses the test/dec/back-jump into a single dispatch with two tagged-int checks. Recursive Fibonacci sees the JIT compile the recursion hot path and reaches roughly 27 ns per call.

Eager collection builders(source)(source)

When laziness is not needed, rangev, mapv, and filterv produce vectors directly in C, bypassing thunk allocation entirely. The bc compiler also recognizes reduce over a vector and dispatches straight to the C primitive walker.

Operation	Cost	vs. lazy equivalent
`(rangev 100)`	2.8 µs	10× faster than `(into [] (range 100))`
`(mapv inc (rangev 100))`	15.2 µs	Eliminates per-element thunk + cons
`(filterv odd? (rangev 100))`	13.8 µs	Same shape as `mapv`

Use rangev for data generation and reduce over vectors for the biggest wins. The speedup over lazy comes from skipping thunk allocation and eval overhead per element; once per-element work is dominated by a user fn the gap narrows.

Concurrency(source)

Standalone mino grants cpu_count worker threads at startup, so future, promise, thread, and the blocking channel ops <!!/>!!/alts!! resolve to real OS threads. Embedders start at one (single-threaded) and raise the limit via mino_set_thread_limit or one of the pool/factory grants. A runtime that has at least one live worker serializes script execution on a per-state recursive mutex; cross-state work runs fully concurrent and intra-state work is naturally race-free. Single-threaded states skip the mutex entirely and pay no lock cost.

core.async numbers from the current bench run:

Operation	Cost	Notes
`offer!`/`poll!` on `(chan 1024)` (no scheduler)	281 µs/op	Buffer + offer/poll data-structure cost only
`offer!` on full buffer returns false	117 µs/op	Hot rejection path
`poll!` on empty buffer returns nil	67 µs/op	Hot empty-buffer path
`put!`/`take!` on `(chan 1)` + `drain!`	363 µs/op	Callback path through the scheduler
`go` block `(<! ch)` with pending put + `drain!`	1.84 ms/op	IOC state machine + park/unpark roundtrip
`go` producer/consumer hand-shake pair	3.67 ms/op	Two park/unpark cycles end-to-end
`alts!` over 1 ready channel	405 µs/op	Arbitration on a single ready candidate
`alts!` over 8 channels, last ready	1.03 ms/op	Linear walk through `:priority` order
`alts!` with `:default`	159 µs/op	Fast non-block path
`(timeout 0)` + `take!` + drain	692 µs/op	Timer-chan path through the scheduler

Shared-state work scales to one core's worth of throughput regardless of worker count because the per-state recursive mutex serializes script execution. To get parallel speedup, distribute work across runtime instances and pass results back via the host or via mino_clone, not across workers in one runtime.

Garbage collection(source)

mino uses a non-moving two-generation tracing collector. Short-lived values live in a young-gen nursery that is swept in bounded minor collections. Survivors are promoted to old-gen, which is marked incrementally, paced by the allocator. A write barrier records old-to-young pointers so minors stay proportional to young reachability. The collector is stop-the-world at slice boundaries; there are no collector threads.

GC share is a function of allocation pressure, not absolute speed. The bytecode VM cut the constant-factor cost of computation but left allocation rates largely unchanged, so GC share rose proportionally on the same workloads:

Workload	GC share	Max pause
Small function calls (empty, identity, let)	~12%	~1.4 ms
`loop/recur` 10,000 iterations	~0%	—
Build 1,000-element vector via `conj`	~19%	~1.4 ms
Build 10,000-element vector via `conj`	~21%	~7.8 ms
Build 5k int-map and sum	~16%	~1.8 ms
map/filter/map/reduce over 50,000 (fused transducers)	~0%	—
Nested vectors 500x100	~17%	~2.0 ms
Realize 10k of lazy range	~33%	~4.0 ms

Five tuning knobs are exposed through mino_gc_set_param: nursery size, major growth multiplier, promotion age, incremental slice budget, and allocation quantum between slices. The defaults target interactive latency on a general workload; embedders with throughput-dominated batches or tighter pause budgets can shift the tradeoff without rebuilding.

The default nursery size rose from 1 MiB to 4 MiB in v0.250.0 after a measured pass over realistic_bench. Allocation-heavy workloads (bump-int-map, nested-vec, lazy-range realization) gained 1.14–1.42x with no measurable regression in worst-case minor-GC pause — the larger nursery collects more bytes per cycle so total GC wall time falls even though each minor pass sweeps more. Each VM state holds 3 extra MiB of young-gen residency before the first major GC; embedders running many concurrent VM states under tight memory budgets override via MINO_GC_NURSERY_BYTES or mino_gc_set_param(S, MINO_GC_NURSERY_BYTES, n).

realistic_bench (v0.249 vs v0.250):

Row	1 MiB (v0.249)	4 MiB (v0.250)	Speedup
build 5k int-map and sum	12.72 ms	11.18 ms	1.14x
bump 5k int-map values	22.92 ms	16.12 ms	1.42x
map/filter/map/reduce 50k	0.77 ms	0.68 ms	1.13x
nested vectors 500×100	23.00 ms	16.92 ms	1.36x
realize 10k lazy range	7.82 ms	5.79 ms	1.35x
fibonacci(25)	7.83 ms	6.43 ms	1.22x

Where the time goes

The cost centers in order of impact:

Allocation pressure. Persistent collections, cons cells, and intermediate seqs are the dominant source of work in any realistic pipeline. Recent reduce and assoc/identity short-circuits cut redundant allocation, but laziness still pays a thunk + cons-cell per element. Use rangev/mapv/filterv when laziness is not needed; use loop/recur when iterating without building a collection.
Core library initialization. A mino_state_t on the Standard or Standalone tier parses and evaluates core.clj (~5.2 ms here, smaller on faster hardware). The Floor tier (mino_install_minimal) skips this entirely and pays only ~0.22 ms. Parsed forms are cached per state, so additional envs in one state avoid re-parsing. Bundled clojure.* namespaces are registered but not evaluated until required.
Bytecode bailouts. Forms the bc compiler doesn't yet handle bail to the tree-walker on first call and are remembered as declined; subsequent calls skip recompilation but still pay tree-walker per-call cost. Compiler coverage is the lever here, not VM speed.
Per-state lock at every eval entry (when threaded). Once a state is multi-threaded (host has granted workers, or the standalone has run mino_install_all), each script entry through mino_eval_string or a worker mino_call takes the per-state recursive mutex. Uncontested cost is ~10 ns per entry; single-threaded states skip the mutex entirely.

What this means in practice

mino is fast enough for configuration evaluation, rules engines, interactive consoles, plugin systems, scripting automation, and data transformation on moderate-sized collections. It evaluates a simple expression in low microseconds and processes hundreds of thousands of elements per millisecond on tight integer loops.

It is still not the right choice for tight numerical loops in the hot inner cycle of a server, large-scale data processing, or any workload where per-element overhead matters at the nanosecond level. For those cases, do the heavy lifting in C and pass results to mino for composition and coordination. The embedding model supports this naturally.

Benchmarking

Write benchmarks as mino scripts or C programs that link against the mino source. Compile and run with the same per-subsystem flags the standalone build uses:

cc -std=c99 -O2 \
  -Isrc -Isrc/public -Isrc/runtime -Isrc/gc -Isrc/eval \
  -Isrc/collections -Isrc/prim -Isrc/async -Isrc/interop \
  -Isrc/diag -Isrc/vendor/imath \
  -o my_bench my_bench.c \
  src/public/*.c src/runtime/*.c src/gc/*.c src/eval/*.c \
  src/eval/bc/*.c src/collections/*.c src/prim/*.c \
  src/async/*.c src/interop/*.c src/regex/*.c \
  src/diag/*.c src/vendor/imath/*.c \
  -lm -lpthread
./my_bench

For minimum-footprint embed measurements, add -ffunction-sections -fdata-sections to the compile flags and -Wl,--gc-sections to the link flags so unreferenced subsystems drop out.