Benchmarks & Performance
Higher-Kinded-J ships with a comprehensive JMH benchmark suite in the hkj-benchmarks module. These benchmarks measure the real cost of the library's abstractions so you can make informed decisions about where and when to use them.
- What the benchmark suite covers and how it is organised
- How to run benchmarks: all, per-type, with GC profiling
- How to interpret results and spot regressions
- What performance characteristics to expect from each type
"Measure. Don't guess." — Kirk Pepperdine, Java performance expert
Why Benchmarks Matter
Functional abstractions wrap values. Wrapping has a cost. The question is never "is there overhead?" — there always is — but "does the overhead matter for my workload?" The benchmark suite answers that question with data rather than intuition.
The suite is designed around three principles:
- Honesty — measure real abstraction costs, not contrived best cases
- Comparability — include raw Java baselines alongside library operations
- Actionability — organise results so regressions are immediately visible
What Is Measured
The hkj-benchmarks module contains 19 benchmark classes covering every major type in the library:
Core Types
| Benchmark | Type | What It Tells You |
|---|---|---|
EitherBenchmark | Either<L,R> | Instance reuse on the Left track, short-circuit efficiency |
MaybeBenchmark | Maybe<A> | Instance reuse on Nothing, nullable interop cost |
TrampolineBenchmark | Trampoline<A> | Stack-safe recursion overhead vs naive recursion |
FreeBenchmark | Free<F,A> | Free monad interpretation cost |
Effect Types
| Benchmark | Type | What It Tells You |
|---|---|---|
IOBenchmark | IO<A> | Lazy construction and platform thread execution |
VTaskBenchmark | VTask<A> | Virtual thread execution, map/flatMap chains |
VStreamBenchmark | VStream<A> | Pull-based stream construction, combinator pipelines, parallel ops, chunking, Java Stream comparison |
VTaskParBenchmark | Par combinators | Parallel zip, all, race, traverse via StructuredTaskScope |
ScopeBenchmark | Scope, Resource | Scope joiner strategies (allSucceed, anySucceed, accumulating), Resource bracket overhead |
Effect Path Wrappers
| Benchmark | Type | What It Tells You |
|---|---|---|
VTaskPathBenchmark | VTaskPath<A> | Wrapper overhead on top of VTask |
IOPathBenchmark | IOPath<A> | Wrapper overhead on top of IO |
ForPathVTaskBenchmark | ForPath with VTask | For-comprehension tuple allocation cost |
Comparisons
| Benchmark | What It Compares |
|---|---|
VTaskVsIOBenchmark | Virtual threads vs platform threads |
VTaskVsPlatformThreadsBenchmark | VTask vs ExecutorService at scale |
VTaskPathVsIOPathBenchmark | Path wrapper costs across effect types |
AbstractionOverheadBenchmark | HKJ abstractions vs raw Java |
ConcurrencyScalingBenchmark | Thread scaling under concurrent load |
MemoryFootprintBenchmark | Allocation rates for VTask, IO, CompletableFuture |
Running Benchmarks
All Benchmarks
./gradlew :hkj-benchmarks:jmh
A Single Benchmark Class
./gradlew :hkj-benchmarks:jmh --includes=".*VStreamBenchmark.*"
./gradlew :hkj-benchmarks:jmh --includes=".*VTaskBenchmark.*"
./gradlew :hkj-benchmarks:jmh --includes=".*EitherBenchmark.*"
A Single Benchmark Method
./gradlew :hkj-benchmarks:jmh --includes=".*VTaskBenchmark.runSucceed.*"
With GC Profiling
This reveals allocation rates and GC pressure — essential for understanding memory behaviour:
./gradlew :hkj-benchmarks:jmh -Pjmh.profilers=gc
Long / Stress Mode
Runs with chainDepth=10000 and recursionDepth=10000 for thorough stack-safety validation:
./gradlew :hkj-benchmarks:longBenchmark
Formatted Report
./gradlew :hkj-benchmarks:benchmarkReport
Reading the Output
JMH reports throughput in operations per microsecond. Higher is better.
Benchmark Mode Cnt Score Error Units
EitherBenchmark.rightMap thrpt 20 15.234 ± 0.512 ops/us
EitherBenchmark.leftMap thrpt 20 89.123 ± 1.234 ops/us
Score is the measured throughput. Error is the 99.9% confidence interval. If the error is larger than ~30% of the score, the result is noisy — increase warmup or measurement iterations.
What to Look For
| Signal | Meaning |
|---|---|
| Left/Nothing operations 5-40x faster than Right/Just | Instance reuse is working |
| VTask ~10-30% slower than IO for simple ops | Expected virtual thread overhead |
| VTask blocking I/O 10x+ faster than platform threads at scale | Virtual thread scheduler scaling |
| Deep chain (50+ steps) completes without error | Stack safety is intact |
| VStream slower than Java Stream | Expected; virtual thread + pull overhead |
| parEvalMap scales with concurrency for I/O | Parallel pipeline working correctly |
| Scope joiners similar speed to Par.all | Minimal Scope abstraction cost |
| Wrapper overhead < 15% | Acceptable Path wrapper cost |
| PVector via ListKind 30-70% the throughput of ArrayList | Persistent collection iteration tax |
Warning Signs
| Signal | Possible Cause |
|---|---|
| Left/Nothing same speed as Right/Just | Instance reuse broken |
| Error margin > 50% of score | Noisy environment, insufficient warmup |
| Deep chain throws StackOverflowError | Stack safety regression |
| VStream > 100x slower than Java Stream | Excessive allocation in pull loop |
| Wrapper overhead > 30% | Unnecessary allocation in Path wrapper |
Expected Performance by Type
Either and Maybe
These types use instance reuse: Left and Nothing operations return the same object without allocating, making short-circuit paths essentially free.
| Comparison | Expected Ratio |
|---|---|
leftMap vs rightMap | Left 5-10x faster |
nothingMap vs justMap | Nothing 5-10x faster |
leftLongChain vs rightLongChain | Left 10-50x faster |
constructNothing vs constructJust | Singleton reuse can reach 30-40x |
The "construct" pair is the strongest signal that allocation, not branching, dominates the gap. constructNothing returns the cached singleton, whereas constructJust allocates a fresh wrapper. On hot paths producing many Just or Right values, that allocation cost is real.
VTask
Virtual thread overhead is the dominant cost for simple operations. For real workloads involving I/O, this overhead is negligible, and the scheduling story flips dramatically in VTask's favour.
| Comparison | Expected |
|---|---|
| Construction (succeed, delay) | Very fast (~100+ ops/us) |
| VTask vs IO (simple execution) | VTask ~10-30% slower |
| Deep chains (50+) | Completes without error |
| Blocking I/O at concurrency 10 | Roughly equivalent to platform threads |
| Blocking I/O at concurrency 100 | VTask ~3-4x faster |
| Blocking I/O at concurrency 1000 | VTask 15-20x faster |
The blocking I/O numbers come from VTaskVsPlatformThreadsBenchmark. Platform throughput collapses as concurrency grows because OS threads become the bottleneck; virtual thread throughput stays roughly flat because the scheduler parks blocked carriers freely. This is the strongest single argument for VTask in I/O-bound workloads.
VStream
VStream's pull-based model adds overhead per element compared to Java Stream's push model, but provides laziness, virtual thread execution, and error recovery that Java Stream cannot.
| Comparison | Expected |
|---|---|
| Construction (empty, of, range) | Very fast (~100+ ops/us) |
| VStream map vs Java Stream map | VStream slower |
| Deep map chain (50) | Completes without error |
| Deep flatMap chain (50) | Completes without error |
existsEarlyMatch vs existsNoMatch | Early match much faster (short-circuit) |
Effect Path Wrappers
| Comparison | Expected Overhead |
|---|---|
| VTaskPath vs raw VTask | 5-15% |
| IOPath vs raw IO | 5-15% |
| ForPath vs direct chaining | 10-25% |
PCollections via ListKind
PCollectionsHktBenchmark measures PVector against ArrayList when both are processed through the standard ListMonad/ListTraverse pipeline. The widen/narrow boundary itself is free; the cost lives in iteration.
| Comparison | Expected Ratio |
|---|---|
widenNarrow PVector vs ArrayList | Indistinguishable, both at the JMH ceiling |
map / flatMap / foldMap PVector vs ArrayList | PVector 30-40% the throughput |
traverse PVector vs ArrayList | PVector ~70% the throughput |
The traverse case is the headline result. The more applicative work an operation does per element, the smaller the gap, because applicative map2 and Kind boxing dominate the underlying iteration cost. See PCollections Integration for the full table and methodology.
Free Monad
Free trades raw speed for program-as-data: a Free programme can be inspected, transformed, or interpreted multiple ways before it is run. That flexibility costs at least three orders of magnitude versus direct function composition.
| Comparison | Expected Ratio |
|---|---|
directComposition vs deepChainExecution | Direct ~3-4 orders of magnitude faster |
pureTrampolineIntegration vs flatMappedTrampolineIntegration | Pure ~100x faster |
programConstruction vs sequentialInterpretation | Construction far cheaper than interpretation |
If a workload is on a hot path and does not benefit from inspection or alternative interpreters, prefer direct composition. Free shines when the same programme is interpreted multiple ways, when the structure is analysed before execution, or when the alternatives would require substantial duplication.
Benchmark Assertion Tests
The benchmark suite includes automated assertion tests that validate performance characteristics after each benchmark run. These are not just "did it finish?" checks — they verify relative performance, overhead ratios, and sanity bounds.
The assertion tests fail (not skip) if benchmark results are missing. This is intentional. Run ./gradlew :hkj-benchmarks:jmh before running ./gradlew :hkj-benchmarks:test. Silent skips hide missing quality gates.
What the Tests Validate
| Test Group | What It Checks |
|---|---|
| SanityChecks | Every benchmark has positive throughput and bounded error margins |
| VTaskRelativePerformance | VTask construction costs (succeed, of, map) are positive |
| ParCombinatorPerformance | Par.zip and Par.map2 have positive throughput |
| VTaskVsIOOverhead | Both VTask and IO construction perform within expected bounds |
| CoreTypePerformance | Maybe, Either, and Trampoline operations have positive throughput |
| FoldPlusPerformance | Fold combination overhead is bounded; sum vs plus parity |
| AbstractionOverhead | Raw Java > IO > VTask ordering; VTaskPath wrapper overhead bounded |
| ConcurrencyScaling | Single and multi-threaded VTask/IO performance is positive |
| IOPerformance | IO construction vs execution ratios; deep recursion completes (stack safety) |
| IOPathPerformance | IOPath construction, map pipelines, and error handling overhead |
| VTaskPathPerformance | VTaskPath construction, map pipelines, and timeout overhead |
| VTaskPathVsIOPath | Cross-type comparison: construction ratios and conversion costs |
| ForPathVTaskPerformance | For-comprehension overhead vs direct chaining; parallel step overhead |
| ScopePerformance | Scope.allSucceed, Resource bracket, and Par.all throughput |
| MemoryFootprint | Bulk construction rates for VTask, IO, and CompletableFuture |
| VStreamPerformance | VStream map execution, construction vs execution, Java Stream baseline |
| VTaskVsPlatformThreads | VTask Par.all vs platform thread pool at scale |
| FreeMonadPerformance | Free monad construction, stack safety, and interpretation overhead |
Running the Tests
# Step 1: Run benchmarks (generates results.json)
./gradlew :hkj-benchmarks:jmh
# Step 2: Run assertion tests against the results
./gradlew :hkj-benchmarks:test
# Or run both together via the benchmarkValidation task
./gradlew benchmarkValidation
Release Quality Gate
The releaseReadiness task is a single-command quality gate that runs every verification step, ordered from fastest to slowest so failures surface early:
./gradlew releaseReadiness
| Step | Task | What It Checks | Speed |
|---|---|---|---|
| 1 | spotlessCheck | Code formatting (Google Java Format) | Seconds |
| 2 | build | Compilation, all unit tests, JaCoCo coverage | Minutes |
| 3 | :hkj-benchmarks:jmh | JMH benchmarks execute successfully | Minutes |
| 4 | :hkj-benchmarks:test | Benchmark assertion tests pass | Seconds |
| 5 | :hkj-processor:pitest (full) | Mutation testing with STRONGER mutators | Slowest |
If any step fails, the build stops immediately. All five must pass before a release.
The release gate runs pitest with -Ppitest.profile=full, which uses STRONGER mutators and all available CPU cores. This is more thorough than the default conservative profile used during local development.
Reports Generated
After a successful run, reports are available at:
| Tool | Location |
|---|---|
| JaCoCo | hkj-core/build/reports/jacoco/test/html/index.html |
| JMH (JSON) | hkj-benchmarks/build/reports/jmh/results.json |
| JMH (human) | hkj-benchmarks/build/reports/jmh/human.txt |
| Pitest | hkj-processor/build/reports/pitest/index.html |
When Overhead Matters (and When It Doesn't)
The benchmarks consistently show that abstraction overhead is measured in nanoseconds. Real-world operations — database queries, HTTP calls, file reads — are measured in milliseconds. The overhead is three to four orders of magnitude smaller than any I/O operation.
Abstraction overhead matters in exactly three scenarios:
- Tight computational loops processing millions of items per second with no I/O. Use primitives directly.
- Very long chains (hundreds of steps) creating GC pressure. Break into named submethods.
- Reflexive parallelism on trivial work.
VTaskParBenchmarkshowspar*variants running thousands of times slower than their sequential equivalents when the per-task work is negligible. Forking, joining, and synchronising all cost more than a few-nanosecond computation. Reach forPar.zip,Par.map2, orScopeonly when each branch does enough work to amortise the structured-concurrency cost.
For everything else, the type safety, composability, and testability benefits far outweigh the cost.
- Production Readiness — stack traces, allocation analysis, and stack safety for Effect Path types
- hkj-benchmarks README — full method reference for all 18 benchmark classes
- Performance Testing Guide — benchmark categories and CI integration
Previous: Release History