VStream Performance: Benchmarks and Optimisation

Measuring the Cost of Lazy, Effectful Streaming

What You'll Learn

  • How VStream performance compares to raw Java streams
  • Overhead characteristics of VStream's virtual thread model
  • How to run VStream benchmarks using JMH
  • Performance characteristics of parallel operations and chunking
  • When to use VStream vs raw Java streams

Run Benchmarks

./gradlew :hkj-benchmarks:jmh --includes=".*VStreamBenchmark.*"

Why Measure?

VStream wraps every element pull in a VTask, which evaluates on a virtual thread. That wrapping has a cost. The question is never "is there overhead?" but "does the overhead matter for my workload?" The benchmarks in this page answer that question with data.

For simple in-memory transformations, Java Stream is faster. For I/O-bound pipelines with concurrent element processing, VStream with parEvalMap is more capable and often faster end-to-end.

Package: org.higherkindedj.benchmarks Module: hkj-benchmarks


Benchmark Methodology

All benchmarks use JMH (Java Microbenchmark Harness) with the following configuration:

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@State(Scope.Thread)

For GC profiling:

./gradlew :hkj-benchmarks:jmh --includes=".*VStreamBenchmark.*" -PjmhProfilers=gc

Performance Characteristics

Construction Cost

VStream construction is lightweight. All factory methods return lazy descriptions without allocating element storage:

OperationOverheadNotes
VStream.empty()Near zeroReturns singleton
VStream.of(value)Near zeroSingle lambda capture
VStream.fromList(list)Near zeroIndex-based lazy iteration
VStream.range(start, end)Near zeroUnfold-based

Combinator Overhead

Combinators like map, filter, and flatMap are lazy: they build a description of the pipeline without executing it. Construction cost is O(1) regardless of stream size.

Terminal Operation Cost

Terminal operations (toList, foldLeft, count) execute the full pipeline. Each element pull involves:

  1. A VTask evaluation (virtual thread fork and join)
  2. Pattern matching on the Step ADT (Emit, Skip, Done)
  3. Any user-supplied transformation functions
  Per-Element Cost Breakdown
  ══════════════════════════

  Consumer calls pull()
       │
       ▼
  ┌─────────────────────┐
  │  VTask evaluation   │ ◀── dominant cost for simple transforms
  │  (virtual thread    │     (~microseconds)
  │   fork and join)    │
  └──────────┬──────────┘
             │
             ▼
  ┌─────────────────────┐
  │  Step pattern match │ ◀── near zero (ADT dispatch)
  │  Emit / Skip / Done │
  └──────────┬──────────┘
             │
             ▼
  ┌─────────────────────┐
  │  User function      │ ◀── depends on workload
  │  (map, filter, etc.)│     trivial: ~nanoseconds
  └──────────┬──────────┘     I/O: ~milliseconds (dominates)
             │
             ▼
        Next pull

The per-element cost is dominated by the VTask overhead for simple transformations. For I/O-bound operations, the virtual thread overhead is negligible compared to I/O latency.

Parallel Processing

Parallel operations add overhead for StructuredTaskScope management but provide significant throughput improvements for I/O-bound workloads:

ScenarioSequentialparEvalMap(4)parEvalMap(8)Speedup
100ms I/O, 100 items~10s~2.5s~1.25s8x
10ms I/O, 1000 items~10s~2.5s~1.25s8x
CPU-bound, 100 items~Ns~N/4sdiminishing<4x

For CPU-bound operations, parallelism beyond the number of available processors provides no benefit and may decrease performance due to scheduling overhead.

Chunking

Chunking reduces per-element overhead by amortising the VTask/virtual-thread cost across a batch:

  Element-by-element                   chunk(10) + batch
  ════════════════════                 ════════════════════

  e₁ ──▶ VTask ──▶ r₁                 e₁ ─┐
  e₂ ──▶ VTask ──▶ r₂                 e₂  │
  e₃ ──▶ VTask ──▶ r₃                 e₃  │
  e₄ ──▶ VTask ──▶ r₄                 ... ├──▶ 1 VTask ──▶ [r₁..r₁₀]
  ...                                  e₉  │
  e₁₀──▶ VTask ──▶ r₁₀                e₁₀─┘

  10 VTask forks                       1 VTask fork
  10× virtual thread overhead          1× virtual thread overhead
ApproachPer-element cost
Element-by-elementVTask fork per element
chunk(10) + batchVTask fork per 10 elements
chunk(100) + batchVTask fork per 100 elements

Larger chunk sizes reduce overhead but increase latency for the first result.


VStream vs Java Stream

AspectVStreamJava Stream
Execution modelVirtual threads per pullPlatform threads
LazinessFully lazy, composableLazy, single-use
ReusabilityReusableSingle-use
ParallelparEvalMap (bounded concurrency)parallel() (ForkJoinPool)
EffectsBuilt-in via VTaskManual management
BackpressureImplicit (pull-based)N/A
OverheadHigher for simple opsLower for simple ops
Best forI/O-bound, effectful pipelinesCPU-bound, in-memory data

Guidance:

  • Use Java Stream for in-memory data transformations where effects are not needed
  • Use VStream when your pipeline involves I/O, needs composable error handling, or benefits from virtual thread integration
  • For simple map/filter/collect on lists, Java Stream is faster
  • For I/O-bound pipelines with concurrent element processing, VStream with parEvalMap is more capable

Optimisation Tips

Minimise Pipeline Depth

Each combinator adds a layer of indirection. For hot paths, consider combining operations:

// Prefer: single map with combined logic
stream.map(x -> (x + 1) * 2)

// Over: multiple chained maps
stream.map(x -> x + 1).map(x -> x * 2)

Use Appropriate Chunk Sizes

For batch I/O operations, chunk size should balance between:

  • Larger chunks: fewer VTask evaluations, higher throughput
  • Smaller chunks: lower memory usage, faster first-result latency

Choose Concurrency Wisely

Over-provisioning concurrency wastes resources. Under-provisioning leaves throughput on the table. Start with the recommendations in the parallel operations guide and measure.


Key Takeaways

  • VStream construction and combinator application is near-zero cost (lazy)
  • Per-element overhead comes from VTask evaluation during terminal operations
  • Parallel operations provide significant speedup for I/O-bound workloads
  • Chunking amortises per-element overhead across batches
  • Use Java Stream for simple in-memory transforms; use VStream for effectful, concurrent pipelines
  • Measure with JMH before optimising; the bottleneck is usually I/O, not framework overhead

See Also


Previous: Parallel Operations Next: Writer