Benchmarks
Performance measurement suite — scheduler, memory, IRQ, IPC, and stress benchmarks with statistical analysis.
Suite Architecture
The benchmark suite (helix-benchmarks, ~500 lines) provides a framework for measuring kernel performance across all subsystems.
Dependencies
[dependencies]
helix-hal = { path = "../hal" }
helix-core = { path = "../core" }
helix-execution = { path = "../subsystems/execution" }
helix-memory = { path = "../subsystems/memory" }
helix-dis = { path = "../subsystems/dis" }
helix-modules = { path = "../modules" }
Benchmark Runner
Feature Flags
[features]
verbose = [] # Print per-iteration timing
extended = [] # Run additional stress benchmarks
stress = [] # High-iteration stress tests
Configuration
Benchmark Parameters
| Parameter | Default | Extended | Stress |
|---|---|---|---|
| Warmup iterations | 100 | 1,000 | 10,000 |
| Measured iterations | 1,000 | 10,000 | 100,000 |
| Statistical samples | 10 | 100 | 1,000 |
| Outlier detection | MAD | MAD | MAD |
| Output format | Summary | Detailed | Full |
Running Benchmarks
# Via kernel shell
> bench
# Via make (runs in QEMU)
make bench
# Specific benchmark
> bench context_switch
> bench memory_alloc
> bench syscall_latency
Scheduler Benchmarks
Context Switch Latency
Measures the time to switch between two threads:
| Metric | Target | Description |
|---|---|---|
| Mean | < 1 us | Average context switch time |
| P99 | < 5 us | 99th percentile (tail latency) |
| Min | ~500 ns | Best case (hot cache) |
| Max | < 20 us | Worst case (cold cache, TLB flush) |
Thread Creation
Measures time to create a new thread (allocate stack, initialize context, add to scheduler):
| Metric | Target |
|---|---|
| Mean | < 10 us |
| P99 | < 50 us |
Scheduler Throughput
Measures scheduling decisions per second with varying thread counts:
| Threads | Target Decisions/sec |
|---|---|
| 10 | > 1,000,000 |
| 100 | > 500,000 |
| 1,000 | > 100,000 |
DIS Intent Scheduling
Measures overhead of intent-based scheduling vs. simple priority scheduling:
| Operation | Target |
|---|---|
| Intent classification | < 100 ns |
| Policy evaluation | < 500 ns |
| Queue selection | < 50 ns |
| Full DIS dispatch | < 2 us |
Memory Benchmarks
Page Allocation
| Allocator | Operation | Target |
|---|---|---|
| Bump | Single page | < 50 ns |
| Bitmap | Single page | < 200 ns |
| Bitmap | Contiguous 16 pages | < 1 us |
| Buddy | Single page | < 100 ns |
| Buddy | 1 MB (256 pages) | < 500 ns |
| Buddy | 8 MB (2048 pages) | < 1 us |
Slab Allocation
| Size Class | Alloc Target | Free Target |
|---|---|---|
| 16 bytes | < 30 ns | < 20 ns |
| 64 bytes | < 30 ns | < 20 ns |
| 256 bytes | < 40 ns | < 25 ns |
| 1024 bytes | < 50 ns | < 30 ns |
| 2048 bytes | < 60 ns | < 35 ns |
Virtual Memory
| Operation | Target |
|---|---|
| Page map | < 200 ns |
| Page unmap | < 150 ns |
| TLB flush (single) | < 100 ns |
| TLB flush (full) | < 500 ns |
| mmap_anonymous (4 KB) | < 1 us |
| mmap_anonymous (1 MB) | < 10 us |
Syscall Benchmarks
| Syscall | Target Latency |
|---|---|
getpid (trivial) | < 100 ns |
read (0 bytes) | < 300 ns |
write (serial, 1 byte) | < 500 ns |
mmap (anonymous, 4 KB) | < 2 us |
fork | < 50 us |
IPC Benchmarks
| Operation | Target |
|---|---|
| Channel send (64B) | < 100 ns |
| Channel recv (64B) | < 100 ns |
| OneShot round-trip | < 300 ns |
| EventBus publish (1 subscriber) | < 200 ns |
| EventBus publish (10 subscribers) | < 1 us |
| MessageRouter send | < 200 ns |
Throughput
| Scenario | Target |
|---|---|
| Channel: 1 producer, 1 consumer | > 5M msg/sec |
| Channel: 4 producers, 1 consumer | > 2M msg/sec |
| EventBus: 1 topic, 10 subscribers | > 1M msg/sec |
Statistical Analysis
The benchmark framework uses rigorous statistical methods:
Outlier Detection
The Median Absolute Deviation (MAD) method identifies outliers:
- Compute the median of all measurements
- Compute the MAD = median(|xi - median|)
- Flag values where |xi - median| > 3 * MAD as outliers
- Report with and without outliers
Confidence Intervals
For each metric, the framework reports:
- Mean with 95% confidence interval
- Median (robust to outliers)
- Standard deviation
- Min / Max (absolute bounds)
- P50, P90, P95, P99 (percentile distribution)
Comparison
When comparing two benchmark runs:
Benchmark: context_switch
Before: mean=890ns, p99=4.2us
After: mean=850ns, p99=3.8us
Change: -4.5% mean, -9.5% p99
Significant: yes (p < 0.05, Mann-Whitney U test)
Running Benchmarks
From the Kernel Shell
helix> bench
=== Helix Benchmark Suite ===
[1/8] context_switch .............. 890ns mean (1000 iters)
[2/8] thread_create ............... 8.2us mean (1000 iters)
[3/8] page_alloc_bitmap ........... 180ns mean (10000 iters)
[4/8] page_alloc_buddy ............ 95ns mean (10000 iters)
[5/8] slab_alloc_64 ............... 28ns mean (10000 iters)
[6/8] syscall_getpid .............. 85ns mean (10000 iters)
[7/8] ipc_channel_roundtrip ....... 210ns mean (10000 iters)
[8/8] dis_intent_classify ......... 75ns mean (10000 iters)
All benchmarks passed target thresholds.
From Host
# Run all benchmarks in QEMU
make bench
# Run with extended iterations
make bench FEATURES=extended
# Run stress tests
make bench FEATURES=stress
Continuous Integration
Benchmarks run on every PR to detect performance regressions:
- Build the kernel with
profile.bench - Boot in QEMU with benchmark arguments
- Parse serial output for results
- Compare against baseline (stored in repo)
- Fail the build if any metric regresses > 10%