CPU scaling benchmark
workers
8 +1 main
iters total
500M
55555555/stream
elapsed
1145.14 ms
total CPU used
9397.19 ms
speedup
8.21×
vs serial
efficiency
91.2%
of 9× ideal
| stream | spawn ms | spawned@ | work start@ | work end@ | work ms | reap wait ms |
|---|---|---|---|---|---|---|
| 0 (main) | 0 | 14.82 | 14.83 | 1079.86 | 1065.03 | 0 |
| 1 | 2.172 | 2.18 | 32.44 | 1119.47 | 1087.03 | 39.79 |
| 2 | 1.689 | 3.89 | 20.48 | 1110.23 | 1089.75 | 33.08 |
| 3 | 1.589 | 5.5 | 22.12 | 1012 | 989.88 | 0.14 |
| 4 | 1.6 | 7.11 | 33.81 | 1142.05 | 1108.24 | 62.31 |
| 5 | 1.616 | 8.74 | 43.06 | 1120.49 | 1077.43 | 42.65 |
| 6 | 1.633 | 10.39 | 41.77 | 1082.16 | 1040.39 | 7.54 |
| 7 | 2.714 | 13.12 | 43.03 | 909.77 | 866.74 | 0.19 |
| 8 | 1.666 | 14.8 | 63.05 | 1135.75 | 1072.7 | 56.04 |
main
w1
w2
w3
w4
w5
w6
w7
w8
fork+handshake
CPU work
parent reap wait
what this measures
Each stream runs a tight integer LCG loop — working set is one CPU register, no memory access,
no shared data. Speedup = sum(stream CPU time) / wall-clock elapsed. Efficiency = speedup / (workers+1).
100% efficiency means perfect linear scaling; less than 100% is the cost of serial fork setup,
reap tail, SMT/core contention.