CPU scaling benchmark
workers
8 +1 main
iters total
500M
55555555/stream
elapsed
1158.93 ms
total CPU used
8841.72 ms
speedup
7.63×
vs serial
efficiency
84.8%
of 9× ideal
| stream | spawn ms | spawned@ | work start@ | work end@ | work ms | reap wait ms |
|---|---|---|---|---|---|---|
| 0 (main) | 0 | 13.46 | 13.47 | 895.84 | 882.37 | 0 |
| 1 | 2.139 | 2.16 | 21.2 | 885.49 | 864.29 | 0.14 |
| 2 | 1.593 | 3.77 | 18.71 | 1143.64 | 1124.93 | 248 |
| 3 | 1.559 | 5.35 | 44.01 | 1034.16 | 990.15 | 143.82 |
| 4 | 1.511 | 6.88 | 51.77 | 1062.51 | 1010.74 | 169.78 |
| 5 | 1.451 | 8.35 | 27.4 | 829.32 | 801.92 | 0.2 |
| 6 | 1.542 | 9.91 | 41.76 | 1148.35 | 1106.59 | 253.1 |
| 7 | 1.943 | 11.86 | 45.19 | 1004.33 | 959.14 | 114.11 |
| 8 | 1.564 | 13.44 | 54.27 | 1155.86 | 1101.59 | 260.14 |
main
w1
w2
w3
w4
w5
w6
w7
w8
fork+handshake
CPU work
parent reap wait
what this measures
Each stream runs a tight integer LCG loop — working set is one CPU register, no memory access,
no shared data. Speedup = sum(stream CPU time) / wall-clock elapsed. Efficiency = speedup / (workers+1).
100% efficiency means perfect linear scaling; less than 100% is the cost of serial fork setup,
reap tail, SMT/core contention.