Pi 5 The Vector Processor including whetv64SPg12 and whetv64DPg12
During the 1980s and early 90s I was responsible for evaluating and acceptance testing of supercomputers for the UK government and those centrally funded for universities. For multiple user development the latter were particularly interested in vector versus scalar performance. I converted my Fortran scalar Whetstone benchmark to one where every test function could vectorize, with a default vector length of 256 words.
The vector version was finely tuned, hands on, on Cray 1 serial 1 that was at Didcot Rutherford Laboratory for a time. First real use was during factory and site trials of the first UK full scale Cray 1. Next was the first CDC Cyber 205 and last was attending user benchmark tests in Japan for ULCC at NEC and Fujitsu, where my benchmarks were also run.
I recompiled the scalar and vector C Whetstone benchmarks on the Pi 5, using gcc 12. The scalar results were effectively the same as those from gcc 8, quoted earlier in this topic. Results for the single and double precision vector version were as follows. Note that the N5 and N8 tests, with functions (both executed at DP) mainly determine the final rating.
The gcc 12 vector benchmark was also run on the Pi 4, to compare like with like. Then, for the three main MFLOPS measurements, the Pi 5 was effectively 3.1 times faster for both single and double precision operation.
For both systems, double precision MFLOPS results were effectively half those at single precision, as expected with SIMD vector operation. Example Of Vector Instructions Compiled Comparison With Old Supercomputers
Following are Scalar and Vector Whetstone benchmark results for the original supercomputers. In the 1980s they provided a useful tool in confirming the choice for university work in dealing with multiple user access, typically with programs containing 90% vectorisable code. Then the choices depended on scalar versus vector performance and multiple processors versus multiple pipelines.
Pi 5 results are included and can look good on a per MHz basis. See the next page for comparisons, including for the benchmark originally used to validate performance of the first Cray 1 supercomputer.
During the 1980s and early 90s I was responsible for evaluating and acceptance testing of supercomputers for the UK government and those centrally funded for universities. For multiple user development the latter were particularly interested in vector versus scalar performance. I converted my Fortran scalar Whetstone benchmark to one where every test function could vectorize, with a default vector length of 256 words.
The vector version was finely tuned, hands on, on Cray 1 serial 1 that was at Didcot Rutherford Laboratory for a time. First real use was during factory and site trials of the first UK full scale Cray 1. Next was the first CDC Cyber 205 and last was attending user benchmark tests in Japan for ULCC at NEC and Fujitsu, where my benchmarks were also run.
I recompiled the scalar and vector C Whetstone benchmarks on the Pi 5, using gcc 12. The scalar results were effectively the same as those from gcc 8, quoted earlier in this topic. Results for the single and double precision vector version were as follows. Note that the N5 and N8 tests, with functions (both executed at DP) mainly determine the final rating.
The gcc 12 vector benchmark was also run on the Pi 4, to compare like with like. Then, for the three main MFLOPS measurements, the Pi 5 was effectively 3.1 times faster for both single and double precision operation.
For both systems, double precision MFLOPS results were effectively half those at single precision, as expected with SIMD vector operation.
Code:
Pi 4 GCC 12 SP Whetstone Vector Benchmark gcc 12 64 Bit Single Precision, Sun Dec 10 17:42:10 2023 Loop content Result MFLOPS MOPS Seconds N1 floating point -1.13316142559051 2387 0.4 N2 floating point -1.13312149047851 2407 2.8 N3 if then else 1.00000000000000 7428 0.7 N4 fixed point 12.00000000000000 1736 9.0 N5 sin,cos etc. 0.49998238682747 79 52.2 N6 floating point 0.99999982118607 2577 10.4 N7 assignments 3.00000000000000 10223 0.9 N8 exp,sqrt etc. 0.75002217292786 78 23.7 MWIPS 4955 100.0 Pi 4 GCC 12 DP Whetstone Vector Benchmark gcc 12 64 Bit Double Precision, Sun Dec 10 17:47:48 2023 Loop content Result MFLOPS MOPS Seconds N1 floating point -1.13314558088707 1164 0.7 N2 floating point -1.13310306766606 1173 4.9 N3 if then else 1.00000000000000 7424 0.6 N4 fixed point 12.00000000000000 1735 7.8 N5 sin,cos etc. 0.49998080312724 76 47.0 N6 floating point 0.99999988868927 1295 18.0 N7 assignments 3.00000000000000 5325 1.5 N8 exp,sqrt etc. 0.75002006515491 83 19.4 MWIPS 4314 100.0 Pi 5 GCC 12 SP Whetstone Vector Benchmark gcc 12 64 Bit Single Precision, Sat Oct 7 10:46:30 2023 Loop content Result MFLOPS MOPS Seconds Pi 5/4 N1 floating point -1.13316142559051 7393 0.3 3.10 N2 floating point -1.13312149047851 7365 2.0 3.06 N3 if then else 1.00000000000000 14169 0.8 1.91 N4 fixed point 12.00000000000000 2399 14.5 1.38 N5 sin,cos etc. 0.49998238682747 177 51.7 2.24 N6 floating point 0.99999982118607 8079 7.4 3.13 N7 assignments 3.00000000000000 26419 0.8 2.58 N8 exp,sqrt etc. 0.75002217292786 178 23.0 2.29 MWIPS 10975 100.3 2.21 Pi 5 GCC 12 DP Whetstone Vector Benchmark gcc 12 64 Bit Double Precision, Sat Oct 7 10:50:40 2023 Loop content Result MFLOPS MOPS Seconds Pi 5/4 N1 floating point -1.13314558088707 3603 0.5 3.10 N2 floating point -1.13310306766606 3620 3.6 3.09 N3 if then else 1.00000000000000 14168 0.7 1.91 N4 fixed point 12.00000000000000 2399 12.9 1.38 N5 sin,cos etc. 0.49998080312724 172 47.5 2.25 N6 floating point 0.99999988868927 3998 13.3 3.09 N7 assignments 3.00000000000000 13172 1.4 2.47 N8 exp,sqrt etc. 0.75002006515491 183 20.0 2.21 MWIPS 9830 99.9 2.28
Code:
L11: add x0, x0, 16 ldr q4, [x0, -16] ldr q0, [x0, 4816] ldr q9, [x0, 9648] fadd v4.4s, v0.4s, v4.4s ldr q8, [x0, 14480] fadd v4.4s, v4.4s, v9.4s fsub v4.4s, v4.4s, v8.4s fmla v0.4s, v1.4s, v4.4s fsub v0.4s, v0.4s, v9.4s fadd v0.4s, v0.4s, v8.4s fmul v0.4s, v0.4s, v1.4s fneg v2.4s, v0.4s mov v5.16b, v0.16b mov v3.16b, v0.16b fmla v2.4s, v1.4s, v4.4s fmls v5.4s, v1.4s, v4.4s fmla v3.4s, v1.4s, v4.4s fadd v2.4s, v2.4s, v9.4s mov v4.16b, v5.16b fadd v2.4s, v2.4s, v8.4s fmla v4.4s, v2.4s, v1.4s fmla v3.4s, v2.4s, v1.4s fadd v4.4s, v4.4s, v8.4s fmls v3.4s, v4.4s, v1.4s fmul v3.4s, v3.4s, v1.4s fadd v0.4s, v3.4s, v0.4s str q3, [x0, -16] fmls v0.4s, v2.4s, v1.4s fmla v0.4s, v4.4s, v1.4s fmul v0.4s, v0.4s, v1.4s fsub v5.4s, v3.4s, v0.4s str q0, [x0, 4816] fsub v0.4s, v0.4s, v3.4s mov v3.16b, v5.16b fmla v3.4s, v2.4s, v1.4s mov v2.16b, v3.16b fmla v2.4s, v4.4s, v1.4s fmul v2.4s, v2.4s, v1.4s fadd v0.4s, v0.4s, v2.4s str q2, [x0, 9648] fmla v0.4s, v4.4s, v1.4s fmul v0.4s, v0.4s, v1.4s str q0, [x0, 14480] cmp x0, x22 bne .L11
Following are Scalar and Vector Whetstone benchmark results for the original supercomputers. In the 1980s they provided a useful tool in confirming the choice for university work in dealing with multiple user access, typically with programs containing 90% vectorisable code. Then the choices depended on scalar versus vector performance and multiple processors versus multiple pipelines.
Pi 5 results are included and can look good on a per MHz basis. See the next page for comparisons, including for the benchmark originally used to validate performance of the first Cray 1 supercomputer.
Code:
Vector Scalar Vector /Scalar MHz MWIPS MFLOPS MWIPS MFLOPS MFLOPS DATECray 1 80 16.2 5.9 98 47 8.0 1978CDC Cyber 205 50 11.9 4.9 161 57 11.7 1981Cray XMP1 118 30.3 11.0 313 151 13.7 1982Cray 2/1 244 25.8 N/A 425 N/A 1984Amdahl VP 500 # 143 21.7 7.5 250 103 13.8 1984Amdahl VP 1100 # 143 21.7 7.5 374 146 19.5 1984Amdahl VP 1200 # 143 21.7 7.5 581 264 35.3 1984IBM 3090-150 VP 54 12.1 4.9 60 17 3.6 1986(CDC) ETA 10E 95 15.7 6.5 335 124 19.2 1987Cray YMP1 154 31.0 12.0 449 195 16.3 1987Fujitsu VP-2400/4 312 71.7 25.4 1828 794 31.3 1991NEC SX-3/11 345 42.9 17.0 1106 441 25.9 1991NEC SX-3/12 345 42.9 17.0 1667 753 44.3 1991 # Fujitsu SystemsRaspberry Pi 5 SP 2400 5843 1206 10986 7599 6.3 2023Raspberry Pi 5 DP 2400 N/A N/A 9816 3731 3.1 2023
Statistics: Posted by RoyLongbottom — Wed Jan 17, 2024 4:59 pm