Quantcast
Channel: Raspberry Pi Forums
Viewing all articles
Browse latest Browse all 1570

General discussion • Pi 5 The Vector Processor

$
0
0
Pi 5 The Vector Processor including whetv64SPg12 and whetv64DPg12

During the 1980s and early 90s I was responsible for evaluating and acceptance testing of supercomputers for the UK government and those centrally funded for universities. For multiple user development the latter were particularly interested in vector versus scalar performance. I converted my Fortran scalar Whetstone benchmark to one where every test function could vectorize, with a default vector length of 256 words.

The vector version was finely tuned, hands on, on Cray 1 serial 1 that was at Didcot Rutherford Laboratory for a time. First real use was during factory and site trials of the first UK full scale Cray 1. Next was the first CDC Cyber 205 and last was attending user benchmark tests in Japan for ULCC at NEC and Fujitsu, where my benchmarks were also run.

I recompiled the scalar and vector C Whetstone benchmarks on the Pi 5, using gcc 12. The scalar results were effectively the same as those from gcc 8, quoted earlier in this topic. Results for the single and double precision vector version were as follows. Note that the N5 and N8 tests, with functions (both executed at DP) mainly determine the final rating.

The gcc 12 vector benchmark was also run on the Pi 4, to compare like with like. Then, for the three main MFLOPS measurements, the Pi 5 was effectively 3.1 times faster for both single and double precision operation.

For both systems, double precision MFLOPS results were effectively half those at single precision, as expected with SIMD vector operation.

Code:

        Pi 4 GCC 12 SP        Whetstone Vector Benchmark gcc 12 64 Bit Single Precision, Sun Dec 10 17:42:10 2023        Loop content          Result          MFLOPS  MOPS Seconds        N1 floating point    -1.13316142559051  2387           0.4        N2 floating point    -1.13312149047851  2407           2.8        N3 if then else       1.00000000000000        7428     0.7        N4 fixed point       12.00000000000000        1736     9.0        N5 sin,cos etc.       0.49998238682747          79    52.2        N6 floating point     0.99999982118607  2577          10.4        N7 assignments        3.00000000000000       10223     0.9        N8 exp,sqrt etc.      0.75002217292786          78    23.7        MWIPS                                   4955         100.0        Pi 4 GCC 12 DP        Whetstone Vector Benchmark gcc 12 64 Bit Double Precision, Sun Dec 10 17:47:48 2023        Loop content          Result          MFLOPS  MOPS Seconds        N1 floating point    -1.13314558088707  1164           0.7        N2 floating point    -1.13310306766606  1173           4.9        N3 if then else       1.00000000000000        7424     0.6        N4 fixed point       12.00000000000000        1735     7.8        N5 sin,cos etc.       0.49998080312724          76    47.0        N6 floating point     0.99999988868927  1295          18.0        N7 assignments        3.00000000000000        5325     1.5        N8 exp,sqrt etc.      0.75002006515491          83    19.4        MWIPS                                   4314         100.0        Pi 5 GCC 12 SP        Whetstone Vector Benchmark gcc 12 64 Bit Single Precision, Sat Oct  7 10:46:30 2023        Loop content          Result          MFLOPS  MOPS Seconds   Pi 5/4        N1 floating point    -1.13316142559051  7393           0.3     3.10        N2 floating point    -1.13312149047851  7365           2.0     3.06        N3 if then else       1.00000000000000       14169     0.8     1.91        N4 fixed point       12.00000000000000        2399    14.5     1.38        N5 sin,cos etc.       0.49998238682747         177    51.7     2.24        N6 floating point     0.99999982118607  8079           7.4     3.13        N7 assignments        3.00000000000000       26419     0.8     2.58        N8 exp,sqrt etc.      0.75002217292786         178    23.0     2.29        MWIPS                                  10975         100.3     2.21        Pi 5 GCC 12 DP        Whetstone Vector Benchmark gcc 12 64 Bit Double Precision, Sat Oct  7 10:50:40 2023        Loop content          Result          MFLOPS  MOPS Seconds   Pi 5/4        N1 floating point    -1.13314558088707  3603           0.5     3.10        N2 floating point    -1.13310306766606  3620           3.6     3.09        N3 if then else       1.00000000000000       14168     0.7     1.91        N4 fixed point       12.00000000000000        2399    12.9     1.38        N5 sin,cos etc.       0.49998080312724         172    47.5     2.25        N6 floating point     0.99999988868927  3998          13.3     3.09        N7 assignments        3.00000000000000       13172     1.4     2.47        N8 exp,sqrt etc.      0.75002006515491         183    20.0     2.21        MWIPS                                   9830          99.9     2.28  
Example Of Vector Instructions Compiled

Code:

 L11:   add     x0, x0, 16        ldr     q4, [x0, -16]        ldr     q0, [x0, 4816]        ldr     q9, [x0, 9648]        fadd    v4.4s, v0.4s, v4.4s        ldr     q8, [x0, 14480]        fadd    v4.4s, v4.4s, v9.4s        fsub    v4.4s, v4.4s, v8.4s        fmla    v0.4s, v1.4s, v4.4s        fsub    v0.4s, v0.4s, v9.4s        fadd    v0.4s, v0.4s, v8.4s        fmul    v0.4s, v0.4s, v1.4s        fneg    v2.4s, v0.4s        mov     v5.16b, v0.16b        mov     v3.16b, v0.16b        fmla    v2.4s, v1.4s, v4.4s        fmls    v5.4s, v1.4s, v4.4s        fmla    v3.4s, v1.4s, v4.4s        fadd    v2.4s, v2.4s, v9.4s        mov     v4.16b, v5.16b        fadd    v2.4s, v2.4s, v8.4s        fmla    v4.4s, v2.4s, v1.4s        fmla    v3.4s, v2.4s, v1.4s        fadd    v4.4s, v4.4s, v8.4s        fmls    v3.4s, v4.4s, v1.4s        fmul    v3.4s, v3.4s, v1.4s        fadd    v0.4s, v3.4s, v0.4s        str     q3, [x0, -16]        fmls    v0.4s, v2.4s, v1.4s        fmla    v0.4s, v4.4s, v1.4s        fmul    v0.4s, v0.4s, v1.4s        fsub    v5.4s, v3.4s, v0.4s        str     q0, [x0, 4816]        fsub    v0.4s, v0.4s, v3.4s        mov     v3.16b, v5.16b        fmla    v3.4s, v2.4s, v1.4s        mov     v2.16b, v3.16b        fmla    v2.4s, v4.4s, v1.4s        fmul    v2.4s, v2.4s, v1.4s        fadd    v0.4s, v0.4s, v2.4s        str     q2, [x0, 9648]        fmla    v0.4s, v4.4s, v1.4s        fmul    v0.4s, v0.4s, v1.4s        str     q0, [x0, 14480]        cmp     x0, x22        bne     .L11      
Comparison With Old Supercomputers

Following are Scalar and Vector Whetstone benchmark results for the original supercomputers. In the 1980s they provided a useful tool in confirming the choice for university work in dealing with multiple user access, typically with programs containing 90% vectorisable code. Then the choices depended on scalar versus vector performance and multiple processors versus multiple pipelines.

Pi 5 results are included and can look good on a per MHz basis. See the next page for comparisons, including for the benchmark originally used to validate performance of the first Cray 1 supercomputer.

Code:

                                                     Vector                         Scalar        Vector       /Scalar                     MHz  MWIPS MFLOPS  MWIPS MFLOPS MFLOPS   DATECray 1                80   16.2    5.9     98     47    8.0   1978CDC Cyber 205         50   11.9    4.9    161     57   11.7   1981Cray XMP1            118   30.3   11.0    313    151   13.7   1982Cray 2/1             244   25.8    N/A    425    N/A          1984Amdahl VP 500   #    143   21.7    7.5    250    103   13.8   1984Amdahl VP 1100  #    143   21.7    7.5    374    146   19.5   1984Amdahl VP 1200  #    143   21.7    7.5    581    264   35.3   1984IBM 3090-150 VP       54   12.1    4.9     60     17    3.6   1986(CDC) ETA 10E         95   15.7    6.5    335    124   19.2   1987Cray YMP1            154   31.0   12.0    449    195   16.3   1987Fujitsu VP-2400/4    312   71.7   25.4   1828    794   31.3   1991NEC SX-3/11          345   42.9   17.0   1106    441   25.9   1991NEC SX-3/12          345   42.9   17.0   1667    753   44.3   1991                # Fujitsu SystemsRaspberry Pi 5 SP   2400   5843   1206  10986   7599    6.3   2023Raspberry Pi 5 DP   2400    N/A    N/A   9816   3731    3.1   2023  

Statistics: Posted by RoyLongbottom — Wed Jan 17, 2024 4:59 pm



Viewing all articles
Browse latest Browse all 1570

Trending Articles