# EXTENDING MAGMAPORTABILITY

Final Presentation

Anna Fortenberry, UNT UTK RECSEM REU 2022

Mentors: Dr. Stan Tomov, UTK and Dr. Kwai Wong, UTK

# CONTENTS

- Problem Overview
- Software and Hardware
- Methodology
- CUDA to DPC++ Translation
- Porting MAGMA SGEMM
- Hardware Usage
- Performance
- Conclusion

1 PROBLEM OVERVIEW

# SIGNIFICANCE OF SUPERCOMPUTING

 Supercomputers provide the computational power necessary to resolve problems in a vast number of important domains



# **EVOLUTION OF SUPERCOMPUTER SYSTEM DESIGN**

- NVIDIA opened a new door for supercomputing (SC) capabilities with the invention of the GPU in 1999
- NVIDIA Tesla K20X GPU powered the first successful hybrid SC system in 2012
- SC Systems are continually increasing in diversity

[4], [5]

| TOP500 The List    |                       |                   |                  |  |  |  |
|--------------------|-----------------------|-------------------|------------------|--|--|--|
| JUNE 2022          | CPU/ Accelerator      | JUNE 2019         | CPU/ Accelerator |  |  |  |
| Frontier           | AMD, AMD              | Summit            | IBM, NVIDIA      |  |  |  |
| S.C. Fugaku        | Fugaku                | Sierra            | IBM, NVIDIA      |  |  |  |
| LUMI               | AMD, AMD              | Sunway TaihuLight | Sunway           |  |  |  |
| Summit             | IBM, NVIDIA           | Tianhe-2A         | Intel            |  |  |  |
| Sierra             | IBM, NVIDIA           | Frontera          | Intel            |  |  |  |
| Sunway TaihuLight  | Sunway                | Piz Daint         | Intel, NVIDIA    |  |  |  |
| Perlmutter         | AMD, NVIDIA           | Trinity           | Intel            |  |  |  |
| Selene AMD, NVIDIA |                       | ABCI              | Intel, NVIDIA    |  |  |  |
| Tianhe-2A          | Tianhe-2A Intel, NUDT |                   | Intel            |  |  |  |
| Adastra            | AMD, AMD              | Lassen            | IBM, NVIDIA      |  |  |  |

# FIRST INTEL GPU POWERED SUPERCOMPUTER



Anticipated for release in late 2022, Intel hopes to enter the supercomputer GPU vendor domain by powering the Aurora supercomputer at Argonne National Laboratory

[8]

### INTEL ONEAPI

- Intel recently released a new programming model called oneAPI
- Applications that take advantage of oneAPI gain portability to all supported hardware platforms
  - CPUs (Scalar Architecture)
  - GPUs (Vector Architecture)
  - FPGAs (Spatial Architecture)
  - Other Accelerators (Matrix Architecture)

# MAGMA STRUCTURE



- Designed originally to run on NVIDIA GPUs
- Extended to support AMD GPUs



- oneAPI includes tools for adopting the model
  - Data Parallel C++ (DPC++) Translation Tool (DPCT)
  - oneAPI Math Kernel Library (oneMKL)

# RESEARCH QUESTIONS

- How well does the DPCT tool translate CUDA code to DPC++ code?
- What are the common translation errors?
- Can this tool be used to translate MAGMA?
- Is DPC++ portable to Nvidia and AMD GPUs, and multicore CPUs?
- What is the performance of DPC++ on each of these accelerators comparative to CUDA?

# 2 SOFTWARE AND HARDWARE

### **OPTIMIZED APPLICATIONS**

### **OPTIMIZED MIDDLEWARE & FRAMEWORKS**

DIRECT PROGRAMMING Data Parallel C++ (DPC++) API-BASED PROGRAMMING oneAPI Libraries

Analysis & Debug Tools









- DPC++ is a oneAPI implementation of the Khronos standard SYCL
- SYCL is an accelerator language that allows code reuse across hardware targets
- SYCL adds data
   parallelism and
   heterogeneous
   programming to
   standard ISO C++

# SOFTWARE OVERVIEW

# DPC++ Compatibility Tool (DPCT)

oneAPI tool to assist with migrating CUDA code to DPC++ code; translates with high accuracy

# DPC++-LLVM (CLang-LLVM)

LLVM-based compiler project that supports SYCL language

# oneAPI Math Kernel (oneMKL)

set of math routines for use in high performance computing on a variety of computational devices

# **DPC++ LLVM NVIDIA\***

CLANG-LLVM build on Linux with CUDA NVIDIA support; allows DPC++ to port to NVIDIA GPUs

# Compute Unified Device Architecture (CUDA)

NVIDIA parallel computing platform for harnessing power of GPUs

# Intel DevCloud

Remote development environments that grant access to Intel hardware for testing oneAPI projects\*

# CENTRAL PROCESSING UNITS

AMD EPYC 7742 PROCESSOR

INTEL® XEON® PROCESSOR E5-2698 V4

Cores: 64 Cores: 20

Base Clock: 2.25 Ghz Base Clock: 2.20 Ghz

# of Threads: 128 # of Threads: 40

Cache: 256 MB Cache: 50 MB

**GRAPHICS PROCESSING UNITS** 

**NVIDIA GEFORCE RTX 3060 INTEL UHD GRAPHICS P630** (Discrete)

[0x3e96] (Integrated)

GPU Cores: 3584

Base Clock: 1320 MHz

Memory Size: 12 GB **GPU Cores:** 192

Base Clock: 350 MHz

Memory Size: Shared System

# 3 METHODOLOGY



structures of CUDA files to DPC++ with DPCT for correctness

Configure system to run DPC++ code on Nvidia GPU

Set up directory with MAGMA CUDA sgemm and dependencies

Test and compare performance of sgemm on available hardware

4 CUDA TO DPC++ TRANSLATION

# SIMPLE KERNEL TRANSLATION

```
__global__ void VectorAddKernel(float* A, float* B, float* C)
     A[threadIdx.x] = threadIdx.x + 1.0f;
     B[threadIdx.x] = threadIdx.x + 1.0f;
     C[threadIdx.x] = A[threadIdx.x] + B[threadIdx.x];
void VectorAddKernel(float* A, float* B, float* C, sycl::nd_item<3> item_ct1)
     A[item_ct1.get_local_id(2)] = item_ct1.get_local_id(2) + 1.0f;
     B[item_ct1.get_local_id(2)] = item_ct1.get_local_id(2) + 1.0f;
     C[item_ct1.get_local_id(2)] =
     A[item_ct1.get_local_id(2)] + B[item_ct1.get_local_id(2)];
```

# TEST 1: ISOLATED FILE

- Translated files for CUDA vector addition and vector-matrix multiplication
- 100% compilation and execution accuracy
- CUDA error handling dead code clean up for file readability

# TEST 2: FILE WITH HEADERS

- Matrix-matrix multiplication file with six included headers
- 98.7% compilation accuracy and 98.0% execution accuracy in the main file
- 10% of the code needed dead code touchups
- Header files had 100% compilation accuracy and execution accuracies ranging from 75%-100%

```
cudaGetDeviceCount(&device count);
                                        device count =
                                        dpct::dev mgr::instance().device count()
while (current device < device count) while (current device < device count)</pre>
    cudaGetDeviceProperties
                                             dpct::dev mgr::instance()
       (&deviceProp, current device);
                                              .get device(current device)
                                              .get device info(deviceProp);
    if (deviceProp.computeMode !=
                                            if (true)
         cudaComputeModeProhibited)
    else {
                                             else {
         devices prohibited++;
                                                 devices prohibited++;
     current device++;
                                             current device++;
```

# 5 PORTING MAGMA SGEMM

# C = A B

template < DIM\_X, DIM\_Y, DIM\_M, DIM\_N, DIM\_K, DIM\_XA, DIM\_YA, DIM\_XB, DIM\_YB>

For I = 1 .. M step DIM\_M

For J = 1 .. N step DIM\_N

For K = 1 .. K step DIM\_K

C<sub>I,J</sub> += A<sub>IK</sub> B<sub>K,J</sub>







- Implementation is templated with 9 parameters
- Computation is done with thread blocks of size[ DIM\_X , DIM\_Y ]
- Thread t<sub>ij</sub> computes
   [ DIM\_M / DIM\_X, DIM\_N / DIM\_Y ] elements of C<sub>IJ</sub>
- A<sub>IK</sub> gets loaded in shared memory by [ DIM\_XA , DIM\_YA ] threads
- B<sub>KJ</sub> gets loaded in shared memory by [ DIM\_XB , DIM\_YB ] threads
- C<sub>IJ</sub> is held and computed in registers

# MAGMA SGEMM TRANSLATION PROCESS

- Collected MAGMA SGEMM CUDA code and dependencies in one directory
- Used DPCT to recursively migrate CUDA code to DPC++
- Translated header files that did not migrate independently in a separate directory and then copied them into the MAGMA SGEMM directory
- Implemented compiler directives as needed

# 6 HARDWARE USAGE

# **MULTICORE CPUS**

```
user1@REU1901-HP-Z800-Workstation: ~/anna/mtxMtxMulCnvt/one/dp...
                                                                    user1@REU1901-HP-Z800-Workstation: ~/anna/mtxMtxMulCnvt/one/dp...
                                                                                       1111100.0%
                                                                                        100.0%
                                                                    Tasks: 207, 1038 thr; 12 running
                                                                    Uptime: 23 days, 05:45:41
    PID USER
                       NI
                           VIRT
                                   RES
                                         SHR S CPU% MEM%
                                                            TIME+
                                                          15h41:40 ./intelCpuExec -wA=8192 -wB=8192 -hA=8192 -hB=8192
3602527 user1
                   20
                        0
                                        267M R 99.4
                                                     3.4
                                                             18:57 ./intelCpuExec -wA=8192 -wB=8192 -hA=8192 -hB=8192
                                                             18:57 ./intelCpuExec -wA=8192 -wB=8192 -hA=8192 -hB=8192
3602528 user1
                   20
                        0
                                        267M R 99.4
                                                     3.4
                                                             19:05 ./intelCpuExec -wA=8192 -wB=8192 -hA=8192 -hB=8192
3602523 user1
                   20
                        0
                                        267M R 98.8
                                                     3.4
                                                             18:44 ./intelCpuExec -wA=8192 -wB=8192 -hA=8192 -hB=8192
3602526 user1
                   20
                        0
                                        267M R 98.8
                                                     3.4
                                                             16:11 ./intelCpuExec -wA=8192 -wB=8192 -hA=8192 -hB=8192
3602515 user1
                   20
                                        267M R 95.5
                                                     3.4
                                                             18:58 ./intelCpuExec -wA=8192 -wB=8192 -hA=8192 -hB=8192
3602521 user1
                   20
                                               99.4
3602519 user1
                   20
                        0
                                        267M R 98.8
                                                     3.4
                                                             18:47 ./intelCpuExec -wA=8192 -wB=8192 -hA=8192 -hB=8192
                                                             19:03 ./intelCpuExec -wA=8192 -wB=8192 -hA=8192 -hB=8192
3602534 user1
                   20
                                        267M R 98.8
                                                     3.4
3602518 user1
                   20
                                        267M R 96.2
                                                             18:50 ./intelCpuExec -wA=8192 -wB=8192 -hA=8192 -hB=8192
3602516 user1
                   20
                                        267M R 97.5
                                                     3.4
                                                             18:09 ./intelCpuExec -wA=8192 -wB=8192 -hA=8192 -hB=8192
3602525 user1
                   20
                                        267M R 99.4
                                                     3.4
                                                             18:23 ./intelCpuExec -wA=8192 -wB=8192 -hA=8192 -hB=8192
3603925 user1
                   20
                        0 11708
                                 5128
                                        3220 R
                                                2.6
                                                     0.0
                                                          0:16.55 htop
```

### MULTICORE CPUS



# NVIDIA GeForce GTX 1650

| i Jul                                | 8 10                                          | :16:5                              | 8 2022                                 |                      |                                                                                       |                            |                                                                      |
|--------------------------------------|-----------------------------------------------|------------------------------------|----------------------------------------|----------------------|---------------------------------------------------------------------------------------|----------------------------|----------------------------------------------------------------------|
| NVIDI                                | A-SMI                                         | 470.1                              | 29.06 Dri                              | iver Ve              | ersion: 470.129.06 (                                                                  | CUDA Versio                | n: 11.4                                                              |
|                                      | Name<br>Temp                                  | Perf                               |                                        |                      | Bus-Id Disp.A<br>Memory-Usage                                                         |                            | Uncorr. ECC<br>Compute M.<br>MIG M.                                  |
| 0 I<br>35%                           | =====<br>NVIDIA<br>63C                        | GeFo<br>P3                         | ========<br>rce Of<br>60W / 10         |                      | ======================================                                                | 100%                       | N/A<br>N/A<br>Default                                                |
|                                      |                                               |                                    |                                        | 1                    |                                                                                       |                            | N/A                                                                  |
| Proce                                | <br><br>sses:                                 |                                    |                                        |                      |                                                                                       |                            | N/A                                                                  |
|                                      | sses:<br>GI<br>ID                             | CI<br>ID                           | PID                                    | l<br>····+··<br>Type | Process name                                                                          |                            |                                                                      |
| Proces                               | GI<br>ID                                      | ID                                 |                                        |                      |                                                                                       |                            | GPU Memory<br>Usage                                                  |
| Proces                               | GI<br>ID<br>=====<br>N/A                      | ID<br>=====<br>N/A                 | ========<br>1239                       | <br>G                | <br>/usr/lib/xorg/Xorg                                                                |                            | GPU Memory<br>Usage<br>====================================          |
| Proces                               | GI<br>ID<br>=====<br>N/A<br>N/A               | ID<br>=====<br>N/A<br>N/A          | ====================================== | G<br>G               | /usr/lib/xorg/Xorg<br>/usr/lib/xorg/Xorg                                              |                            | GPU Memory<br>Usage<br>====================================          |
| Proces<br>GPU                        | GI<br>ID<br>=====<br>N/A<br>N/A<br>N/A        | ID<br>N/A<br>N/A<br>N/A            | ====================================== | <br>G                | /usr/lib/xorg/Xorg<br>/usr/lib/xorg/Xorg<br>/usr/lib/xorg/Xorg<br>/usr/bin/gnome-she  | ıı                         | GPU Memory<br>Usage<br>========<br>23MiB<br>241MiB<br>25MiB          |
| <br>Proces<br>GPU<br>=====<br>0<br>0 | GI<br>ID<br>=====<br>N/A<br>N/A<br>N/A<br>N/A | ID<br><br>N/A<br>N/A<br>N/A<br>N/A | ====================================== | <br>G<br>G           | <br>/usr/lib/xorg/Xorg<br>/usr/lib/xorg/Xorg<br>/usr/bin/gnome-she<br>RendererForSite | ll<br>PerProcess           | GPU Memory                                                           |
| <br>Proces<br>GPU<br><br>0<br>0<br>0 | GI<br>ID<br>=====<br>N/A<br>N/A<br>N/A        | ID<br>N/A<br>N/A<br>N/A            | ====================================== | G<br>G<br>G          | /usr/lib/xorg/Xorg<br>/usr/lib/xorg/Xorg<br>/usr/lib/xorg/Xorg<br>/usr/bin/gnome-she  | ll<br>PerProcess<br>irefox | GPU Memory<br>Usage<br>========<br>23MiB<br>241MiB<br>25MiB<br>13MiB |

# 7 PERFORMANCE

# TEST PARAMETERS

# cuda =

- -DMAGMA\_TUNING
- -DDIM X=16
- -DDIM\_Y=16
- -DBLK M nn=96
- -DBLK\_N\_nn=96
- -DBLK\_K\_nn=16
- -DDIM XA=32
- -DDIM\_YA=8
- -DDIM XB=8
- -DDIM\_YB=32



template < 16, 16, 96, 96, 16, 32, 8, 8, 32>







- Thread  $t_{ii}$  computes [ 96 / 16 , 96 / 16 ] elements of  $C_{IJ}$
- A<sub>IK</sub> gets loaded in shared memory by [ 32, 8 ] threads
- B<sub>KJ</sub> gets loaded in shared memory by [8, 32] threads
- C<sub>IJ</sub> is held and computed in registers

# AMD EPYC 7742 64-CORE PROCESSOR @ 2.25GHZ



# INTEL® XEON® CPU E5-2698 V4 20-CORE PROCESSOR @ 2.20GHZ



# **NVIDIA GEFORCE RTX 3060**



# INTEL UHD GRAPHICS P630 [0x3e96]



# ADDITIONAL TEST PARAMETERS

|       | DIM_X | DIM_Y | DIM_M | DIM_N | DIM_K | DIM_XA | DIM_YA | DIM_XB | DIM_YB |
|-------|-------|-------|-------|-------|-------|--------|--------|--------|--------|
| cuda  | 16    | 16    | 96    | 96    | 16    | 32     | 8      | 8      | 32     |
| ker2  | 16    | 16    | 64    | 64    | 8     | 32     | 8      | 8      | 32     |
| ker11 | 12    | 4     | 48    | 48    | 2     | 24     | 2      | 24     | 2      |

# INTEL UHD GRAPHICS P630 [0x3e96]



# 8 CONCLUSION

# **SUMMARY**

- oneAPI is a promising approach for parallel programming across various architectures
- DPCT tool can be used successfully for an initial port of CUDA code to DPC++
- Large numerical libraries like MAGMA, originally written in CUDA to support Nvidia GPUs, can be easily translated to DPC++ to provide functional portability to different vendor GPUs, as well as multicore CPUs

### **SUMMARY**

- Initial migrated code tuned for Nvidia GPUs performs well on multicore CPUs
- Initial migrated code tuned for Nvidia GPUs retains performance on Nvidia GPUs
- Initial migrated code tuned for Nvidia GPUs performs poorly on the available Intel GPU
  - Tuning is required, but optimal parameters are difficult to find without further knowledge on the hardware design

# ONGOING AND FUTURE WORKS

- Full translation of MAGMA
- ICL account configuration
- Finding near optimal parameters for the Intel integrated GPU
- Testing migrated code on discrete Intel GPU upon release

# **ACKNOWLEDGEMENTS**



This research was funded by the National Science Foundation.



This research was conducted at the University of Tennessee at Knoxville through the RECSEM REU.



Home University
University of North
Texas

- [1] Advancing computing and data capabilities for scientific discovery and continued U.S. technological leadership. Oak Ridge National Lab. <a href="https://www.ornl.gov/directorate/ccsd">https://www.ornl.gov/directorate/ccsd</a>
- [2] <a href="https://thenounproject.com/search/icons/?iconspage=1&q=quantum">https://thenounproject.com/search/icons/?iconspage=1&q=quantum</a>
- [3] Computing at LLNL. Lawrence Livermore National Laboratory. <a href="https://computing.llnl.gov/">https://computing.llnl.gov/</a>
- [4] NVIDIA HISTORY. Nvidia. <a href="https://www.nvidia.com/en-us/about-nvidia/corporate-timeline/">https://www.nvidia.com/en-us/about-nvidia/corporate-timeline/</a>
- [5] New Titan Supercomputer Named Fastest in the World. Department of Energy.

  <a href="https://www.energy.gov/articles/new-titan-supercomputer-named-fastest-world-0">https://www.energy.gov/articles/new-titan-supercomputer-named-fastest-world-0</a>
- [6] June 2019. The Top 500 List. <a href="https://www.top500.org/lists/top500/2019/06/">https://www.top500.org/lists/top500/2019/06/</a>
- [7] June 2022. The Top 500 List. <a href="https://www.top500.org/lists/top500/2022/06/">https://www.top500.org/lists/top500/2022/06/</a>

- [8] Aurora: HPC and AI at Exascale. Intel.

  <a href="https://www.intel.com/content/www/us/en/high-performance-computing/supercomputing/exascale-computing.html">https://www.intel.com/content/www/us/en/high-performance-computing/supercomputing/exascale-computing.html</a>
- [9] Compare Benefits of CPUs, GPUs, and FPGAs for Different oneAPI Compute Workloads. Intel.

  <a href="https://www.intel.com/content/www/us/en/developer/articles/technical/comparing-cpus-apus-and-fpgas-for-oneapi.html#gs.83gstn">https://www.intel.com/content/www/us/en/developer/articles/technical/comparing-cpus-apus-and-fpgas-for-oneapi.html#gs.83gstn</a>
- [10] Intel oneAPI Programming Overview. Intel.

  <a href="https://www.intel.com/content/www/us/en/develop/documentation/onea-pi-programming-quide/top/introduction-to-oneapi-programming/intel-on-eapi-programming-overview.html">https://www.intel.com/content/www/us/en/develop/documentation/onea-pi-programming-quide/top/introduction-to-oneapi-programming/intel-on-eapi-programming-overview.html</a>
- [11] Data Parallel C++: the oneAPI Implementation of SYCL\*. Intel. https://www.intel.com/content/www/us/en/developer/tools/oneapi/data-parallel-c-plus-plus.html#gs.83xmmq

- [12] Intel® DPC++ Compatibility Tool. Intel.

  <a href="https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-c\_ompatibility-tool.html#gs.83zp77">https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-c\_ompatibility-tool.html#gs.83zp77</a>
  - [13] oneMKL. Intel.

    <a href="https://spec.oneapi.io/versions/latest/elements/oneMKL/source/index.ht">https://spec.oneapi.io/versions/latest/elements/oneMKL/source/index.ht</a>
    <a href="mailto:ml">ml</a>
  - [14] What Is CUDA? NVIDIA. <a href="https://blogs.nvidia.com/blog/2012/09/10/what-is-cuda-2/">https://blogs.nvidia.com/blog/2012/09/10/what-is-cuda-2/</a>
  - [15] Compiling SYCL\* for Different GPUs. Intel.

    <a href="https://www.intel.com/content/www/us/en/developer/articles/technical/compiling-sycl-with-different-gpus.html">https://www.intel.com/content/www/us/en/developer/articles/technical/compiling-sycl-with-different-gpus.html</a>
  - [16] AMD EPYC™ 7742. AMD. https://www.amd.com/en/products/cpu/amd-epyc-7742

- [17] Intel® Xeon® Processor E5-2698 v4. Intel. https://ark.intel.com/content/www/us/en/ark/products/91753/intel-xeon-processor-e52698-v4-50m-cache-2-20-ghz.html
- [18] GEFORCE RTX 3060 FAMILY. Nvidia. https://www.nvidia.com/en-us/geforce/graphics-cards/30-series/rtx-3060 -3060ti/
- [19] Intel UHD Graphics P630. TechPowerUp. https://www.techpowerup.com/gpu-specs/uhd-graphics-p630.c3676
- [20] Intel® DevCloud. Intel® DevCloud

### Presentation Template:

Catalina, J. (n.d.). Minimal business. Free PowerPoint Template & amp; Google Slides theme. SlidesCarnival. Retrieved July 5, 2022, from <a href="https://www.slidescarnival.com/eleanor-free-presentation-template/308#preview">https://www.slidescarnival.com/eleanor-free-presentation-template/308#preview</a>