







# Runtime Systems and Out-of-Core Cholesky Factorization on the Intel Xeon Phi System



香港中文大學 The Chinese University of Hong Kong



# **PROPOSED METHODOLOGY**

Performance Testing in seconds, GFLOPS, GLOPS/sec ("Giga Floating Operations Per Second")

1. Nested-For Loop Matrix Multiplication (MM) - QUARK

> Both Native and Offload Execution were taken into consideration

## **NESTED FOR-LOOP MATRIX MULTIPLICATION RESULTS**

I have modified example code from Dr. Asim YarKhan for a QUARK-multithreaded, tiled-routine matrix multiplication driver that will measure the performance in seconds and GFLOPS and print this data in a user-friendly manner to be used on

To generate GFLOPS/sec, under the assumption that C = A \* B where A,B,C are symmetric matrices (n by n), then the general formula would be: 109 \* time\_avg

|            |        |        | 10     | + cinc_uv |
|------------|--------|--------|--------|-----------|
|            | NB=100 | NB=250 | NB=500 | NB=1000   |
| 4 threads  | 13.46  | 12.56  | 12.17  | 9.01      |
| 8 threads  | 26.78  | 24.34  | 23.66  | 14.89     |
| 16 threads | 52.33  | 47.48  | 45.76  | 23.14     |
| 32 threads | 53.96  | 50.25  | 32.62  | 22.62     |
| 64 threads | 52.23  | 49.54  | 20.25  | 13.07     |
|            |        |        |        |           |
|            | NB=100 | NB=250 | NB=500 | NB=1000   |
| 4 threads  | 1.00   | 1.39   | 1.50   | 1.63      |
| 8 threads  | 1.70   | 2.16   | 2.22   | 2.22      |
| 16 threads | 3.03   | 3.70   | 3.54   | 3.36      |
| 32 threads | 5.52   | 6.48   | 5.78   | 4.94      |
| 64 threads | 9.97   | 11.38  | 9.49   | 6.68      |

 $\succ$  The general trend for the HOST shows optimal performance at 16 threads; though at smaller tile sizes, this threshold can be 32 threads.

- $\succ$  The general trend for the MIC shows that optimal performance can be attained at 64 threads, and the data proves to be scalable; however, the actual performance is significantly slower than that on the HOST.
- The performance is still poor (~50 GFLOPS/sec on HOST and ~10 GFLOPS/sec on MIC) but there is possibility for increased performance through offloading and

### DGEMM > PLASMA is installed as a module within Beacon, and a separate environment was installed on the HOST for comparison data The routine is optimized through a tiled routine similar to the QUARK MM. **MIC Environment Variables:** > OMP\_NUM\_THREADS: - In Beacon, each node has 4 MIC, each with 60 cores (MAX VALUE = 240). ➤ KMP\_AFFINITY: - Compact: Sequential Queuing - Balanced: Threads allocated evenly among cores Intel's MKL Library has been advertised to have its functions optimized (i.e., DGEMM = 833 GFLOPS/s); therefore, this test was recreated. The test was successful Given the maximum number of threads and setting the core 120 thread organization to balanced, 180 threads the results matched.

organizing in a compact manner.

**CODE GENERATING DAG&CODE USING QUARK** 

#### struct Label{long I;long J;long K;};

struct List{long node;label Node;char type;label in[3];label out[n-1];};

if((j>k)&&(i>j)) //dgemm type:(i,j,k),wherei>j>k

list[count].Node=assignlabel(i,j,k); list[count].node=(i+1+j\*n)+k\*n\*n; list[count].type='M'; fprintf(fp,"%ld[label=\"(%ld,%ld,%ld)|GEMM\",color=forestgreen];\n",list[count].node,i,j,k); //assign node atrributes like label,color and so on for(q=0;q<3;q++)//Traverse the in-nodes and specify the data dependencies by edges

if (!((list[count].in[q].I==-1)||(list[count].in[q].J==-1)||(list[count].in[q].K==-1))) fprintf(fp,"%ld->%ld;",(list[count].in[q].I+1+list[count].in[q]. J\*n+list[count].in[q].K\*n\*n), list[count].node); fprintf(fp,"{rank=same;depth%ld %ld}\n",(3\*k+3),list[count].node); //mark the depth

void CORE\_dgemm\_quark(Quark \*quark); //body omitted void QUARK\_CORE\_dgemm(Quark \*quark, Quark\_Task\_Flags \*task\_flags, PLASMA\_enum transA, PLASMA enum transB, int m, int n, int k, int nb, double alpha, const double \*A, int Ida,const double \*B, int Idb,double beta, double \*C, int Idc); //body omitted

if((j>k)&&(i>j)) //dgemm type:(i,j,k),wherei>j>k\*

Quark\_Task\_Flags tflags=Quark\_Task\_Flags\_Initializer; //initailize the task QUARK\_Task\_Flag\_Set(&tflags,TASK\_PRIORITY,1); //set task attributes like priority

QUARK\_CORE\_dgemm(quark,&tflags,CblasNoTrans,CblasTrans,NB,NB,NB,NB,-1.0,&A2(0,0, i,k),NB,&A2(0,0,j,k),NB,1.0,&A2(0,0,i,j),NB); // pass the arguments ,where data dependencies are implied continue;

# **EXPECTED GOALS**

#### Runtime Systems

Optimize QUARK implementations (matrix multiplication, DGEMM) with additional OpenMP and Offloading directives to produce better performance.

Incorporate the OOC Cholesky Factorization into QUARK and implement onto Beacon.

OOC Cholesky Factorization:

Complete the code combining OOC algorithm and general Cholesky factorization.

Extend to multiple MPI processes case.

Extend to LU factorization with pivoting and QR factorization.

REFERENCES

Betro, Vincent. <u>Beacon Quickstart Guide at AACE/NICS</u>

Betro, Vincent. Beacon Training: Using the Intel Many Integrate Core (MIC) Architecture: Native Mode and Intel MPI. March 2013

D'Azevedo, Eduardo, Shiquan Su, and Kwai Wong. <u>A Performance Study of Solving</u> <u>a Large Dense Matrix for Radiation Heat Transfer.</u>

Intel. https://software.intel.com/en-us/mic-developer

YarKhan, Asim. <u>Dynamic Task Execution on Shared and Distributed Memory</u> Architectures. Dec. 2012.

YarKhan, Asim, Jakub Kurzak, and Jack Dongarra. *QUARK Users' Guide*. April 2011 Images are provided by Google Images, their respective websites, or generated using software

**TEAM INFO** 

• Authors: Allan Richmond Razon Morales and Tian Chong • Mentors: Dr. Kwai Wong and Dr. Eduardo D'Azevedo • Collaborators: Dr. Shiquan Su, Dr. Asim YarKhan, and Ben Chan