Loop Transformations in the Polyhedron Model

2022-01-14

1 Introduction

To understand the basic concepts of the polyhedron model, let’s start with a simple GEMM kernel:

for (i0 = 0; i0 < ni; i0++) {
for (i1 = 0; i1 < nj; i1++)
C[i0][i1] *= beta;    /* statement S[i0, i1] */
for (i1 = 0; i1 < nk; i1++)
for (i2 = 0; i2 < nj; i2++)
C[i0][i2] += alpha * A[i0][i1] * B[i1][i2];   /* statement T[i0, i1, i2] */
}


Denote:

• C[i0][i1] *= beta; as $S[i_0, i_1]$
• C[i0][i2] += alpha * A[i0][i1] * B[i1][i2]; as $T[i_0, i_1, i_2]$

The schedule of the statements in the kernel can be expressed as: $$S[i_0, i_1] \rightarrow [i_0, 0, i_1, 0] \newline T[i_0, i_1, i_2] \rightarrow [i_0, 1, i_1, i_2]$$

Literally translating from the schedule, there should be four dimensions of loop nests:

for (j0 = 0; j0 < ni; j0++) /* dim 0 */
for (j1 = 0; j1 < 2; j1++)  /* dim 1 */
if (j1 == 0)
for (j2 = 0; j2 < nj; j2++) /* dim 2 */
S[j0, j2]
else if (j1 == 1)
for (j2 = 0; j2 < nk; j2++) /* dim 2 */
for (j3 = 0; j3 < nj; j3++) /* dim 3 */
T[j0, j2, j3]


Note that by mapping the dimension 2 to constants 0 and 1, we specify that all S’s are executed before T’s in dimension 1.

2 Structural Parameters (Constants)

In our example, the structural parameters are the loop bounds ni, nj and nk. A schedule will only consist of loop iterator variables (i0, i1), structural parameters and their linear combinations.

Compared to the previous schedule, the following schedule collapsed the constant dimension and added a structural parameter $n_j$ to the second dimension of T: $$S[i_0, i_1] \rightarrow [i_0, i_1, 0] \newline T[i_0, i_1, i_2] \rightarrow [i_0, n_j + i_1, i_2]$$ Since $i_1 \in [0,n_j)$ for S, in the dimension 1, all T’s will be executed after the S’s. This means that it is actually the same schedule as above.

for (j0 = 0; j0 < ni; j0++) /* dim 0 */
for (j1 = 0; j1 < nj + nk; j1++)  /* dim 1 */
if (j1 < nj)
S[j0, j1]
else if (j1 >= nj)
for (j2 = 0; j2 < nj; j2++) /* dim 2 */
T[j0, j1 - nj, j2]



3 Swapping Dimensions

In the following schedule, we mapped $i_1$ to the third dimension instead of the second. As dimension 2 of S is a constant now, there is no need to add $n_j$. Adding $1$ is sufficient to make T be executed after S. $$S[i_0, i_1] \rightarrow [i_0, 0, i_1] \newline T[i_0, i_1, i_2] \rightarrow [i_0, 1 + i_1, i_2]$$ Again, this is a same schedule.

for (j0 = 0; j0 < ni; j0++) /* dim 0 */
for (j1 = 0; j1 < 1 + nk; j1++)  /* dim 1 */
if (j1 < 1)
for (j2 = 0; j2 < nj; j2++) /* dim 2 */
S[j0, j2]
else if (j1 >= 1)
for (j2 = 0; j2 < nj; j2++) /* dim 2 */
T[j0, j1 - 1, j2]


4 Multiplying by -1

In the following schedule, we multiplied $i_0$ by $-1$ in S and $-1$ in T. We also added a constant offset $-n_i$ to dimension 0 of S. We remove the redundant $+1$ in dimension 1 of T because all S’s are already scheduled before T’s in dimension 0. $$S[i_0, i_1] \rightarrow [-n_i-i_0, 0, i_1] \newline T[i_0, i_1, i_2] \rightarrow [-i_0, i_1, i_2]$$ By multiplying an iteration variable by a negative coefficient, the execution order is reversed on the corresponding dimension.

for (j0 = - ni - ni + 1; j0 < 0; j0++) /* dim 0 */
if (j0 <= -ni)
for (j2 = 0; j2 < nj; j2++)         /* dim 2 */
S[-ni - j0, j2]
else if (j0 > -ni)
for (j1 = 0; j1 < nk; j1++)     /* dim 1 */
for (j2 = 0; j2 < nj; j2++)     /* dim 2 */
T[-j0, j1, j2]


5 Interleaving Dimensions

Try adding $i_0$ to $i_1$ in dimension 2 of S: $$S[i_0, i_1] \rightarrow [-n_i-i_0, 0, -i_0-5i_1] \newline T[i_0, i_1, i_2] \rightarrow [-i_0, i_1, i_2]$$ Although $i_0$ and $i_1$ are added together on dimension 2 of S, dimension 0 has separated different $i_0$’s. Therefore, dimension 2 solely depends on $i_1$.

for (j0 = - ni - ni + 1; j0 < 0; j0++) /* dim 0 */
if (j0 <= -ni)
for (j2 = -nj; j2 < 0; j2++)         /* dim 2 */
S[-ni - j0, -j2 - 1]
else if (j0 > -ni)
for (j1 = 0; j1 < nk; j1++)     /* dim 1 */
for (j2 = 0; j2 < nj; j2++)     /* dim 2 */
T[-j0, j1, j2]


In fact, this is one of the fastest-running schedules discovered by random exploration on my computer.

However, things could be a bit different if: $$S[i_0, i_1] \rightarrow [-n_i, 0, -i_0-5i_1]$$ Without limits in dimension 0, the dimensions of $i_0$ and $i_1$ are interleaved, allowing parallelism:

S[ni - 1, nj - 1]
S[ni - 2, nj - 1]
S[ni - 3, nj - 1]
S[ni - 4, nj - 1]
S[ni - 5, nj - 1]
S[ni - 6, nj - 1] / S[ni - 1, nj - 2]   // can be executed in parallel
S[ni - 7, nj - 1] / S[ni - 2, nj - 2]
S[ni - 8, nj - 1] / S[ni - 3, nj - 2]
S[ni - 9, nj - 1] / S[ni - 4, nj - 2]
S[ni - 10, nj - 1] / S[ni - 5, nj - 2]
S[ni - 11, nj - 1] / S[ni - 6, nj - 2] / S[ni - 1, nj - 3]
S[ni - 12, nj - 1] / S[ni - 7, nj - 2] / S[ni - 2, nj - 3]
...


6 Other Schedules with Good Performance

\begin{aligned} S[i_0, i_1] \rightarrow [&2n_j-n_i, &0, &i_0, &2n_j+n_k-5i_1] \\ T[i_0, i_1, i_2] \rightarrow [&2n_j+2i_0, &2i_1, &i_2, &i_0] \end{aligned}
This is actually equivalent to:
\begin{aligned} S[i_0, i_1] \rightarrow [&0, &0, &i_0, &-i_1] \\ T[i_0, i_1, i_2] \rightarrow [&1, &i_0, &i_1, &i_2] \end{aligned} which is a very ordinary schedule similar to the original execution order.