Loop Transformations in the Polyhedron Model

2022-01-14

5 min read

1 Introduction

To understand the basic concepts of the polyhedron model, let’s start with a simple GEMM kernel:

for (i0 = 0; i0 < ni; i0++) {
    for (i1 = 0; i1 < nj; i1++)
	    C[i0][i1] *= beta;    /* statement S[i0, i1] */
    for (i1 = 0; i1 < nk; i1++)
        for (i2 = 0; i2 < nj; i2++)
	        C[i0][i2] += alpha * A[i0][i1] * B[i1][i2];   /* statement T[i0, i1, i2] */
}

Denote:

C[i0][i1] *= beta; as $S[i_0, i_1]$
C[i0][i2] += alpha * A[i0][i1] * B[i1][i2]; as $T[i_0, i_1, i_2]$

The schedule of the statements in the kernel can be expressed as: $$ S[i_0, i_1] \rightarrow [i_0, 0, i_1, 0] \newline T[i_0, i_1, i_2] \rightarrow [i_0, 1, i_1, i_2] $$

Literally translating from the schedule, there should be four dimensions of loop nests:

for (j0 = 0; j0 < ni; j0++) /* dim 0 */
    for (j1 = 0; j1 < 2; j1++)  /* dim 1 */
        if (j1 == 0)
            for (j2 = 0; j2 < nj; j2++) /* dim 2 */
                S[j0, j2]
        else if (j1 == 1)
            for (j2 = 0; j2 < nk; j2++) /* dim 2 */
                for (j3 = 0; j3 < nj; j3++) /* dim 3 */
                    T[j0, j2, j3]

Note that by mapping the dimension 2 to constants 0 and 1, we specify that all S’s are executed before T’s in dimension 1.

2 Structural Parameters (Constants)

In our example, the structural parameters are the loop bounds ni, nj and nk. A schedule will only consist of loop iterator variables (i0, i1), structural parameters and their linear combinations.

Compared to the previous schedule, the following schedule collapsed the constant dimension and added a structural parameter $n_j$ to the second dimension of T: $$ S[i_0, i_1] \rightarrow [i_0, i_1, 0] \newline T[i_0, i_1, i_2] \rightarrow [i_0, n_j + i_1, i_2] $$ Since $i_1 \in [0,n_j)$ for S, in the dimension 1, all T’s will be executed after the S’s. This means that it is actually the same schedule as above.

for (j0 = 0; j0 < ni; j0++) /* dim 0 */
    for (j1 = 0; j1 < nj + nk; j1++)  /* dim 1 */
        if (j1 < nj)
            S[j0, j1]
        else if (j1 >= nj)
            for (j2 = 0; j2 < nj; j2++) /* dim 2 */
                T[j0, j1 - nj, j2]

3 Swapping Dimensions

In the following schedule, we mapped $i_1$ to the third dimension instead of the second. As dimension 2 of S is a constant now, there is no need to add $n_j$. Adding $1$ is sufficient to make T be executed after S. $$ S[i_0, i_1] \rightarrow [i_0, 0, i_1] \newline T[i_0, i_1, i_2] \rightarrow [i_0, 1 + i_1, i_2] $$ Again, this is a same schedule.

for (j0 = 0; j0 < ni; j0++) /* dim 0 */
    for (j1 = 0; j1 < 1 + nk; j1++)  /* dim 1 */
        if (j1 < 1)
            for (j2 = 0; j2 < nj; j2++) /* dim 2 */
                S[j0, j2]
        else if (j1 >= 1)
            for (j2 = 0; j2 < nj; j2++) /* dim 2 */
                T[j0, j1 - 1, j2]

4 Multiplying by -1

In the following schedule, we multiplied $i_0$ by $-1$ in S and $-1$ in T. We also added a constant offset $-n_i$ to dimension 0 of S. We remove the redundant $+1$ in dimension 1 of T because all S’s are already scheduled before T’s in dimension 0. $$ S[i_0, i_1] \rightarrow [-n_i-i_0, 0, i_1] \newline T[i_0, i_1, i_2] \rightarrow [-i_0, i_1, i_2] $$ By multiplying an iteration variable by a negative coefficient, the execution order is reversed on the corresponding dimension.

for (j0 = - ni - ni + 1; j0 < 0; j0++) /* dim 0 */
    if (j0 <= -ni)
        for (j2 = 0; j2 < nj; j2++)         /* dim 2 */
            S[-ni - j0, j2]
    else if (j0 > -ni)
        for (j1 = 0; j1 < nk; j1++)     /* dim 1 */
            for (j2 = 0; j2 < nj; j2++)     /* dim 2 */
                T[-j0, j1, j2]

5 Interleaving Dimensions

Try adding $i_0$ to $i_1$ in dimension 2 of S: $$ S[i_0, i_1] \rightarrow [-n_i-i_0, 0, -i_0-5i_1] \newline T[i_0, i_1, i_2] \rightarrow [-i_0, i_1, i_2] $$ Although $i_0$ and $i_1$ are added together on dimension 2 of S, dimension 0 has separated different $i_0$’s. Therefore, dimension 2 solely depends on $i_1$.

for (j0 = - ni - ni + 1; j0 < 0; j0++) /* dim 0 */
    if (j0 <= -ni)
        for (j2 = -nj; j2 < 0; j2++)         /* dim 2 */
            S[-ni - j0, -j2 - 1]
    else if (j0 > -ni)
        for (j1 = 0; j1 < nk; j1++)     /* dim 1 */
            for (j2 = 0; j2 < nj; j2++)     /* dim 2 */
                T[-j0, j1, j2]

In fact, this is one of the fastest-running schedules discovered by random exploration on my computer.

However, things could be a bit different if: $$ S[i_0, i_1] \rightarrow [-n_i, 0, -i_0-5i_1] $$ Without limits in dimension 0, the dimensions of $i_0$ and $i_1$ are interleaved, allowing parallelism:

S[ni - 1, nj - 1]
S[ni - 2, nj - 1]
S[ni - 3, nj - 1]
S[ni - 4, nj - 1]
S[ni - 5, nj - 1]
S[ni - 6, nj - 1] / S[ni - 1, nj - 2]   // can be executed in parallel
S[ni - 7, nj - 1] / S[ni - 2, nj - 2]
S[ni - 8, nj - 1] / S[ni - 3, nj - 2]
S[ni - 9, nj - 1] / S[ni - 4, nj - 2]
S[ni - 10, nj - 1] / S[ni - 5, nj - 2]
S[ni - 11, nj - 1] / S[ni - 6, nj - 2] / S[ni - 1, nj - 3]
S[ni - 12, nj - 1] / S[ni - 7, nj - 2] / S[ni - 2, nj - 3]
...

6 Other Schedules with Good Performance

$$ \begin{aligned} S[i_0, i_1] \rightarrow [&2n_j-n_i, &0, &i_0, &2n_j+n_k-5i_1] \\ T[i_0, i_1, i_2] \rightarrow [&2n_j+2i_0, &2i_1, &i_2, &i_0] \end{aligned} $$

This is actually equivalent to:

$$ \begin{aligned} S[i_0, i_1] \rightarrow [&0, &0, &i_0, &-i_1] \\ T[i_0, i_1, i_2] \rightarrow [&1, &i_0, &i_1, &i_2] \end{aligned} $$ which is a very ordinary schedule similar to the original execution order.

Next Vanilla CPU

On This Page