1 Introduction
To understand the basic concepts of the polyhedron model, let’s start with a simple GEMM kernel:
for (i0 = 0; i0 < ni; i0++) {
for (i1 = 0; i1 < nj; i1++)
C[i0][i1] *= beta; /* statement S[i0, i1] */
for (i1 = 0; i1 < nk; i1++)
for (i2 = 0; i2 < nj; i2++)
C[i0][i2] += alpha * A[i0][i1] * B[i1][i2]; /* statement T[i0, i1, i2] */
}
Denote:
C[i0][i1] *= beta;
as $S[i_0, i_1]$C[i0][i2] += alpha * A[i0][i1] * B[i1][i2];
as $T[i_0, i_1, i_2]$
The schedule of the statements in the kernel can be expressed as: $$ S[i_0, i_1] \rightarrow [i_0, 0, i_1, 0] \newline T[i_0, i_1, i_2] \rightarrow [i_0, 1, i_1, i_2] $$
Literally translating from the schedule, there should be four dimensions of loop nests:
for (j0 = 0; j0 < ni; j0++) /* dim 0 */
for (j1 = 0; j1 < 2; j1++) /* dim 1 */
if (j1 == 0)
for (j2 = 0; j2 < nj; j2++) /* dim 2 */
S[j0, j2]
else if (j1 == 1)
for (j2 = 0; j2 < nk; j2++) /* dim 2 */
for (j3 = 0; j3 < nj; j3++) /* dim 3 */
T[j0, j2, j3]
Note that by mapping the dimension 2 to constants 0 and 1, we specify that all S’s are executed before T’s in dimension 1.
2 Structural Parameters (Constants)
In our example, the structural parameters are the loop bounds ni
, nj
and nk
. A schedule will only consist of loop iterator variables (i0
, i1
), structural parameters and their linear combinations.
Compared to the previous schedule, the following schedule collapsed the constant dimension and added a structural parameter $n_j$ to the second dimension of T: $$ S[i_0, i_1] \rightarrow [i_0, i_1, 0] \newline T[i_0, i_1, i_2] \rightarrow [i_0, n_j + i_1, i_2] $$ Since $i_1 \in [0,n_j)$ for S, in the dimension 1, all T’s will be executed after the S’s. This means that it is actually the same schedule as above.
for (j0 = 0; j0 < ni; j0++) /* dim 0 */
for (j1 = 0; j1 < nj + nk; j1++) /* dim 1 */
if (j1 < nj)
S[j0, j1]
else if (j1 >= nj)
for (j2 = 0; j2 < nj; j2++) /* dim 2 */
T[j0, j1 - nj, j2]
3 Swapping Dimensions
In the following schedule, we mapped $i_1$ to the third dimension instead of the second. As dimension 2 of S is a constant now, there is no need to add $n_j$. Adding $1$ is sufficient to make T be executed after S. $$ S[i_0, i_1] \rightarrow [i_0, 0, i_1] \newline T[i_0, i_1, i_2] \rightarrow [i_0, 1 + i_1, i_2] $$ Again, this is a same schedule.
for (j0 = 0; j0 < ni; j0++) /* dim 0 */
for (j1 = 0; j1 < 1 + nk; j1++) /* dim 1 */
if (j1 < 1)
for (j2 = 0; j2 < nj; j2++) /* dim 2 */
S[j0, j2]
else if (j1 >= 1)
for (j2 = 0; j2 < nj; j2++) /* dim 2 */
T[j0, j1 - 1, j2]
4 Multiplying by -1
In the following schedule, we multiplied $i_0$ by $-1$ in S and $-1$ in T. We also added a constant offset $-n_i$ to dimension 0 of S. We remove the redundant $+1$ in dimension 1 of T because all S’s are already scheduled before T’s in dimension 0. $$ S[i_0, i_1] \rightarrow [-n_i-i_0, 0, i_1] \newline T[i_0, i_1, i_2] \rightarrow [-i_0, i_1, i_2] $$ By multiplying an iteration variable by a negative coefficient, the execution order is reversed on the corresponding dimension.
for (j0 = - ni - ni + 1; j0 < 0; j0++) /* dim 0 */
if (j0 <= -ni)
for (j2 = 0; j2 < nj; j2++) /* dim 2 */
S[-ni - j0, j2]
else if (j0 > -ni)
for (j1 = 0; j1 < nk; j1++) /* dim 1 */
for (j2 = 0; j2 < nj; j2++) /* dim 2 */
T[-j0, j1, j2]
5 Interleaving Dimensions
Try adding $i_0$ to $i_1$ in dimension 2 of S: $$ S[i_0, i_1] \rightarrow [-n_i-i_0, 0, -i_0-5i_1] \newline T[i_0, i_1, i_2] \rightarrow [-i_0, i_1, i_2] $$ Although $i_0$ and $i_1$ are added together on dimension 2 of S, dimension 0 has separated different $i_0$’s. Therefore, dimension 2 solely depends on $i_1$.
for (j0 = - ni - ni + 1; j0 < 0; j0++) /* dim 0 */
if (j0 <= -ni)
for (j2 = -nj; j2 < 0; j2++) /* dim 2 */
S[-ni - j0, -j2 - 1]
else if (j0 > -ni)
for (j1 = 0; j1 < nk; j1++) /* dim 1 */
for (j2 = 0; j2 < nj; j2++) /* dim 2 */
T[-j0, j1, j2]
In fact, this is one of the fastest-running schedules discovered by random exploration on my computer.
However, things could be a bit different if: $$ S[i_0, i_1] \rightarrow [-n_i, 0, -i_0-5i_1] $$ Without limits in dimension 0, the dimensions of $i_0$ and $i_1$ are interleaved, allowing parallelism:
S[ni - 1, nj - 1]
S[ni - 2, nj - 1]
S[ni - 3, nj - 1]
S[ni - 4, nj - 1]
S[ni - 5, nj - 1]
S[ni - 6, nj - 1] / S[ni - 1, nj - 2] // can be executed in parallel
S[ni - 7, nj - 1] / S[ni - 2, nj - 2]
S[ni - 8, nj - 1] / S[ni - 3, nj - 2]
S[ni - 9, nj - 1] / S[ni - 4, nj - 2]
S[ni - 10, nj - 1] / S[ni - 5, nj - 2]
S[ni - 11, nj - 1] / S[ni - 6, nj - 2] / S[ni - 1, nj - 3]
S[ni - 12, nj - 1] / S[ni - 7, nj - 2] / S[ni - 2, nj - 3]
...