2008-03-22

Optimizations for your loop(ZT)

The compiler may or may not apply the following optimizations to your loop: Interchange, Unrolling, Cache Blocking, Loop Distribution, Loop Fusion, and LoadPair. These transformations are discussed in the following sections, including how to transform loops manually and how to control them with pragmas or internal options.

Loop Interchange

Loop Interchange is a nested loop transformation applied by HLO that simply swaps the order of execution of two nested loops. It is typically done to provide sequential Unit Stride access to array elements used inside the loop to improve cache locality. The compiler -O3/O3 (Windows*) optimization looks for opportunities to apply loop interchange for you. 

循环交换:是交换两个迭代变量的顺序的过程.主要是为了避免对数组中元素的非顺序访问.

The following is an example of a loop interchange:

Example

// Loop before Loop Interchange

for(i=0;i<NUM;i++) {

  for(j=0;j<NUM;j++) {

    for(k=0;k<NUM;k++) {

      c[i][j] =c[i][j] + a[i][k] * b[k][j];

    }

  }

}

// Loop after Loop Interchange of k & j loops

for(i=0;i<NUM;i++) {

  for(k=0;k<NUM;k++) {

    for(j=0;j<NUM;j++) {

      c[i][j] =c[i][j] + a[i][k] * b[k][j];

    }

  }

}

See discussion under Non-Unit Stride Memory Access for more detail.

Unrolling

Loop unrolling is a loop transformation generally applied by HLO that can take better advantage of Instruction Level parallelism, keeping as many functional units busy doing useful work as possible during single loop iteration. In loop unrolling, you add more work to the inside of the loop while doing fewer loop iterations in exchange.

循环展开:有利于指令级别的并行.

There are pragmas and internal options to control unroll behavior.

Pragma

Description

#pragma unroll

unroll by itself allows the compiler to determine the unroll factor

#pragma unroll(n)

-unroll n (Linux) or /Qunroll:n (Windows) instructs the compiler to unroll the loop n times

#pragma nounroll

Specifies not to unroll a specified loop

Generally, you should unroll loops by factors of cache line sizes. Experiment with the number. Consider the following loop:

Example

N=1025;

M=5;

for(i=0;i<N; i++) {

  for (j=0;j<M; j++) {

    a[i][j] = b[i][j] + c[i][j]

  }

}

Assume you want to unroll the "i" or outer loop by a factor 4, but you notice that 4 does not evenly divide N of 1025. Unrolling in this case is difficult; however, you might use a "post conditioning loop" to take care of the unusual case as follows:

Example

N=1025;

M=5;

K = N % 4;  // K= N MOD 4

//main part of loop

for(i=0;i<N-K; i+=4) {

  for (j=0;j<M; j++) {

    a[i][j] = b[i][j] + c[i][j];

    a[i+1][j] = b[i+1][j] + c[i+1][j];

    a[i+2][j] = b[i+2][j] + c[i+2][j];

    a[i+3][j] = b[i+3][j] + c[ i+3][j];

  }

}

//post conditioning part of loop

for(i= N-K+1;i<N; i++4) {

  for (j=0;j<M; j++) {

    a[i][j] = b[i][j] + c[i][j];

  }

}

Post conditioning is preferred over pre-conditioning (that you will see described in some books) because post conditioning will preserve the data alignment and avoid the cost of memory alignment access penalties.

Cache Blocking

Cache blocking involves structuring data blocks so that they conveniently fit into a portion of the L1 or L2 cache. By controlling data cache locality, an application can minimize performance delays due to memory bus access. The application controls this behavior by dividing a large array into smaller blocks of memory (tiles) so that a thread can make repeated accesses to that data while it is still in cache. For example, image processing and video applications are well suited to cache blocking techniques because an image can be processed on smaller portions of the total image or video frame. Compilers often use the same technique, by grouping related blocks of instructions close together so they execute from the L2 cache.

The effectiveness of the cache blocking technique is highly dependent on data block size, processor cache size, and the number of times the data is reused. Cache sizes vary based on processor. An application can detect the data cache size using the CPUID instruction and dynamically adjust cache blocking tile sizes to maximize performance. As a general rule, cache block sizes should target approximately one-half to three-quarters the size of the physical cache. For systems that are Hyper-Threading Technology (HT Technology) enabled target one-quarter to one-half the physical cache size. (See Designing for Hyper-Threading Technology for more other design considerations.)

Cache blocking is applied at HLO and is used on large arrays where the arrays can't all fit into cache at once. This is one way of pulling a subset of data into cache (in a small region) and using this cached data as effectively as possible before the data is replaced by new data from memory.

In the following pseudo code example we transform the indicated loops:

Example

for i = 1, 1000

  for j = 1, 1000

    for k = 1, 1000

      A[i, j, k] = A[i, j, k]  +  B[i, k, j]

    end_for

  end_for

end_for

to

for v = 1, 1000, 20

  for u = 1, 1000, 20

    for k = v, v+19

      for j = u, u+19

        for i = 1, 1000

          A[i, j, k] = A[i, j, k]  +  B[i, k, j]

        end_for

      end_for

    end_for

  end_for

end_for

The cache block size is set to 20. The goal is to read in a block of cache, do every bit of computing we can with the data in this cache, then load a new block of data into cache. There are 20 elements of A and 20 elements of B in cache at the same time and you should do as much work with this data as you can before you increment to the next cache block.

Blocking factors will be different for different architectures. Determine the blocking factors experimentally. For example, different blocking factors would be required for single precision versus double precision. Typically, the overall impact to performance can be significant.

Loop Distribution

Loop Distribution is an HLO loop transformation that splits a large loop into two smaller loops. It can be useful in cases where optimizations like SWP or vectorization cannot take place due to excessive register usage. By splitting a loop into smaller pieces, it may be possible to get each smaller loop or at least one of the smaller loops to SWP or vectorize. An example is as follows:

循环分解:把一个大循环分成两个或更多的小循环

Example

// Before distribution or splitting the loop

for (i=0; i< NUM; i++) {

   a[i] = foo1(i);

   b[i] = foo2(i);

         :

         :

   z[i] = foo26(i);

}

After distribution the two smaller loops might look as follows:

// After distribution or splitting the loop

for (i=0; i< NUM; i++) {

   a[i] = foo1(i);

   b[i] = foo2(i);

         :

         :

   m[i] = foo13(i);

}

for (i=0; i< NUM; i++) {

   n[i] = foo14(i);

   o[i] = foo15(i);

         :

         :

   z[i] = foo26(i);

}

There are pragmas to suggest distributing loops to the compiler as follows:

Example

#pragma distribute point

Placed outside a loop, the compiler will attempt to distribute the loop based on its internal heuristic. The following is an example of using the pragma outside the loop:

Example

#pragma distribute point

for (i=0; i< NUM; i++) {

   a[i] = foo1(i);

   b[i] = foo2(i);

         :

         :

   z[i] = foo26(i);

}

Placed within a loop, the compiler will attempt to distribute the loop at that point. All loop-carried dependencies will be ignored. The following example uses the pragma within a loop to precisely indicate where the split should take place:

Example

for (i=0; i< NUM; i++) {

   a[i] = foo1(i);

   b[i] = foo2(i);

         :

#pragma distribute point

         :

   z[i] = foo26(i);

}

Loop Fusion

Loop Fusion is the inverse of Loop Distribution. The idea is to join two loops that have the same trip count in order to reduce loop overhead. The -O3 (Linux) or /O3 (Windows) option will attempt loop fusion if the opportunity is present.

循环合并

Load Pair (Itanium® Compiler)

Load pairs (ldfp) are instructions that load two contiguous single or double precision values from memory in one move. Load pairs can significantly improve performance.

Manual Loop Transformations

There might be cases where these manual transformations are called acceptable or even preferred. As a general rule, you should let the compiler transform loops for you. Manually transform loops as a last resort, and only in cases where you are attempting to gain performance increases.

Manual loop transformations have many disadvantages, which include the following:

  1. Application code becomes harder to maintain over time.

  2. New compiler features can cause you to lose any optimization you gain by manually transforming the loop.

  3. Architectural requirements might restrict your code to a specific architecture unintentionally.

The HLO report can give you an idea of what loop transformations have been applied by the compiler.

Experimentation is a critical component in manually transforming loops. You might try to apply a loop transformation that the compiler ignored. Sometimes, it is beneficial to apply a manual loop transformation that the compiler has already applied with -O3 (Linux) or /O3 (Windows).



No comments: