Haenni's PhD thesis

EPIC architecture and multimedia
instruction sets for cryptographic

PhD thesis


Author : Jacques-Olivier Haenni
Thesis director : Prof. Eduardo Sanchez


Exploiting parallelism to improve the performance of a system is a technique that has been around for a long time. In the context of processor architectures, different kinds of parallelism can be distinguished: task-level parallelism (multiprocessor systems or multithreaded processors), instruction-level parallelism (superscalar, VLIW, or EPIC processors), and data-level parallelism (vector processors or multimedia instructions).

Aside from the instruction-level parallelism which is present is every modern high-performance processor, the width of the processed data words has steadily increased, from the 4 bits of the Intel 4004 in 1971 to today's 64 bits. In the meantime, however, many applications underexploit this computing capacity by still using 8-, 16- or 32-bit values.

Therefore, in order to better match the needs of some applications, processor manufacturers have implemented instructions called multimedia instructions, which consider a single processor register as a vector of 8-, 16-, or even 32-bit elements.

However, the use of these instructions is cumbersome, as current compilers are not capable of exploiting them. The developer wishing to benefit from their potential must then use libraries of predefined functions or write code which explicitly uses the multimedia instructions, either by directly writing assembly code or by calling intrinsics.

In 2001, HP and Intel launched the first processor of their new 64-bit IPF architecture (formerly known as IA-64), called Itanium. This architecture defines multimedia instructions that, unlike previous multimedia instruction sets, are completely integrated in the processor's instruction set.

Our study of Itanium specifications has shown that the latency between one of these multimedia instructions and a logic or arithmetic instruction is 4 or 12 clock cycles, depending on the distance, in clock cycles, between the instructions. Such a latency can dramatically slow down a program, and even a good instruction schedule may not avoid this problem. The solution we propose is to separate the instructions by inserting, for lack of better instructions, no-ops between them. We call this transformation noptimisation. Our experiments, initially run on a prototype and then on commercial systems, have shown that this technique introduces a performance improvement of 20% to 60% for the IDEA and RC6 cryptographic algorithms. The transformation has been implemented in the SGI's Pro64 compiler, currently the best public domain compiler for Itanium. This modification has been submitted to SGI and has been integrated in the official release.

While the benefits of multimedia instructions for graphic applications, for example, are well known, cryptographic applications may also take advantage of them. We have realized optimised implementations of the IDEA and RC6 algorithms with IPF and Altivec (Motorola PowerPC G4) multimedia instructions. These programs have been executed on commercial machines after having been simulated, in the case of IPF, with the functional simulator ski provided by HP. These experiments have revealed major performance improvements due to the use of multimedia instructions. In fact, the measured encryption throughputs surpass all current implementations. These experiments also show the advantages of the Altivec multimedia instructions compared to IPF's, as well as the need for compilers supporting multimedia instructions.

At present, many research projects are concerned with the compiler support for multimedia instructions. However, their fairly limited results clearly show the difficulty of the problem.

The approach initially adopted in such projects consisted of exploiting the techniques developed for the compilation on vector systems, which try to parallelize the loops in a program. Another useful approach consists of unrolling the loops and then looking for instructions that can be replaced by a multimedia instruction. We justify this approach with the simplification of the dependence analysis and with the fact that vector systems are considerably different from multimedia instructions, for example in the size and the organisation of the vectors in memory.

Finally, we describe the main issues raised by the automatic search for data-level parallelism, for which we propose solutions or at least some leads to be investigated in future works.


Last update: March 21st, 2002
Jacques-Olivier Haenni
E-mail: haenni@mail.com