Java Optimization for Superscalar and Vector Architectures

Douglas Lyon, Fairfield University, Fairfield CT, U.S.A.

COLUMN

PDF Version

Windows NT crashed.

I am the Blue Screen of Death.

No one hears your screams.

-Anon

Abstract

This paper describes the refactoring of Java code to take advantage of the superscalar and vector architectures available on many modern desktop computers.

The unrolling of Java loops is shown to cause some speed-ups for Java code. However, our benchmarks reveal that Java still lags behind vectorized C code.

The present state-of-the-art in computer hardware has outpaced the current state of the JIT (Just-In-Time) compilers. We have resorted to modifying our Java code to make use of JNI (Java Native Interface) based vector-accelerated C programs to obtain speed-ups from 2 to 10 times.

1 THE SUPERSCALAR G4

TextThe superscalar G4 core (also known as the Motorola MPC74xx series processor) is capable of executing four instructions per clock cycle. Superscalar processors are based on pipelined MIMD (Multiple Instruction, Multiple Data) architectures. Simple loop unwinding can take advantage of such architectures. The G4 also has a 128-bit wide vector unit called the Altivec. The Altivec has 32 registers with 128 bits each and represents a departure from the pure general purpose CPU and a focus upon a SIMD (Single Instruction, Multiple Data) architecture that enables fine-grained parallelism. A block diagram depicting the PowerPC execution flow is shown in Figure 1.

Figure 1. The G4 Execution Flow

Although the PowerPC (PPC) is a RISC (Reduced Instruction Set Computer) system, the addition of the Altivec processor adds over 160 machine instructions. These instructions map directly to C subroutines that can be invoked for the purpose of acceleration. In an ideal world, vectorizing compilers should be able to map data into these instructions and take advantage of the Altivec. In our experience, however, this almost never happens, even with the simplest of code [Freescale].

The trend toward SIMD-style signal processing type architectures is gaining wide acceptance (first with Intel’s MMX instructions and lately with the new SSE and SSE2 instructions). The SIMD idea can be traced back to Alan Turing's 1946 parallel computing studies on VLIW (Very Long Instruction Word) computers. A key difference between the super-scalar architecture and the VLIW architecture is that programs are typically changed in order to take advantage of the new architecture. In theory, Java obviates the need for the programs to change, in response to a new architecture, since this is the responsibility of the JIT (which should mean recompilation for each run). However, the present JIT’s do not take advantage of the modern VLIW machine [Patterson and Hennessy].

The computer industry is motivated to incorporate MIMD architectures into their hardware because they enable the fast computation of homogenous data arrays. These appear in several applications, such as signal processing, image processing, computer graphics, computer vision, communications, etc. In fact, an entire industry has grown up around the creation of chips dedicated to the processing of homogenous data arrays. These are typically called DSP (Digital Signal Processing) chips. It is therefore a matter of an evolutionary trend that we see DSP functions incorporated into general-purpose processors.

2 WHAT ARE C PROGRAMMERS DOING?

To observe just how C programmers optimize their code for the PPC, consider the following function, which is designed to add n numbers in an array and return the result:

The kernal of the for-loop translates into the following assembler:

Based on the addition of 1024 floats (32 bits, each), we find that the unoptimized code, running on a G4/400 Mhz, executes in 9 microseconds. In an effort to get the superscalar nature of the processor to kick-in, it is typical for the C programmer to unwind the loop. Since the G4 processor has a 4 stage super-scalar pipe-line, we can obtain some speed-up by using:

float sumV(float a[], int n) {

int i = 0;

register float s[4] = {0};

for (i=0; i < n; i=i+4) {

s[0] = s[0] + a[i];

s[1] = s[1] + a[i+1];

s[2] = s[2] + a[i+2];

s[3] = s[3] + a[i+3];

}// compaction phase...

return s[0]+s[1]+s[2]+s[3];

}

The kernal of the for-loop translates into the following assembler:

Unwinding the loop by 4 elements gives us a speed up of 2 microseconds (that is a 7 microsecond execution time). However, based on an inspection of the assembler, we find that no vector operations are added to the assembler output. Thus, we have been able to take advantage of the super-scalar architecture, but not the internal vector architecture.

Encouraged by the speed up, we double the unwinding of the loop. Further speed-ups can be had, since there may be a speculative pre-fetch in the CPU that enables the needed data to appear in the L1 cache (which is 32k bytes in size). Thus, we expand the unwinding and note that the array length must be an even multiple of 8, or we will access the array out of bounds:

float sumV8(float a[], int n) {

int i = 0;

register float b[8] = {0};

for (i=0; i < n; i=i+8) {

b[0] = b[0] + a[i];

b[1] = b[1] + a[i+1];

b[2] = b[2] + a[i+2];

b[3] = b[3] + a[i+3];

b[4] = b[4] + a[i+4];

b[5] = b[5] + a[i+5];

b[6] = b[6] + a[i+6];

b[7] = b[7] + a[i+7];

}

return b[0]+b[1]+b[2]+b[3]+

b[4]+b[5]+b[6]+b[7];

}

The unwinding of the loop by 8 does give a small speed-up (executing the sum in only 6 microseconds). It is well to note that unwinding by 16 actually slows the code by a small margin, and thus there are diminishing returns in using the unwinding approach.

One would hope that a vectorizing compiler would see the structure of the above code and take advantage of the obvious parallelism. This appears to be beyond the GCC compilers ability (at the moment). In support of the Altivec processor a -faltivec has been added to the GCC compiler. This enables the direct invocation of extensions to the C language. These include the vector data type, as well as direct invocation of subroutines that map to Altivec assembler instructions, without the inclusion of header files [Freescale]. For example:

The assembler for the kernal of the for-loop is:

The assembler reveals that vec_add has been mapped to the assembler instruction vaddfp. Thus, we have finally started to take advantage of the Altivec processor. However, to do it required major modification to the source code.

Even worse, the code has gotten rather much harder to understand. The invocation, vec_splat_u32 zeros out the vector (a 128 bit quantity). The vec_add instruction is adding 128 bits (i.e. 4, 32-bit floats) at a time. The new code now runs in 3 microseconds (about 300% faster then the original sum function).

Not to be outdone, an optimized assembler language subroutine is available in a library called SAL (Scientific Application Library) [Mercury].

The svex based sumSal function can sum the 1024 floats in just 1.5 microseconds (some 600% faster than the original sum function).

Even faster results can be had when using the VAST compiler [Crescent]. A non-disclosure agreement requires that I remain silent about how much faster VAST is. Also, some of the code had to be altered in order to get the code to work. I am told that my code uses non-standard C extensions.

3 JAVA HAS SPEED PROBLEMS

In Java it is possible to write the same sum function as in C:

We turn off garbage collection and run our benchmark in interpreted mode. This takes 503 microseconds. The just-in-time compiler can be turned back on using:

java -noclassgc -Xmixed math.Mat1

With the JIT on, the sum method takes 41 microseconds to run (about 4.5 times slower than the corresponding C code). We are using the same G4 processor (running at 400 Mhz). Just as in Section 1, we proceed to unwind the for-loop:

The above code saves no time at all, and, in fact runs in 48 microseconds (7 microseconds slower). What is going on? Each test is run 10 times, and the fastest time is being reported. This gives the JIT a chance to kick in. What happens if we only run the test once? The sum code runs in 246 microseconds (about 30 times slower than the slowest C code).

The following code can take advantage of pipelining, without introducing independence in the computation:

It is able to execute in 39 microseconds. Expansion to 8 terms yields:

The above program runs in 31 microseconds.

Thus with simple loop unwinding, Java seems able to get a 25% increase in speed, yet it is still 3 times slower than our slowest C code (and 30 times slower than vector-optimized assembler).

4 WHAT IS GOING ON WITH JAVA’S COMPILER?

Our simple-minded example just adds up a list of numbers, yet we see dramatic speed hits when we start to benchmark our code. Also, we find no direct improvement available for vectorization (though there is some improvement with loop unwinding).

What is going on with the Java compiler? Does it understand anything about vectorization? It might be useful for us to examine the assembler output for some of our code. Given the input code:

The compiler produced:

Now consider the sum4 function:

Compare the assembler output for sum with that of sum4:

Code:

Thus we see in the byte codes an output that has expanded in direct correspondence with the expansion in the input code. Yet the code does appear to run somewhat faster, indicating that the JIT/G4 is taking advantage of loop unwinding to exploit pipelining. Wherever the speed up is occurring, we know for sure it is not with Javac.

5 WHAT CAN WE DO TO SPEED UP JAVA?

Several projects have been started to address poor numerical performance in Java. For example, IBM has created the Ninja project, a closed-source C-based system that is compiled only for the PowerPC. Ninja uses C to create a series of JNI (Java Native Interface) assembler calls in order to speed subroutine invocations. If Ninja were an open-source project then we could recompile it for other platforms. As it is, it accelerates PowerPC only.

There are some very high-performance vector accelerated C libraries available from Apple Computer [Apple]. The Apple code is open-source but highly optimized for the PowerPC. Additionally, it has not been interfaced to JNI. To be of use to Java programmers, a more portable version of the vector-accelerated code is needed, so that several platforms can take advantage of the vectorization.

There are portable, open-source, vector-accelerated C libraries available for signal processing [VSIPL]. However, such libraries do not have JNI interfaces, and generating them is a non-trivial exercise.

6 ON THE VECTORIZATION OF JAVA

This section describes a low-level vector-based API that is open-source. In addition, a Java reference implementation is available, for portability to platforms where the C code is not vectorized.

The accelerated C-based API can be recompiled on a per-platform basis, thereby speeding up code that has been refactored to take advantage of the vector-based subroutines. This is going to require that the programmer alter code, in order to get the speed advantage. A nice alternative would be to create a vectorizing JIT compiler.

Luca Lutterotti created a project that contains the source code for implementing all the Altivec operations in Java. Benchmarks show that there can be a 2 to 10 times speed-up over pure Java implementations [Lutterotti].

The Lutterotti library automatically detects the presents of the Altivec processor. Java code is supplied that will compute the result (when no libraries are available). If there are native C-code libraries or if there are native C-code libraries that make use of Altivec, those are used instead. From the programmer’s point of view, a single, high-level subroutine is supplied:

A short array can actually slow down the execution of JNI-based computations (due to overhead). On the other hand, given arrays of length 256 floats or longer, the above code can run from 2 to 10 times faster than the Java source code. Size 256 arrays are 10 times faster in Altivec accelerated code than the pure Java code. However, size 1024 arrays are only 2.6 times faster than the pure Java code. What happened? Based on my examinations of the code, there appears to be allocation and freeing of array storage, at run-time. This is sure to cause a speed-hit (and one that might be avoidable, given more effort)

7 CONCLUSION

The creation of application specific API’s appears to be a trend in Java. Particularly where speed is critical. For example, there is the Java Advanced Imaging (JAI) API used for image processing, and the Java3D API for graphics. These API’s are closed source and typically written in C or C++. We, as Java programmers, must hope that Sun will support our platform with the new API, or Java programs that make use of the new API will not run.

For example, for several years, Windows, Solaris and Linux variants had Java3D and JAI, but the Macintosh did not. As a result, Mac users had to use another platform in order to use these APIs.

Application specific API’s are basically frameworks that encourage code reuse. (Altivec is not meant for double precision (i.e., 64 bit) floating point numbers. As a result, vectorization is limited to integers and single-precision floating point numbers.

Calling subroutines outside of the Java environment exchanges reliability for speed. The C code can generate segmentation faults, access memory in ways that Java cannot and generally increases the complexity of the system. We have found that complexity is inversely related to reliability.

It would be far better for reliability and complexity if there were vector-based byte codes that could be created by Javac. These could then map into low-level vector instructions (i.e., Altivec or SSE/SSE2). Clearly, heroic refactoring of legacy code is costly, in terms of programmer time. Even better would be a JIT that understands the vectorization of code (thus leaving the byte codes intact). With the advent of a JIT that can take advantage of pipelined MIMD and SIMD architectures, common on today’s desktops, Java would be propelled into the world of high-performance numerical computing [Arvedahl] [Sanseri].

ACKNOWLEDGEMENTS

The author is indebted to Rob Distinti, Member of Technical Staff at Norden Systems, for his contribution of the vectorized version of the sum function.

REFERENCES

[Apple] http://developer.apple.com/hardware/ve/vector_libraries.html.

[Arvedahl] Svante Arvedahl. Java Just-In-Time Compilation for the IA-64 Architecture. 2002. Master's thesis, Linkoping University.

[Crescent] Crescent Bay Software, http://www.crescentbaysoftware.com/compilertech.html

[Freescale] Altivec Technology Programming Interface Manual, Motorola/Freescale, 1999, http://www.freescale.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf

[IBM] Ninja (Numerically Intensive Java) http://alphaworks.ibm.com/tech/ninja

[Lutterotti] Luca Lutterotti, http://www.ing.unitn.it/~luttero/javaonMac/

[Mercury] SAL Reference Manual, TC-SAL-RM-570, Mercury Computer Systems, Inc., Chelmsford, MA 01824-2820, May, 2002. http://www.mc.com/.

[Patterson and Hennessy] David Patterson, John Hennessy: Computer Organization and Design, Morgan Kaufmann Publishers, Inc., San Francisco, California, 1998.

[Sanseri] Samuel K. Sanseri: Toward an Optimizing JIT Compiler for IA-64. 2000. Master's thesis, Portland State University, OR, US.

[VSIPL] Vector Signal Image Processing Library, http://www.vsipl.org/

About the author

After receiving his Ph.D. from Rensselaer Polytechnic Institute, Dr. Lyon worked at AT&T Bell Laboratories. He has also worked for the Jet Propulsion Laboratory at the California Institute of Technology. He is currently the Chairman of the Computer Engineering Department at Fairfield University, a senior member of the IEEE and President of DocJava, Inc., a consulting firm in Connecticut. E-mail Dr. Lyon at Lyon@DocJava.com. His website is http://www.DocJava.com.

Cite this column as follows: Douglas Lyon: “Java Optimization for Superscalar and Vector Architectures", in Journal of Object Technology, vol. 4, no. 2, March-April 2005, pp. 27-39 http://www.jot.fm/issues/issue_2005_03/column3

Previous column

Next column