E-Bunny: A Dynamic Compiler for Embedded Java Virtual Machines

Mourad Debbabi, Abdelouahed Gherbi, Lamia Ketari, Chamseddine Talhi, Hamdi Yahyaoui, Sami Zhioua, Concordia Institute for Information Systems Engineering Concordia University, Quebec, Canada
Nadia Tawbi, Computer Science Department, Laval University, Quebec, Canada

REFEREED
ARTICLE

PDF Version

Abstract

A new acceleration technology for Java embedded virtual machines is presented in this paper. Based on the selective dynamic compilation technique, this technology addresses the J2ME/CLDC (Java 2 Micro Edition for Connected Limited Device Configuration) platform. The primary objective of our work is to come up with an efficient, lightweight and low-footprint accelerated embedded Java Virtual Machine. This is achieved by the means of integrating a selective dynamic compiler that we called E-Bunny into the J2ME/CLDC virtual machine KVM. This paper presents the motivations, the architecture, the design and the implementation issues of E-Bunny and how we addressed them. Experimental results on the performance of our modified KVM demonstrate that we accomplished a speedup of 400% with respect to the Sun’s latest version of KVM. This experimentation was carried on using standard J2ME benchmarks.

1 MOTIVATIONS AND BACKGROUND

With the advent and rising popularity of wireless systems, there is a proliferation of small internet-enabled devices (e.g. PDAs, cell phones, pagers, etc.). In this context, Java is emerging as a standard execution environment due to its security, portability, mobility and network support features. In particular, J2ME/CLDC (Java 2 Micro- Edition for Connected Limited Device Configuration) [13] is now recognized as the standard Java platform in the domain of mobile wireless devices such as pagers, handheld PDAs, TV set-top boxes, appliances, etc. It gained big momentum and is now standardized by the Java Community Process (JCP) and adopted by many standardization bodies such as 3GPP, MEXE and OMA. Another factor that has amplified the wide industrial adoption of J2ME/CLDC is the broad range of Java based solutions that are available in the market. All these factors make Java and J2ME/CLDC an ideal solution for software development in the arena of embedded systems.

The heart of J2ME/CLDC technology is Sun’s Kilobyte Virtual Machine (KVM) [14]. The KVM is a Java virtual machine initially designed with the constraints of low-end mobile devices in mind. The performance and the security of the KVM will be two key factors in the successful deployment of Java technology in wireless and embedded systems.

The primary intent of our research initiative is to dramatically improve the performance of J2ME/CLDC virtual machines such as the KVM.

Lately, a surge of interest has been expressed in the acceleration of Java virtual machines for embedded systems. Two main approaches have been explored: hardware and software acceleration.

For hardware acceleration, a plethora of companies (Zucotto Wireless [3], Nazomi [4], etc.), had proposed Java processors that execute in silicon Java bytecodes. Although these hardware accelerators achieve a significant speedup in terms of virtual machine performance, it remains that their use comes with a high price in terms of power consumption. This energy issue is really damaging especially in the case of low-end mobile devices. Moreover, the cost (royalties, licensing, etc.) of these hardware acceleration technologies is an additional obstacle to their adoption by the industry. These drawbacks of hardware acceleration created an interesting but challenging niche for software acceleration of embedded Java virtual machines. For software acceleration, a large spectrum of techniques has been advanced [1, 5, 6, 7, 12, 16]. These techniques could be classified into 4 categories: general optimizations, ahead-of-time optimizations, just-in-time compilation and selective dynamic compilation. General optimizations consist in designing and implementing more efficient virtual machine components (better garbage collector, fast threading system, accelerated lookups, etc.). Ahead-of-time optimizations consist in using extensive static analysis (flow analysis, annotated type analysis, abstract interpretation, etc.) to optimize programs before execution. Just-in-time (JIT) compilation consists of the dynamic compilation of Java executables (bytecode). This dynamic compilation is achieved thanks to a compiler that is embedded in the Java virtual machine. The compiler is in charge of translating bytecode into the native code of the host platform on which the code is being executed. The selective dynamic compilation consists in compiling, on the fly, into native code only a selected set of methods that are performance-critical.

Experience demonstrated that general and ahead-of-time optimizations can lead to reasonable accelerations. However, they cannot compete with just-in-time and selective dynamic compilation in reaching big speedups (for instance an acceleration of more than 200 %) [15].

It is established that just-in-time compilers require a lot of memory to store the dynamic compiler and the binary code that it generates. The compilation process also implements sophisticated flow analysis and register allocation algorithms in order to generate optimized and high-quality native code. JIT compilers allow to reach high speedup but the static analysis that they use induces a significant overhead in terms of memory and time. This makes JIT compilers much more appropriate for J2SE (Java 2 Standard Edition) and J2EE (Java 2 Enterprise Edition) platforms.

Selective dynamic compilation deviates from the JIT compilation by selecting and compiling, on the fly, only those fragments of the class files that are frequently executed. These code fragments are generally referred to as hotspots. For instance, one can select only those methods that are frequently invoked and convert them to native code. By doing so, significant acceleration of the virtual machine could be reached since efficient optimizations are concentrated on performance critical fragments of the program. Another major advantage of this approach is to reduce memory overhead because only a part of a program is converted to native code. This makes selective dynamic compilation more adequate for embedded systems than JIT compilers.

This paper presents an extremely lightweight selective dynamic compiler for embedded Java virtual machine called E-Bunny. It is built on top of KVM. The contributions are threefold:

Our system is the first academic work that targets CLDC-based embedded Java virtual machines optimization by dynamic compilation. The remaining systems are commercial products such as [15] and [17].
Our solution, besides the compilation of all kind of bytecodes, covers the different issues of the integration of a dynamic compiler into a virtual machine such as multi-threading support, exception handling, garbage collection, switching mechanism between the compiler and the interpreter modes, etc.
Our solution is efficient. It allows to accelerate the performance by a factor of 4 while the memory footprint overhead does not exceed 138 KB.

The remainder of this paper is organized as follows. Section 2 highlights related work relevant to the dynamic compilation in the embedded context. In section 3, we present the architecture as well as the key ideas of our system. Design issues are detailed in section 4. Section 5 presents the results of our implementation and finally section 6 is a conclusion.

2 RELATED WORK

Dynamic compilation became a popular approach to optimize Java performance. Almost all standard Java virtual machines [1, 2, 12, 20, 19] are endowed with a dynamic compiler.

Java HotSpot VM [12], which is the core component of Java 2 Standard Edition, is equipped with a selective dynamic compiler. Performance critical methods are detected by means of a counter-based profiler with additional heuristics. These heuristics investigate the caller methods in order to compile them with the triggering method. On-Stack-Replacement is also used to trigger compilation while a method is still running. The Java Hotspot VM dynamic compiler applies several classical optimizations and is considered as one of the most efficient in the market. Another important reason of this efficiency is the aggressive method in-lining it applies. Indeed, methods that are detected frequent are not only compiled but also in-lined. The benefit of this optimization is that it produces larger blocks of code for the compiler to perform optimizations. Operating on large blocks of code increases the effectiveness of classical optimizations. However, using On-Stack-Replacement or applying expensive optimizations such as aggressive method in-lining is not adequate for embedded systems because they require important resources particularly memory space.

IBM Jalapeño [1] is a Java virtual machine equipped with a Just-In-Time compiler with different levels of optimizations. That is, according to the frequency of a method, this is compiled with the appropriate level of optimization. The profiler used by Jalapeño is very complex and is part of a bigger system called Adaptive Optimization System (AOS). As Java Hotspot VM, the features of IBM Jalapeño cannot be applied in an embedded context due to lack of resources. For instance, a Just-In-Time approach is not suitable because it requires that all methods are compiled, whereas, embedded systems lack the necessary memory to store all method generated code.

Other Java virtual machines are endowed with dynamic compilers including IBM JDK [19], Intel ORP [1], IBM Mixed-Mode-Compiler [20], Latte [21] and OpenJIT [11].

Embedded systems lack hardware resources that are available in desktop systems such as hundreds megabytes of RAM or microprocessors operating at over 2 GHz. This sets several limitations on what dynamic compilation could accomplish in embedded systems. In the sequel, we outline these limitations.

In the context of embedded systems, dynamic compilation should cope with two major difficulties. First, the dynamic compiler should be maintained in memory while the application is executing. This is very challenging because of the stringent lack of memory resources. Second, heavyweight code optimizations are not affordable because of their overhead. However, without such optimizations, a dynamic compiler produces a code of low quality and large quantity. This code requires additional memory to be stored. It is worth mentioning that the produced native code could be 8 times the size of the original bytecode [15]. Hence, embedded dynamic compilers are required to be extremely frugal with memory resources. Another consequence of the big size of native instructions compared to bytecode is the risk of instruction cache overflow. Indeed, among the hardware limitations of embedded systems is the reduced amount of on-chip processor instruction cache. The amount of this resource is suitable for bytecode interpretation. However, due to its big size, the machine code produced by the dynamic compiler can be several times larger than the size of the available instruction cache [15]. This leads to several cache misses that decreases the program performance.

Despite these difficulties, dynamic compilation is also used in CLDC-based em- bedded virtual machines [15, 17, 18]. However, except one paper about KJIT [18], no detailed information about these systems is available in the literature due to commercial reasons.

KJIT [18] is a lightweight dynamic compiler that uses as its foundation the KVM. KJIT does not use any form of profiling for the simple reason that all methods are compiled. This strategy seems to be very heavyweight and only feasible in server or desktop systems. The key idea to make this strategy adequate for embedded Java virtual machines is to compile only a subset of bytecodes. The remaining bytecodes continue to be handled by the interpreter. Indeed, whenever one of the interpreted bytecodes is encountered, execution switches back from the compiled mode to the interpreted mode. This requires an efficient handling of the switching mechanism since this operation is highly frequent. This is achieved in KJIT by pre-processing the bytecode before their compilation. The cost of pre-processing, however, is an additional time required for pre-processing together with an additional space required to store the generated bytecode. The latter is 30% larger than the original bytecode.

CLDC Hotspot VM [15] is an embedded virtual machine introduced by Sun Microsystems. As its name indicates, it is strongly inspired by the standard Java Hotsopt VM. All features of Java Hotspot VM that can be adapted to resourceconstrained environments are applied. Among these features, we find a selective dynamic compiler. Performance critical methods are detected by a single statistical profiler. The compilation is performed in one pass. Three basic optimizations are applied: constant folding, constant propagation and loop peeling. The memory footprint required by CLDC Hotspot (including APIs) reaches 1 megabyte which is almost the double of the space required by KVM. No more details are provided about the CLDC Hotspot dynamic compiler.

3 ARCHITECTURE

E-Bunny is a selective dynamic compiler for embedded Java virtual machines that uses as its foundation the KVM. In this section, we present the key features that make E-Bunny an appropriate Java acceleration technology for embedded systems.

The major features of E-Bunny are:

Reduced Memory Footprint: The footprint resulting from the integration of the E-Bunny dynamic compiler does not exceed 150 KB. The key idea to reduce the code size of E-Bunny is to merge the compilation processing of some bytecodes. This is possible because several bytecodes have joint processing (e.g. invokespecial, invokevirtual). This strategy is applied mainly for some bytecodes that have fast versions [9](e.g. getstatic, getstatic fast, getstaticp fast, getstatic2 fast).
Selective Compilation: Since selective dynamic compilation is the most adequate compilation-based acceleration technique for embedded systems, it was adopted in E-Bunny. Only a subset of methods is compiled. The methods are selected according to their invocation frequencies. The unit of compilation is exclusively a method.
Efficient Stack-Based Code Generation: For the compilation strategy, a trade-off has to be made between the compilation cost and the generated code quality. Although a register-based code is more efficient, we do not generate such code because it requires more passes over the bytecode. In E-Bunny, we generate a stack-based code because it requires only one pass over the bytecode. Thus, a one-pass code generation strategy is adopted, without using neither intermediate representations nor heavyweight optimizations. Only optimizations that might be applied in one pass are allowed.
Using two stacks: Portability is an important issue in embedded systems. Virtual machines with dynamic compilers are strongly machine-dependent because they compile the bytecode into the target machine native code. Thus, it is unavoidable to become dependent on the target machine and consequently less portable. CLDC Hotspot [15] and IBM Mixed-Mode [20] virtual machines, which adopt a mixed-mode approach, use only the native stack for both interpretation and compiled methods execution. However, the native stack is part of the target machine architecture. This makes the virtual machine very dependent on the target machine. Our dynamic compilation approach advocates the use of two stacks (one for interpretation and one for compiled methods execution) to preserve portability to a high extent. Hence, interpretation remains totally machine-independent. Moreover, the platform-specific code is isolated in specific modules. Future porting of E-Bunny to another platform requires the modification or the re-creation of only these modules. However, the drawback of this way to preserve portability is the complexity introduced by the use of two stacks. Indeed, (1) method arguments must be transferred between the two stacks when necessary, (2) the garbage collector must consider the native stack when it scans the heap and (3) additional context information must be added to the thread data structure.
Multi-threading Support: Another challenge introduced by dynamic compilation is the multi-threading support. Conventionally, each thread has its own execution stack. Since our approach uses two kind of stacks, each thread is assigned two stacks upon its creation: a Java stack to interpret methods and a native stack to run compiled ones. For the Java stack, we adopt the KVM way to manage the Java stacks. KVM allocates the Java stack from the heap and manipulates it at a software level. For the native stack, the approach adopted is to organize the native stack as a pool of segments and allocate a segment to each thread. Thus, the native stack will be shared by all living threads.
LRU Algorithm for Cache Management: A limited memory space is allocated to the compiled code. When this space is full, a cache strategy based on a Least Recently Used (LRU) algorithm [10] is adopted to free the necessary space.

Figure 1: E-Bunny Architecture

The E-Bunny architecture is depicted in Figure 1. It includes four major components: the execution engine, the profiler, the compiler and the cache manager. Initially, all invoked Java methods are interpreted. During interpretation, a counterbased profiler gathers profiling information. As the code is interpreted, the profiler identifies hotspot methods. Once a method is recognized as hotspot, its bytecodes are translated into native code by the compiler. The produced native code is stored in the dynamic compiler cache. On future references to the same method, the cached compiled method is executed instead of interpreting it. In the sequel, we highlight the main components of the proposed embedded dynamic compiler architecture.

Interpreter

By interpreter, we mean the KVM’s interpreter. It is the only way in KVM to execute Java programs. The interpreter is basically a loop that fetches, decodes and interprets bytecodes of a given method. E-Bunny adopts a mixed-mode¹ execution approach. Therefore, the interpreter cooperates with the machine code execution component to switch between the two modes: interpreted and native.

Machine Code Execution Component

The machine code execution component is responsible for invoking compiled methods. Basically, it looks for the corresponding machine code in the cache and then executes it. In addition, this component is responsible for transferring, if any, methods arguments from one stack to the other. More details about this mechanism are given in Section 4. When the machine code of a given method is executed, the cache must be set up-to-date according to the cache algorithm. The machine code execution component triggers the cache update.

Profiler

A simple counter-based profiler is used in E-Bunny. A per-method counter is incremented each time the method is invoked. A method is considered as hotspot when its counter reaches an a-priori defined threshold. Consequently, a compilation request is triggered. Our profiling strategy assumes that every method called by a compiled method must be compiled. So, a compiled method cannot invoke an interpreted method. The motivation behind this strategy is to reduce the complexity of the switch mechanism between interpreted and native modes.

Compiler

The compiler is the core component of E-Bunny. It is triggered by the profiler. Basically, it compiles Java methods to machine code. The E-Bunny compiler is extremely lightweight. It goes through the bytecode in one pass and generates a stack-based code. Unlike traditional compilers, neither intermediate representations nor heavyweight optimizations are used. The compilation strategy is described in Section 4.

Cache Manager

Once a method is recognized as hotspot and compiled, the corresponding machine code needs to be stored in the heap. In E-Bunny, a fixed space, called the cache, is pre-allocated in the heap to store the generated code. The cache is pre-allocated in the permanent space of the heap to avoid any conflicts with the garbage collection mechanism. Conventionally, the garbage collector does not scan the permanent space. When the cache becomes full, the cache manager must select elements to remove in order to free space for newly compiled methods. For this purpose an LRU algorithm is used. This algorithm selects the method (or the methods) that has (have) not been invoked for the largest period of time and removes it (or them) from the cache. A queue is used to keep the chronological order of invoked methods. This queue is updated each time a compiled method is invoked. The LRU algorithm does not prevent methods from being recompiled several times, but our experiments show that using an LRU algorithm reduces efficiently the re-compilation rate.

4 DESIGN

In this section we discuss the design issues of E-Bunny. First, we detail the compilation strategy. Second, we focus on a delicate aspect of selective dynamic compilation which is the switch mechanism between interpreted and compiled modes. Then, we illustrate how E-Bunny supports multi-threading. Finally, we describe the interaction with the garbage collection mechanism.

Compilation Strategy

Our compilation strategy spans over a lightweight one pass compilation technique. This strategy avoids complex computations performed by common compilers and generates a code of reasonable quality. Indeed, the generated code is stack-based as Java bytecode but uses many information computed at the compilation step (field offsets, Constant Pool entry address etc.). These information are grafted in the generated code in order to avoid unnecessary further re-computation. Compiling a method goes through three steps. First, generating context saving instructions (the prologue). Second, translating bytecodes into machine code instructions. Third, generating context restoration instructions (the epilogue). The second step, which is the core step of our compilation strategy, consists in translating each bytecode into a sequence of native instructions. We distinguish two categories of bytecodes. The first kind includes bytecodes that are completely translated into native instructions in a straight manner. They are called: simple bytecodes. The second kind of bytecodes are more complex to translate and are called complex bytecodes. E-Bunny targets Intel IA-32 architecture [8]. IA-32 is a 32 bit CISC machine which provides eight general purpose registers (EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP). EBP and ESP are dedicated to stack management. The remaining registers fulfill different other tasks. Hence, compiling a method consists in generating assembly instructions that reproduces the bytecode behavior on an IA-32 platform.

Context Saving and Context Restoration

Context saving and context restoration instructions are used to re-establish the calling method context after the execution exits from the callee method. Context saving instructions figure on top of the method generated code. They carry out three basic operations. First, save old EBP register value (of the calling method) on the native stack. Second, assign a new value to EBP register (of the callee method). Third, assign the new value of EBP to LastEBP field of the current thread structure (CurrentThread->LastEBP =EBP). Actually, LastEBP is a new field added to the thread data structure in E-Bunny. It gives for each thread the EBP value of the last called compiled method. This information is used to two purposes: garbage collection and exception handling. Context saving instructions in E-Bunny are illustrated in Table 1.

Table 1: Context Saving

Table 2: Context Restoration

Context restoration instructions figure on the end of method generated code. Since all Java methods finish by a return bytecode (ireturn, lreturn, areturn or return), context restoration instructions are generated when translating return bytecodes. Context restoration instructions carry out two operations. First, restore the old value of EBP register (saved on the native stack). Second, assign this value to LastEBP field of CurrentThread data structure. Consequently, when returning to the calling method, EBP register and CurrentThread->LastEBP are set to the appropriate values. Context restoration instructions in E-Bunny are illustrated in Table 2.

Simple Bytecodes Translation

The simple bytecode category includes loads (e.g. iload, iaload, ldc), stores (e.g. astore, lastore), stack manipulation (e.g. pop, dup), arithmetic, logic and shift (e.g. iadd, land, ishr, l2i) and branching (e.g. ifne, if_icmpeq, goto) bytecodes. These bytecodes are directly translated into stack-based machine code which reproduces the interpreter behavior on the native stack. As an example, we give the translation details of two bytecodes iload and ifeq. We have chosen iload bytecode to show how it behaves on the native stack and ifeq bytecode to illustrate how E-Bunny translates forward branching instructions in one pass.

Table 3: iload Bytecode Translation

Figure 2: iload on the Native Stack

The iload bytecode loads integer (32 bit value) from local variables to the top of the stack (stack operand). It is directly mapped to a push instruction as shown in Table 3. The symbols “[ ]” denote the content of the specified location. The FrameSize variable is an integer that gives the number of local variables of the current method. The displacement is multiplied by 4 because addresses in Intel² grow by 4. The value 3 in the offset of the push instruction denotes EIP register and two empty spaces (space 1 and space 2) as illustrated in Figure 2. Notice that space 1 and space 2 are left for the returned value of the method (see Section 4).

An important issue of the one pass translation strategy is how to deal with control flow bytecodes (or branching bytecodes). In the sequel, we highlight this issue and how it is processed in E-Bunny.

Control flow bytecodes are translated using a bytecode map table called BCNativeMap. It is an integer table that maps each bytecode index to its corresponding machine instruction index. For instance, BCNativeMap[5] = 31 means that the machine instructions corresponding to the fifth bytecode start at index 31 in the NativeCode table. A BCNativeMap table is associated with each compiled method. It is initialized to zero, and filled progressively when the translation of the method is going on. At a given moment of translation, the BCNativeMap table is filled for the scanned bytecodes and empty (zero values) for the rest.

To deal with control flow bytecodes, there are two situations: forward branches and backward branches. The translation of backward branches is straightforward. BCNativeMap table is used to get the target machine instruction index, and then to generate the corresponding jump native instruction. However, the translation of forward branches is more complex. Indeed, the corresponding BCNativeMap entry of a forward branch is not resolved yet (contains zero) as the target bytecode is not already compiled. The complete translation of the forward branching bytecode is postponed until the target bytecode is reached. Actually, a jump native instruction op-code (e.g. JL, JNE, JMP) is generated with an unresolved operand. Each next bytecode is checked whether it is a target of an unresolved branching. In such case, the corresponding offset is computed and then used to patch the unresolved operand of the jump native instruction. To handle multiple unresolved forward branching, a linked list is used. In Table 4, we give the translation algorithm of ifeq bytecode and the corresponding machine instructions.

Table 4: ifeq Bytecode Translation

Table 5: putfield Bytecode Translation

Complex Bytecodes Translation

In KVM, interpretation of some bytecodes involves the use of some virtual machine runtime services such as method lookup or field reference resolution. For example, getfield bytecode uses field reference resolution subroutine to resolve the reference of the appropriate field. In E-Bunny, bytecodes that require virtual machine services are called complex bytecodes. Translating this kind of bytecodes is more complex. A possible approach consists in generating the corresponding native code, instruction by instruction, including virtual machine services. This yields a complex and a very bulky code. The approach adopted in E-Bunny is to define for each bytecode a specific function that calls necessary virtual machine services. These functions are called from the native code. Hence, the generated machine code is compact and less complex. Complex bytecodes category includes field access, objects creation, array manipulation, method invocation, return, monitor, casting and exception bytecodes. In the sequel we illustrate the translation mechanism of field access bytecodes and exception bytecode (athrow).

Field access bytecodes are putfield, getfield, putstatic, getstatic and their fast versions. These bytecodes access to object fields using symbolic references, to get or set their values. Such symbolic references have to be resolved so that the access could be possible. In E-Bunny, for each bytecode a specific function is defined. For getfield and getstatic, this function calls the field reference resolution subroutine and returns the value of the field in EAX register (EAX:ECX for long values). This value is then pushed on the native stack. For putfield and putstatic, this function calls the field reference resolution subroutine, sets this field and returns the field size in the EAX register. This value is then used to update the native stack. The translation of putfield is given in Table 5. The cp index denotes a Constant Pool index. The “ Switch” operator is used to clarify the illustration. Actually, it is implemented with jump instructions.

Method invocation bytecodes require a special care because they involve more complex processing than other bytecodes. Furthermore they are performance critical (on average 11% of total bytecodes are method invocation bytecodes [16]). The processing that method invocation bytecodes should carry out includes: performing a method reference resolution, checking the validity of the receiver object, performing a dynamic method lookup, creating a new frame for the invoked method, etc. Moreover, these bytecodes deal with the switch mechanism between interpreted and compiled modes (more details about the switch mechanism are given in next section).

In E-Bunny, to translate invoke bytecodes, specific functions have been defined (callVirtual for invokevirtual and invokespecial, callStatic for invokeStatic and callinterface for invokeinterface). These functions behave differently according to the kind of the called method (compiled or native). The conventions adopted to design these functions are the following: If the invoked method is native, the defined functions resolve the invoked method reference, check the validity of the receiving object, make the method lookup, transfer the arguments form the native stack to the Java stack (because the native functions consider the Java stack to get the arguments), invoke the native functions and finally transfer returned value to Java stack. In addition, these functions, set EAX register to 1 (to distinguish the native function call from the compiled method invocation) and EDX register to an integer value necessary to update the native stack.

Table 6: invokevirtual Bytecode Translation

Table 7: athrow Bytecode Translation

If the invoked method is compiled, these defined functions resolve the invoked method reference, check the validity of the receiving object, make the method lookup, get the machine code from the cache, update the latter and exit. We note here that the defined function does not call the invoked method. Instead, the machine code address of the called method and the local variables number are returned respectively in EDX register and in the top of the stack. These values are used to prepare the invocation. After that, the method is invoked. When it exits, the method context information is discarded using the value computed by the translated return bytecode. As an illustration of this mechanism, the translation of the invokevirtual bytecode is given in Table 6.

Another important feature of the Java language is the exception handling mechanism. An exception is raised by the athrow bytecode. A major constraint that the dynamic compilation should respect is to preserve exception handling semantics. The dynamic compilation introduces new issues relevant to exception propagation. Indeed, a compiled method could propagate an exception to an interpreted method. E-Bunny handles the following situations:

The method where the exception is thrown and the one where the exception is caught, are both interpreted: the original virtual machine exception handling mechanism is used.
The method where the exception is thrown, is interpreted and the one where the exception is caught is a compiled method. This case is excluded since the adopted profiling strategy assumes that a compiled method cannot invoke an interpreted one.
The method, where the exception is thrown, is a compiled method. There are two possible situations: the method catching the exception is either interpreted or compiled. In order to locate the method handling the exception, first, BCNativeMap is used to identify the bytecode corresponding to the native instruction throwing the exception. Then, an exception handler lookup is performed. If the method catching the exception (the method containing the handler) is a compiled one then its BCNativeMap is used to locate the native instruction corresponding to the bytecode handling the exception. A jump to this instruction is performed. Otherwise, we switch to the interpreted mode at the level of this bytecode.

Table 8: Interpreted to Native Switch Algorithm

The generated code for the athrow bytecode calls a specific function called throwExc. This function uses the virtual machine exception handling functionality to locate the exception handler. Then, it resumes the execution at the appropriate location. The translation of athrow is illustrated in Table 7.

Switch Mechanism

In E-Bunny, compiled methods are executed in the native stack while interpreted methods are interpreted in the Java stack located in the heap. Hence, execution of Java programs overlaps between native stack and Java stack. The switch between the two modes implies context transferring between the two stacks. We distinguish two situations where the switch occurs between the two modes: interpreted to native and native to interpreted.

The interpreted to native switch occurs when interpreting an invoke bytecode and the invoked method happens to be compiled. Coming from the interpretation mode, the called method arguments are on the top of the Java stack. The switch to the native mode requires their transfer to the native stack. This is carried out according to the algorithm in Table 8. The algorithm is based on the method frame layout in the native stack, depicted in Figure 3.

Figure 3: Switch Mechanism to the Native Mode

The switch from the native mode to the interpreted mode occurs in two situations. First, when a compiled method calls an interpreted method. Second, when a compiled method exits and returns back to its interpreted caller method. The pro- filing strategy we adopt assumes that every method called by a compiled method should be compiled. The switch is then reduced only to the second situation (return case). Handling this switch consists in transferring the returned value, if any, from the native stack to the Java stack. The implementation of this action depends on the returned value type as illustrated in Table 9.

Figure 3 shows the native stack in different phases of the switch. First, before method call (a). Second, after algorithm 1 (b). Third, when the translated called method exits (c). In (c), we assume that the called method returns a one word size value (e.g. integer, short, reference, etc.).

Threads Management

A Java virtual machine provides a framework to run properly different threads. Each thread has its own stack. In the interpreted mode, these stacks are created and managed at a software level. Basically, thread stacks are allocated in the heap. However, in E-Bunny, since two stacks are used, compiled methods have to be run on a native stack. Consequently, threads should use the native stack.

In the current implementation, the native stack is organized as a pool of segments. A segment is assigned to a thread when the latter is created. The segment pool management is based on a bit map. An entry of this map is a bit indicating whether the corresponding segment is used or free. Therefore, each thread executes its compiled methods in its own segment. A consequence of managing several threads with two stacks is that each thread has two forms of context information.

Table 9: Native to Interpreted Switch Algorithm

The first is relevant to Java stack (e.g. sp: stack pointer, fp: frame pointer, lp: locals pointer) and the second is relevant to the native stack (ESP, EBP and EIP registers). The data structure representing the thread in the virtual machine holds information representing both contexts (Figure 4).

Thread scheduling in the KVM is based on a round robin scheduling model. Each thread keeps control during a time-slice. This is decremented after each bytecode that may cause a control transfer (e.g. branching and invoke bytecodes). When the time-slice becomes zero, the virtual machine stops the current thread and resumes the next one in the running threads queue. Method compilation requires the support of thread scheduling. To achieve this purpose, additional code is generated for bytecodes causing transfer control. This code, mainly, decrements the ESI register, dedicated to hold time-slice value, and triggers a thread switch when ESI reaches the NULL value. In addition, a special care to save both contexts relevant to Java and native stacks, is taken.

Garbage Collection Issues

KVM garbage collection is based on a mark-sweep with compaction algorithm. With the selective approach of E-Bunny, compiled methods are executed on the native stack. Consequently, the native stack may contain some object references. Since the current garbage collection algorithm scans only the heap, the native stack will not be considered, and then, object references on it neither will be marked nor updated. Therefore, the current garbage collection algorithm is inefficient with a selective approach.

The garbage collection algorithm has to be extended to deal with the native stack. Precisely, translated method frames in native stack have to be scanned in order to mark and update object references. In E-Bunny, the garbage collection algorithm is enhanced to address this issue. Mainly, the garbage collection functionalities are modified to take into account the object references in the native stack. Indeed, for both marking and updating loops, we check if the frame corresponds to a compiled method or not. If the method is compiled we consider the native stack, otherwise we consider the Java stack.

Figure 4: Multi-threading in E-Bunny

5 IMPLEMENTATION AND RESULTS

E-Bunny is implemented using the C programming language. In our experiments, we used the GNU compiler to build the latest version of KVM (KVM 1.0.4) with E-Bunny. Table 10 shows the executable size of KVM with and without E-Bunny. The first column gives the total executable footprint of KVM without E-Bunny. The second column gives the total executable footprint of KVM equipped with EBunny dynamic compiler. Finally, column 3 of the table shows that using GCC to build KVM with E-Bunny produces a footprint overhead of 64 KB. To summarize, E-Bunny requires 64 KB for executable footprint overhead, 64 KB for storing translated methods and 10 KB for a map between bytecodes and native instructions which is used for compiling control flow instructions and for exception handling mechanism. Hence, the total memory resources required by E-Bunny dynamic compiler is 138 KB. To evaluate the performance of E-Bunny, we have run CaffeineMark benchmark
(without the float test) on the original version of KVM with and without E-Bunny.

E-Bunny produces an overall speedup of 4 over original KVM 1.0.4 (Overall score in Figure 5). Moreover, we built the MIDP 2.0 profile, intended to CLDC devices, using E-Bunny and we ran successfully several midlets. Figure 6 shows a snapshot of MIDP emulator illustrating CaffeineMark midlet results.

Table 10: Executable File Footprint overhead

Figure 5: CaffeineMark Scores of KVM 1.0.4 and E-Bunny with GCC

6 CONCLUSION AND FUTURE WORK

We reported, a new acceleration technology for Java embedded virtual machines that is based on selective dynamic compilation. This technology targets the J2ME/CLDC (Java 2 Micro Edition for Connected Limited Device Configuration) platform. We designed and implemented an efficient, lightweight and low-footprint accelerated embedded Java Virtual Machine. This has been achieved by the means of integrating a selective dynamic compiler, called E-Bunny, into the J2ME/CLDC virtual machine KVM. We presented the motivations, the architecture, the design as well as the technical issues of E-Bunny and how we addressed them. Experimental results demonstrated that we accomplished a speedup of 400% with respect to the Sun’s latest version of KVM.

Currently, many enhancements of E-Bunny are in progress. The major one concerns the bidirectional smooth switching between the interpreted and compiled modes. In fact, the profiling strategy adopted in the current version of E-Bunny, which consists in compiling each method called from the compiled one, is less complex to implement. However, it presents a drawback since it leads to compile nonperformance- critical methods. On the other hand, a more efficient centralized thread scheduling is being implemented. This is expected to reduce the generated native code size.

Figure 6: CaffeineMark midlet ran by KVM 1.0.4 and by E-Bunny

Footnotes

¹ Execution overlaps between interpretation mode and native code execution mode.

² The terms Intel stack and native stack are used interchangeably.

REFERENCES

[1] B. Alpern, C. Attanasio, J. Barton, M. Burke, P. Cheng, J. Choi, A. Cocchi, S. Fink, D. Grove, S. Hummel M. Hind, D. Lieber, V. Litvinov, M. Mergen, T. Ngo, J. Russell, V. Sarkar, M. Serrano, J. Shepherd, S. Smith, V. Sreedhar, H. Srinivasan, and J. Whaley. "The Jalapeño Virtual Machine". IBM Systems Journal, 39(1):211–238, February 2000.

[2] M. Cierniak, G. Lueh, and J. Stichnoth. "Practicing JUDO: Java under Dynamic Optimizations". In Proceedings of SIGPLAN 2000 Conference on Programming Languages Design and Implementation, pages 13–26, Vancouver, Canada, June 2000.

[3] G. Comeau. Java Companion Processors versus Accelerators. http://www.zucotto.com, 2002.

[4] Nazomi Communications. Boosting the performance of Java Software on Smart Handheld Devices and Internet Appliance. http://www.nazomi.com, April 2002.

[5] T. Cramer, R. Friedman, T. Miller, D. Seherger, R. Wilson, and M. Wolczko. "Compiling Java Just in Time". IEEE Micro, 17(3):36–43, 1997.

[6] E. Gagnon and L. Hendren. "Effective inline-threaded interpretation of java bytecode using preparation sequences". In Proceedings of Compiler Construction, 12th International Conference, CC 2003, Held as Part of the Joint Euro- pean Conferences on Theory and Practice of Software, ETAPS 2003, Warsaw, Poland, April 7-11, 2003, volume 2622 of Lecture Notes in Computer Science, pages 170–184. Springer-Verlag, 2003.

[7] C.-H. A. Hsieh, J. C. Gyllenhaal, and W. W. Hwu. "Java Bytecode to Native Code Translation: the Caffeine Prototype and Preliminary Results". In IEEE, editor, Proceedings of the 29th annual IEEE/ACM International Symposium on Microarchitecture, December 2-4, 1996, Paris, France. IEEE Computer Society Press, 1996.

[8] Intel. IA-32 Intel Architecture Software Developer’s Manual Volume 1: Basic Architecture. Intel Corporation, California, USA, order number 245470 edition, 2000.

[9] T. Lindholm and F. Yellin. The Java Virtual Machine Specification. Addison Wesley, 1996.

[10] S. M. Majercik and M. L. Littman. "Using Caching to Solve Larger Probabilistic Planning Problems". In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98) and of the 10th Conference on Innovative Applications of Artificial Intelligence (IAAI-98), pages 954–960, Menlo Park, July 26–30 1998. AAAI Press.

[11] F. Maruyama. "OpenJIT 2: The Design and Implementation of Application Framework for JIT Compilers". In USENIX, editor, Proceedings of the Java Virtual Machine Research and Technology Symposium (JVM’01): April 23–24, 2001, Monterey, California, USA. Berkeley, CA, Berkeley, CA, USA, 2001. USENIX.

[12] Sun MicroSystems. "The Java HotSpot Performance Engine Architecture". Technical report, Sun MicroSystems, California, USA, April 1999.

[13] Sun MicroSystems. "Connected, Limited Device Configuration. Specification Version 1.0, Java 2 Platform Micro Edition". Technical report, Sun MicroSystems, California, USA, May 2000.

[14] Sun MicroSystems. "KVM Porting Guide". Technical report, Sun MicroSystems, California, USA, September 2001.

[15] Sun MicroSystems. "CLDC HotSpot Implementation Virtual Machine". Technical report, Sun MicroSystems, California, USA, 2002.

[16] R. Radhakrishnan, N. Vijaykrishnan, L. John, A. Sivasubramaniam, J. Rubio, and J. Sabarinathan. "Java Runtime Systems: Characterization and Architectural Implications". IEEE Transactions on Computers, 50(2):131–146, 2001.

[17] K. Schmid. "Esmertec’s Jbed Micro Edition CLDC and Jbed Profile for MID". Technical report, Esmertec AG, Dubendorf, Switzerland, Spring 2002.

[18] N. Shaylor. "A Just-in-Time Compiler for Memory-Constrained Low-Power Devices". In Proceedings of the 2nd Java Virtual Machine Research and Technology Symposium, pages 119–126, San Francisco, CA, USA, August 2002.

[19] T. Suganuma, T. Ogasawara, M. Takeuchi, T. Yasue, M. Kawahito, K. Ishizaki, H. Komatsu, and T. Nakatani. "Overview of the IBM Java Just-in-Time Compiler". IBM Systems Journal, 39(1):175–193, 2000.

[20] T. Suganuma, T. Yasue, M. Kawahito, H. Komatsu, and T. Nakatani. A Dynamic "Optimization Framework for a Java Just-in-Time Compiler". ACM SIGPLAN Notices, 36(11):180–195, November 2001.

[21] B. Yang, S.M. Moon, S. Park, J. Lee, S. Lee, J. Park, Y.C. Chung, S. Kim, K. Ebcio?glu, and E. Altman. "LaTTe: A Java VM just-in-time compiler with fast and efficient register allocation". In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques (PACT ’99), pages 128–138, Newport Beach, California, October 12–16, 1999. IEEE Computer Society Press.

About the authors

		Mourad Debbabi holds Ph.D. and M.Sc. degrees in computer science from Paris-XI Orsay, University, France. He published more than 60 research papers in international journals and conferences on computer security, formal semantics, mobile and embedded platforms, Java technology security and acceleration, cryptographic protocol specification, design and analysis, malicious code detection, programming languages, type theory and specification and verifi- cation of safety-critical systems. He is a Full Professor and the Associate Director of the Concordia Institute for Information Systems Engineering, Concordia University, Montreal, Quebec, Canada. He is the Specification Lead of four Standard JAIN (Java Intelligent Networks) Java Specification Requests (JSRs) dedicated to the elaboration of standard specifications for presence and instant messaging through the Java Community Process (JCP) program. In the past, he served as Senior Scientist at the Panasonic Information and Network Technologies Laboratory, Princeton, New Jersey, USA; Associate Professor at the Computer Science Department of Laval University, Quebec, Canada; Senior Scientist at General Electric Research Center, New York, USA; Research Associate at the Computer Science Department of Stanford University, California, USA; and Permanent Researcher at the Bull Corporate Research Center, Paris, France. He can be reached at debbabi@ciise.concordia.ca. See also http://www.ciise.concordia.ca/~debbabi.
		Abdelouahed Gherbi received his Engineering degree in Computer Engineering in 1992 and his Master degree in 1997, from the University of Constantine, Algeria. He is a Ph.D. student in the Electrical and Computer Engineering Department at Concordia University in Montreal. The main topic of his research activities is the acceleration of embedded Java virtual machines. He can be reached at gherbi@ciise.concordia.ca.
		Lamia Ketari is a researcher at CSA (Computer Security and Acceleration) research group at Concordia Institute for Information Systems Engineering, Concordia University, Montreal, Quebec, Canada. She holds an MBA in Management Information Systems form Laval University. Currently, she is pursuing a Ph.D. thesis on the acceleration of Java for wireless devices and semantic-based optimization correctness. In the past, she worked on malicious code detection in COTS (Commercial Off-The-Shelf) applications. More specifically, she worked, with the collaboration of other research group members, on the implementation of a tool for monitoring critical system resources for malicious code detection, called: DaMon (Dynamic Analysis Monitoring). She can be reached at ketari@ciise.concordia.ca.
		Chamseddine Talhi is a researcher at CSA (Computer Security and Acceleration) research group at Concordia Institute for Information Systems Engineering, Concordia University, Montreal, Quebec, Canada. He holds an M.Sc. degree from Constantine University, Algeria. He is pursuing a Ph.D. thesis on the security of embedded Java platforms. He is interested in characterizing enforceable security policies in resource-constrained platforms. In the past years, he worked on formal specification and verification of telecommunication services. He participated in the design and the implementation of a Java optimizing compiler for J2ME/CLDC. He can be reached at talhi@ciise.concordia.ca.
		Nadia Tawbi holds a Ph.D and M.Sc. degrees from Pierre et Marie Curie University, Paris France. She was affiliated to Bull SA research center from 1989 to 1996. Since this year she joined the computer science department of Laval University where she is now an associate professor. Her research interests include static analysis of programs, optimization, object oriented languages, formal verification and security. She can be reached at Nadia.Tawbi@ift.ulaval.ca.
		Hamdi Yahyaoui is a researcher at CSA (Computer Security and Acceleration) research group at Concordia Institute for Information Systems Engineering, Concordia University, Montreal, Quebec, Canada. He holds an M.Sc. degree from Laval University. He is pursuing a Ph.D. thesis on the acceleration and semantic foundations of embedded Java virtual machines. In the past years, he worked on formal dynamic semantics and control flow analysis of Java. He participated in the design and implementation of a dynamic optimizing compiler for J2ME/CLDC. He can be reached at hamdi@ciise.concordia.ca.
		Sami Zhioua is a Ph.D. student at Concordia Institute for Information Systems Engineering, Concordia University, Montreal, Quebec, Canada. The main topic of his research activities is the acceleration of Java in the context of embedded systems. The goal of his research is to design and implement new techniques in order to improve the performance of embedded Java virtual machines. He participated in the design and implementation of a dynamic optimizing compiler for J2ME/CLDC. He can be reached at zhioua@ciise.concordia.ca.

Cite this article as follows: Mourad Debbabi et al: ”E-Bunny: A Dynamic Compiler for Embedded Java Virtual Machines”, in Journal of Object Technology, vol. 4, no. 1, January–February 2005, pages 83–108, http://www.jot.fm/issues/issue_2005_01/article2