diff options
Diffstat (limited to 'src/site/xdoc/manual/bcel-api.xml')
-rw-r--r-- | src/site/xdoc/manual/bcel-api.xml | 645 |
1 files changed, 645 insertions, 0 deletions
diff --git a/src/site/xdoc/manual/bcel-api.xml b/src/site/xdoc/manual/bcel-api.xml new file mode 100644 index 00000000..8417f1d1 --- /dev/null +++ b/src/site/xdoc/manual/bcel-api.xml @@ -0,0 +1,645 @@ +<?xml version="1.0"?> +<!-- + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. +--> +<document> + <properties> + <title>The BCEL API</title> + </properties> + + <body> + <section name="The BCEL API"> + <p> + The <font face="helvetica,arial">BCEL</font> API abstracts from + the concrete circumstances of the Java Virtual Machine and how to + read and write binary Java class files. The API mainly consists + of three parts: + </p> + + <p> + + <ol type="1"> + <li> A package that contains classes that describe "static" + constraints of class files, i.e., reflects the class file format and + is not intended for byte code modifications. The classes may be + used to read and write class files from or to a file. This is + useful especially for analyzing Java classes without having the + source files at hand. The main data structure is called + <tt>JavaClass</tt> which contains methods, fields, etc..</li> + + <li> A package to dynamically generate or modify + <tt>JavaClass</tt> or <tt>Method</tt> objects. It may be used to + insert analysis code, to strip unnecessary information from class + files, or to implement the code generator back-end of a Java + compiler.</li> + + <li> Various code examples and utilities like a class file viewer, + a tool to convert class files into HTML, and a converter from + class files to the <a + href="http://jasmin.sourceforge.net">Jasmin</a> assembly + language.</li> + </ol> + </p> + + <subsection name="JavaClass"> + <p> + The "static" component of the <font + face="helvetica,arial">BCEL</font> API resides in the package + <tt>org.apache.bcel.classfile</tt> and closely represents class + files. All of the binary components and data structures declared + in the <a + href="http://docs.oracle.com/javase/specs/">JVM + specification</a> and described in section <a + href="#2 The Java Virtual Machine">2</a> are mapped to classes. + + <a href="#Figure 3">Figure 3</a> shows an UML diagram of the + hierarchy of classes of the <font face="helvetica,arial">BCEL + </font>API. <a href="#Figure 8">Figure 8</a> in the appendix also + shows a detailed diagram of the <tt>ConstantPool</tt> components. + </p> + + <p align="center"> + <a name="Figure 3"> + <img src="../images/javaclass.gif"/> <br/> + Figure 3: UML diagram for the JavaClass API</a> + </p> + + <p> + The top-level data structure is <tt>JavaClass</tt>, which in most + cases is created by a <tt>ClassParser</tt> object that is capable + of parsing binary class files. A <tt>JavaClass</tt> object + basically consists of fields, methods, symbolic references to the + super class and to the implemented interfaces. + </p> + + <p> + The constant pool serves as some kind of central repository and is + thus of outstanding importance for all components. + <tt>ConstantPool</tt> objects contain an array of fixed size of + <tt>Constant</tt> entries, which may be retrieved via the + <tt>getConstant()</tt> method taking an integer index as argument. + Indexes to the constant pool may be contained in instructions as + well as in other components of a class file and in constant pool + entries themselves. + </p> + + <p> + Methods and fields contain a signature, symbolically defining + their types. Access flags like <tt>public static final</tt> occur + in several places and are encoded by an integer bit mask, e.g., + <tt>public static final</tt> matches to the Java expression + </p> + + + <source>int access_flags = ACC_PUBLIC | ACC_STATIC | ACC_FINAL;</source> + + <p> + As mentioned in <a href="jvm.html#Java_class_file_format">section + 2.1</a> already, several components may contain <em>attribute</em> + objects: classes, fields, methods, and <tt>Code</tt> objects + (introduced in <a href="jvm.html#Method_code">section 2.3</a>). The + latter is an attribute itself that contains the actual byte code + array, the maximum stack size, the number of local variables, a + table of handled exceptions, and some optional debugging + information coded as <tt>LineNumberTable</tt> and + <tt>LocalVariableTable</tt> attributes. Attributes are in general + specific to some data structure, i.e., no two components share the + same kind of attribute, though this is not explicitly + forbidden. In the figure the <tt>Attribute</tt> classes are stereotyped + with the component they belong to. + </p> + + </subsection> + + <subsection name="Class repository"> + <p> + Using the provided <tt>Repository</tt> class, reading class files into + a <tt>JavaClass</tt> object is quite simple: + </p> + + <source>JavaClass clazz = Repository.lookupClass("java.lang.String");</source> + + <p> + The repository also contains methods providing the dynamic equivalent + of the <tt>instanceof</tt> operator, and other useful routines: + </p> + + <source> +if (Repository.instanceOf(clazz, super_class)) { + ... +} + </source> + + </subsection> + + <h4>Accessing class file data</h4> + + <p> + Information within the class file components may be accessed like + Java Beans via intuitive set/get methods. All of them also define + a <tt>toString()</tt> method so that implementing a simple class + viewer is very easy. In fact all of the examples used here have + been produced this way: + </p> + + <source> +System.out.println(clazz); +printCode(clazz.getMethods()); +... +public static void printCode(Method[] methods) { + for (int i = 0; i < methods.length; i++) { + System.out.println(methods[i]); + + Code code = methods[i].getCode(); + if (code != null) // Non-abstract method + System.out.println(code); + } +} + </source> + + <h4>Analyzing class data</h4> + <p> + Last but not least, <font face="helvetica,arial">BCEL</font> + supports the <em>Visitor</em> design pattern, so one can write + visitor objects to traverse and analyze the contents of a class + file. Included in the distribution is a class + <tt>JasminVisitor</tt> that converts class files into the <a + href="http://jasmin.sourceforge.net">Jasmin</a> + assembler language. + </p> + + <subsection name="ClassGen"> + <p> + This part of the API (package <tt>org.apache.bcel.generic</tt>) + supplies an abstraction level for creating or transforming class + files dynamically. It makes the static constraints of Java class + files like the hard-coded byte code addresses "generic". The + generic constant pool, for example, is implemented by the class + <tt>ConstantPoolGen</tt> which offers methods for adding different + types of constants. Accordingly, <tt>ClassGen</tt> offers an + interface to add methods, fields, and attributes. + <a href="#Figure 4">Figure 4</a> gives an overview of this part of the API. + </p> + + <p align="center"> + <a name="Figure 4"> + <img src="../images/classgen.gif"/> + <br/> + Figure 4: UML diagram of the ClassGen API</a> + </p> + + <h4>Types</h4> + <p> + We abstract from the concrete details of the type signature syntax + (see <a href="jvm.html#Type_information">2.5</a>) by introducing the + <tt>Type</tt> class, which is used, for example, by methods to + define their return and argument types. Concrete sub-classes are + <tt>BasicType</tt>, <tt>ObjectType</tt>, and <tt>ArrayType</tt> + which consists of the element type and the number of + dimensions. For commonly used types the class offers some + predefined constants. For example, the method signature of the + <tt>main</tt> method as shown in + <a href="jvm.html#Type_information">section 2.5</a> is represented by: + </p> + + <source> +Type return_type = Type.VOID; +Type[] arg_types = new Type[] { new ArrayType(Type.STRING, 1) }; + </source> + + <p> + <tt>Type</tt> also contains methods to convert types into textual + signatures and vice versa. The sub-classes contain implementations + of the routines and constraints specified by the Java Language + Specification. + </p> + + <h4>Generic fields and methods</h4> + <p> + Fields are represented by <tt>FieldGen</tt> objects, which may be + freely modified by the user. If they have the access rights + <tt>static final</tt>, i.e., are constants and of basic type, they + may optionally have an initializing value. + </p> + + <p> + Generic methods contain methods to add exceptions the method may + throw, local variables, and exception handlers. The latter two are + represented by user-configurable objects as well. Because + exception handlers and local variables contain references to byte + code addresses, they also take the role of an <em>instruction + targeter</em> in our terminology. Instruction targeters contain a + method <tt>updateTarget()</tt> to redirect a reference. This is + somewhat related to the Observer design pattern. Generic + (non-abstract) methods refer to <em>instruction lists</em> that + consist of instruction objects. References to byte code addresses + are implemented by handles to instruction objects. If the list is + updated the instruction targeters will be informed about it. This + is explained in more detail in the following sections. + </p> + + <p> + The maximum stack size needed by the method and the maximum number + of local variables used may be set manually or computed via the + <tt>setMaxStack()</tt> and <tt>setMaxLocals()</tt> methods + automatically. + </p> + + <h4>Instructions</h4> + <p> + Modeling instructions as objects may look somewhat odd at first + sight, but in fact enables programmers to obtain a high-level view + upon control flow without handling details like concrete byte code + offsets. Instructions consist of an opcode (sometimes called + tag), their length in bytes and an offset (or index) within the + byte code. Since many instructions are immutable (stack operators, + e.g.), the <tt>InstructionConstants</tt> interface offers + shareable predefined "fly-weight" constants to use. + </p> + + <p> + Instructions are grouped via sub-classing, the type hierarchy of + instruction classes is illustrated by (incomplete) figure in the + appendix. The most important family of instructions are the + <em>branch instructions</em>, e.g., <tt>goto</tt>, that branch to + targets somewhere within the byte code. Obviously, this makes them + candidates for playing an <tt>InstructionTargeter</tt> role, + too. Instructions are further grouped by the interfaces they + implement, there are, e.g., <tt>TypedInstruction</tt>s that are + associated with a specific type like <tt>ldc</tt>, or + <tt>ExceptionThrower</tt> instructions that may raise exceptions + when executed. + </p> + + <p> + All instructions can be traversed via <tt>accept(Visitor v)</tt> + methods, i.e., the Visitor design pattern. There is however some + special trick in these methods that allows to merge the handling + of certain instruction groups. The <tt>accept()</tt> do not only + call the corresponding <tt>visit()</tt> method, but call + <tt>visit()</tt> methods of their respective super classes and + implemented interfaces first, i.e., the most specific + <tt>visit()</tt> call is last. Thus one can group the handling of, + say, all <tt>BranchInstruction</tt>s into one single method. + </p> + + <p> + For debugging purposes it may even make sense to "invent" your own + instructions. In a sophisticated code generator like the one used + as a backend of the <a href="http://barat.sourceforge.net">Barat + framework</a> for static analysis one often has to insert + temporary <tt>nop</tt> (No operation) instructions. When examining + the produced code it may be very difficult to track back where the + <tt>nop</tt> was actually inserted. One could think of a derived + <tt>nop2</tt> instruction that contains additional debugging + information. When the instruction list is dumped to byte code, the + extra data is simply dropped. + </p> + + <p> + One could also think of new byte code instructions operating on + complex numbers that are replaced by normal byte code upon + load-time or are recognized by a new JVM. + </p> + + <h4>Instruction lists</h4> + <p> + An <em>instruction list</em> is implemented by a list of + <em>instruction handles</em> encapsulating instruction objects. + References to instructions in the list are thus not implemented by + direct pointers to instructions but by pointers to instruction + <em>handles</em>. This makes appending, inserting and deleting + areas of code very simple and also allows us to reuse immutable + instruction objects (fly-weight objects). Since we use symbolic + references, computation of concrete byte code offsets does not + need to occur until finalization, i.e., until the user has + finished the process of generating or transforming code. We will + use the term instruction handle and instruction synonymously + throughout the rest of the paper. Instruction handles may contain + additional user-defined data using the <tt>addAttribute()</tt> + method. + </p> + + <p> + <b>Appending:</b> One can append instructions or other instruction + lists anywhere to an existing list. The instructions are appended + after the given instruction handle. All append methods return a + new instruction handle which may then be used as the target of a + branch instruction, e.g.: + </p> + + <source> +InstructionList il = new InstructionList(); +... +GOTO g = new GOTO(null); +il.append(g); +... +// Use immutable fly-weight object +InstructionHandle ih = il.append(InstructionConstants.ACONST_NULL); +g.setTarget(ih); + </source> + + <p> + <b>Inserting:</b> Instructions may be inserted anywhere into an + existing list. They are inserted before the given instruction + handle. All insert methods return a new instruction handle which + may then be used as the start address of an exception handler, for + example. + </p> + + <source> +InstructionHandle start = il.insert(insertion_point, InstructionConstants.NOP); +... +mg.addExceptionHandler(start, end, handler, "java.io.IOException"); + </source> + + <p> + <b>Deleting:</b> Deletion of instructions is also very + straightforward; all instruction handles and the contained + instructions within a given range are removed from the instruction + list and disposed. The <tt>delete()</tt> method may however throw + a <tt>TargetLostException</tt> when there are instruction + targeters still referencing one of the deleted instructions. The + user is forced to handle such exceptions in a <tt>try-catch</tt> + clause and redirect these references elsewhere. The <em>peep + hole</em> optimizer described in the appendix gives a detailed + example for this. + </p> + + <source> +try { + il.delete(first, last); +} catch (TargetLostException e) { + for (InstructionHandle target : e.getTargets()) { + for (InstructionTargeter targeter : target.getTargeters()) { + targeter.updateTarget(target, new_target); + } + } +} + </source> + + <p> + <b>Finalizing:</b> When the instruction list is ready to be dumped + to pure byte code, all symbolic references must be mapped to real + byte code offsets. This is done by the <tt>getByteCode()</tt> + method which is called by default by + <tt>MethodGen.getMethod()</tt>. Afterwards you should call + <tt>dispose()</tt> so that the instruction handles can be reused + internally. This helps to improve memory usage. + </p> + + <source> +InstructionList il = new InstructionList(); + +ClassGen cg = new ClassGen("HelloWorld", "java.lang.Object", + "<generated>", ACC_PUBLIC | ACC_SUPER, null); +MethodGen mg = new MethodGen(ACC_STATIC | ACC_PUBLIC, + Type.VOID, new Type[] { new ArrayType(Type.STRING, 1) }, + new String[] { "argv" }, "main", "HelloWorld", il, cp); +... +cg.addMethod(mg.getMethod()); +il.dispose(); // Reuse instruction handles of list + </source> + + <h4>Code example revisited</h4> + <p> + Using instruction lists gives us a generic view upon the code: In + <a href="#Figure 5">Figure 5</a> we again present the code chunk + of the <tt>readInt()</tt> method of the factorial example in section + <a href="jvm.html#Code_example">2.6</a>: The local variables + <tt>n</tt> and <tt>e1</tt> both hold two references to + instructions, defining their scope. There are two <tt>goto</tt>s + branching to the <tt>iload</tt> at the end of the method. One of + the exception handlers is displayed, too: it references the start + and the end of the <tt>try</tt> block and also the exception + handler code. + </p> + + <p align="center"> + <a name="Figure 5"> + <img src="../images/il.gif"/> + <br/> + Figure 5: Instruction list for <tt>readInt()</tt> method</a> + </p> + + <h4>Instruction factories</h4> + <p> + To simplify the creation of certain instructions the user can use + the supplied <tt>InstructionFactory</tt> class which offers a lot + of useful methods to create instructions from + scratch. Alternatively, he can also use <em>compound + instructions</em>: When producing byte code, some patterns + typically occur very frequently, for instance the compilation of + arithmetic or comparison expressions. You certainly do not want + to rewrite the code that translates such expressions into byte + code in every place they may appear. In order to support this, the + <font face="helvetica,arial">BCEL</font> API includes a <em>compound + instruction</em> (an interface with a single + <tt>getInstructionList()</tt> method). Instances of this class + may be used in any place where normal instructions would occur, + particularly in append operations. + </p> + + <p> + <b>Example: Pushing constants</b> Pushing constants onto the + operand stack may be coded in different ways. As explained in <a + href="jvm.html#Byte_code_instruction_set">section 2.2</a> there are + some "short-cut" instructions that can be used to make the + produced byte code more compact. The smallest instruction to push + a single <tt>1</tt> onto the stack is <tt>iconst_1</tt>, other + possibilities are <tt>bipush</tt> (can be used to push values + between -128 and 127), <tt>sipush</tt> (between -32768 and 32767), + or <tt>ldc</tt> (load constant from constant pool). + </p> + + <p> + Instead of repeatedly selecting the most compact instruction in, + say, a switch, one can use the compound <tt>PUSH</tt> instruction + whenever pushing a constant number or string. It will produce the + appropriate byte code instruction and insert entries into to + constant pool if necessary. + </p> + + <source> +InstructionFactory f = new InstructionFactory(class_gen); +InstructionList il = new InstructionList(); +... +il.append(new PUSH(cp, "Hello, world")); +il.append(new PUSH(cp, 4711)); +... +il.append(f.createPrintln("Hello World")); +... +il.append(f.createReturn(type)); + </source> + + <h4>Code patterns using regular expressions</h4> + <p> + When transforming code, for instance during optimization or when + inserting analysis method calls, one typically searches for + certain patterns of code to perform the transformation at. To + simplify handling such situations <font + face="helvetica,arial">BCEL </font>introduces a special feature: + One can search for given code patterns within an instruction list + using <em>regular expressions</em>. In such expressions, + instructions are represented by their opcode names, e.g., + <tt>LDC</tt>, one may also use their respective super classes, e.g., + "<tt>IfInstruction</tt>". Meta characters like <tt>+</tt>, + <tt>*</tt>, and <tt>(..|..)</tt> have their usual meanings. Thus, + the expression + </p> + + <source>"NOP+(ILOAD|ALOAD)*"</source> + + <p> + represents a piece of code consisting of at least one <tt>NOP</tt> + followed by a possibly empty sequence of <tt>ILOAD</tt> and + <tt>ALOAD</tt> instructions. + </p> + + <p> + The <tt>search()</tt> method of class + <tt>org.apache.bcel.util.InstructionFinder</tt> gets a regular + expression and a starting point as arguments and returns an + iterator describing the area of matched instructions. Additional + constraints to the matching area of instructions, which can not be + implemented via regular expressions, may be expressed via <em>code + constraint</em> objects. + </p> + + <h4>Example: Optimizing boolean expressions</h4> + <p> + In Java, boolean values are mapped to 1 and to 0, + respectively. Thus, the simplest way to evaluate boolean + expressions is to push a 1 or a 0 onto the operand stack depending + on the truth value of the expression. But this way, the + subsequent combination of boolean expressions (with + <tt>&&</tt>, e.g) yields long chunks of code that push + lots of 1s and 0s onto the stack. + </p> + + <p> + When the code has been finalized these chunks can be optimized + with a <em>peep hole</em> algorithm: An <tt>IfInstruction</tt> + (e.g. the comparison of two integers: <tt>if_icmpeq</tt>) that + either produces a 1 or a 0 on the stack and is followed by an + <tt>ifne</tt> instruction (branch if stack value 0) may be + replaced by the <tt>IfInstruction</tt> with its branch target + replaced by the target of the <tt>ifne</tt> instruction: + </p> + + <source> +CodeConstraint constraint = new CodeConstraint() { + public boolean checkCode(InstructionHandle[] match) { + IfInstruction if1 = (IfInstruction) match[0].getInstruction(); + GOTO g = (GOTO) match[2].getInstruction(); + return (if1.getTarget() == match[3]) && + (g.getTarget() == match[4]); + } +}; + +InstructionFinder f = new InstructionFinder(il); +String pat = "IfInstruction ICONST_0 GOTO ICONST_1 NOP(IFEQ|IFNE)"; + +for (Iterator e = f.search(pat, constraint); e.hasNext(); ) { + InstructionHandle[] match = (InstructionHandle[]) e.next();; + ... + match[0].setTarget(match[5].getTarget()); // Update target + ... + try { + il.delete(match[1], match[5]); + } catch (TargetLostException ex) { + ... + } +} + </source> + + <p> + The applied code constraint object ensures that the matched code + really corresponds to the targeted expression pattern. Subsequent + application of this algorithm removes all unnecessary stack + operations and branch instructions from the byte code. If any of + the deleted instructions is still referenced by an + <tt>InstructionTargeter</tt> object, the reference has to be + updated in the <tt>catch</tt>-clause. + </p> + + <p> + <b>Example application:</b> + The expression: + </p> + + <source> + if ((a == null) || (i < 2)) + System.out.println("Ooops"); + </source> + + <p> + can be mapped to both of the chunks of byte code shown in <a + href="#Figure 6">figure 6</a>. The left column represents the + unoptimized code while the right column displays the same code + after the peep hole algorithm has been applied: + </p> + + <p align="center"><a name="Figure 6"> + <table> + <tr> + <td valign="top"><pre> + 5: aload_0 + 6: ifnull #13 + 9: iconst_0 + 10: goto #14 + 13: iconst_1 + 14: nop + 15: ifne #36 + 18: iload_1 + 19: iconst_2 + 20: if_icmplt #27 + 23: iconst_0 + 24: goto #28 + 27: iconst_1 + 28: nop + 29: ifne #36 + 32: iconst_0 + 33: goto #37 + 36: iconst_1 + 37: nop + 38: ifeq #52 + 41: getstatic System.out + 44: ldc "Ooops" + 46: invokevirtual println + 52: return + </pre></td> + <td valign="top"><pre> + 10: aload_0 + 11: ifnull #19 + 14: iload_1 + 15: iconst_2 + 16: if_icmpge #27 + 19: getstatic System.out + 22: ldc "Ooops" + 24: invokevirtual println + 27: return + </pre></td> + </tr> + </table> + </a> + </p> + </subsection> + </section> + </body> +</document>
\ No newline at end of file |