Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Reverse-Engineering
Instruction Encodings
Wilson Hsieh, University of Utah
Dawson Engler, Stanford University
Godmar Back, University of Utah
What’s the Problem?
Dynamic code generation, JIT compilation
– Emit instructions quickly
– Therefore, avoid assembler
Need to know how to produce binary instructions
Want to express instructions in assembly
“Generate add %l1, %l2, %l1 for SPARC”
Reverse-Engineering Instruction Encodings
USENIX ‘01
What Do We Do?
How can I get the following mapping:
assembly instruction binary format
That mapping exists in the assembler already!
assembly
instruction
assembler
binary
instruction
So let’s reverse-engineer it out of the assembler.
Reverse-Engineering Instruction Encodings
USENIX ‘01
DERIVE Tool Chain
instruction
description
DERIVE
assembler
Reverse-Engineering Instruction Encodings
debugger
encoding
description
code emitter
generator
disassembler
JIT compiler
code emitter
USENIX ‘01
Instruction Descriptions
/* SPARC fragment */
iregs = ( %g0, %g1, %g2, ..., %i6, %i7 );
and, andcc, andn, ...
&op& r_1:iregs, r_2:iregs, r_dest:iregs
| &op& r_1:iregs, imm, r_dest:iregs
;
ba, bn, bne, …
&op&
| &op&”,a”
;
Reverse-Engineering Instruction Encodings
&label&
&label&
USENIX ‘01
DERIVE Tool Chain
instruction
description
DERIVE
assembler
Reverse-Engineering Instruction Encodings
debugger
encoding
description
code emitter
generator
disassembler
JIT compiler
code emitter
USENIX ‘01
Encoding Descriptions
/* MIPS breakpoint instruction */
{ “break”, “&op& imm”,
1, /* operand */
4, /* bytes */
...
{ 0xd, 0x0, 0x0, 0x0, }, /* opcode information */
{ /* operand information */
{ “imm”,
/* name */
IMMED, /* an immediate */
IDENT, /* encoded value = input value */
0,
/* lowest value */
10,
/* length
*/
...
16,
/* bit offset */
I_UNSIGNED,
/* unsigned field */
... },
} }
Reverse-Engineering Instruction Encodings
USENIX ‘01
DERIVE Tool Chain
instruction
description
DERIVE
assembler
Reverse-Engineering Instruction Encodings
debugger
encoding
description
code emitter
generator
disassembler
JIT compiler
code emitter
USENIX ‘01
Code Emitters
/* x86 addl instruction */
#define E_addl_rr_1(_code, rf, rt) do {\
register unsigned short _0 = (0xc001\
| ((((rf)) << 11))\
| (((rt)) << 8)));\
*(unsigned short*)((char*) _code) = _0;\
_code = (void *)((char *) _code + 2);\
} while (0)
/* emit “addl %ecx, %ebx” in code_buffer */
E_addl_rr_1(code_buffer, REGecx, REGebx);
Reverse-Engineering Instruction Encodings
USENIX ‘01
Instruction Model
Opcode
Registers (names)
– Register sets
31
0
– Cache prefetch hints on MIPS
– Address scale on x86
Immediates (integers)
– Not registers
Labels (jump targets)
– Absolute jumps
– Relative jumps
Reverse-Engineering Instruction Encodings
O
P
C
O
D
E
A
R
G
A
R
G
A
R
G
1
2
3
USENIX ‘01
Overall Strategy
Solve for one field at a time
– Hold other fields fixed and vary the desired field
– Use randomization when necessary to find legal values
Anything that is not in a field is the opcode
Reverse-Engineering Instruction Encodings
USENIX ‘01
Intuition Behind DERIVE
Assembly instruction
and %g7, %g6, %g0;
and %g7, %g6, %g1;
and %g7, %g6, %g2;
and %g7, %g6, %g3;
and %g7, %g6, %g4;
and %g7, %g6, %g5;
and %g7, %g6, %g6;
and %g7, %g6, %g7;
and %g7, %g6, %o0;
and %g7, %g6, %o1;
and %g7, %g6, %o2;
and %g7, %g6, %o3;
and %g7, %g6, %o4;
Reverse-Engineering Instruction Encodings
Binary encoding
0x8009 0xc006
0x8209 0xc006
0x8409 0xc006
0x8609 0xc006
0x8809 0xc006
0x8a09 0xc006
0x8c09 0xc006
0x8e09 0xc006
0x9009 0xc006
0x9209 0xc006
0x9409 0xc006
0x9609 0xc006
0x9809 0xc006
USENIX ‘01
DERIVE Structure
Field Type
Solver
register fields
register solver
absolute jump targets
immediate solver
immediate fields
immediate solver
relative jump targets
jump solver
Reverse-Engineering Instruction Encodings
USENIX ‘01
Register Solver
Primary assumptions (for purposes of the talk):
– Register fields are independent
– All register values are legal
Enumerate registers for one field at a time
– Hold other fields constant
– Solve each field separately
Example: 3 register fields, 5 bits per field
– 2^5 * 3 = 32 * 3 = 96 combinations
Reverse-Engineering Instruction Encodings
USENIX ‘01
Intuition Behind DERIVE
Assembly instruction
and %g7, %g6, %g0;
and %g7, %g6, %g1;
and %g7, %g6, %g2;
and %g7, %g6, %g3;
and %g7, %g6, %g4;
and %g7, %g6, %g5;
and %g7, %g6, %g6;
and %g7, %g6, %g7;
and %g7, %g6, %o0;
and %g7, %g6, %o1;
and %g7, %g6, %o2;
and %g7, %g6, %o3;
and %g7, %g6, %o4;
Reverse-Engineering Instruction Encodings
Binary encoding
0x8009 0xc006
0x8209 0xc006
0x8409 0xc006
0x8609 0xc006
0x8809 0xc006
0x8a09 0xc006
0x8c09 0xc006
0x8e09 0xc006
0x9009 0xc006
0x9209 0xc006
0x9409 0xc006
0x9609 0xc006
0x9809 0xc006
USENIX ‘01
Immediate Solver
Primary assumptions:
– Immediate field is a single range of bits in instruction
Explore each bit size to find encoding of one field
– Values of 1, 2, 4, 8, 16, ...
– Again, hold other fields constant
Example: 10-bit immediate field
– 10 combinations
Reverse-Engineering Instruction Encodings
USENIX ‘01
Jump Solver
Primary assumptions:
– Label field is a single range of bits
Emit jumps to different offsets
– Find where label goes for encoding of “0”
– Find smallest jump size
– Find high bit by emitting a negative-valued jump
Reverse-Engineering Instruction Encodings
USENIX ‘01
Solving Time
Processor
Run Time
(minutes)
Description
(lines)
Alpha
6.3
104
ARM
~43.
227
MIPS
2.5
81
PowerPC
4.8
186
SPARC
4.8
97
~240.
221
4.9
106
x86
x86-kaffe
Reverse-Engineering Instruction Encodings
USENIX ‘01
Instruction Emitter Generator
Reads in DERIVE-generated specifications
Produces C macros
– Can generate runtime checks
– Debugging support
– Handles multiple instruction encodings
– “Linkage” macros for backpatching
Used to retarget Kaffe (publicly available JVM) on x86
– Reduced backend description from 20841267 lines (40%)
Reverse-Engineering Instruction Encodings
USENIX ‘01
Extensions
Can handle instructions that take a subset of registers
– SPARC double-word loads
Special encodings that are register-dependent
– %eax on x86
Can handle simple transformations
– Low bits dropped off of jump offsets
User can specify transformations
– Address scaling on x86
User can specify registers that are dependent
– PowerPC post-increment instructions
Reverse-Engineering Instruction Encodings
USENIX ‘01
Future Work
Extending DERIVE
– Fields that are broken up into multiple bit ranges
– Memoization of computations
ATOM-like tools
– Reverse-engineering linkers
Reverse-Engineering Instruction Encodings
USENIX ‘01
Related Work
Instruction encoding munging
– NJ Toolkit [Ramsey & Fernández, USENIX 1995]
Testing assemblers
– NJ Toolkit [Fernández and Ramsey, ICSE 1997]
Reverse engineering compiler technology
– Retarget back-end generators [Collberg, PLDI 1997]
Reverse-Engineering Instruction Encodings
USENIX ‘01
Summary
DERIVE is a cool hack, but it isn’t just a hack.
– It is a useful tool.
– It is a good proof of concept.
– We did some clever tricks to build it.
http://www.cs.utah.edu/~wilson/derive.tar.gz
Reverse-Engineering Instruction Encodings
USENIX ‘01