斑斓视界
前段时间需要实现对方法执行效率的检测,于是采用了 Android Runtime Hook 框架实现面向切面编程,并对框架的原理进行探索。但仅靠阅读相关文字与框架源码来理解 Android Runtime Hook 感觉还是差了一些什么,最后干脆决定自己写一个框架,借此以理清实现 Android Runtime Hook 的相关细节。
本篇文字内容针对 ARM 64 架构。
浮点
原有的说法是,方法调用的参数传递存在着以下规则:寄存器 x0 保存被调用方法的 art::ArtMethod
指针,寄存器 x1 ~ x7 保存方法的前 7 个参数,其余参数通过栈保存。如果方法不是静态方法,则 this 指针就是函数的第一个参数,保存在寄存器 x1 中。
但如果单纯采用 x1 ~ x7 七个寄存器来解析参数,碰到浮点数就会出现大问题。
以下方的方法调用为例:
fun argumentCheck(register1: Byte, register2: Short, register3: Int, register4: Long, register5: Float, register6: Double) {
Log.i("Sample", "argumentCheck() : registers - $register1 $register2 $register3 $register4 $register5 $register 6")
}
argumentCheck(1, 2, 3, 4, 5.0, 6.0)
如果直接解析寄存器 x1 ~ x7 的数据,结果会是这个鬼样子。
Before Hook: I/Sample: argumentCheck() : registers - 1 2 3 4 5.0 6.0
After Hook: I/Sample: argumentCheck() : registers - 1 2 3 4 1.4E-45 3.5E-323
原因在于,浮点型的存储与计算相比整型具有特殊性,因此 CPU 会提供特殊的寄存器用于支持浮点数操作,如 x86 的 xmm 寄存器。
在 ARM 64 中,浮点型寄存器符号为 dX,d0 ~ d31 共计 32 个,大小为 8 Byte,其低 4 Byte 为 sX,同 s0 ~ s31 共计 32 个。在 Android Runtime 布置参数的过程中, dX 用于保存 double 类型参数,sX 用于保存 float 类型参数。
浮点型参数具体是如何布置的,暂且一放,后文细讲。
栈与指针
这算是一个惯性思维引发的问题。
因为寄存器位数的原因,思维滑坡地以为所有参数在 ARM 64 下均在栈上占用 8 Byte,于是就出问题了。
还是以一个方法调用为例子:
fun argumentCheck(register1: Byte, register2: Short, register3: Int, register4: Long, register5: Float,
stack1: Boolean, stack2: Byte, stack3: Char, stack4; Short, stack5; Int, stack6: Long, stack7: Float, stack8: Double,
objNull: Any?, maskA: Long, obj: Any, maskB: Long) {
...
}
argumentCheck(
1, 2, 3, 4, 5.0f, 6.0, // registers
true, 7, '8', 9, 10, 11, 12.0f, 13.0, // stack
null, 0x1020304050607080, this@MainActivity, 0x1020304050607080
)
使用 LLDB 显示获取到的栈底地址,结果如下。
(lldb) x/15x 0x7fd722d930
0x7fd722d930: 0x00000007 0x00000038 0x00000009 0x0000000a
0x7fd722d940: 0x0000000b 0x00000000 0x41400000 0x00000000
0x7fd722d950: 0x402a0000 0x00000000 0x50607080 0x10203040
0x7fd722d960: 0x76fc0ac0 0x50607080 0x10203040
可见,除了 long 与 double 类型的数据大小为 8 Byte 之外,其它类型的数据大小均为 4 Byte,因此在读取栈上参数数据的时候,必须根据读取数据的类型来决定读取数据的大小。
另一点,虽然前几个参数保存在寄存器中,但它们仍按前面所提到的类型大小在栈上占据空间,只不过这一部分空间可能会被 Android Runtime 用于其它逻辑,所以在获取非寄存器内参数的地址与栈底地址的偏移时,需要通过前几个参数的大小来计算。
有意思的是,由于方法参数里设置了两个分隔用的参数,我们可以很清楚地看到,一个对象的指针占 4 Byte。Android Runtime 以对象引用不超过 4G 内存的代价,采用 4 Byte 大小的指针来压缩指针占用的空间,单个指针压缩的空间可以忽略不计,但合计整个虚拟机内的对象指针数,节省的空间就很可观了。
参数传递
如果尝试将方法的代码入口替换为一段产生错误的代码,则调用方法后会导致方法代码出现崩溃,此时如果观察调用栈即可发现崩溃位置,根据是否为静态方法,位于 art_quick_invoke_stub
或 art_quick_invoke_static_stub
。
调用函数方法代码的正是这两个函数,暂且如此称呼罢。
再将调用栈下移一层,可以看见这两个函数在 art::ArtMethod::Invoke()
中被调用。
void ArtMethod::Invoke(Thread* self, uint32_t* args, uint32_t args_size, JValue* result,
const char* shorty) {
...
if (...) {
...
} else {
...
bool have_quick_code = GetEntryPointFromQuickCompiledCode() != nullptr;
if (LIKELY(have_quick_code)) {
...
if (!IsStatic()) {
(*art_quick_invoke_stub)(this, args, args_size, self, result, shorty);
} else {
(*art_quick_invoke_static_stub)(this, args, args_size, self, result, shorty);
}
...
} else {
...
}
}
...
}
因为两者大同小异,区别仅为 this 指针的解析,所以这里只说明 art_quick_invoke_stub
。
art_quick_invoke_stub
是采用汇编编写的代码段,不同架构存在不同的实现。ARM 64 架构的代码位于 quick_entrypoints_arm64.S
。
/*
* extern"C" void art_quick_invoke_stub(ArtMethod *method, x0
* uint32_t *args, x1
* uint32_t argsize, w2
* Thread *self, x3
* JValue *result, x4
* char *shorty); x5
* +----------------------+
* | |
* | C/C++ frame |
* | LR'' |
* | FP'' | <- SP'
* +----------------------+
* +----------------------+
* | x28 | <- TODO: Remove callee-saves.
* | : |
* | x19 |
* | SP' |
* | X5 |
* | X4 | Saved registers
* | LR' |
* | FP' | <- FP
* +----------------------+
* | uint32_t out[n-1] |
* | : : | Outs
* | uint32_t out[0] |
* | ArtMethod* | <- SP value=null
* +----------------------+
*
* Outgoing registers:
* x0 - Method*
* x1-x7 - integer parameters.
* d0-d7 - Floating point parameters.
* xSELF = self
* SP = & of ArtMethod*
* x1 = "this" pointer.
*
*/
ENTRY art_quick_invoke_stub
// Spill registers as per AACPS64 calling convention.
INVOKE_STUB_CREATE_FRAME
// Fill registers x/w1 to x/w7 and s/d0 to s/d7 with parameters.
// Parse the passed shorty to determine which register to load.
// Load addresses for routines that load WXSD registers.
adr x11, .LstoreW2
adr x12, .LstoreX2
adr x13, .LstoreS0
adr x14, .LstoreD0
// Initialize routine offsets to 0 for integers and floats.
// x8 for integers, x15 for floating point.
mov x8, #0
mov x15, #0
add x10, x5, #1 // Load shorty address, plus one to skip return value.
ldr w1, [x9],#4 // Load "this" parameter, and increment arg pointer.
// Loop to fill registers.
.LfillRegisters:
ldrb w17, [x10], #1 // Load next character in signature, and increment.
cbz w17, .LcallFunction // Exit at end of signature. Shorty 0 terminated.
cmp w17, #'F' // is this a float?
bne .LisDouble
cmp x15, # 8*12 // Skip this load if all registers full.
beq .Ladvance4
add x17, x13, x15 // Calculate subroutine to jump to.
br x17
.LisDouble:
cmp w17, #'D' // is this a double?
bne .LisLong
cmp x15, # 8*12 // Skip this load if all registers full.
beq .Ladvance8
add x17, x14, x15 // Calculate subroutine to jump to.
br x17
.LisLong:
cmp w17, #'J' // is this a long?
bne .LisOther
cmp x8, # 6*12 // Skip this load if all registers full.
beq .Ladvance8
add x17, x12, x8 // Calculate subroutine to jump to.
br x17
.LisOther: // Everything else takes one vReg.
cmp x8, # 6*12 // Skip this load if all registers full.
beq .Ladvance4
add x17, x11, x8 // Calculate subroutine to jump to.
br x17
.Ladvance4:
add x9, x9, #4
b .LfillRegisters
.Ladvance8:
add x9, x9, #8
b .LfillRegisters
// Macro for loading a parameter into a register.
// counter - the register with offset into these tables
// size - the size of the register - 4 or 8 bytes.
// register - the name of the register to be loaded.
.macro LOADREG counter size register return
ldr \register , [x9], #\size
add \counter, \counter, 12
b \return
.endm
// Store ints.
.LstoreW2:
LOADREG x8 4 w2 .LfillRegisters
LOADREG x8 4 w3 .LfillRegisters
LOADREG x8 4 w4 .LfillRegisters
LOADREG x8 4 w5 .LfillRegisters
LOADREG x8 4 w6 .LfillRegisters
LOADREG x8 4 w7 .LfillRegisters
// Store longs.
.LstoreX2:
LOADREG x8 8 x2 .LfillRegisters
LOADREG x8 8 x3 .LfillRegisters
LOADREG x8 8 x4 .LfillRegisters
LOADREG x8 8 x5 .LfillRegisters
LOADREG x8 8 x6 .LfillRegisters
LOADREG x8 8 x7 .LfillRegisters
// Store singles.
.LstoreS0:
LOADREG x15 4 s0 .LfillRegisters
LOADREG x15 4 s1 .LfillRegisters
LOADREG x15 4 s2 .LfillRegisters
LOADREG x15 4 s3 .LfillRegisters
LOADREG x15 4 s4 .LfillRegisters
LOADREG x15 4 s5 .LfillRegisters
LOADREG x15 4 s6 .LfillRegisters
LOADREG x15 4 s7 .LfillRegisters
// Store doubles.
.LstoreD0:
LOADREG x15 8 d0 .LfillRegisters
LOADREG x15 8 d1 .LfillRegisters
LOADREG x15 8 d2 .LfillRegisters
LOADREG x15 8 d3 .LfillRegisters
LOADREG x15 8 d4 .LfillRegisters
LOADREG x15 8 d5 .LfillRegisters
LOADREG x15 8 d6 .LfillRegisters
LOADREG x15 8 d7 .LfillRegisters
.LcallFunction:
INVOKE_STUB_CALL_AND_RETURN
END art_quick_invoke_stub
.macro INVOKE_STUB_CREATE_FRAME
SAVE_SIZE=6*8 // x4, x5, x19, x20, FP, LR saved.
SAVE_TWO_REGS_INCREASE_FRAME x4, x5, SAVE_SIZE
SAVE_TWO_REGS x19, x20, 16
SAVE_TWO_REGS xFP, xLR, 32
mov xFP, sp // Use xFP for frame pointer, as it's callee-saved.
.cfi_def_cfa_register xFP
add x10, x2, #(__SIZEOF_POINTER__ + 0xf) // Reserve space for ArtMethod*, arguments and
and x10, x10, # ~0xf // round up for 16-byte stack alignment.
sub sp, sp, x10 // Adjust SP for ArtMethod*, args and alignment padding.
mov xSELF, x3 // Move thread pointer into SELF register.
// Copy arguments into stack frame.
// Use simple copy routine for now.
// 4 bytes per slot.
// X1 - source address
// W2 - args length
// X9 - destination address.
// W10 - temporary
add x9, sp, #8 // Destination address is bottom of stack + null.
// Copy parameters into the stack. Use numeric label as this is a macro and Clang's assembler
// does not have unique-id variables.
1:
cbz w2, 2f
sub w2, w2, #4 // Need 65536 bytes of range.
ldr w10, [x1, x2]
str w10, [x9, x2]
b 1b
2:
// Store null into ArtMethod* at bottom of frame.
str xzr, [sp]
.endm
.macro INVOKE_STUB_CALL_AND_RETURN
REFRESH_MARKING_REGISTER
// load method-> METHOD_QUICK_CODE_OFFSET
ldr x9, [x0, #ART_METHOD_QUICK_CODE_OFFSET_64]
// Branch to method.
blr x9
// Pop the ArtMethod* (null), arguments and alignment padding from the stack.
mov sp, xFP
.cfi_def_cfa_register sp
// Restore saved registers including value address and shorty address.
RESTORE_TWO_REGS x19, x20, 16
RESTORE_TWO_REGS xFP, xLR, 32
RESTORE_TWO_REGS_DECREASE_FRAME x4, x5, SAVE_SIZE
// Store result (w0/x0/s0/d0) appropriately, depending on resultType.
ldrb w10, [x5]
// Check the return type and store the correct register into the jvalue in memory.
// Use numeric label as this is a macro and Clang's assembler does not have unique-id variables.
// Don't set anything for a void type.
cmp w10, #'V'
beq 1f
// Is it a double?
cmp w10, #'D'
beq 2f
// Is it a float?
cmp w10, #'F'
beq 3f
// Just store x0. Doesn't matter if it is 64 or 32 bits.
str x0, [x4]
1: // Finish up.
ret
2: // Store double.
str d0, [x4]
ret
3: // Store float.
str s0, [x4]
ret
.endm
这里拆分成多个部分进行说明,首先是函数声明与传入参数。
/*
* extern"C" void art_quick_invoke_stub(ArtMethod *method, x0
* uint32_t *args, x1
* uint32_t argsize, w2
* Thread *self, x3
* JValue *result, x4
* char *shorty); x5
*/
根据 C 在 ARM 64 中的调用约定,函数的前 8 个参数存于寄存器中,各参数均为参数名字面意义,在此说明的是 shorty
指针,这是传入的是一个字符串,用于标识方法参数的类型,采用 Java Class 文件中的原始类型描述,如以下方法:
fun argumentCheck(register1: Byte, register2: Short, register3: Int, register4: Long, register5: Float, register6: Double): Any { ... }
shorty
字符串为 "LBSIJFD"
,shorty[0]
为返回值类型描述。
.macro INVOKE_STUB_CREATE_FRAME
SAVE_SIZE=6*8 // x4, x5, x19, x20, FP, LR saved.
SAVE_TWO_REGS_INCREASE_FRAME x4, x5, SAVE_SIZE
SAVE_TWO_REGS x19, x20, 16
SAVE_TWO_REGS xFP, xLR, 32
mov xFP, sp // Use xFP for frame pointer, as it's callee-saved.
.cfi_def_cfa_register xFP
add x10, x2, #(__SIZEOF_POINTER__ + 0xf) // Reserve space for ArtMethod*, arguments and
and x10, x10, # ~0xf // round up for 16-byte stack alignment.
sub sp, sp, x10 // Adjust SP for ArtMethod*, args and alignment padding.
mov xSELF, x3 // Move thread pointer into SELF register.
// Copy arguments into stack frame.
// Use simple copy routine for now.
// 4 bytes per slot.
// X1 - source address
// W2 - args length
// X9 - destination address.
// W10 - temporary
add x9, sp, #8 // Destination address is bottom of stack + null.
// Copy parameters into the stack. Use numeric label as this is a macro and Clang's assembler
// does not have unique-id variables.
1:
cbz w2, 2f
sub w2, w2, #4 // Need 65536 bytes of range.
ldr w10, [x1, x2]
str w10, [x9, x2]
b 1b
2:
// Store null into ArtMethod* at bottom of frame.
str xzr, [sp]
.endm
宏定义代码,用于初始化栈帧,将部分寄存器数据备份到栈上。这里主要注意的是 xSELF(x19)保存当前线程的指针,以及保存于栈底的返回代码地址,用于 ret 指令结束函数调用。
// Fill registers x/w1 to x/w7 and s/d0 to s/d7 with parameters.
// Parse the passed shorty to determine which register to load.
// Load addresses for routines that load WXSD registers.
adr x11, .LstoreW2
adr x12, .LstoreX2
adr x13, .LstoreS0
adr x14, .LstoreD0
// Initialize routine offsets to 0 for integers and floats.
// x8 for integers, x15 for floating point.
mov x8, #0
mov x15, #0
...
// Store ints.
.LstoreW2:
LOADREG x8 4 w2 .LfillRegisters
...
LOADREG x8 4 w7 .LfillRegisters
// Store longs.
.LstoreX2:
LOADREG x8 8 x2 .LfillRegisters
...
LOADREG x8 8 x7 .LfillRegisters
// Store singles.
.LstoreS0:
LOADREG x15 4 s0 .LfillRegisters
...
LOADREG x15 4 s7 .LfillRegisters
// Store doubles.
.LstoreD0:
LOADREG x15 8 d0 .LfillRegisters
...
LOADREG x15 8 d7 .LfillRegisters
第二部分,x11 ~ x14 用作存入其它类型、long、float、double 指令的偏移,x8、x15 用作整型、浮点型类型的下标,用于计算需要执行 “保存到第几个寄存器” 的指令。
举例说明,x13 保存着标号 .LstoreS0:
的地址,此时指向代码 LOADREG x15 4 s0 .LfillRegisters
。当第 3 个浮点数需要被保存时,x15 的值被设为 8 * (3 - 1) = 16
,则通过 x13 + x15 可得指向代码 LOADREG x15 4 s2 .LfillRegisters
,则该浮点数将被保存到 s2 寄存器中。
add x10, x5, #1 // Load shorty address, plus one to skip return value.
ldr w1, [x9],#4 // Load "this" parameter, and increment arg pointer.
载入类型描述字符串与 this 指针,它们被分别保存到 x0 与 w1 中。
// Loop to fill registers.
.LfillRegisters:
ldrb w17, [x10], #1 // Load next character in signature, and increment.
cbz w17, .LcallFunction // Exit at end of signature. Shorty 0 terminated.
cmp w17, #'F' // is this a float?
bne .LisDouble
cmp x15, # 8*12 // Skip this load if all registers full.
beq .Ladvance4
add x17, x13, x15 // Calculate subroutine to jump to.
br x17
.LisDouble:
cmp w17, #'D' // is this a double?
bne .LisLong
cmp x15, # 8*12 // Skip this load if all registers full.
beq .Ladvance8
add x17, x14, x15 // Calculate subroutine to jump to.
br x17
.LisLong:
cmp w17, #'J' // is this a long?
bne .LisOther
cmp x8, # 6*12 // Skip this load if all registers full.
beq .Ladvance8
add x17, x12, x8 // Calculate subroutine to jump to.
br x17
.LisOther: // Everything else takes one vReg.
cmp x8, # 6*12 // Skip this load if all registers full.
beq .Ladvance4
add x17, x11, x8 // Calculate subroutine to jump to.
br x17
.Ladvance4:
add x9, x9, #4
b .LfillRegisters
.Ladvance8:
add x9, x9, #8
b .LfillRegisters
这一部分的代码是对寄存器的填充,与前文说明一致。当寄存器填充完毕后,多余的参数保留在栈上,仅通过 x9 寄存器保存大小,.Ladvance4
与 .Ladvance8
对应两种大小的增长。
// Macro for loading a parameter into a register.
// counter - the register with offset into these tables
// size - the size of the register - 4 or 8 bytes.
// register - the name of the register to be loaded.
.macro LOADREG counter size register return
ldr \register , [x9], #\size
add \counter, \counter, 12
b \return
.endm
这部分是对填充寄存器指令的说明,这是一个宏定义,代码长度为 12 Byte(3 个指令,每个指令等长 4 Byte)。当数据被载入到寄存器后,对应的偏移自增 12,指向下一个填充寄存器的代码。
.macro INVOKE_STUB_CALL_AND_RETURN
REFRESH_MARKING_REGISTER
// load method-> METHOD_QUICK_CODE_OFFSET
ldr x9, [x0, #ART_METHOD_QUICK_CODE_OFFSET_64]
// Branch to method.
blr x9
// Pop the ArtMethod* (null), arguments and alignment padding from the stack.
mov sp, xFP
.cfi_def_cfa_register sp
// Restore saved registers including value address and shorty address.
RESTORE_TWO_REGS x19, x20, 16
RESTORE_TWO_REGS xFP, xLR, 32
RESTORE_TWO_REGS_DECREASE_FRAME x4, x5, SAVE_SIZE
// Store result (w0/x0/s0/d0) appropriately, depending on resultType.
ldrb w10, [x5]
// Check the return type and store the correct register into the jvalue in memory.
// Use numeric label as this is a macro and Clang's assembler does not have unique-id variables.
// Don't set anything for a void type.
cmp w10, #'V'
beq 1f
// Is it a double?
cmp w10, #'D'
beq 2f
// Is it a float?
cmp w10, #'F'
beq 3f
// Just store x0. Doesn't matter if it is 64 or 32 bits.
str x0, [x4]
1: // Finish up.
ret
2: // Store double.
str d0, [x4]
ret
3: // Store float.
str s0, [x4]
ret
.endm
最后一步,根据偏移获取到方法入口代码指针,保存到 x9 并跳转,并根据返回值类型写入返回值到 x4 所指向内存,完成调用。
并发
为了防止栈回溯发生错误导致虚拟机崩溃,跳板代码无法在栈上保存数据,因此我们采用了一个外部分配的对象(定义类型为 Box)来备份需要的寄存器数据。然而这种方式相比栈具有明显的缺陷,毕竟栈是线程私有的,如果在外部空间保存的数据,那么方法被并发调用时数据便会相互覆盖。
所以跳板代码里需要进行特殊处理,这里采用了自旋锁的方式来解决线程安全问题。
bridge_match:
ldr x16, bridge_box_pointer
check_lock:
ldr x17, [x16]
cmp x17, #0
bne check_lock
str x19, [x16]
其中,x16 指向的是用于保存数据的 Box 对象。其第一个字段为锁,仅当锁的值为 0 时才可开始写入寄存器数据,而每个线程开始写入寄存器数据前会将锁设置为当前线程的指针。
当跳转完毕到目标方法后便需要对保存的寄存器数据进行解析,为了保证性能,我们不应该在解析数据时占有锁,但又需要防止解析时数据没有其它线程覆盖。
解决方法便是将 Box 对象拷贝一份用于解析,而原 Box 对象在拷贝完成之后立即释放锁。
Box *Runtime::UnlockAndCopyBox(Box *origin) {
auto *clone = reinterpret_cast<Box *>(malloc(sizeof(Box)));
memcpy(clone, origin, sizeof(Box));
origin->lock_ = 0;
return clone;
}