斑斓视界 - Kaleidoscope 开发记录

前段时间需要实现对方法执行效率的检测，于是采用了 Android Runtime Hook 框架实现面向切面编程，并对框架的原理进行探索。但仅靠阅读相关文字与框架源码来理解 Android Runtime Hook 感觉还是差了一些什么，最后干脆决定自己写一个框架，借此以理清实现 Android Runtime Hook 的相关细节。

本篇文字内容针对 ARM 64 架构。

浮点

原有的说法是，方法调用的参数传递存在着以下规则：寄存器 x0 保存被调用方法的 art::ArtMethod 指针，寄存器 x1 ~ x7 保存方法的前 7 个参数，其余参数通过栈保存。如果方法不是静态方法，则 this 指针就是函数的第一个参数，保存在寄存器 x1 中。

但如果单纯采用 x1 ~ x7 七个寄存器来解析参数，碰到浮点数就会出现大问题。

以下方的方法调用为例：

fun argumentCheck(register1: Byte, register2: Short, register3: Int, register4: Long, register5: Float, register6: Double) {
    Log.i("Sample", "argumentCheck() : registers - $register1 $register2 $register3 $register4 $register5 $register 6")
}

argumentCheck(1, 2, 3, 4, 5.0, 6.0)

如果直接解析寄存器 x1 ~ x7 的数据，结果会是这个鬼样子。

Before Hook:    I/Sample: argumentCheck() : registers - 1 2 3 4 5.0 6.0
After Hook:     I/Sample: argumentCheck() : registers - 1 2 3 4 1.4E-45 3.5E-323

原因在于，浮点型的存储与计算相比整型具有特殊性，因此 CPU 会提供特殊的寄存器用于支持浮点数操作，如 x86 的 xmm 寄存器。

在 ARM 64 中，浮点型寄存器符号为 dX，d0 ~ d31 共计 32 个，大小为 8 Byte，其低 4 Byte 为 sX，同 s0 ~ s31 共计 32 个。在 Android Runtime 布置参数的过程中， dX 用于保存 double 类型参数，sX 用于保存 float 类型参数。

浮点型参数具体是如何布置的，暂且一放，后文细讲。

栈与指针

这算是一个惯性思维引发的问题。

因为寄存器位数的原因，思维滑坡地以为所有参数在 ARM 64 下均在栈上占用 8 Byte，于是就出问题了。

还是以一个方法调用为例子：

fun argumentCheck(register1: Byte, register2: Short, register3: Int, register4: Long, register5: Float,
                    stack1: Boolean, stack2: Byte, stack3: Char, stack4; Short, stack5; Int, stack6: Long, stack7: Float, stack8: Double,
                    objNull: Any?, maskA: Long, obj: Any, maskB: Long) {
    ...
}

argumentCheck(
    1, 2, 3, 4, 5.0f, 6.0,                                              // registers
    true, 7, '8', 9, 10, 11, 12.0f, 13.0,                               // stack
    null, 0x1020304050607080, this@MainActivity, 0x1020304050607080
)

使用 LLDB 显示获取到的栈底地址，结果如下。

(lldb) x/15x 0x7fd722d930
0x7fd722d930: 0x00000007 0x00000038 0x00000009 0x0000000a
0x7fd722d940: 0x0000000b 0x00000000 0x41400000 0x00000000
0x7fd722d950: 0x402a0000 0x00000000 0x50607080 0x10203040
0x7fd722d960: 0x76fc0ac0 0x50607080 0x10203040

可见，除了 long 与 double 类型的数据大小为 8 Byte 之外，其它类型的数据大小均为 4 Byte，因此在读取栈上参数数据的时候，必须根据读取数据的类型来决定读取数据的大小。

另一点，虽然前几个参数保存在寄存器中，但它们仍按前面所提到的类型大小在栈上占据空间，只不过这一部分空间可能会被 Android Runtime 用于其它逻辑，所以在获取非寄存器内参数的地址与栈底地址的偏移时，需要通过前几个参数的大小来计算。

有意思的是，由于方法参数里设置了两个分隔用的参数，我们可以很清楚地看到，一个对象的指针占 4 Byte。Android Runtime 以对象引用不超过 4G 内存的代价，采用 4 Byte 大小的指针来压缩指针占用的空间，单个指针压缩的空间可以忽略不计，但合计整个虚拟机内的对象指针数，节省的空间就很可观了。

参数传递

如果尝试将方法的代码入口替换为一段产生错误的代码，则调用方法后会导致方法代码出现崩溃，此时如果观察调用栈即可发现崩溃位置，根据是否为静态方法，位于 art_quick_invoke_stub 或 art_quick_invoke_static_stub。

调用函数方法代码的正是这两个函数，暂且如此称呼罢。

再将调用栈下移一层，可以看见这两个函数在 art::ArtMethod::Invoke() 中被调用。

void ArtMethod::Invoke(Thread* self, uint32_t* args, uint32_t args_size, JValue* result,
                       const char* shorty) {
  ...
  if (...) {
    ...
  } else {
    ...
    bool have_quick_code = GetEntryPointFromQuickCompiledCode() != nullptr;
    if (LIKELY(have_quick_code)) {
      ...
      if (!IsStatic()) {
        (*art_quick_invoke_stub)(this, args, args_size, self, result, shorty);
      } else {
        (*art_quick_invoke_static_stub)(this, args, args_size, self, result, shorty);
      }
      ...
    } else {
      ...
    }
  }
  ...
}

因为两者大同小异，区别仅为 this 指针的解析，所以这里只说明 art_quick_invoke_stub。

art_quick_invoke_stub 是采用汇编编写的代码段，不同架构存在不同的实现。ARM 64 架构的代码位于 quick_entrypoints_arm64.S。

/*
 *  extern"C" void art_quick_invoke_stub(ArtMethod *method,   x0
 *                                       uint32_t  *args,     x1
 *                                       uint32_t argsize,    w2
 *                                       Thread *self,        x3
 *                                       JValue *result,      x4
 *                                       char   *shorty);     x5
 *  +----------------------+
 *  |                      |
 *  |  C/C++ frame         |
 *  |       LR''           |
 *  |       FP''           | <- SP'
 *  +----------------------+
 *  +----------------------+
 *  |        x28           | <- TODO: Remove callee-saves.
 *  |         :            |
 *  |        x19           |
 *  |        SP'           |
 *  |        X5            |
 *  |        X4            |        Saved registers
 *  |        LR'           |
 *  |        FP'           | <- FP
 *  +----------------------+
 *  | uint32_t out[n-1]    |
 *  |    :      :          |        Outs
 *  | uint32_t out[0]      |
 *  | ArtMethod*           | <- SP  value=null
 *  +----------------------+
 *
 * Outgoing registers:
 *  x0    - Method*
 *  x1-x7 - integer parameters.
 *  d0-d7 - Floating point parameters.
 *  xSELF = self
 *  SP = & of ArtMethod*
 *  x1 = "this" pointer.
 *
 */
ENTRY art_quick_invoke_stub
    // Spill registers as per AACPS64 calling convention.
    INVOKE_STUB_CREATE_FRAME

    // Fill registers x/w1 to x/w7 and s/d0 to s/d7 with parameters.
    // Parse the passed shorty to determine which register to load.
    // Load addresses for routines that load WXSD registers.
    adr  x11, .LstoreW2
    adr  x12, .LstoreX2
    adr  x13, .LstoreS0
    adr  x14, .LstoreD0

    // Initialize routine offsets to 0 for integers and floats.
    // x8 for integers, x15 for floating point.
    mov x8, #0
    mov x15, #0

    add x10, x5, #1         // Load shorty address, plus one to skip return value.
    ldr w1, [x9],#4         // Load "this" parameter, and increment arg pointer.

    // Loop to fill registers.
.LfillRegisters:
    ldrb w17, [x10], #1       // Load next character in signature, and increment.
    cbz w17, .LcallFunction   // Exit at end of signature. Shorty 0 terminated.

    cmp  w17, #'F' // is this a float?
    bne .LisDouble

    cmp x15, # 8*12         // Skip this load if all registers full.
    beq .Ladvance4

    add x17, x13, x15       // Calculate subroutine to jump to.
    br  x17

.LisDouble:
    cmp w17, #'D'           // is this a double?
    bne .LisLong

    cmp x15, # 8*12         // Skip this load if all registers full.
    beq .Ladvance8

    add x17, x14, x15       // Calculate subroutine to jump to.
    br x17

.LisLong:
    cmp w17, #'J'           // is this a long?
    bne .LisOther

    cmp x8, # 6*12          // Skip this load if all registers full.
    beq .Ladvance8

    add x17, x12, x8        // Calculate subroutine to jump to.
    br x17

.LisOther:                  // Everything else takes one vReg.
    cmp x8, # 6*12          // Skip this load if all registers full.
    beq .Ladvance4

    add x17, x11, x8        // Calculate subroutine to jump to.
    br x17

.Ladvance4:
    add x9, x9, #4
    b .LfillRegisters

.Ladvance8:
    add x9, x9, #8
    b .LfillRegisters

// Macro for loading a parameter into a register.
//  counter - the register with offset into these tables
//  size - the size of the register - 4 or 8 bytes.
//  register - the name of the register to be loaded.
.macro LOADREG counter size register return
    ldr \register , [x9], #\size
    add \counter, \counter, 12
    b \return
.endm

// Store ints.
.LstoreW2:
    LOADREG x8 4 w2 .LfillRegisters
    LOADREG x8 4 w3 .LfillRegisters
    LOADREG x8 4 w4 .LfillRegisters
    LOADREG x8 4 w5 .LfillRegisters
    LOADREG x8 4 w6 .LfillRegisters
    LOADREG x8 4 w7 .LfillRegisters

// Store longs.
.LstoreX2:
    LOADREG x8 8 x2 .LfillRegisters
    LOADREG x8 8 x3 .LfillRegisters
    LOADREG x8 8 x4 .LfillRegisters
    LOADREG x8 8 x5 .LfillRegisters
    LOADREG x8 8 x6 .LfillRegisters
    LOADREG x8 8 x7 .LfillRegisters

// Store singles.
.LstoreS0:
    LOADREG x15 4 s0 .LfillRegisters
    LOADREG x15 4 s1 .LfillRegisters
    LOADREG x15 4 s2 .LfillRegisters
    LOADREG x15 4 s3 .LfillRegisters
    LOADREG x15 4 s4 .LfillRegisters
    LOADREG x15 4 s5 .LfillRegisters
    LOADREG x15 4 s6 .LfillRegisters
    LOADREG x15 4 s7 .LfillRegisters

// Store doubles.
.LstoreD0:
    LOADREG x15 8 d0 .LfillRegisters
    LOADREG x15 8 d1 .LfillRegisters
    LOADREG x15 8 d2 .LfillRegisters
    LOADREG x15 8 d3 .LfillRegisters
    LOADREG x15 8 d4 .LfillRegisters
    LOADREG x15 8 d5 .LfillRegisters
    LOADREG x15 8 d6 .LfillRegisters
    LOADREG x15 8 d7 .LfillRegisters


.LcallFunction:

    INVOKE_STUB_CALL_AND_RETURN

END art_quick_invoke_stub

.macro INVOKE_STUB_CREATE_FRAME
SAVE_SIZE=6*8   // x4, x5, x19, x20, FP, LR saved.
    SAVE_TWO_REGS_INCREASE_FRAME x4, x5, SAVE_SIZE
    SAVE_TWO_REGS x19, x20, 16
    SAVE_TWO_REGS xFP, xLR, 32

    mov xFP, sp                            // Use xFP for frame pointer, as it's callee-saved.
    .cfi_def_cfa_register xFP

    add x10, x2, #(__SIZEOF_POINTER__ + 0xf) // Reserve space for ArtMethod*, arguments and
    and x10, x10, # ~0xf                   // round up for 16-byte stack alignment.
    sub sp, sp, x10                        // Adjust SP for ArtMethod*, args and alignment padding.

    mov xSELF, x3                          // Move thread pointer into SELF register.

    // Copy arguments into stack frame.
    // Use simple copy routine for now.
    // 4 bytes per slot.
    // X1 - source address
    // W2 - args length
    // X9 - destination address.
    // W10 - temporary
    add x9, sp, #8                         // Destination address is bottom of stack + null.

    // Copy parameters into the stack. Use numeric label as this is a macro and Clang's assembler
    // does not have unique-id variables.
1:
    cbz w2, 2f
    sub w2, w2, #4      // Need 65536 bytes of range.
    ldr w10, [x1, x2]
    str w10, [x9, x2]
    b 1b

2:
    // Store null into ArtMethod* at bottom of frame.
    str xzr, [sp]
.endm

.macro INVOKE_STUB_CALL_AND_RETURN

    REFRESH_MARKING_REGISTER

    // load method-> METHOD_QUICK_CODE_OFFSET
    ldr x9, [x0, #ART_METHOD_QUICK_CODE_OFFSET_64]
    // Branch to method.
    blr x9

    // Pop the ArtMethod* (null), arguments and alignment padding from the stack.
    mov sp, xFP
    .cfi_def_cfa_register sp

    // Restore saved registers including value address and shorty address.
    RESTORE_TWO_REGS x19, x20, 16
    RESTORE_TWO_REGS xFP, xLR, 32
    RESTORE_TWO_REGS_DECREASE_FRAME x4, x5, SAVE_SIZE

    // Store result (w0/x0/s0/d0) appropriately, depending on resultType.
    ldrb w10, [x5]

    // Check the return type and store the correct register into the jvalue in memory.
    // Use numeric label as this is a macro and Clang's assembler does not have unique-id variables.

    // Don't set anything for a void type.
    cmp w10, #'V'
    beq 1f

    // Is it a double?
    cmp w10, #'D'
    beq 2f

    // Is it a float?
    cmp w10, #'F'
    beq 3f

    // Just store x0. Doesn't matter if it is 64 or 32 bits.
    str x0, [x4]

1:  // Finish up.
    ret

2:  // Store double.
    str d0, [x4]
    ret

3:  // Store float.
    str s0, [x4]
    ret

.endm

这里拆分成多个部分进行说明，首先是函数声明与传入参数。

/*
 *  extern"C" void art_quick_invoke_stub(ArtMethod *method,   x0
 *                                       uint32_t  *args,     x1
 *                                       uint32_t argsize,    w2
 *                                       Thread *self,        x3
 *                                       JValue *result,      x4
 *                                       char   *shorty);     x5
 */

根据 C 在 ARM 64 中的调用约定，函数的前 8 个参数存于寄存器中，各参数均为参数名字面意义，在此说明的是 shorty 指针，这是传入的是一个字符串，用于标识方法参数的类型，采用 Java Class 文件中的原始类型描述，如以下方法：

fun argumentCheck(register1: Byte, register2: Short, register3: Int, register4: Long, register5: Float, register6: Double): Any { ... }

shorty 字符串为 "LBSIJFD"，shorty[0] 为返回值类型描述。

.macro INVOKE_STUB_CREATE_FRAME
SAVE_SIZE=6*8   // x4, x5, x19, x20, FP, LR saved.
    SAVE_TWO_REGS_INCREASE_FRAME x4, x5, SAVE_SIZE
    SAVE_TWO_REGS x19, x20, 16
    SAVE_TWO_REGS xFP, xLR, 32

    mov xFP, sp                            // Use xFP for frame pointer, as it's callee-saved.
    .cfi_def_cfa_register xFP

    add x10, x2, #(__SIZEOF_POINTER__ + 0xf) // Reserve space for ArtMethod*, arguments and
    and x10, x10, # ~0xf                   // round up for 16-byte stack alignment.
    sub sp, sp, x10                        // Adjust SP for ArtMethod*, args and alignment padding.

    mov xSELF, x3                          // Move thread pointer into SELF register.

    // Copy arguments into stack frame.
    // Use simple copy routine for now.
    // 4 bytes per slot.
    // X1 - source address
    // W2 - args length
    // X9 - destination address.
    // W10 - temporary
    add x9, sp, #8                         // Destination address is bottom of stack + null.

    // Copy parameters into the stack. Use numeric label as this is a macro and Clang's assembler
    // does not have unique-id variables.
1:
    cbz w2, 2f
    sub w2, w2, #4      // Need 65536 bytes of range.
    ldr w10, [x1, x2]
    str w10, [x9, x2]
    b 1b

2:
    // Store null into ArtMethod* at bottom of frame.
    str xzr, [sp]
.endm

宏定义代码，用于初始化栈帧，将部分寄存器数据备份到栈上。这里主要注意的是 xSELF（x19）保存当前线程的指针，以及保存于栈底的返回代码地址，用于 ret 指令结束函数调用。

    // Fill registers x/w1 to x/w7 and s/d0 to s/d7 with parameters.
    // Parse the passed shorty to determine which register to load.
    // Load addresses for routines that load WXSD registers.
    adr  x11, .LstoreW2
    adr  x12, .LstoreX2
    adr  x13, .LstoreS0
    adr  x14, .LstoreD0

    // Initialize routine offsets to 0 for integers and floats.
    // x8 for integers, x15 for floating point.
    mov x8, #0
    mov x15, #0

    ...

    // Store ints.
.LstoreW2:
    LOADREG x8 4 w2 .LfillRegisters
    ...
    LOADREG x8 4 w7 .LfillRegisters

// Store longs.
.LstoreX2:
    LOADREG x8 8 x2 .LfillRegisters
    ...
    LOADREG x8 8 x7 .LfillRegisters

// Store singles.
.LstoreS0:
    LOADREG x15 4 s0 .LfillRegisters
    ...
    LOADREG x15 4 s7 .LfillRegisters

// Store doubles.
.LstoreD0:
    LOADREG x15 8 d0 .LfillRegisters
    ...
    LOADREG x15 8 d7 .LfillRegisters

第二部分，x11 ~ x14 用作存入其它类型、long、float、double 指令的偏移，x8、x15 用作整型、浮点型类型的下标，用于计算需要执行 “保存到第几个寄存器” 的指令。

举例说明，x13 保存着标号 .LstoreS0: 的地址，此时指向代码 LOADREG x15 4 s0 .LfillRegisters。当第 3 个浮点数需要被保存时，x15 的值被设为 8 * (3 - 1) = 16，则通过 x13 + x15 可得指向代码 LOADREG x15 4 s2 .LfillRegisters，则该浮点数将被保存到 s2 寄存器中。

    add x10, x5, #1         // Load shorty address, plus one to skip return value.
    ldr w1, [x9],#4         // Load "this" parameter, and increment arg pointer.

载入类型描述字符串与 this 指针，它们被分别保存到 x0 与 w1 中。

    // Loop to fill registers.
.LfillRegisters:
    ldrb w17, [x10], #1       // Load next character in signature, and increment.
    cbz w17, .LcallFunction   // Exit at end of signature. Shorty 0 terminated.

    cmp  w17, #'F' // is this a float?
    bne .LisDouble

    cmp x15, # 8*12         // Skip this load if all registers full.
    beq .Ladvance4

    add x17, x13, x15       // Calculate subroutine to jump to.
    br  x17

.LisDouble:
    cmp w17, #'D'           // is this a double?
    bne .LisLong

    cmp x15, # 8*12         // Skip this load if all registers full.
    beq .Ladvance8

    add x17, x14, x15       // Calculate subroutine to jump to.
    br x17

.LisLong:
    cmp w17, #'J'           // is this a long?
    bne .LisOther

    cmp x8, # 6*12          // Skip this load if all registers full.
    beq .Ladvance8

    add x17, x12, x8        // Calculate subroutine to jump to.
    br x17

.LisOther:                  // Everything else takes one vReg.
    cmp x8, # 6*12          // Skip this load if all registers full.
    beq .Ladvance4

    add x17, x11, x8        // Calculate subroutine to jump to.
    br x17

.Ladvance4:
    add x9, x9, #4
    b .LfillRegisters

.Ladvance8:
    add x9, x9, #8
    b .LfillRegisters

这一部分的代码是对寄存器的填充，与前文说明一致。当寄存器填充完毕后，多余的参数保留在栈上，仅通过 x9 寄存器保存大小，.Ladvance4 与 .Ladvance8 对应两种大小的增长。

// Macro for loading a parameter into a register.
//  counter - the register with offset into these tables
//  size - the size of the register - 4 or 8 bytes.
//  register - the name of the register to be loaded.
.macro LOADREG counter size register return
    ldr \register , [x9], #\size
    add \counter, \counter, 12
    b \return
.endm

这部分是对填充寄存器指令的说明，这是一个宏定义，代码长度为 12 Byte（3 个指令，每个指令等长 4 Byte）。当数据被载入到寄存器后，对应的偏移自增 12，指向下一个填充寄存器的代码。

.macro INVOKE_STUB_CALL_AND_RETURN

    REFRESH_MARKING_REGISTER

    // load method-> METHOD_QUICK_CODE_OFFSET
    ldr x9, [x0, #ART_METHOD_QUICK_CODE_OFFSET_64]
    // Branch to method.
    blr x9

    // Pop the ArtMethod* (null), arguments and alignment padding from the stack.
    mov sp, xFP
    .cfi_def_cfa_register sp

    // Restore saved registers including value address and shorty address.
    RESTORE_TWO_REGS x19, x20, 16
    RESTORE_TWO_REGS xFP, xLR, 32
    RESTORE_TWO_REGS_DECREASE_FRAME x4, x5, SAVE_SIZE

    // Store result (w0/x0/s0/d0) appropriately, depending on resultType.
    ldrb w10, [x5]

    // Check the return type and store the correct register into the jvalue in memory.
    // Use numeric label as this is a macro and Clang's assembler does not have unique-id variables.

    // Don't set anything for a void type.
    cmp w10, #'V'
    beq 1f

    // Is it a double?
    cmp w10, #'D'
    beq 2f

    // Is it a float?
    cmp w10, #'F'
    beq 3f

    // Just store x0. Doesn't matter if it is 64 or 32 bits.
    str x0, [x4]

1:  // Finish up.
    ret

2:  // Store double.
    str d0, [x4]
    ret

3:  // Store float.
    str s0, [x4]
    ret

.endm

最后一步，根据偏移获取到方法入口代码指针，保存到 x9 并跳转，并根据返回值类型写入返回值到 x4 所指向内存，完成调用。

并发

为了防止栈回溯发生错误导致虚拟机崩溃，跳板代码无法在栈上保存数据，因此我们采用了一个外部分配的对象（定义类型为 Box）来备份需要的寄存器数据。然而这种方式相比栈具有明显的缺陷，毕竟栈是线程私有的，如果在外部空间保存的数据，那么方法被并发调用时数据便会相互覆盖。

所以跳板代码里需要进行特殊处理，这里采用了自旋锁的方式来解决线程安全问题。

bridge_match:
    ldr x16, bridge_box_pointer
check_lock:
    ldr x17, [x16]
    cmp x17, #0
    bne check_lock
    str x19, [x16]

其中，x16 指向的是用于保存数据的 Box 对象。其第一个字段为锁，仅当锁的值为 0 时才可开始写入寄存器数据，而每个线程开始写入寄存器数据前会将锁设置为当前线程的指针。

当跳转完毕到目标方法后便需要对保存的寄存器数据进行解析，为了保证性能，我们不应该在解析数据时占有锁，但又需要防止解析时数据没有其它线程覆盖。

解决方法便是将 Box 对象拷贝一份用于解析，而原 Box 对象在拷贝完成之后立即释放锁。

Box *Runtime::UnlockAndCopyBox(Box *origin) {
    auto *clone = reinterpret_cast<Box *>(malloc(sizeof(Box)));
    memcpy(clone, origin, sizeof(Box));
    origin->lock_ = 0;
    return clone;
}

开源框架

AoraMD - Kaleidoscope