I’ve been making changes to the JIT in SpiderMonkey, and sometimes get a SEGFAULT, okay so open it in gdb, then this happens:
Thread 1 "js" received signal SIGSEGV, Segmentation fault. 0x0000129af35af5e9 in ?? ()
Not helpful, maybe there’s something in the stack?
(gdb) backtrace #0 0x0000129af35af5e9 in () #1 0x0000129af35b107d in () #2 0xfff9800000000000 in () #3 0xfff8800000000002 in () #4 0xfff8800000000002 in ()
Still not helpful, I’m reasonably confident the crash is in JITed code which has no debugging symbols or other info. So I don’t know what it’s actually executing when it crashed.
In case it’s not apparent, this is a short blog post where I can make notes of one way to get some more information when debugging JITed code.
First of all, those really large addresses (frames 2, 3 and 4) look suspicious. I’m not sure what causes that.
Now, I know the change I made to the JIT, so it’s likely that that’s the code that’s crashing, I just don’t know why. It would help to see what code is being executed:
(gdb) disassemble No function contains program counter for selected frame.
What it’s trying to say, is that the current program counter at this level in the backtrace does not correspond with the C program (SpiderMonkey). Yes, unless we did a call or goto of something invalid, then we’re probably executing JITed code.
Let’s get more info:
(gdb) info registers rax 0x7ffff54b30c0 140737308733632 rbx 0xe4e4e4e400000891 -1953184670468274031 rcx 0xc 12 rdx 0x7ffff54c1058 140737308790872 rsi 0xa 10 rdi 0x7ffff54c1040 140737308790848 rbp 0x7fffffff9438 0x7fffffff9438 rsp 0x7fffffff9418 0x7fffffff9418 r8 0x7fffffff9088 140737488326792 r9 0x8 8 r10 0x7fffffff9068 140737488326760 r11 0x7ffff5d2f128 140737317630248 r12 0x0 0 r13 0x0 0 r14 0x7ffff54a0040 140737308655680 r15 0x0 0 rip 0x129af35af5e9 0x129af35af5e9 eflags 0x10202 [ IF RF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0
These are the values in the CPU registers. The debugger the rip (program counter) and rsp (stack pointer) and rbp (frame pointer) registers to know what it’s executing and to read the stack, including the calls that lead to this one. We can use this too, we’re going to use rip to figure out what’s being executed, it’s current value is 0x129af35af5e9.
(gdb) dump memory code.raw 0x129af35af5e9 0x129af35af600
Then in a shell:
$ hexdump -C code.raw 00000000 83 03 01 c7 02 4b 00 00 00 e9 82 00 00 00 49 bb |.....K........I.| 00000010 a8 ab d1 f5 ff 7f 00 |.......|
I have asked gdb, to write the contents of memory at the instruction pointer to a file named code.raw. Note that on x86-64 you need to write at least 15 bytes, as some instructions can be that long; I have 23 bytes.
I’d normally disassemble code using the objdump program:
$ objdump -d code.raw objdump: code.raw: File format not recognised
In this case it needs extra clues about the raw data in this file. We tell it the file format, the machine "i386" and give the disassembler more information about the machine "x86-64".
$ objdump -b binary -m i386 -M x86-64 -D code.raw code.raw: file format binary Disassembly of section .data: 00000000 <.data>: 0: 83 03 01 addl $0x1,(%rbx) 3: c7 02 4b 00 00 00 movl $0x4b,(%rdx) 9: e9 82 00 00 00 jmpq 0x90 e: 49 rex.WB f: bb a8 ab d1 f5 mov $0xf5d1aba8,%ebx 14: ff (bad) 15: 7f 00 jg 0x17
Yay. I can see the instruction it crashed on. Adding the number 1 to the 32-bit value stored at the address pointed to by rbx. I’d like some more context, so I have to get the instructions that lead to this. Note that after the jmpq instruction nothing makes sense, that’s okay since that jump is always taken.
(gdb) dump memory code.raw 0x2ce07c3895e6 0x2ce07c3895f7 ... $ objdump -b binary -m i386 -M x86-64 -D code.raw code.raw: file format binary Disassembly of section .data: 00000000 <.data>: 0: 49 8b 1b mov (%r11),%rbx 3: 83 03 01 addl $0x1,(%rbx) 6: c7 02 4b 00 00 00 movl $0x4b,(%rdx) c: e9 82 00 00 00 jmpq 0x93
When I go back three bytes I get lucky and find another valid instruction that also makes sense.
(gdb) dump memory code.raw 0x2ce07c3895e5 0x2ce07c3895f7 ... $ objdump -b binary -m i386 -M x86-64 -D code.raw code.raw: file format binary Disassembly of section .data: 00000000 <.data>: 0: 00 49 8b add %cl,-0x75(%rcx) 3: 1b 83 03 01 c7 02 sbb 0x2c70103(%rbx),%eax 9: 4b 00 00 rex.WXB add %al,(%r8) c: 00 e9 add %ch,%cl e: 82 (bad) f: 00 00 add %al,(%rax) ...
Gibberish. Unfortunately I just have to guess which byte an instruction might begin on. Or go back byte-by-byte finding instructions that make sense. There was quiet a bit of experimentation, and a lot more gibberish until I found:
(gdb) dump memory code.raw 0x2ce07c3895dd 0x2ce07c3895f7 ... $ objdump -b binary -m i386 -M x86-64 -D code.raw code.raw: file format binary Disassembly of section .data: 00000000 <.data>: 0: bb 28 f1 d2 f5 mov $0xf5d2f128,%ebx 5: ff (bad) 6: 7f 00 jg 0x8 8: 00 49 8b add %cl,-0x75(%rcx) b: 1b 83 03 01 c7 02 sbb 0x2c70103(%rbx),%eax 11: 4b 00 00 rex.WXB add %al,(%r8) 14: 00 e9 add %ch,%cl 16: 82 (bad) 17: 00 00 add %al,(%rax) ...
This is almost correct (except for all the gibberish). But at least it starts on an instruction that kind-of makes sense with a valid-looking memory address. But wait, that instruction uses ebx a 32-bit register. Which is not what I’m expecting since the code I’m JITing works with 64-bit memory addresses. And all that gibberish could be part of a memory address, it has bytes like 0xff and 0x7f in it!
I go back one more byte:
(gdb) dump memory code.raw 0x2ce07c3895dc 0x2ce07c3895f7 ... $ objdump -b binary -m i386 -M x86-64 -D code.raw code.raw: file format binary Disassembly of section .data: 00000000 <.data>: 0: 49 bb 28 f1 d2 f5 ff movabs $0x7ffff5d2f128,%r11 7: 7f 00 00 a: 49 8b 1b mov (%r11),%rbx d: 83 03 01 addl $0x1,(%rbx) 10: c7 02 4b 00 00 00 movl $0x4b,(%rdx) 16: e9 82 00 00 00 jmpq 0x9d
Got it. That’s a long instruction (which I’ll talk more about in my next article) Now that we have the extra byte at the beginning. x86 has prefix bytes for some instructions which can override some things about the instruction. In this case 0x49 is saying this instruction operates on 64-bit data (well 0x48 says that and +1 is part of the register address).
And there’s the bug (3rd line). I’m dereferencing this address, the one that I load into r11 once, and then again during the addl. I should only de-reference it once. The cause was that I misunderstood SpiderMonkey’s macro assembler’s mnemonics.
Update 2018-08-07
One response to this pointed out that I could have just used:
(gdb) disassemble 0x12345, +0x100
To disassemble a range of memory, and wouldn’t have had the "No function contains program counter for selected frame." error. They even suggested I could use something like:
(gdb) disassemble $rip-50, +0x100
I’ll definitely try these next time, they might not be the exact syntax. I haven’t tested them..
Update 2018-08-18
Another tip is to use: x/20i $pc
That’s the whole command. x means that GDB should use the $pc as a memory location and not as a literal; /20i means "treat that memory location as containing instructions and show 20 of them"
You can also use this with display, like in display x/4i $pc so that every time you stepi, it will auto-print the next 4 instructions.