The missing XSAVE

  • 8 February 2022
  • 0 replies
  • 20 views

1. Introduction

This is a write-up of a bug in Undo’s software replay technology that I diagnosed and fixed. I’m writing it up while the details are still fairly clear in my memory and I have access to the relevant debugging transcripts. Hopefully this will be an entertaining story, and show off many useful features of UDB.

2. The bug

This bug was provoked when I pushed a small commit removing some long-dead code. This caused a backwards-compatibility test to fail in the overnight run, and since my commit was “responsible” for the failure (as determined by git bisect), it fell to me to fix the problem. This put me in a bad mood, because I was certain that there was nothing wrong with my commit, and that what was going on was that removing the dead code had perturbed the code layout causing some previously latent bug to spring into life. There’s little that’s more annoying than to be blamed for something that isn’t your fault. But afterwards I was glad I worked on it.

What was the failure? The backwards-compatibility test took a recording made with LiveRecorder version 5.3, and replayed it on the latest version of UDB (which at the time was 6.6). When the recording was replayed, UDB crashed with this message:

**************************************************************
Fatal error: xsave_type != 0xcafedead && (xsave_type & ~(~0x2f)) == 0
Additional information: B-code has not set up instr_ab_to_nonab_vars.xsave.type correctly: cafedead
Location: src/engine_impl/events/event_x86.inc.c:185:event_xsave_replay_sanity_checks [10635:198932]
**************************************************************

What this means is that there was a discrepancy between the execution sequence of the program when it was recorded and the execution sequence when it was replayed. When recorded, the program had executed an XSAVE instruction, but at the corresponding point in replay the program executed a JLE instead (see the disassembly below).

Emulating the XSAVE instruction should have set the xsave_type variable to an appropriate value, but it remained at its default value of 0xcafedead. The transcript below demonstrates how to reproduce the bug in UDB after the recording has been loaded. I used the ugo time command to jump to just before the crash, and the info events command to query the non-deterministic events around this time (you can see that the recording thinks it is about to execute an XSAVE event). I used the step-instruction command (si for short) to step over the instructions, demonstrating that crash occurs on the JLE at 0x2b033904e0a9.

start 1> ugo time 26863981
0x00002b033904e0a0 in ?? ()
98% 26,863,981> info events -a -200 -b +1
time=26,863,832:0xffffffffffffffff: write. result=0x69 size=80.
time=26,863,981:0xffffffffffffffff: XSAVE. size=96.
98% 26,863,981> disas $pc,$pc+11
Dump of assembler code from 0x2b033904e0a0 to 0x2b033904e0ab:
=> 0x00002b033904e0a0 <_IO_file_write@@GLIBC_2.2.5+32>: sub %rax,%rbx
0x00002b033904e0a3 <_IO_file_write@@GLIBC_2.2.5+35>: add %rax,%rbp
0x00002b033904e0a6 <_IO_file_write@@GLIBC_2.2.5+38>: test %rbx,%rbx
0x00002b033904e0a9 <_IO_file_write@@GLIBC_2.2.5+41>: jle 0x2b033904e100 <_IO_file_write@@GLIBC_2.2.5+128>
End of assembler dump.
98% 26,863,981> si 3
0x00002b033904e0a9 in _IO_file_write@@GLIBC_2.2.5 () from /tmp/undodb.11859.1637847800.0086906.848e573eb2bbc4f3/debuggee-1-9wz3qm99/symbol-files/lib64/libc.so.6
98% 26,863,981> si
**************************************************************
Fatal error: xsave_type != 0xcafedead && (xsave_type & ~(~0x2f)) == 0
Additional information: B-code has not set up instr_ab_to_nonab_vars.xsave.type correctly: cafedead
Location: src/engine_impl/events/event_x86.inc.c:185:event_xsave_replay_sanity_checks [11904:198932]
**************************************************************

We call this kind of bug, where replay behaviour does not match record behaviour, a “non-determinism” bug, because the usual cause is that we failed to capture some non-deterministic feature of the system at record time, and so when we replay the program it takes a different execution path which we eventually discover through failure to execute the expected instruction. However, it seemed doubtful that this was the cause in this case, because the recording replayed correctly in UDB version 6.5.

3. A-code and B-code

To find the point where the execution paths diverged, I set up two instances of UDB replaying the same recording, one with version 6.5 (which I knew could replay the recording correctly) and one with 6.6 (which could not). Then I made both instances go to the same time in execution history and compared the execution state (registers and memory) between the two. I did a binary search to find the point where the replay of the recording diverged: this was a comparison instruction which succeeded in 6.5 and failed in 6.6. The comparison was with the register XMM0. Here’s the value of the XMM0 register in 6.6:

98% 26,861,822> p/x $xmm0.uint128
$3 = 0x0
image.gifand here’s the same register at the same point in 6.5:
98% 26,861,822> p/x $xmm0.uint128
$3 = 0x3f618961b5740d10
image.gifSo in 6.6 the XMM0 register was somehow incorrectly cleared. I traced this back to the preceding XRSTOR instruction:
98% 26,860,035> ugo time 26,860,035:0x2b0335972bf4
0x00002b0335972bf4 in _dl_runtime_resolve_xsavec () from /tmp/undodb.15785.1637859117.4496183.6b6e82dbc96fddf0/debuggee-1-03wan4qf/symbol-files/lib64/ld-linux-x86-64.so.2
98% 26,860,035> disas $pc,$pc+1
Dump of assembler code from 0x2b0335972bf4 to 0x2b0335972bf5:
=> 0x00002b0335972bf4 <_dl_runtime_resolve_xsavec+132>: xrstor 0x40(%rsp)
End of assembler dump.
98% 26,860,035> p/x $xmm0.uint128
$22 = 0x3f618961b5740d10
98% 26,860,035> si
0x00002b0335972bf9 in _dl_runtime_resolve_xsavec () from /tmp/undodb.15785.1637859117.4496183.6b6e82dbc96fddf0/debuggee-1-03wan4qf/symbol-files/lib64/ld-linux-x86-64.so.2
98% 26,860,035> p/x $xmm0.uint128
$23 = 0x0

In 6.6 this XRSTOR instruction cleared the XMM0 register but in 6.5 it left XMM0 unchanged (by restoring the same value). What was going on here? On x86-64, the XRSTOR instruction reverses the effect of the XSAVE instruction: XSAVE copies the processor state (or a subset thereof, depending on the flags in the RAX register) to memory, and XRSTOR restores processor state from memory. (You can read the gory details in the Intel 64 and IA-32 Architectures Software Developer’s Manual Volume 2: Instruction Set Reference.) The operation of the XRSTOR instruction is controlled by the XSTATE_BV field, at the offset of 512 from the base of the XSAVE region. This is a bit-vector describing the data stored in the XSAVE region. It is intersected with the requested-feature bitmap in the RAX register to determine what parts of the processor state get restored.

The XSAVE area was on the stack at 0x40(%rsp) and the requested-feature bitmap was in the RAX register:

98% 26,860,035> p/x $rax
$10 = 0xee

The value 0xee included, among other flags, the XSTATE_SSE flag, which includes the XMM registers. The XSTATE_BV field was here:

98% 26,860,035> p/x *(uint64_t*)($rsp+0x40+512)
$11 = 0x2

The value 0x2 was the XSTATE_SSE flag, so these were the registers that were supposed to be restored. The XMM0 register is stored at offset 160 from the base of the XSAVE area. In UDB version 6.6 this contained the wrong value (zero):

98% 26,860,035> x/2g $rsp+0x40+(10*16)
0x7ffd1741d260:    0x0000000000000000    0x0000000000000000
image.gifbut in version 6.5 the corresponding contents were non-zero:
98% 26,860,035> x/2g $rsp+0x40+(10*16)
0x7ffd1741d260:    0x3f618961b5740d10    0x0000000000000000
image.gifSomething must have gone wrong at the preceding XSAVE instruction, which failed to save the XMM0 register to the XSAVE area. I used the ugo event prev command to jump to this instruction. You’ll see from the disassembly below that this was actually an XSAVEC instruction (a variant of XSAVE that “compresses” the result by omitting the parts of the processor state that were not requested to be saved).
98% 26,860,035> ugo event prev
Going to: time=26,859,273:0xffffffffffffffff: XSAVE. size=96.
0x00002b0335972bd8 in _dl_runtime_resolve_xsavec () from /tmp/undodb.28658.1637919175.2831407.d35582fb6fc784c6/debuggee-1-y0jqeq4w/symbol-files/lib64/ld-linux-x86-64.so.2
98% 26,859,273> disas $pc, $pc+1
Dump of assembler code from 0x2b0335972bd8 to 0x2b0335972bd9:
=> 0x00002b0335972bd8 <_dl_runtime_resolve_xsavec+104>: xsavec 0x40(%rsp)
End of assembler dump.

I checked the value of the XMM0 register, used the step-instruction command to step over the XSAVE instruction, and then examined memory to see what had been saved:

98% 26,859,273> p/x $xmm0.uint128
$15 = 0x3f618961b5740d10
98% 26,859,273> si
0x00002b0335972bdd in _dl_runtime_resolve_xsavec () from /tmp/undodb.28093.1637918403.5949795.bc5f1f623b0c8cb6/debuggee-1-r8r7phvh/symbol-files/lib64/ld-linux-x86-64.so.2
98% 26,859,274> x/2g $rsp+0x40+(10*16)
0x7ffd1741d260: 0x0000000000000000 0x0000000000000000

In version 6.6 (above) the XMM0 register was not saved by the XSAVEC instruction, whereas in version 6.5 (below) it was saved correctly:

98% 26,859,274> x/2g $rsp+0x40+(10*16)
0x7ffd1741d260:    0x3f618961b5740d10    0x0000000000000000

How could the XSAVEC have failed to save the XMM0 register? One possibility is that the requested-feature flags in the RAX register omitted the XSTATE_SSE flag. But this wasn’t the case:

98% 26,860,035> p/x $rax
$16 = 0xee

The other possibility was that something had gone wrong with the translation of the XSAVEC instruction. This is where I have to confess that I’ve been concealing some complexity from you! The transcripts above contain disassemblies of the program being executed. But these are not the instructions that were actually executed! In order to record or replay the non-deterministic behaviour of the program, LiveRecorder uses a “just-in-time” approach to translate the original code (which we call “A-code”) into the code that is actually executed (which we call “B-code”). So long as A-code is executing deterministic instructions, the translation is simple, but when A-code executes a non-deterministic instruction like SYSCALL or XSAVEC, we translate it into code that will allow us to record or replay the non-deterministic results. When everything is working correctly we can ignore this and pretend that A-code is executing, but when something goes wrong we have to peek under the hood. I used the “umaintenance info bbs” command to look at the B-code translation of the A-code:

95% 26,072,993> umaint info bbs -d $pc
Dump of assembler code from 0x2b0335972bd8 to 0x2b0335972bdd:
=> 0x00002b0335972bd8 <_dl_runtime_resolve_xsavec+104>: xsavec 0x40(%rsp)
End of assembler dump.
Dump of assembler code from 0x5f40045b5f80 to 0x5f40045b6030:
0x00005f40045b5f80: movabs %rax,0x3f80011ea0c0
0x00005f40045b5f8a: mov %rdx,-0x3174d49(%rip) # 0x5f4001441248
0x00005f40045b5f91: movabs 0x3f8000dc6d88,%rax
0x00005f40045b5f9b: mov %rdi,0x8(%rax)
0x00005f40045b5f9f: mov %rsi,0x10(%rax)
0x00005f40045b5fa3: mov %rcx,0x30(%rax)
0x00005f40045b5fa7: mov %r8,0x40(%rax)
0x00005f40045b5fab: mov %r9,0x48(%rax)
0x00005f40045b5faf: movabs 0x3f80011ea0c0,%rax
0x00005f40045b5fb9: lea 0x40(%rsp),%rsi
0x00005f40045b5fbe: movabs 0x3f8000dc6d88,%rax
0x00005f40045b5fc8: mov %rsp,0x80(%rax)
0x00005f40045b5fcf: lea 0x8(%rax),%rsp
0x00005f40045b5fd3: pushf
0x00005f40045b5fd4: movabs 0x3f80011ea0c0,%rax
0x00005f40045b5fde: and $0x3133097c,%eax
0x00005f40045b5fe4: and $0x323134,%edx
0x00005f40045b5fea: mov %rax,%r8
0x00005f40045b5fed: mov %rdx,%r9
0x00005f40045b5ff0: movabs 0x3f8000dc6d88,%rax
0x00005f40045b5ffa: movabs $0x5f4001441280,%rax
0x00005f40045b6004: decl (%rax)
0x00005f40045b6006: movabs $0x8c,%rdi
0x00005f40045b6010: movabs $0x2b0335972bdd,%rcx
0x00005f40045b601a: rex.W jmp *0x7(%rip) # 0x5f40045b6028
0x00005f40045b6021: nop
0x00005f40045b6022: nop
0x00005f40045b6023: nop
0x00005f40045b6024: nop
0x00005f40045b6025: nop
0x00005f40045b6026: movabs $0x3f8000196d10,%r14
End of assembler dump.
Found 1 BB(s) overlapping the specified range.
BB length: 5, A-code from 0x2b0335972bd8 - 0x2b0335972bdd, B-code from 0x5f40045b5f80 - 0x5f40045b6030, BBT_XSAVEC

There’s a lot of detail there, so here’s the important bit from the B-code:

   0x00005f40045b6021:    nop
   0x00005f40045b6022:    nop
   0x00005f40045b6023:    nop
   0x00005f40045b6024:    nop
   0x00005f40045b6025:    nop

These five bytes are where the XSAVEC instruction should have been. No wonder the XMM0 register was not being saved: in fact, nothing was being saved! To summarize the results of the investigation so far:

Replay crashed
… because an XSAVE event was replayed but the XSAVE type was the default value;
… because no XSAVE had been executed;
… because the program had taken the wrong branch;
… because the XMM0 register had the wrong value;
… because it had been clobbered by an XRSTOR instruction;
… because the XSAVE area had zeroes in the XMM0 field;
… because the XMM0 register had not been saved;
… because no XSAVE instruction had been executed;
… because the B-code contained NOPs instead of an XSAVE instruction.

4. Child code

To make further progress I needed to look at what went wrong in the just-in-time instruction translator. Translation takes place inside the process being debugged (the “child” process): after a block of B-code finishes executing, it jumps to a run-time management function which has the job of ensuring that the next block of A-code has been translated. Debugging this run-time management code is slightly tricky because the Unix ptrace mechanism allows a process to have only one debugger at a time, and in this situation the child already has a debugger, namely the instance of UDB that is replaying the recording. So the way we support debugging the child code is to multiplex our ptrace connection so that we can have two debuggers running simultaneously, one debugging the original code via the B-code translation, and the other debugging the management code in the same process. As you might imagine, this is a bit delicate when the process switches from original code to management code or vice versa. Luckily for me I didn’t need to spend much time looking at the management code.

The translation of XSAVEC instructions takes place in the function xlate_XSAVEC(), so after attaching a debugger to the “debug server” (the multiplexed debugging system described above) I set a breakpoint there:

(udebug) b xlate_XSAVEC
Breakpoint 1 at 0x3f80001acbb0: file src/engine_impl/instr/xlate_x86_64.inc.c, line 3564.
(udebug) c
Continuing.

Breakpoint 1, xlate_XSAVEC (...) at src/engine_impl/instr/xlate_x86_64.inc.c:3564
3564 if (instr_config()->simulate_xsavec)
(udebug) list
3561 xlate_XSAVEC(pc_t src, pc_t effective_src, ASMGEN_PTR_DECL,
3562 size_t instr_len, bool to_xsaveopt, uint64_t recorded_xcr0)
3563 {
3564 if (instr_config()->simulate_xsavec)
3565 {
3566 return s_xlate_simulate_xsavec(src, ASMGEN_PTR, instr_len,
3567 false, recorded_xcr0);
3568 }
3569
3570 return s_common_xlate_XSAVE(src, effective_src, ASMGEN_PTR,
(udebug) p instr_config()->simulate_xsavec
$1 = true

The XSAVEC translator takes one of two code paths, according to the simulate_xsavec flag. This feature is designed to allow UDB to take a recording made on a CPU supporting the XSAVEC instruction and replay it on a CPU lacking this feature. But this couldn’t explain what I was seeing, because the replay machine (my laptop in this case) has a Broadwell CPU, which supports the XSAVEC instruction. So it seemed more likely that the simulate_xsavec flag had been configured incorrectly.

5. Nanny code

The configuration of the instruction translation system takes place in a separate process, called “Nanny” because its job is to look after the various “child” processes involved in recording and replaying. Nanny and child processes share some regions of memory, so that after Nanny configures the instruction translation system, the children can access the configuration by calling instr_config(). Nanny code is much easier to debug than child code, because it’s possible to use LiveRecorder to make a recording of Nanny, and then this can be replayed forwards and backwards to investigate the problem. The instruction translation system is configured in the function instr_configure(), so I set a breakpoint there:

start 1> b instr_configure
Breakpoint 1 at 0x3f80001997d0: file src/engine_impl/instr/instr_config.c, line 16.
start 1> c
Continuing.

Breakpoint 1, instr_configure (ifs=ifs@entry=0x3f8001211a40) at src/engine_impl/instr/instr_config.c:16
16 *s_instr_config = *ifs;
1% 1,139,945> p ifs->simulate_xsavec
$1 = true

To find out how this flag was set, I set a watchpoint on the memory location of the flag, and used the reverse-continue command (rc for short) to run backwards to the point where the flag was set:

1% 1,139,945> watch -l ifs->simulate_xsavec
Hardware watchpoint 2: -location ifs->simulate_xsavec
1% 1,139,945> rc
Continuing.

Hardware watchpoint 2: -location ifs->simulate_xsavec

Old value = true
New value = false
0x00003f80001dc2f0 in s_set_global_state (mi=0x3f8001210ef0) at src/engine_impl/save_load/load.c:405
405 instr_config.simulate_xsavec = !(asm_xsave_supported() & ASM_CPUID_XSAVEC) ||

The “Old value” and “New value” labels are from the debugger’s point of view, not the debuggee’s, so when running backwards we have to mentally swap them: this means that when Nanny was recorded, it set the simulate_xsavec flag to true at this point. Why did it do that? Well there were three possible reasons:

1% 1,139,944> list
400 * recording, and is not user configurable in any way.
401 */
402 instr_config.simulate_adx = true;
403
404 #if defined(__i386__) || defined(__x86_64__)
405 instr_config.simulate_xsavec = !(asm_xsave_supported() & ASM_CPUID_XSAVEC) ||
406 config_get()->simulate_instr.xsavec ||
407 s_is_xsave_features_subset(mi->xsave_xcr0);
408 #endif
409
image.gifThese three possibilities were:
  1. !(asm_xsave_supported() & ASM_CPUID_XSAVEC): that is, XSAVEC was not supported on the replay CPU, as discussed above. But this was not the case: the replay CPU did support this instruction.
  2. config_get()->simulate_instr.xsavec: that is, a global configuration flag requested simulation of XSAVEC, for example for testing purposes. But this flag had not been set:
    1% 1,139,944> p s_config->simulate_instr.xsavec
    $2 = false

     

  3. s_is_xsave_features_subset(mi->xsave_xcr0): that is, the replay CPU lacked one or more of the features of the recording CPU. Here mi->xsave_xcr0 is a set of flags describing the XSAVE capabilities of the processor, extracted from extended control register 0 on the machine where the recording was made, and saved in the recording. (Note that the function name is misleading since s_is_xsave_features_subset() returns true if the recording CPU’s XSAVE capabilities are not a subset of the replay CPU’s XSAVE capabilities.)

So it was condition 3 that was satisfied:

1% 1,139,944> p/x mi->xsave_xcr0
$3 = 0x34353734310b2c
1% 1,139,944> rs
s_is_xsave_features_subset (saved_xcr0=<optimised out>) at src/engine_impl/save_load/load.c:275
275 return (saved_xcr0 & ~replay_xcr0) != 0;
1% 1,139,943> list
270 uint64_t replay_xcr0;
271
272 int e = proc_get_xcr0(libc_heap, NULL, &replay_xcr0);
273 verify_warnf(e == 0, "failed to get XCR0 value: %s", debug_errno_string(e));
274
275 return (saved_xcr0 & ~replay_xcr0) != 0;
276 }
277 #endif
1% 1,139,943> p/x replay_xcr0
$4 = 0x1f
1% 1,139,943> p/x 0x34353734310b2c & ~replay_xcr0
$5 = 0x34353734310b20

(At this point I failed to notice that the value of mi->xsave_xcr0 looks very like ASCII-encoded text: in particular, the bytes 34 35 37 34 31 encode the string “45741”. This would have been a useful hint as to the underlying cause of the bug! But the good thing about working backwards through program execution with UDB is that you don’t have to be so clever.)

The value in mi->xsave_xcr0 looked suspicious: extended control register 0 should not have so many set bits. Where did this value come from? I set a watchpoint on the location of mi->xsave_xcr0 and reverse-continued to find the point where it was set:

1% 1,139,943> del 2
1% 1,139,943> watch -l mi->xsave_xcr0
Hardware watchpoint 3: -location mi->xsave_xcr0
1% 1,139,943> rc
Continuing.

Hardware watchpoint 3: -location mi->xsave_xcr0

Old value = 14695210004056876
New value = 14695210002221868
0x00003f800014f2a0 in TrioWriteNumber (self=self@entry=0x3f8001211100, number=1, flags=flags@entry=0, width=width@entry=0, precision=-1, base=10) at 3rd_party/util/libc_sprintf.c:2038
2038 *pointer-- = digits[number % base];
0% 13,781>

image.gifWhat happened here? This does not look like an update to mi->xsave_xcr0! But it’s important to remember that a location watchpoint watches a value at a specific location in memory, and that memory location can be reused for different data structures during the execution of the program. So what this demonstrated was that mi->xsave_xcr0 was never updated at all, and so its value was whatever was at that location in memory before the mi structure was allocated.

Note also that the reverse-continue command ran a long way backwards: you can see that the bbcount in the prompt reduced from 1,139,943 all the way to 13,781. So this libc_snprintf() call was very early in the program’s execution, long before it was loading the recording and updating the recording_misc_info_t data structure. When a reverse-execution command takes you a long way from the code you were investigating it is often convenient to use the ugo undo command (uu for short) to get back to where you were:

0% 13,781> uu
0x00003f80001dc2f0 in s_set_global_state (mi=0x3f8001210ef0) at src/engine_impl/save_load/load.c:405
405 instr_config.simulate_xsavec = !(asm_xsave_supported() & ASM_CPUID_XSAVEC) ||
1% 1,139,943>
image.gifSo how come the most recent update to mi->xsave_xcr0 was the apparently unrelated update to the local variable pointer in the libc_snprintf() function? Could it be that mi was allocated on the stack? It sure was:
1% 1,139,944> p mi
$10 = (const recording_misc_info_t *) 0x3f8001210ef0
1% 1,139,944> p/x $sp
$11 = 0x3f8001210ce0
image.gifSo this finally explained the bug! The recording_misc_info_t structure was a local variable in the function handle_command_load():
recording_misc_info_t misc_info;

Being a local variable, it was allocated on the stack, and since this is the C programming language, local variables are not initialized unless they have explicit initializers. The misc_info structure was supposed to be initialized by a call to recording_read_misc_info() to fill in the fields of the structure with data from the recording file. But this function only filled in the fields which were actually read from the file, and recording files did not have an xsave_xcr0 field until LiveRecorder version 6.0. (In older versions of the product, the replay CPU had to have a superset of the features of the recording CPU, but this condition proved onerous, and starting with version 6.0 we began to add support for replay via emulation where features were missing from the replay CPU.) To summarize the analysis:

Replay crashed
… because an XSAVE event was replayed but the XSAVE type was the default value;
… because no XSAVE was executed;
… because the program took the wrong branch;
… because the XMM0 register had the wrong value;
… because it had been clobbered by an XRSTOR instruction;
… because the XSAVE area had zeroes in the XMM0 field;
… because the XMM0 register had not been saved;
… because no XSAVE instruction was executed;
… because the B-code contained NOPs instead of an XSAVE instruction;
… because it had been translated by s_xlate_simulate_xsavec();
… because the simulate_xsavec flag had been configured incorrectly;
… because s_is_xsave_features_subset(mi->xsave_xcr0) returned true;
… because mi->xsave_xcr0 had the bogus value 0x34353734310b2c;
… because the structure was stack-allocated and this field had never been initialized;
… because the recording had no xsave_xcr0 field.

This analysis satisfyingly explained the symptoms of the bug:

  1. It only affected recordings made with LiveRecorder version 5.3 or earlier, because later recordings contained an xsave_xcr0 field that was used to initialize the corresponding field in the recording_misc_info_t structure.
  2. It randomly manifested or not, depending on changes to nearby code, because these changes affected the stack layout and so changed the bogus value in the xsave_xcr0 field in the recording_misc_info_t structure. In particular, the dead code that I had removed in the commit that started the whole thing had contained an (unused) stack-allocated data structure whose removal pushed the recording_misc_info_t structure upward in memory, causing it to overlap the memory used by the local variable pointer in a very early libc_snprintf() call.

The code introducing the bug had been committed in June 2020, but the bug had remained latent for a year and a half before I caused it to reveal itself. 

6. The fix

I won’t claim that this is the largest ratio of investigation to diff that I’ve ever been involved with (it’s probably beaten by the last lousy bug) but it’s an impressive ratio nonetheless:

- recording_misc_info_t misc_info;
+ recording_misc_info_t misc_info = { 0 }

7. Further work

Of course, the fix is not the end of the matter. Applying the “five whys” principle, we ought to try to understand how the bug was introduced, and take appropriate action to reduce the risk of something similar happening in future. From general approaches to specific, here are the ideas I had:

  1. The bug would have been caught immediately if we could have run Nanny under Valgrind. Valgrind’s “Memcheck” tool detects dangerous use of uninitialized values, and would have caught the dependency on the uninitialized xsave_xcr0 field. Unfortunately, we can’t run Nanny under Valgrind because Nanny is not launched from an executable, but forked from the original child process. When LiveRecorder was first developed in the early 2000s, there were technical reasons for doing it this way, but Linux’s capabilities have become much more powerful in the last couple of decades, so we should now refactor the Nanny–child relationship to make it possible (among other benefits) to run Nanny under Valgrind. (We do run other parts of the UDB software under Valgrind.)

  2. The bug was not caught in review. We can’t expect reviewers to spot subtle action-at-a-distance bugs like this, but what we can do instead (following the “software inspection” principle) is to write down a rule (that all stack-allocated data structures must be initialized), and ask reviewers to apply the rule to new code. It’s easy to check the rule locally, and if it had been applied here, it would have fixed the problem without anyone needing to spot the bug.
  3. A programmer might reasonably expect that a function named recording_read_misc_info() fills in the whole of the  recording_misc_info_t structure. The fact that it only fills in the fields that were present in the recording, and leaves other fields uninitialized, is a bug magnet. We should update this function so that it updates the whole structure, either by moving initialization code into the function, or moving fields out of the structure.
  4. A programmer might reasonably expect that a function named s_is_xsave_features_subset() returns true if the record CPU’s XSAVE capabilities are a subset of the replay CPU’s capabilities. We should rename the function or change its behaviour to correspond to the name.

0 replies

Be the first to reply!

Reply