Solving a heap corruption bug using reverse watchpoints

  • 22 February 2021
  • 4 replies
  • 36 views

I recently added a function of our code whose purpose was to concatenate the contents of two memory regions. The basic scheme was:

  • Ensure destination region is large enough
  • Work out number of bytes that need copying (len)
  • Work out offset to copy bytes to
  • Do memcpy(dest + offset, src, len);

It being early on a Monday and the coffee not yet having kicked in I erroneously determined that offset and len were the same quantity and ended up typing something like:

memcpy(dest + offset, src, offset);

 

This is broken as offset may be much larger than the number of extra bytes reserved at the end of dest. This defect manifested as seemingly unrelated failures a long time after the broken copy had taken place. These failures were due to corruption of data stored on the heap. I was able to obtain a recording. My normal work flow for debugging a recording of a session that terminated with an assertion failure (as this was) is something like:

  • ugo end
  • backtrace to get an impression of where I am
  • reverse-finish to the failing assertion
  • Navigate around a little with next/reverse-next

In this case I identified the memory region that was corrupted and set a watchpoint on it. A reverse-continue led me to the innards of a memset. At this point I decided the likely cause was some sort of heap allocation issue. I decided to try a debug build, which incorporates extensive heap checking. This build failed much sooner with a heap integrity check. Once again I obtained a recording. I identified the corrupted field in the heap block header, set a watchpoint and did a reverse-continue. This took me to my new function and it was rapidly apparent what my mistake was.


4 replies

Badge

@Edward Tait Thanks for sharing.

I’m curious...when your watchpoint and rc led you to the memset, how come exploring up the stack, or a few rf didn’t lead you to the conclusion without the need for either enabling heap checking or getting another recording with that turned on?

The data being overwritten by the memset wasn’t heap bookkeeping data but rather something the code I was working on stored on the heap. I examined the backtrace and could tell from the context that the memset was being used on a region of memory that was freshly allocated. At that point I suspected the code I was working on was doing a use-after-free (as it did some caching of allocated structures that seemed very credible), but inspection, setting breakpoints and stepping back and forward through some functions I thought might be misbehaving seemed to show that there was no use-after-free. From the fact that already allocated memory seemed to be being allocated again I determined that the heap’s internal data structures were getting corrupted. I didn’t know enough about the heap internals to know where a good place for a watchpoint would be, so went for the slow way of using full heap checking.

Badge

Ah...I think I follow. So although you could see in the first recording that the data structures weren’t quite right, you didn’t know enough to deduce what wasn’t right...and therefore which bit of memory to watch next to go back further for the next breadcrumb?

It sounds like in your case, the issue reproduced readily enough that turning on extra diagnostics and then moving on to a second recording wasn’t too time-consuming and allowed you to get to the bottom of things for yourself.  Had this not been the case, perhaps you would have asked a colleague with more expertise in the heap module to take a look or it might have made sense in some scripting to do some post hoc scripting to replicate some of the run-time checking that you enabled, but on a recording.

Exactly - the test failed very repeatably and it only took a few minutes to get a build with heap debugging enabled. There would have been a few extra steps to reverse back to the allocation code where the memory got reallocated in the first recording and I’d still have had to dig around in the heap code to find where to set the next watchpoint, so it was a judgement call on what was going to be quicker.

Reply