Distributed debugging using wallclock

  • 21 April 2021
  • 0 replies
  • 12 views

  • New Participant
  • 2 replies

Although I’m supposed to be busy these days running the company I still like to write code when I can carve out some time. I tell myself it’s important to understand the world of our customers, but really that’s just an excuse - it’s still my happy place :)

 

Forever we’ve wanted to make it easy to stitch together multiple processes. We have multi-process correlation for shared memory support and it’s super cool, but there’s nothing for processes communicating over the network. Imagine being able to reverse-step through a remote procedure call. It’s been on our todo list for ever, but so far it hasn’t quite made it to the top.

 

Like most applications these days, UDB itself is composed of multiple processes. There is GDB, which talks to our own implementation of gdbserver (called, imaginatively enough, udbserver), which talks to a component we call nanny. All this happens over sockets.

 

I was trying to diagnose a problem where our udbserver was receiving a bogus progress notification. As of our latest release, we report the progress (i.e. percentage done) of long-running notifications. (As everyone knows, this is one of the hardest problems in computer science.) Our nanny component was suddenly and erroneously sending a 0% notification to udbserver, which then caused an assertion failure. So I had a recording of the udbserver and one of nanny, and I could see from udbserver that it appeared to be doing the right thing and nanny was just sending bogus data. Why? In an ideal world I’d have been able to reverse through udbserver’s socket recv() to nanny’s socket send(), but in the absence of that I did `uinfo wallclock` on the udbserver recording when it was on the socket recv(), switched to the nanny recording and did `ugo wallclock` to that time. Wallclock time is not always super precise, but I could see I’d landed close to a socket send() event (I did `uinfo events` to see the nearby system calls), so I went back to that send() and voila, here was the source of the bug. Actually I wasn’t straight there, but one more watchpoint and reverse-continue and I was. All in all it was a few minutes to fix something that otherwise would have wasted half a day. And when you only get to spend half a day every so often doing actual programming, that would have really sucked.


0 replies

Be the first to reply!

Reply