Corruption of siginfo in sighandler when recorded

  • 29 October 2021
  • 0 replies
  • 53 views

  • New Participant
  • 0 replies

I've recently worked on a problem reported by a customer where our recording engine appeared to change the behaviour of signal handlers and I found the investigation and subsequent workaround interesting enough to share.

First a bit on signal handlers. Processes can install their own functions to handle signals with the rt_sigaction syscall:

struct sigaction sa;
sa.sa_flags = SA_SIGINFO;
sa.sa_sigaction = handler;
sigemptyset(&sa.sa_mask);
sigaction(SIGCHLD, &sa, NULL);

When the kernel invokes this function to deliver a signal it passes a few parameters to the handler to provide information about the signal that is being delivered. Given that the SA_SIGINFO option was used to install the handler, one of those parameters will be a siginfo_t struct.

typedef struct {
int si_signo;
int si_errno;
int si_code;
union __sifields _sifields;
} siginfo_t;

The most used fields in the siginfo_t struct are si_signo, which encodes the signal number, and si_code which encodes a reason for why the signal was created. For example, a si_code value of SI_USER generally means that the signal was sent directly from a user space process, whereas SI_KERNEL means that it was produced by some event in the kernel (see SEND_SIG_PRIV in the kernel code).

The fact that _sifields is a union means that different signals can use the same structure to efficiently encode a number of different fields that are specific to each signal.
 

union __sifields {
/* kill() */
struct {
__kernel_pid_t _pid; /* sender's pid */
__kernel_uid32_t _uid; /* sender's uid */
} _kill;

// ...

/* SIGCHLD */
struct {
__kernel_pid_t _pid; /* which child */
__kernel_uid32_t _uid; /* sender's uid */
int _status; /* exit code */
__ARCH_SI_CLOCK_T _utime;
__ARCH_SI_CLOCK_T _stime;
} _sigchld;

// ...
};

The behaviour that I've been investigating expressed itself in the signal handlers of SIGCHLD signals. These signals are sent to a process when one of its forked child processes terminate, allowing the parent's signal handler to inspect the _sifields._sigchld._status field to determine what the exit status of the child process was. The default disposition for SIGCHLD signals is to ignore them, so an application wouldn't be informed about these signals unless it install a handler.

Our reproducer simply checks that the SIGCHLD's siginfo_t structure contains the expected values (si_status is an alias for _sifields._sigchld._status).

void handler(int signo, siginfo_t *info, void *ctx)
{
assert(info->si_signo == SIGCHLD);
assert(info->si_code == CLD_EXITED);
assert(info->si_status == EXIT_SUCCESS);
}
    pid_t pid = fork();
if (pid > 0) { /* parent */
int wstatus = -1;
waitpid(pid, &wstatus, 0);
assert(WIFEXITED(wstatus) && WEXITSTATUS(wstatus) == EXIT_SUCCESS);
}
else if (pid == 0) { /* child */
_exit(EXIT_SUCCESS);
}

When run natively the status field would always provide the expected EXIT_SUCCESS value. When run under recording the field would intermittently contain other values. Further experimentation with the reproducer showed that the field always contains a bad value, but incidentally, the bad value would often be EXIT_SUCCESS. Furthermore, the failures only expressed themselves on a few of our older Linux distributions. The investigation was carried out on a v3.10 kernel.

There are several components involved with getting the siginfo_t structure safely into the signal handler. The kernel first creates the SIGCHLD signal, with a corresponding siginfo_t structure which the kernel holds onto for now, and queues the signal to be delivered to our process. Since the recorded process is run under the supervision of the kernel's ptrace facility, the recording engine is able to intercept the signal before it is delivered to the recorded process.

A notification about the signal is passed from the kernel to the recording engine via the waitpid interface, at which point we're able to inspect and modify the siginfo_t structure via ptrace calls. This is something our recording engine has to do quite a lot because of how the recorded process is instrumented. The delivery of a signal must often be deferred until a later point when it is safe to deliver it, and the siginfo_t structure must then be reinstalled with a ptrace call before the correct signal can be injected into the process.

When the the recorded process is resumed the kernel copies the siginfo_t structure into it and invokes the signal handler function. At that point our instrumentation regains control and takes inventory of the world before letting the actual signal handler run.

All of this adds complexity to the engine's handling of signals, and when combined with the requirement to support more advanced signal features such as syscall interruption and running signal handlers on alternative signal stacks, it is easy to imagine that the engine could be messing up somewhere and corrupting the contents of the siginfo_t structure.

The structure makes multiple passes between the kernel, the recording engine and the recorded process, so there are multiple entities and communications to consider when narrowing down this problem. I decided to approach this from both ends of the siginfo_t structure's life cycle.

In the beginning the structure is created by the kernel, but since we're not able to record the kernel it was easier to have a look at the first observation of the structure within the recording engine. One of the nice features of our recording engine is that it is able to record parts of itself, and from such a recording it was possible to confirm that the kernel indeed passes a valid siginfo_t structure to us via ptrace. Later on in the same recording it was confirmed that the structure is still valid when it is passed back into the kernel just before the recorded process is resumed and the signal is reinjected. Without a recording of the engine it would have taken much longer to absolve this part of the engine's handling.

At the other end of the life cycle, the recording from the recorded process itself has already informed us the the structure that is present in the application's signal handler is invalid. However, before that structure reaches the application's signal handler it has already passed through a part of our engine that sits inside the recorded process itself. Unfortunately, this part of the engine cannot be recorded because the ptrace facility only allows a single recording engine to be attached to a single process. Luckily, we have an internal tool providing limited debugging capabilities for this part of our engine. Using this it was confirmed that the siginfo_t structure is invalid when it is set up into the recorded process, even before the engine touches it.

This leaves us with the uncomfortable suspicion that the siginfo_t structure may have been clobbered by the kernel. It looks fine when going in and looks bad when coming out.

SystemTap is my go-to tool when I need to have a look at what the kernel is doing in detail. It lets you to create and load small kernel modules that dynamically hook into probe points to display or modify live kernel state. In this case it turned out to be very interesting to inspect the contents of  the siginfo_t structure which the kernel stores internally.

The signal_delivered kernel function is invoked at a point when the kernel has finished setting up the signal frame in the user space process but before the process has been resumed. The following SystemTap script was used to instrument the function to dump some relevant information.

probe signal_delivered = kernel.function("signal_delivered")
{
sig = $sig
regs = $regs
info = $info
}

probe signal_delivered {
if (info && execname() == "reproducer")
{
// At this point regs->sp should point to frame, and frame->info is the info in user memory
printf("signal_delivered: pid:%d, %p, %p, %p\n", pid(), sig, regs, info)
printf(" info: 0x%x\n", info)
printf(" signo: 0x%x\n", info->si_signo)
printf(" code: 0x%x\n", info->si_code)
printf(" status: 0x%x\n", info->_sifields->_sigchld->_status)
printf(" $sp (frame): 0x%x\n", regs->sp)
printf(" frame->info: 0x%x\n", &@cast(regs->sp, "struct rt_sigframe")->info)
printf(" signo: 0x%x\n", @cast(regs->sp, "struct rt_sigframe")->info->si_signo)
printf(" code: 0x%x\n", @cast(regs->sp, "struct rt_sigframe")->info->si_code)
printf(" status: 0x%x\n", @cast(regs->sp, "struct rt_sigframe")->info->_sifields->_sigchld->_status)
print_backtrace()
printf("\n")
}
}

The output from running this script tells us two things. Firstly it shows us the contents of the siginfo_t structure inside the kernel, that is, the one which is used as a source from which to copy into user space. Secondly, it tells us what the contents of the actual siginfo_t structure in user space is after it has been written. In our instance these two structures are different!

signal_delivered: pid:8353, 0x11, 0xffff8b1c7529ff58, 0xffff8b1c7529fe70
info: 0xffff8b1c7529fe70
signo: 0x11
code: 0x1
status: 0x0
$sp (frame): 0x7ffd67c1c9b8
frame->info: 0x7ffd67c1caf0
signo: 0x11
code: 0x1
status: 0x67c1cb20
0xffffffffaccb38b0 : signal_delivered+0x0/0xa0 [kernel]
0xffffffffaccb398c : signal_setup_done+0x3c/0x60 [kernel]
0xffffffffacc2c6a0 : do_signal+0x1d0/0x6f0 [kernel]
0xffffffffacc2cc32 : do_notify_resume+0x72/0xc0 [kernel]
0xffffffffad38c57c : retint_signal+0x48/0x8c [kernel]

The status field of the structure inside the kernel has the expected EXIT_SUCCESS value, but in user space there seems to be a totally different, seemingly random value. This prompted a finer study of the `do_signal` function, which quickly suggested `setup_rt_frame` and `copy_siginfo_to_user` as suspects.

It turns out that kernels from before v4.14 handle siginfo_t structures a bit differently. When copying siginfo_t structures into user space each signal is given special treatment. The common fields si_signo, si_errno and si_code are always written to user space, but the remaining fields are guarded by a switch case. F.ex for SIGCHLD we have (from copy_siginfo_to_user in signal.c):

    err |= __put_user((short)from->si_code, &to->si_code);
switch (from->si_code & __SI_MASK) {

// ...

case __SI_CHLD:
err |= __put_user(from->si_pid, &to->si_pid);
err |= __put_user(from->si_uid, &to->si_uid);
err |= __put_user(from->si_status, &to->si_status);
err |= __put_user(from->si_utime, &to->si_utime);
err |= __put_user(from->si_stime, &to->si_stime);
break;

This piece of code is key. A bitmask __SI_MASK is applied to si_code to switch between different sets of fields. The SIGCHLD signal is required to have a si_code value containing __SI_CHLD in order for the si_status field to be copied into user space. Otherwise the si_status field is simply not copied. When combining the __SI_CHLD value (0x4000) with the CLD_EXITED value (0x1), which is emitted when children exit normally, we would expect our si_code to be 0x4001, but the SystemTap output has shown that the si_code field internal in the kernel only contains 0x1. This explains exactly why we see bad si_status values in the recorded process. The kernel-internal structure lacks an internal __SI_CHLD value and this prevents the kernel from writing out the si_status. The recorded process is presented with an out of scope stack value! Interestingly, the si_pid and si_uid values appear to be correct and this can be attributed to a catch-all default case. The fact that these fields are part of a union means that they alias onto most of the fields for other signals in the union, and only signals that extend beyond the storage of the si_pid and si_uid fields end up losing information.

    default: /* this is just in case for now ... */
err |= __put_user(from->si_pid, &to->si_pid);
err |= __put_user(from->si_uid, &to->si_uid);
break;

The obvious question now is why the internal __SI_CHLD value is missing. It's actually caused by the siginfo_t structure taking a round-trip to the recording engine and back again with a pair of PTRACE_GETSIGINFO and PTRACE_SETSIGINFO calls. Even though the recording engine does not modify the contents of the structure, the copy_siginfo_to_user function masks away the internal __SI_MASK values by casting the si_code field to a short (see earlier listing) while the opposite copy_siginfo_from_user function makes no effort to reintroduce the internal value.

The solution to the problem turned out to be simple. Since the kernel does not perform any masking in copy_siginfo_from_user the recording engine can reintroduce the internal __SI_MASK values when reinjecting the structures with PTRACE_SETSIGINFO. When doing so the si_status field gets populated with valid values and the native behaviour is restored.

Starting with kernel version 4.14 the internal __SI_CHLD values were removed, and later the switch case that handles each signal differently was also removed. These days the whole contents of the siginfo_t structure is copied unmodified from and to user space, so this workaround of reintroducing __SI_MASK values is only valid for older kernels.

This obscure behaviour is something that any debugger using PTRACE_SETSIGINFO can run into, but our recording engine happens to use PTRACE_SETSIGINFO a lot in order to defer signals so it is especially susceptible to this problem.

As a final nugget, it was found that the initial workaround did not work for SIGPOLL signals in 32-bit programs running on x86-64. The offset of the si_fd field prevents it from being aliased and copied in the default case, meaning that the field should have been affected by the same problem and should respond to the same fix, however, the field had the opposite response of only failing when the workaround was applied. This turned out to be caused by a secondary bug related to the mapping of siginfo_t layouts from 32-bit to 64-bit. I'll leave it to the reader to figure out what the problem with the copy_siginfo_to_user32 function is and how that relates to the __SI_CHLD masking behaviour.

I think I learnt a lot from having to switch between tools to retrieve information from different types of components of this system. SystemTap was invaluable in figuring out how the kernel handles siginfo_t structures, but I wouldn't have had the confidence to dive into the kernel if hadn't recorded and analysed the engine first. I'd be interested to hear how other people would approach this kind of problem.


0 replies

Be the first to reply!

Reply