Crash recovery in 256 bytes

The exhubris supervisor.

(This continues my series of posts on the exhubris tools I’m building, to enable more people to use Hubris in their embedded systems.)

One of Hubris’s strongest features is its ability to handle crashes in drivers and other application logic. It leaves the specific crash handling behavior up to the application programmer through a mechanism called a supervisor.

In this post I’ll look at why I made this decision, how it works in practice, and walk through the exhubris supervisor reference implementation, minisuper. (Spoiler: it’s very small.)

The role of the supervisor in Hubris

Bugs happen, and programs crash. That’s more or less the single-sentence motivation for Hubris. But what happens when a program crashes?

On your phone or a desktop computer, you probably get a popup warning saying that an application had stopped. You can launch it again if you want. This approach requires human interaction — to read the error message, and to restart the program (or not) as desired.

Servers in datacenters usually can’t expect a human operator to click dialog boxes, so programs on these machines are usually restarted automatically by some sort of service framework. Usually there will be tools to control how often the program can be restarted, or if there are circumstances where we want the system to give up and stop trying.

Hubris is designed for systems that don’t require human interaction, and may not even have any sort of interface. The tiny microcontrollers inside a keyboard or power adapter need to just keep working without asking for help. So, in such systems, what should we do if a task crashes?

We could immediately restart it, but if we’re in a situation where a task may endlessly crash on startup, that could be very expensive, preventing the machine from meeting its other responsibilities.

We could leave it be, but what will cause it to start back up again? Does the user have to turn us off and back on again?

We could introduce some sort of backoff strategy, waiting longer periods of time each time a task crashes. But how do we cap the wait? If the wait begins at one second and doubles every time, after ten crashes we’re waiting 17 minutes, which is an awfully long time for a toaster to become unresponsive!

My conclusion is that there is no right answer to this question. The correct action to take when a program crashes depends on the context and the application’s requirements. So, Hubris leaves it up to you, the programmer.

Specifically, the decision is left up to the supervisor task. All Hubris applications must include one task playing the role of supervisor. The supervisor will be informed about any crashes, and have the opportunity to take any corrective action it likes.

When a task in a Hubris application crashes — whether it’s because it abused a syscall, dereferenced a wild pointer, or explicitly called panic! — the kernel does only three things.

  1. It records information about the crash (called a fault) in the task’s control block in kernel memory.

  2. It posts a notification to the supervisor task.

  3. It invokes the scheduler to choose a different task to run, since the current task is dead. This almost always chooses the supervisor task1.

That’s it. The kernel does not alter the crashed task’s other state in any way. It also does not immediately unblock any other tasks that are trying to send messages to the crashed task (more on this later).

1

This is because the supervisor task is normally the highest priority task in the system, able to preempt anything else, and normally hangs out waiting for the notification to arrive. While this is probably the behavior you want, you could choose to do things differently. I’m not the cops.

Aside: what happens without a supervisor?

You could choose to build an application that doesn’t contain a supervisor task. I don’t recommend doing this, but if you did, here’s what would happen.

When any task in your application crashes, the kernel will post notification 0 to task 0 (the first task in the config file) as if it were a supervisor. Your task, which is not a supervisor, would get this spurious notification and perhaps just ignore it, going on about its day.

The crashed task, which I’ll call Task A, will be left in a Faulted state unable to run.

Now, what if Task A responds to messages from other tasks? If a client (Task B) tries to send to A while A is Faulted, B will block (as in any case where A is unavailable) … and has no way to unblock.

In fact, if A crashes while processing a message from B, B will also sit blocked, waiting for the reply… which will never come.

It’s not unusual for a Hubris application to have several layers of tasks exchanging messages. If some task C goes to send to B while B is waiting for A (which is dead), C is now blocked too.

This will continue as a sort of “pile-up” with more and more tasks waiting (transitively) for a message that will never arrive. This is why it’s important to include at least a minimal supervisor implementation in your application if there’s any risk of a task crashing. (And I think it’s always best to assume that the risk exists!) Hubris’s design assumes the supervisor exists, and a lot of fundamental operations (like inter-task messaging) have defined behaviors that just don’t make sense without it.

A minimal supervisor

Just because your Hubris application should include a supervisor doesn’t mean you have to write a supervisor. The exhubris repo contains a minimal supervisor (called minisuper) that you can use to get started — or use forever, if it meets your needs.

Let’s take it apart!

I’ll be discussing the code at a specific commit (cae47f138) in the exhubris repo if you want to follow along.

minisuper lives at the path task/minisuper in the repository. In that directory are three files:

task/minisuper
├── Cargo.toml
├── README.mkdn
└── src
    └── main.rs

This is the simplest possible project layout for a Rust executable, and there’s nothing all that interesting in Cargo.toml, so the rest of this section will focus on task/minisuper/src/main.rs. Here’s the entire program, with some comments removed because we’re going to walk through it in detail below.

#![no_std]
#![no_main]

const FAULT_NOTIFICATION: u32 = 1;

#[export_name = "main"]
fn main() -> ! {
    loop {
        userlib::sys_recv_notification(FAULT_NOTIFICATION);

        // Recovery action: find the dead task, starting with the first
        // non-supervisor task index (1).
        let mut next_task = 1;
        while let Some(fault_index) = kipc::find_faulted_task(next_task) {
            let fault_index = usize::from(fault_index);
            kipc::reinitialize_task(fault_index, kipc::NewState::Runnable);
            // keep moving, there could be more than one
            next_task = fault_index + 1;
        }
    }
}

The notification response loop

minisuper, like all Hubris tasks, is written as an infinite loop. Once started, it runs until it crashes (which it shouldn’t).

At the top of the loop, minisuper calls the function sys_recv_notification, passing it a bitmask with bit 0 set. This is a specialized wrapper for the more general RECV syscall: sys_recv_notification blocks the current task until a notification arrives whose corresponding bit in the mask is set. So, this call blocks waiting for this task’s notification 0.

The Hubris kernel-to-supervisor interface specifies that notification 0 is the one the kernel will post on a crash, so, this causes minisuper to wait for a crash.

Technically, minisuper might also wake without any tasks crashing. This is because other tasks can use the POST syscall to manually set the supervisor’s notification 0 to pending. As a result, the loop is designed to gracefully handle the case where no tasks have actually crashed, by going back to sleep in sys_recv_notification.

(sys_recv_notification returns a bitmask indicating which notifications caused the task to wake up. We ignore it, since we only set one bit in the mask — we know if we wake that it’s notification 0!)

Kernel IPCs (KIPCs)

Supervisors need to be able to act on other tasks — to read out information about the crash, and to restart them. These are not powers that other tasks should have! But they require help from the kernel to actually implement.

Most Hubris kernel operations are exposed as system calls (like the RECV that sys_recv_notification used above). System calls are available to all tasks.

If the supervisor-specific actions were implemented as system calls, we’d now have a set of system calls that are only available to some tasks. They would need to check the identity of the caller, and treat calls from any task other than the supervisor as errors. Those checks would need to consistently happen in every protected syscall, and not in others. This seemed unfortunate, and I wanted to avoid it.

The Hubris system call interface is also relatively stable. We rarely add new things and have never removed anything. The supervisor interface, on the other hand, has been regularly adjusted over the years, as we learn from our own systems.

These two properties led me to expose supervisor-specific operations through a different mechanism, kernel IPCs or KIPCs. From the supervisor’s perspective, a KIPC looks just like using SEND to send a message to another task, but instead of a valid task ID, the message is sent to a special ID reserved for the kernel.

The kernel handles these messages internally, giving it a convenient single place to screen out tasks other than the supervisor. KIPCs can also send and return more complex serialized Rust types than the simpler system call ABI can.

minisuper uses two KIPCs, specifically: find_faulted_task and reinitialize_task.

find_faulted_task asks the kernel to scan its task table starting at a given index, returning the index of the next task in the Fault state, if there is a task in the Fault state. This KIPC lets minisuper scan for faulted tasks efficiently; in the usual case of a single faulted task, only two KIPCs are required, one to find it and a second one to find that there are no others.

reinitialize_task resets a task to its default state and optionally sets it running. (The alternative is to leave it stopped.) “Default state” here means that the CPU registers are reloaded to their initial values, including moving the program counter back to the task’s entry point, and resetting the task’s stack pointer. (It does not clear the task’s RAM; tasks are responsible for doing this themselves.) reinitialize_task has some other important effects that I’ll save for a section below.

With that explanation of the KIPC operations out of the way, let’s look at what minisuper is doing with them.

Crash handling policy

Because minisuper’s code has probably scrolled out of view by now, let me repeat the crash handling code here:

let mut next_task = 1;
while let Some(fault_index) = kipc::find_faulted_task(next_task) {
    let fault_index = usize::from(fault_index);
    kipc::reinitialize_task(fault_index, kipc::NewState::Runnable);
    next_task = fault_index + 1;
}

The search starts at task index 1; this is because the supervisor is task index 0, and if the supervisor has faulted, this code is no longer running!

Each time a faulted task is found, minisuper calls reinitialize_task to rewind it to starting state, and passes NewState::Runnable to ask the kernel to start it. It then resumes the search starting from the next task, until all faulted tasks have been reinitialized.

In practice, this restarts one task at a time:

It is possible to construct a weird application where this property is not true, but don’t.

From stepping through this code, it’s clear that minisuper implements a policy of immediately reinitializing and running any task that crashes.

How reinitialize_task prevents pile-ups

I left two loose ends above:

These both lead to the same explanation!

The reinitialize_task operation is very carefully designed to not leave any task behind. In addition to restarting the target task, it may also have effects on other tasks:

  1. Any tasks that are waiting in SEND with a message to the restarted task is broken out of its wait with an error code. (Often, tasks will just try to send again to the new incarnation of the task, but that’s up to them!)

  2. If the crashed task was already processing one or more messages from other tasks — that is, the other tasks are blocked waiting for the reply — those other tasks are broken out of their waits with an error code. (This looks just like the first case to the sending tasks.)

  3. If any tasks are waiting in a closed RECV to the crashed task, they are broken out of their wait, too. (Closed RECV is a little-used Hubris syscall feature that I won’t explain here; if you’d like to know more, there’s a section on closed RECV in the reference manual.)

This all happens when the task is restarted, rather than when the task initially crashes. This is because the other tasks are somewhat likely to retry their sends; delaying that until the crashed task has been restarted means the sends will succeed right away, without hitting the crashed task and failing. This also makes it easier to view the state of all the tasks at the time of the crash, through a debugger or other tools.

Thus, when minisuper restarts a crashed task using reinitialize_task, it also ensures that all the tasks that depend on the crashed task are left in a valid state, and are able to make progress.

Who supervises the supervisor?

If the Hubris kernel notifies the supervisor when a task crashes, what happens if the supervisor crashes?

The answer: nothing. Not right away, anyway.

If the supervisor crashes, the kernel leaves it be. Any other tasks that crash will not be restarted, resulting in a potential pile-up. This is bad.

Ideally, the supervisor will not crash, and supervisor tasks should be written to avoid crashing. If the supervisor needs to do something complex where crashing is a possibility, however, you’ll want to use a hardware watchdog timer to ensure the system can restart if the supervisor dies. (How exactly to do this depends on the microcontroller model, and is out of scope for this post.)

The exhubris libraries provide a feature that can help you write a supervisor that cannot crash: the no-panic feature on userlib. All tasks depend on userlib; if they add the no-panic feature, any panics that are not optimized away by the compiler will cause a link failure with an error message.

The error message is currently not very good; it looks something like:

rust-lld: error: undefined symbol: you_have_introduced_a_panic_but_this_task_has_disabled_panics
>>> referenced by lib.rs:565 (sys/userlib/src/lib.rs:565)
>>>               /home/cbiffle/proj/exhubris/play/.work/demo/build/super:(rust_begin_unwind)

This provides a way to ensure, at compile time, that a task cannot panic. If the task is written in safe Rust, then it also should not be able to crash due to memory protection violations, illegal instructions, and the like. This makes this feature really helpful when writing a supervisor, and if you check task/minisuper/Cargo.toml, you’ll see that the feature is on:

[dependencies]
userlib = {workspace = true, features = ["no-panic"]}

This is not a guarantee against bugs, only crashes. The code can still do wrong things, enter an infinite loop, etc. But it can’t crash, at least.

The cost of supervision

The fact that Hubris requires a supervisor task means that most applications have one more task than they would otherwise need. Since embedded applications are usually space-constrained (in either RAM or flash), is this an impractical cost?

Adding minisuper to an application typically requires…

This is pretty small. I haven’t had any trouble fitting minisuper into even the smallest of supported microcontrollers.

Conclusions

With any luck, you now understand the role of the supervisor task in a Hubris application, and have a sense of what you would need to do to write your own. (Try it!)

The supervisor task is only one of several places where Hubris puts the application programmer in the driver’s seat, instead of hardcoding behavior into the kernel. I may touch on a few others (like peripheral and memory sharing) in future posts.

I feel that this design decision has held up well, and having a very compact reference implementation (minisuper) shows that crash recovery doesn’t have to be expensive and complex to be useful. The same “no questions asked” restart-on-crash policy is what we use in Oxide’s production firmware (in Oxide’s supervisor implementation), and it’s been working well.