On Hubris And Humility

2021-12-06

Last week I gave a talk at the Open Source Firmware Conference about some of the work I’m doing at Oxide Computer, entitled On Hubris and Humility. There is a video of the talk if you’d like to watch it in video form. It came out pretty alright!

The conference version of the talk has a constantly animated background that makes the video hard for some people to watch. OSFC doesn’t appear to be bothering with either captions or transcripts, so my friends who don’t hear as well as I do (or just don’t want to turn their speakers on!) are kind of out of luck.

And so, here’s a transcript with my slides inlined. The words may not exactly match the audio because this is written from my speaker’s notes. And, yes, my slides are all character art. The browser rendering is imperfect.

I’ve also written an epilogue at the end after the initial response to the talk.

Hubris repo
Hubris reference manual in case you don’t feel like you’ve read enough of my writing today

Intro

    ▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏                                                
▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏                                                
▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏▏                                                        
                                                                                
                                                                                
                       ▄▄         ▗  ▖    ▐        ▝                            
                      ▗▘▝▖▗▗▖     ▐  ▌▗ ▗ ▐▄▖  ▖▄ ▗▄   ▄▖                       
                      ▐  ▌▐▘▐     ▐▄▄▌▐ ▐ ▐▘▜  ▛ ▘ ▐  ▐ ▝                       
                      ▐  ▌▐ ▐     ▐  ▌▐ ▐ ▐ ▐  ▌   ▐   ▀▚                       
                       ▙▟ ▐ ▐     ▐  ▌▝▄▜ ▐▙▛  ▌  ▗▟▄ ▝▄▞                       
                                                                                
                 ▗▖       ▐     ▗  ▖         ▝  ▝▜   ▝   ▗                      
                 ▐▌ ▗▗▖  ▄▟     ▐  ▌▗ ▗ ▗▄▄ ▗▄   ▐  ▗▄  ▗▟▄ ▗ ▗                 
                 ▌▐ ▐▘▐ ▐▘▜     ▐▄▄▌▐ ▐ ▐▐▐  ▐   ▐   ▐   ▐  ▝▖▞                 
                 ▙▟ ▐ ▐ ▐ ▐     ▐  ▌▐ ▐ ▐▐▐  ▐   ▐   ▐   ▐   ▙▌                 
                ▐  ▌▐ ▐ ▝▙█     ▐  ▌▝▄▜ ▐▐▐ ▗▟▄  ▝▄ ▗▟▄  ▝▄  ▜                  
                                                             ▞                  
                                                            ▝▘                  
                                                                                
                   CLIFF L. BIFFLE ─── OXIDE COMPUTER COMPANY                   
                                                                                
                                                                                
                                                       ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋      
                                                ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋
                                                             ▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋▋

Hello. My name is Cliff, and I work at Oxide Computer Company. Today, I’ll be talking about Hubris and Humility.

We at Oxide are building a new kind of server, and like many computers nowadays it has a large central processor for running customer code, and several smaller processors for housekeeping and management purposes. Unlike most manufacturers, we’re writing the code that runs on all those processors ourselves.

To keep complexity down, we decided early on to go with microcontroller-class processors for all the smaller supporting cores, such as the ARM Cortex-M series or similar RISC-V CPUs. What these processors have in common is memory protection hardware – not, importantly, memory mapping hardware or virtual memory support. A typical implementation lets you apply protection attributes to ranges of physical address space, which you can use to isolate programs from one another. Or even just to catch null pointer dereferences, something a lot of firmware environments don’t do.

Our goal is that our customers never have to think about or interact with these processors unless they want to, which means that the software running on them needs to be quite robust. We wanted to use the memory protection hardware to support this goal.

If you’re looking for an operating system with memory protection support on this class of processor, there are frankly not a lot of options. We evaluated several, getting pretty far with one, but in each case we ran into serious problems. And so in March 2020 we made the difficult decision to write our own, at the same time that we were designing and building our first product.

This is kind of unreasonable, but not as unreasonable as it might sound. We expected that the bulk of our time would go into writing drivers, and since we were creating the hardware ourselves, no off-the-shelf choice would save us time by providing canned drivers. In addition, by building the operating system and the product at the same time, we could try to avoid speculative engineering and build a system that covered our needs without excess generality. Even so, in recognition of how the suggestion sounded, I named the project Hubris.

Today I’ll give a high-level overview of the system and how its parts fit together, and then dive into the ways that our design goals, our use of Rust, and our discoveries along the way have influenced the design process.

A whirlwind tour of Hubris

I’d like to start by taking you on a whirlwind tour of Hubris’s structure. I’ll be touching on a lot of points here, but mostly superficially. Later in this talk I’ll zoom in on some of them, and if you want to really dig in, there’s a reference manual linked on the last slide.

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                    SERVICE PROCESSOR APPLICATION COMPONENTS                    
                                                                                
                                                                                
                          ┌────────────┐┌─────┐┌──────┐                         
                          │ supervisor ││ rcc ││ gpio │                         
                          └────────────┘└─────┘└──────┘                         
                       ┌──────┐┌──────┐┌─────┐┌─────────┐                       
                       │ spi2 ││ spi4 ││ i2c ││ spdprox │                       
                       └──────┘└──────┘└─────┘└─────────┘                       
                      ┌──────────┐┌─────────┐┌────────────┐                     
                      │ spiflash ││ thermal ││ sequencing │                     
                      └──────────┘└─────────┘└────────────┘                     
                                ┌───────┐┌──────┐                               
                                │ hiffy ││ idle │                               
                                └───────┘└──────┘                               
                                                                                
                      ┌───────────────────────────────────┐                     
                      │           hubris kernel           │                     
                      └───────────────────────────────────┘                     
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

The best way to understand Hubris is to start with a worked example, and I’ve chosen our most complex one, which is the Service Processor. Our Service Processor is roughly analogous to the Baseboard Management Controller you might find on a traditional server.

The Service Processor is one of the use cases that’s driving the development of Hubris. We refer to a system like this, built with Hubris, as an application of Hubris.

The service processor application, like most firmware, consists of a collection of software pieces – represented by boxes here. Some of these are widely reusable, such as utility code, or drivers for common devices, or the Hubris kernel itself. Some are application-specific. Here we encounter the first thing that makes Hubris unusual in its class: components such as these are separately compiled and isolated from one another using the processor’s memory protection hardware. We refer to these isolated components as tasks.

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                    SERVICE PROCESSOR APPLICATION COMPONENTS                    
                                                                                
                                                          ─┐                    
                          ┌────────────┐┌─────┐┌──────┐    │                    
                          │ supervisor ││ rcc ││ gpio │    │                    
                          └────────────┘└─────┘└──────┘    │                    
                       ┌──────┐┌──────┐┌─────┐┌─────────┐  │                    
                       │ spi2 ││ spi4 ││ i2c ││ spdprox │  │ unprivileged,      
                       └──────┘└──────┘└─────┘└─────────┘  │ isolated           
                      ┌──────────┐┌─────────┐┌────────────┐│                    
                      │ spiflash ││ thermal ││ sequencing ││                    
                      └──────────┘└─────────┘└────────────┘│                    
                                ┌───────┐┌──────┐          │                    
                                │ hiffy ││ idle │          │                    
                                └───────┘└──────┘          │                    
                     ──────────────────────────────────────┤                    
                      ┌───────────────────────────────────┐│                    
                      │           hubris kernel           ││ privileged         
                      └───────────────────────────────────┘│                    
                                                          ─┘                    
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

Tasks occupy separate areas of RAM and Flash, and cannot access one another’s memory directly. Separate compilation is important for making this robust: if a library internally declares a global static variable, for instance, each task is sure to get its own copy instead of trying to share.

But if every piece of the system were totally isolated from all the others, it wouldn’t be a very useful system. Application tasks can communicate with one another using a small set of inter-task interaction operations, which together offer something that resembles a cross-task call operation. We informally refer to tasks making these calls as clients, and the tasks handling them as servers, though in a real system most tasks act as clients and servers simultaneously, forming a hierarchy of task interaction.

Several of the tasks shown here are drivers, which brings us to another unusual aspect of the system: drivers run in unprivileged mode, outside the kernel, and are typically isolated in their own tasks. These tasks get exclusive access to any memory mapped registers they require to do their work, by way of the memory protection unit, and can claim hardware interrupt signals, which the kernel will route to the task.

All this is assembled together into a cohesive image by the Hubris build system, controlled by an application configuration file. And here we find the third thing that makes Hubris unusual: Hubris is an aggressively static system. The configuration file defines the full set of tasks that may ever be running in the application. These tasks are assigned to sections of address space by the build system, and they will forever occupy those sections.

Hubris has no operations for creating or destroying tasks at runtime. Task resource requirements are determined during the build and are fixed once deployed. This takes the kernel out of the resource allocation business. Memories are the most visible resources we handle this way, but it applies to all allocatable or routable resources, including hardware interrupts and memory-mapped registers – all are explicitly wired up at compile time and cannot be changed at runtime.

This notion of compile time configuration, static allocation, and specialization of code extends up into application logic, as well. Tasks can be customized with a tree of configuration data that is made available at compile time. It’s not unusual for a single application such as the service processor to contain three or four tasks all built from the same code with different configuration – for instance, handling different SPI controllers. By transforming the configuration data into constant expressions accessible by both the task and the compiler, we can use generic code to generate compact, specialized binaries, without littering the code with conditional compilation directives.

Now, you might be asking, if we can’t create or destroy tasks at runtime, how do we deal with crashes or other software failures? The answer lies in the single task control operation that Hubris does provide, which is in-place reinitialization. When invoked, the reinitialization operation will, informally speaking, stop the task, disconnect it from anything it was using, reset its registers and stack to their default states, and then set it running.

The decision of when to reinitialize a task is left to the application, since different applications may have different constraints. For instance, a driver might require that other tasks be informed of its failure, or even restarted at the same time. Applications choose a task to play the role of supervisor. When any task in the application has a kernel visible fault, such as a memory access violation, or an explicit call to panic!, the kernel records the fault and delivers a notification to the supervisor task. The supervisor can then take any action the application requires, such as restarting one or more tasks or updating an event log.

With this plus some related mechanisms, we can implement recursive component-level restarts in the service processor firmware. Our intent is for the service processor itself to only reboot in truly catastrophic circumstances, and we don’t consider a driver crash to be sufficiently catastrophic. Instead, we restart the driver, and possibly some of its clients, and carry on.

Memory isolation is key to being able to restart components like this, by limiting the “blast radius” of a malfunction. No matter how corrupt the state in one task becomes, other tasks can expect their state to be okay.

Because of this, we don’t need to restrict how each task is written under the hood. While most of Hubris itself and our application code is written in Rust’s safe subset, which grants immunity from common classes of memory usage errors, the unsafe subset of Rust is incredibly valuable, particularly when writing fast drivers, or doing things like DMA. While code using unsafe Rust is still significantly safer than C – for instance, array indexing is still checked and integer overflows are still caught – it has access to operations that suppress those checks, and so it has the ability to corrupt any memory it can reach if it works hard enough. The combination of encouraging (but not requiring) safe Rust, while using memory isolation as a backstop, gives us a lot more flexibility without any loss of system-level safety.

Which is not to say things never go wrong – just that, when they do, it is not typically a memory safety issue. It is far more likely to be a misread of a component datasheet, a subtle interaction between two tasks, or a plain-old logic error. In these cases, we break out Humility.

Humility is a Hubris-specific debugger, which we designed and implemented concurrently with Hubris itself. It can connect to a live target system over JTAG or SWD and provide live views of what’s going on inside – everything from log output, to the state of all tasks, to individual task stack traces and memory dumps. We expect that, despite our best efforts, we will ship products containing bugs; unlike a lot of other companies, we don’t expect that our servers will be allowed to “phone home” from a customer site. As a result, we’ve also built out coredump support that can capture a complete snapshot of service processor state into a file that can be loaded into Humility and debugged off-site, should the customer elect to share it with us.

So – that’s the 10 kilometer view of Hubris and Humility. For the rest of my time here, I’m going to zoom in on some particular areas of our experience that I thought were particularly interesting or surprising.

Synchronicity: it’s nice

The interaction between tasks and the Hubris kernel is fully synchronous. This is somewhat unusual, and has some pleasant implications, some of which we saw coming, and some of which surprised us.

First, I’ll unpack what I mean by “synchronous.”

As I’ve described, application code runs in isolated tasks, which communicate with each other and with the kernel by making syscalls. These syscalls perform a complete operation and then return. That might sound natural, but, a lot of systems allow programs to queue up work to be performed by the kernel on their behalf, perhaps receiving some kind of notification that the work is complete later.

For systems like the ones we’re building, this has four problems.

First, it introduces a queueing problem into the kernel. Sizing such queues, accounting for their resource usage, avoiding starvation, and producing proper backpressure are difficult problems in real systems. How many asynchronous operations can a given task have in flight? How will it behave if that limit is reached? Will this condition occur when you’re testing the system, or will it produce strange behaviors only at the worst time: when the system’s under heavy load?

Second, it increases complexity of the system in the last place where you want complexity: the kernel. Asynchronous systems need structures for keeping track of in-flight operations, whether that’s kernel threads, in-kernel event continuation and completion tables, or similar.

Third, asynchronous systems can be harder to reason about. Sometimes the events will complete in the order of issuance, sometimes they won’t. Either the programs running on the system themselves become fully asynchronous, or they risk having their behavior depend on the order of completion. It’s also significantly more difficult to display the state of an asynchronous program – mechanisms familiar to developers, like stack traces, no longer make sense.

Fourth, asynchronous systems tend to be less efficient. There has been a movement toward fully asynchronous systems interfaces in recent years, driven in large part by increases in core count and complexity of desktop applications. A well-designed asynchronous system interface, with aggressively concurrent software written to support it, can produce higher average throughput if the events it deals in really can complete in varying order. What it cannot produce is lower latency. Synchronous call interfaces simply use fewer machine instructions to complete the task of setting up and completing a call. As a result, systems built on synchronous call interfaces tend to be smaller. On the class of processor we’re targeting, with low core count, limited RAM and flash, and real-time responsibilities, smaller systems with lower latency win.

SEND

Let’s make this concrete by looking at the most common synchronous syscall in Hubris, SEND. Here is the signature of the syscall in Rust.

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                  ┌──────────────────────────────────────────┐                  
                  │1 pub fn sys_send(                        │                  
                  │2     target: TaskId,                     │                  
                  │3     operation: u16,                     │                  
                  │4     outgoing: &[u8],                    │                  
                  │5     incoming: &mut [u8],                │                  
                  │6     leases: &[Lease<'_>],               │                  
                  │7 ) -> (u32, usize);                      │                  
                  └──────────────────────────────────────────┘                  
              Listing 3: the actual signature of the send syscall.              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

This call sends a message to target, invoking a given operation, and carrying some outgoing data. It then expects to get an incoming response back. The syscall will return two values, a response code carrying success/failure information, and a usize giving the size of message that was received into the incoming buffer. This call directly reflects the kernel syscall ABI, and in practice few programs call it directly, preferring to use wrappers – but since we’re talking about the system architecture, wrappers would be distracting.

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                           A TYPICAL CROSS-TASK CALL                            
                                                                                
                                                                                
                                                                                
                           Task A           Task B                              
                             │                 ┆                                
                             └─ SEND ─────────►┐                                
                             ┆                 │                                
                            not              doing                              
                          running            stuff                              
                             ┆                 │                                
                             ┌◄──────── REPLY ─┤                                
                             │                RECV                              
                           running             ┆                                
                             │              (waiting)                           
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

When a task calls SEND it immediately loses the CPU. In the fast path, when the target task is ready and waiting, the scheduler switches to the target task immediately. If the target is not ready, the sender is blocked until it becomes ready. Either way, something else runs.

Messages are always shuttled from task to task directly. They are never buffered or queued in the kernel. In addition to the resource management problems that queueing would introduce, this also removes a memcpy to the kernel by only copying once, which can be a real boon in performance-sensitive applications.

There are two more subtle implications of this design.

One is that each task can have either zero or one outgoing messages at any given time. Because a task loses the CPU until it receives a reply to its message, it can’t simultaneously send two messages – nor can it queue up dozens of outgoing messages to barrage its peers. This takes away a fault or attack amplification vector that’s available in asynchronous systems.

Another is that a task that is sending a message is, by definition, not running. Because a task sending a message or waiting for a reply is not schedulable, we can immediately context switch away from the sending task – if possible, directly into the target task. The target task can then assume that the sender is waiting for a reply.

And, as it turns out, this interacts very nicely with Rust’s opinions on memory aliasing.

Aliasing and leases

In Rust, references represent borrowed data, which comes from somewhere else, possibly your caller’s stack. References are analyzed, at compile time, by the compiler to check that code that operates on borrowed data will stop operating on it before returning. In general, it’s legal to pass references to borrowed data down into functions you call, and then operate on it after the function has returned, because you can rely on the compiler to ensure that that function hasn’t secretly hung onto a reference.

The Hubris lease mechanism extends this function-to-function borrowing mechanism across task boundaries.

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                           A TYPICAL CROSS-TASK CALL                            
                                                                                
                                                                                
                                                                                
                           Task A           Task B                              
                             │                 ┆                                
                             └─ SEND ─────────►┐                                
                             ┆                 │                                
                             ┆   ◄┄┄┄┄┄┄┄┄┄accessing                            
                            not  ┄┄┄┄┄┄┄┄┄►borrowed                             
                          running           memory                              
                             ┆                 │                                
                             ┆                 │                                
                             ┌◄──────── REPLY ─┤                                
                             │                RECV                              
                           running             ┆                                
                             │              (waiting)                           
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

A function running in one task can construct leases referencing data in that task, and attach them to a send to another task. The receiving task is then treated as borrowing the leased data, and can access its contents whenever it likes, until it resumes the first task by replying. Now, these sorts of dynamic task interactions have to be checked at runtime, rather than at compile time like Rust borrows, but otherwise the operation is pretty similar.

This is how drivers on Hubris implement operations like transmitting blocks of data out a serial port: the task that wishes to transmit sends a message indicating its intent, with a read-only lease attached. The driver then loads data through the lease in whatever chunk size it wishes, until the transmission is complete and the sender is resumed. While in this case you could package the data into the message portion of the send call, that would require the kernel to copy the data all in one go, which in turn means that the driver needs to have already set aside a chunk of RAM large enough to receive any potentially sent stream of bytes in one whack. Not only would this waste precious RAM most of the time, it will inevitably lead to some arbitrary low limit on how many bytes can be sent through a serial port in one call. Using leases, the sender doesn’t know or care how large of a buffer the driver is keeping available, and the driver is responsible for moving chunks of sent data out of the sender’s memory.

Cross-task borrows are only safe, from both an engineering reliability perspective and a Rust memory model perspective, because of synchronous messaging. The recipient accessing data through a lease can be confident that it has exclusive access, if required, because the previous holder of exclusive access is stopped.

When the borrowing task replies to the corresponding message, any leases that came along with it are atomically revoked. This ensures that, from the lender’s perspective, sharing memory across tasks works identically to passing references into a function within the same task.

As a programmer

Now, let’s switch away from how the syscall behaves, and talk about how we, the programmers, think it behaves.

Synchronous systems are easier to understand and build.

Most programming in an embedded system is basically explaining to a computer how to perform a process described in a flow chart. There may or may not literally be a flow chart the programmer is working from, I’m speaking conceptually. We want to take a sequence of actions and checks and conditionals and loops and turn it into something the computer can do. Doing this will usually involve building a state machine.

Now, I’m using the term “state machine” in its broadest sense. A lot of embedded software, or systems software in general, thinks of a “state machine” as a sort of interpreter, either hand-written or generated by a tool, that under the hood probably contains a state variable and a big switch statement or table. That’s one kind of a state machine. And that kind of state machine can be very useful.

But we also have a programming language. Programming languages are tools for expressing state machines, and they are quite good at it. When you instead use a programming language to implement an interpreter for some other description of a state machine, you’re introducing a layer of conceptual indirection. It takes a reader a lot more effort to work out the behavior of an arbitrary nest of state transition edges, than it does to read a while loop. Manually rolled state machines deny their authors the advancements of our industry since the introduction of structured programming, nearly 60 years ago. The result is almost always more complex, and more code, than a naive expression of the same process simply written in the programming language itself.

The decades of collective culture around writing, analyzing, and inspecting programs written in programming languages not only makes things easier for the reader, but it gets you better tool support. If a program is mysteriously not making progress, I can use existing debug information produced by any compiler to get a stack trace and display local variables. I would need to spend time engineering this myself for a handwritten state machine.

And yet – despite all this – I’ve written my share of hand-rolled state machines, some quite recently. Why do we do this? Some people feel that explicit state machines or so-called “super-loop” architectures are easier to understand than straight-line code or preemptive multitasking, but I just spent several minutes explaining why I think that’s misguided. The real answer is almost always asynchrony. Perhaps your driver needs to be nudged forward in small increments from interrupt context. The way most processors implement interrupt handlers, they don’t have an opportunity to maintain a stack from invocation to invocation, more or less requiring them to be unrolled into an explicit state machine. The main other reason in my case is resource limitations – perhaps stacks are expensive, and so I need to multiplex them across several logical processes. Each of these individual processes may itself be synchronous internally, but the combination of processes may be mutually asynchronous, with events coming at different and uncorrelated time. And so I’ve ended up with asynchrony again.

Hubris approaches this problem pragmatically. We have attempted to provide a platform where a motivated programmer can entirely avoid asynchrony. Cross-task messaging is synchronous, as I’ve explained, but that’s only part of the problem. The other thing that many operating systems make asynchronous is events – signal handlers, interrupt handlers, and the like.

Since I’ve been talking about drivers, I’ll address interrupt handlers first. Interrupts are, as their name implies, asynchronous on most processors. The Hubris kernel provides “bottom-half” interrupt handler stubs that use compile-time configuration to route interrupts to tasks in the application. Tasks then see interrupts as synchronous events, delivered on request. Specifically, any time a task checks for incoming messages, it can pass an optional parameter specifying which, if any, of its interrupts it would also like to hear about.

This has turned out to be unexpectedly pleasant. A lot of drivers in our firmware are structured as server tasks, meaning that the core loop of the driver resembles, in pseudo-Rust, the following:

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                     ┌───────────────────────────────────┐                      
                     │1 loop {                           │                      
                     │2     let m = sys_recv();          │                      
                     │3     match m.operation {          │                      
                     │4         CONTROL_LASERS => { ... }│                      
                     │5         EJECT_USER => { ... }    │                      
                     │6     }                            │                      
                     │7 }                                │                      
                     └───────────────────────────────────┘                      
                       Listing 4: a typical server loop.                        
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

Each time through this loop, the driver receives an incoming message and dispatches based on the operation that was requested. One message is handled at any given time.

Altering this driver to support interrupts, say during the process of ejecting the user, can be done two different ways. First, the fully synchronous method:

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
             ┌────────────────────────────────────────────────────┐             
             │ 1 loop {                                           │             
             │ 2     let m = sys_recv();                          │             
             │ 3     match m.operation {                          │             
             │ 4         CONTROL_LASERS => { ... }                │             
             │ 5         EJECT_USER => {                          │             
             │ 6             start_rocket_motor();                │             
             │ 7                                                  │             
             │ 8             // not actual syntax, but not far off│             
             │ 9             sys_recv(ROCKET_IRQ_ONLY);           │             
             │10                                                  │             
             │11             release_locking_clamp();             │             
             │12                                                  │             
             │13             sys_reply(m.caller, OK);             │             
             │14         }                                        │             
             │15     }                                            │             
             │16 }                                                │             
             └────────────────────────────────────────────────────┘             
              Listing 5: server using synchronous interrupt wait.               
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

Here the driver is waiting (line 9) for a specific interrupt, and will not honor other messages until it arrives. A real driver dealing with real hardware might include a timeout, but either way, this lets the driver author express a sequence of operations that includes waiting for interrupts as straight line code.

If the driver needs to process other requests while waiting for that interrupt, there’s also the semi-asynchronous approach:

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
              ┌─────────────────────────────────────────────────┐               
              │ 1 loop {                                        │               
              │ 2     let m = sys_recv_or_irq();                │               
              │ 3     match m.operation {                       │               
              │ 4         INTERRUPT => {                        │               
              │ 5             // handle interrupts here         │               
              │ 6         }                                     │               
              │ 7         CONTROL_LASERS => { ... }             │               
              │ 8         EJECT_USER => {                       │               
              │ 9             // omitted: what if two of these  │               
              │10             // requests arrive simultaneously?│               
              │11             start_rocket_motor();             │               
              │12                                               │               
              │13             enable_irq();                     │               
              │14         }                                     │               
              │15     }                                         │               
              │16 }                                             │               
              └─────────────────────────────────────────────────┘               
                Listing 6: interleaving interrupts and requests.                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

This is using the fact that interrupt notifications can be opted into on any receive call. So, the main match statement processing incoming messages is now evaluating both incoming messages and interrupts that have happened since the last time it checked.

I say this approach is semi-asynchronous because, unlike a signal handler, it never alters the control flow of the code you read on the page. This was an explicit design goal of ours, and it makes writing and debugging interrupt driven drivers much easier.

Hubris also does not provide Unix-style intrusive signal handlers. And so, by combining these pieces, we have the summary of the task execution model in Hubris: code you write in a task executes as written or fails. Nothing in the system will arbitrarily alter your program’s control flow from what appears on the page, except potentially to stop and restart it from the top. When phrased like that, it seems weird to have to say it out loud – like, don’t all programmers pretty much assume that they code they write executes the way they wrote it? To which I say, yes, we do, and that assumption is often wrong, and leads to common classes of bugs, from simple data races on up. Hubris attempts to provide a platform where that reasonable assumption is can be maintained.

Types: they’re good

For a systems programming language, Rust has an unusually powerful type system. We’ve used this to great effect. Let’s look at two small examples.

Making illegal task states unrepresentable

First, from the kernel internals, we have TaskState.

As the name implies, TaskState is a type describing the state of a task. While Hubris doesn’t recognize all that many possible task states, there are still several, some of which have associated data. I’ll focus on three potential states today.

First, a task may be runnable.

Second, a task may be blocked trying to deliver a message to a peer.

Third, a task may have failed with a fault.

A task in each of these states needs to keep slightly different sets of information around. The runnable state needs the least, just static properties of the task such as its scheduling priority.

A task that is attempting to deliver a message, a state we call InSend, needs to keep track of quite a bit more. Because the task has entered the InSend state in the first place, we know it tried to send a message to a peer that was not ready to receive. That’s a good start, but we also need to keep track of the location of the message in the sender’s memory space, any leases that were attached to the message, and of course the identity of the task that, we hope, will eventually receive the message.

A task that has taken a fault, on the other hand, doesn’t need any of that, but it does need information about the fault. We record a fault number from a taxonomy of Hubris faults, and in some cases, we record additional metadata. For instance, in the event of a memory access violation, we record whether the fault was imprecise or precise, and, in the latter case, the offending address. For various types of kernel-originated software faults, we record information about why the kernel faulted the task. Finally, for any fault, we also record the state the task was in just before the fault.

Now, what a lot of kernels do here is to work out the union of the information required by all possible task states, and flatten it into a struct. In our case, it might look something like

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                  ┌─────────────────────────────────────────┐                   
                  │ 1 struct Task {                         │                   
                  │ 2     // ... stuff omitted...           │                   
                  │ 3                                       │                   
                  │ 4     state: TaskState,                 │                   
                  │ 5     previous_state: TaskState,        │                   
                  │ 6     ipc_peer: TaskId,                 │                   
                  │ 7     message: USlice<u8>,              │                   
                  │ 8     fault_kind: FaultKind,            │                   
                  │ 9     fault_address: u32,               │                   
                  │10     extra_fault_info: u32,            │                   
                  │11     // ... and so on.                 │                   
                  │12 }                                     │                   
                  │13                                       │                   
                  │14 enum TaskState { Ready, Sending, ... }│                   
                  └─────────────────────────────────────────┘                   
              Listing 7: how we didn't implement the Task struct.               
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

This can work, but there are several problems with it. First, it isn’t at all clear to a reader when each of these fields is valid, and when it may contain garbage. Second, that lack of clarity can lead to security and robustness problems. For instance, a task that has died with a fault should not be scheduled. Any outgoing messages it was attempting to send at the time of the fault should not be delivered. Similarly, we should not be able to receive a message from a task that isn’t trying to send one. Yet all these mistakes can be made quite easily with the flat state representation.

Finally, if we wanted to record task state before a fault, it’s not clear what subset we would record, and it’s definitely not clear how to handle any field that needs to be both saved and updated for the fault.

Having Rust enums available makes a big difference here. Enums in Rust can work just like enums in C, but this is a sort of degenerate case; they can also express arbitrarily complex tagged unions. Our top-level task state enum looks like this:

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                        ┌───────────────────────────┐                           
                        │ 1 enum TaskState {        │                           
                        │ 2     Healthy(SchedState),│                           
                        │ 3     Faulted { ... },    │                           
                        │ 4 }                       │                           
                        │ 5                         │                           
                        │ 6 enum SchedState {       │                           
                        │ 7     Runnable,           │                           
                        │ 8     InSend(TaskId),     │                           
                        │ 9     // ... more         │                           
                        │10 }                       │                           
                        └───────────────────────────┘                           
                     Listing 8: the actual task state enum.                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

TaskState distinguishes between healthy states and faulted states, and the healthy states are separated into a nested enum, SchedState. I’ll come back to the members of Faulted that are omitted there.

Runnable and InSend are both variants of SchedState.

The task ID designating the intended recipient of the message is only accessible, and indeed only exists as a member field, when the state is InSend. It is not possible to accidentally reference it if the task is in a different state, nor is it possible to transition into InSend without providing it.

SchedState is broken out into a sub-enum because it means that code in the kernel can be written such that it can’t even talk about fault states. For instance, there is a set_healthy_state operation that is used in the implementation of several syscalls to move tasks between states. This takes a SchedState as a parameter, meaning the operation simply cannot put a task into a faulted state – that is a different and less frequently used operation. Similarly, the scheduler deals with tasks in terms of their SchedState, and since a faulted task doesn’t have one, it cannot attempt to schedule a faulted task.

Well, technically, it isn’t quite true that a faulted task doesn’t have a SchedState. Remember that I mentioned we record the pre-fault state. The actual definition of Faulted looks like this:

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                    ┌─────────────────────────────────────┐                     
                    │1 enum TaskState {                   │                     
                    │2     Healthy(SchedState),           │                     
                    │3     Faulted {                      │                     
                    │4         fault: FaultInfo,          │                     
                    │5         original_state: SchedState,│                     
                    │6     }                              │                     
                    │7 }                                  │                     
                    └─────────────────────────────────────┘                     
                    Listing 9: fields of the faulted state.                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

This is another useful benefit of breaking healthy states out into the SchedState type: we can store the pre-fault healthy state here. We can’t embed a TaskState within a TaskState because that would produce an infinitely large data structure; but we can embed a SchedState. This gets us several benefits. First, while the SchedState is present in both healthy and faulted states, it is accessed in very different ways, syntactically, so they’re very difficult to confuse. Second, I can guarantee you that the pre-fault original_state does not, itself, describe a fault, because the SchedState type cannot describe faults. Beyond making the code easier to use, being rigorous about which enum variants can appear in which contexts eliminates a lot of “default” or “don’t-care” branches in switch statements, which is good, because these often accumulate bugs, particularly as enums are extended during the life of the system.

Parsing and the humble slice

So, that’s one example of simplifying kernel book-keeping using Rust types. The other example I’d like to discuss is a case of maintaining security properties using types.

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                  ┌──────────────────────────────────────────┐                  
                  │1 pub fn sys_send(                        │                  
                  │2     target: TaskId,                     │                  
                  │3     operation: u16,                     │                  
                  │4     outgoing: &[u8],                    │                  
                  │5     incoming: &mut [u8],                │                  
                  │6     leases: &[Lease<'_>],               │                  
                  │7 ) -> (u32, usize);                      │                  
                  └──────────────────────────────────────────┘                  
              Listing 3: the actual signature of the send syscall.              
                                                                                
                                    (again)                                     
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

Consider the outgoing message argument to the SEND syscall I presented earlier. From the task’s perspective, this argument is passed as a slice, &[u8]. This is a Rust standard type that consists of a pointer and a length. This aspect of SEND is very much like the POSIX write call, which takes an explicit pointer and a length. While the Rust signature of the syscall takes a slice, the machine-level ABI just moves a pointer and length in two registers, and it’s entirely possible for a misbehaving or malicious task to pass an arbitrary pointer and length. This means the kernel has a validation problem: given an address and size, which the caller alleges it can access, determine whether the caller should be allowed to access it.

This problem is common to basically all kernels that have some form of memory protection. It’s also a common source of bugs and vulnerabilities, for the simple reason that it’s easy to forget to check, or even if you remember to check, it’s easy to accidentally separate the address and size values.

In every case like this that I’ve studied, there are two things at play:

The address/size pair after validation is indistinguishable from the same pair before validation.
The address and size are not welded together into a unitary value, and operations are inconsistent on whether they take both or just a base address. (For instance, C pointer indirection only takes a base address.)

It is surprisingly easy to solve both of these problems using types, and in fact, we do this all the time in more pedestrian situations. Consider taking a piece of text and trying to extract a base-ten number from it. You would probably use a number-parsing routine that is capable of returning either a number, or an error indicator if the string does not contain a base-ten number, something like the top listing on this slide:

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                  ┌──────────────────────────────────────────┐                  
                  │1 if let Some(n) = some_text.parse_u32() {│                  
                  │2     println!("we have a number!");      │                  
                  │3     operate_on(n);                      │                  
                  │4 } else {                                │                  
                  │5     println!("not a number");           │                  
                  │6 }                                       │                  
                  └──────────────────────────────────────────┘                  
               Listing 10: how one might parse and use a number.                
                                                                                
                                                                                
                     ┌────────────────────────────────────┐                     
                     │1 if check_is_number(some_text) {   │                     
                     │2     println!("we have a number!");│                     
                     │3     operate_on(some_text);        │                     
                     │4 } else {                          │                     
                     │5     println!("not a number");     │                     
                     │6 }                                 │                     
                     └────────────────────────────────────┘                     
                Listing 11: validating without using the result.                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

This isn’t the only way to do this; you could also write one routine for inspecting the string and verifying that it contains a number, and then pass the text around, assuring anyone who encounters it that it contains a decimal number. That’s what the second listing does.

But this approach feels weird, and it’s worth asking why it feels weird. As systems programmers, one likely objection is about performance: presumably we’re going to need to parse a number out of that text eventually, so doing a validation step that inspects the text, only to later do a parsing step, will probably use more CPU time than a parsing operation that can indicate errors. And that’s true, in practice!

But there’s another reason it feels weird. Consider the definition of the operate_on function in each of these two cases.

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                       ┌────────────────────────────────┐                       
                       │1 // First case                 │                       
                       │2 fn operate_on(n: u32) { ... } │                       
                       │3                               │                       
                       │4 // Second case                │                       
                       │5 fn operate_on(n: &str) { ... }│                       
                       └────────────────────────────────┘                       
                  Listing 12: operate_on gets different types,                  
                  and thus different assurances, in each case.                  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

In the first case, the code in the function can be assured that n is a valid u32, because, well, it says right there. In the second case, the caller could pass any arbitrary string. You could add a large block comment on the function declaring that any string passed in might be a valid number, but, the code will still need to deal with the possibility that it’s passed garbage. Or, it could elect not to deal with that possibility and assume its callers have done the right thing every time, which is often how we get kernel exploits.

So, I’d argue that the second approach feels weird for some very good reasons, and that you should prefer the first approach. This is a specific case of a more general principle, summarized by Alexis King as “Parse, Don’t Validate.” By treating input validation as a parsing problem, we get two benefits:

By parsing once and returning the result, we avoid accidentally wasting time by validating/parsing repeatedly.
The fact that parsing or validation has occurred is now reflected in the types.

And it’s that second point that I’m flogging here. Going back to our operate_on cases, it is simply not possible for the user to call the u32 version with an un-validated string, because it doesn’t accept a string – it accepts a u32, and even a 40-year-old C compiler will warn you if you confuse those. This means that the implementation doesn’t need to decide between re-checking the input or potentially being the cause of bugs and vulnerabilities down the road. It and the caller are working together to move only valid data around.

So why don’t we do the same thing with addresses from user code?

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                 ┌───────────────────────────────────────────┐                  
                 │1 // Linux:                                │                  
                 │2 access_ok(VERIFY_READ, pointer, length); │                  
                 │3                                          │                  
                 │4 // Windows:                              │                  
                 │5 ProbeForRead(pointer, length, alignment);│                  
                 │6                                          │                  
                 │7 // it goes on                            │                  
                 └───────────────────────────────────────────┘                  
            Listing 12: pointers get validated, types don't change.             
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

Here are typical pointer checks from the Linux and Windows kernels.

If we accept that the key distinction between a parsing operation, and a validation operation, is that parsing produces an artifact that you didn’t have before – a change in type, typically – then these are both validation operations. We get no support from the compiler here; you can delete these calls from a program and not get a compiler warning. It’s also not at all clear where in the kernel these operations need to be called – each function must decide whether to spend time and code validating pointer arguments, or potentially be the source of bugs in the future when a caller fails to notice the block comment.

Comments and conventions are great, but they are not machine-checkable, and they are a poor substitute for types. Both Linux and Windows, to use the two examples at hand, have shipped security vulnerabilities related to using these validation operations incorrectly, in ways that would have been prevented by rephrasing them as parsing.

The easiest way to see this is to consider what should happen if the length is zero. Reading zero bytes from an address is probably fine, and on Windows at least, ProbeForRead will OK it independent of address. However, it’s super easy to do this by accident:

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                  ┌─────────────────────────────────────────┐                   
                  │1 const uint8_t * user_data = ...;       │                   
                  │2 size_t len = 0;                        │                   
                  │3                                        │                   
                  │4 ProbeForRead(user_data, len, 1); // OK!│                   
                  │5                                        │                   
                  │6 *user_data // no error, no warning     │                   
                  └─────────────────────────────────────────┘                   
                  Listing 13: mere validation won't stop this.                  
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

Any dereference of the pointer is a bug, but nothing prevents someone from making this mistake…except code review.

Now, consider this hypothetical replacement for ProbeForRead, written in Rust syntax.

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                           ┌────────────────────────┐                           
                           │1 fn probe_for_read<T>( │                           
                           │2     pointer: *const T,│                           
                           │3     length: usize,    │                           
                           │4     alignment: usize, │                           
                           │5 ) -> Option<&[T]>;    │                           
                           └────────────────────────┘                           
                      Listing 14: recast using Rust types.                      
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

This takes the same three arguments, using the same types as C, but the return type has changed. It now returns an option of slice of T, meaning the caller will either get the value None if the check failed, or Some with a slice.

In this case, the slice represents the parse result. It uses the type system to indicate that the memory is OK to access, contains a sequence of values of some type T, and is correctly aligned. This also serves to bond the pointer and length together, so that they can’t mistakenly be used separately or mixed up.

If a function needs a reference to valid user memory, it can require a slice. If it’s willing to do the validation itself, it can take a pointer and a length.

This signature works particularly well in Rust because so-called raw pointers, like the *const T you see here, are allowed to contain arbitrary invalid values, but can’t be dereferenced in safe code. Meaning, the code can pass this pointer value around whether or not it has been validated, without risk, since it will not be unexpectedly dereferenced.

In practice, you may want to bond the pointer and length together before validating, to make them easier to pass around and harder to accidentally separate, and this is what we do in Hubris. Here’s the signature of the equivalent Hubris operation:

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                       ┌───────────────────────────────┐                        
                       │1 fn try_read<T>(              │                        
                       │2     task: &Task,             │                        
                       │3     slice: USlice<T>,        │                        
                       │4 ) -> Result<&[T], FaultInfo>;│                        
                       └───────────────────────────────┘                        
                    Listing 15: Hubris equivalent operation.                    
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

Similar, except that we pass the current task explicitly instead of getting it from some per-thread context, we use a generic USlice type to capture an unvalidated pointer-length pair received from user mode.

A USlice provides stronger guarantees than a raw pointer: it’s not a pointer, and can’t be easily dereferenced even in unsafe code.

So, why don’t we see more of this? I think the reasons are complex, but part of it is that we haven’t been talking about this in the systems programming community – which is part of why I’m talking about it today.

But part of it is that you can’t really achieve this in C, and you can’t achieve it robustly in C++, and those have been our only two practical options for a very long time.

I’m not here to rag on C’s type system, so I’ll leave the reasons why this doesn’t work for a future blog post. But implementing a new kernel from scratch in Rust, you find a lot of opportunities like this for encoding your intended integrity and security properties in the types themselves, so that the compiler can assist you in detecting any violations of those properties.

Speaking of compiler assistance, there’s a subtle Rust thing in that function signature. Because there’s a reference (ampersand) in both the arguments and return type, without further instruction, the compiler connects the two lifetimes. This means that if the try_read call succeeds, the compiler considers the task “borrowed” until we dispose of the slice. Borrowing the task with a shared reference like this prevents most mutation, and in particular, this means it is not possible to change the task’s memory map in a way that would invalidate try_read’s decision while retaining access to the memory. To do that, you’d need to drop the slice, make the change, and then call try_read again to get a new slice, which of course you wouldn’t because you just messed up the memory map. This eliminates a potentially subtle class of bugs, and conveniently happens to be the natural way to express it in Rust.

Application-Debugger Co-Design

At Oxide we’re enthusiastic proponents of hardware-software co-design, treating hardware and software together as an integrated product and making tradeoffs across the two. Hubris’s design has been influenced by a different form of co-design that I didn’t see coming: application-debugger co-design.

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
   ┌────────────────────────────────────────────────────────────────────────┐   
   │[cbiffle@gwydion]$ humility -d hubris.core.54 manifest                  │   
   │humility:      version => hubris build archive v1.0.0                   │   
   │humility:      git rev => a2e01755592189aea0c6cabf36fc5cc9257190b2-dirty│   
   │humility:        board => stm32f4-discovery                             │   
   │humility:       target => thumbv7em-none-eabihf                         │   
   │humility:     features => itm, stm32f4                                  │   
   │humility:   total size => 70K                                           │   
   │humility:  kernel size => 18K                                           │   
   │humility:        tasks => 8                                             │   
   │humility:                 ID TASK                SIZE FEATURES          │   
   │humility:                  0 jefe               10.6K itm               │   
   │humility:                  1 rcc_driver          4.9K stm32f4           │   
   │humility:                  2 usart_driver        6.3K stm32f4           │   
   │humility:                  3 user_leds           5.7K stm32f4           │   
   │humility:                  4 ping                5.5K uart              │   
   │humility:                  5 pong                4.8K                   │   
   │humility:                  6 hiffy              14.4K                   │   
   │humility:                  7 idle                0.1K                   │   
   └────────────────────────────────────────────────────────────────────────┘   
             humility manifest showing the contents of a core dump              
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

We wrote the debugger alongside the kernel.

I want to talk about two aspects of Humility today: how it has changed the operating system, and why more people don’t do this.

On the first point: you can get to most of the really deep changes that Humility brought to Hubris by starting at console interfaces. Most embedded projects I’ve worked on have had a console, usually over serial, sometimes over USB. It gets used during development and test, is usually critical during bringup, and sometimes survives into production. In many cases it’s the only way to verify that all the system tasks are running and not being starved of CPU, for instance.

Consoles seem simple, but I would argue that this appearance is deceptive as their feature set grows. A typical human-readable console over a UART requires printf-equivalent formatting code for strings and numbers. If your application needs real numbers for sensor measurements, or if your printf simply complies with the C standard, that means you’re also pulling along floating point formatting code. This can burn many kilobytes of Flash space, and we haven’t even gotten to input.

Implementing a console in the languages we’ve traditionally used is also a difficult task, because it’s just so easy to get things wrong. Unless you’re fuzzing your console interface – and you are, right? – it probably contains buffer overflows, inadvertent acceptance of illegal data, format string vulnerabilities, or potentially even stack smashing.

Our Hubris-based firmware applications don’t have console interfaces. They don’t contain printf-level data formatting, and they cannot parse command lines. And yet we’re not really missing any of the functionality, because we’ve split the task between the application and the debugger.

We’ve established a set of interface patterns between the debugger and application, which effectively form a user-extensible kernel-aware debugger interface. I refer to this as the Debug Binary Interface or DBI. Like an ABI explaining how to pass values and format data structures in an application, the DBI defines how to declare variables and types such that the debugger will find them and do … stuff. And we’re leaving that “stuff” deliberately ill-defined.

On the kernel end of things, we use this to print task status information:

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                                                                                
      ┌──────────────────────────────────────────────────────────────────┐      
      │[cbiffle@gwydion]$ humility -d ~/Downloads/hubris.core.54 task    │      
      │humility: attached to dump                                        │      
      │system time = 166704                                              │      
      │ID TASK                 GEN PRI STATE                             │      
      │ 0 jefe                   0   0 recv, notif: bit0                 │      
      │ 1 rcc_driver             0   1 recv                              │      
      │ 2 usart_driver           0   2 recv, notif: bit0(irq38)          │      
      │ 3 user_leds              0   2 recv                              │      
      │ 4 ping                  59   4 FAULT: divide by zero (was: ready)│      
      │ 5 pong                   0   3 recv, notif: bit0(T+296)          │      
      │ 6 hiffy                  0   3 notif: bit0(T+203)                │      
      │ 7 idle                   0   5 RUNNING                           │      
      └──────────────────────────────────────────────────────────────────┘      
               humility tasks showing task status in a core dump                
                                                                                
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

As you can see here, task index 4, called ping, has failed with a divide by zero error, and so it might be useful to pull its current stack trace:

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
     ┌────────────────────────────────────────────────────────────────────┐     
     │[cbiffle@gwydion]$ humility tasks -sl ping                          │     
     │system time = 166704                                                │     
     │ID TASK                 GEN PRI STATE                               │     
     │ 4 ping                  59   4 FAULT: divide by zero (was: ready)  │     
     │   |                                                                │     
     │   +--->  0x200025b0 0x0802405e task_ping::divzero                  │     
     │                     @ /home/cbiffle/hubris/task-ping/src/main.rs:28│     
     │          0x20002600 0x080240f2 userlib::sys_panic                  │     
     │                     @ /home/cbiffle/hubris/userlib/src/lib.rs:642  │     
     │          0x20002600 0x080240f2 main                                │     
     │                     @ /home/cbiffle/hubris/task-ping/src/main.rs:39│     
     └────────────────────────────────────────────────────────────────────┘     
             humility tasks -s showing stack trace of a failed task             
                                                                                
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

Outside the kernel, our supervisor implementation has a debugger interface that lets us request that a particular crashing task be held for inspection, instead of automatically restarted like it would be in production, by writing a request into a reserved section of its RAM. And for more general debug or bringup work, our firmware images include a debug agent task that can run small interpreted programs delivered directly into its RAM by the debugger. We use this facility, called the Humility Interchange Format or HIF, to request application-specific sequences of operations, for example, to check and enumerate the tree of attached PMBus devices.

Data motion between the target system and the debugger relies on two mechanisms. For target-to-debugger, we decode data structures directly out of target memory using the DWARF debug information from the compiler. For debugger-to-target, for anything more complex than “poke a 1 into this to activate,” we use Rust enums encoded with serde into a compact binary format, deposited directly into designated areas of target RAM. This means that, while we do have a parser exposed, it is machine-generated, strict, and difficult to get to, which is an improvement.

This neatly resolves a tension that I’ve dealt with my whole embedded career: how much Flash should we waste on things that aren’t expected to happen in production? Should the system include a manufacturing self test mode, or the ability to take over devices from drivers at runtime? If so, what happens if these modes do get activated in production? By moving this code out of the firmware, we’ve answered that question: they won’t, unless someone has physical access to the JTAG scan chain and has authenticated with the processor to open the debug interface.

If writing a debugger is so great why isn’t everyone doing it?

Having debug tools that understand your operating system and application has proven invaluable. So, why aren’t more people doing this?

I think it’s for three main reasons.

First: it’s a domain shift. If your job is to write firmware to make the product go, writing a debugger seems like it might require a different set of skills. It might even be in a totally different language.

Second: it’s a bunch of work, and it’s not immediately obvious how it helps you get to your deadline faster, and that’s all a lot of people care about. One of the shared beliefs that unites us at Oxide is that investing in tools early lets us move faster in the long term, but that belief is unfortunately not universal across all companies.

Third: if you wanted to make it less work to write a debugger by reusing existing code, existing debuggers are typically not modular or designed for reuse. You cannot, for instance, easily link OpenOCD’s SWD support into a new tool, or borrow GDB’s stack trace reconstruction implementation. While you could copy-paste the code out, it would be a significant amount of work to adapt it.

Finally: if you did decide to write your own debugger, the documentation in this area can be truly arcane. DWARF in particular has a reputation for being monstrously complex and hard to follow – a reputation that, in my opinion, is only partially deserved.

The first two points, where writing a debugger is scary to either you or your boss, I can’t really help with, except maybe by giving more talks. But on the other two, where debuggers tend not to be reusable and debug information is complex and hard to understand, I can hook you up. We’re in the process of refactoring Humility to make its core generic and reusable for any program that wants to parse and understand debug information and the contents of another running program. Of course, we’re currently heads down trying to get our first product out the door, so it may take us a few months, but it is coming.

Conclusion

I could not have pulled this off on my own, and I’m fortunate to work with a wonderful group of folks here at Oxide:

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                   CORE TEAM                                    
                                                                                
                Laura Abbott     Rick Altherr   Cliff L. Biffle                 
                Bryan Cantrill   Matt Keeter    Steve Klabnik                   
                                                                                
                                                                                
                                   COMMITTERS                                   
                                                                                
                     Luqman Aden            Adam Leventhal                      
                                                                                
                     Dan Cross              David Pacheco                       
                                                                                
                     Nathanael Huffman      Ben Stoltz                          
                                                                                
                     Sean Klein                                                 
                                                                                
                                                                                
                            WITH SPECIAL GUEST STAR                             
                                                                                
                   OZYMANDIAS, KING OF TRIVIAL CODE REFORMATS                   
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

I’d like to thank both the core Hubris developers, and the seven or so folks who have taken time away from other parts of our product to improve things in firmware land.

We’ve published the repos involved in Hubris development, as well as the draft reference manual, so if you’re interested by anything I had to say today – or infuriated by it – I’d encourage you to read more at the URLs on the final slide.

Hubris and Humility are not done by any means, but they’ve become very useful for our purposes, and maybe you’ll find them useful too.

Thanks.

▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍                              ON HUBRIS AND HUMILITY   
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                     Code: github.com/oxidecomputer/hubris                      
                                                                                
                          Docs: hubris.oxide.computer
                                                                                
                             Oxide: oxide.computer                              
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
   CLIFF L. BIFFLE ── OXIDE COMPUTER COMPANY        ▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍▍

Clickable versions of those links for the web:

Epilogue

It seems like a lot of folks appreciated the talk, which is great! I’ve collected answers to the questions I’m getting most often into a FAQ over in the Hubris repo. Have a look if you’re curious.

If you have any questions not covered by the FAQ, or are interested in having me come talk about something at your conference or meetup, please contact me.

The main objection I’ve been getting is that this feels like Rust evangelism to some folks. I’m not sure I can help them with that – if they can show me another systems language that supports all the features I touch on in this talk, I’d be very excited to learn about it. I talk about Rust a lot in this talk because I haven’t found other practical options.

Many of the things we’ve done in Hubris cannot be done robustly in C/C++, in the sense that they could be approximated, but their integrity would have to be maintained through code review or convention rather than compile-time checks. Believe me, I’ve tried in previous jobs.

But if someone dislikes Rust and proves me wrong on that, you know what? That would be fantastic. Because my goal is not to make everyone use Rust – it’s to improve the robustness of systems software against entirely preventable bugs. Because the current status quo has literally killed people, and we need to do better, however we do it.

#rust #talks

Cliffle