On Hubris And Humility
- Intro
- A whirlwind tour of Hubris
- Synchronicity: itβs nice
- Types: theyβre good
- Application-Debugger Co-Design
- Conclusion
- Epilogue
Last week I gave a talk at the Open Source Firmware Conference about some of the work Iβm doing at Oxide Computer, entitled On Hubris and Humility. There is a video of the talk if youβd like to watch it in video form. It came out pretty alright!
The conference version of the talk has a constantly animated background that makes the video hard for some people to watch. OSFC doesnβt appear to be bothering with either captions or transcripts, so my friends who donβt hear as well as I do (or just donβt want to turn their speakers on!) are kind of out of luck.
And so, hereβs a transcript with my slides inlined. The words may not exactly match the audio because this is written from my speakerβs notes. And, yes, my slides are all character art. The browser rendering is imperfect.
Iβve also written an epilogue at the end after the initial response to the talk.
- Hubris repo
- Hubris reference manual in case you donβt feel like youβve read enough of my writing today
Intro
ββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββ
ββ β β β β
βββββββ β ββ β βββ ββ ββ ββ
β ββββ βββββ β βββ β β β β β
β ββ β β ββ β β β β β ββ
ββ β β β ββββ βββ β βββ βββ
ββ β β β β ββ β β
ββ βββ ββ β ββ β βββ ββ β ββ βββ β β
ββ βββ βββ βββββ β βββ β β β β βββ
ββ β β β β β ββ β βββ β β β β ββ
β ββ β βββ β ββββ βββ βββ ββ βββ ββ β
β
ββ
CLIFF L. BIFFLE βββ OXIDE COMPUTER COMPANY
βββββββββββββββββββ
ββββββββββββββββββββββββββββββββ
βββββββββββββββββββ
Hello. My name is Cliff, and I work at Oxide Computer Company. Today, Iβll be talking about Hubris and Humility.
We at Oxide are building a new kind of server, and like many computers nowadays it has a large central processor for running customer code, and several smaller processors for housekeeping and management purposes. Unlike most manufacturers, weβre writing the code that runs on all those processors ourselves.
To keep complexity down, we decided early on to go with microcontroller-class processors for all the smaller supporting cores, such as the ARM Cortex-M series or similar RISC-V CPUs. What these processors have in common is memory protection hardware β not, importantly, memory mapping hardware or virtual memory support. A typical implementation lets you apply protection attributes to ranges of physical address space, which you can use to isolate programs from one another. Or even just to catch null pointer dereferences, something a lot of firmware environments donβt do.
Our goal is that our customers never have to think about or interact with these processors unless they want to, which means that the software running on them needs to be quite robust. We wanted to use the memory protection hardware to support this goal.
If youβre looking for an operating system with memory protection support on this class of processor, there are frankly not a lot of options. We evaluated several, getting pretty far with one, but in each case we ran into serious problems. And so in March 2020 we made the difficult decision to write our own, at the same time that we were designing and building our first product.
This is kind of unreasonable, but not as unreasonable as it might sound. We expected that the bulk of our time would go into writing drivers, and since we were creating the hardware ourselves, no off-the-shelf choice would save us time by providing canned drivers. In addition, by building the operating system and the product at the same time, we could try to avoid speculative engineering and build a system that covered our needs without excess generality. Even so, in recognition of how the suggestion sounded, I named the project Hubris.
Today Iβll give a high-level overview of the system and how its parts fit together, and then dive into the ways that our design goals, our use of Rust, and our discoveries along the way have influenced the design process.
A whirlwind tour of Hubris
Iβd like to start by taking you on a whirlwind tour of Hubrisβs structure. Iβll be touching on a lot of points here, but mostly superficially. Later in this talk Iβll zoom in on some of them, and if you want to really dig in, thereβs a reference manual linked on the last slide.
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
SERVICE PROCESSOR APPLICATION COMPONENTS
βββββββββββββββββββββββββββββ
β supervisor ββ rcc ββ gpio β
βββββββββββββββββββββββββββββ
ββββββββββββββββββββββββββββββββββ
β spi2 ββ spi4 ββ i2c ββ spdprox β
ββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββ
β spiflash ββ thermal ββ sequencing β
βββββββββββββββββββββββββββββββββββββ
βββββββββββββββββ
β hiffy ββ idle β
βββββββββββββββββ
βββββββββββββββββββββββββββββββββββββ
β hubris kernel β
βββββββββββββββββββββββββββββββββββββ
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY ββββββββββββββββββββββββββββ
The best way to understand Hubris is to start with a worked example, and Iβve chosen our most complex one, which is the Service Processor. Our Service Processor is roughly analogous to the Baseboard Management Controller you might find on a traditional server.
The Service Processor is one of the use cases thatβs driving the development of Hubris. We refer to a system like this, built with Hubris, as an application of Hubris.
The service processor application, like most firmware, consists of a collection of software pieces β represented by boxes here. Some of these are widely reusable, such as utility code, or drivers for common devices, or the Hubris kernel itself. Some are application-specific. Here we encounter the first thing that makes Hubris unusual in its class: components such as these are separately compiled and isolated from one another using the processorβs memory protection hardware. We refer to these isolated components as tasks.
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
SERVICE PROCESSOR APPLICATION COMPONENTS
ββ
βββββββββββββββββββββββββββββ β
β supervisor ββ rcc ββ gpio β β
βββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββ β
β spi2 ββ spi4 ββ i2c ββ spdprox β β unprivileged,
ββββββββββββββββββββββββββββββββββ β isolated
ββββββββββββββββββββββββββββββββββββββ
β spiflash ββ thermal ββ sequencing ββ
ββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββ β
β hiffy ββ idle β β
βββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββ€
ββββββββββββββββββββββββββββββββββββββ
β hubris kernel ββ privileged
ββββββββββββββββββββββββββββββββββββββ
ββ
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY ββββββββββββββββββββββββββββ
Tasks occupy separate areas of RAM and Flash, and cannot access one anotherβs memory directly. Separate compilation is important for making this robust: if a library internally declares a global static variable, for instance, each task is sure to get its own copy instead of trying to share.
But if every piece of the system were totally isolated from all the others, it wouldnβt be a very useful system. Application tasks can communicate with one another using a small set of inter-task interaction operations, which together offer something that resembles a cross-task call operation. We informally refer to tasks making these calls as clients, and the tasks handling them as servers, though in a real system most tasks act as clients and servers simultaneously, forming a hierarchy of task interaction.
Several of the tasks shown here are drivers, which brings us to another unusual aspect of the system: drivers run in unprivileged mode, outside the kernel, and are typically isolated in their own tasks. These tasks get exclusive access to any memory mapped registers they require to do their work, by way of the memory protection unit, and can claim hardware interrupt signals, which the kernel will route to the task.
All this is assembled together into a cohesive image by the Hubris build system, controlled by an application configuration file. And here we find the third thing that makes Hubris unusual: Hubris is an aggressively static system. The configuration file defines the full set of tasks that may ever be running in the application. These tasks are assigned to sections of address space by the build system, and they will forever occupy those sections.
Hubris has no operations for creating or destroying tasks at runtime. Task resource requirements are determined during the build and are fixed once deployed. This takes the kernel out of the resource allocation business. Memories are the most visible resources we handle this way, but it applies to all allocatable or routable resources, including hardware interrupts and memory-mapped registers β all are explicitly wired up at compile time and cannot be changed at runtime.
This notion of compile time configuration, static allocation, and specialization of code extends up into application logic, as well. Tasks can be customized with a tree of configuration data that is made available at compile time. Itβs not unusual for a single application such as the service processor to contain three or four tasks all built from the same code with different configuration β for instance, handling different SPI controllers. By transforming the configuration data into constant expressions accessible by both the task and the compiler, we can use generic code to generate compact, specialized binaries, without littering the code with conditional compilation directives.
Now, you might be asking, if we canβt create or destroy tasks at runtime, how do we deal with crashes or other software failures? The answer lies in the single task control operation that Hubris does provide, which is in-place reinitialization. When invoked, the reinitialization operation will, informally speaking, stop the task, disconnect it from anything it was using, reset its registers and stack to their default states, and then set it running.
The decision of when to reinitialize a task is left to the application, since
different applications may have different constraints. For instance, a driver
might require that other tasks be informed of its failure, or even restarted at
the same time. Applications choose a task to play the role of supervisor. When
any task in the application has a kernel visible fault, such as a memory access
violation, or an explicit call to panic!
, the kernel records the fault and
delivers a notification to the supervisor task. The supervisor can then take any
action the application requires, such as restarting one or more tasks or
updating an event log.
With this plus some related mechanisms, we can implement recursive component-level restarts in the service processor firmware. Our intent is for the service processor itself to only reboot in truly catastrophic circumstances, and we donβt consider a driver crash to be sufficiently catastrophic. Instead, we restart the driver, and possibly some of its clients, and carry on.
Memory isolation is key to being able to restart components like this, by limiting the βblast radiusβ of a malfunction. No matter how corrupt the state in one task becomes, other tasks can expect their state to be okay.
Because of this, we donβt need to restrict how each task is written under the hood. While most of Hubris itself and our application code is written in Rustβs safe subset, which grants immunity from common classes of memory usage errors, the unsafe subset of Rust is incredibly valuable, particularly when writing fast drivers, or doing things like DMA. While code using unsafe Rust is still significantly safer than C β for instance, array indexing is still checked and integer overflows are still caught β it has access to operations that suppress those checks, and so it has the ability to corrupt any memory it can reach if it works hard enough. The combination of encouraging (but not requiring) safe Rust, while using memory isolation as a backstop, gives us a lot more flexibility without any loss of system-level safety.
Which is not to say things never go wrong β just that, when they do, it is not typically a memory safety issue. It is far more likely to be a misread of a component datasheet, a subtle interaction between two tasks, or a plain-old logic error. In these cases, we break out Humility.
Humility is a Hubris-specific debugger, which we designed and implemented concurrently with Hubris itself. It can connect to a live target system over JTAG or SWD and provide live views of whatβs going on inside β everything from log output, to the state of all tasks, to individual task stack traces and memory dumps. We expect that, despite our best efforts, we will ship products containing bugs; unlike a lot of other companies, we donβt expect that our servers will be allowed to βphone homeβ from a customer site. As a result, weβve also built out coredump support that can capture a complete snapshot of service processor state into a file that can be loaded into Humility and debugged off-site, should the customer elect to share it with us.
So β thatβs the 10 kilometer view of Hubris and Humility. For the rest of my time here, Iβm going to zoom in on some particular areas of our experience that I thought were particularly interesting or surprising.
Synchronicity: itβs nice
The interaction between tasks and the Hubris kernel is fully synchronous. This is somewhat unusual, and has some pleasant implications, some of which we saw coming, and some of which surprised us.
First, Iβll unpack what I mean by βsynchronous.β
As Iβve described, application code runs in isolated tasks, which communicate with each other and with the kernel by making syscalls. These syscalls perform a complete operation and then return. That might sound natural, but, a lot of systems allow programs to queue up work to be performed by the kernel on their behalf, perhaps receiving some kind of notification that the work is complete later.
For systems like the ones weβre building, this has four problems.
First, it introduces a queueing problem into the kernel. Sizing such queues, accounting for their resource usage, avoiding starvation, and producing proper backpressure are difficult problems in real systems. How many asynchronous operations can a given task have in flight? How will it behave if that limit is reached? Will this condition occur when youβre testing the system, or will it produce strange behaviors only at the worst time: when the systemβs under heavy load?
Second, it increases complexity of the system in the last place where you want complexity: the kernel. Asynchronous systems need structures for keeping track of in-flight operations, whether thatβs kernel threads, in-kernel event continuation and completion tables, or similar.
Third, asynchronous systems can be harder to reason about. Sometimes the events will complete in the order of issuance, sometimes they wonβt. Either the programs running on the system themselves become fully asynchronous, or they risk having their behavior depend on the order of completion. Itβs also significantly more difficult to display the state of an asynchronous program β mechanisms familiar to developers, like stack traces, no longer make sense.
Fourth, asynchronous systems tend to be less efficient. There has been a movement toward fully asynchronous systems interfaces in recent years, driven in large part by increases in core count and complexity of desktop applications. A well-designed asynchronous system interface, with aggressively concurrent software written to support it, can produce higher average throughput if the events it deals in really can complete in varying order. What it cannot produce is lower latency. Synchronous call interfaces simply use fewer machine instructions to complete the task of setting up and completing a call. As a result, systems built on synchronous call interfaces tend to be smaller. On the class of processor weβre targeting, with low core count, limited RAM and flash, and real-time responsibilities, smaller systems with lower latency win.
SEND
Letβs make this concrete by looking at the most common synchronous syscall in
Hubris, SEND
. Here is the signature of the syscall in Rust.
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
ββββββββββββββββββββββββββββββββββββββββββββ
β1 pub fn sys_send( β
β2 target: TaskId, β
β3 operation: u16, β
β4 outgoing: &[u8], β
β5 incoming: &mut [u8], β
β6 leases: &[Lease<'_>], β
β7 ) -> (u32, usize); β
ββββββββββββββββββββββββββββββββββββββββββββ
Listing 3: the actual signature of the send syscall.
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY ββββββββββββββββββββββββββββ
This call sends a message to target
, invoking a given operation
, and
carrying some outgoing
data. It then expects to get an incoming
response
back. The syscall will return two values, a response code carrying
success/failure information, and a usize
giving the size of message that was
received into the incoming
buffer. This call directly reflects the kernel
syscall ABI, and in practice few programs call it directly, preferring to use
wrappers β but since weβre talking about the system architecture, wrappers
would be distracting.
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
A TYPICAL CROSS-TASK CALL
Task A Task B
β β
ββ SEND ββββββββββΊβ
β β
not doing
running stuff
β β
ββββββββββ REPLY ββ€
β RECV
running β
β (waiting)
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY ββββββββββββββββββββββββββββ
When a task calls SEND
it immediately loses the CPU. In the fast path, when
the target
task is ready and waiting, the scheduler switches to the target
task immediately. If the target is not ready, the sender is blocked until it
becomes ready. Either way, something else runs.
Messages are always shuttled from task to task directly. They are never buffered
or queued in the kernel. In addition to the resource management problems that
queueing would introduce, this also removes a memcpy
to the kernel by only
copying once, which can be a real boon in performance-sensitive applications.
There are two more subtle implications of this design.
One is that each task can have either zero or one outgoing messages at any given time. Because a task loses the CPU until it receives a reply to its message, it canβt simultaneously send two messages β nor can it queue up dozens of outgoing messages to barrage its peers. This takes away a fault or attack amplification vector thatβs available in asynchronous systems.
Another is that a task that is sending a message is, by definition, not running. Because a task sending a message or waiting for a reply is not schedulable, we can immediately context switch away from the sending task β if possible, directly into the target task. The target task can then assume that the sender is waiting for a reply.
And, as it turns out, this interacts very nicely with Rustβs opinions on memory aliasing.
Aliasing and leases
In Rust, references represent borrowed data, which comes from somewhere else, possibly your callerβs stack. References are analyzed, at compile time, by the compiler to check that code that operates on borrowed data will stop operating on it before returning. In general, itβs legal to pass references to borrowed data down into functions you call, and then operate on it after the function has returned, because you can rely on the compiler to ensure that that function hasnβt secretly hung onto a reference.
The Hubris lease mechanism extends this function-to-function borrowing mechanism across task boundaries.
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
A TYPICAL CROSS-TASK CALL
Task A Task B
β β
ββ SEND ββββββββββΊβ
β β
β ββββββββββaccessing
not ββββββββββΊborrowed
running memory
β β
β β
ββββββββββ REPLY ββ€
β RECV
running β
β (waiting)
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY ββββββββββββββββββββββββββββ
A function running in one task can construct leases referencing data in that task, and attach them to a send to another task. The receiving task is then treated as borrowing the leased data, and can access its contents whenever it likes, until it resumes the first task by replying. Now, these sorts of dynamic task interactions have to be checked at runtime, rather than at compile time like Rust borrows, but otherwise the operation is pretty similar.
This is how drivers on Hubris implement operations like transmitting blocks of data out a serial port: the task that wishes to transmit sends a message indicating its intent, with a read-only lease attached. The driver then loads data through the lease in whatever chunk size it wishes, until the transmission is complete and the sender is resumed. While in this case you could package the data into the message portion of the send call, that would require the kernel to copy the data all in one go, which in turn means that the driver needs to have already set aside a chunk of RAM large enough to receive any potentially sent stream of bytes in one whack. Not only would this waste precious RAM most of the time, it will inevitably lead to some arbitrary low limit on how many bytes can be sent through a serial port in one call. Using leases, the sender doesnβt know or care how large of a buffer the driver is keeping available, and the driver is responsible for moving chunks of sent data out of the senderβs memory.
Cross-task borrows are only safe, from both an engineering reliability perspective and a Rust memory model perspective, because of synchronous messaging. The recipient accessing data through a lease can be confident that it has exclusive access, if required, because the previous holder of exclusive access is stopped.
When the borrowing task replies to the corresponding message, any leases that came along with it are atomically revoked. This ensures that, from the lenderβs perspective, sharing memory across tasks works identically to passing references into a function within the same task.
As a programmer
Now, letβs switch away from how the syscall behaves, and talk about how we, the programmers, think it behaves.
Synchronous systems are easier to understand and build.
Most programming in an embedded system is basically explaining to a computer how to perform a process described in a flow chart. There may or may not literally be a flow chart the programmer is working from, Iβm speaking conceptually. We want to take a sequence of actions and checks and conditionals and loops and turn it into something the computer can do. Doing this will usually involve building a state machine.
Now, Iβm using the term βstate machineβ in its broadest sense. A lot of embedded
software, or systems software in general, thinks of a βstate machineβ as a sort
of interpreter, either hand-written or generated by a tool, that under the hood
probably contains a state variable and a big switch
statement or table. Thatβs
one kind of a state machine. And that kind of state machine can be very useful.
But we also have a programming language. Programming languages are tools for
expressing state machines, and they are quite good at it. When you instead use a
programming language to implement an interpreter for some other description of a
state machine, youβre introducing a layer of conceptual indirection. It takes a
reader a lot more effort to work out the behavior of an arbitrary nest of state
transition edges, than it does to read a while
loop. Manually rolled state
machines deny their authors the advancements of our industry since the
introduction of structured programming, nearly 60 years ago. The result is
almost always more complex, and more code, than a naive expression of the same
process simply written in the programming language itself.
The decades of collective culture around writing, analyzing, and inspecting programs written in programming languages not only makes things easier for the reader, but it gets you better tool support. If a program is mysteriously not making progress, I can use existing debug information produced by any compiler to get a stack trace and display local variables. I would need to spend time engineering this myself for a handwritten state machine.
And yet β despite all this β Iβve written my share of hand-rolled state machines, some quite recently. Why do we do this? Some people feel that explicit state machines or so-called βsuper-loopβ architectures are easier to understand than straight-line code or preemptive multitasking, but I just spent several minutes explaining why I think thatβs misguided. The real answer is almost always asynchrony. Perhaps your driver needs to be nudged forward in small increments from interrupt context. The way most processors implement interrupt handlers, they donβt have an opportunity to maintain a stack from invocation to invocation, more or less requiring them to be unrolled into an explicit state machine. The main other reason in my case is resource limitations β perhaps stacks are expensive, and so I need to multiplex them across several logical processes. Each of these individual processes may itself be synchronous internally, but the combination of processes may be mutually asynchronous, with events coming at different and uncorrelated time. And so Iβve ended up with asynchrony again.
Hubris approaches this problem pragmatically. We have attempted to provide a platform where a motivated programmer can entirely avoid asynchrony. Cross-task messaging is synchronous, as Iβve explained, but thatβs only part of the problem. The other thing that many operating systems make asynchronous is events β signal handlers, interrupt handlers, and the like.
Since Iβve been talking about drivers, Iβll address interrupt handlers first. Interrupts are, as their name implies, asynchronous on most processors. The Hubris kernel provides βbottom-halfβ interrupt handler stubs that use compile-time configuration to route interrupts to tasks in the application. Tasks then see interrupts as synchronous events, delivered on request. Specifically, any time a task checks for incoming messages, it can pass an optional parameter specifying which, if any, of its interrupts it would also like to hear about.
This has turned out to be unexpectedly pleasant. A lot of drivers in our firmware are structured as server tasks, meaning that the core loop of the driver resembles, in pseudo-Rust, the following:
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
βββββββββββββββββββββββββββββββββββββ
β1 loop { β
β2 let m = sys_recv(); β
β3 match m.operation { β
β4 CONTROL_LASERS => { ... }β
β5 EJECT_USER => { ... } β
β6 } β
β7 } β
βββββββββββββββββββββββββββββββββββββ
Listing 4: a typical server loop.
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY βββββββββββββββββββββββββββ
Each time through this loop, the driver receives an incoming message and dispatches based on the operation that was requested. One message is handled at any given time.
Altering this driver to support interrupts, say during the process of ejecting the user, can be done two different ways. First, the fully synchronous method:
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1 loop { β
β 2 let m = sys_recv(); β
β 3 match m.operation { β
β 4 CONTROL_LASERS => { ... } β
β 5 EJECT_USER => { β
β 6 start_rocket_motor(); β
β 7 β
β 8 // not actual syntax, but not far offβ
β 9 sys_recv(ROCKET_IRQ_ONLY); β
β10 β
β11 release_locking_clamp(); β
β12 β
β13 sys_reply(m.caller, OK); β
β14 } β
β15 } β
β16 } β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Listing 5: server using synchronous interrupt wait.
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY βββββββββββββββββββββββββββ
Here the driver is waiting (line 9) for a specific interrupt, and will not honor other messages until it arrives. A real driver dealing with real hardware might include a timeout, but either way, this lets the driver author express a sequence of operations that includes waiting for interrupts as straight line code.
If the driver needs to process other requests while waiting for that interrupt, thereβs also the semi-asynchronous approach:
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1 loop { β
β 2 let m = sys_recv_or_irq(); β
β 3 match m.operation { β
β 4 INTERRUPT => { β
β 5 // handle interrupts here β
β 6 } β
β 7 CONTROL_LASERS => { ... } β
β 8 EJECT_USER => { β
β 9 // omitted: what if two of these β
β10 // requests arrive simultaneously?β
β11 start_rocket_motor(); β
β12 β
β13 enable_irq(); β
β14 } β
β15 } β
β16 } β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Listing 6: interleaving interrupts and requests.
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY βββββββββββββββββββββββββββ
This is using the fact that interrupt notifications can be opted into on any receive call. So, the main match statement processing incoming messages is now evaluating both incoming messages and interrupts that have happened since the last time it checked.
I say this approach is semi-asynchronous because, unlike a signal handler, it never alters the control flow of the code you read on the page. This was an explicit design goal of ours, and it makes writing and debugging interrupt driven drivers much easier.
Hubris also does not provide Unix-style intrusive signal handlers. And so, by combining these pieces, we have the summary of the task execution model in Hubris: code you write in a task executes as written or fails. Nothing in the system will arbitrarily alter your programβs control flow from what appears on the page, except potentially to stop and restart it from the top. When phrased like that, it seems weird to have to say it out loud β like, donβt all programmers pretty much assume that they code they write executes the way they wrote it? To which I say, yes, we do, and that assumption is often wrong, and leads to common classes of bugs, from simple data races on up. Hubris attempts to provide a platform where that reasonable assumption is can be maintained.
Types: theyβre good
For a systems programming language, Rust has an unusually powerful type system. Weβve used this to great effect. Letβs look at two small examples.
Making illegal task states unrepresentable
First, from the kernel internals, we have TaskState
.
As the name implies, TaskState
is a type describing the state of a task. While
Hubris doesnβt recognize all that many possible task states, there are still
several, some of which have associated data. Iβll focus on three potential
states today.
First, a task may be runnable.
Second, a task may be blocked trying to deliver a message to a peer.
Third, a task may have failed with a fault.
A task in each of these states needs to keep slightly different sets of information around. The runnable state needs the least, just static properties of the task such as its scheduling priority.
A task that is attempting to deliver a message, a state we call InSend
, needs
to keep track of quite a bit more. Because the task has entered the InSend
state in the first place, we know it tried to send a message to a peer that was
not ready to receive. Thatβs a good start, but we also need to keep track of the
location of the message in the senderβs memory space, any leases that were
attached to the message, and of course the identity of the task that, we hope,
will eventually receive the message.
A task that has taken a fault, on the other hand, doesnβt need any of that, but it does need information about the fault. We record a fault number from a taxonomy of Hubris faults, and in some cases, we record additional metadata. For instance, in the event of a memory access violation, we record whether the fault was imprecise or precise, and, in the latter case, the offending address. For various types of kernel-originated software faults, we record information about why the kernel faulted the task. Finally, for any fault, we also record the state the task was in just before the fault.
Now, what a lot of kernels do here is to work out the union of the information required by all possible task states, and flatten it into a struct. In our case, it might look something like
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
βββββββββββββββββββββββββββββββββββββββββββ
β 1 struct Task { β
β 2 // ... stuff omitted... β
β 3 β
β 4 state: TaskState, β
β 5 previous_state: TaskState, β
β 6 ipc_peer: TaskId, β
β 7 message: USlice<u8>, β
β 8 fault_kind: FaultKind, β
β 9 fault_address: u32, β
β10 extra_fault_info: u32, β
β11 // ... and so on. β
β12 } β
β13 β
β14 enum TaskState { Ready, Sending, ... }β
βββββββββββββββββββββββββββββββββββββββββββ
Listing 7: how we didn't implement the Task struct.
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY ββββββββββββββββββββββββββββ
This can work, but there are several problems with it. First, it isnβt at all clear to a reader when each of these fields is valid, and when it may contain garbage. Second, that lack of clarity can lead to security and robustness problems. For instance, a task that has died with a fault should not be scheduled. Any outgoing messages it was attempting to send at the time of the fault should not be delivered. Similarly, we should not be able to receive a message from a task that isnβt trying to send one. Yet all these mistakes can be made quite easily with the flat state representation.
Finally, if we wanted to record task state before a fault, itβs not clear what subset we would record, and itβs definitely not clear how to handle any field that needs to be both saved and updated for the fault.
Having Rust enums available makes a big difference here. Enums in Rust can work just like enums in C, but this is a sort of degenerate case; they can also express arbitrarily complex tagged unions. Our top-level task state enum looks like this:
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
βββββββββββββββββββββββββββββ
β 1 enum TaskState { β
β 2 Healthy(SchedState),β
β 3 Faulted { ... }, β
β 4 } β
β 5 β
β 6 enum SchedState { β
β 7 Runnable, β
β 8 InSend(TaskId), β
β 9 // ... more β
β10 } β
βββββββββββββββββββββββββββββ
Listing 8: the actual task state enum.
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY ββββββββββββββββββββββββββββ
TaskState distinguishes between healthy states and faulted states, and the
healthy states are separated into a nested enum, SchedState
. Iβll come back to
the members of Faulted
that are omitted there.
Runnable and InSend
are both variants of SchedState
.
The task ID designating the intended recipient of the message is only
accessible, and indeed only exists as a member field, when the state is
InSend
. It is not possible to accidentally reference it if the task is in a
different state, nor is it possible to transition into InSend
without
providing it.
SchedState
is broken out into a sub-enum because it means that code in the
kernel can be written such that it canβt even talk about fault states. For
instance, there is a set_healthy_state
operation that is used in the
implementation of several syscalls to move tasks between states. This takes a
SchedState
as a parameter, meaning the operation simply cannot put a task
into a faulted state β that is a different and less frequently used operation.
Similarly, the scheduler deals with tasks in terms of their SchedState
, and
since a faulted task doesnβt have one, it cannot attempt to schedule a faulted
task.
Well, technically, it isnβt quite true that a faulted task doesnβt have a
SchedState
. Remember that I mentioned we record the pre-fault state.
The actual definition of Faulted
looks like this:
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
βββββββββββββββββββββββββββββββββββββββ
β1 enum TaskState { β
β2 Healthy(SchedState), β
β3 Faulted { β
β4 fault: FaultInfo, β
β5 original_state: SchedState,β
β6 } β
β7 } β
βββββββββββββββββββββββββββββββββββββββ
Listing 9: fields of the faulted state.
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY βββββββββββββββββββββββββββ
This is another useful benefit of breaking healthy states out into the
SchedState
type: we can store the pre-fault healthy state here. We canβt embed
a TaskState
within a TaskState
because that would produce an infinitely
large data structure; but we can embed a SchedState
. This gets us several
benefits. First, while the SchedState
is present in both healthy and faulted
states, it is accessed in very different ways, syntactically, so theyβre very
difficult to confuse. Second, I can guarantee you that the pre-fault
original_state
does not, itself, describe a fault, because the SchedState
type cannot describe faults. Beyond making the code easier to use, being
rigorous about which enum variants can appear in which contexts eliminates a lot
of βdefaultβ or βdonβt-careβ branches in switch statements, which is good,
because these often accumulate bugs, particularly as enums are extended during
the life of the system.
Parsing and the humble slice
So, thatβs one example of simplifying kernel book-keeping using Rust types. The other example Iβd like to discuss is a case of maintaining security properties using types.
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
ββββββββββββββββββββββββββββββββββββββββββββ
β1 pub fn sys_send( β
β2 target: TaskId, β
β3 operation: u16, β
β4 outgoing: &[u8], β
β5 incoming: &mut [u8], β
β6 leases: &[Lease<'_>], β
β7 ) -> (u32, usize); β
ββββββββββββββββββββββββββββββββββββββββββββ
Listing 3: the actual signature of the send syscall.
(again)
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY ββββββββββββββββββββββββββββ
Consider the outgoing message argument to the SEND
syscall I presented
earlier. From the taskβs perspective, this argument is passed as a slice,
&[u8]
. This is a Rust standard type that consists of a pointer and a length.
This aspect of SEND
is very much like the POSIX write
call, which takes an
explicit pointer and a length. While the Rust signature of the syscall takes a
slice, the machine-level ABI just moves a pointer and length in two registers,
and itβs entirely possible for a misbehaving or malicious task to pass an
arbitrary pointer and length. This means the kernel has a validation problem:
given an address and size, which the caller alleges it can access, determine
whether the caller should be allowed to access it.
This problem is common to basically all kernels that have some form of memory protection. Itβs also a common source of bugs and vulnerabilities, for the simple reason that itβs easy to forget to check, or even if you remember to check, itβs easy to accidentally separate the address and size values.
In every case like this that Iβve studied, there are two things at play:
- The address/size pair after validation is indistinguishable from the same pair before validation.
- The address and size are not welded together into a unitary value, and operations are inconsistent on whether they take both or just a base address. (For instance, C pointer indirection only takes a base address.)
It is surprisingly easy to solve both of these problems using types, and in fact, we do this all the time in more pedestrian situations. Consider taking a piece of text and trying to extract a base-ten number from it. You would probably use a number-parsing routine that is capable of returning either a number, or an error indicator if the string does not contain a base-ten number, something like the top listing on this slide:
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
ββββββββββββββββββββββββββββββββββββββββββββ
β1 if let Some(n) = some_text.parse_u32() {β
β2 println!("we have a number!"); β
β3 operate_on(n); β
β4 } else { β
β5 println!("not a number"); β
β6 } β
ββββββββββββββββββββββββββββββββββββββββββββ
Listing 10: how one might parse and use a number.
ββββββββββββββββββββββββββββββββββββββ
β1 if check_is_number(some_text) { β
β2 println!("we have a number!");β
β3 operate_on(some_text); β
β4 } else { β
β5 println!("not a number"); β
β6 } β
ββββββββββββββββββββββββββββββββββββββ
Listing 11: validating without using the result.
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY βββββββββββββββββββββββββββ
This isnβt the only way to do this; you could also write one routine for inspecting the string and verifying that it contains a number, and then pass the text around, assuring anyone who encounters it that it contains a decimal number. Thatβs what the second listing does.
But this approach feels weird, and itβs worth asking why it feels weird. As systems programmers, one likely objection is about performance: presumably weβre going to need to parse a number out of that text eventually, so doing a validation step that inspects the text, only to later do a parsing step, will probably use more CPU time than a parsing operation that can indicate errors. And thatβs true, in practice!
But thereβs another reason it feels weird. Consider the definition of the
operate_on
function in each of these two cases.
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
ββββββββββββββββββββββββββββββββββ
β1 // First case β
β2 fn operate_on(n: u32) { ... } β
β3 β
β4 // Second case β
β5 fn operate_on(n: &str) { ... }β
ββββββββββββββββββββββββββββββββββ
Listing 12: operate_on gets different types,
and thus different assurances, in each case.
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY ββββββββββββββββββββββββββββ
In the first case, the code in the function can be assured that n
is a valid
u32
, because, well, it says right there. In the second case, the caller
could pass any arbitrary string. You could add a large block comment on the
function declaring that any string passed in might be a valid number, but, the
code will still need to deal with the possibility that itβs passed garbage. Or,
it could elect not to deal with that possibility and assume its callers have
done the right thing every time, which is often how we get kernel exploits.
So, Iβd argue that the second approach feels weird for some very good reasons, and that you should prefer the first approach. This is a specific case of a more general principle, summarized by Alexis King as βParse, Donβt Validate.β By treating input validation as a parsing problem, we get two benefits:
- By parsing once and returning the result, we avoid accidentally wasting time by validating/parsing repeatedly.
- The fact that parsing or validation has occurred is now reflected in the types.
And itβs that second point that Iβm flogging here. Going back to our
operate_on
cases, it is simply not possible for the user to call the u32
version with an un-validated string, because it doesnβt accept a string β it
accepts a u32
, and even a 40-year-old C compiler will warn you if you confuse
those. This means that the implementation doesnβt need to decide between
re-checking the input or potentially being the cause of bugs and vulnerabilities
down the road. It and the caller are working together to move only valid data
around.
So why donβt we do the same thing with addresses from user code?
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
βββββββββββββββββββββββββββββββββββββββββββββ
β1 // Linux: β
β2 access_ok(VERIFY_READ, pointer, length); β
β3 β
β4 // Windows: β
β5 ProbeForRead(pointer, length, alignment);β
β6 β
β7 // it goes on β
βββββββββββββββββββββββββββββββββββββββββββββ
Listing 12: pointers get validated, types don't change.
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY βββββββββββββββββββββββββββ
Here are typical pointer checks from the Linux and Windows kernels.
If we accept that the key distinction between a parsing operation, and a validation operation, is that parsing produces an artifact that you didnβt have before β a change in type, typically β then these are both validation operations. We get no support from the compiler here; you can delete these calls from a program and not get a compiler warning. Itβs also not at all clear where in the kernel these operations need to be called β each function must decide whether to spend time and code validating pointer arguments, or potentially be the source of bugs in the future when a caller fails to notice the block comment.
Comments and conventions are great, but they are not machine-checkable, and they are a poor substitute for types. Both Linux and Windows, to use the two examples at hand, have shipped security vulnerabilities related to using these validation operations incorrectly, in ways that would have been prevented by rephrasing them as parsing.
The easiest way to see this is to consider what should happen if the length is
zero. Reading zero bytes from an address is probably fine, and on Windows at
least, ProbeForRead
will OK it independent of address. However, itβs super
easy to do this by accident:
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
βββββββββββββββββββββββββββββββββββββββββββ
β1 const uint8_t * user_data = ...; β
β2 size_t len = 0; β
β3 β
β4 ProbeForRead(user_data, len, 1); // OK!β
β5 β
β6 *user_data // no error, no warning β
βββββββββββββββββββββββββββββββββββββββββββ
Listing 13: mere validation won't stop this.
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY βββββββββββββββββββββββββββ
Any dereference of the pointer is a bug, but nothing prevents someone from making this mistakeβ¦except code review.
Now, consider this hypothetical replacement for ProbeForRead
, written in Rust
syntax.
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
ββββββββββββββββββββββββββ
β1 fn probe_for_read<T>( β
β2 pointer: *const T,β
β3 length: usize, β
β4 alignment: usize, β
β5 ) -> Option<&[T]>; β
ββββββββββββββββββββββββββ
Listing 14: recast using Rust types.
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY ββββββββββββββββββββββββββββ
This takes the same three arguments, using the same types as C, but the return
type has changed. It now returns an option of slice of T, meaning the caller
will either get the value None
if the check failed, or Some
with a slice.
In this case, the slice represents the parse result. It uses the type system to
indicate that the memory is OK to access, contains a sequence of values of some
type T
, and is correctly aligned. This also serves to bond the pointer and
length together, so that they canβt mistakenly be used separately or mixed up.
If a function needs a reference to valid user memory, it can require a slice. If itβs willing to do the validation itself, it can take a pointer and a length.
This signature works particularly well in Rust because so-called raw pointers,
like the *const T
you see here, are allowed to contain arbitrary invalid
values, but canβt be dereferenced in safe code. Meaning, the code can pass this
pointer
value around whether or not it has been validated, without risk, since
it will not be unexpectedly dereferenced.
In practice, you may want to bond the pointer and length together before validating, to make them easier to pass around and harder to accidentally separate, and this is what we do in Hubris. Hereβs the signature of the equivalent Hubris operation:
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
βββββββββββββββββββββββββββββββββ
β1 fn try_read<T>( β
β2 task: &Task, β
β3 slice: USlice<T>, β
β4 ) -> Result<&[T], FaultInfo>;β
βββββββββββββββββββββββββββββββββ
Listing 15: Hubris equivalent operation.
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY ββββββββββββββββββββββββββββ
Similar, except that we pass the current task
explicitly instead of getting it
from some per-thread context, we use a generic USlice
type to capture an
unvalidated pointer-length pair received from user mode.
A USlice
provides stronger guarantees than a raw pointer: itβs not a pointer,
and canβt be easily dereferenced even in unsafe
code.
So, why donβt we see more of this? I think the reasons are complex, but part of it is that we havenβt been talking about this in the systems programming community β which is part of why Iβm talking about it today.
But part of it is that you canβt really achieve this in C, and you canβt achieve it robustly in C++, and those have been our only two practical options for a very long time.
Iβm not here to rag on Cβs type system, so Iβll leave the reasons why this doesnβt work for a future blog post. But implementing a new kernel from scratch in Rust, you find a lot of opportunities like this for encoding your intended integrity and security properties in the types themselves, so that the compiler can assist you in detecting any violations of those properties.
Speaking of compiler assistance, thereβs a subtle Rust thing in that function
signature. Because thereβs a reference (ampersand) in both the arguments and
return type, without further instruction, the compiler connects the two
lifetimes. This means that if the try_read
call succeeds, the compiler
considers the task βborrowedβ until we dispose of the slice. Borrowing the task
with a shared reference like this prevents most mutation, and in particular,
this means it is not possible to change the taskβs memory map in a way that
would invalidate try_read
βs decision while retaining access to the memory. To
do that, youβd need to drop the slice, make the change, and then call try_read
again to get a new slice, which of course you wouldnβt because you just messed
up the memory map. This eliminates a potentially subtle class of bugs, and
conveniently happens to be the natural way to express it in Rust.
Application-Debugger Co-Design
At Oxide weβre enthusiastic proponents of hardware-software co-design, treating hardware and software together as an integrated product and making tradeoffs across the two. Hubrisβs design has been influenced by a different form of co-design that I didnβt see coming: application-debugger co-design.
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β[cbiffle@gwydion]$ humility -d hubris.core.54 manifest β
βhumility: version => hubris build archive v1.0.0 β
βhumility: git rev => a2e01755592189aea0c6cabf36fc5cc9257190b2-dirtyβ
βhumility: board => stm32f4-discovery β
βhumility: target => thumbv7em-none-eabihf β
βhumility: features => itm, stm32f4 β
βhumility: total size => 70K β
βhumility: kernel size => 18K β
βhumility: tasks => 8 β
βhumility: ID TASK SIZE FEATURES β
βhumility: 0 jefe 10.6K itm β
βhumility: 1 rcc_driver 4.9K stm32f4 β
βhumility: 2 usart_driver 6.3K stm32f4 β
βhumility: 3 user_leds 5.7K stm32f4 β
βhumility: 4 ping 5.5K uart β
βhumility: 5 pong 4.8K β
βhumility: 6 hiffy 14.4K β
βhumility: 7 idle 0.1K β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
humility manifest showing the contents of a core dump
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY βββββββββββββββββββββββββββ
We wrote the debugger alongside the kernel.
I want to talk about two aspects of Humility today: how it has changed the operating system, and why more people donβt do this.
On the first point: you can get to most of the really deep changes that Humility brought to Hubris by starting at console interfaces. Most embedded projects Iβve worked on have had a console, usually over serial, sometimes over USB. It gets used during development and test, is usually critical during bringup, and sometimes survives into production. In many cases itβs the only way to verify that all the system tasks are running and not being starved of CPU, for instance.
Consoles seem simple, but I would argue that this appearance is deceptive as their feature set grows. A typical human-readable console over a UART requires printf-equivalent formatting code for strings and numbers. If your application needs real numbers for sensor measurements, or if your printf simply complies with the C standard, that means youβre also pulling along floating point formatting code. This can burn many kilobytes of Flash space, and we havenβt even gotten to input.
Implementing a console in the languages weβve traditionally used is also a difficult task, because itβs just so easy to get things wrong. Unless youβre fuzzing your console interface β and you are, right? β it probably contains buffer overflows, inadvertent acceptance of illegal data, format string vulnerabilities, or potentially even stack smashing.
Our Hubris-based firmware applications donβt have console interfaces. They donβt contain printf-level data formatting, and they cannot parse command lines. And yet weβre not really missing any of the functionality, because weβve split the task between the application and the debugger.
Weβve established a set of interface patterns between the debugger and application, which effectively form a user-extensible kernel-aware debugger interface. I refer to this as the Debug Binary Interface or DBI. Like an ABI explaining how to pass values and format data structures in an application, the DBI defines how to declare variables and types such that the debugger will find them and do β¦ stuff. And weβre leaving that βstuffβ deliberately ill-defined.
On the kernel end of things, we use this to print task status information:
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β[cbiffle@gwydion]$ humility -d ~/Downloads/hubris.core.54 task β
βhumility: attached to dump β
βsystem time = 166704 β
βID TASK GEN PRI STATE β
β 0 jefe 0 0 recv, notif: bit0 β
β 1 rcc_driver 0 1 recv β
β 2 usart_driver 0 2 recv, notif: bit0(irq38) β
β 3 user_leds 0 2 recv β
β 4 ping 59 4 FAULT: divide by zero (was: ready)β
β 5 pong 0 3 recv, notif: bit0(T+296) β
β 6 hiffy 0 3 notif: bit0(T+203) β
β 7 idle 0 5 RUNNING β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
humility tasks showing task status in a core dump
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY βββββββββββββββββββββββββββ
As you can see here, task index 4, called ping
, has failed with a
divide by zero error, and so it might be useful to pull its current stack trace:
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β[cbiffle@gwydion]$ humility tasks -sl ping β
βsystem time = 166704 β
βID TASK GEN PRI STATE β
β 4 ping 59 4 FAULT: divide by zero (was: ready) β
β | β
β +---> 0x200025b0 0x0802405e task_ping::divzero β
β @ /home/cbiffle/hubris/task-ping/src/main.rs:28β
β 0x20002600 0x080240f2 userlib::sys_panic β
β @ /home/cbiffle/hubris/userlib/src/lib.rs:642 β
β 0x20002600 0x080240f2 main β
β @ /home/cbiffle/hubris/task-ping/src/main.rs:39β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
humility tasks -s showing stack trace of a failed task
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY βββββββββββββββββββββββββββ
Outside the kernel, our supervisor implementation has a debugger interface that lets us request that a particular crashing task be held for inspection, instead of automatically restarted like it would be in production, by writing a request into a reserved section of its RAM. And for more general debug or bringup work, our firmware images include a debug agent task that can run small interpreted programs delivered directly into its RAM by the debugger. We use this facility, called the Humility Interchange Format or HIF, to request application-specific sequences of operations, for example, to check and enumerate the tree of attached PMBus devices.
Data motion between the target system and the debugger relies on two mechanisms. For target-to-debugger, we decode data structures directly out of target memory using the DWARF debug information from the compiler. For debugger-to-target, for anything more complex than βpoke a 1 into this to activate,β we use Rust enums encoded with serde into a compact binary format, deposited directly into designated areas of target RAM. This means that, while we do have a parser exposed, it is machine-generated, strict, and difficult to get to, which is an improvement.
This neatly resolves a tension that Iβve dealt with my whole embedded career: how much Flash should we waste on things that arenβt expected to happen in production? Should the system include a manufacturing self test mode, or the ability to take over devices from drivers at runtime? If so, what happens if these modes do get activated in production? By moving this code out of the firmware, weβve answered that question: they wonβt, unless someone has physical access to the JTAG scan chain and has authenticated with the processor to open the debug interface.
If writing a debugger is so great why isnβt everyone doing it?
Having debug tools that understand your operating system and application has proven invaluable. So, why arenβt more people doing this?
I think itβs for three main reasons.
First: itβs a domain shift. If your job is to write firmware to make the product go, writing a debugger seems like it might require a different set of skills. It might even be in a totally different language.
Second: itβs a bunch of work, and itβs not immediately obvious how it helps you get to your deadline faster, and thatβs all a lot of people care about. One of the shared beliefs that unites us at Oxide is that investing in tools early lets us move faster in the long term, but that belief is unfortunately not universal across all companies.
Third: if you wanted to make it less work to write a debugger by reusing existing code, existing debuggers are typically not modular or designed for reuse. You cannot, for instance, easily link OpenOCDβs SWD support into a new tool, or borrow GDBβs stack trace reconstruction implementation. While you could copy-paste the code out, it would be a significant amount of work to adapt it.
Finally: if you did decide to write your own debugger, the documentation in this area can be truly arcane. DWARF in particular has a reputation for being monstrously complex and hard to follow β a reputation that, in my opinion, is only partially deserved.
The first two points, where writing a debugger is scary to either you or your boss, I canβt really help with, except maybe by giving more talks. But on the other two, where debuggers tend not to be reusable and debug information is complex and hard to understand, I can hook you up. Weβre in the process of refactoring Humility to make its core generic and reusable for any program that wants to parse and understand debug information and the contents of another running program. Of course, weβre currently heads down trying to get our first product out the door, so it may take us a few months, but it is coming.
Conclusion
I could not have pulled this off on my own, and Iβm fortunate to work with a wonderful group of folks here at Oxide:
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
CORE TEAM
Laura Abbott Rick Altherr Cliff L. Biffle
Bryan Cantrill Matt Keeter Steve Klabnik
COMMITTERS
Luqman Aden Adam Leventhal
Dan Cross David Pacheco
Nathanael Huffman Ben Stoltz
Sean Klein
WITH SPECIAL GUEST STAR
OZYMANDIAS, KING OF TRIVIAL CODE REFORMATS
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY ββββββββββββββββββββββββββββ
Iβd like to thank both the core Hubris developers, and the seven or so folks who have taken time away from other parts of our product to improve things in firmware land.
Weβve published the repos involved in Hubris development, as well as the draft reference manual, so if youβre interested by anything I had to say today β or infuriated by it β Iβd encourage you to read more at the URLs on the final slide.
Hubris and Humility are not done by any means, but theyβve become very useful for our purposes, and maybe youβll find them useful too.
Thanks.
βββββββββββββββββββββββββ ON HUBRIS AND HUMILITY
Code: github.com/oxidecomputer/hubris
Docs: hubris.oxide.computer
Oxide: oxide.computer
CLIFF L. BIFFLE ββ OXIDE COMPUTER COMPANY βββββββββββββββββββββββββββ
Clickable versions of those links for the web:
Epilogue
It seems like a lot of folks appreciated the talk, which is great! Iβve collected answers to the questions Iβm getting most often into a FAQ over in the Hubris repo. Have a look if youβre curious.
If you have any questions not covered by the FAQ, or are interested in having me come talk about something at your conference or meetup, please contact me.
The main objection Iβve been getting is that this feels like Rust evangelism to some folks. Iβm not sure I can help them with that β if they can show me another systems language that supports all the features I touch on in this talk, Iβd be very excited to learn about it. I talk about Rust a lot in this talk because I havenβt found other practical options.
Many of the things weβve done in Hubris cannot be done robustly in C/C++, in the sense that they could be approximated, but their integrity would have to be maintained through code review or convention rather than compile-time checks. Believe me, Iβve tried in previous jobs.
But if someone dislikes Rust and proves me wrong on that, you know what? That would be fantastic. Because my goal is not to make everyone use Rust β itβs to improve the robustness of systems software against entirely preventable bugs. Because the current status quo has literally killed people, and we need to do better, however we do it.