From Hubris To Bits

A look at the Hubris build process.

2024-11-18

The embedded platform we’ve built for firmware at Oxide is called Hubris. It’s unusual for a microcontroller operating system, and probably the biggest thing that makes it unusual is its use of separately-compiled tasks.

Most firmware applications mash all of their task and OS code together in a common memory space, which is simple and efficient, but can lead to subtle bugs. Hubris instead places bits of code in their own isolated memory areas, where they can run (and crash!) separately. This requires that each bit be compiled as a standalone program.

The CPUs we target don’t have virtual memory, so each of these separate programs has to be laid out at a known place in the address space. This introduces some challenges, and has prevented us from “just” using an off-the-shelf build system.

This post will walk through the process of building a Hubris application from source, from the perspective of the build system, and examine some of these challenges and how we addressed them.

High level overview

The Hubris “build system” is not so much a build system, and more of a fancy linker.

At its simplest, building a Hubris application image consists of the following steps:

For each task (applications contain at least one), compile its code into a standalone ELF binary.
Compile the Hubris kernel into a separate standalone ELF binary.
Lay those binaries out at non-overlapping addresses in memory.
Combine them into a single image.

We can’t actually do the steps in that order, however. Two things prevent it:

Unix-style toolchains (which includes LLVM/rustc) like to either produce a relocatable program that cannot yet be run, or a fully prepared standalone binary. The latter cannot be moved to a new location in memory — at least, not easily.
Because we need to lay out all programs, we need to know their sizes. We can’t know their sizes until we build them. (Early versions of Hubris made you specify RAM and flash requirements for every task; this was super annoying and we fixed it.)
Because we are an “execute-in-place” (XIP) system, we put code in flash (where it is treated as read-only) and data in RAM (where it can be written). Flash and RAM are often separated by gigabytes of physical address space on modern microcontrollers, so the Unix-style strategy of “assume every program can start at the same address and is entirely in RAM” doesn’t work.

As a result, the actual high level sequence looks like this:

Build every task individually into a partially-linked, relocatable object file.
Link each object file separately against the target device’s memory layout, as though it were the only program on the microcontroller. This step is necessary to measure the actual size, because it lets the linker script separate RAM vs flash, and processes link-time optimizations (LTO).
Inspect the results of step 2 to determine the actual RAM and flash requirements of each task.
Use that information to determine a valid layout of tasks in the address space. This must conform to certain requirements imposed by the memory protection units of the microcontrollers we use.
Link each object file again, this time against its allocated partition.
Compile the Hubris kernel to use space left over after task allocation, feeding it information about where the tasks wound up in step 5. (We only wind up having to link the kernel once.)
Combine the results.

So can’t you use the linker?

The linker’s job is to take a bunch of separately compiled machine code blobs and place them in memory, resolving references between them; this is the same thing the Hubris build system does. If the linker were powerful enough to do what we need, we wouldn’t need to do all of this.

Unfortunately, linkers generally don’t provide proper scripting interfaces, and are not generally written as libraries that can be driven from another program, so they’re quite restricted in what they can do. Unix-style linkers, which includes basically all the open source ones, were designed for machines with virtual memory and (nowadays) dynamically linked libraries, which is a significantly easier problem and not a good fit for our needs.

So the answer is, I would love to just use the linker, but I haven’t figured out how.

We do, of course, use the linker for the individual “link” steps in the build.

The initial task build

For this section, let’s focus on a single task in the application.

A Hubris task is an executable program — a “binary” in programmer slang. To keep things (relatively) simple, each task is a Cargo package containing a bin target. As a result, it’s technically possible to create a new Hubris task by running

cargo new --bin my-task-name-here

…though in practice you’ll need to clean up some defaults to make it actually work.

Currently, Hubris tasks live in the same Cargo workspace as the Hubris kernel, because that worked well for Oxide. I’m working on loosening this requirement, allowing tasks to be built from git repos or crates.io. So for this section, let’s just assume it’s a Rust program built with Cargo that lives somewhere.

Each task can specify arbitrary dependencies through Cargo. Importantly, each task can depend on separate versions of packages if needed: since each task is built separately, there’s no need to resolve their dependencies to common versions.

During the initial build, we want to compile the task and its dependencies into a relocatable object. This is not the default for a bin target, so we have to override some things. (It is the default for a staticlib target, but building programs as staticlib turns out to have some issues that we weren’t able to work around.)

Specifically, we pass one additional flag to the compiler:

-C link-arg=-r

-C link-arg=xxx tells rustc to pass the flag xxx through to the linker. The linker (which in our case is LLVM’s ld) understands -r to mean “compile this program but skip the final pass that resolves relocations.”

This produces an ELF object file with relocation data intact, which means that every internal reference within the program is annotated with information about what it’s intending to reference. This extra information allows the program to be rearranged in memory, including shuffling the order of its functions or moving variables. Because they are a linker fiction, these relocations must be resolved before the program can be handed to the actual hardware. We’ll do that in a later pass.

At the end of this step:

We know that the task can be compiled: its dependencies are available, it doesn’t contain any syntax errors, etc.
We have an ELF object file containing the task’s compiled code and data.
The ELF object file starts everything at address zero, and mashes things together in no particular order, which means it’s not actually valid to attempt to run the program on hardware yet.
However, it also contains relocation data that we can use to fix that.

Aside: task configuration

Tasks in Hubris are configurable, and the configuration is controlled by the application’s app.toml file. This means a generic task can be customized for a specific application.

The most basic configuration is through Cargo features. However, Cargo features are frustratingly limited, so we also provide a more generic mechanism.

Each task in the app.toml can have a config section containing arbitrary TOML data. This gets passed into the task build via, believe it or not, an environment variable. To actually use the configuration, code in the task (or a dependency of the task) must use a build.rs script to collect the environment variable contents and inject them into the build somehow, typically through code generation.

Cargo allows build.rs scripts to emit a special cargo::rerun-if-env-changed directive that marks the build as dependent on the contents of an environment variable, so we get rebuild-on-change for free using this method.

This might sound a little gross, but it’s the most reliable way of passing configuration data behind Cargo’s back, if you need something more powerful than its “purely additive boolean features” model.

(To ensure that builds don’t depend on hidden state, which would make them difficult to reproduce, we always set the environment variable during a build, even if it gets set to empty.)

Aside: task slots

“Task slots” are similar to task configuration, but more specialized. They provide a way for tasks to reference each other, usually for IPC interaction.

A task slot gets created in a task’s executable using the task_slot! Rust macro. This emits a placeholder value into the task’s data section, and some supplementary information into a custom task_slots section.

The task referenced by the slot is chosen in the application’s app.toml file, not in the source code. This makes it easier to write reusable IPC clients that get retargeted to different servers in different applications.

During the initial build, the build system identifies all task slots using the information in the custom section, and rewrites the fields in the data section with the task IDs of the desired tasks.

This is very similar to the operation performed by a relocating linker, and I would like to replace this with relocations at some point in the future.

Re-linking to measure task sizes

Since Hubris statically lays out all tasks in memory, we need to know their exact sizes during the build. At this step in the process we have enough information to figure it out, though the method may not be obvious.

You see, the size of the machine code generated for a program can depend on how exactly it gets laid out in memory. While we currently target ARM CPUs, essentially all CPUs share this property: the number of bytes required for a branch, call, or data reference generally depends on how far away it’s reaching. There tend to be a set of compact “short” instructions for reaching things that are nearby in the address space, and one or more tiers of “longer” instructions with greater reach.

A conservative compiler could always use the longest-reach instructions, and in fact this is what most compilers do in their initial output. It’s generally left to the linker to notice that shorter instructions could be used, and replace the original instructions to save space. This process is called linker relaxation, because the original program was generated with the most aggressive constraints possible, and the linker is using additional information to relax those constraints and generate more efficient machine code.

(Linker relaxation is the simplest of a family of algorithms called link-time optimizations, or LTO, which introduce even more reasons why a program’s size might change when it’s linked. LLVM’s ld provides us with powerful LTO behind the scenes, and we’re grateful for it.)

Because we don’t know exactly where our task will live in memory, we can’t yet do this process exactly, but we can approximate it.

Our target microcontrollers will generally have a block of flash at one place in the address space, and a block of RAM somewhere distant. For instance, on the STM32 series, flash is typically at address 0x0800_0000, and RAM is typically around 0x2000_0000. Those addresses are separated by 504 MiB of space, so references from flash to RAM typically can’t use the most compact versions of instructions, but references within flash often can. (A virtually addressed machine would hide the gap by mapping pages of flash and RAM next to each other; since we target physically addressed machines, we don’t have this power.)

So, to measure the size of our task, we will link it in isolation.

We grant the task the entire block of flash, and the entire block of RAM, as if it is the only program running on the microcontroller. We do this by generating a custom linker script describing the desired memory layout, and then handing both that script and our relocatable ELF object (from the previous step) to the linker to process.

The linker spits out a fully linked, non-relocatable ELF executable containing a post-LTO version of our task, which we then throw away.

Well, I exaggerate a bit. We don’t throw it away immediately. First, we process it to measure the results, by parsing the ELF program headers and determining the exact extent of flash and RAM areas. (I want to give a shout-out to the excellent goblin crate, which makes parsing and analyzing ELF headers super easy.)

So, for example, we might determine at this point that our task needs 15,000 bytes of flash, and 2,000 bytes of RAM.

Aside: the stack

The RAM measured in this step includes only static variables — or in linker jargon, “data and BSS.” It does not include memory used by the stack.

Because we’re a physically addressed system with static memory allocation, we can’t use the same hack as Unix systems, which magically map in more pages of stack as the original stack is exceeded. We need to know the size of the stack in advance.

Measuring the stack requirement of an arbitrary program using static analysis is a hard problem. Factors like recursion and dynamic dispatch make it unsolvable in the general case, and while it can be solved for limited classes of programs, those classes of programs are often not very useful. For example, while we don’t use recursion very often, you can easily wind up with program forms that look like recursion in the code generated by derive(Debug).

So, currently, we require tasks to state their stack requirement explicitly.

My colleague Matt Keeter and I have been experimenting with very basic stack size checks in the build system, and Matt has successfully implemented a conservative analysis that can report an error if a program definitely does not have enough stack — but because of the issues I described above, it’s not an exact analysis, and cannot catch all possible runtime stack overflows.

Allocating space to tasks

After the last step, we have a collection of fully linked task binaries, all linked to the same overlapping position in the address space. We’re going to toss those aside and just use the information we derived from them: how much of each kind of memory is required for each task.

Now, we need to divide the available memory of each type (RAM and flash) between tasks, according to their needs.

Because Hubris isolates tasks using the memory protection features of the CPU, the way we divide space between tasks depends on the type of CPU, so we need to understand how Hubris uses memory protection.

Memory protection layout requirements

On ARMv6/7-M parts (e.g. Cortex-M0 through Cortex-M7) and RISC-V parts using the optional protected physical memory scheme, protection regions must be a power-of-two in size, and must be naturally aligned. That is, you can define a region spanning exactly 1024 bytes at an address that’s an integer multiple of 1024, but not a region spanning 1025 bytes, or 1536 bytes. The hardware gives us a handful of these regions, typically eight.

On ARMv8-M parts (e.g. Cortex-M33), protection regions can span any number of 32-byte chunks, and must be aligned to a 32-byte boundary. We still only get a handful of regions to play with, but the more flexible sizing and alignment makes this a lot easier to deal with.

(Hypothetically, if we were to port Hubris to processors designed for virtual memory, “protection regions” would be groups of pages, typically 4096 bytes in size.)

Restrictions like this tend to cause both internal fragmentation and external fragmentation, which waste memory:

Internal fragmentation because a task with, say, 33,000 bytes of code, may need to be allocated 65,536 bytes (64 kiB) of space due to hardware restrictions — wasting 32,536 bytes!
External fragmentation because that 64 kiB space may need to be aligned to a 64 kiB boundary, which may require adding quite a bit of padding before and/or after.

Reducing external fragmentation is easiest, but we also have some tricks to reduce internal fragmentation. I’ll go over those in a bit.

Getting the most out of limited regions

Hubris swaps out the full set of hardware protection regions on each context switch, so each task can use the full set for its own purposes. The simplest way to use them would be to assign one region for RAM and stack (marked read/write) and one region for code and data in flash (marked read-only plus execute). But the hardware typically allows for eight or more regions, and this leaves six of them sitting idle, which is wasteful.

We also use regions for drivers. In most operating systems, drivers are code linked into the kernel. In a Hubris-based application, drivers are just tasks like any other, but tasks that are allowed direct access to some subset of memory-mapped peripherals in the system, such as I2C controllers or Ethernet MACs. Generally, each such peripheral consumes one more protection region in a driver task. (Sometimes peripherals happen to be adjacent and can be merged, but this is rare.)

Finally, we also use one region for null pointer defense. While null pointer dereferences are very unlikely in the Hubris kernel, we still want to catch them — but most microcontrollers map accessible memory near physical address zero, so accesses near zero will succeed by default. We reserve one region to cover the range of addresses around address zero, and mark them as inaccessible even to privileged (kernel) code. We keep this region fixed across context switches.

Reducing internal fragmentation

So far a common task will have around four regions used:

RAM and stack.
Flash.
Some memory-mapped peripheral.
The fixed “null” region around zero.

…leaving four regions idle.

Last year Matt Keeter added “region splitting” to the Hubris task layout algorithm, which uses these idle extra regions to “tile” a task’s RAM and flash areas and attempt to reduce internal fragmentation. For instance,

Without splitting, a 33,000 byte task may require 64 kiB of space.
If one region is free, however, we may be able to divide that into a 32 kiB region + a 256 byte region, which wastes only 24 bytes.

Reducing external fragmentation

We can reduce external fragmentation — padding between task regions due to hardware alignment requirements — by shuffling the order of allocations to fill gaps.

The interaction between this and region splitting is subtle, and currently the Hubris allocator uses a brute-force quadratic algorithm. Because the search space is small, the algorithm doesn’t really matter.

Linking tasks into their allocated areas

The algorithm in the previous step assigned each task’s RAM and flash areas to some number of regions of physical address space, subject to the requirements imposed by the hardware.

In this step, we grab the relocatable, partially-linked object files we produced in step 1. Remember those? We’re going to link them again, but this time we’ll keep the results.

For each task, we generate a linker script that directs the linker to place RAM and flash sections at the appropriate locations in the address space. We then pass this to the linker, along with the original relocatable object file. The linker produces a fully-linked ELF executable, which does not overlap any of the other task executables.

We stash these executables for packaging in a later step.

Building the kernel

Currently, the kernel gets built into fixed areas of RAM and flash, the sizes of which are chosen by the application in its app.toml. Because the kernel’s size changes very rarely, this hasn’t been as annoying as it might sound.

Since the kernel’s RAM and flash sizes are effectively fixed, we don’t have to do a complex multiple-link dance with the kernel, unlike with the tasks. However, there’s some other nuance to building the kernel, which I’ll describe here.

First: the Hubris kernel is not an executable. It’s a library. Applications are expected to provide a main.rs file that provides an executable wrapper for the kernel. This can be as simple as a single line calling kern::start_kernel, but in a real embedded application, it will usually also contain code doing some combination of…

Configuring clock trees, perhaps with board-specific oscillator information.
Arranging early power sequencing or configuring low-power modes.
Checking hardware board revision pins to avoid booting firmware for system A on system B by accident.
Providing custom “fast interrupt” handlers that respond to events faster than the kernel’s default path. (This is quite rare, but super handy.)

So when we “build the kernel,” what we’re actually doing is building this application-provided executable (Cargo bin target) that happens to depend on the kernel.

Second: the kernel gets statically customized for the application. In addition to setting Cargo features in the app.toml, the kernel build also has access to all the information in the app.toml, plus the knowledge we derived in the previous steps:

The number of tasks in the application.
Their priorities.
The subset of tasks that should be started at boot vs. started manually later.
The routing of hardware interrupts to task notifications.
Which tasks should have access to which memory-mapped peripherals.
The physical address and permissions of each task’s RAM, flash, and other regions.
The address of the entry point (main function) in each task.
Each task’s initial stack pointer value.

Reusing the approach we used for task configuration, we smuggle this information behind Cargo’s back using an environment variable, HUBRIS_KCONFIG. This environment variable carries a detailed data structure encoded in RON format; if you’re curious about the details, see build/kconfig in the Hubris repo.

The kernel’s build.rs script processes this and generates a bunch of static and const items that get included with the kernel’s Rust code. This means that, from the kernel’s perspective, all these properties of tasks are compile time constant, and can be used for optimizations. For instance, this lets us eliminate a lot more bounds checks around the task table, since we know its precise length at compile time.

This also means there’s basically no “startup” phase required when the kernel boots. There is no process of loading a task table, no information to parse, and (importantly) nothing that can fail to parse. At boot, the kernel gets its own affairs in order, and then scans a table that indicates which tasks should get started immediately; it initializes those tasks, whose memory locations it knows statically, and then jumps into the scheduler.

We’ve revised this approach a few times during Hubris’s development, and this version provides significant kernel size reductions, in addition to eliminating some rare-but-technically-possible error paths.

The vector table

The kernel contains the hardware vector table.

The vector table is read-only and typically placed in flash. Because we know the interrupt routing at compile time, we statically generate it to ensure that reset sends control to the kernel’s entry point, and all other interrupts send control to a kernel-provided default interrupt service routine (ISR), which routes to tasks.

It’s possible for an application to override this default ISR in its main.rs file, by providing alternate hardware-specific interrupt service routines. We use weak symbols to achieve this: by default, all vectors are routed to the kernel’s default routine, but any override in the application replaces this at link time.

Packaging the application

Hubris applications are distributed as bundles, which are merely ZIP files that contain a conventional file structure. Among other things, the bundle contains

All of the fully-linked task ELF files we produced a couple of steps ago, and
The kernel ELF file.

For some purposes it’s useful to merge the files together into something other than a ZIP file. For instance, we have a tool that takes the collection of (non-overlapping) ELF files and merges them into an IHEX or SREC file for flashing. In practice, however, we usually use the bundle directly and flash through Humility.

The future

This approach is the result of a few years of iteration, and I think it’s pretty good for the time being. The main thing I’d like to change is how it’s implemented.

Hubris’s build system is currently a monolithic program that lives in the same repo as the kernel… and all of Oxide’s production firmware. This isn’t ideal, because it makes it hard for anyone else to use Hubris. So, I’m hoping to fix it.

#dayjob #embedded #hubris #rust

Cliffle