Revisiting Hubris appconfigs

First in a series on exhubris.

2024-11-25

So in my day-job over at Oxide we’ve built this nice embedded operating system called Hubris. If you follow my blog, you’re probably aware of it.

I also build a lot of embedded electronics outside my day-job, and people sometimes ask me (often excitedly!) if they’re using Hubris.

The answer so far is “no.” This is for a variety of reasons, but probably the biggest: it’s actually quite difficult to use Hubris for anything if you don’t want your code to live in the Oxide Hubris repo!

I would like to fix this, to enable other teams to use Hubris without having to coordinate with Oxide (or even publish their source code!). I’m starting by trying to address the needs of a single friendly customer: me.

As of this week I have it working, in a set of tools I call exhubris. It’s not by any means done (or all that pleasant to use). I’m going to write some posts about it, to help me think through the design process, and (more importantly!) to solicit feedback from my readers on where they think things should go.

This first post starts with the part of Hubris most users encounter first: the application configuration file, or appconfig.

Hubris refresher and links

(If you’ve been following Hubris closely, this section will be a bit of a review. If you’re new here, welcome!)

The details of the system have changed a lot over the past four years, but the basic design is still what I described in my announcement talk. A Hubris application is a firmware image intended to be flashed onto some microcontroller; it is made up of the Hubris kernel and a collection of (application-chosen) tasks. The kernel and each task are all compiled separately, and isolated from one another using hardware memory protection. As a result, while we benefit from Rust’s memory safety, we don’t rely on it for system correctness.

Drivers are just tasks in Hubris, but tasks that are granted access to one or more memory-mapped peripheral, and that receive interrupts as messages from the kernel.

Each task can crash independently; thanks to memory isolation, a crash in one task doesn’t damage others. A crashed task will generally be restarted immediately¹, but that policy is up to an application-chosen task called the supervisor. Oxide’s supervisor implementation does some additional stuff, like recording a coredump of each crashed task, but beyond that, restarting a task is very quick. We’ve found this to be a really powerful tool for ensuring system robustness. It has resulted in some comical situations where a fundamental driver is crashing on an Oxide server over five thousand times per second but the system is still working fine.

We don’t generally use strategies like exponential backoff in Hubris, because Hubris applications are intended to run without human intervention, and getting all the details of backoff right is hard. For instance, you probably want to cap the backoff at some interval — how do you choose it? What do you do if it turns out to be wrong? No operator will be sitting at a console to restart the backoff.

You define an application by writing an appconfig file. This is intended to specify everything that goes into the image, to ensure that you can reproduce the build later. We process the appconfig using a set of image building tools, and call out to Cargo to build each piece of the image before stitching them together. (I have a rather long blog post on the process of building a Hubris image if you’re curious.)

Those are the pieces that are most relevant for this and future posts; if you’d like to know a lot more, the Hubris reference manual is pretty detailed and intended to be quite accessible.

A brief history of appconfigs

A Hubris application needs some way to specify all the bits that go into it. We added support for this in early 2020, shortly after the initial draft of the kernel was working. At the time, we chose to use TOML.

We have now been writing and maintaining appconfigs for four years. Currently, we maintain 56 of them, describing firmware applications from tiny 8-pin microcontrollers without enough flash to store the text of this webpage, up through our rather beastly Service Processor in the Oxide servers — a 400 MHz CPU with 2 MiB of flash.

The format has evolved over time as we’ve needed to express more complex ideas. It’s serving us fairly well, but it was not designed per se. It was incrementally grown with features added as we needed them. This means parts of it are creaky and not terribly consistent, but I’ll get to that in the next sections!

Appconfigs today

In theory, you create a Hubris application by opening an editor and writing an appconfig, which is currently a TOML file. (In practice, today, you also need to check it into the Hubris repo, but let’s ignore that for the moment.)

The appconfig has several sections that specify different parts of the build. Let’s walk through them, using a simple application as an example. This is a real production appconfig for donglet², a jack-of-all-trades interface board we use in test automation at Oxide, based on the STM32G031 microcontroller.

I did not name this board.

Currently, the appconfig expects to live inside a Cargo workspace. The workspace contains a rust-toolchain.toml file that pins an exact toolchain revision via rustup, and a Cargo.lock file that pins the hashes of all dependencies. This information is critical to being able to reproduce the build results, but since Cargo/rustup have it covered, you won’t see this information in the file below.

First, some top level keys gives your firmware a name and indicates compatibility with a specific target, board, and chip. (If you have noticed that this information is redundant, you’re going to like exhubris.)

name = "donglet-g031"
target = "thumbv6m-none-eabi"
chip = "../../chips/stm32g0"
memory = "memory-g031x8.toml"
board = "donglet-g031"

The next section specifies how to build the kernel, which in practice means designating a Cargo bin crate that depends on the kernel — the Hubris kernel is a library, and applications provide a main.rs that calls it, giving them an opportunity to e.g. setup the clock tree and check revision pins. Here the kernel is built by the crate named app-donglet in the Cargo workspace.

[kernel]
name = "app-donglet"
requires = {flash = 19168, ram = 1820}
features = ["g031"]
stacksize = 936

This section needs to assign specific amounts of RAM and flash to the kernel, plus indicating how much RAM to use for the kernel stack. (If this looks annoying, keep reading, I’ll come back to it.)

A third section defines all the tasks in the application, by pointing to the Cargo bin crates that define them. The tasks section also specifies some resource assignments, and can provide config to each task to customize its build. (Think Cargo features, but way more powerful.) The donglet image includes seven tasks, but I’ll skip most of them here. These three demonstrate most of the bells and whistles:

[tasks.jefe]
name = "task-jefe"
priority = 0
start = true
stacksize = 368
notifications = ["fault", "timer"]

[tasks.sys]
name = "drv-stm32xx-sys"
priority = 1
uses = ["rcc", "gpio", "system_flash"]
start = true
features = ["g031", "no-ipc-counters"]
stacksize = 256
task-slots = ["jefe"]

[tasks.i2c_driver]
name = "drv-stm32xx-i2c-server"
features = ["g031", "no-ipc-counters"]
priority = 2
uses = ["i2c1"]
start = true
task-slots = ["sys"]
stacksize = 896
notifications = ["i2c1-irq"]

[tasks.i2c_driver.interrupts]
"i2c1.irq" = "i2c1-irq"

Each task chooses a crate from the workspace (somewhat confusingly called name), and is assigned resources: a priority for scheduling, a stack size, and in the case of drivers a set of memory mapped peripherals and interrupts. (Interrupts are routed by the kernel to notifications, which can also be used to implement “software interrupts” — in this case, our supervisor jefe does this with its fault notification, which is how the kernel informs it of crashes in other tasks.)

There are two important keys to note here, uses and task-slots. A task’s uses list names a series of memory-mapped peripherals, defined in the configuration for the chip. The build system sets up the task’s memory protection config so that these peripherals are directly accessible, and others are not. The sys task here uses three things, rcc (for doing clock tree setup on STM32), gpio (for messing with pins), and system_flash (we use this to get the unique die ID for the chip).

task-slots is similar, but the things being named are other tasks instead of peripherals. A task that contacts another task via IPC is expected to name the target task in its task-slots list; the build system then ensures that it can generate a TaskId for that task at compile time. This allows task code to be generic over which server(s) it interacts with.

Moving on from tasks: the last top level section provides config that can be shared and referenced by all tasks. This tends to be the longest in real-world applications, believe it or not, because it winds up looking a lot like a DeviceTree…

[config]
[[config.i2c.controllers]]
controller = 1

[config.i2c.controllers.ports.B]
scl.pin = 6
sda.pin = 9
af = 6

[[config.i2c.controllers.ports.B.muxes]]
driver = "pca9548"
address = 0x73

[[config.i2c.devices]]
controller = 1
mux = 1
segment = 1 
address = 0b1010_000
device = "at24csw080"
description = "Sharkfin VPD"
removable = true

[[config.i2c.devices]]
controller = 1
mux = 1
segment = 2
address = 0b1010_000
device = "at24csw080"
description = "Gimlet Fan VPD"
removable = true

In this case, the i2c_driver task uses this information at compile time to configure its use of pins and the set of devices it expects. But because this is global information, other tasks can also refer to it for the same purpose. We have a validate testing-related task that uses this to scan for attached devices and test that they respond correctly, for instance.

(Tasks can also have private config information, which is used much less often so I’ve skipped it here. It’s basically Cargo features but much, much more powerful. This will come up later.)

Appconfigs, reviewed

I’ve been using appconfigs for almost four years now, and I have opinions.

I think the basic idea is great. We need some way to specify a bunch of executable programs to build, and Cargo isn’t great at that — plus, we need configuration that’s a lot more flexible than what Cargo offers. So some sort of input file that drives a tool, which in turn drives Cargo, seems reasonable.

The problem is, well, everything else.

TOML doesn’t scale

I chose TOML because we needed a format. Anyone who writes Rust has at least encountered TOML, since Cargo uses it heavily. TOML was, in hindsight, the wrong choice. It scales poorly to complex trees of data, and it turns out, appconfigs wind up being complex trees of data! The problems are already apparent in the file I excerpted above, and it’s one of our simplest:

We move between zero and four levels of nesting with almost no visual clue that anything has happened, because TOML doesn’t believe in indentation.
I2C devices are all tagged with the “I’m an array element” syntax, [[config.i2c.devices]]. If you needed to know which array element, you’re going to be doing some squinting. But in practice there’s basically no way to define a complex map in an array without using this syntax, because…
TOML has weird opinions about map and array literals. Arrays are permitted to be wrapped across lines, but maps…aren’t? You’re supposed to totally change syntax and write a table instead. This makes keeping the file readable as strings change in length a bit of a chore. This is less of a problem in simple documents, but starts to crop up in even moderately complex Cargo.toml files as people add/remove features from dependencies.

Plus, TOML assumes a sort of “least common denominator” data model, with no concept of enumerated types, tuples, enums with fields, etc. This means there are data structures we can describe simply and elegantly in Rust that we can’t easily express in TOML. (They can be expressed, serde is very good at this, but you wouldn’t want to write them by hand!)

Tasks have to specify too much

Each stanza in tasks declares things like the set of notifications it exposes, how much stack it needs, what task slots it exposes, etc.

Every time it’s used.

In every application.

Many of our tasks are generic and reusable. jefe, for instance, appears in every Oxide firmware image. sys and i2c_driver are also nearly ubiquitous in our STM32-based images. In every case, we have to repeat all this information.

This is silly. It would be better to have a way for the task to centrally declare the parts of this that don’t change — which is basically everything except the stack size — and then have the appconfig just fill in the rest of the template. This would also let us do better checking for e.g. an appconfig that fails to wire up a required notification.

And yet tasks don’t specify enough

There’s a bunch more information that I’d love to have available. For instance, a task has a task slot bob to talk to some server… what IPC protocol does it expect bob to implement? If we knew that, we could generate simpler code, and detect cases where you’ve miswired the application.

Similarly, tasks don’t provide any hints about what information they expect in their config, or what parts of the global config they rely upon. It would be nice to have a schema, so we could give feedback about mistakes. (If not a schema, just having a list of expected top level keys would be a great start!)

As another example: having task-slots as a first-class concept allows our tools to analyze task IPC relationships, and detect things like priority inversion at build time. This is great! But there’s a lot more information in a typical config that is not first-class, and is not easy to analyze from tools. For instance, some complex configurations wind up including dictionaries of task names. The fact that those are task names is implicit.

Starting fresh

Because exhubris is not intended to build Oxide’s existing firmware codebase (yet), there’s no particular reason why it needs to understand the original TOML appconfig format. And so, it doesn’t! I’ve implemented an alternative format, which is sure to change as I apply it to real problems.

I’ll present the same donglet appconfig in the new format in this section. Note that, by the time you read this, the format may already have changed! But the high-level ideas should remain the same.

I haven’t yet fixed all the problems I mentioned in the previous section. That will unfold in future posts.

A more powerful meta-format

Given my stated intention to stop using TOML, what should I use instead?

One option is to define my own grammar for appconfigs, but that seems like a lot of work. Using something off-the-shelf means editors are more likely to do syntax highlighting correctly, for example.

I’m currently using KDL. KDL is fairly expressive and has a robust Rust parser. It looks a lot like configuration written in Tcl, which for me is a plus (say what you will about Tcl the programming language, but I think it makes for very readable configuration files).

You can read more about KDL at the link above, if you want. You don’t really need to understand KDL to read an appconfig, which is part of the reason I like KDL.

Here is a list of alternatives I considered and decided not to use right now, behind a fold for people who aren’t config format nerds.

A list of rejected (for now) formats

YAML: I really dislike YAML, and I’m writing the code, so my opinion matters. I find its use of indentation difficult to read; I’m not against semantic indentation, I just think YAML uses it badly. There are too many ways of expressing each concept. The standard is very complex. And there’s the Norway problem.
JSON: doesn’t allow comments, requires trailing commas almost everywhere except when they are completely forbidden, requires property names to always be "quoted strings", doesn’t support underscores in numeric literals, doesn’t support binary literals at all, and requires the whole file to be wrapped in curlies and indented. I think JSON’s a pretty decent interchange format, but this is not an interchange format, I will be writing this.
RON: Fixes most issues with JSON! Still requires trailing commas, and I wish there were a way to omit the outermost object parentheses. Not obvious how you’d add something like includes or anchors/references, since it’s really a data declaration language. Better for interchange, in my opinion.
HOCON: Pretty interesting, but it denies the existence of unsigned 64-bit numbers, to say nothing of 128-bit numbers. This isn’t a property of an implementation, it’s in the spec. I blame the JVM. Also, built-in syntax to include things from URLs is… terrifying, and it doesn’t appear to support underscores in numbers (a thing I feel very strongly about).
Tcl: Super powerful for config files, and probably the next thing I’d try after KDL. But the available implementations in Rust are pretty limited — I now maintain a fork of at least one, as part of this effort. Tcl denies the existence of numbers which is actually way easier to work around than denying the existence of unsigned numbers.
Pkl: I’ll be following this closely, I like how well-defined the semantics are, and the fact that it’s designed to produce a data structure (unlike KDL). But it’s very, very complex, and I can’t find a full implementation in Rust. Also, like HOCON, its designers decided that 64-bit signed integers should cover every integer use case, which is (to be blunt) bone-headed.
XML: Too verbose, sorry. I like aspects of XML (schemas, paths, and transformations are super well-defined), but not the act of writing it.

KDL’s current Rust implementation is rather difficult to use, and assumes integers are i64s for whatever reason. But the i64 thing is not in the spec, it’s an implementation limitation, so I can probably fix it. (The knuffel crate attempted to make writing parsers easier, but it’s dead.) Despite these issues, it seems like my best option for now.

App basics

Appconfigs are stored in files named (by convention) app.kdl. I’m using the .kdl extension because it causes editor syntax highlighting to just work in editors with KDL support.

We need a way to tell this KDL file from all other kinds of KDL files, and the solution I’ve landed on is to require the name of the firmware image to appear in the first (non-comment non-blank) line of the file, keyed by the word app.

// Name of this firmware image.
app "donglet-g031"
// Name of the target board, which happens to match but might not.
board "boards/donglet-g031.kdl"

Here the board file serves mostly to reference a chip file, which in turn specifies

peripheral and memory layout,
interrupt controller configuration, and
the target triple used by the compiler (here thumbv6m-none-eabi).

I’ll include that below as a sort of appendix, if you’re curious.

I also allow the board to be inlined if it’s a one-off board. That would instead look like

// Alternative to the version above:
board "donglet-g031" {
    chip "chips/stm32g031k8.kdl" 
}

Either way works, and the tools treat them as equivalent.

The next required section tells the tools how to build the kernel, which (as before) is really an executable that includes the kernel as a library. The equivalent to the original TOML would be:

kernel {
    workspace-crate "app-donglet"
    features "g031"
    stack-size 936
}

If you compare this to the TOML you’ll notice a few things.

The crate containing the startup code and kernel is now referenced as workspace-crate instead of just name. This is about to become important.
The file does not specify sizes for kernel flash and RAM. The exhubris tools implement an autosizing method that removes this requirement.

exhubris supports several different ways of specifying a crate, in any position where a crate is specified (kernel or task). If we instead wanted to build the kernel from a definition in someone else’s repo we could write:

// Alternative to the version above:
kernel {
    git-crate {
        repo "https://github.com/cbiffle/exhubris"
        package "kernel-generic-stm32g031"
        rev "e5d5c7c08d791f3a6590eb762b1512a4a8cab44b"
    }
    features "g031"
    stack-size 936
}

I also plan to allow specification of a crate version on crates.io, eventually.

Now we come to our first tasks, the Oxide supervisor jefe and the STM32 core driver sys:

task "jefe" {
    workspace-crate "task-jefe"
    priority 0
    stack-size 368
    notification "fault"
    notification "timer"
}

task "sys" {
    workspace-crate "drv-stm32xx-sys"
    features "g031" "no-ipc-counters"
    stack-size 256
    priority 1
    uses-peripheral "rcc"
    uses-peripheral "gpio"
    uses-peripheral "system_flash"
    uses-task "jefe"
}

This is an almost direct translation of the TOML, but there are some changes:

start = true (causing the task to be started automatically at boot) is now an implicit default. There’s, like, one case in our entire codebase where we use start = false, so requiring start = true just adds noise.
Instead of an array of names, notifications now get their own lines. This allows them to grow bodies (between {curly braces}) and have properties and stuff. (Though none currently exist.)
The uses list is now uses-peripheral and has one line per peripheral, for the same reason. (We’ll see a use for this in moment!)
task-slots has become uses-task for consistency.

And now to our most complex task:

task "i2c_driver" {
    workspace-crate "drv-stm32xx-i2c-server"
    features "g031" "no-ipc-counters"
    priority 2
    stack-size 896

    uses-task "sys"

    uses-peripheral "i2c1" {
        irq-notification "i2c1-irq"
    }
}

This example shows why I’ve started moving things like uses-peripheral to their own lines. This task customizes its use of the i2c1 peripheral by adding interrupt routing. The original did this with an extra TOML table and a notifications declaration, but in this version, the one line does double-duty:

Names a notification bit i2c1-irq (which is then used in the Rust code), and
Selects I2C1’s only interrupt³ and routes it to the i2c1-irq notification.

it’s pretty common for peripherals on complex chips to have multiple IRQs. In fact, if this were a slightly higher end STM32 chip instead of an STM32G0, the I2C1 peripheral would have two interrupts! In that case, irq-notification takes an additional string designating which IRQ gets mapped.

Finally, we come to the part where I think the new format is strongest: config data. This is global config that can be referenced by the build for any task, which is mostly used here to define the I2C device tree.

config "i2c" {
    controllers {
        i2c1 {
            ports {
                B {
                    scl-pin 6
                    sda-pin 9
                    af 6

                    muxes {
                        vpd {
                            driver "pca9548"
                            address 0x73
                        }
                    }
                }
            }
            devices {
                sharkfin_vpd {
                    controller "i2c1"
                    mux "vpd"
                    segment 1
                    address 0b1010_000
                    device "at24csw080"
                    description "Sharkfin VPD"
                    removable
                }

                // ... and so on
            }
        }
    }
}

Compared to the original this has gotten quite…indenty. I’d probably simplify this in practice, but I think it’s already easier to visually scan and tell which things are nested in which other things.

The reason it’s so indenty is that I’m using a simple subset of KDL for config data, one that corresponds to the JSON data model⁴. This is important. The purpose of config data like this is to be passed into task builds, which means it needs to be serialized to some format, and then parsed again during build. Since KDL doesn’t have a serde codec, parsing it manually in Rust is painful. Instead, exhubris will exploit the correspondence with JSON to convert the config data into RON format and hand that to the task builds, which can then use serde to trivially parse it.

Choosing a subset that corresponds to JSON also means that I can use JSON Schema to define the expected shape and contents of the config data, and JSON Pointer to reference nodes within it if needed. (KDL has its own schema and path/pointer projects underway, but neither appear to be done or implemented, and that doesn’t fix the whole “passing KDL to task builds is rude because it’s hard to parse” issue.)

⁴

KDL formally defines a JSON-in-KDL embedding called JiK. It’s not exactly what I want here so I’m using a subset of that subset! That might change in the future.

Chipdef (optional reading)

I referenced board and chip definitions in the appconfig example above. Board definitions are currently trivial: they just reference a chipdef. Chipdefs are much more interesting. Here’s part of the definition that donglet would use.

// Name of the chip; also indicates that this is a chipdef
chip "STM32G031x8"

// How to compile code for this chip.
target-triple "thumbv6m-none-eabi"

// Size of the hardware vector table. This information is
// required for determining the kernel layout automatically.
vector-table-size 0xC0

// Definition of memory regions.
memory {
    // We treat the vector table as a separate region from
    // flash, because it isn't allocatable to tasks.
    region "vectors" {
        base 0x0800_0000
        size 0xC0
        read
    }

    region "flash" {
        base 0x0800_00C0
        size 0xFF40 // 64 kiB - 0xC0
        read
        execute
    }

    // The STM32G0 is simple and only has one SRAM.
    region "ram" {
        base 0x2000_0000
        size 8192
        read
        write
    }
}

peripheral "rcc" {
    base 0x4002_4400
    size 0x400
    // If a peripheral has only a single interrupt, there's
    // no need to name it. If it had more than one, names
    // would be required to distinguish them.
    irq 4
}

// This mapping merges the ~5 GPIO blocks into one region,
// because in practice we map them all into the same task,
// and this makes better use of the limited MPU region count.
peripheral "gpios" {
    base 0x5000_0000
    size 2000
}

peripheral "i2c1" {
    base 0x4000_5000
    size 0x400
    irq 23
}

The full chipdef would have a bunch more peripherals, but this covers all the peripherals used in the section I excerpted above.

Eventually, this file should grow some knowledge about pin availability on each package, because wiring signals to pins is the most common board-level configuration in our apps. But I haven’t even sketched this yet.

Conclusions and future directions

I’m very enthusiastic about this reboot of appconfigs. I’ve only built small demo applications so far, but it feels much more powerful and easier to extend.

I’m also feeling quite bullish about exhubris in general. The `exhubris tools are on GitHub today if you’d like to poke around, but keep in mind that it’s very early days, and the code is in flux.

Some of my topics of active research — which are also likely topics for future posts in this series — currently include:

Cooking up some illustrative, non-trivial demo apps for people to look at.
Incorporating our Idol IDL more deeply into the build process, so that tasks can specify what interface(s) they implement.
Formalizing include/overlay support, so that common sections of config don’t have to be repeated.
Adding a task.kdl method for tasks to specify their own properties, so we don’t have to repeat them in every appconfig.
Letting users run the exhubris tools without checking out the repo, but in a way that uses the correct version for each of their projects, sort of like rustup does with its rust-toolchain.toml file.
How to generalize our Humility debugger so it can be used on non-Oxide applications. Specifically, we need a way to modularize it.

I am actively looking for feedback on this project, since the whole point is enabling other people to use Hubris. If you had any insights while reading this post, or have ideas on how to make Hubris more useful for your purposes, please reach out.

#dayjob #embedded #hubris #rust

Cliffle