Revisiting Hubris appconfigs
First in a series on exhubris
.
- Hubris refresher and links
- A brief history of appconfigs
- Appconfigs today
- Appconfigs, reviewed
- Starting fresh
- Conclusions and future directions
So in my day-job over at Oxide we’ve built this nice embedded operating system called Hubris. If you follow my blog, you’re probably aware of it.
I also build a lot of embedded electronics outside my day-job, and people sometimes ask me (often excitedly!) if they’re using Hubris.
The answer so far is “no.” This is for a variety of reasons, but probably the biggest: it’s actually quite difficult to use Hubris for anything if you don’t want your code to live in the Oxide Hubris repo!
I would like to fix this, to enable other teams to use Hubris without having to coordinate with Oxide (or even publish their source code!). I’m starting by trying to address the needs of a single friendly customer: me.
As of this week I have it working, in a set of tools I call exhubris
. It’s not
by any means done (or all that pleasant to use). I’m going to write some posts
about it, to help me think through the design process, and (more importantly!)
to solicit feedback from my readers on where they think things should go.
This first post starts with the part of Hubris most users encounter first: the application configuration file, or appconfig.
Hubris refresher and links
(If you’ve been following Hubris closely, this section will be a bit of a review. If you’re new here, welcome!)
The details of the system have changed a lot over the past four years, but the basic design is still what I described in my announcement talk. A Hubris application is a firmware image intended to be flashed onto some microcontroller; it is made up of the Hubris kernel and a collection of (application-chosen) tasks. The kernel and each task are all compiled separately, and isolated from one another using hardware memory protection. As a result, while we benefit from Rust’s memory safety, we don’t rely on it for system correctness.
Drivers are just tasks in Hubris, but tasks that are granted access to one or more memory-mapped peripheral, and that receive interrupts as messages from the kernel.
Each task can crash independently; thanks to memory isolation, a crash in one task doesn’t damage others. A crashed task will generally be restarted immediately1, but that policy is up to an application-chosen task called the supervisor. Oxide’s supervisor implementation does some additional stuff, like recording a coredump of each crashed task, but beyond that, restarting a task is very quick. We’ve found this to be a really powerful tool for ensuring system robustness. It has resulted in some comical situations where a fundamental driver is crashing on an Oxide server over five thousand times per second but the system is still working fine.
We don’t generally use strategies like exponential backoff in Hubris, because Hubris applications are intended to run without human intervention, and getting all the details of backoff right is hard. For instance, you probably want to cap the backoff at some interval — how do you choose it? What do you do if it turns out to be wrong? No operator will be sitting at a console to restart the backoff.
You define an application by writing an appconfig file. This is intended to specify everything that goes into the image, to ensure that you can reproduce the build later. We process the appconfig using a set of image building tools, and call out to Cargo to build each piece of the image before stitching them together. (I have a rather long blog post on the process of building a Hubris image if you’re curious.)
Those are the pieces that are most relevant for this and future posts; if you’d like to know a lot more, the Hubris reference manual is pretty detailed and intended to be quite accessible.
A brief history of appconfigs
A Hubris application needs some way to specify all the bits that go into it. We added support for this in early 2020, shortly after the initial draft of the kernel was working. At the time, we chose to use TOML.
We have now been writing and maintaining appconfigs for four years. Currently, we maintain 56 of them, describing firmware applications from tiny 8-pin microcontrollers without enough flash to store the text of this webpage, up through our rather beastly Service Processor in the Oxide servers — a 400 MHz CPU with 2 MiB of flash.
The format has evolved over time as we’ve needed to express more complex ideas. It’s serving us fairly well, but it was not designed per se. It was incrementally grown with features added as we needed them. This means parts of it are creaky and not terribly consistent, but I’ll get to that in the next sections!
Appconfigs today
In theory, you create a Hubris application by opening an editor and writing an appconfig, which is currently a TOML file. (In practice, today, you also need to check it into the Hubris repo, but let’s ignore that for the moment.)
The appconfig has several sections that specify different parts of the build.
Let’s walk through them, using a simple application as an example. This is a
real production appconfig for donglet
2, a jack-of-all-trades
interface board we use in test automation at Oxide, based on the STM32G031
microcontroller.
I did not name this board.
Currently, the appconfig expects to live inside a Cargo workspace. The
workspace contains a rust-toolchain.toml
file that pins an exact toolchain
revision via rustup
, and a Cargo.lock
file that pins the hashes of all
dependencies. This information is critical to being able to reproduce the build
results, but since Cargo/rustup
have it covered, you won’t see this
information in the file below.
First, some top level keys gives your firmware a name
and indicates
compatibility with a specific target
, board
, and chip
. (If you have
noticed that this information is redundant, you’re going to like exhubris
.)
= "donglet-g031"
= "thumbv6m-none-eabi"
= "../../chips/stm32g0"
= "memory-g031x8.toml"
= "donglet-g031"
The next section specifies how to build the kernel
, which in practice means
designating a Cargo bin
crate that depends on the kernel — the Hubris kernel
is a library, and applications provide a main.rs
that calls it, giving them
an opportunity to e.g. setup the clock tree and check revision pins. Here the
kernel is built by the crate named app-donglet
in the Cargo workspace.
[]
= "app-donglet"
= { = 19168, = 1820}
= ["g031"]
= 936
This section needs to assign specific amounts of RAM and flash to the kernel, plus indicating how much RAM to use for the kernel stack. (If this looks annoying, keep reading, I’ll come back to it.)
A third section defines all the tasks
in the application, by pointing to the
Cargo bin
crates that define them. The tasks
section also specifies some
resource assignments, and can provide config
to each task to customize its
build. (Think Cargo features, but way more powerful.) The donglet
image
includes seven tasks, but I’ll skip most of them here. These three demonstrate
most of the bells and whistles:
[]
= "task-jefe"
= 0
= true
= 368
= ["fault", "timer"]
[]
= "drv-stm32xx-sys"
= 1
= ["rcc", "gpio", "system_flash"]
= true
= ["g031", "no-ipc-counters"]
= 256
= ["jefe"]
[]
= "drv-stm32xx-i2c-server"
= ["g031", "no-ipc-counters"]
= 2
= ["i2c1"]
= true
= ["sys"]
= 896
= ["i2c1-irq"]
[]
= "i2c1-irq"
Each task chooses a crate from the workspace (somewhat confusingly called
name
), and is assigned resources: a priority for scheduling, a stack size, and
in the case of drivers a set of memory mapped peripherals and interrupts.
(Interrupts are routed by the kernel to notifications, which can also be used
to implement “software interrupts” — in this case, our supervisor jefe
does
this with its fault
notification, which is how the kernel informs it of
crashes in other tasks.)
There are two important keys to note here, uses
and task-slots
. A task’s
uses
list names a series of memory-mapped peripherals, defined in the
configuration for the chip
. The build system sets up the task’s memory
protection config so that these peripherals are directly accessible, and others
are not. The sys
task here uses
three things, rcc
(for doing clock tree
setup on STM32), gpio
(for messing with pins), and system_flash
(we use this
to get the unique die ID for the chip).
task-slots
is similar, but the things being named are other tasks instead of
peripherals. A task that contacts another task via IPC is expected to name the
target task in its task-slots
list; the build system then ensures that it can
generate a TaskId
for that task at compile time. This allows task code to be
generic over which server(s) it interacts with.
Moving on from tasks
: the last top level section provides config
that can be
shared and referenced by all tasks. This tends to be the longest in real-world
applications, believe it or not, because it winds up looking a lot like a
DeviceTree…
[]
[[]]
= 1
[]
= 6
= 9
= 6
[[]]
= "pca9548"
= 0x73
[[]]
= 1
= 1
= 1
= 0b1010_000
= "at24csw080"
= "Sharkfin VPD"
= true
[[]]
= 1
= 1
= 2
= 0b1010_000
= "at24csw080"
= "Gimlet Fan VPD"
= true
In this case, the i2c_driver
task uses this information at compile time to
configure its use of pins and the set of devices it expects. But because this is
global information, other tasks can also refer to it for the same purpose. We
have a validate
testing-related task that uses this to scan for attached
devices and test that they respond correctly, for instance.
(Tasks can also have private config
information, which is used much less often
so I’ve skipped it here. It’s basically Cargo features but much, much more
powerful. This will come up later.)
Appconfigs, reviewed
I’ve been using appconfigs for almost four years now, and I have opinions.
I think the basic idea is great. We need some way to specify a bunch of executable programs to build, and Cargo isn’t great at that — plus, we need configuration that’s a lot more flexible than what Cargo offers. So some sort of input file that drives a tool, which in turn drives Cargo, seems reasonable.
The problem is, well, everything else.
TOML doesn’t scale
I chose TOML because we needed a format. Anyone who writes Rust has at least encountered TOML, since Cargo uses it heavily. TOML was, in hindsight, the wrong choice. It scales poorly to complex trees of data, and it turns out, appconfigs wind up being complex trees of data! The problems are already apparent in the file I excerpted above, and it’s one of our simplest:
- We move between zero and four levels of nesting with almost no visual clue that anything has happened, because TOML doesn’t believe in indentation.
- I2C devices are all tagged with the “I’m an array element” syntax,
[[config.i2c.devices]]
. If you needed to know which array element, you’re going to be doing some squinting. But in practice there’s basically no way to define a complex map in an array without using this syntax, because… - TOML has weird opinions about map and array literals. Arrays are permitted to
be wrapped across lines, but maps…aren’t? You’re supposed to totally change
syntax and write a table instead. This makes keeping the file readable as
strings change in length a bit of a chore. This is less of a problem in simple
documents, but starts to crop up in even moderately complex
Cargo.toml
files as people add/remove features from dependencies.
Plus, TOML assumes a sort of “least common denominator” data model, with no
concept of enumerated types, tuples, enums with fields, etc. This means there
are data structures we can describe simply and elegantly in Rust that we can’t
easily express in TOML. (They can be expressed, serde
is very good at this,
but you wouldn’t want to write them by hand!)
Tasks have to specify too much
Each stanza in tasks
declares things like the set of notifications it exposes,
how much stack it needs, what task slots it exposes, etc.
Every time it’s used.
In every application.
Many of our tasks are generic and reusable. jefe
, for instance, appears in
every Oxide firmware image. sys
and i2c_driver
are also nearly ubiquitous in
our STM32-based images. In every case, we have to repeat all this information.
This is silly. It would be better to have a way for the task to centrally declare the parts of this that don’t change — which is basically everything except the stack size — and then have the appconfig just fill in the rest of the template. This would also let us do better checking for e.g. an appconfig that fails to wire up a required notification.
And yet tasks don’t specify enough
There’s a bunch more information that I’d love to have available. For instance,
a task has a task slot bob
to talk to some server… what IPC protocol does it
expect bob
to implement? If we knew that, we could generate simpler code, and
detect cases where you’ve miswired the application.
Similarly, tasks don’t provide any hints about what information they expect in
their config
, or what parts of the global config
they rely upon. It would be
nice to have a schema, so we could give feedback about mistakes. (If not a
schema, just having a list of expected top level keys would be a great start!)
As another example: having task-slots as a first-class concept allows our tools to analyze task IPC relationships, and detect things like priority inversion at build time. This is great! But there’s a lot more information in a typical config that is not first-class, and is not easy to analyze from tools. For instance, some complex configurations wind up including dictionaries of task names. The fact that those are task names is implicit.
Starting fresh
Because exhubris
is not intended to build Oxide’s existing firmware codebase
(yet), there’s no particular reason why it needs to understand the original TOML
appconfig format. And so, it doesn’t! I’ve implemented an alternative format,
which is sure to change as I apply it to real problems.
I’ll present the same donglet
appconfig in the new format in this section.
Note that, by the time you read this, the format may already have changed! But
the high-level ideas should remain the same.
I haven’t yet fixed all the problems I mentioned in the previous section. That will unfold in future posts.
A more powerful meta-format
Given my stated intention to stop using TOML, what should I use instead?
One option is to define my own grammar for appconfigs, but that seems like a lot of work. Using something off-the-shelf means editors are more likely to do syntax highlighting correctly, for example.
I’m currently using KDL. KDL is fairly expressive and has a robust Rust parser. It looks a lot like configuration written in Tcl, which for me is a plus (say what you will about Tcl the programming language, but I think it makes for very readable configuration files).
You can read more about KDL at the link above, if you want. You don’t really need to understand KDL to read an appconfig, which is part of the reason I like KDL.
Here is a list of alternatives I considered and decided not to use right now, behind a fold for people who aren’t config format nerds.
A list of rejected (for now) formats
-
YAML: I really dislike YAML, and I’m writing the code, so my opinion matters. I find its use of indentation difficult to read; I’m not against semantic indentation, I just think YAML uses it badly. There are too many ways of expressing each concept. The standard is very complex. And there’s the Norway problem.
-
JSON: doesn’t allow comments, requires trailing commas almost everywhere except when they are completely forbidden, requires property names to always be
"quoted strings"
, doesn’t support underscores in numeric literals, doesn’t support binary literals at all, and requires the whole file to be wrapped in curlies and indented. I think JSON’s a pretty decent interchange format, but this is not an interchange format, I will be writing this. -
RON: Fixes most issues with JSON! Still requires trailing commas, and I wish there were a way to omit the outermost object parentheses. Not obvious how you’d add something like includes or anchors/references, since it’s really a data declaration language. Better for interchange, in my opinion.
-
HOCON: Pretty interesting, but it denies the existence of unsigned 64-bit numbers, to say nothing of 128-bit numbers. This isn’t a property of an implementation, it’s in the spec. I blame the JVM. Also, built-in syntax to include things from URLs is… terrifying, and it doesn’t appear to support underscores in numbers (a thing I feel very strongly about).
-
Tcl: Super powerful for config files, and probably the next thing I’d try after KDL. But the available implementations in Rust are pretty limited — I now maintain a fork of at least one, as part of this effort. Tcl denies the existence of numbers which is actually way easier to work around than denying the existence of unsigned numbers.
-
Pkl: I’ll be following this closely, I like how well-defined the semantics are, and the fact that it’s designed to produce a data structure (unlike KDL). But it’s very, very complex, and I can’t find a full implementation in Rust. Also, like HOCON, its designers decided that 64-bit signed integers should cover every integer use case, which is (to be blunt) bone-headed.
-
XML: Too verbose, sorry. I like aspects of XML (schemas, paths, and transformations are super well-defined), but not the act of writing it.
KDL’s current Rust implementation is rather difficult to use, and assumes
integers are i64
s for whatever reason. But the i64
thing is not in the spec,
it’s an implementation limitation, so I can probably fix it. (The knuffel
crate attempted to make writing parsers easier, but it’s dead.) Despite these
issues, it seems like my best option for now.
App basics
Appconfigs are stored in files named (by convention) app.kdl
. I’m using the
.kdl
extension because it causes editor syntax highlighting to just work in
editors with KDL support.
We need a way to tell this KDL file from all other kinds of KDL files, and the
solution I’ve landed on is to require the name of the firmware image to appear
in the first (non-comment non-blank) line of the file, keyed by the word app
.
// Name of this firmware image.
app "donglet-g031"
// Name of the target board, which happens to match but might not.
board "boards/donglet-g031.kdl"
Here the board file serves mostly to reference a chip file, which in turn specifies
- peripheral and memory layout,
- interrupt controller configuration, and
- the target triple used by the compiler (here
thumbv6m-none-eabi
).
I’ll include that below as a sort of appendix, if you’re curious.
I also allow the board to be inlined if it’s a one-off board. That would instead look like
// Alternative to the version above:
board "donglet-g031" {
chip "chips/stm32g031k8.kdl"
}
Either way works, and the tools treat them as equivalent.
The next required section tells the tools how to build the kernel, which (as before) is really an executable that includes the kernel as a library. The equivalent to the original TOML would be:
kernel {
workspace-crate "app-donglet"
features "g031"
stack-size 936
}
If you compare this to the TOML you’ll notice a few things.
- The crate containing the startup code and kernel is now referenced as
workspace-crate
instead of justname
. This is about to become important. - The file does not specify sizes for kernel flash and RAM. The
exhubris
tools implement an autosizing method that removes this requirement.
exhubris
supports several different ways of specifying a crate, in any
position where a crate is specified (kernel or task). If we instead wanted to
build the kernel from a definition in someone else’s repo we could write:
// Alternative to the version above:
kernel {
git-crate {
repo "https://github.com/cbiffle/exhubris"
package "kernel-generic-stm32g031"
rev "e5d5c7c08d791f3a6590eb762b1512a4a8cab44b"
}
features "g031"
stack-size 936
}
I also plan to allow specification of a crate version on crates.io
,
eventually.
Now we come to our first tasks, the Oxide supervisor jefe
and the STM32 core
driver sys
:
task "jefe" {
workspace-crate "task-jefe"
priority 0
stack-size 368
notification "fault"
notification "timer"
}
task "sys" {
workspace-crate "drv-stm32xx-sys"
features "g031" "no-ipc-counters"
stack-size 256
priority 1
uses-peripheral "rcc"
uses-peripheral "gpio"
uses-peripheral "system_flash"
uses-task "jefe"
}
This is an almost direct translation of the TOML, but there are some changes:
start = true
(causing the task to be started automatically at boot) is now an implicit default. There’s, like, one case in our entire codebase where we usestart = false
, so requiringstart = true
just adds noise.- Instead of an array of names, notifications now get their own lines. This
allows them to grow bodies (between
{curly braces}
) and have properties and stuff. (Though none currently exist.) - The
uses
list is nowuses-peripheral
and has one line per peripheral, for the same reason. (We’ll see a use for this in moment!) task-slots
has becomeuses-task
for consistency.
And now to our most complex task:
task "i2c_driver" {
workspace-crate "drv-stm32xx-i2c-server"
features "g031" "no-ipc-counters"
priority 2
stack-size 896
uses-task "sys"
uses-peripheral "i2c1" {
irq-notification "i2c1-irq"
}
}
This example shows why I’ve started moving things like uses-peripheral
to
their own lines. This task customizes its use of the i2c1
peripheral by
adding interrupt routing. The original did this with an extra TOML table and a
notifications
declaration, but in this version, the one line does double-duty:
- Names a notification bit
i2c1-irq
(which is then used in the Rust code), and - Selects I2C1’s only interrupt3 and routes it to the
i2c1-irq
notification.
it’s pretty common for peripherals on complex chips to have
multiple IRQs. In fact, if this were a slightly higher end STM32 chip
instead of an STM32G0, the I2C1 peripheral would have two interrupts! In
that case, irq-notification
takes an additional string designating which
IRQ gets mapped.
Finally, we come to the part where I think the new format is strongest: config
data. This is global config that can be referenced by the build for any task,
which is mostly used here to define the I2C device tree.
config "i2c" {
controllers {
i2c1 {
ports {
B {
scl-pin 6
sda-pin 9
af 6
muxes {
vpd {
driver "pca9548"
address 0x73
}
}
}
}
devices {
sharkfin_vpd {
controller "i2c1"
mux "vpd"
segment 1
address 0b1010_000
device "at24csw080"
description "Sharkfin VPD"
removable
}
// ... and so on
}
}
}
}
Compared to the original this has gotten quite…indenty. I’d probably simplify this in practice, but I think it’s already easier to visually scan and tell which things are nested in which other things.
The reason it’s so indenty is that I’m using a simple subset of KDL for config
data, one that corresponds to the JSON data model4. This is important. The
purpose of config data like this is to be passed into task builds, which means
it needs to be serialized to some format, and then parsed again during build.
Since KDL doesn’t have a serde
codec, parsing it manually in Rust is painful.
Instead, exhubris
will exploit the correspondence with JSON to convert the
config data into RON format and hand that to the task builds, which can then
use serde
to trivially parse it.
Choosing a subset that corresponds to JSON also means that I can use JSON Schema to define the expected shape and contents of the config data, and JSON Pointer to reference nodes within it if needed. (KDL has its own schema and path/pointer projects underway, but neither appear to be done or implemented, and that doesn’t fix the whole “passing KDL to task builds is rude because it’s hard to parse” issue.)
KDL formally defines a JSON-in-KDL embedding called JiK. It’s not exactly what I want here so I’m using a subset of that subset! That might change in the future.
Chipdef (optional reading)
I referenced board and chip definitions in the appconfig example above. Board
definitions are currently trivial: they just reference a chipdef. Chipdefs are
much more interesting. Here’s part of the definition that donglet
would use.
// Name of the chip; also indicates that this is a chipdef
chip "STM32G031x8"
// How to compile code for this chip.
target-triple "thumbv6m-none-eabi"
// Size of the hardware vector table. This information is
// required for determining the kernel layout automatically.
vector-table-size 0xC0
// Definition of memory regions.
memory {
// We treat the vector table as a separate region from
// flash, because it isn't allocatable to tasks.
region "vectors" {
base 0x0800_0000
size 0xC0
read
}
region "flash" {
base 0x0800_00C0
size 0xFF40 // 64 kiB - 0xC0
read
execute
}
// The STM32G0 is simple and only has one SRAM.
region "ram" {
base 0x2000_0000
size 8192
read
write
}
}
peripheral "rcc" {
base 0x4002_4400
size 0x400
// If a peripheral has only a single interrupt, there's
// no need to name it. If it had more than one, names
// would be required to distinguish them.
irq 4
}
// This mapping merges the ~5 GPIO blocks into one region,
// because in practice we map them all into the same task,
// and this makes better use of the limited MPU region count.
peripheral "gpios" {
base 0x5000_0000
size 2000
}
peripheral "i2c1" {
base 0x4000_5000
size 0x400
irq 23
}
The full chipdef would have a bunch more peripherals, but this covers all the peripherals used in the section I excerpted above.
Eventually, this file should grow some knowledge about pin availability on each package, because wiring signals to pins is the most common board-level configuration in our apps. But I haven’t even sketched this yet.
Conclusions and future directions
I’m very enthusiastic about this reboot of appconfigs. I’ve only built small demo applications so far, but it feels much more powerful and easier to extend.
I’m also feeling quite bullish about exhubris
in general. The `exhubris tools
are on GitHub today if you’d like to poke around, but keep in mind
that it’s very early days, and the code is in flux.
Some of my topics of active research — which are also likely topics for future posts in this series — currently include:
-
Cooking up some illustrative, non-trivial demo apps for people to look at.
-
Incorporating our Idol IDL more deeply into the build process, so that tasks can specify what interface(s) they implement.
-
Formalizing include/overlay support, so that common sections of config don’t have to be repeated.
-
Adding a
task.kdl
method for tasks to specify their own properties, so we don’t have to repeat them in every appconfig. -
Letting users run the
exhubris
tools without checking out the repo, but in a way that uses the correct version for each of their projects, sort of likerustup
does with itsrust-toolchain.toml
file. -
How to generalize our Humility debugger so it can be used on non-Oxide applications. Specifically, we need a way to modularize it.
I am actively looking for feedback on this project, since the whole point is enabling other people to use Hubris. If you had any insights while reading this post, or have ideas on how to make Hubris more useful for your purposes, please reach out.