#api-design

lilos v1.0 released

After five years of development, something like seven art projects, one commercial product, and many changes to the dark corners of the Rust language, I’ve decided lilos is ready for a 1.0 release!

Some parts I’m excited about include:

  • As of this release, the lilos APIs are entirely cancellation-safe.

  • This release contains contributions from five other people, bringing the total number of contributors to seven! (Want to be number eight? Come say hi!)

  • Thanks to one of those contributors, the operating system tests are now running in CI on QEMU!

(For anyone who’s new, lilos is a tiny embedded operating system that uses Rust async to allow complex multitasking on very limited microcontrollers without requiring dynamic memory allocation. Read more about lilos on my project page, where I link to the docs and provide a curated collection of blog posts on the topic.)

See the release notes if you’re curious about what’s changed. If you’ve got firmware written for an earlier version of lilos (particularly the 0.3.x series) and would like to update (you don’t have to!), those release notes will guide you through the process. There have been some breaking API changes, but I promise they’re all improvements.

The server chose violence

I’m continuing to reflect on the past four years with Hubris — April Fool’s Day was, appropriately enough, the fourth anniversary of the first Hubris user program, and today is the fourth anniversary of the first kernel code. (I wrote the user program first to help me understand what the kernel’s API wanted to look like.)

Of all of Hubris’s design decisions, there’s one that gets a “wait what” response more often than any other. It’s also proving to be a critical part of the system’s overall robustness. In this post, I’ll take a look at our 13th and oddest syscall, REPLY_FAULT.

Who killed the network switch?

We found a neat bug in Hubris this week. Like many bugs, it wasn’t a bug when it was originally written — correct code became a bug as other things changed around it.

I thought the bug itself, and the process of finding and fixing it, provided an interesting window into our development process around Hubris. It’s very rare for us to find a bug in the Hubris kernel, mostly because it’s so small. So I jumped at the opportunity to write this one down.

This is a tale of how two features, each useful on its own, can combine to become a bug. Read on for details.

Mutex without lock, Queue without push: cancel safety in lilos

I’m trying to do something kind of unusual with lilos: in addition to almost all the APIs being safe-in-the-Rust sense, I’m also attempting to create an entire system API that is cancel-safe. I’ve written a lot about Rust’s async feature and its notion of cancellation recently, such as my suggestion for reframing how we think about async/await.

My thoughts on this actually stem from my early work on lilos, where I started beating the drum of cancel-safety back in 2020. My notion of what it means to be cancel-safe has gotten more nuanced since then, and I’ve recently made the latest batch of changes to try to help applications built on lilos be more robust by default.

So, wanna nerd out about async API design and robustness? I know you do.

Safely writing code that isn't thread-safe

One of the nice things about the Rust programming language is that it makes it easier to write correct concurrent (e.g. threaded) programs – to the degree that Rust’s slogan has been, at times, “fearless concurrency.”

But I’d like to tell you about the other side of Rust, which I think is under-appreciated. Rust enables you to write programs that are not concurrent. This feature is missing from most other languages, and is a source of much complexity and bugs.

“But wait,” you might be saying, “of course I can write code that isn’t concurrent in Java or Python or C!”

Can you, though? You can certainly write code that ignores concurrency, and would malfunction if (say) used from multiple threads simultaneously. But that’s not the same thing as writing code that isn’t concurrent – code that simply can’t be used concurrently, by compiler guarantee.

In Rust, you can. Let’s look at why you can do it, and why it’s awesome.

Why Rust mutexes look like they do

One of the common complaints I hear from systems programmers who try Rust is about mutexes, and specifically about the Rust Mutex API. The complaints usually go something like this:

  • They don’t want the mutex to contain data, just a lock.
  • They don’t want to have to manage a “guard” value that unlocks the mutex on drop – often, more specifically, they just want to call an unlock operation because they feel like that’s more explicit.

These changes would make the Rust mutex API equivalent to the C/Posix mutex API. In one case I’ve seen someone try to use Mutex<()> and trickery to fake it.

There’s a problem with this, though: these two aspects of Mutex’s design are inextricably linked to one another, and to Rust’s broader safety guarantees – changing either or both of them will open the door to subtle bugs and corruption due to data races.

A C-style mutex API consisting of some bundle of implicitly guarded data, plus lock and unlock functions, isn’t wise in Rust because it allows safe code to easily commit errors that break memory safety and create data races.

Perhaps controversially, I’d argue that this is also true in C. It’s just more obvious in Rust, because Rust rigorously distinguishes between the notion of “safe” code that cannot commit such errors, and “unsafe” code that can commit such errors if it wishes. C does not make this distinction, and as a result, any code using a mutex in C can trivially produce serious, potentially exploitable, bugs.

In the rest of this post I’ll walk through a typical C mutex API, compare with a typical Rust mutex API, and look at what happens if we change the Rust API to resemble C in various ways.