After five years of development, something like seven art projects, one
commercial product, and many changes to the dark corners of the Rust
language, I’ve decided lilos is ready for a 1.0 release!
Some parts I’m excited about include:
As of this release, the lilos APIs are entirely cancellation-safe.
This release contains contributions from five other people, bringing the total
number of contributors to seven! (Want to be number eight? Come say hi!)
Thanks to one of those contributors, the operating system tests are now
running in CI on QEMU!
(For anyone who’s new, lilos is a tiny embedded operating system that uses
Rust async to allow complex multitasking on very limited microcontrollers
without requiring dynamic memory allocation. Read more about lilos on my
project page, where I link to the docs and provide a curated collection
of blog posts on the topic.)
See the release notes if you’re curious about what’s changed. If
you’ve got firmware written for an earlier version of lilos (particularly the
0.3.x series) and would like to update (you don’t have to!), those release notes
will guide you through the process. There have been some breaking API changes,
but I promise they’re all improvements.
I’m continuing to reflect on the past four years with Hubris — April Fool’s
Day was, appropriately enough, the fourth anniversary of the first Hubris user
program, and today is the fourth anniversary of the first kernel code. (I wrote
the user program first to help me understand what the kernel’s API wanted to
look like.)
Of all of Hubris’s design decisions, there’s one that gets a “wait what”
response more often than any other. It’s also proving to be a critical part of
the system’s overall robustness. In this post, I’ll take a look at our 13th and
oddest syscall, REPLY_FAULT.
We found a neat bug in Hubris this week. Like many bugs, it wasn’t a bug when
it was originally written — correct code became a bug as other things
changed around it.
I thought the bug itself, and the process of finding and fixing it, provided an
interesting window into our development process around Hubris. It’s very rare
for us to find a bug in the Hubris kernel, mostly because it’s so small. So I
jumped at the opportunity to write this one down.
This is a tale of how two features, each useful on its own, can combine to
become a bug. Read on for details.
I’m trying to do something kind of unusual with lilos: in addition to almost
all the APIs being safe-in-the-Rust sense, I’m also attempting to create an
entire system API that is cancel-safe. I’ve written a lot about Rust’s async
feature and its notion of cancellation recently, such as my suggestion for
reframing how we think about async/await.
My thoughts on this actually stem from my early work on lilos, where I started
beating the drum of cancel-safety back in 2020. My notion
of what it means to be cancel-safe has gotten more nuanced since then, and I’ve
recently made the latest batch of changes to try to help applications built on
lilos be more robust by default.
So, wanna nerd out about async API design and robustness? I know you do.
One of the nice things about the Rust programming language is that it
makes it easier to write correct concurrent (e.g. threaded) programs – to the
degree that Rust’s slogan has been, at times, “fearless concurrency.”
But I’d like to tell you about the other side of Rust, which I think is
under-appreciated. Rust enables you to write programs that are not concurrent.
This feature is missing from most other languages, and is a source of much
complexity and bugs.
“But wait,” you might be saying, “of course I can write code that isn’t
concurrent in Java or Python or C!”
Can you, though? You can certainly write code that ignores concurrency, and
would malfunction if (say) used from multiple threads simultaneously. But that’s
not the same thing as writing code that isn’t concurrent – code that simply
can’t be used concurrently, by compiler guarantee.
In Rust, you can. Let’s look at why you can do it, and why it’s awesome.
One of the common complaints I hear from systems programmers who try Rust is
about mutexes, and specifically about the Rust Mutex API. The complaints
usually go something like this:
They don’t want the mutex to contain data, just a lock.
They don’t want to have to manage a “guard” value that unlocks the mutex on
drop – often, more specifically, they just want to call an unlock operation
because they feel like that’s more explicit.
These changes would make the Rust mutex API equivalent to the C/Posix mutex API.
In one case I’ve seen someone try to use Mutex<()> and trickery to fake it.
There’s a problem with this, though: these two aspects of Mutex’s design are
inextricably linked to one another, and to Rust’s broader safety guarantees –
changing either or both of them will open the door to subtle bugs and
corruption due to data races.
A C-style mutex API consisting of some bundle of implicitly guarded data, plus
lock and unlock functions, isn’t wise in Rust because it allows safe code to
easily commit errors that break memory safety and create data races.
Perhaps controversially, I’d argue that this is also true in C. It’s just more
obvious in Rust, because Rust rigorously distinguishes between the notion of
“safe” code that cannot commit such errors, and “unsafe” code that can commit
such errors if it wishes. C does not make this distinction, and as a result, any
code using a mutex in C can trivially produce serious, potentially exploitable,
bugs.
In the rest of this post I’ll walk through a typical C mutex API, compare with a
typical Rust mutex API, and look at what happens if we change the Rust API to
resemble C in various ways.