I’m trying to do something kind of unusual with lilos: in addition to almost
all the APIs being safe-in-the-Rust sense, I’m also attempting to create an
entire system API that is cancel-safe. I’ve written a lot about Rust’s async
feature and its notion of cancellation recently, such as my suggestion for
reframing how we think about async/await.
My thoughts on this actually stem from my early work on lilos, where I started
beating the drum of cancel-safety back in 2020. My notion
of what it means to be cancel-safe has gotten more nuanced since then, and I’ve
recently made the latest batch of changes to try to help applications built on
lilos be more robust by default.
So, wanna nerd out about async API design and robustness? I know you do.
I recently posted about my debugger for async Rust, which can
generate what I call “await-traces” for async code that’s suspended and not
currently running. I mentioned at the time that it appeared possible to get the
source code file name and line number corresponding to the await points, but
left that for future work.
I’m a big fan of Rust’s async feature, which lets you write explicit state
machines like straight-line code. One of the operating systems I maintain,
lilos, is almost entirely based on async, and I think it’s a killer
feature for embedded development.
async is also popular when writing webservers and other network services. My
colleagues at Oxide use it quite a bit. Watching them work has underscored one
of the current issues with async, however: the debugging story is not great.
In particular, answering the question “why isn’t my program currently doing
anything” is very hard.
I’ve been quietly tinkering on some tools to improve the situation since 2021,
and I’ve recently released a prototype debugger for lilos: lildb. lildb
can print await traces for uninstrumented lilos programs, which are like
stack traces, but for suspended futures. I wrote this to help me debug my own
programs, but I’m publishing it to try and move the discussion on async
debugging forward. To that end, this post will walk through what it does, how it
derives the information it uses, and areas where we could improve things.
Now that Hubris has gotten some attention, people sometimes ask me if my
personal projects are powered by Hubris.
The answer is: no, in general, they are not. My personal projects use my other
operating system, lilos, which predates Hubris and takes a fundamentally
different approach. It has dramatically lower resource requirements and allows
more styles of concurrency.