Making Safe Things From Unsafe Parts
Learn Rust the Dangerous Way, Part 5
In part 4 we took the unsafe
code that deals with treating
arrays of f64
as arrays of vectors, and we corralled it into a safe API.
In this installment, we’ll look at the remaining reasons why advance
is an
unsafe fn
, and make it safe — not by removing all the unsafe
, but by
narrowing it down.
This one’s a doozy — the remaining changes to advance
are hard to
separate, so I’ve packed them all into one section. Now is probably a good time
to refill your coffee.
The state of advance
When we left off last time, our program looked like this:
Like in the original C program, advance
is a long function — 149 lines
long, to be exact — so I won’t reproduce the whole thing here. Instead,
we’ll look at highlights.
advance
is still an unsafe fn
. This normally means that a function can
violate memory safety if used incorrectly — it’s a warning to callers that
they need to ensure that some conditions hold, or the program may do totally
undefined things.
Is advance
actually unsafe
to call? Sort of, but that doesn’t matter just
yet. If you were to delete the unsafe
modifier and try to compile, you’d get a
bunch of errors (27 of them). This is because an unsafe fn
also gains the
ability to call other unsafe
operations without being explicit about it, and
we’re still doing that. What unsafe
operations is advance
relying on? If you
were to study those compile errors, you’d see two classes of operations, which
I’ll discuss in the next two sections.
Mutable statics
advance
contains two static mut
variables, declared (as of the end of last
chapter) as follows:
static mut position_Deltas: =
;
static mut magnitudes: Interactions =
Interactions ;
These two variables, position_Deltas
and magnitudes
, are used like locals
within advance
, but are declared as static mut
.
Why is this unsafe? Basically, because it causes the function to not be reentrant. By storing some local data in a global variable, each call to the function depends on the previous one, and two concurrent calls to the function — perhaps from separate threads — would corrupt one another’s data and produce incorrect results.
Rust really doesn’t like this sort of non-reentrant function, because Rust
assumes that you might want to use threads someday, and thread safety guarantees
are part of Rust’s core values. But this sort of thing is equally dangerous in
C, where threads are becoming more common. (See, for example, strtok
vs.
strtok_r
.)
So if it’s dangerous, why does the program do it?
The original program didn’t document the intent behind using static
, but I’m
pretty confident that it’s an optimization to reduce the cost of calling
advance
.
These variables aren’t tiny — each Interactions
contains 10 f64
s, so
position_Deltas
contains 30, for 240 bytes. If these were declared as locals,
the function would need to allocate a large stack frame to contain them, and
initialize them on each call. advance
is called a lot, and that cost would
add up. Experimentally, the naive approach to switching these variables to
locals — replacing static
with let
— costs about 5%!
(Notice that this optimization is doing essentially the same thing as the use of uninitialized memory we saw in part 3: it’s trying to avoid initializing a chunk of memory on each call if it doesn’t need to. In part 3, that optimization proved unnecessary; in this case, the optimization works, probably because the initialization loop is more complex, confusing dead store elimination. Remember to measure the effect of your optimizations!)
We’ll address this unsafe
ty by replacing this optimization with an equivalent,
safe one, later in this section. But first, there’s another unsafe
to
consider:
SIMD intrinsics
All the SIMD intrinsics in the standard library are unsafe
.
That might seem unnecessary and arbitrary — after all, most safety issues
we’ve considered involve memory, and a function like _mm_add_pd
just adds
numbers without touching memory. So why is it unsafe
?
At the risk of being snarky: blame Intel.
We’re using SSE2 instructions in this program. What happens if you run SSE2 code on a processor that doesn’t implement SSE2? One would hope that, when the processor first hits an unsupported instruction, it would trigger some sort of fault and reject the program.
Nope! Many SSE/SSE2 instructions encode other actions on earlier processors. They’re undocumented, but they work. So what looks like a vector add to us might look like a jump or a store to memory to an earlier CPU.
And so, since an unsupported SIMD instruction can destroy the program’s
environment in unpredictable ways, using them explicitly like we’re doing is
unsafe
. (The current Rust program, like the original C program, will do
unpredictable things on an older CPU.)
It’s entirely possible to use SIMD in safe Rust — rustc
is surprisingly
good at auto-vectorizing loops, there are convenient third-party libraries,
and you can always check the target processor model at compile time. We’ll take
that last option, later in this section.
Getting to safety
Getting the static out
To recap: advance
declares two working arrays as static
to avoid paying the
cost of initializing them on each call. This optimization works, and saves
around 5%, but it renders the function non-reentrant and unsafe
.
These arrays also carry data between invocations of the function. This is purely
a side effect; advance
overwrites the arrays every time, so the data from
previous invocations doesn’t get used, but it still gets stored after advance
returns.
This observation is key to making the trick safe. What other ways could we
declare the arrays so that advance
doesn’t have to stack-allocate them on
every call, and where data may flow between successive calls to advance
?
How about as locals in advance
’s caller?
Allocating the arrays as locals in main
, and passing them by reference to
advance
, achieves the same optimization goals without using any unsafe
code
at all.
The signature of advance
changes to allow the caller to pass in references:
unsafe
…and then we delete the static
declarations.
We’ve had Interactions
declared inside the body of advance
ever since we
introduced it in part 4; if we’re going to expose it to the caller, we
need to move it, its impl
, and the two const
s it depends on a few lines up,
outside advance
. (Copy-paste will suffice.)
Finally, the caller, main
, needs to allocate the two variables and pass them
in by reference. main
now looks like this:
The complete program after this change:
Safer SIMD
As we saw above, the safety issue with SIMD hinges on the behavior of older processors when executing newly added instructions.
rustc
can target a wide variety of processors. When compiling, you can either
target a generic “least common denominator” processor, or a specific model, or
processors with a particular set of features. Currently, when I’ve built the
binaries, I’ve targeted the CPU model native
, which is a placeholder meaning
“the kind of CPU I’m compiling on.” In my case, native
actually means
skylake
; on your computer, it may represent something else.
If you compile a binary specifically targeting a recent CPU (such as skylake
from 2015-ish), and then run it on an older CPU (such as a 2003-vintage AMD
Opteron), Bad Things can happen, because the x86 instruction set encoding is
simply not backwards-compatible1. There’s not a lot we can do to defend against
this mistake; while we could query the processor’s features at runtime, it’s
possible that the code leading up to that check was compiled using new
instructions that will execute incorrectly on the older machine.
Lest I appear to be specifically criticizing Intel here, this is a problem on basically every processor architecture. ARM does only slightly better. Only RISC-V went out of its way to ensure that future extensions can be recognized and rejected by older processors.
On the other hand, if you compile a binary targeting an older machine, it
should run on that older machine — if it can’t, it should fail to compile.
Rust will let your code call _mm_add_pd
(which only makes sense on processors
that support SSE2) even if your target CPU doesn’t appear to support it —
Rust assumes that you’ve done some sort of CPUID check and you know what you’re
doing. This is part of the unsafe
contract for that operation, a contract that
we’re currently fulfilling by sheer luck.
If we can ensure that our code will only compile when targeting processors with SSE2, then we can guarantee that the SIMD operations we’re using are safe. We’ll do that the same way you’d probably do it in C: conditional compilation.
Rust’s conditional compilation mechanism is different from C’s. Rather than a
separate macro language evaluated by a preprocessor step, Rust’s is integrated
into the language as attributes. By attaching a #[cfg(...)]
attribute to
something, you make the presence of that thing conditional on certain compile
time configuration.
In our case, #[cfg(target_feature = "sse2")]
makes something conditional on
the compile-time target CPU supporting SSE2.
I’ll use a slightly blunt approach and slap that attribute right on advance
,
like so:
// <-- only compile for SSE2
unsafe
The reason I’m calling this “blunt” is the way that it fails, if we target a
CPU without SSE22. The easiest way to do this is just to disable the
sse2
feature on our compiler command line (next to last line of the command):
$ rustc --target=x86_64-unknown-linux-musl \
-C opt-level=3 -C debuginfo=2 -C codegen-units=1 \
-C target-cpu=native \
-C target-feature=-sse2 \
nbody-5b.rs -o nbody-5b
error[E0425]: cannot find function `advance` in this scope
--> 5/nbody-5b.rs:294:13
|
294 | advance(
| ^^^^^^^ not found in this scope
error: aborting due to previous error
For more information about this error, try `rustc --explain E0425`.
The way I’ve expressed the #[cfg]
attribute, if the target processor doesn’t
support SSE2, the advance
function just vanishes — so any code that
tries to call it now fails to compile. This is sufficient, but the error message
is confusing, and leaves it to the reader to deduce that advance
is missing
because their processor is too old. In a Real Program I’d handle this slightly
differently. See the core::arch
module docs for different approaches,
including checking for features at runtime, and enabling features in some places
but not others.
If you’ve done a lot of SIMD programming on Intel, you may be asking, “but wait, are there any x86-64 CPUs that don’t support SSE2?” In fact, no, I don’t believe there are, so this is just me being paranoid. But if you extended the algorithm to use SSE4 features, this would become very important.
Because this is literally a one-line change, I haven’t included the full program source after the change. I trust you can figure it out if you need to. Don’t worry, there’ll be full source code after the next change.
Turning unsafe
inside-out
After we got rid of the static mut
variables, the only obligation the caller
had to meet to use advance
safely was to ensure that SSE2 instructions weren’t
going to do dumb things on their processor. Now that advance
only exists when
targeting a CPU with SSE2, it is no longer unsafe
to call it. This means that
we can remove the unsafe
keyword from fn advance
, and switch to using more
granular unsafe
blocks inside the function.
For instance, we can explicitly mark the unsafe
bit in the code on the left
(which appears about 2/3 of the way through advance
), producing the code on
the right:
Before:
magnitudes.as_vectors = _mm_mul_pd;
After:
// Safety: this code is only compiled for processors that
// support SSE2, so the SIMD operations used here are
// safe.
magnitudes.as_vectors = unsafe ;
(Remember that, in Rust, blocks like unsafe
— or even if
— can
appear where values are expected, on the right-hand side of an assignment.)
Okay, but why is this better? It’s a fair question. There’s still unsafe
stuff happening. Code in this unsafe
block could be messing with raw pointers,
overwriting memory, smashing the stack.
But it isn’t. The unsafe
block is small, and contains only calls to SIMD
operations and references to local variables. It only takes a few seconds to
read the block and see that it’s only using SIMD.
Like when we wrapped unsafe union
operations in safe API in part 4,
moving to small, focused unsafe
blocks makes it much easier to inspect the
code and convince ourselves that it’s correct. In particular, by converting the
remaining SIMD snippets into granular unsafe
blocks, we don’t have to
carefully inspect the rest of advance
for unsafe
operations — we
just have to search for unsafe
in an editor and read the snippets we find.
By going through advance
somewhat mechanically, and wrapping small unsafe
blocks around each SIMD expression, we can get advance
to compile without
itself being marked as an unsafe fn
. Here’s the source code at this point:
(I would prefer to pull these unsafe
blocks out of advance
entirely and
provide a safe SIMD API, like we did for the union
in part 4. I’m not
doing this right now because it would be a fairly disruptive change in a section
that is already quite long, but stay tuned for a future installment.)
Removing the unsafe
block from main
We’re in the home stretch!
Currently, main
contains a big unsafe
block. It had to, because the
functions it was calling were all originally unsafe
. But now all our functions
are safe, and so it seems like high time to remove that unsafe
block:
Before:
After…?
But, surprise! It won’t compile. And the reason is familiar: the compiler points
out that we’re messing around with a static mut
, which implies thread safety
and reentrancy problems. (This time, the static mut
is the global
solar_Bodies
array.)
“But we’re in main
!” you might protest. “main
doesn’t need to be reentrant.”
Rust does not agree. main
is not special in Rust — it’s just a function,
and it needs to abide by the same rules as any other function. For instance,
though it would be kind of odd, you could totally call main
from another
function — which would be reentering it, since all functions are called
from main
. (This is all true in C, as well.)
Fortunately, this one’s pretty easy to fix. We’re using the static mut
variable solar_Bodies
to hold the state of the simulation, initialized from
some starting conditions. We can change the starting conditions into a
const
…
Before:
static mut solar_Bodies: =
After:
const STARTING_STATE: =
…and then switch to holding our running simulation state in a local, instead
of a static
:
Before:
After:
And now main
is entirely safe.
The program with these changes:
Evaluation
This has been quite a sprint! What have we accomplished?
-
advance
is now safe to call. -
The program will (correctly) fail to compile when targeting a CPU without SSE2.
-
The remaining
unsafe
bits insideadvance
are now granular and easy to inspect. -
main
is now safe and reentrant.
What effect does it have on the program as a whole? I’ve compiled a version at
each step in this section (5a
, 5b
, and finally 5
). First, let’s check
their sizes:
$ size nbody-{4,5a,5b,5}
text data bss dec hex filename
265608 10332 7432 283372 452ec nbody-4.bench
265808 10332 7112 283252 45274 nbody-5a.bench
265808 10332 7112 283252 45274 nbody-5b.bench
265888 10044 7112 283044 451a4 nbody-5.bench
Removing the static mut
arrays inside advance
reduced the bss
RAM usage,
which makes sense, as bss
measures permanently dedicated (static
) sections
of RAM. This isn’t a real reduction in RAM usage, because the arrays are
simply moved to the stack, which isn’t accounted for here.
The text section grew by 80 bytes at the point where we started initializing
solar_Bodies
from a const
, instead of statically.
So, you win some, you lose some; overall, the binary is 328 bytes smaller.
More importantly, how’s the performance? Let’s measure at each step:
Command | Mean [s] | Min [s] | Max [s] | Ratio |
---|---|---|---|---|
./nbody.clang-8.bench 50000000 | 5.277 ± 0.007 | 5.258 | 5.282 | 1.00x |
./nbody-1.bench 50000000 | 5.123 ± 0.024 | 5.095 | 5.161 | 0.97x |
./nbody-2.bench 50000000 | 5.101 ± 0.005 | 5.093 | 5.107 | 0.97x |
./nbody-3.bench 50000000 | 5.103 ± 0.002 | 5.100 | 5.105 | 0.97x |
./nbody-4.bench 50000000 | 5.104 ± 0.002 | 5.101 | 5.107 | 0.97x |
./nbody-5a.bench 50000000 | 5.145 ± 0.003 | 5.140 | 5.150 | 0.97x |
./nbody-5b.bench 50000000 | 5.147 ± 0.002 | 5.145 | 5.149 | 0.98x |
./nbody-5.bench 50000000 | 5.121 ± 0.003 | 5.118 | 5.127 | 0.97x |
In other words,
-
Removing the optimization of using
static
tables inadvance
, and switching to locals allocated inmain
, had a cost (nbody-5a
): runtime increased by 0.8% (41ms) over the code from part 43. -
Rearranging
unsafe
code (nbody-5b
) had a negligible effect, as we’d expect. -
Switching from using an
unsafe
static mut
for the solar system state, to a local (nbody-5
), won us back most of the performance we lost.
The final “pretty safe” version of the program is slightly faster than the original Rust translation, and significantly (3%) faster than the C original.
And it’s 0.3% / 20ms slower than nbody-4
. If that 20ms is important to you,
you could always stop with nbody-4
and skip the changes made in this section.
One of the nice things about Rust is that, while Rust encourages you to make
programs safe, it’s not required. nbody-4
is a pretty reasonable place to
leave things, if nbody-5
doesn’t meet your performance needs.
Why are locals more costly than static
s if we’re not paying to
initialize them? It appears to come down to code density and addressing modes
on x86-64. Because the address of a static
is known during build (at link
time), rustc
can emit instructions that directly reference it with
embedded absolute addresses. With a local, we have to compute its address on
the stack (using the %sp
register) to reference it. The latter approach
appears to produce less-dense code in this case.
This isn’t an absolute truth (pun intended): locals are usually faster on ARM, where instructions can’t embed an absolute memory address cheaply. Measure, measure, measure.
Wrapping up
We’ve reached our destination! Let’s look back at what we’ve done in the series so far.
Starting with a C program, we’ve produced an equivalent Rust program, following an incremental process without sweeping architectural changes. Comparing the two,
-
The Rust code doesn’t use any pointer arithmetic or unchecked array indexing, which eliminates bugs like buffer overruns, stack smashes, and the like.
-
The Rust code doesn’t use raw pointers at all, relying on Rust references to eliminate the possibility of
NULL
pointer bugs or accesses to uninitialized memory. -
While we don’t use threads, the Rust code is fully-reentrant and thread safe, so we could operate in a threaded environment if needed.
-
The Rust program is slightly faster than the C program compiled by Clang (by about 3%), and significantly faster than GCC’s output (by 17%).
-
The Rust source code is longer, mostly due to the need to call Intel SIMD intrinsics directly, while the C program relies on non-standard operator overloading.
-
When both programs are statically linked4, the Rust binary is smaller — 40% the size of the GCC/Clang output.
-
Compile times are roughly equivalent (GCC is slightly faster, Clang, slightly slower).
While we’ve finished the task we originally set out to accomplish, I’m not quite done yet. The current Rust code is a close equivalent to the C code, but it’s very weird-looking Rust code. In the next (bonus!) section, I’ll look at how the program would differ if it were written in normal, idiomatic Rust.
I’m not playing games by statically linking, the author of the C
program was doing it before I got here. That being said, a Rust program on
Linux using dynamic linking will tend to be larger than a C program doing the
same thing, simply because C assumes that a compatible version of its standard
library is already installed on the system, and Rust does not — so std
is always linked in. This can make a big difference for small binaries. (I
point this out so it doesn’t seem like I’m glossing over Rust binary size
issues, which can be a problem in some use cases.)