Let The Compiler Do The Work
Learn Rust the Dangerous Way, Part 6
- A review
- Back to basics: a simple idiomatic version
- Building the code
- Inspecting the results
- Performance evaluation
- The pros and cons of auto-vectorization
- Conclusion
In this series so far, we’ve taken a C program and converted it into a faster, smaller, and reasonably robust Rust program. The Rust program is a recognizable descendant of the C program, and that was deliberate: my goal was to compare and contrast the two languages for optimized code.
In this bonus section, I’ll walk through how we’d write the program from scratch
in Rust. In particular, I’m going to rely on compiler auto-vectorization to
produce a program that is shorter, simpler, portable, and significantly
faster… and without any unsafe
.
Can it be? Read on…
A review
Recall from part 5 that we had a mostly-safe Rust program which
nevertheless contained two kinds of unsafe
code:
-
Explicit Intel SSE vector intrinsics, which only work properly on certain processors, rendering us non-portable.
-
Type-punning of memory between
f64
and vector types, to support the vector intrinsics.
As for performance, we were doing alright; here’s where things stood, using Clang as a baseline:
Command | Mean [s] | Min [s] | Max [s] | Ratio |
---|---|---|---|---|
./nbody.gcc-8.bench 50000000 | 6.199 ± 0.006 | 6.185 | 6.206 | 1.17x |
./nbody.clang-8.bench 50000000 | 5.277 ± 0.007 | 5.258 | 5.282 | 1.00x |
./nbody-5.bench 50000000 | 5.121 ± 0.003 | 5.118 | 5.127 | 0.97x |
Back to basics: a simple idiomatic version
From the original C program, we inherited a bunch of optimizations. Let’s undo them, and implement the essential algorithm without making too many assumptions. Because the algorithm can be implemented with a few simple loops, processing array elements that don’t depend on one another, it’s an excellent candidate for auto-vectorization.
Rusty bits
I’m sticking to fairly basic Rust here, since people reading this may not be fluent yet, but I’m also writing this program authentically — this is really how I’d solve this problem1. The problem doesn’t require a lot of whizbang features.
The main thing you’re going to be seeing below is iterators. Iterators in Rust play a similar role to iterators in C++, but that comparison will quickly lead you astray, as the two are quite different. You don’t need to deeply understand iterators to read this section, but I’d suggest at least reading over the Iterators section of the Rust book to familiarize yourself with the pattern. I’ll discuss some of the nuanced bits as we encounter them.
Finally, a word on code style: I’m using the Rust standard conventions as
implemented by the rustfmt
tool, because doing this is automatic and super
easy.
I would consider implementing a Vec3
type with arithmetic operator
overloading, to keep the actual simulation code simpler, but I’m trying not to
assume that you understand Rust trait
yet.
Fundamentals
Each body in the solar system simulation is represented by this struct
:
I’m deriving Clone
(meaning it’s okay if this type gets copied explicitly) and
Debug
(for pretty debugging output, should we need it). This is by habit;
neither feature is required here.
We’ll use the following constants:
/// Number of bodies modeled in the simulation.
const BODIES_COUNT: usize = 5;
const SOLAR_MASS: f64 = ;
const DAYS_PER_YEAR: f64 = 365.24;
/// Number of body-body interactions.
const INTERACTIONS: usize = BODIES_COUNT * / 2;
Finally, we have the constant table giving the starting state of the simulation. This is the same data as the C program used, but I’ve reformatted the constants:
-
I’ve put in thousands separators, so
4.841431442
becomes4.841_431_442
, because I find that easier to read. -
The constants had more digits than the actual precision of a
double
/f64
allows. I’ve rounded them to their actual values.
Rust’s linter, Clippy, pointed that second one out, which I appreciated — I don’t like my constants to be dishonest!
Long code snippet hidden by default
/// Initial state of the simulation.
const STARTING_STATE: = ;
offset_momentum
Since the order of items in a file doesn’t matter in Rust, I’m once again
visiting the functions from shortest to longest, starting with
offset_momentum
.
The algorithm uses offset_momentum
once, at startup, to adjust the sun’s
velocity. This means it’s probably not a hot spot. My Rust version of the
operation looks like this:
/// Adjust the Sun's velocity to offset system momentum.
This is a slightly different algorithm from what C used, but the numeric results are the same. C subtracted each body’s velocity from the sun’s… including the sun’s… which was a complex way of initializing the sun’s velocity to zero. This not only seemed odd, but it made it hard to use an iterator.
You see, in each iteration of the C loop, two bodies are touched:
bodies[0]
, the sun, gets written.bodies[i]
, which can be either the sun or a planet, gets read.
On iteration 0, bodies[0]
and bodies[i]
are the same — the names
alias, and Rust has restrictive checks around aliasing and mutation to prevent
data races and iterator invalidation, a common C++ bug.
To short-circuit this, I’ve made use of a common Rust pattern, which is to
split the array into two non-overlapping sections (Note 1 above), the sun
on one side, and all the planets on the other. (This is another example of an
operation that uses unsafe
under the hood, but presents an API that we can’t
misuse from safe code.)
Now I can mutate the sun
freely even while iterating over the planets
.
Using iterators like this is a great way of avoiding bounds checks, by not requesting them in the first place — the iterator is by definition restricted to valid bounds2.
If iterators are so great, you might be wondering why I didn’t use them
for the m
loop. This loop is walking over two parallel arrays,
sun.velocity
and planet.velocity
. You can do that with iterators, but I
think the code winds up kind of ugly, and in my tests, there’s no performance
advantage either way. So, aesthetics win.
output_energy
and sqr
This routine gets called to print the energy before and after the simulation has been run. The output gets used to check that the program ran correctly.
At this point, let me introduce the utility function sqr
, because I got tired
of writing things like body.velocity[0] * body.velocity[0]
.
// A convenient way of computing `x` squared.
sqr
is a place where one might be tempted to use a macro in C, but writing
such a macro in a way that x
doesn’t get evaluated twice is tricky.
You’ll basically never see a macro used for something like this in Rust code.
A function is easier to write, less error-prone to apply, and Rust is fairly
aggressive about inlining, so a function generally has no runtime cost over a
macro. (If you’re concerned that a function might not get inlined, you can slap
the #[inline(always)]
attribute on it and move on.)
Right — now that we’ve got sqr
, on to output_energy
.
/// Print the system energy.
The method being used here is the same as in C, but there are some unfamiliar parts to the code.
Note 1: The Rust code
for in bodies.iter.enumerate
is a common way to process all the items in a collection with their indices.
The iter()
gets an iterator over each body
, and the enumerate()
operation
(available on any iterator) extends it with an upwards-counting number.
Okay, so Rust provides a new weird way to write a for
loop. Why did I use it
here?
Because, as in offset_momentum
, this approach doesn’t request or require
bounds checks, so I don’t have to measure the results to tell if the compiler
was able to eliminate them. It will also adapt to any changes in the length of
the bodies
array, without even needing to rely on a common constant. And
because, in general, it’s faster than an explicit indexing loop in practice.
You’ll see this pattern a lot, it’s worth understanding.
Note 2: The other significant Rust-ism here is how we express the body2
loop, which was j
in the C version. The goal is, for each body
, we want to
iterate over every other body with a higher index. We express this by
slicing the array starting at i+1
and iterating over the result. This
technically asks for a check that i+1
is in bounds, and BODIES_COUNT
is
equal to or less than the length — but that’s better than checking every
time through the loop, and in practice, the compiler is great at optimizing this
away.
I strongly prefer this phrasing of the loop, because we no longer have to keep
i
and j
straight in the equations below. I have an easier time telling
body
from body2
than bodies[i]
from bodies[j]
.
advance
The advance
operation pushes the simulation forward by one time step. It’s the
biggest hot spot in the program, executed some 50 million times during a
benchmark run. It’s where most of the optimizations were applied in C. But let’s
take a step back and consider what it actually does.
Using the method from the C program, advance
consists of four phases.
-
Compute the vector between each pair of bodies in the system.
-
Compute the magnitude of the gravitational force given those vectors.
-
Apply gravitation from each body to every other body’s velocity.
-
Update each body’s position based on its velocity.
Let’s visit each of those steps in turn.
/// Steps the simulation forward by one time-step.
That loop is nearly unmodified from the C translation. I played with replacing
the range-based loops with iterators, but I didn’t stick with the result, for
two reasons. First, I thought it was kind of hard to read. Second, it messed up
the compiler’s ability to reason about k
. Right now, the compiler is smart
enough to look at the way k
is maintained and determine that it’s always in
bounds for position_deltas[k]
; adding some iterator implementations to the mix
seemed to break this. Usually, iterators help avoid bounds checks, but this is a
complex case.
I am, however, using an iterator to loop over each dimension of the vector, which actually improved code generation a bit.
Note that when we reach the end of that code snippet, position_deltas
is still
mutable (mut
). We don’t need to change it ever again, so it doesn’t have to be
mut
. We’ll see an idiom for cases like this in just a moment.
Moving on:
// Compute the `1/d^3` magnitude between each pair of bodies.
let magnitudes = ;
This time, I’m using the “freeze” pattern to update a mutable array in-place
within the block, but then assign it to a non-mut
binding, preventing
accidental further mutation. This array is small enough that rustc
does the
right thing. (I didn’t do this for position_deltas
because doing so generated
a call to memcpy
. This is likely to be a compiler bug, and illustrates why
benchmarking your programs is important.)
If you’ve been following the translation of the C program in the rest of this series, you might notice that this code is really short. For performance, the previous programs used a faster but less accurate “reciprocal square root” operation, followed by two iterations of the Newton-Raphson method for approximating square roots to improve its precision.
That’s clever, but:
-
Using that “reciprocal square root” instruction makes us non-portable, because it’s Intel-specific, and
-
I’m keeping things simple until data shows me I can’t.
So, I just used sqrt
.
Now, the velocity update loop:
// Apply every other body's gravitation to each body's velocity.
Again, that loop is substantially similar to the version from part 5, but using
an iterator where possible. (As in the first loop in advance
, I haven’t found
an iterator-based formulation for the i
/j
/k
loops that improves code
generation.)
And finally, position updates:
// Update each body's position.
for body in bodies
}
This is identical to the code from part 5 except for the iterators.
main
Finally, we need a main
function to drive the whole shebang.
That’s hardly changed from part 5 — in fact, I only changed some spelling and capitalization to match Rust conventions.
Building the code
Here’s the completed program:
Normally, we’d build this program using Cargo, and that’s how I built it when writing this article, but because we don’t have any dependencies, you can just compile the code with:
$ rustc --edition=2018 \
-C opt-level=3 -C codegen-units=1 -C debuginfo=2 -C target-cpu=core2 \
--target x86_64-unknown-linux-musl \
nbody-6.rs -o nbody-6
That being said, because I want to encourage you to use Cargo, here’s how you would do it.
First, we need a directory containing a Cargo.toml
file defining a build with
the same options as our command line above.
[]
= "nbody"
= "0.1.0"
= ["Cliff L. Biffle <code@cliffle.com>"]
= "2018"
[]
[]
= 3
= 1
= 2
Next, because Cargo.toml
is portable and unconcerned with certain build
details, we need a .cargo/config
file customizing our machine-specific parts
of the build:
[]
= "x86_64-unknown-linux-musl" # static linking
= [ "-C", "target-cpu=core2" ]
Finally, place the source code in src/main.rs
and run cargo build --release
.
Cargo is good. You should use it if possible.
Okay, end of evangelism. Moving right along…
Inspecting the results
When you want to know if a compiler is behaving a certain way, you check the output. One way to check it is to measure its performance; another is to disassemble it and read the result.
We can disassemble the binary with
$ objdump -d --demangle nbody-6 | less
The listing is long, but inside nbody_6::main
we find code that looks like
this:
401cc0: 66 45 0f 59 c9 mulpd %xmm9,%xmm9
401cc5: 66 44 0f 58 cc addpd %xmm4,%xmm9
401cca: 66 41 0f 51 d1 sqrtpd %xmm9,%xmm2
401ccf: 66 41 0f 59 d1 mulpd %xmm9,%xmm2
Those are SSE vector instructions, alright. They even end in pd
, for “packed
doubles,” which means they’re operating on two f64
s at a time. Since we didn’t
ask for that anywhere in our code, this is a sign that auto-vectorization is
happening.
rustc
is quite aggressive about auto-vectorizing programs, which is one reason
why it’s important to set target-cpu
to something reasonable. By default, it
assumes a very generic processor — which ensures that your binaries will
run on your friend’s computer, but may leave some performance on the table by
not taking advantage of recent processor features.
But is auto-vectorization actually helping? To find out, we’ll have to run it.
Performance evaluation
Let’s take a baseline measurement — surely we’ll need to optimize it further, right?
Command | Mean [s] | Min [s] | Max [s] | Ratio |
---|---|---|---|---|
./nbody.gcc-8.bench 50000000 | 6.199 ± 0.006 | 6.185 | 6.206 | 1.17x |
./nbody.clang-8.bench 50000000 | 5.277 ± 0.007 | 5.258 | 5.282 | 1.00x |
./nbody-5.bench 50000000 | 5.121 ± 0.003 | 5.118 | 5.127 | 0.97x |
./nbody-6.bench 50000000 | 3.853 ± 0.001 | 3.850 | 3.854 | 0.73x |
No, we won’t. The simpler auto-vectorized program is significantly faster than
the hand-optimized version. For each simulation step performed by the original
program compiled by gcc
, the clang
version can do 1.2 steps, and the new one
can do 1.6.
We’ve produced the fastest implementation so far, by doing the least work.
The pros and cons of auto-vectorization
Pro
Our new program is much simpler and clearer than the original C program. It expresses the algorithm in a straightforward way, which can be read and understood without knowing anything about SIMD instruction sets.
The program uses no unsafe
of any kind, so we know without thinking very
much that it’s not going to crash, violate memory safety, or introduce security
flaws like stack smash opportunities.
The program is entirely portable. There’s no Intel-specific stuff in it, and in fact, no 64-bit-specific stuff. It compiles and runs fine on x86, x86-64, and 32 and 64 bit ARM, among others.
This program is also more accurate than the C program, because we used real
sqrt
instead of an approximation. (You could make it faster, but non-portable,
by re-introducing the reciprocal square root trick.)
Finally, the program can automatically benefit from new instruction set
improvements, simply by changing the target-cpu
. For example, if you set
target-cpu=sandybridge
or later, the program will use AVX instead of SSE.
Con
So auto-vectorization is amazing and we never have to write vector code again, right?
That depends.
Auto-vectorization is magic. One of the down-sides of magic is that, when it doesn’t work, it can be hard to tell why.
For example, I mentioned above that you can recompile the program to use AVX. At
face value, this should make the program twice as fast, because AVX instructions
can chew on four f64
s at a time instead of SSE’s two. But, on my
Skylake-based machine at least, the AVX version becomes slower.
By inspecting the disassembly and applying Linux’s perf
tool, I was able to
discover why: when targeting AVX, rustc
1.39.0 fails to vectorize the sqrt
in the magnitude
computation loop. sqrt
is the most expensive instruction we
have, and instead of getting twice as cheap, it got twice as expensive. I was
able to detect this with a benchmark, but to diagnose it I had to read the
assembly listing — and that got me no closer to explaining why it fails.
Auto-vectorization may also require you to be careful about how you write your loops. The LLVM folks have some docs on auto-vectorization that show patterns to use and avoid (in C, but they’re similar in Rust).
Conclusion
We’ve written a simpler program that runs much faster than our hand-optimized code, thanks to the compiler. But we’ve also seen that auto-vectorization has drawbacks. So should we rely on it in real-world situations?
Personally? My answer is yes. In my experience, the amount of time I spend inspecting and profiling auto-vectorization hiccups is much less than the amount of time I could spend hand-optimizing an algorithm for a particular processor’s vector unit3. And even if I were to invest the extra work, I’d have to do it all again if I wanted to switch to ARM or AVX later.
Though, admittedly, hand-optimizing an algorithm for a particular processor, particularly in assembly language, is a lot of fun. But it’s rarely an investment your clients are interested in making now-adays.
When we write high-performance code now-adays, we’re always relying on some collection of optimization passes in the compiler. Any of them can glitch out and let us down, because compilers are, in the end, dumb pattern-matching machines. Auto-vectorization is one such optimization.
You should always have regression tests for your software’s important features, and if performance is one of your features, you should treat it as one:
-
Document the compiler version you tested with, or ideally, pin it in your build system. (Rust has the
rust-toolchain
file to do this.) -
Include a benchmark test that will indicate if things have suddenly gotten slower, so you can at least know to investigate. Ensure that this test runs as part of the build or continuous integration flow. (For Rust projects using Cargo, I recommend Criterion.)
-
Make sure to run the benchmarks when you upgrade the compiler. I haven’t had a
rustc
update break auto-vectorization before, but I know it’s possible, because I’ve hadgcc
upgrades break me several times.
But you don’t have to agree with me. Maybe auto-vectorization asks you to put too much trust in the compiler. That’s fine!
In the next section, I’ll look at how we can explicitly vector-optimize our
algorithms in Rust, without using non-portable or unsafe
code. Stay tuned!