Let The Compiler Do The Work

Learn Rust the Dangerous Way, Part 6

2020-01-05

In this series so far, we’ve taken a C program and converted it into a faster, smaller, and reasonably robust Rust program. The Rust program is a recognizable descendant of the C program, and that was deliberate: my goal was to compare and contrast the two languages for optimized code.

In this bonus section, I’ll walk through how we’d write the program from scratch in Rust. In particular, I’m going to rely on compiler auto-vectorization to produce a program that is shorter, simpler, portable, and significantly faster… and without any unsafe.

Can it be? Read on…

A review

Recall from part 5 that we had a mostly-safe Rust program which nevertheless contained two kinds of unsafe code:

Explicit Intel SSE vector intrinsics, which only work properly on certain processors, rendering us non-portable.
Type-punning of memory between f64 and vector types, to support the vector intrinsics.

As for performance, we were doing alright; here’s where things stood, using Clang as a baseline:

Command	Mean [s]	Min [s]	Max [s]	Ratio
`./nbody.gcc-8.bench 50000000`	6.199 ± 0.006	6.185	6.206	1.17x
`./nbody.clang-8.bench 50000000`	5.277 ± 0.007	5.258	5.282	1.00x
`./nbody-5.bench 50000000`	5.121 ± 0.003	5.118	5.127	0.97x

Back to basics: a simple idiomatic version

From the original C program, we inherited a bunch of optimizations. Let’s undo them, and implement the essential algorithm without making too many assumptions. Because the algorithm can be implemented with a few simple loops, processing array elements that don’t depend on one another, it’s an excellent candidate for auto-vectorization.

Rusty bits

I’m sticking to fairly basic Rust here, since people reading this may not be fluent yet, but I’m also writing this program authentically — this is really how I’d solve this problem¹. The problem doesn’t require a lot of whizbang features.

The main thing you’re going to be seeing below is iterators. Iterators in Rust play a similar role to iterators in C++, but that comparison will quickly lead you astray, as the two are quite different. You don’t need to deeply understand iterators to read this section, but I’d suggest at least reading over the Iterators section of the Rust book to familiarize yourself with the pattern. I’ll discuss some of the nuanced bits as we encounter them.

Finally, a word on code style: I’m using the Rust standard conventions as implemented by the rustfmt tool, because doing this is automatic and super easy.

I would consider implementing a Vec3 type with arithmetic operator overloading, to keep the actual simulation code simpler, but I’m trying not to assume that you understand Rust trait yet.

Fundamentals

Each body in the solar system simulation is represented by this struct:

#[derive(Clone, Debug)]
struct Body {
    position: [f64; 3],
    velocity: [f64; 3],
    mass: f64,
}

I’m deriving Clone (meaning it’s okay if this type gets copied explicitly) and Debug (for pretty debugging output, should we need it). This is by habit; neither feature is required here.

We’ll use the following constants:

/// Number of bodies modeled in the simulation.
const BODIES_COUNT: usize = 5;

const SOLAR_MASS: f64 = (4. * std::f64::consts::PI * std::f64::consts::PI);
const DAYS_PER_YEAR: f64 = 365.24;

/// Number of body-body interactions.
const INTERACTIONS: usize = BODIES_COUNT * (BODIES_COUNT - 1) / 2;

Finally, we have the constant table giving the starting state of the simulation. This is the same data as the C program used, but I’ve reformatted the constants:

I’ve put in thousands separators, so 4.841431442 becomes 4.841_431_442, because I find that easier to read.
The constants had more digits than the actual precision of a double/f64 allows. I’ve rounded them to their actual values.

Rust’s linter, Clippy, pointed that second one out, which I appreciated — I don’t like my constants to be dishonest!

Long code snippet hidden by default

/// Initial state of the simulation.
const STARTING_STATE: [Body; BODIES_COUNT] = [
    // Sun
    Body {
        mass: SOLAR_MASS,
        position: [0.; 3],
        velocity: [0.; 3],
    },
    // Jupiter
    Body {
        position: [
            4.841_431_442_464_72e0,
            -1.160_320_044_027_428_4e0,
            -1.036_220_444_711_231_1e-1
        ],
        velocity: [
            1.660_076_642_744_037e-3 * DAYS_PER_YEAR,
            7.699_011_184_197_404e-3 * DAYS_PER_YEAR,
            -6.904_600_169_720_63e-5 * DAYS_PER_YEAR
        ],
        mass: 9.547_919_384_243_266e-4 * SOLAR_MASS
    },
    // Saturn
    Body {
        position: [
            8.343_366_718_244_58e0,
            4.124_798_564_124_305e0,
            -4.035_234_171_143_214e-1
        ],
        velocity: [
            -2.767_425_107_268_624e-3 * DAYS_PER_YEAR,
            4.998_528_012_349_172e-3 * DAYS_PER_YEAR,
            2.304_172_975_737_639_3e-5 * DAYS_PER_YEAR
        ],
        mass: 2.858_859_806_661_308e-4 * SOLAR_MASS
    },
    // Uranus
    Body {
        position: [
            1.289_436_956_213_913_1e1,
            -1.511_115_140_169_863_1e1,
            -2.233_075_788_926_557_3e-1
        ],
        velocity: [
            2.964_601_375_647_616e-3 * DAYS_PER_YEAR,
            2.378_471_739_594_809_5e-3 * DAYS_PER_YEAR,
            -2.965_895_685_402_375_6e-5 * DAYS_PER_YEAR
        ],
        mass: 4.366_244_043_351_563e-5 * SOLAR_MASS
    },
    // Neptune
    Body {
        position: [
            1.537_969_711_485_091_1e1,
            -2.591_931_460_998_796_4e1,
            1.792_587_729_503_711_8e-1
        ],
        velocity: [
            2.680_677_724_903_893_2e-3 * DAYS_PER_YEAR,
            1.628_241_700_382_423e-3 * DAYS_PER_YEAR,
            -9.515_922_545_197_159e-5 * DAYS_PER_YEAR
        ],
        mass: 5.151_389_020_466_114_5e-5 * SOLAR_MASS
    }
];

`offset_momentum`

Since the order of items in a file doesn’t matter in Rust, I’m once again visiting the functions from shortest to longest, starting with offset_momentum.

The algorithm uses offset_momentum once, at startup, to adjust the sun’s velocity. This means it’s probably not a hot spot. My Rust version of the operation looks like this:

/// Adjust the Sun's velocity to offset system momentum.
fn offset_momentum(bodies: &mut [Body; BODIES_COUNT]) {
    let (sun, planets) = bodies.split_first_mut().unwrap();
    //                             ^-------- Note 1
    sun.velocity = [0.; 3];
    for planet in planets {
        for m in 0..3 {
            sun.velocity[m] -=
                planet.velocity[m] * planet.mass / SOLAR_MASS;
        }
    }
}

This is a slightly different algorithm from what C used, but the numeric results are the same. C subtracted each body’s velocity from the sun’s… including the sun’s… which was a complex way of initializing the sun’s velocity to zero. This not only seemed odd, but it made it hard to use an iterator.

You see, in each iteration of the C loop, two bodies are touched:

bodies[0], the sun, gets written.
bodies[i], which can be either the sun or a planet, gets read.

On iteration 0, bodies[0] and bodies[i] are the same — the names alias, and Rust has restrictive checks around aliasing and mutation to prevent data races and iterator invalidation, a common C++ bug.

To short-circuit this, I’ve made use of a common Rust pattern, which is to split the array into two non-overlapping sections (Note 1 above), the sun on one side, and all the planets on the other. (This is another example of an operation that uses unsafe under the hood, but presents an API that we can’t misuse from safe code.)

Now I can mutate the sun freely even while iterating over the planets.

Using iterators like this is a great way of avoiding bounds checks, by not requesting them in the first place — the iterator is by definition restricted to valid bounds².

If iterators are so great, you might be wondering why I didn’t use them for the m loop. This loop is walking over two parallel arrays, sun.velocity and planet.velocity. You can do that with iterators, but I think the code winds up kind of ugly, and in my tests, there’s no performance advantage either way. So, aesthetics win.

`output_energy` and `sqr`

This routine gets called to print the energy before and after the simulation has been run. The output gets used to check that the program ran correctly.

At this point, let me introduce the utility function sqr, because I got tired of writing things like body.velocity[0] * body.velocity[0].

// A convenient way of computing `x` squared.
fn sqr(x: f64) -> f64 {
    x * x
}

sqr is a place where one might be tempted to use a macro in C, but writing such a macro in a way that x doesn’t get evaluated twice is tricky.

You’ll basically never see a macro used for something like this in Rust code. A function is easier to write, less error-prone to apply, and Rust is fairly aggressive about inlining, so a function generally has no runtime cost over a macro. (If you’re concerned that a function might not get inlined, you can slap the #[inline(always)] attribute on it and move on.)

Right — now that we’ve got sqr, on to output_energy.

/// Print the system energy.
fn output_energy(bodies: &mut [Body; BODIES_COUNT]) {
    let mut energy = 0.;
    //                               v------------------------ Note 1
    for (i, body) in bodies.iter().enumerate() {
        // Add the kinetic energy for each body.
        energy += 0.5
            * body.mass
            * (sqr(body.velocity[0])
                + sqr(body.velocity[1])
                + sqr(body.velocity[2]));

        // Add the potential energy between this body and every other
        // body.
        for body2 in &bodies[i + 1..BODIES_COUNT] {
            //                    ^--------------------------- Note 2
            energy -= body.mass * body2.mass
                / f64::sqrt(
                    sqr(body.position[0] - body2.position[0])
                        + sqr(body.position[1] - body2.position[1])
                        + sqr(body.position[2] - body2.position[2]),
                );
        }
    }

    println!("{:.9}", energy);
}

The method being used here is the same as in C, but there are some unfamiliar parts to the code.

Note 1: The Rust code

    for (i, body) in bodies.iter().enumerate() {

is a common way to process all the items in a collection with their indices. The iter() gets an iterator over each body, and the enumerate() operation (available on any iterator) extends it with an upwards-counting number.

Okay, so Rust provides a new weird way to write a for loop. Why did I use it here?

Because, as in offset_momentum, this approach doesn’t request or require bounds checks, so I don’t have to measure the results to tell if the compiler was able to eliminate them. It will also adapt to any changes in the length of the bodies array, without even needing to rely on a common constant. And because, in general, it’s faster than an explicit indexing loop in practice.

You’ll see this pattern a lot, it’s worth understanding.

Note 2: The other significant Rust-ism here is how we express the body2 loop, which was j in the C version. The goal is, for each body, we want to iterate over every other body with a higher index. We express this by slicing the array starting at i+1 and iterating over the result. This technically asks for a check that i+1 is in bounds, and BODIES_COUNT is equal to or less than the length — but that’s better than checking every time through the loop, and in practice, the compiler is great at optimizing this away.

I strongly prefer this phrasing of the loop, because we no longer have to keep i and j straight in the equations below. I have an easier time telling body from body2 than bodies[i] from bodies[j].

`advance`

The advance operation pushes the simulation forward by one time step. It’s the biggest hot spot in the program, executed some 50 million times during a benchmark run. It’s where most of the optimizations were applied in C. But let’s take a step back and consider what it actually does.

Using the method from the C program, advance consists of four phases.

Compute the vector between each pair of bodies in the system.
Compute the magnitude of the gravitational force given those vectors.
Apply gravitation from each body to every other body’s velocity.
Update each body’s position based on its velocity.

Let’s visit each of those steps in turn.

/// Steps the simulation forward by one time-step.
fn advance(bodies: &mut [Body; BODIES_COUNT]) {
    // Compute point-to-point vectors between each unique pair of
    // bodies.
    let mut position_deltas = [[0.; 3]; INTERACTIONS];
    {
        let mut k = 0;

        for i in 0..BODIES_COUNT-1 {
            for j in i+1..BODIES_COUNT {
                for (m, pd) in position_deltas[k].iter_mut().enumerate() {
                    *pd = bodies[i].position[m] - bodies[j].position[m];
                }
                k += 1;
            }
        }
    }

That loop is nearly unmodified from the C translation. I played with replacing the range-based loops with iterators, but I didn’t stick with the result, for two reasons. First, I thought it was kind of hard to read. Second, it messed up the compiler’s ability to reason about k. Right now, the compiler is smart enough to look at the way k is maintained and determine that it’s always in bounds for position_deltas[k]; adding some iterator implementations to the mix seemed to break this. Usually, iterators help avoid bounds checks, but this is a complex case.

I am, however, using an iterator to loop over each dimension of the vector, which actually improved code generation a bit.

Note that when we reach the end of that code snippet, position_deltas is still mutable (mut). We don’t need to change it ever again, so it doesn’t have to be mut. We’ll see an idiom for cases like this in just a moment.

Moving on:

    // Compute the `1/d^3` magnitude between each pair of bodies.
    let magnitudes = {
        let mut magnitudes = [0.; INTERACTIONS];
        for (i, mag) in magnitudes.iter_mut().enumerate() {
            let distance_squared = sqr(position_deltas[i][2])
                + sqr(position_deltas[i][1])
                + sqr(position_deltas[i][0]);

            *mag = 0.01 / (distance_squared * distance_squared.sqrt());
        }
        magnitudes
    };

This time, I’m using the “freeze” pattern to update a mutable array in-place within the block, but then assign it to a non-mut binding, preventing accidental further mutation. This array is small enough that rustc does the right thing. (I didn’t do this for position_deltas because doing so generated a call to memcpy. This is likely to be a compiler bug, and illustrates why benchmarking your programs is important.)

If you’ve been following the translation of the C program in the rest of this series, you might notice that this code is really short. For performance, the previous programs used a faster but less accurate “reciprocal square root” operation, followed by two iterations of the Newton-Raphson method for approximating square roots to improve its precision.

That’s clever, but:

Using that “reciprocal square root” instruction makes us non-portable, because it’s Intel-specific, and
I’m keeping things simple until data shows me I can’t.

So, I just used sqrt.

Now, the velocity update loop:

    // Apply every other body's gravitation to each body's velocity.
    {
        let mut k = 0;
        for i in 0..BODIES_COUNT-1 {
            for j in i+1..BODIES_COUNT {
                let i_mass_mag = bodies[i].mass * magnitudes[k];
                let j_mass_mag = bodies[j].mass * magnitudes[k];
                for (m, pd) in position_deltas[k].iter().enumerate() {
                    bodies[i].velocity[m] -= *pd * j_mass_mag;
                    bodies[j].velocity[m] += *pd * i_mass_mag;
                }
                k += 1;
            }
        }
    }

Again, that loop is substantially similar to the version from part 5, but using an iterator where possible. (As in the first loop in advance, I haven’t found an iterator-based formulation for the i/j/k loops that improves code generation.)

And finally, position updates:

    // Update each body's position.
    for body in bodies {
        for (m, pos) in body.position.iter_mut().enumerate() {
            *pos += 0.01 * body.velocity[m];
        }
    }
}

This is identical to the code from part 5 except for the iterators.

`main`

Finally, we need a main function to drive the whole shebang.

fn main() {
    let c = std::env::args().nth(1).unwrap().parse().unwrap();

    let mut solar_bodies = STARTING_STATE;

    offset_momentum(&mut solar_bodies);
    output_energy(&mut solar_bodies);
    for _ in 0..c {
        advance(&mut solar_bodies)
    }
    output_energy(&mut solar_bodies);
}

That’s hardly changed from part 5 — in fact, I only changed some spelling and capitalization to match Rust conventions.

Building the code

Here’s the completed program:

nbody-6.rs

Normally, we’d build this program using Cargo, and that’s how I built it when writing this article, but because we don’t have any dependencies, you can just compile the code with:

$ rustc --edition=2018 \
    -C opt-level=3 -C codegen-units=1 -C debuginfo=2 -C target-cpu=core2 \
    --target x86_64-unknown-linux-musl \
    nbody-6.rs -o nbody-6

That being said, because I want to encourage you to use Cargo, here’s how you would do it.

First, we need a directory containing a Cargo.toml file defining a build with the same options as our command line above.

[package]
name = "nbody"
version = "0.1.0"
authors = ["Cliff L. Biffle <code@cliffle.com>"]
edition = "2018"

[dependencies]

[profile.release]
opt-level = 3
codegen-units = 1
debug = 2

Next, because Cargo.toml is portable and unconcerned with certain build details, we need a .cargo/config file customizing our machine-specific parts of the build:

[build]
target = "x86_64-unknown-linux-musl"  # static linking
rustflags = [ "-C", "target-cpu=core2" ]

Finally, place the source code in src/main.rs and run cargo build --release.

Cargo is good. You should use it if possible.

Okay, end of evangelism. Moving right along…

Inspecting the results

When you want to know if a compiler is behaving a certain way, you check the output. One way to check it is to measure its performance; another is to disassemble it and read the result.

We can disassemble the binary with

$ objdump -d --demangle nbody-6 | less

The listing is long, but inside nbody_6::main we find code that looks like this:

  401cc0:       66 45 0f 59 c9          mulpd  %xmm9,%xmm9
  401cc5:       66 44 0f 58 cc          addpd  %xmm4,%xmm9
  401cca:       66 41 0f 51 d1          sqrtpd %xmm9,%xmm2
  401ccf:       66 41 0f 59 d1          mulpd  %xmm9,%xmm2

Those are SSE vector instructions, alright. They even end in pd, for “packed doubles,” which means they’re operating on two f64s at a time. Since we didn’t ask for that anywhere in our code, this is a sign that auto-vectorization is happening.

rustc is quite aggressive about auto-vectorizing programs, which is one reason why it’s important to set target-cpu to something reasonable. By default, it assumes a very generic processor — which ensures that your binaries will run on your friend’s computer, but may leave some performance on the table by not taking advantage of recent processor features.

But is auto-vectorization actually helping? To find out, we’ll have to run it.

Performance evaluation

Let’s take a baseline measurement — surely we’ll need to optimize it further, right?

Command	Mean [s]	Min [s]	Max [s]	Ratio
`./nbody.gcc-8.bench 50000000`	6.199 ± 0.006	6.185	6.206	1.17x
`./nbody.clang-8.bench 50000000`	5.277 ± 0.007	5.258	5.282	1.00x
`./nbody-5.bench 50000000`	5.121 ± 0.003	5.118	5.127	0.97x
`./nbody-6.bench 50000000`	3.853 ± 0.001	3.850	3.854	0.73x

No, we won’t. The simpler auto-vectorized program is significantly faster than the hand-optimized version. For each simulation step performed by the original program compiled by gcc, the clang version can do 1.2 steps, and the new one can do 1.6.

We’ve produced the fastest implementation so far, by doing the least work.

The pros and cons of auto-vectorization

Pro

Our new program is much simpler and clearer than the original C program. It expresses the algorithm in a straightforward way, which can be read and understood without knowing anything about SIMD instruction sets.

The program uses no unsafe of any kind, so we know without thinking very much that it’s not going to crash, violate memory safety, or introduce security flaws like stack smash opportunities.

The program is entirely portable. There’s no Intel-specific stuff in it, and in fact, no 64-bit-specific stuff. It compiles and runs fine on x86, x86-64, and 32 and 64 bit ARM, among others.

This program is also more accurate than the C program, because we used real sqrt instead of an approximation. (You could make it faster, but non-portable, by re-introducing the reciprocal square root trick.)

Finally, the program can automatically benefit from new instruction set improvements, simply by changing the target-cpu. For example, if you set target-cpu=sandybridge or later, the program will use AVX instead of SSE.

Con

So auto-vectorization is amazing and we never have to write vector code again, right?

That depends.

Auto-vectorization is magic. One of the down-sides of magic is that, when it doesn’t work, it can be hard to tell why.

For example, I mentioned above that you can recompile the program to use AVX. At face value, this should make the program twice as fast, because AVX instructions can chew on four f64s at a time instead of SSE’s two. But, on my Skylake-based machine at least, the AVX version becomes slower.

By inspecting the disassembly and applying Linux’s perf tool, I was able to discover why: when targeting AVX, rustc 1.39.0 fails to vectorize the sqrt in the magnitude computation loop. sqrt is the most expensive instruction we have, and instead of getting twice as cheap, it got twice as expensive. I was able to detect this with a benchmark, but to diagnose it I had to read the assembly listing — and that got me no closer to explaining why it fails.

Auto-vectorization may also require you to be careful about how you write your loops. The LLVM folks have some docs on auto-vectorization that show patterns to use and avoid (in C, but they’re similar in Rust).

Conclusion

We’ve written a simpler program that runs much faster than our hand-optimized code, thanks to the compiler. But we’ve also seen that auto-vectorization has drawbacks. So should we rely on it in real-world situations?

Personally? My answer is yes. In my experience, the amount of time I spend inspecting and profiling auto-vectorization hiccups is much less than the amount of time I could spend hand-optimizing an algorithm for a particular processor’s vector unit³. And even if I were to invest the extra work, I’d have to do it all again if I wanted to switch to ARM or AVX later.

Though, admittedly, hand-optimizing an algorithm for a particular processor, particularly in assembly language, is a lot of fun. But it’s rarely an investment your clients are interested in making now-adays.

When we write high-performance code now-adays, we’re always relying on some collection of optimization passes in the compiler. Any of them can glitch out and let us down, because compilers are, in the end, dumb pattern-matching machines. Auto-vectorization is one such optimization.

You should always have regression tests for your software’s important features, and if performance is one of your features, you should treat it as one:

Document the compiler version you tested with, or ideally, pin it in your build system. (Rust has the rust-toolchain file to do this.)
Include a benchmark test that will indicate if things have suddenly gotten slower, so you can at least know to investigate. Ensure that this test runs as part of the build or continuous integration flow. (For Rust projects using Cargo, I recommend Criterion.)
Make sure to run the benchmarks when you upgrade the compiler. I haven’t had a rustc update break auto-vectorization before, but I know it’s possible, because I’ve had gcc upgrades break me several times.

But you don’t have to agree with me. Maybe auto-vectorization asks you to put too much trust in the compiler. That’s fine!

In the next section, I’ll look at how we can explicitly vector-optimize our algorithms in Rust, without using non-portable or unsafe code. Stay tuned!

#c #rust #tutorial

Cliffle