Strategy

# Optimizing a Simple Ray-Tracer Written in Go, Part 2 — Calli…

This is the second part of my mini-series on how I used the go profiling and built-in benchmarking tools to optimize a naive ray-tracer written in Go. For part 1, click here.

1. Sorting efficiently
2. Slices vs Arrays
3. Pointers or values?
4. AVX2 using c2goasm
5. A final word on threading
6. Conclusion

(Note: Most of this blog post was written based on using Go 1.13)

This part takes off directly after part 1.

## Source Code

The full source code for my little ray-tracer can be found here: https://github.com/eriklupander/rt

Once I had gotten past all those pre-allocations in section 5 in part 1, I was starting to run out of low-hanging fruit to optimize. A new heap profile to the rescue!

That `reflectlite.Swapper` needed closer examination. The ray-tracing code sometimes must sort things, intersections in particular. Order of intersections can be important for things like refraction (how light changes direction when entering and exiting various materials). Therefore, every time I had found the intersections of a Ray, I sorted them using `sort.Slice`:

As seen above, at this point, sorting accounted for about 16% of all allocations. The quick-fix was to switch to using `sort.Sort` which requires the implementation of the `sort.Interface` interface:

This required creating a new type for []Intersection that implemented this interface:

I then just had to change a few method signatures to take `Intersections` instead of `[]Intersection` and I could use `sort.Sort` which allocated way less memory.

This improvement reduced the duration from 1.9 seconds to 1.6 seconds and decreased the number of allocations by about 3 million.

Which one is better for representing 4 64-bit floating-point values in a naive ray-tracer?

I originally implemented my `Tuple4` and `Mat4x4` structs with slices for the underlying storage. I tried to read up about the performance-related pros- and cons of using slices over arrays, with the verdict often being “a slice is just a pointer to a type, some index and the underlying array” so it’s just as fast.

Well – somewhere in time I decided I needed to test both ways and carefully benchmark both using `go bench` but mainly by looking at total render time.

I started by refactoring the entire codebase to use arrays as backing storage for `Tuple4` and `Mat4x4`:

Old
New

I think one of the largest benefits was that I now indexed directly into the underlying arrays instead of accessing `Tuple4` values through either the `Elems` or through a `Get(index int)` method in mathematical functions. Here’s a simple example from code that multiplies a row in one matrix by a column in the another:

Honestly, I can’t still say for sure exactly why arrays were faster for my particular use-case, but the speed-up turned out to be significant. The thing is, this refactoring affected so much of the code base that micro benchmarking individual functions was not very conclusive.

However – in the end, rendering the reference image went from 1.6 to 1.1 seconds. At this point, getting a ~30% decrease in total render time was very welcome.

I also read up a bit on whether one should pass parameters as pointers or values in Go in regard to performance, where the conclusion typically was something akin to “it’s better to pass by value (e.g. copy) up to a certain size of N bytes” – where N seemed to differ a bit but up to a kilobyte should be fine. Well – I had to test this as well.

I started with a simple microbenchmark for my `Tuple4` `Add()` function:

Results:

Certainly not a huge difference, the “pass by pointer” one is about 8% faster in this microbenchmark. Given that a function such as the `Add` one is used extensively – for example when determining a pixel’s color given its material’s ambient, diffuse and specular components – even a pretty small improvement such as this one can help improve overall performance.

Similarly, a benchmark comparing passing a 4×4 matrix and a 1×4 tuple to a multiply function show an even more significant improvement using pointers:

`BenchmarkMultiplyByTupleUsingValues-8 79611511 13.8 ns/op`

`BenchmarkMultiplyByTupleUsingPointers-8 100000000 10.7 ns/op `

I implemented this change in some critical code paths in the codebase and got a decent speedup.

The new duration was 0.8s compared to 1.1s before. Allocations did not change.

I stumbled upon this excellent article which got me thinking that I perhaps could identify some bottleneck mathematical operation and try to improve its execution speed by taking explicit advantage of the SIMD, AVX and AVX2 CPU extensions available on most modern x86-64 microprocessors. I haven’t been able to determine if the Go compiler itself takes advantage of SIMD/AVX on `GOARCH=amd64`, but I don’t think so at least for my purposes.

For details on this optimization, please check the article linked above. Here’s a quick summary on how I went about replacing my `MultiplyByTuple` function with a Plan9 assembly implementation taking full advantage of AVX2 that we can call without the overhead of using CGO.

## 4.1 Intrinsics

Write C code that uses Intel intrinsics to perform our Matrix x Tuple multiplication:

It’s OK if all that C code doesn’t make sense. Note that `double` equals our Go `float64` and that we use `_pd` intrinsic function variants for double precision. Comments may provide some insight.

## 4.2 Compile to X86 Assembly

Once we had the C-code above ready – one could imagine we’d use CGO to call it, but that’s one of the neat things about this approach – by generating native Plan9 Go assembly, we can eliminate the CGO function call overhead. To transform this C code into native Go assembly, we first must compile the code into standard x86-64 assembly code using the `clang` compiler:

`clang -S -mavx2 -mfma -masm=intel -mno-red-zone -mstackrealign -mllvm`

`-inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions`

`-fno-rtti -c -O3 cfiles/MultiplyMatrixByVec64.c`

This generates a `MultiplyMatrixByVec64.s` x86 assembly file.

## 4.3 Convert to Plan9 Assembly

Next, we turn to c2goasm for generating Go Plan9 assembly callable from a .go file.

`c2goasm -a -f cfiles/MultiplyMatrixByVec64.s internal/pkg/mat/MultiplyMatrixByVec64_amd64.s`

The resulting `MultiplyMatrixByVec64_amd64.s` contains our Go Plan9 assembly and looks like this (slightly truncated for brevity):

## 4.4 Making it Callable

Finally, we need some hand-written plain Go-code to glue the assembly in `MultiplyMatrixByVec64_amd64.s` to our ordinary Go code:

Once here, we can call `MultiplyByTuplePtr(m *Mat4x4, f2 *Tuple4, _f4 *Tuple4)` just like any other Go code.

## 4.5 Results

So, how much faster is this code compared to our vanilla `MultiplyByTupleUsingPointers` did we benchmark previously? A new microbenchmark tells us this:

Result (including the previous benchmarks from section 3):

More than twice as fast! Given that `MultiplyByTuple` is called several times on every ray/shape intersection test, this fix should provide a nice speedup:

Rendering the reference image now took 0.6s compared to 0.8s before.

I’ve also played around with implementing the `Dot` and `Cross` product functions using the same approach. `Dot` turned out to be significantly slower when done using intrinsics, while Cross was interesting. The microbenchmark showed the intrinsics version to be maybe 15% slower than the pure Go one, but a real render of an image was a few percent faster when the Plan9 assembly version `Cross` product was used. If anything, it serves as a good reminder that one should be careful concluding micro benchmarking isolated functions – the performance of the program as a whole is the most important metric to look at when chasing optimizations.

With the assembler optimizations above done, I was more or less “done” with my optimizations. However, the topic of multi-threading performance in the ray-tracer remained somewhat of a mystery to me. We saw that the initial improvement (before all other improvements) was about 2.5x. How does multi-threading work out after all other optimizations?

(note that the image rendered for this particular chart isn’t identical to the reference one)

About 3.5x faster using all available 8 thread/workers compared to a single worker. But one must note that there’s very little improvement once we move past 4 threads on my machine.

I haven’t figured this one out exactly, but I believe it boils down to several things:

1. Having a CPU (my Mac uses a 4870HQ) with 4 cores / 8 hardware threads (using hyperthreading) doesn’t necessarily mean one should expect the raw performance of 8 “real” CPU cores. Some resources (CPU caches etc) are probably shared within a single core.
2. Memory bandwidth? We’re still allocating and de-allocating a substantial amount of memory. Perhaps the memory subsystem is holding the CPU back?
3. I’ve spent next to no effort thinking about efficient multi-threading or optimizations on the CPU-level. I.e. things like optimizing for CPU cache size. In other words – my implementation could very well be a bad one as far as efficient multi-threading is concerned.

I’ve also run the ray-tracer on a more modern Intel Core i9 8 core / 16 thread MacBook Pro 2019 and a desktop computer with an AMD Ryzen 2600X 6 core / 12 thread CPU, seeing similar behavior where performance improvements are negligible after `worker count` > `num of physical cores`. However, I do remember running the ray-tracer on my Ryzen 2600X with the memory clocked to 2100 Mhz instead of the normal 3200 Mhz. I did notice that the CPU usage was down to 60-70% per core instead of the >99% I saw with the memory at its normal speed, which could indicate memory bandwidth or latency being a culprit. Perhaps I’ll do a follow up on this particular topic!

I’m sure there’s someone out there who could shed some light on this issue. Feel free to use the comments!

After all these optimizations (+BVHs), rendering a complex scene with tens of thousands of triangles, multi-sampling, soft shadows, etc had become possible in reasonable amounts of time. This 1920×1080 render of a dragon took about 20 minutes:

First, a little reminder that the findings in this blog post are strictly anecdotal and specific to my use case – the naive ray-tracer wrote without performance in mind. Your mileage in other circumstances may vary!

I’m also well aware there’s a lot more one can do such as more comprehensive escape analysis and making sure the compiler inlines function calls properly. That said, going from 3m15s to 0.6s for the same result was very rewarding and I learned a ton of stuff doing it. As a bonus, I probably had just as fun doing these performance optimizations as I originally had to create the ray-tracer based on the mentioned “The Ray Tracer Challenge” book.

To give an overview of this optimization journey, the following diagram gives a pretty good view:

The single most important fix was caching the Inverse, probably followed by multi-threading and generally avoiding re-allocating structs and slices in the renderer. The performance gains from slices vs arrays, pointers vs values, and sorting were not as clear, but together they did indeed provide a substantial gain. The intrinsic and Plan9 assembly was perhaps stretching things a bit, but a fun thing to experiment with.

That’s it! Hope you enjoyed reading my ramblings on ray-tracer performance optimizations in Go.

Feel free to share this blog post using your favorite social media platform! There should be a few icons below to get you started.