Indeed, if you’d get some fundamental piece of the core math or intersection algorithms wrong, later use of those functions (such as checking intersections of a ray) will be off and very difficult to troubleshoot.
So, the book takes you step-by-step from building your core math library, rendering your first pixels, and grasping the core concepts to actually rendering various shapes including lighting, reflections, refraction, shadows etc. By the end of the book, your renderer is capable of creating images such as the one below: Gopher model by Takuya Ueda
That said — the purpose of this blog post is not teaching you how to create a ray-tracer — buy the book instead. This blog post is about Golang and my journey after finishing the book and making my renderer perform better. As it turned out, my performance optimizations cut the time for a single render by several orders of magnitude!
1.2 The “Reference” Image
While not very impressive given 2020 graphical standards, this “reference image” of mine showcases a number of different object primitives, shadows, reflections, refraction, lighting, patterns, and transparency. This is a high-res render, while the resolution used for all performance comparisons were 640×480.
Why not? Ok – after doing Java for 15+ years, I’ve been working full-time with Go projects for the last year or so. And I simply love the simplicity of the language, the fast and efficient tooling, and how the language manages to make our team able to efficiently write stable, performant and correct services without having the language or its abstractions getting in our way very often.
I’ll also admit, if I had wanted to write the most performant ray-tracer from the very beginning, I probably should have done this in C, C++, or Rust. However, given that my challenge was learning how to do simple ray-tracing, I’d rather not have to deal with (re)learning a programming language I’m not comfortable with. This point is actually rather significant, as it turned out I’ve probably spent more time on my ray-tracer after finishing the book doing the performance optimizations this blog post is about than I spent with the book. Nevertheless, it’s probable I’d run into many or most of the issues in whatever language I’d picked.
The cucumber-like tests from the book were implemented as standard Go unit tests. My implementation using Go 1.13 with no 3rd party libraries at the time I completed the book.
2.1 Other Optimizations
This blog post is about improving the Go-related code of the renderer. However, I must clarify that a huge part of creating a performant ray-tracer is actually about making it “do less work” through optimization of the actual logic of the renderer. Ray-tracing is mainly about casting rays and checking if they intersect something. If you can decrease the number of rays to cast without sacrificing image quality, that is probably the greatest improvement you can do.
An example of such an improvement is using something called Bounding Boxes, where a complex object is encapsulated inside a “cheap” primitive, which can speed up rendering of complex meshes by several orders of magnitude:
(Source Wikimedia: https://commons.wikimedia.org/wiki/File:BoundingBox.jpg)
A complex mesh such as the head in the image above may consist of hundreds of thousands of triangles. Without optimizations, a naive ray tracer would need to do an intersection test for every single triangle for for every ray cast in the scene, including shadow and reflection rays. That would make rendering of complex scenes quite infeasible on commodity hardware, where a single render could probably stretch into days or weeks to finish.
Instead, by putting the complex head mesh inside a virtual “box”, a simple ray-box intersection test for each ray cast will tell us whether we need to test all the triangles of the head or not. I.e. if the ray doesn’t intersect the box, it won’t intersect any of the thousands of triangles either so we can skip them.
Further, we can subdivide the head into many smaller boxes in a tree-like structure we can traverse and only do intersection tests for the triangles bounded by the leaf “box”. This is known as bounding volume hierarchies.
Source: Wikimedia: https://commons.wikimedia.org/wiki/File:Example_of_bounding_volume_hierarchy.svg)
As said above, these optimizations are tremendously important for complex scenes with many and/or complex objects. While I’ve implemented Bounding Boxes and BVHs in my renderer based on one of the online bonus chapters of the book, for the simple “reference image” used for benchmarking in this blog post, the BVH is ineffectual since the reference scene does not contain any grouped primitives.
Once I was done with the book, I could render simple scenes such as the “reference image” on my MacBook Pro 2014 (4 core / 8 thread) in 640×480 resolution in about 3 minutes and 15 seconds. This made further experimentations with multi-sampling, textures or soft shadows a very slow process, leading to me embarking upon my journey of optimizations.
At a glance, ray-tracing is more or less about casting a ray through each would-be pixel through the image plane, and then applying the exact same algorithms for determining the output color of that given pixel. In theory, this sounds like an embarrassingly parallell problem.
Therefore, my first optimization was to move from a purely single-threaded rendering process to something that would utilize all available CPU threads.
3.1 Distributing the work
So, this is where the Go fun starts. I decided to apply the simple worker pool pattern that I’ve touched on before on this blog.
In terms of Go code, the renderer treats each horizontal line of pixels as a unit of work that one of the workers can consume from a channel. The number of workers is set to GOMAXPROCS, matching the number of virtual CPU cores, 4/8 in my case.
The actual rendering code (e.g.
workerFuncPerLine()) just picks one “job” at a time from the channel and passes it to our
3.2 Unit of Work — Congestion Issues?
Why use one line as the unit of work? Well, actually, I started passing one pixel at a time. However, if you’re rendering a 1920×1080 image, that’s approximately 2 million “jobs” to pass over that channel. I noticed in a CPU profile run that a large portion of time was being spent in this
(Partial profile, pass one pixel per job)
I couldn’t quite make sense of this, other than the profiling tools seemed to think a ridiculous amount of time was being spent waiting for something. A block profile shed some additional light on the issue:
(Block profile, pass one pixel per job)
workerFuncPerLine() method spent a whole lot of time being blocked. By performing the tiny change of passing the a line at a time instead as the unit of work, things started looking much better:
(Block profile, pass one line per job)
In this particular example, the block in
chanrcv2 went from 34.3 seconds down to 0.6, while the
chansend1 actually increased from 4 to 8 seconds, clearly indicating the we improved things a lot while simultaneously moving the bottleneck to the sender side. The CPU profile also looked much better now:
(Profile, pass one line per job)
Overall, one should notice that this change didn’t cut the total render time in half. It yielded an overall improvement of perhaps 5-7%, but it’s a good example on how an issue could be identified using the Golang profiling tools not directly related to mathematics or ray-tracing algorithms, rather a problem with the internal app architecture and how “jobs” were distributed to the “workers”.
3.3 Sharing Memory
As I quite quickly noticed, moving the codebase from strict single-threaded operation to using a worker pool introduced a whole new slew of problems to solve. State and shared object structures was definitely one of the more challenging ones.
What does the code that renders a single pixel need to do its job? Well, it needs to access all geometry and light sources in the scene, as well as knowing where the “camera” is. All that stuff was implemented as plain Go structs, including quite a bit of state that at times was being mutated by the renderer in order to avoid allocating memory for intermediate calculations.
This was actually a bit harder nut to crack than I initially thought. Since I didn’t fancy rewriting everything from scratch, and using mutexes everywhere would both be error-prone and probably wouldn’t scale all that well, I decided to let each “render thread” get a full copy of the entire scene to work with. As long as the scene doesn’t contain a lot of really complex 3D models, in this day of 16GB laptops keeping N number of “copies”, of some structs in-memory isn’t a big issue.
The final declaration of a “Render Context” is implemented as a Go struct and looks like this:
The renderer instantiates GOMAXPROCS number of these “render contexts” that can autonomously render any pixel in the scene. Perhaps not the most elegant of solutions, but it provided safety and did definitely solve some of the rather strange render errors and other problems I did get before moving away from “shared memory”.
Whenever a pixel had been completed, it was directly written to the
*mat.Canvas through a mutex-protected method. This mutex did not become any kind of bottleneck.
3.4 Running Multi-Threaded
So, how did going from 1 to 8 “workers” affect performance? While I didn’t expect my renders to become 8 times faster, I was kind of expecting maybe a 5-7x speedup.
I got about 2.5x.
The reference image was down to about 1m30s from 3m15s. This was a huge disappointment! But on the flip side, it really got me intrigued about what was holding the CPU back. I’ll revisit this in the final section of the second blog post of this mini-series.
What do you do when your code doesn’t perform like you expect? You profile it! So, I added the requisite boilerplate to enable profiling:
I guess the finer details of pprof is better explained elsewhere, so let’s jump directly to what the CPU profile showed after enabling multi-threading:
These profile PNGs are awesome for seeing not only where the most time is being spent (hint – bigger is NOT better) but also the call hierarchies. Seems about 25% of the CPU time was being spent on
runtime.mallocgc, which in plain english means “stuff the Garbage Collector is doing”. While my naive ray-tracer definitely wasn’t written with memory re-use in mind, this did seem a bit excessive.
Luckily, the profiling tools also provides heap profiling capable of both tracking the amount of memory being allocated down to LoC level, as well as showing the number of allocations occurring in any given function or LoC.
Time for a big image:
Yes, look at that
Inverse() function. Over 95% of all memory allocations are happening in it’s descendants! A closer look:
Yup – someone down that call hierarchy is performing about 3.7 billion(!) allocations for rendering a 640×480 image!! That’s more than 12,000 allocations per pixel being rendered. Something was definitely rotten in my own little kingdom of ray-tracing.
4.1 Caching the Inverse Transformation Matrix
While “inverse the transformation matrix” sounds like something Picard could say in Star Trek, it’s one of those fundamental pieces of math that our ray-tracer needs to happen when moving from “world space” to “object space” when performing a ray intersection test on an object. Since we’re testing all objects in the scene on every ray being cast, including reflections and shadows, this adds up to an enormous amount of matrix math that needs to happen.
(The transformation matrix is a 4×4 matrix that keeps track of the translation (position in 3D space), rotation and scaling of our primitives in relation to the “world” cartesian coordinate system.)
The good news is that we actually needed to calculate the inverse transformation matrix for our scene objects exactly once. So once we know where in world space our camera and scene objects are, we can calculate their inverse transformation matrices once and store the result in their structs:
SetTransform(translation Mat4x4) above is only called once when we’re setting up the scene.
This was a trivial optimization to code, but it made a huge difference: The time needed to render the reference image went from 1m30 to about 4.5 seconds!! A single threaded render went from 3m15s to about 10s.
Looking at allocations, we went from 3.8 billion allocations to “just” 180 million and the total amount of memory allocated went from 154 GB to 5.9 GB. While we certainly saved a significant number of CPU cycles for the actual math of calculating the inverse, the big part of this optimization definitely was easing the pressure on the memory subsystem and the garbage collector since calculating the inverse required allocating memory on each invocation. The
Inverse() function is now used so sparingly it doesn’t even show up when doing a heap profile.
3.7 billion down, 180 million to go? If I had the time and will (which I don’t) I’d probably rewrite everything from scratch with a “zero-allocation” goal in mind. Keep things on the stack, pre-allocate every piece of memory needed to hold results, check inlining etc.
I should perhaps mention that when I begun implementing things based on the book, I used a “everything should be immutable” architecture. While that did really help with correctness and avoiding nasty bugs due to (accidental) mutation, allocating new slices or structs on every single computation turned out to be a quite horrible idea from a performance point of view.
Since I didn’t want to start over (and have to refactor at least a hundred implementations of those cucumber tests implemented in plain Go), I took an approach of trying to identify one big “allocator” at a time in order to see if I could make it re-use memory or keep its allocations on the stack. Heap profiler to the rescue!
Looking at this new heap profile, the biggest culprit seems to be this
MultiplyByTuple function responsible for almost 50% of the remaining allocations. It’s a quite central part of the ray-object intersection code so it’s being called often. What does it do?
Yes, it’s just a simple function that multiplies a 4×4 matrix with a 1×4 tuple and returns the resulting tuple. The problem is the
Oops. It’s creating a new slice on every invocation. That’s not always desirable in a performance-critical context. (In a later part I’ll return to the question of “slices vs arrays”.)
What we would like to do is to have the calling code pass a pointer to a struct in which we can store the result, and hopefully the calling code can reuse the allocated memory or keep it on the stack.
Easy peasy – a new third parameter passes a pointer to a
Tuple4 and we’re down to zero allocations. However, there’s a bit more to it. The most typical usage is in this code:
In this snippet, we’re creating a new transformed Ray (a ray is a point in 3D space and a vector holding its direction) from a matrix and another ray. The results are then put into the new Ray. Moving the
NewTuple() call here won’t help us at all. No, this requires a bit more refactoring so the whole call chain uses the C-style pattern of passing the result as a pointer.
This implementation passes pointers to the underlying
Tuple4 directly to the
MultiplyByTuplePtr, which stores the results directly in the passed
Ray pointer. As long as the code calling
TransformRayPtr has allocated
out *Ray in a sensible way, we’ve probably been able to cut down the number of allocations in a really significant way. The Ray in this case is something that can be safely allocated once per render context recursion and can be pre-allocated.
I won’t go into the exact details on how pre-allocating memory for that
Ray works, but on a high level each render goroutine has this “render context” mentioned before, and each render context only deals with a single pixel at a time. The final color of a single pixel depends on many things, in particular the number of extra raycasts that needs to happen to follow reflections, refractions and testing if the point in space is being illuminated. Luckily, we can treat each new “bounce” as a new ray so as long as we have allocated enough “rays” in the context to support our maximum recursion depth, we’re fine with this approach.
We can also use Go’s built-in support for microbenchmarking. Consider these two benchmarks:
In this microbenchmark, the latter is more than twice as fast. (Go tip: remember that we should do something with the result, otherwise the Go compiler may optimize away the function call we’re benchmarking altogether.)
5.1 Cache, pre-allocate, cache, pre-allocate…
The results obtained in this section clearly indicated that in order to improve efficiency through reuse of memory, pre-allocating data structures and reusing them and caching things whenever possible was a key component.
A few more examples where pre-allocating memory gave really nice benefits:
5.1.1 Intersection lists
There’s a lot of intersection lists to keep track of “hits” when casting rays through a scene. It turned out that the intersection code for each primitive (e.g. sphere, plane, cylinder, box, triangle etc) was creating a new
Intersection slice on each invocation:
This is the intersection code for a
Plane. Note how a fresh slice is created and populated at the end of the method if there were an intersection. This pattern repeats itself for all primitive types.
However, for each render context we also know that once we have “used” the intersection(s) of a given primitive, we can safely re-use the same slice for that primitive for any subsequent ray / primitive intersection test. Therefore, a simple pre-allocated slice of sufficient length on the primitive created a large saving in allocations:
Why a slice of size 1? A
plane can only be intersected exactly once by a ray. Other primitive shapes may have more intersects by a ray. A
sphere may be intersected twice (entry and exit), while
cones are worst with up to 4 possible intersections.
Another nice optimization was to look at the top-level render context – what was being created anew for each new pixel being rendered? A lot, it turned out. In its final form, these were things we could pre-allocate and re-use:
ShadeData contains pre-allocated structs needed for a single recursion.. I won’t go into more implementation details, but as one can see there’s quite a few of these “pre-allocated” structs or slices and the complexity of the solution did indeed become significant once I realized how the recursive nature of the ray-tracer affected things.
There’s more of these, but in order to not make this section even longer, the key take-away is that while immutability is a really nice thing to strive for in general computer programs – for really performance critical software, re-using memory, caching and avoiding allocations might be a good thing…
5.1.2 Re-use slice memory
This is a simple one. Many intersection lists were being re-used after a while by explicitly setting them to
nil or even doing
make on them again (which strictly speaking, isn’t reusing anything other than the variable name…).
Instead, slicing a slice to zero length retains the memory previously allocated for its contents, but allows adding items to it from its 0 index again:
5.2 Outcome of Reducing Allocations and Caching
The changes described above was just a few examples of many optimizations where allocations in computational functions was replaced by letting the caller pass a pointer to the result, memory was pre-allocated and slice-memory was being reused. In the end, the results was approximately the following:
- Duration dropped from ~4.5 to ~1.9 seconds
- Allocation count dropped from 180 million to about 33 million
- Total memory allocated went from 5.9 GB to about 1.3 GB
This sums up the first part of this 2-part blog series on my adventures optimizing my little ray-tracer. In the next and final part, I’ll continue with more optimizations.
Feel free to share this blog post using your favorite social media platform! There should be a few icons below to get you started.
Tack för att du läser Callistas blogg.
Hjälp oss att nå ut med information genom att dela nyheter och artiklar i ditt nätverk.