for example, the current distance to the closest intersection.
Printing and exception handling facilities are also available
After using OptiX host, API functions to provide scene
data such as geometry, materials, acceleration structures,
hierarchical relationships, and programs, the application
will then launch ray tracing with the rtContextLaunch API
function that passes control to OptiX. If required, a new ray
tracing kernel is compiled from the given user programs,
acceleration structures are built (or updated) and data is
synchronized between host and device memory, and finally,
the ray tracing kernel is executed, invoking the various user
programs as described above.
After execution of the ray tracing kernel has completed, its
resulting data can be used by the application. Typically, this
involves reading from output buffers filled by one of the user
programs or displaying such a buffer directly, for example, via
OpenGL. An interactive or multi-pass application then repeats
the process starting at context setup, where arbitrary changes
to the context can be made, and the kernel is launched again.
4. DomaiN-SPeCifiC ComPiLatioN
The core of the OptiX host runtime is a just-in-time (JIT)
compiler that serves several important functions. First,
the JIT stage combines all of the user-provided shader
programs into one or more kernels. Second, it analyzes
the node graph to identify data-dependent optimizations.
Finally, the resulting kernel is executed on the GPU using
the CUDA driver API.
Generating and optimizing code for massively parallel
architectures provide some challenges. One challenge is
that code size and live state per computation must be minimized for maximum performance. Another challenge is
structuring the code to reduce divergence. Our experience
with OptiX highlights the interesting tensions between
these sometimes conflicting requirements.
4. 1. optiX programs
The user-specified programs described in Section 3. 1 are
provided to the OptiX host API in the form of Parallel Thread
Execution (PTX) functions. 8 PTX is a virtual machine assembly language for NVIDIA’s CUDA architecture, similar in
many ways to the popular open source Low-Level Virtual
Machine (LLVM) intermediate representation. 5 Like LLVM,
PTX defines a set of simple instructions that provide basic
operations for arithmetic, control flow and memory access.
PTX also provides several higher-level operations such as
texture access and transcendental operations. Also similar
to LLVM, PTX assumes an infinite register file and abstracts
many real machine instructions. A JIT compiler in the CUDA
runtime will perform register allocation, instruction scheduling, dead-code elimination, and numerous other late optimizations as it produces machine code targeting a particular
PTX is written from the perspective of a single thread and
thus does not require explicit lane mask manipulation operations. This makes it straightforward to lower PTX from a
high-level shading language, while giving the OptiX runtime
the ability to manipulate and optimize the resulting code.
NVIDIA’s CUDA C/C++ compiler, nvcc, emits PTX and
is currently the preferred mechanism for programming
OptiX. Programs are compiled offline using nvcc and submitted to the OptiX API as a PTX string. By leveraging the
CUDA C++ compiler, OptiX shader programs have a rich set
of programming language constructs available, including
pointers, templates, and overloading that come automatically by using C++ as the input language. A set of header
files is provided that support the necessary variable annotations and pseudo-instructions for tracing rays and other
OptiX operations. These operations are lowered to PTX in
the form of a call instruction that gets further processed by
the OptiX runtime.
4. 2. PtX to PtX compilation
Given the set of PTX functions for a particular scene, the
OptiX compiler rewrites the PTX using multiple PTX to PTX
transformation passes, which are similar to the compiler
passes that have proven successful in the LLVM infrastructure. In this manner, OptiX uses PTX as an intermediate
representation rather than a traditional instruction set. This
process implements a number of domain-specific operations including an ABI (calling sequence), link-time optimizations, and data-dependent optimizations. The fact that
most data structures in a typical ray tracer are read-only, provides a substantial opportunity for optimizations that would
not be considered safe in a more general environment.
One of the primary steps is transforming the set of
mutually recursive programs into a non-recursive state
machine. Although this was originally done to allow execution on a device that does not support recursion, we
found benefits in scheduling coherent operations on the
SIMT device and now employ this transformation even on
newer devices that have direct support for recursion. The
main step in the transformation is the introduction of a
continuation, which is the minimal set of data necessary to
resume a suspended function.
The set of PTX registers to be saved in the continuation
is determined using a backward dataflow analysis pass that
determines which registers are live when a recursive call (e.g.,
rt Trace) is encountered. A live register is one that is used as
an argument for some subsequent instruction in the data-flow graph. We reserve slots on a per-thread stack array for
each of these variables, store them on the stack before the
call and restore them after the call. This is similar to a caller-save ABI that a traditional compiler would implement for
a CPU-based programming language. In preparation for
introducing continuations, we perform a loop-hoisting pass
and a copy-propagation pass on each function to help minimize the state saved in each continuation.
Finally, the call is replaced with a branch to return execution to the state machine described below, and a label that
can be used to eventually return control flow to this function. Further detail on this transformation can be found in
the original paper.
4. 3. optimization
The OptiX compiler infrastructure provides a set of
domain-specific and data-dependent optimizations