MergeTree

Citation

Year
2017

Version
Peer reviewed version (post-print)

Link to publication
TUTCRIS Portal (http://www.tut.fi/tutcris)

Published in
ACM Transactions on Graphics

DOI
10.1145/3132702

License
CC BY

Take down policy
If you believe that this document breaches copyright, please contact cris.tau@tuni.fi, and we will remove access to the work immediately and investigate your claim.
MergeTree: A Fast Hardware HLBVH Constructor for Animated Ray Tracing

TIMO VIITANEN, MATIAS KOSKELA, PEKKA JÄÄSKELÄINEN, HEIKKI KULTALA, and JARMO TAKALA,
Tampere University of Technology, Finland

Ray tracing is a computationally intensive rendering technique traditionally used in offline high-quality rendering. Powerful hardware accelerators have been recently developed which put real-time ray tracing even in the reach of mobile devices. However, rendering animated scenes remains difficult, as updating the acceleration trees for each frame is a memory-intensive process. This article proposes MergeTree, the first hardware architecture for Hierarchical Linear Bounding Volume Hierarchy (HLBVH) construction, designed to minimize memory traffic. For evaluation, the hardware constructor is synthesized on a 28nm CMOS process technology. Compared to a state-of-the-art binned SAH builder, the present work speeds up construction by a factor of 5; reduces build energy by a factor of 3.2, and memory traffic by a factor of 3. A software HLBVH builder on GPU requires 3.5 times more memory traffic. In order to take tree quality into account, a rendering accelerator is modeled alongside the builder. Given the use of a toplevel build to improve tree quality, the proposed builder reduces system energy per frame by an average 41% with primary rays and 13% with diffuse rays. In large (>500K triangles) scenes, the difference is more pronounced, 62% and 35%, respectively.

ACM Reference format:

1 INTRODUCTION

Ray tracing is a rendering technique where effects such as shadows, reflection and global illumination are more natural to express than in rasterization. Mainstream use of ray tracing has been restricted to offline rendering, but recent years have seen a concerted effort by the academia and the industry to enable real-time ray tracing of dynamic scenes. One prong of this effort has been the development of dedicated ray tracing hardware architectures, both based on programmable processors [47, 48, 53], fixed-function hardware pipelines [32, 41], reconfigurable pipelines [28] and building on conventional, rasterization-based GPUs [25]. Several works focus on mobile systems, reasoning that the ray tracing approach scales well to drawing complex scenes on small displays [48], or aiming to create mobile augmented-reality experiences with physically based lighting [32]. Recently, a commercial mobile GPU IP with ray tracing acceleration has been announced alongside a programming API [44].

Recent ray-tracing hardware accelerators are able enable real-time ray tracing, so far restricted to high-end desktop GPUs, on mobile devices. However, they have so far been largely restricted to scenes with little or no animated content. The reason is that fast rendering algorithms require the scene to be organized in an acceleration data structure such as a Bounding Volume Hierarchy (BVH) tree. When displaying a dynamic scene, the data structure needs to be updated or rebuilt on each frame as the scene changes, posing an additional computational challenge. Given enough animated geometry, construction effort overtakes rendering, as ray tracing scales logarithmically with the number of scene primitives, while construction algorithms have $O(n)$ [30] or $O(n \log n)$ [51] complexity. On the desktop, tree construction has been the subject of intensive research, and fast GPU tree builders now exist which organize large scenes at real-time rates. The fastest builders are based on the Linear Bounding Volume Hierarchy (LBVH) algorithm by Lauterbach et al. [30], and the improved Hierarchical LBVH (HLBVH) by Pantaleoni and Luebke [42], and are able to organize scenes with millions of triangles in real time. These builders leverage the massive amount of computing resources and memory bandwidth available on desktop GPUs, and are, therefore, not directly applicable on mobile systems with limited resources.

A particular restriction of mobile devices is their limited memory bandwidth: a high-end mobile System-on-Chip (SoC) has an order of magnitude less memory bandwidth than a high-end desktop GPU. CMOS logic scaling has allowed increasingly complex on-chip computation to fit in the tight power budgets of mobile SoCs, but the energy cost of off-chip communication has scaled down at a slower pace, and is now very expensive compared to computation. For example, reading the operands of a double-precision multiply-add from external memory and writing back the result costs ca. 200 times more energy than the arithmetic itself [24]. Hence, the design of mobile hardware is an exercise of minimizing memory accesses to work around the memory bottleneck. Mobile GPUs incorporate a slew of special architectural techniques to this end, such as tile-based rendering, texture compression [5] and frame buffer compression [46].

A similar body of memory-conserving techniques is emerging for ray tracing accelerators, including treelent scheduling [3], streaming data models [28] and quantized trees [25]. Tree construction for ray tracing is even more memory-intensive than rendering, as the fast sorting-based build algorithms iterate over datasets of hundreds of megabytes, and perform little computation for each element.
This article focuses on mobile hardware acceleration of the HLBVH algorithm [42]. HLBVH is interesting as a powerful builder in its own right, and as a component in virtually all build algorithms that aim for a fast build time [12, 18, 23]. HLBVH is also an interesting target for hardware acceleration since it does not make heavy use of floating-point arithmetic, hence, a hardware builder could fit in a small silicon footprint. We investigate whether the high build performance of HLBVH on GPU can translate into energy-efficient operation in a mobile context.

The main contributions of this article are as follows. We propose the first hardware HLBVH builder architecture, named MergeTree. MergeTree incorporates novel architectural techniques to reduce memory bandwidth usage: a hardware-accelerated external sort and a novel streaming algorithm for joint hierarchy emission and AABB computation, which operates directly on the sort outputs. The proposed architecture is evaluated via logic synthesis and power analysis on a 28nm Fully Depleted Silicon on Insulator (FDSOI) process technology, and by means of a system-level model which includes traversal and intersection hardware. In addition, two toplevel builds are evaluated as inexpensive postprocessing steps to improve tree quality. Early simulation results for MergeTree were reported in a conference brief [49].

Compared to previous work which uses the more expensive binned SAH algorithm [13], MergeTree gives large improvements in build performance, energy efficiency, memory traffic and silicon area, at the cost of reduced tree quality. Toplevel builds are able to recover much of the quality at low cost. System-level modeling shows that with large animated scenes, the energy cost of tree construction becomes comparable to the cost of rendering, and takes up a significant fraction of a mobile power budget. Hence, the build energy savings from MergeTree translate into significant system energy savings despite the slightly lower tree quality. We also observe that most of the energy footprint in hardware-accelerated tree construction is due to DRAM traffic. A direct translation of GPU HLBVH algorithms to hardware, without the proposed memory traffic optimizations, would have energy consumption and runtime similar to [13].

This article is organized as follows. Section 2 discusses background on BVH construction algorithms. Section 3 reviews related work on hardware tree builders and sorting units. Section 4 discusses the basic algorithmic approach in this work and tradeoffs, while Section 5 describes the hardware architecture implementing the chosen algorithm. In Section 6, the architecture is evaluated by means of ASIC synthesis and system-level simulations, and a detailed power analysis is presented. Section 7 discusses limitations of the proposed architecture and future work, and Section 8 concludes the article.

2 PRELIMINARIES

In a BVH, each node subdivides primitives into two disjoint sets, whose Axis-Aligned Bounding Boxes (AABB) are stored. If a traced ray does not intersect an AABB, all primitives underneath can be discarded, greatly speeding up the rendering process. A standard way to evaluate the quality of a BVH tree is its Surface Area Heuristic (SAH) cost, introduced by Goldsmith and Salmon [19]. The SAH cost of a data structure is the expected cost to traverse a random non-terminating ray through the scene.

A gold-standard way to construct BVH trees is the SAH sweep [36], a greedy top-down partitioning algorithm which at each step evaluates all possible axis-aligned splits of the primitives, according to their AABB centroids, into two subset. The algorithm then selects the split with the lowest SAH and repeats recursively for each subset. Since the basic SAH sweep has a long runtime, often the binned variation [51] is used instead, which evaluates only, e.g., 8 or 16 possible splits per axis. Starting with Linear BVH (LBVH) by Lauterbach et al. [30], a family of GPU construction algorithms has been proposed which are orders of magnitude faster than SAH-based builders. In LBVH, the scene primitives are first sorted according to the Morton codes of their AABB centroids, and then in the process of hierarchy emission, a BVH hierarchy is built which has a binary radix tree topology with regards to the sorted Morton codes. Finally, the AABBs of each node are computed in a bottom-up order.

Pantaleoni and Luebke [42] propose Hierarchical Linear BVH (HLBVH) with improved build performance and a more compact
memory layout compared to LBVH. They further suggest improving tree quality by rebuilding upper levels of the tree with a binned SAH sweep, in an approach called HLBVH+SAH. We use this terminology in the present work, though several works use LBVH to denote HLBVH+SAH. Garanzha et al. [17] and Karras [22] have further optimized the GPU software implementation of HLBVH, especially the hierarchy emission which is non-trivial to parallelize. Most recently, Apetrei [6] combined the hierarchy emission and AABB calculation stages into a single step for a further speedup.

More complex algorithms have been built around HLBVH which further improve tree quality. Karras and Aila [23] divide a HLBVH tree into treelets and rearrange nodes within each treelet in parallel to achieve higher tree quality, in an approach later dubbed Treelet Restructuring BVH (TRBVH). In the Agglomerative TRBVH (ATRBVH) of Domingues and Pedrini [12], the exhaustive search of treelet permutations in TRBVH is replaced with an agglomerative build, yielding nearly the same tree quality at a fraction of the build time. Garanzha et al. [18] sort primitives according to their Morton codes, but instead of the HLBVH hierarchy emission, they store primitive counts in a multi-level grid and perform an approximate SAH sweep. Both TRBVH and Garanzha et al. [18] also break large triangles with spatial splits to improve tree quality. Recently, Ganes- tam and Doggett [16] proposed a fast, high-quality preprocessing step for triangle splitting.

3 RELATED WORK

In this section, we introduce related work on hardware architectures for tree construction and sorting.

3.1 Tree Build and Update Hardware

Some hardware builders have been proposed for k-d trees, an alternative acceleration data structure to BVH. The RayCore architecture [39] incorporates a hardware k-d tree builder unit which builds high levels of the hierarchy with a binned SAH sweep, and switches to a sorting-based build when the dataset is small enough to fit in on-chip SRAM. The builder is suitable only for small models, e.g., a 64K triangle scene already takes 0.1 seconds to build. Therefore, they also design their renderer to use two acceleration trees: the smaller tree contains all animated geometry and is rebuilt with the hardware unit. The FastTree unit [34] uses Morton codes for k-d tree construction, and is the fastest k-d tree constructor hardware so far, at ca. 4 times the performance of the RayCore builder. However, tree quality is not evaluated. They use a memory-intensive radix sort, but this does not appear to harm performance, since k-d trees are much more compute-intensive to construct than BVHs, so their memory interface is not stressed.

In the literature, BVH trees are shown to be less expensive to build and update than k-d trees. Doyle et al. [13] propose the first hardware architecture for BVH construction, which implements the recursive binned SAH sweep algorithm. The architecture was recently prototyped on FPGA [14]. The HART rendering system [40] updates the BVH tree with hardware-accelerated refit operation instead of running a full rebuild on each frame. Since tree quality degrades with each refit, asynchronous rebuilds are run on the CPU to refresh the tree.

The present work is the first hardware implementation of the HLBVH algorithm, which is the basis for most high-performance GPU builders. In contrast to GPU implementations of HLBVH [6, 17, 22, 42], we adapt the algorithm for a streaming, hardware-oriented implementation with minimal memory traffic. The produced trees remain identical to the original work [42]. Our main point of comparison is the state-of-the-art builder by Doyle et al. [13], which implements binned SAH, a more computationally expensive algorithm. Compared to a refit accelerator [40], the proposed builder is able to handle animations which affect mesh topology, e.g., fluids rendered with Marching Cubes, and can handle animation frames with entirely new geometry.

3.2 Sorting Hardware

The multi-merge sort approach has recently been used to sort large bodies of data with FPGAs to accelerate database operations. Koch and Torrensen [27] implement their multi-merge logic with a tree of comparators, and merge data from up to 102 input buffers. As a main difficulty in comparator tree design, they identify the problem of propagating back-pressure through the tree in a single cycle, and solve the issue by inserting decoupling FIFOs that split the tree into smaller sub-trees. Moreover, they pipeline the compare operations to get a higher operating frequency on FPGA. Casper et al. [9] further increase throughput by augmenting the top of the tree with comparators that produce multiple sorted values per cycle. They demonstrate merges from up to 8K input buffers per cycle, but in this case require over 2 MB of buffer memories. The proposed accelerator implements a novel multi-merge based on a pipelined hardware heap rather than a comparator tree, giving a compact silicon area footprint at the cost of reduced throughput.

4 ALGORITHM DESIGN

In this section we describe how we adapt Pantaleoni and Luebke’s [42] HLBVH algorithm to reduce memory traffic.

4.1 Data Structure Design

There are several variations of BVHs in the literature, and the chosen variant can have implications on build performance, so we describe the layout used in this work in detail here. In this layout, the node data structure stores the axis aligned bounding boxes of its two children and pointers to them. Áfra et al. [1] describe this arrangement as MBVH2. A leaf size field determines whether the child is another node or a leaf. Leaves are contiguous sub-arrays in a leaf table, bounded by the index and leaf size fields. Each entry in the table is a pointer to primitive data, which is used for ray-primitive tests and shading. The complete node data structure is 64 bytes long as shown in Fig. 2.

There are many variations on the above details in the literature. Sometimes a node only has its own AABB and two child pointers, but the two-AABB structure is superior in hardware ray tracing [31]. More importantly, many high-performance ray tracers, e.g., Aila and Laine [4], store primitive data in the leaf table, foregoing the extra indirection per rendering. Often the primitives are also preprocessed for faster intersection testing with, e.g., Shevtsov’s [45] method.
Since primitives in the same leaf may not have been contiguous in the input data, this implies rearranging the primitives.

The present work opts to use a reference table for two main reasons. First, we try to make our results compatible with the main prior work by Doyle [13], which to our best understanding has this structure. Secondly, by taking primitive AABBs as input, the resulting architecture is generic to any primitive type for which an AABB can be computed - some examples of useful primitives are the pyramidal displacement mapped surfaces in [39] and indexed-vertex triangle lists in [25]. A primitive-leaf table could be produced as a post-processing step.

4.2 Sorting

Sorting accounts for much of the memory traffic in HLBVH, so we optimized it by referring to literature on external sorting data on slow magnetic disc drives. One optimal sorting algorithm in this environment is the multimergesort [2]. Given N data elements that reside in slow external memory, a fast local memory of size M, and a preferred read length of B, the multimergesort first performs partial sorts for N/M blocks of size M. After this, the algorithm runs multi-merge passes which merge M/B sorted blocks into a larger block. Table 1 compares the minimum memory accesses of multimergesorting in the context of tree construction to a typical radix-16 sort. Assuming one primitive per leaf, a BVH organizing N primitives has N − 1 nodes. Any builder, then, has unavoidable memory traffic from primitive input (32B per input AABB) and hierarchy emission (64B per input AABB for nodes and 4B for the leaf table). In addition, primitive sorting requires memory accesses.

GPU implementations often use, e.g. a radix-16 parallel prefix sort, which performs eight passes through the data, each pass reordering the data according to four bits of the sort key. In each pass the entire data array is read twice and written once. Assuming the sort operates on 8B Morton code - primitive reference pairs, it then requires 3 × 8 × 8B = 192B traffic per input AABB. Finally, the joint hierarchy emission and AABB computation stage must fetch the primitive AABBs referenced by the sort results, adding 32B to the unavoidable 68B output traffic. Out of the total traffic of 324B, more than half is produced by the sort. Replacing the radix sort with multimergesort and assuming a small enough scene to sort in a single pass (2M triangles in the proposed design), the sort traffic drops to a negligible 16B, with an additional 16B per pass. Assuming the AABBs from primitive input may be streamed on-chip to the block sort stage, and the results of the multi-merge to the hierarchy emission stage, the sorted pairs are only accessed twice, for 16B traffic. The total traffic of 148B is less than half that of the radix sort case.

The reads and writes in Table 1 are otherwise consequent, except when the hierarchy emission stage loads the primitive AABBs referenced by each sort output, it generates inefficient 32B random accesses. Consequently, it is interesting to directly sort the primitive AABBs instead of references. Direct AABB sorting is clearly inefficient with a radix sort due to the quadrupled sorting traffic. With multimergesort, the extra traffic is smaller (48B), and nearly offset by the removed hierarchy emission loads (32B). Total traffic still increases by ca. 11%, but all memory accesses can now be arranged in long consecutive bursts which are more efficient. If two or more multi-merge passes are needed, the advantage of AABB sorting is less clear. It is, then, desirable to use AABBs as sorting elements and support as wide a merge as practical, so that scenes of interest fit in a single pass. We use the AABB multi-merge approach as the basis for the present work. It should be mentioned that, in the extreme, the primitives themselves could be used as sort elements. It is inexpensive to add hardware to recompute their AABBs and Morton codes on demand, so the main cost would be increased memory traffic and on-chip storage. This approach is efficient for generating a leaf array with primitive data, discussed in the previous section, but foregoes generality of the architecture for different primitives.

5 HARDWARE ARCHITECTURE

This section describes a hardware architecture named MergeTree which implements the designed construction algorithm. A block diagram of the architecture is shown in Fig. 1. The architecture can be divided into the pre-sorting and multi-merge stages which sort an array of input AABBs according to their Morton codes, and a hierarchy stage which processes the sorted AABBs to emit a BVH tree. It has two operating modes where different sub-modules are active, as shown in Fig. 3. The partial sort mode is used to generate N AABB arrays small enough to fit in the on-chip AABB storage, while the hierarchy stage is inactive. In the multi-merge mode, the arrays are merged into a final sorted sequence and fed into the
MergeTree: A Fast Hardware HLBVH Constructor for Animated Ray Tracing

•

5.1 Primitive Input
The main data format used for input and internal storage is an AABB with three lower-bound and three upper-bound coordinates, a memory reference to primitive data, and a Morton code, for a total of 256 bits.

5.2 Heap Unit
The main component of the proposed algorithm which requires hardware acceleration is the multi-merge from many input sequences into a single, sorted output stream. Hardware acceleration is necessary since a sequential heap implementation for a heap of capacity \( n \) takes up to \( \log n \) compare-swaps to insert an element, which is too slow for our use. Since comparator trees, used by recent sorting accelerators [9, 27], appear unreasonably large when scaled to wide inputs, we use an alternate approach of implementing a pipelined heap data structure in hardware. In software implementations the heap is stored in a single array in memory, where the children of an element can be found with simple arithmetic. However, in custom-designed hardware, each level of the heap can be implemented as a separate memory module, and heap operations would then operate in a pipelined manner, such that a new heap operation can be started while previous operations are still propagating toward deeper levels. This hardware structure was proposed by Bhagwan and Lin [8] for the implementation of large priority queues in telecommunications processors. Ioannou and Katevenis [20] optimize the design for clock speed by overlapping stages of computation. In this work we largely follow the design of the latter work, and we refer the reader to their article [20] for details. We support the insert and replace operations, and implement remove as a replace with a large special-case value \( \infty \). Replace operations have a maximum throughput of one value per two clock cycles, since fully pipelined operation would need an expensive set of global bypass wires between heap levels.

Detailed comparison to comparator trees is outside the scope of this article, but it should be noted that trees have \( O(n \log n) \) comparators with regards to the merge width, and the present design has \( O(\log n) \). The heap is, consequently, easier to scale to the wide merges desired in this work. Our throughput is lower than an optimized comparator tree, but since we have a much higher clock frequency available on ASIC than FPGA, and are sorting larger data elements, this is less of an issue than in the FPGA works. If more throughput is desired, it may be interesting to use a hybrid scheme similar to [9], combining a small toplevel comparator tree with subtrees implemented as heaps.

5.3 Multi-Merge Unit
The pipelined heap is connected to the rest of the system by a hardware finite state machine which initializes the heap, requests replacement data from the memory and feeds it into the heap, and emits output data. To limit the size of the heap, the full primitive AABBs are stored in a SRAM scratchpad memory, and the heap only contains Morton codes and DRAM addresses of the corresponding primitives. The scratchpad memory is organized into a set of double-buffered queues for each block being multi-merged: when one buffer has been processed, a memory read is queued to replace it, but processing can continue from the other side of the double buffer. Only if the second half of the double buffer is also consumed before replacement input arrives for the first half, does the multi-merge unit need to stall. With double buffering, the multi-merge unit provides a degree of memory latency hiding, as the merge process is likely to visit multiple other buffers between the \( B \) elements of a single buffer. This property depends on the heap accesses being...
well distributed between the blocks: if, e.g., the input data is already
sorted, performance may suffer.

At the outset of partial sort and multi-merge modes, the FSM
makes memory requests to fill the scratchpad, and inserts the ini-
tial elements of each buffer into the empty heap. In steady-state
operation, on each even cycle, the finite state machine reads the
top element of the heap, bit manipulates its DRAM address to find
the scratchpad location of the following data element in the same
block, and begins a SRAM read at that location. On the following
cycle the SRAM data is available and used for a replace heap opera-
tion. On odd cycles, the AABB corresponding to the previously read
heap-top value is read from scratchpad, and is written to the output
FIFO on the next even cycle, simultaneously with the next heap-top
read, in a pipelined manner. Like the heap unit, this design is also
half-pipelined, producing one sorted AABB per two cycles in the
absence of stalls.

Fig. 4 demonstrates this steady-state behavior. On the example
even cycle, the scratchpad address addr(1) of a sorted element is
read from the heap, incremented, and used to initialize a scratchpad
read of the element’s successor in the same buffer. On the following
odd cycle the successor’s Morton code morton(X) is available from
the scratchpad, and is used to replace the top element of the heap.
Simultaneously, a scratchpad read is initialized with the original
(incremented) address. On the next even cycle, the sorted ele-
ment’s AABB would be read from the scratchpad and written to the
output, as is done for the previous sorted element (0) in Fig. 4.

A separate register array tracks which buffers in the scratchpad
are valid: if the finite state machine attempts to fetch invalid data,
exection instead stalls until the referenced data is available. Loca-
tions are invalidated when consumed, and validated again when
replaced from memory. When the final element from the given block
or scene is read, a remove operation is performed instead of a re-
place. To avoid special-case handling for the possible small buffer
at the end of the scene, it is padded to full size with special-case
AABBs with a higher Morton code than is generated for normal
scene geometry: these are, then, sorted to the end of the scene and
ignored by subsequent processing.

5.4 Hierarchy Emitter
In order to minimize external memory traffic, we stream sorted
AABBs to a hardware finite state machine which implements a
serial HLBVH hierarchy emission algorithm. The hardware unit
implements Algorithm 1, such that a single stack operation is per-
formed per cycle.

Fig. 5 shows a visual example of the algorithm in operation. The
algorithm reads in a sorted stream of AABBs, and computes the
highest differing bit between each pair of successive Morton codes,
which is interpreted as a hierarchy level for an inner node to be
generated. Based on each input’s hierarchy level, the unit either
concatenates the input into a large leaf (lines 4-8), pushes it into
a small hardware stack, or combines it with the top AABB in the
stack to generate a node. The latter process is then repeated with
the AABB of the generated node used as the input node, until a
higher-level element is found in the stack, or the stack is empty.
Each node is output when sufficient primitives have been read to
determine its child bounding boxes, resulting in a bottom-up, depth-
first order. Stack entries represent inner nodes whose right child is
unknown: they consist of a left child AABB and a hierarchy level.
Optionally, a highlevel_threshold parameter can be supported by the
hardware in order to emit high-level nodes for a separate toplevel
build (lines 15-19), as in the HLBVH+SAH by Pantaleoni and Lue-
bke [42]. The unit then generates separate trees for each highlevel
grid cell, and outputs their AABBs to a buffer for postprocessing.
When the parameter is greater than the highest Morton code bit,
the unit’s reverts to conventional HLBVH and outputs a single tree.

An example of the algorithm is shown in Fig. 5. In Cycles 1 and
2 in Fig. 5, stack entries for nodes on hierarchy levels 2 and 1 are
pushed to the stack: the left child of the level-2 node is the primitive
and child of the level-1 node, b. The next encountered hierarchy
level of 3 is higher than the stack-top at level 1, so the stack-top
node A can be completed. On Cycle 4, the next node in stack can be
completed. On Cycle 5, a stack entry containing the entire subtree
constructed so far, is pushed to the stack: it will become the root of
the tree. Cycles 6 through 9 finish the right-side subtree in the same
manner, except that primitives d and e have the same Morton code,
and so are combined to the same leaf. Finally on Cycle 10, the top
node is emitted. Cycles 8..10 also show that processing finishes with
a special-case input AABB with higher Morton code value than in
the rest of the geometry: this causes the remaining stack entries
to be popped. The generated inner node topology in this example
corresponds to the Morton code bits shown in Fig. 5.

It is visible that generating n inner nodes requires 2n stack opera-
tions, while enlarging a leaf takes 1 cycle. Since a BVH organizing
m leaves has at most m − 1 nodes, the worst-case runtime of the
emitter is 2m − 2 cycles for for m inputs, when there is exactly one
primitive per leaf. The average throughput of the hierarchy emitter
is, then, the same or higher as that of the multi-merge unit, but data
is consumed at an uneven rate depending on inputs, so a FIFO buffer

ALGORITHM 1: Streaming hierarchy emission algorithm

```
while True do
    input ← nextInput ;
    read nextInput from FIFO ;
    while input and nextInput have the same Morton codes do
        input ← nextInput ;
        read nextInput from FIFO ;
        input ← combine (input, nextInput) ;
    end
    if diff > highlevel_threshold then
        highlevel_output.push( input ) ;
    else
        stack.push( input, diff ) ;
    end
end
```
is required between the two units. Hierarchy emission finishes by assuming the highest differing bit after the final sorted primitive is ∞, which causes all remaining stack entries to be popped. The maximum required stack depth is the same as the number of Morton code bits, i.e., possible hierarchy levels.

5.5 Partial Sort
The multi-merge stage described above is straightforward to reuse for the scratchpad-sized partial sort, by configuring the merge heap so that each double-buffer in the AABB storage is the final one in its block, and no further buffers are fetched. Then the only additional hardware needed is logic to sort every buffer-sized sub-block prior to merging. This is easy to implement concurrently with data reading, by streaming the read data into a small number of buffer-sized pre-sorters: our implementation consists of a hardware state machine performing an insertion sort at a rate of one compare-and-swap per second, plus one cycle of overhead for each inner loop of the sort. Multiplexers are inserted to bypass the insertion sorters in the multi-merge mode, and the hierarchy emitter in the partial sort mode.

5.6 Toplevel build
Trees produced by HLBVH likely require post-processing for acceptable quality. The earliest idea proposed in this direction is HLBVH+SAH [42], where top levels of the BVH hierarchy, corresponding to high bits of the Morton code, are rearranged with a higher-quality algorithm. This toplevel build concept is of particular interest in a memory-constrained system, as the datasets are small enough to fit on-chip.

In order to evaluate toplevel builds, we add a configuration option to our simulated hardware to emit an array of high-level nodes, which can then be passed to a separate software or hardware builder. Two toplevel builders are evaluated in this work: a binned SAH build using the accelerator of Doyle et al. [13], and a software implementation of ATRBVH [12]. ATRBVH is itself a LBVH postprocessing step, but the usage in this paper is novel in that we apply the algorithm only to the high-level nodes rather than processing the entire tree: it is then sufficiently lightweight to give real-time performance on a mobile CPU.

6 EVALUATION
This section describes first models used for comparison against the state-of-the-art BVH builder [13]. We then evaluate the performance, silicon area and power characteristics of the proposed builder architecture in isolation. Finally, the builders are modeled as components of a larger rendering system.

6.1 Binned SAH builder model
The closest point of comparison for the proposed builder is the binned SAH architecture proposed by Doyle et al. [13]. The SAH builder is also of interest as a top-level builder used to improve the tree quality of HLBVH trees output by the current work. For
evaluation, it is interesting to examine the builder in more scenes than reported in [13], and to consider its energy consumption.

First, memory traffic is modeled by instrumenting a software binned SAH builder to record build statistics. Traffic is caused by the BVH output, and by large sweeps whose datasets are too long to fit in local primitive buffers. The architecture instance in [13] has buffer capacity for 8192 primitive AABBs. However, during each sweep, the hardware simultaneously produces input data for two child sweeps: in some cases both children could individually fit in the buffers, but are together too large. In this case, we assume only the smaller child to be a large sweep. As shown in Tab. 2, this model comes close to replicating the memory traffic results in [13]. A fixed size threshold of 4096 slightly overestimates traffic, while a threshold of 8192 underestimates it. Floating-point operations are counted based on Wald’s algorithm [51], and then combined with memory traffic to obtain a lower-bound memory model.

Finally, the runtime of a simplified binned SAH builder is modeled so as to give a lower bound for toplevel build performance. The simplified builder operates serially and has a single partitioning unit. It alternates between partitioning and bining a sweep at a rate of one input AABB per cycle, and SAH computation, which takes 32 cycles at the end of each sweep. The unit simplified in this way is substantially slower than the original, but as toplevel trees have only ca. 500-2000 nodes in our test scenes, toplevel build has negligible effect on runtime.

### 6.2 Implementation and Power Analysis

In the graphics hardware community, hardware complexity is often estimated by counting arithmetic units and memories in the design, but in this case we are especially interested in the energy of the proposed architecture, and whether it can reach a high clock frequency. Therefore we wrote a prototype RTL description of the proposed architecture and synthesize it on a CMOS technology. All components in Fig. 1 were implemented in SystemVerilog, and SRAM macros were used for the AABB storage.

The tree builder was synthesized on a 28nm FDSOI technology with Synopsys Design Compiler. The parameters of the builder were set at $M = 8192$, $B = 16$, resulting in a unit with a 256KB scratchpad memory, which handles up to 2M triangles in one pass, reads data in 512B increments and performs a 256-way merge. We include eight partial sorters for scalability. To determine the buffer size $B$, we experimented with the DRAMPower model [10] and found that increasing consecutive access size is clearly beneficial at least up to 512B (16 AABBs). The target frequency is set at 1GHz, supply voltage at 1V, and operating temperature at 25°C. Clock gating and multi-threshold voltage optimizations are enabled.

In order to evaluate performance, we run RTL simulations of the builder unit with various input scenes and memory interfaces. The external memory is modeled with the DRAM simulator Ramulator [26]. For the memory organization, we select 64-bit, 1- and 2-channel LPDDR3-1600 (slow, medium) which are the closest devices to state-of-the-art mobile device memory for which we are able to perform power analysis. In addition, we simulate a 64-bit, 4-channel LPDDR3-1333 memory (fast) which gives a bandwidth close to Doyle [13] to facilitate a direct runtime comparison, but this interface is not representative of mobile systems. We subtract from all memory power figures a static power term computed from an idle memory transaction trace, corresponding to, e.g., refresh power, to isolate the extra dynamic power added by the tree build. Ramulator is integrated to the RTL testbench through the SystemVerilog Direct Programming Interface wrapper, such that memory requests from the simulated builder map into Ramulator transactions, and input data is fed into the builder when the corresponding read transaction completes. RTL simulation was run on 14 test scenes, and the resulting trees were verified in a software ray tracer.

We have also implemented a C++ architectural simulator which gives results within ca. 5% of the RTL simulation at a 20x faster runtime, and allows easier observation of the build process. Fig. 6 shows example simulation traces generated with the C++ simulator. The different execution states are visible: first the unit alternates between partial sort reads which utilize the insertion sorters, and writes which utilize the merge heap. Most of the execution time is spent on the multi-merge phase, which is clearly memory-limited in the slower memory options. In the fast memory option, the throughput of the multi-merge hardware starts to limit performance. Finally, a toplevel SAH build is shown, which is more compute-intensive and uses little memory.

The build times, memory traffic and tree quality of the proposed builder are compared to related work. As a desktop benchmark, we compare against the high-quality ATRBVH builder by Domingues and Pedrini [12] with default settings, and their freely available implementation of Karras’ HLBVH algorithm [22], set to use 32-bit Morton codes. The GPU builders are run on a computer with a GeForce GTX 1080 GPU and a Intel Core i7-3930K CPU, counting only kernel execution times. In tree builder hardware, we compare against the state of the art binned SAH builder by Doyle et al. [13] and the k-d tree builder FastTree [34]. We use the performance figures from their article, and generate memory traffic as described in the previous subsection. Memory traffic for GPU HLBVH was extracted with `nvprof`.

Tree quality is evaluated based on the SAH cost of produced trees [19]. The SAH cost $C$ of a BVH can be computed as:

$$C = C_l \sum_{l=0}^{n_{nodes}} \frac{A(N_l)}{A(R)} + C_f \sum_{l=0}^{n_{leaves}} \frac{A(L_l)}{A(R)} + C_t \sum_{l=0}^{n_{leaves}} P_l A(L_l),$$

where $A(N_l)$ and $A(L_l)$ are the surface areas of the given inner nodes and leaves, $A(R)$ the surface area of the scene AABB, $P_l$ is the primitive count within a given leaf, $C_l$ is the cost of traversing a node, $C_f$ the cost of traversing a leaf, and $C_t$ the cost of a
Table 3. Build performance and quality comparison. SAH costs are relative to a full SAH sweep. Average build time is normalized to GTX 1080 HLBVH. BW denotes system memory bandwidth. The proposed builder (HW HLBVH) is evaluated with three DRAM configurations. Toplevel builds (HW binned SAH, TK1 ATRBVH) show build time in addition to HW HLBVH.

<table>
<thead>
<tr>
<th>Model</th>
<th>BW (GB/s)</th>
<th>Time (ms)</th>
<th>SAH</th>
<th>Time (ms)</th>
<th>SAH</th>
<th>Time (ms)</th>
<th>SAH</th>
<th>Time (ms)</th>
<th>SAH</th>
</tr>
</thead>
<tbody>
<tr>
<td>Crytek (262K)</td>
<td>320</td>
<td>0.27</td>
<td>166%</td>
<td>0.45</td>
<td>103%</td>
<td>0.52</td>
<td>123%</td>
<td>0.82</td>
<td>129%</td>
</tr>
<tr>
<td>Conference (283K)</td>
<td>1.44</td>
<td>76%</td>
<td>2.67</td>
<td>89%</td>
<td>3.35</td>
<td>92%</td>
<td>6.03</td>
<td>87%</td>
<td></td>
</tr>
<tr>
<td>Sportscar (301K)</td>
<td>44</td>
<td>103%</td>
<td>-</td>
<td>104%</td>
<td>3</td>
<td>103%</td>
<td>-</td>
<td>102%</td>
<td></td>
</tr>
<tr>
<td>Italian (375K)</td>
<td>12.8</td>
<td>5.1</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>10.8</td>
<td>-</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td>Hokaido (500K)</td>
<td>25.6</td>
<td>0.10</td>
<td>-</td>
<td>0.66</td>
<td>-</td>
<td>0.91</td>
<td>-</td>
<td>1.46</td>
<td></td>
</tr>
<tr>
<td>TK1 ATRBVH topl.</td>
<td>12.8</td>
<td>0.08</td>
<td>-</td>
<td>0.49</td>
<td>-</td>
<td>0.66</td>
<td>-</td>
<td>1.08</td>
<td></td>
</tr>
<tr>
<td>Mem. BW</td>
<td>14.9</td>
<td>5.60</td>
<td>80%</td>
<td>28.92</td>
<td>116%</td>
<td>9.31</td>
<td>113%</td>
<td>6.35</td>
<td>109%</td>
</tr>
<tr>
<td>Babylonia (500K)</td>
<td>320</td>
<td>1.10</td>
<td>151%</td>
<td>1.25</td>
<td>150%</td>
<td>1.22</td>
<td>184%</td>
<td>1.45</td>
<td>214%</td>
</tr>
<tr>
<td>Kitchen (764K)</td>
<td>320</td>
<td>8.11</td>
<td>87%</td>
<td>10.01</td>
<td>96%</td>
<td>9.45</td>
<td>89%</td>
<td>11.69</td>
<td>85%</td>
</tr>
<tr>
<td>Dragon (871K)</td>
<td>44</td>
<td>-</td>
<td>107%</td>
<td>11</td>
<td>103%</td>
<td>-</td>
<td>120%</td>
<td>-</td>
<td>105%</td>
</tr>
<tr>
<td>HW k-d tree [34]</td>
<td>12.8</td>
<td>17.2</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td>HW HLBVH (proposed)</td>
<td>12.8</td>
<td>3.60</td>
<td>142%</td>
<td>2.83</td>
<td>154%</td>
<td>4.17</td>
<td>142%</td>
<td>4.80</td>
<td>208%</td>
</tr>
<tr>
<td>HW Binned SAH topl.</td>
<td>12.8</td>
<td>0.06</td>
<td>109%</td>
<td>0.02</td>
<td>106%</td>
<td>0.08</td>
<td>121%</td>
<td>0.04</td>
<td>135%</td>
</tr>
<tr>
<td>TK1 ATRBVH topl.</td>
<td>14.9</td>
<td>10.05</td>
<td>99%</td>
<td>4.33</td>
<td>89%</td>
<td>15.20</td>
<td>116%</td>
<td>8.62</td>
<td>117%</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>320</td>
<td>1.80</td>
<td>208%</td>
<td>2.49</td>
<td>155%</td>
<td>2.99</td>
<td>136%</td>
<td>3.66</td>
<td>133%</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>320</td>
<td>14.69</td>
<td>91%</td>
<td>19.80</td>
<td>91%</td>
<td>23.22</td>
<td>90%</td>
<td>29.55</td>
<td>84%</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>44</td>
<td>-</td>
<td>104%</td>
<td>-</td>
<td>101%</td>
<td>30.00</td>
<td>102%</td>
<td>-</td>
<td>102%</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>12.8</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>52.5</td>
<td>-</td>
<td>65.5</td>
<td>-</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>12.8</td>
<td>6.52</td>
<td>202%</td>
<td>8.61</td>
<td>149%</td>
<td>13.08</td>
<td>135%</td>
<td>15.60</td>
<td>132%</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>25.6</td>
<td>4.43</td>
<td>-</td>
<td>6.07</td>
<td>-</td>
<td>8.55</td>
<td>-</td>
<td>10.32</td>
<td>-</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>42.7</td>
<td>3.21</td>
<td>-</td>
<td>4.46</td>
<td>-</td>
<td>6.04</td>
<td>-</td>
<td>7.32</td>
<td>-</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>12.8</td>
<td>0.04</td>
<td>147%</td>
<td>0.02</td>
<td>113%</td>
<td>0.08</td>
<td>120%</td>
<td>0.06</td>
<td>120%</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>14.9</td>
<td>5.74</td>
<td>128%</td>
<td>3.98</td>
<td>94%</td>
<td>17.60</td>
<td>123%</td>
<td>12.84</td>
<td>122%</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>320</td>
<td>4.41</td>
<td>160%</td>
<td>5.07</td>
<td>144%</td>
<td>0.68</td>
<td>153%</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>320</td>
<td>39.84</td>
<td>82%</td>
<td>42.37</td>
<td>88%</td>
<td>5.14</td>
<td>92%</td>
<td>3.96</td>
<td>99%</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>44</td>
<td>-</td>
<td>101%</td>
<td>-</td>
<td>103%</td>
<td>6.60</td>
<td>104%</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>12.8</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>9.39</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>12.8</td>
<td>16.76</td>
<td>144%</td>
<td>24.05</td>
<td>139%</td>
<td>2.02</td>
<td>148%</td>
<td>1.92</td>
<td>148%</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>25.6</td>
<td>11.76</td>
<td>-</td>
<td>15.76</td>
<td>-</td>
<td>1.33</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>42.7</td>
<td>8.56</td>
<td>-</td>
<td>11.12</td>
<td>-</td>
<td>1.00</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>12.8</td>
<td>0.06</td>
<td>116%</td>
<td>0.28</td>
<td>128%</td>
<td>0.03</td>
<td>118%</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Mem. BW</td>
<td>14.9</td>
<td>9.69</td>
<td>109%</td>
<td>45.75</td>
<td>119%</td>
<td>4.94</td>
<td>111%</td>
<td>-</td>
<td>-</td>
</tr>
</tbody>
</table>
Table 4. Area and power breakdown of synthesized design. Power results from Lion scene, fast memory model.

<table>
<thead>
<tr>
<th></th>
<th>Area (mm²)</th>
<th>Power (mW)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Insertion sorters</td>
<td>0.24</td>
<td>25.8</td>
</tr>
<tr>
<td>Multi-merge unit</td>
<td>1.14</td>
<td>17.2</td>
</tr>
<tr>
<td>Hierarchy emitter</td>
<td>0.03</td>
<td>2.2</td>
</tr>
<tr>
<td>FIFOs and muxes</td>
<td>0.99</td>
<td>16.0</td>
</tr>
<tr>
<td><strong>Total</strong></td>
<td><strong>2.41</strong></td>
<td><strong>61.2</strong></td>
</tr>
</tbody>
</table>

Table 5. Hardware comparison with quadratic process scaling.

<table>
<thead>
<tr>
<th>Architecture</th>
<th>Area (mm²)</th>
<th>Node Size (nm)</th>
<th>Area @28nm (mm²)</th>
<th>Mem. BW (GB/s)</th>
</tr>
</thead>
<tbody>
<tr>
<td>GeForce GTX 1080</td>
<td>314</td>
<td>16</td>
<td>962 (5.5x)</td>
<td>320</td>
</tr>
<tr>
<td>HW binned SAH [13]</td>
<td>31.9</td>
<td>65</td>
<td>5.9</td>
<td>44</td>
</tr>
<tr>
<td>HW k-d tree [34]</td>
<td>1.4</td>
<td>28</td>
<td>1.4</td>
<td>43</td>
</tr>
<tr>
<td>MergeTree</td>
<td>2.4</td>
<td>28</td>
<td>2.4</td>
<td>43</td>
</tr>
</tbody>
</table>

Table 6. Memory traffic comparison (MB)

<table>
<thead>
<tr>
<th>Scene</th>
<th>Proposed</th>
<th>HW Binned SAH</th>
<th>GPU HLBVH</th>
</tr>
</thead>
<tbody>
<tr>
<td>Toasters</td>
<td>1.7</td>
<td>1.1 (0.7x)</td>
<td>1.7 (1.0x)</td>
</tr>
<tr>
<td>Bunny</td>
<td>10.6</td>
<td>18 (1.6x)</td>
<td>27 (2.5x)</td>
</tr>
<tr>
<td>Cloth</td>
<td>14.2</td>
<td>25 (1.7x)</td>
<td>36 (2.5x)</td>
</tr>
<tr>
<td>Fairy</td>
<td>19.4</td>
<td>70 (3.6x)</td>
<td>74 (3.8x)</td>
</tr>
<tr>
<td>Crytek</td>
<td>32.9</td>
<td>116 (3.5x)</td>
<td>115 (3.5x)</td>
</tr>
<tr>
<td>Conference</td>
<td>28.9</td>
<td>125 (4.3x)</td>
<td>125 (4.3x)</td>
</tr>
<tr>
<td>Sportscar</td>
<td>38.6</td>
<td>110 (2.9x)</td>
<td>135 (3.5x)</td>
</tr>
<tr>
<td>Italian</td>
<td>43.9</td>
<td>145 (3.3x)</td>
<td>169 (3.9x)</td>
</tr>
<tr>
<td>Babylonia</td>
<td>60.2</td>
<td>219 (3.6x)</td>
<td>230 (3.8x)</td>
</tr>
<tr>
<td>Kitchen</td>
<td>77.1</td>
<td>436 (5.6x)</td>
<td>352 (4.6x)</td>
</tr>
<tr>
<td>Dragon</td>
<td>124.8</td>
<td>379 (3.0x)</td>
<td>413 (3.3x)</td>
</tr>
<tr>
<td>Buddha</td>
<td>147.9</td>
<td>493 (3.3x)</td>
<td>520 (3.5x)</td>
</tr>
<tr>
<td>Livingroom</td>
<td>150.4</td>
<td>833 (5.5x)</td>
<td>685 (4.6x)</td>
</tr>
<tr>
<td>Lion</td>
<td>230.1</td>
<td>800 (3.5x)</td>
<td>766 (3.3x)</td>
</tr>
<tr>
<td><strong>Geom. mean</strong></td>
<td><strong>-</strong></td>
<td>(3.0x)</td>
<td>(3.3x)</td>
</tr>
</tbody>
</table>

Table 7. Power analysis results, average of 14 scenes.

<table>
<thead>
<tr>
<th></th>
<th>Max. BW (GB/s)</th>
<th>Mem. traffic (GB/s)</th>
<th>Logic power (mW)</th>
<th>DRAM power (mW)</th>
<th>Total power (mW)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>12.8</td>
<td>9.2</td>
<td>58.1</td>
<td>507.2</td>
<td>565.3</td>
</tr>
<tr>
<td></td>
<td>25.6</td>
<td>13.9</td>
<td>61.8</td>
<td>814.8</td>
<td>876.6</td>
</tr>
<tr>
<td></td>
<td>42.7</td>
<td>18.9</td>
<td>62.7</td>
<td>1112.4</td>
<td>1175.2</td>
</tr>
</tbody>
</table>

primitive intersection test [23]. We use the SAH cost parameters $C_i = 1.2, C_f = 0, C_t = 1$ given for GPUs by Karras and Aila [23] in order to be comparable with previous work. SAH costs are normalized to a full SAH sweep [52]. Quality was not evaluated for FastTree.

Two toplevel builds are evaluated to recover quality, as discussed in Subsection 4.6. HLBVH+SAH is modeled with a cycle-level simulator, while for HLBVH+ATRBVH, we run a single-threaded C++ implementation on an NVidia Jetson TK1 board with a Tegra K1 SoC. The ATRBVH build runs two iterations over the toplevel nodes and uses a treelet size of 8.

For power analysis, we extract switching activity information files from the previous simulations, and perform power analysis with Synopsys Design Compiler. The constructed trees are loaded into a software ray tracer to verify correctness and compute quality. External memory power is determined by exporting DRAM command traces from Ramulator to DRAMPower [10]. We estimate the power and energy consumption of a 64b DRAM component by doubling the figures for a 32b component, as these could be combined into a 64b component.

6.3 Results

In Table 3, the resulting build performance and tree quality is compared to related work. Even with the slow memory option, the present design is over 2x faster than the state of the art binned SAH unit, and with the fast memory option, 5x faster (6.3x including the Toasters scene, but the 1ms runtime reported in that scene is too imprecise for comparison). Compared to the state of the art k-d tree builder by Liu et al. [34] with the same 12.8 GB/s bandwidth, MergeTree is 4.7x faster. With the fast memory option, the proposed unit is within a factor of two of the desktop GPU HLBVH builder, which has 7.5x more memory bandwidth, and an order of magnitude larger chip area and power envelope. ATRBVH gives a higher tree quality, but is, on average, 7.4x slower than HLBVH.

6.3.1 Area, Power and Memory Traffic. The unit was successfully synthesized and meets timing constraints at 1GHz. The cell area and power breakdown of the synthesized unit is shown in Table 4. Table 5 shows an area comparison to related work. The proposed unit has ca. 2.5x less area than a binned SAH builder [13].

The results of power analysis are shown in Table 7. The main result is that over 90% of total power consumption in the design comes from the DRAM interface. There are some straightforward optimizations to reduce the on-chip power: for example, a multi-bank scratchpad could be used in place of the current expensive dual-port SRAM. However, since the power consumed by the hardware unit itself is negligible compared to DRAM, the improvements from optimization would also be marginal.

Table 6 compares our memory traffic to related work. The proposed builder generates 3.0x less traffic than hardware binned SAH, and 3.3x less than a GPU build - the radix sort stage of the GPU build alone generates roughly as much traffic as our complete build. Our builder also achieves very high bus utilizations of 72%, 54% and 44% of the memory bandwidth on the slow, medium and fast interface options, respectively.

The above results show that the energy consumption and build speed of MergeTree are largely determined by the amount of memory traffic generated. A straightforward conversion of HLBVH to hardware, without the proposed memory traffic optimizations, would likely have a ca. 3x higher energy consumption and runtime, almost as high as the binned SAH builder [13].
The evaluated toplevel builds give significant tree quality improvements: HLBVH+SAH to an average of 118% and HLBVH+ATRBVH to 111%, only ca. 5% worse than a binned SAH build. The hardware binned SAH builder has an insignificant runtime compared to the LBVH, but consumes extra chip area. Our naive single-threaded software implementation of toplevel ATRBVH is already fast enough for real-time construction on a mobile SoC, and could run all scenes at 30FPS except for Lion, which has demanding geometry.

### 6.4 System Level Comparison

From the previous results, it is apparent that MergeTree gives a similar tradeoff as desktop GPU HLBVH: it is very fast and energy-efficient compared to prior work, at the cost of reduced tree quality, which can be mostly recovered with postprocessing toplevel builds. We can conjecture that a binned SAH builder is advantageous in small scenes where the build effort is minuscule relative to rendering, and the proposed builder becomes advantageous in larger scenes. The exact tradeoff depends on the particular scene and visual effects being displayed. This subsection further quantifies the system-level effects of builder selection by modeling a larger system which includes rendering hardware. The model focuses on system energy consumption per frame, as it is a main figure of interest in mobile systems and simplifies modeling. We start out from the premise that the BVH tree for the complete scene is rebuilt for each frame, and a viewpoint is then rendered at a 1280x720 resolution. The scenes and viewpoints used are shown in Table 3. Benchmarks are run for primary ray rendering, as well as diffuse lighting with one sample per pixel, limited to three bounces. The latter is representative of incoherent secondary rays.

The main components of the present model are a fixed-function rendering accelerator, combined with MergeTree, binned SAH and ATRBVH hardware builders. Moreover, toplevel build combinations of HLBVH+SAH and HLBVH+ATRBVH are tested which combine two hardware builders. For MergeTree, we use accurate energy figures based on post-synthesis power analysis and DRAMPower. For the binned SAH builder, we estimate memory traffic and FPU operation counts as described in Subsection 5.1, and then obtain a lower-bound energy model by assuming fully utilized FPU units and long, consecutive burst accesses to DRAM. For the ATRBVH builder, memory accesses and FPU operations are likewise counted from program code. No high-performance hardware architecture has been published for ATRBVH, but given the input size for toplevel builds, even a serial hardware unit performing one FPU operation per cycle is sufficient to process all test scenes at over 100fps. Finally, the rendering accelerator is modeled after the traversal and intersection units of SSRT [32], and simulated at cycle-level as described in [50]. Some common assumptions are used when modeling the hardware units. The units reside on a mobile SoC which is fabricated with a 28nm process technology, and equipped with the 25.6GB/s memory interface described earlier. As with MergeTree, off-chip memory accesses are modeled with Ramulator and DRAMPower. Caches and SRAMs are parametrized with CACTI 6.5 [38]. Floating-point unit energy is based on the figures of Galal et al. [15], with linear process scaling, as in the GpusimPow simulator [35]. Reciprocal calculation is estimated to take as much energy as 3 FLOPs, as in [33]. All units beside MergeTree operate at 500MHz.

#### 6.4.1 Rendering hardware model

The modeled accelerator architecture is shown in Fig. 8: it consists of separate fixed-function pipelines for tree traversal and primitive intersection. Scenes are rendered with a software ray tracer, from which a traversal trace is extracted and fed to a cycle-level hardware simulator, which traces utilizations for all components in Fig. 8. The power consumption of each component is determined by multiplying a dynamic power term with the utilization and adding a static power term. For the
Fig. 7. System energy with only primary rays (top) and diffuse secondary rays (bottom), expressed as power at 30FPS for comparison to mobile GPUs. Four alternatives are modeled: HLBVH only, binned SAH only, and HLBVH with binned SAH and ATRBVH toplevel builds. HLBVH build energy is from RTL simulation, while tracing and binned SAH energy are based on higher-level models. The proposed HLBVH builder reduces build energy at the cost of worse tree quality, which increases tracing energy. Toplevel builds remove much of the quality penalty. Tracing complexity is weakly related to scene size, while build energy grows, and becomes dominant in large scenes. Averaged results are shown in Table 8.

Fig. 8. Ray tracing accelerator simulated for system level energy modeling. The accelerator is modeled after the traversal and intersection unit in SGRT [32], with a two-AABB node layout [31]. All shown components are simulated at cycle level.

In addition to tree construction and traversal, shading is the third main component in the rendering process. Shading has a very wide range of complexity: we experimented with adding the energy cost of minimal shading and pixel output to the model, i.e., Phong shading with a directional light, but this had minimal effect on the total power. On the other hand, sufficiently complex shading may dominate the rendering process [29]. Mobile ray tracing would likely opt for inexpensive shaders at first. As shading cost is independent of tree quality, we omit it from the model.

6.4.2 Results. The system-level energy results are shown scene by scene in Fig. 7, and summarized in Table 8. It is visible that the energy cost of tree construction scales asymptotically faster with scene size than traversal, and with binned SAH, dominates the energy profile in large scenes. Though the binned SAH builder performs significant floating-point computation, most of its energy consumption is also due to DRAM accesses. MergeTree uses on average ca. 3.2× less energy. In primary ray tracing, the build energy savings are sufficient to make HLBVH preferred in all scenes except Toasters. In large scenes, tree construction dominates the system energy, and HLBVH gives significant savings.

With incoherent secondary rays, the energy footprint of ray tracing grows significantly, and is dominated by memory traffic. As such, tree quality has a larger effect on system energy. Moreover, the tracing energy penalty of the proposed builder is larger than predicted by SAH cost. Toplevel builds reduce system energy, but less than predicted by SAH. Regardless, in large (>500K triangle)
Table 8. System energy with different builders, main results. Energy normalized to binned SAH [13], averaged over 14 scenes.

<table>
<thead>
<tr>
<th></th>
<th>Energy</th>
<th>Binned SAH</th>
<th>HLBVH</th>
<th>HLBVH +SAH</th>
<th>HLBVH +ATRBVH</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Primary rays, all scenes</strong></td>
<td>Build</td>
<td>100%</td>
<td>32%</td>
<td>33%</td>
<td>33%</td>
</tr>
<tr>
<td></td>
<td>Trace</td>
<td>100%</td>
<td>143%</td>
<td>114%</td>
<td>112%</td>
</tr>
<tr>
<td></td>
<td>Total</td>
<td>100%</td>
<td>71%</td>
<td>59%</td>
<td>58%</td>
</tr>
<tr>
<td><strong>Primary rays, large scenes</strong></td>
<td>Build</td>
<td>100%</td>
<td>22%</td>
<td>22%</td>
<td>22%</td>
</tr>
<tr>
<td></td>
<td>Trace</td>
<td>100%</td>
<td>148%</td>
<td>122%</td>
<td>122%</td>
</tr>
<tr>
<td></td>
<td>Total</td>
<td>100%</td>
<td>41%</td>
<td>37%</td>
<td>37%</td>
</tr>
<tr>
<td><strong>Diffuse rays, all scenes</strong></td>
<td>Build</td>
<td>100%</td>
<td>32%</td>
<td>33%</td>
<td>33%</td>
</tr>
<tr>
<td></td>
<td>Trace</td>
<td>100%</td>
<td>163%</td>
<td>134%</td>
<td>128%</td>
</tr>
<tr>
<td></td>
<td>Total</td>
<td>100%</td>
<td>110%</td>
<td>91%</td>
<td>87%</td>
</tr>
<tr>
<td><strong>Diffuse rays, large scenes</strong></td>
<td>Build</td>
<td>100%</td>
<td>22%</td>
<td>22%</td>
<td>22%</td>
</tr>
<tr>
<td></td>
<td>Trace</td>
<td>100%</td>
<td>165%</td>
<td>141%</td>
<td>137%</td>
</tr>
<tr>
<td></td>
<td>Total</td>
<td>100%</td>
<td>79%</td>
<td>68%</td>
<td>65%</td>
</tr>
</tbody>
</table>

scenes the cost of tree construction is significant enough that the proposed builder consistently reduces system energy compared to binned SAH.

For comparison with mobile power budgets and GPUs, the energy results in Fig. 7 are presented as power at a fixed 30fps frame rate. Diffuse, animated ray tracing in our model with HLBVH +ATRBVH dissipates between 143..2077mW of power, and primary ray tracing between 72..791mW. For a point of comparison, in the benchmarks of Pathania et al. [43], recent mobile games on an Odroid-XU+E platform dissipate ca. 2..3W, of which ca. 0.8..1.8W is used in the mobile GPU. These results suggest that MergeTree allows the ray tracing of large (>500K triangle) dynamic scenes in a mobile power envelope. However, in the most demanding scenes there is only limited margin for complex shading. A binned SAH builder would already use most of the mobile power budget for these scenes.

7 LIMITATIONS AND FUTURE WORK

One difficulty in the proposed design is handling scenes of over $\frac{M^2}{2B}$ primitives (2 million triangles with the evaluation setup), as they require more than one multi-merge pass. It is simple to add control logic for multiple passes, but the use of AABBs as sorting elements is then suboptimal. Another possibility is to enlarge the scratchpad: doubling the memory size $M$ quadruples the model size that can be processed in one pass. In our experience at least a 512KB scratchpad memory can run at 1GHz; this would be sufficient for scenes of 8M triangles. It would also be interesting to evaluate, as a replacement for double-buffering, the other well-known multi-merge sort read scheduling technique of forecasting, used by, e.g., Barve et al. [7]. Forecasting introduces a second heap which stores the final elements of each buffer, and uses it to fetch replacement buffers in an optimal order. Forecasting would, again, double the size of scene that can be sorted in a single pass with a given scratchpad size.

Recent trends in ray tracing accelerators are toward techniques which reduce the cost of ray traversal while complicating tree construction, e.g., compressed BVHs [25] and treetlet scheduling [3]. In future architectures incorporating these features, the cost of tree construction will be emphasized even more than in straightforward single-precision traversal as evaluated in this article. We are extending MergeTree to generate compressed trees.

This article focused on mobile ray tracing, but the design also has interesting applications in, e.g., collision detection as in [11], and with minor modifications, construction of point set k-d trees [22] and data sorting.

8 CONCLUSION

In this article, we proposed MergeTree, the first hardware accelerator architecture for the HLBVH algorithm, which forms the basis for the highest-performance GPU tree construction algorithms. Novel techniques were proposed to adapt HLBVH into a streaming, serial hardware form, which is suitable for mobile systems with limited power budgets, due to reduced memory traffic. Our results show significant improvements over previous state of the art [13] in terms of build performance, silicon area, memory traffic and energy consumption, at the cost of reduced tree quality, which can be mitigated with inexpensive toplevel builds.

The proposed architecture substantially increases the size of animated scenes which could be rendered by a mobile ray tracing accelerator in real time, and also has applications outside ray tracing. System level modeling showed that the cost of tree construction begins to rival the cost of real-time rendering in large scenes. MergeTree gives significant system energy savings in these scenes.

ACKNOWLEDGMENTS

This work was financially supported by the TUT graduate school and the Academy of Finland (decision #297548, PLC). The 3D models used are courtesy of Ingo Wald (Fairy), Andrew Kensler (Toasters), Yasutoshi Mori (Sportscar), Frank Meinl (Crytek Sponza), Jonathan G. (Italian, Babylonian), Anat Gryngber and Greg Ward (Conference), Naga Govindaraju, Ilkun Kabul and Stephane Redon (Cloth), the Stanford Computer Graphics Laboratory (Bunny, Dragon), and the SceneNet library [21] (Livingroom, Kitchen). Crytek Sponza and Dragon have modifications courtesy of Morgan McGuire [37].

REFERENCES


Received March 2017; accepted June 2017