Implementing a Tiled Parallel Renderer

This assignment is due at 11:59pm on October 2nd.

In this (somewhat) mini-assignment you are given started code for a CPU parallel renderer and your job is to parallelize the renderer using a sort-middle tiled approach (a.k.a. a "tiled renderer", see Lecture 2, slide 44). The renderer is currently able to render triangle meshes into a single-sample-per-pixel frame buffer. As discussed in class the renderer shades quad-fragments, and supports very simple shading with "nearest texel" texture sampling. You'll find this renderer (after you properly parallelize it), to be quite fast for a software implementation.

Getting Started

Grab the renderer starter code.

git clone git+ssh://linux.andrew.cmu.edu/afs/cs/academic/class/15869-f13/code/15869_repo.git

Due to heavy use of C++11, building the code on Linux requires G++ 4.8 (all cluster machines had been updated to have it). If you are using the public andrew Linux clusters, we recommend that you run on the 6-core Xeon machines in GHC 3000 (hostnames are: ghc27-50.ghc.andrew.cmu.edu). We have also included Visual Studio 2013 solution files.

Linux Build Instructions:

The codebase using CMake as its cross-platform make utility. We suggest an out-of-source build using cmake, which you can perform via the following steps:

Assuming your top-level source directory is called: SOURCEDIR ...

// 1. Create a build directory anywhere you like
mkdir BUILDDIR

// 2. Enter the build directory
cd BUILDDIR

// 3. Configure the build
cmake PATH_TO_SOURCEDIR/Renderer

At this point your build should be configured, and you can compile the renderer simply by typing:

make

Running the Renderer:

NOTICE: on Andrew linux machines, you'd need to add /usr/local/lib/gcc/lib64 you your dynamic library search path. For example:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/gcc/lib64

For simplicity we've set the renderer to look for scene data files in a directory called Media that should be located in your BUILDDIR. Just copy Media directory from the repository into your BUILDDIR.

cp -r PATH_TO_SOURCEDIR/Media .

Now you can run the renderer. Just run ./bin/render. The result will be an output image result.bmp that is a rendering of the popular "Sponza" scene in computer graphics. We've included a number of other scenes for you as well. Type ./bin/render -help to see your options.

About the Code

The renderer you are given is already parallelized, but in a very simple manner. Geometry work (vertex processing, triangle assembly, and clipping) is a parallelized across all cores, yielding a buffer of post-clip triangles. Then, rasterization for these triangles is performed serially, yielding a large buffer of quad-fragments. This buffer is shaded in parallel, and the resulting shaded quad-fragments are depth-tested and (if needed) blended into the frame-buffer in serial.

The code for the bulk of the renderer's implementation is located in SOURCEDIR/Renderer/RasterRenderer.

You should first familiarize yourself with the implementation of the non-tiled reference implementation. The implementation resides in the method ForwardNonTiledRendererAlgorithm::RenderProjectedBatch() in SOURCEDIR/Renderer/RasterRenderer/NontiledForwardRenderer.cpp. This method accepts as input a list of post-clip triangles, rasterizes the triangles into a buffer of quad fragments, shades the quad fragments, and then performs the appropriate frame-buffer blend. This code has been heavily documented to aid your understanding.

(WARNING: The rest of the renderer is less heavily documented and also written for performance. Explore at your own risk!)

Implementing Tiled Parallelization

Your job in this assignment is to implement an tiled parallel renderer implementation in SOURCEDIR/Renderer/RasterRenderer/TiledRenderer.cpp. You will use the -tiled option to tell the renderer to use your tiled implementation instead of the reference one. For example:

./bin/render sponza -tiled

As a reminder, the tiled renderer will work like this:

In the first phase of execution, the renderer will classify all triangles into bins. That is, it will build a datastructure where there is one bin per screen tile, and each bin contains a list of triangles that potentially overlap the bin.

In the second phase of execution, the renderer will process each bin in parallel. For each bin, the renderer will rasterize each triangle in the tile's list, making sure that only fragments within the current tile are generated. The fragments will be immediately shaded and blended into the frame-buffer. Thus, once triangles have been classified into bins, all multi-core parallelism in the renderer is entirely across the independent bins.

This basic structure should be evident in TiledRendererAlgorithm::RenderProjectedBatch(). You'll see two phases of execution. In the first phase, one "task" per core is created (basically one thread per core). Each thread will execute TiledRendererAlgorithm::ClassifyTriangles which classifies a set of triangles into bins. In the second phase, the renderer creates one independent task per screen tile, and TiledRendererAlgorithm::RenderTile is invoked for each tasks to generate the image.

Hints on Phase 1 (Classifying Triangles)

  • The stub code in TiledRendererAlgorithm::ClassifyTriangles is ready to go for you to start processing triangles. For each triangle you'll need to compute the bins the triangle overlaps (we'd like you to implement this math: a simple bounding box check is okay, but you can do even better) and then stick the triangle in the appropriate bins.
  • The definition of a triangle is in ProjectedTriangle.h (vertex positions X0, Y0, ... are in fixed-point subpixel units, with 4 bits of subpixel precision. (So if vertex 0 is in the middle of pixel (1,1) it's represented by the values X0=24, Y0=24)
  • Remember that each core is processing triangles in parallel, and that your implementation must preserve order. One way to go about this assignment is to protect a grid of lists (one grid cell per tile) data structure with a lock, and make sure new triangles are always inserted into the appropriate place in the list to preserve order. However, I'll tell you right now this implementation is going to be slow. Instead, we'd like you to implement this assignment without ever taking any locks in either ClassifyTriangles or RenderTile!
  • Although the implementation described targets a GPU (and has many GPU-specific details), Sections 5.2 and 5.3 (and Figure 1) of this paper should give you a big hint of the design we are looking for.

Hints on Phase 2 (Processing Bins)

  • Your implementation of RenderTile should be a serial implementation of rasterizeration for all triangles in "the list" associated with a single bin. The challenge, given the no lock implementation required in Phase 1, is how to find the triangles that should be in this conceptual list.
  • A software-engineering challenge of phase 2 is shading a quad-fragment via a call to ShadeFragment. Please see this document for a better description of how buffers are managed in the surrounding code in the renderer. This will explain why the interface to ShadeFragment is surprizingly complex. (Keep in mind this renderer is built for high performance, and so we incur the complexity of complex buffer management in certain areas -- it was hard to abstract these details in this case).
  • One advantage of a tiled renderer is that an entire frame-buffer tile will fit in cache. It likely for maximum performance you'll want to write results to tile-major frame-buffer structure while processing the tile, and then copy the results back out to the renderer's frame-buffer at the end. See comment in TiledRendererAlgorithm::Finish()

Performance

A quality implementation will make the renderer run quite fast (2-3 times faster than the non-tiled reference). The following are per frame render times for the the course staff's tiled implementation running on the 6-core machines in GHC 3000 (Intel Xeon CPU W3670 @ 3.20GHz)

              Non-Tiled Ref Impl    Staff's Tiled Impl
Square                   12.5 ms                7.9 ms
Bunny                    37.9 ms               15.2 ms
Sibenik                  66.5 ms               22.8 ms
Sponza                   87.8 ms               24.1 ms

Handin

Assignment handin directories will be created for you at: /afs/cs/academic/class/15859-f13/handin/asst1/ANDREWID

For convenience, Ph.D. students with a CS account should access this directory with CS credentials (e.g., via login from a CS linux server, like: linux.gp.cs.cmu.edu). You can check which account (@cs or @andrew) was given premission to access your directory by typing fs la from your handin directory.

Undergrad and masters students without CS accounts should use their andrew credentials. You may need to run aklog cs.cmu.edu to avoid AFS permission issues.

Code Handin

Please hand in the Renderer directory of your modified source tree (I should be able to build and run the code on Andrew Linux machines) by dropping your Renderer directory into a freshly checked out tree. Specifically, my scripts will be looking for a directory:

/afs/cs/academic/class/15859-f13/handin/asst1/ANDREWID/Renderer

Writeup Handin

In the top level of your handin directory, please include as a short writeup describing how you decided to implement your parallelization. I'll be looking for:

/afs/cs/academic/class/15859-f13/handin/asst1/ANDREWID/writeup.pdf

Specifically address the following:

  1. How does your solution sort triangles into tile lists in parallel?
  2. How does your solution preserve processing order of triangles (recall fragments have to hit the frame-buffer in the order the triangles are submitted to the renderer?
  3. Did you take any steps to minimize triangle "bin spread"? (that is, to minimize the number of lists a triangle must be inserted into)

Also include a measurement of your tiled renderer's performance on the four scenes provided in the examples above. (preferrably on one of the six-core Gates ghc27-50.ghc.andrew.cmu.edu machines, but feel free to experiment on any parallel system you have access to.