Assignment 1: Implementing a Parallel Sort-Middle Tiled Renderer : 15-869 Fall 2014

The assignment is due on Friday, Oct 3rd at 11:59pm.

In this assignment you are given starter code for a CPU parallel renderer. Your job is to implement a few missing parts of the rasterizer and to parallelize the renderer using a sort-middle tiled approach (a.k.a. a "tiled renderer"). You'll find this renderer (after you properly parallelize it), to be quite fast for a software CPU implementation.

Getting Started

Grab the renderer starter code: (yes the 15869-f13 in the url is correct)

git clone git+ssh://linux.andrew.cmu.edu/afs/cs/academic/class/15869-f13/asst/asst1-f14.git

Due to heavy use of C++11, building the code on Linux requires G++ 4.8. All Gates cluster machines had been updated to have it, but we recommend you work on the Gates 3000 machines, ghc28-49.ghc.andrew.cmu.edu since they have six-core CPUs with hyperthreading. We have also included Visual Studio 2013 solution files for those that wish to develop/test on Windows.

About the Code

The starting codebase is a basic framework for a parallel CPU-based renderer. The renderer performs shading on quad fragments as discussed in class. It is already parallelized, but in a very simple manner. Geometry work (vertex processing, triangle assembly, and clipping) is parallelized by distributing chunks of triangles across all cores, resulting in a buffer of post-clip triangles. Then, rasterization and early depth testing of these triangles is performed serially on one core, yielding a large buffer of quad fragments. This buffer is shaded in parallel using all cores. Last, the shaded quad fragments are blended into the frame buffer in serial.

This is a codebase that is written for performance, so it is not the cleanest possible academic code. (For example, SSE code is present throughout the codebase.) While we have done our best to extensively document the parts of the renderer that you most likely need to understand and/or modify, some parts of the codebase will be less clearly documented. Nonetheless you are invited to explore the codebase and feel free to ask the staff questions.

Although it is not necessary to understand until Part 3 of this assiment, you may wish to review this diagram, which illustrates the basic structure of the tiled version of the renderer.

Assuming your local top-level source directory for the starter code is: SOURCEDIR ...

The renderer's implementation is located in SOURCEDIR/Renderer/RasterRenderer.

Various tests of the renderer are located in SOURCEDIR/Renderer/TestDriver.

Linux Build Instructions:

The codebase uses CMake as its cross-platform make utility. We suggest building via an out-of-source build, which you can perform via the following steps:

// 1. Create a build directory anywhere you like
mkdir BUILDDIR

// 2. Enter the build directory
cd BUILDDIR

// 3. Configure the build
cmake PATH_TO_SOURCEDIR/Renderer

At this point Makefiles for your local machine have been generated, and you can compile the renderer and all tests by typing:

make

Running the Renderer:

./bin/render square -mediadir /afs/cs/academic/class/15869-f13/asst/Media

Note: to save typing (and decrease load time for some of the larger scenes) you may wish to locate the /Media directory to a local directory, or create a soft link.

The result will be an output image (BMP format) that is supposed to be a rendering of a square, but OH NO, the image is all black! That leads you to Part 1 of the assignment...

Part 1: Implementing Rasterization

The first step in the assignment is to fix the renderer by implementing a triangle rasterizer (a quad fragment generator). You will do this by implementing two functions current stubbed out in Rasterizer.h: RasterizeTriangle and TriangleSIMD::TestQuadFragment.

RasterizeTriangle accepts as input a region of the output image and a triangle. It must identify all 2x2 pixel blocks within this screen region that may be at least partially covered by the triangle. For each of these blocks, it should call the provided function processQuadFragmentFunc().

Part of the imeplementation of processQuadFragmentFunc() is determine if the triangle actually covers samples in this 2x2 pixel block. It does so by calling TriangleSIMD::TestQuadFragment. This method accepts as input the coordinates of four screen sample points (corresponding to the 2x2 block of pixels), and outputs a bitmask indicating which of these points lies within the triangle. The expected format of the bitmask is a little non-standard, so please pay attention to the comments in the code. Also, keep in mind that your implementation of "inside triangle" should respect the rasterization edge rules discussed in class. (Conveniently, the values in TriangleSIMD::isOwnerEdge[i] will be very helpful in crafting an implementation that respects the rules.)

In order to implement these two required functions, you'll need to understand the data stored in the TriangleSIMD structure. We suggest you look at the struct's definition in Rasterizer.h, the ProjectedTriangle struct definition in ProjectedTriangle.h, and also the implementation of SetupTriangle in RendererImplBase.h. SetupTriangle is helpful because it performs the per-triangle "setup" operations discussed in class. As discussed in class, these operations include converting floating-point triangle vertex positions into a fixed-point representation (the renderer uses a fixed-point representation with 4 bits of fraction), computing edge equations based on the fixed-point positions, performing back-face culling, etc. The output of SetupTriangle is a fully populated TriangleSIMD structure that is ready for SIMD-optimized rasterization.

Once you have a valid implementation of RasterizeTriangle and TriangleSIMD::TestQuadFragment, your renderer will output correct pictures!

Now re-run the renderer binary on the 'square' scene.

./bin/render square -mediadir /afs/cs/academic/class/15869-f13/asst/Media

You should see a beautiful picture of a square. We've included a number of more complex scenes for you to enjoy. Type ./bin/render --help to see your options. Examples of correct rendered output for all scenes is in: /afs/cs/academic/class/15869-f13/asst/correct_images.zip.

Hints and Tips:

The definition of a triangle is in ProjectedTriangle.h. Vertex positions X0, Y0, ... are stored in a fixed-point representation is with 4 bits of subpixel precision (notice their type is not float but unsigned short). So if vertex 0 is at the center of pixel (1,1) it is represented by the values X0=24, Y0=24. (The bottom-left of pixel (1,1) would be represented as (16,16). The same fixed-point representation is true for fields in TriangleSIMD.
You are welcome to consider the tiled rasterization methods we discussed in class, but we urge you to try simple solutions first in your implementation of RasterizeTriangle. Optimization of TriangleSIMD::TestQuadFragment can be done in isolation, but before optimizing RasterizeTriangle we suggest you complete part 2 below.

Part 2: Implementing a Sort-Middle Tiled Parallelization Scheme

Before considering how to implemented a tiled version of this renderer, you should first familiarize yourself with the implementation of the non-tiled reference implementation. The implementation resides in the method ForwardNonTiledRendererAlgorithm::RenderProjectedBatch() in NontiledForwardRenderer.cpp. The method accepts as input a list of post-clip triangles, rasterizes the triangles into a buffer of quad fragments, shades the quad fragments, and then performs the appropriate frame-buffer updates. The function has been heavily documented to aid your understanding.

You must implement a sort-middle tiled parallel renderer in TiledRenderer.cpp. Use the -tiled option to instruct the test harness to use your tiled implementation instead of the reference one. For example:

./bin/render sponza -tiled -mediadir PATH_TO_MEDIA

As a reminder, the tiled renderer will implement a two-phase process as described in class:

In the first phase of execution, the renderer will classify all triangles into bins. That is, it will build a data structure where there is one bin per screen tile, and each bin contains a list of triangles that potentially overlap the bin.

In the second phase of execution, the renderer will process the bins in parallel, dynamically distributing the next unprocessed bin to the next available core. Thus, once triangles have been classified into bins, all multi-core parallelism in the renderer is entirely across the independent bins. For each bin, the core will rasterize each triangle in the tile's list, making sure that only fragments within the current tile are generated. The fragments will be immediately shaded and then blended into the frame buffer.

This basic structure should be evident in the implemenation of TiledRendererAlgorithm::RenderProjectedBatch(). You'll see two phases of execution. In the first phase, one "task" per core is created (basically one worker thread per core). Each thread will execute TiledRendererAlgorithm::BinTriangles which classifies a set of triangles into bins. In the second phase, the renderer creates one independent task per screen tile, and TiledRendererAlgorithm::ProcessBin is invoked for each task to compute final pixel values for that render target tile.

Hints on Part 2, Phase 1 (Binning Triangles)

The stub code in TiledRendererAlgorithm::BinTriangles is ready to go for you to start processing triangles. For each triangle you'll need to compute the screen tiles the triangle overlaps (we'd like you to implement this math: a simple bounding box check is okay, but it's possible to do better) and then stick the triangle in the appropriate bins.
Remember that each core is processing triangles in parallel, and that your implementation must preserve order. One way to go about this assignment is to protect a grid of lists (one grid cell per tile) data structure with a lock, and make sure new triangles are always inserted into the appropriate place in the list to preserve order. However, I'll tell you right now this implementation is almost certainly going to be slow. Instead, we'd like you to implement this assignment without ever taking a lock in either BinTriangles or ProcessBin!
Although the implementation described targets a GPU (and has many GPU-specific details), Sections 5.2 and 5.3 (and Figure 1) of this paper should give you a big hint about the structure of the design we are looking for. Come talk to the staff quickly if you are stuck.

Hints on Part 2, Phase 2 (Processing Bins)

Your implementation of TiledRendererAlgorithm::ProcessBin should be a serial implementation of rasterizeration for all triangles in "the list" associated with a single bin. The challenge, given the no-lock implementation required in Phase 1, is how to find the triangles that should be in this conceptual list.
A software-engineering challenge of phase 2 is shading a quad fragment via a call to ShadeFragment. Please see this document for a better description of how buffers are managed in the surrounding code in the renderer. This will explain why the interface to ShadeFragment is surprizingly complex. (Keep in mind this renderer is built for high performance, and so we incur the complexity of complex buffer management in certain areas -- it was hard to abstract these details in this case).
One advantage of a tiled renderer is that an entire frame-buffer tile will fit in cache. It may be the case that for maximum performance you'll want to write results to tile-major frame-buffer structure while processing the tile, and then copy the results back out to the renderer's frame buffer at the end. See comment in TiledRendererAlgorithm::Finish(). Not all students saw benefit from this optimization last year.

Part 3: (optional) Adding a Blend Mode

Notice that the renderer currently implements "replace" as the color-buffer blend function. (That is, if a fragment passes the depth-test then the corresponding pixel's color is set to that of the fragment.) Modify the renderer to support correct alpha blending into the render target. alphablend is a simple test containing transparent triangles that can be used to test your implementation. Note that alpha blending support will noticably improve the quality of the warehouse scene, since the steel grating textures contain alpha.

Hint 1: state.AlphaBlend provides the blend state of the renderer.
Hint 2: See FrameBuffer::BlendPixel in FrameBuffer.h

Extra Credit and Challenges

Implement multi-sample anti-aliasing. If you're a real graphics person, the jaggy edges of triangles produced by this renderer should bother you... a lot! Modify the renderer to support multi-sample anti-aliasing (MSAA).
Tile-based deferred rendering. Several modern mobile GPUs implement a version of tiled rendering that's often called "tile-based deferred rendering" (TBDR). Unlike a traditional early-Z pipeine that shades quad fragments immediately after they pass the depth test, a TBDR system executes visibility and depth testing for all quad fragments in a bucket prior to beginning any shading for the bucket. This means the renderer can figure out which quad fragments are not overwritten by subsequent fragments, and therefore shade only the quad fragments that actually contribute coverage to the final image. This is true regardless of the order the application provides triangles to the pipeline. (Note: TBDR rendering still respects the triangle ordering requirement -- all rendered images are exactly the same as an image produced by processing triangles sequentially.)
Performance. Your TA Yong He thinks his reference implementation of the renderer is pretty fast. Extra credit points will be given for implementations that are fast enough to complete with his renderer! (If you can beat Yong's renderer I know a couple of people that would probably be interested to offer you a job.)

Performance

If you decide to run your renderer on machines other than ghc27-49.ghc.andrew.cmu.edu, you should change the value of Cores on line 20 of RendererImplBase.h. It is currently set to 12, anticipating the 6-core, 12-hyperthread machines in GHC 3000. (If you are unsure of the type of box you are on, count its virtual cores by inspecting the output of less /proc/cpuinfo.)

Your primary goal on this assignment is to build a correct rasterizer and a correct tiled renderer. However, to stay in the spirit of the class we would like you to try a few performance optimizations.

Below is a table of performance results from Yong's so-called (by him) "super-fast" implementation (build settings were unchanged from the start code):

Timings from: ghc48.ghc.andrew.cmu.edu  (1024x768 rendering, unloaded machine)

Scene          Not-Tiled         Tiled
==========================================================================
square           12.8 ms        8.0 ms                     
bunny            21.6 ms       13.3 ms
sibenik          86.2 ms       58.5 ms
sponza          170.5 ms      113.6 ms
warehouse       181.3 ms      114.7 ms
station         112.1 ms       80.6 ms

Handin

Assignment handin directories will be created for you at: /afs/cs/academic/class/15869-f14-users/ANDREWID.

You may need to run aklog cs.cmu.edu prior to copying files over. (Otherwise you may observe a permission denied error.)

Code Handin

Please hand in the RasterRenderer directory of your modified source tree (I should be able to build and run the code on Andrew Linux machines) by dropping your RasterRenderer directory into a freshly checked out tree. Specifically, my scripts will be looking for the directory:

/afs/cs/academic/class/15869-f14-users/ANDREWID/asst1/RasterRenderer

Writeup Handin

Please include a short writeup describing your implementation of parts 1 and 2.

/afs/cs/academic/class/15869-f14-users/ANDREWID/asst1/writeup.pdf

Specifically address the following:

How did you implement RasterizeTriangle to improve on the naive implementation provided?
How does your solution sort triangles into tile lists in parallel?
How does your solution preserve processing order of triangles (recall fragments must hit the frame buffer in the order the triangles are submitted to the renderer?
Did you take any steps to minimize triangle "bin spread"? (bin spread = the number of bin lists a triangle must be inserted into)
Include measurements of your tiled renderer's performance on the six-core Gates machines: ghc27-49.ghc.andrew.cmu.edu. You may also wish to run your renderer on larger machines for fun, just ask Kayvon for access to a 32-core, 64-hyperthread Ubuntu box.