Assignment 2: Kaffe: A Mini DNN Framework

Due Tues Nov 8th, at 11:59pm

Using your framework you will be able to evaluate popular networks (such as VGG-16 or Inception) to perform image classification tasks, or train your own networks. The goals of the assignment are to get basic experience implementing basic DNN layers described, as well as optimizing their implementation for modern parallel architectures. The assignment will also provide the experience of training a DNN for an image classification task.

Getting Started

Grab the assignment starter code from the public Github repo.

git clone

The starter code should build on Andrew Linux. It has a dependency on Halide, which is prebuilt here:


Therefore, to run the starter code you will need to add the following to your LD_LIBRARY_PATH.

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/afs/cs/academic/class/15769-f16/assignments/Halide/bin 

DNN model definitions used in this assignment are located here:


And image datasets used to train and evaluate your networks are located here:


Linux Build Instructions:

Makefiles are included at the top of the source tree, so you should be able to build the starter codebase by simply typing 'make'. This will produce two binaries inference and training. For example, you can run the VGG-16 network to classify the contents of 10 images from the ImageNet dataset using the command line (use -h to view the other command-line options that allow you change evaluation batch size, number of images used during performance testing, etc.):

./inference -v -n 10 /afs/cs/academic/class/15769-f16/assignments/asst2_models 

The starter code will evaluate the network on each of the images and (since the -v verbose mode option is set in the commandline above) will report network evaluation performance and accuracy (vs. ground-truth labels from the data set). Note that is expected for the network to not get 100% accuracy against the ground truth correct classification results, so the harness code also compare your implementation's output against the expected output of a correctly implemented reference implementation of the network. For example, given the commandline above, a correct implementation of the assignment should print this:

Loading weights from disk...
Building DNN pipeline...
Sub-pipeline 0 compile time: 16592ms
Testing using 10 images (batch size: 1)
Image 0: predicted: 65 score: 0.534817
Batch time: 1822ms (1822 ms/image)
Image 1: predicted: 970 score: 0.590683
Batch time: 1485ms (1485 ms/image)
Image 2: predicted: 230 score: 0.50218
Batch time: 1292ms (1292 ms/image)
Image 3: predicted: 967 score: 0.502137
Batch time: 1246ms (1246 ms/image)
Image 4: predicted: 748 score: 0.163116
Batch time: 1251ms (1251 ms/image)
Image 5: predicted: 58 score: 0.346265
Batch time: 1245ms (1245 ms/image)
Image 6: predicted: 334 score: 0.932226
Batch time: 1244ms (1244 ms/image)
Image 7: predicted: 852 score: 0.371311
Batch time: 1250ms (1250 ms/image)
Image 8: predicted: 337 score: 0.295529
Batch time: 1241ms (1241 ms/image)
Image 9: predicted: 332 score: 0.753167
Batch time: 1239ms (1239 ms/image)
1331.5 ms/image
DNN Accuracy: 50%
Network output matches the reference (CORRECT)

Of course, if you run the starter code, you will not get this output because you do not yet have a correct implementation of the required DNN layers. (It will report a mismatch between your program's output and that of the reference.)

Part 1: Implementing Forward Evaluation

In the first part of this assignment you will implement the fully connected, convolutional, maxpool, and softmax layers needed to correctly evaluate VGG and GoogleNet. Once this is complete, you will be able to evaluate these network on input images to produce a classification results. (Note: the assignment starter code provides implementations of a number of other layers used by these networks, such as ReLU, Average Pooling, Local Response Normalization, Flattening, Concatenation, etc.).

The starter code for this assignment is built on top of Halide. You are not required to use Halide for this assignment, and in fact it would be very interesting if a student wished to reimplement the assignment completely in C++. However, we caution that if you choose to not use Halide you will need to implement large parts of the starter code that are written in Halide. (e.g., the layer types provided to you above) While this handout is written assuming you will performance tune a Halide-based implementation for a CPU, you are also welcome to attempt to target a GPU with your implementation. (This may require a modest modification to the starter code as well.)

To get acquainted with the code, you probably wish to inspect NetworkDefinitions.h and Layers.h. (Given how the starter code is structured, most of your implementation will be in Layers.h).

  • Begin by taking a look at the class Layer in Layers.h. A network is defined by a graph of layers. Each layer has a convenient Layer::name (for pretty printing and reference), as well as a list of layers who's values it receives as input (Layer::inputs) and a list of layers that consume its output (Layer::outputs). A layer also can be inspected for information about the dimensionality of its output activations.
  • In the starter code, each layer also holds a Halide function Layer::forward that evaluates the output of the layer, and Halide functions for computing gradients. (1) Gradient of layer's output with respect to its parameters Layer::f_param_grads (such as Weights and biases) and (2) the gradient of the loss with respect to the layers inputs Layer::f_input_grads. The gradient functions are not necessary in Part 1 of this assignment, since no training is performed.
  • Each layer also maintains Halide buffers which hold the values of layer parameters Layer::params. When training is performed, the layer also holds buffers for parameter gradients Layer::param_grads and a copy of parameter values to support parameter update with momentum Layer::param_cache.
  • Layers are combined to create a network topology graph, represented by instances of the Network class in NetworkDefinition.h. Review the various implementations of networks provided in the starter code Vgg, Googlenet, and ToyNet. In particular, you can see how layers are wired up to form a network in Vgg::define_forward (or the equivalent methods for GoogleNet or ToyNet). In the starter code, wiring up layers into a graph is simultaneously building a large Halide pipeline to evaluate the entire network. This pipeline is JIT compiled by the call to Network::
  • Finally, to see how it's all put together, take a look at main() in Inception.cpp. The program instantiates a Network, JIT compiles the Halide program represented by the the graph of layers, and then evaluates this pipeline on a sequence of images. (For VGG and inception, JIT compilation may take up around 15 seconds.)

What you need to do

Please implement Halide functions that perform a forward evaluation pass for Affine (fully connected), Convolutional, MaxPool, and SoftMax layers. Please see the comments in the starter code in the constructor of each of these layer types (in Layers.h). Your code will define the Halide function forward which is a member of each of these unique layer classes.

The Kaffe framework's layer definitions closely resemble the layers of the popular deep learning framework Caffe. (However, your course staff believes everything is better with a K). You may find it helpful to read the documentation of Caffe's layer types to better understand what the layers you must implement need to do:

Notes and Tips:

  • A nice description of the softmax is here:
  • You may need to be careful about numerical precision in your Softmax.
  • In this part of the assignment you need not need consider the challenge of computing gradients for the layers in question (You are not training anything yet! That will come in Part 3.)
  • If you choose to implement this assignment in Halide, a very modest amount of new code is needed in this part of the assignment.

Part 2: Optimizing (Scheduling) Your Forward Evaluation

In this part of the assignment you need to optimize your layer implementations from Part 1 to improve the performance of forward evaluation. If you are implementing the assignment in Halide, this means you'll need to write Halide schedules for your networks. A reasonable implementation should be able to easily meet the performance of Caffe running on the same CPU, which requiring significantly less overall memory. A good implementation should be able to exceed Caffe's performance.

Notes and Tips:

  • We recommend starting by attempting to write good per-layer schedules for all your layers. (That is, schedule the work performed by each layer independently. In this design, the definition of the schedule can be placed right by the definition of the forward func in the Layer constructor. In this approach, your schedules will not attempt any form of cross-layer fusion... in Halide terms there is a compute_root() between layers, so layer outputs will be passed between layers in the form of large in-memory buffers.
  • Good schedules (even per-layer schedulers) may be input cardinality dependent. How to order and tile may depend on the size of the inputs.
  • If you wish to go further, you can consider opportunities for cross-layer fusion. However, you are advised to profile your per-layer performance prior to making decisions about when/whether cross-layer optimizations are useful and necessary. (Don't prematurely optimize without knowing what parts of the code are performance critical!)

Performance profiling

Halide has built-in mechanism for profiling the execution time and memory usage of the program. This can be enabled by setting the HL_JIT_TARGET (since the assignment is setup for a JIT compilation flow. This would be HL_TARGET for the AOT compilation flow) environment variable to host-profile. On the command line this would look like this:

HL_JIT_TARGET=host-profile ./inference -v -n 10 
                /afs/cs/academic/class/15769-f16/assignments/asst2_data vgg

This will produce profiling information which looks like this:

total time: 317.019470 ms  samples: 6052  runs: 1  time/run: 317.019470 ms
average threads used: 7.627396
heap allocations: 37  peak heap usage: 25920512 bytes
constant_exterior:     0.051ms   (0%)    threads: 1.000  peak: 612912   num: 1 avg: 612912
conv1_1_forward:       1.676ms   (0%)    threads: 7.516  peak: 12845056 num: 1 avg: 12845056
constant_exterior$1:   1.393ms   (0%)    threads: 1.000  peak: 13075456 num: 1 avg: 13075456
conv1_2_forward:       28.200ms  (8%)    threads: 7.380  peak: 12845056 num: 1 avg: 12845056
pool1_forward:         1.034ms   (0%)    threads: 1.000  peak: 3211264  num: 1 avg: 3211264
maximum:               0.362ms   (0%)    threads: 1.000  stack: 4
constant_exterior$3:   0.362ms   (0%)    threads: 1.000  peak: 3326976  num: 1 avg: 3326976
conv2_1_forward:       12.673ms  (3%)    threads: 7.783  peak: 6422528  num: 1 avg: 6422528
constant_exterior$4:   0.642ms   (0%)    threads: 1.000  peak: 6653952  num: 1 avg: 6653952
conv2_2_forward:       25.636ms  (8%)    threads: 7.873  peak: 6422528  num: 1 avg: 6422528

Note that naming Halide functions is useful for both debugging and well as making sense of the profiling information. Also the profiling tends to be inaccurate when a function is specified as compute_at. However, the profiling information is very useful to understand which functions to focus on for performance optimization.

Part 3: Implementing Back-Propagation and Training

Now that you're framework can successfully evaluate a network on an image for inference, it's now time to add support for training. We've created a simple starter network ToyNet (see it's definition in NetworkDefinitions.h) that is a template for a network we'd like you to train to perform image classification on small 32x32 images from the CIFAR 100 dataset.

In order to enable back-propagation through ToyNet, you will neet to:

  1. implement (and provide efficient schedules for) gradient computations for the Affine, Convolutional, MaxPool, and SoftMax layers you implemented in Part 1. please see the comments in the define_gradients() function for these layers. Your code will need to define Halide functions for dL/dp (stored as f_param_grads) and dL/dUnit for all inputs (stored as f_input_grads).
  2. Implement a basic stochastic gradient descent loop to drive the training process in Training.cpp. Your implementation should use parameter update with momemtum as described in class (at the bottom of the slide) or in more clarity here.

In this part of the assignment you're task is to successfully train a classification network on CIFAR. You are free to change the topology of ToyNet to your liking, experiment with learning rates and momentum parameters, etc. With the right hand holding of the training process (and a performant implementation), you should be able to train your network on CIFAR in a few hours.

Notes and Tips:

  • We've included (see commandline arg help) methods for loading model parameters from disk, and storing parameters (as a checkpoint during training). This will enabling you to checkpoint and resume training as you see fit.


In this assignment you will be graded 50% on correctness and 40% on your code's performance, and 10% on the writeup.

  • 15 points for a correct implementation of inference
  • 20 points for a correct implementation of gradients (the ability to train)
  • 10 points for the ability to successful train CIFAR to the accuracy threshold. (Accuracy goals to be releaseed shortly.)
  • 35 points for the performance of your forward implementation
  • 10 points for the performance of your gradient computations
  • 10 points for your writeup.

The following are reference timings for forward evaluation on This is an 8-core Xeon E5 1660 v4 machine (see specs here). Similar hosts are: ghc28/31/32/


A baseline implementation (basic parallelization and vectorization): (performance at this level would receive no more than 5 performance points)

  • vgg 400ms/image
  • inception 43ms/image

A modestly tuned implementation that intelligently performs tiling: (performance at or above this level is considered an "A" grade: 30 performance points)

  • vgg 137ms/image
  • inception 43ms/image

A even better implementation: (Better solutions are certainly possible with per-layer schedule tweaking.)

  • vgg 120ms/image
  • inception 32ms/image

Additional notes:

  • GHC machines can get loaded, so make sure you are running on a
  • The above results were obtained using batch size 32, but your implementation may prefer a different batch size. Please indicate in your writeup what the batch size should be.)
  • I will run inference on a ghc machine that is similar to ghc35 (see hostnames above) and on large set of test images -n > 100 to eliminate any warm up overhead.
    • I will focus on your VGG implementation for performance points (if you're in the ballpark of the 137ms/image reference implementation you will get 30 points, a solid A grade, I'll linearly drop points to the 400ms/image baseline.)
  • Inception is largely for fun, since it's harder to tune. But I'm very interested to see how well folks can do.


A reasonable reference implementation of training was able to train to 20% accuracy on CIFAR in about 30 minutes (8 epochs, 6ms/image during training on ghc35). The point of the training part of the assignment is to successfully train, which will be nearly impossible without a basic set of performance optimizations. Obtaining this accuracy in a similar ballpark about 1 hour will give you fill training credit.


You may need to run aklog prior to copying files over. (Otherwise you may observe a permission denied error.)

Since there are many ways to approach this assignment, please hand in your full (We should be able to build and run the code on Andrew Linux machines by typign 'make' from this tree.


As part of this handin please include your writeup in /afs/cs/academic/class/15769-f16-users/ANDREWID/asst2/writeup.pdf. In this writeup, please include:

  • A description of your basic approach to scheduling the pipeline (if you used Halide, describe your schedules, if you went solo in C++, CUDA, or some other programming framework, please describe your approach in more detail)
  • A description of your process in training an image classifier on the CIFAR 100 data in Part 3. What knobs did you play around with (learning rate?, momentum?) Describe any observations encountered in the process (congratulations... for many of you, you've now trained for first network!)