## Lecture 6: Specializing Hardware for Image Processing

Visual Computing Systems CMU 15-769, Fall 2016

## So far, the discussion in this class has focused on generating efficient code for multi-core processors such as CPUs and GPUs.

## **Consider the complexity of executing an** instruction on a modern processor...

**Read instruction** — Address translation, communicate with icache, access icache, etc. **Decode instruction** — Translate op to uops, access uop cache, etc. **Check for dependencies/pipeline hazards** Identify available execution resource Use decoded operands to control register file SRAM (retrieve data) Move data from register file to selected execution resource **Perform arithmetic operation** Move data from execution resource to register file Use decoded operands to control write to register file SRAM

**Question:** 

How does SIMD execution reduce overhead when executing certain types of computations? What properties must these computations have?

### Fraction of energy consumed by different parts of instruction pipeline (H.264 video encoding) [Hameed et al. ISCA 2010]



| FU = functional units        | Pip = pipeline registers (interstage)      |
|------------------------------|--------------------------------------------|
| RF = register fetch          | D-\$ = data cache                          |
| Ctrl = misc pipeline control | IF = instruction fetch + instruction cache |

## DSPs

**Typically simpler instruction stream control paths Complex instructions (e.g., SIMD/VLIW): perform many operations per instruction** 



### Contrast to custom circuit to perform the operation

Example: 8-bit logical OR





## **Recall use of custom circuits in modern SoC**

Audio encode/decode Video encode/decode **High-frame rate camera RAW processing (ISP) Data compression** 





### **Aside: Anton supercomputer** [Developed by DE Shaw Research]

- Supercomputer containing custom circuits for molecular dynamics
  - Simulates time evolution of proteins
- ASIC for computing particle-particle interaction in a single cycle
  - Anton 1: 512 particle-particle interaction units



## FPGAs (field programmable gate arrays)

- FPGA chip provides array of logic blocks, connected by interconnect
- **Programmer defines behavior of logic blocks via hardware description language** (HDL) like Verilog or VHDL



# Specifying combinatorial logic via LUT

- Example: 6-input, 1 output LUT in Xilinx Virtex-7 FPGAs
  - Think of a LUT6 as a 64 element table



## Question

### What is the role of an ISA? (e.g., x86)

**Answer: interface between program definition (software) and** hardware implementation

**Compilers produce sequence of instructions** 

Hardware executes sequences of instructions as efficiently as possible (As shown earlier in lecture, many circuits used to implement/preserve this abstraction, not execute the computation needed by the program)

## New ways of defining hardware

- Verilog/VHDL present very low level programming abstractions for modeling circuits (RTL abstraction: register transfer level)
  - **Combinatorial logic**
  - Registers
- Due to need for greater efficiency, there is significant modern interest in making it easier to synthesize circuit-level designs
  - Skip the ISA, directly synthesize circuits needed to compute the tasks defined by a program.
  - Raise the level of abstraction of direct hardware programming

### **Examples:**

- C to HDL (e.g., ROCCC, Vivado)
- Bluespec
- **CoRAM** [Chung 11]
- **Chisel** [Bachrach 2012]

## Enter domain specific languages

### **Compiling image processing pipelines directly to FPGAs**

- **Darkroom** [Hegarty 2014]
- **Rigel** [Hegarty 2016]

### **Motivation:**

- **Convenience of high-level description of image processing algorithms (like Halide)**
- **Energy-efficiency of FPGA implementations** (particularly important for high-frame rate, low-latency, always on, embedded/robotics applications)

## **Optimizing for minimal buffering**

- **Recall: scheduling Halide programs for CPUs/GPUs** 
  - Key challenge: organize computation so intermediate buffers fit in caches
- **Scheduling for FPGAs:** 
  - Key challenge: minimize size of intermediate buffers (keep buffered data spatially close to combinatorial logic)

### **Consider 1D convolution:**

out(x) = (in(x-1) + in(x) + in(x+1)) / 3.0

### **Efficient hardware implementation: requires storage for 3 pixels in registers**



## Line buffering

### **Consider convolution of 2D image in vertical direction:**

out(x,y) = (in(x,y-1) + in(x,y) + in(x,y+1)) / 3.0

### **Efficient hardware implementation:**

let buf be a shift register containing 2\*WIDTH+1 pixels ("line buffer")

// assume: no output until shift register fills out\_pixel = (buf[0] + buf[WIDTH] + buf[2\*WIDTH]) / 3.0shift(buf); // buf[i] = buf[i+1] buf[2\*WIDTH] = in pixel



Note: despite notation, line buffer \*is not\* a random access SRAM, it is a shift register



### **Class discussion: Rigel** [Hegarty 2016]