Previous | Next --- Slide 110 of 129
Back to Lecture Thumbnails

Review questions:

  • Describe the idea of SIMD execution? (Why is it a reasonable design idea to consider?)
  • Note the forall loop in the code in the fictitious data-parallel language above. I emphasized the point that the semantics of the forall loop are that all iterations of the loop can be executed independently and it any order. Note that this is different from saying that the semantics dictate that the loop is executed in parallel. Imagine that N is a large number (N ~ millions) and that you are a compiler or runtime mapping this program to the very Intel-Core-i7- like quad-core, 8-wide SIMD processor illustrated above. How would you do it?

Advanced mode question: In class we talked about using SIMD hardware capabilities by adding SIMD vector instructions into an instruction stream. e.g., _mm256_mul_ps. (By the way, you may be interested in perusing the Intel Intrinsics Guide.

You may be surprised to learn that although GPUs certain feature SIMD execution support, they do not feature an ISA with SIMD instructions. Nor do GPUs feature an magical "auto-vectorization" support. What does the interface to GPU hardware look like, and how does a GPU achieve efficient SIMD execution?

  • SIMD execution allows multiple elements of data to be operated on using a single instruction. It saves the energy/area overheads of fetching and decoding multiple instructions to perform the same operation.
  • I would map the threads corresponding to each loop in such a way that contiguous iterations that access data in the same cache-line, are mapped to the same core.

GPUs use a form of SIMD execution referred to as SIMT (Single Instruction Multiple Thread). Here, each "SIMD lane" in the GPU core is essentially programmed as a separate thread of execution (although only one instruction is fetched and decoded). Each thread is mapped to a different SIMD lane in a manner that's transparent to the programmer. CUDA also exposes further SIMD vectorization support within each "thread" (