Advanced mode question: In class we talked about using SIMD hardware capabilities by adding SIMD vector instructions into an instruction stream. e.g., _mm256_mul_ps. (By the way, you may be interested in perusing the Intel Intrinsics Guide.
You may be surprised to learn that although GPUs certain feature SIMD execution support, they do not feature an ISA with SIMD instructions. Nor do GPUs feature an magical "auto-vectorization" support. What does the interface to GPU hardware look like, and how does a GPU achieve efficient SIMD execution?
GPUs use a form of SIMD execution referred to as SIMT (Single Instruction Multiple Thread). Here, each "SIMD lane" in the GPU core is essentially programmed as a separate thread of execution (although only one instruction is fetched and decoded). Each thread is mapped to a different SIMD lane in a manner that's transparent to the programmer. CUDA also exposes further SIMD vectorization support within each "thread" (http://docs.nvidia.com/cuda/cuda-c-programming-guide/#built-in-vector-types).