Review question: Please answer the question on the slide.
Maybe? On a first glance instructions in a loop are dependent on each other. However, if we do a loop unroll operations in different iterations are independent and would stop for memory operations during which time we can work on the next iteration. Does this make sense?
I realize that the programs are bandwidth limited but hardware multithreading will help in reducing the total time by a little by hiding the latency.
@kayvonf The style of the first program is easier to maintain as the various function implementations are decoupled (as opposed to the monolithic second program). This allows for better maintenance in large libraries and allows for program compositionality (i.e. building a fused-multiply-add by combining the two functions).
Hardware multithreading could allow for greater performance. Any stalls could be eliminated by switching threads. There is potential for ILP and TLP assuming the output assembly is scheduled properly.
The typical optimization would be to have 1 thread per element index in each function. This is straightforward if enough threads are available on the architecture, otherwise the work must be distributed to each core via some function. If one thread stalls, the other thread can take over.
Additionally, program 2 computes everything in place, increasing cache/memory locality. The code is more compact and entirely visible for compiler optimizations, where most of the data transformation can happen at a register level and loop unwinding can occur.