Previous | Next --- Slide 128 of 129
Back to Lecture Thumbnails
kayvonf

Review question: Please answer the question on the slide.

zf

Assuming $n$ is not trivially small, Program 2 will perform better due to better cache locality. Consider in Program 1, tmp1[0] will be accessed from memory for the first time when it is written with A[0]+B[0]. Later, it will be evicted from the cache, as a result of accessing a later element in the array, say tmp1[1024]. Then, when calculating tmp2[0], tmp1[0] needs to be accessed from memory again.

In the case of Program 1, each element of tmp1[] and tmp2[] has to take two round trips between memory and cache, while in Program 2, every data element takes only one. Even if prefetch is working perfectly due to the sequential access pattern, the running time is still bounded by {the amount of memory access}/{memory bandwidth}.

As the operations in the programs are sequential, element-wise independent arithmetics, it can be efficiently pipelined and vectorized. So memory bandwidth is likely to become the bottleneck.

If $n$ is very small and everything fits in cache, the performance difference should be marginal. In that case, the fact that Program 1 executes more loop-control instructions may have very tiny effect.