kayvonf

Assuming $n$ is not trivially small, Program 2 will perform better due to better cache locality. Consider in Program 1, tmp1[0] will be accessed from memory for the first time when it is written with A[0]+B[0]. Later, it will be evicted from the cache, as a result of accessing a later element in the array, say tmp1[1024]. Then, when calculating tmp2[0], tmp1[0] needs to be accessed from memory again.
If $n$ is very small and everything fits in cache, the performance difference should be marginal. In that case, the fact that Program 1 executes more loop-control instructions may have very tiny effect.