Review question: Give an intuitive description of what it means for a processor to run an instruction stream?
An instruction stream is a list of operations for a core to complete, and is the simple form to which a program must be reduced in order for it to be executed. Common examples of instructions are those that load data from memory into a register, instructions that modify data in the registers (like addition and multiplication), and instructions that store register data in memory. A pointer known as the Program Counter dictates which instruction to execute next, and is updated when an instruction is done executing.
Executing an instruction entails fetching the instruction from memory, decoding it to determine what type of operation it describes, potentially performing an arithmetic operation, potentially accessing memory to load or store a value, and potentially storing the result of an operation back in a register. Finally, the Program Counter is updated to point to the next instruction that should be executed.
A core that uses two threads of execution can select one instruction to execute from a pair of instruction streams. Giving a core the choice of which instruction stream to service is a form of latency hiding. For example, if the first instruction stream attempts to load uncached data from memory, then the core will issue that load request, then switch to executing the second instruction stream while the data is retrieved from memory. If the second instruction stream stalls before the data requested by the first stream has been fetched, then the core will sit idle. Each instruction stream has an execution context, which the core swaps in when that stream is being executed. It is important to remember that even though the core is executing two instruction streams, there is only one piece of instruction fetch/decode hardware, and only one instruction is executed at a time.
@ccanel For your third paragraph, I believe you are referring to temporal multithreading. This is true if the processor is not superscalar. However, simultaneous multithreading does allow multiple threads to share a pipeline.
@ccanel, @jellybean: More generally, simultaneous multi-threading involves executing instructions from two different execution contexts (a.k.a. hardware threads) in parallel on a core. In class, I mainly described interleaved multi-threading, where each clock the core chooses one runnable execution context (a.k.a. hardware thread) and executes the next instruction in that thread's instruction stream using the core's execution resources.
But it's also possible for the core to choose more than one thread to run per clock. For example, the NVIDIA GTX 980 GPU maintains state for up to 64 execution contexts (called "warps" in NVIDIA-speak) on its cores, and each clock it chooses up to four of those 64 threads to execute instructions from. Those four threads execute simultaneously on the core using four different sets of execution resources. So there is interleaved multi-threading in that the chip interleaves up to 64 execution contexts, and simultaneous multi-threading in that it chooses up to four of those contexts to run each clock. (And if you look carefully at the slide I linked to, there is also super-scalar execution in that the core will try and run up to two independent instructions for each of those four warps -- up to a total of eight instructions overall -- each clock.) In short, many forms of parallel and concurrent execution are employed in a modern processor.
Intel's Hyper-threading technology is an implementation of multi-threading that makes sense if you consider the context: Intel had spent years building superscalar processors that could perform a number of different instructions per clock (within a single instruction stream). But as we discussed, it's not always possible for one instruction stream to have the right mixture of independent instructions to utilize all the available units in the core (this is the case of insufficient ILP). Therefore, it's a logical step to say, hey, to increase the CPU's chance of finding the right mix, let's modify our processor to have two threads available to choose instructions from instead of one!
Of course, running two threads is not always better than one, since these threads might thrash each other's data in the cache resulting in more cache misses that ultimately cause far more stalls than Hyper-Threading could ever hope to fill. On the other hand, running two threads at once can also be beneficial in terms of cache behavior if the threads access similar data. One thread might access address X, bringing it into cache. Then, if X is accessed by the other thread for the first time, what normally would have been a cold miss in a single thread system turns out to be a cache hit!
So to summarize: Intel processors that support Hyper-threading maintain two execution contexts (threads) on chip at once. Each clock, the chip looks at the two available contexts, and tries to find a mixture of runnable instructions that best utilizes all the execution units the core has available. It might be the case that one thread has sufficient ILP to fill to consume the whole capability of the chip, and if so, the chip may just run instructions from that one thread, and thus it is essentially behaving like a processor performing interleaved multi-threading.