CSC264 Comp Org/Arch II - MIPS Pipeline

MIPS Pipeline

Stages

Step	Name	Description
IF	Instruction Fetch	fetch instruction, update PC
ID	Instruction Decode	fetch registers, decode (generate control signals)
EX	Execute	perform ALU operation
MEM	Memory Access	read/write memory
WB	Write Back	store result in destination register

Operations in Each Stage

The operations required in the IF stage are the same for all instructions: fetch the instruction and update the PC. After the IF stage, the required operations differ.

R Type Instructions (op Rd, Rs, Rt)

IF: fetch instruction, update PC
ID: decode, fetch operands Rs and Rt from the register file
EX: ALU executes the instruction
MEM: nothing
WB: store result of EX in destination register Rd

Load Instructions (op Rt, offset(Rs))

IF: fetch instruction, update PC
ID: decode, fetch Rs from the register file, fetch offset from the instruction
EX: ALU computes the effective address (Rs + offset)
MEM: read from the effective address
WB: store the value that was read in destination register Rt

Store Instructions (op Rt, offset(Rs))

IF: fetch instruction, update PC
ID: decode, fetch Rs and Rt from the register file, fetch offset from the instruction
EX: ALU computes the effective address (Rs + offset)
MEM: write Rt at the effective address
WB: nothing

Branch Instructions (op Rs, Rt, label)

The label is translated by the assembler into a displacement: the number of statements between the branch statement and the branch target.

IF: fetch instruction, update PC
ID: decode, fetch Rs and Rt from the register file
EX: adder computes new PC using the displacement, ALU computes the branch cond using Rs and Rt
MEM: nothing
WB: nothing

Stages Used

Instruction	Steps Required
R type	IF ID EX WB
branch	IF ID EX
load	IF ID EX MEM WB
store	IF ID EX MEM

To make sure that there is only one instruction in each stage at any given time, the unused stages are filled with no-ops, or doing-nothing's. This makes these instructions take longer but keeps the pipeline running in synch.

Instruction	Steps Required
R type	IF ID EX NOP WB
branch	IF ID EX NOP NOP
load	IF ID EX MEM WB
store	IF ID EX MEM NOP

Here is an example of a few instructions going through the pipeline:

Speedup

Given:

k stage pipeline
clock time per stage t_p
n instructions (tasks)
1st instr (task) takes time k * t_p
remaining n-1 instrs take time t_p each

So the total time for n instructions in a k stage pipeline is:

     k*t_p + (n-1)*t_p = (k+n-1)*t_p

The total time without a pipeline for n instructions is n * k*t_p. The speedup from pipelining is:

     speedup = (time without pipelining) / (time with pipelining)
             = n*k*t_p / (k+n-1)*t_p

For large n, this gets very close to:

     n*k*t_p / n*t_p = k

So theoretically, speedup is the number of stages in the pipeline.

The images in this handout are taken from "Computer Organization and Design" by Patterson and Hennessy.

Pipeline Hazards

There are three things that can disrupt the smooth flow of the pipeline:

structural hazard (resource conflict): A structural hazard occurs when two instructions in the pipeline need to use the same hardware. For example, an instruction in the IF stage needs to add PC+4 and an instruction in the EX stage needs to perform a calculation to execute the instruction. Another example of a structural hazard is two instructions needing to access memory at the same time: one is in the IF stage (fetch the instruction from memory) and the other is in the MEM phase (read or write a data value).
data hazard (data dependency): A data hazard exists when a value calculated by one instruction is not ready when it needs to be used by a subsequent instruction in the pipeline.
control hazard: Branch statements cause problems because the pipeline needs to be loaded with instructions before the branch has been executed. So which instructions should be loaded: those for the case of the branch being taken, or the branch not being taken?

Structural Hazard

This problem is solved by duplicating resources that are needed in multiple stages. MIPS has an adder dedicated to updating the PC by adding 4 (used in IF). MIPS also has an adder dedicated to updating the PC with the branch target address (used in EX). This leaves the CPU free to execute the operation required in the EX stage. Also, MIPS has separate memories (caches) for instructions and data, so one instruction can use the instruction memory in the IF stage at the same time that another instruction uses the data memory in the MEM stage.

Data Hazard

A data hazard occurs when one instruction needs the result of an earlier instruction that is still in the pipeline, for example:

   add      $s0, $t0, $t1
   sub      $t2, $s0, $t3

This is called a data dependency. Since the add instruction doesn't update $s0 until the WB stage, this could stall (delay) the pipeline for three cycles. Data dependencies occur in our code all the time. One way to avoid this hazard is to have the compiler reorder the statements to remove the dependency. This is often not possible, so it's not enough.

As soon as the ALU computes the result of the add, it can be sent directly to the input of the sub. This is an example of forwarding or bypassing, which is the main solution to data hazards. This requires extra hardware to send the output of the ALU (from the add) to the input of the ALU (for the sub). In our example, this will remove the pipeline delay.

However, if we have a load followed by an R-type instruction:

   lw       $s0, 4($t1)
   sub      $t2, $s0, $t3

the result of the load is not available until after the MEM stage. Forwarding will help but will not eliminate the delay, so a stall (also called a bubble) is still needed:

Control Hazard

The statement after a branch has to be fetched right after the branch is fetched. But there is no way to know which instruction should be fetched, because the branch condition hasn't been evaluated yet. This control hazard occurs on every branch instruction. There are several ways to handle it:

stalling: If we can move all branch processing to the ID stage (test registers, calculate new PC, and update PC), then we only need to stall for one cycle. This requires additional hardware, and more importantly, slows down the execution of every branch.
branch prediction: This method predicts whether the branch will be taken, and loads the pipeline with the instructions that match the prediction. If the prediction is correct there is no delay, but if the prediction is wrong then the pipeline must be refilled with the statments on the other execution path. There are several different prediction methods. One is to always predict a branch will not be taken. Another is to use different predictions for different branches. For example, the branch at the bottom of a loop will be taken many times, and not taken only once. Dynamic prediction keeps a history of each branch (taken/untaken) and uses this history to predict future branching.
delayed branching: The assembler moves an instruction unaffected by the branch so that it is executed right after the branch. The instructions controlled by the branch are then not loaded into the pipeline until the branch has been processed. For example, given the following code:
```
      add    $4, $5, $6
      beq    $1, $2, labela
      or     $7, $8, $9
      	
```
These instructions can be reordered as follows:
```
      beq    $1, $2, labela
      add    $4, $5, $6
      or     $7, $8, $9
      	
```
The branch instruction is completed in time to load either the or instruction or the instruction at labela.

The images in this handout are taken from "Computer Organization and Design" by Patterson and Hennessy.