Forwarding
233 in one slide!

- The class consists roughly of 4 quarters: (Bolded words are the big ideas of the course, pay attention when you hear these words)
  1. You will build a simple computer processor
     Build and create **state** machines with **data**, **control**, and **indirection**
  2. You will learn how high-level language code executes on a processor
     Time limitations create **dependencies** in the **state** of the processor
  3. You will learn why computers perform the way they do
     Physical limitations require **locality** and **indirection** in how we access **state**
  4. You will learn about hardware mechanisms for parallelism
     **Locality**, **dependencies**, and **indirection** on performance enhancing drugs

- We will have a SPIMbot contest!
Today’s Lecture

- Why dependencies and delayed feedback necessitate forwarding
- How to implement forwarding
This diagram shows the execution of an ideal code fragment.

<p>| | | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>lw</td>
<td>lw</td>
<td>lw</td>
<td>lw</td>
<td>lw</td>
<td>lw</td>
</tr>
<tr>
<td>$8, 4($29)</td>
<td>$8, 4($29)</td>
<td>$8, 4($29)</td>
<td>$8, 4($29)</td>
<td>$8, 4($29)</td>
<td>$8, 4($29)</td>
</tr>
</tbody>
</table>

<p>| | | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>sub</td>
<td>sub</td>
<td>sub</td>
<td>sub</td>
<td>sub</td>
<td>sub</td>
</tr>
<tr>
<td>$2, $4, $5</td>
<td>$2, $4, $5</td>
<td>$2, $4, $5</td>
<td>$2, $4, $5</td>
<td>$2, $4, $5</td>
<td>$2, $4, $5</td>
</tr>
</tbody>
</table>

<p>| | | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>and</td>
<td>and</td>
<td>and</td>
<td>and</td>
<td>and</td>
<td>and</td>
</tr>
<tr>
<td>$9, $10, $11</td>
<td>$9, $10, $11</td>
<td>$9, $10, $11</td>
<td>$9, $10, $11</td>
<td>$9, $10, $11</td>
<td>$9, $10, $11</td>
</tr>
</tbody>
</table>

<p>| | | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>or</td>
<td>or</td>
<td>or</td>
<td>or</td>
<td>or</td>
<td>or</td>
</tr>
<tr>
<td>$16, $17, $18</td>
<td>$16, $17, $18</td>
<td>$16, $17, $18</td>
<td>$16, $17, $18</td>
<td>$16, $17, $18</td>
<td>$16, $17, $18</td>
</tr>
</tbody>
</table>

<p>| | | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>add</td>
<td>add</td>
<td>add</td>
<td>add</td>
<td>add</td>
<td>add</td>
</tr>
<tr>
<td>$13, $14, $0</td>
<td>$13, $14, $0</td>
<td>$13, $14, $0</td>
<td>$13, $14, $0</td>
<td>$13, $14, $0</td>
<td>$13, $14, $0</td>
</tr>
</tbody>
</table>

- Each instruction needs a total of five cycles for execution.
- One instruction begins on every clock cycle for the first five cycles.
- One instruction completes on each cycle from that time on.
Note how everything goes left to right, except …

Writing back to registers is delayed creating data hazards
Our examples were too simple because they lacked **dependencies** between instructions

```assembly
lw $8, 4($29)
sub $2, $4, $5
and $9, $10, $11
or $16, $17, $18
add $13, $14, $0
```

- The instructions in this example are **independent**.
  - Each instruction reads and writes completely different registers.
  - Our datapath handles this sequence easily, as we saw last time.

- But most sequences of instructions are **not** independent!
An example with dependencies

- sub $2, $1, $3
- and $12, $2, $5
- or $13, $6, $2
- add $14, $2, $2
- sw $15, 100($2)
**sub does not write to the reg file until cycle 5, this creates two data hazards**

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Clock Cycle</th>
</tr>
</thead>
<tbody>
<tr>
<td>sub $2, $1, $3</td>
<td>1 2 3 4 5 6 7 8 9</td>
</tr>
<tr>
<td>and $12, $2, $5</td>
<td>IF</td>
</tr>
<tr>
<td>or $13, $6, $2</td>
<td>IF</td>
</tr>
<tr>
<td>add $14, $2, $2</td>
<td>IF</td>
</tr>
<tr>
<td>sw $15, 100($2)</td>
<td>IF</td>
</tr>
</tbody>
</table>

Read the old value of $2 from the reg file, not the result of sub
**sub** finishes writing to the reg file after half a clock cycle, reg file reads take half a cycle

Clock cycle

<p>| | | | | | | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
<td>5</td>
<td>6</td>
<td>7</td>
<td>8</td>
<td>9</td>
<td></td>
</tr>
</tbody>
</table>

- **sub** $2$, $1$, $3$
  - IF | ID | EX | MEM | WB

- and $12$, $2$, $5$
  - IF | ID | EX | MEM | WB

- or $13$, $6$, $2$
  - IF | ID | EX | MEM | WB

- add $14$, $2$, $2$
  - IF | ID | EX | MEM | WB

- sw $15$, 100($2)$
  - IF | ID | EX | MEM | WB
Use arrows to show dependencies: Arrows that point backwards reveal **data hazards**

- `sub $2, $1, $3`
- `and $12, $2, $5`
- `or $13, $6, $2`
- `add $14, $2, $2`
- `sw $15, 100($2)`
An alternate pipeline diagram showing components

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)
To eliminate hazards, identify when the correct value is created and consumed

R[1]-R[3] calculated by ALU at the end of this cycle

Consume R[1]-R[3] at the start of these cycles
Use pipeline registers to access correct values before values are written back to the reg file

“Forward” data from pipeline registers to later instructions
Forward values from pipeline registers so later instructions can use the correct value

- In clock cycle 4, AND gets R[1]-R[3] from EX/MEM
- In cycle 5, OR gets R[1]-R[3] from MEM/WB

```
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
```
A forwarding unit selects the correct ALU inputs for the EX stage:

- No hazard: ALU’s operands come from the register file, like normal
- Data hazard: operands come from either the EX/MEM or MEM/WB pipeline registers instead

sub $2, $1, $3

and $12, $2, $5

Or $13, $6, $2
Add forwarding muxes in front of the ALU
Use “.” notation to indicate contents of pipeline registers

```
add $2, $10, $1
or $10, $3, $6
sub $1, $2, $3
```
ALU source A comes from the pipeline register when necessary.

if \((\text{EX/MEM.} \text{RegWrite} = 1 \text{ and } \text{EX/MEM.} \text{RegisterRd} = \text{ID/EX.} \text{RegisterRs})\)
then \(\text{ForwardA} = 2\)

\[
\text{sub } \$2, \$1, \$3
\]

\[
\text{and } \$12, \$2, \$5
\]

- ALU source B is similar.

if \((\text{EX/MEM.} \text{RegWrite} = 1 \text{ and } \text{EX/MEM.} \text{RegisterRd} = \text{ID/EX.} \text{RegisterRt})\)
then \(\text{ForwardB} = 2\)
A MEM/WB hazard may occur between an instruction in the EX stage and the instruction from two cycles ago

add $1, $2, $3

add $4, $2, $4

sub $5, $5, $1
What happens if a register is updated twice in a row?

- Register $1$ is written by both of the previous instructions; from which instruction should it receive its value?

```
add $1, $2, $3
add $1, $1, $4
sub $5, $5, $1
```
MEM/WB hazard equations

- Equation for MEM/WB hazards for ALU source A

\[
\text{if (MEM/WB.RegWrite = 1} \\
\text{and MEM/WB.RegisterRd = ID/EX.RegisterRs) } \\
\text{then ForwardA = 1}
\]

- Equation for MEM/WB hazards for ALU source B

\[
\text{if (MEM/WB.RegWrite = 1} \\
\text{and MEM/WB.RegisterRd = ID/EX.RegisterRt) } \\
\text{then ForwardB = 1}
\]
Add forwarding unit control logic
The forwarding unit

- The forwarding unit has several control signals as inputs.

  ID/EX.RegisterRs  EX/MEM.RegisterRd  MEM/WB.RegisterRd
  ID/EX.RegisterRt  EX/MEM.RegWrite   MEM/WB.RegWrite

The forwarding unit outputs are selectors for the ForwardA and ForwardB multiplexers attached to the ALU. These outputs are generated from the inputs using the equations on the previous pages.

- Some new buses route data from pipeline registers to the new muxes.
Example

```plaintext
sub $2, $1, $3
and $12, $2, $5
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
```

- Assume again each register initially contains its number plus 100.
  - After the first instruction, $2 should contain -2 (101 - 103).
  - The other instructions should all use -2 as one of their operands.

- We’ll try to keep the example short.
  - Assume no forwarding is needed except for register $2.
  - We’ll skip the first two cycles, since they’re the same as before.
Clock cycle 3

IF: or $13, $6, $2
ID: and $12, $2, $5
EX: sub $2, $1, $3
Clock cycle 4: forwarding $2 from EX/MEM

IF: add $14, $2, $2

ID: or $13, $6, $2

EX: and $12, $2, $5

MEM: sub $2, $1, $3

Instruction memory

ID/EX

Registers

ALU

Data memory

ID/EX/RegisterRs

Forwarding Unit

EX/MEM/RegisterRd

MEM/WB/RegisterRd
Clock cycle 5: forwarding $2 from MEM/WB

IF: `sw $15, 100($2)`
ID: `add $14, $2, $2`
EX: or $13, $6, $2
MEM: and $12, $2, $5
WB: sub $2, $1, $3
Lots of data hazards

- The first data hazard occurs during cycle 4.
  - The forwarding unit notices that the ALU’s first source register for the AND is also the destination of the SUB instruction.
  - The correct value is forwarded from the EX/MEM register, overriding the incorrect old value still in the register file.

- A second hazard occurs during clock cycle 5.
  - The ALU’s second source (for OR) is the SUB destination again.
  - This time, the value has to be forwarded from the MEM/WB pipeline register instead.

- There are no other hazards involving the SUB instruction.
  - During cycle 5, SUB writes its result back into register $2$.
  - The ADD instruction can read this new value from the register file in the same cycle.
Complete pipelined datapath...so far
We can forward for store instructions too

Forward to offset

```
add $1, $2, $3
sw  $4, 0($1)
```

Forward to data

```
add $1, $2, $3
sw  $1, 0($4)
```
Store Bypassing: Forward to offset

EX: sw $4, 0($1)
MEM: add $1, $2, $3
Store Bypassing: Forward to data

EX: sw $1, 0($4)  MEM: add $1, $2, $3

IF/ID

ID/EX

EX/MEM

MEM/WB

PC

Addr

Instr

Instruction memory

Instr [15 - 0]

Extend

Read register 1

Read data 1

Read register 2

Write register

Read data 2

Write data

Read register

Write register

Write data

Read data 2

Read data 1

Rt

Rd

Rs

Forwarding Unit

ALU

Zero

ALUSrc

Result

RegDst

RESULT

Write data

Read data

Address

Data memory

Write

Read data

EX/MEM.RegisterRd

MEM/WB.RegisterRd
What about stores after a load?

- In what cycle is:
  - The load value available?
  - The store value needed?

- What do we have to add to the datapath?
Load/Store Bypassing: Extend the Datapath

Sequence:
lw $1, 0($2)
sw $1, 0($4)
Each MIPS instruction writes to at most one register.

- This makes the forwarding hardware easier to design, since there is only one destination register that ever needs to be forwarded.

Forwarding is especially important with deep pipelines like the ones in all current PC processors.

Section 6.4 of the textbook has some additional material not shown here.

- Their hazard detection equations also ensure that the source register is not $0$, which can never be modified.
- There is a more complex example of forwarding, with several cases covered. Take a look at it!
Summary

- In real code, most instructions are dependent upon other ones.
  - This can lead to data hazards in our original pipelined datapath.
  - Instructions can’t write back to the register file soon enough for the next two instructions to read.

- **Forwarding** eliminates data hazards involving arithmetic instructions.
  - The forwarding unit detects hazards by comparing the destination registers of previous instructions to the source registers of the current instruction.
  - Hazards are avoided by grabbing results from the pipeline registers before they are written back to the register file.

- Next time we’ll finish up pipelining.
  - Forwarding can’t save us in some cases involving lw.
  - We still haven’t talked about branches for the pipelined datapath.