02. Pipeline and branch prediction

Base lecture in russian

CPU performance = frequency/instruction_efficiency = (cycles/sec) / (cycles/instruction)

  1. Higher frequency
    • ⇒ more energy cost
    • ⇒ more heat emitted
    • ⇒ limited by hardware (e. g. by electrons speed ⇒ smaller parts)
  2. More efficient instructions
    • Logic (e. g. «wall of memory» effect)
    • CPU internal devices (see below) microprogram design

  3. + More efficient memory access

Simultaneous execution: dependency

Problem: simultaneity is not possible when we cannot execution next instruction until previous one is not done

Data dependence
instruction 2 uses data prepared by instruction 1
  • register calculation flow and conditional operations

       1 addiu $t6, $t5, $t4
       2 sll   $v0, $t6, 2
    
  • memory access (both memory calculation flow and $at register usage)

       1 sw    $t2, 0x10010000
       2 lw    $t3, 0x10010000
    
  • indirect addressing

       1 addiu $t4, $t4, 4
       2 sw    $t2, ($t4)
    
  • Path dependence

    instruction 1 calculates an address of instruction 22

  • Conditional branch
       1 ble    $t2, $t3, Label
       2 # продолжение здесь?
       3 
       4 Label:
       5 # …или здесь?
    
  • Indirect jump
       1 lw    $ra, $(sp)  
       2 jr     $ra
    
  • There's no universal dependency eliminator.

    Scaling

    Leaving task parallelism alone, what can be done to scale up one program execution?

    Vector operations

    scalable data in one instruction (e. g. vector addition, mesh tracing etc.

  • Easy to implement by scaling math coprocessor and/or CPU ALU unit
  • Data is vector (i. e. multiple scalar data)
    • ⇒ Full vector load is needed for efficiency
    • ⇒ High impact on data dependence
    • Reorder of calculations as possible solution

  • Top requested
    • Multimedia (MMX/SSE etc.)
    • GPU/APU
    • DAC/ADC
  • What next:
    • assembler optimization

    • hardware code reordering

    Superscalar operations

    multiple instructions in one time (or rather microinstrucions e. g. decoding, memory access, calculation etc.)

  • Require special CPU (re-)design
  • Impact on both data and path dependency
    • Reorder of instructions as possible solution

  • Very complex task of full-load program design
  • Full loading problem: any dependency turns parallel execution into sequential

    Hardware implementations

    Example: MARS BHT simulator

    Example code:

       1 .eqv    START   0x10010000
       2 .eqv    SZ      256
       3 .eqv    GAP     11
       4 .text
       5         li      $t8 0
       6         li      $t7 GAP
       7         sll     $t7 $t7 2
       8 loopg:  sll     $t9 $t8 2
       9 loop:   lw      $t0 START($t9)
      10         addu    $t9 $t9 $t7
      11         blt     $t9 SZ loop
      12         addiu   $t8 $t8 1
      13         blt     $t8 GAP loopg
    

    Pipeline

    Pipeline is microprogam-based instruction design in which single instruction is divided to stages, each implemented by separate CUP part. Under circumstances, each stage can act in parallel with others (each taken from different instruction).

    Rough scheme of MIPS pipeline (other architectures provides another):

    1. Instruction fetch

      • Read from text memory
    2. Instruction Decode

      • Determine arguments and read them from registers
        • including fresh data, calculated on E stage, but still not passed though W stage

      • Determine next address (including non-pseudo conditional brances, because they do not require arithmetics)
    3. Execution

      • ALU/C1 calculates
    4. Memory fetch

      • Data memory access
    5. Writeback

      • Write result to registers

    Example:

       1 lui   $at, 0x1001      # lw $t2, 0x10010000 pseudoinstruction
       2 lw    $t2, 0($at)
       3 sll   $t3, $t3, 2
       4 add   $t2, $t2, $t3    # depends on $t3
       5 lui   $at, 0x1004      # sw $t2, 0x10040000 pseudoinstruction
       6 sw    $t2, 0($at)
    

    simple (with no register back-access) HTML-эмулятор MIPS с конвейером (GitHub)

    ==== Data conflict === Pipeline slip/stall

       1 lw    $t2, 0($at)
       2 add   $t1, $t3, $t2
    

    Path conflict

    On stage D next address is calculated, but next instruction is already fetched. So «delay slot» appeared.

    Delay slot
    fetching and execution of the instruction addressed just after conditional branch, irrespective of conditional value
    addi $ra, $ra, 4    |     addi $t2, $t2, 4
    jr   $ra            |     jr   $ra
                        |     nop
    • or
    addi $ra, $ra, 4    |     jr $ra
    jr   $ra            |     addi $t2, $t2, 4
  • Code at bottom right is correct

  • Modern MIPS'es have hardware slip delay insertion too
  • MARS can simulate this (`Delayed branching»)
  • RiscV has no delay slot
  • H/W

    HSE/ProgrammingOS/02_PipelinePrediction (последним исправлял пользователь FrBrGeorge 2020-03-01 11:30:56)