# Simty: Generalized SIMT execution on RISC-V

**CARRV 2017** 

Sylvain Collange INRIA Rennes / IRISA sylvain.collange@inria.fr





#### From CPU-GPU to heterogeneous multi-core

- Yesterday (2000-2010)
  - Homogeneous multi-core
  - Discrete components
- Today (2011-...)
   Heterogeneous multi-core
  - Physically unified
     CPU + GPU on the same chip
  - Logically separated
     Different programming models, compilers, instruction sets
- Tomorrow
  - Unified programming models?
  - Single instruction set?



#### From CPU-GPU to heterogeneous multi-core

- Yesterday (2000-2010)
  - Homogeneous multi-core
  - Discrete components
- Today (2011-...)
   Heterogeneous multi-core
  - Physically unified
     CPU + GPU on the same chip
  - Logically separated
     Different programming models, compilers, instruction sets
- Tomorrow
  - Unified programming models?
  - Single instruction set?
- Defining the general-purpose throughput-oriented core



### Outline

- Stateless dynamic vectorization
  - Functional view
  - Implementation options
- The Simty core
  - Design goals
  - Micro-architecture

#### The enabler: dynamic inter-thread vectorization

 Idea: microarchitecture aggregates threads together to assemble vector instructions



- Force threads to run in lockstep: threads execute the same instruction at the same time (or do nothing)
- Generalization of GPU's SIMT for general-purpose ISAs
- Benefits vs. static vectorization
  - Programmability: software sees only threads, not threads + vectors
  - Portability: vector width is not exposed in the ISA
  - Scalability: + threads → larger vectors or more latency hiding or more cores
  - Implementation simplicity: handling traps is straightforward

#### Goto considered harmful?

| RISC-V                                                           | NVIDIA<br>Tesla<br>(2007)                                                      | NVIDIA<br>Fermi<br>(2010)                                                                                       | Intel GMA<br>Gen4<br>(2006)                                          | Intel GMA<br>SB<br>(2011)                                      | AMD<br>R500<br>(2005)                                                         | AMD<br>R600<br>(2007)                                                                                                                                        | AMD Cayman<br>(2011)                                                                                                                                |
|------------------------------------------------------------------|--------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|----------------------------------------------------------------|-------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
| jal<br>jalr<br>bXX<br>ecall<br>ebreak<br>Xret                    | bar<br>bra<br>brkpt<br>cal<br>cont<br>kil<br>pbk<br>pret<br>ret<br>ssy<br>trap | bar<br>bpt<br>bra<br>brk<br>brx<br>cal<br>cont<br>exit<br>jcal<br>jmx<br>kil<br>pbk<br>pret<br>ret<br>ssy<br>.s | jmpi if iff else endif do while break cont halt msave mrest push pop | jmpi if else endif case while break cont halt call return fork | jump<br>loop<br>endloop<br>rep<br>endrep<br>breakloop<br>breakrep<br>continue | push push_else pop loop_start loop_start_no_al loop_start_dx10 loop_end loop_continue loop_break jump else call call_fs return return_fs alu alu_push_before | push_else pop push_wqm pop_wqm else_wqm jump_any reactivate reactivate loop_start loop_start_no_al loop_end loop_continue loop_break jump else call |
| Control transfer instructions in GPU instruction sets vs. RISC-V |                                                                                |                                                                                                                 |                                                                      |                                                                |                                                                               | alu_pop_after<br>alu_pop2_after<br>alu_continue<br>alu_break<br>alu_else_after                                                                               | <pre>call_fs return return_fs alu alu_push_before</pre>                                                                                             |

- GPUs: control flow divergence and convergence is explicit
  - Incompatible with general-purpose instruction sets <a>©</a>

alu\_push\_before
alu\_pop\_after
alu\_pop2\_after
alu\_continue
alu\_break

alu else after

### Stateless dynamic vectorization

Idea: per-thread PCs characterize thread state

#### Code **Program Counters (PCs)** tid= if(tid < 2) { Match → active if(tid == 0) { x = 2;Master PC PC else { No match x = 3;→ inactive

- Policy: MPC = min(PC<sub>i</sub>) inside deepest function
  - Intuition: favor threads that are behind so they can catch up
  - Earliest reconvergence with code laid out in reverse post order

#### **Functional view**

Control transfer instruction or exception



#### **Functional view**

Arithmetic instruction



### Implementation 1: reduction tree

# Straighforward implementation of the functional view

- On every branch: compute Master PC from individual PCs
  - Reduction tree to compute max(depth)-min(PCs)
- On every instruction: compare Master PC with individual PCs
  - Row of address comparators
- Issues: area, energy overheads, extra branch resolution latency



### Implementation 2: sorted context table

- Common case: few different PCs
- Order stable in time
- Keep Common PCs+activity masks in sorted heap



- Branch = insertion in sorted context table
- Convergence = fusion of head entries when CPC<sub>1</sub>=CPC<sub>2</sub>
- Activity mask is readily available

### Outline

- Stateless dynamic vectorization
  - Functional view
  - Implementation options
- The Simty core
  - Design goals
  - Micro-architecture

# Simty: illustrating the simplicity of SIMT

Proof of concept for dynamic inter-thread vectorization

- Focus on the core ideas → the RISC of dynamic vectorization
- Simple programming model
  - Many scalar threads
  - General-purpose RISC-V ISA
- Simple micro-architecture
  - Single-issue RISC pipeline
  - SIMD execution units
- Highly concurrent, scalable
  - Interleaved multi-threading to hide latency
  - Dynamic vectorization to increase execution throughput
  - Target: hundreds of threads per core

# Simty implementation

- Written in synthesizable VHDL
- Runs the RISC-V instruction set (RV32I)
- Fully parametrizable SIMD width, multithreading depth
- 10-stage pipeline



### Multiple warps

- Wide dynamic vectorization found counterproductive
  - Sensitive to control-flow and memory divergence
  - Threads that hit in the cache wait for threads that miss
  - Breaks latency hiding capability of interleaved multi-threading
- Two-level approach : partition threads into warps, vectorize inside warps
  - Standard approach on GPUs



#### Two-level context table

- Cache top 2 entries in the Hot Context Table register
  - Constant-time access to CPC, activity masks
  - In-band convergence detection
- Other entries in the Cold Context Table
  - ◆ Branch → incremental insertion in CCT
  - Out-of-band CCT sorting: inexpensive insertion sort in O(n²)
  - If CCT sorting cannot catch up: degenerates into a stack (=GPUs)



### Memory access patterns

#### In traditional vector processing



Scalar load & broadcast Reduction & scalar store





(Non-unit) strided load (Non-unit) strided store



Scatter

### Memory access patterns

#### With dynamic vectorization







Support the general case, optimize for the common case

# Memory access unit

- Scalar and aligned unit-strided scenarios: single pass
- Complex accesses in multiple passes using replay
- Execution of a scatter/gather is interruptible
  - Allowed by multi-thread ISA
  - No need to rollback on TLB miss or exception



### FPGA prototype

#### On Altera Cyclone IV



- Up to 2048 threads per core: 64 warps × 32 threads
- Sweet spot: 8x8 to 32x16
   Latency hiding Throughput multithreading depth
   SIMD width

#### Conclusion

- Stateless dynamic vectorization is implementable
- Unexpectedly inexpensive
  - Overhead amortized even for single-issue RISC without FPU
- Scalable
  - Parallelism in same class as state-of-the-art GPUs
- Minimal software impact
  - Standard scalar RISC-V instruction set, no proprietary extension
  - Reuse the RISC-V software infrastructure: gcc and LLVM backends
  - OS changes to manage ~10K threads?
- One step on the road to single-ISA heterogeneous CPU+GPU

# Simty: Generalized SIMT execution on RISC-V

**CARRV 2017** 

Sylvain Collange INRIA Rennes / IRISA sylvain.collange@inria.fr



