Nested Parallelism PageRank on RISC-V Vector Multi-Processors

Alon Amid, Albert Ou, Krste Asanović, Borivoje Nikolić
Agenda

- Problem Domain (Graphs/PageRank + Nested Parallelism)
- Silicon-Proven Open Source Hardware and Software Implementations (Rocket + Hwacha + GraphMat + OpenMP)
- FPGA-Accelerated Simulation (FireSim)

SW/HW Design Space Exploration

Full-System Implications
Graphs

- Graphs are everywhere
  - Implicit data-parallelism
  - Irregular data layout
- Usefulness of fixed-function acceleration of graph kernels is debatable
- Use general purpose data-parallel acceleration for graph workloads
  - Maximize the efficiency of data-parallel processors

Common Data - Parallel Architectures

- Packed-SIMD
  - Register size exposed in the programming model
  - Direct bit-manipulation
  - ISA implications every technology generation change

- GPUs
  - SIMT programming model
  - Throughput-processors, scratchpad memories

- Vector Architectures
  - Vector-length agnostic programming model
  - Additional flexibility in µarch optimization
Graphs in Data - Parallel Architectures

- Intel AVX
  - Small parallelism factor
  - AVX register utilizations size alignments
    - Alternative sparse-matrix representations to fit AVX registers (Grazelle [1])

- GPUs [2][3]
  - Amortize data-movement between host memory and GPU memory
  - Load balancing between warps and threads

[2] Scalable SIMD-Efficient Graph Processing on GPUs, Farzad Khorasani, Rajiv Gupta, Laxmi N. Bhuyan
[3] Multiple works by John Owens (UC Davis)

Photo credits:
https://www.tomshardware.co.uk/why-gpu-pricing-will-drop-further.news-58816.html
Hwacha Vector Architecture

- Non-standard RISC-V ISA extension
- Vector-length agnostic programming model
- **Silicon-proven, open-source** vector accelerator
  - Open-sourced at the 1st RISC-V Summit
- Integrated with Rocket chip generator
- TileLink cache-coherent memory system
- Parameterizable multi-lane design
Hwacha Vector Architecture

- Decoupled access-execute
- 4 ops/cycle per lane average throughput
- 128 bits/cycle backing memory bandwidth
- 16 KiB SRAM banked register file per lane
  - Max vector length of 2048 double-width elements
  - Systolic-bank execution
  - 4x128 bits register file bandwidth
Nested Parallelism

- Data-parallel accelerators + multi-processors
- Mixing parallelism properties
  - Task level parallelism – flexible, but expensive
  - Data level parallelism - efficient, but rigid
- Many design points, both SW and HW
- How to partition?

Diagram:

A: Scalar processor with on-chip vector accelerator
B: Chip Multi-Processor (CMP)
C: Chip Multi-Processor with packed-SIMD units
D: Chip Multi-Processor with vector accelerators
● Graphs commonly represented as:
  ○ Adjacency lists
  ○ Adjacency matrices
● Adjacency matrix is usually a sparse matrix
● Sparse matrices can be compressed
  ○ Eliminating the zero values
  ○ Reduce storage in memory
● Variety of sparse matrix representations
Graph and Sparse Matrix Representations

```
<p>| | | | | | | | |</p>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>81</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>5</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>61</td>
<td>0</td>
<td>9</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>34</td>
<td>11</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>42</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>17</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>92</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>70</td>
</tr>
</tbody>
</table>
```

row_indices: [0, 1, 3, 3, 3, 3, 5, 6, 7, 7]
column_indices: [1, 1, 0, 2, 6, 7, 7, 6, 1, 7]
values: [81, 5, 61, 9, 34, 11, 42, 17, 92, 70]
Graph and Sparse Matrix Representations

COO

CSR

<table>
<thead>
<tr>
<th>row_indices</th>
<th>column_indices</th>
<th>values</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 1 3 3 3 3 5 6 7 7</td>
<td>1 1 0 2 6 7 7 6 1 7</td>
<td>81 5 61 9 34 11 42 17 92 70</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>row_pointers</th>
<th>column_indices</th>
<th>values</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 1 2 2 6 6 7 8 10</td>
<td>1 1 0 2 6 7 7 6 1 7</td>
<td>81 5 61 9 34 11 42 17 92 70</td>
</tr>
</tbody>
</table>
Graph and Sparse Matrix Representations

**COO (Coordinate List Representation)**

<table>
<thead>
<tr>
<th>row_indices</th>
<th>column_indices</th>
<th>values</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 1 3 3 3 3 5 6 7 7</td>
<td>1 1 0 2 6 7 7 6 1 7</td>
<td>81 5 61 9 34 11 42 17 92 70</td>
</tr>
</tbody>
</table>

**CSR (Compressed Sparse Row Representation)**

<table>
<thead>
<tr>
<th>row_pointers</th>
<th>column_indices</th>
<th>values</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 1 2 2 6 6 7 8 10</td>
<td>1 1 0 2 6 7 7 6 1 7</td>
<td>81 5 61 9 34 11 42 17 92 70</td>
</tr>
</tbody>
</table>

**CSC (Compressed Sparse Column Representation)**

<table>
<thead>
<tr>
<th>row_indices</th>
<th>column_pointers</th>
<th>values</th>
</tr>
</thead>
<tbody>
<tr>
<td>3 0 1 7 3 3 6 3 5 7</td>
<td>0 1 4 5 5 5 5 7 10</td>
<td>61 81 5 92 9 34 17 11 42 70</td>
</tr>
</tbody>
</table>
DCSR/DCSC Representation

- Compress across both dimensions
- Hyper-sparse matrices
  - Required to amortized the overhead of the additional indirection level
- Explicit nested parallelism

\[\begin{array}{cccccccccc}
0 & 61 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 81 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
\end{array}\]

\[\begin{array}{cccccccccccc}
\text{values} & 61 & 81 & 5 & 92 & 9 & 34 & 17 & 11 & 42 & 70 \\
\text{column_indices} & 1 & 2 & 4 & 6 & 8 & 2 & 6 & 1 & 4 & 7 \\
\text{row_ptrs} & 0 & 1 & 5 & 7 & 10 \\
\text{row_indices} & 0 & 1 & 6 & 7 \\
\text{row_starts} & 0 & 2 & 5 \\
\end{array}\]

Nested Parallelism in DCSR/DCSC

- A DCSR representation is composed of multiple CSR representation
- 2 Explicit parallelism levels:
  - Level 1 – Task/Thread level parallelism across the external indirection array
  - Level 2 – Data-level parallelism within each sub-CSR representation
Each thread processes a small unit of a CSR unit
For demonstration purposes, let’s make the sub-CSR larger
Sidenote: PageRank

- Measure of importance of nodes in a directed graph
- Represents a random walk
- Can be implemented as an iterative SpMV
- Common iterative graph processing benchmark

\[
P = A \times \frac{1}{N}
\]

\[
y^{(k)} = d \times P \times y^{(k-1)} + \frac{(1-d)}{|V|}
\]

\[
PR(u) = (1-d) + d \sum_{v \in B_u} \frac{PR(v)}{N_v}
\]

Simple Scalar Sparse Matrix Traversal

- Process the internal CSR in a simple scalar loop
- Traverse the pointers array
- Follow the pointer to the values array
- Perform the required operation (multiplication and accumulation for SpMV)
Simple Scalar Sparse Matrix Traversal

- Process the internal CSR in a simple scalar loop
- Traverse the pointers array
- Follow the pointer to the values array
- Perform the required operation (multiplication and accumulation for SpMV)
Simple Scalar Sparse Matrix Traversal

- Process the internal CSR in a simple scalar loop
- Traverse the pointers array
- Follow the pointer to the values array
- Perform the required operation (multiplication and accumulation for SpMV)
- Process the internal CSR in a simple scalar loop
- Traverse the pointers array
- Follow the pointer to the values array
- Perform the required operation (multiplication and accumulation for SpMV)

<table>
<thead>
<tr>
<th>row_indices</th>
<th>0</th>
<th>1</th>
<th>7</th>
<th>12</th>
<th>21</th>
<th>30</th>
</tr>
</thead>
<tbody>
<tr>
<td>row_ptrs</td>
<td>0</td>
<td>1</td>
<td>5</td>
<td>8</td>
<td>9</td>
<td>11</td>
</tr>
<tr>
<td>column_indices</td>
<td>1</td>
<td>2</td>
<td>4</td>
<td>6</td>
<td>8</td>
<td>14</td>
</tr>
<tr>
<td></td>
<td>14</td>
<td>15</td>
<td>27</td>
<td>43</td>
<td>51</td>
<td>53</td>
</tr>
<tr>
<td>values</td>
<td>61</td>
<td>81</td>
<td>5</td>
<td>92</td>
<td>9</td>
<td>3</td>
</tr>
<tr>
<td></td>
<td>44</td>
<td>2</td>
<td>17</td>
<td>18</td>
<td>10</td>
<td>44</td>
</tr>
</tbody>
</table>
- Process the internal CSR in a simple scalar loop
- Traverse the pointers array
- Follow the pointer to the values array
- Perform the required operation (multiplication and accumulation for SpMV)
- Process the internal CSR in a simple scalar loop
- Traverse the pointers array
- Follow the pointer to the values array
- Perform the required operation (multiplication and accumulation for SpMV)
- Process the internal CSR in a simple scalar loop
- Traverse the pointers array
- Follow the pointer to the values array
- Perform the required operation (multiplication and accumulation for SpMV)
Simple Scalar Sparse Matrix Traversal

- Process the internal CSR in a simple scalar loop
- Traverse the pointers array
- Follow the pointer to the values array
- Perform the required operation (multiplication and accumulation for SpMV)
Virtual Processors View

- View of data parallel accelerators as lock-step execution engines
  - No need to dive into µarch
- Number of virtual processors proportional to vector length
- Example: vector lengths of 4 => 4 virtual processors
  - Not necessarily implemented as 4 functional units.

Virtual Processors View, Figure 2.3, from Vector Microprocessors, PhD dissertation by Krste Asanovic
Stripmining - the most common technique for loop vectorization
Operate over strips of data based on the vector-length
Why does simple stripmining not work for CSR/CSC SpMV?
- Pointer arrays: load imbalance – different pointers point to rows of different lengths
- Values array: serialization on AMOs – need to accumulate all the values of the strip
● Parallel processing of the pointer array (node-centric)
● Problem: Simple stripmining has low utilization of virtual processors due to load-balancing and non-uniform vertex degree distribution
● Solution: Pack the row pointers (vertices) to maintain high utilization of virtual processors
  ○ Scalar re-packing after every stripmining iteration
Parallel processing of the pointer array (node-centric)

Problem: Simple stripmining has low utilization of virtual processors due to load-balancing and non-uniform vertex degree distribution

Solution: Pack the row pointers (vertices) to maintain high utilization of virtual processors
  ○ Scalar re-packing after every stripmining iteration
- Parallel processing of the pointer array (node-centric)
- Problem: Simple stripmining has low utilization of virtual processors due to load-balancing and non-uniform vertex degree distribution
- Solution: Pack the row pointers (vertices) to maintain high utilization of virtual processors
  - Scalar re-packing after every stripmining iteration
- Parallel processing of the pointer array (node-centric)
- Problem: Simple stripmining has low utilization of virtual processors due to load-balancing and non-uniform vertex degree distribution
- Solution: Pack the row pointers (vertices) to maintain high utilization of virtual processors
  - Scalar re-packing after every stripmining iteration

### Packed Stripmining

<table>
<thead>
<tr>
<th>row_indices</th>
<th>0</th>
<th>1</th>
<th>7</th>
<th>12</th>
<th>21</th>
<th>30</th>
</tr>
</thead>
<tbody>
<tr>
<td>row_ptrs</td>
<td>0</td>
<td>1</td>
<td>5</td>
<td>8</td>
<td>9</td>
<td>11</td>
</tr>
<tr>
<td>column_indices</td>
<td>1</td>
<td>2</td>
<td>4</td>
<td>6</td>
<td>8</td>
<td>14</td>
</tr>
<tr>
<td>values</td>
<td>61</td>
<td>81</td>
<td>5</td>
<td>92</td>
<td>9</td>
<td>3</td>
</tr>
</tbody>
</table>
Parallel processing of the pointer array (node-centric)

Problem: Simple stripmining has low utilization of virtual processors due to load-balancing and non-uniform vertex degree distribution

Solution: Pack the row pointers (vertices) to maintain high utilization of virtual processors
  ○ Scalar re-packing after every stripmining iteration
Parallel processing of the values array (edge-centric)

Problem: Accumulation serialization within single vertex

Solution: Distribute accumulation across different vertices by processing values array in constant intervals (rake)
  ○ Allows for trivial load-balancing and high virtual processor utilization without repacking
  ○ Requires predicated tracking of row transitions

Loop Raking

| row_indices | 0 | 1 | 7 | 12 | 21 | 30 |
| row_ptrs   | 0 | 1 | 5 | 8  | 9  | 11 |
| column_indices | 1 | 2 | 4 | 6  | 8  | 14 |
| values      | 61| 81| 5 | 92 | 9  | 3  |
|             | 14| 15| 27| 43 | 51 | 53 |
|             | 44| 2 | 17| 18 | 10 | 44 |
Loop Raking

- Parallel processing of the values array (edge-centric)
- Problem: Accumulation serialization within single vertex
- Solution: Distribute accumulation across different vertices by processing values array in constant intervals (rake)
  - Allows for trivial load-balancing and high virtual processor utilization without repacking
  - Requires predicated tracking of row transitions

![Diagram of Loop Raking](image)
Loop Raking

- Parallel processing of the values array (edge-centric)
- Problem: Accumulation serialization within single vertex
- Solution: Distribute accumulation across different vertices by processing values array in constant intervals (rake)
  - Allows for trivial load-balancing and high virtual processor utilization without repacking
  - Requires predicated tracking of row transitions

![Diagram showing parallel processing with virtual processors (vp1, vp2, vp3, vp4) and row_indices, row_ptrs, column_indices, and values arrays.](image-url)
Evaluation Method  –  Software Stack

- **GraphMat**
  - High-performance parallel graph processing framework
  - Vertex-programming front-end interface mapped to linear algebra backend
  - Uses DCSC/DCSR data-structures
  - Parallelism using OpenMP and MPI
  - Used in other architecture graph processing evaluations

- **OpenMP**
  - Common shared-memory parallel programming multi-threading model
  - Scalable programming model for multi-processors
  - Compile-time and run-time features
  - Used for outer-level thread parallelism
Rocket Chip SoC generator
- Configurable SoC parameters such as L2 caches size and processor tiles
- Real RTL – conclusions directly reflect on test chips and real silicon

FireSim – cycle-exact FPGA-accelerated simulation on the public cloud

Why FireSim and Rocket Chip?
- Full OpenMP and Linux software stack
- Vector architectures require detailed µarch
- DDR Memory models – important for sparse data-structures
- Real RTL – conclusions directly reflect on test chips and real silicon
Design Space Exploration

- 12 SoC configurations

<table>
<thead>
<tr>
<th>Name</th>
<th>Tiles</th>
<th>Vector Lanes / Tile</th>
<th>L2 Cache Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>T1L1C512</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>T1L1C1024</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>T1L1C2048</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>T1L2C512</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>T1L2C1024</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>T1L2C2048</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>T2L1C512</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>T2L1C1024</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>T2L1C2048</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>T2L2C512</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>T2L2C1024</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>T2L2C2048</td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Design Space Exploration

- 12 SoC configurations

<table>
<thead>
<tr>
<th>Name</th>
<th>Tiles</th>
<th>Vector Lanes / Tile</th>
<th>L2 Cache Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>T1L1C512</td>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>T1L1C1024</td>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>T1L1C2048</td>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>T1L2C512</td>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>T1L2C1024</td>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>T1L2C2048</td>
<td>1</td>
<td></td>
<td></td>
</tr>
<tr>
<td>T2L1C512</td>
<td>2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>T2L1C1024</td>
<td>2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>T2L1C2048</td>
<td>2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>T2L2C512</td>
<td>2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>T2L2C1024</td>
<td>2</td>
<td></td>
<td></td>
</tr>
<tr>
<td>T2L2C2048</td>
<td>2</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
Design Space Exploration

- 12 SoC configurations

<table>
<thead>
<tr>
<th>Name</th>
<th>Tiles</th>
<th>Vector Lanes / Tile</th>
<th>L2 Cache Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>T1L1C512</td>
<td>1</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>T1L1C1024</td>
<td>1</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>T1L1C2048</td>
<td>1</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>T1L2C512</td>
<td>1</td>
<td>2</td>
<td></td>
</tr>
<tr>
<td>T1L2C1024</td>
<td>1</td>
<td>2</td>
<td></td>
</tr>
<tr>
<td>T1L2C2048</td>
<td>1</td>
<td>2</td>
<td></td>
</tr>
<tr>
<td>T2L1C512</td>
<td>2</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>T2L1C1024</td>
<td>2</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>T2L1C2048</td>
<td>2</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>T2L2C512</td>
<td>2</td>
<td>2</td>
<td></td>
</tr>
<tr>
<td>T2L2C1024</td>
<td>2</td>
<td>2</td>
<td></td>
</tr>
<tr>
<td>T2L2C2048</td>
<td>2</td>
<td>2</td>
<td></td>
</tr>
</tbody>
</table>
Design Space Exploration

- 12 SoC configurations

### Table: SoC Configurations

<table>
<thead>
<tr>
<th>Name</th>
<th>Tiles</th>
<th>Vector Lanes</th>
<th>L2 Cache Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>T1L1C512</td>
<td>1 1</td>
<td>512</td>
<td></td>
</tr>
<tr>
<td>T1L1C1024</td>
<td>1 1</td>
<td>1024</td>
<td></td>
</tr>
<tr>
<td>T1L1C2048</td>
<td>1 1</td>
<td>2048</td>
<td></td>
</tr>
<tr>
<td>T1L2C512</td>
<td>1 2</td>
<td>512</td>
<td></td>
</tr>
<tr>
<td>T1L2C1024</td>
<td>1 2</td>
<td>1024</td>
<td></td>
</tr>
<tr>
<td>T1L2C2048</td>
<td>1 2</td>
<td>2048</td>
<td></td>
</tr>
<tr>
<td>T2L1C512</td>
<td>2 1</td>
<td>512</td>
<td></td>
</tr>
<tr>
<td>T2L1C1024</td>
<td>2 1</td>
<td>1024</td>
<td></td>
</tr>
<tr>
<td>T2L1C2048</td>
<td>2 1</td>
<td>2048</td>
<td></td>
</tr>
<tr>
<td>T2L2C512</td>
<td>2 2</td>
<td>512</td>
<td></td>
</tr>
<tr>
<td>T2L2C1024</td>
<td>2 2</td>
<td>1024</td>
<td></td>
</tr>
<tr>
<td>T2L2C2048</td>
<td>2 2</td>
<td>2048</td>
<td></td>
</tr>
</tbody>
</table>
Design Space Exploration
● **DCSR Partition Factor**
  ○ Affects granularity of tasks-level parallelism
  ○ Many tasks/partitions can result in shorter vector length for the inner parallelism level
  ○ \( \text{num\_DCSR\_partitions} = \text{num\_hardware\_threads} \times \text{DCSR\_partition\_factor} \)

● **Graphs**
  ○ Three graphs from the Stanford Network Analysis Project (SNAP)

### Software parameters

<table>
<thead>
<tr>
<th>Name</th>
<th>Vertices</th>
<th>Edges</th>
<th>Size</th>
</tr>
</thead>
<tbody>
<tr>
<td>wikiVote</td>
<td>7115</td>
<td>103689</td>
<td>433 KB</td>
</tr>
<tr>
<td>roadNet-CA</td>
<td>1965206</td>
<td>2766607</td>
<td>18 MB</td>
</tr>
<tr>
<td>amazon0302</td>
<td>262111</td>
<td>1234877</td>
<td>5.7 MB</td>
</tr>
</tbody>
</table>
L2 Cache Size

- L2 Cache size does not have an impact
  - Typical of graph workloads with irregular memory accesses
  - Exception: fitting completely within the cache. wikiVote graph fits in L2 size, so demonstrates significantly higher speedup
Scaling and Absolute Speedup

- Absolute speedup compared to minimal scalar hardware config
- Multiple tiles present near linear scaling
- Multi-Tile, Single Lane as an efficient design point
  - Single vector lane provides significant speedup (greater than the additional 4 ops/cycles)
  - Additional vector lanes (>1) demonstrate smaller overall absolute speedups
Tiles vs. Vector Lanes

- Relative Speedup compared to the parallel-scalar implementation on the same hardware configuration.
- Single-tile-Dual-lane configuration presents higher relative speedup compared to dual-tile-single-lane, even though they have the same overall number of lanes.
  - Multi-lane designs have an added benefit in conjunction with multi-core designs.
Loop raking can outperform in all tested hardware configurations, depending on software parameter configuration.

- Packed-stripmining re-packing overhead.
Better performance with higher DCSR partition factors
  - Finer grained load-balancing
  - Exception: small wikiVote graph, due to shorter vector lengths and overhead
Bigger graphs present smaller absolute speedups
  - wikiVote > amazon0302 > roadCA

Small graph effects (wikiVote)
  - Fitting fully in L2 cache can more than double the speedup
  - Vector unit utilization in PageRank depends on the number of vertices with outgoing edges
    - As opposed to overall graph size
    - wikiVote has 8000 vertices (enough to keep the vector unit utilized with a high partition factor), but only 2300 vertices with outgoing edges.

Tested graphs were not significantly scale-free
  - No observed power-law graph effects
Conclusions

- Software/Hardware design space exploration
  - Full Linux-based parallel programming software stack
  - Open-source, silicon-proven hardware
- 4x-25x absolute speedup, 2x-14x vectorized relative speedup
- Loop raking is a better technique than packed-stripmining
- Higher DCSR partitions => better load-balancing
  - Assuming the graph is big enough
- Multi-tile, single vector lane configuration as an efficient design point
Acknowledgments

- Colin Schmidt
- The information, data, or work presented herein was funded in part by the Advanced Research Projects Agency-Energy (ARPA-E), U.S. Department of Energy, under Award Number DE-AR0000849. Research was partially funded by ADEPT Lab industrial sponsor Intel, under the Agile ISTC, and ADEPT Lab affiliates Google, Siemens, and SK Hynix. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.
Questions/Comments