# NeuralScale: A RISC-V Based Neural Processor Boosting AI Inference in Clouds

希姆计算, Stream Computing Inc. 詹荣开, Mark Zhan 2020-6-7



A New Golden Age for Computer Architecture: Domain-Specific Hardware/Software Co-Design, Enhanced Security, Open Instruction Sets, and Agile Chip Development

- The Computation of AI Workload: Massive Parallelism and Huge Memory Access Demands
- Deep Learning Theory Evolvement: New Operators or Activation Functions
- Operators Customization Needs in real AI Applications.
- Not all operators are computing intensive









As a DSA-Native ISA Design, RISC-V is the Best Choice for General Purpose NPU, to achieve a balance both Programmability and Performance:

- > Turing-Complete base ISA: Enable C-Language Programming for NPU
- > V-Extension Vector Instructions: Base of AI Operators Implementation
- > Self-defined Instructions Extension: Matrix/Conv acceleration.





- Scalar Processor Core:
  - RV32 GC(IMAFDC) Instruction Set
  - IEEE-754 Compatible Single-Precision FPU
- Vector Exec Unit
  - RISC-V V-Spec Vector ISA w/ FP16 & INT8 data types
  - Extended V-Spec Instructions for sqrt/exp/log、 postincrement etc.
- Matrix Exec Unit
  - Extended GEMM Instructions











| P920 NPU Features          |                     |  |
|----------------------------|---------------------|--|
| Chip Area                  | ~400mm <sup>2</sup> |  |
| Process                    | TSMC 12nm FinFET    |  |
| GPNPC Cores                | x32                 |  |
| Peak Performance<br>(FP16) | 128 TFLOPS          |  |
| Peak Performance<br>(INT8) | 256 TOPS            |  |
| TDP                        | 130W                |  |
| Memory                     | 16GB LPDDR4         |  |
| Host IF                    | PCI Gen4 x16        |  |



## P920 NPU: Stream Computing 1st Gen NPU Product for AI Inference





120



Power Efficiency (images/second/watt)

#### **BERT (FP16) Performance Comparison**

Latency

Latency

(ms)

(ms)



(sentences/second)

OMPUTING

(images/second)





Power Efficiency (sentences/second/watt)

Stream Computing Inc. Private and Confidential





Critical Latency Performance for Real-Time AI Inference



Power Efficiency: Balanced Design for Various Al **Domain NN Models** 

Use Case: Customer has a pre-trained NN model, and wants to deploy it onto STC P920 NPU to execute AI Inference.



TensorTurbo<sup>™</sup> Engine

- Model Zoo: Pre-Optimized Models for STC P920 NPU
- Graph Compiler: TVM-based, deeply customized for NeuralScale architecture.
- Heterogenous Program Engine (HPE):
  - C/C++ Level Heterogenous Computing Kernel Program



### TensorTurbo: Neural Network Compilation

Fundamental Challenges for AI Compilation: It is all about how to schedule/optimize your AI program, so then you can maximize the hardware resource utilization to gain minimum AI program execution time.





Time



TensorTurbo TVM-based Graph Compiler

- Graph Schedule: Split Batch Dim Input Feature Map to make intermediate data resident on L1 Buffer
  - Heuristics Auto Graph Schedule!
  - A set of graph schedule APIs support users manually split batchdimension feature map data.
- Tiling Strategies within an Operator:
  - Heuristics Templates
  - Auto Tiling
- Operators Schedule:
  - OpSchedule Template: Reduce the effort of TVM IR-based operator development
- Backend Optimization PASS:
  - VME/MME Insn Schedule, DMA Schedule : Exploit ILP
  - Auto-Sync Insert
  - Double Buffer
  - On-Chip Buffer Allocation/Bank Conflict.



### User-Space HAL provides CUDA-style Runtime APIs:

- Device Management API
- Kernel Launch & Management
- Device Memory Management
- Host/Device Memory Movement
- Stream Programming APIs
- Event Management cross multiple streams
- > Utilities:
  - stcGDB: Use GDB to debug a heterogenous program
  - stcProf: Program Performance Profiling Tool
  - stcSMI: System Monitor Interface Tool
- > NPU Firmware:
  - Manage Computing Kernels to be launched in NPU device







- Plan to support total ~30 Optimized NN models in CY2021 (2 + 8 + 20):
  - Image Classification
  - Object Detection & Segmentation
  - Video Enhancement
  - Speech Recognition
  - NLP

| ResNet-50             |                      | BERT                |              |  |
|-----------------------|----------------------|---------------------|--------------|--|
| YOLO                  | SSD                  | Mask R-CNN          | Faster R-CNN |  |
| TDNN+LSTM             | Super<br>Resolution  | RNN/LSTM            | ResNet-101   |  |
| Transformer           | ResNet-34            | Google<br>Inception | VGG16        |  |
| GoogleNet             | AlexNet              | ResNeXt-50          | DenseNet     |  |
| SqueezeNet            | Inception-<br>ResNet | MobileNet           | GNMT         |  |
| GCN                   | GAT                  | FCN                 | DeepFM       |  |
| DeepSpeech2           | Wide&Deep            | TBD                 | TBD          |  |
| 2021/5 2021/8 2021/12 |                      |                     |              |  |



## Stream Computing P920 NPU Competition Advantages Summary





#### NeualScale<sup>™</sup>: Advanced RISC-V based NPC Architect

- Good Flexibility
- Good Scalability: one NPC architecture fits for both inference and training from cloud-side to edge
- Good Programmability: Traditional C programming paradigm

Extreme Costeffectiveness for AI Computation

Reduce TCO of AI Inference Server significantly

- High Throughput performance: ResNet-50 & BERT
- Low Latency: Satisfy latencysensitive real-time AI application
- Low Power: 130W TDP
- Price : Market Competitive Price.

### Advanced E2E Neural Network Toolchain

TensorTurbo:Industry Leading E2E NN Toolchain

- TVM-Based Graph Compiler: Deep NN Compilation Optimization maximize the performance
- Model Zoo : Pre-optimized models allows you getting started quickly.
- Operator Programming IF: customize layers/operators.

## Thanks

Q&A

