## Accelerating Persistent Neural Networks at Datacenter Scale Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengil, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Christian Boehn, Oren Firestein, Alessandro Forin, Kang Su Gatlin, Mahdi Ghandi, Stephen Heil, Kyle Holohan, Tamas Juhasz, Ratna Kumar Kovvuri, Sitaram Lanka, Friedel van Megen, Dima Mukhortov, Prerak Patel, Steve Reinhardt, Adam Sapek, Raja Seera, Balaji Sridharan, Lisa Woods, Phillip Yi-Xiao, Ritchie Zhao, Doug Burger # The Rise of Deep Learning in ML ## Deep neural networks have enabled major advances in machine learning and Al Computer vision Language translation Speech recognition Question answering And more... ## Problem: DNNs are challenging to serve and deploy in large-scale online services Heavily constrained by latency, cost, and power Size and complexity of DNNs outpacing growth of commodity CPUs #### **Recurrent Neural Networks** #### **Convolutional Neural Networks** #### Silicon alternatives for DNNs FLEXIBILITY MS BrainWave Baidu SDA Deephi Tech ESE Teradeep Etc. Cerebras Google TPU Graphcore Groq Intel Nervana Movidius Wave Computing Etc. # The power of Deep Learning on FPGA #### **Performance** Excellent inference performance at low batch sizes Ultra-low latency serving on modern DNNs >10X lower than CPUs and GPUs Scale to many FPGAs in single DNN service #### **Flexibility** FPGAs ideal for adapting to rapidly evolving ML CNNs, LSTMs, MLPs, reinforcement learning, feature extraction, decision trees, etc. Inference-optimized numerical precision Exploit sparsity, deep compression for larger, faster models #### Scale Microsoft has the world's largest cloud investment in FPGAs Multiple Exa-Ops of aggregate Al capacity BrainWave runs on Microsoft's scale infrastructure ## Project BrainWave #### A Scalable FPGA-powered DNN Serving Platform Fast: ultra-low latency, high-throughput serving of DNN models at low batch sizes Flexible: adaptive numerical precision and custom operators Friendly: turnkey deployment of CNTK/Caffe/TF/etc Pretrained DNN Model in CNTK, etc. Scalable DNN Hardware Microservice BrainWave Soft DPU ## Runs on a Configurable Cloud at Massive Scale CPU compute layer Reconfigurable compute layer (FPGA) Converged network ### Deployed in Production Datacenters Deployment of LSTM-based NLP model (tens of millions of parameters) Takes tens of milliseconds to serve on well-tuned CPU implementations Tail latencies in BrainWave-powered DNN models appear negligible in E2E software pipelines Compiler & Runtime Architecture BrainWave System Microarchitecture Persistency at Scale **HW Microservices** General on Intel FPGAs Infrastructure Compiler & Runtime A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Architecture Microarchitecture Persistency at Scale Compiler & Runtime A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Architecture Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms Microarchitecture Persistency at Scale Compiler & Runtime A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Architecture Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms Microarchitecture BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persistency at Scale Compiler & Runtime A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Architecture Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms Microarchitecture BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persistency at Scale Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs Compiler & Runtime A framework-neutral federated compiler and runtime for compiling pretrained DNN models to soft DPUs Architecture Adaptive ISA for narrow precision DNN inference Flexible and extensible to support fast-changing AI algorithms Microarchitecture BrainWave Soft DPU microarchitecture Highly optimized for narrow precision and low batch Persistency at Scale Persist model parameters entirely in FPGA on-chip memories Support large models by scaling across many FPGAs HW Microservices on Intel FPGAs Intel FPGAs deployed at scale with HW microservices [MICRO'16] ### The BrainWave Stack Compiler & Runtime Architecture Microarchitecture Persistency at Scale #### FPGAs Are Deployed in MSFT Servers Worldwide #### FPGAs Are Deployed in MSFT Servers Worldwide Catapult v2 Mezzanine card WCS Gen4.1 Blade with NIC and Catapult FPGA ## Hardware Microservices on FPGAs [MICRO'16] #### The BrainWave Stack Compiler & Runtime Architecture Microarchitecture Persistency at Scale ## BrainWave Compiler & Runtime #### Common Scenarios Convolutional Neural Network (CNN) High Compute-to-Data Ratio MLPs, LSTMs, GRUs Low compute-to-data ratio #### Common Scenarios Convolutional Neural Network (CNN) High Compute-to-Data Ratio MLPs, LSTMs, GRUs **Low compute-to-data ratio** ## Conventional Acceleration Approach: Local Offload and Streaming ## Conventional Acceleration Approach: Local Offload and Streaming Initialized in DRAM 2xCPU R For memory-intensive DNNs with low compute-to-data ratios (e.g., LSTM), HW utilization limited by off-chip DRAM bandwidth **Model Parameters** ## Improving HW utilization with batching # Improving HW utilization with batching Batching improves HW utilization but increases latency ## Improving HW utilization with batching Batching improves HW utilization but increases latency Ideally want high HW utilization at low batch sizes #### **Observations** State-of-art FPGAs have O(10K) distributed Block RAMs O(10MB) → Tens of TB/sec of memory BW Large-scale cloud services and DNN models run persistently Solution: persist all model parameters in FPGA on-chip memory during service lifetime When single request arrives, all chip resources (onchip memories and compute units) are used to process a single query (no batching required) # What if model doesn't fit in single FPGA? ## Solution: Persistency at Datacenter Scale Multiple FPGAs at datacenter scale can form a persistent DNN HW microservice, enabling scale-out of models at ultra-low latencies ## Inter-Layer Pipeline Parallelism $$egin{aligned} f_t &= \sigma_g ig(W_f x_t + U_f h_{t-1} + b_f ig) \ i_t &= \sigma_g ig(W_i x_t + U_i h_{t-1} + b_i ig) \ o_t &= \sigma_g ig(W_o x_t + U_o h_{t-1} + b_o ig) \ c_t &= f_t \circ c_{t-1} + i_t \circ \sigma_c ig(W_c x_t + U_c h_{t-1} + b_c ig) \ h_t &= o_t \circ \sigma_h (c_t) \end{aligned}$$ ## Inter-Layer Pipeline Parallelism ## Inter-Layer Pipeline Parallelism FPGAs communicate directly < 2us/hop DNN HWMS shared by all CPUs ## Intra-Layer Parallelism Single dense matrix FPGAs communicate directly < 2us/hop DNN HWMS shared by all CPUs ## Intra-Layer Parallelism ## The BrainWave Stack Compiler & Runtime Architecture Microarchitecture Persistency at Scale **HW Microservices** on Intel FPGAs ## BrainWave Soft DPU Architecture #### **Core Features** - Single-threaded C programming model (no RTL) - ISA with specialized instructions: dense matmul, convolutions, non-linear activations, vector operations, embeddings - Proprietary parameterizable narrow precision format wrapped in float16 interfaces - Parameterizable microarchitecture and scalable to large FPGAs (~1M ALMs) - Fully integrated with HW microservices (network-attached) - P2P protocol to CPU hosts and FPGAs - Easy to extend ISA with custom operators ## BrainWave Soft DPU Microarchitecture ## Matrix Vector Unit #### **Features** - Optimized for batch 1 matrix-vector multiplication - Matrices distributed row-wise across 1K-10K banks of BRAM, up to 20 TB/s - Can scale to use all available on-chip BRAMs, DSPs, and soft logic - In-situ conversion of float16 weights and activations to internal format Tensor Dense dot product units map efficiently to soft logic and DSPs ## Matrix Vector Unit #### **FPGA Performance vs. Data Type** #### **Impact of Narrow Precison on Accuracy** ## BrainWave Soft DPU Performance #### **Single FPGA BrainWave Soft DPU Performance** | Arria 10 1150 (20nm) | |----------------------| | ms-fp9 | | 316K ALMs (74%) | | 1442 DSPs (95%) | | 2,564 M20Ks (95%) | | 160 GOPS/W | | Stratix 10 280 Early Silicon (14nm) | |--------------------------------------| | ms-fp9 | | 858K ALMs (92%) | | 5,760 DSPs (100%) | | 8,151 M20Ks (70%) | | 320 GOPS/W → 720 GOPS/W (production) | BrainWave Soft DPU Floorplan on Stratix 10 280 ### Conclusion ## Microsoft BrainWave is a powerful platform for an accelerated Al cloud Runs on Microsoft's hyperscale infrastructure with FPGAs Achieves excellent performance at low batch sizes via persistency and narrow precision Adaptable to precision and changes in future AI algorithms # BrainWave running on Hardware Microservices will push the boundary of what is possible to deploy in the cloud Deeper/larger CNNs for more accurate computer vision Higher dimensional RNNs toward human-like natural language processing State-of-the-art speech And much more... Stay tuned for announcements about external availability. # Thank you!