Learn how Microsoft's Project Brainwave is solving the challenges of real-time AI and accelerating the AI technologies powering Bing and Azure with Intel® Stratix® 10 FPGAs and state-of-the-art DNN models.
To meet the computational demands required of deep learning, cloud operators are turning toward specialized hardware for improved efficiency and performance. Project Brainwave, Microsoft's principal infrastructure for AI serving in real time, accelerates deep neural network (DNN) inferencing in major services such as Bing’s intelligent search features and Azure. Exploiting distributed model parallelism and pinning over low-latency hardware microservices, Project Brainwave serves state-of-the-art, pre-trained DNN models with high efficiencies at low batch sizes. A high-performance, precision-adaptable FPGA soft processor is at the heart of the system, achieving up to 39.5 TFLOPs of effective performance at Batch 1 on a state-of-the-art Intel® Stratix® 10 FPGA.
Hastened by the escalating demand for deep learning, the march toward ubiquitous specialized hardware for AI is well underway. There are many approaches being pioneered by companies, startups, and research efforts—spanning GPGPUs to NPUs. Project Brainwave, Microsoft’s principal infrastructure for accelerated AI serving in the cloud, successfully exploits FPGAs on a datacenter-scale fabric for real-time serving of state-of-the-art DNNs.
The key learnings of Project Brainwave are: (1) designing a scalable, end-to-end system architecture for deep learning is as critical as optimizing for single chip performance—in Brainwave, the system and the soft NPU are co-architected in mind for each other, exploiting datacenter scale pinning of models in on-chip memories that scale elastically beyond single-chip solutions, (2) narrow precision quantization is a viable approach for production DNN models, enabling the Project Brainwave system to achieve competitive levels of performance and energy efficiency (720 GOPs/W on an Intel® Stratix® 10 FPGA 280) to hard NPUs with standard precisions without degrading accuracy, and (3) using configurable hardware at scale, a system can be designed without an adversarial tradeoff between latency and throughput (batching)—Brainwave is able to serve models at ultra-low latency at Batch 1 without compromising on throughput and efficiency.
In the future, real-time AI will become increasingly adaptive to live data, necessitating converged architectures for both low-latency inferencing and high-throughput training. State-of-the-art deep-learning algorithms are also evolving at a blinding pace, requiring continuous innovation of the Brainwave architecture. Today, Project Brainwave serves DNNs in real time for production services such as Bing and Azure and will become available to customers in 2018.