Alibaba Cloud Accelerates AI Applications

Analytics Zoo and bfloat16 improve the performance of AI applications on seventh-generation Alibaba Cloud ECS instances.

At a Glance:

  • Seventh-generation Alibaba Cloud high-frequency ECS instances use the third generation of X-Dragon Architecture and 3rd Generation Intel® Xeon® Scalable processors.

  • 3rd Generation Intel Xeon Scalable processors deliver industry-leading and workload-optimized platforms by using enhanced Intel® Deep Learning Boost (Intel® DL Boost), which is a built-in artificial intelligence (AI) acceleration feature. Enhanced Intel DL Boost provides the first x86 support for bfloat16 in the industry, which enhances AI inference and training performance.

author-image

โดย

Executive Overview

This paper describes how to use Analytics Zoo and Brain Floating Point 16-bit (bfloat16) to improve the performance of artificial intelligence (AI) applications running on seventh-generation Alibaba Cloud Elastic Compute Service (ECS) instances.

Seventh-generation Alibaba Cloud ECS instances are powered by 3rd Generation Intel® Xeon® Scalable processors, and they provide bfloat16 support.

3rd Generation Intel Xeon Scalable processors can process complex AI workloads. By using enhanced Intel DL Boost, 3rd Generation Intel Xeon Scalable processors can deliver up to 1.93 times the AI training performance,1 up to 1.87 times the AI inference performance for image classification,2 up to 1.7 times the AI training performance for natural language processing (NLP),3 and up to 1.9 times the AI inference performance for NLP, compared with previous-generation processors.4 Many AI-training workloads from industry sectors such as healthcare, financial, and retail can benefit from the bfloat16 support provided by these processors.

Read the white paper - Accelerating AI Applications on Alibaba Cloud with Analytics Zoo and Bfloat16.

ข้อมูลผลิตภัณฑ์และประสิทธิภาพ

1Up to 1.93x higher AI training performance with a 3rd Generation Intel Xeon Scalable processor supporting Intel DL Boost with BF16 vs. a prior-generation processor with ResNet-50 throughput for image classification. New configuration: 1 node, 4 x 3rd Generation Intel Xeon Platinum 8380H processor (pre-production 28 cores, 250 W) with 384 GB total memory (24 x 16 GB, 3,200 GHz), 800 GB Intel SSD drive, ResNet-50 v1.5, ucode 0x700001b, Intel Hyper-Threading Technology (Intel HT Technology) on, Intel Turbo Boost Technology on, and running Ubuntu 20.04 LTS, Linux 5.4.0-26,28,29-generic. Throughput: https://github.com/Intel-tensorflow/tensorflow -b bf16/base, commit#828738642769358b388d8f615ded9c213f10c99a, Model Zoo: https://github.com/IntelAI/models -b v1.6.1, ImageNet dataset, oneDNN 1.4, BF16, BS=512, tested by Intel on 5/18/2020. Baseline: 1 node, 4 x Intel Xeon Platinum 8280 processor with 768 GB total memory (24 x 32 GB, 2,933 GHz), 800 GB Intel SSD, ucode 0x4002f00, Intel HT Technology on, Intel Turbo Boost Technology on, with Ubuntu 20.04 LTS, Linux 5.4.0-26,28,29-generic, ResNet-50 v1.5. Throughput: https://github.com/Intel-tensorflow/tensorflow -b bf16/base, commit#828738642760358b388d8f615ded0c213f10c99a, Model Zoo: https://github.com/intelai/models -b v1.6.1, ImageNet dataset, oneDNN 1.4, FP32, BS=512, tested by Intel on 5/18/2020.
2Up to 1.87x higher AI inference performance with a 3rd Generation Intel Xeon Scalable processors supporting Intel DL Boost with BF16 compared to prior-generation processors using FP32 on ResNet-50 throughput for image classification. New configuration: 1 node, 4 x 3rd Generation Intel Xeon Platinum 8380H processor (pre-production, 28 cores, 250 W) with 384 GB total memory (24 x 16 GB, 3,200 GHz), 800 GB Intel SSD, ucode 0x700001b, Intel HT Technology on, Intel Turbo Boost Technology on with Ubuntu 20.04 LTS, Linux 5.4.0-26,28,29-generic, ResNet-50 v1.5. Throughput: https://github.com/Intel-tensorflow/tensorflow -b bf16/base, commit#828738642760358b388e8r615ded0c213f10c99a, Model Zoo: https://github.com/IntelAI/models -b v1.6.1, ImageNet dataset, oneDNN 1.4, BF16, BS=56, 5 instances, 28 cores/instance, tested by Intel on 5/18/2020. Baseline: 1 node, 4 x Intel Xeon Platinum 8280 processors with 768 GB total memory (24 x 32 GB, 2,933 GHz), 800 GB Intel SSD, ucode 0x4002f00, Intel HT Technology on, Intel Turbo Boost Technology on, with Ubuntu 20.04 LTS, Linux 5.4.0-26,28,29-generic, ResNet-50 v1.5. Throughput: https://github.com/Intel-tensorflow/tensorflow -b bf16/base, commit#828738642760358b388d8f615ded0c213f10c99a, Model Zoo: https://github.com/IntelAI/models -b v1.6.1, ImageNet dataset, oneDNN 1.5, FP32, BS=56, 4 instances, 28 cores/instance, tested by Intel on 5/18/2020.
3Up to 1.7x more AI training performance with a 3rd Generation Intel Xeon Scalable processor supporting Intel DL Boost with BF16 vs. a prior-generation processor on BERT throughput for natural language processing. New configuration: 1 node, 4 x 3rd Generation Intel Xeon Platinum 8380H processor (pre-production, 28 cores, 250 W) with 384 GB total memory (24 x 16 GB, 3,200 GHz), 800 GB Intel SSD, ucode 0x700001b, Intel HT Technology on, Intel Turbo Boost Technology on with Ubuntu 20.04 LTS, Linux 5.4.0-26,28,29-generic, BERT-Large (QA). Throughput: https://github.com/Intel-tensorflow/tensorflow -b bf16/base, commit#828738642760358b388e8r615ded0c213f10c99a, Model Zoo: https://github.com/IntelAI/models -b v1.6.1, Squad 1.1 dataset, oneDNN 1.4, BF16, BS=12, tested by Intel on 5/18/2020. Baseline: 1 node, 4 x Intel Xeon Platinum 8280 processors with 768 GB total memory (24 x 32 GB, 2,933 GHz), 800 GB Intel SSD, ucode 0x4002f00, Intel HT Technology on, Intel Turbo Boost Technology on, with Ubuntu 20.04 LTS, Linux 5.4.0-26,28,29-generic, BERT-Large (QA). Throughput: https://github.com/Intel-tensorflow/tensorflow -b bf16/base, commit#828738642760358b388d8f615ded0c213f10c99a, Model Zoo: https://github.com/IntelAI/models -b v1.6.1, Squad 1.1 dataset, oneDNN 1.5,FP32, BS=12, tested by Intel on 5/18/2020.
4Up to 1.9x higher AI inference performance with a 3rd Generation Intel Xeon Scalable processor supporting Intel DL Boost with BF16 vs. a prior-generation processor with FP32 for BERT throughput for natural language processing. New configuration: 1 node, 4 x 3rd Generation Intel Xeon Platinum 8380H processor (pre-production, 28 cores, 250 W) with 384 GB total memory (24 x 16 GB, 3,200 GHz), 800 GB Intel SSD, ucode 0x700001b, Intel HT Technology on, Intel Turbo Boost Technology on with Ubuntu 20.04 LTS, Linux 5.4.0-26,28,29-generic, BERT-Large (QA). Throughput: https://github.com/Intel-tensorflow/tensorflow -b bf16/base, commit#828738642760358b388e8r615ded0c213f10c99a, Model Zoo: https://github.com/IntelAI/models -b v1.6.1, Squad 1.1 dataset, oneDNN 1.4, BF16, BS=32, 4 instances, 28 cores/instance, tested by Intel on 5/18/2020. Baseline: 1 node, 4 x Intel Xeon Platinum 8280 processors with 768 GB total memory (24 x 32 GB, 2,933 GHz), 800 GB Intel SSD, ucode 0x4002f00, Intel HT Technology on, Intel Turbo Boost Technology on, with Ubuntu 20.04 LTS, Linux 5.4.0-26,28,29-generic, BERTLarge (QA). Throughput: https://github.com/Intel-tensorflow/tensorflow -b bf16/base, commit#828738642760358b388d8f615ded0c213f10c99a, Model Zoo: https://github.com/IntelAI/models -b v1.6.1, Squad 1.1 dataset, oneDNN 1.5, FP32, BS=32, 4 instances, 28 cores/instance, tested by Intel on 5/18/2020.