Deploying compute-intensive applications in the cloud or consuming them as online services is more affordable and faster for customers than using their own hardware. Google has experience in providing infrastructure for compute-intensive applications to its cloud customers. Similar infrastructure can be used for other compute-intensive applications within Google.
The C2 instance, Google Cloud’s first compute-optimized instance, offers 40 percent better compute performance3 and 90 percent faster CPU frequency4, compared to the previous N1 instance. It’s based on 2nd Generation Intel® Xeon® Scalable processors, and Google Cloud Platform (GCP) customers are already enjoying the breakthrough performance C2 offers:
- WP Engine powers some of its WordPress Digital Experience Platform services with 2nd Generation Intel® Xeon® Scalable-based “C2” (compute optimized) instances on GCP. When combined with other software optimizations, WP Engine achieved platform performance that was 60% faster than before5.
- Climacell uses C2 for its micro-forecasting tools for weather prediction. Climacell performed internal benchmarking of the proof-of-concept solution comparing a Google Cloud C2 instance with previous-generation N1 clusters6. With C2 instances, Climacell achieved 40% better price/performance than N1 instances.
Accelerating Compute-Intensive and Artificial Intelligence (AI) Applications
Compute-intensive and artificial intelligence (AI) applications can often benefit from being optimized for Single Instruction Multiple Data (SIMD) instructions. These instructions enable a single processor instruction to process multiple data items at the same time. Intel® Advanced Vector Extensions 512 (Intel® AVX-512) was introduced with the Intel® Xeon® Scalable processor. The size of the register was doubled to 512 bits compared to previous-generation Intel® Xeon® processors. The bigger register size can help to dramatically increase the throughput for applications that can be optimized for instruction-level parallelization.
Here are some examples:
- Sony Imageworks has used Intel AVX-512 to accelerate rendering for its animations;
- Descartes Labs has used Intel AVX-512 to help improve compression speeds by 38 percent; and
- Google itself smashed a Guinness World Record by using Intel AVX-512 to speed up the calculation of pi to 31.4 trillion decimal places.
Using the 512-bit vectors, applications can pack 32 double precision and 64 single precision floating point operations per clock cycle, with up to two 512-bit fused-multiply add (FMA) units. The 512-bit vectors double the number of operations per clock cycle compared to Intel® Advanced Vector Extensions 2 (Intel® AVX2)7.
Not all applications benefit from Intel AVX-512, but for those applications that do, the performance gains can be significant.
Deep learning applications are among those applications that can benefit from using Intel AVX-512 to process more data at the same time, with a single instruction.
The Genomics team at Google Brain has been using Intel AVX-512 to improve the performance of its genome-sequencing application. DeepVariant, an open-source tool, is built on top of TensorFlow.
The most computationally intensive stage in the process is known as call_variants. It compares an individual’s genome with a reference genome, for medical diagnosis, treatment, or drug research. Using the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) and Intel AVX-512, the runtime for call_variants was cut from over 14 hours to 3 hours 25 minutes. The Google Brain team notes that there has been a three-fold reduction in cost too8.
Intel MKL-DNN is an open-source library for enhancing the performance of deep learning frameworks on Intel® architecture. It provides building blocks to take advantage of Intel AVX-512 and multithreading to accelerate convolutional neural networks (CNNs).
Accelerating Deep Learning Inference
The 2nd Generation Intel Xeon Scalable processor introduced Intel® Deep Learning Boost (Intel® DL Boost), a new Vector Neural Network Instruction that accelerates deep learning inference. It’s a new Intel AVX-512 instruction for fused multiply-add operations, which are often used in matrix manipulations as part of deep learning inference. The instruction combines three instructions into a single instruction, saving clock cycles on the processor. The new instruction can help to speed up applications including image classification, speech recognition, language translation, and object detection.
Intel DL Boost enables up to 30X improvement in deep learning throughput9, and is available for Google Cloud customers to use in their own applications now.
Performance can be further enhanced by using lower precision data, based on 8-bit integers (INT8) instead of 32-bit floating-point (FP32) numbers. Research by Intel found that using INT8 and Intel DL Boost technology together on the Wide & Deep Recommender System improved performance by 200 percent, with a minimal loss of accuracy (less than 0.5 percent), compared to FP32 precision10.
Google Compute Engine will be enabling N2 instance customers to automatically upgrade to the 3rd Generation Intel® Xeon® Scalable processor, with previews coming later this year. 3rd Gen Intel Xeon Scalable processors deliver 1.5 times more performance than other CPUs across 20 popular machine and deep learning workloads11. For more information, see the 3rd Generation Intel Xeon Scalable fact sheet.
Introducing Intel® Advanced Matrix Extensions (Intel® AMX)
The next-generation Intel Xeon Scalable processors, codenamed Sapphire Rapids, will continue Intel’s strategy of providing built-in AI acceleration on Intel Xeon processors with a new accelerator called Intel® Advanced Matrix Extensions (Intel® AMX).
Intel AMX introduces a new programming paradigm, based on two-dimensional registers called tiles. An accelerator, called TMUL (short for tile matrix multiply unit), carries out operations on the tiles. TMUL is a grid of fused multiply-add units that can read and write tiles. The matrix multiplications in the TMUL instruction set compute:
C[M][N] += A[M][K] * B[K][N]
Each tile has a maximum size of 16 rows of 64 bytes (a total of 1 KB). Programmers can configure a smaller size for each tile if it better fits their algorithm. Data is loaded into tiles from memory using the traditional Intel architecture register set as pointers.
The new instructions include:
- TDPBF16PS, which performs a set of SIMD dot-products of two Bfloat16 (BF16) elements.
- TDPBSSD/TDPBSUD/TDPBUSD/TDPBUUD, which are used to multiply signed and unsigned byte elements from two different tiles in different combinations (signed * signed, signed * unsigned, unsigned * signed, unsigned * unsigned).
To find out more about the upcoming instructions, see Chapter 3 of the Intel® Architecture Instruction Set Extensions Programming Reference.