Unlike NAND SSDs, Intel® Optane™ SSDs offer peak performance at queue depths relevant to real-world apps, not synthetic benchmarks.
You want a solid state drive (SSD) that will work the fastest for you and for your workload. Because you are reading this article, it’s a good bet that you study SSD performance specifications when selecting an SSD for your system. When you read the specifications, you see throughput (also known as bandwidth) specified for both reads and writes. You also see the specified maximum accesses per second (commonly called input/output operations per second [IOPS]). It might surprise you to learn that these specifications assume highly idealized test scenarios. These scenarios might not—in fact likely don’t—match the applications that you want to run quickly.
In this article, we explore the role that the number of outstanding accesses (commonly referred to as the queue depth [QD] of a workload) plays in SSD performance. We also examine the types of QDs commonly seen with real applications.
Simply put, most applications have relatively low QDs, and NAND SSDs need high QDs to deliver full performance. With their low latency, Intel® Optane™ SSDs deliver high performance at low QDs. So Intel® Optane™ SSDs deliver high performance for a much wider set of applications.
The Prevalence of Low-QD Applications
QD is not something most people think about every day. An analogy can be used to illustrate QD, show its relationship to latency and throughput, and help explain why lower QDs matter most.
Imagine that your shed is on fire. You don’t have a hose, but you have a bucket and a water faucet at the other side of a small field. So you turn on the faucet, fill the bucket, turn off the faucet, run across the field, and dump the water on the flames. Then you run back to the faucet and repeat the sequence.
In this example (Figure 1), the QD is one (QD=1) because there is only one person and one bucket. The throughput is equal to the average rate at which water is pulled from the faucet and applied to the fire (for example, 12 times per hour). Latency in this example is the time from the completion of the emptying of one bucket on the fire to the arrival of the next bucket to dump on the fire (for example, five minutes).
As you can see, there is a relationship between the latency and the throughput of water onto the fire. If the field is bigger, it takes longer to transit, so the latency for each bucket of water fetched will increase, and water throughput will drop.
If we could reduce the size of the field (Figure 2), moving the faucet closer to the shed, then we can get across the field faster and get more water to the fire more quickly. In this case, we reduce the latency, and, even with QD=1, we still increase the throughput and firefighting effectiveness.
Reducing latency sounds like magic. Is there another way? Let’s take this example to QD=2. We need another bucket and a friend to help us. The two firefighters now pass each other in the field, one headed to the fire and one headed to the faucet. The latency hasn’t changed because the field is the same size, but with QD=2, we now have twice the throughput: water is being applied to the fire faster (Figure 3).
Until we run out of buckets and friends, we could continue to increase the throughput of water onto the fire by increasing the QD. As we increase the number of firefighters running across the field, we will start to run into each other (Figure 4). We’ve introduced inefficiency. Now, each added helper won’t help as much as the first additional helper did. At some point, we will find that the faucet is never turned off, and someone is always filling a bucket. At this point, we will have reached the point of saturation (maximum throughput for the faucet), and adding more buckets (a higher QD) won’t help.
Storage systems work like the example above. The application running on the processor is the shed on fire—it needs buckets of data to move the computation forward. The application or operating system running on the processor makes individual requests of data from an SSD, and the returned data is used to move the computation forward. The number of data items that the application can request simultaneously (the QD, or the number of buckets) depends on the data parallelism of the computation, and on the capabilities of the application. The latency for each access depends on the latency of the SSD and of the system path to that SSD. Therefore, the throughput depends on both the application and the SSD used.
Application and Benchmark QD
SSD performance is usually measured with benchmarks like FIO (Linux) or CrystalDiskMark (Windows). These benchmarks are capable of high QDs. FIO is completely configurable in terms of QD—just specify the QD you want. FIO tests with QD equal to 128 or 256 are common when reporting SSD performance. CrystalDiskMark includes a test with 16 threads, each with a QD of 32, for a total QD of 512. Such high QDs make sense for fully exercising an SSD and for showing off the biggest possible performance in terms of IOPS and throughput.
However, those high-performance numbers—and their dependence upon high QDs—simply do not reflect the reality experienced daily in most data centers and on users’ PCs. In real-world scenarios, a high QD is rarely achieved and maintained. Intel internal testing of real data center workloads has revealed that most applications are in the 1 to 9 QD range (Figure 5).1 In fact, only an implementation of a transactional benchmark (such as TPC-H) reaches really large QDs.
The situation is even more acute for PC applications. With our own measurements, we find that many desktop applications support a QD of just one, two, or four. As Figure 6 illustrates, real-world workloads for many of the most popular applications occur at less than QD=3.
Figures 5 and 6 vividly illustrate the disconnect between high QD measurements employed for SSD specification sheets, and the needs of real-world applications. SSD benchmarks provide lots of buckets to move data, while applications provide only a few. With this background, let’s look at NAND and Intel® Optane™ SSD performance versus QD.
NAND SSD Performance
It’s no surprise that NAND SSDs are built from NAND memory. A single NAND SSD contains many NAND integrated circuits. The latency for a read of data from a NAND integrated circuit itself dominates SSD latency for all but less-frequent tail latencies.3 Due to this NAND read latency, modern NAND SSDs typically have an idle average of about 80 microseconds (µs).4 For a single 3 GHz CPU, that translates to 240,000 processor instructions—a big field to run across with a bucket.
Because of this relatively high latency, low QD performance is a challenge for a NAND SSD. A little math—4,096 bytes x (1/80 µs) = 50 MB/sec—shows us how slow the throughput would be. Of course, larger transfers (a bigger bucket) will increase this throughput. That is why you see SSD benchmarks use large transfers for throughput measurements. Note that only some applications can use large transfers.
A little more math—(1/80 µs) = 12K IOPS—shows how low the IOPS would be for QD=1. A higher QD number will increase this rate. That is why you see larger QD measurements for these values. Larger transfers will also increase the throughput number, which is why you will see high QD levels for IOPS measurements for SSDs.
There are lots of secondary impacts on NAND SSD performance that also drive the need for a higher QD to reach maximum NAND SSD performance. Only one is worth mentioning here: the Yahtzee effect, named by an Intel colleague, Knut Grimsrud. Each NAND integrated circuit (IC) can sustain only one read through its entire latency. Therefore, to get higher performance, the NAND SSD must have many ICs, and each read must exercise a different IC. But data is held on specific ICs, so incoming accesses may collide with a previous access for a specific IC and have to wait, even though other ICs are idle. It’s as if we have multiple faucets, but each is slow, and each bucket can only be filled by a specific faucet. As the QD increases, the likelihood of collisions of reads for a single IC increases, causing performance to increase more slowly than QD. This is why SSD specification sheets include such large QDs to show high IOPS. Intel® Optane™ SSDs do not suffer from the Yahtzee effect because of their more capable memory and SSD architecture.
How Intel® Optane™ SSDs Outperform NAND SSDs in Real-World Data Center Operations
Unlike NAND SSDs, Intel® Optane™ SSDs are designed to provide peak performance at real-world QDs, by using a revolutionary memory and SSD architecture that provides consistent low latency. The low latency of the Intel® Optane™ memory media allows the SSD to achieve extremely low latencies (for an SSD) of around ~8 µs (a much smaller field to run across). Additionally, unlike NAND SSDs, the latency of Intel® Optane™ SSDs is not dominated by memory latency and does not suffer from a Yahtzee effect. An Intel® Optane™ SSD assembles even a single 4 KB read from multiple Intel® Optane™ memory media ICs and those ICs are ready for another read very quickly. Intel® Optane™ SSDs avoid the location and address-based collisions NAND SSDs exhibit. It is like Intel® Optane™ SSDs use multiple faucets at once to fill a single bucket making them ready to fill the next bucket very quickly. This means the Intel® Optane™ memory media is ready for another read in much less time than the NAND SSD, so it doesn’t need input/output (I/O) parallelism to achieve high IOPS.
Stated simply, Intel® Optane™ SSDs deliver peak performance at QDs that are consistent with the lower QDs at which most applications work. NAND SSDs typically require QD ranges of 128 or more to deliver peak performance while Intel® Optane™ SSDs can reach full performance for much smaller QDs often seen with real applications (see Figure 7).5 The chart also highlights the performance difference between a NAND SSD (Intel® SSD P4610) and an Intel® Optane™ SSD (Intel® Optane™ SSD P4800X). The results show a real-world speed advantage for Intel® Optane™ SSDs of four to five times the real-world relevant performance of the tested Intel® NAND SSDs.
While it is an important chart, it only tells part of the story. Figure 8 shows the same workload, but it is plotted to show the operating point of the system in terms of both the throughput delivered (x-axis) and the resulting per-I/O read latency (y-axis). QD is included as the number on the NAND and Intel® Optane™ SSD lines. Suppose we have an application capable of QD=4 operation. The Intel® Optane™ SSD allows that application to operate at greater than 1.2 GB/s throughput with a latency per read-I/O of only about 10 µs. The NAND SSD, on the other hand, provides the application with an operating point of less than 0.3 GB/s and a latency per read-I/O of about 100 µs. Those are very different operating points that will, in turn, result in very different application performance.
Also note in Figure 8 that the NAND SSD requires QDs of 128 or even 256 to reach full performance. Even if your application could get to that operating point, it would come at the cost of higher latency for reads. Now you can see why NAND SSD maximum performance is specified for such high QDs, and why you should ask about the latency for a read at that operating point. For this reason, several benchmarks, such as CrystalDiskMark, include QD=1 measurements as a part of their test suites. Intel® Optane™ SSDs reach full performance for a QD of just over 8, and they maintain low read latency at that operating point. For realistic application QDs, an Intel® Optane™ SSD delivers high throughput and simultaneously low latency. When it’s time to put out the fire, I want an Intel® Optane™ SSD in my system.
The Bonus Benefit of Intel® Optane™ SSDs’ Low-Latency Performance: Easier Code
As David Clark at MIT once put it, “Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed; you can’t bribe God".7 Clark was talking about networking, but the same is true for storage; low latency is powerful and has far ranging impact. We’ve noticed a recurring theme as we have worked with operating system and application developers to integrate low-latency Intel® Optane™ SSDs into systems. These developers have incurred costs in the form of developer time, extra code, and extra compute cycles to overcome the high latency of storage. Over the years, developers of operating systems and key data center applications have expended great effort to increase application throughput in spite of the high latencies of NAND SSDs (and even hard disk drives [HDDs]). Significant code and complex heuristics have been developed to try to shorten the long wait times incurred when transferring data to and from storage. With Intel® Optane™ SSDs, this extra code and extra developer time are no longer needed. The low latency provided by Intel® Optane™ SSDs solves the root of the problem: quick access to data.
To illustrate this concept, let’s look at a commercially important database benchmark, TPC-C. Another colleague at Intel, Jeff Smits, conducted extensive experiments comparing NAND SSD performance to Intel® Optane™ SSD performance. TPC-C is all about throughput—transactions per second (TPS). Database implementations of TPC-C are heavily optimized at the code and system level. Jeff discovered that simply inserting Intel® Optane™ SSDs into the system didn’t deliver the full benefit. He had to reduce the number of outstanding transactions this heavily optimized system generated. When he did this, he saw a strong application-level performance gain. The system assumed high-latency storage, so it included complex code capable of generating lots of simultaneous transactions. Interestingly, dialing back the number of outstanding transactions even allowed CPU caches to function more effectively, because the working set size of the application was reduced. We’ve seen similar simplification-forperformance opportunities with operating system virtual memory paging.
So the bonus benefit of Intel® Optane™ SSDs is a reduction in code complexity and smaller working sets. From that reduced complexity, we see even more increases in system performance. If you are a developer, think about your application and how you could simplify it to achieve higher performance and productivity by using Intel® Optane™ SSDs.
“Real-World” Performance Is Really All That Matters
The term “real-world” is sprinkled liberally throughout this paper. That’s as it should be. After all, published performance stats, no matter how breathtakingly impressive, are of little consequence if the same results cannot be achieved in actual practice. While NAND SSD performance stats might impress when browsing sales brochures, Intel® Optane™ SSD performance will impress day-in and day-out in real-world data center operations and PC applications.8 9 10 11 12 13