The increased need for speed and scale of modern applications has been the driving force behind the growth of CPU speeds and core density as well as the adoption of powerful GPU technologies to support modern workloads that require real-time answers from our data. Network bandwidth requirements have skyrocketed from 1GB to 100GB and greater, with latency expectations dropping below 1ms in many cases. Moore’s law is reaching its peak, and the commodity x86 CPU is plateauing in terms of its overall rate of performance growth. Modern workloads and cloud native applications have performance requirements that general purpose CPUs will become increasingly incapable of meeting.
Given the trends listed above, what can an organization do to address the peak CPU era? How can they leverage the CPU and GPU architectures to their benefit, and more importantly, focus their efforts towards the processing functions they were truly intended to address?
CPU’s GPU’s and now DPU’s?
A new class of processor has been introduced recently, it’s called the DPU (Data Processing Unit) and it’s role in the data center is to accelerate and offload data centric functions from the general purpose CPUs, and to act as a catalyst for bare metal composability at scale. Data Centric Computing requires this new class of processor to accelerate the delivery of data, as well as to offload data centric tasks from the commodity x86 processor leaving more available cycles for general purpose compute needs The data center itself is becoming the “computer” but in order to achieve that goal, we need to address the challenges associated with forcing the commodity x86 compute layer to service data centric operations.
What are Data Centric Operations?
Data centric operations are the core services that allow for the processing of data and data movement within modern infrastructure and data center systems. These processes have long been forced to run on x86 processors regardless of the CPU’s ability to support them efficiently, effectively, and with high performance at scale.
We have asked so much of the x86 processor over the years, primarily to serve as the one-stop-shop for all services that support our infrastructure. Even with the growth of high-core count processors, the demand just doesn’t seem to be stopping for ever more compute power.
And why is that? Because we are asking the CPU to do everything, even if it was not designed to do those things well. Good-enough seems to be ok for many, but we at Fungible have a much different approach to supporting the workloads of today and tomorrow.
Today, customers looking to offload data centric services from the CPU, must leverage a DPU in order to meet the growing networking and IO requirements of modern computing. The more services that can be offloaded from the CPU, the more resources you can give back to support the applications and compute centric services that the CPU was designed to do. This in turn removes the IO Tax that all organizations are paying when using commodity x86 CPU’s as the only processor for IO and data centric services.
The IO Tax
In many data centers today the CPU is the de facto delivery mechanism for IO and data centric operations. There are many consumption models that can deliver these functions. Routers, Switches, Storage Arrays, HyperConverged systems, Software Defined Storage, etc. they almost all use commodity x86 processors to service these functions. And in nearly every instance, those systems are required to overprovision the CPU resources required to perform those tasks. In some cases, FPGA’s are leveraged to accelerate data services in conjunction with a CPU, which increases complexity and cost. The CPU ends up becoming a bottleneck, or needs to be over provisioned to support fluctuations in data operations, traffic spikes, and bursted IO.
At scale, and with high demand, this model will struggle in the long run as the CPU reaches its peak capabilities. At the same time, high demand data services in the modern data center are almost universally forced to utilize host based CPU resources to deliver data centric tasks and the customer ends up needing to over provision compute resources, resulting in costly over provisioned compute resources that may sit idle, or in the opposite end of the spectrum, be overloaded with data processing requests. With this model, the customer is forced to pay an IO tax to perform the data services and IO that their applications require, and the CPU resources that should be dedicated to applications, are now siphoned off to focus on servicing data centric tasks.
As an example, under load, a standard x86 compute node with 48 cores running a high-io workload will need to consume 8 to 12 cores to support the operations of say a 1 million IOP workload. In our view this is an “IO Tax” that is forced upon the organization. The extra burden of delivering data centric operations is consumed by the expensive compute resource that should be delivering applintion services more efficiently. Yet the CPU is asked to handle the data centric operations, and thus the customer is forced to over provision CPU resources to pay the IO Tax.
The IO Tax Illustrated
Earlier in the year, working with Nikef/Surf in Europe, a single server node connected to a single Fungible storage node set a world record of 6.55M 4k Read IOPS utilizing NVMe over TCP in the linux kernel. This was a clear illustration of the power of CPU based server driving high IO to a DPU powered storage node. During this test, the local CPU on the server hit 100% CPU utilization, which illustrates the high performance requirements for performant IO operations at this level while utilizing standard 100G networking cards and the operating systems management of NVMe over TCP operations.
To illustrate the impact of data centric operations and the effect of the IO Tax, we wanted to see what would happen when we offloaded the data centric aspects of driving high IO with the Fungible DPU on the host server as well as at the storage endpoint.
Fungible, working with San Diego Supercomputer Center, recently did a test to see how much performance a single server could deliver to a single Fungible Storage node utilizing Fungible Accelerator Cards for NVMe over TCP storage initiation. The result was a new world record of 10 million IOPS with the same 4K read IOPS workload. Not only did we see a 53% increase in IOP performance, we saw that by leveraging DPU to DPU communications, CPU resources drastically decreased from 100% CPU utilization from the prior record to 24% total CPU utilization at peak performance. While this may not be a standard workload, the proof point of CPU utilization reduction was proven out, and the IO tax was significantly minimized.
Thinking in terms of data centric operations in the data center, a 76% reduction in CPU utilization is significant, and highlights the high price that the IO tax exacts on data centric operations. And while a single server to single storage node operation isn’t always a standard use case, when thinking in terms of the large number of disparate workloads with varied IO profiles as well as the scale in which modern applications are operating, the reduction in CPU utilization means that CPU resources can be freed up to be leveraged for the work they were designed to do.
Putting this into a cost perspective, for each of the systems that were running the 64 core AMD EPYC processors (retail price of 8641), would result in a per-core cost savings of $135 per core, or $6518 per CPU!
Conclusion
At Fungible, we believe that data centric operations will continue to demand more resources from the CPU as network speeds, and the high demand of real-time data insights continues to grow. Offloading data centric operations from the CPU to the DPU can immediately result in considerable cost savings to organizations operating at scale with high performance. Customers looking to architect for the future, reduce the costs of operations, and recapture CPU resources should look to leverage the DPU in their data center designs in order to increase resource utilization, and gain a competitive advantage.