Myth-Busting Latency Numbers for TCP Offload Engines

Myth-Busting Latency Numbers for TCP Offload Engines

When you shop around for a TCP IP-Core for TCP offload engines, don’t you ask yourself: What does that really mean when you see numbers like this?

This Technical Brief sheds light on TCP/IP processing in an FPGA, and in particular busts some myths about latency numbers. Here, we will describe proper ways of measuring latency numbers, and why latency numbers do matter for TCP offload engines when implementing TCP/IP in FPGAs. 

Based on MLE’s past projects experience, it all comes down to technical and economical feasibility determined by FPGA resource costs!

Put a TCP/UDP/IP Turbo Into Your FPGA-SmartNIC

Put a TCP/UDP/IP Turbo Into Your FPGA-SmartNIC

How MLE and Fraunhofer HHI are breaking the 500 MHz fMax barrier in network protocol acceleration (TCP/IP stack) by using Intel Agilex FPGAs as an FPGA SmartNIC.


MLE provides full system stacks including FPGAs, with a focus on networking for HPC, datacenter, or telecommunications. Often, we implement so-called full accelerators where almost all protocol processing runs efficiently within the FPGA fabric. 

Within this Technical Brief we elaborate on two key aspects for FPGA-based SmartNICs: 

  1. How to implement rapid, reliable connectivity from edge to cloud using FPGA-based SmartNICs with TCP/IP stack acceleration, and 
  2. how these implementations can benefit from modern FPGA technology, namely Intel HyperFlex, to deliver better performance, cost and power.

Our design choices for FPGA SmartNICs include the Corundum project, an open-source, high-performance FPGA-based Network Interface Card (NIC) platform. Corundum supports In-Network Processing for which we have integrated MLE’s Network Protocol Accelerator Platform (NPAP) based on the TCP/UDP/IP Full Accelerator from Fraunhofer HHI. 

To enable network protocol processing at linerates of 100 Gbps, or faster, we have optimized this implementation for Intel HyperFlex architecture. The result is a “turbo charged” FPGA SmartNIC which combines several advantages:

  • NPAP with high throughput for those “heavy” TCP data streams which make up for most of the network traffic.
  • NPAP for those latency sensitive TCP connections where TCP round-trip time (RTT) may dominate the entire system’s response time.
  • Corundum processing in open-source Linux software for the rest, i.e. all those administrative and control TCP connections which hardly use any bandwidth and which are not latency sensitive.
  • Performance optimizations utilizing Intel HyperFlex to break the 500 MHz fMax barrier and to avoid FPGA resource “bloat”1.

Deterministic Networking with TSN-10/25/50/100G

Deterministic Networking with TSN-10/25/50/100G

Growing Demand for Deterministic Networking

We all observe a growing need to connect computers with each other with shorter delays (i.e. lower latencies) and higher bandwidth, in particular for High-Performance Computing (HPC) in the data center and in embedded systems such as advanced industrial robotics or autonomous vehicles, requiring the so-called deterministic networking. Processing of TCP/IP based network protocols at speeds of 10 Gbps and beyond demand kernel bypass solutions (such as Intel’s DPDK or Solarflare’s/Xilinx’ Onload or Mellanox/NVida VMA) and/or so-called TOEs (TCP Offload Engines). 

Domain-Specific Architectures (DSA) use so-called heterogeneous computing elements, also known as Cores with the objective to put the compute burden where it belongs. This is a well established approach going back to the early days when an x86 CPU was partnered with an x87 for better floating-point processing. Today, it is common to deploy various flavors of Cores, for example:

  • DSP Cores for digital signal processing in telecommunications
  • Shader Cores optimized for image processing, as they can be found in modern Graphics Processing Units (GPU) 
  • Tensor Processing Units (TPU) Cores which are optimized for Artificial Intelligence and Deep Learning

This is because such (special purpose) fixed-function or programmable function accelerator Cores are optimized for a particular domain and, when properly used, not only take processing load off the (general purpose) CPU but also deliver better overall performance (which is data processed per time) and better efficiency (which is performance per Watt).

Over the following pages we will make a case for processing TCP/IP over TSN over 10/25/50/100 Gigabit Ethernet on dedicated Cores which has significant advantages in particular for real-time Ethernet and Deterministic Networking. These so-called TCP-TSN-Cores can be integrated either in FPGAs or in SoCs (ASIC and ASSP). As we will show, TCP-TSN-Cores are more than just a TOE – the commonly used approach for network protocol acceleration. By running the entire network protocol stack from OSI Layer 2 to at least Layer 4 in a dedicated integrated circuit – a so-called Full Accelerator – we can remove (general purpose) CPUs entirely from the datapath. 

Hence, TCP-TSN-Cores can deliver very low bounded and deterministic latency with predictable scalability needed for 10/25/50/100 Gigabit Deterministic Networking. 


Please fill in the form below, so we can give you access to the Remote Evaluation System.

    NPAP-10G Remote Eval.NPAP-25G Remote Eval.