Latency Measurement of 10G/25G/50G/100G TCP-Cores using RTL Simulation

Distributed Systems-of-Systems which, for example, connect smart sensor hubs with centralized processing via Ethernet, require very low transport latencies in order to deliver short response times. This makes it difficult for system designers to evaluate. And, things get worse if the measurement setup and methodology is not clearly explained, neither can be reproduced.

Therefore, in this Technical Brief we describe how we use the Questa Advanced Simulator from Siemens EDA to analyze latency in a network protocol processing system. And, we also provide the most recent latency values for NPAP, the TCP/IP Stack from Fraunhofer HHI which is, as it turns out, very competitive with other solutions. Being integrators ourselves, we believe we owe this to the FPGA ecosystem!

 

1. Introduction

The Questa Advanced Simulator, short Questa, is a verification solution from Siemens EDA (former Mentor Graphics) for System-on-Chip designs. Questa lets you do clock-domain-crossing (CDC) verification, formal verification, mixed-signal verification, portable stimulus, to maximize the effectiveness of your verification at the block- and subsystem-level so your system-level verification can focus on system-level functionality, including software. The latter is important when accelerating software with FPGAs.

NPAP is the Network Protocol Accelerator Platform, a TCP/UDP/IP Full Accelerator designed and implemented by the Fraunhofer Heinrich-Hertz-Institute (HHI). Fraunhofer HHI focuses on 10 to 100 Gbit transmission in the field of high-performance telecom components and on mobile broadband systems. Fraunhofer HHI is located in Berlin, Germany.

While NPAP is suitable for implementation in ASIC at MLE we typically use NPAP for FPGA acceleration such as SmartNICs, for example. The code for NPAP is IEEE 1076 VHDL and follows a block based design methodology as the following diagram shows:


In this diagram the datapath goes horizontally, with ingress and egress 10/25/50/100 Gigabit Ethernet on the left into/out of dedicated FPGA/ASIC hardware blocks:

  • Ethernet Media Access Controller (MAC)
  • Ethernet management (including ARP)
  • IPv4 (including ICMP for network management and diagnostics plus IGMP for group management)
  • one UDP block (taking care of all UDP packets)
  • one TCP block per TCP connection (picture shows 12)

Connectivity between these hardware blocks is via bi-directional 128-bit wide, which can scale to 50 Gbps line-rate in FPGA and 100 Gbps line-rate in ASIC. The control flow is shown here vertically with means to set MAC addresses, IP addresses, plus a command interface to manage (open, configure and close) TCP connections. Control flow can be implemented via hardware state machines or via software. On-chip Full Accelerators (bottom right in the block diagram) can either implement application-level data processing, or data can be sent to and/or received from adjacent CPUs, SoC-style with integrated CPUs or via PCIe-connected Host CPUs.

2. NPAP Latency Measurement Setup

For our “door-to-door” latency analysis we integrate two separate instances of NPAP (NPAP_0 and NPAP_1), each within a DUT Wrapper, within an RTL HDL Testbench. We connect both instances of NPAP together via IEEE 802.3 XGMII to model a 10 Gig Ethernet connection in between (obviously, in our RTL simulation the Ethernet connection, i.e. the “cable” and the PHY is infinitely fast). We also added an application which sends data. This is reflected by the (TX) point in the diagram below. The interface between this TX application and NPAP is of a streaming kind, namely AXI4-Stream. Similarly, we add a second application which receives data from the second NPAP instance. See the (RX) point in the diagram. Again, AXI4-Stream is used as an interface between NPAP and the application.

If you want to add your “Layer 7” processing to NPAP, just do the same.


3. NPAP Clocking and Clock Domains

As the diagram above shows, both instances of NPAP are clocked at 175 MHz, and so is the TX application and the RX application. With NPAP implementing a bi-directional 128-bit wide datapath, one for the RX direction and the other for the TX direction, the resulting bandwidth in each direction is 22.4 Gbps (128 bits * 175 MHz). This is plenty so NPAP does not become the bottleneck and can make 10 GigE line rate. In a real FPGA design it would be sufficient to run NPAP at 80+ MHz (in theory this would be 78.125 MHz) to achieve 10 Gbps linerate.

Side note: This is a key reason why we at MLE like this technology from Fraunhofer HHI, so much that we got deeply involved. Unlike other implementations of 64 bit width, NPAP can deliver 50 Gbps line rate in a modern FPGA easily at 400 MHz clock speeds. And, MLE has started to collaborate with FPGA vendors to utilize modern pipeline structures in FPGA fabric to increase the line rate up to 100 Gbps by driving clock speeds above 600 MHz.

But, back to our latency measurement setup: The XGMII interconnection and the two instances of the MAC (which is the Low-Latency Ethernet Media Access Controller also from Fraunhofer HHI) operate at 156.25 MHz with a 64 bit wide datapath (the IEEE 802.3 specification references a 32 bit wide datapath running at 312.5 MHz).

4. NPAP Latency Results

Once all connected we fire up RTL simulation. The following is a screenshot of the Xilinx ISIM Waveform Viewer as part of Xilinx Vivado toolchain. (MLE NPAP also supports RTL simulation in other tools such as Questa Advanced Simulator).

Here you see the wave window showing a couple of signal groups of interest. Additional groups show control signals not relevant for latency analysis. The first one signal group shows the top level clocks and resets. First the clock and reset of the two MACs, second the clock and reset of the two NPAP instances. The second signal group shows the receive side of DUT2, while the second signal group shows the transmit side of DUT1 is associated with signal group 3. (Please click to open for larger resolution!)

Implementing a 128 Bit data path, a transfer of 32 Byte Payload comprises two AXI4 “beats” of data with the associated end of frame signal being set.

While this screenshot only shows the transport of 32 Bytes payload via TCP/IP, we did several other tests to benchmark the latency for different payload sizes. Here is the table with the results:

TCP Payload Size [Byte] Latency [ns]
1 462,8
32 485,8
64 520,0
160 656,0
448 1092,1
960 1868,9
1216 2251,3
1456 2622,7

Again, latency was measured “door-to-door”, i.e. we measured the time difference between sending payload data from one NPAP instance via TCP/IP until receiving that payload data at the other instance of NPAP.

In light of full disclosure, for our analysis we used NPAP v1.7.0 (20211101) within the NPAP Demo ERD v2.8 (20211101) Build 20211129-144058 (#63e1e99).

5. Conclusion and Backgrounder

NPAP, the Network Protocol Accelerator Platform, an IP Core for FPGA/ASIC from Fraunhofer Heinrich-Hertz Institute (HHI), is very competitive with regards to latency, despite the fact that NPAP is a full, compliant implementation of a TCP/IP Stack in VHDL.

If you are interested in this technology and want to reproduce our findings and/or wish to evaluate NPAP within your project settings, please contact us!

Authors and Contact Information

Ulrich Langenbach, Dir. Engineering, Missing Link Electronics GmbH
Endric Schubert, PhD, CTO, Missing Link Electronics, Inc.

Missing Link Electronics, Inc.
2880 Zanker Road, Suite 203
San Jose, CA 95134, USA

Missing Link Electronics GmbH
Industriestrasse 10
89231 Neu-Ulm
Germany

www.missinglinkelectronics.com

MLE (Missing Link Electronics) is offering technologies and solutions for Domain-Specific Architectures, which focus on heterogeneous computing using FPGAs. MLE is headquartered in Silicon Valley with offices in Neu-Ulm, Germany.