AMD/Xilinx Zynq UltraScale+ MPSoC PS Mali GPU to PL

AMD/Xilinx Zynq UltraScale+ MPSoC PS Mali GPU to PL


Sometimes it is necessary to direct the output of the AMD/Xilinx Zynq UltraScale+ MPSoC integrated Mali GPU to the FPGA PL. One example of such a use case is running Linux, rendering hardware accelerated OpenGL applications, and displaying the results on a PL attached LVDS, MIPI DSI/CSI, HDMI, or SDI display. The readily available output options for the Mali GPU are limited to PS attached DisplayPort via GTR transceivers. The following describes how Mali GPU video output can be directed to the FPGA PL nonetheless, to enable a broader range of interface options.

High-Speed Data Acquisition Systems

High-Speed Data Acquisition Systems

Challenges and design choices to network FPGAs and servers for high-speed data acquisition or retrieval. We discuss TCP/IP as a very fast transport when using TCP/IP full accelerators, in the sensor-side FPGA and in the server. 

Put a TCP/UDP/IP Turbo Into Your FPGA-SmartNIC

Put a TCP/UDP/IP Turbo Into Your FPGA-SmartNIC

How MLE and Fraunhofer HHI are breaking the 500 MHz fMax barrier in network protocol acceleration by using Intel Agilex FPGAs as an FPGA SmartNIC.


MLE provides full system stacks including FPGAs, with a focus on networking for HPC, datacenter, or telecommunications. Often, we implement so-called full accelerators where almost all protocol processing runs efficiently within the FPGA fabric. 

Within this Technical Brief we elaborate on two key aspects for FPGA-based SmartNICs: 

  1. How to implement rapid, reliable connectivity from edge to cloud using FPGA-based SmartNICs with TCP/IP acceleration, and 
  2. how these implementations can benefit from modern FPGA technology, namely Intel HyperFlex, to deliver better performance, cost and power.

Our design choices for FPGA SmartNICs include the Corundum project, an open-source, high-performance FPGA-based Network Interface Card (NIC) platform. Corundum supports In-Network Processing for which we have integrated MLE’s Network Protocol Accelerator Platform (NPAP) based on the TCP/UDP/IP Full Accelerator from Fraunhofer HHI. 

To enable network protocol processing at linerates of 100 Gbps, or faster, we have optimized this implementation for Intel HyperFlex architecture. The result is a “turbo charged” FPGA SmartNIC which combines several advantages:

  • NPAP with high throughput for those “heavy” TCP data streams which make up for most of the network traffic.
  • NPAP for those latency sensitive TCP connections where TCP round-trip time (RTT) may dominate the entire system’s response time.
  • Corundum processing in open-source Linux software for the rest, i.e. all those administrative and control TCP connections which hardly use any bandwidth and which are not latency sensitive.
  • Performance optimizations utilizing Intel HyperFlex to break the 500 MHz fMax barrier and to avoid FPGA resource “bloat”1.

Latency Measurement of 10G/25G/50G/100G TCP-Cores using RTL Simulation

Latency Measurement of 10G/25G/50G/100G TCP-Cores using RTL Simulation

Distributed Systems-of-Systems which, for example, connect smart sensor hubs with centralized processing via Ethernet, require very low transport latencies in order to deliver short response times. This makes it difficult for system designers to evaluate. And, things get worse if the measurement setup and methodology is not clearly explained, neither can be reproduced. Therefore, in this Technical Brief we describe how we use the Questa Advanced Simulator from Siemens EDA to measure network latency and analyze latency in a network protocol processing system. And, we also provide the most recent latency values for NPAP, the TCP/IP Stack from Fraunhofer HHI which is, as it turns out, very competitive with other solutions. Being integrators ourselves, we believe we owe this to the FPGA ecosystem!

High Level Synthesis for Intel and Xilinx FPGAs

High Level Synthesis for Intel and Xilinx FPGAs

Missing Link Electronics (MLE) has been an early adopter of High Level Synthesis (HLS) for FPGAs. In particular for Domain-Specific Architectures which aim to accelerate algorithms and communication protocols HLS delivers on the promised benefits:

  • Increased productivity as we can focus on the behavior and let HLS do the scheduling and resource mapping
  • Better portability across FPGA device families and even across FPGA device vendors

This MLE Technical Brief describes our findings when using HLS to accelerate a telecommunications network protocol accelerator with FPGA. Driven by the project’s need for short Time-to-Market major portions have been implemented in C/C++ using HLS. And given the application’s large unit volume it was important to evaluate cost/performance across a set of Intel and Xilinx FPGA devices.

Our example uses Intel and Xilinx HLS to implement a specialized Packet FIFO. This Packet FIFO is then integrated as a particular design block into a block-based top-level design. Despite the fact that Intel HLS and Xilinx HLS behave quite differently, and do require special code, we did see a benefit from using HLS compared to “classical” RTL design using VHDL and/or Verilog HDL.

Hence, we encourage the reader to follow a similar approach.

NVMe Streaming

NVMe Streamer for High-Speed FPGA Data Acquisition & Recording

This Technical Brief is about FPGA high speed data streaming, explaining how to best record data that is received by an FPGA, for example from high-speed data acquisition, and that needs to be stored into an NVMe SSD (Non-Volatile Memory Express Solid-State Drive), after FPGA-based Data-in-Motion processing; or the opposite direction when data streams out of an SSD into an FPGA for Data-in-Motion processing.

Deterministic Networking with TSN-10/25/50/100G

Deterministic Networking with TSN-10/25/50/100G

Growing Demand for Deterministic Networking

We all observe a growing need to connect computers with each other with shorter delays (i.e. lower latencies) and higher bandwidth, in particular for High-Performance Computing (HPC) in the data center and in embedded systems such as advanced industrial robotics or autonomous vehicles. Processing of TCP/IP based network protocols at speeds of 10 Gbps and beyond demand kernel bypass solutions (such as Intel’s DPDK or Solarflare’s/Xilinx’ Onload or Mellanox/NVida VMA) and/or so-called TOEs (TCP Offload Engines). 

Domain-Specific Architectures (DSA) use so-called heterogeneous computing elements, also known as Cores with the objective to put the compute burden where it belongs. This is a well established approach going back to the early days when an x86 CPU was partnered with an x87 for better floating-point processing. Today, it is common to deploy various flavors of Cores, for example:

  • DSP Cores for digital signal processing in telecommunications
  • Shader Cores optimized for image processing, as they can be found in modern Graphics Processing Units (GPU) 
  • Tensor Processing Units (TPU) Cores which are optimized for Artificial Intelligence and Deep Learning

This is because such (special purpose) fixed-function or programmable function accelerator Cores are optimized for a particular domain and, when properly used, not only take processing load off the (general purpose) CPU but also deliver better overall performance (which is data processed per time) and better efficiency (which is performance per Watt).

Over the following pages we will make a case for processing TCP/IP over TSN over 10/25/50/100 Gigabit Ethernet on dedicated Cores which has significant advantages in particular for real-time Ethernet and Deterministic Networking. These so-called TCP-TSN-Cores can be integrated either in FPGAs or in SoCs (ASIC and ASSP). As we will show, TCP-TSN-Cores are more than just a TOE – the commonly used approach for network protocol acceleration. By running the entire network protocol stack from OSI Layer 2 to at least Layer 4 in a dedicated integrated circuit – a so-called Full Accelerator – we can remove (general purpose) CPUs entirely from the datapath. 

Hence, TCP-TSN-Cores can deliver very low bounded and deterministic latency with predictable scalability needed for 10/25/50/100 Gigabit Deterministic Networking. 

Recording with NVMe SSDs

Sustained, High-Speed Data Recording with NVMe SSDs

MLE has been providing NVMe Streamer, an FPGA-based technology which enables users to directly stream onto NVM Express (NVMe) SSDs data to and from Programmable Logic (PL). The objective behind NVMe Streamer was to provide a solution for high speed data recording (and re-play) without any CPUs involved, either because your FPGA does not have an embedded CPU or because you are looking for a solution with deterministically high read/write bandwidth and performance scalability.


Please fill in the form below, so we can give you access to the Remote Evaluation System.

    NPAP-10G Remote Eval.NPAP-25G Remote Eval.