Picking The Right Granularity When Buffering PCIe/NVMe Data
Non-Volatile Memory Express (NVMe) is an interface specification often used with PCIe. Its goal is to leverage the parallelism and low latency of modern SSDs. A typical PCIe payload data transfer happens in data chunks of either 128 Byte or 256 Byte.
SSDs deploy several tricks (wear leveling, SLC to TLC conversion) to enhance the read and write speeds as well as their lifespan. One downside is that their read and write speed is not constant over a long write/read period which might result in backpressure.
Some applications do not support back pressure that can lead to an erroneous state if one employs a standard SSD system.
One possible mitigation strategy is to have an elastic buffer between the SSD and the data source. Using an FPGA, there are different possibilities to implement an elastic buffer. At MLE, we investigated BlockRAM (BRAM), UltraRAM (URAM), Dynamic RAM (DRAM) and the second generation of High Bandwidth Memory (HBM2). Each memory technology has its advantages and disadvantages regarding its capabilities to handle different data chunk sizes. We will present our findings below.
BlockRAM (BRAM)
BRAM is a RAM module which can be found on every FPGA. It consists of two ports meaning that in each cycle it is possible to access two different locations. BRAM can be configured as an 18 Kb or 36 Kb FIFO.
BRAM is a viable option for small data chunks but might be too precious for large chunks.
The ZCU106 evaluation board has a total of 11 Mb of BRAM.
UltraRAM (URAM)
Similar to BRAM, URAM is a dual-ported RAM. In contrast to BRAM ports, the two URAM ports can only perform one operation, read or write, per clock-cycle. This is due to the fact that they operate internally on a single memory cell and the operation of port A is performed before port B in the same clock cycle.
The AMD/Xilinx ZCU106 evaluation board has a total of 27 Mb of URAM. This is ~2.5 times more than the available BRAM.
URAM is a good middle ground between DRAM and BRAM. This is due to the fact that more URAM is available than BRAM but in contrast to DRAM it is still on-chip memory and works well with smaller data chunks.
DDRx DRAM
BRAM and URAM are on-chip memory. DDR3/DDR4/DDR5 DRAM on the other hand is off-chip memory which means that some form of interconnect has to be between the PL or PS and DRAM. While BRAM and URAM can store data in the Mb region, with DRAM it is possible to store multiple GiB of data.
DRAM is useful for large data chunks as its efficiency goes down for small chunk sizes.
High Bandwidth Memory (HBM2)
HBM2 is similar to DRAM as it already needs some form of interconnect between the PL or PS and itself. HBM2 is not available on all devices and is expensive but offers high bandwidth as the name suggests.
Similar to DRAM, HBM2 works best with large data chunks. In contrast to DRAM, higher bandwidths can be achieved and is thus best suited for integrating it into a system which requires higher bandwidth than DRAM can provide.
Conclusion
In this article, an overview of different memory technologies and their use case for different data chunk sizes have been presented. Small chunk sizes are difficult for DRAM but are possible in URAM or BRAM. For large data chunks, BRAM and URAM are also viable options but might be too precious. HBM2 is a good option if the bandwidth of DRAM is not sufficient.
In one of our next posts we will discuss how to combine different types of memory (BRAM, URAM and DRAM, for example) to have a hybrid memory subsystem for a high speed NVMe Storage system.
Learn more about our IP core offerings in NVMe Streaming.
You know our Mission: If It Is Packets, We Make It Go Faster – today the many flavors of memory for buffering data in FPGAs.
š www.missinglinkelectronics.com
MLE (Missing Link Electronics) is offering technologies and solutions for Domain-Specific Architectures, which focus on heterogeneous computing using FPGAs. MLE is headquartered in Silicon Valley with offices in Neu-Ulm and Berlin, Germany.