Shared Virtual-Memory Objects for Disaggregated Memory with Limited Coherency

image

Disaggregated memory pools [Generated with AI]

Context

The emerging Compute Express Link (CXL) standard extends the border of main memory to a broader circle. In a nutshell, it allows byte-granular memory access via the PCIe interface, enabling devices (e.g., GPUs) to access and cache host memory, as well as hosts to extend memory capacity with extension cards. More recent versions of the CXL standard (2.0/3.0) go even further, allowing for memory disaggregation with centralized memory pools. In combination with coherent access across machines, this allows for efficient communication via shared memory. Unfortunately, due to the high costs of tracking cache line states, we expect only a small fraction of a memory pool to be cache-coherent across machines. For the residual, dominating part of the memory, software mechanisms must be employed to ensure synchronization.

Problem

With Morsels, we introduced a novel memory-management paradigm that shifts from the management of individual pages to larger virtual-memory objects, technically represented as subtrees of the page-table hierarchy. This reduces management overhead and enables very fast transfer between address spaces. With the extension of the memory domain with shared CXL memory pools, we want to extend the Morsel concept in this regard. Shared memory objects should fully reside on the memory pool (including page tables) and multiple hosts should be able to simultaneously interact with this object. To cope with the limited coherency, the idea is to place page tables in coherent memory for synchronization and implement an ownership model for data pages on the software level.

Goal

On the implementation side, this could be achieved with so-called overlay-morsels - one per host. Initially, all parts within this overlay are shared read-only with the authoritative truth on the memory pool. The first write access to a page triggers a page fault, which leads to acquiring the ownership of this specific page, meaning that it is mapped writable by the overlay. The remaining parts of the memory object stay unaffected. Additionally, the page fault handler must ensure that other hosts cannot access the page anymore, effectively clearing its present bit in the authoritative truth (and possibly other overlays) and performing a flush. For performing such flushes/invalidations, we expect a mechanism to send interrupts to other attached hosts. This mechanism will also be used to initiate write backs if another machine requests an exclusively owned page.

The main difference between this approach and existing RDMA approaches is that the data always resides on the shared memory pool and accesses are performed on cache-line granularity, not page-wise. Due to the lack of compatible hardware featuring the CXL 3.0 standard, the evaluations will be based on a multi-NUMA server system emulating the performance characteristics of CXL-attached memory.

Topics: CXL, paging, disaggregated memory, Linux kernel

References

An Introduction to the Compute Express Link (CXL) Interconnect: General introduction into the CXL concept

Papers

DIMES Workshop
Morsels: Explicit Virtual Memory Objects
Alexander Halbuer, Christian Dietrich, Florian Rommel, Daniel LohmannProceedings of the 1st Workshop on Disruptive Memory SystemsAssociation for Computing Machinery2023.
PDF Details Slides 10.1145/3609308.3625267 [BibTex]