NX-GZIP hardware acceleration on Talos II POWER9

POWER9 processors1 ship with two hardware compression engines: NX-842 (a kernel crypto API accelerator for the 842 algorithm) and NX-GZIP (a gzip/deflate-compatible engine accessible from userspace). NX-GZIP is the interesting one: it provides zlib-compatible acceleration via LD_PRELOAD without rebuilding any software. This post documents a possible setup on a Talos II running Arch Linux ppc64le (kernel 6.19.11) and shows some simple benchmarks to get a feel on what to expect with respect to performance improvement.

Background

POWER9 exposes the compression hardware through the VAS (Virtual Accelerator Switchboard) subsystem2, there are two engines:

  • NX-842: 842 algorithm, registered in the kernel crypto API. Loads automatically once nx-compress-powernv.ko is active. No userspace setup required.3
  • NX-GZIP: gzip/deflate engine, exposed as /dev/crypto/nx-gzip. Requires a one-time NVRAM flag and the libnxz userspace library.2

The Arch Linux ppc64le kernel has CONFIG_PPC_VAS=y and CONFIG_CRYPTO_DEV_NX=y already set — no kernel rebuild is needed.

Step 1: Enable VAS userspace access in NVRAM

To enable exposure of the engine, the skiboot firmware reads the vas-user-space NVRAM option at boot. A kexec (fast reset) is not sufficient — only a full IPL picks up NVRAM changes.

1  sudo nvram --update-config "vas-user-space=enable" --partition ibm,skiboot
2  # Full power cycle required — kexec will not work
3
4  # Verify
5  sudo nvram -p ibm,skiboot --print-config
6  # "ibm,skiboot" Partition
7  # --------------------------
8  # vas-user-space=enable

After reboot, verify in the OPAL message log:

1grep "vas-user-space" /sys/firmware/opal/msglog
2# NVRAM: Searched for 'vas-user-space' found 'enable'
3# VAS: Initialized chip 0 / VAS: Initialized chip 8
4# NX0: gzip Coprocessor Enabled / NX8: gzip Coprocessor Enabled

Step 2: Verify /dev/crypto/nx-gzip exists

With CONFIG_PPC_VAS and CONFIG_CRYPTO_DEV_NX enabled (both present in the Arch ppc64le kernel), the device node is created automatically by the VAS subsystem on boot:

1ls -la /dev/crypto/nx-gzip
2# crw------- 1 root root 236, 0 ...

Step 3: Create system group and udev rule

To make usage of the device a bit easier, we assign it a group and proper permissions. udev requires a system group (GID < 1000) for device node rules — a regular user group silently fails with "Not a system group" and the rule is ignored.

1sudo groupadd --system nx-gzip
2sudo usermod -aG nx-gzip $USER
3echo 'KERNEL=="nx-gzip", GROUP="nx-gzip", MODE="0660"' | \
4  sudo tee /etc/udev/rules.d/99-nx-gzip.rules
5sudo udevadm control --reload
6sudo udevadm trigger --subsystem-match=nx-gzip

Verify:

1sudo udevadm test /devices/virtual/nx-gzip/nx-gzip 2>&1 | grep -E "GROUP|MODE"
2# GROUP="nx-gzip": Set group ID: 939
3# MODE="0660":    Set mode: 0660

Users in the nx-gzip group can now use the crypto device. Log out and back in (or use newgrp nx-gzip) for group membership to take effect.

Step 4: Build libnxz

The library that catches calls to libz and reroutes them to the hardware lives at github.com/libnxz/power-gzip. Clone it.

There is onne build fix needed in lib/nx_gzlib.c line 140 declares digit as char* but strpbrk returns const char*, which gcc -Werror rejects:

1--- a/lib/nx_gzlib.c
2+++ b/lib/nx_gzlib.c
3@@ -140 +140 @@
4-	char* digit;
5+	const char* digit;

If you do not apply the patch manually, these lines will give you a working library:

1cd ~/dat/src/power-gzip
2./configure
3sed -i 's/\tchar\* digit;/\tconst char* digit;/' lib/nx_gzlib.c
4make -j$(nproc) -C lib
5# Produces: lib/.libs/libnxz.so.0.0.65

Step 5: Use via LD_PRELOAD

libnxz intercepts zlib API calls (compress, deflate, inflate, etc.) from programs that dynamically link libz.so. It does not work with the gzip binary — GNU gzip bundles its own deflate and never calls libz.so.

The generic use of the library, for programs that use libz, is:

1LD_PRELOAD=<path-to>/libnxz.so <program>

To verify hardware is actually used:

1  # Run it
2  NX_GZIP_LOGFILE=/tmp/nx.log NX_GZIP_VERBOSE=2 NX_GZIP_TRACE=8 LD_PRELOAD=.../libnxz.so <program>
3
4  # Verify use
5  grep "deflate(nx)" /tmp/nx.log   # count should be > 0

This works with python3, via the zlib module, rsync, pigz, Java, and any program that dynamically links libz.so.

Does not work with: gzip, zstd, lz4 — these have their own compression implementations.

Before wrapping a binary, check that it does link libz.so and not libz-ng.so.2 (zlib-ng):

1ldd /usr/bin/someprogram | grep libz
2# Must show libz.so.1 — not libz-ng.so.2

For example, git on my machine links libz-ng.so.2, so libnxz cannot intercept it — zero NX operations will be recorded.

Benchmarks

Benchmark: Python's zlib.compress() with level 1, comparing software zlib against NX-GZIP under LD_PRELOAD.

 1import zlib, os, time
 2
 3# Random data (incompressible — worst case)
 4data = os.urandom(1024 * 1024) * 50   # 50 MB
 5t = time.perf_counter()
 6out = zlib.compress(data, 1)
 7print(f'{len(data) / (time.perf_counter() - t) / 1024**2:.0f} MB/s')
 8
 9# Compressible data
10data = (b'Hello world this is some compressible text data ' * 64) * (1024 * 32)   # 96 MB
11t = time.perf_counter()
12out = zlib.compress(data, 1)
13print(f'{len(data) / (time.perf_counter() - t) / 1024**2:.0f} MB/s')

NX-GZIP results (zlib level 1, POWER9 DD2.2, 2 sockets / 8 cores):

Input type Size Software zlib NX-GZIP Speedup
Random data 50 MB 24 MB/s 710 MB/s ~29×
Compressible 96 MB 355 MB/s 6000 MB/s ~17×

NX-842 (kernel crypto API, separate engine, via nx-compress-powernv.ko):

Path Throughput Speedup
Software 842 fallback ~104 MB/s
NX-842 hardware ~11000 MB/s ~106×

It's an artificial benchmark, but the benefit is pretty clear.

Real-world use: mksquashfs (270 MB source tree, 32 threads)

Input: the skiboot git repository (~270 MB of source code).

Variant Wall time User CPU CPU% Output size
Software 0.516s 7.87s 1561% 63.8 MB
NX-GZIP 0.102s 0.22s 522% 67.2 MB
Speedup ~5× ~36× -5% ratio

NX trades ~5% compression ratio for 5× wall-time and 36× CPU reduction. At 32 threads the software path saturates all cores; NX offloads deflate to the coprocessor and frees CPU for other work. ERR_NX_TARGET_SPACE retries in the verbose log are normal for highly compressible data — the engine splits chunks, not a software fallback.

Example Wrapper scripts

To use, say with curl, the accelerated compression a simple wrapper script early in your path is sufficient. A thin wrapper script in ~/bin/ transparently applies LD_PRELOAD for curl that links libz.so:

1#!/bin/sh
2exec env LD_PRELOAD=/usr/local/lib/libnxz.so /usr/bin/curl "$@"

Programs verified to link libz.so and confirmed working: curl, wget, cargo, ffmpeg, bsdtar, mksquashfs, unsquashfs, qemu-img, mariadb-dump, pg_dump, pg_restore, pg_basebackup.