NX-GZIP hardware acceleration on Talos II POWER9
POWER9 processors1 ship with two hardware compression engines: NX-842 (a kernel crypto API accelerator for the 842 algorithm) and NX-GZIP (a gzip/deflate-compatible engine accessible from userspace). NX-GZIP is the interesting one: it provides zlib-compatible acceleration via LD_PRELOAD without rebuilding any software. This post documents a possible setup on a Talos II running Arch Linux ppc64le (kernel 6.19.11) and shows some simple benchmarks to get a feel on what to expect with respect to performance improvement.
Background
POWER9 exposes the compression hardware through the VAS (Virtual Accelerator Switchboard) subsystem2, there are two engines:
- NX-842: 842 algorithm, registered in the kernel crypto API. Loads automatically once
nx-compress-powernv.kois active. No userspace setup required.3 - NX-GZIP: gzip/deflate engine, exposed as
/dev/crypto/nx-gzip. Requires a one-time NVRAM flag and the libnxz userspace library.2
The Arch Linux ppc64le kernel has CONFIG_PPC_VAS=y and CONFIG_CRYPTO_DEV_NX=y already set — no kernel rebuild is needed.
Step 1: Enable VAS userspace access in NVRAM
To enable exposure of the engine, the skiboot firmware reads the vas-user-space NVRAM option at boot. A kexec (fast reset) is not sufficient — only a full IPL picks up NVRAM changes.
1 sudo nvram --update-config "vas-user-space=enable" --partition ibm,skiboot
2 # Full power cycle required — kexec will not work
3
4 # Verify
5 sudo nvram -p ibm,skiboot --print-config
6 # "ibm,skiboot" Partition
7 # --------------------------
8 # vas-user-space=enableAfter reboot, verify in the OPAL message log:
1grep "vas-user-space" /sys/firmware/opal/msglog
2# NVRAM: Searched for 'vas-user-space' found 'enable'
3# VAS: Initialized chip 0 / VAS: Initialized chip 8
4# NX0: gzip Coprocessor Enabled / NX8: gzip Coprocessor Enabled
Step 2: Verify /dev/crypto/nx-gzip exists
With CONFIG_PPC_VAS and CONFIG_CRYPTO_DEV_NX enabled (both present in the Arch ppc64le kernel), the device node is created automatically by the VAS subsystem on boot:
1ls -la /dev/crypto/nx-gzip
2# crw------- 1 root root 236, 0 ...Step 3: Create system group and udev rule
To make usage of the device a bit easier, we assign it a group and proper permissions. udev requires a system group (GID < 1000) for device node rules — a regular user group silently fails with "Not a system group" and the rule is ignored.
1sudo groupadd --system nx-gzip
2sudo usermod -aG nx-gzip $USER
3echo 'KERNEL=="nx-gzip", GROUP="nx-gzip", MODE="0660"' | \
4 sudo tee /etc/udev/rules.d/99-nx-gzip.rules
5sudo udevadm control --reload
6sudo udevadm trigger --subsystem-match=nx-gzipVerify:
1sudo udevadm test /devices/virtual/nx-gzip/nx-gzip 2>&1 | grep -E "GROUP|MODE"
2# GROUP="nx-gzip": Set group ID: 939
3# MODE="0660": Set mode: 0660
Users in the nx-gzip group can now use the crypto device.
Log out and back in (or use newgrp nx-gzip) for group membership to take effect.
Step 4: Build libnxz
The library that catches calls to libz and reroutes them to the hardware lives at github.com/libnxz/power-gzip. Clone it.
There is onne build fix needed in lib/nx_gzlib.c line 140 declares digit as char* but strpbrk returns const char*, which gcc -Werror rejects:
1--- a/lib/nx_gzlib.c
2+++ b/lib/nx_gzlib.c
3@@ -140 +140 @@
4- char* digit;
5+ const char* digit;
If you do not apply the patch manually, these lines will give you a working library:
1cd ~/dat/src/power-gzip
2./configure
3sed -i 's/\tchar\* digit;/\tconst char* digit;/' lib/nx_gzlib.c
4make -j$(nproc) -C lib
5# Produces: lib/.libs/libnxz.so.0.0.65Step 5: Use via LD_PRELOAD
libnxz intercepts zlib API calls (compress, deflate, inflate, etc.) from programs that dynamically link libz.so. It does not work with the gzip binary — GNU gzip bundles its own deflate and never calls libz.so.
The generic use of the library, for programs that use libz, is:
1LD_PRELOAD=<path-to>/libnxz.so <program>To verify hardware is actually used:
1 # Run it
2 NX_GZIP_LOGFILE=/tmp/nx.log NX_GZIP_VERBOSE=2 NX_GZIP_TRACE=8 LD_PRELOAD=.../libnxz.so <program>
3
4 # Verify use
5 grep "deflate(nx)" /tmp/nx.log # count should be > 0
This works with python3, via the zlib module, rsync, pigz, Java, and any program that dynamically links libz.so.
Does not work with: gzip, zstd, lz4 — these have their own compression implementations.
Before wrapping a binary, check that it does link libz.so and not libz-ng.so.2 (zlib-ng):
1ldd /usr/bin/someprogram | grep libz
2# Must show libz.so.1 — not libz-ng.so.2
For example, git on my machine links libz-ng.so.2, so libnxz cannot intercept it — zero NX operations will be recorded.
Benchmarks
Benchmark: Python's zlib.compress() with level 1, comparing software zlib against NX-GZIP under LD_PRELOAD.
1import zlib, os, time
2
3# Random data (incompressible — worst case)
4data = os.urandom(1024 * 1024) * 50 # 50 MB
5t = time.perf_counter()
6out = zlib.compress(data, 1)
7print(f'{len(data) / (time.perf_counter() - t) / 1024**2:.0f} MB/s')
8
9# Compressible data
10data = (b'Hello world this is some compressible text data ' * 64) * (1024 * 32) # 96 MB
11t = time.perf_counter()
12out = zlib.compress(data, 1)
13print(f'{len(data) / (time.perf_counter() - t) / 1024**2:.0f} MB/s')NX-GZIP results (zlib level 1, POWER9 DD2.2, 2 sockets / 8 cores):
| Input type | Size | Software zlib | NX-GZIP | Speedup |
|---|---|---|---|---|
| Random data | 50 MB | 24 MB/s | 710 MB/s | ~29× |
| Compressible | 96 MB | 355 MB/s | 6000 MB/s | ~17× |
NX-842 (kernel crypto API, separate engine, via nx-compress-powernv.ko):
| Path | Throughput | Speedup |
|---|---|---|
| Software 842 fallback | ~104 MB/s | |
| NX-842 hardware | ~11000 MB/s | ~106× |
It's an artificial benchmark, but the benefit is pretty clear.
Real-world use: mksquashfs (270 MB source tree, 32 threads)
Input: the skiboot git repository (~270 MB of source code).
| Variant | Wall time | User CPU | CPU% | Output size |
|---|---|---|---|---|
| Software | 0.516s | 7.87s | 1561% | 63.8 MB |
| NX-GZIP | 0.102s | 0.22s | 522% | 67.2 MB |
| Speedup | ~5× | ~36× | -5% ratio |
NX trades ~5% compression ratio for 5× wall-time and 36× CPU reduction. At 32 threads the software path saturates all cores; NX offloads deflate to the coprocessor and frees CPU for other work. ERR_NX_TARGET_SPACE retries in the verbose log are normal for highly compressible data — the engine splits chunks, not a software fallback.
Example Wrapper scripts
To use, say with curl, the accelerated compression a simple wrapper script early in your path is sufficient. A thin wrapper script in ~/bin/ transparently applies LD_PRELOAD for curl that links libz.so:
1#!/bin/sh
2exec env LD_PRELOAD=/usr/local/lib/libnxz.so /usr/bin/curl "$@"
Programs verified to link libz.so and confirmed working: curl, wget, cargo, ffmpeg, bsdtar, mksquashfs, unsquashfs, qemu-img, mariadb-dump, pg_dump, pg_restore, pg_basebackup.