ZFS on high latency devices

"How to stream 10Gbps of block I/O across 100ms of WAN"

This guide assumes familiarity with common ZFS commands and configuration steps. At a mimimum, you should understand how ZFS categorizes I/O and how to use zpool iostat -r, -q and -w. There's no magic list of parameters to drop in, but rather a procedure to follow so that you can calibrate ZFS to your device. This process can be used on local disk as well to identify bottlenecks and problems with data flow, but the gains may be much less significant.

Sometimes, it's useful to be able to run ZFS on something other than a local disk. An iSCSI LUN being tunneled across a PPP link, or a Ceph server providing an RBD from a continent over. Obviously there are limits to what we can do with that kind of latency, but ZFS can make working within these limits much easier by refactoring our data into larger blocks, efficiently merging reads and writes, and spinning up many I/O threads in a throughput-oriented situation. This is a method for optimizing that; with it, very high performance and large IOP size is possible.

This approach can work well enough to saturate 10GbE when connected to high latency, high throughput remote storage. It's long because it attempts to isolate each variable and adjust it under circumstances needed to see its best effect, rather than give cookie-cutter recipes that will fail badly when you're dealing with the storage of 2025, 100 milliseconds away. It's not as bad as it looks at first.

--

There are a few requirements, though:

Larger blocks are better. They are the source that your IO engine will process and they provide the granularity you will see on your pool. Over time, pools with very small block fragment badly, so that even with high I/O aggregation, it's not possible to issue very large operations.


 * 64K Only suitable as a write-once or receive-only pool
 * 128K Reasonable choice for a receive-only pool
 * 256K Very good choice for a receive-only pool, probably the minimum size for a pool taking TxG commit writes.
 * 512K and up: best choice for a TxG commit pool.

Larger blocks are easier to merge during I/O processing but more importantly, they maintain more original data locality and fragment the pool less over time. Dealing with high latency storage requires that we maximize our ability to merge our reads and writes.

If larger blocks will create more RMW, consider the tradeoffs. RMW on a local SSD-based pool may be acceptable in order to create large blocks for zfs send/receive backup purposes. RMW on a pool based on high latency storage may be much more painful.

--

Reads are usually not a problem. Writes must be done carefully for the best results.

The most optimal possible solution is a pool that only receives enough writes to fill it once. This usually will sustain very large write sizes but is a limited use case.

The next best situation is a pool that is only written to by ZFS receive as snapshot deltas are applied and old snapshots are deleted. This is common for backup applications, resists fragmentation well, and provides consistent performance. Ideally, receive precompressed blocks to maximize write merge.

The next best is a pool that either only receives async writes, or has a SLOG on local disk. Ideally it should have a high maximum time between TxG commits and a high zfs_dirty_data_max. High latency devices work best when we can batch a lot of writes.

The other possibility is a pool receiving sync writes that go to the ZIL on the high latency device (logbias=latency, and zfs_immediate_write_size=some big number). Sync writes will be painfully slow and will sharply drop your average write size. Don't even try a pool with logbias=throughput, the increased fragmentation will destroy read performance.

--

Lots of ARC is a good thing. Lots of dirty data space can also be a good thing provided that dirty data stabilizes without hitting the maximum per-pool or the ARC limit.

--

Try zpool creation with ashift=12 first. If your device is really weird, consider a larger number, but remember that it comes with costs in terms of metadata space and achievable compression.

Many people start tinkering at the i/o issuing thread end, the vdev_max and min active counts. This is a bit like trying to get a car to go faster by opening up the exhaust - if it's a constriction, it will help to open it up, but generally the issues preventing efficient i/o are further up the chain.

What is essential is to keep the I/O pipeline fed and flowing, and to make async reads and writes as efficient at merging as possible. We figure out async writes first because they are the hardest. After they work well, everything else will fall into place.

It's best to test using zfs receive, to get the cleanest possible writes. Once you have the pipeline working well for that, it will be more clear what the impact of other I/O patterns is. Async writes in ZFS flow very roughly as follows:


 * Data

Dirty data for pool (must be stable and about 80% of dirty_data_max)


 * TxG commit

zfs_sync_taskq_batch_pct (traverses data structures to generate IO)

zio_taskq_batch_pct (for compression and checksumming)

zio_dva_throttle_enabled (ZIO throttle)


 * VDEV thread limits

zfs_vdev_async_write_min_active

zfs_vdev_async_write_max_active


 * Aggregation (set this first)

zfs_vdev_aggregation_limit (maximum I/O size)

zfs_vdev_write_gap_limit (I/O gaps)

zfs_vdev_read_gap_limit


 * block device scheduler (set this first)

You must work through this flow to determine if there are any significant issues and to maximize IO merge. The exceptions are:


 * zio_taskq_batch_pct (the default of 75% is fine)


 * agg limit and gap limits (you can reasonably guess these)


 * block device scheduler (should be noop or none)

--

K is a factor that determines the likely size of free spaces on your pool after extended use. As a multiple of blocksize, we've found numbers useful to be very roughly:

K = 10 for write-once pools

K = 4 for receive-only pools

K = 2.5 for txg commit pools with no indirect writes

Your numbers may be different, but this is a good starting point.

multiply K by blocksize = aggr-initial

aggr-final = 3 * K * blocksize

write gap = ashift * 4 = 16K

read gap = blocksize * 1.5 but careful above 256K

sync taskq = 75

--

The approach taken works like this:


 * Open up batch taskq, aggregation limits, write threads, and ZIO throttle:

/etc/modprobe.d/zfs.conf:

options zfs zio_dva_throttle_enabled=0 options zfs zfs_txg_timeout=30 options zfs zfs_dirty_data_sync = {zfs_dirty_data_max * zfs_async_dirty_min * 0.9} options zfs zfs_txg_history=100 options zfs zfs_vdev_aggregation_limit=blocksize * K * 3 options zfs zfs_vdev_write_gap_limit=ashift * 4 (16k for ashift=12) options zfs zfs_vdev_read_gap_limit=blocksize + 64k options zfs zfs_delay_scale = blocksize / {expected writes per sec in GB/s) options zfs zfs_sync_taskq_batch_pct=75 options zfs zfs_vdev_async_write_max_active=30 options zfs zfs_vdev_async_read_max_active=30 options zfs zfs_vdev_sync_read_max_active=30 options zfs zfs_vdev_sync_read_min_active=4 options zfs zfs_vdev_async_read_min_active=2 options zfs zfs_vdev_async_write_min_active=2 options zfs zfs_vdev_sync_write_min_active=10 options zfs zfs_vdev_sync_write_max_active=20
 * 1) This is only a preliminary config used to help test ZFS flow
 * 2) Do not adopt this as a long-term configuration!
 * 3) Fill out all non-static values before copying to /etc/modprobe.d/zfs.conf
 * 4) Disabling the throttle during calibration greatly aids merge
 * 1) Disabling the throttle during calibration greatly aids merge
 * 1) TxG commit every 30 seconds
 * 1) Start txg commit just before writers ramp up
 * 1) Save last 100 txg's information
 * 0: IO aggregation
 * 1) Limit total agg for very large blocks to blocksize + 64K and read gap to 0.75m
 * 1) Limit total agg for very large blocks to blocksize + 64K and read gap to 0.75m
 * 1: Set the midpoint of the write delay throttle. Recheck dirty frequently!
 * 1: Set the midpoint of the write delay throttle. Recheck dirty frequently!
 * 1) so 128k block size @ 384MB/s = 128k/0.384 = 333000.
 * 2: Reduce zfs_sync_taskq_batch_pct until TxG commit speed falls by 10%
 * 1)    This will usually end up at 2-5 threads depending on CPU and storage.
 * 1)    This will usually end up at 2-5 threads depending on CPU and storage.
 * 3: Reduce zfs_vdev_aggregation_limit to block size * K
 * 1) options zfs zfs_vdev_aggregation_limit=blocksize * K
 * 4: Reduce sync_read, async_read and async_write max
 * 4a: Reduce async_write_max_active
 * 4: Reduce sync_read, async_read and async_write max
 * 4a: Reduce async_write_max_active
 * 4b: Reduce async_read_max_active
 * 4c: Reduce sync_read_max_active
 * 5: Raise agg limits
 * 1) options zfs zfs_vdev_aggregation_limit=blocksize * K * 3
 * 2) These are good enough to start with
 * 1) These are good enough to start with
 * 1) These are good enough to start with
 * 6a: Set sync_writes:
 * 6a: Set sync_writes:
 * 6b: Set max threads per vdev
 * 1) options zfs zfs_vdev_max_active= SRmax * 1.25
 * 7: Calibrate ZIO throttle
 * 1) options zfs zfs_vdev_queue_depth_pct=5000
 * 2) options zfs zio_dva_throttle_enabled=1
 * 8: Recheck!
 * 1) options zfs zio_dva_throttle_enabled=1
 * 8: Recheck!
 * 8: Recheck!

TxG commit should now drive writes without throttling for latency.

Make a zfs send file of a >20G zvol with volblocksize=128k, uncompressed. Put it somewhere where read speed will not be a problem.


 * Make sure the scheduler is "none" or "noop".
 * Make sure the pool has ashift=12 and no compression.
 * zpool create rbdpool /dev/rbd0 -o ashift=12


 * 1) Dirty data, /proc/spl/kstat/zfs/{poolname}/txgs

zpool receive into rbdpool and watch ndirty in txgs. It should stably sit near 70-80% of dirty_data_max, halfway through the dirty data throttle. If not, adjust dirty_data_max or delay_scale to get ndirty to stabilize.

After every zfs receive test, destroy the snapshot so that you are starting from the same point.

Once dirty data is good, measure write aggregation and speed. Speed should be slow but write aggregation should be very good, around 1MB per write op on high latency disk. If not, stop and recheck everything.

Note that as speed goes up, you may need to use mbuffer with a 16M buffer for the receive.


 * Turn zfs_sync_taskq_batch_pct down until speed reduces 10%. This sets the pace of the initial flow within the TxG commit.

Lowering zfs_sync_taskq_batch_pct has a number of advantages. Most importantly and particularly beneficial when dealing with large blocks, it rate-limits RMW reads during TxG commit. It also seems to considerably improve I/O merge. On many systems it can go quite low before it impacts throughput.

zfs_sync_taskq_batch_pct is now the limiting factor in the TxG commit flow. Decrease it, testing with zfs receive as before, until speed drops by roughly 10%. On most systems this represent 2-5 total threads. At this point you should have a stable write flow without the ZIO throttle enabled and should see significant IO merge.


 * Verify dirty data is stable and roughly at the midpoint of the dirty data throttle, when under high throughput workloads.


 * Decrease agg limits to K * blocksize


 * Decrease write threads until speed starts to reduce

Decrease zfs_vdev_aggregation_limit to 384K. Test again. If speed is much lower than before, raise the async write thread variables until you approach your previous speed, otherwise lower them until speed starts to decrease. This is to produce a stable write flow as IO aggregation diminishes as the pool fragments. Additional threads help to stabilize speed but can diminish I/O merge and cause contention if they are raised too high.


 * Verify IO merge


 * Decrease async read threads until speed reduces 20%

Test zfs send from your disk for speed, and raise or lower zfs_vdev_async_read_max_active until you reach the desired speed. A little slower than what you can handle will let other IO "float" on top of zfs send this way.


 * Decrease sync read threads until speed starts to reduce

Generate sync reads of comparable size. Raise or lower zfs_vdev_sync_read_max_active until you reach peak speed. Often numbers will be comparable to zfs_vdev_async_write_max_active.


 * Raise agg limit to K * blocksize * 3

Raise zfs_vdev_aggregation_limit back up to 1.5M. Test again and verify that ndirty is stable, that r/w aggregation looks good, and that IO is relatively smooth without surges.


 * Check agg size


 * optionally: adjust ZIO throttle for flow


 * Check agg size and throughput


 * Test and verify dirty data

--

IO prioritization: assume SRmax is the highest max (it usually will be). If not, find a compromise value for it so that the other max numbers are within 4 threads of SRmax. This is an old trick from Sun, set as follows;


 * SR: 4 - SRmax
 * SW: SRmax/2 - SRmax
 * AR 2 - ARmax
 * AW 2 - AWmax
 * Scrub 0 - 1
 * VDEV max: SRmax * 1.25

These values are adjustable but are designed for SRmax, ARmax and AWmax to all be relatively high without fighting with each other. When SR or SW is saturated, they share SRmax worth of threads roughly equally, and allow AR and AW to share the remaining 20%. The low value for SRmin keeps sync reads from dominating other I/O.

--

If AW dominates, decrease zfs_sync_taskq_batch_pct.

If SR dominates latency, decrease sync write min or increase vdev max

if SW dominates, get a SLOG or fix your workload

if AR dominates, consider decreasing AR max threads or the total max threads, or rate limit zfs send

if both AR and AW get choked back, increase vdev max

if AW gets choked back under peak IO, increase AW min threads. just a bit.

if RMW during txg commit is too slow or aggressive, adjust zfs_sync_taskq_batch_pct

if ndirty bounces around during txg commit, adjust delay_scale or give your dirty throttle more room to work in

if you have read amplification, decrease your read gap