Below are tips for various workloads.
Descriptions of ZFS internals that have an affect on application performance follow.
Adaptive Replacement Cache
For decades, operating systems have used RAM as a cache to avoid the necessity of waiting on disk IO, which is extremely slow. This concept is called page replacement. Until ZFS, virtually all filesystems used the Last Recently Used (LRU) page replacement algorithm. It caches the last recently used data in memory. Unfortunately, the LRU algorithm is vulnerable to cache flushes, where a brief change in workload that occurs occasionally removes all frequently used data from cache. The Adaptive Replacement Cache (ARC) algorithm was implemented in ZFS to replace LRU. It solves this problem by maintaining four lists:
- A list for recently cached entries.
- A list for recently cached entries that have been evicted at least one.
- A list for entries evicted from #1.
- A list of entries evicited from #2.
Data is evicted from the first list while an effort is made to keep data in the second list. In this way, ARC is able to better to provide a better cache hit rate than last recently used and better performance.
In addition, a feature called L2ARC has been implemented which scans entries that are next to be evicted and caches them. The data stored in ARC and L2ARC can be controlled via the primarycache and secondarycache settings respectively, which can be set on both zvols and datasets. Possible settings are all, none and metadata. It is possible to improve performance when a zvol or dataset hosts an application that does its own caching by caching only metadata. One example is PostgreSQL. Another would be a virtual machine using ZFS.
Top-level vdevs contain an internal property called ashift, which stands for alignment shift. It is set at vdev creation and it immutable. It can be read using `zdb`. It calcualted as the maximum base 2 logarithm of the physical sector size of any child vdev and it alters the disk format such that writes are always done according to it. This makes 2^ashift the smallest possible IO on a vdev. Configuring ashift correctly is important because partial sector writes incur a penalty where the sector must be read into a buffer before it can be written. ZFS makes the implicit assumption that the sector size reported by drives is correct and calculates ashift based on that.
In an ideal world, physical sector size is always reported correctly and therefore, this requires no attention. Unfortunately, this is not the case. The sector size on all storage devices was 512-bytes prior to the creation of flash-based solid state drives. Some operating systems, such as Windows XP, were written under this assumption and will not function when drives report a different sector size.
Flash-based solid state drives came to market around 2007. These devices report 512-byte sectors, but the actual flash pages, which roughly correspond to sectors, are never 512-bytes. The early models used 4096-byte pages while the newer models have moved to an 8192-byte page. In addition, "Advanced Format" hard drives have been created which also use a 4096-byte sector size. Partial page writes suffer from similar performance degradation as partial sector writes. In some cases, the design of NAND-flash makes the performance degradation even worse, but that is beyond the cope of this description.
Reporting the correct sector sizes is the responsibility the block device layer. This unfortunately has made proper handling of devices that misreport drives different across different platforms. The respective methods are as follows:
- sd.conf on Illumos
- gnop on freeBSD
- [-o ashift= http://zfsonlinux.org/faq.html#HowDoesZFSonLinuxHandlesAdvacedFormatDrives] on ZFSOnLinux
- -o ashift also works with both Mac-ZFS (pool version 8) and the next generation Mac-ZFS prototype (pool version 5000).
-o ashift= is convenient, but it is flawed in that the creation of pools containing top level vdevs that have multiple optimal sector sizes require the use of multiple commands. A newer syntax that will rely on the actual sector sizes has been discussed as a cross platform replacement and will likely be implemented in the future.
In addition, Richard Yao has contributed a database of drives known to misreport sector sizes to the ZFSOnLinux project. It is used to automatically adjust ashift without the assistance of the system administrator. This approach is unable to fully compensate for misreported sector sizes whenever drive identifiers are used ambiguously (e.g. virtual machines, iSCSI LUNs, some rare SSDs), but it does a great amount of good. The format is roughly compatible with Illumos' sd.conf and it is expected that other implementations will integrate the database in future releases. Strictly speaking, this database does not belong in ZFS, but the difficulty of patching the Linux kernel (especially older ones) necessitated that this be implemented in ZFS itself for Linux. The same is true for Mac-ZFS. However, FreeBSD and Illumos are both able to implement this in the correct layer.
Internally, ZFS allocates data using multiples of 512-bytes. To be specific, it allocates power of 2 size blocks from 512-bytes to 128KB and connects them as needed (via gang blocks) to scale up or down to the required size, which is a multiple of 512-bytes. When compression is enabled, it will occur on these blocks whenever at least 15% savings can be realized. The reduction is done from either recordsize (on datasets) or blocksize (on zvols).
The following compression algorithms are available:
- New algorithm added after feature flags were created. It is significantly superior to LZJB in basically all metrics tested.
- Default compression algorithm (compression=on) for ZFS. It was created to satisfy the desire for a compression algorithm suitable for use in filesystems. Specifically, that it provides fair compression, has a high compression speed, has a high decompression speed and detects incompressible data detection quickly.
- GZIP (1 through 9)
- Classic Lempel-Ziv implementation. It provides high compression, but often makes IO CPU-bound.
- ZLE (Zero Length Encoding)
- A very simple algorithm that only compresses zeroes.
If you want to use compression and are uncertain which to use, use LZ4. It averages a 2.1:1 compression ratio while gzip-1 averages 2.7:1. Both figures are obtained from testing by the LZ4 project on the silensia corpus. The greater compression ratio of gzip is usually only worthwhile for rarely accessed data.
ZFS datasets use an internal recordsize of 128KB by default. The dataset recordsize is the basic unit of data used for internal copy-on-write on files. Partial record writes require that data be read from either ARC (cheap) or disk (expensive). recordsize can be set to any power of 2 from 512 bytes to 128 kilobytes. Software that writes in fixed record sizes will benefit from the use of a matching recordsize.
Zvols have a volblocksize property that is analogous to record size. The default size is 8KB, which is the size of a page on the SPARC architecture. Workloads that use smaller sized IOs (such as swap on x86 which use 4096-byte pages) will benefit from a smaller volblocksize.
Deduplication is implemented using Linear hashing on top of the ZFS Attribute Processor. Each unique entry requires 320-bytes of storage and includes a 256-bit hash. The entries currrently use kmem_alloc(), which cause the entries to become 512-bytes once internal fragmentation in kmem_alloc() is considered. Each pool has a global deduplication table shared across all datasets and zvols on which deduplication is enabled. Each hash represents either a unique record in a dataset (of size recordsize) or a block in a zvol (of size blocksize).
Contrary to popular belief, ZFS need not scan the entire table whenever doing a write because of the use of linear hashing. However, the miss rate can be exceptionally high when there is insufficient memory for the entire table to the point where performance feels like a linear scan. Each miss requires a random seek and the total penalty is linear in the number of misses. Misses occur on both unique data that is being written and duplicate data that is not cached.
The consequence is that sufficient memory to store deduplication data is required for good performance. The deduplication data is considered metadata and therefore can be cached via L2ARC. In addition, the deduplication table will compete with other metadata for metadata storage, which can have a negative effect on performance. Simulation of the number of deduplication table entries needed for a given pool can be done using the -D option to zdb. Then a simple multiplication by 512-bytes can be done to get the actual metadata requirements. Alternatively, you can estimate an upper bound on the number of unique blocks by dividing the amount of storage you plan to use on each dataset (taking into account that small files each count as a full recordsize for the purposes of deduplication) by the recordsize and each zvol by the volblocksize, summing and then multiplying by 512-bytes.
Writes to partial records are expensive. Give PostgreSQL its own dataset for its databases. Set recordsize=8K on your dataset to avoid expensive partial record writes. Also, PostgreSQL implements its own cache algorithm similar to ARC that is specialized for databases; avoiding double caching with primarycache=metadata will likely increase performance.