Projects

Brainstorm, 18th September 2013
Notes from the meeting that preceded Delphix's semi-annual Engineering Kick Off (EKO)
 * immediately pursuable ideas, plus long-term and strategic thoughts.

Inter-platform coordination ideas
Ideas for projects that would help coordinate changes between platforms …

Mechanism for pull changes from one place to another
Make it easier to build, test, code review, and integrate ZFS changes into illumos.

Cross-platform test suite
One sourcebase, rather than porting STF to every platform?

Maybe integrate XFS Test Suite.

Userland ZFS
We already have ztest / libzpool and want to:
 * expand this to also be able to test more of zfs in userland
 * be able to run /sbin/zfs, /sbin/zpool against userland implementation
 * be able to run most of testrunner (and/or STF) test suite against userland implementation

ZFS (ZPL) version feature flags
Import ZFS on Linux sa=xattr into illumos.

/dev/zfs ioctl interface versioning
Ensure that future additions/changes to the interface maintain maximum compatibility with userland tools.

Enable FreeBSD Linux jails / illumos lx brandz to use ZFS on Linux utilities.

Port ZPIOS from ZFS on Linux to illumos
ZPIOS example

This would require a rewrite to not use Linux interfaces.

Virtual machine images with OpenZFS
To easily try OpenZFS on a choice of distributions within a virtual machine:
 * images could be built for running on public clouds
 * images for installing to real hardware.

Discuss …

ZFS channel programs
Possible Channel Programs:
 * Recursive rollback (revert to a snapshot on dataset and all children, needs a new command line flag, -r is already taken)

Device removal
Based on indirect vdevs, rather than bprewrite.

Reflink support
The two sides of reflink [LWN.net ]

Unified ashift handling
[illumos-zfs Specifying ashift when creating vdevs] (2013-07-03)

RAID-Z hybrid allocator
Preferably compatible with pool version 29 for Solaris 10u11 compatibility.

Replace larger ZIO caches with explicit pages
Subproject: document useful kernel interfaces for page manipulation on various platforms

Improved SPA namespace collision management
Needed mostly by virtual machine hosts. Work in progress in Gentoo.

Temporary pool names in zpool import
 * [illumos-zfs RFC: zpool import -t for temporary pool names] (2013-07-01)

Temporary pool names in zpool create.

Realtime TRIM
FreeBSD already has realtime TRIM support Saso has implementation for Illumos @ Nexenta which he hopes to upstream in the next month or two (2015-10-08)

For more info see: http://www.open-zfs.org/wiki/Features#TRIM_Support

Free space TRIM

 * walk metaslab space maps and issue discard commands to the vdevs.

Platform agnostic encryption support
Preferably compatible with pool version 30, as pool-feature-flag.

Developer resources include a link to a November 2010 blog post by Oracle.

The early ZFS encryption code published in the zfs-crypto repository of OpenSolaris.org could be a starting point. A copy is available from Richard Yao upon request.

Deduplication improvements
Potential algorithms:


 * Bloom filter.
 * Cuckoo Filter.

Convert synchronous writes to asynchronous writes when an ARC miss occurs during a lookup against the DDT.

Use dedicated kmem_cache for deduplication table entries:
 * easy to implement
 * will reduce DDT entries from 512-bytes to 320-bytes.

ZFS Compression / Dedup to favour provider
Currently, as a storage provider, if a customer has 100MB of quota available, and upload 50MB of data which compresses/dedups to 25MB. The customer's quota is only reduced by 25MB. The reward favours the customer. It is desirable as a provider, to be able to reverse this logic such that the customer's quota is reduced by 50MB and the 25MB compression/dedup saved, is to the provider's benefit. Similar to how Google/Amazon/Cloud-Feature.acme already handles it. You get 2G of quota, and any compression saved is to Google's benefit.


 * property(?) to charge quota usage by before-compression-dedup size.

Periodic Data Validation
Problem: ZFS does a great job detecting data errors due to lost writes, media, errors, storage bugs, but only when the user actually accesses the data. Scrub in its current form can take a very long time and can have highly deleterious impacts to overall performance.

Data validation in ZFS should be specified according to data or business needs. Kicking off a scrub every day, week, or month doesn’t directly express that need. More likely, the user wants to express their requirements like this:
 * “Check all old data at least once per month”
 * “Make sure all new writes are verified within 1 day”
 * “Don’t consume more than 50% of my IOPS capacity”

Note that constraints like these may overlap, but that’s fine — the user just must indicate priority and the system must alert the user of violations.

I suggest a new type of scrub. Constraints should be expressed and persisted with the pool. Execution of the scrub should tie into the ZFS IO scheduler. That subsystem is ideally situated to identify a relatively idle system. Further, we should order scrub IOs to be minimally impactful. That may mean having a small queue of outstanding scrub IOs that we’d send to the device, or it might mean that we try to organize large, dense contiguous scrub reads by sorting by LBA.

Further, after writing data to disk, there’s a window for repair while the data is still in the ARC. If ZFS could read that data back, then it could not only detect the failure, but correct it even in a system without redundant on-disk data.

- ahl

Sorted Scrub
Problem: The current scrub algorithm is not rotational media friendly; it generally produces a random read workload and become limited on the underlying storage system's seek latency.

Saso Kiselkov of Nexenta gave a talk on Scrub/Resilver Performance at the OpenZFS Developer Summit 2016 (September 2016):

Video, Slides

The following has been copied from Matt Ahrens' reply on the mailing list in July 2016, here: https://www.listbox.com/member/archive/274414/2016/07/sort/time_rev/page/1/entry/1:30/20160709172505:98792D9A-461B-11E6-905A-93F52D8D822E/

We had an intern work on "sorted scrub" last year. Essentially the idea was to read the metadata to gather into memory all the BP's that need to be scrubbed, sort them by DVA (i.e. offset on disk) and then issue the scrub i/os in that sorted order. However, memory can't hold all of the BP's, so we do multiple passes over the metadata, each pass gathering the next chunk of BP's. This code is implemented and seems to work but probably needs some more testing and code cleanup.

One of the downsides of that approach is having to do multiple passes over the metadata if it doesn't all fit in memory (which it typically does not). In some circumstances, this is worth it, but in others not so much. To improve on that, we would like to do just one pass over the metadata to find all the block pointers. Rather than storing the BP's sorted in memory, we would store them on disk, but only roughly sorted. There are several ways we could do the sorting, which is one of the issues that makes this problem interesting.

We could divide each top-level vdev into chunks (like metaslabs, but probably a different number of them) and for each chunk have an on-disk list of BP's in that chunk that need to be scrubbed/resilvered. When we find a BP, we would append it to the appropriate list. Once we have traversed all the metadata to find all the BP's, we would load one chunk's list of BP's into memory, sort it, and then issue the resilver i/os in sorted order.

As an alternative, it might be better to accumulate as many BP's as fit in memory, sort them, and then write that sorted list to disk. Then remove those BP's from memory and start filling memory again, write that list, etc. Then read all the sorted lists in parallel to do a merge sort. This has the advantage that we do not need to append to lots of lists as we are traversing the metadata. Instead we have to read from lots of lists as we do the scrubs, but this should be more efficient We also don't have to determine beforehand how many chunks to divide each vdev into.

real-time replication
https://www.illumos.org/issues/7166

Lustre feature ideas
The Lustre project supports the use of ZFS as an Object Storage Target. They maintain their own feature request page with ZFS project ideas. Below is a list of project ideas that are well defined, benefit Lustre and have no clear benefit outside of that context.

Collapsible ZAP objects
E.g. fatzap -> microzap downgrades.

Data on separate devices
Architecture - ZFS for Lustre …

TinyZAP
Architecture - ZFS TinyZAP …

Awareness-raising ideas
… awareness of the quality, utility, and availability of open source implementations of ZFS.

Quality
Please add or discuss your ideas.

ZFS and OpenZFS in three minutes (or less)
A very short and light video/animation to grab the attention of people who don't yet realise why ZFS is an extraordinarily good thing.

For an entertaining example of how a little history (completely unrelated to storage) can be taught in ninety seconds, see Hohenwerfen Fortress - The Imprisoned Prince Bishop (context) (part of the ZONE Media portfolio).

A very short video for ZFS and OpenZFS might throw in all that's good, using plain english wherever possible, including:
 * very close to the beginning, the word resilience
 * verifiable integrity of data and so on
 * some basic comparisons (NTFS, HFS Plus, ReFS)

– with the 2010 fork in the midst but (blink and you'll miss that milestone) the lasting impression from the video is that ZFS is great (years ahead of the alternatives) and OpenZFS is rapidly making it better for a broader user base.

Hint: there exist many ZFS-related videos but many are a tad dry, and cover a huge amount of content. Aim for two minutes :-) …  discuss…

Availability
Please add or discuss your ideas.

General
The OpenZFS channel on YouTube, begun October 2013 – to complement the automatically generated ZFS channel.

https://twitter.com/DeirdreS/status/322422786184314881 (2013-02) draws attention to ZFS-related content amongst videos listed by Deirdré Straughan.