Difference between revisions of "Projects"

From OpenZFS
Jump to navigation Jump to search
m (Link from the encryption idea, to developer resources.)
(9 intermediate revisions by 6 users not shown)
Line 1: Line 1:
== Active projects ==
== Active projects ==
=== Resumable send/receive ===
[https://www.illumos.org/issues/2605 illumos gate - Feature #2605: Partial/incremental ZFS send/receive - illumos.org]
Work in progress by [[User:Csiden | Chris Siden]].
=== Storage of small files in dnode ===
Work in progress by [[User:Mahrens | Matt Ahrens]].
=== Raspberry Pi® support ===
Based on ZFS on Linux, which already works on ARM.
Unresolved issues include: running out of kernel virtual address space.
Work in progress by [[User:Ryao | Ryao]].


== Notes from meetings ==
== Notes from meetings ==
Line 74: Line 56:


=== [[Projects/ZFS Channel Programs | ZFS channel programs]] ===
=== [[Projects/ZFS Channel Programs | ZFS channel programs]] ===
Possible Channel Programs:
* Recursive rollback (revert to a snapshot on dataset and all children, needs a new command line flag, -r is already taken)


=== Device removal ===
=== Device removal ===
Line 86: Line 71:


[http://www.listbox.com/member/archive/182191/2013/07/search/YXNoaWZ0/sort/subj/page/3/entry/7:58/20130703201427:AEA03DD0-E43E-11E2-A883-F4AAC72FE4D2/ <nowiki>[illumos-zfs]</nowiki> Specifying ashift when creating vdevs] (2013-07-03)
[http://www.listbox.com/member/archive/182191/2013/07/search/YXNoaWZ0/sort/subj/page/3/entry/7:58/20130703201427:AEA03DD0-E43E-11E2-A883-F4AAC72FE4D2/ <nowiki>[illumos-zfs]</nowiki> Specifying ashift when creating vdevs] (2013-07-03)
=== 1MB blocksize ===
Preferably compatible with pool version 32, as pool-feature-flag.


=== RAID-Z hybrid allocator ===
=== RAID-Z hybrid allocator ===
Line 109: Line 90:


=== TRIM support ===
=== TRIM support ===
==== Realtime TRIM ====
FreeBSD already has realtime TRIM support
Saso has implementation for Illumos @ Nexenta which he hopes to upstream in the next month or two (2015-10-08)


Realtime TRIM.
For more info see:
 
http://www.open-zfs.org/wiki/Features#TRIM_Support
Free space TRIM:  


==== Free space TRIM ====
* walk metaslab space maps and issue discard commands to the vdevs.
* walk metaslab space maps and issue discard commands to the vdevs.


Line 126: Line 110:
=== Deduplication improvements ===
=== Deduplication improvements ===


[http://www.listbox.com/member/archive/182191/2013/02/search/Ymxvb20gZmlsdGVycw/sort/time_rev/page/1/entry/8:16/20130212183221:70E13332-756C-11E2-996D-F0C715E11FC0/ Bloom filter].
Potential algorithms:
 
* [http://www.listbox.com/member/archive/182191/2013/02/search/Ymxvb20gZmlsdGVycw/sort/time_rev/page/1/entry/8:16/20130212183221:70E13332-756C-11E2-996D-F0C715E11FC0/ Bloom filter].
* [https://www.usenix.org/system/files/nsdip13-paper6.pdf Cuckoo Filter].


Convert synchronous writes to asynchronous writes when an ARC miss occurs during a lookup against the DDT.
Convert synchronous writes to asynchronous writes when an ARC miss occurs during a lookup against the DDT.
Line 142: Line 129:


* property(?) to charge quota usage by before-compression-dedup size.
* property(?) to charge quota usage by before-compression-dedup size.
=== Periodic Data Validation ===
Problem: ZFS does a great job detecting data errors due to lost writes, media, errors, storage bugs, but only when the user actually accesses the data. Scrub in its current form can take a very long time and can have highly deleterious impacts to overall performance.
Data validation in ZFS should be specified according to data or business needs. Kicking off a scrub every day, week, or month doesn’t directly express that need. More likely, the user wants to express their requirements like this:
* “Check all old data at least once per month”
* “Make sure all new writes are verified within 1 day”
* “Don’t consume more than 50% of my IOPS capacity”
Note that constraints like these may overlap, but that’s fine — the user just must indicate priority and the system must alert the user of violations.
I suggest a new type of scrub. Constraints should be expressed and persisted with the pool. Execution of the scrub should tie into the ZFS IO scheduler. That subsystem is ideally situated to identify a relatively idle system. Further, we should order scrub IOs to be minimally impactful. That may mean having a small queue of outstanding scrub IOs that we’d send to the device, or it might mean that we try to organize large, dense contiguous scrub reads by sorting by LBA.
Further, after writing data to disk, there’s a window for repair while the data is still in the ARC. If ZFS could read that data back, then it could not only detect the failure, but correct it even in a system without redundant on-disk data.
- ahl
=== Sorted Scrub ===
Problem: The current scrub algorithm is not rotational media friendly; it generally produces a random read workload and become limited on the underlying storage system's seek latency.
Saso Kiselkov of Nexenta gave a talk on [[Scrub/Resilver Performance]] at the [[OpenZFS Developer Summit 2016]] (September 2016):
[https://youtu.be/SZFwv8BdBj4 Video], [https://drive.google.com/file/d/0B5hUzsxe4cdmVU91cml1N0pKYTQ/view?usp=sharing Slides]
The following has been copied from Matt Ahrens' reply on the mailing list in July 2016, here: https://www.listbox.com/member/archive/274414/2016/07/sort/time_rev/page/1/entry/1:30/20160709172505:98792D9A-461B-11E6-905A-93F52D8D822E/
We had an intern work on "sorted scrub" last year.  Essentially the idea was to read the metadata to gather into memory all the BP's that need to be scrubbed, sort them by DVA (i.e. offset on disk) and then issue the scrub i/os in that sorted order.  However, memory can't hold all of the BP's, so we do multiple passes over the metadata, each pass gathering the next chunk of BP's.  This code is implemented and seems to work but probably needs some more testing and code cleanup.
One of the downsides of that approach is having to do multiple passes over the metadata if it doesn't all fit in memory (which it typically does not).  In some circumstances, this is worth it, but in others not so much.  To improve on that, we would like to do just one pass over the metadata to find all the block pointers.  Rather than storing the BP's sorted in memory, we would store them on disk, but only roughly sorted.  There are several ways we could do the sorting, which is one of the issues that makes this problem interesting.
We could divide each top-level vdev into chunks (like metaslabs, but probably a different number of them) and for each chunk have an on-disk list of BP's in that chunk that need to be scrubbed/resilvered.  When we find a BP, we would append it to the appropriate list.  Once we have traversed all the metadata to find all the BP's, we would load one chunk's list of BP's into memory, sort it, and then issue the resilver i/os in sorted order.
As an alternative, it might be better to accumulate as many BP's as fit in memory, sort them, and then write that sorted list to disk.  Then remove those BP's from memory and start filling memory again, write that list, etc.  Then read all the sorted lists in parallel to do a merge sort.  This has the advantage that we do not need to append to lots of lists as we are traversing the metadata. Instead we have to read from lots of lists as we do the scrubs, but this should be more efficient  We also don't have to determine beforehand how many chunks to divide each vdev into.
=== real-time replication===
https://www.illumos.org/issues/7166


== Lustre feature ideas ==
== Lustre feature ideas ==
Line 154: Line 181:


[http://wiki.lustre.org/index.php/Architecture_-_ZFS_for_Lustre#Data_on_Separate_Devices Architecture - ZFS for Lustre] …
[http://wiki.lustre.org/index.php/Architecture_-_ZFS_for_Lustre#Data_on_Separate_Devices Architecture - ZFS for Lustre] …
=== Large dnodes ===
[http://wiki.lustre.org/index.php/Architecture_-_ZFS_large_dnodes Architecture - ZFS large dnodes] …


=== TinyZAP ===
=== TinyZAP ===

Revision as of 18:36, 11 November 2016

Active projects

Notes from meetings

Brainstorm, 18th September 2013

Notes from the meeting that preceded Delphix's semi-annual Engineering Kick Off (EKO)

  • immediately pursuable ideas, plus long-term and strategic thoughts.

Inter-platform coordination ideas

Ideas for projects that would help coordinate changes between platforms …

Mechanism for pull changes from one place to another

Make it easier to build, test, code review, and integrate ZFS changes into illumos.

Cross-platform test suite

One sourcebase, rather than porting STF to every platform?

Maybe integrate XFS Test Suite.

Userland ZFS

We already have ztest / libzpool and want to:

  • expand this to also be able to test more of zfs in userland
  • be able to run /sbin/zfs, /sbin/zpool against userland implementation
  • be able to run most of testrunner (and/or STF) test suite against userland implementation

ZFS (ZPL) version feature flags

Import ZFS on Linux sa=xattr into illumos.

/dev/zfs ioctl interface versioning

Ensure that future additions/changes to the interface maintain maximum compatibility with userland tools.

Enable FreeBSD Linux jails / illumos lx brandz to use ZFS on Linux utilities.

Port ZPIOS from ZFS on Linux to illumos

ZPIOS example

This would require a rewrite to not use Linux interfaces.

Virtual machine images with OpenZFS

To easily try OpenZFS on a choice of distributions within a virtual machine:

  • images could be built for running on public clouds
  • images for installing to real hardware.

 Discuss …

General feature ideas

ZFS channel programs

Possible Channel Programs:

  • Recursive rollback (revert to a snapshot on dataset and all children, needs a new command line flag, -r is already taken)

Device removal

Based on indirect vdevs, rather than bprewrite.

Reflink support

The two sides of reflink() [LWN.net]

Unified ashift handling

[illumos-zfs] Specifying ashift when creating vdevs (2013-07-03)

RAID-Z hybrid allocator

Preferably compatible with pool version 29 for Solaris 10u11 compatibility.

Replace larger ZIO caches with explicit pages

Subproject: document useful kernel interfaces for page manipulation on various platforms

Improved SPA namespace collision management

Needed mostly by virtual machine hosts. Work in progress in Gentoo.

Temporary pool names in zpool import

Temporary pool names in zpool create.

TRIM support

Realtime TRIM

FreeBSD already has realtime TRIM support Saso has implementation for Illumos @ Nexenta which he hopes to upstream in the next month or two (2015-10-08)

For more info see: http://www.open-zfs.org/wiki/Features#TRIM_Support

Free space TRIM

  • walk metaslab space maps and issue discard commands to the vdevs.

Platform agnostic encryption support

Preferably compatible with pool version 30, as pool-feature-flag.

Developer resources include a link to a November 2010 blog post by Oracle.

The early ZFS encryption code published in the zfs-crypto repository of OpenSolaris.org could be a starting point. A copy is available from Richard Yao upon request.

Deduplication improvements

Potential algorithms:

Convert synchronous writes to asynchronous writes when an ARC miss occurs during a lookup against the DDT.

Use dedicated kmem_cache for deduplication table entries:

  • easy to implement
  • will reduce DDT entries from 512-bytes to 320-bytes.

ZFS Compression / Dedup to favour provider

Currently, as a storage provider, if a customer has 100MB of quota available, and upload 50MB of data which compresses/dedups to 25MB. The customer's quota is only reduced by 25MB. The reward favours the customer. It is desirable as a provider, to be able to reverse this logic such that the customer's quota is reduced by 50MB and the 25MB compression/dedup saved, is to the provider's benefit. Similar to how Google/Amazon/Cloud-Feature.acme already handles it. You get 2G of quota, and any compression saved is to Google's benefit.

  • property(?) to charge quota usage by before-compression-dedup size.

Periodic Data Validation

Problem: ZFS does a great job detecting data errors due to lost writes, media, errors, storage bugs, but only when the user actually accesses the data. Scrub in its current form can take a very long time and can have highly deleterious impacts to overall performance.

Data validation in ZFS should be specified according to data or business needs. Kicking off a scrub every day, week, or month doesn’t directly express that need. More likely, the user wants to express their requirements like this:

  • “Check all old data at least once per month”
  • “Make sure all new writes are verified within 1 day”
  • “Don’t consume more than 50% of my IOPS capacity”


Note that constraints like these may overlap, but that’s fine — the user just must indicate priority and the system must alert the user of violations.

I suggest a new type of scrub. Constraints should be expressed and persisted with the pool. Execution of the scrub should tie into the ZFS IO scheduler. That subsystem is ideally situated to identify a relatively idle system. Further, we should order scrub IOs to be minimally impactful. That may mean having a small queue of outstanding scrub IOs that we’d send to the device, or it might mean that we try to organize large, dense contiguous scrub reads by sorting by LBA.

Further, after writing data to disk, there’s a window for repair while the data is still in the ARC. If ZFS could read that data back, then it could not only detect the failure, but correct it even in a system without redundant on-disk data.

- ahl

Sorted Scrub

Problem: The current scrub algorithm is not rotational media friendly; it generally produces a random read workload and become limited on the underlying storage system's seek latency.

Saso Kiselkov of Nexenta gave a talk on Scrub/Resilver Performance at the OpenZFS Developer Summit 2016 (September 2016):

Video, Slides


The following has been copied from Matt Ahrens' reply on the mailing list in July 2016, here: https://www.listbox.com/member/archive/274414/2016/07/sort/time_rev/page/1/entry/1:30/20160709172505:98792D9A-461B-11E6-905A-93F52D8D822E/

We had an intern work on "sorted scrub" last year. Essentially the idea was to read the metadata to gather into memory all the BP's that need to be scrubbed, sort them by DVA (i.e. offset on disk) and then issue the scrub i/os in that sorted order. However, memory can't hold all of the BP's, so we do multiple passes over the metadata, each pass gathering the next chunk of BP's. This code is implemented and seems to work but probably needs some more testing and code cleanup.

One of the downsides of that approach is having to do multiple passes over the metadata if it doesn't all fit in memory (which it typically does not). In some circumstances, this is worth it, but in others not so much. To improve on that, we would like to do just one pass over the metadata to find all the block pointers. Rather than storing the BP's sorted in memory, we would store them on disk, but only roughly sorted. There are several ways we could do the sorting, which is one of the issues that makes this problem interesting.

We could divide each top-level vdev into chunks (like metaslabs, but probably a different number of them) and for each chunk have an on-disk list of BP's in that chunk that need to be scrubbed/resilvered. When we find a BP, we would append it to the appropriate list. Once we have traversed all the metadata to find all the BP's, we would load one chunk's list of BP's into memory, sort it, and then issue the resilver i/os in sorted order.

As an alternative, it might be better to accumulate as many BP's as fit in memory, sort them, and then write that sorted list to disk. Then remove those BP's from memory and start filling memory again, write that list, etc. Then read all the sorted lists in parallel to do a merge sort. This has the advantage that we do not need to append to lots of lists as we are traversing the metadata. Instead we have to read from lots of lists as we do the scrubs, but this should be more efficient We also don't have to determine beforehand how many chunks to divide each vdev into.

real-time replication

https://www.illumos.org/issues/7166

Lustre feature ideas

The Lustre project supports the use of ZFS as an Object Storage Target. They maintain their own feature request page with ZFS project ideas. Below is a list of project ideas that are well defined, benefit Lustre and have no clear benefit outside of that context.

Collapsible ZAP objects

E.g. fatzap -> microzap downgrades.

Data on separate devices

Architecture - ZFS for Lustre …

TinyZAP

Architecture - ZFS TinyZAP …

Awareness-raising ideas

… awareness of the quality, utility, and availability of open source implementations of ZFS.

Quality

Please add or discuss your ideas.  

Utility

ZFS and OpenZFS in three minutes (or less)

A very short and light video/animation to grab the attention of people who don't yet realise why ZFS is an extraordinarily good thing.

For an entertaining example of how a little history (completely unrelated to storage) can be taught in ninety seconds, see Hohenwerfen Fortress - The Imprisoned Prince Bishop (context) (part of the ZONE Media portfolio).

A very short video for ZFS and OpenZFS might throw in all that's good, using plain english wherever possible, including:

  • very close to the beginning, the word resilience
  • verifiable integrity of data and so on
  • some basic comparisons (NTFS, HFS Plus, ReFS)

– with the 2010 fork in the midst but (blink and you'll miss that milestone) the lasting impression from the video is that ZFS is great (years ahead of the alternatives) and OpenZFS is rapidly making it better for a broader user base.

Hint: there exist many ZFS-related videos but many are a tad dry, and cover a huge amount of content. Aim for two minutes :-) …  discuss…

Availability

Please add or discuss your ideas.  

General

The OpenZFS channel on YouTube, begun October 2013 – to complement the automatically generated ZFS channel.

https://twitter.com/DeirdreS/status/322422786184314881 (2013-02) draws attention to ZFS-related content amongst videos listed by Deirdré Straughan.