Documentation/Administrative Commands

This webpage describes the code flow when doing zfs administrative commands (/sbin/zfs subcommands that change state). We will look at the example of zfs snapshot -r and examine what each layer of code is responsible for. This is intended as an introduction to the many layers of ZFS, so we won't go into detail on how snapshots are implemented. You can read more about snapshots in an old blog post.

In broad strokes, userland gets the list of filesystems from the kernel and determines what snapshots need to be created. Information about snapshots, and all other pool-wide metadata (basically everything except for data inside filesystems) is stored in the MOS (Meta Object Set). The MOS is only modified in syncing context, so we use the synctask infrastructure to run a callback in syncing context to create a new dsl_dataset_phys_t to represent the snapshot.

In more detail, here are the responsibilities of each layer of code:

/sbin/zfs infrastructure: main()

The generic (subcommand agnostic) infrastructure of the zfs command does the following:

  • Create a libzfs handle (libzfs_init()).
  • Determine which subcommand should be executed and run it.
    • Each zfs subcommand has a callback, typically named zfs_do_subcommand-name().
  • Call a libzfs function (zpool_log_history()) to log the command (see below for details)

snapshot subcommand: zfs_do_snapshot

The snapshot subcommand's callback is zfs_do_snapshot. It does the following:

  • Parse the command line arguments.
  • Create a list of the snapshots that need to be created.
    • Call a libzfs function (zfs_iter_filesystems()) to iterate over the descendent filesystems, adding the snapshot of that filesystem to the list
  • Call a libzfs function (zfs_snapshot_nvl()) to create the snapshots and handle any errors.

libzfs

We saw two uses of libzfs: iterating over the descendent filesystems, and creating the snapshots.

filesystem iteration: zfs_iter_filesystems()

libzfs provides "handles" to zfs datasets, represented by a zfs_handle_t. The handle is created by getting stats on a dataset from the kernel. The handle then caches these stats (e.g. property values) in userland. Note that the handle is a purely userland (libzfs) concept; the kernel doesn't know about them, and the handle doesn't prevent any concurrent activity (e.g. destroying the dataset, changing properties, etc).

To iterate over a filesystem's children, libzfs uses the ZFS_IOC_DATASET_LIST_NEXT ioctl to the kernel. Each call to this ioctl returns the next child of the specified dataset, along with the stats (e.g. properties) of that dataset. libzfs uses this information to make a zfs_handle_t, and passes the handle to a callback provided by the caller (zfs_do_snapshot in this case).

snapshot creation: zfs_snapshot_nvl()

/sbin/zfs provides the list of snapshots to create, so this is a relatively thin layer in libzfs. Other subcommands have substantially more of their logic implemented in libzfs. The one interesting part of what libzfs does here is handle the errors from the kernel. It will print out human-readable error messages depending on what the error code was from the kernel.

libzfs calls into libzfs_core to do the actual ioctl to the kernel.

libzfs_core: lzc_snapshot()

libzfs calls lzc_snapshot() in libzfs_core. libzfs_core is a very thin layer which basically just marshals the arguments and calls the ioctl to the kernel. In this case it would be ZFS_IOC_SNAPSHOT.

ioctl infrastructure: zfsdev_ioctl()

When userland calls ioctl() on /dev/zfs, the kernel infrastructure will call zfsdev_ioctl(). This has code which is applied to all zfs ioctls. It does the following:

  • marshals the arguments, copying them in from the user address space
  • determines which ioctl function should be called
    • specific ioctl functions are typically named zfs_ioc_name-of-ioctl()
    • ioctl functions are stored in zfs_ioc_vec[], which is populated by calling zfs_ioctl_register() from zfs_ioctl_init()
  • Call the ioctl-specific permission checking function, zfs_secpolicy_snapshot()
  • Call the ioctl-specific function, zfs_ioc_snapshot()
  • If this is a new-style ioctl (which ZFS_IOC_SNAPSHOT is), and it was successful, we log the ioctl and its arguments on disk in the pool's history.
    • This history log can be printed by running zpool history -i
  • If this ioctl allows it (which ZFS_IOC_SNAPSHOT does), and the ioctl was successful, we remember that this thread is allowed to log the CLI history, which will be done as a separate ioctl.

snapshot ioctl: zfs_ioc_snapshot()

This is a relatively thin layer which typically checks that the arguments are well-formed. E.g. all of the snapshots must be in the same pool.

DSL: dsl_dataset_snapshot()

This layer is also relatively thin, it marshals its arguments into structs and creates a synctask to execute a callback from syncing context. For snapshots, there is also some code to suspend the ZIL on old version pools.

synctask infrastructure: dsl_sync_task()

The synctask infrastructure allows for a thread executing in open context (i.e. from an ioctl()) to execute a callback in syncing context (i.e. from spa_sync()). The MOS (Meta Object Set), which contains all the pool-wide metadata, can only be modified from syncing context.

snapshot synctask: dsl_dataset_snapshot_sync()

This code creates each of the snapshots, by modifying the MOS. Specifically, by creating a new object in the MOS to represent each snapshot. Each snapshot's object stores a dsl_dataset_phys_t, which will be filled in by this code. Related datasets (e.g. the previous snapshot and the filesystem) will also be modified.

DMU layer: MOS sync: dmu_objset_sync()

In the previous phase, we only modified the in-memory copy of data which is represented on disk. Subsequently (in the same TXG), we write out all the dirty data in the MOS. The MOS is a object set (objset) like any other, the primary difference being that it is only dirtied (modified) in syncing context.