Documentation/Read Write Lecture

This lecture was presented as Lecture 6 of Marshall Kirk McKusick's class, FreeBSD Kernel Internals: An Intensive Code Walkthrough, taught in the spring of 2016 at the historic Hillside Club in Berkeley, California.

The entire class can be obtained from Marshall Kirk McKusick, 1614 Oxford St, Berkeley CA 94709-1608; +1-510-843-9542. Additional information is available at McKusick.com

link to video

Notes / outline

Notes by Steve Jacobson 03/01/16, version 0.02

Kirk's Introductory Comments (00:03:59)

 - this week:  guest lecturer Matt Ahrens
   - joined Sun Microsystems in 2001
   - one of the original two developers of ZFS
     - along with Jeff Bonwick
   - the two of them developed ZFS starting with a blank
     whiteboard
   - Matt has devoted the last 15 years of his life to all
     things ZFS
   - Matt will present this lecture:
     - ZFS overview
     - ZFS code walkthrough of the read and write paths            (00:04:54)

Matt's Introductory Comments (00:06:24)

 - works for Delphix
 - has worked on OpenZFS for 10 years

Slide 1: Introduction (00:07:09)

 Delphix
 OpenZFS
 Internals
 Matt Ahrens
 mahrens@delphix.com

Slide 2: What is the ZFS Storage System? (00:07:09)

 - Pooled storage
   - Functionality of filesystem + volume manager in one
   - Filesystems allocate and free space from pool
     - means that traditional strict size need not be followed

 - Transactional object model
   - Always consistent on disk (no FSCK, ever)
   - Universal - file, block, NFS, SMB, iSCSI, FC, ...
     - ZFS can serve up data over any kind of interface

 - End-to-end data integrity
   - Detect & correct silent data corruption
     - throughout the entire storage stack
     - detected through checksums
     - ZFS can detect that data was written to the wrong
       place on disk
       - the checksum is stored elsewhere on disk
         - checked when data is read
       - can then look for the data in another location
         - supplies that data or gives an error

 - Simple administration
   - has more options, more configurability
   - intent is to make this easy for system administrators
   - Concisely express intent
     - configuring storage,
     - how are disks arranged,
     - adding and removing mirrors, etc.
     - also filesystem-level configuration:
       - where should a filesystem be mounted,
       - how should it be shared
     - ZFS-level functionality
       - what checks and functions should be used?
       - should the data be compressed?
       - which data should be compressed?

   - Scalable data structures
     - avoids the need for inconvenient workarounds
     - example:  10 TB disk, maximum limit of 1 TB
       per filesystem
     - example:  limit of 1 million files

Slide 3: ZFS History (00:12:41)

 2001:  development starts with 2 engineers
 - two engineers, blank whiteboard
 2005:  ZFS source code released
 - as part of the OpenSolaris project at Sun
 - allowed it to be ported to other operating systems
 2008:  ZFS released in FreeBSD 7.0
 2010:  Oracle stops contributing to source code for ZFS
 - changes made after 2010 by Oracle were kept proprietary
 2010:  illumos is founded as the truly open successor to
        OpenSolaris
 - with OpenSolaris,
   - was open source,
   - contributions were accepted from outsiders,
   - but was run by one organization, Sun Microsystems
 - multilateral:
   - a collaboration of server companies
   - would continue to collaborate
 2013:  ZFS on (native) Linux GA
 - the first generally-available port to Linux
 2013:  Open-source ZFS bands together to form OpenZFS
 - Matt co-founded
 - different communities were working on ZFS in isolation
 - intent:  get these communities talking together,
   - collaborating
 2014:  OpenZFS for Mac OS X launch

Slide 4: Block Diagram (00:15:46)

 - a comparison, at a high level, of ZFS with more traditional
   filesystems and volume managers
 - (diagram is not in Book)
 - ZFS is within the rounded rectangle
 - commonalities:
   - ZFS functionality corresponds to the traditional filesystem
     and volume manager
   - at the top is the VFS layer, which sends down operations
     to the filesystem, such as:
     - read,
     - write,
     - create file,
     - rename file
   - ZFS can also accept operations on ZVOLs
     - like an emulated volume
     - like one big flat file
       - can be exported over interfaces like SCSI
 - differences:
   - the traditional model:
     - the interface between traditional filesystems and volume
       managers is the block interface
     - when a volume is created, it is on top of a set number
       of disk drives, and of a set size
     - the filesystem sees a disk of a fixed size
     - the filesystem is not aware, for instance, that it is
       running on top of two mirrored disks
       - if the filesystem reads incorrect data, it has no way
         to ask the volume manager if it has another copy of
         that data
   - ZFS:
     - the traditional interfaces have been reworked
     - the DMU is somewhat like the MMU (memory management
       unit) of a processor
       - a MMU maps between virtual and physical addresses
       - the DMU maps between logical blocks and physical blocks
         - logical block:  this file, this offset within a file
         - physical block:  represented by a block pointer
           - tells us where the data is actually stored on disk
     - the SPA allocates and frees blocks
     - when the filesystem is processing a write:
       - it sends the data down through the DMU,
       - the DMU tells the SPA:  "here is 8 KB of data; please
         write it somewhere on disk, then tell me how to find
         it again"
         - the block pointer tells the DMU how to find it again
       - the SPA is responsible for finding and allocating a
         free location on disk,
         - dealing with mirroring or RAID, if any,
         - then tells the DMU where the data was written
     - if we overwrite data, or remove a file,
       - the DMU or related parts of the system tells the SPA
         - "this block pointer is no longer needed"
     - the ZPL - DMU interface:
       - responsible for the POSIX-like semantics of the
         filesystem
         - files, directories, etc.
       - the ZPL knows that it can consume objects
         - like a file without additional attributes
           - an array of bytes
         - does not need to know how that object is stored
           on disk
         - just understands "offset x of file y"
       - the DMU worries about indirect blocks, etc.

Slide 5: ZFS Module Layering (00:22:29)

 - (See Figure 10.2, Book, page 526)

 - note:  in the code walkthrough, we will be going through
   the read and write system call paths
   - the most performance-sensitive parts of ZFS
 - here, we will focus on other parts of ZFS

 - ZFS extends from below VFS to above GEOM
 - The management portion:
   - creating filesystems,
   - creating snapshots,
   - setting properties,
   - the relationship and arrangement between filesystems

 - on the left-hand side:  the datapath of read and write

 - ZAP:  ZFS Attribute Processor
   - on-disk key-value store
   - primary use:  directories
     - in FFS:  the directory is a mapping from the name
       of the directory tree to the number of that file
     - the ZAP stores a similar mapping
   - implemented with an extensible hashtable
     - can scale to hundreds of millions of entries in a
       single directory
     - is a fairly heavyweight data structure
     - there is also a simpler microZAP layout,
       - automatically used for simple objects

 - ZFS is a copy-on-write filesystem:
   - whenever writing data to disk:  write to area of disk
     not currently in use
   - whenever we write a piece of data, we have to write all
     the ancestors of it
     - write the data,
     - write the block pointer that points to it,
     - change the block that points to that block pointer,
     - etc.
     - this is a very heavyweight operation:
       - with 10 levels of indirection,
         - if you change 1 block, you must write 10 blocks
           to persist that on disk
       - typically not a problem:
         - you are usually changing a bunch of blocks that
           are next to each other
         - so, writing a few ancestor blocks for a large
           number of data block modifications
         - but, if the latency of the write matters, then
           this operation performs very, very poorly
           - the ZIL helps with this

 - ZIL (ZFS Intent Log)
   - allows us to persist synchronous operations quickly
   - on-disk chain of blocks containing log records
     - can append to this list asynchronously
     - for a synchronous operation, the intent log can append
       a block to this linked list
     - can do this asynchronously relative to writing out
       the bulk changes

 - ARC (Adaptive Replacement Cache)                                (00:29:01)
   - example:  a read system call
     - comes through the VFS to ZFS

Slide 6: Decoder Ring (00:00:00)

 - (See Table 10.1, Book, page 524)

 VFS       Virtual File System
 ZFS       Zettabyte File System
 ZPL       ZFS Posix Layer
 ZIL       ZFS Intent Log
 ZAP       ZFS Attribute Processor
 DMU       Data Management Unit
 DSL       Dataset and Snapshot Layer
 ARC       Adaptive Replacement Cache
 ZIO       ZFS Input / Output
 VDEV      Virtual Device

Slide 7: ZFS Organization (00:00:00)

 - (See Figure 10.3, Book, page 527)

Slide 8: ZFS Structure (00:00:00)

 - (See Figure 10.5, Book, page 532)

Slide 9: Description of a Block Pointer (00:00:00)

 - (See Figure 10.4, Book, page 530)

Question: what is this padding, and why don't we just have a larger asize?

Read Code Path

zfs_read(): contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c (00:13:32)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c

 - when a read system call is invoked, the first zfs code 
   executed is zfs_freebsd_read()
   - same file as zfs_read()
   - this is a small wrapper around zfs_read()

 - zfs_read():
   - parameters well-documented in block comment

 - zfs_range_lock():
   - POSIX semantics require that every read and write system
     call appears to be atomic from the point of view of the
     caller
     - if a writer writes a range of the file:
       - any reader must see the entire write, or none of the
         write
         - cannot see a partial write
     - many other filesystems implement this with a large
       reader-writer lock on the entire file
       - can have many readers, or only one writer
       - gives bad performance for an application that wants to
         issue a lot of concurrent writes
         - example:  database
       - those filesystems then have a mount flag or similar
         that allows you to disable the locking and the POSIX
         semantics 
         - the application then MUST know what it is doing

   - zfs implements finer-grained locking with zfs_range_lock():
     - creates a reader lock,
     - on file zp,
     - from <second parameter> offset,
     - for <third parameter> length

   - if any concurrent writes overlap that (offset, range),
     - the zfs_range_lock() call will block

   - will not block for concurrent reads

zfs_range_lock(): (00:17:01)

 - contrib/opensolaris/uts/common/fs/zfs/zfs_rlock.c

 - implemented with an AVL tree 
   - a balanced, sorted binary tree
   - keeps track of all the ranges that are locked

 - avl_add():  add a newly-allocated lock to the avl tree
   - fast path

 - zfs_range_lock_reader():  add a new entry to the avl tree
   for this range

zfs_range_lock_reader(): (00:18:10)

 - contrib/opensolaris/uts/common/fs/zfs/zfs_rlock.c

 - use avl_find() to locate any rl_t structures that are
   write locked in the range we are trying to read

 - add with zfs_range_add_reader() if no writer conflicts
   - able to deal with partially overlapping reads
   - possible example of two overlapping reads:
     - this part has one reader,
     - this part has two readers,
     - this part has one reader
   - add may require splitting an existing avl entry

- (back in zfs_read()) (00:19:59)

 - note code having to do with mmap()ped reads, which
   we will ignore
   - complication because zfs is not integrated with the
     FreeBSD page cache
     - data could be present in both zfs and the page cache

 - dmu_read_uio_dbuf():  read the specified file from the DMU
   - first parameter specifies the file
   - uio:  where to put the results,
     - and where to start the read
   - nbytes:  read this number of bytes

 - at the end of the function:
   - release the range lock
   - update the atime

dmu_read_uio_dbuf(): (00:21:17)

 - dmu_buf_t *zdb:  which file to read from
   - a dbuf that is in the file we want to read
   - in this case, it will always be the bonus buffer
     - data stored embedded in the znode
   - quicker than specifying the object set and object number
     - saves us a lookup in the dbuf hash table
     - otherwise, would have to look up the object number in
       the hash table to get the dnode

dmu_read_uio_dnode() (00:22:38)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c

 - dmu_buf_hold_array_by_dnode():
   - find all the dbufs that we want to read

 - use uiomove() to copy the data from the dbuf to the uio
   - from where zfs_read() ultimately needs to get the data

dmu_buf_hold_array_by_dnode(): (00:24:12)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c

 - might be reading more than one block

 - zfs supports:
   - multiple logical block sizes
   - different physical block sizes

 - typically:
   - a large file will use multiple large blocks
   - a small file will use a single block, the size of the file

 - default zfs block size is 128 KB

 - example:  a 1 MB read involves 8 * 128 KB blocks
   - this function will read all 8 blocks
   - it will issue all 8 reads in parallel
     - if you have:
       - multiple disks,
       - an SSD that can do more than one thing at once,
       - you will be able to take advantage of this

 - zio_root():
   - use this to issue all of the ZIOs in parallel, and wait
     for them to complete
   - a ZIO root is a ZIO that doesn't do anything
     - it is a placeholder within a tree of ZIOs
   - example:  the (8) ZIOs that we issued in parallel, above,
     will all be children of the ZIO root

   - later in the function, zio_wait() waits on the parent ZIO
     - implicitly waits for all its children

 - for (i = 0; i < numbufs; i++):
   - read the 8 blocks

   - need to do dbuf_hold()
   - initiate the asynchronous I/O with dbuf_read()
     - uses the parent ZIO created above

 - dmu_zfetch():  notify the prefetcher to do the prefetch

dbuf_hold(): (00:28:00)

 - calls dbuf_hold_level()
   - calls dbuf_hold_impl()

dbuf_hold_impl(): (00:28:07)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c

 - dn:  pull a dbuf of this dnode (object),
 - level: 
   - 0 = data,
   - 1 = first-level indirect block,
   - 2 = second-level indirect block,
 - blkid:  0-indexed block in the file
 - dbp:  block pointer returned through here
 - tag:  seen throughout zfs
   - zfs uses many reference counts
     - very hard to debug when you get them wrong
     - refcount_t helps with debugging
       - simple reference count in production builds
       - a complex data structure in debug builds
         - contains a list of references
         - whenever we add a reference count, we allocate a
           reference and add it to the linked list
           - a link to "tag" is added
           - tag tells us who acquired the reference
           - often the macro FTAG
             - the name of the calling function
           - common error:  take a reference in a function, and
             - neglect to give it back in some error path
           - in illumos code, there is a way to associate a stack
             trace with the tag
             - not clear if this is ordinarily available
           - adding a hold, and releasing that hold, requires
             the same tag
             - checked in refcount code

 - dbuf_find():                                                    (00:34:10)
   - tries to locate this dbuf in the hash table
   - hash function is based on (object set, object, level,
     block id)
     - the logical identity of this block
     - DBUF_HASH()

 - if we did not find the dbuf in the hash table:

 - dbuf_findbp()
   - returns the buffer pointer to the parent,
   - returns the block pointer into the dbuf that we are
     looking for

dbuf_findbp(): (00:28:07)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c

 - figures out what sort of parent we have
   - block is referenced from an indirect block,
   - or block is referenced from the dnode

 - a dnode contains an embedded pointer
   - can point to a tree of indirect blocks,
     - which will eventually point to a data block

 - } else if (level < nlevels-1):

   - example:  assume that we need to go one more level of
     indirection to find the data block

   - level is the level of the block we are looking for

   - recursively call dbuf_hold_impl()
     - looking for the block in the same file (dn) that is
       one level above us
       - at the blkid that points to us
     - at each level, the block IDs are sequential

     - current FreeBSD:  default indirect block size is 16 KB
       - means there are 128 block pointers in each indirect
         block
       - the first 128 indirect pointers would be in level 1,
         block ID 0
       - the second 128 indirect pointers would be in level 1,
         block ID 1
       - (  >> epbs) means divide by 128
         - entries per block shift
         - should be 7 (2^^7 = 128)                                (00:37:59)
     - recursive call puts a hold on our parent direct block

   - call dbuf_read() to read parent's direct block

   - do some math to find the block pointer within the parent's
     direct block
     - block pointer mod 128

- (back in dbuf_hold_impl()) (00:38:56)

   - dbuf_create():  create a new dbuf for level 0

dbuf_create(): (00:39:05)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c

 - allocate a new dbuf
   - kmem_cache_alloc()
 - fill out information in the dbuf,
 - add it to the hash table
   - dbuf_hash_insert()
     - taking the lock for the hash bucket is taken inside
       this function
     - it could be that another thread was adding a dbuf
       with the same identity to the hash table
       - it got the lock before we did
       - if so, dbuf_hash_insert() returns the odb that
         was already there
       - we throw ours away and use theirs

 - avl_add():
   - every object keeps track of all of the dbufs that are
     instantiated by creating this object in an avl tree
     - sorted by level and lock id
     - we need to use this when we are manipulating a whole
       range of blocks
       - example:  freeing a chunk of the file
       - we have to find all the dbufs in this range to
         invalidate them

 - refcnt_add():
   - in production builds, a normal increment
   - in debug builds:
     - adds a new reference to the refcount in the first
       parameter
       - the hold count on the dnode
     - db is the tag that's used
       - the dbuf that's holding this dnode

- (back in dbuf_hold_impl()) (00:42:02)

- (back in dmu_buf_hold_array_by_dnode()) (00:42:45)

 - have completed the dbuf_hold()

 - dbuf_read():
   - initiate the read, maybe

dbuf_read(): (00:43:20)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c

 - I have this dbuf
 - I want you to change it to the cached state, if it isn't
   already

 - if (db->db_state == DB_CACHED):
   - it's already cached
   - notify the prefetcher
   - we're done

 - } else if (db->db_state == DB_UNCACHED):
   - if not cached, call dbuf_read_impl()

dbuf_read_impl(): (00:43:47)

 - actually do the read

 - arc_read()
   - zio:  the current zio
     - allows us to wait for a bunch of ZIOs at once
   - os_spa:  the pool
   - db_blkptr:  block pointer, specific to the pool
     - read this block from the above pool

   - the zio pipeline makes extensive use of callbacks
     - when the read completes, call dbuf_read_done

   - there are different priorities of reads
     - ZIO_FLAG_CANFAIL:  synchronous read
       - can be issued in preference to asynchronous writes

- (back in dmu_buf_hold_array_by_dnode()) (00:45:24)

 - this dbuf_read() (following "initiate async i/o" comment)
   - created a ZIO to do the read,
     - attached it as a child of zio argument
   - we did not necessarily issue any I/O to disk yet

 - zio_wait()
   - called on parent zio
   - initiates the children I/Os,
     - and waits for all of them them to complete

 - recap:
   - dbuf_read() creates the ZIOs,
   - zio_wait():  wait for the ZIOs to complete,
   - when the ZIOs complete, call dbuf_read_done()

dbuf_read_done(): (00:46:32)

 - } else if (zio == NULL || zio->io_error == 0):
   - no error,
   - dbuf_set_data():  sets db data to buf buffer
     - buf is an arc_buf_t
     - associates arc_buf_t buf with dmu_buf_impl_t db
   - dbuf_set_data():  sets db data to buf buffer
   - set the db_state to cached

- (back in dmu_buf_hold_array_by_dnode()) (00:47:58)

- (back in dmu_read_uio_dnode()) (00:48:16)

 - call uiomove() to copy the data from the dbuf to the uio

 - we are done

 - that covers the entire read path, except for the ARC and ZIO

arc_read(): (00:48:43)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c

 - bp:  look for this block pointer in cache,
   - done:  if bp is found in cache, call the "done"
     callback immediately
   - otherwise, we issue a ZIO which will eventually
     call the callback

 - the ARC:  has a scan-resistent MRU algorithm:
   - divided into:
     - blocks that have been accessed exactly once,
     - blocks that have been accessed more than once

 - arc_read() starts with a call to buf_hash_find()
   - looks for an ARC buf in the hash table
   - looks up by this block pointer

   - if it finds something, it returns it with the
     hash_lock held

 - there are two main cases:
   - a hit, or a miss

 - in this case, a hit:
   - arc_cache_find() returned non-NULL,
     - has L1 header,
     - has data count

   - arc_cache_find() could also return non-NULL,
     - but no L1 header

     - the arc_buf_hdr_t is divided into two parts:
       - have the header cached in the L2ARC,
       - have the data cached in memory

     - if only in L2ARC,
       - want a small header so we can cache many of them
     - hdr->b_l1hdr.b_datacnt > 0:
       - are there any data buffers associated with this 
         ARC buf header

   - ARC_FLAG_CACHED:  tells the caller that this was a
     cache hit

   - if (done):  this is a cache hit,
     - note:  arc_callback_t is used to keep track of the
       callback in the case of a cache miss

     - this case ends up calling the callback
       - arc_read_done() in this case

 - else:                                                           (00:54:12)
   - everything did not work out perfectly,
   - we did not have the cache hit

   - if (hdr == NULL):  the header does not exist
     - we need to create a new ARC buf header,
       - arc_buf_alloc()
     - and add it to the hash table

   - else:

     - the ARC has two components:
       - referenced once,
       - referenced multiple times
     - ARC memory is divided to hold these two
       - size of each dynamically determined

     - ghost cache:  ARC headers for data that was recently
       evicted from the cache
       - we keep the headers, discard the associated data
       - hits in the ghost cache tell us if it was a hit in the
         accessed once, or accessed many times section
         - these hits can cause memory to shift between one
           section of the cache and the other
       - this is how you get a buffer with an L1 header, but no
         data

   - acb:  tracks the callback function
     - complication:  you could have many threads calling
       arc_read() on the same block pointer at the same time
       - but, we want to issue only one ZIO read for that data
       - when that one read completes, we call all the callbacks
         for that data
       - how we get multiple concurrent reads of the same block
         pointer:
         - snapshots and clones
         - one instance of the DMU is only going to issue one read
           for one block pointer at a given time
           - it has one dbuf that points to that,
           - the dbuf knows that it is already in the middle of
             reading this,
           - if someone else tries to read this, they wait for
             the existing read to complete
           - but, if you are reading from another clone that
             references the same physical block,
             - it will be reading with the same block pointer,
               but from a different location

   - L1 cache miss, look in L2ARC:
     - instructor skipped over L2ARC code

   - zio_read():  we need to read from disk
     - we've missed all of the caches
     - calls arc_read_done() when the read completes

arc_read_done(): (00:58:55)

 - when the read completes, the done callback will be called,
   - arc_read_done() in this case

 - calls all the callbacks that the DMU has requested

 - digression:  ZFS works transparently with pools that were
   created with either endianness
   - (code near BP_SHOULD_BYTESWAP())
   - note:  SPARC:  big endian; Intel:  little endian
   - note:  FFS:  filesystem cannot be moved between different
     endian systems, because FFS metadata is stored in the
     original machine's endianness
   - ZFS stores metadata on the disk in the host system's native
     endianness,
     - but it keeps track, in every block, what endianness was
       used to write that block
   - func:  a function that knows how to byte swap all the
     metadata in there
   - you can move a pool back and forth between SPARC and Intel
   - byte swapping is done in the ARC
   - all data cached in the ARC is stored in the current host's
     byte order, regardless of how it is stored on disk

 - comment:  create copies of the data buffer...                   (00:00:55)
   - if there are multiple callbacks,
     - each has its own dbuf,
     - each gets its own copy of the data in the ARC
   - the ARC header is specific to a particular physical block
     on disk
     - the header can have multiple ARC buffers associated with
       it, if there are multiple DMUs that are accessing that
       block concurrently
   - arc_buf_clone() copies data to a new buffer

 - call acb_done() callback function(s) at the end
   - in our case, dbuf_read_done()

We are now done with the ARC (00:02:12)

Next is ZIO

zio_read(): (00:02:24)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c

 - calls zio_create() to create a ZIO

zio_create(): (00:02:41)

 - fills in all the information that was passed in
   - read,
   - callback

 - io_bp:  block pointer

 - associates the ZIO with its parent
   - zio_add_child()

 - does not issue the I/O read

- (back in zio_read()) (00:03:31)

zio_wait(): (00:03:39)

 - where the I/O read is issued

 - calls zio_execute()
   - drives the whole ZIO pipeline

zio_execute(): (00:03:39)

 - block comment gives a good overview

 - continues to call functions in zio_pipeline[] array
   - contains function pointers for each stage

   - for reads, executes:
     - zio_read_bp_init,
     - zio_ready,
     - zio_vdev_io_start,
     - zio_vdev_io_done,
     - zio_vdev_io_assess,
     - zio_done

   - note that reads are considerably simpler than writes

zio_read_bp_init(): (00:06:00)

 - if *bp is compressed
   - invoke the zio transform stack
    - zio_push_transform()

   - a buffer is provided, in which to put the user's data
     - must be uncompressed data
     - need to allocate a new buffer, cbuf
       - psize is essentially the compressed size
     - then push the transform
       - after the read from disk completes, calls
         zio_decompress()

zio_vdev_io_start(): (00:07:46)

 - issue the ZIO to the VDEV queue

 - there are two primary types of ZIOs:
   - ones with an associated block pointer
   - ones with an associated VDEV

 - background:  when we first create the ZIO with zio_read(),
   we specify the block pointer
   - the block pointer might reference:
     - multiple VDEVs,
     - mirrors,
     - multiple DVAs

 - in our case, the ZIO has a block pointer associated with it,
   but it does not have a VDEV associated with it

 - all reads go through the mirroring code,
   - knows multiple places where the data might be,
   - knows how to reconstruct data,
   - knows how to replace bad data
   - we have hijacked the mirroring code to use it for all
     types of reads
     - because we might have multiple DVAs

   - the mirror code will create a child ZIO for a specific
     VDEV and offset
     - physical I/O
     - might have to create one or more
   - now, we just have a logical ZIO which is associated with
     the block pointer

vdev_mirror_io_start(): (00:11:11)

 - contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c

 - calls vdev_mirror_child_select()
 - we decide which location we are going to try first

 - vdev_mirror_map_init()
   - tells us all the places the data might be

 - zio_vdev_child_io()
   - create a new child I/O
     - associated with a particular VDEV, and a particular
       offset
   - mc is a member of the mirror map

   - vdev_mirror_child_done() called when I/O completes
     - if there was an error, causes us to try somewhere else

- (back in zio_vdev_io_start()) (00:13:12)

 - we create the child ZIO which has a specific VDEV and
   offset,
   - we will come back into zio_vdev_io_start(),
     - but this time vd (VDEV) will not be NULL

 - if the VDEV we selected was not a leaf:
   - example:  could be a mirror or a RAIDZ
   - only leaf dnodes are physical devices
   - example:  RAIDZ
     - an abstraction above the physical hardware
     - not leaf, call through vdev_op_io_start
       - vdev_raidz_io_start() in our example

vdev_raidz_io_start(): (00:15:21)

 - creates ZIOs to read from each of the actual leaf vnodes
   that are part of this RAIDZ
 - when complete, there will be a "done" callback that will
   reconsititute the data and copy it back

 - creates the ZIOs with zio_vdev_child_io()
 - cvd:  child

- (back in zio_vdev_io_start()) (00:16:20)

 - finally, a leaf ZIO

 - at vdev_queue_io()
   - we still have not issued a read to physical disks
   - we only give the disk a few things to do at a time
     - allows us to use our sorting algorithm
       - priority order
       - sorted in LVA order
         - minimizes seeking
   - enqueue this ZIO to the queue
   - return the next ZIO that we actually want to issue to 
     disk
     - may or may not be the same ZIO we passed in

   - call through vdev_op_io_start
     - this time a leaf
     - vdev_geom_io_start() in this case

vdev_geom_io_start(): (00:18:51)

 - contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c

 - interfaces with different things on different operating
   systems

 - ZIO_TYPE_READ:
   - sets up the bio

 - callback is vdev_geom_io_intr()
   - when the I/O completes

 - we are out of ZFS, into the GEOM layer
   - to issue the physical I/O to the disk

vdev_geom_io_intr(): (00:20:36)

 - called after physical I/O to the disk completes

 - calls zio_interrupt() if successful
   - calls zio_taskq_dispatch()
     - calls spa_taskq_dispatch_ent()
       - resume execution on some other thread,
         - running zio_execute()

zio_execute(): (00:21:40)

 - with a different stage this time
   - zio_vdev_io_done()

zio_vdev_io_done(): (00:22:05)

 - interlock:  may have to wait for children

 - if a physical VDEV, and a leaf node,
   - and a read or a write,
     - call vdev_queue_io_done()
       - tells the queueing logic that this ZIO has completed
         - so it can be removed from the queue

 - vdev_op_io_done()
   - zdev code?

zio_checksum_verify(): (00:24:14)

 - the last stage in the pipeline

 - calls zio_checksum_error()
   - calculates the checksum
   - ci_func() actually performs the checksum

vdev_queue.c: (00:25:13)

 - giant block comment at top describes how it works
 - also good blog posts
   - write throttle

 - scaling up the number of concurrent writes as the amount
   of dirty data increases

 - contains a bunch of tunables that can have a big impact
   on performance

   - how many ZIOs of each priority we will execute
     concurrently

   - control the shape of the graph

 - this layer includes both queueing and I/O aggregation

 - I/O aggregation:
   - if we are reading a bunch of blocks,
   - the code would find the blocks that are contiguous on disk,
     - issue a single I/O for those blocks
   - the VDEV queue can do this aggregation
     - up to 128K by default, tunable

Write code path

For the write path: the instructor will provide more descriptions,

 do less walking through source code

Slide 5: ZFS Module Layering (00:04:42)

 - (See Figure 10.2, Book, page 526)

 - the write code path is very different than the read code path
   - primarily because the handling of RAID is broken into two
     phases
     - open context,
     - syncing context

 - open context:
   - when the write system call calls into ZFS,
     - we copy the data from the user's buffer into ZFS
     - ZFS is caching that dirty data in the DMU,
       - it has to write it out at a later point in time

 - syncing context:
   - is that later point in time
   - we accumulate a lot of dirty data
     - up to 4 GB by default
   - when we write it out, we do so in a big batch,
     - with lots of ZIOs

 - writes come through the VFS/ZPL path,
 - data is stored in the DMU
 
 - in syncing context,
   - the DMU writes data to the ZIO,
   - caches certain data in the ARC,
   - notifies the DSL of allocations so it can keep track
     of space

zfs_write(): (00:07:00)

 - contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c

 - vp:  want to write to this vnode
 - uio:  gives the offset and length,
   - and the data we want to write

 - much more complicated than zfs_read(), because we might need
   to change a bunch of metadata
   - such as size
   - if you are extending a file, it's more complicated

 - zfs_range_lock():  range locking code
   - writer this time

 - while (n > 0):
   - we need to break the write into reasonably sized chunks
     - 1 MB by default
   - user could ask for 1 TB to be written

   - the entire range is locked 
     - atomic with respect to concurrent reads

 - when doing a write, there are two interesting cases:
   - partial block write,
   - full block write

 - partial block write:  have to:
   - read in the block,
   - modify part of it,
   - mark it dirty

 - full block write:
   - have to write the entire block
   - do not need to read in the block first
   - have to associate the new data with the ZBUFs

 - dmu_tx_create():  a transaction
   - how the ZPL interacts with the DMU:
     - tells it "I am going to be making these modifications"
       - "is there enough space?"
         - if so, we are allowed to continue

 - dmu_tx_assign():  wait for the next available transaction
   group

 - dmu_write_uio_dbuf():  write the data to the DMU from the uio
   - or -
 - dmu_assign_arcbuf():  another way of writing
   - borrows an ARC buffer and fills it in
   - can copy from another kernel subsystem
   - does this function correctly in FreeBSD?

 - zil_commit():
   - ZFS intent log
   - persists changes on disk before we are able to write them
     out to the main part of storage
   - zfs_wait() always creates ZIL records in memory
     - stored in memory until a transaction group writes
       them to disk
   - actually writes to the intent log

dmu_tx_assign(): (00:13:04)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c

 - check to see if there's enough space
 - also check to see if there is enough memory

dmu_tx_wait(): (00:13:04)

 - check to see if there is enough memory

 - check to see how much dirty data is in memory,
   - maybe delay this operation as a way of creating backpressure
     - dmu_tx_delay()
     - prevents an application from filling an unbounded amount
       of memory with dirty data

dmu_tx_delay(): (00:13:38)

 - prefaced by a huge comment block
   - explains how much we delay as the amount of dirty data
     increases
   - there is also a series of blog posts by Adam Leventhal
     - that explains this

dbuf_dirty(): (00:14:30)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c

 - marks a dbuf as being dirty

 - we might need to keep track of multiple versions of this
   dbuf's dirty data
 - whenever we do a transaction, we do an assign,
   - assigns a transaction to an transaction group
   - determines when these changes will be persisted to disk
   - always assigned to the currently open transaction group
   - the open transaction group is always advancing
     - there is a limit to the amount of data before we
       switch to the next transaction group
 - if you modify the same block several times in quick
   succession,
   - some writes may go to the same block in memory
     - in this case the block is overwritten
   - but, writes could go to blocks in transaction groups
     7 and 8
     - both could be in memory at the same time
     - dirty records (dr = *drp) are what keep track of
       the several dirty versions
   - because of this multiple version possibility, a given
     block can only be dirty in open context or in syncing
     context
   - to keep track of this:
     - user data is only dirty in open context
     - pool-wide metadata is only dirty in syncing context

 - whenever we dirty a dbuf, we dirty all of the ancestors
   - dirty the data block
   - also dirty L1 indirect block
   - and, dirty all references above it

That concludes open context

 - this is relatively simple
 - basically copying stuff into the DMU, and marking stuff
   dirty

spa_sync(): (00:18:31)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c

 - called from txg_sync_thread()

 - sync thread is always running
   - "do I need to sync up with a transaction group?"
     - if so, call spa_sync()

 - calls dsl_pool_sync()
   - finds all the dirty data sets
   - calls dsl_dataset_sync()
     - calls dmu_objset_sync()

dmu_pool_sync(): (00:19:33)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c

 - writes out all the dirty blocks of all the datasets

 - underneath the zio from zio_root()) will be a tree containing
   all the dirty data of all the user datasets
   - everything except for the MOS
     - meta-object set
 - could create tens or hundreds of thousands of ZIOs

 - zio_wait():  allows all of the ZIOs to "start running"
   - means allocate space for them and enqueue them in the
     VDEV queue layer
     - then start running them

dsl_dataset_sync(): (00:20:31)

 - contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c

 - if you have a lot of datasets, each having a little
   dirty data,
   - you will still be able to have maximum parallelism
   - you'll be able to keep all the disks very busy during
     the entire spa_sync()
     - we create as many ZIOs as we can to do so
     - the queueing layer then figures out how many
       concurrent writes to send to disk
     - lots of dirty data implies a heavy write workload,
       - which will lead to lots of parallel ZIO writes

dmu_objset_sync(): (00:21:32)

 - contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c

 - we need to create the tree of ZIOs
   - reflects the on-disk tree of blocks

reference Slide 8: ZFS Structure (00:22:04)

 - (See Figure 10.5, Book, page 532)

 - the lower left objset/dnode represents the root of the tree

- (back in dmu_objset_sync()) (00:22:15)

 - the root is the first zio created by arc_write()

   - pio:  parent ZIO
     - we are a child of this
   - os->os_rootbp:  stores the block pointer here
   - os->os_spa:  write to this pool
   - tx->tx_txg:  the transaction number this write is part of
   - os->os_phys_buf:  the data to write
   - the next two parameters control the write
   - zp:  the write properties
     - tells us:
       - is compression enabled,
       - which function to use,
     - set by dmu_write_policy()
   - dmu_objset_write_ready:  call this callback when the write
     is ready
     - we are ready to issue this I/O to disk
     - all of the blocks beneath this have been allocated
       - only after we have done the allocation of the block
         below, do we know what block pointers were inside
         os->os_phys_buf
   - dmu_objset_write_done:  call this callback when write
     completes

 - dnode_sync() will build up the tree of ZIOs

reference Slide 8: ZFS Structure (00:24:53)

 - (See Figure 10.5, Book, page 532)

 - in the lower left objset/dnode tree:
   - the ZIOs are for the line of dnodes, which represent
     the metadata

- (back in dmu_objset_sync()) (00:25:01)

 - dmu_objset_sync_dnodes(), last instantiation in function:
   - will create ZIOs for all the regular file data

reference Slide 8: ZFS Structure (00:24:53)

 - (See Figure 10.5, Book, page 532)

 - in the lower left objset/dnode tree:
   - the ZIOs are for the master node and the file data
     at the bottom

- (back in dmu_objset_sync()) (00:25:26)

 - zil_sync():  tells the ZIL we don't need the contents
   of the zil (first parameter) up to this transaction
   group (second parameter)
   - because we are persisting onto disk

 - zio_nowait():  
   - I'm not waiting for these ZIOs (zio)
   - I've created all the ZIOs under this tree, so you
     can start issuing them to disk

dmu_objset_sync_dnodes(): (00:26:04)

 - contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c

 - calls dnode_sync() on all of the dirty dnodes
   - each dirty object
   - there is also metadata that can be changing about the
     object
     - calls dbuf_sync_list()
       - finds the dirty dbufs
       - calls dbuf_sync_leaf()
         - calls dbuf_write()
           - calls arc_write()

arc_write(): (00:27:21)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c

 - same parameters as above

 - dbuf_write_physdone() callback
   - keeps track of how much dirty data there is
     - feeds into the write throttle code

dbuf_write_done(): (00:28:04)

 - when the write is done, we will do accounting

dbuf_write_ready(): (00:28:14)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c

 - calls dsl_dataset_block_born()
   - dsl keeps track of space accounting

dsl_dataset_block_born(): (00:28:30)

 - tells the DSL that this block was born
   - born in this dataset, this transaction
 - updates a bunch of accounting

dsl_dataset_block_kill(): (00:29:29)

 - keeps track of the accounting
 - figures out if we should free the block
   - free it if no snapshots reference it
   - we can't free it if a snapshot references it

 - check birth time vs. previous snapshot's birth time

zio_pipeline[] (00:31:18)

 - sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c

 - pipeline concept is similar to read,
   - but the pipeline stages are much different

 - zio_issue_async():
   - causes the ZIO to be processed by a tasking thread
   - reason:  the zio_checksum_generate stage, which
     can be slow
     - checksums the data
   - if not for async, we would have a single spa-sync
     thread issuing ZIOs
     - means a single thread would be doing checksums
       for all the data in the storage pool
   - we use async so we can use a bunch of threads in the
     zio taskqueues to execute the further stages

 - zio_write_bp_init():
   - compression
   - high-level checks
     - example:  de-dup

 - zio_dva_allocate():
   - figures out where to write this on disk
   - allocates a data virtual address

 - zio_ready():
   - wait for any children,
   - call the ready callback to indicate that we are ready,
   - notify our parent that this child is ready
     - maybe we are ready, maybe not

 - zio_vdev_io_start():
   - enqueue this write in the VF queue layer

 - zio_vdev_io_done():
   - called when the write I/O completes

 - zio_vdev_io_assess():
   - if there is a failure, might have to retry

 - zio_done():
   - same as for read
   - assess the final fatal error,
   - call the done callback

zio_write_bp_init(): (00:35:14)

 - calls zio_compress_data()

zio_dva_allocate(): (00:35:39)

 - determining where we're going to write this on disk

 - calls metaslab_alloc()
   - calls metaslab_alloc_dva()
     - we might need to allocate two DVAs if we are storing
       two copies

metaslab_alloc_dva(): (00:36:02)

 - contrib/opensolaris/uts/common/fs/zfs/metaslab.c

 - now we are doing allocation
 - allocation involves three stages:
   - first, select which vdev to write to
     - we are selecting the top-level vdev here
     - this is essentially a round-robin
       - there is a rotor
       - all the VDEVs are on a doubly-linked list
       - we walk the list, allocating some from each
       - we take into account:
         - how much free space there is on the VDEV,
         - how many blocks we have allocated on this vdev
           that have not yet been written
           - we want to keep the queue relatively full,
             but relatively small
           - we can have 1000 outstanding allocations to
             each vdev
             - creates natural throttling
             - disks may write blocks at different speeds
               - faster disk
               - less fragmented disk
             - now issue 100 writes to each disk
             - if writes complete faster on a given disk,
               we give that disk more work to do
   - second, selecting the meta-slab within that device
     - each VDEV is divided into meta-slabs
       - a chunk of contiguous space on the device
         - by default, a top-level VDEV is divided into 200
           meta-slabs
       - each meta-slab keeps track of the space within it
         using a space map
         - on-disk structure
       - also has in-memory data structures that keep track
         of allocated, free parts of meta-slab
       - only specific meta-slabs are loaded in memory
         - we only allocate from loaded meta-slabs
       - full meta-data in memory only for loaded meta-slabs
       - a small amount of information is available about
         every meta-slab
         - how much free space
         - size distribution of free space
   - third, selecting the offset within the meta-slab
     - selected by looking a an AVL tree 
       - lists the exact ranges that are free
     - one way of selecting uses a worst-fit algorithm

Take-home message:

 - open context:  we just copy the dirty data into the DMU
 - syncing context:  we write it all out
 - the DMU finds all the dirty data,
   - creates ZIOs,
   - the SPA does its allocation

Next week: we will start into the networking code

 - how the socket layer works,
 - how we go through a socket pair

                                                      (clock time:  10:06 PM)

======== end of lecture 06 ======== (00:00:00)

Documentation/Read Write Lecture

Notes / outline

Read Code Path

Write code path

Navigation menu

Search