MacOS and IOS Internals, Volume II Kernel Mode
MacOS and IOS Internals, Volume II Kernel Mode
vnodes
The chie£ construct in VFS is that of a vnode. A vnode is a representation of a file or special
object, indeperi--:::. . · -e underlying file system. Commonly, a vnode would map to the
underlying filesystem's index node (inode) object, though filesystem drivers are free to use the
vnode's unique identifier in whatever method suits them. For example, table based filesystems
(e.g. FATI which do not support inodes can use that value as a table index. HFS+ and APFS use
the number as a B-Tree node identifier.
The structure, however, is meant to remain opaque, and accessed through public KPis, all in
bsd/sys/vnode.h. These are some 120 or so functions, all well documented, and providing
getters/setters to the vnode's private fields, as well as miscellaneous operations.
Vnodes are closely linked to each other. All vnodes belonging to the same mounted
filesystem can be accessed through the struct mount's mnt_vnodelist, and walked
through the vnode's v_mntvnodes. The mounted filesystem can also be quickly accessed
through the v_mount field, and is free to hold private data (as it does at the mount level's
mnt_data), in an opaque v_data pointer. Each vnode also holds a v_freelist
TAILQ_ENTRY for easy access to the vnode freelist, and name cache entry links to child vnodes
and links. Further down the structure each vnode also holds a v_parent pointer which, along
with the v_name pointer (pointing to its component name), allows for quick full pathname
reconstruction.
A key field in the structure is the v_ops, a pointer to a vnode operations vector. Not to be
confused with the vfstable's vfc_vfsops (which operate at a file system level), the v_ops
provide the implementations of the common vnode lifecycle methods. The implementations, are
commonly derived from the filesystem the vnode belongs to, but there are a few quasi-
filesystems defining operations as well. These are "quasi", in the sense that they are not
mountable, yet define their own vnode operations - even if their vnodes are found in another file
system.
Thus, the v_op may conveniently change according to vnode type or lifecycle stage. Not all
vnode operations are necessarily supported. More detail on this can be found later in this
chapter, under "VFS SPis", and in the NFS case study. Another common occurrence during the
vnode lifecycle is that its buffered data changes state - as some of it gets "dirtied" (i.e.
modified). Each vnode's buffered data is maintained in two struct buflists -
v_cleanblkhd and v_dirtyblkhd.
The underlying type data is maintained in the v_un union, which holds one of several
pointers. For directory vnodes (i.e. when v_type is VDIR), this points to a struct mount,
which is either the containing filesystem or (when the directory is a rnountpoint) another
struct mount. For UNIX domain sockets (vsocx), to a struct socket, discussed in
Chapter 14. For device files (VCHR/VBLK), to a struct specfs (as discussed in Chapter 6).
For most vnodes (VREG), this points to a ubc_info, discussed next.
21~
Chapter 7 - Fee, Fi-Fo, File: The Virtu al Fl esyst em Switch
The Unified Buffer Cache (UBC) is a concept first introduced into NetBSD[11. Its aim is to
unify the caching mechanisms of VFS (named mappings) and the VM subsystems (used for
anonymous memory), thereby using one cache which can benefit from being central and
common to both, reducing duplicate caching. UBC was also adopted by Apple in XNU, although
the implementation varies from that of *BSD.
A key structure in UBC is the ubc_info. This is a structure pointed to from the vnode's
ubc _ info field (in the v_un union field, which applies for a vtype of VREG, that is regular
files). ubc_info structures are allocated from their own dedicated zone (the ubc_info zone).
Each ubc_info is created in the context of its vnode (by a call to
ubc_info_init_with_size( ), from vnode_create_internal( )), and - if the vnode in
question already has one due to vnode reuse, it is reused as well. The ubc_info also points
back to the struct vnode which refers to it. Figure 7-4 visualizes the ubc_info structure:
cs_mtime Modification lime of file when first cs_blob was loaded (ubc_get_cs_mtime)
Most of the UBC information deals with maintaining the vnode content data in memory. The
pager and pager control elements point to the Mach memory pager (in particular, a vnode pager,
as discussed in Chapter 11). The cluster read-ahead and write-behind handle the fetching and
syncing of the vnode contents to disk through maintaining Universal Page Lists (UPLs, also
discussed in Chapter 11 ).
The rest of the elements are used for the code signing subsystem. The most important of
which are the cached cs_blobs, which (as discussed in III/5) are used by XNU to enforce code
signatures on individual pages, store entitlements, and report code singing information back to
user mode via the csops[_audittoken] system calls (#169, #170). The blob information is
added to the ubc_info structure by ubc_cs_blob_add( ), from load_code_signature()
(bsd/kern/mach_loader.c), unless a blob already exists for the vnode, as may be retrieved by
ubc_cs_blob_get (). More details on code signing, including the specifics of blob validation
can be found in III/5.
In the interest of preserving opacity, access to the ubc_info fields is performed using
ubc_ [get/ set J * functions (all in bsd/sys/ubc_subr.c), which internally call UBCINFOEXISTS, a
macro checking the vnode's ubc_info pointer, and then dereferencing it to get the specific
field. The ubc_info_t's getters and setters are just one part of the KPI exported by the UBC
layer. There are quite a few code signing related functions (ubc_cs_ *, discussed in III/5), and
the remaining functions deal with Universal Page Lists (UPLs), discussed in more detail
throughout Chapter 11.
215
*OS lntern als::Kern el Mode
Buffers
noces :r.aintain buffers, which are used to hold the data of their various I/0 requests.
The •mace ..12 -3;ns two struct buflists pointers - one in v_dirtyblkhd (for dirty
buffers, i.e. those which may need flushing) and another in v_c leanblkhd.
As rich as the buf structure is, it does not hold the actual buffer data, which is maintained
separately, through the buf 's b_datap (accessible via buf_[ set J dataptr ()),and/or the
b_upl (accessible through buf_[ set Jupl (). UPLs (Universal Page Lists) are explained in
chapter 11, but for now they can just be thought of as their names - lists of memory pages,
reserved to hold the buffer contents, and populated by a call to buf_map(). The b_datap may
also contain buffer metadata, zalloc ( ) ed from respective dedicated zones, with element
ranging in powers of two from 512 through 16,348.
Reading or writing a buffer from/to disk is performed by buf_b [read/write] (). The
actual operation is then carried out by VNOP_STRATEGY, by default asynchronously. This may be
made blocking, by calling buf_biowai t ( ) on the buffer, as is performed by buf_bread, and
by buf_bwrite() for buffers marked as synchronous (i.e. in which the B_ASYNC is not set).
216
Chapter 7 - Fee, Fi-Fo, File: The Vi:= =l'.es;sem Switch
Fortunately, a series of _iterate functions are all supported KPis and allow their caller
to iterate over the structures, specifying a callback action to perform:
One specific real-world application of this experiment is in the iOS 11.3 jailbreak,
~ which required remounting the root filesystem read-write by overwriting the initial
, mountpoint data. Previously opened vnodes had their backpointer to the mount
data incorrectly set - but the vnode_iterate could be used to overcome that.
217
•os lnternals::Kernel Mode
The attributes are all hardcoded in four corresponding tables, defined in bsd/vfs/vfs_attrlist.c
-getattrlist_common_tab,getattrlist_dir_tab,getattrlist_file_taband
getattrlist_common_extended_tab. The tables are all getattrlist_attrtab
structures, also defined (ibid.) as shown in Listing 7-7. The unique structure of the tables (shown
in the annotation comment, above) makes it easy to locate the table in the decompressed
kernelcache's _TEXT._ con st, using the joker module.
/·•
* A zero after the ATTR bit indicates that we don't expect the underlying FS to report
* back with this inform;;:tion, and we will synthesize it at the VFS level.
*I
static struct getattrlist_attrtab getattrlist_conunon_tab[] ~ {
II OxOOOOOOOl 0000000002000000 sizeof(int32_t + uint32_t) (1<<7)
{ATTR_C!·lN_NAME, VATTR_BIT(va_name), sizeof ( struct attrreference), KAUTH_VNODE_READ_ATTRIBUTES},
{ATTR_Cl-!N_DEVID, 0, sizeof(dev_t), KAUTH_VNODE_READ_ATTRIBUTES},
There are a variety of ways to query attributes from user mode. On a path name,
getattrlist ( 2) (#220), or (as of Darwin 14) getattrlistat ( 2) (#476) may be used.
Alternatively, fgetattrlist ( 2) (#228) can be used on a file descriptor. To handle so many
attributes, the system calls use a struct attrlist, which breaks the attributes into five 32-
bit bitmaps. In this way, a caller can ask for multiple attributes at once. Darwin 14 introduces
another system call, getattrlistbulk( 2) (#461), specifically intended for retrieving
attributes for multiple objects in the same directory.
Perusing the respective manual pages of all the above provides good examples as to the
system call usage. Note, also, that certain attributes may be defined as writable, in which case
setattrlist ( 2) or fsetattrlist ( 2) (system calls #221 and #229, respectively) can be
used to modify them. Darwin 17 adds setattrlistat ( 2) (system call #524), which provides
the basis for utimensat ( 2) and other calls.
218
Chapter 7 - Fee, Fi-Fo, File: The Virtual Filesystem Switch
Finally, note that in Listing 7-7 that attribute table entries also have a kauth_action_t
associated with them, commonly KAUTH_VNODE_READ_ATTRIBUTES. As explained in III/3, The
KAuth facility is a precursor (circa 10.4) to the Mandatory Access Control Framework (as of
10.5). KAuth is called out from vnode_authorize and .• authattr[_new], after MACF's
mac_vnode_check_[ get/set J * callouts are called. As discussed in III/4, the MAC
Framework delegates the decision to a policy extension, commonly Sandbox.kext.
fsctl(2)
The fsctl ( 2) system call (#242), along with the file-descriptor based ffsctl (#245) are
proprietary system calls meant for high level filesystem control operations", Using any one of
ioctl ( 2 j-stvle predefined control codes, a user mode caller
The fsctl ( 2) codes known to XNU proper are defined in bsd/sys/fsctl.h, which is also
exported to the user mode· <sys/fsctl.h>. The codes are #define both as FSIOC_ * and as
corresponding FSCTL_ *, with the latter being an application of the IOCBASECMD macro over
the former. The codes are shown in Table 7-8:
.. NAMESPACE ALLOW DMG SNAPSHOT EVENTS [Dis]Allow snapshot events on disk images
.. FIOSEEKHOLE
Deprecated: now in f'cntl c 2 J
.. FIOSEEKD.ATA .· ••• ·····
. ..
(DISK_CONDITIONER )IOC GET
Disk conditioner
(DISK CONDITIONER )IOC SET
* - The man page for both these calls is still found in XNU's bsd/man/man2/fsctl.2, but not installed to the MANPATH.
Perhaps this is for the best, seeing as the man page is terribly outdated, and the one code it lists
(FSGETMOUNTINFOSIZE) is not even present in the XNU sources anymore.
219
*O S lntern als::Kern el Mode
Apple :-nakes heavy use of extended attributes, or xattrs. As explained in Volume I/3,
extended attributes provide the implementations of important filesystem features, such as
compression, data protection, and resource forks. Thus, any filesystem which can support xattrs
(as both HFS+ and APFS do), can also provide these features.
As with standard attributes, there are several system calls to handle extended attributes:
[ f] getxattr (#234,235) , ( f] setxattr ( #236,237) and [ f] rernovexattr ( #238,239) can
be used to manipulate known attributes by name (all in bsd/vfs/vfs_syscalls.c). As is the common
convention, the f ..• variants work on an already open file descriptor. ( f J listxattr
(#240,241) can be used to list the extended attributes of a pathname or descriptor, although
corn. apple. system.* xattrs get filtered through xattr_protected ().
The system calls funnel to the kernel internal functions, vn_[ get I set I remove J xattr (in
bsd/vfs/vfs_xattr.c). Similar to standard attributes, both the MAC Framework's callout
I I
(rnac_vnode_check_[ get set delete] extattr, hooked by Sandbox.kext and MacOS's
Quarantine.kext) and that of Kauth's (i.e., calling vnode_authorize () with
I
KAUTH_VNODE_ [ READ WRITE]_EXTATTRIBUTES) must agree to allow the operation.
Some filesystems natively support extended attributes, whereas others do not. Those which
do, advertise this capability with the VFS_TBLNATIVEXATTR flag (q.v. Listing 7-27). In those
which do not (e.g. FAT-derivatives), extended attributes may be emulated by use of hidden
"Apple Double" dot-underscore(._) files, #defined as ATTR_FILE_PREFIX. The emulation is
also used when archiving files into formats which do not support extended attributes, e.g.
tar ( 1). Thus, when CONFIG_APPLEDOUBLE is set (as it is, by default), the implementations of
default_(get/set/list]xattr() (in bsd/vfs/vfs_xattr.c) call open_xattrfile() to open
the Apple Double file in kernel, and then get_xattrinfo() to populate the attr_info_t.
The AppleDouble file format is documented with ASCII art in bsd/vfs/vfs_xattr.c) as shown in
Listing 7-9:
I
I ATTR DATA 2
/lll/!ll!I///
<-------·
I
I ATTR DATA N <----------·
I I/II/II//////
I Attribute Free Space
I
'----> RESOURCE FORK
!II/I/II//!// Variable Sized Data
I/IJ/1/!I/!//
/lll!l/!II/I!
////////////!
Ill/II/I/II//
---------------------------------------------
NOTE: The EXT ATTR HOR, ATTR ENTRY's and A77R DATA's are
stored as part of the Finder Info. The ier.gth in the Finder
Info AppleDouble entry includes the leng-:ch o: the extended
attribute header, attribute entries, ar.d attribute data.
*l
220
Chapter 7 - Fee, Fi-Fo, File: The Virtual Files ystem Switch
I . I. I
When the attribute and file exist, hexdumping the file will show the structure presented
in Listing 7-9. Output 7-10-b shown the file created by the above xattr addition, annotated.
Note that entries are in big endian format, and 16-bit aligned.
I .
#
# The extended attribute is implemented by a hidden file:
#
morpheus@Zephyr (/Volumes/NO NAME) i ls -la X
-rwxr. . ·xrwx
. 1 morpheus staff -l096 Apr 24 20:32 X
-
morpheus@Zephyr (/Volumes/NO NAHE) % hexdump~
MAGIC VERSION Filler (ADH_HACOSX)
00000000 00 05 16 07 00 02 00 00 -Id 61 63 20 .jf 53 20 58 I• ....... Mac OS XI
numEntries AD_FINDERINFO
00000010 20 20 20 20 20 20 20 20 00 02 00 00 00 09 00 00 ........ I
AD RESOURCE offset
Oe bO 00 00
-
00 02100 00 Oe e2[00 00 I . 2 ........•..... I
00000020 00 32 00 00
length
00000030 01 lelOO 00 00 00 00 00 00 00 00 00 00 00 00 00 i ......•••••..... I
00000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ! •••••••••••••••• I
total size
00000050 00 00 00 00 41 54 54 52 3b 9a c9 ff 00 00 Oe e2 ! •••• ATTR; ...... • I
data start data_length
00000060
-
00 00 00 88 00 00 00 05 00 00 00 00 00 00 00 00 I ......... ······· I
offset length
00000070 00 00 00 00 00 00 00 01 00 00 00 88 00 00 00 05 I················ I
flags nl name[6]
00000080 00 00 05 74 65 73 74 00 76 61 6c 75 65 00 00 00 I ... test.value ... I
00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 1 •• • • • • • • • • • • • • • • 1
OOOOOeeO 00 00 00 00 0l 00 00 00 01 00 00 00 00 00 00 00
72 63 65 20
················
.. This resource
OOOOOefO 00 le 5-1 68 69 73 20 72 65 73 6£ 75
00000£00 66 6f 72 6b 20 69 6e 74 65 6e 74 69 6£ 6e 61 6c fork intentional
0001JOflO 6c 79 20 6c 65 66 74 20 62 6c 61 6e 6b 20 20 20 ly left blank
00000f20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
OOOOOfeO 00 00 00 00 01 00 00 co 01 00 00 00 00 00 00 00 1 ................
OOOOOffO 00 le 00 00 00 00 00 00 00 00 00 le 00 le ff ff '; ................
00001000
The two mandatory attributes, AD_ATTRIBUTES (Ox09, at Offset Ox32 and spanning
OxebO bytes) and AD_RESOURCE (Ox02, at offset Oxee2 spanning Oxlle bytes, for the
resource fork), are created automatically, and highlighted. The AD_ATTRIBUTES contain one
attribute, identified by the ATTR_HDR_MAGIC ('ATTR'), and conforming to the struct
attr_header (also in bsd/vfs/vfs_xattr.c), with the attribute defined as an attr_entry:
221
~ *OS lnternals::Kernel Mode
Apple makes heavy use of VFS features - specifically, extended attributes - in order to
provide ado.c., _ on-standard and mostly private functionality.
ICIUIC ,- .......... u,1-::,lCIIIUCIIU v r-o t::Alt:::11::>IUIJ::, dllU u1e llleLrldrll!>rTl!> □ruvrainq cnem
Extension Support Provides
Resource Forks Alternate Data Streams
Compression Transparent file compression
Restricted Extended Attributes Darwin 15: Prevent modification to file, sans entitlement
Data Vaulting Darwin 17: Prevent read access to file, sans entitlement
Data Protection NSFileProtectionClass encryption for sensitive files
FSEvents Char device Filesystem Notifications via /dev/fsevents character device
Document IDs Proprietary 32-bit Identifiers tagging files & directories to track their lifecycle
Object IDs Proprietary 64-bit identifiers uniquely identifying an object for direct open
Disk Conditioning Proprietary Intentional 1/0 degradation/throttle for specific mount points
Triggers Proprietary Trigger vnodes used for automounting filesystems in MacOS
EVFILT VNODE kqueues Trigger vnodes used for automounting filesystems in MacOS
/dev/vn## device Loop mount device nodes, #if NVNDEVICE
File Providers Host port Designated processes serving as VFS namespace resolvers
Resource Forks
Resource forks are an antiquated legacy of the MacOS Classic days. The Macintosh File
System (MFS) could support a number of "forks", which enabled storing multiple related data
elements in the same file*. The main fork used was the resource fork, in which application
resources (icons, images and other media) could be stored. The NeXTSTEP bundle format
provides a far better method of storing resources, but resource forks are nonetheless supported
to this day. This support is enabled by #defineing NAMEDRSRCF0RK, as is done by default
across all Darwin flavors.
As discussed in Volume I (Output 3-22), the resource fork may be accessed by requesting
the file's com. apple. ResourceFork extended attribute, or by simply appending
" .. namedfork/rsrc" to any file. Special handling in cache_lookup_path() (in
bsd/vfs/vfs_cache.c) checks if a filename component requested starts with two dots and followed
by the _PATH_RSRCF0RKSPEC, and the filesystem supports forks (the mount structure's
rnnt_kern_flag of MNTK_NAMED_STREAMS is set). If so, then the cached vnode's cn_flags
CN_WANTSRSRCF0RK is set, and VFS syscalls operate on the fork instead of the actual vnode.
File compression
The corn. apple. decrnpfs xattr implements the transparent filesystem compression.
"Transparent", in that the calls manipulating a compressed file with VFS have no idea if a file is
compressed or not. The filesystem calls decrnpfs_file_is_cornpressed() on vnode access,
(i.e. when implementing .• _vnop_open( )), which calls decrnpfs_cnode_get_vnode_state
to check a cached result. The slower path checks for the UF_COMPRESSED flag, which must
always be accompanied by a corn. apple. decmpfs extended attribte. The extended attribute is
expected to hold, at a minimum, a decmpfs_header (from bsd/sys/decmpfs.h), which will
indicate the compression_type and the uncornpressed_size (which is reported as the file
size for ls ( 1) and similar tools when the file is flagged as VF_ COMPRESSED). Files which are
small enough may have their contents compressed into the extended attribute's value. In other
cases the compressed data may be held in the resource fork.
* - Windows users may be familiar with the NT equivalent of "Alternate Data Streams", e.g. : : $DATA and the like.
222
Chapter 7 - Fee, Fi-Fo, File: The Virtual Filesystem Switch
When the file data is requested, the driver can call decmpfs_pagein_compressed and
decmpfs_read_compressed to handle the decompression, while remaining entirely oblivious
to the decompression algorithm used. This is shown in Figure 7-12:
_________.,._.,_A..:,;.PteFSCompression•:main()
read(2J ' ! 1) User mode process opens
'------------__J I .1file on some filesystem
8} Kemel extensions can add their own
methods by registering decomprHsors
and providing their'._'.fu
~n~
c;t,
~·o~n'.!:p~
oi~
1
nt~
•'.!
~_.,..--.,,,,..---"---- -
1
--------1 register_decmpfs_decompressor
l~e~ CMP_MAX
· O 1 .. n 255
-----!_s_t_
ru_c_
t_d~
ec~pf~~~egistration
3) Generic read redirects call to- l decmpfs_regislralion 1 or version 3 (forgel_Hags)
~-----~-' Filesystem specific lmplemen'tltion j Version
VNOP_REA ~
7)_T_w
_e_l_
Re
Jg~i.--~ validate Double check compressed file is valid
regi'5ter@d by XNU and
stores data directly in a(ljµst_fe!ch Hint to decomp,essor on upcoming fetch
xattr's attr_bytes
fetch Retrieve and decompress. data
...fs_vnop_read()
Called on file removal
'-------------'i 4) Fituystem driver c~lls decmps
to query if the file is compressed
D
get_flags Retrieve compression flags ('J3 only}
decmpfs_read_compressed ,-
6)-F-
o,-c-
om
--
p,-
e,-
s-•
d-fi-
,l•-
•-
,d-
,i-
ve-
,--, 'crnpf
can s•lisfied ro•d with decmpfs... -- compresston_fy'p8 (n)
• •I • • I ~
#
# Get either com apple AppleFSCompression * kext names, or com.apple.AppleFSCompression.providesType* properties.
# This has the c~veat that it might miss ~n AppleFSCompression providers not following the naming convention,
# but that hasn't happened yet
#
morpheus@Chimera (-)Si reg -1 -w O -f L
~9TI!P. -- "(-o. •FSComP.ress 'o TYP..!!.LM!P.leFSComP.res ion ,P.rovide TyJ..'.'..
+-o com_apple_AppleFSCompression_AppleFSCompressionTypeZlib
<class com_apple_AppleFSCompression_AppleFSCompressionTypeZlib, .. >
-~~t-:s"' /:>~ · ·· Yes
r+'?" = Yes
'' =Yes
"= Yes
::" = Yes
·-:~+-~" = Yes
+-o com_apple_AppleFSCompression_AppleFSCompressionTypeDataless
too l e. ,.,, ~ r_;,_··i.d.t··~; T:.,.-;,>:!.~:" = Yes
The flow in the above diagram can (somewhat) be traced thanks to KDebug codes, which
are emitted at specific points as of Darwin 18. Compression is transparent, but might pose a
challenge for third party raw filesystem tools, which access the filesystem data from outside
XNU, and therefore need to implement their own decompression logic. fsleuth handles most
common compression types known at the time of writing.
223
•os lnternals::Kernel Mode
.- r\
Restricted ....
One of Apple's most notable extensions is the com. apple. rootless extended attribute.
When coupled with the SF_RESTRICTED chflags ( 2) flag, it marks the file as immutable,
even to the root user. This is a stronger protection than BS D's SF_ IMMUTABLE, because the
root user can easily toggle that flag, whereas SF_RESTRICTED cannot be modified with the
right entitlement This is a key feature of Apple's System Integrity Protection for MacOS (also
known as "rootless", introduced in MacOS 10.11 and discussed in III/9), culling the formerly
omnipotent powers of root so as to put restricted files out of reach.
When the flag is present, the com. apple. rootless extended attribute is checked. If
present and containing a value, the process requesting the operation must hold the
com. apple. rootless. storage. value entitlement to be allowed modifications. If present
with no value, only com. apple. rootless.* install* entitlement holders are allowed to
modify the file. This enforcement is provided courtesy of Sandbox.kext, whose platform profile
applies to all processes.
Data Vault
The Data Vault facility is a relatively new addition to Darwin, as of version 17. The idea is to
extend platform profile/SIP protections from merely modifying the files, to reading or even just
accessing their metadata. Another special flag, UF_DATAVAULT is used to datavault files. A code
signing flag, CS_DATAVAULT_CONTROLLER (0x80000000) granted to blessed processes through
the com. apple.rootless. datavaul t. controller special entitlement, which is required
to access these files.
Data Protection
A file system may be mounted with the MNT_CPROTECT flag, which implies its files are
protected through NSFileProtectionClass. As described in Volume III (Chapter 11,
specifically 11-6 through 11-9), the com. apple. system. cprotect extended attribute holds
the wrapped per-file key, which is unwrapped by Apple [ SEP J Keys tore. kext callbacks.
Calling getattrlist( 2) with the ATTR_CMN_DATA_PROTECT_FLAGS will retrieve the file
protection class for a given file system object. Refer to III/11 for more details about the
extended attribute format, protection classes, and AppleKeyStore callbacks.
FSEvents
When XNU is compiled with CONFIG_FSE (as is the case by default), filesystem events also
get directed to the FSEvents facility. As described in I/4 (under "FSEvents"), this facility
(entirely self-contained in bsd/vfs/vfs_events.c) presents itself to user mode as the /dev/fsevents
character device. Clients can then use the device to listen on global filesystem notifications,
reading in a stream of kfs_event structures (q.v. Figure 4-1 in Volume I). When in kernel
mode, the kfs_event structures are buffered into their own dedicated zone, fs-event-buf.
The size of the zone is set at MAX_FSEVENTS (4096) entries, though may be overridden by the
kern.maxkfsevents boot argument.
The FSEvents clients are referred to as watchers. recall (from I/4) that watchers are
expected to use the FSEVENTS_CLONE ioctl ( 2), and supply a clone_args structure,
containing the event reporting array and a queue depth. The kernel mode handler
fseventsioctl takes these arguments and calls add_watcher () to populate an
fs_event_watcher entry in the watcher_table array. Then, when an fsevents record is
generated (in numerous locations throughout VFS, by calling add_fsevent), the watcher table
is consulted, and - if the specified event type is marked FSE_REPORT and the device node
( =volume) it is from was not on the devices_not_to_watch list, the watcher (which is
presumably blocking on read ( 2) from the cloned descriptor) is woken up. The cloned
descriptor is of DTYPE_FSEVENTS, and its read ( 2) is serviced by fmod_watch ( ) , which
populates the kfs_event record .
There is a hard-coded limit of MAX_WATCHERS (8). Apple therefore discourages direct use of
the character device (in fact warning that it is "unsupported"), and offers the user-mode
FSEvents. framework, which uses fseventsd. The daemon, along with other Apple
processes (namely coreservicesd, revisiond and Spotlight's mds get flagged as
WATCHER_APPLE_SYSTEM_SERVICE (Ox00l0). This flag prevents events from being dropped
224
Chapter 7 - Fee, Fi-Fo, File: The Virtual Fi:esystem Switch
when the watcher queue is over 75% full. This also allows watchers to set directories to ignore,
as per some internal radar.
• watch table lock: Protects the watcher_table. Access to this lock is through
[un)lock_watch_table( ), which is used when adding/removing watchers or
delivering events.
• event buf lock: Protects the kfs_fsevent list. Access to this lock is through
[un ] lock_event_list ( ) , which is called from add_fsevent and
release -event -ref.
• event writer lock: Protects concurrency to the user mode write( 2) operation,
handled by the fseventswrite callback. The lock is accessed directly in said function.
• event handling lock: protects the event queue of the watchers, when adding events
to a watcher or removing a watcher.
The locks are all static, with the first three are grouped into the fsevent-mutex lock
group, and the last being the sole member of the fsevent-rw group.
Document IDs
Document tombstones
Files marked with a document ID are closely monitored for lifecycle changes. When such
files are created, edited, renamed or removed the VFS layer offers "document tombstones" as a
way to store the metadata about the last operation on the particular Document ID.
Object IDs
Another undocumented feature is the ability to open a file by specifying the filesystem and
object ID. The undocumented openbyid_np system call (#479). The operation requires a
MACF privilege (PRIV-VFS-OPEN-BY- ID), which the Sandbox enforces with the
com. apple .private. vfs. open-by-id entitlement. Among the holders of the entitlement
are backupd, searchd, revisiond and the iCloud components (bird ( 8) /brctl ( 1),
cloudd( 8) and others), which utilize the syscall through the private CloudDocsDaemon
framework's BRCOpenByID wrapper.
225
*OS lnternals::Kernel Mode
stn:ct: cioc_tombstone {
st:ruct vnode •t_lastop_parent;
struct vnode •t lastop item;
uint32_t t=lastopyarent_vid;
uint32 t t lastop item vid;
uint64=t t-lastop-fileid;
uint64_t t=lastop=document_id;
unsigned char t_lastop_filename[NAME_MAX +l];
};
struct doc_tombstone *doc_tombstone_get(void);
II remove a tombstone
void doc_to:nbstone_clear(struct doc_tombstone •ut, struct vnode **old_vpp);
Disk Conditioning is a facility for intentional degradation of I/O performance from specific
mount points. It allows delaying access time as well as restricting read/write throughput of
devices, through callouts made throughout VFS. As with the Network Link Conditioner, it cannot
be used to improve times - only to introduce artificial latency and delay. The dmc ( 1) utility
controls the facility when applied over a mount point. The utility works its magic by two specific
[f]fsctl(2) (#242/245) codes- DISK_CONDITIONER_IOC_[GET/SET], both of which
accept a disk_conditioner_info structure. The structure is defined in bsd/sys/fsctl.h, as
shown in Listing 7-15. The entire file is marked XNU_KERNEL_PRIVATE, so it is not exported to
the user mode headers.
Listing 7-15: The disk_conditioner info structure (from XNU 4570's bsd/sys/fsctl.h)
(/• Disk conditioner configuration*/
typedef struct disk conditioner info {
int enabled; -
uint64 t access time usec;
-
II maximum latency until transfer begins
-,
uint64-t read throughput mbps; // maximum throughput for reads
uint64-t writ~ throughput mbps; // maximum throughput for writes
int is=ssd; / /-behave lik; an SSD - accessed by disk,_conditioner_mount is ssd
disk_conditioner_info;
I* Disk conditioner•/
#define DISK_CONDITIONER_IOC_GET _IOR('A', 18, disk_conditioner_info)
#define DISK CONDITIONER FSCTL GET IOCBASECMD(DISK CONDITIONER IOC GET)
#define DISK-CONDITIONER-IOC SET IOW( 'A', 19, disk conditioner info)
#define DISK-CONDITIONER-FSCTL SET IOCBASECMD(DISK CONDITIONER IOC SET)
226
Chapter 7 - Fee, Fi-Fo, File: The Virtu al P. esys tem Switch
Triggers (MacOS)
Darwin's VFS implementation provides support for vnode triggers. These enable an
interested kernel extension to set callbacks which will be invoked when the vnode is accessed,
particularly for mounting. Triggers are dependent on the C0NFIG_TRIGGERS compile-time
option, which is set for MacOS but not elsewhere. This makes sense, as the chief use of triggers
is in autofs.kext, which performs NFS-automounting on MacOS - a feature which the *OS variants
have no need of. As a consequence, the sizeof ( struct vnode) is larger by eight bytes on
MacOS, owing to the need to place a vnode _resolve_t at the end of the structure.
Assuming all checks allow, the callback is called, and its implementation is entirely up to the
kext. In the case of autofs. kext, it works with triggers. kext to propagate the event up to
a user mode daemon - automountd, on host special port #11, implementing MIG subsystem
#666 with 8 messages, and configurable through /etc/autofs.conf.
The autofs project, consisting of both kexts (autofs and triggers), daemons (autofsd and
automountd), mount_autofs, mount_url and several interesting test utilities is Qpen source[21, and
the interested reader is encouraged to peruse it, to find multiple examples of working with
triggers and mounting.
In addition the formidable FSEvents mechanism, Apple extended their VFS implementation
so as to integrate it the kevent ( 2) facility. As explained in 1/8, kqueue ( 2) sand their
kevent ( 2) provide a substrate for GCD. Unlike FSEvents, which is global, EVFILT_VN0DE
requires arming with a specific file descriptor, which will be watched for write-oriented lifecycle
events{NOTE_[WRITE/LINK/EXTEND/ATTRIB/LINK/RENAME/REV0KE] and
[ •. FUNL0CK J ). This makes it closer in operation style to Linux's inotify than FSEvents,
though directory descriptors cannot be used here.
/dev/vn## (conditional)
Darwin adopts from BSD support for a special device called /dev/vn##, which can be used
to provide a device interface (character, but most commonly block) to a file. This can be used for
the common practice of loop mounting, wherein a file containing a raw filesystem image can be
mounted like a device. This is contingent on XNU being compiled with NVNDEVICE #defined as
an integer, which will cause the creation of /dev/vn## nodes from Oto that number minus one.
227
•os lnternals::Kernel Mode
Loop rnountir ; ·s a common practice, but the -o loop option of mount ( 8), which might
be familiar to some from Linux, is not supported. Instead, Darwin requires open ( 2) ing the file
to be used in this way, and issuing a VNIOCATTACH ioctl ( 2), supplying the file name, file
size, and recd/write in a struct vn_ioctl (from <sys/vnioctl.h> ). This causes the creation of
a /dev/vn=== with ### taking the next available number. Once the /dev/vn### device is
initialized, it can be treated as any other device, and can be mount ( 8) ed, etc.
The /usr/libexec/vndevice binary on MacOS and BridgeOS (which still identifies internally as
vncontrol) provides a CLI which performs the VN ioctl ( 2) s. The kernel-side implementation
is in bsd/dev/vn/vn.c. This feature, which was supported by default in earlier versions of Darwin,
is no longer supported, with NVNDEVICE undefined since the HFS+ lega[Y. volume name
exRloit[3 ] which relied on this mechanism was used to jailbreak iOS 4.x.
File Providers
Apple introduced file providers back in iOS10, modifying the API before making it public in
Darwin 19 for MacOS as well. These are application extensions, allowing developers to provide a
filesystem-like interface for documents on remote servers, using the Fileprovider.framework and
the NSFileProviderExtension class. The framework is somewhat documented by ARRle[4],
though all details of implementation are hidden.
Internally, file providers are an application of VFS namespaces ("nspaces"). The nspace
subsystem is initialized by a call to nspace_resolver_init( ), during vfsinit( ). Apple
originally used this to support filesystem snapshots. As of Darwin 19, is seems* that this has
been extended to support a new concept of "dataless" files. Not much is known about these, but
they are marked with SF_DATALESS and appear to be placeholders for remote files. Such files
can be materialized (fetched) or evicted. The evictability of a file can be retrieved through
ffsctl ( 2) with the (presently) undocumented code of 0x40084a47, and toggled through
Oxc0084a44.
The nspace resolver, as of Darwin 19, is the filecooredinationd(8) daemon. The daemon was
introduced as far back as Darwin 17, but in 19 it also claims HSP #30. The daemon registers
itself by setting the vf s. nspace. resolver sysctl ( 2) MIB. When a resolver process exits,
proc_exit ( ) calls nspace_resolver_ exited ( ) to automatically deregister.
* - At the time of writing the XNU sources are not available (and, from experience, may not be for a while). When
they are, one could hope to find the code behind this facility in bsd/vfs/vfs_syscalls.c.
228
Cha ter 7 - Fee Fi-Fo File: The Virtu al ri!esvs tem Sw itch
For this experiment, iCloud Drive must be enabled - which is easy to check with
fileproviderctl listproviders. If it is, go to one of the drive-enabled folders. It's
easy to check those too through the utility:
If iCloud Drive is enabled, placing a file on the desktop and calling fileproviderctl
evict file will make the file vanish, leaving in its place a hidden file, .fi/e.icloud:
# Perform eviction
#
morpheus@Bifrdst (-/Desktop)$ fileP.:roviderctl evict!$..
morpheus@BifrOst (-/Desktop)$ ls -1. P:asswd
ls: ./passwd: No such file or directory
#
#8
#
morpheus@BifrOst (-/Desktop)$ ls -1
total 48
drwx------@ 8 morpheus staff 256 Oct 14 2019
drwxr-xr-x+ 28 morpheus staff 896 Oct 12 13:57
-rw-r--r--@ morpheus staff 6148 Oct 12 13:59 .DS - Store
-rw-r--r-- morpheus staff 0 Apr 25 20:2-l .localized
-rw-r--r--@ morpheus staff 154 Oct 12 13:59 . pas swd , ic loud
#
# Examine this new .passwd.icloud file (a bplist} using jlutil(j)
#
morpheus@BifrOst (-/Desktop)$ 'lutil ,P.asswd.icloud
NSURLNameKey: passwd
NSURLFileSizeKey: 69,6
NSURLFileResourceTypeKey: NSURLFileResourceTypeRegular
The hidden file holds the basic metadata required to pull this file back from iCloud.
Doing so requires using fileproviderctl ( j) again:
#
# Materialize the file
#
rnorpheus@BifrOst (-/Desktop)$ fileP.roviderctl materialize P.asswd
Attempting to materialize item at -/D{5}p/p{4}d
file -/D{5}p/p{4}d:
ASCII text
morpheus@BifrOst (-/Desktop)$ ls -1. p~
-rw-r--r-- 1 morpheus staff 6946 Oct 12 13: 59 . /jses swd
229
*OS lnternals::Kernel Mode
\\ll~ rQT?:- ~
VY Lr r:i?J tN,_""~
-----·---- --- --· ---·
The ;re;;- -5€ of VFS is through its rich set of Kernel Programming Interfaces, all neatly
defined anc c.,-o1ted in bsd/vfs/vfs_kpi.vfs. Through the KPis, kernel level code requiring VFS
client fund:o.'E..i.:-; - accessing vnode member data - can do so through accessors, leaving the
vnode_ t as 2-i opaque pointer. This is important, considering the vnode_t structure tends to
change every '10':: and then between releases. Over four dozen such accessors exist, and a good
way to find then is to try grep "vnode_ kpi _vf s. c I grep "vp) ". Note, however, that
not all accessors are defined as "approved" KPis. vnode_tag is presently Unsupported, and a
few are Private.
An important structure used throughout VFS KPis is the vfs_context_t. This is a pointer
to a struct vfs_context, defined in bsd/sys/user.h as a structure of two fields - the
vc_thread (the Mach thread_t) and the vc_ucred (the thread's kauth_cred_t credential
structure). Normally, the vfs_context_current () can be used to retrieve the current thread
context, which is equivalent to accessing the uu_context field of the uthread returned by
get_bsdthread_info ( current_thread ()), or vfs_context_kernel () if no other
context could be found. vfs_context_kernel () itself is (as noted in its comment) a very
dangerous hack, but one which has been in use for a long time and Apple shows no signs of
addressing. Calling vfs_context_create () on a given context can clone it, or create a new
context (with the current_thread and kauth_cred_get) if called with NULL. Context thus
created are kalloc ( ) ed from kalloc. 16, and should be freed with vf s_context_ rele ( ) .
The vf s _context_ t is used extensively by VFS to determine the attributes of the current
operation. Table shows all the KPis which can be used:
It is inevitable that, sooner or later, kernel code will need to access a file directly. This is the
case when, for example, a file is provided to execve ( 2), or other system calls which accept a
pathname as an argument - such as open ( 2), getfh ( 2) and the like. Another is when the file
needs to be created from kernel mode - such as when dumping a core or kdebug tracing. In
these cases, there must be a (relatively) simple way to convert a pathname to a struct
vnode. Such functionality is provided by namei ( ) , from bsd/vfs/vfs_lookup.c.
230
Chapter 7 - Fee, Fi-Fo, File: The Virtual Filesystem Switch
As the listing shows, the nameidata is a rather complex structure. File access therefore
begins with a call to NDINIT. This is a macro #define in bsd/sys/namei.h as accepting several
arguments:
Arg Purpose
nd is a struct nameidata to be initialized by the call, passed by reference.
op An operation which is one of LOOKUP (0), CREATE (1), DELETE (2) or RENAME (3).
This is passed to vNOP_LOOKUP.
pop is a more precise operation, which is used only if the kernel is compiled with coNFIG_TRIGGERs.
This is an OP * path operation value, passed to resolvers
Theser set the cn_flags (component name flags of the ni_cnd field of the node data. Common
flags values are [NOJFOLLOW (symbolic links) ,LOCKLEAF (to auto-lock vnode on return) and
AuorTVNPATH[l/2J, to request auditing of pathname.
segflg A uro * value specifying the origin of the namep argument (uro usERSPACE/uro SYSSPACE)
namep The pathname of the vnode to be opened.
When taken from userspace, it is used with the the CAST ossa ADDR T macro.
ctx is the VFS context, commonly vfs_current_context( J
With the struct nameidata initialized, the next step is to call namei ( ) on it. This
performs the vnode lookup, by taking the namep component, calling copyinstr( 9) (if
obtained from urn_USERSPACE) or copystr ( 9), and then perform the step by step directory
traversal (by processing pathname components in between '/' separators), resolving symbolic
links if the FOLLOW flag is set. The operation continues until the name can be resolved, and
MACF is consulted (using mac_vnode_check_lookup_preflight before the actual
lookup ( ) operation or if a symbolic link is encountered.
The lookup ( ) function is, as its comment states, "a very central and rather complicated
routine". Its complexity arises from the many special cases it needs to consider: double dot( .. )
directory specifiers, union mounts, resource forks, and other idiosyncrasies. The heart of the
lookup operation, however, is in a call to VNOP_LOOKUP, through which VFS finds the underlying
filesystem driver's ••• vnop_lookup handler. This way the filesystem specific logic can be
entirely decoupled from the pathname and link processing. The return code of the lookup can be
any of the common errno codes, such as ENOENT, EACCES, EPERJ-1, etc.
231
*O S lntern als::Kern el Mode
Remaining optimistic and assuming the lookup was successful, namei ( ) will likewise be
successful, ano ~- = · -cce will be ready for use in the ni_vp member of the struct
nameidata. When rnc r..-TI:2.:.data contains a buffer (as indicated by the HASBUF flag of
ni_cnd.cn_flags and a non NULL ni_cnd.cn_pnbuf), care must be taken to free it. This is
handled by a call to nameidone ( ) , which resets the flag, resets the pointer to NULL and uses
FREE_ZONE to free the buffer from the dedicated M_NAMEI BSD zone. When the vnode is no
longer needed, it must be released with a call to vnode_put ( ) , which will decrease its
v _ iocount and possibly put it on the vnode list for recycling ( discussed later).
Kernel code can also take a different route over the NDINIT/namei approach, and call
vnode_open (from bsd/vfs/vfs_subr.c). This simplifies the process into a single line of code,
hiding the eventual use of both the macro and namei (by the internal vn_open_auth ( ) ), but
the route is more scenic (and laced with more authentication checks).
The main use of a struct vnode is for 1/0 operations, through vn_rdwr (). This
function, declared and well commented in bsd/sys/vnode.h, can be used to either UIO_READ or
UIO_WRITE any len bytes from/to offset to/from a specified buffer base address. Additional
arguments to this function are the seglg indicating whether base is a UIO_USERSPACE or
UIO_SYSSPACE, the credentials of the requestor, and the struct proc p of the process on
behalf of which the 1/0 request is done. Additionally, ioflg specifies a bitmask of IO_* options
- from bsd/sys/vnode.h, and ares id, a pointer to an integer storing the number of bytes which
remain in the 1/0 request after vn_rdwr completes.
This pattern of access can be seen in several locations around the kernel. The Mach-0
loading process of get_macho_vnode() (from bsd/kern/mach_loader.c, discussed in Chapter 6)
is one such example. Another good one is exec_activate_image, which supports the
execve ( 2) implementation:
Lifilin g...z:M.;. The code of exec activate image pertaining to vnode handling (from bsd/kern/kern_exec.c)
exec_activate_image(struct image_params *imgp)
{
struct nameidata *ndp = NULL;
again:
error= namei(ndp);
if (error)
goto bad_notrans;
bad_notrans:
if (imgp->ip_ndp)
nameidone(imgp->ip_ndp);
if (ndp)
FREE (ndp, H_TEMP);
232
,
Chapter 7 - Fee, Fi-Fo, File: The V, ....a - esysiem Switch
The file opened in this way may also be a partition, which would require getting device
geometry, etc - hence necessitating an ioctl in kernel. This is achieved by using a function
pointer, linking it to either a device-based ioctl ( ) , or a file-based one. Through the in-kernel
ioctl functionality the device geometry or file block map/extent layout can be obtained. Blocks
can be mapped into memory, and then read/write operations are as simple as an in-memory
copy. When the file is closed, kern_writefile () (a wrapper over vn_rdwr ()) is called .
do_ioctl = &file_ioctl;
233
•os lnternals::Kernel Mode
Vnode lifeCJCle
Vnodes are allocated from the BSD M_VNODE zone ( #25), which is backed by the dedicated
vnodes zone. The zone grows dynamically as vnodes are allocated, but there is an upper cap on
vnodes. The maximum number of vnodes is determined during bsd_startupearly(), set to the
kernel's sane_size divided by 64k + 1,024. The value is further capped by the compile type
CONFIG_VNODES macro, which is commonly 263,168. This can be overridden by the
kern. maxvnodes argument.
File I/O, however, is very frequent. So sooner or later any limit will be hit, but vnodes never
get freed - instead, they are recycled. The struct vnode maintains two counts - v_usecount
(a reference count, modified by vnode_ref_ext/vnode_rele_internal) and v_iocount
(for I/O operations, modified by vnode_get () /vnode_put () and other operations). When
both these counts are zero the vnode may be put on one of three vnode freelists, by
vnode_list_ add ( ) , depending on the vnode flags:
• Vnodes marked VL_DEAD: are added to the vnode_dead_list. This list is tried first
when obtaining a new vnode.
• Vnodes marked VRAGE: are put on the vnode_rage_list, which holds "rapidly aging"
vnodes. Aged vnodes are put in the front of free lists, rather than their end.
• Other vnodes: are put on the vnode_free_list. It is easy to determine if a given
vnode is already on the freelist, through the VONLIST macro, which expands to a check
the vnode's v_freelist member against the magic value of Oxdeadb.
Vnodes have quite a few structures associated with them, so it's not a simple matter to just
put them on a free list. They must be properly laundered - which is the task of vclean ( ) . This
routine is responsible for f sync ( 2) ing the vnode contents, cleaning the associated memory
pages in the Unified Buffer Cache, calling VNOP_ INACTIVE if the last reference to the vnode has
been dropped, to advise the filesystem of this. It additionally calls VNOP_RECLAIM ( ) , giving the
filesystem a chance to remove the vnode from any cached structures or hash lookups, as well as
deallocate filesystem private structures.
~
Chapter 7 - Fee, Fi-Fo, File: The Virtual ?esys !em Sw itch
After considering all the objects and KPis defined for use by clients VFS, let us turn our
attention to those interfaces required of the service providers of VFS, and the process of
developing a VFS filesystem.
Registering Filesystems
A filesystem provider can register its filesystem with VFS by calling on vfs fsadd. This
function (in bsd/vfs/kpi_vfs.c) takes in a struct vfs_fsentry by reference. If registration is
successful, its second argument, a vfstable_t is populated with an opaque handle, which can
be used when deregistering. The magic of VFS is that it handles filesystems of multiple types
and varieties. For this, the very notion of a filesystem needs to be abstracted, and VFS's struct
vfs_fsentry, showin in Listing 7-26, aims to achieve exactly that:
The vfe_fsname is used to locate the filesystem, when matched against the filesystem
type specified by the mount ( 2) system call. Every filesystem should also declare the
vfe_flags, a bitmap of VFS_TBL* constants (also from bsd/sys/mount.h), which inform the
kernel of the filesystem capabilities. The most important fields in the vfs_fsentry are the
vfe_vfsops and vfs_opvdescs, which specify the filesystem level operations and individual
vnode level operations (respectively) that the filesystem supports. In this way, the higher level
VFS operations are really just higher level shims, with the kext-supplied filesystem specific logic
performing all the actual work.
235
*OS lnternals::Kernel Mode
VFS operations
Any filesystem registered using vfs_fsadd may opt to install a number of callback
functions, as a struct vfsops pointer which is provided as the first member (vfe_vfsops)
of the struct vfs_entry. The structure presently defines some 15 callbacks, though not all
are required. The callbacks (all pointers to functions returning an integer) are well documented
in bsd/sys/mount.h, and shown in Table 7-28:
............ I ........ ''''- .. , ..J vµ ..... ,u1.1v,,~ U'-11111..U Ill I.II~ i:)l...LU.Ll. V..L.:::>V~i:) \IIVIII U::>U/::>y::>JIIIV\Jlll..11}
Operation Purpose
vfs_mount(mp, devvp, data, context); Mount the fs from devvp on mp
vfs_start(mp, flags, context); Start the mounted fs at mp. flags unused
vfs unmount(mp, mntflags, context); Unmount fs at mp with mntflags (e.g. MNT FORCE)
Examples of using this KPI can be found in the open source FUSE (discussed later), or by
disassembling Apple's own filesystem kexts.
Vnode operations
The vfe_opvdescs field of the vfs_fsentry defines the operations which populate the
v_ops vector of every vnode in the registered filesystem, unless otherwise stated (through a
quasi filesystem). The operations are defined as an array of vnodeopv_entry_desc ( defined
in bsd/sys/vnode.h) structures, each with two fields - a pointer to the vnodeop_desc and
another to the function implementing the operation. The structure is shown in Listing 7-29 (next
page).
236
Chapter 7 - Fee, Fi-Fo, File: The Vt'.'..12. =-esystem Switch
.!Jgj_n g~ The VFS operation entry and descriptor s:ructl!res, from XNU 4903's bsd/sys/vnode.h
struct vnodeopv_entry_desc {
struct vnodeop_desc *opve_op; I* t;h1.c ::~ ::t ::his is*/
int (*opve_impl)(void *); I* code r._:cg this operation•/
};
struct vnodeopv_desc {
int (***opv_desc_vector_p)(void *); I* p-r to:: e ?t.r r.o the vector where op should go
struct vnodeopv_entry_desc *opv_desc_ops; -e=i:cated list•/
};
struct vnodeop_desc {
int vdesc_offset; /*offset._ ector--first for speed•/
canst char *vdesc_name; /* a readable name tor debugging*/
int vdesc_flags; /* VDESC + =:ags ~/
I*
* These ops are used by bypass routines to map and locate arguments.
* Creds and procs are not needed in bypass routines, but sometimes
* they are useful to (for example) transport layers.
* Nameidata is useful because it has a cred in it.
*I
int *vdesc_vp_offsets; I* list ended by VDESC _NO_OFFSET * I
int vdesc_vpp_offset; I* return vpp location*/
int vdesc_cred_offset; /* cred location, if any*/
int vdesc_proc_offset; /* proc location, if any*/
int vdesc_componentname_offset; /* if any*/
int vdesc_context_offset; I* context location, if any *I
I*
* Finally, we've got a list of private data (about each operation)
* for each transport layer. (Support to manage this list is not
* yet part of BSD.)
*I
caddr_t *vdesc_transports;
-}_; /
Once the filesystem is registered, execution moves to a callback model, through VNOP _ *
wrappers over common vnode operations. VFS fulfills its role as an adapter layer, performing
common logic for the defined operations before dispatching them to the filesystem-specific
implementations, found in the vnode's v_op member. Most wrappers are similar, loading an
operation-specific argument structure passing it to the operation pointer (provided by the
filesystem). The VNOP_READ wrapper serves as a typical example:
a.a_desc = &vnop_read_desc;
a.a_vp = vp;
a.a_uio = uio;
a.a_ioflag = ioflag;
a.a_context = ctx;
_err= (*vp->v_op[vnop_read_desc.vdesc_offset])(&a);
DTRACE FSINFO IO(read,
vn;de_t, ;p, user_ssize_t, (resid - uio_resid(uio)));
return (_err);
237
*OS lnternals::Kernel Mode
~~ ~11:@cru□®~
,........_.,~- -- -•·"·--·----- - -- --- -------- ··-
Putting together all we've seen so far we end up at the flow presented in Figure 7-31, which
connects with Figure 5-23:
r Serialize pa~amete~7
, . into single struct j struct vnop_read_args a = { vp, uio, • . } I
Oxfd7a0:
Oxfd7a8:
NULL
NULL
A good way of gaining familiarity with VFS APis and KPis is to look at them in context - by
examining the implementations of some of the file systems used in XNU. The three case studies
picked are quite different - devfs, MacOS's NFS support and FUSE, but they are thankfully all
open source, and through them some common implementation patterns can be observed.
/dev (devfs)
For devices to be usable by user mode callers, they must have some filesystem
representation, in the form of device nodes (which appear in ls -1 as 'b'(lock) or 'c'haracter.
Device nodes traditionally had to be created (by the mknod ( 2) system call) or removed
manually following the driver addition or removal - a cumbersome requirement which could lead
to unnecessary complications. Modern day UN*X systems (notably, Linux/Android) solved this by
installing a user mode daemon to automatically maintain the nodes. Darwin and FreeBSD,
however, adopt a different approach.
The /dev directory is itself a a mount point, for the devfs special filesystem. This is a virtual
filesystem (somewhat like Linux's /proc), where nodes can be created directly from kernel code.
Only node pathnames can be created this way, but this proves sufficient. Kernel code can call on
devf s_make_node ( ) (from bsd/miscfs/devfs/devfs_tree.c) to create the node, and obtain an
opaque handle as it magically appears in /dev. The handle can be used with devfs_remove()
(ibid.) to just as magically make it disappear. Once added, the device ready for use: User mode
operations will be redirected by the VFS layer to the implementing callback. Both operations take
the devfs_mutex (bsd/miscfs/devfs/devfs_tree.c), through the DEVFS_[UNJLOCK macros
(#defined in bsd/miscfs/devfs/devfsdefs.h)
Darwin's devfs implementation closely resembles that of BSD's, with the original author
comments and a few Apple modifications. Device nodes are created in the M_DEVFSNODE BSD
Zone. The node names are allocated from M_DEVFSNAME . The device nodes are maintained as
struct devnodes, with their dn_typeinfo (a devnode_type union) holding either their
dev_t, directory entry, or symbolic link name. The root node is dev_ root, a devdirent_t,
from which all files are linked.
238
Chapter 7 - Fee, Fi-Fo, File: The Vutu al i']esys tem Switch
Block devices are commonly created in conjunction with more complicated, IOKit-enabled
logic. In these cases, the IOMediaBSDClient IOKit class (discussed in Chapter 13) can be
used to handle the block device creation automatically, without the need to call the bdevsw*
functions at all (or the devfs registration, as discussed next). Similar IOKit handling can be found
in IOKit's roserialBSDClient which handles character devices for serial port devices, but in
most cases creating a character device is best done manually.
It is possible to manifest a single hardware device as both block and character. This is, in
fact, quite common, with disk devices, whose block representation is used for mounting
filesystems, and the character representation as a "raw" device, for purposes of fsck( 8) and
the like. Calling cdevsw_add_with_bdev() will use the same major index for both node types
(as in the case, for example, with /dev/[r]disk* nodes).
A Raw access to block devices entirely bypasses the filesystem, and thus any file
permissions, or extended attributes and flags like those used in SIP are rendered
irrelevant. Apple thus enforces the com. apple. rootless. restricted-block-
devices (MacOS) and com. apple. private.security. disk-device-access (*OS)
master entitlements, which are bestowed upon the OS's own low-level tools (notably, the
fsck* family). On a jailbroken *OS device the entitlement can easily be faked, but in
MacOS bypassing it requires disabling SIP.
specfs nodes
Device nodes are still represented as vnodes, but with a v_type of VBLK or VCHR. In
addition, when the vnode is created (by devfs, mknod( 2), vnode_create_internal (), or
otherwise), its vnfs_vops are set to [ devfs_] spec_vnodeop_p. This puts such nodes,
sooner or later, within the realm of the specfs filesystem.
239
*OS lnternals::Kernel Mode
• When an ~-:teme,-,tation exists for the operation in both character or block device
switches, (ope~, c a os e and ioctl) it is called upon, in order to perform the operation in
a manner determ ined by the driver. There may still be some specific device specific
tweaks or hacks - for example, preventing opening of mounted block devices, or handling
the dosmq Jf a controlling tty.
• When dealing with read or write operations, can directly invoke the callbacks for a
character device driver. For block devices, however, these callbacks do not exist, and thus
one of the buf_bread[n] or buf_b[ /a/d]write are used.
• Other callbacks in Table 7-32 not called from specfs either have different code paths to
call them, or were initially put due to compatibility with BSD, but were quickly phased out
or left unsupported.
Hidden in /dev is the rather peculiar /dev/fd quasi-filesystem, called fdesc. First - unlike
other filesystems, it is not an actual mounted filesystem (though it used to be in older versions
of MacOS). Second, the filesystem appears different to each process which uses it. Every process
sees in fdesc numbered entries, corresponding to its open file descriptors". A good way to see
that is to list the directory with two different process - one, such as ls ( 1), and the other one a
shell (through autocomplete functionality in /dev/fd. fdesc also creates symbolic links to
descriptors 0, 1 and 2 from /dev/stdin, stdout and stderr (respectively).
• devfs_devfd_readdir (): called from VNOP _READDIR() when the user requests a
directory listing, through getdirentries [ 64 J. The callback obtains its position in the
directory listing, dividing the uio_offset by UIO_MX (16), the record size. It then
checks if that position is a valid file descriptor in the current_proc ( ) 's space - i.e. non
NULL, and not flagged by UF _RESERVED, using the fdf ile and fdflags macros.
• devf s _devfd_ lookup ( ) , which obtains the calling process from the VFS context
pointer, and then checks if the looked up name (actually the descriptor number, in string
form) is valid, in the same way devf s _ devfd_ readdir ( ) does. If the name is indeed
valid, it calls to fdesc_allocvp() to create a vnode for that descriptor on the fly, and
returns it in the lookup's vpp.
The created vnode is tagged as VT_FDESC, and its vnode_fsparam is set such that
the vnfs_vtype is VNON, and the vnode level operations are fdesc_vnodeop_p. The
vnode_fsparam's vnfs_fsnode (which ends up in v_data) points to a struct
fdescnode (from bsd/miscfs/devfs/fdesc.h), which holds the descriptor number in fd_fd.
• fdesc_[ get/set J attr: accesses the vnode's v_data, where it finds the fdescnode
structure, from which it retrieve the descriptor, and uses fp_ lookup ( ) to obtain it.
• f des c_open ( ) : implemented in an admitted "XXX Kludge", storing the descriptor
number in the uthread's uu_dupfd, and deliberately returning ENODEV. This forces a
release of the vnode by vn_open_auth (),and code back in openl () calls
dupfdopen ( ) (from bsd/kern/kern_descrip.c) on the descriptor number. The actual vnode
opened is thus the real vnode pointed to by the descriptor, which explains why all the
other operations return ENOTSUP.
* - Linux's /dev/fd is a symbolic link to /proc/self/fd, wherein pseudofiles are managed by the proc filesystem.
240
Chapter 7 - Fee, Fi-Fo, File: The Virtua l Fl:esys tem Switch
I~ .
~ Experiment: Creating a simple character device
As we have seen, character devices make up a large part of the nodes in devfs. It is
common practice to implement anything outside of mass storage devices as character
devices - and therefore useful to be able to build the a simple character device driver from
scratch. Such a driver can then be used as a template for more complex devices, real or
virtual, which communicate via the POSIX model.
Using an empty kernel extension as a starting point, we can put in the code to create
the device node. First, we need to populate a struct cdevsw with callbacks. These can
initially all be NULL, but better practice is to link them to enodev (from bsd/kern/subr_xxx.c),
which returns -ENODEV to user mode. In the entry point, we can then create the device with
cdevsw_add. Unless there is a penchant for a specific major, -1 specifies that the caller is
requesting dynamic allocation of a major index for the added device. If successfully added,
the return code will indicate the major assigned. The devices managed then need to be
l
published to user mode, using devfs_make_node. This is shown in Listing 7-33:
At this point, building and kextload (a) ing your module should make a new device
node appear in /dev, thanks to the magic of devfs. Trying any operation on the node will
result in an error message, because no callbacks have been implemented.
The next step is to implement a few callbacks. To make the device "functional", the
implemented callbacks usually include read and write. Keeping the example simple, we
can have our device act as a clipboard of sorts, holding data provided by the user using
write ( 2), and supplying it back to the user through read ( 2). A partial implementation of
a read function is shown below (the write function can be implemented similarly):
Listing 7-34: A sample reader function for a memory buffer backed character device
fcha~uf[BUFSIZE];
1
int writePos;
1 // TODO: SAMPLE ONLY! Don't forget sanity/bounds checks on kernel memory here •.
// read() only uses one iovec in the uio, but good code should handle multiple
// When moving to/from a single buffer, max copy size can be set to uio_resid()
// but scatter/gather needs to consider multiple iovec sizes ..
~---
r-e
t
_u_r_n_e_
r_ro_
r
_;~----
241
*O S lntern als::Kern el M ode
NFS (MacOS
Most UN*X flavors have adopted the Network File System (NFS) standard to provide file
sharing services. MacOS does so as well, supporting both NFSv3 (RFC1813) and NFSv4
(RFC3530).
NFS is a legacy mechanism, and is best discussed elsewhere (in the RFCs
specified or a good reference like Qtl!.aghan's excellent reference[5l or the
BSD Implementation[6l. The aim of this section is to detail the Darwin
implementation specifics, and not get bogged down with the protocol or
component explanation.
The user mode portions of NFS are handled in MacOS similar to other operating systems, by
several daemons:
• LsbinLnfsd: providing support for remote client requests using the NFS and/or mount
protocols (formerly provided by the now obsolete rnountd (a)). This LaunchDaemon
starts from com.apple.nfsd.plist, contingent on the presence of /etc/exports (which contains
the list of filesystems to export).
• Lsbin/...rnc.statd: providing the host status service, as a way for local daemons to probe
their remote counterparts.
• Lsbin/....r~c.lockd: providing the locking service, which is required when a remote client
requests a local file lock.
• LusrLlibexecLautomountd: managing the autofs mechanism, which transparently
mounts remote filesystems when access to them is attempted. This LaunchDaemon starts
from com.apple.automountd.plist, and claims Host Special Port #11.
• LsbinLnfsiod: sets the maximum number of asynchronous I/O threads. This is a
deprecated daemon, because control of the number of threads can be done by merely
setting the vfs. generic. nfs. client. nfsiod_thread_rnax sysctl ( 2) value -
which is exactly what this binary does, and exits.
• nfssvc (itl..5..5.).: This is a "pseudo system call", in that most of the NFS service handling
is done in kernel mode, and so this system call is not expected to return. The nfsd(8)
merely provides a user-mode process shell, spawning any number of server threads, all of
which invoke this call with the NFSSVC_NFSD argument, and remain in it until the
daemon exits or is killed. Another use of the system call is with the NFSSVC_ADDSOCK
argument, which registers the server sockets with the kernel. Lastly, the NFSSVC_EXPORT
flag, to maintain the server's map of exported filesystems.
• getfh (#161).: Enabling the translation of any pathname to an NFS handle. Fortunately,
only on filesystems which are exported.
• fhoP.en (#248).: Enabling the translation of an NFS file handle to an open file descriptor
with the o_ flags from fcntl.h. This is required by /sbin/rpc.lockd, so as to enable locking
when handling NFS requests.
242
Chapter 7 - Fee, Fi-Fo, File: The Virtu al Filesystem Switch
NFS Client services are started automatically when the mount (a) command is given a
mount point and remote file system specified with -t nfs. This, in turn, calls mount_nfs (a),
and mounts a remote server specification of a filesystem on a local directory mount point.
The NFSCLIENT #define also enables the nfsclnt system call (#247). This call, used by
rpc.lockd(8), supports a flag, which may be NFSCLNT_LOCKDNOTIFY or •. _LOCKDANS (for
rpc.lockd(8) notification or answers) and the NFSCLNT_TESTIDMA P, used by nfs4mapid(8).
The nfsstat( 1) utility can be used to display client and server statistics, by polling various
sysctl ( 8) MIBs in the vf s. nf s namespace. The utility has also been spotted in iOS 13 beta
2, indicating that Apple could be testing NFS client functionality in *OS internal builds.
The mechanism behind the kernel to daemon interaction is a reverse system call. In this
implementation, the user mode daemon performs a system call ( commonly, read ( 2 ) ) on a
device node supplied by the kernel-level VFS driver code. The system call is left to block until the
kernel-level code requires some service from the daemon. It encodes the request in the "read"
data, which is then processed by the daemon and acted upon. The daemon can then write ( 2)
the reply back to the device node, supplying it back to the VFS driver.
FUSE is by no means unique to Darwin systems. It was started in other UNIX flavors, and is
in fact not officially supported - the Darwin implementation was introduced by Amit Singh
(author of the seminal precursor to this work) called MacFUSEl7J. The project was later picked
up by the open source community, and the present implementation - OSXFUSEl8J - is maintained
to this day. Because FUSE does require a kernel component, it is not applicable in the *OS
variants, wherein Apple uses DMG mounts (by registering loop block devices) instead.
Apple uses their own version of filesystems in user mode, in the private UserFS.framework,
as of iOS 11. The project is naturally closed source and does not share any design ideas with
FUSE - It does not rely on a character device, nor does it implement the reverse syscall
mechanism. The private framework uses XPC to communicate with its userfsd daemon and
userfs_helper,overcom.apple.filesystems.[userfs[d/fs_helper] ports. The
master daemon is entitled for raw device access, and loads filesystem support rom the
framework's Plugins/ directory (though these are prelinked into the shared cache). Present
plugins are msdos.dylib and exfat.dylib, obviating the need of the corresponding kernel
extensions, which were indeed removed from *OS kernelcaches. To support iOS 13's "liveFS"
feature, additional livefile_xxx.dylib plugins were introduced, for APFS, exfat, msdos and HFS.
243
*OS lnternals::Kernel Mode
~~WO®\Wl @~~~
1. Look through the rnaru.a, pages of BSD's vnode ( 9), vget ( 9) and vput ( 9 ) , comparing
these with Darwin's implementation.
2. Why are filesystems in user mode a good idea? What would the disadvantage be?
3. Why is Apple using their home grown implementation, rather than something like FUSE?
~®lf@[f'®[ru~~
~-
1. Silvers - "UBC: An Efficient Unified I/O and Memory Caching Subsystem for NetBSD"
https: LLwww. usenix. orgf legaQt/ pu bl ications/1 ibrary/
12roceeding.s/usenix2000/freenixLfull 12a12ers/silvers/silvers_htmlL
2. Apple Open Source - autofs project - http_;f/_g12ensource.a1212le.comLtarballs/autofs
3. The iPhone Wiki - "HFS Legacy Volume Name Exploit" -
htt12s:L/.www.theiphonewiki.com/wiki/.HFS_Legacy Volume Name Stack Buffer Overflow
4. Apple Developer - File Provider Documentation -
https:L/.developer.apple.comf.documentation/.file12rovider
5. McKusick, Neville-Neil & Watson - "The Design and Implementation of the FreeBSD
Operating System" (2nd Edition) - ISBN -978-0321968975
6. MacFUSE project page on Google Code: http_;//Code.goog1e.comL12/.macfuse/
7. OSXFUSE project page on github:http.;LLosxfuse.github.com/
8. htt12s:LLwww.amazon.com/NFS-Illustrated-Brent-Callaghanfd12/.0201325705
244
Space Oddity: APFS
Apple first introduced its newest filesystem, APFS, as a special preview in MacOS 12,
announcing plans to finally retire the venerable (18+ years old) HFS+. Though still not a full
fledged and bootable filesystem, APFS showed great promise by providing 64-bit compatibility
and plenty of new features.
It was only almost a year later, however, that APFS was deemed stable enough to be used
as a default filesystem. Over this time, Apple kept working and reworking the filesystem
internals, breaking compatibility with previous implementations. The filesystem finally stabilized
with the first out-of-box implementation in iOS 10.3, probably chosen due to the relative safety
of *OS, wherein users are not given free rein to the filesystem. It was then enabled in MacOS
10.13, and has pushed HFS+ to the sidelines.
Although Apple promised the specification of APFS was to be available "by the end of the
year" (2016), it failed to deliver it, providing a paltry and partial placeholder document extolling
APFS's features, but disclosing virtually no detail on the implementation. In the meantime, it took
extensive reverse engineering to figure out how the filesystem really worked. Preliminary
analysis by Jonas Plum[11 provided detail on the data structures. This was followed by extensive
research detailing APFS internals, performed by Hansen et al. In a detailed article[21, they
provide a forensic view of the data structures used, which proved invaluable for future work,
including the author's implementation of his filesystem tool.
Finally, two and half years after its initial release and coinciding with that of Darwin 18, the
APFS specification showed up with no announcement on develo12er.agQle.com[31_ The document
is fairly detailed in documenting the data structures and constants, but seems at times to be
minimalistic and created automatically from the source code comments of the header files -
certainly not on par with the HFS+ specification of TN1150. This chapter, along with the
reference provided by Apple, should hopefully provide a clear view of APFS' intricate structures
and logic.
This book is filled with hands-on experiments, but this chapter, in particular,
is where the reader is encouraged to follow along with each and every one.
1!f Filesystem implementations make very specific use of very particular data
structures - and the best way to understand them is through careful step-by-
step tracing of filesystem operations, and dumping raw blocks. The fsleuth tool,
which is freely available from the book's website, was especially designed with
verbose debugging output to allow the avid APFS ( and HFS+) enthusiast to
inspect the filesystem internals.
-
245
•os lnternals::Kernel Mode
Partitions are defined in the GUID Partition Table (GPT), which is at the second block of the
disk (and another backup stored towards the end of the disk). The APFS partition type is
identified by a well-known GUID. In MacOS, another well-known GUID is used for APFS recovery
volumes.
Being a filesystem, each volume usually maintains its own object map (though in some
cases may use that of its container) which is again a B-Tree. Two specific objects make up the
filesystem itself:
• The RootFS Tree: is a B-Tree wherein file metadata is maintained. This includes the file's
inode attributes (stat ( 1), and the like), extended attributes (xattr ( 1) ), and extent
records.
• The Extent Tree: maps logical extents to physical blocks, where the file data is actually
stored.
In addition to the volumes and their filesystems, the container needs to maintain state for
all of its blocks. This is the role of the Space Manager object. The Space Manager maintains a
logical bitmap, wherein 'O' indicates the corresponding block is free, and '1' indicates it is in use.
Although every block is 4K, the number of blocks in a given container can be huge, and so the
Space Manager groups contiguous blocks into chunks, and makes use of Chunk Info Blocks
(CIBs) to maintain the bitmaps at a chunk level, and CIB Allocation Blocks (CABs) to group
together continguous CIBs.
Our last object in the APFS bestiary is the Reaper. Reapers track the state of large objects,
so that they can be safely deleted and their space reclaimed. An example of that is on snapshot
deletion, which requires destroying all deleted objects whose state was preserved for the
snapshot, but is no longer needed if the snapshot is destroyed. The objects to be reaped are
maintained in Reaper List blocks, which, as their name implies, may span multiple blocks and
list entries.
There are additional objects, although less commonly encountered. Fusion drives, which
enable containers to span traditional (magnetic platter) hard drives and solid state disks,
maintain write back caches and "middle trees" to track hard drive blocks cached on the solid
state disks. APFS also contains built-in support for encryption, supporting an intermediate state,
as the drive is in the process of being encrypted (when enabling FileVault), through an
"encryption rolling state" object. Finally, in order to provide EFI support in the face of APFS's
frequent changes, the "EFI jumpstart" object which is an encapsulated EFI driver.
As we continue our exploration, £sleuth ( j) will be used to unravel the structure of APFS,
one object at a time, in a series of experiments - starting with inpecting the GUID Partition Table
itself.
2~
.Ei_g~ : A very high level view of APFS
Blocko_oo
Container superblock provides
~ global object map wherein
other objects can be looked up
Cl)
\I
·----------! -· .•--········
,-------------------- .--------------1, .. -: ..•....
·8~
~
'i3
f FS 8-Tree has records of
~ro BlockJfff I various types for every inode
r---
~
0.
(/)
Block vvv
co
Q) Array of "filesystems" points
0.
ro to superblock volumes
.r::
u
inode #2 for the fs root
Block sss
A volume represents a
mountable filesystem
"spaceman" handles free space
Block rrr
management for container
--_]
PN Volume snapshots
"Reaper" handles garbage
.-11able state rollback
collection for large objects
•os lnternals::Kernel Mode
--
.
00000
OOlbO
figure 8-2: An annotated hexdump of the GPT from MacOS
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 fe
1 ................ 1
I················ I
OOlcO ff ff ee fe ff ff 01 00
OOldO 00 00 00 00 00 00 00 00
00 00 a3 70 3d 3a 00 00 1. · · · · · · · · · .p=: · -1
*
00 00 00 00 00 00 00 00 I··············· -1
OOlfO 00 00 00 00 00 00 00 00 00 00 00 00 00 00 55 aa I - · · · .......... u.1
00200 45 46 49 20 50 41 52 54 00 00 01 00 Sc 00 00 00 EFI PART .... \ .. - I
00210 Se 23 a6 27 00 00 00 00 01 00 00 00 00 00 00 00 '#. ' ........... -1
00220 a3 70 3d 3a 00 00 00 00 22 00 00 00 00 00 00 00 .p=: .... " ...... -1
00230 82 70 3d 3a 00 00 00 00 02 34 84 e2 fd b4 c8 43 .p=: ..... 4 ..... cl
00240 a4 df 3c 80 53 be fl 00 02 00 00 00 00 00 00 00 .. <.S .......... -1
00250 80 00 00 00 80 00 00 00 88 6d cd 75 00 00 00 00 ....•...• m.u ... • I
00260 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ I
l *
00400 28 73 2a cl lf f8 d2 11 ba 4b 00 aO c9 3e c9 3b (S* •••••• K ••• >.; I
00410 da f6 98 ff 49 ec Sd 4e b3 55 75 11 le 98 c6 ec .... I .. N.Uu .... - I
00420 Ox28 Ox64027 < .......• @ •••••• I
00430 00 00 00 00 00 00 00 00 45 00 46 00 49 00 20 00 ........ E.F.I. - I
00440 53 00 79 00 73 00 74 00 65 00 6d 00 20 00 50 00 s.y.s.t.e.m .. P. I
00450 61 00 72 00 74 00 69 00 74 oo 69 00 6f 00 6e 00 a.r.t.i.t.i.o.n.l
00460 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ I
*
00480 ef 57 34 7c 00 00 aa 11 aa 11 00 30 65 43 ec ac I . W4 I ....... OeC .. I
00490 b8 63 2d e3 a3 3b Sf 44 83 aO 3d fc la 30 b9 19 I ,c- .. ;_D •. = .. o. • I
004a0
004b0
Ox64028
00 00 00 00 00 00 00 00
Ox3a29e87f I c@ .... · · · · l: ····I
00 00 00 00 00 00 00 00
I················ I
*
00500 4d 64 61 53 63 61 aa 11 aa 11 00 30 65 43 ec ac I MdaSca .•••. Oec .. I
00510 bS c7 3c 3a 70 f6 3e 4a a7 e2 6e 3d 7b 74 dl f9 1 .. <:p.>J .. n={t •. I
00520
I oosso
Ox3a29e880
00 00 00 00 00 00 00 00
Ox3a3d707f I-· l = • • • • .p=: · · · -1
Ox28 * Ox200 = Ox5000
00 00 00 00 00 00 00 00 I················ I
I *
\8_sooo eb 58 90 42 53 44 20 20 34 2e 34 00 02 01 20 00
-•-- ~ -•-~--~"A>W•"
I .X.BSD 4 .4 ..• · l
Looking at the hexdump can be a bit daunting - but fortunately GPT recognition is built-
in to fsleuth ( j). Try the tool on the raw disk device will show:
Note, that prior to MacOS 14 fsleuth( j) will detect both APFS partitions - and that
the APFS recovery partition has a different GUID (B5C7 .•• -7B74D1F9) than the one used
for boot. As of MacOS 14 there is only one container, and the recovery filesystem is instead a
volume within it.
248