overlayfs是一种stackable文件系统。本文阅读6.12.1内核overlay源码,来学习stackable文件系统。
overlayfs的源码如下,总共大概只有1.3万行
$ wc -l fs/overlayfs/* 1299 fs/overlayfs/copy_up.c 1421 fs/overlayfs/dir.c 870 fs/overlayfs/export.c 619 fs/overlayfs/file.c 1309 fs/overlayfs/inode.c 136 fs/overlayfs/Kconfig 9 fs/overlayfs/Makefile 1429 fs/overlayfs/namei.c 898 fs/overlayfs/overlayfs.h 196 fs/overlayfs/ovl_entry.h 1009 fs/overlayfs/params.c 43 fs/overlayfs/params.h 1247 fs/overlayfs/readdir.c 1527 fs/overlayfs/super.c 1535 fs/overlayfs/util.c 271 fs/overlayfs/xattrs.c 13818 total
overlayfs模块的入口是super.c
fs/overlayfs/super.c:1526:module_init(ovl_init);
overlayfs的模块初始化函数ovl_init(),定义在super.c,它只做两件事:
1,调用kmem_cache_create()创建inode cache。所有的inode都从这个里面分配。
2,调用register_filesystem(),注册文件系统,overlayfs的文件系统变量是 ovl_fs_type,也定义在super.c
struct file_system_type ovl_fs_type = { .owner = THIS_MODULE, .name = "overlay", .init_fs_context = ovl_init_fs_context, .parameters = ovl_parameter_spec, .fs_flags = FS_USERNS_MOUNT, .kill_sb = kill_anon_super, };
上面这个结构体是fs层的,overlay的挂载流程是走的init_fs_context()这套。ovl_init_fs_context()定义在params.c
int ovl_init_fs_context(struct fs_context *fc)
它传入的参数,是内核已经分配好的fs,类型是struct fs_context。
init_fs_context()的目的是分配两个私有结构体,ctx和ofs,类型分别是:struct ovl_fs_context和struct ovl_fs。
struct ovl_fs_context定义在params.h,
struct ovl_fs_context { struct path upper; struct path work; size_t capacity; size_t nr; /* includes nr_data */ size_t nr_data; struct ovl_opt_set set; struct ovl_fs_context_layer *lower; char *lowerdir_all; /* user provided lowerdir string */ };
init_fs_context()会分配3个lower,lower从字面意思看,表示底层目录。是个数组,个数存在capacity。
struct ovl_fs定义在ovl_entry.h,这个结构体非常大,从上下文看,是存储了overlayfs所有的配置等信息。
struct ovl_fs { unsigned int numlayer; /* Number of unique fs among layers including upper fs */ unsigned int numfs; /* Number of data-only lower layers */ unsigned int numdatalayer; struct ovl_layer *layers; struct ovl_sb *fs; /* workbasedir is the path at workdir= mount option */ struct dentry *workbasedir; /* workdir is the 'work' or 'index' directory under workbasedir */ struct dentry *workdir; long namelen; /* pathnames of lower and upper dirs, for show_options */ struct ovl_config config; /* creds of process who forced instantiation of super block */ const struct cred *creator_cred; bool tmpfile; bool noxattr; bool nofh; /* Did we take the inuse lock? */ bool upperdir_locked; bool workdir_locked; /* Traps in ovl inode cache */ struct inode *workbasedir_trap; struct inode *workdir_trap; /* -1: disabled, 0: same fs, 1..32: number of unused ino bits */ int xino_mode; /* For allocation of non-persistent inode numbers */ atomic_long_t last_ino; /* Shared whiteout cache */ struct dentry *whiteout; bool no_shared_whiteout; /* r/o snapshot of upperdir sb's only taken on volatile mounts */ errseq_t errseq; };
init_fs_context()初始化了ofs的config的部分成员,初始化的数据来源是模块参数、或者代码里面定义的数据。
config类型是struct olv_config,这个结构体也定义在ovl_entry.h
struct ovl_config { char *upperdir; char *workdir; char **lowerdirs; bool default_permissions; int redirect_mode; int verity_mode; bool index; int uuid; bool nfs_export; int xino; bool metacopy; bool userxattr; bool ovl_volatile; };
init_fs_context()最后把分配的ctx、ofs分别存储在fs_context的fs_private、s_fs_info。
最后,设置fs_context的ops为:ovl_context_ops。
ovl_context_ops的类型是:struct fs_context_operations,它定义了fs context操作的接口,内核接下来调用这个ops里面的接口继续进行初始化。
ovl_context_ops定义在params.c
static const struct fs_context_operations ovl_context_ops = { .parse_monolithic = ovl_parse_monolithic, .parse_param = ovl_parse_param, .get_tree = ovl_get_tree, .reconfigure = ovl_reconfigure, .free = ovl_free, };
根据参考文档,parse_monolithic是用来解析按data page传递的挂载参数,这里我们不研究。
parse_param接口,是内核解析到一个参数的时候调用,
static int ovl_parse_param(struct fs_context *fc, struct fs_parameter *param)
传入的参数是,fs_context和fs_parameter。
fs->purpose表示目的,比如是FS_CONTEXT_FOR_RECONFIGURE,overlayfs不允许remount的时候修改配置,所以这里报错。
其它参数,调用opt = fs_parse(fc, ovl_parameter_spec, param, &result);
提供了一个参数模板,返回一个选项id,对于不同的选项,会解析到对应的config中。
对于lowerdir/upperdir/workdir等参数选项。会调用ovl_parse_layer解析
1 ovl_mount_dir(),将路径解析到path
底层是调用err = kern_path(name, LOOKUP_FOLLOW, path);
2 调用ovl_mount_dir_check(),对path进行检查,看文件系统是否支持overlay。
第一个检查就是看路径是否是一个目录:if (!d_is_dir(path->dentry))
3 调用ovl_add_layer()保存path和路径到
path保存到struct ovl_fs_context的work/upper/lower[]中
name保存到struct ovl_fs的config成员(struct ovl_config)的workdir/upperdir中。
get_tree,这个接口,根据参考文档,就是真正的创建文件系统的地方。它根据fs_context中的信息,创建可挂载的root和superblock。创建后,可以将fs_context中的相关信息转移到superblock。对于overlay,它是将fc->s_fs_info,也就是分配的ovl_fs结构体转移到sb。
static int ovl_get_tree(struct fs_context *fc)
{
return get_tree_nodev(fc, ovl_fill_super);
}
overlay提供的get_tree接口ovl_get_tree,它调用内核提供好的接口get_tree_nodev,这个接口只要提供一个填充superblock的接口。 内核把superblock都给你分配好了。
再来看ovl_fill_super,它定义在fs/overlayfs/super.c。它的参数是内核已经分配好的sb,以及fs_context。
int ovl_fill_super(struct super_block *sb, struct fs_context *fc)
内核在调用fill_super之前,已经将fc中的s_fs_info转移给了sb。所以获得ovl_fs和ovl_fs_context,如下:
struct ovl_fs *ofs = sb->s_fs_info;
struct ovl_fs_context *ctx = fc->fs_private;
初始化sb,步骤如下
1 设置dentry operations
sb->s_d_op = &ovl_dentry_operations
设置super block的dentry operation
2 保存创建者的cred
ofs->creator_cred = cred = prepare_creds();
3 验证选项
err = ovl_fs_params_verify(ctx, &ofs->config);
4 设置super oprations
sb->s_op = &ovl_super_operations;
会分配struct ovl_layers数组,个数包括upper和每个lower layer。保存在ofs->layers。
struct ovl_layer { /* ovl_free_fs() relies on @mnt being the first member! */ struct vfsmount *mnt; /* Trap in ovl inode cache */ struct inode *trap; struct ovl_sb *fs; /* Index of this layer in fs root (upper idx == 0) */ int idx; /* One fsid per unique underlying sb (upper fsid == 0) */ int fsid; /* xwhiteouts were found on this layer */ bool has_xwhiteouts; };
upper layer
===============
upper layer位于数组的第0个。
调用err = ovl_get_upper(sb, ofs, &layers[0], &ctx->upper); 解析upper layer
1 首先会调用:ovl_setup_trap(sb, upperpath->dentry, &upper_layer->trap, "upperdir",创建一个inode,这个inode的key是upper这个目录的inode。根据说明,表示这个inode是一个保留的inode。内部是不可以创建和查找的。
trap->i_flags = S_DEAD;
2 调用 upper_mnt = clone_private_mount(upperpath),克隆一下upper这个mount,克隆出来的是private,也不挂在树上。查看内核代码,只有overlayfs用到了这个接口。克隆出来的这个mount的root是upperpath。相当于一个隐藏的bind mount。
3 锁住upperpath那个inode
ovl_inuse_trylock(ovl_upper_mnt(ofs)->mnt_root)
workdir
=============
调用 err = ovl_get_workdir(sb, ofs, &ctx->upper, &ctx->work);
进行workdir的解析,和一些检验工作。
1 会校验upper dir和work dir必须是一个挂载点,且不能是父子目录关系
2 将work dir的dentry保存,并且锁住其inode
ofs->workbasedir = dget(workpath->dentry);
ovl_inuse_trylock(ofs->workbasedir
3 和upper layer一样,得到一个trap inode
err = ovl_setup_trap(sb, ofs->workbasedir, &ofs->workbasedir_trap, "workdir");
4 创建workdir,并且进行兼容检查
调用 ovl_make_workdir(sb, ofs, workpath);
前面的是保存到workbasedir,这个函数会在这个workbasedir目录下,创建一个work子目录。并进行一系列的检查动作。
创建之前会调用:mnt_want_write(),结束后调用:mnt_drop_write()。
workdir = ovl_workdir_create(ofs, OVL_WORKDIR_NAME, false);
ofs->workdir = workdir;
lower layer
=========================
调用 oe = ovl_get_lowerstack(sb, ctx, ofs, layers); 解析lower layer
这个函数返回一个oe
struct ovl_path { const struct ovl_layer *layer; struct dentry *dentry; }; struct ovl_entry { unsigned int __numlower; struct ovl_path __lowerstack[]; };
1 为每个lower layer,调用 ovl_lower_dir(l->name, &l->path, ofs, &sb->s_stack_depth);
检查lower layer是否具有file handle的能力,进而确定ofs是否支持index和xino。
2 调用 err = ovl_get_layers(sb, ofs, ctx, layers);
这里引入了一个新的结构体ovl_sb,ofs->fs是这个类型。
struct ovl_sb { struct super_block *sb; dev_t pseudo_dev; /* Unusable (conflicting) uuid */ bool bad_uuid; /* Used as a lower layer (but maybe also as upper) */ bool is_lower; };
分配ctx->nr + 2个olv_sb,存放在ofs->fs,fs[0]保留给upper layer,最后一个保留给null fs。中间的给lower layer。
ofs->numfs表示fs的个数。
调用err = get_anon_bdev(&ofs->fs[0].pseudo_dev);,给upper layer分配一个块设备id。
对于每个lower layer,
a. fsid = ovl_get_fsid(ofs, &l->path);
这里面会分配block device id,初始化ofs->fs[ofs->numfs++]
b. 执行和upper layer相同的操作:ovl_setup_trap()/clone_private_mount()
初始化layer
3 分配oe,个数为nr_merged_lower = ctx->nr - ctx->nr_data;
data layer不参与oe。
初始化oe
oe.__lowerstack[i].dentry = dget(ctx->lower[i].path.dentry)
oe.__lowerstack[i].layer = &ofs->layers[i+1]
这一步是fill super的最后一步了
sb的各种属性这里赋值。
sb->s_magic = OVERLAYFS_SUPER_MAGIC;
sb->s_xattr = ovl_xattr_handlers(ofs);
sb->s_fs_info = ofs;
sb->s_flags |= SB_POSIXACL;
sb->s_iflags |= SB_I_SKIP_SYNC;
sb->s_iflags |= SB_I_NOUMASK;
sb->s_iflags |= SB_I_EVM_HMAC_UNSUPPORTED;
然后调用root_dentry = ovl_get_root(sb, ctx->upper.dentry, oe);
分配一个root dentry,最后赋值给sb。
root_dentry = ovl_get_root(sb, ctx->upper.dentry, oe);
sb->s_root = root_dentry;
来看ovl_get_root()的实现:
内核提供了一个接口d_make_root()创建一个root dentry,只需要提供一个inode。
root = d_make_root(ovl_new_inode(sb, S_IFDIR, 0));
下面引入一个结构体
struct ovl_inode { union { struct ovl_dir_cache *cache; /* directory */ const char *lowerdata_redirect; /* regular file */ }; const char *redirect; u64 version; unsigned long flags; struct inode vfs_inode; struct dentry *__upperdentry; struct ovl_entry *oe; /* synchronize copy up and more */ struct mutex lock; };
这个结构体是overlayfs的inode,inode结构体是内嵌在ovl_inode里面的。然后其它overlay特有的信息都在ovl_inode这个结构体里面表示。
dentry都没有内嵌的机制。那么overlay是在哪个结构体存储私有信息呢?
void *d_fsdata; /* fs-specific data */
overlay把它当成一个flags来用。
/* private information held for every overlayfs dentry */
static inline unsigned long *OVL_E_FLAGS(struct dentry *dentry)
{
return (unsigned long *) &dentry->d_fsdata;
}
dentry创建后,还有下列初始化:
ovl_set_flag(OVL_WHITEOUTS, d_inode(root));
ovl_dentry_set_flag(OVL_E_CONNECTED, root);
ovl_set_upperdata(d_inode(root));
ovl_inode_init(d_inode(root), &oip, ino, fsid);
ovl_dentry_init_flags(root, upperdentry, oe, DCACHE_OP_WEAK_REVALIDATE);
/* root keeps a reference of upperdentry */
dget(upperdentry);
ovl_set_flag是设置ovl_inode.flags
ovl_dentry_set_flag是设置dentry.d_fsdata
ovl_new_inode()
struct inode *ovl_new_inode(struct super_block *sb, umode_t mode, dev_t rdev) { struct inode *inode; inode = new_inode(sb); if (inode) ovl_fill_inode(inode, mode, rdev); return inode; } static void ovl_fill_inode(struct inode *inode, umode_t mode, dev_t rdev) { inode->i_mode = mode; inode->i_flags |= S_NOCMTIME; #ifdef CONFIG_FS_POSIX_ACL inode->i_acl = inode->i_default_acl = ACL_DONT_CACHE; #endif ovl_lockdep_annotate_inode_mutex_key(inode); switch (mode & S_IFMT) { case S_IFREG: inode->i_op = &ovl_file_inode_operations; inode->i_fop = &ovl_file_operations; inode->i_mapping->a_ops = &ovl_aops; break; case S_IFDIR: inode->i_op = &ovl_dir_inode_operations; inode->i_fop = &ovl_dir_operations; break; case S_IFLNK: inode->i_op = &ovl_symlink_inode_operations; break; default: inode->i_op = &ovl_special_inode_operations; init_special_inode(inode, mode, rdev); break; } }
它调用内核的new_inode(sb)分配inode,实际上又走到了
ovl_super_operations.alloc_inode 也就是 ovl_alloc_inode。
参考
【1】David Howells. VFS: Introduce filesystem context
https://lwn.net/Articles/780267/
【2】Filesystem Mount API
https://www.kernel.org/doc/html/latest/filesystems/mount_api.html
【3】Overlay Filesystem
https://www.kernel.org/doc/html/latest/filesystems/overlayfs.html