From 2618849f31e7cf51fadd4a5242458501a6d5b315 Mon Sep 17 00:00:00 2001 From: Qu Wenruo Date: Thu, 23 Oct 2025 19:44:04 +1030 Subject: [PATCH 1/4] btrfs: ensure no dirty metadata is written back for an fs with errors [BUG] During development of a minor feature (make sure all btrfs_bio::end_io() is called in task context), I noticed a crash in generic/388, where metadata writes triggered new works after btrfs_stop_all_workers(). It turns out that it can even happen without any code modification, just using RAID5 for metadata and the same workload from generic/388 is going to trigger the use-after-free. [CAUSE] If btrfs hits an error, the fs is marked as error, no new transaction is allowed thus metadata is in a frozen state. But there are some metadata modifications before that error, and they are still in the btree inode page cache. Since there will be no real transaction commit, all those dirty folios are just kept as is in the page cache, and they can not be invalidated by invalidate_inode_pages2() call inside close_ctree(), because they are dirty. And finally after btrfs_stop_all_workers(), we call iput() on btree inode, which triggers writeback of those dirty metadata. And if the fs is using RAID56 metadata, this will trigger RMW and queue new works into rmw_workers, which is already stopped, causing warning from queue_work() and use-after-free. [FIX] Add a special handling for write_one_eb(), that if the fs is already in an error state, immediately mark the bbio as failure, instead of really submitting them. Then during close_ctree(), iput() will just discard all those dirty tree blocks without really writing them back, thus no more new jobs for already stopped-and-freed workqueues. The extra discard in write_one_eb() also acts as an extra safenet. E.g. the transaction abort is triggered by some extent/free space tree corruptions, and since extent/free space tree is already corrupted some tree blocks may be allocated where they shouldn't be (overwriting existing tree blocks). In that case writing them back will further corrupting the fs. CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Filipe Manana Signed-off-by: Qu Wenruo Signed-off-by: David Sterba --- fs/btrfs/extent_io.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 755ec6dfd51c..23273d0e6f22 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2228,6 +2228,14 @@ static noinline_for_stack void write_one_eb(struct extent_buffer *eb, wbc_account_cgroup_owner(wbc, folio, range_len); folio_unlock(folio); } + /* + * If the fs is already in error status, do not submit any writeback + * but immediately finish it. + */ + if (unlikely(BTRFS_FS_ERROR(fs_info))) { + btrfs_bio_end_io(bbio, errno_to_blk_status(BTRFS_FS_ERROR(fs_info))); + return; + } btrfs_submit_bbio(bbio, 0); } From f260c6aff0b8af236084012d14f9f1bf792ea883 Mon Sep 17 00:00:00 2001 From: Shardul Bankar Date: Sun, 26 Oct 2025 01:30:21 +0530 Subject: [PATCH 2/4] btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relation When btrfs_add_qgroup_relation() is called with invalid qgroup levels (src >= dst), the function returns -EINVAL directly without freeing the preallocated qgroup_list structure passed by the caller. This causes a memory leak because the caller unconditionally sets the pointer to NULL after the call, preventing any cleanup. The issue occurs because the level validation check happens before the mutex is acquired and before any error handling path that would free the prealloc pointer. On this early return, the cleanup code at the 'out' label (which includes kfree(prealloc)) is never reached. In btrfs_ioctl_qgroup_assign(), the code pattern is: prealloc = kzalloc(sizeof(*prealloc), GFP_KERNEL); ret = btrfs_add_qgroup_relation(trans, sa->src, sa->dst, prealloc); prealloc = NULL; // Always set to NULL regardless of return value ... kfree(prealloc); // This becomes kfree(NULL), does nothing When the level check fails, 'prealloc' is never freed by either the callee or the caller, resulting in a 64-byte memory leak per failed operation. This can be triggered repeatedly by an unprivileged user with access to a writable btrfs mount, potentially exhausting kernel memory. Fix this by freeing prealloc before the early return, ensuring prealloc is always freed on all error paths. Fixes: 4addc1ffd67a ("btrfs: qgroup: preallocate memory before adding a relation") Reviewed-by: Qu Wenruo Signed-off-by: Shardul Bankar Signed-off-by: David Sterba --- fs/btrfs/qgroup.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 1175b8192cd7..31ad8580322a 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1539,8 +1539,10 @@ int btrfs_add_qgroup_relation(struct btrfs_trans_handle *trans, u64 src, u64 dst ASSERT(prealloc); /* Check the level of src and dst first */ - if (btrfs_qgroup_level(src) >= btrfs_qgroup_level(dst)) + if (btrfs_qgroup_level(src) >= btrfs_qgroup_level(dst)) { + kfree(prealloc); return -EINVAL; + } mutex_lock(&fs_info->qgroup_ioctl_lock); if (!fs_info->quota_root) { From 953902e4fb4c373c81a977f78e40f9f93a79e20f Mon Sep 17 00:00:00 2001 From: Filipe Manana Date: Fri, 24 Oct 2025 12:30:56 +0100 Subject: [PATCH 3/4] btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new name If we are logging a new name make sure our inode has the runtime flag BTRFS_INODE_COPY_EVERYTHING set so that at btrfs_log_inode() we will find new inode refs/extrefs in the subvolume tree and copy them into the log tree. We are currently doing it when adding a new link but we are missing it when renaming. An example where this makes a new name not persisted: 1) create symlink with name foo in directory A 2) fsync directory A, which persists the symlink 3) rename the symlink from foo to bar 4) fsync directory A to persist the new symlink name Step 4 isn't working correctly as it's not logging the new name and also leaving the old inode ref in the log tree, so after a power failure the symlink still has the old name of "foo". This is because when we first fsync directoy A we log the symlink's inode (as it's a new entry) and at btrfs_log_inode() we set the log mode to LOG_INODE_ALL and then because we are using that mode and the inode has the runtime flag BTRFS_INODE_NEEDS_FULL_SYNC set, we clear that flag as well as the flag BTRFS_INODE_COPY_EVERYTHING. That means the next time we log the inode, during the rename through the call to btrfs_log_new_name() (calling btrfs_log_inode_parent() and then btrfs_log_inode()), we will not search the subvolume tree for new refs/extrefs and jump directory to the 'log_extents' label. Fix this by making sure we set BTRFS_INODE_COPY_EVERYTHING on an inode when we are about to log a new name. A test case for fstests will follow soon. Reported-by: Vyacheslav Kovalevsky Link: https://lore.kernel.org/linux-btrfs/ac949c74-90c2-4b9a-b7fd-1ffc5c3175c7@gmail.com/ Reviewed-by: Boris Burkov Reviewed-by: Qu Wenruo Signed-off-by: Filipe Manana Signed-off-by: David Sterba --- fs/btrfs/inode.c | 1 - fs/btrfs/tree-log.c | 3 +++ 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 02cb081697fe..b95175116ea3 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6873,7 +6873,6 @@ static int btrfs_link(struct dentry *old_dentry, struct inode *dir, BTRFS_I(inode)->dir_index = 0ULL; inode_inc_iversion(inode); inode_set_ctime_current(inode); - set_bit(BTRFS_INODE_COPY_EVERYTHING, &BTRFS_I(inode)->runtime_flags); ret = btrfs_add_link(trans, BTRFS_I(dir), BTRFS_I(inode), &fname.disk_name, 1, index); diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 6aad6b65522b..00a59fb79167 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -7910,6 +7910,9 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans, bool log_pinned = false; int ret; + /* The inode has a new name (ref/extref), so make sure we log it. */ + set_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags); + btrfs_init_log_ctx(&ctx, inode); ctx.logging_new_name = true; From 3b1a4a59a2086badab391687a6a0b86e03048393 Mon Sep 17 00:00:00 2001 From: austinchang Date: Wed, 29 Oct 2025 09:35:27 +0000 Subject: [PATCH 4/4] btrfs: mark dirty extent range for out of bound prealloc extents In btrfs_fallocate(), when the allocated range overlaps with a prealloc extent and the extent starts after i_size, the range doesn't get marked dirty in file_extent_tree. This results in persisting an incorrect disk_i_size for the inode when not using the no-holes feature. This is reproducible since commit 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree"), then became hidden since commit 3d7db6e8bd22 ("btrfs: don't allocate file extent tree for non regular files") and then visible again after commit 8679d2687c35 ("btrfs: initialize inode::file_extent_tree after i_mode has been set"), which fixes the previous commit. The following reproducer triggers the problem: $ cat test.sh MNT=/mnt/test DEV=/dev/vdb mkdir -p $MNT mkfs.btrfs -f -O ^no-holes $DEV mount $DEV $MNT touch $MNT/file1 fallocate -n -o 1M -l 2M $MNT/file1 umount $MNT mount $DEV $MNT len=$((1 * 1024 * 1024)) fallocate -o 1M -l $len $MNT/file1 du --bytes $MNT/file1 umount $MNT mount $DEV $MNT du --bytes $MNT/file1 umount $MNT Running the reproducer gives the following result: $ ./test.sh (...) 2097152 /mnt/test/file1 1048576 /mnt/test/file1 The difference is exactly 1048576 as we assigned. Fix by adding a call to btrfs_inode_set_file_extent_range() in btrfs_fallocate_update_isize(). Fixes: 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree") Signed-off-by: austinchang Reviewed-by: Filipe Manana Signed-off-by: Filipe Manana Signed-off-by: David Sterba --- fs/btrfs/file.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 7efd1f8a1912..fa82def46e39 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2854,12 +2854,22 @@ static int btrfs_fallocate_update_isize(struct inode *inode, { struct btrfs_trans_handle *trans; struct btrfs_root *root = BTRFS_I(inode)->root; + u64 range_start; + u64 range_end; int ret; int ret2; if (mode & FALLOC_FL_KEEP_SIZE || end <= i_size_read(inode)) return 0; + range_start = round_down(i_size_read(inode), root->fs_info->sectorsize); + range_end = round_up(end, root->fs_info->sectorsize); + + ret = btrfs_inode_set_file_extent_range(BTRFS_I(inode), range_start, + range_end - range_start); + if (ret) + return ret; + trans = btrfs_start_transaction(root, 1); if (IS_ERR(trans)) return PTR_ERR(trans);