Using the helper makes it a bit more clear that we're accessing the
first list entry.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename all the exported functions from extent_map.h that don't have a
'btrfs_' prefix in their names, so that they are consistent with all the
other functions, to make it clear they are btrfs specific functions and
to avoid potential name collisions in the future with functions defined
elsewhere in the kernel.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.
So add a 'btrfs_' prefix to their name to make it clear they are from
btrfs.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. So add a prefix to
their name.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fsync a file (or directory) that has no more hard links, because
while a process had a file descriptor open on it, the file's last hard
link was removed and then the process did an fsync against the file
descriptor, after a power failure or crash the file still exists after
replaying the log.
This behaviour is incorrect since once an inode has no more hard links
it's not accessible anymore and we insert an orphan item into its
subvolume's tree so that the deletion of all its items is not missed in
case of a power failure or crash.
So after log replay the file shouldn't exist anymore, which is also the
behaviour on ext4, xfs, f2fs and other filesystems.
Fix this by not ignoring inodes with zero hard links at
btrfs_log_inode_parent() and by committing an inode's delayed inode when
we are not doing a fast fsync (either BTRFS_INODE_COPY_EVERYTHING or
BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode's runtime flags). This
last step is necessary because when removing the last hard link we don't
delete the corresponding ref (or extref) item, instead we record the
change in the inode's delayed inode with the BTRFS_DELAYED_NODE_DEL_IREF
flag, so that when the delayed inode is committed we delete the ref/extref
item from the inode's subvolume tree - otherwise the logging code will log
the last hard link and therefore upon log replay the inode is not deleted.
The base code for a fstests test case that reproduces this bug is the
following:
. ./common/dmflakey
_require_scratch
_require_dm_target flakey
_require_mknod
_scratch_mkfs >>$seqres.full 2>&1 || _fail "mkfs failed"
_require_metadata_journaling $SCRATCH_DEV
_init_flakey
_mount_flakey
touch $SCRATCH_MNT/foo
# Commit the current transaction and persist the file.
_scratch_sync
# A fifo to communicate with a background xfs_io process that will
# fsync the file after we deleted its hard link while it's open by
# xfs_io.
mkfifo $SCRATCH_MNT/fifo
tail -f $SCRATCH_MNT/fifo | \
$XFS_IO_PROG $SCRATCH_MNT/foo >>$seqres.full &
XFS_IO_PID=$!
# Give some time for the xfs_io process to open a file descriptor for
# the file.
sleep 1
# Now while the file is open by the xfs_io process, delete its only
# hard link.
rm -f $SCRATCH_MNT/foo
# Now that it has no more hard links, make the xfs_io process fsync it.
echo "fsync" > $SCRATCH_MNT/fifo
# Terminate the xfs_io process so that we can unmount.
echo "quit" > $SCRATCH_MNT/fifo
wait $XFS_IO_PID
unset XFS_IO_PID
# Simulate a power failure and then mount again the filesystem to
# replay the journal/log.
_flakey_drop_and_remount
# We don't expect the file to exist anymore, since it was fsynced when
# it had no more hard links.
[ -f $SCRATCH_MNT/foo ] && echo "file foo still exists"
_unmount_flakey
# success, all done
echo "Silence is golden"
status=0
exit
A test case for fstests will be submitted soon.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's a pointless label as we don't have to do anything under it other
than return from the function. So remove it and directly return from the
function where we used to goto.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point in checking if the inode is a directory as
ctx->log_new_dentries is only set in case we are logging a directory down
the call chain of btrfs_log_inode(). So remove that check making the logic
more simple and while at it add a comment about why use a local variable
to track if we later need to log new dentries.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we don't need to log new directory dentries, there's no point in having
an else branch just to set 'ret' to zero, as it's already zero because
every time it gets a non-zero value we jump into one of the exit labels.
So remove it, which reduces source code size and the module text size.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1813855 163737 16920 1994512 1e6f10 fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1813807 163737 16920 1994464 1e6ee0 fs/btrfs/btrfs.ko
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using memcmp(), which requires copying both file extent items
from each extent buffer into a local buffer, use memcmp_extent_buffer() so
that we only need to copy one of the file extent items and directly use
the extent buffer of the other file extent item for the comparison.
This reduces code size, saves one memory copy and reduces stack usage.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is exclusively used for log replay since commit
3eb4234424 ("btrfs: remove outdated logic from overwrite_item() and add
assertion"), so update the comment so that it doesn't say it can be used
for logging. Also some minor rewording for clarity and while at it
reformat the affected text so that it fits closer to the 80 characters
limit for comments.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of referring to path->nodes[0] and path->slots[0] multiple times,
which is verbose and confusing since we have an 'eb' and 'slot' variables
as well, introduce local variables 'dst_eb' to point to path->nodes[0] and
'dst_slot' to have path->slots[0], reducing verbosity and making it more
obvious about which extent buffer and slot we are referring to.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to allocate memory and copy from both the destination and
source extent buffers to compare if the items are equal, we can instead
use memcmp_extent_buffer() which allows to do only one memory allocation
and copy instead of two.
So use memcmp_extent_buffer() instead of memcmp(), allowing us to avoid
one memory allocation, which can fail or be slow while under memory heavy
pressure, avoid the memory copying and reducing code.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's an internal function and most of the time the callers are doing a lot
of BTRFS_I() calls on the returned VFS inode to get the btrfs inode, so
change the return type to struct btrfs_inode instead.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
fixup_inode_link_count() mostly wants to use a btrfs_inode, plus it's an
internal function so it should take btrfs_inode instead of a VFS inode.
Change the argument type to btrfs_inode, avoiding several BTRFS_I() calls
too.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of read_one_inode() are mostly interested in the btrfs_inode
structure rather than the VFS inode, so make read_one_inode() return
the btrfs_inode instead, avoiding lots of BTRFS_I() calls.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers of btrfs_iget_logging() are interested in the btrfs_inode
structure rather than the VFS inode, so make btrfs_iget_logging() return
the btrfs_inode instead, avoiding lots of BTRFS_I() calls.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_key is defined as objectid/type/offset and the keys are also
printed like that. For better readability, update all key
initializations to match this order.
Signed-off-by: David Sterba <dsterba@suse.com>
We have several places explicitly calling btrfs_mark_buffer_dirty() but
that is not necessarily since the target leaf came from a path that was
obtained for a btree search function that modifies the btree, something
like btrfs_insert_empty_item() or anything else that ends up calling
btrfs_search_slot() with a value of 1 for its 'cow' argument.
These just make the code more verbose, confusing and add a little extra
overhead and well as increase the module's text size, so remove them.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function read_alloc_one_name() does not initialize the name field of
the passed fscrypt_str struct if kmalloc fails to allocate the
corresponding buffer. Thus, it is not guaranteed that
fscrypt_str.name is initialized when freeing it.
This is a follow-up to the linked patch that fixes the remaining
instances of the bug introduced by commit e43eec81c5 ("btrfs: use
struct qstr instead of name and namelen pairs").
Link: https://lore.kernel.org/linux-btrfs/20241009080833.1355894-1-jroi.martin@gmail.com/
Fixes: e43eec81c5 ("btrfs: use struct qstr instead of name and namelen pairs")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Roi Martin <jroi.martin@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The add_inode_ref() function does not initialize the "name" struct when
it is declared. If any of the following calls to "read_one_inode()
returns NULL,
dir = read_one_inode(root, parent_objectid);
if (!dir) {
ret = -ENOENT;
goto out;
}
inode = read_one_inode(root, inode_objectid);
if (!inode) {
ret = -EIO;
goto out;
}
then "name.name" would be freed on "out" before being initialized.
out:
...
kfree(name.name);
This issue was reported by Coverity with CID 1526744.
Fixes: e43eec81c5 ("btrfs: use struct qstr instead of name and namelen pairs")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Roi Martin <jroi.martin@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few places that check if we have the inode locked by doing:
ASSERT(inode_is_locked(vfs_inode));
This actually proved to be useful several times as if assertions are
enabled (and by default they are in many distros) it immediately triggers
a crash which is impossible for users to miss.
However that doesn't check if the lock is held by the calling task, so
the check passes if some other task locked the inode.
Using one of the lockdep functions to check the lock is held, like
lockdep_assert_held() for example, does check that the calling task
holds the lock, and if that's not the case it produces a warning and
stack trace in dmesg. However, despite the misleading "assert" in the
name of the lockdep helpers, it does not trigger a crash/BUG_ON(), just
a warning and splat in dmesg, which is easy to get unnoticed by users
who may have lockdep enabled.
So add a helper that does the ASSERT() and calls lockdep_assert_held()
immediately after and use it every where we check the inode is locked.
Like this if the lock is held by some other task we get the warning
in dmesg which is caught by fstests, very helpful during development,
and may also be occassionaly noticed by users with lockdep enabled.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's pointless to pass a super block argument to btrfs_iget() because we
always pass a root and from it we can get the super block through:
root->fs_info->sb
So remove the super block argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As of commit 1b53e51a4a ("btrfs: don't commit transaction for every
subvol create") we started to make any fsync after creating a subvolume
to fallback to a transaction commit if the fsync is performed in the
same transaction that was used to create the subvolume. This happens
with the following at ioctl.c:create_subvol():
$ cat fs/btrfs/ioctl.c
(...)
/* Tree log can't currently deal with an inode which is a new root. */
btrfs_set_log_full_commit(trans);
(...)
Note that the comment is misleading as the problem is not that fsync can
not deal with the root inode of a new root, but that we can not log any
inode that belongs to a root that was not yet persisted because that would
make log replay fail since the root doesn't exist at log replay time.
The above simply makes any fsync fallback to a full transaction commit if
it happens in the same transaction used to create the subvolume - even if
it's an inode that belongs to any other subvolume. This is a brute force
solution and it doesn't necessarily improve performance for every workload
out there - it just moves a full transaction commit from one place, the
subvolume creation, to another - an fsync for any inode.
Just improve on this by making the fallback to a transaction commit only
for an fsync against an inode of the new subvolume, or for the directory
that contains the dentry that points to the new subvolume (in case anyone
attempts to fsync the directory in the same transaction).
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The member extent_map::block_start can be calculated from
extent_map::disk_bytenr + extent_map::offset for regular extents.
And otherwise just extent_map::disk_bytenr.
And this is already validated by the validate_extent_map(). Now we can
remove the member.
However there is a special case in btrfs_create_dio_extent() where we
for NOCOW/PREALLOC ordered extents cannot directly use the resulting
btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them
yet.
So for that call site, we pass file_extent->disk_bytenr +
file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for
offset.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent_map::block_len is either extent_map::len (non-compressed
extent) or extent_map::disk_num_bytes (compressed extent).
Since we already have sanity checks to do the cross-checks between the
new and old members, we can drop the old extent_map::block_len now.
For most call sites, they can manually select extent_map::len or
extent_map::disk_num_bytes, since most if not all of them have checked
if the extent is compressed.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have extent_map::offset, the old extent_map::orig_start is just
extent_map::start - extent_map::offset for non-hole/inline extents.
And since the new extent_map::offset is already verified by
validate_extent_map() while the old orig_start is not, let's just remove
the old member from all call sites.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This would make it very obvious that the member just matches
btrfs_file_extent_item::disk_num_bytes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using a inode pointer, use a btrfs_inode pointer in the log
context structure, as this is generally what we need and allows for some
internal APIs to take a btrfs_inode instead, making them more consistent
with most of the code base. This will later allow to help to remove a lot
of BTRFS_I() calls in btrfs_sync_file().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently struct btrfs_inode has a key member, named "location", that is
either:
1) The key of the inode's item. In this case the objectid is the number
of the inode;
2) A key stored in a dir entry with a type of BTRFS_ROOT_ITEM_KEY, for
the case where we have a root that is a snapshot of a subvolume that
points to other subvolumes. In this case the objectid is the ID of
a subvolume inside the snapshotted parent subvolume.
The key is only used to lookup the inode item for the first case, while
for the second it's never used since it corresponds to directory stubs
created with new_simple_dir() and which are marked as dummy, so there's
no actual inode item to ever update. In the second case we only check
the key type at btrfs_ino() for 32 bits platforms and its objectid is
only needed for unlink.
Instead of using a key we can do fine with just the objectid, since we
can generate the key whenever we need it having only the objectid, as
in all use cases the type is always BTRFS_INODE_ITEM_KEY and the offset
is always 0.
So use only an objectid instead of a full key. This reduces the size of
struct btrfs_inode from 1048 bytes down to 1040 bytes on a release kernel.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The index_cnt field of struct btrfs_inode is used only for two purposes:
1) To store the index for the next entry added to a directory;
2) For the data relocation inode to track the logical start address of the
block group currently being relocated.
For the relocation case we use index_cnt because it's not used for
anything else in the relocation use case - we could have used other fields
that are not used by relocation such as defrag_bytes, last_unlink_trans
or last_reflink_trans for example (among others).
Since the csum_bytes field is not used for directories, do the following
changes:
1) Put index_cnt and csum_bytes in a union, and index_cnt is only
initialized when the inode is a directory. The csum_bytes is only
accessed in IO paths for regular files, so we're fine here;
2) Use the defrag_bytes field for relocation, since the data relocation
inode is never used for defrag purposes. And to make the naming better,
alias it to reloc_block_group_start by using a union.
This reduces the size of struct btrfs_inode by 8 bytes in a release
kernel, from 1056 bytes down to 1048 bytes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have been seeing crashes on duplicate keys in
btrfs_set_item_key_safe():
BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192)
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:2620!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs]
With the following stack trace:
#0 btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4)
#1 btrfs_drop_extents (fs/btrfs/file.c:411:4)
#2 log_one_extent (fs/btrfs/tree-log.c:4732:9)
#3 btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9)
#4 btrfs_log_inode (fs/btrfs/tree-log.c:6626:9)
#5 btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8)
#6 btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8)
#7 btrfs_sync_file (fs/btrfs/file.c:1933:8)
#8 vfs_fsync_range (fs/sync.c:188:9)
#9 vfs_fsync (fs/sync.c:202:9)
#10 do_fsync (fs/sync.c:212:9)
#11 __do_sys_fdatasync (fs/sync.c:225:9)
#12 __se_sys_fdatasync (fs/sync.c:223:1)
#13 __x64_sys_fdatasync (fs/sync.c:223:1)
#14 do_syscall_x64 (arch/x86/entry/common.c:52:14)
#15 do_syscall_64 (arch/x86/entry/common.c:83:7)
#16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121)
So we're logging a changed extent from fsync, which is splitting an
extent in the log tree. But this split part already exists in the tree,
triggering the BUG().
This is the state of the log tree at the time of the crash, dumped with
drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py)
to get more details than btrfs_print_leaf() gives us:
>>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"])
leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610
leaf 33439744 flags 0x100000000000000
fs uuid e5bd3946-400c-4223-8923-190ef1f18677
chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160
generation 7 transid 9 size 8192 nbytes 8473563889606862198
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 204 flags 0x10(PREALLOC)
atime 1716417703.220000000 (2024-05-22 15:41:43)
ctime 1716417704.983333333 (2024-05-22 15:41:44)
mtime 1716417704.983333333 (2024-05-22 15:41:44)
otime 17592186044416.000000000 (559444-03-08 01:40:16)
item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13
index 195 namelen 3 name: 193
item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37
location key (0 UNKNOWN.0 0) type XATTR
transid 7 data_len 1 name_len 6
name: user.a
data a
item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53
generation 9 type 1 (regular)
extent data disk byte 303144960 nr 12288
extent data offset 0 nr 4096 ram 12288
extent compression 0 (none)
item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53
generation 9 type 2 (prealloc)
prealloc data disk byte 303144960 nr 12288
prealloc data offset 4096 nr 8192
item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53
generation 9 type 2 (prealloc)
prealloc data disk byte 303144960 nr 12288
prealloc data offset 8192 nr 4096
...
So the real problem happened earlier: notice that items 4 (4k-12k) and 5
(8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and
item 5 starts at i_size.
Here is the state of the filesystem tree at the time of the crash:
>>> root = prog.crashed_thread().stack_trace()[2]["inode"].root
>>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0))
>>> print_extent_buffer(nodes[0])
leaf 30425088 level 0 items 184 generation 9 owner 5
leaf 30425088 flags 0x100000000000000
fs uuid e5bd3946-400c-4223-8923-190ef1f18677
chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da
...
item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160
generation 7 transid 7 size 4096 nbytes 12288
block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0
sequence 6 flags 0x10(PREALLOC)
atime 1716417703.220000000 (2024-05-22 15:41:43)
ctime 1716417703.220000000 (2024-05-22 15:41:43)
mtime 1716417703.220000000 (2024-05-22 15:41:43)
otime 1716417703.220000000 (2024-05-22 15:41:43)
item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13
index 195 namelen 3 name: 193
item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37
location key (0 UNKNOWN.0 0) type XATTR
transid 7 data_len 1 name_len 6
name: user.a
data a
item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53
generation 9 type 1 (regular)
extent data disk byte 303144960 nr 12288
extent data offset 0 nr 8192 ram 12288
extent compression 0 (none)
item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53
generation 9 type 2 (prealloc)
prealloc data disk byte 303144960 nr 12288
prealloc data offset 8192 nr 4096
Item 5 in the log tree corresponds to item 183 in the filesystem tree,
but nothing matches item 4. Furthermore, item 183 is the last item in
the leaf.
btrfs_log_prealloc_extents() is responsible for logging prealloc extents
beyond i_size. It first truncates any previously logged prealloc extents
that start beyond i_size. Then, it walks the filesystem tree and copies
the prealloc extent items to the log tree.
If it hits the end of a leaf, then it calls btrfs_next_leaf(), which
unlocks the tree and does another search. However, while the filesystem
tree is unlocked, an ordered extent completion may modify the tree. In
particular, it may insert an extent item that overlaps with an extent
item that was already copied to the log tree.
This may manifest in several ways depending on the exact scenario,
including an EEXIST error that is silently translated to a full sync,
overlapping items in the log tree, or this crash. This particular crash
is triggered by the following sequence of events:
- Initially, the file has i_size=4k, a regular extent from 0-4k, and a
prealloc extent beyond i_size from 4k-12k. The prealloc extent item is
the last item in its B-tree leaf.
- The file is fsync'd, which copies its inode item and both extent items
to the log tree.
- An xattr is set on the file, which sets the
BTRFS_INODE_COPY_EVERYTHING flag.
- The range 4k-8k in the file is written using direct I/O. i_size is
extended to 8k, but the ordered extent is still in flight.
- The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this
calls copy_inode_items_to_log(), which calls
btrfs_log_prealloc_extents().
- btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the
filesystem tree. Since it starts before i_size, it skips it. Since it
is the last item in its B-tree leaf, it calls btrfs_next_leaf().
- btrfs_next_leaf() unlocks the path.
- The ordered extent completion runs, which converts the 4k-8k part of
the prealloc extent to written and inserts the remaining prealloc part
from 8k-12k.
- btrfs_next_leaf() does a search and finds the new prealloc extent
8k-12k.
- btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into
the log tree. Note that it overlaps with the 4k-12k prealloc extent
that was copied to the log tree by the first fsync.
- fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k
extent that was written.
- This tries to drop the range 4k-8k in the log tree, which requires
adjusting the start of the 4k-12k prealloc extent in the log tree to
8k.
- btrfs_set_item_key_safe() sees that there is already an extent
starting at 8k in the log tree and calls BUG().
Fix this by detecting when we're about to insert an overlapping file
extent item in the log tree and truncating the part that would overlap.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Extent maps are always associated to an inode's extent map tree, so
there's no need to pass the extent map tree explicitly to
clear_em_logging().
In order to facilitate an upcoming change that adds a shrinker for extent
maps, change clear_em_logging() to receive the inode instead of its extent
map tree.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A comment from Filipe on one of my previous cleanups brought my
attention to a new helper we have for getting the root id of a root,
which makes it easier to read in the code.
The changes where made with the following Coccinelle semantic patch:
// <smpl>
@@
expression E,E1;
@@
(
E->root_key.objectid = E1
|
- E->root_key.objectid
+ btrfs_root_id(E)
)
// </smpl>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
We consistently use ->num_bytes everywhere through the delayed ref code,
except in btrfs_ref. Rename btrfs_ref to match all the other code.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this in both btrfs_tree_ref and btrfs_data_ref, which is just
wasting space and making the code more complicated. Move this into
btrfs_ref proper and update all the call sites to do the assignment in
btrfs_ref.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_ref currently has ->owning_root, and ->ref_root is shared between
the tree ref and data ref, so in order to move that into btrfs_ref
proper I would need to add another root parameter to the initialization
function. This function has too many arguments, and adding another root
will make it easy to make mistakes about which root goes where.
Drop the generic ref init function and statically initialize the
btrfs_ref in every usage. This makes the code easier to read because we
can see what elements we're assigning, and will make the upcoming change
moving the ref_root into the btrfs_ref more clear and less error prone
than adding a new element to the initialization function.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before deciding if we can do a NOCOW write into a range, one of the things
we have to do is check if there are checksum items for that range. We do
that through the btrfs_lookup_csums_list() function, which searches for
checksums and adds them to a list supplied by the caller.
But all we need is to check if there is any checksum, we don't need to
look for all of them and collect them into a list, which requires more
search time in the checksums tree, allocating memory for checksums items
to add to the list, copy checksums from a leaf into those list items,
then free that memory, etc. This is all unnecessary overhead, wasting
mostly CPU time, and perhaps some occasional IO if we need to read from
disk any extent buffers.
So change btrfs_lookup_csums_list() to allow to return immediately in
case it finds any checksum, without the need to add it to a list and read
it from a leaf. This is accomplished by allowing a NULL list parameter and
making the function return 1 if it found any checksum, 0 if it didn't
found any, and a negative value in case of an error.
The following test with fio was used to measure performance:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
cat <<EOF > /tmp/fio-job.ini
[global]
name=fio-rand-write
filename=$MNT/fio-rand-write
rw=randwrite
bssplit=4k/20:8k/20:16k/20:32k/20:64k/20
direct=1
numjobs=16
fallocate=posix
time_based
runtime=300
[file1]
size=8G
ioengine=io_uring
iodepth=16
EOF
umount $MNT &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test was run on a release kernel (Debian's default kernel config).
The results before this patch:
WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec
The results after this patch:
WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All the callers of btrfs_lookup_csums_list() pass a value of 0 as the
"search_commit" parameter. So remove it and make the function behave as
to always search from the regular root.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The mod_start and mod_len fields of struct extent_map were introduced by
commit 4e2f84e63d ("Btrfs: improve fsync by filtering extents that we
want") in order to avoid too low performance when fsyncing a file that
keeps getting extent maps merge, because it resulted in each fsync logging
again csum ranges that were already merged before.
We don't need this anymore as extent maps in the list of modified extents
are never merged with other extent maps and once we log an extent map we
remove it from the list of modified extent maps, so it's never logged
twice.
So remove the mod_start and mod_len fields from struct extent_map and use
instead the start and len fields when logging checksums in the fast fsync
path. This also makes EXTENT_FLAG_FILLING unused so remove it as well.
Running the reproducer from the commit mentioned before, with a larger
number of extents and against a null block device, so that IO is fast
and we can better see any impact from searching checksums items and
logging them, gave the following results from dd:
Before this change:
409600000 bytes (410 MB, 391 MiB) copied, 22.948 s, 17.8 MB/s
After this change:
409600000 bytes (410 MB, 391 MiB) copied, 22.9997 s, 17.8 MB/s
So no changes in throughput.
The test was done in a release kernel (non-debug, Debian's default kernel
config) and its steps are the following:
$ mkfs.btrfs -f /dev/nullb0
$ mount /dev/sdb /mnt
$ dd if=/dev/zero of=/mnt/foobar bs=4k count=100000 oflag=sync
$ umount /mnt
This also reduces the size of struct extent_map from 128 bytes down to 112
bytes, so now we can have 36 extents maps per 4K page instead of 32.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helpers are doing an initialization or release work, none of which
is performance critical that it would require a static inline, so move
them to the .c file.
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode and we require to copy items from subvolume leaves
to the log tree, we clone each subvolume leaf and than use that clone to
copy items to the log tree. This is required to avoid possible deadlocks
as stated in commit 796787c978 ("btrfs: do not modify log tree while
holding a leaf from fs tree locked").
The cloning requires allocating an extent buffer (struct extent_buffer)
and then allocating pages (folios) to attach to the extent buffer. This
may be slow in case we are under memory pressure, and since we are doing
the cloning while holding a read lock on a subvolume leaf, it means we
can be blocking other operations on that leaf for significant periods of
time, which can increase latency on operations like creating other files,
renaming files, etc. Similarly because we're under a log transaction, we
may also cause extra delay on other tasks doing an fsync, because syncing
the log requires waiting for tasks that joined a log transaction to exit
the transaction.
So to improve this, for any inode logging operation that needs to copy
items from a subvolume leaf ("full sync" or "copy everything" bit set
in the inode), preallocate a dummy extent buffer before locking any
extent buffer from the subvolume tree, and even before joining a log
transaction, add it to the log context and then use it when we need to
copy items from a subvolume leaf to the log tree. This avoids making
other operations get extra latency when waiting to lock a subvolume
leaf that is used during inode logging and we are under heavy memory
pressure.
The following test script with bonnie++ was used to test this:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdh
MNT=/mnt/sdh
MOUNT_OPTIONS="-o ssd"
MEMTOTAL_BYTES=`free -b | grep Mem: | awk '{ print $2 }'`
NR_DIRECTORIES=20
NR_FILES=20480
DATASET_SIZE=$((MEMTOTAL_BYTES * 2 / 1048576))
DIRECTORY_SIZE=$((MEMTOTAL_BYTES * 2 / NR_FILES))
NR_FILES=$((NR_FILES / 1024))
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
bonnie++ -u root -d $MNT \
-n $NR_FILES:$DIRECTORY_SIZE:$DIRECTORY_SIZE:$NR_DIRECTORIES \
-r 0 -s $DATASET_SIZE -b
umount $MNT
The results of this test on a 8G VM running a non-debug kernel (Debian's
default kernel config), were the following.
Before this change:
Version 2.00a ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
debian0 7501M 376k 99 1.4g 96 117m 14 1510k 99 2.5g 95 +++++ +++
Latency 35068us 24976us 2944ms 30725us 71770us 26152us
Version 2.00a ------Sequential Create------ --------Random Create--------
debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
20:384100:384100/20 20480 32 20480 58 20480 48 20480 39 20480 56 20480 61
Latency 411ms 11914us 119ms 617ms 10296us 110ms
After this change:
Version 2.00a ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
debian0 7501M 375k 99 1.4g 97 117m 14 1546k 99 2.3g 98 +++++ +++
Latency 35975us 20945us 2144ms 10297us 2217us 6004us
Version 2.00a ------Sequential Create------ --------Random Create--------
debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
20:384100:384100/20 20480 35 20480 58 20480 48 20480 40 20480 57 20480 59
Latency 320ms 11237us 77779us 518ms 6470us 86389us
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With help of neovim, LSP and clangd we can identify header files that
are not actually needed to be included in the .c files. This is focused
only on removal (with minor fixups), further cleanups are possible but
will require doing the header files properly with forward declarations,
minimized includes and include-what-you-use care.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, in struct extent_map, we use an unsigned int (32 bits) to
identify the compression type of an extent and an unsigned long (64 bits
on a 64 bits platform, 32 bits otherwise) for flags. We are only using
6 different flags, so an unsigned long is excessive and we can use flags
to identify the compression type instead of using a dedicated 32 bits
field.
We can easily have tens or hundreds of thousands (or more) of extent maps
on busy and large filesystems, specially with compression enabled or many
or large files with tons of small extents. So it's convenient to have the
extent_map structure as small as possible in order to use less memory.
So remove the compression type field from struct extent_map, use flags
to identify the compression type and shorten the flags field from an
unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and
reduces the size of the structure from 136 bytes down to 128 bytes, using
now only two cache lines, and increases the number of extent maps we can
have per 4K page from 30 to 32. By using a u32 for the flags instead of
an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(),
but that level of atomicity is not needed as most flags are never cleared
once set (before adding an extent map to the tree), and the ones that can
be cleared or set after an extent map is added to the tree, are always
performed while holding the write lock on the extent map tree, while the
reader holds a lock on the tree or tests for a flag that never changes
once the extent map is in the tree (such as compression flags).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're not clearing the dirty flag off of extent_buffers in zoned mode,
all that is left of btrfs_redirty_list_add() is a memzero() and some
ASSERT()ions.
As we're also memzero()ing the buffer on write-out btrfs_redirty_list_add()
has become obsolete and can be removed.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmU/xAEACgkQxWXV+ddt
WDvYKg//SjTimA5Nins9mb4jdz8n+dDeZnQhKzy3FqInU41EzDRc4WwnEODmDlTa
AyU9rGB3k0JNSUc075jZFCyLqq/ARiOqRi4x33Gk0ckIlc4X5OgBoqP2XkPh0VlP
txskLCrmhc3pwyR4ErlFDX2jebIUXfkv39bJuE40grGvUatRe+WNq0ERIrgO8RAr
Rc3hBotMH8AIqfD1L6j1ZiZIAyrOkT1BJMuqeoq27/gJZn/MRhM9TCrMTzfWGaoW
SxPrQiCDEN3KECsOY/caroMn3AekDijg/ley1Nf7Z0N6oEV+n4VWWPBFE9HhRz83
9fIdvSbGjSJF6ekzTjcVXPAbcuKZFzeqOdBRMIW3TIUo7mZQyJTVkMsc1y/NL2Z3
9DhlRLIzvWJJjt1CEK0u18n5IU+dGngdktbhWWIuIlo8r+G/iKR/7zqU92VfWLHL
Z7/eh6HgH5zr2bm+yKORbrUjkv4IVhGVarW8D4aM+MCG0lFN2GaPcJCCUrp4n7rZ
PzpQbxXa38ANBk6hsp4ndS8TJSBL9moY8tumzLcKg97nzNMV6KpBdV/G6/QfRLCN
3kM6UbwTAkMwGcQS86Mqx6s04ORLnQeD6f7N6X4Ppx0Mi/zkjI2HkRuvQGp12B0v
iZjCCZAYY2Iu+/TU0GrCXSss/grzIAUPzM9msyV3XGO/VBpwdec=
=9TVx
-----END PGP SIGNATURE-----
Merge tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"New features:
- raid-stripe-tree
New tree for logical file extent mapping where the physical mapping
may not match on multiple devices. This is now used in zoned mode
to implement RAID0/RAID1* profiles, but can be used in non-zoned
mode as well. The support for RAID56 is in development and will
eventually fix the problems with the current implementation. This
is a backward incompatible feature and has to be enabled at mkfs
time.
- simple quota accounting (squota)
A simplified mode of qgroup that accounts all space on the initial
extent owners (a subvolume), the snapshots are then cheap to create
and delete. The deletion of snapshots in fully accounting qgroups
is a known CPU/IO performance bottleneck.
The squota is not suitable for the general use case but works well
for containers where the original subvolume exists for the whole
time. This is a backward incompatible feature as it needs extending
some structures, but can be enabled on an existing filesystem.
- temporary filesystem fsid (temp_fsid)
The fsid identifies a filesystem and is hard coded in the
structures, which disallows mounting the same fsid found on
different devices.
For a single device filesystem this is not strictly necessary, a
new temporary fsid can be generated on mount e.g. after a device is
cloned. This will be used by Steam Deck for root partition A/B
testing, or can be used for VM root images.
Other user visible changes:
- filesystems with partially finished metadata_uuid conversion cannot
be mounted anymore and the uuid fixup has to be done by btrfs-progs
(btrfstune).
Performance improvements:
- reduce reservations for checksum deletions (with enabled free space
tree by factor of 4), on a sample workload on file with many
extents the deletion time decreased by 12%
- make extent state merges more efficient during insertions, reduce
rb-tree iterations (run time of critical functions reduced by 5%)
Core changes:
- the integrity check functionality has been removed, this was a
debugging feature and removal does not affect other integrity
checks like checksums or tree-checker
- space reservation changes:
- more efficient delayed ref reservations, this avoids building up
too much work or overusing or exhausting the global block
reserve in some situations
- move delayed refs reservation to the transaction start time,
this prevents some ENOSPC corner cases related to exhaustion of
global reserve
- improvements in reducing excessive reservations for block group
items
- adjust overcommit logic in near full situations, account for one
more chunk to eventually allocate metadata chunk, this is mostly
relevant for small filesystems (<10GiB)
- single device filesystems are scanned but not registered (except
seed devices), this allows temp_fsid to work
- qgroup iterations do not need GFP_ATOMIC allocations anymore
- cleanups, refactoring, reduced data structure size, function
parameter simplifications, error handling fixes"
* tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits)
btrfs: open code timespec64 in struct btrfs_inode
btrfs: remove redundant log root tree index assignment during log sync
btrfs: remove redundant initialization of variable dirty in btrfs_update_time()
btrfs: sysfs: show temp_fsid feature
btrfs: disable the device add feature for temp-fsid
btrfs: disable the seed feature for temp-fsid
btrfs: update comment for temp-fsid, fsid, and metadata_uuid
btrfs: remove pointless empty log context list check when syncing log
btrfs: update comment for struct btrfs_inode::lock
btrfs: remove pointless barrier from btrfs_sync_file()
btrfs: add and use helpers for reading and writing last_trans_committed
btrfs: add and use helpers for reading and writing fs_info->generation
btrfs: add and use helpers for reading and writing log_transid
btrfs: add and use helpers for reading and writing last_log_commit
btrfs: support cloned-device mount capability
btrfs: add helper function find_fsid_by_disk
btrfs: stop reserving excessive space for block group item insertions
btrfs: stop reserving excessive space for block group item updates
btrfs: reorder btrfs_inode to fill gaps
btrfs: open code btrfs_ordered_inode_tree in btrfs_inode
...
During log syncing, when we start updating the log root tree we compute
an index value, stored in variable 'index2', once we lock the log root
tree's mutex. This value depends on the log root's log_transid. And
shortly after we compute again the same value for 'index2' - the value
is exactly the same since we haven't released the mutex and therefore
the log_transid of the log root is the same as before.
This second 'index2' computation became pointless after commit
a93e01682e ("btrfs: remove no longer needed use of log_writers for the
log root tree"). So remove it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When syncing the log, if we get an error when updating the log root, we
check first if the log root tree context is in a log context list, and if
so it deletes from the log root tree context from the list. This check
however is pointless because at this moment the context is always in a
list, he have just added it to a context list. The check became pointless
after commit a93e01682e ("btrfs: remove no longer needed use of
log_writers for the log root tree"). So remove this now pointless empty
list check.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the log_transid field of a root is always modified while holding
the root's log_mutex locked. Most readers of a root's log_transid are also
holding the root's log_mutex locked, however there is one exception which
is btrfs_set_inode_last_trans() where we don't take the lock to avoid
blocking several operations if log syncing is happening in parallel.
Any races here should be harmless, and in the worst case they may cause a
fsync to log an inode when it's not really needed, so nothing bad from a
functional perspective.
To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the log_transid field of a root using READ_ONCE() and WRITE_ONCE(),
and use these helpers where needed.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, the last_log_commit of a root can be accessed concurrently
without any lock protection. Readers can be calling btrfs_inode_in_log()
early in a fsync call, which reads a root's last_log_commit, while a
writer can change the last_log_commit while a log tree if being synced,
at btrfs_sync_log(). Any races here should be harmless, and in the worst
case they may cause a fsync to log an inode when it's not really needed,
so nothing bad from a functional perspective.
To avoid data race warnings from tools like KCSAN and other issues such
as load and store tearing (amongst others, see [1]), create helpers to
access the last_log_commit field of a root using READ_ONCE() and
WRITE_ONCE(), and use these helpers everywhere.
[1] https://lwn.net/Articles/793253/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The structure btrfs_ordered_inode_tree is used only in one place, in
btrfs_inode. The structure itself has a 4 byte hole which is wasted
space.
Move the btrfs_ordered_inode_tree members to btrfs_inode with a common
prefix 'ordered_tree_' where the hole can be utilized and shrink inode
size.
Signed-off-by: David Sterba <dsterba@suse.com>
When freeing a log tree, during a transaction commit, we clear its dirty
log pages io tree by calling clear_extent_bits() using a range from 0 to
(u64)-1. This will iterate the io tree's rbtree and call rb_erase() on
each node before freeing it, which will often trigger rebalance operations
on the rbtree. A better alternative it to use extent_io_tree_release(),
which will not do deletions and trigger rebalances.
So use extent_io_tree_release() instead of clear_extent_bits().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The root argument for fixup_inode_link_count() always matches the root of
the given inode, so remove the root argument and get it from the inode
argument. This also applies to the helpers count_inode_extrefs() and
count_inode_refs() used by fixup_inode_link_count() - they don't need the
root argument, as it always matches the root of the inode passed to them.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The root argument for btrfs_update_inode() always matches the root of the
given inode, so remove the root argument and get it from the inode
argument.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While data extents require us to store additional inline refs to track
the original owner on free, this information is available implicitly for
metadata. It is found in the owner field of the header of the tree
block. Even if other trees refer to this block and the original ref goes
away, we will not rewrite that header field, so it will reliably give the
original owner.
In addition, there is a relocation case where a new data extent needs to
have an owning root separate from the referring root wired through
delayed refs.
To use it for recording simple quota deltas, we need to wire this root
id through from when we create the delayed ref until we fully process
it. Store it in the generic btrfs_ref struct of the delayed ref.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(),
we check if its generation matches the running transaction and if not we
just print a warning. Such mismatch is an indicator that something really
went wrong and only printing a warning message (and stack trace) is not
enough to prevent a corruption. Allowing a transaction to commit with such
an extent buffer will trigger an error if we ever try to read it from disk
due to a generation mismatch with its parent generation.
So abort the current transaction with -EUCLEAN if we notice a generation
mismatch. For this we need to pass a transaction handle to
btrfs_mark_buffer_dirty() which is always available except in test code,
in which case we can pass NULL since it operates on dummy extent buffers
and all test roots have a single node/leaf (root node at level 0).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Both callers of btrfs_pin_extent_for_log_replay expand the parameters to
extent buffer members. We can simply pass the extent buffer instead.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is only one caller of btrfs_pin_reserved_extent that expands the
parameters to extent buffer members. We can simply pass the extent
buffer instead.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function name in the comment does not bring much value to code not
exposed as API and we don't stick to the kdoc format anymore. Update
formatting of parameter descriptions.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmURvloACgkQxWXV+ddt
WDt+CQ/+NgBtQn7eyABsdHzXWPxpFyGZrdw5ldKnly3G+WDW2GKMaZ6CpDuEZGNQ
vMAkSGX5LIHXvO79pDnGG0i+bRINWrc5HZVZ/p5Da6wplBTgIPlbLmxaZX9MJLbx
j7Oz37GXiQJY8BxnVCnsb+bhhTrTbO9HFUQr/nxefIvu22OBdL1WXYcfuBOeEsFG
qr/aeC52YqCVgXvt+8a5DqAKE0NWc4PFMFUMo4vlf1xuL652fvff7xiup1CAIgBh
qsCa17E7q+qjri2phAhbFNadfpH5wGfyjTWScOlaFuXjRhW2v2oqz3WU5IQj4dmu
PI+k++PLUzIxT0IcjD1YbZzRFaEI6fR2W0GA4LK08fjVehh2ao5jOjtRgLl8HlqG
qC5fslAPzUxRmwMmCjSGfXF14sgtyLy8eVWf69xn06/1cbEmfHDrWNXP1QHuq6eT
Jqy8Ywia3jRzzfZ1utABJPLBW4hFQKkyobtyd67fxslUFmtuLvLqGTiOdmVFiD9K
o+BF2xjEz2n8O1+aRZk5SFNC9zcaASaRg/wQrhvSI9qxM18fh4TXgKQOniLzAK7v
lZc+JkegFW4CVquCUpmbsdZAOpVNRXfPOJIt/w6G+oRbaiTvPUnrH+uyq8IGREbw
E7d8XIP0qlF0DQBGK4Mw/riZz/e5MmEKNjza6M+fj2uglpfWTv4=
=6WEW
-----END PGP SIGNATURE-----
Merge tag 'for-6.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- delayed refs fixes:
- fix race when refilling delayed refs block reserve
- prevent transaction block reserve underflow when starting
transaction
- error message and value adjustments
- fix build warnings with CONFIG_CC_OPTIMIZE_FOR_SIZE and
-Wmaybe-uninitialized
- fix for smatch report where uninitialized data from invalid extent
buffer range could be returned to the caller
- fix numeric overflow in statfs when calculating lower threshold
for a full filesystem
* tag 'for-6.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: initialize start_slot in btrfs_log_prealloc_extents
btrfs: make sure to initialize start and len in find_free_dev_extent
btrfs: reset destination buffer when read_extent_buffer() gets invalid range
btrfs: properly report 0 avail for very full file systems
btrfs: log message if extent item not found when running delayed extent op
btrfs: remove redundant BUG_ON() from __btrfs_inc_extent_ref()
btrfs: return -EUCLEAN for delayed tree ref with a ref count not equals to 1
btrfs: prevent transaction block reserve underflow when starting transaction
btrfs: fix race when refilling delayed refs block reserve
Jens reported a compiler warning when using
CONFIG_CC_OPTIMIZE_FOR_SIZE=y that looks like this
fs/btrfs/tree-log.c: In function ‘btrfs_log_prealloc_extents’:
fs/btrfs/tree-log.c:4828:23: warning: ‘start_slot’ may be used
uninitialized [-Wmaybe-uninitialized]
4828 | ret = copy_items(trans, inode, dst_path, path,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4829 | start_slot, ins_nr, 1, 0);
| ~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/tree-log.c:4725:13: note: ‘start_slot’ was declared here
4725 | int start_slot;
| ^~~~~~~~~~
The compiler is incorrect, as we only use this code when ins_len > 0,
and when ins_len > 0 we have start_slot properly initialized. However
we generally find the -Wmaybe-uninitialized warnings valuable, so
initialize start_slot to get rid of the warning.
Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmTskOwACgkQxWXV+ddt
WDsNJw/8CCi41Z7e3LdJsQd2iy3/+oJZUvIGuT5YvshYxTLCbV7AL+diBPnSQs4Q
/KFMGL7RZBgJzwVoSQtXnESXXgX8VOVfN1zY//k5g6z7BscCEQd73H/M0B8ciZy/
aBygm9tJ7EtWbGZWNR8yad8YtOgl6xoClrPnJK/DCLwMGPy2o+fnKP3Y9FOKY5KM
1Sl0Y4FlJ9dTJpxIwYbx4xmuyHrh2OivjU/KnS9SzQlHu0nl6zsIAE45eKem2/EG
1figY5aFBYPpPYfopbLDalEBR3bQGiViZVJuNEop3AimdcMOXw9jBF3EZYUb5Tgn
MleMDgmmjLGOE/txGhvTxKj9kci2aGX+fJn3jXbcIMksAA0OQFLPqzGvEQcrs6Ok
HA0RsmAkS5fWNDCuuo4ZPXEyUPvluTQizkwyoulOfnK+UPJCWaRqbEBMTsvm6M6X
wFT2czwLpaEU/W6loIZkISUhfbRqVoA3DfHy398QXNzRhSrg8fQJjma1f7mrHvTi
CzU+OD5YSC2nXktVOnklyTr0XT+7HF69cumlDbr8TS8u1qu8n1keU/7M3MBB4xZk
BZFJDz8pnsAqpwVA4T434E/w45MDnYlwBw5r+U8Xjyso8xlau+sYXKcim85vT2Q0
yx/L91P6tdekR1y97p4aDdxw/PgTzdkNGMnsTBMVzgtCj+5pMmE=
=N7Yn
-----END PGP SIGNATURE-----
Merge tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"No new features, the bulk of the changes are fixes, refactoring and
cleanups. The notable fix is the scrub performance restoration after
rewrite in 6.4, though still only partial.
Fixes:
- scrub performance drop due to rewrite in 6.4 partially restored:
- do IO grouping by blg_plug/blk_unplug again
- avoid unnecessary tree searches when processing stripes, in
extent and checksum trees
- the drop is noticeable on fast PCIe devices, -66% and restored
to -33% of the original
- backports to 6.4 planned
- handle more corner cases of transaction commit during orphan
cleanup or delayed ref processing
- use correct fsid/metadata_uuid when validating super block
- copy directory permissions and time when creating a stub subvolume
Core:
- debugging feature integrity checker deprecated, to be removed in
6.7
- in zoned mode, zones are activated just before the write, making
error handling easier, now the overcommit mechanism can be enabled
again which improves performance by avoiding more frequent flushing
- v0 extent handling completely removed, deprecated long time ago
- error handling improvements
- tests:
- extent buffer bitmap tests
- pinned extent splitting tests
- cleanups and refactoring:
- compression writeback
- extent buffer bitmap
- space flushing, ENOSPC handling"
* tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (110 commits)
btrfs: zoned: skip splitting and logical rewriting on pre-alloc write
btrfs: tests: test invalid splitting when skipping pinned drop extent_map
btrfs: tests: add a test for btrfs_add_extent_mapping
btrfs: tests: add extent_map tests for dropping with odd layouts
btrfs: scrub: move write back of repaired sectors to scrub_stripe_read_repair_worker()
btrfs: scrub: don't go ordered workqueue for dev-replace
btrfs: scrub: fix grouping of read IO
btrfs: scrub: avoid unnecessary csum tree search preparing stripes
btrfs: scrub: avoid unnecessary extent tree search preparing stripes
btrfs: copy dir permission and time when creating a stub subvolume
btrfs: remove pointless empty list check when reading delayed dir indexes
btrfs: drop redundant check to use fs_devices::metadata_uuid
btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super
btrfs: use the correct superblock to compare fsid in btrfs_validate_super
btrfs: simplify memcpy either of metadata_uuid or fsid
btrfs: add a helper to read the superblock metadata_uuid
btrfs: remove v0 extent handling
btrfs: output extra debug info if we failed to find an inline backref
btrfs: move the !zoned assert into run_delalloc_cow
btrfs: consolidate the error handling in run_delalloc_nocow
...
Use LIST_HEAD() to initialize the list_head instead of open-coding it.
Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variables leaf and slot are initialized when declared but the values
assigned to them are never read as they are being re-assigned later on.
The initializations are redundant and can be removed. Cleans up clang
scan build warnings:
fs/btrfs/tree-log.c:6797:25: warning: Value stored to 'leaf' during its
initialization is never read [deadcode.DeadStores]
fs/btrfs/tree-log.c:6798:7: warning: Value stored to 'slot' during its
initialization is never read [deadcode.DeadStores]
It's been there since b8aa330d2a ("Btrfs: improve performance on fsync
of files with multiple hardlinks") without any usage so it's safe to be
removed.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In later patches, we're going to change how the inode's ctime field is
used. Switch to using accessor functions instead of raw accesses of
inode->i_ctime.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230705190309.579783-27-jlayton@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
When dropping inode items from a log tree at drop_inode_items(), we this
BUG_ON() on the result of btrfs_search_slot() because we don't expect an
exact match since having a key with an offset of (u64)-1 is unexpected.
That is generally true, but for dir index keys for example, we can get a
key with such an offset value, even though it's very unlikely and it would
take ages to increase the sequence counter for a dir index up to (u64)-1.
We can deal with an exact match, we just have to delete the key at that
slot, so there is really no need to BUG_ON(), error out or trigger any
warning. So remove the BUG_ON().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_ordered_sum::bytendr stores a logical address. Make that clear by
renaming it to ->logical.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The for_rename argument of btrfs_record_unlink_dir() is defined as an
integer, but the argument is in fact used as a boolean. So change it to
a boolean to make its use more clear.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point of having a label and goto at btrfs_record_unlink_dir()
because the function is trivial and can just return early if we are not
in a rename context. So remove the label and goto and instead return
early if we are not in a rename.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update the comments at btrfs_record_unlink_dir() so that they mention
where new names are logged and where old names are removed. Also, while
at it make the width of the comments closer to 80 columns and capitalize
the sentences and finish them with punctuation.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_record_unlink_dir() we directly check the logged_trans field of
the given inodes to check if they were previously logged in the current
transaction, and if any of them were, then we can avoid setting the field
last_unlink_trans of the directory to the id of the current transaction if
we are in a rename path. Avoiding that can later prevent falling back to
a transaction commit if anyone attempts to log the directory.
However the logged_trans field, store in struct btrfs_inode, is transient,
not persisted in the inode item on its subvolume b+tree, so that means
that if an inode is evicted and then loaded again, its original value is
lost and it's reset to 0. So directly checking the logged_trans field can
lead to some false negative, and that only results in a performance impact
as mentioned before.
Instead of directly checking the logged_trans field of the inodes, use the
inode_logged() helper, which will check in the log tree if an inode was
logged before in case its logged_trans field has a value of 0. This way
we can avoid setting the directory inode's last_unlink_trans and cause
future logging attempts of it to fallback to transaction commits. The
following test script shows one example where this happens without this
patch:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
num_init_files=10000
num_new_files=10000
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $num_init_files; i++)); do
echo -n > $MNT/testdir/file_$i
done
echo -n > $MNT/testdir/foo
sync
# Add some files so that there's more work in the transaction other
# than just renaming file foo.
for ((i = 1; i <= $num_new_files; i++)); do
echo -n > $MNT/testdir/new_file_$i
done
# Change the file, fsync it.
setfattr -n user.x1 -v 123 $MNT/testdir/foo
xfs_io -c "fsync" $MNT/testdir/foo
# Now triggger eviction of file foo but no eviction for our test
# directory, since it is being used by the process below. This will
# set logged_trans of the file's inode to 0 once it is loaded again.
(
cd $MNT/testdir
while true; do
:
done
) &
pid=$!
echo 2 > /proc/sys/vm/drop_caches
kill $pid
wait $pid
# Move foo out of our testdir. This will set last_unlink_trans
# of the directory inode to the current transaction, because
# logged_trans of both the directory and the file are set to 0.
mv $MNT/testdir/foo $MNT/foo
# Change file foo again and fsync it.
# This fsync will result in a transaction commit because the rename
# above has set last_unlink_trans of the parent directory to the id
# of the current transaction and because our inode for file foo has
# last_unlink_trans set to the current transaction, since it was
# evicted and reloaded and it was previously modified in the current
# transaction (the xattr addition).
xfs_io -c "pwrite 0 64K" $MNT/foo
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/foo
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "file fsync took: $dur milliseconds"
umount $MNT
Before this patch: fsync took 19 milliseconds
After this patch: fsync took 5 milliseconds
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At need_log_inode() we directly check the ->logged_trans field of the
given inode to check if it was previously logged in the transaction, with
the goal of skipping logging the inode again when it's not necessary.
The ->logged_trans field in not persisted in the inode item or elsewhere,
it's only stored in memory (struct btrfs_inode), so it's transient and
lost once the inode is evicted and then loaded again. Once an inode is
loaded, we are conservative and set ->logged_trans to 0, which may mean
that either the inode was never logged in the current transaction or it
was logged but evicted before being loaded again.
Instead of checking the inode's ->logged_trans field directly, we can
use instead the helper inode_logged(), which will really check if the
inode was logged before in the current transaction in case we have a
->logged_trans field with a value of 0. This will prevent unnecessarily
logging an inode when it's not needed, and in some cases preventing a
transaction commit, in case the logging requires a fallback to a
transaction commit. The following test script shows a scenario where
due to eviction we fallback a transaction commit when trying to fsync
a file that was renamed:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
num_init_files=10000
num_new_files=10000
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $num_init_files; i++)); do
echo -n > $MNT/testdir/file_$i
done
echo -n > $MNT/testdir/foo
sync
# Add some files so that there's more work in the transaction other
# than just renaming file foo.
for ((i = 1; i <= $num_new_files; i++)); do
echo -n > $MNT/testdir/new_file_$i
done
# Fsync the directory first.
xfs_io -c "fsync" $MNT/testdir
# Rename file foo.
mv $MNT/testdir/foo $MNT/testdir/bar
# Now trigger eviction of the test directory's inode.
# Once loaded again, it will have logged_trans set to 0 and
# last_unlink_trans set to the current transaction.
echo 2 > /proc/sys/vm/drop_caches
# Fsync file bar (ex-foo).
# Before the patch the fsync would result in a transaction commit
# because the inode for file bar has last_unlink_trans set to the
# current transaction, so it will attempt to log the parent directory
# as well, which will fallback to a full transaction commit because
# it also has its last_unlink_trans set to the current transaction,
# due to the inode eviction.
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir/bar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "file fsync took: $dur milliseconds"
umount $MNT
Before this patch: fsync took 22 milliseconds
After this patch: fsync took 8 milliseconds
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This fixes the following warning reported by gcc 10.2.1 under x86_64:
../fs/btrfs/tree-log.c: In function ‘btrfs_log_inode’:
../fs/btrfs/tree-log.c:6211:9: error: ‘last_range_start’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
6211 | ret = insert_dir_log_key(trans, log, path, key.objectid,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
6212 | first_dir_index, last_dir_index);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../fs/btrfs/tree-log.c:6161:6: note: ‘last_range_start’ was declared here
6161 | u64 last_range_start;
| ^~~~~~~~~~~~~~~~
This might be a false positive fixed in later compiler versions but we
want to have it fixed.
Reported-by: k2ci <kernel-bot@kylinos.cn>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Shida Zhang <zhangshida@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging dir dentries of a directory, we iterate over the subvolume
tree to find dir index keys on leaves modified in the current transaction.
This however is heavy on locking, since btrfs_search_forward() may often
keep locks on extent buffers for quite a while when walking the tree to
find a suitable leaf modified in the current transaction and with a key
not smaller than then the provided minimum key. That means it will block
other tasks trying to access the subvolume tree, which may be common fs
operations like creating, renaming, linking, unlinking, reflinking files,
etc.
A better solution is to iterate the log tree, since it's much smaller than
a subvolume tree and just use plain btrfs_search_slot() (or the wrapper
btrfs_for_each_slot()) and only contains dir index keys added in the
current transaction.
The following bonnie++ test on a non-debug kernel (with Debian's default
kernel config) on a 20G null block device, was used to measure the impact:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
NR_DIRECTORIES=20
NR_FILES=20480 # must be a multiple of 1024
DATASET_SIZE=$(( (8 * 1024 * 1024 * 1024) / 1048576 )) # 8 GiB as megabytes
DIRECTORY_SIZE=$(( DATASET_SIZE / NR_FILES ))
NR_FILES=$(( NR_FILES / 1024 ))
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount $DEV $MNT
bonnie++ -u root -d $MNT \
-n $NR_FILES:$DIRECTORY_SIZE:$DIRECTORY_SIZE:$NR_DIRECTORIES \
-r 0 -s $DATASET_SIZE -b
umount $MNT
Before patchset:
Version 2.00a ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
debian0 8G 376k 99 1.1g 98 939m 92 1527k 99 3.2g 99 9060 256
Latency 24920us 207us 680ms 5594us 171us 2891us
Version 2.00a ------Sequential Create------ --------Random Create--------
debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
20/20 20480 96 +++++ +++ 20480 95 20480 99 +++++ +++ 20480 97
Latency 8708us 137us 5128us 6743us 60us 19712us
After patchset:
Version 2.00a ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
debian0 8G 384k 99 1.2g 99 971m 91 1533k 99 3.3g 99 9180 309
Latency 24930us 125us 661ms 5587us 46us 2020us
Version 2.00a ------Sequential Create------ --------Random Create--------
debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
20/20 20480 90 +++++ +++ 20480 99 20480 99 +++++ +++ 20480 97
Latency 7030us 61us 1246us 4942us 56us 16855us
The patchset consists of this patch plus a previous one that has the
following subject:
"btrfs: avoid iterating over all indexes when logging directory"
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory, after copying all directory index items from the
subvolume tree to the log tree, we iterate over the subvolume tree to find
all dir index items that are located in leaves COWed (or created) in the
current transaction. If we keep logging a directory several times during
the same transaction, we end up iterating over the same dir index items
everytime we log the directory, wasting time and adding extra lock
contention on the subvolume tree.
So just keep track of the last logged dir index offset in order to start
the search for that index (+1) the next time the directory is logged, as
dir index values (key offsets) come from a monotonically increasing
counter.
The following test measures the difference before and after this change:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
# Time values in milliseconds.
declare -a fsync_times
# Total number of files added to the test directory.
num_files=1000000
# Fsync directory after every N files are added.
fsync_period=100
mkdir $MNT/testdir
fsync_total_time=0
for ((i = 1; i <= $num_files; i++)); do
echo -n > $MNT/testdir/file_$i
if [ $((i % fsync_period)) -eq 0 ]; then
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
fsync_total_time=$((fsync_total_time + (end - start)))
fsync_times[i]=$(( (end - start) / 1000000 ))
echo -n -e "Progress $i / $num_files\r"
fi
done
echo -e "\nHistogram of directory fsync duration in ms:\n"
printf '%s\n' "${fsync_times[@]}" | \
perl -MStatistics::Histogram -e '@d = <>; print get_histogram(\@d);'
fsync_total_time=$((fsync_total_time / 1000000))
echo -e "\nTotal time spent in fsync: $fsync_total_time ms\n"
echo
umount $MNT
The test was run on a non-debug kernel (Debian's default kernel config)
against a 15G null block device.
Result before this change:
Histogram of directory fsync duration in ms:
Count: 10000
Range: 3.000 - 362.000; Mean: 34.556; Median: 31.000; Stddev: 25.751
Percentiles: 90th: 71.000; 95th: 77.000; 99th: 81.000
3.000 - 5.278: 1423 #################################
5.278 - 8.854: 1173 ###########################
8.854 - 14.467: 591 ##############
14.467 - 23.277: 1025 #######################
23.277 - 37.105: 1422 #################################
37.105 - 58.809: 2036 ###############################################
58.809 - 92.876: 2316 #####################################################
92.876 - 146.346: 6 |
146.346 - 230.271: 6 |
230.271 - 362.000: 2 |
Total time spent in fsync: 350527 ms
Result after this change:
Histogram of directory fsync duration in ms:
Count: 10000
Range: 3.000 - 1088.000; Mean: 8.704; Median: 8.000; Stddev: 12.576
Percentiles: 90th: 12.000; 95th: 14.000; 99th: 17.000
3.000 - 6.007: 3222 #################################
6.007 - 11.276: 5197 #####################################################
11.276 - 20.506: 1551 ################
20.506 - 36.674: 24 |
36.674 - 201.552: 1 |
201.552 - 353.841: 4 |
353.841 - 1088.000: 1 |
Total time spent in fsync: 92114 ms
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree-log code has three almost identical copies for the accounting on
an extent_buffer that doesn't need to be written any more. The only
difference is that walk_down_log_tree passed the bytenr used to find the
buffer instead of extent_buffer.start and calculates the length using the
nodesize, while the other two callers look at the extent_buffer.len
field that must always be equivalent to the nodesize.
Factor the code into a common helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_bin_search() is a simple wrapper that searches for the whole slots
by calling btrfs_generic_bin_search() with the starting slot/first_slot
preset to 0.
This simple wrapper can be open coded as btrfs_bin_search().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is used in the tree-log code and is a holdover from previous
iterations of extent buffer writeback. We can simply use
wait_on_extent_buffer_writeback here, and remove
btrfs_wait_tree_block_writeback completely as it's equivalent (waiting
on page write writeback).
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_clean_tree_block is a misnomer, it's just
clear_extent_buffer_dirty with some extra accounting around it. Rename
this to btrfs_clear_buffer_dirty to make it more clear it belongs with
it's setter, btrfs_mark_buffer_dirty.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're passing in the trans into btrfs_clean_tree_block, we can
easily roll in the handling of the !trans case and replace all
occurrences of
if (test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags))
clear_extent_buffer_dirty(eb);
with
btrfs_tree_lock(eb);
btrfs_clean_tree_block(eb);
btrfs_tree_unlock(eb);
We need the lock because if we are actually dirty we need to make sure
we aren't racing with anything that's starting writeout currently. This
also makes sure that we're accounting fs_info->dirty_metadata_bytes
appropriately.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We check the header generation in the extent buffer against the current
running transaction id to see if it's safe to clear DIRTY on this
buffer. Generally speaking if we're clearing the buffer dirty we're
holding the transaction open, but in the case of cleaning up an aborted
transaction we don't, so we have extra checks in that path to check the
transid. To allow for a future cleanup go ahead and pass in the trans
handle so we don't have to rely on ->running_transaction being set.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently use 'ret' and 'err' to track the return value for
log_dir_items(), which is confusing and likely the cause for previous
bugs where log_dir_items() did not return an error when it should, fixed
in previous patches.
So change this and use only a single variable, 'ret', to track the return
value. This is simpler and makes it similar to most of the existing code.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use the value 1 for BTRFS_LOG_FORCE_COMMIT, but that value
has a few inconveniences:
1) If it's ever used by btrfs_log_inode(), or any function down the call
chain, we have to remember to btrfs_set_log_full_commit(), which is
repetitive and has a chance to be forgotten in future use cases.
btrfs_log_inode_parent() only calls btrfs_set_log_full_commit() when
it gets a negative value from btrfs_log_inode();
2) Down the call chain of btrfs_log_inode(), we may have functions that
need to force a log commit, but can return either an error (negative
value), false (0) or true (1). So they are forced to return some
random negative to force a log commit - using BTRFS_LOG_FORCE_COMMIT
would make the intention more clear. Currently the only example is
flush_dir_items_batch().
So turn BTRFS_LOG_FORCE_COMMIT into a negative value. The chosen value
is -(MAX_ERRNO + 1), so that it does not overlap any errno value and makes
it easier to debug.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory, we always set the inode's last_dir_index_offset
to the offset of the last dir index item we found. This is using an extra
field in the log context structure, and it makes more sense to update it
only after we insert dir index items, and we could directly update the
inode's last_dir_index_offset field instead.
So make this simpler by updating the inode's last_dir_index_offset only
when we actually insert dir index keys in the log tree, and getting rid
of the last_dir_item_offset field in the log context structure.
Reported-by: David Arendt <admin@prnet.org>
Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/
Reported-by: Maxim Mikityanskiy <maxtram95@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/Y8voyTXdnPDz8xwY@mail.gmail.com/
Reported-by: Hunter Wardlaw <wardlawhunter@gmail.com>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1207231
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216851
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When syncing a log, if we fail to update a log root in the log root tree,
we are aborting the transaction if the failure was not -ENOSPC. This is
excessive because there is a chance that a transaction commit can succeed,
and therefore avoid to turn the filesystem into RO mode. All we need to be
careful about is to mark the log for a full commit, which we already do,
to make sure no one commits a super block pointing to an outdated log root
tree.
So don't abort the transaction if we fail to update a log root in the log
root tree, and log an error if the failure is not -ENOSPC, so that it does
not go completely unnoticed.
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When syncing the log, if we fail to write log tree extent buffers, we mark
the log for a full commit and abort the transaction. However we don't need
to abort the transaction, all we really need to do is to make sure no one
can commit a superblock pointing to new log tree roots. Just because we
got a failure writing extent buffers for a log tree, it does not mean we
will also fail to do a transaction commit.
One particular case is if due to a bug somewhere, when writing log tree
extent buffers, the tree checker detects some corruption and the writeout
fails because of that. Aborting the transaction can be very disruptive for
a user, specially if the issue happened on a root filesystem. One example
is the scenario in the Link tag below, where an isolated corruption on log
tree leaves was causing transaction aborts when syncing the log.
Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging conflicting inodes, if we reach the maximum limit of inodes,
we return BTRFS_LOG_FORCE_COMMIT to force a transaction commit. However
we don't mark the log for full commit (with btrfs_set_log_full_commit()),
which means that once we leave the log transaction and before we commit
the transaction, some other task may sync the log, which is incomplete
as we have not logged all conflicting inodes, leading to some inconsistent
in case that log ends up being replayed.
So also call btrfs_set_log_full_commit() at add_conflicting_inode().
Fixes: e09d94c9e4 ("btrfs: log conflicting inodes without holding log mutex of the initial inode")
CC: stable@vger.kernel.org # 6.1
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sometimes we log a directory without holding its VFS lock, so while we
logging it, dir index entries may be added or removed. This typically
happens when logging a dentry from a parent directory that points to a
new directory, through log_new_dir_dentries(), or when while logging
some other inode we also need to log its parent directories (through
btrfs_log_all_parents()).
This means that while we are at log_dir_items(), we may not find a dir
index key we found before, because it was deleted in the meanwhile, so
a call to btrfs_search_slot() may return 1 (key not found). In that case
we return from log_dir_items() with a success value (the variable 'err'
has a value of 0). This can lead to a few problems, specially in the case
where the variable 'last_offset' has a value of (u64)-1 (and it's
initialized to that when it was declared):
1) By returning from log_dir_items() with success (0) and a value of
(u64)-1 for '*last_offset_ret', we end up not logging any other dir
index keys that follow the missing, just deleted, index key. The
(u64)-1 value makes log_directory_changes() not call log_dir_items()
again;
2) Before returning with success (0), log_dir_items(), will log a dir
index range item covering a range from the last old dentry index
(stored in the variable 'last_old_dentry_offset') to the value of
'last_offset'. If 'last_offset' has a value of (u64)-1, then it means
if the log is persisted and replayed after a power failure, it will
cause deletion of all the directory entries that have an index number
between last_old_dentry_offset + 1 and (u64)-1;
3) We can end up returning from log_dir_items() with
ctx->last_dir_item_offset having a lower value than
inode->last_dir_index_offset, because the former is set to the current
key we are processing at process_dir_items_leaf(), and at the end of
log_directory_changes() we set inode->last_dir_index_offset to the
current value of ctx->last_dir_item_offset. So if for example a
deletion of a lower dir index key happened, we set
ctx->last_dir_item_offset to that index value, then if we return from
log_dir_items() because btrfs_search_slot() returned 1, we end up
returning from log_dir_items() with success (0) and then
log_directory_changes() sets inode->last_dir_index_offset to a lower
value than it had before.
This can result in unpredictable and unexpected behaviour when we
need to log again the directory in the same transaction, and can result
in ending up with a log tree leaf that has duplicated keys, as we do
batch insertions of dir index keys into a log tree.
So fix this by making log_dir_items() move on to the next dir index key
if it does not find the one it was looking for.
Reported-by: David Arendt <admin@prnet.org>
Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory, at log_dir_items(), if we get an error when
attempting to search the subvolume tree for a dir index item, we end up
returning 0 (success) from log_dir_items() because 'err' is left with a
value of 0.
This can lead to a few problems, specially in the case the variable
'last_offset' has a value of (u64)-1 (and it's initialized to that when
it was declared):
1) By returning from log_dir_items() with success (0) and a value of
(u64)-1 for '*last_offset_ret', we end up not logging any other dir
index keys that follow the missing, just deleted, index key. The
(u64)-1 value makes log_directory_changes() not call log_dir_items()
again;
2) Before returning with success (0), log_dir_items(), will log a dir
index range item covering a range from the last old dentry index
(stored in the variable 'last_old_dentry_offset') to the value of
'last_offset'. If 'last_offset' has a value of (u64)-1, then it means
if the log is persisted and replayed after a power failure, it will
cause deletion of all the directory entries that have an index number
between last_old_dentry_offset + 1 and (u64)-1;
3) We can end up returning from log_dir_items() with
ctx->last_dir_item_offset having a lower value than
inode->last_dir_index_offset, because the former is set to the current
key we are processing at process_dir_items_leaf(), and at the end of
log_directory_changes() we set inode->last_dir_index_offset to the
current value of ctx->last_dir_item_offset. So if for example a
deletion of a lower dir index key happened, we set
ctx->last_dir_item_offset to that index value, then if we return from
log_dir_items() because btrfs_search_slot() returned an error, we end up
returning without any error from log_dir_items() and then
log_directory_changes() sets inode->last_dir_index_offset to a lower
value than it had before.
This can result in unpredictable and unexpected behaviour when we
need to log again the directory in the same transaction, and can result
in ending up with a log tree leaf that has duplicated keys, as we do
batch insertions of dir index keys into a log tree.
Fix this by setting 'err' to the value of 'ret' in case
btrfs_search_slot() or btrfs_previous_item() returned an error. That will
result in falling back to a full transaction commit.
Reported-by: David Arendt <admin@prnet.org>
Link: https://lore.kernel.org/linux-btrfs/ae169fc6-f504-28f0-a098-6fa6a4dfb612@leemhuis.info/
Fixes: e02119d5a7 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a new name, we don't expect to fail joining a log transaction
since we know at least one of the inodes was logged before in the current
transaction. However if we fail for some unexpected reason, we end up not
freeing the fscrypt name we previously allocated. So fix that by freeing
the name in case we failed to join a log transaction.
Fixes: ab3c5c18e8 ("btrfs: setup qstr from dentrys using fscrypt helper")
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As of commit 193df62457 ("btrfs: search for last logged dir index if
it's not cached in the inode"), the overwrite_item() function is always
called for a root that is from a fs/subvolume tree. In other words, now
it's only used during log replay to modify a fs/subvolume tree. Therefore
we can remove the logic that checks if we are dealing with a log tree at
overwrite_item().
So remove that logic, replacing it with an assertion and document that if
we ever need to support a log root there, we will need to clone the leaf
from the fs/subvolume tree and then release it before modifying the log
tree, which is needed to avoid a potential deadlock, similar to the one
recently fixed by a patch with the subject:
"btrfs: do not modify log tree while holding a leaf from fs tree locked"
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After commit 193df62457 ("btrfs: search for last logged dir index if
it's not cached in the inode"), there are no more callers of
do_overwrite_item(), except overwrite_item().
Originally both used to be the same function, but were split in
commit 086dcbfa50 ("btrfs: insert items in batches when logging a
directory when possible"), as there was the need to execute all logic
of overwrite_item() but skip the tree search, since in the context of
directory logging we already had a path with a leaf to copy data from.
So unify them again as there is no more need to have them split.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The code used by btrfs_submit_bio only interacts with the rest of
volumes.c through __btrfs_map_block (which itself is a more generic
version of two exported helpers) and does not really have anything
to do with volumes.c. Create a new bio.c file and a bio.h header
going along with it for the btrfs_bio-based storage layer, which
will grow even more going forward.
Also update the file with my copyright notice given that a large
part of the moved code was written or rewritten by me.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although we have an existing function, btrfs_lookup_csums_range(), to
find all data checksums for a range, it's based on a btrfs_ordered_sum
list.
For the incoming RAID56 data checksum verification at RMW time, we don't
want to waste time by allocating temporary memory.
So this patch will introduce a new helper, btrfs_lookup_csums_bitmap().
It will use bitmap based result, which will be a perfect fit for later
RAID56 usage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several different tree block parentness check parameters used
across several helpers:
- level
Mandatory
- transid
Under most cases it's mandatory, but there are several backref cases
which skips this check.
- owner_root
- first_key
Utilized by most top-down tree search routine. Otherwise can be
skipped.
Those four members are not always mandatory checks, and some of them are
the same u64, which means if some arguments got swapped compiler will
not catch it.
Furthermore if we're going to further expand the parentness check, we
need to modify quite some helpers just to add one more parameter.
This patch will concentrate all these members into a structure called
btrfs_tree_parent_check, and pass that structure for the following
helpers:
- btrfs_read_extent_buffer()
- read_tree_block()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function is for internal interfaces so we should use the
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these out of ctree.h into orphan.h to cut down on code in ctree.h.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these out of ctree.h into file.h to cut down on code in ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these prototypes out of ctree.h and into file-item.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move these prototypes out of ctree.h and into their own header file.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update, reformat or reword function comments. This also removes the kdoc
marker so we don't get reports when the function name is missing.
Changes made:
- remove kdoc markers
- reformat the brief description to be a proper sentence
- reword to imperative voice
- align parameter list
- fix typos
Signed-off-by: David Sterba <dsterba@suse.com>
Move all the root-tree.c prototypes to root-tree.h, and then update all
the necessary files to include the new header.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move all the extent tree related prototypes to extent-tree.h out of
ctree.h, and then go include it everywhere needed so everything
compiles.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For directories with encrypted files/filenames, we need to store a flag
indicating this fact. There's no room in other fields, so we'll need to
borrow a bit from dir_type. Since it's now a combination of type and
flags, we rename it to dir_flags to reflect its new usage.
The new flag, FT_ENCRYPTED, indicates a directory containing encrypted
data, which is orthogonal to file type; therefore, add the new
flag, and make conversion from directory type to file type strip the
flag.
As the file types almost never change we can afford to use the bits.
Actual usage will be guarded behind an incompat bit, this patch only
adds the support for later use by fscrypt.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While struct qstr is more natural without fscrypt, since it's provided
by dentries, struct fscrypt_str is provided by the fscrypt handlers
processing dentries, and is thus more natural in the fscrypt world.
Replace all of the struct qstr uses with struct fscrypt_str.
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Most places where we get a struct qstr, we are doing so from a dentry.
With fscrypt, the dentry's name may be encrypted on-disk, so fscrypt
provides a helper to convert a dentry name to the appropriate disk name
if necessary. Convert each of the dentry name accesses to use
fscrypt_setup_filename(), then convert the resulting fscrypt_name back
to an unencrypted qstr. This does not work for nokey names, but the
specific locations that could spawn nokey names are noted.
At present, since there are no encrypted directories, nothing goes down
the filename encryption paths.
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Many functions throughout btrfs take name buffer and name length
arguments. Most of these functions at the highest level are usually
called with these arguments extracted from a supplied dentry's name.
But the entire name can be passed instead, making each function a little
more elegant.
Each function whose arguments are currently the name and length
extracted from a dentry is herein converted to instead take a pointer to
the name in the dentry. The couple of calls to these calls without a
struct dentry are converted to create an appropriate qstr to pass in.
Additionally, every function which is only called with a name/len
extracted directly from a qstr is also converted.
This change has positive effect on stack consumption, frame of many
functions is reduced but this will be used in the future for fscrypt
related structures.
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass GFP_NOFS, we can drop the parameter and use it
directly.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is specific to the item-accessor code, move it out of ctree.h into
accessor.h/.c and then update the users to include the new header file.
This un-inlines btrfs_init_map_token, however this is only called once
per function so it's not critical to be inlined. This also saves 904
bytes of code on a release build.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several fs wide related helpers in ctree.h. The bulk of these
are the incompat flag test helpers, but there are things such as
btrfs_fs_closing() and the read only helpers that also aren't directly
related to the ctree code. Move these into a fs.h header, which will
serve as the location for file system wide related helpers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode in full mode, or when logging xattrs or when logging
the dir index items of a directory, we are modifying the log tree while
holding a read lock on a leaf from the fs/subvolume tree. This can lead to
a deadlock in rare circumstances, but it is a real possibility, and it was
recently reported by syzbot with the following trace from lockdep:
WARNING: possible circular locking dependency detected
6.1.0-rc5-next-20221116-syzkaller #0 Not tainted
------------------------------------------------------
syz-executor.1/16154 is trying to acquire lock:
ffff88807e3084a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
but task is already holding lock:
ffff88807df33078 (btrfs-log-00){++++}-{3:3}, at: __btrfs_tree_lock+0x32/0x3d0 fs/btrfs/locking.c:197
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-log-00){++++}-{3:3}:
down_read_nested+0x9e/0x450 kernel/locking/rwsem.c:1634
__btrfs_tree_read_lock+0x32/0x350 fs/btrfs/locking.c:135
btrfs_tree_read_lock fs/btrfs/locking.c:141 [inline]
btrfs_read_lock_root_node+0x82/0x3a0 fs/btrfs/locking.c:280
btrfs_search_slot_get_root fs/btrfs/ctree.c:1678 [inline]
btrfs_search_slot+0x3ca/0x2c70 fs/btrfs/ctree.c:1998
btrfs_lookup_csum+0x116/0x3f0 fs/btrfs/file-item.c:209
btrfs_csum_file_blocks+0x40e/0x1370 fs/btrfs/file-item.c:1021
log_csums.isra.0+0x244/0x2d0 fs/btrfs/tree-log.c:4258
copy_items.isra.0+0xbfb/0xed0 fs/btrfs/tree-log.c:4403
copy_inode_items_to_log+0x13d6/0x1d90 fs/btrfs/tree-log.c:5873
btrfs_log_inode+0xb19/0x4680 fs/btrfs/tree-log.c:6495
btrfs_log_inode_parent+0x890/0x2a20 fs/btrfs/tree-log.c:6982
btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7083
btrfs_sync_file+0xa41/0x13c0 fs/btrfs/file.c:1921
vfs_fsync_range+0x13e/0x230 fs/sync.c:188
generic_write_sync include/linux/fs.h:2856 [inline]
iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
btrfs_direct_write fs/btrfs/file.c:1536 [inline]
btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
call_write_iter include/linux/fs.h:2160 [inline]
do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
do_iter_write+0x182/0x700 fs/read_write.c:861
vfs_iter_write+0x74/0xa0 fs/read_write.c:902
iter_file_splice_write+0x745/0xc90 fs/splice.c:686
do_splice_from fs/splice.c:764 [inline]
direct_splice_actor+0x114/0x180 fs/splice.c:931
splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
do_splice_direct+0x1ab/0x280 fs/splice.c:974
do_sendfile+0xb19/0x1270 fs/read_write.c:1255
__do_sys_sendfile64 fs/read_write.c:1323 [inline]
__se_sys_sendfile64 fs/read_write.c:1309 [inline]
__x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
-> #1 (btrfs-tree-00){++++}-{3:3}:
__lock_release kernel/locking/lockdep.c:5382 [inline]
lock_release+0x371/0x810 kernel/locking/lockdep.c:5688
up_write+0x2a/0x520 kernel/locking/rwsem.c:1614
btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline]
btrfs_unlock_up_safe+0x1e3/0x290 fs/btrfs/locking.c:238
search_leaf fs/btrfs/ctree.c:1832 [inline]
btrfs_search_slot+0x265e/0x2c70 fs/btrfs/ctree.c:2074
btrfs_insert_empty_items+0xbd/0x1c0 fs/btrfs/ctree.c:4133
btrfs_insert_delayed_item+0x826/0xfa0 fs/btrfs/delayed-inode.c:746
btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline]
__btrfs_commit_inode_delayed_items fs/btrfs/delayed-inode.c:1111 [inline]
__btrfs_run_delayed_items+0x280/0x590 fs/btrfs/delayed-inode.c:1153
flush_space+0x147/0xe90 fs/btrfs/space-info.c:728
btrfs_async_reclaim_metadata_space+0x541/0xc10 fs/btrfs/space-info.c:1086
process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
worker_thread+0x669/0x1090 kernel/workqueue.c:2436
kthread+0x2e8/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
-> #0 (&delayed_node->mutex){+.+.}-{3:3}:
check_prev_add kernel/locking/lockdep.c:3097 [inline]
check_prevs_add kernel/locking/lockdep.c:3216 [inline]
validate_chain kernel/locking/lockdep.c:3831 [inline]
__lock_acquire+0x2a43/0x56d0 kernel/locking/lockdep.c:5055
lock_acquire kernel/locking/lockdep.c:5668 [inline]
lock_acquire+0x1e3/0x630 kernel/locking/lockdep.c:5633
__mutex_lock_common kernel/locking/mutex.c:603 [inline]
__mutex_lock+0x12f/0x1360 kernel/locking/mutex.c:747
__btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
__btrfs_release_delayed_node fs/btrfs/delayed-inode.c:251 [inline]
btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline]
btrfs_remove_delayed_node+0x52/0x60 fs/btrfs/delayed-inode.c:1285
btrfs_evict_inode+0x511/0xf30 fs/btrfs/inode.c:5554
evict+0x2ed/0x6b0 fs/inode.c:664
dispose_list+0x117/0x1e0 fs/inode.c:697
prune_icache_sb+0xeb/0x150 fs/inode.c:896
super_cache_scan+0x391/0x590 fs/super.c:106
do_shrink_slab+0x464/0xce0 mm/vmscan.c:843
shrink_slab_memcg mm/vmscan.c:912 [inline]
shrink_slab+0x388/0x660 mm/vmscan.c:991
shrink_node_memcgs mm/vmscan.c:6088 [inline]
shrink_node+0x93d/0x1f30 mm/vmscan.c:6117
shrink_zones mm/vmscan.c:6355 [inline]
do_try_to_free_pages+0x3b4/0x17a0 mm/vmscan.c:6417
try_to_free_mem_cgroup_pages+0x3a4/0xa70 mm/vmscan.c:6732
reclaim_high.constprop.0+0x182/0x230 mm/memcontrol.c:2393
mem_cgroup_handle_over_high+0x190/0x520 mm/memcontrol.c:2578
try_charge_memcg+0xe0c/0x12f0 mm/memcontrol.c:2816
try_charge mm/memcontrol.c:2827 [inline]
charge_memcg+0x90/0x3b0 mm/memcontrol.c:6889
__mem_cgroup_charge+0x2b/0x90 mm/memcontrol.c:6910
mem_cgroup_charge include/linux/memcontrol.h:667 [inline]
__filemap_add_folio+0x615/0xf80 mm/filemap.c:852
filemap_add_folio+0xaf/0x1e0 mm/filemap.c:934
__filemap_get_folio+0x389/0xd80 mm/filemap.c:1976
pagecache_get_page+0x2e/0x280 mm/folio-compat.c:104
find_or_create_page include/linux/pagemap.h:612 [inline]
alloc_extent_buffer+0x2b9/0x1580 fs/btrfs/extent_io.c:4588
btrfs_init_new_buffer fs/btrfs/extent-tree.c:4869 [inline]
btrfs_alloc_tree_block+0x2e1/0x1320 fs/btrfs/extent-tree.c:4988
__btrfs_cow_block+0x3b2/0x1420 fs/btrfs/ctree.c:440
btrfs_cow_block+0x2fa/0x950 fs/btrfs/ctree.c:595
btrfs_search_slot+0x11b0/0x2c70 fs/btrfs/ctree.c:2038
btrfs_update_root+0xdb/0x630 fs/btrfs/root-tree.c:137
update_log_root fs/btrfs/tree-log.c:2841 [inline]
btrfs_sync_log+0xbfb/0x2870 fs/btrfs/tree-log.c:3064
btrfs_sync_file+0xdb9/0x13c0 fs/btrfs/file.c:1947
vfs_fsync_range+0x13e/0x230 fs/sync.c:188
generic_write_sync include/linux/fs.h:2856 [inline]
iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
btrfs_direct_write fs/btrfs/file.c:1536 [inline]
btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
call_write_iter include/linux/fs.h:2160 [inline]
do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
do_iter_write+0x182/0x700 fs/read_write.c:861
vfs_iter_write+0x74/0xa0 fs/read_write.c:902
iter_file_splice_write+0x745/0xc90 fs/splice.c:686
do_splice_from fs/splice.c:764 [inline]
direct_splice_actor+0x114/0x180 fs/splice.c:931
splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
do_splice_direct+0x1ab/0x280 fs/splice.c:974
do_sendfile+0xb19/0x1270 fs/read_write.c:1255
__do_sys_sendfile64 fs/read_write.c:1323 [inline]
__se_sys_sendfile64 fs/read_write.c:1309 [inline]
__x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
other info that might help us debug this:
Chain exists of:
&delayed_node->mutex --> btrfs-tree-00 --> btrfs-log-00
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-log-00);
lock(btrfs-tree-00);
lock(btrfs-log-00);
lock(&delayed_node->mutex);
Holding a read lock on a leaf from a fs/subvolume tree creates a nasty
lock dependency when we are COWing extent buffers for the log tree and we
have two tasks modifying the log tree, with each one in one of the
following 2 scenarios:
1) Modifying the log tree triggers an extent buffer allocation while
holding a write lock on a parent extent buffer from the log tree.
Allocating the pages for an extent buffer, or the extent buffer
struct, can trigger inode eviction and finally the inode eviction
will trigger a release/remove of a delayed node, which requires
taking the delayed node's mutex;
2) Allocating a metadata extent for a log tree can trigger the async
reclaim thread and make us wait for it to release enough space and
unblock our reservation ticket. The reclaim thread can start flushing
delayed items, and that in turn results in the need to lock delayed
node mutexes and in the need to write lock extent buffers of a
subvolume tree - all this while holding a write lock on the parent
extent buffer in the log tree.
So one task in scenario 1) running in parallel with another task in
scenario 2) could lead to a deadlock, one wanting to lock a delayed node
mutex while having a read lock on a leaf from the subvolume, while the
other is holding the delayed node's mutex and wants to write lock the same
subvolume leaf for flushing delayed items.
Fix this by cloning the leaf of the fs/subvolume tree, release/unlock the
fs/subvolume leaf and use the clone leaf instead.
Reported-by: syzbot+9b7c21f486f5e7f8d029@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000ccc93c05edc4d8cf@google.com/
CC: stable@vger.kernel.org # 6.0+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we have NOWAIT specified on our IOCB and we're writing into a
PREALLOC or NOCOW extent then we need to be able to tell
can_nocow_extent that we don't want to wait on any locks or metadata IO.
Fix can_nocow_extent to allow for NOWAIT.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Stefan Roesch <shr@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have two variants of lock/unlock extent, one set that takes a cached
state, another that does not. This is slightly annoying, and generally
speaking there are only a few places where we don't have a cached state.
Simplify this by making lock_extent/unlock_extent the only variant and
make it take a cached state, then convert all the callers appropriately.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During log replay, when adding/replacing inode references, there are two
special cases that have special code for them:
1) When we have an inode with two or more hardlinks in the same directory,
therefore two or more names encoded in the same inode reference item,
and one of the hard links gets renamed to the old name of another hard
link - that is, the index number for a name changes. This was added in
commit 0d836392ca ("Btrfs: fix mount failure after fsync due to
hard link recreation"), and is covered by test case generic/502 from
fstests;
2) When we have several inodes that got renamed to an old name of some
other inode, in a cascading style. The code to deal with this special
case was added in commit 6b5fc433a7 ("Btrfs: fix fsync after
succession of renames of different files"), and is covered by test
cases generic/526 and generic/527 from fstests.
Both cases can be deal with by making sure __add_inode_ref() is always
called by add_inode_ref() for every name encoded in the inode reference
item, and not just for the first name that has a conflict. With such
change we no longer need that special casing for the two cases mentioned
before. So do those changes.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory we start by flushing all its delayed items.
That results in adding dir index items to the subvolume btree, for new
dentries, and removing dir index items from the subvolume btree for any
dentries that were deleted.
This makes it straightforward to log a directory simply by iterating over
all the modified subvolume btree leaves, especially when we used to log
both dir index keys and dir item keys (before commit 339d035424
("btrfs: only copy dir index keys when logging a directory") and when we
used to copy old dir index entries for leaves modified in the current
transaction (before commit 732d591a5d ("btrfs: stop copying old dir
items when logging a directory")).
From an efficiency point of view this has a couple of drawbacks:
1) Adds extra latency, due to copying delayed items to the subvolume btree
and deleting dir index items from the btree.
Further if there are other tasks accessing the btree, which is common
(syscalls like creat, mkdir, rename, link, unlink, truncate, reflinks,
etc, finishing an ordered extent, etc), lock contention can cause
further delays, both to the task logging a directory and to the other
tasks accessing the btree;
2) More time spent overall flushing delayed items, if after logging the
directory further changes are done to the directory in the same
transaction.
For example, if we add 10 dentries to a directory, fsync it, add more
10 dentries, fsync it again, then add more 10 dentries and fsync it
again, then we end up inserting 3 batches of 10 items to the subvolume
btree. With the changes from this patch, we flush all the delayed items
to the btree only once - a single batch of 30 items, and outside the
logging code (transaction commit or when delayed items are flushed
asynchronously).
This change simply skips the flushing of delayed items every time we log a
directory. Instead we copy the delayed insertion items directly to the log
tree and delete delayed deletion items directly from the log tree.
Therefore avoiding changing first the subvolume btree and then scanning it
for new items to copy from it to the log tree and detecting deletions
by observing gaps in consecutive dir index keys in subvolume btree leaves.
Running the following tests on a non-debug kernel (Debian's default kernel
config), on a box with a NVMe device, a 12 cores Intel CPU and 64G of ram,
produced the results below.
The results compare a branch without this patch and all the other patches
it depends on versus the same branch with the patchset applied.
The patchset is comprised of the following patches:
btrfs: don't drop dir index range items when logging a directory
btrfs: remove the root argument from log_new_dir_dentries()
btrfs: update stale comment for log_new_dir_dentries()
btrfs: free list element sooner at log_new_dir_dentries()
btrfs: avoid memory allocation at log_new_dir_dentries() for common case
btrfs: remove root argument from btrfs_delayed_item_reserve_metadata()
btrfs: store index number instead of key in struct btrfs_delayed_item
btrfs: remove unused logic when looking up delayed items
btrfs: shrink the size of struct btrfs_delayed_item
btrfs: search for last logged dir index if it's not cached in the inode
btrfs: move need_log_inode() to above log_conflicting_inodes()
btrfs: move log_new_dir_dentries() above btrfs_log_inode()
btrfs: log conflicting inodes without holding log mutex of the initial inode
btrfs: skip logging parent dir when conflicting inode is not a dir
btrfs: use delayed items when logging a directory
Custom test script for testing time spent at btrfs_log_inode():
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
# Total number of files to create in the test directory.
NUM_FILES=10000
# Fsync after creating or renaming N files.
FSYNC_AFTER=100
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
TEST_DIR=$MNT/testdir
mkdir $TEST_DIR
echo "Creating files..."
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $TEST_DIR/file_$i
if (( ($i % $FSYNC_AFTER) == 0 )); then
xfs_io -c "fsync" $TEST_DIR
fi
done
sync
echo "Renaming files..."
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $TEST_DIR/file_$i $TEST_DIR/file_$i.renamed
if (( ($i % $FSYNC_AFTER) == 0 )); then
xfs_io -c "fsync" $TEST_DIR
fi
done
umount $MNT
And using the following bpftrace script to capture the total time that is
spent at btrfs_log_inode():
#!/usr/bin/bpftrace
k:btrfs_log_inode
{
@start_log_inode[tid] = nsecs;
}
kr:btrfs_log_inode
/@start_log_inode[tid]/
{
$dur = (nsecs - @start_log_inode[tid]) / 1000;
@btrfs_log_inode_total_time = sum($dur);
delete(@start_log_inode[tid]);
}
END
{
clear(@start_log_inode);
}
Result before applying patchset:
@btrfs_log_inode_total_time: 622642
Result after applying patchset:
@btrfs_log_inode_total_time: 354134 (-43.1% time spent)
The following dbench script was also used for testing:
#!/bin/bash
NUM_JOBS=$(nproc --all)
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-O no-holes -R free-space-tree"
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT --skip-cleanup -t 120 -S $NUM_JOBS
umount $MNT
Before patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 3322265 0.034 21.032
Close 2440562 0.002 0.994
Rename 140664 1.150 269.633
Unlink 670796 1.093 269.678
Deltree 96 5.481 15.510
Mkdir 48 0.004 0.052
Qpathinfo 3010924 0.014 8.127
Qfileinfo 528055 0.001 0.518
Qfsinfo 552113 0.003 0.372
Sfileinfo 270575 0.005 0.688
Find 1164176 0.052 13.931
WriteX 1658537 0.019 5.918
ReadX 5207412 0.003 1.034
LockX 10818 0.003 0.079
UnlockX 10818 0.002 0.313
Flush 232811 1.027 269.735
Throughput 869.867 MB/sec (sync dirs) 12 clients 12 procs max_latency=269.741 ms
After patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4152738 0.029 20.863
Close 3050770 0.002 1.119
Rename 175829 0.871 211.741
Unlink 838447 0.845 211.724
Deltree 120 4.798 14.162
Mkdir 60 0.003 0.005
Qpathinfo 3763807 0.011 4.673
Qfileinfo 660111 0.001 0.400
Qfsinfo 690141 0.003 0.429
Sfileinfo 338260 0.005 0.725
Find 1455273 0.046 6.787
WriteX 2073307 0.017 5.690
ReadX 6509193 0.003 1.171
LockX 13522 0.003 0.077
UnlockX 13522 0.002 0.125
Flush 291044 0.811 211.631
Throughput 1089.27 MB/sec (sync dirs) 12 clients 12 procs max_latency=211.750 ms
(+25.2% throughput, -21.5% max latency)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we find a conflicting inode (an inode that had the same name and
parent directory as the inode we are logging now) that was deleted in the
current transaction, we always end up logging its parent directory.
This is to deal with the case where the conflicting inode corresponds to
a deleted subvolume/snapshot or a directory that had subvolumes/snapshots
(or some subdirectory inside it had subvolumes/snapshots, etc), because
we can't deal with dropping subvolumes/snapshots during log replay. So
if we log the parent directory, and if we are dealing with these special
cases, then we fallback to a transaction commit when logging the parent,
because its last_unlink_trans will match the current transaction (which
gets set and propagated when a subvolume/snapshot is deleted).
This change skips the logging of the parent directory when the conflicting
inode is not a directory (or a subvolume/snapshot). This is ok because in
this case logging the current inode is enough to trigger an unlink of the
conflicting inode during log replay.
So for a case like this:
$ mkdir /mnt/dir
$ echo -n "first foo data" > /mnt/dir/foo
$ sync
$ rm -f /mnt/dir/foo
$ echo -n "second foo data" > /mnt/dir/foo
$ xfs_io -c "fsync" /mnt/dir/foo
We avoid logging parent directory "dir" when logging the new file "foo".
In other cases it avoids falling back to a transaction commit, when the
parent directory has a last_unlink_trans value that matches the current
transaction, due to moving a file from it to some other directory.
This is a case that happens frequently with dbench for example, where a
new file that has the name/parent of another file that was deleted in the
current transaction, is fsynced.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode, if we detect the inode has a reference that
conflicts with some other inode that got renamed, we log that other inode
while holding the log mutex of the current inode. We then find out if
there are other inodes that conflict with the first conflicting inode,
and log them while under the log mutex of the original inode. This is
fine because the recursion can only happen once.
For the upcoming work where we directly log delayed items without flushing
them first to the subvolume tree, this recursion adds a lot of complexity
and it's hard to keep lockdep happy about it.
So collect a list of conflicting inodes and then log the inodes after
unlocking the log mutex of the inode we started with.
Also limit the maximum number of conflict inodes we log to 10, to avoid
spending too much time logging (and maybe allocating too many list
elements too), as typically we don't have more than 1 or 2 conflicting
inodes - if we go over the limit, simply fallback to a transaction commit.
It is possible to have a very long list of conflicting inodes to be
intentionally created by a user if he/she creates a very long succession
of renames like this:
(...)
rename E to F
rename D to E
rename C to D
rename B to C
rename A to B
touch A (create a new file named A)
fsync A
If that happened for a sequence of hundreds or thousands of renames, it
could massively slow down the logging and cause other secondary effects
like for example blocking other fsync operations and transaction commits
for a very long time (assuming it wouldn't run into -ENOSPC or -ENOMEM
first). However such cases are very uncommon to happen in practice,
nevertheless it's better to be prepared for them and avoid chaos.
Such long sequence of conflicting inodes could be created before this
change.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The static function log_new_dir_dentries() is currently defined below
btrfs_log_inode(), but in an upcoming patch a new function is introduced
that is called by btrfs_log_inode() and this new function needs to call
log_new_dir_dentries(). So move log_new_dir_dentries() to a location
between btrfs_log_inode() and need_log_inode() (the later is called by
log_new_dir_dentries()).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The static function need_log_inode() is defined below btrfs_log_inode()
and log_conflicting_inodes(), but in the next patches in the series we
will need to call need_log_inode() in a couple new functions that will be
used by btrfs_log_inode(). So move its definition to a location above
log_conflicting_inodes().
Also make its arguments 'const', since they are not supposed to be
modified.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The key offset of the last dir index item that was logged is stored in
the inode's last_dir_index_offset field. However that field is not
persisted in the inode item or elsewhere, so if the inode gets evicted
and reloaded, it gets a value of (u64)-1, so that when we are logging
dir index items we check if they were logged before, to avoid attempts
to insert duplicated keys and fallback to a transaction commit.
Improve on this by searching for the last dir index that was logged when
we start logging a directory if the inode's last_dir_index_offset is not
set (has a value of (u64)-1) and it was logged before. This avoids
checking if each dir index item we find was already logged before, and
simplifies the logging of dir index items (process_dir_items_leaf()).
This will also be needed for an incoming change where we start logging
delayed items directly, without flushing them first.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At log_new_dir_dentries() we always start by allocating a list element
for the starting inode and then do a while loop with the condition being
a list emptiness check.
This however is not needed, we can avoid allocating this initial list
element and then just check for the list emptiness at the end of the
loop's body. So just do that to save one memory allocation from the
kmalloc-32 slab.
This allows for not doing any memory allocation when we don't have any
subdirectory to log, which is a very common case.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At log_new_dir_dentries(), there's no need to keep the current list
element allocated while processing the leaves with directory items for
the current directory, and while logging other inodes. Plus in case we
find a subdirectory, we also end up allocating a new list element while
the current one is still allocated, temporarily using more memory than
necessary.
So free the current list element early on, before processing leaves.
Also make the removal and release of all list elements in case of an
error more simple by eliminating the label and goto, adding an explicit
loop to release all list elements in case an error happens.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment refers to the function log_dir_items() in order to check why
the inodes of new directory entries need to be logged, but the relevant
comments are no longer at log_dir_items(), they were moved to the function
process_dir_items_leaf() in commit eb10d85ee7 ("btrfs: factor out the
copying loop of dir items from log_dir_items()"). So update it with the
current function name.
Also remove references with i_mutex to "VFS lock", since the inode lock
is no longer a mutex since 2016 (it's now a rw semaphore).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point in passing a root argument to log_new_dir_dentries()
because it always corresponds to the root of the given inode. So remove
it and extract the root from the given inode.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory that was previously logged in the current
transaction, we drop all the range items (BTRFS_DIR_LOG_INDEX_KEY key
type). This is because we will process all leaves in the subvolume's tree
that were changed in the current transaction and then add range items for
covering new dir index items and deleted dir index items, which could
cover now a larger range than before.
We used to fail if we tried to insert a range item key that already
exists, so we dropped all range items to avoid failing. However nowadays,
since commit 750ee45490 ("btrfs: fix assertion failure when logging
directory key range item"), we simply update any range item that already
exists, increasing its range's last dir index if needed. Since the range
covered by a range item can never decrease, due to the fact that dir index
values come from a monotonically increasing counter and are never reused,
we can stop dropping all range items before we start logging a directory.
By not dropping the items we can avoid having occasional tree rebalance
operations.
This will also be needed for an incoming change where we start logging
delayed items directly, without flushing them first.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_insert_file_extent() is only ever used to insert holes, so rename
it and remove the redundant parameters.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During log replay, at add_link(), we may increment the link count of
another inode that has a reference that conflicts with a new reference
for the inode currently being processed.
During log replay, at add_link(), we may drop (unlink) a reference from
some inode in the subvolume tree if that reference conflicts with a new
reference found in the log for the inode we are currently processing.
After the unlink, If the link count has decreased from 1 to 0, then we
increment the link count to prevent the inode from being deleted if it's
evicted by an iput() call, because we may have references to add to that
inode later on (and we will fixup its link count later during log replay).
However incrementing the link count from 0 to 1 triggers a warning:
$ cat fs/inode.c
(...)
void inc_nlink(struct inode *inode)
{
if (unlikely(inode->i_nlink == 0)) {
WARN_ON(!(inode->i_state & I_LINKABLE));
atomic_long_dec(&inode->i_sb->s_remove_count);
}
(...)
The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's
never set during log replay.
Most of the time, the warning isn't triggered even if we dropped the last
reference of the conflicting inode, and this is because:
1) The conflicting inode was previously marked for fixup, through a call
to link_to_fixup_dir(), which increments the inode's link count;
2) And the last iput() on the inode has not triggered eviction of the
inode, nor was eviction triggered after the iput(). So at add_link(),
even if we unlink the last reference of the inode, its link count ends
up being 1 and not 0.
So this means that if eviction is triggered after link_to_fixup_dir() is
called, at add_link() we will read the inode back from the subvolume tree
and have it with a correct link count, matching the number of references
it has on the subvolume tree. So if when we are at add_link() the inode
has exactly one reference only, its link count is 1, and after the unlink
its link count becomes 0.
So fix this by using set_nlink() instead of inc_nlink(), as the former
accepts a transition from 0 to 1 and it's what we use in other similar
contexts (like at link_to_fixup_dir().
Also make add_inode_ref() use set_nlink() instead of inc_nlink() to
bump the link count from 0 to 1.
The warning is actually harmless, but it may scare users. Josef also ran
into it recently.
CC: stable@vger.kernel.org # 5.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During log replay, when processing inode references, if we get an error
when looking up for an extended reference at __add_inode_ref(), we ignore
it and proceed, returning success (0) if no other error happens after the
lookup. This is obviously wrong because in case an extended reference
exists and it encodes some name not in the log, we need to unlink it,
otherwise the filesystem state will not match the state it had after the
last fsync.
So just make __add_inode_ref() return an error it gets from the extended
reference lookup.
Fixes: f186373fef ("btrfs: extended inode refs")
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a new name, in case of a rename, we pin the log before
changing it. We then either delete a directory entry from the log or
insert a key range item to mark the old name for deletion on log replay.
However when doing one of those log changes we may have another task that
started writing out the log (at btrfs_sync_log()) and it started before
we pinned the log root. So we may end up changing a log tree while its
writeback is being started by another task syncing the log. This can lead
to inconsistencies in a log tree and other unexpected results during log
replay, because we can get some committed node pointing to a node/leaf
that ends up not getting written to disk before the next log commit.
The problem, conceptually, started to happen in commit 88d2beec7e
("btrfs: avoid logging all directory changes during renames"), because
there we started to update the log without joining its current transaction
first.
However the problem only became visible with commit 259c4b96d7
("btrfs: stop doing unnecessary log updates during a rename"), and that is
because we used to pin the log at btrfs_rename() and then before entering
btrfs_log_new_name(), when unlinking the old dentry, we ended up at
btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(). Both
of them join the current log transaction, effectively waiting for any log
transaction writeout (due to acquiring the root's log_mutex). This made it
safe even after leaving the current log transaction, because we remained
with the log pinned when we called btrfs_log_new_name().
Then in commit 259c4b96d7 ("btrfs: stop doing unnecessary log updates
during a rename"), we removed the log pinning from btrfs_rename() and
stopped calling btrfs_del_inode_ref_in_log() and
btrfs_del_dir_entries_in_log() during the rename, and started to do all
the needed work at btrfs_log_new_name(), but without joining the current
log transaction, only pinning the log, which is racy because another task
may have started writeout of the log tree right before we pinned the log.
Both commits landed in kernel 5.18, so it doesn't make any practical
difference which should be blamed, but I'm blaming the second commit only
because with the first one, by chance, the problem did not happen due to
the fact we joined the log transaction after pinning the log and unpinned
it only after calling btrfs_log_new_name().
So make btrfs_log_new_name() join the current log transaction instead of
pinning it, so that we never do log updates if it's writeout is starting.
Fixes: 259c4b96d7 ("btrfs: stop doing unnecessary log updates during a rename")
CC: stable@vger.kernel.org # 5.18+
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we will return 1 or -EAGAIN if we decide we need to commit
the transaction rather than sync the log. In practice this doesn't
really matter, we interpret any !0 and !BTRFS_NO_LOG_SYNC as needing to
commit the transaction. However this makes it hard to figure out what
the correct thing to do is.
Fix this up by defining BTRFS_LOG_FORCE_COMMIT and using this in all the
places where we want to force the transaction to be committed.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
iput() already handles NULL and non-NULL parameter, so it is not needed
to check that. This unifies all iput calls.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Lv Ruyi <lv.ruyi@zte.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_read_buffer() is useless, it just calls
btree_read_extent_buffer_pages() with exactly the same arguments.
So remove it and rename btree_read_extent_buffer_pages() to
btrfs_read_extent_buffer(), which is a shorter name, has the "btrfs_"
prefix (since it's used outside disk-io.c) and the name is clear enough
about what it does.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inserting a key range item (BTRFS_DIR_LOG_INDEX_KEY) while logging
a directory, we don't expect the insertion to fail with -EEXIST, because
we are holding the directory's log_mutex and we have dropped all existing
BTRFS_DIR_LOG_INDEX_KEY keys from the log tree before we started to log
the directory. However it's possible that during the logging we attempt
to insert the same BTRFS_DIR_LOG_INDEX_KEY key twice, but for this to
happen we need to race with insertions of items from other inodes in the
subvolume's tree while we are logging a directory. Here's how this can
happen:
1) We are logging a directory with inode number 1000 that has its items
spread across 3 leaves in the subvolume's tree:
leaf A - has index keys from the range 2 to 20 for example. The last
item in the leaf corresponds to a dir item for index number 20. All
these dir items were created in a past transaction.
leaf B - has index keys from the range 22 to 100 for example. It has
no keys from other inodes, all its keys are dir index keys for our
directory inode number 1000. Its first key is for the dir item with
a sequence number of 22. All these dir items were also created in a
past transaction.
leaf C - has index keys for our directory for the range 101 to 120 for
example. This leaf also has items from other inodes, and its first
item corresponds to the dir item for index number 101 for our directory
with inode number 1000;
2) When we finish processing the items from leaf A at log_dir_items(),
we log a BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and a last
offset of 21, meaning the log is authoritative for the index range
from 21 to 21 (a single sequence number). At this point leaf B was
not yet modified in the current transaction;
3) When we return from log_dir_items() we have released our read lock on
leaf B, and have set *last_offset_ret to 21 (index number of the first
item on leaf B minus 1);
4) Some other task inserts an item for other inode (inode number 1001 for
example) into leaf C. That resulted in pushing some items from leaf C
into leaf B, in order to make room for the new item, so now leaf B
has dir index keys for the sequence number range from 22 to 102 and
leaf C has the dir items for the sequence number range 103 to 120;
5) At log_directory_changes() we call log_dir_items() again, passing it
a 'min_offset' / 'min_key' value of 22 (*last_offset_ret from step 3
plus 1, so 21 + 1). Then btrfs_search_forward() leaves us at slot 0
of leaf B, since leaf B was modified in the current transaction.
We have also initialized 'last_old_dentry_offset' to 20 after calling
btrfs_previous_item() at log_dir_items(), as it left us at the last
item of leaf A, which refers to the dir item with sequence number 20;
6) We then call process_dir_items_leaf() to process the dir items of
leaf B, and when we process the first item, corresponding to slot 0,
sequence number 22, we notice the dir item was created in a past
transaction and its sequence number is greater than the value of
*last_old_dentry_offset + 1 (20 + 1), so we decide to log again a
BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and an end range
of 21 (key.offset - 1 == 22 - 1 == 21), which results in an -EEXIST
error from insert_dir_log_key(), as we have already inserted that
key at step 2, triggering the assertion at process_dir_items_leaf().
The trace produced in dmesg is like the following:
assertion failed: ret != -EEXIST, in fs/btrfs/tree-log.c:3857
[198255.980839][ T7460] ------------[ cut here ]------------
[198255.981666][ T7460] kernel BUG at fs/btrfs/ctree.h:3617!
[198255.983141][ T7460] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
[198255.984080][ T7460] CPU: 0 PID: 7460 Comm: repro-ghost-dir Not tainted 5.18.0-5314c78ac373-misc-next+
[198255.986027][ T7460] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[198255.988600][ T7460] RIP: 0010:assertfail.constprop.0+0x1c/0x1e
[198255.989465][ T7460] Code: 8b 4c 89 (...)
[198255.992599][ T7460] RSP: 0018:ffffc90007387188 EFLAGS: 00010282
[198255.993414][ T7460] RAX: 000000000000003d RBX: 0000000000000065 RCX: 0000000000000000
[198255.996056][ T7460] RDX: 0000000000000001 RSI: ffffffff8b62b180 RDI: fffff52000e70e24
[198255.997668][ T7460] RBP: ffffc90007387188 R08: 000000000000003d R09: ffff8881f0e16507
[198255.999199][ T7460] R10: ffffed103e1c2ca0 R11: 0000000000000001 R12: 00000000ffffffef
[198256.000683][ T7460] R13: ffff88813befc630 R14: ffff888116c16e70 R15: ffffc90007387358
[198256.007082][ T7460] FS: 00007fc7f7c24640(0000) GS:ffff8881f0c00000(0000) knlGS:0000000000000000
[198256.009939][ T7460] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[198256.014133][ T7460] CR2: 0000560bb16d0b78 CR3: 0000000140b34005 CR4: 0000000000170ef0
[198256.015239][ T7460] Call Trace:
[198256.015674][ T7460] <TASK>
[198256.016313][ T7460] log_dir_items.cold+0x16/0x2c
[198256.018858][ T7460] ? replay_one_extent+0xbf0/0xbf0
[198256.025932][ T7460] ? release_extent_buffer+0x1d2/0x270
[198256.029658][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.031114][ T7460] ? lock_acquired+0xbe/0x660
[198256.032633][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.034386][ T7460] ? lock_release+0xcf/0x8a0
[198256.036152][ T7460] log_directory_changes+0xf9/0x170
[198256.036993][ T7460] ? log_dir_items+0xba0/0xba0
[198256.037661][ T7460] ? do_raw_write_unlock+0x7d/0xe0
[198256.038680][ T7460] btrfs_log_inode+0x233b/0x26d0
[198256.041294][ T7460] ? log_directory_changes+0x170/0x170
[198256.042864][ T7460] ? btrfs_attach_transaction_barrier+0x60/0x60
[198256.045130][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.046568][ T7460] ? lock_release+0xcf/0x8a0
[198256.047504][ T7460] ? lock_downgrade+0x420/0x420
[198256.048712][ T7460] ? ilookup5_nowait+0x81/0xa0
[198256.049747][ T7460] ? lock_downgrade+0x420/0x420
[198256.050652][ T7460] ? do_raw_spin_unlock+0xa9/0x100
[198256.051618][ T7460] ? __might_resched+0x128/0x1c0
[198256.052511][ T7460] ? __might_sleep+0x66/0xc0
[198256.053442][ T7460] ? __kasan_check_read+0x11/0x20
[198256.054251][ T7460] ? iget5_locked+0xbd/0x150
[198256.054986][ T7460] ? run_delayed_iput_locked+0x110/0x110
[198256.055929][ T7460] ? btrfs_iget+0xc7/0x150
[198256.056630][ T7460] ? btrfs_orphan_cleanup+0x4a0/0x4a0
[198256.057502][ T7460] ? free_extent_buffer+0x13/0x20
[198256.058322][ T7460] btrfs_log_inode+0x2654/0x26d0
[198256.059137][ T7460] ? log_directory_changes+0x170/0x170
[198256.060020][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.060930][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.061905][ T7460] ? lock_contended+0x770/0x770
[198256.062682][ T7460] ? btrfs_log_inode_parent+0xd04/0x1750
[198256.063582][ T7460] ? lock_downgrade+0x420/0x420
[198256.064432][ T7460] ? preempt_count_sub+0x18/0xc0
[198256.065550][ T7460] ? __mutex_lock+0x580/0xdc0
[198256.066654][ T7460] ? stack_trace_save+0x94/0xc0
[198256.068008][ T7460] ? __kasan_check_write+0x14/0x20
[198256.072149][ T7460] ? __mutex_unlock_slowpath+0x12a/0x430
[198256.073145][ T7460] ? mutex_lock_io_nested+0xcd0/0xcd0
[198256.074341][ T7460] ? wait_for_completion_io_timeout+0x20/0x20
[198256.075345][ T7460] ? lock_downgrade+0x420/0x420
[198256.076142][ T7460] ? lock_contended+0x770/0x770
[198256.076939][ T7460] ? do_raw_spin_lock+0x1c0/0x1c0
[198256.078401][ T7460] ? btrfs_sync_file+0x5e6/0xa40
[198256.080598][ T7460] btrfs_log_inode_parent+0x523/0x1750
[198256.081991][ T7460] ? wait_current_trans+0xc8/0x240
[198256.083320][ T7460] ? lock_downgrade+0x420/0x420
[198256.085450][ T7460] ? btrfs_end_log_trans+0x70/0x70
[198256.086362][ T7460] ? rcu_read_lock_sched_held+0x16/0x80
[198256.087544][ T7460] ? lock_release+0xcf/0x8a0
[198256.088305][ T7460] ? lock_downgrade+0x420/0x420
[198256.090375][ T7460] ? dget_parent+0x8e/0x300
[198256.093538][ T7460] ? do_raw_spin_lock+0x1c0/0x1c0
[198256.094918][ T7460] ? lock_downgrade+0x420/0x420
[198256.097815][ T7460] ? do_raw_spin_unlock+0xa9/0x100
[198256.101822][ T7460] ? dget_parent+0xb7/0x300
[198256.103345][ T7460] btrfs_log_dentry_safe+0x48/0x60
[198256.105052][ T7460] btrfs_sync_file+0x629/0xa40
[198256.106829][ T7460] ? start_ordered_ops.constprop.0+0x120/0x120
[198256.109655][ T7460] ? __fget_files+0x161/0x230
[198256.110760][ T7460] vfs_fsync_range+0x6d/0x110
[198256.111923][ T7460] ? start_ordered_ops.constprop.0+0x120/0x120
[198256.113556][ T7460] __x64_sys_fsync+0x45/0x70
[198256.114323][ T7460] do_syscall_64+0x5c/0xc0
[198256.115084][ T7460] ? syscall_exit_to_user_mode+0x3b/0x50
[198256.116030][ T7460] ? do_syscall_64+0x69/0xc0
[198256.116768][ T7460] ? do_syscall_64+0x69/0xc0
[198256.117555][ T7460] ? do_syscall_64+0x69/0xc0
[198256.118324][ T7460] ? sysvec_call_function_single+0x57/0xc0
[198256.119308][ T7460] ? asm_sysvec_call_function_single+0xa/0x20
[198256.120363][ T7460] entry_SYSCALL_64_after_hwframe+0x44/0xae
[198256.121334][ T7460] RIP: 0033:0x7fc7fe97b6ab
[198256.122067][ T7460] Code: 0f 05 48 (...)
[198256.125198][ T7460] RSP: 002b:00007fc7f7c23950 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[198256.126568][ T7460] RAX: ffffffffffffffda RBX: 00007fc7f7c239f0 RCX: 00007fc7fe97b6ab
[198256.127942][ T7460] RDX: 0000000000000002 RSI: 000056167536bcf0 RDI: 0000000000000004
[198256.129302][ T7460] RBP: 0000000000000004 R08: 0000000000000000 R09: 000000007ffffeb8
[198256.130670][ T7460] R10: 00000000000001ff R11: 0000000000000293 R12: 0000000000000001
[198256.132046][ T7460] R13: 0000561674ca8140 R14: 00007fc7f7c239d0 R15: 000056167536dab8
[198256.133403][ T7460] </TASK>
Fix this by treating -EEXIST as expected at insert_dir_log_key() and have
it update the item with an end offset corresponding to the maximum between
the previously logged end offset and the new requested end offset. The end
offsets may be different due to dir index key deletions that happened as
part of unlink operations while we are logging a directory (triggered when
fsyncing some other inode parented by the directory) or during renames
which always attempt to log a single dir index deletion.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/YmyefE9mc2xl5ZMz@hungrycats.org/
Fixes: 732d591a5d ("btrfs: stop copying old dir items when logging a directory")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On Linux, empty symlinks are invalid, and attempting to create one with
the system call symlink(2) results in an -ENOENT error and this is
explicitly documented in the man page.
If we rename a symlink that was created in the current transaction and its
parent directory was logged before, we actually end up logging the symlink
without logging its content, which is stored in an inline extent. That
means that after a power failure we can end up with an empty symlink,
having no content and an i_size of 0 bytes.
It can be easily reproduced like this:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/testdir
$ sync
# Create a file inside the directory and fsync the directory.
$ touch /mnt/testdir/foo
$ xfs_io -c "fsync" /mnt/testdir
# Create a symlink inside the directory and then rename the symlink.
$ ln -s /mnt/testdir/foo /mnt/testdir/bar
$ mv /mnt/testdir/bar /mnt/testdir/baz
# Now fsync again the directory, this persist the log tree.
$ xfs_io -c "fsync" /mnt/testdir
<power failure>
$ mount /dev/sdc /mnt
$ stat -c %s /mnt/testdir/baz
0
$ readlink /mnt/testdir/baz
$
Fix this by always logging symlinks in full mode (LOG_INODE_ALL), so that
their content is also logged.
A test case for fstests will follow.
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On a zoned filesystem, if we fail to allocate the root node for the log
root tree while syncing the log, we end up returning without finishing
the IO plug we started before, resulting in leaking resources as we
have started writeback for extent buffers of a log tree before. That
allocation failure, which typically is either -ENOMEM or -ENOSPC, is not
fatal and the fsync can safely fallback to a full transaction commit.
So release the IO plug if we fail to allocate the extent buffer for the
root of the log root tree when syncing the log on a zoned filesystem.
Fixes: 3ddebf27fc ("btrfs: zoned: reorder log node allocation on zoned filesystem")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During log replay there is this pattern of running delayed items after
every inode unlink. To avoid repeating this several times, move the
logic into an helper function and use it instead of calling
btrfs_unlink_inode() followed by btrfs_run_delayed_items().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When an inode has a last_reflink_trans matching the current transaction,
we have to take special care when logging its checksums in order to
avoid getting checksum items with overlapping ranges in a log tree,
which could result in missing checksums after log replay (more on that
in the changelogs of commit 40e046acbd ("Btrfs: fix missing data
checksums after replaying a log tree") and commit e289f03ea7 ("btrfs:
fix corrupt log due to concurrent fsync of inodes with shared extents")).
We also need to make sure a full fsync will copy all old file extent
items it finds in modified leaves, because they might have been copied
from some other inode.
However once we fsync an inode, we don't need to keep paying the price of
that extra special care in future fsyncs done in the same transaction,
unless the inode is used for another reflink operation or the full sync
flag is set on it (truncate, failure to allocate extent maps for holes,
and other exceptional and infrequent cases).
So after we fsync an inode reset its last_unlink_trans to zero. In case
another reflink happens, we continue to update the last_reflink_trans of
the inode, just as before. Also set last_reflink_trans to the generation
of the last transaction that modified the inode whenever we need to set
the full sync flag on the inode, just like when we need to load an inode
from disk after eviction.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Doing a full fsync may require processing many leaves of metadata, which
can take some time and result in a task monopolizing a cpu for too long.
So add a cond_resched() after processing a leaf when doing a full fsync,
while not holding any locks on any tree (a subvolume or a log tree).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a full fsync, at copy_items(), we iterate over all extents and
then collect their checksums into a list. After copying all the extents to
the log tree, we then log all the previously collected checksums.
Before the previous patch in the series (subject "btrfs: stop copying old
file extents when doing a full fsync"), we had to do it this way, because
while we were iterating over the items in the leaf of the subvolume tree,
we were holding a write lock on a leaf of the log tree, so logging the
checksums for an extent right after we collected them could result in a
deadlock, in case the checksum items ended up in the same leaf.
However after the previous patch in the series we now do a first iteration
over all the items in the leaf of the subvolume tree before locking a path
in the log tree, so we can now log the checksums right after we have
obtained them. This avoids holding in memory all checksums for all extents
in the leaf while copying items from the source leaf to the log tree. The
amount of memory used to hold all checksums of the extents in a leaf can
be significant. For example if a leaf has 200 file extent items referring
to 1M extents, using the default crc32c checksums, would result in using
over 200K of memory (not accounting for the extra overhead of struct
btrfs_ordered_sum), with smaller or less extents it would be less, but
it could be much more with more extents per leaf and/or much larger
extents.
So change copy_items() to log the checksums for an extent after looking
them up, and then free their memory, as they are no longer necessary.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode in full sync mode, we go over every leaf that was
modified in the current transaction and has items associated to our inode,
and then copy all those items into the log tree. This includes copying
file extent items that were created and added to the inode in past
transactions, which is useless and only makes use more leaf space in the
log tree.
It's common to have a file with many file extent items spanning many
leaves where only a few file extent items are new and need to be logged,
and in such case we log all the file extent items we find in the modified
leaves.
So change the full sync behaviour to skip over file extent items that are
not needed. Those are the ones that match the following criteria:
1) Have a generation older than the current transaction and the inode
was not a target of a reflink operation, as that can copy file extent
items from a past generation from some other inode into our inode, so
we have to log them;
2) Start at an offset within i_size - we must log anything at or beyond
i_size, otherwise we would lose prealloc extents after log replay.
The following script exercises a scenario where this happens, and it's
somehow close enough to what happened often on a SQL Server workload which
I had to debug sometime ago to fix an issue where a pattern of writes to
prealloc extents and fsync resulted in fsync failing with -EIO (that was
commit ea7036de0d ("btrfs: fix fsync failure and transaction abort
after writes to prealloc extents")). In that particular case, we had large
files that had random writes and were often truncated, which made the
next fsync be a full sync.
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
MKFS_OPTIONS="-O no-holes -R free-space-tree"
MOUNT_OPTIONS="-o ssd"
FILE_SIZE=$((1 * 1024 * 1024 * 1024)) # 1G
# FILE_SIZE=$((2 * 1024 * 1024 * 1024)) # 2G
# FILE_SIZE=$((512 * 1024 * 1024)) # 512M
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
# Create a file with many extents. Use direct IO to make it faster
# to create the file - using buffered IO we would have to fsync
# after each write (terribly slow).
echo "Creating file with $((FILE_SIZE / 4096)) extents of 4K each..."
xfs_io -f -d -c "pwrite -b 4K 0 $FILE_SIZE" $MNT/foobar
# Commit the transaction, so every extent after this is from an
# old generation.
sync
# Now rewrite only a few extents, which are all far spread apart from
# each other (e.g. 1G / 32M = 32 extents).
# After this only a few extents have a new generation, while all other
# ones have an old generation.
echo "Rewriting $((FILE_SIZE / (32 * 1024 * 1024))) extents..."
for ((i = 0; i < $FILE_SIZE; i += $((32 * 1024 * 1024)))); do
xfs_io -c "pwrite $i 4K" $MNT/foobar >/dev/null
done
# Fsync, the inode logged in full sync mode since it was never fsynced
# before.
echo "Fsyncing file..."
xfs_io -c "fsync" $MNT/foobar
umount $MNT
And the following bpftrace program was running when executing the test
script:
$ cat bpf-script.sh
#!/usr/bin/bpftrace
k:btrfs_log_inode
{
@start_log_inode[tid] = nsecs;
}
kr:btrfs_log_inode
/@start_log_inode[tid]/
{
@log_inode_dur[tid] = (nsecs - @start_log_inode[tid]) / 1000;
delete(@start_log_inode[tid]);
}
k:btrfs_sync_log
{
@start_sync_log[tid] = nsecs;
}
kr:btrfs_sync_log
/@start_sync_log[tid]/
{
$sync_log_dur = (nsecs - @start_sync_log[tid]) / 1000;
printf("btrfs_log_inode() took %llu us\n", @log_inode_dur[tid]);
printf("btrfs_sync_log() took %llu us\n", $sync_log_dur);
delete(@start_sync_log[tid]);
delete(@log_inode_dur[tid]);
exit();
}
With 512M test file, before this patch:
btrfs_log_inode() took 15218 us
btrfs_sync_log() took 1328 us
Log tree has 17 leaves and 1 node, its total size is 294912 bytes.
With 512M test file, after this patch:
btrfs_log_inode() took 14760 us
btrfs_sync_log() took 588 us
Log tree has a single leaf, its total size is 16K.
With 1G test file, before this patch:
btrfs_log_inode() took 27301 us
btrfs_sync_log() took 1767 us
Log tree has 33 leaves and 1 node, its total size is 557056 bytes.
With 1G test file, after this patch:
btrfs_log_inode() took 26166 us
btrfs_sync_log() took 593 us
Log tree has a single leaf, its total size is 16K
With 2G test file, before this patch:
btrfs_log_inode() took 50892 us
btrfs_sync_log() took 3127 us
Log tree has 65 leaves and 1 node, its total size is 1081344 bytes.
With 2G test file, after this patch:
btrfs_log_inode() took 50126 us
btrfs_sync_log() took 586 us
Log tree has a single leaf, its total size is 16K.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we want to log an extent, in the fast fsync path, we obtain a path
to the leaf that will hold the file extent item either through a deletion
search, via btrfs_drop_extents(), or through an insertion search using
btrfs_insert_empty_item(). After that we fill the file extent item's
fields one by one directly on the leaf.
Instead of doing that, we could prepare the file extent item before
obtaining a btree path, and then copy the prepared extent item with a
single operation once we get the path. This helps avoid some contention
on the log tree, since we are holding write locks for longer than
necessary, especially in the case where the path is obtained via
btrfs_drop_extents() through a deletion search, which always keeps a
write lock on the nodes at levels 1 and 2 (besides the leaf).
This change does that, we prepare the file extent item that is going to
be inserted before acquiring a path, and then copy it into a leaf using
a single copy operation once we get a path.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The following test was run to measure the impact of the whole patchset:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-R free-space-tree -O no-holes"
NUM_JOBS=8
FILE_SIZE=128M
RUN_TIME=200
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=randwrite
fsync=1
fallocate=none
group_reporting=1
direct=0
bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
ioengine=sync
filesize=$FILE_SIZE
runtime=$RUN_TIME
time_based
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
umount $MNT &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test ran inside a VM (8 cores, 32G of RAM) with the target disk
mapping to a raw NVMe device, and using a non-debug kernel config
(Debian's default config).
Before the patchset:
WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=22.7GiB (24.4GB), run=200013-200013msec
After the patchset:
WRITE: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=24.3GiB (26.1GB), run=200007-200007msec
A 7.8% gain on throughput and +7.0% more IO done in the same period of
time (200 seconds).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point in calling btrfs_release_path() after finishing the loop
that logs the modified extents, since log_one_extent() returns with the
path released. In case the list of extents is empty, the path is already
released, so there's no need for that case as well.
So just remove that unnecessary btrfs_release_path() call.
This change if part of a patchset that is comprised of the following
patches:
1/6 btrfs: remove unnecessary leaf free space checks when pushing items
2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
4/6 btrfs: remove constraint on number of visited leaves when replacing extents
5/6 btrfs: remove useless path release in the fast fsync path
6/6 btrfs: prepare extents to be logged before locking a log tree path
The last patch in the series has some performance test result in its
changelog.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_log_inode(), we have two variables to track errors and the
return value of the function, named 'ret' and 'err'. In some places we
use 'ret' and if gets a non-zero value we assign its value to 'err'
and then jump to the 'out' label, while in other places we use 'err'
directly without 'ret' as an intermediary. This is inconsistent, error
prone and not necessary. So change that to use only the 'ret' variable,
making this consistent with most functions in btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During a rename or link operation, we need to determine if an inode was
previously logged or not, and if it was, do some update to the logged
inode. We used to rely exclusively on the logged_trans field of struct
btrfs_inode to determine that, but that was not reliable because the
value of that field is not persisted in the inode item, so it's lost
when an inode is evicted and loaded back again. That led to several
issues in the past, such as not persisting deletions (such as the case
fixed by commit 803f0f64d1 ("Btrfs: fix fsync not persisting dentry
deletions due to inode evictions")), or resulting in losing a file
after an inode eviction followed by a rename (commit ecc64fab7d
("btrfs: fix lost inode on log replay after mix of fsync, rename and
inode eviction")), besides other issues.
So the inode_logged() helper was introduced and used to determine if an
inode was possibly logged before in the current transaction, with the
caveat that it could return false positives, in the sense that even if an
inode was not logged before in the current transaction, it could still
return true, but never to return false in case the inode was logged.
>From a functional point of view that is fine, but from a performance
perspective it can introduce significant latencies to rename and link
operations, as they will end up doing inode logging even when it is not
necessary.
Recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package
installations and upgrades, with the zypper tool, were often taking a
long time to complete. With strace it could be observed that zypper was
spending about 99% of its time on rename operations, and then with
further analysis we checked that directory logging was happening too
frequently. Taking into account that installation/upgrade of some of the
packages needed a few thousand file renames, the slowdown was very
noticeable for the user.
The issue was caused indirectly due to an excessive number of inode
evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or
a 5.16-rc8 kernel. While triggering the inode evictions if something
outside btrfs' control, btrfs could still behave better by eliminating
the false positives from the inode_logged() helper.
So change inode_logged() to actually eliminate such false positives caused
by inode eviction and when an inode was never logged since the filesystem
was mounted, as both cases relate to when the logged_trans field of struct
btrfs_inode has a value of zero. When it can not determine if the inode
was logged based only on the logged_trans value, lookup for the existence
of the inode item in the log tree - if it's there then we known the inode
was logged, if it's not there then it can not have been logged in the
current transaction. Once we determine if the inode was logged, update
the logged_trans value to avoid future calls to have to search in the log
tree again.
Alternatively, we could start storing logged_trans in the on disk inode
item structure (struct btrfs_inode_item) in the unused space it still has,
but that would be a bit odd because:
1) We only care about logged_trans since the filesystem was mounted, we
don't care about its value from a previous mount. Having it persisted
in the inode item structure would not make the best use of the precious
unused space;
2) In order to get logged_trans persisted before inode eviction, we would
have to update the delayed inode when we finish logging the inode and
update its logged_trans in struct btrfs_inode, which makes it a bit
cumbersome since we need to check if the delayed inode exists, if not
create it and populate it and deal with any errors (-ENOMEM mostly).
This change is part of a patchset comprised of the following patches:
1/5 btrfs: add helper to delete a dir entry from a log tree
2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
3/5 btrfs: avoid logging all directory changes during renames
4/5 btrfs: stop doing unnecessary log updates during a rename
5/5 btrfs: avoid inode logging during rename and link when possible
The following test script mimics part of what the zypper tool does during
package installations/upgrades. It does not triggers inode evictions, but
it's similar because it triggers false positives from the inode_logged()
helper, because the inodes have a logged_trans of 0, there's a log tree
due to a fsync of an unrelated file and the directory inode has its
last_trans field set to the current transaction:
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=10000
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
# Now do some change to an unrelated file and fsync it.
# This is just to create a log tree to make sure that inode_logged()
# does not return false when called against "testdir".
xfs_io -f -c "pwrite 0 4K" -c "fsync" $MNT/foo
# Do some change to testdir. This is to make sure inode_logged()
# will return true when called against "testdir", because its
# logged_trans is 0, it was changed in the current transaction
# and there's a log tree.
echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
echo "Renaming $NUM_FILES files..."
start=$(date +%s%N)
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
done
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "Renames took $dur milliseconds"
umount $MNT
Testing this change on a box using a non-debug kernel (Debian's default
kernel config) gave the following results:
NUM_FILES=10000, before patchset: 27837 ms
NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9236 ms (-66.8%)
NUM_FILES=10000, after whole patchset applied: 8902 ms (-68.0%)
NUM_FILES=5000, before patchset: 9127 ms
NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4640 ms (-49.2%)
NUM_FILES=5000, after whole patchset applied: 4441 ms (-51.3%)
NUM_FILES=2000, before patchset: 2528 ms
NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1983 ms (-21.6%)
NUM_FILES=2000, after whole patchset applied: 1747 ms (-30.9%)
NUM_FILES=1000, before patchset: 1085 ms
NUM_FILES=1000, after patches 1/5 to 4/5 applied: 893 ms (-17.7%)
NUM_FILES=1000, after whole patchset applied: 867 ms (-20.1%)
Running dbench on the same physical machine with the following script:
$ cat run-dbench.sh
#!/bin/bash
NUM_JOBS=$(nproc --all)
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-O no-holes -R free-space-tree"
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 120 $NUM_JOBS
umount $MNT
Before patchset:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 3761352 0.032 143.843
Close 2762770 0.002 2.273
Rename 159304 0.291 67.037
Unlink 759784 0.207 143.998
Deltree 72 4.028 15.977
Mkdir 36 0.003 0.006
Qpathinfo 3409780 0.013 9.678
Qfileinfo 596772 0.001 0.878
Qfsinfo 625189 0.003 1.245
Sfileinfo 306443 0.006 1.840
Find 1318106 0.063 19.798
WriteX 1871137 0.021 8.532
ReadX 5897325 0.003 3.567
LockX 12252 0.003 0.258
UnlockX 12252 0.002 0.100
Flush 263666 3.327 155.632
Throughput 980.047 MB/sec 12 clients 12 procs max_latency=155.636 ms
After whole patchset applied:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4195584 0.033 107.742
Close 3081932 0.002 1.935
Rename 177641 0.218 14.905
Unlink 847333 0.166 107.822
Deltree 118 5.315 15.247
Mkdir 59 0.004 0.048
Qpathinfo 3802612 0.014 10.302
Qfileinfo 666748 0.001 1.034
Qfsinfo 697329 0.003 0.944
Sfileinfo 341712 0.006 2.099
Find 1470365 0.065 9.359
WriteX 2093921 0.021 8.087
ReadX 6576234 0.003 3.407
LockX 13660 0.003 0.308
UnlockX 13660 0.002 0.114
Flush 294090 2.906 115.539
Throughput 1093.11 MB/sec 12 clients 12 procs max_latency=115.544 ms
+11.5% throughput -25.8% max latency rename max latency -77.8%
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>