The functions qualify for the pure attribute as they always return the
same value for the same argument (in the given scope). This allows to
optimize the calls and cache the value.
Signed-off-by: David Sterba <dsterba@suse.com>
Unlike folio helpers for date the ones for metadata always take the
extent buffer start and length, so they can be simplified to take the
eb only. The fs_info can be obtained from eb too so it can be dropped
as parameter.
Added in patch "btrfs: use metadata specific helpers to simplify extent
buffer helpers".
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are considering the used bytes counter of a block group as the amount
to update the space info's reclaim bytes counter after relocating the
block group, but this value alone is often not enough. This is because we
may have a reserved extent (or more) and in that case its size is
reflected in the reserved counter of the block group - the size of the
extent is only transferred from the reserved counter to the used counter
of the block group when the delayed ref for the extent is run - typically
when committing the transaction (or when flushing delayed refs due to
ENOSPC on space reservation). Such call chain for data extents is:
btrfs_run_delayed_refs_for_head()
run_one_delayed_ref()
run_delayed_data_ref()
alloc_reserved_file_extent()
alloc_reserved_extent()
btrfs_update_block_group()
-> transfers the extent size from the reserved
counter to the used counter
For metadata extents:
btrfs_run_delayed_refs_for_head()
run_one_delayed_ref()
run_delayed_tree_ref()
alloc_reserved_tree_block()
alloc_reserved_extent()
btrfs_update_block_group()
-> transfers the extent size from the reserved
counter to the used counter
Since relocation flushes delalloc, waits for ordered extent completion
and commits the current transaction before doing the actual relocation
work, the correct amount of reclaimed space is therefore the sum of the
"used" and "reserved" counters of the block group before we call
btrfs_relocate_chunk() at btrfs_reclaim_bgs_work().
So fix this by taking the "reserved" counter into consideration.
Fixes: 243192b676 ("btrfs: report reclaim stats in sysfs")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_reclaim_bgs_work(), we are grabbing twice the used bytes counter
of the block group while not holding the block group's spinlock. This can
result in races, reported by KCSAN and similar tools, since a concurrent
task can be updating that counter while at btrfs_update_block_group().
So avoid these races by grabbing the counter in a critical section
delimited by the block group's spinlock after setting the block group to
RO mode. This also avoids using two different values of the counter in
case it changes in between each read. This silences KCSAN and is required
for the next patch in the series too.
Fixes: 243192b676 ("btrfs: report reclaim stats in sysfs")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_reclaim_bgs_work(), we are grabbing a block group's zone unusable
bytes while not under the protection of the block group's spinlock, so
this can trigger race reports from KCSAN (or similar tools) since that
field is typically updated while holding the lock, such as at
__btrfs_add_free_space_zoned() for example.
Fix this by grabbing the zone unusable bytes while we are still in the
critical section holding the block group's spinlock, which is right above
where we are currently grabbing it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After previous patch removing nodesize from parameters,
__alloc_dummy_extent_buffer() and alloc_dummy_extent_buffer() are
identical so we can drop one.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass a valid fs_info so we can read the nodesize from that
instead of passing it as parameter.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no longer any need for the 'out' label as there are no resources
to cleanup anymore in case of an error and we can directly return if
begin_cmd() fails.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Whenever we issue a command we allocate a path and then compute it. For
the current inode this is not necessary since we have one preallocated
and computed in the send context structure, so we can use it instead
and avoid allocating and freeing a path.
For example if we have 100 extents to send (100 write commands) for a
file, we are allocating and freeing paths 100 times.
So improve on this by avoiding path allocation and freeing whenever a
command is for the current inode by using the current inode's path
stored in the send context structure.
A test was run before applying this patch and the previous one in the
series:
"btrfs: send: keep the current inode's path cached"
The test script is the following:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
mkfs.btrfs -f $DEV > /dev/null
mount $DEV $MNT
DIR="$MNT/one/two/three/four"
FILE="$DIR/foobar"
mkdir -p $DIR
# Create some empty files to get a deeper btree and therefore make
# path computations slower.
for ((i = 1; i <= 30000; i++)); do
echo -n > "$DIR/filler_$i"
done
for ((i = 0; i < 10000; i += 2)); do
offset=$(( i * 4096 ))
xfs_io -f -c "pwrite -S 0xab $offset 4K" $FILE > /dev/null
done
btrfs subvolume snapshot -r $MNT $MNT/snap
start=$(date +%s%N)
btrfs send -f /dev/null $MNT/snap
end=$(date +%s%N)
echo -e "\nsend took $(( (end - start) / 1000000 )) milliseconds"
umount $MNT
Result before applying the 2 patches: 1121 milliseconds
Result after applying the 2 patches: 815 milliseconds (-31.6%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Whenever we need to send a command for the current inode, like sending
writes, xattr updates, truncates, utimes, etc, we compute the inode's
path each time, which implies doing some memory allocations and traversing
the inode hierarchy to extract the name of the inode and each ancestor
directory, and that implies doing lookups in the subvolume tree amongst
other operations.
Most of the time, by far, the current inode's path doesn't change while
we are processing it (like if we need to issue 100 write commands, the
path remains the same and it's pointless to compute it 100 times).
To avoid this keep the current inode's path cached in the send context
and invalidate it or update it whenever it's needed (after unlinks or
renames).
A performance test, and its results, is mentioned in the next patch in
the series (subject: "btrfs: send: avoid path allocation for the current
inode when issuing commands").
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need for the 'out' label as there are no resources to cleanup
in case of an error and we can directly return if begin_cmd() fails.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need for the 'ret' variable, we can just return directly the
result of the call to iterate_dir_item().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need for the 'out' label as there are no resources to cleanup
in case of an error and we can directly return if begin_cmd() fails.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Extract the logic to rename the current inode at process_recorded_refs()
into a helper function and use it, therefore removing duplicated logic
and making it easier for an upcoming patch by avoiding yet more duplicated
logic.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several local variables at process_recorded_refs() that are used
as booleans, with some of them having a 'bool' type while two of them
having an 'int' type. Change this to make them all use the 'bool' type
which is more clear and to make everything more consistent.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We always send xattrs for the current inode only and both callers of
send_set_xattr() pass a path for the current inode. So move the path
allocation and computation to send_set_xattr(), reducing duplicated
code. This also facilitates an upcoming patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When attempting to build a too long path we are currently returning
-ENOMEM, which is very odd and misleading. So update fs_path_ensure_buf()
to return -ENAMETOOLONG instead. Also, while at it, move the WARN_ON()
into the if statement's expression, as it makes it clear what is being
tested and also has the effect of adding 'unlikely' to the statement,
which allows the compiler to generate better code as this condition is
never expected to happen.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper fs_path_add_path() is basically a copy of fs_path_add() and it
can be made a wrapper around fs_path_add(). So do that and also make it
inline and constify its second argument.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to have an 'out' label and jump into it since there are
no resource cleanups to perform (release locks, free memory, etc), so
make this simpler by removing the label and goto and instead return
directly.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Several places are hardcoding the path length calculation instead of using
the helper fs_path_len() for that. Update all those places to instead use
fs_path_len().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helper function fs_path_len() is trivial and doesn't need to change
its path argument, so make it inline and constify the argument.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's duplicated logic in both branches of the if statement, so move it
outside the branches.
This also reduces the object code size.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1746279 163600 16920 1926799 1d668f fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename
1746047 163592 16920 1926559 1d659f fs/btrfs/btrfs.ko
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use a struct btrfs_inode in btrfs_get_name() as it's an internal
helper, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use a struct btrfs_inode to btrfs_get_parent() as it's an internal
helper, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use a struct btrfs_inode in btrfs_remap_file_range_prep() as it's an
internal helper, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use a struct btrfs_inode to btrfs_remap_file_range() as it's an internal
helper, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_extent_same_range() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_double_mmap_unlock() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_double_mmap_lock() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to clone_copy_inline_extent() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to the extract() callback as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to the apply() callback as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_inherit_props() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_load_inode_props() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_fill_inode() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to fill_stack_inode_item() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use a struct btrfs_inode in create_pending_snapshot() as it's an
internal helper, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_defrag_file() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_inode() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to new_simple_dir() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_inode() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_read_locked_inode() as it's an
internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to extent_range_clear_dirty_for_io() as it's
an internal interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to can_nocow_extent() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a label that does nothing else than return, so remove it and
also change other gotos to immediate returns as the function is short
enough for this pattern.
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs_key is defined as objectid/type/offset and the keys are also
printed like that. For better readability, update all key
initializations to match this order.
Signed-off-by: David Sterba <dsterba@suse.com>
When using offset_in_page() it's clear what it means, we don't need to
store it in the local variable just to use it right away. There's no
change in the generated code, but keeps the declarations smaller.
Signed-off-by: David Sterba <dsterba@suse.com>
Reduce stack consumption of zstd_compress_folios() by 40 bytes
(10*sizeof(int)) as we can store struct zstd_parameters in the workspace
that is reused for each call.
typedef struct {
ZSTD_compressionParameters cParams;
ZSTD_frameParameters fParams;
} ZSTD_parameters;
typedef struct {
unsigned windowLog;
unsigned chainLog;
unsigned hashLog;
unsigned searchLog;
unsigned minMatch;
unsigned targetLength;
ZSTD_strategy strategy;
} ZSTD_compressionParameters;
typedef struct {
int contentSizeFlag;
int checksumFlag;
int noDictIDFlag;
} ZSTD_frameParameters;
Signed-off-by: David Sterba <dsterba@suse.com>
If BTRFS_FS_NEED_ZONE_FINISH is already set for the whole filesystem, exit
early in btrfs_can_activate_zone(). There's no need to check if
BTRFS_FS_NEED_ZONE_FINISH needs to be set if it is already set.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have btrfs_meta_is_subpage(), we should make btrfs_is_subpage()
to be data inode specific.
This change involves:
- Simplify btrfs_is_subpage()
Now we only need to do a very simple sectorsize check against
PAGE_SIZE.
And since the function is pretty simple now, just make it an inline
function.
- Add an extra ASSERT() to make sure btrfs_is_subpage() is only called
on data inode mapping
- Migrate btree_csum_one_bio() to use btrfs_meta_folio_*() helpers
- Migrate alloc_extent_buffer() to use btrfs_meta_folio_*() helpers
- Migrate end_bbio_meta_write() to use btrfs_meta_folio_*() helpers
Or we will trigger the ASSERT() due to calling btrfs_folio_*() on
metadata folios.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
By using a shared bio_add_folio_nofail() with calculated
range_start/range_len, so no more explicit subpage routine needed.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently inside write_one_eb() we have two different ways of handling
subpage and regular metadata.
The differences are:
- Extra offset/length calculation when adding the folio range to bio for
subpage cases
- Only decrease wbc->nr_to_write if the whole page is no longer dirty
for subpage cases
- Use subpage helper for subpage cases
Merge the tow ways into a shared one:
- Always calculate the to-be-queued range
So that bio_add_folio() can use the same calculated resulted length
and offset for both cases.
- Use btrfs_meta_folio_clear_dirty() and
btrfs_meta_folio_set_writeback() helpers
This will cover both cases.
- Only decrease wbc->nr_to_write if the folio is no longer dirty
Since we have the folio locked, no one else can modify the folio dirty
flags (set_extent_buffer_dirty() will also lock the folio for subpage
cases).
Thus after our btrfs_meta_folio_clear_dirty() call, if the whole folio
is no longer dirty, we're submitting the last dirty eb of the folio,
and can decrease wbc->nr_to_write properly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_clear_buffer_dirty() is called on dirty extent buffer
that will not be written back.
The function will call btree_clear_folio_dirty() to clear the folio
dirty flag and also clear PAGECACHE_TAG_DIRTY flag.
And we split the subpage and regular handling, as for subpage cases we
should only clear PAGECACHE_TAG_DIRTY if the last dirty extent buffer in
the page is cleared.
So here we can simplify the function by:
- Use the newly introduced btrfs_meta_folio_clear_and_test_dirty() helper
The helper will return true if we cleared the folio dirty flag.
With that we can use the same helper for both subpage and regular
cases.
- Rename btree_clear_folio_dirty() to btree_clear_folio_dirty_tag()
As we move the folio dirty clearing in the btrfs_clear_buffer_dirty().
- Call btrfs_meta_folio_clear_and_test_dirty() to clear the dirty flags
for both regular and subpage metadata cases
- Only call btree_clear_folio_dirty_tag() when the folio is no longer
dirty
- Update the comment inside set_extent_buffer_dirty()
As there is no separate clear_subpage_extent_buffer_dirty() anymore.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following functions are doing metadata specific checks:
- set_extent_buffer_uptodate()
- clear_extent_buffer_uptodate()
The reason why we do not use btrfs_folio_*() helpers for those helpers
is, btrfs_is_subpage() cannot handle dummy extent buffer if nodesize >=
PAGE_SIZE but block size < PAGE_SIZE.
In that case, we do not need to attach extra bitmaps to the extent
buffer folio. But since dummy extent buffer folios are not attached to
btree inode, btrfs_is_subpage() will return true, causing problems.
And the following are using btrfs_folio_*() helpers for metadata, but
in theory we should use metadata specific checks:
- set_extent_buffer_dirty()
This is not causing problems because a dummy extent buffer should never
be marked dirty.
To make code simpler, introduce btrfs_meta_folio_*() helpers, to do
the metadata specific handling, so that we do not to open-code such
checks in above involved functions.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently subpage attach/detach is not doing proper dummy extent buffer
subpage check, as btrfs_is_subpage() is not reliable for dummy extent
buffer folios.
Since we have a metadata specific check now, use that for
btrfs_attach_subpage() first.
Then enhance btrfs_detach_subpage() to accept a type parameter, so that
we can do extra checks for dummy extent buffers properly.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we have only one btrfs_is_subpage() to cover both data and
metadata.
But there is a special case for metadata:
- dummy extent buffer, sector size < PAGE_SIZE and node size >= PAGE_SIZE
In such case, btrfs_is_subpage() will return true for extent buffer
folio.
But that is not correct, and that's exactly why we have some open-coded
checks for functions like set_extent_buffer_uptodate() and
clear_extent_buffer_uptodate().
Just extract the metadata specific checks into a helper, and replace
those call sites.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For the future large folio support, our filemap can have folios with
different sizes, thus we can no longer rely on a fixed blocks_per_page
value.
To prepare for that future, here we do:
- Remove btrfs_fs_info::sectors_per_page
- Introduce a helper, btrfs_blocks_per_folio()
Which uses the folio size to calculate the number of blocks for each
folio.
- Migrate the existing btrfs_fs_info::sectors_per_page to use that
helper
There are some exceptions:
* Metadata nodesize < page size support
In the future, even if we support large folios, we will only
allocate a folio that matches our nodesize.
Thus we won't have a folio covering multiple metadata unless
nodesize < page size.
* Existing subpage bitmap dump
We use a single unsigned long to store the bitmap.
That means until we change the bitmap dumping code, our upper limit
for folio size will only be 256K (4K block size, 64 bit unsigned
long).
* btrfs_is_subpage() check
This will be migrated into a future patch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The ordered extent cleanup is hard to grasp because it doesn't follow
the common cleanup-asap pattern.
E.g. run_delalloc_nocow() and cow_file_range() allocate one or more
ordered extent, but if any error is hit, the cleanup is done later inside
btrfs_run_delalloc_range().
To change the existing delayed cleanup:
- Update the comment on error handling of run_delalloc_nocow()
There are in fact 3 different cases other than 2 if we are doing
ordered extents cleanup inside run_delalloc_nocow():
1) @cow_start and @cow_end not set
No fallback to COW at all.
Before @cur_offset we need to cleanup the OE and page dirty.
After @cur_offset just clear all involved page and extent flags.
2) @cow_start set but @cow_end not set.
This means we failed before even calling fallback_to_cow().
It's just a variant of case 1), where it's @cow_start splitting
the two parts (and we should just ignore @cur_offset since it's
advanced without any new ordered extent).
3) @cow_start and @cow_end both set
This means fallback_to_cow() failed, meaning [start, cow_start)
needs the regular OE and dirty folio cleanup, and skip range
[cow_start, cow_end) as cow_file_range() has done the cleanup,
and eventually cleanup [cow_end, end) range.
- Only reset @cow_start after fallback_to_cow() succeeded
As above case 2) and 3) are both relying on @cow_start to determine
the cleanup range.
- Move btrfs_cleanup_ordered_extents() into run_delalloc_nocow(),
cow_file_range() and nocow_one_range()
For cow_file_range() it's pretty straightforward and easy.
For run_delalloc_nocow() refer to the above 3 different error cases.
For nocow_one_range() if we hit an error, we need to cleanup the
ordered extents by ourselves.
And then it will fallback to case 1), since @cur_offset is not yet
advanced, the existing cleanup will co-operate with nocow_one_range()
well.
- Remove the btrfs_cleanup_ordered_extents() inside submit_uncompressed_range()
As failed cow_file_range() will do all the proper cleanup now.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we're doing all the ordered extent and extent map generation
inside a while() loop of run_delalloc_nocow(). This makes it pretty
hard to read, nor doing proper error handling.
So move that part of code into a helper, nocow_one_range().
This should not change anything, but there is a tiny timing change where
btrfs_dec_nocow_writers() is only called after nocow_one_range() helper
exits.
This timing change is small, and makes error handling easier, thus
should be fine.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The address space flag AS_STABLE_WRITES determine if FGP_STABLE for will
wait for the folio to finish its writeback.
For btrfs, due to the default data checksum behavior, if we modify the
folio while it's still under writeback, it will cause data checksum
mismatch. Thus for quite some call sites we manually call
folio_wait_writeback() to prevent such problem from happening.
Currently there is only one call site inside btrfs really utilizing
FGP_STABLE, and in that case we also manually call folio_wait_writeback()
to do the waiting.
But it's better to properly expose the stable writes flag to a per-inode
basis, to allow call sites to fully benefit from FGP_STABLE flag.
E.g. for inodes with NODATASUM allowing beginning dirtying the page
without waiting for writeback.
This involves:
- Update the mapping's stable write flag when setting/clearing NODATASUM
inode flag using ioctl
This only works for empty files, so it should be fine.
- Update the mapping's stable write flag when reading an inode from disk
- Remove the explicit folio_wait_writeback() for FGP_BEGINWRITE call
site
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently for s390x HW zlib compression, to get the best performance we
need a buffer size which is larger than a page.
This means we need to copy multiple pages into workspace->buf, then use
that buffer as zlib compression input.
Currently it's hardcoded using page sized folio, and all the handling
are deep inside a loop.
Refactor the code by:
- Introduce a dedicated helper to do the buffer copy
The new helper will be called copy_data_into_buffer().
- Add extra ASSERT()s
* Make sure we only go into the function for hardware acceleration
* Make sure we still get page sized folio
- Prepare for future large folios
This means we will rely on the folio size, other than PAGE_SIZE to do
the copy.
- Handle the folio mapping and unmapping inside the helper function
For S390x hardware acceleration case, it never utilize the @data_in
pointer, thus we can do folio mapping/unmapping all inside the function.
Acked-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_do_readpage() if we get an extent map for a prealloc extent we
end up assigning twice to the 'block_start' variable, first the value
returned by extent_map_block_start() and then EXTENT_MAP_HOLE. This is
pointless so make it more clear by using an if-else statement and doing
only one assignment. Also, while at it, move the declaration of
'block_start' into the while loop's scope, since it's not used outside of
it and the related 'disk_bytenr' is also declared in this scope.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
It is a long known bug that VM image on btrfs can lead to data csum
mismatch, if the qemu is using direct-io for the image (this is commonly
known as cache mode 'none').
[CAUSE]
Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
fs is allowed to dirty/modify the folio even if the folio is under
writeback (as long as the address space doesn't have AS_STABLE_WRITES
flag inherited from the block device).
This is a valid optimization to improve the concurrency, and since these
filesystems have no extra checksum on data, the content change is not a
problem at all.
But the final write into the image file is handled by btrfs, which needs
the content not to be modified during writeback, or the checksum will
not match the data (checksum is calculated before submitting the bio).
So EXT4/XFS/NTRFS assume they can modify the folio under writeback, but
btrfs requires no modification, this leads to the false csum mismatch.
This is only a controlled example, there are even cases where
multi-thread programs can submit a direct IO write, then another thread
modifies the direct IO buffer for whatever reason.
For such cases, btrfs has no sane way to detect such cases and leads to
false data csum mismatch.
[FIX]
I have considered the following ideas to solve the problem:
- Make direct IO to always skip data checksum
This not only requires a new incompatible flag, as it breaks the
current per-inode NODATASUM flag.
But also requires extra handling for no csum found cases.
And this also reduces our checksum protection.
- Let hardware handle all the checksum
AKA, just nodatasum mount option.
That requires trust for hardware (which is not that trustful in a lot
of cases), and it's not generic at all.
- Always fallback to buffered write if the inode requires checksum
This was suggested by Christoph, and is the solution utilized by this
patch.
The cost is obvious, the extra buffer copying into page cache, thus it
reduces the performance.
But at least it's still user configurable, if the end user still wants
the zero-copy performance, just set NODATASUM flag for the inode
(which is a common practice for VM images on btrfs).
Since we cannot trust user space programs to keep the buffer
consistent during direct IO, we have no choice but always falling back
to buffered IO. At least by this, we avoid the more deadly false data
checksum mismatch error.
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In that function we call set_extent_buffer_uptodate() or
clear_extent_buffer_uptodate(), which will already update the uptodate
flag for all the involved extent buffer folios.
Thus there is no need to update the folio uptodate flags again.
Just remove the open-coded part.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Retrieve folios instead of pages and work on them throughout. Removes
a few calls to compound_head() and a reference to page->mapping.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove references to the page lock and page->mapping. Also btrfs folios
can never be swizzled into swap (mentioned in extent_write_cache_pages()).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Only allocate the btrfs_encoded_read_private structure for asynchronous
(io_uring) mode.
There's no need to allocate an object from slab in the synchronous mode. In
such a case stack can be happily used as it used to be before 68d3b27e05
("btrfs: move priv off stack in btrfs_encoded_read_regular_fill_pages()")
which was a preparation for the async mode.
While at it, fix the comment to reflect the atomic => refcount change in
d29662695e ("btrfs: fix use-after-free waiting for encoded read endios").
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit b35108a51c ("jiffies: Define secs_to_jiffies()") introduced
secs_to_jiffies(). As the value here is a multiple of 1000, use
secs_to_jiffies() instead of msecs_to_jiffies() to avoid the
multiplication
This is converted using scripts/coccinelle/misc/secs_to_jiffies.cocci with
the following Coccinelle rules:
@depends on patch@
expression E;
@@
-msecs_to_jiffies
+secs_to_jiffies
(E
- * \( 1000 \| MSEC_PER_SEC \)
)
Link: https://lkml.kernel.org/r/20250225-converge-secs-to-jiffies-part-two-v3-5-a43967e36c88@linux.microsoft.com
Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com>
Acked-by: David Sterba <dsterba@suse.com>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Carlos Maiolino <cmaiolino@redhat.com>
Cc: Chris Mason <clm@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Dick Kennedy <dick.kennedy@broadcom.com>
Cc: Dongsheng Yang <dongsheng.yang@easystack.cn>
Cc: Fabio Estevam <festevam@gmail.com>
Cc: Frank Li <frank.li@nxp.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Ilpo Jarvinen <ilpo.jarvinen@linux.intel.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
Cc: Kalesh Anakkur Purayil <kalesh-anakkur.purayil@broadcom.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Mark Brown <broonie@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Nicolas Palix <nicolas.palix@imag.fr>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oded Gabbay <ogabbay@kernel.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Sebastian Reichel <sre@kernel.org>
Cc: Selvin Thyparampil Xavier <selvin.xavier@broadcom.com>
Cc: Shawn Guo <shawnguo@kernel.org>
Cc: Shyam-sundar S-k <Shyam-sundar.S-k@amd.com>
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmfLM5wACgkQxWXV+ddt
WDsK3A/7BEIUzin4CpmhBkFQamPCLjLL+Zz2etmoYWCrKnNPRMutVbsgeRM43cBt
NXMD4RSoeXO/aYzrPhe4KMP4a5PkI02v2CEpPJqMRPmbADGyExx5Vnh68ioZWQbi
N54Sd5LqhMT9FcViG46VJXr+MOBKIzO8++TxswIrCDO+6X/Y39+xZGxj4DXrnF38
zgvxbILbiH+7vC1m9NV8K7Vl0jp36hQKcCjJYCfohbVoFQiyvmuh2x0LDL2HnIfH
VpREP+eo/a3ZO8vPo7+4HZ5DVf5AolulbEC6myxsvFScLhWlh218plVyuv4QyACW
RYDm9MqLqfqOkEDgj+Tb0C4s6uyVon5xbRL3aNbSE73KnUVeb/bB77qAejjzAkIr
MvEEeEJp0H34OZm2fnUyFIu3ShDcSif1qH0rCOm1rBeqYZZsX7ny7TvKIqkgrsKk
JbzgpYLyzzqTHs9QERw3OUhIBuefFCs4HlUeukLbUCdqI+ruPp5s76jfHQnT3dzG
ad5CUW8eHf6mkU93dUlQIeDJSVPdaanf0Whomk3eOKgBeu8+gNp9R41kKJ7UtoA9
GG504bqNjSe8t0sVmSyuE30BWAQWYnyCSY/9u46JrB6MtfWv+wikU/Nox4qZjM4d
UhhWkDTELaTngcYkbm5+MD0DkkglTeqEbR9gCM21c9xiJrojhcw=
=v6KI
-----END PGP SIGNATURE-----
Merge tag 'for-6.14-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix leaked extent map after error when reading chunks
- replace use of deprecated strncpy
- in zoned mode, fixed range when ulocking extent range, causing a hang
* tag 'for-6.14-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix a leaked chunk map issue in read_one_chunk()
btrfs: replace deprecated strncpy() with strscpy()
btrfs: zoned: fix extent range end unlock in cow_file_range()
Add btrfs_free_chunk_map() to free the memory allocated
by btrfs_alloc_chunk_map() if btrfs_add_chunk_map() fails.
Fixes: 7dc66abb5a ("btrfs: use a dedicated data structure for chunk maps")
CC: stable@vger.kernel.org
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Haoxiang Li <haoxiang_li2024@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some filesystems, such as NFS, cifs, ceph, and fuse, do not have
complete control of sequencing on the actual filesystem (e.g. on a
different server) and may find that the inode created for a mkdir
request already exists in the icache and dcache by the time the mkdir
request returns. For example, if the filesystem is mounted twice the
directory could be visible on the other mount before it is on the
original mount, and a pair of name_to_handle_at(), open_by_handle_at()
calls could instantiate the directory inode with an IS_ROOT() dentry
before the first mkdir returns.
This means that the dentry passed to ->mkdir() may not be the one that
is associated with the inode after the ->mkdir() completes. Some
callers need to interact with the inode after the ->mkdir completes and
they currently need to perform a lookup in the (rare) case that the
dentry is no longer hashed.
This lookup-after-mkdir requires that the directory remains locked to
avoid races. Planned future patches to lock the dentry rather than the
directory will mean that this lookup cannot be performed atomically with
the mkdir.
To remove this barrier, this patch changes ->mkdir to return the
resulting dentry if it is different from the one passed in.
Possible returns are:
NULL - the directory was created and no other dentry was used
ERR_PTR() - an error occurred
non-NULL - this other dentry was spliced in
This patch only changes file-systems to return "ERR_PTR(err)" instead of
"err" or equivalent transformations. Subsequent patches will make
further changes to some file-systems to return a correct dentry.
Not all filesystems reliably result in a positive hashed dentry:
- NFS, cifs, hostfs will sometimes need to perform a lookup of
the name to get inode information. Races could result in this
returning something different. Note that this lookup is
non-atomic which is what we are trying to avoid. Placing the
lookup in filesystem code means it only happens when the filesystem
has no other option.
- kernfs and tracefs leave the dentry negative and the ->revalidate
operation ensures that lookup will be called to correctly populate
the dentry. This could be fixed but I don't think it is important
to any of the users of vfs_mkdir() which look at the dentry.
The recommendation to use
d_drop();d_splice_alias()
is ugly but fits with current practice. A planned future patch will
change this.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
Link: https://lore.kernel.org/r/20250227013949.536172-2-neilb@suse.de
Signed-off-by: Christian Brauner <brauner@kernel.org>
strncpy() is deprecated for NUL-terminated destination buffers. Use
strscpy() instead and don't zero-initialize the param array.
Link: https://github.com/KSPP/linux/issues/90
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Running generic/751 on the for-next branch often results in a hang like
below. They are both stack by locking an extent. This suggests someone
forget to unlock an extent.
INFO: task kworker/u128:1:12 blocked for more than 323 seconds.
Not tainted 6.13.0-BTRFS-ZNS+ #503
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u128:1 state:D stack:0 pid:12 tgid:12 ppid:2 flags:0x00004000
Workqueue: btrfs-fixup btrfs_work_helper [btrfs]
Call Trace:
<TASK>
__schedule+0x534/0xdd0
schedule+0x39/0x140
__lock_extent+0x31b/0x380 [btrfs]
? __pfx_autoremove_wake_function+0x10/0x10
btrfs_writepage_fixup_worker+0xf1/0x3a0 [btrfs]
btrfs_work_helper+0xff/0x480 [btrfs]
? lock_release+0x178/0x2c0
process_one_work+0x1ee/0x570
? srso_return_thunk+0x5/0x5f
worker_thread+0x1d1/0x3b0
? __pfx_worker_thread+0x10/0x10
kthread+0x10b/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
INFO: task kworker/u134:0:184 blocked for more than 323 seconds.
Not tainted 6.13.0-BTRFS-ZNS+ #503
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u134:0 state:D stack:0 pid:184 tgid:184 ppid:2 flags:0x00004000
Workqueue: writeback wb_workfn (flush-btrfs-4)
Call Trace:
<TASK>
__schedule+0x534/0xdd0
schedule+0x39/0x140
__lock_extent+0x31b/0x380 [btrfs]
? __pfx_autoremove_wake_function+0x10/0x10
find_lock_delalloc_range+0xdb/0x260 [btrfs]
writepage_delalloc+0x12f/0x500 [btrfs]
? srso_return_thunk+0x5/0x5f
extent_write_cache_pages+0x232/0x840 [btrfs]
btrfs_writepages+0x72/0x130 [btrfs]
do_writepages+0xe7/0x260
? srso_return_thunk+0x5/0x5f
? lock_acquire+0xd2/0x300
? srso_return_thunk+0x5/0x5f
? find_held_lock+0x2b/0x80
? wbc_attach_and_unlock_inode.part.0+0x102/0x250
? wbc_attach_and_unlock_inode.part.0+0x102/0x250
__writeback_single_inode+0x5c/0x4b0
writeback_sb_inodes+0x22d/0x550
__writeback_inodes_wb+0x4c/0xe0
wb_writeback+0x2f6/0x3f0
wb_workfn+0x32a/0x510
process_one_work+0x1ee/0x570
? srso_return_thunk+0x5/0x5f
worker_thread+0x1d1/0x3b0
? __pfx_worker_thread+0x10/0x10
kthread+0x10b/0x230
? __pfx_kthread+0x10/0x10
ret_from_fork+0x30/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
This happens because we have another success path for the zoned mode. When
there is no active zone available, btrfs_reserve_extent() returns
-EAGAIN. In this case, we have two reactions.
(1) If the given range is never allocated, we can only wait for someone
to finish a zone, so wait on BTRFS_FS_NEED_ZONE_FINISH bit and retry
afterward.
(2) Or, if some allocations are already done, we must bail out and let
the caller to send IOs for the allocation. This is because these IOs
may be necessary to finish a zone.
The commit 06f3642847 ("btrfs: do proper folio cleanup when
cow_file_range() failed") moved the unlock code from the inside of the
loop to the outside. So, previously, the allocated extents are unlocked
just after the allocation and so before returning from the function.
However, they are no longer unlocked on the case (2) above. That caused
the hang issue.
Fix the issue by modifying the 'end' to the end of the allocated
range. Then, we can exit the loop and the same unlock code can properly
handle the case.
Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Fixes: 06f3642847 ("btrfs: do proper folio cleanup when cow_file_range() failed")
CC: stable@vger.kernel.org
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAme95g8ACgkQxWXV+ddt
WDvi3g//V55iBXnPv0Jrs7b95GRskYv8A4vJsZhGtub4PlcEh8S6Q1IoU3qwiKHv
E2THDA/A14qetxh3tSo73+RdS3JHpIH4QKjO54k74gOh45OEUs4Lq8NBAujmpz4b
BMZZnM5iyZipNfbebUa/XxlPLvHg8D2rUqwycS/A0c5BE56HTvVzmKL3RdUfkAvA
uZaJa6FOKfr6ge3ikl/dm+Rl7f+ZymIK4T9XsW3Lt223siYvcLJvWEIL0tk9B1y/
ZUQNqPOCHY0mX/zPC0425LoeH3LWDPyZPCakaY8tiwI20p/sP+hPLBC8WDrJvoam
losu6v8EqkYK9zND/ETVq3d1Y9mzub/soKuM+aDQ/UM0JXz1vI3RYQcpskECR0Gf
ZPq5tv+dSBbMmscvkxnkuNBaTr3IbOhkxaKwOvdoRN9F4HbmhgxTscshaQHklmiG
4qRx2HtW9Zw8ufyLUFUYaRAj45eFDZMQStQMCNSECD8X+fS6CPGUqGFcuXrm+kLL
v6k0cbvh1NOLSchqtfR4rochJFUp5veUNHoYQ7YRy3CqV1yrF7iM1e0G1rvyOQYQ
9tpN93IYwLItRdUjtqyS/q8WOddRTo0LTqh5HDXPnLd3jc/kO7KjHv9dJna7wyhO
MUJmLlpy1dRDHCvTl70oF0Nxe4Ve20n7U2QayF5bMGtCmQnzGL0=
=4+6s
-----END PGP SIGNATURE-----
Merge tag 'for-6.14-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- extent map shrinker fixes:
- fix potential use after free accessing an inode to reach fs_info,
the shrinker could do iput() in the meantime
- skip unnecessary scanning of inodes without extent maps
- do direct iput(), no need for indirection via workqueue
- in block < page mode, fix race when extending i_size in buffered mode
- fix minor memory leak in selftests
- print descriptive error message when seeding device is not found
* tag 'for-6.14-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix data overwriting bug during buffered write when block size < page size
btrfs: output an error message if btrfs failed to find the seed fsid
btrfs: do regular iput instead of delayed iput during extent map shrinking
btrfs: skip inodes without loaded extent maps when shrinking extent maps
btrfs: fix use-after-free on inode when scanning root during em shrinking
btrfs: selftests: fix btrfs_test_delayed_refs() leak of transaction
[BUG]
When running generic/418 with a btrfs whose block size < page size
(subpage cases), it always fails.
And the following minimal reproducer is more than enough to trigger it
reliably:
workload()
{
mkfs.btrfs -s 4k -f $dev > /dev/null
dmesg -C
mount $dev $mnt
$fsstree_dir/src/dio-invalidate-cache -r -b 4096 -n 3 -i 1 -f $mnt/diotest
ret=$?
umount $mnt
stop_trace
if [ $ret -ne 0 ]; then
fail
fi
}
for (( i = 0; i < 1024; i++)); do
echo "=== $i/$runtime ==="
workload
done
[CAUSE]
With extra trace printk added to the following functions:
- btrfs_buffered_write()
* Which folio is touched
* The file offset (start) where the buffered write is at
* How many bytes are copied
* The content of the write (the first 2 bytes)
- submit_one_sector()
* Which folio is touched
* The position inside the folio
* The content of the page cache (the first 2 bytes)
- pagecache_isize_extended()
* The parameters of the function itself
* The parameters of the folio_zero_range()
Which are enough to show the problem:
22.158114: btrfs_buffered_write: folio pos=0 start=0 copied=4096 content=0x0101
22.158161: submit_one_sector: r/i=5/257 folio=0 pos=0 content=0x0101
22.158609: btrfs_buffered_write: folio pos=0 start=4096 copied=4096 content=0x0101
22.158634: btrfs_buffered_write: folio pos=0 start=8192 copied=4096 content=0x0101
22.158650: pagecache_isize_extended: folio=0 from=4096 to=8192 bsize=4096 zero off=4096 len=8192
22.158682: submit_one_sector: r/i=5/257 folio=0 pos=4096 content=0x0000
22.158686: submit_one_sector: r/i=5/257 folio=0 pos=8192 content=0x0101
The tool dio-invalidate-cache will start 3 threads, each doing a buffered
write with 0x01 at offset 0, 4096 and 8192, do a fsync, then do a direct read,
and compare the read buffer with the write buffer.
Note that all 3 btrfs_buffered_write() are writing the correct 0x01 into
the page cache.
But at submit_one_sector(), at file offset 4096, the content is zeroed
out, by pagecache_isize_extended().
The race happens like this:
Thread A is writing into range [4K, 8K).
Thread B is writing into range [8K, 12k).
Thread A | Thread B
-------------------------------------+------------------------------------
btrfs_buffered_write() | btrfs_buffered_write()
|- old_isize = 4K; | |- old_isize = 4096;
|- btrfs_inode_lock() | |
|- write into folio range [4K, 8K) | |
|- pagecache_isize_extended() | |
| extend isize from 4096 to 8192 | |
| no folio_zero_range() called | |
|- btrfs_inode_lock() | |
| |- btrfs_inode_lock()
| |- write into folio range [8K, 12K)
| |- pagecache_isize_extended()
| | calling folio_zero_range(4K, 8K)
| | This is caused by the old_isize is
| | grabbed too early, without any
| | inode lock.
| |- btrfs_inode_unlock()
The @old_isize is grabbed without inode lock, causing race between two
buffered write threads and making pagecache_isize_extended() to zero
range which is still containing cached data.
And this is only affecting subpage btrfs, because for regular blocksize
== page size case, the function pagecache_isize_extended() will do
nothing if the block size >= page size.
[FIX]
Grab the old i_size while holding the inode lock.
This means each buffered write thread will have a stable view of the
old inode size, thus avoid the above race.
CC: stable@vger.kernel.org # 5.15+
Fixes: 5e8b9ef303 ("btrfs: move pos increment and pagecache extension to btrfs_buffered_write")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If btrfs failed to locate the seed device for whatever reason, mounting
the sprouted device will fail without any meaning error message:
# mkfs.btrfs -f /dev/test/scratch1
# btrfstune -S1 /dev/test/scratch1
# mount /dev/test/scratch1 /mnt/btrfs
# btrfs dev add -f /dev/test/scratch2 /mnt/btrfs
# umount /mnt/btrfs
# btrfs dev scan -u
# btrfs mount /dev/test/scratch2 /mnt/btrfs
mount: /mnt/btrfs: fsconfig system call failed: No such file or directory.
dmesg(1) may have more information after failed mount system call.
# dmesg -t | tail -n6
BTRFS info (device dm-5): first mount of filesystem 64252ded-5953-4868-b962-cea48f7ac4ea
BTRFS info (device dm-5): using crc32c (crc32c-generic) checksum algorithm
BTRFS info (device dm-5): using free-space-tree
BTRFS error (device dm-5): failed to read chunk tree: -2
BTRFS error (device dm-5): open_ctree failed: -2
[CAUSE]
The failure to mount is pretty straight forward, just unable to find the
seed device and its fsid, caused by `btrfs dev scan -u`.
But the lack of any useful info is a problem.
[FIX]
Just add an extra error message in open_seed_devices() to indicate the
error.
Now the error message would look like this:
BTRFS info (device dm-4): first mount of filesystem 7769223d-4db1-4e4c-ac29-0a96f53576ab
BTRFS info (device dm-4): using crc32c (crc32c-generic) checksum algorithm
BTRFS info (device dm-4): using free-space-tree
BTRFS error (device dm-4): failed to find fsid e87c12e6-584b-4e98-8b88-962c33a619ff when attempting to open seed devices
BTRFS error (device dm-4): failed to read chunk tree: -2
BTRFS error (device dm-4): open_ctree failed: -2
Link: https://github.com/kdave/btrfs-progs/issues/959
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent map shrinker now runs in the system unbound workqueue and no
longer in kswapd context so it can directly do an iput() on inodes even
if that blocks or needs to acquire any lock (we aren't holding any locks
when requesting the delayed iput from the shrinker). So we don't need to
add a delayed iput, wake up the cleaner and delegate the iput() to the
cleaner, which also adds extra contention on the spinlock that protects
the delayed iputs list.
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If there are inodes that don't have any loaded extent maps, we end up
grabbing a reference on them and later adding a delayed iput, which wakes
up the cleaner and makes it do unnecessary work. This is common when for
example the inodes were open only to run stat(2) or all their extent maps
were already released through the folio release callback
(btrfs_release_folio()) or released by a previous run of the shrinker, or
directories which never have extent maps.
Reported-by: Ivan Shapovalov <intelfx@intelfx.name>
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/
CC: stable@vger.kernel.org # 6.13+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_scan_root() we are accessing the inode's root (and fs_info) in a
call to btrfs_fs_closing() after we have scheduled the inode for a delayed
iput, and that can result in a use-after-free on the inode in case the
cleaner kthread does the iput before we dereference the inode in the call
to btrfs_fs_closing().
Fix this by using the fs_info stored already in a local variable instead
of doing inode->root->fs_info.
Fixes: 1020443840 ("btrfs: make the extent map shrinker run asynchronously as a work queue job")
CC: stable@vger.kernel.org # 6.13+
Tested-by: Ivan Shapovalov <intelfx@intelfx.name>
Link: https://lore.kernel.org/linux-btrfs/0414d690ac5680d0d77dfc930606cdc36e42e12f.camel@intelfx.name/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmeuSwwACgkQxWXV+ddt
WDtQ8Q//fsTAu1DLjeVrzMhVNswGwr3PgWzLk3PqWTDEG9UeR/jJYntIPglVyhhP
Mp2E3CYe2rlWwK0K3PITDu179tLrnCvbfKEPwWvyVZw5D0EDjPQYs9/H5ztSE8O4
4i3kv2LjlXHE3h62tjNoeHL4NK1SRJcFeH69XhhIe0ELvTQVarvfJupZwdQQivWg
sDlQXklXxl1kEtHVGnmz6jd09a0vti7xw8MAG6QiIP83Hvt6Ie+NLfTfTCkRIWSK
95mPM+1YhmLQe15sD8xjHyYmH5E0cEXQh1Pvlz6xqQWRvZERG8Pmj+iwFTLaw4iA
JR6sN2/KFgXE9OIGbFqQ+dvm++2hWcnPwW+h6EdOSj0DQkupbJm4VeBK0WQ4YZ+x
Q0OQXPTfGpcjp7KyJrT6EZFq5VxeEfOz4hozhiCSTs+Xpx7Oh/2THL01N/dUMn0C
SNR9E4/Rlq7rWV7euGwicwo/tZZIdCr4ihUGk4jpamlUbIXj+2SrOc4cpQdypmsO
DeYvwzIXnPe8/Eo3rZ5ej0DK7GxfEFyd6v6l0oS6HepvMJ6y6/eiOYteVbGpvhXv
J2M6PLstiZc152VHPApN9+ZlXBeGjyMfxLcsweblpSBBt/57otY6cMhqNuIp0j9B
0zP3KKOwrIJ8tzcwjMSH+2OZsDQ7oc7eiJI08r0IcpCbCBTIeyE=
=0hI+
-----END PGP SIGNATURE-----
Merge tag 'for-6.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix stale page cache after race between readahead and direct IO write
- fix hole expansion when writing at an offset beyond EOF, the range
will not be zeroed
- use proper way to calculate offsets in folio ranges
* tag 'for-6.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix hole expansion when writing at an offset beyond EOF
btrfs: fix stale page cache after race between readahead and direct IO write
btrfs: fix two misuses of folio_shift()
At btrfs_write_check() if our file's i_size is not sector size aligned and
we have a write that starts at an offset larger than the i_size that falls
within the same page of the i_size, then we end up not zeroing the file
range [i_size, write_offset).
The code is this:
start_pos = round_down(pos, fs_info->sectorsize);
oldsize = i_size_read(inode);
if (start_pos > oldsize) {
/* Expand hole size to cover write data, preventing empty gap */
loff_t end_pos = round_up(pos + count, fs_info->sectorsize);
ret = btrfs_cont_expand(BTRFS_I(inode), oldsize, end_pos);
if (ret)
return ret;
}
So if our file's i_size is 90269 bytes and a write at offset 90365 bytes
comes in, we get 'start_pos' set to 90112 bytes, which is less than the
i_size and therefore we don't zero out the range [90269, 90365) by
calling btrfs_cont_expand().
This is an old bug introduced in commit 9036c10208 ("Btrfs: update hole
handling v2"), from 2008, and the buggy code got moved around over the
years.
Fix this by discarding 'start_pos' and comparing against the write offset
('pos') without any alignment.
This bug was recently exposed by test case generic/363 which tests this
scenario by polluting ranges beyond EOF with an mmap write and than verify
that after a file increases we get zeroes for the range which is supposed
to be a hole and not what we wrote with the previous mmaped write.
We're only seeing this exposed now because generic/363 used to run only
on xfs until last Sunday's fstests update.
The test was failing like this:
$ ./check generic/363
FSTYP -- btrfs
PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025
MKFS_OPTIONS -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
generic/363 0s ... [failed, exit status 1]- output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad)
--- tests/generic/363.out 2025-02-05 15:31:14.013646509 +0000
+++ /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad 2025-02-05 17:25:33.112630781 +0000
@@ -1 +1,46 @@
QA output created by 363
+READ BAD DATA: offset = 0xdcad, size = 0xd921, fname = /home/fdmanana/btrfs-tests/dev/junk
+OFFSET GOOD BAD RANGE
+0x1609d 0x0000 0x3104 0x0
+operation# (mod 256) for the bad data may be 4
+0x1609e 0x0000 0x0472 0x1
+operation# (mod 256) for the bad data may be 4
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/363.out /home/fdmanana/git/hub/xfstests/results//generic/363.out.bad' to see the entire diff)
Ran: generic/363
Failures: generic/363
Failed 1 of 1 tests
Fixes: 9036c10208 ("Btrfs: update hole handling v2")
CC: stable@vger.kernel.org
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After commit ac325fc2aa ("btrfs: do not hold the extent lock for entire
read") we can now trigger a race between a task doing a direct IO write
and readahead. When this race is triggered it results in tasks getting
stale data when they attempt do a buffered read (including the task that
did the direct IO write).
This race can be sporadically triggered with test case generic/418, failing
like this:
$ ./check generic/418
FSTYP -- btrfs
PLATFORM -- Linux/x86_64 debian0 6.13.0-rc7-btrfs-next-185+ #17 SMP PREEMPT_DYNAMIC Mon Feb 3 12:28:46 WET 2025
MKFS_OPTIONS -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
generic/418 14s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad)
--- tests/generic/418.out 2020-06-10 19:29:03.850519863 +0100
+++ /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad 2025-02-03 15:42:36.974609476 +0000
@@ -1,2 +1,5 @@
QA output created by 418
+cmpbuf: offset 0: Expected: 0x1, got 0x0
+[6:0] FAIL - comparison failed, offset 24576
+diotest -wp -b 4096 -n 8 -i 4 failed at loop 3
Silence is golden
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/418.out /home/fdmanana/git/hub/xfstests/results//generic/418.out.bad' to see the entire diff)
Ran: generic/418
Failures: generic/418
Failed 1 of 1 tests
The race happens like this:
1) A file has a prealloc extent for the range [16K, 28K);
2) Task A starts a direct IO write against file range [24K, 28K).
At the start of the direct IO write it invalidates the page cache at
__iomap_dio_rw() with kiocb_invalidate_pages() for the 4K page at file
offset 24K;
3) Task A enters btrfs_dio_iomap_begin() and locks the extent range
[24K, 28K);
4) Task B starts a readahead for file range [16K, 28K), entering
btrfs_readahead().
First it attempts to read the page at offset 16K by entering
btrfs_do_readpage(), where it calls get_extent_map(), locks the range
[16K, 20K) and gets the extent map for the range [16K, 28K), caching
it into the 'em_cached' variable declared in the local stack of
btrfs_readahead(), and then unlocks the range [16K, 20K).
Since the extent map has the prealloc flag, at btrfs_do_readpage() we
zero out the page's content and don't submit any bio to read the page
from the extent.
Then it attempts to read the page at offset 20K entering
btrfs_do_readpage() where we reuse the previously cached extent map
(decided by get_extent_map()) since it spans the page's range and
it's still in the inode's extent map tree.
Just like for the previous page, we zero out the page's content since
the extent map has the prealloc flag set.
Then it attempts to read the page at offset 24K entering
btrfs_do_readpage() where we reuse the previously cached extent map
(decided by get_extent_map()) since it spans the page's range and
it's still in the inode's extent map tree.
Just like for the previous pages, we zero out the page's content since
the extent map has the prealloc flag set. Note that we didn't lock the
extent range [24K, 28K), so we didn't synchronize with the ongoing
direct IO write being performed by task A;
5) Task A enters btrfs_create_dio_extent() and creates an ordered extent
for the range [24K, 28K), with the flags BTRFS_ORDERED_DIRECT and
BTRFS_ORDERED_PREALLOC set;
6) Task A unlocks the range [24K, 28K) at btrfs_dio_iomap_begin();
7) The ordered extent enters btrfs_finish_one_ordered() and locks the
range [24K, 28K);
8) Task A enters fs/iomap/direct-io.c:iomap_dio_complete() and it tries
to invalidate the page at offset 24K by calling
kiocb_invalidate_post_direct_write(), resulting in a call chain that
ends up at btrfs_release_folio().
The btrfs_release_folio() call ends up returning false because the range
for the page at file offset 24K is currently locked by the task doing
the ordered extent completion in the previous step (7), so we have:
btrfs_release_folio() ->
__btrfs_release_folio() ->
try_release_extent_mapping() ->
try_release_extent_state()
This last function checking that the range is locked and returning false
and propagating it up to btrfs_release_folio().
So this results in a failure to invalidate the page and
kiocb_invalidate_post_direct_write() triggers this message logged in
dmesg:
Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!
After this we leave the page cache with stale data for the file range
[24K, 28K), filled with zeroes instead of the data written by direct IO
write (all bytes with a 0x01 value), so any task attempting to read with
buffered IO, including the task that did the direct IO write, will get
all bytes in the range with a 0x00 value instead of the written data.
Fix this by locking the range, with btrfs_lock_and_flush_ordered_range(),
at the two callers of btrfs_do_readpage() instead of doing it at
get_extent_map(), just like we did before commit ac325fc2aa ("btrfs: do
not hold the extent lock for entire read"), and unlocking the range after
all the calls to btrfs_do_readpage(). This way we never reuse a cached
extent map without flushing any pending ordered extents from a concurrent
direct IO write.
Fixes: ac325fc2aa ("btrfs: do not hold the extent lock for entire read")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It is meaningless to shift a byte count by folio_shift(). The folio index
is in units of PAGE_SIZE, not folio_size(). We can use folio_contains()
to make this work for arbitrary-order folios, so remove the assertion
that the folios are of order 0.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmejbmkACgkQxWXV+ddt
WDuXuQ/8DETuhqww7hwrDgHyDHRgj/783Oy1+V/jJPQZ1hpjAbSJBU7aKxneAMMc
Gj4ExbDfjd8kRAvsE71/1McJqVBd6CiuBG/k65vjqVS4lM8mEasVefCa4OsOR3iB
gGzQeNbrDzIs3IOg8hM1l1iPDkI/AyOyeysD/qNMdWO4mcKEMwrBYkQJhL6+DzfE
BKVP+NAdK4iv/W/EngAqvcd7mq2RdxeR9nBesHnKTzPCVFf6bBT3b3Qem/ovY6Td
ZNQLjx9GXumDj0jSiyRI5rxjSYOrqSU4JV+1C7ghOwBj2uD2SZxAvghuRSuUTKnN
9/9x1RNO7i+FE3GHzYWShhHkNuuLZspmi2J1neSttQ5Jy/PFJR/tUAED5zY6Nl0X
73GWSzLlmaJ+E77DkTJksBYmRrnwtA6plMKD4umDCHw0lNu/0GCvxtZY1T1ZiKy2
yK37Ja559+k7RPIdKoyHo81A7num4gLeBZTqd9F/XPjU26b57Qnqk1LetgGeT8xk
IZFQtz9HdmvLwwbxWwNvp1ttRf+1dj1lpVnb5n6r0d8Uyta9tgQpDvwhNjluBsEx
AQxK9yUZ5kAXEEbEyDwoOsZ5yjwkaMzqRpQWWappb0jCLm5dADI87odFRaOSlgWS
WoXL6Vbod8G5vaEVwPDl6yuSS+609c7M8ftBlgvcx3XZ5/N6y2M=
=YjCE
-----END PGP SIGNATURE-----
Merge tag 'for-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- add lockdep annotation for relocation root to fix a splat warning
while merging roots
- fix assertion failure when splitting ordered extent after transaction
abort
- don't print 'qgroup inconsistent' message when rescan process updates
qgroup data sooner than the subvolume deletion process
- fix use-after-free (accessing the error number) when attempting to
join an aborted transaction
- avoid starting new transaction if not necessary when cleaning qgroup
during subvolume drop
* tag 'for-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: avoid starting new transaction when cleaning qgroup during subvolume drop
btrfs: fix use-after-free when attempting to join an aborted transaction
btrfs: do not output error message if a qgroup has been already cleaned up
btrfs: fix assertion failure when splitting ordered extent after transaction abort
btrfs: fix lockdep splat while merging a relocation root
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes the
page allocator so we end up with the ability to allocate and free
zero-refcount pages. So that callers (ie, slab) can avoid a refcount
inc & dec.
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use
large folios other than PMD-sized ones.
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and
fixes for this small built-in kernel selftest.
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part of
the mapletree code.
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups.
- "simplify split calculation" from Wei Yang provides a few fixes and a
test for the mapletree code.
- "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes
continues the work of moving vma-related code into the (relatively) new
mm/vma.c.
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the page
allocator.
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue. It
should reduce the amount of unnecessary reading.
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated
(https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/).
Qi's series addresses this windup by synchronously freeing PTE memory
within the context of madvise(MADV_DONTNEED).
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests code
when optional compiler warnings are enabled.
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from David
Hildenbrand tightens the allocator's observance of __GFP_HARDWALL.
- "pkeys kselftests improvements" from Kevin Brodsky implements various
fixes and cleanups in the MM selftests code, mainly pertaining to the
pkeys tests.
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size.
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic.
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based
kernel build was demonstrated.
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of zram_write_page().
A watchdog softlockup warning was eliminated.
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky
cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed.
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging logic.
- "Account page tables at all levels" from Kevin Brodsky cleans up and
regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy.
- "mm/damon: replace most damon_callback usages in sysfs with new core
functions" from SeongJae Park cleans up and generalizes DAMON's sysfs
file interface logic.
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is presented in
response to DAMOS actions.
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes
DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs
is completed.
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter
Xu cleans up and generalizes the hugetlb reservation accounting.
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface.
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting), but
also inclusion (allowing) behavior.
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
"introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to reduce
the size of struct page and to enable dynamic allocation of memory
descriptors."
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes and
simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel build
time with swap-on-zram.
- "mm: update mips to use do_mmap(), make mmap_region() internal" from
Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal.
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU
regressions and otherwise improves MGLRU performance.
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park
updates DAMON documentation.
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing.
- "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand
provides various cleanups in the areas of hugetlb folios, THP folios and
migration.
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache
reading and writing. To permite userspace to address issues with
massive buildup of useless pagecache when reading/writing fast devices.
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA
jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz
uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE=
=Ib2s
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs.
- "Allocate and free frozen pages" from Matthew Wilcox reorganizes
the page allocator so we end up with the ability to allocate and
free zero-refcount pages. So that callers (ie, slab) can avoid a
refcount inc & dec
- "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to
use large folios other than PMD-sized ones
- "Fix mm/rodata_test" from Petr Tesarik performs some maintenance
and fixes for this small built-in kernel selftest
- "mas_anode_descend() related cleanup" from Wei Yang tidies up part
of the mapletree code
- "mm: fix format issues and param types" from Keren Sun implements a
few minor code cleanups
- "simplify split calculation" from Wei Yang provides a few fixes and
a test for the mapletree code
- "mm/vma: make more mmap logic userland testable" from Lorenzo
Stoakes continues the work of moving vma-related code into the
(relatively) new mm/vma.c
- "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David
Hildenbrand cleans up and rationalizes handling of gfp flags in the
page allocator
- "readahead: Reintroduce fix for improper RA window sizing" from Jan
Kara is a second attempt at fixing a readahead window sizing issue.
It should reduce the amount of unnecessary reading
- "synchronously scan and reclaim empty user PTE pages" from Qi Zheng
addresses an issue where "huge" amounts of pte pagetables are
accumulated:
https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/
Qi's series addresses this windup by synchronously freeing PTE
memory within the context of madvise(MADV_DONTNEED)
- "selftest/mm: Remove warnings found by adding compiler flags" from
Muhammad Usama Anjum fixes some build warnings in the selftests
code when optional compiler warnings are enabled
- "mm: don't use __GFP_HARDWALL when migrating remote pages" from
David Hildenbrand tightens the allocator's observance of
__GFP_HARDWALL
- "pkeys kselftests improvements" from Kevin Brodsky implements
various fixes and cleanups in the MM selftests code, mainly
pertaining to the pkeys tests
- "mm/damon: add sample modules" from SeongJae Park enhances DAMON to
estimate application working set size
- "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn
provides some cleanups to memcg's hugetlb charging logic
- "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song
removes the global swap cgroup lock. A speedup of 10% for a
tmpfs-based kernel build was demonstrated
- "zram: split page type read/write handling" from Sergey Senozhatsky
has several fixes and cleaups for zram in the area of
zram_write_page(). A watchdog softlockup warning was eliminated
- "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin
Brodsky cleans up the pagetable destructor implementations. A rare
use-after-free race is fixed
- "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes
simplifies and cleans up the debugging code in the VMA merging
logic
- "Account page tables at all levels" from Kevin Brodsky cleans up
and regularizes the pagetable ctor/dtor handling. This results in
improvements in accounting accuracy
- "mm/damon: replace most damon_callback usages in sysfs with new
core functions" from SeongJae Park cleans up and generalizes
DAMON's sysfs file interface logic
- "mm/damon: enable page level properties based monitoring" from
SeongJae Park increases the amount of information which is
presented in response to DAMOS actions
- "mm/damon: remove DAMON debugfs interface" from SeongJae Park
removes DAMON's long-deprecated debugfs interfaces. Thus the
migration to sysfs is completed
- "mm/hugetlb: Refactor hugetlb allocation resv accounting" from
Peter Xu cleans up and generalizes the hugetlb reservation
accounting
- "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino
removes a never-used feature of the alloc_pages_bulk() interface
- "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park
extends DAMOS filters to support not only exclusion (rejecting),
but also inclusion (allowing) behavior
- "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi
introduces a new memory descriptor for zswap.zpool that currently
overlaps with struct page for now. This is part of the effort to
reduce the size of struct page and to enable dynamic allocation of
memory descriptors
- "mm, swap: rework of swap allocator locks" from Kairui Song redoes
and simplifies the swap allocator locking. A speedup of 400% was
demonstrated for one workload. As was a 35% reduction for kernel
build time with swap-on-zram
- "mm: update mips to use do_mmap(), make mmap_region() internal"
from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that
mmap_region() can be made MM-internal
- "mm/mglru: performance optimizations" from Yu Zhao fixes a few
MGLRU regressions and otherwise improves MGLRU performance
- "Docs/mm/damon: add tuning guide and misc updates" from SeongJae
Park updates DAMON documentation
- "Cleanup for memfd_create()" from Isaac Manjarres does that thing
- "mm: hugetlb+THP folio and migration cleanups" from David
Hildenbrand provides various cleanups in the areas of hugetlb
folios, THP folios and migration
- "Uncached buffered IO" from Jens Axboe implements the new
RWF_DONTCACHE flag which provides synchronous dropbehind for
pagecache reading and writing. To permite userspace to address
issues with massive buildup of useless pagecache when
reading/writing fast devices
- "selftests/mm: virtual_address_range: Reduce memory" from Thomas
Weißschuh fixes and optimizes some of the MM selftests"
* tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits)
mm/compaction: fix UBSAN shift-out-of-bounds warning
s390/mm: add missing ctor/dtor on page table upgrade
kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags()
tools: add VM_WARN_ON_VMG definition
mm/damon/core: use str_high_low() helper in damos_wmark_wait_us()
seqlock: add missing parameter documentation for raw_seqcount_try_begin()
mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh
mm/page_alloc: remove the incorrect and misleading comment
zram: remove zcomp_stream_put() from write_incompressible_page()
mm: separate move/undo parts from migrate_pages_batch()
mm/kfence: use str_write_read() helper in get_access_type()
selftests/mm/mkdirty: fix memory leak in test_uffdio_copy()
kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags()
selftests/mm: virtual_address_range: avoid reading from VM_IO mappings
selftests/mm: vm_util: split up /proc/self/smaps parsing
selftests/mm: virtual_address_range: unmap chunks after validation
selftests/mm: virtual_address_range: mmap() without PROT_WRITE
selftests/memfd/memfd_test: fix possible NULL pointer dereference
mm: add FGP_DONTCACHE folio creation flag
mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue
...
Remove highest_bit and lowest_bit. After the HDD allocation path has been
removed, the only purpose of these two fields is to determine whether the
device is full or not, which can instead be determined by checking the
inuse_pages.
Link: https://lkml.kernel.org/r/20250113175732.48099-6-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Chis Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Hugh Dickens <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The previous commit removed the page_list argument from
alloc_pages_bulk_noprof() along with the alloc_pages_bulk_list() function.
Now that only the *_array() flavour of the API remains, we can do the
following renaming (along with the _noprof() ones):
alloc_pages_bulk_array -> alloc_pages_bulk
alloc_pages_bulk_array_mempolicy -> alloc_pages_bulk_mempolicy
alloc_pages_bulk_array_node -> alloc_pages_bulk_node
Link: https://lkml.kernel.org/r/275a3bbc0be20fbe9002297d60045e67ab3d4ada.1734991165.git.luizcap@redhat.com
Signed-off-by: Luiz Capitulino <luizcap@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmePs7oACgkQnJ2qBz9k
QNmHuAf9GkLnY5u1/81xP5V9ukZ4N2yeMW0dydLS5cjWj/St5ELeMAza3jeqtJtD
j36vbnmy2c5pPaGLAK8BJpMXT/R2TkmmKD004zcfqF2S3SgbGzdgO1zMZzq9KJpM
woRKZtLuglDajedsDEBBcKotBhlN2+C/sQlFuL1mX4zitk9ajr0qYUB1+JqOeg5f
qwPsDLT077ADpxd7lVIMcm+OqbduP5KWkBKYHpn7lJcLe1eqVMMzceJroW42zhVG
Dq8Iln26bbU9Wx6FSPFCUcHEzHRHUfXmu07HN9U0X++0QgWjrmBQQLooGFB/bR4a
edBrPpVas6xE4/brjgFX3gOKtv8xYg==
=ewDV
-----END PGP SIGNATURE-----
Merge tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify pre-content notification support from Jan Kara:
"This introduces a new fsnotify event (FS_PRE_ACCESS) that gets
generated before a file contents is accessed.
The event is synchronous so if there is listener for this event, the
kernel waits for reply. On success the execution continues as usual,
on failure we propagate the error to userspace. This allows userspace
to fill in file content on demand from slow storage. The context in
which the events are generated has been picked so that we don't hold
any locks and thus there's no risk of a deadlock for the userspace
handler.
The new pre-content event is available only for users with global
CAP_SYS_ADMIN capability (similarly to other parts of fanotify
functionality) and it is an administrator responsibility to make sure
the userspace event handler doesn't do stupid stuff that can DoS the
system.
Based on your feedback from the last submission, fsnotify code has
been improved and now file->f_mode encodes whether pre-content event
needs to be generated for the file so the fast path when nobody wants
pre-content event for the file just grows the additional file->f_mode
check. As a bonus this also removes the checks whether the old
FS_ACCESS event needs to be generated from the fast path. Also the
place where the event is generated during page fault has been moved so
now filemap_fault() generates the event if and only if there is no
uptodate folio in the page cache.
Also we have dropped FS_PRE_MODIFY event as current real-world users
of the pre-content functionality don't really use it so let's start
with the minimal useful feature set"
* tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits)
fanotify: Fix crash in fanotify_init(2)
fs: don't block write during exec on pre-content watched files
fs: enable pre-content events on supported file systems
ext4: add pre-content fsnotify hook for DAX faults
btrfs: disable defrag on pre-content watched files
xfs: add pre-content fsnotify hook for DAX faults
fsnotify: generate pre-content permission event on page fault
mm: don't allow huge faults for files with pre content watches
fanotify: disable readahead if we have pre-content watches
fanotify: allow to set errno in FAN_DENY permission response
fanotify: report file range info with pre-content events
fanotify: introduce FAN_PRE_ACCESS permission event
fsnotify: generate pre-content permission event on truncate
fsnotify: pass optional file access range in pre-content event
fsnotify: introduce pre-content permission events
fanotify: reserve event bit of deprecated FAN_DIR_MODIFY
fanotify: rename a misnamed constant
fanotify: don't skip extra event info if no info_mode is set
fsnotify: check if file is actually being watched for pre-content events on open
fsnotify: opt-in for permission events at file open time
...
At btrfs_qgroup_cleanup_dropped_subvolume() all we want to commit the
current transaction in order to have all the qgroup rfer/excl numbers up
to date. However we are using btrfs_start_transaction(), which joins the
current transaction if there is one that is not yet committing, but also
starts a new one if there is none or if the current one is already
committing (its state is >= TRANS_STATE_COMMIT_START). This later case
results in unnecessary IO, wasting time and a pointless rotation of the
backup roots in the super block.
So instead of using btrfs_start_transaction() followed by a
btrfs_commit_transaction(), use btrfs_commit_current_transaction() which
achieves our purpose and avoids starting and committing new transactions.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
There is a bug report that btrfs outputs the following error message:
BTRFS info (device nvme0n1p2): qgroup scan completed (inconsistency flag cleared)
BTRFS warning (device nvme0n1p2): failed to cleanup qgroup 0/1179: -2
[CAUSE]
The error itself is pretty harmless, and the end user should ignore it.
When a subvolume is fully dropped, btrfs will call
btrfs_qgroup_cleanup_dropped_subvolume() to delete the qgroup.
However if a qgroup rescan happened before a subvolume fully dropped,
qgroup for that subvolume will not be re-created, as rescan will only
create new qgroup if there is a BTRFS_ROOT_REF_KEY found.
But before we drop a subvolume, the subvolume is unlinked thus there is no
BTRFS_ROOT_REF_KEY.
In that case, btrfs_remove_qgroup() will fail with -ENOENT and trigger
the above error message.
[FIX]
Just ignore -ENOENT error from btrfs_remove_qgroup() inside
btrfs_qgroup_cleanup_dropped_subvolume().
Reported-by: John Shand <jshand2013@gmail.com>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1236056
Fixes: 839d6ea4f8 ("btrfs: automatically remove the subvolume qgroup")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When COWing a relocation tree path, at relocation.c:replace_path(), we
can trigger a lockdep splat while we are in the btrfs_search_slot() call
against the relocation root. This happens in that callchain at
ctree.c:read_block_for_search() when we happen to find a child extent
buffer already loaded through the fs tree with a lockdep class set to
the fs tree. So when we attempt to lock that extent buffer through a
relocation tree we have to reset the lockdep class to the class for a
relocation tree, since a relocation tree has extent buffers that used
to belong to a fs tree and may currently be already loaded (we swap
extent buffers between the two trees at the end of replace_path()).
However we are missing calls to btrfs_maybe_reset_lockdep_class() to reset
the lockdep class at ctree.c:read_block_for_search() before we read lock
an extent buffer, just like we did for btrfs_search_slot() in commit
b40130b23c ("btrfs: fix lockdep splat with reloc root extent buffers").
So add the missing btrfs_maybe_reset_lockdep_class() calls before the
attempts to read lock an extent buffer at ctree.c:read_block_for_search().
The lockdep splat was reported by syzbot and it looks like this:
======================================================
WARNING: possible circular locking dependency detected
6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Not tainted
------------------------------------------------------
syz.0.0/5335 is trying to acquire lock:
ffff8880545dbc38 (btrfs-tree-01){++++}-{4:4}, at: btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146
but task is already holding lock:
ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (btrfs-treloc-02/1){+.+.}-{4:4}:
reacquire_held_locks+0x3eb/0x690 kernel/locking/lockdep.c:5374
__lock_release kernel/locking/lockdep.c:5563 [inline]
lock_release+0x396/0xa30 kernel/locking/lockdep.c:5870
up_write+0x79/0x590 kernel/locking/rwsem.c:1629
btrfs_force_cow_block+0x14b3/0x1fd0 fs/btrfs/ctree.c:660
btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755
btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153
replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224
merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692
merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942
relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754
btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087
btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494
__btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278
btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655
btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #1 (btrfs-tree-01/1){+.+.}-{4:4}:
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
down_write_nested+0xa2/0x220 kernel/locking/rwsem.c:1693
btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
btrfs_init_new_buffer fs/btrfs/extent-tree.c:5052 [inline]
btrfs_alloc_tree_block+0x41c/0x1440 fs/btrfs/extent-tree.c:5132
btrfs_force_cow_block+0x526/0x1fd0 fs/btrfs/ctree.c:573
btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755
btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153
btrfs_insert_empty_items+0x9c/0x1a0 fs/btrfs/ctree.c:4351
btrfs_insert_empty_item fs/btrfs/ctree.h:688 [inline]
btrfs_insert_inode_ref+0x2bb/0xf80 fs/btrfs/inode-item.c:330
btrfs_rename_exchange fs/btrfs/inode.c:7990 [inline]
btrfs_rename2+0xcb7/0x2b90 fs/btrfs/inode.c:8374
vfs_rename+0xbdb/0xf00 fs/namei.c:5067
do_renameat2+0xd94/0x13f0 fs/namei.c:5224
__do_sys_renameat2 fs/namei.c:5258 [inline]
__se_sys_renameat2 fs/namei.c:5255 [inline]
__x64_sys_renameat2+0xce/0xe0 fs/namei.c:5255
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
-> #0 (btrfs-tree-01){++++}-{4:4}:
check_prev_add kernel/locking/lockdep.c:3161 [inline]
check_prevs_add kernel/locking/lockdep.c:3280 [inline]
validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
__lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649
btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146
btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline]
read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610
btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237
replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224
merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692
merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942
relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754
btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087
btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494
__btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278
btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655
btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Chain exists of:
btrfs-tree-01 --> btrfs-tree-01/1 --> btrfs-treloc-02/1
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(btrfs-treloc-02/1);
lock(btrfs-tree-01/1);
lock(btrfs-treloc-02/1);
rlock(btrfs-tree-01);
*** DEADLOCK ***
8 locks held by syz.0.0/5335:
#0: ffff88801e3ae420 (sb_writers#13){.+.+}-{0:0}, at: mnt_want_write_file+0x5e/0x200 fs/namespace.c:559
#1: ffff888052c760d0 (&fs_info->reclaim_bgs_lock){+.+.}-{4:4}, at: __btrfs_balance+0x4c2/0x26b0 fs/btrfs/volumes.c:4183
#2: ffff888052c74850 (&fs_info->cleaner_mutex){+.+.}-{4:4}, at: btrfs_relocate_block_group+0x775/0xd90 fs/btrfs/relocation.c:4086
#3: ffff88801e3ae610 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xf11/0x1ad0 fs/btrfs/relocation.c:1659
#4: ffff888052c76470 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288
#5: ffff888052c76498 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288
#6: ffff8880545db878 (btrfs-tree-01/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
#7: ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
stack backtrace:
CPU: 0 UID: 0 PID: 5335 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074
check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206
check_prev_add kernel/locking/lockdep.c:3161 [inline]
check_prevs_add kernel/locking/lockdep.c:3280 [inline]
validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
__lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649
btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146
btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline]
read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610
btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237
replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224
merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692
merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942
relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754
btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087
btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494
__btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278
btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655
btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:906 [inline]
__se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f1ac6985d29
Code: ff ff c3 (...)
RSP: 002b:00007f1ac63fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007f1ac6b76160 RCX: 00007f1ac6985d29
RDX: 0000000020000180 RSI: 00000000c4009420 RDI: 0000000000000007
RBP: 00007f1ac6a01b08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000001 R14: 00007f1ac6b76160 R15: 00007fffda145a88
</TASK>
Reported-by: syzbot+63913e558c084f7f8fdc@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/677b3014.050a0220.3b53b0.0064.GAE@google.com/
Fixes: 99785998ed ("btrfs: reduce lock contention when eb cache miss for btree search")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmeHvVQACgkQxWXV+ddt
WDsJ6w//cPqI8tf3kMxurZcG7clJRIIISotPrC6hm3UDNpJLa7HDaVJ50FAoIhMV
sB4RQNZky4mfB6ypXxmETzV3ZHvP0+oFgRs72Ommi0ZbdnBgxhaUTrDXLKl52o4r
UoeqvRKReEYOesN09rPXYPwytUOkxHU/GjNzv7bC/Tzvq/xKaIN5qMYZwkHtJ8PK
JtCFypfbmDPNDJz37l0BhRya2oMtpcUtxM9uP8RWVuQtaELgjcy56W/+osoyJTy9
FSKaoWUPsDVDufnILlGR8Kub2Z5mcISVqyARUdr/q3j5CDfyTdQvahmUy7sHgUAe
HGh5QBdRJu1QTvdZw+nK4YCaYpK6Nj4liDtO1cwVitde5RXsJrt6kYBLlY/kU2Qr
KODOloM/zVKxULR0ARl11NULZquUsczP6Wxfn+dtyDJ3JGlY9OcuESmorHoUtkMX
75Tj1AtRMNcfZAE2HquL1Oz3bIMcg4btDJsC+9Yp5K11SP12XpOwC42k/9Bx3iBe
Iki0BSuppFqX5MMY3OEWzD1pz2vOGYR8ISD6EIsjpjl2vBeRwydaCCZfuszSC7gl
Y4goSdwFMPVlqllL1h27XUjKVXvttCqqdB6P28MbvZKnFAPlm189BJQZC5cbHAJU
ceBww5PvI9QxnJnFG5iOLcnko6liUWPP9l2c5LLtUsJIi8B5Hu0=
=SXLv
-----END PGP SIGNATURE-----
Merge tag 'for-6.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"User visible changes, features:
- rebuilding of the free space tree at mount time is done in more
transactions, fix potential hangs when the transaction thread is
blocked due to large amount of block groups
- more read IO balancing strategies (experimental config), add two
new ways how to select a device for read if the profiles allow that
(all RAID1*), the current default selects the device by pid which
is good on average but less performant for single reader workloads
- select preferred device for all reads (namely for testing)
- round-robin, balance reads across devices relevant for the
requested IO range
- add encoded write ioctl support to io_uring (read was added in
6.12), basis for writing send stream using that instead of
syscalls, non-blocking mode is not yet implemented
- support FS_IOC_READ_VERITY_METADATA, applications can use the
metadata to do their own verification
- pass inode's i_write_hint to bios, for parity with other
filesystems, ioctls F_GET_RW_HINT/F_SET_RW_HINT
Core:
- in zoned mode: allow to directly reclaim a block group by simply
resetting it, then it can be reused and another block group does
not need to be allocated
- super block validation now also does more comprehensive sys array
validation, adding it to the points where superblock is validated
(post-read, pre-write)
- subpage mode fixes:
- fix double accounting of blocks due to some races
- improved or fixed error handling in a few cases (compression,
delalloc)
- raid stripe tree:
- fix various cases with extent range splitting or deleting
- implement hole punching to extent range
- reduce number of stripe tree lookups during bio submission
- more self-tests
- updated self-tests (delayed refs)
- error handling improvements
- cleanups, refactoring
- remove rest of backref caching infrastructure from relocation,
not needed anymore
- error message updates
- remove unnecessary calls when extent buffer was marked dirty
- unused parameter removal
- code moved to new files
Other code changes: add rb_find_add_cached() to the rb-tree API"
* tag 'for-6.14-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (127 commits)
btrfs: selftests: add a selftest for deleting two out of three extents
btrfs: selftests: add test for punching a hole into 3 RAID stripe-extents
btrfs: selftests: add selftest for punching holes into the RAID stripe extents
btrfs: selftests: test RAID stripe-tree deletion spanning two items
btrfs: selftests: don't split RAID extents in half
btrfs: selftests: check for correct return value of failed lookup
btrfs: don't use btrfs_set_item_key_safe on RAID stripe-extents
btrfs: implement hole punching for RAID stripe extents
btrfs: fix deletion of a range spanning parts two RAID stripe extents
btrfs: fix tail delete of RAID stripe-extents
btrfs: fix front delete range calculation for RAID stripe extents
btrfs: assert RAID stripe-extent length is always greater than 0
btrfs: don't try to delete RAID stripe-extents if we don't need to
btrfs: selftests: correct RAID stripe-tree feature flag setting
btrfs: add io_uring interface for encoded writes
btrfs: remove the unused locked_folio parameter from btrfs_cleanup_ordered_extents()
btrfs: add extra error messages for delalloc range related errors
btrfs: subpage: dump the involved bitmap when ASSERT() failed
btrfs: subpage: fix the bitmap dump of the locked flags
btrfs: do proper folio cleanup when run_delalloc_nocow() failed
...
Add a selftest creating three extents and then deleting two out of the
three extents.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Test creating a range of three RAID stripe-extents and then punch a hole
in the middle, deleting all of the middle extents and partially deleting
the "book ends".
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a selftest for punching a hole into a RAID stripe extent. The test
create an 1M extent and punches a 64k bytes long hole at offset of 32k from
the start of the extent.
Afterwards it verifies the start and length of both resulting new extents
"left" and "right" as well as the absence of the hole.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a selftest for RAID stripe-tree deletion with a delete range spanning
two items, so that we're punching a hole into two adjacent RAID stripe
extents truncating the first and "moving" the second to the right.
The following diagram illustrates the operation:
|--- RAID Stripe Extent ---||--- RAID Stripe Extent ---|
|----- keep -----|--- drop ---|----- keep ----|
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The selftests for partially deleting the start or tail of RAID
stripe-extents split these extents in half.
This can hide errors in the calculation, so don't split the RAID
stripe-extents in half but delete the first or last 16K of the 64K
extents.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 5e72aabc1f ("btrfs: return ENODATA in case RST lookup fails")
changed btrfs_get_raid_extent_offset()'s return value to ENODATA in case
the RAID stripe-tree lookup failed.
Adjust the test cases which check for absence of a given range to check
for ENODATA as return value in this case.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If the stripe extent we want to delete starts before the range we want to
delete and ends after the range we want to delete we're punching a
hole in the stripe extent:
|--- RAID Stripe Extent ---|
| keep |--- drop ---| keep |
This means we need to a) truncate the existing item and b)
create a second item for the remaining range.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a user requests the deletion of a range that spans multiple stripe
extents and btrfs_search_slot() returns us the second RAID stripe extent,
we need to pick the previous item and truncate it, if there's still a
range to delete left, move on to the next item.
The following diagram illustrates the operation:
|--- RAID Stripe Extent ---||--- RAID Stripe Extent ---|
|--- keep ---|--- drop ---|
While at it, comment the trivial case of a whole item delete as well.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix tail delete of RAID stripe-extents, if there is a range to be deleted
as well after the tail delete of the extent.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When deleting the front of a RAID stripe-extent the delete code
miscalculates the size on how much to pad the remaining extent part in the
front.
Fix the calculation so we're always having the sizes we expect.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When modifying a RAID stripe-extent, ASSERT() that the length of the new
RAID stripe-extent is always greater than 0.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Even if the RAID stripe-tree is not enabled in the filesystem,
do_free_extent_accounting() still calls into btrfs_delete_raid_extent().
Check if the extent in question is on a block-group that has a profile
which is used by RAID stripe-tree before attempting to delete a stripe
extent. Return early if it doesn't, otherwise we're doing a unnecessary
search.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
RAID stripe-tree is an incompatible feature not a read-only compatible, so
set the incompat flag not a compat_ro one in the selftest code.
Subsequent changes in btrfs_delete_raid_extent() will start checking for
this flag.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside function get_canonical_dev_path(), we call d_path() to get the
final device path.
But d_path() can return error, and in that case the next strscpy() call
will trigger an invalid memory access.
Add back the missing error handling for d_path().
Reported-by: Boris Burkov <boris@bur.io>
Fixes: 7e06de7c83 ("btrfs: canonicalize the device path before adding it")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add an io_uring interface for encoded writes, with the same parameters
as the BTRFS_IOC_ENCODED_WRITE ioctl.
As with the encoded reads code, there's a test program for this at
https://github.com/maharmstone/io_uring-encoded, and I'll get this
worked into an fstest.
How io_uring works is that it initially calls btrfs_uring_cmd with the
IO_URING_F_NONBLOCK flag set, and if we return -EAGAIN it tries again in
a kthread with the flag cleared.
Ideally we'd honour this and call try_lock etc., but there's still a lot
of work to be done to create non-blocking versions of all the functions
in our write path. Instead, just validate the input in
btrfs_uring_encoded_write() on the first pass and return -EAGAIN, with a
view to properly optimizing the optimistic path later on.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_cleanup_ordered_extents() is only called in error
handling path, and the last caller with a @locked_folio parameter was
removed to fix a bug in the btrfs_run_delalloc_range() error handling.
There is no need to pass @locked_folio parameter anymore.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All the error handling bugs I hit so far are all -ENOSPC from either:
- cow_file_range()
- run_delalloc_nocow()
- submit_uncompressed_range()
Previously when those functions failed, there was no error message at
all, making the debugging much harder.
So here we introduce extra error messages for:
- cow_file_range()
- run_delalloc_nocow()
- submit_uncompressed_range()
- writepage_delalloc() when btrfs_run_delalloc_range() failed
- extent_writepage() when extent_writepage_io() failed
One example of the new debug error messages is the following one:
run fstests generic/750 at 2024-12-08 12:41:41
BTRFS: device fsid 461b25f5-e240-4543-8deb-e7c2bd01a6d3 devid 1 transid 8 /dev/mapper/test-scratch1 (253:4) scanned by mount (2436600)
BTRFS info (device dm-4): first mount of filesystem 461b25f5-e240-4543-8deb-e7c2bd01a6d3
BTRFS info (device dm-4): using crc32c (crc32c-arm64) checksum algorithm
BTRFS info (device dm-4): forcing free space tree for sector size 4096 with page size 65536
BTRFS info (device dm-4): using free-space-tree
BTRFS warning (device dm-4): read-write for sector size 4096 with page size 65536 is experimental
BTRFS info (device dm-4): checking UUID tree
BTRFS error (device dm-4): cow_file_range failed, root=363 inode=412 start=503808 len=98304: -28
BTRFS error (device dm-4): run_delalloc_nocow failed, root=363 inode=412 start=503808 len=98304: -28
BTRFS error (device dm-4): failed to run delalloc range, root=363 ino=412 folio=458752 submit_bitmap=11-15 start=503808 len=98304: -28
Which shows an error from cow_file_range() which is called inside a
nocow write attempt, along with the extra bitmap from
writepage_delalloc().
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For btrfs_folio_assert_not_dirty() and btrfs_folio_set_lock(), we call
bitmap_test_range_all_zero() to ensure the involved range has no
dirty/lock bit already set.
However with my recent enhanced delalloc range error handling, I was
hitting the ASSERT() inside btrfs_folio_set_lock(), and it turns out
that some error handling path is not properly updating the folio flags.
So add some extra dumping for the ASSERTs to dump the involved bitmap
to help debug.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're dumping the locked bitmap into the @checked_bitmap variable,
printing incorrect values during debug.
Thankfully even during my development I haven't hit a case where I need
to dump the locked bitmap. But for the sake of consistency, fix it by
dupping the locked bitmap into @locked_bitmap variable for output.
Fixes: 75258f20fb ("btrfs: subpage: dump extra subpage bitmaps for debug")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With CONFIG_DEBUG_VM set, test case generic/476 has some chance to crash
with the following VM_BUG_ON_FOLIO():
BTRFS error (device dm-3): cow_file_range failed, start 1146880 end 1253375 len 106496 ret -28
BTRFS error (device dm-3): run_delalloc_nocow failed, start 1146880 end 1253375 len 106496 ret -28
page: refcount:4 mapcount:0 mapping:00000000592787cc index:0x12 pfn:0x10664
aops:btrfs_aops [btrfs] ino:101 dentry name(?):"f1774"
flags: 0x2fffff80004028(uptodate|lru|private|node=0|zone=2|lastcpupid=0xfffff)
page dumped because: VM_BUG_ON_FOLIO(!folio_test_locked(folio))
------------[ cut here ]------------
kernel BUG at mm/page-writeback.c:2992!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 2 UID: 0 PID: 3943513 Comm: kworker/u24:15 Tainted: G OE 6.12.0-rc7-custom+ #87
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
pc : folio_clear_dirty_for_io+0x128/0x258
lr : folio_clear_dirty_for_io+0x128/0x258
Call trace:
folio_clear_dirty_for_io+0x128/0x258
btrfs_folio_clamp_clear_dirty+0x80/0xd0 [btrfs]
__process_folios_contig+0x154/0x268 [btrfs]
extent_clear_unlock_delalloc+0x5c/0x80 [btrfs]
run_delalloc_nocow+0x5f8/0x760 [btrfs]
btrfs_run_delalloc_range+0xa8/0x220 [btrfs]
writepage_delalloc+0x230/0x4c8 [btrfs]
extent_writepage+0xb8/0x358 [btrfs]
extent_write_cache_pages+0x21c/0x4e8 [btrfs]
btrfs_writepages+0x94/0x150 [btrfs]
do_writepages+0x74/0x190
filemap_fdatawrite_wbc+0x88/0xc8
start_delalloc_inodes+0x178/0x3a8 [btrfs]
btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
shrink_delalloc+0x114/0x280 [btrfs]
flush_space+0x250/0x2f8 [btrfs]
btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
process_one_work+0x164/0x408
worker_thread+0x25c/0x388
kthread+0x100/0x118
ret_from_fork+0x10/0x20
Code: 910a8021 a90363f7 a9046bf9 94012379 (d4210000)
---[ end trace 0000000000000000 ]---
[CAUSE]
The first two lines of extra debug messages show the problem is caused
by the error handling of run_delalloc_nocow().
E.g. we have the following dirtied range (4K blocksize 4K page size):
0 16K 32K
|//////////////////////////////////////|
| Pre-allocated |
And the range [0, 16K) has a preallocated extent.
- Enter run_delalloc_nocow() for range [0, 16K)
Which found range [0, 16K) is preallocated, can do the proper NOCOW
write.
- Enter fallback_to_fow() for range [16K, 32K)
Since the range [16K, 32K) is not backed by preallocated extent, we
have to go COW.
- cow_file_range() failed for range [16K, 32K)
So cow_file_range() will do the clean up by clearing folio dirty,
unlock the folios.
Now the folios in range [16K, 32K) is unlocked.
- Enter extent_clear_unlock_delalloc() from run_delalloc_nocow()
Which is called with PAGE_START_WRITEBACK to start page writeback.
But folios can only be marked writeback when it's properly locked,
thus this triggered the VM_BUG_ON_FOLIO().
Furthermore there is another hidden but common bug that
run_delalloc_nocow() is not clearing the folio dirty flags in its error
handling path.
This is the common bug shared between run_delalloc_nocow() and
cow_file_range().
[FIX]
- Clear folio dirty for range [@start, @cur_offset)
Introduce a helper, cleanup_dirty_folios(), which
will find and lock the folio in the range, clear the dirty flag and
start/end the writeback, with the extra handling for the
@locked_folio.
- Introduce a helper to clear folio dirty, start and end writeback
- Introduce a helper to record the last failed COW range end
This is to trace which range we should skip, to avoid double
unlocking.
- Skip the failed COW range for the error handling
CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When testing with COW fixup marked as BUG_ON() (this is involved with the
new pin_user_pages*() change, which should not result new out-of-band
dirty pages), I hit a crash triggered by the BUG_ON() from hitting COW
fixup path.
This BUG_ON() happens just after a failed btrfs_run_delalloc_range():
BTRFS error (device dm-2): failed to run delalloc range, root 348 ino 405 folio 65536 submit_bitmap 6-15 start 90112 len 106496: -28
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:1444!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
CPU: 0 UID: 0 PID: 434621 Comm: kworker/u24:8 Tainted: G OE 6.12.0-rc7-custom+ #86
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
pc : extent_writepage_io+0x2d4/0x308 [btrfs]
lr : extent_writepage_io+0x2d4/0x308 [btrfs]
Call trace:
extent_writepage_io+0x2d4/0x308 [btrfs]
extent_writepage+0x218/0x330 [btrfs]
extent_write_cache_pages+0x1d4/0x4b0 [btrfs]
btrfs_writepages+0x94/0x150 [btrfs]
do_writepages+0x74/0x190
filemap_fdatawrite_wbc+0x88/0xc8
start_delalloc_inodes+0x180/0x3b0 [btrfs]
btrfs_start_delalloc_roots+0x174/0x280 [btrfs]
shrink_delalloc+0x114/0x280 [btrfs]
flush_space+0x250/0x2f8 [btrfs]
btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
process_one_work+0x164/0x408
worker_thread+0x25c/0x388
kthread+0x100/0x118
ret_from_fork+0x10/0x20
Code: aa1403e1 9402f3ef aa1403e0 9402f36f (d4210000)
---[ end trace 0000000000000000 ]---
[CAUSE]
That failure is mostly from cow_file_range(), where we can hit -ENOSPC.
Although the -ENOSPC is already a bug related to our space reservation
code, let's just focus on the error handling.
For example, we have the following dirty range [0, 64K) of an inode,
with 4K sector size and 4K page size:
0 16K 32K 48K 64K
|///////////////////////////////////////|
|#######################################|
Where |///| means page are still dirty, and |###| means the extent io
tree has EXTENT_DELALLOC flag.
- Enter extent_writepage() for page 0
- Enter btrfs_run_delalloc_range() for range [0, 64K)
- Enter cow_file_range() for range [0, 64K)
- Function btrfs_reserve_extent() only reserved one 16K extent
So we created extent map and ordered extent for range [0, 16K)
0 16K 32K 48K 64K
|////////|//////////////////////////////|
|<- OE ->|##############################|
And range [0, 16K) has its delalloc flag cleared.
But since we haven't yet submit any bio, involved 4 pages are still
dirty.
- Function btrfs_reserve_extent() returns with -ENOSPC
Now we have to run error cleanup, which will clear all
EXTENT_DELALLOC* flags and clear the dirty flags for the remaining
ranges:
0 16K 32K 48K 64K
|////////| |
| | |
Note that range [0, 16K) still has its pages dirty.
- Some time later, writeback is triggered again for the range [0, 16K)
since the page range still has dirty flags.
- btrfs_run_delalloc_range() will do nothing because there is no
EXTENT_DELALLOC flag.
- extent_writepage_io() finds page 0 has no ordered flag
Which falls into the COW fixup path, triggering the BUG_ON().
Unfortunately this error handling bug dates back to the introduction of
btrfs. Thankfully with the abuse of COW fixup, at least it won't crash
the kernel.
[FIX]
Instead of immediately unlocking the extent and folios, we keep the extent
and folios locked until either erroring out or the whole delalloc range
finished.
When the whole delalloc range finished without error, we just unlock the
whole range with PAGE_SET_ORDERED (and PAGE_UNLOCK for !keep_locked
cases), with EXTENT_DELALLOC and EXTENT_LOCKED cleared.
And the involved folios will be properly submitted, with their dirty
flags cleared during submission.
For the error path, it will be a little more complex:
- The range with ordered extent allocated (range (1))
We only clear the EXTENT_DELALLOC and EXTENT_LOCKED, as the remaining
flags are cleaned up by
btrfs_mark_ordered_io_finished()->btrfs_finish_one_ordered().
For folios we finish the IO (clear dirty, start writeback and
immediately finish the writeback) and unlock the folios.
- The range with reserved extent but no ordered extent (range(2))
- The range we never touched (range(3))
For both range (2) and range(3) the behavior is not changed.
Now even if cow_file_range() failed halfway with some successfully
reserved extents/ordered extents, we will keep all folios clean, so
there will be no future writeback triggered on them.
CC: stable@vger.kernel.org
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If we failed to compress the range, or cannot reserve a large enough
data extent (e.g. too fragmented free space), we will fall back to
submit_uncompressed_range().
But inside submit_uncompressed_range(), run_delalloc_cow() can also fail
due to -ENOSPC or any other error.
In that case there are 3 bugs in the error handling:
1) Double freeing for the same ordered extent
This can lead to crash due to ordered extent double accounting
2) Start/end writeback without updating the subpage writeback bitmap
3) Unlock the folio without clear the subpage lock bitmap
Both bugs 2) and 3) will crash the kernel if the btrfs block size is
smaller than folio size, as the next time the folio gets writeback/lock
updates, subpage will find the bitmap already have the range set,
triggering an ASSERT().
[CAUSE]
Bug 1) happens in the following call chain:
submit_uncompressed_range()
|- run_delalloc_cow()
| |- cow_file_range()
| |- btrfs_reserve_extent()
| Failed with -ENOSPC or whatever error
|
|- btrfs_clean_up_ordered_extents()
| |- btrfs_mark_ordered_io_finished()
| Which cleans all the ordered extents in the async_extent range.
|
|- btrfs_mark_ordered_io_finished()
Which cleans the folio range.
The finished ordered extents may not be immediately removed from the
ordered io tree, as they are removed inside a work queue.
So the second btrfs_mark_ordered_io_finished() may find the finished but
not-yet-removed ordered extents, and double free them.
Furthermore, the second btrfs_mark_ordered_io_finished() is not subpage
compatible, as it uses fixed folio_pos() with PAGE_SIZE, which can cover
other ordered extents.
Bugs 2) and 3) are more straightforward, btrfs just calls folio_unlock(),
folio_start_writeback() and folio_end_writeback(), other than the helpers
which handle subpage cases.
[FIX]
For bug 1) since the first btrfs_cleanup_ordered_extents() call is
handling the whole range, we should not do the second
btrfs_mark_ordered_io_finished() call.
And for the first btrfs_cleanup_ordered_extents(), we no longer need to
pass the @locked_page parameter, as we are already in the async extent
context, thus will never rely on the error handling inside
btrfs_run_delalloc_range().
So just let the btrfs_clean_up_ordered_extents() handle every folio
equally.
For bug 2) we should not even call
folio_start_writeback()/folio_end_writeback() anymore.
As the error handling protocol, cow_file_range() should clear
dirty flag and start/finish the writeback for the whole range passed in.
For bug 3) just change the folio_unlock() to btrfs_folio_end_lock()
helper.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
If submit_one_sector() failed inside extent_writepage_io() for sector
size < page size cases (e.g. 4K sector size and 64K page size), then
we can hit double ordered extent accounting error.
This should be very rare, as submit_one_sector() only fails when we
failed to grab the extent map, and such extent map should exist inside
the memory and has been pinned.
[CAUSE]
For example we have the following folio layout:
0 4K 32K 48K 60K 64K
|//| |//////| |///|
Where |///| is the dirty range we need to writeback. The 3 different
dirty ranges are submitted for regular COW.
Now we hit the following sequence:
- submit_one_sector() returned 0 for [0, 4K)
- submit_one_sector() returned 0 for [32K, 48K)
- submit_one_sector() returned error for [60K, 64K)
- btrfs_mark_ordered_io_finished() called for the whole folio
This will mark the following ranges as finished:
* [0, 4K)
* [32K, 48K)
Both ranges have their IO already submitted, this cleanup will
lead to double accounting.
* [60K, 64K)
That's the correct cleanup.
The only good news is, this error is only theoretical, as the target
extent map is always pinned, thus we should directly grab it from
memory, other than reading it from the disk.
[FIX]
Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
range, which can touch ranges we should not touch, instead
move the error handling inside extent_writepage_io().
So that we can cleanup exact sectors that ought to be submitted but failed.
This provides much more accurate cleanup, avoiding the double accounting.
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs with block size (4K) smaller than page size (64K,
aarch64), there is a very high chance to crash the kernel at
generic/750, with the following messages:
(before the call traces, there are 3 extra debug messages added)
BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental
BTRFS info (device dm-3): checking UUID tree
hrtimer: interrupt took 5451385 ns
BTRFS error (device dm-3): cow_file_range failed, root=4957 inode=257 start=1605632 len=69632: -28
BTRFS error (device dm-3): run_delalloc_nocow failed, root=4957 inode=257 start=1605632 len=69632: -28
BTRFS error (device dm-3): failed to run delalloc range, root=4957 ino=257 folio=1572864 submit_bitmap=8-15 start=1605632 len=69632: -28
------------[ cut here ]------------
WARNING: CPU: 2 PID: 3020984 at ordered-data.c:360 can_finish_ordered_extent+0x370/0x3b8 [btrfs]
CPU: 2 UID: 0 PID: 3020984 Comm: kworker/u24:1 Tainted: G OE 6.13.0-rc1-custom+ #89
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
pc : can_finish_ordered_extent+0x370/0x3b8 [btrfs]
lr : can_finish_ordered_extent+0x1ec/0x3b8 [btrfs]
Call trace:
can_finish_ordered_extent+0x370/0x3b8 [btrfs] (P)
can_finish_ordered_extent+0x1ec/0x3b8 [btrfs] (L)
btrfs_mark_ordered_io_finished+0x130/0x2b8 [btrfs]
extent_writepage+0x10c/0x3b8 [btrfs]
extent_write_cache_pages+0x21c/0x4e8 [btrfs]
btrfs_writepages+0x94/0x160 [btrfs]
do_writepages+0x74/0x190
filemap_fdatawrite_wbc+0x74/0xa0
start_delalloc_inodes+0x17c/0x3b0 [btrfs]
btrfs_start_delalloc_roots+0x17c/0x288 [btrfs]
shrink_delalloc+0x11c/0x280 [btrfs]
flush_space+0x288/0x328 [btrfs]
btrfs_async_reclaim_data_space+0x180/0x228 [btrfs]
process_one_work+0x228/0x680
worker_thread+0x1bc/0x360
kthread+0x100/0x118
ret_from_fork+0x10/0x20
---[ end trace 0000000000000000 ]---
BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1605632 OE len=16384 to_dec=16384 left=0
BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1622016 OE len=12288 to_dec=12288 left=0
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
BTRFS critical (device dm-3): bad ordered extent accounting, root=4957 ino=257 OE offset=1634304 OE len=8192 to_dec=4096 left=0
CPU: 1 UID: 0 PID: 3286940 Comm: kworker/u24:3 Tainted: G W OE 6.13.0-rc1-custom+ #89
Hardware name: QEMU KVM Virtual Machine, BIOS unknown 2/2/2022
Workqueue: btrfs_work_helper [btrfs] (btrfs-endio-write)
pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : process_one_work+0x110/0x680
lr : worker_thread+0x1bc/0x360
Call trace:
process_one_work+0x110/0x680 (P)
worker_thread+0x1bc/0x360 (L)
worker_thread+0x1bc/0x360
kthread+0x100/0x118
ret_from_fork+0x10/0x20
Code: f84086a1 f9000fe1 53041c21 b9003361 (f9400661)
---[ end trace 0000000000000000 ]---
Kernel panic - not syncing: Oops: Fatal exception
SMP: stopping secondary CPUs
SMP: failed to stop secondary CPUs 2-3
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: 0x275bb9540000 from 0xffff800080000000
PHYS_OFFSET: 0xffff8fbba0000000
CPU features: 0x100,00000070,00801250,8201720b
[CAUSE]
The above warning is triggered immediately after the delalloc range
failure, this happens in the following sequence:
- Range [1568K, 1636K) is dirty
1536K 1568K 1600K 1636K 1664K
| |/////////|////////| |
Where 1536K, 1600K and 1664K are page boundaries (64K page size)
- Enter extent_writepage() for page 1536K
- Enter run_delalloc_nocow() with locked page 1536K and range
[1568K, 1636K)
This is due to the inode having preallocated extents.
- Enter cow_file_range() with locked page 1536K and range
[1568K, 1636K)
- btrfs_reserve_extent() only reserved two extents
The main loop of cow_file_range() only reserved two data extents,
Now we have:
1536K 1568K 1600K 1636K 1664K
| |<-->|<--->|/|///////| |
1584K 1596K
Range [1568K, 1596K) has an ordered extent reserved.
- btrfs_reserve_extent() failed inside cow_file_range() for file offset
1596K
This is already a bug in our space reservation code, but for now let's
focus on the error handling path.
Now cow_file_range() returned -ENOSPC.
- btrfs_run_delalloc_range() do error cleanup <<< ROOT CAUSE
Call btrfs_cleanup_ordered_extents() with locked folio 1536K and range
[1568K, 1636K)
Function btrfs_cleanup_ordered_extents() normally needs to skip the
ranges inside the folio, as it will normally be cleaned up by
extent_writepage().
Such split error handling is already problematic in the first place.
What's worse is the folio range skipping itself, which is not taking
subpage cases into consideration at all, it will only skip the range
if the page start >= the range start.
In our case, the page start < the range start, since for subpage cases
we can have delalloc ranges inside the folio but not covering the
folio.
So it doesn't skip the page range at all.
This means all the ordered extents, both [1568K, 1584K) and
[1584K, 1596K) will be marked as IOERR.
And these two ordered extents have no more pending ios, they are marked
finished, and *QUEUED* to be deleted from the io tree.
- extent_writepage() do error cleanup
Call btrfs_mark_ordered_io_finished() for the range [1536K, 1600K).
Although ranges [1568K, 1584K) and [1584K, 1596K) are finished, the
deletion from io tree is async, it may or may not happen at this
time.
If the ranges have not yet been removed, we will do double cleaning on
those ranges, triggering the above ordered extent warnings.
In theory there are other bugs, like the cleanup in extent_writepage()
can cause double accounting on ranges that are submitted asynchronously
(compression for example).
But that's much harder to trigger because normally we do not mix regular
and compression delalloc ranges.
[FIX]
The folio range split is already buggy and not subpage compatible, it
was introduced a long time ago where subpage support was not even considered.
So instead of splitting the ordered extents cleanup into the folio range
and out of folio range, do all the cleanup inside writepage_delalloc().
- Pass @NULL as locked_folio for btrfs_cleanup_ordered_extents() in
btrfs_run_delalloc_range()
- Skip the btrfs_cleanup_ordered_extents() if writepage_delalloc()
failed
So all ordered extents are only cleaned up by
btrfs_run_delalloc_range().
- Handle the ranges that already have ordered extents allocated
If part of the folio already has ordered extent allocated, and
btrfs_run_delalloc_range() failed, we also need to cleanup that range.
Now we have a concentrated error handling for ordered extents during
btrfs_run_delalloc_range().
Fixes: d1051d6ebf ("btrfs: Fix error handling in btrfs_cleanup_ordered_extents")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename the macro so it's obvious what it means.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Same pattern in both functions, we really only use index, start_index is
redundant.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are only 2 WAIT_* values left for wait parameter, we can encode
this to the function name if the waiting functionality is split.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Last use was in the readahead code that got removed by f26c923860
("btrfs: remove reada infrastructure").
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Another conversion to folio API, use the folio locking directly instead
of back and forth page <-> folio conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The check function pattern is supposed to return true/false, currently
there's only one error code.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the folio API, remove page_folio/folio_page conversions.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Continue page to folio updates, sync what the function does with it's
name.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using the kmem cache freeing directly is clear enough, we don't need to
wrap it. All the users are in the same file.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>