-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmko/DUACgkQxWXV+ddt
WDtyCw//UaFOTX/k72HgA1n2MWfegWbWyD+OGNbGosoZljKOrAe/mRnjXTyF9lyW
8GDzGvJzF4Tkl5lyuGyiequlrO2F3veTpwHo94xnBTOYCeiTpMqTN/e/5SkasBpN
4YlWq7OGYR4hwghRvZpaW7nsmVCKDLIlZVkH77x9Bmvx0NLO24EJlEZusQT4zYew
ntC/i9x3DW0ZxYyfRhFIFvk9JUUdgXfxJ6dNexz0zi3dKUSUIR9hI0J9Nwl++1cF
SgjAzbtO064htWoCvsKykgA6YGbJCZjw8XO8D2eJonkN24VbqSMaY44TPXmCMLVs
ZXw871jV2E/urfWhRNdxv/kJdCFudPk0qXG5ZtfHO4UUwS/nZ+qAig+LHawgAOCJ
9CgWy4zrfiYCqULRuqF1wzWu/z22++zIlZC552VAZd1RQ+JjqJY/aje4xhY5nUF4
n1uVBReZaI9sH3jJOsMWpwLMptbhpH9RZp3QPgqZlUHo6GtPJJmNKfw8KgMAhZ7L
wf7iy6v9yo+7VZ2ACwu2qJ+lZRxbZ0yvCnFatN3O5G1O0kkIrZFUM3MwdKtufZ0u
LHWkGfoaq7zR6E6DhIaxIhiTTXMlOfLTikNKgBUO3NEdrRZwrDhr7K07S25jFxSx
ZCNV6OdSCeziShPqT0ntcwecnJ41/kOcm13732NHF+QgzMK5LrI=
=rO4x
-----END PGP SIGNATURE-----
Merge tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"Features:
- shutdown ioctl support (needs CONFIG_BTRFS_EXPERIMENTAL for now):
- set filesystem state as being shut down (also named going down
in other filesystems), where all active operations return EIO
and this cannot be changed until unmount
- pending operations are attempted to be finished but error
messages may still show up depending on where exactly the
shutdown happened
- scrub (and device replace) vs suspend/hibernate:
- a running scrub will prevent suspend, which can be annoying as
suspend is an immediate request and scrub is not critical
- filesystem freezing before suspend was not sufficient as the
problem was in process freezing
- behaviour change: on suspend scrub and device replace are
cancelled, where scrub can record the last state and continue
from there; the device replace has to be restarted from the
beginning
- zone stats exported in sysfs, from the perspective of the
filesystem this includes active, reclaimable, relocation etc zones
Performance:
- improvements when processing space reservation tickets by
optimizing locking and shrinking critical sections, cumulative
improvements in lockstat numbers show +15%
Notable fixes:
- use vmalloc fallback when allocating bios as high order allocations
can happen with wide checksums (like sha256)
- scrub will always track the last position of progress so it's not
starting from zero after an error
Core:
- under experimental config, checksum calculations are offloaded to
process context, simplifies locking and allows to remove
compression write worker kthread(s):
- speed improvement in direct IO throughput with buffered IO
fallback is +15% when not offloaded but this is more related to
internal crypto subsystem improvements
- this will be probably default in the future removing the sysfs
tunable
- (experimental) block size > page size updates:
- support more operations when not using large folios (encoded
read/write and send)
- raid56
- more preparations for fscrypt support
Other:
- more conversions to auto-cleaned variables
- parameter cleanups and removals
- extended warning fixes
- improved printing of structured values like keys
- lots of other cleanups and refactoring"
* tag 'for-6.19-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (147 commits)
btrfs: remove unnecessary inode key in btrfs_log_all_parents()
btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root()
btrfs: remaining BTRFS_PATH_AUTO_FREE conversions
btrfs: send: do not allocate memory for xattr data when checking it exists
btrfs: send: add unlikely to all unexpected overflow checks
btrfs: reduce arguments to btrfs_del_inode_ref_in_log()
btrfs: remove root argument from btrfs_del_dir_entries_in_log()
btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref()
btrfs: don't search back for dir inode item in INO_LOOKUP_USER
btrfs: don't rewrite ret from inode_permission
btrfs: add orig_logical to btrfs_bio for encryption
btrfs: disable verity on encrypted inodes
btrfs: disable various operations on encrypted inodes
btrfs: remove redundant level reset in btrfs_del_items()
btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf()
btrfs: optimize balance_level() path reference handling
btrfs: factor out root promotion logic into promote_child_to_root()
btrfs: raid56: remove the "_step" infix
btrfs: raid56: enable bs > ps support
btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZgAKCRCRxhvAZXjc
opxBAQCjNjr0yTSoaGRM0CJXg79Of3DLIlBdB7TygibTN16WhwEA+VKWoHL5eRjg
PZlwZD4Ei2ymeQYxi+6owTF8G806tAs=
=m/Bt
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.19-rc1.guards' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull superblock lock guard updates from Christian Brauner:
"This starts the work of introducing guards for superblock related
locks.
Introduce super_write_guard for scoped superblock write protection.
This provides a guard-based alternative to the manual sb_start_write()
and sb_end_write() pattern, allowing the compiler to automatically
handle the cleanup"
* tag 'vfs-6.19-rc1.guards' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
xfs: use super write guard in xfs_file_ioctl()
open: use super write guard in do_ftruncate()
btrfs: use super write guard in relocating_repair_kthread()
ext4: use super write guard in write_mmp_block()
btrfs: use super write guard in sb_start_write()
btrfs: use super write guard btrfs_run_defrag_inode()
btrfs: use super write guard in btrfs_reclaim_bgs_work()
fs: add super_write_guard
Many fields of struct btrfs_path are used as booleans but their type is
an unsigned int (of one 1 bit width to save space). Change the type to
bool keeping the :1 suffix so that they combine with the previous u8
fields in order to save space. This makes the code more clear by using
explicit true/false and more in line with the preferred style, preserving
the size of the structure.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When I tried to remove btrfs_bio::fs_info and use btrfs_bio::inode to
grab the fs_info, the header "btrfs_inode.h" is needed to access the
full btrfs_inode structure.
Then btrfs will fail to compile.
[CAUSE]
There is a recursive including chain:
"bio.h" -> "btrfs_inode.h" -> "extent_map.h" -> "compression.h" ->
"bio.h"
That recursive including is causing problems for btrfs.
[ENHANCEMENT]
To reduce the risk of recursive including:
- Remove unnecessary local includes from btrfs headers
Either the included header is pulled in by other headers, or is
completely unnecessary.
- Remove btrfs local includes if the header only requires a pointer
In that case let the implementing C file to pull the required header.
This is especially important for headers like "btrfs_inode.h" which
pulls in a lot of other btrfs headers, thus it's a mine field of
recursive including.
- Remove unnecessary temporary structure definition
Either if we have included the header defining the structure, or
completely unused.
Now including "btrfs_inode.h" inside "bio.h" is completely fine,
although "btrfs_inode.h" still includes "extent_map.h", but that header
only includes "fs.h", no more chain back to "bio.h".
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs defined its own variant of folio_next_pos() called folio_end().
This is an ambiguous name as 'end' might be exclusive or inclusive.
Switch to the new folio_next_pos().
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://patch.msgid.link/20251024170822.1427218-3-willy@infradead.org
Acked-by: David Sterba <dsterba@suse.com>
Cc: Chris Mason <clm@fb.com>
Cc: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EIO is one of them.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Annual typo fixing pass. Strangely codespell found only about 30% of
what is in this patch, the rest was done manually using text
spellchecker with a custom dictionary of acceptable terms.
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the defrag ioctl cannot rewrite the extents without
compression. Add a new flag for that, as setting compression to 0 (or
"no compression") means to do no changes to compression so take what is
the current default, like mount options or properties.
The defrag setting overrides mount or properties. The compression
BTRFS_DEFRAG_DONT_COMPRESS is only used for in-memory operations and
does not need to have a fixed value.
Mount with zstd:9, copy test file from /usr/bin/ (about 260KB):
$ mount -o compress=zstd:9 /dev/vda /mnt
$ filefrag -vsb testfile
filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is: 9123683e
File size of testfile is 297704 (292 blocks of 1024 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 13312.. 13439: 128: encoded
1: 128.. 255: 13364.. 13491: 128: 13440: encoded
2: 256.. 291: 13424.. 13459: 36: 13492: last,encoded,eof
testfile: 3 extents found
$ compsize testfile
Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 42% 124K 292K 292K
zstd 42% 124K 292K 292K
Defrag to uncompressed:
$ btrfs fi defrag --nocomp testfile
$ filefrag -vsb testfile
filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is: 9123683e
File size of testfile is 297704 (292 blocks of 1024 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 291: 291840.. 292131: 292: last,eof
testfile: 1 extent found
$ compsize testfile
Processed 1 file, 1 regular extents (1 refs), 0 inline, 1 fragments.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 100% 292K 292K 292K
none 100% 292K 292K 292K
Compress again with LZO:
$ btrfs fi defrag -clzo testfile
$ filefrag -vsb testfile
filefrag: -b needs a blocksize option, assuming 1024-byte blocks.
Filesystem type is: 9123683e
File size of testfile is 297704 (292 blocks of 1024 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 13312.. 13439: 128: encoded
1: 128.. 255: 13392.. 13519: 128: 13440: encoded
2: 256.. 291: 13480.. 13515: 36: 13520: last,encoded,eof
testfile: 3 extents found
$ compsize testfile
Processed 1 file, 3 regular extents (3 refs), 0 inline, 1 fragments.
Type Perc Disk Usage Uncompressed Referenced
TOTAL 64% 188K 292K 292K
lzo 64% 188K 292K 292K
Signed-off-by: David Sterba <dsterba@suse.com>
Simplify folio_pos() + folio_size() and use the new helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation to use a helper for folio_pos + folio_size, rename the
variables for the locked range so they don't use the 'folio_' prefix. As
the locking ranges take inclusive end of the range (hence the "-1") this
would be confusing as the folio helpers typically use exclusive end of
the range.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use the rb-tree helper so we don't open code the search and insert
code.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Pan Chuang <panchuang@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using list_entry() against the list's prev entry, use
list_last_entry(), which removes the need to know the last member is
accessed through the prev list pointer and the naming makes it easier
to reason about what we are doing.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Old code has a lot of int for bool return values, bool is recommended
and done in new code. Convert the trivial cases that do simple 0/false
and 1/true. Functions comment are updated if needed.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we reject large folios for defrag gracefully, but the
implementation itself is already mostly large folios compatible.
There are several parts of defrag in btrfs:
- Extent map checking
Aka, defrag_collect_targets(), which prepares a list of target ranges
that should be defragged.
This part is completely folio unrelated, thus it doesn't care about
the folio size.
- Target folio preparation
Aka, defrag_prepare_one_folio(), which lock and read (if needed) the
target folio.
Since folio read and lock are already supporting large folios, this
part needs only minor changes.
- Redirty the target range of the folio
This is already done in a way supporting large folios.
So it's pretty straightforward to enable large folios for defrag:
- Do not reject large folios for experimental builds
This affects the large folio check inside defrag_prepare_one_folio().
- Wait for ordered extents of the whole folio in
defrag_prepare_one_folio()
- Lock the whole extent range for all involved folios in
defrag_one_range()
- Allow the folios[] array to be partially empty
Since we can have large folios, folios[] will not always be full.
This affects:
* How to allocate folios in defrag_one_range()
Now we cannot use page index, but use the end position of the folio
as an iterator.
* How to free the folios[] array
If we hit an empty slot, it means we have large folios and already
hit the end of the array.
* How to mark the range dirty
Instead of use page index directly, we have to go through each
folio, and check if the folio covers the defrag target inside
defrag_one_locked_target().
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename all the exported functions from extent_map.h that don't have a
'btrfs_' prefix in their names, so that they are consistent with all the
other functions, to make it clear they are btrfs specific functions and
to avoid potential name collisions in the future with functions defined
elsewhere in the kernel.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel.
So add a 'btrfs_' prefix to their names to make it clear they are from
btrfs.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is an exported function so it should have a 'btrfs_' prefix by
convention, to make it clear it's btrfs specific and to avoid collisions
with functions from elsewhere in the kernel.
So rename it to btrfs_set_extent_bit().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. One of them has a
double underscore prefix which is also discouraged.
So remove double underscore prefix where applicable and add a 'btrfs_'
prefix to their name to make it clear they are from btrfs.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. So add a prefix to
their name.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Simplify conditionally reading an rb_entry(), there's the
rb_entry_safe() helper that checks the node pointer for NULL so we don't
have to write it explicitly.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's an internal function and most of the time the callers are doing a lot
of BTRFS_I() calls on the returned VFS inode to get the btrfs inode, so
change the return type to struct btrfs_inode instead.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The zstd and zlib compression types support setting compression levels.
Enhance the defrag interface to specify the levels as well. For zstd the
negative (realtime) levels are also accepted.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Pass a struct btrfs_inode to btrfs_defrag_file() as it's an internal
interface, allowing to remove some use of BTRFS_I.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When running defrag (manual defrag) against a file that has extents that
are contiguous and we already have the respective extent maps loaded and
merged, we end up not defragging the range covered by those contiguous
extents. This happens when we have an extent map that was the result of
merging multiple extent maps for contiguous extents and the length of the
merged extent map is greater than or equals to the defrag threshold
length.
The script below reproduces this scenario:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Create a 256K file with 4 extents of 64K each.
xfs_io -f -c "falloc 0 64K" \
-c "pwrite 0 64K" \
-c "falloc 64K 64K" \
-c "pwrite 64K 64K" \
-c "falloc 128K 64K" \
-c "pwrite 128K 64K" \
-c "falloc 192K 64K" \
-c "pwrite 192K 64K" \
$MNT/foo
umount $MNT
echo -n "Initial number of file extent items: "
btrfs inspect-internal dump-tree -t 5 $DEV | grep EXTENT_DATA | wc -l
mount $DEV $MNT
# Read the whole file in order to load and merge extent maps.
cat $MNT/foo > /dev/null
btrfs filesystem defragment -t 128K $MNT/foo
umount $MNT
echo -n "Number of file extent items after defrag with 128K threshold: "
btrfs inspect-internal dump-tree -t 5 $DEV | grep EXTENT_DATA | wc -l
mount $DEV $MNT
# Read the whole file in order to load and merge extent maps.
cat $MNT/foo > /dev/null
btrfs filesystem defragment -t 256K $MNT/foo
umount $MNT
echo -n "Number of file extent items after defrag with 256K threshold: "
btrfs inspect-internal dump-tree -t 5 $DEV | grep EXTENT_DATA | wc -l
Running it:
$ ./test.sh
Initial number of file extent items: 4
Number of file extent items after defrag with 128K threshold: 4
Number of file extent items after defrag with 256K threshold: 4
The 4 extents don't get merged because we have an extent map with a size
of 256K that is the result of merging the individual extent maps for each
of the four 64K extents and at defrag_lookup_extent() we have a value of
zero for the generation threshold ('newer_than' argument) since this is a
manual defrag. As a consequence we don't call defrag_get_extent() to get
an extent map representing a single file extent item in the inode's
subvolume tree, so we end up using the merged extent map at
defrag_collect_targets() and decide not to defrag.
Fix this by updating defrag_lookup_extent() to always discard extent maps
that were merged and call defrag_get_extent() regardless of the minimum
generation threshold ('newer_than' argument).
A test case for fstests will be sent along soon.
CC: stable@vger.kernel.org # 6.1+
Fixes: 199257a78b ("btrfs: defrag: don't use merged extent map for their generation check")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When cleaning up defrag inodes at btrfs_cleanup_defrag_inodes(), called
during remount and unmount, we are freeing every node from the rbtree
that tracks inodes for auto defrag using
rbtree_postorder_for_each_entry_safe(), which doesn't modify the tree
itself. So once we unlock the lock that protects the rbtree, we have a
tree pointing to a root that was freed (and a root pointing to freed
nodes, and their children pointing to other freed nodes, and so on).
This makes further access to the tree result in a use-after-free with
unpredictable results.
Fix this by initializing the rbtree to an empty root after the call to
rbtree_postorder_for_each_entry_safe() and before unlocking.
Fixes: 276940915f ("btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes()")
Reported-by: syzbot+ad7966ca1f5dd8b001b3@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000f9aad406223eabff@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Defrag ioctl passes readahead from the file, but autodefrag does not
have a file so the readahead state is allocated when needed.
The autodefrag loop in cleaner thread iterates over inodes so we can
simply provide an on-stack readahead state and will not need to allocate
it in btrfs_defrag_file(). The size is 32 bytes which is acceptable.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's only one caller inode_should_defrag() that passes NULL to
btrfs_add_inode_defrag() so we can drop it an simplify the code.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The potential memory allocation failure is not a fatal error, skipping
autodefrag is fine and the caller inode_should_defrag() does not care
about the errors. Further writes can attempt to add the inode back to
the defragmentation list again.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_cleanup_defrag_inodes() is not called frequently, only in remount
or unmount, but the way it frees the inodes in fs_info->defrag_inodes
is inefficient. Each time it needs to locate first node, remove it,
potentially rebalance tree until it's done. This allows to do a
conditional reschedule.
For cleanups the rbtree_postorder_for_each_entry_safe() iterator is
convenient but we can't reschedule and restart iteration because some of
the tree nodes would be already freed.
The cleanup operation is kmem_cache_free() which will likely take the
fast path for most objects so rescheduling should not be necessary.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not follow the pattern where the underscores would be
justified, so rename it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not follow the pattern where the underscores would be
justified, so rename it.
Also update the misleading comment, the passed item is not freed, that's
what the caller does.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not follow the pattern where the underscores would be
justified, so rename it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A comparator function does not change its parameters, make them const.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function does not follow the pattern where the underscores would be
justified, so rename it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
KCSAN complains about a data race when accessing the last_trans field of a
root:
[ 199.553628] BUG: KCSAN: data-race in btrfs_record_root_in_trans [btrfs] / record_root_in_trans [btrfs]
[ 199.555186] read to 0x000000008801e308 of 8 bytes by task 2812 on cpu 1:
[ 199.555210] btrfs_record_root_in_trans+0x9a/0x128 [btrfs]
[ 199.555999] start_transaction+0x154/0xcd8 [btrfs]
[ 199.556780] btrfs_join_transaction+0x44/0x60 [btrfs]
[ 199.557559] btrfs_dirty_inode+0x9c/0x140 [btrfs]
[ 199.558339] btrfs_update_time+0x8c/0xb0 [btrfs]
[ 199.559123] touch_atime+0x16c/0x1e0
[ 199.559151] pipe_read+0x6a8/0x7d0
[ 199.559179] vfs_read+0x466/0x498
[ 199.559204] ksys_read+0x108/0x150
[ 199.559230] __s390x_sys_read+0x68/0x88
[ 199.559257] do_syscall+0x1c6/0x210
[ 199.559286] __do_syscall+0xc8/0xf0
[ 199.559318] system_call+0x70/0x98
[ 199.559431] write to 0x000000008801e308 of 8 bytes by task 2808 on cpu 0:
[ 199.559464] record_root_in_trans+0x196/0x228 [btrfs]
[ 199.560236] btrfs_record_root_in_trans+0xfe/0x128 [btrfs]
[ 199.561097] start_transaction+0x154/0xcd8 [btrfs]
[ 199.561927] btrfs_join_transaction+0x44/0x60 [btrfs]
[ 199.562700] btrfs_dirty_inode+0x9c/0x140 [btrfs]
[ 199.563493] btrfs_update_time+0x8c/0xb0 [btrfs]
[ 199.564277] file_update_time+0xb8/0xf0
[ 199.564301] pipe_write+0x8ac/0xab8
[ 199.564326] vfs_write+0x33c/0x588
[ 199.564349] ksys_write+0x108/0x150
[ 199.564372] __s390x_sys_write+0x68/0x88
[ 199.564397] do_syscall+0x1c6/0x210
[ 199.564424] __do_syscall+0xc8/0xf0
[ 199.564452] system_call+0x70/0x98
This is because we update and read last_trans concurrently without any
type of synchronization. This should be generally harmless and in the
worst case it can make us do extra locking (btrfs_record_root_in_trans())
trigger some warnings at ctree.c or do extra work during relocation - this
would probably only happen in case of load or store tearing.
So fix this by always reading and updating the field using READ_ONCE()
and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's pointless to pass a super block argument to btrfs_iget() because we
always pass a root and from it we can get the super block through:
root->fs_info->sb
So remove the super block argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The member extent_map::block_start can be calculated from
extent_map::disk_bytenr + extent_map::offset for regular extents.
And otherwise just extent_map::disk_bytenr.
And this is already validated by the validate_extent_map(). Now we can
remove the member.
However there is a special case in btrfs_create_dio_extent() where we
for NOCOW/PREALLOC ordered extents cannot directly use the resulting
btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them
yet.
So for that call site, we pass file_extent->disk_bytenr +
file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for
offset.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have extent_map::offset, the old extent_map::orig_start is just
extent_map::start - extent_map::offset for non-hole/inline extents.
And since the new extent_map::offset is already verified by
validate_extent_map() while the old orig_start is not, let's just remove
the old member from all call sites.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce two new members for extent_map:
- disk_bytenr
- offset
Both are matching the members with the same name inside
btrfs_file_extent_items.
For now this patch only touches those members when:
- Reading btrfs_file_extent_items from disk
- Inserting new holes
- Merging two extent maps
With the new disk_bytenr and disk_num_bytes, doing merging would be a
little more complex, as we have 3 different cases:
* Both extent maps are referring to the same data extents
|<----- data extent A ----->|
|<- em 1 ->|<- em 2 ->|
* Both extent maps are referring to different data extents
|<-- data extent A -->|<-- data extent B -->|
|<- em 1 ->|<- em 2 ->|
* One of the extent maps is referring to a merged and larger data
extent that covers both extent maps
This is not really valid case other than some selftests.
So this test case would be removed.
A new helper merge_ondisk_extents() is introduced to handle the above
valid cases.
To properly assign values for those new members, a new btrfs_file_extent
parameter is introduced to all the involved call sites.
- For NOCOW writes the btrfs_file_extent would be exposed from
can_nocow_file_extent().
- For other writes, the members can be easily calculated
As most of them have 0 offset and utilizing the whole on-disk data
extent.
The exception is encoded write, but thankfully that interface provided
offset directly and all other needed info.
For now, both the old members (block_start/block_len/orig_start) are
co-existing with the new members (disk_bytenr/offset), meanwhile all the
critical code is still using the old members only.
The cleanup will happen later after all the old and new members are
properly validated.
There would be some re-ordering for the assignment of the extent_map
members, now we follow the new ordering:
- start and len
Or file_pos and num_bytes for other structures.
- disk_bytenr and disk_num_bytes
- offset and ram_bytes
- compression
So expect some seemingly unrelated line movement.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A comment from Filipe on one of my previous cleanups brought my
attention to a new helper we have for getting the root id of a root,
which makes it easier to read in the code.
The changes where made with the following Coccinelle semantic patch:
// <smpl>
@@
expression E,E1;
@@
(
E->root_key.objectid = E1
|
- E->root_key.objectid
+ btrfs_root_id(E)
)
// </smpl>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d96835 ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a convenience helper to get a fs_info from a VFS inode pointer
instead of open coding the chain or using btrfs_sb() that in some cases
does one more pointer hop. This is implemented as a macro (still with
type checking) so we don't need full definitions of struct btrfs_inode,
btrfs_root or btrfs_fs_info.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The BUG_ON verifies a condition that should be guaranteed by the correct
use of the path search (with keep_locks and lowest_level set), an
assertion is the suitable check.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With help of neovim, LSP and clangd we can identify header files that
are not actually needed to be included in the .c files. This is focused
only on removal (with minor fixups), further cleanups are possible but
will require doing the header files properly with forward declarations,
minimized includes and include-what-you-use care.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove more hidden calls to compound_head() by using an array of folios
instead of pages. Also neaten the error path in defrag_one_range() by
adjusting the length of the array instead of checking for NULL.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use a folio throughout defrag_prepare_one_page() to remove dozens of
hidden calls to compound_head(). There is no support here for large
folios; indeed, turn the existing check for PageCompound into a check
for large folios.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>