It's pointless to pass a super block argument to btrfs_iget_path() because
we always pass a root and from it we can get the super block through:
root->fs_info->sb
So remove the super block argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's pointless to pass a super block argument to btrfs_iget() because we
always pass a root and from it we can get the super block through:
root->fs_info->sb
So remove the super block argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When creating and deleting a subvolume, after starting a transaction we
are explicitly calling btrfs_record_root_in_trans() for the root which we
passed to btrfs_start_transaction(). This is pointless because at
transaction.c:start_transaction() we end up doing that call, regardless
of whether we actually start a new transaction or join an existing one,
and if we were not it would mean the root item of that root would not
be updated in the root tree when committing the transaction, leading to
problems easy to spot with fstests for example.
Remove these redundant calls. They were introduced with commit
74e9795812 ("btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume
operations").
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following 3 parameters can be cleaned up using btrfs_file_extent
structure:
- len
btrfs_file_extent::num_bytes
- orig_block_len
btrfs_file_extent::disk_num_bytes
- ram_bytes
btrfs_file_extent::ram_bytes
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Most parameters of create_io_em() can be replaced by the members with
the same name inside btrfs_file_extent.
Do a direct parameters cleanup here.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All parameters after @filepos of btrfs_alloc_ordered_extent() can be
replaced with btrfs_file_extent structure.
This patch does the cleanup, meanwhile some points to note:
- Move btrfs_file_extent structure to ordered-data.h
The structure is needed by both btrfs_alloc_ordered_extent() and
can_nocow_extent(), but since btrfs_inode.h includes
ordered-data.h, so we need to move the structure to ordered-data.h.
- Move the special handling of NOCOW/PREALLOC into
btrfs_alloc_ordered_extent()
This is to allow btrfs_split_ordered_extent() to properly split them
for DIO.
For now just move the handling into btrfs_alloc_ordered_extent() to
simplify the callers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The following functions and structures can be simplified using the
btrfs_file_extent structure:
- can_nocow_extent()
No need to return ram_bytes/orig_block_len through the parameter list,
the @file_extent parameter contains all the needed info.
- can_nocow_file_extent_args
The following members are no longer needed:
* disk_bytenr
This one is confusing as it's not really the
btrfs_file_extent_item::disk_bytenr, but where the IO would be,
thus it's file_extent::disk_bytenr + file_extent::offset now.
* num_bytes
Now file_extent::num_bytes.
* extent_offset
Now file_extent::offset.
* disk_num_bytes
Now file_extent::disk_num_bytes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The member extent_map::block_start can be calculated from
extent_map::disk_bytenr + extent_map::offset for regular extents.
And otherwise just extent_map::disk_bytenr.
And this is already validated by the validate_extent_map(). Now we can
remove the member.
However there is a special case in btrfs_create_dio_extent() where we
for NOCOW/PREALLOC ordered extents cannot directly use the resulting
btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them
yet.
So for that call site, we pass file_extent->disk_bytenr +
file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for
offset.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent_map::block_len is either extent_map::len (non-compressed
extent) or extent_map::disk_num_bytes (compressed extent).
Since we already have sanity checks to do the cross-checks between the
new and old members, we can drop the old extent_map::block_len now.
For most call sites, they can manually select extent_map::len or
extent_map::disk_num_bytes, since most if not all of them have checked
if the extent is compressed.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have extent_map::offset, the old extent_map::orig_start is just
extent_map::start - extent_map::offset for non-hole/inline extents.
And since the new extent_map::offset is already verified by
validate_extent_map() while the old orig_start is not, let's just remove
the old member from all call sites.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce two new members for extent_map:
- disk_bytenr
- offset
Both are matching the members with the same name inside
btrfs_file_extent_items.
For now this patch only touches those members when:
- Reading btrfs_file_extent_items from disk
- Inserting new holes
- Merging two extent maps
With the new disk_bytenr and disk_num_bytes, doing merging would be a
little more complex, as we have 3 different cases:
* Both extent maps are referring to the same data extents
|<----- data extent A ----->|
|<- em 1 ->|<- em 2 ->|
* Both extent maps are referring to different data extents
|<-- data extent A -->|<-- data extent B -->|
|<- em 1 ->|<- em 2 ->|
* One of the extent maps is referring to a merged and larger data
extent that covers both extent maps
This is not really valid case other than some selftests.
So this test case would be removed.
A new helper merge_ondisk_extents() is introduced to handle the above
valid cases.
To properly assign values for those new members, a new btrfs_file_extent
parameter is introduced to all the involved call sites.
- For NOCOW writes the btrfs_file_extent would be exposed from
can_nocow_file_extent().
- For other writes, the members can be easily calculated
As most of them have 0 offset and utilizing the whole on-disk data
extent.
The exception is encoded write, but thankfully that interface provided
offset directly and all other needed info.
For now, both the old members (block_start/block_len/orig_start) are
co-existing with the new members (disk_bytenr/offset), meanwhile all the
critical code is still using the old members only.
The cleanup will happen later after all the old and new members are
properly validated.
There would be some re-ordering for the assignment of the extent_map
members, now we follow the new ordering:
- start and len
Or file_pos and num_bytes for other structures.
- disk_bytenr and disk_num_bytes
- offset and ram_bytes
- compression
So expect some seemingly unrelated line movement.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently function can_nocow_extent() only returns members needed for
extent_map.
However since we will soon change the extent_map structure to be more
like btrfs_file_extent_item, we want to expose the expected file extent
caused by the NOCOW write for future usage.
This introduces a new structure, btrfs_file_extent, to be a more
memory access friendly representation of btrfs_file_extent_item.
And use that structure to expose the expected file extent caused by the
NOCOW write.
For now there is no user of the new structure yet.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This would make it very obvious that the member just matches
btrfs_file_extent_item::disk_num_bytes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the core of the fiemap code lives in extent_io.c, which does
not make any sense because it's not related to extent IO at all (and it
was not as well before the big rewrite of fiemap I did some time ago).
The entry point for fiemap, btrfs_fiemap(), lives in inode.c since it's
an inode operation.
Since there's a significant amount of fiemap code, move all of it into a
dedicated file, including its entry point inode.c:btrfs_fiemap().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When running 'make W=2' there are a few reports where a variable of the
same name is declared in a nested block. In all the cases we can use the
one declared in the parent block, no problematic cases were found.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode
instead, as this is generally what we do for internal APIs, making it
more consistent with most of the code base. This will later allow to
help to remove a lot of BTRFS_I() calls in btrfs_sync_file().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On 64 bits platforms we don't really need to have a dedicated member (the
objectid field) for the inode's number since we store in the VFS inode's
i_ino member, which is an unsigned long and this type is 64 bits wide on
64 bits platforms. We only need that field in case we are on a 32 bits
platform because the unsigned long type is 32 bits wide on such platforms
See commit 33345d0152 ("Btrfs: Always use 64bit inode number") regarding
this 64/32 bits detail.
The objectid field of struct btrfs_inode is also used to store the ID of
a root for directories that are stubs for unreferenced roots. In such
cases the inode is a directory and has the BTRFS_INODE_ROOT_STUB runtime
flag set.
So in order to reduce the size of btrfs_inode structure on 64 bits
platforms we can remove the objectid member and use the VFS inode's i_ino
member instead whenever we need to get the inode number. In case the inode
is a root stub (BTRFS_INODE_ROOT_STUB set) we can use the member
last_reflink_trans to store the ID of the unreferenced root, since such
inode is a directory and reflinks can't be done against directories.
So remove the objectid fields for 64 bits platforms and alias the
last_reflink_trans field with a name of ref_root_id in a union.
On a release kernel config, this reduces the size of struct btrfs_inode
from 1040 bytes down to 1032 bytes.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently struct btrfs_inode has a key member, named "location", that is
either:
1) The key of the inode's item. In this case the objectid is the number
of the inode;
2) A key stored in a dir entry with a type of BTRFS_ROOT_ITEM_KEY, for
the case where we have a root that is a snapshot of a subvolume that
points to other subvolumes. In this case the objectid is the ID of
a subvolume inside the snapshotted parent subvolume.
The key is only used to lookup the inode item for the first case, while
for the second it's never used since it corresponds to directory stubs
created with new_simple_dir() and which are marked as dummy, so there's
no actual inode item to ever update. In the second case we only check
the key type at btrfs_ino() for 32 bits platforms and its objectid is
only needed for unlink.
Instead of using a key we can do fine with just the objectid, since we
can generate the key whenever we need it having only the objectid, as
in all use cases the type is always BTRFS_INODE_ITEM_KEY and the offset
is always 0.
So use only an objectid instead of a full key. This reduces the size of
struct btrfs_inode from 1048 bytes down to 1040 bytes on a release kernel.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When not using the NO_HOLES feature we always allocate an io tree for an
inode's file_extent_tree. This is wasteful because that io tree is only
used for regular files, so we allocate more memory than needed for inodes
that represent directories or symlinks for example, or for inodes that
correspond to free space inodes.
So improve on this by allocating the io tree only for inodes of regular
files that are not free space inodes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The index_cnt field of struct btrfs_inode is used only for two purposes:
1) To store the index for the next entry added to a directory;
2) For the data relocation inode to track the logical start address of the
block group currently being relocated.
For the relocation case we use index_cnt because it's not used for
anything else in the relocation use case - we could have used other fields
that are not used by relocation such as defrag_bytes, last_unlink_trans
or last_reflink_trans for example (among others).
Since the csum_bytes field is not used for directories, do the following
changes:
1) Put index_cnt and csum_bytes in a union, and index_cnt is only
initialized when the inode is a directory. The csum_bytes is only
accessed in IO paths for regular files, so we're fine here;
2) Use the defrag_bytes field for relocation, since the data relocation
inode is never used for defrag purposes. And to make the naming better,
alias it to reloc_block_group_start by using a union.
This reduces the size of struct btrfs_inode by 8 bytes in a release
kernel, from 1056 bytes down to 1048 bytes.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use the spinlock inode_lock from struct btrfs_root to
serialize access to two different data structures:
1) The delayed inodes xarray (struct btrfs_root::delayed_nodes);
2) The inodes xarray (struct btrfs_root::inodes).
Instead of using our own lock, we can use the spinlock that is part of the
xarray implementation, by using the xa_lock() and xa_unlock() APIs and
using the xarray APIs with the double underscore prefix that don't take
the xarray locks and assume the caller is using xa_lock() and xa_unlock().
So remove the spinlock inode_lock from struct btrfs_root and use the
corresponding xarray locks. This brings 2 benefits:
1) We reduce the size of struct btrfs_root, from 1336 bytes down to
1328 bytes on a 64 bits release kernel config;
2) We reduce lock contention by not using anymore the same lock for
changing two different and unrelated xarrays.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make btrfs_iget_path() simpler and easier to read by avoiding nesting of
if-then-else statements and having an error label to do all the error
handling instead of repeating it a couple times.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When creating a new inode, at btrfs_create_new_inode(), one of the very
last steps is to add the inode to the root's inodes xarray. This often
requires allocating memory which may fail (even though xarrays have a
dedicated kmem_cache which make it less likely to fail), and at that point
we are forced to abort the current transaction (as some, but not all, of
the inode metadata was added to its subvolume btree).
To avoid a transaction abort, preallocate memory for the xarray early at
btrfs_create_new_inode(), so that if we fail we don't need to abort the
transaction and the insertion into the xarray is guaranteed to succeed.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use a red black tree (rb-tree) to track the currently open
inodes of a root (in struct btrfs_root::inode_tree). This however is not
very efficient when the number of inodes is large since rb-trees are
binary trees. For example for 100K open inodes, the tree has a depth of
17. Besides that, inserting into the tree requires navigating through it
and pulling useless cache lines in the process since the red black tree
nodes are embedded within the btrfs inode - on the other hand, by being
embedded, it requires no extra memory allocations.
We can improve this by using an xarray instead, which is efficient when
indices are densely clustered (such as inode numbers), is more cache
friendly and behaves like a resizable array, with a much better search
and insertion complexity than a red black tree. This only has one small
disadvantage which is that insertion will sometimes require allocating
memory for the xarray - which may fail (not that often since it uses a
kmem_cache) - but on the other hand we can reduce the btrfs inode
structure size by 24 bytes (from 1080 down to 1056 bytes) after removing
the embedded red black tree node, which after the next patches will allow
to reduce the size of the structure to 1024 bytes, meaning we will be able
to store 4 inodes per 4K page instead of 3 inodes.
This change does a straightforward change to use an xarray, and results
in a transaction abort if we can't allocate memory for the xarray when
creating an inode - but the next patch changes things so that we don't
need to abort.
Running the following fs_mark test showed some improvements:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
MOUNT_OPTIONS="-o ssd"
FILES=100000
THREADS=$(nproc --all)
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 92081.6 12505547
16 2400000 0 138222.6 13067072
23 3600000 0 148833.1 13290336
43 4800000 0 97864.7 13931248
53 6000000 0 85597.3 14384313
After this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 93225.1 12571078
16 2400000 0 146720.3 12805007
23 3600000 0 160626.4 13073835
46 4800000 0 116286.2 13802927
53 6000000 0 90087.9 14754892
The test was run with a release kernel config (Debian's default config).
Also capturing the insertion times into the rb tree and into the xarray,
that is measuring the duration of the old function inode_tree_add() and
the duration of the new btrfs_add_inode_to_root() function, gave the
following results (in nanoseconds):
Before this patch, inode_tree_add() execution times:
Count: 5000000
Range: 0.000 - 5536887.000; Mean: 775.674; Median: 729.000; Stddev: 4820.961
Percentiles: 90th: 1015.000; 95th: 1139.000; 99th: 1397.000
0.000 - 7.816: 40 |
7.816 - 37.858: 209 |
37.858 - 170.278: 6059 |
170.278 - 753.961: 2754890 #####################################################
753.961 - 3326.728: 2232312 ###########################################
3326.728 - 14667.018: 4366 |
14667.018 - 64652.943: 852 |
64652.943 - 284981.761: 550 |
284981.761 - 1256150.914: 221 |
1256150.914 - 5536887.000: 7 |
After this patch, btrfs_add_inode_to_root() execution times:
Count: 5000000
Range: 0.000 - 2900652.000; Mean: 272.148; Median: 241.000; Stddev: 2873.369
Percentiles: 90th: 342.000; 95th: 432.000; 99th: 572.000
0.000 - 7.264: 104 |
7.264 - 33.145: 352 |
33.145 - 140.081: 109606 #
140.081 - 581.930: 4840090 #####################################################
581.930 - 2407.590: 43532 |
2407.590 - 9950.979: 2245 |
9950.979 - 41119.278: 514 |
41119.278 - 169902.616: 155 |
169902.616 - 702018.539: 47 |
702018.539 - 2900652.000: 9 |
Average, percentiles, standard deviation, etc, are all much better.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmaG0jUACgkQxWXV+ddt
WDsg4Q//f7xRooomsRaRvIDN4Add17jyZWuhreMhxKKvdtoRzLq/YZUldGJt1XXu
tVpCvKkVvrJt5UY3b1czfA8GTHeOInZSoqyV8ZaxOAcg6cfKrUtLFi1wAKGU6BK/
S9Zz/9lfbjXhG4PnZW7GRFmH0n68p/dmS+kXPVVPhj7CibxuZJtifflK1r6rHw+Y
PDDnMIPqbrdK+xBfR+SEpG+1uIEFvI6SgsAZG6p0mzcbbUINGp7SpU3v2NVuvdIX
tjTlZhz+D2do4dPk3RJ4Z6dpL+VlZnR1F9CZn6SmPcdGWn6mTkywoukJPP62iHYO
b8Xr4j1mKDPcu55OEPWwhYWOvRP4FvltrGB+35xsQoPlKBBKxSMFgNyEkdfvjtKU
qL9qRYJvm8D2OBqdv5BnawKRSouPcR4KPiM9JSGWRcW6vdOiY3Cd1mnAR8EA1LSd
SRYmBlrZIIjOpEeGkEy2MeNKEH9L5B1wAU21VQd+cAlCKBFUmFb1csZ/vCJJ1B34
swedmirNx0lyZTNBGQsDipkxSM0c/0xj/ysG8N41bLvPUM35x0K9xw5ZqBlTXrrg
Lln51s6TeqSzfEvKdDGikBL3dVhT5I6acRUchQU9YSZ3Bqxj3QCTNnOZMZQsCFkp
CMohduRVHiK4LHQTkU++q8bT8xX20kh/1HIJhAjdnDbrlRY4oTU=
=tyXO
-----END PGP SIGNATURE-----
Merge tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix folio refcounting when releasing them (encoded write, dummy
extent buffer)
- fix out of bounds read when checking qgroup inherit data
- fix how configurable chunk size is handled in zoned mode
- in the ref-verify tool, fix uninitialized return value when checking
extent owner ref and simple quota are not enabled
* tag 'for-6.10-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix folio refcount in __alloc_dummy_extent_buffer()
btrfs: fix folio refcount in btrfs_do_encoded_write()
btrfs: fix uninitialized return value in the ref-verify tool
btrfs: always do the basic checks for btrfs_qgroup_inherit structure
btrfs: zoned: fix calc_available_free_space() for zoned mode
The conversion to folios switched __free_page() to __folio_put() in the
error path in btrfs_do_encoded_write().
However, this gets the page refcounting wrong. If we do hit that error
path (I reproduced by modifying btrfs_do_encoded_write to pretend to
always fail in a way that jumps to out_folios and running the fstests
case btrfs/281), then we always hit the following BUG freeing the folio:
BUG: Bad page state in process btrfs pfn:40ab0b
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x61be5 pfn:0x40ab0b
flags: 0x5ffff0000000000(node=0|zone=2|lastcpupid=0x1ffff)
raw: 05ffff0000000000 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000061be5 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: nonzero _refcount
Call Trace:
<TASK>
dump_stack_lvl+0x3d/0xe0
bad_page+0xea/0xf0
free_unref_page+0x8e1/0x900
? __mem_cgroup_uncharge+0x69/0x90
__folio_put+0xe6/0x190
btrfs_do_encoded_write+0x445/0x780
? current_time+0x25/0xd0
btrfs_do_write_iter+0x2cc/0x4b0
btrfs_ioctl_encoded_write+0x2b6/0x340
It turns out __free_page() decreases the page reference count while
__folio_put() does not. Switch __folio_put() to folio_put() which
decreases the folio reference count first.
Fixes: 400b172b8c ("btrfs: compression: migrate compression/decompression paths to folios")
Tested-by: Ed Tomlinson <edtoml@gmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
With 20 threads each walking a dedicated 1000 dirs * 1000 files
directory tree to stat(2) on a 32 core + 24GB ram vm:
before: 3.54s user 892.30s system 1966% cpu 45.549 total
after: 3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)
Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Link: https://lore.kernel.org/r/20240611173824.535995-3-mjguzik@gmail.com
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmZCE4MACgkQxWXV+ddt
WDudtQ//WjXcHtY3I6NJtDhPsIOG3Qjg9mA0shp73X4djJtZoGCdgL7dq+fTp5lk
Wu6/XY5g+CSttTgwF4eyHgUSJOptKWY0XQDWxX5VR8WCM2qmUZ7SedlrBED9GNDM
rN/3egmc74OGwnqyQq3I/2qYLByXFj66tsvW3UBjLNB8vMHajjw1idj9ujipioHq
ySStPCHkPMwuhEzw9+CTe3W47VUSb5Ug3XDhAZXvxT99oDHn1m+CxKQwcona/IPH
1El8PmZ7JetaT9ZO3DICBICfCyo+2SSy/KXYypXXE+nzNZhbhC0V9N7Uqm1c91C0
aRglsJZCXmHBD4BPLvkls6CqEIvMc7FvcNCqQlrbRT6PlfX91/XaeDq4l3RUcuPn
mGShsdHUiwbPMWYVwqVUKd0IPiktF1R7yigTjYSkEFJTL6HFTrBqV/2fAMUsMfPc
8gyzYMCPQld73WmrnXZQPKvmzO/LvE0gS5cPapokGwoXstq9n3iYd4ypN0wN6sif
1jwy3efNzWXXMYV0WzcihKwFMm2fqp/pl9bXq/zwn2CunfIX4WTsaQ2NmJf81jqF
qFNjlr8S3qO7AvIOs+R2XY9E3VjfzeDADzvjpQy5J/ZYbcHBcxxdYDhg+QGhe5nB
eNmR51oL1pHSjU2M8PxATL8JxKkX2BvX6u64lVojaw4rxUlyFC0=
=MMpE
-----END PGP SIGNATURE-----
Merge tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This update brings a few minor performance improvements, otherwise
there's a lot of refactoring, cleanups and other sort of not user
visible changes.
Performance improvements:
- inline b-tree locking functions, improvement in metadata-heavy
changes
- relax locking on a range that's being reflinked, allows read
operations to run in parallel
- speed up NOCOW write checks (throughput +9% on a sample test)
- extent locking ranges have been reduced in several places, namely
around delayed ref processing
Core:
- more page to folio conversions:
- relocation
- send
- compression
- inline extent handling
- super block write and wait
- extent_map structure optimizations:
- reduced structure size
- code simplifications
- add shrinker for allocated objects, the numbers can go high and
could exhaust memory on smaller systems (reported) as they may
not get an opportunity to be freed fast enough
- extent locking optimizations:
- reduce locking ranges where it does not seem to be necessary and
are safe due to other means of synchronization
- potential improvements due to lower contention,
allocation/freeing and state management operations of extent
state tracking structures
- delayed ref cleanups and simplifications
- updated trace points
- improved error handling, warnings and assertions
- cleanups and refactoring, unification of error handling paths"
* tag 'for-6.10-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (122 commits)
btrfs: qgroup: fix initialization of auto inherit array
btrfs: count super block write errors in device instead of tracking folio error state
btrfs: use the folio iterator in btrfs_end_super_write()
btrfs: convert super block writes to folio in write_dev_supers()
btrfs: convert super block writes to folio in wait_dev_supers()
bio: Export bio_add_folio_nofail to modules
btrfs: remove duplicate included header from fs.h
btrfs: add a cached state to extent_clear_unlock_delalloc
btrfs: push extent lock down in submit_one_async_extent
btrfs: push lock_extent down in cow_file_range()
btrfs: move can_cow_file_range_inline() outside of the extent lock
btrfs: push lock_extent into cow_file_range_inline
btrfs: push extent lock into cow_file_range
btrfs: push extent lock into run_delalloc_cow
btrfs: remove unlock_extent from run_delalloc_compressed
btrfs: push extent lock down in run_delalloc_nocow
btrfs: adjust while loop condition in run_delalloc_nocow
btrfs: push extent lock into run_delalloc_nocow
btrfs: push the extent lock into btrfs_run_delalloc_range
btrfs: lock extent when doing inline extent in compression
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZj3HuwAKCRCRxhvAZXjc
orYvAQCZOr68uJaEaXAArYTdnMdQ6HIzG+FVlwrqtrhz0BV07wEAqgmtSR9XKh+L
0+DNepg4R8PZOHH371eSSsLNRCUCkAs=
=SVsU
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Free up FMODE_* bits. I've freed up bits 6, 7, 8, and 24. That
means we now have six free FMODE_* bits in total (but bit #6
already got used for FMODE_WRITE_RESTRICTED)
- Add FOP_HUGE_PAGES flag (follow-up to FMODE_* cleanup)
- Add fd_raw cleanup class so we can make use of automatic cleanup
provided by CLASS(fd_raw, f)(fd) for O_PATH fds as well
- Optimize seq_puts()
- Simplify __seq_puts()
- Add new anon_inode_getfile_fmode() api to allow specifying f_mode
instead of open-coding it in multiple places
- Annotate struct file_handle with __counted_by() and use
struct_size()
- Warn in get_file() whether f_count resurrection from zero is
attempted (epoll/drm discussion)
- Folio-sophize aio
- Export the subvolume id in statx() for both btrfs and bcachefs
- Relax linkat(AT_EMPTY_PATH) requirements
- Add F_DUPFD_QUERY fcntl() allowing to compare two file descriptors
for dup*() equality replacing kcmp()
Cleanups:
- Compile out swapfile inode checks when swap isn't enabled
- Use (1 << n) notation for FMODE_* bitshifts for clarity
- Remove redundant variable assignment in fs/direct-io
- Cleanup uses of strncpy in orangefs
- Speed up and cleanup writeback
- Move fsparam_string_empty() helper into header since it's currently
open-coded in multiple places
- Add kernel-doc comments to proc_create_net_data_write()
- Don't needlessly read dentry->d_flags twice
Fixes:
- Fix out-of-range warning in nilfs2
- Fix ecryptfs overflow due to wrong encryption packet size
calculation
- Fix overly long line in xfs file_operations (follow-up to FMODE_*
cleanup)
- Don't raise FOP_BUFFER_{R,W}ASYNC for directories in xfs (follow-up
to FMODE_* cleanup)
- Don't call xfs_file_open from xfs_dir_open (follow-up to FMODE_*
cleanup)
- Fix stable offset api to prevent endless loops
- Fix afs file server rotations
- Prevent xattr node from overflowing the eraseblock in jffs2
- Move fdinfo PTRACE_MODE_READ procfs check into the .permission()
operation instead of .open() operation since this caused userspace
regressions"
* tag 'vfs-6.10.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (39 commits)
afs: Fix fileserver rotation getting stuck
selftests: add F_DUPDFD_QUERY selftests
fcntl: add F_DUPFD_QUERY fcntl()
file: add fd_raw cleanup class
fs: WARN when f_count resurrection is attempted
seq_file: Simplify __seq_puts()
seq_file: Optimize seq_puts()
proc: Move fdinfo PTRACE_MODE_READ check into the inode .permission operation
fs: Create anon_inode_getfile_fmode()
xfs: don't call xfs_file_open from xfs_dir_open
xfs: drop fop_flags for directories
xfs: fix overly long line in the file_operations
shmem: Fix shmem_rename2()
libfs: Add simple_offset_rename() API
libfs: Fix simple_offset_rename_exchange()
jffs2: prevent xattr node from overflowing the eraseblock
vfs, swap: compile out IS_SWAPFILE() on swapless configs
vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements
fs/direct-io: remove redundant assignment to variable retval
fs/dcache: Re-use value stored to dentry->d_flags instead of re-reading
...
Now that we have the lock_extent tightly coupled with
extent_clear_unlock_delalloc we can add a cached state to
extent_clear_unlock_delalloc and benefit from skipping the extra lookup
when we're doing cow.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need to include the time we spend in the allocator under our
extent lock protection, move it after the allocator and make sure we
lock the extent in the error case to ensure we're not clearing these
bits without the extent lock held.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we've got the extent lock pushed into cow_file_range() we can
push it further down into the allocation loop. This allows us to only
hold the extent lock during the dropping of the extent map range and
inserting the ordered extent.
This makes the error case a little trickier as we'll now have to lock
the range before clearing any of the other extent bits for the range,
but this is the error path so is less performance critical.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These checks aren't reliant on the extent lock. Move this up into
cow_file_range_inline(), and then update encoded writes to call this
check before calling __cow_file_range_inline(). This will allow us to
skip the extent lock if we're not able to inline the given extent.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we've pushed the lock_extent() into cow_file_range() we can
push the extent locking into cow_file_range_inline() and move the
lock_extent in cow_file_range() to after we call
cow_file_range_inline().
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that cow_file_range is the only function that is called with the
range locked, push this call into cow_file_range so we can further
narrow the scope.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is used by zoned but also as the fallback for uncompressed extents
when we fail to compress the ranges. Push the extent lock into
run_dealloc_cow(), and adjust the compression case to take the extent
lock after calling run_delalloc_cow().
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we immediately unlock the extent range when we enter
run_delalloc_compressed() simply move the lock_extent() down to cover
cow_file_range() and then remove the unlock_extent() from
run_delalloc_compressed.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
run_delalloc_nocow is a little special because we use the file extents
to see if we can nocow a range. We don't actually need the protection
of the extent lock to look at the file extents at this point however.
We are currently holding the page lock for this range, so we are
protected from anybody who would simultaneously be modifying the file
extent items for this range.
* mmap() - we're holding the page lock.
* buffered writes - we're holding the page lock.
* direct writes - we're holding the page lock and direct IO has to flush
page cache before it's able to continue.
* fallocate() - all callers flush the range and wait on ordered extents
while holding the inode lock and the mmap lock, so we are again saved
by the page lock.
We want to use the extent lock to protect
1) The mapping tree for the given range.
2) The ordered extents for the given range.
3) The io_tree for the given range.
Push the extent lock down to cover these operations. In the
fallback_to_cow() case we simply lock before doing anything and rely on
the cow_file_range() helper to handle it's range properly.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have the following pattern
while (1) {
if (cur_offset > end)
break;
}
Which is just
while (cur_offset <= end) {
...
}
so adjust the code to be more clear.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
run_delalloc_nocow is a bit special as it walks through the file extents
for the inode and determines what it can nocow and what it can't. This
is the more complicated area for extent locking, so start with this
function.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We want to limit the scope of the extent lock to be around operations
that can change in flight. Currently we hold the extent lock through
the entire writepage operation, which isn't really necessary.
We want to protect to make sure nobody has updated DELALLOC. In
find_lock_delalloc_range we must lock the range in order to validate the
contents of our io_tree. However once we've done that we're safe to
unlock the range and continue, as we have the page lock already held for
the range.
We are protected from all operations at this point.
* mmap() - we're holding the page lock, thus are protected.
* buffered writes - again, we're protected because we take the page lock
for the first and last page in our range for buffered writes so we
won't create new delalloc ranges in this area.
* direct IO - we invalidate pagecache before attempting to write a new
area, which requires the page lock, so again are protected once we're
holding the page lock on this range.
Additionally this behavior actually already exists for compressed, we
unlock the range as soon as we start to process the async extents, and
re-lock it during compression. So this is completely safe, and makes
the locking more consistent.
Make this simple by just pushing the extent lock into
btrfs_run_delalloc_range. From there followup patches will push the
lock further down into its users.
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We currently don't lock the extent when we're doing a
cow_file_range_inline() for a compressed extent. This isn't a problem
necessarily, but it's inconsistent with the rest of our usage of
cow_file_range_inline(). This also leads to some extra weird logic
around whether the extent is locked or not. Fix this to lock the extent
before calling cow_file_range_inline() in compression to make it
consistent with the rest of the inline users. In future patches this
will be pushed down into the cow_file_range_inline() helper, so we're
fine with the quick and dirty locking here. This patch exists to make
the behavior change obvious.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We duplicate the extent cleanup for cow_file_range_inline() in the cow
and compressed case. The encoded case doesn't need to do cleanup the
same way, so rename cow_file_range_inline to __cow_file_range_inline and
then make cow_file_range_inline handle the extent cleanup appropriately,
and update the callers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since 4750af3bbe ("btrfs: prevent extent_clear_unlock_delalloc() to
unlock page not locked by __process_pages_contig()") we have been
unlocking the locked page manually instead of via
extent_clear_unlock_delalloc() because of subpage blocksize support.
However we actually disable inline extent creation for subpage blocksize
support, so this behavior isn't necessary. Remove this code and
comment, if at some point the subpage blocksize code grows support for
inline extents this can be re-evaluated.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we have a lot of duplicated checks of
if (start == 0 && fs_info->sectorsize == PAGE_SIZE)
cow_file_range_inline();
Instead of duplicating this check everywhere, consolidate all of the
inline extent logic into a helper which documents all of the checks and
then use that helper inside of cow_file_range_inline(). With this we
can clean up all of the calls to either unconditionally call
cow_file_range_inline(), or at least reduce the checks we're doing
before we call cow_file_range_inline();
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the cow path we will clone the reloc csums for relocated data
extents, and if there's an error we already have an ordered extent and
rely on the ordered extent finishing to clean everything up.
There's a problem however, we don't mark the ordered extent with an
error, we pretend like everything was just fine. If we were at the end
of our range we won't actually bubble up this error anywhere, and we
could end up inserting an extent that doesn't have csums where it should
have them.
Fix this by adding a helper to mark the ordered extent with an error,
and then use this when we fail to lookup the csums in
btrfs_reloc_clone_csums. Use this helper in the other place where we
use the same pattern while we're here.
This will prevent us from erroneously inserting the extent that doesn't
have the required checksums.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function create_io_em() is called before we submit an IO, to update
the in-memory extent map for the involved range.
This patch changes the following aspects:
- Does not allow BTRFS_ORDERED_NOCOW type
For real NOCOW (excluding NOCOW writes into preallocated ranges)
writes, we never call create_io_em(), as we does not need to update
the extent map at all.
So remove the sanity check allowing BTRFS_ORDERED_NOCOW type.
- Add extra sanity checks
* PREALLOC
- @block_len == len
For uncompressed writes.
* REGULAR
- @block_len == @orig_block_len == @ram_bytes == @len
We're creating a new uncompressed extent, and referring all of it.
- @orig_start == @start
We haven no offset inside the extent.
* COMPRESSED
- valid @compress_type
- @len <= @ram_bytes
This is to co-operate with encoded writes, which can cause a new
file extent referring only part of a uncompressed extent.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently try_release_extent_mapping() as an int return type, but we
use it as a boolean. Its only caller, the release folio callback, also
returns a boolean which corresponds to try_release_extent_mapping()'s
return value. So change its return value type to bool as well as its
helper try_release_extent_state().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A comment from Filipe on one of my previous cleanups brought my
attention to a new helper we have for getting the root id of a root,
which makes it easier to read in the code.
The changes where made with the following Coccinelle semantic patch:
// <smpl>
@@
expression E,E1;
@@
(
E->root_key.objectid = E1
|
- E->root_key.objectid
+ btrfs_root_id(E)
)
// </smpl>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_prune_dentries() has open code to find the first inode in
a root with a minimum inode number. Remove that code and make it use the
helper btrfs_find_first_inode() for that task.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Export the relocation private helper find_next_inode() to inode.c, as this
same logic is also used at btrfs_prune_dentries() and will be used by an
upcoming change that adds an extent map shrinker. The next patch will
change btrfs_prune_dentries() to use this helper.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing fs_info and extent map tree arguments to
btrfs_add_extent_mapping(), we can pass an inode instead, as extent maps
are always inserted in the extent map tree of an inode, and the fs_info
can be extracted from the inode (inode->root->fs_info). The only exception
is in the self tests where we allocate an extent map tree and then use it
to insert/update/remove extent maps. However the tests can be changed to
use a test inode and then use the inode's extent map tree.
So change btrfs_add_extent_mapping() to have an inode as an argument
instead of a fs_info and an extent map tree. This reduces the number of
parameters and will also be needed for an upcoming change.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The csum_exist_in_range() function is now too trivial and is only used in
one place, so open code it in its single caller.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before deciding if we can do a NOCOW write into a range, one of the things
we have to do is check if there are checksum items for that range. We do
that through the btrfs_lookup_csums_list() function, which searches for
checksums and adds them to a list supplied by the caller.
But all we need is to check if there is any checksum, we don't need to
look for all of them and collect them into a list, which requires more
search time in the checksums tree, allocating memory for checksums items
to add to the list, copy checksums from a leaf into those list items,
then free that memory, etc. This is all unnecessary overhead, wasting
mostly CPU time, and perhaps some occasional IO if we need to read from
disk any extent buffers.
So change btrfs_lookup_csums_list() to allow to return immediately in
case it finds any checksum, without the need to add it to a list and read
it from a leaf. This is accomplished by allowing a NULL list parameter and
making the function return 1 if it found any checksum, 0 if it didn't
found any, and a negative value in case of an error.
The following test with fio was used to measure performance:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
cat <<EOF > /tmp/fio-job.ini
[global]
name=fio-rand-write
filename=$MNT/fio-rand-write
rw=randwrite
bssplit=4k/20:8k/20:16k/20:32k/20:64k/20
direct=1
numjobs=16
fallocate=posix
time_based
runtime=300
[file1]
size=8G
ioengine=io_uring
iodepth=16
EOF
umount $MNT &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The test was run on a release kernel (Debian's default kernel config).
The results before this patch:
WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec
The results after this patch:
WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All the callers of btrfs_lookup_csums_list() pass a value of 0 as the
"search_commit" parameter. So remove it and make the function behave as
to always search from the regular root.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_page_mkwrite() is a struct vm_operations_struct callback and we
define that structure in file.c. Currently the function is in inode.c and
has to be exported to be used in file.c, which makes no sense because it's
not used anywhere else. So move btrfs_page_mkwrite() from inode.c and into
file.c.
While at it do a few minor style changes:
1) Capitalize the first word of every comment and end each sentence with
punctuation;
2) Avoid splitting some statements into two lines when everything fits in
85 characters or less.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_finish_one_ordered() it's pointless to assign 0 to the 'ret'
variable because if it has a non-zero value (error), we have already
jumped to the 'out' label. So remove that redundant assignment.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The mod_start and mod_len fields of struct extent_map were introduced by
commit 4e2f84e63d ("Btrfs: improve fsync by filtering extents that we
want") in order to avoid too low performance when fsyncing a file that
keeps getting extent maps merge, because it resulted in each fsync logging
again csum ranges that were already merged before.
We don't need this anymore as extent maps in the list of modified extents
are never merged with other extent maps and once we log an extent map we
remove it from the list of modified extent maps, so it's never logged
twice.
So remove the mod_start and mod_len fields from struct extent_map and use
instead the start and len fields when logging checksums in the fast fsync
path. This also makes EXTENT_FLAG_FILLING unused so remove it as well.
Running the reproducer from the commit mentioned before, with a larger
number of extents and against a null block device, so that IO is fast
and we can better see any impact from searching checksums items and
logging them, gave the following results from dd:
Before this change:
409600000 bytes (410 MB, 391 MiB) copied, 22.948 s, 17.8 MB/s
After this change:
409600000 bytes (410 MB, 391 MiB) copied, 22.9997 s, 17.8 MB/s
So no changes in throughput.
The test was done in a release kernel (non-debug, Debian's default kernel
config) and its steps are the following:
$ mkfs.btrfs -f /dev/nullb0
$ mount /dev/sdb /mnt
$ dd if=/dev/zero of=/mnt/foobar bs=4k count=100000 oflag=sync
$ umount /mnt
This also reduces the size of struct extent_map from 128 bytes down to 112
bytes, so now we can have 36 extents maps per 4K page instead of 32.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For both compression and decompression paths, we always require a
"struct page **pages" and "unsigned long nr_pages", this involves quite
some part of the btrfs compression paths:
- All the compression entry points
- compressed_bio structure
This affects both compression and decompression.
- async_extent structure
Unfortunately with all those involved parts, there is no good way to
split the conversion into smaller patches while still passing compiling.
So do this in one big conversion in one go.
Please note this is direct page->folio conversion, no change on the page
sized folio requirement yet.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
Since insert_inline_extent() now only accepts a single page, it's much
easier to convert it to use folio interfaces.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since our inline extent cannot accept anything larger than a sector,
there is really no need to pass all the compressed pages to
insert_inline_extent().
And just in case, expand the ASSERT()s to make sure we only try inline
with compressed size no larger than sectorsize.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we have two wrappers to allocate and free a page for
compression usage:
- btrfs_alloc_compr_page()
- btrfs_free_compr_page()
The allocator would try to grab a page from the pool, and only allocate
a new page if the pool is empty.
The reclaimer would check if the pool is full, and if not full it would
put the page into the pool.
This patch converts both helpers to use folio interfaces, and allowing
further conversion of compression path to folios.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unify naming of return value to the preferred way.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point in having a static writepages callback in inode.c that
does nothing besides calling extent_writepages from extent_io.c.
So just remove the callback at inode.c and rename extent_writepages()
to btrfs_writepages().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no point in having a static readahead callback in inode.c that
does nothing besides calling extent_readahead() from extent_io.c.
So just remove the callback at inode.c and rename extent_readahead()
to btrfs_readahead().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In commit b4ccace878 ("btrfs: refactor submit_compressed_extents()"), if
an async extent compressed but failed to find enough space, we changed
from falling back to an uncompressed write to just failing the write
altogether. The principle was that if there's not enough space to write
the compressed version of the data, there can't possibly be enough space
to write the larger, uncompressed version of the data.
However, this isn't necessarily true: due to fragmentation, there could
be enough discontiguous free blocks to write the uncompressed version,
but not enough contiguous free blocks to write the smaller but
unsplittable compressed version.
This has occurred to an internal workload which relied on write()'s
return value indicating there was space. While rare, it has happened a
few times.
Thus, in order to prevent early ENOSPC, re-add a fallback to
uncompressed writing.
Fixes: b4ccace878 ("btrfs: refactor submit_compressed_extents()")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Co-developed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, this call site in btrfs_clear_delalloc_extent() only converts
the reservation. We are marking it not delalloc, so I don't think it
makes sense to keep the rsv around. This is a path where we are not
sure to join a transaction, so it leads to incorrect free-ing during
umount.
Helps with the pass rate of generic/269 and generic/475.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Create subvolume, create snapshot and delete subvolume all use
btrfs_subvolume_reserve_metadata() to reserve metadata for the changes
done to the parent subvolume's fs tree, which cannot be mediated in the
normal way via start_transaction. When quota groups (squota or qgroups)
are enabled, this reserves qgroup metadata of type PREALLOC. Once the
operation is associated to a transaction, we convert PREALLOC to
PERTRANS, which gets cleared in bulk at the end of the transaction.
However, the error paths of these three operations were not implementing
this lifecycle correctly. They unconditionally converted the PREALLOC to
PERTRANS in a generic cleanup step regardless of errors or whether the
operation was fully associated to a transaction or not. This resulted in
error paths occasionally converting this rsv to PERTRANS without calling
record_root_in_trans successfully, which meant that unless that root got
recorded in the transaction by some other thread, the end of the
transaction would not free that root's PERTRANS, leaking it. Ultimately,
this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount
for the leaked reservation.
The fix is to ensure that every qgroup PREALLOC reservation observes the
following properties:
1. any failure before record_root_in_trans is called successfully
results in freeing the PREALLOC reservation.
2. after record_root_in_trans, we convert to PERTRANS, and now the
transaction owns freeing the reservation.
This patch enforces those properties on the three operations. Without
it, generic/269 with squotas enabled at mkfs time would fail in ~5-10
runs on my system. With this patch, it ran successfully 1000 times in a
row.
Fixes: e85fde5162 ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a new statx field for (sub)volume identifiers, as implemented by
btrfs and bcachefs.
This includes bcachefs support; we'll definitely want btrfs support as
well.
Link: https://lore.kernel.org/linux-fsdevel/2uvhm6gweyl7iyyp2xpfryvcu2g3padagaeqcbiavjyiis6prl@yjm725bizncq/
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Link: https://lore.kernel.org/r/20240308022914.196982-1-kent.overstreet@linux.dev
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit
16a1d96835 ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.
[1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The helpers btrfs_del_delalloc_inode() and __btrfs_del_delalloc_inode()
don't follow the pattern when the "__" helper does a special case and
are in fact reversed regarding the naming. We can merge them into one as
there's only one place that needs to be open coded.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The purpose of the BUG_ON is not clear. The helper btrfs_grab_root()
could return a NULL in case args->root would be a NULL or if there are
zero references. Then we check if the root pointer stored in the inode
still exists.
The whole call chain is for iget:
btrfs_iget
btrfs_iget_path
btrfs_iget_locked
iget5_locked
btrfs_init_locked_inode
which is called from many contexts where we the root pointer is used and
we can safely assume has enough references.
Signed-off-by: David Sterba <dsterba@suse.com>
The may_destroy_subvol() looks up a root by a key, allowing to do an
inexact search when key->offset is -1. It's never expected to find such
item, as it would break the allowed range of a root id.
Signed-off-by: David Sterba <dsterba@suse.com>
The memory allocation error in add_async_extent() is not handled
properly, return an error and push the BUG_ON to the caller. Handling it
there is not trivial so at least make it visible.
Signed-off-by: David Sterba <dsterba@suse.com>
The "do_list" variable has a rather confusing name, so remove it and
directly use btrfs_is_free_space_inode() instead.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The "do_list" variable is only used once, plus its name/meaning is a bit
confusing, so remove it and directory use btrfs_is_free_space_inode().
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When adding or removing and inode to/from the root's delalloc list,
instead of using a BUG_ON() to validate list emptiness, use ASSERT()
since this is to check logic errors rather than real errors.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The merge and split callbacks for an inode's io tree are supposed to be
called while the io tree's spinlock is being held, so that the given
extent_state records are stable, not modified or freed while the callbacks
are using them. So add lockdep assertions in the callbacks.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When setting and clearing a delalloc range, at btrfs_set_delalloc_extent()
and btrfs_clear_delalloc_extent(), we are adding/removing the inode
to/from the root's list of delalloc inodes while under the protection of
the inode's lock. This however is not needed, we can add and remove the
inode to the root's list without holding the inode's lock because here
we are under the protection of the io tree's lock, reducing the size of
the critical section delimited by the inode's lock. The inode's lock is
used in many other places such as when finishing an ordered extent (when
calling btrfs_update_inode_bytes() or btrfs_delalloc_release_metadata(),
or decreasing the number of outstanding extents) or when reserving space
when doing a buffered or direct IO write (calls to functions from
delalloc-space.c).
So move the inode add/remove operations to the root's list of delalloc
inodes to outside the critical section delimited by the inode's lock.
This also allows us to get rid of the BTRFS_INODE_IN_DELALLOC_LIST flag
since we can rely on the inode's delalloc bytes counter to determine if
the inode is or is not in the list.
The following fio based test, that exercises IO to multiple files in the
same subvolume, was used to test:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
MOUNT_OPTIONS="-o ssd"
mkfs.btrfs -f $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
fio --direct=0 --ioengine=sync --thread --directory=$MNT \
--invalidate=1 --group_reporting=1 \
--new_group --rw=randwrite --size=50m --numjobs=200 \
--bs=4k --fsync_on_close=0 --fallocate=none --end_fsync=0 \
--name=foo --filename_format=FioWorkloads.\$jobnum
umount $MNT
The test was run on a non-debug kernel (Debian's default kernel config)
against a 16G null block device.
Result before this patch:
WRITE: bw=81.9MiB/s (85.9MB/s), 81.9MiB/s-81.9MiB/s (85.9MB/s-85.9MB/s), io=9.77GiB (10.5GB), run=122136-122136msec
Result after this patch:
WRITE: bw=86.8MiB/s (91.0MB/s), 86.8MiB/s-86.8MiB/s (91.0MB/s-91.0MB/s), io=9.77GiB (10.5GB), run=115180-115180msec
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_add_delalloc_inodes() adds a single inode its root's
list of delalloc inodes, so it doesn't make any sense at all for the
function's name to be plural. Rename it to the singular form
btrfs_add_delalloc_inode() to avoid any confusion.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function requires the delalloc lock of the inode's root to be held,
so assert it's held.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to pass a root argument to __btrfs_del_delalloc_inode()
and btrfs_del_delalloc_inode(), we can just pass the inode since the root
is always the root associated to that inode. Some remove the root argument
from these functions.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to pass a root argument to btrfs_add_delalloc_inodes(), we
can just pass the inode since the root is always the root associated to
the inode in the context it's called. So remove it and have the single
caller pass only the inode.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a convenience helper to get a fs_info from a VFS inode pointer
instead of open coding the chain or using btrfs_sb() that in some cases
does one more pointer hop. This is implemented as a macro (still with
type checking) so we don't need full definitions of struct btrfs_inode,
btrfs_root or btrfs_fs_info.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add convenience helpers to get a fs_info from a page or folio pointer
instead of open coding the chain or using btrfs_sb() that in some cases
does one more pointer hop. This is implemented as a macro (still with
type checking) so we don't need full definitions of struct page, folio,
btrfs_root and btrfs_fs_info. The latter can't be static inlines as this
would create loop between ctree.h <-> fs.h, or the headers would have to
be restructured.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add convenience helpers to get a struct btrfs_inode from a page or folio
pointer instead of open coding the chain or intermediate BTRFS_I. This
is implemented as a macro (still with type checking) so we don't need
full definitions of struct page or address_space.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We've had numerous attempts to let function unpin_extent_cache() return
void as it only returns 0. There are still error cases to handle so do
that, in addition to the verbose messages. The only caller
btrfs_finish_one_ordered() will now abort the transaction, previously it
let it continue which could lead to further problems.
Signed-off-by: David Sterba <dsterba@suse.com>
With help of neovim, LSP and clangd we can identify header files that
are not actually needed to be included in the .c files. This is focused
only on removal (with minor fixups), further cleanups are possible but
will require doing the header files properly with forward declarations,
minimized includes and include-what-you-use care.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The block size stored in the super block is used by subsystems outside
of btrfs and it's a copy of fs_info::sectorsize. Unify that to always
use our sectorsize, with the exception of mount where we first need to
use fixed values (4K) until we read the super block and can set the
sectorsize.
Replace all uses, in most cases it's fewer pointer indirections.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Convert use of struct page to struct folio inside btrfs_truncate_block().
The only page based function is set_page_extent_mapped(). All other
functions have folio equivalents.
Had to use __filemap_get_folio() because filemap_grab_folio() does not
allow passing allocation mask as a parameter.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter @pg_offset of btrfs_get_extent() is only utilized for
inlined extent, and we already have an ASSERT() and tree-checker, to
make sure we can only get inline extent at file offset 0.
Any invalid inline extent with non-zero file offset would be rejected by
tree-checker in the first place.
Thus the @pg_offset parameter is not really necessary, just remove it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXh1GgACgkQxWXV+ddt
WDtnvA/7BN7BZ6QmwWv9UyxhgSBtzI19AXPi/kBsssnnjNuzXoHFaVHj68lQCCOB
a4YjRxAg7nmwFGHdVDTdnwXgUECzqlVkeX9cXg1ZpJy0IfP9RriGedRlC/93z7aV
pg6DnKMh2FlkibK7yO6kRBR8RYLc5aVIytqHXgUeqbaquuhj2Hh8EpqRo2X0RsoE
wDXlK0qgrU8HyrA3fHdqKYPcm1+cYABGTCwGx65iRffy8vH+KFSAr71G8jOJVEUj
DgNWJCpBjXJNs0dsKrik5oGmvLd3GDBKinNX7R2mAvMAMGWrL+fVVTVTfBS/clUT
FBiVFNYCJuphMcO3Qjs6JIuEez0GuGEeh1m+tQ8W795At1FSiINtE5J7LjmJUl5X
FuUwOUpxco1lTXBLX149Y9kk7AEOaqYxy0XbH4r5bbKyuzQegRGB9/qQX4sSaCt3
3T+Td9PvS+6Jo+CDO0dsYhG/h3bsHeHtHGR6f2CiO/m1zHDnTX9aYVcLMM3hsrMI
8OUEy1jkuKnDZQuZuIWES/3V9FlJL34dR3Cb236Pv/yIH1iujIc27g0qXrC1vzPg
wnUL1wheLQ9IRLedXoiHtX2Y2ZfFQGQDrIKNCJFD+WNPkZYffih5QNTV/mPZmL80
9EbjoVTu+6rygzdD43R1RWvK0kFsY44RKhHreI8SItO+e3/0TAs=
=hMf8
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- fix freeing allocated id for anon dev when snapshot creation fails
- fiemap fixes:
- followup for a recent deadlock fix, ranges that fiemap can access
can still race with ordered extent completion
- make sure fiemap with SYNC flag does not race with writes
* tag 'for-6.8-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix double free of anonymous device after snapshot creation failure
btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
btrfs: fix race between ordered extent completion and fiemap
When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that
are no concurrent writes and we get a stable view of the inode's extent
layout.
When the flag is given we flush all IO (and wait for ordered extents to
complete) and then lock the inode in shared mode, however that leaves open
the possibility that a write might happen right after the flushing and
before locking the inode. So fix this by flushing again after locking the
inode - we leave the initial flushing before locking the inode to avoid
holding the lock and blocking other RO operations while waiting for IO
and ordered extents to complete. The second flushing while holding the
inode's lock will most of the time do nothing or very little since the
time window for new writes to have happened is small.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmXMewsACgkQxWXV+ddt
WDtFUBAAkEU/hxB4YsLn2JEdp3wc80w5/qKkPaYHsI2ncvc3RFiG+tqSY7BakMgE
Kkdl8ouNX3p/S62ykIBQTKZnOTk7FgKlClAQtgKn1afexqABsP2mifnh40Dzf7eA
VvEl7chnRT6oeivtQkB+BtgOzaOUp4j/8oAivRN8NKNwTxojV4g9PErKSOWfVQSq
3zlrLJbe6era43SpnexkjZHn4Fy4CN+C7FMm+pT/yKzZi2oBZs9BvNZGhIkdnzcK
MftrY9dSGO3CDD2Kvrz3lEm7ZB83wCpm+GTDN7iJx2y+yeW+aHjshFkJr1ApEZQa
lsWTnj3hk3yHoOPUuLlchw5JcFb/dFZ1Ztdwkunf8nmt5a3O/5Zf+Csgze8c+Iii
MJQKi0B/bNQ7cSEwRt36s75kROBItZmHCZmSBlOpT1LXSDQMJ9lvEnv/fPQdcHHF
WMEmk5O5IoGYv5kx5wIoWv27HKE/bDwH6RjkxEd/n17XP+PcfHY4K0o0CGtfwS8g
hdy9RI9X8dbf3ZPrxtsgQ2T8btWs68A4S6nwcSuY5HK0WNmvRh47eLfCI6S6XGJs
hHkppLcc+WTXOskCA+ABdm9hgeAPZkCSpuQSmC2HBt8gRv8XqO7z4cZ/up2N+tES
ZOJSrJb97nusOcxY0pLexnD6eI3pQxzGMiPONlC1Re8CdjZ0l+4=
=RRGT
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few regular fixes and one fix for space reservation regression since
6.7 that users have been reporting:
- fix over-reservation of metadata chunks due to not keeping proper
balance between global block reserve and delayed refs reserve; in
practice this leaves behind empty metadata block groups, the
workaround is to reclaim them by using the '-musage=1' balance
filter
- other space reservation fixes:
- do not delete unused block group if it may be used soon
- do not reserve space for checksums for NOCOW files
- fix extent map assertion failure when writing out free space inode
- reject encoded write if inode has nodatasum flag set
- fix chunk map leak when loading block group zone info"
* tag 'for-6.8-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't refill whole delayed refs block reserve when starting transaction
btrfs: zoned: fix chunk map leak when loading block group zone info
btrfs: reject encoded write if inode has nodatasum flag set
btrfs: don't reserve space for checksums when writing to nocow files
btrfs: add new unused block groups to the list of unused block groups
btrfs: do not delete unused block group if it may be used soon
btrfs: add and use helper to check if block group is used
btrfs: don't drop extent_map for free space inode on write error
Currently we allow an encoded write against inodes that have the NODATASUM
flag set, either because they are NOCOW files or they were created while
the filesystem was mounted with "-o nodatasum". This results in having
compressed extents without corresponding checksums, which is a filesystem
inconsistency reported by 'btrfs check'.
For example, running btrfs/281 with MOUNT_OPTIONS="-o nodatacow" triggers
this and 'btrfs check' errors out with:
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space tree
[4/7] checking fs roots
root 256 inode 257 errors 1040, bad file extent, some csum missing
root 256 inode 258 errors 1040, bad file extent, some csum missing
ERROR: errors found in fs roots
(...)
So reject encoded writes if the target inode has NODATASUM set.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While running the CI for an unrelated change I hit the following panic
with generic/648 on btrfs_holes_spacecache.
assertion failed: block_start != EXTENT_MAP_HOLE, in fs/btrfs/extent_io.c:1385
------------[ cut here ]------------
kernel BUG at fs/btrfs/extent_io.c:1385!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 2695096 Comm: fsstress Kdump: loaded Tainted: G W 6.8.0-rc2+ #1
RIP: 0010:__extent_writepage_io.constprop.0+0x4c1/0x5c0
Call Trace:
<TASK>
extent_write_cache_pages+0x2ac/0x8f0
extent_writepages+0x87/0x110
do_writepages+0xd5/0x1f0
filemap_fdatawrite_wbc+0x63/0x90
__filemap_fdatawrite_range+0x5c/0x80
btrfs_fdatawrite_range+0x1f/0x50
btrfs_write_out_cache+0x507/0x560
btrfs_write_dirty_block_groups+0x32a/0x420
commit_cowonly_roots+0x21b/0x290
btrfs_commit_transaction+0x813/0x1360
btrfs_sync_file+0x51a/0x640
__x64_sys_fdatasync+0x52/0x90
do_syscall_64+0x9c/0x190
entry_SYSCALL_64_after_hwframe+0x6e/0x76
This happens because we fail to write out the free space cache in one
instance, come back around and attempt to write it again. However on
the second pass through we go to call btrfs_get_extent() on the inode to
get the extent mapping. Because this is a new block group, and with the
free space inode we always search the commit root to avoid deadlocking
with the tree, we find nothing and return a EXTENT_MAP_HOLE for the
requested range.
This happens because the first time we try to write the space cache out
we hit an error, and on an error we drop the extent mapping. This is
normal for normal files, but the free space cache inode is special. We
always expect the extent map to be correct. Thus the second time
through we end up with a bogus extent map.
Since we're deprecating this feature, the most straightforward way to
fix this is to simply skip dropping the extent map range for this failed
range.
I shortened the test by using error injection to stress the area to make
it easier to reproduce. With this patch in place we no longer panic
with my error injection test.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWurp4ACgkQxWXV+ddt
WDsqSg/+OS5/1Cr2W6/3ns2hannEeAzYUeoRDNhNHluHOSufXS52QTckQdiA62BO
iMKGoIxZIn9BQPlvil1hi+jIEt/9qsRt/Qc6oBnzvlto21tJCoS486PJAShu6Sj5
jXKxtR7d6WrJEfk65uzatk1SbRguRKFxSrFlkaOeOHAmWsD54p/BnsZ/pqxPjF8W
LOFvwdhbTw3pzQ873b+hJg16rm4IenAnuazZNmXRdSufgdPEcArv0l7fMr4xTBvO
DBQXoM5GBGVHV2+IsrZiK39p7khz9ej2Ob4rps/x6PduC+GPxGtm6iLy8dZts+hV
D1FOHh3fqWmV2LQIzLNNu9N7sj5sF5dNFRZHSkq4qFNVNQYfvyFg43iJKfUnMY/s
puUm7ElSF3tLC2pRys0m/jDfkykZVFFZzbayfYQn+jRKuUASyXnWqmCKlljkLJD5
ekFXPpor+SQzQso9x0OpAjkSIUmmYFqSvoJCCczPFoo/3EDPv4C6VGOPEQyN6dDH
nBjn7fLXmn4hpdEKia+LU1MhajFis+SUlmjaoTh7UfCCzXDosDOPThRC1Kx0rNlY
t4KON8pMUCK3iGEce+7iOSwEImDDU4B7DUARey/sF0C8cs7jRsX8bf8eFTrEId8M
4C2sLmTw0JJ5n2I2soyTi9fHrGJnJamUlzp/hLrp8JyMzy6qBrs=
=38MW
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- zoned mode fixes:
- fix slowdown when writing large file sequentially by looking up
block groups with enough space faster
- locking fixes when activating a zone
- new mount API fixes:
- preserve mount options for a ro/rw mount of the same subvolume
- scrub fixes:
- fix use-after-free in case the chunk length is not aligned to
64K, this does not happen normally but has been reported on
images converted from ext4
- similar alignment check was missing with raid-stripe-tree
- subvolume deletion fixes:
- prevent calling ioctl on already deleted subvolume
- properly track flag tracking a deleted subvolume
- in subpage mode, fix decompression of an inline extent (zlib, lzo,
zstd)
- fix crash when starting writeback on a folio, after integration with
recent MM changes this needs to be started conditionally
- reject unknown flags in defrag ioctl
- error handling, API fixes, minor warning fixes
* tag 'for-6.8-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: scrub: limit RST scrub to chunk boundary
btrfs: scrub: avoid use-after-free when chunk length is not 64K aligned
btrfs: don't unconditionally call folio_start_writeback in subpage
btrfs: use the original mount's mount options for the legacy reconfigure
btrfs: don't warn if discard range is not aligned to sector
btrfs: tree-checker: fix inline ref size in error messages
btrfs: zstd: fix and simplify the inline extent decompression
btrfs: lzo: fix and simplify the inline extent decompression
btrfs: zlib: fix and simplify the inline extent decompression
btrfs: defrag: reject unknown flags of btrfs_ioctl_defrag_range_args
btrfs: avoid copying BTRFS_ROOT_SUBVOL_DEAD flag to snapshot of subvolume being deleted
btrfs: don't abort filesystem when attempting to snapshot deleted subvolume
btrfs: zoned: fix lock ordering in btrfs_zone_activate()
btrfs: fix unbalanced unlock of mapping_tree_lock
btrfs: ref-verify: free ref cache before clearing mount opt
btrfs: fix kvcalloc() arguments order in btrfs_ioctl_send()
btrfs: zoned: optimize hint byte for zoned allocator
btrfs: zoned: factor out prepare_allocation_zoned()
Sweet Tea spotted a race between subvolume deletion and snapshotting
that can result in the root item for the snapshot having the
BTRFS_ROOT_SUBVOL_DEAD flag set. The race is:
Thread 1 | Thread 2
----------------------------------------------|----------
btrfs_delete_subvolume |
btrfs_set_root_flags(BTRFS_ROOT_SUBVOL_DEAD)|
|btrfs_mksubvol
| down_read(subvol_sem)
| create_snapshot
| ...
| create_pending_snapshot
| copy root item from source
down_write(subvol_sem) |
This flag is only checked in send and swap activate, which this would
cause to fail mysteriously.
create_snapshot() now checks the root refs to reject a deleted
subvolume, so we can fix this by locking subvol_sem earlier so that the
BTRFS_ROOT_SUBVOL_DEAD flag and the root refs are updated atomically.
CC: stable@vger.kernel.org # 4.14+
Reported-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWYTmMACgkQxWXV+ddt
WDvPRg/+KgS5LV3nNC0MguYcTMQxmgeutIgXZIMfeA3v6EnFS7nj8leP4EPc6+bj
JPSkwj4u2vHVwpnTVuEAuJUXnmFY+Qu70nVy6bM2uOHOYTVBQ8zRVK4cErNNLWCp
OekDaADR53RrZ/xprlQ7b7Ph0Ch2uq9OrpH50IcyquEsH1ffkxlqwyrvth4/8dxC
6zgsFHWrbtVKJf0DYoQPpjEPz5tpdQ+xHZwtmf1cNlUgI1objODr/ZTqXtZqTfw4
/GwrtDPbEri53K/qjgr0dDH7pBVqD6PtnbgoHfYkiizZ0G7UkmlaK6rZIurtATJb
Yk/RCqCUp9tPC4yeFSewFMm1Y8Ae3rkUBG7rnYkvMmBspMqyh/kQAWSBimF5yk/y
vFEdFTe9AbdvP19Nw0CqovLzaO6RrOXCL1usnFvCmBgvF5gZAv63ZW1njP3ZoNta
wB8Rs6hxdRkph8Dk7yvYf54uUR+JyKqjHY6egg2qkKTjz0CSf6qQFyFZXpr81m97
gK4WN5SeP/P2ukRbBKKyzZ5IljUxZuVatvJa0tktd7kAbU26WLzofOJ7pX+iqimM
F2G7gKGJZykLY1WPntXBp9Dg97Ras2O5iViQ7ZKwRdOx1yZS5zzTYlIznHBAmXbL
UgXfVnpJH1xFdkvedNTn+Fz9BHNV1K2a2AT7VITj7sxz23z3aJA=
=4sw3
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no exciting changes for users, it's been mostly API
conversions and some fixes or refactoring.
The mount API conversion is a base for future improvements that would
come with VFS. Metadata processing has been converted to folios, not
yet enabling the large folios but it's one patch away once everything
gets tested enough.
Core changes:
- convert extent buffers to folios:
- direct API conversion where possible
- performance can drop by a few percent on metadata heavy
workloads, the folio sizes are not constant and the calculations
add up in the item helpers
- both regular and subpage modes
- data cannot be converted yet, we need to port that to iomap and
there are some other generic changes required
- convert mount to the new API, should not be user visible:
- options deprecated long time ago have been removed: inode_cache,
recovery
- the new logic that splits mount to two phases slightly changes
timing of device scanning for multi-device filesystems
- LSM options will now work (like for selinux)
- convert delayed nodes radix tree to xarray, preserving the
preload-like logic that still allows to allocate with GFP_NOFS
- more validation of sysfs value of scrub_speed_max
- refactor chunk map structure, reduce size and improve performance
- extent map refactoring, smaller data structures, improved
performance
- reduce size of struct extent_io_tree, embedded in several
structures
- temporary pages used for compression are cached and attached to a
shrinker, this may slightly improve performance
- in zoned mode, remove redirty extent buffer tracking, zeros are
written in case an out-of-order is detected and proper data are
written to the actual write pointer
- cleanups, refactoring, error message improvements, updated tests
- verify and update branch name or tag
- remove unwanted text"
* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
btrfs: pass btrfs_io_geometry into btrfs_max_io_len
btrfs: pass struct btrfs_io_geometry to set_io_stripe
btrfs: open code set_io_stripe for RAID56
btrfs: change block mapping to switch/case in btrfs_map_block
btrfs: factor out block mapping for single profiles
btrfs: factor out block mapping for RAID5/6
btrfs: reduce scope of data_stripes in btrfs_map_block
btrfs: factor out block mapping for RAID10
btrfs: factor out block mapping for DUP profiles
btrfs: factor out RAID1 block mapping
btrfs: factor out block-mapping for RAID0
btrfs: re-introduce struct btrfs_io_geometry
btrfs: factor out helper for single device IO check
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
btrfs: migrate eb_bitmap_offset() to folio interfaces
btrfs: migrate various end io functions to folios
btrfs: migrate subpage code to folio interfaces
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
btrfs: don't double put our subpage reference in alloc_extent_buffer
btrfs: cleanup metadata page pointer usage
...